20 Processing and Analyzing Quantitative Data
Udita Mitra
- Objective
This module will deal with the issues involved in the process of handling, managing and interpreting quantitative data collected in the process of research. It will also discuss about the basic statistical tools with the help of which we analyse social phenomena.
- Introduction
Quantitative research can be construed as a research strategy that emphasizes quantification in the collection and analysis of data. It entails a deductive approach to the relationship between theory and research in which the accent is placed on testing the theories. Quantitative research usually incorporates the practices and norms of the natural scientific model and of positivism in particular and it also embodies a view of social reality as an external, objective reality (Bryman 2004: 19). It also has a preoccupation with measurement and involves a process of collecting large amount of data. These data may be collected through various ways like survey and field research. The data, after collection, have to be processed in order to ensure their proper analysis and interpretation. According to Kothari (2004), technically, processing implies editing, coding, classification and tabulation of collected data so that they are amenable to analysis. These endeavours help us to search for patterns of relationship that exist among data-groups (Ibid.: 122).
- Learning Outcome
This module will help you to understand different issues involved in processing and analysing quantitative data. It will also help you to grasp the essential steps of applying various statistical measures in order to interpret data collected through social research.
- Data Processing
Data reduction or processing mainly involves various steps necessary for preparing the data for analysis. These steps involve editing, categorising the open-ended questions, coding, computerization and preparation of tables (Ahuja 2007: 304). The processing of data is an essential step before analysis because it enables us to overcome the errors at the stage of data collection.
4.1. Editing
According to Majumdar (2005), error can come in at any stage of social research especially in the stage of data collection. These errors have to be kept at a minimum level to avoid errors in the results of the research. Editing or checking for errors in the completed questionnaires is a laborious exercise and needs to be done meticulously. Interviewers tend to commit mistakes like some questions are missed out; some answers remain unrecorded or are recorded at the wrong places. The questionnaires therefore need to be checked for completeness, accuracy and uniformity (Ibid.: 310).
4.2. Coding
Coding implicates the process of assigning numbers or other symbols to answers so that they can be categorized into specific classes. Such classes should be appropriate to the research problem under consideration (Kothari 2004: 123). Careful consideration should be made so as not to leave out any response uncoded. According to Majumdar (2005: 313), a set of categories is referred to as “coding frame” or “code book”. Code book explains how to assign numerical codes for response categories received in the questionnaire/schedule. It also indicates the location of a variable on computer cards. Ahuja (2007: 306) provides an example to illustrate how variables can be coded. In a question regarding the religion of the respondent the answer categories of Hindu, Muslim, Sikh, and Christian can be coded as 1, 2, 3, and 4 respectively. In such cases, the counting of frequencies will not be according to Hindus, Muslims etc., but as 1, 2 and so on. Coding can be done manually or with the help of computers.
4.3. Classification
Besides editing and coding of data, classification is another important method to process data. Classification has been defined as the process of arranging data into groups and classes on the basis of some common characteristics (Kothari 2004: 123). Classification can be of two types, namely
- Classification according to attributes or common characteristics like gender, literacy etc., and
- Classification according to class intervals whereby the entire range of data is divided into a number of classes or class intervals.
4.4. Tabulation
Tabulation is the process of summarising raw data and displaying the same in compact form for further analysis (Kothari 2004: 127). The necessity of tabulating raw data is:
- It conserves space and reduces explanatory and descriptive statement to a minimum, and
- It provides a basis for various statistical computations.
Tabulation can be done manually as well as with electronic and mechanical devices like computers. When the data are not large in number, tabulation can be done by hand with the help of tally marks.
Self check exercise – 1
Question 1. Tabulate the following examination grades for 80 students.
72, 49, 81, 52, 31,38,81, 58,68, 73, 43, 56, 45, 54, 40, 81, 60, 52, 52, 38, 79, 83, 63, 58, 59, 71, 89, 73, 77, 60, 65, 60, 69, 88, 75, 59, 52, 75, 70, 93, 90, 62, 91, 61, 53, 83, 32, 49, 39, 57, 39, 28, 67, 74, 61, 42, 39, 76, 68, 65, 58, 49, 72, 29, 70, 56, 48, 60, 36, 79, 72, 65, 40, 49, 37, 63, 72, 58, 62, 46 (Levin and Fox 2006).
Procedures for Tabulation/Grouping of Data
The above is an array of scores which otherwise would not be very handy to use. In order to make the data meaningful and useful it must be organized and classified into frequency tables. There are certain easy steps to be followed in order to convert the raw scores into frequency tables.
- We must first find the difference between the highest and the lowest score in the series. In the above case the difference is 65 (93-28). To it we must add 1 to bring in the entire range of scores. So it becomes 66.
- Next, we would have to assume the number of class intervals that would best summarise the entire range of scores. In this case we assume the number of intervals as 10.
- Now we would divide the range of scores by the number of class intervals to obtain the width (denoted as i) of the class interval. Here it would be 6.6, that is 6 or 7 approximately.
- To the lowest score in the series we add (i – 1) to get the first class interval. In this case it would be 28+ (7-1) or 28 to 33.
- We take the higher integer from the upper limit of the class interval and repeat step iv to get the next class interval. In this way we would obtain the class intervals and put the frequencies in the respective class intervals (Elifson 1997).
Answer: The complete class interval of examination grades for 80 students is the following:
Class Interval | Frequencies |
28- 33 | 4 |
34-39 | 7 |
40-45 | 5 |
46-51 | 6 |
52-57 | 9 |
58-63 | 16 |
64-69 | 7 |
70-75 | 12 |
76-81 | 6 |
82-87 | 3 |
88-93 | 5 |
N = 80 | |
- Data Analysis
The term ‘data analysis’ refers to the computation of certain indices or measures along with searching for patterns of relationship that exist among the data groups. Analysis, particularly in case of survey or experimental data (quantitative data), involves estimating the values of unknown parameters of the population and testing of hypothesis for drawing inferences (Kothari 2004: 130). Quantitative data analysis occurs typically at a late stage in the research process. But this does not mean that the researchers should not be considering how they will analyse their data at the beginning of the research. During the designing phase of the questionnaire or observation schedule, the researchers should be fully aware of the techniques of data analysis. In other words, the kinds of data the researchers will collect and the size of the sample will have implications for the sorts of analysis that would be applied for (Bryman 2004).
- 6. Statistics in Social Research
The task of analysing quantitative data in research is done by social statistics. Social statistics has two major areas of function in research. They are namely Descriptive and Inferential. Descriptive statistics is concerned with organizing raw data obtained in the process of research. Tabulation and classification of data are instances of descriptive statistics. Inferential statistics is concerned with making inferences or conclusions from the data collected from the sample and drawing generalisations on the entire population (Elifson 1997). Inferential statistics is also known as sampling statistics and it is concerned with two major types of problems:
- the estimation of population parameters, and
- the testing of statistical hypothesis (Kothari 2004: 131)
Some of the most important and useful statistical measures that would be taken up for discussion in the present module are:
- measures of central tendency or statistical averages
- measures of dispersion
- chi-square test
- t-test
- measures of relationship
From the next section we are going to take up each for discussion.
Self Check Exercise – 2
- How does descriptive statistics work?
Descriptive statistics tries to describe and summarize the mass of data that is obtained in the process of conducting research. It tries to do so with the help of some specific measures. The very first step of organizing data would be to arrange the raw scores into a number of categories known as frequency tables. After it is done, the next step would be to represent the data through various graphs and figures. Some of these would be bar graph, pie chart, frequency polygon etc.
- What is inferential statistics?
Inferential statistics deals with the task of drawing inferences on the population by studying the sample drawn from that population. The reasons why we infer on the findings of a sample can be many. Insufficient resources in terms of money and man power can force a researcher to draw a sample from the population. Time available for a research may also be short and inadequate to study an entire population. Statistics can be of great help in generalizing findings. It needs to be mentioned here that error(s) inevitably appears in the process of sampling, but researchers may adopt various methods to minimize those. The prefix ‘social’ is attached to statistics due to its application to interpret social phenomena.
6.1. Measures of Central Tendency
When the scores have been tabulated into a frequency distribution, the next task is to calculate a measure of central tendency or central position. The measure of central tendency defines a value around which items have a tendency to cluster. The importance of the Measure of Central Tendency is twofold. First, it is an “average” which represents all the scores in a distribution and gives a precise picture of the entire distribution. Second, it enables us to compare two or more groups in terms of typical performance. Three “averages” or measures of central tendency are commonly used: Arithmetic Mean, Median and Mode (Garrett 1981: 27).
- iii) Mode: When a rough and quick estimate of central tendency is wanted, mode is usually the most preferred measure. Mode is that value which has the greatest frequency in the given series of scores. Like median, mode is also a positional average and is therefore unaffected by extreme scores in the series of numbers. It is useful in all situations where we want to eliminate the effect of extreme variations (Kothari 2004: 133).
a) Calculating Mode from Ungrouped Data: In a simple ungrouped data, the mode is that single measure or score which occurs most frequently. For instance in the series of the numbers 10, 11, 11, 12, 12, 13, 13, 13, 14, 14, the crude mode is 13 (the most frequented one).
b) Calculating the Mode from Grouped Data: When the data are grouped into a frequency distribution, the crude mode is found out by the midpoint of the interval which contains the highest frequency. In the case of the above table, the value of the mode would be 172 (the midpoint of the class interval 170-174 (Garrett 1981). We can also calculate the true mode from a grouped frequency distribution. The formula for calculating the true mode in a normal or symmetrical distribution is:
Mode = 3 Mdn – 2 Mean (ibid).
iv) When to Use the Various Measures of Central Tendency: The situations in which the three measures are used are stated below:
a) The Mean is used when
- The scores are distributed symmetrically around a central point
- The central tendency having the greatest stability is wanted
- Other statistics like standard deviation and correlation coefficient are to be computed later.
b) The Median is used when
- The exact midpoint of the distribution is all that is wanted
- There are extreme scores which affect the mean but they do not affect the median.
c) The Mode is used when
- A rough and quick estimate of central tendency is all that is wanted
- The measure of central tendency should be the most typical value (Garrett 1981).
The choice of average depends on the researcher and the objectives of the study. Only then will be the statistical computation of averages be effective and useful in interpretation of data.
6.2. Measures of Dispersion (Range, Interquartile Range, Mean Deviation or Average Deviation and Standard Deviation)
Measures of central tendency like mean, median and mode can only be a representative of the entire series of scores. But it cannot fully describe the nature of a frequency distribution. For instance it cannot state how far a given score in a series deviates from the average. In other words, how much a score is lower or higher than the average? Therefore, in order to measure this spread of score from the central tendency, we calculate the measures of dispersion or variability. There are different measures of dispersion. They are the range, mean deviation and standard deviation.
i) Range: Range is the simplest and the easiest measure of variability. It is usually calculated by subtracting the lowest score from the highest score in the given series of data. The value of the range depends on only two values and this is its main limitation. It ignores the remaining values in the distribution and therefore it fails to provide an accurate and stable picture of the dispersed scores.
a) Range for Ungrouped Data: In a distribution of ungrouped scores, if the scores are arranged in an array, the range is defined as the largest score minus the smallest score plus one.
Range = (Highest value of an item in a series) ─ (Lowest value of an item in a series) +1
In a distribution that has 103 as the highest score and 30 as the lowest score, the range is computed as range = (103- 30)+1 = 74 (Leonard 1996).
b) Range for Grouped Data: In case of grouped data, the range is the difference between the upper true limit of the highest class interval and the lower true limit of the lowest class interval. Let us look into the following data:
In case of the above data, the upper true limit of the highest class interval is 66.5 (64-66) and the lower true limit of the lowest class interval is 30.5 (31-33). Therefore, the range would be 66.5-30.5=36. Here, 1 is not added because the difference is between the two true limits (Leonard 1996). Please note that range does not represent the entire series of scores as its computation requires only the two extreme values.
ii) Mean Deviation or Average Deviation: It is the average of difference of the values of items from some average of the series (Kothari 2004: 135). It is based on absolute deviations of scores from the centre (Leonard 1996). This procedure is designed to avoid the algebraic sum of deviations from the mean equalling zero, in which case it would be impossible to compute indices of variability.
a) Average Deviation for Ungrouped Scores:
iv) When to Use the Various Measures of Variability: The rules for using the measures of dispersion are as follows:
a) Range can be used when
- the scores are scanty in number or are too dispersed
- a knowledge of the extreme scores or total spread of scores are wanted.
b) Average Deviation can be computed when
- it is desirable to weigh all deviations from the mean according to their size
- extreme deviations would influence the S.D. unduly.
c) D. is to be used when
- the statistic having the greatest stability is wanted
- coefficient of correlation and other statistics are subsequently to be computed (Garrett 1981).
6.3. Chi-square Test
The Chi-square test is an important one among several tests of significance developed by statisticians. It is symbolically written as 2 and can be used to determine if categorical data shows dependency or the two classifications are independent. It can be used to make comparisons between theoretical populations and actual data when categories are used. The test is, in fact, a technique by use of which it is possible for all researchers to test a) goodness of fit, and b) test of significance of association between two attributes (Kothari 2008).
a) Test of Goodness of Fit: As a test of Goodness of Fit, Chi-square enables us to see how well the theoretical distribution fit to the observed data. If the calculated value of 2 is greater than its table value at a certain level of significance, the fit is considered to be a good one. When the calculated value of 2 is less than the table value, we do not consider the fit to be a good one (Kothari op. cit).
Illustrative Problem
Given below is the data on the number of students entering the University from each school.
School 1 – 22, School 2 – 25, school 3 – 26, School 4 – 28, School 5 – 33.
Is there a difference in the quality of school? N=50
In the case of the above data the most suitable technique of statistical application would be chi-square goodness of fit test because the data are at the nominal level and the hypothesis is to be tested on one variable, that is, the quality of schools on the basis of the prospect of entering the University from each school.
The steps for calculating the chi-square are shown below.
- Stating the Null and the Alternative Hypothesis: The null hypothesis assumes that there is no difference in the quality of the schools. Whereas the alternative hypothesis would state that there is a difference in the quality of the schools.
- Choice of a Statistical Test: As has been stated above, the appropriate statistical test applicable here would be Chi-square goodness of fit test.
- Level of Significance and Sample Size: Here the level of significance would be 0.5, that means only 5 times in 100. The sample size is 50.
- One versus the two tailed test: It is a two tailed test because no direction is indicated in the alternative hypothesis. It only suggests that there is a difference in the number of students entering the University from each school.
- The Sampling Distribution: The sampling distribution is a function of the degrees of freedom which are quantities that are free to vary. Here it can be computed by (k-1) where ‘k’ is the number of categories into which observations are divided. Here there are 5 categories, that means degrees of freedom (df) = (5-1) = 4.
- The Region of Rejection: The point of intersection of the ‘df’ and the level of significance gives the critical value of 2 which is 9.488. The computed value of the chi-square has to be greater than the table value, so as to reject the null hypothesis. It is computed by the formula:
Since the computed value of chi-square is 147.8, which is greater than its table value of 9.488, the alternative hypothesis is upheld that there are differences in the quality of schools. This is understood from the different number of students entering the University from each school.
b) Chi-square Test of Independence: As a test of independence, chi square test enables us to explain whether or not two attributes are associated. If the table value of chi is greater than its computed value, we can conclude that there is no association between the attributes, that is, the null hypothesis is upheld. But if the computed value of chi is greater than its table value, we uphold that the two attributes are associated and the association is not due to chance factors but it exists in reality (Kothari 2008). For the test of association, the formula for computing the chi-square remains the same as above.
When the value of X (time spent in prison) would be 2, the value of ̂(the degree of prisonization) would be
̂ = a + bX = 0.83 + 1.11(2) = 0.83 + 2.22 = 3.05.
In this way we can calculate the value of the dependent variable from the existing regression equation and infer exactly what amount of change in X will lead to what amount of change in Y.
To conclude, we can state that the regression analysis is a statistical method to deal with the formulation of mathematical model depicting relationship amongst variables which can be used for the purpose of prediction of the values of the dependent variable, given the values of the independent variable (Kothari 2004: 142).
iii) Contingency Tables: Contingency Tables are another way of explaining and interpreting relationship between variables. In the present module, we would be concerned only with the bivariate contingency tables where the focus of discussion would be on two variables – one an independent variable or the predictor variable (symbolized by X) and the other a dependent variable (symbolized by Y). Here we would discuss a relationship between marital status (X) and employment status (Y) of women. The hypothesis is that marital status exerts an influence on the employment status of women. The study has been carried out on 200 respondents (Elifson 1997). The data have been presented in the table below:
Marital Status (X)
Employment | Never Married | Married | Divorced | Widowed | Total |
Status (Y) | |||||
Employed | 21 | 60 | 11 | 6 | 98 |
Not-Employed | 14 | 65 | 4 | 19 | 102 |
Total | 35 | 125 | 15 | 25 | N= 200 |
Contingency tables can be interpreted by percentaging it in three ways as follows.
Percentaging Down: This is one of the most common ways of calculating percentages. Here the column marginals (35, 125, 15 and 25) are taken as the base on which the percentages are calculated. Percentaging down is also referred to as percentaging on the independent variable when it is the column variable. Percentaging down allows us to determine the effect of the independent variable by comparing across the percentages within a row that is by comparing people in different categories of the independent variable (Elifson 1997: 172). The method will be shown below:
Marital Status (X)
Employment Status (Y) | Never Married | Married | Divorced | Widowed |
Employed | 60% | 48% | 73.3% | 24% |
Not-Employed | 40% | 52% | 26.7% | 76% |
Total | 100% | 100% | 100% | 100% |
While interpreting from the above table, we say 60% (21/35×100) of the never married respondents are employed, 48% (60/125×100) of the married respondents are employed, 73.3% (11/15×100) of the divorced respondents are employed and 24% (6/25×100) of the widowed respondents are employed. Ifwe interpret it in this way we get a logical relationship between marital status and employment status of women.
Percentaging Across: When we are percentaging across we are taking row marginal as the base and calculating percentages. Here we are percentaging across and comparing up and down. An advantage of doing this is that a profile of the employed versus those who are not employed can be established in terms of their marital status (Elifson 1997: 172). This is also shown in the table below:
Marital Status (X)
Employment | Never | Married | Divorced | Widowed | Total |
Status (Y) | Married | ||||
Employed | 21.4% | 61.2% | 11.2% | 6.1% | 99.9% |
Not-Employed | 13.7% | 63.7% | 3.9% | 18.6% | 99.9% |
From the above table, we can say that 21.4% (21/98×100) of the respondents who are employed have never married, 13.7% (14/102×100) of the respondents who are not-employed have never married. Moreover, 61.2% of the employed respondents are married, where as 63.7% of the respondents who are not-employed have married, 11.2% of the employed respondents have been divorced, 3.9% of the not-employed respondents have been divorced. 6.1% of the employed respondents are widowed and 18.6% of the not-employed respondents are widowed. In the above table, the total has not come to 100% due to rounding (Elifson 1997).
Percentaging on the total number of cases: This is another method of interpreting bivariate contingency tables. Here the percentages are calculated on the total number of cases (N). The following table shows this:
Marital Status (X)
Employment | Never | Married | Divorced | Widowed |
Status (Y) | Married | |||
Employed | 10.5% | 30% | 5.5% | 3% |
Not-Employed | 7% | 32.5% | 2% | 9.5% |
Total 100%
From the above table we infer that 10.5% (21/200×100) of the respondents have never married and are employed where as 7% (14/200×100) of the respondents who have married are not-employed. This way of percentaging like the second method (percentaging across) also does not allow us to see the influence of the independent variable on the dependent one and is rarely used. But it is used in certain instances (Elifson 1997: 172).
Self-Check Exercise – 3
- What is measurement?
Measurement is the assignment of numbers to objects or events according to some predetermined (or arbitrary) rules. The different levels of measurement represent different levels of numerical information contained in a set of observations.
- What are the levels of measurement that are used by the social scientists?
There are four levels of measurement namely – nominal, ordinal, interval and ratio. The characteristics of each will decide the kind of statistical application we can use.
- The nominal level does not involve highly complex measurement but rather involves rules for placing individuals or objects into categories.
- The ordinal scales possess all the characteristics of the nominal and in addition the categories represent a rank-ordered series of relationships like poorer, healthier, greater than etc.
- The interval and ratio scales are the highest level of measurement in science and employ numbers. The numerical values associated with these scales permit the use of mathematical operations such as adding, subtracting, multiplying and dividing. The only difference between the two is that the ratio level has a true zero point which the interval does not have. With both these levels we can state the exact differences between categories (Elifson 1997).
- Limitations of Statistics in Sociology
Statistics plays a role in Sociology, especially in Applied Sociology. There is a debate which has been going on since the middle of the twentieth century between researchers who are committed to the use of quantitative methods and computer application and those who believe in qualitative approach in sociology. The latter group argues that statistics, if its importance is overemphasized, will become a substitute for sociology. They argue that it is not always appropriate to conduct research with quantitative variables that can be handled by statistical analysis. The decision to apply statistics to the research would depend on factors like the nature of the problem, the subjects of study and the availability of previously collected data, to name a few (Weinstein 2011). Researchers now a day increasingly depend on the use of mixed methods. In general, mixed methods combine both qualitative and quantitative techniques to cancel out their weaknesses. Triangulation is a particular application of mixed methods (Guthrie 2010). One way in which a qualitative research approach is introduced into quantitative research is through ethnostatistics which implicates the study of the construction, interpretation and display of statistics in quantitative social research. The idea of ethnostatistics can be applied in many ways but one predominant way to apply it is to treat statistics as rhetoric. More specifically this implies examining the language used in persuading audiences about the validity of the research (Bryman 2004: 446). To conclude, we can say that statistics will be a necessary tool for effective research but can never be a substitute for sociological reasoning. It can give the data some precision and make it manageable and smart for presentation (Weinstein 2011).
- Summary
The present module has tried to analyse the processes and methods to examine quantitative data that is, data that can be reduced to numbers. This process comes at a time when the researcher is through with the process of data collection. The data are first to be processed through various methods of coding, tabulation and classification. These help to reduce the data to manageable proportions and make it ready to be applied to interpret data. After the data are processed different methods of statistics like measures of central tendency, dispersion, chi-square, t-test, coefficient correlation, simple regression and contingency tables are used to interpret data. The choice of the use of statistical application depends on the nature of the research and the availability of the levels of data. But it has to be remembered that statistical analysis is only a helping tool of research. It can never be a substitute for the efforts of the researcher and the quality of the data collected. A combination of quantitative and qualitative methods of analysis is essential for the interpretation of data in social research.
you can view video on Processing and Analyzing Quantitative Data |
- References
- Ahuja Ram. Research Methods. Jaipur: Rawat Publications, 2007.
- Bryman Alan. Social Research Methods. New York: Oxford University Press, 2004.
- Elifson Kirk W, Runyon Richard P. and Haber Audrey. Fundamentals of Social Statistics. United States: Mc. Graw Hill, 1997.
- Garrett, Henry E. Statistics in Psychology and Education. New York: David McKay Company, Inc. , 1981.
- Guthrie, Gerard. Basic Research Methods: An Entry to Social Science Research. New Delhi: Sage Publications India Private Limited, 2010.
- Kothari C.R. Research Methodology: Methods and Techniques. New Delhi: New Age International (P) Limited, Publishers, 2008.
- Leonard Wilbert Marcellus. Basic Social Statistics. Illinois: Stipes Publishing L.L.C., 1996.
- Levin Jack and Fox James Alan.Elementary Statistics in Social Research. New Delhi: Dorling Kindersley (India) Pvt. Ltd., 2006.
- Majumdar P. K. Research Methods in Social Science. New Delhi: Vinod Vasishtha for Viva Books Private Limited, 2005.
- Morrison Ken. Marx, Durkheim, Weber. London: Sage Publications, 1995.
- Vito Gennaro and Latessa Edward.Statistical Applications in Criminal Justice. London: Sage Publications, 1989.
- Weinstein Jay Alan. Applying Social Statistics. United Kingdom: Rowman and Littlefield Publishers Inc., 2011.