22 Explanation of Quantitative Data

Sudarshana Sen

  1. Introduction

 

Parametric and non-parametric are two broad classifications of statistical procedures. The disciplinary domain of statistics takes important role in social sciences because it is usually impossible to collect data from all individuals of interest (population) in a given space and time. The only solution available to social researcher is to collect data from a subset (sample) of the individuals of interest. Meanwhile, the real objective of the research is to know the “truth” about the population.

 

Quantities such as means, standard deviations and proportions are all meaningful values and are generally called “parameters” when social scientists are talking about a population. Since researchers usually cannot get data from the whole population, they cannot show the values of the parameters for that population. Researchers can, however, calculate estimates of these quantities for their sample. When calculated from sample data, these quantities are called “statistics.”

 

When applying a statistical method, it is common to differentiate between quantitative and qualitative features and variables. Nominal and ordinal variables and data are usually considered as qualitative (attributive), while interval variables and ratio variables are considered as quantitative (Ferguson 1966, Krneta 1987). Also, it is common to apply nonparametric statistical methods on nominal and ordinal data, while parametric methods are used for the interval and ratio data (Ferguson 1966).

 

A statistic estimates a parameter. Parametric statistical procedures rely on assumptions about the shape of the distribution (i.e., assume a normal distribution) in the underlying population and about the form or parameters (i.e., means and standard deviations) of the assumed distribution. These are tests that assume a certain distribution of the data generally and fall under normal distribution and, in an interval level of measurement with an equality of variances. Descriptive Statistics in the forms of t-test, z-test and ANOVA are used in parametric procedures. Parametric distributions are used as arguments to higher-level functions that compute probabilities, expectations, random variates, or parameter estimates from data. Distributions with undetermined parameters can be used throughout, and later the parameters can be solved for or optimized over1. Parametric statistics is a branch of statistics which assumes that the data have come from a type of probability distribution and makes inferences about the parameters of the distribution (Geisser and Johnson 2006).

 

There is hardly any qualified answer to the question: whether a given association between two variables is significant or not. Again in this instance, the question is whether the association between those two variables is strong, important and report-worthy or not. In this context, parametric statistics assumes significance. As the name suggests, parametric statistics are those that make certain assumptions about the parameters describing the population from which the sample is selected. Here, the term “significance” does not imply “importance” as in the general sense of the expression. It refers only to the likelihood that relationship observed in a sample could be attributed to sampling error alone (Babbie 2013: 470)

 

Non-parametric statistical procedures rely on no or few assumptions about the shape or parameters of the population distribution from which the sample was drawn. Non-parametric  statistics assumes no normality in data, no function form in the distribution, and assumes absence of any parent population to which a sample belongs. It is significant to note, given the internal fluidity of non-parametric procedure, more number of social conditions are amenable to non-parametric statistical treatment. Hence, in the numerical section, we will find various types of non-parametric treatment of behavioural and social situations.

 

  1. Learning Outcome

 

This module will be helpful to understand the use of the dual statistical methods in sociological research. This would include introduction to the basic concepts and strategies of the non-parametric and parametric methods, and their scope and limitations.

 

  1. Utility of Statistics in Social Research

 

Sociologists seek the help of statistical tools to study cultural change in the society, family pattern, industrial systems to name a few. They also study statistically the relation between income and education, occupational mobility and migration and things like these. Thus, statistics is of immense use in various sociological studies. In fact, research in most social science disciplines requires a large amount of quantitative data. The process of handling, managing and interpreting quantitative data collected in the process of research very often necessitate statistical exercise. Hence, there is a strong logic to argue that sociology needs statistics.

 

Statistics and statistical methods have highly significant application in sociology. Functions of statistics are numerous: the methods of descriptive statistics have an important application for describing natural phenomena; inferential statistics is used for inductive reasoning about unknown properties of a larger group using the known indicators of the causes; hypothesis testing most frequently refers to the results of one, two or more causes, on the basis of which it is possible to draw conclusions on the problem of the research, by accepting or refuting an initial hypothesis; regression and correlation analysis, in the most simple case, examines the influence and dependence between two or more variables. If the relationship of a greater number of variables is examined, it is multiple regression and correlation.

 

Yet, there are many limitations of the use of statistics in social science. Thus, to begin with, statistical laws are true on average. Statistics are aggregates of facts. So single observation is not a statistics, it deals with groups and aggregates only. Second, statistical methods are best applicable on quantitative data. Third, statistics cannot be applied to heterogeneous data. Fourth, if sufficient care is not exercised in collecting, analyzing and interpretation the data, statistical results might be misleading. Finally, some errors are possible in statistical decisions. Particularly the inferential statistics involves certain errors. We do not know whether an error has been committed or not.

 

 

  1. It only applies to continuous distributions.
  2. It tends to be more sensitive near the centre of the distribution than at the tails.
  3. Perhaps the most serious limitation is that the distribution must be fully specified. That is, if location, scale, and shape parameters are estimated from the data, the critical region of the K-S test is no longer valid. It typically must be determined by simulation.
  4. ii) Chi-Square Test:

 

An important non-parametric test often used in sociological analysis is Chi-square test. Application of chi-square distribution and chi-square test is important in cases with multiple qualitative variables for which it is known or assumed that the variables are interrelated. Chi-square test is quite a common test based on determining the sum of the quotient of the square of the difference between the observed and expected frequencies and expected frequencies.

 

The primary use of the chi-square test is to examine whether two variables are independent or not. What does it mean to be independent, in this sense? It means that the two factors are not related. Typically in social science research, we are interested in finding factors that are dependent upon each other—education and income, occupation and prestige, age and voting behaviour. By ruling out independence of the two variables, the chi-square can be used to assess whether two variables are, in fact, dependent or not. More generally, we say that one variable is “not correlated with” or “independent of” the other if an increase in one variable is not associated with an increase in another. If two variables are correlated, their values tend to move together, either in the same or in the opposite direction. Chi-square examines a special kind of correlation: that between two nominal variables.

 

In the following example, we’ll use a chi-square test to determine whether there is a relationship between gender and getting in trouble at school (both nominal variables). Below is the table documenting the raw scores of boys and girls and their respective behaviour issues (or lack thereof):

 

Gender and Getting in Trouble at School

Got in Trouble Did Not Get in Trouble Total
Boys 46 71 117
Girls 37 83 120
Total 83 154 237

 

To examine statistically whether boys got in trouble in school more often, we need to frame the question in terms of hypotheses. The null hypothesis is that the two variables are independent (i.e. no relationship or correlation) and the research hypothesis is that the two variables are  related. In this case, the specific hypotheses are:

math. But before we can come to a conclusion, we need to find our critical statistic, which entails finding our degrees of freedom. In this case, the number of degrees of freedom is equal to the number of columns in the table minus one multiplied by the number of rows in the table minus one, or (r-1)(c-1).

 

In our case, we have (2-1)(2-1), or one degree of freedom.

 

We also need to reference our alpha, which we set at .05. As you can see, the critical statistic for an alpha level of 0.05 and one degree of freedom is 3.841, which is larger than our obtained statistic of 1.87. Because the critical statistic is greater than our obtained statistic, we cannot reject our null hypothesis.

 

iii) The Two Sample Rank-Sum Test

 

The two sample rank-sum test is a nonparametric alternative to the two sample t-test which is based solely on the order in which the observations from the two samples fall. We will use the following as a running example.

 

The logic underlying the two-sample rank-sum test is straightforward. The data consist of two independent samples drawn from identically distributed populations. Let x1, x2, . . . , xn denote the first random sample of size n and let y1, y2, . . . , ym denote the second random sample of size m. Assign the ranks 1 to n + m to the combined observations from smallest to largest without regard to sample membership and let Rk denote the rank assigned to the n + m observations for k = 1, . . . , n + m. Let Tx and Ty denote the sums of the ranks from the first and second samples, respectively, and let T = Tx. Finally, note that Tx + Ty = (n + m)(n + m + 1)/2 . The null hypothesis simply states that each of the possible arrangements of the n + m observations to the two samples with n values in the first sample and m values in the second sample occurs with equal probability. The exact lower (upper) one-sided probability value of an observed value of T, T0, is the proportion of all possible T values less (greater) than or equal to T0.

 

iv) The Kruskal-Wallis Test

 

This test was developed by Kruskal and Wallis (1952) jointly and is named after them. The Kruskal-Wallis test is a nonparametric (distribution free) test, and is used when the assumptions of ANOVA are not met. They both assess for significant differences on a continuous dependent variable by a grouping independent variable (with three or more groups). In the ANOVA, we assume that distribution of each group is normally distributed and there is approximately equal variance on the scores for each group. However, in the Kruskal-Wallis Test, we do not have any of these assumptions. Like all non-parametric tests, the Kruskal-Wallis Test is not as powerful as the ANOVA. The following account has been taken from www.statisticssolutions.com.

 

Null hypothesis: Null hypothesis assumes that the samples are from identical populations.

Alternative hypothesis: Alternative hypothesis assumes that the samples come from different populations.

 

Questions like the following are answered:

 

How do test scores differ between the different grade levels in elementary school? Does marketing scores differ between the different grade levels in elementary school?

 

Procedure:

  1. Arrange the data of both samples in a single series in ascending order.
  2. Assign rank to them in ascending order. In the case of a repeated value, or a tie, assign ranks to them by averaging their rank position.
  3. Then sum up the different ranks, e.g. R1 R2 R3…., for each of the different groups..
  4. To calculate the value, apply the following formula: Where,

 

H = Kruskal-Wallis Test statistic

N = total number of observations in all samples

Ti = Sum of the ranks assigned

 

The Kruskal-Wallis test statistic is approximately a chi-square distribution, with k-1 degrees of freedom where nishould be greater than 5. If the calculated value of the Kruskal-Wallis test is less than the critical chi-square value, then the null hypothesis cannot be reject. If the calculated value of Kruskal-Wallis test is greater than the critical chi-square value, then we can reject the null hypothesis and say that the sample comes from a different population.

 

Assumptions:

  1. We assume that the samples drawn from the population are random.
  2. We also assume that the cases of each group are independent.
  3. The measurement scale should be at least ordinal.
  4. v) Spearman’s Rank Correlation.

 

The following description has been eminently taken from the source www.statisticssolutions.com: Spearman Correlation Coefficient is also referred to as Spearman Rank Correlation or Spearman’s rho. It is typically denoted either with the Greek letter rho (ρ), or rs. It is one of the few cases where a Greek letter denotes a value of a sample and not the characteristic of the general population. Like all correlation coefficients, Spearman’s rho measures the strength of association of two variables. As such, the Spearman Correlation Coefficient is a close sibling to Pearson’s Bivariate Correlation Coefficient, Point-Biserial Correlation, and the Canonical Correlation.

 

All correlation analyses express the strength of linkage or co-occurrence between to variables in a single value between -1 and +1. This value is called the correlation coefficient. A positive correlation coefficient indicates a positive relationship between the two variables (the larger A, the larger B) while a negative correlation coefficients expresses a negative relationship (the larger A, the smaller B). A correlation coefficient of 0 indicates that no relationship between the variables exists at all. However correlations are limited to linear relationships between variables. Even if the correlation coefficient is zero a non-linear relationship might exist.

 

Compared to Pearson’s bivariate correlation coefficient the Spearman Correlation does not require continuous-level data (interval or ratio), because it uses ranks instead of assumptions about the distributions of the two variables. This allows us to analyze the association between variables of ordinal measurement levels. Moreover the Spearman Correlation is a non-paracontinuous-level test, which does not assume that the variables approximate multivariate normal distribution. Spearman Correlation Analysis can therefore be used in many cases where the assumptions of Pearson’s Bivariate Correlation (continuous-level variables, linearity, and multivariate normal distribution of the variables to test for significance) are not met.

 

Typical questions the Spearman Correlation Analysis answers are as follows:

  • Sociology: Do people with a higher level of education have a stronger opinion of whether or not tax reforms are needed?
  • Medicine: Does the number of symptoms a patient has indicate a higher severity of illness?
  • Biology: Is mating choice influenced by body size in bird species A?
  • Business: Are consumers more satisfied with products that are higher ranked in quality?

 

Theoretically, the Spearman correlation calculates the Pearson correlation for variables that are converted to ranks. Similar to Pearson’s bivariate correlation, the Spearman correlation also tests the null hypothesis of independence between two variables. However this can lead to difficult interpretations. Kendall’s Tau-b rank correlation improves this by reflecting the strength of the dependence between the variables in comparison.

 

Since both variables need to be of ordinal scale or ranked data, Spearman’s correlation requires converting interval or ratio scales into ranks before it can be calculated. Mathematically, Spearman correlation and Pearson correlation are very similar in the way that they use difference measurements to calculate the strength of association. Pearson correlation uses standard deviations while Spearman correlation difference in ranks. However, this leads to an issue with the Spearman correlation when tied ranks exist in the sample. An example of this is when a sample of marathon results awards two silver medals but no bronze medal. A statistician is even crueller to these runners because a rank is defined as average position in the ascending order of values. For a statistician, the marathon result would have one first place, two places with a rank of 2.5, and the next runner ranks 4. If tied ranks occur, a more complicated formula has to be used to calculate rho, but SPSS automatically and correctly calculates tied ranks.3

 

  1. Why don’t we always use Non-parametric tests?

 

Although non-parametric tests have the very desirable property of making fewer assumptions about the distribution of measurements in the population from which we drew our sample, they have two main drawbacks. The first is that they generally are less statistically powerful than the analogous parametric procedure when the data truly are approximately normal. “Less powerful” means that there is a smaller probability that the procedure will tell us that two variables are associated with each other when in fact they are truly associated. If you are planning a study and trying to determine how many patients to include, a non-parametric test will require a slightly larger sample size to have the same power as the corresponding parametric test. The second drawback associated with non-parametric tests is that their results are often less easy to interpret than the results of parametric tests. Many non-parametric tests use rankings of the values in the data rather than using the actual data. Knowing that the difference in mean ranks between two groups is five does not really help our intuitive understanding of the data. On the other hand, knowing that the mean systolic blood pressure of patients taking the new drug was five mmHg lower than the mean systolic blood pressure of patients on the standard treatment is both intuitive and useful. In short, non-parametric procedures are useful in many cases and necessary in some, but they are not a perfect solution.

 

 

Self Check Exercise 2

 

Q: What is the major advantage of non-parametric statistics?

 

It makes fewer assumptions about the distribution of measurement in the population.

 

Q: What type of measurement level is used in Spearman’s correlation?

 

It is ordinal level of measurement in most cases

  • Q: Which test is carried out when precondition for ANOVA is absent? Kruskal-Wallis test does not need any assumption like ANOVA.

 

  • References

 

  • Bagdonavicius, Vilyandas, Krupois, Julius and Nikulin, Mikhail, S.. Nonparametric Tests for Complete Data. New York: ISTE & John Wiley & Sons, 2011.
  • Chakravarti, I.M. Laha, R.G. and Roy, J. Handbook of Methods of Applied Statistics, Volume I Wiley and Sons, 1967.
  • Connover, W.J. Practical Non-parametric Statistics. New York: John Wiley, 1971.
  • Fisher, R.A. Contributions to Mathematical Statistics. New York: John Wiley, 1950.
  • Gibbons, J. D. and Chakraborti, S. Non-parametric Statistical Inference. Boca Raton, FL: CRC Press, 2009.
  • Govidarajulu, Z. Non-parametric Inference. Singapore: World Scientific, 2007.
  • Maritz, J.S. Distribution – Free Statistical Methods. New York: Chapman & Hall / CRC, 1995.
  • Rao. C.R. Linear Statistical Inference and its Applications. New York: John Wiley & Sons, 2002.
  • Vaart, Van Der. Asymptotic Statistics. Cambridge: Cambridge University Press, 2000.
  • Website: www.statisticssolutions.com