DATA ANALYSIS
"It takes an unusual mind to analyze the obvious" (Alfred
North Whitehead)
The terms "statistical analysis" and "data analysis" can be said to mean the same thing -- the study of how we describe, combine, and make inferences based on numbers. A lot of people are scared of numbers (quantiphobia), but data analysis with statistics has got less to do with numbers, and more to do with rules for arranging them. It even lets you create some of those rules yourself, so instead of looking at it like a lot of memorization, it's best to see it as an extension of the research mentality, something researchers do anyway (i.e., play with or crunch numbers). Once you realize that YOU have complete and total power over how you want to arrange numbers, your fear of them will disappear. It helps, of course, if you know some basic algebra and arithmetic, at a level where you might be comfortable solving the following equation (answer at bottom of page):
x - 3/x - 1 = x - 4/x - 5
Without statistics, all you're doing is making educated guesses. In social science, that may seem like all that's necessary, since we're studying the obvious anyway. However, there's a difference between something socially, or meaningfully significant, and something statistically significant. Statistical significance is first of all, short and simple. You communicate as much with just one number as a paragraph of description. Some people don't like statistics because of this reductionism, but it's become the settled way researchers communicate with one another. Secondly, statistical significance is what policy and decision making is based on. Policymakers will dismiss anything nonstatistical as anecdotal evidence. Anecdotal means interesting and amusing, but hardly serious enough to be published or promulgated. Finally, just because something is statistically significant doesn't make it true. It's better than guessing, but you can lie and deceive with statistics. Since they can mislead you, there's no substitute for knowing something about the topic so that, as is the most common interpretative approach, the researcher is able to say what is both meaningful and statistically significant.
There are three (3) general areas that make up the field of statistics: descriptive statistics, relational statistics, and inferential statistics. 1. Descriptive statistics fall into one of two categories: measures of central tendency (mean, median, and mode) or measures of dispersion (standard deviation and variance). Their purpose is to explore hunches that may have come up during the course of the research process, but most people compute them to look at the normality of their numbers. Examples include descriptive analysis of sex, age, race, social class, and so forth. 2. Relational statistics fall into one of three categories: univariate, bivariate, and multivariate analysis. Univariate analysis is the study of one variable for a subpopulation, for example, age of murderers, and the analysis is often descriptive, but you'd be surprised how many advanced statistics can be computed using just one variable. Bivariate analysis is the study of a relationship between two variables, for example, murder and meanness, and the most commonly known technique here is correlation. Multivariate analysis is the study of relationship between three or more variables, for example, murder, meanness, and gun ownership, and for all techniques in this area, you simply take the word "multiple" and put it in front of the bivariate technique used, as in multiple correlation. 3. Inferential statistics, also called inductive statistics, fall into one of two categories: tests for difference of means and tests for statistical significance, the latter which are further subdivided into parametric or nonparametric, depending upon whether you're inferring to the larger population as a whole (parametric) or the people in your sample (nonparametric). The purpose of difference of means tests is to test hypotheses, and the most common techniques are called Z-tests. The most common parametric tests of significance are the F-test, t-test, ANOVA, and regression. The most common nonparametric tests of significance are chi-square, the Mann-Whitney U-test, and the Kruskal-Wallis test. To summarize:
Descriptive statistics (mean, median, mode; standard deviation, variance)
Relational statistics (correlation, multiple correlation)
Inferential tests for difference of means (Z-tests)
Inferential parametric tests for significance (F-tests, t-tests, ANOVA, regression)
Inferential nonparametric tests for significance (chi-square, Mann-Whitney, Kruskal-Wallis)
MEASURES OF CENTRAL TENDENCY
The most commonly used measure of central tendency is the mean. To compute the mean, you add up all the numbers and divide by how many numbers there are. It's not the average nor a halfway point, but a kind of center that balances high numbers with low numbers. For this reason, it's most often reported along with some simple measure of dispersion, such as the range, which is expressed as the lowest and highest number.
The median is the number that falls in the middle of a range of numbers. It's not the average; it's the halfway point. There are always just as many numbers above the median as below it. In cases where there is an even set of numbers, you average the two middle numbers. The median is best suited for data that are ordinal, or ranked. It is also useful when you have extremely low or high scores.
The mode is the most frequently occurring number in a list of numbers. It's the closest thing to what people mean when they say something is average or typical. The mode doesn't even have to be a number. It will be a category when the data are nominal or qualitative. The mode is useful when you have a highly skewed set of numbers, mostly low or mostly high. You can also have two modes (bimodal distribution) when one group of scores are mostly low and the other group is mostly high, with few in the middle.
MEASURES OF DISPERSION
In data analysis, the purpose of statistically computing a measure of dispersion is to discover the extent to which scores differ, cluster, or spread from around a measure of central tendency. The most commonly used measure of dispersion is the standard deviation. You first compute the variance, which is calculated by subtracting the mean from each number, squaring it, and dividing the grand total (Sum of Squares) by how many numbers there are. The square root of the variance is the standard deviation.
The standard deviation is important for many reasons. One reason is that, once you know the standard deviation, you can standardize by it. Standardization is the process of converting raw scores into what are called standard scores, which allow you to better compare groups of different sizes. Standardization isn't required for data analysis, but it becomes useful when you want to compare different subgroups in your sample, or between groups in different studies. A standard score is called a z-score (not to be confused with a z-test), and is calculated by subtracting the mean from each and every number and dividing by the standard deviation. Once you have converted your data into standard scores, you can then use probability tables that exist for estimating the likelihood that a certain raw score will appear in the population. This is an example of using a descriptive statistic (standard deviation) for inferential purposes.
CORRELATION
The most commonly used relational statistic is correlation, and it's a measure of the strength of some relationship between two variables, not causality. Interpretation of a correlation coefficient does not even allow the slightest hint of causality. The most a researcher can say is that the variables share something in common; that is, are related in some way. The more two things have something in common, the more strongly they are related. There can also be negative relations, but the important quality of correlation coefficients is not their sign, but their absolute value. A correlation of -.58 is stronger than a correlation of .43, even though with the former, the relationship is negative. The following table lists the interpretations for various correlation coefficients:
.8 to 1.0 | very strong |
.6 to .8 | strong |
.4 to .6 | moderate |
.2 to .4 | weak |
.0 to .2 | very weak |
The most frequently used correlation coefficient in data analysis is the Pearson product moment correlation. It is symbolized by the small letter r, and is fairly easy to compute from raw scores using the following formula:
If you square the Pearson correlation coefficient, you get the coefficient of determination, symbolized by the large letter R. It is the amount of variance accounted for in one variable by the other. Large R can also be computed by using the statistical technique of regression, but in that situation, it's interpreted as the amount of variance explained for one variable by another. If you subtract a coefficient of determination from one, you get something called the coefficient of alienation, which is sometimes seen in the literature.
Z-TESTS, F-TESTS, AND T-TESTS
These refer to a variety of tests for inferential purposes. Z-tests are not to be confused with z-scores. Z-tests come in a variety of forms, the most popular being: (1) to test the significance of correlation coefficients; (2) to test for equivalence of sample proportions to population proportions, as in whether the number of minorities you've got in your sample is proportionate to the number in the population. Z-tests essentially check for linearity and normality, allow some rudimentary hypothesis testing, and allow the ruling out of Type I and Type II error.
F-tests are much more powerful, as they allow explanation of variance in one variable accounted for by variance in another variable. In this sense, they are very much like the coefficient of determination. One really needs a full-fledged statistics course to gain an understanding of F-tests, so suffice it to say here that you find them most commonly with regression and ANOVA techniques. F-tests require interpretation by using a table of critical values.
T-tests are kind of like little F-tests, and similar to Z-tests. It's appropriate for smaller samples, and relatively easy to interpret since any calculated t over 2.0 is, by rule of thumb, significant. T-tests can be used for one sample, two samples, one tail, or two-tailed. You use a two-tailed test if there's any possibility of bidirectionality in the relationship between your variables. The formula for the t-test is as follows:
ANOVA
Analysis of Variance (ANOVA) is a data analytic technique based on the idea of comparing explained variance with unexplained variance, kind of like a comparison of the coefficient of determination with the coefficient of alienation. It uses a rather unique computational formula which involves squaring almost every column of numbers. What is called the Between Sum of Squares (BSS) refers to variance in one variable explained by variance in another variable, and what is called the Within Sum of Squares (WSS) refers to variance that is not explained by variance in another variable. A F-test is then conducted on the number obtained by dividing BSS by WSS. The results are presented in what's called an ANOVA source table, which looks like the following:
Source | SS | df | MS | F | p |
Total | 2800 | ||||
Between | 1800 | 1 | 1800 | 10.80 | <.05 |
Within | 1000 | 6 | 166.67 |
REGRESSION
Regression is the closest thing to estimating causality in data analysis, and that's because it predicts how much the numbers "fit" a projected straight line. There are also advanced regression techniques for curvilinear estimation. The most common form of regression, however, is linear regression, and the least squares method to find an equation that best fits a line representing what is called the regression of y on x. The procedure is similar to computing a calculus minima (if you've had a math course in calculus). Instead of finding the perfect number, however, one is interested in finding the perfect line, such that there is one and only one line (represented by equation) that perfectly represents, or fits the data, regardless of how scattered the data points. The slope of the line (equation) provides information about predicted directionality, and the estimated coefficients (or beta weights) for x and y (independent and dependent variables) indicate the power of the relationship. Use of a regression formula (not shown here because it's too large; only the generic regression equation is shown) produces a number called R-squared, which is a kind of conservative, yet powerful coefficient of determination. Interpretation of R-squared is somewhat controversial, but generally uses the same strength table as correlation coefficients, and at a minimum, researchers say it represents "variance explained."
CHI-SQUARE
A technique designed for less than interval level data is chi-square (pronounced kye- square), and the most common forms of it are the chi-square test for contingency and the chi-square test for independence. Other varieties exist, such as Cramer's V, Proportional Reduction in Error (PRE) statistics, Yule's Q, and Phi. Essentially, all chi-square type tests involve arranging the frequency distribution of the data in what is called a contingency table of rows and columns. Marginals, which are estimates of error in predicting concordant pairs in the rows and columns (based on the null hypothesis), are then computed, subtracted from one another, and expressed in the form of a ratio, or contingency coefficient. Predicted scores based on the null hypothesis are called expected frequencies, and these are subtracted from observed frequencies (Observed minus Expected). Chi-square tests are frequently seen in the literature, and can be easily done by hand, or are run by computers automatically whenever a contingency table is asked for.
The chi-square test for contingency is interpreted as a strength of association measure, while the chi-square test for independence (which requires two samples) is a nonparametric test of significance that essentially rules out as much sampling error and chance as possible.
MANN-WHITNEY AND KRUSKAL-WALLIS TESTS
The Mann-Whitney U test is similar to chi-square and the t-test, and used whenever you have ranked ordinal level measurement. As a significance test, you need two samples, and you rank (say, from 1 to 15) the scores in each group, looking at the number of ties. A z-table is used to compare calculated and table values of U. The interpretation is usually along the lines of some significant difference being due to the variables you've selected.
The Kruskal-Wallis H test is similar to ANOVA and the F-test, and also uses ordinal, multisample data. It's most commonly seen when raters are used to judge research subjects or research content. Rank calculations are compared to a chi-square table, and interpretation is usually along the lines that there are some significant differences, and grounds for accepting research hypotheses.
IMPORTANT TERMS FOR UNDERSTANDING STATISTICAL INTERPRETATION
1. p-value: a p-value, sometimes called an uncertainty or probability coefficient, is based on properties of the sampling distribution. It is usually expressed as p less than some decimal, as in p < .05 or p < .0006, where the decimal is obtained by tweaking the significance setting of any statistical procedure you run in SPSS. It is used in two ways: (1) as a criterion level where you, the researcher have arbitrarily decided in advance to use as the cutoff where you reject the null hypothesis, in which case, you would ordinarily say something like "setting p at p > .65 for one-tailed or two-tailed tests of significance allows some confidence that 65% of the time, rejecting the null hypothesis will not be in error"; and more commonly, (2) as a expression of inference uncertainty after you have run some test statistic regarding the strength of some association or relationship between your independent and dependent variables, in which case, you would say something like "the evidence suggests there is a statistically significant effect, however, p < .05 also suggests that 5% of the time, we should be uncertain about the significance of drawing any statistical inferences."
2. mean: the mean, or average, is a word describing the average calculated over an entire population. It is therefore a parameter, and the average in a sample is both a descriptive statistic and your best estimate of the population mean. The best way to describe it is as: "a mean of 22.5 indicates that the average score in this sample and by inference the population as a whole is 22.5".
3. standard deviation: the standard deviation represents a set of numbers indicating the spread, or typical distance between a single number and the set's average. A standard deviation value of 15, for example, would be interpreted as: "a mean of 22.5 and standard deviation of 15 means that the most typical number of cases fall into a range from 7.5 to 37.5".
4. standard error: the standard error is the standard deviation of a sampling distribution. It is your best guess about the size of a difference between a statistic used to estimate a parameter and the parameter itself. As an expression of uncertainty, it has two advantages over the p-value: (1) something called degrees of freedom are always associated with it; and (2) it is used in the calculation of other, more advanced statistics. The number of degrees of freedom for purposes of calculating the standard error is always N - 1, or the total number of cases in your sample minus one. The number of degrees of freedom for analyzing residuals is always N-2, or the total number of cases in your sample minus two. As you apply other, more advanced statistics, the same pattern holds with progressively more subtracted from your total sample size. Since you generally don't want large degrees of freedom, it may be better for you, the researcher, to think the opposite way, that is, to look at the way you measured your variables (both independent - the numerator; and dependent - the denominator) and think about how the subjects in your sample might have been confused and meant something else when they responded to items measuring your concept or construct. If you used multiple measures, or several items measuring the same thing different ways, you generally won't have a problem, but if you measured "broken home" by divorce only, you may have as many as seven (7) different degrees of freedom for this concept because there are plenty of other ways a home could be broken.
5. residual plot: a residual plot, or scattergram, is a diagnostic tool which the researcher normally uses to decide on what is called the robustness of any advanced statistic to be used later. Robustness is how well an advanced statistic can tolerate violations of the assumptions for its use. Statistics such as the t-test, the F-test, regression, and even correlation have many assumptions such as random sampling, independence across groups, nearly equivalent means, equal standard deviations, and the like. Residuals are the original observations (the raw data) with any group averages subtracted, and this is done using transformations like Z-scores or studentized t-scores. The residuals (usually the x-axis) are then plotted against the group averages (usually the y-axis and called something like centertized values or leverage), and the resulting scatterplot of residuals versus group averages is then visually analyzed by the researcher to find ways to better further analysis. There are many things to look for in a scatterplot, the most important ones being: (1) a funnel-shaped pattern would indicate the need for a log, exponential, or some other transformation of the raw data; and (2) the presence of outliers, or seriously outlying observations, which might indicate throwing out a few cases in your sample.
6. correlation: a Pearson's correlation coefficient, or small r, represents the degree of linear association between any two variables. Unlike regression, correlation doesn't care which variable is the independent one or the dependent one, therefore, you cannot infer causality. Correlations are also dimension-free, but they require a good deal of variability or randomness in your outcome measures. A correlation coefficient always ranges from negative one (-1) to one (1), so a negative correlation coefficient of -0.65 indicates that "65% of the time, when one variable is low, the other variable is high" and it's up to you, the researcher to guess which one is usually initially low. A positive correlation coefficient of 0.65 indicates that "65% of the time, when one variable exerts a positive influence, the other variable also exerts a positive influence". Researchers often report the names of the variables in such sentences, rather than just saying "one variable". A correlation coefficient at zero, or close to zero, indicates no linear relationship.
7. F-test: The F-statistic is a test of significance used as an analysis of variance tool and also as a regression tool. It is used to test whether all, or several, statistical coefficients that can be calculated are zero or nonzero. It does this by calculating and comparing what are called generalized sum of squares, a principle that many statistics are based on. In the case of testing a single coefficient, the F-statistic is the square of the t-statistic, and they are interpreted similarly (although a general rule of thumb is that any t-statistic greater than two is significant). F-statistics are compared to a table of numbers which contain values of expected F-distributions. The degrees of freedom in the numerator are the number of coefficients tested (the number of independent variables being regressed); the denominator is the degrees of freedom associated with your standard error. By locating these spots in a table, you see if an F-statistic is greater than what can normally be expected in a F-distribution. If it is greater than what the table indicates, you can say that your full model (all your variables) have statistical significance. This is often interpreted by saying: "A statistically significant F of .428 with an associated p-value of .66 indicates that 66% of the time, the variables selected for observation in this study will provide a good fit in explanatory models of the outcome being measured." Researchers normally substitute words for what the actual outcome is (crime, quality of life, etc.) instead of saying "the outcome being measured".
8. R-squared: The R-squared statistic, or the coefficient of determination, is the percentage of total response variation explained by the independent variables. Adjusted R-squared is preferable to use if you have a lot of independent variables since R-squared can always be made larger by adding more variables. R-squared statistics are usually interpreted as percentages, such as: "an R-squared of .51 indicates that fifty-one percent of the variation in [some outcome, your dependent variable] is explained by a linear regression on [explanatory variables, your independent variables]." R-squared should be identical to the square of the sample correlation coefficient; that is R alone represents your multiple correlation coefficient. Adjusted R-squared represents a penalty for unnecessary variables.
9. Cook's distance: Cook's distance, or the d-statistic, represents overall influence, or the effect that omitting a case has on the estimated regression coefficients. A case that has a large Cook's distance will be a case that is influential if it is deleted. It may have a large d-statistic because it has a large studentized residual, a large leverage, or both. It is up to the researcher to determine what is causing such an influence, and whether or not the cases should be deleted or the whole dataset smoothed.
10. Durbin-Watson statistic: J. Durbin and G.S. Watson developed the DW statistic as a test for comparing how close residuals are to their past values and how close they are to their average. The test is usually done with time series data, and confirms or rejects evidence of serial correlation. With regression, a high DW statistic suggests there is independence of the observations, and that you, the researcher, can rule out various threats to validity like history, maturation, and decay.
INTERNET RESOURCES
How to Interpret an ANOVA Table
How to Prove Things with Data Analysis
Understanding Correlation
Answer to Equation at Top of Page: -0.25
PRINTED RESOURCES
Bachman, R. & R. Paternoster. (1997). Statistical Methods for Criminology
& Criminal Justice. NY: McGraw Hill.
Bohrnstedt, G. & D. Knoke. (1994). Statistics for Social Data Analysis.
Itasca, NY: Peacock.
Fox, J., J. Levin & M. Shively. (1999). Elementary Statistics in Criminal
Justice Research. NY: Longman.
Hagan, F. (2000). Research Methods in Criminal Justice and Criminology.
Boston: Allyn & Bacon.
McKean, J. & B. Byers. (2000). Data Analysis for Criminal Justice &
Criminology. Boston: Allyn & Bacon.
Miller, L. & J. Whitehead. (1996). Intro to Criminal Justice Research &
Statistics. Cincinnati: Anderson.
Ramsey, F. & D. Schafer. (1997). The Statistical Sleuth. Belmont, CA:
Duxbury.
Salkind, N. (2000). Exploring Research. Upper Saddle River, NJ: Prentice
Hall.
Singleton, R. & B. Straits. (1999). Approaches to Social Research. NY:
Oxford Univ. Press.
Sirkin, M. (1995). Statistics for the Social Sciences. Thousand Oaks, CA:
Sage.
Vito, G. & E. Latessa. (1989.) Statistical Applications in Criminal Justice.
Newbury Park, CA: Sage.
Wikipedia Entry on Statistics
Last updated: Nov. 25, 2011
Not an official webpage of APSU, copyright restrictions apply, see
Megalinks in Criminal Justice
O'Connor, T. (2011). "Quantitative Data Analysis," MegaLinks in Criminal Justice.
Retrieved from http://www.drtomoconnor.com/3760/3760lect07.htm accessed on Nov.
25, 2011.