MEASUREMENT, RELIABILITY, AND VALIDITY
"Those who speak most of progress measure it by quantity, not quality" (George Santayana)

    Measurement is at the core of doing research. Measurement is the assignment of numbers to things. In almost all research, everything has to be reduced to numbers eventually. Precision and exactness in measurement are vitally important. The measures are what are actually used to test the hypotheses. A researcher needs good measures for both independent and dependent variables. 

    Measurement consists of two basic processes called conceptualization and operationalization, then an advanced process called determining the levels of measurement, and then even more advanced methods of measuring reliability and validity.

    Conceptualization is the process of taking a construct or concept and refining it by giving it a conceptual or theoretical definition. Ordinary dictionary definitions will not do. Instead, the researcher takes keywords in their research question or hypothesis and finds a clear and consistent definition that is agreed-upon by others in the scientific community. Sometimes, the researcher pushes the envelope by coming up with a novel conceptual definition, but such initiatives are rare and require the researcher to have intimate familiarity with the topic. More common is the process by which a researcher notes agreements and disagreements over conceptualization in the literature review, and then comes down in favor of someone else's conceptual definition. It's perfectly acceptable in science to borrow the conceptualizations and operationalizations of others. Conceptualization is often guided by the theoretical framework, perspective, or approach the researcher is committed to. For example, a researcher operating from within a Marxist framework would have quite different conceptual definitions for a hypothesis about social class and crime than a non-Marxist researcher. That's because there are strong value positions in different theoretical perspectives about how some things should be measured. Most criminal justice researchers at this point will at least decide what type of crime they're going to study.

    Operationalization is the process of taking a conceptual definition and making it more precise by linking it to one or more specific, concrete indicators or operational definitions. These are usually things with numbers in them that reflect empirical or observable reality. For example, if the type of crime one has chosen to study is theft (as representative of crime in general), creating an operational definition for it means at least choosing between petty theft and grand theft (false taking of less or more than $150). I don't want to give the impression from this example that researchers should rely upon statutory or legal definitions. Some researchers do, but most often, operational definitions are also borrowed or created anew. They're what link the world of ideas to the world of everyday reality. It's more important that ordinary people would agree on your indicators than other scientists or legislators, but again, avoid dictionary definitions. If you were to use legalistic definitions, then it's your duty to provide what is called an auxiliary theory, which is a justification for the research utility of legal hair-splitting (as in why less or more than $150 is of theoretical significance). The most important thing to remember at this point, however, is your unit of analysis. You want to make absolutely sure that everything you reduce down is defined at the same unit of analysis: societal, regional, state, communal, individual, to name a few. You don't want to end up with a research project that has to collect political science data, sociological data, and psychological data. In most cases, you should break it all down so that each variable is operationally defined at the same level of thought, attitude, trait, or behavior, although some would call this psychological reductionism and are more comfortable with group-level units or psychological units only as a proxy measure for more abstract, harder-to-measure terms. 

LEVELS OF MEASUREMENT

    A level of measurement is the precision by which a variable is measured. For 50 years, with few detractors, science has used the Stevens (1951) typology of measurement levels. There are three things to remember about this typology: (1) anything that can be measured falls into one of the four types; (2) the higher the type, the more precision in measurement; and (3) every level up contains all the properties of the previous level. The four levels of measurement, from lowest to highest, are:

    The nominal level of measurement describes variables that are categorical in nature. The characteristics of the data you're collecting fall into distinct categories. If there are a limited number of distinct categories (usually only two), then you're dealing with a discrete variable. If there are an unlimited or infinite number of distinct categories, then you're dealing with a continuous variable. Nominal variables include demographic characteristics like sex, race, and religion. 

    The ordinal level of measurement describes variables that can be ordered or ranked in some order of importance. It describes most judgments about things, such as big or little, strong or weak. Most opinion and attitude scales or indexes in the social sciences are ordinal in nature. 

    The interval level of measurement describes variables that have more or less equal intervals, or meaningful distances between their ranks. For example, if you were to ask somebody if they were first, second, or third generation immigrant, the assumption is that the distance, or number of years, between each generation is the same. All crime rates in criminal justice are interval level measures, as is any kind of rate. 

    The ratio level of measurement describes variables that have equal intervals and a fixed zero (or reference) point. It is possible to have zero income, zero education, and no involvement in crime, but rarely do we see ratio level variables in social science since it's almost impossible to have zero attitudes on things, although "not at all", "often", and "twice as often" might qualify as ratio level measurement.

    Advanced statistics require at least interval level measurement, so the researcher always strives for this level, accepting ordinal level (which is the most common) only when they have to. Variables should be conceptually and operationally defined with levels of measurement in mind since it's going to affect how well you can analyze your data later on.

RELIABILITY AND VALIDITY      

    For a research study to be accurate, its findings must be reliable and valid. Reliability means that the findings would be consistently the same if the study were done over again. It sounds easy, but think of a typical exam in college; if you scored a 74 on that exam, don't you think you would score differently if you took if over again? Validity refers to the truthfulness of findings; if you really measured what you think you measured, or more precisely, what others think you measured. Again, think of a typical multiple choice exam in college; does it really measure proficiency over the subject matter, or is it really measuring IQ, age, test-taking skill, or study habits?

    A study can be reliable but not valid, and it cannot be valid without first being reliable. You cannot assume validity no matter how reliable your measurements are. There are many different threats to validity as well as reliability, but an important early consideration is to ensure you have internal validity. This means that you are using the most appropriate research design for what you're studying (experimental, quasi-experimental, survey, qualitative, or historical), and it also means that you have screened out spurious variables as well as thought out the possible contamination of other variables creeping into your study. Anything you do to standardize or clarify your measurement instrument to reduce user error will add to your reliability.

    It's also important early on to consider the time frame that is appropriate for what you're studying. Some social and psychological phenomena (most notably those involving behavior or action) lend themselves to a snapshot in time. If so, your research need only be carried out for a short period of time, perhaps a few weeks or a couple of months. In such a case, your time frame is referred to as cross-sectional. Sometimes, cross-sectional research is criticized as being unable to determine cause and effect, and a longer time frame is called for, one that is called longitudinal, which may add years onto carrying out your research. There are many different types of longitudinal research, such as those that involve tracking a cohort of subjects (such as schoolchildren across grade levels), or those that involve time-series (such as tracking a third world nation's economic development over four years or so). The general rule is to use longitudinal research the greater the number of variables you've got operating in your study and the more confident you want to be about cause and effect.

METHODS OF MEASURING RELIABILITY

There are four good methods of measuring reliability:

    The test-retest technique is to administer your test, instrument, survey, or measure to the same group of people at different points in time. Most researchers administer what is called a pretest for this, and to troubleshoot bugs at the same time. All reliability estimates are usually in the form of a correlation coefficient, so here, all you do is calculate the correlation coefficient between the two scores on the same group and report it as your reliability coefficient.

    The multiple forms technique has other names, such as parallel forms and disguised test-retest, but it's simply the scrambling or mixing up of questions on your survey, for example, and giving it to the same group twice. The idea is that it's a more rigorous test of reliability.

    Inter-rater reliability is most appropriate when you use assistants to do interviewing or content analysis for you. To calculate this kind of reliability, all you do is report the percentage of agreement on the same subject between your raters, or assistants.

    Split-half reliability is estimated by taking half of your test, instrument, or survey, and analyzing that half as if it were the whole thing. Then, you compare the results of this analysis with your overall analysis. There are different variations of this technique, one of the most common being called Cronbach's alpha (a frequently reported reliability statistic) which correlates performance on each item with overall score. Another technique, closer to the split-half method, is the Kuder-Richardson coefficient, or KR-20. Statistical packages on most computers will calculate these for you, although in graduate school, you'll have to do them by hand and understand that all test statistics are derived from the formula that all observed scores consist of a true score and error score.

METHODS OF MEASURING VALIDITY

There are four good methods of estimating validity:

    Face validity is the least statistical estimate (validity overall is not as easily quantified as reliability) as it's simply an assertion on the researcher's partclaiming that they've reasonably measured what they intended to measure. It's essentially a "take my word for it" kind of validity. Usually, a researcher asks a colleague or expert in the field to vouch for the items measuring what they were intended to measure.

    Content validity goes back to the ideas of conceptualization and operationalization. If the researcher has focused in too closely on only one type or narrow dimension of a construct or concept, then it's conceivable that other indicators were overlooked. In such a case, the study lacks content validity. Content validity is making sure you've covered all the conceptual space. There are different ways to estimate it, but one of the most common is a reliability approach where you correlate scores on one domain or dimension of a concept on your pretest with scores on that domain or dimension with the actual test. Another way is to simply look over your inter-item correlations. 

    Criterion validity is using some standard or benchmark that is known to be a good indicator. A researcher might have devised a police cynicism scale, for example, and they compare their Cronbach's alpha to the known Cronbach's alpha of say, Neiderhoffer's cynicism scale. There are different forms of criterion validity: concurrent validity is how well something estimates actual day-by-day behavior; predictive validity is how well something estimates some future event or manifestation that hasn't happened yet. The latter type is commonly found in criminology. Suppose you are creating a scale that predicts how and when juveniles become mass murderers. To establish predictive validity, you would have to find at least one mass murderer, and investigate if the predictive factors on your scale, retrospectively, affected them earlier in life. With criterion validity, you're concerned with how well your items are determining your dependent variable.

    Construct validity is the extent to which your items are tapping into the underlying theory or model of behavior. It's how well the items hang together (convergent validty) or distinguish different people on certain traits or behaviors (discriminant validity). It's the most difficult validity to achieve. You have to either do years and years of research or find a group of people to test that have the exact opposite traits or behaviors you're interested in measuring.   

A LIST OF THREATS TO RELIABILITY AND VALIDITY

AMBIGUITY -- when correlation is taken for causation
APPREHENSION -- when people are scared to respond to your study
COMPENSATION -- when a planned group or people contrast breaks down
DEMORALIZATION -- when people get bored with your measurements
DIFFUSION -- when people figure out your test and start mimicking symptoms
HISTORY -- when some critical event occurs between pretest and posttest
INADEQUATE OPERATIONALIZATION -- unclear definitions
INSTRUMENTATION -- when the researcher changes the measuring device
INTERACTION -- when confounding treatments or influences are co-occuring
MATURATION -- when people change or mature over the research period
MONO-OPERATION BIAS - when using only one exemplar
MORTALITY -- when people die or drop out of the research
REGRESSION TO THE MEAN -- a tendency toward middle scores
RIVALRY -- the John Henry Effect, when groups compete to score good
SELECTION -- when volunteers are used, people self-select themselves
SETTING -- something about the setting or context contaminates the study
TREATMENT -- the Hawthorne Effect, when people are trying to gain attention

REVIEW QUESTIONS:
1. Why are multiple indicators better than just one indicator of a variable?
2. Give an example of a construct or concept that cannot be psychologically reduced, and explain why.
3. State a topic of research interest to you and what kinds of reliability and validity are most appropriate, explaining why.
4. Operationalize the "public safety" concept by listing all important theoretical and empirical dimensions.

QUIZ: (Identify the usual level of measurement for each of the following)
1. year in school
2. IQ scores
3. life expectancy
4. fatigue
5. cynicism
6. grade point average
7. hair color
8. type of neighborhood
9. temperature
10. climate

PRACTICUM:
1. Find a research study on the Internet or using a Library, and locate and report what the author states for a reliability coefficient. Try to find a Cronbach's alpha or KR-20. Turn in the coefficient amount and the name, author, and URL of the study. 

PRINTED RESOURCES
Blalock, H. (1979) "Measurement and Conceptualization Problems" American Sociological Review 44: 881-94.
Carmines, E. & R. Zeller (1979) Reliability and Validity Assessment. Beverly Hills, CA: Sage.
Cronbach, L. (1970) Elements of Psychological Testing. NY: Harper.
Guilford, J. & b. Fruchter (1978) Fundamental Statistics in Education and Psychology. NY: McGraw.
Hagan, F. (2000) Research Methods in Criminal Justice and Criminology. Boston: Allyn & Bacon.
Miller, D. (1991) Handbook of Research Design and Social Measurement. CA: Sage.
Neuman, L. & B. Wiegand (2000) Criminal Justice Research Methods. Boston: Allyn & Bacon.
Salkind, N. (2000) Exploring Research. Upper Saddle River, NJ: Prentice Hall.
Schuessler, K. (1985) Measuring Social Life Feelings. San Francisco: Jossey-Bass.
Slavin, R. (1984) Research Methods in Education: A Practical Guide. NJ: Prentice Hall.
Stevens, S. (Ed.) (1951) Handbook of Experimental Psychology. NY: Wiley. 
Wikipedia Entry on Statistical Reliability

Last updated: July 17, 2011
Not an official webpage of APSU, copyright restrictions apply, see Megalinks in Criminal Justice
O'Connor, T.  (2011). "Measurement, Reliability, and Validity," MegaLinks in Criminal Justice. Retrieved from http://www.drtomoconnor.com/3760/3760lect03a.htm accessed on July 17, 2011.