Reliability: Doing it again until you get it right

How do we interpret the results of reliability measures?

When you want to interpret the value of the different reliability coefficients, you should keep in mind two important aspects: a) reliability coefficients should be positive and not negative, and b) reliability coefficients should be as large as possible in the value interval between .00 and +1.00.

Now, getting back to the results of the four reliability measures presented in Table 5, we can affirm that the test-retest Pearson correlation coefficient of 0.91 is very high and indicates a strong reliability over time for the test which measures the preferences for different types of vocational programs. Also, the interrater reliability coefficient of 0.83 is high and represent a strong level of agreement between the two sets of observations/judgements related to the supportive attitude of the student in the social interaction with his/her client.

In contrast, the Pearson correlation coefficient measuring the parallel forms reliability of 0.13 is very low and indicates a weak reliability and a poor equivalence between the two forms of IRMT test. Also, the Cronbach value of 0.24 is not strong and demonstrates a weak internal consistency, meaning that the five items of the test of attitudes towards different types of health care poorly assess the same construct. A good consistency of a scale is indicated by a value of the coefficient Cronbach’s alpha comprised between 0.6 and 0.8. Usually, when you want to improve the Cronbach’s alpha coefficient, you need to review the items of the scale and try to remove the item which probably does not measure the investigated concept well. Then, you compute again this reliability coefficient. You repeat the procedure until the new Cronbach’s alpha falls within the required range.

Before proceeding to an example, you should know that there are “many other types of internal consistency validity, you would not be surprised, right? This is especially true for measures of internal consistency. Not only is there coefficient alpha, but there are also split-half reliability, Spearman–Brown, Kuder–Richardson 20 and 21 (KR20 and KR21), and still others that basically do the same thing—examine the one-dimensional nature of a test—only in different ways” (Salkind, 2017: 170).

“Creating specific, reliable measures often seems to diminish the richness of meaning our general concepts have. This problem is inevitable. The best solution is to use several different measures, tapping the different aspects of a concept” (Babbie, 2013: 195).

What should you do when you cannot establish reliability?

Establishing reliability of a test is a complex task that involves a great amount of work. Keep in mind that reliability is about how much error contributes to the observed score. Below you can find five ways which help you lower that error, and consequently, increase reliability:

Make sure that the instructions for the test takers are standardized and clear across all settings in which the measurement instrument is administered.
Especially in case of the achievement tests, the larger the sample of the test takers, the more likely the sample will be representative and reliable.
Remove unclear items from the test, because some people will respond in one way and others will respond in a different manner, independently of their knowledge or individual traits.
Especially for the achievement measures, you should assess the difficulty of the test. If a test is too difficult or too easy, it will not reflect an accurate picture of the test takers’ performance.
Minimize the effects of the external events. For instance, if an important event, such as the Easter, occurs near the time of testing, you might postpone the assessment.

In this section, you can learn about some of the criteria based on which you can judge your relative success or failure in measuring things. When a researcher constructs and evaluates a test (an instrument with a set of items aimed at evaluating something), s/he has to take into consideration two aspects: reliability and validity. You must know that “criteria of the quality of measures include precision, accuracy, reliability, and validity. (…) Whereas reliability means getting consistent results from the same measure, validity refers to getting results that accurately reflect the concept being measured” (Babbie, 2013: 195).

Suppose you collected data by applying a test, a scale, or an instrument comprising a set of variables formulated as questions with pre-defined answers. Before starting to analyze and interpret data, you have to ensure that the gathered data represent the phenomenon that you want to study. In this regard, you have to answer two essential questions: “How do I know that the test, scale, instrument, etc. I use works every time I use it?” – that is reliability – and “How do I know that the test, scale, instrument, etc. I use measures what it is supposed to?” – that is validity. If the tools that you use to collect data are unreliable or invalid, then the results of any test or any hypothesis, and consequently the conclusions you formulate based on your research, will be inconclusive.

Reliability: Doing it again until you get it right

“Reliability is a concern every time a single observer is the source of data, because we have no certain guard against the impact of that observer’s subjectivity. We can’t tell for sure how much of what’s reported originated in the situation observed and how much in the observer” (Babbie, 2013: 190).

Reliability is about whether a test or any other instrument you use as a measurement tool can indeed measure something consistently. Let’s consider two concrete situations:

If you administer, for example, a scale about socio-emotional loneliness among the elderly before their admission in a senior club, you should ask if the administration of the same scale 6 months later will be reliable.

In another context, let’s suppose you apply a test about knowledge in a class, and a student got 88 points (a good score) and another got only 63 points (a bad score, indicating the student has to learn more). In applying such a test, we operate with the following elements: the observed scores (what the two students actually got on the test – 88 and 65) and the true score (a 100% reflection of what the student really knows). However, we cannot directly measure the true score, because it represents a theoretical reflection of the actual amount of the trait possessed by an individual.

Starting from this second example, let’s look a little closer at the issue of true and observed scores. There is no full certainty regarding tests and measurement and here is why: We established that the true score is the real value associated with the studied variable (here it is level of knowledge about certain specific topics). However, from the perspective of the psychometricians – people who are specialized in the construction and validation of tests and measurements – the true score has nothing to do with the reflection of reality. From their point of view:

«true score is the mean score an individual would get if s/he took a test an infinite number of times, and it represents the theoretical typical level of performance on a given test. (...) A test is reliable if it consistently produces whatever score a person would get on average, regardless of what the test is measuring. In fact, a perfectly reliable test might not produce a score that has anything to do with the construct of interest, such as “what you really know” » (Salkind, 2017: 162).

Apart from the technical aspect discussed above, let’s think why true scores and observed scores are not identical. They might be the same if the test would be an absolutely perfect reflection of what is being measured. But the world is not perfect. So, an observed score may be close to the true score, but very rarely are they identical. And the difference between them represents the amount of error interfering with the process of measurement.

Let’s suppose that a student obtains a score of 91 points on a statistical test, but his/her true score – which we never really know, but only theorize about – is 81. It results in a 10-point difference – that is the error score – due to different sources of error or reasons why individual test scores vary from the true value of 100%.

What is the source of such errors? There can be several reasons. Perhaps, the room in which the test was taken was so cold that some students could hardly concentrate, and this could certainly have a negative effect on the test score. Or maybe the student did not study enough to be well prepared for the test. You can agree that both of these examples reflect testing conditions and not the qualities of the variable being measured. Thus, first of all, we should be concerned about reducing these sources of error as much as possible, by assuring appropriate test-taking conditions (for example, by using a room with an adequate temperature and motivating students to enhance their learning efforts). As far as you are able to reduce errors, you can increase the reliability of measurement, so that the observed scores will more closely match the true scores. And a lesser error confers more reliability.

Different types of reliability

In the Table 1 (attached), you can find a the four most important and most often used types of reliability.

Table 1