Chapter 1 : Statistical Thinking: Why Is It Important?
1.2 THE MMR VACCINE-AUTISM LINK
In 1998, Andrew Wakefield, a well-respected doctor at the time, presented the results of his research in the Lancet, one of the world’s leading peer-reviewed medical journals, suggesting a link between the measles, mumps, and rubella (MMR) vaccine and autism. The story did not get much media attention until Wakefield held a press conference. The story went on to be a media sensation. However, a simple review by the media of the actual journal article where Wakefield presented his findings would have ended the debate at that time. The title of the journal article presented in the Lancet was “Ileal-Lymphoid-Nodular Hyperplasia, Non-Specific Colitis, and Pervasive Developmental Disorder in Children.”
First, Wakefield strongly suggests an association between the MMR vaccine and autism but does not make a causal conclusion. As we will learn, an association does not necessarily mean a causal relationship exists. In this case, it does not mean the MMR vaccine causes autism. In the journal article, he states:
Rubella virus is associated with autism and the combined measles, mumps, and rubella vaccine (rather than monovalent measles vaccine) has also been implicated. Fudenberg noted that for 15 of 20 autistic children, the first symptoms developed within a week of vaccination.
Second, the research was based on only 12 children vaccinated with MMR:
12 children (mean age 6 years [range 3–10], 11 boys) were referred to a paediatric gastroenterology unit with a history of normal development followed by loss of acquired skills, including language, together with diarrhea and abdominal pain. Children underwent gastroenterological, neurological, and developmental assessment and review of developmental records.
Wakefield found that nine of the children went on to develop autism soon after receiving the vaccination. Although his conclusions were speculative and based on only 12 children, this did not stop the media from widely reporting that Wakefield had found a causal link between the vaccine and autism.
Parents depend on the media to act as gatekeepers when it comes to these sorts of controversial claims made by researchers. They lead busy lives, so they need to trust that the media is questioning the quality of research before presenting the findings to the public. When it comes to the health and well-being of their children, a proper critique and communication of the findings is more helpful than a sensational story.
If the news media had taken the time to read in the journal article that Wakefield’s conclusions were speculative and based on only 12 children, they would (or should) have concluded that the research was not worth reporting on. Strong claims require strong evidence. A claim that the MMR vaccine is associated with autism based on just 12 children is not strong evidence.
1.3 SAMPLES AND POPULATIONS
Using statistics and statistical thinking, we analyze and interpret data to gain an understanding of the characteristics of populations. We select a sample from a well-defined population. From the sample, we calculate a sample statistic, which is an estimate of a population characteristic known as the population parameter. For example, the population could be defined as all adults in the US with high cholesterol. Researchers may want to test a drug for lowering cholesterol. They select a sample of adults with high cholesterol, give them the drug, and calculate the sample average cholesterol level (the sample statistic). The sample average cholesterol level is considered an estimate of the population average cholesterol level (the population parameter). In other words, if every adult in the population were to take the drug, the sample statistic estimates what the average cholesterol level would be for the entire population.
In the Wakefield study, the population of interest was all children in the UK who had received the MMR vaccine. Wakefield wanted to estimate the percentage of these children who went on to develop autism after receiving the MMR vaccine. In order to estimate this percentage, a sample of children who received the MMR vaccine was selected. The percentage of children in this sample who developed autism was the sample statistic. It was an estimate of the true percentage of children in the population of MMR-vaccinated children who went on to develop autism (the population parameter).
Researchers are interested in looking for relationships between characteristics in the population of interest. A variable is simply a characteristic about an individual. In this study, the characteristics were MMR vaccine (yes or no) and autism (yes or no). Whether or not a child received the MMR vaccine is called the explanatory variable. It is used to try and explain the outcome or response variable, which is whether or not the child went on to develop autism. Wakefield was interested in the relationship between autism and exposure to the MMR vaccine in the population of children in the UK. Are children who receive the vaccine more likely to get autism?
Wakefield found that 75%, or nine, of the 12 MMR-vaccinated children he sampled in this study went on to develop autism. If this percentage were anywhere close to the true percentage in the population, then there would have been a noticeable increase in the incidence of autism after the vaccine was introduced. Wakefield discusses this fact near the end of the journal article:
If there is a causal link between measles, mumps, and rubella vaccine and this syndrome, a rising incidence might be anticipated after the introduction of this vaccine in the UK in 1988. Published evidence is inadequate to show whether there is a change in incidence or a link with measles, mumps, and rubella vaccine.
The fact there was no evidence of an increase in the overall percentage of children with autism since the vaccine was introduced is another reason why the news media should not have reported on this research in the way that it did. If the media had applied basic critical and statistical thinking skills to reading the journal article, they would have quickly surmised that the scientific evidence presented in the paper suggesting a link between the vaccine and autism was weak and insufficient. Instead, the media simply took what Wakefield stated in his press conference and ran with what they saw as a sensational story. The media, more than any other group, need the critical and statistical thinking skills to question the quality of the data upon which the science is based. They are the gatekeepers of truth for the general public, holding an immense power and responsibility in determining how we view the world around us. However, if we learn these necessary critical and statistical thinking, we can take that power into our own hands to some degree and hold the media (and researchers) to account.
A small sample size will (more often than not) result in a sample statistic that is far from parameters when the sample size is small. However, as the sample size increases, we expect our sample statistic to converge toward the population parameter of interest. This should make intuitive sense. The more (quality) data upon which a sample statistic is based, the closer we expect it to be to the truth in the population.
Andrew Wakefield was interested in determining the percentage of children in the population who received the MMR vaccine who went on to develop autism. A sample size of 12 children, even if it were a properly selected representative sample of children from the population, is unlikely to give a reliable estimate of that percentage. As we learn in our next section, Wakefield’s sample was far from representative.
1.4 SELECTING A REPRESENTATIVE SAMPLE
When selecting a sample from some population, ideally we want to select a representative sample, a sample that provides us with an unbiased estimate of a population characteristic. A (simple) random sample is expected to be a representative sample. A (simple) random sample is one in which every individual in the population has an equal chance of being selected. Obtaining a proper random sample of individuals is often easier said than done.
For example, let’s say you want to determine the average height of the population of male students at your college. At midday, you decide to stand in the middle of your campus, where students are heading to and from their classes, asking male students their height as they pass by. You believe that all male students walk past (where you are standing) at some point during any given day. Therefore, you feel that your sample should be random and therefore a representative sample of all male students.
However, what if on that particular day and time the male basketball team just got back from a game and were passing by? Including these men in your sample would result in an overrepresentation of tall men in your sample. In other words, there would be a higher proportion of tall men in your sample than in the population. The sample would result in an estimated average height well above the population average. The resulting sample average height would be a biased estimate of the average height of male students in the population.
A random sample of men at your college should result in (or is expected to be) a representative sample of individual men’s heights. This representative sample of individual heights should ensure we obtain a sample average height closer to the population average height than if we were to include the basketball team in the calculation. The resulting sample average height is an unbiased estimate of the population average height of male students. With a properly selected random sample, the larger the sample size, the closer we expect the sample average height to be to the population average height.
This is the power of a random sample of data as a means for pursuing the truth in populations, and it is one of the most important concepts at the heart of statistical thinking. In addition, the size of the population does not matter. It can be one hundred thousand, a million, a billion, or even a trillion. How close we expect our sample statistic (say a sample average) to be to what is known as the population parameter (say a population average) is driven by the size of a properly selected random sample.
In the MMR-autism study, the sample size was small and far from representative. In January 2011, the investigative journalist Brian Deer published a paper in the British Medical Journal titled “How the Case Against the MMR Vaccine Was Fixed,” presenting the results of his investigation into Wakefield.
Brian Deer did an exhaustive job of investigating the truth in this case. A summary of his findings can be found on his website. He found that two years before Wakefield published his findings (and had selected the 12 children for his study), he was hired by a lawyer named Richard Barr to attack the MMR vaccine. Brian Deer also discusses how Wakefield self-selected his sample of 12 children.
The type of sample that Wakefield selected is known as a convenience sample. In his case, the sample was conveniently chosen in a way to ensure he showed evidence of an association between autism and the MMR vaccine. As we will learn, random samples are difficult to obtain, and researchers will often have to rely on convenience samples. How useful a convenience sample is in estimating population characteristics depends on how far from representative of the population the sample is. There might be factors about the sample selected that differ from the population affecting our estimate of the population characteristic in which we are interested. Depending on how the convenience sample was selected, it may be very difficult to know the effect these factors have on the sample estimate of the population characteristic of interest. We might not even know what these factors are.
For example, on April 23, 2020, at the height of the first wave of COVID-19 pandemic in New York City (NYC), the New York Times published an article titled “Cuomo Says 21% of Those Tested in NYC Had Virus Antibodies.” As the news article points out, the headline statistic translates to 1.7 million New Yorkers having already contracted the virus. However, the official case count for NYC at the time was around 200,000 cases. So, which of these two statistics were closer to the true number of cases?
In a sample of 1,300 NYC residents, 21% were found to have coronavirus antibodies. However, the sample of residents were selected from supermarkets, making it a convenience sample. At the time, going to supermarkets felt like a high-risk activity. Lower-income New Yorkers were hit hardest by the pandemic and were more likely to have to do their own shopping. For that reason alone, it is very likely there would have been a higher proportion of NYC residents with coronavirus antibodies shopping in supermarkets than in the general population. Also, as the news article points out, the accuracy of the antibody tests used at the time were questionable, which could have also resulted in inflating the positivity rate among those tested.
Statistics calculated from convenience samples can be misleading due to the fact that the sample is not representative of the population. Determining what characteristics about the convenience sample adversely affecting the accuracy of the sample statistics calculated from the data can be challenging. The official case count of approximately 200,000 cases at the time was an underestimate of the true number of cases for several reasons: lack of availability of testing, underreporting, and a high percentage of cases that were asymptomatic going undetected. The estimate of 1.7 million cases was an overestimate of the true case count for reasons already discussed. For what it is worth, the truth—the true number of cases at that time in NYC—was somewhere in between these two numbers.
The lesson to be learned from this example is how close a statistic is to the truth in a population depends on the quality of the data upon which the statistic is based. Only a random sample is expected to be representative of the population, resulting in a sample statistic that is an unbiased estimate of the population parameter of interest.
Available in Multiple Formats – Published by
– Respected in the Industry for over 30 years.
Statistical Thinking through Media Examples (Third Edition) by Anthony Donoghue – ©2022, 346 pages
Choose the format that suits you best:
- Paperback List Price: $106.95
- Cognella Direct Ebook: $80.95 – You save $26 (24%)
- Paperback and Ebook Bundle: $104.95
Free 3–7 day delivery · 30-day returns
Also available on Amazon – Paperback: $114.21 Hardcover: $139.00