10 Introduction to Statistics

10.1 From Probability to Statistics

Until this point we have focused on the study of probability. At its core, probability is a subject which seeks to quantify the uncertainty present in statistical experiments. In the study of probability we begin by first making assumptions about the state of the world¹ and from there we draw conclusions about what must be true about the state of uncertainty in the world. In this regard, probability answers questions of the form “if this is true about the world, what should we see?” For instance, if a fair coin is tossed $100$ times, what is the likelihood that more than half of the tosses come up heads? We have made the assumption that we have a fair coin, tossed independently, and we wish to quantify our degree of uncertainty about this scenario.

This is not the only way that we could frame problems related to uncertainty. For instance, what if we asked “given the results from $100$ tosses of this coin, do we believe that the coin is fair?” This inversion of the previous question starts from the observation of information from an experiment and asks questions about the underlying mechanisms that generated these data.² This type of question is addressed by the field of statistics.

Definition 10.1 (Statistics (Field of Study)) Statistics is the discipline in which data are collected, analyzed, and presented with the goal of understanding the mechanisms through which those data were generated.

In Probability, we make assumptions about the world and calculate probabilities. These probabilities describe what we should expect to see if we were to observe the processes as they were assumed to exist. In Statistics, we collect information from statistical experiments, and use these data to infer what conditions were likely to have given rise to our observations. We are back solving for what set of assumptions is most plausible, given the observations. Uncertainty remains at the core of statistics. We will rarely be able to know for certain what different assumptions gave rise to the data that we observe, but instead look to clarify and quantify the ever-present uncertainty. Probability remains central to the study of statistics. Specifically, probability is the core tool for quantifying the ever-present uncertainty. Probability statements are the language of Statistics. As a result, the study of Statistics is largely the study of how we can take the set of tools that have been developed throughout the first part of these notes, and apply them in the reverse direction. Statistics gives us the tools that we need to make sense of the world around us. Statistics serves as a process for evaluating the quality of evidence and drawing conclusions from it. Statistics is the area of study required to draw informed conclusions from the information that we collect. Ultimately, it is Statistics which powers quantitative decision-making. Virtually every avenue of the modern world demands that we make decisions on the basis of incomplete or imperfect information, and through Statistics we can ensure that these decisions are as informed as possible.

10.2 Background and Data

It is important to formalize some terminology upon which we will rely. A key challenge in formalizing these ideas is that, for many of these central concepts, we have an intuitive or colloquial sense of the idea. Just as with Probability, a large part of our goal during the early phases of learning Statistics revolves around connecting formalized ideas to intuitive concepts that we are familiar with from other contexts.

Definition 10.2 (Data) Facts, figures, observations, or recordings in virtually any form (images, sounds, text, measurements) which are gathered and processed to form and communicate conclusions.

Data sit at the center of Statistics as the prime objects of study. We are concerned with how we can take data and draw valid conclusions. This may be by ensuring that the data are collected in a way which is suitable to draw conclusions, or by finding ways to graphically display the information within collected data, or by drawing inferences about the world using the data on hand. Data are the prime focus of Statistics. The data themselves are not particularly descriptive or actionable. Instead, the data are transformed into useful information through statistical techniques. We will broadly refer to any such process as a statistical analysis.

The goal of a statistical analysis can be placed into one of four categories. These categories define the four purposes of statistics.

Descriptive statistics: Descriptive statistics focuses on organizing and summarizing information. With descriptive statistics we seek to describe the current state of the world which lead to the data we have collected.
Inferential statistics: Inferential statistics provides methods for drawing conclusions and quantifying the uncertainty surrounding these conclusions, regarding a population or process response for the data collected. With inferential statistics we seek to infer the underlying truth about a population or process of interest.
Predictive statistics: Predictive statistics provides methods for making predictions regarding the future behaviour of a process or population based on past observations from that population or process. With predictive statistics we seek to predict what is to come in the future.
Prescriptive statistics: Prescriptive statistics provides methods for suggesting interventions into a population or process according to its likely impact on a chosen criterion. With prescriptive statistics we seek to prescribe interventions based on what is likely to happen.

Example 10.1 (Charles and Sadie Categorized Questions) Back out for coffee after a relaxing break, Charles and Sadie turn their attention to thinking about the possible use cases for Statistics. They begin to play a game, trying to identify which of the four major categories would be most appropriate to address their various questions of interest. For each of the following, identify whether the problem is best approached through descriptive, inferential, predictive, or prescriptive statistical techniques.

Charles wonders how many people, on average, visit the coffee shop each day.
Sadie wonders if the type of music playing in the shop impacts what purchases customers make.
Charles wants to determine how many chocolate chip cookies the coffee shop should prepare for Saturday morning.
Sadie wonders what the most common drink add-in is.
Charles wants to know how much the coffee shop should sell their coffees for, if they are trying to maximize income.
Sadie wants to understand how many people have signed up for the loyalty program.
Charles wonders if there is a meaningful difference between people’s orders who sign up for the loyalty program and those who don’t.
Sadie, in turn, questions how much the loyalty program is likely to grow over the next month.
Charles wants to understand how the reward tiers can be changed to grow the royalty program faster.

Solution

This is descriptive. This is an attempt to describe the state of the world as it exists.
This is inferential. This is an attempt to draw conclusions regarding the true state of the world.
This is predictive. This is an attempt to understand the future behaviour of uncertain quantities.
This is descriptive. This is an attempt to describe the state of the world as it exists.
This is prescriptive. This is an attempt to suggest an intervention to achieve a desired outcome.
This is descriptive. This is an attempt to describe the state of the world as it exists.
This is inferential. This is an attempt to draw conclusions regarding the true state of the world.
This is predictive. This is an attempt to understand the future behaviour of uncertain quantities.
This is prescriptive. This is an attempt to suggest an intervention to achieve a desired outcome.

Each of the various roles that statistics can play is defined in terms of populations³. We understand this at an intuitive level, and this intuition is strong place to begin to formalize Statistics.

Definition 10.3 (Population) The collection of all individuals or items that are under consideration in a study or experiment.

In many settings, the population is a well-defined, concrete idea. We may think of all individuals who attend a particular university, all birds of a species living in a particular park, all cars of a particular model made last year at a given factory. In each of these cases we can envision taking all members of the population⁴ and placing them in one location. If we were able to do this, any questions we had about the population could be directly answered. This is not typically possible in these cases owing to practical considerations regarding the resources that would be required.⁵ In many other settings, we cannot even imagine grouping the entire population of interest together, since the population is less concretely defined.

Consider, for instance, investigating the quality of vaccines that are produced at a particular facility. This facility will continue producing vaccines indefinitely into the future, and we may wish to know about the set of these future items. Similarly, we may wish to understand the impact of a particular teaching style on children’s ability to learn math skills. In this case, we are not concerned with one particular school or one particular school board or one particular set of students. Rather we want to know how children in general respond to this teaching intervention. In cases like these the population of interest is less concrete and more conceptual. It is not a specific well-defined group of individuals or items, and it may be possibly infinite. Instead of being able to collect all the items of the population together we are only able to assess any individual or item and answer “is this a member of the described population?”. We refer to these as conceptual populations.

Definition 10.4 (Conceptual Population) A set of individuals, items, or observations which are hypothetical in the sense that they do not tangibly exist as a concrete group, but instead share a common feature which defines the population. The units in the conceptual population are linked through the circumstances that they arise under resulting from conditions which are equivalent in some way. Sometimes conceptual populations are called hypothetical populations.

The utility of a conceptual population is that it allows us to unify the framework of Statistics whether we are studying groups of people or objects that really do exist in front of us, or those which we can describe but not collect. Even something as well-defined as the population of a country, for instance, is a population which may be conceptual in many regards. There are constantly new individuals being born in the country, those who are dying, those moving to or away from it. Still, none of us are confused about what we mean by the “population of a country”. Likewise, conceptual populations in statistics are well-defined, even if they remain intangible.

Example 10.2 (Charles and Sadie Identify Populations) Charles and Sadie had such fun identifying the uses for statistics during their last conversation, today at coffee they decide to identify populations of interest. They open up the local paper to the science section, and begin to read the headlines. For each headline, indicate the population of interest and specify whether this is a conceptual population.

“Study Finds Link Between Coffee Consumption and Productivity in Office Workers”
“Research Shows Decline in Pollinator Populations Across Agricultural Regions”
“Poll Indicates Attitudes Toward Healthcare Reform Among Registered Voters”
“Research Reveals Impact of Air Pollution on Respiratory Health Among Children in New York City in 2023”
“Survey Explores Relationship Between Social Support and Mental Health Among LGBTQ+ Youth”
“Poll Indicates Satisfaction with Public Transportation Among Commuters in Metropolitan Areas in Canada”

Solution

The population of interest here is simply “office workers”. This is a conceptual population as we can easily tell whether someone belongs to the population (ask whether they work in an office), but we cannot easily describe the complete group of individuals.
The population of interest here is pollinators in agricultural regions. This is a conceptual population as it will be constantly shifting and changing. We are able to describe whether a particular animal is a pollinator living in an agricultural region, but we are not able to enumerate through the animals which would satisfy this.
The population of interest here is registered voters (wherever the study is run). This is not a conceptual population as a voter registry contains a list of all the people within this population. It may be the case that this will change over time, but we can concretely determine the population at this point in time.
The population is children in New York City in 2023. This is not a conceptual population as, while there is not any practical way of gathering all children living in New York City in 2023 together in one place, this is a fully describable population that (given enough resources) could be gathered together.
The population of interest here is LGBTQ+ youth. This is a conceptual population, as we are able to assess membership to the population, but not fully enumerate the members.
The population here is Canadian metropolitan-area commuters. This is a conceptual population.

Ultimately, our goal with statistics is to understand a population. However, as a general rule, we are unable to directly observe the entirety of the population. While it is typically infeasible to observe the entire population, we are often able to observe some units from the population. These units, when collected together, are referred to as a sample.

Definition 10.5 (Sample) A sample is a subset of a population which is observed, and as a result, information regarding these units is obtainable.

Thus, taken together we are interested in a particular population. We are typically unable to observe our population in full, and instead content ourselves with the capacity to view a subset of this population, which is referred to as a sample. Generally speaking, we are interested in some numeric quantities which describe the population. Perhaps we wish to know the average height of students in a school, or the total number of calls that are made at a company over a period of time, or the maximum litter size for a breed of house cats, or the proportion of defective units produced during a manufacturing run. In each of these situations, the question of interest relates to a quantity describing the population. If we were able to view the entirety of the population, we could simply compute the value of quantity. We refer to such quantities as parameters.

Definition 10.6 (Parameter) A parameter is a numeric quantity of interest which is defined for a population. A parameter captures the behaviour of the population. Typically, the value of parameters will be unknown and unknowable.

The fact that parameters are generally unknowable is the central tension at the heart of many statistical problems. To resolve this tension we turn our focus towards quantities which can be computed, namely those which are derived from samples that we have taken. These quantities are aptly named statistics.

Definition 10.7 (Statistics (Quantities)) A statistic is a numeric quantity of interest that is computed on a sample. Any quantity which is calculated based on observed data from a sample is a statistic.

In this regard, Statistics as a subject is the study of statistics (as quantities). For instance, if we take the average height of a group of students, or the number of calls made by a selection of employees during a period of time, or the maximum size that a particular cat breeder has for a litter, or the proportion of defective products from a random selection of items sampled from a manufacturing run, each of these are statistics. Note that to differentiate a statistic from a parameter we are functionally differentiating between whether the quantity was computed on a sample or a population. There is no difference in the quantity itself, it is what the quantity is computed with respect to.

Thus, with these definitions we are able to concretely outline the process of statistics. We have interest in a particular population, conceptual or otherwise. Specifically, we have questions which are answered by parameters defined for this population. These parameters are unknowable since it is infeasible to observe all members of the population, and so instead we turn to taking observations for subsets of the population. These subsets are called samples, and samples are observable. Once observed, we are able to compute quantities of interest on the samples, referred to as statistics. It is our hope that somehow these statistics will be representative of the underlying parameters of interest, thus allowing us to answer the questions about the populations using information from the sample.

Figure 10.1: An overview of the process of statistics. Visualized is the complete population, which is unable to be observed in full. There exists a parameter (or parameters) of interest that are computable from this population. We take a sample (subset) of this population, and can more accurately observe this sample, allowing us to compute the value of a statistic. This statistic is meant to correspond to the parameter of interest.

Example 10.3 (Experiments in the Coffee Shop) Charles and Sadie, fully bought into the process of statistics, decide to put their new knowledge to the test. To do so they wish to determine how the process of statistics would apply to the world around them, in the coffee shop. For each of the following scenarios indicate what the population of interest would be, whether it is conceptual or concrete, identify the parameter of interest, a possible sample of size $4$, and the relevant statistic.

To understand the daily traffic in the coffee shop, Charles counts the number of individuals who pass into the store in a particular hour.
To better understand the profitability of the store, Sadie collects the receipt totals for each of the customers arriving at the coffee shop.
To understand how the coffee shop has integrated into the community, both Charles and Sadie monitor the proportion of customers who are students at the local school, each day.

Solution

The population of interest is the customers of the coffee shop. This is a conceptual population. The parameter of interest is the average number of customers per day arriving at the coffee shop. Notably, Charles is measuring the average number per hour rather than per day, but this could be accommodated for during the analysis. Samples of size $4$ would observe watching four separate hours, and counting the number of customers in the store each hour: one possibility would be $\{5, 35, 3, 22\}$. The statistic calculated is likely the average per hour (or a scaled version to convert it to the day). Based on the sample that is $16.25$.
The population of interest is the purchases made at the coffee shop. This is a conceptual population. The parameter of interest is the average size of a purchase. Samples of size $4$ would include any four observations of totals that could be observed at the store, perhaps $\{\$1.20, \$23.18, \$14.21, \$5.87\}$. The statistic of interest would be the average size of the transaction. Based on the sample that is $\$11.115$.
Here, the population of interest is the customers of the coffee shop. This is a conceptual population. The parameter of interest is the proportion of customers who are students at the local school. A sample of size $4$ would be any four proportions observed from the day, perhaps $\{0.2, 0.1, 0.08, 0.25\}$. The statistic of interest is the average proportion, based on the sample that is $0.1825$.

With this process outlined, we can revisit the roles that statistics will serve. Ultimately, our goal is to effectively use collected data⁶ to discern and communicate information. We look to do this by:

describing the collected data, conveying the information that we have gathered;
inferring conclusions about our population parameters from our sample statistics;
predicting out-of-sample observations, based on the sampled ones; or
prescribing interventions for the population to influence a parameter.

The value in each of these applications stems from the capacity that we have to connect the sample to the population. As a result, much of our statistical focus centers on finding ways to ensure that our conclusions drawn from our sample are reflective of the overall population.

To understand how this is possible, consider an experiment which seeks to determine whether a particular coin is fair. Suppose that we toss it $100$ times, and see $54$ heads. Is this a fair coin? While we cannot be entirely sure, this seems to be more-or-less in line with the number of heads that we would expect to see if the coin were fair. There is uncertainty present, but we can be more certain that this is a fair coin compared to our beliefs prior to running the experiment. Now imagine that instead of seeing $54$ heads on $100$ tosses, we had seen $94$. Immediately we should be skeptical that this coin is fair. It is perfectly possible that we see $94$ heads on $100$ tosses of a fair coin⁷, but it is not likely. It does not seem to be what we would expect to observe, and as a result, we are right to be skeptical of this.

This intuitive connection between what we can say about the population and the sample relies upon the sample being representative of the population. We would not take flips from another coin to be evidence of whether our coin is biased. We would not consider the sample to be particularly representative if instead of writing down every result, we ignored every time that more than one tail came up in a row. The intuitions we have about the relationship between our sample and our population rely on the assumption that the sample represents the population in some meaningful sense. Choosing a representative sample is, as a result, an important aspect of the statistical process.

10.3 Sampling

Sampling is an area of study within statistics which focuses on the process through which units are recruited from the population into our sample. If the units we observe are biased in some way, or are inappropriately recruited, then it is immediately clear that conclusions drawn about the sample will not be transportable to the population. Imagine, for instance, that we are interested in the height of students at a university. If our sample contains only members of the basketball and volleyball teams, this will not be reflective of the heights of most students at the school. Our sample is unrepresentative. With sampling our goal is to select the sample in such a way to make guarantees about the information we learn from it, quantifying our uncertainty, and being relatively confident in our conclusions.

When we deal with populations which are conceptual in nature it may make less sense to think of sampling directly. For instance, it seems strange to think of our previous example of repeatedly rolling a die in the same manner that we think of recruiting students and measuring their heights. In the case where we wish to learn about a process rather than a concrete population, we will often frame the generation of our sample not through the language of sampling but rather through the language of experimental design. The design of experiments refers to the same set of factors that are considered for sampling: how can we ensure that the units we observe will be representative of the underlying population or process. However, with experimental design it is largely the case that we are in direct control of the types of factors that may lead our process to being unrepresentative. It is up to us to ensure that the units that are being generated, through the conceptual population, are representative of the units we are interested in based on the research question.

Our focus in this class will be on investigating several of the techniques used in sampling and experimental design to ensure that the samples we use are related to the population of interest in predictable ways. It will always be possible that, even following best practices and being very careful with our implementation, we end up choosing a sample or having experimental units which are non-representative. This is uncertainty which is unavoidable in any scientific inquiry. Our goal then is to understand this uncertainty, to quantify it, and to ensure that we are able to understand precisely how big of a risk is this lack of representation, and how that will change the results we can report. We begin by describing techniques which can be used to ensure that samples are good representations for the populations of interest. In the next section, we will turn to the same types of considerations for experimental design.

10.3.1 Simple Random Sampling

When considering the process of data collection via sampling, the primary decision to make is on the sampling design. The sampling design refers to the strategy that is employed to decide which members of the population will comprise the sample. Some sampling designs will not result in valid or representative samples. The most straightforward sampling design, which, if applied correctly, will produce valid samples is simple random sampling.

Definition 10.8 (Simple Random Sampling) Simple random sampling is a sampling procedure in which each possible sample of a given size is equally likely to be obtained.

Simple random sampling produces a simple random sample. Simple random sampling is far and away the most important sampling scheme. It is an effective way of drawing a representative sample itself, it is intuitive, and it forms the basis of many other, more complex sampling schemes. Generally, we can think of simple random sampling as sampling without replacement.⁸ Suppose we take our population to correspond to items in an urn, each labelled with the corresponding unit. Then a simple random sample is typically formed by selecting $n$ items from the urn without replacement. Those which are selected form the sample. If desired, for any reason at all, it is possible to form a simple random sample with replacement, where in this setting the balls would be placed back into the urn after each selection. Whether the sample is to be formed with or without replacement, the same general procedure will be followed. Each member of the population will get assigned a numeric label (from $1$ through to $N$, the population size) and then software is used to select a subset of $n$ of the labels.

Example 10.4 (Rating Coffee Orders) Charles and Sadie are still attending the coffee shop, and Charles is still working through the $960$ different orders that are available (recall Example 3.5). Sadie, with a stronger grasp on statistics now, decides to try to understand the general quality of orders at the coffee shop. To do so, instead of ordering every possible meal (as Charles is doing), Sadie considers a simple random sample of possible orders.

Suppose that Sadie wants to understand the quality based on the next twenty visits to the coffee shop. Describe the procedure for forming a simple random sample.
What is the probability that Sadie’s current order will be one of the orders included in the simple random sample?

Solution

Sadie desires a simple random sample of size $n=20$ from the $N=960$ different orders. To do so, suppose that we order each of the orders from $1,2,\dots,960$ (perhaps ordered based on the order in which Charles will try them). Then Sadie will randomly select, without replacement, $20$ numbers from these values. These orders will then be taken over the next twenty visits, with Sadie recording whatever information is relevant for each of them. For instance, Sadie may end up select the orders corresponding to:

##  [1] 834 499 529 376 910 931 549 620 623 169 419 754 296  80 276 342 918 855 265
## [20]  87

Sadie’s current order is $1$ of the possible options of the $960$. While we can work this out from first principles, we can also take this to be a hypergeometric random variable with $n=20$, $N=960$, and $M=1$. Then we want $P(X = 1)$. For this, using the probability mass function of a hypergeometric random variable, we get \[P(X = 1) = \frac{\binom{1}{1}\binom{959}{19}}{\binom{960}{20}} = \frac{1}{48}.\] Note this is $\dfrac{20}{960} = \dfrac{n}{N}$. In general, this will be true for any simple random sample.

There is an appeal in the simplicity of simple random sampling. Moreover, it is quite clear how, as long as enough units are sampled, simple random sampling will result in a sample which is representative of the overall population. Despite these benefits, there are some drawbacks that are not easily overcome in the simple random sampling paradigm. For instance, if you imagine a situation in which your sample is spread over a large geographic region, it is unlikely to be practical to form a simple random sample. Additionally, if you do not have a list of all population members, a simple random sample cannot be formed as described.⁹ Another practical concern involves sampling in this regard when units have a natural ordering. Suppose that you are looking to test the impact of a new cancer therapy, and wish to form a sample of current cancer patients who will receive the experimental treatment to see if it improves over the current practice. If you form through simple random sampling it is possible that you will have only patients who are newly diagnosed or else only patients who have had their diagnosis for a long time. Neither situation is a particularly effective method for testing the therapy, and it becomes a large practice issue where you likely want to ensure that you have both sets of individuals represented in the sample.

To overcome these issues with simple random sampling, alternative sampling designs have been proposed. These alternatives can lead to more convenience in the sampling, and perhaps yield more accurate results than a simple random sample can. It is important to note that these alternative designs are only more effective when the design itself is taken into account when analyzing or describing the data.

10.3.2 Systematic Random Sampling

One alternative design to simple random sampling, which is closely related, is known as systematic random sampling.

Definition 10.9 (Systematic Random Sampling) In systematic random sampling the sample is selected by choosing a random starting point from the list of members of the population, and then sampling every $k$th member until the desired sample size is reached.

Systematic sampling forms a sample that looks like a simple random sample, but it is more straightforward to implement. If you want a sample of size $50$ from a population of size $500$, then by selecting every $10$th member of the population, you will achieve the sample you desire. You want to be able to pick any individual from the population, and so you should randomly select the starting point before picking every $10$th member. Selecting every $10th$ individual is more straightforward administratively than generating random numbers and sampling those indices, particularly when there is a natural ordering of the individuals. However, there are some implementation decisions which need to be made, notably: what should $k$ be, and who should be the first individual included? It is common to have the process of systematic random sampling described as follows.

Divide the population size, $N$, by the desired sample size, $n$, and round the result down to the nearest whole number. This will be $k$.
Select a number, $m$, randomly between $1$ and $k$. This will be the starting point.
Include in the sample $m$, $m+k$, $m+2k$, and so forth until the last unit of the sample.

This will generally form a usable sample, if its shortcomings are properly accounted for. However, it is not without its shortcomings as a procedure.

To understand why this can lead to issues suppose that we have $N=7$, and want to form a sample of size $n=3$. Using this procedure we get $k=2$, as the result of rounding down $\dfrac{7}{3}=2.33\dot3$. Next, we select either $1$ or $2$ as our starting point. If we select $1$ then we end up including $\{1,3,5\}$ and if we select $2$ then we get $\{2,4,6\}$. Note that in we will never select item number $7$, which means that there is no chance it is represented in our sample. This is a problem. There are plenty of ways to resolve this concern, some of which lead to other issues themselves.

One small modification that can be made is to select $m$ between $1$ and $N-(n-1)k$.¹⁰ This will ensure that it is always possible to select up to the last unit. In our example with $N=7$ and $n=3$, the starting point is selected from $1,2,3$ giving in addition to the two possibilities outlined above, $\{3,5,7\}$ as a third option. This alleviates the issues of not including some members of the population in any possible sample. When this technique is used, however, it is worth noting that some elements become more likely to be included than the others.¹¹ This can be accounted for during analysis, but it needs to be completely understood to do so.

Example 10.5 (Randomly Sampling Customer Experience) Sadie, content with the results of the simple random sampling of possible meals, decides to try to understand the overall customer satisfaction of individuals coming into the coffee shop. Charles suggests that a systematic sample may be in order, and they set out planning this.

Suppose that Charles and Sadie expect there to be $98$ customers arriving in a day, and they wish to sample $10$ of them.

Describe the process of forming a systematic sample from this population, including the specific values for the choices that are made.
What is a risk of this sampling design?

Solution

Here we have $N=98$ and want $n=10$, thus we take $k = \lfloor\dfrac{98}{10}\rfloor = 9$. We want to select the starting value between $1$ and $98 - 9\times(10-1) = 98 - 81 = 17$. Thus, there are a total of $17$ different samples we could wind up with. These are summarized as follows:

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1   10   19   28   37   46   55   64   73    82
##  [2,]    2   11   20   29   38   47   56   65   74    83
##  [3,]    3   12   21   30   39   48   57   66   75    84
##  [4,]    4   13   22   31   40   49   58   67   76    85
##  [5,]    5   14   23   32   41   50   59   68   77    86
##  [6,]    6   15   24   33   42   51   60   69   78    87
##  [7,]    7   16   25   34   43   52   61   70   79    88
##  [8,]    8   17   26   35   44   53   62   71   80    89
##  [9,]    9   18   27   36   45   54   63   72   81    90
## [10,]   10   19   28   37   46   55   64   73   82    91
## [11,]   11   20   29   38   47   56   65   74   83    92
## [12,]   12   21   30   39   48   57   66   75   84    93
## [13,]   13   22   31   40   49   58   67   76   85    94
## [14,]   14   23   32   41   50   59   68   77   86    95
## [15,]   15   24   33   42   51   60   69   78   87    96
## [16,]   16   25   34   43   52   61   70   79   88    97
## [17,]   17   26   35   44   53   62   71   80   89    98

This sampling design has several risks. The first is that there is no guarantee that $N=98$ is the correct population size. As a result, this could end up not producing $n=10$ or having $n=10$ before the end of the day. A second possible issue is if there are temporal patterns to customers arriving, such that groups of customers are missed with high probability. Cyclical patterns do not lend themselves well to systematic sampling, and it is plausible that customer arrivals to a coffee shop may exhibit some cyclical tendencies.

10.3.3 Cluster Sampling

While systematic sampling can be a more straightforward method for implementing a sample that looks like a simple random sample, it is generally not going to alleviate concerns with (for instance) geographic separation. In this case the issue with either of the two aforementioned sampling schemes is that they each would take substantial resources to send researchers to the area where the units to be sampled are. A remedy to this is to turn to cluster sampling.

Definition 10.10 (Cluster Sampling) In cluster sampling individuals are grouped together into clusters. The clusters are then sampled at random, according to a simple random sampling scheme. Any selected cluster is then sampled in full.

For instance, you may define clusters based on the geographic region that is occupied. This way you can ensure, for instance, that you only visit a set number of geographic regions, while still sampling enough individuals to collect useful data. A key criterion for cluster sampling to be valid is that each cluster should represent the overall population well. This can be an issue where members of a cluster are often more similar to one another than to members of other clusters. As a result, you can end up with an unrepresentative sample owing to the clustered nature of the sampling. In order to form a cluster sample, the procedure is essentially equivalent to simple random sampling. First, the population is divided into groups (clusters) which are labelled from $1$ through to the number of clusters that there are. Then, the clusters are randomly sampled, according to a simple random sample. Finally, all members of the selected clusters are included into the overall sample.

Example 10.6 (Charles and Sadie Sample their City) Charles and Sadie are reflecting upon the chocolate bars that they sold for charity in the past. They want to understand the feelings that members of their city have towards charitable giving. They feel that sampling $300$ homes is a useful number, but given that they will be going door-to-door, they want to do this in a clustered pattern. The city is made up of $947$ blocks, each with $20$ homes on it.

Describe how Charles and Sadie could use clustered sampling to form the sample that they desire.
What issues may arise using clustered sampling in this way?

Solution

Here we have $N=18940$ and we want a sample of $n=300$. In order to achieve $n=300$, we require selecting $15$ clusters, since each cluster has $20$ homes in it. Thus, Charles and Sadie should label the blocks $1$ through $947$, and then select $15$ of these at random. Once the $15$ blocks are selected, all $20$ homes on each should be visited. One possible sample would be:

##  [1] 834 499 529 376 910 931 549 620 623 169 419 754 296  80 276

A major issue with this type of clustering is that homes that are on the same block are likely to be fairly homogenous. That is, city blocks will often be divided by socioeconomic status, education, and other demographic factors. As a result, selecting a full cluster is not likely to give a representative cross-section of the complete city, and instead, is likely to be skewed based on which clusters happen to be included. It seems very likely that with a survey investigating thoughts towards charitable giving, socioeconomic factors will play a role.

The major concern with cluster sampling is that, if the clusters are grouped together based on relevant information, the sample becomes predictably unrepresentative of the overall population. Because the clusters are often naturally formed based on a relevant factor (like geographic location), if this factor influences the topic that is being studied, the clustering design will influence the validity of the results. Still, cluster sampling, when done correctly, alleviates many of the difficulties with practically implementing a simple random sample.

Remark (Systematic Sampling as Cluster Sampling). Mathematically, systematic sampling can be seen as a particular form of cluster sampling. To see this note that, once $k$ and $m$ are defined, the set of individuals who are included in any given sample are completely defined and grouped together. As a result, you could preform these groupings of individuals, and treat those as clusters together. Then, instead of sampling individuals, you are sampling a cluster of individuals.

The key difference between cluster sampling and systematic sampling is that the clusters in systematic sampling are not typically naturally defined. There is not normally going to be a clear separation forming the groups in this way, and as a result, it may not be easier to run a geographically isolated systematic sample than it is to run a geographically isolated simple random sample. However, the understanding of equivalence mathematically is useful when data from these samples are to be analyzed as the tools from cluster random sampling can be put to use to validly analyze systematic data.

10.3.4 Stratified Random Sampling

The key issue with cluster sampling is that natural clusters of individuals typically exhibit self-similarity. This is an issue when your sample is formed via complete clusters, however, it can be turned into a benefit to ensure a greater reliability of the sample itself. Exploiting this self-similarity leads to a sampling technique known as stratified sampling.

Definition 10.11 (Stratified Sampling) In stratified sampling the population is divided into subpopulations known as strata. These strata should be comprised of groups of similar individuals. A simple random sample is formed within each stratum, and each of the simple random samples are combined to form the overall sample.

The major benefits of stratified sampling are two-fold. First, you will typically have more precision in your conclusions being drawn than from other sampling schemes. The reason being that individuals within a stratum will be similar to one another, and so the variability that will arise based on which individuals are included is smaller than in other sampling schemes. Second, you are able to split your data into the different subpopulations describing results and conclusions for each group of individuals. This allows us to make conclusions both at the population level as a whole, but also at the subpopulation level, which is often of direct interest.

To implement stratified sampling, you need to decide how many individuals will be sampled within each stratum. A simple but effective way of doing this is to use proportional allocation. With proportional allocation the strata that are larger will have more members sampled than the strata which are smaller, which is a natural decision to make. To implement stratified random sampling with proportional allocation: 1. Divide the population into strata, based on a relevant and natural dividing criteria. 2. Within each stratum, conduct a simple random sample of size $n_j$, where $n_j$ is given by $n\times\frac{N_j}{N}$, where $n$ is the desired sample size, $N_j$ is the size of the $j$th strata, and $N$ is the size of the population as a whole. This should be rounded to the nearest whole number. 3. Form the sample by including all members from each of the sampled strata.

Example 10.7 (Charles and Sadie Push for Transportation Infrastructure) Charles and Sadie have faced some push-back on their attempts to increase access to well-funded public transportation options within their city. To understand better where the resistance is coming from, they decide to conduct a survey of households in the town. There are $18940$ total homes in the city, and Charles and Sadie figure that perhaps household income levels will influence opinions on investments in public infrastructure. Charles and Sadie categorize $940$ of these households as high income, $10000$ as middle income, and the remaining $8000$ as low income. Suppose that they want a sample of size approximately $75$ from this population.

Describe how a stratified sample can be formed in this setting.
What potential drawbacks are there with using stratified sampling here?
What other factors may have been useful to segment the population into natural strata?

Solution

This population is made up of three strata, with $N_1 = 940$, $N_2 = 10000$, and $N_3 = 8000$. Using proportional allocation this will result in a sample from the first of size $n_1 = \dfrac{940}{18940}\times 75 = 3.7 \approx 4$. For the second we get $n_2 = \dfrac{10000}{18940}\times 75 = 310.6 \approx 40$. For the third we get $n_3 = \dfrac{8000}{18940}\times 75 = 31.6 \approx 32$. Taken together this gives $76$ as the sample size, which is close to desired. Then, three separate simple random samples are to be formed. The first is a sample of $4$ homes randomly selected from the numbers, $\{1,\dots,940\}$, the second is $40$ homes randomly selected from $\{1,\dots,10000\}$ middle income homes, and the remaining $32$ are selected from $\{1,\dots,8000\}$ from the low income households. These three groups are combined, and interviewed to form the overall sample.
The major drawback is that this is likely to result in a costly sampling scheme. Specifically, as contrasted with the cluster sampling in Example 10.6, there is no guarantee that Charles and Sadie would not be required to travel across many of the blocks in their city. It is possible that to visit the $76$ households, they have to visit $76$ different blocks, which greatly increases the cost of administering the sample as described.
It may have also been useful to stratify the population, for instance, based on vehicle ownership, based on commuter status, or based on political leanings. Each of these is likely to be relevant to subgroup analyzes, while also reducing the variability within the strata themselves.

While stratified samples are often the most efficient samples, statistically, they do not alleviate many of the practical concerns regarding simple random samples. In fact, in some cases, these issues may even be exacerbated. If, for instance, strata are formed on the basis of geographic location, then forming a stratified sample will guarantee that it is required to visit each geographic location. The sampling scheme which is selected should be concordant with the goals of the analysis as well as the restrictions and constraints that are at play.

10.3.5 Multistage Sampling

Sometimes the nature of the population, question of interest, or constraints at play render any single sampling design ineffective to construct a useful sample. In these cases multistage sampling can be an effective way of tailoring the sampling design to the specific requirements.

Definition 10.12 (Multistage Sampling) In multistage sampling, cluster sampling is combined with one or more of the other discussed sampling techniques, into a multistage procedure to effectively and efficiently target the population of interest. Multistage sampling may combine cluster sampling with simple random sampling, systematic random sampling, cluster sampling, or stratified sampling in different sequences and orders to achieve a custom-specified sampling scheme.

For example, household surveys are commonly run. To do so, a researcher may randomly sample cities in the area of interest. Within those cities, they may stratify based on region in the city, and within each region cluster based on the blocks. Then, a simple random sample within the blocks is taken, and those households are selected. This sampling scheme can be understood in relation to each of the component sampling schemes, and creates a flexible way of constructing a sampling scheme which meets the needs of the situation.

10.4 Experimental Design

While sampling is often a useful framing for collecting data, there are times when our goal is not to simply understand the way that a population is, but rather to understand the impact of particular intervention. Consider, for instance, studies which look at the efficacy of medical treatments, the utility of new fertilizers on agricultural yield, or the impact of political interventions on climate change. In each of these cases we are not interested in the current state of a population, but rather how some action influences the state of a population. For this, we turn to the process of experiment design. In experimental design we seek to understand how experimental units have their response variables impacted, based on the levels of particular factors, or treatments. In plain language our goal is to understand how specific interventions influence some trait in a population. To proceed we formally define each of these concepts.

Definition 10.13 (Experimental units) Individuals or items upon which the experiment is being performed. If the experimental units are humans, we will often call them subjects or patients, in place of units.

Definition 10.14 (Response Variable) The characteristic or trait of the experimental unit that is measured or observed. The experiment’s purpose is to understand how a response variable reacts to a particular intervention.

Definition 10.15 (Factor) A variable whose effect on the response variable is of interest. The factor is the variable which is being controlled or manipulated within the experiment, and is the cause of the change in response variables.

Definition 10.16 (Levels) The possible values of a factor. We are often comparing two or more levels of the factor, determining how specific levels impact the response variable.

Definition 10.17 (Treatment) A treatment refers to the complete experimental condition. That is, the set of all levels across all factors that are assigned to an experimental unit. In one-factor experiments, this is the levels of the single factor. In multifactor experiments, a treatment is a combination of the levels of the factors.

In an experiment, experimental units are observed after having been given the treatment. The response variable is measured, and compared across the different levels of the factors (or across the different treatments) to determine how the various treatments impact the outcome. In a well-structured experiment it is possible to conclude that the treatment causes a particular impact on the response variable, so long as the analysis takes into account the limitations of the experiment that was run.

Example 10.8 (Sadie’s House Plant Growth) Sadie has become very interested in understanding the conditions under which houseplants will thrive. Through some informal experimentation Sadie believes that the frequency of watering, type of fertilizer, and amount of sunlight are all important factors in the plant growth. Sadie is predominantly interested in determining the impact on the height of the plants.

Indicate the experimental units, response variable, factor(s), possible level(s), and treatment(s) in this proposed experiment.
Indicate further examples of response variables and factors that may be relevant to Sadie’s question of interest.

Solution

In this experiment, the units are the houseplants that Sadie is watching. The response variable will be the height of the plants (measured in, for instance, centimeters). The factors under consideration are the frequency of watering, type of fertilizer, and the amount of sunlight. Some possible levels for frequency of watering may be weekly, versus biweekly, versus monthly. Some possible levels for type of fertilizers may be natural compost, synthetic compost, or no fertilizer. Some possible levels for the amount of sunlight may be direct, indirect, or no sunlight. The treatments will consist of the combinations of the different levels for the three outlined factors (for instance, weekly watering, with natural compost, and direct sunlight; weekly watering, with natural compost, and indirect sunlight; etc.). If these levels are considered, then there will be a total of $3\times 3 \times 3 = 27$ treatments under consideration.
Sadie may instead wish to measure something like leaf size, or stem width, or number of flowering plants. Each of these will produce different results, and will be useful for slightly different sets of considerations that Sadie may have. For other factors, Sadie may consider the pot size, or perhaps exposure to other stimulus.¹²

10.4.1 The Principles of Experimental Design

Just as there were many different schemes for constructing a sample, there are also many different experimental designs. We will explore two common designs, but it is useful to first understand the guiding philosophy of experimentation in Statistics. There are three key factors that ensure that data collected from an experiment are useful for drawing scientific conclusions.

First, is the idea of statistical control. It is not enough to know whether a particular treatment was followed by a positive response in the response variable. Instead, we require there to be some point of comparison. We call this point of comparison a statistical control. The idea is that we should always be comparing two or more treatments, even when our interest is in one particular treatment. If the treatment of interest is truly beneficial, we should be able to see this by comparison to other treatments. This way, we are able to ensure that the changes in the response variable that we see are related to our intervention, rather than by random chance. Sometimes we wish to see whether a particular treatment is effective, and there does not exist another treatment option that is a plausible candidate. In these cases, we will often take a treatment option to be “nothing at all”, which is to say no direct intervention on the given factor.¹³ In this way our control can still be used to see if our active intervention improves over doing nothing, when that is the alternative.

Beyond statistical controls, experiments rely on randomization and replication. With randomization, the idea is that the treatment that each experimental unit gets should be randomly selected without consideration of the specific unit. This way unintentional selection bias can be avoided within the groups. For instance, in a medical study, if you end up giving the experimental treatment to patients who are otherwise healthier, then you may expect that the experimental treatment will produce better results not because it was more effective, but because the patients who received it are the ones who are expected to have better outcomes irrespective of treatments. In addition to randomization, replication serves a central role in experimentation. The experiment should be conducted on a sufficient number of experimental units to ensure that random noise does not cloud conclusions. In particular, the sufficient sample size will ensure that the groups created via randomization will truly resemble each other, and the more units in the study, the better able you are to discern differences which exist between treatments. These principles can be put to work across various different experimental designs, depending on the specific experimental setup and constraints that are at play.

10.4.2 Completely Randomized Design

The key question when defining an experimental design is how do the experimental units get assigned to the various treatment options. The most obvious choice is to randomly assign each unit to one of the treatment options, ignoring any underlying factors about the units. This is considered a completely randomized design.

Definition 10.18 (Completely Randomized Design) A completely randomized design is one in which the experimental units are randomly divided into groups, one for each treatment in the experiment. The treatments are then assigned to each of the groups, randomly selected for each.

Typically, we will consider an equal assignment of numbers of experimental units to each treatment option, though this is not strictly required. In the completely randomized design, we ignore anything that we may know about the experimental units beforehand.

Example 10.9 (Sadie’s Houseplants: Completely Randomized Design) Sadie is going forth with experimentation to understand how different factors impact the height of houseplants. Sadie decides to test treatments comprised of combinations of three factors: watering frequency (low frequency high volume versus high frequency low volume), fertilizer use (store bought fertilizer, versus homemade compost, versus no fertilizer), and sun exposure (direct sun exposure, versus indirect sun exposure, versus artificial light exposure). There are a total of $72$ houseplants, and Sadie wishes to use a completely randomized design.

Describe how this experiment can proceed as outlined by Sadie.

Solution

There are a total of $2\times3\times 3 = 18$ different treatment options. This means that for each treatment option, $\dfrac{72}{18} = 4$ houseplants should be assigned to it. We can form these $18$ groups by sampling without replacement from the numbers $1$ through $72$, $4$ at a time. Each number is assigned to one of the houseplants, and that form the groups. For instance, the $18$ groups may be:

##       [,1] [,2] [,3] [,4]
##  [1,]   66   48   47   38
##  [2,]   17   34    1   15
##  [3,]   14    8   68   71
##  [4,]   35   26   62   42
##  [5,]   37   11   61    2
##  [6,]   72    4   19   67
##  [7,]   41   64   18   25
##  [8,]   69   39   10   70
##  [9,]   50   53   30   33
## [10,]   40   29   31   24
## [11,]   16   49   32   36
## [12,]   20   44   55   51
## [13,]   22   65   59   54
## [14,]   60    6   13   45
## [15,]   23    3   52   46
## [16,]    9    5   43   57
## [17,]   58   28   63    7
## [18,]   56   27   21   12

Then, each of the groups gets assigned one of the treatments (low frequency/store bought/direct exposure; low frequency/store bought/indirect; low frequency/store bought/artificial; etc.). The treatments are given to the units, and then the heights are measured and recorded alongside the treatment that was assigned.

There are at least two shortcomings of completely randomized designs which we may wish to overcome. The first is that, in a completely randomized design, we may not be able to understand the impact on subpopulations of interest. Because randomization occurs without consideration of any other factor it is also not possible to directly analyze how treatment may impact the outcome variable segmented by these factors. While this is often not the primary question of interest, it will often be the case that having answers to these types of questions is desirable. Second, a completely randomized design may be less efficient at capturing the true effect of treatment, when treatment is mediated by other factors. If some groups of experimental units respond more favourably¹⁴ than others, ensuring that treatment allocation is split within these groups will lead to more precise estimates of the true treatment effect. As a result, we will often turn to more involved experimental designs to allocate treatment options.

10.4.3 Randomized Block Design

When we wished to exploit the structure of a population in sampling, making use of systematic differences, we divided the population into groups called strata on the basis of these traits. We can do the same thing with our experimental units forming blocks of experimental units. This blocking procedure gives rise to the randomized block design, an alternative to a completely randomized design.

Definition 10.19 (Randomized Block Design) A randomized block design assigns treatments randomly to all units within a block of experimental units. That is, the experimental units are separated into various blocks, and then within each block a completely randomized procedure is used.

With the use of the block design, you are able to assess not only is there an overall impact of treatment on the response variable, but also is this impact of treatment impacted¹⁵ through the blocking factor(s). This may be of scientific interest directly, and it also may help to ensure that random noise does not erode the ability to discern the true impact of treatment on the outcome. Typically, blocking factors will be natural factors which are suspected, or known, to influence the outcome, but which are of secondary interest to the experimenter.

Example 10.10 (Sadie’s House Plants: Randomized Block Design) Of Sadie’s $72$ plants, $18$ of them are species of trees, $18$ of them are species of vines or other crawlers, and the remaining $36$ are flowering plants. If Sadie decides to test treatments comprised of combinations of three factors, watering frequency (low frequency high volume versus high frequency low volume), fertilizer use (store bought fertilizer, versus homemade compost, versus no fertilizer), and sun exposure (direct sun exposure, versus indirect sun exposure, versus artificial light exposure), how can a randomized block design to perform this experiment?

Solution

Here, the type of plant is the natural blocking factor. There are $18$ different treatment options. This means that for each type of plant we perform a completely randomized design with $18$ treatments. To do so, we can assign each of the trees to a single treatment option (there are $18$ trees and $18$ treatments, so this is a matter of simply randomizing the order of treatments and assigning one to each), we can assign each of the vines to a single treatment option (same as with the trees). For the flowering plants, there are $36$ of them and $18$ treatments, so each treatment should have two different plants. If we label the flowering plants $1$ through $36$, then we can simply draw without replacement from the numbers $1$ through $36$, $2$ at a time. For instance, the $18$ groups may be:

##       [,1] [,2]
##  [1,]    2   11
##  [2,]   17    4
##  [3,]   14   10
##  [4,]   36   21
##  [5,]   15    1
##  [6,]    9   29
##  [7,]    3   12
##  [8,]   18    7
##  [9,]    8   26
## [10,]   16   35
## [11,]   20   24
## [12,]   22    5
## [13,]   25   30
## [14,]   23   19
## [15,]   31   32
## [16,]   27    6
## [17,]   33   34
## [18,]   28   13

Then, each of the groups, within each of the blocks gets assigned one of the treatments (low frequency/store bought/direct exposure; low frequency/store bought/indirect; low frequency/store bought/artificial; etc.). The treatments are given to the units, and then the heights are measured and recorded alongside the block and the treatment that was assigned.

10.5 Data Description and Organization

Whether data are collected via sampling or experimentation, it is important to ensure that statistical principles are followed so that the observed data are representative of the population of interest and useful for accomplishing the goals of the statistical analysis. There is a substantial amount of statistical work which goes into ensuring that data collection is valid. Once valid data have been collected, we must do something with them. As previously discussed, there are typically four use cases for data. In these notes, our focus is on description and inference. Before we can use the data to describe patterns or conduct inference, we must first develop a shared language around what data are. Some of this has been informally introduced throughout our discussions thus far, however, the formal definition remains important for ensuring the foundation for statistical analyses.

Definition 10.20 (Variable) A characteristic or trait that can vary from one observation to the next is called a variable. Variables are the relevant pieces of information that are recorded in our data. We may have one or more variable recorded for each individual unit in our data.

Definition 10.21 (Observation) An observation is an individual piece of data. Our data are comprised of multiple observations across the various units in our sample (or on our experimental units).

Generally speaking, we make observations of variables, and together this forms our data. We use the data to answer the questions of interest or conduct our analyses. Every variable can be categorized as either a qualitative or quantitative variable. Qualitative variables are the non-numeric variables we observe, following categories or other less structured formats. Quantitative are numeric variables.

Definition 10.22 (Qualitative Variable) Any variable which is not numerical, such as those which fit into categories, are referred to as qualitative variables.

Definition 10.23 (Quantitative Variable) A quantitative variable is any variable which is described numerically.

Quantitative variables can be either discrete or continuous. We saw this distinction when working with random variables, and the distinction is equivalent in the case of variables in a collected dataset as well. A variable is considered discrete if it can take on a (countable) number of values (that can be listed). A variable is considered continuous if it can take on any value from a defined range of values. Often times, just as with random variables, we make the distinction based on how we wish to think about the variables, rather than based on the theoretical underlying truth.¹⁶

Definition 10.24 (Discrete Variable) A quantitative variable which can take on values from a countable set. There are either finitely many options for the variable, or else a countably infinite number.

Definition 10.25 (Continuous Variable) A quantitative variable that can take on an uncountably infinite number of values is called continuous. Continuous variables can theoretically take on any value over a range of values, with the possibilities unable to be enumerated.

Example 10.11 (Charles and Sadie Categorize Variables) Charles and Sadie realize that oftentimes there are many ways of measuring qualities or traits that are of interest in a study. Upon realizing this, they begin to discuss a number of topics, considering how they may be measured.

For each of the following traits, discuss different options for variables that could represent the quantity discussed. For each, include possibilities which are qualitative and those which are quantitative, and specify whether the quantitative are discrete or continuous.

Charles suggests that there are many ways of thinking about attained education.
Sadie, still thinking of plants, realizes that there may be many ways of thinking about the size of different plants.
Based on an overheard conversation, Charles wonders how socioeconomic status may be measured.
Sadie, after enjoying a snack at the coffee shop, thinks about how we might measure the quality of food.

Solution

Attained education may be a qualitative variable if it is described via categories (for instance, high school, college diploma, undergraduate degree, etc.) It could almost be made to be quantitative if, for instance, you measured the total number of years that someone had formal education for. This would be a discrete quantitative variable.
The size of different plants could be qualitative by using subjective sizes (for instance, small, medium, and large). It is possible to measure these quantitatively as well, for instance by considering the height or volume of the plant. Likely height and volume both are best considered continuous values, but they may be discretized in certain datasets.
Socioeconomic status can be a qualitative variable if, for instance, it is categorized on the basis of high/medium/low. To form a quantitative variable here you may consider household income, or wealth. Depending on how income or wealth are measured it may be discrete (for instance, counting the number of thousands of dollars) or continuous (if you listed exact dollar figures). This is an example of a variable which is always discrete, technically, but may be better treated as continuous.
Food quality could be categorically rated (bad/average/good), or on a similar graded scale (S/A/B/C/D). Alternatively, the food could be subjectively graded numerically, giving for instance a star rating (out of $5$) or a rating out of $100$. Likely these ratings would be discrete, but would be possible to come up with a continuous rating scale if that were desired.

10.6 From Data to Insight

Whether data are collected via sampling or via experiments, the data themselves are not particularly useful for insight. If you are presented with a large dataset, it will likely not be possible to directly interpret the data, or communicate a message. Instead, we need to take the data as input and convert them to more useful products. The remainder of these notes will focus on ways of doing this within statistics.

We will focus both on how to summarize and communicate data that have been collected, and then how to begin gaining insight from these data. These are the first two roles of statistics, as introduced before: description and inference. All the roles that statistics plays build from the idea that we have been able to collect data which are somehow relevant and representative of the underlying population of interest. We investigate the data, describe what has been observed in the sample, or attempt to conduct inference not because we are interested in the data themselves, but because we hope that the data will be reflective of the population of interest. We are not primarily interested in the statistics that we calculate, but in what these statistics say about the parameters of interest.

It is important, as we begin to explore how we can use data directly, to keep in mind that the entire statistical enterprise relies on having high quality data available. This relies on having measured the factors that we care about, in ways that are meaningful. It relies on representative samples and well-designed experiments. Without adhering to the principles discussed throughout this chapter, statistics cannot proceed in a way which addresses our goals. High quality data are not a substitute for statistical analysis, but it is a prerequisite for it.

For instance, we assume that particular probability mass functions hold, that certain distributions are present, that random quantities are independent↩︎
Note, the word data is a plural noun in English. That is, we say “The observed data are …” rather than “The observed data is …” Some statisticians care deeply about correcting this misconception, and forget how weird those types of sentences to non-statisticians. I promise it will eventually sound more familiar!↩︎
or processes↩︎
Be that individuals or objects.↩︎
It is not impossible to do so. For instance, governments often run national censuses, which are a full survey of every member of the population of a country. These are incredibly large undertakings, however, and are not feasible in many settings.↩︎
The sample.↩︎
The probability of this is $0.00000000000000000000094036353533487965306938402439390200376889694666715513449162$, which is very, very small, but not zero.↩︎
This, as we will remember from our previous chapters, makes the hypergeometric distribution deeply connected to simple random sampling.↩︎
There are ways of doing this, but they are more complex and fall beyond the scope of these notes.↩︎
Instead of selecting $m$ between $1$ and $k$.↩︎
For instance, $3$ and $5$ both show up twice, while every other element shows up only once.↩︎
For instance, some people claim that music will help with plant growth.↩︎
In medical studies this takes the form of a placebo: a treatment which looks like a medical treatment but has no active ingredients.↩︎
Or less favourably…↩︎
Or “mediated”.↩︎
As a result, times, heights, or volumes will often be considered continuous even if theoretically they are discretized.↩︎

10.1 From Probability to Statistics

10.2 Background and Data

10.3 Sampling

10.3.1 Simple Random Sampling

10.3.2 Systematic Random Sampling

10.3.3 Cluster Sampling

10.3.4 Stratified Random Sampling

10.3.5 Multistage Sampling

10.4 Experimental Design

10.4.1 The Principles of Experimental Design

10.4.2 Completely Randomized Design

10.4.3 Randomized Block Design

10.5 Data Description and Organization

10.6 From Data to Insight

Self-Assessment