Statistics & Probability
Statistics:
Basic Terms
Population: every member of a certain group.
Sample: a subset of the whole population.
Discrete Data: when there are only a certain number of values possible, often countable. For example: the number of siblings, number of pages, etc.
Continuous Data: data that can take any value within a range, usually measured and approximated/rounded. For example: height, mass, time taken, etc.
Range: a basic measure of spread, found by: highest value - lowest value. This figure can easily be distorted by extreme values.
Reliability: can be compromised by: missing data, small sample size, or errors. Outlier values often arise due to an error (discussed more later).
Interpolation: estimating unknown values that fall within the range of a set of known data points.
Extrapolation: estimating unknown values that fall outside the range of a set of known data points. This is heavily inaccurate and should be avoided.
Average: also known as central tendency
- Mode: the most frequent data value
- Median: the middle of an ordered list. The $\left(\frac{n+1}{2}\right)$th value.
- Mean: the sum of the values, divided by the total number of values: $$\large \bar x=\frac{\sum x_i}{n}$$.
Sampling Methods
There are various sampling methods you are expected to know for this course:
- Simple Random: when a sample is randomly selected (e.g. picking names from a hat)
- Convenience: a method that is most accessible for the sampler (e.g. asking the people that are in your class)
- Systemic: when you randomly select the first data point, then elect the rest at regular intervals (e.g. questioning every 10th person that walks by)
- Stratified: when an equal number of data points are picked from all the subsets of a population (e.g. asking 10 boys and 10 girls for their opinion on a movie)
- Quota: sampling method in which the samples are based on specific subsets of a population (similar to stratified), but are also proportionate to their prevalence in the population.
Presentation of Data
Histograms: similar to a bar chart, but are more suitable for continuous data, as they show the frequency between various intervals.
Frequency Tables: instead of a long list of data, it is often more efficient to state how many (the frequency) of each value there is.
Quartiles: As formerly mentioned, the range can be distorted due to extreme values, so an alternative is to assess how spread out the central 50% of the data is. This is called the interquartile range. This process is explained below:
- Quartile 1 (Q1): the value that is ¼ of the way through the ordered data.
- Quartile 2 (Q2): the median of the ordered data.
- Quartile 3 (Q3): the value that is ¾ of the way through the ordered data.
- You can think of Q1 and Q3 as the "medians" of each half of the data.
- Hence, the interquartile range (IQR) can calculated using the following: $IQR=Q_3-Q_1$
Box & Whisker Plots: Using the information above, and combining the median with quartiles and extremes, we can make a diagram that displays the spread. This is known as a box and whisker plot, where the rectangle represents the IQR, and the line in the middle is the median (or quartile 2).
Cumulative Frequency Graphs: This is a line graph showing the cumulative frequency (“a running total”) of how many data points have occurred that are less than a particular x value. We plot the upper boundary against the cumulative.
Percentiles:
Percentiles are measures that indicate the value below which a certain percentage of observations in a sample falls. For example, if you are at the 90th percentile for a test, it means you scored better than 90% of the participants.
Outliers:
As you may know, an outlier is a data point that differs significantly from other observations in a dataset. However, this can be a very subjective definition, hence, in statistics, the following criteria is used. Outlier = a point more than $1.5\cdot IQR$ below $Q1$ or above $Q3$.
Mean and Frequency:
$$\qquad \qquad \large \bar x = \frac{\sum f_ix_i}{\sum f_i} \qquad \small \text{Where $f_i$ is the frequency of the $i$th value, and $x_i$ is the $i$th value}$$
Measures of Dispersion
Instead of finding where the center of the data is, we often need to know how spread out the data is (measures of dispersion). We have already discussed some examples of this (IQR and range), however these are fairly basic.
Standard Deviation (σ):
This measures the square-root of the average squared difference from each point to the mean:
$$σ = \sqrt \frac{{\sum {(x_i-μ)}^2}}{n} \qquad \small \text{Where $μ$ is the mean}$$
Due to the complexity, you are not expected to do this manually, and should instead use a graphing calculator:
- TI-84: STAT >> 1:EDIT >> Enter in column >> STATE >> CALC >> 1:1-var stats >> ENTER
Variance (σ2):
This is simply the standard deviation squared:
$$\text{Var}(x) = σ^2 = \frac{{\sum {(x_i-μ)}^2}}{n}$$
Regression & Correlation
Pearson's Product-Moment Correlation Coefficient (r):
When given a set of $x$ and $y$ values, and we look at their relationship, we can measure how closely it follows a linear relationship with the Pearson’s product-moment correlation coefficient, $r$. It is measured from -1 to +1, therefore giving you the direction as well. The correlation coefficient has a very complex formula, which is not needed to manually calculate for the scope of this course.
- TI-84: STATE >> 1:EDIT >> ENTER >> Enter x’s in L1, y’s in L2 >> MODE >> Check that ‘stat diagnostics’ are ON >[2nd] QUIT >> STATE >> CALC >> 4:LINREG(ax+b) >> ENTER

Regression Line:
You may have worked with a line of best fit in the past, although this was most likely done by eye, with the only level of accuracy being that it should pass through the means $(x,y)$, if done properly. The regression line is a more rigorous version - a line that minimises the sum of the squares of the vertical distance between each point and the line. This should only be done with a strong correlation and never needs to be found manually.
Once you find the equation of the line (same buttons as finding $r$), you can use it to estimate values for $y$, given x-values, but NOT vice versa. There are two types of regression lines: x-on-y and y-on-x. The former should only be used when you are given an x-value, and need to estimate the y-value, while the second one should only be used when you are given a y-value, and need to estimate an x-value.
Probability:
Basic Terms
Trial: each time an experiment is repeated.
Outcomes: the possible results from a trial.
Sample Space: the set of all possible outcomes in an experiment/trial.
Event: set of outcomes of an experiment to which a probability is assigned.
Relative Frequency: number of times an event occurs divided by the total number of trials.
Probabilities of Outcomes
If all the outcomes in a trial are equally likely, and $A$ is your desired outcome, then the probability of $A$ occurring can be expressed as:
$$P(A)=\frac{\text {number of outcomes for $A$}}{\text{total number of outcomes}}=\frac{n(A)}{n(U)}$$
where $U$ is the set of all outcomes.
Complementary Events:
The complement of an event ($A’$) is the event of $A$ not occuring. For example, getting an odd number and getting an even number on a dice are two complementary events. As an event occurring, or not occurring, covers all eventualities, we get:
$$P(A)+P(A')=1 \qquad \text{or} \qquad P(A')=1-P(A)$$
Overlapping of Probabilities
Venn Diagrams: these are very helpful for showing which outcomes belong in which events, even with multiple events. They use overlapping circles to show the intersection of the probability of different events.
Intersection: notated $A\cup B$, the intersection contains the outcomes that are part of both $A$ and $B$ (excludes outcomes that are part of just one).
Union: notated $A\cap B$, the union contains the outcomes that are just in $A$, just in $B$, and in both (i.e. their intersection).
Independent Events: this refers to events where one event occurring does not affect the probability of the other one occurring. For independent events: $P(A\cap B)=P(A)\cdot P(B)$
Mutually Exclusive: when there are no outcomes in the intersection (i.e. you can't have both events occur simultaneously). Hence, if $A$ and $B$ are mutually exclusive, then: $P(A\cap B)=0$
Using Venn diagrams, we can find the relationship between unions and intersections:
$$P(A\cup B)=P(A)+P(B)-P(A\cap B)$$
or for mutually exclusive events:
$$P(A\cup B)=P(A)+P(B)$$
Conditional Probability
This refers to finding the probability that an event occurs, given that it is known that another event has definitely occurred. The symbol “│” is often used to mean “given that”. Conditional probabilities can be calculated using the following equation:
$$P(A|B)=\frac{P(A\cap B)}{P(B)}$$
which means that the probability that $A$ will occur given that $B$ has occurred is equal to the probability of the intersection of $A$ and $B$, divided by the probability of $B$ occurring.
If we look at conditional probability when the two events are also independent, then:
$$P(A|B)=\frac{P(A)\cdot P(B)}{P(B)}=P(A)$$
Random Variables
Discrete Random Variables:
This simply means that we have a random experiment where the outcomes take only discrete values and each of these have an assigned probability. A random variable, $X$, is said to be a discrete random variable if $X$ takes on a finite or countable number of possible values and the outcomes are random. Hence, we can understand that the sum of the probabilities always equal 1:
$$\sum P(X=x_i)=1$$
Expected Value:
This is very similar to the mean - it is what you would expect to be the mean if you repeat the experiment multiple times. So it will work in a similar manner to calculating the mean from a frequency table, but as we have probabilities instead of frequencies, it essentially negates the need for dividing by the sum of frequencies/probabilities (total is 1), leaving us with the following:
$$E(X)=\sum x_i \cdot P(X=x_i)$$
One idea involving the expected value is the concept of a "fair game", which is an experiment where $E(X)=0$, and there is no gain or loss on average.
Binomial Distributions
Refers to situations where we have a certain number of identical (independent) random trials $n$, all of which can be viewed as having 2 outcomes, which we may call success and failure, with a fixed probability of success $p$. The more basic questions may ask you to find the probability of observing $x$ successes out of $n$ trials. Tougher questions may ask about the probability of a range of successes occurring. We can generalise the formula for binomial distributions as the following:
$$P(X=x)={}_n \mathrm{ C }_x \cdot (p^x)\cdot (1-p)^{n-x}$$
Using GDC: if we want to find the problem of getting exactly $x$ success from $n$ trials, we find the binomial pdf function:
- TI-84: [2nd] DISTR >> Scroll Down >> A: binompdf >> Enter n, p & x
- The binomcdf function does the same thing, but finds the probability of at most $x$ successes from $n$ trials.
Notation: $X\sim B(n,p)$, means $X$ is a random variable that follows a binomial distribution, with $n$ trials, and $p$ as the probability of success
Normal Distributions
Refers to a situation where we have a large set of one-variable data. It is a data set that is mostly centered around the mean, in a symmetrical manner. As it is continuous data, showing probabilities in a bar chart wouldn’t work, and instead, a smooth line is required, in the shape of a bell curve. The total sum of probability is still 1, but here, that means the area under the curve = 1.
Spread: if the mean $μ$ decides the central location, then the standard deviation $σ$ decides how spread out the data is, as follows:
- Around 68% of the data lies between $μ - σ$ and $μ + σ$
- Around 95% of the data lies between $μ - 2σ$ and $μ + 2σ$
- Around 98% of the data lies between $μ - 3σ$ and $μ + 3σ$
- Note: you will be expected to be able to recall these values on a paper 1 exam.
Finding Probability/Area: if you tell your GDC the $μ$ and $σ$, then you can ask what percentage of the population lies between certain values. That is the same as asking for the probability of somebody chosen at random being in that interval, or, the area under the curve between that interval. This process is called ‘normal cdf’ on your GDC.
- TI-84: [2nd] DISTR >> 2: NORMAL CDF >> Enter $μ$, $σ$, or Lower Boundary, Upper Boundary
Finding Missing Boundaries: a GDC can also find an upper boundary, given the area/probability below, and $μ$ and $σ$. If you are asked above a lower bound, you must do $1 - area$ to effectively turn it into an upper bound. On a GDC: [3: INVNORM]
Notation: $X\sim N(μ,σ^2)$, means $X$ is a random variable that follows a normal distribution, with $μ$ as the mean, and $σ$ as the standard deviation
Standardisation:
All normal distributions can be viewed in the same way, if we consider them as having a center of $μ$ and measure x-values by how many standard deviations (number of $σ$) away from $μ$ they lie. Standardising a normal distribution is when we shift the mean to $μ=0$ and set $σ=1$. Then, simply, x-values become a measure of how many standard deviations ($σ$'s) away from $μ$ they are, and we call these new, standardised, x-values: z-scores.
To do this, we use the formula:
$$z=\frac{x-μ}{σ}$$
where $x$ is the original x-value, and $μ$ & $σ$ are the original mean and standard deviation. This process will be most helpful for finding an unknown $μ$ or $σ$, with the help of the inverse norm.