Population

A population is all subjects of interest in a particular study.

Sample

A sample is a subset of a population.

Census

A census is when you collect data for an entire population.

Sample Survey

A sample survey is when you collect data for a given sample.

Parameter

A parameter is a descriptive measure of a given population.

Statistic

A parameter is a descriptive measure of a given sample.

Variable (Statistics)

A variable is some attribute that describes something (ie. a person, place, thing, or idea). The value of this variable can "vary" between entities. For example a variable of a person could be their hair or eye colour, which can take different values for different peoples.

Classifications

Variables can be classified into a few different categories and sub categories depending on the forms they take.

Qualitative

Qualitative variables are any variables in which their values belong to a category, instead of a magnitude.

Examples:

A persons blood type (A, B, AB, O)
A persons hair Colour (Brown, Black, Blonde, etc.)
An animals species (Dog, Cat, Bird, etc.)

Quantitative

Quantitative variables are any variables in which their values take on numerical values that represent different magnitudes of a variable.

Discrete

A quantitative variable is considered discrete if it's possible values form a set of separate numbers (ie. natural numbers or integers)

Examples:

The number of siblings a person has (ie, 2, 3, etc.)
The number of people in the world (ie, 7 billion, 3,000, etc.)

Continuous

A quantitative variable is considered continuous if it's possible values form an interval (ie. the reals between 3 and 4).

Examples:

The temperature in a room (e.g., 20.5 °C)
The weight of a person in kg's (eg. 20.1 kg)

Descriptive Statistics

Descriptive statistics is a method of summarizing data typically through the usage of graphs and numbers.

Bar Graph

A bar graph is a type of way to display qualitative variables where each category is drawn as a single bar whose height is the frequency of the category.

Example

A campus press polled a sample of 300 undergrads in order to study the attitude towards a proposed change in on campus housing regulations. Summary of results of an opinion poll are shown as follows:

Dot Plot

A bar graph is a type of way to display quantitative variables where each dot indicates an occurrence of a value in a sample.

Instructions

Draw a horizontal line and Label it with the name of the variable
Mark regular values of the variable on it
For each observation, place a dot above its value on the number line

Example

The following set of data is the scores obtained for midterm test on a 0-100 scale. Construct a dot plot.

10, 90, 95, 100, 65, 50, 60, 50, 90, 55, 60, 70

Pie Chart

A pie chart is a type of way to display qualitative variables drawn as a circle where each category is represented as a "slice". The size of each "slice" is proportional to the percentage of observations falling in that category

Example

A campus press polled a sample of 300 undergrads in order to study the attitude towards a proposed change in on campus housing regulations. Summary of results of an opinion poll are shown as follows:

Frequency Table

A frequency table is a listing of possible values for qualitative variables, together with the number of observations and/or relative frequencies for each value

Example

A campus press polled a sample of 300 undergrads in order to study the attitude towards a proposed change in on campus housing regulations. Summary of results of an opinion poll is as follows:

Response	Frequency	Proportion ^[1]	Percentage ^[2]
Support	150	0.500	50.0%
Neutral	50	0.167	16.7%
Oppose	100	0.333	33.3%
Total	300	1	100%

1: Proportion is calculated as , ie. for Neutral 2: Percentage is calculated as , ie. for Neutral

Stem-and-Leaf Plots

A stem-and-leaf plot is a type of way to display quantitative variables by creating a stem (the first digit of the number) and a leaf (usually the last digit of the number). Each stem is then placed on the left side (and in between numbers as well even if they dont have leaves) of a table and the corresponding leaf is placed on the right side, these leaves can then optionally be sorted.

Example

Consider the following data:

80, 85, 75, 90, 62, 50, 55, 65, 75, 82, 70, 25, 92, 57, 63, 72, 81, 95, 31, 69

The stem-and-leaf plot is as follows

Stem	Leaf
2	5
3
4	1
5	0 5 7
6	2 3 5 9
7	0 2 5 5
8	0 1 2 5
9	0 2 5

Histogram

A histogram is similar to a bar graph but instead displays the frequency of intervals of quantitative variables.

Instructions

Split the values of the variable into a set of continuous intervals of typically equal size.
For each interval count the total quantity of occurrences that fit into that interval.
Draw bars for each interval whose height is the quantity of occurrences counted before.

Properties

"Modality"

The "modality" of a histogram describe where the peaks occur and are described in 3 main categories:

Unimodal - One clear peak
Bimodal - Two clear peaks
Multimodal - More than 2 clear peaks

Shape

The shape of a histogram describes how the "mass" of it falls, generally the shape can be described as one of the three following categories

Symmetric - Both the left and right sides are largely mirrors of each other
Left Skewed - Most of the "mass" is on the right side and then a long smaller left trail
Right Skewed - Most of the "mass" is on the left side and then a long smaller right trail

Numerically when a histogram is "left skewed", it's median will be much greater than it's mean and conversely if it's "right skewed", it's median will be a lot smaller. If the median and mean are approximately the same the histogram is symmetric.

Centre

The centre of a histogram is the location where the values usually "cluster".

Measures of Centre

Mean

The mean of a distribution is it's long-running average value. Formally the mean is defined as the expected value which is calculated using the following formulae.

Formula

Discrete Case

For a discrete random variable (ie. a randomized dice that can only be the integers 1-6):

Where the sum is taken over all possible values of and is the probability that the random variable takes the value of .

Continuous Case

For a continuous random variable:

Estimations

The mean of a distribution can be estimated be estimated when the exact probabilities of values are unknown.

Sample Survey

Given a sample survey with being the number of observation and being the observation we can estimate the mean of the random variable as follows:

Example

The numbers of hours spent studying for a subset of students are 4, 6, 8, 7, 5. Estimate the mean number of hours spent studying for students.

Properties

Addition by a Constant

Multiplication by a Constant

Addition

Multiplication

Median

The mean of a distribution is a method of determining the centre of the given distribution. Given a distribution where is a given observation, and is the total observations the median can be determined using the following steps

If odd (ie. 1, 3, 5), the median is the observation alternatively if is even the median is the mean of the and observations.

Examples

Odd

Given the numbers 12, 14, 15, 17, 20, 24, 24, 27, 29, find the median

Even

Given the numbers 12, 14, 15, 17, 20, 24, 24, 27, 29, 30, find the median

Spread

The spread of a histogram describes how far most points usually are from the centre of the histogram.

Outlier

An outlier is a single observation in a histogram that is visibly removed from the main "mass" of observations, in other words it's unusually far from the centre of the histogram.

Example

Construct a histogram of the following numbers:

175, 192, 207, 212, 213, 214, 218, 225, 229, 230, 231, 235, 235, 237, 240, 240, 242, 248, 250, 253, 257, 260, 265, 265

Median

The mean of a distribution is a method of determining the centre of the given distribution. Given a distribution where is a given observation, and is the total observations the median can be determined using the following steps

If odd (ie. 1, 3, 5), the median is the observation alternatively if is even the median is the mean of the and observations.

Examples

Odd

Given the numbers 12, 14, 15, 17, 20, 24, 24, 27, 29, find the median

Even

Given the numbers 12, 14, 15, 17, 20, 24, 24, 27, 29, 30, find the median

Interquartile Range

The interquartile range is a measure of spread of a distribution, it's calculated as the difference between the 75th and 25th percentile.

Range

The range of a distribution is a method of determining the variability of the given distribution. The range can be calculated by taking the difference between the largest and smallest observations:

Random Variable

A random variable is a characteristic or measurement that can be determined for each outcome of some event. They are usually denoted with capital letters such as , , or .

Expected Value

The expected value of a random variable is the "most likely" value the random variable will take. In the discrete case it's the mean value of the random variables possible values, and the continuous version can be thought of as the same for a continuous random variable. The expected value of a random variable is usually denoted as .

Formula

Discrete Case

For a discrete random variable (ie. a randomized dice that can only be the integers 1-6):

Where the sum is taken over all possible values of and is the probability that the random variable takes the value of .

Continuous Case

For a continuous random variable:

Probability Density Function

Given a continuous random variable, the probability density function describes the relative probability that value of the random variable would be equal to that value. For example given the probability density function and the random variable we can say that the relative probability that is between and is equal to .

Rules

For any probability density function , the following must be true

, for some number

Rationale

Zero

asks the question what is the probability that the random variable is a single number , this is because the random variable is continuous, which means that there are infinite values between any, two numbers, this means that your chance of picking just one number from an infinite amount is always 0.

Total Probability

asks what the chance that the random variable is a number, which must always be true mean it must equal 1.

Positive or Zero

must always be true because you can not have a negative probability for a random variable to take on a given value.

Cumulative Distribution Function

Given a random variable, the probability density function describes the relative probability that value of the random variable would be less than or equal to that value. For example given a cumulative distribution function we can determine the probability that by determining the value of .

The cumulative distribution function can be determined by taking an integral of the variables probability density function

Variance

The variance of a given event is the "average distance" between each individual event and the average.

Sample

Given as the number of events the variance of a given sample is given by the following equation:

Random Variable

The variance for a given random variable can be given by the following equation:

Alternatively it can be written as

When the random variable is squared and then the expected value is applied it ends up just being , where the probability density function stays unchanged.

Properties

Addition by a Constant

Multiplication by a Constant

Addition of Two Random Variables

Standard Deviation

The standard deviation is the square root of the variance and is given by the following equation:

Properties

Relation to Spread

The spread of a given distribution grows as it's standard deviation grows.

Units

The units of the standard deviation are the same as the variable (for example if has units of metres so does the standard deviation).

Conditional Probability

Conditional probability describes how two events are related. Specifically it can be used to determine if the probability of one event can be determined given another related event.

For any two events an , where , the conditional probability of given has occurred is written as , the symbol is read as "given". The conditional probability can be calculated using the following formula

Example

For two events and , , , and . Find :

Normal Distribution

A normal distribution is a specific type of distribution taking the following general form

Where the parameter represents the mean of the distribution and represents the variance.

Interquartile Range

Given a normal distribution the interquartile range can be calculated as follows, as the standard score for the first and third quartiles are and .

The 68-95-99.7 Rule

Given a normal distribution the following statements are typically true

Approximately of observations fall within of .
Approximately of observations fall within of .
Approximately of observations fall within of .

Z-Score

The z-score represents the number of standard deviations a "raw score" is above or below the mean value of what is being observed or measured. For a given point it's z-score can be calculated as follows

Z-score's can be used to compare relative values in different observations, for example given scores for the SAT and ACT a student's score on both can be compared by calculating the z-score of their performance on both.

Standard Normal Distribution

The gaussian distribution is a type of normal distribution, where and that shows up in many different applications. It can be defined as follows;

Covariance

Covariance is a measure of the joint variability of two random variables. Given two random variables and their covariance can be calculated as follows

the expected value can be exchanged with the mean of the variables if that's more appropriate.

Given two random variables and if the covariance between them is positive it signals that large 's tend to occur with large 's and conversely, if it's negative it signals that large 's tend to occur with small 's. When the covariance is exactly and are independent random variables.

Covariance of a Sample

Bayes' Theorem

Bayes' theorem is a theorem describing rules for how to determine the probability of a cause given it's effect. For example using Bayes' theorem, the probability that a patient has a disease given that they tested positive for a that disease can be found using the probability that the test yields a positive result when the disease is present.

Formally Bayes' theorem can be defined as follows

Where the probability of the event is not 0.

Example

A company has three plants. Plant 1 produces 35% of the car output, plant 2 produces 20% and plant 3 produces the remaining 45% of cars. 1% of the output of plant 1 is defective, 1.8% of the output of plant 2 is defective and 2% of the output of plant 3 is defective. The annual total production of the company is 1,000,000 cars. A car is chosen at random from the annual output and is found defective. What is the probability that it came from plant 2?

Independent Events

Events are considered independent if knowing that one occurs does not change the probability that the other occurs. Formally we can say two events and are independent if the following is true:

Extrema of Independent Random Variables

Given a list of independent random variables with the same probability density function, we can calculate the probability density function of their maximum and minimum values given their individual probability density functions.

Maximum

Given independent random variables which are evenly distributed with the probability density function the probability density function of their maximum value can be calculated as follows:

Derivation

Via the fundamental theorem of calculus we can then use this to determine the probability density function of the maximum

Minimum

Given independent random variables which are evenly distributed with the probability density function the probability density function of their minimum value can be calculated as follows:

Bernoulli Random Variable

A Bernoulli random variable is a type of random variable that can only take on the values of 1 or 0, the value of 1 is typically called "success" and 0 is called "failure". Because of this property their probabilities are directly related as they're the only two options.

Where represents the probability of success.

Mean

Variance

Bernoulli Distribution

A Bernoulli distribution is any distribution where the random variable is a Bernoulli random variable. This means the random variable follows the probability mass function

where , and is the "probability of success".

Binomial Coefficient

Binomial Distribution

A Binomial distribution is a distribution of the number of "successes" in a sequence of independent trials each asking a yes or no question (where 1 represents yes and 0 represents no). For a single trial () the binomial distribution is a Bernoulli distribution.

Where is the "probability of success" for any one event and .

Ten Percent Condition

The ten percent condition is an estimation of independence for Bernoulli trials. When a population is finite the trials are not truly independent. However when the sample size is less than 10% of the population the trials are approximately independent, the smaller percentage of the population you take the closer to true independence do the trials become.

Geometric Distribution

A geometric distribution is a distribution of the number of independent Bernoulli trials until the first "success". The expected value of a geometric distribution is called the "return time".

where is the probability that any one trial is a "success".

Mean

Variance

Poisson Distribution

A Poisson distribution is a distribution that models the probability a given number events occur during a given interval.

Where is the "rate of occurrences" for 1 unit time, and

Mean

Variance

Poisson Process

The Poisson process describes a set of independent points occurring over time or space. The process derives its name from the fact that the number of points in any given finite region follows a Poisson distribution, i.e., events occur continuously and independently at a constant average rate.

Examples

Common real-world examples include:

Radioactive decay events
Customer arrivals at a service centre
Emails received per hour
Accidents at an intersection per day
Phone calls to a call centre

Exponential Distribution

An exponential distribution is a distribution that models the distance between events in a Poisson process. The exponential distribution is the continuous analogue of the geometric distribution.

Where and is the "rate of occurrences".

Mean

Variance

Estimator

An estimator involves using data from a sample to infer values for an unknown population parameter (for example the population mean).

Biasedness

The biasedness of an estimator refers to the difference between the true value of the population parameter and the expected value of the estimator. Better estimators have smaller values of biasedness.

The biasedness of an estimator for a parameter can be calculated as

Estimators where are said to be "unbiased estimators".

Quality

The quality of an estimator can be determined as the loss function between the estimator of the parameter and the parameter.

Types

Point Estimator

A point estimator is a type of estimator that calculates a single value for an unknown population parameter.

Interval Estimator

An interval estimator is a type of estimator that calculates an interval of values for an unknown population parameter, meaning we calculate that the value of the population parameter is between the lower and upper bounds.

Types

Confidence Interval Estimator

Given a confidence level and a population parameter a confidence interval estimator provides an interval such that:

Calculation

Given a population parameter and an estimator , and a confidence level we can calculate the confidence interval as

where is the "critical value" which corresponds to the desired confidence level.

Critical Value

The critical value can be calculated in two ways depending on the survey used, for large sample sizes or where the population variance is known , where is the value of such that . Alternatively for small sample sizes with unknown population variance, .

Calculation for Mean

Known Population Standard Deviation

Given a survey of size , with a mean and a known population standard deviation we can calculate the confidence interval for as:

Unknown Population Standard Deviation

When the population standard deviation is unknown, we estimate it using the sample standard deviation. For a survey of size with a standard deviation , the confidence interval for the mean at confidence level is:

Hypothesis Testing

Hypothesis testing is a strategy of inference for determining whether a statement about the value of a population parameter should or should not be rejected. It involves stating a null hypothesis (), which assumes no effect or difference, and an alternative hypothesis (), which suggests there is an effect or difference.

The null and alternative hypothesis typically take on one of the three following forms for a parameter and a hypothesized value

Lower-tailed: vs
Upper-tailed: vs
Two-tailed: vs

The hypotheses should always be formulated before viewing or analyzing the data.

Test Procedures

A test procedure consists of several key components to evaluate the hypotheses.

Properties

-value

The -value is a measure of how "unusual" the data would be if was true. Specifically it is defined as the probability of observing a result as extreme or more extreme than the one obtained, assuming the null hypothesis is true. A small -value provides evidence against , suggesting the observed data is unlikely to occur under the null hypothesis.

Among the 3 forms of null and alternative hypothesizes the -value is calculated as follows:

Lower-tailed:
Upper-tailed:
Two-tailed:

Significance Level ()

The significance level, denoted as , is the predetermined threshold for the -value that determines whether to reject the null hypothesis. It represents the maximum probability of committing a type I error. Common choices for are 0.05, 0.01, or 0.10. If the -value , the result is considered statistically significant, and is rejected.

Errors

Type I Error

Type I error is a type of error in hypothesis testing in which the null hypothesis is incorrectly rejected even though it is correct. In terms of the courtroom example, a type I error corresponds to convicting an innocent defendant.

Where is the probability of rejecting when it's actually true and is the event of committing a type I error.

Type II Error

Type II error is a type of error in hypothesis testing in which the null hypothesis is incorrectly accepted even though it is incorrect. In terms of the courtroom example, a type II error corresponds to acquitting a criminal.

Where is the probability of rejecting when it's actually true and is the event of committing a type II error.

Table of Error Types

	is true	is false
Not Rejected	Correct inference (true negative)	Type II error (false negative)
Rejected	Type I error (false positive)	Correct inference (true positive)

Test Statistic

The test statistic is a function of the data whose sampling distribution under the null hypothesis is known. It measures the discrepancy between the observed data and what would be expected if were true.

Z-Statistic

A Z-statistic is a test statistic for the population mean used when the population standard deviation is known or the sample size is large.

Where:

- Sample mean.
- Hypothesized population mean under the null hypothesis .
- Population standard deviation.
- Sample size.

t-Statistic

A t-statistic is a test statistic for the population mean used when the population standard deviation is unknown and the sample size is small.

Where:

- Sample mean. The average value of your observations.
- Hypothesized population mean under the null hypothesis .
- Sample standard deviation. An estimate of the population standard deviation when it is unknown.
- Sample size (number of observations).

Chi-Squared Statistic

A chi-squared statistic is a test statistic used to test hypotheses about the population variance.

Where:

- Sample variance. The variance calculated from the sample.
- Hypothesized population variance under the null hypothesis .
- Sample size.

Type I Error

Type I error is a type of error in hypothesis testing in which the null hypothesis is incorrectly rejected even though it is correct. In terms of the courtroom example, a type I error corresponds to convicting an innocent defendant.

Where is the probability of rejecting when it's actually true and is the event of committing a type I error.

Type II Error

Type II error is a type of error in hypothesis testing in which the null hypothesis is incorrectly accepted even though it is incorrect. In terms of the courtroom example, a type II error corresponds to acquitting a criminal.

Where is the probability of rejecting when it's actually true and is the event of committing a type II error.

Z-Statistic

A Z-statistic is a test statistic for the population mean used when the population standard deviation is known or the sample size is large.

Where:

- Sample mean.
- Hypothesized population mean under the null hypothesis .
- Population standard deviation.
- Sample size.

t-Statistic

A t-statistic is a test statistic for the population mean used when the population standard deviation is unknown and the sample size is small.

Where:

- Sample mean. The average value of your observations.
- Hypothesized population mean under the null hypothesis .
- Sample standard deviation. An estimate of the population standard deviation when it is unknown.
- Sample size (number of observations).

Chi-Squared Statistic

A chi-squared statistic is a test statistic used to test hypotheses about the population variance.

Where:

- Sample variance. The variance calculated from the sample.
- Hypothesized population variance under the null hypothesis .
- Sample size.

Two-Sample Hypothesis Testing

Two-sample hypothesis testing is a type of hypothesis testing which, instead of determining whether the value of a population parameter should or should not be rejected, determines whether the difference between the parameters of two populations is statistically significant.

Test Statistics

Two Sample t-Statistic

A two sample t-statistic is a test statistic for the population mean used when the population standard deviation is unknown but assumed to be the same for both populations and the sample size is small.

Where:

- Sample mean for the sample .
- Hypothesized population mean difference under the null hypothesis .
- Pooled standard deviation of the two samples.
- Sample size for the sample .

Analysis of Variance

Analysis of Variance (ANOVA) is a statistical method used to test whether three or more population means are equal by analyzing the sample variances within and between groups. ANOVA is a form of hypothesis testing and was developed as a generalization of the two-sample t-test beyond two means.

Procedure

Hypothesis:

:
: at least one of the means is different

Test Statitistic

The F-statistic is used to compare the variance between groups to the variance within groups in order to determine whether the null hypothesis should be rejected.

F-Statistic

A F-statistic is a test statistic for comparing the variances used to determine if a group of sample variances are significantly different.

Pooled Variance

The pooled variance is a method for estimating the variance of several different populations when the means of each population maybe be different but where one may assume that the variance of each population is the same. Under the assumption of equal population variances, the pooled sample variance provides a higher precision estimate of variance than the individual sample variances. The square root of the pooled variance is called the pooled standard deviation, similarly to how the square root of the variance is the standard deviation.

Equation

Given a set of sample variances where the sample has a size of the pooled variance is calculated as:

Residual

A residual is the observed discrepancy between the outcome and a model’s prediction . It quantifies how far the model’s prediction is from the observed data, according to the model’s assumptions.

Least Squares

The least squares method is a type of estimation method used to find the "best fit" for a data set. It works by minimizing the sum of squared residuals.

Ordinary Least Squares

Ordinary least squares is a special case of least squares in which given a linear relationship between a set of independent variable and a set of dependent variables the parameters of the linear regression are defined as:

The resulting estimated regression line is:

Example


1	2
2	3
3	2.5
4	4
5	4.5

Data points are shown in and our calculated best-fit line is shown in