A population is all subjects of interest in a particular study.
A sample is a subset of a population.
A census is when you collect data for an entire population.
A sample survey is when you collect data for a given sample.
A parameter is a descriptive measure of a given population.
A parameter is a descriptive measure of a given sample.
A variable is some attribute that describes something (ie. a person, place, thing, or idea). The value of this variable can "vary" between entities. For example a variable of a person could be their hair or eye colour, which can take different values for different peoples.
Variables can be classified into a few different categories and sub categories depending on the forms they take.
Qualitative variables are any variables in which their values belong to a category, instead of a magnitude.
Examples:
Quantitative variables are any variables in which their values take on numerical values that represent different magnitudes of a variable.
A quantitative variable is considered discrete if it's possible values form a set of separate numbers (ie. natural numbers or integers)
Examples:
A quantitative variable is considered continuous if it's possible values form an interval (ie. the reals between 3 and 4).
Examples:
ERROR: Could not find file 01 - Notes/10 - Statistics/Methods/Inferential Statistics
Descriptive statistics is a method of summarizing data typically through the usage of graphs and numbers.
A bar graph is a type of way to display qualitative variables where each category is drawn as a single bar whose height is the frequency of the category.
A campus press polled a sample of 300 undergrads in order to study the attitude towards a proposed change in on campus housing regulations. Summary of results of an opinion poll are shown as follows:
A bar graph is a type of way to display quantitative variables where each dot indicates an occurrence of a value in a sample.
The following set of data is the scores obtained for midterm test on a 0-100 scale. Construct a dot plot.
10, 90, 95, 100, 65, 50, 60, 50, 90, 55, 60, 70
A pie chart is a type of way to display qualitative variables drawn as a circle where each category is represented as a "slice". The size of each "slice" is proportional to the percentage of observations falling in that category
A campus press polled a sample of 300 undergrads in order to study the attitude towards a proposed change in on campus housing regulations. Summary of results of an opinion poll are shown as follows:
A frequency table is a listing of possible values for qualitative variables, together with the number of observations and/or relative frequencies for each value
A campus press polled a sample of 300 undergrads in order to study the attitude towards a proposed change in on campus housing regulations. Summary of results of an opinion poll is as follows:
| Response | Frequency | Proportion [1] | Percentage [2] |
|---|---|---|---|
| Support | 150 | 0.500 | 50.0% |
| Neutral | 50 | 0.167 | 16.7% |
| Oppose | 100 | 0.333 | 33.3% |
| Total | 300 | 1 | 100% |
1:
Proportion is calculated as
A stem-and-leaf plot is a type of way to display quantitative variables by creating a stem (the first digit of the number) and a leaf (usually the last digit of the number). Each stem is then placed on the left side (and in between numbers as well even if they dont have leaves) of a table and the corresponding leaf is placed on the right side, these leaves can then optionally be sorted.
Consider the following data:
80, 85, 75, 90, 62, 50, 55, 65, 75, 82, 70, 25, 92, 57, 63, 72, 81, 95, 31, 69
The stem-and-leaf plot is as follows
| Stem | Leaf |
|---|---|
| 2 | 5 |
| 3 | |
| 4 | 1 |
| 5 | 0 5 7 |
| 6 | 2 3 5 9 |
| 7 | 0 2 5 5 |
| 8 | 0 1 2 5 |
| 9 | 0 2 5 |
A histogram is similar to a bar graph but instead displays the frequency of intervals of quantitative variables.
The "modality" of a histogram describe where the peaks occur and are described in 3 main categories:
The shape of a histogram describes how the "mass" of it falls, generally the shape can be described as one of the three following categories
Numerically when a histogram is "left skewed", it's median will be much greater than it's mean and conversely if it's "right skewed", it's median will be a lot smaller. If the median and mean are approximately the same the histogram is symmetric.
The centre of a histogram is the location where the values usually "cluster".
The mean of a distribution is it's long-running average value. Formally the mean is defined as the expected value which is calculated using the following formulae.
For a discrete random variable (ie. a randomized dice that can only be the integers 1-6):
Where the sum is taken over all possible values of
For a continuous random variable:
The mean of a distribution can be estimated be estimated when the exact probabilities of values are unknown.
Given a sample survey with
The numbers of hours spent studying for a subset of students are 4, 6, 8, 7, 5. Estimate the mean number of hours spent studying for students.
The mean of a distribution is a method of determining the centre of the given distribution. Given a distribution where
If
Given the numbers 12, 14, 15, 17, 20, 24, 24, 27, 29, find the median
Given the numbers 12, 14, 15, 17, 20, 24, 24, 27, 29, 30, find the median
The spread of a histogram describes how far most points usually are from the centre of the histogram.
An outlier is a single observation in a histogram that is visibly removed from the main "mass" of observations, in other words it's unusually far from the centre of the histogram.
Construct a histogram of the following numbers:
175, 192, 207, 212, 213, 214, 218, 225, 229, 230, 231, 235, 235, 237, 240, 240, 242, 248, 250, 253, 257, 260, 265, 265
The mean of a distribution is a method of determining the centre of the given distribution. Given a distribution where
If
Given the numbers 12, 14, 15, 17, 20, 24, 24, 27, 29, find the median
Given the numbers 12, 14, 15, 17, 20, 24, 24, 27, 29, 30, find the median
The interquartile range is a measure of spread of a distribution, it's calculated as the difference between the 75th and 25th percentile.
The range of a distribution is a method of determining the variability of the given distribution. The range can be calculated by taking the difference between the largest and smallest observations:
A random variable is a characteristic or measurement that can be determined for each outcome of some event. They are usually denoted with capital letters such as
The expected value of a random variable is the "most likely" value the random variable will take. In the discrete case it's the mean value of the random variables possible values, and the continuous version can be thought of as the same for a continuous random variable. The expected value of a random variable
For a discrete random variable (ie. a randomized dice that can only be the integers 1-6):
Where the sum is taken over all possible values of
For a continuous random variable:
Given a continuous random variable, the probability density function describes the relative probability that value of the random variable would be equal to that value. For example given the probability density function
For any probability density function
Given a random variable, the probability density function describes the relative probability that value of the random variable would be less than or equal to that value. For example given a cumulative distribution function
The cumulative distribution function can be determined by taking an integral of the variables probability density function
The variance of a given event is the "average distance" between each individual event and the average.
Given
The variance for a given random variable
Alternatively it can be written as
When the random variable is squared and then the expected value is applied it ends up just being
, where the probability density function stays unchanged.
The standard deviation is the square root of the variance and is given by the following equation:
The spread of a given distribution grows as it's standard deviation grows.
The units of the standard deviation are the same as the variable (for example if
Conditional probability describes how two events are related. Specifically it can be used to determine if the probability of one event can be determined given another related event.
For any two events
A normal distribution is a specific type of distribution taking the following general form
Where the parameter
Given a normal distribution the interquartile range can be calculated as follows, as the standard score for the first and third quartiles are
Given a normal distribution the following statements are typically true
The z-score represents the number of standard deviations a "raw score" is above or below the mean value of what is being observed or measured. For a given point
Z-score's can be used to compare relative values in different observations, for example given scores for the SAT and ACT a student's score on both can be compared by calculating the z-score of their performance on both.
The gaussian distribution is a type of normal distribution, where
Covariance is a measure of the joint variability of two random variables. Given two random variables
the expected value can be exchanged with the mean of the variables if that's more appropriate.
Given two random variables
Bayes' theorem is a theorem describing rules for how to determine the probability of a cause given it's effect. For example using Bayes' theorem, the probability that a patient has a disease given that they tested positive for a that disease can be found using the probability that the test yields a positive result when the disease is present.
Formally Bayes' theorem can be defined as follows
Where the probability of the event
A company has three plants. Plant 1 produces 35% of the car output, plant 2 produces 20% and plant 3 produces the remaining 45% of cars. 1% of the output of plant 1 is defective, 1.8% of the output of plant 2 is defective and 2% of the output of plant 3 is defective. The annual total production of the company is 1,000,000 cars. A car is chosen at random from the annual output and is found defective. What is the probability that it came from plant 2?
Events are considered independent if knowing that one occurs does not change the probability that the other occurs. Formally we can say two events
Given a list of independent random variables with the same probability density function, we can calculate the probability density function of their maximum and minimum values given their individual probability density functions.
Given
Via the fundamental theorem of calculus we can then use this to determine the probability density function of the maximum
Given
A Bernoulli random variable is a type of random variable that can only take on the values of 1 or 0, the value of 1 is typically called "success" and 0 is called "failure". Because of this property their probabilities are directly related as they're the only two options.
Where
A Bernoulli distribution is any distribution where the random variable is a Bernoulli random variable. This means the random variable follows the probability mass function
where
A Binomial distribution is a distribution of the number of "successes" in a sequence of
Where
The ten percent condition is an estimation of independence for Bernoulli trials. When a population is finite the trials are not truly independent. However when the sample size is less than 10% of the population the trials are approximately independent, the smaller percentage of the population you take the closer to true independence do the trials become.
A geometric distribution is a distribution of the number of independent Bernoulli trials until the first "success". The expected value of a geometric distribution is called the "return time".
where
A Poisson distribution is a distribution that models the probability a given number
Where
The Poisson process describes a set of independent points occurring over time or space. The process derives its name from the fact that the number of points in any given finite region follows a Poisson distribution, i.e., events occur continuously and independently at a constant average rate.
Common real-world examples include:
An exponential distribution is a distribution that models the distance between events in a Poisson process. The exponential distribution is the continuous analogue of the geometric distribution.
Where
ERROR: Could not find file 01 - Notes/10 - Statistics/Inferential Statistics/Inferential Statistics
An estimator involves using data from a sample to infer values for an unknown population parameter (for example the population mean).
The biasedness of an estimator refers to the difference between the true value of the population parameter and the expected value of the estimator. Better estimators have smaller values of biasedness.
The biasedness of an estimator
Estimators where
The quality of an estimator can be determined as the loss function between the estimator of the parameter and the parameter.
A point estimator is a type of estimator that calculates a single value for an unknown population parameter.
An interval estimator is a type of estimator that calculates an interval of values for an unknown population parameter, meaning we calculate that the value of the population parameter is between the lower and upper bounds.
Given a confidence level
Given a population parameter
where
The critical value can be calculated in two ways depending on the survey used, for large sample sizes or where the population variance is known
Given a survey of size
When the population standard deviation is unknown, we estimate it using the sample standard deviation. For a survey of size
Hypothesis testing is a strategy of inference for determining whether a statement about the value of a population parameter should or should not be rejected. It involves stating a null hypothesis (
The null and alternative hypothesis typically take on one of the three following forms for a parameter
The hypotheses should always be formulated before viewing or analyzing the data.
A test procedure consists of several key components to evaluate the hypotheses.
The
Among the 3 forms of null and alternative hypothesizes the
The significance level, denoted as
Type I error is a type of error in hypothesis testing in which the null hypothesis is incorrectly rejected even though it is correct. In terms of the courtroom example, a type I error corresponds to convicting an innocent defendant.
Where
Type II error is a type of error in hypothesis testing in which the null hypothesis is incorrectly accepted even though it is incorrect. In terms of the courtroom example, a type II error corresponds to acquitting a criminal.
Where
| Not Rejected |
Correct inference (true negative) | Type II error (false negative) |
| Rejected |
Type I error (false positive) | Correct inference (true positive) |
The test statistic is a function of the data whose sampling distribution under the null hypothesis is known. It measures the discrepancy between the observed data and what would be expected if
A Z-statistic is a test statistic for the population mean used when the population standard deviation is known or the sample size is large.
Where:
A t-statistic is a test statistic for the population mean used when the population standard deviation is unknown and the sample size is small.
Where:
A chi-squared statistic is a test statistic used to test hypotheses about the population variance.
Where:
Type I error is a type of error in hypothesis testing in which the null hypothesis is incorrectly rejected even though it is correct. In terms of the courtroom example, a type I error corresponds to convicting an innocent defendant.
Where
Type II error is a type of error in hypothesis testing in which the null hypothesis is incorrectly accepted even though it is incorrect. In terms of the courtroom example, a type II error corresponds to acquitting a criminal.
Where
A Z-statistic is a test statistic for the population mean used when the population standard deviation is known or the sample size is large.
Where:
A t-statistic is a test statistic for the population mean used when the population standard deviation is unknown and the sample size is small.
Where:
A chi-squared statistic is a test statistic used to test hypotheses about the population variance.
Where:
Two-sample hypothesis testing is a type of hypothesis testing which, instead of determining whether the value of a population parameter should or should not be rejected, determines whether the difference between the parameters of two populations is statistically significant.
A two sample t-statistic is a test statistic for the population mean used when the population standard deviation is unknown but assumed to be the same for both populations and the sample size is small.
Where:
Analysis of Variance (ANOVA) is a statistical method used to test whether three or more population means are equal by analyzing the sample variances within and between groups. ANOVA is a form of hypothesis testing and was developed as a generalization of the two-sample t-test beyond two means.
Hypothesis:
The F-statistic is used to compare the variance between groups to the variance within groups in order to determine whether the null hypothesis should be rejected.
A F-statistic is a test statistic for comparing the variances used to determine if a group of sample variances are significantly different.
The pooled variance is a method for estimating the variance of several different populations when the means of each population maybe be different but where one may assume that the variance of each population is the same. Under the assumption of equal population variances, the pooled sample variance provides a higher precision estimate of variance than the individual sample variances. The square root of the pooled variance is called the pooled standard deviation, similarly to how the square root of the variance is the standard deviation.
Given a set of sample variances
A residual is the observed discrepancy between the outcome
The least squares method is a type of estimation method used to find the "best fit" for a data set. It works by minimizing the sum of squared residuals.
Ordinary least squares is a special case of least squares in which given a linear relationship between a set of independent variable
The resulting estimated regression line is:
| 1 | 2 |
| 2 | 3 |
| 3 | 2.5 |
| 4 | 4 |
| 5 | 4.5 |
Data points are shown in
and our calculated best-fit line is shown in