Descriptive statistics are used to summarize and describe the main features of a data set. They help us understand the data’s overall structure without analyzing every individual data point.
1. Measures of Central Tendency
These measures give us an idea of the “center” of a data set.
- Mean (Average)
- The mean is the most common measure of central tendency.
- How it’s calculated: Add all the values and divide by the number of values.
- Example: Imagine you weigh 4 objects: 125g, 173g, 108g, and 211g.
Mean = 125+173+108+2114=154.25g\frac{125 + 173 + 108 + 211}{4} = 154.25g - Use: The mean is useful when you want an overall measure, but it can be affected by extreme values (outliers).
- Median (Middle Value)
- The median is the middle value when all data points are arranged in order.
- Why use it: It’s not affected by outliers, so it represents the typical value better for skewed data.
- Example: Arrange {1, 3, 6, 6, 7, 12}:
- If odd, median = middle number (e.g., 6).
- If even, median = average of the two middle numbers (e.g., 4+52=4.5\frac{4+5}{2} = 4.5).
- Mode (Most Frequent Value)
- The mode is the value that appears most often.
- Example: In {1, 3, 6, 6, 7, 12}, mode = 6.
- Use: Mode is helpful for categorical data, like favorite colors or survey results.
2. Variance and Standard Deviation
These measures tell us how “spread out” the data is.
- Variance (σ2\sigma^2)
- Variance measures the average squared deviation from the mean.
- Steps to calculate:
- Find the mean.
- Subtract the mean from each data point.
- Square each result.
- Average these squared differences.
- Formula:
σ2=∑(x−μ)2n\sigma^2 = \frac{\sum (x – \mu)^2}{n},
where xx = individual values, μ\mu = mean, and nn = number of values. - Example: For data points 43,26,31,28,38,2443, 26, 31, 28, 38, 24:
- Mean = 31.6731.67
- Variance = 45.5645.56
- Standard Deviation (σ\sigma)
- The standard deviation is the square root of the variance, showing how much values differ from the mean in the original units.
- Example: If variance = 45.5645.56,
σ=45.56=6.75\sigma = \sqrt{45.56} = 6.75.
3. Normal Distribution (Bell Curve)
A normal distribution is a common way data is distributed:
- Symmetrical with most values near the mean.
- The mean, median, and mode are at the center.
- Spread is determined by the standard deviation:
- 68%68\% of values fall within ±1 standard deviation.
- 95.45%95.45\% within ±2 standard deviations.
- 99.73%99.73\% within ±3 standard deviations.
Example: Heights in a population might follow a normal distribution, with most people having an average height and fewer people being very short or tall.
4. Correlation Coefficient (rr)
The correlation coefficient measures how strongly two variables are related.
- Range:
- r=1r = 1: Perfect positive relationship (as one variable increases, the other increases).
- r=−1r = -1: Perfect negative relationship (as one variable increases, the other decreases).
- r=0r = 0: No relationship.
- How it’s calculated:
- Compares how much two variables vary together versus how much they vary independently.
- Example:
- Relationship between hours studied and test scores might yield r=0.85r = 0.85, suggesting a strong positive correlation.
5. Chi-Square Test (χ2\chi^2)
The chi-square test measures how observed data compare to expected data.
- Formula:
χ2=∑(o−e)2e\chi^2 = \sum \frac{(o – e)^2}{e},
where oo = observed value, ee = expected value. - Steps:
- Compare observed vs. expected values.
- Square the differences.
- Divide by the expected values.
- Sum the results.
- Example:
- Observed cancer deaths = 22, Expected = 28.3: χ2=(22−28.3)228.3=1.728\chi^2 = \frac{(22 – 28.3)^2}{28.3} = 1.728.
6. p-Value
The pp-value is used in hypothesis testing to determine the significance of results.
- Definition: It represents the probability of obtaining results at least as extreme as the current results, assuming the null hypothesis is true.
- Smaller pp-value (< 0.05): Strong evidence against the null hypothesis.
- Larger pp-value (> 0.05): Weak evidence, fail to reject the null hypothesis.
- Example:
- In a clinical trial, if the pp-value is 0.030.03, there’s only a 3% chance the observed effect is due to random variation.