๐ Round 6 โ Statistics & Probability
Complete Guide From Scratch for Fresher Data Analystโ
What to expect: 20โ30 minutes testing your understanding of statistical concepts and how to apply them to business problems. You'll get both direct concept questions ("Mean vs Median?") and scenario-based questions ("This data is right-skewed โ which average should we report?").
1. Descriptive Statistics โ Summarizing Dataโ
1.1 Central Tendency โ "Where Is the Center?"โ
| Measure | What It Is | Formula | When to Use |
|---|---|---|---|
| Mean | Sum of all values รท count | Sum รท N | When distribution is symmetric and there are no outliers |
| Median | Middle value in sorted data | Middle position | When there are outliers or data is skewed โ safer choice |
| Mode | Most frequently occurring value | Most frequent | For categorical data (e.g., "Delhi" is the most common city) |
Critical Example โ comes up in almost every interview:
Employee Salaries: โน30K, โน35K, โน40K, โน45K, โน50K, โน10,00,000 (CEO)
- Mean = โน2,00,000 โ misleading! The CEO salary drags the average up
- Median = โน42,500 โ accurate representation of the typical salary
๐ง Ratt lo: Jab data mein outliers hon, hamesha Median use karo. Interview mein bolo: "Mean is sensitive to outliers, so I prefer Median for skewed distributions like income or house prices."
1.2 Spread โ "How Scattered Is the Data?"โ
| Measure | What It Is | Key Point |
|---|---|---|
| Range | Max - Min | Very basic, heavily affected by outliers |
| Variance (ฯยฒ) | Average of squared deviations from mean | Measures spread, but units are squared |
| Standard Deviation (ฯ) | โVariance | Most important โ same units as original data |
| IQR | Q3 - Q1 (75th - 25th percentile) | Range of the middle 50% of data, best for outlier detection |
Standard Deviation โ Intuitive Explanation:
- Class A marks: 70, 72, 68, 71, 69 โ SD โ 1.6 (consistent โ everyone scored similarly)
- Class B marks: 30, 50, 90, 70, 10 โ SD โ 30 (highly variable โ very different scores)
Low SD = consistent data. High SD = lots of variation.
Worked Problem โ Calculating Variance and SD by Hand:
Data: 4, 8, 6, 10, 2
Step 1: Mean = (4+8+6+10+2)/5 = 30/5 = 6
Step 2: Deviations from mean:
4-6 = -2, 8-6 = +2, 6-6 = 0, 10-6 = +4, 2-6 = -4
Step 3: Squared deviations:
4, 4, 0, 16, 16
Step 4: Variance = (4+4+0+16+16)/5 = 40/5 = 8
Step 5: SD = โ8 โ 2.83
1.3 Percentiles & Quartilesโ
"90th percentile" means: 90% of observations fall below this value. If your score is at the 90th percentile, you're in the top 10%.
1.4 Box Plot โ Visual Summary of Distributionโ
IQR (Q3 - Q1)
โโโโโโโโโโโโโโโโ
โถโโโโโโค โ โโโโโโโด โ โ โ Outliers
โโโโโโโโโโโโโโโโ
Min* Q1 Q2(Median) Q3 Max*
* Whiskers extend to Q1 - 1.5รIQR and Q3 + 1.5รIQR
Points beyond whiskers = OUTLIERS
Worked Problem โ Outlier Detection:
Data (sorted): 12, 15, 18, 20, 22, 25, 28, 30, 95
Q1 = 15 (25th percentile)
Q3 = 28 (75th percentile)
IQR = 28 - 15 = 13
Lower fence = Q1 - 1.5 ร IQR = 15 - 19.5 = -4.5
Upper fence = Q3 + 1.5 ร IQR = 28 + 19.5 = 47.5
โ 95 > 47.5, so 95 IS an outlier โ
2. Distributions โ The Shape of Dataโ
2.1 Normal Distribution (Bell Curve)โ
The most important distribution in statistics. Many natural phenomena follow it โ heights, IQ scores, measurement errors.
โโโโโโโโโโ
โโโค โโโ
โโโค โ โ โโโ
โโโค โ โ ฮผ โ โ โโโ
โโโค โ โ โ โ โ โ โ โโโ
โโโโดโโดโโดโโดโโโโผโโโโโดโโดโโดโโดโโโ
โโ1ฯโโ
โโโโ2ฯโโโโ
โโโโโโ3ฯโโโโโโ
The 68-95-99.7 Rule (Empirical Rule):
| Range | % of Data | Example (Mean=50, SD=10) |
|---|---|---|
| ฮผ ยฑ 1ฯ | 68% | 40 to 60 contains 68% of observations |
| ฮผ ยฑ 2ฯ | 95% | 30 to 70 contains 95% of observations |
| ฮผ ยฑ 3ฯ | 99.7% | 20 to 80 contains virtually all observations |
Worked Problem:
Customer daily spending is normally distributed: Mean = โน500, SD = โน100.
Q: What % of customers spend between โน300 and โน700?
A: โน300 = 500 - 2ร100 = ฮผ - 2ฯ
โน700 = 500 + 2ร100 = ฮผ + 2ฯ
By 68-95-99.7 rule โ 95% of customers โ
Q: A customer spends โน800. Is this unusual?
A: โน800 = 500 + 3ร100 = ฮผ + 3ฯ
Only 0.15% of customers spend this much โ YES, highly unusual โ
๐ง Interview mein aise use karo: "If customer spending is normally distributed with mean โน5000 and SD โน1000, then 95% of customers spend between โน3000 and โน7000."
2.2 Skewness โ Data Leaning to One Sideโ
| Direction | Tail | Mean vs Median | Examples |
|---|---|---|---|
| Right Skewed | Long tail to the RIGHT | Mean > Median | Income, house prices, age at retirement |
| Left Skewed | Long tail to the LEFT | Mean < Median | Exam scores (easy exam), age at death |
| Symmetric | No tail | Mean โ Median | Height, weight, IQ |
๐ง Trick: Mean hamesha tail ki taraf khinchta hai. Right-skewed โ Mean > Median. Interview mein: "Income data is right-skewed, so I'd report Median, not Mean."
2.3 Other Important Distributionsโ
| Distribution | When It Occurs | Example |
|---|---|---|
| Binomial | Fixed number of trials, each with success/fail | "Out of 10 emails, how many get opened?" |
| Poisson | Count of rare events in fixed interval | "How many customer complaints per hour?" |
| Uniform | Every outcome equally likely | Rolling a fair die |
| Exponential | Time between events | "Time between customer arrivals at a store" |
3. Z-Score โ "How Normal Is This Data Point?"โ
A Z-Score tells you how many standard deviations a data point is from the mean. It allows comparison across different scales.
Formula: Z = (X - ฮผ) / ฯ
Worked Problem โ Comparing Performance Across Subjects:
- Maths: 80/100 (class avg 70, SD 10) โ Z = (80-70)/10 = 1.0
- English: 75/100 (class avg 60, SD 5) โ Z = (75-60)/5 = 3.0
English performance is relatively much better despite lower raw marks, because the student is 3 standard deviations above the class average (top 0.15%) compared to only 1 SD above in Maths (top 16%).
| Z-Score | Meaning | Percentile |
|---|---|---|
| 0 | Exactly at the average | 50th |
| +1 | 1 SD above average | 84th (top 16%) |
| +2 | 2 SD above average | 97.5th (top 2.5%) |
| +3 | 3 SD above average | 99.85th (top 0.15%) |
| -1 | 1 SD below average | 16th (bottom 16%) |
| -2 | 2 SD below average | 2.5th (bottom 2.5%) |