📈 Round 6 — Statistics & Probability

Complete Guide From Scratch for Fresher Data Analyst

What to expect: 20–30 minutes testing your understanding of statistical concepts and how to apply them to business problems. You'll get both direct concept questions ("Mean vs Median?") and scenario-based questions ("This data is right-skewed — which average should we report?").

1. Descriptive Statistics — Summarizing Data

1.1 Central Tendency — "Where Is the Center?"

Measure	What It Is	Formula	When to Use
Mean	Sum of all values ÷ count	Sum ÷ N	When distribution is symmetric and there are no outliers
Median	Middle value in sorted data	Middle position	When there are outliers or data is skewed — safer choice
Mode	Most frequently occurring value	Most frequent	For categorical data (e.g., "Delhi" is the most common city)

Critical Example — comes up in almost every interview:

Employee Salaries: ₹30K, ₹35K, ₹40K, ₹45K, ₹50K, ₹10,00,000 (CEO)

Mean = ₹2,00,000 — misleading! The CEO salary drags the average up
Median = ₹42,500 — accurate representation of the typical salary

🧠 Ratt lo: Jab data mein outliers hon, hamesha Median use karo. Interview mein bolo: "Mean is sensitive to outliers, so I prefer Median for skewed distributions like income or house prices."

1.2 Spread — "How Scattered Is the Data?"

Measure	What It Is	Key Point
Range	Max - Min	Very basic, heavily affected by outliers
Variance (σ²)	Average of squared deviations from mean	Measures spread, but units are squared
Standard Deviation (σ)	√Variance	Most important — same units as original data
IQR	Q3 - Q1 (75th - 25th percentile)	Range of the middle 50% of data, best for outlier detection

Standard Deviation — Intuitive Explanation:

Class A marks: 70, 72, 68, 71, 69 → SD ≈ 1.6 (consistent — everyone scored similarly)
Class B marks: 30, 50, 90, 70, 10 → SD ≈ 30 (highly variable — very different scores)

Low SD = consistent data. High SD = lots of variation.

Worked Problem — Calculating Variance and SD by Hand:

Data: 4, 8, 6, 10, 2

Step 1: Mean = (4+8+6+10+2)/5 = 30/5 = 6

Step 2: Deviations from mean:
  4-6 = -2,  8-6 = +2,  6-6 = 0,  10-6 = +4,  2-6 = -4

Step 3: Squared deviations:
  4, 4, 0, 16, 16

Step 4: Variance = (4+4+0+16+16)/5 = 40/5 = 8

Step 5: SD = √8 ≈ 2.83

1.3 Percentiles & Quartiles

"90th percentile" means: 90% of observations fall below this value. If your score is at the 90th percentile, you're in the top 10%.

1.4 Box Plot — Visual Summary of Distribution

              IQR (Q3 - Q1)
           ┌──────────────┐
     ╶─────┤    ┃         ├─────╴    ●  ●   ← Outliers
           └──────────────┘
   Min*    Q1    Q2(Median) Q3    Max*
   
   * Whiskers extend to Q1 - 1.5×IQR and Q3 + 1.5×IQR
   Points beyond whiskers = OUTLIERS

Worked Problem — Outlier Detection:

Data (sorted): 12, 15, 18, 20, 22, 25, 28, 30, 95

Q1 = 15 (25th percentile)
Q3 = 28 (75th percentile)
IQR = 28 - 15 = 13

Lower fence = Q1 - 1.5 × IQR = 15 - 19.5 = -4.5
Upper fence = Q3 + 1.5 × IQR = 28 + 19.5 = 47.5

→ 95 > 47.5, so 95 IS an outlier ✅

2. Distributions — The Shape of Data

2.1 Normal Distribution (Bell Curve)

The most important distribution in statistics. Many natural phenomena follow it — heights, IQ scores, measurement errors.

            ┌────────┐
          ┌─┤        ├─┐
        ┌─┤ │        │ ├─┐
      ┌─┤ │ │   μ    │ │ ├─┐
    ──┤ │ │ │   ↓    │ │ │ ├──
   ───┴─┴─┴─┴───┼────┴─┴─┴─┴───
            │←1σ→│
         │←──2σ──→│
      │←────3σ────→│

The 68-95-99.7 Rule (Empirical Rule):

Range	% of Data	Example (Mean=50, SD=10)
μ ± 1σ	68%	40 to 60 contains 68% of observations
μ ± 2σ	95%	30 to 70 contains 95% of observations
μ ± 3σ	99.7%	20 to 80 contains virtually all observations

Worked Problem:

Customer daily spending is normally distributed: Mean = ₹500, SD = ₹100.

Q: What % of customers spend between ₹300 and ₹700?
A: ₹300 = 500 - 2×100 = μ - 2σ
   ₹700 = 500 + 2×100 = μ + 2σ
   By 68-95-99.7 rule → 95% of customers ✅

Q: A customer spends ₹800. Is this unusual?
A: ₹800 = 500 + 3×100 = μ + 3σ
   Only 0.15% of customers spend this much → YES, highly unusual ✅

🧠 Interview mein aise use karo: "If customer spending is normally distributed with mean ₹5000 and SD ₹1000, then 95% of customers spend between ₹3000 and ₹7000."

2.2 Skewness — Data Leaning to One Side

Direction	Tail	Mean vs Median	Examples
Right Skewed	Long tail to the RIGHT	Mean > Median	Income, house prices, age at retirement
Left Skewed	Long tail to the LEFT	Mean < Median	Exam scores (easy exam), age at death
Symmetric	No tail	Mean ≈ Median	Height, weight, IQ

🧠 Trick: Mean hamesha tail ki taraf khinchta hai. Right-skewed → Mean > Median. Interview mein: "Income data is right-skewed, so I'd report Median, not Mean."

2.3 Other Important Distributions

Distribution	When It Occurs	Example
Binomial	Fixed number of trials, each with success/fail	"Out of 10 emails, how many get opened?"
Poisson	Count of rare events in fixed interval	"How many customer complaints per hour?"
Uniform	Every outcome equally likely	Rolling a fair die
Exponential	Time between events	"Time between customer arrivals at a store"

3. Z-Score — "How Normal Is This Data Point?"

A Z-Score tells you how many standard deviations a data point is from the mean. It allows comparison across different scales.

Formula: Z = (X - μ) / σ

Worked Problem — Comparing Performance Across Subjects:

Maths: 80/100 (class avg 70, SD 10) → Z = (80-70)/10 = 1.0
English: 75/100 (class avg 60, SD 5) → Z = (75-60)/5 = 3.0

English performance is relatively much better despite lower raw marks, because the student is 3 standard deviations above the class average (top 0.15%) compared to only 1 SD above in Maths (top 16%).

Z-Score	Meaning	Percentile
0	Exactly at the average	50th
+1	1 SD above average	84th (top 16%)
+2	2 SD above average	97.5th (top 2.5%)
+3	3 SD above average	99.85th (top 0.15%)
-1	1 SD below average	16th (bottom 16%)
-2	2 SD below average	2.5th (bottom 2.5%)

Complete Guide From Scratch for Fresher Data Analyst​

1. Descriptive Statistics — Summarizing Data​

1.1 Central Tendency — "Where Is the Center?"​

1.2 Spread — "How Scattered Is the Data?"​

1.3 Percentiles & Quartiles​

1.4 Box Plot — Visual Summary of Distribution​

2. Distributions — The Shape of Data​

2.1 Normal Distribution (Bell Curve)​

2.2 Skewness — Data Leaning to One Side​

2.3 Other Important Distributions​

3. Z-Score — "How Normal Is This Data Point?"​