Random Forest & Regression
3. Random Forest — Decision Tree's Powerful Upgrade
Random Forest = an ensemble of many decision trees whose combined vote is more accurate than any single tree.
Why it works better than a single tree:
- Each tree trains on a random subset of data (Bagging)
- Each tree considers random features at each split
- Individual errors cancel out through majority voting
Key Concepts:
| Concept | Meaning |
|---|---|
| Bagging | Bootstrap Aggregating — each tree gets a random data sample (with replacement) |
| Feature Randomness | Each split considers only √n random features (classification) or n/3 (regression) |
| Ensemble | Combining multiple models for better performance |
| OOB Score | Out-of-Bag Score — built-in validation using the ~37% data each tree didn't train on |
🧠 Analogy: Ek doctor se opinion lo vs 100 doctors ki committee. Committee zyada accurate hogi — individual biases cancel out ho jaate hain.
4. Linear & Logistic Regression — Basics
Linear Regression (Predict a number)
Fits a straight line through data points to predict a continuous outcome.
Formula: y = mx + b
- y = predicted value (e.g., Sales)
- x = input feature (e.g., Ad Spend)
- m = slope (how much y changes per unit change in x)
- b = intercept (predicted y when x = 0)
Worked Example:
A model predicts: Sales = 200 × (Ad_Spend_in_lakhs) + 5000
Interpretation:
- Base sales (no ads) = ₹5,000
- Each ₹1 lakh in ad spend adds ₹200 to sales
- If Ad Spend = ₹10 lakhs → Sales = 200×10 + 5000 = ₹7,000
Key Assumptions:
- Linear relationship between x and y
- No multicollinearity (features shouldn't be highly correlated with each other)
- Homoscedasticity (constant variance of errors)
- Normal distribution of residuals
R² (Coefficient of Determination):
- Measures how well the model explains variance in the data
- R² = 0.85 → Model explains 85% of the variance, 15% unexplained
- R² = 1.0 → Perfect fit; R² = 0.0 → Model explains nothing
Logistic Regression (Predict a category — Yes/No)
Despite the name "Regression," this is a classification algorithm. It predicts the probability of a binary outcome using the sigmoid function.
- Output: Probability between 0 and 1
- Decision threshold: Usually 0.5 — probability > 0.5 = Yes, ≤ 0.5 = No
- Use cases: Churn prediction, spam detection, loan default
5. Feature Engineering & Data Preparation
5.1 Train-Test Split
Why: Evaluate model on data it has NEVER seen during training.
Common Splits:
80/20 → 80% train, 20% test (most common)
70/30 → When dataset is large
60/20/20 → Train/Validation/Test (for tuning hyperparameters)
CRITICAL: Never use test data during training — that's data leakage!
🧠 Interview mein ye zaroor bolo: "Stratified split ensures class proportions are maintained. If 30% of data is churn, both train and test sets will have ~30% churn."
5.2 Handling Missing Values
| Strategy | When to Use | Code |
|---|---|---|
| Drop rows | Very few missing values (\< 5%) | df.dropna() |
| Mean/Median imputation | Numerical columns | df['col'].fillna(df['col'].median()) |
| Mode imputation | Categorical columns | df['col'].fillna(df['col'].mode()[0]) |
| Forward/Back fill | Time series data | df['col'].ffill() |
5.3 Encoding Categorical Variables
| Method | When to Use | Example |
|---|---|---|
| Label Encoding | Ordinal data (has natural order) | Low=0, Medium=1, High=2 |
| One-Hot Encoding | Nominal data (no order) | City → is_Delhi, is_Mumbai, is_Bangalore |
6. Model Evaluation — How Good Is the Model?
6.1 Confusion Matrix
PREDICTED
Positive Negative
ACTUAL Positive [ TP | FN ]
Negative [ FP | TN ]
| Term | Meaning | Example (Churn Prediction) |
|---|---|---|
| TP (True Positive) | Predicted positive, actually positive | Predicted churn, customer did churn ✅ |
| TN (True Negative) | Predicted negative, actually negative | Predicted stay, customer did stay ✅ |
| FP (False Positive) | Predicted positive, actually negative | Predicted churn, but customer stayed ❌ (false alarm) |
| FN (False Negative) | Predicted negative, actually positive | Predicted stay, but customer churned ❌ (missed) |
6.2 Key Metrics
| Metric | Formula | What It Tells You | When It Matters |
|---|---|---|---|
| Accuracy | (TP+TN) / Total | Overall proportion correct | Only when classes are balanced |
| Precision | TP / (TP+FP) | Of those predicted positive, how many actually were? | When false positives are costly (spam filter) |
| Recall | TP / (TP+FN) | Of all actual positives, how many did we catch? | When false negatives are costly (disease detection) |
| F1 Score | 2×(P×R)/(P+R) | Harmonic mean of Precision and Recall | When you need a balance of both |
| AUC-ROC | Area under ROC curve | Model's ability to distinguish classes | Overall model discrimination ability |
Worked Problem — Complete Confusion Matrix Analysis:
A churn model's confusion matrix on 200 test customers:
Predicted
Churn Stay
Actual Churn [ 35 | 15 ] = 50 actual churners
Actual Stay [ 10 | 140 ] = 150 actual stayers
Accuracy = (35+140)/200 = 87.5%
Precision = 35/(35+10) = 77.8% (of predicted churners, 78% actually churned)
Recall = 35/(35+15) = 70.0% (caught 70% of actual churners)
F1 Score = 2×(0.778×0.70)/(0.778+0.70) = 0.737
Interpretation: Model misses 30% of churners (15 FN).
If each churner costs ₹5000 to lose, that's ₹75,000 in missed
retention opportunities. → Might want to increase recall.
The Critical Interview Scenario:
Q: "Cancer detection — Precision or Recall?" A: Recall. Missing a real cancer case (False Negative) is far worse than ordering extra tests (False Positive).
Q: "Spam filter — Precision or Recall?" A: Precision. Sending an important email to spam (False Positive) is worse than letting some spam through (False Negative).
6.3 ROC Curve & AUC
The ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs False Positive Rate at various threshold settings.
- AUC = 1.0 → Perfect model
- AUC = 0.5 → Random guessing (useless)
- AUC > 0.8 → Good model
- AUC > 0.9 → Excellent model
🧠 "ROC curve kya hai?" ka one-liner: "It shows the trade-off between catching more positives (recall) and generating false alarms at every possible threshold. AUC summarizes this trade-off into a single number."
6.4 The Accuracy Paradox
🧠 "98% accuracy" sunke impress mat ho jao!
Example: 1000 transactions: 980 normal, 20 fraud. A model that ALWAYS predicts "Normal" → Accuracy = 980/1000 = 98%! But it detected zero fraud cases.
Lesson: For imbalanced data, accuracy is meaningless. Use F1 Score, Precision, Recall, and AUC instead.
7. Bias-Variance Tradeoff
| Concept | What It Is | Analogy |
|---|---|---|
| Bias | Error from oversimplification — model misses real patterns | Arrows cluster together but far from the bullseye |
| Variance | Error from overcomplexity — model learns noise | Arrows scattered all over |
| Sweet Spot | Neither too simple nor too complex | Arrows clustered near the bullseye |
| Model State | Bias | Variance | What's Happening |
|---|---|---|---|
| Underfitting | High | Low | Too simple — misses patterns |
| Good Fit | Low | Low | Just right |
| Overfitting | Low | High | Too complex — memorizes noise |
For Decision Trees specifically:
- Deep, unpruned tree → Low bias, High variance (overfits)
- Shallow, pruned tree → High bias, Low variance (underfits)
- Random Forest → Reduces variance while keeping low bias (best of both)
8. Interview Questions (12 Questions)
Q1: "Decision Tree vs Random Forest?"
Answer: "A Decision Tree is a single model that's easy to interpret and visualize — you can show it to a client and they'll understand the logic. However, it's prone to overfitting. Random Forest c