Decision Trees โญ
2. Decision Trees โ The Core Algorithm ๐ณโ
A Decision Tree is a supervised learning algorithm that recursively splits data based on feature values to create a flowchart-like structure for classification or regression.
2.1 How It Worksโ
The tree asks a series of questions, splitting data into increasingly pure groups:
Is Monthly Spend > โน5000?
/ \
YES NO
/ \
Contract > 12 months? Support Calls > 4?
/ \ / \
YES NO YES NO
/ \ / \
Will STAY โ
Will CHURN โ Will CHURN โ Will STAY โ
(High spender, (Short contract (Unhappy (Low spender,
long contract) = churn risk) customer) but satisfied)
Structure:
- Root Node: Topmost node โ represents the entire dataset
- Decision Nodes: Where data splits based on a feature
- Leaf Nodes: Final output โ the predicted class or value
- Branch: The outcome of a split (e.g., "Yes" or "No")
- Depth: Number of edges from root to the deepest leaf
2.2 Splitting Criteria โ How the Tree Decides Which Question to Askโ
The tree selects the feature that produces the purest split โ meaning the resulting groups are as homogeneous as possible.
Impurityโ
Impurity measures how mixed a group is. A bucket with only red balls = pure (impurity = 0). A bucket with 50% red, 50% blue = maximum impurity.
The tree's goal: each split should reduce impurity as much as possible.
Gini Indexโ
Formula: Gini = 1 - ฮฃ(pแตขยฒ) where pแตข = proportion of class i
Worked Example โ Complete Gini Calculation:
Dataset: 20 customers deciding between Churn/Stay. Feature: "Has Complaint" (Yes/No).
Overall: 12 Stay, 8 Churn โ Gini(parent) = 1 - (0.6ยฒ + 0.4ยฒ) = 0.48
Split by "Has Complaint":
Complaint = Yes (10 customers): 3 Stay, 7 Churn
Gini(Yes) = 1 - (0.3ยฒ + 0.7ยฒ) = 1 - (0.09 + 0.49) = 0.42
Complaint = No (10 customers): 9 Stay, 1 Churn
Gini(No) = 1 - (0.9ยฒ + 0.1ยฒ) = 1 - (0.81 + 0.01) = 0.18
Weighted Gini = (10/20)ร0.42 + (10/20)ร0.18 = 0.21 + 0.09 = 0.30
Gini Gain = 0.48 - 0.30 = 0.18 โ This feature reduces impurity โ
| Gini Value | Meaning |
|---|---|
| 0 | Perfectly pure (all one class) |
| 0.5 | Maximum impurity for binary classification |
Entropy & Information Gainโ
Entropy Formula: E = -ฮฃ pแตข ร logโ(pแตข)
Information Gain = Entropy(parent) - Weighted Average Entropy(children)
The feature with the highest Information Gain is chosen for the split.
Worked Example:
Before split: 100 customers (50 churn, 50 stay)
Entropy = -0.5รlogโ(0.5) - 0.5รlogโ(0.5) = 1.0 (maximum uncertainty)
Split by "Contract Length > 12 months":
Left child: 60 customers (10 churn, 50 stay) โ Entropy = 0.65
Right child: 40 customers (40 churn, 0 stay) โ Entropy = 0.0 (PURE!)
Weighted Entropy = (60/100)ร0.65 + (40/100)ร0.0 = 0.39
Information Gain = 1.0 - 0.39 = 0.61 โ HIGH gain = great feature โ
๐ง "Gini vs Entropy kya fark hai?" Both measure impurity. Gini is computationally faster, Entropy is more theoretically grounded. In practice, results are very similar. Scikit-learn uses Gini by default.
2.3 Overfitting โ The Biggest Problemโ
Overfitting occurs when a tree becomes too complex โ it memorizes the training data (including noise) and performs poorly on new, unseen data.
Analogy: A student who memorizes past exam papers word-for-word. They score 100% on those exact questions but fail on any new question.
Signs: Very high training accuracy but poor test accuracy. Deep tree with hundreds of leaves.
Solution: Pruning
| Type | What It Does | Parameters |
|---|---|---|
| Pre-pruning | Stops the tree from growing too deep | max_depth, min_samples_split, min_samples_leaf |
| Post-pruning | Grows the full tree, then removes unhelpful branches | ccp_alpha parameter |
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(
max_depth=5, # Tree won't go deeper than 5 levels
min_samples_split=20, # A node needs โฅ20 samples to split
min_samples_leaf=10, # Each leaf must have โฅ10 samples
random_state=42
)
model.fit(X_train, y_train)
๐ง Key parameters ratt lo: max_depth = kitna deep, min_samples_split = minimum samples to split, min_samples_leaf = minimum samples in leaf. In sab se tree ko "behave" karte hain.
2.4 Complete Scikit-learn Pipelineโ
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
# 1. Load and prepare data
df = pd.read_csv('customer_churn.csv')
X = df[['monthly_spend', 'contract_length', 'support_calls', 'tenure']]
y = df['churned'] # 1 = Churned, 0 = Stayed
# 2. Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 3. Build and train model
model = DecisionTreeClassifier(max_depth=4, random_state=42)
model.fit(X_train, y_train)
# 4. Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# 5. Feature importance
for name, imp in zip(X.columns, model.feature_importances_):
print(f"{name}: {imp:.3f}")