Model Evaluation
ombines hundreds of trees, each trained on random subsets of data and features. The averaging reduces variance and improves accuracy. Trade-off: Random Forest is more accurate but less interpretable."
Q2: "Explain overfitting in simple terms."
Answer: "Overfitting is when your model memorizes the training data instead of learning the underlying pattern. It's like a student who memorizes past papers — scores 100% on those same questions but fails on new ones. Solutions include pruning (limiting tree depth), cross-validation, getting more data, or using ensemble methods like Random Forest."
Q3: "When would you use Logistic Regression vs Decision Tree?"
Answer: "Logistic Regression when the relationship is approximately linear — like predicting loan default based on credit score. It's highly interpretable because coefficients show the direct impact of each feature. Decision Tree when relationships are non-linear with complex interactions — like churn prediction where combinations of factors matter. Decision Trees also handle categorical variables natively."
Q4: "What is cross-validation?"
Answer: "K-Fold Cross-Validation divides data into k equal parts (typically 5). In each round, k-1 parts train the model and 1 part tests it. This repeats k times so every data point gets tested exactly once. The final score is the average across all k rounds. This gives a more reliable estimate of model performance than a single train-test split, especially with limited data."
Q5: "What are the limitations of Decision Trees?"
Answer: "Six main limitations: (1) Prone to overfitting without pruning. (2) Unstable — small data changes can produce very different trees. (3) Biased toward features with many unique values. (4) Cannot capture diagonal decision boundaries efficiently. (5) Greedy algorithm — locally optimal splits, not globally optimal. (6) Cannot extrapolate beyond training data range in regression."
Q6: "Explain Gini Index with a calculation."
Answer: "Gini measures how impure a node is. Formula: 1 - sum of squared probabilities. For a node with 70% Class A and 30% Class B: Gini = 1 - (0.7² + 0.3²) = 1 - (0.49 + 0.09) = 0.42. A pure node has Gini = 0, maximum impurity for 2 classes is 0.5. The tree picks the split that results in the lowest weighted Gini across child nodes."
Q7: "How do you handle imbalanced data?"
Answer: "Five approaches: (1) Re-sampling — oversample minority class (SMOTE) or undersample majority. (2) Class weights — set class_weight='balanced' in sklearn to penalize minority misclassification more. (3) Use F1/AUC instead of accuracy. (4) Anomaly detection if minority is very rare. (5) Collect more minority class data if possible."
Q8: "What is feature importance?"
Answer: "Feature importance tells us which features contribute most to predictions. In Decision Trees, it's based on total impurity reduction — features used higher in the tree and reducing more impurity rank higher. For example, if 'Monthly Spend' is the root split, it's likely the most important feature. I'd use this to communicate key drivers to stakeholders."
Q9: "Explain the difference between bagging and boosting."
Answer: "Both are ensemble methods. Bagging (used in Random Forest) trains each model independently on random subsets, then averages predictions — reduces variance. Boosting (used in XGBoost) trains models sequentially, where each new model focuses on errors of the previous one — reduces bias. Bagging is parallelizable and more robust; Boosting is more accurate but can overfit."
Q10: "What is data leakage and how do you prevent it?"
Answer: "Data leakage is when information from outside the training set influences the model — it artificially inflates performance but fails in production. Common sources: (1) Including the target variable as a feature. (2) Using future data to predict the past. (3) Pre-processing (scaling, encoding) before train-test split. Prevention: always split data first, then preprocess each set independently."
Q11: "Why is Random Forest more stable than a single Decision Tree?"
Answer: "Two reasons: (1) Bagging — each tree sees different data, so individual errors don't propagate. (2) Feature randomness — each split considers only a subset of features, decorrelating the trees. Together, these make the ensemble robust to noise and small data changes, whereas a single tree can change completely with minor data variations."
Q12: "How would you explain a predictive model's results to a non-technical stakeholder?"
Answer: "I'd focus on three things: (1) Business impact — 'This model can identify 70% of customers likely to churn, allowing us to retain them.' (2) Key drivers — 'The top 3 factors are contract length, monthly spend, and support calls.' (3) Actionable insight — 'Customers with contracts under 6 months and more than 3 support calls ha