Skip to main content

EDA Workflow

edabad', 'Delhi', 'Bangalore'], 'segment': ['Corporate', 'Consumer', 'Corporate', 'Home Office', 'Consumer'], 'revenue': [15000, 8000, 12000, 5000, 20000], 'orders': [3, 2, 4, 1, 5] } df = pd.DataFrame(data) # df = DataFrame ka chhota naam, industry standard hai


## 3.2 Reading Data

```python
# Read CSV file
df = pd.read_csv('sales_data.csv')

# Read Excel file
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')

# Read from SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM customers', conn)

3.3 Exploring Data (Sabse Pehle Yeh Karo!)

🧠 Interview mein dataset milte hi yeh 5 commands chalao — interviewer impressed ho jayega ki systematic approach hai.

# 1️⃣ Kitna bada data hai?
df.shape # (rows, columns) — e.g., (5, 5) matlab 5 rows, 5 columns

# 2️⃣ Data kaisa dikhta hai? (Pehli 5 rows dekho)
df.head() # Pehli 5 rows
df.tail() # Aakhri 5 rows (check karo data properly load hua ki nahi)

# 3️⃣ Columns ke types kya hain? (Yeh BAHUT important hai!)
df.dtypes # Har column ka type: int, float, object(string), datetime
df.info() # Complete summary: types + kitne values NULL hain + memory
# 🧠 Agar date column ka type 'object' dikh raha hai toh convert karna padega!

# 4️⃣ Numbers ka summary (statistics ek jhatkein mein)
df.describe() # count, mean, std, min, 25%, 50%(median), 75%, max
# 🧠 Yeh dekh ke turant pata chal jaata hai: outliers hain ya nahi, data skewed hai ya nahi

# 5️⃣ Missing values check karo (DATA CLEANING ka pehla step)
df.isnull().sum() # Har column mein kitne NULL hain

# Bonus: Unique values (categorical columns ke liye)
df['city'].nunique() # Kitni unique cities hain (e.g., 4)
df['city'].unique() # Kaunsi cities hain: ['Delhi', 'Mumbai', ...]
df['city'].value_counts() # Har city mein kitne customers hain (sorted)

[!TIP] 🧠 Interview mantra: Dataset milte hi: shape → head → info → describe → isnull().sum(). Yeh 5 steps ratt lo. Interviewer sochega "is bande ko pata hai kya karna hai."

3.4 Selecting Data

Selecting Columns

# Single column (returns a Series)
df['name']

# Multiple columns (returns a DataFrame)
df[['name', 'city', 'revenue']]

Selecting Rows

# By index position — iloc (integer location)
df.iloc[0] # First row
df.iloc[0:3] # First 3 rows
df.iloc[0:3, 0:2] # First 3 rows, first 2 columns

# By label — loc (label location)
df.loc[0] # Row with index label 0
df.loc[0:2, 'name':'city'] # Rows 0-2, columns 'name' to 'city'

Filtering Rows (SABSE IMPORTANT! Interview mein zaroor poochenge)

🧠 SQL mein WHERE karte the, Pandas mein bracket ke andar condition likhte hain. Yeh roz ka kaam hai.

# Ek condition — "sirf wo rows do jahan revenue > 10000"
df[df['revenue'] > 10000]
# 🧠 Samjho: df[CONDITION] = SQL ka WHERE CONDITION

# Do conditions — & (AND), | (OR), ~ (NOT)
# ⚠️ IMPORTANT: Har condition ko () mein wrap karo, nahi toh ERROR aayega!
df[(df['revenue'] > 10000) & (df['city'] == 'Delhi')] # Revenue > 10K AUR Delhi waale
df[(df['city'] == 'Delhi') | (df['city'] == 'Mumbai')] # Delhi YA Mumbai waale

# Multiple values check karna ho toh isin() use karo (SQL ka IN jaisa)
df[df['city'].isin(['Delhi', 'Mumbai'])] # Delhi ya Mumbai

# String mein kuch dhundna ho (SQL ka LIKE '%text%' jaisa)
df[df['name'].str.contains('a', case=False)] # Naam mein 'a' hai?

# Range mein filter (SQL ka BETWEEN jaisa)
df[df['revenue'].between(5000, 15000)] # 5000 se 15000 ke beech

3.5 Handling Missing Data

# Detect missing values
df.isnull().sum() # Count per column
df.isnull().sum().sum() # Total missing values

# Strategy 1: Drop rows with ANY missing value
df_clean = df.dropna()

# Strategy 2: Drop rows where SPECIFIC columns are missing
df_clean = df.dropna(subset=['revenue', 'customer_id'])

# Strategy 3: Fill with a constant
df['phone'].fillna('Unknown', inplace=True)

# Strategy 4: Fill numerical with median (robust to outliers)
df['revenue'].fillna(df['revenue'].median(), inplace=True)

# Strategy 5: Fill categorical with mode (most frequent value)
df['city'].fillna(df['city'].mode()[0], inplace=True)

# Strategy 6: Forward fill (time series — use previous value)
df['stock_price'].fillna(method='ffill', inplace=True)

# Strategy 7: Interpolation (estimate from neighbors)
df['temperature'].interpolate(method='linear', inplace=True)

When to Use What?

MethodWhenWhy
Drop<5% missing, data is random (MCAR)Simple, no imputation bias
MedianNumerical, has outliersRobust to extreme values
MeanNumerical, no outliersKeeps distribution center
ModeCategorical dataMost frequent = safest guess
Forward fillTime seriesPrevious value is best guess
InterpolateContinuous time series with trendSmooth estimation

3.6 Removing Duplicates

# Check for duplicates
df.duplicated().sum() # Count of duplicate rows

# View the duplicates
df[df.duplicated(keep=False)] # Shows ALL copies (not just the extras)

# Remove duplicates
df = df.drop_duplicates() # Keep first occurrence, remove rest

# Remove based on specific columns
df = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='last')

3.7 GroupBy — Split, Apply, Combine

🧠 SQL ka GROUP BY yaad hai? Pandas mein wahi concept hai! Data ko groups mein baanto → har group pe function lagao (sum, mean, count) → result combine karo.

# Simple groupby — "har city ka total revenue batao"
df.groupby('city')['revenue'].sum()
# 🧠 Padho aise: "city ke hisaab se group karo, revenue ka sum nikalo"

# Multiple aggregations ek saath (INTERVIEW MEIN BAHUT POOCHHA JAATA HAI)
df.groupby('segment').agg(
total_revenue=('revenue', 'sum'), # Total revenue
avg_revenue=('revenue', 'mean'), # Average revenue
order_count=('orders', 'sum'), # Total orders
customer_count=('name', 'count') # Kitne customers
).reset_index() # ← reset_index() se groupby column wapas normal column ban jaata hai

# Do columns se group karo — "har city mein har segment ka revenue"
df.groupby(['city', 'segment'])['revenue'].sum().reset_index()

# Ek column pe multiple functions ek saath
df.groupby('segment')['revenue'].agg(['sum', 'mean', 'min', 'max'])

3.8 Merging DataFrames (SQL ke JOINs Pandas mein)

🧠 SQL mein JOIN likhte the, Pandas mein pd.merge() likho. Concept BILKUL same hai — bass syntax alag hai.

# Do tables banate hain (SQL wali same tables socho)
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'name': ['Rajesh', 'Priya', 'Amit', 'Sneha', 'Vikram']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103],
'customer_id': [1, 2, 1],
'amount': [5000, 3000, 8000]
})

# INNER JOIN — sirf matching rows (SQL: INNER JOIN ... ON)
merged = pd.merge(customers, orders, on='customer_id', how='inner')
# 🧠 on = kis column pe match karna hai, how = JOIN type

# LEFT JOIN — saare customers, chahe order na kiya ho
merged = pd.merge(customers, orders, on='customer_id', how='left')
# 🧠 Jo customers ka order nahi hai, unke order columns mein NaN aayega

# Agar dono tables mein column ka naam ALAG hai:
merged = pd.merge(df1, df2, left_on='cust_id', right_on='customer_id')

# Multiple columns pe merge (composite key)
merged = pd.merge(df1, df2, on=['customer_id', 'date'], how='inner')
how ParameterSQL EquivalentKeeps
'inner'INNER JOINOnly matching rows
'left'LEFT JOINAll from left + matches from right
'right'RIGHT JOINAll from right + matches from left
'outer'FULL OUTER JOINAll from both

3.9 Creating New Columns

# Simple calculation
df['revenue_per_order'] = df['revenue'] / df['orders']

# Conditional column using np.where (like SQL CASE WHEN)
import numpy as np
df['revenue_tier'] = np.where(df['revenue'] >= 10000, 'High', 'Low')

# Multiple conditions using np.select
conditions = [
df['revenue'] >= 15000,
df['revenue'] >= 8000,
df['revenue'] \< 8000
]
choices = ['High', 'Medium', 'Low']
df['tier'] = np.select(conditions, choices)

# Using apply with a custom function
def categorize(revenue):
if revenue >= 15000:
return 'High'
elif revenue >= 8000:
return 'Medium'
else:
return 'Low'

df['tier'] = df['revenue'].apply(categorize)

# Using apply with lambda (one-liner)
df['revenue_in_thousands'] = df['revenue'].apply(lambda x: x / 1000)

3.10 Sorting

# Sort by one column
df.sort_values('revenue', ascending=False) # Highest first

# Sort by multiple columns
df.sort_values(['city', 'revenue'], ascending=[True, False])

# Sort by index
df.sort_index()

3.11 Pivot Tables

Pivot tables restructure data for summarization (like Excel pivot tables).

# Create a pivot table
pivot = df.pivot_table(
values='revenue', # What to aggregate
index='city', # Rows
columns='segment', # Columns
aggfunc='sum', # Aggregation function
fill_value=0 # Replace NaN with 0
)

Result:

cityConsumerCorporateHome Office
Ahmedabad0120000
Bangalore2000000
Delhi0150005000
Mumbai800000

3.12 String Operations

# Pandas string methods (use .str accessor)
df['name'].str.upper() # RAJESH, PRIYA...
df['name'].str.lower() # rajesh, priya...
df['name'].str.len() # Length of each name
df['name'].str.contains('a') # Boolean: contains 'a'?
df['name'].str.startswith('R') # Starts with 'R'?
df['name'].str.replace('a', 'A') # Replace 'a' with 'A'
df['name'].str.split(' ') # Split by space
df['name'].str.strip() # Remove whitespace

3.13 Date Operations

# Convert string to datetime
df['date'] = pd.to_datetime(df['date_string'])

# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_name'] = df['date'].dt.day_name() # Monday, Tuesday...
df['quarter'] = df['date'].dt.quarter

# Date arithmetic
df['days_since_order'] = (pd.Timestamp.now() - df['date']).dt.days

# Filter by date
df[df['date'] >= '2023-06-01']
df[df['date'].between('2023-06-01', '2023-08-31')]

CHAPTER 4: DATA VISUALIZATION

4.1 Matplotlib Basics

import matplotlib.pyplot as plt

# Simple line chart
plt.figure(figsize=(10, 6))
plt.plot([1, 2, 3, 4, 5], [10, 20, 15, 25, 30], marker='o')
plt.title('Monthly Revenue Trend')
plt.xlabel('Month')
plt.ylabel('Revenue (₹ thousands)')
plt.grid(True, alpha=0.3)
plt.show()

4.2 Seaborn — Beautiful Charts Made Easy

import seaborn as sns

# Set style
sns.set_style('whitegrid')

# 1. BAR CHART — Compare categories
plt.figure(figsize=(8, 5))
sns.barplot(data=df, x='city', y='revenue', palette='viridis')
plt.title('Revenue by City')
plt.show()

# 2. HISTOGRAM — Distribution of a variable
plt.figure(figsize=(8, 5))
sns.histplot(df['revenue'], kde=True, bins=10, color='coral')
plt.title('Distribution of Revenue')
plt.xlabel('Revenue (₹)')
plt.show()

# 3. BOX PLOT — Spread and outliers
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='segment', y='revenue')
plt.title('Revenue Distribution by Segment')
plt.show()

# 4. SCATTER PLOT — Relationship between two variables
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='orders', y='revenue', hue='segment', size='orders')
plt.title('Orders vs Revenue')
plt.show()

# 5. HEATMAP — Correlation matrix
plt.figure(figsize=(8, 6))
numeric_df = df.select_dtypes(include='number')
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.show()

# 6. COUNT PLOT — Frequency of categories
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='segment', palette='Set2')
plt.title('Customer Count by Segment')
plt.show()

Which Chart to Use?


CHAPTER 5: COMPLETE EDA WORKFLOW

Here's the step-by-step process for any data analysis task at DecisionTree:

Full Mini-Project: E-Commerce Sales Analysis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# ═══════════════════════════════════════════════════
# STEP 1: Load and Explore
# ═══════════════════════════════════════════════════
df = pd.read_csv('ecommerce_sales.csv')
print(f"Dataset shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nFirst 5 rows:\n{df.head()}")
print(f"\nStatistical Summary:\n{df.describe()}")

# ═══════════════════════════════════════════════════
# STEP 2: Check Data Quality
# ═══════════════════════════════════════════════════
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nDuplicate rows: {df.duplicated().sum()}")

# ═══════════════════════════════════════════════════
# STEP 3: Clean Data
# ═══════════════════════════════════════════════════
# Fill missing numerical values with median
df['revenue'].fillna(df['revenue'].median(), inplace=True)

# Fill missing categories with mode
df['category'].fillna(df['category'].mode()[0], inplace=True)

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Convert date column
df['order_date'] = pd.to_datetime(df['order_date'])

# ═══════════════════════════════════════════════════
# STEP 4: Feature Engineering
# ═══════════════════════════════════════════════════
df['month'] = df['order_date'].dt.month
df['quarter'] = df['order_date'].dt.quarter
df['avg_order_value'] = df['revenue'] / df['quantity']

# Classify customers by spending
df['customer_tier'] = pd.cut(
df['revenue'],
bins=[0, 1000, 5000, float('inf')],
labels=['Low', 'Medium', 'High']
)

# ═══════════════════════════════════════════════════
# STEP 5: Analysis & Insights
# ═══════════════════════════════════════════════════

# Revenue by category
category_revenue = df.groupby('category')['revenue'].agg(['sum', 'mean', 'count'])
category_revenue.columns = ['Total Revenue', 'Avg Revenue', 'Order Count']
print(category_revenue.sort_values('Total Revenue', ascending=False))

# Monthly trend
monthly = df.groupby('month')['revenue'].sum()

# ═══════════════════════════════════════════════════
# STEP 6: Visualize
# ═══════════════════════════════════════════════════
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Revenue by category
category_revenue['Total Revenue'].plot(kind='bar', ax=axes[0,0], color='steelblue')
axes[0,0].set_title('Revenue by Category')
axes[0,0].set_ylabel('Revenue (₹)')

# Monthly trend
monthly.plot(kind='line', ax=axes[0,1], marker='o', color='coral')
axes[0,1].set_title('Monthly Revenue Trend')

# Distribution of order values
sns.histplot(df['revenue'], kde=True, ax=axes[1,0])
axes[1,0].set_title('Revenue Distribution')

# Customer tier split
df['customer_tier'].value_counts().plot(kind='pie', ax=axes[1,1], autopct='%1.1f%%')
axes[1,1].set_title('Customer Tier Distribution')

plt.tight_layout()
plt.show()