Pandas Integration Examples¶
This notebook demonstrates how to use pranaam with pandas DataFrames for real-world data processing and analysis.
We’ll cover:
Basic DataFrame processing
Data analysis with predictions
Confidence-based filtering
Saving and exporting results
Let’s start by importing our dependencies:
[1]:
import pandas as pd
import pranaam
print(f"Pandas version: {pd.__version__}")
print(f"Pranaam version: {pranaam.__version__ if hasattr(pranaam, '__version__') else 'latest'}")
2026-01-21 19:31:49.365618: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2026-01-21 19:31:49.410119: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-21 19:31:50.797333: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
Pandas version: 2.3.3
Pranaam version: 0.0.2
Creating Sample Data¶
First, let’s create a sample employee dataset to work with:
[2]:
def create_sample_data():
"""Create sample employee data for demonstration."""
return pd.DataFrame({
"employee_id": [1001, 1002, 1003, 1004, 1005, 1006],
"name": [
"Shah Rukh Khan",
"Priya Sharma",
"Mohammed Ali",
"Raj Patel",
"Fatima Khan",
"Amitabh Bachchan",
],
"department": [
"Engineering",
"Marketing",
"Finance",
"HR",
"Engineering",
"Management",
],
"salary": [75000, 65000, 70000, 60000, 80000, 120000],
})
# Create our sample data
df = create_sample_data()
print("Original employee data:")
print(df)
Original employee data:
employee_id name department salary
0 1001 Shah Rukh Khan Engineering 75000
1 1002 Priya Sharma Marketing 65000
2 1003 Mohammed Ali Finance 70000
3 1004 Raj Patel HR 60000
4 1005 Fatima Khan Engineering 80000
5 1006 Amitabh Bachchan Management 120000
📊 Basic DataFrame Processing¶
Now let’s add religion predictions to our DataFrame using pranaam:
[3]:
# Get predictions for the name column
print("Getting predictions for all names...")
predictions = pranaam.pred_rel(df["name"], lang="eng")
print("\nPredictions:")
print(predictions)
Getting predictions for all names...
[01/21/26 19:31:51] INFO pranaam - Loading eng model from /home/runner/work/pranaam/pranaam/pranaam/model/eng_and_hindi_models_v2/eng_model.kera s
INFO pranaam - Loading eng model with tf-keras compatibility layer
2026-01-21 19:31:51.307337: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
Predictions:
name pred_label pred_prob_muslim
0 Shah Rukh Khan muslim 71.0
1 Priya Sharma not-muslim 27.0
2 Mohammed Ali muslim 73.0
3 Raj Patel not-muslim 35.0
4 Fatima Khan muslim 73.0
5 Amitabh Bachchan not-muslim 31.0
[4]:
# Merge predictions back to original DataFrame
# Note: pranaam returns name, pred_label, pred_prob_muslim
df_with_predictions = df.merge(
predictions[["name", "pred_label", "pred_prob_muslim"]],
on="name",
how="left"
)
print("Combined data with predictions:")
print(df_with_predictions)
Combined data with predictions:
employee_id name department salary pred_label \
0 1001 Shah Rukh Khan Engineering 75000 muslim
1 1002 Priya Sharma Marketing 65000 not-muslim
2 1003 Mohammed Ali Finance 70000 muslim
3 1004 Raj Patel HR 60000 not-muslim
4 1005 Fatima Khan Engineering 80000 muslim
5 1006 Amitabh Bachchan Management 120000 not-muslim
pred_prob_muslim
0 71.0
1 27.0
2 73.0
3 35.0
4 73.0
5 31.0
📈 Data Analysis with Predictions¶
Now let’s perform some analysis using the religion predictions:
[5]:
# Basic statistics
print("Religion distribution in our dataset:")
religion_counts = df_with_predictions["pred_label"].value_counts()
print(religion_counts)
print("\nPercentage breakdown:")
print(religion_counts / len(df_with_predictions) * 100)
Religion distribution in our dataset:
pred_label
muslim 3
not-muslim 3
Name: count, dtype: int64
Percentage breakdown:
pred_label
muslim 50.0
not-muslim 50.0
Name: count, dtype: float64
[6]:
# Average salary by predicted religion
print("Salary analysis by predicted religion:")
salary_by_religion = df_with_predictions.groupby("pred_label")["salary"].agg([
'mean', 'median', 'min', 'max', 'count'
])
print(salary_by_religion)
Salary analysis by predicted religion:
mean median min max count
pred_label
muslim 75000.000000 75000.0 70000 80000 3
not-muslim 81666.666667 65000.0 60000 120000 3
[7]:
# Department distribution by predicted religion
print("Department vs Religion cross-tabulation:")
dept_religion = pd.crosstab(
df_with_predictions["department"],
df_with_predictions["pred_label"],
margins=True
)
print(dept_religion)
Department vs Religion cross-tabulation:
pred_label muslim not-muslim All
department
Engineering 2 0 2
Finance 1 0 1
HR 0 1 1
Management 0 1 1
Marketing 0 1 1
All 3 3 6
🎯 Confidence-Based Analysis¶
Not all predictions are equally certain. Let’s analyze the confidence levels and filter based on them:
[8]:
# Add confidence score calculation
# Higher numbers mean more confident predictions
df_with_predictions['confidence'] = df_with_predictions['pred_prob_muslim'].apply(
lambda x: max(x, 100 - x)
)
# Show confidence distribution
print("Detailed prediction analysis:")
print("=" * 70)
print(f"{'Name':<18} | {'Prediction':<10} | {'Muslim %':<8} | {'Confidence':<10}")
print("-" * 70)
for _, row in df_with_predictions.iterrows():
print(f"{row['name']:<18} | {row['pred_label']:<10} | {row['pred_prob_muslim']:>6.1f}% | {row['confidence']:>8.1f}%")
Detailed prediction analysis:
======================================================================
Name | Prediction | Muslim % | Confidence
----------------------------------------------------------------------
Shah Rukh Khan | muslim | 71.0% | 71.0%
Priya Sharma | not-muslim | 27.0% | 73.0%
Mohammed Ali | muslim | 73.0% | 73.0%
Raj Patel | not-muslim | 35.0% | 65.0%
Fatima Khan | muslim | 73.0% | 73.0%
Amitabh Bachchan | not-muslim | 31.0% | 69.0%
[9]:
# Filter high-confidence predictions (>90%)
high_confidence_mask = df_with_predictions['confidence'] > 90
high_confidence_df = df_with_predictions[high_confidence_mask]
print("High-confidence predictions (confidence > 90%):")
print(f"Found {len(high_confidence_df)} out of {len(df_with_predictions)} predictions")
print("\nHigh-confidence results:")
print(high_confidence_df[['name', 'pred_label', 'pred_prob_muslim', 'confidence']])
High-confidence predictions (confidence > 90%):
Found 0 out of 6 predictions
High-confidence results:
Empty DataFrame
Columns: [name, pred_label, pred_prob_muslim, confidence]
Index: []
[10]:
# Confidence level categorization
df_with_predictions['confidence_level'] = pd.cut(
df_with_predictions['confidence'],
bins=[0, 70, 85, 95, 100],
labels=['Low', 'Medium', 'High', 'Very High'],
include_lowest=True
)
print("Confidence level distribution:")
conf_dist = df_with_predictions['confidence_level'].value_counts().sort_index()
print(conf_dist)
print("\nPercentage:")
print(conf_dist / len(df_with_predictions) * 100)
Confidence level distribution:
confidence_level
Low 2
Medium 4
High 0
Very High 0
Name: count, dtype: int64
Percentage:
confidence_level
Low 33.333333
Medium 66.666667
High 0.000000
Very High 0.000000
Name: count, dtype: float64
💾 Saving and Exporting Results¶
Let’s save our enriched dataset to various formats:
[11]:
# Prepare final dataset with clean column names
final_df = df_with_predictions[[
'employee_id', 'name', 'department', 'salary',
'pred_label', 'pred_prob_muslim', 'confidence', 'confidence_level'
]].rename(columns={
'pred_label': 'predicted_religion',
'pred_prob_muslim': 'muslim_probability',
'confidence': 'prediction_confidence'
})
print("Final dataset with clean column names:")
print(final_df)
print(f"\nDataset shape: {final_df.shape}")
Final dataset with clean column names:
employee_id name department salary predicted_religion \
0 1001 Shah Rukh Khan Engineering 75000 muslim
1 1002 Priya Sharma Marketing 65000 not-muslim
2 1003 Mohammed Ali Finance 70000 muslim
3 1004 Raj Patel HR 60000 not-muslim
4 1005 Fatima Khan Engineering 80000 muslim
5 1006 Amitabh Bachchan Management 120000 not-muslim
muslim_probability prediction_confidence confidence_level
0 71.0 71.0 Medium
1 27.0 73.0 Medium
2 73.0 73.0 Medium
3 35.0 65.0 Low
4 73.0 73.0 Medium
5 31.0 69.0 Low
Dataset shape: (6, 8)
[12]:
# Save to CSV (most common format)
output_file = "employee_predictions.csv"
final_df.to_csv(output_file, index=False)
print(f"✅ Results saved to {output_file}")
# Show what was saved
print("\nSaved data preview:")
saved_df = pd.read_csv(output_file)
print(saved_df.head())
print("\nFile info:")
print(f"- Rows: {len(saved_df)}")
print(f"- Columns: {list(saved_df.columns)}")
✅ Results saved to employee_predictions.csv
Saved data preview:
employee_id name department salary predicted_religion \
0 1001 Shah Rukh Khan Engineering 75000 muslim
1 1002 Priya Sharma Marketing 65000 not-muslim
2 1003 Mohammed Ali Finance 70000 muslim
3 1004 Raj Patel HR 60000 not-muslim
4 1005 Fatima Khan Engineering 80000 muslim
muslim_probability prediction_confidence confidence_level
0 71.0 71.0 Medium
1 27.0 73.0 Medium
2 73.0 73.0 Medium
3 35.0 65.0 Low
4 73.0 73.0 Medium
File info:
- Rows: 6
- Columns: ['employee_id', 'name', 'department', 'salary', 'predicted_religion', 'muslim_probability', 'prediction_confidence', 'confidence_level']
[13]:
# Clean up the demo file
import os
if os.path.exists(output_file):
os.remove(output_file)
print(f"🧹 Demo file {output_file} removed")
🧹 Demo file employee_predictions.csv removed
🔍 Advanced Analytics Example¶
Let’s create a summary report of our analysis:
[14]:
# Create a comprehensive summary
print("📊 EMPLOYEE RELIGION PREDICTION ANALYSIS REPORT")
print("=" * 60)
# Dataset overview
total_employees = len(final_df)
print("\n📋 Dataset Overview:")
print(f" Total employees analyzed: {total_employees}")
print(f" Departments: {final_df['department'].nunique()} ({', '.join(final_df['department'].unique())})")
print(f" Salary range: ${final_df['salary'].min():,} - ${final_df['salary'].max():,}")
# Religion predictions
religion_summary = final_df['predicted_religion'].value_counts()
print("\n🔮 Religion Predictions:")
for religion, count in religion_summary.items():
pct = count / total_employees * 100
print(f" {religion.title()}: {count} employees ({pct:.1f}%)")
# Confidence analysis
avg_confidence = final_df['prediction_confidence'].mean()
high_conf_count = (final_df['prediction_confidence'] > 90).sum()
print("\n📈 Confidence Analysis:")
print(f" Average confidence: {avg_confidence:.1f}%")
print(f" High confidence predictions (>90%): {high_conf_count}/{total_employees} ({high_conf_count/total_employees*100:.1f}%)")
# Department insights
dept_analysis = final_df.groupby('department').agg({
'predicted_religion': lambda x: x.value_counts().index[0], # most common religion
'prediction_confidence': 'mean',
'salary': 'mean'
})
print("\n🏢 Department Analysis:")
for dept in dept_analysis.index:
most_common = dept_analysis.loc[dept, 'predicted_religion']
avg_conf = dept_analysis.loc[dept, 'prediction_confidence']
avg_sal = dept_analysis.loc[dept, 'salary']
print(f" {dept}: Mostly {most_common} (avg confidence: {avg_conf:.1f}%, avg salary: ${avg_sal:,.0f})")
print("\n✅ Analysis complete!")
📊 EMPLOYEE RELIGION PREDICTION ANALYSIS REPORT
============================================================
📋 Dataset Overview:
Total employees analyzed: 6
Departments: 5 (Engineering, Marketing, Finance, HR, Management)
Salary range: $60,000 - $120,000
🔮 Religion Predictions:
Muslim: 3 employees (50.0%)
Not-Muslim: 3 employees (50.0%)
📈 Confidence Analysis:
Average confidence: 70.7%
High confidence predictions (>90%): 0/6 (0.0%)
🏢 Department Analysis:
Engineering: Mostly muslim (avg confidence: 72.0%, avg salary: $77,500)
Finance: Mostly muslim (avg confidence: 73.0%, avg salary: $70,000)
HR: Mostly not-muslim (avg confidence: 65.0%, avg salary: $60,000)
Management: Mostly not-muslim (avg confidence: 69.0%, avg salary: $120,000)
Marketing: Mostly not-muslim (avg confidence: 73.0%, avg salary: $65,000)
✅ Analysis complete!
Key Takeaways¶
.merge() to combine predictions with existing dataNext Steps¶
CSV Processing: Learn to process large CSV files
Performance Benchmarks: Optimize for large datasets
Basic Usage: Review fundamental concepts
Best Practices¶
Always check confidence scores - Don’t trust all predictions equally
Use batch processing - Process multiple names at once for efficiency
Handle missing data - Check for NaN values in name columns before processing
Validate results - Spot-check predictions against domain knowledge
Document assumptions - Note the model’s limitations and biases in your analysis