Pandas Integration Examples¶

This notebook demonstrates how to use pranaam with pandas DataFrames for real-world data processing and analysis.

We’ll cover:

Basic DataFrame processing
Data analysis with predictions
Confidence-based filtering
Saving and exporting results

Let’s start by importing our dependencies:

[1]:

import pandas as pd

import pranaam

print(f"Pandas version: {pd.__version__}")
print(f"Pranaam version: {pranaam.__version__ if hasattr(pranaam, '__version__') else 'latest'}")

2026-01-21 19:31:49.365618: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2026-01-21 19:31:49.410119: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-21 19:31:50.797333: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.

Pandas version: 2.3.3
Pranaam version: 0.0.2

Creating Sample Data¶

First, let’s create a sample employee dataset to work with:

[2]:

def create_sample_data():
    """Create sample employee data for demonstration."""
    return pd.DataFrame({
        "employee_id": [1001, 1002, 1003, 1004, 1005, 1006],
        "name": [
            "Shah Rukh Khan",
            "Priya Sharma",
            "Mohammed Ali",
            "Raj Patel",
            "Fatima Khan",
            "Amitabh Bachchan",
        ],
        "department": [
            "Engineering",
            "Marketing",
            "Finance",
            "HR",
            "Engineering",
            "Management",
        ],
        "salary": [75000, 65000, 70000, 60000, 80000, 120000],
    })

# Create our sample data
df = create_sample_data()
print("Original employee data:")
print(df)

Original employee data:
   employee_id              name   department  salary
0         1001    Shah Rukh Khan  Engineering   75000
1         1002      Priya Sharma    Marketing   65000
2         1003      Mohammed Ali      Finance   70000
3         1004         Raj Patel           HR   60000
4         1005       Fatima Khan  Engineering   80000
5         1006  Amitabh Bachchan   Management  120000

📊 Basic DataFrame Processing¶

Now let’s add religion predictions to our DataFrame using pranaam:

[3]:

# Get predictions for the name column
print("Getting predictions for all names...")
predictions = pranaam.pred_rel(df["name"], lang="eng")
print("\nPredictions:")
print(predictions)

Getting predictions for all names...

[01/21/26 19:31:51] INFO     pranaam - Loading eng model from
                             /home/runner/work/pranaam/pranaam/pranaam/model/eng_and_hindi_models_v2/eng_model.kera
                             s

                    INFO     pranaam - Loading eng model with tf-keras compatibility layer

2026-01-21 19:31:51.307337: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


Predictions:
               name  pred_label  pred_prob_muslim
0    Shah Rukh Khan      muslim              71.0
1      Priya Sharma  not-muslim              27.0
2      Mohammed Ali      muslim              73.0
3         Raj Patel  not-muslim              35.0
4       Fatima Khan      muslim              73.0
5  Amitabh Bachchan  not-muslim              31.0

[4]:

# Merge predictions back to original DataFrame
# Note: pranaam returns name, pred_label, pred_prob_muslim
df_with_predictions = df.merge(
    predictions[["name", "pred_label", "pred_prob_muslim"]],
    on="name",
    how="left"
)

print("Combined data with predictions:")
print(df_with_predictions)

Combined data with predictions:
   employee_id              name   department  salary  pred_label  \
0         1001    Shah Rukh Khan  Engineering   75000      muslim
1         1002      Priya Sharma    Marketing   65000  not-muslim
2         1003      Mohammed Ali      Finance   70000      muslim
3         1004         Raj Patel           HR   60000  not-muslim
4         1005       Fatima Khan  Engineering   80000      muslim
5         1006  Amitabh Bachchan   Management  120000  not-muslim

   pred_prob_muslim
0              71.0
1              27.0
2              73.0
3              35.0
4              73.0
5              31.0

📈 Data Analysis with Predictions¶

Now let’s perform some analysis using the religion predictions:

[5]:

# Basic statistics
print("Religion distribution in our dataset:")
religion_counts = df_with_predictions["pred_label"].value_counts()
print(religion_counts)
print("\nPercentage breakdown:")
print(religion_counts / len(df_with_predictions) * 100)

Religion distribution in our dataset:
pred_label
muslim        3
not-muslim    3
Name: count, dtype: int64

Percentage breakdown:
pred_label
muslim        50.0
not-muslim    50.0
Name: count, dtype: float64

[6]:

# Average salary by predicted religion
print("Salary analysis by predicted religion:")
salary_by_religion = df_with_predictions.groupby("pred_label")["salary"].agg([
    'mean', 'median', 'min', 'max', 'count'
])
print(salary_by_religion)

Salary analysis by predicted religion:
                    mean   median    min     max  count
pred_label
muslim      75000.000000  75000.0  70000   80000      3
not-muslim  81666.666667  65000.0  60000  120000      3

[7]:

# Department distribution by predicted religion
print("Department vs Religion cross-tabulation:")
dept_religion = pd.crosstab(
    df_with_predictions["department"],
    df_with_predictions["pred_label"],
    margins=True
)
print(dept_religion)

Department vs Religion cross-tabulation:
pred_label   muslim  not-muslim  All
department
Engineering       2           0    2
Finance           1           0    1
HR                0           1    1
Management        0           1    1
Marketing         0           1    1
All               3           3    6

🎯 Confidence-Based Analysis¶

Not all predictions are equally certain. Let’s analyze the confidence levels and filter based on them:

[8]:

# Add confidence score calculation
# Higher numbers mean more confident predictions
df_with_predictions['confidence'] = df_with_predictions['pred_prob_muslim'].apply(
    lambda x: max(x, 100 - x)
)

# Show confidence distribution
print("Detailed prediction analysis:")
print("=" * 70)
print(f"{'Name':<18} | {'Prediction':<10} | {'Muslim %':<8} | {'Confidence':<10}")
print("-" * 70)

for _, row in df_with_predictions.iterrows():
    print(f"{row['name']:<18} | {row['pred_label']:<10} | {row['pred_prob_muslim']:>6.1f}% | {row['confidence']:>8.1f}%")

Detailed prediction analysis:
======================================================================
Name               | Prediction | Muslim % | Confidence
----------------------------------------------------------------------
Shah Rukh Khan     | muslim     |   71.0% |     71.0%
Priya Sharma       | not-muslim |   27.0% |     73.0%
Mohammed Ali       | muslim     |   73.0% |     73.0%
Raj Patel          | not-muslim |   35.0% |     65.0%
Fatima Khan        | muslim     |   73.0% |     73.0%
Amitabh Bachchan   | not-muslim |   31.0% |     69.0%

[9]:

# Filter high-confidence predictions (>90%)
high_confidence_mask = df_with_predictions['confidence'] > 90
high_confidence_df = df_with_predictions[high_confidence_mask]

print("High-confidence predictions (confidence > 90%):")
print(f"Found {len(high_confidence_df)} out of {len(df_with_predictions)} predictions")
print("\nHigh-confidence results:")
print(high_confidence_df[['name', 'pred_label', 'pred_prob_muslim', 'confidence']])

High-confidence predictions (confidence > 90%):
Found 0 out of 6 predictions

High-confidence results:
Empty DataFrame
Columns: [name, pred_label, pred_prob_muslim, confidence]
Index: []

[10]:

# Confidence level categorization
df_with_predictions['confidence_level'] = pd.cut(
    df_with_predictions['confidence'],
    bins=[0, 70, 85, 95, 100],
    labels=['Low', 'Medium', 'High', 'Very High'],
    include_lowest=True
)

print("Confidence level distribution:")
conf_dist = df_with_predictions['confidence_level'].value_counts().sort_index()
print(conf_dist)
print("\nPercentage:")
print(conf_dist / len(df_with_predictions) * 100)

Confidence level distribution:
confidence_level
Low          2
Medium       4
High         0
Very High    0
Name: count, dtype: int64

Percentage:
confidence_level
Low          33.333333
Medium       66.666667
High          0.000000
Very High     0.000000
Name: count, dtype: float64

💾 Saving and Exporting Results¶

Let’s save our enriched dataset to various formats:

[11]:

# Prepare final dataset with clean column names
final_df = df_with_predictions[[
    'employee_id', 'name', 'department', 'salary',
    'pred_label', 'pred_prob_muslim', 'confidence', 'confidence_level'
]].rename(columns={
    'pred_label': 'predicted_religion',
    'pred_prob_muslim': 'muslim_probability',
    'confidence': 'prediction_confidence'
})

print("Final dataset with clean column names:")
print(final_df)
print(f"\nDataset shape: {final_df.shape}")

Final dataset with clean column names:
   employee_id              name   department  salary predicted_religion  \
0         1001    Shah Rukh Khan  Engineering   75000             muslim
1         1002      Priya Sharma    Marketing   65000         not-muslim
2         1003      Mohammed Ali      Finance   70000             muslim
3         1004         Raj Patel           HR   60000         not-muslim
4         1005       Fatima Khan  Engineering   80000             muslim
5         1006  Amitabh Bachchan   Management  120000         not-muslim

   muslim_probability  prediction_confidence confidence_level
0                71.0                   71.0           Medium
1                27.0                   73.0           Medium
2                73.0                   73.0           Medium
3                35.0                   65.0              Low
4                73.0                   73.0           Medium
5                31.0                   69.0              Low

Dataset shape: (6, 8)

[12]:

# Save to CSV (most common format)
output_file = "employee_predictions.csv"
final_df.to_csv(output_file, index=False)
print(f"✅ Results saved to {output_file}")

# Show what was saved
print("\nSaved data preview:")
saved_df = pd.read_csv(output_file)
print(saved_df.head())
print("\nFile info:")
print(f"- Rows: {len(saved_df)}")
print(f"- Columns: {list(saved_df.columns)}")

✅ Results saved to employee_predictions.csv

Saved data preview:
   employee_id            name   department  salary predicted_religion  \
0         1001  Shah Rukh Khan  Engineering   75000             muslim
1         1002    Priya Sharma    Marketing   65000         not-muslim
2         1003    Mohammed Ali      Finance   70000             muslim
3         1004       Raj Patel           HR   60000         not-muslim
4         1005     Fatima Khan  Engineering   80000             muslim

   muslim_probability  prediction_confidence confidence_level
0                71.0                   71.0           Medium
1                27.0                   73.0           Medium
2                73.0                   73.0           Medium
3                35.0                   65.0              Low
4                73.0                   73.0           Medium

File info:
- Rows: 6
- Columns: ['employee_id', 'name', 'department', 'salary', 'predicted_religion', 'muslim_probability', 'prediction_confidence', 'confidence_level']

[13]:

# Clean up the demo file
import os

if os.path.exists(output_file):
    os.remove(output_file)
    print(f"🧹 Demo file {output_file} removed")

🧹 Demo file employee_predictions.csv removed

🔍 Advanced Analytics Example¶

Let’s create a summary report of our analysis:

[14]:

# Create a comprehensive summary
print("📊 EMPLOYEE RELIGION PREDICTION ANALYSIS REPORT")
print("=" * 60)

# Dataset overview
total_employees = len(final_df)
print("\n📋 Dataset Overview:")
print(f"   Total employees analyzed: {total_employees}")
print(f"   Departments: {final_df['department'].nunique()} ({', '.join(final_df['department'].unique())})")
print(f"   Salary range: ${final_df['salary'].min():,} - ${final_df['salary'].max():,}")

# Religion predictions
religion_summary = final_df['predicted_religion'].value_counts()
print("\n🔮 Religion Predictions:")
for religion, count in religion_summary.items():
    pct = count / total_employees * 100
    print(f"   {religion.title()}: {count} employees ({pct:.1f}%)")

# Confidence analysis
avg_confidence = final_df['prediction_confidence'].mean()
high_conf_count = (final_df['prediction_confidence'] > 90).sum()
print("\n📈 Confidence Analysis:")
print(f"   Average confidence: {avg_confidence:.1f}%")
print(f"   High confidence predictions (>90%): {high_conf_count}/{total_employees} ({high_conf_count/total_employees*100:.1f}%)")

# Department insights
dept_analysis = final_df.groupby('department').agg({
    'predicted_religion': lambda x: x.value_counts().index[0],  # most common religion
    'prediction_confidence': 'mean',
    'salary': 'mean'
})
print("\n🏢 Department Analysis:")
for dept in dept_analysis.index:
    most_common = dept_analysis.loc[dept, 'predicted_religion']
    avg_conf = dept_analysis.loc[dept, 'prediction_confidence']
    avg_sal = dept_analysis.loc[dept, 'salary']
    print(f"   {dept}: Mostly {most_common} (avg confidence: {avg_conf:.1f}%, avg salary: ${avg_sal:,.0f})")

print("\n✅ Analysis complete!")

📊 EMPLOYEE RELIGION PREDICTION ANALYSIS REPORT
============================================================

📋 Dataset Overview:
   Total employees analyzed: 6
   Departments: 5 (Engineering, Marketing, Finance, HR, Management)
   Salary range: $60,000 - $120,000

🔮 Religion Predictions:
   Muslim: 3 employees (50.0%)
   Not-Muslim: 3 employees (50.0%)

📈 Confidence Analysis:
   Average confidence: 70.7%
   High confidence predictions (>90%): 0/6 (0.0%)

🏢 Department Analysis:
   Engineering: Mostly muslim (avg confidence: 72.0%, avg salary: $77,500)
   Finance: Mostly muslim (avg confidence: 73.0%, avg salary: $70,000)
   HR: Mostly not-muslim (avg confidence: 65.0%, avg salary: $60,000)
   Management: Mostly not-muslim (avg confidence: 69.0%, avg salary: $120,000)
   Marketing: Mostly not-muslim (avg confidence: 73.0%, avg salary: $65,000)

✅ Analysis complete!

Key Takeaways¶

🐼 Pandas Integration: Pranaam works seamlessly with pandas DataFrames and Series
🔗 Easy Merging: Use .merge() to combine predictions with existing data
📊 Rich Analytics: Leverage pandas’ groupby, crosstab, and aggregation functions
🎯 Confidence Filtering: Use confidence scores to filter reliable predictions
💾 Export Ready: Save enriched datasets to CSV, Excel, or other formats
📈 Business Insights: Transform name data into actionable demographic insights

Next Steps¶

CSV Processing: Learn to process large CSV files
Performance Benchmarks: Optimize for large datasets
Basic Usage: Review fundamental concepts

Best Practices¶

Always check confidence scores - Don’t trust all predictions equally
Use batch processing - Process multiple names at once for efficiency
Handle missing data - Check for NaN values in name columns before processing
Validate results - Spot-check predictions against domain knowledge
Document assumptions - Note the model’s limitations and biases in your analysis