CSV Processing with Pranaam¶

This notebook demonstrates how to process CSV files containing names and add religion predictions. This is useful for:

Processing employee databases
Analyzing customer lists
Research datasets
Survey responses

We’ll cover:

Creating sample CSV data
Reading and validating CSV files
Processing names with error handling
Saving enriched results
Batch processing strategies

[1]:

from pathlib import Path

import pandas as pd

import pranaam

print(f"Pandas version: {pd.__version__}")
print(f"Pranaam version: {pranaam.__version__ if hasattr(pranaam, '__version__') else 'latest'}")

2026-01-21 19:31:42.880388: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2026-01-21 19:31:42.924629: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-21 19:31:44.317969: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.

Pandas version: 2.3.3
Pranaam version: 0.0.2

📝 Creating Sample CSV Data¶

Let’s start by creating a sample CSV file to work with:

[2]:

def create_sample_csv(filename="sample_names.csv"):
    """Create a sample CSV file for testing."""
    sample_data = pd.DataFrame({
        "id": [1, 2, 3, 4, 5, 6, 7, 8],
        "full_name": [
            "Shah Rukh Khan",
            "Priya Sharma",
            "Mohammed Ali",
            "Raj Patel",
            "Fatima Khan",
            "John Smith",
            "Deepika Padukone",
            "Abdul Rahman"
        ],
        "department": [
            "Engineering", "Marketing", "Finance", "HR",
            "Sales", "IT", "Design", "Operations"
        ],
        "city": [
            "Mumbai", "Delhi", "Bangalore", "Chennai",
            "Pune", "Hyderabad", "Kolkata", "Ahmedabad"
        ],
        "salary": [75000, 65000, 70000, 60000, 80000, 85000, 72000, 68000]
    })

    sample_data.to_csv(filename, index=False)
    print(f"📝 Created sample file: {filename}")
    return sample_data

# Create our sample data
sample_df = create_sample_csv()
print("\nSample data:")
print(sample_df)

📝 Created sample file: sample_names.csv

Sample data:
   id         full_name   department       city  salary
0   1    Shah Rukh Khan  Engineering     Mumbai   75000
1   2      Priya Sharma    Marketing      Delhi   65000
2   3      Mohammed Ali      Finance  Bangalore   70000
3   4         Raj Patel           HR    Chennai   60000
4   5       Fatima Khan        Sales       Pune   80000
5   6        John Smith           IT  Hyderabad   85000
6   7  Deepika Padukone       Design    Kolkata   72000
7   8      Abdul Rahman   Operations  Ahmedabad   68000

📖 CSV Processing Function¶

Let’s create a comprehensive function to process CSV files with names:

[3]:

def process_csv_with_pranaam(input_file, output_file, name_column, language="eng", chunk_size=1000):
    """Process CSV file and add religion predictions.

    Args:
        input_file: Path to input CSV file
        output_file: Path to output CSV file
        name_column: Name of column containing names
        language: Language code ('eng' or 'hin')
        chunk_size: Process in chunks for large files
    """

    # Validate input file
    if not Path(input_file).exists():
        print(f"❌ Error: Input file '{input_file}' not found")
        return False

    try:
        # Read CSV
        print(f"📖 Reading {input_file}...")
        df = pd.read_csv(input_file)
        print(f"   Found {len(df)} rows, {len(df.columns)} columns")

        # Validate name column
        if name_column not in df.columns:
            print(f"❌ Error: Column '{name_column}' not found in CSV")
            print(f"   Available columns: {list(df.columns)}")
            return False

        # Data quality checks
        print("\n🔍 Data Quality Analysis:")
        total_rows = len(df)
        missing_names = df[name_column].isna().sum()
        empty_names = (df[name_column].str.strip() == "").sum() if not df[name_column].isna().all() else 0

        print(f"   Total rows: {total_rows}")
        print(f"   Missing names: {missing_names}")
        print(f"   Empty names: {empty_names}")

        # Clean data
        if missing_names > 0 or empty_names > 0:
            print(f"   Removing {missing_names + empty_names} invalid rows...")
            df_clean = df.dropna(subset=[name_column])
            df_clean = df_clean[df_clean[name_column].str.strip() != ""]
        else:
            df_clean = df.copy()

        valid_rows = len(df_clean)
        print(f"   Valid rows for processing: {valid_rows}")

        if valid_rows == 0:
            print("❌ No valid names to process!")
            return False

        # Get predictions
        print(f"\n🔮 Getting predictions for {valid_rows} names (language: {language})...")
        predictions = pranaam.pred_rel(df_clean[name_column], lang=language)

        # Rename prediction columns to avoid conflicts
        predictions = predictions.rename(columns={
            "name": name_column,
            "pred_label": f"{name_column}_religion",
            "pred_prob_muslim": f"{name_column}_confidence_muslim",
        })

        # Merge predictions back
        df_with_predictions = df_clean.merge(predictions, on=name_column, how="left")

        # Add confidence score
        conf_col = f"{name_column}_confidence_muslim"
        df_with_predictions[f"{name_column}_confidence"] = df_with_predictions[conf_col].apply(
            lambda x: max(x, 100 - x)
        )

        # Save results
        print(f"💾 Saving results to {output_file}...")
        df_with_predictions.to_csv(output_file, index=False)

        # Generate summary
        print("\n📊 Processing Summary:")
        print(f"   Input rows: {total_rows}")
        print(f"   Valid names processed: {valid_rows}")
        print(f"   Output rows: {len(df_with_predictions)}")

        # Religion distribution
        religion_counts = df_with_predictions[f"{name_column}_religion"].value_counts()
        print(f"   Religion predictions: {dict(religion_counts)}")

        # Confidence analysis
        high_conf_count = (df_with_predictions[f"{name_column}_confidence"] > 90).sum()
        medium_conf_count = (
            (df_with_predictions[f"{name_column}_confidence"] >= 70) &
            (df_with_predictions[f"{name_column}_confidence"] <= 90)
        ).sum()
        low_conf_count = (df_with_predictions[f"{name_column}_confidence"] < 70).sum()

        print("   Confidence distribution:")
        print(f"     High (>90%): {high_conf_count} predictions")
        print(f"     Medium (70-90%): {medium_conf_count} predictions")
        print(f"     Low (<70%): {low_conf_count} predictions")

        print(f"\n✅ Successfully processed {input_file} → {output_file}")
        return True

    except Exception as e:
        print(f"❌ Error processing file: {str(e)}")
        return False

🔄 Processing Our Sample Data¶

Now let’s process our sample CSV file:

[4]:

# Process the sample CSV
input_file = "sample_names.csv"
output_file = "sample_names_with_predictions.csv"

success = process_csv_with_pranaam(
    input_file=input_file,
    output_file=output_file,
    name_column="full_name",
    language="eng"
)

📖 Reading sample_names.csv...
   Found 8 rows, 5 columns

🔍 Data Quality Analysis:
   Total rows: 8
   Missing names: 0
   Empty names: 0
   Valid rows for processing: 8

🔮 Getting predictions for 8 names (language: eng)...

[01/21/26 19:31:44] INFO     pranaam - Loading eng model from
                             /home/runner/work/pranaam/pranaam/pranaam/model/eng_and_hindi_models_v2/eng_model.kera
                             s

                    INFO     pranaam - Loading eng model with tf-keras compatibility layer

2026-01-21 19:31:44.845097: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)

💾 Saving results to sample_names_with_predictions.csv...

📊 Processing Summary:
   Input rows: 8
   Valid names processed: 8
   Output rows: 8
   Religion predictions: {'muslim': 4, 'not-muslim': 4}
   Confidence distribution:
     High (>90%): 0 predictions
     Medium (70-90%): 5 predictions
     Low (<70%): 3 predictions

✅ Successfully processed sample_names.csv → sample_names_with_predictions.csv

📋 Examining the Results¶

Let’s load and examine the processed results:

[5]:

if success:
    # Load the processed results
    results_df = pd.read_csv(output_file)

    print("📋 Processed Results:")
    print(results_df)

    print("\n📈 New Columns Added:")
    new_columns = [col for col in results_df.columns if 'full_name' in col and col != 'full_name']
    for col in new_columns:
        print(f"   • {col}")

📋 Processed Results:
   id         full_name   department       city  salary full_name_religion  \
0   1    Shah Rukh Khan  Engineering     Mumbai   75000             muslim
1   2      Priya Sharma    Marketing      Delhi   65000         not-muslim
2   3      Mohammed Ali      Finance  Bangalore   70000             muslim
3   4         Raj Patel           HR    Chennai   60000         not-muslim
4   5       Fatima Khan        Sales       Pune   80000             muslim
5   6        John Smith           IT  Hyderabad   85000         not-muslim
6   7  Deepika Padukone       Design    Kolkata   72000         not-muslim
7   8      Abdul Rahman   Operations  Ahmedabad   68000             muslim

   full_name_confidence_muslim  full_name_confidence
0                         71.0                  71.0
1                         27.0                  73.0
2                         73.0                  73.0
3                         35.0                  65.0
4                         73.0                  73.0
5                         37.0                  63.0
6                         32.0                  68.0
7                         73.0                  73.0

📈 New Columns Added:
   • full_name_religion
   • full_name_confidence_muslim
   • full_name_confidence

[6]:

# Detailed analysis of predictions
if success:
    print("🔍 Detailed Prediction Analysis:")
    print("=" * 70)
    print(f"{'Name':<20} | {'Religion':<10} | {'Muslim %':<8} | {'Confidence':<10}")
    print("-" * 70)

    for _, row in results_df.iterrows():
        name = row['full_name']
        religion = row['full_name_religion']
        muslim_prob = row['full_name_confidence_muslim']
        confidence = row['full_name_confidence']

        print(f"{name:<20} | {religion:<10} | {muslim_prob:>6.1f}% | {confidence:>8.1f}%")

🔍 Detailed Prediction Analysis:
======================================================================
Name                 | Religion   | Muslim % | Confidence
----------------------------------------------------------------------
Shah Rukh Khan       | muslim     |   71.0% |     71.0%
Priya Sharma         | not-muslim |   27.0% |     73.0%
Mohammed Ali         | muslim     |   73.0% |     73.0%
Raj Patel            | not-muslim |   35.0% |     65.0%
Fatima Khan          | muslim     |   73.0% |     73.0%
John Smith           | not-muslim |   37.0% |     63.0%
Deepika Padukone     | not-muslim |   32.0% |     68.0%
Abdul Rahman         | muslim     |   73.0% |     73.0%

🚀 Large File Processing Strategy¶

For large CSV files, we need to process data in chunks to avoid memory issues:

[7]:

def process_large_csv(input_file, output_file, name_column, language="eng", chunk_size=1000):
    """Process large CSV files in chunks to manage memory usage."""

    print(f"🚀 Processing large CSV file: {input_file}")
    print(f"   Chunk size: {chunk_size} rows")

    # Get total row count first
    total_rows = sum(1 for line in open(input_file)) - 1  # -1 for header
    print(f"   Total rows: {total_rows:,}")

    processed_chunks = []
    chunk_num = 0

    try:
        # Process file in chunks
        for chunk_df in pd.read_csv(input_file, chunksize=chunk_size):
            chunk_num += 1
            print(f"\n📦 Processing chunk {chunk_num} ({len(chunk_df)} rows)...")

            # Clean chunk
            clean_chunk = chunk_df.dropna(subset=[name_column])
            clean_chunk = clean_chunk[clean_chunk[name_column].str.strip() != ""]

            if len(clean_chunk) == 0:
                print("   ⚠️ No valid names in this chunk, skipping...")
                continue

            # Get predictions for chunk
            predictions = pranaam.pred_rel(clean_chunk[name_column], lang=language)

            # Rename columns
            predictions = predictions.rename(columns={
                "name": name_column,
                "pred_label": f"{name_column}_religion",
                "pred_prob_muslim": f"{name_column}_confidence_muslim",
            })

            # Merge predictions
            chunk_with_predictions = clean_chunk.merge(predictions, on=name_column, how="left")

            # Add confidence score
            conf_col = f"{name_column}_confidence_muslim"
            chunk_with_predictions[f"{name_column}_confidence"] = chunk_with_predictions[conf_col].apply(
                lambda x: max(x, 100 - x)
            )

            processed_chunks.append(chunk_with_predictions)
            print(f"   ✅ Processed {len(chunk_with_predictions)} names")

        # Combine all chunks
        print(f"\n🔗 Combining {len(processed_chunks)} chunks...")
        final_df = pd.concat(processed_chunks, ignore_index=True)

        # Save results
        print(f"💾 Saving {len(final_df)} rows to {output_file}...")
        final_df.to_csv(output_file, index=False)

        print("\n✅ Large file processing completed!")
        return True

    except Exception as e:
        print(f"❌ Error processing large file: {str(e)}")
        return False

# Demonstrate with our sample (simulating large file processing)
print("Demonstrating large file processing strategy:")
large_file_success = process_large_csv(
    input_file="sample_names.csv",
    output_file="sample_large_processed.csv",
    name_column="full_name",
    chunk_size=3  # Small chunk size for demo
)

Demonstrating large file processing strategy:
🚀 Processing large CSV file: sample_names.csv
   Chunk size: 3 rows
   Total rows: 8

📦 Processing chunk 1 (3 rows)...
   ✅ Processed 3 names

📦 Processing chunk 2 (3 rows)...
   ✅ Processed 3 names

📦 Processing chunk 3 (2 rows)...
   ✅ Processed 2 names

🔗 Combining 3 chunks...
💾 Saving 8 rows to sample_large_processed.csv...

✅ Large file processing completed!

📊 Validation and Quality Checks¶

Let’s create validation functions to ensure our processing worked correctly:

[8]:

def validate_processed_csv(original_file, processed_file, name_column):
    """Validate that the processed CSV is correct."""

    print("🔍 Validation Report:")
    print("=" * 40)

    # Load both files
    original_df = pd.read_csv(original_file)
    processed_df = pd.read_csv(processed_file)

    # Basic checks
    print(f"Original file rows: {len(original_df)}")
    print(f"Processed file rows: {len(processed_df)}")
    print(f"Rows preserved: {len(processed_df) / len(original_df) * 100:.1f}%")

    # Check for new columns
    original_cols = set(original_df.columns)
    processed_cols = set(processed_df.columns)
    new_cols = processed_cols - original_cols

    print(f"\nNew columns added: {len(new_cols)}")
    for col in sorted(new_cols):
        print(f"  • {col}")

    # Check predictions completeness
    religion_col = f"{name_column}_religion"
    if religion_col in processed_df.columns:
        null_predictions = processed_df[religion_col].isna().sum()
        print("\nPrediction completeness:")
        print(f"  Names with predictions: {len(processed_df) - null_predictions}")
        print(f"  Names without predictions: {null_predictions}")

        if null_predictions == 0:
            print("  ✅ All names have predictions")
        else:
            print(f"  ⚠️ {null_predictions} names missing predictions")

    # Confidence distribution
    conf_col = f"{name_column}_confidence"
    if conf_col in processed_df.columns:
        high_conf = (processed_df[conf_col] > 90).sum()
        medium_conf = ((processed_df[conf_col] >= 70) & (processed_df[conf_col] <= 90)).sum()
        low_conf = (processed_df[conf_col] < 70).sum()

        print("\nConfidence distribution:")
        print(f"  High confidence (>90%): {high_conf} ({high_conf/len(processed_df)*100:.1f}%)")
        print(f"  Medium confidence (70-90%): {medium_conf} ({medium_conf/len(processed_df)*100:.1f}%)")
        print(f"  Low confidence (<70%): {low_conf} ({low_conf/len(processed_df)*100:.1f}%)")

    print("\n✅ Validation complete!")

# Validate our processed files
if success:
    validate_processed_csv("sample_names.csv", "sample_names_with_predictions.csv", "full_name")

🔍 Validation Report:
========================================
Original file rows: 8
Processed file rows: 8
Rows preserved: 100.0%

New columns added: 3
  • full_name_confidence
  • full_name_confidence_muslim
  • full_name_religion

Prediction completeness:
  Names with predictions: 8
  Names without predictions: 0
  ✅ All names have predictions

Confidence distribution:
  High confidence (>90%): 0 (0.0%)
  Medium confidence (70-90%): 5 (62.5%)
  Low confidence (<70%): 3 (37.5%)

✅ Validation complete!

🧹 Cleanup¶

Let’s clean up the demo files:

[9]:

import os

# Clean up demo files
demo_files = [
    "sample_names.csv",
    "sample_names_with_predictions.csv",
    "sample_large_processed.csv"
]

print("🧹 Cleaning up demo files:")
for file in demo_files:
    if os.path.exists(file):
        os.remove(file)
        print(f"   ✅ Removed {file}")
    else:
        print(f"   ℹ️ {file} not found")

print("\n🎉 Demo cleanup complete!")

🧹 Cleaning up demo files:
   ✅ Removed sample_names.csv
   ✅ Removed sample_names_with_predictions.csv
   ✅ Removed sample_large_processed.csv

🎉 Demo cleanup complete!

📝 Command-Line Equivalent¶

If you were to create a command-line script, here’s what the usage would look like:

[10]:

# This shows how you might structure a command-line interface
def demonstrate_cli_usage():
    print("💻 Command-Line Usage Examples:")
    print("=" * 50)

    examples = [
        {
            "description": "Basic CSV processing",
            "command": "python csv_processor.py data.csv results.csv --name-column 'full_name'"
        },
        {
            "description": "Process with Hindi names",
            "command": "python csv_processor.py data.csv results.csv --name-column 'employee_name' --language hin"
        },
        {
            "description": "Large file with custom chunk size",
            "command": "python csv_processor.py large_data.csv results.csv --name-column 'name' --chunk-size 5000"
        },
        {
            "description": "Create sample file for testing",
            "command": "python csv_processor.py --create-sample"
        }
    ]

    for i, example in enumerate(examples, 1):
        print(f"\n{i}. {example['description']}:")
        print(f"   {example['command']}")

    print("\n📋 Required Arguments:")
    print("   • input_file: Path to CSV file with names")
    print("   • output_file: Path for results CSV")
    print("   • --name-column: Column containing names")

    print("\n⚙️ Optional Arguments:")
    print("   • --language: 'eng' or 'hin' (default: eng)")
    print("   • --chunk-size: Rows per chunk (default: 1000)")
    print("   • --create-sample: Generate test data")

demonstrate_cli_usage()

💻 Command-Line Usage Examples:
==================================================

1. Basic CSV processing:
   python csv_processor.py data.csv results.csv --name-column 'full_name'

2. Process with Hindi names:
   python csv_processor.py data.csv results.csv --name-column 'employee_name' --language hin

3. Large file with custom chunk size:
   python csv_processor.py large_data.csv results.csv --name-column 'name' --chunk-size 5000

4. Create sample file for testing:
   python csv_processor.py --create-sample

📋 Required Arguments:
   • input_file: Path to CSV file with names
   • output_file: Path for results CSV
   • --name-column: Column containing names

⚙️ Optional Arguments:
   • --language: 'eng' or 'hin' (default: eng)
   • --chunk-size: Rows per chunk (default: 1000)
   • --create-sample: Generate test data

Key Takeaways¶

📝 CSV Processing: Pranaam seamlessly integrates with CSV workflows
🔍 Data Validation: Always validate input data and check for missing values
🚀 Chunk Processing: Handle large files by processing in chunks
📊 Quality Metrics: Monitor confidence scores to assess prediction quality
🔗 Column Naming: Use consistent naming conventions for prediction columns
✅ Validation: Always validate results to ensure processing completed correctly

Best Practices for CSV Processing¶

Validate Input Data
- Check file exists and is readable
- Verify required columns are present
- Handle missing or empty names gracefully
Memory Management
- Use chunk processing for files > 100MB
- Choose appropriate chunk sizes (1000-5000 rows)
- Monitor memory usage during processing
Error Handling
- Wrap processing in try-except blocks
- Log errors with sufficient detail
- Provide clear error messages to users
Output Quality
- Use descriptive column names
- Include confidence scores
- Validate output completeness
- Save processing metadata

Next Steps¶

Performance Benchmarks: Optimize for large-scale processing
Pandas Integration: Advanced DataFrame operations
Basic Usage: Review fundamental concepts