CSV Processing with Pranaamยถ

This notebook demonstrates how to process CSV files containing names and add religion predictions. This is useful for:

  • Processing employee databases

  • Analyzing customer lists

  • Research datasets

  • Survey responses

Weโ€™ll cover:

  1. Creating sample CSV data

  2. Reading and validating CSV files

  3. Processing names with error handling

  4. Saving enriched results

  5. Batch processing strategies

[1]:
from pathlib import Path

import pandas as pd

import pranaam

print(f"Pandas version: {pd.__version__}")
print(f"Pranaam version: {pranaam.__version__ if hasattr(pranaam, '__version__') else 'latest'}")
2026-01-21 19:31:42.880388: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2026-01-21 19:31:42.924629: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-21 19:31:44.317969: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
Pandas version: 2.3.3
Pranaam version: 0.0.2

๐Ÿ“ Creating Sample CSV Dataยถ

Letโ€™s start by creating a sample CSV file to work with:

[2]:
def create_sample_csv(filename="sample_names.csv"):
    """Create a sample CSV file for testing."""
    sample_data = pd.DataFrame({
        "id": [1, 2, 3, 4, 5, 6, 7, 8],
        "full_name": [
            "Shah Rukh Khan",
            "Priya Sharma",
            "Mohammed Ali",
            "Raj Patel",
            "Fatima Khan",
            "John Smith",
            "Deepika Padukone",
            "Abdul Rahman"
        ],
        "department": [
            "Engineering", "Marketing", "Finance", "HR",
            "Sales", "IT", "Design", "Operations"
        ],
        "city": [
            "Mumbai", "Delhi", "Bangalore", "Chennai",
            "Pune", "Hyderabad", "Kolkata", "Ahmedabad"
        ],
        "salary": [75000, 65000, 70000, 60000, 80000, 85000, 72000, 68000]
    })

    sample_data.to_csv(filename, index=False)
    print(f"๐Ÿ“ Created sample file: {filename}")
    return sample_data

# Create our sample data
sample_df = create_sample_csv()
print("\nSample data:")
print(sample_df)
๐Ÿ“ Created sample file: sample_names.csv

Sample data:
   id         full_name   department       city  salary
0   1    Shah Rukh Khan  Engineering     Mumbai   75000
1   2      Priya Sharma    Marketing      Delhi   65000
2   3      Mohammed Ali      Finance  Bangalore   70000
3   4         Raj Patel           HR    Chennai   60000
4   5       Fatima Khan        Sales       Pune   80000
5   6        John Smith           IT  Hyderabad   85000
6   7  Deepika Padukone       Design    Kolkata   72000
7   8      Abdul Rahman   Operations  Ahmedabad   68000

๐Ÿ“– CSV Processing Functionยถ

Letโ€™s create a comprehensive function to process CSV files with names:

[3]:
def process_csv_with_pranaam(input_file, output_file, name_column, language="eng", chunk_size=1000):
    """Process CSV file and add religion predictions.

    Args:
        input_file: Path to input CSV file
        output_file: Path to output CSV file
        name_column: Name of column containing names
        language: Language code ('eng' or 'hin')
        chunk_size: Process in chunks for large files
    """

    # Validate input file
    if not Path(input_file).exists():
        print(f"โŒ Error: Input file '{input_file}' not found")
        return False

    try:
        # Read CSV
        print(f"๐Ÿ“– Reading {input_file}...")
        df = pd.read_csv(input_file)
        print(f"   Found {len(df)} rows, {len(df.columns)} columns")

        # Validate name column
        if name_column not in df.columns:
            print(f"โŒ Error: Column '{name_column}' not found in CSV")
            print(f"   Available columns: {list(df.columns)}")
            return False

        # Data quality checks
        print("\n๐Ÿ” Data Quality Analysis:")
        total_rows = len(df)
        missing_names = df[name_column].isna().sum()
        empty_names = (df[name_column].str.strip() == "").sum() if not df[name_column].isna().all() else 0

        print(f"   Total rows: {total_rows}")
        print(f"   Missing names: {missing_names}")
        print(f"   Empty names: {empty_names}")

        # Clean data
        if missing_names > 0 or empty_names > 0:
            print(f"   Removing {missing_names + empty_names} invalid rows...")
            df_clean = df.dropna(subset=[name_column])
            df_clean = df_clean[df_clean[name_column].str.strip() != ""]
        else:
            df_clean = df.copy()

        valid_rows = len(df_clean)
        print(f"   Valid rows for processing: {valid_rows}")

        if valid_rows == 0:
            print("โŒ No valid names to process!")
            return False

        # Get predictions
        print(f"\n๐Ÿ”ฎ Getting predictions for {valid_rows} names (language: {language})...")
        predictions = pranaam.pred_rel(df_clean[name_column], lang=language)

        # Rename prediction columns to avoid conflicts
        predictions = predictions.rename(columns={
            "name": name_column,
            "pred_label": f"{name_column}_religion",
            "pred_prob_muslim": f"{name_column}_confidence_muslim",
        })

        # Merge predictions back
        df_with_predictions = df_clean.merge(predictions, on=name_column, how="left")

        # Add confidence score
        conf_col = f"{name_column}_confidence_muslim"
        df_with_predictions[f"{name_column}_confidence"] = df_with_predictions[conf_col].apply(
            lambda x: max(x, 100 - x)
        )

        # Save results
        print(f"๐Ÿ’พ Saving results to {output_file}...")
        df_with_predictions.to_csv(output_file, index=False)

        # Generate summary
        print("\n๐Ÿ“Š Processing Summary:")
        print(f"   Input rows: {total_rows}")
        print(f"   Valid names processed: {valid_rows}")
        print(f"   Output rows: {len(df_with_predictions)}")

        # Religion distribution
        religion_counts = df_with_predictions[f"{name_column}_religion"].value_counts()
        print(f"   Religion predictions: {dict(religion_counts)}")

        # Confidence analysis
        high_conf_count = (df_with_predictions[f"{name_column}_confidence"] > 90).sum()
        medium_conf_count = (
            (df_with_predictions[f"{name_column}_confidence"] >= 70) &
            (df_with_predictions[f"{name_column}_confidence"] <= 90)
        ).sum()
        low_conf_count = (df_with_predictions[f"{name_column}_confidence"] < 70).sum()

        print("   Confidence distribution:")
        print(f"     High (>90%): {high_conf_count} predictions")
        print(f"     Medium (70-90%): {medium_conf_count} predictions")
        print(f"     Low (<70%): {low_conf_count} predictions")

        print(f"\nโœ… Successfully processed {input_file} โ†’ {output_file}")
        return True

    except Exception as e:
        print(f"โŒ Error processing file: {str(e)}")
        return False

๐Ÿ”„ Processing Our Sample Dataยถ

Now letโ€™s process our sample CSV file:

[4]:
# Process the sample CSV
input_file = "sample_names.csv"
output_file = "sample_names_with_predictions.csv"

success = process_csv_with_pranaam(
    input_file=input_file,
    output_file=output_file,
    name_column="full_name",
    language="eng"
)
๐Ÿ“– Reading sample_names.csv...
   Found 8 rows, 5 columns

๐Ÿ” Data Quality Analysis:
   Total rows: 8
   Missing names: 0
   Empty names: 0
   Valid rows for processing: 8

๐Ÿ”ฎ Getting predictions for 8 names (language: eng)...
[01/21/26 19:31:44] INFO     pranaam - Loading eng model from
                             /home/runner/work/pranaam/pranaam/pranaam/model/eng_and_hindi_models_v2/eng_model.kera
                             s
                    INFO     pranaam - Loading eng model with tf-keras compatibility layer
2026-01-21 19:31:44.845097: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
๐Ÿ’พ Saving results to sample_names_with_predictions.csv...

๐Ÿ“Š Processing Summary:
   Input rows: 8
   Valid names processed: 8
   Output rows: 8
   Religion predictions: {'muslim': 4, 'not-muslim': 4}
   Confidence distribution:
     High (>90%): 0 predictions
     Medium (70-90%): 5 predictions
     Low (<70%): 3 predictions

โœ… Successfully processed sample_names.csv โ†’ sample_names_with_predictions.csv

๐Ÿ“‹ Examining the Resultsยถ

Letโ€™s load and examine the processed results:

[5]:
if success:
    # Load the processed results
    results_df = pd.read_csv(output_file)

    print("๐Ÿ“‹ Processed Results:")
    print(results_df)

    print("\n๐Ÿ“ˆ New Columns Added:")
    new_columns = [col for col in results_df.columns if 'full_name' in col and col != 'full_name']
    for col in new_columns:
        print(f"   โ€ข {col}")
๐Ÿ“‹ Processed Results:
   id         full_name   department       city  salary full_name_religion  \
0   1    Shah Rukh Khan  Engineering     Mumbai   75000             muslim
1   2      Priya Sharma    Marketing      Delhi   65000         not-muslim
2   3      Mohammed Ali      Finance  Bangalore   70000             muslim
3   4         Raj Patel           HR    Chennai   60000         not-muslim
4   5       Fatima Khan        Sales       Pune   80000             muslim
5   6        John Smith           IT  Hyderabad   85000         not-muslim
6   7  Deepika Padukone       Design    Kolkata   72000         not-muslim
7   8      Abdul Rahman   Operations  Ahmedabad   68000             muslim

   full_name_confidence_muslim  full_name_confidence
0                         71.0                  71.0
1                         27.0                  73.0
2                         73.0                  73.0
3                         35.0                  65.0
4                         73.0                  73.0
5                         37.0                  63.0
6                         32.0                  68.0
7                         73.0                  73.0

๐Ÿ“ˆ New Columns Added:
   โ€ข full_name_religion
   โ€ข full_name_confidence_muslim
   โ€ข full_name_confidence
[6]:
# Detailed analysis of predictions
if success:
    print("๐Ÿ” Detailed Prediction Analysis:")
    print("=" * 70)
    print(f"{'Name':<20} | {'Religion':<10} | {'Muslim %':<8} | {'Confidence':<10}")
    print("-" * 70)

    for _, row in results_df.iterrows():
        name = row['full_name']
        religion = row['full_name_religion']
        muslim_prob = row['full_name_confidence_muslim']
        confidence = row['full_name_confidence']

        print(f"{name:<20} | {religion:<10} | {muslim_prob:>6.1f}% | {confidence:>8.1f}%")
๐Ÿ” Detailed Prediction Analysis:
======================================================================
Name                 | Religion   | Muslim % | Confidence
----------------------------------------------------------------------
Shah Rukh Khan       | muslim     |   71.0% |     71.0%
Priya Sharma         | not-muslim |   27.0% |     73.0%
Mohammed Ali         | muslim     |   73.0% |     73.0%
Raj Patel            | not-muslim |   35.0% |     65.0%
Fatima Khan          | muslim     |   73.0% |     73.0%
John Smith           | not-muslim |   37.0% |     63.0%
Deepika Padukone     | not-muslim |   32.0% |     68.0%
Abdul Rahman         | muslim     |   73.0% |     73.0%

๐Ÿš€ Large File Processing Strategyยถ

For large CSV files, we need to process data in chunks to avoid memory issues:

[7]:
def process_large_csv(input_file, output_file, name_column, language="eng", chunk_size=1000):
    """Process large CSV files in chunks to manage memory usage."""

    print(f"๐Ÿš€ Processing large CSV file: {input_file}")
    print(f"   Chunk size: {chunk_size} rows")

    # Get total row count first
    total_rows = sum(1 for line in open(input_file)) - 1  # -1 for header
    print(f"   Total rows: {total_rows:,}")

    processed_chunks = []
    chunk_num = 0

    try:
        # Process file in chunks
        for chunk_df in pd.read_csv(input_file, chunksize=chunk_size):
            chunk_num += 1
            print(f"\n๐Ÿ“ฆ Processing chunk {chunk_num} ({len(chunk_df)} rows)...")

            # Clean chunk
            clean_chunk = chunk_df.dropna(subset=[name_column])
            clean_chunk = clean_chunk[clean_chunk[name_column].str.strip() != ""]

            if len(clean_chunk) == 0:
                print("   โš ๏ธ No valid names in this chunk, skipping...")
                continue

            # Get predictions for chunk
            predictions = pranaam.pred_rel(clean_chunk[name_column], lang=language)

            # Rename columns
            predictions = predictions.rename(columns={
                "name": name_column,
                "pred_label": f"{name_column}_religion",
                "pred_prob_muslim": f"{name_column}_confidence_muslim",
            })

            # Merge predictions
            chunk_with_predictions = clean_chunk.merge(predictions, on=name_column, how="left")

            # Add confidence score
            conf_col = f"{name_column}_confidence_muslim"
            chunk_with_predictions[f"{name_column}_confidence"] = chunk_with_predictions[conf_col].apply(
                lambda x: max(x, 100 - x)
            )

            processed_chunks.append(chunk_with_predictions)
            print(f"   โœ… Processed {len(chunk_with_predictions)} names")

        # Combine all chunks
        print(f"\n๐Ÿ”— Combining {len(processed_chunks)} chunks...")
        final_df = pd.concat(processed_chunks, ignore_index=True)

        # Save results
        print(f"๐Ÿ’พ Saving {len(final_df)} rows to {output_file}...")
        final_df.to_csv(output_file, index=False)

        print("\nโœ… Large file processing completed!")
        return True

    except Exception as e:
        print(f"โŒ Error processing large file: {str(e)}")
        return False

# Demonstrate with our sample (simulating large file processing)
print("Demonstrating large file processing strategy:")
large_file_success = process_large_csv(
    input_file="sample_names.csv",
    output_file="sample_large_processed.csv",
    name_column="full_name",
    chunk_size=3  # Small chunk size for demo
)
Demonstrating large file processing strategy:
๐Ÿš€ Processing large CSV file: sample_names.csv
   Chunk size: 3 rows
   Total rows: 8

๐Ÿ“ฆ Processing chunk 1 (3 rows)...
   โœ… Processed 3 names

๐Ÿ“ฆ Processing chunk 2 (3 rows)...
   โœ… Processed 3 names

๐Ÿ“ฆ Processing chunk 3 (2 rows)...
   โœ… Processed 2 names

๐Ÿ”— Combining 3 chunks...
๐Ÿ’พ Saving 8 rows to sample_large_processed.csv...

โœ… Large file processing completed!

๐Ÿ“Š Validation and Quality Checksยถ

Letโ€™s create validation functions to ensure our processing worked correctly:

[8]:
def validate_processed_csv(original_file, processed_file, name_column):
    """Validate that the processed CSV is correct."""

    print("๐Ÿ” Validation Report:")
    print("=" * 40)

    # Load both files
    original_df = pd.read_csv(original_file)
    processed_df = pd.read_csv(processed_file)

    # Basic checks
    print(f"Original file rows: {len(original_df)}")
    print(f"Processed file rows: {len(processed_df)}")
    print(f"Rows preserved: {len(processed_df) / len(original_df) * 100:.1f}%")

    # Check for new columns
    original_cols = set(original_df.columns)
    processed_cols = set(processed_df.columns)
    new_cols = processed_cols - original_cols

    print(f"\nNew columns added: {len(new_cols)}")
    for col in sorted(new_cols):
        print(f"  โ€ข {col}")

    # Check predictions completeness
    religion_col = f"{name_column}_religion"
    if religion_col in processed_df.columns:
        null_predictions = processed_df[religion_col].isna().sum()
        print("\nPrediction completeness:")
        print(f"  Names with predictions: {len(processed_df) - null_predictions}")
        print(f"  Names without predictions: {null_predictions}")

        if null_predictions == 0:
            print("  โœ… All names have predictions")
        else:
            print(f"  โš ๏ธ {null_predictions} names missing predictions")

    # Confidence distribution
    conf_col = f"{name_column}_confidence"
    if conf_col in processed_df.columns:
        high_conf = (processed_df[conf_col] > 90).sum()
        medium_conf = ((processed_df[conf_col] >= 70) & (processed_df[conf_col] <= 90)).sum()
        low_conf = (processed_df[conf_col] < 70).sum()

        print("\nConfidence distribution:")
        print(f"  High confidence (>90%): {high_conf} ({high_conf/len(processed_df)*100:.1f}%)")
        print(f"  Medium confidence (70-90%): {medium_conf} ({medium_conf/len(processed_df)*100:.1f}%)")
        print(f"  Low confidence (<70%): {low_conf} ({low_conf/len(processed_df)*100:.1f}%)")

    print("\nโœ… Validation complete!")

# Validate our processed files
if success:
    validate_processed_csv("sample_names.csv", "sample_names_with_predictions.csv", "full_name")
๐Ÿ” Validation Report:
========================================
Original file rows: 8
Processed file rows: 8
Rows preserved: 100.0%

New columns added: 3
  โ€ข full_name_confidence
  โ€ข full_name_confidence_muslim
  โ€ข full_name_religion

Prediction completeness:
  Names with predictions: 8
  Names without predictions: 0
  โœ… All names have predictions

Confidence distribution:
  High confidence (>90%): 0 (0.0%)
  Medium confidence (70-90%): 5 (62.5%)
  Low confidence (<70%): 3 (37.5%)

โœ… Validation complete!

๐Ÿงน Cleanupยถ

Letโ€™s clean up the demo files:

[9]:
import os

# Clean up demo files
demo_files = [
    "sample_names.csv",
    "sample_names_with_predictions.csv",
    "sample_large_processed.csv"
]

print("๐Ÿงน Cleaning up demo files:")
for file in demo_files:
    if os.path.exists(file):
        os.remove(file)
        print(f"   โœ… Removed {file}")
    else:
        print(f"   โ„น๏ธ {file} not found")

print("\n๐ŸŽ‰ Demo cleanup complete!")
๐Ÿงน Cleaning up demo files:
   โœ… Removed sample_names.csv
   โœ… Removed sample_names_with_predictions.csv
   โœ… Removed sample_large_processed.csv

๐ŸŽ‰ Demo cleanup complete!

๐Ÿ“ Command-Line Equivalentยถ

If you were to create a command-line script, hereโ€™s what the usage would look like:

[10]:
# This shows how you might structure a command-line interface
def demonstrate_cli_usage():
    print("๐Ÿ’ป Command-Line Usage Examples:")
    print("=" * 50)

    examples = [
        {
            "description": "Basic CSV processing",
            "command": "python csv_processor.py data.csv results.csv --name-column 'full_name'"
        },
        {
            "description": "Process with Hindi names",
            "command": "python csv_processor.py data.csv results.csv --name-column 'employee_name' --language hin"
        },
        {
            "description": "Large file with custom chunk size",
            "command": "python csv_processor.py large_data.csv results.csv --name-column 'name' --chunk-size 5000"
        },
        {
            "description": "Create sample file for testing",
            "command": "python csv_processor.py --create-sample"
        }
    ]

    for i, example in enumerate(examples, 1):
        print(f"\n{i}. {example['description']}:")
        print(f"   {example['command']}")

    print("\n๐Ÿ“‹ Required Arguments:")
    print("   โ€ข input_file: Path to CSV file with names")
    print("   โ€ข output_file: Path for results CSV")
    print("   โ€ข --name-column: Column containing names")

    print("\nโš™๏ธ Optional Arguments:")
    print("   โ€ข --language: 'eng' or 'hin' (default: eng)")
    print("   โ€ข --chunk-size: Rows per chunk (default: 1000)")
    print("   โ€ข --create-sample: Generate test data")

demonstrate_cli_usage()
๐Ÿ’ป Command-Line Usage Examples:
==================================================

1. Basic CSV processing:
   python csv_processor.py data.csv results.csv --name-column 'full_name'

2. Process with Hindi names:
   python csv_processor.py data.csv results.csv --name-column 'employee_name' --language hin

3. Large file with custom chunk size:
   python csv_processor.py large_data.csv results.csv --name-column 'name' --chunk-size 5000

4. Create sample file for testing:
   python csv_processor.py --create-sample

๐Ÿ“‹ Required Arguments:
   โ€ข input_file: Path to CSV file with names
   โ€ข output_file: Path for results CSV
   โ€ข --name-column: Column containing names

โš™๏ธ Optional Arguments:
   โ€ข --language: 'eng' or 'hin' (default: eng)
   โ€ข --chunk-size: Rows per chunk (default: 1000)
   โ€ข --create-sample: Generate test data

Key Takeawaysยถ

๐Ÿ“ CSV Processing: Pranaam seamlessly integrates with CSV workflows
๐Ÿ” Data Validation: Always validate input data and check for missing values
๐Ÿš€ Chunk Processing: Handle large files by processing in chunks
๐Ÿ“Š Quality Metrics: Monitor confidence scores to assess prediction quality
๐Ÿ”— Column Naming: Use consistent naming conventions for prediction columns
โœ… Validation: Always validate results to ensure processing completed correctly

Best Practices for CSV Processingยถ

  1. Validate Input Data

    • Check file exists and is readable

    • Verify required columns are present

    • Handle missing or empty names gracefully

  2. Memory Management

    • Use chunk processing for files > 100MB

    • Choose appropriate chunk sizes (1000-5000 rows)

    • Monitor memory usage during processing

  3. Error Handling

    • Wrap processing in try-except blocks

    • Log errors with sufficient detail

    • Provide clear error messages to users

  4. Output Quality

    • Use descriptive column names

    • Include confidence scores

    • Validate output completeness

    • Save processing metadata

Next Stepsยถ