CSV Processing with Pranaamยถ
This notebook demonstrates how to process CSV files containing names and add religion predictions. This is useful for:
Processing employee databases
Analyzing customer lists
Research datasets
Survey responses
Weโll cover:
Creating sample CSV data
Reading and validating CSV files
Processing names with error handling
Saving enriched results
Batch processing strategies
[1]:
from pathlib import Path
import pandas as pd
import pranaam
print(f"Pandas version: {pd.__version__}")
print(f"Pranaam version: {pranaam.__version__ if hasattr(pranaam, '__version__') else 'latest'}")
2026-01-21 19:31:42.880388: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2026-01-21 19:31:42.924629: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-21 19:31:44.317969: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
Pandas version: 2.3.3
Pranaam version: 0.0.2
๐ Creating Sample CSV Dataยถ
Letโs start by creating a sample CSV file to work with:
[2]:
def create_sample_csv(filename="sample_names.csv"):
"""Create a sample CSV file for testing."""
sample_data = pd.DataFrame({
"id": [1, 2, 3, 4, 5, 6, 7, 8],
"full_name": [
"Shah Rukh Khan",
"Priya Sharma",
"Mohammed Ali",
"Raj Patel",
"Fatima Khan",
"John Smith",
"Deepika Padukone",
"Abdul Rahman"
],
"department": [
"Engineering", "Marketing", "Finance", "HR",
"Sales", "IT", "Design", "Operations"
],
"city": [
"Mumbai", "Delhi", "Bangalore", "Chennai",
"Pune", "Hyderabad", "Kolkata", "Ahmedabad"
],
"salary": [75000, 65000, 70000, 60000, 80000, 85000, 72000, 68000]
})
sample_data.to_csv(filename, index=False)
print(f"๐ Created sample file: {filename}")
return sample_data
# Create our sample data
sample_df = create_sample_csv()
print("\nSample data:")
print(sample_df)
๐ Created sample file: sample_names.csv
Sample data:
id full_name department city salary
0 1 Shah Rukh Khan Engineering Mumbai 75000
1 2 Priya Sharma Marketing Delhi 65000
2 3 Mohammed Ali Finance Bangalore 70000
3 4 Raj Patel HR Chennai 60000
4 5 Fatima Khan Sales Pune 80000
5 6 John Smith IT Hyderabad 85000
6 7 Deepika Padukone Design Kolkata 72000
7 8 Abdul Rahman Operations Ahmedabad 68000
๐ CSV Processing Functionยถ
Letโs create a comprehensive function to process CSV files with names:
[3]:
def process_csv_with_pranaam(input_file, output_file, name_column, language="eng", chunk_size=1000):
"""Process CSV file and add religion predictions.
Args:
input_file: Path to input CSV file
output_file: Path to output CSV file
name_column: Name of column containing names
language: Language code ('eng' or 'hin')
chunk_size: Process in chunks for large files
"""
# Validate input file
if not Path(input_file).exists():
print(f"โ Error: Input file '{input_file}' not found")
return False
try:
# Read CSV
print(f"๐ Reading {input_file}...")
df = pd.read_csv(input_file)
print(f" Found {len(df)} rows, {len(df.columns)} columns")
# Validate name column
if name_column not in df.columns:
print(f"โ Error: Column '{name_column}' not found in CSV")
print(f" Available columns: {list(df.columns)}")
return False
# Data quality checks
print("\n๐ Data Quality Analysis:")
total_rows = len(df)
missing_names = df[name_column].isna().sum()
empty_names = (df[name_column].str.strip() == "").sum() if not df[name_column].isna().all() else 0
print(f" Total rows: {total_rows}")
print(f" Missing names: {missing_names}")
print(f" Empty names: {empty_names}")
# Clean data
if missing_names > 0 or empty_names > 0:
print(f" Removing {missing_names + empty_names} invalid rows...")
df_clean = df.dropna(subset=[name_column])
df_clean = df_clean[df_clean[name_column].str.strip() != ""]
else:
df_clean = df.copy()
valid_rows = len(df_clean)
print(f" Valid rows for processing: {valid_rows}")
if valid_rows == 0:
print("โ No valid names to process!")
return False
# Get predictions
print(f"\n๐ฎ Getting predictions for {valid_rows} names (language: {language})...")
predictions = pranaam.pred_rel(df_clean[name_column], lang=language)
# Rename prediction columns to avoid conflicts
predictions = predictions.rename(columns={
"name": name_column,
"pred_label": f"{name_column}_religion",
"pred_prob_muslim": f"{name_column}_confidence_muslim",
})
# Merge predictions back
df_with_predictions = df_clean.merge(predictions, on=name_column, how="left")
# Add confidence score
conf_col = f"{name_column}_confidence_muslim"
df_with_predictions[f"{name_column}_confidence"] = df_with_predictions[conf_col].apply(
lambda x: max(x, 100 - x)
)
# Save results
print(f"๐พ Saving results to {output_file}...")
df_with_predictions.to_csv(output_file, index=False)
# Generate summary
print("\n๐ Processing Summary:")
print(f" Input rows: {total_rows}")
print(f" Valid names processed: {valid_rows}")
print(f" Output rows: {len(df_with_predictions)}")
# Religion distribution
religion_counts = df_with_predictions[f"{name_column}_religion"].value_counts()
print(f" Religion predictions: {dict(religion_counts)}")
# Confidence analysis
high_conf_count = (df_with_predictions[f"{name_column}_confidence"] > 90).sum()
medium_conf_count = (
(df_with_predictions[f"{name_column}_confidence"] >= 70) &
(df_with_predictions[f"{name_column}_confidence"] <= 90)
).sum()
low_conf_count = (df_with_predictions[f"{name_column}_confidence"] < 70).sum()
print(" Confidence distribution:")
print(f" High (>90%): {high_conf_count} predictions")
print(f" Medium (70-90%): {medium_conf_count} predictions")
print(f" Low (<70%): {low_conf_count} predictions")
print(f"\nโ
Successfully processed {input_file} โ {output_file}")
return True
except Exception as e:
print(f"โ Error processing file: {str(e)}")
return False
๐ Processing Our Sample Dataยถ
Now letโs process our sample CSV file:
[4]:
# Process the sample CSV
input_file = "sample_names.csv"
output_file = "sample_names_with_predictions.csv"
success = process_csv_with_pranaam(
input_file=input_file,
output_file=output_file,
name_column="full_name",
language="eng"
)
๐ Reading sample_names.csv...
Found 8 rows, 5 columns
๐ Data Quality Analysis:
Total rows: 8
Missing names: 0
Empty names: 0
Valid rows for processing: 8
๐ฎ Getting predictions for 8 names (language: eng)...
[01/21/26 19:31:44] INFO pranaam - Loading eng model from /home/runner/work/pranaam/pranaam/pranaam/model/eng_and_hindi_models_v2/eng_model.kera s
INFO pranaam - Loading eng model with tf-keras compatibility layer
2026-01-21 19:31:44.845097: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
๐พ Saving results to sample_names_with_predictions.csv...
๐ Processing Summary:
Input rows: 8
Valid names processed: 8
Output rows: 8
Religion predictions: {'muslim': 4, 'not-muslim': 4}
Confidence distribution:
High (>90%): 0 predictions
Medium (70-90%): 5 predictions
Low (<70%): 3 predictions
โ
Successfully processed sample_names.csv โ sample_names_with_predictions.csv
๐ Examining the Resultsยถ
Letโs load and examine the processed results:
[5]:
if success:
# Load the processed results
results_df = pd.read_csv(output_file)
print("๐ Processed Results:")
print(results_df)
print("\n๐ New Columns Added:")
new_columns = [col for col in results_df.columns if 'full_name' in col and col != 'full_name']
for col in new_columns:
print(f" โข {col}")
๐ Processed Results:
id full_name department city salary full_name_religion \
0 1 Shah Rukh Khan Engineering Mumbai 75000 muslim
1 2 Priya Sharma Marketing Delhi 65000 not-muslim
2 3 Mohammed Ali Finance Bangalore 70000 muslim
3 4 Raj Patel HR Chennai 60000 not-muslim
4 5 Fatima Khan Sales Pune 80000 muslim
5 6 John Smith IT Hyderabad 85000 not-muslim
6 7 Deepika Padukone Design Kolkata 72000 not-muslim
7 8 Abdul Rahman Operations Ahmedabad 68000 muslim
full_name_confidence_muslim full_name_confidence
0 71.0 71.0
1 27.0 73.0
2 73.0 73.0
3 35.0 65.0
4 73.0 73.0
5 37.0 63.0
6 32.0 68.0
7 73.0 73.0
๐ New Columns Added:
โข full_name_religion
โข full_name_confidence_muslim
โข full_name_confidence
[6]:
# Detailed analysis of predictions
if success:
print("๐ Detailed Prediction Analysis:")
print("=" * 70)
print(f"{'Name':<20} | {'Religion':<10} | {'Muslim %':<8} | {'Confidence':<10}")
print("-" * 70)
for _, row in results_df.iterrows():
name = row['full_name']
religion = row['full_name_religion']
muslim_prob = row['full_name_confidence_muslim']
confidence = row['full_name_confidence']
print(f"{name:<20} | {religion:<10} | {muslim_prob:>6.1f}% | {confidence:>8.1f}%")
๐ Detailed Prediction Analysis:
======================================================================
Name | Religion | Muslim % | Confidence
----------------------------------------------------------------------
Shah Rukh Khan | muslim | 71.0% | 71.0%
Priya Sharma | not-muslim | 27.0% | 73.0%
Mohammed Ali | muslim | 73.0% | 73.0%
Raj Patel | not-muslim | 35.0% | 65.0%
Fatima Khan | muslim | 73.0% | 73.0%
John Smith | not-muslim | 37.0% | 63.0%
Deepika Padukone | not-muslim | 32.0% | 68.0%
Abdul Rahman | muslim | 73.0% | 73.0%
๐ Large File Processing Strategyยถ
For large CSV files, we need to process data in chunks to avoid memory issues:
[7]:
def process_large_csv(input_file, output_file, name_column, language="eng", chunk_size=1000):
"""Process large CSV files in chunks to manage memory usage."""
print(f"๐ Processing large CSV file: {input_file}")
print(f" Chunk size: {chunk_size} rows")
# Get total row count first
total_rows = sum(1 for line in open(input_file)) - 1 # -1 for header
print(f" Total rows: {total_rows:,}")
processed_chunks = []
chunk_num = 0
try:
# Process file in chunks
for chunk_df in pd.read_csv(input_file, chunksize=chunk_size):
chunk_num += 1
print(f"\n๐ฆ Processing chunk {chunk_num} ({len(chunk_df)} rows)...")
# Clean chunk
clean_chunk = chunk_df.dropna(subset=[name_column])
clean_chunk = clean_chunk[clean_chunk[name_column].str.strip() != ""]
if len(clean_chunk) == 0:
print(" โ ๏ธ No valid names in this chunk, skipping...")
continue
# Get predictions for chunk
predictions = pranaam.pred_rel(clean_chunk[name_column], lang=language)
# Rename columns
predictions = predictions.rename(columns={
"name": name_column,
"pred_label": f"{name_column}_religion",
"pred_prob_muslim": f"{name_column}_confidence_muslim",
})
# Merge predictions
chunk_with_predictions = clean_chunk.merge(predictions, on=name_column, how="left")
# Add confidence score
conf_col = f"{name_column}_confidence_muslim"
chunk_with_predictions[f"{name_column}_confidence"] = chunk_with_predictions[conf_col].apply(
lambda x: max(x, 100 - x)
)
processed_chunks.append(chunk_with_predictions)
print(f" โ
Processed {len(chunk_with_predictions)} names")
# Combine all chunks
print(f"\n๐ Combining {len(processed_chunks)} chunks...")
final_df = pd.concat(processed_chunks, ignore_index=True)
# Save results
print(f"๐พ Saving {len(final_df)} rows to {output_file}...")
final_df.to_csv(output_file, index=False)
print("\nโ
Large file processing completed!")
return True
except Exception as e:
print(f"โ Error processing large file: {str(e)}")
return False
# Demonstrate with our sample (simulating large file processing)
print("Demonstrating large file processing strategy:")
large_file_success = process_large_csv(
input_file="sample_names.csv",
output_file="sample_large_processed.csv",
name_column="full_name",
chunk_size=3 # Small chunk size for demo
)
Demonstrating large file processing strategy:
๐ Processing large CSV file: sample_names.csv
Chunk size: 3 rows
Total rows: 8
๐ฆ Processing chunk 1 (3 rows)...
โ
Processed 3 names
๐ฆ Processing chunk 2 (3 rows)...
โ
Processed 3 names
๐ฆ Processing chunk 3 (2 rows)...
โ
Processed 2 names
๐ Combining 3 chunks...
๐พ Saving 8 rows to sample_large_processed.csv...
โ
Large file processing completed!
๐ Validation and Quality Checksยถ
Letโs create validation functions to ensure our processing worked correctly:
[8]:
def validate_processed_csv(original_file, processed_file, name_column):
"""Validate that the processed CSV is correct."""
print("๐ Validation Report:")
print("=" * 40)
# Load both files
original_df = pd.read_csv(original_file)
processed_df = pd.read_csv(processed_file)
# Basic checks
print(f"Original file rows: {len(original_df)}")
print(f"Processed file rows: {len(processed_df)}")
print(f"Rows preserved: {len(processed_df) / len(original_df) * 100:.1f}%")
# Check for new columns
original_cols = set(original_df.columns)
processed_cols = set(processed_df.columns)
new_cols = processed_cols - original_cols
print(f"\nNew columns added: {len(new_cols)}")
for col in sorted(new_cols):
print(f" โข {col}")
# Check predictions completeness
religion_col = f"{name_column}_religion"
if religion_col in processed_df.columns:
null_predictions = processed_df[religion_col].isna().sum()
print("\nPrediction completeness:")
print(f" Names with predictions: {len(processed_df) - null_predictions}")
print(f" Names without predictions: {null_predictions}")
if null_predictions == 0:
print(" โ
All names have predictions")
else:
print(f" โ ๏ธ {null_predictions} names missing predictions")
# Confidence distribution
conf_col = f"{name_column}_confidence"
if conf_col in processed_df.columns:
high_conf = (processed_df[conf_col] > 90).sum()
medium_conf = ((processed_df[conf_col] >= 70) & (processed_df[conf_col] <= 90)).sum()
low_conf = (processed_df[conf_col] < 70).sum()
print("\nConfidence distribution:")
print(f" High confidence (>90%): {high_conf} ({high_conf/len(processed_df)*100:.1f}%)")
print(f" Medium confidence (70-90%): {medium_conf} ({medium_conf/len(processed_df)*100:.1f}%)")
print(f" Low confidence (<70%): {low_conf} ({low_conf/len(processed_df)*100:.1f}%)")
print("\nโ
Validation complete!")
# Validate our processed files
if success:
validate_processed_csv("sample_names.csv", "sample_names_with_predictions.csv", "full_name")
๐ Validation Report:
========================================
Original file rows: 8
Processed file rows: 8
Rows preserved: 100.0%
New columns added: 3
โข full_name_confidence
โข full_name_confidence_muslim
โข full_name_religion
Prediction completeness:
Names with predictions: 8
Names without predictions: 0
โ
All names have predictions
Confidence distribution:
High confidence (>90%): 0 (0.0%)
Medium confidence (70-90%): 5 (62.5%)
Low confidence (<70%): 3 (37.5%)
โ
Validation complete!
๐งน Cleanupยถ
Letโs clean up the demo files:
[9]:
import os
# Clean up demo files
demo_files = [
"sample_names.csv",
"sample_names_with_predictions.csv",
"sample_large_processed.csv"
]
print("๐งน Cleaning up demo files:")
for file in demo_files:
if os.path.exists(file):
os.remove(file)
print(f" โ
Removed {file}")
else:
print(f" โน๏ธ {file} not found")
print("\n๐ Demo cleanup complete!")
๐งน Cleaning up demo files:
โ
Removed sample_names.csv
โ
Removed sample_names_with_predictions.csv
โ
Removed sample_large_processed.csv
๐ Demo cleanup complete!
๐ Command-Line Equivalentยถ
If you were to create a command-line script, hereโs what the usage would look like:
[10]:
# This shows how you might structure a command-line interface
def demonstrate_cli_usage():
print("๐ป Command-Line Usage Examples:")
print("=" * 50)
examples = [
{
"description": "Basic CSV processing",
"command": "python csv_processor.py data.csv results.csv --name-column 'full_name'"
},
{
"description": "Process with Hindi names",
"command": "python csv_processor.py data.csv results.csv --name-column 'employee_name' --language hin"
},
{
"description": "Large file with custom chunk size",
"command": "python csv_processor.py large_data.csv results.csv --name-column 'name' --chunk-size 5000"
},
{
"description": "Create sample file for testing",
"command": "python csv_processor.py --create-sample"
}
]
for i, example in enumerate(examples, 1):
print(f"\n{i}. {example['description']}:")
print(f" {example['command']}")
print("\n๐ Required Arguments:")
print(" โข input_file: Path to CSV file with names")
print(" โข output_file: Path for results CSV")
print(" โข --name-column: Column containing names")
print("\nโ๏ธ Optional Arguments:")
print(" โข --language: 'eng' or 'hin' (default: eng)")
print(" โข --chunk-size: Rows per chunk (default: 1000)")
print(" โข --create-sample: Generate test data")
demonstrate_cli_usage()
๐ป Command-Line Usage Examples:
==================================================
1. Basic CSV processing:
python csv_processor.py data.csv results.csv --name-column 'full_name'
2. Process with Hindi names:
python csv_processor.py data.csv results.csv --name-column 'employee_name' --language hin
3. Large file with custom chunk size:
python csv_processor.py large_data.csv results.csv --name-column 'name' --chunk-size 5000
4. Create sample file for testing:
python csv_processor.py --create-sample
๐ Required Arguments:
โข input_file: Path to CSV file with names
โข output_file: Path for results CSV
โข --name-column: Column containing names
โ๏ธ Optional Arguments:
โข --language: 'eng' or 'hin' (default: eng)
โข --chunk-size: Rows per chunk (default: 1000)
โข --create-sample: Generate test data
Key Takeawaysยถ
Best Practices for CSV Processingยถ
Validate Input Data
Check file exists and is readable
Verify required columns are present
Handle missing or empty names gracefully
Memory Management
Use chunk processing for files > 100MB
Choose appropriate chunk sizes (1000-5000 rows)
Monitor memory usage during processing
Error Handling
Wrap processing in try-except blocks
Log errors with sufficient detail
Provide clear error messages to users
Output Quality
Use descriptive column names
Include confidence scores
Validate output completeness
Save processing metadata
Next Stepsยถ
Performance Benchmarks: Optimize for large-scale processing
Pandas Integration: Advanced DataFrame operations
Basic Usage: Review fundamental concepts