Batch Processing and Performance¶
This notebook demonstrates efficient batch processing techniques for large datasets and provides performance optimization tips.
Setup¶
Load libraries and create a larger sample dataset for demonstration.
[1]:
import pandas as pd
import ethnicolr
import time
from pathlib import Path
# Load sample data
data_path = Path('data/input-with-header.csv')
try:
small_df = pd.read_csv(data_path)
print(f"Loaded data from: {data_path}")
except FileNotFoundError:
# Create sample data if file not found
small_df = pd.DataFrame({
'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],
'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']
})
print("Using generated sample data")
print(f"Sample data shape: {small_df.shape}")
print("\nFirst few rows:")
small_df.head()
# Create a larger dataset for batch processing demonstration
# Replicate the small dataset multiple times
large_df = pd.concat([small_df] * 20, ignore_index=True)
print(f"\nLarge dataset shape: {large_df.shape}")
print("Ready for batch processing demonstrations.")
2025-12-27 22:21:42.384699: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:42.387822: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:42.396376: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1766874102.410910 2918 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766874102.415116 2918 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-12-27 22:21:42.431050: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loaded data from: data/input-with-header.csv
Sample data shape: (62, 2)
First few rows:
Large dataset shape: (1240, 2)
Ready for batch processing demonstrations.
Performance Comparison¶
Let’s compare the performance of different models on our dataset.
[2]:
def time_prediction(func, df, *args, **kwargs):
"""Helper function to time predictions"""
start_time = time.time()
result = func(df, *args, **kwargs)
end_time = time.time()
return result, end_time - start_time
# Test different models
models = {
'census_lookup': (ethnicolr.census_ln, ['last_name'], {'year': 2010}),
'census_lstm': (ethnicolr.pred_census_ln, ['last_name'], {}),
'wiki_lastname': (ethnicolr.pred_wiki_ln, ['last_name'], {}),
'florida_lstm': (ethnicolr.pred_fl_reg_ln, ['last_name'], {})
}
performance_results = []
for model_name, (func, args, kwargs) in models.items():
print(f"\nTesting {model_name}...")
result, duration = time_prediction(func, large_df, *args, **kwargs)
perf_data = {
'model': model_name,
'duration': round(duration, 2),
'rows_per_second': round(len(large_df) / duration, 0),
'result_rows': result.shape[0],
'result_cols': result.shape[1]
}
performance_results.append(perf_data)
print(f"Duration: {duration:.2f} seconds")
print(f"Speed: {len(large_df) / duration:.0f} rows/second")
# Performance summary
perf_df = pd.DataFrame(performance_results).set_index('model')
print("\nPerformance Summary:")
perf_df[['duration', 'rows_per_second', 'result_rows', 'result_cols']]
2025-12-27 22:21:44,232 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:44,233 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
2025-12-27 22:21:44,235 - INFO - Loading Census 2010 data from /home/runner/work/ethnicolr/ethnicolr/ethnicolr/data/census/census_2010.csv...
2025-12-27 22:21:44,394 - INFO - Loaded 162253 last names from Census 2010
2025-12-27 22:21:44,395 - INFO - Merging demographic data for 1240 records...
Testing census_lookup...
2025-12-27 22:21:44,429 - INFO - Matched 1240 of 1240 rows (100.0%)
2025-12-27 22:21:44,430 - INFO - Added columns: pct2prace, pctaian, pctapi, pctblack, pcthispanic, pctwhite
2025-12-27 22:21:44,431 - INFO - Processing 1240 names using Census 2010 LSTM model
2025-12-27 22:21:44,432 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:44,432 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
Duration: 0.20 seconds
Speed: 6244 rows/second
Testing census_lstm...
2025-12-27 22:21:44.441137: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2025-12-27 22:21:44,745 - INFO - Predicted 1240 of 1240 rows (100.0%)
2025-12-27 22:21:44,746 - INFO - Added columns: hispanic, race, api, black, white
2025-12-27 22:21:44,747 - INFO - Processing 1240 last names
2025-12-27 22:21:44,750 - INFO - Applying Wikipedia last name model to 1240 processable names (confidence interval: 1.0)
2025-12-27 22:21:44,751 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:44,751 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
Duration: 0.32 seconds
Speed: 3926 rows/second
Testing wiki_lastname...
2025-12-27 22:21:44,989 - INFO - Successfully predicted 1240 of 1240 names (100.0%)
2025-12-27 22:21:44,990 - INFO - Added columns: Asian,IndianSubContinent, Asian,GreaterEastAsian,Japanese, processing_status, race, GreaterEuropean,EastEuropean, GreaterEuropean,WestEuropean,Germanic, Asian,GreaterEastAsian,EastAsian, name_normalized, GreaterAfrican,Africans, GreaterEuropean,WestEuropean,Nordic, GreaterEuropean,British, GreaterEuropean,WestEuropean,French, GreaterEuropean,WestEuropean,Italian, GreaterEuropean,WestEuropean,Hispanic, GreaterAfrican,Muslim, GreaterEuropean,Jewish
2025-12-27 22:21:44,990 - INFO - Predicting race/ethnicity for 1240 rows using Florida LSTM model
2025-12-27 22:21:44,992 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:44,992 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
Duration: 0.24 seconds
Speed: 5083 rows/second
Testing florida_lstm...
2025-12-27 22:21:45,226 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
Duration: 0.24 seconds
Speed: 5255 rows/second
Performance Summary:
[2]:
| duration | rows_per_second | result_rows | result_cols | |
|---|---|---|---|---|
| model | ||||
| census_lookup | 0.20 | 6244.0 | 1240 | 8 |
| census_lstm | 0.32 | 3926.0 | 1240 | 7 |
| wiki_lastname | 0.24 | 5083.0 | 1240 | 18 |
| florida_lstm | 0.24 | 5255.0 | 1240 | 7 |
Chunked Processing¶
For very large datasets, processing in chunks can be more memory efficient.
First, let’s define our chunked processing function:
[3]:
def process_in_chunks(df, func, chunk_size=1000, *args, **kwargs):
"""Process dataframe in chunks to manage memory usage"""
results = []
total_chunks = (len(df) - 1) // chunk_size + 1
for i in range(0, len(df), chunk_size):
chunk = df.iloc[i:i + chunk_size]
chunk_result = func(chunk, *args, **kwargs)
results.append(chunk_result)
if (i // chunk_size + 1) % 5 == 0: # Progress every 5 chunks
print(f"Processed {i // chunk_size + 1}/{total_chunks} chunks")
return pd.concat(results, ignore_index=True)
[4]:
# Example: Process in chunks of 250 rows
print("Processing Florida model in chunks of 250...")
start_time = time.time()
chunked_result = process_in_chunks(
large_df,
ethnicolr.pred_fl_reg_ln,
250, # chunk_size as positional argument
'last_name' # positional argument for the prediction function
)
chunked_duration = time.time() - start_time
print(f"\nChunked processing completed in {chunked_duration:.2f} seconds")
print(f"Result shape: {chunked_result.shape}")
chunked_result[['last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head()
2025-12-27 22:21:45,246 - INFO - Predicting race/ethnicity for 250 rows using Florida LSTM model
2025-12-27 22:21:45,247 - INFO - Preserving 188 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,248 - INFO - Data filtering summary: 250 -> 250 rows (kept 100.0%)
2025-12-27 22:21:45,377 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
2025-12-27 22:21:45,378 - INFO - Predicting race/ethnicity for 250 rows using Florida LSTM model
2025-12-27 22:21:45,379 - INFO - Preserving 188 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,379 - INFO - Data filtering summary: 250 -> 250 rows (kept 100.0%)
Processing Florida model in chunks of 250...
2025-12-27 22:21:45,508 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
2025-12-27 22:21:45,508 - INFO - Predicting race/ethnicity for 250 rows using Florida LSTM model
2025-12-27 22:21:45,509 - INFO - Preserving 188 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,510 - INFO - Data filtering summary: 250 -> 250 rows (kept 100.0%)
2025-12-27 22:21:45,648 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
2025-12-27 22:21:45,649 - INFO - Predicting race/ethnicity for 250 rows using Florida LSTM model
2025-12-27 22:21:45,650 - INFO - Preserving 188 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,651 - INFO - Data filtering summary: 250 -> 250 rows (kept 100.0%)
2025-12-27 22:21:45,777 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
2025-12-27 22:21:45,778 - INFO - Predicting race/ethnicity for 240 rows using Florida LSTM model
2025-12-27 22:21:45,779 - INFO - Preserving 178 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,780 - INFO - Data filtering summary: 240 -> 240 rows (kept 100.0%)
2025-12-27 22:21:45,906 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
Processed 5/5 chunks
Chunked processing completed in 0.66 seconds
Result shape: (1240, 7)
[4]:
| last_name | race | asian | hispanic | nh_black | nh_white | |
|---|---|---|---|---|---|---|
| 0 | Smith | nh_white | 0.004512 | 0.017937 | 0.251722 | 0.725829 |
| 1 | Garcia | hispanic | 0.006059 | 0.883960 | 0.010610 | 0.099372 |
| 2 | Johnson | nh_white | 0.003667 | 0.013745 | 0.424924 | 0.557664 |
| 3 | Davis | nh_white | 0.007555 | 0.011607 | 0.379582 | 0.601256 |
| 4 | Brown | nh_white | 0.003721 | 0.008477 | 0.474747 | 0.513055 |
Handling Missing or Problematic Names¶
Real-world datasets often have missing values, special characters, or other data quality issues.
[5]:
# Create a dataset with some problematic entries
problematic_df = large_df.copy().head(50)
# Add some missing values and problematic names
problematic_df.loc[5, 'last_name'] = None
problematic_df.loc[10, 'last_name'] = ''
problematic_df.loc[15, 'last_name'] = 'O\'Connor' # Apostrophe
problematic_df.loc[20, 'last_name'] = 'García' # Accented character
problematic_df.loc[25, 'last_name'] = '123' # Numeric
problematic_df.loc[30, 'first_name'] = None
print("Sample problematic entries:")
print(problematic_df.iloc[[5, 10, 15, 20, 25, 30]][['first_name', 'last_name']])
# Process with Wikipedia model (handles problematic names better)
wiki_result = ethnicolr.pred_wiki_name(problematic_df, 'last_name', 'first_name')
print("\nProcessing results for problematic names:")
problem_indices = [5, 10, 15, 20, 25, 30]
display_cols = ['first_name', 'last_name', 'race', '__name', 'processing_status']
# Some columns might not exist, so filter to available ones
available_cols = [col for col in display_cols if col in wiki_result.columns]
print(wiki_result.iloc[problem_indices][available_cols])
2025-12-27 22:21:45,921 - INFO - Processing 50 names
2025-12-27 22:21:45,926 - INFO - Applying Wikipedia name model to 50 processable names (confidence interval: 1.0)
2025-12-27 22:21:45,927 - INFO - Data filtering summary: 50 -> 50 rows (kept 100.0%)
2025-12-27 22:21:46,084 - INFO - Successfully predicted 50 of 50 names (100.0%)
2025-12-27 22:21:46,085 - INFO - Added columns: Asian,IndianSubContinent, Asian,GreaterEastAsian,Japanese, processing_status, race, __name, GreaterEuropean,EastEuropean, GreaterEuropean,WestEuropean,Germanic, name_normalized_clean, Asian,GreaterEastAsian,EastAsian, name_normalized, GreaterEuropean,WestEuropean,Nordic, GreaterAfrican,Africans, GreaterEuropean,British, GreaterEuropean,WestEuropean,French, GreaterEuropean,WestEuropean,Italian, GreaterEuropean,WestEuropean,Hispanic, GreaterAfrican,Muslim, GreaterEuropean,Jewish
Sample problematic entries:
first_name last_name
5 Jennifer None
10 Robert
15 Jessica O'Connor
20 Anthony García
25 Elizabeth 123
30 None Baker
Processing results for problematic names:
first_name last_name race \
5 Jennifer None GreaterEuropean,British
10 Robert GreaterEuropean,British
15 Jessica O'Connor GreaterEuropean,WestEuropean,Hispanic
20 Anthony García GreaterEuropean,WestEuropean,Italian
25 Elizabeth 123 GreaterEuropean,British
30 None Baker GreaterEuropean,British
__name processing_status
5 Jennifer processed
10 Robert processed
15 O'Connor Jessica processed
20 García Anthony processed
25 123 Elizabeth processed
30 Baker processed
Data Quality Analysis¶
Analyze the quality and coverage of predictions across your dataset.
[6]:
# Get predictions for quality analysis
census_pred = ethnicolr.pred_census_ln(large_df, 'last_name')
wiki_pred = ethnicolr.pred_wiki_ln(large_df, 'last_name')
# Calculate prediction confidence (use correct column names for each model)
# Census model columns: api, black, hispanic, white
census_pred['max_confidence'] = census_pred[['api', 'black', 'hispanic', 'white']].max(axis=1)
# Wikipedia model: find numeric probability columns only
numeric_cols = []
for col in wiki_pred.columns:
if col not in ['race', '__name', 'last_name', 'processing_status']:
try:
# Check if column is numeric
pd.to_numeric(wiki_pred[col], errors='raise')
numeric_cols.append(col)
except (ValueError, TypeError):
continue
if len(numeric_cols) > 0:
wiki_pred['max_confidence'] = wiki_pred[numeric_cols].max(axis=1)
else:
wiki_pred['max_confidence'] = 0.5 # Default value if no numeric columns found
# Confidence distribution
print("Census Model Confidence Distribution:")
print(f"High confidence (>0.8): {(census_pred['max_confidence'] > 0.8).sum()} ({(census_pred['max_confidence'] > 0.8).mean()*100:.1f}%)")
print(f"Medium confidence (0.5-0.8): {((census_pred['max_confidence'] > 0.5) & (census_pred['max_confidence'] <= 0.8)).sum()} ({((census_pred['max_confidence'] > 0.5) & (census_pred['max_confidence'] <= 0.8)).mean()*100:.1f}%)")
print(f"Low confidence (<0.5): {(census_pred['max_confidence'] <= 0.5).sum()} ({(census_pred['max_confidence'] <= 0.5).mean()*100:.1f}%)")
print("\nWikipedia Model Confidence Distribution:")
print(f"Found {len(numeric_cols)} numeric probability columns")
print(f"High confidence (>0.8): {(wiki_pred['max_confidence'] > 0.8).sum()} ({(wiki_pred['max_confidence'] > 0.8).mean()*100:.1f}%)")
print(f"Medium confidence (0.5-0.8): {((wiki_pred['max_confidence'] > 0.5) & (wiki_pred['max_confidence'] <= 0.8)).sum()} ({((wiki_pred['max_confidence'] > 0.5) & (wiki_pred['max_confidence'] <= 0.8)).mean()*100:.1f}%)")
print(f"Low confidence (<0.5): {(wiki_pred['max_confidence'] <= 0.5).sum()} ({(wiki_pred['max_confidence'] <= 0.5).mean()*100:.1f}%)")
2025-12-27 22:21:46,095 - INFO - Processing 1240 names using Census 2010 LSTM model
2025-12-27 22:21:46,096 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:46,097 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
2025-12-27 22:21:46,302 - INFO - Predicted 1240 of 1240 rows (100.0%)
2025-12-27 22:21:46,302 - INFO - Added columns: hispanic, race, api, black, white
2025-12-27 22:21:46,303 - INFO - Processing 1240 last names
2025-12-27 22:21:46,306 - INFO - Applying Wikipedia last name model to 1240 processable names (confidence interval: 1.0)
2025-12-27 22:21:46,307 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:46,308 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
2025-12-27 22:21:46,512 - INFO - Successfully predicted 1240 of 1240 names (100.0%)
2025-12-27 22:21:46,513 - INFO - Added columns: Asian,IndianSubContinent, Asian,GreaterEastAsian,Japanese, processing_status, race, GreaterEuropean,EastEuropean, GreaterEuropean,WestEuropean,Germanic, Asian,GreaterEastAsian,EastAsian, name_normalized, GreaterAfrican,Africans, GreaterEuropean,WestEuropean,Nordic, GreaterEuropean,British, GreaterEuropean,WestEuropean,French, GreaterEuropean,WestEuropean,Italian, GreaterEuropean,WestEuropean,Hispanic, GreaterAfrican,Muslim, GreaterEuropean,Jewish
Census Model Confidence Distribution:
High confidence (>0.8): 280 (22.6%)
Medium confidence (0.5-0.8): 900 (72.6%)
Low confidence (<0.5): 60 (4.8%)
Wikipedia Model Confidence Distribution:
Found 13 numeric probability columns
High confidence (>0.8): 480 (38.7%)
Medium confidence (0.5-0.8): 640 (51.6%)
Low confidence (<0.5): 120 (9.7%)
Batch Processing Best Practices¶
Performance Tips:¶
Choose the right model: Census lookup is fastest, ML models are slower but more accurate
Use chunking: For datasets >10,000 rows, process in chunks to manage memory
Clean data first: Remove/handle missing values before processing
Monitor confidence: Low confidence predictions may need manual review
Memory Management:¶
Process in chunks of 500-2000 rows for large datasets
Use only the columns you need in your input DataFrame
Clear intermediate results when not needed
Error Handling:¶
Check for missing values in name columns
Handle special characters and accents
Validate results and flag low-confidence predictions
[7]:
# Example production-ready batch processing function
def robust_batch_predict(df, name_col, model='census', chunk_size=1000, min_confidence=0.5):
"""
Robust batch prediction with error handling and quality filtering
"""
# Data validation
if name_col not in df.columns:
raise ValueError(f"Column '{name_col}' not found in DataFrame")
# Clean data
clean_df = df.copy()
clean_df[name_col] = clean_df[name_col].fillna('').astype(str)
# Choose prediction function and define probability columns
if model == 'census':
pred_func = ethnicolr.pred_census_ln
prob_cols = ['api', 'black', 'hispanic', 'white']
elif model == 'wiki':
pred_func = ethnicolr.pred_wiki_ln
prob_cols = None # Will be determined dynamically after prediction
elif model == 'florida':
pred_func = ethnicolr.pred_fl_reg_ln
prob_cols = ['asian', 'hispanic', 'nh_black', 'nh_white']
else:
raise ValueError(f"Unknown model: {model}")
# Process in chunks
result = process_in_chunks(clean_df, pred_func, chunk_size, name_col)
# Calculate confidence based on model type
if model == 'wiki':
# For Wikipedia model, find numeric probability columns dynamically
prob_cols = []
for col in result.columns:
if col not in ['race', '__name', name_col, 'processing_status']:
try:
pd.to_numeric(result[col], errors='raise')
prob_cols.append(col)
except (ValueError, TypeError):
continue
if prob_cols and len(prob_cols) > 0:
result['max_confidence'] = result[prob_cols].max(axis=1)
result['high_confidence'] = result['max_confidence'] >= min_confidence
else:
result['max_confidence'] = 0.5 # Default confidence
result['high_confidence'] = False
return result
# Example usage
print("Running robust batch prediction...")
robust_result = robust_batch_predict(
large_df.head(100),
'last_name',
model='census',
chunk_size=50,
min_confidence=0.6
)
print(f"\nProcessed {len(robust_result)} names")
print(f"High confidence predictions: {robust_result['high_confidence'].sum()} ({robust_result['high_confidence'].mean()*100:.1f}%)")
print("\nSample results:")
robust_result[['last_name', 'race', 'max_confidence', 'high_confidence']].head(10)
2025-12-27 22:21:46,528 - INFO - Processing 50 names using Census 2010 LSTM model
2025-12-27 22:21:46,529 - INFO - Data filtering summary: 50 -> 50 rows (kept 100.0%)
2025-12-27 22:21:46,633 - INFO - Predicted 50 of 50 rows (100.0%)
2025-12-27 22:21:46,634 - INFO - Added columns: hispanic, race, api, black, white
2025-12-27 22:21:46,635 - INFO - Processing 50 names using Census 2010 LSTM model
2025-12-27 22:21:46,636 - INFO - Data filtering summary: 50 -> 50 rows (kept 100.0%)
Running robust batch prediction...
2025-12-27 22:21:46,743 - INFO - Predicted 50 of 50 rows (100.0%)
2025-12-27 22:21:46,744 - INFO - Added columns: hispanic, race, api, black, white
Processed 100 names
High confidence predictions: 81 (81.0%)
Sample results:
[7]:
| last_name | race | max_confidence | high_confidence | |
|---|---|---|---|---|
| 0 | Smith | white | 0.747435 | True |
| 1 | Garcia | hispanic | 0.942291 | True |
| 2 | Johnson | white | 0.608818 | True |
| 3 | Davis | white | 0.618956 | True |
| 4 | Brown | white | 0.571941 | False |
| 5 | Wilson | white | 0.700449 | True |
| 6 | Martinez | hispanic | 0.942765 | True |
| 7 | Anderson | white | 0.775244 | True |
| 8 | Taylor | white | 0.684406 | True |
| 9 | Rodriguez | hispanic | 0.942559 | True |