Batch Processing and Performance¶

This notebook demonstrates efficient batch processing techniques for large datasets and provides performance optimization tips.

Setup¶

Load libraries and create a larger sample dataset for demonstration.

[1]:

import pandas as pd
import ethnicolr
import time
from pathlib import Path

# Load sample data
data_path = Path('data/input-with-header.csv')

try:
    small_df = pd.read_csv(data_path)
    print(f"Loaded data from: {data_path}")
except FileNotFoundError:
    # Create sample data if file not found
    small_df = pd.DataFrame({
        'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],
        'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']
    })
    print("Using generated sample data")

print(f"Sample data shape: {small_df.shape}")
print("\nFirst few rows:")
small_df.head()

# Create a larger dataset for batch processing demonstration
# Replicate the small dataset multiple times
large_df = pd.concat([small_df] * 20, ignore_index=True)
print(f"\nLarge dataset shape: {large_df.shape}")
print("Ready for batch processing demonstrations.")

2025-12-27 22:21:42.384699: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:42.387822: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:42.396376: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1766874102.410910    2918 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766874102.415116    2918 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-12-27 22:21:42.431050: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Loaded data from: data/input-with-header.csv
Sample data shape: (62, 2)

First few rows:

Large dataset shape: (1240, 2)
Ready for batch processing demonstrations.

Performance Comparison¶

Let’s compare the performance of different models on our dataset.

[2]:

def time_prediction(func, df, *args, **kwargs):
    """Helper function to time predictions"""
    start_time = time.time()
    result = func(df, *args, **kwargs)
    end_time = time.time()
    return result, end_time - start_time

# Test different models
models = {
    'census_lookup': (ethnicolr.census_ln, ['last_name'], {'year': 2010}),
    'census_lstm': (ethnicolr.pred_census_ln, ['last_name'], {}),
    'wiki_lastname': (ethnicolr.pred_wiki_ln, ['last_name'], {}),
    'florida_lstm': (ethnicolr.pred_fl_reg_ln, ['last_name'], {})
}

performance_results = []

for model_name, (func, args, kwargs) in models.items():
    print(f"\nTesting {model_name}...")
    result, duration = time_prediction(func, large_df, *args, **kwargs)

    perf_data = {
        'model': model_name,
        'duration': round(duration, 2),
        'rows_per_second': round(len(large_df) / duration, 0),
        'result_rows': result.shape[0],
        'result_cols': result.shape[1]
    }
    performance_results.append(perf_data)

    print(f"Duration: {duration:.2f} seconds")
    print(f"Speed: {len(large_df) / duration:.0f} rows/second")

# Performance summary
perf_df = pd.DataFrame(performance_results).set_index('model')
print("\nPerformance Summary:")
perf_df[['duration', 'rows_per_second', 'result_rows', 'result_cols']]

2025-12-27 22:21:44,232 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:44,233 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
2025-12-27 22:21:44,235 - INFO - Loading Census 2010 data from /home/runner/work/ethnicolr/ethnicolr/ethnicolr/data/census/census_2010.csv...
2025-12-27 22:21:44,394 - INFO - Loaded 162253 last names from Census 2010
2025-12-27 22:21:44,395 - INFO - Merging demographic data for 1240 records...


Testing census_lookup...

2025-12-27 22:21:44,429 - INFO - Matched 1240 of 1240 rows (100.0%)
2025-12-27 22:21:44,430 - INFO - Added columns: pct2prace, pctaian, pctapi, pctblack, pcthispanic, pctwhite
2025-12-27 22:21:44,431 - INFO - Processing 1240 names using Census 2010 LSTM model
2025-12-27 22:21:44,432 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:44,432 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)

Duration: 0.20 seconds
Speed: 6244 rows/second

Testing census_lstm...

2025-12-27 22:21:44.441137: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2025-12-27 22:21:44,745 - INFO - Predicted 1240 of 1240 rows (100.0%)
2025-12-27 22:21:44,746 - INFO - Added columns: hispanic, race, api, black, white
2025-12-27 22:21:44,747 - INFO - Processing 1240 last names
2025-12-27 22:21:44,750 - INFO - Applying Wikipedia last name model to 1240 processable names (confidence interval: 1.0)
2025-12-27 22:21:44,751 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:44,751 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)

Duration: 0.32 seconds
Speed: 3926 rows/second

Testing wiki_lastname...

2025-12-27 22:21:44,989 - INFO - Successfully predicted 1240 of 1240 names (100.0%)
2025-12-27 22:21:44,990 - INFO - Added columns: Asian,IndianSubContinent, Asian,GreaterEastAsian,Japanese, processing_status, race, GreaterEuropean,EastEuropean, GreaterEuropean,WestEuropean,Germanic, Asian,GreaterEastAsian,EastAsian, name_normalized, GreaterAfrican,Africans, GreaterEuropean,WestEuropean,Nordic, GreaterEuropean,British, GreaterEuropean,WestEuropean,French, GreaterEuropean,WestEuropean,Italian, GreaterEuropean,WestEuropean,Hispanic, GreaterAfrican,Muslim, GreaterEuropean,Jewish
2025-12-27 22:21:44,990 - INFO - Predicting race/ethnicity for 1240 rows using Florida LSTM model
2025-12-27 22:21:44,992 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:44,992 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)

Duration: 0.24 seconds
Speed: 5083 rows/second

Testing florida_lstm...

2025-12-27 22:21:45,226 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white

Duration: 0.24 seconds
Speed: 5255 rows/second

Performance Summary:

[2]:

	duration	rows_per_second	result_rows	result_cols
model
census_lookup	0.20	6244.0	1240	8
census_lstm	0.32	3926.0	1240	7
wiki_lastname	0.24	5083.0	1240	18
florida_lstm	0.24	5255.0	1240	7

Chunked Processing¶

For very large datasets, processing in chunks can be more memory efficient.

First, let’s define our chunked processing function:

[3]:

def process_in_chunks(df, func, chunk_size=1000, *args, **kwargs):
    """Process dataframe in chunks to manage memory usage"""
    results = []
    total_chunks = (len(df) - 1) // chunk_size + 1

    for i in range(0, len(df), chunk_size):
        chunk = df.iloc[i:i + chunk_size]
        chunk_result = func(chunk, *args, **kwargs)
        results.append(chunk_result)

        if (i // chunk_size + 1) % 5 == 0:  # Progress every 5 chunks
            print(f"Processed {i // chunk_size + 1}/{total_chunks} chunks")

    return pd.concat(results, ignore_index=True)

[4]:

# Example: Process in chunks of 250 rows
print("Processing Florida model in chunks of 250...")
start_time = time.time()
chunked_result = process_in_chunks(
    large_df,
    ethnicolr.pred_fl_reg_ln,
    250,  # chunk_size as positional argument
    'last_name'  # positional argument for the prediction function
)
chunked_duration = time.time() - start_time

print(f"\nChunked processing completed in {chunked_duration:.2f} seconds")
print(f"Result shape: {chunked_result.shape}")
chunked_result[['last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head()

2025-12-27 22:21:45,246 - INFO - Predicting race/ethnicity for 250 rows using Florida LSTM model
2025-12-27 22:21:45,247 - INFO - Preserving 188 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,248 - INFO - Data filtering summary: 250 -> 250 rows (kept 100.0%)
2025-12-27 22:21:45,377 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
2025-12-27 22:21:45,378 - INFO - Predicting race/ethnicity for 250 rows using Florida LSTM model
2025-12-27 22:21:45,379 - INFO - Preserving 188 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,379 - INFO - Data filtering summary: 250 -> 250 rows (kept 100.0%)

Processing Florida model in chunks of 250...

2025-12-27 22:21:45,508 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
2025-12-27 22:21:45,508 - INFO - Predicting race/ethnicity for 250 rows using Florida LSTM model
2025-12-27 22:21:45,509 - INFO - Preserving 188 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,510 - INFO - Data filtering summary: 250 -> 250 rows (kept 100.0%)
2025-12-27 22:21:45,648 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
2025-12-27 22:21:45,649 - INFO - Predicting race/ethnicity for 250 rows using Florida LSTM model
2025-12-27 22:21:45,650 - INFO - Preserving 188 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,651 - INFO - Data filtering summary: 250 -> 250 rows (kept 100.0%)
2025-12-27 22:21:45,777 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
2025-12-27 22:21:45,778 - INFO - Predicting race/ethnicity for 240 rows using Florida LSTM model
2025-12-27 22:21:45,779 - INFO - Preserving 178 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,780 - INFO - Data filtering summary: 240 -> 240 rows (kept 100.0%)
2025-12-27 22:21:45,906 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white

Processed 5/5 chunks

Chunked processing completed in 0.66 seconds
Result shape: (1240, 7)

[4]:

	last_name	race	asian	hispanic	nh_black	nh_white
0	Smith	nh_white	0.004512	0.017937	0.251722	0.725829
1	Garcia	hispanic	0.006059	0.883960	0.010610	0.099372
2	Johnson	nh_white	0.003667	0.013745	0.424924	0.557664
3	Davis	nh_white	0.007555	0.011607	0.379582	0.601256
4	Brown	nh_white	0.003721	0.008477	0.474747	0.513055

Handling Missing or Problematic Names¶

Real-world datasets often have missing values, special characters, or other data quality issues.

[5]:

# Create a dataset with some problematic entries
problematic_df = large_df.copy().head(50)

# Add some missing values and problematic names
problematic_df.loc[5, 'last_name'] = None
problematic_df.loc[10, 'last_name'] = ''
problematic_df.loc[15, 'last_name'] = 'O\'Connor'  # Apostrophe
problematic_df.loc[20, 'last_name'] = 'García'     # Accented character
problematic_df.loc[25, 'last_name'] = '123'        # Numeric
problematic_df.loc[30, 'first_name'] = None

print("Sample problematic entries:")
print(problematic_df.iloc[[5, 10, 15, 20, 25, 30]][['first_name', 'last_name']])

# Process with Wikipedia model (handles problematic names better)
wiki_result = ethnicolr.pred_wiki_name(problematic_df, 'last_name', 'first_name')

print("\nProcessing results for problematic names:")
problem_indices = [5, 10, 15, 20, 25, 30]
display_cols = ['first_name', 'last_name', 'race', '__name', 'processing_status']
# Some columns might not exist, so filter to available ones
available_cols = [col for col in display_cols if col in wiki_result.columns]
print(wiki_result.iloc[problem_indices][available_cols])

2025-12-27 22:21:45,921 - INFO - Processing 50 names
2025-12-27 22:21:45,926 - INFO - Applying Wikipedia name model to 50 processable names (confidence interval: 1.0)
2025-12-27 22:21:45,927 - INFO - Data filtering summary: 50 -> 50 rows (kept 100.0%)
2025-12-27 22:21:46,084 - INFO - Successfully predicted 50 of 50 names (100.0%)
2025-12-27 22:21:46,085 - INFO - Added columns: Asian,IndianSubContinent, Asian,GreaterEastAsian,Japanese, processing_status, race, __name, GreaterEuropean,EastEuropean, GreaterEuropean,WestEuropean,Germanic, name_normalized_clean, Asian,GreaterEastAsian,EastAsian, name_normalized, GreaterEuropean,WestEuropean,Nordic, GreaterAfrican,Africans, GreaterEuropean,British, GreaterEuropean,WestEuropean,French, GreaterEuropean,WestEuropean,Italian, GreaterEuropean,WestEuropean,Hispanic, GreaterAfrican,Muslim, GreaterEuropean,Jewish

Sample problematic entries:
   first_name last_name
5    Jennifer      None
10     Robert
15    Jessica  O'Connor
20    Anthony    García
25  Elizabeth       123
30       None     Baker

Processing results for problematic names:
   first_name last_name                                   race  \
5    Jennifer      None                GreaterEuropean,British
10     Robert                          GreaterEuropean,British
15    Jessica  O'Connor  GreaterEuropean,WestEuropean,Hispanic
20    Anthony    García   GreaterEuropean,WestEuropean,Italian
25  Elizabeth       123                GreaterEuropean,British
30       None     Baker                GreaterEuropean,British

              __name processing_status
5           Jennifer         processed
10            Robert         processed
15  O'Connor Jessica         processed
20    García Anthony         processed
25     123 Elizabeth         processed
30             Baker         processed

Data Quality Analysis¶

Analyze the quality and coverage of predictions across your dataset.

[6]:

# Get predictions for quality analysis
census_pred = ethnicolr.pred_census_ln(large_df, 'last_name')
wiki_pred = ethnicolr.pred_wiki_ln(large_df, 'last_name')

# Calculate prediction confidence (use correct column names for each model)
# Census model columns: api, black, hispanic, white
census_pred['max_confidence'] = census_pred[['api', 'black', 'hispanic', 'white']].max(axis=1)

# Wikipedia model: find numeric probability columns only
numeric_cols = []
for col in wiki_pred.columns:
    if col not in ['race', '__name', 'last_name', 'processing_status']:
        try:
            # Check if column is numeric
            pd.to_numeric(wiki_pred[col], errors='raise')
            numeric_cols.append(col)
        except (ValueError, TypeError):
            continue

if len(numeric_cols) > 0:
    wiki_pred['max_confidence'] = wiki_pred[numeric_cols].max(axis=1)
else:
    wiki_pred['max_confidence'] = 0.5  # Default value if no numeric columns found

# Confidence distribution
print("Census Model Confidence Distribution:")
print(f"High confidence (>0.8): {(census_pred['max_confidence'] > 0.8).sum()} ({(census_pred['max_confidence'] > 0.8).mean()*100:.1f}%)")
print(f"Medium confidence (0.5-0.8): {((census_pred['max_confidence'] > 0.5) & (census_pred['max_confidence'] <= 0.8)).sum()} ({((census_pred['max_confidence'] > 0.5) & (census_pred['max_confidence'] <= 0.8)).mean()*100:.1f}%)")
print(f"Low confidence (<0.5): {(census_pred['max_confidence'] <= 0.5).sum()} ({(census_pred['max_confidence'] <= 0.5).mean()*100:.1f}%)")

print("\nWikipedia Model Confidence Distribution:")
print(f"Found {len(numeric_cols)} numeric probability columns")
print(f"High confidence (>0.8): {(wiki_pred['max_confidence'] > 0.8).sum()} ({(wiki_pred['max_confidence'] > 0.8).mean()*100:.1f}%)")
print(f"Medium confidence (0.5-0.8): {((wiki_pred['max_confidence'] > 0.5) & (wiki_pred['max_confidence'] <= 0.8)).sum()} ({((wiki_pred['max_confidence'] > 0.5) & (wiki_pred['max_confidence'] <= 0.8)).mean()*100:.1f}%)")
print(f"Low confidence (<0.5): {(wiki_pred['max_confidence'] <= 0.5).sum()} ({(wiki_pred['max_confidence'] <= 0.5).mean()*100:.1f}%)")

2025-12-27 22:21:46,095 - INFO - Processing 1240 names using Census 2010 LSTM model
2025-12-27 22:21:46,096 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:46,097 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
2025-12-27 22:21:46,302 - INFO - Predicted 1240 of 1240 rows (100.0%)
2025-12-27 22:21:46,302 - INFO - Added columns: hispanic, race, api, black, white
2025-12-27 22:21:46,303 - INFO - Processing 1240 last names
2025-12-27 22:21:46,306 - INFO - Applying Wikipedia last name model to 1240 processable names (confidence interval: 1.0)
2025-12-27 22:21:46,307 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:46,308 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
2025-12-27 22:21:46,512 - INFO - Successfully predicted 1240 of 1240 names (100.0%)
2025-12-27 22:21:46,513 - INFO - Added columns: Asian,IndianSubContinent, Asian,GreaterEastAsian,Japanese, processing_status, race, GreaterEuropean,EastEuropean, GreaterEuropean,WestEuropean,Germanic, Asian,GreaterEastAsian,EastAsian, name_normalized, GreaterAfrican,Africans, GreaterEuropean,WestEuropean,Nordic, GreaterEuropean,British, GreaterEuropean,WestEuropean,French, GreaterEuropean,WestEuropean,Italian, GreaterEuropean,WestEuropean,Hispanic, GreaterAfrican,Muslim, GreaterEuropean,Jewish

Census Model Confidence Distribution:
High confidence (>0.8): 280 (22.6%)
Medium confidence (0.5-0.8): 900 (72.6%)
Low confidence (<0.5): 60 (4.8%)

Wikipedia Model Confidence Distribution:
Found 13 numeric probability columns
High confidence (>0.8): 480 (38.7%)
Medium confidence (0.5-0.8): 640 (51.6%)
Low confidence (<0.5): 120 (9.7%)

Batch Processing Best Practices¶

Performance Tips:¶

Choose the right model: Census lookup is fastest, ML models are slower but more accurate
Use chunking: For datasets >10,000 rows, process in chunks to manage memory
Clean data first: Remove/handle missing values before processing
Monitor confidence: Low confidence predictions may need manual review

Memory Management:¶

Process in chunks of 500-2000 rows for large datasets
Use only the columns you need in your input DataFrame
Clear intermediate results when not needed

Error Handling:¶

Check for missing values in name columns
Handle special characters and accents
Validate results and flag low-confidence predictions

[7]:

# Example production-ready batch processing function
def robust_batch_predict(df, name_col, model='census', chunk_size=1000, min_confidence=0.5):
    """
    Robust batch prediction with error handling and quality filtering
    """
    # Data validation
    if name_col not in df.columns:
        raise ValueError(f"Column '{name_col}' not found in DataFrame")

    # Clean data
    clean_df = df.copy()
    clean_df[name_col] = clean_df[name_col].fillna('').astype(str)

    # Choose prediction function and define probability columns
    if model == 'census':
        pred_func = ethnicolr.pred_census_ln
        prob_cols = ['api', 'black', 'hispanic', 'white']
    elif model == 'wiki':
        pred_func = ethnicolr.pred_wiki_ln
        prob_cols = None  # Will be determined dynamically after prediction
    elif model == 'florida':
        pred_func = ethnicolr.pred_fl_reg_ln
        prob_cols = ['asian', 'hispanic', 'nh_black', 'nh_white']
    else:
        raise ValueError(f"Unknown model: {model}")

    # Process in chunks
    result = process_in_chunks(clean_df, pred_func, chunk_size, name_col)

    # Calculate confidence based on model type
    if model == 'wiki':
        # For Wikipedia model, find numeric probability columns dynamically
        prob_cols = []
        for col in result.columns:
            if col not in ['race', '__name', name_col, 'processing_status']:
                try:
                    pd.to_numeric(result[col], errors='raise')
                    prob_cols.append(col)
                except (ValueError, TypeError):
                    continue

    if prob_cols and len(prob_cols) > 0:
        result['max_confidence'] = result[prob_cols].max(axis=1)
        result['high_confidence'] = result['max_confidence'] >= min_confidence
    else:
        result['max_confidence'] = 0.5  # Default confidence
        result['high_confidence'] = False

    return result

# Example usage
print("Running robust batch prediction...")
robust_result = robust_batch_predict(
    large_df.head(100),
    'last_name',
    model='census',
    chunk_size=50,
    min_confidence=0.6
)

print(f"\nProcessed {len(robust_result)} names")
print(f"High confidence predictions: {robust_result['high_confidence'].sum()} ({robust_result['high_confidence'].mean()*100:.1f}%)")
print("\nSample results:")
robust_result[['last_name', 'race', 'max_confidence', 'high_confidence']].head(10)

2025-12-27 22:21:46,528 - INFO - Processing 50 names using Census 2010 LSTM model
2025-12-27 22:21:46,529 - INFO - Data filtering summary: 50 -> 50 rows (kept 100.0%)
2025-12-27 22:21:46,633 - INFO - Predicted 50 of 50 rows (100.0%)
2025-12-27 22:21:46,634 - INFO - Added columns: hispanic, race, api, black, white
2025-12-27 22:21:46,635 - INFO - Processing 50 names using Census 2010 LSTM model
2025-12-27 22:21:46,636 - INFO - Data filtering summary: 50 -> 50 rows (kept 100.0%)

Running robust batch prediction...

2025-12-27 22:21:46,743 - INFO - Predicted 50 of 50 rows (100.0%)
2025-12-27 22:21:46,744 - INFO - Added columns: hispanic, race, api, black, white


Processed 100 names
High confidence predictions: 81 (81.0%)

Sample results:

[7]:

	last_name	race	max_confidence	high_confidence
0	Smith	white	0.747435	True
1	Garcia	hispanic	0.942291	True
2	Johnson	white	0.608818	True
3	Davis	white	0.618956	True
4	Brown	white	0.571941	False
5	Wilson	white	0.700449	True
6	Martinez	hispanic	0.942765	True
7	Anderson	white	0.775244	True
8	Taylor	white	0.684406	True
9	Rodriguez	hispanic	0.942559	True