Batch Processing and Performance

This notebook demonstrates efficient batch processing techniques for large datasets and provides performance optimization tips.

Setup

Load libraries and create a larger sample dataset for demonstration.

[1]:
import pandas as pd
import ethnicolr
import time
from pathlib import Path

# Load sample data
data_path = Path('data/input-with-header.csv')

try:
    small_df = pd.read_csv(data_path)
    print(f"Loaded data from: {data_path}")
except FileNotFoundError:
    # Create sample data if file not found
    small_df = pd.DataFrame({
        'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],
        'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']
    })
    print("Using generated sample data")

print(f"Sample data shape: {small_df.shape}")
print("\nFirst few rows:")
small_df.head()

# Create a larger dataset for batch processing demonstration
# Replicate the small dataset multiple times
large_df = pd.concat([small_df] * 20, ignore_index=True)
print(f"\nLarge dataset shape: {large_df.shape}")
print("Ready for batch processing demonstrations.")
2025-12-27 22:21:42.384699: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:42.387822: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:42.396376: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1766874102.410910    2918 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766874102.415116    2918 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-12-27 22:21:42.431050: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loaded data from: data/input-with-header.csv
Sample data shape: (62, 2)

First few rows:

Large dataset shape: (1240, 2)
Ready for batch processing demonstrations.

Performance Comparison

Let’s compare the performance of different models on our dataset.

[2]:
def time_prediction(func, df, *args, **kwargs):
    """Helper function to time predictions"""
    start_time = time.time()
    result = func(df, *args, **kwargs)
    end_time = time.time()
    return result, end_time - start_time

# Test different models
models = {
    'census_lookup': (ethnicolr.census_ln, ['last_name'], {'year': 2010}),
    'census_lstm': (ethnicolr.pred_census_ln, ['last_name'], {}),
    'wiki_lastname': (ethnicolr.pred_wiki_ln, ['last_name'], {}),
    'florida_lstm': (ethnicolr.pred_fl_reg_ln, ['last_name'], {})
}

performance_results = []

for model_name, (func, args, kwargs) in models.items():
    print(f"\nTesting {model_name}...")
    result, duration = time_prediction(func, large_df, *args, **kwargs)

    perf_data = {
        'model': model_name,
        'duration': round(duration, 2),
        'rows_per_second': round(len(large_df) / duration, 0),
        'result_rows': result.shape[0],
        'result_cols': result.shape[1]
    }
    performance_results.append(perf_data)

    print(f"Duration: {duration:.2f} seconds")
    print(f"Speed: {len(large_df) / duration:.0f} rows/second")

# Performance summary
perf_df = pd.DataFrame(performance_results).set_index('model')
print("\nPerformance Summary:")
perf_df[['duration', 'rows_per_second', 'result_rows', 'result_cols']]
2025-12-27 22:21:44,232 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:44,233 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
2025-12-27 22:21:44,235 - INFO - Loading Census 2010 data from /home/runner/work/ethnicolr/ethnicolr/ethnicolr/data/census/census_2010.csv...
2025-12-27 22:21:44,394 - INFO - Loaded 162253 last names from Census 2010
2025-12-27 22:21:44,395 - INFO - Merging demographic data for 1240 records...

Testing census_lookup...
2025-12-27 22:21:44,429 - INFO - Matched 1240 of 1240 rows (100.0%)
2025-12-27 22:21:44,430 - INFO - Added columns: pct2prace, pctaian, pctapi, pctblack, pcthispanic, pctwhite
2025-12-27 22:21:44,431 - INFO - Processing 1240 names using Census 2010 LSTM model
2025-12-27 22:21:44,432 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:44,432 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
Duration: 0.20 seconds
Speed: 6244 rows/second

Testing census_lstm...
2025-12-27 22:21:44.441137: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2025-12-27 22:21:44,745 - INFO - Predicted 1240 of 1240 rows (100.0%)
2025-12-27 22:21:44,746 - INFO - Added columns: hispanic, race, api, black, white
2025-12-27 22:21:44,747 - INFO - Processing 1240 last names
2025-12-27 22:21:44,750 - INFO - Applying Wikipedia last name model to 1240 processable names (confidence interval: 1.0)
2025-12-27 22:21:44,751 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:44,751 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
Duration: 0.32 seconds
Speed: 3926 rows/second

Testing wiki_lastname...
2025-12-27 22:21:44,989 - INFO - Successfully predicted 1240 of 1240 names (100.0%)
2025-12-27 22:21:44,990 - INFO - Added columns: Asian,IndianSubContinent, Asian,GreaterEastAsian,Japanese, processing_status, race, GreaterEuropean,EastEuropean, GreaterEuropean,WestEuropean,Germanic, Asian,GreaterEastAsian,EastAsian, name_normalized, GreaterAfrican,Africans, GreaterEuropean,WestEuropean,Nordic, GreaterEuropean,British, GreaterEuropean,WestEuropean,French, GreaterEuropean,WestEuropean,Italian, GreaterEuropean,WestEuropean,Hispanic, GreaterAfrican,Muslim, GreaterEuropean,Jewish
2025-12-27 22:21:44,990 - INFO - Predicting race/ethnicity for 1240 rows using Florida LSTM model
2025-12-27 22:21:44,992 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:44,992 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
Duration: 0.24 seconds
Speed: 5083 rows/second

Testing florida_lstm...
2025-12-27 22:21:45,226 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
Duration: 0.24 seconds
Speed: 5255 rows/second

Performance Summary:
[2]:
duration rows_per_second result_rows result_cols
model
census_lookup 0.20 6244.0 1240 8
census_lstm 0.32 3926.0 1240 7
wiki_lastname 0.24 5083.0 1240 18
florida_lstm 0.24 5255.0 1240 7

Chunked Processing

For very large datasets, processing in chunks can be more memory efficient.

First, let’s define our chunked processing function:

[3]:
def process_in_chunks(df, func, chunk_size=1000, *args, **kwargs):
    """Process dataframe in chunks to manage memory usage"""
    results = []
    total_chunks = (len(df) - 1) // chunk_size + 1

    for i in range(0, len(df), chunk_size):
        chunk = df.iloc[i:i + chunk_size]
        chunk_result = func(chunk, *args, **kwargs)
        results.append(chunk_result)

        if (i // chunk_size + 1) % 5 == 0:  # Progress every 5 chunks
            print(f"Processed {i // chunk_size + 1}/{total_chunks} chunks")

    return pd.concat(results, ignore_index=True)
[4]:
# Example: Process in chunks of 250 rows
print("Processing Florida model in chunks of 250...")
start_time = time.time()
chunked_result = process_in_chunks(
    large_df,
    ethnicolr.pred_fl_reg_ln,
    250,  # chunk_size as positional argument
    'last_name'  # positional argument for the prediction function
)
chunked_duration = time.time() - start_time

print(f"\nChunked processing completed in {chunked_duration:.2f} seconds")
print(f"Result shape: {chunked_result.shape}")
chunked_result[['last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head()
2025-12-27 22:21:45,246 - INFO - Predicting race/ethnicity for 250 rows using Florida LSTM model
2025-12-27 22:21:45,247 - INFO - Preserving 188 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,248 - INFO - Data filtering summary: 250 -> 250 rows (kept 100.0%)
2025-12-27 22:21:45,377 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
2025-12-27 22:21:45,378 - INFO - Predicting race/ethnicity for 250 rows using Florida LSTM model
2025-12-27 22:21:45,379 - INFO - Preserving 188 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,379 - INFO - Data filtering summary: 250 -> 250 rows (kept 100.0%)
Processing Florida model in chunks of 250...
2025-12-27 22:21:45,508 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
2025-12-27 22:21:45,508 - INFO - Predicting race/ethnicity for 250 rows using Florida LSTM model
2025-12-27 22:21:45,509 - INFO - Preserving 188 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,510 - INFO - Data filtering summary: 250 -> 250 rows (kept 100.0%)
2025-12-27 22:21:45,648 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
2025-12-27 22:21:45,649 - INFO - Predicting race/ethnicity for 250 rows using Florida LSTM model
2025-12-27 22:21:45,650 - INFO - Preserving 188 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,651 - INFO - Data filtering summary: 250 -> 250 rows (kept 100.0%)
2025-12-27 22:21:45,777 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
2025-12-27 22:21:45,778 - INFO - Predicting race/ethnicity for 240 rows using Florida LSTM model
2025-12-27 22:21:45,779 - INFO - Preserving 178 duplicate rows based on column 'last_name'
2025-12-27 22:21:45,780 - INFO - Data filtering summary: 240 -> 240 rows (kept 100.0%)
2025-12-27 22:21:45,906 - INFO - Prediction complete. Added columns: hispanic, race, asian, nh_black, nh_white
Processed 5/5 chunks

Chunked processing completed in 0.66 seconds
Result shape: (1240, 7)
[4]:
last_name race asian hispanic nh_black nh_white
0 Smith nh_white 0.004512 0.017937 0.251722 0.725829
1 Garcia hispanic 0.006059 0.883960 0.010610 0.099372
2 Johnson nh_white 0.003667 0.013745 0.424924 0.557664
3 Davis nh_white 0.007555 0.011607 0.379582 0.601256
4 Brown nh_white 0.003721 0.008477 0.474747 0.513055

Handling Missing or Problematic Names

Real-world datasets often have missing values, special characters, or other data quality issues.

[5]:
# Create a dataset with some problematic entries
problematic_df = large_df.copy().head(50)

# Add some missing values and problematic names
problematic_df.loc[5, 'last_name'] = None
problematic_df.loc[10, 'last_name'] = ''
problematic_df.loc[15, 'last_name'] = 'O\'Connor'  # Apostrophe
problematic_df.loc[20, 'last_name'] = 'García'     # Accented character
problematic_df.loc[25, 'last_name'] = '123'        # Numeric
problematic_df.loc[30, 'first_name'] = None

print("Sample problematic entries:")
print(problematic_df.iloc[[5, 10, 15, 20, 25, 30]][['first_name', 'last_name']])

# Process with Wikipedia model (handles problematic names better)
wiki_result = ethnicolr.pred_wiki_name(problematic_df, 'last_name', 'first_name')

print("\nProcessing results for problematic names:")
problem_indices = [5, 10, 15, 20, 25, 30]
display_cols = ['first_name', 'last_name', 'race', '__name', 'processing_status']
# Some columns might not exist, so filter to available ones
available_cols = [col for col in display_cols if col in wiki_result.columns]
print(wiki_result.iloc[problem_indices][available_cols])
2025-12-27 22:21:45,921 - INFO - Processing 50 names
2025-12-27 22:21:45,926 - INFO - Applying Wikipedia name model to 50 processable names (confidence interval: 1.0)
2025-12-27 22:21:45,927 - INFO - Data filtering summary: 50 -> 50 rows (kept 100.0%)
2025-12-27 22:21:46,084 - INFO - Successfully predicted 50 of 50 names (100.0%)
2025-12-27 22:21:46,085 - INFO - Added columns: Asian,IndianSubContinent, Asian,GreaterEastAsian,Japanese, processing_status, race, __name, GreaterEuropean,EastEuropean, GreaterEuropean,WestEuropean,Germanic, name_normalized_clean, Asian,GreaterEastAsian,EastAsian, name_normalized, GreaterEuropean,WestEuropean,Nordic, GreaterAfrican,Africans, GreaterEuropean,British, GreaterEuropean,WestEuropean,French, GreaterEuropean,WestEuropean,Italian, GreaterEuropean,WestEuropean,Hispanic, GreaterAfrican,Muslim, GreaterEuropean,Jewish
Sample problematic entries:
   first_name last_name
5    Jennifer      None
10     Robert
15    Jessica  O'Connor
20    Anthony    García
25  Elizabeth       123
30       None     Baker

Processing results for problematic names:
   first_name last_name                                   race  \
5    Jennifer      None                GreaterEuropean,British
10     Robert                          GreaterEuropean,British
15    Jessica  O'Connor  GreaterEuropean,WestEuropean,Hispanic
20    Anthony    García   GreaterEuropean,WestEuropean,Italian
25  Elizabeth       123                GreaterEuropean,British
30       None     Baker                GreaterEuropean,British

              __name processing_status
5           Jennifer         processed
10            Robert         processed
15  O'Connor Jessica         processed
20    García Anthony         processed
25     123 Elizabeth         processed
30             Baker         processed

Data Quality Analysis

Analyze the quality and coverage of predictions across your dataset.

[6]:
# Get predictions for quality analysis
census_pred = ethnicolr.pred_census_ln(large_df, 'last_name')
wiki_pred = ethnicolr.pred_wiki_ln(large_df, 'last_name')

# Calculate prediction confidence (use correct column names for each model)
# Census model columns: api, black, hispanic, white
census_pred['max_confidence'] = census_pred[['api', 'black', 'hispanic', 'white']].max(axis=1)

# Wikipedia model: find numeric probability columns only
numeric_cols = []
for col in wiki_pred.columns:
    if col not in ['race', '__name', 'last_name', 'processing_status']:
        try:
            # Check if column is numeric
            pd.to_numeric(wiki_pred[col], errors='raise')
            numeric_cols.append(col)
        except (ValueError, TypeError):
            continue

if len(numeric_cols) > 0:
    wiki_pred['max_confidence'] = wiki_pred[numeric_cols].max(axis=1)
else:
    wiki_pred['max_confidence'] = 0.5  # Default value if no numeric columns found

# Confidence distribution
print("Census Model Confidence Distribution:")
print(f"High confidence (>0.8): {(census_pred['max_confidence'] > 0.8).sum()} ({(census_pred['max_confidence'] > 0.8).mean()*100:.1f}%)")
print(f"Medium confidence (0.5-0.8): {((census_pred['max_confidence'] > 0.5) & (census_pred['max_confidence'] <= 0.8)).sum()} ({((census_pred['max_confidence'] > 0.5) & (census_pred['max_confidence'] <= 0.8)).mean()*100:.1f}%)")
print(f"Low confidence (<0.5): {(census_pred['max_confidence'] <= 0.5).sum()} ({(census_pred['max_confidence'] <= 0.5).mean()*100:.1f}%)")

print("\nWikipedia Model Confidence Distribution:")
print(f"Found {len(numeric_cols)} numeric probability columns")
print(f"High confidence (>0.8): {(wiki_pred['max_confidence'] > 0.8).sum()} ({(wiki_pred['max_confidence'] > 0.8).mean()*100:.1f}%)")
print(f"Medium confidence (0.5-0.8): {((wiki_pred['max_confidence'] > 0.5) & (wiki_pred['max_confidence'] <= 0.8)).sum()} ({((wiki_pred['max_confidence'] > 0.5) & (wiki_pred['max_confidence'] <= 0.8)).mean()*100:.1f}%)")
print(f"Low confidence (<0.5): {(wiki_pred['max_confidence'] <= 0.5).sum()} ({(wiki_pred['max_confidence'] <= 0.5).mean()*100:.1f}%)")
2025-12-27 22:21:46,095 - INFO - Processing 1240 names using Census 2010 LSTM model
2025-12-27 22:21:46,096 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:46,097 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
2025-12-27 22:21:46,302 - INFO - Predicted 1240 of 1240 rows (100.0%)
2025-12-27 22:21:46,302 - INFO - Added columns: hispanic, race, api, black, white
2025-12-27 22:21:46,303 - INFO - Processing 1240 last names
2025-12-27 22:21:46,306 - INFO - Applying Wikipedia last name model to 1240 processable names (confidence interval: 1.0)
2025-12-27 22:21:46,307 - INFO - Preserving 1178 duplicate rows based on column 'last_name'
2025-12-27 22:21:46,308 - INFO - Data filtering summary: 1240 -> 1240 rows (kept 100.0%)
2025-12-27 22:21:46,512 - INFO - Successfully predicted 1240 of 1240 names (100.0%)
2025-12-27 22:21:46,513 - INFO - Added columns: Asian,IndianSubContinent, Asian,GreaterEastAsian,Japanese, processing_status, race, GreaterEuropean,EastEuropean, GreaterEuropean,WestEuropean,Germanic, Asian,GreaterEastAsian,EastAsian, name_normalized, GreaterAfrican,Africans, GreaterEuropean,WestEuropean,Nordic, GreaterEuropean,British, GreaterEuropean,WestEuropean,French, GreaterEuropean,WestEuropean,Italian, GreaterEuropean,WestEuropean,Hispanic, GreaterAfrican,Muslim, GreaterEuropean,Jewish
Census Model Confidence Distribution:
High confidence (>0.8): 280 (22.6%)
Medium confidence (0.5-0.8): 900 (72.6%)
Low confidence (<0.5): 60 (4.8%)

Wikipedia Model Confidence Distribution:
Found 13 numeric probability columns
High confidence (>0.8): 480 (38.7%)
Medium confidence (0.5-0.8): 640 (51.6%)
Low confidence (<0.5): 120 (9.7%)

Batch Processing Best Practices

Performance Tips:

  1. Choose the right model: Census lookup is fastest, ML models are slower but more accurate

  2. Use chunking: For datasets >10,000 rows, process in chunks to manage memory

  3. Clean data first: Remove/handle missing values before processing

  4. Monitor confidence: Low confidence predictions may need manual review

Memory Management:

  • Process in chunks of 500-2000 rows for large datasets

  • Use only the columns you need in your input DataFrame

  • Clear intermediate results when not needed

Error Handling:

  • Check for missing values in name columns

  • Handle special characters and accents

  • Validate results and flag low-confidence predictions

[7]:
# Example production-ready batch processing function
def robust_batch_predict(df, name_col, model='census', chunk_size=1000, min_confidence=0.5):
    """
    Robust batch prediction with error handling and quality filtering
    """
    # Data validation
    if name_col not in df.columns:
        raise ValueError(f"Column '{name_col}' not found in DataFrame")

    # Clean data
    clean_df = df.copy()
    clean_df[name_col] = clean_df[name_col].fillna('').astype(str)

    # Choose prediction function and define probability columns
    if model == 'census':
        pred_func = ethnicolr.pred_census_ln
        prob_cols = ['api', 'black', 'hispanic', 'white']
    elif model == 'wiki':
        pred_func = ethnicolr.pred_wiki_ln
        prob_cols = None  # Will be determined dynamically after prediction
    elif model == 'florida':
        pred_func = ethnicolr.pred_fl_reg_ln
        prob_cols = ['asian', 'hispanic', 'nh_black', 'nh_white']
    else:
        raise ValueError(f"Unknown model: {model}")

    # Process in chunks
    result = process_in_chunks(clean_df, pred_func, chunk_size, name_col)

    # Calculate confidence based on model type
    if model == 'wiki':
        # For Wikipedia model, find numeric probability columns dynamically
        prob_cols = []
        for col in result.columns:
            if col not in ['race', '__name', name_col, 'processing_status']:
                try:
                    pd.to_numeric(result[col], errors='raise')
                    prob_cols.append(col)
                except (ValueError, TypeError):
                    continue

    if prob_cols and len(prob_cols) > 0:
        result['max_confidence'] = result[prob_cols].max(axis=1)
        result['high_confidence'] = result['max_confidence'] >= min_confidence
    else:
        result['max_confidence'] = 0.5  # Default confidence
        result['high_confidence'] = False

    return result

# Example usage
print("Running robust batch prediction...")
robust_result = robust_batch_predict(
    large_df.head(100),
    'last_name',
    model='census',
    chunk_size=50,
    min_confidence=0.6
)

print(f"\nProcessed {len(robust_result)} names")
print(f"High confidence predictions: {robust_result['high_confidence'].sum()} ({robust_result['high_confidence'].mean()*100:.1f}%)")
print("\nSample results:")
robust_result[['last_name', 'race', 'max_confidence', 'high_confidence']].head(10)
2025-12-27 22:21:46,528 - INFO - Processing 50 names using Census 2010 LSTM model
2025-12-27 22:21:46,529 - INFO - Data filtering summary: 50 -> 50 rows (kept 100.0%)
2025-12-27 22:21:46,633 - INFO - Predicted 50 of 50 rows (100.0%)
2025-12-27 22:21:46,634 - INFO - Added columns: hispanic, race, api, black, white
2025-12-27 22:21:46,635 - INFO - Processing 50 names using Census 2010 LSTM model
2025-12-27 22:21:46,636 - INFO - Data filtering summary: 50 -> 50 rows (kept 100.0%)
Running robust batch prediction...
2025-12-27 22:21:46,743 - INFO - Predicted 50 of 50 rows (100.0%)
2025-12-27 22:21:46,744 - INFO - Added columns: hispanic, race, api, black, white

Processed 100 names
High confidence predictions: 81 (81.0%)

Sample results:
[7]:
last_name race max_confidence high_confidence
0 Smith white 0.747435 True
1 Garcia hispanic 0.942291 True
2 Johnson white 0.608818 True
3 Davis white 0.618956 True
4 Brown white 0.571941 False
5 Wilson white 0.700449 True
6 Martinez hispanic 0.942765 True
7 Anderson white 0.775244 True
8 Taylor white 0.684406 True
9 Rodriguez hispanic 0.942559 True