Advanced Analytics and Performance

This notebook explores advanced features of ethnicolr2 including batch processing, performance optimization, and statistical analysis.

[1]:
import pandas as pd
import numpy as np
import time

# Import ethnicolr2 functions
from ethnicolr2 import (
    pred_fl_last_name,
    pred_fl_full_name,
    pred_census_last_name
)

print("Advanced analytics setup complete")
Advanced analytics setup complete

Performance Benchmarking

Let’s test performance with a larger synthetic dataset:

[2]:
# Create a larger synthetic dataset
np.random.seed(42)

# Common surnames from different ethnic backgrounds
surnames = [
    'Smith', 'Johnson', 'Williams', 'Brown', 'Jones',  # Common American
    'Garcia', 'Rodriguez', 'Martinez', 'Hernandez', 'Lopez',  # Hispanic
    'Zhang', 'Wang', 'Li', 'Liu', 'Chen',  # Chinese
    'Kim', 'Lee', 'Park', 'Choi', 'Jung',  # Korean
    'Patel', 'Shah', 'Singh', 'Kumar', 'Sharma'  # South Asian
]

# Generate synthetic dataset
n_samples = 1000
large_df = pd.DataFrame({
    'id': range(n_samples),
    'last_name': np.random.choice(surnames, n_samples)
})

print(f"Created dataset with {len(large_df)} rows")
print(f"Unique surnames: {large_df['last_name'].nunique()}")
display(large_df.head())
Created dataset with 1000 rows
Unique surnames: 25
id last_name
0 0 Rodriguez
1 1 Jung
2 2 Chen
3 3 Zhang
4 4 Martinez
[3]:
# Benchmark different models
models = {
    'Census Last Name': lambda df: pred_census_last_name(df, lname_col='last_name'),
    'Florida Last Name': lambda df: pred_fl_last_name(df, lname_col='last_name')
}

performance_results = []

for model_name, model_func in models.items():
    print(f"Benchmarking {model_name}...")

    start_time = time.time()
    result_df = model_func(large_df.copy())
    end_time = time.time()

    execution_time = end_time - start_time
    rows_per_second = len(large_df) / execution_time

    performance_results.append({
        'Model': model_name,
        'Rows': len(large_df),
        'Time (s)': round(execution_time, 3),
        'Rows/sec': round(rows_per_second, 1)
    })

    print(f"  Time: {execution_time:.3f}s ({rows_per_second:.1f} rows/sec)")

perf_df = pd.DataFrame(performance_results)
print("\nPerformance Summary:")
display(perf_df)
Benchmarking Census Last Name...
/home/runner/work/ethnicolr2/ethnicolr2/.venv/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator CountVectorizer from version 1.2.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
100%|██████████| 1/1 [00:00<00:00, 103.61it/s]
/home/runner/work/ethnicolr2/ethnicolr2/.venv/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator CountVectorizer from version 1.2.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
  Time: 1.295s (771.9 rows/sec)
Benchmarking Florida Last Name...
100%|██████████| 1/1 [00:00<00:00, 96.85it/s]
  Time: 0.026s (38952.6 rows/sec)

Performance Summary:

Model Rows Time (s) Rows/sec
0 Census Last Name 1000 1.295 771.9
1 Florida Last Name 1000 0.026 38952.6

Statistical Analysis

Analyze prediction confidence and uncertainty:

[4]:
# Analyze prediction confidence
results = pred_fl_last_name(large_df.copy(), lname_col='last_name')

# Extract max probabilities (confidence)
max_probs = []
for _, row in results.iterrows():
    probs = row['probs']
    max_prob = max(probs.values())
    max_probs.append(max_prob)

results['confidence'] = max_probs

print("Confidence Statistics:")
print(f"Mean confidence: {np.mean(max_probs):.3f}")
print(f"Median confidence: {np.median(max_probs):.3f}")
print(f"Min confidence: {np.min(max_probs):.3f}")
print(f"Max confidence: {np.max(max_probs):.3f}")

# Confidence by prediction category
confidence_by_pred = results.groupby('preds')['confidence'].agg(['mean', 'count'])
confidence_by_pred.columns = ['avg_confidence', 'count']
confidence_by_pred = confidence_by_pred.round(3)

print("\nConfidence by Prediction Category:")
display(confidence_by_pred)
100%|██████████| 1/1 [00:00<00:00, 108.64it/s]
Confidence Statistics:
Mean confidence: 0.727
Median confidence: 0.771
Min confidence: 0.373
Max confidence: 0.903

Confidence by Prediction Category:

avg_confidence count
preds
asian 0.741 311
hispanic 0.800 176
nh_black 0.843 50
nh_white 0.677 463

Summary

This advanced analytics notebook demonstrated:

  1. Performance benchmarking across different models

  2. Statistical analysis of prediction confidence

  3. Large-scale processing patterns

Key Insights

  • ethnicolr2 efficiently handles large datasets

  • Probability distributions provide valuable uncertainty information

  • Different models have different performance characteristics

  • Confidence scores help identify uncertain predictions