Advanced Analytics and Performance¶
This notebook explores advanced features of ethnicolr2 including batch processing, performance optimization, and statistical analysis.
[1]:
import pandas as pd
import numpy as np
import time
# Import ethnicolr2 functions
from ethnicolr2 import (
pred_fl_last_name,
pred_fl_full_name,
pred_census_last_name
)
print("Advanced analytics setup complete")
Advanced analytics setup complete
Performance Benchmarking¶
Let’s test performance with a larger synthetic dataset:
[2]:
# Create a larger synthetic dataset
np.random.seed(42)
# Common surnames from different ethnic backgrounds
surnames = [
'Smith', 'Johnson', 'Williams', 'Brown', 'Jones', # Common American
'Garcia', 'Rodriguez', 'Martinez', 'Hernandez', 'Lopez', # Hispanic
'Zhang', 'Wang', 'Li', 'Liu', 'Chen', # Chinese
'Kim', 'Lee', 'Park', 'Choi', 'Jung', # Korean
'Patel', 'Shah', 'Singh', 'Kumar', 'Sharma' # South Asian
]
# Generate synthetic dataset
n_samples = 1000
large_df = pd.DataFrame({
'id': range(n_samples),
'last_name': np.random.choice(surnames, n_samples)
})
print(f"Created dataset with {len(large_df)} rows")
print(f"Unique surnames: {large_df['last_name'].nunique()}")
display(large_df.head())
Created dataset with 1000 rows
Unique surnames: 25
| id | last_name | |
|---|---|---|
| 0 | 0 | Rodriguez |
| 1 | 1 | Jung |
| 2 | 2 | Chen |
| 3 | 3 | Zhang |
| 4 | 4 | Martinez |
[3]:
# Benchmark different models
models = {
'Census Last Name': lambda df: pred_census_last_name(df, lname_col='last_name'),
'Florida Last Name': lambda df: pred_fl_last_name(df, lname_col='last_name')
}
performance_results = []
for model_name, model_func in models.items():
print(f"Benchmarking {model_name}...")
start_time = time.time()
result_df = model_func(large_df.copy())
end_time = time.time()
execution_time = end_time - start_time
rows_per_second = len(large_df) / execution_time
performance_results.append({
'Model': model_name,
'Rows': len(large_df),
'Time (s)': round(execution_time, 3),
'Rows/sec': round(rows_per_second, 1)
})
print(f" Time: {execution_time:.3f}s ({rows_per_second:.1f} rows/sec)")
perf_df = pd.DataFrame(performance_results)
print("\nPerformance Summary:")
display(perf_df)
Benchmarking Census Last Name...
/home/runner/work/ethnicolr2/ethnicolr2/.venv/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator CountVectorizer from version 1.2.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
100%|██████████| 1/1 [00:00<00:00, 103.61it/s]
/home/runner/work/ethnicolr2/ethnicolr2/.venv/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator CountVectorizer from version 1.2.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
Time: 1.295s (771.9 rows/sec)
Benchmarking Florida Last Name...
100%|██████████| 1/1 [00:00<00:00, 96.85it/s]
Time: 0.026s (38952.6 rows/sec)
Performance Summary:
| Model | Rows | Time (s) | Rows/sec | |
|---|---|---|---|---|
| 0 | Census Last Name | 1000 | 1.295 | 771.9 |
| 1 | Florida Last Name | 1000 | 0.026 | 38952.6 |
Statistical Analysis¶
Analyze prediction confidence and uncertainty:
[4]:
# Analyze prediction confidence
results = pred_fl_last_name(large_df.copy(), lname_col='last_name')
# Extract max probabilities (confidence)
max_probs = []
for _, row in results.iterrows():
probs = row['probs']
max_prob = max(probs.values())
max_probs.append(max_prob)
results['confidence'] = max_probs
print("Confidence Statistics:")
print(f"Mean confidence: {np.mean(max_probs):.3f}")
print(f"Median confidence: {np.median(max_probs):.3f}")
print(f"Min confidence: {np.min(max_probs):.3f}")
print(f"Max confidence: {np.max(max_probs):.3f}")
# Confidence by prediction category
confidence_by_pred = results.groupby('preds')['confidence'].agg(['mean', 'count'])
confidence_by_pred.columns = ['avg_confidence', 'count']
confidence_by_pred = confidence_by_pred.round(3)
print("\nConfidence by Prediction Category:")
display(confidence_by_pred)
100%|██████████| 1/1 [00:00<00:00, 108.64it/s]
Confidence Statistics:
Mean confidence: 0.727
Median confidence: 0.771
Min confidence: 0.373
Max confidence: 0.903
Confidence by Prediction Category:
| avg_confidence | count | |
|---|---|---|
| preds | ||
| asian | 0.741 | 311 |
| hispanic | 0.800 | 176 |
| nh_black | 0.843 | 50 |
| nh_white | 0.677 | 463 |
Summary¶
This advanced analytics notebook demonstrated:
Performance benchmarking across different models
Statistical analysis of prediction confidence
Large-scale processing patterns
Key Insights¶
ethnicolr2 efficiently handles large datasets
Probability distributions provide valuable uncertainty information
Different models have different performance characteristics
Confidence scores help identify uncertain predictions