Advanced Prediction Models¶
This notebook demonstrates advanced ethnicity prediction using Wikipedia and Florida voter registration models, including confidence intervals and detailed ethnic categories.
Setup¶
Load the required libraries and sample data.
[1]:
import pandas as pd
import ethnicolr
from pathlib import Path
# Load sample data
data_path = Path('data/input-with-header.csv')
try:
df = pd.read_csv(data_path)
print(f"Loaded data from: {data_path}")
except FileNotFoundError:
# Create sample data if file not found
df = pd.DataFrame({
'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],
'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']
})
print("Using generated sample data")
print(f"Sample data shape: {df.shape}")
print("\nFirst few rows:")
df.head()
2025-12-27 22:21:27.980858: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:27.983959: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:27.990764: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1766874088.003396 2688 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766874088.007538 2688 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-12-27 22:21:28.023586: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loaded data from: data/input-with-header.csv
Sample data shape: (62, 2)
First few rows:
[1]:
| first_name | last_name | |
|---|---|---|
| 0 | John | Smith |
| 1 | Maria | Garcia |
| 2 | David | Johnson |
| 3 | Sarah | Davis |
| 4 | Michael | Brown |
Wikipedia-based Predictions¶
Wikipedia models provide more granular ethnic categories and work well with both first and last names.
[2]:
# Predict using Wikipedia model with full names
wiki_result = ethnicolr.pred_wiki_name(df, 'last_name', 'first_name')
print(f"Wikipedia prediction result shape: {wiki_result.shape}")
print("\nColumns added:")
wiki_cols = [col for col in wiki_result.columns if col not in df.columns]
print(wiki_cols)
# Show detailed predictions
wiki_result[['first_name', 'last_name', 'race', '__name']].head(10)
2025-12-27 22:21:33,965 - INFO - Processing 62 names
2025-12-27 22:21:33,971 - INFO - Applying Wikipedia name model to 62 processable names (confidence interval: 1.0)
2025-12-27 22:21:33,971 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:33.980309: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2025-12-27 22:21:34,211 - INFO - Successfully predicted 62 of 62 names (100.0%)
2025-12-27 22:21:34,212 - INFO - Added columns: name_normalized, GreaterEuropean,WestEuropean,Hispanic, GreaterEuropean,British, GreaterEuropean,WestEuropean,Nordic, GreaterEuropean,Jewish, GreaterAfrican,Muslim, GreaterEuropean,WestEuropean,Italian, GreaterAfrican,Africans, processing_status, GreaterEuropean,EastEuropean, Asian,IndianSubContinent, __name, Asian,GreaterEastAsian,Japanese, Asian,GreaterEastAsian,EastAsian, name_normalized_clean, GreaterEuropean,WestEuropean,French, race, GreaterEuropean,WestEuropean,Germanic
Wikipedia prediction result shape: (62, 20)
Columns added:
['__name', 'name_normalized', 'name_normalized_clean', 'processing_status', 'Asian,GreaterEastAsian,EastAsian', 'Asian,GreaterEastAsian,Japanese', 'Asian,IndianSubContinent', 'GreaterAfrican,Africans', 'GreaterAfrican,Muslim', 'GreaterEuropean,British', 'GreaterEuropean,EastEuropean', 'GreaterEuropean,Jewish', 'GreaterEuropean,WestEuropean,French', 'GreaterEuropean,WestEuropean,Germanic', 'GreaterEuropean,WestEuropean,Hispanic', 'GreaterEuropean,WestEuropean,Italian', 'GreaterEuropean,WestEuropean,Nordic', 'race']
[2]:
| first_name | last_name | race | __name | |
|---|---|---|---|---|
| 0 | John | Smith | GreaterEuropean,British | Smith John |
| 1 | Maria | Garcia | GreaterEuropean,WestEuropean,Italian | Garcia Maria |
| 2 | David | Johnson | GreaterEuropean,British | Johnson David |
| 3 | Sarah | Davis | GreaterEuropean,British | Davis Sarah |
| 4 | Michael | Brown | GreaterEuropean,British | Brown Michael |
| 5 | Jennifer | Wilson | GreaterEuropean,British | Wilson Jennifer |
| 6 | Carlos | Martinez | GreaterEuropean,WestEuropean,Hispanic | Martinez Carlos |
| 7 | Lisa | Anderson | GreaterEuropean,British | Anderson Lisa |
| 8 | James | Taylor | GreaterEuropean,British | Taylor James |
| 9 | Anna | Rodriguez | GreaterEuropean,WestEuropean,Hispanic | Rodriguez Anna |
Florida Voter Registration Models¶
Florida models are trained on actual voter registration data and can provide both 4-category and 5-category predictions.
[3]:
# Standard 4-category Florida model
fl_result = ethnicolr.pred_fl_reg_name(df, 'last_name', 'first_name')
print("Florida 4-category predictions:")
fl_result[['first_name', 'last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head()
print("\nRace distribution (Florida model):")
print(fl_result['race'].value_counts())
2025-12-27 22:21:34,222 - INFO - Processing 62 full names
2025-12-27 22:21:34,227 - INFO - Applying Florida voter name model to 62 processable names (confidence interval: 1.0)
2025-12-27 22:21:34,227 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:34,393 - INFO - Successfully predicted 62 of 62 names (100.0%)
2025-12-27 22:21:34,393 - INFO - Added columns: nh_white, name_normalized, asian, __name, name_normalized_clean, nh_black, processing_status, race, hispanic
Florida 4-category predictions:
Race distribution (Florida model):
race
nh_white 58
hispanic 3
nh_black 1
Name: count, dtype: int64
[4]:
# 5-category Florida model (includes 'other' category)
fl5_result = ethnicolr.pred_fl_reg_name_five_cat(df, 'last_name', 'first_name')
print("Florida 5-category predictions:")
fl5_result[['first_name', 'last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white', 'other']].head()
print("\nRace distribution (Florida 5-category):")
print(fl5_result['race'].value_counts())
2025-12-27 22:21:34,401 - INFO - Generating full names from columns: last_name, first_name
2025-12-27 22:21:34,402 - INFO - Using Florida 5-category model for year 2022
2025-12-27 22:21:34,403 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
Florida 5-category predictions:
Race distribution (Florida 5-category):
race
nh_white 34
nh_black 18
hispanic 9
other 1
Name: count, dtype: int64
Last Name Only Predictions¶
When only last names are available, we can still make good predictions.
[5]:
# Wikipedia last name model
wiki_ln = ethnicolr.pred_wiki_ln(df, 'last_name')
print("Wikipedia last name predictions:")
wiki_ln[['last_name', 'race']].head(10)
# Florida last name model
fl_ln = ethnicolr.pred_fl_reg_ln(df, 'last_name')
print("\nFlorida last name predictions:")
fl_ln[['last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head(10)
2025-12-27 22:21:34,558 - INFO - Processing 62 last names
2025-12-27 22:21:34,560 - INFO - Applying Wikipedia last name model to 62 processable names (confidence interval: 1.0)
2025-12-27 22:21:34,561 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:34,704 - INFO - Successfully predicted 62 of 62 names (100.0%)
2025-12-27 22:21:34,705 - INFO - Added columns: name_normalized, GreaterEuropean,WestEuropean,Hispanic, GreaterEuropean,British, GreaterEuropean,WestEuropean,Nordic, GreaterEuropean,Jewish, GreaterAfrican,Muslim, GreaterEuropean,WestEuropean,Italian, GreaterAfrican,Africans, processing_status, GreaterEuropean,EastEuropean, Asian,IndianSubContinent, Asian,GreaterEastAsian,Japanese, Asian,GreaterEastAsian,EastAsian, GreaterEuropean,WestEuropean,French, race, GreaterEuropean,WestEuropean,Germanic
2025-12-27 22:21:34,706 - INFO - Predicting race/ethnicity for 62 rows using Florida LSTM model
2025-12-27 22:21:34,707 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
Wikipedia last name predictions:
2025-12-27 22:21:34,847 - INFO - Prediction complete. Added columns: nh_white, asian, nh_black, race, hispanic
Florida last name predictions:
[5]:
| last_name | race | asian | hispanic | nh_black | nh_white | |
|---|---|---|---|---|---|---|
| 0 | Smith | nh_white | 0.004512 | 0.017937 | 0.251722 | 0.725829 |
| 1 | Garcia | hispanic | 0.006059 | 0.883960 | 0.010610 | 0.099372 |
| 2 | Johnson | nh_white | 0.003667 | 0.013745 | 0.424924 | 0.557664 |
| 3 | Davis | nh_white | 0.007555 | 0.011607 | 0.379582 | 0.601256 |
| 4 | Brown | nh_white | 0.003721 | 0.008477 | 0.474747 | 0.513055 |
| 5 | Wilson | nh_white | 0.004638 | 0.016631 | 0.333033 | 0.645697 |
| 6 | Martinez | hispanic | 0.003296 | 0.888409 | 0.011035 | 0.097260 |
| 7 | Anderson | nh_white | 0.009505 | 0.013844 | 0.239017 | 0.737635 |
| 8 | Taylor | nh_white | 0.005646 | 0.015479 | 0.271970 | 0.706904 |
| 9 | Rodriguez | hispanic | 0.003506 | 0.895370 | 0.008677 | 0.092447 |
Model Comparison¶
Let’s compare predictions across different models for the same names.
[6]:
# Create comparison dataframe
comparison = pd.DataFrame({
'name': df['first_name'] + ' ' + df['last_name'],
'census': ethnicolr.pred_census_ln(df, 'last_name')['race'],
'wiki_fullname': wiki_result['race'],
'wiki_lastname': wiki_ln['race'],
'florida_4cat': fl_result['race'],
'florida_5cat': fl5_result['race']
})
print("Model comparison (first 15 names):")
comparison.head(15)
2025-12-27 22:21:34,859 - INFO - Processing 62 names using Census 2010 LSTM model
2025-12-27 22:21:34,860 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:35,003 - INFO - Predicted 62 of 62 rows (100.0%)
2025-12-27 22:21:35,003 - INFO - Added columns: black, white, api, race, hispanic
Model comparison (first 15 names):
[6]:
| name | census | wiki_fullname | wiki_lastname | florida_4cat | florida_5cat | |
|---|---|---|---|---|---|---|
| 0 | John Smith | white | GreaterEuropean,British | GreaterEuropean,British | nh_white | nh_white |
| 1 | Maria Garcia | hispanic | GreaterEuropean,WestEuropean,Italian | GreaterEuropean,WestEuropean,Hispanic | hispanic | hispanic |
| 2 | David Johnson | white | GreaterEuropean,British | GreaterEuropean,British | nh_white | nh_white |
| 3 | Sarah Davis | white | GreaterEuropean,British | GreaterEuropean,British | nh_white | nh_white |
| 4 | Michael Brown | white | GreaterEuropean,British | GreaterEuropean,British | nh_white | nh_black |
| 5 | Jennifer Wilson | white | GreaterEuropean,British | GreaterEuropean,British | nh_white | nh_white |
| 6 | Carlos Martinez | hispanic | GreaterEuropean,WestEuropean,Hispanic | GreaterEuropean,WestEuropean,Hispanic | hispanic | hispanic |
| 7 | Lisa Anderson | white | GreaterEuropean,British | GreaterEuropean,British | nh_white | nh_white |
| 8 | James Taylor | white | GreaterEuropean,British | GreaterEuropean,British | nh_white | nh_white |
| 9 | Anna Rodriguez | hispanic | GreaterEuropean,WestEuropean,Hispanic | GreaterEuropean,WestEuropean,Hispanic | nh_white | hispanic |
| 10 | Robert Thomas | white | GreaterEuropean,British | GreaterEuropean,British | nh_white | nh_white |
| 11 | Ashley Jackson | black | GreaterEuropean,British | GreaterEuropean,British | nh_black | nh_black |
| 12 | Kevin White | white | GreaterEuropean,British | GreaterEuropean,British | nh_white | nh_black |
| 13 | Michelle Harris | white | GreaterEuropean,British | GreaterEuropean,British | nh_white | nh_black |
| 14 | Daniel Martin | white | GreaterEuropean,WestEuropean,Hispanic | GreaterEuropean,British | nh_white | nh_white |
Confidence Analysis¶
Let’s examine the confidence scores to understand prediction certainty.
[7]:
# Calculate max probability (confidence) for each prediction
fl_result['max_prob'] = fl_result[['asian', 'hispanic', 'nh_black', 'nh_white']].max(axis=1)
# Show high vs low confidence predictions
high_conf = fl_result[fl_result['max_prob'] > 0.8]
low_conf = fl_result[fl_result['max_prob'] < 0.5]
print(f"High confidence predictions (>80%): {len(high_conf)} names")
print("Examples:")
print(high_conf[['first_name', 'last_name', 'race', 'max_prob']].head())
print(f"\nLow confidence predictions (<50%): {len(low_conf)} names")
print("Examples:")
print(low_conf[['first_name', 'last_name', 'race', 'max_prob']].head())
High confidence predictions (>80%): 37 names
Examples:
first_name last_name race max_prob
0 John Smith nh_white 0.931000
1 Maria Garcia hispanic 0.829117
4 Michael Brown nh_white 0.851771
5 Jennifer Wilson nh_white 0.852773
6 Carlos Martinez hispanic 0.908851
Low confidence predictions (<50%): 2 names
Examples:
first_name last_name race max_prob
35 Kayla Perez nh_white 0.483368
53 Vanessa Bailey nh_white 0.413277
Detailed Ethnic Categories (Wikipedia)¶
The Wikipedia model provides much more granular ethnic predictions.
[8]:
# Show detailed ethnic categories from Wikipedia model
print("Detailed ethnic categories from Wikipedia model:")
ethnic_dist = wiki_result['race'].value_counts()
print(ethnic_dist)
# Show examples of detailed categories
print("\nExamples by ethnic category:")
for category in ethnic_dist.head(5).index:
examples = wiki_result[wiki_result['race'] == category]['__name'].head(3).tolist()
print(f"{category}: {', '.join(examples)}")
Detailed ethnic categories from Wikipedia model:
race
GreaterEuropean,British 52
GreaterEuropean,WestEuropean,Hispanic 3
GreaterEuropean,WestEuropean,French 3
GreaterEuropean,WestEuropean,Italian 2
GreaterEuropean,Jewish 2
Name: count, dtype: int64
Examples by ethnic category:
GreaterEuropean,British: Smith John, Johnson David, Davis Sarah
GreaterEuropean,WestEuropean,Hispanic: Martinez Carlos, Rodriguez Anna, Martin Daniel
GreaterEuropean,WestEuropean,French: Adams Rachel, Carter Lauren, Sanchez Christina
GreaterEuropean,WestEuropean,Italian: Garcia Maria, Peterson Andrea
GreaterEuropean,Jewish: Phillips Zachary, Parker Aaron
Model Selection Guidelines¶
Choose the right model for your use case:
Census lookup: Best for aggregate statistics, population-level analysis
Census LSTM: Good baseline for individual predictions, 4 broad categories
Wikipedia models: Best for detailed ethnic categories, works well with diverse international names
Florida models: Good for US-focused applications, trained on actual voter data
5-category models: Include ‘other’ for better coverage of mixed/unknown ethnicities
Always consider the confidence scores and validate results on your specific dataset.