Advanced Prediction Models¶

This notebook demonstrates advanced ethnicity prediction using Wikipedia and Florida voter registration models, including confidence intervals and detailed ethnic categories.

Setup¶

Load the required libraries and sample data.

[1]:

import pandas as pd
import ethnicolr
from pathlib import Path

# Load sample data
data_path = Path('data/input-with-header.csv')

try:
    df = pd.read_csv(data_path)
    print(f"Loaded data from: {data_path}")
except FileNotFoundError:
    # Create sample data if file not found
    df = pd.DataFrame({
        'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],
        'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']
    })
    print("Using generated sample data")

print(f"Sample data shape: {df.shape}")
print("\nFirst few rows:")
df.head()

2026-07-11 20:54:40.097143: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2026-07-11 20:54:40.099852: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2026-07-11 20:54:40.106345: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1783803280.118630    2924 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1783803280.122718    2924 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-07-11 20:54:40.137670: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Loaded data from: data/input-with-header.csv
Sample data shape: (62, 2)

First few rows:

[1]:

	first_name	last_name
0	John	Smith
1	Maria	Garcia
2	David	Johnson
3	Sarah	Davis
4	Michael	Brown

Wikipedia-based Predictions¶

Wikipedia models provide more granular ethnic categories and work well with both first and last names.

[2]:

# Predict using Wikipedia model with full names
wiki_result = ethnicolr.pred_wiki_name(df, 'last_name', 'first_name')
print(f"Wikipedia prediction result shape: {wiki_result.shape}")
print("\nColumns added:")
wiki_cols = [col for col in wiki_result.columns if col not in df.columns]
print(wiki_cols)

# Show detailed predictions
wiki_result[['first_name', 'last_name', 'race', '__name']].head(10)

2026-07-11 20:54:48,329 - INFO - Processing 62 names
2026-07-11 20:54:48,334 - INFO - Applying Wikipedia name model to 62 processable names (confidence interval: 1.0)
2026-07-11 20:54:48,335 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2026-07-11 20:54:48.343418: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2026-07-11 20:54:48,570 - INFO - Successfully predicted 62 of 62 names (100.0%)
2026-07-11 20:54:48,571 - INFO - Added columns: GreaterEuropean,British, processing_status, race, Asian,GreaterEastAsian,EastAsian, GreaterAfrican,Africans, GreaterEuropean,WestEuropean,French, name_normalized, GreaterAfrican,Muslim, GreaterEuropean,WestEuropean,Italian, GreaterEuropean,WestEuropean,Hispanic, GreaterEuropean,Jewish, GreaterEuropean,EastEuropean, __name, Asian,IndianSubContinent, Asian,GreaterEastAsian,Japanese, GreaterEuropean,WestEuropean,Germanic, name_normalized_clean, GreaterEuropean,WestEuropean,Nordic

Wikipedia prediction result shape: (62, 20)

Columns added:
['__name', 'name_normalized', 'name_normalized_clean', 'processing_status', 'Asian,GreaterEastAsian,EastAsian', 'Asian,GreaterEastAsian,Japanese', 'Asian,IndianSubContinent', 'GreaterAfrican,Africans', 'GreaterAfrican,Muslim', 'GreaterEuropean,British', 'GreaterEuropean,EastEuropean', 'GreaterEuropean,Jewish', 'GreaterEuropean,WestEuropean,French', 'GreaterEuropean,WestEuropean,Germanic', 'GreaterEuropean,WestEuropean,Hispanic', 'GreaterEuropean,WestEuropean,Italian', 'GreaterEuropean,WestEuropean,Nordic', 'race']

[2]:

	first_name	last_name	race	__name
0	John	Smith	GreaterEuropean,British	Smith John
1	Maria	Garcia	GreaterEuropean,WestEuropean,Italian	Garcia Maria
2	David	Johnson	GreaterEuropean,British	Johnson David
3	Sarah	Davis	GreaterEuropean,British	Davis Sarah
4	Michael	Brown	GreaterEuropean,British	Brown Michael
5	Jennifer	Wilson	GreaterEuropean,British	Wilson Jennifer
6	Carlos	Martinez	GreaterEuropean,WestEuropean,Hispanic	Martinez Carlos
7	Lisa	Anderson	GreaterEuropean,British	Anderson Lisa
8	James	Taylor	GreaterEuropean,British	Taylor James
9	Anna	Rodriguez	GreaterEuropean,WestEuropean,Hispanic	Rodriguez Anna

Florida Voter Registration Models¶

Florida models are trained on actual voter registration data and can provide both 4-category and 5-category predictions.

[3]:

# Standard 4-category Florida model
fl_result = ethnicolr.pred_fl_reg_name(df, 'last_name', 'first_name')
print("Florida 4-category predictions:")
fl_result[['first_name', 'last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head()

print("\nRace distribution (Florida model):")
print(fl_result['race'].value_counts())

2026-07-11 20:54:48,582 - INFO - Processing 62 full names
2026-07-11 20:54:48,586 - INFO - Applying Florida voter name model to 62 processable names (confidence interval: 1.0)
2026-07-11 20:54:48,587 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2026-07-11 20:54:48,752 - INFO - Successfully predicted 62 of 62 names (100.0%)
2026-07-11 20:54:48,752 - INFO - Added columns: name_normalized_clean, __name, processing_status, race, hispanic, nh_black, nh_white, asian, name_normalized

Florida 4-category predictions:

Race distribution (Florida model):
race
nh_white    58
hispanic     3
nh_black     1
Name: count, dtype: int64

[4]:

# 5-category Florida model (includes 'other' category)
fl5_result = ethnicolr.pred_fl_reg_name_five_cat(df, 'last_name', 'first_name')
print("Florida 5-category predictions:")
fl5_result[['first_name', 'last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white', 'other']].head()

print("\nRace distribution (Florida 5-category):")
print(fl5_result['race'].value_counts())

2026-07-11 20:54:48,759 - INFO - Generating full names from columns: last_name, first_name
2026-07-11 20:54:48,761 - INFO - Using Florida 5-category model for year 2022
2026-07-11 20:54:48,762 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)

Florida 5-category predictions:

Race distribution (Florida 5-category):
race
nh_white    34
nh_black    18
hispanic     9
other        1
Name: count, dtype: int64

Last Name Only Predictions¶

When only last names are available, we can still make good predictions.

[5]:

# Wikipedia last name model
wiki_ln = ethnicolr.pred_wiki_ln(df, 'last_name')
print("Wikipedia last name predictions:")
wiki_ln[['last_name', 'race']].head(10)

# Florida last name model
fl_ln = ethnicolr.pred_fl_reg_ln(df, 'last_name')
print("\nFlorida last name predictions:")
fl_ln[['last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head(10)

2026-07-11 20:54:48,917 - INFO - Processing 62 last names
2026-07-11 20:54:48,919 - INFO - Applying Wikipedia last name model to 62 processable names (confidence interval: 1.0)
2026-07-11 20:54:48,920 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2026-07-11 20:54:49,061 - INFO - Successfully predicted 62 of 62 names (100.0%)
2026-07-11 20:54:49,061 - INFO - Added columns: processing_status, GreaterEuropean,British, race, Asian,GreaterEastAsian,EastAsian, GreaterAfrican,Africans, GreaterEuropean,WestEuropean,French, GreaterAfrican,Muslim, GreaterEuropean,WestEuropean,Italian, GreaterEuropean,WestEuropean,Hispanic, GreaterEuropean,Jewish, GreaterEuropean,EastEuropean, Asian,IndianSubContinent, Asian,GreaterEastAsian,Japanese, GreaterEuropean,WestEuropean,Germanic, name_normalized, GreaterEuropean,WestEuropean,Nordic
2026-07-11 20:54:49,062 - INFO - Predicting race/ethnicity for 62 rows using Florida LSTM model
2026-07-11 20:54:49,063 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)

Wikipedia last name predictions:

2026-07-11 20:54:49,206 - INFO - Prediction complete. Added columns: race, hispanic, nh_white, nh_black, asian


Florida last name predictions:

[5]:

	last_name	race	asian	hispanic	nh_black	nh_white
0	Smith	nh_white	0.004512	0.017937	0.251722	0.725829
1	Garcia	hispanic	0.006059	0.883960	0.010610	0.099372
2	Johnson	nh_white	0.003667	0.013745	0.424924	0.557664
3	Davis	nh_white	0.007555	0.011607	0.379582	0.601256
4	Brown	nh_white	0.003721	0.008477	0.474747	0.513055
5	Wilson	nh_white	0.004638	0.016631	0.333033	0.645697
6	Martinez	hispanic	0.003296	0.888409	0.011035	0.097260
7	Anderson	nh_white	0.009505	0.013844	0.239017	0.737635
8	Taylor	nh_white	0.005646	0.015479	0.271970	0.706904
9	Rodriguez	hispanic	0.003506	0.895370	0.008677	0.092447

Model Comparison¶

Let’s compare predictions across different models for the same names.

[6]:

# Create comparison dataframe
comparison = pd.DataFrame({
    'name': df['first_name'] + ' ' + df['last_name'],
    'census': ethnicolr.pred_census_ln(df, 'last_name')['race'],
    'wiki_fullname': wiki_result['race'],
    'wiki_lastname': wiki_ln['race'],
    'florida_4cat': fl_result['race'],
    'florida_5cat': fl5_result['race']
})

print("Model comparison (first 15 names):")
comparison.head(15)

2026-07-11 20:54:49,219 - INFO - Loading Census 2010 PyTorch model on cpu...
/home/runner/work/ethnicolr/ethnicolr/.venv/lib/python3.11/site-packages/torch/nn/modules/rnn.py:1009: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1
  super().__init__("LSTM", *args, **kwargs)
2026-07-11 20:54:49,226 - INFO - Predicting 62 names using Census 2010 PyTorch model
2026-07-11 20:54:49,233 - INFO - Predicted 62 of 62 rows

Model comparison (first 15 names):

[6]:

	name	census	wiki_fullname	wiki_lastname	florida_4cat	florida_5cat
0	John Smith	white	GreaterEuropean,British	GreaterEuropean,British	nh_white	nh_white
1	Maria Garcia	hispanic	GreaterEuropean,WestEuropean,Italian	GreaterEuropean,WestEuropean,Hispanic	hispanic	hispanic
2	David Johnson	white	GreaterEuropean,British	GreaterEuropean,British	nh_white	nh_white
3	Sarah Davis	white	GreaterEuropean,British	GreaterEuropean,British	nh_white	nh_white
4	Michael Brown	white	GreaterEuropean,British	GreaterEuropean,British	nh_white	nh_black
5	Jennifer Wilson	white	GreaterEuropean,British	GreaterEuropean,British	nh_white	nh_white
6	Carlos Martinez	hispanic	GreaterEuropean,WestEuropean,Hispanic	GreaterEuropean,WestEuropean,Hispanic	hispanic	hispanic
7	Lisa Anderson	white	GreaterEuropean,British	GreaterEuropean,British	nh_white	nh_white
8	James Taylor	white	GreaterEuropean,British	GreaterEuropean,British	nh_white	nh_white
9	Anna Rodriguez	hispanic	GreaterEuropean,WestEuropean,Hispanic	GreaterEuropean,WestEuropean,Hispanic	nh_white	hispanic
10	Robert Thomas	white	GreaterEuropean,British	GreaterEuropean,British	nh_white	nh_white
11	Ashley Jackson	black	GreaterEuropean,British	GreaterEuropean,British	nh_black	nh_black
12	Kevin White	white	GreaterEuropean,British	GreaterEuropean,British	nh_white	nh_black
13	Michelle Harris	white	GreaterEuropean,British	GreaterEuropean,British	nh_white	nh_black
14	Daniel Martin	white	GreaterEuropean,WestEuropean,Hispanic	GreaterEuropean,British	nh_white	nh_white

Confidence Analysis¶

Let’s examine the confidence scores to understand prediction certainty.

[7]:

# Calculate max probability (confidence) for each prediction
fl_result['max_prob'] = fl_result[['asian', 'hispanic', 'nh_black', 'nh_white']].max(axis=1)

# Show high vs low confidence predictions
high_conf = fl_result[fl_result['max_prob'] > 0.8]
low_conf = fl_result[fl_result['max_prob'] < 0.5]

print(f"High confidence predictions (>80%): {len(high_conf)} names")
print("Examples:")
print(high_conf[['first_name', 'last_name', 'race', 'max_prob']].head())

print(f"\nLow confidence predictions (<50%): {len(low_conf)} names")
print("Examples:")
print(low_conf[['first_name', 'last_name', 'race', 'max_prob']].head())

High confidence predictions (>80%): 37 names
Examples:
  first_name last_name      race  max_prob
0       John     Smith  nh_white  0.931000
1      Maria    Garcia  hispanic  0.829117
4    Michael     Brown  nh_white  0.851771
5   Jennifer    Wilson  nh_white  0.852773
6     Carlos  Martinez  hispanic  0.908851

Low confidence predictions (<50%): 2 names
Examples:
   first_name last_name      race  max_prob
35      Kayla     Perez  nh_white  0.483368
53    Vanessa    Bailey  nh_white  0.413277

Detailed Ethnic Categories (Wikipedia)¶

The Wikipedia model provides much more granular ethnic predictions.

[8]:

# Show detailed ethnic categories from Wikipedia model
print("Detailed ethnic categories from Wikipedia model:")
ethnic_dist = wiki_result['race'].value_counts()
print(ethnic_dist)

# Show examples of detailed categories
print("\nExamples by ethnic category:")
for category in ethnic_dist.head(5).index:
    examples = wiki_result[wiki_result['race'] == category]['__name'].head(3).tolist()
    print(f"{category}: {', '.join(examples)}")

Detailed ethnic categories from Wikipedia model:
race
GreaterEuropean,British                  52
GreaterEuropean,WestEuropean,Hispanic     3
GreaterEuropean,WestEuropean,French       3
GreaterEuropean,WestEuropean,Italian      2
GreaterEuropean,Jewish                    2
Name: count, dtype: int64

Examples by ethnic category:
GreaterEuropean,British: Smith John, Johnson David, Davis Sarah
GreaterEuropean,WestEuropean,Hispanic: Martinez Carlos, Rodriguez Anna, Martin Daniel
GreaterEuropean,WestEuropean,French: Adams Rachel, Carter Lauren, Sanchez Christina
GreaterEuropean,WestEuropean,Italian: Garcia Maria, Peterson Andrea
GreaterEuropean,Jewish: Phillips Zachary, Parker Aaron

Model Selection Guidelines¶

Choose the right model for your use case:

Census lookup: Best for aggregate statistics, population-level analysis
Census LSTM: Good baseline for individual predictions, 4 broad categories
Wikipedia models: Best for detailed ethnic categories, works well with diverse international names
Florida models: Good for US-focused applications, trained on actual voter data
5-category models: Include ‘other’ for better coverage of mixed/unknown ethnicities

Always consider the confidence scores and validate results on your specific dataset.