Advanced Prediction Models

This notebook demonstrates advanced ethnicity prediction using Wikipedia and Florida voter registration models, including confidence intervals and detailed ethnic categories.

Setup

Load the required libraries and sample data.

[1]:
import pandas as pd
import ethnicolr
from pathlib import Path

# Load sample data
data_path = Path('data/input-with-header.csv')

try:
    df = pd.read_csv(data_path)
    print(f"Loaded data from: {data_path}")
except FileNotFoundError:
    # Create sample data if file not found
    df = pd.DataFrame({
        'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],
        'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']
    })
    print("Using generated sample data")

print(f"Sample data shape: {df.shape}")
print("\nFirst few rows:")
df.head()
2025-12-27 22:21:27.980858: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:27.983959: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:27.990764: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1766874088.003396    2688 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766874088.007538    2688 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-12-27 22:21:28.023586: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loaded data from: data/input-with-header.csv
Sample data shape: (62, 2)

First few rows:
[1]:
first_name last_name
0 John Smith
1 Maria Garcia
2 David Johnson
3 Sarah Davis
4 Michael Brown

Wikipedia-based Predictions

Wikipedia models provide more granular ethnic categories and work well with both first and last names.

[2]:
# Predict using Wikipedia model with full names
wiki_result = ethnicolr.pred_wiki_name(df, 'last_name', 'first_name')
print(f"Wikipedia prediction result shape: {wiki_result.shape}")
print("\nColumns added:")
wiki_cols = [col for col in wiki_result.columns if col not in df.columns]
print(wiki_cols)

# Show detailed predictions
wiki_result[['first_name', 'last_name', 'race', '__name']].head(10)
2025-12-27 22:21:33,965 - INFO - Processing 62 names
2025-12-27 22:21:33,971 - INFO - Applying Wikipedia name model to 62 processable names (confidence interval: 1.0)
2025-12-27 22:21:33,971 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:33.980309: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2025-12-27 22:21:34,211 - INFO - Successfully predicted 62 of 62 names (100.0%)
2025-12-27 22:21:34,212 - INFO - Added columns: name_normalized, GreaterEuropean,WestEuropean,Hispanic, GreaterEuropean,British, GreaterEuropean,WestEuropean,Nordic, GreaterEuropean,Jewish, GreaterAfrican,Muslim, GreaterEuropean,WestEuropean,Italian, GreaterAfrican,Africans, processing_status, GreaterEuropean,EastEuropean, Asian,IndianSubContinent, __name, Asian,GreaterEastAsian,Japanese, Asian,GreaterEastAsian,EastAsian, name_normalized_clean, GreaterEuropean,WestEuropean,French, race, GreaterEuropean,WestEuropean,Germanic
Wikipedia prediction result shape: (62, 20)

Columns added:
['__name', 'name_normalized', 'name_normalized_clean', 'processing_status', 'Asian,GreaterEastAsian,EastAsian', 'Asian,GreaterEastAsian,Japanese', 'Asian,IndianSubContinent', 'GreaterAfrican,Africans', 'GreaterAfrican,Muslim', 'GreaterEuropean,British', 'GreaterEuropean,EastEuropean', 'GreaterEuropean,Jewish', 'GreaterEuropean,WestEuropean,French', 'GreaterEuropean,WestEuropean,Germanic', 'GreaterEuropean,WestEuropean,Hispanic', 'GreaterEuropean,WestEuropean,Italian', 'GreaterEuropean,WestEuropean,Nordic', 'race']
[2]:
first_name last_name race __name
0 John Smith GreaterEuropean,British Smith John
1 Maria Garcia GreaterEuropean,WestEuropean,Italian Garcia Maria
2 David Johnson GreaterEuropean,British Johnson David
3 Sarah Davis GreaterEuropean,British Davis Sarah
4 Michael Brown GreaterEuropean,British Brown Michael
5 Jennifer Wilson GreaterEuropean,British Wilson Jennifer
6 Carlos Martinez GreaterEuropean,WestEuropean,Hispanic Martinez Carlos
7 Lisa Anderson GreaterEuropean,British Anderson Lisa
8 James Taylor GreaterEuropean,British Taylor James
9 Anna Rodriguez GreaterEuropean,WestEuropean,Hispanic Rodriguez Anna

Florida Voter Registration Models

Florida models are trained on actual voter registration data and can provide both 4-category and 5-category predictions.

[3]:
# Standard 4-category Florida model
fl_result = ethnicolr.pred_fl_reg_name(df, 'last_name', 'first_name')
print("Florida 4-category predictions:")
fl_result[['first_name', 'last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head()

print("\nRace distribution (Florida model):")
print(fl_result['race'].value_counts())
2025-12-27 22:21:34,222 - INFO - Processing 62 full names
2025-12-27 22:21:34,227 - INFO - Applying Florida voter name model to 62 processable names (confidence interval: 1.0)
2025-12-27 22:21:34,227 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:34,393 - INFO - Successfully predicted 62 of 62 names (100.0%)
2025-12-27 22:21:34,393 - INFO - Added columns: nh_white, name_normalized, asian, __name, name_normalized_clean, nh_black, processing_status, race, hispanic
Florida 4-category predictions:

Race distribution (Florida model):
race
nh_white    58
hispanic     3
nh_black     1
Name: count, dtype: int64
[4]:
# 5-category Florida model (includes 'other' category)
fl5_result = ethnicolr.pred_fl_reg_name_five_cat(df, 'last_name', 'first_name')
print("Florida 5-category predictions:")
fl5_result[['first_name', 'last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white', 'other']].head()

print("\nRace distribution (Florida 5-category):")
print(fl5_result['race'].value_counts())
2025-12-27 22:21:34,401 - INFO - Generating full names from columns: last_name, first_name
2025-12-27 22:21:34,402 - INFO - Using Florida 5-category model for year 2022
2025-12-27 22:21:34,403 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
Florida 5-category predictions:

Race distribution (Florida 5-category):
race
nh_white    34
nh_black    18
hispanic     9
other        1
Name: count, dtype: int64

Last Name Only Predictions

When only last names are available, we can still make good predictions.

[5]:
# Wikipedia last name model
wiki_ln = ethnicolr.pred_wiki_ln(df, 'last_name')
print("Wikipedia last name predictions:")
wiki_ln[['last_name', 'race']].head(10)

# Florida last name model
fl_ln = ethnicolr.pred_fl_reg_ln(df, 'last_name')
print("\nFlorida last name predictions:")
fl_ln[['last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head(10)
2025-12-27 22:21:34,558 - INFO - Processing 62 last names
2025-12-27 22:21:34,560 - INFO - Applying Wikipedia last name model to 62 processable names (confidence interval: 1.0)
2025-12-27 22:21:34,561 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:34,704 - INFO - Successfully predicted 62 of 62 names (100.0%)
2025-12-27 22:21:34,705 - INFO - Added columns: name_normalized, GreaterEuropean,WestEuropean,Hispanic, GreaterEuropean,British, GreaterEuropean,WestEuropean,Nordic, GreaterEuropean,Jewish, GreaterAfrican,Muslim, GreaterEuropean,WestEuropean,Italian, GreaterAfrican,Africans, processing_status, GreaterEuropean,EastEuropean, Asian,IndianSubContinent, Asian,GreaterEastAsian,Japanese, Asian,GreaterEastAsian,EastAsian, GreaterEuropean,WestEuropean,French, race, GreaterEuropean,WestEuropean,Germanic
2025-12-27 22:21:34,706 - INFO - Predicting race/ethnicity for 62 rows using Florida LSTM model
2025-12-27 22:21:34,707 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
Wikipedia last name predictions:
2025-12-27 22:21:34,847 - INFO - Prediction complete. Added columns: nh_white, asian, nh_black, race, hispanic

Florida last name predictions:
[5]:
last_name race asian hispanic nh_black nh_white
0 Smith nh_white 0.004512 0.017937 0.251722 0.725829
1 Garcia hispanic 0.006059 0.883960 0.010610 0.099372
2 Johnson nh_white 0.003667 0.013745 0.424924 0.557664
3 Davis nh_white 0.007555 0.011607 0.379582 0.601256
4 Brown nh_white 0.003721 0.008477 0.474747 0.513055
5 Wilson nh_white 0.004638 0.016631 0.333033 0.645697
6 Martinez hispanic 0.003296 0.888409 0.011035 0.097260
7 Anderson nh_white 0.009505 0.013844 0.239017 0.737635
8 Taylor nh_white 0.005646 0.015479 0.271970 0.706904
9 Rodriguez hispanic 0.003506 0.895370 0.008677 0.092447

Model Comparison

Let’s compare predictions across different models for the same names.

[6]:
# Create comparison dataframe
comparison = pd.DataFrame({
    'name': df['first_name'] + ' ' + df['last_name'],
    'census': ethnicolr.pred_census_ln(df, 'last_name')['race'],
    'wiki_fullname': wiki_result['race'],
    'wiki_lastname': wiki_ln['race'],
    'florida_4cat': fl_result['race'],
    'florida_5cat': fl5_result['race']
})

print("Model comparison (first 15 names):")
comparison.head(15)
2025-12-27 22:21:34,859 - INFO - Processing 62 names using Census 2010 LSTM model
2025-12-27 22:21:34,860 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:35,003 - INFO - Predicted 62 of 62 rows (100.0%)
2025-12-27 22:21:35,003 - INFO - Added columns: black, white, api, race, hispanic
Model comparison (first 15 names):
[6]:
name census wiki_fullname wiki_lastname florida_4cat florida_5cat
0 John Smith white GreaterEuropean,British GreaterEuropean,British nh_white nh_white
1 Maria Garcia hispanic GreaterEuropean,WestEuropean,Italian GreaterEuropean,WestEuropean,Hispanic hispanic hispanic
2 David Johnson white GreaterEuropean,British GreaterEuropean,British nh_white nh_white
3 Sarah Davis white GreaterEuropean,British GreaterEuropean,British nh_white nh_white
4 Michael Brown white GreaterEuropean,British GreaterEuropean,British nh_white nh_black
5 Jennifer Wilson white GreaterEuropean,British GreaterEuropean,British nh_white nh_white
6 Carlos Martinez hispanic GreaterEuropean,WestEuropean,Hispanic GreaterEuropean,WestEuropean,Hispanic hispanic hispanic
7 Lisa Anderson white GreaterEuropean,British GreaterEuropean,British nh_white nh_white
8 James Taylor white GreaterEuropean,British GreaterEuropean,British nh_white nh_white
9 Anna Rodriguez hispanic GreaterEuropean,WestEuropean,Hispanic GreaterEuropean,WestEuropean,Hispanic nh_white hispanic
10 Robert Thomas white GreaterEuropean,British GreaterEuropean,British nh_white nh_white
11 Ashley Jackson black GreaterEuropean,British GreaterEuropean,British nh_black nh_black
12 Kevin White white GreaterEuropean,British GreaterEuropean,British nh_white nh_black
13 Michelle Harris white GreaterEuropean,British GreaterEuropean,British nh_white nh_black
14 Daniel Martin white GreaterEuropean,WestEuropean,Hispanic GreaterEuropean,British nh_white nh_white

Confidence Analysis

Let’s examine the confidence scores to understand prediction certainty.

[7]:
# Calculate max probability (confidence) for each prediction
fl_result['max_prob'] = fl_result[['asian', 'hispanic', 'nh_black', 'nh_white']].max(axis=1)

# Show high vs low confidence predictions
high_conf = fl_result[fl_result['max_prob'] > 0.8]
low_conf = fl_result[fl_result['max_prob'] < 0.5]

print(f"High confidence predictions (>80%): {len(high_conf)} names")
print("Examples:")
print(high_conf[['first_name', 'last_name', 'race', 'max_prob']].head())

print(f"\nLow confidence predictions (<50%): {len(low_conf)} names")
print("Examples:")
print(low_conf[['first_name', 'last_name', 'race', 'max_prob']].head())
High confidence predictions (>80%): 37 names
Examples:
  first_name last_name      race  max_prob
0       John     Smith  nh_white  0.931000
1      Maria    Garcia  hispanic  0.829117
4    Michael     Brown  nh_white  0.851771
5   Jennifer    Wilson  nh_white  0.852773
6     Carlos  Martinez  hispanic  0.908851

Low confidence predictions (<50%): 2 names
Examples:
   first_name last_name      race  max_prob
35      Kayla     Perez  nh_white  0.483368
53    Vanessa    Bailey  nh_white  0.413277

Detailed Ethnic Categories (Wikipedia)

The Wikipedia model provides much more granular ethnic predictions.

[8]:
# Show detailed ethnic categories from Wikipedia model
print("Detailed ethnic categories from Wikipedia model:")
ethnic_dist = wiki_result['race'].value_counts()
print(ethnic_dist)

# Show examples of detailed categories
print("\nExamples by ethnic category:")
for category in ethnic_dist.head(5).index:
    examples = wiki_result[wiki_result['race'] == category]['__name'].head(3).tolist()
    print(f"{category}: {', '.join(examples)}")
Detailed ethnic categories from Wikipedia model:
race
GreaterEuropean,British                  52
GreaterEuropean,WestEuropean,Hispanic     3
GreaterEuropean,WestEuropean,French       3
GreaterEuropean,WestEuropean,Italian      2
GreaterEuropean,Jewish                    2
Name: count, dtype: int64

Examples by ethnic category:
GreaterEuropean,British: Smith John, Johnson David, Davis Sarah
GreaterEuropean,WestEuropean,Hispanic: Martinez Carlos, Rodriguez Anna, Martin Daniel
GreaterEuropean,WestEuropean,French: Adams Rachel, Carter Lauren, Sanchez Christina
GreaterEuropean,WestEuropean,Italian: Garcia Maria, Peterson Andrea
GreaterEuropean,Jewish: Phillips Zachary, Parker Aaron

Model Selection Guidelines

Choose the right model for your use case:

  • Census lookup: Best for aggregate statistics, population-level analysis

  • Census LSTM: Good baseline for individual predictions, 4 broad categories

  • Wikipedia models: Best for detailed ethnic categories, works well with diverse international names

  • Florida models: Good for US-focused applications, trained on actual voter data

  • 5-category models: Include ‘other’ for better coverage of mixed/unknown ethnicities

Always consider the confidence scores and validate results on your specific dataset.