# Advanced Prediction Models

This notebook demonstrates advanced ethnicity prediction using Wikipedia and Florida voter registration models, including confidence intervals and detailed ethnic categories.

## Setup

Load the required libraries and sample data.

In [None]:
import pandas as pd
import ethnicolr
from pathlib import Path

# Load sample data
data_path = Path('data/input-with-header.csv')

try:
 df = pd.read_csv(data_path)
 print(f"Loaded data from: {data_path}")
except FileNotFoundError:
 # Create sample data if file not found
 df = pd.DataFrame({
 'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],
 'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']
 })
 print("Using generated sample data")

print(f"Sample data shape: {df.shape}")
print("\nFirst few rows:")
df.head()

## Wikipedia-based Predictions

Wikipedia models provide more granular ethnic categories and work well with both first and last names.

In [None]:
# Predict using Wikipedia model with full names
wiki_result = ethnicolr.pred_wiki_name(df, 'last_name', 'first_name')
print(f"Wikipedia prediction result shape: {wiki_result.shape}")
print("\nColumns added:")
wiki_cols = [col for col in wiki_result.columns if col not in df.columns]
print(wiki_cols)

# Show detailed predictions
wiki_result[['first_name', 'last_name', 'race', '__name']].head(10)

## Florida Voter Registration Models

Florida models are trained on actual voter registration data and can provide both 4-category and 5-category predictions.

In [None]:
# Standard 4-category Florida model
fl_result = ethnicolr.pred_fl_reg_name(df, 'last_name', 'first_name')
print("Florida 4-category predictions:")
fl_result[['first_name', 'last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head()

print("\nRace distribution (Florida model):")
print(fl_result['race'].value_counts())

In [None]:
# 5-category Florida model (includes 'other' category)
fl5_result = ethnicolr.pred_fl_reg_name_five_cat(df, 'last_name', 'first_name')
print("Florida 5-category predictions:")
fl5_result[['first_name', 'last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white', 'other']].head()

print("\nRace distribution (Florida 5-category):")
print(fl5_result['race'].value_counts())

## Last Name Only Predictions

When only last names are available, we can still make good predictions.

In [None]:
# Wikipedia last name model
wiki_ln = ethnicolr.pred_wiki_ln(df, 'last_name')
print("Wikipedia last name predictions:")
wiki_ln[['last_name', 'race']].head(10)

# Florida last name model 
fl_ln = ethnicolr.pred_fl_reg_ln(df, 'last_name')
print("\nFlorida last name predictions:")
fl_ln[['last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head(10)

## Model Comparison

Let's compare predictions across different models for the same names.

In [None]:
# Create comparison dataframe
comparison = pd.DataFrame({
 'name': df['first_name'] + ' ' + df['last_name'],
 'census': ethnicolr.pred_census_ln(df, 'last_name')['race'],
 'wiki_fullname': wiki_result['race'],
 'wiki_lastname': wiki_ln['race'], 
 'florida_4cat': fl_result['race'],
 'florida_5cat': fl5_result['race']
})

print("Model comparison (first 15 names):")
comparison.head(15)

## Confidence Analysis

Let's examine the confidence scores to understand prediction certainty.

In [None]:
# Calculate max probability (confidence) for each prediction
fl_result['max_prob'] = fl_result[['asian', 'hispanic', 'nh_black', 'nh_white']].max(axis=1)

# Show high vs low confidence predictions
high_conf = fl_result[fl_result['max_prob'] > 0.8]
low_conf = fl_result[fl_result['max_prob'] < 0.5]

print(f"High confidence predictions (>80%): {len(high_conf)} names")
print("Examples:")
print(high_conf[['first_name', 'last_name', 'race', 'max_prob']].head())

print(f"\nLow confidence predictions (<50%): {len(low_conf)} names")
print("Examples:")
print(low_conf[['first_name', 'last_name', 'race', 'max_prob']].head())

## Detailed Ethnic Categories (Wikipedia)

The Wikipedia model provides much more granular ethnic predictions.

In [None]:
# Show detailed ethnic categories from Wikipedia model
print("Detailed ethnic categories from Wikipedia model:")
ethnic_dist = wiki_result['race'].value_counts()
print(ethnic_dist)

# Show examples of detailed categories
print("\nExamples by ethnic category:")
for category in ethnic_dist.head(5).index:
 examples = wiki_result[wiki_result['race'] == category]['__name'].head(3).tolist()
 print(f"{category}: {', '.join(examples)}")

## Model Selection Guidelines

Choose the right model for your use case:

- **Census lookup**: Best for aggregate statistics, population-level analysis
- **Census LSTM**: Good baseline for individual predictions, 4 broad categories
- **Wikipedia models**: Best for detailed ethnic categories, works well with diverse international names
- **Florida models**: Good for US-focused applications, trained on actual voter data
- **5-category models**: Include 'other' for better coverage of mixed/unknown ethnicities

Always consider the confidence scores and validate results on your specific dataset.