Basic Usage Guide¶

This notebook demonstrates the fundamentals of ethnicolr2 with step-by-step examples using all available prediction models.

Overview¶

ethnicolr2 provides three main prediction models:

Census Last Name: Predicts from last names using US Census data
Florida Last Name: Predicts from last names using Florida voter registration data
Florida Full Name: Predicts using both first and last names for highest accuracy

Let’s start by importing the necessary libraries and loading sample data.

[1]:

import pandas as pd
from pathlib import Path
import sys

# Import ethnicolr2 prediction functions
from ethnicolr2 import (
    census_ln,
    pred_fl_last_name,
    pred_fl_full_name,
    pred_census_last_name
)

print(f"Python version: {sys.version}")
print(f"pandas version: {pd.__version__}")
print("ethnicolr2 imported successfully!")

Python version: 3.11.15 (main, Mar  3 2026, 14:59:53) [Clang 21.1.4 ]
pandas version: 2.3.3
ethnicolr2 imported successfully!

Sample Data¶

Let’s create a sample dataset with diverse names to demonstrate the prediction capabilities:

[2]:

# Create sample data with diverse names
sample_data = {
    'first_name': ['John', 'Maria', 'Wei', 'Aisha', 'David', 'Priya', 'Carlos', 'Sarah'],
    'last_name': ['Smith', 'Rodriguez', 'Zhang', 'Johnson', 'Williams', 'Patel', 'Garcia', 'Kim'],
    'full_name': ['John Smith', 'Maria Rodriguez', 'Wei Zhang', 'Aisha Johnson',
                  'David Williams', 'Priya Patel', 'Carlos Garcia', 'Sarah Kim']
}

df = pd.DataFrame(sample_data)
print("Sample dataset:")
display(df)

Sample dataset:

	first_name	last_name	full_name
0	John	Smith	John Smith
1	Maria	Rodriguez	Maria Rodriguez
2	Wei	Zhang	Wei Zhang
3	Aisha	Johnson	Aisha Johnson
4	David	Williams	David Williams
5	Priya	Patel	Priya Patel
6	Carlos	Garcia	Carlos Garcia
7	Sarah	Kim	Sarah Kim

1. Census Last Name Predictions¶

The census model uses US Census data for predictions based on last names only:

[3]:

# Census last name predictions
census_results = pred_census_last_name(df.copy(), lname_col='last_name')

print("Census Last Name Model Results:")
print("=" * 50)

# Display results with predictions and probabilities
display_cols = ['first_name', 'last_name', 'preds']
display(census_results[display_cols])

# Show probability distribution for first few rows
print("\nProbability distributions (first 3 rows):")
for i, (idx, row) in enumerate(census_results.head(3).iterrows()):
    print(f"{row['last_name']}: {row['probs']}")

/home/runner/work/ethnicolr2/ethnicolr2/.venv/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator CountVectorizer from version 1.2.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
100%|██████████| 1/1 [00:00<00:00, 127.52it/s]

Census Last Name Model Results:
==================================================

	first_name	last_name	preds
0	John	Smith	nh_white
1	Maria	Rodriguez	hispanic
2	Wei	Zhang	asian
3	Aisha	Johnson	nh_white
4	David	Williams	nh_black
5	Priya	Patel	nh_white
6	Carlos	Garcia	hispanic
7	Sarah	Kim	asian


Probability distributions (first 3 rows):
Smith: {'nh_white': np.float32(0.9581536), 'nh_black': np.float32(0.03215752), 'hispanic': np.float32(0.00064442545), 'asian': np.float32(0.007015482), 'other': np.float32(0.0020290206)}
Rodriguez: {'nh_white': np.float32(0.001115346), 'nh_black': np.float32(0.00011618709), 'hispanic': np.float32(0.9918202), 'asian': np.float32(0.0068077734), 'other': np.float32(0.0001404588)}
Zhang: {'nh_white': np.float32(0.11944395), 'nh_black': np.float32(0.012060731), 'hispanic': np.float32(0.0039124084), 'asian': np.float32(0.8631792), 'other': np.float32(0.0014036907)}

2. Florida Last Name Predictions¶

The Florida model is trained on Florida voter registration data:

[4]:

# Florida last name predictions
fl_ln_results = pred_fl_last_name(df.copy(), lname_col='last_name')

print("Florida Last Name Model Results:")
print("=" * 50)

# Display results
display(fl_ln_results[display_cols])

# Show probability distribution for first few rows
print("\nProbability distributions (first 3 rows):")
for i, (idx, row) in enumerate(fl_ln_results.head(3).iterrows()):
    print(f"{row['last_name']}: {row['probs']}")

/home/runner/work/ethnicolr2/ethnicolr2/.venv/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator CountVectorizer from version 1.2.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
100%|██████████| 1/1 [00:00<00:00, 93.70it/s]

Florida Last Name Model Results:
==================================================

	first_name	last_name	preds
0	John	Smith	nh_white
1	Maria	Rodriguez	hispanic
2	Wei	Zhang	asian
3	Aisha	Johnson	nh_white
4	David	Williams	nh_black
5	Priya	Patel	nh_white
6	Carlos	Garcia	hispanic
7	Sarah	Kim	nh_white


Probability distributions (first 3 rows):
Smith: {'asian': np.float32(0.0033283841), 'hispanic': np.float32(0.012039133), 'nh_black': np.float32(0.20626089), 'nh_white': np.float32(0.7733226), 'other': np.float32(0.0050489954)}
Rodriguez: {'asian': np.float32(0.019128762), 'hispanic': np.float32(0.85660416), 'nh_black': np.float32(0.011903799), 'nh_white': np.float32(0.102173075), 'other': np.float32(0.010190237)}
Zhang: {'asian': np.float32(0.77986765), 'hispanic': np.float32(0.016638936), 'nh_black': np.float32(0.02302508), 'nh_white': np.float32(0.10416626), 'other': np.float32(0.07630212)}

3. Florida Full Name Predictions¶

The most accurate model uses both first and last names:

[5]:

# Florida full name predictions using first and last name columns
fl_full_results = pred_fl_full_name(df.copy(),
                                   fname_col='first_name',
                                   lname_col='last_name')

print("Florida Full Name Model Results:")
print("=" * 50)

# Display results
display(fl_full_results[display_cols])

# Show probability distribution for first few rows
print("\nProbability distributions (first 3 rows):")
for i, (idx, row) in enumerate(fl_full_results.head(3).iterrows()):
    print(f"{row['first_name']} {row['last_name']}: {row['probs']}")

/home/runner/work/ethnicolr2/ethnicolr2/.venv/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator CountVectorizer from version 1.2.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
100%|██████████| 1/1 [00:00<00:00, 102.31it/s]

Florida Full Name Model Results:
==================================================

	first_name	last_name	preds
0	John	Smith	nh_white
1	Maria	Rodriguez	hispanic
2	Wei	Zhang	asian
3	Aisha	Johnson	nh_black
4	David	Williams	nh_white
5	Priya	Patel	asian
6	Carlos	Garcia	hispanic
7	Sarah	Kim	asian


Probability distributions (first 3 rows):
John Smith: {'asian': np.float32(0.00022009955), 'hispanic': np.float32(0.00048264896), 'nh_black': np.float32(0.0025561512), 'nh_white': np.float32(0.99649566), 'other': np.float32(0.00024542172)}
Maria Rodriguez: {'asian': np.float32(0.0016184724), 'hispanic': np.float32(0.9841166), 'nh_black': np.float32(0.00077053445), 'nh_white': np.float32(0.011156033), 'other': np.float32(0.002338387)}
Wei Zhang: {'asian': np.float32(0.9515218), 'hispanic': np.float32(0.00089299306), 'nh_black': np.float32(0.00050908193), 'nh_white': np.float32(0.008469809), 'other': np.float32(0.0386062)}

Summary¶

This notebook demonstrated:

Three prediction models with different data sources and accuracy levels
Easy Python API for batch predictions on DataFrames
Probability distributions for uncertainty quantification
Model comparison to understand prediction differences

Key Takeaways¶

Florida Full Name model generally provides the most accurate predictions
Probability distributions are valuable for understanding prediction confidence
Different models may disagree, especially for ambiguous names
Sample data works seamlessly without external dependencies