Basic Usage Guide

This notebook demonstrates the fundamentals of ethnicolr2 with step-by-step examples using all available prediction models.

Overview

ethnicolr2 provides three main prediction models:

  • Census Last Name: Predicts from last names using US Census data

  • Florida Last Name: Predicts from last names using Florida voter registration data

  • Florida Full Name: Predicts using both first and last names for highest accuracy

Let’s start by importing the necessary libraries and loading sample data.

[1]:
import pandas as pd
from pathlib import Path
import sys

# Import ethnicolr2 prediction functions
from ethnicolr2 import (
    census_ln,
    pred_fl_last_name,
    pred_fl_full_name,
    pred_census_last_name
)

print(f"Python version: {sys.version}")
print(f"pandas version: {pd.__version__}")
print("ethnicolr2 imported successfully!")
Python version: 3.11.14 (main, Dec  9 2025, 19:02:23) [Clang 21.1.4 ]
pandas version: 2.3.3
ethnicolr2 imported successfully!

Sample Data

Let’s create a sample dataset with diverse names to demonstrate the prediction capabilities:

[2]:
# Create sample data with diverse names
sample_data = {
    'first_name': ['John', 'Maria', 'Wei', 'Aisha', 'David', 'Priya', 'Carlos', 'Sarah'],
    'last_name': ['Smith', 'Rodriguez', 'Zhang', 'Johnson', 'Williams', 'Patel', 'Garcia', 'Kim'],
    'full_name': ['John Smith', 'Maria Rodriguez', 'Wei Zhang', 'Aisha Johnson',
                  'David Williams', 'Priya Patel', 'Carlos Garcia', 'Sarah Kim']
}

df = pd.DataFrame(sample_data)
print("Sample dataset:")
display(df)
Sample dataset:
first_name last_name full_name
0 John Smith John Smith
1 Maria Rodriguez Maria Rodriguez
2 Wei Zhang Wei Zhang
3 Aisha Johnson Aisha Johnson
4 David Williams David Williams
5 Priya Patel Priya Patel
6 Carlos Garcia Carlos Garcia
7 Sarah Kim Sarah Kim

1. Census Last Name Predictions

The census model uses US Census data for predictions based on last names only:

[3]:
# Census last name predictions
census_results = pred_census_last_name(df.copy(), lname_col='last_name')

print("Census Last Name Model Results:")
print("=" * 50)

# Display results with predictions and probabilities
display_cols = ['first_name', 'last_name', 'preds']
display(census_results[display_cols])

# Show probability distribution for first few rows
print("\nProbability distributions (first 3 rows):")
for i, (idx, row) in enumerate(census_results.head(3).iterrows()):
    print(f"{row['last_name']}: {row['probs']}")
/home/runner/work/ethnicolr2/ethnicolr2/.venv/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator CountVectorizer from version 1.2.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
100%|██████████| 1/1 [00:00<00:00, 171.85it/s]
Census Last Name Model Results:
==================================================

first_name last_name preds
0 John Smith nh_white
1 Maria Rodriguez hispanic
2 Wei Zhang asian
3 Aisha Johnson nh_white
4 David Williams nh_black
5 Priya Patel nh_white
6 Carlos Garcia hispanic
7 Sarah Kim asian

Probability distributions (first 3 rows):
Smith: {'nh_white': np.float32(0.9581536), 'nh_black': np.float32(0.032157544), 'hispanic': np.float32(0.00064442604), 'asian': np.float32(0.007015485), 'other': np.float32(0.0020290213)}
Rodriguez: {'nh_white': np.float32(0.001115346), 'nh_black': np.float32(0.0001161872), 'hispanic': np.float32(0.9918202), 'asian': np.float32(0.0068077766), 'other': np.float32(0.0001404588)}
Zhang: {'nh_white': np.float32(0.11944395), 'nh_black': np.float32(0.012060731), 'hispanic': np.float32(0.0039124084), 'asian': np.float32(0.8631792), 'other': np.float32(0.0014036907)}

2. Florida Last Name Predictions

The Florida model is trained on Florida voter registration data:

[4]:
# Florida last name predictions
fl_ln_results = pred_fl_last_name(df.copy(), lname_col='last_name')

print("Florida Last Name Model Results:")
print("=" * 50)

# Display results
display(fl_ln_results[display_cols])

# Show probability distribution for first few rows
print("\nProbability distributions (first 3 rows):")
for i, (idx, row) in enumerate(fl_ln_results.head(3).iterrows()):
    print(f"{row['last_name']}: {row['probs']}")
/home/runner/work/ethnicolr2/ethnicolr2/.venv/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator CountVectorizer from version 1.2.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
100%|██████████| 1/1 [00:00<00:00, 218.11it/s]
Florida Last Name Model Results:
==================================================

first_name last_name preds
0 John Smith nh_white
1 Maria Rodriguez hispanic
2 Wei Zhang asian
3 Aisha Johnson nh_white
4 David Williams nh_black
5 Priya Patel nh_white
6 Carlos Garcia hispanic
7 Sarah Kim nh_white

Probability distributions (first 3 rows):
Smith: {'asian': np.float32(0.0033283862), 'hispanic': np.float32(0.012039134), 'nh_black': np.float32(0.20626086), 'nh_white': np.float32(0.77332264), 'other': np.float32(0.005048998)}
Rodriguez: {'asian': np.float32(0.019128766), 'hispanic': np.float32(0.85660416), 'nh_black': np.float32(0.011903805), 'nh_white': np.float32(0.1021731), 'other': np.float32(0.010190243)}
Zhang: {'asian': np.float32(0.77986765), 'hispanic': np.float32(0.016638936), 'nh_black': np.float32(0.023025088), 'nh_white': np.float32(0.10416626), 'other': np.float32(0.07630213)}

3. Florida Full Name Predictions

The most accurate model uses both first and last names:

[5]:
# Florida full name predictions using first and last name columns
fl_full_results = pred_fl_full_name(df.copy(),
                                   fname_col='first_name',
                                   lname_col='last_name')

print("Florida Full Name Model Results:")
print("=" * 50)

# Display results
display(fl_full_results[display_cols])

# Show probability distribution for first few rows
print("\nProbability distributions (first 3 rows):")
for i, (idx, row) in enumerate(fl_full_results.head(3).iterrows()):
    print(f"{row['first_name']} {row['last_name']}: {row['probs']}")
/home/runner/work/ethnicolr2/ethnicolr2/.venv/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator CountVectorizer from version 1.2.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
100%|██████████| 1/1 [00:00<00:00, 159.24it/s]
Florida Full Name Model Results:
==================================================

first_name last_name preds
0 John Smith nh_white
1 Maria Rodriguez hispanic
2 Wei Zhang asian
3 Aisha Johnson nh_black
4 David Williams nh_white
5 Priya Patel asian
6 Carlos Garcia hispanic
7 Sarah Kim asian

Probability distributions (first 3 rows):
John Smith: {'asian': np.float32(0.00022009955), 'hispanic': np.float32(0.00048264896), 'nh_black': np.float32(0.0025561498), 'nh_white': np.float32(0.99649566), 'other': np.float32(0.00024542198)}
Maria Rodriguez: {'asian': np.float32(0.0016184732), 'hispanic': np.float32(0.9841166), 'nh_black': np.float32(0.00077053445), 'nh_white': np.float32(0.011156033), 'other': np.float32(0.002338387)}
Wei Zhang: {'asian': np.float32(0.9515218), 'hispanic': np.float32(0.0008929939), 'nh_black': np.float32(0.00050908246), 'nh_white': np.float32(0.008469812), 'other': np.float32(0.03860621)}

Summary

This notebook demonstrated:

  1. Three prediction models with different data sources and accuracy levels

  2. Easy Python API for batch predictions on DataFrames

  3. Probability distributions for uncertainty quantification

  4. Model comparison to understand prediction differences

Key Takeaways

  • Florida Full Name model generally provides the most accurate predictions

  • Probability distributions are valuable for understanding prediction confidence

  • Different models may disagree, especially for ambiguous names

  • Sample data works seamlessly without external dependencies