Basic Usage Guide¶
This notebook demonstrates the fundamentals of ethnicolr2 with step-by-step examples using all available prediction models.
Overview¶
ethnicolr2 provides three main prediction models:
Census Last Name: Predicts from last names using US Census data
Florida Last Name: Predicts from last names using Florida voter registration data
Florida Full Name: Predicts using both first and last names for highest accuracy
Let’s start by importing the necessary libraries and loading sample data.
[1]:
import pandas as pd
from pathlib import Path
import sys
# Import ethnicolr2 prediction functions
from ethnicolr2 import (
census_ln,
pred_fl_last_name,
pred_fl_full_name,
pred_census_last_name
)
print(f"Python version: {sys.version}")
print(f"pandas version: {pd.__version__}")
print("ethnicolr2 imported successfully!")
Python version: 3.11.14 (main, Dec 9 2025, 19:02:23) [Clang 21.1.4 ]
pandas version: 2.3.3
ethnicolr2 imported successfully!
Sample Data¶
Let’s create a sample dataset with diverse names to demonstrate the prediction capabilities:
[2]:
# Create sample data with diverse names
sample_data = {
'first_name': ['John', 'Maria', 'Wei', 'Aisha', 'David', 'Priya', 'Carlos', 'Sarah'],
'last_name': ['Smith', 'Rodriguez', 'Zhang', 'Johnson', 'Williams', 'Patel', 'Garcia', 'Kim'],
'full_name': ['John Smith', 'Maria Rodriguez', 'Wei Zhang', 'Aisha Johnson',
'David Williams', 'Priya Patel', 'Carlos Garcia', 'Sarah Kim']
}
df = pd.DataFrame(sample_data)
print("Sample dataset:")
display(df)
Sample dataset:
| first_name | last_name | full_name | |
|---|---|---|---|
| 0 | John | Smith | John Smith |
| 1 | Maria | Rodriguez | Maria Rodriguez |
| 2 | Wei | Zhang | Wei Zhang |
| 3 | Aisha | Johnson | Aisha Johnson |
| 4 | David | Williams | David Williams |
| 5 | Priya | Patel | Priya Patel |
| 6 | Carlos | Garcia | Carlos Garcia |
| 7 | Sarah | Kim | Sarah Kim |
1. Census Last Name Predictions¶
The census model uses US Census data for predictions based on last names only:
[3]:
# Census last name predictions
census_results = pred_census_last_name(df.copy(), lname_col='last_name')
print("Census Last Name Model Results:")
print("=" * 50)
# Display results with predictions and probabilities
display_cols = ['first_name', 'last_name', 'preds']
display(census_results[display_cols])
# Show probability distribution for first few rows
print("\nProbability distributions (first 3 rows):")
for i, (idx, row) in enumerate(census_results.head(3).iterrows()):
print(f"{row['last_name']}: {row['probs']}")
/home/runner/work/ethnicolr2/ethnicolr2/.venv/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator CountVectorizer from version 1.2.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
100%|██████████| 1/1 [00:00<00:00, 171.85it/s]
Census Last Name Model Results:
==================================================
| first_name | last_name | preds | |
|---|---|---|---|
| 0 | John | Smith | nh_white |
| 1 | Maria | Rodriguez | hispanic |
| 2 | Wei | Zhang | asian |
| 3 | Aisha | Johnson | nh_white |
| 4 | David | Williams | nh_black |
| 5 | Priya | Patel | nh_white |
| 6 | Carlos | Garcia | hispanic |
| 7 | Sarah | Kim | asian |
Probability distributions (first 3 rows):
Smith: {'nh_white': np.float32(0.9581536), 'nh_black': np.float32(0.032157544), 'hispanic': np.float32(0.00064442604), 'asian': np.float32(0.007015485), 'other': np.float32(0.0020290213)}
Rodriguez: {'nh_white': np.float32(0.001115346), 'nh_black': np.float32(0.0001161872), 'hispanic': np.float32(0.9918202), 'asian': np.float32(0.0068077766), 'other': np.float32(0.0001404588)}
Zhang: {'nh_white': np.float32(0.11944395), 'nh_black': np.float32(0.012060731), 'hispanic': np.float32(0.0039124084), 'asian': np.float32(0.8631792), 'other': np.float32(0.0014036907)}
2. Florida Last Name Predictions¶
The Florida model is trained on Florida voter registration data:
[4]:
# Florida last name predictions
fl_ln_results = pred_fl_last_name(df.copy(), lname_col='last_name')
print("Florida Last Name Model Results:")
print("=" * 50)
# Display results
display(fl_ln_results[display_cols])
# Show probability distribution for first few rows
print("\nProbability distributions (first 3 rows):")
for i, (idx, row) in enumerate(fl_ln_results.head(3).iterrows()):
print(f"{row['last_name']}: {row['probs']}")
/home/runner/work/ethnicolr2/ethnicolr2/.venv/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator CountVectorizer from version 1.2.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
100%|██████████| 1/1 [00:00<00:00, 218.11it/s]
Florida Last Name Model Results:
==================================================
| first_name | last_name | preds | |
|---|---|---|---|
| 0 | John | Smith | nh_white |
| 1 | Maria | Rodriguez | hispanic |
| 2 | Wei | Zhang | asian |
| 3 | Aisha | Johnson | nh_white |
| 4 | David | Williams | nh_black |
| 5 | Priya | Patel | nh_white |
| 6 | Carlos | Garcia | hispanic |
| 7 | Sarah | Kim | nh_white |
Probability distributions (first 3 rows):
Smith: {'asian': np.float32(0.0033283862), 'hispanic': np.float32(0.012039134), 'nh_black': np.float32(0.20626086), 'nh_white': np.float32(0.77332264), 'other': np.float32(0.005048998)}
Rodriguez: {'asian': np.float32(0.019128766), 'hispanic': np.float32(0.85660416), 'nh_black': np.float32(0.011903805), 'nh_white': np.float32(0.1021731), 'other': np.float32(0.010190243)}
Zhang: {'asian': np.float32(0.77986765), 'hispanic': np.float32(0.016638936), 'nh_black': np.float32(0.023025088), 'nh_white': np.float32(0.10416626), 'other': np.float32(0.07630213)}
3. Florida Full Name Predictions¶
The most accurate model uses both first and last names:
[5]:
# Florida full name predictions using first and last name columns
fl_full_results = pred_fl_full_name(df.copy(),
fname_col='first_name',
lname_col='last_name')
print("Florida Full Name Model Results:")
print("=" * 50)
# Display results
display(fl_full_results[display_cols])
# Show probability distribution for first few rows
print("\nProbability distributions (first 3 rows):")
for i, (idx, row) in enumerate(fl_full_results.head(3).iterrows()):
print(f"{row['first_name']} {row['last_name']}: {row['probs']}")
/home/runner/work/ethnicolr2/ethnicolr2/.venv/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator CountVectorizer from version 1.2.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
100%|██████████| 1/1 [00:00<00:00, 159.24it/s]
Florida Full Name Model Results:
==================================================
| first_name | last_name | preds | |
|---|---|---|---|
| 0 | John | Smith | nh_white |
| 1 | Maria | Rodriguez | hispanic |
| 2 | Wei | Zhang | asian |
| 3 | Aisha | Johnson | nh_black |
| 4 | David | Williams | nh_white |
| 5 | Priya | Patel | asian |
| 6 | Carlos | Garcia | hispanic |
| 7 | Sarah | Kim | asian |
Probability distributions (first 3 rows):
John Smith: {'asian': np.float32(0.00022009955), 'hispanic': np.float32(0.00048264896), 'nh_black': np.float32(0.0025561498), 'nh_white': np.float32(0.99649566), 'other': np.float32(0.00024542198)}
Maria Rodriguez: {'asian': np.float32(0.0016184732), 'hispanic': np.float32(0.9841166), 'nh_black': np.float32(0.00077053445), 'nh_white': np.float32(0.011156033), 'other': np.float32(0.002338387)}
Wei Zhang: {'asian': np.float32(0.9515218), 'hispanic': np.float32(0.0008929939), 'nh_black': np.float32(0.00050908246), 'nh_white': np.float32(0.008469812), 'other': np.float32(0.03860621)}
Summary¶
This notebook demonstrated:
Three prediction models with different data sources and accuracy levels
Easy Python API for batch predictions on DataFrames
Probability distributions for uncertainty quantification
Model comparison to understand prediction differences
Key Takeaways¶
Florida Full Name model generally provides the most accurate predictions
Probability distributions are valuable for understanding prediction confidence
Different models may disagree, especially for ambiguous names
Sample data works seamlessly without external dependencies