Basic Usage Examples¶

This notebook demonstrates basic usage of the ethnicolr package for predicting race and ethnicity from names.

Setup¶

First, let’s import the necessary libraries and load some sample data.

[1]:

import pandas as pd
import ethnicolr
from pathlib import Path

# Load sample data
data_path = Path('data/input-with-header.csv')

try:
    df = pd.read_csv(data_path)
    print(f"Loaded data from: {data_path}")
except FileNotFoundError:
    # Create sample data if file not found
    df = pd.DataFrame({
        'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],
        'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']
    })
    print("Using generated sample data")

print(f"Sample data shape: {df.shape}")
print("\nFirst few rows:")
df.head()

2025-12-27 22:21:37.664762: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:37.667915: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:37.676606: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1766874097.690951    2818 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766874097.695142    2818 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-12-27 22:21:37.711070: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Loaded data from: data/input-with-header.csv
Sample data shape: (62, 2)

First few rows:

[1]:

	first_name	last_name
0	John	Smith
1	Maria	Garcia
2	David	Johnson
3	Sarah	Davis
4	Michael	Brown

Census Data Lookup¶

The simplest approach is to look up demographic probabilities by last name using US Census data.

[2]:

# Census lookup by last name (2010 data)
result_census = ethnicolr.census_ln(df, 'last_name', year=2010)
print(f"Result shape: {result_census.shape}")
print("\nColumns added:")
census_cols = [col for col in result_census.columns if col not in df.columns]
print(census_cols)

# Show first few results
result_census[['last_name', 'first_name'] + census_cols].head()

2025-12-27 22:21:39,521 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:39,523 - INFO - Loading Census 2010 data from /home/runner/work/ethnicolr/ethnicolr/ethnicolr/data/census/census_2010.csv...
2025-12-27 22:21:39,679 - INFO - Loaded 162253 last names from Census 2010
2025-12-27 22:21:39,680 - INFO - Merging demographic data for 62 records...
2025-12-27 22:21:39,715 - INFO - Matched 62 of 62 rows (100.0%)
2025-12-27 22:21:39,716 - INFO - Added columns: pct2prace, pctaian, pctapi, pctblack, pcthispanic, pctwhite

Result shape: (62, 8)

Columns added:
['pctwhite', 'pctblack', 'pctapi', 'pctaian', 'pct2prace', 'pcthispanic']

[2]:

	last_name	first_name	pctwhite	pctblack	pctapi	pctaian	pct2prace	pcthispanic
0	Smith	John	70.9	23.11	0.5	0.89	2.19	2.4
1	Garcia	Maria	5.38	0.45	1.41	0.47	0.26	92.03
2	Johnson	David	58.97	34.63	0.54	0.94	2.56	2.36
3	Davis	Sarah	62.2	31.6	0.49	0.82	2.45	2.44
4	Brown	Michael	57.95	35.6	0.51	0.87	2.55	2.52

Simple Census-based Predictions¶

For more sophisticated predictions, we can use the census-based LSTM model.

[3]:

# Predict using census LSTM model
result_pred = ethnicolr.pred_census_ln(df, 'last_name')
print(f"Result shape: {result_pred.shape}")
print("\nPrediction columns:")
pred_cols = [col for col in result_pred.columns if col not in df.columns]
print(pred_cols)

# Show predictions with confidence scores (census model uses api, black, hispanic, white)
result_pred[['last_name', 'first_name', 'race', 'api', 'black', 'hispanic', 'white']].head(10)

2025-12-27 22:21:39,727 - INFO - Processing 62 names using Census 2010 LSTM model
2025-12-27 22:21:39,728 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:39.736393: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2025-12-27 22:21:39,935 - INFO - Predicted 62 of 62 rows (100.0%)
2025-12-27 22:21:39,936 - INFO - Added columns: white, black, api, race, hispanic

Result shape: (62, 7)

Prediction columns:
['api', 'black', 'hispanic', 'white', 'race']

[3]:

	last_name	first_name	race	api	black	hispanic	white
0	Smith	John	white	0.008322	0.215059	0.029183	0.747435
1	Garcia	Maria	hispanic	0.010198	0.002871	0.942291	0.044640
2	Johnson	David	white	0.005786	0.364275	0.021122	0.608818
3	Davis	Sarah	white	0.003984	0.356766	0.020293	0.618956
4	Brown	Michael	white	0.003043	0.398471	0.026545	0.571941
5	Wilson	Jennifer	white	0.008207	0.273209	0.018134	0.700449
6	Martinez	Carlos	hispanic	0.005853	0.003826	0.942765	0.047556
7	Anderson	Lisa	white	0.011476	0.188307	0.024973	0.775244
8	Taylor	James	white	0.007325	0.279690	0.028578	0.684406
9	Rodriguez	Anna	hispanic	0.007562	0.006216	0.942559	0.043662

Summary Statistics¶

Let’s look at the distribution of predicted races in our sample.

[4]:

# Distribution of predicted races
race_dist = result_pred['race'].value_counts()
print("Race distribution:")
for race, count in race_dist.items():
    percentage = (count / len(result_pred)) * 100
    print(f"{race}: {count} ({percentage:.1f}%)")

# Show some examples by race
print("\nExample predictions by race:")
for race in race_dist.index[:3]:
    examples = result_pred[result_pred['race'] == race]['last_name'].head(3).tolist()
    print(f"{race}: {', '.join(examples)}")

Race distribution:
white: 51 (82.3%)
hispanic: 9 (14.5%)
black: 1 (1.6%)
api: 1 (1.6%)

Example predictions by race:
white: Smith, Johnson, Davis
hispanic: Garcia, Martinez, Rodriguez
black: Jackson

Comparing Methods¶

Let’s compare the census lookup vs. ML prediction for a few names.

[5]:

# Create comparison dataframe
# Since both functions process the same input DataFrame, we can align by index
# rather than merging on potentially different last_name values

print(f"Original data shape: {df.shape}")
print(f"Census result shape: {result_census.shape}")
print(f"Prediction result shape: {result_pred.shape}")

# Check if we have the expected columns
print("\nCensus columns with percentages:", [col for col in result_census.columns if 'pct' in col])
print("Prediction columns with probabilities:", [col for col in result_pred.columns if col in ['api', 'black', 'hispanic', 'white']])

# Create aligned comparison using index
comparison = pd.DataFrame({
    'last_name': df['last_name'],  # Use original names for reference
    'census_white': result_census['pctwhite'],
    'census_black': result_census['pctblack'],
    'census_api': result_census['pctapi'],
    'census_hispanic': result_census['pcthispanic'],
    'pred_race': result_pred['race'],
    'pred_white': result_pred['white'],
    'pred_black': result_pred['black'],
    'pred_api': result_pred['api'],
    'pred_hispanic': result_pred['hispanic']
})

print(f"\nComparison created with {len(comparison)} names")
print("Comparison of Census vs ML predictions (first 10 names):")
comparison[['last_name', 'census_white', 'census_black', 'pred_race', 'pred_white', 'pred_black']].head(10)

Original data shape: (62, 2)
Census result shape: (62, 8)
Prediction result shape: (62, 7)

Census columns with percentages: ['pctwhite', 'pctblack', 'pctapi', 'pctaian', 'pct2prace', 'pcthispanic']
Prediction columns with probabilities: ['api', 'black', 'hispanic', 'white']

Comparison created with 62 names
Comparison of Census vs ML predictions (first 10 names):

[5]:

	last_name	census_white	census_black	pred_race	pred_white	pred_black
0	Smith	70.9	23.11	white	0.747435	0.215059
1	Garcia	5.38	0.45	hispanic	0.044640	0.002871
2	Johnson	58.97	34.63	white	0.608818	0.364275
3	Davis	62.2	31.6	white	0.618956	0.356766
4	Brown	57.95	35.6	white	0.571941	0.398471
5	Wilson	67.36	25.99	white	0.700449	0.273209
6	Martinez	5.28	0.49	hispanic	0.047556	0.003826
7	Anderson	75.17	18.93	white	0.775244	0.188307
8	Taylor	65.38	28.42	white	0.684406	0.279690
9	Rodriguez	4.75	0.54	hispanic	0.043662	0.006216

Key Differences¶

Census lookup: Returns population-level probabilities for each race/ethnicity based on surname frequency in census data
ML prediction: Uses neural networks trained on census data to predict the most likely race/ethnicity category
Use cases: Census lookup for aggregate analysis, ML predictions for individual classification