Basic Usage Examples

This notebook demonstrates basic usage of the ethnicolr package for predicting race and ethnicity from names.

Setup

First, let’s import the necessary libraries and load some sample data.

[1]:
import pandas as pd
import ethnicolr
from pathlib import Path

# Load sample data
data_path = Path('data/input-with-header.csv')

try:
    df = pd.read_csv(data_path)
    print(f"Loaded data from: {data_path}")
except FileNotFoundError:
    # Create sample data if file not found
    df = pd.DataFrame({
        'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],
        'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']
    })
    print("Using generated sample data")

print(f"Sample data shape: {df.shape}")
print("\nFirst few rows:")
df.head()
2025-12-27 22:21:37.664762: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:37.667915: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:37.676606: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1766874097.690951    2818 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766874097.695142    2818 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-12-27 22:21:37.711070: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loaded data from: data/input-with-header.csv
Sample data shape: (62, 2)

First few rows:
[1]:
first_name last_name
0 John Smith
1 Maria Garcia
2 David Johnson
3 Sarah Davis
4 Michael Brown

Census Data Lookup

The simplest approach is to look up demographic probabilities by last name using US Census data.

[2]:
# Census lookup by last name (2010 data)
result_census = ethnicolr.census_ln(df, 'last_name', year=2010)
print(f"Result shape: {result_census.shape}")
print("\nColumns added:")
census_cols = [col for col in result_census.columns if col not in df.columns]
print(census_cols)

# Show first few results
result_census[['last_name', 'first_name'] + census_cols].head()
2025-12-27 22:21:39,521 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:39,523 - INFO - Loading Census 2010 data from /home/runner/work/ethnicolr/ethnicolr/ethnicolr/data/census/census_2010.csv...
2025-12-27 22:21:39,679 - INFO - Loaded 162253 last names from Census 2010
2025-12-27 22:21:39,680 - INFO - Merging demographic data for 62 records...
2025-12-27 22:21:39,715 - INFO - Matched 62 of 62 rows (100.0%)
2025-12-27 22:21:39,716 - INFO - Added columns: pct2prace, pctaian, pctapi, pctblack, pcthispanic, pctwhite
Result shape: (62, 8)

Columns added:
['pctwhite', 'pctblack', 'pctapi', 'pctaian', 'pct2prace', 'pcthispanic']
[2]:
last_name first_name pctwhite pctblack pctapi pctaian pct2prace pcthispanic
0 Smith John 70.9 23.11 0.5 0.89 2.19 2.4
1 Garcia Maria 5.38 0.45 1.41 0.47 0.26 92.03
2 Johnson David 58.97 34.63 0.54 0.94 2.56 2.36
3 Davis Sarah 62.2 31.6 0.49 0.82 2.45 2.44
4 Brown Michael 57.95 35.6 0.51 0.87 2.55 2.52

Simple Census-based Predictions

For more sophisticated predictions, we can use the census-based LSTM model.

[3]:
# Predict using census LSTM model
result_pred = ethnicolr.pred_census_ln(df, 'last_name')
print(f"Result shape: {result_pred.shape}")
print("\nPrediction columns:")
pred_cols = [col for col in result_pred.columns if col not in df.columns]
print(pred_cols)

# Show predictions with confidence scores (census model uses api, black, hispanic, white)
result_pred[['last_name', 'first_name', 'race', 'api', 'black', 'hispanic', 'white']].head(10)
2025-12-27 22:21:39,727 - INFO - Processing 62 names using Census 2010 LSTM model
2025-12-27 22:21:39,728 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:39.736393: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2025-12-27 22:21:39,935 - INFO - Predicted 62 of 62 rows (100.0%)
2025-12-27 22:21:39,936 - INFO - Added columns: white, black, api, race, hispanic
Result shape: (62, 7)

Prediction columns:
['api', 'black', 'hispanic', 'white', 'race']
[3]:
last_name first_name race api black hispanic white
0 Smith John white 0.008322 0.215059 0.029183 0.747435
1 Garcia Maria hispanic 0.010198 0.002871 0.942291 0.044640
2 Johnson David white 0.005786 0.364275 0.021122 0.608818
3 Davis Sarah white 0.003984 0.356766 0.020293 0.618956
4 Brown Michael white 0.003043 0.398471 0.026545 0.571941
5 Wilson Jennifer white 0.008207 0.273209 0.018134 0.700449
6 Martinez Carlos hispanic 0.005853 0.003826 0.942765 0.047556
7 Anderson Lisa white 0.011476 0.188307 0.024973 0.775244
8 Taylor James white 0.007325 0.279690 0.028578 0.684406
9 Rodriguez Anna hispanic 0.007562 0.006216 0.942559 0.043662

Summary Statistics

Let’s look at the distribution of predicted races in our sample.

[4]:
# Distribution of predicted races
race_dist = result_pred['race'].value_counts()
print("Race distribution:")
for race, count in race_dist.items():
    percentage = (count / len(result_pred)) * 100
    print(f"{race}: {count} ({percentage:.1f}%)")

# Show some examples by race
print("\nExample predictions by race:")
for race in race_dist.index[:3]:
    examples = result_pred[result_pred['race'] == race]['last_name'].head(3).tolist()
    print(f"{race}: {', '.join(examples)}")
Race distribution:
white: 51 (82.3%)
hispanic: 9 (14.5%)
black: 1 (1.6%)
api: 1 (1.6%)

Example predictions by race:
white: Smith, Johnson, Davis
hispanic: Garcia, Martinez, Rodriguez
black: Jackson

Comparing Methods

Let’s compare the census lookup vs. ML prediction for a few names.

[5]:
# Create comparison dataframe
# Since both functions process the same input DataFrame, we can align by index
# rather than merging on potentially different last_name values

print(f"Original data shape: {df.shape}")
print(f"Census result shape: {result_census.shape}")
print(f"Prediction result shape: {result_pred.shape}")

# Check if we have the expected columns
print("\nCensus columns with percentages:", [col for col in result_census.columns if 'pct' in col])
print("Prediction columns with probabilities:", [col for col in result_pred.columns if col in ['api', 'black', 'hispanic', 'white']])

# Create aligned comparison using index
comparison = pd.DataFrame({
    'last_name': df['last_name'],  # Use original names for reference
    'census_white': result_census['pctwhite'],
    'census_black': result_census['pctblack'],
    'census_api': result_census['pctapi'],
    'census_hispanic': result_census['pcthispanic'],
    'pred_race': result_pred['race'],
    'pred_white': result_pred['white'],
    'pred_black': result_pred['black'],
    'pred_api': result_pred['api'],
    'pred_hispanic': result_pred['hispanic']
})

print(f"\nComparison created with {len(comparison)} names")
print("Comparison of Census vs ML predictions (first 10 names):")
comparison[['last_name', 'census_white', 'census_black', 'pred_race', 'pred_white', 'pred_black']].head(10)
Original data shape: (62, 2)
Census result shape: (62, 8)
Prediction result shape: (62, 7)

Census columns with percentages: ['pctwhite', 'pctblack', 'pctapi', 'pctaian', 'pct2prace', 'pcthispanic']
Prediction columns with probabilities: ['api', 'black', 'hispanic', 'white']

Comparison created with 62 names
Comparison of Census vs ML predictions (first 10 names):
[5]:
last_name census_white census_black pred_race pred_white pred_black
0 Smith 70.9 23.11 white 0.747435 0.215059
1 Garcia 5.38 0.45 hispanic 0.044640 0.002871
2 Johnson 58.97 34.63 white 0.608818 0.364275
3 Davis 62.2 31.6 white 0.618956 0.356766
4 Brown 57.95 35.6 white 0.571941 0.398471
5 Wilson 67.36 25.99 white 0.700449 0.273209
6 Martinez 5.28 0.49 hispanic 0.047556 0.003826
7 Anderson 75.17 18.93 white 0.775244 0.188307
8 Taylor 65.38 28.42 white 0.684406 0.279690
9 Rodriguez 4.75 0.54 hispanic 0.043662 0.006216

Key Differences

  • Census lookup: Returns population-level probabilities for each race/ethnicity based on surname frequency in census data

  • ML prediction: Uses neural networks trained on census data to predict the most likely race/ethnicity category

  • Use cases: Census lookup for aggregate analysis, ML predictions for individual classification