Basic Usage Examples¶
This notebook demonstrates basic usage of the ethnicolr package for predicting race and ethnicity from names.
Setup¶
First, let’s import the necessary libraries and load some sample data.
[1]:
import pandas as pd
import ethnicolr
from pathlib import Path
# Load sample data
data_path = Path('data/input-with-header.csv')
try:
df = pd.read_csv(data_path)
print(f"Loaded data from: {data_path}")
except FileNotFoundError:
# Create sample data if file not found
df = pd.DataFrame({
'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],
'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']
})
print("Using generated sample data")
print(f"Sample data shape: {df.shape}")
print("\nFirst few rows:")
df.head()
2025-12-27 22:21:37.664762: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:37.667915: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-27 22:21:37.676606: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1766874097.690951 2818 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766874097.695142 2818 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-12-27 22:21:37.711070: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loaded data from: data/input-with-header.csv
Sample data shape: (62, 2)
First few rows:
[1]:
| first_name | last_name | |
|---|---|---|
| 0 | John | Smith |
| 1 | Maria | Garcia |
| 2 | David | Johnson |
| 3 | Sarah | Davis |
| 4 | Michael | Brown |
Census Data Lookup¶
The simplest approach is to look up demographic probabilities by last name using US Census data.
[2]:
# Census lookup by last name (2010 data)
result_census = ethnicolr.census_ln(df, 'last_name', year=2010)
print(f"Result shape: {result_census.shape}")
print("\nColumns added:")
census_cols = [col for col in result_census.columns if col not in df.columns]
print(census_cols)
# Show first few results
result_census[['last_name', 'first_name'] + census_cols].head()
2025-12-27 22:21:39,521 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:39,523 - INFO - Loading Census 2010 data from /home/runner/work/ethnicolr/ethnicolr/ethnicolr/data/census/census_2010.csv...
2025-12-27 22:21:39,679 - INFO - Loaded 162253 last names from Census 2010
2025-12-27 22:21:39,680 - INFO - Merging demographic data for 62 records...
2025-12-27 22:21:39,715 - INFO - Matched 62 of 62 rows (100.0%)
2025-12-27 22:21:39,716 - INFO - Added columns: pct2prace, pctaian, pctapi, pctblack, pcthispanic, pctwhite
Result shape: (62, 8)
Columns added:
['pctwhite', 'pctblack', 'pctapi', 'pctaian', 'pct2prace', 'pcthispanic']
[2]:
| last_name | first_name | pctwhite | pctblack | pctapi | pctaian | pct2prace | pcthispanic | |
|---|---|---|---|---|---|---|---|---|
| 0 | Smith | John | 70.9 | 23.11 | 0.5 | 0.89 | 2.19 | 2.4 |
| 1 | Garcia | Maria | 5.38 | 0.45 | 1.41 | 0.47 | 0.26 | 92.03 |
| 2 | Johnson | David | 58.97 | 34.63 | 0.54 | 0.94 | 2.56 | 2.36 |
| 3 | Davis | Sarah | 62.2 | 31.6 | 0.49 | 0.82 | 2.45 | 2.44 |
| 4 | Brown | Michael | 57.95 | 35.6 | 0.51 | 0.87 | 2.55 | 2.52 |
Simple Census-based Predictions¶
For more sophisticated predictions, we can use the census-based LSTM model.
[3]:
# Predict using census LSTM model
result_pred = ethnicolr.pred_census_ln(df, 'last_name')
print(f"Result shape: {result_pred.shape}")
print("\nPrediction columns:")
pred_cols = [col for col in result_pred.columns if col not in df.columns]
print(pred_cols)
# Show predictions with confidence scores (census model uses api, black, hispanic, white)
result_pred[['last_name', 'first_name', 'race', 'api', 'black', 'hispanic', 'white']].head(10)
2025-12-27 22:21:39,727 - INFO - Processing 62 names using Census 2010 LSTM model
2025-12-27 22:21:39,728 - INFO - Data filtering summary: 62 -> 62 rows (kept 100.0%)
2025-12-27 22:21:39.736393: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2025-12-27 22:21:39,935 - INFO - Predicted 62 of 62 rows (100.0%)
2025-12-27 22:21:39,936 - INFO - Added columns: white, black, api, race, hispanic
Result shape: (62, 7)
Prediction columns:
['api', 'black', 'hispanic', 'white', 'race']
[3]:
| last_name | first_name | race | api | black | hispanic | white | |
|---|---|---|---|---|---|---|---|
| 0 | Smith | John | white | 0.008322 | 0.215059 | 0.029183 | 0.747435 |
| 1 | Garcia | Maria | hispanic | 0.010198 | 0.002871 | 0.942291 | 0.044640 |
| 2 | Johnson | David | white | 0.005786 | 0.364275 | 0.021122 | 0.608818 |
| 3 | Davis | Sarah | white | 0.003984 | 0.356766 | 0.020293 | 0.618956 |
| 4 | Brown | Michael | white | 0.003043 | 0.398471 | 0.026545 | 0.571941 |
| 5 | Wilson | Jennifer | white | 0.008207 | 0.273209 | 0.018134 | 0.700449 |
| 6 | Martinez | Carlos | hispanic | 0.005853 | 0.003826 | 0.942765 | 0.047556 |
| 7 | Anderson | Lisa | white | 0.011476 | 0.188307 | 0.024973 | 0.775244 |
| 8 | Taylor | James | white | 0.007325 | 0.279690 | 0.028578 | 0.684406 |
| 9 | Rodriguez | Anna | hispanic | 0.007562 | 0.006216 | 0.942559 | 0.043662 |
Summary Statistics¶
Let’s look at the distribution of predicted races in our sample.
[4]:
# Distribution of predicted races
race_dist = result_pred['race'].value_counts()
print("Race distribution:")
for race, count in race_dist.items():
percentage = (count / len(result_pred)) * 100
print(f"{race}: {count} ({percentage:.1f}%)")
# Show some examples by race
print("\nExample predictions by race:")
for race in race_dist.index[:3]:
examples = result_pred[result_pred['race'] == race]['last_name'].head(3).tolist()
print(f"{race}: {', '.join(examples)}")
Race distribution:
white: 51 (82.3%)
hispanic: 9 (14.5%)
black: 1 (1.6%)
api: 1 (1.6%)
Example predictions by race:
white: Smith, Johnson, Davis
hispanic: Garcia, Martinez, Rodriguez
black: Jackson
Comparing Methods¶
Let’s compare the census lookup vs. ML prediction for a few names.
[5]:
# Create comparison dataframe
# Since both functions process the same input DataFrame, we can align by index
# rather than merging on potentially different last_name values
print(f"Original data shape: {df.shape}")
print(f"Census result shape: {result_census.shape}")
print(f"Prediction result shape: {result_pred.shape}")
# Check if we have the expected columns
print("\nCensus columns with percentages:", [col for col in result_census.columns if 'pct' in col])
print("Prediction columns with probabilities:", [col for col in result_pred.columns if col in ['api', 'black', 'hispanic', 'white']])
# Create aligned comparison using index
comparison = pd.DataFrame({
'last_name': df['last_name'], # Use original names for reference
'census_white': result_census['pctwhite'],
'census_black': result_census['pctblack'],
'census_api': result_census['pctapi'],
'census_hispanic': result_census['pcthispanic'],
'pred_race': result_pred['race'],
'pred_white': result_pred['white'],
'pred_black': result_pred['black'],
'pred_api': result_pred['api'],
'pred_hispanic': result_pred['hispanic']
})
print(f"\nComparison created with {len(comparison)} names")
print("Comparison of Census vs ML predictions (first 10 names):")
comparison[['last_name', 'census_white', 'census_black', 'pred_race', 'pred_white', 'pred_black']].head(10)
Original data shape: (62, 2)
Census result shape: (62, 8)
Prediction result shape: (62, 7)
Census columns with percentages: ['pctwhite', 'pctblack', 'pctapi', 'pctaian', 'pct2prace', 'pcthispanic']
Prediction columns with probabilities: ['api', 'black', 'hispanic', 'white']
Comparison created with 62 names
Comparison of Census vs ML predictions (first 10 names):
[5]:
| last_name | census_white | census_black | pred_race | pred_white | pred_black | |
|---|---|---|---|---|---|---|
| 0 | Smith | 70.9 | 23.11 | white | 0.747435 | 0.215059 |
| 1 | Garcia | 5.38 | 0.45 | hispanic | 0.044640 | 0.002871 |
| 2 | Johnson | 58.97 | 34.63 | white | 0.608818 | 0.364275 |
| 3 | Davis | 62.2 | 31.6 | white | 0.618956 | 0.356766 |
| 4 | Brown | 57.95 | 35.6 | white | 0.571941 | 0.398471 |
| 5 | Wilson | 67.36 | 25.99 | white | 0.700449 | 0.273209 |
| 6 | Martinez | 5.28 | 0.49 | hispanic | 0.047556 | 0.003826 |
| 7 | Anderson | 75.17 | 18.93 | white | 0.775244 | 0.188307 |
| 8 | Taylor | 65.38 | 28.42 | white | 0.684406 | 0.279690 |
| 9 | Rodriguez | 4.75 | 0.54 | hispanic | 0.043662 | 0.006216 |
Key Differences¶
Census lookup: Returns population-level probabilities for each race/ethnicity based on surname frequency in census data
ML prediction: Uses neural networks trained on census data to predict the most likely race/ethnicity category
Use cases: Census lookup for aggregate analysis, ML predictions for individual classification