Census Data Models¶
The census data models in ethnicolr2 provide predictions based on US Census surname statistics and LSTM models trained on census data.
Available Census Functions¶
census_ln() - Census Statistics Lookup¶
Direct lookup of census surname statistics without machine learning prediction:
from ethnicolr2 import census_ln
import pandas as pd
# Create sample data
df = pd.DataFrame({'surname': ['Smith', 'Zhang', 'Rodriguez', 'Johnson']})
# Get 2010 census statistics
census_2010 = census_ln(df, 'surname', year=2010)
print(census_2010)
Output columns:
pctwhite: Percentage Non-Hispanic Whitepctblack: Percentage Non-Hispanic Blackpctapi: Percentage Asian and Pacific Islanderpctaian: Percentage American Indian and Alaska Nativepct2prace: Percentage Two or More Racespcthispanic: Percentage Hispanic
pred_census_last_name() - LSTM Predictions¶
Machine learning predictions using LSTM models trained on census data:
from ethnicolr2 import pred_census_last_name
# LSTM-based predictions
ml_predictions = pred_census_last_name(df, 'surname', year=2010)
print(ml_predictions)
Output columns:
race: Predicted category (highest probability)asian: Probability Asianblack: Probability Non-Hispanic Blackhispanic: Probability Hispanicwhite: Probability Non-Hispanic White
Census Years¶
Both 2000 and 2010 census data are available:
# Compare across census years
census_2000 = census_ln(df, 'surname', year=2000)
census_2010 = census_ln(df, 'surname', year=2010)
# ML predictions for different years
pred_2000 = pred_census_last_name(df, 'surname', year=2000)
pred_2010 = pred_census_last_name(df, 'surname', year=2010)
Practical Examples¶
Academic Research¶
import pandas as pd
from ethnicolr2 import census_ln, pred_census_last_name
# Load research dataset
authors_df = pd.read_csv('academic_authors.csv')
# Columns: ['author_name', 'last_name', 'institution', 'field']
# Get census statistics
census_stats = census_ln(authors_df, 'last_name', year=2010)
# Add ML predictions
ml_predictions = pred_census_last_name(authors_df, 'last_name', year=2010)
# Merge results
research_results = pd.merge(
authors_df,
census_stats[['last_name', 'pctwhite', 'pctblack', 'pctapi', 'pcthispanic']],
on='last_name'
)
research_results = pd.merge(
research_results,
ml_predictions[['last_name', 'race', 'asian', 'black', 'hispanic', 'white']],
on='last_name'
)
print(research_results.groupby(['field', 'race']).size())
Historical Analysis¶
# Compare demographic trends over time
names = ['Kim', 'Patel', 'Martinez', 'Johnson']
df = pd.DataFrame({'last_name': names})
# Get both census years
results_2000 = census_ln(df, 'last_name', year=2000)
results_2010 = census_ln(df, 'last_name', year=2010)
# Compare changes
for name in names:
row_2000 = results_2000[results_2000['last_name'] == name].iloc[0]
row_2010 = results_2010[results_2010['last_name'] == name].iloc[0]
print(f"\\n{name}:")
print(f" Hispanic 2000: {row_2000['pcthispanic']:.1f}%")
print(f" Hispanic 2010: {row_2010['pcthispanic']:.1f}%")
print(f" Change: {row_2010['pcthispanic'] - row_2000['pcthispanic']:+.1f}%")
Large Dataset Processing¶
def process_census_batch(df_chunk, year=2010):
\"\"\"Process large datasets in batches\"\"\"
# Get both census stats and ML predictions
census_stats = census_ln(df_chunk, 'last_name', year=year)
ml_predictions = pred_census_last_name(df_chunk, 'last_name', year=year)
# Merge results
result = pd.merge(df_chunk, census_stats, on='last_name')
result = pd.merge(result, ml_predictions[['last_name', 'race']], on='last_name')
return result
# Process large CSV in chunks
chunk_size = 10000
results = []
for chunk in pd.read_csv('large_dataset.csv', chunksize=chunk_size):
chunk_result = process_census_batch(chunk)
results.append(chunk_result)
print(f"Processed {len(chunk)} records")
final_result = pd.concat(results, ignore_index=True)
Data Quality Considerations¶
Coverage¶
Not all surnames appear in census data:
# Check which names have census data
census_result = census_ln(df, 'last_name', year=2010)
# Names not in census will have NaN values
missing_census = census_result[census_result['pctwhite'].isna()]
print(f"Names missing from census: {len(missing_census)}")
# ML predictions work for all names (including those not in census)
ml_result = pred_census_last_name(df, 'last_name', year=2010)
print(f"ML predictions available: {len(ml_result)}")
Confidence Assessment¶
# Assess prediction confidence
ml_predictions = pred_census_last_name(df, 'last_name', year=2010)
# Calculate max probability (confidence indicator)
ml_predictions['confidence'] = ml_predictions[['asian', 'black', 'hispanic', 'white']].max(axis=1)
# High confidence predictions (>80%)
high_confidence = ml_predictions[ml_predictions['confidence'] > 0.8]
print(f"High confidence predictions: {len(high_confidence)} / {len(ml_predictions)}")
# Review uncertain predictions
uncertain = ml_predictions[ml_predictions['confidence'] < 0.5]
print("\\nUncertain predictions:")
print(uncertain[['last_name', 'race', 'confidence']])
Census vs Florida Models¶
When to use census models vs Florida models:
Use Census Models When:¶
Academic research requiring census validation
Historical analysis (2000 vs 2010 comparison)
Need to match published census statistics
Working with surnames only
Analyzing general US population
Use Florida Models When:¶
Need highest accuracy
Have both first and last names available
Working with modern datasets
Doing practical applications (not academic)
# Compare census vs Florida predictions
from ethnicolr2 import pred_census_last_name, pred_fl_last_name
census_pred = pred_census_last_name(df, 'last_name')
florida_pred = pred_fl_last_name(df, 'last_name')
# Compare predictions
comparison = pd.DataFrame({
'name': df['last_name'],
'census_race': census_pred['race'],
'florida_race': florida_pred['race'],
'census_conf': census_pred[['asian', 'black', 'hispanic', 'white']].max(axis=1),
'florida_conf': florida_pred[['asian', 'hispanic', 'nh_black', 'nh_white']].max(axis=1)
})
# Check agreement
agreement = (comparison['census_race'] == comparison['florida_race']).mean()
print(f"Census vs Florida agreement: {agreement:.2%}")
Command Line Usage¶
# Census statistics lookup
census_ln input.csv -l last_name -o census_output.csv -y 2010
# Census LSTM predictions
pred_census_last_name input.csv -l last_name -o ml_output.csv -y 2010
Next Steps¶
Florida Models: Explore higher-accuracy Florida models
Examples and Use Cases: See more practical examples
Models API Reference: Technical API documentation