# Basic Usage Examples

This notebook demonstrates basic usage of the ethnicolr package for predicting race and ethnicity from names.

## Setup

First, let's import the necessary libraries and load some sample data.

In [None]:
import pandas as pd
import ethnicolr
from pathlib import Path

# Load sample data
data_path = Path('data/input-with-header.csv')

try:
 df = pd.read_csv(data_path)
 print(f"Loaded data from: {data_path}")
except FileNotFoundError:
 # Create sample data if file not found
 df = pd.DataFrame({
 'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],
 'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']
 })
 print("Using generated sample data")

print(f"Sample data shape: {df.shape}")
print("\nFirst few rows:")
df.head()

## Census Data Lookup

The simplest approach is to look up demographic probabilities by last name using US Census data.

In [None]:
# Census lookup by last name (2010 data)
result_census = ethnicolr.census_ln(df, 'last_name', year=2010)
print(f"Result shape: {result_census.shape}")
print("\nColumns added:")
census_cols = [col for col in result_census.columns if col not in df.columns]
print(census_cols)

# Show first few results
result_census[['last_name', 'first_name'] + census_cols].head()

## Simple Census-based Predictions

For more sophisticated predictions, we can use the census-based LSTM model.

In [None]:
# Predict using census LSTM model
result_pred = ethnicolr.pred_census_ln(df, 'last_name')
print(f"Result shape: {result_pred.shape}")
print("\nPrediction columns:")
pred_cols = [col for col in result_pred.columns if col not in df.columns]
print(pred_cols)

# Show predictions with confidence scores (census model uses api, black, hispanic, white)
result_pred[['last_name', 'first_name', 'race', 'api', 'black', 'hispanic', 'white']].head(10)

## Summary Statistics

Let's look at the distribution of predicted races in our sample.

In [None]:
# Distribution of predicted races
race_dist = result_pred['race'].value_counts()
print("Race distribution:")
for race, count in race_dist.items():
 percentage = (count / len(result_pred)) * 100
 print(f"{race}: {count} ({percentage:.1f}%)")

# Show some examples by race
print("\nExample predictions by race:")
for race in race_dist.index[:3]:
 examples = result_pred[result_pred['race'] == race]['last_name'].head(3).tolist()
 print(f"{race}: {', '.join(examples)}")

## Comparing Methods

Let's compare the census lookup vs. ML prediction for a few names.

In [None]:
# Create comparison dataframe
# Since both functions process the same input DataFrame, we can align by index
# rather than merging on potentially different last_name values

print(f"Original data shape: {df.shape}")
print(f"Census result shape: {result_census.shape}")
print(f"Prediction result shape: {result_pred.shape}")

# Check if we have the expected columns
print("\nCensus columns with percentages:", [col for col in result_census.columns if 'pct' in col])
print("Prediction columns with probabilities:", [col for col in result_pred.columns if col in ['api', 'black', 'hispanic', 'white']])

# Create aligned comparison using index
comparison = pd.DataFrame({
 'last_name': df['last_name'], # Use original names for reference
 'census_white': result_census['pctwhite'],
 'census_black': result_census['pctblack'], 
 'census_api': result_census['pctapi'],
 'census_hispanic': result_census['pcthispanic'],
 'pred_race': result_pred['race'],
 'pred_white': result_pred['white'],
 'pred_black': result_pred['black'],
 'pred_api': result_pred['api'], 
 'pred_hispanic': result_pred['hispanic']
})

print(f"\nComparison created with {len(comparison)} names")
print("Comparison of Census vs ML predictions (first 10 names):")
comparison[['last_name', 'census_white', 'census_black', 'pred_race', 'pred_white', 'pred_black']].head(10)

## Key Differences

- **Census lookup**: Returns population-level probabilities for each race/ethnicity based on surname frequency in census data
- **ML prediction**: Uses neural networks trained on census data to predict the most likely race/ethnicity category
- **Use cases**: Census lookup for aggregate analysis, ML predictions for individual classification