Models API Reference¶
This page documents all model-related functions and classes in ethnicolr2.
Prediction Functions¶
Census Models¶
- ethnicolr2.census_ln(df, lname_col, year=2000)
Appends columns from Census data to the input DataFrame based on the last name.
Removes extra space. Checks if the name is the Census data. If it is, outputs data from that row.
- Parameters:
- Returns:
- Pandas DataFrame with additional columns ‘pctwhite’,
’pctblack’, ‘pctapi’, ‘pctaian’, ‘pct2prace’, ‘pcthispanic’
- Return type:
DataFrame
- ethnicolr2.pred_census_last_name(df, lname_col)
Predict the race/ethnicity by the last name using the Census data model.
- Parameters:
df (
DataFrame) – Pandas DataFrame containing the first and last name columns.lname_col (str) – Column name for the last name.
- Returns:
- Pandas DataFrame with additional columns:
race the predict result
Additional columns for probability of each classes.
- Return type:
DataFrame
Florida Models¶
- ethnicolr2.pred_fl_last_name(df, lname_col)
Predict the race/ethnicity by the last name using the Florida voter registration data model.
- Parameters:
- Returns:
‘preds’: Predicted race/ethnicity category
’probs’: Dictionary of probabilities for each category
- Return type:
DataFrame with original data plus
- Raises:
ValueError – If lname_col doesn’t exist or DataFrame is invalid
RuntimeError – If model prediction fails
- ethnicolr2.pred_fl_full_name(df, full_name_col=None, lname_col=None, fname_col=None)
Predict the race/ethnicity by the full name using the Florida voter registration data model.
- Parameters:
- Returns:
- Pandas DataFrame with additional columns:
race the prediction result
Additional columns for the probability of each of the classes.
- Return type:
DataFrame
Neural Network Classes¶
LSTM Model¶
- class ethnicolr2.models.LSTM(input_size, hidden_size, output_size, num_layers=1)[source]
Bases:
ModuleLSTM model for ethnicity prediction from character sequences.
- Parameters:
- __init__(input_size, hidden_size, output_size, num_layers=1)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
Base Model Class¶
- class ethnicolr2.ethnicolr_class.EthnicolrModelClass[source]
Bases:
object- static lineToTensor(line, all_letters, max_name, oob)[source]
Convert a name string to a tensor of character indices.
- Parameters:
- Returns:
Tensor of character indices with shape (max_name,)
- Return type:
- classmethod predict(df, vocab_fn, model_fn)[source]
Generate race/ethnicity predictions for names in DataFrame.
- Parameters:
- Returns:
DataFrame with original data plus ‘preds’ and ‘probs’ columns
- Raises:
FileNotFoundError – If model or vocabulary files don’t exist
ValueError – If DataFrame is empty or malformed
RuntimeError – If model loading or prediction fails
- Return type:
- static test_and_norm_df(df, col)[source]
Validates and normalizes DataFrame for prediction.
- Parameters:
- Returns:
Cleaned DataFrame with duplicates and NaN values removed
- Raises:
ValueError – If column doesn’t exist or contains no valid data
- Return type:
Model Implementation Notes¶
The prediction functions above are implemented using internal model classes that handle:
Model Loading: Automatic loading of pre-trained PyTorch models
Text Processing: Character-level encoding and sequence preparation
Batch Inference: Efficient processing of multiple names
Result Formatting: Converting raw model outputs to readable predictions
For most use cases, use the high-level prediction functions rather than the internal model classes directly.
Examples¶
Loading Models Manually¶
from ethnicolr2.pred_fl_ln_lstm import LastNameLstmModel
# Load Florida last name model
model = LastNameLstmModel()
# Make predictions
import pandas as pd
df = pd.DataFrame({'last_name': ['Smith', 'Zhang']})
result = model.predict(df, vocab_fn=model.VOCAB_FN, model_fn=model.MODEL_FN)
print(result)
Custom Model Parameters¶
from ethnicolr2.models import LSTM
import torch
# Create custom LSTM
model = LSTM(
vocab_size=128, # Character vocabulary size
hidden_size=256, # LSTM hidden units
num_layers=2, # Number of LSTM layers
num_classes=5, # Number of output classes
dropout=0.2 # Dropout rate
)
# Example forward pass
batch_size, sequence_length = 32, 20
input_tensor = torch.randint(0, 128, (batch_size, sequence_length))
output = model(input_tensor)
print(f"Output shape: {output.shape}") # [32, 5]
Model Configuration¶
The prediction models use different configuration parameters:
Model |
Max Length |
Character Set |
Training Data |
|---|---|---|---|
Census Last Name |
15 chars |
ASCII + common punctuation |
US Census 2000/2010 |
Florida Last Name |
30 chars |
Extended character set |
FL voter registration |
Florida Full Name |
47 chars |
Extended character set |
FL voter registration |
All models use 2-layer LSTM networks with 256 hidden units and batch processing for efficiency.