Models API Reference¶
This page documents all model-related functions and classes in ethnicolr2.
Prediction Functions¶
Census Models¶
- ethnicolr2.census_ln(df, lname_col, year=2000)[source]
Append Census demographic data by last name.
- ethnicolr2.pred_census_last_name(df, lname_col)[source]
Predict race/ethnicity by last name using Census LSTM model.
Florida Models¶
- ethnicolr2.pred_fl_last_name(df, lname_col)[source]
Predict race/ethnicity by last name using Florida voter registration model.
- ethnicolr2.pred_fl_full_name(df, full_name_col=None, lname_col=None, fname_col=None)[source]
Predict race/ethnicity by full name using Florida voter registration model.
- Parameters:
- Returns:
DataFrame with predictions and probabilities
- Return type:
pd.DataFrame
Neural Network Classes¶
LSTM Model¶
- class ethnicolr2.models.LSTM(input_size, hidden_size, output_size, num_layers=1)[source]
Bases:
ModuleLSTM model for ethnicity prediction from character sequences.
- Parameters:
- __init__(input_size, hidden_size, output_size, num_layers=1)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
Base Model Class¶
- class ethnicolr2.ethnicolr_class.EthnicolrModelClass[source]
Bases:
object- static lineToTensor(line, all_letters, max_name, oob)[source]
Convert a name string to a tensor of character indices.
- Parameters:
- Returns:
Tensor of character indices with shape (max_name,)
- Return type:
- classmethod predict(df, vocab_fn, model_fn)[source]
Generate race/ethnicity predictions for names in DataFrame.
- Parameters:
- Returns:
DataFrame with original data plus ‘preds’ and ‘probs’ columns
- Raises:
FileNotFoundError – If model or vocabulary files don’t exist
ValueError – If DataFrame is empty or malformed
RuntimeError – If model loading or prediction fails
- Return type:
- static test_and_norm_df(df, col)[source]
Validates and normalizes DataFrame for prediction.
- Parameters:
- Returns:
Cleaned DataFrame with duplicates and NaN values removed
- Raises:
ValueError – If column doesn’t exist or contains no valid data
- Return type:
Model Implementation Notes¶
The prediction functions above are implemented using internal model classes that handle:
Model Loading: Automatic loading of pre-trained PyTorch models
Text Processing: Character-level encoding and sequence preparation
Batch Inference: Efficient processing of multiple names
Result Formatting: Converting raw model outputs to readable predictions
For most use cases, use the high-level prediction functions rather than the internal model classes directly.
Examples¶
Loading Models Manually¶
from ethnicolr2.pred_fl_ln_lstm import LastNameLstmModel
# Load Florida last name model
model = LastNameLstmModel()
# Make predictions
import pandas as pd
df = pd.DataFrame({'last_name': ['Smith', 'Zhang']})
result = model.predict(df, vocab_fn=model.VOCAB_FN, model_fn=model.MODEL_FN)
print(result)
Custom Model Parameters¶
from ethnicolr2.models import LSTM
import torch
# Create custom LSTM
model = LSTM(
vocab_size=128, # Character vocabulary size
hidden_size=256, # LSTM hidden units
num_layers=2, # Number of LSTM layers
num_classes=5, # Number of output classes
dropout=0.2 # Dropout rate
)
# Example forward pass
batch_size, sequence_length = 32, 20
input_tensor = torch.randint(0, 128, (batch_size, sequence_length))
output = model(input_tensor)
print(f"Output shape: {output.shape}") # [32, 5]
Model Configuration¶
The prediction models use different configuration parameters:
Model |
Max Length |
Character Set |
Training Data |
|---|---|---|---|
Census Last Name |
15 chars |
ASCII + common punctuation |
US Census 2000/2010 |
Florida Last Name |
30 chars |
Extended character set |
FL voter registration |
Florida Full Name |
47 chars |
Extended character set |
FL voter registration |
All models use 2-layer LSTM networks with 256 hidden units and batch processing for efficiency.