Models API Reference¶

This page documents all model-related functions and classes in ethnicolr2.

Prediction Functions¶

Census Models¶

ethnicolr2.census_ln(df, lname_col, year=2000)

Appends columns from Census data to the input DataFrame based on the last name.

Removes extra space. Checks if the name is the Census data. If it is, outputs data from that row.

Parameters:

df (DataFrame) – Pandas DataFrame containing the first and last name columns.
lname_col (str) – Column name for the last name.
year (int) – The year of Census data to be used. (2000 or 2010) (default is 2000)

Returns:

Pandas DataFrame with additional columns ‘pctwhite’,: ’pctblack’, ‘pctapi’, ‘pctaian’, ‘pct2prace’, ‘pcthispanic’

Return type:

DataFrame

ethnicolr2.pred_census_last_name(df, lname_col)

Predict the race/ethnicity by the last name using the Census data model.

Parameters:

df (DataFrame) – Pandas DataFrame containing the first and last name columns.
lname_col (str) – Column name for the last name.

Returns:

Pandas DataFrame with additional columns:

race the predict result
Additional columns for probability of each classes.

Return type:

DataFrame

Florida Models¶

ethnicolr2.pred_fl_last_name(df, lname_col)

Predict the race/ethnicity by the last name using the Florida voter registration data model.

Parameters:

df (DataFrame) – Pandas DataFrame containing the last name column
lname_col (str) – Column name for the last name

Returns:

‘preds’: Predicted race/ethnicity category
’probs’: Dictionary of probabilities for each category

Return type:

DataFrame with original data plus

Raises:

ValueError – If lname_col doesn’t exist or DataFrame is invalid
RuntimeError – If model prediction fails

ethnicolr2.pred_fl_full_name(df, full_name_col=None, lname_col=None, fname_col=None)

Predict the race/ethnicity by the full name using the Florida voter registration data model.

Parameters:

df (DataFrame) – Pandas DataFrame containing the first and last name columns.
ful_name_col (str) – Column name for the full name.
lname_col (str) – Column name for the last name.
fname_col (str or int) – Column name for the first name.

Returns:

Pandas DataFrame with additional columns:

race the prediction result
Additional columns for the probability of each of the classes.

Return type:

DataFrame

Neural Network Classes¶

LSTM Model¶

class ethnicolr2.models.LSTM(input_size, hidden_size, output_size, num_layers=1)[source]

Bases: Module

LSTM model for ethnicity prediction from character sequences.

Parameters:

input_size (int) – Size of vocabulary (number of unique characters)
hidden_size (int) – Size of hidden state in LSTM
output_size (int) – Number of output categories
num_layers (int) – Number of LSTM layers

__init__(input_size, hidden_size, output_size, num_layers=1)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input)[source]

Forward pass through the LSTM model.

Parameters:: input (Tensor) – Tensor of character indices with shape (batch_size, seq_len)
Returns:: Log-softmax probabilities for each category with shape (batch_size, output_size)
Return type:: Tensor

Base Model Class¶

class ethnicolr2.ethnicolr_class.EthnicolrModelClass[source]

Bases: object

static lineToTensor(line, all_letters, max_name, oob)[source]

Convert a name string to a tensor of character indices.

Parameters:

line (str) – Input name string
all_letters (str) – String containing all valid characters
max_name (int) – Maximum name length (longer names are truncated)
oob (int) – Out-of-bounds index for unknown characters

Returns:

Tensor of character indices with shape (max_name,)

Return type:

Tensor

model: Module | None = None

model_year: int | None = None

classmethod predict(df, vocab_fn, model_fn)[source]

Generate race/ethnicity predictions for names in DataFrame.

Parameters:

df (DataFrame) – DataFrame containing name data with ‘__name’ column
vocab_fn (str) – Path to vocabulary file (.joblib)
model_fn (str) – Path to trained model file (.pt)

Returns:

DataFrame with original data plus ‘preds’ and ‘probs’ columns

Raises:

FileNotFoundError – If model or vocabulary files don’t exist
ValueError – If DataFrame is empty or malformed
RuntimeError – If model loading or prediction fails

Return type:

DataFrame

race: list[str] | None = None

static test_and_norm_df(df, col)[source]

Validates and normalizes DataFrame for prediction.

Parameters:

df (DataFrame) – Input DataFrame
col (str) – Column name to validate and process

Returns:

Cleaned DataFrame with duplicates and NaN values removed

Raises:

ValueError – If column doesn’t exist or contains no valid data

Return type:

DataFrame

vocab: list[str] | None = None

Model Implementation Notes¶

The prediction functions above are implemented using internal model classes that handle:

Model Loading: Automatic loading of pre-trained PyTorch models
Text Processing: Character-level encoding and sequence preparation
Batch Inference: Efficient processing of multiple names
Result Formatting: Converting raw model outputs to readable predictions

For most use cases, use the high-level prediction functions rather than the internal model classes directly.

Examples¶

Loading Models Manually¶

from ethnicolr2.pred_fl_ln_lstm import LastNameLstmModel

# Load Florida last name model
model = LastNameLstmModel()

# Make predictions
import pandas as pd
df = pd.DataFrame({'last_name': ['Smith', 'Zhang']})
result = model.predict(df, vocab_fn=model.VOCAB_FN, model_fn=model.MODEL_FN)
print(result)

Custom Model Parameters¶

from ethnicolr2.models import LSTM
import torch

# Create custom LSTM
model = LSTM(
    vocab_size=128,      # Character vocabulary size
    hidden_size=256,     # LSTM hidden units
    num_layers=2,        # Number of LSTM layers
    num_classes=5,       # Number of output classes
    dropout=0.2          # Dropout rate
)

# Example forward pass
batch_size, sequence_length = 32, 20
input_tensor = torch.randint(0, 128, (batch_size, sequence_length))
output = model(input_tensor)
print(f"Output shape: {output.shape}")  # [32, 5]

Model Configuration¶

The prediction models use different configuration parameters:

Model	Max Length	Character Set	Training Data
Census Last Name	15 chars	ASCII + common punctuation	US Census 2000/2010
Florida Last Name	30 chars	Extended character set	FL voter registration
Florida Full Name	47 chars	Extended character set	FL voter registration

All models use 2-layer LSTM networks with 256 hidden units and batch processing for efficiency.