Models API Reference

This page documents all model-related functions and classes in ethnicolr2.

Prediction Functions

Census Models

ethnicolr2.census_ln(df, lname_col, year=2000)[source]

Append Census demographic data by last name.

Parameters:
  • df (DataFrame) – Pandas DataFrame containing the last name column

  • lname_col (str) – Column name for the last name

  • year (int) – Census data year (2000 or 2010)

Returns:

DataFrame with original data plus Census demographic columns

Return type:

pd.DataFrame

ethnicolr2.pred_census_last_name(df, lname_col)[source]

Predict race/ethnicity by last name using Census LSTM model.

Parameters:
  • df (DataFrame) – Pandas DataFrame containing the last name column

  • lname_col (str) – Column name for the last name

Returns:

DataFrame with predictions and probabilities

Return type:

pd.DataFrame

Florida Models

ethnicolr2.pred_fl_last_name(df, lname_col)[source]

Predict race/ethnicity by last name using Florida voter registration model.

Parameters:
  • df (DataFrame) – Pandas DataFrame containing the last name column

  • lname_col (str) – Column name for the last name

Returns:

DataFrame with predictions and probabilities

Return type:

pd.DataFrame

ethnicolr2.pred_fl_full_name(df, full_name_col=None, lname_col=None, fname_col=None)[source]

Predict race/ethnicity by full name using Florida voter registration model.

Parameters:
  • df (DataFrame) – Pandas DataFrame containing name columns

  • full_name_col (str | None) – Column name for full name (optional)

  • lname_col (str | None) – Column name for last name (optional)

  • fname_col (str | None) – Column name for first name (optional)

Returns:

DataFrame with predictions and probabilities

Return type:

pd.DataFrame

Neural Network Classes

LSTM Model

class ethnicolr2.models.LSTM(input_size, hidden_size, output_size, num_layers=1)[source]

Bases: Module

LSTM model for ethnicity prediction from character sequences.

Parameters:
  • input_size (int) – Size of vocabulary (number of unique characters)

  • hidden_size (int) – Size of hidden state in LSTM

  • output_size (int) – Number of output categories

  • num_layers (int) – Number of LSTM layers

__init__(input_size, hidden_size, output_size, num_layers=1)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input)[source]

Forward pass through the LSTM model.

Parameters:

input (Tensor) – Tensor of character indices with shape (batch_size, seq_len)

Returns:

Log-softmax probabilities for each category with shape (batch_size, output_size)

Return type:

Tensor

Base Model Class

class ethnicolr2.ethnicolr_class.EthnicolrModelClass[source]

Bases: object

static lineToTensor(line, all_letters, max_name, oob)[source]

Convert a name string to a tensor of character indices.

Parameters:
  • line (str) – Input name string

  • all_letters (str) – String containing all valid characters

  • max_name (int) – Maximum name length (longer names are truncated)

  • oob (int) – Out-of-bounds index for unknown characters

Returns:

Tensor of character indices with shape (max_name,)

Return type:

Tensor

model: Module | None = None
model_year: int | None = None
classmethod predict(df, vocab_fn, model_fn)[source]

Generate race/ethnicity predictions for names in DataFrame.

Parameters:
  • df (DataFrame) – DataFrame containing name data with ‘__name’ column

  • vocab_fn (str | Path | PathLike[str]) – Path to vocabulary file (.joblib)

  • model_fn (str | Path | PathLike[str]) – Path to trained model file (.pt)

Returns:

DataFrame with original data plus ‘preds’ and ‘probs’ columns

Raises:
Return type:

DataFrame

race: list[str] | None = None
static test_and_norm_df(df, col)[source]

Validates and normalizes DataFrame for prediction.

Parameters:
  • df (DataFrame) – Input DataFrame

  • col (str) – Column name to validate and process

Returns:

Cleaned DataFrame with duplicates and NaN values removed

Raises:

ValueError – If column doesn’t exist or contains no valid data

Return type:

DataFrame

vocab: list[str] | None = None

Model Implementation Notes

The prediction functions above are implemented using internal model classes that handle:

  • Model Loading: Automatic loading of pre-trained PyTorch models

  • Text Processing: Character-level encoding and sequence preparation

  • Batch Inference: Efficient processing of multiple names

  • Result Formatting: Converting raw model outputs to readable predictions

For most use cases, use the high-level prediction functions rather than the internal model classes directly.

Examples

Loading Models Manually

from ethnicolr2.pred_fl_ln_lstm import LastNameLstmModel

# Load Florida last name model
model = LastNameLstmModel()

# Make predictions
import pandas as pd
df = pd.DataFrame({'last_name': ['Smith', 'Zhang']})
result = model.predict(df, vocab_fn=model.VOCAB_FN, model_fn=model.MODEL_FN)
print(result)

Custom Model Parameters

from ethnicolr2.models import LSTM
import torch

# Create custom LSTM
model = LSTM(
    vocab_size=128,      # Character vocabulary size
    hidden_size=256,     # LSTM hidden units
    num_layers=2,        # Number of LSTM layers
    num_classes=5,       # Number of output classes
    dropout=0.2          # Dropout rate
)

# Example forward pass
batch_size, sequence_length = 32, 20
input_tensor = torch.randint(0, 128, (batch_size, sequence_length))
output = model(input_tensor)
print(f"Output shape: {output.shape}")  # [32, 5]

Model Configuration

The prediction models use different configuration parameters:

Model

Max Length

Character Set

Training Data

Census Last Name

15 chars

ASCII + common punctuation

US Census 2000/2010

Florida Last Name

30 chars

Extended character set

FL voter registration

Florida Full Name

47 chars

Extended character set

FL voter registration

All models use 2-layer LSTM networks with 256 hidden units and batch processing for efficiency.