Models API Reference

This page documents all model-related functions and classes in ethnicolr2.

Prediction Functions

Census Models

ethnicolr2.census_ln(df, lname_col, year=2000)

Appends columns from Census data to the input DataFrame based on the last name.

Removes extra space. Checks if the name is the Census data. If it is, outputs data from that row.

Parameters:
  • df (DataFrame) – Pandas DataFrame containing the first and last name columns.

  • lname_col (str) – Column name for the last name.

  • year (int) – The year of Census data to be used. (2000 or 2010) (default is 2000)

Returns:

Pandas DataFrame with additional columns ‘pctwhite’,

’pctblack’, ‘pctapi’, ‘pctaian’, ‘pct2prace’, ‘pcthispanic’

Return type:

DataFrame

ethnicolr2.pred_census_last_name(df, lname_col)

Predict the race/ethnicity by the last name using the Census data model.

Parameters:
  • df (DataFrame) – Pandas DataFrame containing the first and last name columns.

  • lname_col (str) – Column name for the last name.

Returns:

Pandas DataFrame with additional columns:
  • race the predict result

  • Additional columns for probability of each classes.

Return type:

DataFrame

Florida Models

ethnicolr2.pred_fl_last_name(df, lname_col)

Predict the race/ethnicity by the last name using the Florida voter registration data model.

Parameters:
  • df (DataFrame) – Pandas DataFrame containing the last name column

  • lname_col (str) – Column name for the last name

Returns:

  • ‘preds’: Predicted race/ethnicity category

  • ’probs’: Dictionary of probabilities for each category

Return type:

DataFrame with original data plus

Raises:
  • ValueError – If lname_col doesn’t exist or DataFrame is invalid

  • RuntimeError – If model prediction fails

ethnicolr2.pred_fl_full_name(df, full_name_col=None, lname_col=None, fname_col=None)

Predict the race/ethnicity by the full name using the Florida voter registration data model.

Parameters:
  • df (DataFrame) – Pandas DataFrame containing the first and last name columns.

  • ful_name_col (str) – Column name for the full name.

  • lname_col (str) – Column name for the last name.

  • fname_col (str or int) – Column name for the first name.

Returns:

Pandas DataFrame with additional columns:
  • race the prediction result

  • Additional columns for the probability of each of the classes.

Return type:

DataFrame

Neural Network Classes

LSTM Model

class ethnicolr2.models.LSTM(input_size, hidden_size, output_size, num_layers=1)[source]

Bases: Module

LSTM model for ethnicity prediction from character sequences.

Parameters:
  • input_size (int) – Size of vocabulary (number of unique characters)

  • hidden_size (int) – Size of hidden state in LSTM

  • output_size (int) – Number of output categories

  • num_layers (int) – Number of LSTM layers

__init__(input_size, hidden_size, output_size, num_layers=1)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input)[source]

Forward pass through the LSTM model.

Parameters:

input (Tensor) – Tensor of character indices with shape (batch_size, seq_len)

Returns:

Log-softmax probabilities for each category with shape (batch_size, output_size)

Return type:

Tensor

Base Model Class

class ethnicolr2.ethnicolr_class.EthnicolrModelClass[source]

Bases: object

static lineToTensor(line, all_letters, max_name, oob)[source]

Convert a name string to a tensor of character indices.

Parameters:
  • line (str) – Input name string

  • all_letters (str) – String containing all valid characters

  • max_name (int) – Maximum name length (longer names are truncated)

  • oob (int) – Out-of-bounds index for unknown characters

Returns:

Tensor of character indices with shape (max_name,)

Return type:

Tensor

model: Module | None = None
model_year: int | None = None
classmethod predict(df, vocab_fn, model_fn)[source]

Generate race/ethnicity predictions for names in DataFrame.

Parameters:
  • df (DataFrame) – DataFrame containing name data with ‘__name’ column

  • vocab_fn (str) – Path to vocabulary file (.joblib)

  • model_fn (str) – Path to trained model file (.pt)

Returns:

DataFrame with original data plus ‘preds’ and ‘probs’ columns

Raises:
Return type:

DataFrame

race: list[str] | None = None
static test_and_norm_df(df, col)[source]

Validates and normalizes DataFrame for prediction.

Parameters:
  • df (DataFrame) – Input DataFrame

  • col (str) – Column name to validate and process

Returns:

Cleaned DataFrame with duplicates and NaN values removed

Raises:

ValueError – If column doesn’t exist or contains no valid data

Return type:

DataFrame

vocab: list[str] | None = None

Model Implementation Notes

The prediction functions above are implemented using internal model classes that handle:

  • Model Loading: Automatic loading of pre-trained PyTorch models

  • Text Processing: Character-level encoding and sequence preparation

  • Batch Inference: Efficient processing of multiple names

  • Result Formatting: Converting raw model outputs to readable predictions

For most use cases, use the high-level prediction functions rather than the internal model classes directly.

Examples

Loading Models Manually

from ethnicolr2.pred_fl_ln_lstm import LastNameLstmModel

# Load Florida last name model
model = LastNameLstmModel()

# Make predictions
import pandas as pd
df = pd.DataFrame({'last_name': ['Smith', 'Zhang']})
result = model.predict(df, vocab_fn=model.VOCAB_FN, model_fn=model.MODEL_FN)
print(result)

Custom Model Parameters

from ethnicolr2.models import LSTM
import torch

# Create custom LSTM
model = LSTM(
    vocab_size=128,      # Character vocabulary size
    hidden_size=256,     # LSTM hidden units
    num_layers=2,        # Number of LSTM layers
    num_classes=5,       # Number of output classes
    dropout=0.2          # Dropout rate
)

# Example forward pass
batch_size, sequence_length = 32, 20
input_tensor = torch.randint(0, 128, (batch_size, sequence_length))
output = model(input_tensor)
print(f"Output shape: {output.shape}")  # [32, 5]

Model Configuration

The prediction models use different configuration parameters:

Model

Max Length

Character Set

Training Data

Census Last Name

15 chars

ASCII + common punctuation

US Census 2000/2010

Florida Last Name

30 chars

Extended character set

FL voter registration

Florida Full Name

47 chars

Extended character set

FL voter registration

All models use 2-layer LSTM networks with 256 hidden units and batch processing for efficiency.