Usage¶

Parsernaam provides both a Python API and a command-line interface for parsing names using machine learning models.

Python API¶

Basic Usage¶

The main entry point is the ParseNames class with its static parse() method:

import pandas as pd
from parsernaam.parse import ParseNames

# Create DataFrame with names to parse
df = pd.DataFrame({'name': [
    'Jan', 
    'Nicholas Turner', 
    'Petersen', 
    'Nichols Richard', 
    'Piet',
    'John Smith', 
    'Janssen', 
    'Kim Yeon'
]})

# Parse names using ML models
results = ParseNames.parse(df)
print(results)

Output Format¶

The parse() method returns a DataFrame with an additional parsed_name column containing dictionaries with:

name: The original name
type: Classification result ('first', 'last', 'first_last', 'last_first')
prob: Confidence probability (0.0 to 1.0)

Example output:

	name	parsed_name
0	Jan	{‘name’: ‘Jan’, ‘type’: ‘first’, ‘prob’: 0.677}
1	Nicholas Turner	{‘name’: ‘Nicholas Turner’, ‘type’: ‘first_last’, ‘prob’: 0.999}
2	Petersen	{‘name’: ‘Petersen’, ‘type’: ‘last’, ‘prob’: 0.534}
3	Nichols Richard	{‘name’: ‘Nichols Richard’, ‘type’: ‘last_first’, ‘prob’: 0.999}

Working with Results¶

You can extract specific information from the parsed results:

# Get just the classification types
df['name_type'] = df['parsed_name'].apply(lambda x: x['type'])

# Get confidence scores
df['confidence'] = df['parsed_name'].apply(lambda x: x['prob'])

# Filter high-confidence results
high_confidence = df[df['confidence'] > 0.9]

Custom DataFrame Column¶

If your DataFrame uses a different column name for names, specify it:

# DataFrame with 'full_name' column instead of 'name'
df = pd.DataFrame({'full_name': ['John Smith', 'Jane Doe']})

# Rename column or create a 'name' column
df_renamed = df.rename(columns={'full_name': 'name'})
results = ParseNames.parse(df_renamed)

Batch Processing¶

For large datasets, parsernaam efficiently processes names in batches:

# Large dataset
import numpy as np
large_df = pd.DataFrame({
    'name': np.random.choice(['John Smith', 'Jane Doe', 'Kim Yeon'], 10000)
})

# Process efficiently with automatic batching
results = ParseNames.parse(large_df)

Command Line Interface¶

Basic Usage¶

Parse names from a CSV file:

parse_names input.csv -o output.csv -n name_column

Arguments¶

input: Input CSV file containing names to parse (required)
-o, –output: Output CSV file (default: parsernaam_output.csv)
-n, –names-col: Column name containing the names to parse (required)

Examples¶

# Parse names from 'name' column in input.csv
parse_names data.csv -o results.csv -n name

# Parse names from 'full_name' column
parse_names employees.csv -o parsed_employees.csv -n full_name

# Use default output filename
parse_names names.csv -n participant_name

Input File Format¶

The input CSV file should contain at least one column with names:

id,name,age
1,John Smith,35
2,Kim Yeon,28
3,Nicholas Turner,42

Output File Format¶

The output CSV includes all original columns plus a parsed_name column:

id,name,age,parsed_name
1,John Smith,35,"{'name': 'John Smith', 'type': 'first_last', 'prob': 0.997}"
2,Kim Yeon,28,"{'name': 'Kim Yeon', 'type': 'last_first', 'prob': 0.999}"
3,Nicholas Turner,42,"{'name': 'Nicholas Turner', 'type': 'first_last', 'prob': 0.999}"

Web Interface¶

Gradio Demo¶

Try parsernaam interactively using the Gradio web interface:

Parsernaam on Hugging Face

Running Locally¶

You can also run the Gradio interface locally:

python gradio_app.py

This will start a local web server where you can test name parsing interactively.

Model Behavior¶

Single Names¶

For single names, the model classifies them as either:

first: Likely a first name (e.g., “John”, “Sarah”, “Raj”)
last: Likely a last name (e.g., “Smith”, “Patel”, “Johnson”)

Multi-word Names¶

For names with multiple words, the model determines the order:

first_last: First name followed by last name (e.g., “John Smith”)
last_first: Last name followed by first name (e.g., “Smith John”)

Confidence Scores¶

All predictions include confidence scores:

0.9+: Very high confidence
0.7-0.9: High confidence
0.5-0.7: Moderate confidence
<0.5: Low confidence (consider manual review)

Cultural Considerations¶

The model is particularly effective for:

Indian names: Trained on Indian voter registration data
Western names: Trained on US voter registration data
Mixed cultural contexts: Handles various naming conventions

For Indian names specifically, the model assumes that last names are typically shared among family members, providing culturally-aware parsing.

Error Handling¶

Parsernaam gracefully handles various edge cases:

Empty names: Returns appropriate default values
Special characters: Processes names with Unicode characters
Very long names: Truncates to model’s sequence length limit
Missing data: Handles NaN values in DataFrames

Performance Tips¶

Batch Processing: Process multiple names at once for better performance
GPU Acceleration: Ensure CUDA is available for faster inference
Model Caching: Models are automatically cached after first load
Memory Management: For very large datasets, consider processing in chunks

Troubleshooting¶

Common Issues¶

Low Confidence Scores: Some names may be ambiguous. Consider:

Checking if the name follows expected patterns
Using additional context clues if available
Manual review for confidence < 0.7

Unexpected Classifications: The model may occasionally misclassify:

Very unusual or rare names
Names from cultures not well-represented in training data
Names with unconventional spellings

Performance Issues: For slow processing:

Ensure you’re using batch processing (passing DataFrames vs individual names)
Check if CUDA is available and properly configured
Consider processing large datasets in smaller chunks