Usage¶
Parsernaam provides both a Python API and a command-line interface for parsing names using machine learning models.
Python API¶
Basic Usage¶
The main entry point is the ParseNames class with its static parse() method:
import pandas as pd
from parsernaam.parse import ParseNames
# Create DataFrame with names to parse
df = pd.DataFrame({'name': [
'Jan',
'Nicholas Turner',
'Petersen',
'Nichols Richard',
'Piet',
'John Smith',
'Janssen',
'Kim Yeon'
]})
# Parse names using ML models
results = ParseNames.parse(df)
print(results)
Output Format¶
The parse() method returns a DataFrame with an additional parsed_name column containing dictionaries with:
name: The original name
type: Classification result (
'first','last','first_last','last_first')prob: Confidence probability (0.0 to 1.0)
Example output:
name |
parsed_name |
|
|---|---|---|
0 |
Jan |
{‘name’: ‘Jan’, ‘type’: ‘first’, ‘prob’: 0.677} |
1 |
Nicholas Turner |
{‘name’: ‘Nicholas Turner’, ‘type’: ‘first_last’, ‘prob’: 0.999} |
2 |
Petersen |
{‘name’: ‘Petersen’, ‘type’: ‘last’, ‘prob’: 0.534} |
3 |
Nichols Richard |
{‘name’: ‘Nichols Richard’, ‘type’: ‘last_first’, ‘prob’: 0.999} |
Working with Results¶
You can extract specific information from the parsed results:
# Get just the classification types
df['name_type'] = df['parsed_name'].apply(lambda x: x['type'])
# Get confidence scores
df['confidence'] = df['parsed_name'].apply(lambda x: x['prob'])
# Filter high-confidence results
high_confidence = df[df['confidence'] > 0.9]
Custom DataFrame Column¶
If your DataFrame uses a different column name for names, specify it:
# DataFrame with 'full_name' column instead of 'name'
df = pd.DataFrame({'full_name': ['John Smith', 'Jane Doe']})
# Rename column or create a 'name' column
df_renamed = df.rename(columns={'full_name': 'name'})
results = ParseNames.parse(df_renamed)
Batch Processing¶
For large datasets, parsernaam efficiently processes names in batches:
# Large dataset
import numpy as np
large_df = pd.DataFrame({
'name': np.random.choice(['John Smith', 'Jane Doe', 'Kim Yeon'], 10000)
})
# Process efficiently with automatic batching
results = ParseNames.parse(large_df)
Command Line Interface¶
Basic Usage¶
Parse names from a CSV file:
parse_names input.csv -o output.csv -n name_column
Arguments¶
input: Input CSV file containing names to parse (required)
-o, –output: Output CSV file (default:
parsernaam_output.csv)-n, –names-col: Column name containing the names to parse (required)
Examples¶
# Parse names from 'name' column in input.csv
parse_names data.csv -o results.csv -n name
# Parse names from 'full_name' column
parse_names employees.csv -o parsed_employees.csv -n full_name
# Use default output filename
parse_names names.csv -n participant_name
Input File Format¶
The input CSV file should contain at least one column with names:
id,name,age
1,John Smith,35
2,Kim Yeon,28
3,Nicholas Turner,42
Output File Format¶
The output CSV includes all original columns plus a parsed_name column:
id,name,age,parsed_name
1,John Smith,35,"{'name': 'John Smith', 'type': 'first_last', 'prob': 0.997}"
2,Kim Yeon,28,"{'name': 'Kim Yeon', 'type': 'last_first', 'prob': 0.999}"
3,Nicholas Turner,42,"{'name': 'Nicholas Turner', 'type': 'first_last', 'prob': 0.999}"
Web Interface¶
Gradio Demo¶
Try parsernaam interactively using the Gradio web interface:
Running Locally¶
You can also run the Gradio interface locally:
python gradio_app.py
This will start a local web server where you can test name parsing interactively.
Model Behavior¶
Single Names¶
For single names, the model classifies them as either:
first: Likely a first name (e.g., “John”, “Sarah”, “Raj”)
last: Likely a last name (e.g., “Smith”, “Patel”, “Johnson”)
Multi-word Names¶
For names with multiple words, the model determines the order:
first_last: First name followed by last name (e.g., “John Smith”)
last_first: Last name followed by first name (e.g., “Smith John”)
Confidence Scores¶
All predictions include confidence scores:
0.9+: Very high confidence
0.7-0.9: High confidence
0.5-0.7: Moderate confidence
<0.5: Low confidence (consider manual review)
Cultural Considerations¶
The model is particularly effective for:
Indian names: Trained on Indian voter registration data
Western names: Trained on US voter registration data
Mixed cultural contexts: Handles various naming conventions
For Indian names specifically, the model assumes that last names are typically shared among family members, providing culturally-aware parsing.
Error Handling¶
Parsernaam gracefully handles various edge cases:
Empty names: Returns appropriate default values
Special characters: Processes names with Unicode characters
Very long names: Truncates to model’s sequence length limit
Missing data: Handles NaN values in DataFrames
Performance Tips¶
Batch Processing: Process multiple names at once for better performance
GPU Acceleration: Ensure CUDA is available for faster inference
Model Caching: Models are automatically cached after first load
Memory Management: For very large datasets, consider processing in chunks
Troubleshooting¶
Common Issues¶
Low Confidence Scores: Some names may be ambiguous. Consider:
Checking if the name follows expected patterns
Using additional context clues if available
Manual review for confidence < 0.7
Unexpected Classifications: The model may occasionally misclassify:
Very unusual or rare names
Names from cultures not well-represented in training data
Names with unconventional spellings
Performance Issues: For slow processing:
Ensure you’re using batch processing (passing DataFrames vs individual names)
Check if CUDA is available and properly configured
Consider processing large datasets in smaller chunks