Naampy Usage Examples¶

This notebook demonstrates how to use the naampy package for predicting gender from Indian names using electoral roll data and machine learning models.

Installation¶

First, ensure you have naampy installed:

pip install naampy

Basic Setup and Imports¶

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from naampy import in_rolls_fn_gender, predict_fn_gender, InRollsFnData

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

Sample Data¶

Let’s create a sample dataset with Indian names to demonstrate the functionality:

# Create sample data with common Indian names
sample_names = [
    'Priya', 'Rahul', 'Anjali', 'Vikram', 'Deepika', 'Arjun',
    'Kavita', 'Rajesh', 'Sunita', 'Amit', 'Meera', 'Rohan',
    'Neha', 'Karan', 'Pooja', 'Sanjay', 'Ritu', 'Ashok',
    'Geeta', 'Manish', 'Seema', 'Suresh', 'Anita', 'Naveen'
]

# Create DataFrame
df = pd.DataFrame({
    'id': range(1, len(sample_names) + 1),
    'first_name': sample_names,
    'age': [25, 30, 28, 35, 22, 29, 31, 40, 26, 33, 27, 24, 23, 32, 29, 38, 25, 42, 30, 36, 28, 39, 34, 27]
})

print("Sample dataset:")
print(df.head(10))
print(f"\nTotal names: {len(df)}")

Electoral Roll Gender Prediction¶

The primary method uses Indian Electoral Roll statistics to predict gender. This is based on actual voting records from 31 Indian states and union territories.

# Predict gender using electoral roll data
result_df = in_rolls_fn_gender(df, 'first_name')

print("Results with electoral roll data:")
print(result_df[['first_name', 'prop_female', 'prop_male', 'n_female', 'n_male']].head(10))

Understanding the Results¶

prop_female: Proportion of females with this name (0.0 to 1.0)
prop_male: Proportion of males with this name (0.0 to 1.0)
n_female: Absolute count of females with this name in electoral data
n_male: Absolute count of males with this name in electoral data
pred_gender: ML prediction for names not found in electoral data
pred_prob: Confidence score for ML predictions

Visualizing Gender Predictions¶

# Create visualizations of the results
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Plot 1: Gender proportion distribution
axes[0, 0].hist(result_df['prop_female'], bins=20, alpha=0.7, edgecolor='black')
axes[0, 0].set_title('Distribution of Female Proportion')
axes[0, 0].set_xlabel('Proportion Female')
axes[0, 0].set_ylabel('Frequency')

# Plot 2: Names by predicted gender
gender_counts = result_df.apply(lambda row: 'Female' if row['prop_female'] > 0.5 else 'Male', axis=1).value_counts()
axes[0, 1].pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%', startangle=90)
axes[0, 1].set_title('Gender Distribution of Sample Names')

# Plot 3: Confidence levels (for names with clear gender indication)
confidence = result_df['prop_female'].apply(lambda x: abs(x - 0.5) * 2 if pd.notna(x) else 0)
axes[1, 0].hist(confidence, bins=20, alpha=0.7, edgecolor='black')
axes[1, 0].set_title('Prediction Confidence Distribution')
axes[1, 0].set_xlabel('Confidence Level (0=uncertain, 1=certain)')
axes[1, 0].set_ylabel('Frequency')

# Plot 4: Sample count in electoral data
total_counts = result_df['n_female'] + result_df['n_male']
valid_counts = total_counts[total_counts > 0]
if len(valid_counts) > 0:
    axes[1, 1].hist(valid_counts, bins=15, alpha=0.7, edgecolor='black')
    axes[1, 1].set_title('Sample Sizes in Electoral Data')
    axes[1, 1].set_xlabel('Total Count (Female + Male)')
    axes[1, 1].set_ylabel('Frequency')
    axes[1, 1].set_xscale('log')
else:
    axes[1, 1].text(0.5, 0.5, 'No electoral data\navailable for\nthese names', 
                   horizontalalignment='center', verticalalignment='center', 
                   transform=axes[1, 1].transAxes, fontsize=12)
    axes[1, 1].set_title('Sample Sizes in Electoral Data')

plt.tight_layout()
plt.show()

Machine Learning Model Predictions¶

For names not found in the electoral data, naampy uses a neural network model trained on character patterns:

# Test the ML model directly with some uncommon names
uncommon_names = ['Aadhya', 'Vivaan', 'Kiara', 'Aryan', 'Diya', 'Ishaan', 'Zara', 'Reyansh']

ml_predictions = predict_fn_gender(uncommon_names)
print("ML Model Predictions for Uncommon Names:")
print(ml_predictions)

Dataset Comparison¶

Naampy provides different datasets with varying coverage and accuracy trade-offs:

# Compare different datasets
datasets = ['v1', 'v2', 'v2_1k']
test_names = ['Priya', 'Rahul', 'Anjali']  # Use a small subset for comparison
test_df = pd.DataFrame({'first_name': test_names})

print("Dataset Comparison for Selected Names:")
print("=" * 50)

for dataset in datasets:
    try:
        result = in_rolls_fn_gender(test_df, 'first_name', dataset=dataset)
        print(f"\n{dataset.upper()} Dataset:")
        for _, row in result.iterrows():
            if pd.notna(row['prop_female']):
                print(f"  {row['first_name']}: {row['prop_female']:.3f} female, {row['prop_male']:.3f} male (n={row['n_female'] + row['n_male']:.0f})")
            else:
                print(f"  {row['first_name']}: Not in dataset (ML: {row.get('pred_gender', 'N/A')})")
    except Exception as e:
        print(f"\n{dataset.upper()} Dataset: Error loading - {str(e)}")

State and Year Filtering¶

You can filter the electoral data by specific states or birth years for more targeted predictions:

# Check available states
available_states = InRollsFnData.list_states()
print(f"Available states ({len(available_states)}):")
print(sorted(available_states)[:10])  # Show first 10 states
print("... and", len(available_states) - 10, "more")

# Compare predictions for different states
test_states = ['kerala', 'punjab', 'maharashtra']  # Different linguistic regions
test_name_df = pd.DataFrame({'first_name': ['Priya', 'Simran', 'Aarti']})

print("State-wise Comparison:")
print("=" * 40)

# All states combined
all_states_result = in_rolls_fn_gender(test_name_df, 'first_name')
print("\nAll States Combined:")
for _, row in all_states_result.iterrows():
    if pd.notna(row['prop_female']):
        print(f"  {row['first_name']}: {row['prop_female']:.3f} female")

# Individual states
for state in test_states:
    try:
        state_result = in_rolls_fn_gender(test_name_df, 'first_name', state=state)
        print(f"\n{state.title()} only:")
        for _, row in state_result.iterrows():
            if pd.notna(row['prop_female']):
                print(f"  {row['first_name']}: {row['prop_female']:.3f} female")
            else:
                print(f"  {row['first_name']}: Not found in {state}")
    except Exception as e:
        print(f"\n{state.title()}: Error - {str(e)}")

Performance Analysis¶

Let’s analyze the coverage and performance of different approaches:

# Analyze coverage
coverage_stats = {
    'total_names': len(result_df),
    'found_in_electoral': len(result_df[result_df['prop_female'].notna()]),
    'ml_predictions': len(result_df[result_df['pred_gender'].notna()]),
    'no_prediction': len(result_df[(result_df['prop_female'].isna()) & (result_df['pred_gender'].isna())])
}

print("Coverage Analysis:")
print("=" * 30)
for key, value in coverage_stats.items():
    percentage = (value / coverage_stats['total_names']) * 100
    print(f"{key.replace('_', ' ').title()}: {value} ({percentage:.1f}%)")

# Visualize coverage
fig, ax = plt.subplots(1, 1, figsize=(10, 6))

categories = ['Electoral Roll\nData', 'ML Model\nPrediction', 'No Prediction']
values = [coverage_stats['found_in_electoral'], 
          coverage_stats['ml_predictions'], 
          coverage_stats['no_prediction']]
colors = ['#2E8B57', '#4169E1', '#DC143C']

bars = ax.bar(categories, values, color=colors, alpha=0.8, edgecolor='black')
ax.set_title('Prediction Coverage Analysis', fontsize=14, fontweight='bold')
ax.set_ylabel('Number of Names')

# Add value labels on bars
for bar, value in zip(bars, values):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 0.1,
            f'{value}\n({value/coverage_stats["total_names"]*100:.1f}%)',
            ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

Conclusion¶

This notebook demonstrated the key features of naampy:

Electoral Roll Predictions: High accuracy for names found in Indian electoral data
ML Fallback: Neural network predictions for uncommon names
Multiple Datasets: Different options balancing coverage vs. accuracy
Geographic Filtering: State-specific predictions for regional analysis
Comprehensive Output: Proportions, counts, and confidence scores

The package provides a robust solution for gender prediction from Indian names, suitable for demographic analysis, data preprocessing, and research applications.

Next Steps¶

Explore the API documentation for detailed function references
Check out additional examples in the User Guide
Report issues or contribute on GitHub