Naampy Usage Examples

This notebook demonstrates how to use the naampy package for predicting gender from Indian names using electoral roll data and machine learning models.

Installation

First, ensure you have naampy installed:

pip install naampy

Basic Setup and Imports

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from naampy import in_rolls_fn_gender, predict_fn_gender, InRollsFnData

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

Sample Data

Let’s create a sample dataset with Indian names to demonstrate the functionality:

# Create sample data with common Indian names
sample_names = [
    'Priya', 'Rahul', 'Anjali', 'Vikram', 'Deepika', 'Arjun',
    'Kavita', 'Rajesh', 'Sunita', 'Amit', 'Meera', 'Rohan',
    'Neha', 'Karan', 'Pooja', 'Sanjay', 'Ritu', 'Ashok',
    'Geeta', 'Manish', 'Seema', 'Suresh', 'Anita', 'Naveen'
]

# Create DataFrame
df = pd.DataFrame({
    'id': range(1, len(sample_names) + 1),
    'first_name': sample_names,
    'age': [25, 30, 28, 35, 22, 29, 31, 40, 26, 33, 27, 24, 23, 32, 29, 38, 25, 42, 30, 36, 28, 39, 34, 27]
})

print("Sample dataset:")
print(df.head(10))
print(f"\nTotal names: {len(df)}")

Electoral Roll Gender Prediction

The primary method uses Indian Electoral Roll statistics to predict gender. This is based on actual voting records from 31 Indian states and union territories.

# Predict gender using electoral roll data
result_df = in_rolls_fn_gender(df, 'first_name')

print("Results with electoral roll data:")
print(result_df[['first_name', 'prop_female', 'prop_male', 'n_female', 'n_male']].head(10))

Understanding the Results

  • prop_female: Proportion of females with this name (0.0 to 1.0)

  • prop_male: Proportion of males with this name (0.0 to 1.0)

  • n_female: Absolute count of females with this name in electoral data

  • n_male: Absolute count of males with this name in electoral data

  • pred_gender: ML prediction for names not found in electoral data

  • pred_prob: Confidence score for ML predictions

Visualizing Gender Predictions

# Create visualizations of the results
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Plot 1: Gender proportion distribution
axes[0, 0].hist(result_df['prop_female'], bins=20, alpha=0.7, edgecolor='black')
axes[0, 0].set_title('Distribution of Female Proportion')
axes[0, 0].set_xlabel('Proportion Female')
axes[0, 0].set_ylabel('Frequency')

# Plot 2: Names by predicted gender
gender_counts = result_df.apply(lambda row: 'Female' if row['prop_female'] > 0.5 else 'Male', axis=1).value_counts()
axes[0, 1].pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%', startangle=90)
axes[0, 1].set_title('Gender Distribution of Sample Names')

# Plot 3: Confidence levels (for names with clear gender indication)
confidence = result_df['prop_female'].apply(lambda x: abs(x - 0.5) * 2 if pd.notna(x) else 0)
axes[1, 0].hist(confidence, bins=20, alpha=0.7, edgecolor='black')
axes[1, 0].set_title('Prediction Confidence Distribution')
axes[1, 0].set_xlabel('Confidence Level (0=uncertain, 1=certain)')
axes[1, 0].set_ylabel('Frequency')

# Plot 4: Sample count in electoral data
total_counts = result_df['n_female'] + result_df['n_male']
valid_counts = total_counts[total_counts > 0]
if len(valid_counts) > 0:
    axes[1, 1].hist(valid_counts, bins=15, alpha=0.7, edgecolor='black')
    axes[1, 1].set_title('Sample Sizes in Electoral Data')
    axes[1, 1].set_xlabel('Total Count (Female + Male)')
    axes[1, 1].set_ylabel('Frequency')
    axes[1, 1].set_xscale('log')
else:
    axes[1, 1].text(0.5, 0.5, 'No electoral data\navailable for\nthese names', 
                   horizontalalignment='center', verticalalignment='center', 
                   transform=axes[1, 1].transAxes, fontsize=12)
    axes[1, 1].set_title('Sample Sizes in Electoral Data')

plt.tight_layout()
plt.show()

Machine Learning Model Predictions

For names not found in the electoral data, naampy uses a neural network model trained on character patterns:

# Test the ML model directly with some uncommon names
uncommon_names = ['Aadhya', 'Vivaan', 'Kiara', 'Aryan', 'Diya', 'Ishaan', 'Zara', 'Reyansh']

ml_predictions = predict_fn_gender(uncommon_names)
print("ML Model Predictions for Uncommon Names:")
print(ml_predictions)

Dataset Comparison

Naampy provides different datasets with varying coverage and accuracy trade-offs:

# Compare different datasets
datasets = ['v1', 'v2', 'v2_1k']
test_names = ['Priya', 'Rahul', 'Anjali']  # Use a small subset for comparison
test_df = pd.DataFrame({'first_name': test_names})

print("Dataset Comparison for Selected Names:")
print("=" * 50)

for dataset in datasets:
    try:
        result = in_rolls_fn_gender(test_df, 'first_name', dataset=dataset)
        print(f"\n{dataset.upper()} Dataset:")
        for _, row in result.iterrows():
            if pd.notna(row['prop_female']):
                print(f"  {row['first_name']}: {row['prop_female']:.3f} female, {row['prop_male']:.3f} male (n={row['n_female'] + row['n_male']:.0f})")
            else:
                print(f"  {row['first_name']}: Not in dataset (ML: {row.get('pred_gender', 'N/A')})")
    except Exception as e:
        print(f"\n{dataset.upper()} Dataset: Error loading - {str(e)}")

State and Year Filtering

You can filter the electoral data by specific states or birth years for more targeted predictions:

# Check available states
available_states = InRollsFnData.list_states()
print(f"Available states ({len(available_states)}):")
print(sorted(available_states)[:10])  # Show first 10 states
print("... and", len(available_states) - 10, "more")
# Compare predictions for different states
test_states = ['kerala', 'punjab', 'maharashtra']  # Different linguistic regions
test_name_df = pd.DataFrame({'first_name': ['Priya', 'Simran', 'Aarti']})

print("State-wise Comparison:")
print("=" * 40)

# All states combined
all_states_result = in_rolls_fn_gender(test_name_df, 'first_name')
print("\nAll States Combined:")
for _, row in all_states_result.iterrows():
    if pd.notna(row['prop_female']):
        print(f"  {row['first_name']}: {row['prop_female']:.3f} female")

# Individual states
for state in test_states:
    try:
        state_result = in_rolls_fn_gender(test_name_df, 'first_name', state=state)
        print(f"\n{state.title()} only:")
        for _, row in state_result.iterrows():
            if pd.notna(row['prop_female']):
                print(f"  {row['first_name']}: {row['prop_female']:.3f} female")
            else:
                print(f"  {row['first_name']}: Not found in {state}")
    except Exception as e:
        print(f"\n{state.title()}: Error - {str(e)}")

Performance Analysis

Let’s analyze the coverage and performance of different approaches:

# Analyze coverage
coverage_stats = {
    'total_names': len(result_df),
    'found_in_electoral': len(result_df[result_df['prop_female'].notna()]),
    'ml_predictions': len(result_df[result_df['pred_gender'].notna()]),
    'no_prediction': len(result_df[(result_df['prop_female'].isna()) & (result_df['pred_gender'].isna())])
}

print("Coverage Analysis:")
print("=" * 30)
for key, value in coverage_stats.items():
    percentage = (value / coverage_stats['total_names']) * 100
    print(f"{key.replace('_', ' ').title()}: {value} ({percentage:.1f}%)")

# Visualize coverage
fig, ax = plt.subplots(1, 1, figsize=(10, 6))

categories = ['Electoral Roll\nData', 'ML Model\nPrediction', 'No Prediction']
values = [coverage_stats['found_in_electoral'], 
          coverage_stats['ml_predictions'], 
          coverage_stats['no_prediction']]
colors = ['#2E8B57', '#4169E1', '#DC143C']

bars = ax.bar(categories, values, color=colors, alpha=0.8, edgecolor='black')
ax.set_title('Prediction Coverage Analysis', fontsize=14, fontweight='bold')
ax.set_ylabel('Number of Names')

# Add value labels on bars
for bar, value in zip(bars, values):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 0.1,
            f'{value}\n({value/coverage_stats["total_names"]*100:.1f}%)',
            ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

Conclusion

This notebook demonstrated the key features of naampy:

  1. Electoral Roll Predictions: High accuracy for names found in Indian electoral data

  2. ML Fallback: Neural network predictions for uncommon names

  3. Multiple Datasets: Different options balancing coverage vs. accuracy

  4. Geographic Filtering: State-specific predictions for regional analysis

  5. Comprehensive Output: Proportions, counts, and confidence scores

The package provides a robust solution for gender prediction from Indian names, suitable for demographic analysis, data preprocessing, and research applications.

Next Steps