{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Naampy Usage Examples\n", "\n", "This notebook demonstrates how to use the naampy package for predicting gender from Indian names using electoral roll data and machine learning models.\n", "\n", "## Installation\n", "\n", "First, ensure you have naampy installed:\n", "\n", "```bash\n", "pip install naampy\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Setup and Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from naampy import in_rolls_fn_gender, predict_fn_gender, InRollsFnData\n", "\n", "# Set up plotting style\n", "plt.style.use('default')\n", "sns.set_palette(\"husl\")\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sample Data\n", "\n", "Let's create a sample dataset with Indian names to demonstrate the functionality:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create sample data with common Indian names\n", "sample_names = [\n", " 'Priya', 'Rahul', 'Anjali', 'Vikram', 'Deepika', 'Arjun',\n", " 'Kavita', 'Rajesh', 'Sunita', 'Amit', 'Meera', 'Rohan',\n", " 'Neha', 'Karan', 'Pooja', 'Sanjay', 'Ritu', 'Ashok',\n", " 'Geeta', 'Manish', 'Seema', 'Suresh', 'Anita', 'Naveen'\n", "]\n", "\n", "# Create DataFrame\n", "df = pd.DataFrame({\n", " 'id': range(1, len(sample_names) + 1),\n", " 'first_name': sample_names,\n", " 'age': [25, 30, 28, 35, 22, 29, 31, 40, 26, 33, 27, 24, 23, 32, 29, 38, 25, 42, 30, 36, 28, 39, 34, 27]\n", "})\n", "\n", "print(\"Sample dataset:\")\n", "print(df.head(10))\n", "print(f\"\\nTotal names: {len(df)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Electoral Roll Gender Prediction\n", "\n", "The primary method uses Indian Electoral Roll statistics to predict gender. This is based on actual voting records from 31 Indian states and union territories." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Predict gender using electoral roll data\n", "result_df = in_rolls_fn_gender(df, 'first_name')\n", "\n", "print(\"Results with electoral roll data:\")\n", "print(result_df[['first_name', 'prop_female', 'prop_male', 'n_female', 'n_male']].head(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Understanding the Results\n", "\n", "- `prop_female`: Proportion of females with this name (0.0 to 1.0)\n", "- `prop_male`: Proportion of males with this name (0.0 to 1.0)\n", "- `n_female`: Absolute count of females with this name in electoral data\n", "- `n_male`: Absolute count of males with this name in electoral data\n", "- `pred_gender`: ML prediction for names not found in electoral data\n", "- `pred_prob`: Confidence score for ML predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizing Gender Predictions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create visualizations of the results\n", "fig, axes = plt.subplots(2, 2, figsize=(15, 12))\n", "\n", "# Plot 1: Gender proportion distribution\n", "axes[0, 0].hist(result_df['prop_female'], bins=20, alpha=0.7, edgecolor='black')\n", "axes[0, 0].set_title('Distribution of Female Proportion')\n", "axes[0, 0].set_xlabel('Proportion Female')\n", "axes[0, 0].set_ylabel('Frequency')\n", "\n", "# Plot 2: Names by predicted gender\n", "gender_counts = result_df.apply(lambda row: 'Female' if row['prop_female'] > 0.5 else 'Male', axis=1).value_counts()\n", "axes[0, 1].pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%', startangle=90)\n", "axes[0, 1].set_title('Gender Distribution of Sample Names')\n", "\n", "# Plot 3: Confidence levels (for names with clear gender indication)\n", "confidence = result_df['prop_female'].apply(lambda x: abs(x - 0.5) * 2 if pd.notna(x) else 0)\n", "axes[1, 0].hist(confidence, bins=20, alpha=0.7, edgecolor='black')\n", "axes[1, 0].set_title('Prediction Confidence Distribution')\n", "axes[1, 0].set_xlabel('Confidence Level (0=uncertain, 1=certain)')\n", "axes[1, 0].set_ylabel('Frequency')\n", "\n", "# Plot 4: Sample count in electoral data\n", "total_counts = result_df['n_female'] + result_df['n_male']\n", "valid_counts = total_counts[total_counts > 0]\n", "if len(valid_counts) > 0:\n", " axes[1, 1].hist(valid_counts, bins=15, alpha=0.7, edgecolor='black')\n", " axes[1, 1].set_title('Sample Sizes in Electoral Data')\n", " axes[1, 1].set_xlabel('Total Count (Female + Male)')\n", " axes[1, 1].set_ylabel('Frequency')\n", " axes[1, 1].set_xscale('log')\n", "else:\n", " axes[1, 1].text(0.5, 0.5, 'No electoral data\\navailable for\\nthese names', \n", " horizontalalignment='center', verticalalignment='center', \n", " transform=axes[1, 1].transAxes, fontsize=12)\n", " axes[1, 1].set_title('Sample Sizes in Electoral Data')\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Machine Learning Model Predictions\n", "\n", "For names not found in the electoral data, naampy uses a neural network model trained on character patterns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Test the ML model directly with some uncommon names\n", "uncommon_names = ['Aadhya', 'Vivaan', 'Kiara', 'Aryan', 'Diya', 'Ishaan', 'Zara', 'Reyansh']\n", "\n", "ml_predictions = predict_fn_gender(uncommon_names)\n", "print(\"ML Model Predictions for Uncommon Names:\")\n", "print(ml_predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset Comparison\n", "\n", "Naampy provides different datasets with varying coverage and accuracy trade-offs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compare different datasets\n", "datasets = ['v1', 'v2', 'v2_1k']\n", "test_names = ['Priya', 'Rahul', 'Anjali'] # Use a small subset for comparison\n", "test_df = pd.DataFrame({'first_name': test_names})\n", "\n", "print(\"Dataset Comparison for Selected Names:\")\n", "print(\"=\" * 50)\n", "\n", "for dataset in datasets:\n", " try:\n", " result = in_rolls_fn_gender(test_df, 'first_name', dataset=dataset)\n", " print(f\"\\n{dataset.upper()} Dataset:\")\n", " for _, row in result.iterrows():\n", " if pd.notna(row['prop_female']):\n", " print(f\" {row['first_name']}: {row['prop_female']:.3f} female, {row['prop_male']:.3f} male (n={row['n_female'] + row['n_male']:.0f})\")\n", " else:\n", " print(f\" {row['first_name']}: Not in dataset (ML: {row.get('pred_gender', 'N/A')})\")\n", " except Exception as e:\n", " print(f\"\\n{dataset.upper()} Dataset: Error loading - {str(e)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## State and Year Filtering\n", "\n", "You can filter the electoral data by specific states or birth years for more targeted predictions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check available states\n", "available_states = InRollsFnData.list_states()\n", "print(f\"Available states ({len(available_states)}):\")\n", "print(sorted(available_states)[:10]) # Show first 10 states\n", "print(\"... and\", len(available_states) - 10, \"more\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compare predictions for different states\n", "test_states = ['kerala', 'punjab', 'maharashtra'] # Different linguistic regions\n", "test_name_df = pd.DataFrame({'first_name': ['Priya', 'Simran', 'Aarti']})\n", "\n", "print(\"State-wise Comparison:\")\n", "print(\"=\" * 40)\n", "\n", "# All states combined\n", "all_states_result = in_rolls_fn_gender(test_name_df, 'first_name')\n", "print(\"\\nAll States Combined:\")\n", "for _, row in all_states_result.iterrows():\n", " if pd.notna(row['prop_female']):\n", " print(f\" {row['first_name']}: {row['prop_female']:.3f} female\")\n", "\n", "# Individual states\n", "for state in test_states:\n", " try:\n", " state_result = in_rolls_fn_gender(test_name_df, 'first_name', state=state)\n", " print(f\"\\n{state.title()} only:\")\n", " for _, row in state_result.iterrows():\n", " if pd.notna(row['prop_female']):\n", " print(f\" {row['first_name']}: {row['prop_female']:.3f} female\")\n", " else:\n", " print(f\" {row['first_name']}: Not found in {state}\")\n", " except Exception as e:\n", " print(f\"\\n{state.title()}: Error - {str(e)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance Analysis\n", "\n", "Let's analyze the coverage and performance of different approaches:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Analyze coverage\n", "coverage_stats = {\n", " 'total_names': len(result_df),\n", " 'found_in_electoral': len(result_df[result_df['prop_female'].notna()]),\n", " 'ml_predictions': len(result_df[result_df['pred_gender'].notna()]),\n", " 'no_prediction': len(result_df[(result_df['prop_female'].isna()) & (result_df['pred_gender'].isna())])\n", "}\n", "\n", "print(\"Coverage Analysis:\")\n", "print(\"=\" * 30)\n", "for key, value in coverage_stats.items():\n", " percentage = (value / coverage_stats['total_names']) * 100\n", " print(f\"{key.replace('_', ' ').title()}: {value} ({percentage:.1f}%)\")\n", "\n", "# Visualize coverage\n", "fig, ax = plt.subplots(1, 1, figsize=(10, 6))\n", "\n", "categories = ['Electoral Roll\\nData', 'ML Model\\nPrediction', 'No Prediction']\n", "values = [coverage_stats['found_in_electoral'], \n", " coverage_stats['ml_predictions'], \n", " coverage_stats['no_prediction']]\n", "colors = ['#2E8B57', '#4169E1', '#DC143C']\n", "\n", "bars = ax.bar(categories, values, color=colors, alpha=0.8, edgecolor='black')\n", "ax.set_title('Prediction Coverage Analysis', fontsize=14, fontweight='bold')\n", "ax.set_ylabel('Number of Names')\n", "\n", "# Add value labels on bars\n", "for bar, value in zip(bars, values):\n", " height = bar.get_height()\n", " ax.text(bar.get_x() + bar.get_width()/2., height + 0.1,\n", " f'{value}\\n({value/coverage_stats[\"total_names\"]*100:.1f}%)',\n", " ha='center', va='bottom', fontweight='bold')\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "This notebook demonstrated the key features of naampy:\n", "\n", "1. **Electoral Roll Predictions**: High accuracy for names found in Indian electoral data\n", "2. **ML Fallback**: Neural network predictions for uncommon names\n", "3. **Multiple Datasets**: Different options balancing coverage vs. accuracy\n", "4. **Geographic Filtering**: State-specific predictions for regional analysis\n", "5. **Comprehensive Output**: Proportions, counts, and confidence scores\n", "\n", "The package provides a robust solution for gender prediction from Indian names, suitable for demographic analysis, data preprocessing, and research applications.\n", "\n", "### Next Steps\n", "\n", "- Explore the [API documentation](../api_reference.html) for detailed function references\n", "- Check out additional examples in the [User Guide](../user_guide.html)\n", "- Report issues or contribute on [GitHub](https://github.com/appeler/naampy)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 4 }