{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naampy Usage Examples\n",
    "\n",
    "This notebook demonstrates how to use the naampy package for predicting gender from Indian names using electoral roll data and machine learning models.\n",
    "\n",
    "## Installation\n",
    "\n",
    "First, ensure you have naampy installed:\n",
    "\n",
    "```bash\n",
    "pip install naampy\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Setup and Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from naampy import in_rolls_fn_gender, predict_fn_gender, InRollsFnData\n",
    "\n",
    "# Set up plotting style\n",
    "plt.style.use('default')\n",
    "sns.set_palette(\"husl\")\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sample Data\n",
    "\n",
    "Let's create a sample dataset with Indian names to demonstrate the functionality:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create sample data with common Indian names\n",
    "sample_names = [\n",
    "    'Priya', 'Rahul', 'Anjali', 'Vikram', 'Deepika', 'Arjun',\n",
    "    'Kavita', 'Rajesh', 'Sunita', 'Amit', 'Meera', 'Rohan',\n",
    "    'Neha', 'Karan', 'Pooja', 'Sanjay', 'Ritu', 'Ashok',\n",
    "    'Geeta', 'Manish', 'Seema', 'Suresh', 'Anita', 'Naveen'\n",
    "]\n",
    "\n",
    "# Create DataFrame\n",
    "df = pd.DataFrame({\n",
    "    'id': range(1, len(sample_names) + 1),\n",
    "    'first_name': sample_names,\n",
    "    'age': [25, 30, 28, 35, 22, 29, 31, 40, 26, 33, 27, 24, 23, 32, 29, 38, 25, 42, 30, 36, 28, 39, 34, 27]\n",
    "})\n",
    "\n",
    "print(\"Sample dataset:\")\n",
    "print(df.head(10))\n",
    "print(f\"\\nTotal names: {len(df)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Electoral Roll Gender Prediction\n",
    "\n",
    "The primary method uses Indian Electoral Roll statistics to predict gender. This is based on actual voting records from 31 Indian states and union territories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Predict gender using electoral roll data\n",
    "result_df = in_rolls_fn_gender(df, 'first_name')\n",
    "\n",
    "print(\"Results with electoral roll data:\")\n",
    "print(result_df[['first_name', 'prop_female', 'prop_male', 'n_female', 'n_male']].head(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Understanding the Results\n",
    "\n",
    "- `prop_female`: Proportion of females with this name (0.0 to 1.0)\n",
    "- `prop_male`: Proportion of males with this name (0.0 to 1.0)\n",
    "- `n_female`: Absolute count of females with this name in electoral data\n",
    "- `n_male`: Absolute count of males with this name in electoral data\n",
    "- `pred_gender`: ML prediction for names not found in electoral data\n",
    "- `pred_prob`: Confidence score for ML predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualizing Gender Predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create visualizations of the results\n",
    "fig, axes = plt.subplots(2, 2, figsize=(15, 12))\n",
    "\n",
    "# Plot 1: Gender proportion distribution\n",
    "axes[0, 0].hist(result_df['prop_female'], bins=20, alpha=0.7, edgecolor='black')\n",
    "axes[0, 0].set_title('Distribution of Female Proportion')\n",
    "axes[0, 0].set_xlabel('Proportion Female')\n",
    "axes[0, 0].set_ylabel('Frequency')\n",
    "\n",
    "# Plot 2: Names by predicted gender\n",
    "gender_counts = result_df.apply(lambda row: 'Female' if row['prop_female'] > 0.5 else 'Male', axis=1).value_counts()\n",
    "axes[0, 1].pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%', startangle=90)\n",
    "axes[0, 1].set_title('Gender Distribution of Sample Names')\n",
    "\n",
    "# Plot 3: Confidence levels (for names with clear gender indication)\n",
    "confidence = result_df['prop_female'].apply(lambda x: abs(x - 0.5) * 2 if pd.notna(x) else 0)\n",
    "axes[1, 0].hist(confidence, bins=20, alpha=0.7, edgecolor='black')\n",
    "axes[1, 0].set_title('Prediction Confidence Distribution')\n",
    "axes[1, 0].set_xlabel('Confidence Level (0=uncertain, 1=certain)')\n",
    "axes[1, 0].set_ylabel('Frequency')\n",
    "\n",
    "# Plot 4: Sample count in electoral data\n",
    "total_counts = result_df['n_female'] + result_df['n_male']\n",
    "valid_counts = total_counts[total_counts > 0]\n",
    "if len(valid_counts) > 0:\n",
    "    axes[1, 1].hist(valid_counts, bins=15, alpha=0.7, edgecolor='black')\n",
    "    axes[1, 1].set_title('Sample Sizes in Electoral Data')\n",
    "    axes[1, 1].set_xlabel('Total Count (Female + Male)')\n",
    "    axes[1, 1].set_ylabel('Frequency')\n",
    "    axes[1, 1].set_xscale('log')\n",
    "else:\n",
    "    axes[1, 1].text(0.5, 0.5, 'No electoral data\\navailable for\\nthese names', \n",
    "                   horizontalalignment='center', verticalalignment='center', \n",
    "                   transform=axes[1, 1].transAxes, fontsize=12)\n",
    "    axes[1, 1].set_title('Sample Sizes in Electoral Data')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Machine Learning Model Predictions\n",
    "\n",
    "For names not found in the electoral data, naampy uses a neural network model trained on character patterns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test the ML model directly with some uncommon names\n",
    "uncommon_names = ['Aadhya', 'Vivaan', 'Kiara', 'Aryan', 'Diya', 'Ishaan', 'Zara', 'Reyansh']\n",
    "\n",
    "ml_predictions = predict_fn_gender(uncommon_names)\n",
    "print(\"ML Model Predictions for Uncommon Names:\")\n",
    "print(ml_predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset Comparison\n",
    "\n",
    "Naampy provides different datasets with varying coverage and accuracy trade-offs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare different datasets\n",
    "datasets = ['v1', 'v2', 'v2_1k']\n",
    "test_names = ['Priya', 'Rahul', 'Anjali']  # Use a small subset for comparison\n",
    "test_df = pd.DataFrame({'first_name': test_names})\n",
    "\n",
    "print(\"Dataset Comparison for Selected Names:\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "for dataset in datasets:\n",
    "    try:\n",
    "        result = in_rolls_fn_gender(test_df, 'first_name', dataset=dataset)\n",
    "        print(f\"\\n{dataset.upper()} Dataset:\")\n",
    "        for _, row in result.iterrows():\n",
    "            if pd.notna(row['prop_female']):\n",
    "                print(f\"  {row['first_name']}: {row['prop_female']:.3f} female, {row['prop_male']:.3f} male (n={row['n_female'] + row['n_male']:.0f})\")\n",
    "            else:\n",
    "                print(f\"  {row['first_name']}: Not in dataset (ML: {row.get('pred_gender', 'N/A')})\")\n",
    "    except Exception as e:\n",
    "        print(f\"\\n{dataset.upper()} Dataset: Error loading - {str(e)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## State and Year Filtering\n",
    "\n",
    "You can filter the electoral data by specific states or birth years for more targeted predictions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check available states\n",
    "available_states = InRollsFnData.list_states()\n",
    "print(f\"Available states ({len(available_states)}):\")\n",
    "print(sorted(available_states)[:10])  # Show first 10 states\n",
    "print(\"... and\", len(available_states) - 10, \"more\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare predictions for different states\n",
    "test_states = ['kerala', 'punjab', 'maharashtra']  # Different linguistic regions\n",
    "test_name_df = pd.DataFrame({'first_name': ['Priya', 'Simran', 'Aarti']})\n",
    "\n",
    "print(\"State-wise Comparison:\")\n",
    "print(\"=\" * 40)\n",
    "\n",
    "# All states combined\n",
    "all_states_result = in_rolls_fn_gender(test_name_df, 'first_name')\n",
    "print(\"\\nAll States Combined:\")\n",
    "for _, row in all_states_result.iterrows():\n",
    "    if pd.notna(row['prop_female']):\n",
    "        print(f\"  {row['first_name']}: {row['prop_female']:.3f} female\")\n",
    "\n",
    "# Individual states\n",
    "for state in test_states:\n",
    "    try:\n",
    "        state_result = in_rolls_fn_gender(test_name_df, 'first_name', state=state)\n",
    "        print(f\"\\n{state.title()} only:\")\n",
    "        for _, row in state_result.iterrows():\n",
    "            if pd.notna(row['prop_female']):\n",
    "                print(f\"  {row['first_name']}: {row['prop_female']:.3f} female\")\n",
    "            else:\n",
    "                print(f\"  {row['first_name']}: Not found in {state}\")\n",
    "    except Exception as e:\n",
    "        print(f\"\\n{state.title()}: Error - {str(e)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Performance Analysis\n",
    "\n",
    "Let's analyze the coverage and performance of different approaches:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyze coverage\n",
    "coverage_stats = {\n",
    "    'total_names': len(result_df),\n",
    "    'found_in_electoral': len(result_df[result_df['prop_female'].notna()]),\n",
    "    'ml_predictions': len(result_df[result_df['pred_gender'].notna()]),\n",
    "    'no_prediction': len(result_df[(result_df['prop_female'].isna()) & (result_df['pred_gender'].isna())])\n",
    "}\n",
    "\n",
    "print(\"Coverage Analysis:\")\n",
    "print(\"=\" * 30)\n",
    "for key, value in coverage_stats.items():\n",
    "    percentage = (value / coverage_stats['total_names']) * 100\n",
    "    print(f\"{key.replace('_', ' ').title()}: {value} ({percentage:.1f}%)\")\n",
    "\n",
    "# Visualize coverage\n",
    "fig, ax = plt.subplots(1, 1, figsize=(10, 6))\n",
    "\n",
    "categories = ['Electoral Roll\\nData', 'ML Model\\nPrediction', 'No Prediction']\n",
    "values = [coverage_stats['found_in_electoral'], \n",
    "          coverage_stats['ml_predictions'], \n",
    "          coverage_stats['no_prediction']]\n",
    "colors = ['#2E8B57', '#4169E1', '#DC143C']\n",
    "\n",
    "bars = ax.bar(categories, values, color=colors, alpha=0.8, edgecolor='black')\n",
    "ax.set_title('Prediction Coverage Analysis', fontsize=14, fontweight='bold')\n",
    "ax.set_ylabel('Number of Names')\n",
    "\n",
    "# Add value labels on bars\n",
    "for bar, value in zip(bars, values):\n",
    "    height = bar.get_height()\n",
    "    ax.text(bar.get_x() + bar.get_width()/2., height + 0.1,\n",
    "            f'{value}\\n({value/coverage_stats[\"total_names\"]*100:.1f}%)',\n",
    "            ha='center', va='bottom', fontweight='bold')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "This notebook demonstrated the key features of naampy:\n",
    "\n",
    "1. **Electoral Roll Predictions**: High accuracy for names found in Indian electoral data\n",
    "2. **ML Fallback**: Neural network predictions for uncommon names\n",
    "3. **Multiple Datasets**: Different options balancing coverage vs. accuracy\n",
    "4. **Geographic Filtering**: State-specific predictions for regional analysis\n",
    "5. **Comprehensive Output**: Proportions, counts, and confidence scores\n",
    "\n",
    "The package provides a robust solution for gender prediction from Indian names, suitable for demographic analysis, data preprocessing, and research applications.\n",
    "\n",
    "### Next Steps\n",
    "\n",
    "- Explore the [API documentation](../api_reference.html) for detailed function references\n",
    "- Check out additional examples in the [User Guide](../user_guide.html)\n",
    "- Report issues or contribute on [GitHub](https://github.com/appeler/naampy)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}