{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic Usage Examples\n",
    "\n",
    "This notebook demonstrates the key functions in the `instate` package for predicting Indian states and languages from last names.\n",
    "\n",
    "## Overview\n",
    "\n",
    "The `instate` package provides two main approaches:\n",
    "\n",
    "1. **Electoral Rolls Lookups** - Fast frequency-based lookups from Indian electoral rolls data (2017)\n",
    "2. **Neural Network Predictions** - Machine learning models for enhanced predictions\n",
    "\n",
    "Let's explore each approach with practical examples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "First, let's import the necessary modules and set up our examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import instate\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Sample Indian last names for demonstration\n",
    "sample_names = ['sood', 'dhingra', 'kumar', 'patel', 'singh', 'sharma', 'reddy', 'iyer']\n",
    "\n",
    "print(f\"instate version: {instate.__version__}\")\n",
    "print(f\"Sample names: {sample_names}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Electoral Rolls Lookups\n",
    "\n",
    "The electoral rolls approach provides frequency-based lookups for names found in the 2017 Indian electoral rolls dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get State Distribution\n",
    "\n",
    "The `get_state_distribution` function returns the probability distribution P(state|lastname) based on electoral rolls data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Get state distributions for our sample names\nstate_dist = instate.get_state_distribution(sample_names)\n\nprint(\"State distributions from electoral rolls:\")\nprint(\"=\"*50)\n\n# Get state columns (exclude the name column)\nname_col = state_dist.columns[0] \nstate_columns = [col for col in state_dist.columns if col != name_col]\n\nfor i, row in state_dist.iterrows():\n    name = row[name_col]\n    print(f\"\\n{name.upper()}:\")\n    \n    # Get non-zero state probabilities for this name\n    state_probs = []\n    for state_col in state_columns:\n        prob = row[state_col]\n        if pd.notna(prob) and prob > 0:\n            state_probs.append((state_col, prob))\n    \n    if state_probs:\n        # Show top 3 states for each name\n        sorted_states = sorted(state_probs, key=lambda x: x[1], reverse=True)[:3]\n        for state, prob in sorted_states:\n            # Clean up state name for display\n            display_state = state.replace('_', ' ').title()\n            print(f\"  {display_state}: {prob:.3f}\")\n    else:\n        print(\"  Not found in electoral rolls\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize State Distribution\n",
    "\n",
    "Let's create a visualization for one of the names with the richest state distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Pick a name with good state distribution for visualization\nname_to_plot = 'kumar'  # This is typically found in multiple states\n\n# Find the row for this name in our results\nname_row = state_dist[state_dist.iloc[:, 0] == name_to_plot]\n\nif not name_row.empty:\n    row = name_row.iloc[0]\n    \n    # Get non-zero state probabilities for visualization\n    state_probs = []\n    name_col = state_dist.columns[0] \n    state_columns = [col for col in state_dist.columns if col != name_col]\n    \n    for state_col in state_columns:\n        prob = row[state_col]\n        if pd.notna(prob) and prob > 0:\n            state_probs.append((state_col, prob))\n    \n    if state_probs:\n        # Get top 10 states for plotting\n        sorted_states = sorted(state_probs, key=lambda x: x[1], reverse=True)[:10]\n        states, probabilities = zip(*sorted_states)\n        \n        # Clean up state names for display\n        display_states = [state.replace('_', ' ').title() for state in states]\n        \n        # Create bar plot\n        plt.figure(figsize=(12, 6))\n        bars = plt.bar(range(len(display_states)), probabilities)\n        plt.xlabel('States')\n        plt.ylabel('Probability')\n        plt.title(f'State Distribution for \"{name_to_plot}\" (Electoral Rolls Data)')\n        plt.xticks(range(len(display_states)), display_states, rotation=45, ha='right')\n        \n        # Color bars by probability\n        if probabilities:\n            max_prob = max(probabilities)\n            for i, bar in enumerate(bars):\n                bar.set_color(plt.cm.viridis(probabilities[i] / max_prob))\n        \n        plt.tight_layout()\n        plt.grid(axis='y', alpha=0.3)\n        plt.show()\n    else:\n        print(f\"'{name_to_plot}' has no state distribution data\")\nelse:\n    print(f\"'{name_to_plot}' not found in results\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get State Languages\n",
    "\n",
    "The `get_state_languages` function maps states to their official languages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Get languages for some specific states\nstates_to_check = ['Maharashtra', 'Punjab', 'Tamil Nadu', 'West Bengal', 'Gujarat']\n\nprint(\"State to Languages Mapping:\")\nprint(\"=\"*40)\n\n# Pass the list of states to get_state_languages\nstate_languages = instate.get_state_languages(states_to_check)\n\n# Display the results\nfor i, row in state_languages.iterrows():\n    state = row.iloc[0]  # First column is the state\n    if len(row) > 1 and 'official_languages' in state_languages.columns:\n        languages = row['official_languages']\n        if pd.notna(languages):\n            print(f\"{state}: {languages}\")\n        else:\n            print(f\"{state}: No language data available\")\n    else:\n        print(f\"{state}: No language data available\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List Available States\n",
    "\n",
    "See all states available in the electoral rolls dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get all available states\n",
    "available_states = instate.list_available_states()\n",
    "\n",
    "print(f\"Total states available: {len(available_states)}\")\n",
    "print(\"\\nAvailable states:\")\n",
    "print(\"=\"*50)\n",
    "\n",
    "# Print states in columns for better readability\n",
    "for i, state in enumerate(sorted(available_states), 1):\n",
    "    print(f\"{i:2d}. {state}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Neural Network Predictions\n",
    "\n",
    "For names not in electoral rolls or for enhanced predictions, the package provides neural network models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict States\n",
    "\n",
    "The `predict_state` function uses GRU neural networks to predict likely states."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Predict states for our sample names\n",
    "try:\n",
    "    state_predictions = instate.predict_state(sample_names, top_k=3)\n",
    "\n",
    "    print(\"Neural Network State Predictions:\")\n",
    "    print(\"=\"*50)\n",
    "\n",
    "    for i, row in state_predictions.iterrows():\n",
    "        name = row.iloc[0]  # First column is the name\n",
    "        predictions = row['predicted_states']\n",
    "        print(f\"\\n{name.upper()}:\")\n",
    "        for j, state in enumerate(predictions, 1):\n",
    "            print(f\"  {j}. {state}\")\n",
    "except Exception as e:\n",
    "    print(f\"State prediction error: {e}\")\n",
    "    print(\"Note: Neural network models may require additional setup or trained weights.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict Languages\n",
    "\n",
    "The `predict_language` function predicts likely languages using LSTM or KNN models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Predict languages using different models\n",
    "print(\"Language Prediction Examples:\")\n",
    "print(\"=\"*50)\n",
    "\n",
    "# Try LSTM model first\n",
    "try:\n",
    "    print(\"\\nTrying LSTM model...\")\n",
    "    language_predictions_lstm = instate.predict_language(sample_names, model='lstm', top_k=3)\n",
    "    \n",
    "    print(\"\\nNeural Network Language Predictions (LSTM):\")\n",
    "    print(\"-\" * 50)\n",
    "    \n",
    "    for i, row in language_predictions_lstm.iterrows():\n",
    "        name = row.iloc[0]  # First column is the name\n",
    "        pred_langs = row['predicted_languages']\n",
    "        print(f\"\\n{name.upper()}:\")\n",
    "        if isinstance(pred_langs, list):\n",
    "            for j, lang in enumerate(pred_langs[:3], 1):\n",
    "                print(f\"  {j}. {lang}\")\n",
    "        else:\n",
    "            print(f\"  1. {pred_langs}\")\n",
    "            \n",
    "except Exception as e:\n",
    "    print(f\"LSTM model not available: {e}\")\n",
    "    print(\"\\nTrying KNN model...\")\n",
    "    \n",
    "    try:\n",
    "        language_predictions_knn = instate.predict_language(sample_names, model='knn', top_k=3)\n",
    "        \n",
    "        print(\"\\nNeural Network Language Predictions (KNN):\")\n",
    "        print(\"-\" * 50)\n",
    "        \n",
    "        for i, row in language_predictions_knn.iterrows():\n",
    "            name = row.iloc[0]  # First column is the name\n",
    "            pred_langs = row['predicted_languages']\n",
    "            print(f\"\\n{name.upper()}:\")\n",
    "            if isinstance(pred_langs, list):\n",
    "                for j, lang in enumerate(pred_langs[:3], 1):\n",
    "                    print(f\"  {j}. {lang}\")\n",
    "            else:\n",
    "                print(f\"  1. {pred_langs}\")\n",
    "    except Exception as e2:\n",
    "        print(f\"KNN model also not available: {e2}\")\n",
    "        print(\"Note: Language prediction requires trained models to be available.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparative Analysis\n",
    "\n",
    "Let's compare the electoral rolls data with neural network predictions for names found in both systems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "print(\"Comparison: Electoral Rolls vs Neural Network Predictions\")\nprint(\"=\"*65)\n\n# Get name and state columns for easier access\nname_col = state_dist.columns[0] \nstate_columns = [col for col in state_dist.columns if col != name_col]\n\nfor name in sample_names:\n    print(f\"\\n{name.upper()}:\")\n    \n    # Electoral rolls top state\n    name_row = state_dist[state_dist[name_col] == name]\n    if not name_row.empty:\n        row = name_row.iloc[0]\n        \n        # Find the state with highest probability\n        max_prob = 0\n        top_state = None\n        for state_col in state_columns:\n            prob = row[state_col]\n            if pd.notna(prob) and prob > max_prob:\n                max_prob = prob\n                top_state = state_col.replace('_', ' ').title()\n        \n        if top_state:\n            print(f\"  Electoral Rolls Top State: {top_state} ({max_prob:.3f})\")\n        else:\n            print(f\"  Electoral Rolls: Not found\")\n    else:\n        print(f\"  Electoral Rolls: Not found\")\n    \n    # Neural network top state (if available)\n    try:\n        # Find the row for this name in the predictions\n        name_rows = state_predictions[state_predictions.iloc[:, 0] == name]\n        if not name_rows.empty:\n            nn_top_state = name_rows.iloc[0]['predicted_states'][0]\n            print(f\"  Neural Network Top State:  {nn_top_state}\")\n        else:\n            print(f\"  Neural Network: No prediction for {name}\")\n    except (NameError, Exception):\n        print(f\"  Neural Network: Predictions not available\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "This notebook demonstrated the key features of the `instate` package:\n",
    "\n",
    "1. **Electoral Rolls Functions**:\n",
    "   - `get_state_distribution()`: Get probability distributions from electoral data\n",
    "   - `get_state_languages()`: Map states to official languages\n",
    "   - `list_available_states()`: See all available states in the dataset\n",
    "\n",
    "2. **Neural Network Functions**:\n",
    "   - `predict_state()`: GRU-based state prediction\n",
    "   - `predict_language()`: LSTM/KNN-based language prediction\n",
    "\n",
    "The package is useful for:\n",
    "- Demographic analysis\n",
    "- Geographic distribution studies\n",
    "- Language inference from names\n",
    "- Cultural and linguistic research\n",
    "\n",
    "For more information, visit the [GitHub repository](https://github.com/appeler/instate)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}