{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic Usage Examples\n", "\n", "This notebook demonstrates the key functions in the `instate` package for predicting Indian states and languages from last names.\n", "\n", "## Overview\n", "\n", "The `instate` package provides two main approaches:\n", "\n", "1. **Electoral Rolls Lookups** - Fast frequency-based lookups from Indian electoral rolls data (2017)\n", "2. **Neural Network Predictions** - Machine learning models for enhanced predictions\n", "\n", "Let's explore each approach with practical examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "First, let's import the necessary modules and set up our examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import instate\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Sample Indian last names for demonstration\n", "sample_names = ['sood', 'dhingra', 'kumar', 'patel', 'singh', 'sharma', 'reddy', 'iyer']\n", "\n", "print(f\"instate version: {instate.__version__}\")\n", "print(f\"Sample names: {sample_names}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Electoral Rolls Lookups\n", "\n", "The electoral rolls approach provides frequency-based lookups for names found in the 2017 Indian electoral rolls dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get State Distribution\n", "\n", "The `get_state_distribution` function returns the probability distribution P(state|lastname) based on electoral rolls data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Get state distributions for our sample names\nstate_dist = instate.get_state_distribution(sample_names)\n\nprint(\"State distributions from electoral rolls:\")\nprint(\"=\"*50)\n\n# Get state columns (exclude the name column)\nname_col = state_dist.columns[0] \nstate_columns = [col for col in state_dist.columns if col != name_col]\n\nfor i, row in state_dist.iterrows():\n name = row[name_col]\n print(f\"\\n{name.upper()}:\")\n \n # Get non-zero state probabilities for this name\n state_probs = []\n for state_col in state_columns:\n prob = row[state_col]\n if pd.notna(prob) and prob > 0:\n state_probs.append((state_col, prob))\n \n if state_probs:\n # Show top 3 states for each name\n sorted_states = sorted(state_probs, key=lambda x: x[1], reverse=True)[:3]\n for state, prob in sorted_states:\n # Clean up state name for display\n display_state = state.replace('_', ' ').title()\n print(f\" {display_state}: {prob:.3f}\")\n else:\n print(\" Not found in electoral rolls\")" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualize State Distribution\n", "\n", "Let's create a visualization for one of the names with the richest state distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Pick a name with good state distribution for visualization\nname_to_plot = 'kumar' # This is typically found in multiple states\n\n# Find the row for this name in our results\nname_row = state_dist[state_dist.iloc[:, 0] == name_to_plot]\n\nif not name_row.empty:\n row = name_row.iloc[0]\n \n # Get non-zero state probabilities for visualization\n state_probs = []\n name_col = state_dist.columns[0] \n state_columns = [col for col in state_dist.columns if col != name_col]\n \n for state_col in state_columns:\n prob = row[state_col]\n if pd.notna(prob) and prob > 0:\n state_probs.append((state_col, prob))\n \n if state_probs:\n # Get top 10 states for plotting\n sorted_states = sorted(state_probs, key=lambda x: x[1], reverse=True)[:10]\n states, probabilities = zip(*sorted_states)\n \n # Clean up state names for display\n display_states = [state.replace('_', ' ').title() for state in states]\n \n # Create bar plot\n plt.figure(figsize=(12, 6))\n bars = plt.bar(range(len(display_states)), probabilities)\n plt.xlabel('States')\n plt.ylabel('Probability')\n plt.title(f'State Distribution for \"{name_to_plot}\" (Electoral Rolls Data)')\n plt.xticks(range(len(display_states)), display_states, rotation=45, ha='right')\n \n # Color bars by probability\n if probabilities:\n max_prob = max(probabilities)\n for i, bar in enumerate(bars):\n bar.set_color(plt.cm.viridis(probabilities[i] / max_prob))\n \n plt.tight_layout()\n plt.grid(axis='y', alpha=0.3)\n plt.show()\n else:\n print(f\"'{name_to_plot}' has no state distribution data\")\nelse:\n print(f\"'{name_to_plot}' not found in results\")" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get State Languages\n", "\n", "The `get_state_languages` function maps states to their official languages." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Get languages for some specific states\nstates_to_check = ['Maharashtra', 'Punjab', 'Tamil Nadu', 'West Bengal', 'Gujarat']\n\nprint(\"State to Languages Mapping:\")\nprint(\"=\"*40)\n\n# Pass the list of states to get_state_languages\nstate_languages = instate.get_state_languages(states_to_check)\n\n# Display the results\nfor i, row in state_languages.iterrows():\n state = row.iloc[0] # First column is the state\n if len(row) > 1 and 'official_languages' in state_languages.columns:\n languages = row['official_languages']\n if pd.notna(languages):\n print(f\"{state}: {languages}\")\n else:\n print(f\"{state}: No language data available\")\n else:\n print(f\"{state}: No language data available\")" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List Available States\n", "\n", "See all states available in the electoral rolls dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get all available states\n", "available_states = instate.list_available_states()\n", "\n", "print(f\"Total states available: {len(available_states)}\")\n", "print(\"\\nAvailable states:\")\n", "print(\"=\"*50)\n", "\n", "# Print states in columns for better readability\n", "for i, state in enumerate(sorted(available_states), 1):\n", " print(f\"{i:2d}. {state}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Neural Network Predictions\n", "\n", "For names not in electoral rolls or for enhanced predictions, the package provides neural network models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict States\n", "\n", "The `predict_state` function uses GRU neural networks to predict likely states." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Predict states for our sample names\n", "try:\n", " state_predictions = instate.predict_state(sample_names, top_k=3)\n", "\n", " print(\"Neural Network State Predictions:\")\n", " print(\"=\"*50)\n", "\n", " for i, row in state_predictions.iterrows():\n", " name = row.iloc[0] # First column is the name\n", " predictions = row['predicted_states']\n", " print(f\"\\n{name.upper()}:\")\n", " for j, state in enumerate(predictions, 1):\n", " print(f\" {j}. {state}\")\n", "except Exception as e:\n", " print(f\"State prediction error: {e}\")\n", " print(\"Note: Neural network models may require additional setup or trained weights.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict Languages\n", "\n", "The `predict_language` function predicts likely languages using LSTM or KNN models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Predict languages using different models\n", "print(\"Language Prediction Examples:\")\n", "print(\"=\"*50)\n", "\n", "# Try LSTM model first\n", "try:\n", " print(\"\\nTrying LSTM model...\")\n", " language_predictions_lstm = instate.predict_language(sample_names, model='lstm', top_k=3)\n", " \n", " print(\"\\nNeural Network Language Predictions (LSTM):\")\n", " print(\"-\" * 50)\n", " \n", " for i, row in language_predictions_lstm.iterrows():\n", " name = row.iloc[0] # First column is the name\n", " pred_langs = row['predicted_languages']\n", " print(f\"\\n{name.upper()}:\")\n", " if isinstance(pred_langs, list):\n", " for j, lang in enumerate(pred_langs[:3], 1):\n", " print(f\" {j}. {lang}\")\n", " else:\n", " print(f\" 1. {pred_langs}\")\n", " \n", "except Exception as e:\n", " print(f\"LSTM model not available: {e}\")\n", " print(\"\\nTrying KNN model...\")\n", " \n", " try:\n", " language_predictions_knn = instate.predict_language(sample_names, model='knn', top_k=3)\n", " \n", " print(\"\\nNeural Network Language Predictions (KNN):\")\n", " print(\"-\" * 50)\n", " \n", " for i, row in language_predictions_knn.iterrows():\n", " name = row.iloc[0] # First column is the name\n", " pred_langs = row['predicted_languages']\n", " print(f\"\\n{name.upper()}:\")\n", " if isinstance(pred_langs, list):\n", " for j, lang in enumerate(pred_langs[:3], 1):\n", " print(f\" {j}. {lang}\")\n", " else:\n", " print(f\" 1. {pred_langs}\")\n", " except Exception as e2:\n", " print(f\"KNN model also not available: {e2}\")\n", " print(\"Note: Language prediction requires trained models to be available.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparative Analysis\n", "\n", "Let's compare the electoral rolls data with neural network predictions for names found in both systems." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "print(\"Comparison: Electoral Rolls vs Neural Network Predictions\")\nprint(\"=\"*65)\n\n# Get name and state columns for easier access\nname_col = state_dist.columns[0] \nstate_columns = [col for col in state_dist.columns if col != name_col]\n\nfor name in sample_names:\n print(f\"\\n{name.upper()}:\")\n \n # Electoral rolls top state\n name_row = state_dist[state_dist[name_col] == name]\n if not name_row.empty:\n row = name_row.iloc[0]\n \n # Find the state with highest probability\n max_prob = 0\n top_state = None\n for state_col in state_columns:\n prob = row[state_col]\n if pd.notna(prob) and prob > max_prob:\n max_prob = prob\n top_state = state_col.replace('_', ' ').title()\n \n if top_state:\n print(f\" Electoral Rolls Top State: {top_state} ({max_prob:.3f})\")\n else:\n print(f\" Electoral Rolls: Not found\")\n else:\n print(f\" Electoral Rolls: Not found\")\n \n # Neural network top state (if available)\n try:\n # Find the row for this name in the predictions\n name_rows = state_predictions[state_predictions.iloc[:, 0] == name]\n if not name_rows.empty:\n nn_top_state = name_rows.iloc[0]['predicted_states'][0]\n print(f\" Neural Network Top State: {nn_top_state}\")\n else:\n print(f\" Neural Network: No prediction for {name}\")\n except (NameError, Exception):\n print(f\" Neural Network: Predictions not available\")" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "This notebook demonstrated the key features of the `instate` package:\n", "\n", "1. **Electoral Rolls Functions**:\n", " - `get_state_distribution()`: Get probability distributions from electoral data\n", " - `get_state_languages()`: Map states to official languages\n", " - `list_available_states()`: See all available states in the dataset\n", "\n", "2. **Neural Network Functions**:\n", " - `predict_state()`: GRU-based state prediction\n", " - `predict_language()`: LSTM/KNN-based language prediction\n", "\n", "The package is useful for:\n", "- Demographic analysis\n", "- Geographic distribution studies\n", "- Language inference from names\n", "- Cultural and linguistic research\n", "\n", "For more information, visit the [GitHub repository](https://github.com/appeler/instate)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 4 }