Basic Usage Examples

This notebook demonstrates the key functions in the instate package for predicting Indian states and languages from last names.

Overview

The instate package provides two main approaches:

  1. Electoral Rolls Lookups - Fast frequency-based lookups from Indian electoral rolls data (2017)

  2. Neural Network Predictions - Machine learning models for enhanced predictions

Let’s explore each approach with practical examples.

Setup

First, let’s import the necessary modules and set up our examples.

[1]:
import instate
import pandas as pd
import matplotlib.pyplot as plt

# Sample Indian last names for demonstration
sample_names = ['sood', 'dhingra', 'kumar', 'patel', 'singh', 'sharma', 'reddy', 'iyer']

print(f"instate version: {instate.__version__}")
print(f"Sample names: {sample_names}")
instate version: 1.1.0
Sample names: ['sood', 'dhingra', 'kumar', 'patel', 'singh', 'sharma', 'reddy', 'iyer']

Electoral Rolls Lookups

The electoral rolls approach provides frequency-based lookups for names found in the 2017 Indian electoral rolls dataset.

Get State Distribution

The get_state_distribution function returns the probability distribution P(state|lastname) based on electoral rolls data.

[2]:
# Get state distributions for our sample names
state_dist = instate.get_state_distribution(sample_names)

print("State distributions from electoral rolls:")
print("="*50)

# Get state columns (exclude the name column)
name_col = state_dist.columns[0]
state_columns = [col for col in state_dist.columns if col != name_col]

for i, row in state_dist.iterrows():
    name = row[name_col]
    print(f"\n{name.upper()}:")

    # Get non-zero state probabilities for this name
    state_probs = []
    for state_col in state_columns:
        prob = row[state_col]
        if pd.notna(prob) and prob > 0:
            state_probs.append((state_col, prob))

    if state_probs:
        # Show top 3 states for each name
        sorted_states = sorted(state_probs, key=lambda x: x[1], reverse=True)[:3]
        for state, prob in sorted_states:
            # Clean up state name for display
            display_state = state.replace('_', ' ').title()
            print(f"  {display_state}: {prob:.3f}")
    else:
        print("  Not found in electoral rolls")
Copying electoral rolls data from package...
State distributions from electoral rolls:
==================================================

SOOD:
  Total N: 1.000
  Punjab: 0.483
  Delhi: 0.245

DHINGRA:
  Total N: 1.000
  Delhi: 0.996
  Andaman And Nicobar Islands: 0.002

KUMAR:
  Total N: 1.000
  Delhi: 0.529
  Kerala: 0.266

PATEL:
  Total N: 1.000
  Uttar Pradesh: 0.357
  Madhya Pradesh: 0.308

SINGH:
  Total N: 1.000
  Delhi: 0.782
  Manipur: 0.180

SHARMA:
  Total N: 1.000
  Delhi: 0.772
  Sikkim: 0.099

REDDY:
  Total N: 1.000
  Andhra Pradesh: 0.994
  Delhi: 0.003

IYER:
  Total N: 1.000
  Delhi: 0.468
  Goa: 0.167

Visualize State Distribution

Let’s create a visualization for one of the names with the richest state distribution.

[3]:
# Pick a name with good state distribution for visualization
name_to_plot = 'kumar'  # This is typically found in multiple states

# Find the row for this name in our results
name_row = state_dist[state_dist.iloc[:, 0] == name_to_plot]

if not name_row.empty:
    row = name_row.iloc[0]

    # Get non-zero state probabilities for visualization
    state_probs = []
    name_col = state_dist.columns[0]
    state_columns = [col for col in state_dist.columns if col != name_col]

    for state_col in state_columns:
        prob = row[state_col]
        if pd.notna(prob) and prob > 0:
            state_probs.append((state_col, prob))

    if state_probs:
        # Get top 10 states for plotting
        sorted_states = sorted(state_probs, key=lambda x: x[1], reverse=True)[:10]
        states, probabilities = zip(*sorted_states)

        # Clean up state names for display
        display_states = [state.replace('_', ' ').title() for state in states]

        # Create bar plot
        plt.figure(figsize=(12, 6))
        bars = plt.bar(range(len(display_states)), probabilities)
        plt.xlabel('States')
        plt.ylabel('Probability')
        plt.title(f'State Distribution for "{name_to_plot}" (Electoral Rolls Data)')
        plt.xticks(range(len(display_states)), display_states, rotation=45, ha='right')

        # Color bars by probability
        if probabilities:
            max_prob = max(probabilities)
            for i, bar in enumerate(bars):
                bar.set_color(plt.cm.viridis(probabilities[i] / max_prob))

        plt.tight_layout()
        plt.grid(axis='y', alpha=0.3)
        plt.show()
    else:
        print(f"'{name_to_plot}' has no state distribution data")
else:
    print(f"'{name_to_plot}' not found in results")
../_images/examples_basic_usage_7_0.svg

Get State Languages

The get_state_languages function maps states to their official languages.

[4]:
# Get languages for some specific states
states_to_check = ['Maharashtra', 'Punjab', 'Tamil Nadu', 'West Bengal', 'Gujarat']

print("State to Languages Mapping:")
print("="*40)

# Pass the list of states to get_state_languages
state_languages = instate.get_state_languages(states_to_check)

# Display the results
for i, row in state_languages.iterrows():
    state = row.iloc[0]  # First column is the state
    if len(row) > 1 and 'official_languages' in state_languages.columns:
        languages = row['official_languages']
        if pd.notna(languages):
            print(f"{state}: {languages}")
        else:
            print(f"{state}: No language data available")
    else:
        print(f"{state}: No language data available")
State to Languages Mapping:
========================================
Maharashtra: Marathi
Punjab: Punjabi
Tamil Nadu: Tamil
West Bengal: Bengali, English
Gujarat: Gujarati

List Available States

See all states available in the electoral rolls dataset.

[5]:
# Get all available states
available_states = instate.list_available_states()

print(f"Total states available: {len(available_states)}")
print("\nAvailable states:")
print("="*50)

# Print states in columns for better readability
for i, state in enumerate(sorted(available_states), 1):
    print(f"{i:2d}. {state}")
Total states available: 32

Available states:
==================================================
 1. Andaman and Nicobar Islands
 2. Andhra Pradesh
 3. Arunachal Pradesh
 4. Assam
 5. Bihar
 6. Chandigarh
 7. Dadra and Nagar Haveli
 8. Daman and Diu
 9. Delhi
10. Goa
11. Gujarat
12. Haryana
13. Jammu and Kashmir and Ladakh
14. Jharkhand
15. Karnataka
16. Kerala
17. Madhya Pradesh
18. Maharashtra
19. Manipur
20. Meghalaya
21. Mizoram
22. Nagaland
23. Odisha
24. Puducherry
25. Punjab
26. Rajasthan
27. Sikkim
28. Telangana
29. Tripura
30. Uttar Pradesh
31. Uttarakhand
32. total_n

Neural Network Predictions

For names not in electoral rolls or for enhanced predictions, the package provides neural network models.

Predict States

The predict_state function uses GRU neural networks to predict likely states.

[6]:
# Predict states for our sample names
try:
    state_predictions = instate.predict_state(sample_names, top_k=3)

    print("Neural Network State Predictions:")
    print("="*50)

    for i, row in state_predictions.iterrows():
        name = row.iloc[0]  # First column is the name
        predictions = row['predicted_states']
        print(f"\n{name.upper()}:")
        for j, state in enumerate(predictions, 1):
            print(f"  {j}. {state}")
except Exception as e:
    print(f"State prediction error: {e}")
    print("Note: Neural network models may require additional setup or trained weights.")
Downloading GRU model...
Downloading: 100%|█████████▉| 50816.0/50835.2 [00:01<00:00, 38175.25KB/s]
Neural Network State Predictions:
==================================================

SOOD:
  1. Meghalaya
  2. Chandigarh
  3. Punjab

DHINGRA:
  1. Daman and Diu
  2. Andaman and Nicobar Islands
  3. Puducherry

KUMAR:
  1. Jammu and Kashmir and Ladakh
  2. Punjab
  3. Sikkim

PATEL:
  1. Chandigarh
  2. Sikkim
  3. Uttarakhand

SINGH:
  1. Jammu and Kashmir and Ladakh
  2. Daman and Diu
  3. Meghalaya

SHARMA:
  1. Meghalaya
  2. Sikkim
  3. Andaman and Nicobar Islands

REDDY:
  1. Puducherry
  2. Telangana
  3. Meghalaya

IYER:
  1. Puducherry
  2. Delhi
  3. Telangana

Predict Languages

The predict_language function predicts likely languages using LSTM or KNN models.

[7]:
# Predict languages using different models
print("Language Prediction Examples:")
print("="*50)

# Try LSTM model first
try:
    print("\nTrying LSTM model...")
    language_predictions_lstm = instate.predict_language(sample_names, model='lstm', top_k=3)

    print("\nNeural Network Language Predictions (LSTM):")
    print("-" * 50)

    for i, row in language_predictions_lstm.iterrows():
        name = row.iloc[0]  # First column is the name
        pred_langs = row['predicted_languages']
        print(f"\n{name.upper()}:")
        if isinstance(pred_langs, list):
            for j, lang in enumerate(pred_langs[:3], 1):
                print(f"  {j}. {lang}")
        else:
            print(f"  1. {pred_langs}")

except Exception as e:
    print(f"LSTM model not available: {e}")
    print("\nTrying KNN model...")

    try:
        language_predictions_knn = instate.predict_language(sample_names, model='knn', top_k=3)

        print("\nNeural Network Language Predictions (KNN):")
        print("-" * 50)

        for i, row in language_predictions_knn.iterrows():
            name = row.iloc[0]  # First column is the name
            pred_langs = row['predicted_languages']
            print(f"\n{name.upper()}:")
            if isinstance(pred_langs, list):
                for j, lang in enumerate(pred_langs[:3], 1):
                    print(f"  {j}. {lang}")
            else:
                print(f"  1. {pred_langs}")
    except Exception as e2:
        print(f"KNN model also not available: {e2}")
        print("Note: Language prediction requires trained models to be available.")
Language Prediction Examples:
==================================================

Trying LSTM model...

Neural Network Language Predictions (LSTM):
--------------------------------------------------

SOOD:
  1. hindi
  2. punjabi
  3. urdu

DHINGRA:
  1. hindi
  2. maithili
  3. urdu

KUMAR:
  1. telugu
  2. urdu
  3. chenchu

PATEL:
  1. hindi
  2. urdu
  3. sindhi

SINGH:
  1. hindi
  2. urdu
  3. kannada

SHARMA:
  1. telugu
  2. urdu
  3. kannada

REDDY:
  1. malayalam
  2. urdu
  3. chenchu

IYER:
  1. telugu
  2. urdu
  3. chenchu

Comparative Analysis

Let’s compare the electoral rolls data with neural network predictions for names found in both systems.

[8]:
print("Comparison: Electoral Rolls vs Neural Network Predictions")
print("="*65)

# Get name and state columns for easier access
name_col = state_dist.columns[0]
state_columns = [col for col in state_dist.columns if col != name_col]

for name in sample_names:
    print(f"\n{name.upper()}:")

    # Electoral rolls top state
    name_row = state_dist[state_dist[name_col] == name]
    if not name_row.empty:
        row = name_row.iloc[0]

        # Find the state with highest probability
        max_prob = 0
        top_state = None
        for state_col in state_columns:
            prob = row[state_col]
            if pd.notna(prob) and prob > max_prob:
                max_prob = prob
                top_state = state_col.replace('_', ' ').title()

        if top_state:
            print(f"  Electoral Rolls Top State: {top_state} ({max_prob:.3f})")
        else:
            print(f"  Electoral Rolls: Not found")
    else:
        print(f"  Electoral Rolls: Not found")

    # Neural network top state (if available)
    try:
        # Find the row for this name in the predictions
        name_rows = state_predictions[state_predictions.iloc[:, 0] == name]
        if not name_rows.empty:
            nn_top_state = name_rows.iloc[0]['predicted_states'][0]
            print(f"  Neural Network Top State:  {nn_top_state}")
        else:
            print(f"  Neural Network: No prediction for {name}")
    except (NameError, Exception):
        print(f"  Neural Network: Predictions not available")
Comparison: Electoral Rolls vs Neural Network Predictions
=================================================================

SOOD:
  Electoral Rolls Top State: Total N (1.000)
  Neural Network Top State:  Meghalaya

DHINGRA:
  Electoral Rolls Top State: Total N (1.000)
  Neural Network Top State:  Daman and Diu

KUMAR:
  Electoral Rolls Top State: Total N (1.000)
  Neural Network Top State:  Jammu and Kashmir and Ladakh

PATEL:
  Electoral Rolls Top State: Total N (1.000)
  Neural Network Top State:  Chandigarh

SINGH:
  Electoral Rolls Top State: Total N (1.000)
  Neural Network Top State:  Jammu and Kashmir and Ladakh

SHARMA:
  Electoral Rolls Top State: Total N (1.000)
  Neural Network Top State:  Meghalaya

REDDY:
  Electoral Rolls Top State: Total N (1.000)
  Neural Network Top State:  Puducherry

IYER:
  Electoral Rolls Top State: Total N (1.000)
  Neural Network Top State:  Puducherry

Summary

This notebook demonstrated the key features of the instate package:

  1. Electoral Rolls Functions:

    • get_state_distribution(): Get probability distributions from electoral data

    • get_state_languages(): Map states to official languages

    • list_available_states(): See all available states in the dataset

  2. Neural Network Functions:

    • predict_state(): GRU-based state prediction

    • predict_language(): LSTM/KNN-based language prediction

The package is useful for:

  • Demographic analysis

  • Geographic distribution studies

  • Language inference from names

  • Cultural and linguistic research

For more information, visit the GitHub repository.