Basic Usage Examples¶
This notebook demonstrates the key functions in the instate package for predicting Indian states and languages from last names.
Overview¶
The instate package provides two main approaches:
Electoral Rolls Lookups - Fast frequency-based lookups from Indian electoral rolls data (2017)
Neural Network Predictions - Machine learning models for enhanced predictions
Let’s explore each approach with practical examples.
Setup¶
First, let’s import the necessary modules and set up our examples.
[1]:
import instate
import pandas as pd
import matplotlib.pyplot as plt
# Sample Indian last names for demonstration
sample_names = ['sood', 'dhingra', 'kumar', 'patel', 'singh', 'sharma', 'reddy', 'iyer']
print(f"instate version: {instate.__version__}")
print(f"Sample names: {sample_names}")
instate version: 1.1.0
Sample names: ['sood', 'dhingra', 'kumar', 'patel', 'singh', 'sharma', 'reddy', 'iyer']
Electoral Rolls Lookups¶
The electoral rolls approach provides frequency-based lookups for names found in the 2017 Indian electoral rolls dataset.
Get State Distribution¶
The get_state_distribution function returns the probability distribution P(state|lastname) based on electoral rolls data.
[2]:
# Get state distributions for our sample names
state_dist = instate.get_state_distribution(sample_names)
print("State distributions from electoral rolls:")
print("="*50)
# Get state columns (exclude the name column)
name_col = state_dist.columns[0]
state_columns = [col for col in state_dist.columns if col != name_col]
for i, row in state_dist.iterrows():
name = row[name_col]
print(f"\n{name.upper()}:")
# Get non-zero state probabilities for this name
state_probs = []
for state_col in state_columns:
prob = row[state_col]
if pd.notna(prob) and prob > 0:
state_probs.append((state_col, prob))
if state_probs:
# Show top 3 states for each name
sorted_states = sorted(state_probs, key=lambda x: x[1], reverse=True)[:3]
for state, prob in sorted_states:
# Clean up state name for display
display_state = state.replace('_', ' ').title()
print(f" {display_state}: {prob:.3f}")
else:
print(" Not found in electoral rolls")
Copying electoral rolls data from package...
State distributions from electoral rolls:
==================================================
SOOD:
Total N: 1.000
Punjab: 0.483
Delhi: 0.245
DHINGRA:
Total N: 1.000
Delhi: 0.996
Andaman And Nicobar Islands: 0.002
KUMAR:
Total N: 1.000
Delhi: 0.529
Kerala: 0.266
PATEL:
Total N: 1.000
Uttar Pradesh: 0.357
Madhya Pradesh: 0.308
SINGH:
Total N: 1.000
Delhi: 0.782
Manipur: 0.180
SHARMA:
Total N: 1.000
Delhi: 0.772
Sikkim: 0.099
REDDY:
Total N: 1.000
Andhra Pradesh: 0.994
Delhi: 0.003
IYER:
Total N: 1.000
Delhi: 0.468
Goa: 0.167
Visualize State Distribution¶
Let’s create a visualization for one of the names with the richest state distribution.
[3]:
# Pick a name with good state distribution for visualization
name_to_plot = 'kumar' # This is typically found in multiple states
# Find the row for this name in our results
name_row = state_dist[state_dist.iloc[:, 0] == name_to_plot]
if not name_row.empty:
row = name_row.iloc[0]
# Get non-zero state probabilities for visualization
state_probs = []
name_col = state_dist.columns[0]
state_columns = [col for col in state_dist.columns if col != name_col]
for state_col in state_columns:
prob = row[state_col]
if pd.notna(prob) and prob > 0:
state_probs.append((state_col, prob))
if state_probs:
# Get top 10 states for plotting
sorted_states = sorted(state_probs, key=lambda x: x[1], reverse=True)[:10]
states, probabilities = zip(*sorted_states)
# Clean up state names for display
display_states = [state.replace('_', ' ').title() for state in states]
# Create bar plot
plt.figure(figsize=(12, 6))
bars = plt.bar(range(len(display_states)), probabilities)
plt.xlabel('States')
plt.ylabel('Probability')
plt.title(f'State Distribution for "{name_to_plot}" (Electoral Rolls Data)')
plt.xticks(range(len(display_states)), display_states, rotation=45, ha='right')
# Color bars by probability
if probabilities:
max_prob = max(probabilities)
for i, bar in enumerate(bars):
bar.set_color(plt.cm.viridis(probabilities[i] / max_prob))
plt.tight_layout()
plt.grid(axis='y', alpha=0.3)
plt.show()
else:
print(f"'{name_to_plot}' has no state distribution data")
else:
print(f"'{name_to_plot}' not found in results")
Get State Languages¶
The get_state_languages function maps states to their official languages.
[4]:
# Get languages for some specific states
states_to_check = ['Maharashtra', 'Punjab', 'Tamil Nadu', 'West Bengal', 'Gujarat']
print("State to Languages Mapping:")
print("="*40)
# Pass the list of states to get_state_languages
state_languages = instate.get_state_languages(states_to_check)
# Display the results
for i, row in state_languages.iterrows():
state = row.iloc[0] # First column is the state
if len(row) > 1 and 'official_languages' in state_languages.columns:
languages = row['official_languages']
if pd.notna(languages):
print(f"{state}: {languages}")
else:
print(f"{state}: No language data available")
else:
print(f"{state}: No language data available")
State to Languages Mapping:
========================================
Maharashtra: Marathi
Punjab: Punjabi
Tamil Nadu: Tamil
West Bengal: Bengali, English
Gujarat: Gujarati
List Available States¶
See all states available in the electoral rolls dataset.
[5]:
# Get all available states
available_states = instate.list_available_states()
print(f"Total states available: {len(available_states)}")
print("\nAvailable states:")
print("="*50)
# Print states in columns for better readability
for i, state in enumerate(sorted(available_states), 1):
print(f"{i:2d}. {state}")
Total states available: 32
Available states:
==================================================
1. Andaman and Nicobar Islands
2. Andhra Pradesh
3. Arunachal Pradesh
4. Assam
5. Bihar
6. Chandigarh
7. Dadra and Nagar Haveli
8. Daman and Diu
9. Delhi
10. Goa
11. Gujarat
12. Haryana
13. Jammu and Kashmir and Ladakh
14. Jharkhand
15. Karnataka
16. Kerala
17. Madhya Pradesh
18. Maharashtra
19. Manipur
20. Meghalaya
21. Mizoram
22. Nagaland
23. Odisha
24. Puducherry
25. Punjab
26. Rajasthan
27. Sikkim
28. Telangana
29. Tripura
30. Uttar Pradesh
31. Uttarakhand
32. total_n
Neural Network Predictions¶
For names not in electoral rolls or for enhanced predictions, the package provides neural network models.
Predict States¶
The predict_state function uses GRU neural networks to predict likely states.
[6]:
# Predict states for our sample names
try:
state_predictions = instate.predict_state(sample_names, top_k=3)
print("Neural Network State Predictions:")
print("="*50)
for i, row in state_predictions.iterrows():
name = row.iloc[0] # First column is the name
predictions = row['predicted_states']
print(f"\n{name.upper()}:")
for j, state in enumerate(predictions, 1):
print(f" {j}. {state}")
except Exception as e:
print(f"State prediction error: {e}")
print("Note: Neural network models may require additional setup or trained weights.")
Downloading GRU model...
Downloading: 100%|█████████▉| 50816.0/50835.2 [00:01<00:00, 38175.25KB/s]
Neural Network State Predictions:
==================================================
SOOD:
1. Meghalaya
2. Chandigarh
3. Punjab
DHINGRA:
1. Daman and Diu
2. Andaman and Nicobar Islands
3. Puducherry
KUMAR:
1. Jammu and Kashmir and Ladakh
2. Punjab
3. Sikkim
PATEL:
1. Chandigarh
2. Sikkim
3. Uttarakhand
SINGH:
1. Jammu and Kashmir and Ladakh
2. Daman and Diu
3. Meghalaya
SHARMA:
1. Meghalaya
2. Sikkim
3. Andaman and Nicobar Islands
REDDY:
1. Puducherry
2. Telangana
3. Meghalaya
IYER:
1. Puducherry
2. Delhi
3. Telangana
Predict Languages¶
The predict_language function predicts likely languages using LSTM or KNN models.
[7]:
# Predict languages using different models
print("Language Prediction Examples:")
print("="*50)
# Try LSTM model first
try:
print("\nTrying LSTM model...")
language_predictions_lstm = instate.predict_language(sample_names, model='lstm', top_k=3)
print("\nNeural Network Language Predictions (LSTM):")
print("-" * 50)
for i, row in language_predictions_lstm.iterrows():
name = row.iloc[0] # First column is the name
pred_langs = row['predicted_languages']
print(f"\n{name.upper()}:")
if isinstance(pred_langs, list):
for j, lang in enumerate(pred_langs[:3], 1):
print(f" {j}. {lang}")
else:
print(f" 1. {pred_langs}")
except Exception as e:
print(f"LSTM model not available: {e}")
print("\nTrying KNN model...")
try:
language_predictions_knn = instate.predict_language(sample_names, model='knn', top_k=3)
print("\nNeural Network Language Predictions (KNN):")
print("-" * 50)
for i, row in language_predictions_knn.iterrows():
name = row.iloc[0] # First column is the name
pred_langs = row['predicted_languages']
print(f"\n{name.upper()}:")
if isinstance(pred_langs, list):
for j, lang in enumerate(pred_langs[:3], 1):
print(f" {j}. {lang}")
else:
print(f" 1. {pred_langs}")
except Exception as e2:
print(f"KNN model also not available: {e2}")
print("Note: Language prediction requires trained models to be available.")
Language Prediction Examples:
==================================================
Trying LSTM model...
Neural Network Language Predictions (LSTM):
--------------------------------------------------
SOOD:
1. hindi
2. punjabi
3. urdu
DHINGRA:
1. hindi
2. maithili
3. urdu
KUMAR:
1. telugu
2. urdu
3. chenchu
PATEL:
1. hindi
2. urdu
3. sindhi
SINGH:
1. hindi
2. urdu
3. kannada
SHARMA:
1. telugu
2. urdu
3. kannada
REDDY:
1. malayalam
2. urdu
3. chenchu
IYER:
1. telugu
2. urdu
3. chenchu
Comparative Analysis¶
Let’s compare the electoral rolls data with neural network predictions for names found in both systems.
[8]:
print("Comparison: Electoral Rolls vs Neural Network Predictions")
print("="*65)
# Get name and state columns for easier access
name_col = state_dist.columns[0]
state_columns = [col for col in state_dist.columns if col != name_col]
for name in sample_names:
print(f"\n{name.upper()}:")
# Electoral rolls top state
name_row = state_dist[state_dist[name_col] == name]
if not name_row.empty:
row = name_row.iloc[0]
# Find the state with highest probability
max_prob = 0
top_state = None
for state_col in state_columns:
prob = row[state_col]
if pd.notna(prob) and prob > max_prob:
max_prob = prob
top_state = state_col.replace('_', ' ').title()
if top_state:
print(f" Electoral Rolls Top State: {top_state} ({max_prob:.3f})")
else:
print(f" Electoral Rolls: Not found")
else:
print(f" Electoral Rolls: Not found")
# Neural network top state (if available)
try:
# Find the row for this name in the predictions
name_rows = state_predictions[state_predictions.iloc[:, 0] == name]
if not name_rows.empty:
nn_top_state = name_rows.iloc[0]['predicted_states'][0]
print(f" Neural Network Top State: {nn_top_state}")
else:
print(f" Neural Network: No prediction for {name}")
except (NameError, Exception):
print(f" Neural Network: Predictions not available")
Comparison: Electoral Rolls vs Neural Network Predictions
=================================================================
SOOD:
Electoral Rolls Top State: Total N (1.000)
Neural Network Top State: Meghalaya
DHINGRA:
Electoral Rolls Top State: Total N (1.000)
Neural Network Top State: Daman and Diu
KUMAR:
Electoral Rolls Top State: Total N (1.000)
Neural Network Top State: Jammu and Kashmir and Ladakh
PATEL:
Electoral Rolls Top State: Total N (1.000)
Neural Network Top State: Chandigarh
SINGH:
Electoral Rolls Top State: Total N (1.000)
Neural Network Top State: Jammu and Kashmir and Ladakh
SHARMA:
Electoral Rolls Top State: Total N (1.000)
Neural Network Top State: Meghalaya
REDDY:
Electoral Rolls Top State: Total N (1.000)
Neural Network Top State: Puducherry
IYER:
Electoral Rolls Top State: Total N (1.000)
Neural Network Top State: Puducherry
Summary¶
This notebook demonstrated the key features of the instate package:
Electoral Rolls Functions:
get_state_distribution(): Get probability distributions from electoral dataget_state_languages(): Map states to official languageslist_available_states(): See all available states in the dataset
Neural Network Functions:
predict_state(): GRU-based state predictionpredict_language(): LSTM/KNN-based language prediction
The package is useful for:
Demographic analysis
Geographic distribution studies
Language inference from names
Cultural and linguistic research
For more information, visit the GitHub repository.