API Reference

Main Functions

The instate package provides a clean 4-function API for predicting states and languages from names.

Electoral Rolls Lookups

instate.get_state_distribution(names: DataFrame | list[str], name_column: str | None = None) DataFrame[source]

Get P(state|lastname) from 2017 Indian electoral rolls.

This returns the empirical distribution of a lastname across Indian states based on the electoral rolls data. This is the Bayes optimal estimate given the observed frequencies.

Parameters:
  • names – DataFrame containing names or list of name strings. Names are automatically cleaned (lowercase, stripped).

  • name_column – If names is a DataFrame, the column containing names. If None and DataFrame has ‘name’ or ‘lastname’, uses that.

Returns:

DataFrame with original data plus 31 state probability columns. State columns are named by state (e.g., ‘delhi’, ‘punjab’). Values are proportions (0-1) representing P(state|lastname).

Examples

>>> names = ["dhingra", "sood", "gowda"]
>>> result = get_state_distribution(names)
>>> result[["name", "delhi", "punjab", "karnataka"]]
>>> df = pd.DataFrame({"lastname": ["dhingra", "sood"]})
>>> result = get_state_distribution(df, "lastname")
>>> result.columns[:5].tolist()
instate.get_state_languages(states: DataFrame | list[str], state_column: str | None = None) DataFrame[source]

Map Indian states to their official languages.

Based on census data, returns the official language(s) for each state.

Parameters:
  • states – DataFrame containing states or list of state names.

  • state_column – If states is a DataFrame, the column containing state names.

Returns:

DataFrame with state and official_languages columns. If input was DataFrame, adds official_languages column.

Examples

>>> states = ["Delhi", "Punjab", "Karnataka"]
>>> result = get_state_languages(states)
>>> result[["state", "official_languages"]]
>>> df = pd.DataFrame({"state_name": ["Delhi", "Punjab"]})
>>> result = get_state_languages(df, "state_name")
instate.list_available_states() list[str][source]

List all states available in the electoral rolls dataset.

Returns:

List of state names available in the data.

Examples

>>> states = list_available_states()
>>> len(states)
31
>>> "Delhi" in states
True

Neural Network Predictions

instate.predict_state(names: DataFrame | list[str], name_column: str | None = None, top_k: int = 3, model: str = 'gru') DataFrame[source]

Predict most likely Indian states for given names using neural network.

Uses a trained GRU model to predict which Indian states a person with the given lastname is most likely to be from. This is useful for names not found in the electoral rolls data.

Parameters:
  • names – DataFrame containing names or list of name strings. Names are automatically cleaned (lowercase, stripped).

  • name_column – If names is a DataFrame, the column containing names.

  • top_k – Number of top states to return (default: 3).

  • model – Model to use for prediction. Currently only “gru” supported.

Returns:

DataFrame with name and predicted_states columns. predicted_states contains a list of top_k state names.

Examples

>>> names = ["dhingra", "sood", "gowda"]
>>> result = predict_state(names, top_k=3)
>>> result["predicted_states"][0]
['Delhi', 'Punjab', 'Haryana']
>>> df = pd.DataFrame({"lastname": ["sharma", "patel"]})
>>> result = predict_state(df, "lastname", top_k=2)
>>> len(result["predicted_states"][0])
2
instate.predict_language(names: DataFrame | list[str], name_column: str | None = None, top_k: int = 3, model: str = 'lstm') DataFrame[source]

Predict most likely languages for given names.

Two methods available: - “lstm”: Neural network prediction using trained LSTM model - “knn”: K-nearest neighbor lookup in language database

Parameters:
  • names – DataFrame containing names or list of name strings.

  • name_column – If names is a DataFrame, the column containing names.

  • top_k – Number of top languages to return (default: 3). Note: KNN method returns only the single best match.

  • model – Prediction method - “lstm” (neural) or “knn” (lookup).

Returns:

DataFrame with name and predicted_languages columns. For LSTM: predicted_languages contains list of top_k languages. For KNN: predicted_languages contains single best language.

Examples

>>> names = ["sood", "chintalapati"]
>>> result = predict_language(names, model="lstm")
>>> result["predicted_languages"][0]
['hindi', 'punjabi', 'urdu']
>>> result_knn = predict_language(names, model="knn")
>>> result_knn["predicted_languages"][0]
'hindi'
>>> df = pd.DataFrame({"name": ["patel", "sharma"]})
>>> result = predict_language(df, "name", model="lstm", top_k=2)
>>> len(result["predicted_languages"][0])
2

Module Reference

electoral module

Electoral rolls based name-to-state lookup.

Functions for looking up state distributions from 2017 Indian electoral rolls data.

instate.electoral.get_state_distribution(names: DataFrame | list[str], name_column: str | None = None) DataFrame[source]

Get P(state|lastname) from 2017 Indian electoral rolls.

This returns the empirical distribution of a lastname across Indian states based on the electoral rolls data. This is the Bayes optimal estimate given the observed frequencies.

Parameters:
  • names – DataFrame containing names or list of name strings. Names are automatically cleaned (lowercase, stripped).

  • name_column – If names is a DataFrame, the column containing names. If None and DataFrame has ‘name’ or ‘lastname’, uses that.

Returns:

DataFrame with original data plus 31 state probability columns. State columns are named by state (e.g., ‘delhi’, ‘punjab’). Values are proportions (0-1) representing P(state|lastname).

Examples

>>> names = ["dhingra", "sood", "gowda"]
>>> result = get_state_distribution(names)
>>> result[["name", "delhi", "punjab", "karnataka"]]
>>> df = pd.DataFrame({"lastname": ["dhingra", "sood"]})
>>> result = get_state_distribution(df, "lastname")
>>> result.columns[:5].tolist()
instate.electoral.get_state_languages(states: DataFrame | list[str], state_column: str | None = None) DataFrame[source]

Map Indian states to their official languages.

Based on census data, returns the official language(s) for each state.

Parameters:
  • states – DataFrame containing states or list of state names.

  • state_column – If states is a DataFrame, the column containing state names.

Returns:

DataFrame with state and official_languages columns. If input was DataFrame, adds official_languages column.

Examples

>>> states = ["Delhi", "Punjab", "Karnataka"]
>>> result = get_state_languages(states)
>>> result[["state", "official_languages"]]
>>> df = pd.DataFrame({"state_name": ["Delhi", "Punjab"]})
>>> result = get_state_languages(df, "state_name")
instate.electoral.list_available_states() list[str][source]

List all states available in the electoral rolls dataset.

Returns:

List of state names available in the data.

Examples

>>> states = list_available_states()
>>> len(states)
31
>>> "Delhi" in states
True

predict module

Neural network predictions for names not in electoral rolls.

Functions for predicting states and languages using trained models.

instate.predict.predict_language(names: DataFrame | list[str], name_column: str | None = None, top_k: int = 3, model: str = 'lstm') DataFrame[source]

Predict most likely languages for given names.

Two methods available: - “lstm”: Neural network prediction using trained LSTM model - “knn”: K-nearest neighbor lookup in language database

Parameters:
  • names – DataFrame containing names or list of name strings.

  • name_column – If names is a DataFrame, the column containing names.

  • top_k – Number of top languages to return (default: 3). Note: KNN method returns only the single best match.

  • model – Prediction method - “lstm” (neural) or “knn” (lookup).

Returns:

DataFrame with name and predicted_languages columns. For LSTM: predicted_languages contains list of top_k languages. For KNN: predicted_languages contains single best language.

Examples

>>> names = ["sood", "chintalapati"]
>>> result = predict_language(names, model="lstm")
>>> result["predicted_languages"][0]
['hindi', 'punjabi', 'urdu']
>>> result_knn = predict_language(names, model="knn")
>>> result_knn["predicted_languages"][0]
'hindi'
>>> df = pd.DataFrame({"name": ["patel", "sharma"]})
>>> result = predict_language(df, "name", model="lstm", top_k=2)
>>> len(result["predicted_languages"][0])
2
instate.predict.predict_state(names: DataFrame | list[str], name_column: str | None = None, top_k: int = 3, model: str = 'gru') DataFrame[source]

Predict most likely Indian states for given names using neural network.

Uses a trained GRU model to predict which Indian states a person with the given lastname is most likely to be from. This is useful for names not found in the electoral rolls data.

Parameters:
  • names – DataFrame containing names or list of name strings. Names are automatically cleaned (lowercase, stripped).

  • name_column – If names is a DataFrame, the column containing names.

  • top_k – Number of top states to return (default: 3).

  • model – Model to use for prediction. Currently only “gru” supported.

Returns:

DataFrame with name and predicted_states columns. predicted_states contains a list of top_k state names.

Examples

>>> names = ["dhingra", "sood", "gowda"]
>>> result = predict_state(names, top_k=3)
>>> result["predicted_states"][0]
['Delhi', 'Punjab', 'Haryana']
>>> df = pd.DataFrame({"lastname": ["sharma", "patel"]})
>>> result = predict_state(df, "lastname", top_k=2)
>>> len(result["predicted_states"][0])
2