API Reference¶
Main Functions¶
The instate package provides a clean 4-function API for predicting states and languages from names.
Electoral Rolls Lookups¶
- instate.get_state_distribution(names: DataFrame | list[str], name_column: str | None = None) DataFrame[source]¶
Get P(state|lastname) from 2017 Indian electoral rolls.
This returns the empirical distribution of a lastname across Indian states based on the electoral rolls data. This is the Bayes optimal estimate given the observed frequencies.
- Parameters:
names – DataFrame containing names or list of name strings. Names are automatically cleaned (lowercase, stripped).
name_column – If names is a DataFrame, the column containing names. If None and DataFrame has ‘name’ or ‘lastname’, uses that.
- Returns:
DataFrame with original data plus 31 state probability columns. State columns are named by state (e.g., ‘delhi’, ‘punjab’). Values are proportions (0-1) representing P(state|lastname).
Examples
>>> names = ["dhingra", "sood", "gowda"] >>> result = get_state_distribution(names) >>> result[["name", "delhi", "punjab", "karnataka"]]
>>> df = pd.DataFrame({"lastname": ["dhingra", "sood"]}) >>> result = get_state_distribution(df, "lastname") >>> result.columns[:5].tolist()
- instate.get_state_languages(states: DataFrame | list[str], state_column: str | None = None) DataFrame[source]¶
Map Indian states to their official languages.
Based on census data, returns the official language(s) for each state.
- Parameters:
states – DataFrame containing states or list of state names.
state_column – If states is a DataFrame, the column containing state names.
- Returns:
DataFrame with state and official_languages columns. If input was DataFrame, adds official_languages column.
Examples
>>> states = ["Delhi", "Punjab", "Karnataka"] >>> result = get_state_languages(states) >>> result[["state", "official_languages"]]
>>> df = pd.DataFrame({"state_name": ["Delhi", "Punjab"]}) >>> result = get_state_languages(df, "state_name")
Neural Network Predictions¶
- instate.predict_state(names: DataFrame | list[str], name_column: str | None = None, top_k: int = 3, model: str = 'gru') DataFrame[source]¶
Predict most likely Indian states for given names using neural network.
Uses a trained GRU model to predict which Indian states a person with the given lastname is most likely to be from. This is useful for names not found in the electoral rolls data.
- Parameters:
names – DataFrame containing names or list of name strings. Names are automatically cleaned (lowercase, stripped).
name_column – If names is a DataFrame, the column containing names.
top_k – Number of top states to return (default: 3).
model – Model to use for prediction. Currently only “gru” supported.
- Returns:
DataFrame with name and predicted_states columns. predicted_states contains a list of top_k state names.
Examples
>>> names = ["dhingra", "sood", "gowda"] >>> result = predict_state(names, top_k=3) >>> result["predicted_states"][0] ['Delhi', 'Punjab', 'Haryana']
>>> df = pd.DataFrame({"lastname": ["sharma", "patel"]}) >>> result = predict_state(df, "lastname", top_k=2) >>> len(result["predicted_states"][0]) 2
- instate.predict_language(names: DataFrame | list[str], name_column: str | None = None, top_k: int = 3, model: str = 'lstm') DataFrame[source]¶
Predict most likely languages for given names.
Two methods available: - “lstm”: Neural network prediction using trained LSTM model - “knn”: K-nearest neighbor lookup in language database
- Parameters:
names – DataFrame containing names or list of name strings.
name_column – If names is a DataFrame, the column containing names.
top_k – Number of top languages to return (default: 3). Note: KNN method returns only the single best match.
model – Prediction method - “lstm” (neural) or “knn” (lookup).
- Returns:
DataFrame with name and predicted_languages columns. For LSTM: predicted_languages contains list of top_k languages. For KNN: predicted_languages contains single best language.
Examples
>>> names = ["sood", "chintalapati"] >>> result = predict_language(names, model="lstm") >>> result["predicted_languages"][0] ['hindi', 'punjabi', 'urdu']
>>> result_knn = predict_language(names, model="knn") >>> result_knn["predicted_languages"][0] 'hindi'
>>> df = pd.DataFrame({"name": ["patel", "sharma"]}) >>> result = predict_language(df, "name", model="lstm", top_k=2) >>> len(result["predicted_languages"][0]) 2
Module Reference¶
electoral module¶
Electoral rolls based name-to-state lookup.
Functions for looking up state distributions from 2017 Indian electoral rolls data.
- instate.electoral.get_state_distribution(names: DataFrame | list[str], name_column: str | None = None) DataFrame[source]¶
Get P(state|lastname) from 2017 Indian electoral rolls.
This returns the empirical distribution of a lastname across Indian states based on the electoral rolls data. This is the Bayes optimal estimate given the observed frequencies.
- Parameters:
names – DataFrame containing names or list of name strings. Names are automatically cleaned (lowercase, stripped).
name_column – If names is a DataFrame, the column containing names. If None and DataFrame has ‘name’ or ‘lastname’, uses that.
- Returns:
DataFrame with original data plus 31 state probability columns. State columns are named by state (e.g., ‘delhi’, ‘punjab’). Values are proportions (0-1) representing P(state|lastname).
Examples
>>> names = ["dhingra", "sood", "gowda"] >>> result = get_state_distribution(names) >>> result[["name", "delhi", "punjab", "karnataka"]]
>>> df = pd.DataFrame({"lastname": ["dhingra", "sood"]}) >>> result = get_state_distribution(df, "lastname") >>> result.columns[:5].tolist()
- instate.electoral.get_state_languages(states: DataFrame | list[str], state_column: str | None = None) DataFrame[source]¶
Map Indian states to their official languages.
Based on census data, returns the official language(s) for each state.
- Parameters:
states – DataFrame containing states or list of state names.
state_column – If states is a DataFrame, the column containing state names.
- Returns:
DataFrame with state and official_languages columns. If input was DataFrame, adds official_languages column.
Examples
>>> states = ["Delhi", "Punjab", "Karnataka"] >>> result = get_state_languages(states) >>> result[["state", "official_languages"]]
>>> df = pd.DataFrame({"state_name": ["Delhi", "Punjab"]}) >>> result = get_state_languages(df, "state_name")
predict module¶
Neural network predictions for names not in electoral rolls.
Functions for predicting states and languages using trained models.
- instate.predict.predict_language(names: DataFrame | list[str], name_column: str | None = None, top_k: int = 3, model: str = 'lstm') DataFrame[source]¶
Predict most likely languages for given names.
Two methods available: - “lstm”: Neural network prediction using trained LSTM model - “knn”: K-nearest neighbor lookup in language database
- Parameters:
names – DataFrame containing names or list of name strings.
name_column – If names is a DataFrame, the column containing names.
top_k – Number of top languages to return (default: 3). Note: KNN method returns only the single best match.
model – Prediction method - “lstm” (neural) or “knn” (lookup).
- Returns:
DataFrame with name and predicted_languages columns. For LSTM: predicted_languages contains list of top_k languages. For KNN: predicted_languages contains single best language.
Examples
>>> names = ["sood", "chintalapati"] >>> result = predict_language(names, model="lstm") >>> result["predicted_languages"][0] ['hindi', 'punjabi', 'urdu']
>>> result_knn = predict_language(names, model="knn") >>> result_knn["predicted_languages"][0] 'hindi'
>>> df = pd.DataFrame({"name": ["patel", "sharma"]}) >>> result = predict_language(df, "name", model="lstm", top_k=2) >>> len(result["predicted_languages"][0]) 2
- instate.predict.predict_state(names: DataFrame | list[str], name_column: str | None = None, top_k: int = 3, model: str = 'gru') DataFrame[source]¶
Predict most likely Indian states for given names using neural network.
Uses a trained GRU model to predict which Indian states a person with the given lastname is most likely to be from. This is useful for names not found in the electoral rolls data.
- Parameters:
names – DataFrame containing names or list of name strings. Names are automatically cleaned (lowercase, stripped).
name_column – If names is a DataFrame, the column containing names.
top_k – Number of top states to return (default: 3).
model – Model to use for prediction. Currently only “gru” supported.
- Returns:
DataFrame with name and predicted_states columns. predicted_states contains a list of top_k state names.
Examples
>>> names = ["dhingra", "sood", "gowda"] >>> result = predict_state(names, top_k=3) >>> result["predicted_states"][0] ['Delhi', 'Punjab', 'Haryana']
>>> df = pd.DataFrame({"lastname": ["sharma", "patel"]}) >>> result = predict_state(df, "lastname", top_k=2) >>> len(result["predicted_states"][0]) 2