API Reference¶
This page contains the complete API documentation for naampy.
Main Functions¶
These are the two primary functions you’ll use with naampy:
- in_rolls_fn_gender(df, namecol, state=None, year=None, dataset='v2_1k')¶
Predict gender from Indian first names using Electoral Roll statistics.
This function enriches the input DataFrame with gender statistics from the Indian Electoral Rolls database. For names not found in the database, it automatically falls back to machine learning predictions (except for v2_native dataset).
- Parameters:
df (pd.DataFrame) – Input DataFrame containing the first name column.
namecol (str) – Name of the column containing first names to analyze.
state (str, optional) – Specific Indian state to use for analysis. Available states: andaman, andhra, arunachal, assam, bihar, chandigarh, dadra, daman, delhi, goa, gujarat, haryana, himachal, jharkhand, jk, karnataka, kerala, maharashtra, manipur, meghalaya, mizoram, mp, nagaland, odisha, puducherry, punjab, rajasthan, sikkim, tripura, up, uttarakhand. Defaults to None (all states).
year (int, optional) – Specific birth year to filter data by. Defaults to None (all years).
dataset (str, optional) – Dataset version to use. Options: - ‘v1’: 12 states dataset - ‘v2’: Full 30 states dataset - ‘v2_1k’: 1000+ occurrences dataset (default, good balance) - ‘v2_native’: Native language dataset (no ML fallback) - ‘v2_en’: English transliteration dataset
- Returns:
- Enhanced DataFrame with additional columns:
n_female (float): Count of females with this name
n_male (float): Count of males with this name
n_third_gender (float): Count of third gender individuals
prop_female (float): Proportion female (0.0 to 1.0)
prop_male (float): Proportion male (0.0 to 1.0)
prop_third_gender (float): Proportion third gender (0.0 to 1.0)
pred_gender (str): ML prediction for names not in database
pred_prob (float): ML prediction confidence score
- Return type:
pd.DataFrame
Note
Names are automatically cleaned (stripped and lowercased)
For names not in electoral data, ML predictions are added
Data is cached after first download for faster subsequent use
Third gender category reflects Indian electoral roll classifications
Example
>>> import pandas as pd >>> df = pd.DataFrame({'name': ['Priya', 'Rahul', 'Anjali']}) >>> result = in_rolls_fn_gender(df, 'name') >>> print(result[['name', 'prop_female', 'prop_male']].head()) name prop_female prop_male 0 priya 0.994 0.006 1 rahul 0.008 0.992 2 anjali 0.989 0.011
- predict_fn_gender(first_names)¶
Predict gender using a neural network model based on character patterns in names.
This method uses a character-level neural network trained on Indian names to predict gender when names are not found in the electoral roll database. The model learns patterns in character sequences to make predictions.
- Parameters:
first_names (list[str]) – List of first names to predict gender for. Names are automatically converted to lowercase.
- Returns:
- DataFrame containing:
name (str): Input first name (lowercased)
pred_gender (str): Predicted gender (‘male’ or ‘female’)
pred_prob (float): Confidence score for the prediction (0.0 to 1.0)
- Return type:
pd.DataFrame
Note
Names are classified as ‘female’ if predicted probability > 0.5
Names are classified as ‘male’ if predicted probability ≤ 0.5
The model handles character sequences up to 24 characters
Model accuracy: RMSE of 0.22 on test data
Example
>>> names = ['Priya', 'Rahul', 'Unknown_Name'] >>> result = InRollsFnData.predict_fn_gender(names) >>> print(result) name pred_gender pred_prob 0 priya female 0.945 1 rahul male 0.876 2 unknown_name female 0.623
Core Classes¶
- class InRollsFnData[source]¶
Bases:
objectMain class for handling Indian Electoral Roll data and gender prediction.
This class provides methods to predict gender based on Indian first names using two approaches: 1. Statistical data from Indian Electoral Rolls (31 states and union territories) 2. Machine learning model for names not found in the electoral data
The class maintains cached data and models for efficient repeated predictions.
- static load_naampy_data(dataset)[source]¶
Download and cache the naampy dataset from Harvard Dataverse.
This method downloads the specified dataset version if not already cached locally. Subsequent calls will use the cached version for faster performance.
- Parameters:
dataset (str) – Version of the dataset to load. Options are: - ‘v1’: 12 states dataset - ‘v2’: Full 30 states dataset - ‘v2_1k’: 30 states with 1000+ name occurrences (default) - ‘v2_native’: Native language dataset (16 states) - ‘v2_en’: English transliteration of v2_native
- Returns:
Local file path to the downloaded/cached dataset
- Return type:
- Raises:
Exception – If the dataset download fails
Example
>>> path = InRollsFnData.load_naampy_data('v2_1k') >>> print(f"Data cached at: {path}")
- classmethod predict_fn_gender(first_names)[source]¶
Predict gender using a neural network model based on character patterns in names.
This method uses a character-level neural network trained on Indian names to predict gender when names are not found in the electoral roll database. The model learns patterns in character sequences to make predictions.
- Parameters:
first_names (list[str]) – List of first names to predict gender for. Names are automatically converted to lowercase.
- Returns:
- DataFrame containing:
name (str): Input first name (lowercased)
pred_gender (str): Predicted gender (‘male’ or ‘female’)
pred_prob (float): Confidence score for the prediction (0.0 to 1.0)
- Return type:
pd.DataFrame
Note
Names are classified as ‘female’ if predicted probability > 0.5
Names are classified as ‘male’ if predicted probability ≤ 0.5
The model handles character sequences up to 24 characters
Model accuracy: RMSE of 0.22 on test data
Example
>>> names = ['Priya', 'Rahul', 'Unknown_Name'] >>> result = InRollsFnData.predict_fn_gender(names) >>> print(result) name pred_gender pred_prob 0 priya female 0.945 1 rahul male 0.876 2 unknown_name female 0.623
- classmethod in_rolls_fn_gender(df, namecol, state=None, year=None, dataset='v2_1k')[source]¶
Predict gender from Indian first names using Electoral Roll statistics.
This function enriches the input DataFrame with gender statistics from the Indian Electoral Rolls database. For names not found in the database, it automatically falls back to machine learning predictions (except for v2_native dataset).
- Parameters:
df (pd.DataFrame) – Input DataFrame containing the first name column.
namecol (str) – Name of the column containing first names to analyze.
state (str, optional) – Specific Indian state to use for analysis. Available states: andaman, andhra, arunachal, assam, bihar, chandigarh, dadra, daman, delhi, goa, gujarat, haryana, himachal, jharkhand, jk, karnataka, kerala, maharashtra, manipur, meghalaya, mizoram, mp, nagaland, odisha, puducherry, punjab, rajasthan, sikkim, tripura, up, uttarakhand. Defaults to None (all states).
year (int, optional) – Specific birth year to filter data by. Defaults to None (all years).
dataset (str, optional) – Dataset version to use. Options: - ‘v1’: 12 states dataset - ‘v2’: Full 30 states dataset - ‘v2_1k’: 1000+ occurrences dataset (default, good balance) - ‘v2_native’: Native language dataset (no ML fallback) - ‘v2_en’: English transliteration dataset
- Returns:
- Enhanced DataFrame with additional columns:
n_female (float): Count of females with this name
n_male (float): Count of males with this name
n_third_gender (float): Count of third gender individuals
prop_female (float): Proportion female (0.0 to 1.0)
prop_male (float): Proportion male (0.0 to 1.0)
prop_third_gender (float): Proportion third gender (0.0 to 1.0)
pred_gender (str): ML prediction for names not in database
pred_prob (float): ML prediction confidence score
- Return type:
pd.DataFrame
Note
Names are automatically cleaned (stripped and lowercased)
For names not in electoral data, ML predictions are added
Data is cached after first download for faster subsequent use
Third gender category reflects Indian electoral roll classifications
Example
>>> import pandas as pd >>> df = pd.DataFrame({'name': ['Priya', 'Rahul', 'Anjali']}) >>> result = in_rolls_fn_gender(df, 'name') >>> print(result[['name', 'prop_female', 'prop_male']].head()) name prop_female prop_male 0 priya 0.994 0.006 1 rahul 0.008 0.992 2 anjali 0.989 0.011
- static list_states(dataset='v2_1k')[source]¶
Get list of available states in the specified dataset.
This method returns all unique states/union territories available in the chosen dataset version for filtering and analysis.
- Parameters:
dataset (str, optional) – Dataset version to query. Defaults to ‘v2_1k’. See load_naampy_data() for available dataset options.
- Returns:
Array of state names available in the dataset.
- Return type:
np.ndarray
Example
>>> states = InRollsFnData.list_states('v2_1k') >>> print(f"Available states: {', '.join(states[:5])}...") Available states: andaman, andhra, arunachal, assam, bihar...
Utility Functions¶
Module Constants¶
Available Datasets¶
- IN_ROLLS_DATA = Dictionary mapping dataset versions to Harvard Dataverse URLs¶
Harvard Dataverse URLs for Indian Electoral Roll datasets.
Contains download URLs for different versions of the naampy gender prediction datasets hosted on Harvard Dataverse. Each version contains electoral roll statistics from different numbers of Indian states and territories.
- Dataset versions:
v1: 12 states dataset
v2: Full 30 states dataset
v2_1k: 30 states with 1000+ name occurrences (recommended)
v2_native: Native language scripts dataset (16 states)
v2_en: English transliteration of v2_native
The following dataset versions are available:
v1: 12 states dataset (legacy)
v2: Full 30 states dataset
v2_1k: 30 states with 1000+ name occurrences (recommended default)
v2_native: Native language dataset (16 states, no ML fallback)
v2_en: English transliteration of v2_native
Output Columns¶
- IN_ROLLS_COLS = List of columns added by in_rolls_fn_gender()¶
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.
The electoral roll functions add these columns to your DataFrame:
n_male,n_female,n_third_gender: Count statisticsprop_male,prop_female,prop_third_gender: Proportion statistics
Command Line Interface¶
The package includes a command-line interface:
in_rolls_fn_gender input.csv -f first_name -o output.csv
- main(argv=['-M', 'html', 'source', 'build'])[source]¶
Command-line interface for naampy gender prediction.
This function provides a command-line interface to process CSV files and add gender predictions based on first names using Indian Electoral Roll data.
- Parameters:
argv (list[str], optional) – Command line arguments. Defaults to sys.argv[1:].
- Returns:
Exit code (0 for success, -1 for error)
- Return type:
Example
$ in_rolls_fn_gender input.csv -f first_name -o output.csv $ in_rolls_fn_gender input.csv -f name -s kerala -y 1990
For usage examples, see the User Guide.