API Reference

This page contains the complete API documentation for naampy.

Main Functions

These are the two primary functions you’ll use with naampy:

in_rolls_fn_gender(df, namecol, state=None, year=None, dataset='v2_1k')

Predict gender from Indian first names using Electoral Roll statistics.

This function enriches the input DataFrame with gender statistics from the Indian Electoral Rolls database. For names not found in the database, it automatically falls back to machine learning predictions (except for v2_native dataset).

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing the first name column.

  • namecol (str) – Name of the column containing first names to analyze.

  • state (str, optional) – Specific Indian state to use for analysis. Available states: andaman, andhra, arunachal, assam, bihar, chandigarh, dadra, daman, delhi, goa, gujarat, haryana, himachal, jharkhand, jk, karnataka, kerala, maharashtra, manipur, meghalaya, mizoram, mp, nagaland, odisha, puducherry, punjab, rajasthan, sikkim, tripura, up, uttarakhand. Defaults to None (all states).

  • year (int, optional) – Specific birth year to filter data by. Defaults to None (all years).

  • dataset (str, optional) – Dataset version to use. Options: - ‘v1’: 12 states dataset - ‘v2’: Full 30 states dataset - ‘v2_1k’: 1000+ occurrences dataset (default, good balance) - ‘v2_native’: Native language dataset (no ML fallback) - ‘v2_en’: English transliteration dataset

Returns:

Enhanced DataFrame with additional columns:
  • n_female (float): Count of females with this name

  • n_male (float): Count of males with this name

  • n_third_gender (float): Count of third gender individuals

  • prop_female (float): Proportion female (0.0 to 1.0)

  • prop_male (float): Proportion male (0.0 to 1.0)

  • prop_third_gender (float): Proportion third gender (0.0 to 1.0)

  • pred_gender (str): ML prediction for names not in database

  • pred_prob (float): ML prediction confidence score

Return type:

pd.DataFrame

Note

  • Names are automatically cleaned (stripped and lowercased)

  • For names not in electoral data, ML predictions are added

  • Data is cached after first download for faster subsequent use

  • Third gender category reflects Indian electoral roll classifications

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['Priya', 'Rahul', 'Anjali']})
>>> result = in_rolls_fn_gender(df, 'name')
>>> print(result[['name', 'prop_female', 'prop_male']].head())
     name  prop_female  prop_male
0   priya       0.994      0.006
1   rahul       0.008      0.992
2  anjali       0.989      0.011
predict_fn_gender(first_names)

Predict gender using a neural network model based on character patterns in names.

This method uses a character-level neural network trained on Indian names to predict gender when names are not found in the electoral roll database. The model learns patterns in character sequences to make predictions.

Parameters:

first_names (list[str]) – List of first names to predict gender for. Names are automatically converted to lowercase.

Returns:

DataFrame containing:
  • name (str): Input first name (lowercased)

  • pred_gender (str): Predicted gender (‘male’ or ‘female’)

  • pred_prob (float): Confidence score for the prediction (0.0 to 1.0)

Return type:

pd.DataFrame

Note

  • Names are classified as ‘female’ if predicted probability > 0.5

  • Names are classified as ‘male’ if predicted probability ≤ 0.5

  • The model handles character sequences up to 24 characters

  • Model accuracy: RMSE of 0.22 on test data

Example

>>> names = ['Priya', 'Rahul', 'Unknown_Name']
>>> result = InRollsFnData.predict_fn_gender(names)
>>> print(result)
      name pred_gender  pred_prob
0    priya      female      0.945
1    rahul        male      0.876
2  unknown_name  female      0.623

Core Classes

class InRollsFnData[source]

Bases: object

Main class for handling Indian Electoral Roll data and gender prediction.

This class provides methods to predict gender based on Indian first names using two approaches: 1. Statistical data from Indian Electoral Rolls (31 states and union territories) 2. Machine learning model for names not found in the electoral data

The class maintains cached data and models for efficient repeated predictions.

static load_naampy_data(dataset)[source]

Download and cache the naampy dataset from Harvard Dataverse.

This method downloads the specified dataset version if not already cached locally. Subsequent calls will use the cached version for faster performance.

Parameters:

dataset (str) – Version of the dataset to load. Options are: - ‘v1’: 12 states dataset - ‘v2’: Full 30 states dataset - ‘v2_1k’: 30 states with 1000+ name occurrences (default) - ‘v2_native’: Native language dataset (16 states) - ‘v2_en’: English transliteration of v2_native

Returns:

Local file path to the downloaded/cached dataset

Return type:

str

Raises:

Exception – If the dataset download fails

Example

>>> path = InRollsFnData.load_naampy_data('v2_1k')
>>> print(f"Data cached at: {path}")
classmethod predict_fn_gender(first_names)[source]

Predict gender using a neural network model based on character patterns in names.

This method uses a character-level neural network trained on Indian names to predict gender when names are not found in the electoral roll database. The model learns patterns in character sequences to make predictions.

Parameters:

first_names (list[str]) – List of first names to predict gender for. Names are automatically converted to lowercase.

Returns:

DataFrame containing:
  • name (str): Input first name (lowercased)

  • pred_gender (str): Predicted gender (‘male’ or ‘female’)

  • pred_prob (float): Confidence score for the prediction (0.0 to 1.0)

Return type:

pd.DataFrame

Note

  • Names are classified as ‘female’ if predicted probability > 0.5

  • Names are classified as ‘male’ if predicted probability ≤ 0.5

  • The model handles character sequences up to 24 characters

  • Model accuracy: RMSE of 0.22 on test data

Example

>>> names = ['Priya', 'Rahul', 'Unknown_Name']
>>> result = InRollsFnData.predict_fn_gender(names)
>>> print(result)
      name pred_gender  pred_prob
0    priya      female      0.945
1    rahul        male      0.876
2  unknown_name  female      0.623
classmethod in_rolls_fn_gender(df, namecol, state=None, year=None, dataset='v2_1k')[source]

Predict gender from Indian first names using Electoral Roll statistics.

This function enriches the input DataFrame with gender statistics from the Indian Electoral Rolls database. For names not found in the database, it automatically falls back to machine learning predictions (except for v2_native dataset).

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing the first name column.

  • namecol (str) – Name of the column containing first names to analyze.

  • state (str, optional) – Specific Indian state to use for analysis. Available states: andaman, andhra, arunachal, assam, bihar, chandigarh, dadra, daman, delhi, goa, gujarat, haryana, himachal, jharkhand, jk, karnataka, kerala, maharashtra, manipur, meghalaya, mizoram, mp, nagaland, odisha, puducherry, punjab, rajasthan, sikkim, tripura, up, uttarakhand. Defaults to None (all states).

  • year (int, optional) – Specific birth year to filter data by. Defaults to None (all years).

  • dataset (str, optional) – Dataset version to use. Options: - ‘v1’: 12 states dataset - ‘v2’: Full 30 states dataset - ‘v2_1k’: 1000+ occurrences dataset (default, good balance) - ‘v2_native’: Native language dataset (no ML fallback) - ‘v2_en’: English transliteration dataset

Returns:

Enhanced DataFrame with additional columns:
  • n_female (float): Count of females with this name

  • n_male (float): Count of males with this name

  • n_third_gender (float): Count of third gender individuals

  • prop_female (float): Proportion female (0.0 to 1.0)

  • prop_male (float): Proportion male (0.0 to 1.0)

  • prop_third_gender (float): Proportion third gender (0.0 to 1.0)

  • pred_gender (str): ML prediction for names not in database

  • pred_prob (float): ML prediction confidence score

Return type:

pd.DataFrame

Note

  • Names are automatically cleaned (stripped and lowercased)

  • For names not in electoral data, ML predictions are added

  • Data is cached after first download for faster subsequent use

  • Third gender category reflects Indian electoral roll classifications

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['Priya', 'Rahul', 'Anjali']})
>>> result = in_rolls_fn_gender(df, 'name')
>>> print(result[['name', 'prop_female', 'prop_male']].head())
     name  prop_female  prop_male
0   priya       0.994      0.006
1   rahul       0.008      0.992
2  anjali       0.989      0.011
static list_states(dataset='v2_1k')[source]

Get list of available states in the specified dataset.

This method returns all unique states/union territories available in the chosen dataset version for filtering and analysis.

Parameters:

dataset (str, optional) – Dataset version to query. Defaults to ‘v2_1k’. See load_naampy_data() for available dataset options.

Returns:

Array of state names available in the dataset.

Return type:

np.ndarray

Example

>>> states = InRollsFnData.list_states('v2_1k')
>>> print(f"Available states: {', '.join(states[:5])}...")
Available states: andaman, andhra, arunachal, assam, bihar...

Utility Functions

find_ngrams(vocab, text, n)[source]

Find and return list of the index of n-grams in the vocabulary list.

Generate the n-grams of the specific text, find them in the vocabulary list and return the list of index have been found.

Parameters:
  • vocab (list) – Vocabulary list.

  • text (str) – Input text

  • n (int) – N-grams

Returns:

List of the index of n-grams in the vocabulary list.

Return type:

list

get_app_file_path(app_name, filename)[source]
Return type:

str

download_file(url, target)[source]
Return type:

bool

Module Constants

Available Datasets

IN_ROLLS_DATA = Dictionary mapping dataset versions to Harvard Dataverse URLs

Harvard Dataverse URLs for Indian Electoral Roll datasets.

Contains download URLs for different versions of the naampy gender prediction datasets hosted on Harvard Dataverse. Each version contains electoral roll statistics from different numbers of Indian states and territories.

Dataset versions:
  • v1: 12 states dataset

  • v2: Full 30 states dataset

  • v2_1k: 30 states with 1000+ name occurrences (recommended)

  • v2_native: Native language scripts dataset (16 states)

  • v2_en: English transliteration of v2_native

The following dataset versions are available:

  • v1: 12 states dataset (legacy)

  • v2: Full 30 states dataset

  • v2_1k: 30 states with 1000+ name occurrences (recommended default)

  • v2_native: Native language dataset (16 states, no ML fallback)

  • v2_en: English transliteration of v2_native

Output Columns

IN_ROLLS_COLS = List of columns added by in_rolls_fn_gender()

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

The electoral roll functions add these columns to your DataFrame:

  • n_male, n_female, n_third_gender: Count statistics

  • prop_male, prop_female, prop_third_gender: Proportion statistics

Command Line Interface

The package includes a command-line interface:

in_rolls_fn_gender input.csv -f first_name -o output.csv
main(argv=['-M', 'html', 'source', 'build'])[source]

Command-line interface for naampy gender prediction.

This function provides a command-line interface to process CSV files and add gender predictions based on first names using Indian Electoral Roll data.

Parameters:

argv (list[str], optional) – Command line arguments. Defaults to sys.argv[1:].

Returns:

Exit code (0 for success, -1 for error)

Return type:

int

Example

$ in_rolls_fn_gender input.csv -f first_name -o output.csv $ in_rolls_fn_gender input.csv -f name -s kerala -y 1990

For usage examples, see the User Guide.