API Reference¶

This page contains the complete API documentation for naampy.

Main Functions¶

These are the two primary functions you’ll use with naampy:

in_rolls_fn_gender(df, namecol, state=None, year=None, dataset='v2_1k')¶

Predict gender from Indian first names using Electoral Roll statistics.

This function enriches the input DataFrame with gender statistics from the Indian Electoral Rolls database. For names not found in the database, it automatically falls back to machine learning predictions (except for v2_native dataset).

Parameters:

df (pd.DataFrame) – Input DataFrame containing the first name column.
namecol (str) – Name of the column containing first names to analyze.
state (str, optional) – Specific Indian state to use for analysis. Available states: andaman, andhra, arunachal, assam, bihar, chandigarh, dadra, daman, delhi, goa, gujarat, haryana, himachal, jharkhand, jk, karnataka, kerala, maharashtra, manipur, meghalaya, mizoram, mp, nagaland, odisha, puducherry, punjab, rajasthan, sikkim, tripura, up, uttarakhand. Defaults to None (all states).
year (int, optional) – Specific birth year to filter data by. Defaults to None (all years).
dataset (str, optional) – Dataset version to use. Options: - ‘v1’: 12 states dataset - ‘v2’: Full 30 states dataset - ‘v2_1k’: 1000+ occurrences dataset (default, good balance) - ‘v2_native’: Native language dataset (no ML fallback) - ‘v2_en’: English transliteration dataset

Returns:

Enhanced DataFrame with additional columns:

n_female (float): Count of females with this name
n_male (float): Count of males with this name
n_third_gender (float): Count of third gender individuals
prop_female (float): Proportion female (0.0 to 1.0)
prop_male (float): Proportion male (0.0 to 1.0)
prop_third_gender (float): Proportion third gender (0.0 to 1.0)
pred_gender (str): ML prediction for names not in database
pred_prob (float): ML prediction confidence score

Return type:

pd.DataFrame

Note

Names are automatically cleaned (stripped and lowercased)
For names not in electoral data, ML predictions are added
Data is cached after first download for faster subsequent use
Third gender category reflects Indian electoral roll classifications

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['Priya', 'Rahul', 'Anjali']})
>>> result = in_rolls_fn_gender(df, 'name')
>>> print(result[['name', 'prop_female', 'prop_male']].head())
     name  prop_female  prop_male
0   priya       0.994      0.006
1   rahul       0.008      0.992
2  anjali       0.989      0.011

predict_fn_gender(first_names)¶

Predict gender using a neural network model based on character patterns in names.

This method uses a character-level neural network trained on Indian names to predict gender when names are not found in the electoral roll database. The model learns patterns in character sequences to make predictions.

Parameters:

first_names (list[str]) – List of first names to predict gender for. Names are automatically converted to lowercase.

Returns:

DataFrame containing:

name (str): Input first name (lowercased)
pred_gender (str): Predicted gender (‘male’ or ‘female’)
pred_prob (float): Confidence score for the prediction (0.0 to 1.0)

Return type:

pd.DataFrame

Note

Names are classified as ‘female’ if predicted probability > 0.5
Names are classified as ‘male’ if predicted probability ≤ 0.5
The model handles character sequences up to 24 characters
Model accuracy: RMSE of 0.22 on test data

Example

>>> names = ['Priya', 'Rahul', 'Unknown_Name']
>>> result = InRollsFnData.predict_fn_gender(names)
>>> print(result)
      name pred_gender  pred_prob
0    priya      female      0.945
1    rahul        male      0.876
2  unknown_name  female      0.623

Core Classes¶

class InRollsFnData[source]¶

Bases: object

Main class for handling Indian Electoral Roll data and gender prediction.

This class provides methods to predict gender based on Indian first names using two approaches: 1. Statistical data from Indian Electoral Rolls (31 states and union territories) 2. Machine learning model for names not found in the electoral data

The class maintains cached data and models for efficient repeated predictions.

static load_naampy_data(dataset)[source]¶

Download and cache the naampy dataset from Harvard Dataverse.

This method downloads the specified dataset version if not already cached locally. Subsequent calls will use the cached version for faster performance.

Parameters:: dataset (str) – Version of the dataset to load. Options are: - ‘v1’: 12 states dataset - ‘v2’: Full 30 states dataset - ‘v2_1k’: 30 states with 1000+ name occurrences (default) - ‘v2_native’: Native language dataset (16 states) - ‘v2_en’: English transliteration of v2_native
Returns:: Local file path to the downloaded/cached dataset
Return type:: str
Raises:: Exception – If the dataset download fails

Example

>>> path = InRollsFnData.load_naampy_data('v2_1k')
>>> print(f"Data cached at: {path}")

classmethod predict_fn_gender(first_names)[source]¶

Predict gender using a neural network model based on character patterns in names.

Parameters:

first_names (list[str]) – List of first names to predict gender for. Names are automatically converted to lowercase.

Returns:

DataFrame containing:

name (str): Input first name (lowercased)
pred_gender (str): Predicted gender (‘male’ or ‘female’)
pred_prob (float): Confidence score for the prediction (0.0 to 1.0)

Return type:

pd.DataFrame

Note

Names are classified as ‘female’ if predicted probability > 0.5
Names are classified as ‘male’ if predicted probability ≤ 0.5
The model handles character sequences up to 24 characters
Model accuracy: RMSE of 0.22 on test data

Example

>>> names = ['Priya', 'Rahul', 'Unknown_Name']
>>> result = InRollsFnData.predict_fn_gender(names)
>>> print(result)
      name pred_gender  pred_prob
0    priya      female      0.945
1    rahul        male      0.876
2  unknown_name  female      0.623

classmethod in_rolls_fn_gender(df, namecol, state=None, year=None, dataset='v2_1k')[source]¶

Predict gender from Indian first names using Electoral Roll statistics.

Parameters:

df (pd.DataFrame) – Input DataFrame containing the first name column.
namecol (str) – Name of the column containing first names to analyze.
state (str, optional) – Specific Indian state to use for analysis. Available states: andaman, andhra, arunachal, assam, bihar, chandigarh, dadra, daman, delhi, goa, gujarat, haryana, himachal, jharkhand, jk, karnataka, kerala, maharashtra, manipur, meghalaya, mizoram, mp, nagaland, odisha, puducherry, punjab, rajasthan, sikkim, tripura, up, uttarakhand. Defaults to None (all states).
year (int, optional) – Specific birth year to filter data by. Defaults to None (all years).
dataset (str, optional) – Dataset version to use. Options: - ‘v1’: 12 states dataset - ‘v2’: Full 30 states dataset - ‘v2_1k’: 1000+ occurrences dataset (default, good balance) - ‘v2_native’: Native language dataset (no ML fallback) - ‘v2_en’: English transliteration dataset

Returns:

Enhanced DataFrame with additional columns:

n_female (float): Count of females with this name
n_male (float): Count of males with this name
n_third_gender (float): Count of third gender individuals
prop_female (float): Proportion female (0.0 to 1.0)
prop_male (float): Proportion male (0.0 to 1.0)
prop_third_gender (float): Proportion third gender (0.0 to 1.0)
pred_gender (str): ML prediction for names not in database
pred_prob (float): ML prediction confidence score

Return type:

pd.DataFrame

Note

Names are automatically cleaned (stripped and lowercased)
For names not in electoral data, ML predictions are added
Data is cached after first download for faster subsequent use
Third gender category reflects Indian electoral roll classifications

Example

>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['Priya', 'Rahul', 'Anjali']})
>>> result = in_rolls_fn_gender(df, 'name')
>>> print(result[['name', 'prop_female', 'prop_male']].head())
     name  prop_female  prop_male
0   priya       0.994      0.006
1   rahul       0.008      0.992
2  anjali       0.989      0.011

static list_states(dataset='v2_1k')[source]¶

Get list of available states in the specified dataset.

This method returns all unique states/union territories available in the chosen dataset version for filtering and analysis.

Parameters:: dataset (str, optional) – Dataset version to query. Defaults to ‘v2_1k’. See load_naampy_data() for available dataset options.
Returns:: Array of state names available in the dataset.
Return type:: np.ndarray

Example

>>> states = InRollsFnData.list_states('v2_1k')
>>> print(f"Available states: {', '.join(states[:5])}...")
Available states: andaman, andhra, arunachal, assam, bihar...

Utility Functions¶

find_ngrams(vocab, text, n)[source]¶

Find and return list of the index of n-grams in the vocabulary list.

Generate the n-grams of the specific text, find them in the vocabulary list and return the list of index have been found.

Parameters:

vocab (list) – Vocabulary list.
text (str) – Input text
n (int) – N-grams

Returns:

List of the index of n-grams in the vocabulary list.

Return type:

list

get_app_file_path(app_name, filename)[source]¶

Return type:: str

download_file(url, target)[source]¶

Return type:: bool

Module Constants¶

Available Datasets¶

IN_ROLLS_DATA = Dictionary mapping dataset versions to Harvard Dataverse URLs¶

Harvard Dataverse URLs for Indian Electoral Roll datasets.

Contains download URLs for different versions of the naampy gender prediction datasets hosted on Harvard Dataverse. Each version contains electoral roll statistics from different numbers of Indian states and territories.

Dataset versions:

v1: 12 states dataset
v2: Full 30 states dataset
v2_1k: 30 states with 1000+ name occurrences (recommended)
v2_native: Native language scripts dataset (16 states)
v2_en: English transliteration of v2_native

The following dataset versions are available:

v1: 12 states dataset (legacy)
v2: Full 30 states dataset
v2_1k: 30 states with 1000+ name occurrences (recommended default)
v2_native: Native language dataset (16 states, no ML fallback)
v2_en: English transliteration of v2_native

Output Columns¶

IN_ROLLS_COLS = List of columns added by in_rolls_fn_gender()¶

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

The electoral roll functions add these columns to your DataFrame:

n_male, n_female, n_third_gender: Count statistics
prop_male, prop_female, prop_third_gender: Proportion statistics

Command Line Interface¶

The package includes a command-line interface:

in_rolls_fn_gender input.csv -f first_name -o output.csv

main(argv=['-M', 'html', 'source', 'build'])[source]¶

Command-line interface for naampy gender prediction.

This function provides a command-line interface to process CSV files and add gender predictions based on first names using Indian Electoral Roll data.

Parameters:: argv (list[str], optional) – Command line arguments. Defaults to sys.argv[1:].
Returns:: Exit code (0 for success, -1 for error)
Return type:: int

Example

$ in_rolls_fn_gender input.csv -f first_name -o output.csv $ in_rolls_fn_gender input.csv -f name -s kerala -y 1990

For usage examples, see the User Guide.