Outkast Documentation¶

Outkast is a Python library for inferring caste from Indian names using SECC 2011 data.

Installation¶

Install outkast using pip:

pip install outkast

Requirements¶

Python 3.11 or higher
pandas
numpy

Quick Start¶

Here’s a simple example of how to use outkast:

import pandas as pd
from outkast import secc_caste

# Create a DataFrame with names
df = pd.DataFrame({'name': ['Patel', 'Sharma', 'Singh', 'Kumar']})

# Add caste predictions
result = secc_caste(df, 'name')
print(result)

The function will add columns with caste proportions:

n_sc: Number of Scheduled Caste individuals
n_st: Number of Scheduled Tribe individuals
n_other: Number of Other individuals
prop_sc: Proportion of Scheduled Caste
prop_st: Proportion of Scheduled Tribe
prop_other: Proportion of Other

API Reference¶

outkast.secc_caste(df: DataFrame, namecol: str | int, state: str | None = None, year: int | None = None) → DataFrame¶

Appends additional columns from SECC data to the input DataFrame based on the last name.

Removes extra space. Checks if the name is the SECC data. If it is, outputs data from that row.

Parameters:

df (DataFrame) – Pandas DataFrame containing the last name column.
namecol (str or int) – Column’s name or location of the name in DataFrame.
state (str) – The state name of SECC data to be used. (default is None for all states)
year (int) – The year of SECC data to be used. (default is None for all years)

Returns:

Pandas DataFrame with additional columns:-: ‘n_sc’, ‘n_st’, ‘n_other’, ‘prop_sc’, ‘prop_st’, ‘prop_other’ by last name

Return type:

DataFrame

Main Functions¶

outkast.secc_caste(df: DataFrame, namecol: str | int, state: str | None = None, year: int | None = None) → DataFrame¶

Appends additional columns from SECC data to the input DataFrame based on the last name.

Removes extra space. Checks if the name is the SECC data. If it is, outputs data from that row.

Parameters:

df (DataFrame) – Pandas DataFrame containing the last name column.
namecol (str or int) – Column’s name or location of the name in DataFrame.
state (str) – The state name of SECC data to be used. (default is None for all states)
year (int) – The year of SECC data to be used. (default is None for all years)

Returns:

Pandas DataFrame with additional columns:-: ‘n_sc’, ‘n_st’, ‘n_other’, ‘prop_sc’, ‘prop_st’, ‘prop_other’ by last name

Return type:

DataFrame

Core Classes¶

class outkast.secc_caste_ln.SeccCasteLnData[source]¶

Bases: object

__init__() → None¶

static list_states() → list[str][source]¶

classmethod secc_caste(df: DataFrame, namecol: str | int, state: str | None = None, year: int | None = None) → DataFrame[source]¶

Appends additional columns from SECC data to the input DataFrame based on the last name.

Removes extra space. Checks if the name is the SECC data. If it is, outputs data from that row.

Parameters:

df (DataFrame) – Pandas DataFrame containing the last name column.
namecol (str or int) – Column’s name or location of the name in DataFrame.
state (str) – The state name of SECC data to be used. (default is None for all states)
year (int) – The year of SECC data to be used. (default is None for all years)

Returns:

Pandas DataFrame with additional columns:-: ‘n_sc’, ‘n_st’, ‘n_other’, ‘prop_sc’, ‘prop_st’, ‘prop_other’ by last name

Return type:

DataFrame

Utility Functions¶

outkast.utils.column_exists(df: DataFrame, col: str | int) → bool[source]¶

Check the column name exists in the DataFrame.

Parameters:

df – Pandas DataFrame.
col – Column name.

Returns:

True if exists, False if not exists.

outkast.utils.find_ngrams(vocab: Sequence[str], text: str, n: int) → list[int][source]¶

Find and return list of the index of n-grams in the vocabulary list.

Generate the n-grams of the specific text, find them in the vocabulary list and return the list of index have been found.

Parameters:

vocab – Vocabulary list.
text – Input text
n – N-grams

Returns:

List of the index of n-grams in the vocabulary list. Returns -1 for n-grams not found in vocabulary.

outkast.utils.fixup_columns(cols: Sequence[str | int]) → list[str][source]¶

Replace index location column to name with col prefix

Parameters:: cols – List of original columns
Returns:: List of column names

Command Line Interface¶

Outkast also provides a command-line interface:

secc_caste input.csv -l name_column -o output.csv

Options:

-l, --last-name: Name or index of the column containing last names (required)
-s, --state: Filter by specific state (optional)
-y, --year: Filter by birth year (optional)
-o, --output: Output file name (default: secc-caste-output.csv)

Data Source¶

The predictions are based on the Socio Economic and Caste Census (SECC) 2011 data, which provides comprehensive demographic information about Indian households.

Contributing¶

Contributions are welcome! Please feel free to submit a Pull Request.

License¶

This project is licensed under the MIT License.

Changelog¶

Version 1.0.0¶

Breaking Change: Dropped support for Python < 3.11
Modernized codebase with type hints and modern Python features
Replaced deprecated pkg_resources with importlib.resources
Updated to modern packaging with pyproject.toml
Added comprehensive type annotations
Used f-strings throughout
Added match/case statements for cleaner conditional logic

Version 0.2.1¶

Previous stable release with Python 3.5+ support

Outkast Documentation¶

Installation¶

Requirements¶

Quick Start¶

API Reference¶

Main Functions¶

Core Classes¶

Utility Functions¶

Command Line Interface¶

Data Source¶

Contributing¶

License¶

Changelog¶

Version 1.0.0¶

Version 0.2.1¶

Indices and tables¶