Outkast Documentation¶
Outkast is a Python library for inferring caste from Indian names using SECC 2011 data.
Installation¶
Install outkast using pip:
pip install outkast
Requirements¶
Python 3.11 or higher
pandas
numpy
Quick Start¶
Here’s a simple example of how to use outkast:
import pandas as pd
from outkast import secc_caste
# Create a DataFrame with names
df = pd.DataFrame({'name': ['Patel', 'Sharma', 'Singh', 'Kumar']})
# Add caste predictions
result = secc_caste(df, 'name')
print(result)
The function will add columns with caste proportions:
n_sc: Number of Scheduled Caste individualsn_st: Number of Scheduled Tribe individualsn_other: Number of Other individualsprop_sc: Proportion of Scheduled Casteprop_st: Proportion of Scheduled Tribeprop_other: Proportion of Other
API Reference¶
- outkast.secc_caste(df: DataFrame, namecol: str | int, state: str | None = None, year: int | None = None) DataFrame¶
Appends additional columns from SECC data to the input DataFrame based on the last name.
Removes extra space. Checks if the name is the SECC data. If it is, outputs data from that row.
- Parameters:
df (
DataFrame) – Pandas DataFrame containing the last name column.namecol (str or int) – Column’s name or location of the name in DataFrame.
state (str) – The state name of SECC data to be used. (default is None for all states)
year (int) – The year of SECC data to be used. (default is None for all years)
- Returns:
- Pandas DataFrame with additional columns:-
‘n_sc’, ‘n_st’, ‘n_other’, ‘prop_sc’, ‘prop_st’, ‘prop_other’ by last name
- Return type:
DataFrame
Main Functions¶
- outkast.secc_caste(df: DataFrame, namecol: str | int, state: str | None = None, year: int | None = None) DataFrame¶
Appends additional columns from SECC data to the input DataFrame based on the last name.
Removes extra space. Checks if the name is the SECC data. If it is, outputs data from that row.
- Parameters:
df (
DataFrame) – Pandas DataFrame containing the last name column.namecol (str or int) – Column’s name or location of the name in DataFrame.
state (str) – The state name of SECC data to be used. (default is None for all states)
year (int) – The year of SECC data to be used. (default is None for all years)
- Returns:
- Pandas DataFrame with additional columns:-
‘n_sc’, ‘n_st’, ‘n_other’, ‘prop_sc’, ‘prop_st’, ‘prop_other’ by last name
- Return type:
DataFrame
Core Classes¶
- class outkast.secc_caste_ln.SeccCasteLnData[source]¶
Bases:
object- classmethod secc_caste(df: DataFrame, namecol: str | int, state: str | None = None, year: int | None = None) DataFrame[source]¶
Appends additional columns from SECC data to the input DataFrame based on the last name.
Removes extra space. Checks if the name is the SECC data. If it is, outputs data from that row.
- Parameters:
df (
DataFrame) – Pandas DataFrame containing the last name column.namecol (str or int) – Column’s name or location of the name in DataFrame.
state (str) – The state name of SECC data to be used. (default is None for all states)
year (int) – The year of SECC data to be used. (default is None for all years)
- Returns:
- Pandas DataFrame with additional columns:-
‘n_sc’, ‘n_st’, ‘n_other’, ‘prop_sc’, ‘prop_st’, ‘prop_other’ by last name
- Return type:
DataFrame
Utility Functions¶
- outkast.utils.column_exists(df: DataFrame, col: str | int) bool[source]¶
Check the column name exists in the DataFrame.
- Parameters:
df – Pandas DataFrame.
col – Column name.
- Returns:
True if exists, False if not exists.
- outkast.utils.find_ngrams(vocab: list[str], text: str, n: int) list[int][source]¶
Find and return list of the index of n-grams in the vocabulary list.
Generate the n-grams of the specific text, find them in the vocabulary list and return the list of index have been found.
- Parameters:
vocab – Vocabulary list.
text – Input text
n – N-grams
- Returns:
List of the index of n-grams in the vocabulary list.
Command Line Interface¶
Outkast also provides a command-line interface:
secc_caste input.csv -l name_column -o output.csv
Options:
-l, --last-name: Name or index of the column containing last names (required)-s, --state: Filter by specific state (optional)-y, --year: Filter by birth year (optional)-o, --output: Output file name (default: secc-caste-output.csv)
Data Source¶
The predictions are based on the Socio Economic and Caste Census (SECC) 2011 data, which provides comprehensive demographic information about Indian households.
Contributing¶
Contributions are welcome! Please feel free to submit a Pull Request.
License¶
This project is licensed under the MIT License.
Changelog¶
Version 1.0.0¶
Breaking Change: Dropped support for Python < 3.11
Modernized codebase with type hints and modern Python features
Replaced deprecated
pkg_resourceswithimportlib.resourcesUpdated to modern packaging with
pyproject.tomlAdded comprehensive type annotations
Used f-strings throughout
Added match/case statements for cleaner conditional logic
Version 0.2.1¶
Previous stable release with Python 3.5+ support