Outkast Documentation

Outkast is a Python library for inferring caste from Indian names using SECC 2011 data.

Installation

Install outkast using pip:

pip install outkast

Requirements

  • Python 3.11 or higher

  • pandas

  • numpy

Quick Start

Here’s a simple example of how to use outkast:

import pandas as pd
from outkast import secc_caste

# Create a DataFrame with names
df = pd.DataFrame({'name': ['Patel', 'Sharma', 'Singh', 'Kumar']})

# Add caste predictions
result = secc_caste(df, 'name')
print(result)

The function will add columns with caste proportions:

  • n_sc: Number of Scheduled Caste individuals

  • n_st: Number of Scheduled Tribe individuals

  • n_other: Number of Other individuals

  • prop_sc: Proportion of Scheduled Caste

  • prop_st: Proportion of Scheduled Tribe

  • prop_other: Proportion of Other

API Reference

outkast.secc_caste(df: DataFrame, namecol: str | int, state: str | None = None, year: int | None = None) DataFrame

Appends additional columns from SECC data to the input DataFrame based on the last name.

Removes extra space. Checks if the name is the SECC data. If it is, outputs data from that row.

Parameters:
  • df (DataFrame) – Pandas DataFrame containing the last name column.

  • namecol (str or int) – Column’s name or location of the name in DataFrame.

  • state (str) – The state name of SECC data to be used. (default is None for all states)

  • year (int) – The year of SECC data to be used. (default is None for all years)

Returns:

Pandas DataFrame with additional columns:-

‘n_sc’, ‘n_st’, ‘n_other’, ‘prop_sc’, ‘prop_st’, ‘prop_other’ by last name

Return type:

DataFrame

Main Functions

outkast.secc_caste(df: DataFrame, namecol: str | int, state: str | None = None, year: int | None = None) DataFrame

Appends additional columns from SECC data to the input DataFrame based on the last name.

Removes extra space. Checks if the name is the SECC data. If it is, outputs data from that row.

Parameters:
  • df (DataFrame) – Pandas DataFrame containing the last name column.

  • namecol (str or int) – Column’s name or location of the name in DataFrame.

  • state (str) – The state name of SECC data to be used. (default is None for all states)

  • year (int) – The year of SECC data to be used. (default is None for all years)

Returns:

Pandas DataFrame with additional columns:-

‘n_sc’, ‘n_st’, ‘n_other’, ‘prop_sc’, ‘prop_st’, ‘prop_other’ by last name

Return type:

DataFrame

Core Classes

class outkast.secc_caste_ln.SeccCasteLnData[source]

Bases: object

static list_states() list[str][source]
classmethod secc_caste(df: DataFrame, namecol: str | int, state: str | None = None, year: int | None = None) DataFrame[source]

Appends additional columns from SECC data to the input DataFrame based on the last name.

Removes extra space. Checks if the name is the SECC data. If it is, outputs data from that row.

Parameters:
  • df (DataFrame) – Pandas DataFrame containing the last name column.

  • namecol (str or int) – Column’s name or location of the name in DataFrame.

  • state (str) – The state name of SECC data to be used. (default is None for all states)

  • year (int) – The year of SECC data to be used. (default is None for all years)

Returns:

Pandas DataFrame with additional columns:-

‘n_sc’, ‘n_st’, ‘n_other’, ‘prop_sc’, ‘prop_st’, ‘prop_other’ by last name

Return type:

DataFrame

Utility Functions

outkast.utils.column_exists(df: DataFrame, col: str | int) bool[source]

Check the column name exists in the DataFrame.

Parameters:
  • df – Pandas DataFrame.

  • col – Column name.

Returns:

True if exists, False if not exists.

outkast.utils.find_ngrams(vocab: list[str], text: str, n: int) list[int][source]

Find and return list of the index of n-grams in the vocabulary list.

Generate the n-grams of the specific text, find them in the vocabulary list and return the list of index have been found.

Parameters:
  • vocab – Vocabulary list.

  • text – Input text

  • n – N-grams

Returns:

List of the index of n-grams in the vocabulary list.

outkast.utils.fixup_columns(cols: list[Any]) list[str][source]

Replace index location column to name with col prefix

Parameters:

cols – List of original columns

Returns:

List of column names

Command Line Interface

Outkast also provides a command-line interface:

secc_caste input.csv -l name_column -o output.csv

Options:

  • -l, --last-name: Name or index of the column containing last names (required)

  • -s, --state: Filter by specific state (optional)

  • -y, --year: Filter by birth year (optional)

  • -o, --output: Output file name (default: secc-caste-output.csv)

Data Source

The predictions are based on the Socio Economic and Caste Census (SECC) 2011 data, which provides comprehensive demographic information about Indian households.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License.

Changelog

Version 1.0.0

  • Breaking Change: Dropped support for Python < 3.11

  • Modernized codebase with type hints and modern Python features

  • Replaced deprecated pkg_resources with importlib.resources

  • Updated to modern packaging with pyproject.toml

  • Added comprehensive type annotations

  • Used f-strings throughout

  • Added match/case statements for cleaner conditional logic

Version 0.2.1

  • Previous stable release with Python 3.5+ support

Indices and tables