{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic Usage Examples\n",
    "\n",
    "This notebook demonstrates basic usage of the ethnicolr package for predicting race and ethnicity from names."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "First, let's import the necessary libraries and load some sample data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "import pandas as pd\nimport ethnicolr\nfrom pathlib import Path\n\n# Load sample data\ndata_path = Path('data/input-with-header.csv')\n\ntry:\n    df = pd.read_csv(data_path)\n    print(f\"Loaded data from: {data_path}\")\nexcept FileNotFoundError:\n    # Create sample data if file not found\n    df = pd.DataFrame({\n        'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],\n        'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']\n    })\n    print(\"Using generated sample data\")\n\nprint(f\"Sample data shape: {df.shape}\")\nprint(\"\\nFirst few rows:\")\ndf.head()"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Census Data Lookup\n",
    "\n",
    "The simplest approach is to look up demographic probabilities by last name using US Census data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Census lookup by last name (2010 data)\n",
    "result_census = ethnicolr.census_ln(df, 'last_name', year=2010)\n",
    "print(f\"Result shape: {result_census.shape}\")\n",
    "print(\"\\nColumns added:\")\n",
    "census_cols = [col for col in result_census.columns if col not in df.columns]\n",
    "print(census_cols)\n",
    "\n",
    "# Show first few results\n",
    "result_census[['last_name', 'first_name'] + census_cols].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simple Census-based Predictions\n",
    "\n",
    "For more sophisticated predictions, we can use the census-based LSTM model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Predict using census LSTM model\nresult_pred = ethnicolr.pred_census_ln(df, 'last_name')\nprint(f\"Result shape: {result_pred.shape}\")\nprint(\"\\nPrediction columns:\")\npred_cols = [col for col in result_pred.columns if col not in df.columns]\nprint(pred_cols)\n\n# Show predictions with confidence scores (census model uses api, black, hispanic, white)\nresult_pred[['last_name', 'first_name', 'race', 'api', 'black', 'hispanic', 'white']].head(10)"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary Statistics\n",
    "\n",
    "Let's look at the distribution of predicted races in our sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribution of predicted races\n",
    "race_dist = result_pred['race'].value_counts()\n",
    "print(\"Race distribution:\")\n",
    "for race, count in race_dist.items():\n",
    "    percentage = (count / len(result_pred)) * 100\n",
    "    print(f\"{race}: {count} ({percentage:.1f}%)\")\n",
    "\n",
    "# Show some examples by race\n",
    "print(\"\\nExample predictions by race:\")\n",
    "for race in race_dist.index[:3]:\n",
    "    examples = result_pred[result_pred['race'] == race]['last_name'].head(3).tolist()\n",
    "    print(f\"{race}: {', '.join(examples)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparing Methods\n",
    "\n",
    "Let's compare the census lookup vs. ML prediction for a few names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Create comparison dataframe\n# Since both functions process the same input DataFrame, we can align by index\n# rather than merging on potentially different last_name values\n\nprint(f\"Original data shape: {df.shape}\")\nprint(f\"Census result shape: {result_census.shape}\")\nprint(f\"Prediction result shape: {result_pred.shape}\")\n\n# Check if we have the expected columns\nprint(\"\\nCensus columns with percentages:\", [col for col in result_census.columns if 'pct' in col])\nprint(\"Prediction columns with probabilities:\", [col for col in result_pred.columns if col in ['api', 'black', 'hispanic', 'white']])\n\n# Create aligned comparison using index\ncomparison = pd.DataFrame({\n    'last_name': df['last_name'],  # Use original names for reference\n    'census_white': result_census['pctwhite'],\n    'census_black': result_census['pctblack'], \n    'census_api': result_census['pctapi'],\n    'census_hispanic': result_census['pcthispanic'],\n    'pred_race': result_pred['race'],\n    'pred_white': result_pred['white'],\n    'pred_black': result_pred['black'],\n    'pred_api': result_pred['api'], \n    'pred_hispanic': result_pred['hispanic']\n})\n\nprint(f\"\\nComparison created with {len(comparison)} names\")\nprint(\"Comparison of Census vs ML predictions (first 10 names):\")\ncomparison[['last_name', 'census_white', 'census_black', 'pred_race', 'pred_white', 'pred_black']].head(10)"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Differences\n",
    "\n",
    "- **Census lookup**: Returns population-level probabilities for each race/ethnicity based on surname frequency in census data\n",
    "- **ML prediction**: Uses neural networks trained on census data to predict the most likely race/ethnicity category\n",
    "- **Use cases**: Census lookup for aggregate analysis, ML predictions for individual classification"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}