{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Advanced Prediction Models\n",
    "\n",
    "This notebook demonstrates advanced ethnicity prediction using Wikipedia and Florida voter registration models, including confidence intervals and detailed ethnic categories."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Load the required libraries and sample data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "import pandas as pd\nimport ethnicolr\nfrom pathlib import Path\n\n# Load sample data\ndata_path = Path('data/input-with-header.csv')\n\ntry:\n    df = pd.read_csv(data_path)\n    print(f\"Loaded data from: {data_path}\")\nexcept FileNotFoundError:\n    # Create sample data if file not found\n    df = pd.DataFrame({\n        'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],\n        'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']\n    })\n    print(\"Using generated sample data\")\n\nprint(f\"Sample data shape: {df.shape}\")\nprint(\"\\nFirst few rows:\")\ndf.head()"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wikipedia-based Predictions\n",
    "\n",
    "Wikipedia models provide more granular ethnic categories and work well with both first and last names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Predict using Wikipedia model with full names\n",
    "wiki_result = ethnicolr.pred_wiki_name(df, 'last_name', 'first_name')\n",
    "print(f\"Wikipedia prediction result shape: {wiki_result.shape}\")\n",
    "print(\"\\nColumns added:\")\n",
    "wiki_cols = [col for col in wiki_result.columns if col not in df.columns]\n",
    "print(wiki_cols)\n",
    "\n",
    "# Show detailed predictions\n",
    "wiki_result[['first_name', 'last_name', 'race', '__name']].head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Florida Voter Registration Models\n",
    "\n",
    "Florida models are trained on actual voter registration data and can provide both 4-category and 5-category predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Standard 4-category Florida model\nfl_result = ethnicolr.pred_fl_reg_name(df, 'last_name', 'first_name')\nprint(\"Florida 4-category predictions:\")\nfl_result[['first_name', 'last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head()\n\nprint(\"\\nRace distribution (Florida model):\")\nprint(fl_result['race'].value_counts())"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# 5-category Florida model (includes 'other' category)\nfl5_result = ethnicolr.pred_fl_reg_name_five_cat(df, 'last_name', 'first_name')\nprint(\"Florida 5-category predictions:\")\nfl5_result[['first_name', 'last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white', 'other']].head()\n\nprint(\"\\nRace distribution (Florida 5-category):\")\nprint(fl5_result['race'].value_counts())"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Last Name Only Predictions\n",
    "\n",
    "When only last names are available, we can still make good predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Wikipedia last name model\nwiki_ln = ethnicolr.pred_wiki_ln(df, 'last_name')\nprint(\"Wikipedia last name predictions:\")\nwiki_ln[['last_name', 'race']].head(10)\n\n# Florida last name model  \nfl_ln = ethnicolr.pred_fl_reg_ln(df, 'last_name')\nprint(\"\\nFlorida last name predictions:\")\nfl_ln[['last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head(10)"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Comparison\n",
    "\n",
    "Let's compare predictions across different models for the same names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create comparison dataframe\n",
    "comparison = pd.DataFrame({\n",
    "    'name': df['first_name'] + ' ' + df['last_name'],\n",
    "    'census': ethnicolr.pred_census_ln(df, 'last_name')['race'],\n",
    "    'wiki_fullname': wiki_result['race'],\n",
    "    'wiki_lastname': wiki_ln['race'], \n",
    "    'florida_4cat': fl_result['race'],\n",
    "    'florida_5cat': fl5_result['race']\n",
    "})\n",
    "\n",
    "print(\"Model comparison (first 15 names):\")\n",
    "comparison.head(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confidence Analysis\n",
    "\n",
    "Let's examine the confidence scores to understand prediction certainty."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Calculate max probability (confidence) for each prediction\nfl_result['max_prob'] = fl_result[['asian', 'hispanic', 'nh_black', 'nh_white']].max(axis=1)\n\n# Show high vs low confidence predictions\nhigh_conf = fl_result[fl_result['max_prob'] > 0.8]\nlow_conf = fl_result[fl_result['max_prob'] < 0.5]\n\nprint(f\"High confidence predictions (>80%): {len(high_conf)} names\")\nprint(\"Examples:\")\nprint(high_conf[['first_name', 'last_name', 'race', 'max_prob']].head())\n\nprint(f\"\\nLow confidence predictions (<50%): {len(low_conf)} names\")\nprint(\"Examples:\")\nprint(low_conf[['first_name', 'last_name', 'race', 'max_prob']].head())"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Detailed Ethnic Categories (Wikipedia)\n",
    "\n",
    "The Wikipedia model provides much more granular ethnic predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show detailed ethnic categories from Wikipedia model\n",
    "print(\"Detailed ethnic categories from Wikipedia model:\")\n",
    "ethnic_dist = wiki_result['race'].value_counts()\n",
    "print(ethnic_dist)\n",
    "\n",
    "# Show examples of detailed categories\n",
    "print(\"\\nExamples by ethnic category:\")\n",
    "for category in ethnic_dist.head(5).index:\n",
    "    examples = wiki_result[wiki_result['race'] == category]['__name'].head(3).tolist()\n",
    "    print(f\"{category}: {', '.join(examples)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Selection Guidelines\n",
    "\n",
    "Choose the right model for your use case:\n",
    "\n",
    "- **Census lookup**: Best for aggregate statistics, population-level analysis\n",
    "- **Census LSTM**: Good baseline for individual predictions, 4 broad categories\n",
    "- **Wikipedia models**: Best for detailed ethnic categories, works well with diverse international names\n",
    "- **Florida models**: Good for US-focused applications, trained on actual voter data\n",
    "- **5-category models**: Include 'other' for better coverage of mixed/unknown ethnicities\n",
    "\n",
    "Always consider the confidence scores and validate results on your specific dataset."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}