{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Advanced Prediction Models\n", "\n", "This notebook demonstrates advanced ethnicity prediction using Wikipedia and Florida voter registration models, including confidence intervals and detailed ethnic categories." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "Load the required libraries and sample data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "import pandas as pd\nimport ethnicolr\nfrom pathlib import Path\n\n# Load sample data\ndata_path = Path('data/input-with-header.csv')\n\ntry:\n df = pd.read_csv(data_path)\n print(f\"Loaded data from: {data_path}\")\nexcept FileNotFoundError:\n # Create sample data if file not found\n df = pd.DataFrame({\n 'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],\n 'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']\n })\n print(\"Using generated sample data\")\n\nprint(f\"Sample data shape: {df.shape}\")\nprint(\"\\nFirst few rows:\")\ndf.head()" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wikipedia-based Predictions\n", "\n", "Wikipedia models provide more granular ethnic categories and work well with both first and last names." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Predict using Wikipedia model with full names\n", "wiki_result = ethnicolr.pred_wiki_name(df, 'last_name', 'first_name')\n", "print(f\"Wikipedia prediction result shape: {wiki_result.shape}\")\n", "print(\"\\nColumns added:\")\n", "wiki_cols = [col for col in wiki_result.columns if col not in df.columns]\n", "print(wiki_cols)\n", "\n", "# Show detailed predictions\n", "wiki_result[['first_name', 'last_name', 'race', '__name']].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Florida Voter Registration Models\n", "\n", "Florida models are trained on actual voter registration data and can provide both 4-category and 5-category predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Standard 4-category Florida model\nfl_result = ethnicolr.pred_fl_reg_name(df, 'last_name', 'first_name')\nprint(\"Florida 4-category predictions:\")\nfl_result[['first_name', 'last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head()\n\nprint(\"\\nRace distribution (Florida model):\")\nprint(fl_result['race'].value_counts())" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# 5-category Florida model (includes 'other' category)\nfl5_result = ethnicolr.pred_fl_reg_name_five_cat(df, 'last_name', 'first_name')\nprint(\"Florida 5-category predictions:\")\nfl5_result[['first_name', 'last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white', 'other']].head()\n\nprint(\"\\nRace distribution (Florida 5-category):\")\nprint(fl5_result['race'].value_counts())" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Last Name Only Predictions\n", "\n", "When only last names are available, we can still make good predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Wikipedia last name model\nwiki_ln = ethnicolr.pred_wiki_ln(df, 'last_name')\nprint(\"Wikipedia last name predictions:\")\nwiki_ln[['last_name', 'race']].head(10)\n\n# Florida last name model \nfl_ln = ethnicolr.pred_fl_reg_ln(df, 'last_name')\nprint(\"\\nFlorida last name predictions:\")\nfl_ln[['last_name', 'race', 'asian', 'hispanic', 'nh_black', 'nh_white']].head(10)" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Comparison\n", "\n", "Let's compare predictions across different models for the same names." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create comparison dataframe\n", "comparison = pd.DataFrame({\n", " 'name': df['first_name'] + ' ' + df['last_name'],\n", " 'census': ethnicolr.pred_census_ln(df, 'last_name')['race'],\n", " 'wiki_fullname': wiki_result['race'],\n", " 'wiki_lastname': wiki_ln['race'], \n", " 'florida_4cat': fl_result['race'],\n", " 'florida_5cat': fl5_result['race']\n", "})\n", "\n", "print(\"Model comparison (first 15 names):\")\n", "comparison.head(15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Confidence Analysis\n", "\n", "Let's examine the confidence scores to understand prediction certainty." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Calculate max probability (confidence) for each prediction\nfl_result['max_prob'] = fl_result[['asian', 'hispanic', 'nh_black', 'nh_white']].max(axis=1)\n\n# Show high vs low confidence predictions\nhigh_conf = fl_result[fl_result['max_prob'] > 0.8]\nlow_conf = fl_result[fl_result['max_prob'] < 0.5]\n\nprint(f\"High confidence predictions (>80%): {len(high_conf)} names\")\nprint(\"Examples:\")\nprint(high_conf[['first_name', 'last_name', 'race', 'max_prob']].head())\n\nprint(f\"\\nLow confidence predictions (<50%): {len(low_conf)} names\")\nprint(\"Examples:\")\nprint(low_conf[['first_name', 'last_name', 'race', 'max_prob']].head())" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Detailed Ethnic Categories (Wikipedia)\n", "\n", "The Wikipedia model provides much more granular ethnic predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Show detailed ethnic categories from Wikipedia model\n", "print(\"Detailed ethnic categories from Wikipedia model:\")\n", "ethnic_dist = wiki_result['race'].value_counts()\n", "print(ethnic_dist)\n", "\n", "# Show examples of detailed categories\n", "print(\"\\nExamples by ethnic category:\")\n", "for category in ethnic_dist.head(5).index:\n", " examples = wiki_result[wiki_result['race'] == category]['__name'].head(3).tolist()\n", " print(f\"{category}: {', '.join(examples)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Selection Guidelines\n", "\n", "Choose the right model for your use case:\n", "\n", "- **Census lookup**: Best for aggregate statistics, population-level analysis\n", "- **Census LSTM**: Good baseline for individual predictions, 4 broad categories\n", "- **Wikipedia models**: Best for detailed ethnic categories, works well with diverse international names\n", "- **Florida models**: Good for US-focused applications, trained on actual voter data\n", "- **5-category models**: Include 'other' for better coverage of mixed/unknown ethnicities\n", "\n", "Always consider the confidence scores and validate results on your specific dataset." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 4 }