{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic Usage Examples\n", "\n", "This notebook demonstrates basic usage of the ethnicolr package for predicting race and ethnicity from names." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "First, let's import the necessary libraries and load some sample data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "import pandas as pd\nimport ethnicolr\nfrom pathlib import Path\n\n# Load sample data\ndata_path = Path('data/input-with-header.csv')\n\ntry:\n df = pd.read_csv(data_path)\n print(f\"Loaded data from: {data_path}\")\nexcept FileNotFoundError:\n # Create sample data if file not found\n df = pd.DataFrame({\n 'first_name': ['John', 'Maria', 'David', 'Sarah', 'Michael'],\n 'last_name': ['Smith', 'Garcia', 'Johnson', 'Davis', 'Brown']\n })\n print(\"Using generated sample data\")\n\nprint(f\"Sample data shape: {df.shape}\")\nprint(\"\\nFirst few rows:\")\ndf.head()" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Census Data Lookup\n", "\n", "The simplest approach is to look up demographic probabilities by last name using US Census data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Census lookup by last name (2010 data)\n", "result_census = ethnicolr.census_ln(df, 'last_name', year=2010)\n", "print(f\"Result shape: {result_census.shape}\")\n", "print(\"\\nColumns added:\")\n", "census_cols = [col for col in result_census.columns if col not in df.columns]\n", "print(census_cols)\n", "\n", "# Show first few results\n", "result_census[['last_name', 'first_name'] + census_cols].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple Census-based Predictions\n", "\n", "For more sophisticated predictions, we can use the census-based LSTM model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Predict using census LSTM model\nresult_pred = ethnicolr.pred_census_ln(df, 'last_name')\nprint(f\"Result shape: {result_pred.shape}\")\nprint(\"\\nPrediction columns:\")\npred_cols = [col for col in result_pred.columns if col not in df.columns]\nprint(pred_cols)\n\n# Show predictions with confidence scores (census model uses api, black, hispanic, white)\nresult_pred[['last_name', 'first_name', 'race', 'api', 'black', 'hispanic', 'white']].head(10)" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary Statistics\n", "\n", "Let's look at the distribution of predicted races in our sample." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Distribution of predicted races\n", "race_dist = result_pred['race'].value_counts()\n", "print(\"Race distribution:\")\n", "for race, count in race_dist.items():\n", " percentage = (count / len(result_pred)) * 100\n", " print(f\"{race}: {count} ({percentage:.1f}%)\")\n", "\n", "# Show some examples by race\n", "print(\"\\nExample predictions by race:\")\n", "for race in race_dist.index[:3]:\n", " examples = result_pred[result_pred['race'] == race]['last_name'].head(3).tolist()\n", " print(f\"{race}: {', '.join(examples)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing Methods\n", "\n", "Let's compare the census lookup vs. ML prediction for a few names." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Create comparison dataframe\n# Since both functions process the same input DataFrame, we can align by index\n# rather than merging on potentially different last_name values\n\nprint(f\"Original data shape: {df.shape}\")\nprint(f\"Census result shape: {result_census.shape}\")\nprint(f\"Prediction result shape: {result_pred.shape}\")\n\n# Check if we have the expected columns\nprint(\"\\nCensus columns with percentages:\", [col for col in result_census.columns if 'pct' in col])\nprint(\"Prediction columns with probabilities:\", [col for col in result_pred.columns if col in ['api', 'black', 'hispanic', 'white']])\n\n# Create aligned comparison using index\ncomparison = pd.DataFrame({\n 'last_name': df['last_name'], # Use original names for reference\n 'census_white': result_census['pctwhite'],\n 'census_black': result_census['pctblack'], \n 'census_api': result_census['pctapi'],\n 'census_hispanic': result_census['pcthispanic'],\n 'pred_race': result_pred['race'],\n 'pred_white': result_pred['white'],\n 'pred_black': result_pred['black'],\n 'pred_api': result_pred['api'], \n 'pred_hispanic': result_pred['hispanic']\n})\n\nprint(f\"\\nComparison created with {len(comparison)} names\")\nprint(\"Comparison of Census vs ML predictions (first 10 names):\")\ncomparison[['last_name', 'census_white', 'census_black', 'pred_race', 'pred_white', 'pred_black']].head(10)" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Differences\n", "\n", "- **Census lookup**: Returns population-level probabilities for each race/ethnicity based on surname frequency in census data\n", "- **ML prediction**: Uses neural networks trained on census data to predict the most likely race/ethnicity category\n", "- **Use cases**: Census lookup for aggregate analysis, ML predictions for individual classification" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 4 }