Command Line Interface¶

ethnicolr2 provides command-line tools for batch processing of CSV files, making it easy to integrate into data processing pipelines.

Available Commands¶

census_ln - Census Statistics Lookup¶

Get census surname statistics without machine learning:

census_ln input.csv -l last_name -o output.csv -y 2010

Options:

-l, --last: Column name or index containing last names
-o, --output: Output CSV filename
-y, --year: Census year (2000 or 2010, default: 2000)

pred_census_last_name - Census LSTM Predictions¶

Machine learning predictions using census-trained models:

pred_census_last_name input.csv -l surname -o predictions.csv -y 2010

pred_fl_last_name - Florida Last Name Model¶

High-accuracy predictions using Florida voter data:

pred_fl_last_name input.csv -l last_name -o fl_predictions.csv

pred_fl_full_name - Florida Full Name Model¶

Highest accuracy using both first and last names:

pred_fl_full_name input.csv -l last_name -f first_name -o full_predictions.csv

Options:

-l, --last: Last name column
-f, --first: First name column
-o, --output: Output filename

Input File Formats¶

With Headers¶

first_name,last_name,employee_id
John,Smith,12345
Maria,Rodriguez,12346
Wei,Zhang,12347

# Use column names
pred_fl_full_name employees.csv -l last_name -f first_name -o results.csv

Without Headers¶

John,Smith,12345
Maria,Rodriguez,12346
Wei,Zhang,12347

# Use column indices (0-based)
pred_fl_full_name employees.csv -l 1 -f 0 -o results.csv

Practical Examples¶

Process Employee Database¶

# Input: employees.csv
# Columns: emp_id,first_name,last_name,department,salary

# Get demographic predictions for HR analysis
pred_fl_full_name employees.csv \\
  --last last_name \\
  --first first_name \\
  --output employee_demographics.csv

# Results include original data + predictions
head employee_demographics.csv

Academic Research Dataset¶

# Input: research_authors.csv
# Columns: paper_id,author_surname,institution,field

# Use census model for academic validation
pred_census_last_name research_authors.csv \\
  --last author_surname \\
  --output author_demographics.csv \\
  --year 2010

Customer Analysis Pipeline¶

#!/bin/bash
# Pipeline for customer demographic analysis

# Step 1: Extract customer names
cut -d',' -f2,3 customers_full.csv > customer_names.csv

# Step 2: Get demographic predictions
pred_fl_last_name customer_names.csv \\
  -l 1 \\
  -o customer_demographics.csv

# Step 3: Merge back with original data
python merge_results.py customers_full.csv customer_demographics.csv

Batch Processing Multiple Files¶

#!/bin/bash
# Process multiple CSV files

for file in data/*.csv; do
    echo "Processing $file..."
    pred_fl_last_name "$file" \\
      -l last_name \\
      -o "results/$(basename "$file" .csv)_demographics.csv"
done

Performance Tips¶

Large Files¶

For files larger than 100MB, consider splitting:

# Split large file into chunks
split -l 50000 large_dataset.csv chunk_

# Process chunks in parallel
for chunk in chunk_*; do
    pred_fl_last_name "$chunk" -l 1 -o "results_$chunk.csv" &
done
wait

# Combine results
cat results_chunk_*.csv > final_results.csv

Memory Usage¶

Monitor memory usage for very large datasets:

# Check available memory
free -h

# Run with memory monitoring
/usr/bin/time -v pred_fl_last_name large_file.csv -l last_name -o output.csv

Error Handling¶

Common Issues¶

FileNotFoundError:

# Check file exists
ls -la input.csv

# Use absolute path if needed
pred_fl_last_name /full/path/to/input.csv -l last_name -o output.csv

Column Not Found:

# Check column names
head -1 input.csv

# Use correct column name or index
pred_fl_last_name input.csv -l "Last Name" -o output.csv  # with spaces
pred_fl_last_name input.csv -l 2 -o output.csv           # by index

Permission Errors:

# Check write permissions
ls -la output_directory/

# Use different output location
pred_fl_last_name input.csv -l last_name -o ~/Desktop/output.csv

Validation¶

Verify results before using:

# Check output file structure
head output.csv
wc -l input.csv output.csv  # Should have same number of lines

# Quick statistics
cut -d',' -f'race_column' output.csv | sort | uniq -c

Integration with Data Pipelines¶

Apache Airflow¶

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime

dag = DAG('demographic_analysis',
          start_date=datetime(2023, 1, 1),
          schedule_interval='@daily')

predict_demographics = BashOperator(
    task_id='predict_demographics',
    bash_command='''
    pred_fl_last_name /data/daily_customers.csv \\
      -l last_name \\
      -o /data/demographics_{{ ds }}.csv
    ''',
    dag=dag
)

Make/Unix Pipelines¶

# Makefile for demographic analysis

demographics.csv: raw_data.csv
	pred_fl_last_name $< -l last_name -o $@

analysis.html: demographics.csv
	python generate_report.py $< > $@

clean:
	rm -f demographics.csv analysis.html

Docker Integration¶

FROM python:3.11-slim

RUN pip install ethnicolr2

WORKDIR /app
COPY process.sh /app/

ENTRYPOINT ["./process.sh"]

# Run in Docker
docker run -v $(pwd)/data:/app/data demographic-processor \\
  pred_fl_last_name /app/data/input.csv -l last_name -o /app/data/output.csv

Output Customization¶

Selecting Columns¶

Most CLI tools output all original columns plus predictions. To select specific columns:

# Process then select columns
pred_fl_last_name input.csv -l last_name -o full_output.csv
cut -d',' -f1,2,race,asian,hispanic full_output.csv > selected_output.csv

Formatting¶

# Convert to TSV
pred_fl_last_name input.csv -l last_name -o temp.csv
tr ',' '\\t' < temp.csv > output.tsv

# Add headers if needed
echo -e "name\\trace\\tconfidence" > final.tsv
tail -n +2 output.tsv >> final.tsv

Getting Help¶

# Get help for any command
census_ln --help
pred_fl_last_name --help
pred_fl_full_name --help

# Check version
python -c "import ethnicolr2; print(ethnicolr2.__version__)"

Next Steps¶

examples: See complete workflow examples
Utilities API Reference: Python API for more control
troubleshooting: Common issues and solutions