Command Line Interface

ethnicolr2 provides command-line tools for batch processing of CSV files, making it easy to integrate into data processing pipelines.

Available Commands

census_ln - Census Statistics Lookup

Get census surname statistics without machine learning:

census_ln input.csv -l last_name -o output.csv -y 2010

Options:

  • -l, --last: Column name or index containing last names

  • -o, --output: Output CSV filename

  • -y, --year: Census year (2000 or 2010, default: 2000)

pred_census_last_name - Census LSTM Predictions

Machine learning predictions using census-trained models:

pred_census_last_name input.csv -l surname -o predictions.csv -y 2010

pred_fl_last_name - Florida Last Name Model

High-accuracy predictions using Florida voter data:

pred_fl_last_name input.csv -l last_name -o fl_predictions.csv

pred_fl_full_name - Florida Full Name Model

Highest accuracy using both first and last names:

pred_fl_full_name input.csv -l last_name -f first_name -o full_predictions.csv

Options:

  • -l, --last: Last name column

  • -f, --first: First name column

  • -o, --output: Output filename

Input File Formats

With Headers

first_name,last_name,employee_id
John,Smith,12345
Maria,Rodriguez,12346
Wei,Zhang,12347
# Use column names
pred_fl_full_name employees.csv -l last_name -f first_name -o results.csv

Without Headers

John,Smith,12345
Maria,Rodriguez,12346
Wei,Zhang,12347
# Use column indices (0-based)
pred_fl_full_name employees.csv -l 1 -f 0 -o results.csv

Practical Examples

Process Employee Database

# Input: employees.csv
# Columns: emp_id,first_name,last_name,department,salary

# Get demographic predictions for HR analysis
pred_fl_full_name employees.csv \\
  --last last_name \\
  --first first_name \\
  --output employee_demographics.csv

# Results include original data + predictions
head employee_demographics.csv

Academic Research Dataset

# Input: research_authors.csv
# Columns: paper_id,author_surname,institution,field

# Use census model for academic validation
pred_census_last_name research_authors.csv \\
  --last author_surname \\
  --output author_demographics.csv \\
  --year 2010

Customer Analysis Pipeline

#!/bin/bash
# Pipeline for customer demographic analysis

# Step 1: Extract customer names
cut -d',' -f2,3 customers_full.csv > customer_names.csv

# Step 2: Get demographic predictions
pred_fl_last_name customer_names.csv \\
  -l 1 \\
  -o customer_demographics.csv

# Step 3: Merge back with original data
python merge_results.py customers_full.csv customer_demographics.csv

Batch Processing Multiple Files

#!/bin/bash
# Process multiple CSV files

for file in data/*.csv; do
    echo "Processing $file..."
    pred_fl_last_name "$file" \\
      -l last_name \\
      -o "results/$(basename "$file" .csv)_demographics.csv"
done

Performance Tips

Large Files

For files larger than 100MB, consider splitting:

# Split large file into chunks
split -l 50000 large_dataset.csv chunk_

# Process chunks in parallel
for chunk in chunk_*; do
    pred_fl_last_name "$chunk" -l 1 -o "results_$chunk.csv" &
done
wait

# Combine results
cat results_chunk_*.csv > final_results.csv

Memory Usage

Monitor memory usage for very large datasets:

# Check available memory
free -h

# Run with memory monitoring
/usr/bin/time -v pred_fl_last_name large_file.csv -l last_name -o output.csv

Error Handling

Common Issues

FileNotFoundError:

# Check file exists
ls -la input.csv

# Use absolute path if needed
pred_fl_last_name /full/path/to/input.csv -l last_name -o output.csv

Column Not Found:

# Check column names
head -1 input.csv

# Use correct column name or index
pred_fl_last_name input.csv -l "Last Name" -o output.csv  # with spaces
pred_fl_last_name input.csv -l 2 -o output.csv           # by index

Permission Errors:

# Check write permissions
ls -la output_directory/

# Use different output location
pred_fl_last_name input.csv -l last_name -o ~/Desktop/output.csv

Validation

Verify results before using:

# Check output file structure
head output.csv
wc -l input.csv output.csv  # Should have same number of lines

# Quick statistics
cut -d',' -f'race_column' output.csv | sort | uniq -c

Integration with Data Pipelines

Apache Airflow

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime

dag = DAG('demographic_analysis',
          start_date=datetime(2023, 1, 1),
          schedule_interval='@daily')

predict_demographics = BashOperator(
    task_id='predict_demographics',
    bash_command='''
    pred_fl_last_name /data/daily_customers.csv \\
      -l last_name \\
      -o /data/demographics_{{ ds }}.csv
    ''',
    dag=dag
)

Make/Unix Pipelines

# Makefile for demographic analysis

demographics.csv: raw_data.csv
	pred_fl_last_name $< -l last_name -o $@

analysis.html: demographics.csv
	python generate_report.py $< > $@

clean:
	rm -f demographics.csv analysis.html

Docker Integration

FROM python:3.11-slim

RUN pip install ethnicolr2

WORKDIR /app
COPY process.sh /app/

ENTRYPOINT ["./process.sh"]
# Run in Docker
docker run -v $(pwd)/data:/app/data demographic-processor \\
  pred_fl_last_name /app/data/input.csv -l last_name -o /app/data/output.csv

Output Customization

Selecting Columns

Most CLI tools output all original columns plus predictions. To select specific columns:

# Process then select columns
pred_fl_last_name input.csv -l last_name -o full_output.csv
cut -d',' -f1,2,race,asian,hispanic full_output.csv > selected_output.csv

Formatting

# Convert to TSV
pred_fl_last_name input.csv -l last_name -o temp.csv
tr ',' '\\t' < temp.csv > output.tsv

# Add headers if needed
echo -e "name\\trace\\tconfidence" > final.tsv
tail -n +2 output.tsv >> final.tsv

Getting Help

# Get help for any command
census_ln --help
pred_fl_last_name --help
pred_fl_full_name --help

# Check version
python -c "import ethnicolr2; print(ethnicolr2.__version__)"

Next Steps