← Back to Home

Assignment 3

End-to-End HuggingFace Model Training & Docker Deployment

Overview

Objective

Fine-tune DistilBERT on Goodreads book reviews for genre classification. Convert a notebook into production scripts, containerize with Docker, and publish to HuggingFace Hub.

Key Tasks

  • Notebook → modular Python scripts
  • Docker training & eval images
  • HuggingFace Trainer API
  • Push model to HuggingFace Hub
  • Local vs. Hub evaluation comparison

Technologies

HuggingFace DistilBERT PyTorch Docker CUDA

Model Selection

Why DistilBERT (distilbert-base-cased)?

Criterion Justification
Efficiency 40% fewer parameters than BERT-base (66M vs 110M)
Speed ~60% faster inference — ideal for Docker containers
Performance Retains 97% of BERT's capability (GLUE benchmark)
Cased Variant Preserves capitalization — meaningful in book reviews
Task Fit Built-in classification head for multi-class genre prediction
Ecosystem Trainer API, tokenizer, one-command push to Hub

Training & Results

59.9%
Training Accuracy
On same test split
60.0%
Weighted F1
Training evaluation
8
Genres
Book review classes
6,400
Train Samples
800 per genre

Training Configuration

Parameter Value
Base Model distilbert-base-cased
Dataset UCSD Goodreads Book Reviews
Epochs 3
Learning Rate 5e-5 (AdamW)
Batch Size (Train / Eval) 10 / 16
Warmup Steps 100
Weight Decay 0.01
Max Sequence Length 512 tokens

Training Evaluation Results

Metric Value
Loss 1.2379
Accuracy 59.88%
Precision 60.44%
Recall 59.88%
F1 (weighted) 60.01%

Evaluation Comparison: Local vs. HuggingFace Hub

Post-training evaluations use freshly re-sampled test data from the same URLs — different random reviews each run cause metric variance vs. training evaluation.

Metric Training Eval Local Model Hub Model
Accuracy 59.88% 36.75% 36.56%
Precision 60.44% 42.50% 45.28%
Recall 59.88% 36.75% 36.56%
F1 (weighted) 60.01% 37.71% 38.38%

Per-Genre Classification Report (Local Model)

Genre Precision Recall F1-Score Support
Children 0.38 0.51 0.44 200
Comics & Graphic 0.58 0.39 0.46 200
Fantasy & Paranormal 0.28 0.20 0.23 200
History & Biography 0.48 0.33 0.39 200
Mystery, Thriller & Crime 0.47 0.30 0.37 200
Poetry 0.66 0.47 0.55 200
Romance 0.19 0.48 0.27 200
Young Adult 0.35 0.26 0.30 200
Accuracy 0.37 1600
Weighted Avg 0.43 0.37 0.38 1600

Per-Genre Classification Report (Hub Model)

Genre Precision Recall F1-Score Support
Children 0.41 0.51 0.45 200
Comics & Graphic 0.59 0.35 0.44 200
Fantasy & Paranormal 0.34 0.20 0.26 200
History & Biography 0.46 0.30 0.36 200
Mystery, Thriller & Crime 0.52 0.33 0.40 200
Poetry 0.74 0.48 0.58 200
Romance 0.16 0.47 0.24 200
Young Adult 0.39 0.28 0.33 200
Accuracy 0.37 1600
Weighted Avg 0.45 0.37 0.38 1600

Analysis

  • Local ≈ Hub: Metrics are nearly identical (~1% variance), confirming correct upload/download from HuggingFace Hub. Differences are due to different random test samples.
  • Training vs. Post-training eval: Training eval (59.88%) used the same test split; post-training evals (36-37%) used freshly sampled reviews — different data, different difficulty.
  • Best genres: Poetry (0.55-0.58 F1) and Comics & Graphic (0.44-0.46 F1) — distinctive review language.
  • Hardest genres: Fantasy & Paranormal (0.23-0.26 F1) and Romance (0.24-0.27 F1) — overlapping vocabulary with other genres.

Docker Image Build Instructions

🏋️ Training Image

Dockerfile: Dockerfile

# Build
docker build -t mlops-train .

# Run (GPU)
docker run --gpus all mlops-train

# Run + push to Hub
docker run --gpus all \
  -e HF_TOKEN=your_token \
  mlops-train \
  python src/train.py \
    --push_to_hub \
    --hf_username NisargUpadhyay

🔍 Evaluation Image

Dockerfile: Dockerfile.eval

# Build
docker build -f Dockerfile.eval \
  -t mlops-eval .

# Run (auto-evaluates from HF Hub)
docker run mlops-eval

# Check logs
docker run -d --name eval mlops-eval
docker logs -f eval

Docker Commands Reference

Command Purpose
docker build -t <name> . Build image from Dockerfile
docker build -f <file> -t <name> . Build from a specific Dockerfile
docker run <image> Run container (foreground)
docker run -d --name <n> <img> Run container (background)
docker logs -f <container> Follow logs in real-time
docker ps -a List all containers

Challenges & Solutions

# Challenge Solution
1 Large data downloads (~1GB gzip per genre) Streamed with requests.get(stream=True), limited to 10K reviews, sampled 2K
2 Colab magic commands in notebook (!pip, %matplotlib) Removed all magic commands; restructured into 4 clean modules
3 Random test data variance across runs Documented as expected behavior; consistent on same-split evaluations
4 HuggingFace model download throttling Used HF's built-in caching and retry logic
5 WandB API key requirement in Trainer Disabled with WANDB_DISABLED=true and report_to=[]
6 Genre overlap (Fantasy vs. Young Adult) Real-world limitation — reviews share similar language across genres
7 Docker eval image without GPU Model (263MB) is small enough for CPU inference