Assignment 3

Overview

Objective

Fine-tune DistilBERT on Goodreads book reviews for genre classification. Convert a notebook into production scripts, containerize with Docker, and publish to HuggingFace Hub.

Key Tasks

Notebook → modular Python scripts
Docker training & eval images
HuggingFace Trainer API
Push model to HuggingFace Hub
Local vs. Hub evaluation comparison

Technologies

HuggingFace DistilBERT PyTorch Docker CUDA

Model Selection

Why DistilBERT (`distilbert-base-cased`)?

Criterion	Justification
Efficiency	40% fewer parameters than BERT-base (66M vs 110M)
Speed	~60% faster inference — ideal for Docker containers
Performance	Retains 97% of BERT's capability (GLUE benchmark)
Cased Variant	Preserves capitalization — meaningful in book reviews
Task Fit	Built-in classification head for multi-class genre prediction
Ecosystem	Trainer API, tokenizer, one-command push to Hub

Training & Results

59.9%

Training Accuracy

On same test split

60.0%

Weighted F1

Training evaluation

8

Genres

Book review classes

6,400

Train Samples

800 per genre

Training Configuration

Parameter	Value
Base Model	distilbert-base-cased
Dataset	UCSD Goodreads Book Reviews
Epochs	3
Learning Rate	5e-5 (AdamW)
Batch Size (Train / Eval)	10 / 16
Warmup Steps	100
Weight Decay	0.01
Max Sequence Length	512 tokens

Training Evaluation Results

Metric	Value
Loss	1.2379
Accuracy	59.88%
Precision	60.44%
Recall	59.88%
F1 (weighted)	60.01%

Evaluation Comparison: Local vs. HuggingFace Hub

Post-training evaluations use freshly re-sampled test data from the same URLs — different random reviews each run cause metric variance vs. training evaluation.

Metric	Training Eval	Local Model	Hub Model
Accuracy	59.88%	36.75%	36.56%
Precision	60.44%	42.50%	45.28%
Recall	59.88%	36.75%	36.56%
F1 (weighted)	60.01%	37.71%	38.38%

Per-Genre Classification Report (Local Model)

Genre	Precision	Recall	F1-Score	Support
Children	0.38	0.51	0.44	200
Comics & Graphic	0.58	0.39	0.46	200
Fantasy & Paranormal	0.28	0.20	0.23	200
History & Biography	0.48	0.33	0.39	200
Mystery, Thriller & Crime	0.47	0.30	0.37	200
Poetry	0.66	0.47	0.55	200
Romance	0.19	0.48	0.27	200
Young Adult	0.35	0.26	0.30	200
Accuracy			0.37	1600
Weighted Avg	0.43	0.37	0.38	1600

Per-Genre Classification Report (Hub Model)

Genre	Precision	Recall	F1-Score	Support
Children	0.41	0.51	0.45	200
Comics & Graphic	0.59	0.35	0.44	200
Fantasy & Paranormal	0.34	0.20	0.26	200
History & Biography	0.46	0.30	0.36	200
Mystery, Thriller & Crime	0.52	0.33	0.40	200
Poetry	0.74	0.48	0.58	200
Romance	0.16	0.47	0.24	200
Young Adult	0.39	0.28	0.33	200
Accuracy			0.37	1600
Weighted Avg	0.45	0.37	0.38	1600

Analysis

Local ≈ Hub: Metrics are nearly identical (~1% variance), confirming correct upload/download from HuggingFace Hub. Differences are due to different random test samples.
Training vs. Post-training eval: Training eval (59.88%) used the same test split; post-training evals (36-37%) used freshly sampled reviews — different data, different difficulty.
Best genres: Poetry (0.55-0.58 F1) and Comics & Graphic (0.44-0.46 F1) — distinctive review language.
Hardest genres: Fantasy & Paranormal (0.23-0.26 F1) and Romance (0.24-0.27 F1) — overlapping vocabulary with other genres.

Docker Image Build Instructions

🏋️ Training Image

Dockerfile: Dockerfile

# Build
docker build -t mlops-train .

# Run (GPU)
docker run --gpus all mlops-train

# Run + push to Hub
docker run --gpus all \
  -e HF_TOKEN=your_token \
  mlops-train \
  python src/train.py \
    --push_to_hub \
    --hf_username NisargUpadhyay

🔍 Evaluation Image

Dockerfile: Dockerfile.eval

# Build
docker build -f Dockerfile.eval \
  -t mlops-eval .

# Run (auto-evaluates from HF Hub)
docker run mlops-eval

# Check logs
docker run -d --name eval mlops-eval
docker logs -f eval

Docker Commands Reference

Command	Purpose
`docker build -t <name> .`	Build image from Dockerfile
`docker build -f <file> -t <name> .`	Build from a specific Dockerfile
`docker run <image>`	Run container (foreground)
`docker run -d --name <n> <img>`	Run container (background)
`docker logs -f <container>`	Follow logs in real-time
`docker ps -a`	List all containers

Challenges & Solutions

#	Challenge	Solution
1	Large data downloads (~1GB gzip per genre)	Streamed with `requests.get(stream=True)`, limited to 10K reviews, sampled 2K
2	Colab magic commands in notebook (`!pip`, `%matplotlib`)	Removed all magic commands; restructured into 4 clean modules
3	Random test data variance across runs	Documented as expected behavior; consistent on same-split evaluations
4	HuggingFace model download throttling	Used HF's built-in caching and retry logic
5	WandB API key requirement in Trainer	Disabled with `WANDB_DISABLED=true` and `report_to=[]`
6	Genre overlap (Fantasy vs. Young Adult)	Real-world limitation — reviews share similar language across genres
7	Docker eval image without GPU	Model (263MB) is small enough for CPU inference