Overview
Objective
Fine-tune DistilBERT on Goodreads book reviews for genre classification. Convert a notebook into production scripts, containerize with Docker, and publish to HuggingFace Hub.
Key Tasks
- Notebook → modular Python scripts
- Docker training & eval images
- HuggingFace Trainer API
- Push model to HuggingFace Hub
- Local vs. Hub evaluation comparison
Technologies
Model Selection
Why DistilBERT (distilbert-base-cased)?
| Criterion | Justification |
|---|---|
| Efficiency | 40% fewer parameters than BERT-base (66M vs 110M) |
| Speed | ~60% faster inference — ideal for Docker containers |
| Performance | Retains 97% of BERT's capability (GLUE benchmark) |
| Cased Variant | Preserves capitalization — meaningful in book reviews |
| Task Fit | Built-in classification head for multi-class genre prediction |
| Ecosystem | Trainer API, tokenizer, one-command push to Hub |
Training & Results
Training Configuration
| Parameter | Value |
|---|---|
| Base Model | distilbert-base-cased |
| Dataset | UCSD Goodreads Book Reviews |
| Epochs | 3 |
| Learning Rate | 5e-5 (AdamW) |
| Batch Size (Train / Eval) | 10 / 16 |
| Warmup Steps | 100 |
| Weight Decay | 0.01 |
| Max Sequence Length | 512 tokens |
Training Evaluation Results
| Metric | Value |
|---|---|
| Loss | 1.2379 |
| Accuracy | 59.88% |
| Precision | 60.44% |
| Recall | 59.88% |
| F1 (weighted) | 60.01% |
Evaluation Comparison: Local vs. HuggingFace Hub
Post-training evaluations use freshly re-sampled test data from the same URLs — different random reviews each run cause metric variance vs. training evaluation.
| Metric | Training Eval | Local Model | Hub Model |
|---|---|---|---|
| Accuracy | 59.88% | 36.75% | 36.56% |
| Precision | 60.44% | 42.50% | 45.28% |
| Recall | 59.88% | 36.75% | 36.56% |
| F1 (weighted) | 60.01% | 37.71% | 38.38% |
Per-Genre Classification Report (Local Model)
| Genre | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Children | 0.38 | 0.51 | 0.44 | 200 |
| Comics & Graphic | 0.58 | 0.39 | 0.46 | 200 |
| Fantasy & Paranormal | 0.28 | 0.20 | 0.23 | 200 |
| History & Biography | 0.48 | 0.33 | 0.39 | 200 |
| Mystery, Thriller & Crime | 0.47 | 0.30 | 0.37 | 200 |
| Poetry | 0.66 | 0.47 | 0.55 | 200 |
| Romance | 0.19 | 0.48 | 0.27 | 200 |
| Young Adult | 0.35 | 0.26 | 0.30 | 200 |
| Accuracy | 0.37 | 1600 | ||
| Weighted Avg | 0.43 | 0.37 | 0.38 | 1600 |
Per-Genre Classification Report (Hub Model)
| Genre | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Children | 0.41 | 0.51 | 0.45 | 200 |
| Comics & Graphic | 0.59 | 0.35 | 0.44 | 200 |
| Fantasy & Paranormal | 0.34 | 0.20 | 0.26 | 200 |
| History & Biography | 0.46 | 0.30 | 0.36 | 200 |
| Mystery, Thriller & Crime | 0.52 | 0.33 | 0.40 | 200 |
| Poetry | 0.74 | 0.48 | 0.58 | 200 |
| Romance | 0.16 | 0.47 | 0.24 | 200 |
| Young Adult | 0.39 | 0.28 | 0.33 | 200 |
| Accuracy | 0.37 | 1600 | ||
| Weighted Avg | 0.45 | 0.37 | 0.38 | 1600 |
Analysis
- Local ≈ Hub: Metrics are nearly identical (~1% variance), confirming correct upload/download from HuggingFace Hub. Differences are due to different random test samples.
- Training vs. Post-training eval: Training eval (59.88%) used the same test split; post-training evals (36-37%) used freshly sampled reviews — different data, different difficulty.
- Best genres: Poetry (0.55-0.58 F1) and Comics & Graphic (0.44-0.46 F1) — distinctive review language.
- Hardest genres: Fantasy & Paranormal (0.23-0.26 F1) and Romance (0.24-0.27 F1) — overlapping vocabulary with other genres.
Docker Image Build Instructions
🏋️ Training Image
Dockerfile: Dockerfile
# Build
docker build -t mlops-train .
# Run (GPU)
docker run --gpus all mlops-train
# Run + push to Hub
docker run --gpus all \
-e HF_TOKEN=your_token \
mlops-train \
python src/train.py \
--push_to_hub \
--hf_username NisargUpadhyay
🔍 Evaluation Image
Dockerfile: Dockerfile.eval
# Build docker build -f Dockerfile.eval \ -t mlops-eval . # Run (auto-evaluates from HF Hub) docker run mlops-eval # Check logs docker run -d --name eval mlops-eval docker logs -f eval
Docker Commands Reference
| Command | Purpose |
|---|---|
docker build -t <name> . |
Build image from Dockerfile |
docker build -f <file> -t <name> .
|
Build from a specific Dockerfile |
docker run <image> |
Run container (foreground) |
docker run -d --name <n> <img>
|
Run container (background) |
docker logs -f <container> |
Follow logs in real-time |
docker ps -a |
List all containers |
Challenges & Solutions
| # | Challenge | Solution |
|---|---|---|
| 1 | Large data downloads (~1GB gzip per genre) |
Streamed with requests.get(stream=True),
limited to 10K reviews, sampled 2K
|
| 2 |
Colab magic commands in notebook (!pip,
%matplotlib)
|
Removed all magic commands; restructured into 4 clean modules |
| 3 | Random test data variance across runs | Documented as expected behavior; consistent on same-split evaluations |
| 4 | HuggingFace model download throttling | Used HF's built-in caching and retry logic |
| 5 | WandB API key requirement in Trainer |
Disabled with WANDB_DISABLED=true and
report_to=[]
|
| 6 | Genre overlap (Fantasy vs. Young Adult) | Real-world limitation — reviews share similar language across genres |
| 7 | Docker eval image without GPU | Model (263MB) is small enough for CPU inference |
Links
🤗 HuggingFace Model
Fine-tuned DistilBERT for Goodreads genre classification — publicly accessible
View on HuggingFace →📂 Source Code
All scripts (train/eval/data/utils), Dockerfiles, requirements, and evaluation results
View on GitHub →📓 Original Notebook
Instructor-provided Colab notebook for fine-tuning classification
Open in Colab →