--- language: "code" license: "mit" tags: - dockerfile - hadolint - binary-classification - codebert model-index: - name: Binary Dockerfile Classifier results: [] --- # ๐Ÿงฑ Dockerfile Quality Classifier โ€“ Binary Model This model predicts whether a given Dockerfile is: - โœ… **GOOD** โ€“ clean and adheres to best practices (no top rule violations) - โŒ **BAD** โ€“ violates at least one important rule (from Hadolint) It is the first step in a full ML-based Dockerfile linter. --- ## ๐Ÿง  Model Overview - **Architecture:** Fine-tuned `microsoft/codebert-base` - **Task:** Binary classification (`good` vs `bad`) - **Input:** Full Dockerfile content as plain text - **Output:** `[prob_good, prob_bad]` โ€” softmax scores - **Max input length:** 512 tokens --- ## ๐Ÿ“š Training Details - **Data source:** Real-world and synthetic Dockerfiles - **Labels:** Based on [Hadolint](https://github.com/hadolint/hadolint) top 30 rules - **Bad examples:** At least one rule violated - **Good examples:** Fully clean files - **Dataset balance:** 15000 BAD / 1500 GOOD (clean) --- ## ๐Ÿงช Evaluation Results Evaluation on a held-out test set of 1,650 Dockerfiles: | Class | Precision | Recall | F1-score | Support | |-------|-----------|--------|----------|---------| | good | 0.96 | 0.91 | 0.93 | 150 | | bad | 0.99 | 1.00 | 0.99 | 1500 | | **Accuracy** | | | **0.99** | 1650 | --- ## ๐Ÿš€ Quick Start ### ๐Ÿงช Step 1 โ€” Create test script Save this as `test_binary_predict.py`: ```python import sys from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch from pathlib import Path path = Path(sys.argv[1]) text = path.read_text(encoding="utf-8") tokenizer = AutoTokenizer.from_pretrained("LeeSek/binary-dockerfile-model") model = AutoModelForSequenceClassification.from_pretrained("LeeSek/binary-dockerfile-model") model.eval() inputs = tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=512) with torch.no_grad(): logits = model(**inputs).logits probs = torch.nn.functional.softmax(logits, dim=1).squeeze() label = "GOOD" if torch.argmax(probs).item() == 0 else "BAD" print(f"Prediction: {label} โ€” Probabilities: good={probs[0]:.3f}, bad={probs[1]:.3f}") ``` --- ### ๐Ÿ“„ Step 2 โ€” Create good and bad Dockerfile Good: ```docker FROM node:18 WORKDIR /app COPY . . RUN npm install CMD ["node", "index.js"] ``` Bad: ```docker FROM ubuntu:latest RUN apt-get install python3 ADD . /app WORKDIR /app RUN pip install flask CMD python3 app.py ``` --- ### โ–ถ๏ธ Step 3 โ€” Run the prediction ```bash python test_binary_predict.py Dockerfile ``` Expected output: ``` Prediction: GOOD โ€” Probabilities: good=0.998, bad=0.002 ``` --- ## ๐Ÿ—‚ Extras The full training and evaluation pipeline โ€” including data preparation, training, validation, prediction โ€” is available in the **`scripts/`** folder. > ๐Ÿ’ฌ **Note:** Scripts are written with **Polish comments and variable names** for clarity during local development. Logic is fully portable. --- ## ๐Ÿ“˜ License MIT --- ## ๐Ÿ™Œ Credits - Model powered by [Hugging Face Transformers](https://huggingface.co/transformers) - Tokenizer: CodeBERT - Rule definitions: [Hadolint](https://github.com/hadolint/hadolint)