ML Results — Image Pipeline

Preliminary Results — Model Leaderboard

Best Accuracy

36.3%

LightGBM Pipeline

Best QWK

0.2616

LightGBM Pipeline

Models Evaluated

7

ResNet · LightGBM · XGBoost · CatBoost · DT

Val Set Size

2 931

20% stratified split

Rank	Model	Type	Val Accuracy	QWK	Notes

Validation accuracy and QWK for each model. LightGBM leads both metrics.

Multi-axis comparison scaled 0–1. Accuracy (×100), QWK (×10 for visibility).

Key Findings

LightGBM leads all 7 models on both accuracy (36.3%) and QWK (0.2616); CatBoost is 2nd by accuracy (35.69%) but 3rd by QWK, while TwoStage ResNet + GradNorm is 2nd by QWK (0.243).
XGBoost achieves the best Stage-1 Breed1 accuracy (63.66%) but its Stage-2 QWK (0.2108) is weaker — likely because its class probability distributions are less calibrated as Stage-2 inputs.
Decision Tree (27.94%, QWK 0.123) performs on par with Baseline ResNet (28.32%, 0.114), confirming tree stumps are too weak to leverage the 860-dim feature space.
All models struggle with Speed 0 (same-day adoptions) — a highly imbalanced class with only 83 val samples.

Confusion Matrix — AdoptionSpeed

Raw prediction counts per (true class, predicted class).

Row-normalised recall matrix — each row sums to 1.0. Diagonal = per-class recall.

Pattern Analysis

Speed 4 (No adoption) has the highest recall across all models — the most common and distinct class.
Speed 0 (Same day) is systematically misclassified as Speed 2 or Speed 4 by ResNet models. LightGBM handles it marginally better.
ResNet models collapse most predictions into Speed 2 and Speed 4, missing Speed 3 almost entirely.
LightGBM distributes predictions more evenly across classes, explaining the higher QWK.

Stage 1 — Feature Prediction Accuracy

Stage 1 trains each model to predict tabular attributes from the image alone. High accuracy means the visual signal contains enough information to infer the attribute.

Feature	# Classes	ResNet (No Norm)	ResNet (GradNorm)	LightGBM	XGBoost	CatBoost	DT	Best

Per-feature validation accuracy across all Stage-1 configurations. Type and Health exceed 90% in every model; Breed1 is hardest for LightGBM/CatBoost but XGBoost (63.7%) matches CNN accuracy.

Insights

Type (Dog/Cat) and Health are visually obvious — all models exceed 92%, with XGBoost reaching 97.2%.
Breed1: XGBoost (63.7%) and ResNet GradNorm (61.0%) are the top two; LightGBM (31.2%) is weakest — it lacks the fine-grained visual feature learning of the CNN backbone during multi-task training.
Gender is the weakest non-breed feature across all models (43–52%); Decision Tree is worst (43.8%), confirming gender is poorly discriminated by visual appearance alone.
XGBoost leads on 5 of 10 features, LightGBM leads on 4, suggesting boosting with deeper per-tree splits better exploits the 512-dim embedding space.

Ablation Analysis — Feature Group Importance

Combinatorial ablation on the LightGBM Stage-2 classifier (298 subsets, max combo size 3). Each group is zeroed out; ΔQWK measures the impact relative to the full baseline (QWK = 0.2616).

Single-Feature QWK Impact

ΔQWK when each feature group is individually zeroed. Negative = harmful to remove; positive = redundant.

All 298 ablation combinations plotted by combo size. Each point is one subset; colour encodes ΔQWK magnitude.

Most Harmful Combinations (Worst 5)

Features Ablated	n	Accuracy	QWK	ΔACC	ΔQWK

Most Redundant Combinations (Best 5 — removing helps or barely hurts)

Features Ablated	n	Accuracy	QWK	ΔACC	ΔQWK

Ablation Insights

Image embeddings are by far the most critical feature — removing them alone drops QWK by 0.26 (to near-random, ≈ −0.001).
Sterilized is the most informative tabular head (ΔQWK = −0.074 when removed), more than PhotoAmt (−0.027).
Color1 is the most redundant feature — removing it actually slightly improves QWK (+0.025), suggesting it adds noise rather than signal.
Gender + Color1 together improve QWK by +0.033 when removed, making them a clean candidate for pruning in the next experiment.

Grad-CAM — What the Model Sees

Gradient-weighted Class Activation Maps (Grad-CAM) back-propagate through the last ResNet block (backbone.layer4[-1]) to highlight the image regions that drove the AdoptionSpeed prediction. Warmer colours (red) indicate higher influence; cooler colours (blue/green) are less attended. Overlays are from the TwoStage ResNet + GradNorm Stage-2 head on 16 validation samples.

3 Correct 13 Incorrect 16 Total Samples

Observations

Correct predictions tend to focus on the pet's body and face — semantically relevant regions for assessing condition and appeal.
Speed 1 → Speed 2 errors are common (5 of 13 misses): the model over-predicts slower adoption, possibly due to class imbalance during Stage-2 training.
Speed 0 (Same day) is predicted as Speed 4 — the model sees no distinguishing visual cue for urgent adoption, which aligns with Speed 0 having zero recall in the confusion matrix.
Several incorrect predictions still attend to the pet rather than background, suggesting the error originates in the decision boundary rather than feature extraction.

Conclusion — Best Model by Category

Best Overall

LightGBM Pipeline

Accuracy 36.3% · QWK 0.2616

Best ResNet

TwoStage ResNet + GradNorm

Accuracy 35.4% · QWK 0.243

Best No-Norm ResNet

TwoStage ResNet (No Norm)

Accuracy 34.66% · QWK 0.2148

Best Breed Prediction (Stage 1)

TwoStage ResNet (either variant)

Breed1 Accuracy ~60–61% vs LightGBM 31%

Most Critical Feature

Image Embeddings

Removing drops QWK by −0.262 to near-random

Recommended Prune

Color1 + Gender heads

Removing improves QWK by +0.033

Supplementary — Experiment Setup

Dataset

Property	Value
Source	PetFinder.my Adoption Prediction (Kaggle)
Target	AdoptionSpeed (5 ordinal classes: 0 = same day → 4 = no adoption)
Image size	224 × 224 px (ResNet standard)
Train / Val split	80 / 20 stratified by AdoptionSpeed · Seed 42
Val set size	2 931 – 2 932 samples
Class imbalance	Speed 0: 83 · Speed 1: 603 · Speed 2: 846 · Speed 3: 622 · Speed 4: 777
Image filter	Only pets with a primary photo (`{PetID}-1.jpg`) are included

Model Architectures

Baseline ResNet

ResNet18 backbone (ImageNet pretrained)
Single head: Linear(512, 5) → AdoptionSpeed
No multi-task pre-training
Cross-entropy loss

TwoStage ResNet

ResNet18 backbone (shared)
Stage 1: 10 attribute heads (Type → Color1) — trained jointly with GradNorm
Stage 2: Linear(859, 5) — backbone frozen; input = 512-dim features ∥ 347-dim Stage-1 logits
No-Norm variant uses equal loss weights

LightGBM Pipeline

ResNet18 feature extractor (frozen)
Stage 1: 10 independent LGBMClassifiers (one per tabular attribute)
Stage 2: LGBMClassifier on 860-dim vector (512 img ∥ 347 logits ∥ 1 PhotoAmt)
Device: GPU (OpenCL) when available

Stage-2 Feature Vector (LightGBM)

Component	Source	Dims	Cumulative
Image embedding	ResNet18 backbone (avgpool output)	512	512
Type logits	Stage-1 LightGBM	2	514
FurLength logits	Stage-1 LightGBM	4	518
MaturitySize logits	Stage-1 LightGBM	5	523
Breed1 logits	Stage-1 LightGBM	308	831
Health / Vaccinated / Dewormed / Sterilized / Gender logits	Stage-1 LightGBM (×5 heads, 4 classes each)	20	851
Color1 logits	Stage-1 LightGBM	8	859
PhotoAmt	Raw tabular column	1	860

Training Hyperparameters

ResNet (both variants)

Epochs	10
Batch size	32
Optimizer	Adam · lr = 3×10⁻⁴
GradNorm α	1.5
GradNorm lr	0.01
Stage-2 backbone	frozen
Distributed	DDP
Random seed	42

LightGBM

Stage-1 estimators	10 (one per attribute)
Stage-2 estimator	LGBMClassifier
ResNet batch size	64
Device	GPU (OpenCL) / CPU
Objective	multiclass
Random seed	42

Evaluation Metrics

Metric	Formula	Why used
Accuracy	correct / total	Baseline correctness; easy to interpret
QWK (Quadratic Weighted Kappa)	Cohen's κ, quadratic weights	Official Kaggle metric; penalises large ordinal errors more than small ones
Confusion Matrix	raw + row-normalised	Per-class recall and systematic bias analysis

ML Training Results

Preliminary Results — Model Leaderboard

Key Findings

Confusion Matrix — AdoptionSpeed

Pattern Analysis

Stage 1 — Feature Prediction Accuracy

Insights

Ablation Analysis — Feature Group Importance

Single-Feature QWK Impact

Most Harmful Combinations (Worst 5)

Most Redundant Combinations (Best 5 — removing helps or barely hurts)

Ablation Insights

Grad-CAM — What the Model Sees

Observations

Conclusion — Best Model by Category

Supplementary — Experiment Setup

Dataset

Model Architectures

Stage-2 Feature Vector (LightGBM)

Training Hyperparameters

Evaluation Metrics