News Category text classification report comparing scalable ML baselines, reduced-feature pipelines, and transformer fine-tuning on held-out test performance.
Headline and short description are combined into one document
Word TF-IDF, character TF-IDF, and hashing-based sparse features
Scalable linear classifiers and ensemble baselines for sparse text features
Accuracy, macro-F1, weighted-F1, precision, and recall on the test split
Bag-of-Words and TF-IDF representations with unigram and bigram terms
Chi-square feature selection and TruncatedSVD projections
Linear LR/SVC/SGD models, with MLP evaluated on dense SVD features
Candidate pipelines are ranked by macro-F1 on the held-out test set
Combined headline and description, truncated or padded to 128 tokens
Checkpoint-specific WordPiece tokenization for BERT-family encoders
BERT and DistilBERT encoders are fine-tuned end to end
CLS, mean, or pooler-style representation with dropout and a linear head
The best validation checkpoint is evaluated once on the final test split