Text EDA Report

Exploratory analysis of HuffPost news articles in the News Category dataset, covering class balance, missing values, text lengths, timeline patterns, and dominant terms.

EDA Results

Dataset Overview

Loading dataset description...

Sample Rows

Data Types & Missing Values

Category Distribution

Top news categories ranked by number of articles. The long tail of smaller classes is a key issue for downstream classification.

Key Insights

  • Loading category insights...

Text Lengths

Headlines are short and compact, usually around 10 words.

Combining headlines with short descriptions adds much more text context for NLP tasks.

Character-length histograms complement word counts and show how compact or verbose the raw text strings actually are.

Character distributions are useful when estimating storage, truncation risk, and input limits for downstream models.

Text Length by Category

Average combined text length for the largest categories. This highlights which news sections tend to use longer summaries.

Timeline Distribution

Article volume across years. Later years are noticeably smaller, which suggests a temporal skew in the dataset.

Key Insights

  • Loading timeline insights...

Data Quality

Missing values are concentrated in `short_description` and `authors`, while label and link fields are complete.

Missing Value Table

Terms and Authors

Frequent content words reveal a strong concentration of politics, media, and lifestyle language.

Most Frequent Authors

Stopwords Analysis

This view keeps stopwords instead of removing them, so you can inspect how much of the corpus is made of high-frequency function words.

Vocabulary Richness

Bigrams

Bigram frequencies reveal common phrase patterns that simple unigram counts cannot capture.

TF-IDF Keywords by Class

TF-IDF highlights words that are especially characteristic of each category rather than merely frequent overall.

Key Insights

Overall Summary

  • POLITICS is the dominant class with 16.99% of all articles, while the largest-to-smallest class ratio reaches 35.11x, so label imbalance is a major modeling concern.
  • The text is concise overall: headlines average 9.6 words, and the combined headline-plus-description field averages 29.27 words, which is suitable for lightweight text classification pipelines.
  • The dataset is temporally skewed: article volume peaks in 2013 with 34,583 records, then drops to 1,398 in 2022, so random splitting can hide time drift.
  • Data completeness issues are concentrated in metadata rather than labels: 9.41% of rows lack short descriptions and 17.86% lack author names, while category, link, and date remain complete.
  • The corpus is broad but still repetitive in topic focus, with 98,314 unique tokens overall and trump as the most frequent content word at 16,506 occurrences.