Data Science Pipeline: From EDA to Machine Learning

This landing page presents the comprehensive project for P4AI-DS (CO3135). It spans three core data modalities - Tabular, Text, and Image - guiding each dataset through rigorous Exploratory Data Analysis (EDA) and robust Machine Learning modeling.

About This Project

The objective of this project is to build an end-to-end Data Science pipeline across tabular, text, and image modalities. The project is divided into two assignment: Assignment 1: Exploratory Data Analysis (EDA) to deeply understand and visualize the inherent patterns in the data, and Assignment 2: Machine Learning to train, evaluate, and benchmark predictive algorithms.

For each modality, the EDA phase covers schema inspection, missing value handling, and visual insight generation. The Machine Learning phase builds upon these findings to deploy optimal preprocessing techniques, benchmark multiple models, and expose feature importance to validate the models' mathematical logic against human insights.

Reports

Tabular Analysis

Job Salary Prediction Report

Comprehensive exploratory data analysis decoding the underlying drivers of compensation in the tech labor market.

  • Dataset schema, missing values, and feature validation
  • Multivariate interactions between Education and Company Size
  • Category frequency tracking and Class Imbalance mapping
  • Cramér's V categorical redundancy and Pearson correlation
  • Actionable insights for Machine Learning engineering
EDA Results
Text Analysis

News Category Text Report

Exploratory data analysis for the News Category dataset, focused on class balance, text lengths, missing values, yearly distribution, and keyword patterns.

  • Dataset overview, schema, and sample records
  • Category imbalance and smallest classes
  • Headline and combined text length distributions
  • Timeline drift across publication years
  • Missing values, duplicates, terms, and top authors
EDA Results
Image Analysis

PetFinder EDA Report

Complete EDA covering tabular demographics and image-level analysis — adoption speed distributions, feature correlations, quality metrics, breed gallery, and dimensionality reduction.

  • Dataset overview, adoption speed, and class balance
  • Feature distributions, correlation heatmap, health patterns
  • Image quality, photo count, and color analysis
  • Interactive breed gallery with Dog/Cat tabs
  • t-SNE, PCA, and cross-modality insights
EDA Results

Datasets

The following public datasets were used in this project: