Job Salary Prediction — Machine Learning Dashboard

Interactive visualization tracking predictive performance and dominant feature signals extracted from Tree-based models.

EDA Results

Train/Test Split Configuration

Training Set (80%)
200,000
Testing Set (20%)
50,000
Total Records
250,000

EDA Findings → ML Actions Tracker

Nominal columns (Job Title, Location) have exploding cardinalities

Target Encodings: Mapped categorized strings to continuous mean Salary floats inside the cross-validated Pipeline to decisively dodge massive, memory-inefficient One-Hot sparsity.

Education & Company Size possess inherent structural hierarchy

Ordinal Encodings: Hard-coded mapping dictionaries converting string labels to strictly sequenced integers (0 to 4) to force Tree estimators to acknowledge their physical magnitude (Startup < Enterprise).

Features exhibit pristine statistical independence (Cramér's V ~ 0)

Feature Presevation: Retained all 8 predictor features intact. Bypassed PCA compression or recursive elimination protocols to avoid stripping valuable organic variance from the modeling space.

Salaries exhibit an exponential "long-tail" multiplier effect

Algorithm Selection: Favored non-linear, deep gradient-boosting Tree Regressors (XGBoost/LightGBM) vastly over rigid linear models, fully capturing compound outliers without transforming the target axis.

Zero Missing Values and Uniform Uniformity (~20k/category)

Raw Architecture: Safely circumvented structural Imputer meshes and SMOTE subroutines to keep precision baselines mathematically raw and untampered.

Unified Preprocessing Architecture Schema

# Unified Universal Pipeline
numeric (experience_years, skills_count)
→ Passthrough
// Bypassed Imputer (0 missing values in subset).
// Bypassed StandardScaler to establish a raw baseline for Tree comparision.
categorical_ordinal (education_level, company_size)
→ Manual Dictionary Mapper
// Converts string values strictly into [0, 1, 2, 3, 4] scaling integer steps.
categorical_nominal (job_title, industry, location, remote_work)
→ TargetEncoder
// Fitted explicitly on X_train to calculate mean Target(Salary). Avoids sparse matrices.
# Estimator Endpoints
→ Baseline Regression: LinearRegression()
→ Ensemble Frameworks: RandomForestRegressor() | XGBRegressor() | LGBMRegressor()

Data Preprocessing Pipeline

Ordinal Encoding Mappings

Hierarchical categorical columns cleanly mapped to integer step increments to establish mathematical magnitude boundaries.

Target Encoding: Job Titles

Instead of inefficient Sparse OHE Matrices, complex Nominal categories are immediately mapped directly into their statistically expected Salary value from the Training Split.

Algorithm Performance Benchmark

Insights (Proving the "Exponential Multiplier")

Standard Linear Regression serves as a robust baseline but underperforms severely when estimating the long tail of top earners. XGBoost and LightGBM successfully map out non-linear exponential interactions (such as having an Advanced Degree inside an Enterprise), leading to significantly higher R² Accuracy and massively reduced Root Mean Square Error (RMSE).

Feature Importance Mapping

Insights

The Feature Importance layout explicitly mirrors the linear and categorical findings from our EDA visually. Location, Experience, and Company Size reliably establish themselves as the primary pillars determining the final structural salary boundaries. Surface-level metrics like short-term certifications barely influence the Tree splits.

Final ML Evaluation Report

Champion Algorithm

XGBoost

The eXtreme Gradient Boosting Regressor perfectly captures the expansive "multiplier effects" of hierarchical groupings without collapsing under sparse matrix dimensions.

Validation R²
97.6%
Test MAE
$4.6k

Architectural Retrospective

This phenomenal predictive margin directly authenticates our foundational EDA strategy across a massive hold-out test split of 50,000 records. The rigorous combination of domain-driven processing and deep boosting constructs created a devastatingly accurate Pipeline.

Precision over PCA By electing structurally-aware Target Encodings instead of lossy dimensionality reduction schemas, we mathematically retained 100% of the dataset's organic variance.
Efficiency over Gravity Circumventing memory-exploding One-Hot sparse matrices catalyzed an incredibly elegant, universally scalable scikit-learn Pipeline architecture.