Interactive visualization tracking predictive performance and dominant feature signals extracted from Tree-based models.
Target Encodings: Mapped categorized strings to continuous mean Salary floats inside the cross-validated Pipeline to decisively dodge massive, memory-inefficient One-Hot sparsity.
Ordinal Encodings: Hard-coded mapping dictionaries converting string labels to strictly sequenced integers (0 to 4) to force Tree estimators to acknowledge their physical magnitude (Startup < Enterprise).
Feature Presevation: Retained all 8 predictor features intact. Bypassed PCA compression or recursive elimination protocols to avoid stripping valuable organic variance from the modeling space.
Algorithm Selection: Favored non-linear, deep gradient-boosting Tree Regressors (XGBoost/LightGBM) vastly over rigid linear models, fully capturing compound outliers without transforming the target axis.
Raw Architecture: Safely circumvented structural Imputer meshes and SMOTE subroutines to keep precision baselines mathematically raw and untampered.
Hierarchical categorical columns cleanly mapped to integer step increments to establish mathematical magnitude boundaries.
Instead of inefficient Sparse OHE Matrices, complex Nominal categories are immediately mapped directly into their statistically expected Salary value from the Training Split.
Standard Linear Regression serves as a robust baseline but underperforms severely when estimating the long tail of top earners. XGBoost and LightGBM successfully map out non-linear exponential interactions (such as having an Advanced Degree inside an Enterprise), leading to significantly higher R² Accuracy and massively reduced Root Mean Square Error (RMSE).
The Feature Importance layout explicitly mirrors the linear and categorical findings from our EDA visually. Location, Experience, and Company Size reliably establish themselves as the primary pillars determining the final structural salary boundaries. Surface-level metrics like short-term certifications barely influence the Tree splits.
This phenomenal predictive margin directly authenticates our foundational EDA strategy across a massive hold-out test split of 50,000 records. The rigorous combination of domain-driven processing and deep boosting constructs created a devastatingly accurate Pipeline.
scikit-learn Pipeline architecture.