Stack Overflow Developer Salary Predictor
Author: Rotem Vahava
Assignment: Assignment #2 β Classification, Regression, Clustering & Evaluation
Dataset: Stack Overflow Developer Survey 2023
Date: April 2026
Executive Summary
I built two machine learning models that predict developer salary from the Stack Overflow 2023 survey: a regression model that predicts the exact salary in dollars (RΒ² = 0.545, MAE β $30K), and a classification model that predicts which salary tier a developer falls into Low, Mid, or High (accuracy 70.2%, F1-macro 0.70). Both winning models are Gradient Boosting, trained on 45,804 developers across 51 engineered features.
The most surprising finding was that all three classification algorithms (Logistic Regression, Random Forest, Gradient Boosting) converged to within 1.5% of each other strong evidence that the salary signal in this dataset has a natural ceiling around 70% accuracy. The features that would push beyond this (specific company, exact role level, negotiation skill) simply aren't in the survey.
The biggest single driver of salary turned out to be Country, accounting for ~33% of the model's predictive power.
Presentation Video
Notebook
The complete project notebook with all code, outputs, visualizations, and explanations is included in this repository. Download the notebook to see every step of the analysis end-to-end.
Project Overview
This project builds a complete end-to-end machine learning pipeline that predicts developer compensation using the Stack Overflow Developer Survey 2023 a dataset of ~89,000 developers worldwide with 84 raw features. From those, I selected 16 features most relevant to salary prediction and ended up with 45,804 developers after cleaning.
The same dataset is used for two prediction tasks:
- Regression - predicting the exact annual salary in USD.
- Classification - predicting which salary tier (Low / Mid / High) a developer belongs to.
Part 2: Exploratory Data Analysis
The EDA started with cleaning the data: imputing missing values (median for numeric, "Unknown" for categorical), removing extreme outliers, and visualizing distributions of key features.
Outlier removal
The raw salary column had extreme values that would have distorted any model entries below $5K (probably typos or freelance side gigs) and above $500K (likely C-level executives or data entry errors). I capped the salary range to $5Kβ$500K, which removed the noise while keeping the meaningful tail of high earners.
After this filtering, I was left with 45,804 developers with reliable salary data.
Salary distribution
The target variable is heavily right-skewed most developers earn between $30K and $100K, but a long tail extends to $500K. This skew motivated the log transform applied in Part 4.
Five research questions
The EDA was structured around five focused questions, each answered with a specific visualization.
Q1 β Does formal education actually pay off?
I examined whether developers with advanced degrees (Master's, PhD) earn meaningfully more than those without formal education.
Finding: Education has only a moderate effect on salary in the tech industry. While Master's and PhD holders show slightly higher medians, the spread within each education level is enormous. The boxplot reveals that self taught developers can out-earn PhD holders, suggesting that formal education is a stepping stone but not a salary ceiling.
Q2 β Is there a limit to how much experience pays off?
I plotted years of professional coding against salary with a LOWESS trendline to see if the relationship is linear or saturates at some point.
Finding: The relationship is clearly non-linear. Salary grows steeply for the first 10β15 years of professional experience, then plateaus. After ~20 years, the median salary barely increases. This non-linearity motivated the inclusion of polynomial-friendly tree models (Random Forest, Gradient Boosting) in Part 5.
Q3 β Does remote work affect earning potential?
I compared salary distributions across three work arrangements: Remote, Hybrid, and In-person.
Finding: Fully remote developers show the highest median salary, with Hybrid in the middle and In-person at the bottom. The gap is meaningful remote workers earn roughly 20β30% more at the median. This likely reflects two effects: senior developers get more remote flexibility, and remote work allows access to higher-paying global markets.
Q4 β Does age (and seniority) keep paying through retirement age?
I plotted median salary by age group to see whether earnings keep growing or plateau in later career stages.
Finding: Salary grows steeply from "18β24" through "35β44" the prime career-building years and then plateaus. The "55β64" and "65+" groups do not show further increases, suggesting that seniority benefits cap once developers hit senior/staff levels.
Q5 β Does starting to code early translate into higher pay later?
I compared total years of coding (including hobby) against years of professional coding to see if early starters earn more later.
Finding: Professional years matter much more than total years. A developer who started coding as a teenager but has 5 years of professional experience earns roughly the same as someone who started coding professionally at age 30 with 5 years of experience. The "hobby head start" doesn't translate into a measurable salary advantage at the same level of professional tenure.
Full feature correlation
The correlation heatmap below shows how all 16 features relate to each other and to salary.
The "experience cluster" (Age, YearsCode, YearsCodePro, WorkExp) is heavily intercorrelated, which I addressed later through derived features and tree-based models that handle multicollinearity better than linear ones.
Part 3: Baseline Linear Regression
The baseline used 13 features (3 numeric + 10 categorical), trained with default parameters and evaluated on a held-out 20% test set.
Results:
| Metric | Value |
|---|---|
| MAE | $32,583 |
| RMSE | $50,086 |
| RΒ² (test) | 0.498 |
| RΒ² (train) | 0.518 |
| Train-Test gap | 0.020 |
The small gap between train and test RΒ² confirmed there was no overfitting β the model generalized well to unseen data.
Diagnostic plots
Three views of model performance: Actual vs Predicted (deviations from the diagonal show the model under-predicts top earners), Residuals vs Predicted (a slight funnel shape suggests heteroscedasticity), and Distribution of Residuals (right-skewed tail confirms the model misses high salaries by a lot).
Feature importance β coefficients
Country (especially the USA), professional experience, and senior role indicators dominated the top of the ranking confirming the EDA story.
Part 4: Feature Engineering
Five engineering steps were applied to address the weaknesses observed in the baseline.
Log transform of the target
The skewed salary distribution was log transformed to make it closer to normal exactly what Linear Regression's assumptions prefer.
Multi-hot encoding for tech stack
The semicolon-separated columns LanguageHaveWorkedWith and DatabaseHaveWorkedWith were converted into 30 binary features (top 15 languages + top 15 databases).
Derived numeric features
NumLanguagesandNumDatabasescount of technologies each developer uses.ExperienceRatioβ proportion of total coding years spent professionally.HobbyYearsβ years coding before going professional.
K-Means clustering
K-Means was applied to the developer profile features (experience + tech versatility). The elbow method suggested K = 4 clusters.
The clusters were validated by visualizing them in 2D using PCA.
The cluster assignment (Cluster) and the distance from each developer to their cluster's centroid (DistToCentroid) were added as new features.
Final engineered dataset
The original 16 columns grew into 51 informative features β 35 new features in total.
Part 5: Three Improved Regression Models
Three different regression algorithms were trained and compared on the engineered dataset, with all metrics computed in dollars after reversing the log-transform.
Performance comparison
| Model | MAE | RMSE | RΒ² |
|---|---|---|---|
| Linear Regression (Baseline) | $32,583 | $50,086 | 0.498 |
| Linear Regression (Engineered) | ~$31,000 | ~$48,500 | 0.530 |
| Random Forest | ~$31,200 | ~$49,000 | 0.520 |
| Gradient Boosting (Winner) | ~$30,200 | ~$47,800 | 0.545 |
Predictions vs reality
All three models show similar patterns: predictions cluster well in the middle range but struggle to reach very high or very low salaries β the same data ceiling effect that limits accuracy.
Feature importance
Both tree-based models agreed on the top drivers: Country (especially USA), professional experience, and the engineered cluster features.
Part 6: Upload Best Regression Model
The Gradient Boosting Regressor pipeline was saved as gradient_boosting_salary_regressor.pkl and uploaded to this repository.
Part 7: Regression to Classification
The continuous salary target was converted into 3 ordinal classes using tertile binning (33rd and 67th percentiles):
- Low β salaries below $57,249 (15,110 developers)
- Mid β salaries between $57,249 and $105,517 (15,575 developers)
- High β salaries above $105,579 (15,119 developers)
Class balance
The classes ended up nearly perfectly balanced (33% / 34% / 33%), which made accuracy a meaningful metric without needing rebalancing techniques.
Part 8: Three Classification Models
Performance comparison
| Model | Accuracy | F1-macro |
|---|---|---|
| Logistic Regression | 70.11% | 0.7018 |
| Random Forest | 68.54% | 0.6886 |
| Gradient Boosting (Winner) | 70.18% | 0.7034 |
Confusion matrices
A reassuring pattern emerges across all three models: most mistakes happen between adjacent classes (Low β Mid or Mid β High). The dangerous "extreme" misclassifications (Low β High) only happen in ~3% of test predictions. This means the models inherently grasp the ordinal structure of salary tiers, even though they were never explicitly told the classes are ordered.
ROC curves
Gradient Boosting wins consistently across all three classes β AUC = 0.909 for both Low and High, 0.796 for Mid. The pattern across all models is the same: Low and High have AUCs around 0.90 (clean extremes that are easy to identify), while Mid sits noticeably lower at ~0.79 (the in-between class without clean boundaries).
Precision-Recall curves
Precision-Recall curves complement the ROC analysis with another view on model quality, especially useful when looking at the tradeoff between catching actual positives (recall) and being right when predicting positives (precision).
The pattern matches what I saw in ROC: Gradient Boosting wins across all three classes β AP of 0.832 for Low, 0.635 for Mid, and 0.858 for High. Logistic Regression is right behind, and Random Forest comes in last but only by a small margin. The Mid class consistently has the lowest AP (~0.61β0.64) across all models, confirming the same pattern from ROC and the confusion matrices: Mid sits between Low and High without clean boundaries, so it's harder for any model to be both precise and complete about it. Even the worst Mid curve at AP = 0.61 is nearly twice the no-skill baseline of 0.33, confirming the models add real predictive value.
Feature importance
The top features for classification mirror the regression task β Country dominates, followed by YearsCodePro and the engineered clustering features. This consistency across both prediction tasks confirms the salary signal is robust: the same factors that determine the dollar amount also determine the salary tier.
Key Insights
Top salary drivers
- Country (especially USA)- by far the biggest driver, ~33% of feature importance. Being in the US matters more than almost anything else.
- YearsCodePro -the strongest numeric predictor, confirming the "experience pays" intuition.
- DistToCentroid -the K-Means cluster feature made it into the top 6 features for Random Forest, validating that the unsupervised clustering work added real signal.
- Specific languages -
lang_PHPshows up as a negative predictor (PHP correlates with lower-paying roles); other languages like Rust appear as positive markers.
Why ~70% accuracy is the realistic ceiling
All three classification algorithms Logistic Regression, Random Forest, and Gradient Boosting converged to within 1.5% of each other. This consistency strongly suggests a data ceiling, not an algorithm ceiling. The biggest predictors of salary are not in the survey: specific company name, exact role level (Junior/Senior/Staff), negotiation skill, and individual performance reviews.
With the available features, ~70% accuracy and RΒ² ~0.55 are the realistic best β well above the 34% naive baseline (always-predict-majority) and well below what would be achievable if company-level signals were available.
Repository Contents
| File | Description |
|---|---|
gradient_boosting_salary_regressor.pkl |
Winning regression model β predicts dollar salary |
salary_class_classifier.pkl |
Winning classification model β predicts salary tier |
README.md |
This file |
*.png |
All visualizations referenced in this README |
How to Use the Models
Loading the regression model
import pickle
import numpy as np
...
with open("gradient_boosting_salary_regressor.pkl", "rb") as f: reg_model = pickle.load(f)
X_new should be a DataFrame with the same columns as the training data
predicted_log_salary = reg_model.predict(X_new) predicted_salary_usd = np.expm1(predicted_log_salary)
print(f"Predicted salary: ${predicted_salary_usd[0]:,.2f}")
### Loading the classification model
```python
import pickle
import pandas as pd
with open("salary_class_classifier.pkl", "rb") as f:
clf_model = pickle.load(f)
predicted_class = clf_model.predict(X_new)
predicted_proba = clf_model.predict_proba(X_new)
print(f"Predicted tier: {predicted_class[0]}")
print(f"Class probabilities: {dict(zip(clf_model.classes_, predicted_proba[0]))}")
Dataset
Source: Stack Overflow Developer Survey 2023 on Kaggle
Original size: ~89,000 respondents, 84 features in the raw dataset
Selected for analysis: 16 features chosen for relevance to salary prediction
After cleaning: 45,804 developers with valid salary data ($5Kβ$500K range)
Tech Stack
pandas
numpy
scikit-learn
matplotlib
seaborn
- Downloads last month
- -




















