Stack Overflow Developer Salary Predictor

Author: Rotem Vahava
Assignment: Assignment #2 — Classification, Regression, Clustering & Evaluation
Dataset: Stack Overflow Developer Survey 2023
Date: April 2026

Executive Summary

I built two machine learning models that predict developer salary from the Stack Overflow 2023 survey: a regression model that predicts the exact salary in dollars (R² = 0.545, MAE ≈ $30K), and a classification model that predicts which salary tier a developer falls into Low, Mid, or High (accuracy 70.2%, F1-macro 0.70). Both winning models are Gradient Boosting, trained on 45,804 developers across 51 engineered features.

The most surprising finding was that all three classification algorithms (Logistic Regression, Random Forest, Gradient Boosting) converged to within 1.5% of each other strong evidence that the salary signal in this dataset has a natural ceiling around 70% accuracy. The features that would push beyond this (specific company, exact role level, negotiation skill) simply aren't in the survey.

The biggest single driver of salary turned out to be Country, accounting for ~33% of the model's predictive power.

Presentation Video

Notebook

The complete project notebook with all code, outputs, visualizations, and explanations is included in this repository. Download the notebook to see every step of the analysis end-to-end.

Project Overview

This project builds a complete end-to-end machine learning pipeline that predicts developer compensation using the Stack Overflow Developer Survey 2023 a dataset of ~89,000 developers worldwide with 84 raw features. From those, I selected 16 features most relevant to salary prediction and ended up with 45,804 developers after cleaning.

The same dataset is used for two prediction tasks:

Regression - predicting the exact annual salary in USD.
Classification - predicting which salary tier (Low / Mid / High) a developer belongs to.

Part 2: Exploratory Data Analysis

The EDA started with cleaning the data: imputing missing values (median for numeric, "Unknown" for categorical), removing extreme outliers, and visualizing distributions of key features.

Outlier removal

The raw salary column had extreme values that would have distorted any model entries below $5K (probably typos or freelance side gigs) and above $500K (likely C-level executives or data entry errors). I capped the salary range to $5K–$500K, which removed the noise while keeping the meaningful tail of high earners.

After this filtering, I was left with 45,804 developers with reliable salary data.

Salary distribution

The target variable is heavily right-skewed most developers earn between $30K and $100K, but a long tail extends to $500K. This skew motivated the log transform applied in Part 4.

Five research questions

The EDA was structured around five focused questions, each answered with a specific visualization.

Q1 — Does formal education actually pay off?

I examined whether developers with advanced degrees (Master's, PhD) earn meaningfully more than those without formal education.

Finding: Education has only a moderate effect on salary in the tech industry. While Master's and PhD holders show slightly higher medians, the spread within each education level is enormous. The boxplot reveals that self taught developers can out-earn PhD holders, suggesting that formal education is a stepping stone but not a salary ceiling.

Q2 — Is there a limit to how much experience pays off?

I plotted years of professional coding against salary with a LOWESS trendline to see if the relationship is linear or saturates at some point.

Finding: The relationship is clearly non-linear. Salary grows steeply for the first 10–15 years of professional experience, then plateaus. After ~20 years, the median salary barely increases. This non-linearity motivated the inclusion of polynomial-friendly tree models (Random Forest, Gradient Boosting) in Part 5.

Q3 — Does remote work affect earning potential?

I compared salary distributions across three work arrangements: Remote, Hybrid, and In-person.

Finding: Fully remote developers show the highest median salary, with Hybrid in the middle and In-person at the bottom. The gap is meaningful remote workers earn roughly 20–30% more at the median. This likely reflects two effects: senior developers get more remote flexibility, and remote work allows access to higher-paying global markets.

Q4 — Does age (and seniority) keep paying through retirement age?

I plotted median salary by age group to see whether earnings keep growing or plateau in later career stages.

Finding: Salary grows steeply from "18–24" through "35–44" the prime career-building years and then plateaus. The "55–64" and "65+" groups do not show further increases, suggesting that seniority benefits cap once developers hit senior/staff levels.

Q5 — Does starting to code early translate into higher pay later?

I compared total years of coding (including hobby) against years of professional coding to see if early starters earn more later.

Finding: Professional years matter much more than total years. A developer who started coding as a teenager but has 5 years of professional experience earns roughly the same as someone who started coding professionally at age 30 with 5 years of experience. The "hobby head start" doesn't translate into a measurable salary advantage at the same level of professional tenure.

Full feature correlation

The correlation heatmap below shows how all 16 features relate to each other and to salary.

The "experience cluster" (Age, YearsCode, YearsCodePro, WorkExp) is heavily intercorrelated, which I addressed later through derived features and tree-based models that handle multicollinearity better than linear ones.

Part 3: Baseline Linear Regression

The baseline used 13 features (3 numeric + 10 categorical), trained with default parameters and evaluated on a held-out 20% test set.

Results:

Metric	Value
MAE	$32,583
RMSE	$50,086
R² (test)	0.498
R² (train)	0.518
Train-Test gap	0.020

The small gap between train and test R² confirmed there was no overfitting — the model generalized well to unseen data.

Diagnostic plots

Three views of model performance: Actual vs Predicted (deviations from the diagonal show the model under-predicts top earners), Residuals vs Predicted (a slight funnel shape suggests heteroscedasticity), and Distribution of Residuals (right-skewed tail confirms the model misses high salaries by a lot).

Feature importance — coefficients

Country (especially the USA), professional experience, and senior role indicators dominated the top of the ranking confirming the EDA story.

Part 4: Feature Engineering

Five engineering steps were applied to address the weaknesses observed in the baseline.

Log transform of the target

The skewed salary distribution was log transformed to make it closer to normal exactly what Linear Regression's assumptions prefer.

Multi-hot encoding for tech stack

The semicolon-separated columns LanguageHaveWorkedWith and DatabaseHaveWorkedWith were converted into 30 binary features (top 15 languages + top 15 databases).

Derived numeric features

NumLanguages and NumDatabases count of technologies each developer uses.
ExperienceRatio — proportion of total coding years spent professionally.
HobbyYears — years coding before going professional.

K-Means clustering

K-Means was applied to the developer profile features (experience + tech versatility). The elbow method suggested K = 4 clusters.

The clusters were validated by visualizing them in 2D using PCA.

The cluster assignment (Cluster) and the distance from each developer to their cluster's centroid (DistToCentroid) were added as new features.

Final engineered dataset

The original 16 columns grew into 51 informative features — 35 new features in total.

Part 5: Three Improved Regression Models

Three different regression algorithms were trained and compared on the engineered dataset, with all metrics computed in dollars after reversing the log-transform.

Performance comparison

Model	MAE	RMSE	R²
Linear Regression (Baseline)	$32,583	$50,086	0.498
Linear Regression (Engineered)	~$31,000	~$48,500	0.530
Random Forest	~$31,200	~$49,000	0.520
Gradient Boosting (Winner)	~$30,200	~$47,800	0.545

Predictions vs reality

All three models show similar patterns: predictions cluster well in the middle range but struggle to reach very high or very low salaries — the same data ceiling effect that limits accuracy.

Feature importance

Both tree-based models agreed on the top drivers: Country (especially USA), professional experience, and the engineered cluster features.

Part 6: Upload Best Regression Model

The Gradient Boosting Regressor pipeline was saved as gradient_boosting_salary_regressor.pkl and uploaded to this repository.

Part 7: Regression to Classification

The continuous salary target was converted into 3 ordinal classes using tertile binning (33rd and 67th percentiles):

Low — salaries below $57,249 (15,110 developers)
Mid — salaries between $57,249 and $105,517 (15,575 developers)
High — salaries above $105,579 (15,119 developers)

Class balance

The classes ended up nearly perfectly balanced (33% / 34% / 33%), which made accuracy a meaningful metric without needing rebalancing techniques.

Part 8: Three Classification Models

Performance comparison

Model	Accuracy	F1-macro
Logistic Regression	70.11%	0.7018
Random Forest	68.54%	0.6886
Gradient Boosting (Winner)	70.18%	0.7034

Confusion matrices

A reassuring pattern emerges across all three models: most mistakes happen between adjacent classes (Low ↔ Mid or Mid ↔ High). The dangerous "extreme" misclassifications (Low ↔ High) only happen in ~3% of test predictions. This means the models inherently grasp the ordinal structure of salary tiers, even though they were never explicitly told the classes are ordered.

ROC curves

Gradient Boosting wins consistently across all three classes — AUC = 0.909 for both Low and High, 0.796 for Mid. The pattern across all models is the same: Low and High have AUCs around 0.90 (clean extremes that are easy to identify), while Mid sits noticeably lower at ~0.79 (the in-between class without clean boundaries).

Precision-Recall curves

Precision-Recall curves complement the ROC analysis with another view on model quality, especially useful when looking at the tradeoff between catching actual positives (recall) and being right when predicting positives (precision).

The pattern matches what I saw in ROC: Gradient Boosting wins across all three classes — AP of 0.832 for Low, 0.635 for Mid, and 0.858 for High. Logistic Regression is right behind, and Random Forest comes in last but only by a small margin. The Mid class consistently has the lowest AP (~0.61–0.64) across all models, confirming the same pattern from ROC and the confusion matrices: Mid sits between Low and High without clean boundaries, so it's harder for any model to be both precise and complete about it. Even the worst Mid curve at AP = 0.61 is nearly twice the no-skill baseline of 0.33, confirming the models add real predictive value.

Feature importance

The top features for classification mirror the regression task — Country dominates, followed by YearsCodePro and the engineered clustering features. This consistency across both prediction tasks confirms the salary signal is robust: the same factors that determine the dollar amount also determine the salary tier.

Key Insights

Top salary drivers

Country (especially USA)- by far the biggest driver, ~33% of feature importance. Being in the US matters more than almost anything else.
YearsCodePro -the strongest numeric predictor, confirming the "experience pays" intuition.
DistToCentroid -the K-Means cluster feature made it into the top 6 features for Random Forest, validating that the unsupervised clustering work added real signal.
Specific languages -lang_PHP shows up as a negative predictor (PHP correlates with lower-paying roles); other languages like Rust appear as positive markers.

Why ~70% accuracy is the realistic ceiling

All three classification algorithms Logistic Regression, Random Forest, and Gradient Boosting converged to within 1.5% of each other. This consistency strongly suggests a data ceiling, not an algorithm ceiling. The biggest predictors of salary are not in the survey: specific company name, exact role level (Junior/Senior/Staff), negotiation skill, and individual performance reviews.

With the available features, ~70% accuracy and R² ~0.55 are the realistic best — well above the 34% naive baseline (always-predict-majority) and well below what would be achievable if company-level signals were available.

Repository Contents

File	Description
`gradient_boosting_salary_regressor.pkl`	Winning regression model — predicts dollar salary
`salary_class_classifier.pkl`	Winning classification model — predicts salary tier
`README.md`	This file
`*.png`	All visualizations referenced in this README

How to Use the Models

Loading the regression model

import pickle
import numpy as np
...

with open("gradient_boosting_salary_regressor.pkl", "rb") as f: reg_model = pickle.load(f)

X_new should be a DataFrame with the same columns as the training data

predicted_log_salary = reg_model.predict(X_new) predicted_salary_usd = np.expm1(predicted_log_salary)

print(f"Predicted salary: ${predicted_salary_usd[0]:,.2f}")


### Loading the classification model

```python
import pickle
import pandas as pd

with open("salary_class_classifier.pkl", "rb") as f:
    clf_model = pickle.load(f)

predicted_class = clf_model.predict(X_new)
predicted_proba = clf_model.predict_proba(X_new)

print(f"Predicted tier: {predicted_class[0]}")
print(f"Class probabilities: {dict(zip(clf_model.classes_, predicted_proba[0]))}")

Dataset

Source: Stack Overflow Developer Survey 2023 on Kaggle
Original size: ~89,000 respondents, 84 features in the raw dataset
Selected for analysis: 16 features chosen for relevance to salary prediction
After cleaning: 45,804 developers with valid salary data ($5K–$500K range)

Tech Stack

pandas
numpy
scikit-learn
matplotlib
seaborn

Downloads last month: -