Used-Car Price Modeling for Three Rivers Auto

December 7th, 2024 by Michelle Star

This is an abridged version of my Machine Learning final project, where I compare learning models to predict used-car prices and identify the factors that most strongly drive price.

Project brief

Three Rivers Auto asked for two things:
1) understand which features most drive used-car price, and
2) ship a model that predicts price for 1,000 unseen listings.

Data: 1,809 training rows with log(price) and car features; 1,000 test rows without price.

TL;DR

Best model: Gradient Boosted Trees (GBM) — RMSE 0.2805, R² 0.8686.
Most important signals: mileage ↓, model year ↑, horsepower ↑ then plateaus, and brand premiums (Porsche, Lexus, Toyota).
Color, transmission type, and accident flags had limited incremental value.

Dataset and checks

No missing values. Wide ranges for mileage (to 405k) and horsepower (to 760).
Correlation snapshot:

Correlation heatmap

Price is log-scaled. Distribution below:

Log price distribution

Exploratory patterns

Horsepower HP distribution HP vs price

Model year and mileage Year vs price Mileage vs price

Brand and color Median price by brand Color premiums

Takeaways: newer cars and lower mileage command higher prices; brand effects persist after controls; extreme HP has diminishing returns.

Modeling approach

Train/test split 80/20 on log(price). Compared:

Linear, Ridge, Lasso
Random Forest (RF)
Generalized Additive Model (GAM)
Gradient Boosted Trees (GBM)
Classical feature screens: Best Subset, Stepwise, PCR, PLS

Cross-validation examples: PCR CV PLS CV

Results

Model comparison

Best model: GBM GBM actual vs predicted GBM feature importance GBM residuals

Benchmarks

Linear/Ridge/Lasso: good but miss nonlinearity.
RF and GAM: competitive, interpretable effects.

What drives price (business view)

Mileage dominates: large, monotonic drop from ~50k to 100k; tapering beyond 200k.
Model year: step-up for ~2015+ inventory.
Powertrain: horsepower helps until mid-high ranges; diminishing returns afterward.
Brand equity: Porsche, Lexus, Toyota retain value; Dodge/Chrysler lag.
Low-leverage features: color, transmission label, accident flag add little once core factors included.

Limitations and next steps

Tail errors: more variance on very cheap cars; rare categories (e.g., colors) are sparse.
Add features if available: condition grades, owners, service history, trim, options, market signals.
Monitor drift; retrain with newer sales cycles.

Appendix: more figures

Linear residuals:
Linear standardized importance:
Ridge A vs P:
Lasso A vs P:
GBM CV curve:

Files produced

Predictions CSV: id, price for 1,000 test cars (log-price).
Code: single R file with all steps and seeds.