Retail Returns Intelligence

01 — The problem

Most retailers catch excessive returners after the loss.

A small fraction of customers drive a disproportionate share of returns through wardrobing, bracketing, serial returns, and policy abuse. The signal is in the transaction stream — but it usually gets read reactively, once the refunds have already cleared.

$743B

merchandise returned in the U.S. in 2023 — 14.5% of all retail sales.

$101B

lost to return fraud and abuse — roughly $13.70 of every $100 returned.

22.8%

of transactions arrive with missing CustomerIDs — real, messy operational data.

Source: National Retail Federation & Appriss Retail, 2023 Consumer Returns report.

02 — Live demo

Score a transaction in real time.

This calls a live FastAPI backend running the trained models — real inference, not a mockup. Pick a real customer or enter your own transaction to get a return probability, risk tier, customer segment, anomaly flag, and the top SHAP factors driving the prediction.

Checking API…

Transaction input

Try a real customer:

Prediction

Submit a transaction to see the model's scored output.

Customer profile lookup

Pull the full behavioral profile for any CustomerID.

03 — Results that hold up

Evaluated like it ships, not like a demo.

Temporal train/test split, point-in-time-safe features, threshold selection, explainability, and rolling-window backtesting. Every figure below is generated by the notebooks in the repo.

Precision-recall curve and F1-vs-threshold for the LightGBM return classifier — Classifier · LightGBM **PR-AUC 0.852, edging XGBoost (0.849).** Threshold tuned to F1 = 0.789 at 0.94 — precision-recall is the right lens on a 1.8% base rate.

SHAP global feature importance for the LightGBM classifier — Explainability · SHAP **Every score is attributable.** Quantity, category return rate, and price drive risk — each prediction ships with its top factors.

Isolation Forest score distribution and return rate by anomaly group — Anomaly · Isolation Forest **294 excessive returners flagged — no labels.** Flagged customers show a visibly higher return-rate distribution, validating the unsupervised signal.

PCA projection of KMeans customer segments — Segmentation · KMeans (k=4) **Four behavioral segments for differentiated policy.** Premium Loyal, Healthy Browser, At-Risk, and Returner — separated on RFM + return features.

Power curve and control-vs-treatment return rate for the A/B policy simulation — Experimentation · A/B simulation **Powered to 80%, significant at p<0.05.** A 14-day return window on the Returner segment cuts return-value ratio, with a spend guardrail intact.

Rolling-window backtest: Brier score, precision at decile, and actual vs predicted return rate over time — Backtesting · 18 rolling windows **Stable across 18 months, not a single lucky split.** Precision@decile holds near 0.18 and predicted return rate tracks actuals window over window.

04 — Four models, one system

Return intelligence isn't one problem.

Is this transaction risky right now? Is this customer a systematic returner? How should policy differ across the base? And how do we win the sale back? Each question gets its own model.

1Supervised · LightGBM

Return-likelihood classifier

Per-transaction P(return) at checkout. Trained on a strict temporal split (2009–H1 2011 → H2 2011) with point-in-time-safe history features. SHAP makes every score explainable.

2Unsupervised · Isolation Forest

Excessive-returner detection

Flags systematic returners from customer-level behavior with no labels required. Validated against a top-decile return-value-ratio heuristic.

3Clustering · KMeans

Customer segmentation

k=4 on RFM + return features → Premium Loyal · Healthy Browser · At-Risk · Returner. Turns one blunt return policy into four targeted ones.

4Hybrid · embeddings + ALS

Substitute recommender

Content embeddings (sentence-transformers) blended with implicit ALS. At the return moment, surfaces alternatives to retain revenue instead of refunding it.

⚡

Temporal leakage is the trap here. Lifetime return rate, return velocity, and avg days-to-return must be computed using only transactions before the one being scored. All features are point-in-time safe, and the split is strictly chronological — no future leaks backward.

05 — Built like production

MLOps, not notebook-ops.

The modeling is wrapped in the scaffolding a real deployment needs: orchestration, experiment tracking, tests, a serving API, and a pipeline that's proven to scale.

Orchestration

Prefect 2.x runs the weekly ingest → feature → train → score flow with retries and observability.

Experiment tracking

MLflow logs params, metrics, and SHAP artifacts across all four model runs — reproducible comparisons, not one-off cells.

Scales to warehouse volume

The same feature logic runs in PySpark on Databricks with a medallion architecture: Bronze → Silver → Gold.

Serving

FastAPI on Render with typed Pydantic schemas and a /health check. The demo above hits it live.

Tested

A pytest suite covers the API contract, score schema, customer profile, substitutes, and feature-matrix shape.

SQL foundations

Feature aggregations prototyped in DuckDB — CTEs, window functions, and RFM rollups before they hit the pipeline.

User input → FastAPI → Feature lookup → LightGBM · IF · KMeans · Recommender → SHAP → JSON

06 — Tech stack

The toolkit.

Modeling

LightGBMXGBoostIsolation ForestKMeanssentence-transformersimplicit ALS

Data & scale

PandasNumPyDuckDBPySparkDatabricks

Interpretability & stats

SHAPscipy.statsstatsmodels

MLOps & serving

MLflowPrefectFastAPIpytestRenderVercel

Known limitations — honest disclosures

Return labels are UCI Online Retail II cancellations (C-prefixed invoices); the data doesn't separate refunds from store credit or partial-line cancellations.
The live API serves precomputed customer features rather than recomputing history per request — production would join against a feature store.
Some transaction-time features fall back to dataset means at inference; the classifier still leans on customer-level signal.
The /substitutes recommender is trained and evaluated offline in v1; live wiring is a V2 item.
The A/B test is a simulation against held-out history with a two-proportion z-test and spend guardrail — not a live randomized arm.

Return risk, scored and explained
before the refund.