ML SYSTEM · 1.07M REAL RETAIL TRANSACTIONS

Return risk, scored and explained
before the refund.

U.S. retailers lost $101B to return fraud and abuse in 2023. Retail Returns Intelligence scores every transaction's return likelihood in real time, flags excessive returners without labels, segments customers for differentiated policy, and recommends substitutes that turn refunds into retained revenue.

0.000ROC-AUC
Top-decile lift
0MTransactions
0Models
01 — The problem

Most retailers catch excessive returners after the loss.

A small fraction of customers drive a disproportionate share of returns through wardrobing, bracketing, serial returns, and policy abuse. The signal is in the transaction stream — but it usually gets read reactively, once the refunds have already cleared.

$743B
merchandise returned in the U.S. in 2023 — 14.5% of all retail sales.
$101B
lost to return fraud and abuse — roughly $13.70 of every $100 returned.
22.8%
of transactions arrive with missing CustomerIDs — real, messy operational data.

Source: National Retail Federation & Appriss Retail, 2023 Consumer Returns report.

02 — Live demo

Score a transaction in real time.

This calls a live FastAPI backend running the trained models — real inference, not a mockup. Pick a real customer or enter your own transaction to get a return probability, risk tier, customer segment, anomaly flag, and the top SHAP factors driving the prediction.

Checking API…

Transaction input

Try a real customer:

Prediction

Submit a transaction to see the model's scored output.

Customer profile lookup

Pull the full behavioral profile for any CustomerID.
03 — Results that hold up

Evaluated like it ships, not like a demo.

Temporal train/test split, point-in-time-safe features, threshold selection, explainability, and rolling-window backtesting. Every figure below is generated by the notebooks in the repo.

Precision-recall curve and F1-vs-threshold for the LightGBM return classifier
Classifier · LightGBM PR-AUC 0.852, edging XGBoost (0.849). Threshold tuned to F1 = 0.789 at 0.94 — precision-recall is the right lens on a 1.8% base rate.
SHAP global feature importance for the LightGBM classifier
Explainability · SHAP Every score is attributable. Quantity, category return rate, and price drive risk — each prediction ships with its top factors.
Isolation Forest score distribution and return rate by anomaly group
Anomaly · Isolation Forest 294 excessive returners flagged — no labels. Flagged customers show a visibly higher return-rate distribution, validating the unsupervised signal.
PCA projection of KMeans customer segments
Segmentation · KMeans (k=4) Four behavioral segments for differentiated policy. Premium Loyal, Healthy Browser, At-Risk, and Returner — separated on RFM + return features.
Power curve and control-vs-treatment return rate for the A/B policy simulation
Experimentation · A/B simulation Powered to 80%, significant at p<0.05. A 14-day return window on the Returner segment cuts return-value ratio, with a spend guardrail intact.
Rolling-window backtest: Brier score, precision at decile, and actual vs predicted return rate over time
Backtesting · 18 rolling windows Stable across 18 months, not a single lucky split. Precision@decile holds near 0.18 and predicted return rate tracks actuals window over window.
04 — Four models, one system

Return intelligence isn't one problem.

Is this transaction risky right now? Is this customer a systematic returner? How should policy differ across the base? And how do we win the sale back? Each question gets its own model.

1Supervised · LightGBM

Return-likelihood classifier

Per-transaction P(return) at checkout. Trained on a strict temporal split (2009–H1 2011 → H2 2011) with point-in-time-safe history features. SHAP makes every score explainable.

2Unsupervised · Isolation Forest

Excessive-returner detection

Flags systematic returners from customer-level behavior with no labels required. Validated against a top-decile return-value-ratio heuristic.

3Clustering · KMeans

Customer segmentation

k=4 on RFM + return features → Premium Loyal · Healthy Browser · At-Risk · Returner. Turns one blunt return policy into four targeted ones.

4Hybrid · embeddings + ALS

Substitute recommender

Content embeddings (sentence-transformers) blended with implicit ALS. At the return moment, surfaces alternatives to retain revenue instead of refunding it.

Temporal leakage is the trap here. Lifetime return rate, return velocity, and avg days-to-return must be computed using only transactions before the one being scored. All features are point-in-time safe, and the split is strictly chronological — no future leaks backward.

05 — Built like production

MLOps, not notebook-ops.

The modeling is wrapped in the scaffolding a real deployment needs: orchestration, experiment tracking, tests, a serving API, and a pipeline that's proven to scale.

Orchestration

Prefect 2.x runs the weekly ingest → feature → train → score flow with retries and observability.

Experiment tracking

MLflow logs params, metrics, and SHAP artifacts across all four model runs — reproducible comparisons, not one-off cells.

Scales to warehouse volume

The same feature logic runs in PySpark on Databricks with a medallion architecture: Bronze → Silver → Gold.

Serving

FastAPI on Render with typed Pydantic schemas and a /health check. The demo above hits it live.

Tested

A pytest suite covers the API contract, score schema, customer profile, substitutes, and feature-matrix shape.

SQL foundations

Feature aggregations prototyped in DuckDB — CTEs, window functions, and RFM rollups before they hit the pipeline.

User input FastAPI Feature lookup LightGBM · IF · KMeans · Recommender SHAP JSON
06 — Tech stack

The toolkit.

Modeling

LightGBMXGBoostIsolation ForestKMeanssentence-transformersimplicit ALS

Data & scale

PandasNumPyDuckDBPySparkDatabricks

Interpretability & stats

SHAPscipy.statsstatsmodels

MLOps & serving

MLflowPrefectFastAPIpytestRenderVercel
Known limitations — honest disclosures
  • Return labels are UCI Online Retail II cancellations (C-prefixed invoices); the data doesn't separate refunds from store credit or partial-line cancellations.
  • The live API serves precomputed customer features rather than recomputing history per request — production would join against a feature store.
  • Some transaction-time features fall back to dataset means at inference; the classifier still leans on customer-level signal.
  • The /substitutes recommender is trained and evaluated offline in v1; live wiring is a V2 item.
  • The A/B test is a simulation against held-out history with a two-proportion z-test and spend guardrail — not a live randomized arm.