Model trust warning

Target leakage risk in CSV AutoML

Target leakage happens when a column reveals the answer too directly or uses information that would not be available when a real prediction is made.

It can make model results look unusually strong while hiding a weak real-world setup. MLdeck helps surface review signals, but users still need to inspect the data context.

Open Data Quality Read the ML checklist

A simple example

Imagine predicting whether a customer will churn, but the CSV includes a column named cancellation_reason. That column may only exist after the customer has already churned. A model can appear accurate because the answer is already hidden in the inputs.

Similar problems can happen with final_invoice_amount, resolved_status, refund_date, delivery_completed_at, or any field created after the event being predicted.

Signals that deserve review

Answer-like names

Column names that closely match the target or final outcome may need review.

Post-outcome fields

Fields created after the prediction moment can make exploratory metrics misleading.

Suspiciously strong metrics

Very high scores can be real, but they should prompt a leakage and duplicate review.

ID or status fields

Some IDs and statuses are harmless; others may encode an outcome. Context matters.

How this connects to MLdeck AutoML

MLdeck AutoML trains and compares models in the browser. Data Quality helps users review a CSV before relying on model results. The two workflows support each other, but neither removes the need for human review.

If leakage risk appears, compare a version with the suspicious column excluded, document the decision, and use stronger validation before relying on the result.

Review before trusting metrics

Leakage review is especially important before sharing a report, exporting artifacts, or making decisions from a leaderboard. A warning is not proof of leakage; it is a reason to inspect the data source carefully.

Data Quality vs AutoML Schema validation guide CSV checker overview