Example workflow

Data quality checks for CSV machine learning

Q: What data-quality checks matter before AutoML?

Missing values, high-cardinality IDs, target leakage, constant columns, duplicates, class imbalance, target mismatch, and too few rows all matter.

Q: What is target leakage?

Target leakage happens when a feature contains information that would not be available at prediction time or directly reveals the answer.

Q: Why can high accuracy be a warning sign?

Very high accuracy can indicate leakage, duplicates, class imbalance, or a target-derived feature.

Q: Should I export a model if warnings appear?

Exports can be useful for testing, and the warnings provide context for validation and artifact review.

This example explains how Data Quality review supports AutoML interpretation before training and export: missing values, high-cardinality IDs, target leakage, suspicious columns, class imbalance, validation limits, and export readiness.

Check your CSV after sign-in How the checker works

Why data quality matters more than model choice

A powerful model cannot rescue a broken dataset. If the target is wrong, a column leaks the answer, rows are duplicated, or classes are badly imbalanced, the leaderboard may look impressive while the workflow remains unreliable. CSV machine learning often fails because the table was not designed for prediction.

Data-quality checks create a pause before interpretation. They ask whether the target is suitable, whether features are known at prediction time, whether missingness is meaningful, and whether validation evidence reflects the future use case. MLdeck warnings are decision support, not automatic proof.

Common CSV problems before machine learning

Common problems include missing values, high-cardinality IDs, constant columns, near-duplicate columns, too few rows, class imbalance, suspiciously high performance, target-feature mismatch, and row ordering. A customer table might include IDs and emails. A sales table might include future revenue. A music table might include duplicate tracks. A delivery table might include final delivery timestamps while predicting delivery time.

These issues do not always mean the project should stop. They mean the user should investigate before trusting metrics. Sometimes the fix is simple: exclude an ID, clean labels, remove post-outcome fields, or choose a better target.

Missing values, categorical columns, and outliers

Missing values can be random, systematic, or meaningful. A missing payment method might indicate a data import issue. A missing cancellation reason may be expected for active customers. Categorical columns can contain rare values, spelling variations, or too many unique labels. Numeric columns can contain impossible values, unit mixups, or extreme outliers.

MLdeck can profile and preprocess these columns while keeping the review visible. Automated preprocessing is useful for exploration, and domain review helps users interpret what the missingness and categories mean.

Target leakage and suspicious columns

Target leakage happens when a feature reveals the answer or uses information from after the prediction moment. In churn data, cancellation reason leaks churn. In regression, final invoice amount may leak price. In classification, a label-derived status code can make the task trivial. Leakage often appears as unusually strong performance or a single feature dominating the model.

Suspicious columns should be inspected, not blindly kept. Ask whether the feature would be available when making the prediction. If not, exclude it. If the answer is uncertain, run a version with and without the column and compare behavior.

Baseline comparison as a sanity check

Baselines are simple reference models. In classification, the majority-class baseline predicts the most common class. In regression, the mean baseline predicts the average target. A model that cannot beat a simple baseline meaningfully is probably not useful. A model that beats the baseline by a huge amount should be reviewed for leakage or duplicates.

Baseline comparison keeps expectations grounded. It helps users distinguish real signal from a dataset where the easy answer already performs well. MLdeck's leaderboard should be read alongside baseline evidence and warnings.

Exploratory evidence and stronger validation

Exploratory evidence helps decide what to try next. Stronger validation evidence may include holdout sets, time-aware splits, group-aware splits, external test data, parity validation for exports, and domain review. MLdeck provides browser-local AutoML with data-quality warnings, leakage-risk warnings, and validation evidence to support that review.

How MLdeck reports warnings

MLdeck surfaces warnings in the workflow and may include data-quality evidence in reports or export metadata when available. Warnings can cover schema ambiguity, missingness, target issues, validation gaps, and suspicious columns. The goal is to make the user slower in the right places. A warning is not a verdict; it is a prompt to inspect the data.

For example, a warning about high cardinality may point to a customer ID, product ID, or transaction key that should not be treated like an ordinary category. A warning about class imbalance may mean accuracy is not enough. A warning about target ambiguity may mean the selected column is numeric but actually represents categories. These checks help users decide whether to revise the dataset before training again.

What users should review before exporting

Before exporting, review the target, feature list, excluded columns, warnings, leaderboard, baseline comparison, and validation scope. Confirm that identifiers and leakage columns are removed. Check whether exported artifacts include schema and preprocessing metadata. If warnings remain, document why they are acceptable or pause the workflow.

Exporting too early can spread a weak data assumption into another environment. A PDF report, ONNX artifact, Docker package, or Python file should carry enough context for a reviewer to understand what was trained, what was excluded, and which warnings still need attention.

Data quality FAQ

What data-quality checks matter before AutoML?

Review missing values, IDs, leakage, constant columns, duplicates, class imbalance, target suitability, and row count.

What is target leakage?

It is a feature that reveals the answer or uses information not available at prediction time.

Why can high accuracy be a warning sign?

It may indicate leakage, duplicate rows, class imbalance, or target-derived features.

How do MLdeck warnings help?

Warnings support review by making missingness, leakage risk, class balance, and schema concerns visible in the browser workflow.

Should I export a model if warnings appear?

You can export for testing, and the warnings provide context for validation and artifact review.

Related examples and guides

Use these practical examples to see data-quality review in context.

CSV Data Quality Checker Data Quality for Machine Learning AutoML validation evidence Missing values in CSV files CSV schema validation Target leakage risk Data Quality vs AutoML Customer churn prediction from CSV Local AutoML for CSV files Browser-local AutoML