Example workflow

Data quality checks for CSV machine learning

Model choice rarely matters more than data quality. This example explains how MLdeck data-quality warnings can support review before interpreting AutoML results. The focus is on missing values, high-cardinality IDs, target leakage, suspicious columns, class imbalance, validation limits, and export readiness in browser-local CSV workflows.

Why data quality matters more than model choice

A powerful model cannot rescue a broken dataset. If the target is wrong, a column leaks the answer, rows are duplicated, or classes are badly imbalanced, the leaderboard may look impressive while the workflow remains unreliable. CSV machine learning often fails because the table was not designed for prediction.

Data-quality checks create a pause before interpretation. They ask whether the target is suitable, whether features are known at prediction time, whether missingness is meaningful, and whether validation evidence reflects the future use case. MLdeck warnings are decision support, not automatic proof.

Common CSV problems before machine learning

Common problems include missing values, high-cardinality IDs, constant columns, near-duplicate columns, too few rows, class imbalance, suspiciously high performance, target-feature mismatch, and row ordering. A customer table might include IDs and emails. A sales table might include future revenue. A music table might include duplicate tracks. A delivery table might include final delivery timestamps while predicting delivery time.

These issues do not always mean the project should stop. They mean the user should investigate before trusting metrics. Sometimes the fix is simple: exclude an ID, clean labels, remove post-outcome fields, or choose a better target.

Missing values, categorical columns, and outliers

Missing values can be random, systematic, or meaningful. A missing payment method might indicate a data import issue. A missing cancellation reason may be expected for active customers. Categorical columns can contain rare values, spelling variations, or too many unique labels. Numeric columns can contain impossible values, unit mixups, or extreme outliers.

MLdeck can profile and preprocess these columns, but users should still review what the missingness and categories mean. Automated preprocessing is useful for exploration; domain review is still needed before relying on outputs.

Target leakage and suspicious columns

Target leakage happens when a feature reveals the answer or uses information from after the prediction moment. In churn data, cancellation reason leaks churn. In regression, final invoice amount may leak price. In classification, a label-derived status code can make the task trivial. Leakage often appears as unusually strong performance or a single feature dominating the model.

Suspicious columns should be inspected, not blindly kept. Ask whether the feature would be available when making the prediction. If not, exclude it. If the answer is uncertain, run a version with and without the column and compare behavior.

Baseline comparison as a sanity check

Baselines are simple reference models. In classification, the majority-class baseline predicts the most common class. In regression, the mean baseline predicts the average target. A model that cannot beat a simple baseline meaningfully is probably not useful. A model that beats the baseline by a huge amount should be reviewed for leakage or duplicates.

Baseline comparison keeps expectations grounded. It helps users distinguish real signal from a dataset where the easy answer already performs well. MLdeck's leaderboard should be read alongside baseline evidence and warnings.

Exploratory evidence vs strict validation

Exploratory evidence helps decide what to try next. Strict validation is stronger evidence for important decisions. Strict validation may include holdout sets, time-aware splits, group-aware splits, external test data, parity validation for exports, and domain review. MLdeck is an MVP and early beta, so strict validation should be used before relying on results for important decisions.

How MLdeck reports warnings

MLdeck surfaces warnings in the workflow and may include data-quality evidence in reports or export metadata when available. Warnings can cover schema ambiguity, missingness, target issues, validation gaps, and suspicious columns. The goal is to make the user slower in the right places. A warning is not a verdict; it is a prompt to inspect the data.

For example, a warning about high cardinality may point to a customer ID, product ID, or transaction key that should not be treated like an ordinary category. A warning about class imbalance may mean accuracy is not enough. A warning about target ambiguity may mean the selected column is numeric but actually represents categories. These checks help users decide whether to revise the dataset before training again.

What users should review before exporting

Before exporting, review the target, feature list, excluded columns, warnings, leaderboard, baseline comparison, and validation scope. Confirm that identifiers and leakage columns are removed. Check whether exported artifacts include schema and preprocessing metadata. If warnings remain, document why they are acceptable or pause the workflow.

Exporting too early can spread a weak data assumption into another environment. A PDF report, ONNX artifact, Docker package, or Python file should carry enough context for a reviewer to understand what was trained, what was excluded, and which warnings still need attention.

Data quality FAQ

What data-quality checks matter before AutoML?

Review missing values, IDs, leakage, constant columns, duplicates, class imbalance, target suitability, and row count.

What is target leakage?

It is a feature that reveals the answer or uses information not available at prediction time.

Why can high accuracy be a warning sign?

It may indicate leakage, duplicate rows, class imbalance, or target-derived features.

Does MLdeck automatically fix all data problems?

No. Warnings support review, but users remain responsible for interpretation.

Should I export a model if warnings appear?

You can export for testing, but investigate warnings before relying on artifacts.

Related examples and guides

Use these practical examples to see data-quality review in context.