Data Quality
"Data cleaning and repairing account for about 60% of the work of data scientists."
Christian Kaestner
Required reading:
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI . In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-15).
Recommended reading:
Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F. and Grafberger, A., 2018. Automating large-scale data quality verification . Proceedings of the VLDB Endowment, 11(12), pp.1781-1794.
Nick Hynes, D. Sculley, Michael Terry. "The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets ." NIPS Workshop on ML Systems (2017)
Administrativa: Midterm
Midterm in 1 week
During lecture, 80 min, here
Answer questions related to a given scenario
All lecture content, reading, recitations in scope -- focus on topics you had opportunity to practice
No electronics, can bring 6 pages notes on paper (handwritten or typed, both sides)
Old midterms online, see course webpage
Administrativa: Homework I3
Open ended: Try a tool and write a blog post about it
Any tool related to building ML-enabled systems
Except: No pure ML frameworks
ML pipelines, data engineering, operations, ...
Open source, academic, or commercial; local or cloud
Also look for competitors of tools of interest
Claim tool in Spreadsheet, first come first serve
1 week assignment (despite due in 3 weeks)
Past tools: Algorithmia , Amazon Elastic MapReduce , Apache Flink , Azure ML , Dask , Databricks , DataRobot , Google Cloud AutoML , IBM Watson Studio , LaunchDarkly , Metaflow , Pycaret , Split.io , TensorBoard , Weights and Biases , Amazon Sagemaker , Apache Airflow , Apache Flume , Apache Hadoop , Apache Spark , Auto-Surprise , BentoML , CML , Cortex , DVC , Grafana , Great Expectations , Holoclean , Kedro , Kubeflow , Kubernetes , Luigi , MLflow , ModelDB , Neo4j , pydqc , Snorkel , TensorFlow Lite , TPOT
17-745 students: Research project instead, contact us now
Learning Goals
Distinguish precision and accuracy; understanding the better models vs more data tradeoffs
Use schema languages to enforce data schemas
Design and implement automated quality assurance steps that check data schema conformance and distributions
Devise infrastructure for detecting data drift and schema violations
Consider data quality as part of a system; design an organization that values data quality
Data cleaning and repairing account for about 60% of the work of data scientists.
Case Study: Inventory Management
Data Comes from Many Sources
Manually entered
Actions from IT systems
Logging information, traces of user interactions
Sensor data
Crowdsourced
Many Data Sources
SalesTrends
AdNetworks
Inventory ML
VendorSales
ProductData
Marketing
Expired/Lost/Theft
PastSales
sources of different reliability and quality
Inventory Database
Product Database:
ID
Name
Weight
Description
Size
Vendor
...
...
...
...
...
...
Stock:
ProductID
Location
Quantity
...
...
...
Sales history:
UserID
ProductId
DateTime
Quantity
Price
...
...
...
...
...
Raw Data is an Oxymoron
What makes good quality data?
Accuracy
The data was recorded correctly.
Completeness
All relevant data was recorded.
Uniqueness
The entries are recorded once.
Consistency
The data agrees with itself.
Timeliness
The data is kept up to date.
Data is noisy
Unreliable sensors or data entry
Wrong results and computations, crashes
Duplicate data, near-duplicate data
Out of order data
Data format invalid
Examples in inventory system?
Data changes
System objective changes over time
Software components are upgraded or replaced
Prediction models change
Quality of supplied data changes
User behavior changes
Assumptions about the environment no longer hold
Examples in inventory system?
Users may deliberately change data
Users react to model output
Users try to game/deceive the model
Examples in inventory system?
Accuracy vs Precision
Accuracy: Reported values (on average) represent real value
Precision: Repeated measurements yield the same result
Accurate, but imprecise: Average over multiple measurements
Inaccurate, but precise: Systematic measurement problem, misleading
Data Quality and Machine Learning
More data -> better models (up to a point, diminishing effects)
Noisy data (imprecise) -> less confident models, more data needed
some ML techniques are more or less robust to noise (more on robustness in a later lecture)
Inaccurate data -> misleading models, biased models
Need the "right" data
Invest in data quality, not just quantity
Poor Data Quality has Consequences
(often delayed consequences)
Example: Systematic bias in labeling
Poor data quality leads to poor models
Not detectable in offline evaluation
Problem in production -- now difficult to correct
Delayed Fixes increase Repair Cost
Data Cascades
Detection almost always delayed! Expensive rework.
Difficult to detect in offline evaluation.
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI . In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-15).
Data Schema
Ensuring basic consistency about shape and types
Dirty Data: Example
Problems with the data?
Schema Problems
Illegal attribute values: bdate=30.13.70
Violated attribute dependencies: age=22, bdate=12.02.70
Uniqueness violation: (name=”John Smith”, SSN=”123456”), (name=”Peter Miller”, SSN=”123456”)
Referential integrity violation: emp=(name=”John Smith”, deptno=127)
if department 127 not defined
Data Schema
Define expected format of data
expected fields and their types
expected ranges for values
constraints among values (within and across sources)
Data can be automatically checked against schema
Protects against change; explicit interface between components
Schema in Relational Databases
CREATE TABLE employees (
emp_no INT NOT NULL ,
birth_date DATE NOT NULL ,
name VARCHAR (30 ) NOT NULL ,
PRIMARY KEY (emp_no));
CREATE TABLE departments (
dept_no CHAR (4 ) NOT NULL ,
dept_name VARCHAR (40 ) NOT NULL ,
PRIMARY KEY (dept_no), UNIQUE KEY (dept_name));
CREATE TABLE dept_manager (
dept_no CHAR (4 ) NOT NULL ,
emp_no INT NOT NULL ,
FOREIGN KEY (emp_no) REFERENCES employees (emp_no),
FOREIGN KEY (dept_no) REFERENCES departments (dept_no),
PRIMARY KEY (emp_no,dept_no));
Which Problems are Schema Problems?
What Happens When new Data Violates Schema?
Schema-Less Data Exchange
CSV files
Key-value stores (JSon, XML, Nosql databases)
Message brokers
REST API calls
R/Pandas Dataframes
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
10|53|M|lawyer|90703
11|39|F|other|30329
12|28|F|other|06405
13|47|M|educator|29206
Schema Library: Apache Avro
{ "type" : "record" ,
"namespace" : "com.example" ,
"name" : "Customer" ,
"fields" : [{
"name" : "first_name" ,
"type" : "string" ,
"doc" : "First Name of Customer"
},
{
"name" : "age" ,
"type" : "int" ,
"doc" : "Age at the time of registration"
}
]
}
Schema Library: Apache Avro
Schema specification in JSON format
Serialization and deserialization with automated checking
Native support in Kafka
Benefits
Serialization in space efficient format
APIs for most languages (ORM-like)
Versioning constraints on schemas
Drawbacks
Reading/writing overhead
Binary data format, extra tools needed for reading
Requires external schema and maintenance
Learning overhead
Examples
Avro
XML Schema
Protobuf
Thrift
Parquet
ORC
Discussion: Data Schema Constraints for Inventory System?
Product Database:
ID
Name
Weight
Description
Size
Vendor
...
...
...
...
...
...
Stock:
ProductID
Location
Quantity
...
...
...
Sales history:
UserID
ProductId
DateTime
Quantity
Price
...
...
...
...
...
Summary: Schema
Basic structure and type definition of data
Well supported in databases and many tools
Very low bar
Instance-Level Problems
Inconsistencies, wrong values
Dirty Data: Example
Problems with the data beyond schema problems?
Instance-Level Problems
Missing values: phone=9999-999999
Misspellings: city=Pittsburg
Misfielded values: city=USA
Duplicate records: name=John Smith, name=J. Smith
Wrong reference: emp=(name=”John Smith”, deptno=127)
if department 127 defined but wrong
Can we detect these?
Discussion: Instance-Level Problems in Scenario?
Data Cleaning Overview
Data analysis / Error detection
Usually focused on specific kind of problems, e.g., duplication, typos, missing values, distribution shift
Detection in input data vs detection in later stages (more context)
Error repair
Repair data vs repair rules, one at a time or holistic
Data transformation or mapping
Automated vs human guided
Error Detection Examples
Illegal values: min, max, variance, deviations, cardinality
Misspelling: sorting + manual inspection, dictionary lookup
Missing values: null values, default values
Duplication: sorting, edit distance, normalization
Error Detection: Example
Q. Can we (automatically) detect errors? Which errors are problem-dependent?
Data Quality Rules
Invariants on data that must hold
Typically about relationships of multiple attributes or data sources, eg.
ZIP code and city name should correspond
User ID should refer to existing user
SSN should be unique
For two people in the same state, the person with the lower income should not have the higher tax rate
Classic integrity constraints in databases or conditional constraints
Rules can be used to reject data or repair it
Machine Learning for Detecting Inconsistencies
Example: HoloClean
User provides rules as integrity constraints (e.g., "two entries with the same
name can't have different city")
Detect violations of the rules in the data; also detect statistical outliers
Automatically generate repair candidates (with probabilities)
Discovery of Data Quality Rules
Rules directly taken from external databases
Given clean data,
several algorithms that find functional relationships (X ⇒ Y ) among columns
algorithms that find conditional relationships (if Z then X ⇒ Y )
algorithms that find denial constraints (X and Y cannot cooccur in a row)
Given mostly clean data (probabilistic view),
algorithms to find likely rules (e.g., association rule mining)
outlier and anomaly detection
Given labeled dirty data or user feedback,
supervised and active learning to learn and revise rules
supervised learning to learn repairs (e.g., spell checking)
Further reading: Ilyas, Ihab F., and Xu Chu. Data cleaning . Morgan & Claypool, 2019.
Association rule mining
Sale 1: Bread, Milk
Sale 2: Bread, Diaper, Beer, Eggs
Sale 3: Milk, Diaper, Beer, Coke
Sale 4: Bread, Milk, Diaper, Beer
Sale 5: Bread, Milk, Diaper, Coke
Rules
{Diaper, Beer} -> Milk (40% support, 66% confidence)
Milk -> {Diaper, Beer} (40% support, 50% confidence)
{Diaper, Beer} -> Bread (40% support, 66% confidence)
(also useful tool for exploratory data analysis)
Further readings: Standard algorithms and many variations, see Wikipedia
Discussion: Data Quality Rules in Inventory System
Drift & Model Decay
Concept drift (or concept shift)
properties to predict change over time (e.g., what is credit card fraud)
over time: different expected outputs for same inputs
model has not learned the relevant concepts
Data drift (or covariate shift or population drift)
characteristics of input data changes (e.g., customers with face masks)
input data differs from training data
over time: predictions less confident, further from training data
Upstream data changes
external changes in data pipeline (e.g., format changes in weather service, new worker performing manual entry)
model interprets input data incorrectly
over time: abrupt changes due to faulty inputs
How do we fix these drifts?
fix1: retrain with new training data or relabeled old training data
fix2: retrain with new data
fix3: fix pipeline, retrain entirely
On Terminology
Concept and data drift are separate concepts
In practice and literature not always clearly distinguished
Colloquially encompasses all forms of model degradations and environment changes
Define term for target audience
Breakout: Drift in the Inventory System
What kind of drift might be expected?
As a group, in slack #lecture
write plausible example of:
Concept Drift:
Data Drift:
Upstream data changes:
Watch for Degradation in Prediction Accuracy
Indicators of Concept Drift
How to detect concept drift in production?
Indicators of Concept Drift
Model degradations observed with telemetry
Telemetry indicates different outputs over time for similar inputs
Relabeling training data changes labels
Interpretable ML models indicate rules that no longer fit
(many papers on this topic, typically on statistical detection)
Dealing with Drift
Regularly retrain model on recent data
Use evaluation in production to detect decaying model performance
Involve humans when increasing inconsistencies detected
Monitoring thresholds, automation
Monitoring, monitoring, monitoring!
Structural drift
Data schema changes, sometimes by infrastructure changes
e.g., 4124784115
-> 412-478-4115
Semantic drift
Meaning of data changes, same schema
e.g., Netflix switches from 5-star to +/- rating, but still uses 1 and 5
Distribution changes
e.g., credit card fraud differs to evade detection
e.g., marketing affects sales of certain items
Detecting Data Drift
Compare distributions over time (e.g., t-test)
Detect both sudden jumps and gradual changes
Distributions can be manually specified or learned (see invariant detection)
Data Distribution Analysis
Plot distributions of features (histograms, density plots, kernel density estimation)
Identify which features drift
Define distance function between inputs and identify distance to closest training data (eg., wasserstein and energy distance, see also kNN)
Formal models for data drift contribution etc exist
Anomaly detection and "out of distribution" detection
Observe distribution of output labels
Microsoft Azure Data Drift Dashboard
Breakout: Drift in the Inventory System
What kind of monitoring for drift in Inventory scenario?
Data Quality is a System-Wide Concern
Everyone wants to do the model work, not the data work
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI . In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-15).
Data flows across components
Data Quality is a System-Wide Concern
Data flows across components
e.g., from user interface into database to crowd-sourced labeling team into ML pipeline
Documentation at the interfaces is important
Humans interacting with the system
Entering data
Labeling data
Observed with sensors/telemetry
Incentives, power structures, recognition
Organizational practices
Value, attention, and resources given to data quality
Data Quality Documentation
Teams rarely document expectations of data quantity or quality
Data quality tests are rare, but some teams adopt defensive monitoring
Local tests about assumed structure and distribution of data
Identify drift early and reach out to producing teams
Several ideas for documenting distributions, including Datasheets and Dataset Nutrition Label
Mostly focused on static datasets, describing origin, consideration, labeling procedure, and distributions
Example
Nahar, Nadia, Shurui Zhou, Grace Lewis, and Christian Kästner. “Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process .” In Proceedings of the 44th International Conference on Software Engineering (ICSE), May 2022.
Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. "Datasheets for datasets ." Communications of the ACM 64, no. 12 (2021): 86-92.
Common Data Cascades
Interacting with physical world brittleness
Idealized data, ignoring realities and change of real-world data
Static data, one time learning mindset, no planning for evolution
Inadequate domain expertise
Not understanding the data and its context
Involving experts only late for trouble shooting
Conflicting reward systems
Missing incentives for data quality
Not recognizing data quality importance, discard as technicality
Missing data literacy with partners
Poor (cross-organizational) documentation
Conflicts at team/organization boundary
Undetected drift
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI . In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-15).
Discussion: Possible Data Cascades in Inventory Scenario?
Interacting with physical world brittleness
Inadequate domain expertise
Conflicting reward systems
Poor (cross-organizational) documentation
Ethics and Politics of Data
Raw data is an oxymoron
Incentives for Data Quality? Valuing Data Work?
Quality Assurance for the Data Processing Pipelines
Error Handling and Testing in Pipeline
Avoid silent failures!
Write testable data acquisition and feature extraction code
Test this code (unit test, positive and negative tests)
Test retry mechanism for acquisition + error reporting
Test correct detection and handling of invalid input
Catch and report errors in feature extraction
Test correct detection of data drift
Test correct triggering of monitoring system
Detect stale data, stale models
More in a later lecture.
Excursion: Static Analysis and Code Linters
Automate routine inspection tasks
if (user.jobTitle = "manager" ) {
...
}
function fn ( ) {
x = 1 ;
return x;
x = 3 ;
}
PrintWriter log = null ;
if (anyLogging) log = new PrintWriter(...);
if (detailedLogging) log.println("Log started" );
Static Analysis
Analyzes the structure/possible executions of the code, without running it
Different levels of sophistication
Simple heuristic and code patterns (linters)
Sound reasoning about all possible program executions
Tradeoff between false positives and false negatives
Often supporting annotations needed (e.g., @Nullable
)
Tools widely available, open source and commercial
A Linter for Data?
Data Linter at Google
Miscoding
Number, date, time as string
Enum as real
Tokenizable string (long strings, all unique)
Zip code as number
Outliers and scaling
Unnormalized feature (varies widely)
Tailed distributions
Uncommon sign
Packaging
Duplicate rows
Empty/missing data
Summary
Data and data quality are essential
Data from many sources, often inaccurate, imprecise, inconsistent, incomplete, ... -- many different forms of data quality problems
Many mechanisms for enforcing consistency and cleaning
Data schema ensures format consistency
Data quality rules ensure invariants across data points
Concept and data drift are key challenges -- monitor
Data quality is a system-level concern
Data quality at the interface between components
Documentation and monitoring often poor
Involves organizational structures, incentives, ethics, ...
Quality assurance for the data processing pipelines
Resume presentation
Data Quality
"Data cleaning and repairing account for about 60% of the work of data scientists."
Christian Kaestner
Required reading:
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-15).
Recommended reading:
Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F. and Grafberger, A., 2018. Automating large-scale data quality verification. Proceedings of the VLDB Endowment, 11(12), pp.1781-1794.
Nick Hynes, D. Sculley, Michael Terry. "The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets." NIPS Workshop on ML Systems (2017)