(1)
Looking back at the semester
(413 slides in 40 min)
(2)
Discussion of future of ML in Production
(3)
Feedback for future semesters
As a group, think about challenges that the team will likely focus when turning their research into a product:
Post answer to #lecture
on Slack and tag all group members
By Steven Geringer, via Ryan Orban. Bridging the Gap Between Data Science & Engineer: Building High-Performance Teams. 2016
Broad-range generalist + Deep expertise
Figure: Jason Yip. Why T-shaped people?. 2018
17-445/17-645/17-745, Fall 2022, 12 units
Monday/Wednesdays 1:25-2:45pm
Recitation Fridays 10:10-11:00am / 1:25-2:45pm
"Coding warmup assignment"
Out now, due Sep 7
Enhance simple web application with ML-based feature: Automated image captioning
Open ended coding assignment, change existing code, learn new APIs
Building Intelligent Systems by Geoff Hulten
https://www.buildingintelligentsystems.com/
Most chapters assigned at some point in the semester
Supplemented with research articles, blog posts, videos, podcasts, ...
Electronic version in the library
Specification grading, based in adult learning theory
Giving you choices in what to work on or how to prioritize your work
We are making every effort to be clear about expectations (specifications), will clarify if you have questions
Assignments broken down into expectations with point values, each graded pass/fail
Opportunities to resubmit work until last day of class
/**
Return the text spoken within the audio file
????
*/
String transcribe(File audioFile);
We routinely build:
ML intensifies our challenges
Focus: building models from given data, evaluating accuracy
Focus: experimenting, deploying, scaling training and serving, model monitoring and updating
Interaction of ML and non-ML components, system requirements, user interactions, safety, collaboration, delivering products
Based on the excellent paper: Passi, S., & Sengers, P. (2020). Making data science systems work. Big Data & Society, 7(2).
Wagstaff, Kiri. "Machine learning that matters." In Proceedings of the 29 th International Conference on Machine Learning, (2012).
In the model
Outside the model (e.g., "guardrails")
(Image CC BY-SA 4.0, C J Cowie)
Automate: Take action on user's behalf
Prompt: Ask the user if an action should be taken
Organize/Annotate/Augment: Add information to a display
Hybrids of these
Fall detection for elderly people:
Safe browsing: Blocking malicious web pages
Discuss in group and post to #lecture
tagging all members:
(1) How do we present the intelligence to the user?
(2) Justify in terms of system goals, forcefulness, frequency, value of correct and cost of wrong predictions
Newer better models released (better model architectures, more training data, ...)
Goals and scope change (more domains, handling dialects, ...)
The world changes (new products, names, slang, ...)
Online experimentation
Design for telemetry
In enterprise ML teams:
Shifting to pipeline-centric workflow challenging
O'Leary, Katie, and Makoto Uchida. "Common problems with Creating Machine Learning Pipelines from Existing Code." Proc. Third Conference on Machine Learning and Systems (MLSys) (2020).
Golden rule: Try to do what you agreed to do by the time you agreed to. If you cannot, seek help and communicate clearly and early.
First Part: Measuring Prediction Accuracy
Second Part: What is Correctness Anyway?
Third Part: Learning from Software Testing
Later: Testing in Production
Actually Grade 5 Cancer | Actually Grade 3 Cancer | Actually Benign | |
---|---|---|---|
Model predicts Grade 5 Cancer | 10 | 6 | 2 |
Model predicts Grade 3 Cancer | 3 | 24 | 10 |
Model predicts Benign | 5 | 22 | 82 |
accuracy=correct predictionsall predictions
Example's accuracy = 10+24+8210+6+2+3+24+10+5+22+82=.707
def accuracy(model, xs, ys):
count = length(xs)
countCorrect = 0
for i in 1..count:
predicted = model(xs[i])
if predicted == ys[i]:
countCorrect += 1
return countCorrect / count
(CC BY-SA 4.0 by Walber)
Turning numeric prediction into classification with threshold ("operating point")
more in a later lecture
Often neither training nor test data representative of production data
Figure from: Geirhos, Robert, et al. "Shortcut learning in deep neural networks." Nature Machine Intelligence 2, no. 11 (2020): 665-673.
wordsVectorizer = CountVectorizer().fit(text)
wordsVector = wordsVectorizer.transform(text)
invTransformer = TfidfTransformer().fit(wordsVector)
invFreqOfWords = invTransformer.transform(wordsVector)
X = pd.DataFrame(invFreqOfWords.toarray())
train, test, spamLabelTrain, spamLabelTest =
train_test_split(X, y, test_size = 0.5)
predictAndReport(train = train, test = test)
Example: Kaggle competition on detecting distracted drivers
Relation of datapoints may not be in the data (e.g., driver)
specifications, bugs, fit
Given a specification, do outputs match inputs?
/**
* compute deductions based on provided adjusted
* gross income and expenses in customer data.
*
* see tax code 26 U.S. Code A.1.B, PART VI
*/
float computeDeductions(float agi, Expenses expenses);
Each mismatch is considered a bug, should to be fixed.†
Use ML precisely because no specifications (too complex, rules unknown)
// detects cancer in an image
boolean hasCancer(Image scan);
@Test
void testPatient1() {
assertEquals(loadImage("patient1.jpg"), false);
}
@Test
void testPatient2() {
assertEquals(loadImage("patient2.jpg"), false);
}
All models are approximations. Assumptions, whether implied or clearly stated, are never exactly true. All models are wrong, but some models are useful. So the question you need to ask is not "Is the model true?" (it never is) but "Is the model good enough for this particular application?" -- George Box
(Daniel Miessler, CC SA 2.0)
(Slicing, Capabilities, Invariants, Simulation, ...)
Opportunistic/exploratory testing: Add some unit tests, without much planning
Specification-based testing ("black box"): Derive test cases from specifications
Structural testing ("white box"): Derive test cases to cover implementation paths
Test execution usually automated, but can be manual too; automated generation from specifications or code possible
"Call mom" "What's the weather tomorrow?" "Add asafetida to my shopping list"
Input divided by movie age. Notice low accuracy, but also low support (i.e., little validation data), for old movies.
Input divided by genre, rating, and length. Accuracy differs, but also amount of test data used ("support") differs, highlighting low confidence areas.
Source: Barash, Guy, et al. "Bridging the gap between ML solutions and their business requirements using feature interactions." In Proc. FSE, 2019.
Ré, Christopher, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. "Overton: A Data System for Monitoring and Improving Machine-Learned Products." arXiv preprint arXiv:1909.05372 (2019).
Further reading: Christian Kaestner. Rediscovering Unit Testing: Testing Capabilities of ML Models. Toward Data Science, 2021.
From: Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList." In Proceedings ACL, p. 4902–4912. (2020).
Idea 1: Domain-specific generators
Testing negation in sentiment analysis with template:
I {NEGATION} {POS_VERB} the {THING}.
Testing texture vs shape priority with artificial generated images:
Figure from Geirhos, Robert, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.” In Proc. International Conference on Learning Representations (ICLR), (2019).
Idea 3: Crowd-sourcing test creation
Testing sarcasm in sentiment analysis: Ask humans to minimally change text to flip sentiment with sarcasm
Testing background in object detection: Ask humans to take pictures of specific objects with unusual backgrounds
Figure from: Kaushik, Divyansh, Eduard Hovy, and Zachary C. Lipton. “Learning the difference that makes a difference with counterfactually-augmented data.” In Proc. International Conference on Learning Representations (ICLR), (2020).
(if it wasn't for that darn oracle problem)
How do we know the expected output of a test?
assertEquals(??, factorPrime(15485863));
Identifying invariants requires domain knowledge of the problem!
Further readings: Zhang, Mengshi, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. "DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems." In Proc. ASE. 2018.
Big problems: Many inputs, massive scale
Open-ended problems: No single "final" solution; incremental improvements and growth over time
Time-changing problems: Adapting to constant changes, learning with users
Intrinsically hard problems: Unclear rules, heuristics perform poorly
Examples?
see Hulten, Chapter 2
AI: Higher accuracy predictions at much lower cost
May use new, cheaper predictions for traditional tasks (examples?)
May now use predictions for new kinds of problems (examples?)
May now use more predictions than before
(Analogies: Reduced cost of light; internet reduced cost of search)
Ideally, these goals should be aligned with each other
What are different types of goals behind automating admissions decisions to a Master's program?
As a group post answer to #lecture
tagging all group members using template:
Organizational goals: ...
Leading indicators: ...
System goals: ...
User goals: ...
Model goals: ...
But: Not every measure is precise, not every measure is cost effective
Douglas Hubbard, “How to Measure Anything: finding the value of intangibles in business" 2014
Often higher-level measures are composed from lower level measures
Clear trace from specific low-level measurements to high-level metric
For design strategy, see Goal-Question-Metric approach
Three ingredients:
Q. What went wrong? What is the root cause of the failure?
Q. What went wrong? What is the root cause of the failure?
REQ: The vehicle must be prevented from veering off the lane.
SPEC: Lane detector accurately identifies lane markings in the input image; the controller generates correct steering commands
Discuss with your neighbor to come up with 2-3 assumptions
CC BY-SA 3.0 Anynobody
As a group, post answer to #lecture
and tag group members:
Requirement: ...
Assumptions: ...
Specification: ...
What can go wrong: ...
Examples?
See also 🗎 Jackson, Michael. "The world and the machine." In Proceedings of the International Conference on Software Engineering. IEEE, 1995.
See examples and details http://gendermag.org/foundations.php
Dashcam system
No model is every "correct"
Some mistakes are unavoidable
Anticipate the eventual mistake
ML model = unreliable component
Recall: Thermal fuse in smart toaster
CC BY-SA 4.0 by Chabe01
Independent mechanism to detect problems (in the real world)
Example: Gyrosensor to detect a train taking a turn too fast
What design strategies would you consider to mitigate ML mistakes:
Consider: Human in the loop, Undoable actions, Guardrails, Mistake detection and recovery (monitoring, doer-checker, fail-over, redundancy), Containment and isolation
As a group, post one design idea for each scenario to #lecture
and tag all group members.
Likely? Toby Ord predicts existential risk from GAI at 10% within 100 years: Toby Ord, "The Precipice: Existential Risk and the Future of Humanity", 2020
What can possibly go wrong in my system, and what are potential impacts on system requirements?
Risk = Likelihood * Impact
A number of methods:
identify hazards and component fault scenarios through guided inspection of requirements
Required reading: Hulten, Geoff. "Building Intelligent Systems: A Guide to Machine Learning Engineering." (2018), Chapters 17 and 18
Recommended reading: Siebert, Julien, Lisa Joeckel, Jens Heidrich, Koji Nakamichi, Kyoko Ohashi, Isao Namba, Rieko Yamamoto, and Mikio Aoyama. “Towards Guidelines for Assessing Qualities of Machine Learning Systems.” In International Conference on the Quality of Information and Communications Technology, pp. 17–31. Springer, Cham, 2020.
Architectural decisions affect entire systems, not only individual modules
Abstract, different abstractions for different scenarios
Reason about quality attributes early
Make architectural decisions explicit
Question: Did the original architect make poor decisions?
Identify components and their responsibilities
Establishes interfaces and team boundaries
Decomposition enables scaling teams
Each team works on a component
Need to coordinate on interfaces, but implementations remain hidden
Interface descriptions are crutial
Interfaces rarely fully specified in practice, source of conflicts
Separating concerns, understanding interdependencies
Facilitating experimentation, updates with confidence
Separating training and inference and closing the loop
Learn, serve, and observe at scale or with resource limits
Scenario: Component for detecting credit card frauds, as a service for banks
From: Habibullah, Khan Mohammad, Gregory Gay, and Jennifer Horkoff. "Non-Functional Requirements for Machine Learning: An Exploration of System Scope and Interest." arXiv preprint arXiv:2203.11063 (2022).
fWh,bh,Wo,bo(X)=ϕ(Wo⋅ϕ(Wh⋅X+bh)+bo)
(matrix multiplications interleaved with step function)
Consumption | CO2 (lbs) |
---|---|
Air travel, 1 passenger, NY↔SF | 1984 |
Human life, avg, 1 year | 11,023 |
American life, avg, 1 year | 36,156 |
Car, avg incl. fuel, 1 lifetime | 126,000 |
Training one model (GPU) | CO2 (lbs) |
---|---|
NLP pipeline (parsing, SRL) | 39 |
w/ tuning & experimentation | 78,468 |
Transformer (big) | 192 |
w/ neural architecture search | 626,155 |
Strubell, Emma, Ananya Ganesh, and Andrew McCallum. "Energy and Policy Considerations for Deep Learning in NLP." In Proc. ACL, pp. 3645-3650. 2019.
Constraints define the space of attributes for valid design solutions
"We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.”
Amatriain & Basilico. Netflix Recommendations: Beyond the 5 stars, Netflix Technology Blog (2012)
Consider two scenarios:
As a group, post to #lecture
tagging all group members:
- Qualities of interests: ??
- Constraints: ??
- ML algorithm(s) to use: ??
Model inference component as a service
from flask import Flask, escape, request
app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = '/tmp/uploads'
detector_model = … # load model…
# inference API that returns JSON with classes
# found in an image
@app.route('/get_objects', methods=['POST'])
def pred():
uploaded_img = request.files["images"]
coverted_img = … # feature encoding of uploaded img
result = detector_model(converted_img)
return jsonify({"response":
result['detection_class_entities']})
Offline use?
Deployment at scale?
Hardware needs and operating cost?
Frequent updates?
Integration of the model into a system?
Meeting system requirements?
Every system is different!
Peng, Zi, Jinqiu Yang, Tse-Hsun Chen, and Lei Ma. "A first look at the integration of machine learning models in complex autonomous driving systems: a case study on Apollo." In Proc. FSE, 2020.
Cloud? Phone? Glasses?
What qualities are relevant for the decision?
As a group, post in #lecture
tagging group members:
Avoid training–serving skew
Based on: Yokoyama, Haruki. "Machine learning system architectural pattern for improving operational stability." In Int'l Conf. Software Architecture Companion, pp. 267-274. IEEE, 2019.
{
"mid": string,
"languageCode": string,
"name": string,
"score": number,
"boundingPoly": {
object (BoundingPoly)
}
}
From Google’s public object detection API.
See also: 🗎 Washizaki, Hironori, Hiromu Uchida, Foutse Khomh, and Yann-Gaël Guéhéneuc. "Machine Learning Architecture and Design Patterns." Draft, 2019; 🗎 Sculley, et al. "Hidden technical debt in machine learning systems." In NeurIPS, 2015.
Discuss how to collect telemetry, the metric to monitor, and how to operationalize
Scenarios:
As a group post to #lecture
and tag team members:
- Quality metric:
- Data to collect:
- Operationalization:
Image source: Joel Thomas and Clemens Mewald. Productionizing Machine Learning: From Deployment to Drift Detection. Databricks Blog, 2019
Bernardi, Lucas, et al. "150 successful machine learning models: 6 lessons learned at Booking.com." In Proc. Int'l Conf. Knowledge Discovery & Data Mining, 2019.
From: Kohavi, Ron, Diane Tang, and Ya Xu. "Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing." 2020.
if (features.enabled(userId, "one_click_checkout")) {
// new one click checkout function
} else {
// old checkout functionality
}
one_click_checkout
, but always the same users; or 50% of beta-users and 90% of developers and 0.1% of all usersdef isEnabled(user): Boolean = (hash(user.id) % 100) < 10
Release new version to small percentage of population (like A/B testing)
Automatically roll back if quality measures degrade
Automatically and incrementally increase deployment to 100% otherwise
Danger of "silent" mistakes in many phases
Examples?
df['Join_year'] = df.Joined.dropna().map(
lambda x: x.split(',')[1].split(' ')[1])
df.loc[idx_nan_age,'Age'].loc[idx_nan_age] =
df['Title'].loc[idx_nan_age].map(map_means)
df["Weight"].astype(str).astype(int)
def is_valid_row(row):
try:
datetime.strptime(row['date'], '%b %d %Y')
return true
except ValueError:
return false
@test
def test_dates(self):
self.assertTrue(is_valid_row(...))
self.assertTrue(is_valid_row(...))
self.assertFalse(is_valid_row(...))
Test larger units of behavior
Often based on use cases or user stories -- customer perspective
@Test void testCleaningWithFeatureEng() {
DataFrame d = loadTestData();
DataFrame cd = clean(d);
DataFrame f = feature3.encode(cd);
assert(noMissingValues(f.getColumn("m")));
assert(max(f.getColumn("m"))<=1.0);
}
Automatic detection of problematic patterns based on code structure
if (user.jobTitle = "manager") {
...
}
function fn() {
x = 1;
return x;
x = 3;
}
#lecture
, tagging group members, suggest what tests to implementQA responsibilities in both roles
CC BY-SA 4.0 Khtan66
Integrate ML artifacts into software release process, unify process (i.e., DevOps extension)
Automated data and model validation (continuous deployment)
Continuous deployment for ML models: from experimenting in notebooks to quick feedback in production
Versioning of models and datasets (more later)
Monitoring in production (discussed earlier)
Further reading: MLOps principles
Linux Foundation AI Initiative
Data cleaning and repairing account for about 60% of the work of data scientists.
Own experience?
Quote: Gil Press. “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says.” Forbes Magazine, 2016.
sources of different reliability and quality
Recommended Reading: Gitelman, Lisa, Virginia Jackson, Daniel Rosenberg, Travis D. Williams, Kevin R. Brine, Mary Poovey, Matthew Stanley et al. "Data bite man: The work of sustaining a long-term study." In "Raw Data" Is an Oxymoron, (2013), MIT Press: 147-166.
Accuracy: Reported values (on average) represent real value
Precision: Repeated measurements yield the same result
Accurate, but imprecise: Average over multiple measurements
Inaccurate, but precise: ?
(CC-BY-4.0 by Arbeck)
Detection almost always delayed! Expensive rework. Difficult to detect in offline evaluation.
Sambasivan, N., et al. (2021, May). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proc. CHI (pp. 1-15).
CREATE TABLE employees (
emp_no INT NOT NULL,
birth_date DATE NOT NULL,
name VARCHAR(30) NOT NULL,
PRIMARY KEY (emp_no));
CREATE TABLE departments (
dept_no CHAR(4) NOT NULL,
dept_name VARCHAR(40) NOT NULL,
PRIMARY KEY (dept_no), UNIQUE KEY (dept_name));
CREATE TABLE dept_manager (
dept_no CHAR(4) NOT NULL,
emp_no INT NOT NULL,
FOREIGN KEY (emp_no) REFERENCES employees (emp_no),
FOREIGN KEY (dept_no) REFERENCES departments (dept_no),
PRIMARY KEY (emp_no,dept_no));
Image source: Theo Rekatsinas, Ihab Ilyas, and Chris Ré, “HoloClean - Weakly Supervised Data Repairing.” Blog, 2017.
Concept drift (or concept shift)
Data drift (or covariate shift, distribution shift, or population drift)
Upstream data changes
How do we fix these drifts?
What kind of drift might be expected?
As a group, tagging members, write plausible examples in #lecture
:
- Concept Drift:
- Data Drift:
- Upstream data changes:
Image source and further readings: Detect data drift (preview) on models deployed to Azure Kubernetes Service (AKS)
"Everyone wants to do the model work, not the data work"
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-15).
Teams rarely document expectations of data quantity or quality
Data quality tests are rare, but some teams adopt defensive monitoring
Several ideas for documenting distributions, including Datasheets and Dataset Nutrition Label
🗎 Gebru, Timnit, et al. "Datasheets for datasets." Communications of the ACM 64, no. 12 (2021).
🗎 Nahar, Nadia, et al. “Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process.” In Pro. ICSE, 2022.
Efficent Algorithms
Faster Machines
More Machines
Photos:
photo_id | user_id | path | upload_date | size | camera_id | camera_setting |
---|---|---|---|---|---|---|
133422131 | 54351 | /st/u211/1U6uFl47Fy.jpg | 2021-12-03T09:18:32.124Z | 5.7 | 663 | ƒ/1.8; 1/120; 4.44mm; ISO271 |
133422132 | 13221 | /st/u11b/MFxlL1FY8V.jpg | 2021-12-03T09:18:32.129Z | 3.1 | 1844 | ƒ/2, 1/15, 3.64mm, ISO1250 |
133422133 | 54351 | /st/x81/ITzhcSmv9s.jpg | 2021-12-03T09:18:32.131Z | 4.8 | 663 | ƒ/1.8; 1/120; 4.44mm; ISO48 |
Users:
user_id | account_name | photos_total | last_login |
---|---|---|---|
54351 | ckaestne | 5124 | 2021-12-08T12:27:48.497Z |
13221 | eva.burk | 3 | 2021-12-21T01:51:54.713Z |
Cameras:
camera_id | manufacturer | print_name |
---|---|---|
663 | Google Pixel 5 | |
1844 | Motorola | Motorola MotoG3 |
select p.photo_id, p.path, u.photos_total
from photos p, users u
where u.user_id=p.user_id and u.account_name = "ckaestne"
Divide data:
Tradeoffs?
Figure based on Christopher Meiklejohn. Dynamic Reduction: Optimizing Service-level Fault Injection Testing With Service Encapsulation. Blog Post 2021
addPhoto(id=133422131, user=54351, path="/st/u211/1U6uFl47Fy.jpg", date="2021-12-03T09:18:32.124Z")
updatePhotoData(id=133422131, user=54351, title="Sunset")
replacePhoto(id=133422131, user=54351, path="/st/x594/vipxBMFlLF.jpg", operation="/filter/palma")
deletePhoto(id=133422131, user=54351)
Trend to store all events in raw form (no consistent schema)
May be useful later
Data storage is comparably cheap
Bet: Yet unknown future value of data is greater than storage costs
As a group, discuss and post in #lecture
, tagging group members:
Martínez-Plumed et al. "CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories." IEEE Transactions on Knowledge and Data Engineering (2019).
taming chaos, understand req., plan before coding, remember testing
incremental prototypes, starting with most risky components
working with customers, constant replanning
(Image CC BY-SA 4.0, Lakeworks)
Source: Martin Fowler 2009, https://martinfowler.com/bliki/TechnicalDebtQuadrant.html
As a group in #lecture
, tagging members: Post two plausible examples technical debt in housing price prediction system:
Sculley, David, et al. Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems. 2015.
(Intro to Ethics and Fairness)
In 2015, Shkreli received widespread criticism [...] obtained the manufacturing license for the antiparasitic drug Daraprim and raised its price from USD 13.5 to 750 per pill [...] referred to by the media as "the most hated man in America" and "Pharma Bro". -- Wikipedia
"I could have raised it higher and made more profits for our shareholders. Which is my primary duty." -- Martin Shkreli
What is the (real) organizational objective of the company?
Are these companies intentionally trying to cause harm? If not, what are the root causes of the problem?
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
(Not everybody contributed equally during baking, not everybody is equally hungry)
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification, Buolamwini & Gebru, ACM FAT* (2018).
Discrimination in Online Ad Delivery, Latanya Sweeney, SSRN (2013).
Data reflects past biases, not intended outcomes
Should the algorithm reflect the reality?
Bias in dataset labels assigned (directly or indirectly) by humans
Example: Hiring decision dataset -- labels assigned by (possibly biased) experts or derived from past (possibly biased) hiring decisions
Bias in how and what data is collected
Crime prediction: Where to analyze crime? What is considered crime? Actually a random/representative sample?
Recall: Raw data is an oxymoron
Features correlate with protected attribute, remain after removal
"Big Data processes codify the past. They do not invent the future. Doing that requires moral imagination, and that’s something only humans can provide. " -- Cathy O'Neil in Weapons of Math Destruction
Scenario: Evaluate applications & identify students who are likely to succeed
Features: GPA, GRE/SAT, gender, race, undergrad institute, alumni connections, household income, hometown, transcript, etc.
As a group, post to #lecture
tagging members:
Source: Federal Reserve’s Survey of Consumer Finances
Key idea: Compare outcomes across two groups
Outcomes matter, not accuracy!
Key idea: Focus on accuracy (not outcomes) across two groups
Accuracy matters, not outcomes!
In groups, post to #lecture
tagging members:
Research on what post people perceive as fair/just (psychology)
When rewards depend on inputs and participants can chose contributions: Most people find it fair to split rewards proportional to inputs
Most people agree that for a decision to be fair, personal characteristics that do not influence the reward, such as sex or age, should not be considered when dividing the rewards.
🕮 Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter and Julia Lane. Big Data and Social Science: Data Science Methods and Tools for Research and Practice. Chapter 11, 2nd ed, 2020
Strong legal precedents
Very limited scope of affirmative action
Most forms of group fairness likely illegal
In practice: Anti-classification
In all pipeline stages:
Fairness-aware Machine Learning, Bennett et al., WSDM Tutorial (2019).
Equality or equity? Equalized odds? ...
Cannot satisfy all. People have conflicting preferences...
Treating everybody equally in a meritocracy will reinforce existing inequalities whereas uplifting disadvantaged communities can be seen as giving unfair advantages to people who contributed less, making it harder to succeed in the advantaged group merely due to group status.
We should stop training radiologists now. It’s just completely obvious that within five years, deep learning is going to do better than radiologists. -- Geoffrey Hinton, 2016
Within organizations usually little institutional support for fairness work, few activists
Fairness issues often raised by communities affected, after harm occurred
Affected groups may need to organize to affect change
Do we place the cost of unfair systems on those already marginalized and disadvantaged?
Assume most universities want to automate admissions decisions.
As a group in #lecture
, tagging group members:
What good or bad societal implications can you anticipate, beyond a single product? Should we do something about it?
"Doctor/nurse applying blood pressure monitor" -> "Healthcare worker applying blood pressure monitor"
How to fix?
TV subtitles: Humans check transcripts, especially with heavy dialects
Carefully review data collection procedures, sampling biases, what data is collected, how trustworthy labels are, etc.
Can address most sources of bias: tainted labels, skewed samples, limited features, sample size disparity, proxies:
-> Requirements engineering, system engineering
-> World vs machine, data quality, data cascades
What to do?
Buy-in from management is crucial
Show that fairness work is taken seriously through action (funding, hiring, audits, policies), not just lofty mission statements
Reported success strategies:
Recall: Model cards
Mitchell, Margaret, et al. "Model cards for model reporting." In Proc. FAccT, 220-229. 2019.
Excerpt from a “Data Card” for Google’s Open Images Extended dataset (full data card)
Image: Gong, Yuan, and Christian Poellabauer. "An overview of vulnerabilities of voice controlled systems." arXiv preprint arXiv:1803.09156 (2018).
Goyal, Raman, Gabriel Ferreira, Christian Kästner, and James Herbsleb. "Identifying unusual commits on GitHub." Journal of Software: Evolution and Process 30, no. 1 (2018): e1893.
IF age between 18–20 and sex is male THEN
predict arrest
ELSE IF age between 21–23 and 2–3 prior offenses THEN
predict arrest
ELSE IF more than three priors THEN
predict arrest
ELSE
predict no arrest
Rudin, Cynthia. "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead." Nature Machine Intelligence 1, no. 5 (2019): 206-215.
Image source (CC BY-NC-ND 4.0): Christin, Angèle. (2017). Algorithms in practice: Comparing web journalism and criminal justice. Big Data & Society. 4.
Debugging is the most common use in practice (Bhatt et al. "Explainable machine learning in deployment." In Proc. FAccT. 2020.)
Levels of explanations:
f(x)=α+β1x1+...+βnxn
Truthful explanations, easy to understand for humans
Easy to derive contrastive explanation and feature importance
Requires feature selection/regularization to minimize to few important features (e.g. Lasso); possibly restricting possible parameter values
Can measure how well g fits f with common model quality measures, typically R2
Advantages? Disadvantages?
Source: Christoph Molnar. "Interpretable Machine Learning." 2019
Source: Christoph Molnar. "Interpretable Machine Learning." 2019
Derive key influence factors or decisions from model parameters
Derive contrastive counterfacturals from models
Examples: Predict arrest for 18 year old male with 1 prior:
IF age between 18–20 and sex is male THEN predict arrest
ELSE IF age between 21–23 and 2–3 prior offenses THEN predict arrest
ELSE IF more than three priors THEN predict arrest
ELSE predict no arrest
Which features were most influential for a specific prediction?
Source: https://github.com/marcotcr/lime
Source: https://github.com/marcotcr/lime
Source: Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Anchors: High-precision model-agnostic explanations." In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
Often long or multiple explanations
Your loan application has been declined. If your savings account ...
Your loan application has been declined. If your lived in ...
Report all or select "best" (e.g. shortest, most actionable, likely values)
(Rashomon effect)
Source: Christoph Molnar. "Interpretable Machine Learning." 2019
Data debugging: What data most influenced the training?
Source: Christoph Molnar. "Interpretable Machine Learning." 2019
In groups, discuss which explainability approaches may help and why. Tagging group members, write to #lecture
.
Algorithm bad at recognizing some signs in some conditions:
Graduate appl. system seems to rank applicants from HBCUs low:
Left Image: CC BY-SA 4.0, Adrian Rosebrock
Tell the user when a lack of data might mean they’ll need to use their own judgment. Don’t be afraid to admit when a lack of data could affect the quality of the AI recommendations.
Source: People + AI Guidebook, Google
Users are less likely to question the model when explanations provided
Danger of overtrust and intentional manipulation
Stumpf, Simone, Adrian Bussone, and Dympna O’sullivan. "Explanations considered harmful? user interactions with machine learning systems." In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI). 2016.
(a) Rationale, (b) Stating the prediction, (c) Numerical internal values
Observation: Both experts and non-experts overtrust numerical explanations, even when inscrutable.
Ehsan, Upol, Samir Passi, Q. Vera Liao, Larry Chan, I. Lee, Michael Muller, and Mark O. Riedl. "The who in explainable AI: how AI background shapes perceptions of AI explanations." arXiv preprint arXiv:2107.13509 (2021).
Hypotheses:
Rudin, Cynthia. "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead." Nature Machine Intelligence 1.5 (2019): 206-215. (Preprint)
Eslami, Motahhare, et al. I always assumed that I wasn't really that close to [her]: Reasoning about Invisible Algorithms in News Feeds. In Proc. CHI, 2015.
Does providing an explanation allow customers to 'hack' the system?
"Just a bug, things happen, nothing we could have done"
Results from the 2018 StackOverflow Survey
Assume you are receiving complains that a child gets many recommendations about R-rated movies
In a group, discuss how you could address this in your own system and post to #lecture
, tagging team members:
K.G Orphanides. Children's YouTube is still churning out blood, suicide and cannibalism. Wired UK, 2018; Kristie Bertucci. 16 NSFW Movies Streaming on Netflix. Gadget Reviews, 2020
Scott Chacon and Ben Straub. Pro Git. 2014
dvc add images
dvc run -d images -o model.p cnn.py
dvc remote add myrepo s3://mybucket
dvc push
from verta import Client
client = Client("http://localhost:3000")
proj = client.set_project("My first ModelDB project")
expt = client.set_experiment("Default Experiment")
# log the first run
run = client.set_experiment_run("First Run")
run.log_hyperparameters({"regularization" : 0.5})
run.log_dataset_version("training_and_testing_data", dataset_version)
model1 = # ... model training code goes here
run.log_metric('accuracy', accuracy(model1, validationData))
run.log_model(model1)
# log the second run
run = client.set_experiment_run("Second Run")
run.log_hyperparameters({"regularization" : 0.8})
run.log_dataset_version("training_and_testing_data", dataset_version)
model2 = # ... model training code goes here
run.log_metric('accuracy', accuracy(model2, validationData))
run.log_model(model2)
Key goal: If a customer complains about an interaction, can we reproduce the prediction with the right model? Can we debug the model's pipeline and data? Can we reproduce the model?
<date>,<model>,<model version>,<feature inputs>,<output>
<date>,<model>,<model version>,<feature inputs>,<output>
<date>,<model>,<model version>,<feature inputs>,<output>
Attack at inference time
Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition, Sharif et al. (2016).
From Goodfellow et al (2018). Making machine learning robust against adversarial inputs. Communications of the ACM, 61(7), 56-66.
Inject mislabeled training data to damage model quality
Attacker must have some access to the public or private training set
Example: Anti-virus (AV) scanner: AV company (allegedly) poisoned competitor's model by submitting fake viruses
Insert training data with seemingly correct labels
More targeted than availability attack, cause specific misclassification
Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks, Shafahi et al. (2018)
Singel. Google Catches Bing Copying; Microsoft Says 'So What?'. Wired 2011.
Given a model output (e.g., name of a person), infer the corresponding, potentially sensitive input (facial image of the person)
Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures, M. Fredrikson et al. in CCS (2015).
Recall: Dashcam system from I2/I3
As a group, tagging members, post in #lecture
:
A systematic approach to identifying threats (i.e., attacker actions)
Minimize the impact of a compromised component
Monitoring & detection
Andew Pole, who heads a 60-person team at Target that studies customer behavior, boasted at a conference in 2010 about a proprietary program that could identify women - based on their purchases and demographic profile - who were pregnant.
Lipka. "What Target knows about you". Reuters, 2014
Who has access?
Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. "Concrete problems in AI safety." arXiv preprint arXiv:1606.06565 (2016).
Does the model goal align with the system goal? Does the system goal align with the user's goals?
Test model and system quality in production
(see requirements engineering and architecture lectures)
Image: David Silver. Adversarial Traffic Signs. Blog post, 2017
Scenario: Medical use of transcription service, dictate diagnoses and prescriptions
As a group, tagging members, post to #lecture
:
- What safety concerns can you anticipate?
- What notion of robustness are you concerned about (i.e., what distance function)?
- How could you use robustness to improve the product (i.e., when/how to check robustness)?
Reliability = absence of defects, mean time between failure
Safety = prevents accidents, harms
Can build safe systems from unreliable components (e.g. redundancy, safeguards)
System may be unsafe despite reliable components (e.g. stronger gas tank causes more severe damage in incident)
Accuracy and robustness are about reliability!
Anticipate problems (hazard analysis, FTA, FMEA, HAZOP, ...)
Anticipate the existence of unanticipated problems
Plan for mistakes, design mitigations
Improve reliability (accuracy, robustness)
Two main strategies:
Most standards require both
(Process and Team Reflections)
Talk: Ryan Orban. Bridging the Gap Between Data Science & Engineer: Building High-Performance Teams. 2016
n(n − 1) / 2 communication links within a team
Structural congruence, Geographical congruence, Task congruence, IRC communication congruence
In groups, tagging team members, discuss and post in #lecture
:
(1)
Looking back at the semester
(413 slides in 40 min)
(2)
Discussion of future of ML in Production
(3)
Feedback for future semesters
(closing remarks)
see also Andrej Karpathy. Software 2.0. Blog, 2017
Ryohei Fujimaki. AutoML 2.0: Is The Data Scientist Obsolete? Forbes, 2020
However, AutoML does not spell the end of data scientists, as it doesn’t “AutoSelect” a business problem to solve, it doesn’t AutoSelect indicative data, it doesn’t AutoAlign stakeholders, it doesn’t provide AutoEthics in the face of potential bias, it doesn’t provide AutoIntegration with the rest of your product, and it doesn’t provide AutoMarketing after the fact. -- Frederik Bussler
Frederik Bussler. Will AutoML Be the End of Data Scientists?, Blog 2020
(better tools don't replace the knowledge to use them)
This is an education problem, more than a research problem.
Interdisciplinary teams, mutual awareness and understanding
Software engineers and data scientists will each play an essential role
Joint responsibilities, joint processes, joint tools, joint vocabulary