17-445: Security

ML Attacker Goal

Confidentiality attacks: Exposure of sensitive data
- Infer a sensitive label for a data point (e.g., hospital record)
Integrity attacks: Unauthorized modification of data
- Induce a model to misclassify data points from one class to another
- e.g., Spam filter: Classify a spam as a non-spam
Availability attacks: Disruption to critical services
- Reduce the accuracy of a model
- Induce a model to misclassify many data points

Knowledge: Does the attacker have access to the model?
- Training data? Learning algorithm used? Parameters?
Attacker actions:
- Training time: Poisoning attacks
- Inference time: Evasion attacks, model inversion attacks

Understanding Machine Learning, Bhogavalli (2019)

Availability: Inject mislabeled training data to damage model quality
- 3% poisoning => 11% decrease in accuracy (Steinhardt, 2017)
Attacker must have some access to the training set
- e.g., models trained on public data set (e.g., ImageNet)
Example: Anti-virus (AV) scanner
- Online platform for submission of potentially malicious code
- Some AV company (allegedly) poisoned competitor's model

Insert training data with seemingly correct labels
More targeted than availability attacks
- Cause misclassification from one specific class to another

Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks, Shafahi et al. (2018)

Anomaly detection & data sanitization
- Identify and remove outliers in training set (see data quality lecture)
- Identify and understand drift from telemetry
Quality control over your training data
- Who can modify or add to my training set? Do I trust the data source?
- Use security mechanisms (e.g., authentication) and logging to track data provenance

Stronger Data Poisoning Attacks Break Data Sanitization Defenses, Koh, Steinhardt, and Liang (2018).

Add noise to an existing sample & cause misclassification
Attack at inference time
- Typically assumes knowledge of the model (algorithm, parameters)
- Recently, shown to be possible even when the attacker only has access to model output ("blackbox" attack)

Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition, Sharif et al. (2016).

Robust Physical-World Attacks on Deep Learning Visual Classification, Eykholt et al., in CVPR (2018).

From Goodfellow et al (2018). Making machine learning robust against adversarial inputs. Communications of the ACM, 61(7), 56-66.

Adversarial training
- Generate/find a set of adversarial examples
- Re-train your model with correct labels
Input sanitization
- "Clean" & remove noise from input samples
- e.g., Color depth reduction, spatial smoothing, JPEG compression
Redundancy: Design multiple mechanisms to detect an attack
- Stop sign: Insert a barcode as a checksum; harder to bypass

Reliable Smart Road Signs, Sayin et al. (2019).

See counterfactual explanations
Find similar input with different prediction
- targeted (specific prediction) vs untargeted (any wrong prediction)
Many similarity measures (e.g., change one feature vs small changes to many features)
- $x^* = x + arg min \{ |\epsilon| : f(x+\epsilon) \neq f(x) \}$
Attacks more effective with access to model internals, but also black-box attacks (with many queries to the model) feasible
- With model internals: follow the model's gradient
- Without model internals: learn surrogate model
- With access to confidence scores: heuristic search (e.g., hill climbing)

Given a model output (e.g., name of a person), infer the corresponding, potentially sensitive input (facial image of the person)
One method: Gradient descent on input space
- Assumes that the model produces a confidence score for prediction
- Start with a random input vector & iterate towards input values with higher confidence level

Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures, M. Fredrikson et al. in CCS (2015).

Limit attacker access to confidence scores
- e.g., reduce the precision of the scores by rounding them off
- But also reduces the utility of legitimate use of these scores!
Differential privacy in ML
- Limit what attacker can learn about the model (e.g., parameters) based on an individual training sample
- Achieved by adding noise to input or output (e.g., DP-SGD)
- More noise => higher privacy, but also lower model accuracy!

Biscotti: A Ledger for Private and Secure Peer-to-Peer Machine Learning, M. Shayan et al., arXiv:1811.09904 (2018).

On-going arms race (mostly among researchers)
- Defenses proposed & quickly broken by noble attacks
Assume ML component is likely vulnerable
- Design your system to minimize impact of an attack
Remember: There may be easier ways to compromise system
- e.g., poor security misconfiguration (default password), lack of encryption, code vulnerabilities, etc.,