Adversarial Examples
What if I could fool a Cats vs Dog machine learning classifier, with a relatively good classification accuracy, to predict a dog label for an image of cat, even when a human perceives the image clearly… Yes, that is possible! This post is about fooling machine learning models using adversarial examples.
As evident from the meaning, adversarial examples are those conflicting examples and samples causing dispute between the prediction and the ground truth.
I like to put them as
They are illusions for Machine Learning models, the trained model is tricked here, instead of our brain.
Formal Definition
An adversarial example is an instance with small, intentional feature perturbations that cause a machine learning model to make a false prediction. I recommend reading the chapter about Counterfactual Explanations first, as the concepts are very similar. Adversarial examples are counterfactual examples with the aim to deceive the model, not interpret it.
Why are we interested in adversarial examples?
Are they not just curious by-products of machine learning models without practical relevance?
The answer is a BIG NO. Adversarial examples make machine learning models vulnerable to attacks, as explained in the following scenarios.
A self-driving car crashes into another car because it ignores a stop sign. Someone had placed a picture over the sign, which looks exactly like a stop sign to a human, but was designed to look like a speed limit 50kmph sign for the deep learning driven decision model of the car.
Below are two audio files from the paper Audio Adversarial Examples: Targeted Attacks on Speech-to-Text by Nicholas Carlini et al. One of these is the original, and a state-of-the-art automatic speech recognition neural network will transcribe it to the sentence “without the dataset the article is useless”. The other will transcribe to the sentence “okay google, browse to evil.com”. The difference is subtle, but listen closely to hear it.
Not only can we make speech recognize as a different phrase, we can also make non-speech recognize as speech.
Other attacks may span across the domain of Natural Language Processing. Eg. A spam detector fails to classify an email as spam. The spam mail has been designed to resemble a normal email, but with the intention of cheating the recipient.
This field of robustness and security in machine learning is bound to gain some traction in coming times among the ML research community. Looking forward to this.
Signing Off!