Large neural networks performing image classification are notoriously vulnerable to adversarial attacks that cause a classifier to change its answer due to a minor change in the input pattern. Two images that are indistinguishable to a human viewed may be classified differently, with one classification correct and the other classification totally unrelated to either image.
However, for at least one form of adversarial attack, the problem is not specific to image classification nor is it specific to large neural networks. In this form of adversarial attack on an image classifier (called an incremental differential attack), the attacker starts with a known classifer and with an image that is classified correctly. A published classifier may used, or the attacker may develop a neural network classifier of its own. The attacker's is intended to approximate unpublished classifier. The attacker's classifier may also approximate a classifier that is not a neural network. Using the known classifier, the attacker uses gradient descent like the gradient descent computation that is used in training neural networks but with a different objective function. The objective is to lower the activation for the for any output node that corresponds to the correct answer and/or to increase the socre of one or more selected output nodes that correspond to wrong answers. Unlike in training the back propagation is not used to update the connection weights in the network. Instead, the back propagation is extended to the input variables and the gradient descent updates the values of the input variables. That is, the back propagation is used to change the image by small incrmental amounts. Selecting a subset of wrong answers allows the attacker to make a separate attack for any selected subset of the set of wrong answers. Thus, with this method, there are exponentially many different adversarial attacks possible on any single image.
An incremental differential attack produces an error if the change is enough so that the output activation for a wrong answer exceeds the output activation of the correct answer. The attack succeeds against any smoothly differentiable function defined in a high-dimensional space, not against just deep neural networks. Let ε be a small number such that a change in an input variable by ε is not noticable. For example, in a digital image, the value of ε might be one-half the size of a quatization level. One strategy for an attacker would be to change each input variable by ε times the sign of the derivative of the objective with respect to the input variable. This strategy will create an new image that is indistinguishable from the original. The new image will be misclassified if the sum of the magnitudes of the partial derviatives summed over all input variables is > 1.0.
Even very intelligent people sometimes make stupid mistakes. An aspect of intelligence is the ability to recognize a stupid mistake. Even the best machine learning systems may make a mistake that would cause a person to say: "That's stupid! Nobody would make that mistake." More worrisome are the facts that:
The architecture of a network and the certain activation functions, for example as in (1), (2) and (3) above, may reduce the sensitivity to adversarial attacks and the incidence of stupid mistakes. The use of judgment nodes (4) helps to detect errors, including stupid mistakes. With imitation learning (5), a system developer or a human + AI learning management system may start with a conventional network and then develop a separate system with comparable or better performance while incorporating these architectural features to give the imitation system sensibility.
by James K Baker and Bradley J Baker