One sign of intelligence is to "know what you know and to know what you do not yet know." During his defense speech at his trial, Socrates said "The only thing I know is that I don't know anything." According to the Oracle at Delphi, Socrates was "the wisest of the Greeks." Socrates claimed that he was wiser only in knowing the limitations of his knowledge. Most machine learning systems lack Socratic wisdom. They lack introspection and, if they make an estimate of the probability that an answer they give is correct, they generally over estimate.
In this discussion, judgment and, hopefully, Socratic wisdom, is represented by a machine learning system (or subsystem) that attempts to assess whether or not the output of another machine learning system or subsystem is correct. Although more general implementations are possible, the following special case is considered in this introduction:
The node whose output is to be judged is called the subject node; the node making the judgment is called the judgment node; and the node that integrates the output of the judgment node into the network is called the integration node.
Important Note:The node being judged does not need to be an output node! In a classifier network, generally an implicit node-specific objective may be inferred for any inner layer node.
The judgment node attempts to estimate for each datum the probability that the output of the subject node is correct or incorrect. As an estimate of a probability, the judgment node has an activation in the range [0, 1]. An extra computation is needed to translate the activation of the judgment node into a value corresponding to a "corrected value" of the activation of the subject node, so that it represents a score for the output category associated with the subject node.
The integration node performs this extra computation. The integration node receives two incoming connections, one from the subject node and one from the judgment node. The weights of these connections do not need to be trained. They are pre-determined by the role that is played by the integration node to compute the corrected value:
The subject node is assumed either to have a sigmoid activation function or to be a member of a set of nodes to which a softmax operation is applied. In either case, the output of the subject node will be in the range [0, 1]. At the extreme values, the conditions above become:
These conditions imply that the weight for the connection from the subject node to the integration node should be -1 and the weight for the connection from the judgment node to the integration node should be +1 (or vice versa). Any of several functions may be chosen as the activation function for the integration node, for example:
In the recommended implementation, the subject node continues to receive back propagation from the error cost function of its output relative to the target output for each datum. It is further recommended that, in spite of the connection from the subject node to the integration node, during the back propagation to compute the partial derivatives of the objective there be no back propagation from the integration node to the subject node. Back propagation back through this connection may cause the subject node to drastically change its behavior. This change in behavior of the subject node is like an extreme case of the Observer Effect in social science research: Observing a social system changes it behavior.
In training a neural network, this change in behavior is undesirable. If, as recommended, the subject continues to receive back propagation from its error cost function, the subject node retains it interpretability. In addition, both the judgment node and the integration node have easily interpreted roles and objectives. However, if the subject node receives back propagation from the integration node, its behavior will be changed and its interpretability will be degraded. With the change in behavior of the subject node, the judgment node and the integration node would also become more difficult to interpret.
Optionally, there may be a direct connection from the subject node to the judgment node. Such a direct connection may accelerate the training and may even reduce the error rate. If there is such a direct connection from the subject node to the judgment node, then it is also recommended that there be no back propagation along that connection.
Because of the violation of the chain rule in the back propagation computation, the computed weight updates no longer constitute an estimate of the gradient of the error cost function of the network that includes the connections from the subject node to the integration node and, optionally, to the judgment node. Thus, the iterative training procedure is no longer an instance of optimization by stochastic gradient descent. However, the optimization properties of stochastic gradient assume an optimization problem in which the architecture of the network being optimized is fixed. The procedure of adding a judgment node and integration node to an existing, trained network is an instance of a different paradigm: Incremental Development. Incremental Development has different optimization criteria.
An alternative procedure would be to simply add two nodes to the original networkl The new nodes may have the same connections as the judgment nodes and integration node, respectively. Then the new network would be trained using stochastic gradient descent. It is to be expected that this alternative procedure will work as well as stochastic gradient of deep neural networks generally does. However, in this procedure the subject node and the new nodes will not necessarily be easy to interpret. In addition, the gradient descent training of the new network may take more computation because the weights to the integration node need no training and the task of the judgment node may be much simpler if the residual number of errors made by the subject node is relatively small.
Judgment Node Exercise for Students: Build and train a neural network. Then experiment with various implementations of a judgment node. Compare the performance and amount of training time with a conventional network with two added nodes.
by James K Baker and Bradley J Baker