The "knowledge" of a node P is its activation value as a function of the vector-valued input datum variable d: actP(d). For example, if node P has high activation values for one set of data D1 and low activation values for a second set D2, then node P has the knowledge to discriminate D1 from D2. If the sets D1 and D2 can be understood and explained by a human, then the knowledge of node P is said to be "interpretable." Any output node of a classifier that has been successfully trained may be interpreted as discriminating its target category from other data. However, in many large neural networks, it may be very difficult for humans to interpret the activation patterns of most inner layer nodes.
The direct purpose of node-to-node knowledge sharing is to create and propagate interpretability of innner nodes in a deep neural network with the assistance of a cooperative human + AI learning supervisor. Node-to-node knowledge sharing is implemented by means of a special link between a knowledge-providing node P and a knowledge-receiving node R. This knowledge sharing link is unusual because it participates in the back propagation computation of the estimated gradient even though it is not associated with a connection in the network.
Important Note: A knowledge sharing link propagates an intrepretation from node P to node R.
A knowledge sharing link from a knowledge-providing node P to a knowledge-receiving node R imposes a local objective on node R with a cost function if the activations of P and R fail to satisfy a specified relationship. In the spirit of human-assisted training, the selection of ordered pairs of nodes to link, the specification of the desired relationship for each pair and the values of any associated hyperparameters may be determined by a cooperative human + AI learning supervisor system. In the cooperative learning supervisor system, the humans and the AI system have complementary roles. For example, a human is best at verifying that an interpretation is understandable and confirming individual instances of an interpretation. Using knowledge sharing links, the AI system can propagate an interpretation to a multitude of additional nodes in a large network or across a large number of networks.
An example relationship in a knowledge sharing link is that the activations of node P and node R for an input datum d should be equal: actR(d) = actP(d).
Two example cost functions for the "is-equal-to" relationship are:
Important Note: In human-assisted training, knowledge sharing is controlled by a cooperative human + AI learning supervisor system.
The human + AI learning supervisor may select one or more knowledge receiving nodes to have their tentative interpretations confirmed by a human. The fact that the interpretation of a node can be expressed in words understandable to humans increases the likelihood that the interpretation will also apply to new, unseen data.
The equality relationship is an important tool in transferring knowledge from one network to another. As will be discussed elsewhere it is also very useful in improving generalization and reducing over fitting in very large networks and ensembles and other special applications. However, in tranfering knowledge from an output node to an inner layer node, a relationship other than equality may be useful. More generally, a node in a lower layer often may represent a larger set of data than a node in a higher layer. For example, in a speech recognition system a node R that discriminates vowels from non-vowels will have high activation values for a larger set of data than a node P that specifically identifies the vowel "EE".
If it desired that node R learn an interpretation that responds to a superset of the set of responses for node P, then the relationship may be expressed with an inequality: actR(d) > actP(d), with a cost function such as cost(d) = Max(0, actP(d) - actR(d))
Another useful relationship is for the activations knowledge receiving node R to be different from the activation of the knowledge providing node. This relationship may help to create diversity .
Any of these relationships may be imposed on a selected subset of the data.
Although the details will not be discussed in this introduction, a knowledge sharing relationship may be imposed between a pluarlity of k knowledge-providing nodes P1, P2, ..., Pk and a knowledge-receving node R. As one example, the activation of a node R that discriminates vowels from non-vowel should have an activation that is greater than any of the activations of a set of nodes P1, P2, ..., Pk that identify individual vowels. This desired relationship may be represented with a set of pairwise relationships, but may also be represented by a single composite relationship: actR(d) > MAXv(actv(d)), where the maximum is over all nodes v that detect individual vowels.
Increasing the number of connections in a network increasing the capacity of the network to represent complex functions. Challenging classification problems often require very large network. Neural networks are universal approximators. For any smooth function, a neural network may be constructed to approximate that function to any specified degree of accuracy. For example, it is always possible to construct a neural network to exactly match the training data, unless there are two or more identical input data vectors with differing target categories. However, there is a major problem caused by having a neural network match the training data too well.
As long as there are still errors, it is always possible to add nodes and connections to a neural network so that it can match the training better. However, the purpose of building and training a neural network classifier is to be able to classifier new data that is not included in the training. Adding too many nodes and arcs to a network will enable it to match accidental patterns in the particular selection of training data, which casues errors on new data.
In contrast, adding a node-to-node knowledge sharing link adds regularization. It decreases the number of degrees of freedom, which tends to reduce the tendency to overfit the training data. Furthermore, it is highly customizable and can be targeter at a specific node. Thus, with regard to over fitting, adding knowledge sharing links has the opposite from adding network connections. It decreases over fitting rather than increasing over fitting. This property enables many strategies for finding the optimizing trade-off between fitting the training data well and generalizing to new, unseen data, especially when the knowledge sharing propagates human confirmed interpretations.
Important Note: Knowledge sharing may be used as an alternative to early stopping .
by James K Baker and Bradley J Baker