Knowledge sharing is a powerful tool for increasing interpretabilty of neural networks as well as a powerful tool for highly customizable regularization. Multiple examples of its use for regularization will be discussed in other sections and in othe parts of this website. This section discusses its use for improving interpretability in imitation learning.
Knowledge sharing is implemented with a virtual link from a knowledge providing node to a knowledge receiving node. The virtual link imposes a local regularization on the knowledge receiving node. Generally, the regularization enforces a specified relationship between the activation value of the knowledge providing node and the activation value of the knowledge receiving node, with the regularization enforced on a specified set of data.
A knowledge sharing link is not a connection. It is only used during training. It is not part of the finished trained network, except perhaps if there is adaptive training or other on-going continual learning. Any two nodes may be linked. They may be in the same layer. The two nodes may be in separate networks. Within a single network, typically every link goes from a knowledge providing node to a knowledge receiving node that is in the same or a lower numbered layer, though the reverse paradigm may also be used. With some restrictions, there may be a mixture of links that go from higher numbered layers to lower number layers and links that go the other way. Although it is preferable to avoid creating a cycle of knowledge sharing links, it is not prohibited. Between two nodes in the same layer, there may be a link in each direction. Between a node in one network to a node in another network, there is no restriction on the relative position of the nodes and there may be links in both directions between a pair of nodes.
When used to improve interpretability, a knowledge sharing link may be created from a node with a known interpretation to one or more other nodes. In any neural network classifier trained by supervised training, the interpretation of each output node is its target category, so any output node is interpretable. Knowledge sharing may be used to create additional interpretable nodes. A human or a human + AI learning management system may confirm interpretations of inner nodes that occur naturally during training or that are created deliberately by knowledge sharing.
A knowledge sharing link imposes a regularization on the knowledge receiving node to satisfy a specified relationship on a specified set of data. For propagating an interpretation, the most common target relationship is for the activations of the two nodes to be equal, to the extent that the regularized training of the knowledge receiving node can achieve that goal. The name "knowledge sharing" comes from the fact that an interpretable knowledge providing node shares its knowledge of the desired activation level for each selected datum consistent with the interpretation that the knowledge providing node already knows and that the knowledge receiving node is trying to learn. The equality relationship is not forced, it is only encouraged by the regularization cost function with an intensity that is controlled by a hyperparameter that may be dynamically controlled by a human + AI learning management system. In some cases, an inequality relationship or an is-not-equal relationship may be specified.
In standard imitation learning, the network to be imitated is fixed, only the imitating network is being trained, so any knowledge sharing link between the two networks goes from the reference network being imitated to the network being trained to imitate it.
Training a very deep neural network from stratch often requires an inordinate amount of computation. Transfer learning is one technique for reducing the amount of computation. In transfer learning, the new network is initialized to be the same as a network that has already been trained on a similar task. Thus, transfer learning has some characteristic in common with imitation learning, but there are significant differences. In comparison to imitation learning with knowledge sharing, transfer learning has three major limitations:
In contrast, an imitating network may have an arbitrary architecture. It may be either larger or smaller than the reference network. The knowledge sharing regularization continues throughout the training process. Knowledge sharing can not only maintain any interpretability present in the reference network, it can help propagate interpretaions to additional nodes. In addition, by providing additional, on-going transfer of knowledge from the reference network, knowledge sharing may extend the benefit of computation reduction beyond the initialization.
In many cases, the only interpretable nodes in the reference network may be the output nodes. Even in this case, knowledge sharing can be very useful not only in increasing interpretability but also in facilitating the training of very deep networks. One example, is to have the new network be a copy of the reference network with additional nodes added in many layers through the network. In this example, the weights for connections from the nodes copied the reference network may be initialized by transfer from the reference network, and the weights for out-going connections from new nodes may be initialized to zero. Thus the new network initially matches the performance of the reference network. However, each new node may have a knowledge sharing link from an interpretale node in the reference network. In particular, these knowledge sharing links will cause new nodes in inner layers to imitate as well as they can the output nodes in the reference network. These output imitators in inner layers of the new network will reduce some of the problems encountered in training a very deep neural network. In particular, the regularization from the knowledge sharing links will prevent the gradient from vanishing unless the imitation is perfect. Of course having an inner layer node perfectly imitate an output node is not a problem. It essentially reduces the task of training the new network to the training of a less deep network.
Interpretability of inner layer nodes may be further enhanced by having additional networks called "companion networks". A companion network may be a simpler, smaller, more easily interpreted network. For example, in image recognition, there may be a catalog of trained networks that recognize a wide variety of objects, including objects that may parts of objects in the images to be recognized. A set of is-a-part-of relationships is called a mereology. With a catalog of networks trained to recognize elements of mereologies, the main network may have a set of companion networks that recognize some of the elements of some of the output targets for the main network. In this case, there may be a knowedge sharing link from an interpreted node in one of the companion networks. The interpreted node may, for example, be an output node of the companion network. The knowledge sharing link to an inner node in the main network will regularize that inner node to help it learn the interpretation. Training such a node to have such an interpretation not only helps make the main network easier to interpret, training an inner node to detect a known element of the mereology of the target category may facilitate the training of the main network and may even improve its performance.
Since knowledge sharing links may be imposed selectively for a specified set of data, a knowledge sharing link from a companion network may be restricted to only apply on data for which the companion network detects an instance of its target object with a specified degree of confidence. Thus, an inner node in the main network may be the knowledge receiving node from a plurality of companion networks. The links from multiple companion networks will tend to only be activity simultaneously when the input matches both targets. Thus, the inner node in the main network is regularized to detect elements from multiple mereologies and even to recognize when the input matches more than one.
While a deep network is being trained and many of the connection weights are still very different from their final values, the derivative of the objective with respect to a connection weight in a layer that is far from both the input and the output provides very little information about what the final value of that connection weight will need to be. This problem may be reduced by having connections that skip layers, giving middle layers a shorter chain of connections to the input or the output. Adding such connections increases the number of learned parameters, which means that a greater degree of regularization will be required. Unfortunately, ordinary regularization is often inadequate even for networks of moderate depth, as evidenced by the fact the early stopping of the iterative training is often necessary. The need for early stopping is an indication that convergence to optimum of the objective, which includes the regularization, is not actually desirable. Adding extra connections doesn't help. It just makes matters worse.
Node-to-node knowledge sharing links are a better way to give guidance to middle layer nodes and their connections. A vertical knowledge sharing link that skips many layers can share knowledge from a node near the output to a middle layer node. Unlike a layer-skipping connection, a vertical knowlege sharing link increases the amount of regularization and does not increase the number of learned parameters.
Node-to-node knowledge sharing links are also very flexible and highly customizable. As one example, consider the case of a middle layer node N1 for which there is a high value of the magnitude of the evaluation of a potential connection from that middle layer node N1 to a node N2 in a much higher layer. If new connections were being added, that new connection would have a high priority. Rather than making that connection, a knowledge sharing link could be created from the higher layer node to a node N3 to which the middle layer node N1 is already connected. If the weights in lower and higher layers are still being changed, node N3 will be able to learn to imitate N2 as well as it can while also continuing to learn from the regular backpropagation. The potential number of knowledge sharing links grows like the square of the number of nodes, so the larger the network, the greater their potential influence.
by James K Baker