Let J be the Loss Function for the objective of a differentiable network, such as a layered deep neural network. Let dt for t=1,...,T be a set of training data. The derivatives of J with respect to the connection weights may be computed for each training datum dt:
where a(dt)j is the activation of node j for datum dt, ∂J(dt)/∂x(dt)k is the derivative of J with respect to the input to node k for datum dt, and wj,k is the weight of the connection from node j to node k. This expression is very familiar. It is the core computation in deep learning. In training a neural network by stochastic gradient descent, expression (1) is summed over all dt in each mini batch and multiplied by the learning rate to compute the update for each connection weight in the network. The update of the node bias of each node may also use this expression, if node j is taken to be a special node that always has activation 1.0.
In training a feed forward, layered neural network by iterative stochastic gradient descent based on mini batches, expression (1) is computed for each connection <j, k>, for each datum dt, for each epoch of training. The evaluation and accumulation of expression (1) is the inner loop of the computation.
However, expression (1) can be used for many other things in addition to the weight upgrade in stochastic gradent descent training of neural network connections. For example, having no connection from node j to node k is equivalent to having a connection with weight zero, so
Freezing weights, even at non-zero values, reduces the number of degrees of freedom, helping to prevent over training or over fitting. If a frozen weight has a large magnitude in expression (1), thawing the weight can help the training process escape from the vicinity of a stationary point. The system may even be able to escape from a global minimum, because the system with the weight thawed may have greater representation capacity. For this purpose, it is useful to have a relatively large fraction of the weights in a network be frozen, which may require active management. The decisions to freeze or thaw connection weights may be managed by a human + AI learning management system. A learning management system can monitor a large number of connections and also monitor the progress of the training process. Thus, the learning management system can make dynamic decisions that can steer the training process around stationary points and also help control the number of degrees of freedom so that at convergence the degree of over fitting is reduced. This strategy may enable larger networks to be trained successfully. It may also reduce the need for early stopping.
Any new connection can be safely added to the network with the weight initialized to zero. Thus the expanded network will initially match the performance of the original network. Choosing the new connection to have a large magnitude in expression (1) assures that subsequent gradient descent training will make more rapid improvement over the prior performance.
Using expression (1) to find useful connections between a network N1 and a network N2, allows the two networks to be connected and be gradually integrated. The two networks N1 and N2 do not need to have been trained on the same data. N1 and N2 may even have different objectives. The combined network may be trained to meet either objective as its sole objective or to meet both objectives. The task of trying to meet both objectives will have a regularizing effect, which will help reduce over fitting. On the other hand, the initial performance of the combined network will match the individual performance of each network on its assigned task.
The two networks may share input variables. On the other hand, each network may have been trained on a subset of the input variables and the subsets may be different. The subsets may be disjoint or they may overlap. The set of input variables for the combined network will be the union of the sets of input variables for the two networks.
In the merged network, the outputs of the two networks may be separate, with the vector of output variables being the concatenation of the output vectors of the two networks being merged. This method may be used if the two networks have distinct output with separate objectives.
If the two networks are classifiers, then the set of output categories for the combined network may be the union of the two sets of output categories. The two sets of output categories may be the same, may be different but overlap or may be disjoint. In the combined network, there would be one output node for each category in the union of the two sets of categories. For each category that is not in the overlap set, new connections may be added from the opposite network with weights intialized to zero.
For each category in the overlap set, there is an issue of how to combine the activations coming from the two networks. One implementation is to treat one of the two networks as the primary network and to intialize weights of connections from the other network to zero. This implementation or a variation of it may be used if one network, say N2, is used to add a single new category or a small number of new categories to the other network.
Other implementations may treat the two networks more symmetrically, for example, if the networks use a softmax operation to compute the output, the pre-softmax weighted sum may initially simply use the connection weights from the respective networks. If the overlap set of categories is much smaller than the set of non-overlap categories, this simple combining method may effectively double the scores of the categories in the overlap set, if the scores are highly correlated. In such a case, the scores in the set of overlap category may initially be multiplied by a fraction of 0.5 or some other value specified by a hyperparameter. As the weights of the new connections get trained to be non-zero, this hyperparameter may be increased toward 1.0 and eventually be dropped.
As connections are added to merge two networks, each connection that is made between the two networks will restrict which other cross-network connections can be added later. Consider the case in which N1 is a network with many layers and N2 is a network with fewer layers. There are multiple strategies available to the developer or the learning management system, as to where to make the connections and which connections to make first.
One strategy is to systematically make connections such that most of the connections go from one network to the other, say from N1 to N2. In one extreme version of this strategy, the connections from N2 back to N1 are to the output layer or to layers near the output of N1. Then any node in any other layer of N1 may be connected to any node in any layer of N2. In another extreme version, the connections N2 to N1 connect only from a small number of layers near the input to N2. Then any node in any other layer of N1 may be connected to any node in any of the other layers of N2. Another strategy is to interleave the layers of N1 and N2.
In each of these strategies, only a small fraction of the potential connections need to be made. Within each strategy, the connections may be made in a priority order with the priority based in part on the magnitude of expression (1).
A new network may be built to solve a specific problem that is not well solved by the original network. For example, the new network may fix an error or close call, with expression (1) summed only over the problem data. It may be pretrained to learn a new set of data or even a new set of categories. Optimized for a specific, narrow task, the new network may have an architecture designed to be easy to train and perhaps may be designed to be simple and robust. In particular, the new network may represent a model that facilitates one-shot learning.
Once trained on its specific problem, the new network may be merged with the original network.
by James K Baker and Bradley J Baker