It is obvious that adding a new connection with a connection weight of zero does not change the activation of any node in the network and therefore does not degrade performance on the objective. It does not cause new errors; it does not increase the error loss function; it does not increase the penalty from any regularization (unless the regularization directly penalizes the number of connections including those with weight zero). Thus, adding such a connection does not degrade performance. Continued gradient descent training will then improve over the previous performance even if the previous network was at a global optimum for a network with its architecture, unless the previous network is at an optimum performance even for all networks with an extra connection.
The selection of which new connection or connections to make may be guided by computing the partial derviative for the objective J with respect to the new connection. The derivatives of J with respect to the connection weight may be computed by the standard formula, for each training datum dt:
Note that a new connection may be made between two nodes in the same layer, causing the layer to be split into two layers. When connecting a pair of nodes in the same layer, the new connection may be made in either direction. The direction may be chosen to maximize the magnitude of the expression above.
As discussed below, a similar principle may be applied when adding a node to a network or adding an entire new layer. Adding one or more nodes at once will increase the amount of incremental improvement and may reduce the amount of computation required to find the absolute optimum. It also eliminates the essentially non-existent dilemma that the partial derivative of every possible new connection may be zero.
The properties in the previous paragraphs mean that it is always possble to improve performance until an absolute optimum has been obtained or until the network is completely connected with every node being a separate layer. However, even if incrementally adding connections leads to a completely connected set of nodes, further incremental improvement may be made by adding extra nodes until a network has been found that optimizes performance over all networks. Without regularization, this would mean that the network would memorize the training data. There would be no errors on the training data except for any datum for which there are examples of multiple categories for data with exactly the same value for every input variable. However, such a network with optimum performance on the training data would not generalize well to new data. The regularization must be chosen such that optimizing the performance on the regularized objective optimizes the performance on new data. The ability to incrementally grow a network to optimize the performance on the regularized objective separates the problem of selecting the network architecture from the problem of optimizing the regularization.
As has been discussed above, when a connection is added to a neural network and initialized with a connection weight of zero, the modified network will compute exactly the same feed forward activations as the unmodified network. Under subsequent gradient descent training, the modified network will be optimizing its objective over a superset of the representation capacity of the unmodified network. The objective may include regularization as well as the objective for output of the network. The modified network is initialized to match the performance of the unmodified network. With proper regularization, further training should improve the performance even on data not in the training set.
A node may be added to any location in a neural network without degradation in performance if the connection weights are initialed to zero for all the connections that go from the new node to some other node in the network.
A potential new connection for a feed forward network from a node A to a node B may be evaluated by computing the partial derivative of the network objective, including regularization, with respect to a phantom connection from A to B with weight zero. This evaluation indicates how much immediate improvement to the objective may be achieved by adding the connection and then doing an update. The partial derivative may be estimated summed over any set of labeled data. It may be estimated over a minibatch, the whole epoch, or anything in between. For example, the derivative estimate may be summed over just the next few minibatches which may then be recomputed to compute updates that include the new connections. Note that, even near a stationary point, even at a global minimum, the partial derivative with respect to a potential connection that is not currently in the network may be, and generally will be, non-zero. For some purposes, the derivatives may even be estimated over data not included in the training set, because this estimate is only used in the selection of a potential connection, not in the training of the weights in the expanded network. The selection of the set of data over which to sum the estimate of the derivative with respect to the phantom connection may be determined by a human + AI learning management system.
The learning management system may also control how and which connections to add. In some incremental learning strategies, only one or a small number of new connections are added at a time, which allows later additions to be chosen taking account of the changes from continued training of the network including the previous additions. In some other learning strategies, for example when getting the training process away from a stationary point, a large number of new additions may be made at once so that the impact of the sum of the magnitudes of their derivatives is sufficient to make significant progress in the iterative training.
In some training strategies, connections may be pruned as well as added. In such a strategy, the human + AI learning management system may effectively be exploring the space of network architectures with incremental changes. For feed forward networks, a new connection is only allowed if its addition does not cause a cycle in the directed graph of connections. It is convenient therefore, to characterize each network in terms of the strict partial order < defined by its transitive closure. A new connection from node A to node B may be added if and only if A < B in this strict partial order. In the exploration of possible network architectures, the human + AI learing management system should take account of which additions or purnings change the strict partial order versus which ones do not. For example, a new connection that changes the strict partial order may prevent certain other connections, but a new connection that does not change the strict partial order will not block any other connection that is currently allowed.
A node may be added to a network without degrading the performance of the network simply by initializing all its outgoing connections to have weights of zero. Adding a node adds an additional learned parameter for its node bias and one more for each incoming or outgoing connection, so adding a node may increase the chance of over fitting the training data. Therefore, the addition of nodes should generally be done as part of a systematic strategy that monitors and controls the trade-off between bias and variance, such as done in recommended methods of Incremental Development.
One form of incremental development builds a deeper network by adding one layer at a time to an existing base network. One way to guarantee no degradation in performance on the objective is to insert the new layer between two existing layer while leaving in place the existing connections. In specifying the two layers between which the new layer is inserted, the input variables and the set of output nodes are each considered a layer. The base network may be as simple as a single layer or it may already be a fully trained state-of-art deep neural network. Because each addition is guaranteed not to degrade performance, an arbitraily deep network may be built without causing the problem of degrading performance on training data, which limited the depth of some architectures in the past.
Over fitting or variance of performance on independent development data may still be a problem as the number of learned parameters increases with the depth of the network. However, knowledge sharing provides a powerful means of reducing this problem as well as making the network easier to intrepret.
A network may also be incrementally grown by adding one or more nodes at various locations in the network. One technique for adding a node is to make two copies of an existing node with the same connections as the existing node. They may have input connection weights the same as the original node and output connection weights that are each one-half of the original. Another implement is to have one of the pair be identical to the original and to have the other intially have zero weights on its outgoing connections. The two copies are then trained on different data, at least for an initial break-in period, so that they train to have different weights. The difference in training data may be permanent, or it may be temporary. In addition, especially if the difference in data is temporary, a knowledge sharing with an "is-not-equal-to" relation may be made from either direction between the two nodes or in both directions.
In life long or continual learning, a network may encounter a datum or a set of data that represent a new category or a datum or data that represent an existing category but that are very different from previous examples of that category. In these cases, the network may be grown by adding a network that is a detector for the new datum. For example, the subnetwork may model the new data as produced by a simple parametric probability distribution such as a multi-variant Guassian with a diagonal covariance matrix. If only a single new datum is available, the means of the probability model may be set to the values of datum and the variances may be initially set to values determined by the system designer or by a human + AI learning management system. The addition to the network may be made wihtout degradation by initially setting the weights of the outgoing connections to zero.
by James K Baker