Joint optimization is a tool that should be in every neural network ensemble-building toolkit. It converts any ensemble into a single network that performs better on the ensemble's objective. The number of members in the ensemble is unlimited or may be as few as two. The ensemble may initially be untrained, partially trained or trained to convergence.
The joint optimization procedure comprises adding a neural network (called the "combining network") above the ensemble members to create a single network in which each ensemble member is a subnetwork. The combining network is initialized to be be equivalent to whatever voting rule or fixed combining rule was used for the ensemble. The full, combined network is then trained by stochastic gradient descent.
For example, if the ensemble were averaged with an arithmetic average, then the combining network would have a linear node for each output category. Each linear node would initially sum the corresponding ensemble members scores with a weight of one divided by the number of ensemble members for each connection. There may be additional connections initialized with a weight of zero and additional nodes with all connections intialized to zero. There may be additional layers initialized to be the identity function. Training would then continue from whatever level of training has already been done. Even if the ensemble has already been trained to convergence, the gradient of the objective with respect to the weights of the connections and node biases in the combining network would only be zero in a rare degenerate case.
An important improvement is that in the training of the full combined network would include back propagating the combined objective to each subnetwork that was a former ensemble member. In other words the former ensemble members would be trained to cooperate in a fashion that optimizes their shared objective.
The combined training would improve the performance of the combined network over the uncombined ensemble except in the rare case in which the full combined network is at a stationary point. Furthermore, a small random perturbation would enable improvement unless the stationary point is a local minimum. The training of the combined network comprises training each pair and each subset of the ensemble members to optimally cooperate, so it would be very rare for the initial combined network to be at a minimum.
The same procedure also may be used initializing the combining network to be equivalent to or to approximate any other fixed combining rule. For example, a geometric mean combining rule may be implemented using one layer of nodes with logarithmic actiavtion functions, followed a summation layer and a final layer exponential activation functions. A plurality voting rule may be implemented by a softmax layer for each enemble member followed by a scaling of the largest score to be 1.0 and then using a combining network initialized to be the arithmetic mean.
The combining network may also use several different combining rules with a separate subnetwork for each. The combining network may be initialized to implement any one combining rule with subnetworks that implement other combining rules initially connected with zero weights. Thus the combined network may also be trained to learn the optimum combining rule for the assigned task. With extra layers initialized to be an identity, the combining network could have the full universal representation power of a deep neural network. The combining betwork may then learn combining operations that would be impossible to implement as a normal combining rule.
by James K Baker