Suggestions for Easier Ways to Build Very Deep Neural Network

This page is a continuation of the discussion on the Experimenters' Page, but it also stands on its own. Unlike the suggestions on the Experimenters' Page, the methods on this page are useful for deep learning research in general, not just for trying out and testing novel ideas that haven't been tested before. Some of the methods on this page may help you successfully build a neural network with more layers than you ever have before.

Suggestions for ways to build very deep neural networks:

Stacking

Stacking rearranges an ensmeble into a single deep neural network with equal performance and then adds additional connections to improve performance further. Stacking may also be applied to the former ensemble members that have been combined into a single network by Joint Optimization. The ensemble members are arranged in a stack with their existing connections intact. That is, in the stack, the former first layer nodes that are in an inner network in the stack are still connected to the input and the former output nodes that are in an inner network in the stack are still connected to the output or to the joint optimization network. In other words, at this stage, the stack of networks still makes exactly the same computations as before being stacked.

Next, new cross-network connections are added with their connection weights initialized to zero. The feed forward computation and thus the recognition performance is still identical. However, the back propagation now includes the new connections. The new cross-network connections may be made from any network in the stack to any network higher in the stack. The priority of which node pairs to connect may be determined by the technique of evaluating potential new connections.

It is feasible and practical to build and train very large ensembles. Training a very deep network from scratch is much more difficult. Transfer learning may help, but it requires having a network that is as large as the new network, and that has already been trained on a similar task. Stacking the members of an ensemble produces a deep network that is already trained on the exact same task as the ensemble. In addition, the ability to add cross-network connections gives the new deep network the ability to represent computations that could not be done by the ensemble.

Doubling

Doubling is a process that iteratively builds a deep network by doubling the depth of the network with each step of the iteration. Doubling uses the concept of adding a layer to a network and intializing the new layer so that the computation of new network is initially identical to the original network. That is, the new layer is initialized to compute the identity function. In doubling, rather than adding one layer at a time, a new layer is added between every pair of adjacent layers. As with building a network layer-by-layer, subsequent training of the new network with double depth will improve its performance.

Knowledge sharing from nodes in the former network to the corresponding nodes in the new doubled network will help guide the training process, so training the doubled netwok should be much easier than training such a deep network from scratch.

General Merging and Stacking

Merging is the proces of combining a set two or more networks into a single network that is initially equivalent to the original set of networks. The additional cross-network connections may be added with no degradation in performance. Further training will improve the performance. Stacking an ensemble is an instance of merging. However, any set of networks may be merged and the new network may be arranged in any desired architecture. The set of networks do not need to be an ensemble. For example, they may be a main network and a set of companinion networks. They do not need to be arranged just as an stack. They may be arranged in an order that facilitates the representation of a needed computation or that improves interpretability. Multiple copies made be made of a network in the set, so that the network may be placed at multiple locations in the new stack or other architecture.

Connecting Nodes within a Layer

To build a deeper network from a single network without adding any nodes, you may simply add connections between nodes within a layer. For example, you could place all the nodes in a layer in whatever order you choose, then make a connection from each node in the layer to the next node in the order. The last node may be unconnected or may be connected to the first node in the next layer or to any node in any higher layer.

This simple scheme will not be computationally efficient on GPUs with tensor cores, but you may adapt the general concept to connect blocks of nodes to fit your application and the available hardware. As with the other methods, the new connections initially have connection weights of zero, and the new network initially computes the same values as the original network. In the new network, every node will be a separate layer, so the new network will have much greater representation capacity that the original network. As in the other methods, the new network is already initialized to be equivalent to the original network, so there is no need to have a pretrained network for transfer learning.

A more conservative strategy is to incrementally add new connections within layers. The choice of which connections to add first may be guided by the evaluation of potential connections, as before. Knowledge sharing from the original network or from any earlier network in the incremental growth may help guide the training of the more highly connected networks.

Knowledge Sharing for Improving Interpretability

A Brief Introduction to Knowledge Sharing

The actions of inner nodes of most deep neural networks are notoriously difficult to interpret. Generally, the deeper the network, the more difficult it becomes to interpret inner layer nodes. Imitation learning and node-to-node knowledge sharing help make the inner nodes of a network more interpretable, regardless of the depth of the network.

A very large, deep neural network also may need a substantial amount of regularization to help its nodes learn activation patterns that will generalize to new data. Knowledge sharing will also help provide better regularization.

Knowledge sharing is a powerful tool for increasing interpretabilty of neural networks as well as a powerful tool for highly customizable regularization. Knowledge sharing is implemented with a virtual link from a knowledge providing node to a knowledge receiving node. The virtual link imposes a local regularization on the knowledge receiving node. Generally, the regularization enforces a specified relationship between the activation value of the knowledge providing node and the activation value of the knowledge receiving node, with the regularization enforced on a specified set of data.

The following discussions of knowledge sharing, imitation learning, and related topics show ways to use knowledge sharing to improve the development and training of very deep neural networks:

  1. Imitation Learning
  2. Knowledge Sharing for Imitation Learning
  3. Combining Imitation with Knowledge Sharing
Multiple examples of the use of knowledge sharing for regularization are discussed in other parts of this website.

Navigation Menu

by James K Baker and Bradley J Baker

© D5AI LLC, 2020

The text in this work is licensed under a Creative Commons Attribution 4.0 International License.
Some of the ideas presented here are covered by issued or pending patents. No license to such patents is created or implied by publication or reference to herein.