The amount of computation required to train a deep netowrk from scratch tends to grow like a high power or even an exponential of the depth. The total computation tends to be dominated by the computation for the largest network. One reason for selecting a substantially larger network for each trial is that there is little improvement from a network of one size to one that is only slightly larger. That is also the reason for stopping once you have a satisfactory solution.
Alternative: Incremental growth and incremental trainingAdvantages:
- Incremental growth increases the size of the network more gradually.
- Incremental training initializes the new network to performance equivalent to the old network, and uses transfer learning, imitation learning, and node-to-node knowledge sharing to speed up training of the new network.
- The amount of computation required to train each new deep network tends to be bounded by a constant or a low oreder of the depth.
- The total amount of computation may be of a lower order than just training the largest network from scratch.
- The smaller networks tend to be easier to interpret.
- Imitation training and knowledge sharing tend to preserve interpretability in the larger networks.
- Nodes in added layers may learn interpretations as well, especially with vertical and sideways knowledge sharing.
by James K Baker and Bradley J Baker