Outline of Rules to Break

Rule Number 1: "Hands off during training"

The core concept of HAT-AI is to break this rule.
Why do we no longer need this rule?
What is the advantage of breaking this rule?
It is not a competition between man and machine.
Cooperation: The Golden Rule and the Allegory of the Long Spoons

Early stopping is a standard Best Practice for training deep neural networks
1. However, early stopping is actually "stupid" (in the sense defined in the Early Stopping discussion)
2. The only thing more stupid than stopping easly is NOT STOPPING (and not using one of the alternatives)
3. See: The Fallacy of the Excluded Middle
Other alternatives
1. Local early stopping
2. Slowing learaning rather than stopping
3. Increasing regularization
4. Increasing Knowledge sharing
Furthermore, sometimes not stopping may be SMART.
1. Direct treatment of variance
2. Direct treatment of data causing overfitting
3. Knowledge sharing alternatives

AI Research should be done only on real world data.

Alternative: Use real data but also use carefully designed, well understood data (More Coming Soon)
1. Imitation learning
2. Using Simulated Data for Experimentation in a Controlled Environment
3. Suggestions for Experimenters

Two standard practices when developing a new network architecture for a challenging task:

Select a small number of architures in succession, each one substantionally larger or deeper than the previous.
Train each new architecture from scratch.

The amount of computation required to train a deep netowrk from scratch tends to grow like a high power or even an exponential of the depth. The total computation tends to be dominated by the computation for the largest network. One reason for selecting a substantially larger network for each trial is that there is little improvement from a network of one size to one that is only slightly larger. That is also the reason for stopping once you have a satisfactory solution.
Alternative: Incremental growth and incremental training

Incremental growth increases the size of the network more gradually.

Incremental training initializes the new network to performance equivalent to the old network, and uses transfer learning, imitation learning, and node-to-node knowledge sharing to speed up training of the new network.

Advantages:

The amount of computation required to train each new deep network tends to be bounded by a constant or a low oreder of the depth.

The total amount of computation may be of a lower order than just training the largest network from scratch.

The smaller networks tend to be easier to interpret.

Imitation training and knowledge sharing tend to preserve interpretability in the larger networks.

Nodes in added layers may learn interpretations as well, especially with vertical and sideways knowledge sharing.

Limit the number of parameters to avoid overfitting (high variance in performance on new data). (More Coming Soon)

Instead. or addition, use knowledge sharing to reduce the effective number of degrees of freedom:
1. Train an arbitrarily deep networks.
2. Build an arbitrarily large ensembles.
3. Train an arbitrarily large ensembles.
Instead do incremental development, using incremental regularization to limit the number of degrees of freedom.

Misleading claim: Ensembles are always better, they usually win competitions.

The statement about competitions has been true for many Kaggle competitions, but it neglects the fact that both the statements below are true.

True statements:
1. For every single network, there is an ensemble that performs at least as well.
2. For every ensemble, there is a single network that performs at least as well (just add a node to the ensemble that is equivalent to its combining rule).
In each case, the new system can be trained to perform even better, except when the old system is already achieving optimum performance even for the new architecture. It is like asking "Which is larger, the largest odd number or the largest even number?".
Important Note: From any ensemble, a single network with better performance than a simple combination can be built using a joint optimization combining network. Joint optimization not only gneralizes and optimizes the combining rule, it backpropagates to the individual ensemble members to optimize their joint performance.

Standard Practice: The joint decision of an ensemble should be computed by an arithmetic average, a geometric average or a voting scheme.

Instead use a joint optimization combining network. A combining network can be initialized to be equivalent to any of the standard combining rules and then be optimized by gradient descent. More importantly, the combining network can back propagate to the component ensemble members to optimize their joint performance.

Misleading Assumption in not counting the input variables to a neural network as a layer:

A standard practice in computing the estimated gradient of the error cost function for a neural network is to compute the partial derivatives proceeding backwards throught the network from the output layer to the first layer layer of nodes, including the weights from the input variables to the first layer. Although there are connections from the input variables to the first layer, the input variables are not treated as nodes. they not have node biases. However, there are advantages to treating the input variables as a layer of nodes with biases. These input layer node biases may also be trained during the back propagation, which may improve robustness.

by James K Baker and Bradley J Baker

Some of the ideas presented here are covered by issued or pending patents. No license to such patents is created or implied by publication or reference to herein.

Breaking Rules

Outline of Rules to Break