Ordinary Early Stopping is "Stupid"

Early stopping, that is deliberately stopping iterative gradient descent before convergence, is a recommended "best practice" for training neural networks. The larger the network, the greater the need for early stopping. However, there are reasons to question the routine practice of early stopping.

Early stopping of iterative gradient descent training of a classifier is typically implemented by repeatedly measuring the performance of the classifier on a set of training data and comparing that performance with the performance of the classifier on a designated set of data that is disjoint from the training. The disjoint set of data is called the development set. Early stopping comprises stopping the iterative gradient descent when the performance on the development set relative to the performance on the training degrades by more than some preset criterion.

With the above implementation of early stopping, both of the following two statements are true:

  1. Early stopping, is "stupid".
  2. About the only thing more stupid than early stopping is not stopping,
where "stupid" is judged relative to the alternatives and "not stopping" consists of ignoring the criterion and continuing iterative gradient descent until convergence.

How is it possible that both statements are true? You may think "Either the training is stopped early or it is not, so they can't both be stupid." The quoted statement is an instance of the Fallacy of the Excluded Middle. There are alternatives to either stopping completely or continuing to convergence without changing anything:

Each of these techiques requires the application of intelligence during the training process and may be enhanced by human-assisted training.

Local Early Stopping

Local stopping requires deciding separately for each node whether to stop back propagation from that node. Normal early stopping is equivalent to a highly constrained special case in which the back propagation must be stopping at the same time for all nodes, or none at all. In a large network, node-specific early stopping may requires monitoring the performance of millions of nodes with individual local objectives. For some of the nodes, the local objective may be interpretable and the interpretability may help guide the local early stopping. By definition, it requires human judgment to determine interpretability. However, it is impractical for an individual or even a team of humans to make a separate decision for each node. Thus, node-specific early stopping can be best controlled by a cooperative team including humans and an AI system. Furthermore, during training, not only is the performance of each node varying, but even the local objective itself varies as connection weights are changing throughout the network. Monitoring the all the nodes and their interactives and making stopping decisions clearly needs the help an AI system.

Decreasing the Number of Learned Parameters

Decreasing the number of learned parameters or increasing the amount of regularization is certainly capable of preventing overfitting, but for ether there is a trade-off betwwen bias and variance. If there were a known method for optimizing this trade-off better than early stopping does, early stopping would not be a recommended best practice. Adding additional hyperparameters does not essentially change the situation. It is possible, however, that a cooperative human + AI learning management system with enough controls on the regularization and/or enough other hyperparameters may be able to make early unnecessary. Indeed, knowledge sharing is an instance of that.

Knowledge Sharing

Under control of a cooperative human + AI learning management system, knowledge sharing is capable of implementing a superset of any of the above techniques. Node-to-node knowledge sharing can be used to implement node-specific early stopping, either with gradual slowing or stopping at once. As form of regularization, data-conditional node-to-node knowledge sharing is extremely flexible, customizable and minutely targetable. A knowledge sharing link can be focused on an arbitrarily small set or data, or be focused on as few as a single node. The strength of each link relative to the magnitude of the derivative of the global cost function may vary from 0 to an unlimited magnitude.

In an example scenario, the number of knowledge sharing links and their strength can initially be limited to no more than is required for the desired amount of interpretability to be produced by the knowledge sharing. As the training progressing, each node can be monitored as it would be for node-specific early stopping. However, rather than implementing early stopping, the monitoring would be used to gradually increase the number and/or strength of knowledge sharing links. The monitoring during the gradual increase in rergularization could be used to adjust the amount of regularization such that the stopping criterion is not met before convergence of the iterative gradient desecent training.

Navigation Menu

by James K Baker and Bradley J Baker

© D5AI LLC, 2020

The text in this work is licensed under a Creative Commons Attribution 4.0 International License.
Some of the ideas presented here are covered by issued or pending patents. No license to such patents is created or implied by publication or reference to herein.