Early stopping, that is, stopping the iterative training before it reaches convergence, is a recommended best practice in training neural networks. However, from another point of view, early stopping doesn't make sense. If it is better to stop an iterative process that is converging to a specified objective, then there must be something wrong with the objective. Early stopping, then, is at best a remedy that attempts to hide the underlying problem, not to fix it.
The criterion for early stopping is that performance on an independent test set is getting worse as the iterative process is proceeding toward convergence. The performance on an independent test set gets worse if the training is causing the machine learning system to fit elements of the training data that are not representative of the true underlying distribution, a phenomenon that is referred to as over fitting. The well-known best practice remedy for over fitting is regularization. Regularization comprises applying one or more constraints or additonal objectives on the learned parameters. Some forms of regularization also accomplishes additional purposes such as enabling pruning of some of the learned parameters. An example of pruning enabled by regularization is weight decay in a neural network. Weight decay is a regularization with an objective that is optimized if a weight is zero. In neural network, weight decay may drive some of the connection weights to have a magnitude close to zero, allowing the connection associated with the weight to be removed. Removing the conection saves computation and memory as well as reducing the number of learned parameters.
However, the fact that early stopping is needed and is a recommended best practice is direct evidence that current methods of regularization are not adequate to avoid the criterion that invokdes early stopping. Regularization in satistics is well-known and has been used very broadly since long before neural networks. Most current methods of regularization are control by only a handful of hyper parameters. These few hyper parameters are inadequate to match the regularization needs of a neural network with millions of learned parameters.
A cooperative human + AI learning supervisor system, on the other hand, is capable of managing the complexity of customizing regularization to each of millions of nodes or connections. A cooperative human + AI learning supervisor can also manage the complexity of customizing regularization to each of millions of items of training data with any of several techinques. For example, it is possible to define a local objective for an inner node in a neural network. The learning supervisor system can monitor a criterion for early stopping for each node. Thus, the learning supervisor system can decide to apply local early stopping on a node-by-node only when and where it is needed. In local early stopping, the learning supervisor system prevents back propagation from a particular node while normal back propagation proceeds elsewhere. Such selectivity is needed if some parts of a network are close to over fitting while other parts are still in an earlier stage of training. This situation prevails, for example, if a network is built incrementally with continuing training.
Another techique can be customization either to an individual node or an individual datum or any combination. This highly flexible technique is node-to-node knowledge sharing. The primary purpose of node-to-node knowledge sharing is to propagate interpretable knowledge learned by one node to other nodes. However, the mechanism of node-to-node knowledge sharing imposes a local objective on each knowledge-receiving node, which enables a node-specific regularization on the knowledge receiving. The application of node-to-node knowledge sharing may be conditional on the datum being in a specified set, which enables the regularization to be customized to an individual datum. The humans in the cooperative learning supervisor system provide the insight and confirmation of interpretability, which drives the process. The AI system is the cooperative learning supervisor handles the complexity of managing the large number of nodes and the large amount of data.
by James K Baker