Incremental Development and Growth is a proposed new paradigm for training large, deep neural networks and other complex machine learning systems. It is "incremental" in the sense that there are many steps in the development, and the size of the network is increased by a moderate amount in each step. This incremental approach to developing a large network is in contrast to the current main paradigms. It encompasses several distinct but overlapping aspects:
There are two main paradigms for training a new neural network design for a challenging task:
These two paradigms have been very successful. Generally, one or both of them have been used as the paradigm in research that has successfully trained new, larger networks that broke previous records. However, several problems still remain:
A new paradigm is proposed that has several distinct, but related aspects: Incremental development and testing, incremental growth, incremental training, and use of an intelligent learning management system. In keeping with the theme of this website, it is recommended that the learning management system be a system based on cooperation between a human team and one or more AI systems. Each of the other aspects will be described briefly below. Note that these terms do not have established well-known definitions. In understanding discussions on this website, interpret these terms as described here. Do not assume that they have the same meaning as the same or similar terms used elsewhere. On this page, only a brief description is given for each of these concepts. More information is available on the respective linked pages.
The phrase "Incremental Development," by itself might be an all-encompassing term referring to many aspects and techniques for the incremental training of a machine learning system, including the other aspects discussed here. The additional phrase "and Testing" is meant to convey a narrower focus, not to add an additional aspect.
In Incremental Development and Testing, the distinguishing characteristic is that, during the incremental development process there are many rounds of testing the performance on development data that has been set aside from the training data. This development testing is used by the human developers and/or the AI in a Learning Management System to make decisions about what to do in later rounds of the incremental development. Because repeated testing may cause the system to adapt to properties of the development data, in Incremental Development and Testing, best practices dictate that disjoint sets of development data be used for successive rounds of development. If there is a limited amount of development data and too many incremental steps, the development testing with new development data may be limited to only once after several steps. There are also techniques that are incremental in the sense used here but that make large relative increasing in the network size, such as Doubling, which doubles the size of the network in each step. It is incremental in terms of techniques such as imitating the previous network and being guided by node-to-node knowledge sharing links from that previous network. Each new layer is between two existing layers. With this interleaving, the depth of the new network may be anything N + 1 to 2 * N.
As in manual development, in incremental development, the testing is used to make design decisions, to detect problems of bias or variance, and to tune hyperparameters. However, in HAT-AI, best practices use a cooperative human + AI learning mangagement system. This learning management system is capable of handling millions of hyperparameters. A node-to-node knowledge sharing link and any associated hyperparameters, for example, may be customized to an individual knowledge receiving node. Each node-to-node knowledge sharing link may also be data-specific, that is, to apply only for a selected subset of data. When a substantial number of customized hyperparameters have been tuned using a set of development data, it is recommended that the data no longer be used for further development testing. This former development data may be added to the training set to turn it into training data for future rounds of development. The selection of data for a data-specific knowledge sharing link should be done automatically by a subnetowrk or companion network that is trained only from training data.
In Incremental Growth, new connections and nodes are added to an existing network and initialized in a way that maintains the performance of the existing network (see Incremental Training). The new network will have greater representation capacily than the old network. In addition, for almost every new connection, the partial derivative of the objective, including regularization as well as the error loss function, will have a non-zero magnitude. The magnitude for a potential new connection may be computed in advance, and the connections to add may be chosen on that basis.
The attribute "incremental" is relative. In one technique, the amount of knowledge transferred from the existing network is so great that the new network may be double the size of the existing network, or more. In general, the training of the new network is greatly facilitated by using imitation learning and knowledge sharing. In imitation learning, the transfer of knowledge from the smaller network to the larger network continues thoughout the training, not just at initialization as in Transfer Learning. In addition, every node in the smaller network may be mapped to a corresponding node in the larger network, so there is an abundance of ordered pairs to use for knowledge sharing links from the smaller network to the larger network. These knowledge sharing links help retain any interpretations that are present in the smaller network.
Incremental Training is the natural complement to Incremental Growth. Although Incremental Training has other aspects, the essence of incremental training is having the training of the larger network start from a set of learned parameters for which the performance of the new network is equivalent to that of the smaller network and then resuming gradient desenct training to immediately improve the performance. Thus, the training process for the new, larger network starts from a configuration that is already well-trained on the same task. The amount of additional training may thereby be much less than when starting from scratch.
Incremental training may also be used in situations other than repeated Incremental Growth. In training a neural network with many layers, there are often many saddle points in the objective function being minimized. The stochastic gradient descent training process is often very slow when the system approches one of these saddle points. The geometry near a saddle point in a function of high dimension is often like a curving canyon with high, steep sides approaching the saddle point and then a curving canyon with high, steep sides leaving the saddle point. The gradient descent process must first slowly approach the saddle point along the first canyon. It must get close enough to the saddle point to sense the direction of the second canyon. Because the gradient is zero at the saddle point, this approach, turn, and recede process may require many updates to the iterative training.
This slow learning process may occur in training any neural network, whether or not incremental growth is being used. However, adding structure followed by Incremental Training can help stochastic gradient decent escape from any stationary point, even a from a local or global minimum, not just a saddle point. The structure to be added may be chosen to have new elements for which the magnitude of the partial derivative of the objective is as large as possible. A single new connection with a large derivative is enough to help the iterative training process escape from a stationary point.
In the case of Incremental Growth, the selection of the right new structure to add, followed by Incremental Training, may allow the new network to be trained much more quickly than starting from scratch. If the Incremental Growth is used when the smaller network is near a stationary point, the training of the larger network may even be faster than contiued training of the smaller network. Note that, at convergence, the training of any neural network will be approaching a stationary point, so the opportunity for Incremental Growth followed by Incremental Training of the new network will always occur.
The core of human-assisted training for artificial intelligence is to have a cooperative team of humans and machines manage the training process. The embodiment of this cooperative management of the training is a human + AI learning management system. This learning management system itself needs to be trained.
In some cases, the Learning Management System may need to explore the space of network architectures and/or the space of hyperparameters. One of the advantages of a human + AI learning management system is that the hyperparameters may be customized, so the number of hyperparameters may be as large as the number of learned parameters. For example, knowledge sharing links tend to decrease rather than increase the number of degrees of freedom, so the number of node-to-node knowledge sharing links may be greater than the number of connections in the network. Each of these knowledge sharing links may have one or more hyperparameters that may be dynamically adjusted by the Learning Management System during the training process.
Training the learning management system to navigate this search may be done by a process called "Reinforcement Learning", which is a well-known technique in machine learning that has demonstrated many dramatic successes in which deep neural networks have been used for some of the subtasks within the reinforcement learning process.
As an instances of the maxim, "you should be willing to eat your own cooking," it is recommended that the learning management system be a human-assisted AI system and that human-assisted training be used in its training. However, applying human assistance during reinforcement learning is a new area of research. For example, in experiments with AlphaGO Zero, which was trained by reinforcement learning, it was discovered that the system performed better when it was trained from scratch than when it was trained including expert games in its training data. A learning system designed as a cooperation between human and machines may be better able to capture and benefit from human knowledge and intuition.
The ability to add structure without degrading performance is essential to the processes of incremental growth and training described here. This technique seems obvious from hindsight but has been largely neglected in the development of deep learning. For example, even after it was discovered in the 1980s that training by gradient descent using back propagation could successfully train neural networks with more than one layer, progress with training more than one hidden layer was disappointingly slow, with convolutional neural networks as the main exception. However, the success with convolutional neural networks did not generalize to general purpose network architectures.
In the twenty-first century, after deep learning had been successful for several years using algorithms that didn't exist in the 1980s, it was discovered that the algorithms used in the 1980s could also successfully solve such problems and could train deeper neural networks on problems with large quantities of data but with an amount of computing power that wasn't readily available in the 1980s. The use of incremental growth and incremental training could have been used to reduce the amount of computation, perhaps to the amount available in the 1980s
In addition, with techniques such as one-shot learning, incremental machine learning does not necessarily require a large amount of data. Furthermore, incremental growth and training used with one-shot learning tend to limit the amount of computation to generally no worse than proportional to the number of errors, which is small for problems with only a small amount of data. In other words, incremental machine learning with neural networks based on incrementally adding structure without degrading performance would have been computationally feasible in the 1980s with the amount of computing power then available.
Using incremental development, incremental growth, and incremental training, given a sufficient amount of training and development data, it is possible to continue growing a network until it has optimum performance. With one-shot learning, any targeted error may be corrected, down to the minimum Bayes error. One-shot learning and incremental growth by adding structure also enable a classifier to be quickly grown to a network that learns a new category. With knowledge sharing and other regularization, it is possible to grow and train a network to be the best system that can be found as measured by the performance on the available development data.
For experimenters: Note that an imitation learning task has a potentially unlimited amount of data. If you have a novel idea, first try it out on an imitation learning task, where you can control the amount of data in order to separate issues caused by insufficient data from other issues. Go to the Experimenters' Page for more suggestions.
by James K Baker and Bradley J Baker