Network Architecture Search: AutoML and others – Automatic Network Architecture Search is what the media advertises as AI that creates AI. Although it sounds very cool, and sometimes scary as a concept, an automatic network architecture search has its limitations. That’s why it hasn’t become mainstream yet.
I personally believe that we will eventually get to a point where we don’t have to design neural networks, but we are not there yet. One big challenge about automated architecture search is that there are infinitely many ways one can design a model, so you still have to define a search space to narrow down your search manually. Many network architecture search methods define a search space and an algorithm to navigate within that search space.
For example, a search space may consist of possible cell configurations in a 10-layer architecture, and a very naive search algorithm would be a random search. More sophisticated methods use controller models to decide what direction to explore next. Some common themes in architecture search include:
- using reinforcement learning or evolutionary algorithms to propose networks;
- proposing a large parent model and then searching within its sub-networks;
- or defining a macro architecture and optimizing pieces of it.
In a very broad sense, what an architecture search algorithm tries to do is to eliminate the knobs in a system by tuning them automatically.
However, the current approaches in architecture search are still not hyperparameter-free. They introduce new hyperparameters while eliminating some.
For example, the models that generate architectures still need to be hand designed. Furthermore, hyperparameters like learning rate, regularization strength, and the type of optimizer are almost always left outside the scope of the search.
The choice of the search space also needs careful tuning in many cases. Another problem with the automated model design is how you define and measure the success of a model. Many network architecture search papers try to optimize the validation error on the CIFAR-10 dataset.
Does lower error on the validation set really always mean a better model? Especially when you try so many different model configurations a lower error could be due to pure chance. Some of the papers evaluate how their automatically discovered architectures transfer to other datasets as well. Still, it seems that many methods might be leaking information from the validation set by ‘optimizing’ the network architecture to minimize validation error.
This problem is not specifically an automated architecture search problem though. You can have the same problem with hand-designed model architectures as well when you do too many experiments. Alright, let’s take a look at some of the highly-cited network architecture search papers.
The first paper is one of Google Brain’s AutoML papers. The proposed system uses a recurrent neural network-based controller, which proposes network architectures. The proposed models are trained from scratch until convergence. If a trained model performs well, then the controller gets rewarded. If it doesn’t, the controller gets penalized. The controller gets updated through this reinforcement learning mechanism.
As this process repeats, the controller learns to minimize the expected validation error and starts proposing better and better architectures in the next iterations. As you can guess, this entire process is very computationally expensive. The authors used 800 GPUs for 28 days to perform this search. Let’s move on to the next paper, which is titled Learning Transferable Architectures for Scalable Image Recognition, also known as the NASNet paper.
This is a follow-up paper from Google Brain. It has a smaller search space as compared to the original paper. The authors narrowed down the search space by hand-designing the overall macro architecture and searching for optimal blocks. Specifically, they search for reduction cells that downscale their inputs and normal cells that return a feature map of the same dimension as their input. Once these cell architectures are learned, they can be transferred to other models as well.
For example, cells learned from CIFAR-10 can be used in an ImageNet model by increasing the number of reduction cells since ImageNet has a higher resolution. Despite the smaller search space, their approach needed 500 higher-end GPUs for 4 days. Another follow-up work, named Progressive Neural Architecture Search, required 5x fewer network evaluations as compared to NASNet, but even that is not so cheap.
Another paper from Google Brain, called AmoebaNet, uses an evolutionary algorithm instead of reinforcement learning to search for architectures. An evolutionary algorithm simply mutates existing architectures and kills the ones that don’t perform well. Over time, better and better model architectures evolve. AmoebaNet uses the same search space as NASNet. Again, it’s very computationally expensive and uses a lot of Tensor Processing Units.
Now let’s take a look at some of the more efficient solutions. This paper, titled Efficient Neural Architecture Search via Parameter Sharing is essentially an approximation of NASNet. It produces almost as good results as NASNet on a single GPU in less than a day. The speed-up comes from sharing the parameters between proposed networks instead of trashing them every single time and training from scratch.
The authors do this by sampling subnetworks from a large parent network, where they force all sub-networks to share weights. Once they find the best performing model, they retrain that final model architecture from scratch.
The next paper is titled Efficient Architecture Search by Network Transformation. Like many others, this approach also uses a reinforcement-learning-based controller to explore a search space. However, the authors use function-preserving transformations in their search to transfer as much as possible from previously trained models.
They modify architectures in a way that uses different parametrization to represent the same underlying function.
For example, a new layer can be added to a block in a way that doesn’t change the output. The additional layer would increase the representational power of the network and create more room for exploration without training the rest of the network from scratch. This idea originated from another paper titled Network Morphism.
In a way, this is somewhat similar to a greedy layer-by-layer search approach. Instead of modifying and training only the last layer, network morphism allows for modifying any layer without breaking the following layers. It’s time for some LEMONADE, which stands for Lamarckian Evolutionary algorithm for Multi-Objective Neural Architecture Design. Lemonade employs an evolutionary algorithm for architecture search and uses network morphism to modify networks.
Among the networks that were sampled using morphisms, the good ones are kept, others are discarded, and the process repeats. Unlike prior work, Lemonade tries to optimize architectures for multiple objectives, such as predictive performance and the number of parameters, at the same time. The reference to Lamarckian evolution in the title implies that acquired traits are passed onto the next generation.
In the context of architecture search, it’s just a fancy way of saying the learned weights are reused in the subsequent iterations. Lemonade’s performance is comparable to other methods. In terms of computational needs, it’s much more efficient than NASNet but still expensive as compared to many other more efficient methods.
The next one in our list is Auto-Keras. This paper also makes use of network morphisms like the previous ones. The authors explore the search space by morphing network architectures guided by a Bayesian optimization algorithm.
Their framework is available as an open source project at autokeras.com. In this paper, the authors propose what they call a “convolutional neural fabric.” This fabric consists of a grid that connects nodes at different layers, scales, and channels, with a sparse local connectivity pattern. A convolutional neural fabric is essentially a macro architecture with many parallel paths on which the architecture search is performed.
Theoretically, an infinitely large fabric would be a universal architecture that can implement any neural network architecture. DARTS: Differentiable Architecture Search.
Most of the papers we overviewed so far used either evolutionary algorithms or reinforcement learning to search for architectures. This paper takes a different approach and formulates architecture search as a differentiable, continuous optimization problem. To make the search space continuous, they relax the categorical choice of a particular operation as a softmax over all possible operations.
Once the search is over, the final architecture is discretized by choosing the ops with the highest softmax values. Their approach is reminiscent of sub-network search and fabric-based search algorithms since all possible operations are present during training.
Their search space is similar to NASNet, where they search for cell architectures. DARTS is among the most efficient neural architecture search algorithms, requiring only 4 GPU days for CIFAR-10 classification. Neural Architecture Optimization This paper also adopts a differentiable approach. Their method searches for network architectures in a continuous embedding space.
Their framework consists of an encoder, decoder, and a performance predictor. The encoder embeds topologies into a continuous “network-embedding” space. The performance predictor inputs a network embedding and predicts its performance. The decoder maps a network embedding back into a discrete topology representation. Those three components are jointly trained. In essence, the model tries to learn what kind of topologies work and what doesn’t by optimizing a performance predictor in the embedding space.
Today’s last paper is titled SMASH: One-Shot Model Architecture Search through HyperNetworks In the paper, the authors train a hyper network to predict the weights of a network in one shot instead of training all those candidate networks from scratch. They first train a HyperNet to predict the weights of a given arbitrary network.
Then, they randomly generate many architectures and rather than training those models, they use the HyperNet to obtain the weights.
Finally, they pick the architecture that performs the best with those weights and train it from scratch. Clearly, HyperNet-generated weights are not as good as the ones trained from scratch, but the authors claim that they reflect the relative performance of the architectures well.
Alright, that’s all for today. I hope you liked it and see you next time.
reference – Network Architecture Search: AutoML and others
Web enthusiast. Thinker. Evil coffeeaholic. Food specialist. Reader. Twitter fanatic. Music maven. AI and Machine Learning!