Recently, there has been a heated debate in the field of deep learning. It all started with a Simply Stats blog by Jeff Leek titled “Don’t use Deep Learning Your Data Isn’t That Big.” Author Jeff Leek points out in the blog post that linear models with fewer parameters outperform deep networks when the sample dataset is small (which is common in the bioinformatics world), even with some layers and hidden units. To prove his point, Leek uses an example of image recognition based on MNIST database, distinguishing between 0 and 1. He also showed that when classifying 0s and 1s in a MNIST dataset using only 80 samples, a simple linear predictor (logistic regression) is more accurate than a deep neural network.

Andrew Beam, a postdoctoral fellow in biomedical informatics at The Harvard School of Medicine, wrote an article to counter the controversy surrounding the post: You Can probably use Deep Learning even if Your Data Isn’t That Big. Andrew Beam pointed out that even with a small data set, a properly trained deep network can beat a simple linear model. The debate is intensifying as more bioinformatics researchers are using deep learning to solve a wide variety of problems. Is the hype real? Or is the linear model sufficient for all of our needs? The conclusion remains the same — it depends. In this article, the author explores some examples of machine learning uses where using deep learning would not be wise. It also explains some misconceptions about deep learning. The author believes that it is these misconceptions that lead to the ineffective use of deep learning, which is especially easy for beginners.

Break the deep learning bias

First, let’s take a look at some of the prejudices that laymen are prone to, which are actually half-truths. There are two main points, one of which is more technical and I’ll explain in detail.

Deep learning also works well on small sample sets

Deep learning caught fire in the context of big data (the first Google Brain project provided deep neural networks with tons of Youtube videos), and since then the vast majority of deep learning content has been based on complex algorithms in large amounts of data.

However, this pairing of big data and deep learning has somehow been misinterpreted to mean that deep learning cannot be applied to small samples. If there are only a few samples, input them into the neural network with a high proportion of parameter samples seems to be bound to the path of overfitting. However, simply considering the sample size and dimensions of a given problem, supervised or unsupervised, is almost always modeling the data in a vacuum, without any context. The data is likely to be either you have data sources relevant to the problem, or you have strong prior knowledge that experts in the field can provide, or the data can be constructed in a very specific way (for example, in the form of graphics or image coding). In all of these cases, deep learning has the opportunity to be an alternative approach — for example, you can code a valid representation of a large, relevant data set and apply that representation to your problem. A classic example of this is often seen in natural language processing, where you can learn embedments of words from a large corpus, such as Wikipedia, and then embed them as a smaller, narrower corpus into a supervised task. In extreme cases, you can use a set of neural networks for joint learning feature representations, which is an effective way to reuse that representation in small sample sets. This approach is called “one-shot learning” and has been successfully used in high-dimensional data fields including computer vision and drug discovery.

(Click to enlarge image)

One-off Learning Networks in drug Discovery, Altae-Tran et al. ACS Cent.sci. 2017

Deep learning is not the answer

The second bias I hear the most is overhype. Just because deep neural networks have done so well in other fields, many people who have not yet entered the field expect them to experience fabulous performance improvements as well. Others, inspired by the impressive performance of deep learning in the processing of images, music, and language — three data types that are closely related to humans — dove into the field with gusto, eager to try to train the latest GAN constructs. In many ways, of course, the hype is real. Deep learning plays an important role in machine learning and is also an important tool of data modeling method library. Its popularity has led to the development of important frameworks such as TensorFlow and PyTorch, which are useful even outside of deep learning. Stories of underdogs rising to superstars have inspired many researchers to revisit previous fuzzy algorithms, such as evolutionary algorithms and enhanced learning. But under no circumstances should deep learning be considered a panacea. Aside from the fact that there is no such thing as a free lunch, deep learning models are subtle and require careful and sometimes time-consuming hyperparametric search, tuning, and testing (more on that later). In addition, in many cases, using deep learning doesn’t make sense from a practical point of view, and simpler models can get better results.

Deep learning is more than just.fit()

While deep learning models are coming from other areas of machine learning, I think there is another aspect that is often overlooked. Most deep learning tutorials and introductory materials describe models as layers of nodes connected in a hierarchical fashion, with the first layer being inputs and the last being outputs, and you can train the network with some form of stochastic gradient descent (SGD) method. Some of the material will give a brief introduction to how STOCHASTIC gradient descent works and what backpropagation is, but most of the material will focus on the rich types of neural networks (convolutional neural networks, cyclic neural networks, and so on). The optimization methods themselves get little attention, which is unfortunate, because deep learning works in large part because of these particular optimization methods (see Ferenc Huszar’s blog and the papers cited in the blog). Understanding how to optimize the parameters and how to partition the data so that they can be used more efficiently to achieve good network convergence in a reasonable amount of time is crucial. Why stochastic gradient descent is so crucial, however, remains unknown, but clues are now appearing sporadically. I tend to think of this approach as part of Bayesian reasoning. Essentially, whenever you do some form of numerical optimization, you perform some Bayesian inference with certain assumptions and priors. There’s actually a whole field of research called probabilistic Numerics that starts from this point of view. The same is true for stochastic gradient descent. The latest research results show that this process is actually a Markov chain, which can be regarded as a steady state distribution with backward variational approximation under certain assumptions. So when you stop the stochastic gradient descent, and you take the final parameters, you’re essentially sampling from this approximation distribution. I think this idea is instructive because the optimizer’s parameters (in this case, learning rate) make more sense. For example, when you increase the learning parameters of stochastic gradient descent, the Markov chain becomes unstable until it finds a local minimum for a large area sample, thus increasing the variance of the program. On the other hand, if the learning parameters are reduced, the Markov chain can slowly approximate the narrow minimum until it converges, thus increasing the bias of a particular region. Another parameter, the batch size of the stochastic gradient descent, can also control the type of convergence region of the algorithm, with small batches converging to a larger region and large batches converging to a smaller region.

(Click to enlarge image)

Stochastic gradient descent selects large or narrow minima according to the learning rate or batch size

This complexity means that deep network optimizers are important: they are a core part of the model, just as important as the tier architecture. This is not common in many other models of machine learning. Linear models (even regularized ones, like the LASSO algorithm) and support vector machines (SVM) are convex optimization problems, without much nuance, and with only one optimal solution. This is why researchers from other fields are confused when using tools such as SciKit-learn because they find no apis that simply provide the.fit() function (although some tools, such as Skflow, now try to put simple networks into.fit(), which I think is a bit misleading, Because the whole point of deep learning is flexibility).

When is deep learning not necessary

Under what circumstances is deep learning not ideal? In my opinion, deep learning is more of a hindrance than a blessing in the following cases.

Low budget or low investment

The deep network is a very flexible model with a variety of structure and node models, optimizers, and regularization methods. Depending on the application scenario, your model may have convolution layers (how wide are the layers? Is there pooling?) Or loop structure (is there a gating unit?) ; The network may be really deep (Hourglass, Siamese, or some other structure?). Or just a few hidden layers (how many units?) ; It may use rectilinear units or other activation functions; It may or may not have random discarding (in what layer? In what proportion? , and the weights should be regularized (L1, L2, or some weirder regularization method?). . This is just a partial list, but there are many other types of nodes, connections, and even loss functions to try. Tuning many of the hyperparameters and exploring the framework can be time consuming, even for just one instance of training a large network. Google’s recent claim that its AutoML approach can automatically find the best architecture is impressive, but it still requires over 800 Gpus to run 24/7 for weeks, which is almost out of reach for anyone. The key is that training deep networks will cost a lot of money in calculation and debugging. This consumption does not make sense for many everyday forecasting problems, and the return on investment for tuning deep networks, even small ones, is low. Even with adequate budgets and investment, there is no reason not to try alternatives, even as benchmarks. You may be pleasantly surprised to find that linear SVM will suffice.

Explain and communicate the importance of model parameters or features to the general audience

The deep web is also known as the black box, which is highly predictive but poorly interpretable. Although there are many recent tools, such as saliency maps and activation difference, that can be useful in some areas, they are not completely applicable to all applications. Mostly, these tools work fine when you want to make sure the network doesn’t trick you by memorizing data sets or focusing on specific false features, but it’s still hard to decipher the overall decision of the deep Web from the importance of each feature. Nothing can really beat linear models in this field, because the coefficients learned are directly related to the response. This is especially important when these explanations are communicated to a general audience and they need to make decisions based on them. For example, doctors need to combine a variety of different data to confirm a diagnosis. The simpler and more straightforward the relationship between variables and outcomes, the better physicians can exploit it, rather than underestimate or overestimate the actual value. In addition, there are cases where the precision of the model (especially the deep network) is not as important as its interpretability. For example, a policy maker may want to know the impact of some demographic variables on mortality and may be more interested in a direct approximation of the relationship than in the accuracy of the prediction. In both cases, deep learning is at a disadvantage to simpler, more permeable approaches.

Establish causal mechanisms

The extreme case of model interpretability is when we try to build a mechanical model that actually captures the phenomena behind the data. A good example involves trying to guess whether two molecules (e.g., drugs, proteins, nucleic acids, etc.) interact with each other in a particular cellular environment, or assuming that a particular marketing strategy has an actual effect on sales. In this area, according to expert opinion, nothing beats old-fashioned Bayesian methods, which are the best way we can express and infer cause and effect. Vicarious has some nice new research on why this more principled approach works better than deep learning in video game missions.

Learn “unstructured” characteristics

This may be controversial. One area I’ve found deep learning to excel at is finding useful data representations for specific tasks. A good example is the word embedding mentioned above. Natural languages have rich and complex structures that are similar to “context-aware” networks: each word can be represented by a vector that encodes its frequently occurring text. Embedding of words learned in a large corpus, used in NLP tasks, can sometimes improve performance in a specific task in another corpus. However, if the corpus in question is completely unstructured, it may not serve any purpose. For example, say you are sorting objects by looking at an unstructured list of keywords. Since keywords are not always used in any particular structure (such as in a sentence), word embedding won’t help much in these cases. In this case, the data is a true “bag of words,” and this representation is likely sufficient for the task. Word embedding, on the other hand, is less time-consuming and better captures keyword similarity if you use pre-training. However, I’d rather start with the word bag representation and see if I can get a good prediction. After all, each dimension of the word bag is easier to read than its corresponding slot.

Deep learning is the future

Deep learning is hot, well funded, and growing incredibly fast. While you are still reading the papers presented at the conference, there are probably two or three new versions that could surpass it. This presents a big challenge to the points I’ve listed above: deep learning could be very useful in these scenarios in the near future. Tools for interpreting deep learning models of images and discrete sequences are getting better. Recently introduced software, such as Edward’s combination of Bayesian modeling and deep network frameworks, is capable of quantifying the uncertainty of neural network parameters, as well as simple Bayesian reasoning through probabilistic programming and automatic variational reasoning. In the long run, there may be a simplified modeling library capable of giving significant properties that deep networks have, thereby reducing the parameter space to try. So keep your arXiv readings up to date. This post may be out of date in a month or two.

(Click to enlarge image)

Edward took deep learning and Bayesian models into account by combining probabilistic programming with Tensorflow. From Tran et al. ICLR 2017

When Not to Use Deep Learning


Thanks to Xue Mingdeng for correcting this article.

To contribute or translate InfoQ Chinese, please email [email protected]. You are also welcome to follow us on Sina Weibo (@InfoQ, @Ding Xiaoyun) and wechat (wechat id: InfoQChina).