How Large Amount of Data is managed by ML Algorithms?

Aug 9, 2018 | 1284 Views

Some algorithms are better at learning with small data while others are preferable for large data. This fact can be understood rigorously through statistical learning theory. Intuitively, algorithm that chooses from a large or complex collection of models needs a larger data set to converge to a model that generalizes well to new data. Thus there is a trade-off between how complex model one wants to be able to learn and how much data and therefore also compute resources that one can provide.

More practically, a simple model that is often used for small data could be Naive Bayes, but also Logistic Regression and Perceptron are reasonable choices, or any other model parameterized by not too many parameters. On the other side of the spectrum are Deep Neural networks which in practice often have millions of parameters. One of common complaints about DL is precisely its data-hungriness.

We already mentioned the issue of compute resources but let us go deeper. Say we have a complex problem which requires a complex model and therefore also a larger set of data. This increases the need for compute power in many ways, more data to pass through, more parameters to update and more complicated gradient to compute (assuming we are using a gradient based method). In fact this is not all. Given the huge amount of data and parameters, none of these are likely to fit in the memory of a single machine. So we turn to Distributed Machine Learning. The goal is to get the same result as if the model would be trained on a single utopian super computer, but due to necessary approximations results will to some extent differ.
It would be an understatement to say that a distributed deep learning model is fairly different from a naive Bayes model on a single computer. Following are the pointers for the same:

  • Splitting data across machines is called data-parallelism. This means that gradient updates have to be sent across network or other channel so that another node can use all of the data spread across many nodes.
  • Splitting parameters across machines is called model-parallelism. This is even harder but required if model is so big that parameters don't fit into the memory of a single machine.
  • There are a lot of clever tricks, research and papers related to this. For example compression of updates, synchronous/asynchronous and centralized or decentralized algorithms. Recently a very efficient algorithm called ring all-reduce was developed by Baidu.

Apart from this we can also discuss about the memory of the computer which is usually the most limiting constraint. A modern PC typically has something like 16 GB RAM. Consequently, it can load datasets up to a few GBs in memory, which means millions, if not billions, of data points. For many machine learning tasks, this is more than enough.

To increase beyond that, you can use techniques like stochastic gradient descent and update the machine learning model in mini-batches. That is, you can load a small portion of the dataset into memory, update the model, and then load another portion into memory and so on. The constraint is no longer main memory, but rather disk storage. Computers typically have hard disks of a few hundred GBs up to a couple of TBs.

You could go even further by streaming data over the network and keep processing it in mini-batches, which means there is essentially no limit to the amount of data you can handle. However, processing speed would have taken over as the main constraint long before you get that far. You can speed things up considerably by moving computation from the CPU to a faster GPU, but when we're talking about TBs of data, it will be extremely time-consuming to train a machine learning model even on a state of the art GPU.
Finally, yes one could continue training the same Logistic Regression algorithm for larger data, but as data grows the model learns less and less some might as well stop training before reaching the big data regime.

Source: HOB