I assume you’re talking about reducing the batch size in a mini batch stochastic gradient descent algorithm and comparing that to larger batch sizes requiring fewer iterations.

Andrew Ng. provides a good discussion of this and some visuals in his online coursera class on ML and neural networks. So the rest of this post is mostly a regurgitation of his teachings from that class.

Let’s take the two extremes, on one side each gradient descent step is using the entire dataset. You’re computing the gradients for every sample. In this case you know exactly the best directly towards a local minimum. You don’t waste time going the wrong direction. So in terms of numbers gradient descent steps, you’ll get there in the fewest.

Of course computing the gradient over the entire dataset is expensive. So now we go to the other extreme. A batch size of just 1 sample. In this case the gradient of that sample may take you completely the wrong direction. But hey, the cost of computing the one gradient was quite trivial. As you take steps with regard to just one sample you "wander" around a bit, but on the average you head towards an equally reasonable local minimum as in full batch gradient descent.

This might be a moment to point out that I have seen some literature suggesting that perhaps this bouncing around that 1-sample stochastic gradient descent does might help you bounce out of a local minima that full batch mode wouldn’t avoid, but that’s debatable. Some other good answers here address this question more directly than I have.

In terms of computational power, while the single-sample stochastic GD process takes many many more iterations, you end up getting there for less cost than the full batch mode, "typically." This is how Andrew Ng puts it.

Now let’s find the middle ground you asked about. We might realize that modern BLAS libraries make computing vector math quite efficient, so computing 10 or 100 samples at once, presuming you’ve vectorized your code properly, will be barely more work than computing 1 sample (you gain memory call efficiencies as well as computational tricks built into most efficient math libraries). And averaging over a batch of 10, 100, 1000 samples is going to produce a gradient that is a more reasonable approximation of the true, full batch-mode gradient. So our steps are now more accurate, meaning we need fewer of them to converge, and at a cost that is only marginally higher than single-sample GD.

Optimizing the exact size of the mini-batch you should use is generally left to trial and error. Run some tests on a sample of the dataset with numbers ranging from say tens to a few thousand and see which converges fastest, then go with that. Batch sizes in those ranges seem quite common across the literature. And if your data truly is IID, then the central limit theorem on variation of random processes would also suggest that those ranges are a reasonable approximation of the full gradient.

Deciding exactly when to stop iterating is typically done by monitoring your generalization error against an untrained on validation set and choosing the point at which validation error is at its lowest point. Training for too many iterations will eventually lead to overfitting, at which point your error on your validation set will start to climb. When you see this happening back up and stop at the optimal point.

参考:https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network

最后修改日期: 2019年10月4日

作者

留言

撰写回覆或留言

发布留言必须填写的电子邮件地址不会公开。