Recently, some members in our team have experimented with variants of SGD (Stochastic Gradient Descent) and SDCA (Stochastic Dual Coordinate Ascent) on large and sparse datasets (such as the KDD Cup 2010 dataset [*]). Note that we've focused only on linear models (e.g. logistic regression and SVM) which lead to convex optimization.
What we have found during the process is very interesting from the engineering standpoint, yet not covered in any academic papers. That is: in SGD and SDCA, each weights update is typically so fast [**] that cache efficiency can become the limiting factor. Put it another way: we experienced that if the data is randomly shuffled in memory, the algorithm suddenly becomes ~3x slower (see the table below).
What we have found during the process is very interesting from the engineering standpoint, yet not covered in any academic papers. That is: in SGD and SDCA, each weights update is typically so fast [**] that cache efficiency can become the limiting factor. Put it another way: we experienced that if the data is randomly shuffled in memory, the algorithm suddenly becomes ~3x slower (see the table below).