Practical Machine Learning: 2015

Saturday, December 12, 2015

Evaluation of Deep Learning Frameworks

For some reason that I don't even remember, I wrote my comparison of deep learning frameworks (Caffe, CNTK, TensorFlow, Theano, Torch) on GitHub instead of this site.

Tuesday, October 6, 2015

How Macnix users can stay productive on Windows PCs

So many Mac users nowadays, even at a Windows 10 event

Or why Macbook Pro users should convert to use the new sexy Surface Book

...

Motivation for this post.

Microsoft announced a beautiful, hybrid, yet powerful device today. I was stunned by its design. Then I was curious to see what the Hacker News community says. The reaction is overwhelmingly positive. The major concern is perhaps this

Could be a MBP replacement for developers. The only thing is those of us running on OS X, how is Windows 10? I love my command line and Linux like commands and tools. - Homebrew - Bash scripts - Docker (Windows 10 currently not supported) - Vagrant
I just feel the tooling for MS isn't in the direction I am. I still have a Windows 7 desktop and it's just not the same.

... and unfortunately it's a valid concern. In this post, I'll try to clear that concern.

For backend developers

From my experience, many backend developers who carry MBPs don't really program for the Mac, or even on the Mac. These people mostly ssh to their dev server and write code to be run on Linux. Even if they write code locally [2], they still need to build or interpret code for the server. So still, their terminal is mostly for ssh to the dev server.

If you're that person, here's an easy solution to keep you happy on Windows. Install a nice terminal (e.g. ConEmu) [3], Cygwin (with OpenSSH added) [3.5], and then you can have a nice ssh experience on your Windows device. Here's a demo screenshot of my console.

ssh from PowerShell (as a shell environment), which is managed by ConEmu (as a terminal)

Full Linux-server experience, even with tmux and zsh.

If the idea of installing or using Cygwin sounds cumbersome, which shouldn't be, here's a *one-liner * in PowerShell (initial setup only) for your Linux taste.

curl https://chocolatey.org/install.ps1 | iex # install Chocolatey
choco install -y cygwin cyg-get # install apps
cyg-get openssh # install OpenSSH via Cygwin

(Well, I broke the script into 3 lines for explanation, but you get it, don't you?)

You may ask: what is Chocolatey? Aha, it's Windows's answer to Mac OS's Homebrew (or Ubuntu's apt-get) [4]. So, you've already got the Homebrew concern covered [5].

For web-frontend developers

I'm not a frontend guy but I think that those guys may prefer Visual Studio in Windows over any other alternative. VS is really a beast! If you happen to use command lines a lot (in that case you're more likely a full-stack engineer), you can use PowerShell and in case you prefer some of the more mature Unix utilities (as I do), you can always call those commands (in Cygwin's bin).

For iOS developers or iDevices fans

Well, stick with you MBP :). Windows is certainly inferior in this regime.

Conclusions

I don't suggest that you should switch to Windows just because you can also do stuffs in Windows. If you prefer a bare-metal Unix environment, there's no reason to switch to a Windows device (but then, Mac OS may not be your choice either). However, if Unix tooling on Windows is your major barrier for switching to a Windows PC, it should not be anymore.

I emphasized the word PC as I don't see any big deal regarding which OS to use on a laptop/desktop. Note that the same is not true on the server side. Personally, I prefer working with Linux server over with Windows Server (and I haven't seen Mac OS used for servers). But that's another point for another post.

Side notes

[1] Like everybody else, I had not known about this device

[2] Many people (including me) prefer nice IDEs over vim or Emacs. The files can be synced automatically between the local and the server.

[3] Just as you may prefer iTerm2 over the default terminal on your Mac.
[3.5] Many prefer MSYS2 over Cygwin. I still prefer Cygwin but that's another point for another post.
[4] One of the coolest features of Windows 10, in my opinion, is OneGet. It's a superset of Chocolatey and Nuget and etc. and it's a built-in command within PowerShell. Unfortunately, it's still beta (that's probably MS hasn't made much noise about it yet) so I still use Chocolatey.
[5] The op also mentioned Docker. Docker is a server side technology, not something that you want to host on your Mac client. So just ssh to your remote machine and docker however you want.

Friday, May 1, 2015

Scaling Up Stochastic Dual Coordinate Ascent

That's our new paper, to appear at KDD 2015. Here's the abstract.

Stochastic Dual Coordinate Ascent (SDCA) has recently emerged as a state-of-the-art method for solving large-scale supervised learning problems formulated as minimization of convex loss functions. It performs iterative, random coordinate updates to maximize the dual objective. Due to the sequential nature of the iterations, it is mostly implemented as a single-threaded algorithm limited to in-memory datasets. In this paper, we introduce an asynchronous parallel version of the algorithm, analyze its convergence properties, and propose a solution for primal-dual synchronization required to achieve convergence in practice. In addition, we describe a method for scaling the algorithm to out-of-memory datasets via multi-threaded deserialization of block-compressed data. This approach yields sufficient pseudo-randomness to provide the same convergence rate as random-order in-memory access. Empirical evaluation demonstrates the efficiency of the proposed methods and their ability to fully utilize computational resources and scale to out-of-memory datasets.

There are two main ideas in this paper

A semi-asynchronous parallel SDCA algorithm that guarantees strong (linear) convergence and scales almost linearly with respect to the number of cores on large and sparse datasets.
A binary data loader that can serve random examples out-of-memory, off a compressed data file on disk. This allows us to train on very large datasets, with minimal memory usage, while achieving fast convergence rate (due to the pseudo shuffling). For smaller datasets, we even showed that this *out-of-memory* training approach can be even more efficient than standard in-memory training approaches [*].

Note that the second idea is not restricted to SDCA or even linear learning. In fact, we originally implemented this binary data loader for training large neural networks. However, it couples nicely with SDCA as the real strength of SDCA is on very large sparse datasets, for which the need for out-of-memory training arises.

See the full paper for more details :).

Side notes
[*] Cache efficiency is the key, as I mentioned in a previous blog post.

Tuesday, March 10, 2015

Metrics revisited

Machine learning researchers and practitioners often use one metric on the test set and optimize on a different metric when training on the train set. Consider the traditional binary classification problem, for instance. We typically use AUC on the test set for measuring the goodness of an algorithm while using another loss function, e.g. logistic loss or hinge loss, on the train set.

Why is that? The common explanation is that AUC is not easily trainable. Computing AUC requires batch training as there's no concept as AUC per example. Even in batch training, we just don't use it as a loss function [1].

I want to ask a deeper question. Why is AUC a good metric in the first place? It's not the metric that business people care about. Why don't we use the true business loss, which can be factored into loss due to false positives and loss due to false negatives, for testing a machine learning algorithm; and even for training it?

The major reason that AUC is favored as a proxy for business loss is that it is independent of the classification threshold. Why are we scared of the threshold? Why do we need to set a threshold in order to use a classifier model? Isn't it anti machine-learning that humans have to manually set the threshold?

So I'd like to propose that we shouldn't consider threshold as a parameter to tune. Instead, make it another parameter to learn. Here are a few challenges when doing so

If we are using a linear model, this addition of threshold parameter will make the model nonlinear. In fact, we will no longer have linear models.
The threshold parameter needs to be within 0 and 1. We can relax this constraint by applying a logistic function on a parameterized threshold variable.

ML research has always been challenging. Adding another layer of complexity shouldn't be an issue. Not modeling the business problem directly is more of an issue to me.