Practical Machine Learning

Sunday, June 26, 2016

Rich Image Captioning In The Wild

In this CVPR 2016 - Deep Vision Workshop paper, we present an image caption system that addresses new challenges of automatically describing images in the wild. The challenges include high quality caption quality with respect to human judgments, out-of-domain data handling, and low latency required in many applications. Built on top of a state-of-the-art framework, we developed a deep vision model that detects a broad range of visual concepts, an entity recognition model that identifies celebrities and landmarks, and a confidence model for the caption output. Experimental results show that our caption engine outperforms previous state-of-the-art systems significantly on both in-domain dataset (i.e. MS COCO) and out-of-domain datasets.

I'll present this work at the CVPR Deep Vision workshop. Hope to see many friends at CVPR this year.

Saturday, December 12, 2015

Evaluation of Deep Learning Frameworks

For some reason that I don't even remember, I wrote my comparison of deep learning frameworks (Caffe, CNTK, TensorFlow, Theano, Torch) on GitHub instead of this site.

Tuesday, October 6, 2015

How Macnix users can stay productive on Windows PCs

So many Mac users nowadays, even at a Windows 10 event

Or why Macbook Pro users should convert to use the new sexy Surface Book

...

Motivation for this post.

Microsoft announced a beautiful, hybrid, yet powerful device today. I was stunned by its design. Then I was curious to see what the Hacker News community says. The reaction is overwhelmingly positive. The major concern is perhaps this

Could be a MBP replacement for developers. The only thing is those of us running on OS X, how is Windows 10? I love my command line and Linux like commands and tools. - Homebrew - Bash scripts - Docker (Windows 10 currently not supported) - Vagrant
I just feel the tooling for MS isn't in the direction I am. I still have a Windows 7 desktop and it's just not the same.

... and unfortunately it's a valid concern. In this post, I'll try to clear that concern.

For backend developers

From my experience, many backend developers who carry MBPs don't really program for the Mac, or even on the Mac. These people mostly ssh to their dev server and write code to be run on Linux. Even if they write code locally [2], they still need to build or interpret code for the server. So still, their terminal is mostly for ssh to the dev server.

If you're that person, here's an easy solution to keep you happy on Windows. Install a nice terminal (e.g. ConEmu) [3], Cygwin (with OpenSSH added) [3.5], and then you can have a nice ssh experience on your Windows device. Here's a demo screenshot of my console.

ssh from PowerShell (as a shell environment), which is managed by ConEmu (as a terminal)

Full Linux-server experience, even with tmux and zsh.

If the idea of installing or using Cygwin sounds cumbersome, which shouldn't be, here's a *one-liner * in PowerShell (initial setup only) for your Linux taste.

curl https://chocolatey.org/install.ps1 | iex # install Chocolatey
choco install -y cygwin cyg-get # install apps
cyg-get openssh # install OpenSSH via Cygwin

(Well, I broke the script into 3 lines for explanation, but you get it, don't you?)

You may ask: what is Chocolatey? Aha, it's Windows's answer to Mac OS's Homebrew (or Ubuntu's apt-get) [4]. So, you've already got the Homebrew concern covered [5].

For web-frontend developers

I'm not a frontend guy but I think that those guys may prefer Visual Studio in Windows over any other alternative. VS is really a beast! If you happen to use command lines a lot (in that case you're more likely a full-stack engineer), you can use PowerShell and in case you prefer some of the more mature Unix utilities (as I do), you can always call those commands (in Cygwin's bin).

For iOS developers or iDevices fans

Well, stick with you MBP :). Windows is certainly inferior in this regime.

Conclusions

I don't suggest that you should switch to Windows just because you can also do stuffs in Windows. If you prefer a bare-metal Unix environment, there's no reason to switch to a Windows device (but then, Mac OS may not be your choice either). However, if Unix tooling on Windows is your major barrier for switching to a Windows PC, it should not be anymore.

I emphasized the word PC as I don't see any big deal regarding which OS to use on a laptop/desktop. Note that the same is not true on the server side. Personally, I prefer working with Linux server over with Windows Server (and I haven't seen Mac OS used for servers). But that's another point for another post.

Side notes

[1] Like everybody else, I had not known about this device

[2] Many people (including me) prefer nice IDEs over vim or Emacs. The files can be synced automatically between the local and the server.

[3] Just as you may prefer iTerm2 over the default terminal on your Mac.
[3.5] Many prefer MSYS2 over Cygwin. I still prefer Cygwin but that's another point for another post.
[4] One of the coolest features of Windows 10, in my opinion, is OneGet. It's a superset of Chocolatey and Nuget and etc. and it's a built-in command within PowerShell. Unfortunately, it's still beta (that's probably MS hasn't made much noise about it yet) so I still use Chocolatey.
[5] The op also mentioned Docker. Docker is a server side technology, not something that you want to host on your Mac client. So just ssh to your remote machine and docker however you want.

Friday, May 1, 2015

Scaling Up Stochastic Dual Coordinate Ascent

That's our new paper, to appear at KDD 2015. Here's the abstract.

Stochastic Dual Coordinate Ascent (SDCA) has recently emerged as a state-of-the-art method for solving large-scale supervised learning problems formulated as minimization of convex loss functions. It performs iterative, random coordinate updates to maximize the dual objective. Due to the sequential nature of the iterations, it is mostly implemented as a single-threaded algorithm limited to in-memory datasets. In this paper, we introduce an asynchronous parallel version of the algorithm, analyze its convergence properties, and propose a solution for primal-dual synchronization required to achieve convergence in practice. In addition, we describe a method for scaling the algorithm to out-of-memory datasets via multi-threaded deserialization of block-compressed data. This approach yields sufficient pseudo-randomness to provide the same convergence rate as random-order in-memory access. Empirical evaluation demonstrates the efficiency of the proposed methods and their ability to fully utilize computational resources and scale to out-of-memory datasets.

There are two main ideas in this paper

A semi-asynchronous parallel SDCA algorithm that guarantees strong (linear) convergence and scales almost linearly with respect to the number of cores on large and sparse datasets.
A binary data loader that can serve random examples out-of-memory, off a compressed data file on disk. This allows us to train on very large datasets, with minimal memory usage, while achieving fast convergence rate (due to the pseudo shuffling). For smaller datasets, we even showed that this *out-of-memory* training approach can be even more efficient than standard in-memory training approaches [*].

Note that the second idea is not restricted to SDCA or even linear learning. In fact, we originally implemented this binary data loader for training large neural networks. However, it couples nicely with SDCA as the real strength of SDCA is on very large sparse datasets, for which the need for out-of-memory training arises.

See the full paper for more details :).

Side notes
[*] Cache efficiency is the key, as I mentioned in a previous blog post.

Tuesday, March 10, 2015

Metrics revisited

Machine learning researchers and practitioners often use one metric on the test set and optimize on a different metric when training on the train set. Consider the traditional binary classification problem, for instance. We typically use AUC on the test set for measuring the goodness of an algorithm while using another loss function, e.g. logistic loss or hinge loss, on the train set.

Why is that? The common explanation is that AUC is not easily trainable. Computing AUC requires batch training as there's no concept as AUC per example. Even in batch training, we just don't use it as a loss function [1].

I want to ask a deeper question. Why is AUC a good metric in the first place? It's not the metric that business people care about. Why don't we use the true business loss, which can be factored into loss due to false positives and loss due to false negatives, for testing a machine learning algorithm; and even for training it?

The major reason that AUC is favored as a proxy for business loss is that it is independent of the classification threshold. Why are we scared of the threshold? Why do we need to set a threshold in order to use a classifier model? Isn't it anti machine-learning that humans have to manually set the threshold?

So I'd like to propose that we shouldn't consider threshold as a parameter to tune. Instead, make it another parameter to learn. Here are a few challenges when doing so

If we are using a linear model, this addition of threshold parameter will make the model nonlinear. In fact, we will no longer have linear models.
The threshold parameter needs to be within 0 and 1. We can relax this constraint by applying a logistic function on a parameterized threshold variable.

ML research has always been challenging. Adding another layer of complexity shouldn't be an issue. Not modeling the business problem directly is more of an issue to me.

Side notes

[1] Computing AUC is costly and computing the AUC function gradient is even costlier.

Saturday, December 20, 2014

Is it really due to Neural Networks?

Deep Learning is hot. It has been achieving state-of-the-art results on many hard machine learning problems. So it's natural that many study and scrutinize it. There have been a couple of papers in the series of intriguing properties of neural networks. Here's a recent one, so-called Deep Neural Networks are easily fooled, that was discussed actively on Hacker News. Isn't it interesting?

The common theme among these papers is that DNN is unstable, in the sense that

Changing an image (e.g. of a lion) in a way imperceptible to humans can cause DNN to label the image as something else entirely (e.g. mislabeling a lion as library)
Or DNN gives high confidence (99%) predictions for images that are unrecognizable by humans

I myself have experienced this phenomenon too, even before these papers came out. However, what surprised me is many people are too quick to claim that these properties are due to neural networks. Training a neural network, or any system, requires tuning many hyper parameters. Just because you don't use the right set of hyper-parameters doesn't mean that the method fails.

Anyway, my experience shows that the aforementioned instability is likely due to the use of output function, in particular: Softmax. In training a multi-class neural network, the most common output function is Softmax [1]. The lesser common approach is just using Sigmoid for each output node [2].

In practice, they perform about the same in terms of testing error. Softmax may perform better in some cases and vice versa but I haven't seen much difference. To illustrate, I trained a neural net on MNIST data using both Sigmoid and Softmax output functions. MNIST is a 10-class problem. The network architecture and other hyper-parameters are the same. As you can see, using Softmax doesn't really give any advantage.

Sigmoid	Softmax

However, if you train a predictor using Softmax, it tends to be too confident because the raw outputs are normalized in an exponential way. Testing on test data doesn't expose this problem because in test data, every example clearly belongs to one of the trained classes. However, in practice, it may not be the case: there are many examples where the predictor should be able to say I'm not confident about my predictions [3]. Here's an illustration: I applied the MNIST models trained above on a weird example, which doesn't look like any digit. The model trained using Sigmoid outputs that it doesn't belong to any of the trained digit classes. The model trained using Softmax is very confident that it's a 9.

Side notes

[1] It's commonly said that Softmax is a generalization of logistic function (i.e. Sigmoid). This is not true. For example, 2-class Softmax Regression isn't the same as Logistic Regression. In general, they behave very differently. The only common property is that they are smooth and the class-values using these functions sum up to 1.

[2] Some people say that for multiclass classification, Softmax gives a probabilistic interpretation while Sigmoid (per-class) does not. I strongly disagree. Both have probabilistic interpretations, depending on how you view them. (And you can always proportionally normalize the Sigmoid outputs if you care about them summing up to 1 to represent probabilistic distribution.)

[3] Predicting with uncertainty is very important in many applications. This is in fact the theme of Michael Jordan's keynote at ICML 2014.

Wednesday, December 17, 2014

NIPS 2014 Highlights

If you didn't go to NIPS 2014, read this great summary by John Platt at Microsoft Research.

Pages

Sunday, June 26, 2016

Rich Image Captioning In The Wild

Saturday, December 12, 2015

Evaluation of Deep Learning Frameworks

Tuesday, October 6, 2015

How Macnix users can stay productive on Windows PCs

Motivation for this post.

For backend developers

For web-frontend developers

For iOS developers or iDevices fans

Conclusions

Friday, May 1, 2015

Scaling Up Stochastic Dual Coordinate Ascent

Tuesday, March 10, 2015

Metrics revisited

Saturday, December 20, 2014

Is it really due to Neural Networks?

Wednesday, December 17, 2014

NIPS 2014 Highlights