Wednesday, September 29, 2021

The Human Regression Ensemble

I sometimes worry that people credit machine learning with magical powers. Friends from other fields often show me little datasets. Maybe they measured the concentration of a protein in some cell line for the last few days and they want to know what the concentration will be tomorrow.

Day Concentration
Monday 1.32
Tuesday 1.51
Wednesday 1.82
Thursday 2.27
Friday 2.51
Saturday ???

Sure, you can use a fancy algorithm for that if you want. But I usually recommend to just stare hard at the data, use your intuition, and make a guess. My friends usually respond with horror—you can’t just throw out predictions like that, that’s illegal! They want to use a rigorous method with guarantees.

Now, it’s true we have methods with guarantees, but those guarantees are often a bit of a mirage. For example, you can compute a linear regression and get a confidence interval for the regression coefficients. That’s fine, but you’re assuming (1) the true relationship is linear (2) all data are independent (3) the noise is Gaussian and (4) the magnitude of noise constant. These assumptions are extremely difficult to verify, even for experts.

All predictions need assumptions. The advantage of the "look at your data and make a guess" method is that you can’t fool yourself about this fact.

But is it really true that humans can do as well as algorithms for simple tasks? Let’s test this.

What I did

1. I took four common datasets to define simple one-dimensional prediction problems. For each of those problems, I split the data into a training set and a test set.

2. I took the training points, and plotted them to a .pdf file as black dots, with four red dots for registration. Here’s what this looks like for the boston dataset:

In each .pdf file there were .25 identical copies of the training data like this.

3. I transferred that .pdf file to my tablet. On the tablet, I hand-drew 25 curves that I felt were all plausible fits of the data.

4. I transferred the labeled .pdf back to my computer, and wrote some simple image processing code that would read in all of the lines and average them. I then used this average to predict for test data.

5. As a comparison, I made predictions for the test data using six standard regression methods: Ridge, local regression (LOWESS), Gaussian processes (GPR), random forests (RF), neural networks (MLP) and K-nearest neighbors (K-NN). More details about all these methods are below.

6. I computed error measures , the root mean squared error (RMSE) and the mean absolute error (MAE).

To make sure the results were fair, I committed myself to just drawing the curves for each dataset once, and never touching them again, no matter if I did something that seems stupid in retrospect—which as you’ll see below, I did.

On the other hand, I had to do some of tinkering with all the machine learning methods to get reasonable results, e.g. messing with how the neural networks were optimized, or what hyper-parameters to consider. This could create some bias, but if it does, it’s in favor of the machine learning methods and against me.

Results

For the Boston dataset, I used the crime variable for the x-axis and house value variable for the y-axis. Here’s all the lines I drew on top of each other:

And here are the results comparing to the machine learning algorithms:

Here are the results for the diabetes dataset. I used age for the x-axis and disease progression for the y-axis. (I don’t think I did a great job drawing curves for this one.)

Here are the results for the Iris dataset, using sepal length for the x-axis and petal width for the y-axis.

And finally, here are the results for the wine dataset, using malic acid for the x-axis and alcohol for the y-axis.

I tend to think I under-reacted a bit to the spike of data with with x around 0.2 and large y values. I thought at the time that it didn’t make sense to have a non-monotonic relationship between malic acid and alcohol. However, in retrospect it could easily be real, e.g. because of a cluster of one type of wine. Anyway, I left my predictions as is the first time.

Summary of results

Here’s a summary of the the RMSE for all datasets.

Method Boston Diabetes Iris Wine
Ridge .178 .227 .189 .211
LOWESS .178 .229 .182 .212
Gaussian Process .177 .226 .184 .204
Random Forests .192 .226 .192 .200
Multi-Layer Perceptron .177 .225 .185 .211
K-NN .178 .232 .186 .202
justin .178 .230 .181 .204

And here’s a summary of the MAE

Method Boston Diabetes Iris Wine
Ridge .133 .191 .150 .180
LOWESS .134 .194 .136 .180
Gaussian Process .131 .190 .139 .170
Random Forests .136 .190 .139 .162
Multi-Layer Perceptron .131 .190 .139 .179
K-NN .129 .196 .137 .165
justin .121 .194 .138 .171

Honestly, I’m a little surprised how well I did here—I expected that I’d do OK but that some algorithm (probably LOWESS, still inexplicably not in base scikit-learn) would win in most cases.

I’ve been doing machine learning for years, but I’ve never run a "human regression ensemble" before. With practice, I’m sure I’d get better at drawing these lines, but I’m not going to get any better at applying machine learning methods.

I didn’t do anything particularly clever in setting up these machine learning methods, but it wasn’t entirely trivial (see below). A random person in the world is probably more likely than I was to make a mistake when running a machine learning method, but probably be just as good at drawing curves. This is an extremely robust way to predict?

What’s the point of this? It’s just that machine learning isn’t magic. For simple problems, it doesn’t fundamentally give you anything better than you can get just from common sense.

Machine learning is still useful, of course. For one thing, it can be automated. (Drawing many curves is kind of tedious…) Also, of course, with much larger datasets, machine learning will—I assume—beat any manual predictions. The point is just that in those cases it’s an elaboration on common sense, not some magical pixie dust.

Details on the regression methods

Here were the machine learning algorithms I used:

  1. Ridge: Linear regression with squared l2-norm regularization.
  2. LOWESS: Locally-weighted regression.
  3. GPR: Gaussian-process regression with an RBF kernel
  4. RF: Random forests
  5. MLP: A single hidden-layer multi-layer perceptron.
  6. KNN: K-nearest neighbors

For all the methods other than gaussian processes, I used 5-fold cross-validation to tune the key hyper- parameter. The options I used were

  1. Ridge: Regularization penalty of λ=.001, .01, .1, 1, or 10.
  2. LOWESS: Bandwidth of σ=.001,.01,.1,1,10
  3. Random forests: Minimum samples in each leaf of n=1,2,…,19
  4. Multi-layer perceptrons: Used 1, 5, 10, 20, 50, or 100 hidden units, with α=.01 regularization. In all cases, optimization used (non-stochastic) l-bfgs with 50,0000 iterations and tanh nonlinearities
  5. K-nearest neighbors: used K=1,2,…,19 neighbors.

For Gaussian processes, I did not use cross-validation, but rather scikit-learn’s built-in hyperparameter optimization. In particular, I used the magical incantation kernel = ConstantKernel(1.0,(.1,10)) + ConstantKernel(1.0,(.1,10)) * RBF(10,(.1,100)) + WhiteKernel(5,(.5,50)) which I understand means the system optimizes the kernel parameters to maximize the marginal likelihood.

Published



from Hacker News https://ift.tt/3ulqiRh

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.