For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | quantombone's commentsregister

A VR system that lets you walk around your house will have to detect the tables/chairs/walls in your living space and place some kind of digital content in their place so you don’t bump into stuff. VR without perception won’t work. That kind of #VR is essentially #MixedReality


Using PyTorch (and the broader space of machine learning algorithms under the “deep learning” category) really makes me feel like a wizard. But the downside to being Python-dependent is that putting PyTorch stuff into products not easy. I hope PyTorch 1.0 will change that.

Note: This article was not written with Machine Learning in mind, and I will have to re-read the article to better articulate my thoughts on “Machine Learning Wizardry” and juxtapose my own ideas with those of the article author.

Kudos to author: The article’s main metaphor is excellent because it got my creative juices flowing (i.e., brain working at 110% for a few brief moments).


You can always use ONNX to convert PyTorch trained model to any other format. https://onnx.ai/


SVMs have convex objective functions, but when people use SVMs, they are using some kind of features + SVMs on top. The success of the approach is both good features, and waiting long enough for the optimizer to converge.

With DNNs, people learn everything (features + decision boundary), and this problem is not convex. Surprisingly DNNs work quite well in practice even though we were taught to be afraid of non-convex problems in grad school around 2005.

If back in early 2000s, we stopped worrying about theoretical issues and explored more approaches like ConvNets, we might have had the deep learning revolution 10 years earlier.


The hinge-loss and the primal form of the SVM objective is really easy to understand. Every ML 101 class would jump into the dual formulation, talk about kernels, RKHS, and all the fancy stuff.

Once you realize that a linear SVM isn’t very different from logistic regression, it starts to all make sense (at least it did for me).

Key insight of the hinge-loss: once something is classified correctly beyond the margin, it incurs a loss of zero.

Now, Something fun to think about. Draw the hinge loss. Now draw the ReLU (which is found all over the place in CNNs). Now thing about L1-regularization (which was used to induce sparsity in compressed sensing). They are more similar in form than you would think.


It's funny how close everything is connected. It turns out you can even derive the hinge loss using a mixture of normal distributions, so it is also connected closely to OLS.

Some people have had good luck with hinge or multi-hinge loss for neural networks instead of the almost universal log loss, since of course the hinge loss can be used in things other than linear models. It doesn't care how you get the y output.


Wasn't the reason why everyone cared about SVMs was specifically because of the nonlinear kernel stuff that would push performance a smidge and produce new bests on existing benchmark datasets?

The QP-dual formulation never seemed like something that could scale, and linear SVMs never seemed all that much better than just lasso/elasticnet regression. (hmmmmm :) )


The nonlinear kernel SVM works at least as well in primal, using just the representer theorem. Since it is unconstrained, all you need to do is create a kernel matrix and solve your system with your favorite convex optimizer like Newton's method (which also can work in lower dimensions).

Second-order methods like Newton's method converge better to the exact solution that SGD, although they don't reach a "pretty good" solution as fast usually. Coordinate descent methods in the dual also get very close to the exact solution, but Newton's method and friends are usually faster. With (quasi-) Newton methods in the primal, everything just comes down to solving linear systems, which is a much more well-studied problem.

I've even experimented successfully with kernelized models with millions of examples in low dimensions using the Fast Gauss Transform. That's impossible in the dual.

You can also generate low-rank kernel approximations [0] using the Nystroem method or the Fastfood transform that can then be used in a linear SVM. For example, if I had a problem with n=10^6, I can make a low-rank approximation of the kernel matrix (say d=1000) and feed that into a fast SGD optimizer.

This often works really well, and is usually pretty close to the exact kernel solution if the problem is of lower intrinsic dimensionality, which is usually true if the dual SVM is sparse in its basis vectors. This largely negates the sparsity advantage of the primal SVM. If a kernel approximation isn't good, then the dual SVM wouldn't be meaningfully sparse anyway, so there is still no advantage of dual. Best just solve the kernelized system in the primal, and use a second-order optimizer if needed.

[0] https://scikit-learn.org/stable/modules/kernel_approximation...


Back in around 2008, SVMs were all the rage in computer vision. We would use hand designed visual features and then a linear SVM on top. That was how object detectors were built (remember DPM?)

Funny how SVMs are just max-margin loss functions and we just took for granted that you needed domain expertise to craft features like HOG/SIFT by hand.

By 2018, we use ConvNets to learn BOTH the features and the classifier. In fact, it’s hard to separate where the features end and the classifier begins (in a modern CNN).


So the pitch is that you don't have to do feature engineering... but then instead it seems people do network structure engineering with featurish things like convolutions.

The performance is still better in most cases but I often have to wonder, are people just doing feature engineering once removed and is the better performance just the result of having WAY more parameters in the model?


I guess one upshot to the SVM approach is that there's math for quantifying how well a given model will generalize, subject to some assumptions.

Is there anything like that in the ANN world?


In short, no. Not for practically large models used in common tasks like image classification or speech to text.


are people just doing feature engineering once removed and is the better performance just the result of having WAY more parameters in the model?

Not really, or sort of, depending on how you think.

A deep neural network does work - at least to some extent - because of the large number of parameters. However, it is practical because it can be trained in a reasonable amount of time.

Things like ResNets are useful because they allow us to train deeper networks.

You can create a SVM with the same number of parameters[1], and in theory it could be as accurate (this is basically the no free lunch theorem[2]). But you won't be able to train it to the same accuracy.

[1] Of course there are practical concerns about what you do for features, since hand created features just aren't as good as neural network ones. One thing people do now is use the lower layers of a deep neural network as a feature extractor and then put a SVM on top of them as the classifier. This works quite well, and is reasonably fast to train.

[2] https://en.wikipedia.org/wiki/No_free_lunch_theorem


But you won't be able to train it to the same accuracy.

I'm not sure I agree with this bit in theory. A Neural Network is a stack of basis functions; and this stack can also be seen as a bunch of basis functions. And basis functions are what kernels represent. Trivially, you could then "copy" the weights that a ANN would learn into a kernel and obtain the same accuracy.

The reason this doesn't work in practice is, in SVMs, you tend not to learn kernels from scratch but use (possibly a combination of) standard parameterized kernels - [1]. The learning step in the SVM adapts this standard kernel to your dataset as much as the parameters allow, but this would be sub-optimal compared to learning a kernel (or the corresponding basis functions) from scratch that's built just for your data. With a well trained ANN the latter is what you get.

[1] there has been a fair amount of work on learning kernels too, but its not as mainstream as using standard kernels.


Trivially, you could then "copy" the weights that a ANN would learn into a kernel and obtain the same accuracy.

Sure.

I'm not sure I agree with this bit in theory.

No one really agrees with it in theory - I'm not aware of a good theoretical explanation as to why some deep networks are easier to train. And yet there is a growing body of real, generalized practical hints which work pretty reliably.

This is pretty exciting! There is undiscovered ideas here. But it is unsatisfactory from the theoretical sense at the moment.


If you use the right sort of kernel for an SVM it becomes a neural network with automatic architecture derivation.

See slide 7:

http://www.cs.rpi.edu/~magdon/courses/LFD-Slides/SlidesLect2...


Significantly, it becomes a simple, 2-layer neural network. The power of the advances of neural networks in the past decade have largely relied on "deep" architectures with many layers. Very deep networks effectively learn the features from the data, rather than learn a decision surface over a set of hand-crafted features, as in learning with SVMs or shallow neural networks.


I thought it had been proven that a two layer neural network has the same power as a deep one (obviously with a much greater width). It's just that deep neural networks are a lot more practical to train in practice. So I'm not sure how important that distinction is.


This is something of an academic factoid that has nothing to do with the practice of training and using neural networks, or with the merits of deep networks that I was describing above.

Shallow feed-forward networks are "universal function approximators" [0] when the number of hidden neurons is finite but unbounded. Of course, the width of that layer grows exponentially in the depth of the deep network that you might wish to approximate [1].

The statement that "[i]t's just that deep neural networks are a lot more practical to train" (emphasis mine) sounds somewhat reductive; it's not only that depth is a nice trick or hack for training speed, but that depth makes the success of deep networks in the past decade at all possible. We live in a world with bounded computing resources and bounded training data. You cannot subsume all deep networks into shallow networks, and shallow networks into SVMs in the real world. So I am pretty sure of how important that distinction is.

And what's more, depth extracts a hierarchy of interpret-able features at multiple scales[2], and a decision surface embedded within that feature space, rather than a brittle decision surface in an extremely high dimensional space with little semantic meaning. One of these approaches generalizes better than the other to unseen data.

[0] https://en.wikipedia.org/wiki/Universal_approximation_theore... [1] https://pdfs.semanticscholar.org/f594/f693903e1507c33670b896... [2] https://distill.pub/2017/feature-visualization/


An important addition to this is priors: deep networks allow to express the prior that hierarchical representation, i.e. composing into multiple layers of abstractions make sense (see e.g. conv nets).


If an SVM kernel can replicate a 2 layer NN, why couldn't there be a kernel for a X layer NN, and then autoderive the architecture just like SVMs can autoderive the correct number of neurons? Then there'd also be a more robust theoretical understanding of what's happening.


See my other point, there might be, in fact there definitely is for any working NN, but as of now (2019, happy new year) we probably can't find it


An infinitely sized 2 layer NN is universal in the same way a Turing machine is universal — sure you can write any program; God help you if you try.


If i remember my Goodfellow correctly (and quickly checking, wikipedia, I did https://en.wikipedia.org/wiki/Universal_approximation_theore... ), there is a nuance here which is almost always missed: you can represent any function with a sufficiently wide 2 layer neural network, it doesn't say anything about being able tune the network until you find a correct setting (i.e. learnability).

This is important. Flippantly said,discarding learnability and speed of convergence, you can get the power of any neural network by the following algorithm:

1. Randomly generates a sufficiently wide bit pattern 2. Interprets it as a program and run it on the test set 3. discard results until the desired accuracy is reached


Fwiw the "deep learning" advances in NLP have typically still been from shallow networks, almost always less than 10 layers and usually more like 2.


Could you say a little more about this?

I ask because when we're training human to understand things, there are a variety of benefits to separate feature-understanding from the classifiers. In particular, you get gains in flexibility, extendability, and debuggability.

I get why people are happy to take the ConvNet gains and run with them for now. But have you seen any interesting work to get the benefits of separation in the new paradigm? (Or, alternately, is there a reason why those concerns are outmoded?)


That's actually closer to how deep learning started. Initially, deep learning mostly consisted of unsupervised (task independent) features with a linear classifier on top. We had to fit an unsupervised model (e.g. autoencoder) layer by layer before using the feature layers in a supervised task.

This was because we didn't understand how to train a deep model end-to-end until later. When we learned how to make that end-to-end training work it tended to perform better because the learned features were task specific.

You can still learn general features in a bunch of ways, in addition to the older method using autoencoders. For one example, multiple supervised heads with auxiliary losses can learn more generalize features.


To be fair there is a lot of domain knowledge embedded in the use of a convolutional architecture. There is a fascinating paper where the authors don't even train the weights of the convolutional layers and are still able to achieve good performance.

https://arxiv.org/pdf/1606.04801v2.pdf


It’s also hard to separate the design of the neural architecture from the definition of the feature extractor.


I feel like we pushed features into the architecture and called it a day.

Otherwise, why we would we need a gazillion architectures for different problems (or even the same exact problem)?


You still may want to replace softmax layer with support vector machine for classification sometimes.


Here is a zooming example. I definitely noticed that it makes people's eyes look evil. Maybe it's hallucinating animal eyes on top of human eyes...

http://imgur.com/kThM9vf


That's a bit freaky


ConvNets have gotten popular because of their strong empirical results. All the recent work on visualizing CNNs suggests that the community working on Deep Learning still has a lot to learn about their own algorithms.

But high-level notions like a Jaguar is a cat-like animal aren't necessary to perform well on an N-way classification task like ImageNet.

What's more important to note is everybody knows there's plenty wrong with a pure appearance-based approach like CNNs. Every few years a new approach pops up that is based on ontologies, an approach inspired by Plato, etc, but these systems require a lot of time and effort. More importantly, they don't perform as well on large-scale benchmarks. In the publish-or-perish world, you can jump on the CNN bandwagon or start reading Aristotle's metaphysics and never earn your PhD.


I doubt that reviewers for NIPS would reject a paper with a novel approach because it didn't perform at best in class level, provided it offered a way forward.

If it doesn't work at all, or isn't a new idea, that's different.


Is there an official list somewhere which assigns animals to kinds people you will find in a startup? Like "shark" for VC, "owl" for the PhD, grizzly bear for the bearded Linux hacker, etc?

I remember a political satire from TV in the post-communist days in Poland (1991?) -- Lech Walesa was a Lion. I'd love to see a cartoon/puppets show about startups and/or software dev with animals. The honey badger would def be in there. :-)


I use a pen when I read. I underline important bits and actually write down the key points in the margin. Writing these things down helps memory and I don't really care about leaving the pages in the same state that I found them. Also I don't always have a notebook with me, and I do this to my own books only. I just don't care about "ruining" a book -- I want the knowledge!

Of course you might not want to borrow some of my books when I leave them in this state -- they look like study aids from a med school study-a-thon. But whatever trick YOU find that helps you pound knowledge into your brain is worth it.


Yes to @eof's response. He does know me well, and should be able to clarify in my absence.

There is going to be an AI engine that we use in the future in a similar way that we use Linux today. I mean that "future Linux-ish AI engine" and unless Linus has been busy ML-ing, it won't be called Linux.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You