Man, I don't get it. Was I the only one put off the video game portion of the story? It seemed so out of place and enough for me to go "rubbish" and toss the series aside
Why someone like Henri Poincare or Polya was not included in this list is beyond me. Both were interested in figuring how mathematicians did what they did and both have brilliant insights
This makes me wonder. Is deep learning as a field an empirical science purely because everyone is afraid of the math? It has the richness of modern day physics but for some reason most the practioners seem to want to keep thinking of it as the wild west
No, there are many very mathematically inclined deep learning researchers. It's an empirical science because the mathematical tools we possess are not sufficient to describe the phenomena we observe and make predictions under one unified theory. Being an empirical science does not mean that the field is a "wild west". Deep learning models are subjectable to repeatable controlled experiments, from which you can improve your understanding of what will happen in most cases. Good practitioners know this.
>It's an empirical science because the mathematical tools we possess are not sufficient to describe the phenomena we observe and make predictions under one unified theory.
To me the deep learning is actually itself a [long-awaited] tool (which has well established, and simple at that, math underneath - gradient based optimization, vector space representation and compression) to make a good progress toward mathematical foundations of the empirical science of cognition.
In the 90-ies there were works showing that for example Gabors in the first layer of the biological visual cortex are optimal for the feature based image recognition that we have. And as it happens in the DL visual NNs the convolution kernels in the first layers also converge to the Gabor-like. I see [signs of] similar convergence in the other layers (and all those semantically meaningful vector operations in the embedding space in LLMs are also very telling). Proving optimality or similar is much harder there, yet to me those "repeatable controlled experiments" (i.e. stable convergence) provide strong indication that it will be the case (as something does drive that convergence, and when there is such a drive in dynamic systems, you naturally end asymptotically up ("attracted") near something either fixed or periodic), and that would be a (or even "the") math foundation for understanding of cognition (dis-convergence from the real biological cognition, ie. emergence of completely different, yet comparable, type of cognition would also be great, if not even the much greater result) .
A little bit of A and B. You can do a lot with very little math beyond linear algebra, calculus, and undergraduate probability, and that knowledge is mainly there to provide intuition and formalize the problem that you’re solving a bit. You also churn out results (including very impressive ones) without doing any math.
A result of the above is that people are empirically demonstrating new problems and solving them very quickly — much more quickly than people can come up with theoretical results explaining why they work. The theory is harder to come by for a few reasons, but many of the successful examples of deep learning don’t fit nicely into older frameworks from, e.g., statistics and optimal control, to explain them well.
There's no such thing as a "head transplant". The person receiving the transplant is the one whose head it is. The transplant they receive is the rest of the body. Therefore, "full body transplant" is a more accurate term.
Respectfully, I don't think you or anyone else alive right now actually knows how this would go down. The amount of interactions between organs, central nervous system, endocrine system, and brain means I'm not sure anybody involved would remain "them"
> state-space models make transformer based models obsolete
We will see whether they work on a large scale pretty soon. I hope they will, but they might not be. There're models which might outperform more advanced models on the smaller scale, and I haven't heard how Mamba performs on GPT scale.
> The attention mechanism allows us to aggregate data from many (key, value) pairs. So far our discussion was quite abstract, simply describing a way to pool data. We have not explained yet where those mysterious queries, keys, and values might arise from. Some intuition might help here: for instance, in a regression setting, the query might correspond to the location where the regression should be carried out. The keys are the locations where past data was observed and the values are the (regression) values themselves
Would this kind of understanding aid in extracting 'skills' from an LLM by identifying the relevant topology and isolating that part? Then we could have a toolbox of skills to assemble into what we needed?