It sounds like jongjong was probably using surrogate gradients. You keep the step activation in the forward pass but replace with a smooth approximation in the backwards pass.
I can't remember the name of the algorithm we used. It wasn't doing gradient descent but it was a similar principle; basically adjust the weights up or down by some fixed amount proportional to their contribution to the error. It was much simpler than calculating gradients but it still gave pretty good results for single-character recognition.
Yeah, I think surrogate gradients are usually used to train spiking neural nets where the binary nature is considered an end in itself, for reasons of biological plausibility or something. Not for any performance benefits. It's not an area I really know that much about though.
There's performance benefits when they're implemented in hardware. The brain is a mixed-signal system whose massively-parallel, tiny, analog components keep it ultra-fast at ultra-low energy.
Analog NN's, including spiking ones, share some of those properties. Several chips, like TrueNorth, are designed to take advantage of that on biological side. Others, like Mythic AI's, are accelerating normal types of ML systems.
Very cool project. If you haven't already, for JAX and PyTorch support take a look at the Python Array API Standard, https://data-apis.org/array-api/latest/, and see https://data-apis.org/array-api-compat/ for how to use it. If you have or can write everything in terms of the subset of NumPy supported in the Array API Standard, you can get support for alternative array libraries almost for free.
That looks like a valuable resource, thank you! I already mostly stuck to a subset supported by NumPy and JAX (because that's the array libraries I'm familiar with). I hope the other are not to far off...
If I had to give a loose definition of topology, I would say that it is actually about studying spaces which have some notion of what is close and far, even if no metric exists. The core idea of neighborhoods in point set topology captures the idea of points being nearby another point, and allows defining things like continuity and sequence convergence which require a notion of closeness. From Wikipedia [0] for example
The terms 'nearby', 'arbitrarily small', and 'far apart' can all be made precise by using the concept of open sets. If we change the definition of 'open set', we change what continuous functions, compact sets, and connected sets are. Each choice of definition for 'open set' is called a topology. A set with a topology is called a topological space.
Metric spaces are an important class of topological spaces where a real, non-negative distance, also called a metric, can be defined on pairs of points in the set. Having a metric simplifies many proofs, and many of the most common topological spaces are metric spaces.
That's not to say that topology is necessarily the best lens for understanding neural networks, and the article's author has shown up in the comments to state he's moved on in his thinking. I'm just trying to clear up a misconception.
I think the truly surprising thing is just how well floating point numbers work in many practical applications despite how different they are from the real numbers. One could call it the "unreasonable effectiveness of floating point mathematics".
There are many situations where you have something you want to compute to within low number of units in the last place, that seem fairly involved, but there are very often clever methods that let you do it without having to go to extended or arbitrary precision. Maybe not that surprising, but it's something I find interesting.
Wittgenstein himself states that the Tractatus is nonsense in its closing pages.
My propositions are elucidatory in this way: he who understands
me finally recognizes them as senseless, when he has climbed out
through them, on them, over them. (He must so to speak throw
away the ladder, after he has climbed up on it.)
I think you may agree with the Wittgenstein of the Tractatus more than you realize. My understanding is that his main goal at that time was to show that many of the classic problems of metaphysics which plagued philosophers for centuries or more are literally just nonsense. He didn't write the Tractatus to convince regular people though, but to convince academic philosophers of his time. He earned his fame by being somewhat successful. Rather than making a logical argument for his point, I understand his aim as stimulating his audience to think things out for themselves by offering them carefully crafted nonsense that gave a fresh perspective.
I think you just have no use for the Tractatus because you're not preoccupied with metaphysical questions.
The Fortran implementation is just a reference implementation. The goal of reference BLAS [0] is to provide relatively simple and easy to understand implementations which demonstrate the interface and are intended to give correct results to test against. Perhaps an exceptional Fortran compiler which doesn't yet exist could generate code which rivals hand (or automatically) tuned optimized BLAS libraries like OpenBLAS [1], MKL [2], ATLAS [3], and those based on BLIS [4], but in practice this is not observed.
Justine observed that the threading model for LLaMA makes it impractical to integrate one of these optimized BLAS libraries, so she wrote her own hand-tuned implementations following the same principles they use.
Fair enough, this is not meant to be some endorsement of the standard Fortran BLAS implementations over the optimized versions cited above. Only that the mainstream compilers cited above appear capable of applying these optimizations to the standard BLAS Fortran without any additional effort.
I am basing these comments on quick inspection of the assembly output. Timings would be equally interesting to compare at each stage, but I'm only willing to go so far for a Hacker News comment. So all I will say is perhaps let's keep an open mind about the capability of simple Fortran code.
Check out The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ort. Chapter 5 walks through how to write an optimized GEMM. It involves clever use of block multiplication, choosing block sizes for optimal cache behavior for specific chips. Modern compilers just aren't able to do such things now. I've spent a little time debugging things in scipy.linalg by swapping out OpenBLAS with reference BLAS and have found the slowdown from using reference BLAS is typically at least an order of magnitude.
You are right, I just tested this out and my speed from BLAS to OpenBLAS went from 6 GFLOP/s to 150 GFLOP/s. I can only imagine what BLIS and MKL would give. I apologize for my ignorance. Apparently my faith in the compilers was wildly misplaced.
No, you can still trust compilers: 1) The hand-tuned BLAS routines are essentially a different algorithm with hard-coded information. 2) The default OpenBLAS uses OpenMP parallelism, so much speed likely originates from multithreading. Set OMP_NUM_THREADS environment variable to 1 before running your benchmarks. You will still see a significant performance difference due to a few factors, such as extra hard-coded information in OpenBLAS implementation.
I ran with OMP_NUM_THREADS=1, but your point is well taken.
As for the original post, I felt a bit embarrassed about my original comments, but I think the compilers actually did fairly well based on what they were given, which I think is what you are saying in your first part.
I’ve found code generation tools a boon to my Emacs-ing because I can use them to help write Emacs lisp code for any customization I think I might need in the moment without breaking the flow of concentration towards my primary task.
I've developed a workflow that's working pretty well for me. I treat the LLM as a junior developer that I'm pair programming with and mentoring. I explain to it what I plan to work on, run ideas by it, show it code snippets I'm working on and ask it to explain what I'm doing, and whether it sees any bugs, flaws, or limitations. When I ask it to generate code, I read carefully and try to correct its mistakes. Sometimes it has good advice and helps me figure out something that's better than what I would have done on my own. What I end up with is like a living lab notebook that documents the thought processes I go through as I develop something. Like you, for individual tasks, a lot of times I could do it faster if I just wrote the damn code myself, and sometimes I fall back to that. In the longer term I feel like this pair programming approach gives me a higher average velocity. Like others are sharing, it also lowers the activation energy needed for me to get started on something, and has generally been a pretty fun way to work.