For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | thesz's commentsregister

  > doesn't that mean we're approaching optimality?
No.

Transformers are Markov chains [1]. Somewhere around this fascinating site [2] I read that stateful models have an advantage. Author provided an example, a state machine with two states A and B, where at state A transitions are to state A (output 0) and to state B (output 1) with equal probability and at state B the transition is always to state A and output is always 1.

For this state machine just one bit of memory can make an optimal prediction that ones always go in pairs, whereas Markov chain will approximate this prediction and never reach optimality.

  [1] https://arxiv.org/abs/2410.02724
  [2] https://bactra.org/

Markov chains are themselves a kind of state machine, namely a probabilistic deterministic finite automaton (PDFA), albeit where state is solely governed by the N most recent symbols. (Deterministic means that given a sequence, we can always infer the associated state transitions unambiguously). I believe the example in the reference you provide represents the more general case of PDFA, which is not representable as a finite order Markov processs.

I think that might have been the example given here: https://bactra.org/notebooks/nn-attention-and-transformers.h...

Huh? Any process on a computer by itself is also a Markov chain.

If you include all the information the LLM uses to produce the next token as part of the state, then of course the LLM is a Markov chain.

So would be any other process for sampling continuations of a text, with finite memory.


My comment in the previous discussion of that paper: https://news.ycombinator.com/item?id=48014197

Authors used LTL (linear temporal logic) to express, basically, non-reduced non-ordered binary decision diagrams. Or just binary decision diagrams, BDDs.

BDDs are almost guaranteed to have exponential size because they do not employ reduction (sharing of common expressions). Reduced BDDs are more succinct and reduced ordered BDDs are even more succinct.

Also, transformers in the paper are constructed, not trained. Training any model to express some truth table is very hard. They also did not perform comparison with, say, Kolmogorov-Arnold representation, which is also universal approximator.

So this paper is not as deep as one may think it is.


Thank you for the explanation, makes it a lot more digestible

Just nit picking a bit:

> Training any model to express some truth table is very hard

What kind of models are you including here? Truth tables can be modeled in regular code very easily and reliably. And I’m sure there are many deterministic models that could do the same. Are you talking about LLMs in particular or a certain category/type of models?


  > What kind of models are you including here?
Logistic regression, non-linear logistic regression, simple neural networks, etc. Not the code as in "bunch of ifs" or decision trees.

The truth tables? Predict zeroth bit in enwik9 from, say, bits of 64 previous bytes. Or multiply two 16 bit numbers and predict bit 16 of the multiplication result.


What?

Reduced Ordered BDDs are likely not as succinct as an arbitrary BDD.

The famous Hidden Weight Bit problem can be more succinctly expressed in arbitrary BDDs (changing the order and revisiting nodes), but are provably EXPSPACE in ROBDDs.

-------

We study ROBDDs because they uniquely reduce to a canonical form. All functions have exactly one (!!!!) ROBDD.

BDDs in general however are really arbitrary, as arbitrary as any codebase can basically get. That makes BDDs in general to difficult to study or do math upon.

Results based on the studies of arbitrary BDDs do NOT apply to the simpler, easier to understand world of ROBDDs. And vice versa.


I can now elaborate on my comment above a bit more...

Ingo Wegener's "Branching Programs and Binary Decision Diagrams" copyright year 2000, is considered the introductory work on all BDD related matters. On the front cover are two BDDs: BOTH of these implement the hidden weighted bit function (HWB-4) introduced in the first chapter, and is basically the main example used throughout the whole book.

Please locate a copy of the front cover ASAP, before continuing with my post.

-----------

Now that you've found the book cover, look at it carefully and understand the HWB-4 function. (Just traverse it by hand, dotted lines represent "x1 == 0", while solid lines represent "x1 == 1").

Both BDDs fail to be ROBDDs: the left diagram explores the variables in this order: x1, x2, x3, x4, and then (depending on path), x1/x2/x3. That is, the variables x1/x2/x3 are explored _TWICE_ (breaking the rule for ROBDDs).

The example on the right also fails to be a ROBDD. The variable orderings are seemingly random: one branch covers x4, x1, x2, x3. The other branch is x4, x3, x1, x2. This fails the "ordered" property as the different branches cover different orderings.

BOTH are BDDs however. Both clearly implement the HWB-4 function efficiently (less than EXP-space. Indeed, its a very small graph for the front cover of a book). If these were actually ROBDDs, they'd be ridiculously large and unable to fit.

-------

ROBDDs are again, studied due to the unique property (and proof), that for any given function, there can only be one ROBDD ever generated (!!!!). For the verification problem of circuits, this is extremely useful. Different circuits might implement the same function, despite one circuit being smaller or using less power (or other useful properties).

To "prove" that the circuits output the same function, you simply create an ROBDD. Because functions ALWAYS result with the same ROBDD, you can ALWAYS use this process to prove circuit equivalence.

ROBDDs aren't succinct. No: their usefulness is the opposite. Exactly one ROBDD to ever find for any binary function that has ever existed or will ever exist.


While it is true on the schema level, it is not true at the query level.

The most frequently used kind of join is a general relation.

[1] https://github.com/agirish/tpcds/blob/master/query1.sql

Query 1 from TPC-DS creates a multi-column relation by using GROUP BY. Which relation is then partially constrained by different columns from different tables.


Imperative languages such as C/C++ specify "microtransactions" - an ordering over memory accesses (including (de)allocations) within some statement or group of statements.

Compilers are free to rearrange these accesses if the final result is same as if executed by these ordered microtransactions.

Consider loop fusion, loop splitting and/or loop skewing.


For all practical purposes, it very much is, consider random number generation.

While agree with the main statement, I don’t understand the point about RNG. You don’t need Turing-completeness nor imperativeness for RNG.

Most probably you do not want two of your SELECT statements to have the same seed during the execution.

This necessitates some state, most probably, persisted one. Hence, imperativeness.


Some of them, with notable exception of perfect squares of primes, can be expressed by products of different combinations of factors.

E.g., 6^2 = (223)3 = 2(233).

One of ~22 (ln(2^32)) perfect squares will be a square of perfect prime. Most won't.


Decoding this so it's readable: 6^2 = (2*2*3)*3 = 2*(2*3*3).

(You can escape * with \: \*)


  > gradient descent isn't good at combinatorial optimisation.
If you convolve your problem with sufficiently wide Gaussian, you can use gradient descent. The approach is called Natural Evolution Strategies [1].

[1] https://en.wikipedia.org/wiki/Natural_evolution_strategy#Nat...

It requires O(N^4) evaluations to compute Fisher Information Matrix for N-dimensional parameterization of the problem in original formulation. But there are closed form solutions and more economical representations of covariance matrix (LoRA, hehe).


The difference between training and inference is 1) one have to keep intermediate results for backward pass in training and 2) computation for training double because of the backward pass.

Training is also done over batches, which increase memory requirements by several orders of magnitude. This is why training needs costly compute.

One of the ways out of this unfortunate situation is to use something like Stochastic Average Gradient Descent [1]. Examples there are mostly concerned with regularized logistic regression, which makes problem more or less convex. Neural networks are inherently non-convex. Still, maybe some ideas from there can be utilized in the context of neural networks, like use of estimated Lipshitz constant to derive curvature and appropriate learning step.

  [1] https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L12.pdf


So one way to think about it is roughly,

Training is inference + backwards pass (~2x inference cost) + activations (vram overhead) + optimizer (vram overhead) + gradients (vram overhead).


Multiply "inference + backwards pass (~2x inference cost) + activations (vram overhead)" by batch size (thousands) to get to the actual RAM and compute cost. Optimizer like ADAM adds only two or three model-sized overhead.

And last, but not least, you need only one hidden layer kept in RAM for inference, but you need all of them (61 for Deepseek models) kept in RAM for computing gradient for one sample.


Microbatch size is a hyperparameter, it can be set to 1 and work just as effectively. With gradient accumulation it's equivalent even. Large batch sizes are used to increase parallelism, and sometimes to reduce variance in the loss signal (at the cost of increased bias).

Batch size is frequently limited by compute bottlenecks well before memory.


And of course you do all of this for every object in your training set, which is going to be larger than the total number of uses for any individual user.


Does it matter what is the difference in size of needed inputs for inference vs. training?


That is an estimate of the relative cost of one training step, but you have to multiply it by the number of training steps, an unknown quantity.


It's all got much more complex than that in recent years. Training now involves large amounts of inference for RL rollouts and similar. You can't disentangle them computationally like that. "Inference" is just the word used to mean serving customer traffic now, and "training" means creating the model you serve.

LLMs are Markov Chains [1]. "Emergent abilities" of LLMs can be explained by decrease of perplexity in text prediction [2].

  [1] https://arxiv.org/abs/2410.02724
  [2] https://arxiv.org/abs/2304.15004


  > central concept or software engineering is architecture patterns.
Both RUP and PSP/TSP do stand on the ground of defect prevention. All sorts of defects, from incorrect sets of requirements to memory corruption.

Architecture patterns can be of help in that regard and they also can be very error-prone, as right now I am in the process of removing a bug introduced through misunderstanding of one rather old singleton.


> Both RUP and PSP/TSP do stand on the ground of defect prevention. All sorts of defects, from incorrect sets of requirements to memory corruption.

Was this comment generated by an hallucinating agent? It reads like poorly pieced together word soup.


English is not my native language.

PSP/TSP (Personal and Team Software Processes) and RUP (Rational's Uniform Process, to a lesser extent) are very valuable approaches to software engineering. Please, read about them, they are very interesting.


> English is not my native language.

That's not it. It's the random word soup. PSP/TSP has absolutely nothing to do with the discussion.


PSP/TSP have everything to do with the discussion.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You