More

frannyg · on Feb 26, 2024

Thanks for the detailed explanations. And the rambling as well!

Pretty much every Yes and No apply. I had to understand bits of the gaps I was trying to close myself, so thanks for taking the time to interpret into my question.

frannyg · on Feb 26, 2024

I have no freaking idea what you said in the second paragraph but I love it and it will linger in the back of my head until I understand enough to look it up.

[nodding repeatedly with a serious face and lot of resolve]

frannyg · on Feb 26, 2024

Nice. Thank you for the addition of slower memory layers.

So MoE models are a bit like thinking tools running concurrently, right(?), sieving through training data on paths that are the same contextually, but different in terms of specificity and sensitivity.

If the agents/experts/ architectures - the code - don't have the minimum required amount of memory & processing power, they might even miss entire bunches of tokens that are or might be relevant within the given (the prompt) and predicted/requested context. So more processing power and or time is relevant only to the extent, here: size, of the to-be-queried-at-inference-time training data (tokens and weights).

Now here's where I find myself exactly within the realm that I was in when I phrased my question: analysing the result of a request and evaluating different sets of tokens, which, I now understand, makes much more sense within the subject of code generation than with the recitation of facts or bits of narratives.

Generated code has functions (things to do with other things). Functions can be done more or less efficient, while even the least efficient code works "more than good and fast enough". There is no value in looping through versions of fact and fiction when the answer fits the expectation. And if it doesn't fit, users can have an actual conversation, which is where I get another part of my answer, which is that more processing power only becomes relevant in relation to the amount of concurrent requests in relation to the parts of the training data that are queried at inference time.

No single request will ever query so much data at the same time, that memory and compute become a bottleneck.

It definitely can become a bottleneck when a long/large/broad( but specific) request gets processed by MoEs simultaneously or when versions of results of engineering tasks are being evaluated. But that is simply not within the task or design of current LLMs and is instead added on top (or as a wrapper, for example, which I still fail to find a non-replaceable usecase for while also still being certain that I will find one once I get to LLMs and AIs).

Again, thanks!

frannyg · on Feb 26, 2024

Nope, this definitelly fills a few gaps, thanks. I'm still too lazy of thinking about this whole O(n) time thing even though I'm constantly wondering whether "more" or better results could be achieved by throwing CPUs at stuff, hahaha. I rarely think in terms of time in general, just about depth, breadth and clarity.

frannyg · on Feb 26, 2024

This blew my mind a little as it feels unintuitive to do this since you wouldn't just forget what you based your previous reply on, at least not after some practice with your mind and memory (which I need to catch up on, I must add).

It also feels like a multiplication of required processing power but I have no clue yet how one could use the previous generation of weights of and the tokens themselves to improve, elaborate on, widen the range of predicted potential results.

frannyg · on Feb 26, 2024

> the same model with give the same result

Is it wrong to think of this as misleading? Don't the results for exactly the same request differ because there are multiple output strings with the same computed weights?

Or do you include "multiple ways to phrase the same" in "same results" and I'm being a noob?

PeterisP · on Feb 26, 2024

There is certain intentional randomness in how the tokens are selected, and certain unintentional randomness due to letting some optimizations cause small side-effects, but in any case in that sentence I didn't really intended to talk about the result being identical but rather about the result not being any better just because more compute was available, as by default that extra available potential simply wouldn't get used in any way other than getting a speedup.

frannyg · on Feb 26, 2024

So "compute" includes just having more data ... that can also be "ignored"/ "skipped" for whatever reasons (e.g. weights), ok.

frannyg · on Feb 26, 2024

Ok, thanks. My misconception kind of prohibited the insight of a potential (theoretical) assert statement, which is kind of what is meant by

> if the [resulting] dataset "fits" the model architecture properly,

right?

I have too many questions. It seems unreasonable to ask away and I should instead read the studies and some books.

frannyg · on Feb 26, 2024

> there is no looping going on internally

My thoughts after this sentence filled a huge gap I was wondering about, thanks.

frannyg · on Feb 26, 2024

Yeah, I totally forgot about training time and time of request (aaah, inference time! now I get it.) being completely different points in time because the LLM has no access to the training data anymore.

HN For You