> Testing and mocking is a huge challenge when developing LLM driven systems that aren’t deterministic. Even relatively simple flows are extremely hard to reproduce.
This is by far the most frustrating part of building with LLMs. Is there any good solution out there for any framework?
This in my opinion is the fundamental problem where because LLMs work on a computer there is an assumption of correctness in surprising ways. In reality, when you mix a deterministic system and a probabilistic system you always get a probabilistic system.
Conceptually, LLM-as-a-judge doesn't feel like it should work — it's like asking a student to grade their own homework. it's very unintuitive for me that it actually seems to work pretty well
LLM as a judge isn't telling you if something is right or wrong, it's telling you if a given generation is normal or an aberration, according solely to the data the model was trained on. This is one reason LLMs prefer their answers to answers from other LLMs.
"LLMs as a judge" is more about addressing the failure mode of auto-regressive (one-token at a time) generation letting an LLM lead itself astray due to its previous choices, rather than telling you any general truth.
Finally I'd note that in every maths challenge I ever completed as a student, you were strongly advised to go back and check your own work at the end if you had time left over, and for me this usually led to me catching things I'd missed the first time.
One of the most effective things to do in coding (and many other things) is to always do a self-review: Before you turn over a changeset for review, review it as if you were reviewing some one else's code.
It sounds superfluous, but it works very well as it saves time and frustration by allowing you to quickly fix the stuff that wouldn't pass review anyway.
The LLM as a judge-concept works in a similar way. Instead of "give a good answer", the task and perspective is "does this make sense?", which is very different.
No, it's like a teacher's assistant using a answer sheet provided by the professor to grade exams. A common practice at the university I went to and accurate enough in most cases.
The problem of matching an answer to a given answer is a lot simpler than generating an answer. Especially for an LLM which has language transformation as one of its core competences.
It's student writing judge evaluation questions for himself as judge, judging judge (himself) and evaluating himself later using own judgement (as judge).
1 student all the way down.
If it thinks eating concrete makes you stronger, it's going to think that and give green light end to end.
The attention is probably just latching on to strong statistical patterns. Obvious errors create sharp spikes in attention weights, and drown out more subtle signals that can actually matter more
This is by far the most frustrating part of building with LLMs. Is there any good solution out there for any framework?