more jimminyx's comments

jimminyx · on March 24, 2025

> Testing and mocking is a huge challenge when developing LLM driven systems that aren’t deterministic. Even relatively simple flows are extremely hard to reproduce.

This is by far the most frustrating part of building with LLMs. Is there any good solution out there for any framework?

th0ma5 · on March 24, 2025

This in my opinion is the fundamental problem where because LLMs work on a computer there is an assumption of correctness in surprising ways. In reality, when you mix a deterministic system and a probabilistic system you always get a probabilistic system.

jimminyx · on Feb 14, 2025

Conceptually, LLM-as-a-judge doesn't feel like it should work — it's like asking a student to grade their own homework. it's very unintuitive for me that it actually seems to work pretty well

petesergeant · on Feb 14, 2025

LLM as a judge isn't telling you if something is right or wrong, it's telling you if a given generation is normal or an aberration, according solely to the data the model was trained on. This is one reason LLMs prefer their answers to answers from other LLMs.

"LLMs as a judge" is more about addressing the failure mode of auto-regressive (one-token at a time) generation letting an LLM lead itself astray due to its previous choices, rather than telling you any general truth.

Finally I'd note that in every maths challenge I ever completed as a student, you were strongly advised to go back and check your own work at the end if you had time left over, and for me this usually led to me catching things I'd missed the first time.

taurknaut · on Feb 14, 2025

> Using LLMs to judge correctness

It seems their pr is willing to make much stronger claims than you will.

procaryote · on Feb 14, 2025

PR's view of truth is "can I convince legal we probably won't lose if we're sued for saying this", which is a fairly weak form of truth

dinfinity · on Feb 14, 2025

One of the most effective things to do in coding (and many other things) is to always do a self-review: Before you turn over a changeset for review, review it as if you were reviewing some one else's code.

It sounds superfluous, but it works very well as it saves time and frustration by allowing you to quickly fix the stuff that wouldn't pass review anyway.

The LLM as a judge-concept works in a similar way. Instead of "give a good answer", the task and perspective is "does this make sense?", which is very different.

tinco · on Feb 14, 2025

No, it's like a teacher's assistant using a answer sheet provided by the professor to grade exams. A common practice at the university I went to and accurate enough in most cases.

The problem of matching an answer to a given answer is a lot simpler than generating an answer. Especially for an LLM which has language transformation as one of its core competences.

mirekrusin · on Feb 14, 2025

...pretty well according to student.

It's student writing judge evaluation questions for himself as judge, judging judge (himself) and evaluating himself later using own judgement (as judge).

1 student all the way down.

If it thinks eating concrete makes you stronger, it's going to think that and give green light end to end.

jimminyx · on Jan 29, 2025

The attention is probably just latching on to strong statistical patterns. Obvious errors create sharp spikes in attention weights, and drown out more subtle signals that can actually matter more

jimminyx · on Dec 4, 2024

Do any of the others also handle reranking?

iosjunkie · on Dec 4, 2024

Qdrant does with its ‘Query API’.

https://qdrant.tech/documentation/concepts/hybrid-queries/

And handles embedding creation with its fastembed package.

https://github.com/qdrant/fastembed

tech2trees · on Dec 5, 2024

Marqo does: https://www.marqo.ai/

cess11 · on Dec 4, 2024

I don't know about them, but Manticore does.

https://manticoresearch.com/use-case/vector-search/

jimminyx · on Nov 28, 2024

Same here - their compute pricing and performance were excellent value compared to the major cloud providers.

HN For You