For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | nicksrose7224's commentsregister

What is the accuracy of this method vs manual entry?

When I message my claw "Mark that I had 825 calories for lunch today", it has marked down 825 correctly 100% of the time so far.

It shows me way fewer ads than all the popular fitness apps and loads way quicker since it doesn't have to load like 10MB of ads for me to enter one number, so it seems like a good improvement.

I do not think it's an improvement over an excel sheet, but as the average openclaw user, I would rather pay anthropic $10/day in API credits than create a google sheets document.


I do something similar with Claude Code. I say, "I ate a single serving of that Toasted Beef Ravioli that Aldi sells." Claude web searches, finds it, gets its nutrition info, then uses gspread to add it to the daily food log tab of my spreadsheet.

So much less hassle, lower activation energy needed than with MyFitnessPal.


And, no need for OpenClaw either

But you need to know that the meal was 825 calories which these calorie tracker apps calculate for you with all the ingredient amounts.

They do this already, but the problem is it takes me more time to verify if what they're saying is correct than to just use a search engine. All the LLMs constantly make stuff up & have extremely low precision & recall of information


disagree - i actually think all the problems the author lays out about Deep Research apply just as well to GPT4o / o3-mini-whatever. These things just are absolutely terrible at precision & recall of information


I think Deep Research shows that these things can be very good at precision and recall of information if you give them access to the right tools... but that's not enough, because of source quality. A model that has great precision and recall but uses flawed reports from Statista and Statcounter is still going to give you bad information.


Deep Research doesn’t give the numbers that are in statcounter and statista. It’s choosing the wrong sources, but it’s also failing to represent them accurately.


Wow, that's really surprising. My experience with much simpler RAG workflows is that once you stick a number in the context the LLMs can reliably parrot that number back out again later on.

Presumably Deep Research has a bunch of weird multi-LLM-agent things going on, maybe there's something about their architecture that makes it more likely for mistakes like that to creep in?


Have a look at the previous essay. I couldn't get ChatGPT 4o to give me a number in a PDF correctly even when I gave it the PDF, the page number, and the row and column.

https://www.ben-evans.com/benedictevans/2025/1/the-problem-w...


I have a hunch that's a problem unique to the way ChatGPT web edition handles PDFs.

Claude gets that question right: https://claude.ai/share/7bafaeab-5c40-434f-b849-bc51ed03e85c

ChatGPT treats a PDF upload as a data extraction problem, where it first pulls out all of the embedded textual content on the PDF and feeds that into the model.

This fails for PDFs that contain images of scanned documents, since ChatGPT isn't tapping its vision abilities to extract that information.

Claude (and Gemini) both apply their vision capabilities to PDF content, so they can "see" the data.

I talked about this problem here: https://simonwillison.net/2024/Jun/27/ai-worlds-fair/#slide....

So my hunch is that ChatGPT couldn't extract useful information from the PDF you provided and instead fell back on whatever was in its training data, effectively hallucinating a response and pretending it came from the document.

That's a huge failure on OpenAI's behalf, but it's not illustrative of models being unable to interpret documents: it's illustrative of OpenAI's ChatGPT PDF feature being unable to extract non-textual image content (and then hallucinating on top of that inability).


Interesting, thanks. I think the higher level problem is that 1: I have no way to know this failure mode when using the product and 2: I don't really know if I can rely on Claude to get this right every single time either, or what else it would fail at instead.


Yeah, completely understand that. I talked about this problem on stage as an illustration of how infuriatingly difficult these tools are to use because of the vast number of weird undocumented edge cases like this.

This is an unfortunate example though because it undermines one of the few ways in which I've grown to genuinely trust these models: I'm confident that if the model is top tier it will reliably answer questions about information I've directly fed into the context.

[... unless it's GPT-4o and the content was scanned images bundled in a PDF!]

It's also why I really care that I can control the context and see what's in it - systems that hide the context from me (most RAG systems, search assistants etc) leave me unable to confidently tell what's been fed in, which makes them even harder for me to trust.


I don't think this is a good take. Discovery & science are inherently meaningful even if the applications are not immediately felt. Nuclear magnetic resonance (NMR) was discovered in 1938, but there was no obvious applicability of it to everyday life. In 1971, 33 years later, Paul Lauterbur used it to develop the first MRI


super bad take. by this logic nobody should be excited about any technology whatsoever


Many technologies enable some new cool things and in the process automate some stuff.

Here we are talking about paying extra pretty much solely to not have to deal with a human.


Waymos enable fewer families torn apart by car crashes.


tbh i could not tell the difference between this & generic technostep made by people. Sounds just as crappy to me


I dont think they're saying its up to tech companies to decide what has value, more that the development of new technology itself ends up deciding for the rest of the world how things are valued.

It's been this way for 10,000 years since the invention of the wheel. New inventions change how things are valued by making it easier for people do more work with less time.


This sounds compelling but where i always get stuck is on trust of what the LLM / agent spits back out. Every time I've tried to use it for one of the above use cases you mentioned and then actually dug into the sources it may or may not mention, it's almost always highly imprecise, missing really important details, or straight up completely lying or hallucinating.

how do you get around this issue?

Granted on (3), you can just verify yourself by running the code, so trust/accuracy isn't as much an issue here but still annoying when things don't work.


Frame your question in human terms. LLM -> employee, hallucination -> false belief, etc. Same hiring problems. Same solutions.

You have a problem. The candidate must reliably solve it. What are their skills, general aptitudes, and observed reliability for this problem? Set them up to succeed, but move on if you distrust them to meet the role’s responsibility. We are all flawed, and that’s the nature of uncertainty when working with others.

Past that, there’s little situational advice that one can give about a general intelligence. If you want specific advice, give your specific attempt at a solution!


Are you saying it would have been a good thing for your wife's parents not to reproduce? Where would that leave her and you? (noted that childhood you described sounds awful, i agree)


While my wife and I love each other, yes, it would have been better had they not had children -- for her sake, and this is her own feeling. Her trauma from childhood and young adulthood continues to affect her deeply and daily even now, decades later, in manifold ways, from complicated health issues to self-efficacy beliefs to frequent nightmares and constant fear about the future. When your own parent refuses to give you food, faith that everything will work out in the end can be hard to cultivate.

Personally, selfishly, the thought of my life without her is depressing, absolutely. But I can love her and yet -- or more precisely, "and so," because it's out of empathy that I feel this way -- I can understand and support her desire never to have existed.


That's a philosophical question, but I would say probably better off. If that wouldn't have taken place, it means a lot of abuse around the world wouldn't either.


this is not a strong counter argument. Costs will come down. When there is insanely high demand for a product (like there is here) and the thing makes people more productive, costs always come down due to pure incentives to make it cheaper. This happened with electricity, the car, air travel, solar power, etc etc


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You