For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | more suninsight's commentsregister

It only seems effective, unless you start using it for actual work. The biggest issue - context. All tool use creates context. Large code bases come with large context out of the bat. LLM's seem to work, unless they are hit with a sizeable context. Anything above 10k and the quality seems to deteriorate.

Other issue is that LLM's can go off on a tangent. As context builds up, they forget what their objective was. One wrong turn, and in the rabbit hole they go never to recover.

The reason I know, is because we started solving these problems an year back. And we aren't done yet. But we did cover a lot of distance.

[Plug]: Try it out at https://nonbios.ai:

- Agentic memory → long-horizon coding

- Full Linux box → real runtime, not just toy demos

- Transparent → see & control every command

- Free beta — no invite needed. Works with throwaway email (mailinator etc.)


> One wrong turn, and in the rabbit hole they go never to recover.

I think this is probably at the heart of the best argument against these things as viable tools.

Once you have sufficiently described the problem such that the LLM won't go the wrong way, you've likely already solved most of it yourself.

Tool use with error feedback sounds autonomous but you'll quickly find that the error handling layer is a thin proxy for the human operator's intentions.


Yes, but we dont believe that this is a 'fundamental' problem. We have learnt to guide their actions a lot better and they go down the rabbit a lot less now than when we started out.


True, but on the other hand, there are a bunch of tasks that are just very typing intensive and not really complex.

Especially in GUI development, building forms, charts, etc.

I could imagine that LLMs are a great help here.


Some of the thinking models might recover... with an extra 4k tokens used up in <thinking>. And even if they were stable at long contexts, the speed drops massively. You just can't win with this architecture lol.


That is very accurate with what we have found. <thinking> models do a lot better, but with huge speed drops. For now, we have chosen accuracy over speed. But speed drop is like 3-4x - so we might move to an architecture where we 'think' only sporadically.

Everything happening in the LLM space is so close to how humans think naturally.


Looks interesting. How do you manage context ?


So managing context is what takes the maximum effort. We use a bunch of strategies to reduce it, including, but not limited to:

1. Custom MCP server to work on linux command line. This wasn't really a 'MCP' server because we started working on it before MCP was a thing. But thats the easiest way to explain it now. The MCP server is optimised to reduce context.

2. Guardrails to reduce context. Think about it as prompt alterations giving the LLM subtle hints to work with less context. The hints could be at a behavioural level and a task level.

3. Continuously pruning the built up context to make the Agent 'forget'. Forgetting what is not important is what we believe a foundational capability.

This is kind of inspired by the science which says humans use sleep to 'forget' not useful memories and is critical to keeping the brain healthy. This translates directly to LLM's - making them forget is critical to keep them focussed on the larger task and their actions alligned.


https://nonbios.ai - [Disclosure: I am working on this.]

- We are in public beta and free for now.

- Fully Agentic. Controllable and Transparent. Agent does all the work, but keeps you in the loop. You can take back control anytime and guide it.

- Not an IDE, so don't compete with VSCode forks. Interface is just a chatbox.

- More like Replit - but full stack focussed. You can build backend services.

- Videos are up at youtube.com/@nonbios


I also did not get it, but now I get it a bit, I think.

Look at it this way. You have to get some work done - maybe book a flight ticket. So you go to two sites - first you go to flight fare comparison, then you book the ticket on the airline website. And you have to do it in code.

There are two ways you can do it.

First Way 1. Understand the API of the flight comparison portal. 2. Understand the API for the airline website. 3. Write code which combines both these API and does the task.

Second Way 1. Message a coder friend who knows the API of the flight comparison portal and ask him to write code to get the cheapest flight. 2. Message another coder friend who knows the API of the airline portal and ask him to book a flight.

Both ways are possible, but which one do you think is Less Work ? Which one is 'cognitively' easier ? Which one can you do while driving a car with one hand ?

It should be clear that the second way is easier. Not only is the second way easier, but if the task requires multiple providers and a lot of context, it might be the only way possible.

The first way is analogous to LLM's doing API calls. The second way is analogous to LLM's doing MCP Servers. MCP servers reduce the cognitive cost to do a task to the LLM - which dramatically increases their power.


Very cool product !

We, at NonBioS.ai [AI Software Dev], built something like this from scratch for Linux VM's, and it was a heavy lift. Could have used you guys if had known about it. But can see this being immediately useful at a ton of places.


Thank you, really appreciate that! Curious - did you end up using QEMU for your Linux VMs? And are you running your system locally or in the cloud?

We’re currently focused on macOS but planning to support Linux soon, so I’d love to hear more about your use case. Feel free to reach out at founders@trycua.com - always great to learn from others building in this space.


No we dont use QEMU - never heard of them till now. We built our own software from scratch - using Ubuntu - for AI. We are completely on the cloud. Every user gets a full Ubuntu Cloud VM for his NonBioS AI Engineer to work on.

We covered this a fair bit on our blogs: - https://www.nonbios.ai/post/why-nonbios-chose-cloud-vms-for-... - https://www.nonbios.ai/post/private-linux-vms-for-every-nonb...


I guess the "lucky 10000" effect is extremely strong today!

This is like an OS developer who has never heard of Linux.


They just provision cloud VMs. That's the level of tech understanding you need to build an "AI startup" nowadays. "Never heard of QEMU, we just use Ubuntu" doesn't really strike confidence.


Key questions:

1. The key data point seems to be Figure 6a. Where it compares performance on BABILong and claims Titans performance is at ~62%, as compared to GPT-4o-mini at ~42% for 100k sequence length.

However, GPT-4o and Claude are missing in this comparison - maybe because they perform better ?

2. There is no example provided of the Neural Memory Module in action. This is the first question I would ask of this paper.


The biggest model that they have used has only 760M parameters, and it outperforms models 1 order of magnitude larger.


Gah dmn


This paper was written by a very small team at Google. It strikes me as similar in that regard to the original transformers paper. If this technique scales well, Google is no doubt already exploiting it for their next generation models -- and I think there are signs that Gemini 2.0 models already exploit this.


> Doctors will often recommend exercise, but I find that these days even moderately strenuous exercise like riding a bicycle destroys my sleep quality for several days. There's something about it that appears to be too physiologically stressing, even though ten years ago I was a happy as a regular gymgoer.

I had something similar like this. I think I was able to fix it. The theory is that your sleep is still poor, even though you sleep through the night. This is causing high cortisol levels during day time and higher resting heart rate. This is elevated further after doing moderate exrercise and takes a long time to get back to normal as your sleep isnt adequate. If your heart rate doesnt go down enough, then your sleep quality gets destroyed.

The solution, for me and I am guessing for you, is this: Stop the cycleing. First fix sleep. Track it using Wellue O2 Ring. If the scores are not good, the reconfigure CPAP - use sleepapnea reddit for inputs. Once sleep is sorted as per O2 Ring, then it might take a few months for you to recover. After that you can restart moderate exercise and things should be fine.


Yeah, I also suspected a vicious cycle of stress/cortisol causing poor sleep, which leads to more cortisol and poor recovery.

It did get better when I stopped cycling, as much as I loved it. I'm now walking instead and feeling much better. I intend to increase volume over time and once my VO2Max is back to my baseline then I may introduce cycling with an eye on going easy and eating enough before/during/after exercise.

Thanks for the advice, it is good to hear that it worked on other people.


Bunch of them : Langsmith, Lunary, Phoenix Arize, Portkey, Datadog and Helicone.

We also picked Langfuse - more details here: https://www.nonbios.ai/post/the-nonbios-llm-observability-pi...


Thanks, this post was insightful. I laughed at the reason why you rejected Arize Phoenix, I had similar thoughts while going through their site!=)

> "Another notable feature of Langfuse is the use of a model as a judge ... this is not enabled in the free version/self-hosted version"

I think you can add LLM-as-judge to the self-hosted version of Langfuse by defining your own evaluation pipeline: https://langfuse.com/docs/scores/external-evaluation-pipelin...


Thanks for the pointer !

We are actually toying with building out a prompt evaluation platform and were considering extending langfuse. Maybe just use this instead.


Thanks for sharing your blogpost. We had a similar journey. I installed and tried both Langfuse and Phoenix and ended up choosing Langfuse due to some versioning conflicts on the python dependency. I’m curious if your thoughts change after V3? I also liked that it only depended on Postgres but the scalable version requires other dependencies.

The thing I liked about Phoenix is that it uses OpenTelemetry. In the end we’re building our Agents SDK in a way that the observability platform can be swapped (https://github.com/zetaalphavector/platform/tree/master/agen...) and the abstraction is OpenTelemetry-inspired.


As you mentioned, this was a significant trade-off. We faced two choices:

(1) Stick with a single Docker container and Postgres. This option is simple to self-host, operate, and iterate on, but it suffers from poor performance at scale, especially for analytical queries that become crucial as the project grows. Additionally, as more features emerged, we needed a queue and benefited from caching and asynchronous processing, which required splitting into a second container and adding Redis. These features would have been blocked when going for this setup.

(2) Switch to a scalable setup with a robust infrastructure that enables us to develop features that interest the majority of our community. We have chosen this path and prioritized templates and Helm charts to simplify self-hosting. Please let us know if you have any questions or feedback as we transition to v3. We aim to make this process as easy as possible.

Regarding OTel, we are considering adding a collector to Langfuse as the OTel semantics are currently developing well. The needs of the Langfuse community are evolving rapidly, and starting with our own instrumentation has allowed us to move quickly while the semantic conventions were not developed. We are tracking this here and would greatly appreciate your feedback, upvotes, or any comments you have on this thread: https://github.com/orgs/langfuse/discussions/2509


So we are still on V2.7 - works pretty good for us. Havent tried V3 yet, and not looking to upgrade. I think the next big feature set we are looking for is a prompt evaluation system.

But we are coming around to the view that it is a big enough problem to have dedicated saas, rather than piggy back on observability saas. At NonBioS, we have very complex requirements - so we might just end up building it up from the ground up.


"Langsmith appeared popular, but we had encountered challenges with Langchain from the same company, finding it overly complex for previous NonBioS tooling. We rewrote our systems to remove dependencies on Langchain and chose not to proceed with Langsmith as it seemed strongly coupled with Langchain."

I've never really used Langchain, but setup Langsmith with my own project quite quickly. It's very similar to setting up Langfuse, activated with a wrapper around the OpenAI library. (Though I haven't looked into the metadata and tracing yet.)

Functionally the two seem very similar. I'm looking at both and am having a hard time figuring out differences.


We launched Laminar couple of months ago, https://www.lmnr.ai. Extremely fast, great DX and written in Rust. Definitely worth a look.


Congrats on the Launch!


apologies for hijacking your launch (congrats btw!)


thanks Marc :)


I did a similar test and tried to pull up certain categories of individuals I am interested in, with their names and linkedin profile links. ChatGPT hallucinated the names and the links.

I simply cannot move to a search, where there is random hallucination because having to check for each and every result for hallucination defeats the purpose of search itself.


You dont have to deal with it. Only a question of time before better AI comes out and starts refactoring this to better looking code. These are just growing up pangs and one of those problems which should go away on their own.

The bigger problem is what to do of the humans who are fixing this sloppy code right now.


Biggest issue i have with mosh is that it you cant scroll up the history. It is kind of a deal breaker for me.


hence why most throw in tmux/screen on the other end, possibly automatically so:

https://github.com/blinksh/blink/discussions/1526


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You