For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | justanotherunit's commentsregister

Love it. Are there any benchmarks available on how fast this is from file change, kernel event received, event dispatched? I would expect it to be really quick.

Is it not just to train a model on your voice recordings and just use that to generate audio clips from text?


Interesting, would you mind sharing your architectural setup? How does your index communicate to your agent server, what is the main agent framework/engine used?

Sounds like a cool concept to speak into your watch/wearable which automatically saves or performs tasks on the fly.

What is the general execution time from:

Prompt received -> final task executed?


So basically there's a /chat endpoint that goes to the LLM (a Pi agent), which has access to call specific tools (web search, SQL execution, cron) but doesn't have filesystem access, so the only thing it can do is exfiltrate data it can see (pretty big, but you can't really avoid that, and it doesn't have access to anything on the host system). There's a Signal bridge that runs on another container to connect to Signal, a Telegram webhook, and the other big component is a coding agent and a tool container. The coding agent can write files to a directory that's also mounted in the tool container, and the tool container can run the tools. That way you separate the coder from everything else, and nothing has access to any of your keys.

You can't really avoid the coder exfiltrating your tool secrets, but at least it's separated. I also want to add a secondary container of "trusted" tool that the main LLM can call but no other LLM can change.

This way you're assured that, for example, the agent can't contact anyone that you don't want it contact, or it can read your emails but not send/delete, things like that. It makes it very easy to enforce ACLs for things you don't want LLM-coded, but also enables LLM coding of less-trusted programs.


And now it can even make private (and public!) dynamic websites that have access to data from your database, while exposing only the data you want exposed.

I'm really liking it, I created a page to show my favorite restaurants per city, for example:

https://stavrobot.home.stavros.io/pages/restaurants

That's dynamic, loading from the database, and updating live when the assistant creates new entries.

This page was created just by telling the assistant "make me a page to show my favorite restaurants, with their ratings, groupped by city".


Fascinating, thanks for responding. If k may ask, what is your monthly (or any other interval) token usage? And are you finding a pi to be a bottleneck in regards of any performance?


You mean with the bot, or with developing the bot? The bot's token usage is fairly small, but it's only a few days old so I don't know. Pi hasn't been a bottleneck that I can see, my bot is much faster than OpenClaw was when I tried it.


Was thinking about the bot. I stared your repo and will be checking it out later!


Thanks, let me know if you have feedback! I've made it fairly easy to set up, I think, you don't even need a separate server since it's all sandboxed, you can try it on your PC.


I am not affiliated with them at all, but https://getencube.com tries to remove a lot of friction for just these cases and reduce costs. I met their founder a couple of months back, great engineer and a really cool product


Is it the models tho? With every release (mutlimodal etc) its just a well crafted layer of business logic between the user and the LLM. Sometimes I feel like we lose track of what the LLM does, and what the API before it does.


It's 100% the models. Terminal bench is a good indication for this. There the agents get "just a terminal tool", and yet they still can solve lots and lots of tasks. Last year you needed lots of glue, and two years ago you needed monstrosities like langchain that worked maybe once in a blue moon, if you didn't look funny at it.

Check out the exercise from the swe-agent people who released a mini agent that's "terminal in a loop" and that started to get close to the engineered agents this year.

https://github.com/SWE-agent/mini-swe-agent


Its definitely a mix, we have been codeveloping better models and frameworks/systems to improve the outputs. Now we have llms.txt, MCP servers, structured outputs, better context management systems and augemented retreival through file indexing, search, and documentation indexing.

But these raw models (which i test through direct api calls) are much better. The biggest change with regards to price was through mixture of experts which allowed keeping quality very similar and dropping compute 10x. (This is what allowed deepseek v3 to have similar quality to gpt-4o at such a lower price.)

This same tech has most likely been applied to these new models and now we have 1T-100T? parameter models with the same cost as 4o through mixture of experts. (this is what I'd guess at least)


It's the models.

"A well crafted layer of business logic" just doesn't exist. The amount of "business logic" involved in frontier LLMs is surprisingly low, and mostly comes down to prompting and how tools like search or memory are implemented.

Things like RAG never quite took off in frontier labs, and the agentic scaffolding they use is quite barebones. They bet on improving the model's own capabilities instead, and they're winning on that bet.


So how would you go and explain how an output of tokens can call a function, or even generate an image since that requires a whole different kind of compute? It’s still a layer between the model which acts as a parser to enable these capabilities.

Maybe “business” is a bad term for it, but the actual output of the model still needs to be interpreted.

Maybe I am way out of line here since this is not my field, and I am doing my best to understand these layers. But in your terms you are maybe speaking of the model as an application?


The logic of all of those things is really, really simple.

An LLM emits a "tool call" token, then it emits the actual tool call as normal text, and then it ends the token stream. The scaffolding sees that a "tool call" token was emitted, parses the call text, runs the tool accordingly, flings the tool output back into the LLM as text, and resumes inference.

It's very simple. You can write basic tool call scaffolding for an LLM in, like, 200 lines. But, of course, you need to train the LLM itself to actually use tools well. Which is the hard part. The AI is what does all the heavy lifting.

Image generation, at the low end, is just another tool call that's prompted by the LLM with text. At the high end, it's a type of multimodal output - the LLM itself is trained to be able to emit non-text tokens that are then converted into image or audio data. In this system, it's AI doing the heavy lifting once again.


This is great! I’ve been diving deep into local models that can run on this kind of hardware. Been building this exact same thing, but for complete recordings of meetings and such because, why not? I can even run a low-end model with ollama to refine and summaries the transcription. Even combining with smaller embedding models for a modern, semantic search. It has surprised me how well this works, and how fast it actually is locally.

Hopefully we will see even more locally run AI models in the future with a complete package.


I’ve used h3 for a game. Since they align with an unique hex, I can ensure that one cell grid aligns and is placed on the same place in the world, where players could then compete on.


Yes, I agree totally. But portfolios served a great purpose to expose what you can and what you know to the world and recruiters. Which would then lead to a opportunity to talk about the projects.

We might be going back to a more oldschool approach where talking directly and presenting themselves would be more of a value again. It have always been an higher value, but now it will be kinda more forced I believe.

Another route would be that portfolios become more blog-based, talking about different solutions and problems for each project, as you are saying.


I agree, that would also require engineers to become more invested into core domain problems, which would then lead to more specialised skills (deeper, not broader). My guess is that not everyone actually likes this, but as for now most of the current state points to that direction.


Whatever makes you feel great.

I am kinda in the same boat (but I do not write articles), spending most of my free time either learning or developing.

Frankly, I love it. It makes me happy, so why change it? If I feel burnt out, I usually switch to something else for a short time, but I can mostly switch between reading/coding/watching tech influencers.

So if you do not feel unhealthy (exercise can always help to take a natural break anyway), keep on learning and developing :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You