More

martin_ · 2026-03-19T06:04:42 1773900282

I built something similar[0] a while back, Stardrift looks 100x better - nice work!

[0] unitedstarlinktracker.com

libria · 2026-03-19T13:29:08 1773926948

This is the 1st link posted in /r/unitedairlines anytime someone mentions "starlink". One use-case better covered by https://unitedstarlinktracker.com/ is the upfront log that shows a quick swath of airports that might receive and depart starlink equipped planes. I can CTRL-F -> "RDU" and know immediately my chance of checking this out (not much).

Would it be hard to produce a pie chart showing top 10 airports with most starlink planes arriving/departing?

martin_ · 2026-03-22T07:06:47 1774163207

Oops sorry just saw this - done! https://unitedstarlinktracker.com/#airports

bblcla · 2026-03-19T06:22:21 1773901341

hey! I saw this and liked it a lot! It’s impressive how you pull in all the routes per tail - we considered doing it but were worried it would be too expensive. Definitely opens up cool options though.

martin_ · 2025-07-11T16:48:57 1752252537

how do you low cost run a 1T param model?

maven29 · 2025-07-11T16:51:32 1752252692

32B active parameters with a single shared expert.

JustFinishedBSG · 2025-07-11T16:54:05 1752252845

This doesn’t change the VRAM usage, only the compute requirements.

selfhoster11 · 2025-07-11T19:11:08 1752261068

It does not have to be VRAM, it could be system RAM, or weights streamed from SSD storage. Reportedly, the latter method achieves around 1 token per second on computers with 64 GB of system RAM.

R1 (and K2) is MoE, whereas Llama 3 is a dense model family. MoE actually makes these models practical to run on cheaper hardware. DeepSeek R1 is more comfortable for me than Llama 3 70B for exactly that reason - if it spills out of the GPU, you take a large performance hit.

If you need to spill into CPU inference, you really want to be multiplying a different set of 32B weights for every token compared to the same 70B (or more) instead, simply because the computation takes so long.

refulgentis · 2025-07-11T19:31:12 1752262272

The amount of people who will be using it at 1 token/sec because there's no better option, and have 64 GB of RAM, is vanishingly small.

IMHO it sets the local LLM community back when we lean on extreme quantization & streaming weights from disk to say something is possible*, because when people try it out, it turns out it's an awful experience.

* the implication being, anything is possible in that scenario

selfhoster11 · 2025-07-12T06:34:07 1752302047

Good. Vanishingly small is still more than zero. Over time, running such models will become easier too, as people slowly upgrade to better hardware. It's not like there aren't options for the compute-constrained either. There are lots of Chinese models in the 3-32B range, and Gemma 3 is particularly good too.

I will also point out that having three API-based providers deploying an impractically-large open-weights model beats the pants of having just one. Back in the day, this was called second-sourcing IIRC. With proprietary models, you're at the mercy of one corporation and their Kafkaesque ToS enforcement.

refulgentis · 2025-07-12T12:06:19 1752321979

You said "Good." then wrote a nice stirring bit about how having a bad experience with a 1T model will force people to try 4B/32B models.

That seems separate from the post it was replying to, about 1T param models.

If it is intended to be a reply, it hand waves about how having a bad experience with it will teach them to buy more expensive hardware.

Is that "Good."?

The post points out that if people are taught they need an expensive computer to get 1 token/second, much less try it and find out it's a horrible experience (let's talk about prefill), it will turn them off against local LLMs unnecessarily.

Is that "Good."?

jimjimwii · 2025-07-13T17:18:48 1752427128

Had you posted this comment in the early 90s about linux instead of local models, it would have made about the same amount of sense but aged just as poorly as this comment will.

I'll remain here happily using 2.something tokens / second model.

apitman · 2025-07-14T03:29:09 1752463749

But local aka desktop Linux is still an awful experience for most people. I use Arch btw

selfhoster11 · 2025-07-15T08:15:52 1752567352

I'd rather use Arch over a genuine VT100 than touch Windows 11, so the analogy remains valid - at least you have a choice at all, even if you are in a niche of a niche.

homarp · 2025-07-11T21:56:27 1752270987

agentic loop can run all night long. It's just a different way to work: prepare your prompt queue, set it up, check result in the morning, adjust. 'local vibe' in 10h instead of 10mn is still better than 10 days of manual side coding.

hereme888 · 2025-07-12T09:14:39 1752311679

Right on! Especially if its coding abilities are better than Claude 4 Opus. I spent thousands on my PC in anticipation of this rather than to play fancy video games.

Now, where's that spare SSD...

maven29 · 2025-07-11T16:56:34 1752252994

You can probably run this on CPU if you have a 4090D for prompt processing, since 1TB of DDR4 only comes out to around $600.

For GPU inference at scale, I think token-level batching is used.

zackangelo · 2025-07-11T17:45:36 1752255936

Typically a combination of expert level parallelism and tensor level parallelism is used.

For the big MLP tensors they would be split across GPUs in a cluster. Then for the MoE parts you would spread the experts across the GPUs and route to them based on which experts are active (there would likely be more than one if the batch size is > 1).

t1amat · 2025-07-11T17:13:21 1752254001

With 32B active parameters it would be ridiculously slow at generation.

selfhoster11 · 2025-07-11T19:15:28 1752261328

DDR3 workstation here - R1 generates at 1 token per second. In practice, this means that for complex queries, the speed of replying is closer to an email response than a chat message, but this is acceptable to me for confidential queries or queries where I need the model to be steerable. I can always hit the R1 API from a provider instead, if I want to.

Given that R1 uses 37B active parameters (compared to 32B for K2), K2 should be slightly faster than that - around 1.15 tokens/second.

CamperBob2 · 2025-07-12T18:56:30 1752346590

That's pretty good. Are you running the real 600B+ parameter R1, or a distill, though?

selfhoster11 · 2025-07-14T01:54:07 1752458047

The full thing, 671B. It loses some intelligence at 1.5 bit quantisation, but it's acceptable. I could actually go for around 3 bits if I max out my RAM, but I haven't done that yet.

apitman · 2025-07-14T03:32:05 1752463925

I've seen people say the models get more erratic at higher (lower?) quantization levels. What's your experience been?

selfhoster11 · 2025-07-15T08:14:36 1752567276

If you mean clearly, noticeably erratic or incoherent behaviour, then that hasn't been my experience for >=4-bit inference of 32B models, or in my R1 setup. I think the others might have been referring to this happening with smaller models (sub-24B), which suffer much more after being quantised below 4 or 5 bits.

My R1 most likely isn't as smart as the output coming from an int8 or FP16 API, but that's just a given. It still holds up pretty well for what I did try.

martin_ · 2025-06-19T16:48:32 1750351712

nice ship! I wrote a blog post on how to observe trends over time for a team via OTel, but I prefer your method for individual development!

https://ma.rtin.so/posts/monitoring-claude-code-with-datadog...

bazhand · 2025-06-19T17:16:38 1750353398

I like this solution, I had tinkered with the Otel hoping to get un-redacted prompt and responses but had no luck. Did you perhaps get deeper into what data was useful?

martin_ · 2025-06-20T04:00:51 1750392051

You can get the un-redacted prompts via the OTel event logger[0], but unfortunately it won't give you responses. You could open up a GitHub issue to request that addition!

Disclaimer - I work at Anthropic but not on Claude Code, the team is responsive via GH issues though!

[0] https://docs.anthropic.com/en/docs/claude-code/monitoring-us...

bazhand · 2025-06-22T09:41:02 1750585262

You may be interested in https://github.com/ryoppippi/ccusage

martin_ · 2025-05-21T21:09:35 1747861775

huh - really? no - definitely not. haven't heard anyone else report that either. What browser?

martin_ · on April 2, 2025

Amazing performance! Do you anticipate making the model available for commercial use or are you primarily focused on releasing agents built upon it?

martin_ · on Feb 24, 2025

Wow brutal roasts

“You've spent so much time reverse engineering other people's APIs that you forgot to build something people would want to reverse engineer.”

martin_ · on Feb 10, 2025

This is neat - I built something in a similar vein recently but less productized, good work!

I have a habit of moving around (thus changing doctors) and a lack of a history / being able to easily consolidate them has been a reoccurring pain. I recently had a full physical while in Thailand and there was some potential concern around my labs & some imaging --- of course, without history it was a "check again in 6 months!" which prompted me to capitalize on Gemini's PDF parsing abilities...

I still have work to do, but it's amazing what you can do in just a few hours now: https://health.martinamps.com

abiraja · on Feb 10, 2025

This is really cool! Would love an open source version of it.

martin_ · on Feb 10, 2025

i have a rudimentary pipeline that takes in a ton of data, converts to json & markdown and then i used claude and o1 pro to generate the dashboard. That is to say, there are manual hops. How would you want it packaged / what would be useful?

martin_ · on Jan 3, 2025

I've observed given that LLM's inherently want to autocomplete, they're more inclined to keep complicating a solution than rewrite it because it was directionally bad. The most effective way i've found to combat this is to restart a session and prompt it such that it produces an efficient/optimal solution to the concrete problem... then give it the problematic code and ask it to refactor it accordingly

ActivePattern · on Jan 3, 2025

I've observed this with ChatGPT. It seems to be trained to minimize changes to code earlier in the conversation history. This is helpful in many cases since it's easier to track what it's changed. The downside is that it tends to never overhaul the approach when necessary.

martin_ · on Dec 28, 2024

I for one am grateful for that fact ;)

martin_ · on Nov 7, 2024

to play devils advocate, if that person had decided to go with $0 instead that there would be equally bad headlines/interpretations of "Instead of allocating the formulaic $1 we are entitled to inline with all other changes over X years, they squandered it on Y"?

tdeck · on Nov 7, 2024

I think many people would see no increase and assume there was some special mechanism needed to enact increases which hadn't happened in that particular year. Whereas a $1 increase clearly says "someone evaluated this and adjusted it up only $1". The analogy of a 10 cent tip vs. not tipping is a good one; the person who doesn't tip for a full meal is being a cheap asshole, but the person who leaves 10 cents is being a mean-spirited cheap asshole.

HN For You