Quality indie software in a niche that Ikea is not addressing can make a decent income unlike a lemonade stand.
And unlike at (this hypothetical) Ikea, you wouldn't have to maintain the impression of 20x AI-augmented output to avoid being fired. Well, you could still use AI as much as you want, but you wouldn't have to keep proving you're not underusing it.
Evals let us agree on the baseline, measurement, etc, and compare if simple things others do perform just as well. For same reason, instead of 'works on my box' and 'my coding style', use one of the many community evals vs making up your own benchmark.
That helps head off much of many of the unfalsifiable discussions & claims happening and moves everyone forward.
a rust version of that compiler (that the project runs on) ran at 480k claims/sec and it was able to deterministically resolve 83% of conflicts across 1 million concurrent agents (also 393,275x compression reduction @ 1m agents on input vs output, but different topics can make the compression vary)
natively claude (and other LLM) will resolve conflicting claims at about 51% rate (based on internal research)
the built in byzantine fault tolerance (again, in the compiler) is also pretty remarkable, it can correctly find the right answer even if 93% of the agents/data are malicious (with only 7% of agents/data telling us the correct information)
basically the idea here is if you want to build autonomous at scale, you need to be able to resolve disagreement at scale and this project does a pretty nice job at doing that
My question was on claims like "5x productivity boost in merged PRs (lots of open PR & merge rate goes down, but net positive)", eg, does this change anything on swe-bench or any other standard coding eval?
The ecosystem is 8 tools plus a claude code plugin, the unlock was composing those tools (I don't regularly use all 9). The 5x claim was from /insights (claude code)
Not for everyone, but it radically changed how I build. Senior engineer, 10+ years
Now it's trivial to run multiple projects in parallel across claude sessions (this was not really manageable before using wheat)
Genuinely don't remember the last time I opened a file locally
It sounds like the answer is "No, there is no repeatable eval of the core AI coding productivity claim, definitely not on one of the many AI coding benchmarks in the community used for understanding & comparison, and there will not be"
Maybe there's a fundamental miscommunication here of what evals are?
Evals apply not just to LLMs but to skills, prompts, tools, and most things changing the behavior of compound AI systems, and especially like the productivity claims being put forth in this thread.
The features in the post relate directly to heavily researched areas of agents that are regularly benchmarked and evaluated. They're not obscure, eg, another recent HN frontpage item benchmarked on research and planning.
your question makes sense, it's just not in current scope
we are still benchmarking the compiler at scale and the LLM tools that were made were created as functional prototypes to showcase a single example of the compiler's use case
since much of the unlock here is finding different applications for the compiler itself, we simply don't have the bandwidth to do much benchmarking on these projects on top of maintaining the repos themselves
all the code is open source and there is nothing stopping anyone from running their own benchmarks if they were curious
Speaking of embeddable, we just announced cypher syntax for gfql, so the first OSS CPU/GPU cypher query engine you can use on dataframes
Typically used with scaleout DBs like databricks & splunk for analytical apps: security/fraud/event/social data analysis pipelines, ML+AI embedding & enrichment pipelines, etc. We originally built it for the compute-tier gap here to help Graphistry users making embeddable interactive GPU graph viz apps and dashboards and not wanting to add an external graph DB phase into their interactive analytics flows.
We took a multilayer approach to the GPU & vectorization acceleration, including a more parallelism-friendly core algorithm. This makes fancy features pay-as-you-go vs dragging everything down as in most columnar engines that are appearing. Our vectorized core conforms to over half of TCK already, and we are working to add trickier bits on different layers now that flow is established.
The core GFQL engine has been in production for a year or two now with a lot of analyst teams around the world (NATO, banks, US gov, ...) because it is part of Graphistry. The open-source cypher support is us starting to make it easy for others to directly use as well, including LLMs :)
Specifically, these big companies revenue share with app companies who in turn increase monetization via selling your private information, esp via free apps. In exchange for Apple etc super high app store rake percentage fees, they claim to run security vetting programs and ToS that vet who they do business with and tell users & courts that things are safe, even when they know they're not.
It's not rocket science for phone OS's to figure out who these companies are and, as iOS / android os users already get tracked by apple/google/etc, triangulate to which apps are participating
I'm game for throwing rocks at Apple and Google, but I don't get this one.
> consumer apps embed ad SDKs → those SDKs feed location signals into RTB ad exchanges → surveillance-oriented firms sit in the RTB pipeline and harvest bid request data even without winning auctions
Would you ban ad supported apps? Assuming the comment you're responding to is realistic, I'm not sure how the OS is to blame.
Neither big players have refined enough permissions. These set users up for giving away more data than they think.
Maybe one clear example is needing a permission once for setup and then it remaining persistent.
An easy demonstration is just looking at what Graphene has done. It's open source and you wana say Google can't protect their users better? Certainly Graphene has some advanced features but not everything can be dismissed so easily. Besides, just throw advanced features behind a hidden menu (which they already have!). There's no reason you can't many most users happy while also catering to power users (they'll always complain, but that's their job)
I’m not sure that’s how corporate blame works. The ceo signed off on the CIOs proposal to streamline data analytics logs via WeTotallyWontSiphonOffYourDataAndSellIt incorporated for user improvement purposes, which happens to be owned by the CFO’s brother in law. How were the CIO and CEO to know that a third party was selling off the data, and how was that third party to know that the sale of the data to another party who then onsold the data to the fbi would be illegal?
> How were the CIO and CEO to know that a third party was selling off the data, and how was that third party to know that the sale of the data to another party who then onsold the data to the fbi would be illegal?
Ask yourself the same question about personal health data and the answer reveals itself: the CEO and CIO know (or should know) that the vendor needs to be HIPAA-compliant or it's their necks (the CEO's and CIO's), so they look for a vendor who advertises as being HIPAA-compliant.
Pass legislation to the same effect for all PII and the CEO and CIO will then make requirements of the vendor. If the vendor lies, they get fired because the company hiring them is culpable. The vendor may also be subject to civil and/or criminal penalties. It seems simple, other than the fact that we have a federal legislature with no apparent interest in solving this problem, alongside a populace which either doesn't notice or doesn't care about that.
To answer the question more pithily: communication.
In regulated industries, like finance and taxation, regulators deliberately assign responsibility to individuals, so misconduct doesn’t get lost inside the company or within its corporate stakeholder network. That removes a lot of friction once you want to hold someone liable.
I've read our parents comment as an implicit proposal to establish similar structures in tech.
If I was simultaneously also the owner of the ad platform, I'd fix it & knock out the bad players, or get ready to be sued for a decade+ of knowing malpractice
And if I was a US citizen seeing the companies being involved be sued for being monopolies and abusing their position, and then seeing them cry security in court yet knowingly do this for a decade+, I'd feel frustrated by successive left + right US administrations & voters
If Google & Apple & friends refused to take a rake and opened distribution, then I'd agree, net neutrality etc, not their problem
But they own so much, and so deep into the pipeline, and explain their fees to courts because "security"... and then don't do investigations. They employ some of the best security analysts in the world and have $10-30B/yr revenue tied to just the app store fees, so they very much can take a big bite out of this if they wanted.
> They employ some of the best security analysts in the world and have $10-30B/yr revenue
I'll never not be impressed by how many people will defend trillion dollar organizations and say that things are too expensive. Especially when open source projects (including forks!) implement such features.
I'm completely with you, they could do these things if they wanted to. They have the money. They have the manpower. It is just a matter of priority. And we need to be honest, they're spending larger amounts on slop than actual fixes or even making their products better (for the user).
“Priorities” is far too soft a term in this context. These are anti-priorities: not just things they choose not to work on, but things they’ll spend big money to prevent, up to and including bribing, uh I mean lobbying, lawmakers.
Ultimately the fact that ad sdks have such wide access to location information is a choice by the platforms. I've long wanted meaningful process isolation between the app and its ad sdks, but right now there's oodles of them that just squat on location data when the app requests it.
Apple supposedly does this with the privacy report cards.
However, I'd be shocked if a cursory audit comparing SDKs embedded in apps and disclosed data sales showed they were effectively enforcing anything at all.
Do people really still think advertising has a legitimate function?
Really these days it's 95% psychological manipulation to get people to buy inferior quality stuff they don't need. And 5% of people actually finding what they're looking for.
Don't forget, most advertising can work fine in a "pull" mode. I need something so I go out and look for it. These days something like Google (not ideal because results also manipulated by the highest bidder). Or I look for dedicated forums or a subreddit for real people's experiences. In the old days it would have been yellow pages or ask a friend.
It's everyone. Especially google, but all the big tech companies play in the same pool. Amazon, Google, Apple, Meta etc make money selling ads, which ultimate enables the tools that result data harvesting from everyone across the internet. I wrote a little data investigation [1] (mostly finished) that show cases how every major news organization across the globe I scanned had some level of data collection integrated. This is just one industry, but its important (as it connects back to the incentives these media organizations have, which is to make money by selling ads at any cost). The eff also released an angle in how the bidding process to buy ads is itself a massive privacy nightmare[2]
Yeah, but unlike facebook, they weren't just caught making videos of people having sex then paying people to watch the videos.
Also, unlike facebook, they also weren't just caught running a dark money lobbyist network with the goal of forcing more collection of minors' private information.
facebook is evil for many different reasons, but for a government looking to spy on its own citizens cloudflare is much more attractive target. That said, I have no doubt that they're collecting copious amounts of data from both companies, either by sale or by force.
The interesting part is Google & Apple, as part of explaining to courts why their large app store fees are legit and not proof of monopoly positions, hid behind the security argument that they need to be the clearing house of what software runs on the devices. Except... they've knowingly punted on this one for 10+ years.
I would 100% agree that losing privacy through any utility-level carrier (credit cards, phone, OS provider, etc) should be default disallowed, and any opt-ins have a clear transparency mode with easy opt-out. At least two areas the US can learn from the EU on digital policy is digital marketplaces and consumer privacy protection, and this topic is at the intersection of both.
Once my code exists and passes test, I generally move on to having it iteratively hunt for bugs, security issues, and DRY code reduction opportunities until it stops finding worthwhile ones.
This doesn't always work as well as I'd like, but largely does enough. Conversely, doing as I go has been a waste of time.
The phenomena you're describing is why Cobol programmers still exist, and simultaneously, why it's increasingly irrelevant to most programmers
The killer feature is ecosystem: Easily and reliably reusing other libraries and tools that work out-of-the-box with other Python code written in the last few years . There are individually neato features motivating the efforts involved in upgrading a widely-used language & engine as well, but that kind of thinking misses the forest for the trees unfortunately.
It's a bit surprising to me, in the age of AI coding, for this to be a problem. Most features seem friendly to bootstrapping with automation (ex: f-strings that support ' not just "), and it's interesting if any don't fall in that camp. The main discussion seems to still be framed by the 2024 comments, before Claude Code etc became widespread: https://github.com/orgs/pypy/discussions/5145 .
The alternative is when you run a script that you last used a few years ago and now need it again for some reason (very common in research) and you might end up spending way too much time making it work with your now upgraded stack.
Sure you can were you should have pinned dependencies but that's a lot of overhead for a random script...
We can play that game - items like GIL-free interpreters and memory views are pretty relevant to folks on the more demanding side of scientific computing. But my point is this is a head-in-sand game when the community vastly outweighs any individual feature. My experience with the scientific computing community is that the non-pypy portion of it is much bigger.
I'm not a pypy maintainer, so my only horse in this race is believing cpython folks benefit from seeing the pypy community prove Things Can Be Better. Part of that means I rather pypy live on by avoiding unforced errors.
I liked they did this work + its sister paper, but disliked how it was positioned basically opposite of the truth.
The good: It shows on one kind of benchmark, some flavors of agentically-generated docs don't help on that task. So naively generating these, for one kind of task, doesn't work. Thank you, useful to know!
The bad: Some people assume this means in general these don't work, or automation can't generate useful ones.
The truth: Instruction files help measurably, and just a bit of engineering enables you to guarantee high scores for the typical cases. As soon as you have an objective function, you can flip it into an eval, and set an AI coder to editing these files until they work.
Ex: We recently released https://github.com/graphistry/graphistry-skills for more easily using graphistry via AI coding, and by having our authoring AI loop a bit with our evals, we jumped the scores from 30-50% success rate to 90%+. As we encounter more scenarios (and mine them from our chats etc), it's pretty straight forward to flip them into evals and ask Claude/Codex to loop until those work well too.
We do these kind of eval-driven AI coding loops all the time , and IMO how to engineer these should be the message, not that they don't work on average. Deeper example near the middle/end of the talk here: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t...
* Specification extraction. We have security.md and policy.md, often per module. Threat model, mechanisms, etc. This is collaborative and gets checked in for ourselves and the AI. Policy is often tricky & malleable product/business/ux decision stuff, while security is technical layers more independent of that or broader threat model.
* Bug mining. It is driven by the above. It is iterative, where we keep running it to surface findings, adverserially analyze them, and prioritize them. We keep repeating until diminishing returns wrt priority levels. Likely leads to policy & security spec refinements. We use this pattern not just for security , but general bugs and other iterative quality & performance improvement flows - it's just a simple skill file with tweaks like parallel subagents to make it fast and reliable.
This lets the AI drive itself more easily and in ways you explicitly care about vs noise
In our evals for answering cybersecurity incident investigation questions and even autonomously doing the full investigation, gpt-5.2-codex with low reasoning was the clear winner over non-codex or higher reasoning. 2X+ faster, higher completion rates, etc.
It was generally smarter than pre-5.2 so strategically better, and codex likewise wrote better database queries than non-codex, and as it needs to iteratively hunt down the answer, didn't run out the clock by drowning in reasoning.
And yes, sometimes it's nice to support a local lemonade stand. For my family's income, I know which segment I'd feel more confident to work for..