For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | mattcollins's commentsregister

On the other hand, AI coding tools make it relatively easy to set and apply policies that can help with this sort of thing.

I like to have something like the following in AGENTS.md:

## Guiding Principles - Optimise for long-term maintainability - KISS - YAGNI


Not sure if you're kidding or not, but to write great maintable code, you need a lot of understanding that a LLM just doesn't have, like history, business context, company culture etc. Also, I doubt that in it's training data it has a lot of good examples of great maintainable code to pull from.


Neither do most humans writing such code, i have seen llms generate better code than 90% of coders I have seen in the last 20 years


Admitting you've spent two decades on a career stuck working in the kind of sweatshops that hire people who can't actually code isn't much of a flex, and certainly doesn't lend a whole lot of credence to your argument.


Not the GP, but when I take strolls through some open source project hosted on GitHub, usually I am not impressed either. Unnecessary OOPism, way too long procedures, that instead could be pure functions, badly named variables, and way, way, waaay too many willy-nilly added dependencies. If that is what the LLMs mostly learn from, I am not surprised at all. But then again this stuff was also written by humans. I remember one especially bad case of a procedure in a very popular project (in its niche) that was basically a 1 man show. A procedure of 300+ lines doing all kinds of stuff, changing the global state of that service it is implementing. But that code was or is relied upon by tech giants and other businesses and no one improves it. They are happy with paying that one guy probably not so much money.


Please let's stop with the "but some humans also suck at this so it's ok if LLMs also suck at it" argument. It doesn't add anything to the discussion.


Awesome! However, the corporate is excited with using AI, making the coder the one who's at risk at getting fired for writing the exact same lousy (for the sake of the argument) code.

Or worse: for not relying as much as possible to the AI who apparently can write just as bad code but faster!

A subtle detail: you speak of coders, not software engineers. A SWE's value is not his code churning speed.


This says more about you and the people you work with. I find engineers that have been at the company for a while are quite invaluable when it comes to this information, it's not just knowing the how but the when + why that's critical as well.

Acting like people can't be good at their job is frankly dehumanizing and says a lot about your mindset with how you view other fellow devs.


If only more engineers admitted, that something they wrote is not good code, but a product of its time, then I think we would get more realistic expectations.

It's OK to say that something you made is shit. It is OK to say that you were not given time to do xyz.

How you recognize something has been made fitting at least is, when you see it in use without much of a change for some 3 or 4 years and while you are the person maintaining it, you rarely ever need to touch it, because you built it in a way that is simple enough to not have tons of bugs yet flexible enough, to cover use-cases and anticipated use-cases.


He isn't kidding. I have a directive to write the shortest, least complicated, readable business code and it makes a huge difference


Sometimes, as in the bilsbi's top level comment, the solution is to use a free tool/library/product that already exists. The solution is not always to write new code, but the agent will happily do it.

Maybe that's "the manager's job", but that's just passing the buck and getting a worse solution. Every level of management should be looking for the best solution.


"Be sure to remember software is a sociotechnical system and dont fall prey to the Mechanistic myth"


This is exactly right. I maintain an AGENTS.md for my own AI assistant with similar principles - "禁止只记录不行动" (no recording without action) and strict rules about when to escalate vs. when to solve autonomously.

The key insight is that the AGENTS.md becomes a kind of "engineering culture in a file". When you onboard a human engineer, you hope they absorb the team's values over time. With AI, you can encode those values upfront.

The challenge is that principles need to be specific enough to be actionable. "Write simple code" is too vague. "Avoid single-use wrapper functions" (from the sibling comment) is better - it's enforceable.


I wrote something similar in a Claude Code instructions.md: "minimize cyclomatic complexity" What happened next? It generated an 8 line wrapper function called only once from a different file. So, I told it to inline that logic in the caller. The result? One. Line. Of. Code.

So, I asked it to modify its instructions.md file to not repeat that mistake. The result was the new line "Avoid single-use wrapper functions; inline logic at the call site unless reused"

instruction.md is the new intern.


It reminds me a lot of people who take Code Complete too seriously. "Common sense" is not an objective or universal statement unfortunately - plus, speaking for myself, what I consider "common sense" can change on the daily, which is why I can't be trusted adding features to my own codebase long term <_<.


I wrote something similar in a Claude Code instructions.md: "minimize cyclomatic complexity" What happened next? It generated an 8 line wrapper function called only once from a different file. So, I told it to inline that logic in the caller. The result? One. Line. Of. Code.

So, I asked it to modify its instructions.md file to not repeat that mistake. The result was the new line "Avoid single-use wrapper functions; inline logic at the call site unless reused"

instructions.md is the new intern.


Maybe a better way to handle "minimize cyclomatic complexity" would be to set an agent in a loop of code metrics, refactor, test and repeat.


Good idea. Am still a bit shy around token budget spend.


I think this is at the moment the practical limitation to using AI for everything (and what the coding agents themselves also optimize for to some degree, or it's the slider they can play with for price vs quality, the "thinking" models being the exact same, but just burning more tokens).


Am waiting for the next Mac Studio to come out to experiment with the "AI for everything" approach. Most likely, the open source distilled models will lower quality. So, another "price vs quality" tradeoff. Still, will be fun to code like I'm at a foundation lab.


This seems like a perfect use case for a local model. But I've found in practice that the system requirements for agents are much higher than for models that can handle simple refactoring tasks. Once tool use context is factored in, there is very little room for models that perform decently.


What I hope to do with refactoring is to distill namespace and common patterns into a DSL. I am very curious about what tradeoffs you found.


Whatever agent I tried would include thousands of tokens in tool-use instruction. That would use up most available context unless running very low-spec models. I've concluded it's best to use the big 3 for most tasks and qwen on runpod for more private data.


FWIW, I ran a test comparing LLM accuracy with TOON versus JSON, CSV and a variety of other formats when using them to represent tabular data: https://www.improvingagents.com/blog/is-toon-good-for-table-...

I've only looked at one model (gpt-4.1-nano) so far. I'm hoping to run similar tests on some other models but it gets challenging to discern statistically significant differences with better models as their accuracy tends to be a lot better across the board.


Results from some further tests here: https://www.improvingagents.com/blog/toon-benchmarks


This is a follow-up to previous work looking at which format of TABULAR data LLMs understand best: https://www.improvingagents.com/blog/best-input-data-format-...

(There was some good discussion on Hacker News around that here: https://news.ycombinator.com/item?id=45458455)

We often want to feed NON-TABULAR data to LLMs, though, such as typical API responses or config files.

This new work looks out how the format of such nested / hierarchical data affects how well LLMs can answer questions about it; specifically how several models get on with JSON, YAML, XML and Markdown.



Author here.

This has made me chuckle several times - thanks!


I'm the person who ran the test.

To hopefully clarify a bit...

I intentionally chose input data large enough that the LLM would be scoring in the region of 50% accuracy in order to maximise the discriminative power of the test.


Can you expand on how you did this?


I did a small test with just a couple of formats and something like 100 records, saw that the accuracy was higher than I wanted, then increased the number of records until the accuracy was down to 50%-ish (e.g. 100 -> 200 -> 500 -> 1000, though I forget the precise numbers.)


I'm the person who ran the test.

To explain the 60% a bit more...

With small amounts of input data, the accuracy is near 100%. As you increase the size of the input data, the accuracy gradually decreases.

For this test, I intentionally chose an input data set large enough that the LLM would score in the region of 50% accuracy (with variation between formats) in order to maximise the discriminative power of the test.


Thanks for your work on this! It's a very legit domain of problem for LLMs to optimize for. I've produced a comprehensive eval based on your post and run it against 30 models, each tasked with recalling specific data from 500 rows in different tabular formats. Have a look at the results here: https://weval.org/analysis/table-format-sensitivity__combine...

As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing a basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.

Based on these limited tests, here's the leaderboards on formats FWIW:

    CSV: 84.25%
    Markdown Table: 82.65%
    YAML: 81.85%
    JSON Lines (jsonl): 79.85%
    Markdown key-value: 79.83%
    Pipe-delimited: 79.45%
    Natural language summary: 78.65%
    JSON: 77.73%
    HTML table: 75.80%
    XML: 73.80%
So, the biggest takeaway really is: Use the best model you can reasonably afford, then format will matter less. The cheapest 100% coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1

And if you have no control over model, then use CSV or Markdown Table.


Wouldn't it be more useful to measure the number of rows the model can process while still hitting 100% accuracy?


> As you increase the size of the input data, the accuracy gradually decreases.

Interesting.

On your section "Limitations and Areas for Further Study",

What I'd be curious on future work would be,

    - changing the order of the data on each table type
    - changing the order of the questions
I'm curious to know if what it fails is the same, if it changes depending on the location, if it's a bias.

Is it always a specific question? Is it always a specific value? Is it always question #x (or around question #x?). Does it tend towards x or y on types of questions?

Good idea


LLMs have documented position biases, with skew towards first and last. This is strongest in messages due to system prompt + current question training data, but it's present in list data in general.


Exactly. But the papers I’ve seen, the tests are done based on answers being multiple choice usually.

    Where do you eat?
    A) floor
    B) table
    C) dirt

In this case, the questions asked have an answer. The bias would then be on the order of the input data. It’s different enough that it triggered my curiosity.



Thank you for including the tokens needed for each test.

It looks to me that the concisest way of representing each of these tables was a CSV and then a standard markdown table. The amount of tokens appears to be 1/2 or 1/3 of the other options. For experiments not in mice (GPT-4.1-nano), but in larger models or larger context aside from the data table itself, my guess is that preserving context is might be higher value than having the higher-LLM-legibility of the Markdown-KV.


I'm the person who ran the test.

The context I used in the test was pretty large. You'll see much better (near 100%) accuracy if you're using smaller amounts of context.

[I chose the context size so that the LLM would be scoring in the ballpark of 50% accuracy (with variation between formats) to maximise the discriminative power of the test.]


"This feature is available to all customers, meaning anyone can enable this today from the Cloudflare dashboard."

https://blog.cloudflare.com/control-content-use-for-ai-train...


I wondered about this, too.

Cloudflare have some recent data about traffic from bots (https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...) which indicates that, for the time being, the overwhelming majority of the bot requests are for AI training and not for RAG.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You