Not sure if you're kidding or not, but to write great maintable code, you need a lot of understanding that a LLM just doesn't have, like history, business context, company culture etc. Also, I doubt that in it's training data it has a lot of good examples of great maintainable code to pull from.
Admitting you've spent two decades on a career stuck working in the kind of sweatshops that hire people who can't actually code isn't much of a flex, and certainly doesn't lend a whole lot of credence to your argument.
Not the GP, but when I take strolls through some open source project hosted on GitHub, usually I am not impressed either. Unnecessary OOPism, way too long procedures, that instead could be pure functions, badly named variables, and way, way, waaay too many willy-nilly added dependencies. If that is what the LLMs mostly learn from, I am not surprised at all. But then again this stuff was also written by humans. I remember one especially bad case of a procedure in a very popular project (in its niche) that was basically a 1 man show. A procedure of 300+ lines doing all kinds of stuff, changing the global state of that service it is implementing. But that code was or is relied upon by tech giants and other businesses and no one improves it. They are happy with paying that one guy probably not so much money.
Awesome! However, the corporate is excited with using AI, making the coder the one who's at risk at getting fired for writing the exact same lousy (for the sake of the argument) code.
Or worse: for not relying as much as possible to the AI who apparently can write just as bad code but faster!
A subtle detail: you speak of coders, not software engineers. A SWE's value is not his code churning speed.
This says more about you and the people you work with. I find engineers that have been at the company for a while are quite invaluable when it comes to this information, it's not just knowing the how but the when + why that's critical as well.
Acting like people can't be good at their job is frankly dehumanizing and says a lot about your mindset with how you view other fellow devs.
If only more engineers admitted, that something they wrote is not good code, but a product of its time, then I think we would get more realistic expectations.
It's OK to say that something you made is shit. It is OK to say that you were not given time to do xyz.
How you recognize something has been made fitting at least is, when you see it in use without much of a change for some 3 or 4 years and while you are the person maintaining it, you rarely ever need to touch it, because you built it in a way that is simple enough to not have tons of bugs yet flexible enough, to cover use-cases and anticipated use-cases.
Sometimes, as in the bilsbi's top level comment, the solution is to use a free tool/library/product that already exists. The solution is not always to write new code, but the agent will happily do it.
Maybe that's "the manager's job", but that's just passing the buck and getting a worse solution. Every level of management should be looking for the best solution.
This is exactly right. I maintain an AGENTS.md for my own AI assistant with similar principles - "禁止只记录不行动" (no recording without action) and strict rules about when to escalate vs. when to solve autonomously.
The key insight is that the AGENTS.md becomes a kind of "engineering culture in a file". When you onboard a human engineer, you hope they absorb the team's values over time. With AI, you can encode those values upfront.
The challenge is that principles need to be specific enough to be actionable. "Write simple code" is too vague. "Avoid single-use wrapper functions" (from the sibling comment) is better - it's enforceable.
I wrote something similar in a Claude Code instructions.md: "minimize cyclomatic complexity" What happened next? It generated an 8 line wrapper function called only once from a different file. So, I told it to inline that logic in the caller. The result? One. Line. Of. Code.
So, I asked it to modify its instructions.md file to not repeat that mistake. The result was the new line "Avoid single-use wrapper functions; inline logic at the call site unless reused"
It reminds me a lot of people who take Code Complete too seriously. "Common sense" is not an objective or universal statement unfortunately - plus, speaking for myself, what I consider "common sense" can change on the daily, which is why I can't be trusted adding features to my own codebase long term <_<.
I wrote something similar in a Claude Code instructions.md: "minimize cyclomatic complexity" What happened next? It generated an 8 line wrapper function called only once from a different file. So, I told it to inline that logic in the caller. The result? One. Line. Of. Code.
So, I asked it to modify its instructions.md file to not repeat that mistake. The result was the new line "Avoid single-use wrapper functions; inline logic at the call site unless reused"
I think this is at the moment the practical limitation to using AI for everything (and what the coding agents themselves also optimize for to some degree, or it's the slider they can play with for price vs quality, the "thinking" models being the exact same, but just burning more tokens).
Am waiting for the next Mac Studio to come out to experiment with the "AI for everything" approach. Most likely, the open source distilled models will lower quality. So, another "price vs quality" tradeoff. Still, will be fun to code like I'm at a foundation lab.
This seems like a perfect use case for a local model. But I've found in practice that the system requirements for agents are much higher than for models that can handle simple refactoring tasks. Once tool use context is factored in, there is very little room for models that perform decently.
Whatever agent I tried would include thousands of tokens in tool-use instruction. That would use up most available context unless running very low-spec models. I've concluded it's best to use the big 3 for most tasks and qwen on runpod for more private data.
I've only looked at one model (gpt-4.1-nano) so far. I'm hoping to run similar tests on some other models but it gets challenging to discern statistically significant differences with better models as their accuracy tends to be a lot better across the board.
We often want to feed NON-TABULAR data to LLMs, though, such as typical API responses or config files.
This new work looks out how the format of such nested / hierarchical data affects how well LLMs can answer questions about it; specifically how several models get on with JSON, YAML, XML and Markdown.
I intentionally chose input data large enough that the LLM would be scoring in the region of 50% accuracy in order to maximise the discriminative power of the test.
I did a small test with just a couple of formats and something like 100 records, saw that the accuracy was higher than I wanted, then increased the number of records until the accuracy was down to 50%-ish (e.g. 100 -> 200 -> 500 -> 1000, though I forget the precise numbers.)
With small amounts of input data, the accuracy is near 100%. As you increase the size of the input data, the accuracy gradually decreases.
For this test, I intentionally chose an input data set large enough that the LLM would score in the region of 50% accuracy (with variation between formats) in order to maximise the discriminative power of the test.
Thanks for your work on this! It's a very legit domain of problem for LLMs to optimize for. I've produced a comprehensive eval based on your post and run it against 30 models, each tasked with recalling specific data from 500 rows in different tabular formats. Have a look at the results here: https://weval.org/analysis/table-format-sensitivity__combine...
As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing a basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.
Based on these limited tests, here's the leaderboards on formats FWIW:
So, the biggest takeaway really is: Use the best model you can reasonably afford, then format will matter less. The cheapest 100% coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1
And if you have no control over model, then use CSV or Markdown Table.
> As you increase the size of the input data, the accuracy gradually decreases.
Interesting.
On your section "Limitations and Areas for Further Study",
What I'd be curious on future work would be,
- changing the order of the data on each table type
- changing the order of the questions
I'm curious to know if what it fails is the same, if it changes depending on the location, if it's a bias.
Is it always a specific question? Is it always a specific value? Is it always question #x (or around question #x?). Does it tend towards x or y on types of questions?
LLMs have documented position biases, with skew towards first and last. This is strongest in messages due to system prompt + current question training data, but it's present in list data in general.
Exactly. But the papers I’ve seen, the tests are done based on answers being multiple choice usually.
Where do you eat?
A) floor
B) table
C) dirt
In this case, the questions asked have an answer. The bias would then be on the order of the input data. It’s different enough that it triggered my curiosity.
Thank you for including the tokens needed for each test.
It looks to me that the concisest way of representing each of these tables was a CSV and then a standard markdown table. The amount of tokens appears to be 1/2 or 1/3 of the other options. For experiments not in mice (GPT-4.1-nano), but in larger models or larger context aside from the data table itself, my guess is that preserving context is might be higher value than having the higher-LLM-legibility of the Markdown-KV.
The context I used in the test was pretty large. You'll see much better (near 100%) accuracy if you're using smaller amounts of context.
[I chose the context size so that the LLM would be scoring in the ballpark of 50% accuracy (with variation between formats) to maximise the discriminative power of the test.]
I like to have something like the following in AGENTS.md:
## Guiding Principles - Optimise for long-term maintainability - KISS - YAGNI