While obviously super-impressive, it is clearly not maintanable without AI agent. It has spinel_codegen.rb is 21k lines of code with up to 15 levels of nesting in some methods.
Compilers code was never pretty, but even by those standard, I feel like it is a very-very hard to maintain code by humans.
Compiler code can be pretty if you have the time to maintain it. Compilers are some of the most modular applications you can build with hard boundaries between subsystems and clear handoffs at each level.
The problem is that people often do not have the time to refactor once they have gotten the thing to work. And the mess keeps growing.
Management problem more than anything else, I feel.
Compilers should not have so much churn. You decide on a set of language features, stick to it and implement. After that, it should only be bugfixes for the foreseeable future till someone can make a solid case for that shiny new feature.
spinel_codegen.rb is an eldritch horror. I always get spaghetti code like this when using Claude, and I've been wondering if I'm doing something wrong. Now I see an application that looks genuinely interesting (not trivial slop) written by someone I consider to be a top notch programmer, and the code quality is still pretty garbage in some places.
For example infer_comparison_type() [1]. This is far from the worst offender - it's not that hard to read - but what's striking here that there is a better implementation that's so simple and obvious and Claude still fails to get there. Why not replace this with
COMPARISON_TYPES = Set.new(["<", ">", "<=", ">=", "==", "!=", "!"])
def infer_comparison_type(mname)
if COMPARISON_TYPES.include?(mname)
"bool"
else
""
end
# Or even better, strip the else case
# (Which would return nil for anything not in the set)
end
This would be shorter, faster, more readable, and more easily maintainable, but Claude always defaults to an if-return, if-return, if-return pattern. (Even if-else seems to be somewhat alien to Claude.) My own Claude codebases are full of that if-return crap, and now I know I'm not alone.
Other files have much better code quality though. For example, most of the lib directory, which seems to correspond to the ext directory in the mainline Ruby repo. The API is clearly inspired by MRI ruby, even though the implementation differs substantially. I would guess that Matz prompted Claude to mirror parts of the original API and this had a bit of a regularizing effect on the output.
The solution to this with Claude is to use a small agent harness and include refactoring steps once tests are written and pass. For some things you will need to include rules on coding style it should prefer. This is especially true for Ruby or other languages it has not seen as much training data for as for e.g. Python.
It's true that it's shorter, but I suspect that the if-return, if-return pattern compiles down to much faster code. Separately, this code was originally written in C then ported. There are reasonable explanations for why Matz has the code written this way besides the typical AI slop.
I'm skeptical of that reasoning because the original C wasn't too clean or performant either. For example emit.c from an earlier commit [1]
It writes a separate call to emit_raw for each line, even though there many successive calls to emit_raw before it runs into any branching or other dynamic logic. What if you change this
emit_raw(ctx, "#include <stdio.h>\n");
emit_raw(ctx, "#include <stdlib.h>\n");
emit_raw(ctx, "#include <string.h>\n");
emit_raw(ctx, "#include <math.h>\n");
// And on for dozens more lines
to this
emit_raw(ctx,
"#include <stdio.h>\n"
"#include <stdlib.h>\n"
"#include <string.h>\n"
"#include <math.h>\n"
// And on for dozens more lines
);
That would leave you with code that is just as readable, but only calls the emit function once, leading to a smaller and faster binary. Again, this is a trivial change to the code, but Claude struggles to get there.
Obviously it doesn't matter much now if it's maintabable by hand or not. If code is passing tests and benchmarks, I am happy.
But I am not sure that huge files are easy for the AI to work with. I try to restrict the files to 300 lines. My thinking is that if it's easy for a human to understand the code, it will be easy for coding agents, too.
Ok, let me call you out more explicitly. It is clear that most of the code is not written by you. Commit history shows that first a large feature appears out of the blue, then you have a followup series of commits removing "useless" comments (left by LLM). Quite a few useless comments are still there.
Also your rust implementation is 100% broken which some of comments you deleted point out.
Fil-C aborts your program if it detects unsafe memory operations. You very much can write code that is not memory safe, it will just crash. Also it has significant runtime cost.
Rust tries to prevent you from writing memory-unsafe code. But it has official ways of overcoming these barriers ("unsafe" keyword, which tells compiler - "trust me bro, I know what I'm doing) and some soundness holes. But beause safety is proven statically by compiler, it is mostly zero-cost. ("Mostly" because some things compiler can't prove and people resort to "unsafe" + runtime checks)
Two orthogonal approaches to safety. You could have Fil-C style runtime checks in Rust, in principle.
> Once you add static linking to the toolchain (in all of its forms) things get really fucking slow.
Could you expand on that, please? Every time you run dynmically linked program, it is linked at runtime. (unless it explicitly avoids linking unneccessary stuff by dlopening things lazily; which pretty much never happens). If it is fine to link on every program launch, linking at build time should not be a problem at all.
If you want to have link time optimization, that's another story. But you absolutely don't have to do that if you care about build speed.
Reading your comment is sounds like the opposite would be true, because so much linking would be needed to be done at runtime. But that perception fails to realize, that when claiming an executable is linked dynamically, most symbols were also statically linked. It is only the few public exported symbols that are dynamically linked, because there are deemed to be a reasonable separate concern, that should be handled by someone elses codebase.
I think lazily linking is the default even if you don't use dlopen, i.e. every symbol gets linked upon first use. Of course that has the drawback, that the program can crash due to missing/incompatible libraries in the middle of work.
A lot of vendors use non-lazy binding for security reasons, and some platforms don't support anything other than RTLD_NOW (e.g., Android).
Anyway, while what you said is theoretically half-true, a fairly large number of libraries are not designed/encapsulated well. This means almost all of their symbols are exported dynamically, so, the idea that there are only "few public exported symbols" is unfortunately false.
However, something almost no one ever mentions is that ELF was actually designed to allow dynamic libraries to be fairly performant. It isn't something I would recommend, as it breaks many assumptions on Unices, (while you don't get the benefits of LTO) you can achieve code generation almost equivalent to static linking by using something like "-fno-semantic-interposition -Wl,-Bsymbolic,-z,now". MaskRay has a good explanation on it:
https://maskray.me/blog/2021-05-16-elf-interposition-and-bsy...
That is exactly what you want to evaluate the thechnology. Not make a buggy commit into softwared not used by nobody and reviewed by an intern. But actually review it by domain professionals, in real world very well-tested project. So they could make an informed decision on where it lacks in capabilities and what needs to be fixed before they try it again.
I doubt that anyone expected to merge any of these PRs. Question is - can the machine solve minor (but non-trivial) issues listed on github in an efficient way with minimal guidance. Current answer is no.
Also, _if_ anything was to be merged, dotnet is dogfooded extensively at Microsoft, so bugs in it are much more likely to be noticed and fixed before you get a stable release on your plate.
> Not make a buggy commit into software not used by nobody and reviewed by an intern.
If it can't even make a decent commit into software nobody uses, how can it ever do it for something even more complex?
And no, you don't need to review it with an intern...
> can the machine solve minor (but non-trivial) issues listed on github in an efficient way with minimal guidance
I'm sorry but the only way this is even a question is if you never used AI in the real world.
Anyone with a modicum of common sense would tell you immediately: it cannot.
You can't even keep it "sane" in a small conversation, let alone using tons of context to accomplish non-trivial tasks.
This is Stephen Toub, who is the lead of many important .NET projects. I don't think he is worried about losing job anytime soon.
I think, we should not read too much into it. He is honestly exploring how much this tool can help him to resolve trivial issues. Maybe he was asked to do so by some of his bosses, but unlikely to fear the tool replacing him in the near future.
They don’t have any problem firing experienced devs for no reason. Including on the .NET team (most of the .NET Android dev team was laid off recently).
I can definitely believe that companies will start (or have already started) using "Enthusiasm about AI" as justification for a hire/promote/reprimand/fire decision. Adherence to the Church Of AI has become this weird purity test throughout the software industry!
I love the fact that they seem to be asking it to do simple things because ”AI can do the simple boring things for us so we can focus on the important problems” and then it floods them with so many meaningless mumbo jumbo that they could have probably done the simple thing in a fraction of the time they take to keep correcting it continuously.
It is called experimentation. That is how people evaluate new technology. By trying to do small things with it first. And if it doesn't work well - retrying later, once bigger issues are fixed.
Hot take: CPython is not an important project for Microsoft, and it is not lead by them. The faster CPython project had questionable acheivement on top of that.
Half of Microsoft (especially server-side) still runs on dotnet. And there are no real contributors outside of microsoft. So it is a vital project.
They also laid off one of the veteran TypeScript developers. TypeScript is definitely an important project for Microsoft, and a lot of code there is written in it.
Anyone not showing open AI enthusiasm at that level will absolutely be fired. Anyone speaking for MS will have to be openly enthusiastic or silent on the topic by now.
> A 15,000-line proof is going to have a mistake somewhere.
If this proof is formal, than it is not going to. That is why writing formal proofs is such a PITA, you actually have to specify every little detail, or it doesn't work at all.
> Verifying that those 15,000 lines do what they do doesn't give me much more confidence than thorough unit testing would.
It actually does. Programs written in statically typed languages (with reasonably strong type systems) empirically have less errors than the ones written in dynamically typed languages. Formal verification as done by F* is like static typing on (a lot of) steroids. And HACL has unit tests among other things.
I agree that absence of tests isn't great, and is very common with many C-based projects. But the rest of your comments reads like "ooh, it's C, disgusting!". I hope, I'm wrong.
Thank you. These 2 are well-known, as well as plenty others. But I wanted to see answer from the author of the comment to which I replied. Apart from tests (of which both sqlite and curl have plenty, and that is obviously good), I don't see any reasonable difference in sqlite or curl code in aspects which were mentioned in their comment (namely, style and ownership). I'd like to see what they think is reasonable C code.
It is also a statically linked Linux distribution. But it's core idea is reproducible nix-style builds (including installing as many different versions/build configurations of any package), but with less pl fuff (no fancy funcional language - just some ugly jinja2/shell style build descriptions; which in practice work amazingly well, because underlying package/dependency model is very solid - https://stal-ix.github.io/IX.html).
It is very opionated (just see this - https://stal-ix.github.io/STALIX.html), and a bit rough, but I was able to run it in VMs sucessfully. It would be amazing if it stabilizes one day.
Did you notice that you just devided kids in Loudoun and Baltimore in 2 groups, giving them as examples of different environments? You do not object to premise, only to granularity of defining environment geographically.
> You do not object to premise, only to granularity of defining environment geographically.
Correct. I just picked those two because of stark differences of two well known areas close to each other. But it can go down to even neighborhood, or even street in said neighborhood.
Sorry if my rambling seems confusing. I'm not against the idea that environment affects children. I'm against broad brush stroke categorization about how different countries behave.
Or even one individual on different days. It should be all chaos and noise and yet it's not because these "general" numbers get translated to a realistic "it's more/less likely" not "it's guaranteed".
You're arguing against comparisons you don't like, or feel make you look worse than others. In other words you want to get to arbitrarily define the brush width presumably based on where you feel you sit in the comparison.
> I'm against broad brush stroke categorization about how different countries behave.
Ok - pick any conservative country (say India or Indonesia). Now tell me that the chances of an average Indonesian woman wearing a bikini to a beach (pretty normal in most Western countries) is same as an average French woman?
Or for a less gender-charged example, chances of an average Saudi eating Pork vs an average American.
>Ok - pick any conservative country (say India or Indonesia). Now tell me that the chances of an average Indonesian woman wearing a bikini to a beach (pretty normal in most Western countries) is same as an average French woman?
The strongest predictor for both the French and the Indonesian is almost certainly going to be the individuals physique and and the second is probably going to be the country and prevailing culture in which the beach is located (i.e. what everyone else is wearing).
This kind of illustrates the point you're trying to disagree with. You can't just look at some sort of demographic based average and shoot from the hip and expect to hit anything.
> The strongest predictor for both the French and the Indonesian is almost certainly going to be the individuals physique
I take it that you have either never been to a beach or the one you have been to is only open to athletes and supermodels.
> the second is probably going to be the country and prevailing culture in which the beach is located (i.e. what everyone else is wearing)
So you haven't had the chance of seeing Indonesian woman wearing full headgear and clothes covering their body having fun at a beach far away from Indonesia? Not joking, they were having a genuinely good time - from direct experience.
The world is much bigger and has far greater variety of people, customs and norms than you can imagine.
>I take it that you have either never been to a beach or the one you have been to is only open to athletes and supermodels.
Have you been to the beach in the last 10yr. All manner of 1-pc swimsuits are arguably the default style for women.
>So you haven't had the chance of seeing Indonesian woman wearing full headgear and clothes covering their body having fun at a beach far away from Indonesia? Not joking, they were having a genuinely good time - from direct experience.
My mistake, I mixed up Indonesia and the Phillipines in my mind. No surprise muslim women will not be wearing bikinis. But the Westerners will also be far more modest in a setting where that is the prevailing default so....
>The world is much bigger and has far greater variety of people, customs and norms than you can imagine.
If looking down one's nose like that is what it takes to be cultured I'm glad I'm not.
This is so wrong that it I don't even know where to start countering it. The average Indian woman will not ever wear a bikini at all, most wouldn't even wear one in a women only swimming pool let alone a mixed beach.
Compilers code was never pretty, but even by those standard, I feel like it is a very-very hard to maintain code by humans.
reply