That's all great, but sadly impractical.
I looked at one of the first statements:
> GenDB is an LLM-powered agentic system that decomposes the
complex end-to-end query processing and optimization task into
a sequence of smaller and well-defined steps, where each step is
handled by a dedicated LLM agent.
And knowing typical LLM latency, it's outside of the realm of OLTP and probably even OLAP. You can't wait tens of seconds to minutes until LLM generates you some optimal code that you then compile and execute.
Considering it's just s single Phd student who does this work, I don't believe such a task can be realistically accomplished, even as a PoC / research.
Why not? Even without LLMs it is technically feasible to build custom database engine that performs much better than general database kernels. And we see this happening all the time, with timeseries, BLOBs, documents, OLTP, OLAP, logging etc.
The catch is obviously that the development is way too expensive and that it takes a lot of technical capability which isn't really all that common. The novelty which this paper presents is that these two barriers might have come to an end - we can use LLMs and agents to build custom database engines for ourselves™ and our™ specific workloads, very quickly and for a tiny fraction of development price.
If you look into the results, you will see that they are able to execute 5x TPC-H queries in ~200ms (total). The dataset is not large it is rather small (10GB) but nonetheless, you wouldn't be able to run 5 queries in such a small amount of time if you had to analyze the workload, generate the code, build indices, start the agents/engine and retrieve the results. I didn't read the whole paper but this is why I think your understanding is wrong.
If they count only query execution time, not everything else, it would make sense though. It also could be practical, if your system runs just a few predefined and very optimized queries.
The idea with parallel compilation is interesting. Worth considering, in some cases. The only problem with it is the same as too much parallelization - you can exhaust your CPU resources much faster. But with some sort of smart scheduling it should work. I'll think about it, thanks!
Interesting... AsmJit is pretty fast for compilation, but about 3x than sljit. The only way I can see how to make it fast enough, in theory (i.e. without slowing down point-lookup queries and such) would be to fuse planning with code generation - i.e. a single pass plan builder + compiler essentially. Not sure if Umbra tries to do that, and AsmJit is not the best choice for it anyway, but with sljit it could be on par with interpreter even for fastest queries I believe. Pretty hard (likely impossible) to implement though, planning is inherently a non-linear process...
Because pg_jitter uses AsmJit's Compiler, which also allocates registers. That's much more work than using hardcoded physical registers in SLJIT case. There is always a cost of such comfort.
I think AsmJit's strength is completeness of its backends as you can emit nice SIMD code with it (like AVX-512). But the performance could be better of course, and that's possible - making it 2x faster would be possible.
There are other issues with that auto-allocation. I tested all 3 backends on very large queries (hundreds of KBs) per query. Performance of all of them (+LLVM, but -sljit) was abysmal - the compiler overhead was in seconds to tens(!) of seconds. They have some non-linear components in their optimization algorithms. While sljit was scaling linearly and almost as fast as for smaller queries. So yes, it gives higher run-time performance but the cost of that performance grows non-linearly with code size and complexity. While you still can have good performance with manual allocations. I also don't believe you can make AsmJit 2x faster without sacrificing that auto-allocation algorithm.
AsmJit has only one place where a lot of time is spent - bin-packing. It's the least optimized part, which has quadratic complexity (at the moment), which starts to show when you have like hundreds of thousands of virtual registers. There is even a benchmark in AsmJit called `asmjit_bench_regalloc`, which shows that a single function that has 16MB alone, with 65k labels and 200k virtual registers takes 2.2 seconds to generate (and 40ms of that is time to just call `emit()`).
If this function is optimized, or switched to some other implementation when there is tens of thousands of virtual registers, you would get orders of magnitude faster compilation.
But realistically, which query requires tens of megabytes of machine code? These are pathological cases. For example we are talking about 25ms when it comes to a single function having 1MB of machine code, and sub-ms time when you generate tens of KB of machine code.
So from my perspective the ability to generate SIMD code that the CPU would execute fast in inner loops is much more valuable than anything else. Any workload, which is CPU-bound just deserves this. The question is how much the CPU bound the workload is. I would imagine databases like postgres would be more memory-bound if you are processing huge rows and accessing only a very tiny part of each row - that's why columnar databases are so popular, but of course they have different problems.
I worked on one project, which tried to deal with this by using buckets and hashing in a way that there would be 16 buckets, and each column would get into one of these, to make the columns closer to each other, so the query engine needs to load only buckets used in the query. But we are talking about gigabytes of RAW throughput per core in this case.
I have a test of 200Kb query that AsmJit takes 7 seconds to compile (that's not too bad both LLVM and MIR take ~20s), while sljit does it in 50ms. 200Kb is a pathological case, but it's not unheard of in the area I'm working on. It's realistic, although a rare case.
Last 10-15 years most OLTP workloads became CPU bound, because active datasets of most real databases fully fit in memory. There are exceptions, of course.
That's interesting - 200kB should not be a big deal for it - maybe it uses something that I usually don't, like many function calls, or insane number of branches, etc... I would be interested in that case, but I'm not sure whether I would be able to blindly improve AsmJit without a comprehensive test.
Definitely good to know though. When it comes to low-latency compilation my personal goal is to make it even faster when generating small functions.
SLJIT is a bit smarter than just to use hardcoded registers. It's multi-platform anyway, so it uses registers when they are available on the target platform, if not it will use memory, that's why performance can differ between Windows and Linux on x64 for example - different number of available registers.
Indeed, but this also means that you would get drastically different performance on platforms that have more physical registers vs on platforms that have less. For example x86_64 only has 16 GP registers, while AArch64 has 32 - if you use 25 registers without any analysis and just go to stack with 10 of them, the difference could be huge.
But... I consider SLJIT to be for a different use-case than AsmJit. It's more portable, but its scope is much more limited.
It's definitely different, and for Postgres specifically, they may complement each other. SLJit can be used for low latency queries where codegen time is more important than optimizations, also for other platforms like s390x / PPC / SPARC, etc. AsmJit can be used for SIMD optimizations for x86_64 and ARM64. MIR is kinda in the middle - it does auto-allocations of registers, doesn't support SIMD, but also it's multiplatform. The only thing that doesn't fit well here is LLVM :). It has some advantages in some edge cases, but... It really needs a separate provider, the current one is bad. I'll probably create another LLVM backend for pg_jitter in the future to utilize it properly...
It's not useful for sub-millisecond queries like point lookups, or other simple ones that process only a few records. sljit option starts to pay off when you process (not necessarily return) hundreds of records. The more - the better. I'm still thinking about a caching option, that will allow to lift this requirement somewhat - for cached plans. For non-cached ones it will stay.
The emphasis on compilation time there is because the JIT provider that comes with Postgres (LLVM-based) is broken in that particular area. But you're right, JITed code can be cached, if some conditions are met (it's position independent, for one). Not all JIT providers do that, but many do. Caching is on the table, but if your JIT-compilation takes microseconds, caching could be rather a burden in many cases. Still for some cases useful.
Most databases in practice are sub-terabyte and even sub-100Gb, their active dataset is almost fully cached. For most databases I worked with, cache hit rate is above 95% and for almost all of them it's above 90%. In that situation, most queries are CPU-bound. It's completely different from typical OLAP in this sense.
Postgres caches query plans too, the problem is you can only cache what you can share, and if your planner works well, you can share very little, there can be a lot of unique plans even for the same query
No it cannot cache query plans between processes (connections) and the only way it can cache in the same process in the same connection is by the client manually preparing it, this was how the big boys did it 30 years ago, not anymore.
Was common guidance back in the day to use stored procedures for all application access code because they where cached in MSSQL (which PG doesn't even do). Then around 2000 it started caching based on statement text and that became much less important.
You would only used prepared statements if doing a bunch of inserts in a loop or something and it has a very small benefit now days only because its not sending the same text over the network over and over and hashing to lookup plan.
I didn't say it can cache between processes. The problem is not caching between processes, it's that caching itself is not very useful, because the planner creates different plans for different input parameters of the same query in the general case. So you can reliably cache plans only for the same sets of parameters. Or you can cache generic plans, which Postgres already does as well (and sharing that cache won't solve much of the problem too).
Other databases cache plans and have for years because it's very useful, many (most?) apps run many of the same statement with differing parameters, its a big win. They do this without the client having to figure out the statement matching logic like your various PG Orms and connection poolers try and do.
They also do things like auto parameterization if the statement doesn't have them and parameter sniffing to make multiple different plans based on different
values where it makes sense.
PG is extremely primitive compared to these other systems in this area, and it has to be since it doesn't cache anything unless specifically instructed to for a single connection.
You make some unsubstantiated claims here. I assure you that it isn't as simple as you claim. And what Postgres does here is (mostly) the right thing, you can't do much better. You simply can't decide what plan you need to use based on the query and its parameters alone, unless you already cached that plan for those parameters (and even in that case you need to watch out for possible dramatic changes in statistics). Prepared statements != cached execution plans.
Ah yes so Microsoft and Oracle do these things for no good reason, you are the one making unsubstantiated claims such as "you can't do much better". And "You simply can't decide what plan you need to use based on the query and its parameters alone" which is mostly what those systems do (along with statistics). If you bothered to read what I linked you could see exactly how they are doing it.
I never said it was simple, in fact I said how primitive PG is compared to the "big boys" because they put huge effort into making their systems fast back in the TPS wars of the early 2000's on much slower hardware.
There are reasons for that, it's useful in a very narrow set of situations.
Postgres cached plans exist for the same reason.
If you're claiming Oracle and MSSQL do _much_ better in this area - that's what I call unsubstantiated. From what you write further it's pretty clear you don't have a lot of understanding what happens under the hood. And no, prepared statements are not what you read in Wikipedia. Not in all databases anyway. Go read it somewhere else.
Postgresql doesn't cache plans unless the client explicitly sends commands to do so. Applications cannot take advantage of this unless they keep connections open and reuse them in a pool and they must mange this themselves. The plan has to be planned for every separate connection/process rather than a single cached planed increasing server memory costs which are plan cache size X number of connections.
It has no "reason" to cache plans the client must do this using its "reasons".
>If you're claiming Oracle and MSSQL do _much_ better in this area - that's what I call unsubstantiated.
You are making all sorts of claims without nary a link to back it up. Are you suggestion PG does better than MSSQL, Oracle and DB2 in planning while be constrained to replan on every single statement? The PG planner is specifically kept simple so that it is fast at its job, not thorough or it would adversely effect execution time more than it already does, this is well documented and always a concern when new features are proposed for it.
>From what you write further it's pretty clear you don't have a lot of understanding what happens under the hood.
Sticks and stones, is that all you have how about something substantial.
> And no, prepared statements are not what you read in Wikipedia. Not in all databases anyway.
Ok Mr. Unsubstantiated are we talking about PG or not? What does one use prepared statements for in PG hmmm, you know the thing you call the PG plan cache? How about something besides your claim that prepared statements are not in fact plan caches? Are you talking about completely different DB systems? How about you substantiate that?
Read carefully about "plan_cache_mode" and how it works (and its default settings).
Sorry, that's my last message in this thread, and I'm still here just for educational purposes, because what you're talking about is in fact a common misconception.
If you read it carefully, you'll see that generic plans do not require any "explicit commands", Postgres executes a query 5 times in custom mode, then tries a generic one, if it worked (not much worse than an average of 5 custom plans), the plan is cached. You can turn it off though. And I'd recommend to turn it off for most cases, because it's a pretty bad heuristics. Nevertheless, for some (pretty narrow set of) cases it's useful.
So, Mr Big Boy, now we can get to what a prepared statement in Postgres is. Prepared statements are cached in a session, but if that statement was cached in custom mode, it won't contain a plan. When Postgres receives a prepared statement in custom mode, it will just skip parsing, that's it. The query will still be planned, because custom plans rely on input parameters. If we run it in generic mode, then the plan is cached.
I think you should read carefully, this only applies to prepared statements within the same session, which is exactly what I have been saying. There is no global cache, and if you reset the session it's gone.
This controls whether prepared statements even use a cached plan at all. Other database can do this with hints and they can skip parsing by using stored procedures which are basically globally named prepared statements that the client can call without preparing a temporary one or they can do prepared but again this is typically a waste of time because parsing enough to match existing plans is fast (soft vs hard parse in Oracle speak). They have many more options with more powerful caching abilities that all clients can share across sessions.
The only time PG "automatically" caches the plan is when it implicitly prepares the plan within a PL/pgsql statement like doing a insert loop inside a function, its still is only for the current session. This is just part of the planning process in other databases that cache everything all the time globally.
You don't seem to understand that most other commercial "big-boy" RDBMS cache plans across sessions and that nothing has to be done for them to reuse between completely different connections with differing parameters and can still have specialized versions based on these parameters values vs a single generic plan.
At least now you admit prepared statements are in-fact a plan cache, contradicting your other statements, and seem to make a gotcha out of an option an option to disable that cache.
You can see various discussions on pg-hackers, here is one where the submitter confirms everything I have said and attempted to add the auto part but not tackle the much harder sharing between sessions part and was shot down, I don't believe much has change in PG around plan caching since this post and even has a guy that worked on DB2 talking about how they did it: https://www.postgresql.org/message-id/flat/8e76d8fc-8b8c-14b...
Sure, but that's not the main issue. If you add a global cache, it will have only a marginal value. There are Postgres extensions / forks with global cache and they are not wildly more efficient. The main issue you still do not understand is for different parameters you _need_ different plans, and caching doesn't help with that. It can help with parsing, sure. Parsing is very fast though, in relation to planning. And you keep conflating "prapared" statements with plan caching. Ok.
>If you add a global cache, it will have only a marginal value
Please substantiate this, again all other major commercial RDBMS's do this and have invested a lot of effort and money into these systems, they would not do something that has marginal value.
Again I went through the era of needing to manually prepare queries in client code when it was the only choice as it is now in PG. It was not a marginal improvement when automatic global caching became available, it was objectively measurable via industry standard benchmarks.
You can also find other post complaining about prepared statement cache memory usage especially when libs and pooler auto prepare, the cache is repeated for every connection, 100 connections equals 100X cache size. Another advantage of a shared cache, this is obvious.
I will leave you with a quote from Bruce Momjian, you know one of the founding members of the PG dev team, in the thread I linked that you didn't seem to read just like the other links I gave you:
"I think everyone agrees on the Desirability of the feature, but the
Design is the tricky part."
>The main issue you still do not understand is for different parameters you _need_ different plans, and caching doesn't help with that.
You still don't seem to be grasping what other more advanced systems do here and again don't seem to be reading any of the existing literature I am giving you. These systems will make different plans if they detect its necessary, they have MULTIPLE cached plans of the same statement and you can examine their caches and see stats on their usage.
These systems also have hints that let you disable, force a single generic, tell it how you want to evaluate specific parameters for unknown values, specific hard coded values etc. if you want to override their default behavior that uses statistics and heuristic to make a determination of which plan to use.
>And you keep conflating "prapared" statements with plan caching.
Again we are talking about PG and the only way PG caches a plan is using prepared statements, in PG prepared statements and plan caching are the same thing, there is no other choice.
From your own link trying to gotcha me on PG plan caching config, first sentence of plan_cache_mode: "Prepared statements (either explicitly prepared or implicitly generated, for example by PL/pgSQL) can be executed using custom or generic plans."
The only other things a prepared statement does is skip parsing, which is another part of caching, and reduce network traffic from client to server. These things can be done with stored procedures in systems that have global caches and are shared across all connections and these systems still support the very rare situation of using a prepared statement, its almost vestigial now days.
Here is Microsoft Guidance on prepared statements in MSSQL now days:
"In SQL Server, the prepare/execute model has no significant performance advantage over direct execution, because of the way SQL Server reuses execution plans. SQL Server has efficient algorithms for matching current Transact-SQL statements with execution plans that are generated for prior executions of the same Transact-SQL statement. If an application executes a Transact-SQL statement with parameter markers multiple times, SQL Server will reuse the execution plan from the first execution for the second and subsequent executions (unless the plan ages from the plan cache)."
If you think I'm trying to "gotcha" you, you're mistaken. I'm past time I would care about that. It was simply a (apparently failed) education opportunity. Be well.
>So, Mr Big Boy, now we can get to what a prepared statement in Postgres is.
Yeah not a gotcha at all mr teacher. I think you should stop posting low effort responses and examine your own opportunities for education that may have been missed here.
Lets get this straight prepared statements should not be conflated with caching, yet the only way to cache a plan and avoid a full parse is to use a prepared statement and it is by far the biggest reason to use it and why many poolers and libraries try to prepare statements.
Do you realize how ridiculous this is, here is PG's own docs on the purpose of preparing:
"Prepared statements potentially have the largest performance advantage when a single session is being used to execute a large number of similar statements. The performance difference will be particularly significant if the statements are complex to plan or rewrite"
"Although the main point of a prepared statement is to avoid repeated parse analysis and planning of the statement, PostgreSQL will force re-analysis and re-planning of the statement before using it whenever database objects used in the statement have undergone definitional (DDL) changes or their planner statistics have been updated since the previous use of the prepared statement."
The MAIN POINT of preparing is what I am conflating with it, yes...
If PG cached plans automatically and globally then settings like constraint_exclusion and enable_partition_pruning would not need to exist or at least be on by default because the added overhead of the optimizations during planning would be meaningless.
Seriously this whole thread is Brandolini's law in action you obviously can't articulate how PG is better because it does not have a global plan cache and act like I don't know how PG works? Get real buddy.
Are you going to post another couple sentences with no content or are you done here?
You can't get a plan cache without a prepared statement, but you can get a prepared statement without a plan cache. It's not the same thing, and in most cases in Postgres prepared statements _do_not_ give you plan caching, because they are created for custom plans. "Custom plan" is a misnomer - having a "custom plan" means the query is replanned on each execution. It's a common misconception - even a sizeable portion of articles you can find on the internet miss this. But if you have a good reading comprehension, you can read, and, possibly, understand, this:
> A prepared statement can be executed with either a generic plan or a custom plan. A generic plan is the same across all executions, while a custom plan is generated for a specific execution using the parameter values given in that call.
>You can't get a plan cache without a prepared statement, but you can get a prepared statement without a plan cache.
What is the purpose of a prepared statement without a plan cache? I thought parsing was a non issue? All thats left is a little extra network traffic savings.
I will for a second time quote the PG documentation that you linked btw of what the MAIN POINT of a prepared statement is according to the maintainers, I am not sure why I have to repeat this again:
"Although the main point of a prepared statement is to avoid repeated parse analysis and planning of the statement, PostgreSQL will force re-analysis and re-planning of the statement before using it whenever database objects used in the statement have undergone definitional (DDL) changes or their planner statistics have been updated since the previous use of the prepared statement.”
I am not sure what point you are trying to make other than worming your way out of your previous statements. Prepared statements are in fact plan caches and it is their MAIN purpose according the PG’s own documentation, you haven't given any other purpose for their existence, I gave the other two, one of which you dismissed, the third is not even listed in the PG docs and is also minor.
> It's not the same thing, and in most cases in Postgres prepared statements _do_not_ give you plan caching, because they are created for custom plans.
The default setting is auto which will cache the plan if the generic plan cost is similar to a custom one based on the 5 run heuristic. This is going to be most of the time on repeated simple statements that make up the bulk of application queries and why other database do this all the time globally without calling prepare. It is a large savings, not sure why you think this would not occur regularly and if you have any data to back this up I am sure everyone would like to see it, it would upset conventional thinking in other major commercial RDBMS’s with hard won gain over many years.
>You're also mixing up parsing and planning for some reason.
No I am not, you are obviously not comprehending what I said and cannot read the documentation I quoted which I had to repeat a second time here. I am not sure why you think I am mixing them up I was only trying to be gracious and include the other benefit of prepared statement, one of two thats left if it doesn't cache the plan, it avoids parsing which yes has a smaller impact, the third even less.
Also not everyone shares PG terminology, Oracle refers to what you call parsing as a soft parse (syntax check, semantic check) and parsing and planning as a hard parse (rewrite and optimizing, row source generation), you obviously have little experience outside of PG and seem to have a myopic view of what is possible in RDBMS systems and how these terms are used.
>Query parsing costs like 1/100 of planning, it's not nothing, but pretty close to it.
Again what is the point of a prepared statement if skipping parsing is meaningless and planning is not THE MAIN POINT?
>Even though you're just a rude nobody, it still may be useful for others, who may read this stupid conversation…
Further ad hominem and you call me rude, who are you to say this? How about you step off your high horse and learn something mr superior somebody. I was trying to debate in good faith and you insult me with zero substance, yeah this is a stupid conversation...
> then tries a generic one, if it worked (not much worse than an average of 5 custom plans), the plan is cached
Seems like it's not great at detecting this in all cases[1]. That said, I do note that was reproduced on PG16, perhaps they've made improvements since, given the documentation explicitly mentions what you said.
That's exactly what I said above - just turn this thing off. The reason is that even if your generic plan is better than 5 custom plans before it, that doesn't guarantee much. With probability high enough to cause troubles, it's just a coincidence, and generic plans in general tend to be very bad (because they use some hardcoded constants instead of statistics for planning).
This behavior is often a source of random latency spikes, when your queries suddenly start misbehaving, and then suddenly stop doing it. If you don't have auto_explain on, it will look like mysterious glitches in production.
The few cases when they are useful are very simple ones, like single table selects by index. They are already fast, and with generic plans you can cut planning time completely. Which is kinda...not much. There are more complicated cases where they are useful, involving Postgres forks like AWS Aurora, which has query plan management subsystem, allowing to store plans directly. Then you can cut planning time for them. But that's a completely different story.
And knowing typical LLM latency, it's outside of the realm of OLTP and probably even OLAP. You can't wait tens of seconds to minutes until LLM generates you some optimal code that you then compile and execute.