For many production setups, taking a database snapshot involves transferring significant amounts of data over the network. The standard way to do this efficiently is to process data in batches. Batching reduces per-request overhead and helps maximize throughput, but it also introduces an important tuning problem: choosing the right batch size.
A batch size that works well in a low latency environment can become a bottleneck when snapshots run across regions or under less predictable network conditions. Static batch size configuration assumes stable networks, which rarely reflects reality.
In this blog post we describe how we used automatic batch size tuning to optimize data throughput for Postgres snapshots, the constraints we worked under and how we validated that the approach actually improves performance in production-like environments for our open source pgstream tool.
(Disclaimer: I work at Xata.)
Just wanted to mention that we also support anonymization, in case that’s something you're looking into:
https://xata.io/postgres-data-masking
Recently we launched Xata Agent, an open-source AI agent which helps diagnose issues and suggest optimizations for PostgreSQL databases.
To make sure that Xata Agent still works well after modifying a prompt or switching LLM models we decided to test it with an Eval. In this blog, we'll explain how we used Vercel's AI SDK and Vitest to build an Eval in TypeScript.
thanks! Cool that you like the UI since I feel like I'm not very 'artistic' if that makes sense.
No, no RAG. I use Gemini 1.5 Flash as an LLM, and it has a very long context window (1M tokens). Because of that, I can feed the entire transcript into Gemini's context. I feel that's important to get good results.
I see you're using pg-schema-diff for schema diffing, hadn’t come across it before, so thanks for mentioning it!
Have you seen pgroll? https://github.com/xataio/pgroll
It is a Postgres schema migration tool for achieving zero downtime, minimal locking schema changes. Thought it might be interesting for you. I also checked the unsupported operations in pg-schema-diff, and from a quick look, pgroll seems to cover more migration types: https://pgroll.com/docs/v0.8.0/getting-started
Have you seen pgstream? https://github.com/xataio/pgstream
It is similar to pg_replicate and could be a good fit for streaming CDC data from Postgres. There is no built-in output plugin specifically for DuckDB but it might help you for building something lightweight and custom for your use case.
A batch size that works well in a low latency environment can become a bottleneck when snapshots run across regions or under less predictable network conditions. Static batch size configuration assumes stable networks, which rarely reflects reality.
In this blog post we describe how we used automatic batch size tuning to optimize data throughput for Postgres snapshots, the constraints we worked under and how we validated that the approach actually improves performance in production-like environments for our open source pgstream tool.