Great question - there's no Debezium under the hood. Artie has its own Reader and Transfer layers, built from scratch.
TOAST columns: Artie has automatic detection built in. If a TOAST column hasn't changed, its value won't appear in the WAL. Artie detects this and skips the update for that column in the destination. This works without needing to set REPLICA IDENTITY FULL on your tables.
Schema drift: Artie never requires a schema registry. For relational sources like Postgres, Artie reads the source schema directly and syncs new columns immediately. For DDL changes, Artie uses lazy schema evaluation. On the next DML event for the table, it compares source vs. destination schema and applies any outstanding changes before writing the row.
You and er, Madison, right? This article is riddled with generalisations, oversimplifications and many comments that are vague to the point of being almost misleading - or certainly of limited value given the lack of context. This is the level of technical article I increasingly encounter out there these days... In a way I suppose it's good in that we all have to fall back on reputable sources again rather than be tempted to hit a search engine where we will this sort of traffic-generating, borderline chatgpt, cut'n'paste salad, complete with occasional chirpy "WAL is a gamechanger!" sorts of comments. And a 'deep-dive'? Sheesh. Anyone looking for a reasonably deep dive - and the first encouraging sign you might notice is that it's >1500 words - could do a lot worse than go here...
For Postgres, we have our own custom snapshotter that is capable of doing parallel snapshots against your read replica and not incur WAL growth. More details here: https://news.ycombinator.com/item?id=36855338
For MySQL and MongoDB, we rely on Debezium to perform the initial snapshots.
1/ Column doesn't exist in the destination, we'll create it based on our typing inference from the data type (important: not the data value).
2/ Certain tools will handle automatic column data type conversion if a change like this was detected at the source. We do not do this. We will simply hard fail and cause head-of-line blocking reasons being: this is anti-pattern and should be rare, in which case - it's okay to cause an err and require manual intervention for this breaking change.
3/ If the column has been dropped from the source, you as the end user can decide whether this column should be also dropped in the destination, or not. The default is not to drop it.
^ We hear more customers explicitly don't want columns to be dropped because it could cause downstream errors, such as other views / tables not compiling due to referencing a non-existent column.
We haven't heard much from folks that don't even want columns to be added. If there is a need, we can definitely add that as a config option to provide maximum configurability.
> Finally, the biggest issue with CDC always ends up being the seed loads, recoveries and the incremental snapshot strategies.
Yep totally. On the recovery bit, this is exactly why we are leveraging Kafka. If there are any particular issues, we simply don't commit the offset and cause head-of-line blocking.
On the incremental snapshot and recoveries bit, we primarily leverage Debezium's DDD-3 high watermark strategy [1] for MySQL and MongoDB. Postgres has a different issue in that replication slots can grow really fast, esp on AWS! [2]. We ended up writing our own custom snapshotter for Postgres that is Debezium compatible to onboard customers that have a massive dataset and cannot afford to have a read lock on their WAL.
The custom snapshotter sounds interesting, potentially a good selling point. On recovery end, in my designs I have also found it useful to have synthetic events so in breakage of CDC I can stitch logs together and not just start from scratch and lose history. I can see you are in the depths of it, more than I’ve been for a while. Wish you the best.
> Postgres has a different issue in that replication slots can grow really fast, esp on AWS! [2]. We ended up writing our own custom snapshotter for Postgres that is Debezium compatible to onboard customers that have a massive dataset and cannot afford to have a read lock on their WAL.
Does Debezium's DDD-3 watermark (DBLog) implementation for Postgres not process the WAL quickly enough? We don't use it ourselves either, but architecturally it appears it would reasonably bound how long the WAL can remain un-read?
Agreed that many production DBs people care about have pretty severe limitations here! Managed Supabase is another good example.
On a single unbounded (CPU + mem) Debezium running on a VM extracting Postgres, I was able to clock in about 7-10m/hr. You could increase the # of tasks, but then it'll hinder your DB perf. Also, this is on your primary DB.
We found it far more efficient and less risky to do CDC streaming and snapshotting w/o read lock in parallel to two different topics. Once snapshot is done and drained, we then move to drain the CDC topic.
> Something McKinley doesn't address is that it's quite advantageous if the values in your data warehouse don't change intra-day because this lets business users reach consensus. Whereas if Bob runs a report and gets $X, and Alice runs the same report 5 minutes later and gets $Y, that creates confusion (much more than you would expect). I recall a particular system I built that refreshed every 6 hours (limited by upstream), that eventually Marketing asked me to dial back to every 24 hours because they couldn't stand things changing in the middle of the day.
If they want to see a consistent view of the report, you could bound this.
1/ SELECT * FROM FOO WHERE DATE_TRUNC('day', updated_at) < DATE_TRUNC('day', DATEADD(day, -1, CURRENT_DATE()));
If your dataset doesn't contain kv, you can turn on include `artie_updated_at` which will provide an additional column with the updated_at field to support incremental ingestion.
2/ If you had stateful data, you could also explore creating a Snowflake task and leveraging the time travel f(x) to create a "snapshot" if your workload depended on it.
3/ Also, if you _did_ want this to be more lagged, you can actually increase the flushIntervalSeconds [1] to 6h, 24h, whichever time interval you fancy. You as the customer should have maximum flexibility when it comes to when to flush to DWH.
4/ You can also choose to refresh the analytical report on Looker / Mode to be daily. [2]
> Now of course I see you're targeting more real-time use cases like fraud detection. That's great! But why you would run a fraud detection process out of your data warehouse, which likely doesn't even have a production-grade uptime SLA? Run it out of your production database, that's what it's for!
You can certainly do this in production db (that was our original hypothesis as well!), however, after talking to more companies...it has become more obvious to us that folks that are running fraud algos actually want to join this across various data sets. Further, by using a DWH - it provides a nice visualization layer on top.
Of course, you could go with something even more bespoke by utilizing real-time DBs such as Materialize / Rockset / RisingWave. Just comes with trade offs such as increase in architectural complexity.
There are also plenty of additional use cases this can unlock given that DWH is a platform, any post-DWH application can benefit from less lag, such as reverse ETLs.
I think you missed the parent's point - your USP is real-time replication. So everything you're proposing makes it not real time. Your USP is now worthless (in that context) and you're competitors are numerous.
Huh? The parent's point was your underlying dataset is always in real-time. There's no issue querying a data warehouse when all you're doing is looking for a simple transactional report.
I agree with the parent point. I also don't think DWH is the primary usecase for your platform.
I have seen architectures where databases are siloed within departments and data has to be replicated across department physical databases in the same network or different, mostly in banks, insurances and old school industries. In this scenario, a daily batch would run that would replicate and populate the tables and kick start business processes. A platform like this would make sense. Another usecase, i can think of is reverse ETL, but there are many tools custom made for that.
As for fraud analysis, there are many vendor tools that does exactly that, asking people to visualize and implement a full blown usecase is hard.
I might be naive I don't see the USP between artie and Airbyte, hevodata, fivetran, stitch etc. and others from a distance.
TOAST columns: Artie has automatic detection built in. If a TOAST column hasn't changed, its value won't appear in the WAL. Artie detects this and skips the update for that column in the destination. This works without needing to set REPLICA IDENTITY FULL on your tables.
Schema drift: Artie never requires a schema registry. For relational sources like Postgres, Artie reads the source schema directly and syncs new columns immediately. For DDL changes, Artie uses lazy schema evaluation. On the next DML event for the table, it compares source vs. destination schema and applies any outstanding changes before writing the row.
Let me know if you have any other questions!
reply