More

ryanbooz · on Dec 7, 2023

The team at Tembo keeps working quickly to enable many different use cases with Postgres and extensions. Pretty fun example of how capable Postgres is for various tasks.

chuckhend · on Dec 7, 2023

Thank you! We built PGMQ because we needed a queue to sit in between our managed service's control-plane and data-plane. We made it an extension so that it became a feature of the database, which meant we could use it with any programming language. IMO, this is a benefit of Postgres extensions that often gets overlooked...any language with a Postgres driver can use Postgres extensions.

ryanbooz · on June 21, 2022

(NB - post author)

In a sense, it is. Continuous aggregates only have to materialize the most recent bucket of time, not the entire materialized view as you have to in PostgreSQL. That's honestly hard to demonstrate and quantify in a blog post like this because it's something that you notice over time. If you have to refresh the PG materialized view every hour (to get the recent hour of data) and it takes 2 minutes - a year from now it's probably going to take 3-4 minutes (maybe more)... and a lot of valuable CPU/IO to boot.

With continuous aggregates, TimescaleDB is only materializing the last hour - and updating anything that's changed in previous buckets within the refresh window.

trollied · on June 22, 2022

Appreciate the reply, thank you.

ryanbooz · on June 21, 2022

(NB - blog author/Timescale employee)

One thing we're improving as we move forward in documentation and other areas is explaining why doing joins (and things like window functions) is difficult in continuous aggregates and not the current focus. Honestly, it's part of the reason most databases haven't tackled this problem before.

Once you add in joins or things that might refer to data outside of the refresh window (LAG values for example), things get really complicated. For instance, if you join to a dimension table and a piece of metadata changes, does that change now need to be updated and reflected back in all of this historical aggregate data that's outside of the automatic refresh policy? Same with a window function - if data within a window hasn't changed but data that *might* be hit because of the window function reference does change, continuous aggregates would have to know about that for each query and track those changes too.

I'm not saying it's impossible or that it won't be solved someday, but the functionality with continuous aggregates that keeps the aggregate data updated automatically (without losing any history) *and* being able to perform fast joins on the finalized data is a very useful step that's not available anywhere else within the Postgres ecosystem.

RE: CAGG on top of a CAGG - you're certainly not the only person to request this[1] () and we understand that. Part of this is because of what I discussed above (tracking changes across multiple tables), although having finalized data might make this more possible in the future.

That said (!!!), the cool thing is that we already *have* begun to solve this problem with hyperfunction aggregates and 2-step aggregation, something I showed in the blog post. So, if your dataset can benefit from one of the hyperfunction aggregates that we currently provide, there are lots of cool things you can do with it, including rollups into bigger buckets without creating a second continuous aggregate! If you haven't checked them out, please do! [2][3]

[1]: https://github.com/timescale/timescaledb/issues/1400 [2]: https://www.timescale.com/blog/introducing-hyperfunctions-ne... [3]: https://www.timescale.com/blog/how-postgresql-aggregation-wo...

ryanbooz · on June 21, 2022

(blog author)

Thanks for the feedback! Out of curiosity, if the data you're trying to analyze doesn't have time as one of the critical components, what kind of data is it?

Always helpful to learn a bit more.

throwaway894345 · on June 21, 2022

Is time series the right answer for anything with a time dimension, or is it mostly for things where time is THE critical dimension? For example, business intelligence applications care about time, but they also care about a whole bunch of other stuff as well (I think with at least as much importance)--is timeseries the right answer for this use case?

epgui · on June 21, 2022

Anytime you’re interested in seeing how things change over time, that’s time series. It’s a very big category of use cases.

throwaway894345 · on June 21, 2022

Sure, but analytics is sometimes change over time, and other times change over some other dimension. Presumably if time is just one dimension among many, then timeseries is probably not the right fit in general?

epgui · on June 21, 2022

As with anything else, you can approach specific problems in many different ways.

slt2021 · on June 21, 2022

timeseries is usually specific to use cases when you data represents some signal over time, like temperature reading, stock price, etc.

so you need 2 components: timestamp and signal reading, in this case all specific timeseries analytics apply: sliding/tumbling window, avg per window, smoothing, autocorrelation and all other techniques from Digital Signal Processing/timeseries analytics.

Your regular monthly Sales data of ACME Corp by product category and storeId - this is not timeseries, just general BI

ryanbooz · on June 21, 2022

(NB - post author)

Great definition! Having worked for years on both energy and IoT applications, the argument here is that your "monthly sales data" is likely being aggregated from your time-series data (sales transactions over time). If you store the transaction data in a database like TimescaleDB, then continuous aggregates provide the straightforward method for keeping that aggregated, monthly sales data up-to-date. :-D!

throwaway894345 · on June 21, 2022

That's very zen, but ultimately it doesn't answer my question.

epgui · on June 21, 2022

Well, I could be more opinionated, but even in very specific situations, reasonable people disagree about the best way to model data, and I don't really know a lot about your specific problem-space or situation.

My personal preference is to think of almost any changing measurement or event stream as a time series. See also the reply to a sibling comment.

mritchie712 · on June 21, 2022

time is usually in the table, but not always in an analytics query.

I'm building https://luabase.com/. A good example would be summing transactions by the ethereum contract address.

ryanbooz · on June 21, 2022

Totally agree. Time is a primary component, but it might not always be the primary query parameter... at least once the data is aggregated.

In the example you gave, I'd assume that you wouldn't run a query over billions of transactions to do a sum. (obviously indexes would be part of reducing this number at query time). I would think you'd probably want to aggregate the sum per hour/day of all addresses and then decide at query-time if you need to sum all transactions for all time or within a specific range. Whenever you need to constrain the query based on time, you're still using the data like time-series, even if the final result doesn't have a date on it. And whenever you're doing the same aggregate queries over and over, that's where Continuous Aggregates can help!

For example, using the (transaction??) timestamp to efficiently store the data in time-based partitions (TimescaleDB chunks) unlocks all kinds of other functionality. You can create continuous aggregates to keep that historical aggregate data up-to-date (even if you need to eventually drop or archive some of the raw transaction data). With 2.7, you can create indexes on the views in ways you couldn't before which speeds up queries even more. Chunks can be compressed (often 93%+!!) and make historical queries faster while saving you money.

So in that sense, time is the component that helps unlock features - when time is an essential component of the raw data, but the query-time analytics don't have to specifically be about time. PostgreSQL and TimescaleDB work together to efficiently use indexes and features like partition pruning to provide the performance you need.

BTW, I'm not sure if you saw the post and tutorial we just released last week showing how to analyze transactions on the Bitcoin Blockchain or not. [1][2] Similar use-case and not all tied to time-based queries only. There are also other companies currently indexing other blockchains (Solana for instance) that have had really great success with TimescaleDB (and it gets even better with TimescaleDB 2.7!)

Thanks!

[1]: https://www.timescale.com/blog/analyzing-the-bitcoin-blockch...

[2]: https://docs.timescale.com/timescaledb/latest/tutorials/anal...

mfreed · on June 21, 2022

We see those types of queries commonly in TimescaleDB. And, for example, both compression and "horizontal" scale out has ways where you can optimize your code for these types of analytical queries.

More concrete, we see a lot of web3/crypto use cases, and making a wallet ID, NFT name, or ticker as a top-level considerations.

E.g., use your contract address as the segmentby field for compression.

ryanbooz · on Feb 10, 2022

Timescale DevRel here!

Great observations around how popular and useful it is for systems to have some level of Postgres line compatibility! Thanks for doing the writeup!

One thing that's not totally clear in the comparison table is that most of these implementations are just that, databases providing an abstraction through the Postgres wire protocol over a different database architecture/parser & solution.

TimescaleDB, however, is the only one listed that is built directly on Postgres as an extension. There is no abstraction layer over some other implementation of the query parser or storage layer.

When it comes to application maintenance and developer expectations at query time, not all of these are apples-to-apples comparisons, as some of the other comments have noted.

eatonphil · on Feb 10, 2022

While the table doesn't say that much about intent (other than what you might infer from the SQL Compatibility, ACID and Category columns), I had hoped it was clear from the rest of the article that I was saying there is no relationship between the wire protocol and being PostgreSQL compatible in general.

> It does NOT encompass the actual query language itself, let alone database semantics.

> This doesn't mean that any PostgreSQL or MySQL query will work with them since, as mentioned previously, query language and database semantics are independent of the wire protocol.

> To reiterate, the wire protocol doesn't specify anything about the query language.

> Just because a database implements the PostgreSQL wire protocol does not mean it intends to be a drop-in replacement for PostgreSQL.

> So just looking at parser support is not enough to talk about PostgreSQL query or semantic compatibility. But it is a starting point.

But I guess it wasn't clear.

proddata · on Feb 10, 2022

CrateDB DevRel here :)

> databases providing an abstraction through the Postgres wire protocol

I would not call it an abstraction, if one has a full parser, analyzer, planner and execution engine. It is just a common language ;)

ryanbooz · on Oct 25, 2021

Thanks for your excellent contribution to this discussion. As the post author I wholly agree with your approach: if a solution hits the sweet spot for you in the context of your requirements that's the one you choose. Thank you for considering TimescaleDB alongside ClickHouse in what was obviously a well thought through assessment of these two excellent technologies.

ryanbooz · on Oct 25, 2021

Heh - somehow missed that I had already responded to this one, my apologies. (and no immediate way to edit after the fact).

ryanbooz · on Oct 22, 2021

You are correct. The current multi-node deployment (if you need it for your workload) does have one "managing" access node. All ANs and DNs can be replicated and configured with tooling such as patroni for HA.

We have a few users that have larger multi-node clusters setup this way (one at 40+ nodes so far) and happy with the offering overall. Obviously YMMV depending on requirements/IT support.

ryanbooz · on Oct 22, 2021

(post author)

Thanks for the great, thoughtful feedback. We (Timescale) couldn't agree more that there is a lot to love about ClickHouse, especially where it truly excels.

Information like this is helpful for others currently in the "choose the right tool" part of the job and to the developers of the product. I can't imagine how different all of our offerings will look in a few more years! :-)

ryanbooz · on Oct 21, 2021

Looks like you edited this with some more detail, so I'll answer higher.

Compression in TimescaleDB used to mean all compressed data was immutable and the table schema couldn't be altered. Since TimescaleDB 2.1, and 2.3 that has changed in a few ways.

- Schema can now have columns added or renamed - Compressed chunks can now have rows inserted into them (partially compressed, the background job will fully compress when it runs)

Row deletion is not possible yet, but I've personally been having some internal conversations around ways to do exactly as you're suggesting in the near-term; deleting rows based on a "segmentby" column. I have some testing to do... but my focus has been taken up by a certain, 33-minute long read, blog post.

Feel free to join our Slack and DM me if you want to talk about it further.

slack.timescale.com

ryanbooz · on Oct 21, 2021

In it's current form/state, ClickHouse is not optimized for typical JOIN-type queries, a point we make in the post. You would have to re-write your statement to get better performance. The other main point is that all data is "immutable", so if your reference data needs to be updated, it would still need to go through some kind of asynchronous transform process to ensure you're getting the correct values at query time.

TimescaleDB is PostgreSQL, so it can easily handle this kind of join aggregate like you would expect. If "m.measurement" was compressed, historical queries with a time predicate would likely be faster than uncompressed state.

HN For You