For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | nicholashandel's commentsregister

What kinds of things are you seeing people build? Curious on the use cases in prod!


Overall, we improve any AI answer - we've been integrated into AI search experiences (the most common/obvious use case), content generation use cases (eg https://capitol.ai/), but we're excited to see what else people come up with!


That's right! That's a potential option for an integration.

The other options are: - Transform's JDBC can be used to connect to tools that have SQL interfaces like Mode, Hex, Deepnote, etc.: https://docs.transform.co/docs/api/sql/sql-overview - Materializations can be exported as constructed data marts to tools like Tableau / Looker that take in constructed data sources: https://docs.transform.co/docs/metricflow/reference/material...


Totally agree with this ^


Yes! MetricFlow is core to everything we do so there will be tons of active development from our team.


I think those points are on the mark. I added some notes to the similar question that asked for a comparison of dbt, cube and MetricFlow!

I'll just say Joins are hard, but if solved elegantly they allow you to do some really interesting stuff. As an example in MetricFlow you can traverse the data graph in your queries. For example if you go `pip install metricflow` and try the tutorial: `mf_tutorial`

You can see what I mean by running queries that traverse the multiple joins to get to other dimensions:

1. Ask for transactions by day by customer `mf query --metrics transactions --dimensions ds,customer --order ds`

2. Ask for transactions by day by customer country `mf query --metrics transactions --dimensions ds,customer__country --order ds`

3. Ask for transactions by day by customer region where we traverse through a country to region mapping `mf query --metrics transactions --dimensions ds,customer__country__region --order ds`


I think it’s probably best to talk about this comparison in three areas:

Semantics - The MetricFlow spec allows the construction of a much broader range of metrics with much less expression of logic or duplication of that logic than dbt or Cube.

Performance - MetricFlow generates queries that rivals the optimizations of a skilled Data Engineer and builds pre-aggregated tables similar to Cube while dbt builds a static query from a jinja macro.

Interfaces - Cube has some great interfaces for frontend developers, dbt just generates SQL at this point, and MetricFlow has a Python and CLI . The hosted version, Transform, comes with a SQL and GraphQL Interface but that is beyond the scope of the OSS project.


If you’re interested, the longer version:

Semantics

MetricFlow has a less configuration relative to these other frameworks. We accomplish this by choosing abstractions that allow us to handle more on our side at query time through the DataFlow Plan builder. Working with the SQL constructions as a dataflow enables extensions such as non-dw data sources, or using other languages(Python) for some transformations.

The dbt spec is relatively new and requires a few extremely unDRY expressions. The most obvious is the lack of support for joins which means you simply won’t be able to answer most questions unless you build huge tables. There are a few other issues with the abstractions. For example, dimensions are defined multiple times across metrics. A few folks posted more about these challenges in their Github Issue but they’re sticking to their spec. I’m skeptical it will work at any scale.

The Cube concept is similar to Explores in Looker. They’re limiting because you end up with a bunch of representations of small domains within the warehouse and the moment you hit the edge of that domain you need to add a new Cube/Explore. This is not DRY and it’s frustrating. There is also no first-class object for Metrics which means you’re limited to to relatively simple metric types.

Performance

MetricFlow has the flexibility of the DataFlow Plan Builder and builds quite efficient queries. The Materialization feature allows you to build roll up tables programmatically to the data warehouse which could then be used as a low-latency serving layer.

dbt is a jinja macro and generates a static query per metric requested: [https://github.com/dbt-labs/dbt_metrics/blob/main/macros/get.... This macro will be quite hard to optimize for more complicated metric types. We struggled a ton with this before refactoring our framework to allow the manipulation and optimizations of these DataFlow Plans.

Cube is pretty slick on caching, but I know less about their query optimizations. They have some awesome pre-aggregation and caching features. I think this comes from their background in serving frontend interfaces.

Interfaces

MetricFlow supports a Python SDK and our CLI, today. Transform has a few more interfaces (SQL over JDBC, GraphQL, React) that sit outside the scope of this OSS project.

dbt only builds a query in the dbt context today. TBD what the dbt server does but I imagine it will expose a JDBC for paying customers.

Cube seems more focused on building custom data applications but has recently pivoted to the analytics front. I haven’t seen those interfaces in action but I’m curious to learn more there.


Traditionally, querying business and product metrics for data analysis has required lots of ad-hoc sql queries. Often they are encoded in dashboards, in ETL pipelines, and others are copied directly. Semantic layers act as a single source of truth that encapsulates all of that logic. Having a single layer responsible for querying these datasets enables powerful workflows:

- It enables self-serve analytics experiences because it creates objects that business people can interact within pivot-table-like forms. Hundreds of lines of SQL are distilled to simple queries.

- The logic is DRYer / easier to govern than all the repetitive SQL rollups that are required to answer business questions.

- Analysts become more productive write the same stuff less and ask more questions faster.

The idea of limiting duplicated logic is very well understood in the software engineering community and desired in the analytics community but we’re still in the early days. In practice, this is really hard in SQL and the tools we have are too limited.

More specifically, the reason why I get excited about MetricFlow

- We basically built a generalized SQL constructor. It will be able to build performant and legible SQL for complicated requests (things that data engineers describe in hundreds of lines) through simple and consistent query interfaces.

- The way we encapsulate logic requires much fewer lines of yaml/code than most other frameworks and we can do much more with those lines. LookML and previous versions we worked on at Airbnb became quite unruly because of the choices in the abstractions.

- The metric abstraction is flexible and allows us to calculate complicated metrics with only a few lines of yaml. That means we can define metrics like conversion metrics that might take joining two data sources together, deduplicating, filtering to a conversion window, etc. in a single way with a few parameters that reference existing objects.


Well said!

We need more of this problem space exposed to engineers and not just for “analysts”.

I’ll share a couple other articles from a company that does a nice job explaining the technical problems in what is traditionally “business analytics”.

The space is OLAP and you may have scoffed at the idea of “OLAP cubes”, but man were they useful. In the way that excel powers a ton of business processes, cubes powered a lot of analytics. Underlying tech is cool but they are showing their age: https://www.holistics.io/blog/the-rise-and-fall-of-the-olap-...

Another write up of this idea of a semantic layer above raw sql statement: https://www.holistics.io/blog/holistics-data-modeling-explai...

So this “semantic layer” leverages the latest tech to deliver the same business insights faster, better, more flexibly. Ie once you define this semantic layer over your data (ie how all your sql tables are connected), the semantic engine knows how to query up and down your data model, writing the SQL queries for you, on the fly. You can ask and answer new questions without writing new queries. And with modern columnar query engines (eg big query, spark, presto, etc), perf is usually pretty good.


And for completeness, here’s another company that I also used at $previous_job that provides a “semantic model” offering. This write up also helps describe where it fits in.

(This one has just enough content vs marketing for me not to feel embarrassed posting here on HN for people who want to find out more. And IMO the BI landscape is littered with pablum from an engineers POV, often obscuring the nature of the technical problems to be solved in the space - which are very cool.)

https://www.atscale.com/blog/what-is-a-universal-semantic-la...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You