Well, one of the differences is that we only require PostgreSQL as a external dependency in contrast with DataHub (MySQL, Kafka, ElasticSearch). Please correct me if I wrong about this list of DataHub's external dependencies.
ODD Specification is a standard for collecting and gathering such metadata, ETL included. We gather metadata for lineage on an entity level now, but we plan to expand this to the column-level lineage at the end 2022 — start 2023. Specification allows us to make the system open and it's really easy to write your own integration by taking a look in what format metadata needs to be injected in the Platform.
Also, thank you for sharing links with us! I'm thrilled to take a look how BMW solved a problem of lineage gathering from Spark, that's something we are improving in our product right now.
While Pachyderm (a great product by the way) helps teams to automate transformation tasks, ODD is more of a discovery/observability/monitoring solution for your pipelines. Basically if Pachyderm helps you to build a pipeline, ODD helps you to monitor all of your pipelines in a context of your whole data infrastructure
> How is the lineage generated or manually maintained
All lineage in the platform is generated and not manually handled by user in the UI. We are leveraging ODD Specification (https://github.com/opendatadiscovery/opendatadiscovery-speci...) and all ODD Collectors (agents that scrape metadata from your data sources) send payload to the ODD Platform in this specification's format. ODD Specification introduces something called ODDRN — OpenDataDiscovery Resource Names. These are basically strings, identifiers of specific data entities. All ODD Collectors generates same identifiers for same entities, allowing us automatically build a lineage graph in ODD Platform.
Not letting a user to manually change lineage in the UI is kinda our solution to one of the lineage problems. This way users can be sure that the lineage is correct, up to date and no one messed with it at least in the UI.
Of course if there's an described API endpoint, there's a way to change the lineage by sending a request on your own (e.g. via curl or custom script), but I wouldn't call it manual. This approach allows companies and users to write their own integrations, making the system open.
We have a lot of repositories on GitHub, please feel free to pick any issue from the list. Do not hesitate to ask us anything in GitHub issues' threads or in our Slack community. I'll provide links for your convinience
Thank you for your kind words and a constructive feedback! We appreciate it.
Let me cover some of your reactions from my perspective as a Data Engineer. Please feel free to add your opinion on those
> Shorten data discovery phase. In my experience, analysts and data scientists are always very familiar with what relevant data exists, or else they can find the right people to acquire what data they need. Often, kick-off meetings for new projects cover with stakeholders which data is useful.
You're right, but from my experience it's not always the case. Sometimes finding the key person/team responsible for a dataset might be challenging. You mentioned the kick-off meeting, about which I agree, but it's not always the silver bullet. Data goes outdated/deprecated all the time and we are trying to solve a problem of telling about this to all people which may be affected by this as soon an as easy as possible.
> Know the sources of your dashboards and ad hoc reports. All dashboards I am aware of surface this sort of information
Again, you are right. All dashboard services and BI tools can show you from what data source what data are they getting. But from my experience sometimes it's useful to take a look at the origin of data some dashboard uses. This is where end-to-end lineage comes in hand. Also, I consider useful to have metadata of all of my dashboards from all of my company's BI tools in one place.
> Deprecate outdated objects responsibly by assessing and mitigating the risks. This is a good idea, however, it is challenging
Couldn't agree more. We are working not only to improve our way to solve this problem, but the solution itself, if it makes sense. We are basically trying to find a right approach to this and offer it to everyone else. I know it's ambitious and really is a loud statement, but I hope we are getting there.
In overall, thank you for your input!
@germanosin, would you like to add something I may have missed?
I see your point. As you mentioned the product is rather young and we continue to develop it. I agree that documentation is one of the fundamental parts. Thank you for your input, we will find a way to make the documentation more straight forward and useful for cases such as you've described.
Perhaps you would be interested in a call with us where we can answer all your questions including integration with your infrastructure, provide help configuring the platform if needed, etc?
Its not confusing its well written I just want more, but I get it's a community project and it takes time to do all this stuff for free. I'm just ~not so~patiently waiting for docs more around the person looking at set up vs the conceptual
Well like adding a test from panadas/great expectations, software looks great when its set up and already added. So trying to add one myself I just have to imagine, so great expectations make sense probably an API hook set up. But I'm using panadas mostly so how do I add one? Is it going to be more work to use existing tests? If so how much? Really I'm trying to timeline how long setting up everything from my pipeline inside the product as well. Since I want a junior doing alot of this aswell is this going to be weirdly hard for a math uni grad? You know?
The online demo I didn't see abilities to do that on that type of account, which idc just part of gauging how long it would take to go from 0 to running to useful for the company.
Thanks for the help I'm going to keep an eye out for the product in the future for sure. It’s just at the point it is still more work for me to see if I want to do the work to use it