thadguidry's comments

thadguidry · on Jan 8, 2025

One of the most powerful features that I pushed the MyST team to incorporate is xref (linking External MyST projects via borrowed syntax from ReStructuredText) https://mystmd.org/guide/external-references#myst-xref Which allows supplying a list of external references, and complete metadata: https://mystmd.org/guide/website-metadata#myst-xref-json

thadguidry · on March 23, 2020

Thanks @rasmusei! If you are a data scientist you might also be interested in how to work along with Jupyter. Our community has some documentation on our Wiki about that here: https://github.com/OpenRefine/OpenRefine/wiki/Jupyter

thadguidry · on March 23, 2020

Thanks Larry for the ping. Happy to answer questions from the community here!

thadguidry · on March 23, 2020

SIMILE library is used in OpenRefine for certain Faceting like timeline, clustering, etc. Parallax was originated by David to show how time series data visualizations could be enhanced. David was one of our original designers of OpenRefine and I worked closely with him and Stefano in testing it.

thadguidry · on March 23, 2020

Our current architecture is here: https://github.com/OpenRefine/OpenRefine/wiki/Architecture

YES! We want users to process much larger data sets also! We have started experiments with using Apache Spark on the backend where the hope is that we can help users with much larger datasets. This work is being funded by CZI and you can read the grant proposal here: http://openrefine.org/blog/2019/11/14/czi-eoss.html

ratnakar007 · on March 23, 2020

Long time back, I built an extension, by which you can take the openrefine mappings, and run on a hadoop cluster. The idea is you would run all transformations on a small dataset on your local machine. Once you are satisfied with the mappings, you would deploy the same on hadoop cluster. I have tested this on large datasets, and it works. Let me know how I can help. You can check my github: https://github.com/rmalla1/OpenRefine-HD

Chris2048 · on March 23, 2020

before you go down the spark route, consider perl/unix-tools may do this kind of thing quite well: https://livefreeordichotomize.com/2019/06/04/using_awk_and_r...

thadguidry · on March 23, 2020

That author did not have Spark tuned well for the use case. This is a common issue with Spark. Since OpenRefine commonly is used with Strings, we plan to optimize in many areas for that such a few mentioned here: https://databricks.com/glossary/spark-tuning But in general, there are always tradeoffs when trying to provide immediate feedback for interactions. Since OpenRefine has many interactive features, some will need to support batching and advise the user in the interface that things will take longer...do you want to send to batch? Some of the tradeoffs and ways we plan to address these are mentioned in our general OpenRefine on Spark issue here: https://github.com/OpenRefine/OpenRefine/issues/1433

HN For You