Thanks @rasmusei! If you are a data scientist you might also be interested in how to work along with Jupyter. Our community has some documentation on our Wiki about that here: https://github.com/OpenRefine/OpenRefine/wiki/Jupyter
SIMILE library is used in OpenRefine for certain Faceting like timeline, clustering, etc. Parallax was originated by David to show how time series data visualizations could be enhanced. David was one of our original designers of OpenRefine and I worked closely with him and Stefano in testing it.
YES! We want users to process much larger data sets also! We have started experiments with using Apache Spark on the backend where the hope is that we can help users with much larger datasets. This work is being funded by CZI and you can read the grant proposal here: http://openrefine.org/blog/2019/11/14/czi-eoss.html
Long time back, I built an extension, by which you can take the openrefine mappings, and run on a hadoop cluster. The idea is you would run all transformations on a small dataset on your local machine. Once you are satisfied with the mappings, you would deploy the same on hadoop cluster. I have tested this on large datasets, and it works. Let me know how I can help.
You can check my github: https://github.com/rmalla1/OpenRefine-HD
That author did not have Spark tuned well for the use case. This is a common issue with Spark. Since OpenRefine commonly is used with Strings, we plan to optimize in many areas for that such a few mentioned here: https://databricks.com/glossary/spark-tuning But in general, there are always tradeoffs when trying to provide immediate feedback for interactions. Since OpenRefine has many interactive features, some will need to support batching and advise the user in the interface that things will take longer...do you want to send to batch? Some of the tradeoffs and ways we plan to address these are mentioned in our general OpenRefine on Spark issue here: https://github.com/OpenRefine/OpenRefine/issues/1433