I just published an S3 storage provider for Elasticlunr. You can now store your indexes to an S3 bucket aside the Disk storage provider included in the base project.
The storage API is flexible, so writing to any storage provider (Google Cloud Storage, DB and so on) shouldn't be a problem. it's just a matter of grabbing the right provider or implementing one yourself.
Yes. This library has stemming, TF-IDF included already. The everything provided by the JS version is included in this library. And improvements are made where applicable.
You might want to consider mapping an index to an ets table-based data structure instead of an immutable object managed by a GenServer, it will give you a way to share it between processes without having to awkwardly copy a potentially huge data structure all over the place.
This makes sense, and I think you've taken the correct route. I look forward to trying this in one of my projects and comparing to my current postgres-only backed search strategy. For my use case losing the index between restarts isn't a deal breaker, so hopefully I'll have some useful feedback.
I don't understand how this works. Is data read from ETS somehow shared more efficiently than data shared via a regular message? (which iirc is always copied)
It's still copied but if you are using an ets table you're likely only copying a small subset of the data per query instead of schlepping the whole index every time.
Nice work! For certain storage media, e.g. S3 it might be useful to have some sort of delta-based updates where you can enqueue deltas that accrue over time. It might also be interesting to solicit volunteers to help implement distributed in-memory or disk persistence.
The storage API is flexible, so writing to any storage provider (Google Cloud Storage, DB and so on) shouldn't be a problem. it's just a matter of grabbing the right provider or implementing one yourself.
https://github.com/heywhy/ex_elasticlunr_s3