heywhy's comments

heywhy · on Jan 11, 2022

I just published an S3 storage provider for Elasticlunr. You can now store your indexes to an S3 bucket aside the Disk storage provider included in the base project.

The storage API is flexible, so writing to any storage provider (Google Cloud Storage, DB and so on) shouldn't be a problem. it's just a matter of grabbing the right provider or implementing one yourself.

https://github.com/heywhy/ex_elasticlunr_s3

heywhy · on Jan 9, 2022

Yes. This library has stemming, TF-IDF included already. The everything provided by the JS version is included in this library. And improvements are made where applicable.

heywhy · on Jan 9, 2022

Yes, it is a port of that library with some improvements.

dnautics · on Jan 9, 2022

You might want to consider mapping an index to an ets table-based data structure instead of an immutable object managed by a GenServer, it will give you a way to share it between processes without having to awkwardly copy a potentially huge data structure all over the place.

heywhy · on Jan 9, 2022

I do have thoughts about performance too but I was following the "get it working then make improvements" route :). Thank you for the suggestions.

kuzee · on Jan 9, 2022

This makes sense, and I think you've taken the correct route. I look forward to trying this in one of my projects and comparing to my current postgres-only backed search strategy. For my use case losing the index between restarts isn't a deal breaker, so hopefully I'll have some useful feedback.

heywhy · on Jan 9, 2022

That's great. I will be looking forward to this.

dnautics · on Jan 9, 2022

Love it. You're doing exactly the right thing.

skrebbel · on Jan 9, 2022

I don't understand how this works. Is data read from ETS somehow shared more efficiently than data shared via a regular message? (which iirc is always copied)

dnautics · on Jan 9, 2022

It's still copied but if you are using an ets table you're likely only copying a small subset of the data per query instead of schlepping the whole index every time.

eproxus · on Jan 9, 2022

It’s still copied, but a process can quickly become a bottleneck in parallel code (every request to a process is sequential).

An ETS table can be concurrently read (and tweaked even further for that use case if desired).

heywhy · on Jan 9, 2022

Like eproxus mentioned, it's still been shared through normal process messaging but improvements will be made regarding this.

linkdd · on Jan 9, 2022

I'd say even using mnesia as an option for disc copies.

dnautics · on Jan 9, 2022

mnesia had very difficult to debug consistency issues that can crop up. Have these been fixed?

heywhy · on Jan 9, 2022

Hello. I'm the author of the library, you should use the IndexManager (https://github.com/heywhy/ex_elasticlunr/blob/master/lib/ela...) to store your index after making changes to it but note that the indexes will be lost on application shutdown.

But I'm currently working on a configurable storage mechanism so that you can use whatever storage provider of your choice. See https://github.com/heywhy/ex_elasticlunr/pull/9

dnautics · on Jan 9, 2022

Nice work! For certain storage media, e.g. S3 it might be useful to have some sort of delta-based updates where you can enqueue deltas that accrue over time. It might also be interesting to solicit volunteers to help implement distributed in-memory or disk persistence.

heywhy · on Jan 9, 2022

Thank you for the suggestions. I also have same direction for the library. I don't mind if you recommend volunteers.

And don't forget to share the project with friends and colleagues who might be interested in contributing.

dnautics · on Jan 9, 2022

Post it on elixirforum (elixirforum.com).... Did a quick search and couldn't find it there.

HN For You