For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | more eventreduce's commentsregister

The biggest usecase for EventReduce is realtime applications. Most technologies for these like Firebase, AWS AppSync etc. work on non-relational data. If you want to use EventReduce with relational queries, you have to make them non-relational before, for example by using materialized views. If you do not want to do that, you should not use this algorithm in its current featureset.


My guess would be that if you're at a scale where you're thinking about these sorts of things, you are also at a scale where you're running on multiple machines. How does EventReduce share writes across the cluster?


EventReduce is an algorithm and not a database-wrapper. It will not care about your writes or if your database layer is a cluster and so also not affect them.


Sorry I wasn't clear in my original post.

I'm thinking about the application layer. If you have an application that writes data to a table, it's typical to run multiple instances of that application to support scale and reliability requirements.

If I send a write to one instance, how does it communicate and synchronise that write with the other application instances?

I ask because this can be a tricky thing to do, especially when consensus is required, as consensus algorithms such as Raft/Paxos require a number of network roundtrips which will introduce latency, and actually account for much of that latency in the database examples given in some cases.


EventReduce is a simple algorithm. It does not care or affect how you handle propagation of writes or how you handle your events, transactions or conflicts.

See it as a simple function that can do oldResults+event=newResults like shown in the big image on top of the readme.


This means then that if you run multiple application servers, which most do, that you’ll need to implement a data distribution mechanism of some sort.

I must admit, with limitations like this I’m struggling to figure out the use cases for this.

Edit: so I guess this is easier using the change subscriptions you mention in other comments. That does mean many subscribers, but hopefully that’s minimal load. This has the trade-off that it’s now eventually consistent, but I suppose that’s not a problem for many high read applications.

I’m still feeling like this could be solved in a simpler way with just simple data structures and a pub sub mechanism. Now I think of it, we do a similar thing with Redis for one service, and a custom Python server/pipeline in another, but we’ve never felt the need for this sort of thing.

Do you have more details about specific applications/use cases, and why this is better than alternatives?


I think the best example for why this is useful is described by david glasser at his talk about the oplog driver used in meteor.js https://www.youtube.com/watch?v=_dzX_LEbZyI


thank you for the clarification


This is a great question. So EventReduce is only the algorithm that calculates your new results. The parsing of SQL is not done by it, you have to bring it by yourself. This works over providing some information about the query like Sort-fields and query-matchers. This is described good in the JavaScript implementation [1]. Providing this functions works easy for NoSQL-Queries because they are better composable. For SQL-queries you have to do some work before you can use EventReduce.

Also see the limitations of EventReduce in the readme.

[1] https://github.com/pubkey/event-reduce/tree/master/javascrip...


There is a big difference between a change-stream and a realtime query. For example mongodbs cursor-stream is a good way to observe the events that happen to a specific collection or documents that match some criteria. If you want the realtime-results of a query that has sorting, skip limit etc. than it is really hard to warp the changestream into this. In fact this is exactly what EventReduce could do for you.

For more information about the difference I recommend the video "Real-Time Databases Explained: Why Meteor, RethinkDB, Parse & Firebase Don't Scale" https://www.youtube.com/watch?v=HiQgQ88AdYo&t=1703s


>> There is a big difference between a change-stream and a realtime query. For example mongodbs cursor-stream is a good way to observe the events that happen to a specific collection or documents that match some criteria. If you want the realtime-results of a query that has sorting, skip limit etc. than it is really hard to warp the changestream into this.

Have you actually tried it from mongo shell or any mongodb client driver ?

In the official document link which i shared it is clearly mentioned it supports aggregation pipeline. Any operator which is compatible with aggregation pipeline framework including "$sort" and "$skip" can be used. You can also use JOIN like operator "$lookup" or "$graphLookup".

See this link for info https://docs.mongodb.com/manual/core/aggregation-pipeline-op...


Yes I used it. I actually know it really well. I also did performance comparisons with mongodb and mongodbs change stream and cursors. What I posted here is just an algorithm. You could now compare it to mongodb (a product) and say it is a "more flexible solution" but I do not see the point in directly comparing it simply based on the documentation of both.


>> Yes I used it. I actually know it really well. I also did performance comparisons with mongodb and mongodbs change stream and cursors.

Can you share the link for the code and data in Database against which you are querying to prove your claim ?


No and I also do not want to "claim" something. Feel free to do your own tests.


There is a big difference between a database with an event stream and a 'realtime query' that can be created with event reduce.


What is that difference?


I recommend the video "Real-Time Databases Explained: Why Meteor, RethinkDB, Parse & Firebase Don't Scale" https://www.youtube.com/watch?v=HiQgQ88AdYo


That does not answer what differentiates your solution. I work on steaming systems. I am aware of the spectrum of online, latency aware data processing. But what I can tell from your solution is that the changes are coming from the database itself. Since, as I understand it, the database is the still the source of all data, I don’t see why your solution is any faster than continuous queries in a database.


Yes this is correct. The performance benefit comes from doing all this stuff on the CPU instead of using disc-io. Also the internal binary decision diagram of EventReduced is optimized in a way to run less logic then a query would do. This makes it even faster then running the query again with an in-memory database.


And the main cost of this (questionable IMO) benefit is losing consistency, which is losing any change to DB not coming from the calling app. You haven't mentioned this cost anywhere.


The writes are not tunneled somehow through this algorithm. You still use the database like your normally would do. So the consistency is not affected.

Also this is an open source project, not something I want to sell you. Feel free to make a PR/issue with any open topics that are not mentioned in the docs.


>The writes are not tunneled somehow through this algorithm

Then I fail to understand how it works. How Event-Reduce becomes aware of these "write events"?

>this is an open source project, not something I want to sell you

You made it open source so others can use it, right? They better be making an informed decision whether your solution suits their needs.


You have to provide the events by yourself. See EventReduce as a simple function that can do oldResults+Event=newResults.

And yes, you should always do testings before you use open source stuff. There is no warranty use it on your own risk.


OK, so you don't "tunnel writes through" EventReduce, you "tee" them to EventReduce.

Anyway, to maintain consistency, you have to limit yourself to one process of your app. No sharding, load-balancing etc. This is significant limitation, and it's not obvious. I encourage you to mention it in README.md.


I encourage you to read the readme and check out the demo. EventReduce is nothing magically drills out your database and affects the consistency of your write-accesses.

It is a simple algorithm that is implemented as a function with two inputs and one output.


> How Event-Reduce becomes aware of these "write events"?

Some DBs expose an event stream, for example, PG:

https://www.postgresql.org/docs/current/logicaldecoding-expl...


> For the different implementations in common browser databases, we can observe an up to 12 times faster displaying of new query results after a write occurred.

Is this intended to be an optimisation on top of localStorage and so on? If so, at least you don't have to worry about multiple writers.


localStorage is no database, check out the demo page.


I do not think this has many in common. Lambda is used for stream processing much data. EventReduce is used for optimizing the latency of much (repeating) queries.


No, see the FAQ in the readme:

Materialized views solve a similar problem but in a different way with different trade-offs. When you have many users, all subscribing to different queries, you cannot create that many views because they are all recalculated on each write access to the database. EventReduce however has a better scalability because I does not affect write performance and the calculation is done when the fresh query results are requested not beforehand.


Others have pointed this out already: This is wholly dependent on the RDBMS used, and Oracle offers incremental refresh on demand. I have to admit though that I mistakenly thought that MSSQL would as well...


1. No I do not have a paper. I thought a lot about publishing a paper first but then decided against it, because I think that good code and tests and demos are more valuable.

2. EventReduce is mostly useful for realtime applications. I myself use it in a NoSQL database (RxDB). There you stream data and events and a single document write is the most atomic 'transaction' you can do. If you need transactional serial writes and reads that depend on each other, you would not use EventReduce for that.

3. EventReduce is just the algorithm that merges oldResults+Event. It assumes that you feed in the events in the correct order. Mostly this is meant to be used with databases that provide a changeStream where you can be sure that the order is correct.

4. Sort order matters because EventReduce promises you to always return the same results as a fresh query over the database would have returned. When the sort order is not predictable, the returned rows from a query depend on how the data is stored in the database. This order cannot be predicted by EventReduce which means it will then return a wrong result set.

PS: BDDs are awesome :)


It would probably be a good idea to write a paper at some point; it's simply easier to read a document explaining the algorithm with some pseudocode than to dig through an actual codebase with all the messy language-details in between the parts that actually matter.


I understand that reading the plain source code is more painful then reading a paper.

There are many different trade-offs between a paper and the current repository with source code. For me the biggest argument was that EventReduce is a performance optimization. So to be sure if it really works and is faster, you always need an implementation since you cannot predict the performance from a paper.

Because I did not have time for both, I only created the repository with the implementation. Maybe a paper will be published afterwards.


The value of a paper is the peer review by experts.


So if I get you, it's for append-only data - probably no updates, definitely no no deletions? Still don't get how you don't need logical clocks to pick out the delta(s), but thanks for your prompt answer.

Edit: your example gives replaceExisting() so that's supporting an update of some kind.


No it is explicitly not for append-only data. It works with inserts, updates and deletes. I think I have problems understanding what exactly you mean by the need for a logical clock. The algorithm is feeded with the old query results plus one event, and then returns the new query results. Since there is only one event at each point of time, it does not have to order or maintain them.


I had the same question as #2. Basically, it has to be the front-end to any event that reads/writes the data, in strict order of occurrence?


Not exactly. To use EventReduce you must have a changestream out of all writes to your data int the correct order. You can do that by wrapping a frontend over your database.

But easier you do that by using a database that already provides a changestream like couchdb, Postgres, mongodb and so on.


BDDs you are using, are they zero-supressed decision diagrams or it was not necessary to do these kinds of optimizations?


Yes the BDD is minimized with the two rules (reduction and elimination). Also the sorting of the boolean functions is optimized via plain brute forcing.

The was no good JavaScript implementation for BDDs so I had to create my own one https://github.com/pubkey/binary-decision-diagram


Cool, I've checked your code and it's not zero suppressed BDDs, although it might not be a performance gain if you used ZDDs. (zero supressed BDDs are very good at representing sets of permutations/combinations etc. but as far as I understand you have all the possible permutations encoded in the BDD, not a subset of all possible permutations).


Thanks for pointing that out. I never heard of materialize.io but I will dive into it. On the first glance it looks like materialize is more like a full product while EventReduce is just something that you use on top of your existing solutions.


I do not think so. If you check the example schema, it is very hard to understand it. What does ':lat' mean? is it a string or a number? What does '!b' mean, can it only be false?


If you read the whole example:

lat = custom validator

b = shorthand for boolean (as is s for string)

! = optional

So for me reading the whole example its very easy to understand and digest.


Yes, reading the docs explains what these keywords mean. But by just looking at the schema it is impossible to understand what that is. And if you check the definition of clean code which is something like "intuitive understandable" then it comes clear that this is not clean.


Ah I see where you’re coming from. Good point.


The “!” for optional is the only thing that confused me. Why not “?” for optional?


Updated to ? in 1.0.3


Nice that makes more sense!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You