ankoh's comments

ankoh · on Aug 17, 2022

This has to be ruled out first: https://github.com/whatwg/fs/issues/7#issuecomment-116176851...

...but then the OPFS will be a quite decent fit. We (DuckDB-Wasm) are also looking closely at OPFS.

IMHO the requirement here is not even to get to full ACID.

With OPFS, we will get close enough to IndexedDB on steroids and bypassing the js heap limits through out-of-core operators.

After all, we are still running in a browser.

So I see the value of Wasm-based databases to be a front-facing accelerator, not a substitute for robust storage solutions.

ankoh · on March 26, 2022

We're still on a journey to explore what APIs work best with JavaScript but the differences between WASM and Node are on purpose. DuckDB-Wasm has isolated wasm heap memory and runs in a separate web worker. That's why we serialize everything as Arrow IPC buffer and pass the ArrayBuffer through the workers message API as transferrable. On Node.js, we can interact with DuckDB much more easily and don't want to pay the price for the IPC stream every time. The truth is, we tried quite a few different APIs and this turned out the be a good trade off between efficiency and convenience.

But I'm happy to discuss this further if you have any suggestions.

ankoh · on Oct 29, 2021

The author outlines many problems that you'll run into when implementing a persistent storage backend using the current browser APIs.

We faced many of them ourselves but paused any further work on an IndexedDB-backend due to the lack of synchronous IndexedDB apis (e.g. check the warning here https://developer.mozilla.org/en-US/docs/Web/API/IDBDatabase...). He bypasses this issue using SharedArrayBuffers which would lock DuckDB-Wasm to cross-origin-isolated sites. (See the "Multithreading" section in our blog post)

We might be able to lift this limitation in the future but this has some far-reaching implications affecting the query execution of DuckDB itself.

To the best of my knowledge, there's just no way to do synchronous persistency efficiently right now that wont lock you to a browser or cross-origin-isolation. But this will be part of our ongoing research.

ankoh · on Oct 29, 2021

It depends.

Querying CSV files is particularly painful over the network since we still have to read everything for a full scan.

With Parquet, you would at least only have to read the columns of group by keys and aggregate arguments.

Try it out and share your experiences with us!

ankoh · on Oct 29, 2021

Yes we do! DuckDB-Wasm can read files using HTTP range requests very similar to the sql.js-httpvfs from phiresky.

The blog post contains a few examples how this can be used, for example, to partially query Parquet files over the network.

E.g. just visit shell.duckdb.org and enter:

select * from 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parqu...' limit 10;

1egg0myegg0 · on Oct 29, 2021

NIIIICE! Data twitter was pretty excited about that cool SQLite trick - now you can turn it up a notch!

tomrod · on Oct 29, 2021

Is data twitter == #datatwitter, like Econ Twitter is #econtwitter?

If so, I have another cool community to follow!

obeliskora · on Oct 29, 2021

It would be really cool to load duckdb files too. sql.js-httpvfs seems convenient because it works on everything in database so you don't have to create indexes, or setup keys and constraints in the client.

ankoh · on Oct 29, 2021

I agree! DuckDB-Wasm can already open DuckDB database files in the browser the very same way.

obeliskora · on Oct 29, 2021

That's really neat! Can you control the cache too?

ankoh · on Oct 29, 2021

DuckDB-Wasm uses a traditional buffer manager and evicts pages using a combination of FIFO + LRU (to distinguish sequential scans from hot pages like the Parquet metadata).

ankoh · on Oct 29, 2021

DuckDB-wasm is targeting the browser so it's not directly competing with Pandas (that's the job of native DuckDB).

It's targeting use cases where you want to push analytical computation away from servers into the client (browser).

Lets me sketch 2 examples:

A) You have your data sitting in S3 and the user-specific data is in a browser-manageable area.

(E.g. this paper from Tableau research actually states dataset sizes that should fall into that category: https://research.tableau.com/sites/default/files/get_real.pd...)

In that scenario, you could eliminate your central server if your clients are smart enough.

B) You are talking about larger dataset sizes (GB) and want to explore them ad-hoc in your browser.

Uploading them is unrealistic and installing additional software is no longer ad-hoc enough.

ankoh · on Oct 29, 2021

The WebAssembly module is 1.6 - 1.8 MB brotli-compressed depending on the Wasm feature set. We're currently investigating ways to reduce this to around 1 MB. We further use streaming instantiation which means that the WebAssembly module will be compiled while downloading it. But still, it will hurt a bit more than a 40KB library.

Regarding multi-tab usage: Not today. The available filesystem apis make it difficult to implement this right now. We're looking into ways to make DuckDB-Wasm persistent but we can only read in this release.

HN For You