>Erlich: You say that like it's a bad thing. Richard, if you're not an asshole, it creates this kind of asshole vacuum and that void is filled by other assholes, like Jared.
(Note: I don't necessarily entirely agree with all of it.)
Unless you have several hundred million documents, just write a simple encoder that serializes the embedding vectors to a flat binary file.
Writing code from scratch to process and search 200k unstructured documents -- parsing, cleaning, chunking, OpenAI embedding API, serialization code, linear search with cosine similarity, and the actual time to debug, test and run all this -- took me less than 3 hours in Go. The flat binary representation of all vectors is under 500 MB. I even went ahead and made it mmap-friendly for the fun of it even though I could read it into all into memory.
Even the dumb linear search I wrote takes just 20-30ms per query on my Macbook for the 200k documents. The search results are fantastic.
I didn't bother cleaning it so it's just a code dump, but it's fairly straightforward. Not included are a Python script to parse and clean the raw documents into JSON files (used in `summarize` to output results), code to read these files and get the embeddings from OpenAI for use in `newEmbeddingJSON `, and a bunch of random parallelization shell scripts that I didn't save.
To use it, I call newDBFromJSON from a directory of JSON embedding vectors and serialize the binary representation. This takes a few minutes mostly because parsing JSON is slow, but you I only needed to do this once. When I need to search for the top 10 documents most similar to document X, I call `search` with the embedding vector for that doc. Alternatively if I need to do semantic search with natural language, I'll call the OpenAI API to get the embedding vector for the query and call `search` with that vector. It's pretty fast thanks to Go concurrency maxing out my CPU. It's super accurate with the search results thanks to OpenAI's embeddings.
It's nowhere close to production-ready (it's littered with panics), but it was good enough for me.
Hope this helps!
Edit: oh and don't use float64 (OpenAI's vectors are float16)
You could apply the same logic to anyone ringing the alarm bells about climate change (or alarm bells in general). Just because snake oil is usually unfalsifiable doesn't mean everything _currently_ unfalsified is snake oil.
To be fair a large amount of the doomsaying around climate change has been proven to be hyperbole and wrong. The data around climate change is very real, but the most vocal people about it have been saying stuff like cities being underwater and ice caps melting and extreme weather like never before seen, and all in the timescale of 10-20 years. And this is going back to the 60s. We're now 60 years in the future and things are basically the same.
Climate change doesn't impact every part of the world equally. If the temperature rises a couple of degrees in the US Midwest, it's probably not going to be catastrohpic. If it rises a couple of degrees in Bangladesh, it's going to make the place unlivable.
Personally, living in a hot tropical country, I've experienced weather patterns becoming weirder and weirder, and generally trending towards way too hot. Last year, the summer was so long and dry and hot that I really felt that I can't physically live here any longer. And that's when I had the luxury of AC - something the vast majority of my country can't afford.
So yeah, it's easy to dismiss climate change if you live in a cold climate. It's much more real in warm, dry countries.
I was just listing the goals I gave it in temporal order, but I’ll include a weak task for the giggles in the future when talking about the POC. Good suggestion!
1. Global temperature is rising. Any water that used to be above 4°C (or the equivalent for salt water) now takes more volume. But what about water that's below 4°C? Wouldn't that water compress? What portion of the ocean is colder than 4°C? I'd imagine all of the Arctic zone and near Antarctica is, which is an awful lot of water to offset the tropical water. An even more nuanced perspective would be to look at the actual temperature distribution and use it to weight the dV/dT of water at different temperatures.
2. Global temperature is rising. Ice is melting into water. This is new mass entering the ocean. Why can't the sea bed, which is under more pressure due to extra mass, expand? The sea bed isn't a steel utensil, it's sand and rocks. And it's constantly shifting. And it's composition is different in different regions.
Has anyone done a detailed computer simulation of the whole earth's geology under rising temperatures? There might be feedback loops that might amplify or negate some effects so it's quite important to account for all variables. And obviously there's a lot of variables to account for.
(It's sometimes scary how little we know about our own planet. I'm not talking about the things in my comment because I'm sure it's just my ignorance.)
> 1. Global temperature is rising. Any water that used to be above 4°C (or the equivalent for salt water) now takes more volume. But what about water that's below 4°C? Wouldn't that water compress?
It might marginally, but the amount of water compressed will be a lot smaller than the amount of water "uncompressed", both due to the larger range of temperatures above 4°C (below is only a bit, and then ice) and due to the fact that the densest water is at the bottom, meaning that it's easier for everything else to heat up.
> 2. Global temperature is rising. Ice is melting into water. This is new mass entering the ocean. Why can't the sea bed, which is under more pressure due to extra mass, expand?
Expand where? It can't just rise since there is gravity, and the pressure above increases with more water. It can't go down because there is already other stuff there.
In regard to that last point, I hadn't thought of it but there is something called post-glacial rebound, it would make sense that the increased weight on the seabed would deform it, and potentially even cause bulging of land without ocean on top, potentially negating the sea level rise effect to varying degrees worldwide.
No, water is most dense at 4°C. If you take water at 2°C and increase its temperature by a degree, it will _compress_, not expand. But if you take water at 10°C and heat it by a degree it will expand. My question is what percentage of the expansion is offset by the compression.
(Note that the 4°C number is only for pure water.)
> water still weights the same
Weight has nothing to do with my first point. It's the increase in volume that spills into land.
UTC is non-unique because of leap seconds, so TAI + lat + long + altitude is actually required.
I work on software for astronomy, and that quadruplet is what’s used to describe an observing location. You can actually get a little in trouble because of changes in the shape of the earth over time, so latitude and longitude and altitude need to be treated as time-dependent values, which matters once you are accounting for relativistic effects.
That sort of assumes there's one time zone that's being used per spacetime coordinate, which isn't guaranteed. You can get political situations where de facto and de jure time diverge, or where different authorities nominally in charge of the time in a place disagree.
Lebanon seems to have experienced exactly this recently:
It does unambiguously give you a spacetime coordinate (useful) but it doesn't unambiguously tell you what local time you should use for an occurrence, and the answer would really depend on who was asking.
> Storing (UTC, latitude, longitude, altitude) is the holy grail, I guess.
It’s not.
If I set up a meeting next year in NYC at 10 and the legislature decides to change the timezone’s offset, the meeting remains at 10 NY time on that date, it does not shift in NY time. It’s the UTC which shifts.
And UTC alone is sufficient for past events, as they are fixed instants in the time-stream. Unless you’re at a stage where you need to take relativistic effects in account, then you need to add the referential.