what bothers me is not that this issue will certainly disappear now that it has been identified, but that that we have yet to identify the category of these "stupid" bugs ...
We already know exactly what causes these bugs. They are not a fundamental problem of LLMs, they are a problem of tokenizers. The actual model simply doesn't get to see the same text that you see. It can only infer this stuff from related info it was trained on. It's as if someone asked you how many 1s there are in the binary representation of this text. You'd also need to convert it first to think it through, or use some external tool, even though your computer never saw anything else.
> It's as if someone asked you how many 1s there are in the binary representation of this text.
I'm actually kinda pleased with how close I guessed! I estimated 4 set bits per character, which with 491 characters in your post (including spaces) comes to 1964.
Then I ran your message through a program to get the actual number, and turns out it has 1800 exactly.
>I estimated 4 set bits per character, which with 491 characters in your post (including spaces) comes to 1964
And that's exactly the kind of reasoning an LLM does when you ask it about characters in a word. It doesn't come from the word, it comes from other heuristics it picked up during training.
Okay but, genuinely not an expert on the latest with LLMs, but isn’t tokenization an inherent part of LLM construction? Kind of like support vectors in SVMs, or nodes in neural networks? Once we remove tokenization from the equation, aren’t we no longer talking about LLMs?
It's not a side effect of tokenization per se, but of the tokenizers people use in actual practice. If somebody really wanted an LLM that can flawlessly count letters in words, they could train one with a naive tokenizer (like just ascii characters). But the resulting model would be very bad (for its size) at language or reasoning tasks.
Basically it's an engineering tradeoff. There is more demand for LLMs that can solve open math problems, but can't count the Rs in strawberry, than there is for models that can count letters but are bad at everything else.
181.78.46.78.in-addr.arpa domain name pointer min2max.run.
The domain's authoritative nameserver (Infomaniak) points vivianvoss.net at 78.46.78.181 — a Hetzner box in Germany with rDNS min2max.run. That server redirects HTTP to SafeBrowse.io and responds to TLS handshakes with garbage. Not a local issue, not a DNS hijack — the A record itself is wrong.
And the logs show it is going to the same address:
* Established connection to vivianvoss.net (78.46.78.181 port 443) from 172.16.245.55 port 36208
Any chance you're a comcast xfinity customer? Searching for safebrowse.io shows that xfinity "advanced security" does this whole redirect to safebrowse.io.
--
Unrelated, but the site also returns an AAAA record for an ipv6 address that does not work. So they've misconfigured their server in that regard.
this looks very awesome. can someone tell me why there is no chatter about this? is there something else out there that blows this out of the water in terms of ease of use and access to sample many LLM's ?
Ollama provides a web server with API that just works out of the box, which is great when you want to integrate multiple applications (potentially distributed on smaller edge devices) with LLMs that run on a single beefy machine.
In my home I have a large gaming rig that sometimes runs Ollama+Open WebUI, then I also have a bunch of other services running on a smaller server and a Raspberry Pi which reach out to Ollama for their LLM inference needs.
Are you talking about the Hugging Face Python libraries, the Hugging Face hosted inference APIs, the Hugging Face web interfaces, the Hugging Face iPhone app, Hugging Face Spaces (hosted Docker environments with GPU access) or something else?
But it's likely to be much slower than what you'd get with a backend like llama.cpp on CPU (particularly if you're running on a Mac, but I think on Linux as well), as well as not supporting features like CPU offloading.
I think the biggest selling point of ollama (llama.cpp) are quantizations, for a slight hit (with q8 or q4) in quality you can get a significant performance boost.
Does ollama/llama.cpp provide low bit operations (avx or cuda kernels) to speed up inference? Or just model compression with inference still done in fp16?
My understanding is the modern quantization algorithms are typically implemented in Pytorch.
The only thing I know (from using it) that with quantization I can fit models like llama2 13b, in my 24GB of VRAM when I use q8 (16GB) instead of fp16 (26GB). This means I can get nearly the full quality of llama2 13b's output while still being able to use only my GPU, without the need to do very slow inference on only CPU+RAM.
And the models are quantized before inference, so I'd only download 16GB for the llama2 13b q8 instead of the full 26GB, which means it's not done on the fly.
As an aside, even gpt4 level quality does not feel satisfactory to me lately. I can’t imagine willingly using models as dumb as llama2-13b. What do you do with it?
Yeah I agree, everytime a new model releases I download the highest quantization or fp16, that fits into my VRAM, test it out with a few prompts, and then realize that downloadable models are still not as good as the closed ones (except speed wise).
I don't know why I still do it, but everytime I read so many comments how good model X is, and how it outperforms anything else, and then I want to see it for myself.
Ollama is really organized - it relies on llama but the UX and organization it provides makes it legit. We recently made a one-click wizard to run Open WebUI and Ollama together, self hosted and remotely accessible but locally hosted [1]
LM Studio is a lot more user friendly, probably the easiest UI to use out there. No terminal nonsense, no manual to read. Just double click and chat. It even explains to you what the model names mean (eg diff between Q4_1 Q4_K Q4_K_M... For whatever reason all the other tools assume you know what it means).
Why do you think there is no chatter about this? There have been hundreds of posts about ollama on HN. This is a point release of an already well known project.
I use a mix of using llamacpp directly via my own python bindings and using it via llamacpp-python for function calling and full control over parameters and loading, but otherwise ollama is just great for ease of use. There's really not a reason not to use it, if just want to load gguf models and don't have any intricate requirements.
> ... if a statement can be proved, it also has a zero-knowledge proof.
Mind blown.
>Feeding the pseudorandom bits (instead of the random ones) into a probabilistic algorithm will result in an efficient deterministic one for the same problem.
This is nuts. AI is a probabilistic computation ... so what they're saying - if i'm reading this right - is that we can reduce the complexity of our current models by orders of magnitude.
If I'm living in noobspace someone please pull me out.
I don't know exactly what it's saying but it definitely isn't that. AI already uses pseudorandom numbers and is deterministic. (Except some weird AI accelerator chips that use analogue computation to improve efficiency.)
> AI is a probabilistic computation ... so what they're saying - if i'm reading this right - is that we can reduce the complexity of our current models by orders of magnitude.
Unfortunately, no. First, the result applies to decision, not search problems. Second, the resulting deterministic algorithm is much less efficient than the randomized algorithm, albeit it still belongs to the same complexity class (under some mild assumptions).
nature says eat or be eaten(or die). now that we are on top of the food chain, it's useful, for mental health and societal reasons, not not want to rip every one's throat out, even thou every other species (including us) continues to do so. You tend to steer and veer towards what you look at. The good news is we get to choose what we look at... the bad news is we (statistically) choose wrong. Race car drivers don't look at the wall when they drive, because when they do, they tend to hit it. stop looking at the wall guys ...
Rather than having selections for multiple languages (for each task) it seems like language detection or a selection/setup screen would be best. With fallback, to english, or whatever your default is. Maybe use online translations services?
edit: Oh it seems you do have a language drop down, but there are still multiple languages appearing in quests... this just means more quests I guess eh :)
This is exactly like what I was looking for but it seems very incomplete. Does anyone else know of a resource like this that isn't the first hit on google?