Here's a summary of what's happened the past couple of years and what tools are out there.
After ChatGPT released, there was a lot of hype in the space but open source was far behind. Iirc the best open foundation LLM that existed was GPT-2 but it was two generations behind.
Awhile later Meta released LLaMA[1], a well trained base foundation model, which brought an explosion to open source. It was soon implemented in the Hugging Face Transformers library[2] and the weights were spread across the Hugging Face website for anyone to use.
At first, it was difficult to run locally. Few developers had the system or money to run. It required too much RAM and iirc Meta's original implementation didn't support running on the CPU but developers soon came up with methods to make it smaller via quantization. The biggest project for this was Llama.cpp[3] which probably is still the biggest open source project today for running LLMs locally. Hugging Face Transformers also added quantization support through bitsandbytes[4].
Over the next months there was rapid development in open source. Quantization techniques improved which meant LLaMA was able to run with less and less RAM with greater and greater accuracy on more and more systems. Tools came out that were capable of finetuning LLaMA and there were hundreds of LLaMA finetunes that came out which finetuned LLaMA on instruction following, RLHF, and chat datasets which drastically increased accuracy even further. During this time, Stanford's Alpaca, Lmsys's Vicuna, Microsoft's Wizard, 01ai's Yi, Mistral, and a few others made their way onto the open LLM scene with some very good LLaMA finetunes.
A new inference engine (software for running LLMs like Llama.cpp, Transformers, etc) called vLLM[5] came out which was capable of running LLMs in a more efficient way than was previously possible in open source. Soon it would even get good AMD support, making it possible for those with AMD GPUs to run open LLMs locally and with relative efficiency.
Then Meta released Llama 2[6]. Llama 2 was by far the best open LLM for its time. Released with RLHF instruction finetunes for chat and with human evaluation data that put its open LLM leadership beyond doubt. Existing tools like Llama.cpp and Hugging Face Transformers quickly added support and users had access to the best LLM open source had to offer.
At this point in time, despite all the advancements, it was still difficult to run LLMs. Llama.cpp and Transformers were great engines for running LLMs but the setup process was difficult and required a lot of time. You had to find the best LLM, quantize it in the best way for your computer (or figure out how to identify and download one from Hugging Face), setup whatever engine you wanted, figure out how to use your quantized LLM with the engine, fix any bugs you made along the way, and finally figure out how to prompt your specific LLM in a chat-like format.
However, tools started coming out to make this process significantly easier. The first one of these that I remember was GPT4All[7]. GPT4All was a wrapper around Llama.cpp which made it easy to install, easy to select the LLM that you want (pre-quantized options for easy download from a download manager), and a chat UI which made LLMs easy to use. This significantly reduced the barrier to entry for those who were interested in using LLMs.
The second project that I remember was Ollama[8]. Also a wrapper around Llama.cpp, Ollama gave most of what GPT4All had to offer but in an even simpler way. Today, I believe Ollama is bigger than GPT4All although I think it's missing some of the higher-level features of GPT4All.
Another important tool that came out during this time is called Exllama[9]. Exllama is an inference engine with a focus on modern consumer Nvidia GPUs and advanced quantization support based on GPTQ. It is probably the best inference engine for squeezing performance out of consumer Nvidia GPUs.
Months later, Nvidia came out with another new inference engine called TensorRT-LLM[10]. TensorRT-LLM is capable of running most LLMs and does so with extreme efficiency. It is the most efficient open source inferencing engine that exists for Nvidia GPUs. However, it also has the most difficult setup process of any inference engine and is made primarily for production use cases and Nvidia AI GPUs so don't expect it to work on your personal computer.
With the rumors of GPT-4 being a Mixture of Experts LLM, research breakthroughs in MoE, and some small MoE LLMs coming out, interest in MoE LLMs was at an all-time high. The company Mistral had proven itself in the past with very impressive LLaMA finetunes, capitalized on this interest by releasing Mixtral 8x7b[11]. The best accuracy for its size LLM that the local LLM community had seen to date. Eventually MoE support was added to all inference engines and it was a very popular mid-to-large sized LLM.
Cohere released their own LLM as well called Command R+[12] built specifically for RAG-related tasks with a context length of 128k. It's quite large and doesn't have notable performance on many metrics, but it has some interesting RAG features no other LLM has.
More recently, Llama 3[13] was released which similar to previous Llama releases, blew every other open LLM out of the water. The smallest version of Llama 3 (Llama 3 8b) has the greatest accuracy for its size of any other open LLM and the largest version of Llama 3 released so far (Llama 3 70b) beats every other open LLM on almost every metric.
Less than a month ago, Google released Gemma 2[14], the largest of which, performs very well under human evaluation despite being less than half the size of Llama 3 70b, but performs only decently on automated benchmarks.
If you're looking for a tool to get started running LLMs locally, I'd go with either Ollama or GPT4All. They make the process about as painless as possible. I believe GPT4All has more features like using your local documents for RAG, but you can also use something like Open WebUI[15] with Ollama to get the same functionality.
If you want to get into the weeds a bit and extract some more performance out of your machine, I'd go with using Llama.cpp, Exllama, or vLLM depending upon your system. If you have a normal, consumer Nvidia GPU, I'd go with Exllama. If you have an AMD GPU that supports ROCm 5.7 or 6.0, I'd go with vLLM. For anything else, including just running it on your CPU or M-series Mac, I'd go with Llama.cpp. TensorRT-LLM only makes sense if you have an AI Nvidia GPU like the A100, V100, A10, H100, etc.
I don't even know what people are really asking for anymore. 4070, 7800XT, 4060, and 7600 were all incredibly high-value launches, and in January we're incredibly likely to get price drops on 12GB cards as well as 16GB cards at the old 4070 Ti price point. The value is there, just a fairly large faction of people have an emotional investment in pretending like it's perpetual doom-and-gloom.
Consider the reaction when Microcenter started putting a $100 steam gift card on the already-attractive 4070 price. People still flipped a shit even though a de-facto $500 price was obviously about as good as any 4070-tier product would get for this gen (apart from the firesales at the end).
This is pretty much what super refresh is going to accomplish next month (a year later!) - 4070 will probably move down to $499-549, 4070 Super will be 95% of a 4070 Ti for $599-629, and 4070 Ti Super will offer 16GB at the same $749 price point. But people want a return to 2000s-level pricing (despite soaring wafer costs etc) and nothing short of that is going to satisfy some people.
(and a fair number of people (including reviewers!) actively propagandize about these cards, like complaining the 4060 or 4060 Ti is actually slower than its predecessor - they're absolutely not, even at 4K (tough for a 6600XT tier card!) the 4060 Ti is still faster on average. And as a general statement, AMD made the same switch to 128b memory bus last gen already without catastrophic issues, like the aforementioned 6600XT. A card can still be a mediocre step without making shit up about it, and the 3060 Ti was the absolute peak of the 30-series value so it's understandable the 40-series struggles to drastically outpace it. But it's not slower, either.)
Since the RTX launch there has been the rise of what I'd call the "socially gamer, fiscally conservative" faction, the group of people that likes to think they'd be gaming, or likes to think they would upgrade, but this spending is too gosh-darn wasteful! However good the deals get, it's much more fun to just post dismissively on the internet about it, and it costs nothing to do so, but you get all the social clout of aligning yourself with the thing your in-group is excited about and signaling your social virtue (thrift), etc.
Others are elder gamers who have lives and families now and are just drifting away in general and can't justify the money on a hobby they really no longer do, but they can't just let go of that identity/touchstone from their youth. And they're subconsciously looking for a deal that's so out-of-band good that they won't feel bad about not using it.
But again, like the 4060 and 4070, there are plenty of deals at all times and in all price brackets, including older cards if that's your jam. 3080 and 3090 are good, 6800XT/7800XT is good, 4060 and 4070 are good, 7900XT has fallen as low as $700, etc. If you need to upgrade, then upgrade, but the slowing of moore's law has hit GPUs hardest. They are asymptotic perf-per-area machines that thrive on scaling out on cheap transistors (jensen has explicitly talked about this more than a decade ago) and now that cost increases are eating up most of the density gains it's understandable that gpus are the first and most affected by slowing gains/rising costs.
At some point if people don't want to bite on deals like the microcenter thing, you will just see companies pivot to the products that are actually moving. 4090 did great, AI is doing great. Midrange and high-end will continue to sell and the "culturally gamer, fiscally conservative" folks will continue to seethe about the lack of a true Radeon 7850 successor or whatever. Everyone understands why the $150 price point is dead, but people don't want to accept that the cancer is going to continue to progress upwards, because fab tech isn't getting healthier either. And the perf/$ progress that AMD/NVIDIA can make is fundamentally determined by the transistors/$ that the fabs give them. Going above that rate of trx/$ gain is long-term unsustainable for them.
If you can't be brought over by reasonable deals, like 4070 for $500 or Super Series refresh, are you really addressable market, from the perspective of NVIDIA and AMD? and in AMD's case, they benefit if you buy a PS5 anyway, that's still an AMD product, but doesn't carry the baggage of having to argue with the tech community (and big tech reviewers who increasingly view themselves as activists/participants rather than simply "calling balls and strikes") about the merits of cache vs bus width, or whether DLSS or RT or framegen is a good idea, or whatever. Sony will make a decision and you will take it or leave it, and seemingly that is more palatable than what AMD and NVIDIA are doing.
Unified APUs are great, and have huge cost advantages, but it does involve many of the same compromises that gamers already hate. You're not going to be upgrading the memory of your Steam Console, and Series X/S both have GTX 970-style "3.5GB" slow-memory partitions, and there is a huge focus on upscaling (which will probably include AI upscaling in the PS5 Pro), etc. Those things are simply ground-truths in the console world. A huge part of the "console experience" is simply removing the option to even argue about these compromises, so that people don't feel bad about not being able to use ultra settings or having to turn on upscaling to do so, or for using a CPU with less cache/slightly lower IPC, etc. "Compromise pain" is quite similar to "choice paralysis" in practice, and by taking the choice away you relieve the pain.
Their conclusion is that:
"In total, we found 12 per cent (n = 15,839) of the total analysable sample (n = 131,738) of titles described sexual activity that constitutes sexual violence."
Their data set comprises of content shown to first time site visitors.
After ChatGPT released, there was a lot of hype in the space but open source was far behind. Iirc the best open foundation LLM that existed was GPT-2 but it was two generations behind.
Awhile later Meta released LLaMA[1], a well trained base foundation model, which brought an explosion to open source. It was soon implemented in the Hugging Face Transformers library[2] and the weights were spread across the Hugging Face website for anyone to use.
At first, it was difficult to run locally. Few developers had the system or money to run. It required too much RAM and iirc Meta's original implementation didn't support running on the CPU but developers soon came up with methods to make it smaller via quantization. The biggest project for this was Llama.cpp[3] which probably is still the biggest open source project today for running LLMs locally. Hugging Face Transformers also added quantization support through bitsandbytes[4].
Over the next months there was rapid development in open source. Quantization techniques improved which meant LLaMA was able to run with less and less RAM with greater and greater accuracy on more and more systems. Tools came out that were capable of finetuning LLaMA and there were hundreds of LLaMA finetunes that came out which finetuned LLaMA on instruction following, RLHF, and chat datasets which drastically increased accuracy even further. During this time, Stanford's Alpaca, Lmsys's Vicuna, Microsoft's Wizard, 01ai's Yi, Mistral, and a few others made their way onto the open LLM scene with some very good LLaMA finetunes.
A new inference engine (software for running LLMs like Llama.cpp, Transformers, etc) called vLLM[5] came out which was capable of running LLMs in a more efficient way than was previously possible in open source. Soon it would even get good AMD support, making it possible for those with AMD GPUs to run open LLMs locally and with relative efficiency.
Then Meta released Llama 2[6]. Llama 2 was by far the best open LLM for its time. Released with RLHF instruction finetunes for chat and with human evaluation data that put its open LLM leadership beyond doubt. Existing tools like Llama.cpp and Hugging Face Transformers quickly added support and users had access to the best LLM open source had to offer.
At this point in time, despite all the advancements, it was still difficult to run LLMs. Llama.cpp and Transformers were great engines for running LLMs but the setup process was difficult and required a lot of time. You had to find the best LLM, quantize it in the best way for your computer (or figure out how to identify and download one from Hugging Face), setup whatever engine you wanted, figure out how to use your quantized LLM with the engine, fix any bugs you made along the way, and finally figure out how to prompt your specific LLM in a chat-like format.
However, tools started coming out to make this process significantly easier. The first one of these that I remember was GPT4All[7]. GPT4All was a wrapper around Llama.cpp which made it easy to install, easy to select the LLM that you want (pre-quantized options for easy download from a download manager), and a chat UI which made LLMs easy to use. This significantly reduced the barrier to entry for those who were interested in using LLMs.
The second project that I remember was Ollama[8]. Also a wrapper around Llama.cpp, Ollama gave most of what GPT4All had to offer but in an even simpler way. Today, I believe Ollama is bigger than GPT4All although I think it's missing some of the higher-level features of GPT4All.
Another important tool that came out during this time is called Exllama[9]. Exllama is an inference engine with a focus on modern consumer Nvidia GPUs and advanced quantization support based on GPTQ. It is probably the best inference engine for squeezing performance out of consumer Nvidia GPUs.
Months later, Nvidia came out with another new inference engine called TensorRT-LLM[10]. TensorRT-LLM is capable of running most LLMs and does so with extreme efficiency. It is the most efficient open source inferencing engine that exists for Nvidia GPUs. However, it also has the most difficult setup process of any inference engine and is made primarily for production use cases and Nvidia AI GPUs so don't expect it to work on your personal computer.
With the rumors of GPT-4 being a Mixture of Experts LLM, research breakthroughs in MoE, and some small MoE LLMs coming out, interest in MoE LLMs was at an all-time high. The company Mistral had proven itself in the past with very impressive LLaMA finetunes, capitalized on this interest by releasing Mixtral 8x7b[11]. The best accuracy for its size LLM that the local LLM community had seen to date. Eventually MoE support was added to all inference engines and it was a very popular mid-to-large sized LLM.
Cohere released their own LLM as well called Command R+[12] built specifically for RAG-related tasks with a context length of 128k. It's quite large and doesn't have notable performance on many metrics, but it has some interesting RAG features no other LLM has.
More recently, Llama 3[13] was released which similar to previous Llama releases, blew every other open LLM out of the water. The smallest version of Llama 3 (Llama 3 8b) has the greatest accuracy for its size of any other open LLM and the largest version of Llama 3 released so far (Llama 3 70b) beats every other open LLM on almost every metric.
Less than a month ago, Google released Gemma 2[14], the largest of which, performs very well under human evaluation despite being less than half the size of Llama 3 70b, but performs only decently on automated benchmarks.
If you're looking for a tool to get started running LLMs locally, I'd go with either Ollama or GPT4All. They make the process about as painless as possible. I believe GPT4All has more features like using your local documents for RAG, but you can also use something like Open WebUI[15] with Ollama to get the same functionality.
If you want to get into the weeds a bit and extract some more performance out of your machine, I'd go with using Llama.cpp, Exllama, or vLLM depending upon your system. If you have a normal, consumer Nvidia GPU, I'd go with Exllama. If you have an AMD GPU that supports ROCm 5.7 or 6.0, I'd go with vLLM. For anything else, including just running it on your CPU or M-series Mac, I'd go with Llama.cpp. TensorRT-LLM only makes sense if you have an AI Nvidia GPU like the A100, V100, A10, H100, etc.
[1] https://ai.meta.com/blog/large-language-model-llama-meta-ai/
[2] https://github.com/huggingface/transformers
[3] https://github.com/ggerganov/llama.cpp
[4] https://github.com/bitsandbytes-foundation/bitsandbytes
[5] https://github.com/vllm-project/vllm
[6] https://ai.meta.com/blog/llama-2/
[7] https://www.nomic.ai/gpt4all
[8] http://ollama.ai/
[9] https://github.com/turboderp/exllamav2
[10] https://github.com/NVIDIA/TensorRT-LLM
[11] https://mistral.ai/news/mixtral-of-experts/
[12] https://cohere.com/blog/command-r-plus-microsoft-azure
[13] https://ai.meta.com/blog/meta-llama-3/
[14] https://blog.google/technology/developers/google-gemma-2/
[15] https://github.com/open-webui/open-webui