For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | danielhanchen's commentsregister

Thinking / reasoning + multimodal + tool calling.

We made some quants at https://huggingface.co/collections/unsloth/gemma-4 for folks to run them - they work really well!

Guide for those interested: https://unsloth.ai/docs/models/gemma-4

Also note to use temperature = 1.0, top_p = 0.95, top_k = 64 and the EOS is "<turn|>". "<|channel>thought\n" is also used for the thinking trace!


Daniel, your work is changing the world. More power to you.

I setup a pipeline for inference with OCR, full text search, embedding and summarization of land records dating back 1800s. All powered by the GGUF's you generate and llama.cpp. People are so excited that they can now search the records in multiple languages that a 1 minute wait to process the document seems nothing. Thank you!


Oh appreciate it!

Oh nice! That sounds fantastic! I hope Gemma-4 will make it even better! The small ones 2B and 4B are shockingly good haha!


Just switched from 3.1 Flash Lite to Gemma-4 31B on the AI Studio API since there is a generous 1500/day on non-billed projects. It's doing fantastic.

Hey in really interested in your pipeline techniques. I've got some pdfs I need to get processed but processing them in the cloud with big providers requires redaction.

Wondering if a local model or a self hosted one would work just as well.


I run llama.cpp with Qwen3-VL-8B-Instruct-Q4_K_S.gguf with mmproj-F16.gguf for OCR and translation. I also run llama.cpp with Qwen3-Embedding-0.6B-GGUF for embeddings. Drupal 11 with ai_provider_ollama and custom provider ai_provider_llama (heavily derived from ai_provider_ollama) with PostreSQL and pgvector.

People on site scan the documents and upload them for archival. The directory monitor looks for new files in the archive directories and once a new file is available, it is uploaded to Drupal. Once a new content is created in Drupal, Drupal triggers the translation and embedding process through llama.cpp. Qwen3-VL-8B is also used for chat and RAG. Client is familiar with Drupal and CMS in general and wanted to stay in a similar environment. If you are starting new I would recommend looking at docling.


Are you linking any of the processes using the Drupal AI module suite?

Yes, they are all linked using Drupal's AI modules. I have an OpenCV application that removes the old paper look, enhances the contrast and fixes the orientation of the images before they hit llama.cpp for OCR and translation.

Disclaimer: I'm an AI novice relative to many here. FWIW last wknd I spent a couple hours setting up self-hosted n8n with ollama and gemma3:4b [EDIT: not Qwen-3.5], using PDF content extraction for my PoC. 100% local workflow, no runtime dependency on cloud providers. I doubt it'd scale very well (macbook air m4, measly 16GB RAM), but it works as intended.

For those who wish to do OCR on photos, like receipts, or PDFs or anything really, Paperless-NGX works amazingly well and runs on a potato.

How do you extract the content? OCR? Pdf to text then feed into qwen?

I tried something similar where I needed a bunch of tables extracted from the pdf over like 40 pages. It was crazy slow on my MacBook and innacurate


If you have a basic ARM MacBook, GLM-OCR is the best single model I have found for OCR with good table extraction/formatting. It's a compact 0.9b parameter model, so it'll run on systems with only 8 GB of RAM.

https://github.com/zai-org/GLM-OCR

Use mlx-vlm for inference:

https://github.com/zai-org/GLM-OCR/blob/main/examples/mlx-de...

Then you can run a single command to process your PDF:

  glmocr parse example.pdf

  Loading images: example.pdf
  Found 1 file(s)
  Starting Pipeline...
  Pipeline started!
  GLM-OCR initialized in self-hosted mode
  Using Pipeline (enable_layout=true)...

  === Parsing: example.pdf (1/1) ===
My test document contains scanned pages from a law textbook. It's two columns of text with a lot of footnotes. It took 60 seconds to process 5 pages on a MBP with M4 Max chip.

After it's done, you'll have a directory output/example/ that contains .md and .json files. The .md file will contain a markdown rendition of the complete document. The .json file will contain individual labeled regions from the document along with their transcriptions. If you get all the JSON objects with

  "label": "table"
from the JSON file, you can get an HTML-formatted table from each "content" section of these objects.

It might still be inaccurate -- I don't know how challenging your original tables are -- but it shouldn't be terribly slow. The tables it produced for me were good.

I have also built more complex work flows that use a mixture of OCR-specialized models and general purpose VLM models like Qwen 3.5, along with software to coordinate and reconcile operations, but GLM-OCR by itself is the best first thing to try locally.


Thanks! Just tried it on a 40 page pdf. Seems to work for single images but the large pdf gives me connection timeouts

I also get connection timeouts on larger documents, but it automatically retries and completes. All the pages are processed when I'm done. However, I'm using the Python client SDK for larger documents rather than the basic glmocr command line tool. I'm not sure if that makes a difference.

Yeah looks like the cli also retries as well. I was able to get it working using a higher timeout.

Cool! For GLM-OCR, do you use "Option 2: Self-host with vLLM / SGLang" and in that case, am I correct that there is no internet connection involved and hence connection timeouts would be avoided entirely?

When you self-host, there's still a client/server relationship between your self-hosted inference server and the client that manages the processing of individual pages. You can get timeouts depending on the configured timeouts, the speed of your inference server, and the complexity of the pages you're processing. But you can let the client retry and/or raise the initial timeout limit if you keep running into timeouts.

That said, this is already a small and fast model when hosted via MLX on macOS. If you run the inference server with a recent NVidia GPU and vLLM on Linux it should be significantly faster. The big advantage with vLLM for OCR models is its continuous batching capability. Using other OCR models that I couldn't self-host on macOS, like DeepSeek 2 OCR or Chandra 2, vLLM gave dramatic throughput improvements on big documents via continuous batching if I process 8-10 pages at a time. This is with a single 4090 GPU.


1. Correction: I'd planned to use Qwen-3.5 but ended up using gemma3:4b.

2. The n8n workflow passes a given binary pdf to gemma, which (based on a detailed prompt) analyzes it and produces JSON output.

See https://github.com/LinkedInLearning/build-with-ai-running-lo... if you want more details. :)


Python pdftools to convert to images and tesseract to ocr them to text files. Fast free and can run on CPU.

Seconded, would also love to hear your story if you would be willing

I'm very active in family history and this kind of project is massively helpful, thank you

This is a very interesting project. If it's publicly available, would you mind sharing it? I would love to understand how it works.

Ps: found your other comments, thanks.


> your work is changing the world

I realize this may have been hyperbole, but it sure isn't changing the world.


For relatively small values of changing or world, it sure is.

In the world of local models, Unsloth is one of the most significant projects there is.


I'm trying to disable "thinking", but it doesn't seem to work (in llama.cpp). The usual `--reasoning-budget 0` doesn't seem to change it, nor `--chat-template-kwargs '{"enable_thinking":false}'` (both with `--jinja`). Am I missing something?

EDIT: Ok, looks like there's yet another new flag for that in llama.cpp, and this one seems to work in this case: `--reasoning off`.

FWIW, I'm doing some initial tries of unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL, and for writing some Nix, I'm VERY impressed - seems significantly better than qwen3.5-35b-a3b for me for now. Example commandline on a Macbook Air M4 32gb RAM:

  llama-cli -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL  -t 1.0 --top-p 0.95 --top-k 64 -fa on --no-mmproj --reasoning-budget 0 -c 32768 --jinja --reasoning off
(at release b8638, compiled with Nix)

Oh very cool! Will check the `--reasoning off` flag as well!

Yep the models are really good!


Daniel, I know you might hear this a lot but I really appreciate a lot of what you have been doing at Unsloth and the way you handle your communication, whether within hackernews/reddit.

I am not sure if someone might have asked this already to you, but I have a question (out of curiosity) as to which open source model you find best and also, which AI training team (Qwen/Gemini/Kimi/GLM) has cooperated the most with the Unsloth team and is friendly to work with from such perspective?


Thanks a lot for the support :)

Tbh Gemma-4 haha - it's sooooo good!!!

For teams - Google haha definitely hands down then Qwen, Meta haha through PyTorch and Llama and Mistral - tbh all labs are great!


Now you have gotten me a bit excited for Gemma-4, Definitely gonna see if I can run the unsloth quants of this on my mac air & thanks for responding to my comment :-)

Thanks! Have a super good day!!

llama.cpp (b8642) auto-fits ~200k context on this 24GB RX 7900 XTX & it shows a solid 100+ tok/s ("S_TG t/s") on the first 32k of it, nice!

    ./llama-batched-bench -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
    -npp 1000,2000,4000,8000,16000,32000,64000,96000,128000 -ntg 128 -npl 1 -c 0
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    0.416 |  2404.87 |    1.064 |   120.29 |    1.480 |   762.20 |
    |  2000 |    128 |    1 |   2128 |    0.755 |  2649.86 |    1.075 |   119.04 |    1.830 |  1162.83 |
    |  4000 |    128 |    1 |   4128 |    1.501 |  2665.72 |    1.093 |   117.08 |    2.594 |  1591.49 |
    |  8000 |    128 |    1 |   8128 |    3.142 |  2545.85 |    1.114 |   114.87 |    4.257 |  1909.47 |
    | 16000 |    128 |    1 |  16128 |    6.908 |  2316.00 |    1.189 |   107.65 |    8.097 |  1991.73 |
    | 32000 |    128 |    1 |  32128 |   16.382 |  1953.31 |    1.278 |   100.12 |   17.661 |  1819.16 |
    | 64000 |    128 |    1 |  64128 |   43.427 |  1473.74 |    1.453 |    88.12 |   44.879 |  1428.89 |
    | 96000 |    128 |    1 |  96128 |   82.227 |  1167.50 |    1.623 |    78.86 |   83.850 |  1146.42 |
    |128000 |    128 |    1 | 128128 |  133.237 |   960.69 |    1.797 |    71.25 |  135.034 |   948.86 |

~50 tok/s on M1 Max 64Gb

Oh nice that's pretty good!

FYI, screenshot for the "Search and download Gemma 4" step on your guide is for qwen3.5, and when I searched for gemma-4 in Unsloth Studio it only shows Gemma 3 models.

We're still updating it haha! Sorry! It's been quite complex to support new models without breaking old ones

Speaking of which, do you think Step 3.5 Flash is going to happen or should I stop holding my breath?

Oh quants - haha I can re-investigate it - just totally forgot about them

  and the EOS is "<turn|>". "<|channel>thought\n" is also used for the thinking trace!
Can someone explain this to me? Why is this faux-XML important here?

That’s how the model is trained to signal the end to its generation and to indicate its thinking.

These are likely individual tokens. They are super common.

Huge fan of the Unsloth quants! Having reasoning and tool calling this accessible locally is a massive leap forward.

The main hurdle I've found with local tool calling is managing the execution boundaries safely. I’ve started plugging these local models into PAIO to handle that. Since it acts as a hardened execution layer with strict BYOK sovereignty, it lets you actually utilize Gemma-4's tool calling capabilities without the low-level anxiety of a hallucination accidentally wiping your drive. It’s the perfect secure gateway for these advanced local models.


Hi! Do you ever make quants of the base models? I'm interested in experimenting with them in non-chat contexts.

Yes, they are listed on huggingface. The instruction trained models have an 'it' in their name.

https://huggingface.co/collections/unsloth/gemma-4

Edit: Sorry, I'm not sure if this is a quant, but it says 'finetuned' from the Google Gemma 4 parent snapshot. It's the same size as the UD 8-bit quant though.


Only the 'it' models seem to have quants. I was really hoping to try a base model.

Basic quantization is easy if you have enough RAM (not VRAM) to load the weights.

Thank you for your work.

You have an answer on your page regarding "Should I pick 26B-A4B or 31B?", but can you please clarify if, assuming 24GB vRAM, I should pick a full precision smaller model or 4 bit larger model?


Try 26B first. 31B seems to have very heavy KV cache (maybe bugged in llama.cpp at the moment; 16K takes up 4.9GB).

edit: 31B cache is not bugged, there's static SWA cost of 3.6GB.. so IQ4_XS at 15.2GB seems like reasonable pair, but even then barely enough for 64K for 24GB VRAM. Maybe 8 bit KV quantization is fine now after https://github.com/ggml-org/llama.cpp/pull/21038 got merged, so 100K+ is possible.

> I should pick a full precision smaller model or 4 bit larger model?

4 bit larger model. You have to use quant either way -- even if by full precision you mean 8 bit, it's gonna be 26GB + overhead + chat context.

Try UD-Q4_K_XL.


Yes UD-Q4_K_XL works well! :)

what is the main difference between "normal" quants and the UD ones?

They explain it here:

https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs

For the best quality reply, I used the Gemma-4 31B UD-Q8_K_XL quant with Unsloth Studio to summarize the URL with web search. It produced 4.9 tok/s (including web search) on an MacBook Pro M1 Max with 64GB.

Here an excerpt of it's own words:

Unsloth Dynamic 2.0 Quantization

Dynamic 2.0 is not just a "bit-reduction" but an intelligent, per-layer optimization strategy.

- Selective Layer Quantization: Instead of making every layer 4-bit, Dynamic 2.0 analyzes every single layer and selectively adjusts the quantization type. Some critical layers may be kept at higher precision, while less critical layers are compressed more.

- Model-Specific Tailoring: The quantization scheme is custom-built for each model. For example, the layers selected for quantization in Gemma 3 are completely different from those in Llama 4.

- High-Quality Calibration: They use a hand-curated calibration dataset of >1.5M tokens specifically designed to enhance conversational chat performance, rather than just optimizing for Wikipedia-style text.

- Architecture Agnostic: While previous versions were mostly effective for MoE (Mixture of Experts) models, Dynamic 2.0 works for all architectures (both MoE and non-MoE).


Thank you!

I presume 24B is somewhat faster since it's only 4B activated - 31B is quite a large dense model so more accurate!


This is one of the more confusing aspects of experimenting with local models as a noob. Given my GPU, which model should I use, which quantization of that model should I pick (unsloth tends to offer over a dozen!) and what context size should I use? Overestimate any of these, and the model just won't load and you have to trial-and-error your way to finding a good combination. The red/yellow/green indicators on huggingface.co are kind of nice, but you only know for sure when you try to load the model and allocate context.

Definitely Unsloth Studio can help - we recommend specific quants (like Gemma-4) and also auto calculate the context length etc!

Will have to try it out. I always thought that was more for fine-tuning and less for inference.

Oh yes sadly we partially mis-communicated haha - there's both and synthetic data generation + exporting!

Noob question. Why I would use this version over the original model?

1/3 the RAM & CPU consumed for 99% the performance

Hey, I tried to use Unsloth to run Gemma 4 locally but got stuck during the setup on Windows 11.

At some point it asked me to create a password, and right after that it threw an error. Here’s a screenshot: https://imgur.com/a/sCMmqht

This happened after running the PowerShell setup, where it installed several things like NVIDIA components, VS Code, and Python. At the end, PowerShell tell me to open a http://localhost URL in my browser, and that’s where I was prompted to set the password before it failed.

Also, I noticed that an Unsloth icon was added to my desktop, but when I click it, nothing happens.

For context, I’m not a developer and I had never used PowerShell before. Some of the steps were a bit intimidating and I wasn’t fully sure what I was approving when clicking through.

The overall experience felt a bit rough for my level. It would be great if this could be packaged as a simple .exe or a standalone app instead of going through terminal and browser steps.

Are there any plans to make something like that?


Apologies we just fixed it!! If you try again from source ie

irm https://unsloth.ai/install.ps1 | iex

it should work hopefully. If not - please at us on Discord and we'll help you!

The Network error is a bummer - we'll check.

And yes we're working on a .exe!!


It worked! https://imgur.com/a/SOfiRhv

Thanks, will check it out tomorrow.

Hope the unsloth-setup.exe > Windows App is coming soon! I think it will expand accessibility and user base.


Oh nice! Glad it worked! Yes!! We're working on the app!

Temperature 1.0 used to be bad for sampling. 0.7 was the better choice, and the difference in results were noticeable. You may want to experiment with this.

You might be right, but Google's recommendation was temp 1 etc primarily because all their benchmarks were used with these numbers, so it's better reproducibility for downstream tasks

Fair, though putting a note in the readme about temperature 0.7 couldn't hurt.

I wonder why they do benchmarks with 1 instead of 0.7... that's strange. 0.7 or 0.8 at most gives noticeably better samples.


Reproducibility. They're benchmarks.

Reproducibility is a matter of using the same input seeds, which jax can do. 0.7 vs 1.0 would make no difference for that.

Without seeds, 0.7 would be less random than 1.0, so it'd be (slightly) more reproducible.


I haven't tried a local model in a while. I can only fit E4B in VRAM (8GB), but it's good enough that I can see it replacing Claude.ai for some things.

Thanks for this, I gave this guide to my Claude and he oneshot the unsloth and gemma4 set up on the old macbook he runs on. It's way faster than I expected, haven't tried out local models for a few generations but will be very nice when they become useful

Thanks! Oh nice! Ye local models are advancing much faster than I expected!

Thank you and your brother for all the amazing work, it's really inspiring to others <3

Thank you and appreciate it!

How does Gemma 4 26B A4B compare with Qwen3.5 35B A3B for same quants(4)

This comment deserves it's own HN post. Thanks!

Awesome!! Thank you SO much for this.

Appreciate it!

Wow! Thank you very much!

Thanks!

neat, time to update my spam filter model hehe

Haha! Ye the model is really good

Apologies on the delay - we fixed it! Please re-try with:

curl -LsSf https://astral.sh/uv/install.sh | sh

uv venv unsloth_studio --python 3.13

source unsloth_studio/bin/activate

uv pip install unsloth --torch-backend=auto

unsloth studio setup

unsloth studio -H 0.0.0.0 -p 8888


Thank you for the follow up! Big fan of your models here, thanks for everything you are doing!

Works fine on MacOS now (chat only).

On Ubuntu 24.04 with two GPU's (3090+3070), it appears that Llama.cpp sometimes uses the CPU and not GPU. This is judging from the tk/s and CPU load for identical models run with US-studio vs. just Llama.cpp (bleeding edge).


If you can try again we just updated the process sorry! We did a new pypi release:

curl -LsSf https://astral.sh/uv/install.sh | sh

uv venv unsloth_studio --python 3.13

source unsloth_studio/bin/activate

uv pip install unsloth==2026.3.7 --torch-backend=auto

unsloth studio setup

unsloth studio -H 0.0.0.0 -p 8888


Daniel, it works now. Thanks for the hard work!

https://gist.github.com/ontouchstart/532312fcba59aec3ce7f6aa...


Thank you Daniel.

Here is the error message on my machine:

https://gist.github.com/ontouchstart/86ca3cbd8b6b61fa0aeec75...

It seems we might need more instructions on how to set up python (via uv) in vanilla MacOS.


I have been updating the gist above and will stop this

``` ../scipy/meson.build:274:9: ERROR: Dependency lookup for OpenBLAS with method 'pkgconfig' failed: Pkg-config for machine host machine not found. Giving up. ```

Too much work.


Glad it was helpful!


Oh will check and fix - thanks


Hey! Our primary objective for now is to provide the open source community with cool and useful tooling - we found closed source to be much more popular because of better tooling!

We have much much in the pipeline!!


Thanks! How do you earn or keep yourself afloat? I really like what you guys are doing. And similar orgs. I am personally doing the same, full-time. But I am worried when I will run out of personal savings.


I've been wondering this since they started it, mostly as a concern they stay afloat. Since Daniel does the work of ten, it seems like their value:cost ratio is world-class at the very least.

With the studio release, it seems to like they could be on the path to just bootstrapping a unicorn or a 10x corn or whatever that's called, which is super interesting. Anyway, his refusal to go into details reassures me, sounds like things are fine, and they're shipping. Vai com dios


Daniel is a very impressive guy. Well within the realm of “fund the people not the idea” that YC seems to do. Got a few bucks from them and probably earning from collaborations etc. Odds of them not figuring out a business model seem slim.

https://www.ycombinator.com/companies/unsloth-ai


From comments elsewhere in this thread, it sounds like Unsloth could also be getting some decent consulting revenue from larger companies.


The opportunity here is HUUUUGGGEEEE!!!

Companies have no idea what they are doing, they know they need it, they know they want it, engineers want it, they don’t have it in their ecosystem so this is a perfect opportunity to come in with a professional services play. We got you on inference training/running, your models, all that, just focus on your business. Pair that with huggingface’s storage and it’s a win/win.


Investments are not income


You didn't answer the parent question.


They don't owe anyone an answer.


But if they want to attract users, like they seem to do, then answering would go long way.


that doesnt sound reassuring?


Thanks! We do have normal AMD support for Unsloth but yes the UI doesn't support it just yet! Will keep you posted!


What does "normal AMD support" mean here? I was completely unable to get it working on my Ryzen AI 9700 XT. I had to munge the versions in the requirements to get libraries compatible with recent enough ROCm, and it didn't go well at all. My last attempt was a couple weeks before studio was announced.


Hey will check ASAP and fix - sorry about that


Actually the opposite haha- more than 50% of our audience comes from large organizations eg Meta, NASA, the UN, Walmart, Spotify, AWS, Google, and the list goes on!


You would be surprised! Nearly every Fortune 500 company has utilized either our RL fine-tuning package or used our quants and models - the UI was primarily a culmination of pain points folks had when doing either training or inference!

We're complimentary to LM Studio - they have a great tool as well!


I don’t know why this is being downvoted. Danielhanchen is legit, and unsloth was early to the fine-tuning on a budget party.


Haha no worries at all :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You