More

Vaskivo · 2026-05-18T20:59:14 1779137954

How can you get it to run at 41 t/s? I also have a single 3090 and even with MTP can't break 20 t/s.

HEre's my setup:

  llama-server
  --port 9999
  --model /MODELS/LLMs/Qwen3.6-27B-UD-Q4_K_XL.gguf
  --ctx-size 128000
  --threads 12
  --flash-attn on
  --device CUDA0
  --jinja
  --gpu-layers 52
  --mmproj /MODELS/LLMs/Qwen3.6-27B-mmproj-F16.gguf
  --cache-type-k q8_0
  --cache-type-v q8_0
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0
  --spec-type draft-mtp --spec-draft-n-max 2

(I'm not filling out 100% of the VRAM, as I have other stuff I need it for.)

nyrikki · 2026-05-18T21:34:25 1779140065

(Note UPDATED config)

Ya, if you are using the CPU it may slowdown quick.

This may be a bit huge and overcomplicated, on this host I am running it on a AMD Ryzen 7 5700G so that I can use the APU to dedicate the 3090.

    podman run --device nvidia.com/gpu=all -d -v llama_qwen3.6mpt:/root/.cache -p 8080:8080 local/llama.cpp:full-cuda --server \
    -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
    -ngl 99 \
    --ctx-size 131072 \
    --no-mmproj-offload \
    --no-context-shift \
    --kv-unified \
    --spec-type draft-mtp \
    --spec-draft-n-max 6 \
    --spec-draft-p-min 0.75 \
    -fa on --jinja --no-mmap \
    --cache-ram -1 \
    --no-warmup -np 1 \
    -n 32768 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --temp 0.6 \
    --min-p 0.00 \
    --top-k 20 \
    --top-p 0.95 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.05 \
    --fit off \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking":true}' \
    --prio 3 \
    --poll 100 \
    --port 8080 \
    --host 0.0.0.0

I am just building the container with:

     podman build -t local/llama.cpp:full-cuda --target full -f .devops/cuda.Dockerfile .

And here is the logs from a 'make me a flappy bird program in python' webui prompt.

     prompt eval time =     105.86 ms /    19 tokens (    5.57 ms per token,   179.47 tokens per second)
       eval time =  100549.41 ms /  4608 tokens (   21.82 ms per token,    45.83 tokens per second)
      total time =  100655.28 ms /  4627 tokens
     draft acceptance rate = 0.47215 ( 3408 accepted /  7218 generated)

I am down to ~25.54 t/s with a 95% full context.

nyrikki · 2026-05-18T22:01:06 1779141666

That config looked too complicated, getting rid of the --prio 3 and --poll 100, setting the draft-n-max to now recommended values, etc... kicked it up to 61 t/s

I think that was all about some earlier crashes.

     podman run --device nvidia.com/gpu=all -d -v llama_qwen3.6mpt:/root/.cache -p 8080:8080 local/llama.cpp:full-cuda --server \
    -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
    -ngl 99 \
    --ctx-size 128000 \
    --no-mmproj-offload \
    --no-context-shift \
    --kv-unified \
    --spec-type draft-mtp \
    --spec-draft-n-max 2 \
    --spec-draft-p-min 0.75 \
    -fa on --jinja --no-mmap \
    --cache-ram -1 \
    --no-warmup -np 1\
    -n 32768 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --temp 0.6 \
    --min-p 0.00 \
    --top-k 20 \
    --top-p 0.95 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.05 \
    --fit off \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking":true}' \
    --port 8080 \
    --host 0.0.0.0

Vaskivo · 2026-05-19T08:25:09 1779179109

Yeah, having even a little bit in the CPU tanks the t/s...

But thanks. I've learned a few more configurations to tinker with.

Vaskivo · on March 21, 2025

There's an episode where the father is trying to teach Bluey to play chess. It is implied by the mother that he's only doing that because "all smart kids play chess, therefore she should learn chess."

The teaching fails, not because she doesn't understand but because she and Bingo end up making up the fantasy of the game. (To the point that sacrificing a piece is a shock and abhorrent).

It ends up with the mother intervening, beating the dad at the game and saying "Work on the heads later, for now, just hearts."

So to you, I say: Bill Nye, yes. Bluey, yes too. Each in their own time... If the kids feel like it.

Vaskivo · on Sept 14, 2024

"Alternate history" is a common literary genre.

People will always explore the "what ifs."

Vaskivo · on July 29, 2023

I like small phones. I dont want to have something larger that 5.5 inches in my pocket.

Five years ago I bought a Sony xperia zx1 compact, due to its form factor. I've been looking for a new phone for about a year, but all of them are too big or under powered for my liking.

To me, the alternative seems to be foldable phones.

I'll probably buy a Motorola razr 40 in the coming weeks.

Also, an iPhone is not a option.

silisili · on July 29, 2023

Zenfone 9 should be a hair smaller than zx1, have you checked it out?

Vaskivo · on July 29, 2023

I have the zx1 _compact_: https://www.gsmarena.com/sony_xperia_xz1_compact-8610.php

Which is 129mm tall and has a 4.6 inches display.

Still, the Zenphone is about 15mm taller. I'll check it out. Thanks.

Vaskivo · on Nov 22, 2021

About a year ago a quit my job. I was unhappy with some decisions the employer and just left. I had some savings and a side project I wanted to make.

Spent the following seven months making an Android game. I wasn't expecting to make any money from it, but really wanted to build and launch MY GAME :)

Following that, I spent two to three months interviewing... And I'm now employed at a foreign, fully remote early stage startup. So everything turned out just I wanted it to.

I have no children nor any loans, so all my finances were very flexible and manageable.

Vaskivo · on May 20, 2021

The music comes from this guy: https://wingless-seraph.net/en/index.html

I'm in no way affiliated with him. Found his music on itch.io and fell in love with it. I was surprised by it's quality.

Vaskivo · on May 20, 2021

I really enjoyed godot. Once I "got" some of the concepts, I became very fast in creating elements and changing them. And I used GDScript.

When I built the first prototype[0] I also used the opportunity to try out godot. I like it a lot. It's main concept (scene <==> node) felt really elegant.

Some other engines might have better features. But One Way Dungeon is a 2D game with no real-time gameplay or physics or networking... So ease of use is more important than high-end capabilities.

[0] you can read the whole story here: https://news.ycombinator.com/item?id=27208333

Vaskivo · on May 20, 2021

Thank you for your feedback.

Could you please email me with the detail of the crash?

You can see my email in my profile. Or inside the app. Top left button in the title menu.

Vaskivo · on May 20, 2021

Thank you for your feedback.

Vaskivo · on May 20, 2021

Thank you for your feedback.

I'm currently trying to figure a way to make the "post battle" smoother.

Inside the dungeon, I'm planning on adding "events". To mix it up a bit from battles. But I want it to continue linear and simple. The game is, in fact, a "boss rush game" with a "dungeon crawler" look.

HN For You