More

adefa · 2026-04-05T16:46:00 1775407560

Released uncensored versions of all four Gemma 4 models. bf16 + GGUF for each.

Collection: https://huggingface.co/collections/TrevorJS/gemma-4-uncensor...

Code: https://github.com/TrevorS/gemma-4-abliteration

Results

Refusal rates from 686 prompts across 4 datasets (JailbreakBench, tulu-harmbench, NousResearch, mlabonne). Manually audited — most flagged refusals are actually the model complying with a disclaimer attached.

  E2B (2.3B): 98% → 0.4%, KL Div 0.346
  E4B (4.5B): 99% → 0.7%, KL Div 0.068
  26B MoE:    98% → 0.7%, KL Div 0.090
  31B:       100% → 3.2%, KL Div 0.124

26B MoE

Standard abliteration only touches dense layers, which gets you from 98% -> 29% on the MoE. The remaining refusals are in the expert weights. Used Expert-Granular Abliteration (EGA, concept from OBLITERATUS [1]) with norm-preserving biprojection [2] on each of the 128 expert slices per layer. That gets it to 3%.

[1] https://github.com/elder-plinius/OBLITERATUS

[2] https://huggingface.co/blog/grimjim/abliteration-biprojectio...

How it was built

Set up an automated research loop -- an AI agent reads the current results and idea backlog, picks the next experiment, runs it on the GPU, records results, and repeats. It ran 22 experiments across the 4 models, discovered the false-positive problem in standard refusal markers, built the cross-dataset evaluation, and implemented the MoE expert abliteration when dense-only wasn't enough.

Full experiment history and code in the repo.

Downloads

Each model has bf16 safetensors + GGUF (Q4_K_M, Q8_0):

  E2B bf16: https://huggingface.co/TrevorJS/gemma-4-E2B-it-uncensored
  E2B GGUF: https://huggingface.co/TrevorJS/gemma-4-E2B-it-uncensored-GGUF
  E4B bf16: https://huggingface.co/TrevorJS/gemma-4-E4B-it-uncensored
  E4B GGUF: https://huggingface.co/TrevorJS/gemma-4-E4B-it-uncensored-GGUF
  26B bf16: https://huggingface.co/TrevorJS/gemma-4-26B-A4B-it-uncensored
  26B GGUF: https://huggingface.co/TrevorJS/gemma-4-26B-A4B-it-uncensored-GGUF
  31B bf16: https://huggingface.co/TrevorJS/gemma-4-31B-it-uncensored
  31B GGUF: https://huggingface.co/TrevorJS/gemma-4-31B-it-uncensored-GGUF

Quick start:

  llama-server -hf TrevorJS/gemma-4-26B-A4B-it-uncensored-GGUF -c 8192

CamperBob2 · 2026-04-05T18:23:51 1775413431

What about the sampling parameters? You can't just run llama-server with no CLI arguments (other than a uselessly-small context size) and expect useful results.

adefa · 2026-02-12T01:15:03 1770858903

True :)

After some performance improvements, it is realtime on my DGX Spark with an RTF of .416 -- now getting ~19.5 tokens per second. Check it out, see if it's better for you.

adefa · 2026-02-12T01:12:52 1770858772

I'm curious to see if you are able to run the model now from the CLI?

adefa · 2026-02-12T01:11:31 1770858691

The cubecl-wgpu were only needed to reduce the number of kernel workgroups, otherwise I was getting errors in WASM.

adefa · 2026-02-12T01:08:59 1770858539

This should be fixed now. There were a number of bugs that kept the model from working correctly in different environments. Please let me know if you test again. :)

mikebelanger · 2026-02-20T23:37:52 1771630672

Cool! Thanks for the response, I'll give it a shot again sometime

adefa · 2026-02-12T01:08:12 1770858492

Please try again. The model weights are unchanged, but the inference code is improved.

adefa · 2026-02-12T01:07:38 1770858458

this should be fixed

adefa · 2026-02-12T01:06:40 1770858400

Hello everyone, thanks for the interest. I merged a number of significant performance improvements that increase speed and accuracy across CUDA, Metal, and WASM as well as improve stability.

Here are the latest benchmarks running on DGX Spark:

https://github.com/TrevorS/voxtral-mini-realtime-rs#benchmar...

adefa · 2026-02-12T01:05:18 1770858318

Hello, I pushed up and merged a PR that greatly improves performance on CUDA, Metal, and in WASM.

Depending on your hardware, the model is definitely real time (able to transcribe audio faster than the length of the audio).

adefa · 2026-02-03T22:18:34 1770157114

Benchmarks using DGX Spark on vLLM 0.15.1.dev0+gf17644344

  FP8: https://huggingface.co/Qwen/Qwen3-Coder-Next-FP8

  Sequential (single request)

    Prompt     Gen     Prompt Processing    Token Gen
    Tokens     Tokens  (tokens/sec)         (tokens/sec)
    ------     ------  -----------------    -----------
       521        49            3,157            44.2
     1,033        83            3,917            43.7
     2,057        77            3,937            43.6
     4,105        77            4,453            43.2
     8,201        77            4,710            42.2

  Parallel (concurrent requests)

    pp4096+tg128 (4K context, 128 gen):

     n    t/s
    --    ----
     1    28.5
     2    39.0
     4    50.4
     8    57.5
    16    61.4
    32    62.0

    pp8192+tg128 (8K context, 128 gen):

     n    t/s
    --    ----
     1    21.6
     2    27.1
     4    31.9
     8    32.7
    16    33.7
    32    31.7

cmrdporcupine · 2026-02-03T23:59:54 1770163194

I tried the FP8 in vLLM on my Spark and although it fit in memory, I started swapping once I actually tried to run any queries, and, yeah, could not have a context larger than 8k.

I figured out later this is because vLLM apparently de-quantizes to BF16 at runtime, so pointless to run the FP8?

I get about 30-35 tok/second using llama.cpp and a 4-bit quant. And a 200+k context, using only 50GB of RAM.

justaboutanyone · 2026-02-04T00:55:34 1770166534

Running llama.cpp rather than vLLM, it's happy enough to run the FP8 variant with 200k+ context using about 90GB vram

cmrdporcupine · 2026-02-04T02:08:06 1770170886

yeah, what did you get for tok/sec there though? Memory bandwidth is the limitation with these devices. With 4 bit I didn't get over 35-39 tok/sec, and averaged more like 30 when doing actual tool use with opencode. I can't imagine fp8 being faster.

HN For You