Refusal rates from 686 prompts across 4 datasets (JailbreakBench, tulu-harmbench, NousResearch, mlabonne). Manually audited — most flagged refusals are actually the model complying with a disclaimer attached.
E2B (2.3B): 98% → 0.4%, KL Div 0.346
E4B (4.5B): 99% → 0.7%, KL Div 0.068
26B MoE: 98% → 0.7%, KL Div 0.090
31B: 100% → 3.2%, KL Div 0.124
26B MoE
Standard abliteration only touches dense layers, which gets you from 98% -> 29% on the MoE. The remaining refusals are in the expert weights. Used Expert-Granular Abliteration (EGA, concept from OBLITERATUS [1]) with norm-preserving biprojection [2] on each of the 128 expert slices per layer. That gets it to 3%.
Set up an automated research loop -- an AI agent reads the current results and idea backlog, picks the next experiment, runs it on the GPU, records results, and repeats. It ran 22 experiments across the 4 models, discovered the false-positive problem in standard refusal markers, built the cross-dataset evaluation, and implemented the MoE expert abliteration when dense-only wasn't enough.
Full experiment history and code in the repo.
Downloads
Each model has bf16 safetensors + GGUF (Q4_K_M, Q8_0):
What about the sampling parameters? You can't just run llama-server with no CLI arguments (other than a uselessly-small context size) and expect useful results.
After some performance improvements, it is realtime on my DGX Spark with an RTF of .416 -- now getting ~19.5 tokens per second. Check it out, see if it's better for you.
This should be fixed now. There were a number of bugs that kept the model from working correctly in different environments. Please let me know if you test again. :)
Hello everyone, thanks for the interest. I merged a number of significant performance improvements that increase speed and accuracy across CUDA, Metal, and WASM as well as improve stability.
Here are the latest benchmarks running on DGX Spark:
I tried the FP8 in vLLM on my Spark and although it fit in memory, I started swapping once I actually tried to run any queries, and, yeah, could not have a context larger than 8k.
I figured out later this is because vLLM apparently de-quantizes to BF16 at runtime, so pointless to run the FP8?
I get about 30-35 tok/second using llama.cpp and a 4-bit quant. And a 200+k context, using only 50GB of RAM.
yeah, what did you get for tok/sec there though? Memory bandwidth is the limitation with these devices. With 4 bit I didn't get over 35-39 tok/sec, and averaged more like 30 when doing actual tool use with opencode. I can't imagine fp8 being faster.
Collection: https://huggingface.co/collections/TrevorJS/gemma-4-uncensor...
Code: https://github.com/TrevorS/gemma-4-abliteration
Results
Refusal rates from 686 prompts across 4 datasets (JailbreakBench, tulu-harmbench, NousResearch, mlabonne). Manually audited — most flagged refusals are actually the model complying with a disclaimer attached.
26B MoEStandard abliteration only touches dense layers, which gets you from 98% -> 29% on the MoE. The remaining refusals are in the expert weights. Used Expert-Granular Abliteration (EGA, concept from OBLITERATUS [1]) with norm-preserving biprojection [2] on each of the 128 expert slices per layer. That gets it to 3%.
[1] https://github.com/elder-plinius/OBLITERATUS
[2] https://huggingface.co/blog/grimjim/abliteration-biprojectio...
How it was built
Set up an automated research loop -- an AI agent reads the current results and idea backlog, picks the next experiment, runs it on the GPU, records results, and repeats. It ran 22 experiments across the 4 models, discovered the false-positive problem in standard refusal markers, built the cross-dataset evaluation, and implemented the MoE expert abliteration when dense-only wasn't enough.
Full experiment history and code in the repo.
Downloads
Each model has bf16 safetensors + GGUF (Q4_K_M, Q8_0):
Quick start:reply