More

brucehoult · 2026-04-23T01:33:12 1776907992

> rlwimi / rlwinm

Definitely a nice and pretty much pioneering feature on PowerPC in 1994 (and I guess RS/6000 before that, but I never used one).

Today's Arm64 BFM does both those jobs in one, minus the ability to create a split mask via rotating, but plus adding a choice of sign or zero extension to extracted fields (including extracted to the same place they already were, for pure sign/zero extension). As a result it's got about 100 aliases.

It would be nice to have these in RISC-V but they seriously violate the quite strict "Stanford Standard RISC" 2R1W principle that keeps the RISC-V integer pipeline simple (smaller, faster, cheaper).

When working in the "B" extension working group I suggested adopting the M88000 bitfield instructions which follow the 2R1W principle. Someone had an objection to encoding both field width and offset into a single constant (or `Rs2`), though I think it's well worth it. M88k as a 32 bit ISA used 5 bits for each, but 6 bits for each for RV64 fits RISC-V's 12 bit immediates perfectly.

- ext / extu: Extract signed or unsigned bit field from a register. You specify offset (starting bit position) and width. The extracted field is right-justified (shifted to the low bits) in the destination, with sign-extension or zero-extension.

- mak: Make (insert) a bit field. Takes a value, shifts it left by the offset, and inserts it into the destination while clearing the target field first (or combining in specific ways).

- set: Set (force to 1) a contiguous bit field in a register.

- clr: Clear (force to 0) a contiguous bit field in a register.

All take `Rd`, `Rs1` and a field size:offset as either a literal or as `Rs2`.

Unfortunately, the R-type `mak` violates 2R1W because the `Rd` is also a source, which complicates OoO implementations making them 3R1W. RISC-V could use an alternative formulation in which `mak` (or some other name` masks off the source field and shifts it into place, and then the insert is completed using `clr` and `or`.

On the other hand the forms with 12 bit literals are expensive in encoding space, but even including just the `Rs2` versions would be great, especially as often several instructions in a row can use the same field specification, which fits `addi Rd,zero,imm12` (aka `li`) perfectly.

On the gripping hand, while the immediate version of `mak` violates RISC-V convention by making the `Rd` also a source, any real pipeline is going to have fields for all of `Rd`, `Rs1`, `Rs2`, and `imm32` so only the decoder is affected.

Also, `ext` / `extu` are not needed as a pair of C-extension shifts do the same job with the same code size, and can be decoded into a single µop on a higher end CPU if desired.

As an example: take a 10 bit field at offset 21 and insert into a destination at offset 1 (this is part of decoding RISC-V J/JAL instructions).

PowerPC:

    rlwimi  r4, r3, 11, 1, 10

Arm64:

    ubfx   x2, x0, #21, #10      # extract bits[30:21] → low 10 bits of x2 (unsigned)
    bfi    x1, x2, #1, #10       # insert those 10 bits into x1 starting at bit 1

Alternatively, using `bfm` directly without aliases (exactly the same instructions, just trickier to get right)

    bfm    x2, x0, #21, #30
    bfm    x1, x2, #63-1, #9

M88k:

    extu   r3, r1, 21, 10        # extract 10-bit field starting at bit 21 → low bits of r3
    mak    r2, r3, 1, 10         # make/insert the field at bit 1 in destination

RISC-V:

    srli   x12, x10, 21          # shift field down to low bits
    andi   x12, x12, 0x3FF       # mask to 10 bits
    slli   x12, x12, 1           # position at bit 1 (for imm[10:1])
    li     x13, ~0x7FE           # mask to clear bits [10:1] only
    and    x11, x11, x13
    or     x11, x11, x12         # insert the field

RISC-V with some M88k inspiration:

    extui  r3, r1, 21, 10        # extract 10-bit field starting at bit 21 → low bits of r3
    maki   r4, r3, 1, 10         # modified mak: masks + shifts field to bits [10:1] (others 0)
    clri   r2, 1, 10             # clear the target field in destination
    or     r2, r2, r4            # insert the prepared field

Alternatively

    li     t0, (1<<6) | 10       # specification for insertion bit field
    srli   a3, a1, 21            # shift 10-bit field starting at bit 21 → low bits of r3
    mak    a4, a3, t0            # modified mak: masks + shifts field to bits [10:1] (others 0)
    clr    a2, t0                # clear the target field in destination
    or     a2, a2, r4            # insert the prepared field

Alternatively:

    srli   a3, a1, 21
    maki   a2, a3, (1<<6) | 10   # decoder expands to `maki a2, a2, a3, (1<<6) | 10`

Again, this last formulation of `maki` violates RISC-V instruction format convention in making `a2` both src and dst, BUT if the decoder handles that then the expanded form does NOT cause any issues with the pipeline implementation.

camel-cdr · 2026-04-23T08:59:24 1776934764

bitfield insert/extract was also looked at by the scalar efficiency SIG: https://lists.riscv.org/g/sig-scalar-efficiency/topic/115060...

IIRC it didn't go anywere, because it wasn't worth the encoding space.

But a rlwimi sounds like a good candidate for >32b encoding.

brucehoult · 2026-04-24T00:27:03 1776990423

Both the PowerPC and Arm64 instructions do grab a lot of encoding space.

rlwimi uses 26 bits of opcode space (i.e. 2^26 = 64M code points). In a RISC-V context you can drop the Rc (set status flags) bit, but for RV64 you need to expand the shift/start/end fields from 5 to 6 bits, so you end up needing 28 bits of encoding space, 18 for the field spec and 5 each for Rd1 and Rd/Rs2.

A RISC-V major opcode, such as OP-IMM (which this effectively is, but with a R/W Rd/Rs2) only has 2^25 bits of encoding space for all instructions in total!

PPC64's rldimi expands shift and size to 6 bits each but drops the ability to take the source field from an arbitrary position but only from the LSBs, and so uses 23 encoding bits. i.e. exactly my proposed RISC-V instruction (except for the set flags bit, so 22 bits).

Arm64's BFM/SBFM effectively uses 24 bits to provide both 32 bit and 64 bit operations — there are 25 bits but `sf` and `N` must be the same, potentially allowing the other half of the code points (plus the ones for 32 bit with the MSBs of `immr` and `imms` set) to be used for something else in future. Note that BFM leaves all other bits in the dst unchanged, while SBFM both sign-extends into the higher bits of dst AND zeros the lower bits of DST.

So BFM/SBFM *could* be fit into RISC-V, taking up half of a major opcode, of which there aren't many left. That is a pretty huge amount — the enormous V extension takes 1 1/2 major opcodes, for far more functionality. It would free up various immediate shifts and sign/zero extension instructions, but those don't take much encoding space, no more than 16 bits each.

As nice as they are, it's hard to avoid a conclusion that both (32 bit) PowerPC and Arm64 spend too much opcode space on these.

I think PPC64's `rldimi` and M88K's `mak` (extended to 64 bits) and my last RISC-V suggestion — which are all effectively the same thing — hit the right tradeoff, not using excessive encoding space but allowing a 2-instruction sequence for that bit field move):

    srli   a3, a1, 21
    maki   a2, a3, (1<<6) | 10   # decoder expands to `maki a2, a2, a3, (1<<6) | 10`

That's 22 bits of opcode space, the same as any one of `addi`, `andi`, `ori`, `xori`, `slti`, `sltiu` (OP-IMM) or `addiw` (OP-IMM-32).

The original RV64GC has 5/8 funct3 encodings in OP-IMM-32 unused, which `maki` (or call it `bfi` or whatever) could have used one of. It has a combined `Rd`/`Rs2` field which is unusual in full size 4-byte RISC-V instructions, but not unprecedented: the V extension does that for multiply-add instructions.

I don't immediately see any ratified or currently-proposed extension using this space.

gblargg · 2026-04-24T01:21:38 1776993698

What would justify using this significant space for them these days? Video encoding/decoding in software seems like the most likely candidate, since there's a lot of bitfield packing and high data volume.

(Thanks for your elaboration on various architectures. It's an interesting glimpse into what goes in in allocating opcode space on fixed-length instruction machines.)

brucehoult · 2026-04-24T02:49:05 1776998945

My example is applicable to compiler / assembler / JIT / emulator.

The performance of conventional compilers and assemblers is not important to anyone but developers, but everyone uses JavaScript / WebAsm all the time. And QEMU can be important too (e.g. in docker for non-native ISAs, using binfmt_misc).

I guess I should point out in the proposed RISC-V example, it's 6 bytes of code as the initial shift can be a 2-byte "C" extension instruction. So that's slightly smaller code than everything except 32 bit PowerPC, which is another important aspect. Arm64 and M68k use 8 bytes of code.

Oh! I just realised standard RISC-V can be improved in this case (but not by so much in the general case).

    srli   x12, x10, 20          # shift field down to correct position
    andi   x12, x12, 0x7FE       # mask to 10 bits
    andi   x11, x11, ~0x7FE      # clear space in the destination
    or     x11, x11, x12         # insert the field

That's just 12 bytes of code.

In the more general case you need a `lui` or `lui;andi` pair to load the mask into a register, and then register to register ops, for 14 bytes total.

Note that x86_64 needs four instructions and 14 bytes of code, so no better than RISC-V.

brucehoult · 2026-04-20T03:14:53 1776654893

Compared to Delhi? Ok. But I've had a soaking uncomfortable shirt every time I've been to Vegas, while in Phoenix it evaporates quickly.

brucehoult · 2026-04-20T03:08:40 1776654520

I was also on BIX, then NLZ, same name as here. Even made it into the "Best of BIX" in the back of BYTE a couple of times.

Living in New Zealand, it wasn't easy to meet people — or for that matter to access BIX! I was fortunate that from mid 1986 my employer paid for access via X.25 [1] for several years until telnet was possible from Actrix BBS.

jdow took me to LASFS once in 1989 and I think I saw JP from a distance. But in 2004 I spontaneously caught a flight to LA for the historic SpaceShip One 100km high flight. jpistritto picked me up at the airport and we drove to Mojave. Parking at the XCor hangar david42 and his wife Rita pulled up next to us in an RX7. There was a party in the hangar that evening, I got to talk with JP and LN and many others, at one point helped Doug Jones (can't remember if he was on bix) make LN2 icecream. A lot of us slept in the hangar. In the morning I helped shadow cook bacon&eggs for everyone, before we all went out to watch the flight.

Also at other times got to meetups in Phoenix, New York (a lot of C++ crowd there), New Haven (people came down from Boston), Seattle.

Good times.

[1] NZ$13.20 per kilosegment (ISTR even more at first!) .. up to 64k bytes if you filled the packets, but possibly as little as 1000 bytes if there was only 1 byte per packet e.g. sitting there and hitting return: so I always filed all new messages to scratchpad and then did either SHOW or else download via X/Y/Z modem.

brucehoult · 2026-04-20T02:47:22 1776653242

Interesting (but understandable pre-silicon) to see a couple of errors about the 6502 in that e.g. SBC needs SEC before it not CLC. The code examples could be improved too e.g. the 6502 memory copy has no need to use both index registers and increment them in lockstep with the same values. And better still, since you're copying fewer than 256 bytes, initialize one index register to COUNT-1 and copy from last to first.

On the other hand the 6800 code is buggy too. It's incrementing only one byte of the FROM and TO pointers — and the MSB at that on a bigendian machine — with no provision for crossing a page boundary, when the normal thing is to

    LDX FROM
    LDA 0,X
    INX
    STX FROM
    LDX TO
    STA 0,X
    INX
    STX TO

Still, as they say, much messier than 6502's...

    LDA FROM,X
    STA TO,X
    INX

... even if the 6502 needs an outer loop to copy more than 256 bytes, at least the inner loop is fast.

Also no mention is made of `(ZP),Y` addressing mode which takes 6502 to another level entirely.

brucehoult · 2026-04-20T01:52:42 1776649962

It became just another MS-DOS rag. In the early days it covered EVERYTHING, all ISAs, all programming languages (very famous Lips and Forth and Smalltalk issues, for example).

See: https://news.ycombinator.com/item?id=47829410

rigonkulous · 2026-04-21T17:18:35 1776791915

Yes, its fascinating that it went where the market was driven - by the markets hooks in its own advertising pages - and in that capacity, BYTE became a driving force for the early computing revolution not just (but also because of) the readership, but also their advertisers - inasmuch as that revolution could be defined as "wide adoption of new and emerging technologies to form a standard" - BYTE started as a user manual and ended its existence as a catalog of things with user manuals.

Probably, if one thinks about it, one of the more eloquent data structures in human existence, BYTE.

brucehoult · 2026-04-20T01:48:47 1776649727

> CP/M did have a sort of revival in that it became common in low-end machines like the C-128

Amstrad were good late 80s CP/M machines. We got both those and C128 in New Zealand.

> the one RISC/CISC CPU thing that really mattered!

Not only indirect addressing, but also multiple memory operands in the same instruction — more than one VM page, really, though a single unaligned operand crossing a page boundary is also bad. Many machines trap on that case to this day and let software emulate it.

Not being able to easily tell how long an instruction is (and thus where the next one starts) is also bad, but can be overcome at some cost in the front end, and the back end is unaffected. Unlike x86 and VAX the 68k does actually tell you everything you need in the first 16 bits, but yes the complex addressing of the 020/030 were what killed it.

brucehoult · 2026-04-20T01:38:46 1776649126

Yeah, I've got a complete collection from October 1978 to December 1991 (by which time they had became just another x86 PC rag). I bought a fair few individual copies myself from 78 or 79 until the late 80s, but the bulk of my collection I got for free from an elderly engineer in I think the late 2000s.

Here's a tweet I made packing them up when I was moving overseas in April 2015:

https://x.com/BruceHoult/status/586675607087419394/photo/1

I also have a 1984 Encyclopædia Britannica, all 30 volumes.

Will anyone want them when I can't house them?

bombcar · 2026-04-20T12:42:05 1776688925

Nobody wants those things - but some may need them.

What do I mean? If you posted them for free you might get a taker, but probably not.

But it's possible you might identify the right child at the right time who could appreciate them.

1991 Byte might be too old, but a 1984 Britannica has something even an offline copy of Wikipedia doesn't.

brucehoult · 2026-04-18T01:37:50 1776476270

Definitely not true. I've been using Boehm GC with my C/C++ programs for decades — since the 90s, at least.

dataflow · 2026-04-18T03:13:02 1776481982

Does this also hold true when you look at codebases that others also worked on, rather than just you?

brucehoult · 2026-04-18T01:32:12 1776475932

> built by one single dad

Not some random dad, but a GC expert and former leader of the JavaScript VM team at Apple.

brucehoult · 2026-04-17T02:13:48 1776392028

I cut down the December 2019 RISC-V ISA manual to just the things needed to get started with RV32I, to be even less intimidating.

I left out the end of the RV32I chapter with fence, ecall/ebreak, and hints. But included the later page (which many people miss) with the exact binary encodings, and also the chapter with the register API names and standard pseudo-instructions.

It's 18 pages in total.

I hope it's useful to someone else.

HN For You