Definitely a nice and pretty much pioneering feature on PowerPC in 1994 (and I guess RS/6000 before that, but I never used one).
Today's Arm64 BFM does both those jobs in one, minus the ability to create a split mask via rotating, but plus adding a choice of sign or zero extension to extracted fields (including extracted to the same place they already were, for pure sign/zero extension). As a result it's got about 100 aliases.
It would be nice to have these in RISC-V but they seriously violate the quite strict "Stanford Standard RISC" 2R1W principle that keeps the RISC-V integer pipeline simple (smaller, faster, cheaper).
When working in the "B" extension working group I suggested adopting the M88000 bitfield instructions which follow the 2R1W principle. Someone had an objection to encoding both field width and offset into a single constant (or `Rs2`), though I think it's well worth it. M88k as a 32 bit ISA used 5 bits for each, but 6 bits for each for RV64 fits RISC-V's 12 bit immediates perfectly.
- ext / extu: Extract signed or unsigned bit field from a register. You specify offset (starting bit position) and width. The extracted field is right-justified (shifted to the low bits) in the destination, with sign-extension or zero-extension.
- mak: Make (insert) a bit field. Takes a value, shifts it left by the offset, and inserts it into the destination while clearing the target field first (or combining in specific ways).
- set: Set (force to 1) a contiguous bit field in a register.
- clr: Clear (force to 0) a contiguous bit field in a register.
All take `Rd`, `Rs1` and a field size:offset as either a literal or as `Rs2`.
Unfortunately, the R-type `mak` violates 2R1W because the `Rd` is also a source, which complicates OoO implementations making them 3R1W. RISC-V could use an alternative formulation in which `mak` (or some other name` masks off the source field and shifts it into place, and then the insert is completed using `clr` and `or`.
On the other hand the forms with 12 bit literals are expensive in encoding space, but even including just the `Rs2` versions would be great, especially as often several instructions in a row can use the same field specification, which fits `addi Rd,zero,imm12` (aka `li`) perfectly.
On the gripping hand, while the immediate version of `mak` violates RISC-V convention by making the `Rd` also a source, any real pipeline is going to have fields for all of `Rd`, `Rs1`, `Rs2`, and `imm32` so only the decoder is affected.
Also, `ext` / `extu` are not needed as a pair of C-extension shifts do the same job with the same code size, and can be decoded into a single µop on a higher end CPU if desired.
As an example: take a 10 bit field at offset 21 and insert into a destination at offset 1 (this is part of decoding RISC-V J/JAL instructions).
PowerPC:
rlwimi r4, r3, 11, 1, 10
Arm64:
ubfx x2, x0, #21, #10 # extract bits[30:21] → low 10 bits of x2 (unsigned)
bfi x1, x2, #1, #10 # insert those 10 bits into x1 starting at bit 1
Alternatively, using `bfm` directly without aliases (exactly the same instructions, just trickier to get right)
bfm x2, x0, #21, #30
bfm x1, x2, #63-1, #9
M88k:
extu r3, r1, 21, 10 # extract 10-bit field starting at bit 21 → low bits of r3
mak r2, r3, 1, 10 # make/insert the field at bit 1 in destination
RISC-V:
srli x12, x10, 21 # shift field down to low bits
andi x12, x12, 0x3FF # mask to 10 bits
slli x12, x12, 1 # position at bit 1 (for imm[10:1])
li x13, ~0x7FE # mask to clear bits [10:1] only
and x11, x11, x13
or x11, x11, x12 # insert the field
RISC-V with some M88k inspiration:
extui r3, r1, 21, 10 # extract 10-bit field starting at bit 21 → low bits of r3
maki r4, r3, 1, 10 # modified mak: masks + shifts field to bits [10:1] (others 0)
clri r2, 1, 10 # clear the target field in destination
or r2, r2, r4 # insert the prepared field
Alternatively
li t0, (1<<6) | 10 # specification for insertion bit field
srli a3, a1, 21 # shift 10-bit field starting at bit 21 → low bits of r3
mak a4, a3, t0 # modified mak: masks + shifts field to bits [10:1] (others 0)
clr a2, t0 # clear the target field in destination
or a2, a2, r4 # insert the prepared field
Again, this last formulation of `maki` violates RISC-V instruction format convention in making `a2` both src and dst, BUT if the decoder handles that then the expanded form does NOT cause any issues with the pipeline implementation.
Both the PowerPC and Arm64 instructions do grab a lot of encoding space.
rlwimi uses 26 bits of opcode space (i.e. 2^26 = 64M code points). In a RISC-V context you can drop the Rc (set status flags) bit, but for RV64 you need to expand the shift/start/end fields from 5 to 6 bits, so you end up needing 28 bits of encoding space, 18 for the field spec and 5 each for Rd1 and Rd/Rs2.
A RISC-V major opcode, such as OP-IMM (which this effectively is, but with a R/W Rd/Rs2) only has 2^25 bits of encoding space for all instructions in total!
PPC64's rldimi expands shift and size to 6 bits each but drops the ability to take the source field from an arbitrary position but only from the LSBs, and so uses 23 encoding bits. i.e. exactly my proposed RISC-V instruction (except for the set flags bit, so 22 bits).
Arm64's BFM/SBFM effectively uses 24 bits to provide both 32 bit and 64 bit operations — there are 25 bits but `sf` and `N` must be the same, potentially allowing the other half of the code points (plus the ones for 32 bit with the MSBs of `immr` and `imms` set) to be used for something else in future. Note that BFM leaves all other bits in the dst unchanged, while SBFM both sign-extends into the higher bits of dst AND zeros the lower bits of DST.
So BFM/SBFM *could* be fit into RISC-V, taking up half of a major opcode, of which there aren't many left. That is a pretty huge amount — the enormous V extension takes 1 1/2 major opcodes, for far more functionality. It would free up various immediate shifts and sign/zero extension instructions, but those don't take much encoding space, no more than 16 bits each.
As nice as they are, it's hard to avoid a conclusion that both (32 bit) PowerPC and Arm64 spend too much opcode space on these.
I think PPC64's `rldimi` and M88K's `mak` (extended to 64 bits) and my last RISC-V suggestion — which are all effectively the same thing — hit the right tradeoff, not using excessive encoding space but allowing a 2-instruction sequence for that bit field move):
That's 22 bits of opcode space, the same as any one of `addi`, `andi`, `ori`, `xori`, `slti`, `sltiu` (OP-IMM) or `addiw` (OP-IMM-32).
The original RV64GC has 5/8 funct3 encodings in OP-IMM-32 unused, which `maki` (or call it `bfi` or whatever) could have used one of. It has a combined `Rd`/`Rs2` field which is unusual in full size 4-byte RISC-V instructions, but not unprecedented: the V extension does that for multiply-add instructions.
I don't immediately see any ratified or currently-proposed extension using this space.
What would justify using this significant space for them these days? Video encoding/decoding in software seems like the most likely candidate, since there's a lot of bitfield packing and high data volume.
(Thanks for your elaboration on various architectures. It's an interesting glimpse into what goes in in allocating opcode space on fixed-length instruction machines.)
My example is applicable to compiler / assembler / JIT / emulator.
The performance of conventional compilers and assemblers is not important to anyone but developers, but everyone uses JavaScript / WebAsm all the time. And QEMU can be important too (e.g. in docker for non-native ISAs, using binfmt_misc).
I guess I should point out in the proposed RISC-V example, it's 6 bytes of code as the initial shift can be a 2-byte "C" extension instruction. So that's slightly smaller code than everything except 32 bit PowerPC, which is another important aspect. Arm64 and M68k use 8 bytes of code.
Oh! I just realised standard RISC-V can be improved in this case (but not by so much in the general case).
srli x12, x10, 20 # shift field down to correct position
andi x12, x12, 0x7FE # mask to 10 bits
andi x11, x11, ~0x7FE # clear space in the destination
or x11, x11, x12 # insert the field
That's just 12 bytes of code.
In the more general case you need a `lui` or `lui;andi` pair to load the mask into a register, and then register to register ops, for 14 bytes total.
Note that x86_64 needs four instructions and 14 bytes of code, so no better than RISC-V.
I was also on BIX, then NLZ, same name as here. Even made it into the "Best of BIX" in the back of BYTE a couple of times.
Living in New Zealand, it wasn't easy to meet people — or for that matter to access BIX! I was fortunate that from mid 1986 my employer paid for access via X.25 [1] for several years until telnet was possible from Actrix BBS.
jdow took me to LASFS once in 1989 and I think I saw JP from a distance. But in 2004 I spontaneously caught a flight to LA for the historic SpaceShip One 100km high flight. jpistritto picked me up at the airport and we drove to Mojave. Parking at the XCor hangar david42 and his wife Rita pulled up next to us in an RX7. There was a party in the hangar that evening, I got to talk with JP and LN and many others, at one point helped Doug Jones (can't remember if he was on bix) make LN2 icecream. A lot of us slept in the hangar. In the morning I helped shadow cook bacon&eggs for everyone, before we all went out to watch the flight.
Also at other times got to meetups in Phoenix, New York (a lot of C++ crowd there), New Haven (people came down from Boston), Seattle.
Good times.
[1] NZ$13.20 per kilosegment (ISTR even more at first!) .. up to 64k bytes if you filled the packets, but possibly as little as 1000 bytes if there was only 1 byte per packet e.g. sitting there and hitting return: so I always filed all new messages to scratchpad and then did either SHOW or else download via X/Y/Z modem.
Interesting (but understandable pre-silicon) to see a couple of errors about the 6502 in that e.g. SBC needs SEC before it not CLC. The code examples could be improved too e.g. the 6502 memory copy has no need to use both index registers and increment them in lockstep with the same values. And better still, since you're copying fewer than 256 bytes, initialize one index register to COUNT-1 and copy from last to first.
On the other hand the 6800 code is buggy too. It's incrementing only one byte of the FROM and TO pointers — and the MSB at that on a bigendian machine — with no provision for crossing a page boundary, when the normal thing is to
LDX FROM
LDA 0,X
INX
STX FROM
LDX TO
STA 0,X
INX
STX TO
Still, as they say, much messier than 6502's...
LDA FROM,X
STA TO,X
INX
... even if the 6502 needs an outer loop to copy more than 256 bytes, at least the inner loop is fast.
Also no mention is made of `(ZP),Y` addressing mode which takes 6502 to another level entirely.
It became just another MS-DOS rag. In the early days it covered EVERYTHING, all ISAs, all programming languages (very famous Lips and Forth and Smalltalk issues, for example).
Yes, its fascinating that it went where the market was driven - by the markets hooks in its own advertising pages - and in that capacity, BYTE became a driving force for the early computing revolution not just (but also because of) the readership, but also their advertisers - inasmuch as that revolution could be defined as "wide adoption of new and emerging technologies to form a standard" - BYTE started as a user manual and ended its existence as a catalog of things with user manuals.
Probably, if one thinks about it, one of the more eloquent data structures in human existence, BYTE.
> CP/M did have a sort of revival in that it became common in low-end machines like the C-128
Amstrad were good late 80s CP/M machines. We got both those and C128 in New Zealand.
> the one RISC/CISC CPU thing that really mattered!
Not only indirect addressing, but also multiple memory operands in the same instruction — more than one VM page, really, though a single unaligned operand crossing a page boundary is also bad. Many machines trap on that case to this day and let software emulate it.
Not being able to easily tell how long an instruction is (and thus where the next one starts) is also bad, but can be overcome at some cost in the front end, and the back end is unaffected. Unlike x86 and VAX the 68k does actually tell you everything you need in the first 16 bits, but yes the complex addressing of the 020/030 were what killed it.
Yeah, I've got a complete collection from October 1978 to December 1991 (by which time they had became just another x86 PC rag). I bought a fair few individual copies myself from 78 or 79 until the late 80s, but the bulk of my collection I got for free from an elderly engineer in I think the late 2000s.
Here's a tweet I made packing them up when I was moving overseas in April 2015:
I cut down the December 2019 RISC-V ISA manual to just the things needed to get started with RV32I, to be even less intimidating.
I left out the end of the RV32I chapter with fence, ecall/ebreak, and hints. But included the later page (which many people miss) with the exact binary encodings, and also the chapter with the register API names and standard pseudo-instructions.
Definitely a nice and pretty much pioneering feature on PowerPC in 1994 (and I guess RS/6000 before that, but I never used one).
Today's Arm64 BFM does both those jobs in one, minus the ability to create a split mask via rotating, but plus adding a choice of sign or zero extension to extracted fields (including extracted to the same place they already were, for pure sign/zero extension). As a result it's got about 100 aliases.
It would be nice to have these in RISC-V but they seriously violate the quite strict "Stanford Standard RISC" 2R1W principle that keeps the RISC-V integer pipeline simple (smaller, faster, cheaper).
When working in the "B" extension working group I suggested adopting the M88000 bitfield instructions which follow the 2R1W principle. Someone had an objection to encoding both field width and offset into a single constant (or `Rs2`), though I think it's well worth it. M88k as a 32 bit ISA used 5 bits for each, but 6 bits for each for RV64 fits RISC-V's 12 bit immediates perfectly.
- ext / extu: Extract signed or unsigned bit field from a register. You specify offset (starting bit position) and width. The extracted field is right-justified (shifted to the low bits) in the destination, with sign-extension or zero-extension.
- mak: Make (insert) a bit field. Takes a value, shifts it left by the offset, and inserts it into the destination while clearing the target field first (or combining in specific ways).
- set: Set (force to 1) a contiguous bit field in a register.
- clr: Clear (force to 0) a contiguous bit field in a register.
All take `Rd`, `Rs1` and a field size:offset as either a literal or as `Rs2`.
Unfortunately, the R-type `mak` violates 2R1W because the `Rd` is also a source, which complicates OoO implementations making them 3R1W. RISC-V could use an alternative formulation in which `mak` (or some other name` masks off the source field and shifts it into place, and then the insert is completed using `clr` and `or`.
On the other hand the forms with 12 bit literals are expensive in encoding space, but even including just the `Rs2` versions would be great, especially as often several instructions in a row can use the same field specification, which fits `addi Rd,zero,imm12` (aka `li`) perfectly.
On the gripping hand, while the immediate version of `mak` violates RISC-V convention by making the `Rd` also a source, any real pipeline is going to have fields for all of `Rd`, `Rs1`, `Rs2`, and `imm32` so only the decoder is affected.
Also, `ext` / `extu` are not needed as a pair of C-extension shifts do the same job with the same code size, and can be decoded into a single µop on a higher end CPU if desired.
As an example: take a 10 bit field at offset 21 and insert into a destination at offset 1 (this is part of decoding RISC-V J/JAL instructions).
PowerPC:
Arm64: Alternatively, using `bfm` directly without aliases (exactly the same instructions, just trickier to get right) M88k: RISC-V: RISC-V with some M88k inspiration: Alternatively Alternatively: Again, this last formulation of `maki` violates RISC-V instruction format convention in making `a2` both src and dst, BUT if the decoder handles that then the expanded form does NOT cause any issues with the pipeline implementation.reply