Shortly after the X86-S
proposal, Intel presented the biggest novelty since the days of AVX-512
— and even beyond: APX, its new (true) architecture (ISA), but exclusively in 64-bit (it is not and in any case could not have been supported in 32-bit mode). This series of articles sets itself the task of analysing its innovations, criticalities and providing food for thought.
Meanwhile, and to be precise (and as it has been introduced), one cannot formally speak of a new architecture, as these are additions to x64
that have been made via this new ISA
extension. An extension that therefore uses the 64-bit mode introduced in 2003 by AMD as a basis, to which the following features are added:
- 16 new general-purpose registers;
- several instructions are available in a new format using three operands instead of the canonical two, or two for those with only one;
- new conditional instructions;
- absolute address jump instruction (64-bit, of course).
We had already become accustomed for a very long time to the addition of new instructions in all sorts of sauces and in the most disparate domains, and it was all added transparently to the existing ones, but making new registers available is an extremely significant and impactful change, such that I can actually speak of a new architecture.
This is because, although the code using APX
works and can be safely mixed with any x64
code, providing multiple code paths (as, for example, the Intel compiler makes possible) in the same executable to support multiple scenarios (depending on the processor on which the code runs) would increase its size even significantly.
As long as we are talking about a few critical parts that need specific code paths to make them run at their best depending on specific processors (we are talking, more precisely, about specific microarchitectures), it is a choice that pays off and does not increase the size of the binaries excessively, but the scope of all the innovations of APX
is such and is so pervasive in almost every aspect of the code, that it is difficult to think that we can continue to follow this approach (although theoretically and practically feasible).
My expectation is, therefore, that we will see the generation of executables specifically dedicated to APX
, without any code path for normal x64
code (which will be relegated to equivalent binaries), so as not only to maximise performance, but also to contain the size of the code overall (also on disk). This is also the reason why I prefer to think of APX
as a new architecture rather than an extension like those that have been introduced so far.
The REX2
prefix for accessing new registers
The new registers cannot normally be addressed, so some changes were made to the ISA for this purpose, similar to what AMD did for x64
with the introduction of the REX
prefix:
Byte | Bit | ||||||||
---|---|---|---|---|---|---|---|---|---|
REX | |||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
0 | 0 | 1 | 0 | 0 | W | R | X | B |
Thus, the introduction of the new prefix REX2
stands out:
Byte | Bit | ||||||||
---|---|---|---|---|---|---|---|---|---|
REX2 (2-byte REX) | |||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
0 (0xD5) | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | |
1 | M0 | R4 | X4 | B4 | W | R3 | X3 | B3 |
this time focusing on the need to use additional bits to access the new ‘bank’ of registers. As can be seen immediately, R3
, X3
and B3
are the equivalents of R
, X
and B
, and serve to specify the fourth bit of the register (specified in the opcode itself or in the operand in memory), index and base respectively (if there is an operand in memory), while R4
, X4
and B4
do the same thing, but for the fifth bit (5 bits = 32 possible values = 32 addressable registers).
There appears, however, an additional bit, M0
, which is used to specify which ‘opcode map’ to use: either 0
or 1
. Here we need a little digression on how the opcodes of the x86
and x64
instructions are defined/mapped. In these processors, opcodes are defined by sequences of bytes (other architectures may use words as the ‘base unit’, consisting of 2 or more bytes).
Leaving aside the discussion of prefixes for the moment and simplifying the discussion a lot, an 8086
instruction (which was the first representative of this ISA) uses the first byte to define up to a maximum of 256 opcodes which correspond (very) roughly to 256 instructions. The 8086
did not use all 256 possibilities, but only a subset, leaving some configurations free, which, however, over time were used by new instructions (or prefixes) that were gradually added. The set of instructions that belong to these 256 configurations specified by the first byte is called ‘map 0
‘.
At a certain point, this map had exhausted all possibilities, and in order to add further instructions Intel thought it was best to recycle an instruction that for 8086
was completely useless (actually even dangerous: POP CS
; whose opcode is 0F
in hexadecimal) and reuse it to create a new opcode map (map 1
). Thus opening up the possibility of defining another 256 possible instructions (those that have 0F
as their first byte and whose second byte specifies the new opcode), with the only penalty being that, in this case, the instructions have become one byte longer (because the first is always 0F
).
As can be easily guessed, map 1
was also exhausted by adding instructions, so Intel used a couple of configurations (0F 38
and 0F 3A
) to create another couple of maps (2
and 3
). These instructions, therefore, are even longer (due to the first two bytes having one of those two values).
The new bit M0 therefore allows only one of the first two maps to be selected, but not the other two. Which isn’t even a big deal, because most of the ‘general-purpose‘ instructions (also called legacy in Intel’s document containing the complete APX
specification) are enclosed in the first two maps.
The other two can in any case be selected via the VEX
prefixes (VEX2
and VEX3
introduced by Intel with the AVX
extensions):
Byte | Bit | ||||||||
---|---|---|---|---|---|---|---|---|---|
VEX3 (3-byte VEX) | |||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
0 (0xC4) | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | |
1 | R̅ | X̅ | B̅ | m4 | m3 | m2 | m1 | m0 | |
2 | W | v̅3 | v̅2 | v̅1 | v̅0 | L | p1 | p0 | |
VEX2 (2-byte VEX) | |||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
0 (0xC5) | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | |
1 | R̅ | v̅3 | v̅2 | v̅1 | v̅0 | L | p1 | p0 |
or with the EVEX
prefix (introduced by Intel with the AVX-512
extensions):
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
---|---|---|---|---|---|---|---|---|---|
Byte 0 (62h) | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | |
Byte 1 (P0) | R̅ | X̅ | B̅ | R̅’ | 0 | 0 | m1 | m0 | P[7:0] |
Byte 2 (P1) | W | v̅3 | v̅2 | v̅1 | v̅0 | 1 | p1 | p0 | P[15:8] |
Byte 3 (P2) | z | L’ | L | b | V̅’ | a2 | a1 | a0 | P[23:16] |
VEX2
allows only the second map (0F
, which is implicit) to be selected, VEX3
up to 32 via bits m4..m0
, and finally EVEX
allows all four maps to be selected via m1..m0
.
VEX2
is a prefix of two bytes. VEX3
is three bytes. EVEX
, on the other hand, is a full four bytes. So it is clear that the instruction length will increase, even considerably, depending on the prefix used.
Changes to the EVEX
prefix
Until now, the REX2
prefix only introduced the possibility of being able to select the 16 new general-purpose registers, but Intel went much further with APX
, extending the operation of some instructions (not only some general ones, but also some AVX
) and giving them new possibilities that previously required more instructions to emulate their behaviour. To do this, it extended the EVEX
prefix by exploiting some bits (which were previously unused). Originally, its format was as shown in the above table, but now, to support the new general-purpose registers, it has become:
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
---|---|---|---|---|---|---|---|---|---|
Byte 0 (62h) | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | |
Byte 1 (P0) | R̅3 | X̅3 | B̅3 | R̅4 | B4 | m2 | m1 | m0 | P[7:0] |
Byte 2 (P1) | W | v̅3 | v̅2 | v̅1 | v̅0 | X̅4 | p1 | p0 | P[15:8] |
Byte 3 (P2) | z | L’ | L | b | v̅4 | a2 | a1 | a0 | P[23:16] |
So they simply added R4
, X4
and B4
(the first two in a negated version for technical reasons which I won’t repeat here) to be able to set the fifth bit of the respective register.
A number of VEX
instructions have also been updated, but as there is no possibility of specifying additional bits in the VEX2
and VEX3
prefixes, Intel thought of always exploiting EVEX
to extend its operation in these cases:
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
---|---|---|---|---|---|---|---|---|---|
Byte 0 (62h) | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | |
Byte 1 (P0) | R̅3 | X̅3 | B̅3 | R̅4 | B4 | m2 | m1 | m0 | P[7:0] |
Byte 2 (P1) | W | v̅3 | v̅2 | v̅1 | v̅0 | X̅4 | p1 | p0 | P[15:8] |
Byte 3 (P2) | 0 | 0 | L | 0 | v̅4 | NF | 0 | 0 | P[23:16] |
The only substantial difference (apart from being able to access the new registers) from the VEX
prefixes is provided by the new NF
bit, which makes it possible to suppress the generation of flags for certain instructions that originally generated them according to the result of the operation. I will talk more about this in a separate section in the next article.
For the general ones, however, it has made more substantial changes:
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
---|---|---|---|---|---|---|---|---|---|
Byte 0 (62h) | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | |
Byte 1 (P0) | R̅3 | X̅3 | B̅3 | R̅4 | B4 | 1 | 0 | 0 | P[7:0] |
Byte 2 (P1) | W | v̅3 | v̅2 | v̅1 | v̅0 | X̅4 | p1 | p0 | P[15:8] |
Byte 3 (P2) | 0 | 0 | 0 | ND | v̅4 | NF | 0 | 0 | P[23:16] |
I’ve already talked about NF
, but close to it was added ND
, which allows us to transform a binary instruction (a source operand and a destination operand which also acts as a second source) into a ternary one, or a unary one (the single operand is used as both source and destination) into a binary one, according to certain precise logics which will also be explained in another section of the next article.
It should be noted, however, that in this case the bits m2..m0
(m2
has now been added to EVEX
to bring the selectable maps from 4 to 8) have taken on a very precise value: 100
(in binary). EVEX
uses this field to be able to specify which map, among the eight possible ones, to use for the opcode byte following this prefix. We have seen that on x64
there are four maps (from 0
to 3
) that define the corresponding instruction sets, but 100
equates to a new map: map 4
!
The reason is quickly explained: not all general instructions are extended with these new mechanisms (NF
and ND
), but only some are. Intel thought, in order not to complicate things further, to collect all ‘promotable’ instructions and put them in this new map. So the new wonders are only reserved for instructions in map 4
, while all other instructions can only access the new registers (via REX2
) and cannot exploit anything else.
Intel, finally, changed this last format again in order to introduce further new conditional instructions:
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
---|---|---|---|---|---|---|---|---|---|
Byte 0 (62h) | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | |
Byte 1 (P0) | R̅3 | X̅3 | B̅3 | R̅4 | B4 | 1 | 0 | 0 | P[7:0] |
Byte 2 (P1) | W | OF | SF | ZF | CF | X̅4 | p1 | p0 | P[15:8] |
Byte 3 (P2) | 0 | 0 | 0 | ND=1 | SC3 | SC2 | SC1 | SC0 | P[23:16] |
The OF
, SF
, ZF
and CF
fields are the very same bits found in the flags register, while SC3..SC0
are four bits that define the code (but with some modifications) that is normally used in conditional jumps to indicate when to jump. A separate section in the next article will deal with these particular new conditional instructions.
The next article will, as already mentioned, deal in more detail with APX
‘s innovations and the impact they have in real/common scenarios.