With the implementation costs covered in the previous article, the observations and criticisms come to an end, while possible improvements that could be made to APX
before the final commercialisation of the first processors that will implement it (assuming it is not too late now!) are now set out.
Various modifications
One modification I would suggest is to treat conditional instructions in the same way as other processors do, allowing their effects to be totally ignored if the specific condition is not met. This also makes the implementation in the execution pipeline simpler (only the commit or retire of the instruction is performed).
Currently, on the other hand, if the condition is not met, the target argument is reset (CFCMOVcc
) if it is a register in any case (while it remains unchanged otherwise). The original version of CMOVcc
also has the flaw of generating exceptions if the memory location it references cannot be accessed, even when the condition is false, but fortunately APX
provides one (CFCMOVcc
) that suppresses exceptions in such cases.
All these individual differences and different behaviour depending on the instruction do not benefit either the decoder that has to decode them or the backend that has to execute them. The same occurs when only some instructions are given the possibility of being able to suppress flags generation, while others are not. This results in greater implementation complexity, also at the expense of compilers (who must take into account and handle all these special cases).
Modifications to REX2
(to add NF
)
So the next concrete, as well as extremely simple, change would be to give the possibility of using the NF
(No Flags
) bit to all instructions ‘promoted’ by this new extension, instead of just a few.
In reality, all the improvements proposed in this article involve the complete removal of the concept of ‘promotion’ (which currently only occurs for certain instructions. This led to the creation of map 4
using the prefix EVEX
, as we have already seen in the first article), since the idea is to allow all general-purpose instructions to take advantage of the new features introduced with APX
.
In order to achieve this (while at the same time giving code density a nice hand help), a trivial modification to the REX2
prefix is required, which currently has the following structure:
Byte | Bit | ||||||||
---|---|---|---|---|---|---|---|---|---|
REX2 (2-byte REX) | |||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
0 (0xD5) | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | |
1 | M0 | R4 | X4 | B4 | W | R3 | X3 | B3 |
Which, by adding the NF
bit to signal the possible suppression of flags generation, becomes:
Byte | Bit | ||||||||
---|---|---|---|---|---|---|---|---|---|
New REX2 (2-byte REX) | |||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
0 (0xD4, 0xD5) | 1 | 1 | 0 | 1 | 0 | 1 | 0 | M0 | |
1 | NF | R4 | X4 | B4 | W | R3 | X3 | B3 |
Now we not only use the opcode (D5
in hexadecimal) of the old AAD
instruction (suppressed by x64
in 64-bit mode), but also that of AAM
(D4
), both of which allow us to set NF
(in MSB: the most significant bit of the second byte), without any other penalties apart from that of using REX2
, which, however, occupies only two bytes (as opposed to EVEX
where, instead, four bytes would be needed!).
The reason why NF
has taken the place of M0
over the original in REX2
will be better seen later with the other prefixes, but I anticipate that it serves to maintain exactly the same format of the second byte, everywhere. Whereas for the map to be selected, there are differences, depending on the prefix (but this is the only variation).
New prefix REX3
(to add condition)
In the same vein and as previously suggested, a condition could be applied to all general-purpose instructions. Giving them, therefore, the possibility of being able to be totally ignored in the event that it is not fulfilled, and without any side effects (also explained at length above).
This modification is extremely important precisely in order to come to Intel’s statement aid in the APX
presentation, which states that processor pipelines are becoming longer (and wider) as time goes by, and thus more susceptible to performance losses when the prediction of conditional jumps fails.
The solution I propose, for this purpose, is to introduce a new prefix, REX3
, very similar to REX2
, but with the addition of a byte in which it is possible to specify the condition that must be fulfilled in order to approve the execution of the that instruction. The format of the new prefix is as follows:
Byte | Bit | ||||||||
---|---|---|---|---|---|---|---|---|---|
REX3 (3-byte REX) | |||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
0 (0x1F) | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |
1 | NF | R4 | X4 | B4 | W | R3 | X3 | B3 | |
2 | 0 | 0 | 0 | M0 | SC3 | SC2 | SC1 | SC0 |
where, as we have already seen in the first article setting out the format of all the prefixes added or modified by APX
, SC3..SC0
are four bits representing the code (modified, excluding the test for the parity bit P
) of the condition that is used in conditional jumps. While NF
is the No Flags
bit we have already seen above with the new prefix REX2
.
The three bits at 0
in the third byte, which are before M0
, leave room for any other maps to be added (although, using them all for this purpose, 16 would be too many) and/or to enable, in any future extensions, other features.
As can be seen, this new prefix (for which I have used opcode 1F
, which corresponds to the old legacy POP DS
instruction) is quite simple, flexible, and easier to implement than EVEX
, besides the fact that it also has the not inconsiderable advantage of occupying one byte less than the latter and thus mitigating the impact on code density.
Taking advantage of REX3
, it is also possible to (re)implement the new CCMP
and CTEST
instructions by exploiting opcodes 70-7F
(map 0
: the classic conditional jump instructions with an offset of 8 bits for the jump) for the former and 80-8F
(map 1
: these are the less famous conditional jumps with an offset of 16 or 32 bits) for the latter. The first 4 bits (the least significant ones) will be used to specify the value of the OF
, SF
, ZF
and CF
fields, to be copied to the respective flags in the event that the condition in REX3
is not met.
In this case the format of the instruction for CCMP
becomes as follows:
Byte | Bit | ||||||||
---|---|---|---|---|---|---|---|---|---|
REX3 (3-byte REX) for CCMP | |||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
0 (0x1F) | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |
1 | NF | R4 | X4 | B4 | W | R3 | X3 | B3 | |
2 | 0 | 0 | 0 | 0 | SC3 | SC2 | SC1 | SC0 | |
3 | 0 | 1 | 1 | 1 | OF | SF | ZF | CF |
While for CTEST
:
Byte | Bit | ||||||||
---|---|---|---|---|---|---|---|---|---|
REX3 (3-byte REX) for CTEST | |||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
0 (0x1F) | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |
1 | NF | R4 | X4 | B4 | W | R3 | X3 | B3 | |
2 | 0 | 0 | 0 | 1 | SC3 | SC2 | SC1 | SC0 | |
3 | 1 | 0 | 0 | 0 | OF | SF | ZF | CF |
The choice of reusing the opcodes of the conditional jump instructions is certainly the best one, because transforming (via the new REX3
) into conditional instructions that are already conditional in themselves would not make any sense. So we might as well reuse them, using the 4 bits of the condition to store the values of OF
, SF
, ZF
and CF
instead.
This is a very simple implementation, as can be seen, which requires a couple of trivial comparisons in the presence of the new prefix REX3
to check whether it is in the special case of these two new instructions, and which also has the advantage of occupying one byte less than the current solution using EVEX
, thus improving code density.
Changes to VEX3
(for new registers)
In this regard, code density could also be trivially improved for instructions (AVX
, AVX-2
) that make use of the VEX3
prefix, should it become necessary to access the 16 general-purpose registers that APX
has added, without having to resort to the longer (occupying an extra byte) and more complicated EVEX
. VEX3
currently has the following format:
Byte | Bit | ||||||||
---|---|---|---|---|---|---|---|---|---|
VEX3 (3-byte VEX) | |||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
0 (0xC4) | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | |
1 | R̅ | X̅ | B̅ | m4 | m3 | m2 | m1 | m0 | |
2 | W | v̅3 | v̅2 | v̅1 | v̅0 | L | p1 | p0 |
whereas with my proposal it would become:
Byte | Bit | ||||||||
---|---|---|---|---|---|---|---|---|---|
New VEX3 (3-byte VEX) | |||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
0 (0xC4) | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | |
1 | R̅3 | X̅3 | B̅3 | R4 | X4 | B4 | m1 | m0 | |
2 | W | v̅3 | v̅2 | v̅1 | v̅0 | L | p1 | p0 |
Thus, reusing bits m4..m2
to add the 3 bits needed to be able to specify the new registers. This would reduce the selectable opcode maps from 32 to just 4, but this would not be a big problem for a couple of reasons.
The first is that there are currently only four maps for all instructions (and there is still room to add more), so none would be missing. The second is that the current trend is to use AVX-512
to extend the SIMD
instruction set, which always makes use of the EVEX
prefix (which supports up to 8 maps. So there is plenty of room to add another thousand instructions).
New prefixes REXM0
and REXM1
to eliminate EVEX
With a similar approach, but copying what has already been done with the REX3
prefix that I proposed just above, one could avoid using EVEX
altogether in order to ‘promote’ instructions from binary to ternary, and from unary to binary, which EVEX
makes possible thanks to the new ND
bit (which, set to 1
, enables this new functionality) and the v̅4..v̅0
field that allows one to specify the register to be used to store the result of the operation.
In this case, it would be a matter of reusing some opcodes that x64
has freed (by removing some legacy x86
instructions) to add the following two prefixes:
Byte | Bit | ||||||||
---|---|---|---|---|---|---|---|---|---|
REXM0 (3-byte REX with NDD, for map 0) | |||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
0 (0x06, 0x16) | 0 | 0 | 0 | NDD4 | 0 | 1 | 1 | 0 | |
1 | NF | R4 | X4 | B4 | W | R3 | X3 | B3 | |
2 | NDD3 | NDD2 | NDD1 | NDD0 | SC3 | SC2 | SC1 | SC0 | |
REXM1 (3-byte REX with NDD, for map 1) | |||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
0 (0x0E, 0x1E) | 0 | 0 | 0 | NDD4 | 1 | 1 | 1 | 0 | |
1 | NF | R4 | X4 | B4 | W | R3 | X3 | B3 | |
2 | NDD3 | NDD2 | NDD1 | NDD0 | SC3 | SC2 | SC1 | SC0 |
As can be seen, the two new prefixes (using opcodes 06
, 16
, 0E
and 1E
, corresponding to the old PUSH ES
, PUSH SS
, PUSH CS
, PUSH DS
instructions) REXM0
and REXM1
are very similar to REX3
, but with some slight differences.
Firstly, it is possible to specify the destination register (NDD
) via the new NDD4..NDD0
bits (without having to set the ND
bit, which is implicitly specified). Then, the M0
bit disappeared to make way for NDD0
, as now map 0
or map 1
is selected using the appropriate prefix (REXM0
for map 0
and REXM1
for map 1
). Similarly, and if needed, other prefixes could be added to support new maps (there are still enough legacy instruction opcodes that are free in x64
).
It should be emphasised that these two prefixes do not need to implement the new CCMP
and CTEST
instructions as well, since there is no use of the new target register in this case (there is no result to store: they are just flags-altering instructions). Their implementation using only REX3
is therefore sufficient, as explained above.
These two new prefixes are shorter (by one byte) than EVEX
, thus limiting the damage to code density caused by using such long prefixes, but they also have the added advantage of making conditional any general-purpose instruction that has been extended to ternary or binary.
For example:
; Add
123456789
0 to the 64-bit value from memory and save it to RAXif the zero flag (Z) is set.
ADD.Z RAX,[RBX + RCX * 8 + 1234],1234567890
whose operation as well as potential should be intelligible, but with the particular point to be made that the instruction would not generate any exception in the event that the condition was not verified and the element in memory was inaccessible.
Furthermore, and to close, REXM0
and REXM1
are also much simpler to implement (the mechanism is similar to REX2
and REX3
, which in turn are similar to REX
) than the enormous complication of the new prefix EVEX
.
Changes to EVEX
(for new registers)
Which now, and having become completely useless for the ‘promotion’ of general-purpose instructions, only requires the trivial addition of the 3 bits to address the new APX
registers, as already proposed for VEX3
. So its new format will be this:
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
---|---|---|---|---|---|---|---|---|---|
Byte 0 (62h) | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | |
Byte 1 (P0) | R̅3 | X̅3 | B̅3 | R̅4 | B4 | m2 | m1 | m0 | P[7:0] |
Byte 2 (P1) | W | v̅3 | v̅2 | v̅1 | v̅0 | X̅4 | p1 | p0 | P[15:8] |
Byte 3 (P2) | z | L’ | L | b | v̅4 | a2 | a1 | a0 | P[23:16] |
and would continue to function exactly as now: exclusively for AVX-512
instructions.
Summary of the proposed changes
Coming to a close, I think it is appropriate to recapitulate the benefits of the proposed changes to APX
:
- simplified implementation (and, consequently, lower transistors & power consumption);
- less impact on code density (25% to 50% less space occupied by the new prefixes, compared to the use of
EVEX
, for both general-purpose andAVX/VEX3
instructions), which in turn translates into lower consumption (less pressure on caches and, in general, on the entire memory hierarchy); - all general-purpose instructions that modify flags can suppress their generation (the use of
NF
becomes orthogonal); - all general-purpose instructions become conditional (with simplification of both the compilers and the execution pipeline, which now only has to commit or not retire their execution).
The advantages of these solutions should be obvious, having the same amount of new functionality made available but with the not inconsiderable possibility of conditionally executing all general-purpose instructions (a new feature, therefore, in addition to what APX
offers).
Finally, it should be noted that the new prefixes are designed to use all innovations incrementally. The new REX2
offers, as a basic feature, access to new registers and suppression of flags generation (NF
). On top of that, REX3
adds the possibility of specifying the condition for instruction execution. REXM0
and REXM1
add, on that, the new target register (NDD
). All in a simple and ‘compiler-friendly‘ manner.
The next article will be the last and will report the conclusions regarding APX
.