After the first introductory article on APX
, which also set out the format of the new REX2
prefix and, above all, the modifications to the EVEX
prefix, let’s analyse what the latter offers, in addition to accessing the new 16 general-purpose registers (the only innovation made available by REX2
). That’s because it also allows certain existing instructions to be extended/improved (but not all: only a certain number of them have the privilege of being ‘promoted’ for the purpose) by enabling appropriate functionalities.
The base, therefore, remains that of the promoted instruction, but its behaviour is altered by setting certain bits (NF
and ND
) that EVEX
now makes available, which can also be used simultaneously (enabling both functionalities).
NF
: No Flags
I think that it is very rare for CISC processors to be missing the flags register or, at any rate, not to have a few bits somewhere reserved for storing the flags arising from the result of the last executed operation (usually arithmetic or logic), and obviously x64
is no different.
This architectural choice falls back on the need to be able to get more ‘useful work’ out of the executed instructions, so as to save the execution of special ones to explicitly check the result of what has been processed (as it could, perhaps, be used immediately to decide what to do next).
However, it is also true that it will not always be necessary to carry out these checks, so most of the time it happens that the flags generated by an instruction are never used and, therefore, the processor has only ‘wasted time’ (used up resources unnecessarily) in processing them. In fact, subsequent instructions that are executed sequentially will discard the old flags and replace them with the new ones.
Sometimes, however, it happens that the flags resulting from the execution of an instruction are important later but, in the meantime, other instructions will also be executed that will destroy (replace) them.
In such cases, the only option is to store them somewhere, and then retrieve them again just when they are needed. This entails executing multiple save & restore instructions, as well as a place to temporarily store them. All, therefore, impacting performance and the space (or registers) used.
Precisely in order to deal with the latter scenario, Intel has introduced, with APX
, the NF
functionality, which makes it possible to avoid generating flags for instructions that would normally do so. So it will be sufficient to ‘extend’ (using the prefix EVEX
) the instructions executed after the one whose flags we are interested in preserving, by setting this bit for all of them, so as to preserve the flags that were generated before their execution.
These may seem like rare scenarios, but they are not at all, especially since processor instructions are often ‘reordered’ appropriately so that as many of them as possible are executed simultaneously. One only has to think, for instance, of the very common cycles, whose exit condition (or, vice versa, the condition of repeating the cycle) must be checked (test) in order to then decide which of the two paths to take (jump).
It is quite common for other instructions to be inserted between the instruction producing the condition to be checked and the corresponding jump instruction (which checks the condition) in order to ‘delay’ the dependency between these two instructions (test and then jump) and try to maximise the number of instructions executed. It is immediately clear that if the inserted instructions altered the flags in turn, we would have a big problem (downstream).
A practical example (from one of my implementations of the very famous daxpy function of the BLAS library, used in HPC – High Performance Computing applications):
xor eax,eax ; Sets offset of first vector.
sub rcx,8 ; Checks if we have a full vector to process.
js .check_tail ; No, we have none.
The SUB
instruction is used to check whether there are enough elements to be processed to fill a vector (of an AVX-512 register) and if not, to jump (with the JS
instruction) to the routine that processes the remaining elements (which do not fill an entire vector, of course).
JS
is executed immediately after SUB
, so it must wait until the latter instruction has been processed before it can evaluate (thanks to the flags updated by SUB
) whether or not to jump. In the meantime, the processor pipeline is blocked (there is a stall) and this is particularly true if we are dealing with an in-order rather than out-of-order processor.
We could make better use of the processor’s resources if between SUB
and JS
we could, instead, place the XOR
instruction, so that some other calculation could be performed in the meantime, having the pipeline busy with something:
sub rcx,8 ; Checks if we have a full vector to process.
xor eax,eax ; Sets offset of first vector.
js .check_tail ; No, we have none.
The problem is that the XOR
alters, in turn, the flags, so those generated by the SUB
would be destroyed and the JS
would fail in deciding whether or not to jump.
A problem which, with APX
, is brilliantly solved by preventing these additional instructions from altering the flags, thanks to NF
(just setting this flag in the XOR
).
The price to be paid, however, is very high if the only intention is to prevent the flags from being modified, since the use of EVEX
for instructions using NF
lengthens the instruction by 3 or 4 bytes (depending on whether the instruction is in map 1
or 0
), thus impacting (negatively) on code density (a topic that will be better addressed in a separate article).
For these reasons, one cannot think of using NF
freely, but must only use it judiciously where it is really needed, making sure that instructions are executed as much as possible without it.
NDD
: New Data Destination
With NDD
, we can say that we are facing a memorable change for this architecture, which makes x64
with APX
not only a more modern ISA, but also far more competitive than other architectures. I can safely say, without any doubts, that this is the most important feature APX
has introduced. And it is also the one that, in my humble opinion, contributes the most to improving performance.
In fact, by setting the appropriate ND
bit, instructions (not all of them, but only those that have been ‘promoted’ by APX
) can (finally!) make use of an additional register (defined by the v̅4..v̅0
field of EVEX
) into which the result of the operation performed will go.
As anticipated, if an instruction is binary (makes use of two arguments to perform its calculations), then with ND
it will become ternary. The two arguments that are present in normal x64
instructions therefore both become data sources, while the additional register made available with APX
(via the prefix EVEX
) will be used to hold the result of the operation (whereas with normal binary instructions, one of the two arguments acts as both source and destination).
Similarly, a unary instruction (with only one argument acting as both source and target) is transformed into binary thanks to ND
, with the only (x64
) argument acting solely as source.
The examples given in the Intel document explain this better than a thousand words:
Existing x86 form | Existing x86 semantics | NDD extension | NDD semantics |
INC r/m | r/m := r/m + 1 | INC ndd, r/m | ndd := r/m + 1 |
SUB r/m, imm | r/m := r/m – imm | SUB ndd, r/m, imm | ndd(v) := r/m – imm |
SUB r/m, reg | r/m := r/m – reg | SUB ndd, r/m, reg | ndd(v) := r/m – reg |
SUB reg, r/m | reg := reg – r/m | SUB ndd, reg, r/m | ndd(v) := reg – r/m |
The flexibility of Intel’s solution is far superior to the equivalent ternary and binary instructions provided by other architectures, for two reasons:
- it allows a memory operand to be taken from one of the sources (whereas other ISAs generally only use registers. RISCs, in fact, are usually ‘load/store‘ processors: memory access is carried out exclusively with these two types of instruction);
- the possible operand in memory for ternary instructions can be used in any of the two sources (instead of only in the first or only in the second, i.e. in a fixed pattern).
Needless to say, emulating its operation by other processors requires the execution of more instructions. This results in greater efficiency, with obvious and as well as beneficial performance spin-offs from APX
.
One important difference with normal instructions, however, must be emphasised. When those ‘promoted’ instructions using ND
have a size of 8 or 16 bits, the bits not directly affected (those after 8 or 16) are cleared (similar to what happens with all 32-bit operations with x64
. This behaviour is called clear in jargon), whereas normally they retain their value (in these cases, the processor ‘merges’ the new 8/16 bits with the other bits not involved in the operation. This is called merge in jargon).
This decision was made in order to avoid stalls in the processor pipeline, due to accesses to registers when instructions have only changed a part of them (e.g. one instruction only changed the first 8 bits of a register, but the next one accesses all 64 bits), with related (negative) performance consequences.
The only exceptions to this new behaviour are certain IMUL
(signed multiplication) and SETcc
(set a byte to 1
if the condition specified by cc
is met, or to 0
otherwise) instructions, which do not support NDD
(don’t make use of the additional destination register). In this case they behave as usual (performing the merge) if the ND
bit is 0
, while the new clear mechanism is applied if ND
is 1
.
That is all for the moment. In the next article, we will cover the new instructions that have been introduced by APX
.