With the analysis of the pros and cons of APX
‘s features closed, we now move on to a series of analyses and reflections, starting with the familiar code density, which is an extremely important factor.
This is because the space occupied in memory by instructions has implications for the entire memory hierarchy and, therefore, directly affects performance. The subject is complex (and has been on the academic and industrial agendas for a very long time), and it only takes a trivial search to realise how much material has been written on the subject, but I will quote below a summary from the thesis of one of the RISC-V designers to show how important this aspect is:
Waterman shows that RVC fetches 25%-30% fewer instruction bits, which reduces instruction
cache misses by 20%-25%, or roughly the same performance impact as doubling the instruction
cache size.
The highlighted parts (especially the last one) should be quite telling, and although they relate only to RISC-V
, similar results can be found in all architectures, as the concept and issues are general. They are not, however, directly relevant to this series of articles, so I will only report on my observations regarding APX
.
Intel claims that, by enabling APX
, code density is ‘similar‘ to x64
(which itself is not that brilliant!), based on preliminary results compiling the aforementioned SPEC2017
test suite. If this were really confirmed, it would certainly be quite a coup (it would mean that all the innovations introduced have compensated for the considerable increase in instruction size).
At the moment, however, I harbour my doubts about this, not least because Intel has not made any statements: it has not claimed that it is equal, slightly worse or slightly better, but only ‘similar’. This climate of uncertainty deserves, therefore, at least some consideration by bringing in some numbers, until the ‘official’ ones arrive.
PUSH2
and POP2
(using EVEX
)
Let us start immediately with an element that is certainly known to worsen code density: the new PUSH2
and POP2
instructions. Their encoding requires the use of the EVEX
prefix, so only four bytes are needed to use it, plus one for the opcode and finally another for ModR/M
, which is needed to reference the memory (in reality, only the configuration specifying a register and not a location is used). Total: six bytes (minimum).
For comparison, a PUSH
or POP
instruction requires one or two bytes (depending on whether an x86
register or a new x64
register is used). So a couple of these require two to four bytes, but in any case much less than six bytes. The deterioration that occurs with the new instructions is, in this case, very obvious (moreover, and as already mentioned, these instructions are used a lot).
The new registers (using REX2
)
Using at least one of the new registers requires the use of the REX2
prefix, which alone takes up two bytes, and which should be used whenever this register is referenced in an instruction. Conversely, emulating its operation with x64
in some way would require more bytes, depending on the scenario.
For example, if a register were needed temporarily for certain operations, it would first have to be stored somewhere and the stack is the most convenient and suitable place. Then one would have to do a PUSH
to store it and a POP
to restore it after the operations have been completed. We know that the cost in this case would vary from one to two bytes, so in total we would need two to four bytes, so at the very least we would break even with the use of REX2
, but in the worst case we would double the cost.
But the advantage of x64
is that the operations performed between PUSH
and POP
would not require the use of REX2
, but at most REX
(to reference the new registers of x64
), which occupies only one byte, so executing two or more instructions would absorb the cost of PUSH
and POP
and at some point we would be ahead in terms of space used by these instructions. Whereas and as already mentioned, all instructions using the new APX
registers would always require the use of REX2
, constantly paying for two bytes each time.
Another scenario to emulate the operation of the new APX
registers would be to use the stack as a sort of bank of additional registers, directly referencing precise locations (e.g. [SP+16]
to emulate the new R16
register, [SP+24]
for R17
, [SP+32]
for R18
, and so on).
In this case the costs would be different, depending on the use. For example, copying from/to the stack requires the use of a MOV
instruction, which takes up three bytes (opcode + ModR/M
+ 8-bit offset) if only x86
registers are used, and four bytes (the prefix REX
is required) if at least one x64
register is used.
Whereas an equivalent MOV
using one of the new registers made available by APX
will always need REX2
, but not the 8-bit offset, so it will always require 4 bytes. In this case the solution with the stack (x64
) would be more advantageous (one byte could be saved, while in the worst case it would occupy the same space).
It would be a different matter if the new register were to be used in instructions using only registers. In this case, instructions using ModR/M
would always have to specify the 8-bit offset, while the APX
equivalents would be forced to use REX2
. The advantage would clearly be of the stack solution, because a byte would always be saved.
If, on the other hand, the new register were to be used in instructions that also referenced a memory location, then the stack solution would be less efficient and take up much more space, as it would be necessary to find a free register (whose value would be stored) to load the value from the stack, then perform the required operation, and finally restore the register that had been used.
It would then take 2-4 bytes to ‘free’ the required register and 3-4 bytes to copy the value from the stack to it. Whereas using the new register with APX
requires only two extra bytes due to REX2
. Furthermore, if the final value were also to be stored on the stack, then another instruction (3-4 bytes) would be needed for this purpose. The price to pay in such cases would be very steep!
A hybrid solution between the two (as well as the preferable one) would be to PUSH
the register to be used on the stack, and then use it taking into account where it is now located within it. Eventually, when no longer useful, a POP
would restore the contents of the temporarily used register. This is a technique that I used in the first half of the 1990s, when I tried my hand at writing an 80186
emulator for Amiga systems equipped with a 68020
processor (or higher), and which has the virtue of combining the two previous scenarios, taking the best of them (minimising the cost of preserving and restoring the previous value in/from the stack).
As can be seen, depending on the specific scenarios, x64
or APX
has the advantage, depending on how and how much the new registers are used.
NF
(No Flags. Using EVEX
)
Turning to the NF
(No Flags
) functionality, used to suppress the generation of flags, it requires the use of the EVEX
prefix, which means that four bytes must always be added to the length of the instruction, except when it resides in map 1
(i.e. with a 0F
prefix): in this case the 0F
prefix is already incorporated in EVEX
, so the additional bytes would be reduced to three.
The increase in instruction length would therefore be decidedly substantial, if we consider that to emulate its operation with x64
it would be sufficient to save the flags on the stack, by means of the PUSHF
instruction, and restore them at the end of operations (therefore when the value needs to be checked or used), using the POPF
instruction. Total cost: just two bytes.
If we also consider that the block of instructions to be executed could change flags several times and, therefore, that it would always be necessary to use NF
, the cost for APX
would increase even more (whereas for x64
it always remains fixed at two bytes).
The advantage of the current solution (PUSHF
+ POPF
) with x64
, therefore, is much greater in terms of better code density than APX
.
NDD
(New Data Destination. Using EVEX
)
The last feature to be considered, and one which has a major influence on code density, is NDD
, which, as we have already seen, allows binary instructions to become ternary, and binary to become unary, giving the possibility of using a register as a destination (with the current two operands in x64
both acting as data sources at this point).
Such an instruction always requires the use of the EVEX
prefix, so the same considerations apply as before for NF
: 4 extra bytes are needed, except for instructions in map 1
(which require 3), to which the opcode byte and the ModR/M
byte are then added. So in total you would always need (at least) 6 bytes.
To emulate this with x64
would always require an extra instruction: a MOV
that would copy the value of the first source into the (new) destination register. As we have seen, this requires 2-3 bytes. Then you would need the instruction that performs the actual operation, which would in turn require 2-3 bytes. So a total of 4-6 bytes would be needed.
From these calculations I have removed, in order to simplify things a bit, the data concerning any offsets and/or immediate values, as they are invariant to both solutions (they occupy exactly the same bytes).
What remains is the real difference between APX
and x64
when it comes to executing a ternary or binary instruction (emulated, in the case of the latter), and you can see how x64
would be more efficient or, at most, would cost the same.
Summing up
Trying to come to some conclusions, if we take the individual new functionalities of APX
, my opinion is that their use will, on average, have a decidedly negative impact on code density, which will decrease compared to x64
(which, as already mentioned, is certainly not in a good shape in this respect, compared to other architectures — including the x86
from which it derives).
Having, on the other hand, the possibility of combining/using more of these functionalities in the same instruction, there would be advantages (savings would accumulate). But I do not believe that such scenarios are, in any case, so common as to heavily influence code density. In my opinion, they are not common enough to bring APX
, overall, even close to x64
.
We will see concretely in the future, when the first processors with APX
and the corresponding binaries are released, as what I have provided at the moment are personal evaluations dictated only by my experience in this area and my analysis of the costs of using the new features of this extension.
The next article will deal with the subject of APX
implementation costs.