Having discussed the innovative features of APX
, let us turn to the new instructions that have been added by this extension.
Calling convention (routines)
Having doubled the general-purpose registers means that they have to be saved and then retrieved in/from the stack when they are used in calls to routines (whether functions or methods), depending on the calling convention adopted by the specific platform (which is part of the so-called ABI).
Intel has proposed defining the new registers as volatile, i.e. they are freely usable by the routine that has been called (callee, in jargon). It will, therefore, be up to the calling routine (the caller) to store their values before invoking the routine, and then restore them immediately afterwards (this convention is called caller-saved).
There are pros and cons to every such choice. In this case we can say that, since the saving and restoring of these new registers is entirely the responsibility of the caller, it will affect code density quite a bit, since these operations will have to be performed every single time the routine that uses them is called (so if there are 100 parts in the program that call it, there will be 100 times the operations of saving and restoring the new registers used).
If, on the other hand, the opposite convention (callee-saved) had been adopted, code density would have benefited considerably (because there would have been only one point in the program where these operations were performed: at the beginning and end of the called routine), but the performance of the routine would have suffered (because the new registers would have had to be saved before they could be used and, vice versa, they would have had to be restored before returning any results or, in any case, returning control to the caller).
It is not easy or possible to establish a priori what the best convention to adopt might be, since it is rather obvious and self-evident that this depends strictly on the type of code to be executed. But an ABI
needs to set a convention anyway, because it must be valid and used by all the applications that will run in the system, so a choice had to be made.
In my opinion, perhaps it would have been better to choose a middle way: a hybrid solution in which the first eight new registers could have been used freely by the caller (and, therefore, saved and restored by the called party, should he need to use them in turn), while the other eight would have been available to the called party (and, therefore, the caller would have had to retain their value).
This is because a routine rarely uses all the registers at its disposal, so often some of the registers would have been used, but without any need for the caller or the called party to retain their values, with obvious advantages on both sides (including the infamous code density).
New instructions
Coming back to the new instructions, dealing with 32 registers means potentially having to execute several PUSH
and POP
instructions every time you fall into one of the above situations. Which should also be quite frequent: if the 16 new registers have been added, it is precisely because you want to use them, and often too (though not always all of them)! Otherwise, there would have been no point in making all these changes.
This sounds rather strange to me, since I still remember very well how AMD had claimed, when introducing x86-64
AKA x64
, to have evaluated the extension of x86
to 32 instead of 16 registers, but to have given up because the advantages did not prove to be significant (contrary to the switch from 8 to 16 registers, where the differences, instead, were quite tangible, as we have seen for ourselves) and did not justify the greater implementation complexity of such a solution.
In any case, and going back to the topic, Intel thought of mitigating the situation a bit by adding a couple of new instructions, PUSH2
and POP2
, which, as can be clearly guessed from their mnemonics, allow the push or pop on/from the stack of two registers at a time, instead of just one (as is the case with the normal PUSH
and POP
). This can roughly halve the number of corresponding instructions that would normally be required, with obvious performance advantages (one instruction executed each time, instead of two).
An example, taken from an old version of FFMPEG
(for x64
):
PUSH R12
PUSH RDI
PUSH RSI
PUSH RBX
SUB RSP, 0x68
LEA RBP, [RSP+0x80]
MOV ESI, [RIP+0x20f79f2]
TEST ESI, ESI
JZ 0x140d55203
LEA RSP, [RBP-0x18]
POP RBX
POP RSI
POP RDI
POP R12
POP R13
POP R14
POP R15
POP RBP
RET
easily shows how the PUSH
and POP
instructions could be halved by using the new PUSH2
and POP2
.
Also on the subject, although not a new instruction as such, is the introduction of a so-called ‘hint‘ for the PUSH
and POP
instructions (exclusively those operating on registers and using the classic as well as the most widespread encoding), which would indicate to the processor that these instructions (executed in the appropriate sequence) would be ‘balanced’. In this case, the processor would not save and restore their values in/from memory, but would store them internally, so as to improve the performance of these two operations (and without stressing the memory hierarchy).
Finally, another new instruction that was added is JMPABS
, which, as the name already suggests, allows jumping to a 64-bit absolute address. Evidently Intel has encountered some not rare cases in which this is necessary (on the other hand, the classic CALL
and JMP
instructions only allow, in 64-bit mode, to move by + or – 2GB at most) and has decided to make up for it, even though I personally have not encountered occasions in which such an operation was necessary.
New conditional instructions
Other new instructions introduced by APX
are the so-called conditional instructions, for which the format of the EVEX
prefix changes according to the last table shown in the first article (which sees the introduction of the fields OF
, SF
, ZF
and CF
and SC3..SC0
) and which, of course, check whether a certain condition (specified in SC3..SC0
) is true in order to decide how to proceed (depending on the particular type of instruction).
In fact, the only two (new, of course) instructions that use this special format of EVEX
are CCMPscc
and CTESTscc
, whose differences lie only in the type of check that, if any, is made (as with the CMP
and TEST
instructions, respectively) as to whether the condition in SC3..SC0
is true.
Their operating logic can be briefly summarised as follows: if SC3..SC0
were to be satisfied, then the processor flags would be updated by comparing the two operands, just as with CMP
and TEST
. If, on the other hand, it was not, then no comparison would be made, but the OF
, SF
, ZF
and CF
flags would be set by copying their values from the equivalent fields found in EVEX
; in addition, the AF
flag would always be reset to zero.
It should be pointed out that not all conditions normally possible with x86
/x64
can be used: the parity flag (P
) check conditions are not. In this case, the two encodings have been reused respectively to force the evaluation (and thus performing the check of the operands) or skip it (avoiding the check and thus copying the OF
, SF
, ZF
and CF
fields to their respective flags).
An important thing to underline is that these instructions can always generate an exception if one of the elements is in memory and it’s not accessible (or, in general, generates any kind of fault). This occurs regardless, even if the condition in SC3..SC0
is unsatisfied and reading the operand in memory is, therefore, completely useless. In this case, the behaviour is identical to another conditional instruction already present since the days of the Pentium Pro: the famous CMOVcc
.
The latter is, incidentally, also the basis of the four further new conditional instructions that APX
makes available. The first is the same CMOVcc
, which is extended using the NDD
and, therefore, gains a destination register to store the result of the operation (the second source is copied if the condition cc
is met, otherwise the first source is copied).
The other three instructions are called CFCMOVcc
, because they all have the same thing in common: they raise no exceptions if the operand in memory is not accessible and the condition is false (of course the exception is raised if the condition is true, in this case). The first of these is, therefore, identical to the CMOVcc
above, but with the suppression of exceptions (if the condition is not met). Which, I would say: finally! In fact, this was/is my expectation for a conditional instruction: there should be no side effects if the condition is not fulfilled!
The other two CFCMOVccs
do not use the NDD
and, therefore, have only two operands: the first will always act as both first source and destination. The difference between the two is that the operands are reversed: for the first, the first argument is a register and the second is an operand that can stay in memory (or in a register), while for the second instruction it is the exact opposite (the first operand can stay in memory and the second is always a register).
The peculiarity of these four new instructions is that they do not use the particular format of EVEX
at all (which, as I had already anticipated, is exploited exclusively for the new CCMP
and CTEST
), but the condition to be checked is included directly in the opcode (as in the original instruction from which they originated).
SETcc
: improved / new (operating beyond bytes)
Finally, the operation of the SETcc
instruction (which I had mentioned in the previous article) has been extended (to different sizes rather than only bytes), giving the possibility (by exploiting the ND
bit) of applying the clear logic (instead of the merge, which is the default) when the operand (representing the destination of the result) is a register instead of a memory location (in this case there is no modification). This is very useful, because it avoids having to add an instruction before SETcc
to reset the contents of the register (which typically happens in real code, where the entire register is often used and not just the first 8 bits).
An example, also taken from FFMPEG
(x64
):
XOR EAX, EAX
CMP WORD [RCX+0x18], 0x20b
SETZ AL
where it can be seen that the EAX
register is zeroed with the XOR
instruction, and only then does the SETZ
instruction set the value of the least significant byte (represented by the AL
register) to 1
if the memory location of the CMP
contained the value 0x20
(otherwise AL
would remain at 0
).
A similar recurring pattern often found is also the following:
CMP [RDI], EAX
SETZ AL
MOVZX EAX, AL
where, in this case, first the comparison is carried out to update the flags appropriately, then the SETZ
instruction is executed to set, according to those new flags, the value of the least significant byte (always AL
), and immediately afterwards all other bytes of EAX
are reset with the MOVZX
instruction.
That is all for the moment. The next article will focus on analysing the advantages and flaws of APX
.