Having discussed code density in the previous article, we now turn our thoughts to the weight that all these changes to implement APX
may entail. Specifically, Intel claims that the silicon area is not affected ‘significantly’, nor is the (power) consumption of a core.
Don’t look at the core, look at the decoder!
We know, however, that the cores of modern processors are very large (especially those of Intel processors), so and while it may be true that the overall impact would not be so noticeable in itself, it is also true that it should not be spread over the whole area of the core, but instead the area in which it affects, i.e. mainly the instructions decoding block (which I will call the decoder, for simplicity’s sake), should be considered.
It must also be said that, due to the very obsolete design (the 8086
. Which, however, was made in other times and with quite different requirements. Contextualisation is always important!) to which more and more modifications were gradually added (I would prefer the term ‘patches’, which is certainly more accurate), in the past it was the single element that came to occupy the most area (an estimated 40% with the Pentium Pro!), with several million transistors used for this purpose alone.
Over time, other elements (in particular cache) took over in terms of area occupied/transistors employed, while decoders have substantially settled down and will presumably continue to use similar budgets for the number of transistors employed (in the order of several million).
The problem is, however, that they are also one of the most active elements of a core and draw the most power, also contributing considerably to the thermal budget of the chip, so increasing their complexity a little could have an appreciable effect.
This is why I stated earlier that it is necessary not to consider the increase in silicon area that APX
entails by comparing it with the whole chip (or core), but should do so by considering the decoder area alone. Because what is a ‘non-significant’ change for the whole chip could turn out to be very significant if only the decoder is taken into account, with all the implications that this would entail (power consumption, in particular).
The impact of APX
in detail
The implementation of APX
, in fact, is not trivial at the decoder level, because there is quite a bit of logic to be turned into transistors. If the introduction of the new REX2
prefix can be relatively simple (it is very similar to REX
, with the addition that it does not have to take into account any other prefix after it, since the opcode map to be used is already incorporated in it), the modifications to the EVEX
prefix are, on the other hand, decidedly more complicated.
For convenience, I report the structure of the original EVEX
prefix (as when it was introduced by AVX-512
):
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
---|---|---|---|---|---|---|---|---|---|
Byte 0 (62h) | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | |
Byte 1 (P0) | R̅ | X̅ | B̅ | R̅’ | 0 | 0 | m1 | m0 | P[7:0] |
Byte 2 (P1) | W | v̅3 | v̅2 | v̅1 | v̅0 | 1 | p1 | p0 | P[15:8] |
Byte 3 (P2) | z | L’ | L | b | V̅’ | a2 | a1 | a0 | P[23:16] |
In fact, this prefix has now taken on a dual form as well as function (as illustrated in the first article), since it behaves quite differently depending on whether the ‘promoted’ instruction is one of the two new conditionals (CCMP
and CTEST
) or any other.
Below is the structure that EVEX
assumes exclusively for these two instructions:
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
---|---|---|---|---|---|---|---|---|---|
Byte 0 (62h) | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | |
Byte 1 (P0) | R̅3 | X̅3 | B̅3 | R̅4 | B4 | 1 | 0 | 0 | P[7:0] |
Byte 2 (P1) | W | OF | SF | ZF | CF | X̅4 | p1 | p0 | P[15:8] |
Byte 3 (P2) | 0 | 0 | 0 | ND=1 | SC3 | SC2 | SC1 | SC0 | P[23:16] |
So the decoder, having ‘realised’ that it is dealing with promoted instructions (because EVEX
has specified the use of map 4
. More details can be found in the first article) must check whether it is dealing with one of the two instructions mentioned or not, and take completely different routes to decide how to use the bits it carries (which serve to ‘expand’ the behaviour of the instructions) and how to generate the micro-op to finally feed to the backend.
The biggest complication is with the other instructions (not the two new conditional ones), because it has to take into account whether or not they can be extended to ternary/binary (from binary/unary), but in particular remapping the instruction opcodes internally is a fairly onerous operation in terms of resources used.
The structure of EVEX
, in the latter case, is as follows:
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | ||
---|---|---|---|---|---|---|---|---|---|
Byte 0 (62h) | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | |
Byte 1 (P0) | R̅3 | X̅3 | B̅3 | R̅4 | B4 | 1 | 0 | 0 | P[7:0] |
Byte 2 (P1) | W | v̅3 | v̅2 | v̅1 | v̅0 | X̅4 | p1 | p0 | P[15:8] |
Byte 3 (P2) | 0 | 0 | 0 | ND | v̅4 | NF | 0 | 0 | P[23:16] |
This remapping is necessary, because map 4
contains instructions from both map 0
and map 1
, which obviously cannot coexist (map 4
contains 256 configurations, while maps 0
and 1
each provide 256, so a total of 512). To be precise, those of map 0
do not need any remapping, because they already use their original opcode (as already explained in the first article).
The problems arise with those of map 1
, for which the free (unused) configurations of map 0
have been used. The remapping operation therefore involves, first of all, recognising which of the 256 configurations of map 4
belong to the original map 1
, and then, only in this case, converting them to the original opcode.
Ultimately, it is not just a matter of considering the number of transistors used (which will not be many overall, compared to all those already used in the frontend), but the fact that these are very active elements. It is well known, in fact, that the decoder is one of the particularly power-drawing elements of a core, which contributes significantly to consumption, and these far from simple modifications will only increase it.
Every cloud has a silver lining
Looking exclusively at the overall consumption of a core (or, more generally, of a processor with more of them) would not, however, be correct in assessing the impact that this new extension has on the consumption recorded during code execution (and thus on its actual behaviour).
Another extremely important element that must be taken into account is, in fact, performance: how quickly a given task is completed. This is because the measurement of the efficiency of a processor cannot disregard the combination of power consumption and time spent.
The concept is truly banal, but I have often seen that, when talking about processors, there is a tendency to identify/confuse their efficiency with consumption (many times maximum consumption is taken), completely leaving out performance. Which is clearly (but not as obviously, unfortunately) wrong.
This is because one processor can consume more than another, but if it takes less time to perform the tasks it is given, then it can also consume less when taking the entire processing into account.
Specifically, Intel stated that around 10 per cent fewer instructions are executed with APX
(in the preliminary results of internal tests it carried out with the celebrated SPEC2017
suite). Unfortunately, no further information was provided (it would have been useful to understand in which parts of the code this would have affected. For example, whether inside or outside the loops and, in general, in the most critical parts).
In any case, the company also stated that consumption did not increase significantly (although no measurement is provided), so it is perfectly legitimate to expect that the increase due to the greater complexity of the APX
implementation was balanced by the fact of executing fewer instructions (because one is able to do ‘more useful work’) and, therefore, ultimately by the shorter execution time that led the processor to consume less overall (than one might have expected).
Which is hardly surprising: if code density is a ‘Holy Grail’ of processor architectures, the same can be said about single core/thread performance. Both, as we have seen, have profound implications and, therefore, have been the subject of research and development for a very long time, with processor manufacturers always striving to improve both.
For the sake of completeness, it must be said that response time is also another sought-after characteristic of an architecture (of a microarchitecture, more precisely), which is partially linked to the other two mentioned earlier. In fact, if a design requires the execution of certain tasks within certain time periods, and a processor fails to meet these constraints, it can be as efficient as it wants to be, but it has failed (and, therefore, is not fit for purpose!).
In the next article, I will propose improvements to APX
, which both considerably simplify its implementation and significantly reduce the length of instructions that make use of this extension, thereby significantly improving code density.