APX: Intel's new architecture - 1 - Introduction

Shortly after the X86-S proposal, Intel presented the biggest novelty since the days of AVX-512 — and even beyond: APX, its new (true) architecture (ISA), but exclusively in 64-bit (it is not and in any case could not have been supported in 32-bit mode). This series of articles sets itself the task of analysing its innovations, criticalities and providing food for thought.

Meanwhile, and to be precise (and as it has been introduced), one cannot formally speak of a new architecture, as these are additions to x64 that have been made via this new ISA extension. An extension that therefore uses the 64-bit mode introduced in 2003 by AMD as a basis, to which the following features are added:

16 new general-purpose registers;
several instructions are available in a new format using three operands instead of the canonical two, or two for those with only one;
new conditional instructions;
absolute address jump instruction (64-bit, of course).

We had already become accustomed for a very long time to the addition of new instructions in all sorts of sauces and in the most disparate domains, and it was all added transparently to the existing ones, but making new registers available is an extremely significant and impactful change, such that I can actually speak of a new architecture.

This is because, although the code using APX works and can be safely mixed with any x64 code, providing multiple code paths (as, for example, the Intel compiler makes possible) in the same executable to support multiple scenarios (depending on the processor on which the code runs) would increase its size even significantly.

As long as we are talking about a few critical parts that need specific code paths to make them run at their best depending on specific processors (we are talking, more precisely, about specific microarchitectures), it is a choice that pays off and does not increase the size of the binaries excessively, but the scope of all the innovations of APX is such and is so pervasive in almost every aspect of the code, that it is difficult to think that we can continue to follow this approach (although theoretically and practically feasible).

My expectation is, therefore, that we will see the generation of executables specifically dedicated to APX, without any code path for normal x64 code (which will be relegated to equivalent binaries), so as not only to maximise performance, but also to contain the size of the code overall (also on disk). This is also the reason why I prefer to think of APX as a new architecture rather than an extension like those that have been introduced so far.

The `REX2` prefix for accessing new registers

The new registers cannot normally be addressed, so some changes were made to the ISA for this purpose, similar to what AMD did for x64 with the introduction of the REX prefix:

Byte	Bit
REX
	7	6	5	4	3	2	1	0
0	0	1	0	0	W	R	X	B

Thus, the introduction of the new prefix REX2 stands out:

Byte	Bit
REX2 (2-byte REX)
	7	6	5	4	3	2	1	0
0 (0xD5)	1	1	0	1	0	1	0	1
1	M₀	R₄	X₄	B₄	W	R₃	X₃	B₃

this time focusing on the need to use additional bits to access the new ‘bank’ of registers. As can be seen immediately, R3, X3 and B3 are the equivalents of R, X and B, and serve to specify the fourth bit of the register (specified in the opcode itself or in the operand in memory), index and base respectively (if there is an operand in memory), while R4, X4 and B4 do the same thing, but for the fifth bit (5 bits = 32 possible values = 32 addressable registers).

There appears, however, an additional bit, M0, which is used to specify which ‘opcode map’ to use: either 0 or 1. Here we need a little digression on how the opcodes of the x86 and x64 instructions are defined/mapped. In these processors, opcodes are defined by sequences of bytes (other architectures may use words as the ‘base unit’, consisting of 2 or more bytes).

Leaving aside the discussion of prefixes for the moment and simplifying the discussion a lot, an 8086 instruction (which was the first representative of this ISA) uses the first byte to define up to a maximum of 256 opcodes which correspond (very) roughly to 256 instructions. The 8086 did not use all 256 possibilities, but only a subset, leaving some configurations free, which, however, over time were used by new instructions (or prefixes) that were gradually added. The set of instructions that belong to these 256 configurations specified by the first byte is called ‘map 0‘.

At a certain point, this map had exhausted all possibilities, and in order to add further instructions Intel thought it was best to recycle an instruction that for 8086 was completely useless (actually even dangerous: POP CS; whose opcode is 0F in hexadecimal) and reuse it to create a new opcode map (map 1). Thus opening up the possibility of defining another 256 possible instructions (those that have 0F as their first byte and whose second byte specifies the new opcode), with the only penalty being that, in this case, the instructions have become one byte longer (because the first is always 0F).

As can be easily guessed, map 1 was also exhausted by adding instructions, so Intel used a couple of configurations (0F 38 and 0F 3A) to create another couple of maps (2 and 3). These instructions, therefore, are even longer (due to the first two bytes having one of those two values).

The new bit M0 therefore allows only one of the first two maps to be selected, but not the other two. Which isn’t even a big deal, because most of the ‘general-purpose‘ instructions (also called legacy in Intel’s document containing the complete APX specification) are enclosed in the first two maps.

The other two can in any case be selected via the VEX prefixes (VEX2 and VEX3 introduced by Intel with the AVX extensions):

Byte	Bit
VEX3 (3-byte VEX)
	7	6	5	4	3	2	1	0
0 (0xC4)	1	1	0	0	0	1	0	0
1	R̅	X̅	B̅	m₄	m₃	m₂	m₁	m₀
2	W	v̅₃	v̅₂	v̅₁	v̅₀	L	p₁	p₀
VEX2 (2-byte VEX)
	7	6	5	4	3	2	1	0
0 (0xC5)	1	1	0	0	0	1	0	1
1	R̅	v̅₃	v̅₂	v̅₁	v̅₀	L	p₁	p₀

or with the EVEX prefix (introduced by Intel with the AVX-512 extensions):

	7	6	5	4	3	2	1	0
Byte 0 (62h)	0	1	1	0	0	0	1	0
Byte 1 (P0)	R̅	X̅	B̅	R̅’	0	0	m₁	m₀	P[7:0]
Byte 2 (P1)	W	v̅₃	v̅₂	v̅₁	v̅₀	1	p₁	p₀	P[15:8]
Byte 3 (P2)	z	L’	L	b	V̅’	a₂	a₁	a₀	P[23:16]

VEX2 allows only the second map (0F, which is implicit) to be selected, VEX3 up to 32 via bits m₄..m₀, and finally EVEX allows all four maps to be selected via m1..m₀.

VEX2 is a prefix of two bytes. VEX3 is three bytes. EVEX, on the other hand, is a full four bytes. So it is clear that the instruction length will increase, even considerably, depending on the prefix used.

Changes to the `EVEX` prefix

Until now, the REX2 prefix only introduced the possibility of being able to select the 16 new general-purpose registers, but Intel went much further with APX, extending the operation of some instructions (not only some general ones, but also some AVX) and giving them new possibilities that previously required more instructions to emulate their behaviour. To do this, it extended the EVEX prefix by exploiting some bits (which were previously unused). Originally, its format was as shown in the above table, but now, to support the new general-purpose registers, it has become:

	7	6	5	4	3	2	1	0
Byte 0 (62h)	0	1	1	0	0	0	1	0
Byte 1 (P0)	R̅₃	X̅₃	B̅₃	R̅₄	B₄	m₂	m₁	m₀	P[7:0]
Byte 2 (P1)	W	v̅₃	v̅₂	v̅₁	v̅₀	X̅₄	p₁	p₀	P[15:8]
Byte 3 (P2)	z	L’	L	b	v̅₄	a₂	a₁	a₀	P[23:16]

So they simply added R4, X4 and B4 (the first two in a negated version for technical reasons which I won’t repeat here) to be able to set the fifth bit of the respective register.

A number of VEX instructions have also been updated, but as there is no possibility of specifying additional bits in the VEX2 and VEX3 prefixes, Intel thought of always exploiting EVEX to extend its operation in these cases:

	7	6	5	4	3	2	1	0
Byte 0 (62h)	0	1	1	0	0	0	1	0
Byte 1 (P0)	R̅₃	X̅₃	B̅₃	R̅₄	B₄	m₂	m₁	m₀	P[7:0]
Byte 2 (P1)	W	v̅₃	v̅₂	v̅₁	v̅₀	X̅₄	p₁	p₀	P[15:8]
Byte 3 (P2)	0	0	L	0	v̅₄	NF	0	0	P[23:16]

The only substantial difference (apart from being able to access the new registers) from the VEX prefixes is provided by the new NF bit, which makes it possible to suppress the generation of flags for certain instructions that originally generated them according to the result of the operation. I will talk more about this in a separate section in the next article.

For the general ones, however, it has made more substantial changes:

	7	6	5	4	3	2	1	0
Byte 0 (62h)	0	1	1	0	0	0	1	0
Byte 1 (P0)	R̅₃	X̅₃	B̅₃	R̅₄	B₄	1	0	0	P[7:0]
Byte 2 (P1)	W	v̅₃	v̅₂	v̅₁	v̅₀	X̅₄	p₁	p₀	P[15:8]
Byte 3 (P2)	0	0	0	ND	v̅₄	NF	0	0	P[23:16]

I’ve already talked about NF, but close to it was added ND, which allows us to transform a binary instruction (a source operand and a destination operand which also acts as a second source) into a ternary one, or a unary one (the single operand is used as both source and destination) into a binary one, according to certain precise logics which will also be explained in another section of the next article.

It should be noted, however, that in this case the bits m₂..m₀ (m₂ has now been added to EVEX to bring the selectable maps from 4 to 8) have taken on a very precise value: 100 (in binary). EVEX uses this field to be able to specify which map, among the eight possible ones, to use for the opcode byte following this prefix. We have seen that on x64 there are four maps (from 0 to 3) that define the corresponding instruction sets, but 100 equates to a new map: map 4!

The reason is quickly explained: not all general instructions are extended with these new mechanisms (NF and ND), but only some are. Intel thought, in order not to complicate things further, to collect all ‘promotable’ instructions and put them in this new map. So the new wonders are only reserved for instructions in map 4, while all other instructions can only access the new registers (via REX2) and cannot exploit anything else.

Intel, finally, changed this last format again in order to introduce further new conditional instructions:

	7	6	5	4	3	2	1	0
Byte 0 (62h)	0	1	1	0	0	0	1	0
Byte 1 (P0)	R̅₃	X̅₃	B̅₃	R̅₄	B₄	1	0	0	P[7:0]
Byte 2 (P1)	W	OF	SF	ZF	CF	X̅₄	p₁	p₀	P[15:8]
Byte 3 (P2)	0	0	0	ND=1	SC₃	SC₂	SC₁	SC₀	P[23:16]

The OF, SF, ZF and CF fields are the very same bits found in the flags register, while SC3..SC0 are four bits that define the code (but with some modifications) that is normally used in conditional jumps to indicate when to jump. A separate section in the next article will deal with these particular new conditional instructions.

The next article will, as already mentioned, deal in more detail with APX‘s innovations and the impact they have in real/common scenarios.

APX: Intel’s new architecture – 1 – Introduction

The `REX2` prefix for accessing new registers

Changes to the `EVEX` prefix

APX: la nuova architettura di Intel – 1 – Introduzione

APX: la nuova architettura di Intel – 2 – Innovazioni

Sfoglia categorie

Programmazione

Genesi di un videogame ai tempi dell’Amiga: La lunga e tetra ora del tè dell’anima

Not always “big is better”: the importance of choosing data types – An example with CPython

Non sempre “big is better”: l’importanza della scelta dei tipi di dati – Un esempio con CPython

No, i limiti dell’HAM non sono svaniti!

L’abuso di assembly nuoce gravemente alla salute (mentale)

Amiga in modalità HAM: gioia per gli occhi, ma per pochi giochi

Non erano pigri certi sviluppatori Amiga che spremevano la macchina

Con Unity 3D è un gioco – Parte 2

Con Unity 3D è un gioco!

Genesi di un videogame ai tempi dell’Amiga: Verkosoft al Salvataggio!

Genesi di un videogame ai tempi dell’Amiga: La lunga e tetra ora del tè dell’anima

Not always “big is better”: the importance of choosing data types – An example with CPython

APX: Intel’s new architecture – 1 – Introduction

The REX2 prefix for accessing new registers

Changes to the EVEX prefix

Sfoglia categorie

Programmazione

Tag Clouds

Press ESC to close

The `REX2` prefix for accessing new registers

Changes to the `EVEX` prefix