Arm MTE on the Ampere-1a

August 6, 2025 - 3 mins read

Ampere’s newest CPU cores, the AmpereOne series, are the first server CPUs supporting Arm’s Memory Tagging Extension. Previously, this feature was only available on a few mobile CPUs, such as the Pixel 8’s Tensor G3 (featuring one Cortex-X3, four Cortex-A715, and four Cortex-A510 cores).

Only the Ampere-1a (CPU part 0xac4) and Ampere-1b (CPU part 0xac5) support MTE. The Ampere-1 (CPU part 0xac3) does not seem to support MTE according to LLVM’s feature flags.

We had previosly worked with MTE on our Cage project, where we used a Pixel 8 as our benchmarking device. This proved to be quite tricky, as the mobile cores on the Pixel tended to overheat. Even though we cooled the phone with a thermoelectric cooler on the back, results were still quite flaky.

Enabling MTE

When the Ampere-1a became available, some server vendors could not provide MTE functionality out of the box, even though the CPU supports it. After confirming with Supermicro, we bought a server from them. When this machine finally arrived, MTE did still not work out of the box. On our Supermicro board (Supermicro ARS-211M-NR), we needed to follow the following steps to enable MTE:

Boot into the builtin UEFI shell
Run the following command:

nvparam -s NVPDBOOT -i 56 -w 0x1

Reboot the machine

Performance Benchmarks

We re-ran some of the benchmarks from the Cage paper, most notably the main performance results as well as the instruction throughput/latency tests.

Instruction Throughput/Latencies

Instruction throughput is measured in instructions per cycle (IPC) and measures how many instructions can be retired per cycle, if there are no data dependencies between them. Latencies on the other hand are measured in cycles per instruction and represent the number of cycles an instruction takes.

Inst	Cortex-X3 Tp	Cortex-X3 Lat	Cortex-A715 Tp	Cortex-A715 Lat	Cortex-A510 Tp	Cortex-A510 Lat	Ampere-1a Tp	Ampere-1a Lat
MTE
`irg`	1.34	1.99	1.00	2.00	0.50	3.00	0.29	3.47
`addg`	2.01	1.99	3.81	1.00	2.22	2.00	2.00	2.01
`subg`	2.01	1.99	3.81	1.00	2.22	2.00	2.00	1.00
`subp`	3.49	0.99	3.81	1.00	2.50	2.00	3.82	1.00
`subps`	2.88	0.99	3.80	1.00	2.50	2.00	1.91	1.00
`stg`	1.00	–	1.81	–	1.00	–	1.00	–
`st2g`	1.00	–	1.84	–	0.46	–	0.50	–
`stzg`	1.00	–	1.84	–	0.98	–	1.00	–
`st2zg`	0.34	–	1.79	–	0.45	–	0.50	–
`stgp`	1.00	–	1.69	–	0.98	–	1.00	–
`ldg`	2.92	–	1.91	–	0.93	–	1.96	–
PAC
`pacdza`	1.01	4.97	1.51	5.00	0.20	4.99	0.50	6.98
`pacda`	1.01	4.97	1.42	5.00	0.20	5.00	0.50	7.02
`autdza`	1.01	4.97	1.51	5.00	0.20	7.99	0.45	6.98
`autda`	1.01	4.97	1.43	5.00	0.20	7.99	0.50	6.98
`xpacd`	1.01	1.99	1.56	2.00	0.20	4.99	0.50	6.98

The numbers for the Cortex cores are taken from the Cage paper. To reproduce them, take a look at the Appendix A.

Overall, it seems MTE performs quite well on the Ampere chips. While performing the benchmarks for the Cage paper, I would have loved to have access to a machine with proper cooling, more cores, RAM, and without having to run Android. I’m curious to see if other vendors follow suit and also implement MTE in their chips.