FTMOPA (widening, 2-way, FP8 to FP16)

8-bit floating-point sparse sum of two outer products, accumulating

This instruction generates 8-bit floating-point sum of outer products by multiplying the 2-in-4 selected elements from the dense sub-matrices in the two first source vectors with the corresponding elements of the compressed sparse sub-matrix in the second source vector and accumulates the results to the corresponding elements of a 16-bit element ZA tile.

The sum of outer products is generated by multiplying the selected 2-in-4 8-bit floating-point values from each overlapping 16-bit containers of the two SVLH×2 sub-matrices in the first source vectors by the two 8-bit floating point values from the corresponding 16-bit container of the 2×SVLH sub-matrix in the second source vector. The two selected elements from each overlapping 16-bit containers of the first source vectors correspond to 2-in-4 elements of rows of two SVLH×2 sub-matrices. Each 16-bit container of the second source vector holds 2 elements of columns of a compressed 2×SVLH sub-matrix.

The 2-in-4 8-bit floating values from overlapping 16-bit containers of the first source vectors are selected by 4-bit control in the indexed segment of the control vector register. If the control bit corresponding to an element in the first source vectors is 0, the element is discarded and does not contribute to the sum of products result. If more than two bits of the 4-bit control corresponding to 16-bit containers of the first source vectors are 1, only the elements corresponding to the least two significant bits are selected.

The instruction widens the selected elements of sub-matrices of 8-bit floating-point values held in the first source vectors to half-precision values and multiplies them by the corresponding widened elements of sub-matrix of 8-bit floating-point values in the second source vector to half-precision values. The resulting SVLH×SVLH half-precision sum of outer products is scaled by 2-UInt(FPMR.LSCALE[3:0]), before being destructively added to the half-precision destination tile. This is equivalent to performing a downscaled 2-way dot product and accumulate to each of the destination tile elements.

The 8-bit floating-point encoding format for the elements of the first source vector and the second source vector is selected by FPMR.F8S1 and FPMR.F8S2 respectively.

This instruction is unpredicated.

SME2
(FEAT_SME_TMOP && FEAT_SME_F8F16)

313029282726252423222120191817161514131211109876543210
10000000011Zm000KZkZni2100ZAda

Encoding

FTMOPA <ZAda>.H, { <Zn1>.B-<Zn2>.B }, <Zm>.B, <Zk>[<index>]

Decode for this encoding

if !IsFeatureImplemented(FEAT_SME_TMOP) || !IsFeatureImplemented(FEAT_SME_F8F16) then EndOfDecode(Decode_UNDEF); constant integer n = UInt(Zn:'0'); constant integer m = UInt(Zm); constant integer k = UInt('1':K:'1':Zk); constant integer index = UInt(i2); constant integer da = UInt(ZAda);

Assembler Symbols

<ZAda>

Is the name of the ZA tile ZA0-ZA1, encoded in the "ZAda" field.

<Zn1>

Is the name of the first scalable vector register of the first source multi-vector group, encoded as "Zn" times 2.

<Zn2>

Is the name of the second scalable vector register of the first source multi-vector group, encoded as "Zn" times 2 plus 1.

<Zm>

Is the name of the second source scalable vector register, encoded in the "Zm" field.

<Zk>

Is the name of the control vector register Z20-Z23 or Z28-Z31, encoded in the "K:Zk" fields.

<index>

Is the control segment index, in the range 0 to 3, encoded in the "i2" field.

Operation

CheckFPMREnabled(); CheckStreamingSVEAndZAEnabled(); constant integer VL = CurrentVL; constant integer dim = VL DIV 16; constant integer csize = VL DIV 4; constant bits(VL) op2 = Z[m, VL]; constant bits(VL) op3 = Z[k, VL]; constant bits(csize) ctrl = Elem[op3, index, csize]; constant bits(dim*dim*16) op4 = ZAtile[da, 16, dim*dim*16]; bits(dim*dim*16) result; for row = 0 to dim-1 for col = 0 to dim-1 integer i = 0; bits(16) rowop = Zeros(16); bits(16) colop = Zeros(16); for r = 0 to 1 constant bits(VL) op1 = Z[n+r, VL]; for e = 0 to 1 if i < 2 && Elem[ctrl, 4*col + 2*r + e, 1] == '1' then Elem[rowop, i, 8] = Elem[op1, 2*row + e, 8]; i = i + 1; for j = 0 to 1 Elem[colop, j, 8] = Elem[op2, 2*col + j, 8]; constant bits(16) sum = Elem[op4, row*dim+col, 16]; Elem[result, row*dim+col, 16] = FP8DotAddFP(sum, rowop, colop, FPCR, FPMR); ZAtile[da, 16, dim*dim*16] = result;


Internal version only: aarchmrs v2024-12_rel, pseudocode v2024-12_rel ; Build timestamp: 2024-12-15T22:18

Copyright © 2010-2024 Arm Limited or its affiliates. All rights reserved. This document is Non-Confidential.