LoongArch Reference Manual

1 hour ago 2

According to the context of the software runtime, the non-privileged instruction set of the basic part of LoongArch includes basic integer instructions and basic floating-point instructions. This chapter will describe the integer instruction part. The basic integer instruction part is the most basic part of the non-privileged instruction subset.

2.1. Programming Model of Basic Integer Instruction

The basic integer instruction programming model described in this section only involves the operating environment of the application software, which is always related to some privileged resources. Therefore, the concept of privileged resources will be introduced where necessary to ensure the completeness of the description. Although the content of privileged resources is covered here, it will not be expanded in detail. Readers who need a comprehensive and in-depth understanding can refer to the relevant chapters in the manual according to the prompts in the text.

2.1.1. Data Types

There are 5 data types operated by basic integer instructions, namely: bit (b), Byte (B, length 8b), Halfword (H, length 16b), Word (W, length 32b), Doubleword (D, length 64b). In LA32, there are no integer instructions for operating doubleword. Byte, half-word, word and double-word data types all use two’s complement encoding.

2.1.2. Registers

The registers involved in basic integer instructions include General Registers (GR) and Program Counters (PC), as shown in the figure.

gr and pc

Figure 2. GR and PC

2.1.2.1. General-purpose Registers

There are 32 General purpose Registers (GR), denoted as r0-r31, and the value of register r0 is always 0. The length of GR is recorded as GRLEN. The length of GR in LA32 is 32 bits, and the length of GR in LA64 is 64 bits. There is an orthogonal relationship between basic integer instructions and general registers. That is, from an architectural point of view, any register operand in this instruction can use any of the 32 GRs. The only exception is that the destination register implicit in the BL instruction must be r1. In the standard LoongArch Application Binary Interface (ABI), r1 is as storing the return address of a function call.

2.1.2.2. PC

There is only one PC, which records the address of the current instruction. The PC register cannot be modified directly by instructions, it can only be modified indirectly by branch instructions, exception trap and exception return instructions. However, the PC register can be directly read as the source operand of some non-branch instructions. The length of PC is always the same as the length of GR.

2.1.3. Running Privilege Levels

LoongArch defines 4 running Privilege LeVels (PLV), namely PLV0-PLV3. The specific privilege level of the application is determined by the system software at runtime, and the application cannot accurately aware this. In LoongArch, the application usually runs on PLV3. For more information about privilege levels, see Privilege Levels.

2.1.3.1. Privileged Resources Accessible by Applications

Generally speaking, privileged resources cannot be directly accessed by application running at a non-privileged level, but when RPCNTL1/RPCNTL2/RPCNTL3 in CSR.MISC is set, the CSRRD instruction can be executed at the privilege level of PLV1/PLV2/PLV3 to read performance monitor counters. For more information about performance monitor counters, see Control and Status Registers Related to Performance Monitoring.

2.1.3.2. Disabling of Some Non-privileged Functions

Some non-privileged functions that are enabled by default after power-on reset can be disabled by the system software during execution. By setting the DRDTL1/DRDTL2/DRDTL3 bits in CSR.MISC to 1, the execution of RDTIME instructions at the PLV1/PLV2/PLV3 level can be prohibited, or will trigger the Instruction Privilege error Exception (IPE).

2.1.4. Exceptions and interrupts

Exceptions and interrupts will interrupt the currently executing program and switch the control flow to the entry of the exception/interrupt handler to start execution. Exceptions are caused by abnormal conditions that occur during the execution of the instruction, and interrupts are caused by external events (such as interrupt signal input). In the manual, it will strictly distinguish the two concepts of “generating an exception/interrupt” and “triggering an exception/interrupt”. The difference between the two is that the former does not necessarily cause a change in the control flow, while the latter must change the current control flow to an entry point of the exception/interrupt handler.

The handling specifications for exceptions and interrupts belong to the privileged resource handling part of the architecture. Here is a brief introduction to the exceptions that the application can perceive.

SYStem call exception (SYS): the execution of the SYSCALL instruction will trigger the system call exception immediately.
BrEaKpoint exception (BEK): executing the BREAK instruction will trigger a breakpoint exception immediately.
Instruction Non-defined Exception (INE): if the executed instruction code is not defined in the architecture, or the architecture specification defines the instruction as not existing in the current context, then the instruction non-defined exception will be triggered immediately.
Instruction Privilege error Exception (IPE): in addition to the special circumstances listed in Running Privilege Levels, executing a privileged instruction in the application software will definitely trigger the instruction privilege level error exception immediately.
ADdress error Exception (ADE): when the program has a functional error that causes the address of the instruction fetch or memory access instruction to appear illegal (such as the instruction fetch address is not aligned on 4-byte boundaries, and the privileged address space is accessed), ADdress error Exception for Fetching instructions (ADEF) or ADdress error Exception for Memory access instructions (ADEM) will be triggered.
Floating-Point error Exception (FPE): when the floating-point number instruction is executed, special processing is required for data exceptions, which can generate or trigger the basic floating-point error exception. See Floating-Point Move Instructions for more information.

2.1.5. Memory Address Space

Only the virtual address space visible to the application is involved here. The translation of virtual memory addresses to physical memory addresses is determined by the runtime environment. These contents relate to the relevant specifications of privileged resources in the architecture and will be introduced in the second half of this manual. The memory address space on LoongArch is a continuous linear address space, which is addressed in bytes.

In LA32, the specification of the memory address space that application can access is: 0-2³¹-1.

In LA64, the range of memory address space accessible by application is: 0-2^VALEN-1-1. Generally VALEN is in the range of [40,48]. Application can determine the specific value of VALEN by executing the CPUCFG instruction to read the VALEN field of the 0x1 configuration word.

When the virtual address of the instruction fetch or memory access instruction in the application exceeds the above range, ADEF or ADEM will be triggered.

2.1.6. Endian

LoongArch bit designations are always little-endian.

2.1.7. Memory Access Types

LoongArch supports three types of memory access: Coherent Cached (CC), Strongly-ordered UnCached (SUC) and Weakly-ordered UnCached (WUC). The memory access type used for a location is associated with the virtual address, which is determined by the Memory Access Type field. The relationship of the memory access type and MAT field is: 0 - SUC, 1 - CC, 2 - WUC, and 3 - reserved. The memory access type setting process is transparent to the application.

When using consistent cacheable access type, the accessed object can be either the final memory object or the caches. This type of access is usually used to access faster.

When using SUC or WUC access, the final memory object can only be directly accessed. The difference between the two is: SUC access meets sequential consistency, that is, all accesses are executed in strict accordance with the order in the program and the next memory access operation cannot be started before the current memory access operation is completely completed. While the WUC read access allows speculative execution, and WUC written data can be merged inside the processor core to a larger scale (such as a Cache line) and then written out in a burst mode. Subsequent writes in the merge process can overwrite the data written earlier.

In LoongArch, only SUC memory access instructions must not have side effects, that is, such instructions cannot be predictive executed. Software can use this feature to access I/O devices in the system through SUC type memory access instructions. However, LoongArch allows SUC fetch instruction operations to have side effects. This means that the access type is a SUC type of fetch instruction operation, even if it originates from the result of branch prediction, it is allowed to be executed. In order to prevent the out-of-core memory access operations generated by such speculative execution from erroneously entering the illegal physical address space, it is necessary to filter out the risky accesses, This will be done on the chip.

The WUC type of access is usually used to accelerate the access to UC memory data, such as video memory data.

2.1.7.1. Cache Coherency Maintenance of Instruction Cache

The Cache coherency between the instruction Cache of a certain processor core and the Cache in other processor cores or Cache Coherenr I/O Master must be maintained by hardware.

The Cache coherency maintenance between the instruction Cache and the data Cache within the processor core can be implemented as hardware maintenance. This means that for the self-modifying code, the software does not need to use the CACOP instruction to maintain the Cache coherency between the instruction Cache and the data Cache within the same core. However, due to the pipeline structure and speculative instruction fetching behavior, the software still needs to use the IBAR instruction to ensure that the instruction fetching must be able to see the execution effect of the store instruction.

2.1.8. Unaligned Memory Access

The fetch addresses of all instruction fetches must be aligned on 4-byte boundaries, otherwise the ADEF will be triggered.

Except for atomic memory access instructions, integer bound check memory access instructions and floating-point bound check memory access instructions, other load/store memory access instructions can be implemented to allow memory access addresses to be unaligned. However, in an implementation that allows memory access address misalignment, the system mode software can configure the ALCL0-ALCL3 control bits in CSR.MISC to address these load/store memory access instructions at the privilege levels of PLV0-PLV3. Alignment check is needed, too. For memory accessed instructions that require address alignment checks, if the address accessed is not naturally aligned, an Address aLignment fault Exception (ALE) will be triggered.

2.1.9. Overview of Memory Consistency

The memory consistency model of the LoongArch uses the Weak Consistency (WC) model. This section only gives a brief description of the weak consistency model adopted by the architecture.

In the weak consistency model, synchronization operations need to be distinguished from ordinary memory accesses. The programmer must use the synchronization operations defined by the architecture to protect the access to the write shared unit to ensure that multiple processor cores have access to the write shared unit mutually exclusive. The following restrictions are imposed on the sequence of memory access events:

The execution of the synchronization operation satisfies the sequence consistency condition. That is, synchronization operations are executed in all processor cores strictly in the order in which they appear in the program, and the next synchronization operation cannot be started until the current synchronization operation is completely completed.
Before any ordinary memory access operation is allowed to be executed, all synchronization operations prior to this memory access operation in the same processor core have been completed.
Before any synchronization operation is allowed to be executed, all ordinary memory access operations that precede this synchronization operation in the same processor have been completed.

The instructions that can generate synchronous operations in LoongArch include DBAR, IBAR, AM atomic memory access instructions with DBAR function, and LL-SC instruction pairs.

2.2. Overview of Basic Integer Instructions

This section will describe the functions of application-level basic integer instructions in LA64. For LA32, it only needs to implement a subset of them. The instruction list contained in this subset is shown in the table. Because the length of GR in LA32 is only 32 bits, the sign extension operation in “sign extend the 32-bit result into the general register rd” in the subsequent instruction description is not required.

Table 2. Application-level basic integer instructions in LA32

Arithmetic operation instructions	ADD.W, SUB.W, ADDIW, ALSL.W, LU12L.W, SLT, SLTU, SLTI, SLTUI, PCADDI, PCADDU12I, PCALAU12I, AND, OR, NOR, XOR, ANDN, ORN, ANDI, ORI, XORI, MUL.W, MULH.W, MULH.WU, DIV.W, MOD.W, DIV.WU, MOD.WU
Bit-shift instructions	SLL.W, SRL.W, SRA.W, ROTR.W, SLLI.W, SRLI.W, SRAI.W, ROTRI.W
Bit-manipulation instructions	EXT.W.B, EXT.W.H, CLO.W, CLZ.W, CTO.W, CTZ.W, BYTEPICK.W, REVB.2H, BITREV.4B, BITREV.W, BSTRINS.W, BSTRPICK.W, MASKEQZ, MASKNEZ
Branch instructions	BEQ, BNE, BLT, BGE, BLTU, BGEU, BEQZ, BNEZ, B, BL, JIRL
Memory access instructions	LD.B, LD.H, LD.W, LD.BU, LD.HU, ST.B, ST.H, STW, PRELD
Atomic memory access instructions	LL.W, SC.W
Barrier instructions	DBAR, IBAR
Other instructions	SYSCALL, BREAK, RDTIMEL.W, RDTIMEH.W, CPUCFG

In addition, for those instructions whose data length of the operation object is GR length, the operation length is 32 bits in LA32 and 64 bits in LA64. Unless there are special circumstances, no special instructions will be given in the instruction function description.

2.2.1. Arithmetic Operation Instructions

2.2.1.1. ADD.{W/D}, SUB.{W/D}

Instruction formats:

add.w rd, rj, rk add.d rd, rj, rk sub.w rd, rj, rk sub.d rd, rj, rk

The ADD.W instruction performs the operation that the [31:0] bit data in the general register rj plus the [31:0] bit data in the general register rk; the resultant [31:0] bit is sign extension, then written into the general register rd.

ADD.W: tmp = GR[rj][31:0] + GR[rk][31:0] GR[rd] = SignExtend(tmp[31:0],GRLEN)

The SUB.W instruction performs the operation that the [31:0] bit data in the general register rk minus the [31:0] bit data in the general register rj; the resultant [31:0] bit is sign extension, then written into the general register rd.

SUB.W: tmp = GR[rj][31:0] - GR[rk][31:0] GR[rd] = SignExtend(tmp[31:0], GRLEN)

The ADD.D instruction performs the operation that the [63:0] bit data in the general register rj plus the [63:0] bit data in the general register rk; the result is written into the general register rd.

ADD.D: tmp = GR[rj][63:0] + GR[rk][63:0] GR[rd] = tmp[63:0]

The SUB.D instruction performs the operation that the [63:0] bit data in the general register rj minus the [63:0] bit data in the general register rk; writes the result into the general register rd.

SUB.D: tmp = GR[rj][63:0] - GR[rk][63:0] GR[rd] = tmp[63:0]

When the above instructions are executed, no special handling will be done on overflow.

2.2.1.2. ADDI.{W/D}, ADDU16I.D

Instruction formats:

addi.w rd, rj, si12 addi.d rd, rj, si12 addu16i.d rd, rj, si16

The ADDI.W instruction performs the operation that the [31:0] bit data in the general register rj plus the 12-bit immediate si12 sign extension 32-bit data; the resultant [31:0] bit is sign extension, then written into the general register rd.

ADDI.W: tmp = GR[rj][31:0] + SignExtend(si12, 32) GR[rd] = SignExtend(tmp[31:0], GRLEN)

The ADDI.D instruction performs the operation that the [63:0] bit data in the general register plus to the 64-bit data after 12-bit immediate si12 sign-extension; the result is written into the general register rd.

ADDI.D: tmp = GR[rj][63:0] + SignExtend(si12, 64) GR[rd] = tmp[63:0]

ADDU16I.D shifts the 16-bit immediate sil6 logic to the left by 16 bits and then sign extensions the resultant data, the result plus [63:0] bit data in the general register rj, and the result of the addition is written into the general register rd. The ADDU16I.D instruction is used in conjunction with the LDPTR.W/D and STPTR.W/D instructions to accelerate GOT table-based access in position-independent codes.

ADDU16I.D: tmp = GR[rj][63:0] + SignExtend({si16, 16'b0}, 64) GR[rd] = tmp[63:0]

When the above instructions are executed, no special handling will be done on overflow.

2.2.1.3. ALSL.{W[U]/D}

Instruction formats:

alsl.w rd, rj, rk, sa2 alsl.d rd, rj, rk, sa2 alsl.wu rd, rj, rk, sa2

The ALSL.W instruction performs the operation that logical shift the [31:0] bit data in the general register rj to the left (sa2 + 1) and it plus the [31:0] bit data in the general register rk; then write the result into the general register rd after the sign extension.

ALSL.W: tmp = (GR[rj][31:0] << (sa2+1)) + GR[rk][31:0] GR[rd] = SignExtend(tmp[31:0], GRLEN)

ALSL.WU logical shift the [31:0] bit data in the general register rj to the left (sa2 + 1) bit and it plus the [31:0] bit data in the general register rk; then the result is [31:0] bit zero after expansion, write to general register rd.

ALSL.WU: tmp = (GR[rj][31:0] << (sa2+1)) + GR[rk][31:0] GR[rd] = ZeroExtend(tmp[31:0], GRLEN)

The ALSL.D instruction performs the operation that logical shift the [63:0] bit data in the general register rj (sa2 + 1) to the left and it plus the [63:0] bit data in the general register rk; then the result is written into the general register rd.

ALSL.D: tmp = (GR[rj][63:0] << (sa2+1)) + GR[rk][63:0] GR[rd] = tmp[63:0]

When the above instructions are executed, no special handling will be done on overflow.

Tip	When writing assembly, you need to fill in the immediate field with the real shift value, i.e. (sa2+1), not the value in the immediate field of the instruction code.

2.2.1.4. LU12I.W, LU32I.D, LU52I.D

Instruction formats:

lu12i.w rd, si20 lu32i.d rd, si20 lu52i.d rd, rj, si12

The LU12I.W instruction performs the operation that splice the 12-bit 0 behind the lowest bit of the 20-bit immediate si20, then writes it into the general register rd after sign extension.

LU12I.W: GR[rd] = SignExtend({si20, 12'b0}, GRLEN)

The LU32I.D instruction performs the operation that splice the bit data [31:0] in the general register rd behind the lowest bit of the 20-bit immediate si20 sign extension data; then the result is written into the general register rd.

LU32I.D: GR[rd] = {SignExtend(si20, 32), GR[rd][31:0]}

The LU52I.D instruction performs the operation that splice the [51:0] bit data in the general register rj behind the lowest bit of the 12-bit immediate sil2 sign extension data; then the result is written into the general register rd.

LU52I.D: GR[rd] = {si12, GR[rj][51:0]}

When the above instructions are executed, no special handling will be done on overflow.

2.2.1.5. SLT[U]

Instruction formats:

slt rd, rj, rk sltu rd, rj, rk

The SLT instruction performs the operation that compares the data in the general register rj with the data in the general register rk as signed integers. If the former is smaller than the latter, the value of the general register rd is set to 1, otherwise it is set to 0.

SLT: GR[rd] = (signed(GR[rj]) < signed(GR[rk])) ? 1 : 0

The SLTU instruction performs the operation that compares the data in the general register rj with the data in the general register rk as unsigned integers. If the former is less than the latter, the value of the general register rd is set to 1, otherwise it is set to 0.

SLTU: GR[rd] = (unsigned(GR[rj]) < unsigned(GR[rk])) ? 1 : 0

The data length compared by SLT and SLTU is consistent with the length of the general register of the executing machine.

2.2.1.6. SLT[U]I

Instruction formats:

slti rd, rj, si12 sltui rd, rj, si12

The SLTI instruction performs the operation that compares the data in the general register rj and the 12-bit immediate sil2 sign extension data as a signed integer for size comparison. If the former is smaller than the latter, the value of the general register rd is set to 1, otherwise it is set to 0.

SLTI: tmp = SignExtend(si12, GRLEN) GR[rd] = (signed(GR[rj]) < signed(tmp)) ? 1 : 0

The SLTUI instruction performs the operation that compares the data in the general register rj and the 12-bit immediate sil2 sign extension data as an unsigned integer for size comparison. If the former is smaller than the latter, the value of the general register rd is set to 1, otherwise it is set to 0.

SLTUI: tmp = SignExtend(si12, GRLEN) GR[rd] = (unsigned(GR[rj]) < unsigned(tmp)) ? 1 : 0

The data length compared by SLTI and SLTUI is consistent with the length of the general register of the executing machine. Note that for SLTUI instructions, immediate data is still sign extended.

2.2.1.7. PCADDI, PCADDU121, PCADDU18l, PCALAU12I

Instruction formats:

pcaddi rd, si20 pcaddu12i rd, si20 pcaddu18i rd, si20 pcalau12i rd, si20

The PCADDI instruction performs the operation that splice the 2 bit 0 behind the lowest bit of the 20-bit immediate data si20 and sign extension, the resultant data plus the PC of the instruction; then the result of the addition is written into the general register rd.

PCADDI: GR[rd]= PC + SignExtend({si20, 2'b0}, GRLEN)

The PCADDU12I instruction performs the operation that splice the 12-bit 0 behind the lowest bit of the 20-bit immediate data si20 and signs extension, the resultant data plus the PC of the instruction; then the result of the addition is written into the general register rd.

PCADDU12I: GR[rd] = PC + SignExtend({si20, 12'b0}, GRLEN)

The PCADDU18I instruction performs the operation that splice the 18-bit 0 behind the lowest bit of the 20-bit immediate si20 and signs extension, the resultant data plus the PC of the instruction; then the result of the addition is written into the general register rd.

PCADDU18I: GR[rd] = PC + SignExtend({si20, 18'b0}, GRLEN)

The PCALAU12I instruction performs the operation that splice the 12-bit 0 behind the lowest bit of the 20-bit immediate data si20 and sign extension; the resultant data plus the PC of the instruction; then the lowest 12 bits of the addition result are erased and written into the general register rd.

PCALAU12I: tmp = PC + SignExtend({si20, 12'b0}, GRLEN) GR[rd] = {tmp[GRLEN-1:12], 12'b0}

The data length of the above instruction operation is consistent with the length of the general register of the executed machine.

2.2.1.8. AND, OR, NOR, XOR, ANDN, ORN

Instruction formats:

and rd, rj, rk or rd, rj, rk nor rd, rj, rk xor rd, rj, rk andn rd, rj, rk orn rd, rj, rk

The AND instruction performs the bitwise AND operation between the data in the general register rj and the data in the general register rk; then the result is written into the general register rd.

AND: GR[rd] = GR[rj] & GR[rk]

The OR instruction performs the bitwise OR operation between the data in the general register rj and the data in the general register rk; then the result is written into the general register rd.

OR: GR[rd] = GR[rj] | GR[rk]

The NOR instruction performs the bitwise OR operation between the data in the general register rj and the data in the general register rk; then the result is written into the general register rd.

NOR: GR[rd] = ~(GR[rj] | GR[rk])

The XOR instruction performs the bitwise XOR operation between the data in the general register rj and the data in the general register rk; then the result is written into the general register rd.

XOR: GR[rd] = GR[rj] ^ GR[rk]

The ANDN instruction performs the operation that reverses the data in the general register rk bit by bit, then performs the bitwise AND operation with the data in the general register rk and the data in the general register rj; then the result is written into the general register rd.

ANDN: GR[rd] = GR[rj] & (~GR[rk])

The ORN instruction performs the operation that reverses the data in the general register rk bit by bit, then performs a bitwise OR operation with the data in the general register rk and the data in the general register rj, and the result is written into the general register rd.

ORN: GR[rd] = GR[rj] | (~GR[rk])

The data length of the above instruction operation is consistent with the length of the general register of the executed machine.

2.2.1.9. ANDI, ORI, XORI

Instruction formats:

andi rd, rj, ui12 ori rd, rj, ui12 xori rd, rj, ui12

The ANDI instruction performs the bitwise AND operation between the data in the general register rj and the 12-bit immediate zero extension data; then the result is written into the general register rd.

ANDI: GR[rd] = GR[rj] & ZeroExtend(ui12, GRLEN)

The ORI instruction performs the bitwise OR operation between the data in the general register rj and the 12-bit immediate zero extension data; then the result is written into the general register rd.

ORI: GR[rd] = GR[rj] | ZeroExtend(ui12, GRLEN)

The XORI instruction performs the bitwise XOR operation between the data in the general register rj and the 12-bit immediate zero extension data; then the result is written into the general register rd.

XORI: GR[rd] = GR[rj] ^ ZeroExtend(ui12, GRLEN)

The data length of the above instruction operation is consistent with the length of the general register of the executed machine.

2.2.1.10. NOP

The NOP instruction is an alias for the instruction andi r0, r0, 0. Its function is only to occupy the 4-byte instruction code position and increase the PC by 4, except that it will not change any other software-visible processor state.

2.2.1.11. MUL.{W/D}, MULH, {W[U]/D[U]}

Instruction formats:

mul.w rd, rj, rk mulh.w rd, rj, rk mulh.wu rd, rj, rk mul.d rd, rj, rk mulh.d rd, rj, rk mulh.du rd, rj, rk

The MUL.W instruction performs the operation that multiplies the [31:0] bit data in the general register rj with the [31:0] bit data in the general register rk, the result of the multiplication [31:0] bit data is signed and written into the general register rd.

MUL.W: product = signed(GR[rj][31:0]) * signed(GR[rk][31:0]) GR[rd] = SignExtend(product[31:0], GRLEN)

The MULH.W instruction performs the operation that multiplies the [31:0] bit data in the general register rj with the [31:0] bit data in the general register rk as a signed number, the result of the multiplication [63:32] bit data is sign extension and written into the general register rd.

MULH.W: product = signed(GR[rj][31:0]) * signed(GR[rk][31:0]) GR[rd] = SignExtend(product[63:32], GRLEN)

The MULH.WU instruction performs the operation that multiplies the [31:0] bit data in the general register rj with the [31:0] bit data in the general register rk as unsigned numbers, the result of the multiplication [63:32] bit data is sign extension and written into the general register rd.

MULH.WU: product = unsigned(GR[rj][31:0]) * unsigned(GR[rk][31:0]) GR[rd] = SignExtend(product[63:32], GRLEN)

The MUL.D instruction performs the operation that multiplies the [63:0] bit data in the general register rj with the [63:0] bit data in the general register rk, the result of the multiplication [63:0] bit data and written into the general register rd.

MUL.D: product = signed(GR[rj][63:0]) * signed(GR[rk][63:0]) GR[rd] = product[63:0]

The MULH.D instruction performs the operation that multiplies the [63:0] bit data in the general register rj with the [63:0] bit data in the general register rk as a signed number, the result of the multiplication [127:64] bit data and written into the general register rd.

MULH.D: product = signed(GR[rj][63:0]) * signed(GR[rk][63:0]) GR[rd] = product[127:64]

The MULH.DU instruction performs the operation that multiplies the [63:0] bit data in the general register rj and the [63:0] bit data in the general register rk as unsigned numbers, the result of the multiplication [127:64] bit data and written into the general register rd.

MULH.DU: product = unsigned(GR[rj][63:0]) * unsigned(GR[rk][63:0]) GR[rd] = product[127:64]

2.2.1.12. MULW.D.W[U]

Instruction formats:

mulw.d.w rd, rj, rk mulw.d.wu rd, rj, rk

The MULW.D.W instruction performs the operation that multiplies the [31:0] bit data in the general register rj with the [31:0] bit data in the general register rk as a signed number, and the 64-bit product result is written into the general register rd.

MULW.D.W: product = signed(GR[rj][31:0]) * signed(GR[rk][31:0]) GR[rd] = product[63:0]

The MULW.D.WU instruction performs the operation that multiplies the [31:0] bit data in the general register rj with the [31:0] bit data in the general register rk as unsigned numbers, and writes the 64-bit product result into the general register rd.

MULW.D.WU: product = unsigned(GR[rj][31:0]) * unsigned(GR[rk][31:0]) GR[rd] = product[63:0]

2.2.1.13. DIV.{W[U]/D[U]}, MOD.{W[U]/D[U]}

Instruction formats:

div.w rd, rj, rk mod.w rd, rj, rk div.wu rd, rj, rk mod.wu rd, rj, rk div.d rd, rj, rk mod.d rd, rj, rk div.du rd, rj, rk mod.du rd, rj, rk

The DIV.W and DIV.WU instruction performs the operation that divide the [31:0] bit data in the general register rj by the [31:0] bit data in the general register rk, and the resulting quotient is sign extension and written into the general register rd.

DIV.W: quotient = signed(GR[rj][31:0]) / signed(GR[rk][31:0]) GR[rd] = SignExtend(quotient[31:0], GRLEN) DIV.WU: quotient = unsigned(GR[rj][31:0]) / unsigned(GR[rk][31:0]) GR[rd] = SignExtend(quotient[31:0], GRLEN)

The MOD.W and MOD.WU instruction performs the operation that divide the [31:0] bit data in the general register rj by the [31:0] bit data in the general register rk, and the resulting remainder is sign extension and written into the general register rd.

MOD.W: remainder = signed(GR[rj][31:0]) % signed(GR[rk][31:0]) GR[rd] = SignExtend(remainder[31:0], GRLEN) MOD.WU: remainder = unsigned(GR[rj][31:0]) % unsigned(GR[rk][31:0]) GR[rd] = SignExtend(remainder[31:0], GRLEN)

The DIV.D and DIV.DU instruction performs the operation that divide the [63:0] bit data in the general register rj by the [63:0] bit data in the general register rk, and the resulting quotient sign extension and written into the general register rd.

DIV.D: GR[rd] = signed(GR[rj][63:0]) / signed(GR[rk][63:0]) DIV.DU: GR[rd] = unsigned(GR[rj][63:0]) / unsigned(GR[rk][63:0])

The MOD.D and MOD.DU instruction performs the operation that divide the [63:0] bit data in the general register rj by the [63:0] bit data in the general register rk, and the resulting remainder is sign extension and written into the general register rd.

MOD.D: GR[rd] = signed(GR[rj][63:0]) % signed(GR[rk][63:0]) MOD.DU: GR[rd] = unsigned(GR[rj][63:0]) % unsigned(GR[rk][63:0])

When DIV.W, MOD.W, DIV.D and MOD.D perform division operations, the operands are all regarded as signed numbers. When DIV.WU, M0D.WU, DIV.DU and MOD.DU perform division operations, the source operands are all regarded as unsigned numbers.

Each pair of instructions for finding the quotient/remainder satisfies the result of DIV.W/MOD.W, DIV.WU/MOD.WU, DIV.D/MOD.D, DIV.DU/MOD.DU, the remainder and the dividend The sign is consistent and the absolute value of the remainder is less than the absolute value of the divisor.

When the divisor is 0, the result can be any value, but no exception will be triggered.

2.2.2. Bit-shift Instructions

2.2.2.1. SLL.W, SRL.W, SRA.W, ROTR.W

Instruction formats:

sll.w rd, rj, rk srl.w rd, rj, rk sra.w rd, rj, rk rotr.w rd, ri, rk

The SLL.W instruction performs the operation that logical left shifts the bit data of [31:0] in the general register rj, and writes the sign extension of the shift result into the general register rd.

SLL.W: tmp = SLL(GR[rj][31:0], GR[rk][4:0]) GR[rd] = SignExtend(tmp[31:0], GRLEN)

The SRL.W instruction performs the operation that logical right shifts the bit data of [31:0] in the general register rj, and writes the sign extension of the shift result into the general register rd.

SRL.W: tmp = SRL(GR[rj][31:0], GR[rk][4:0]) GR[rd] = SignExtend(tmp[31:0], GRLEN)

The SRA.W instruction performs the operation that arithmetical right shifts [31:0] bit data in the general register rj, and writes the sign extension of the shift result into the general register rd.

SRA.W: tmp = SRA(GR[rj][31:0], GR[rk][4:0]) GR[rd] = SignExtend(tmp[31:0], GRLEN)

The ROTR.W instruction performs the operation that cyclical right shifts the [31:0] bit data in the general register rj, and writes the sign extension of the shift result into the general register rd.

ROTR.W: tmp = ROTR(GR[rj][31:0], GR[rk][4:0]) GR[rd] = SignExtend(tmp[31:0], GRLEN)

The shift amount of the above-mentioned shift instruction is all [4:0] bit data in the general register rk, and is regarded as an unsigned number.

2.2.2.2. SLLI.W, SRLI.W, SRAI.W, ROTRI.W

Instruction formats:

sliw rd, rj, ui5 srli.w rd, rj, ui5 srai.w rd, rj, ui5 rotri.w rd, rj, ui5

The SLLI.W instruction performs the operation that logical left shifts the [31:0] bit data in the general register rj, and writes the sign extension of the shift result into the general register rd.

SLLI.W: tmp = SLL(GR[rj][31:0], ui5) GR[rd] = SignExtend(tmp[31:0], GRLEN)

The SRLI.W instruction performs the operation that logical right shifts the [31:0] bit data in the general register rj to the right, and writes the sign extension of the shift result into the general register rd.

SRLI.W: tmp = SRL(GR[rj][31:0], ui5) GR[rd] = SignExtend(tmp[31:0], GRLEN)

The SRAI.W instruction performs the operation that arithmetical right shifts the bit data of [31:0] in the general register rj, and writes the sign extension of the shift result into the general register rd.

SRAI.W: tmp = SRA(GR[rj][31:0], ui5) GR[rd] = SignExtend(tmp[31:0], GRLEN)

The ROTRI.W instruction performs the operation that cyclical right shifts the [31:0] bit data in the general register rj, and the sign extension of the shift result is written into the general register rd.

ROTRI.W: tmp = ROTR(GR[rj][31:0], ui5) GR[rd] = SignExtend(tmp[31:0], GRLEN)

The shift amounts of the above shift instructions are all 5-bit unsigned immediate ui5 in the instruction code.

2.2.2.3. SLL.D, SRL.D, SRA.D, ROTR.D

Instruction formats:

sl.d rd, rj, rk srl.d rd, rj, rk sra.d rd, rj, rk rotr.d rd, rj, rk

The SLL.D instruction performs the operation that logical left shifts the bit data of [63:0] in the general register rj, and writes the sign extension of the shift result into the general register rd.

SLL.D: GR[rd] = SLL(GR[rj][63:0], GR[rk][5:0])

The SRL.D instruction performs the operation that logical right shifts the bit data of [63:0] in the general register rj, and writes the sign extension of the shift result into the general register rd.

SRL.D: GR[rd] = SRL(GR[rj][63:0], GR[rk][5:0])

The SRA.D instruction performs the operation that arithmetic right shifts the bit data of [63:0] in the general register rj, and writes the sign extension of the shift result into the general register rd.

SRA.D: GR[rd] = SRA(GR[rj][63:0], GR[rk][5:0])

The ROTR.D instruction performs the operation that cyclical right shifts the bit data of [63:0] in the general register rj, and writes the sign extension of the shift result into the general register rd.

ROTR.D: GR[rd] = ROTR(GR[rj][63:0], GR[rk][5:0])

The shift amount of the above-mentioned shift instruction is all [5:0] bit data in the general register rk, and is regarded as an unsigned number.

2.2.2.4. SLLI.D, SRLI.D, SRAI.D, ROTRI.D

Instruction formats:

slli.d rd, rj, ui6 srli.d rd, rj, ui6 srai.d rd, rj, ui6 rotri.d rd, rj, ui6

The SLII.D instruction performs the operation that logicalleft shifts the bit data of [63:0] in the general register rj, and the sign extension of the shift result is written into the general register rd.

SLLI.D: GR[rd] = SLL(GR[rj][63:0], ui6)

The SRLI.D instruction performs the operation that logical right shifts the bit data of [63:0] in the general register rj, and writes the sign extension of the shift result into the general register rd.

SRLI.D: GR[rd] = SRL(GR[rj][63:0], ui6)

The SRAI.D instruction performs the operation that arithmetically right shifts the bit data of [63:0] in the general register rj, and writes the sign extension of the shift result into the general register rd.

SRAI.D: GR[rd] = SRA(GR[rj][63:0], ui6)

The ROTRI.D instruction performs the operation that cyclical right shifts the [63:0] bit data in the general register rj, and the sign extension of the shift result is written into the general register rd.

ROTRI.D: GR[rd] = ROTR(GR[rj][63:0], ui6)

The shift amount of the above-mentioned shift instruction is the 6-bit unsigned immediate ui6 in the instruction code.

2.2.3. Bit-manipulation Instructions

2.2.3.1. EXT.W{B/H}

Instruction formats:

ext.w.b rd, rj ext.w.h rd, rj

The EXT.W.B instruction performs the operation that will sign extension the bit data of [7:0] in the general register rj and write it into the general register rd.

EXT.W.B: GR[rd] = SignExtend(GR[rj][7:0], GRLEN)

The EXT.W.H instruction performs the operation that will sign extension the bit data of [15:0] in the general register rj and write it into the general register rd.

EXT.W.H: GR[rd] = SignExtend(GR[rj][15:0], GRLEN)

2.2.3.2. CL{O/Z}.{W/D}, CT{O/Z}.{W/D}

Instruction formats:

clo.w rd, rj clo.d rd, rj clz.w rd, rj clz.d rd, rj cto.w rd, rj cto.d rd, rj ctz.w rd, rj ctz.d rd, rj

The CLO.W instruction performs the operation that for the data of bit [31:0] in the general register rj, the number of continuous bits 1 is measured from bit 31 to bit 0, and the result is written into the universal register rd.

CLO.W: GR[rd] = CLO(GR[rj][31:0])

The CLZ.W instruction performs the operation that for the data of bit [31:0] in the general register rj, the number of continuous bits 0 is measured from bit 31 to bit 0, and the result is written into the universal register rd.

CLZ.W: GR[rd] = CLZ(GR[rj][31:0])

The CTO.W instruction performs the operation that for the data of bit [31:0] in the general register rj, the number of continuous bits 1 is measured from bit 0 to bit 31, and the result is written into the universal register rd.

CTO.W: GR[rd] = CTO(GR[rj][31:0])

The CTZ.W instruction performs the operation that for the data of bit [31:0] in the general register rj, the number of continuous bits 0 is measured from bit 0 to bit 31, and the result is written into the universal register rd.

CTZ.W: GR[rd] = CTZ(GR[rj][31:0])

The CLO.D instruction performs the operation that for the data of bit [63:0] in the general register rj, the number of continuous bits 1 is measured from bit 63 to bit 0, and the result is written into the universal register rd.

CLO.D: GR[rd] = CL0(GR[rj][63:0])

The CLZ.D instruction performs the operation that for the data of bit [63:0] in the general register rj, the number of continuous bits 1 is measured from bit 0 to bit 63, and the result is written into the universal register rd.

CLZ.D: GR[rd] = CLZ(GR[rj][63:0])

The CTO.D instruction performs the operation that for the data of bit [63:0] in the general register rj, the number of continuous bits 0 is measured from bit 0 to bit 63, and the result is written into the universal register rd.

CTO.D: GR[rd] = CTO(GR[rj][63:0])

The CTZ.D instruction performs the operation that for the data of bit [63:0] in the general register rj, the number of continuous bits 0 is measured from bit 0 to bit 63, and the result is written into the universal register rd.

CTZ.D: GR[rd] = CTZ(GR[rj][63:0])

2.2.3.3. BYTEPICK.{W/D}

Instruction formats:

bytepick.w rd, rj, rk, sa2 bytepick.d rd, rj, rk, sa3

The BYTEPICK.W instruction performs the operation that splice [31:0] bits in the general register rj behind [31:0] bits in the general register rk, and intercepts 4 consecutive bytes starting from the leftmost sa2 byte, and writes the 32-bit bit string symbol into universal register rd after expansion.

BYTEPICK.W: tmp = {GR[rk][8*(4-sa2):0], GR[rj][31:8*(4-sa2)]} GR[rd] = SignExtend(tmp[31:0], GRLEN)

The BYTEPICK.D instruction performs the operation that splice [63:0] bits in the general register rj behind [63:0] bits in the general register rk, and intercepts 8 consecutive bytes starting from the leftmost sa3 byte, and writes the 64-bit bit string symbol into universal register rd after expansion.

BYTEPICK.D: GR[rd] = {GR[rk][8*(8-sa3):0], GR[rj][63:8*(8-sa3)]}

2.2.3.4. REVB.{2H/4H/2W/D}

Instruction formats:

revb.2h rd, rj revb.4h rd, ri revb.2w rd, rj revb.d rd, rj

The REVB.2H instruction performs the operation that arranges the 2 bytes in the [15:0] bits in the general register rj in reverse order to form the [15:0] bits of the intermediate result, and reverses the 2 bytes in the [31:16] in the general register rj Arrange the [31:16] bits of the intermediate result, and write the 32-bit intermediate result sign extended to the general register rd.

REVB.2H: tmp0 = {GR[rj][ 7: 0], GR[rj][15: 8]} tmp1 = {GR[rj][23:16], GR[rj][31:24]} GR[rd] = SignExtend({tmp1, tmp0}, GRLEN)

The REVB.4H instruction performs the operation that arranges the 2 bytes in the [15:0] bits of the general register rj in reverse order and writes them into the [15:0] bits of the general register rd, and writes 2 words in the [31:16] bits of the general register rj. Write the sections in reverse order to bits [31:16] of the general register rd, and write the 2 bytes of bits [47:32] in the general register rj in reverse order to bits [47:32] of the general register rd. The 2 bytes in the [63:48] bits in the register rj are written in the [63:48] bits in the general register rd in reverse order.

REVB.4H: tmp0 = {GR[rj][ 7: 0], GR[rj][15: 8]} tmp1 = {GR[rj][23:16], GR[rj][31:24]} tmp2 = {GR[rj][39:32], GR[rj][47:40]} tmp3 = {GR[rj][55:48], GR[rj][63:56]} GR[rd] = {tmp3, tmp2, tmp1, tmp0}

The REVB.2W instruction performs the operation that writes the 4 bytes in the [31:0] bits of the general register rj into the [31:0] bits of the general register rd in reverse order, and writes 4 of the [63:32] bits in the general register rj. Write the byte in reverse order to bits [63:32] of the general register rd.

REVB.2W: tmp0 = {GR[rj][ 7: 0], GR[rj][15: 8], GR[rj][31:24], GR[rj][23:16]} tmp1 = {GR[rj][39:32], GR[rj][47:40], GR[rj][55:48], GR[rj][63:56]} GR[rd] = {tmp1, tmp0}

REVB.D writes the 8 bytes in the [63:0] bits in the general register rj into the general register rd in reverse order.

REVB.D: GR[rd] = {GR[rj][ 7: 0], GR[rj][15: 8], GR[rj][31:24], GR[rj][23:16], GR[rj][39:32], GR[rj][47:40], GR[rj][55:48], GR[rj][63:56]}

2.2.3.5. REVH.{2W/D}

Instruction formats:

revh.2w rd, rj revh.d rd, rj

The REVH.2W instruction performs the operation that writes two half-words in bit [31:0] of general purpose register rj into bit [31:0] of general purpose register rd, and two half-words in bit [63:32] of general purpose register rj into bit [63:32] of general purpose register rd.

REVH.2W: tmp0 = {GR[rj][15: 0], GR[rj][31:16]} tmp1 = {GR[rj][47:32], GR[rj][63:48]} GR[rd] = {tmp1, tmp0}

The REVH.D instruction performs the operation that write four half-words in [63:0] bit of universal register rj in reverse order to universal register rd.

REVH.D: GR[rd] = {GR[rj][15:0], GR[rj][31:16], GR[rj][47:32], GR[rj][63:48]}

2.2.3.6. BITREV.{4B/8B}

Instruction formats:

bitrev.4b rd, rj bitrev.8b rd, rj

The BITREV.4B instruction performs the operation that the [7:0] bit in general register rj is arranged in reverse order, the [15:8] bit in general register rj is arranged in reverse order, the [23:16] bit in general register rj is arranged in reverse order, and the [31:24] bit in general register rj is arranged in reverse order; the 32-bit intermediate result sign extension is written into general register rd in turn.

BITREV.4B: bstr32[31:24] = BITREV(GR[rj][31:24]) bstr32[23:16] = BITREV(GR[rj][23:16]) bstr32[15: 8] = BITREV(GR[rj][15: 8]) bstr32[ 7: 0] = BITREV(GR[rj][ 7: 0]) GR[rd] = SignExtend(bstr32, GRLEN)

The BITREV.8B instruction performs the operation that the [7:0] bit in general register rj is arranged in reverse order, the [15:8] bit in general register rj is arranged in reverse order, the [23:16] bit in general register rj is arranged in reverse order, the [31:24] bit in general register rj is arranged in reverse order; the [39:32] bit in general register rj is arranged in reverse order; the [47:40] bit in general register rj is arranged in reverse order; the [55:48] bit in general register rj is arranged in reverse order; the [63:56] bit in general register rj is arranged in reverse order; the 32-bit intermediate result sign extension is written into general register rd in turn.

BITREV.8B: GR[rd][63:56] = BITREV(GR[rj][63:56]) GR[rd][55:48] = BITREV(GR[rj][55:48]) GR[rd][47:40] = BITREV(GR[rj][47:40]) GR[rd][39:32] = BITREV(GR[rj][39:32]) GR[rd][31:24] = BITREV(GR[rj][31:24]) GR[rd][23:16] = BITREV(GR[rj][23:16]) GR[rd][15: 8] = BITREV(GR[rj][15: 8]) GR[rd][ 7: 0] = BITREV(GR[rj][ 7: 0])

2.2.3.7. BITREV.{W/D}

Instruction formats:

bitrev.w rd, rj bitrev.d rd, rj

The BITREV.W instruction performs the operation that the [31:0] bit in general register rj is arranged in reverse order; the 32-bit intermediate result sign extension is written into general register rd in turn.

BITREV.W: bstr32[31:0] = BITREV(GR[rj][31:0]) GR[rd] = SignExtend(bstr32, GRLEN)

The BITREV.D instruction performs the operation that the [63:0] bit in general register rj is arranged in reverse order; the 32-bit intermediate result sign extension is written into general register rd in turn.

BITREV.D: GR[rd] = BITREV(GR[rj][63:0])

2.2.3.8. BSTRINS.{W/D}

Instruction formats:

bstrins.w rd, rj, msbw, lsbw bstrins.d rd, rj, msbd, lsbd

The BSTRINS.W instruction performs the operation that replaces the [msbw:lsbw] bit in the lowest 32 bits of the general register rd with the [msbw-lsbw:0] bit in the general register rj, and the resulting 32-bit result is sign extension and written into the general register rd.

BSTRINS.W: bstr32[31:msbw+1] = GR[rd][31: msbw+1] bstr32[msbw:lsbw] = GR[rj][msbw-lsbw:0] bstr32[lsbw-1:0] = GR[rd][lsbw-1:0] GR[rd] = SignExtend(bstr32[31:0], GRLEN)

The BSTRINS.D instruction performs the operation that replaces the [msbd:lsbd] bit in the general register rd with the [msbd-lsbd:0] bit in the general register rj, and the rest of the general register rd remains unchanged.

BSTRINS.D: GR[rd][63:msbd+1] = GR[rd][63:msbd+1] GR[rd][msbd:lsbd] = GR[rj][msbd-lsbd:0] GR[rd][lsbd-1:0] = GR[rd][lsbd-1:0]

2.2.3.9. BSTRPICK.{W/D}

Instruction formats:

bstrpick.w rd, rj, msbw, lsbw bstrpick.d rd, rj, msbd, lsbd

BSTRPICK.W extracts the [msbw:Isbw] bit in the general register rj and zero-extends it to 32 bits, and the formed 32-bit intermediate result is sign extension and written into the general register rd.

BSTRPICK.W: bstr32[31:0] = ZeroExtend(GR[rj][msbw:lsbw], 32) GR[rd] = SignExtend(bstr32[31:0], GRLEN)

BSTRPICK.D extracts the [msbd:Isbd] bit in the general register rj and zero-extends it to 64 bits and writes it into the general register rd.

BSTRPICK.D: GR[rd] = ZeroExtend(GR[rj][msbd:lsbd], 64)

2.2.3.10. MASKEQZ, MASKNEZ

Instruction formats:

maskeqz rd, rj, rk masknez rd, rj, rk

MASKEQZ and MASKNEZ instructions perform conditional assignment operations. When MASKEQZ is executed, if the value of the general register rk is equal to 0, the general register rd is set to 0, otherwise it is assigned to the value of the rj register.

MASKEQZ: GR[rd] = (GR[rk] == 0) ? 0 : GR[rj]

When MASKNEZ is executed, if the value of the general register rk is not equal to 0, the general register rd is set to 0, otherwise it is assigned to the value of the rj register.

MASKNEZ: GR[rd] = (GR[rk] != 0) ? 0 : GR[rj]

2.2.4. Branch Instructions

2.2.4.1. BEQ, BNE, BLT[U], BGE[U]

Instruction formats:

beq rj, rd, offs16 bne rj, rd, offs16 blt rj, rd, offs16 bge rj, rd, offs16 bltu rj, rd, offs16 bgeu rj, rd, offs16

The BEQ instruction performs the operation that compares the values of general register rj and general register rd, if the two are equal, jump to the target address, otherwise it does not jump.

BEQ: if GR[rj] == GR[rd]: PC = PC + SignExtend({offs16, 2'b0}, GRLEN)

The BNE instruction performs the operation that compares the values of general register rj and general register rd, if the two are not equal, jump to the target address, otherwise it does not jump.

BNE: if GR[rj] != GR[rd]: PC = PC + SignExtend({offs16, 2'b0}, GRLEN)

The BLT instruction performs the operation that compares the values of general register rj and general register rd as signed numbers. If the former is smaller than the latter, it jumps to the target address, otherwise it does not jump.

BLT: if signed(GR[rj]) < signed(GR[rd]): PC = PC + SignExtend({offs16, 2'b0}, GRLEN)

The BGE instruction performs the operation that compares the values of general register rj and general register rd as signed numbers. If the former is greater than or equal to the latter, it jumps to the target address, otherwise it does not jump.

BGE: if signed(GR[rj]) >= signed(GR[rd]): PC = PC + SignExtend({offs16, 2'b0}, GRLEN)

The BLTU instruction performs the operation that compares the values of general register rj and general register rd as unsigned numbers. If the former is less than the latter, it jumps to the target address, otherwise it does not jump.

BLTU: if unsigned(GR[rj]) < unsigned(GR[rd]): PC = PC + SignExtend({offs16, 2'b0}, GRLEN)

The BGEU instruction performs the operation that compares the values of general register rj and general register rd as unsigned numbers. If the former is greater than or equal to the latter, it jumps to the target address, otherwise it does not jump.

BGEU: if unsigned(GR[rj]) >= unsigned(GR[rd]): PC = PC + SignExtend({offs16, 2'b0}, GRLEN)

The calculation method of the jump target address of the above-mentioned six branch instructions is to logically shift the 16-bit immediate offs16 in the instruction code by 2 bits and then sign expand, and the resulting offset value is added to the PC of the branch instruction.

Tip	When writing assembly, you need to fill in the immediate field with the real offset value in bytes, i.e. (offs16<<2).

2.2.4.2. BEQZ, BNEZ

Instruction formats:

beqz rj, offs21 bnez rj, offs21

The BEQZ instruction performs the operation that judges the value of the general register rj, if it is equal to 0, jump to the target address, otherwise it does not jump.

BEQZ: if GR[rj] == 0: PC = PC + SignExtend({offs21, 2'b0}, GRLEN)

The BNEZ instruction performs the operation that judges the value of the general register rj, if it is not equal to 0, it jumps to the target address, otherwise it does not jump.

BNEZ: if GR[rj] != 0: PC = PC + SignExtend({offs21, 2'b0}, GRLEN)

The jump target address of the above two branch instructions is to logical left shift the 21-bit immediate offs21 in the instruction code by 2 bits and then sign extension, and the resulting offset value is added to the PC of the branch instruction.

Tip	When writing assembly, you need to fill in the immediate field with the real offset value in bytes, i.e. (offs21<<2).

2.2.4.3. B

Instruction formats:

The B instruction performs the operation that jumps to the target address unconditionally. The jump target address is to logical left shift the 26-bit immediate offs26 in the instruction code by 2 bits and then sign extension, and the resulting offset value is added to the PC of the branch instruction.

B: PC = PC + SignExtend({offs26, 2' b0}, GRLEN)

Tip	When writing assembly, you need to fill in the immediate field with the real offset value in bytes, i.e. (offs26<<2).

2.2.4.4. BL

Instruction formats:

The BL instruction performs the operation that jumps to the target address unconditionally, and writes the result of adding 4 to the PC value of the instruction into the No.1 general register r1.

The jump target address of the instruction is to shift the 26-bit immediate offs26 in the instruction code to the left by 2 bits and then sign extend it. The shift value is added to the PC of the branch instruction.

BL: GR[1] = PC + 4 PC = PC + SignExtend({offs26, 2'b0}, GRLEN)

In LA ABI, the No.1 general register r1 serves as the return address register ra.

Tip	When writing assembly, you need to fill in the immediate field with the real offset value in bytes, i.e. (offs26<<2).

2.2.4.5. JIRL

Instruction formats:

JIRL jumps to the target address unconditionally, and the PC value of the instruction plus 4; then writes the result into the general register rd.

The jump target address of the instruction is to logically shift the 16-bit immediate offs16 in the instruction code by 2 bits to the left and then sign extension, and the resulting offset value is added to the value in the general register rj.

JIRL: GR[rd] = PC + 4 PC = GR[rj] + SignExtend({offs16, 2'b0}, GRLEN)

When rd is equal to 0, the function of JIRL is a common non-call indirect jump instruction.

JIRL with rd equal to 0, rj equal to 1 and offs16 equal to 0 is often used as an indirect jump from call return.

Tip	When writing assembly, you need to fill in the immediate field with the real offset value in bytes, i.e. (offs16<<2).

2.2.5. Common Memory Access Instructions

2.2.5.1. LD.{B[U]/H[U]/W[U]/D}, ST.{B/H/W/D}

Instruction formats:

ld.b rd, rj, si12 ld.h rd, rj, si12 ld.w rd, rj, si12 ld.d rd, rj, si12 ld.bu rd, rj, si12 ld.hu rd, rj, si12 ld.wu rd, rj, si12 st.b rd, rj, si12 st.h rd, rj, si12 st.w rd, rj, si12 st.d rd, rj, si12

LD.{B/H/W/D} retrieves the data of one byte/halfword/word/double word from the internal sign extension and writes it into the general register rd.

LD.B: vaddr = GR[rj] + SignExtend(si12, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) byte = MemoryLoad(paddr, BYTE) GR[rd] = SignExtend(byte, GRLEN) LD.H: vaddr = GR[rj] + SignExtend(si12, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) halfword = MemoryLoad(paddr, HALFWORD) GR[rd] = SignExtend(halfword, GRLEN) LD.W: vaddr = GR[rj] + SignExtend(si12, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) word = MemoryLoad(paddr, WORD) GR[rd] = SignExtend(word, GRLEN) LD.D: vaddr = GR[rj] + SignExtend(si12, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) GR[rd] = MemoryLoad(paddr, DOUBLEWORD)

LD.{BU/HU/WU} retrieves one byte/halfword/word data from the memory and writes it into the general register rd after zero extension.

LD.BU: vaddr = GR[rj] + SignExtend(si12, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) byte = MemoryLoad(paddr, BYTE) GR[rd] = ZeroExtend(byte, GRLEN) LD.HU: vaddr = GR[rj] + SignExtend(si12, GRLEN) AddressCompli anceCheck(vaddr) paddr = AddressTranslation(vaddr) halfword = MemoryLoad(paddr, HALFWORD) GR[rd] = ZeroExtend(halfword, GRLEN) LD.WU: vaddr = GR[rj] + SignExtend(si12, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) word = MemoryLoad(paddr, WORD) GR[rd] = ZeroExtend(word, GRLEN)

ST.{B/H/W/D} writes [7:0]/[15:0]/[31:0]/[63:0] bit data in general register rd into the memory.

ST.B: vaddr = GR[rj] + SignExtend(si12, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) MemoryStore(GR[rd][7:0], paddr, BYTE) ST.H: vaddr = GR[rj] + SignExtend(si12, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) MemoryStore(GR[rd][15:0], paddr, HALFWORD) ST.W: vaddr = GR[rj] + SignExtend(si12, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) MemoryStore(GR[rd][31:0], paddr, WORD) ST.D: vaddr = GR[rj] + SignExtend(si12, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) MemoryStore(GR[rd][63:0], paddr, DOUBLEWORD)

The memory access address calculation method of the above instruction is sum the value in the general register rj and the sign extension 12-bit immediate value sil2.

For LD.{H[U]/W[U]/D} and ST.{B/H/W/D} instructions, no matter what kind of hardware implementation and environmental configuration, as long as their memory access addresses are naturally aligned When the memory access address is not naturally aligned, if the hardware implementation supports non-aligned memory access and the current computing environment is configured to allow non-aligned memory access, then the non-aligned exception will not be triggered, otherwise a non-aligned exception will be triggered.

2.2.5.2. LDX.{B[U]/H[U]/W[U]/D}, STX.{B/H/W/D}

Instruction formats:

ldx.b rd, rj, rk ldx.h rd, rj, rk ldx.w rd, rj, rk ldx.d rd, rj, rk ldx.bu rd, rj, rk ldx.hu rd, rj, rk ldx.wu rd, rj, rk stx.b rd, rj, rk stx.h rd, rj, rk stx.w rd, rj, rk sbx.d rd, rj, rk

LDX.{B/H/W/D} retrieves the data of one byte/halfword/word/double word from the internal sign extension and writes it into the general register rd.

LDX.B: vaddr = GR[rj] + GR[rk] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) byte = MemoryLoad(paddr, BYTE) GR[rd] = SignExtend(byte, GRLEN) LDX.H: vaddr = GR[rj] + GR[rk] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) halfword = MemoryLoad(paddr, HALFWORD) GR[rd] = SignExtend(halfword, GRLEN) LDX.W: vaddr = GR[rj] + GR[rk] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) word = MemoryLoad(paddr, WORD) GR[rd] = SignExtend(word, GRLEN) LDX.D: vaddr = GR[rj] + GR[rk] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) GR[rd] = MemoryLoad(paddr, DOUBLEWORD)

LDX.{BU/HU/WU} retrieves one byte/halfword/word data from the internal zero extension and writes it into the general register rd.

LDX.BU: vaddr = GR[rj] + GR[rk] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) byte = MemoryLoad(paddr, BYTE) GR[rd] = ZeroExtend(byte, GRLEN) LDX.HU: vaddr = GR[rj] + GR[rk] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) halfword = MemoryLoad(paddr, HALFWORD) GR[rd] = ZeroExtend(halfword, GRLEN) LDX.WU: vaddr = GR[rj] + GR[rk] AddressCompli anceCheck(vaddr) paddr = AddressTranslation(vaddr) word = MemoryLoad(paddr, WORD) GR[rd] = ZeroExtend(word, GRLEN)

STX.{B/H/W/D} writes [7:0], [15:0], [31:0] and [63:0] bits of data in the general register rd into the memory.

STX.B: vaddr = GR[rj] + GR[rk] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) MemoryStore(GR[rd][7:0], paddr, BYTE) STX.H: vaddr = GR[rj] + GR[rk] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) MemoryStore(GR[rd][15:0], paddr, HALFWORD) STX.W: vaddr = GR[rj] + GR[rk] AddressCompli anceCheck(vaddr) paddr = AddressTranslation(vaddr) MemoryStore(GR[rd][31:0], paddr, WORD) STX.D: vaddr = GR[rj] + GR[rk] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) MemoryStore(GR[rd][63:0], paddr, DOUBLEWORD)

The memory access address calculation method of the above instruction is the value in the general register rj and the value in the general register rk. For LDX.{H[U]/W[U]/D} and STX.{B/H/W/D} instructions, no matter what kind of hardware implementation and environment configuration, as long as its memory access address is natural Aligned, will not trigger non-aligned exception; when the fetch address is not naturally aligned, if the hardware implementation supports non-aligned memory access and the current computing environment is configured to allow non-aligned memory access, then the non-aligned exception will not be triggered, otherwise a non-aligned exception will be triggered.

2.2.5.3. LDPTR.{W/D}, STPTR.{W/D}

Instruction formats:

ldptr.w rd, rj, si14 ldptr.d rd, rj, si14 stptr.w rd, rj, si14 stptr.d rd, rj, si14

LDPTR.{W/D} retrieves the data of a word/double word from the internal sign extension and writes it into the general register rd.

LDPTR.W: vaddr = GR[rj] + SignExtend({si14, 2'b0}, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) word = MemoryLoad(paddr, WORD) GR[rd] = SignExtend(word, GRLEN) LDPTR.D: vaddr = GR[rj] + SignExtend({si14, 2'b0}, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) GR[rd] = MemoryLoad(paddr, DOUBLEWORD)

STPTR.{W/D} Write the data of bits [31:0]/[63:0] in the general register rd into the memory.

STPTR.W: vaddr = GR[rj] + SignExtend({si14, 2'b0}, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) MemoryStore(GR[rd][31:0], paddr, WORD) STPTR.D: vaddr = GR[rj] + SignExtend({si14, 2'b0}, GRLEN) AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) MemoryStore(GR[rd][63:0], paddr, DOUBLEWORD)

The memory access address calculation method of the above instruction is to logical left shift the 14-bit immediate data si14 by 2 bits, sign extension, and then sum the value in the general register rj.

Tip	When writing assembly, you need to fill in the immediate field with the real offset value in bytes, i.e. (si14<<2).

For LDPTR.{W/D} and STPTR.{W/D} instructions, no matter what kind of hardware implementation and environmental configuration, as long as the memory access address is naturally aligned, the non-aligned exception will not be triggered; when the memory address is not naturally aligned, if the hardware implementation supports unaligned memory access and the current computing environment is configured to allow unaligned memory access, then the unaligned exception will not be triggered, otherwise it will trigger the unaligned exception.

LDPTR.{W/D}, STPTR.{W/D} instructions are used in conjunction with ADDU16I.D instructions to accelerate GOT table-based access in position-independent codes.

2.2.5.4. PRELD

Instruction formats:

PRELD Reads a cache-line of data from memory in advance into the Cache. The access address is the 12bit immediate number of the value in the general register rj plus the symbol extension.

The processor learns from the hint in the PRELD instruction what type will be acquired and which level of Cache the data to be taken back fill in, hint has 32 optional values (0 to 31), 0 represents load to level 1 Cache, and 8 represents store to level 1 Cache. The remaining hint values are not defined and are processed for nop instructions when the processor executes.

If the Cache attribute of the access address of the PRELD instruction is not cached, then the instruction cannot generate a memory access action and is treated as a NOP instruction. The PRELD instruction will not trigger any exceptions related to MMU or address.

2.2.5.5. PRELDX

Instruction formats:

The PRELDX instruction continuously prefetches data from memory into the Cache according to the configuration parameters, and the continuously prefetched data is a block (block) of length block_size starting from the specified base address (base) with a number of (block_num) spacing stride. The base address is the sum of the [63:0] bits in the general register rj and the sign extension [15:0] bits in the general register rk. The [I16] bits in general register rk are the address sequence ascending and descending flag bits, with 0 indicating address ascending and 1 indicating address descending. The value of bits [25:20] in general register rk is block_size, the basic unit of block_size is 16 bytes, so the maximum length of a single block is 1KB. The value of bits [39:32] in general register rk is block_num-1, so a single instruction can prefetch up to 256 blocks. The value of bits [59:44] in the block general register rk is treated as a signed number and defines the stride between adjacent blocks, the basic unit of stride is 1 byte. The value of bits [39:32] in rk is block.num-1, so a single instruction can prefetch up to 256 blocks. The value of bits [59:44] in general register rk is regarded as a signed number, which defines the corresponding The basic unit of stride and stride between adjacent blocks is 1 byte.

hint in the PRELDX instruction indicates the type of prefetch and the level of Cache into which the fetched data is to be filled. hint has 32 selectable values from 0 to 31. Currently, hint=0 is defined as load prefetch to level 1 data Cache, hint=2 is defined as load prefetch to level 3 Cache, hint-8 is defined as store prefetch to level 1 data Cache. The meaning of the rest of hint values is not defined yet, and the processor executes it as NOP instruction.

If the Cache attribute of the access address of the PRELDX instruction is not cached, then the instruction cannot generate a memory access action and is treated as a NOP instruction.

The PRELDX instruction does not trigger any exceptions related to MMU or address.

2.2.6. Bound Check Memory Access Instructions

2.2.6.1. LD{GT/LE}.{B/H/W/D}, ST{GT/LE}.{B/H/W/D}

Instruction formats:

ldgt.b rd, rj, rk ldgt.h rd, rj, rk ldgt.w rd, rj, rk ldgt.d rd, rj, rk ldle.b rd, rj, rk ldle.h rd, rj, rk ldle.w rd, rj, rk ldle.d rd, rj, rk stgt.b rd, rj, rk stgt.h rd, rj, rk stgt.w rd, rj, rk stgt.d rd, rj, rk stle.b rd, rj, rk stle.h rd, rj, rk stle.w rd, rj, rk stle.d rd, rj, rk

LDGT/LDLE.B/H/W/D fetches a byte/half word word/double word data symbol extension from memory and writes it to the general register rd.

STGT/STLE.B/H/W/D writes the [7:0]/[15:0]/[31:0]/[63:0] bits of data from the general register rd to memory.

The access addresses of the above instructions come directly from the values in the general register rj. The access addresses of the above instructions are required to be naturally aligned, otherwise a non-alignment exception will be triggered.

B/H/W/D and STGT.B/H/W/D instructions check whether the value in general register rj is greater than the value in general register rk, and terminate the access operation and trigger the bound check exception if the condition is not satisfied; B/H/W/D and STLE.B/H/W/D instructions check whether the value in general register rj is less than or equal to the value in general register rk, and if the condition is not satisfied, the access operation is terminated and the bound check exception is triggered.

LDGT.B: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] > GR[rk]: byte = MemoryLoad(paddr, BYTE) GR[rd] = SignExtend(byte, GRLEN) else: RaiseException(BCE) # Bound Check Exception LDGT.H: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] > GR[rk]: halfword = MemoryLoad(paddr, HALFWORD) GR[rd] = SignExtend(halfword, GRLEN) else: RaiseException(BCE) # Bound Check Exception LDGT.W: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] > GR[rk]: word = MemoryLoad(paddr, WORD) GR[rd] = SignExtend(word, GRLEN) else: RaiseException(BCE) # Bound Check Exception LDGT.D: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] > GR[rk]: GR[rd] = MemoryLoad(paddr, DOUBLEWORD) else: RaiseException(BCE) # Bound Check Exception LDLE.B: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] <= GR[rk]: byte = MemoryLoad(paddr, BYTE) GR[rd] = SignExtend(byte, GRLEN) else: RaiseException(BCE) # Bound Check Exception LDLE.H: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] <= GR[rk]: halfword = MemoryLoad(paddr, HALFWORD) GR[rd] = SignExtend(halfword, GRLEN) else: RaiseException(BCE) # Bound Check Exception LDLE.W: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] <= GR[rk]: word = MemoryLoad(paddr, WORD) GR[rd] = SignExtend(word, GRLEN) else: RaiseException(BCE) # Bound Check Exception LDLE.D: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] <= GR[rk]: GR[rd] = MemoryLoad(paddr, DOUBLEWORD) else: RaiseException(BCE) # Bound Check Exception STGT.B: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] > GR[rk]: MemoryStore(GR[rd][7:0], paddr, BYTE) else: RaiseException(BCE) # Bound Check Exception STGT.H: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] > GR[rk]: MemoryStore(GR[rd][15:0], paddr, HALFWORD) else: RaiseException(BCE) # Bound Check Exception STGT.W: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] > GR[rk]: MemoryStore(GR[rd][31:0], paddr, WORD) else: RaiseException(BCE) # Bound Check Exception STGT.D: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] > GR[rk]: MemoryStore(GR[rd][63:0], paddr, DOUBLEWORD) else: RaiseException(BCE) # Bound Check Exception STLE.B: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] <= GR[rk]: MemoryStore(GR[rd][7:0], paddr, BYTE) else: RaiseException(BCE) # Bound Check Exception STLE.H: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] <= GR[rk]: MemoryStore(GR[rd][15:0], paddr, HALFWORD) else: RaiseException(BCE) # Bound Check Exception STLE.W: vaddr = GR[rij] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] <= GR[rk]: MemoryStore(GR[rd][31:0], paddr, WORD) else: RaiseException(BCE) # Bound Check Exception STLE.D: vaddr = GR[rj] AddressComplianceCheck(vaddr) paddr = AddressTranslation(vaddr) if GR[rj] <= GR[rk]: MemoryStore(GR[rd][63:0], paddr, DOUBLEWORD) else: RaiseException(BCE) # Bound Check Exception

2.2.7. Atomic Memory Access Instructions

2.2.7.1. AM{SWAP/ADD/AND/OR/XOR/MAX/MIN}[DB].{W/D}, AM{MAX/MIN}[_DB].{WU/DU}

Instruction formats:

amswap.w rd, rk, rj amswap_db.w rd, rk, rj amswap.d rd, rk, rj amswap_db.d rd, rk, rj amadd.w rd, rk, rj amadd_db.w rd, rk, rj amadd.d rd, rk, rj amadd_db.d rd, rk, rj amand.w rd, rk, rj amand_db.w rd, rk, rj amand.d rd, rk, rj amand_db.d rd, rk, rj amor.w rd, rk, rj amor_db.w rd, rk, rj amor.d rd, rk, rj amor_db.d rd, rk, rj amxor.w rd, rk, rj amxor_db.w rd, rk, rj amxor.d rd, rk, rj amxor_db.d rd, rk, rj ammax.w rd, rk, rj ammax_db.w rd, rk, rj ammax.d rd, rk, rj ammax_db.d rd, rk, rj ammin.w rd, rk, rj ammin_db.w rd, rk, rj ammin.d rd, rk, rj ammin_db.d rd, rk, rj ammax.wu rd, rk, rj ammax_db.wu rd, rk, rj ammax.du rd, rk, rj ammax_db.du rd, rk, rj ammin.wu rd, rk, rj ammin_db.wu rd, rk, rj ammin.du rd, rk, rj ammin_db.du rd, rk, rj

The AM* atomic access instruction performs a sequence of “read-modify-write” operations on a memory cell atomically. Specifically, it retrieves the old value at the specified address in memory and writes it to the general register rd, performs some simple operations on the old value in memory and the value in the general register rk, and then writes the result of the operations back to the specified address in memory. The entire “read-modify-write” process is atomic, meaning that the processor executing the instruction does not perform any other access-write operations nor does it trigger any exceptions during the time between the return of the access read operation data and the global visibility of the access write operation, and no other processor cores or cache-consistent. The module has global visibility of the execution of the write operation on the Cache row where the instruction accesses the object.

The access address of an AM* atomic access instruction is the value of the general register rj. The access address of an AM* atomic access instruction always requires natural alignment, and failure to meet this condition will trigger a non-alignment exception.

Atomic access instructions ending in .W and .WU read and write memory and intermediate operations with a data length of 32 bits, while atomic access instructions ending in .D and .DU read and write memory and intermediate operations with a data length of 64 bits. Whether ending in .W or .WU, the data of a word retrieved from memory by an atomic access instruction is symbolically extended and written to the general register rd.

AMSWAP[.DB].{W/D} instruction writes the new value of memory from the general register rk. AMADD[.DB].{W/D} instruction writes the new value of memory from the result ofold value of memory plus the value in general register rk. AMAND[DB].{W/D} instruction writes the new value to memory as a result of the bitwise AND operation of the old value in memory and the value in general register rk. AMOR[DB].{W/D} instruction writes a new value to memory from AMXOR[.DB]. The new value written to memory by the {W/D} instruction is the result of the bitwise OR operation of the old value in memory and the value in general register rk. AMMAX[_DB].{W/D} instruction writes the new value to memory as the result of the bitwise AND operation of the old value in memory and the value in general register rk. The new value written to memory is the maximum value obtained by comparing the old value in memory with the value in general register rk as a signed number. [_DB].{W/D} instruction The new value written to memory is the minimum value obtained by comparing the old value of memory with the value in general register rk as if it were a signed number. The new value written to memory by the AMMAX[DB].[WU/DU] instruction is the maximum value obtained by comparing the old value in memory with the value in general register rk as an unsigned number. AMMIN[_DB].{WU/DU} instruction writes the new value to memory by comparing the old value in memory with the value in general register rk as an unsigned number. The new value written to memory is the minimum value obtained by comparing the old value in memory with the value in general register rk as an unsigned number.

AM*_DB.W[U]/D[U] instruction not only completes the above atomized operation sequence, but also implements the data barrier function at the same time. That is, all access operations preceding the atomic access instruction in the same processor core are completed before such atomic access instructions are allowed to be executed, and all access operations following the atomic access instruction in the same processor core are allowed to be executed only after such atomic access instructions are executed.

If the AM* atomic memory access instruction has the same register number as rd and rj, the execution will trigger an Instruction Non-defined Exception.

If the AM* atomic memory access instruction has the same register number as rd and rk, the execution result is uncertain. Please software to avoid this situation.

2.2.7.2. AM.{SWAP/ADD}[_DB].{B/H}

Instruction formats:

amswap.b rd, rk, rj amswap_db.b rd, rk, rj amswap.h rd, rk, rj amswap_db.h rd, rk, rj amadd.b rd, rk, rj amadd_db.b rd, rk, rj amadd.h rd, rk, rj amadd_db.h rd, rk, rj

AM{SWAP/ADD}[_DB].{B/H} and AM{SWAP/ADD}[_DB].{W/D} are atomic access instructions, can atomically complete the "read - modify - write" sequence of operations on a memory cell, the main difference is that the data being accessed is byte/half-word or word/double-word.

AM{SWAP/ADD}[_DB].{B/H} retrieve the old byte/half word value at the specified address in memory and write it to the general register rd after symbol extension, At the same time, the old value in the memory is exchanged or added with the byte/half-word value of the general register rk [7:0]/[15:0] bit, and then the byte/half-word results will be written back to the specified address of the memory. The entire "read-modify-write" process is atomic, meaning that the execution of the instruction, from the access to read the data return to the access to write the implementation of the effect of global visibility at the time, the processor executing the instruction neither executes other memory access write operations nor triggers any exception, and no other processor core or Cache coherence module can globally see the execution effect of the write operation on the Cache line of the object accessed by the instruction.

AM{SWAP/ADD}[_DB].{B/H} The access address of an atomic access instruction is the value of general-purpose register rj.

AM{SWAP/ADD}[_DB].H access address of an atomic access instruction is always required to be naturally aligned, and a non-alignment exception is triggered if this condition is not met.

In addition to the above atomic sequence of operations, the AM{SWAP/ADD}_DB.{B/H} instruction also implements the data barrier function. That is, when this kind of atomic access instruction is allowed to execute before, all in the same processor core before the atomic access instruction access operations have been completed; at the same time, only until the completion of this kind of atomic access instruction execution, all in the same processor core after the atomic access instruction access operation is allowed to execute.

If rd and rj have the same register number in AM{SWAP/ADD}[_DB].{B/H} instruction, there is no exception for trigger instruction.

If the register numbers of rd and rk in an AM{SWAP/ADD}[_DB].{B/H} instruction are the same, the execution result is uncertain, so please ask the software to avoid this situation.

2.2.7.3. AMCAS[_DB].{B/H/W/D}

Instruction formats:

amcas.b rd, rk, rj amcas_db.b rd, rk, rj amcas.h rd, rk, rj amcas_db.h rd, rk, rj amcas.w rd, rk, rj amcas_db.w rd, rk, rj amcas.d rd, rk, rj amcas_db.d rd, rk, rj

AMCAS[_DB].{B/H/W/D} instruction performs a byte/half-word/word/double-word sized Compare-and-Swap operation on a specified address in memory: The byte/half-word/word/double-word value retrieved from memory (old memory value) is compared with the value stored in the [7:0]/[15:0]/[31:0]/[63:0] location of the general-purpose register rd (expected value), and the value stored in the [7:0]/[15:0]/[31:0]/[63:0] location of the general-purpose register rk (new value) is written to the same location in the memory only when the comparison results are equal. Regardless of whether the comparison results are equal or not, the old memory value is written to the general-purpose register rd after sign expansion.

The above process, If a write occurs because the old memory value is equal to the expected value, then the entire "read - modify - write" process is atomic, that is, from the access to the read operation data return to the access to the write operation to perform the effect of the global visibility of this time, the processor executing the instruction is neither the implementation of the other access to the write operation nor trigger Any exception, and no other processor core or Cache Consistency Module to the instruction access object where the Cache line of the write operation of the execution of the effect of the global visible.

AMCAS[_DB].{H/W/D} The access address of the instruction is the value of general-purpose register rj, and the access address is always required to be naturally aligned, if this condition is not met, a non-aligned exception will be triggered.

In addition to the above atomic sequence of operations, the AMCAS_DB.{B/H/W/D} instruction also implements the data barrier function. That is, when this kind of atomic access instruction is allowed to execute before, all in the same processor core before the atomic access instruction access operations have been completed; at the same time, only when this kind of atomic access instruction execution is completed, all in the same processor core after the atomic access instruction access operations are allowed to execute.

2.2.7.4. LL.{W/D}, SC.{W/D}

Instruction formats:

ll.w rd, rj, si14 ll.d rd, rj, si14 sc.w rd, rj, si14 sc.d rd, rj, si14

The two pairs of instructions, LL.W and SC.W, LL.D and SC.D, are used to implement an atomic “read, modify, and write” sequence of memory access operations. The LL.{W/D} instruction retrieves a word/double-word data from the specified address of the memory and writes it to the general register rd after sign extension, and the paired SC. {W/D} instruction operates on the same length of data and has the same access Memory address. The atomic maintenance mechanism for the sequence of memory access operations is that when LL.{W/D} is executed, the access address is recorded and the previous flag is set (LLbit is set to 1), and the LLbit is checked when the SC.{W/D} instruction is executed. Only when the LLbit is 1, the write action will actually occur, otherwise it will not be written. When the software needs to successfully complete an atomic “read-modify-write” memory access operation sequence, it needs to construct a loop to repeatedly execute the LLSC instruction pair until the SC is successfully completed. In order to construct this loop, the SC.[W/D] instruction will write the flag of its execution success (or simply the LLbit value seen when the SC instruction is executed) into the general register rd and return.

During the execution of the paired LLSC, the following events will clear the LLbit to 0:

The ERTN instruction is executed and the KL0 bit in CSR.LLBCTL is not equal to 1 when executed;
Other processor cores or Cache Coherent I/O masters perform a store operation on the Cache line where the address corresponding to the LLbit is located.

If the memory access attribute of the LLSC instruction to the access address is not Cached, then the execution result is uncertain.

2.2.7.5. SC.Q

Instruction formats:

The SC.Q instruction is similar to the SC.D instruction and is used in conjunction with the LL.D instruction to implement an atomic "read-modify-write" access sequence for 128-bit data.

SC.Q writes the 128-bit data {GR[rk][63:0], GR[rd][63:0]} obtained by splicing the general-purpose registers rk and rd into memory, and its access address is the value of the general-purpose register rj. SC.Q instruction will check LLbit when executing, and only when LLbit is 1, then it will write, otherwise it will not write, SC.Q instruction will write the flag of success or failure (also can be understood as the value of LLbit when SC.Q instruction executes) into general register rd and return to the memory.

The access address of SC.Q instruction is always required to be 16-byte aligned, if this condition is not met, a non-aligned exception will be triggered.

If the SC.Q instruction’s memory access attribute for the access address is not consistently cacheable (CC), the result of the execution is indeterminate.

2.2.7.6. LL.ACQ.{W/D}, SC.REL.{W/D}

Instruction formats:

ll.acq.w rd, rj ll.acq.d rd, rj sc.rel.w rd, rj sc.rel.d rd, rj

LL.ACQ.{W/D} is an LL.{W/D} instruction with read-acquire semantics, that is, only when LL.ACQ.{W/D} is executed (globally visible), all subsequent access operations can start executing (globally visible effect); SC.REL.{W/D} is an SC.{W/D} instruction with write-release semantics, that is, only when SC.REL.{W/D} is executed (globally visible), all access operations can start executing (globally visible effect).

The LL.ACQ.{W/D} instruction fetches a word/double word of data symbol expansion from the specified address in memory and writes it to the general-purpose register rd, and at the same time records the access address and places a flag (LLbit set to 1). The SC.REL.{W/D} instruction conditionally writes the word/double-word value of [31:0]/[63:0] in the general-purpose register rd to the specified address in the memory, whether or not to write to the memory depends on the LLbit, and only when the LLbit is 1 does it really generate a write action, otherwise it does not write. SC.REL instruction will write the flag of success or failure of its execution (which can be simply understood as the LLbit value seen by the SC.REL instruction when it is executed) into the general-purpose register rd and return it, regardless of whether it writes to the memory or not.

During paired LL-SC execution, the following events clear the LLbit to zero:

An ERTN instruction is executed and the KLO bit in CSR.LLBCTL is not equal to 1 at the time of execution.
another processor core or Cache Coherent master completes a store operation on the Cache line corresponding to the address of the LLbit.

LL.ACQ and SC.REL instructions always require a natural alignment of the access address, if this condition is not met a non-alignment exception is triggered.

If the LL.ACQ and SC.REL instructions direct that the store access attribute of the access address is not cache-consistent (CC), then the result of the execution is indeterminate.

2.2.8. Barrier Instructions

2.2.8.1. DBAR

Instruction formats:

The DBAR instruction is used to complete the barrier function between load/store memory access operations. The immediate hint it carries is used to indicate the synchronization object and synchronization degree of the barrier.

A hint value of 0 is mandatory by default, and it indicates a fully functional synchronization barrier. Only after all previous load/store access operations are completely executed, the DBAR 0 instruction can be executed; and only after the execution of DBAR 0 is completed, all subsequent load/store access operations can be executed.

If there is no special function implementation, all other hint values must be executed according to hint=0.

2.2.8.2. IBAR

Instruction formats:

The IBAR instruction is used to complete the synchronization between the store operation and the instruction fetch operation within a single processor core. The immediate hint it carries is used to indicate the synchronization object and synchronization degree of the barrier.

A hint value of 0 is mandatory by default. It can ensure that the instruction fetch after the IBAR 0 instruction must be able to observe the execution effect of all store operations before the IBAR 0 instruction.

2.2.9. CRC Check Instructions

2.2.9.1. CRC[C].W.{B/H/W/D}.W

Instruction formats:

crc.w.b.w rd, rj, rk crc.w.h.w rd, rj, rk crc.w.w.w rd, rj, rk crc.w.d.w rd, rj, rk crcc.w.b.w rd, rj, rk crcc.w.h.w rd, rj, rk crcc.w.w.w rd, rj, rk crcc.w.d.w rd, rj, rk

CRC[C]W.{B/H/W/D}.W is used to calculate the CRC-32 checksum, which stores the 32-bit cumulative CRC checksum stored in the general register rk in the general register rj [7:0]/[15:0]/[31:0]/[63:0] bit message, get a new 32-bit CRC checksum according to the CRC-32 checksum generation algorithm, and write it after sign extension into the general register rd. The difference is that CRC.W.{B/H/W/D}.W uses IEEE802.3 polynomial (polynomial value is 0xEDB88320), CRCC.W.{B/H/W/D}.W uses Castagnoli polynomial (polynomial value is 0x82F63B78). The CRC instructions defined in this manual only support the “LSB first” (little endian) standard, which means that the lowest bit of data (little endian) is transmitted first, and the lowest bit of the data is mapped to the coefficient of the most significant term of the message polynomial.

CRC.W.B.W: chksum = CRC32(GR[rk][31:0], GR[rj][7:0], 8, 0xEDB88320) GR[rd] = SignExtend(chksum, GRLEN) CRC.W.H.W: chksum = CRC32(GR[rk][31:0], GR[rj][15:0], 16, 0xEDB88320) GR[rd] = SignExtend(chksum, GRLEN) CRC.W.W.W: chksum = CRC32(GR[rk][31:0], GR[rj][31:0], 32, 0xEDB88320) GR[rd] = SignExtend(chksum, GRLEN) CRC.W.D.W: chksum = CRC32(GR[rk][31:0], GR[rj][63:0], 64, 0xEDB88320) GR[rd] = SignExtend(chksum, GRLEN) CRCC.W.B.W: chksum = CRC32(GR[rk][31:0], GR[rj][7:0], 8, 0x82F63B78) GR[rd] = SignExtend(chksum, GRLEN) CRCC.W.H.W: chksum = CRC32(GR[rk][31:0], GR[rj][15:0], 16, 0x82F63B78) GR[rd] = SignExtend(chksum, GRLEN) CRCC.W.W.W: chksum = CRC32(GR[rk][31:0], GR[rj][31:0], 32, 0x82F63B78) GR[rd] = SignExtend(chksum, GRLEN) CRCC.W.D.W: chksum = CRC32(GR[rk][31:0], GR[rj][63:0], 64, 0x82F63B78) GR[rd] = SignExtend(chksum, GRLEN)

2.2.10. Other Miscellaneous Instructions

2.2.10.1. syscall

Instruction formats:

Executing the SYSCALL instruction will immediately and unconditionally trigger the system call exception.

The information carried in the code field in the instruction code can be used as a parameter passed by the exception handling routine.

2.2.10.2. break

Instruction formats:

Executing the BREAK instruction will immediately and unconditionally trigger the breakpoint exception.

The information carried in the code field in the instruction code can be used as a parameter passed by the exception handling routine.

2.2.10.3. ASRT{LE/GT}.D

Instruction formats:

asrtle.d rj, rk asrtgt.d rj, rk

The value in general register rj and general register rk are compared as signed numbers. If the comparison conditions are not met, an exception for address bound checking is triggered. For the ASRTLE.D instruction, if the value in the general register rj is greater than the value in the general register rk, an exception is triggered; for the ASRTGT.D instruction, if the value in the general register rj is less than or equal to the value in the general register rk, an exception is triggered.

2.2.10.4. RDTIME{L/H}.W, RDTIME.D

Instruction formats:

rdtimel.w rd, rj rdtimeh.w rd, rj rdtime.d rd, rj

The LoongArch instruction system defines-a constant frequency timer, whose main body is-a 64-bit counter called StableCounter. StableCounter is set to 0 after reset, and then increments by 1 every counting clock cycle. When the count reaches all 1s, it automatically wraps around to 0 and continues to increment. At the same time, each timer has a software-configurable globally unique-number, called Counter ID. The characteristic of the constant frequency timer is that its timing frequency remains unchanged after reset, no matter how the clock frequency of the processor core changes.

The RDTIME{L/W}.W and RDTIME.D instructions are used to read constant frequency timer information, the StableCounter value is written into the general register rd, and the Counter ID number information is written into the general register rj. The difference between the three instructions is the difference in the Stable Counter information read. RDTIMEL.W reads the [31:0] bits of the Counter, RDTIMEH.W reads the [63:32] bits of the Counter, and RDTIME.D reads The entire 64-bit Counter value. On a 64-bit processor, the 32-bit value read by the RDTIME{L/H}.W instruction is sign extension and written to the general register rd. The RDTIME(L/H).W instruction is defined so that the 64-bit Counter can also be accessed on a 32-bit processor.

2.2.10.5. cpucfg

Instruction formats:

The CPUCFG instruction is used to dynamically identify which features of LoongArch are implemented in the running processor during the execution of the software. The realization of the functional characteristics of these instruction systems is recorded in the series of configuration information words. One configuration information word can be read once the CPUCFG instruction is executed.

When using the CPUCFG instruction, the source operand register rj stores the number of the configuration information word to be accessed, and the configuration information word information read after the instruction is executed is written into the general register rd. In LA64, each configuration information word is 32 bits, which is written into the result register after the sign extension.

The configuration information word contains-series of configuration bits (fields), and its record form is CPUCFG.<configuration word number>.<configuration information mnemonic name>[bit subscript], where the single bit configuration bit is marked as bitXX, which means The XX bit of the configuration word; the bit under the multi-bit configuration field is marked as bitXX:YY, which means the continuous (XX-YY+1) bit from the XX bit to the YY bit of the configuration word. For example, the 0th bit in the configuration word No.1 is used to indicate whether to implement LA32. Record this configuration information as CPUCFG.1.LA32[bit0], where 0x1 indicates that the font size of the configuration information word is No.1, and LA32 indicates this configuration The mnemonic name of the information field is called LA32, and bit 0 means that the field of LA32 is located at bit 0 of the configuration word. The PALEN field of the number of physical address bits supported by the 11th to 4th digits of the configuration word No.1 is recorded as CPUCFG.1.PALEN[itl1:4].

The configuration information accessible by the CPUCFG instruction in the Godson architecture is listed in the table. CPUCFG access to undefined configuration words will read back all 0 values. The undefined field in the defined configuration word can be read back to any value when CPUCFG is executed, and the software should not make any interpretation of it.

Table 3. The configuration information accessible by the CPUCFG instruction Word number Bit number Annotation Implication

0x0	31:0	PRID	Processor Identity
0x1	1:0	ARCH	2’b00 indicates the implementation of simplified LA32; 2’b01 indicates the implementation of LA32; 2’b10 indicates the implementation of LA64; 2’b11 is reserved.
	2	PGMMU	1 indicates that the MMU supports page mapping mode
	3	IOCSR	1 indicates support for the IOCSR instruction
	11:4	PALEN	The supported physical address bits PALEN value minus 1
	19:12	VALEN	The supported virtual address bits VALEN value minus 1
	20	UAL	1 indicates support for non-aligned memory access
	21	RI	1 indicates support for page attribute of “Read Inhibit”
	22	EP	1 indicates support for page attribute of “Execution Protection”
	23	RPLV	1 indicates support for page attributes of RPLV
	24	HP	1 indicates support for page attributes of huge page
	25	CRC	1 indicates that support CRC instruction That is, information such as “Loongson3A5000 @ 2.5GHz”
	26	MSG_INT	1 indicates that the external interrupt uses the message interrupt mode, otherwise it is the level interrupt line mode
2		FP	1 indicates support for basic floating-point instructions
	1	FP_SP	1 indicates support for single-precision floating-point numbers
	2	FP_DP	1 indicates support for double-precision floating-point numbers
	5:3	FP_ver	The version number of the floating-point arithmetic standard. 1 is the initial version number, indicating that it is compatible with the IEEE 754-2008 standard
	6	LSX	1 indicates support for 128-bit vector extension
	7	LASX	1 indicates support for 256-bit vector expansion
	8	COMPLEX	1 indicates support for complex vector operation instructions
	9	CRYPTO	1 indicates support for encryption and decryption vector instructions
	10	LVZ	1 indicates support for virtualization expansion
	13:11	LVZ_ver	The version number of the virtualization hardware acceleration specification. 1 is the initial version number
	14	LLFTP	1 indicates support for constant frequency counter and timer
	17:15	LLFTP_ver	Constant frequency counter and timer version number. 1 is the initial version
	18	LBT_X86	1 indicates support for X86 binary translation extension
	19	LBT_ARM	1 indicates support for ARM binary translation extension
	20	LBT_MIPS	1 indicates support for MIPS binary translation extension
	21	LSPW	1 indicates support for the software page table walking instruction
	22	LAM	1 indicates support AM* atomic memory access instruction
	24	HPTW	1 indicates support Page Table Walker
	25	FRECIPE	1 indicates support FRECIPE.{S/D}、FRSQRTE.{S/D}. If 128-bit vector extension is also supported, VFRECIPE.{S/D}、VFRSQRTE.{S/D} is supported. If 256-bit vector extension is also supported, XVFRECIPE.{S/D}、XVFRSQRTE.{S/D} is supported.
	26	DIV32	1 indicates that DIV.W[U] and MOD.W[U] instructions on 64-bit machines compute only the low 32-bit data of the input register
	27	LAM_BH	1 indicates support AM{SWAP/ADD}[_DB].{B/H}.
	28	LAMCAS	1 indicates support AMCAS[_DB].{B/H/W/D}.
	29	LLACQ_SCREL	1 indicates support LLACQ.{W/D}、SCREL.{W/D}.
	30	SCQ	1 indicates support SC.Q.
3		CCDMA	1 indicates support for hardware Cache coherent DMA
	1	SFB	1 indicates support for Store Fill Buffer (SFB)
	2	UCACC	1 indicates support for ucacc win
	3	LLEXC	1 indicates support for LL instruction to fetch exclusive block function_
	4	SCDLY	1 indicates support random delay function after SC
	5	LLDBAR	1 indicates support LL automatic with dbar function
	6	ITLBTHMC	1 indicates that the hardware maintains the consistency between ITLB and TLB
	7	ICHMC	1 indicates that the hardware maintains the data consistency between ICache and DCache in one processor core
	10:8	SPW_LVL	The maximum number of directory levels supported by the page walk instruction
	11	SPW_HP_HF	1 indicates that the page walk instruction fills the TLB in half when it encounters a large page
	12	RVA	1 indicates that the software configuration can be used to shorten the virtual address range
	16:13	RVAMAX-1	The maximum configurable virtual address is shortened by -1
	17	DBAR_hints	1 indicates that the non-0 value of the DBAR is implemented according to the recommended meaning of the manual.
	23	LD_SEQ_SA	1 indicates that the hardware is enabled to guarantee sequential execution of load operations at the same address.
0x4	31:0	CC_FREQ	Constant frequency timer and the crystal frequency corresponding to the clock used by the timer
0x5	15:0	CC_MUL	Constant frequency timer and the corresponding multiplication factor of the clock used by the timer
0x5	31:16	CC_DIV	Constant frequency timer and the division coefficient corresponding to the clock used by the timer
0x6		PMP	1 indicates support for the performance counter
	3:1	PMVER	In the performance monitor, the architecture defines the version number of the event, and 1 is the initial version
	7:4	PMNUM	Number of performance monitors minus 1
	13:8	PMBITS	Number of bits of a performance monitor minus 1
	14	UPM	1 indicates support for reading performance counter in user mode
0x10		L1 IU_Present	1 indicates that there is a first-level instruction Cache or a first-level unified Cache
	1	L1 IU Unify	1 indicates that the Cache shown by L1 IU_Present is the unified Cache
	2	L1 D Present	1 indicates there is a first-level data Cache
	3	L2 IU Present	1 indicates there is a second-level instruction Cache or a second-level unified Cache
	4	L2 IU Unify	1 indicates that the Cache shown by L2 IU_Present is the unified Cache
	5	L2 IU Private	1 indicates that the Cache shown by L2 IU_Present is private to each core
	6	L2 IU Inclusive	1 indicates that the Cache shown by L2 IU_Present has an inclusive relationship to the lower levels (L1)
	7	L2 D Present	1 indicates there is a secondary data Cache
	8	L2 D Private	1 indicates that the secondary data Cache is private to each core
	9	L2 D Inclusive	1 indicates that the secondary data Cache has a containment relationship to the lower level (L1)
	10	L3 IU Present	1 indicates there is a three-level instruction Cache or a three-level system Cache
	11	L3 IU Unify	1 indicates that the Cache shown by L3 IU_Present is unified Cache
	12	L3 IU Private	1 indicates that the Cache shown by L3 IU_Present is private to each core
	13	L3 IU Inclusive	1 indicates that the Cache shown by L3 IU_Present has an inclusive relationship to the lower levels (L1 and L2)
	14	L3 D Present	1 indicates there is a three-level data Cache
	15	L3 D Private	1 indicates that the three-level data Cache is private to each core
	16	L3 D Inclusive	1 indicates that the three-level data Cache has an inclusive relationship to the lower levels (L1 and 12)
0x11	15:0	Way-1	Number of channels minus 1 (Cache corresponding to L1 IU_Present in configuration word 10)
	23:16	Index-log2	log2(number of Cache rows per channel) (Cache corresponding to L1 IU_Present in configuration word 10)
	30:24	Linesize-log2	log2(Cache line bytes) (Cache corresponding to L1 IU_Present in configuration word 10)
0x12	15:0	Way-1	Number of channels minus 1 (Cache corresponding to L1 D Present in configuration word 10)
	23:16	Index-log2	log2(number of Cache rows per channel) (Cache corresponding to L1 D Present in configuration word 10)
	30:24	Linesize-log2	log2(Cache row bytes) (Cache corresponding to L1 D Present in configuration word 10)
0x13	15:0	Way-1	Number of channels minus 1 (Cache corresponding to L2 IU Present in configuration word 10)
	23:16	Index-log2	log2(number of Cache rows per channel) (Cache corresponding to L2 IU Present in configuration word 10)
	30:24	Linesize-log2	log2(Cache row bytes) (Cache corresponding to L2 IU Present in configuration word 10)
0x14	15:0	Way-1	Number of channels minus 1 (Cache corresponding to L3 IU Present in configuration word 10)
	23:16	Index-log2	log2(number of Cache rows per channel) (Cache corresponding to L3 IU Present in configuration word 10)
	30:24	Linesize-log2	log2(Cache row bytes) (Cache corresponding to L3 IU Present in configuration word 10)

Read Entire Article