NEON co-processor

Overview of the NEON coprocessor
NEON Vector Instruction Set
NEON coprocessor programming
Summary

Overview of the NEON coprocessor

The SIMD processor(*1) in the Cortex-A series processors performs SIMD operations(*1) and can efficiently perform multimedia operations (video encoding/decoding/decoding/image processing/sound processing, etc.) and calculations of large amounts of data.

【List of data formats that can be calculated (*2)】
Data Type	8 bits	16 bits	32 bits	64 bits
An unsigned integer	U8	U16	U32	U64
Signed Integer	S8	S16	S32	S64
Unspecified Integer	I8	I16	I32	I64
Floating point	—	F16	F32	—
Polynomials	P8	P16	—	—

(*1) SIMD (single instruction multiple data) operation is a form of computer parallelization in which multiple data operations are performed with a single instruction (also known as packed operation, packed operation and vector operation).
(*2) Please note that some data types are not available for certain instructions.

Arm Register Application Example

The Arm register performs one operation with one instruction.

【Example of operation using Arm register】

Example of operation using the NEON register

The NEON register performs multiple calculations with one instruction at the specified data size. When a 64-bit wide register is selected and 16-bit wide operations are performed, operations are performed in the 0 to 15, 16 to 31, 32 to 47, and 48 to 63 bits, respectively, and this unit is called a lane.

【Example of an operation using the NEON register】

NEON register set

The d0 registers can be used as either 32 registers 64 bits wide (d0 to d31) or 16 registers 128 bits wide (q0 to q15). changing the contents of the d0 register results in the same contents of the lower 64 bits of the q0 register.

NEON coprocessor initialization process

Because the NEON coprocessor is disabled on reset, the NEON coprocessor must be set up for access rights and occupancy during the initialization process; executing a NEON instruction while the NEON coprocessor is disabled results in an “undefined instruction exception”.

【Example of an NEON coprocessor operation program】

;====================================================================
; CP10/CP11 access permissions
;====================================================================
    MRC   p15, 0, r0, c1, c0, 2   ; Coprocessor Access Control Register (CPACR) Read
    ORR   r0, r0, #(0xF << 20)    ; Full access settings for CP10/11
    MCR   p15, 0, r0, c1, c0, 2   ; Coprocessor Access Control Register (CPACR) Write
    ISB
;====================================================================
; Start to NEON the VFP.
;====================================================================
    MOV   r0, #0x40000000         ;
    VMSR   FPEXC, r0              ; Set the EN bit in the floating-point exception register

NEON Vector Instruction Set

In the NEON vector instruction set, the assembly instruction starts with V. In addition to the processing instruction, the size and data type of the calculation result are set, and the register width and data type determine the number of lanes to be processed in the calculation. For example, if the data size is 8 bits in the q register (128 bits), the number of lanes is 16, which allows 16 operations to be performed simultaneously.

V{<mod>}<op>{<shape>}{<cond>}{.<dt>} (<dest>},src1,src2

Settings	Settings
Command modification<mod>	Q: Perform saturation arithmetic operations. (e.g., VQADD)
	H: Halves the result. (e.g., VHADD)
	D: Doubles the result. (e.g., VQDMUL)
	R: Rounding of the results. (e.g., VRHADD)
Instruction processing<op>	Operations (e.g., ADD, SUB, MUL, etc.)
<shape>	L: Two operands are twice as wide as the width of a bit.
	W: The last operand is twice as wide as the last bit.
	N: The result is half a bit wide.
requirement<cond>	Execute a conditional instruction (used in the Thumb2 instruction IT block).
data type<dt>	Specifies the data type. Unsigned integer, U8, U16, U32, U64 Signed integers, S8, S16, S32, S64 Integers of unspecified type, I8, I16, I32, I64 Floating point number, F16, F32 Polynomial, P8, P16
dest	Destination
src1	Source Operand 1
src2	Source Operand 2

Data transfer instruction from Arm register to NEON register

Data is transferred from the Arm register to the NEON register; the contents of the r1 register of the Arm register are transferred to the lower 32 bits of the d0 register and the contents of the r0 register of the Arm register are transferred to the upper 32 bits of the d0 register.

VMOV  d0,r0,r1    ; d0=r0 (upper 32 bits) + r1 (lower 32 bits)

Copy the contents of the Arm register to the NEON register

The Arm register is copied to the NEON register in 32-bit increments.

VDPU.32   q0,r0    ; Copy the r0 register to the q0 register every 32 bits

Additional Instructions

The 16-bit data x 4 addition instruction performs addition processing for each 16-bit lane.

VADD.I16  d2,d1,d0   ; d2 = d1 + d0(Every 16-bit lane adds up.)

Load Instructions

To read from memory to the NEON register, set the read start address in the Arm register; from the address (0x80000000) indicated by the r0 register, read into the d0, d1, and d2 registers in 32-bit increments, and set "!" to [r0]. The r0 register (read address) can be updated by setting [r0] to

VLD1.32   {d0,d1,d2},[r0]    ; Read from the address indicated by the r0 register

Store Instructions

To write the NEON register to memory, set the read start address in the Arm register; write the values of the d0, d1, and d2 registers to the address (0x80000000) indicated by the r0 register in 32-bit increments, and set "!" to [r0]. The r0 register (write address) can be updated by setting !

VST1.32   {d0,d1,d2},[r0]    ; Write to the address indicated by the r0 register

NEON coprocessor programming

When using the NEON coprocessor, you can choose from three methods: modifying an existing program to generate NEON vector instructions (auto-vectoring), using the NEON built-in functions, or using the NEON vector instructions in assembly language. We recommend using automatic vectorization and the NEON built-in function because assembly instructions require programming with the Arm processor pipeline in mind and are difficult to code.

Autovectorization of the Arm compiler

The Arm compiler generates NEON vector instructions by setting the compiler options and modifying the C/C++ source code. Setting the "--diag_warning=optimizations" option will print out diagnostic messages about optimization.

【Arm compiler option settings】
No	Settings	Settings
1	Optimization setting	Set "`-O2-Otime`" or "`-O3 -Otime`".
2	NEON Vector instruction settings Set "`--vectorize`".
3		Processor Setting Value	Set up an Arm processor with NEON coprocessor. For example: "`--cpu Cortex-A9`"

【Changes to the source code】
No	Change Point
1	Make it a simple loop with few lines.
2	Don't break out of the loop with a break statement.
3	Set the number of loops to 2ⁿ.
4	The number of loops must be specified.
5	It is recommended that the functions in the loop be inlined.
6	Use index [] for the pointer.
7	Use the `__restrict` keyword (*3) to avoid memory space overlap.

(*3) __restrict is a setting that tells the compiler that the pointer types and function parameter arrays of various objects will not use duplicate memory space.

Automatic Vectorization Example

We will modify the source code to allow automatic vectorization of the float array addition process.

【Source code before change】

Set the compiler options to "-O2 -Otime -vectorize -cpu=Cortex-A9".
Referring to the above "Changes in the source code", modify the program.

void float_add(float *fres,float *fdata1,float *fdata2)
{
    unsigned long i;

    for(i=0;i<128;i++){
        *fres = (*fdata1)+(*fdata2);
        fres++;
        fdata1++;
        fdata2++;
    }
}

【Modified source code】
The following changes can be made to enable vector operations.

__restrict

argument pointer type.
Change from pointer access to index access.

Vectorization with built-in functions

If automatic vectorization is not possible, vectorize it using the built-in function: a program that finds the sum of two floatable arrays is vectorized with the NEON built in function.

【Source code before change】
This program cannot be automatically vectorized. The author is using the res variable to find the total value of the array, so I'm wondering if it is not possible to perform parallel operations. We think that this is a good idea. So, we vectorize it using the NEON built-in function.

float calc(float * __restrict fdata1,float * __restrict fdata2)
{
    float res=0.0;
    unsigned long i;

    for(i=0;i<128;i++){
        res += fdata1[i]+fdata2[i];
    }
    return res;
}

【Modified source code (*4)】
Perform vector operations with the NEON built-in function.

arm_neon.h Include the header file.
Define float32x4_t variable (used as the q register (128 bits wide) in the NEON coprocessor) to perform single-precision floating-point arithmetic (float type).
The number of operations is changed to 128/4 = 32, since one operation is performed 4 times.
vld1q_f32()Using the built-in function, read four float type data from the beginning of the array into the q register.
vaddq_f32() Add four float type variables using the built-in function.
vgetq_lane_f32()Get the total value per lane in the built-in function.
Update the pointer.

(*4) Because the order of calculation is different from the source code before modification, the result may not be the same due to the influence of calculation accuracy.

Summary

Since the Cortex-A series includes a NEON coprocessor, we were able to increase the speed by changing the arithmetic to SIMD operations, so we tuned the open-source audio codec program to auto-vector and use the NEON built-in functions. As a result, we realized that improving execution speed is not easy (on the contrary, we had some problems with performance degradation and trouble). So, here are some considerations that I have found to make the NEON coprocessor more effective.

It is important that the NEON coprocessor has an efficient data structure that allows it to read and write large amounts of data (requires an understanding of how the VLD/VST instructions work).
The NEON built-in function saves and returns the NEON registers to the stack when there are no more NEON registers available due to many variable definitions, which slows down execution.
Automatic vectorization will not take place unless the appropriate source code is changed. We believe that in many cases in the existing source code, the program was not programmed for SIMD operations.

The PMU (see Part 16) should be used to measure the execution time and understand the change in execution time relative to the modifications.

合わせて読みたい

PMU（パフォーマンス監視ユニット）
2015.06.30

The NEON co-processor can streamline your arithmetic operations if you use it well, so why not try it?

Reference Material

To learn how to use the NEON coprocessor, please refer to the following three manuals

Cortex-A9 NEON Media Processing Engine Revision: r4p1 Technical Reference Manual

NEON Programmer's Guide Version: 1.0

Arm NEON Intrinsics Reference (Document number:IHI 0073A)

投稿者

APS

毎月約50,000人のエンジニアが利用する「APS-WEB」の運営、エンジニア限定セミナー「APS SUMMIT」の主催、最新事例をまとめた「APSマガジン」の発行、広い知識と高い技術力を習得できる「APSワークショップ」の開催など、半導体専門技術コンテンツ・メディアとして日々新しい技術ノウハウを発信しています。