Overview of the NEON coprocessor
The SIMD processor(*1) in the Cortex-A series processors performs SIMD operations(*1) and can efficiently perform multimedia operations (video encoding/decoding/decoding/image processing/sound processing, etc.) and calculations of large amounts of data.
Data Type | 8 bits | 16 bits | 32 bits | 64 bits |
---|---|---|---|---|
An unsigned integer | U8 | U16 | U32 | U64 |
Signed Integer | S8 | S16 | S32 | S64 |
Unspecified Integer | I8 | I16 | I32 | I64 |
Floating point | — | F16 | F32 | — |
Polynomials | P8 | P16 | — | — |
(*1) SIMD (single instruction multiple data) operation is a form of computer parallelization in which multiple data operations are performed with a single instruction (also known as packed operation, packed operation and vector operation).
(*2) Please note that some data types are not available for certain instructions.
Arm Register Application Example
The Arm register performs one operation with one instruction.
【Example of operation using Arm register】
Example of operation using the NEON register
The NEON register performs multiple calculations with one instruction at the specified data size. When a 64-bit wide register is selected and 16-bit wide operations are performed, operations are performed in the 0 to 15, 16 to 31, 32 to 47, and 48 to 63 bits, respectively, and this unit is called a lane.
【Example of an operation using the NEON register】
NEON register set
The d0 registers can be used as either 32 registers 64 bits wide (d0 to d31) or 16 registers 128 bits wide (q0 to q15). changing the contents of the d0 register results in the same contents of the lower 64 bits of the q0 register.
NEON coprocessor initialization process
Because the NEON coprocessor is disabled on reset, the NEON coprocessor must be set up for access rights and occupancy during the initialization process; executing a NEON instruction while the NEON coprocessor is disabled results in an “undefined instruction exception”.
【Example of an NEON coprocessor operation program】
;==================================================================== ; CP10/CP11 access permissions ;==================================================================== MRC p15, 0, r0, c1, c0, 2 ; Coprocessor Access Control Register (CPACR) Read ORR r0, r0, #(0xF << 20) ; Full access settings for CP10/11 MCR p15, 0, r0, c1, c0, 2 ; Coprocessor Access Control Register (CPACR) Write ISB ;==================================================================== ; Start to NEON the VFP. ;==================================================================== MOV r0, #0x40000000 ; VMSR FPEXC, r0 ; Set the EN bit in the floating-point exception register
NEON Vector Instruction Set
In the NEON vector instruction set, the assembly instruction starts with V. In addition to the processing instruction, the size and data type of the calculation result are set, and the register width and data type determine the number of lanes to be processed in the calculation. For example, if the data size is 8 bits in the q register (128 bits), the number of lanes is 16, which allows 16 operations to be performed simultaneously.
V{<mod>}<op>{<shape>}{<cond>}{.<dt>} (<dest>},src1,src2
Settings | Settings |
---|---|
Command modification<mod> | Q: Perform saturation arithmetic operations. (e.g., VQADD) |
H: Halves the result. (e.g., VHADD) | |
D: Doubles the result. (e.g., VQDMUL) | |
R: Rounding of the results. (e.g., VRHADD) | |
Instruction processing<op> | Operations (e.g., ADD, SUB, MUL, etc.) |
<shape> | L: Two operands are twice as wide as the width of a bit. |
W: The last operand is twice as wide as the last bit. | |
N: The result is half a bit wide. | |
requirement<cond> | Execute a conditional instruction (used in the Thumb2 instruction IT block). |
data type<dt> | Specifies the data type. Unsigned integer, U8, U16, U32, U64 Signed integers, S8, S16, S32, S64 Integers of unspecified type, I8, I16, I32, I64 Floating point number, F16, F32 Polynomial, P8, P16 |
dest | Destination |
src1 | Source Operand 1 |
src2 | Source Operand 2 |
Data transfer instruction from Arm register to NEON register
Data is transferred from the Arm register to the NEON register; the contents of the r1 register of the Arm register are transferred to the lower 32 bits of the d0 register and the contents of the r0 register of the Arm register are transferred to the upper 32 bits of the d0 register.
VMOV d0,r0,r1 ; d0=r0 (upper 32 bits) + r1 (lower 32 bits)
Copy the contents of the Arm register to the NEON register
The Arm register is copied to the NEON register in 32-bit increments.
VDPU.32 q0,r0 ; Copy the r0 register to the q0 register every 32 bits
Additional Instructions
The 16-bit data x 4 addition instruction performs addition processing for each 16-bit lane.
VADD.I16 d2,d1,d0 ; d2 = d1 + d0(Every 16-bit lane adds up.)
Load Instructions
To read from memory to the NEON register, set the read start address in the Arm register; from the address (0x80000000) indicated by the r0 register, read into the d0, d1, and d2 registers in 32-bit increments, and set "!" to [r0]. The r0 register (read address) can be updated by setting [r0] to
VLD1.32 {d0,d1,d2},[r0] ; Read from the address indicated by the r0 register
Store Instructions
To write the NEON register to memory, set the read start address in the Arm register; write the values of the d0, d1, and d2 registers to the address (0x80000000) indicated by the r0 register in 32-bit increments, and set "!" to [r0]. The r0 register (write address) can be updated by setting !
VST1.32 {d0,d1,d2},[r0] ; Write to the address indicated by the r0 register
NEON coprocessor programming
When using the NEON coprocessor, you can choose from three methods: modifying an existing program to generate NEON vector instructions (auto-vectoring), using the NEON built-in functions, or using the NEON vector instructions in assembly language. We recommend using automatic vectorization and the NEON built-in function because assembly instructions require programming with the Arm processor pipeline in mind and are difficult to code.
Autovectorization of the Arm compiler
The Arm compiler generates NEON vector instructions by setting the compiler options and modifying the C/C++ source code. Setting the "--diag_warning=optimizations
" option will print out diagnostic messages about optimization.
No | Settings | Settings | |
---|---|---|---|
1 | Optimization setting | Set "-O2-Otime " or "-O3 -Otime ". | |
2 | NEON Vector instruction settings | Set "||
3 | Processor Setting Value | Set up an Arm processor with NEON coprocessor. For example: " --cpu Cortex-A9 " |
No | Change Point |
---|---|
1 | Make it a simple loop with few lines. |
2 | Don't break out of the loop with a break statement. |
3 | Set the number of loops to 2n. |
4 | The number of loops must be specified. |
5 | It is recommended that the functions in the loop be inlined. |
6 | Use index [] for the pointer. |
7 | Use the __restrict keyword (*3) to avoid memory space overlap. |
(*3) __restrict
is a setting that tells the compiler that the pointer types and function parameter arrays of various objects will not use duplicate memory space.
Automatic Vectorization Example
We will modify the source code to allow automatic vectorization of the float array addition process.
【Source code before change】
- Set the compiler options to "
-O2 -Otime -vectorize -cpu=Cortex-A9
". - Referring to the above "Changes in the source code", modify the program.
void float_add(float *fres,float *fdata1,float *fdata2) { unsigned long i; for(i=0;i<128;i++){ *fres = (*fdata1)+(*fdata2); fres++; fdata1++; fdata2++; } }
【Modified source code】
The following changes can be made to enable vector operations.
- Sets the
- argument pointer type.
- Change from pointer access to index access.
__restrict
attribute to theVectorization with built-in functions
If automatic vectorization is not possible, vectorize it using the built-in function: a program that finds the sum of two floatable arrays is vectorized with the NEON built in function.
【Source code before change】
This program cannot be automatically vectorized. The author is using the res variable to find the total value of the array, so I'm wondering if it is not possible to perform parallel operations. We think that this is a good idea. So, we vectorize it using the NEON built-in function.
float calc(float * __restrict fdata1,float * __restrict fdata2) { float res=0.0; unsigned long i; for(i=0;i<128;i++){ res += fdata1[i]+fdata2[i]; } return res; }
【Modified source code (*4)】
Perform vector operations with the NEON built-in function.
arm_neon.h
Include the header file.- Define
float32x4_t
variable (used as the q register (128 bits wide) in the NEON coprocessor) to perform single-precision floating-point arithmetic (float type). - The number of operations is changed to 128/4 = 32, since one operation is performed 4 times.
vld1q_f32()
Using the built-in function, read four float type data from the beginning of the array into the q register.vaddq_f32()
Add four float type variables using the built-in function.vgetq_lane_f32()
Get the total value per lane in the built-in function.- Update the pointer.
(*4) Because the order of calculation is different from the source code before modification, the result may not be the same due to the influence of calculation accuracy.
Summary
Since the Cortex-A series includes a NEON coprocessor, we were able to increase the speed by changing the arithmetic to SIMD operations, so we tuned the open-source audio codec program to auto-vector and use the NEON built-in functions. As a result, we realized that improving execution speed is not easy (on the contrary, we had some problems with performance degradation and trouble). So, here are some considerations that I have found to make the NEON coprocessor more effective.
- It is important that the NEON coprocessor has an efficient data structure that allows it to read and write large amounts of data (requires an understanding of how the VLD/VST instructions work).
- The NEON built-in function saves and returns the NEON registers to the stack when there are no more NEON registers available due to many variable definitions, which slows down execution.
- Automatic vectorization will not take place unless the appropriate source code is changed. We believe that in many cases in the existing source code, the program was not programmed for SIMD operations.
- The PMU (see Part 16) should be used to measure the execution time and understand the change in execution time relative to the modifications.
合わせて読みたい
The NEON co-processor can streamline your arithmetic operations if you use it well, so why not try it?
Reference Material
To learn how to use the NEON coprocessor, please refer to the following three manuals
Cortex-A9 NEON Media Processing Engine Revision: r4p1 Technical Reference Manual
NEON Programmer's Guide Version: 1.0
Arm NEON Intrinsics Reference (Document number:IHI 0073A)
“もっと見る” カテゴリーなし
Mbed TLS overview and features
In this article, I'd like to discuss Mbed TLS, which I've touched on a few times in the past, Transport …
What is an “IoT device development platform”?
I started using Mbed because I wanted a microcontroller board that could connect natively to the Internet. At that time, …
Mbed OS overview and features
In this article, I would like to write about one of the components of Arm Mbed, and probably the most …