目次
Overview of the Cortex-A9 processor
The Cortex-A9 processor is an Armv7-A architecture and can be configured in a multi-core configuration from one to four cores.
Optional VFP-v3-D16 or NEON media processing engines and TrustZone security extensions can be included to increase system performance and reliability.
Program flow trace (PTM: Program Trace Macrocell) and performance monitoring units on the Cortex-A9 can be used to verify program behavior.
The Cortex-A9 has the following features
- Variable length multiple issue pipeline
- Register renaming
- speculative data prefetching
- Branch prediction and return stack
- Reducing Power by Fast Loop Mode
- Increased security through TrustZone extensions
- Onboard L1 data and instruction cache
Overview of the Cortex-A9 MPCore
The Cortex-A9 MPCore can support multi-core configurations of up to four processors, either asymmetric multiprocessing (AMP) or symmetric multiprocessing (SMP).
The Cortex-A9 MPCore has the following features
Snoop Control Unit
Manages coherency of L1 data cache between processors during SMP operation, using a modified version of the MESI protocol(*1)
General-purpose Interrupt Controller
Routing and prioritization of inter-processor communication and system interrupts.
Private Timers and Watchdog Timers
Each processor has a private timer and a watchdog timer.
Global Timers
There is one 64-bit count-up timer. All CPUs in the cluster can reference it.
(*1) What is the MESI protocol? A cache coherency and memory consistency protocol that synchronizes memory and cache in multiprocessor systems and is widely used in write-back caches.
Features of the Cortex-A9
This section describes the unique features of the Cortex-A9 processor.
Register renaming function
In order to increase instruction execution efficiency, programs can be executed faster by accessing as little data to memory as possible and performing processing between registers. However, not all processing can be done between registers.
When developing a program, the function call is “ It is specified in AAPCS (Procedure Call Standard for Arm Architecture)” and not all registers are available. Increasing the number of registers in a processor architecture means changing the instruction set architecture, which cannot be easily changed.
In the “MOV{S}<c>.W <Rd>, #<const>” instruction, the “Rd” field indicates the register number in 4 bits. You can see that it is not easy to increase the “Rd” field, since all the other bits are used.
As a solution to this problem, the Cortex-A9 has a “register renaming” function.
Architecture Registers and Physical Registers
The Cortex-A9 processor has two classes of registers.
- Architecture registers used by the software (r-0r15, cpsr).
- Physical registers (p0-p55) and virtual flags (flg0- flg7) that are physically implemented in the processor. The virtual flags (flg0-flg7) have copies of the cpsr flags NZCV, GE and Q bits.
Coding and debugging uses architecture registers and does not allow the use of physical registers.
Register renaming operation example
This section describes the operation of register renaming when successive STR (*2) and LDR (*3) instructions are executed in succession. Because architectural registers use the same register (r0) for store and load data for different addresses, the LDR instruction cannot be executed until the STR instruction is completed.
The Cortex-A9 processor automatically replaces the physical registers and renames the STR instruction r0=p0/r1=p1 and the LDR instruction r0=p2/r2=p3. Since the same registers are not used, the STR and LDR instructions can be executed simultaneously.
Example of virtual flag operation
The result of the CMP instruction (*4) updates the NZCV flag in the cpsr register.
When executed in the architecture register, the two CMP instructions are register-independent, so they can be executed at the same time, but the cpsr register has only one, so the two instructions can not be executed simultaneously. Using the CMP instruction with two virtual flags (flg_0 and flg_7) makes it possible to execute the two instructions simultaneously.
The “STR r0,[r1]” command writes word data (32 bits) from the r0 register to the address indicated by the r1 register.
(*3) The “LDR r0,[r2]” command reads the word data (32 bits) from the address of the r2 register to the r0 register.
(*4) “CMP r0, [r2]” subtracts r0-r2 and updates the NZCV flag of cpsr.
fast loop mode operation
When executing instructions of less than 64 bytes, fast loop mode is automatically activated in the instruction prefetching stage to reduce power consumption by stopping the prefetching unit from making cache accesses.
Branch predictions and return stacks
The Cortex-A9 has a branch prediction feature in the instruction-side L1 memory system. The branch prediction function consists of the following three functions
- The branch destination address cache caches the address of the branch destination of a branch instruction, with 512 branch destination address caches and is configured as a 2-way × 256 entries.
- An eight-entry return-only stack is implemented to predict the return address as specific instructions are fetched (BL, BLX
and BLX instructions are recognized). The value of the link register is pushed onto the return stack. - As a global history buffer, we manage the last 4096 branch states in four states (Strongly not taken, Weakly not taken, Weakly taken, and Strongly taken).
Name | meaning |
---|---|
Strongly not taken | No branching (very likely) |
Weakly not taken | No branching (small chance) |
Weakly taken | Branching (small chance) |
Strongly taken | Branching (very likely) |
Branch prediction is disabled at reset, so it must be enabled in the initialization process.
Branch prediction is set by the Z bit of the SCTLR (system control register) of the CP15 (coprocessor 15).
What is branch prediction?
Branching prediction is a function that predicts whether or not a branching instruction in a program will branch or not, thus sustaining the effect of the instruction pipeline. Let’s consider the case of an iterative process (e.g. for/while loop in C).
In the case of a for statement that loops 100 times, the loop counter is compared and a conditional branch instruction is executed (the CMP and BLT instructions in [Sample Program Example] perform the loop count for the for statement).
If the branch prediction is predicted to branch, then 99 branch predictions will be successful and one branch prediction will fail. However, in the case of multiple loops or a low number of loops, the branch prediction success rate decreases, so we use a branch destination address cache or global history buffer to maintain the effectiveness of the instruction pipeline.
Overview of the Cortex-A9 processor control method
Describes the basics of how to configure and use the cache and the performance monitoring unit.
What is a co-processor?
Processor function settings are made in the co-processor settings.
The coprocessor is not connected to the bus interface unit. Configuration is done using dedicated instructions (MRC/MCR instructions). Depending on the processor mode, access may be limited. Please refer to the “Technical Reference Manual (*5)” of your processor.
- There are 16 coprocessors defined.
- The CP15 (Coprocessor 15) provides system control functions.
- The CP11 (co-processor 11) supports double precision floating point arithmetic.
- The CP10 (co-processor 10) supports single precision floating point operations as well as extensions to both VFP advanced SIMD architectures.
(*5) “Cortex-A9 Technical” in English Reference Manual Revision:r4p1” or the Japanese version “ See Cortex-A9 Technical Reference Manual Revision: r2p2.
Name | Acronym | R/W | Order |
---|---|---|---|
System Control Register | SCTLR | R | MRC p15, 0, <Rd>, c1, c0, 0 |
W | MCR p15, 0, <Rd>, c1, c0, 0 | ||
auxiliary control register | ACTLR | R | MRC p15, 0, <Rd>, c1, c0, 1 |
W | MCR p15, 0, <Rd>, c1, c0, 1 |
How do I use the cache?
To use caching, you need to learn about memory types and cache configuration in the MMU (Memory Management Unit).
Overview of Memory Management Unit Operation
Address translation from virtual to physical addresses uses the MMU and a translation look-ahead buffer (TLB). Address translation is performed automatically by referring to the translation table in the table walk unit and also caches the TLB.
You can set the memory type (normal, device, strong order), cache attribute, and buffer attribute in the conversion table. The conversion table consists of a level 1 conversion table and a level 2 conversion table. The level 1 conversion table allows you to choose between 1M and 16M bytes in page size, while the level 2 conversion table allows you to select a finer 64K and 4K bytes.
Memory Type
The Arm architecture defines three types of memory. Normal memory is the cache and bufferable memory type.
- The Cortex-A series is configured in the Memory Management Unit.
- The Cortex-R series is configured with a memory protection unit.
- The Cortex-M series has a fixed memory map.
memory type | overview |
---|---|
normal memory | Specify the memory type of the application code and data area as Normal. Instruction fetching can be done only from the area where you specify the memory type as normal. Cacheable and buffering is possible. |
device memory | This is a memory type for peripherals. Access order must be guaranteed for other device accesses. Caching and buffering are possible. |
strong reader memory | This is a memory model targeted for multiprocessors. Cache and buffer prohibition. |
Configuration of the L1 cache
The instruction and data caches are not disabled on reset. Manual deactivation and activation is required.
Cache lockdown (*6) is not implemented in the Cortex-A9. Data cache is a non-blocking cache(*7) and supports four independent read/write operations.
Item | Settings | Remarks | |
---|---|---|---|
Size | 16K bytes / 32K bytes / 64K bytes | Settings at implementation | |
Line Size | 8 words | — | |
Configuration | Virtual index, physical tag | — | |
Parity | Options | — |
Item | Settings | Remarks |
---|---|---|
Size | 16K bytes / 32K bytes / 64K bytes | Settings at implementation |
Line Size | 8 words | — |
Configuration | Physical index, physical tags | — |
Parity | Options | — |
(*6) Cache Lockdown: Allows you to load critical code and data into the cache so that the cache lines holding them are not reallocated later. This ensures that subsequent accesses to the code and data are always cache hits.
(*7) Non-blocking cache: If a cache miss occurs, the cache is not accessible while data is being transferred from main memory to the cache and instructions stop executing. In contrast, with a non-blocking cache system, GIGABYTE’s unique feature is that it is able to provide continuous access to the cache from the CPU while maintaining cache misses.
Data Prefetching Function
It implements an automatic prefetch mechanism to monitor cache misses that occur on the processor. The unit is capable of monitoring and prefetching two independent data streams. you can enable “L1 prefetch enable” on the CP15 ACTLR (auxiliary control register).
PMU (Performance Monitoring Unit)
The PMU has the ability to obtain information on processor operation without interfering with the execution of the processor core.
The Cortex-A9 is equipped with a cycle counter and six event counters that can measure any item. The counter size is 32 bits in size and can also generate an interrupt if it overflows.
Cycle Counter
Counts the number of execution cycles.
Event Counter
The number of occurrences can be measured by selecting a set number of events (factors). Measurement items include cache misses, TLB misses, branch prediction performance, pipeline stalls, memory accesses, etc. Common Armv7-A/R core measurement items are “Arm Architecture Reference Manual Armv7-A and Armv7-R Edition. For processor-specific measurements, see the Technical Reference Manual.
Event Number | Content | Measurement Results |
---|---|---|
0x50 | coherent line-fill mistake Counts the number of coherent line-fill requests executed by the Cortex-A9 processor and missed by all other Cortex-A9 processors. This means that the request was sent to external memory. | Accurate |
0x51 | coherent line-fill hit Counts the number of coherent line fill requests executed by the Cortex-A9 processor and hit by other Cortex-A9 processors. This means that the line fill data was fetched directly from the corresponding Cortex-A9 cache. | accurate |
Summary
Unlike microcontrollers commonly used in embedded development, the Cortex-A9 processor includes features such as cache, MMU, and branch prediction. As a result, it is important to understand each of these features in order to develop for optimal performance.
“もっと見る” カテゴリーなし
Mbed TLS overview and features
In this article, I'd like to discuss Mbed TLS, which I've touched on a few times in the past, Transport …
What is an “IoT device development platform”?
I started using Mbed because I wanted a microcontroller board that could connect natively to the Internet. At that time, …
Mbed OS overview and features
In this article, I would like to write about one of the components of Arm Mbed, and probably the most …