Overview of Cortex-A9

Overview of the Cortex-A9 processor
Overview of the Cortex-A9 MPCore
Features of the Cortex-A9
Overview of the Cortex-A9 processor control method
Summary

Overview of the Cortex-A9 processor

The Cortex-A9 processor is an Armv7-A architecture and can be configured in a multi-core configuration from one to four cores.

Optional VFP-v3-D16 or NEON media processing engines and TrustZone security extensions can be included to increase system performance and reliability.
Program flow trace (PTM: Program Trace Macrocell) and performance monitoring units on the Cortex-A9 can be used to verify program behavior.

The Cortex-A9 has the following features

Variable length multiple issue pipeline
Register renaming
speculative data prefetching
Branch prediction and return stack
Reducing Power by Fast Loop Mode
Increased security through TrustZone extensions
Onboard L1 data and instruction cache

Overview of the Cortex-A9 MPCore

The Cortex-A9 MPCore can support multi-core configurations of up to four processors, either asymmetric multiprocessing (AMP) or symmetric multiprocessing (SMP).

The Cortex-A9 MPCore has the following features

Snoop Control Unit

Manages coherency of L1 data cache between processors during SMP operation, using a modified version of the MESI protocol(*1)

General-purpose Interrupt Controller

Routing and prioritization of inter-processor communication and system interrupts.

Private Timers and Watchdog Timers

Each processor has a private timer and a watchdog timer.

Global Timers

There is one 64-bit count-up timer. All CPUs in the cluster can reference it.

(*1) What is the MESI protocol? A cache coherency and memory consistency protocol that synchronizes memory and cache in multiprocessor systems and is widely used in write-back caches.

Features of the Cortex-A9

This section describes the unique features of the Cortex-A9 processor.

Register renaming function

In order to increase instruction execution efficiency, programs can be executed faster by accessing as little data to memory as possible and performing processing between registers. However, not all processing can be done between registers.

When developing a program, the function call is “ It is specified in AAPCS (Procedure Call Standard for Arm Architecture)” and not all registers are available. Increasing the number of registers in a processor architecture means changing the instruction set architecture, which cannot be easily changed.
In the “MOV{S}<c>.W <Rd>, #<const>” instruction, the “Rd” field indicates the register number in 4 bits. You can see that it is not easy to increase the “Rd” field, since all the other bits are used.

As a solution to this problem, the Cortex-A9 has a “register renaming” function.

Architecture Registers and Physical Registers

The Cortex-A9 processor has two classes of registers.

Architecture registers used by the software (r-0r15, cpsr).
Physical registers (p0-p55) and virtual flags (flg0- flg7) that are physically implemented in the processor. The virtual flags (flg0-flg7) have copies of the cpsr flags NZCV, GE and Q bits.

Coding and debugging uses architecture registers and does not allow the use of physical registers.

Register renaming operation example

This section describes the operation of register renaming when successive STR (*2) and LDR (*3) instructions are executed in succession. Because architectural registers use the same register (r0) for store and load data for different addresses, the LDR instruction cannot be executed until the STR instruction is completed.

The Cortex-A9 processor automatically replaces the physical registers and renames the STR instruction r0=p0/r1=p1 and the LDR instruction r0=p2/r2=p3. Since the same registers are not used, the STR and LDR instructions can be executed simultaneously.

Example of virtual flag operation

The result of the CMP instruction (*4) updates the NZCV flag in the cpsr register.

When executed in the architecture register, the two CMP instructions are register-independent, so they can be executed at the same time, but the cpsr register has only one, so the two instructions can not be executed simultaneously. Using the CMP instruction with two virtual flags (flg_0 and flg_7) makes it possible to execute the two instructions simultaneously.

The “STR r0,[r1]” command writes word data (32 bits) from the r0 register to the address indicated by the r1 register.
(*3) The “LDR r0,[r2]” command reads the word data (32 bits) from the address of the r2 register to the r0 register.
(*4) “CMP r0, [r2]” subtracts r0-r2 and updates the NZCV flag of cpsr.

fast loop mode operation

When executing instructions of less than 64 bytes, fast loop mode is automatically activated in the instruction prefetching stage to reduce power consumption by stopping the prefetching unit from making cache accesses.

Branch predictions and return stacks

The Cortex-A9 has a branch prediction feature in the instruction-side L1 memory system. The branch prediction function consists of the following three functions

The branch destination address cache caches the address of the branch destination of a branch instruction, with 512 branch destination address caches and is configured as a 2-way × 256 entries.
An eight-entry return-only stack is implemented to predict the return address as specific instructions are fetched (BL, BLX and BLX instructions are recognized). The value of the link register is pushed onto the return stack.
As a global history buffer, we manage the last 4096 branch states in four states (Strongly not taken, Weakly not taken, Weakly taken, and Strongly taken).

Name	meaning
Strongly not taken	No branching (very likely)
Weakly not taken	No branching (small chance)
Weakly taken	Branching (small chance)
Strongly taken	Branching (very likely)

Branch prediction is disabled at reset, so it must be enabled in the initialization process.
Branch prediction is set by the Z bit of the SCTLR (system control register) of the CP15 (coprocessor 15).

What is branch prediction?

Branching prediction is a function that predicts whether or not a branching instruction in a program will branch or not, thus sustaining the effect of the instruction pipeline. Let’s consider the case of an iterative process (e.g. for/while loop in C).

In the case of a for statement that loops 100 times, the loop counter is compared and a conditional branch instruction is executed (the CMP and BLT instructions in [Sample Program Example] perform the loop count for the for statement).

If the branch prediction is predicted to branch, then 99 branch predictions will be successful and one branch prediction will fail. However, in the case of multiple loops or a low number of loops, the branch prediction success rate decreases, so we use a branch destination address cache or global history buffer to maintain the effectiveness of the instruction pipeline.

Overview of the Cortex-A9 processor control method

Describes the basics of how to configure and use the cache and the performance monitoring unit.

What is a co-processor?

Processor function settings are made in the co-processor settings.

The coprocessor is not connected to the bus interface unit. Configuration is done using dedicated instructions (MRC/MCR instructions). Depending on the processor mode, access may be limited. Please refer to the “Technical Reference Manual (*5)” of your processor.

There are 16 coprocessors defined.
The CP15 (Coprocessor 15) provides system control functions.
The CP11 (co-processor 11) supports double precision floating point arithmetic.
The CP10 (co-processor 10) supports single precision floating point operations as well as extensions to both VFP advanced SIMD architectures.

(*5) “Cortex-A9 Technical” in English Reference Manual Revision:r4p1” or the Japanese version “ See Cortex-A9 Technical Reference Manual Revision: r2p2.

[Typical CP15 (Coprocessor 15) registers and instructions]
Name	Acronym	R/W	Order
System Control Register	SCTLR	R	MRC p15, 0, <Rd>, c1, c0, 0
System Control Register	SCTLR	W	MCR p15, 0, <Rd>, c1, c0, 0
auxiliary control register	ACTLR	R	MRC p15, 0, <Rd>, c1, c0, 1
auxiliary control register	ACTLR	W	MCR p15, 0, <Rd>, c1, c0, 1

How do I use the cache?

To use caching, you need to learn about memory types and cache configuration in the MMU (Memory Management Unit).

Overview of Memory Management Unit Operation

Address translation from virtual to physical addresses uses the MMU and a translation look-ahead buffer (TLB). Address translation is performed automatically by referring to the translation table in the table walk unit and also caches the TLB.

You can set the memory type (normal, device, strong order), cache attribute, and buffer attribute in the conversion table. The conversion table consists of a level 1 conversion table and a level 2 conversion table. The level 1 conversion table allows you to choose between 1M and 16M bytes in page size, while the level 2 conversion table allows you to select a finer 64K and 4K bytes.

Memory Type

The Arm architecture defines three types of memory. Normal memory is the cache and bufferable memory type.

The Cortex-A series is configured in the Memory Management Unit.
The Cortex-R series is configured with a memory protection unit.
The Cortex-M series has a fixed memory map.

【List of Memory Types】
memory type	overview
normal memory	Specify the memory type of the application code and data area as Normal. Instruction fetching can be done only from the area where you specify the memory type as normal. Cacheable and buffering is possible.
device memory	This is a memory type for peripherals. Access order must be guaranteed for other device accesses. Caching and buffering are possible.
strong reader memory	This is a memory model targeted for multiprocessors. Cache and buffer prohibition.

Configuration of the L1 cache

The instruction and data caches are not disabled on reset. Manual deactivation and activation is required.

Cache lockdown (*6) is not implemented in the Cortex-A9. Data cache is a non-blocking cache(*7) and supports four independent read/write operations.

【Instruction cache configuration】
Item	Settings	Remarks
Size	16K bytes / 32K bytes / 64K bytes	Settings at implementation
Line Size	8 words	—
Configuration		Virtual index, physical tag	—
Parity	Options	—

【Data cache configuration】
Item	Settings	Remarks
Size	16K bytes / 32K bytes / 64K bytes	Settings at implementation
Line Size	8 words	—
Configuration	Physical index, physical tags	—
Parity	Options	—

(*6) Cache Lockdown: Allows you to load critical code and data into the cache so that the cache lines holding them are not reallocated later. This ensures that subsequent accesses to the code and data are always cache hits.
(*7) Non-blocking cache: If a cache miss occurs, the cache is not accessible while data is being transferred from main memory to the cache and instructions stop executing. In contrast, with a non-blocking cache system, GIGABYTE’s unique feature is that it is able to provide continuous access to the cache from the CPU while maintaining cache misses.

Data Prefetching Function

It implements an automatic prefetch mechanism to monitor cache misses that occur on the processor. The unit is capable of monitoring and prefetching two independent data streams. you can enable “L1 prefetch enable” on the CP15 ACTLR (auxiliary control register).

PMU (Performance Monitoring Unit)

The PMU has the ability to obtain information on processor operation without interfering with the execution of the processor core.

The Cortex-A9 is equipped with a cycle counter and six event counters that can measure any item. The counter size is 32 bits in size and can also generate an interrupt if it overflows.

Cycle Counter

Counts the number of execution cycles.

Event Counter

The number of occurrences can be measured by selecting a set number of events (factors). Measurement items include cache misses, TLB misses, branch prediction performance, pipeline stalls, memory accesses, etc. Common Armv7-A/R core measurement items are “Arm Architecture Reference Manual Armv7-A and Armv7-R Edition. For processor-specific measurements, see the Technical Reference Manual.

【Typical event settings for Cortex-A9】
Event Number	Content	Measurement Results
0x50	coherent line-fill mistake Counts the number of coherent line-fill requests executed by the Cortex-A9 processor and missed by all other Cortex-A9 processors. This means that the request was sent to external memory.	Accurate
0x51	coherent line-fill hit Counts the number of coherent line fill requests executed by the Cortex-A9 processor and hit by other Cortex-A9 processors. This means that the line fill data was fetched directly from the corresponding Cortex-A9 cache.	accurate

Summary

Unlike microcontrollers commonly used in embedded development, the Cortex-A9 processor includes features such as cache, MMU, and branch prediction. As a result, it is important to understand each of these features in order to develop for optimal performance.

投稿者

APS

毎月約50,000人のエンジニアが利用する「APS-WEB」の運営、エンジニア限定セミナー「APS SUMMIT」の主催、最新事例をまとめた「APSマガジン」の発行、広い知識と高い技術力を習得できる「APSワークショップ」の開催など、半導体専門技術コンテンツ・メディアとして日々新しい技術ノウハウを発信しています。