The end of the year is approaching, but I’d like to have one last delta before taking some time off. PowerPC vs. Arm, seems like an appropriate stand-off. In this rendition. However, I will incarnate the e200z0 core and the Cortex-M4 core, which are the MCU implementations of each corresponding ISAs. For the sake of simplicity, each time the word PowerPC is uttered, I am referring to the e200z0 core; similarly, Arm will stand as a simplification of Cortex-M4.
Getting the obvious out of the way
PowerPC is sold both as silicon (i.e. MCU) as well as synthesizable IP blocks; Arm only sells IP, but there are a number of companies that sell microcontrollers built around said IP. At the end of the day, both cores cannot be compared in terms of technology node because their implementation depends on a third party. I will say, however, that PowerPCs are typically used in automotive and industrial applications which tend to use more robust technology nodes than consumer applications where Arm is typically found. I suspect, but cannot confirm, that one of the reasons for this is that the Arm core is relatively big (physically), and really benefits from a smaller node. Therefore, it is not strange to find PowerPC devices qualified at -40 – 125C ranges, in LPDF packages; Arm devices are normally only qualified in 0 – 85C ranges and come in smaller, BGA packages.
Architecture
Similarly colored boxes show the equivalent blocks for each architecture. It should be immediately obvious that the Cortex-M4 to the left has a significant number of blocks without equivalent on the e200z0 architecture.
And this is what I’d like to talk about. Power.org has done an excellent job of defining a powerful core, one that is flexible and capable of being hooked-up to an almost infinite number of peripherals. And then it stops. Standard peripherals, such as an interrupt handler unit, or a debug trace unit are not defined in the standard, which means each vendor is free to implement as they wish. Arm, on the other hand, tightly integrates these “standard” peripherals into the core. Arm wins in this situation because tighter integration of debug peripherals means compatibility with standard tools; tighter integration of the interrupt handler unit means quicker interrupts (but let’s not get ahead of ourselves). This approach also helps vendors integrating the IP as they do not have to worry about handling these elements (which are more than likely far away from their target application, or from where they want to add value).
The direct effect of one approach vs. the other is quickly visible when it comes to interrupts: Arm’s Cortex-M4 guarantees a latency of 3-cycles from the time the Interrupt is flagged to the time the core is actually doing something with it. All context registers are stored automatically. The e200z0, on the other hand, will require an external controller to flag it to the core as an external interrupt. Next, some code will need to be written to ensure that the context registers are correctly stored. Finally, it is also code that will allow to jump to the pending interrupt and attend. Latency is therefore not guaranteed, and will vary from implementation to implementation.
But that is not to say that the e200z0 is inferior. Let’s take a look at Table 1:
Cortex-M4 | e200z0 | |
Execution | IN-order | IN-order |
Memory Management/Protection Unit | Y | N |
Instruction Cache | N | N |
Signal processing extension | Y | N |
Pipeline | 3-stage | 4-stage |
Branch unit processor | Not explicit | Y |
Multiply | 1 | 1 |
Integer divide cycles | 2 – 12 cycles | 5 – 34 cycles |
Endianness | Little | Big |
Architecture | Harvard | Harvard |
Interrupt controller | Internal | External |
Jump-to-Isr latency | 3 cycles | Code dependant; several cycles |
Relocatable ISR table | Yes | Yes |
Debug Interfaces | JTAG, J-Link | JTAG |
Number of core registers | 13 + SP, LR, PC (16 total) | 32 + SP, CR, LR, CTR |
Instruction set supported | Thumb 16-bit instructions | VLE 16-bit instructions |
Table 1.
In fact, when you look at the generalities, the e200z0 and the Cortex-M4 are very similar: Harvard architecture, 32-bit RISC machines with no out-of-order execution and 1-cycle execution times for most instructions. Yes, the Cortex-M4 is about twice as fast ath the e200z0 when it comes to division, but the fact that the latter has double the amount of core registers means that it can economize load/store cycles.
Which brings us to the instruction set architecture.
ISA
In a similar effort, both Arm and Power.org have created extensions to their original ISA with the goal of reformatting instructions into 16-bit words to help with code density. Both communities have later released devices that are only compatible with these extensions, removing all support for the original ISA. This is the case for both the e200z0 and the Cortex-M4 with Variable Length Encoding, and Thumb ISAs, respectively.
Comparing and contrasting both ISAs probably deserves a blog entry by itself, but the gist of it is that both instruction sets have similar encodings. Perhaps worthy of a special mention is Thumb’s immediate rotate addressing mode, which allows to shift a core-register while performing another operation during the same execution cycle of the original operation.
Truth be told, both ISAs are so complex that it will be up to the compiler to fully exploit their advantages. Take, for example, the Cortex-M4 DSP extension which adds a DSP-like unit capable of 1-cycle Multiply-and-accumulate operations, among others. When writing code, a simple line such as
y = (m * x + b);
will compile using a standard sequence of loads, multiplies, stores, and adds. In order to use the DSP-extension, an abstraction layer needs to be downloaded, and function-like calls need be made (which are replaced by macros and take advantage of said extension).
Which means that code is no longer portable to, say, a PowerPC architecture.
Toolchain support
This category is tough. Both organizations have done an excellent job of standardizing their architectures, and a plethora of compilers and standard tools is available for both. Since both are also JTAG-compliant, this means that almost anything can be used to develop for them:
- gcc
- CodeWarrior
- Green Hills
- IAR Workbench (Arm only)
I’d say there’s a tie here, although there may be specialized tools on each case,debugging activities are not necessarily harder on one platform than on the other.
Conclusion
If both architectures were to hit the market for the first time today,with the same IP-based distribution model, it’s really hard to predict who would win. The Cortex-M4 is tightly integrated with an interrupt controller and debugging support, while the e200z0 allows a greater amount of customization to vendors. The Cortex-M4 allows bit-shifting as part of a register load or store, but the e200z0 doesn’t need to perform loads and stores as often because it has more core registers. The Cortex-M4 is slightly faster with fixed-point math division. Toolchain support is excellent for both architectures. Without bringing down these characteristics to specific products, it’s hard to have a winner!