Demand for performance is intuitive, particularly when developing an embedded application. The challenge most design teams face is knowing how much performance is enough, something that is intrinsically dependent on two things: the functionality required and how it is implemented. While the implementation is predominantly in software, part of the value of microcontrollers is that the peripherals do a lot of the heavy lifting in terms of interfaces. However, manipulating the information those interfaces provide is putting increased pressure and strain on the core itself.
Greater performance comes at a cost; but as microcontrollers tend to be manufactured on well-established process nodes, that cost is more often measured in terms of system-level power, rather than component-level dollars and cents.
Choosing the optimal microcontroller for a given application is arguably becoming more difficult as the breadth of choice and depth of functionality continue to expand in line with the demands of end-users.
The evolution of MCU applications
Leading semiconductor vendors have faced a period of adoption over the last decade or so, towards offering a comprehensive portfolio of MCUs based on 32-bit processing cores. Although ARM®’s Cortex™-M family is endemic in this space, it is not the only option and several proprietary architectures still exist. Perhaps one of the most prominent is the SuperH family from Renesas. It is interesting to note that while Renesas and other companies have also adopted the ARM architecture to produce complementary 32-bit families, they still offer 8- and 16-bit MCUs, for which demand is anticipated to continue for several more years.
Part of that demand comes from specific market segments; many device manufacturers including those mentioned have identified motor control as a significant horizontal segment that spans the industrial control and automotive markets. As a result, many of the MCUs still sporting a proprietary core are focused on these markets, while those with the more generic ARM architecture are targeting areas with different fundamental requirements, including the Internet of Things and M2M.
Throughout this decade of adoption, those same vendors have faced skepticism about the real-world need for more expensive and complex 32-bit alternatives to the well-established 8- and 16-bit devices. While these criticisms have foundation, the subject of cost has largely been eroded over time with economies of scale and process maturation. The subject of complexity, arguably, still remains.
However, it is invariably justified by the rise in complexity in end-applications, and this is predominantly due to communications. Embedded devices have evolved in line with the emergence of greater levels of connectivity, such that today it is unusual for any electronic device not to communicate in some way with the outside world. The endemic nature of high-level communications today, whether that be wired or wireless, typically involves a standardized protocol, in order to provide interoperability, and a physical interface. Depending on the topology of the communications interface, this may be relatively complex and therefore require greater processing power to execute.
Greater connectivity is not the only evolution that embedded devices have seen over the last decade or so, of course. With connectivity comes the potential to play a greater role within a larger system, encouraging developers to squeeze more functionality into ever-smaller form-factors, power envelopes and financial budgets. Communications taking up a relatively large proportion of the available processing power has led to the demand for more capable devices, which semiconductor vendors have answered by adopting larger and more complex processor cores.
Measuring performance
Evaluating the amount of performance needed for a ‘standard’ function like running a protocol stack is, unfortunately, entirely subjective. Ultimately, it largely depends on the core, its sub-system and the compiler. For this reason, relatively modest 8-bit devices may be capable of executing the code, but pipelines that accept wider instructions, often allowing multiple instructions to execute at the same time, or requiring fewer clock cycles to execute, can significantly accelerate execution.
When the aforementioned features are factored out, by applying algorithms that make less use of such features, then ‘raw’ performance can be assessed. This is often the goal of industry benchmarks, such as CoreMark from EEMBC. The objective of this simple C codebase is to provide a realistic mixture of read/write, integer and control operations that rely more on the fundamental elements of a core, and which cannot easily be obviated through compiler optimization.
The benchmark gives a single figure of merit, but it can be represented in terms of clock speed, providing another level of normalization. Choosing to filter results by a relatively narrow window (2.51 to 3.00 CoreMark/MHz) shows a range of 32-bit devices including the RX600 family from Renesas; Microchip’s PIC32 range; STM32 from STMicroelectronics; the Kinetis K range from Freescale; Atmel’s SAM3 and SAM4 families; and Texas Instruments’ OMAP35x.
Figure 1: The RX600 32-bit MCU family from Renesas implements a CMOS camera interface.
EEMBC suggests the CoreMark benchmark is more accurate than others, because it is standard across all devices and focuses strongly on the core architecture. For more information about CoreMark, see the article ‘MCU Performance Analysis Techniques’.
The architecture of 8- and 16-bit devices will always struggle to compete with their 32-bit cousins at a core level, due fundamentally to their commensurately lower throughput; arithmetic functions performed on large numbers will always take longer in an ALU with only 8-bits. However, many manufacturers have gone to great lengths to extend the lifetime of these devices largely because they can still meet their customers’ requirements without increasing complexity or unnecessary performance.
Figure 2: The STM32-L family uses a Cortex-M3 to target ultra-low-power applications.
This is often achieved through innovative extensions to the core architecture, while preserving the instruction set. An example of this comes from Silicon Labs, with its 8051-based devices like the C8051F12x. SiLabs has implemented a ‘hard wired’ 8051 instruction set architecture that increases performance while retaining object code compatibility with the original micro-coded version. Using a two-stage pipeline increases throughput, preserves the 8-bit program memory width, but enables most instructions to be executed within 1 or 2 clock cycles. This, says SiLabs, delivers twenty to twenty-five times the performance of the original 8051 core, with up to 100 MIPS peak throughput.
Similarly, the AT89LP family from Atmel is an 8051-compatible family that the company claims can deliver up to twelve times the performance of ‘vanilla’ 8051 devices. Atmel’s proprietary 8-bit architecture, AVR, which powers devices including the ATmega128, offers true single-clock-cycle execution and delivers 1 MIPS/MHz. According to the EEMBC results, the ATmega1281 running at 2 MHz delivers a CoreMark/MHz figure of 0.18, which rises to 0.44 when the compiler is configured to optimize for code size. This is comparable to some 32-bit devices and illustrates how performance is still application dependent.
Moving up the bus width
Since performance is so heavily dependent on bus architectures, preserving code compatibility while migrating along a performance curve can be challenging at a low level. Re-implementing an instruction set to deliver greater performance, as demonstrated by Silicon Labs and Atmel, can overcome this challenge, but as the demand for performance continues to accelerate it becomes less likely that legacy instruction sets will be able to meet that demand.
Investing in a proprietary architecture offers much greater scope for performance scaling while retaining software compatibility; an example of this is the MSP430 family from Texas Instruments, based on a 16-bit RISC core. This family spans nine distinct series, ranging from the low voltage and value series to the FRAM and RF SoC series. This breadth offers various performance points, delivered through the level of integration, peripheral set and clock frequency each device offers.
While low-level code compatibility is largely architecture dependent, high-level compatibility is simpler, primarily due to the pervasive nature of C and the availability of C compilers. The nature of software complexity means that today most MCUs are programmed at a high level using C, rather than using a low-level (assembly) language.
Figure 3: The MSP430 Value Line targets low cost while offering a migration path to higher performance.
While there will still be architecture-specific dependencies, this allows for much greater freedom to target different members of the same family and even completely different families. It is, of course, another reason why ARM’s Cortex-M family has been so successfully adopted by a large number of device manufacturers. This is aided by the software development environments and libraries available from ARM and device manufacturers. Its growing ecosystem is also benefiting both vendors and developers.
Figure 4: Atmel’s SAM4L family combines the higher performance of the Cortex-M4 with Atmel’s pico-power technology.
Meeting future demands
Continued demand for greater performance means device manufactures are now integrating even greater functionality on a single device. Although still emerging, this is taking the shape of multicore MCUs, both heterogeneous and homogeneous, where either multiple identical cores or multiple but dissimilar cores are integrated in a single device.
While this is not uncommon in application-specific SoCs targeting the mobile phone sector, it is less common in general-purpose devices. However, many device manufacturers believe it will become more common, driven by the Internet of Things (IoT).
The IoT will also promote greater deployment of ‘smart’ sensors, each of which will likely have a high-performance, 32-bit embedded MCU. These, in turn, will communicate (most likely wirelessly) to a sensor hub, which will feature an even higher-performance processor capable of interfacing to and controlling a sensor network.
As this trend continues, demand is also expected to drive performance further, eventually to a point where MCUs with 64-bit cores become the norm.
Conclusion
Performance is subjective; many applications still only require simple control, limited connectivity with no ‘hard real-time’ requirements. However, as the world becomes more connected, the requirements imposed by more sophisticated communication protocols will continue to raise the performance bar. Today, some 8-bit and many 16-bit devices are capable of meeting that demand, while 32-bit devices offer significant headroom.
Today, that headroom may be a luxury many applications cannot afford; cost-optimization will still point towards simpler devices. While communications protocols are not likely to ‘suddenly’ become much more compute-intensive, the end-applications are on an inexorably upward curve in terms of complexity, which will rapidly consume any processing headroom currently available.
All of this points towards the wider adoption of 32-bit architectures and, beyond that, microcontrollers that are even more complex.
References:
- ‘Microcontroller Performance Analysis Techniques’