High-Performance Application Processors Are Pushing the Multimedia Limits

Fundamentally, an embedded system typically comprises of three elements: some form of ‘intelligence’ (such as a microprocessor), some storage (which today is normally Flash-based), and the necessary hardware interfaces to the outside world; a formula that works across most vertical markets. However, the increased creation of data, in all its forms, and the inherent value of that data, means this tried-and-tested format is evolving.

In order to access the value of that data, engineers need to pack more functionality and performance into smaller packages running at lower power. This challenges manufacturers to create new solutions and increase the level of integration, achieved using the latest fabrication nodes.
Simply throwing more transistors at a problem is not always the solution; it is the careful application of those transistors that really counts. As the complexity in today’s systems is often ‘hidden’ in the software, it follows that the most crucial element in the embedded system formula is now the processor, but just how much performance is ‘enough’?

By looking at what is really pushing up the performance curve and how manufacturers are meeting the demand to get more from less, engineers can better appreciate the solutions now available to meet their needs.

History of horsepower

Just like an internal combustion engine, the efficiency of a processor is not 100% and it cannot be measured simply in terms of its clock speed. It could be argued that as processor architectures have increased in bus size, predominantly in order to accommodate larger and more complex instruction sets, as well as larger addressable memories, the complexity of the instruction cycle has also increased such that the incremental benefit has steadily reduced. Manufacturers strive to maximize architectural efficiency using a number of innovative techniques.

It is for this reason that simpler 8-bit architectures, such as the 8051, continue to survive despite the onslaught of low-cost 32-bit alternatives that can, undeniably, offer greater functionality for largely similar power and cost budgets.

However, there is now a range of specific applications that demand higher performance, and it is this focus that has enabled the industry to collaborate on standardized benchmarks that can measure the level of performance achieved under ‘real world’ conditions.

Thanks to these benchmarks, manufacturers have been able to optimize their solutions, creating system architectures that couple industry-leading processor cores (such as ARM®’s Cortex™-A8 and -A9) with their own proprietary hardware-acceleration technology, class-leading non-volatile memory, and application-specific peripherals. The result is system-on-chips (SoCs) that deliver the benefit of an application-specific solution with the cost benefits of a standard part.

This puts even more emphasis on empirical results, as the part with the highest baseline performance, as measured in millions of instructions per second, or MIPS, may not deliver the best results. Markus Levy, EEMBC president, commented, “As embedded microprocessors have evolved from relatively simple devices to the complex SoCs we see today, the whole concept of benchmarking has changed as well, going beyond speed measurements while performing abstract computing functions to include energy consumption and aspects of performance that are much closer to the user experience.”

The need for more speed

Modern SoCs targeting mainly multimedia-type applications like the Sitara and DaVinci families from Texas Instruments, and the i.MX devices from Freescale, which feature the ARM Cortex-A8 and -A9 processor cores, are capable of running at speeds in the region of 1 GHz; much higher than microcontrollers running at clock speeds in the low hundreds of MHz.

Even with these high-performance devices, however, it is important to realize the performance that can be achieved from a given device is not just based on the processor core, but is rather a factor of the surrounding processor sub-system. Invariably, modern embedded systems that target high-end applications will employ such an SoC; one that has architectural extensions, hardware-acceleration blocks or optimized peripherals, all of which (largely) remain under the control of the processor core. This level of complexity makes it impractical to develop all of the software and control kernels for a given application, deferring instead to an operating system. It also follows that optimizing the performance of a device involves selecting and configuring the most appropriate operating system for the application, as well as careful development of the application software.

Often the application will call for real-time response and so a real-time operating system (RTOS) will be chosen, other times a real-time response is less important. There exists now a range of operating systems that can provide a good foundation for most applications targeted by a specific device, often provided by the manufacturer as well as third parties.

Configuring an operating system is now a critical part of embedded design. Typically they are designed to be both portable and modular, allowing engineers to configure the features to create an optimized build for their application. This clearly has implications on the application software, as it may only be able to access specific hardware features through the operating system. This system-level design approach makes it difficult to estimate performance demands in the early stages of design, but can deliver the most optimal solution.

Benchmarks can help in assisting engineers during system-level design, but increasingly, manufacturers are providing the tools and knowledge base necessary to help engineers make early-stage decisions. Many IDEs (integrated development environments) now feature power estimators, allowing engineers to configure specific aspects at the system level in order to get early visibility of the parts that could meet the performance requirements based on the application.

Architectural changes

Thanks to its dominance and incredibly strong ecosystem, SoC manufacturers have now largely standardized on ARM’s Cortex family of processor cores; for high-end applications the cores of choice are often the ARM Cortex-A8 and -A9.

The cores in the Cortex family are denoted by their prefix; Cortex-M cores are intended to raise the level of performance available in microcontrollers, while the Cortex-R series focus on delivering real-time performance. The Cortex-A family is referred to as being suitable for application processors and it could be proposed that this class of device has undergone the most focused optimization.

Unlike the majority of general purpose devices, application processors and SoCs are intended for applications that feature a significant level of multimedia processing, and it is here where hardware acceleration can play a significant role in maximizing performance while minimizing power. Processing video data has unique requirements, which translates well to hardware acceleration. Coupling this with the ability to integrate one, two, or even four processing cores in a single device can result in SoCs that can handle the most demanding media processing, as found in an increasing range of applications.

Software demands

Developing an embedded system that makes full use of the hardware resources available requires careful planning, and it is advisable to know what the performance needs are before development starts, as opposed to trying to build a system that delivers the highest possible throughput.
For emerging applications like video analytics, one of the key applications for this new class of high-performance SoC, this can be difficult to quantify, but manufacturers are striving to deliver board-support packages and software libraries that give developers a firm basis from which to start developing.

The engineering community at large is also a good resource for software; open-source software is a trend that is influencing the entire industry in general and the emerging market for computer vision in particular.

The OpenCV project (CV standing for Computer Vision), which can be found at www.opencv.org, has over 2500 algorithms available under BSD license, for functions such as detecting and recognizing faces, objects, and human actions in videos, as well as a growing range of 3D and augmented reality functions.

To really take full advantage of the hardware features, however, generic algorithms need to be ‘tuned’; an area that members of the Embedded Vision Alliance (www.embedded-vision.com) have already started working on. The goal of the Alliance is to promote the use of embedded vision technology and was founded in 2011 by BDTI, Xilinx, and IMS Research, later joined by Analog Devices, Apical, Avnet Electronics Marketing, CEVA, CogniVue, Freescale, MathWorks, National Instruments, NVIDIA, Texas Instruments, Tokyo Electron Device, XIMEA, and XMOS.

Solutions

Two of the founding members of the Embedded Vision Alliance, Freescale and Texas Instruments, already offer highly-integrated SoCs suitable for this application space, in the form of the i.MX range from Freescale, and the Sitara and DaVinci families from TI.

Freecale’s i.MX range is described as Multimedia Applications Processors, and the very popular i.MX6 family is available in single, dual and quad-core variants, as shown in Figure 1.

Freescale’s i.MX family of Multimedia Applications Processors


Figure 1: Freescale’s i.MX family of Multimedia Applications Processors provides scalable access to a range of video-centric hardware extensions, closely coupled to 1, 2 or 4 ARM Cortex-A9 processors.

Alongside the ARM Cortex-A9 processor core(s) sit the multimedia and image processing hardware accelerators, which provide the extra horsepower needed to make this a truly-capable applications processor. These include the 2D and 3D Graphics Processing Units, the Image Processing Unit, and the Video Processing Unit.

An important feature of the I.MX6 family is its pin and software compatibility, which deliver its scalability. The dual- and quad-core parts run at up to 1.2 GHz with 1 Mbyte of L2 cache, and there is also a dual-lite option that runs at 1 GHz with 512 kbyte of cache, as does the single-core Solo family. As well as the core speed and cache, the variants offer different peripheral mixes, as shown in Figure 2.

Freescale i.MX6 family offers pin- and software-compatible scalability


Figure 2: The i.MX6 family offers pin- and software-compatible scalability throughout most of the range, across single-, dual- and quad-core options.

Heterogeneous multicore devices, those that integrate different processor cores, have become commonplace in portable media devices, as they offer the best hardware platform for evermore demanding software. One example of this comes from Texas Instruments, in the form of their DaVinci DM37x ARM Cortex-A8-based video SoC family. The ARM core in the DM37x devices runs at up to 1 GHz and features a TMS320C64x DSP core, along with a POWERVR SGX Graphics Accelerator (as shown in Figure 3).

DM37x DaVinci family from Texas Instruments


Figure 3: The DM37x DaVinci family from Texas Instruments integrates an ARM Cortex-A8 alongside TI’s own TMS320C64x DSP.

The graphics accelerator (DM3730 only) provides a universal scalable shader engine supporting OpenGLES 1.1 and OpenVG1.0, while the VLIW (Very Long Instruction Word) DSP core features eight highly-independent functional units including six ALUs.

Both Freescale and TI offer ARM Cortex-A8 based multimedia application processors, too, in the form of the i.MX5 family (Freescale) and the Sitara range from TI. As shown in Figure 4, the latter also integrates the POWERVR SGX Graphics Accelerator subsystem, along with a Programmable Real-Time Unit and Industrial Communication Subsystem, which operate independently of the ARM core.

TI’s Sitara processors


Figure 4: TI’s Sitara processors, based on the ARM Cortex-A8, also feature a graphics accelerator subsystem.

Conclusion

High-performance processor cores have never been so accessible in a wide range of low-power devices. Their availability is directly influencing developments in multimedia devices and the ‘user experience’ in evermore portable end products.

However, it is important to appreciate that the performance needed for this class of application is equally influenced by the hardware acceleration units integrated alongside the core(s), coupled with optimized software; and it is the successful mix of technologies, to create devices like those featured in this article, that is really driving innovation.

References:

  1. EEMBC
  2. Embedded Vision Alliance
  3. OpenCV
  • High-Performance Application Processors Are Pushing the Multimedia Limits已关闭评论
    A+
发布日期:2019年07月13日  所属分类:参考设计