|
Complex matrix algebra is of great importance to a wide variety of applications. One of the most important application areas is telecommunications. Matrix calculations are used in communications standards such as 3GPP-LTE, WiMAX, and many others. For example, the MIMO (Multi Input Multi Output) algorithm in the LTE receiver is based on a 4x4 complex matrix inversion.
In this article we present an implementation of 4x4 complex matrix inversion on the recently announced StarCore SC3850 DSP core. We use the cofactor method and optimize our code to take advantage of parallelism in the SC3850 architecture, resulting in a highly efficient implementation. We discuss the implementation in detail, including code structure and optimizations. The matrix inversion output is verified against a floating point MATLAB model.
1.1. SC3850 architecture
The SC3850 is the newest member of Freescale's StarCore family. It is used in the MSC8156, a six-core DSP targeting wireless broadband equipment [4]. The SC3850 has four independent arithmetic-logic units (ALU), each of which contains dual 16-bit multipliers. Together, the four ALUs can complete eight 16-bit multiply-accumulates (MACs) per cycle—up to 8 GMACs at 1 GHz.
The dual-multiplier ALUs are new in the SC3850. The previous-generation SC3400 offered only one multiplier per ALU. The new hardware is supported by new dual-multiply instructions, including new complex 16x16 and complex 32x16 multiply instructions. Complex 16x16 multiplication is performed using L_mpyre and L_mpyim instructions that compute the real and imaginary portion of the product, respectively. All inputs and outputs come from 40-bit registers The source operands are assumed to contain a packed complex number, where the high portion holds the real part (signed, fractional 16 bits), and the low portion holds the imaginary part (signed, fractional 16 bits). The output of the operation is stored as a 40-bit value.

Table 1.
Using the 4 ALUs, two complex 16x16 multiplications can be performed in a cycle. The following figure illustrates the complex multiply instructions.

(Click to enlarge)
Figure 1. SC3850 complex multiplication. L_mpyre finds the real portion of the product, and L_mpyim finds the imaginary portion.
The following code demonstrates the use of the new SC3850 complex multiply instructions versus the SC3400. Note that SC3850 requires only two instructions, as opposed to four instructions on the SC3400.

Figure 2. SC3850 complex multiply code. The SC3850 can perform two complex 16x16 multiplies per cycle.

Figure 3. SC3400 complex multiply code. The SC3400 can perform one complex 16x16 multiply per cycle.
Complex 32x16 multiplication (i.e., mixed precision multiplication) is somewhat more complex. As with 16x16 multiplication, one operand is a 16-bit fractional complex number in a packed complex format. The other operand is a 32-bit fractional complex number which is placed in two registers: one register holds the 32-bit real portion and the other register holds the 32-bit imaginary portion. The result is placed in two registers in 32-bit precision.
L_dmpy, L_dmac, and L_dmsu perform this complex multiplication. The throughput is 1 MAC/cycle.

Figure 4. SC3850 32x16 complex multiply code. In this example, b is the 32-bit input.
|