Comparing Loihi with a SpiNNaker 2 prototype on low-latency keyword spotting and adaptive robotic control

Yexin Yan; Terrence C Stewart; Xuan Choo; Bernhard Vogginger; Johannes Partzsch; Sebastian Höppner; Florian Kelber; Chris Eliasmith; Steve Furber; Christian Mayr

doi:10.1088/2634-4386/abf150

1. Introduction

With the substantial progress of artificial intelligence (AI) in recent years, neural network based algorithms are increasingly being deployed in embedded AI applications. Smart speakers which continuously listen for keywords like 'Alexa' and robotic applications which employ neural network based adaptive control algorithms are examples from industry and research. To improve the efficiency regarding power consumption and computation time, various hardware architectures have been proposed.

The neural networks employed in these AI applications are most commonly deep neural networks (DNNs). A substantial amount of computation in DNNs is caused by the multiply-accumulate (MAC) operations. For efficient computation of DNNs, many machine learning hardware architectures include an MAC unit to facilitate the MAC operations in DNNs [1].

While DNNs are currently widely adopted for applications, spiking neural networks (SNNs) which more closely mimic the behavior of biological neural networks are increasingly gaining attention as this type of network has the potential of high efficiency, especially in combination with neuromorphic hardware [2]. One prominent example is the Loihi neuromorphic chip [3] which has been shown to be efficient in various neural network based benchmark tasks like keyword spotting [4] and adaptive control [5]. Another neuromorphic architecture is represented by the second generation of the SpiNNaker system (SpiNNaker 2) [6].

While Loihi has dedicated circuits for synapses and neurons, which increases the efficiency for the implemented models, and a programmable learning engine for more flexibility for various learning rules, SpiNNaker 2 uses general purpose processors (Arm cores) connected with numerical accelerators. While the processor increases the flexibility of the synapse and neuron models and learning rules, the accelerators increase the efficiency for certain computations like exponential function and random number generation which are often required in neuromorphic applications. Besides the neuromorphic accelerators, SpiNNaker2 also contains MAC arrays for efficient matrix operations and is thus able to merge SNN and DNN operation.

Both neuromorphic hardware platforms have been proved to be efficient in a number of applications. For example, neuromorphic olfactory circuit [7] and online few-shot learning [8] on Loihi, and reward-based structural plasticity [9] on SpiNNaker 2. However, a direct comparison of both neuromorphic hardware platforms with the same benchmarks has been missing.

In this work, we implement the keyword spotting and adaptive control benchmark tasks on the second SpiNNaker 2 prototype [10]. We compare the computation time and active energy consumption of the benchmark tasks with Loihi, and highlight the benefit of the MAC array. Specifically, for keyword spotting, because the original DNN version is implemented on the SpiNNaker 2 prototype with the MAC array, and the SNN version is implemented on Loihi because it only supports SNN, the SpiNNaker 2 prototype shows better efficiency regarding computation time and energy consumption. For adaptive control, SNN is implemented on both hardwares and Loihi shows better efficiency when low dimensional vector-matrix multiplication is involved, and the SpiNNaker 2 prototype shows better efficiency when high dimensional vector-matrix multiplication is involved.

In section 2 we give an overview of the prototype chip, with emphasis on the MAC array. Section 3 describes the two benchmarks implemented in this work. Section 4 presents the software implementation. The experimental results are presented in section 5.

2. The SpiNNaker 2 prototype chip

2.1. System overview

SpiNNaker [11] is a digital neuromorphic hardware system based on low-power Arm processors originally built for real-time simulation of SNNs. In the second generation of SpiNNaker (SpiNNaker 2), which is currently being developed in the human brain project [12], several improvements are being made. The SpiNNaker 2 architecture is based on processing elements (PEs) which contain an Arm Cortex-M4F core, 128 kbytes local SRAM, hardware accelerators for exponential functions [13, 14] and true- and pseudo random numbers [9, 15] and MAC accelerators. Additionally, the PEs include advanced dynamic voltage and frequency scaling (DVFS) features [16, 17]. The PEs are arranged in quad-processing elements (QPEs) containing four PEs and a network-on-chip (NoC) router for packet based on-chip communication. The QPEs can be placed in an array scheme without any additional flat top level routing to form the SpiNNaker 2 many core SoC.

SpiNNaker 2 will be implemented in GLOBALFOUNDRIES 22FDX technology [18]. This FDSOI technology allows the application of adaptive body biasing (ABB) for low-power operation at ultra-low supply voltages in both forward [19] and reverse bias schemes [20]. For maximum energy efficiency and reasonable clock frequencies, 0.50 V nominal supply voltage is chosen and ABB in a forward bias scheme is applied. The ABB aware implementation methodology from [21] has been used. This allows to achieve >200 MHz clock frequency at 0.50 V nominal supply voltage at the first DVFS performance level PL1 and >400 MHz from 0.60 V supply at the second DVFS performance level PL2.

The second SpiNNaker 2 prototype chip has been implemented and manufactured in 22FDX [10]. It contains 2 QPEs with 8 PEs in total to allow the execution of neuromorphic applications. Figure 1 shows the simplified block diagram of the testchip PE array. The chip photo is shown in figure 2. The testchip includes peripheral components for host communication, a prototype of the SpiNNaker router for chip-to-chip spike communication and some shared on-chip SRAM.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** Simplified schematic of the second SpiNNaker 2 prototype with 2 QPEs. Each QPE contains 4 PEs. Each PE contains an MAC array, an Arm core and a local SRAM. The NoC router is responsible for the communication.
Download figure:
Standard image High-resolution image

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Chip photo of the SpiNNaker 2 prototype in 22FDX technology.
Download figure:
Standard image High-resolution image

2.2. MAC array

The MAC array has 64 MAC units in a 4 × 16 layout. Figure 3 illustrates the MAC array. The data of operand A and operand B are arrays of 8 bit integer values. In each clock cycle, 16 values from the array of operand A and 4 values from the array of operand B are fed into the MAC array. Every MAC unit in the same column is fed with the same value from operand A, and every MAC unit in the same row is fed with the same value from operand B. The software running on the Arm core is responsible for arranging the data in the SRAM and notifying the MAC array the address and length of the data to be processed. After the data is processed, the results are written back to predefined addresses in the memory. The result of each MAC unit is 29-bit.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Schematic of the MAC array. Each square in the 4 × 16 block represents one MAC unit. The squares around the block represent the data to be executed. In each clock cycle, 4 values from operand B and 16 values from operand A are fed into the MAC array simultaneously, as indicated by the arrows.
Download figure:
Standard image High-resolution image

When computing a matrix multiplication, a general purpose processor like the Arm core needs to: (1) fetch the operand A and operand B into the registers, (2) do the MAC, (3) write the result back, (4) check the condition of the loop, (5) compute the addresses of the data in the next iteration. While the MAC array essentially does the same, it is more efficient due to the single instruction multiple data (SIMD) operation. In particular, the efficiency is made possible by:

(a)
64 MAC operations can be done in one clock cycle in parallel.
(b)
16 × 8 bits of data of operand A and 4 × 8 bits of data of operand B can be fetched in one clock cycle in parallel.
(c)
Control logic and data transfer in parallel to MAC operations, hiding the overhead of data transfer for the next iteration.

3. Benchmark models

In this section, we briefly review the two benchmark models implemented in this work: keyword spotting and adaptive control.

3.1. Keyword spotting

Keyword spotting is a speech processing problem which deals with identifying keywords in utterances. A practical use case is the identification of wake words for virtual assistants (e.g. 'Alexa'). In this work, the keyword spotting network we implement on the SpiNNaker 2 prototype is the same as in [4], which consists of 1 input layer with 390 input values, 2 dense layers each with 256 neurons and 1 output layer with 29 output values (figure 4). Also, the same as in [4], no training is involved and only inference is considered. The 390 dimensional input to the network is the Mel-frequency cepstral coefficient (MFCC) features of an audio waveform in each time step. The 29 dimensional output of the network basically corresponds to the alphabetical characters, with additional special characters for e.g. silence etc. One 'inference' with this network involves passing 10 time steps of the MFCC features into the network. The outputs are then postprocessed to form a result for the inference. The difference to the implementation on Loihi is that on the SpiNNaker 2 prototype, we implement the network with normal DNN with ReLU activations, whereas on Loihi, the SNN version was implemented since Loihi only supports SNNs.

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Keyword spotting network architecture.
Download figure:
Standard image High-resolution image

3.2. Adaptive control

For our second benchmark task, we use the adaptive control algorithm proposed as a benchmark in [22] and further investigated in [5] (figure 5). This benchmark consists of a single-hidden-layer neural network, where the input is the sensory state of the system to be controlled (such as a robot arm) and the output is the extra force that should be applied to compensate for the intrinsic dynamics and forces on the arm (gravity, friction, etc). The only non-linearities are in the hidden layer (i.e. there is no non-linear operation directly on the input or output). The input weights are fixed and randomly chosen, and the output weights ω_ij are initialized to zero and then adjusted using a variant of the delta rule [23] (equation (1)), where α is a learning rate, a_i is the current level of activity of the ith neuron, and E_j is an error signal

$\begin{equation}{\Delta}{\omega }_{ij}=\alpha {a}_{i}{E}_{j}.\end{equation} \tag{ 1 }$

Crucially, if we use the output of a PD-controller to be this error signal E_j, and if we take the output of this network and add it to the control signal produced by a PD-controller, then the resulting system will act as a stable adaptive controller [24]. This is a variant of the adaptive control algorithm developed by Jean-Jacques Slotine [25]. One way to think of this is that the neural network is acting somewhat like the I term in a PID-controller, but since the I value is being produced by the neural network, it can be different for different parts of the sensory space. It can thus learn to, for example, apply extra positive torque when a robot arm is leaning far to one side, and extra negative torque when the arm is leaning far to the other side.

When used with spiking neurons, we also apply a low-pass filter to the a_i term, producing a continuous value representative of the recent spiking activity of the neuron.

While this benchmark was originally proposed for its simplicity and applicability across a wide range of neuromorphic hardware and controlled devices, there is one further important reason for us to choose this benchmark. The core network that it requires has a single hidden layer non-linearity, and the inputs and outputs are generally of much lower dimensionality than the number of neurons in the hidden layer. This is exactly the sort of network that forms the core component of the neural engineering framework (NEF) [26]. The NEF has been used to create large-scale biologically-based neural models [27] by chaining these smaller networks together. By sending the output from one of these networks to the inputs of another network, we are effectively factoring the weight matrix between the hidden layers of the two networks. This has been shown to be a highly efficient method for implementing neural models on the original SpiNNaker 1 hardware [28], and we expect the same to be the case on SpiNNaker 2.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** Adaptive control network architecture.
Download figure:
Standard image High-resolution image

4. Implementation of the benchmarks on the SpiNNaker 2 prototype

We implemented the keyword spotting and adaptive control benchmarks on the SpiNNaker 2 prototype with the MAC array and Arm core responsible for different computational tasks. Since the same benchmarks have also been implemented on Loihi [4, 5], this allows a side-by-side comparison between both neuromorphic hardwares.

4.1. Keyword spotting

The keyword spotting network consists of 2 computational steps: vector-matrix multiplication which is done with the MAC array and ReLU update which is done with the Arm core. Because of memory constraints (see section 5.1.1) layer 1 is split into 2 PEs. The weights in this network are the same as in [4]. The input to the network is a 390 dimensional vector of 8 bit integers. The ReLU activations of each layer are also 8 bit integers. The ReLU activations of layer 2 are directly sent back to host PC, where the vector-matrix multiplication for the output layer with 29 dimensions is performed, the same as in [4]. Figure 6 shows the implementation of the keyword spotting network on the SpiNNaker 2 prototype.

Figure 6. Refer to the following caption and surrounding text. — **Figure 6.** Implementation of keyword spotting network on the SpiNNaker 2 prototype.
Download figure:
Standard image High-resolution image

4.2. Adaptive control

The implementation of adaptive control on the SpiNNaker 2 prototype is based on [28, 29]. There are mainly 4 computational steps: input processing, neuron update, output processing and weight update.

In input processing, the inputs to the network are multiplied with the input weight matrix to produce the input current for each neuron in the hidden layer. The weights are quantized to 8 bit integers with stochastic rounding. The vector-matrix multiplication with only Arm core and without MAC array is also implemented and serves as reference.

The rest of the computation is implemented on the Arm core which allows event based processing.

In neuron update, the neuron dynamics is updated according to the input current. The leaky-integrate-and-fire (LIF) neuron model is used in the hidden layer to allow for event based processing of the spikes in the following steps.

In output processing, the outputs of the neurons are multiplied with the output weight matrix. In the case of non-spiking neuron models like ReLU, this process is a vector-matrix multiplication. In the case of spiking neuron models, a connection is only activated when there is a spike, so this output processing step corresponds to adding the weights associated with the neuron which has spiked to the output of the network.

In weight update, the output weight matrix is updated according to the neuron activity and error signal. In order to do weight update in an event based manner, the low pass filter mentioned in section 3.2 has been removed, similar to [29]. Because of the short time constant of the low pass filter in this application, this modification does not affect the performance. Since the learning rate is normally very small, floating point data type is chosen for the weights in the output weight matrix.

In this work, we focus on the adaptive control network implemented on a single PE. The implementation is done with scalability in mind. In the case that the size of a neuron population exceeds the memory limit of a PE, it can be split into many PEs [28]. In this work, the PE additionally simulates the PD controller. The overhead is negligible.

The computational steps and the hardware component used for each step is summarized in figure 7. The PD controller is not shown since the computation is relatively simple.

Figure 7. Refer to the following caption and surrounding text. — **Figure 7.** Main computational steps and hardware component for each step in adaptive control.
Download figure:
Standard image High-resolution image

5. Results

In this section we show the results of both benchmarks running on the SpiNNaker 2 prototype chip. In particular, we show results regarding the memory footprint, computation time and energy measurement when the PE is running with 0.5 V and 250 MHz. The results of computation time and energy measurement are compared with Loihi. In addition, for adaptive control, the SpiNNaker 2 prototype chip is connected to a robotic arm to demonstrate real time control. Since we implemented the same models on the SpiNNaker 2 prototype as on Loihi, the differences between both hardwares in terms of classification accuracy in the case of keyword spotting and mean squared error between actual and desired trajectories in the case of adaptive control are negligibly small, so that this will not be further discussed in this section. Since for the benchmarks in this work, there is not much data movement between the PEs, the throughput of the NoC is not a bottleneck.

5.1. Keyword spotting

5.1.1. Memory footprint

For the keyword spotting benchmark, the required SRAM memory mainly consists of 2 parts: weight memory and neuron input memory.

The weight memory is the memory for storing the weights and biases, which are quantized as 8-bit integers. The required memory in bytes is

$\begin{equation}{M}_{\mathrm{w}}=(D+1)N\end{equation} \tag{ 2 }$

where D is the number of input dimensions, N is the number of neurons.

The neuron input memory is the memory for storing the results from the MAC array after the vector-matrix multiplication is complete. Each input is a 32 bit integer. The required memory in bytes is

$\begin{equation}{M}_{\mathrm{i}}=4N\end{equation} \tag{ 3 }$

Since the ReLU unit does not need to hold its output value between inferences, which is the case for the LIF neuron model, there is no neuron memory needed.

The total memory for a neural network on a PE is

$\begin{equation}{M}_{\text{total}}={M}_{\mathrm{w}}+{M}_{\mathrm{i}}\end{equation} \tag{ 4 }$

Based on equations (2)–(4) for memory footprint, the first hidden layer of the keyword spotting network would require ca 100 kbytes of memory. For each PE, in total 128 kbytes of SRAM memory is available, which is used for the program code as well as the program data. In this work, it is assumed that each PE has 90 kbytes of SRAM memory available for the data of the neural network. So the first hidden layer is split into two PEs.

5.1.2. Computation time and comparison with Loihi

In the keyword spotting benchmark, the computation times for the vector-matrix multiplication (T_mm) and the ReLU update (T_relu) are measured. After the measurement, polynomial models can be fitted by minimizing the mean-squared error. Here we have adopted simple linear models to capture the behavior of the system depending on the model parameters. We found these models accurate enough for the parameter range considered in this work. Since the parameters are limited by the hardware constraints such as SRAM memory anyways, adopting more complicated models does not seem to be necessary for our benchmarks. The number of clock cycles for the vector-matrix multiplication with the MAC array is found to be

$\begin{equation}{T}_{\mathrm{m}\mathrm{m}}=74.0+5.38N+0.13ND+24.0D\end{equation} \tag{ 5 }$

where N is the number of neurons and D is the number of input dimensions. The time for the vector-matrix multiplication is mostly reflected in 0.13ND. Before the vector-matrix multiplication starts, the inputs to the network needs to be prepared for the MAC array. This pre-processing step is mostly reflected in 24.0D. After the vector-matrix multiplication, a post-processing step is necessary for the resulting neuron input current. The computation time depends on both D and N, and this is reflected in 24.0D and 5.38N. For each of the computational steps, there is a constant overhead, which is reflected in the constant 74.0.

The computation time for ReLU update with Arm core is found to be

$\begin{equation}{T}_{\text{ReLU}}=17.70N+117.5\end{equation} \tag{ 6 }$

The total time is

$\begin{equation}{T}_{\text{total}}={T}_{\mathrm{m}\mathrm{m}}+{T}_{\text{ReLU}}\end{equation} \tag{ 7 }$

Based on equations (5)–(7) for computation time, with the keyword spotting network split into 3 PEs (figure 6), the computation of one time step consumes less than 21k clock cycles. With a safety margin of 4k clock cycles, one time step would take less than 25k clock cycles. When the PE is running with 250 MHz, this means the duration of one time step can be reduced to 0.1 ms. Since 10 time steps are combined to 1 time window to form one inference, a time step duration of 0.1 ms would correspond to 1000 inferences per second. In [4], 296 inferences per second has been reported for Loihi. One reason for the reduced speed of Loihi might be that the inputs to the neural network are coming from an FPGA which could cause some latency, while the SpiNNaker 2 prototype is using inputs generated by one of the PEs of the same chip.

5.1.3. Energy measurement and comparison with Loihi

Both QPEs are used for the measurement. In each QPE, 3 PEs are switched on to simulate a keyword spotting network. The measured result is then divided by 2 to obtain the energy per network. The energy is measured incrementally, similar to previous measurements on SpiNNaker 1 [30] and on the first SpiNNaker 2 prototype [17]. When measuring the idle energy, the PLL is started and the software is running on the Arm cores. In each time step, after the timer tick interrupt wakes up the Arm core from the sleep mode, the Arm core only handles the interrupt itself, with no neural processing involved, and then it goes back to sleep mode. The result we present in this section is the active energy which is obtained by subtracting the idle energy from the total energy. The resulting active energy per inference is 7.1 μJ.

The keyword spotting network is implemented as a normal DNN on the SpiNNaker 2 prototype. The MAC array is used for the computation of the connection matrix, and the Arm core is used for the computation of ReLU activation function. Since Loihi only supports SNN, the spiking version of the keyword spotting network is implemented on Loihi. This could be the reason that the SpiNNaker 2 prototype consumes less energy for each inference in the keyword spotting benchmark (table 1). Note that in [4], the reported energy per inference on Loihi was 270 μJ, including a 70 mW overhead presumably caused by the ×86 processor on Loihi. In this work the overhead has been removed which results in 37 μJ per inference.

Table 1. Comparison of the SpiNNaker 2 prototype (SpiNN) and Loihi for the keyword spotting task.

Hardware	Inference/s	Energy/inference (μJ)
SpiNN	1000	7.1
Loihi	296	37

5.2. Adaptive control

5.2.1. Memory footprint

For an adaptive control network simulated on a PE, the required SRAM memory mainly consists of 4 parts: input weight matrix and bias memory, output weight matrix memory, neuron input current memory and neuron memory.

The input weight matrix and bias memory is the memory for storing the input weight matrix and bias, which are quantized as 8-bit integers. The required memory in bytes is

$\begin{equation}{M}_{\mathrm{i}\mathrm{b}}=({D}_{\mathrm{i}\mathrm{n}}+1)N\end{equation} \tag{ 8 }$

where D_in is the number of input dimensions, N is the number of neurons.

The output weight matrix memory is the memory for storing the output weight matrix, which are 16 bit floating point numbers. The required memory in bytes is

$\begin{equation}{M}_{\mathrm{o}}=2{D}_{\text{out}}N\end{equation} \tag{ 9 }$

where D_out is the number of output dimensions.

The neuron input current memory is the memory for storing the results from the MAC array after the input processing is complete. Each input current is a 32 bit integer. The required memory in bytes is

$\begin{equation}{M}_{\mathrm{i}\mathrm{c}}=4N\end{equation} \tag{ 10 }$

The neuron memory is the memory to hold the LIF neuron parameters like the membrane potential and refractory time. Each of them has 32 bits. The required memory in bytes is

$\begin{equation}{M}_{\mathrm{n}}=8N\end{equation} \tag{ 11 }$

The total memory for a neural network on a PE is

$\begin{equation}{M}_{\text{total}}={M}_{\mathrm{i}\mathrm{b}}+{M}_{\mathrm{o}}+{M}_{\mathrm{i}\mathrm{c}}+{M}_{\mathrm{n}}\end{equation} \tag{ 12 }$

Since it is assumed that each PE has 90 kbytes of SRAM memory available for the data of the neural network, the maximum number of output dimensions given the number of input dimensions and number of neurons in a neural network can be derived with equations (8)–(12). The result is shown figure 8.

Figure 8. Refer to the following caption and surrounding text. — **Figure 8.** Left: maximum number of output dimensions for each input dimension and number of neurons for a neural network simulated on a PE. Right: speedup of input processing time with the MAC array. Numbers in the legend indicate the number of neurons.
Download figure:
Standard image High-resolution image

5.2.2. Computation time and comparison with Loihi

For adaptive control, the computation times for input processing (T_{i_mlacc}/T_{i_no_mlacc}), neuron update (T_n), output processing (T_o) and weight update (T_w) are measured. After the measurement, polynomial models can be fitted by minimizing the mean-squared error. For input processing with MAC array, the number of clock cycles is

$\begin{equation}{T}_{\mathrm{i}{\_}\mathrm{m}\mathrm{a}\mathrm{c}}=131.21+5.07N+0.13N{D}_{\mathrm{i}\mathrm{n}}+35.79{D}_{\mathrm{i}\mathrm{n}}\end{equation} \tag{ 13 }$

where N is the number of neurons, D_in is the number of input dimensions. Equation (13) is very similar to equation (5), because the main computation is in both cases done by the MAC array. The difference is caused by the different data types. In keyword spotting, the inputs are assumed to be 8 bit integers, but in adaptive control, each input is assumed to be floating point. This is necessary because in general, the same implementation can be used as a building block for NEF implementation on SpiNNaker 2 to construct large-scale cognitive models as mentioned in section 3.2, so that the input data type needs to be the same as the output data type. Since the output weights are floating point, and their values change dynamically due to learning, an extra range check is performed for each input value, and an extra data type conversion is performed. This is reflected in 35.79D_in and the constant 131.21.

The number of clock cycles without MAC array is

$\begin{equation}{T}_{\mathrm{i}\text{\_}\mathrm{n}\mathrm{o}\text{\_}\mathrm{m}\mathrm{a}\mathrm{c}}=102.52+22.54N+7.07N{D}_{\mathrm{i}\mathrm{n}}+25.54{D}_{\mathrm{i}\mathrm{n}}.\end{equation} \tag{ 14 }$

The main benefit of MAC array is reflected in the reduction of 7.07ND_in in equation (14) to 0.13ND_in in equation (13), which is made possible by the SIMD operation of the MAC array. The speedup is higher for higher dimensions. Figure 8 shows the speedup of the computation time for input processing with the MAC array compared to without the MAC array.

Unlike in keyword spotting, where the ReLU neuron model is used, in adaptive control, the LIF neuron model is used, which is the same as in Loihi. The neuron update time in terms of number of clock cycles is

$\begin{equation}{T}_{\mathrm{n}}=28.19N-26.90NP+509.18\end{equation} \tag{ 15 }$

where P is the firing probability. The minus sign in −26.9NP is because during the refractory period, the computation needed is reduced. Since this is event based, it depends on P.

The output processing time is

$\begin{equation}{T}_{\mathrm{o}}=5.8N{D}_{\text{out}}P+19.31NP\end{equation} \tag{ 16 }$

where D_out is the number of output dimensions.

The weight update time is

$\begin{equation}{T}_{\mathrm{w}}=8.28N{D}_{\text{out}}P+28.04NP\end{equation} \tag{ 17 }$

The total time is

$\begin{equation}{T}_{\text{total}}={T}_{\mathrm{i}\text{\_}\mathrm{m}\mathrm{a}\mathrm{c}}+{T}_{\mathrm{n}}+{T}_{\mathrm{o}}+{T}_{\mathrm{w}}\end{equation} \tag{ 18 }$

Since output processing and weight update are event based, the firing rate of 130 Hz corresponding to a firing probability P of 0.13, which is used for comparing the SpiNNaker 2 prototype with Loihi, would reduce the computation time by 87% compared to a non-event-based implementation.

Typically, the SpiNNaker system runs in real time with 1 ms time step. When the PE is running at 250 MHz, the available number of clock cycles for each time step is 250 000, which is the computational constraint. According to equation (18), for the range of the parameters shown in figure 8, the computation can be done within 1 ms. So the maximum implementable size of a network on a single PE in this benchmark is constrained by memory rather than computation.

For the adaptive control benchmark task with different number of input dimensions, output dimensions and number of neurons, the duration of a time step of SpiNNaker 2 prototype and Loihi is compared and shown in figure 9, with the mean population firing rate kept at around 130 Hz for both hardwares. Here the duration of a time step for the SpiNNaker 2 prototype refers to the time for the PE to complete the computation of a time step. From the comparison it is clear that for small number of input dimensions, Loihi is faster than the SpiNNaker 2 prototype, and for large number of input dimensions, the SpiNNaker 2 prototype is faster than Loihi. The maximum ratio of duration of a time step between both hardwares is summarized in table 2.

Figure 9. Refer to the following caption and surrounding text. — **Figure 9.** Duration of a time step of SpiNNaker 2 prototype (with strips) and Loihi (without strips) for different number of neurons per core, different input and output dimensions for the adaptive control benchmark. No measurement result for the SpiNNaker 2 prototype is shown where the implementation is limited by memory.
Download figure:
Standard image High-resolution image

Table 2. Maximum ratio of duration of a time step between the SpiNNaker 2 prototype (SpiNN) and Loihi for the adaptive control task.

Input dimensions	1	100
Output dimensions	1	1
Number of neurons	1024	512
Duration of a time step SpiNN:Loihi	1:0.37	0.49:1

Because of the MAC array, the computation time of the SpiNNaker 2 prototype increases less rapidly with the number of input dimensions, so that the SpiNNaker 2 prototype could catch up with Loihi in terms of computation time for higher input dimensions.

5.2.3. Energy measurement and comparison with Loihi

The energy consumption of the SpiNNaker 2 prototype and Loihi is measured with the same parameters as in the computation time comparison. The result is shown in figure 10. Similar to section 5.1.3, only the active energy is shown. For small number of input dimensions, Loihi is more energy efficient than the SpiNNaker 2 prototype, and for large number of input dimensions, the SpiNNaker 2 prototype is more energy efficient than Loihi. The maximum ratio of active energy consumption between both hardwares is summarized in table 3.

Figure 10. Refer to the following caption and surrounding text. — **Figure 10.** Active energy of SpiNNaker 2 prototype (with strips) and Loihi (without strips) for different number of neurons per core, different input and output dimensions for the adaptive control benchmark. No measurement result for the SpiNNaker 2 prototype is shown where the implementation is limited by memory.
Download figure:
Standard image High-resolution image

Table 3. Maximum ratio of active energy consumption between the SpiNNaker 2 prototype (SpiNN) and Loihi for the adaptive control task.

Input dimensions	1	100
Output dimensions	1	1
Number of neurons	1024	512
Active energy SpiNN:Loihi	1:0.81	0.36:1

Similar to the computation time comparison, we see the benefit of MAC array especially for high input dimensions, when the MAC array is more extensively used. This is made more clear in the energy breakdown in figure 11. Here, it is clear how the input processing energy increases with the input dimensions for the same number of neurons and output dimensions, how the neuron update energy increases with the number of neurons for the same input dimensions and output dimensions, and how the output processing and weight update energy increases with the number of output dimensions for the same input dimensions and number of neurons.

Figure 11. Refer to the following caption and surrounding text. — **Figure 11.** Breakdown of energy consumption per core per time step of the SpiNNaker 2 prototype into 4 energy components: input processing, neuron update, output processing and weight update.
Download figure:
Standard image High-resolution image

5.2.4. Robotic demo

The SpiNNaker 2 prototype running the adaptive control benchmark is connected to a robotic arm built with Lego Mindstorms EV3 robot kit. The setup is based on [22]. The input to the neural network is the position and velocity of the motor and the output of the neural network is the motor control signal to be combined with the PD controller output, as described in section 3.2.

In this demo we consider two situations: the normal case and the simulated aging case (figure 12, upper part). In the case of simulated aging an extra weight is added to the robotic arm to resemble the aging effect or unknown disturbance. For each case the performance of the adaptive controller is compared with a normal PID controller. In the normal case, both controllers perform equally well, but in the simulated aging case, the PID controller cannot adapt itself to the new situation, while the adaptive controller can learn from the error feedback and adapt its parameters to improve the performance (figure 12, middle part). The difference between both controllers is made more clear with the root mean squared error (RMSE) (figure 12, lower part).

Figure 12. Refer to the following caption and surrounding text. — **Figure 12.** Robotic demo. Upper part: in the normal case (left), there is no extra weight attached to the robotic arm. In the simulated aging case (right), an extra weight is attached to resemble the aging effect. Middle part: performance of the PID controller and the adaptive controller in both cases. The y-axis is the normalized angle position of the motor. In the normal case, both controllers perform well. But in the simulated aging case, the PID controller cannot adapt to the new situation, while the adaptive controller can improve the performance by adaptation. Lower part: RMSE of both controllers. In each trial the robotic arm attempts to reach either the upper or lower position. The error is measured as the difference between the actual and the target position when the arm has finished transitioning between the upper and lower positions. The mean RMSE and the standard deviation are shown for 10 runs each with 120 trials. The extra weight is added to the arm during the 60th trial. The curve is smoothed with a moving average with a window size of 4.
Download figure:
Standard image High-resolution image

6. Discussion

In this section, we consider the suitability of other neuromorphic platforms for implementing the benchmarks in this work. Since the comparison between the SpiNNaker 2 prototype and Loihi has already been extensively discussed in previous sections, we leave the summary of this comparison to the conclusion section.

6.1. Comparison with SpiNNaker 1

We assume that the same benchmarks in this work could also be implemented on SpiNNaker 1. However, since in SpiNNaker 1 there is no MAC array, the vector-matrix multiplication would be much slower and therefore consume much more energy than the SpiNNaker 2 prototype. Figure 8 indicates the speedup in terms of number of clock cycles for the vector-matrix multiplication in the SpiNNaker 2 prototype compared to what it would be in SpiNNaker 1. The differences in fabrication technology and supply voltage etc further increases the difference between the SpiNNaker 2 prototype and SpiNNaker 1.

6.2. Comparison with other neuromorphic platforms

To ease the discussion, we group neuromorphic platforms into 3 categories:

(a)
Neuromorphic platforms with static synapses, such as TrueNorth [31], NeuroGrid [32], Braindrop [33], HiAER-IFAT [34], DYNAPs [35], Tianjic [36], NeuroSoC [37] and DeepSouth [38],
(b)
Neuromorphic platforms with configurable (but not programmable) plasticity, such as ROLLS [39], ODIN [40] and TITAN [41],
(c)
Neuromorphic platforms with programmable plasticity, such as (except SpiNNaker 1/2 and Loihi) the BrainScales 1/2 system [42, 43].

We assume all 3 groups of neuromorphic platforms should be able to implement the keyword spotting benchmark in this work. However, DNNs cannot be directly implemented on these platforms since they only support SNNs (except Tianjic, which also supports DNNs). Solutions similar to the SNN version implemented on Loihi would be an option.

For adaptive control, since learning is involved, we assume the neuromorphic platforms in group 1 cannot support this benchmark. It would be still possible to have an external host PC to reprogram the synaptic weights, but that would not be suitable for embedded applications.

Although the learning rule in adaptive control is relatively simple, it involves multiplying an external error signal with the activity of the presynaptic neuron in every time step, which is quite different from the learning rules normally supported in the neuromorphic community, like spike-timing dependent plasticity [44] or spike-driven synaptic plasticity (SDSP) [45]. Therefore we assume the neuromorphic platforms in group 2 could not implement the adaptive control benchmark.

The BrainScales 2 system in group 3 comes with programmable plasticity, but since the neural network runs in accelerated time, it is unclear whether the neural activity of each time step can be used for the weight update. Also it is unclear how to interface robotic applications which require real time response with a neural network running in accelerated time.

7. Conclusion

The PE of the SpiNNaker 2 prototype consists of a general purpose processor plus highly efficient accelerators, while Loihi employs dedicated circuits for neuron and synapse models plus a flexible learning engine. In this work, we compare these two platforms by comparing their performance in the same applications, namely keyword spotting and adaptive control.

For keyword spotting, because of the MAC array used for vector-matrix multiplication and Arm core used for ReLU activation, the DNN version of keyword spotting network can be directly implemented on the SpiNNaker 2 prototype, while on Loihi the SNN version is implemented for the same task. The result of this is faster inference and higher energy efficiency of the SpiNNaker 2 prototype.

For adaptive control both the SpiNNaker 2 prototype and Loihi are efficient in specific parameter regions. The SpiNNaker 2 prototype is more efficient than Loihi both regarding the computation time and active energy, when the number of input dimensions is high, because that is where the vector-matrix multiplication is more complicated and the MAC array is more dominant. On the other hand, the SpiNNaker 2 prototype is less efficient than Loihi when the number of input dimensions is low, because that is where the vector-matrix multiplication is less complicated and the Arm core is more dominant.

Through the comparison of the SpiNNaker 2 prototype and Loihi in these two benchmarks, we try to bring more insight into the SpiNNaker 2 system and highlight the benefit of the MAC array in neuromorphic applications. Since both SpiNNaker 2 and Loihi have very wide application fields, the two benchmarks in this work is by far not a comprehensive comparison of both neuromorphic platforms. The comparison regarding other benchmarks would be out of scope of this work and is left for future work.

AI and neuroscience have inspired each other for decades. DNNs have been proved to be successful in a number of application areas. On the other hand, SNNs have remained a more biologically plausible model with potential for efficient computation. In the machine learning community, dedicated machine learning hardware platforms have been developed for DNNs. And in the neuromorphic community, various neuromorphic hardware platforms have been developed for SNNs. By trying to mix DNN and SNN on the algorithm and hardware levels, we attempt to increase exchange between different disciplines and contribute to different communities.

Acknowledgements

The research leading to these results has received funding from the European Union (EU) Seventh Framework Programme (FP7) under Grant Agreement No. 604102, the EU's Horizon 2020 research and innovation programme under Grant Agreements Nos. 720270 and 785907 (Human Brain Project, HBP), Intel Corporation, the Canada Research Chairs Program, Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant 261453, and the National Research Council Canada (NRCC) at the University of Waterloo. The authors thank Arm and Racyics GmbH for IP. For X Choo and C Eliasmith their Loihi results described here were supported in part by funding from Intel Corporation.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).

Dates