wiki:2015/benchmark15

Purposes and Goals

The group met for the first session on the evening of 5 May to go over purposes and goals.

Alexander Rast suggested the concept of 4 independent classes of benchmark:

1) Benchmarks of raw hardware performance/speed 2) Benchmarks of resource utilisation (e.g. power) 3) Benchmarks of application performance (e.g. recognition accuracy) 4) Benchmarks of real-world fidelity (e.g. spike matching)

In general this was thought as a reasonable summary - with the possible addition of benchmarks of scalability, although this could also be included implicitly under resource utilisation. It was also felt that the fourth class of benchmark would be of relatively limited importance.

Michael Pfeiffer went on to emphasise the purpose of having neuromorphic specific benchmarks: many modern benchmarks do not take into account operating constraints (e.g. power/area/space budget) and are often characterised by a single-minded focus on a single figure of merit. Such benchmarks have a tendency to reward "brute-force" approaches using conventional processing that simply throw more hardware at the problem. An example comes from the field of classification where in many classifier tasks the benchmarks that are being used consider only ultimate classification accuracy at the expense of other goals - even if this means running a supercomputer for a month (with associated power usage!) He also felt that the design goals of neuromorphic systems are not being well captured: adaptability, scalability, power efficiency. Potentially there are problems that neuromorphic systems can solve that conventional systems are in fact unable to solve (although he did not enumerate a list of what these might be) - could these be captured in a set of benchmarks? Overall, he indicated the goal was to develop benchmarks that could put neuromorphic systems in a favourable light compared to conventional systems; current benchmarks tend to emphasise the strength of conventional platforms or at least provide no measure in which neuromorphic systems can be expected to do substantially better.

Ryad Benosman also commented on the need for good benchmarking datasets, including such things as input from AER-generating devices like DVS retinas. He suggested for the future that there might be a common repository of datasets that go beyond the classic ones (e.g. MNIST) and can be represented in a spike format. It was agreed by all that one particular weakness of current datasets is a tendency to represent static data rather than dynamic information and this particularly favoured large systems that can spend a long time with off-line methods.

In subsequent discussion it was felt that perhaps the most important goal of neuromorphic benchmarks should be to highlight the strengths of neuromorphic chips in areas where conventional High Performance Computing cannot be expected to do well. While HPC systems can probably always perform competitively in absolute measures of speed or recognition accuracy, they may not do so well in constrained problems where power or time are relevant - and the benchmarks should thus reflect such constrained situations in which neuromorphic systems are in their design envelope.

Practical Benchmarks

The group met for the second time on the evening of 7 May to discuss practical benchmarks.

Alexander Rast suggested the concept of a tradeoff space - measuring how one measure of performance or resource utilisation varies with the others and quoting the complete (or at least a partial) tradeoff space for neuromorphic chips within their operating regime, to indicate true performance. While this was felt to be potentially useful, and certainly capturing the essential aspect of neuromorphic benchmarking distinct from existing benchmarks, others expressed concern that such a presentation, in not reducing things to a single figure, has the potential to be misleading or confusing and could be quoted too narrowly in order to give a distorted view of actual performance. Also, such benchmarks would be difficult to implement in practice, particularly with respect to relative units between different measures.

A discussion followed on benchmark suites vs. standalone benchmarks. It was emphasised again that one factor of existing benchmarks contributing to their success was the reduction of the problem to a single, easily quotable figure. The value and acceptance of standard benchmark suites (e.g. SPEC) was felt to be a desirable end. In the discussion, the group generally favoured benchmark suites - but with specific tests chosen that could emphasise the strengths of neuromorphic computing and on which conventional computing platforms and techniques would probably perform badly. Generally speaking it was thought that such tasks would probably come from the embedded and real-time domain although no specific benchmarks were suggested.

Adrian Whatley brought up the problem of the mismatch between the types of data output by spiking event-based input devices like DVS retinas and existing tests which, in the video domain, were based on frame-based output. Such tests would give irrelevant results when the input was event-based. An extensive discussion followed. The main question to be answered was: are video/graphical datasets based on frame-based imaging techniques suitable for neuromorphic benchmarking, or did event-based input constitute a genuinely separate class of data? The group quickly concluded that a representation-independent input format could not be devised. Eventually it was decided that perhaps the best solution for benchmarking would be to provide datasets such as video data both in event-based form and in frame-based form where available. The same concept could be extended to other input devices such as audio files/cochlea spike trains, etc. Such an approach would make it possible to make fair comparisons against conventional systems without unduly penalising neuromorphic systems for using different input representation.

Last modified 4 years ago Last modified on 05/20/15 00:08:52