Apple is heavily committed to asymmetric multiprocessing (AMP) in its own chips, and in future Macs, iPhones and iPads. With four ‘Firestorm’ performance and four ‘Icestorm’ efficiency cores in its M1 SoC, several researchers have been working to establish the differences between them in terms of structural units, behaviour and performance. For example, Dougall Johnson has meticulously documented them here and here, with measurements for each instruction. Others, including Maynard Handley, have been building a detailed picture of the many techniques which these cores use to achieve their performance.
What currently seems harder to establish is the difference in overall performance across more typical code. In real-world use, what are the penalties for processes running on Icestorm rather than Firestorm cores? Here I report one initial comparison, of performance when calculating floating-point dot products, a task which you might not consider a good fit for the Icestorm.
Central to this is my previous observation that different Quality-of-Service (QoS) settings for processes determine which cores they are run on. OperationQueue processes given a QoS of 17 or higher are invariably run by macOS 11 and 12 on Firestorm cores (and can load Icestorms too), while those with a QoS of 9 are invariably run only on Icestorm cores. Those might change in the face of extreme loading of either core pool, but when there are few other active processes it appears consistent.
Rather than use a test harness such as that developed by Dougall Johnson, these tests were performed in regular macOS running with Full Security enabled on a stock system without any third-party kernel or system extensions. Execution times were measured using Mach ticks, and converted to seconds. The number of processes allowed in the OperationQueue was constrained to 4, to try to limit core use to a single pool.
Four different methods were used to calculate dot products on Swift Float (32-bit floating-point, C float
) numbers:
- a tight loop of assembly language using mixed SIMD instructions on 4-wide arrays of single-precision floating-point numbers;
- the Apple simd (a relative of the Accelerate libraries) call
simd_dot()
on twosimd_float4
arrays, using Swift; - simple Swift
for
using nested loops; - a more ‘idiomatic’ Swift nested loop using
map
andreduce
.
Code for each is given in the Appendix below.
Does setting QoS control which cores are used?
Core load was observed using Activity Monitor. In every run, tests performed with a QoS of 9 only loaded the Icestorm cores, and those with higher QoS only the Firestorm cores. The screenshot below shows a series (from the left) in which four alternating QoS settings were used. At no time did any test appear to pass any load to the other pool of cores.
Performance
Times taken were measured on a range of iterations, and appeared most consistent and comparable for 10^8 iterations of the dot product calculation. On Firestorm cores, this was fastest using the simd (Accelerate) library, which took 0.0938 seconds, then for the assembly language (0.142 s) and simple Swift (0.451 s). ‘Idiomatic’ Swift took much longer, at 15.7 seconds. That is consistent with my previous results from tests which didn’t control or observe which cores they were run on.
On the Icestorm cores, assembly language was fastest (0.271 seconds), then simd (Accelerate) (0.309 s), simple Swift (1.27 s), and ‘idiomatic’ Swift (86.3 s).
Relative to their Firestorm times, Icestorms performed more slowly by:
- 190% running assembly language
- 330% running simd (Accelerate) library functions
- 280% running simple Swift
- 550% running ‘idiomatic’ Swift
where 100% would be the same time as the Firestorm core, and 200% would be twice that time.
My previous comparison between compression performed by AppleArchive using all eight cores and only Icestorm cores showed the latter was far slower (717%). These results show that, at their best, Icestorm cores can run SIMD vector arithmetic at slightly better than half the ‘speed’ of the Firestorm cores. Although I suspect that Apple’s simd library isn’t optimised for the Icestorm, it achieved a third of the ‘speed’ of a Firestorm when run on Icestorm.
Maynard Handley previously commented that Icestorm cores use about 10% of the power (net 25% of energy) of Firestorm cores. For SIMD vector arithmetic, at least, they perform extremely well for their economy. In the M1, multiprocessing isn’t always as asymmetric as you might expect.
Appendix: Code used in the iterative loop
In each case, the first section of code calculates the dot product itself, following which the values in one of the arrays are incremented ready for the next run through the loop.
Assembly language:
FMUL V1.4S, V2.4S, V3.4S
FADDP V0.4S, V1.4S, V1.4S
FADDP V0.4S, V0.4S, V0.4S
FADD V2.4S, V2.4S, V4.4S
simd (Accelerate) library:
tempA = simd_dot(vA, vB)
vA = vA + vC
Simple Swift:
tempA = 0.0
for i in 0...3 {
tempA += vA[i] * vB[i]
}
for i in 0...3 {
vA[i] = vA[i] + vC[i]
}
‘Idiomatic’ Swift:
tempA = zip(vA, vB).map(*).reduce(0, +)
for (index, value) in vA.enumerated() {
vA[index] = value + vC[index]
} }
from Hacker News https://ift.tt/3t1QZtz
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.