Forum: The Gate BBS

Anandtech's Apple M1 SoC Deep Dive

From Jolly Roger@jollyroger@pobox.com to comp.sys.mac.system on Wednesday, November 18, 2020 20:54:10

From Newsgroup: comp.sys.mac.system

Today, Apple has unveiled their brand-new MacBook line-up. This isn't an ordinary release – if anything, the move that Apple is making today is something that hasn't happened in 15 years: The start of a CPU
architecture transition across their whole consumer Mac line-up.

Thanks to the company's vertical integration across hardware and
software, this is a monumental change that nobody but Apple can so
swiftly usher in. The last time Apple ventured into such an undertaking
in 2006, the company had ditched IBM's PowerPC ISA and processors in
favor of Intel x86 designs. Today, Intel is being ditched in favor of
the company's own in-house processors and CPU microarchitectures, built
upon the Arm ISA.

The new processor is called the Apple M1, the company's first SoC
designed with Macs in mind. With four large performance cores, four
efficiency cores, and an 8-GPU core GPU, it features 16 billion
transistors on a 5nm process node. Apple's is starting a new SoC naming
scheme for this new family of processors, but at least on paper it looks
a lot like an A14X.

Today's event contained a ton of new official announcements, but also
was lacking (in typical Apple fashion) in detail. Today, we're going to
be dissecting the new Apple M1 news, as well as doing a
microarchitectural deep dive based on the already-released Apple A14
SoC.

*The Apple M1 SoC: An A14X for Macs*

The new Apple M1 is really the start of a new major journey for Apple.
During Apple's presentation the company didn't really divulge much in
the way of details for the design, however there was one slide that told
us a lot about the chip's packaging and architecture:

<https://images.anandtech.com/doci/16226/2020-11-10%2019_08_48_575px.jpg>

This packaging style with DRAM embedded within the organic packaging
isn't new for Apple; they've been using it since the A12. However it's something that's only sparingly used. When it comes to higher-end chips,
Apple likes to use this kind of packaging instead of your usual
smartphone POP (package on package) because these chips are designed
with higher TDPs in mind. So keeping the DRAM off to the side of the
compute die rather than on top of it helps to ensure that these chips
can still be efficiently cooled.

What this also means is that we're almost certainly looking at a 128-bit
DRAM bus on the new chip, much like that of previous generation A-X
chips.

On the very same slide, Apple also seems to have used an actual die shot
of the new M1 chip. It perfectly matches Apple's described
characteristics of the chip, and it looks looks like a real photograph
of the die. Cue what's probably the quickest die annotation I've ever
made:

<https://images.anandtech.com/doci/16226/M1_575px.png>

We can see the M1's four Firestorm high-performance CPU cores on the
left side. Notice the large amount of cache – the 12MB cache was one of
the surprise reveals of the event, as the A14 still only featured 8MB of
L2 cache. The new cache here looks to be portioned into 3 larger blocks,
which makes sense given Apple's transition from 8MB to 12MB for this new configuration, it is after all now being used by 4 cores instead of 2.

Meanwhile the 4 Icestorm efficiency cores are found near the center of
the SoC, above which we find the SoC's system level cache, which is
shared across all IP blocks.

Finally, the 8-core GPU takes up a significant amount of die space and
is found in the upper part of this die shot.

What's most interesting about the M1 here is how it compares to other
CPU designs by Intel and AMD. All the aforementioned blocks still only
cover up part of the whole die, with a significant amount of auxiliary
IP. Apple made mention that the M1 is a true SoC, including the
functionality of what previously was several discrete chips inside of
Mac laptops, such as I/O controllers and Apple's SSD and security
controllers.

<https://images.anandtech.com/doci/16226/2020-11-10%2019_10_00_575px.jpg>

The new CPU core is what Apple claims to be the world's fastest. This is
going to be a centre-point of today's article as we dive deeper into the microarchitecture of the Firestorm cores, as well look at the
performance figures of the very similar Apple A14 SoC.

With its additional cache, we expect the Firestorm cores used in the M1
to be even faster than what we're going to be dissecting today with the
A14, so Apple's claim of having the fastest CPU core in the world seems extremely plausible.

The whole SoC features a massive 16 billion transistors, which is 35%
more than the A14 inside of the newest iPhones. If Apple was able to
keep the transistor density between the two chips similar, we should
expect a die size of around 120mm². This would be considerably smaller
than past generation of Intel chips inside of Apple's MacBooks.

*Road To Arm: Second Verse, Same As The First*

Section by Ryan Smith

The fact that Apple can even pull off a major architectural transition
so seamlessly is a small miracle, and one that Apple has quite a bit of experience in accomplishing. After all, this is not Apple's first-time switching CPU architectures for their Mac computers.

The long-time PowerPC company came to a crossroads around the middle of
the 2000s when the Apple-IBM-Motorola (AIM) alliance, responsible for
PowerPC development, increasingly struggled with further chip
development. IBM's PowerPC 970 (G5) chip put up respectable performance
numbers in desktops, but its power consumption was significant. This
left the chip non-viable for use in the growing laptop segment, where
Apple was still using Motorola's PowerPC 7400 series (G4) chips, which
did have better power consumption, but not the performance needed to
rival what Intel would eventually achieve with its Core series of
processors.

And thus, Apple played a card that they held in reserve: Project
Marklar. Leveraging the flexibility of the Mac OS X and its underlying
Darwin kernel, which like other Unixes is designed to be portable, Apple
had been maintaining an x86 version of Mac OS X. Though largely
considered to initially have been an exercise in good coding practices – making sure Apple was writing OS code that wasn't unnecessarily bound to PowerPC and its big-endian memory model – Marklar became Apple's exit strategy from a stagnating PowerPC ecosystem. The company would switch
to x86 processors – specifically, Intel's x86 processors – upending its software ecosystem, but also opening the door to much better performance
and new customer opportunities.

The switch to x86 was by all metrics a big win for Apple. Intel's
processors delivered better performance-per-watt than the PowerPC
processors that Apple left behind, and especially once Intel launched
the Core 2 (Conroe) series of processors in late 2006, Intel firmly
established itself as the dominant force for PC processors. This
ultimately setup Apple's trajectory over the coming years, allowing them
to become a laptop-focused company with proto-ultrabooks (MacBook Air)
and their incredibly popular MacBook Pros. Similarly, x86 brought with
it Windows compatibility, introducing the ability to directly boot
Windows, or alternatively run it in a very low overhead virtual machine.

The cost of this transition, however, came on the software side of
matters. Developers would need to start using Apple's newest toolchains
to produce universal binaries that could work on PPC and x86 Macs – and
not all of Apple's previous APIs would make the jump to x86. Developers
of course made the jump, but it was a transition without a true
precedent.

Bridging the gap, at least for a bit, was Rosetta, Apple's PowerPC
translation layer for x86. Rosetta would allow most PPC Mac OS X
applications to run on the x86 Macs, and though performance was a bit hit-and-miss (PPC on x86 isn't the easiest thing), the higher
performance of the Intel CPUs helped to carry things for most
non-intensive applications. Ultimately Rosetta was a band-aid for Apple,
and one Apple ripped off relatively quickly; Apple already dropped
Rosetta by the time of Mac OS X 10.7 (Lion) in 2011. So even with
Rosetta, Apple made it clear to developers that they expected them to
update their applications for x86 if they wanted to keeping selling them
and to keep users happy.

Ultimately, the PowerPC to x86 transitions set the tone for the modern,
agile Apple. Since then, Apple has created a whole development
philosophy around going fast and changing things as they see fit, with
only limited regard to backwards compatibility. This has given users and developers few options but to enjoy the ride and keep up with Apple's development trends. But it has also given Apple the ability to introduce
new technologies early, and if necessary, break old applications so that
new features aren't held back by backwards compatibility woes.

All of this has happened before, and it will all happen again starting
next week, when Apple launches their first Apple M1-based Macs.
Universal binaries are back, Rosetta is back, and Apple's push to
developers to get their applications up and running on Arm is in full
force. The PPC to x86 transition created the template for Apple for an
ISA change, and following that successful transition, they are going to
do it all over again over the next few years as Apple becomes their own
chip supplier.

*A Microarchitectural Deep Dive & Benchmarks*

In the following page we'll be investigating the A14's Firestorm cores
which will be used in the M1 as well, and also do some extensive
benchmarking on the iPhone chip, setting the stage as the minimum of
what to expect from the M1:

*Apple's Humongous CPU Microarchitecture*

So how does Apple plan to compete with AMD and Intel in this market?
Readers who have been following Apple's silicon endeavors over the last
few years will certainly not be surprised to see the performance that
Apple proclaimed during the event.

The secret sauce lies in Apple's in-house CPU microarchitecture. Apple's
long journey into custom CPU microarchitectures started off with the
release of the Apple A6 back in 2012 in the iPhone 5. Even back then
with their first-generation “Swift” design, the company had marked some impressive performance figures compared to the mobile competition.

The real shocker that really made waves through the industry was however Apple's subsequent release of the Cyclone CPU microarchitecture in
2013's Apple A7 SoC and iPhone 5S. Apple's early adoption of the 64-bit
Armv8 ISA shocked everybody, as the company was the first in the
industry to implement the new instruction set architecture, but they
beat even Arm's own CPU teams by more than a year, as the Cortex-A57
(Arm own 64-bit microarchitecture design) would not see light of day
until late 2014.

Apple famously called their “Cyclone” design a “desktop-class architecture” which in hindsight probably should have an obvious pointer
to where the company was heading. Over subsequent generations, Apple had evolved their custom CPU microarchitecture at an astounding rate,
posting massive performance gains with each generation, which we've
covered extensively over the years:

AnandTech A-Series Coverage and Testing ---------------------------------------------------------------
Year Apple A# Review / Coverage ---------------------------------------------------------------
2012 A6 The iPhone 5 Review
2013 A7 The iPhone 5s Review
2014 A8 The iPhone 6 Review
2015 A9 The Apple iPhone 6s and iPhone 6s Plus Review
2016 A10 The iPhone 7 and iPhone 7 Plus Review
2017 A11 -
2018 A12 The iPhone XS & XS Max Review
2019 A13 The Apple iPhone 11, 11 Pro & 11 Pro Max Review
2020 A14 You're reading it ---------------------------------------------------------------

This year's A14 chip includes the 8th generation in Apple's 64-bit microarchitecture family that had been started off with the A7 and the
Cyclone design. Over the years, Apple's design cadence seems to have
settled down around major bi-generation microarchitecture updates
starting with the A7 chipset, with the A9, A11, A13 all showcasing major increases of their design complexity and microarchitectural width and
depth.

Apple's CPUs still pretty much remain a black box design given that the
company doesn't disclose any details, and the only publicly available
resources on the matter date back to LLVM patches in the A7 Cyclone era,
which very much aren't relevant anymore to today's designs. While we
don't have the official means and information as to how Apple's CPU
work, that doesn't mean we cannot figure out certain aspects of the
design. Through our own in-house tests as well as third party
microbenchmarks (A special credit due for @Veedrac's
microarchitecturometer test suite), we can however unveil some of the
details of Apple's designs. The following disclosures are estimated
based on testing the behavior of the latest Apple A14 SoC inside of the
iPhone 12 Pro:

*Apple's Firestorm CPU Core: Even Bigger & Wider*

Apple's latest generation big core CPU design inside of the A14 is
codenamed “Firestorm”, following up last year's “Lightning” microarchitecture inside of the Apple A13. The new Firestorm core and
its years long pedigree from continued generational improvements lies at
the heart of today's discussion, and is the key part as to how Apple is
making the large jump away from Intel x86 designs to their own in-house
SoCs.

<https://images.anandtech.com/doci/16226/Firestorm.png>

The above diagram is an estimated feature layout of Apple's latest big
core design – what's represented here is my best effort attempt in identifying the new designs' capabilities, but certainly is not an
exhaustive drill-down into everything that Apple's design has to offer –
so naturally some inaccuracies might be present.

What really defines Apple's Firestorm CPU core from other designs in the industry is just the sheer width of the microarchitecture. Featuring an
8-wide decode block, Apple's Firestorm is by far the current widest commercialized design in the industry. IBM's upcoming P10 Core in the
POWER10 is the only other official design that's expected to come to
market with such a wide decoder design, following Samsung's cancellation
of their own M6 core which also was described as being design with such
a wide design.

Other contemporary designs such as AMD's Zen(1 through 3) and Intel's
µarch's, x86 CPUs today still only feature a 4-wide decoder designs
(Intel is 1+4) that is seemingly limited from going wider at this point
in time due to the ISA's inherent variable instruction length nature,
making designing decoders that are able to deal with aspect of the
architecture more difficult compared to the ARM ISA's fixed-length instructions. On the ARM side of things, Samsung's designs had been
6-wide from the M3 onwards, whilst Arm's own Cortex cores had been
steadily going wider with each generation, currently 4-wide in currently available silicon, and expected to see an increase to a 5-wide design in upcoming Cortex-X1 cores.

Apple's microarchitecture being 8-wide actually isn't new to the new
A14. I had gone back to the A13 and it seems I had made a mistake in the
tests as I had originally deemed it a 7-wide machine. Re-testing it
recently, I confirmed that it was in that generation that Apple had
upgraded from a 7-wide decode which had been present in the A11 and 12.

<https://images.anandtech.com/doci/16226/A14-firestorm-ROB_575px.png>

One aspect of recent Apple designs which we were never really able to
answer concretely is how deep their out-of-order execution capabilities
are. The last official resource we had on the matter was a 192 figure
for the ROB (Re-order Buffer) inside of the 2013 Cyclone design. Thanks
again to Veedrac's implementation of a test that appears to expose this
part of the µarch, we can seemingly confirm that Firestorm's ROB is in
the 630 instruction range deep, which had been an upgrade from last
year's A13 Lightning core which is measured in at 560 instructions. It's
not clear as to whether this is actually a traditional ROB as in other architectures, but the test at least exposes microarchitectural
limitations which are tied to the ROB and behaves and exposes correct
figures on other designs in the industry. An out-of-order window is the
amount of instructions that a core can have “parked”, waiting for
execution in, well, out of order sequence, whilst the core is trying to
fetch and execute the dependencies of each instruction.

A +-630 deep ROB is an immensely huge out-of-order window for Apple's
new core, as it vastly outclasses any other design in the industry.
Intel's Sunny Cove and Willow Cove cores are the second-most “deep” OOO designs out there with a 352 ROB structure, while AMD's newest Zen3 core
makes due with 256 entries, and recent Arm designs such as the Cortex-X1 feature a 224 structure.

Exactly how and why Apple is able to achieve such a grossly
disproportionate design compared to all other designers in the industry
isn't exactly clear, but it appears to be a key characteristic of
Apple's design philosophy and method to achieve high ILP (Instruction level-parallelism).

*Many, Many Execution Units*

Having high ILP also means that these instructions need to be executed
in parallel by the machine, and here we also see Apple's back-end
execution engines feature extremely wide capabilities. On the Integer
side, whose in-flight instructions and renaming physical register file
capacity we estimate at around 354 entries, we find at least 7 execution
ports for actual arithmetic operations. These include 4 simple ALUs
capable of ADD instructions, 2 complex units which feature also MUL
(multiply) capabilities, and what appears to be a dedicated integer
division unit. The core is able to handle 2 branches per cycle, which I
think is enabled by also one or two dedicated branch forwarding ports,
but I wasn't able to 100% confirm the layout of the design here.

The Firestorm core here doesn't appear to have major changes on the
Integer side of the design, as the only noteworthy change was an
apparent slight increase (yes) in the integer division latency of that
unit.

On the floating point and vector execution side of things, the new
Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple's addition of a fourth execution
pipeline. The FP rename registers here seem to land at 384 entries,
which is again comparatively massive. The four 128-bit NEON pipelines
thus on paper match the current throughput capabilities of desktop cores
from AMD and Intel, albeit with smaller vectors. Floating-point
operations throughput here is 1:1 with the pipeline count, meaning
Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4
cycles latency. That's quadruple the per-cycle throughput of Intel CPUs
and previous AMD CPUs, and still double that of the recent Zen3, of
course, still running at lower frequency. This might be one reason why
Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).

Vector abilities of the 4 pipelines seem to be identical, with the only instructions that see lower throughput being FP divisions, reciprocals
and square-root operations that only have an throughput of 1, on one of
the four pipes.

<https://images.anandtech.com/doci/16226/A14_firestorm-LDSTQ_575px.png>

On the load-store front, we're seeing what appears to be four execution
ports: One load/store, one dedicated store and two dedicated load units.
The core can do at max 3 loads per cycle and two stores per cycle, but a maximum of only 2 loads and 2 stores concurrently.

What's interesting here is again the depth of which Apple can handle outstanding memory transactions. We're measuring up to around 148-154 outstanding loads and around 106 outstanding stores, which should be the equivalent figures of the load-queues and store-queues of the memory
subsystem. To not surprise, this is also again deeper than any other microarchitecture on the market. Interesting comparisons are AMD's Zen3
at 44/64 loads & stores, and Intel's Sunny Cove at 128/72. The Intel
design here isn't far off from Apple and actually the throughput of
these latest microarchitecture is relatively matched – it would be interesting to see where Apple is going to go once they deploy the
design to non-mobile memory subsystems and DRAM.

One large improvement on the part of the Firestorm cores this generation
has been on the side of the TLBs. The L1 TLB has been doubled from 128
pages to 256 pages, and the L2 TLB goes up from 2048 pages to 3072
pages. On today's iPhones this is an absolutely overkill change as the
page size is 16KB, which means that the L2 TLB covers 48MB which is well
beyond the cache capacity of even the A14. With Apple moving the microarchitecture onto Mac systems, having compatibility with 4KB pages
and making sure the design still offers enough performance would be a
key part as to why Apple chose to make such a large upgrade this
generation.

<https://images.anandtech.com/doci/16226/lat-A14_575px.png>

On the cache hierarchy side of things, we've known for a long time that
Apple's designs are monstrous, and the A14 Firestorm cores continue this
trend. Last year we had speculated that the A13 had 128KB L1 Instruction
cache, similar to the 128KB L1 Data cache for which we can test for,
however following Darwin kernel source dumps Apple has confirmed that
it's actually a massive 192KB instruction cache. That's absolutely
enormous and is 3x larger than the competing Arm designs, and 6x larger
than current x86 designs, which yet again might explain why Apple does extremely well in very high instruction pressure workloads, such as the
popular JavaScript benchmarks.

The huge caches also appear to be extremely fast – the L1D lands in at a 3-cycle load-use latency. We don't know if this is clever load-load
cascading such as described on Samsung's cores, but in any case, it's
very impressive for such a large structure. AMD has a 32KB 4-cycle
cache, whilst Intel's latest Sunny Cove saw a regression to 5 cycles
when they grew the size to 48KB. Food for thought on the advantages or disadvantages of slow of fast frequency designs.

On the L2 side of things, Apple has been employing an 8MB structure
that's shared between their two big cores. This is an extremely unusual
cache hierarchy and contrasts to everybody else's use of an intermediary
sized private L2 combined with a larger slower L3. Apple here disregards
the norms, and chooses a large and fast L2. Oddly enough, this
generation the A14 saw the L2 of the big cores make a regression in
terms of access latency, going back from 14 cycles to 16 cycles,
reverting the improvements that had been made with the A13. We don't
know for sure why this happened, I do see higher parallel access
bandwidth into the cache for scalar workloads, however peak bandwidth
still seems to be the same as the previous generation. Another point of hypothesis is that because Apple shares the L2 amongst cores, that this
might be an indicator of changes for Apple Silicon SoCs with more than
just two cores connected to a single cache, much like the A12X
generation.

Apple has had employed a large LLC on their SoCs for many generations
now. On the A14 this appears to be again a 16MB cache that is serving
all the IP blocks on the SoC, most useful of course for the CPU and GPU. Comparatively speaking, this cache hierarchy isn't nearly as fast as the
actual CPU-cluster L3s of other designs out there, and in recent years
we've seen more mobile SoC vendors employ such LLC in front of the
memory controllers for the sake of power efficiency. What Apple would do
in a larger laptop or desktop chip remains unclear, but I do think we'd
see similar designs there.

We've covered more specific aspects of Apple's designs, such as their
MLP (memory level parallelism) capabilities, and the A14 doesn't seem to
change in that regard. One other change I've noted from the A13 is that
the new design now also makes usage of Arm's more relaxed memory model
in that the design is able to optimise streaming stores into
non-temporal stores automatically, mimicking the change that had been introduced in the Cortex-A76 and the Exynos-M4. x86 designs wouldn't be
able to achieve a similar optimization in theory – at least it would be
very interesting to see if one attempted to do so.

Maximum Frequency vs Loaded Threads
Per-Core Maximum MHz
------------------------------------------------
Apple A14 1 2 3 4 5 6 ------------------------------------------------
Performance 1 2998 2890 2890 2890 2890 2890
Performance 2 2890 2890 2890 2890 2890
Efficiency 1 1823 1823 1823 1823
Efficiency 2 1823 1823 1823
Efficiency 3 1823 1823
Efficiency 4 1823 ------------------------------------------------

Of course, the old argument about having a very wide architecture is
that you cannot clock as high as something which is narrower. This is
somewhat true; however, I wouldn't come to any conclusion as to the capabilities of Apple's design in a higher power device. On the A14
inside of the new iPhones the new Firestorm cores are able to reach 3GHz
clock speeds, clocking down to 2.89GHz when there's two cores active at
any time.

We'll be investigating power in more detail in just a bit, but I
currently see Apple being limited by the thermal envelope of the actual
phones rather than it being some intrinsic clock ceiling of the microarchitecture. The new Firestorm cores are clocking in now at
roughly the same speed any other mobile CPU microarchitecture from Arm
even though it's a significantly wider design – so the argument about
having to clock slower because of the more complex design also doesn't
seem to apply in this instance. It will be very interesting to see what
Apple could do not only in a higher thermal envelope device such as a
laptop, but also on a wall-powered device such as a Mac.

*Dominating Mobile Performance*

Before we dig deeper into the x86 vs Apple Silicon debate, it would be
useful to look into more detail how the A14 Firestorm cores have
improved upon the A13 Lightning cores, as well as detail the power and
power efficiency improvements of the new chip's 5nm process node.

The process node is actually quite the wildcard in the comparisons here
as the A14 is the first 5nm chipset on the market, closely followed by
Huawei's Kirin 9000 in the Mate 40 series. We happen to have both
devices and chips in house for testing, and contrasting the Kirin 9000 (Cortex-A77 3.13GHz on N5) vs the Snapdragon 865+ (Cortex-A77 3.09GHz on
N7P) we can somewhat deduct how much of an impact the process node has
in terms of power and efficiency, translating those improvements to the
A13 vs A14 comparison.

<https://images.anandtech.com/doci/16226/specint_big_575px.png>

Starting off with SPECint2006, we don't see anything very unusual about
the A14 scores, save the great improvement in 456.hmmer. Actually, this
wasn't due to a microarchitectural jump, but rather due to new
optimisations on the part of the new LLVM version in Xcode 12. It seems
here that the compiler has employed a similar loop optimisation as found
on GCC8 onwards. The A13 score actually had improved from 47.79 to
64.87, but I hadn't run new numbers on the whole suite yet.

For the rest of the workloads, the A14 generally looks like a relatively
linear progression from the A13 in terms of progression, accounting for
the clock frequency increase from 2.66GHz to 3GHz. The overall IPC gains
for the suite look to be around 5% which is a bit less than Apple's
prior generations, though with a larger than usual clock speed increase.

Power consumption for the new chip is actually in line, and sometimes
even better than the A13, which means that workload energy efficiency
this generation has seen a noticeable improvement even at the peak
performance point.

Performance against the contemporary Android and Cortex-core powered
SoCs looks to be quite lopsided in favour of Apple. The one thing that
stands out the most are the memory-intensive, sparse memory
characterised workloads such as 429.mcf and 471.omnetpp where the Apple
design features well over twice the performance, even though all the
chip is running similar mobile-grade LPDDR4X/LPDDR5 memory. In our microarchitectural investigations we've seen signs of “memory magic” on Apple's designs, where we might believe they're using some sort of pointer-chase prefetching mechanism.

<https://images.anandtech.com/doci/16226/specfp_big_575px.png>

In SPECfp, the increases of the A14 over the A13 are a little higher
than the linear clock frequency increase, as we're measuring an overall
10-11% IPC uplift here. This isn't too surprising given the additional
fourth FP/SIMD pipeline of the design, whereas the integer side of the
core has remained relatively unchanged compared to the A13.

<https://images.anandtech.com/doci/16226/spec2006_A14_575px.png>

In the overall mobile comparison, we can see that the new A14 has made
robust progress in terms of increasing performance over the A13.
Compared to the competition, Apple is well ahead of the pack – we'll
have to wait for next year's Cortex-X1 devices to see the gap narrow
again.

What's also very important to note here is that Apple has achieved this
all whilst remaining flat, or even lowering the power consumption of the
new chip, notably reducing energy consumption for the same workloads.

Looking at the Kirin 9000 vs the Snapdragon 865+, we're seeing a 10%
reduction in power at relatively similar performance. Both chips use the
same CPU IP, only differing in their process node and implementations.
It seems Apple's A14 here has been able to achieve better figures than
just the process node improvement, which is expected given that it's a
new microarchitecture design as well.

One further note is the data of the A14's small efficiency cores. This generation we saw a large microarchitectural boost on the part of these
new cores which are now seeing 35% better performance versus last year's
A13 efficiency cores – all while further reducing energy consumption. I
don't know how the small cores will come into play on Apple's “Apple Silicon” Mac designs, but they're certainly still very performant and extremely efficient compared to other current contemporary Arm designs.

Lastly, there's the x86 vs Apple performance comparison. Usually for
iPhone reviews I comment on this in this section of the article, but
given today's context and the goals Apple has made for Apple Silicon,
let's investigate that into a whole dedicated section…

*From Mobile to Mac: What to Expect?*

To date, our performance comparisons for Apple's chipsets have always
been in the context of iPhone reviews, with the juxtaposition to x86
designs being a rather small footnote within the context of the
articles. Today's Apple Silicon launch event completely changes the
narrative of what we portray in terms of performance, setting aside the
typical apples vs oranges comparisons people usually argument with.

We currently do not have Apple Silicon devices and likely won't get our
hands on them for another few weeks, but we do have the A14, and expect
the new Mac chips to be strongly based on the microarchitecture we're
seeing employed in the iPhone designs. Of course, we're still comparing
a phone chip versus a high-end laptop and even a high-end desktop chip,
but given the performance numbers, that's also exactly the point we're
trying to make here, setting the stage as the bare minimum of what Apple
could achieve with their new Apple Silicon Mac chips.

<https://images.anandtech.com/graphs/graph16226/111158.png>

The performance numbers of the A14 on this chart is relatively
mind-boggling. If I were to release this data with the label of the A14
hidden, one would guess that the data-points came from some other x86
SKU from either AMD or Intel. The fact that the A14 currently competes
with the very best top-performance designs that the x86 vendors have on
the market today is just an astonishing feat.

Looking into the detailed scores, what again amazes me is the fact that
the A14 not only keeps up, but actually beats both these competitors in memory-latency sensitive workloads such as 429.mcf and 471.omnetpp, even
though they either have the same memory (i7-1185G7 with LPDDR4X-4266),
or desktop-grade memory (5950X with DDR-3200).

Again, disregard the 456.hmmer score advantage of the A14, that's
majorly due to compiler discrepancies, subtract 33% for a more apt
comparison figure.

<https://images.anandtech.com/graphs/graph16226/111159.png>

Even in SPECfp which is even more dominated by memory heavy workloads,
the A14 not only keeps up, but generally beats the Intel CPU design more
often than not. AMD also wouldn't be looking good if not for the
recently released Zen3 design.

<https://images.anandtech.com/graphs/graph16226/111168.png>

In the overall SPEC2006 chart, the A14 is performing absolutely
fantastic, taking the lead in absolute performance only falling short of
AMD's recent Ryzen 5000 series.

The fact that Apple is able to achieve this in a total device power
consumption of 5W including the SoC, DRAM, and regulators, versus +21W
(1185G7) and 49W (5950X) package power figures, without DRAM or
regulation, is absolutely mind-blowing.

<https://images.anandtech.com/graphs/graph16226/119329.png>

There's been a lot of criticism about more common benchmark suites such
as GeekBench, but frankly I've found these concerns or arguments to be
quite unfounded. The only factual differences between workloads in SPEC
and workloads in GB5 is that the latter has less outlier tests which are memory-heavy, meaning it's more of a CPU benchmark whereas SPEC has more tendency towards CPU+DRAM.

The fact that Apple does well in both workloads is evidence that they
have an extremely well-balanced microarchitecture, and that Apple
Silicon will be able to scale up to “desktop workloads” in terms of performance without much issue.

*Where the Performance Trajectory Finally Intersects*

During the release of the A7, people were pretty dismissive of the fact
that Apple had called their microarchitecture a desktop-class design.
People were also very dismissive of us calling the A11 and A12 reaching
near desktop level performance figures a few years back, and today marks
an important moment in time for the industry as Apple's A14 now clearly
is able to showcase performance that's beyond the best that Intel can
offer. It's been a performance trajectory that's been steadily executing
and progressing for years:

<https://images.anandtech.com/doci/16226/perf-trajectory_575px.png>

Whilst in the past 5 years Intel has managed to increase their best single-thread performance by about 28%, Apple has managed to improve
their designs by 198%, or 2.98x (let's call it 3x) the performance of
the Apple A9 of late 2015.

Apple's performance trajectory and unquestioned execution over these
years is what has made Apple Silicon a reality today. Anybody looking at
the absurdness of that graph will realise that there simply was no other
choice but for Apple to ditch Intel and x86 in favour of their own
in-house microarchitecture – staying par for the course would have meant stagnation and worse consumer products.

Today's announcements only covered Apple's laptop-class Apple Silicon,
whilst we don't know the details at time of writing as to what Apple
will be presenting, Apple's enormous power efficiency advantage means
that the new chip will be able to offer either vastly increased battery
life, and/or, vastly increased performance, compared to the current
Intel MacBook line-up.

Apple has claimed that they will completely transition their whole
consumer line-up to Apple Silicon within two years, which is an
indicator that we'll be seeing a high-TDP many-core design to power a
future Mac Pro. If the company is able to continue on their current
performance trajectory, it will look extremely impressive.

*Apple Shooting for the Stars: x86 Incumbents Beware*

The previous pages were written ahead of Apple officially announcing the
new M1 chip. We already saw the A14 performing outstandingly and
outperforming the best that Intel has to offer. The new M1 should
perform notably above that.

We come back to a few of Apple's slides during the presentations as to
what to expect in terms of performance and efficiency. Particularly the performance/power curves are the most detail that Apple is sharing at
this moment in time:

<https://images.anandtech.com/doci/16226/2020-11-10%2019_11_10_575px.jpg>

In this graphic, Apple showcases the new M1 chip featuring a CPU power consumption peak of around 18W. The competing PC laptop chip here is
peaking at the 35-40W range so certainly these are not single-threaded performance figures, but rather whole-chip multi-threaded performance.
We don't know if this is comparing M1 to an AMD Renoir chip or an Intel
ICL or TGL chip, but in both cases the same general verdict applies:

Apple's usage of a significantly more advanced microarchitecture that
offers significant IPC, enabling high performance at low core clocks,
allows for significant power efficiency gains versus the incumbent x86
players. The graphic shows that at peak-to-peak, M1 offers around a 40% performance uplift compared to the existing competitive offering, all
whilst doing it at 40% of the power consumption.

Apple's comparison of random performance points is to be criticised,
however the 10W measurement point where Apple claims 2.5x the
performance does make some sense, as this is the nominal TDP of the
chips used in the Intel-based MacBook Air. Again, it's thanks to the
power efficiency characteristics that Apple has been able to achieve in
the mobile space that the M1 is promised to showcase such large gains –
it certainly matches our A14 data.

*Don't forget about the GPU*

Today we mostly covered the CPU side of things as that's where the unprecedented industry shift is happening. However, we shouldn't forget
about the GPU, as the new M1 represents Apple's first-time introduction
of their custom designs into the Mac space.

<https://images.anandtech.com/doci/16226/2020-11-10%2019_12_54_575px.jpg>

Apple's performance and power efficiency claims here are really lacking
context as we have no idea what their comparison point is. I won't try
to theorise here as there's just too many variables at play, and we
don't know enough details.

What we do know is that in the mobile space, Apple is absolutely leading
the pack in terms of performance and power efficiency. The last time we
tested the A12Z the design was more than able to compete and beat
integrated graphics designs. But since then we've seen more significant
jumps from both AMD and Intel.

*Performance Leadership?*

Apple claims the M1 to be the fastest CPU in the world. Given our data
on the A14, beating all of Intel's designs, and just falling short of
AMD's newest Zen3 chips – a higher clocked Firestorm above 3GHz, the 50% larger L2 cache, and an unleashed TDP, we can certainly believe Apple
and the M1 to be able to achieve that claim.

This moment has been brewing for years now, and the new Apple Silicon is
both shocking, but also very much expected. In the coming weeks we'll be
trying to get our hands on the new hardware and verify Apple's claims.

Intel has stagnated itself out of the market, and has lost a major
customer today. AMD has shown lots of progress lately, however it'll be incredibly hard to catch up to Apple's power efficiency. If Apple's
performance trajectory continues at this pace, the x86 performance crown
might never be regained.

Reference: <https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive>

--
E-mail sent to this address may be devoured by my ravenous SPAM filter.
I often ignore posts from Google. Use a real news client instead.

JR
--- Synchronet 3.18b-Win32 NewsLink 1.113

Who's Online

System Info

Sysop:	Gate Keeper
Location:	Shelby, NC
Users:	795
Nodes:	20 (0 / 20)
Uptime:	31:37:38
Calls:	12,560
Files:	5,294
D/L today:	1 files (33K bytes)
Messages:	586,977

Anandtech's Apple M1 SoC Deep Dive

Who's Online

System Info