• Anandtech's Apple M1 SoC Deep Dive

    From Jolly Roger@jollyroger@pobox.com to comp.sys.mac.system on Wednesday, November 18, 2020 20:54:10
    From Newsgroup: comp.sys.mac.system

    Today, Apple has unveiled their brand-new MacBook line-up. This isn't an ordinary release – if anything, the move that Apple is making today is something that hasn't happened in 15 years: The start of a CPU
    architecture transition across their whole consumer Mac line-up.

    Thanks to the company's vertical integration across hardware and
    software, this is a monumental change that nobody but Apple can so
    swiftly usher in. The last time Apple ventured into such an undertaking
    in 2006, the company had ditched IBM's PowerPC ISA and processors in
    favor of Intel x86 designs. Today, Intel is being ditched in favor of
    the company's own in-house processors and CPU microarchitectures, built
    upon the Arm ISA.

    The new processor is called the Apple M1, the company's first SoC
    designed with Macs in mind. With four large performance cores, four
    efficiency cores, and an 8-GPU core GPU, it features 16 billion
    transistors on a 5nm process node. Apple's is starting a new SoC naming
    scheme for this new family of processors, but at least on paper it looks
    a lot like an A14X.

    Today's event contained a ton of new official announcements, but also
    was lacking (in typical Apple fashion) in detail. Today, we're going to
    be dissecting the new Apple M1 news, as well as doing a
    microarchitectural deep dive based on the already-released Apple A14
    SoC.

    *The Apple M1 SoC: An A14X for Macs*

    The new Apple M1 is really the start of a new major journey for Apple.
    During Apple's presentation the company didn't really divulge much in
    the way of details for the design, however there was one slide that told
    us a lot about the chip's packaging and architecture:

    <https://images.anandtech.com/doci/16226/2020-11-10%2019_08_48_575px.jpg>

    This packaging style with DRAM embedded within the organic packaging
    isn't new for Apple; they've been using it since the A12. However it's something that's only sparingly used. When it comes to higher-end chips,
    Apple likes to use this kind of packaging instead of your usual
    smartphone POP (package on package) because these chips are designed
    with higher TDPs in mind. So keeping the DRAM off to the side of the
    compute die rather than on top of it helps to ensure that these chips
    can still be efficiently cooled.

    What this also means is that we're almost certainly looking at a 128-bit
    DRAM bus on the new chip, much like that of previous generation A-X
    chips.

    On the very same slide, Apple also seems to have used an actual die shot
    of the new M1 chip. It perfectly matches Apple's described
    characteristics of the chip, and it looks looks like a real photograph
    of the die. Cue what's probably the quickest die annotation I've ever
    made:

    <https://images.anandtech.com/doci/16226/M1_575px.png>

    We can see the M1's four Firestorm high-performance CPU cores on the
    left side. Notice the large amount of cache – the 12MB cache was one of
    the surprise reveals of the event, as the A14 still only featured 8MB of
    L2 cache. The new cache here looks to be portioned into 3 larger blocks,
    which makes sense given Apple's transition from 8MB to 12MB for this new configuration, it is after all now being used by 4 cores instead of 2.

    Meanwhile the 4 Icestorm efficiency cores are found near the center of
    the SoC, above which we find the SoC's system level cache, which is
    shared across all IP blocks.

    Finally, the 8-core GPU takes up a significant amount of die space and
    is found in the upper part of this die shot.

    What's most interesting about the M1 here is how it compares to other
    CPU designs by Intel and AMD. All the aforementioned blocks still only
    cover up part of the whole die, with a significant amount of auxiliary
    IP. Apple made mention that the M1 is a true SoC, including the
    functionality of what previously was several discrete chips inside of
    Mac laptops, such as I/O controllers and Apple's SSD and security
    controllers.

    <https://images.anandtech.com/doci/16226/2020-11-10%2019_10_00_575px.jpg>

    The new CPU core is what Apple claims to be the world's fastest. This is
    going to be a centre-point of today's article as we dive deeper into the microarchitecture of the Firestorm cores, as well look at the
    performance figures of the very similar Apple A14 SoC.

    With its additional cache, we expect the Firestorm cores used in the M1
    to be even faster than what we're going to be dissecting today with the
    A14, so Apple's claim of having the fastest CPU core in the world seems extremely plausible.

    The whole SoC features a massive 16 billion transistors, which is 35%
    more than the A14 inside of the newest iPhones. If Apple was able to
    keep the transistor density between the two chips similar, we should
    expect a die size of around 120mm². This would be considerably smaller
    than past generation of Intel chips inside of Apple's MacBooks.

    *Road To Arm: Second Verse, Same As The First*

    Section by Ryan Smith

    The fact that Apple can even pull off a major architectural transition
    so seamlessly is a small miracle, and one that Apple has quite a bit of experience in accomplishing. After all, this is not Apple's first-time switching CPU architectures for their Mac computers.

    The long-time PowerPC company came to a crossroads around the middle of
    the 2000s when the Apple-IBM-Motorola (AIM) alliance, responsible for
    PowerPC development, increasingly struggled with further chip
    development. IBM's PowerPC 970 (G5) chip put up respectable performance
    numbers in desktops, but its power consumption was significant. This
    left the chip non-viable for use in the growing laptop segment, where
    Apple was still using Motorola's PowerPC 7400 series (G4) chips, which
    did have better power consumption, but not the performance needed to
    rival what Intel would eventually achieve with its Core series of
    processors.

    And thus, Apple played a card that they held in reserve: Project
    Marklar. Leveraging the flexibility of the Mac OS X and its underlying
    Darwin kernel, which like other Unixes is designed to be portable, Apple
    had been maintaining an x86 version of Mac OS X. Though largely
    considered to initially have been an exercise in good coding practices – making sure Apple was writing OS code that wasn't unnecessarily bound to PowerPC and its big-endian memory model – Marklar became Apple's exit strategy from a stagnating PowerPC ecosystem. The company would switch
    to x86 processors – specifically, Intel's x86 processors – upending its software ecosystem, but also opening the door to much better performance
    and new customer opportunities.

    The switch to x86 was by all metrics a big win for Apple. Intel's
    processors delivered better performance-per-watt than the PowerPC
    processors that Apple left behind, and especially once Intel launched
    the Core 2 (Conroe) series of processors in late 2006, Intel firmly
    established itself as the dominant force for PC processors. This
    ultimately setup Apple's trajectory over the coming years, allowing them
    to become a laptop-focused company with proto-ultrabooks (MacBook Air)
    and their incredibly popular MacBook Pros. Similarly, x86 brought with
    it Windows compatibility, introducing the ability to directly boot
    Windows, or alternatively run it in a very low overhead virtual machine.

    The cost of this transition, however, came on the software side of
    matters. Developers would need to start using Apple's newest toolchains
    to produce universal binaries that could work on PPC and x86 Macs – and
    not all of Apple's previous APIs would make the jump to x86. Developers
    of course made the jump, but it was a transition without a true
    precedent.

    Bridging the gap, at least for a bit, was Rosetta, Apple's PowerPC
    translation layer for x86. Rosetta would allow most PPC Mac OS X
    applications to run on the x86 Macs, and though performance was a bit hit-and-miss (PPC on x86 isn't the easiest thing), the higher
    performance of the Intel CPUs helped to carry things for most
    non-intensive applications. Ultimately Rosetta was a band-aid for Apple,
    and one Apple ripped off relatively quickly; Apple already dropped
    Rosetta by the time of Mac OS X 10.7 (Lion) in 2011. So even with
    Rosetta, Apple made it clear to developers that they expected them to
    update their applications for x86 if they wanted to keeping selling them
    and to keep users happy.

    Ultimately, the PowerPC to x86 transitions set the tone for the modern,
    agile Apple. Since then, Apple has created a whole development
    philosophy around going fast and changing things as they see fit, with
    only limited regard to backwards compatibility. This has given users and developers few options but to enjoy the ride and keep up with Apple's development trends. But it has also given Apple the ability to introduce
    new technologies early, and if necessary, break old applications so that
    new features aren't held back by backwards compatibility woes.

    All of this has happened before, and it will all happen again starting
    next week, when Apple launches their first Apple M1-based Macs.
    Universal binaries are back, Rosetta is back, and Apple's push to
    developers to get their applications up and running on Arm is in full
    force. The PPC to x86 transition created the template for Apple for an
    ISA change, and following that successful transition, they are going to
    do it all over again over the next few years as Apple becomes their own
    chip supplier.

    *A Microarchitectural Deep Dive & Benchmarks*

    In the following page we'll be investigating the A14's Firestorm cores
    which will be used in the M1 as well, and also do some extensive
    benchmarking on the iPhone chip, setting the stage as the minimum of
    what to expect from the M1:

    *Apple's Humongous CPU Microarchitecture*

    So how does Apple plan to compete with AMD and Intel in this market?
    Readers who have been following Apple's silicon endeavors over the last
    few years will certainly not be surprised to see the performance that
    Apple proclaimed during the event.

    The secret sauce lies in Apple's in-house CPU microarchitecture. Apple's
    long journey into custom CPU microarchitectures started off with the
    release of the Apple A6 back in 2012 in the iPhone 5. Even back then
    with their first-generation “Swift” design, the company had marked some impressive performance figures compared to the mobile competition.

    The real shocker that really made waves through the industry was however Apple's subsequent release of the Cyclone CPU microarchitecture in
    2013's Apple A7 SoC and iPhone 5S. Apple's early adoption of the 64-bit
    Armv8 ISA shocked everybody, as the company was the first in the
    industry to implement the new instruction set architecture, but they
    beat even Arm's own CPU teams by more than a year, as the Cortex-A57
    (Arm own 64-bit microarchitecture design) would not see light of day
    until late 2014.

    Apple famously called their “Cyclone” design a “desktop-class architecture” which in hindsight probably should have an obvious pointer
    to where the company was heading. Over subsequent generations, Apple had evolved their custom CPU microarchitecture at an astounding rate,
    posting massive performance gains with each generation, which we've
    covered extensively over the years:

    AnandTech A-Series Coverage and Testing ---------------------------------------------------------------
    Year Apple A# Review / Coverage ---------------------------------------------------------------
    2012 A6 The iPhone 5 Review
    2013 A7 The iPhone 5s Review
    2014 A8 The iPhone 6 Review
    2015 A9 The Apple iPhone 6s and iPhone 6s Plus Review
    2016 A10 The iPhone 7 and iPhone 7 Plus Review
    2017 A11 -
    2018 A12 The iPhone XS & XS Max Review
    2019 A13 The Apple iPhone 11, 11 Pro & 11 Pro Max Review
    2020 A14 You're reading it ---------------------------------------------------------------

    This year's A14 chip includes the 8th generation in Apple's 64-bit microarchitecture family that had been started off with the A7 and the
    Cyclone design. Over the years, Apple's design cadence seems to have
    settled down around major bi-generation microarchitecture updates
    starting with the A7 chipset, with the A9, A11, A13 all showcasing major increases of their design complexity and microarchitectural width and
    depth.

    Apple's CPUs still pretty much remain a black box design given that the
    company doesn't disclose any details, and the only publicly available
    resources on the matter date back to LLVM patches in the A7 Cyclone era,
    which very much aren't relevant anymore to today's designs. While we
    don't have the official means and information as to how Apple's CPU
    work, that doesn't mean we cannot figure out certain aspects of the
    design. Through our own in-house tests as well as third party
    microbenchmarks (A special credit due for @Veedrac's
    microarchitecturometer test suite), we can however unveil some of the
    details of Apple's designs. The following disclosures are estimated
    based on testing the behavior of the latest Apple A14 SoC inside of the
    iPhone 12 Pro:

    *Apple's Firestorm CPU Core: Even Bigger & Wider*

    Apple's latest generation big core CPU design inside of the A14 is
    codenamed “Firestorm”, following up last year's “Lightning” microarchitecture inside of the Apple A13. The new Firestorm core and
    its years long pedigree from continued generational improvements lies at
    the heart of today's discussion, and is the key part as to how Apple is
    making the large jump away from Intel x86 designs to their own in-house
    SoCs.

    <https://images.anandtech.com/doci/16226/Firestorm.png>

    The above diagram is an estimated feature layout of Apple's latest big
    core design – what's represented here is my best effort attempt in identifying the new designs' capabilities, but certainly is not an
    exhaustive drill-down into everything that Apple's design has to offer –
    so naturally some inaccuracies might be present.

    What really defines Apple's Firestorm CPU core from other designs in the industry is just the sheer width of the microarchitecture. Featuring an
    8-wide decode block, Apple's Firestorm is by far the current widest commercialized design in the industry. IBM's upcoming P10 Core in the
    POWER10 is the only other official design that's expected to come to
    market with such a wide decoder design, following Samsung's cancellation
    of their own M6 core which also was described as being design with such
    a wide design.

    Other contemporary designs such as AMD's Zen(1 through 3) and Intel's
    µarch's, x86 CPUs today still only feature a 4-wide decoder designs
    (Intel is 1+4) that is seemingly limited from going wider at this point
    in time due to the ISA's inherent variable instruction length nature,
    making designing decoders that are able to deal with aspect of the
    architecture more difficult compared to the ARM ISA's fixed-length instructions. On the ARM side of things, Samsung's designs had been
    6-wide from the M3 onwards, whilst Arm's own Cortex cores had been
    steadily going wider with each generation, currently 4-wide in currently available silicon, and expected to see an increase to a 5-wide design in upcoming Cortex-X1 cores.

    Apple's microarchitecture being 8-wide actually isn't new to the new
    A14. I had gone back to the A13 and it seems I had made a mistake in the
    tests as I had originally deemed it a 7-wide machine. Re-testing it
    recently, I confirmed that it was in that generation that Apple had
    upgraded from a 7-wide decode which had been present in the A11 and 12.

    <https://images.anandtech.com/doci/16226/A14-firestorm-ROB_575px.png>

    One aspect of recent Apple designs which we were never really able to
    answer concretely is how deep their out-of-order execution capabilities
    are. The last official resource we had on the matter was a 192 figure
    for the ROB (Re-order Buffer) inside of the 2013 Cyclone design. Thanks
    again to Veedrac's implementation of a test that appears to expose this
    part of the µarch, we can seemingly confirm that Firestorm's ROB is in
    the 630 instruction range deep, which had been an upgrade from last
    year's A13 Lightning core which is measured in at 560 instructions. It's
    not clear as to whether this is actually a traditional ROB as in other architectures, but the test at least exposes microarchitectural
    limitations which are tied to the ROB and behaves and exposes correct
    figures on other designs in the industry. An out-of-order window is the
    amount of instructions that a core can have “parked”, waiting for
    execution in, well, out of order sequence, whilst the core is trying to
    fetch and execute the dependencies of each instruction.

    A +-630 deep ROB is an immensely huge out-of-order window for Apple's
    new core, as it vastly outclasses any other design in the industry.
    Intel's Sunny Cove and Willow Cove cores are the second-most “deep” OOO designs out there with a 352 ROB structure, while AMD's newest Zen3 core
    makes due with 256 entries, and recent Arm designs such as the Cortex-X1 feature a 224 structure.

    Exactly how and why Apple is able to achieve such a grossly
    disproportionate design compared to all other designers in the industry
    isn't exactly clear, but it appears to be a key characteristic of
    Apple's design philosophy and method to achieve high ILP (Instruction level-parallelism).

    *Many, Many Execution Units*

    Having high ILP also means that these instructions need to be executed
    in parallel by the machine, and here we also see Apple's back-end
    execution engines feature extremely wide capabilities. On the Integer
    side, whose in-flight instructions and renaming physical register file
    capacity we estimate at around 354 entries, we find at least 7 execution
    ports for actual arithmetic operations. These include 4 simple ALUs
    capable of ADD instructions, 2 complex units which feature also MUL
    (multiply) capabilities, and what appears to be a dedicated integer
    division unit. The core is able to handle 2 branches per cycle, which I
    think is enabled by also one or two dedicated branch forwarding ports,
    but I wasn't able to 100% confirm the layout of the design here.

    The Firestorm core here doesn't appear to have major changes on the
    Integer side of the design, as the only noteworthy change was an
    apparent slight increase (yes) in the integer division latency of that
    unit.

    On the floating point and vector execution side of things, the new
    Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple's addition of a fourth execution
    pipeline. The FP rename registers here seem to land at 384 entries,
    which is again comparatively massive. The four 128-bit NEON pipelines
    thus on paper match the current throughput capabilities of desktop cores
    from AMD and Intel, albeit with smaller vectors. Floating-point
    operations throughput here is 1:1 with the pipeline count, meaning
    Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4
    cycles latency. That's quadruple the per-cycle throughput of Intel CPUs
    and previous AMD CPUs, and still double that of the recent Zen3, of
    course, still running at lower frequency. This might be one reason why
    Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).

    Vector abilities of the 4 pipelines seem to be identical, with the only instructions that see lower throughput being FP divisions, reciprocals
    and square-root operations that only have an throughput of 1, on one of
    the four pipes.

    <https://images.anandtech.com/doci/16226/A14_firestorm-LDSTQ_575px.png>

    On the load-store front, we're seeing what appears to be four execution
    ports: One load/store, one dedicated store and two dedicated load units.
    The core can do at max 3 loads per cycle and two stores per cycle, but a maximum of only 2 loads and 2 stores concurrently.

    What's interesting here is again the depth of which Apple can handle outstanding memory transactions. We're measuring up to around 148-154 outstanding loads and around 106 outstanding stores, which should be the equivalent figures of the load-queues and store-queues of the memory
    subsystem. To not surprise, this is also again deeper than any other microarchitecture on the market. Interesting comparisons are AMD's Zen3
    at 44/64 loads & stores, and Intel's Sunny Cove at 128/72. The Intel
    design here isn't far off from Apple and actually the throughput of
    these latest microarchitecture is relatively matched – it would be interesting to see where Apple is going to go once they deploy the
    design to non-mobile memory subsystems and DRAM.

    One large improvement on the part of the Firestorm cores this generation
    has been on the side of the TLBs. The L1 TLB has been doubled from 128
    pages to 256 pages, and the L2 TLB goes up from 2048 pages to 3072
    pages. On today's iPhones this is an absolutely overkill change as the
    page size is 16KB, which means that the L2 TLB covers 48MB which is well
    beyond the cache capacity of even the A14. With Apple moving the microarchitecture onto Mac systems, having compatibility with 4KB pages
    and making sure the design still offers enough performance would be a
    key part as to why Apple chose to make such a large upgrade this
    generation.

    <https://images.anandtech.com/doci/16226/lat-A14_575px.png>

    On the cache hierarchy side of things, we've known for a long time that
    Apple's designs are monstrous, and the A14 Firestorm cores continue this
    trend. Last year we had speculated that the A13 had 128KB L1 Instruction
    cache, similar to the 128KB L1 Data cache for which we can test for,
    however following Darwin kernel source dumps Apple has confirmed that
    it's actually a massive 192KB instruction cache. That's absolutely
    enormous and is 3x larger than the competing Arm designs, and 6x larger
    than current x86 designs, which yet again might explain why Apple does extremely well in very high instruction pressure workloads, such as the
    popular JavaScript benchmarks.

    The huge caches also appear to be extremely fast – the L1D lands in at a 3-cycle load-use latency. We don't know if this is clever load-load
    cascading such as described on Samsung's cores, but in any case, it's
    very impressive for such a large structure. AMD has a 32KB 4-cycle
    cache, whilst Intel's latest Sunny Cove saw a regression to 5 cycles
    when they grew the size to 48KB. Food for thought on the advantages or disadvantages of slow of fast frequency designs.

    On the L2 side of things, Apple has been employing an 8MB structure
    that's shared between their two big cores. This is an extremely unusual
    cache hierarchy and contrasts to everybody else's use of an intermediary
    sized private L2 combined with a larger slower L3. Apple here disregards
    the norms, and chooses a large and fast L2. Oddly enough, this
    generation the A14 saw the L2 of the big cores make a regression in
    terms of access latency, going back from 14 cycles to 16 cycles,
    reverting the improvements that had been made with the A13. We don't
    know for sure why this happened, I do see higher parallel access
    bandwidth into the cache for scalar workloads, however peak bandwidth
    still seems to be the same as the previous generation. Another point of hypothesis is that because Apple shares the L2 amongst cores, that this
    might be an indicator of changes for Apple Silicon SoCs with more than
    just two cores connected to a single cache, much like the A12X
    generation.

    Apple has had employed a large LLC on their SoCs for many generations
    now. On the A14 this appears to be again a 16MB cache that is serving
    all the IP blocks on the SoC, most useful of course for the CPU and GPU. Comparatively speaking, this cache hierarchy isn't nearly as fast as the
    actual CPU-cluster L3s of other designs out there, and in recent years
    we've seen more mobile SoC vendors employ such LLC in front of the
    memory controllers for the sake of power efficiency. What Apple would do
    in a larger laptop or desktop chip remains unclear, but I do think we'd
    see similar designs there.

    We've covered more specific aspects of Apple's designs, such as their
    MLP (memory level parallelism) capabilities, and the A14 doesn't seem to
    change in that regard. One other change I've noted from the A13 is that
    the new design now also makes usage of Arm's more relaxed memory model
    in that the design is able to optimise streaming stores into
    non-temporal stores automatically, mimicking the change that had been introduced in the Cortex-A76 and the Exynos-M4. x86 designs wouldn't be
    able to achieve a similar optimization in theory – at least it would be
    very interesting to see if one attempted to do so.

    Maximum Frequency vs Loaded Threads
    Per-Core Maximum MHz
    ------------------------------------------------
    Apple A14 1 2 3 4 5 6 ------------------------------------------------
    Performance 1 2998 2890 2890 2890 2890 2890
    Performance 2 2890 2890 2890 2890 2890
    Efficiency 1 1823 1823 1823 1823
    Efficiency 2 1823 1823 1823
    Efficiency 3 1823 1823
    Efficiency 4 1823 ------------------------------------------------

    Of course, the old argument about having a very wide architecture is
    that you cannot clock as high as something which is narrower. This is
    somewhat true; however, I wouldn't come to any conclusion as to the capabilities of Apple's design in a higher power device. On the A14
    inside of the new iPhones the new Firestorm cores are able to reach 3GHz
    clock speeds, clocking down to 2.89GHz when there's two cores active at
    any time.

    We'll be investigating power in more detail in just a bit, but I
    currently see Apple being limited by the thermal envelope of the actual
    phones rather than it being some intrinsic clock ceiling of the microarchitecture. The new Firestorm cores are clocking in now at
    roughly the same speed any other mobile CPU microarchitecture from Arm
    even though it's a significantly wider design – so the argument about
    having to clock slower because of the more complex design also doesn't
    seem to apply in this instance. It will be very interesting to see what
    Apple could do not only in a higher thermal envelope device such as a
    laptop, but also on a wall-powered device such as a Mac.

    *Dominating Mobile Performance*

    Before we dig deeper into the x86 vs Apple Silicon debate, it would be
    useful to look into more detail how the A14 Firestorm cores have
    improved upon the A13 Lightning cores, as well as detail the power and
    power efficiency improvements of the new chip's 5nm process node.

    The process node is actually quite the wildcard in the comparisons here
    as the A14 is the first 5nm chipset on the market, closely followed by
    Huawei's Kirin 9000 in the Mate 40 series. We happen to have both
    devices and chips in house for testing, and contrasting the Kirin 9000 (Cortex-A77 3.13GHz on N5) vs the Snapdragon 865+ (Cortex-A77 3.09GHz on
    N7P) we can somewhat deduct how much of an impact the process node has
    in terms of power and efficiency, translating those improvements to the
    A13 vs A14 comparison.

    <https://images.anandtech.com/doci/16226/specint_big_575px.png>

    Starting off with SPECint2006, we don't see anything very unusual about
    the A14 scores, save the great improvement in 456.hmmer. Actually, this
    wasn't due to a microarchitectural jump, but rather due to new
    optimisations on the part of the new LLVM version in Xcode 12. It seems
    here that the compiler has employed a similar loop optimisation as found
    on GCC8 onwards. The A13 score actually had improved from 47.79 to
    64.87, but I hadn't run new numbers on the whole suite yet.

    For the rest of the workloads, the A14 generally looks like a relatively
    linear progression from the A13 in terms of progression, accounting for
    the clock frequency increase from 2.66GHz to 3GHz. The overall IPC gains
    for the suite look to be around 5% which is a bit less than Apple's
    prior generations, though with a larger than usual clock speed increase.

    Power consumption for the new chip is actually in line, and sometimes
    even better than the A13, which means that workload energy efficiency
    this generation has seen a noticeable improvement even at the peak
    performance point.

    Performance against the contemporary Android and Cortex-core powered
    SoCs looks to be quite lopsided in favour of Apple. The one thing that
    stands out the most are the memory-intensive, sparse memory
    characterised workloads such as 429.mcf and 471.omnetpp where the Apple
    design features well over twice the performance, even though all the
    chip is running similar mobile-grade LPDDR4X/LPDDR5 memory. In our microarchitectural investigations we've seen signs of “memory magic” on Apple's designs, where we might believe they're using some sort of pointer-chase prefetching mechanism.

    <https://images.anandtech.com/doci/16226/specfp_big_575px.png>

    In SPECfp, the increases of the A14 over the A13 are a little higher
    than the linear clock frequency increase, as we're measuring an overall
    10-11% IPC uplift here. This isn't too surprising given the additional
    fourth FP/SIMD pipeline of the design, whereas the integer side of the
    core has remained relatively unchanged compared to the A13.

    <https://images.anandtech.com/doci/16226/spec2006_A14_575px.png>

    In the overall mobile comparison, we can see that the new A14 has made
    robust progress in terms of increasing performance over the A13.
    Compared to the competition, Apple is well ahead of the pack – we'll
    have to wait for next year's Cortex-X1 devices to see the gap narrow
    again.

    What's also very important to note here is that Apple has achieved this
    all whilst remaining flat, or even lowering the power consumption of the
    new chip, notably reducing energy consumption for the same workloads.

    Looking at the Kirin 9000 vs the Snapdragon 865+, we're seeing a 10%
    reduction in power at relatively similar performance. Both chips use the
    same CPU IP, only differing in their process node and implementations.
    It seems Apple's A14 here has been able to achieve better figures than
    just the process node improvement, which is expected given that it's a
    new microarchitecture design as well.

    One further note is the data of the A14's small efficiency cores. This generation we saw a large microarchitectural boost on the part of these
    new cores which are now seeing 35% better performance versus last year's
    A13 efficiency cores – all while further reducing energy consumption. I
    don't know how the small cores will come into play on Apple's “Apple Silicon” Mac designs, but they're certainly still very performant and extremely efficient compared to other current contemporary Arm designs.

    Lastly, there's the x86 vs Apple performance comparison. Usually for
    iPhone reviews I comment on this in this section of the article, but
    given today's context and the goals Apple has made for Apple Silicon,
    let's investigate that into a whole dedicated section…

    *From Mobile to Mac: What to Expect?*

    To date, our performance comparisons for Apple's chipsets have always
    been in the context of iPhone reviews, with the juxtaposition to x86
    designs being a rather small footnote within the context of the
    articles. Today's Apple Silicon launch event completely changes the
    narrative of what we portray in terms of performance, setting aside the
    typical apples vs oranges comparisons people usually argument with.

    We currently do not have Apple Silicon devices and likely won't get our
    hands on them for another few weeks, but we do have the A14, and expect
    the new Mac chips to be strongly based on the microarchitecture we're
    seeing employed in the iPhone designs. Of course, we're still comparing
    a phone chip versus a high-end laptop and even a high-end desktop chip,
    but given the performance numbers, that's also exactly the point we're
    trying to make here, setting the stage as the bare minimum of what Apple
    could achieve with their new Apple Silicon Mac chips.

    <https://images.anandtech.com/graphs/graph16226/111158.png>

    The performance numbers of the A14 on this chart is relatively
    mind-boggling. If I were to release this data with the label of the A14
    hidden, one would guess that the data-points came from some other x86
    SKU from either AMD or Intel. The fact that the A14 currently competes
    with the very best top-performance designs that the x86 vendors have on
    the market today is just an astonishing feat.

    Looking into the detailed scores, what again amazes me is the fact that
    the A14 not only keeps up, but actually beats both these competitors in memory-latency sensitive workloads such as 429.mcf and 471.omnetpp, even
    though they either have the same memory (i7-1185G7 with LPDDR4X-4266),
    or desktop-grade memory (5950X with DDR-3200).

    Again, disregard the 456.hmmer score advantage of the A14, that's
    majorly due to compiler discrepancies, subtract 33% for a more apt
    comparison figure.

    <https://images.anandtech.com/graphs/graph16226/111159.png>

    Even in SPECfp which is even more dominated by memory heavy workloads,
    the A14 not only keeps up, but generally beats the Intel CPU design more
    often than not. AMD also wouldn't be looking good if not for the
    recently released Zen3 design.

    <https://images.anandtech.com/graphs/graph16226/111168.png>

    In the overall SPEC2006 chart, the A14 is performing absolutely
    fantastic, taking the lead in absolute performance only falling short of
    AMD's recent Ryzen 5000 series.

    The fact that Apple is able to achieve this in a total device power
    consumption of 5W including the SoC, DRAM, and regulators, versus +21W
    (1185G7) and 49W (5950X) package power figures, without DRAM or
    regulation, is absolutely mind-blowing.

    <https://images.anandtech.com/graphs/graph16226/119329.png>

    There's been a lot of criticism about more common benchmark suites such
    as GeekBench, but frankly I've found these concerns or arguments to be
    quite unfounded. The only factual differences between workloads in SPEC
    and workloads in GB5 is that the latter has less outlier tests which are memory-heavy, meaning it's more of a CPU benchmark whereas SPEC has more tendency towards CPU+DRAM.

    The fact that Apple does well in both workloads is evidence that they
    have an extremely well-balanced microarchitecture, and that Apple
    Silicon will be able to scale up to “desktop workloads” in terms of performance without much issue.

    *Where the Performance Trajectory Finally Intersects*

    During the release of the A7, people were pretty dismissive of the fact
    that Apple had called their microarchitecture a desktop-class design.
    People were also very dismissive of us calling the A11 and A12 reaching
    near desktop level performance figures a few years back, and today marks
    an important moment in time for the industry as Apple's A14 now clearly
    is able to showcase performance that's beyond the best that Intel can
    offer. It's been a performance trajectory that's been steadily executing
    and progressing for years:

    <https://images.anandtech.com/doci/16226/perf-trajectory_575px.png>

    Whilst in the past 5 years Intel has managed to increase their best single-thread performance by about 28%, Apple has managed to improve
    their designs by 198%, or 2.98x (let's call it 3x) the performance of
    the Apple A9 of late 2015.

    Apple's performance trajectory and unquestioned execution over these
    years is what has made Apple Silicon a reality today. Anybody looking at
    the absurdness of that graph will realise that there simply was no other
    choice but for Apple to ditch Intel and x86 in favour of their own
    in-house microarchitecture – staying par for the course would have meant stagnation and worse consumer products.

    Today's announcements only covered Apple's laptop-class Apple Silicon,
    whilst we don't know the details at time of writing as to what Apple
    will be presenting, Apple's enormous power efficiency advantage means
    that the new chip will be able to offer either vastly increased battery
    life, and/or, vastly increased performance, compared to the current
    Intel MacBook line-up.

    Apple has claimed that they will completely transition their whole
    consumer line-up to Apple Silicon within two years, which is an
    indicator that we'll be seeing a high-TDP many-core design to power a
    future Mac Pro. If the company is able to continue on their current
    performance trajectory, it will look extremely impressive.

    *Apple Shooting for the Stars: x86 Incumbents Beware*

    The previous pages were written ahead of Apple officially announcing the
    new M1 chip. We already saw the A14 performing outstandingly and
    outperforming the best that Intel has to offer. The new M1 should
    perform notably above that.

    We come back to a few of Apple's slides during the presentations as to
    what to expect in terms of performance and efficiency. Particularly the performance/power curves are the most detail that Apple is sharing at
    this moment in time:

    <https://images.anandtech.com/doci/16226/2020-11-10%2019_11_10_575px.jpg>

    In this graphic, Apple showcases the new M1 chip featuring a CPU power consumption peak of around 18W. The competing PC laptop chip here is
    peaking at the 35-40W range so certainly these are not single-threaded performance figures, but rather whole-chip multi-threaded performance.
    We don't know if this is comparing M1 to an AMD Renoir chip or an Intel
    ICL or TGL chip, but in both cases the same general verdict applies:

    Apple's usage of a significantly more advanced microarchitecture that
    offers significant IPC, enabling high performance at low core clocks,
    allows for significant power efficiency gains versus the incumbent x86
    players. The graphic shows that at peak-to-peak, M1 offers around a 40% performance uplift compared to the existing competitive offering, all
    whilst doing it at 40% of the power consumption.

    Apple's comparison of random performance points is to be criticised,
    however the 10W measurement point where Apple claims 2.5x the
    performance does make some sense, as this is the nominal TDP of the
    chips used in the Intel-based MacBook Air. Again, it's thanks to the
    power efficiency characteristics that Apple has been able to achieve in
    the mobile space that the M1 is promised to showcase such large gains –
    it certainly matches our A14 data.

    *Don't forget about the GPU*

    Today we mostly covered the CPU side of things as that's where the unprecedented industry shift is happening. However, we shouldn't forget
    about the GPU, as the new M1 represents Apple's first-time introduction
    of their custom designs into the Mac space.

    <https://images.anandtech.com/doci/16226/2020-11-10%2019_12_54_575px.jpg>

    Apple's performance and power efficiency claims here are really lacking
    context as we have no idea what their comparison point is. I won't try
    to theorise here as there's just too many variables at play, and we
    don't know enough details.

    What we do know is that in the mobile space, Apple is absolutely leading
    the pack in terms of performance and power efficiency. The last time we
    tested the A12Z the design was more than able to compete and beat
    integrated graphics designs. But since then we've seen more significant
    jumps from both AMD and Intel.

    *Performance Leadership?*

    Apple claims the M1 to be the fastest CPU in the world. Given our data
    on the A14, beating all of Intel's designs, and just falling short of
    AMD's newest Zen3 chips – a higher clocked Firestorm above 3GHz, the 50% larger L2 cache, and an unleashed TDP, we can certainly believe Apple
    and the M1 to be able to achieve that claim.

    This moment has been brewing for years now, and the new Apple Silicon is
    both shocking, but also very much expected. In the coming weeks we'll be
    trying to get our hands on the new hardware and verify Apple's claims.

    Intel has stagnated itself out of the market, and has lost a major
    customer today. AMD has shown lots of progress lately, however it'll be incredibly hard to catch up to Apple's power efficiency. If Apple's
    performance trajectory continues at this pace, the x86 performance crown
    might never be regained.

    Reference: <https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive>

    --
    E-mail sent to this address may be devoured by my ravenous SPAM filter.
    I often ignore posts from Google. Use a real news client instead.

    JR
    --- Synchronet 3.18b-Win32 NewsLink 1.113