From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 169F1C2BA1A for ; Fri, 24 Apr 2020 11:01:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id ECE1620736 for ; Fri, 24 Apr 2020 11:00:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726791AbgDXLA5 (ORCPT ); Fri, 24 Apr 2020 07:00:57 -0400 Received: from foss.arm.com ([217.140.110.172]:59820 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726289AbgDXLA5 (ORCPT ); Fri, 24 Apr 2020 07:00:57 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id F3BF01FB; Fri, 24 Apr 2020 04:00:55 -0700 (PDT) Received: from [10.57.33.170] (unknown [10.57.33.170]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 385363F6CF; Fri, 24 Apr 2020 04:00:53 -0700 (PDT) Subject: Re: [PATCH v4 05/11] arm64: csum: Disable KASAN for do_csum() To: David Laight , Will Deacon , Mark Rutland Cc: "linux-kernel@vger.kernel.org" , "linux-arch@vger.kernel.org" , "kernel-team@android.com" , Michael Ellerman , Peter Zijlstra , Linus Torvalds , Segher Boessenkool , Christian Borntraeger , Luc Van Oostenryck , Arnd Bergmann , Peter Oberparleiter , Masahiro Yamada , Nick Desaulniers References: <20200421151537.19241-1-will@kernel.org> <20200421151537.19241-6-will@kernel.org> <20200422094951.GA54428@lakrids.cambridge.arm.com> <20200422104138.GA30265@willie-the-truck> <6efa0cc1-bd3e-b9b6-4e69-7ac05e6efe35@arm.com> From: Robin Murphy Message-ID: Date: Fri, 24 Apr 2020 12:00:52 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2020-04-24 10:41 am, David Laight wrote: > From: Robin Murphy >> Sent: 22 April 2020 12:02 > .. >> Sure - I have a nagging feeling that it could still do better WRT >> pipelining the loads anyway, so I'm happy to come back and reconsider >> the local codegen later. It certainly doesn't deserve to stand in the >> way of cross-arch rework. > > How fast does that loop actually run? I've not characterised it in detail, but faster than any of the other attempts so far ;) > To my mind it seems to do a lot of operations on each 64bit value. > I'd have thought that a loop based on: > sum64 = *ptr; > sum64_high = *ptr++ >> 32; > and then fixing up the result would be faster. > > The x86-64 code is also bad! > On intel cpu prior to haswell a simple: > sum_64 += *ptr32++; > is faster than the current code. > (Although you can do a lot better even on ivy bridge.) The aim here is to minimise load bandwidth - most Arm cores can slurp 16 bytes from L1 in a single load as quickly as any smaller amount, so nibbling away in little 32-bit chunks would result in up to 4x more load cycles. Yes, the C code looks ridiculous, but the other trick is that most of those operations don't actually exist. Since a __uint128_t is really backed by any two 64-bit GPRs - or if you're careful, one 64-bit GPR and the carry flag - all those shifts and rotations are in fact resolved by register allocation, so what we end up with is a very neat loop of essentially just loads and 64-bit accumulation: ... 138: a94030c3 ldp x3, x12, [x6] 13c: a9412cc8 ldp x8, x11, [x6, #16] 140: a94228c4 ldp x4, x10, [x6, #32] 144: a94324c7 ldp x7, x9, [x6, #48] 148: ab03018d adds x13, x12, x3 14c: 510100a5 sub w5, w5, #0x40 150: 9a0c0063 adc x3, x3, x12 154: ab08016c adds x12, x11, x8 158: 9a0b0108 adc x8, x8, x11 15c: ab04014b adds x11, x10, x4 160: 9a0a0084 adc x4, x4, x10 164: ab07012a adds x10, x9, x7 168: 9a0900e7 adc x7, x7, x9 16c: ab080069 adds x9, x3, x8 170: 9a080063 adc x3, x3, x8 174: ab070088 adds x8, x4, x7 178: 9a070084 adc x4, x4, x7 17c: 910100c6 add x6, x6, #0x40 180: ab040067 adds x7, x3, x4 184: 9a040063 adc x3, x3, x4 188: ab010064 adds x4, x3, x1 18c: 9a030023 adc x3, x1, x3 190: 710100bf cmp w5, #0x40 194: aa0303e1 mov x1, x3 198: 54fffd0c b.gt 138 ... Instruction-wise, that's about as good as it can get short of maintaining multiple accumulators and moving the pairwise folding out of the loop. The main thing that I think is still left on the table is that the load-to-use distances are pretty short and there's clearly scope to spread out and amortise the load cycles better, which stands to benefit both big and little cores. Robin.