Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95670C433EF for ; Fri, 26 Nov 2021 20:41:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236555AbhKZUoP (ORCPT ); Fri, 26 Nov 2021 15:44:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59724 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231418AbhKZUmO (ORCPT ); Fri, 26 Nov 2021 15:42:14 -0500 Received: from mail-pf1-x42f.google.com (mail-pf1-x42f.google.com [IPv6:2607:f8b0:4864:20::42f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D1B62C061574 for ; Fri, 26 Nov 2021 12:33:42 -0800 (PST) Received: by mail-pf1-x42f.google.com with SMTP id b68so9913633pfg.11 for ; Fri, 26 Nov 2021 12:33:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=su3YaAtYA3Jm75i9GGPnHnjOmzDIJJuzGPL8eq2UD7s=; b=D2BMGYkQ/wHLSfqcV8r7k0xgHB7HxWaHznGtHQ0XNUobSsRUMivzhLU0iBhauEHpkZ 1Km14sCr0S2Q9RrzNCh9D7quHUYuafNdqnOOiVwxKN2caZH3QhYiBF8im0o6yC42zSDY Q22qPsF+gqz3lSVo4nWxWGi1d5dnYQ+4C8l6uhi0MgR6I1NwBHm4Ssk9kpXMR1fnnqB1 gwxH2ObFnczDTV+3G4gAbQPaIk8SxwD8P2naUY4tGoNN4GBgm+kl7I0pvQ7NfhqZFmpR ujzc2reEju+HDg/A0ptNa5UeY7r/21r6DTsiQzUCJaIfcQch1aT2C+fdfg2dltqLd0uo 1F/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=su3YaAtYA3Jm75i9GGPnHnjOmzDIJJuzGPL8eq2UD7s=; b=mWcDO4MWvWrs1rBUE1YNRD0x5PkW+ZViZsLd1nTidV7+8rhou8Ei7tZ9bCLnAK3NSW 8AGjpLNOFYyJk1S4hBOM/kHChVcWDyf8daAvYAsZJ8kFJ9DoGlOwLB5u0fYsc0Bsra4m EQC1uFWMXmmNvC5JZQAGA9fpkFC7tzDfZWhDDMq+0xGtSgmSV3UUxmHJqj0WEzKOy1C7 okxiHi3yKj6Cmj6018ghxrTF2lcMpWpXNKqY/hD7DVw3QJA3+b8DQIJwll1LXR3w00Oj CnbVqBB2dvPaP0TlLrBW9o5eWWcjdE+1pSXA6mKW9zXLi5O5aybTxUiZ96vBTpwELdHn 7ACg== X-Gm-Message-State: AOAM533GRm60A31UpJrBRQSipr3N+4mzoLWZz3wChKy2N2k0bpR6kfTU VjnBeN+SolS7IgiW3GmobL5VH55bP/o83qgLo/g= X-Google-Smtp-Source: ABdhPJxM1vi5dMbXGJm+eGOMKg6HeXT5ik1Q39qdSa+CORNzQyeMO5R76fqvAd2vwsAiLkNGbkJOw7Qy7IqzTCaLT0I= X-Received: by 2002:a63:1b1a:: with SMTP id b26mr23279089pgb.338.1637958822426; Fri, 26 Nov 2021 12:33:42 -0800 (PST) MIME-Version: 1.0 References: <20211125193852.3617-1-goldstein.w.n@gmail.com> In-Reply-To: From: Noah Goldstein Date: Fri, 26 Nov 2021 14:33:31 -0600 Message-ID: Subject: Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c To: Eric Dumazet Cc: tglx@linutronix.de, mingo@redhat.com, Borislav Petkov , dave.hansen@linux.intel.com, X86 ML , hpa@zytor.com, peterz@infradead.org, alexanderduyck@fb.com, open list Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 26, 2021 at 2:07 PM Eric Dumazet wrote: > > On Fri, Nov 26, 2021 at 11:50 AM Noah Goldstein wrote: > > > > Bright :) but it will need a BMI support check. > > Yes, probably not worth the pain. Making a V2 for my patch with your optimization for the loop case. Do you think 1 or 2 accum for the 32 byte case? > > > > > I actually get better performance in hyperthread benchmarks with 2 accum: > > > > Used: > > > > u64 res; > > temp64 = (__force uint64_t)sum; > > asm("movq 0*8(%[src]),%[res]\n\t" > > "addq 1*8(%[src]),%[res]\n\t" > > "adcq 2*8(%[src]),%[res]\n\t" > > "adcq $0, %[res]\n" > > "addq 3*8(%[src]),%[temp64]\n\t" > > "adcq 4*8(%[src]),%[temp64]\n\t" > > "adcq %[res], %[temp64]\n\t" > > "mov %k[temp64],%k[res]\n\t" > > "rorx $32,%[temp64],%[temp64]\n\t" > > "adcl %k[temp64],%k[res]\n\t" > > "adcl $0,%k[res]" > > : [temp64] "+r"(temp64), [res] "=&r"(res) > > : [src] "r"(buff) > > : "memory"); > > return (__force __wsum)res; > > > > w/ hyperthread: > > size, 2acc lat, 1acc lat, 2acc tput, 1acc tput > > 40, 6.511, 7.863, 6.177, 6.157 > > > > w/o hyperthread: > > size, 2acc lat, 1acc lat, 2acc tput, 1acc tput > > 40, 5.577, 6.764, 3.150, 3.210