Received: by 10.213.65.68 with SMTP id h4csp540278imn; Thu, 22 Mar 2018 03:08:02 -0700 (PDT) X-Google-Smtp-Source: AG47ELvVP6QgLFa3yFaK7juu2jeiGR5Q5Q2HJ0UciVZ+vXV1ks55hhDT9wdqdKH5QTJPFIdSLq++ X-Received: by 10.98.70.8 with SMTP id t8mr5643665pfa.185.1521713282116; Thu, 22 Mar 2018 03:08:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521713282; cv=none; d=google.com; s=arc-20160816; b=OddAxRfF6ZyrkiqZLvvktqlU+rNTcrYprgCNuxBW3lKXDoPBTpbIScdG2GFOh2k6Ub 2xxpGZipD4T0+G8lYPi31DUMWuczmfmc/rNml5vKHcpVCr2Ozv9VSbCmOGfcPd+rHdM1 +lvC2PfpdwCp5HMi2EEj1nAFI8icedwMwneTjiSdgk4j4BB4Wr6ZyoKUwW4D7LSWDic4 HP0svwoVaxyPq4L57X/cK574zDIJNW34CahxABFLUoMf7YWvG0lXLuQYI3b+ABKj8qZ5 naGEmRcPPjgtyKzURc08PkFMS5zLm8zPr3DtHifTq4hkhhnptz5UZt4On7HN/I1RNVHY 2QZg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=+srmQ4VkrTuQ1jsaq23NK+0znC0GQZGyhpRpcbLtVUA=; b=pgZ6vX9fD9UER9GDkXfKke7B8x8W+XyiCg6jMRNewkPI5Cu2CRMxIFj2eOcytHBFJC wtfeICEdhJv0X8vlRzrfh6+yiRqxjv8fvj7T/yvrSSz0M1VFbNfPazhKCxuPlH3GYSNj DCH5Ha36mEdcTwSGXDtf5Iebc6whpE8mm6Y7I8c7ffwcysjXcNkhI2oMqPsXKmHEHE1R 4VtIRlXsUX1jmL3LMGxMh7bCJ+xd+qzAC0v2drtD0A5ath62g0o0Ia3TGRpAqKtdVsxC ZrNlNfAh4N2CdG2Y8bk7d8ZpP+niQl75nHh6Ur3nF1159Kd/OsWmPGMFrqnaF6i0LUER j6WQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b=lhrmLhd6; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 4-v6si5882997plh.540.2018.03.22.03.07.47; Thu, 22 Mar 2018 03:08:02 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b=lhrmLhd6; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752792AbeCVJdw (ORCPT + 99 others); Thu, 22 Mar 2018 05:33:52 -0400 Received: from mail-wr0-f177.google.com ([209.85.128.177]:45400 "EHLO mail-wr0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751823AbeCVJds (ORCPT ); Thu, 22 Mar 2018 05:33:48 -0400 Received: by mail-wr0-f177.google.com with SMTP id h2so7945498wre.12; Thu, 22 Mar 2018 02:33:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=+srmQ4VkrTuQ1jsaq23NK+0znC0GQZGyhpRpcbLtVUA=; b=lhrmLhd6n+eYFPezIYYONmBWqF+sg//J4fuEJiRa57n0eYsF29XSEwGIra1B+5G5xu fWfp2ljfGvzg+vgQRhDKBAglaOnVgSXGqYslKcUdugqtjoysj1mIKYSQlyXjYh2R0PTO QrnFQr2L2kwd+sOFk+Ga8sNkm2a6ZDF3YCN3rlMg/ZY+oPGzX5x1y9fC+Y3SjVnuILN+ q72tL05/77ZusUEYKB3idHl7QDNXdgSL4T91wqflSTqewJHmInM+RvvS9esJFJVAzjFv mkujQoUPy5udLM/0MzoCfjOPKF5Geplv7Mwzs4Sg2jE+Vi3DkjMt+eAHM9p/sX84aBUl c5EQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=+srmQ4VkrTuQ1jsaq23NK+0znC0GQZGyhpRpcbLtVUA=; b=HAoFiu/IdKoFBMxxDWaxzgDRbMKeDyjuAkTsN8nB0+XJ3D4fj9oSP6kALP3c1ZmFv5 cQqpwWHyk+ewOGT2+8kfjNj+VYUBhXDgNiJzmt0W+kBPOezG7uCnvh4/iM9sjdK6jM5z faFF5MiCcWH7ekGamVlh1VQFO6uNnMGzXohzrC6Nm3Q9jHfob6NN+YdwCiweQQ1H+Rzd Vgkpw4iHh16iLAegaXNyajXKiXueDioU1ijINpn4y9ZZ+RLq0qElqV/IQWOg4SouXJ3B uB5F2u8eYtUx11lBLE+ZS7yiPt9n89nTHeSXwfGaCGAPw5uhaYdp93c58EpFdM2YTdJd z1/Q== X-Gm-Message-State: AElRT7HzWlKxPdOsLCPFEApbAfx4MVlk3xTEE8NuuxhZK4VaZx0VGoPW 7ykapUaeiqwtI+8BVoAPiv4= X-Received: by 10.223.189.14 with SMTP id j14mr7995401wrh.138.1521711226518; Thu, 22 Mar 2018 02:33:46 -0700 (PDT) Received: from gmail.com (2E8B0CD5.catv.pool.telekom.hu. [46.139.12.213]) by smtp.gmail.com with ESMTPSA id o11sm6435726wrg.91.2018.03.22.02.33.44 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 22 Mar 2018 02:33:45 -0700 (PDT) Date: Thu, 22 Mar 2018 10:33:43 +0100 From: Ingo Molnar To: Linus Torvalds Cc: Thomas Gleixner , David Laight , Rahul Lakkireddy , "x86@kernel.org" , "linux-kernel@vger.kernel.org" , "netdev@vger.kernel.org" , "mingo@redhat.com" , "hpa@zytor.com" , "davem@davemloft.net" , "akpm@linux-foundation.org" , "ganeshgr@chelsio.com" , "nirranjan@chelsio.com" , "indranil@chelsio.com" , Andy Lutomirski , Peter Zijlstra , Fenghua Yu , Eric Biggers Subject: Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access Message-ID: <20180322093343.aatl3prhheha4dlm@gmail.com> References: <7f0ddb3678814c7bab180714437795e0@AcuMS.aculab.com> <7f8d811e79284a78a763f4852984eb3f@AcuMS.aculab.com> <20180320082651.jmxvvii2xvmpyr2s@gmail.com> <20180321074634.dzpyjz3ia46snodh@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds wrote: > And the real worry is things like AVX-512 etc, which is exactly when > things like "save and restore one ymm register" will quite likely > clear the upper bits of the zmm register. Yeah, I think the only valid save/restore pattern is to 100% correctly enumerate the width of the vector registers, and use full width instructions. Using partial registers, even though it's possible in some cases is probably a bad idea not just due to most instructions auto-zeroing the upper portion to reduce false dependencies, but also because 'mixed' use of partial and full register access is known to result in penalties on a wide range of Intel CPUs, at least according to the Agner PDFs. On AMD CPUs there's no penalty. So what I think could be done at best is to define a full register save/restore API, which falls back to XSAVE*/XRSTOR* if we don't have the routines for the native vector register width. (I.e. if old kernel is used on very new CPU.) Note that the actual AVX code could still use partial width, it's the save/restore primitives that has to handle full width registers. > And yes, we can have some statically patched code that takes that into account, > and saves the whole zmm register when AVX512 is on, but the whole *point* of the > dynamic XSAVES thing is actually that Intel wants to be able enable new > user-space features without having to wait for OS support. Literally. That's why > and how it was designed. This aspect wouldn't be hurt AFAICS: to me it appears that due to glibc using vector instructions in its memset() the AVX bits get used early on and to the maximum, so the XINUSE for them is set for every task. The optionality of other XSAVE based features like MPX wouldn't be hurt if the kernel only uses vector registers. > And saving a couple of zmm registers is actually pretty hard. They're big. Do > you want to allocate 128 bytes of stack space, preferably 64-byte aligned, for a > save area? No. So now it needs to be some kind of per-thread (or maybe per-CPU, > if we're willing to continue to not preempt) special save area too. Hm, that's indeed a nasty complication: - While a single 128 bytes slot might work - in practice at least two vector registers are needed to have enough parallelism and hide latencies. - ¤t->thread.fpu.state.xsave is available almost all the time: with our current 'direct' FPU context switching code the only time there's live data in ¤t->thread.fpu is when the task is not running. But it's not IRQ-safe. We could probably allow irq save/restore sections to use it, as local_irq_save()/restore() is still *much* faster than a 1-1.5K FPU context save/restore pattern. But I was hoping for a less restrictive model ... :-/ To have a better model and avoid the local_irq_save()/restore we could perhaps change the IRQ model to have a per IRQ 'current' value (we have separate IRQ stacks already), but that's quite a bit of work to transform all code that operates on the interrupted task (scheduler and timer code). But it's work that would be useful for other reasons as well. With such a separation in place ¤t->thread.fpu.state.xsave would become a generic, natural vector register save area. > And even then, it doesn't solve the real worry of "maybe there will be odd > interactions with future extensions that we don't even know of". Yes, that's true, but I think we could avoid these dangers by using CPU model based enumeration. The cost would be that vector ops would only be available on new CPU models after an explicit opt-in. In many cases it will be a single new constant to an existing switch() statement, easily backported as well. > All this to do a 32-byte PIO access, with absolutely zero data right > now on what the win is? Ok, so that's not what I'd use it for, I'd use it: - Speed up existing AVX (crypto, RAID) routines for smaller buffer sizes. Right now the XSAVE*+XRSTOR* cost is significant: x86/fpu: Cost of: XSAVE insn: 104 cycles x86/fpu: Cost of: XRSTOR insn: 80 cycles ... and that's with just 128-bit AVX and a ~0.8K XSAVE area. The Agner PDF lists Skylake XSAVE+XRSTOR costs at 107+122 cycles, plus there's probably a significant amount of L1 cache churn caused by XSAVE/XRSTOR. Most of the relevant vector instructions have a single cycle cost on the other hand. - To use vector ops in bulk, well-aligned memcpy(), which in many workloads is a fair chunk of all memset() activity. A usage profile on a typical system: galatea:~> cat /proc/sched_debug | grep hist | grep -E '[[:digit:]]{4,}$' | grep '0\]' hist[0x0000]: 1514272 hist[0x0010]: 1905248 hist[0x0020]: 99471 hist[0x0030]: 343309 hist[0x0040]: 177874 hist[0x0080]: 190052 hist[0x00a0]: 5258 hist[0x00b0]: 2387 hist[0x00c0]: 6975 hist[0x00d0]: 5872 hist[0x0100]: 3229 hist[0x0140]: 4813 hist[0x0160]: 9323 hist[0x0200]: 12540 hist[0x0230]: 37488 hist[0x1000]: 17136 hist[0x1d80]: 225199 First column is length of the area copied, the column is usage count. To do this I wouldn't complicate the main memset() interface in any way to branch it off to vector ops, I'd isolate specific memcpy()'s and memset()s (such as page table copying and page clearing) and use the simpler vector register based primitives there. For example we have clear_page() which is used by GFP_ZERO and by other places is implemented on modern x86 CPUs as: ENTRY(clear_page_erms) movl $4096,%ecx xorl %eax,%eax rep stosb ret While for such buffer sizes the enhanced-REP string instructions are supposed to be slightly faster than 128-bit AVX ops, for such exact page granular ops I'm pretty sure 256-bit (and 512-bit) vector ops are faster. - For page granular memset/memcpy it would also be interesting to investigate whether non-temporal, cache-preserving vector ops for such known-large bulk ops, such as VMOVNTQA, are beneficial in certain circumstances. On x86 there's only a single non-temporal instruction to GP registers: MOVNTI, and for stores only. The vector instructions space is a lot richer in that regard, allowing non-temporal loads as well which utilize fill buffers to move chunks of memory into vector registers. Random example: in do_cow_fault() we use copy_user_highpage() to copy the page, which uses copy_user_page() -> copy_page(), which uses: ENTRY(copy_page) ALTERNATIVE "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD movl $4096/8, %ecx rep movsq ret But in this COW copy case it's pretty obvious that we shouldn't keep the _source_ page in cache. So we could use non-temporal load, which appear to make a difference on more recent uarchs even on write-back memory ranges: https://stackoverflow.com/questions/40096894/do-current-x86-architectures-support-non-temporal-loads-from-normal-memory See the final graph in that entry and now the 'NT load' variant results in the best execution time in the 4K case - and this is a limited benchmark that doesn't measure the lower cache eviction pressure by NT loads. ( The store part is probably better done into the cache, not just due to the SFENCE cost (which is relatively low at 40 cycles), but because it's probably beneficial to prime the cache with a freshly COW-ed page, it's going to get used in the near future once we return from the fault. ) etc. - But more broadly, if we open up vector ops for smaller buffer sizes as well then I think other kernel code would start using them as well: - I think the BPF JIT, whose byte code machine languge is used by an increasing number of kernel subsystems, could benefit from having vector ops. It would possibly allow the handling of floating point types. - We could consider implementing vector ops based copy-to-user and copy-from-user primitives as well, for cases where we know that the dominant usage pattern is for larger, well-aligned chunks of memory. - Maybe we could introduce a floating point library (which falls back to a C implementation) and simplify scheduler math. We go to ridiculous lengths to maintain precision across a wide range of parameters, essentially implementing 128-bit fixed point math. Even 32-bit floating point math would possibly be better than that, even if it was done via APIs. etc.: I think the large vector processor available in modern x86 CPUs could be utilized by the kernel as well for various purposes. But I think that's only worth doing if vector registers and their save areas are easily accessibly and the accesses are fundamentally IRQ safe. > Yes, yes, I can find an Intel white-paper that talks about setting WC and then > using xmm and ymm instructions to write a single 64-byte burst over PCIe, and I > assume that is where the code and idea came from. But I don't actually see any > reason why a burst of 8 regular quad-word bytes wouldn't cause a 64-byte burst > write too. Yeah, I'm not too convinced about the wide readq/writeq usecase either, I just used the opportunity to outline these (very vague) plans about utilizing vector instructions more broadly within the kernel. > So as far as I can tell, there are basically *zero* upsides, and a lot of > potential downsides. I agree about the potential downsides and I think most of them can be addressed adequately - and I think my list of upsides above is potentially significant, especially once we have lightweight APIs to utilize individual vector registers without having to do a full save/restore of ~1K large vector register context. Thanks, Ingo