Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp3528565pxb; Mon, 4 Apr 2022 20:03:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx/7Olaj93ccSj1Ll7hoHICIUqatVGVwOy0bFHW0qdBt6zjso/kwRbHv0ZRhtUk0pYx7sGd X-Received: by 2002:a17:903:2351:b0:156:a562:b3f8 with SMTP id c17-20020a170903235100b00156a562b3f8mr1341921plh.81.1649127781823; Mon, 04 Apr 2022 20:03:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649127781; cv=none; d=google.com; s=arc-20160816; b=IXrvty7cNiTaO47xDJ5BXcDpX2kP1uzH+KDb+miw4oFJ2cB4cBFRmuiagBaeuMmf33 NEprxnAjJbfILnCHFDkku4N4Dvhg2j0Iac3kg+VikrBi/xwr9I1SWYd/FAdQt3k6pn1+ iNzi1xHBljUb4ZwIvEKqBJBuCFqsKgG/+Zo6/fGUjC0fiIKdqytrrSPaAi6pmlzjVkBu CgjOgkznGVjxLmpjqernh82S3Y3xz2p/mly7OCPxziHlDEib0CBIgD5dg46jThyP5EWd pW8XlqHilgtt8utkJ9PUn+lLoR1+M3+w4ekVrnv6cQjBGvLjM9rlq0I/YnhEg8EKnGhI YK5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:date:mime-version:references:subject:cc:to:from :dkim-signature:dkim-signature:message-id; bh=ao8ewpta1h3rBHe+eerTP6dP9fqXfbqWk7GzZd3tujY=; b=R/UviFMIqnCOK7Lx8Uf4i47v8Z6ufkQeGNTJ3JO2HhV9nyAJ/Ztzv7iONMgLBHCJPM NnRpguzL+yD2yOWsBmdZfixygnM5HLDKXK7VIHQwrZOowtIlbMo6Sz60JMP/kKfRAdpn ly5zTpAfF6Jj/yiA0/J75xPcQHYVMnt9nkUlmutNup31YbTeS1WUw6cCHK29JiGJYtpF DrXmv3X1WEnot1Pc1rT4FPliOqdzA5y9ZptkKv/RQBwTHAETFIm9RopxvfzR8MJCbFkh tGFfD8oSq/6baTZVAgHsCHM/QJ7vcL/O/BOeOzIbCtGSmGUjNihhSRNhyb5LFWEITa1J sB5g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=ufgtnfye; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id u6-20020a17090341c600b00153b32082afsi12798831ple.266.2022.04.04.20.03.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 04 Apr 2022 20:03:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=ufgtnfye; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (out1.vger.email [IPv6:2620:137:e000::1:20]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 2533E1697AB; Mon, 4 Apr 2022 18:27:13 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239350AbiDDMNf (ORCPT + 99 others); Mon, 4 Apr 2022 08:13:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50214 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239076AbiDDMN0 (ORCPT ); Mon, 4 Apr 2022 08:13:26 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 967CD3DDC2 for ; Mon, 4 Apr 2022 05:11:29 -0700 (PDT) Message-ID: <20220404104820.713066297@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1649074287; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=ao8ewpta1h3rBHe+eerTP6dP9fqXfbqWk7GzZd3tujY=; b=ufgtnfyeeDiba+d+6EKLpU3I9fACAe6mSFYME8ao1PGAvRPN+iUPHsdz7aVwLLTH2gFQMv r/N2g9tDW9YT/kG+EgjWHiUHHopPNtjKuef+TqZSR6r5KFbbmctpt50m4uvDMeSSokJoQ/ LrZYOtwY2Q24GGj1NCV9Dsquev8VbOgFzdtJjbMZG2Xoxk76sp4YutM/s1M6B44a2BnkcN h+7P3tfHkHXg6UUfpM0iZUGiBDYrAOudWHqQTuQvT2y/Q1yjFjozu1Wmke6FXqnm1OoZtB ZY3Erq8sVRGl0QqfuCmxYOcpS9hP5HwJK44hp1pcRBCTIHqdKqKDdWeD+h41ZQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1649074287; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=ao8ewpta1h3rBHe+eerTP6dP9fqXfbqWk7GzZd3tujY=; b=u5bWR62vuxn6XVr7n1PDxmCCOIeqO4ff8Nc85hqIaqMz9//g97CfMquF/rx1p5p9GVewlH mNwKXxGa2Time/Dg== From: Thomas Gleixner To: LKML Cc: x86@kernel.org, Andrew Cooper , "Edgecombe, Rick P" , Andrew Cooper Subject: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported References: <20220404103741.809025935@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Date: Mon, 4 Apr 2022 14:11:27 +0200 (CEST) X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org XSAVEC/S store the FPU state in compacted format, which avoids holes in the memory image. The kernel uses this feature in a very naive way and just avoids holes which come from unsupported features, like PT. That's a marginal saving of 128 byte vs. the uncompacted format on a SKL-X. The first 576 bytes are fixed. 512 byte legacy (FP/SSE) and 64 byte XSAVE header. On a SKL-X machine the other components are stored at the following offsets: xstate_offset[2]: 576, xstate_sizes[2]: 256 xstate_offset[3]: 832, xstate_sizes[3]: 64 xstate_offset[4]: 896, xstate_sizes[4]: 64 xstate_offset[5]: 960, xstate_sizes[5]: 64 xstate_offset[6]: 1024, xstate_sizes[6]: 512 xstate_offset[7]: 1536, xstate_sizes[7]: 1024 xstate_offset[9]: 2560, xstate_sizes[9]: 8 XSAVEC/S use the init optimization which does not write data of a component when the component is in init state. The state is stored in the XSTATE_BV bitmap of the XSTATE header. The kernel requests to save all enabled components, which results in a suboptimal write/read pattern when the set of active components is sparse. A typical scenario is an active set of 0x202 (PKRU + SSE) out of the full supported set of 0x2FF. That means XSAVEC/S writes and XRSTOR[S] reads: - SSE in the legacy area (0-511) - Part of the XSTATE header (512-575) - PKRU at offset 2560 which is suboptimal. Prefetch works better when the access is linear. But what's worse is that PKRU can be located in a different page which obviously affects dTLB. XSAVEC/S allows to further reduce the memory footprint when the active feature set is sparse and the CPU supports XGETBV1. XGETBV1 reads the state of the XSTATE components as a bitmap. This bitmap can be fed into XSAVEC/S to request only the storage of the active components, which changes the layout of the state buffer to: - SSE in the legacy area (0-511) - Part of the XSTATE header (512-575) - PKRU at offset 576 This optimization does not gain much for e.g. a kernel build, but for context switch heavy applications it's very visible. Perf stats from hackbench: Before: 242,618.89 msec task-clock # 102.928 CPUs utilized ( +- 0.20% ) 1,038,988 context-switches # 0.004 M/sec ( +- 0.54% ) 460,081 cpu-migrations # 0.002 M/sec ( +- 0.56% ) 10,813 page-faults # 0.045 K/sec ( +- 0.62% ) 506,912,353,968 cycles # 2.089 GHz ( +- 0.20% ) 167,267,811,210 instructions # 0.33 insn per cycle ( +- 0.04% ) 34,481,978,727 branches # 142.124 M/sec ( +- 0.04% ) 305,975,304 branch-misses # 0.89% of all branches ( +- 0.09% ) 2.35717 +- 0.00607 seconds time elapsed ( +- 0.26% ) 506,064,738,921 cycles ( +- 0.43% ) 3,334,160,871 L1-dcache-load-misses ( +- 0.77% ) 135,271,979 dTLB-load-misses ( +- 2.12% ) 18,169,634 dTLB-store-misses ( +- 1.78% ) 2.3323 +- 0.0117 seconds time elapsed ( +- 0.50% ) After: 222,252.90 msec task-clock # 103.800 CPUs utilized ( +- 0.51% ) 1,004,665 context-switches # 0.005 M/sec ( +- 0.42% ) 459,123 cpu-migrations # 0.002 M/sec ( +- 0.33% ) 10,677 page-faults # 0.048 K/sec ( +- 0.79% ) 464,356,465,870 cycles # 2.089 GHz ( +- 0.51% ) 166,615,501,152 instructions # 0.36 insn per cycle ( +- 0.05% ) 34,355,848,663 branches # 154.580 M/sec ( +- 0.05% ) 300,049,704 branch-misses # 0.87% of all branches ( +- 0.14% ) 2.1412 +- 0.0117 seconds time elapsed ( +- 0.55% ) 473,864,807,936 cycles ( +- 0.64% ) 3,198,078,809 L1-dcache-load-misses ( +- 0.24% ) 27,798,721 dTLB-load-misses ( +- 2.33% ) 4,981,069 dTLB-store-misses ( +- 1.80% ) 2.1733 +- 0.0132 seconds time elapsed ( +- 0.61% ) The most significant change is in dTLB misses. The effect depends on the application scenario, the kernel configuration and the allocation placement of task_struct, so it might be not noticable at all. As the XGETBV1 optimization is not introducing a measurable overhead it's worth to use it if supported by the hardware. Enable it when available with a static key and mask out the non-active states in the requested bitmap for XSAVEC/S. Signed-off-by: Thomas Gleixner --- arch/x86/kernel/fpu/xstate.c | 10 ++++++++-- arch/x86/kernel/fpu/xstate.h | 16 +++++++++++++--- 2 files changed, 21 insertions(+), 5 deletions(-) --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -86,6 +86,8 @@ static unsigned int xstate_flags[XFEATUR #define XSTATE_FLAG_SUPERVISOR BIT(0) #define XSTATE_FLAG_ALIGNED64 BIT(1) +DEFINE_STATIC_KEY_FALSE(__xsave_use_xgetbv1); + /* * Return whether the system supports a given xfeature. * @@ -1481,7 +1483,7 @@ void xfd_validate_state(struct fpstate * } #endif /* CONFIG_X86_DEBUG_FPU */ -static int __init xfd_update_static_branch(void) +static int __init fpu_update_static_branches(void) { /* * If init_fpstate.xfd has bits set then dynamic features are @@ -1489,9 +1491,13 @@ static int __init xfd_update_static_bran */ if (init_fpstate.xfd) static_branch_enable(&__fpu_state_size_dynamic); + + if (cpu_feature_enabled(X86_FEATURE_XGETBV1) && + cpu_feature_enabled(X86_FEATURE_XCOMPACTED)) + static_branch_enable(&__xsave_use_xgetbv1); return 0; } -arch_initcall(xfd_update_static_branch) +arch_initcall(fpu_update_static_branches) void fpstate_free(struct fpu *fpu) { --- a/arch/x86/kernel/fpu/xstate.h +++ b/arch/x86/kernel/fpu/xstate.h @@ -10,7 +10,12 @@ DECLARE_PER_CPU(u64, xfd_state); #endif -static inline bool xsave_use_xgetbv1(void) { return false; } +DECLARE_STATIC_KEY_FALSE(__xsave_use_xgetbv1); + +static __always_inline __pure bool xsave_use_xgetbv1(void) +{ + return static_branch_likely(&__xsave_use_xgetbv1); +} static inline void __xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask) { @@ -185,13 +190,18 @@ static inline int __xfd_enable_feature(u static inline void os_xsave(struct fpstate *fpstate) { u64 mask = fpstate->xfeatures; - u32 lmask = mask; - u32 hmask = mask >> 32; + u32 lmask, hmask; int err; WARN_ON_FPU(!alternatives_patched); xfd_validate_state(fpstate, mask, false); + if (xsave_use_xgetbv1()) + mask &= xgetbv(1); + + lmask = mask; + hmask = mask >> 32; + XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err); /* We should never fault when copying to a kernel buffer: */