Received: by 2002:a05:6358:1087:b0:cb:c9d3:cd90 with SMTP id j7csp8316303rwi; Tue, 25 Oct 2022 05:21:41 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4TxDpsOA+/3JeSofTO53fsDHKTdQHEompcEFIVy6tkh8fNYQDIF0J2bdXV3CdO+vRC4xP/ X-Received: by 2002:a17:907:16a5:b0:78d:e608:f073 with SMTP id hc37-20020a17090716a500b0078de608f073mr32571387ejc.753.1666700501359; Tue, 25 Oct 2022 05:21:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666700501; cv=none; d=google.com; s=arc-20160816; b=ZMuUNciUlNBfiBmByWTiXYPepbyXFstxpYQLHqlQRb2SRARhYcx7aqaDT7wYex6gs3 kPo4AyIhI/9tFlYrkf0n+dhtwmRm5qp5xA2z7s8B6WssxdWsKTwpJim5SoVxx0EzVEFQ vBoXmy24w9+0Cjo6axEyiR6iWn8gswwHIHvcau4bCqdVtCNlNITMLnLyg3b5KIqAUtkH UkiG5r8pV3QSwlqLPxk0DyxzScUokRn6pYuEc653PRqaTHX8hD5iR7Y4X+5Q7LxCrmed GaVkDzKcEsN5pkDynawsGmETG6gQHVwShQvuE7ZSTsRAttnlnZTRKmtfaa44AMyP8v61 mXug== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=6ipvIUKO0rPO8818tBeqrmI16hroOqzlM9hOsNPceso=; b=Jf4os0ILzAKmoluJdumjec/3CHiJe0bpUE4er29fdRqieDbUg8VH93kytpdBLJWJGT hnfQnAof14NQpsv4eoJSxpvukHc3dQlmFn7IZ5A1O42YXLm1rCbOps6NyxY7k5uYa0Dk b9wlqRwvz4EL2c2HNwvQXAFKqKfdPrT5FAUgYs/gAowR590yWeCD05l0BtZBylxDm0WW SWEmV2aO7PFCpy/9vrN5R7SdOcdlMYYKft7PqyHa9QLJLbBiibmKXNyv8xU7pWH1CTPE hV4kkY9BGUMV5YGt1Ol4CYq90Jkq/X4y+ja/ZqGZCCv6FjC+pJiAtz7ZtQr4dcpJ+XnU Zb9A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="UObgGF/D"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a14-20020aa7cf0e000000b004604e9d477dsi2301036edy.265.2022.10.25.05.21.13; Tue, 25 Oct 2022 05:21:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="UObgGF/D"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230410AbiJYMS2 (ORCPT + 99 others); Tue, 25 Oct 2022 08:18:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48046 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230193AbiJYMS0 (ORCPT ); Tue, 25 Oct 2022 08:18:26 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 918D71805A5 for ; Tue, 25 Oct 2022 05:18:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1666700305; x=1698236305; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9TyaQkeYwWVmFoUuIECQRs7TiTIjwMOBEo5jtvzxpkE=; b=UObgGF/DGxRtpoKVofulYE+1u7Yq+9cbgtu44h0MMtBsHjmbajwJEweS VTksifWxxN5wjErYZSbvuKy5ZJ6DJfmZ3GTqXXDjiJxt1YVwiG/kF4j9v pUjJfIssXQcQ20qMgJAFOsSun4SwMVuxOM2zl2Fo7Q4guhLfJyqFBWVrQ e24J7+o/qjwCScBOsH1+MqAfa1TJAImf6PKJh9KTEhBZhagEsFVNMJX6m 3/9aq4Ib/aR5UG7aQ/dgWJDDm8OogHcSfILHWnlxjcc64SoEB0LfyIWpi QUZgS70b+Sfr4zEg1ZjsFX/5JxRYTNDR76GFwrLJhNI2ZFHWZmGdqdtaK Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10510"; a="306387964" X-IronPort-AV: E=Sophos;i="5.95,212,1661842800"; d="scan'208";a="306387964" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Oct 2022 05:18:25 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10510"; a="634074024" X-IronPort-AV: E=Sophos;i="5.95,212,1661842800"; d="scan'208";a="634074024" Received: from sunyi-station.sh.intel.com (HELO sunyi-station..) ([10.239.159.10]) by fmsmga007.fm.intel.com with ESMTP; 25 Oct 2022 05:18:22 -0700 From: Yi Sun To: dave.hansen@intel.com, tglx@linutronix.de, linux-kernel@vger.kernel.org, x86@kernel.org Cc: sohil.mehta@intel.com, ak@linux.intel.com, ilpo.jarvinen@linux.intel.com, heng.su@intel.com, tony.luck@intel.com, dave.hansen@linux.intel.com, Yi Sun Subject: [RESEND PATCH v5 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR Date: Tue, 25 Oct 2022 20:17:48 +0800 Message-Id: <20221025121749.584519-2-yi.sun@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221025121749.584519-1-yi.sun@intel.com> References: <20221025121749.584519-1-yi.sun@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add two trace points x86_fpu_latency_xsave and x86_fpu_latency_xrstor. The latency dumped by the new trace points can tell when XSAVE/XRSTOR are getting more or less expensive, and get out the RFBM (requested-feature bitmap) and XINUSE to figure out the reason. Calculate the latency of instructions XSAVE and XRSTOR within a single trace event respectively. Another option considered was to have 2 separated trace events marking the start and finish of the XSAVE/XRSTOR. The latency was calculated from the 2 trace points in user space, but there was significant overhead added by the trace function itself. In internal testing, the single trace point option which is implemented here proved to save big overhead introduced by trace function. Make use of trace_clock() to calculate the latency, which is based on cpu_clock() with precision at most ~1 jiffy between CPUs. CONFIG_X86_DEBUG_FPU and CONFIG_TRACEPOINTS are required. And the compiler will get rid of all the extra crust when either of the two configs is disabled. If both of the configs are enabled, xsave/xrstor_tracing_enabled would be reduced to a static check for tracing enabled. Thus, in the fast path there would be only 2 additional static checks. Since trace points can be enabled dynamically, while the code is checking tracepoint_enabled(trace_event), the trace_event could be concurrently enabled. Hence there is probability to get single once noisy result 'trace_clock() - (-1)' at the moment enabling the trace points x86_fpu_latency_*. Leave the noise here instead of additional conditions while calling the x86_fpu_latency_* because it's not worth for the only once noise. It's easy to filter out by the following consuming script or other user space tool. Trace log looks like following: x86_fpu_latency_xsave: x86/fpu: latency:100 RFBM:0x202e7 XINUSE:0x202 x86_fpu_latency_xrstor: x86/fpu: latency:99 RFBM:0x202e7 XINUSE:0x202 Reviewed-by: Sohil Mehta Reviewed-by: Tony Luck Signed-off-by: Yi Sun diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h index 4645a6334063..5f7cb633df09 100644 --- a/arch/x86/include/asm/trace/fpu.h +++ b/arch/x86/include/asm/trace/fpu.h @@ -89,6 +89,41 @@ DEFINE_EVENT(x86_fpu, x86_fpu_xstate_check_failed, TP_ARGS(fpu) ); +DECLARE_EVENT_CLASS(x86_fpu_latency, + TP_PROTO(struct fpstate *fpstate, u64 latency), + TP_ARGS(fpstate, latency), + + TP_STRUCT__entry( + __field(struct fpstate *, fpstate) + __field(u64, latency) + __field(u64, rfbm) + __field(u64, xinuse) + ), + + TP_fast_assign( + __entry->fpstate = fpstate; + __entry->latency = latency; + __entry->rfbm = fpstate->xfeatures; + __entry->xinuse = fpstate->regs.xsave.header.xfeatures; + ), + + TP_printk("x86/fpu: latency:%lld RFBM:0x%llx XINUSE:0x%llx", + __entry->latency, + __entry->rfbm, + __entry->xinuse + ) +); + +DEFINE_EVENT(x86_fpu_latency, x86_fpu_latency_xsave, + TP_PROTO(struct fpstate *fpstate, u64 latency), + TP_ARGS(fpstate, latency) +); + +DEFINE_EVENT(x86_fpu_latency, x86_fpu_latency_xrstor, + TP_PROTO(struct fpstate *fpstate, u64 latency), + TP_ARGS(fpstate, latency) +); + #undef TRACE_INCLUDE_PATH #define TRACE_INCLUDE_PATH asm/trace/ #undef TRACE_INCLUDE_FILE diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h index 5ad47031383b..9b5ef6bdb60a 100644 --- a/arch/x86/kernel/fpu/xstate.h +++ b/arch/x86/kernel/fpu/xstate.h @@ -5,6 +5,9 @@ #include #include #include +#include + +#include #ifdef CONFIG_X86_64 DECLARE_PER_CPU(u64, xfd_state); @@ -68,6 +71,20 @@ static inline u64 xfeatures_mask_independent(void) return XFEATURE_MASK_INDEPENDENT; } +static inline bool xsave_tracing_enabled(void) +{ + if (!IS_ENABLED(CONFIG_X86_DEBUG_FPU)) + return false; + return tracepoint_enabled(x86_fpu_latency_xsave); +} + +static inline bool xrstor_tracing_enabled(void) +{ + if (!IS_ENABLED(CONFIG_X86_DEBUG_FPU)) + return false; + return tracepoint_enabled(x86_fpu_latency_xrstor); +} + /* XSAVE/XRSTOR wrapper functions */ #ifdef CONFIG_X86_64 @@ -113,7 +130,7 @@ static inline u64 xfeatures_mask_independent(void) * original instruction which gets replaced. We need to use it here as the * address of the instruction where we might get an exception at. */ -#define XSTATE_XSAVE(st, lmask, hmask, err) \ +#define __XSTATE_XSAVE(st, lmask, hmask, err) \ asm volatile(ALTERNATIVE_3(XSAVE, \ XSAVEOPT, X86_FEATURE_XSAVEOPT, \ XSAVEC, X86_FEATURE_XSAVEC, \ @@ -126,11 +143,22 @@ static inline u64 xfeatures_mask_independent(void) : "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \ : "memory") +#define XSTATE_XSAVE(fps, lmask, hmask, err) \ + do { \ + struct fpstate *f = fps; \ + u64 tc = -1; \ + if (xsave_tracing_enabled()) \ + tc = trace_clock(); \ + __XSTATE_XSAVE(&f->regs.xsave, lmask, hmask, err); \ + if (xsave_tracing_enabled()) \ + trace_x86_fpu_latency_xsave(f, trace_clock() - tc);\ + } while (0) + /* * Use XRSTORS to restore context if it is enabled. XRSTORS supports compact * XSAVE area format. */ -#define XSTATE_XRESTORE(st, lmask, hmask) \ +#define __XSTATE_XRESTORE(st, lmask, hmask) \ asm volatile(ALTERNATIVE(XRSTOR, \ XRSTORS, X86_FEATURE_XSAVES) \ "\n" \ @@ -140,6 +168,17 @@ static inline u64 xfeatures_mask_independent(void) : "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \ : "memory") +#define XSTATE_XRESTORE(fps, lmask, hmask) \ + do { \ + struct fpstate *f = fps; \ + u64 tc = -1; \ + if (xrstor_tracing_enabled()) \ + tc = trace_clock(); \ + __XSTATE_XRESTORE(&f->regs.xsave, lmask, hmask); \ + if (xrstor_tracing_enabled()) \ + trace_x86_fpu_latency_xrstor(f, trace_clock() - tc);\ + } while (0) + #if defined(CONFIG_X86_64) && defined(CONFIG_X86_DEBUG_FPU) extern void xfd_validate_state(struct fpstate *fpstate, u64 mask, bool rstor); #else @@ -184,7 +223,7 @@ static inline void os_xsave(struct fpstate *fpstate) WARN_ON_FPU(!alternatives_patched); xfd_validate_state(fpstate, mask, false); - XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err); + XSTATE_XSAVE(fpstate, lmask, hmask, err); /* We should never fault when copying to a kernel buffer: */ WARN_ON_FPU(err); @@ -201,7 +240,7 @@ static inline void os_xrstor(struct fpstate *fpstate, u64 mask) u32 hmask = mask >> 32; xfd_validate_state(fpstate, mask, true); - XSTATE_XRESTORE(&fpstate->regs.xsave, lmask, hmask); + XSTATE_XRESTORE(fpstate, lmask, hmask); } /* Restore of supervisor state. Does not require XFD */ @@ -211,7 +250,7 @@ static inline void os_xrstor_supervisor(struct fpstate *fpstate) u32 lmask = mask; u32 hmask = mask >> 32; - XSTATE_XRESTORE(&fpstate->regs.xsave, lmask, hmask); + XSTATE_XRESTORE(fpstate, lmask, hmask); } /* -- 2.34.1