Received: by 2002:a05:6a10:413:0:0:0:0 with SMTP id 19csp2965435pxp; Tue, 8 Mar 2022 05:28:55 -0800 (PST) X-Google-Smtp-Source: ABdhPJxNyZf0Uw/XYlUncG3K/lVecSoyRmNkCl4sLtYxqAsD8uHyyMBDScJLPV8QSYlh/zGNN3sX X-Received: by 2002:a17:907:7849:b0:6d5:87bd:5602 with SMTP id lb9-20020a170907784900b006d587bd5602mr13014710ejc.349.1646746134781; Tue, 08 Mar 2022 05:28:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646746134; cv=none; d=google.com; s=arc-20160816; b=B6vXVlrlZbtO390zYnvHy7Z0BBV+yeYLnmeLVJs8fRXWsVEroyOhI8edRF0ZCbbIGu +WCY7bcIhuPtjPuqKyPq+l2vZytBXG31RMn1Xq480eeR1z14pLQhKvGFu2swTqfrl9q7 Z1prTgpTq2UCAGOyDoQA1tB4OnBofAIKLmhd+JVDaKnOzGPV29wD83Okjj+k/Lfu6EDD r437hUwNoGoc3zyLwxUm2al9ZtZHX+M9TjdPKtFJdEhyz9lZr16mgsjCur0u5YpvOfJ4 Mp/75Ku4W/L/PGG8ifQeeAEW0AGJpvHDNDFkQ8DHdYPscnZZFGx5M844RP2iDHV33QoK ULPg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=RbBhkEsbvTvptnfwlTOsXGndnEtIWgBz6ZagA4q8H14=; b=GCNDflCJxXM/NYb9EDjsA0FcK6JVb4UDwlOzO9bjihjPw7cQNY3qb0A0zw8IvVsCcB 0LD18M8KsbyKnDRuO+/HWfJ5K+jwPWpsoccvUjLcKJKEyEl+1y9+3P6AnMic1dV2QOEn AxHoI0UN0P+bkea1aWmF7H7sZznHOaY3buN6d6UorhgqXvT+zOar0xwKhdObgw7tn0ce 3ymx41sD9SADJMui/U34rEx5cUqnIMaUyRVVuC31MAb+IN0wpOkWJWB2iVkj4TYMm4Or 3aeWHY1ih/00c1apm3ssGRF1slmDExHD851ukYLvp2a+j4/KlHy4V42K8ki8MAAsHp/I 49hw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=desiato.20200630 header.b=qpU0luH1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id p24-20020aa7cc98000000b00410a101f172si10022930edt.472.2022.03.08.05.28.29; Tue, 08 Mar 2022 05:28:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=desiato.20200630 header.b=qpU0luH1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245518AbiCHMwD (ORCPT + 99 others); Tue, 8 Mar 2022 07:52:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54636 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230080AbiCHMwC (ORCPT ); Tue, 8 Mar 2022 07:52:02 -0500 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B1BCD28E37; Tue, 8 Mar 2022 04:51:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=RbBhkEsbvTvptnfwlTOsXGndnEtIWgBz6ZagA4q8H14=; b=qpU0luH1xsaBXRv5/arXaQzZT0 Cl/+han+6yS0Qy6GTJkO30rwQ7DeH5TW0x2z1PeM/oHUfbvO2U2c9kHip8cvr7NNYPTl0rwxuXavt +6JE4KXwSN3tzxjpWm7GqzJlwUoSbVn6Oya5cWbBVAdtlLy1LEboTXpQCcbVURcQ5IniETgOwqxU7 vPo/bqDWY59wgKHtlLL/DExUsUwLSi0Z7p/mzR5GCI0fv/I4Mlscf+2vpuzCy7G4oobKb02HUs7b7 5mDlvgjE9My5XDjMEc+AiODpvkOxerEgzAtL2IzykpemzsbH0k760/3RUxO2ojJMUj8LS+2RkZhtY Y9cOwyRw==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nRZIK-00GSqL-Vq; Tue, 08 Mar 2022 12:50:29 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 5AB10300261; Tue, 8 Mar 2022 13:50:26 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id EDECA203A88DC; Tue, 8 Mar 2022 13:50:25 +0100 (CET) Date: Tue, 8 Mar 2022 13:50:25 +0100 From: Peter Zijlstra To: Wen Yang Cc: Wen Yang , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Thomas Gleixner , Stephane Eranian , mark rutland , jiri olsa , namhyung kim , borislav petkov , x86@kernel.org, "h. peter anvin" , linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [RESEND PATCH 2/2] perf/x86: improve the event scheduling to avoid unnecessary pmu_stop/start Message-ID: References: <20220304110351.47731-1-simon.wy@alibaba-inc.com> <20220304110351.47731-2-simon.wy@alibaba-inc.com> <0c119da1-053b-a2d6-1579-8fb09dbe8e63@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0c119da1-053b-a2d6-1579-8fb09dbe8e63@linux.alibaba.com> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Mar 06, 2022 at 10:36:38PM +0800, Wen Yang wrote: > The perf event generated by the above script A (cycles:D), and the counter > it used changes from #1 to #3. We use perf event in pinned mode, and then > continuously read its value for a long time, but its PMU counter changes Yes, so what? >, so the counter value will also jump. I fail to see how the counter value will jump when we reprogram the thing. When we stop we update the value, then reprogram on another counter and continue. So where does it go sideways? > > 0xffff88b72db85800: > The perf event generated by the above script A (instructions:D), which has > always occupied #fixed_instruction. > > 0xffff88bf46c34000, 0xffff88bf46c35000, 0xffff88bf46c30000: > Theses perf events are generated by the above script B. > > > > > > > > so it will cause unnecessary pmu_stop/start and also cause abnormal cpi. > > > > How?!? > > > > We may refer to the x86_pmu_enable function: > step1: save events moving to new counters > step2: reprogram moved events into new counters > > especially: > > static inline int match_prev_assignment(struct hw_perf_event *hwc, > struct cpu_hw_events *cpuc, > int i) > { > return hwc->idx == cpuc->assign[i] && > hwc->last_cpu == smp_processor_id() && > hwc->last_tag == cpuc->tags[i]; > } I'm not seeing an explanation for how a counter value is not preserved. > > > Cloud servers usually continuously monitor the cpi data of some important > > > services. This issue affects performance and misleads monitoring. > > > > > > The current event scheduling algorithm is more than 10 years old: > > > commit 1da53e023029 ("perf_events, x86: Improve x86 event scheduling") > > > > irrelevant > > > > commit 1da53e023029 ("perf_events, x86: Improve x86 event scheduling") > > This commit is the basis of the perf event scheduling algorithm we currently > use. Well yes. But how is the age of it relevant? > The reason why the counter above changed from #1 to #3 can be found from it: > The algorithm takes into account the list of counter constraints > for each event. It assigns events to counters from the most > constrained, i.e., works on only one counter, to the least > constrained, i.e., works on any counter. > > the nmi watchdog permanently consumes one fp (*cycles*). > therefore, when the above shell script obtains *cycles:D* > again, it has to use a GP, and its weight is 5. > but other events (like *cache-misses*) have a weight of 4, > so the counter used by *cycles:D* will often be taken away. So what? I mean, it is known the algorithm isn't optimal, but at least it's bounded. There are event sets that will fail to schedule but could, but I don't think you're talking about that. Event migrating to a different counter is not a problem. This is expected and normal. Code *must* be able to deal with it. > In addition, we also found that this problem may affect NMI watchdog in the > production cluster. > The NMI watchdog also uses a fixed counter in fixed mode. Usually, it is The > first element of the event_list array, so it usually takes precedence and > can get a fixed counter. > But if the administrator closes the watchdog first and then enables it, it > may be at the end of the event_list array, so its expected fixed counter may > be occupied by other perf event, and it can only use one GP. In this way, > there is a similar issue here: the PMU counter used by the NMI watchdog may > be disabled/enabled frequently and unnecessarily. Again, I'm not seeing a problem. If you create more events than we have hardware counters we'll rotate the list and things will get scheduled in all sorts of order. This works. > Any advice or guidance on this would be appreciated. I'm still not sure what your actual problem is; I suspect you're using perf wrong. Are you using rdpmc and not respecting the scheme described in include/uapi/linux/perf_events.h:perf_event_mmap_page ? Note that if you're using pinned counters you can simplify that scheme by ignoring all the timekeeping nonsense. In that case it does become significantly simpler/faster. But you cannot use rdpmc without using the mmap page's self-monitoring data.