Received: by 2002:a25:d7c1:0:0:0:0:0 with SMTP id o184csp1298666ybg; Fri, 18 Oct 2019 15:26:51 -0700 (PDT) X-Google-Smtp-Source: APXvYqzFviXhQW3F8+xbGnUWaNlNr5iAizdGUrxcobDyAcy0npvgun6dp2udVprp3uWq0UJOs1ft X-Received: by 2002:a17:906:5c0a:: with SMTP id e10mr11000570ejq.285.1571437611492; Fri, 18 Oct 2019 15:26:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1571437611; cv=none; d=google.com; s=arc-20160816; b=oCTUvfCoUBLaU6ZmejODKdghfTayh6mLmWIHSy6qNLyyzIe22I/BPuCwrjVDXv60Pe bbqJL4shmvAoGLMvqwbDQDjR/a4hRJzvot/DXKDYjFnlIZbiOtMto4jeYi9GHzXxc90l SypJS444fAJLYC56duDEl8CPPu/SkBdCYtOx1gX8dPhhrlHfYgGogIQcewTPQlj/Amwj Dpd6Ysqkm3iPUGoT/HYOMbYFzDtMOVOacbFaLAuJVj13HhJT12uPalmswEPsBKVse+51 axndKShOsQD5ARanOy9wq30D12yc1ptoWkH6vJ3FOvT8L3pMDbZbTocGq2rDtKbdo65N Tr0A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:from:subject:mime-version :message-id:date:dkim-signature; bh=j+5h8tQfMwxRzPWa0YsZn6CHewLhXavyhIU1GNL2cCg=; b=byqMr0HLhudRqlheZzcFFa2+FqaDp9o2fo1ggVJfY41O2Bzk+oLVA/nAaH428IMvYQ JB2mS4lxbOo8dNShG61dvOQ77F6WwsznWVVqAl5+KgP67mMlvEbRcdAzB3BZG7QgBBHY 9/Lpb3eZxIjlo8nXs1j6LTX8dN6UWQgcXeRw/llZw0Qkd6aZEw4n6U0E2PkX+R7dHrDD AqhjW5WPpryBS1ZWbbNrsOHyMwoh8ubyRexIeX620X2LlF97xVDJ+lB4PC31p+WJmT34 GKyMqstM23OcBIy3UbPrPJusPJLjI7A8WndkpR6tN44PcIpl5Qa7AZA4rXW+aKmuAp4b uXEg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=wWvUFiMD; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o46si4790222edc.124.2019.10.18.15.26.28; Fri, 18 Oct 2019 15:26:51 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=wWvUFiMD; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2439095AbfJRA3V (ORCPT + 99 others); Thu, 17 Oct 2019 20:29:21 -0400 Received: from mail-pg1-f202.google.com ([209.85.215.202]:39698 "EHLO mail-pg1-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2406927AbfJRA2l (ORCPT ); Thu, 17 Oct 2019 20:28:41 -0400 Received: by mail-pg1-f202.google.com with SMTP id m20so3021626pgv.6 for ; Thu, 17 Oct 2019 17:28:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:message-id:mime-version:subject:from:to:cc; bh=j+5h8tQfMwxRzPWa0YsZn6CHewLhXavyhIU1GNL2cCg=; b=wWvUFiMDL+FpAEBvOeSh92Ok6NrnP0bMly0DvcbnIEwAXgGMsFpdxjlNZIkHJTYYBa RRx649P7sqiPze6DlRwJOUwwuFezLRxr6AueTJGVgGsRuHZqUR62pVBaDH5dw3oVDwJF cqz06TfsdUQqKYWYzqCu1qGrSGJTcxPYv4FciDQIqqO3PIW4m2UzT+g4IwApY3LkDK0H UzSRCzIapr1rLvg4twVJwpwLGQFu0MzEemzo/jZzvekhF3dWx3g2K2T5lXemO6XXF183 DSve4IVI9m4PDbMg8OfAlO424FgA0MXzf8aoQb3/3dDlHQPqUPbovQfJUZyE+fNef8pE aqeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc; bh=j+5h8tQfMwxRzPWa0YsZn6CHewLhXavyhIU1GNL2cCg=; b=S3e+uEg95ajtwW76FKX+2LqAPGDrEBxh3YkwKr2gpOirVzvD0FZa7LECl1i/XEnR+3 GirzjtfiWjEoW9yUu8fVFZT3u4XYXfnikrmveSdATOtpJam83LqzFCetEVyA9+TcoLqU 8RvKQKBf8OeIgcoQHr3MHucLadm2DTdpy50+1MldQ3FIPvBpB+PsbYEFjFrQvKjaSs8f 26RMtYtxSs99WlPL5EhkIELW2Eyb9YtHJMbrXCMKKH/ASSRgbPLVZg8oNeIY+3cBvqa5 YzyqWalEu6vjH7u6iw3C8Gu+Fayqyt1WQyNXxhBnvc3WrJasc87b4kp1miACKuu4SZyn O5BA== X-Gm-Message-State: APjAAAWBPWiSmSUI4T+Mx3GBR/w8a+XAO72kObBHHdo0NXD5Gy0IH8y+ 1mg9UmYn+IGVIvA8RIDGPQ7eBevZW/FOUPhC8lysEJt2e3l88330Z5YvF7BaKJ0rqXheLfwnWgL EfJZsPg2eIFhK3eMWGQeuUa+ujAx9+t+UPHAlu4toXjfWi49EmG9lE/bULVGyZtZiXQqh5Ika X-Received: by 2002:a63:131b:: with SMTP id i27mr6927200pgl.209.1571358518957; Thu, 17 Oct 2019 17:28:38 -0700 (PDT) Date: Thu, 17 Oct 2019 17:27:46 -0700 Message-Id: <20191018002746.149200-1-eranian@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.23.0.866.gb869b98d4c-goog Subject: [PATCH] perf/core: fix multiplexing event scheduling issue From: Stephane Eranian To: linux-kernel@vger.kernel.org Cc: peterz@infradead.org, mingo@elte.hu, acme@redhat.com, jolsa@redhat.com, kan.liang@intel.com, songliubraving@fb.com, irogers@google.com Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch complements the following commit: 7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()") The fix from Song addresses the consequences of the problem but not the cause. This patch fixes the causes and can sit on top of Song's patch. This patch fixes a scheduling problem in the core functions of perf_events. Under certain conditions, some events would not be scheduled even though many counters would be available. This is related to multiplexing and is architecture agnostic and PMU agnostic (i.e., core or uncore). This problem can easily be reproduced when you have two perf stat sessions. The first session does not cause multiplexing, let's say it is measuring 1 event, E1. While it is measuring, a second session starts and causes multiplexing. Let's say it adds 6 events, B1-B6. Now, 7 events compete and are multiplexed. When the second session terminates, all 6 (B1-B6) events are removed. Normally, you'd expect the E1 event to continue to run with no multiplexing. However, the problem is that depending on the state Of E1 when B1-B6 are removed, it may never be scheduled again. If E1 was inactive at the time of removal, despite the multiplexing hrtimer still firing, it will not find any active events and will not try to reschedule. This is what Song's patch fixes. It forces the multiplexing code to consider non-active events. However, the cause is not addressed. The kernel should not rely on the multiplexing hrtimer to unblock inactive events. That timer can have abitrary duration in the milliseconds. Until the timer fires, counters are available, but no measurable events are using them. We do not want to introduce blind spots of arbitrary durations. This patch addresses the cause of the problem, by checking that, when an event is disabled or removed and the context was multiplexing events, inactive events gets immediately a chance to be scheduled by calling ctx_resched(). The rescheduling is done on event of equal or lower priority types. With that in place, as soon as a counter is freed, schedulable inactive events may run, thereby eliminating a blind spot. This can be illustrated as follows using Skylake uncore CHA here: $ perf stat --no-merge -a -I 1000 -C 28 -e uncore_cha_0/event=0x0/ 42.007856531 2,000,291,322 uncore_cha_0/event=0x0/ 43.008062166 2,000,399,526 uncore_cha_0/event=0x0/ 44.008293244 2,000,473,720 uncore_cha_0/event=0x0/ 45.008501847 2,000,423,420 uncore_cha_0/event=0x0/ 46.008706558 2,000,411,132 uncore_cha_0/event=0x0/ 47.008928543 2,000,441,660 uncore_cha_0/event=0x0/ Adding second sessiont with 4 events for 4s $ perf stat -a -I 1000 -C 28 --no-merge -e uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/ sleep 4 48.009114643 1,983,129,830 uncore_cha_0/event=0x0/ (99.71%) 49.009279616 1,976,067,751 uncore_cha_0/event=0x0/ (99.30%) 50.009428660 1,974,448,006 uncore_cha_0/event=0x0/ (98.92%) 51.009524309 1,973,083,237 uncore_cha_0/event=0x0/ (98.55%) 52.009673467 1,972,097,678 uncore_cha_0/event=0x0/ (97.11%) End of 4s, second session is removed. But the first event does not schedule and never will unless new events force multiplexing again. 53.009815999 uncore_cha_0/event=0x0/ (95.28%) 54.009961809 uncore_cha_0/event=0x0/ (93.52%) 55.010110972 uncore_cha_0/event=0x0/ (91.82%) 56.010217579 uncore_cha_0/event=0x0/ (90.18%) Signed-off-by: Stephane Eranian --- kernel/events/core.c | 67 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) diff --git a/kernel/events/core.c b/kernel/events/core.c index 9ec0b0bfddbd..578587246ffb 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -2140,6 +2140,10 @@ group_sched_out(struct perf_event *group_event, #define DETACH_GROUP 0x01UL +static void ctx_resched(struct perf_cpu_context *cpuctx, + struct perf_event_context *task_ctx, + enum event_type_t event_type); + /* * Cross CPU call to remove a performance event * @@ -2153,6 +2157,7 @@ __perf_remove_from_context(struct perf_event *event, void *info) { unsigned long flags = (unsigned long)info; + int was_necessary = ctx->rotate_necessary; if (ctx->is_active & EVENT_TIME) { update_context_time(ctx); @@ -2171,6 +2176,37 @@ __perf_remove_from_context(struct perf_event *event, cpuctx->task_ctx = NULL; } } + + /* + * sanity check that event_sched_out() does not and will not + * change the state of ctx->rotate_necessary + */ + WARN_ON(was_necessary != event->ctx->rotate_necessary); + /* + * if we remove an event AND we were multiplexing then, that means + * we had more events than we have counters for, and thus, at least, + * one event was in INACTIVE state. Now, that we removed an event, + * we need to resched to give a chance to all events to get scheduled, + * otherwise some may get stuck. + * + * By the time this function is called the event is usually in the OFF + * state. + * Note that this is not a duplicate of the same code in _perf_event_disable() + * because the call path are different. Some events may be simply disabled + * others removed. There is a way to get removed and not be disabled first. + */ + if (ctx->rotate_necessary && ctx->nr_events) { + int type = get_event_type(event); + /* + * In case we removed a pinned event, then we need to + * resched for both pinned and flexible events. The + * opposite is not true. A pinned event can never be + * inactive due to multiplexing. + */ + if (type & EVENT_PINNED) + type |= EVENT_FLEXIBLE; + ctx_resched(cpuctx, cpuctx->task_ctx, type); + } } /* @@ -2218,6 +2254,8 @@ static void __perf_event_disable(struct perf_event *event, struct perf_event_context *ctx, void *info) { + int was_necessary = ctx->rotate_necessary; + if (event->state < PERF_EVENT_STATE_INACTIVE) return; @@ -2232,6 +2270,35 @@ static void __perf_event_disable(struct perf_event *event, event_sched_out(event, cpuctx, ctx); perf_event_set_state(event, PERF_EVENT_STATE_OFF); + /* + * sanity check that event_sched_out() does not and will not + * change the state of ctx->rotate_necessary + */ + WARN_ON_ONCE(was_necessary != event->ctx->rotate_necessary); + + /* + * if we disable an event AND we were multiplexing then, that means + * we had more events than we have counters for, and thus, at least, + * one event was in INACTIVE state. Now, that we disabled an event, + * we need to resched to give a chance to all events to be scheduled, + * otherwise some may get stuck. + * + * Note that this is not a duplicate of the same code in + * __perf_remove_from_context() + * because events can be disabled without being removed. + */ + if (ctx->rotate_necessary && ctx->nr_events) { + int type = get_event_type(event); + /* + * In case we removed a pinned event, then we need to + * resched for both pinned and flexible events. The + * opposite is not true. A pinned event can never be + * inactive due to multiplexing. + */ + if (type & EVENT_PINNED) + type |= EVENT_FLEXIBLE; + ctx_resched(cpuctx, cpuctx->task_ctx, type); + } } /* -- 2.23.0.866.gb869b98d4c-goog