Received: by 2002:a05:6a10:6744:0:0:0:0 with SMTP id w4csp306033pxu; Fri, 23 Oct 2020 00:51:27 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxuywyHQJ9ZkG7msJw4PJKLQFeiDdzPTt52cao0Drf2TRj6EHwhjTO3npKClVDu3DIfw5PU X-Received: by 2002:aa7:d28c:: with SMTP id w12mr991974edq.248.1603439486906; Fri, 23 Oct 2020 00:51:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1603439486; cv=none; d=google.com; s=arc-20160816; b=HuVBAkKzY4gpT4b07X3BC7BBqY5bQPly3ax95hCsLY+Tt96AE4UTYG+bf8AE593ywa wHi2Ydz/3GiGNRSo80sumxWPMPb9R3Jvv9Fo803WItyTzGvc1FzxHPZ8xrOESdelZQNH 2OExAocX34j8dTvtBk89Yv5TLIb2VbFaJZBSGFDEE7RcImSJI8oDbeYDV4jc0IEKeaXQ z8UOZbybZLP92PwncGUTERf0ZHrJi4sJIj5Mll8G4/jUf26g0XZtlz9G2iv3bJFv+Yx2 Se+fnSd93y8Md1wPRzjH0R8Le54sPagJwLVCu8CKkxhCjeA4P6OSVdcaG2ZC1Iqo0/eX 9geg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:mime-version:message-id:date :sender:dkim-signature; bh=1nY3WP57swg6MAG4MDM8yKEtar1GITDnzwO2HHJz44U=; b=V5CpVsNcCbsL3syypSKem0u7QQqo51D6TQmSAxwNXvKh+IyiRKxOEtvR33WqgGWur5 dPSavFtz5QVvMGnmlWhq0RjwnMRzlNaz/waM1Un1w/3uM/i/bvS+4jf9NCJdqQqKGXgM 7bKxMs2SIxuVeuKl0Di/QoOc19/0tMrMDTOL7exNBzuvkdoqZrfYRGWktXFMkHb/KCPg lmMLm7pga6g9D0y648l6Jb+383lNaAiTNBXIhBoADtG5TUyM/LVivGhF1Z0MTmniJgrD qzugiBljY3mDTOKHiYP7NpIjrJOf9nAyrF5FbAV1AeQzZtoyjubpZTrMJ2kZrZKk7ELz jQeA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="i4wG/gwo"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id kt22si347252ejb.648.2020.10.23.00.51.03; Fri, 23 Oct 2020 00:51:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="i4wG/gwo"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S374828AbgJWDaS (ORCPT + 99 others); Thu, 22 Oct 2020 23:30:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40682 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S368814AbgJWDaR (ORCPT ); Thu, 22 Oct 2020 23:30:17 -0400 Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com [IPv6:2607:f8b0:4864:20::64a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 700B7C0613CF for ; Thu, 22 Oct 2020 20:30:16 -0700 (PDT) Received: by mail-pl1-x64a.google.com with SMTP id h1so149657pll.10 for ; Thu, 22 Oct 2020 20:30:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=sender:date:message-id:mime-version:subject:from:to:cc; bh=1nY3WP57swg6MAG4MDM8yKEtar1GITDnzwO2HHJz44U=; b=i4wG/gwoPmGRZ6Mpz8QxDdp67CljbBwZPYy9aiHWtHbg4Yjk2zTjT5HniN2uRl+Qxv gRv7AZgHdFg7vtaVzteuxIsZ/JmDzxnbuP6QEPQaJ9wOrJzESY9tyUgJ1m1pBv21m8KJ hgT/1Ff73GDCmmnlabp4ozlnhXfW2f2muaShGzYcrkSkxd39qeaafQvRriM73XFnEuTY Jcj50rqDVJPXGgZGvLAUuVxwweSabtVnQxF6wPoKxc/BsejfXtcXgHUaOghv3xYhkdnr OsH4XdW+IsmDTCun6KJ5XbRjdwMI/OaApRCKkvVa5ZK3l3OfQKTuBJHfqLChyZMhXTlL ShPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:message-id:mime-version:subject:from :to:cc; bh=1nY3WP57swg6MAG4MDM8yKEtar1GITDnzwO2HHJz44U=; b=Qye3wp9fQVLFj9J4FiitBnFmiZwbdhzcgoEC5VYxp8+/V7IIfvM3LGTjl+Zi7WxUNj po2SfAj9YeAPDOWOxJIk5E0gaXA8HFNroVXgq+k1T9Z010xr7uSVub6uQoPm/u9VGk7f WTZqnRwOufoc1ubNhUv/m/uWxC5RA9e5Vm/HNCDipPrRLy3VHDtw9tXUDZSAWxHDQcGB /rBocQvLdDmpRQ95BiGF5GE33u3bxqchFZS+bVTr5Wc1C0JtrjAF/4YjNsTWNtI0pXxG vc9qGlMUTez62eb/6WcYaamxcJEEzh0JeSrqrGesxy+nUtl1C+yYuW6FR1gUXUkW47kf w77Q== X-Gm-Message-State: AOAM532GX8VjL9sBQgCORnwJMnaL/DrZSfNtpE/AGF7/ctZSBFU64dA1 Yp/l+tQ0H7vsK3pNyrq9qzDuEYFJwi+Z Sender: "joshdon via sendgmr" X-Received: from joshdon.svl.corp.google.com ([2620:15c:2cd:202:a28c:fdff:fee1:cc86]) (user=joshdon job=sendgmr) by 2002:a05:6a00:8c5:b029:142:2501:39e6 with SMTP id s5-20020a056a0008c5b0290142250139e6mr184421pfu.53.1603423815689; Thu, 22 Oct 2020 20:30:15 -0700 (PDT) Date: Thu, 22 Oct 2020 20:29:42 -0700 Message-Id: <20201023032944.399861-1-joshdon@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.29.0.rc1.297.gfa9743e501-goog Subject: [PATCH 1/3] sched: better handling for busy polling loops From: Josh Don To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , "David S. Miller" , Jakub Kicinski Cc: Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Paolo Bonzini , Eric Dumazet , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kvm@vger.kernel.org, Josh Don , Xi Wang Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Busy polling loops in the kernel such as network socket poll and kvm halt polling have performance problems related to process scheduler load accounting. Both of the busy polling examples are opportunistic - they relinquish the cpu if another thread is ready to run. This design, however, doesn't extend to multiprocessor load balancing very well. The scheduler still sees the busy polling cpu as 100% busy and will be less likely to put another thread on that cpu. In other words, if all cores are 100% utilized and some of them are running real workloads and some others are running busy polling loops, newly woken up threads will not prefer the busy polling cpus. System wide throughput and latency may suffer. This change allows the scheduler to detect busy polling cpus in order to allow them to be more frequently considered for wake up balancing. This change also disables preemption for the duration of the busy polling loop. This is important, as it ensures that if a polling thread decides to end its poll to relinquish cpu to another thread, the polling thread will actually exit the busy loop and potentially block. When it later becomes runnable, it will have the opportunity to find an idle cpu via wakeup cpu selection. Suggested-by: Xi Wang Signed-off-by: Josh Don Signed-off-by: Xi Wang --- include/linux/sched.h | 5 +++ kernel/sched/core.c | 94 +++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 25 ++++++++---- kernel/sched/sched.h | 2 + 4 files changed, 119 insertions(+), 7 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index afe01e232935..80ef477e5a87 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1651,6 +1651,7 @@ extern int can_nice(const struct task_struct *p, const int nice); extern int task_curr(const struct task_struct *p); extern int idle_cpu(int cpu); extern int available_idle_cpu(int cpu); +extern int polling_cpu(int cpu); extern int sched_setscheduler(struct task_struct *, int, const struct sched_param *); extern int sched_setscheduler_nocheck(struct task_struct *, int, const struct sched_param *); extern void sched_set_fifo(struct task_struct *p); @@ -2048,4 +2049,8 @@ int sched_trace_rq_nr_running(struct rq *rq); const struct cpumask *sched_trace_rd_span(struct root_domain *rd); +extern void prepare_to_busy_poll(void); +extern int continue_busy_poll(void); +extern void end_busy_poll(bool allow_resched); + #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d95dc3f4644..2783191d0bd4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5107,6 +5107,24 @@ int available_idle_cpu(int cpu) return 1; } +/** + * polling_cpu - is a given CPU currently running a thread in a busy polling + * loop that could be preempted if a new thread were to be scheduled? + * @cpu: the CPU in question. + * + * Return: 1 if the CPU is currently polling. 0 otherwise. + */ +int polling_cpu(int cpu) +{ +#ifdef CONFIG_SMP + struct rq *rq = cpu_rq(cpu); + + return unlikely(rq->busy_polling); +#else + return 0; +#endif +} + /** * idle_task - return the idle task for a given CPU. * @cpu: the processor in question. @@ -7191,6 +7209,7 @@ void __init sched_init(void) rq_csd_init(rq, &rq->nohz_csd, nohz_csd_func); #endif + rq->busy_polling = 0; #endif /* CONFIG_SMP */ hrtick_rq_init(rq); atomic_set(&rq->nr_iowait, 0); @@ -7417,6 +7436,81 @@ void ia64_set_curr_task(int cpu, struct task_struct *p) #endif +/* + * Calling this function before entering a preemptible busy polling loop will + * help the scheduler make better load balancing decisions. Wake up balance + * will treat the polling cpu as idle. + * + * Preemption is disabled inside this function and re-enabled in + * end_busy_poll(), thus the polling loop must periodically check + * continue_busy_poll(). + * + * REQUIRES: prepare_to_busy_poll(), continue_busy_poll(), and end_busy_poll() + * must be used together. + */ +void prepare_to_busy_poll(void) +{ + struct rq __maybe_unused *rq = this_rq(); + unsigned long __maybe_unused flags; + + /* Preemption will be reenabled by end_busy_poll() */ + preempt_disable(); + +#ifdef CONFIG_SMP + raw_spin_lock_irqsave(&rq->lock, flags); + /* preemption disabled; only one thread can poll at a time */ + WARN_ON_ONCE(rq->busy_polling); + rq->busy_polling++; + raw_spin_unlock_irqrestore(&rq->lock, flags); +#endif +} +EXPORT_SYMBOL(prepare_to_busy_poll); + +int continue_busy_poll(void) +{ + if (!single_task_running()) + return 0; + + /* Important that we check this, since preemption is disabled */ + if (need_resched()) + return 0; + + return 1; +} +EXPORT_SYMBOL(continue_busy_poll); + +/* + * Restore any state modified by prepare_to_busy_poll(), including re-enabling + * preemption. + * + * @allow_resched: If true, this potentially calls schedule() as part of + * enabling preemption. A busy poll loop can use false in order to have an + * opportunity to block before rescheduling. + */ +void end_busy_poll(bool allow_resched) +{ +#ifdef CONFIG_SMP + struct rq *rq = this_rq(); + unsigned long flags; + + raw_spin_lock_irqsave(&rq->lock, flags); + BUG_ON(!rq->busy_polling); /* not paired with prepare() */ + rq->busy_polling--; + raw_spin_unlock_irqrestore(&rq->lock, flags); +#endif + + /* + * preemption needs to be kept disabled between prepare_to_busy_poll() + * and end_busy_poll(). + */ + BUG_ON(preemptible()); + if (allow_resched) + preempt_enable(); + else + preempt_enable_no_resched(); +} +EXPORT_SYMBOL(end_busy_poll); + #ifdef CONFIG_CGROUP_SCHED /* task_group_lock serializes the addition/removal of task groups */ static DEFINE_SPINLOCK(task_group_lock); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1a68a0536add..58e525c74cc6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5460,6 +5460,11 @@ static int sched_idle_cpu(int cpu) { return sched_idle_rq(cpu_rq(cpu)); } + +static int sched_idle_or_polling_cpu(int cpu) +{ + return sched_idle_cpu(cpu) || polling_cpu(cpu); +} #endif /* @@ -5880,6 +5885,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this u64 latest_idle_timestamp = 0; int least_loaded_cpu = this_cpu; int shallowest_idle_cpu = -1; + int found_polling = 0; int i; /* Check if we have any choice: */ @@ -5914,10 +5920,14 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this shallowest_idle_cpu = i; } } else if (shallowest_idle_cpu == -1) { + int polling = polling_cpu(i); + load = cpu_load(cpu_rq(i)); - if (load < min_load) { + if ((polling == found_polling && load < min_load) || + (polling && !found_polling)) { min_load = load; least_loaded_cpu = i; + found_polling = polling; } } } @@ -6085,7 +6095,7 @@ static int select_idle_smt(struct task_struct *p, int target) for_each_cpu(cpu, cpu_smt_mask(target)) { if (!cpumask_test_cpu(cpu, p->cpus_ptr)) continue; - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) + if (available_idle_cpu(cpu) || sched_idle_or_polling_cpu(cpu)) return cpu; } @@ -6149,7 +6159,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t for_each_cpu_wrap(cpu, cpus, target) { if (!--nr) return -1; - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) + if (available_idle_cpu(cpu) || sched_idle_or_polling_cpu(cpu)) break; } @@ -6179,7 +6189,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target) for_each_cpu_wrap(cpu, cpus, target) { unsigned long cpu_cap = capacity_of(cpu); - if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu)) + if (!available_idle_cpu(cpu) && !sched_idle_or_polling_cpu(cpu)) continue; if (task_fits_capacity(p, cpu_cap)) return cpu; @@ -6223,14 +6233,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) } symmetric: - if (available_idle_cpu(target) || sched_idle_cpu(target)) + if (available_idle_cpu(target) || sched_idle_or_polling_cpu(target)) return target; /* * If the previous CPU is cache affine and idle, don't be stupid: */ if (prev != target && cpus_share_cache(prev, target) && - (available_idle_cpu(prev) || sched_idle_cpu(prev))) + (available_idle_cpu(prev) || sched_idle_or_polling_cpu(prev))) return prev; /* @@ -6252,7 +6262,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (recent_used_cpu != prev && recent_used_cpu != target && cpus_share_cache(recent_used_cpu, target) && - (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && + (available_idle_cpu(recent_used_cpu) || + sched_idle_or_polling_cpu(recent_used_cpu)) && cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr)) { /* * Replace recent_used_cpu with prev as it is a potential diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 28709f6b0975..45de468d0ffb 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1003,6 +1003,8 @@ struct rq { /* This is used to determine avg_idle's max value */ u64 max_idle_balance_cost; + + unsigned int busy_polling; #endif /* CONFIG_SMP */ #ifdef CONFIG_IRQ_TIME_ACCOUNTING -- 2.29.0.rc1.297.gfa9743e501-goog