Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp2012557imm; Thu, 11 Oct 2018 03:46:03 -0700 (PDT) X-Google-Smtp-Source: ACcGV63xCVWJ3e8EaToqVxTp3/mfW/3sptliUAVW5K+H99ue1xanIsmFXVVzugqhed9YxyMA5f4P X-Received: by 2002:a65:4cc3:: with SMTP id n3-v6mr939324pgt.257.1539254763150; Thu, 11 Oct 2018 03:46:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539254763; cv=none; d=google.com; s=arc-20160816; b=LVN7AWAMx14dOXt6xQwyqKNMBAongVzzfM2nVcQYz/oExanYHuSOZIZNapdCaU6mHU UaNymfDT1kN8j27Vh+OBSf664MYhmSqE0gl+paLZa5pDQ7nqNQCGvplDkmWVshcTqMKa Cy67kRTSOJvUd4/A1ro/EWPIQhu/pjKSq+AaN1tBxWoIO7T6ZU2u9esTlBSQsYnIu6fq Mdbf/ipp9NjE86pDmFIiNaVQ1s7inpArAstjFghoCIvt2Kt7vLxfglfLcU2NPnlNj9xi x3ab9R4uzoVZGAceq41wIbixtb+yGeHPOLsbspZYThsV1ATZeFXVJTiWLDY5xFrS1O95 p7rA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-disposition :content-transfer-encoding:mime-version:robot-unsubscribe:robot-id :git-commit-id:subject:to:references:in-reply-to:reply-to:cc :message-id:from:date; bh=5Xxyx/HCIgnAUBZnOtIz2XKXK1Wse7DuhY9DLd+VC9o=; b=n+0ON8pGKXXLawpOQHs4maNCaRlG6VKU76ki/uf7gQpy/Z3WblQkyiSgjq6S8SHK3R 1l0sz+ljFSSF4tBrSUGdkAIiX4pAnfx5Rx4mlvXFzqj+dzwM5sPj8KAAo2dh6mKV5rq8 ipDO+J1rCjmoeB0r3fV/XOizceUbHHgXaAnmoVq4yqMB7wssXcsbXtR5RA2ezLuvw99d 3AMVKKw2zxdh3jWOrFNreqG0KI/AY2cBJiJ5QWSPQLpF5CxsQgX/KvfXlokfT2tgSZdG pIO5LlyCaSTAxR5MaC21Mjp/o37yRdqL2w95/hhJckwWn5J2vXsdVyEpfz2mdPmtwF7M 8amA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e12-v6si19473173pfi.271.2018.10.11.03.45.48; Thu, 11 Oct 2018 03:46:03 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728279AbeJKSG6 (ORCPT + 99 others); Thu, 11 Oct 2018 14:06:58 -0400 Received: from terminus.zytor.com ([198.137.202.136]:58155 "EHLO terminus.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726656AbeJKSG6 (ORCPT ); Thu, 11 Oct 2018 14:06:58 -0400 Received: from terminus.zytor.com (localhost [127.0.0.1]) by terminus.zytor.com (8.15.2/8.15.2) with ESMTPS id w9BAdrwt1860073 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 11 Oct 2018 03:39:53 -0700 Received: (from tipbot@localhost) by terminus.zytor.com (8.15.2/8.15.2/Submit) id w9BAdrGg1860070; Thu, 11 Oct 2018 03:39:53 -0700 Date: Thu, 11 Oct 2018 03:39:53 -0700 X-Authentication-Warning: terminus.zytor.com: tipbot set sender to tipbot@zytor.com using -f From: tip-bot for Phil Auld Message-ID: Cc: torvalds@linux-foundation.org, mingo@kernel.org, linux-kernel@vger.kernel.org, hpa@zytor.com, peterz@infradead.org, tglx@linutronix.de, pauld@redhat.com Reply-To: tglx@linutronix.de, pauld@redhat.com, hpa@zytor.com, peterz@infradead.org, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, mingo@kernel.org In-Reply-To: <20181008143639.GA4019@pauld.bos.csb> References: <20181008143639.GA4019@pauld.bos.csb> To: linux-tip-commits@vger.kernel.org Subject: [tip:sched/urgent] sched/fair: Fix throttle_list starvation with low CFS quota Git-Commit-ID: 8b48300108248e950cde0bdc5708039fc3836623 X-Mailer: tip-git-log-daemon Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Content-Disposition: inline X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00, DATE_IN_FUTURE_96_Q autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on terminus.zytor.com Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Commit-ID: 8b48300108248e950cde0bdc5708039fc3836623 Gitweb: https://git.kernel.org/tip/8b48300108248e950cde0bdc5708039fc3836623 Author: Phil Auld AuthorDate: Mon, 8 Oct 2018 10:36:40 -0400 Committer: Ingo Molnar CommitDate: Thu, 11 Oct 2018 11:18:32 +0200 sched/fair: Fix throttle_list starvation with low CFS quota With a very low cpu.cfs_quota_us setting, such as the minimum of 1000, distribute_cfs_runtime may not empty the throttled_list before it runs out of runtime to distribute. In that case, due to the change from c06f04c7048 to put throttled entries at the head of the list, later entries on the list will starve. Essentially, the same X processes will get pulled off the list, given CPU time and then, when expired, get put back on the head of the list where distribute_cfs_runtime will give runtime to the same set of processes leaving the rest. Fix the issue by setting a bit in struct cfs_bandwidth when distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can decide to put the throttled entry on the tail or the head of the list. The bit is set/cleared by the callers of distribute_cfs_runtime while they hold cfs_bandwidth->lock. This is easy to reproduce with a handful of CPU consumers. I use 'crash' on the live system. In some cases you can simply look at the throttled list and see the later entries are not changing: crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 1 ffff90b56cb2d200 -976050 2 ffff90b56cb2cc00 -484925 3 ffff90b56cb2bc00 -658814 4 ffff90b56cb2ba00 -275365 5 ffff90b166a45600 -135138 6 ffff90b56cb2da00 -282505 7 ffff90b56cb2e000 -148065 8 ffff90b56cb2fa00 -872591 9 ffff90b56cb2c000 -84687 10 ffff90b56cb2f000 -87237 11 ffff90b166a40a00 -164582 crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 1 ffff90b56cb2d200 -994147 2 ffff90b56cb2cc00 -306051 3 ffff90b56cb2bc00 -961321 4 ffff90b56cb2ba00 -24490 5 ffff90b166a45600 -135138 6 ffff90b56cb2da00 -282505 7 ffff90b56cb2e000 -148065 8 ffff90b56cb2fa00 -872591 9 ffff90b56cb2c000 -84687 10 ffff90b56cb2f000 -87237 11 ffff90b166a40a00 -164582 Sometimes it is easier to see by finding a process getting starved and looking at the sched_info: crash> task ffff8eb765994500 sched_info PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" sched_info = { pcount = 8, run_delay = 697094208, last_arrival = 240260125039, last_queued = 240260327513 }, crash> task ffff8eb765994500 sched_info PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" sched_info = { pcount = 8, run_delay = 697094208, last_arrival = 240260125039, last_queued = 240260327513 }, Signed-off-by: Phil Auld Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: stable@vger.kernel.org Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop") Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 22 +++++++++++++++++++--- kernel/sched/sched.h | 2 ++ 2 files changed, 21 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7fc4a371bdd2..f88e00705b55 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4476,9 +4476,13 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq) /* * Add to the _head_ of the list, so that an already-started - * distribute_cfs_runtime will not see us + * distribute_cfs_runtime will not see us. If disribute_cfs_runtime is + * not running add to the tail so that later runqueues don't get starved. */ - list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); + if (cfs_b->distribute_running) + list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); + else + list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); /* * If we're the first throttled task, make sure the bandwidth @@ -4622,14 +4626,16 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun) * in us over-using our runtime if it is all used during this loop, but * only by limited amounts in that extreme case. */ - while (throttled && cfs_b->runtime > 0) { + while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) { runtime = cfs_b->runtime; + cfs_b->distribute_running = 1; raw_spin_unlock(&cfs_b->lock); /* we can't nest cfs_b->lock while distributing bandwidth */ runtime = distribute_cfs_runtime(cfs_b, runtime, runtime_expires); raw_spin_lock(&cfs_b->lock); + cfs_b->distribute_running = 0; throttled = !list_empty(&cfs_b->throttled_cfs_rq); cfs_b->runtime -= min(runtime, cfs_b->runtime); @@ -4740,6 +4746,11 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) /* confirm we're still not at a refresh boundary */ raw_spin_lock(&cfs_b->lock); + if (cfs_b->distribute_running) { + raw_spin_unlock(&cfs_b->lock); + return; + } + if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) { raw_spin_unlock(&cfs_b->lock); return; @@ -4749,6 +4760,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) runtime = cfs_b->runtime; expires = cfs_b->runtime_expires; + if (runtime) + cfs_b->distribute_running = 1; + raw_spin_unlock(&cfs_b->lock); if (!runtime) @@ -4759,6 +4773,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) raw_spin_lock(&cfs_b->lock); if (expires == cfs_b->runtime_expires) cfs_b->runtime -= min(runtime, cfs_b->runtime); + cfs_b->distribute_running = 0; raw_spin_unlock(&cfs_b->lock); } @@ -4867,6 +4882,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->period_timer.function = sched_cfs_period_timer; hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); cfs_b->slack_timer.function = sched_cfs_slack_timer; + cfs_b->distribute_running = 0; } static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 455fa330de04..9683f458aec7 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -346,6 +346,8 @@ struct cfs_bandwidth { int nr_periods; int nr_throttled; u64 throttled_time; + + bool distribute_running; #endif };