Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp1212849imm; Wed, 10 Oct 2018 10:50:08 -0700 (PDT) X-Google-Smtp-Source: ACcGV61pX7QHWXz2s5PcK+hjnOnZbrWVYbcy6Qs6VWv7iJUi6Jrq+EsFqUgvyLXrjl2tpJViFNfq X-Received: by 2002:a63:f5a:: with SMTP id 26-v6mr30514446pgp.100.1539193808755; Wed, 10 Oct 2018 10:50:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539193808; cv=none; d=google.com; s=arc-20160816; b=zlzTufOFovGb3CS9fc9nW9oxBb6PknhtEE09etR5XAQNUsUGrpWll9Jqnb9bKB+fr3 s4KEowsKC0DRBC3+yQSIz1BcXdETgw/TFsnAy5mj4uic8FuQNe04rMm4i2W/hCIn8Tyj tA5OQ3ZzdxfYPKiTw7/olXTfH+jAjWWengHLijCWofNyGaEJw/bX7C2XjvOpCI9Oe9dX 0AEIejzBFz8fMVT9mk8YUz6KA9sjY54JvwgQrFqetlF19L0WoTVBBYunhc63yBM2R6KY nZhd8zdtJwt7ufugekxUbBPcL+UksJXarAtd0W9AI7xhSZsEtCQsO+eXMhowOlUzroZp TJew== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:message-id :in-reply-to:date:references:subject:cc:to:from:dkim-signature; bh=8BFoTA+TJfiJTtUbDdbw7HoAqtzXIPUH8N157PepmKc=; b=eIglNVo7EgdwTrLfoTywUDkTeLapJwNIEGlPmoEvBprIICxBFcrUVzMJP9EFWvK5F7 XUh9Obs64bbU2zJkuz2mWMc/ZOq+cHZkRcJMVVRxvlzwbssulSlDouVv6U2jQHv1v94s F2DuNvEXd4OpNz4hSE+7qvEizIK1R+2cjvdgxpF+hKp32atWiPp0mLjAVZWB4HC627Q7 4E9Cw01ij7t0ctAZlxa0vw/tr5ufpxYZyEyu3nZr9jJsnAvy+6XcwOejX/f57HH8xCpS VB3iIU4p4/QiwUAngOytZd86Fx+YTD3Iu6XOd2bQGvLCEWl+P3RS3BpTObEejJ4lZkzl 6Keg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=wOvxXylR; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w24-v6si23420181pga.3.2018.10.10.10.49.54; Wed, 10 Oct 2018 10:50:08 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=wOvxXylR; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726954AbeJKBMn (ORCPT + 99 others); Wed, 10 Oct 2018 21:12:43 -0400 Received: from mail-pf1-f194.google.com ([209.85.210.194]:44519 "EHLO mail-pf1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726537AbeJKBMm (ORCPT ); Wed, 10 Oct 2018 21:12:42 -0400 Received: by mail-pf1-f194.google.com with SMTP id r9-v6so2989370pff.11 for ; Wed, 10 Oct 2018 10:49:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version; bh=8BFoTA+TJfiJTtUbDdbw7HoAqtzXIPUH8N157PepmKc=; b=wOvxXylR6IsgZ3/RHcnVs9eQLMoU6wRjwaK3ocbWdyL5Wy6fHFxDqU7KUyCxZxlL0b OszOvC6SrhZL+Ua4UMzJeBwdsYvoNPRWcNexTlv2A6IuxPBuDqN1lVgN/CVgvFvj+QId 6Ky+CCeBtRJ7qg5XgwqrCQqpMNbjHJiQm3VSrLPS8qiW7SwP6RzA5rIgjXBdWjb/fF1k tRY5WKx8xXg/iqD2UjikSdDMo342OEs/UftjXIL+a6xNIAFZN1rteukMkUj2NDH6j0Dp wFfdIbwthfFAVGaV6uN5ojG+QZdZhi59CicKc6P8Q7E7kMKRXXqKDe/m9oRAZ4JkUHjI 5nAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version; bh=8BFoTA+TJfiJTtUbDdbw7HoAqtzXIPUH8N157PepmKc=; b=B1AUy2wi9BXtd52aVJ35skTbk4nqms9PwZIFYjJ63Z+y94+jjTTfUagIyHE6LlU7l2 mf2rX0NYpVAX3sEhKGph1/U/7u+Y0YjK8CI/eKePtHg6KrRkNYLhgjTY2V4aKmCLRjfh b58HbWGYiY/zvIirW/2YaUksOM+rDGCwSkM3SzUCoVYZJfWjCs8ynNfOcEKE3eUl0EsO svYqGbzrfxzepwpnlSP5+jQGNyHH+N095W67poODh7TA4MHaqjlAHR8eE6t+OB2CTs+O qtTgtTPQ+Qdi4cU1pX1ELPyLTk+QAcj29iwFX+pWU8OsactdHJGmxh9xSDBe/QvLWxau 7Xzw== X-Gm-Message-State: ABuFfohLh1RJTalUPHzW9hT8HQo4Y9dXDNKKUWMUrf4uvRS1jpIRqmE/ qRuSJLOOYGogEgNSgPsIdawYXg== X-Received: by 2002:a62:7501:: with SMTP id q1-v6mr35394962pfc.225.1539193768299; Wed, 10 Oct 2018 10:49:28 -0700 (PDT) Received: from bsegall-linux.svl.corp.google.com.localhost ([2620:15c:2cd:202:39d7:98b3:2536:e93f]) by smtp.gmail.com with ESMTPSA id k3-v6sm71914533pfk.60.2018.10.10.10.49.26 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Wed, 10 Oct 2018 10:49:26 -0700 (PDT) From: bsegall@google.com To: Ingo Molnar Cc: Phil Auld , Ben Segall , Joel Fernandes , Steve Muckle , Paul Turner , Vincent Guittot , Morten Rasmussen , Peter Zijlstra , linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: Re: [Patch] sched/fair: Avoid throttle_list starvation with low cfs quota References: <20181008143639.GA4019@pauld.bos.csb> <20181009083244.GA51643@gmail.com> Date: Wed, 10 Oct 2018 10:49:25 -0700 In-Reply-To: <20181009083244.GA51643@gmail.com> (Ingo Molnar's message of "Tue, 9 Oct 2018 10:32:44 +0200") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Ingo Molnar writes: > I've Cc:-ed a handful of gents who worked on CFS bandwidth details to widen the discussion. > Patch quoted below. > > Looks like a real bug that needs to be fixed - and at first sight the quota of 1000 looks very > low - could we improve the arithmetics perhaps? > > A low quota of 1000 is used because there's many VMs or containers provisioned on the system > that is triggering the bug, right? > > Thanks, > > Ingo > > * Phil Auld wrote: > >> From: "Phil Auld" >> >> sched/fair: Avoid throttle_list starvation with low cfs quota >> >> With a very low cpu.cfs_quota_us setting, such as the minimum of 1000, >> distribute_cfs_runtime may not empty the throttled_list before it runs >> out of runtime to distribute. In that case, due to the change from >> c06f04c7048 to put throttled entries at the head of the list, later entries >> on the list will starve. Essentially, the same X processes will get pulled >> off the list, given CPU time and then, when expired, get put back on the >> head of the list where distribute_cfs_runtime will give runtime to the same >> set of processes leaving the rest. >> >> Fix the issue by setting a bit in struct cfs_bandwidth when >> distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can >> decide to put the throttled entry on the tail or the head of the list. The >> bit is set/cleared by the callers of distribute_cfs_runtime while they hold >> cfs_bandwidth->lock. >> >> Signed-off-by: Phil Auld >> Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop") >> Cc: Peter Zijlstra >> Cc: Ingo Molnar >> Cc: stable@vger.kernel.org Reviewed-by: Ben Segall In theory this does mean the unfairness could still happen if distribute is still running, but while a tiny quota makes it more likely, the fact that we're not getting through much of the list makes it not really a worry. If you wanted to be even more careful there could be some generation counter or something, but it doesn't seem necessary. >> --- >> >> This is easy to reproduce with a handful of cpu consumers. I use crash on >> the live system. In some cases you can simply look at the throttled list and >> see the later entries are not changing: >> >> crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 >> 1 ffff90b56cb2d200 -976050 >> 2 ffff90b56cb2cc00 -484925 >> 3 ffff90b56cb2bc00 -658814 >> 4 ffff90b56cb2ba00 -275365 >> 5 ffff90b166a45600 -135138 >> 6 ffff90b56cb2da00 -282505 >> 7 ffff90b56cb2e000 -148065 >> 8 ffff90b56cb2fa00 -872591 >> 9 ffff90b56cb2c000 -84687 >> 10 ffff90b56cb2f000 -87237 >> 11 ffff90b166a40a00 -164582 >> crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 >> 1 ffff90b56cb2d200 -994147 >> 2 ffff90b56cb2cc00 -306051 >> 3 ffff90b56cb2bc00 -961321 >> 4 ffff90b56cb2ba00 -24490 >> 5 ffff90b166a45600 -135138 >> 6 ffff90b56cb2da00 -282505 >> 7 ffff90b56cb2e000 -148065 >> 8 ffff90b56cb2fa00 -872591 >> 9 ffff90b56cb2c000 -84687 >> 10 ffff90b56cb2f000 -87237 >> 11 ffff90b166a40a00 -164582 >> >> Sometimes it is easier to see by finding a process getting starved and looking >> at the sched_info: >> >> crash> task ffff8eb765994500 sched_info >> PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" >> sched_info = { >> pcount = 8, >> run_delay = 697094208, >> last_arrival = 240260125039, >> last_queued = 240260327513 >> }, >> crash> task ffff8eb765994500 sched_info >> PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" >> sched_info = { >> pcount = 8, >> run_delay = 697094208, >> last_arrival = 240260125039, >> last_queued = 240260327513 >> }, >> >> >> fair.c | 22 +++++++++++++++++++--- >> sched.h | 2 ++ >> 2 files changed, 21 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 7fc4a371bdd2..f88e00705b55 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -4476,9 +4476,13 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq) >> >> /* >> * Add to the _head_ of the list, so that an already-started >> - * distribute_cfs_runtime will not see us >> + * distribute_cfs_runtime will not see us. If disribute_cfs_runtime is >> + * not running add to the tail so that later runqueues don't get starved. >> */ >> - list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); >> + if (cfs_b->distribute_running) >> + list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); >> + else >> + list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); >> >> /* >> * If we're the first throttled task, make sure the bandwidth >> @@ -4622,14 +4626,16 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun) >> * in us over-using our runtime if it is all used during this loop, but >> * only by limited amounts in that extreme case. >> */ >> - while (throttled && cfs_b->runtime > 0) { >> + while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) { >> runtime = cfs_b->runtime; >> + cfs_b->distribute_running = 1; >> raw_spin_unlock(&cfs_b->lock); >> /* we can't nest cfs_b->lock while distributing bandwidth */ >> runtime = distribute_cfs_runtime(cfs_b, runtime, >> runtime_expires); >> raw_spin_lock(&cfs_b->lock); >> >> + cfs_b->distribute_running = 0; >> throttled = !list_empty(&cfs_b->throttled_cfs_rq); >> >> cfs_b->runtime -= min(runtime, cfs_b->runtime); >> @@ -4740,6 +4746,11 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) >> >> /* confirm we're still not at a refresh boundary */ >> raw_spin_lock(&cfs_b->lock); >> + if (cfs_b->distribute_running) { >> + raw_spin_unlock(&cfs_b->lock); >> + return; >> + } >> + >> if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) { >> raw_spin_unlock(&cfs_b->lock); >> return; >> @@ -4749,6 +4760,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) >> runtime = cfs_b->runtime; >> >> expires = cfs_b->runtime_expires; >> + if (runtime) >> + cfs_b->distribute_running = 1; >> + >> raw_spin_unlock(&cfs_b->lock); >> >> if (!runtime) >> @@ -4759,6 +4773,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) >> raw_spin_lock(&cfs_b->lock); >> if (expires == cfs_b->runtime_expires) >> cfs_b->runtime -= min(runtime, cfs_b->runtime); >> + cfs_b->distribute_running = 0; >> raw_spin_unlock(&cfs_b->lock); >> } >> >> @@ -4867,6 +4882,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) >> cfs_b->period_timer.function = sched_cfs_period_timer; >> hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); >> cfs_b->slack_timer.function = sched_cfs_slack_timer; >> + cfs_b->distribute_running = 0; >> } >> >> static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 455fa330de04..9683f458aec7 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -346,6 +346,8 @@ struct cfs_bandwidth { >> int nr_periods; >> int nr_throttled; >> u64 throttled_time; >> + >> + bool distribute_running; >> #endif >> }; >> >> >> >> --