Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp4812522imm; Tue, 9 Oct 2018 05:45:34 -0700 (PDT) X-Google-Smtp-Source: ACcGV609JIiux7agCqiQwkTRoIlXimBsc9C9NiYVFYERP2HKD8+E5zOhPLB63NoYQVIZHyPqSlLV X-Received: by 2002:a62:968a:: with SMTP id s10-v6mr29586098pfk.191.1539089134409; Tue, 09 Oct 2018 05:45:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539089134; cv=none; d=google.com; s=arc-20160816; b=G7wp7WAQPtW7DfShBmAbdHwlhBgqi7ycPczoxIdw2d6jKGunj8sIY2fICGEbhryQWI 3X7NEVQpApSpsH/FaBbi6arzTDQn4xI9swW3kgzpW9RAjGluQDXKaKBrL5syV3AdNi3u UuRbTFfFFCBniVXmVhbr1aA29X/euQKyj8umbWX3Wj9DwDfDUfHTXcOWQCS3NvyPLjJP b7PzB0yFhiWNPy2sTlzpuDn3s8gc0wZRwFB7XuKTnMEFF9dxQPitmGCweEmtk9wt+pFa 09ya2XJuEh4L/Hp8uUypca/a8sw/7+2exBmq00pBuPpRH5TLYqY6ucUQzRbquFiVCmEZ 4wWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=7rZIUFaeU5gYTRTtTWXcRJ06BOQouZKZ+QAhXW5alro=; b=soivk0JV9bwh5XB6j8X5lF79lj++YtYM0C0gT0GfpQwrt2f+ni7yOOfPbVWLc9Vjtt cHdP92xD5MQYKFxchcOPy38KZdotK7xjDJg6/5k/HSQyenH4PvrRtFVzpsNR0hAekULD qkBFT7msN2wXWNnM8+/ru8zaMzj4oKwRRq7w+/jSfv2j30zUXVyaU22uvTuvYXrSn0Xt W2+tUAgDearbXr51QliOfdYIWXBCqAepXmI4dilx3OcyMYClXMLpXwtLoXluxPVvafYi K5zJ6b10s9WB2MHtu9zkYjUiaWufcsG+oYUYFkT/B6E37PuUGau06MiUxXWkpEujiSo+ Df3w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b7-v6si14924098pfj.49.2018.10.09.05.45.18; Tue, 09 Oct 2018 05:45:34 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726921AbeJIUBc (ORCPT + 99 others); Tue, 9 Oct 2018 16:01:32 -0400 Received: from mx1.redhat.com ([209.132.183.28]:57206 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726740AbeJIUBc (ORCPT ); Tue, 9 Oct 2018 16:01:32 -0400 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id C5DCB308ED4D; Tue, 9 Oct 2018 12:44:44 +0000 (UTC) Received: from pauld.bos.csb (dhcp-17-51.bos.redhat.com [10.18.17.51]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 90F56106224F; Tue, 9 Oct 2018 12:44:43 +0000 (UTC) Date: Tue, 9 Oct 2018 08:44:41 -0400 From: Phil Auld To: Ingo Molnar Cc: Ben Segall , Joel Fernandes , Steve Muckle , Paul Turner , Vincent Guittot , Morten Rasmussen , Peter Zijlstra , linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: Re: [Patch] sched/fair: Avoid throttle_list starvation with low cfs quota Message-ID: <20181009124441.GF4019@pauld.bos.csb> References: <20181008143639.GA4019@pauld.bos.csb> <20181009083244.GA51643@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181009083244.GA51643@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.44]); Tue, 09 Oct 2018 12:44:45 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Tue, Oct 09, 2018 at 10:32:44AM +0200 Ingo Molnar wrote: > > I've Cc:-ed a handful of gents who worked on CFS bandwidth details to widen the discussion. > Patch quoted below. > > Looks like a real bug that needs to be fixed - and at first sight the quota of 1000 looks very > low - could we improve the arithmetics perhaps? > > A low quota of 1000 is used because there's many VMs or containers provisioned on the system > that is triggering the bug, right? > Thanks for taking a look. The quota is extremely low. I believe that's a different issue, though. The kernel allows this setting and should handle it better than it currently does. The proposed patch fixes it so that all the tasks make progress (even if not much progress) rather than having some starve at the back of the list. Cheers, Phil > Thanks, > > Ingo > > * Phil Auld wrote: > > > From: "Phil Auld" > > > > sched/fair: Avoid throttle_list starvation with low cfs quota > > > > With a very low cpu.cfs_quota_us setting, such as the minimum of 1000, > > distribute_cfs_runtime may not empty the throttled_list before it runs > > out of runtime to distribute. In that case, due to the change from > > c06f04c7048 to put throttled entries at the head of the list, later entries > > on the list will starve. Essentially, the same X processes will get pulled > > off the list, given CPU time and then, when expired, get put back on the > > head of the list where distribute_cfs_runtime will give runtime to the same > > set of processes leaving the rest. > > > > Fix the issue by setting a bit in struct cfs_bandwidth when > > distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can > > decide to put the throttled entry on the tail or the head of the list. The > > bit is set/cleared by the callers of distribute_cfs_runtime while they hold > > cfs_bandwidth->lock. > > > > Signed-off-by: Phil Auld > > Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop") > > Cc: Peter Zijlstra > > Cc: Ingo Molnar > > Cc: stable@vger.kernel.org > > --- > > > > This is easy to reproduce with a handful of cpu consumers. I use crash on > > the live system. In some cases you can simply look at the throttled list and > > see the later entries are not changing: > > > > crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 > > 1 ffff90b56cb2d200 -976050 > > 2 ffff90b56cb2cc00 -484925 > > 3 ffff90b56cb2bc00 -658814 > > 4 ffff90b56cb2ba00 -275365 > > 5 ffff90b166a45600 -135138 > > 6 ffff90b56cb2da00 -282505 > > 7 ffff90b56cb2e000 -148065 > > 8 ffff90b56cb2fa00 -872591 > > 9 ffff90b56cb2c000 -84687 > > 10 ffff90b56cb2f000 -87237 > > 11 ffff90b166a40a00 -164582 > > crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 > > 1 ffff90b56cb2d200 -994147 > > 2 ffff90b56cb2cc00 -306051 > > 3 ffff90b56cb2bc00 -961321 > > 4 ffff90b56cb2ba00 -24490 > > 5 ffff90b166a45600 -135138 > > 6 ffff90b56cb2da00 -282505 > > 7 ffff90b56cb2e000 -148065 > > 8 ffff90b56cb2fa00 -872591 > > 9 ffff90b56cb2c000 -84687 > > 10 ffff90b56cb2f000 -87237 > > 11 ffff90b166a40a00 -164582 > > > > Sometimes it is easier to see by finding a process getting starved and looking > > at the sched_info: > > > > crash> task ffff8eb765994500 sched_info > > PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" > > sched_info = { > > pcount = 8, > > run_delay = 697094208, > > last_arrival = 240260125039, > > last_queued = 240260327513 > > }, > > crash> task ffff8eb765994500 sched_info > > PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" > > sched_info = { > > pcount = 8, > > run_delay = 697094208, > > last_arrival = 240260125039, > > last_queued = 240260327513 > > }, > > > > > > fair.c | 22 +++++++++++++++++++--- > > sched.h | 2 ++ > > 2 files changed, 21 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 7fc4a371bdd2..f88e00705b55 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4476,9 +4476,13 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq) > > > > /* > > * Add to the _head_ of the list, so that an already-started > > - * distribute_cfs_runtime will not see us > > + * distribute_cfs_runtime will not see us. If disribute_cfs_runtime is > > + * not running add to the tail so that later runqueues don't get starved. > > */ > > - list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); > > + if (cfs_b->distribute_running) > > + list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); > > + else > > + list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); > > > > /* > > * If we're the first throttled task, make sure the bandwidth > > @@ -4622,14 +4626,16 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun) > > * in us over-using our runtime if it is all used during this loop, but > > * only by limited amounts in that extreme case. > > */ > > - while (throttled && cfs_b->runtime > 0) { > > + while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) { > > runtime = cfs_b->runtime; > > + cfs_b->distribute_running = 1; > > raw_spin_unlock(&cfs_b->lock); > > /* we can't nest cfs_b->lock while distributing bandwidth */ > > runtime = distribute_cfs_runtime(cfs_b, runtime, > > runtime_expires); > > raw_spin_lock(&cfs_b->lock); > > > > + cfs_b->distribute_running = 0; > > throttled = !list_empty(&cfs_b->throttled_cfs_rq); > > > > cfs_b->runtime -= min(runtime, cfs_b->runtime); > > @@ -4740,6 +4746,11 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > > > > /* confirm we're still not at a refresh boundary */ > > raw_spin_lock(&cfs_b->lock); > > + if (cfs_b->distribute_running) { > > + raw_spin_unlock(&cfs_b->lock); > > + return; > > + } > > + > > if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) { > > raw_spin_unlock(&cfs_b->lock); > > return; > > @@ -4749,6 +4760,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > > runtime = cfs_b->runtime; > > > > expires = cfs_b->runtime_expires; > > + if (runtime) > > + cfs_b->distribute_running = 1; > > + > > raw_spin_unlock(&cfs_b->lock); > > > > if (!runtime) > > @@ -4759,6 +4773,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > > raw_spin_lock(&cfs_b->lock); > > if (expires == cfs_b->runtime_expires) > > cfs_b->runtime -= min(runtime, cfs_b->runtime); > > + cfs_b->distribute_running = 0; > > raw_spin_unlock(&cfs_b->lock); > > } > > > > @@ -4867,6 +4882,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > > cfs_b->period_timer.function = sched_cfs_period_timer; > > hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); > > cfs_b->slack_timer.function = sched_cfs_slack_timer; > > + cfs_b->distribute_running = 0; > > } > > > > static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index 455fa330de04..9683f458aec7 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -346,6 +346,8 @@ struct cfs_bandwidth { > > int nr_periods; > > int nr_throttled; > > u64 throttled_time; > > + > > + bool distribute_running; > > #endif > > }; > > > > > > > > -- --