Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp4573306imm; Tue, 9 Oct 2018 01:33:29 -0700 (PDT) X-Google-Smtp-Source: ACcGV636Fm4oUmeHtCDf70/CLylorDbFbdRhw9nomGraCfp8u57LffPBdmnEYTl6zE13E+ovjQMa X-Received: by 2002:a63:1d62:: with SMTP id d34-v6mr24851085pgm.180.1539074009917; Tue, 09 Oct 2018 01:33:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539074009; cv=none; d=google.com; s=arc-20160816; b=flnwO0wQjhm1nwv8ph7YIWo31c4lZXLF4NecfUy43rkNs13kOOzY5TqKWbSauiNBtg U8HngYm2toSKo2I0ali6lSOu/YrJFgvkCOd0impppdqvC0uhq4h645u9amW/WjqyjlCN Mdumj9RlBBitg+7PCe8NUqfqTb2NMSI6bQbJ34E5F4JAu7Co1IfdSJQGnHgkJSBCPDIE hp0b4Mp8SVRnd2JJNqfnKGypiAESvb1hDzHD4CrBWyP937OTO7Thi+E/K2UXjTAUkTgO 14JEB0kRl1WkqLB6jH8fEDay27PGw9/RXLhP1AzkUAgNoBQ0oTbx28rPElA+WSlZrEQD kcvA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=huHFFlfKrKNaNoWOvGmz+JRmP3v80kM9v38FiVp4Ro4=; b=mM3heDjHPiANqR+g6BhNCd2uvJKBbXjnV2Cadoi597PcPmRfG1WWIYmFQ5fdh0QV8s cFfWQYDGrMWIzoWHxpkxlr0w7fdk992CPHSq7H0Dt5LhAwJXAqy9JHxmqR6/jIklPwq3 XM1ebJXsFfeg/eMVve15QmcG50RtEPM2xUohzNVgBqjtZ74U3BCqa01Jtxq6J3UnrMw8 SLitfsIV2oX8Hl65tejrS/XQNYSny93+iC+4Nlm7NNTq/gCIjGL9+grKtx17IHME6zzM 2l1Kd7KgnuFKsirqzMsNiZUfKInLTnIOIIIDfg52wN0VSwRj1JGLk3I+MDYeMs6RoTdm W6bA== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b="HKfa/Cka"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r12-v6si19701665pgg.209.2018.10.09.01.33.14; Tue, 09 Oct 2018 01:33:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b="HKfa/Cka"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726468AbeJIPsh (ORCPT + 99 others); Tue, 9 Oct 2018 11:48:37 -0400 Received: from mail-wr1-f68.google.com ([209.85.221.68]:45922 "EHLO mail-wr1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725892AbeJIPsg (ORCPT ); Tue, 9 Oct 2018 11:48:36 -0400 Received: by mail-wr1-f68.google.com with SMTP id q5-v6so799920wrw.12; Tue, 09 Oct 2018 01:32:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=huHFFlfKrKNaNoWOvGmz+JRmP3v80kM9v38FiVp4Ro4=; b=HKfa/CkaXnXq9LweQ/cDtYtuWu6zs4+ncO9BBQgdUJFKEwC/zNDglyg6U0cSCoE7bz Mf7WmKtDjO7dkQv6ujUifumIxTooiAbmFiyFp9jzTIgbVfIC/N6x442+DvL+t5zpSdp+ dOb+M/v8zQYSXWcqXbVlRHDs1TjPMhElO1LIoEZs3Ap3dX4XQSY8KgDdTj3ws3xP6/IJ kRAeOanbPVK6qXCZl/bCv8JYCNmGo+Tllu4S/eeI5vdMzPoMWEeUkXhvgV9kBo6cxPEC 2JBN1+pF2N1PTnJGxvcfHV2dvN6/l2v0AMwn7GDq55f/sG0mstzpyMRO94zWPXhiPKQm mSpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=huHFFlfKrKNaNoWOvGmz+JRmP3v80kM9v38FiVp4Ro4=; b=T3Y9SvpXylWNhd87O8K5fX5wh9BXSmrYbHy6PWVCybev+FNjV1ig8IRMzMA1ws3KrZ orDOS+ROZtfCSyqfScqgwj4DTf7LSR0vP72ZU7AEN3qnR52Jdgc2YuBE1Asjhd2W9PaP Wm/MvrjhjvHgFNrTJSj6a+/pgNrr4vj7ZKZSeFOBCu2X0EXqXpEUh+udymiDZMcqTo5l 6qzuyqyGyYhXsDhEFdK3lP3qa3iE5xGLJMMaxRVGjXOj6Gv6fc4c7cN0rAQV2CgHCCN+ HAEVRwYn+RkKqznFoCSoFN839VPIx4kAbmRJgBXra0mv8a3wXX760guOkZG9fl/s17kN g7sw== X-Gm-Message-State: ABuFfojhzxasmFozPJLndiUpMB3cbKFKnLdo0M2t/+m+W3V2IclR9I3W lcrxhwMzwiPadbK8+Mlp0dQ= X-Received: by 2002:adf:90a2:: with SMTP id i31-v6mr17835559wri.77.1539073967607; Tue, 09 Oct 2018 01:32:47 -0700 (PDT) Received: from gmail.com (2E8B0CD5.catv.pool.telekom.hu. [46.139.12.213]) by smtp.gmail.com with ESMTPSA id y65-v6sm9996159wmg.40.2018.10.09.01.32.46 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 09 Oct 2018 01:32:46 -0700 (PDT) Date: Tue, 9 Oct 2018 10:32:44 +0200 From: Ingo Molnar To: Phil Auld , Ben Segall , Joel Fernandes , Steve Muckle , Paul Turner , Vincent Guittot , Morten Rasmussen Cc: Peter Zijlstra , linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: Re: [Patch] sched/fair: Avoid throttle_list starvation with low cfs quota Message-ID: <20181009083244.GA51643@gmail.com> References: <20181008143639.GA4019@pauld.bos.csb> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181008143639.GA4019@pauld.bos.csb> User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I've Cc:-ed a handful of gents who worked on CFS bandwidth details to widen the discussion. Patch quoted below. Looks like a real bug that needs to be fixed - and at first sight the quota of 1000 looks very low - could we improve the arithmetics perhaps? A low quota of 1000 is used because there's many VMs or containers provisioned on the system that is triggering the bug, right? Thanks, Ingo * Phil Auld wrote: > From: "Phil Auld" > > sched/fair: Avoid throttle_list starvation with low cfs quota > > With a very low cpu.cfs_quota_us setting, such as the minimum of 1000, > distribute_cfs_runtime may not empty the throttled_list before it runs > out of runtime to distribute. In that case, due to the change from > c06f04c7048 to put throttled entries at the head of the list, later entries > on the list will starve. Essentially, the same X processes will get pulled > off the list, given CPU time and then, when expired, get put back on the > head of the list where distribute_cfs_runtime will give runtime to the same > set of processes leaving the rest. > > Fix the issue by setting a bit in struct cfs_bandwidth when > distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can > decide to put the throttled entry on the tail or the head of the list. The > bit is set/cleared by the callers of distribute_cfs_runtime while they hold > cfs_bandwidth->lock. > > Signed-off-by: Phil Auld > Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop") > Cc: Peter Zijlstra > Cc: Ingo Molnar > Cc: stable@vger.kernel.org > --- > > This is easy to reproduce with a handful of cpu consumers. I use crash on > the live system. In some cases you can simply look at the throttled list and > see the later entries are not changing: > > crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 > 1 ffff90b56cb2d200 -976050 > 2 ffff90b56cb2cc00 -484925 > 3 ffff90b56cb2bc00 -658814 > 4 ffff90b56cb2ba00 -275365 > 5 ffff90b166a45600 -135138 > 6 ffff90b56cb2da00 -282505 > 7 ffff90b56cb2e000 -148065 > 8 ffff90b56cb2fa00 -872591 > 9 ffff90b56cb2c000 -84687 > 10 ffff90b56cb2f000 -87237 > 11 ffff90b166a40a00 -164582 > crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 > 1 ffff90b56cb2d200 -994147 > 2 ffff90b56cb2cc00 -306051 > 3 ffff90b56cb2bc00 -961321 > 4 ffff90b56cb2ba00 -24490 > 5 ffff90b166a45600 -135138 > 6 ffff90b56cb2da00 -282505 > 7 ffff90b56cb2e000 -148065 > 8 ffff90b56cb2fa00 -872591 > 9 ffff90b56cb2c000 -84687 > 10 ffff90b56cb2f000 -87237 > 11 ffff90b166a40a00 -164582 > > Sometimes it is easier to see by finding a process getting starved and looking > at the sched_info: > > crash> task ffff8eb765994500 sched_info > PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" > sched_info = { > pcount = 8, > run_delay = 697094208, > last_arrival = 240260125039, > last_queued = 240260327513 > }, > crash> task ffff8eb765994500 sched_info > PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" > sched_info = { > pcount = 8, > run_delay = 697094208, > last_arrival = 240260125039, > last_queued = 240260327513 > }, > > > fair.c | 22 +++++++++++++++++++--- > sched.h | 2 ++ > 2 files changed, 21 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7fc4a371bdd2..f88e00705b55 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4476,9 +4476,13 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq) > > /* > * Add to the _head_ of the list, so that an already-started > - * distribute_cfs_runtime will not see us > + * distribute_cfs_runtime will not see us. If disribute_cfs_runtime is > + * not running add to the tail so that later runqueues don't get starved. > */ > - list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); > + if (cfs_b->distribute_running) > + list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); > + else > + list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); > > /* > * If we're the first throttled task, make sure the bandwidth > @@ -4622,14 +4626,16 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun) > * in us over-using our runtime if it is all used during this loop, but > * only by limited amounts in that extreme case. > */ > - while (throttled && cfs_b->runtime > 0) { > + while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) { > runtime = cfs_b->runtime; > + cfs_b->distribute_running = 1; > raw_spin_unlock(&cfs_b->lock); > /* we can't nest cfs_b->lock while distributing bandwidth */ > runtime = distribute_cfs_runtime(cfs_b, runtime, > runtime_expires); > raw_spin_lock(&cfs_b->lock); > > + cfs_b->distribute_running = 0; > throttled = !list_empty(&cfs_b->throttled_cfs_rq); > > cfs_b->runtime -= min(runtime, cfs_b->runtime); > @@ -4740,6 +4746,11 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > > /* confirm we're still not at a refresh boundary */ > raw_spin_lock(&cfs_b->lock); > + if (cfs_b->distribute_running) { > + raw_spin_unlock(&cfs_b->lock); > + return; > + } > + > if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) { > raw_spin_unlock(&cfs_b->lock); > return; > @@ -4749,6 +4760,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > runtime = cfs_b->runtime; > > expires = cfs_b->runtime_expires; > + if (runtime) > + cfs_b->distribute_running = 1; > + > raw_spin_unlock(&cfs_b->lock); > > if (!runtime) > @@ -4759,6 +4773,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > raw_spin_lock(&cfs_b->lock); > if (expires == cfs_b->runtime_expires) > cfs_b->runtime -= min(runtime, cfs_b->runtime); > + cfs_b->distribute_running = 0; > raw_spin_unlock(&cfs_b->lock); > } > > @@ -4867,6 +4882,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > cfs_b->period_timer.function = sched_cfs_period_timer; > hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); > cfs_b->slack_timer.function = sched_cfs_slack_timer; > + cfs_b->distribute_running = 0; > } > > static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 455fa330de04..9683f458aec7 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -346,6 +346,8 @@ struct cfs_bandwidth { > int nr_periods; > int nr_throttled; > u64 throttled_time; > + > + bool distribute_running; > #endif > }; > > > > --