Received: by 2002:ac0:a874:0:0:0:0:0 with SMTP id c49csp463862ima; Fri, 15 Mar 2019 06:52:31 -0700 (PDT) X-Google-Smtp-Source: APXvYqxz+sbfLlz9eDXkV0okVtLVm2M4p3/Bux05nDkONyANbeXX0sLidIxRbhcwNV+3p9uw64qW X-Received: by 2002:a17:902:8c8b:: with SMTP id t11mr4341904plo.148.1552657951716; Fri, 15 Mar 2019 06:52:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1552657951; cv=none; d=google.com; s=arc-20160816; b=CbZvU2v0AA8UjyF6qK8nYSnaAax8KLO/dDao1a7RpGeBQRtOBmhHQQG2Ei5Fpb32N9 y7+PefaX/Vzg1N+4F+THWyE2UU1QDaF9LCuYIdaUrYBPE47HRjX5oI99MNiFLASlMuhF v5joF4WdowjSRK3ZqzvKYL3KLSAsOQ8XWzlFcjlINzvMWcgi9yULEX2dPWBLzlO0KGr+ Ql6fbNi1whZiKrF00TefZpY8OEt9K5dAxkZ2A46Uwrw+yBxEbgSoyUONo/3kLvpPFklj ktDEpH811YAOKBznKUsadGxkBDfcuXvEEIt8WMbuZyawztZIH6oPakr960H7TpB1JW/e iS/Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=OkSADDIK/aUsgCKu7zyx0yes/nEan4wna0zL7tNcpsY=; b=S1CE45a3Qpt3UqcePWN5b3n3hZ6q1kbhHIUnJcXbz6aUhpTGI1JxpT97pzrEp4KKNs b/Im5AgkAtjoE+T6Ewz52ze7y/+T8wYDCicrYyw/C2rzOqfBF+oCoZIhJPlk7WV8PArS yBf03O4pnqV115G20tifRh3oGpDbAekA6MSFlsU9ETUDfKwQ6SC7QgNwy3hwCoKjBb4n nxe2vWFZ1Mrbn89rhsH+UWYlRzY1FFQq6e2y4R2LbQX040sTyOpq4mVjBUSuDApJSgF8 Y4kD3O5Dq/JXEZWLPb1of4otcEtKkcBVbgPw8onPe/uzHJBYjRnIrmyEKpKsyRIYZ1mj 6jWA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e1si1799574pgs.451.2019.03.15.06.52.16; Fri, 15 Mar 2019 06:52:31 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729102AbfCONv2 (ORCPT + 99 others); Fri, 15 Mar 2019 09:51:28 -0400 Received: from mx1.redhat.com ([209.132.183.28]:53388 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727705AbfCONv1 (ORCPT ); Fri, 15 Mar 2019 09:51:27 -0400 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 584A0307AD0F; Fri, 15 Mar 2019 13:51:27 +0000 (UTC) Received: from pauld.bos.csb (dhcp-17-51.bos.redhat.com [10.18.17.51]) by smtp.corp.redhat.com (Postfix) with ESMTPS id CAAE9604C7; Fri, 15 Mar 2019 13:51:26 +0000 (UTC) Date: Fri, 15 Mar 2019 09:51:25 -0400 From: Phil Auld To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Ben Segall , Ingo Molnar Subject: Re: [PATCH] sched/fair: Limit sched_cfs_period_timer loop to avoid hard lockup Message-ID: <20190315135124.GC27131@pauld.bos.csb> References: <20190313150826.16862-1-pauld@redhat.com> <20190315101150.GV5996@hirez.programming.kicks-ass.net> <20190315103357.GC6521@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190315103357.GC6521@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.46]); Fri, 15 Mar 2019 13:51:27 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 15, 2019 at 11:33:57AM +0100 Peter Zijlstra wrote: > On Fri, Mar 15, 2019 at 11:11:50AM +0100, Peter Zijlstra wrote: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index ea74d43924b2..b71557be6b42 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4885,6 +4885,8 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer) > > return HRTIMER_NORESTART; > > } > > > > +extern const u64 max_cfs_quota_period; > > + > > static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) > > { > > struct cfs_bandwidth *cfs_b = > > @@ -4892,6 +4894,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) > > unsigned long flags; > > int overrun; > > int idle = 0; > > + int count = 0; > > > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > for (;;) { > > @@ -4899,6 +4902,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) > > if (!overrun) > > break; > > > > + if (++count > 3) { > > + u64 new, old = ktime_to_ns(cfs_b->period); > > + > > + new = (old * 147) / 128; /* ~115% */ > > + new = min(new, max_cfs_quota_period); > > Also, we can still engineer things to come unstuck; if we explicitly > configure period at 1e9 and then set a really small quota and then > create this insane amount of cgroups you have.. > > this code has no room to manouvre left. > > Do we want to do anything about that? Or leave it as is, don't do that > then? > If the period is 1s it would be hard to make this loop fire repeatedly. I don't think it's that dependent on the quota other than getting some rqs throttled. The small quota would also mean fewer of them would get unthrottled per distribute call. You'd probably need _significantly_ more cgroups than my insane 2500 to hit it. Right now it settles out with a new period of ~12-15ms. So ~200,000 cgroups? Ben and I talked a little about this in another thread. I think hitting this is enough of an edge case that this approach will make the problem go away. The only alternative we came up with to reduce the time taken in unthrottle involved a fair bit of complexity added to the every day code paths. And might not help if the children all had their own quota/period settings active. Thoughts? Cheers, Phil > > + > > + cfs_b->period = ns_to_ktime(new); > > + > > + /* since max is 1s, this is limited to 1e9^2, which fits in u64 */ > > + cfs_b->quota *= new; > > + cfs_b->quota /= old; > > + > > + pr_warn_ratelimited( > > + "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n", > > + smp_processor_id(), > > + new/NSEC_PER_USEC, > > + cfs_b->quota/NSEC_PER_USEC); > > + > > + /* reset count so we don't come right back in here */ > > + count = 0; > > + } > > + > > idle = do_sched_cfs_period_timer(cfs_b, overrun, flags); > > } > > if (idle) --