Received: by 2002:ac0:aed5:0:0:0:0:0 with SMTP id t21csp6999912imb; Sat, 9 Mar 2019 12:34:10 -0800 (PST) X-Google-Smtp-Source: APXvYqzrGVG7mflNljHy2lv2yYjMIDtqB+7I1TvmDvd78ODcY+bL82RvTSP3ryefqm7Suk8n3zTx X-Received: by 2002:a63:460a:: with SMTP id t10mr22402183pga.354.1552163650085; Sat, 09 Mar 2019 12:34:10 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1552163650; cv=none; d=google.com; s=arc-20160816; b=qvmh2jXSOLP+wno+92ouZSiUpNebR2BTUnpzxG3ZRoHC3UiitQhdgaViihieDhYeWO 8/b2AerJQOKCz9nhLPb1tQS0sWeymkMUOrAihh/jmwvduc7X9SI7Cj054BJRdjR28nHC P7YgK2lZZpYP7st94OnofdxnhSh1RlFLkrfmpj3Vg8VXsKswsmFRuc+YpWG21G8OesJp rW6pJT7m8a+GOoWTP4n0b4ZHXQWV9vKO61fn50cCFopolfzkVFH7O5tO7Xm2MNhLrdCy bCOTlvHnGiPxgmPx3IlG2QkOz2BXKDCPpEP9HA8/nRtveZ5rORQsm10b+9iwnysq5gmc 3atA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=gFO6mW2swhUBLSwc6ffyeInY5NV49/2pWzJdH6FkbzM=; b=NWXb32OWyA2NEGEtsXLgma5etYO0X/7SR45D6mYqi8fIT3IK5rvaWoINTxk8DqxHTj 7GHdYtQ9WFjq5zxTnlKvzVyxhjAzGJq9Su69tMtFGIkzVehgTJQUpnIPgmp/Ob4Qkoo+ mkEXF608L1Sq4jMW4+ECpzzeLRRsnKI4dIJSTT724pDbAnrOB8B7hU3phpuqjA52MGrl q67vPQZuvljb08cRmUiN0DGs6DEnn4nIQOiFAyPWUnbQAB0+yDBmDBDUNOhSm2knEGxz F0e9a6nIg7rViWe7dKIoZkXVKqSXnAGHBK5xfJpuQHaAJrEhDnbQaRXDbQqndWIJjyo5 Fnsw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d3si1282588pla.399.2019.03.09.12.33.54; Sat, 09 Mar 2019 12:34:10 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726446AbfCIUdY (ORCPT + 99 others); Sat, 9 Mar 2019 15:33:24 -0500 Received: from mx1.redhat.com ([209.132.183.28]:46320 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726340AbfCIUdY (ORCPT ); Sat, 9 Mar 2019 15:33:24 -0500 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 2202B308403E; Sat, 9 Mar 2019 20:33:24 +0000 (UTC) Received: from lorien.usersys.redhat.com (ovpn-116-119.phx2.redhat.com [10.3.116.119]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 8AC005D704; Sat, 9 Mar 2019 20:33:23 +0000 (UTC) Date: Sat, 9 Mar 2019 15:33:21 -0500 From: Phil Auld To: bsegall@google.com Cc: mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org Subject: Re: [RFC] sched/fair: hard lockup in sched_cfs_period_timer Message-ID: <20190309203320.GA24464@lorien.usersys.redhat.com> References: <20190301145209.GA9304@pauld.bos.csb> <20190304190510.GB5366@lorien.usersys.redhat.com> <20190305200554.GA8786@pauld.bos.csb> <20190306162313.GB8786@pauld.bos.csb> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.11.3 (2019-02-01) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.40]); Sat, 09 Mar 2019 20:33:24 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 06, 2019 at 11:25:02AM -0800 bsegall@google.com wrote: > Phil Auld writes: > > > On Tue, Mar 05, 2019 at 12:45:34PM -0800 bsegall@google.com wrote: > >> Phil Auld writes: > >> > >> > Interestingly, if I limit the number of child cgroups to the number of > >> > them I'm actually putting processes into (16 down from 2500) the problem > >> > does not reproduce. > >> > >> That is indeed interesting, and definitely not something we'd want to > >> matter. (Particularly if it's not root->a->b->c...->throttled_cgroup or > >> root->throttled->a->...->thread vs root->throttled_cgroup, which is what > >> I was originally thinking of) > >> > > > > The locking may be a red herring. > > > > The setup is root->throttled->a where a is 1-2500. There are 4 threads in > > each of the first 16 a groups. The parent, throttled, is where the > > cfs_period/quota_us are set. > > > > I wonder if the problem is the walk_tg_tree_from() call in unthrottle_cfs_rq(). > > > > The distribute_cfg_runtime looks to be O(n * m) where n is number of > > throttled cfs_rqs and m is the number of child cgroups. But I'm not > > completely clear on how the hierarchical cgroups play together here. > > > > I'll pull on this thread some. > > > > Thanks for your input. > > > > > > Cheers, > > Phil > > Yeah, that isn't under the cfs_b lock, but is still part of distribute > (and under rq lock, which might also matter). I was thinking too much > about just the cfs_b regions. I'm not sure there's any good general > optimization there. > It's really an edge case, but the watchdog NMI is pretty painful. > I suppose cfs_rqs (tgs/cfs_bs?) could have "nearest > ancestor with a quota" pointer and ones with quota could have > "descendants with quota" list, parallel to the children/parent lists of > tgs. Then throttle/unthrottle would only have to visit these lists, and > child cgroups/cfs_rqs without their own quotas would just check > cfs_rq->nearest_quota_cfs_rq->throttle_count. throttled_clock_task_time > can also probably be tracked there. That seems like it would add a lot of complexity for this edge case. Maybe it would be acceptible to use the safety valve like my first example, or something like the below which will tune the period up until it doesn't overrun for ever. The down side of this one is it does change the user's settings, but that could be preferable to an NMI crash. Cheers, Phil diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 310d0637fe4b..78f9e28adc7b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4859,16 +4859,42 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer) return HRTIMER_NORESTART; } +extern const u64 max_cfs_quota_period; +s64 cfs_quota_period_autotune_thresh = 100 * NSEC_PER_MSEC; +int cfs_quota_period_autotune_shift = 4; /* 100 / 16 = 6.25% */ + static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) { struct cfs_bandwidth *cfs_b = container_of(timer, struct cfs_bandwidth, period_timer); + s64 nsprev, nsnow, new_period; + ktime_t now; int overrun; int idle = 0; raw_spin_lock(&cfs_b->lock); + nsprev = ktime_to_ns(hrtimer_cb_get_time(timer)); for (;;) { - overrun = hrtimer_forward_now(timer, cfs_b->period); + /* + * Note this reverts the change to use hrtimer_forward_now, which avoids calling hrtimer_cb_get_time + * for a value we already have + */ + now = hrtimer_cb_get_time(timer); + nsnow = ktime_to_ns(now); + if (nsnow - nsprev >= cfs_quota_period_autotune_thresh) { + new_period = ktime_to_ns(cfs_b->period); + new_period += new_period >> cfs_quota_period_autotune_shift; + if (new_period <= max_cfs_quota_period) { + cfs_b->period = ns_to_ktime(new_period); + cfs_b->quota += cfs_b->quota >> cfs_quota_period_autotune_shift; + pr_warn_ratelimited( + "cfs_period_timer [cpu%d] : Running too long, scaling up (new period %lld, new quota = %lld)\n", + smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC); + } + nsprev = nsnow; + } + + overrun = hrtimer_forward(timer, now, cfs_b->period); if (!overrun) break; --