Received: by 2002:a5b:505:0:0:0:0:0 with SMTP id o5csp4928004ybp; Mon, 7 Oct 2019 16:33:08 -0700 (PDT) X-Google-Smtp-Source: APXvYqxS+uBNm2LLS+yMhNz9zApyGDUAwHXcCJmXGQI9Ub4B/+uWENZkQMpEwKOQ4clmxGjA6Fak X-Received: by 2002:a50:a8a2:: with SMTP id k31mr30910951edc.79.1570491188291; Mon, 07 Oct 2019 16:33:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1570491188; cv=none; d=google.com; s=arc-20160816; b=p2gpYYRqcf98n0ko1LscgpvlY/ckjT0EuAJonu7hvsPe24vqFzRFqN5pXTiLuxcqNM kAUW4BRmoVx2+IUr8s0F56BCK4nzr5nAPSQ6tq98vnMU8f7dU3LVVijGZ3eLEKHax6G9 GmzrQLVw/8MmLsi34nvy6ONoZ00/MRasVEieuR/fki2SXIeBruBLQuI9M0G5KfvbHp9V eXp9p5W3rcxGa6S7h6zprsU0x8nwuLam9iOOfHdVTPGI30cR+InJ38De0CsyA13Mt80C 8xy0aNKhtEsF1zxhzsLPkblW8cAByvtg365cI0dLaaEvAkeHGqfqyHHnRrNna0VFNFf9 tLUA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=piaZbYWTjNBhZkZDmofwj+4KxlQj6g4yPVECpKaNgh0=; b=0FVR82ZTlDHiT5wc3l/3Otx6wRS5txiyztn+BLeoZol+3zLgziOK70+615uh2ZazCI Qjbb2svh5Quf3Y53Sf9ZwuCpki4oSIgix/CjIUIg94zmCIta80tA7vUmKjz/yrOVgUr3 TSF17vpg36+hjARDK5zuRp1+kLeh0E24F7PF9LERDkclL0QogTa692TOqFa8eubT8kFv 7I4Sv6nY+v8wQvqo3pp6x6vN6QGsSQZnKVGf/NyqIBo94L2dRDAglI1/pGnjDVdlsMzP PL5lNvuB2h0ncdol8ghTfUKzYNju7MyiBGb8G0wMXB4Loy1UpIpti7wqOkWbyh4Bx8vE /W9g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=NiANENA7; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c48si9458175edc.169.2019.10.07.16.32.44; Mon, 07 Oct 2019 16:33:08 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=NiANENA7; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729686AbfJGX3R (ORCPT + 99 others); Mon, 7 Oct 2019 19:29:17 -0400 Received: from mail-wm1-f65.google.com ([209.85.128.65]:55584 "EHLO mail-wm1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728980AbfJGX3Q (ORCPT ); Mon, 7 Oct 2019 19:29:16 -0400 Received: by mail-wm1-f65.google.com with SMTP id a6so1130008wma.5 for ; Mon, 07 Oct 2019 16:29:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=piaZbYWTjNBhZkZDmofwj+4KxlQj6g4yPVECpKaNgh0=; b=NiANENA7BMEGBVN9ysemZd4hhOVcXGa8kHkzWeYSeDhzNZCOHOhkUkatxk11rSm62q XIsUc5VDV6Hngd0MAiclj+ZLPhCH+hL8R2j48Dy+o99nvOA5sOq6I7fA9oTVoMNfuOBk q0vo9i5FfaabitzQm7e4G4yU0jFasD+DQJeuCtHtuWs80eV00OAccDAsdfeKJ5Pl6LD6 ylv/gjadfyppKGzMNpI4Hb6VG4kMO4QFNKgPm5RXpSCdlYOW6YwaDyxJ/wMSJwNLAXHy kPujoEKk6uKMMY6SGvqL47k9ASnc5Ujov5yflMax3t77c9duswCG5Gb6Mc0n0mXNDomU UmTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=piaZbYWTjNBhZkZDmofwj+4KxlQj6g4yPVECpKaNgh0=; b=hkUvF+TxDw1ciWJ+xM/en8EpB/HZGVhniPubb0s8E2XV02lM4GAE3sK3Acy0x5W3fA /ki1qimGlcuYt72G6UUdZf9grZBbyoKRmHiHQaIB9Ink5c9b2FPhKv9GCJJQr2EBgtwR P7sECZEW9VdmUWgJzNwzg+2v0P461yq6DJ96uIMFAcIiNdVFk3vONtDoAFUuWH4fViv0 qbv67HlhxD+Ire7owUQZgKk2iGvSAeyRDBzI09foYDdbSsUvbc0htHsXdOYBM7z0KP7K eNb9Kn8AkEkaRbkVAx/Hsulq8q8gQd9vd6T3G9ktB5gahpcYShcG30TEepd+tBqfHRVp Bdow== X-Gm-Message-State: APjAAAX7/GbtiMFiCOjM4FabtplqBpR958NaEK/xJC7XiIMvTvjUeLUN 9wprgaXmTISKIUk9HvDsOWkF7TNmawaosn3udTVjFg== X-Received: by 2002:a1c:3182:: with SMTP id x124mr1385909wmx.168.1570490953213; Mon, 07 Oct 2019 16:29:13 -0700 (PDT) MIME-Version: 1.0 References: <20191004001243.140897-1-xueweiz@google.com> <20191007151425.GD22412@pauld.bos.csb> In-Reply-To: <20191007151425.GD22412@pauld.bos.csb> From: Xuewei Zhang Date: Mon, 7 Oct 2019 16:29:01 -0700 Message-ID: Subject: Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision To: Phil Auld Cc: Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Anton Blanchard , Linus Torvalds , Thomas Gleixner , linux-kernel@vger.kernel.org, stable@vger.kernel.org, trivial@kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 7, 2019 at 8:14 AM Phil Auld wrote: > > On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote: > > quota/period ratio is used to ensure a child task group won't get more > > bandwidth than the parent task group, and is calculated as: > > normalized_cfs_quota() = [(quota_us << 20) / period_us] > > > > If the quota/period ratio was changed during this scaling due to > > precision loss, it will cause inconsistency between parent and child > > task groups. See below example: > > A userspace container manager (kubelet) does three operations: > > 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us. > > 2) Create a few children cgroups. > > 3) Set quota to 1,000us and period to 10,000us on a child cgroup. > > > > These operations are expected to succeed. However, if the scaling of > > 147/128 happens before step 3), quota and period of the parent cgroup > > will be changed: > > new_quota: 1148437ns, 1148us > > new_period: 11484375ns, 11484us > > > > And when step 3) comes in, the ratio of the child cgroup will be 104857, > > which will be larger than the parent cgroup ratio (104821), and will > > fail. > > > > Scaling them by a factor of 2 will fix the problem. > > > > Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup") > > Signed-off-by: Xuewei Zhang > > > I managed to get it to trigger the second case. It took 50,000 children (20x my initial tests). > > [ 1367.850630] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 4340, cfs_quota_us = 250000) > [ 1370.390832] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 8680, cfs_quota_us = 500000) > [ 1372.914689] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 17360, cfs_quota_us = 1000000) > [ 1375.447431] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 34720, cfs_quota_us = 2000000) > [ 1377.982785] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 69440, cfs_quota_us = 4000000) > [ 1380.481702] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 138880, cfs_quota_us = 8000000) > [ 1382.894692] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 277760, cfs_quota_us = 16000000) > [ 1385.264872] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 555520, cfs_quota_us = 32000000) > [ 1393.965140] cfs_period_timer[cpu11]: period too short, but cannot scale up without losing precision (cfs_period_us = 555520, cfs_quota_us = 32000000) > > I suspect going higher could cause the original lockup, but that'd be the case with the old code as well. > And this also gets us out of it faster. > > > Tested-by: Phil Auld Thanks a lot for the review and experiment+test Phil! Really appreciate it. To other scheduler maintainers: Could someone help review and approve the patch? I'm happy to fix any defect in it :) Best regards, Xuewei > > > Cheers, > Phil > > > > --- > > kernel/sched/fair.c | 36 ++++++++++++++++++++++-------------- > > 1 file changed, 22 insertions(+), 14 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 83ab35e2374f..b3d3d0a231cd 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4926,20 +4926,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) > > if (++count > 3) { > > u64 new, old = ktime_to_ns(cfs_b->period); > > > > - new = (old * 147) / 128; /* ~115% */ > > - new = min(new, max_cfs_quota_period); > > - > > - cfs_b->period = ns_to_ktime(new); > > - > > - /* since max is 1s, this is limited to 1e9^2, which fits in u64 */ > > - cfs_b->quota *= new; > > - cfs_b->quota = div64_u64(cfs_b->quota, old); > > - > > - pr_warn_ratelimited( > > - "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n", > > - smp_processor_id(), > > - div_u64(new, NSEC_PER_USEC), > > - div_u64(cfs_b->quota, NSEC_PER_USEC)); > > + /* > > + * Grow period by a factor of 2 to avoid lossing precision. > > + * Precision loss in the quota/period ratio can cause __cfs_schedulable > > + * to fail. > > + */ > > + new = old * 2; > > + if (new < max_cfs_quota_period) { > > + cfs_b->period = ns_to_ktime(new); > > + cfs_b->quota *= 2; > > + > > + pr_warn_ratelimited( > > + "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n", > > + smp_processor_id(), > > + div_u64(new, NSEC_PER_USEC), > > + div_u64(cfs_b->quota, NSEC_PER_USEC)); > > + } else { > > + pr_warn_ratelimited( > > + "cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n", > > + smp_processor_id(), > > + div_u64(old, NSEC_PER_USEC), > > + div_u64(cfs_b->quota, NSEC_PER_USEC)); > > + } > > > > /* reset count so we don't come right back in here */ > > count = 0; > > -- > > 2.23.0.581.g78d2f28ef7-goog > > > > --