Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp9749768imu; Sun, 30 Dec 2018 04:33:57 -0800 (PST) X-Google-Smtp-Source: AFSGD/Uk0tE9LR3APUvu/goRf/xqdKWg+a/RMotOEDzz6fZOvq8a4uV+EeeZCXM/GXFBf9wH7YpX X-Received: by 2002:a62:b24a:: with SMTP id x71mr35420600pfe.148.1546173236934; Sun, 30 Dec 2018 04:33:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546173236; cv=none; d=google.com; s=arc-20160816; b=ILR68hdEYsOsVwCJ9rN2V/phjU3PFZnE8ro4SWyKA2FVwv3ZNCW4S9+5tTF9rm3e0E 2qFxdKOZzpkvjJ8QLJTh8olBIewJ7gmoNy+cfqsMDoIB1JGNC3O9eQ+rTRQ6jQ4D8LI+ HokOzQ5EujqWmOBlMouGBGbe0VZUZytA5RcDlF8bg68HaQr/xyPaiFUonshRcpfwmvBw wzl/Gss7A/cp4gCR34rPeFPRB5L4M7PNQL+kDzK06vqHLjOLxvngyJhOVsrGLVn2si/S MpWVp/wEikDx14zV6Aca46Zf0GsrHhLFl0D9roNhPPofDorgIkaUe4PQlcUHhzW5inon 8Low== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=YfBTsleZJvSk290gRhaTuj5wkeRjt3sAoiYA+5MeCFo=; b=QM6os0Kp+hdYhJ7b/thLUAeifP00ZOUGuVQj+rZI1tNOsYawtjnBGQQEqd1G7GSsKG V4E7j6PG6iBzRgqgyoZUz1bejYCXBq1Qj++5Hd8sO25Tf4jCMRUwW9nuOpo4rgLl3aic KEbNZg8oQZMpGTH5gSnu+TzqjNGunCqYCYXS+3MbM2V1q9a3fN1B/nYhfD6YClq3ReF6 C+d+cE72HDeNHydampBdA5H+7h0JdpTrUu2GrXCF+eRTfwQMDSOMX5SPZdMaRPSC+wrJ nl32kL8Ad3keZn11sqcdbBmFzVsTao3Sdi5jrNCcI28IxE1oyf0sAD0mOm56orV9M95V 8ugw== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b=HWeZeXBa; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 123si26248881pfx.109.2018.12.30.04.33.18; Sun, 30 Dec 2018 04:33:56 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b=HWeZeXBa; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725988AbeL3MbV (ORCPT + 99 others); Sun, 30 Dec 2018 07:31:21 -0500 Received: from mail-wr1-f67.google.com ([209.85.221.67]:41465 "EHLO mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725894AbeL3MbU (ORCPT ); Sun, 30 Dec 2018 07:31:20 -0500 Received: by mail-wr1-f67.google.com with SMTP id x10so24587104wrs.8 for ; Sun, 30 Dec 2018 04:31:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=YfBTsleZJvSk290gRhaTuj5wkeRjt3sAoiYA+5MeCFo=; b=HWeZeXBaSNUFNpk7XLopWx1WyNLoJpb6esUqI1F2rTylJqbU7h7VZcSbkJQkEE8aak OAMdB2uNc57qA8OSQAJ9/9FQmliIz47oUW889KJwkCJVRAj6jgSgdBWycEXWXNH5WhN6 mn3MkOiaQsoupMnJGJM4H7BQtvJwiudvizoUciM8M0x4cGFLSUVOA+JRA7Po+YVqK4ZZ z+7h93+8wlOEs2klqhdh+d7gBbKFuOZAzjF8hw7Mf+eaTy4AVNIyHb+trw0RiMink2HW VNuGrcnxmIzZwsj1+Y5geeEFzB0layou/jsVnEOmQynCDDdj20boj/ZeH51zbcCoJ9HA ZJng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=YfBTsleZJvSk290gRhaTuj5wkeRjt3sAoiYA+5MeCFo=; b=BMD1V0haHHaIljygZoOfdNOQSZElnKFVr53rzmSyArmVxz5V3KudAi0YBo5cI46lqP e1tcs3RHPFWTF8NTah01IwIwcJ0S8lT7fITuy7zm4wdD47j/WJBrnAeg97Eiw8Ilfl07 totmqaki0N1JtEVYzFw+4V3gbh7SLmL4771ie69dC6YR2JdYC3LwE105zdUHNSZ45wAI Bdff3pgPcr2XgAuCQDwKk79VBajWYXwyF512/kSVXHH3o5Mi3Ui2uuXS4N+MvP5m2/UC T88u+GzYayHfiGzJ4OB+Gf1r9gKbKDpI75jazDO58XztV27u5En2XLv6G2vSqHEZLk4I yIwg== X-Gm-Message-State: AJcUukfnBaYx0W0eAy9awbIm2TDr2edFqvj6eIzf3VcT/JR9CATTtNQl +cKMUpzEj6IlLUtIgQIyMq0= X-Received: by 2002:adf:cd0e:: with SMTP id w14mr31544623wrm.218.1546173078129; Sun, 30 Dec 2018 04:31:18 -0800 (PST) Received: from gmail.com (2E8B0CD5.catv.pool.telekom.hu. [46.139.12.213]) by smtp.gmail.com with ESMTPSA id 126sm25687471wmd.1.2018.12.30.04.31.16 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Sun, 30 Dec 2018 04:31:17 -0800 (PST) Date: Sun, 30 Dec 2018 13:31:15 +0100 From: Ingo Molnar To: Vincent Guittot Cc: Xie XiuQi , Ingo Molnar , Peter Zijlstra , xiezhipeng1@huawei.com, huawei.libin@huawei.com, linux-kernel , Linus Torvalds , Tejun Heo , Peter Zijlstra Subject: [PATCH] sched: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c Message-ID: <20181230123115.GB17231@gmail.com> References: <1545879866-27809-1-git-send-email-xiexiuqi@huawei.com> <20181230120435.GA17231@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181230120435.GA17231@gmail.com> User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Ingo Molnar wrote: > > * Vincent Guittot wrote: > > > > Reported-by: Zhipeng Xie > > > Cc: Bin Li > > > Cc: [4.10+] > > > Fixes: 9c2791f936ef (sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list) > > > > If it only happens in update_blocked_averages(), the del leaf has been added by: > > a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance path) > > So I think until we are confident in the proposed fixes, how about > applying Linus's patch that reverts a9e7f6544b9c and simplifies the list > manipulation? > > That way we can re-introduce the O(nr_cgroups) optimization without > pressure. > > I'll prepare a commit for sched/urgent that does this, please holler if > any of you disagrees! I've applied the patch below to tip:sched/urgent and I'll push it out if all goes well in testing: 1e2adc76e619: ("sched: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c") I've preemptively added the Tested-by tags of the gents who found and analyzed this bug: Tested-by: Zhipeng Xie Tested-by: Sargun Dhillon ... in the assumption that you'll do the testing of Linus's fix to make sure it's all good! [ Will probably update the commit with acks and any other feedback before sending it to Linus tomorrow-ish. We don't want to end 2018 with a known scheduler bug in the upstream tree! ;-) ] Thanks, Ingo ===========================> From 1e2adc76e61924cdfd9dc50c728044d0fbbade27 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Thu, 27 Dec 2018 13:46:17 -0800 Subject: [PATCH] sched: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the scheduler under high loads, starting at around the v4.18 time frame, and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list manipulation. Do a (manual) revert of: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") It turns out that the list_del_leaf_cfs_rq() introduced by this commit is a surprising property that was not considered in followup commits such as: 9c2791f936ef ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list") As Vincent Guittot explains: "I think that there is a bigger problem with commit a9e7f6544b9c and cfs_rq throttling: Let take the example of the following topology TG2 --> TG1 --> root: 1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1 cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in one path because it has never been used and can't be throttled so tmp_alone_branch will point to leaf_cfs_rq_list at the end. 2) Then TG1 is throttled 3) and we add TG3 as a new child of TG1. 4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1 cfs_rq and tmp_alone_branch will stay on rq->leaf_cfs_rq_list. With commit a9e7f6544b9c, we can del a cfs_rq from rq->leaf_cfs_rq_list. So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1 cfs_rq is removed from the list. Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list but tmp_alone_branch still points to TG3 cfs_rq because its throttled parent can't be enqueued when the lock is released. tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should. So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch points on another TG cfs_rq, the next TG cfs_rq that will be added, will be linked outside rq->leaf_cfs_rq_list - which is bad. In addition, we can break the ordering of the cfs_rq in rq->leaf_cfs_rq_list but this ordering is used to update and propagate the update from leaf down to root." Instead of trying to work through all these cases and trying to reproduce the very high loads that produced the lockup to begin with, simplify the code temporarily by reverting a9e7f6544b9c - which change was clearly not thought through completely. This (hopefully) gives us a kernel that doesn't lock up so people can continue to enjoy their holidays without worrying about regressions. ;-) [ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ] Analyzed-by: Xie XiuQi Analyzed-by: Vincent Guittot Reported-by: Zhipeng Xie Reported-by: Sargun Dhillon Reported-by: Xie XiuQi Tested-by: Zhipeng Xie Tested-by: Sargun Dhillon Signed-off-by: Linus Torvalds Cc: # v4.13+ Cc: Bin Li Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Fixes: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 43 +++++++++---------------------------------- 1 file changed, 9 insertions(+), 34 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d1907506318a..6483834f1278 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -352,10 +352,9 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) } } -/* Iterate thr' all leaf cfs_rq's on a runqueue */ -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ - list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \ - leaf_cfs_rq_list) +/* Iterate through all leaf cfs_rq's on a runqueue: */ +#define for_each_leaf_cfs_rq(rq, cfs_rq) \ + list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) /* Do the two (enqueued) entities belong to the same group ? */ static inline struct cfs_rq * @@ -447,8 +446,8 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) { } -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ - for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos) +#define for_each_leaf_cfs_rq(rq, cfs_rq) \ + for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) static inline struct sched_entity *parent_entity(struct sched_entity *se) { @@ -7647,27 +7646,10 @@ static inline bool others_have_blocked(struct rq *rq) #ifdef CONFIG_FAIR_GROUP_SCHED -static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) -{ - if (cfs_rq->load.weight) - return false; - - if (cfs_rq->avg.load_sum) - return false; - - if (cfs_rq->avg.util_sum) - return false; - - if (cfs_rq->avg.runnable_load_sum) - return false; - - return true; -} - static void update_blocked_averages(int cpu) { struct rq *rq = cpu_rq(cpu); - struct cfs_rq *cfs_rq, *pos; + struct cfs_rq *cfs_rq; const struct sched_class *curr_class; struct rq_flags rf; bool done = true; @@ -7679,7 +7661,7 @@ static void update_blocked_averages(int cpu) * Iterates the task_group tree in a bottom up fashion, see * list_add_leaf_cfs_rq() for details. */ - for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) { + for_each_leaf_cfs_rq(rq, cfs_rq) { struct sched_entity *se; /* throttled entities do not contribute to load */ @@ -7694,13 +7676,6 @@ static void update_blocked_averages(int cpu) if (se && !skip_blocked_update(se)) update_load_avg(cfs_rq_of(se), se, 0); - /* - * There can be a lot of idle CPU cgroups. Don't let fully - * decayed cfs_rqs linger on the list. - */ - if (cfs_rq_is_decayed(cfs_rq)) - list_del_leaf_cfs_rq(cfs_rq); - /* Don't need periodic decay once load/util_avg are null */ if (cfs_rq_has_blocked(cfs_rq)) done = false; @@ -10570,10 +10545,10 @@ const struct sched_class fair_sched_class = { #ifdef CONFIG_SCHED_DEBUG void print_cfs_stats(struct seq_file *m, int cpu) { - struct cfs_rq *cfs_rq, *pos; + struct cfs_rq *cfs_rq; rcu_read_lock(); - for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos) + for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq) print_cfs_rq(m, cpu, cfs_rq); rcu_read_unlock(); }