Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp647462imu; Fri, 11 Jan 2019 06:45:49 -0800 (PST) X-Google-Smtp-Source: ALg8bN4RekxtXUU597/x1NUj7JMv7SrAFtVyHQwMt1Wj+F49UNp3uKz9P9lFkUrk+DXaFPBSPlQe X-Received: by 2002:a62:8e19:: with SMTP id k25mr14807347pfe.185.1547217949838; Fri, 11 Jan 2019 06:45:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1547217949; cv=none; d=google.com; s=arc-20160816; b=Rbm1XXlYIDBZv7cnVa/vjzhEtDsYriJWUkq94cfdhWSpCl29zDFyQVeFcCSHlhsy63 8aoA/j+I6l0L0oETkzb9pVHX/A+rlFXiHw3ZeOaUJWyqe2b9Tet3LAN+Z1n9LFG9zHOQ l60LGvTe4FYNX+Zq5BQR4JC2z6XgMtY1Hq3q/mdqHO2e6Bsx07KxxcN6uOOjRtG4Scn9 yujSvUiOAN3q4J7HIW97QH/DoATdpD3fFQSNt++HBGVgCDT1xZR4dTF2oPEsGYzQhP9J KKgXYoOyfiJU0JbQ6/dRDPz5c+LZdn+hdFdD/+wg/F7WXkGFYWucUmbKM2Blt5LLu9cx hMhQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=rzsKgb9jlepEYkvfINgBQKyJA+ev6v1H4oC2qrbnnCU=; b=lH+9U4YYt2vTp53jWt7cCLKlWY+exxrtXbiSNtkuAwH/vMLeXMwdGO6/WDxWG1TE7m j3P9T+d61V1z8YmG/W/1GUSPP1OkBhdwXBoojMkTttwF4wxUCbTdyLPYEAS6q9OZmRYd 7z7zdJzFW8jp2TRBRs5QQiedAsJYSQY0jZ8OL09LvU4X4KX0BrHz1mlRtWVLIo3Z53Xr 5NyHu4ro3rAcWo/PNN9d3bOL1LqXackHxU9nBvwhP0DIUp8PrcHoVxN5iIlxehxPcDqB bRd0BJeAps3oWUqfHZrmULCWqEObGo4ABp4SgmgcZsoD9jTFLccvHJVD+E2bklaT0N4n h2xg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=mssTbDbN; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i4si4083951pfg.218.2019.01.11.06.45.34; Fri, 11 Jan 2019 06:45:49 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=mssTbDbN; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404432AbfAKOnj (ORCPT + 99 others); Fri, 11 Jan 2019 09:43:39 -0500 Received: from mail.kernel.org ([198.145.29.99]:36494 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404401AbfAKOng (ORCPT ); Fri, 11 Jan 2019 09:43:36 -0500 Received: from localhost (5356596B.cm-6-7b.dynamic.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id B6FE52063F; Fri, 11 Jan 2019 14:43:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1547217815; bh=GLE/QxMomXvlDseL/eAZ5EW4nbgSCzcGTaTwgDR+SkI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=mssTbDbNmpJhVWDL0rLzqgmo+OkAxpr7EvZVO/dWziGE1cFjDU/OFXsN/Ub2eWiDN LrhfQrrj5KBmDUXS2P6qWlijT7G6dC7H3XLRJHRrVgtCsHATm5NxeSnskqCU/NwURi 3Fi17+sO7EWHByP+nM+rpJeyJJd+AEfKDNXzwVhU= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Zhipeng Xie , Sargun Dhillon , Xie XiuQi , Linus Torvalds , Vincent Guittot , Bin Li , Mike Galbraith , Peter Zijlstra , Tejun Heo , Thomas Gleixner , Ingo Molnar Subject: [PATCH 4.20 51/65] sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c Date: Fri, 11 Jan 2019 15:15:37 +0100 Message-Id: <20190111131103.218018795@linuxfoundation.org> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190111131055.331350141@linuxfoundation.org> References: <20190111131055.331350141@linuxfoundation.org> User-Agent: quilt/0.65 X-stable: review X-Patchwork-Hint: ignore MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 4.20-stable review patch. If anyone has any objections, please let me know. ------------------ From: Linus Torvalds commit c40f7d74c741a907cfaeb73a7697081881c497d0 upstream. Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the scheduler under high loads, starting at around the v4.18 time frame, and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list manipulation. Do a (manual) revert of: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") It turns out that the list_del_leaf_cfs_rq() introduced by this commit is a surprising property that was not considered in followup commits such as: 9c2791f936ef ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list") As Vincent Guittot explains: "I think that there is a bigger problem with commit a9e7f6544b9c and cfs_rq throttling: Let take the example of the following topology TG2 --> TG1 --> root: 1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1 cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in one path because it has never been used and can't be throttled so tmp_alone_branch will point to leaf_cfs_rq_list at the end. 2) Then TG1 is throttled 3) and we add TG3 as a new child of TG1. 4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1 cfs_rq and tmp_alone_branch will stay on rq->leaf_cfs_rq_list. With commit a9e7f6544b9c, we can del a cfs_rq from rq->leaf_cfs_rq_list. So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1 cfs_rq is removed from the list. Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list but tmp_alone_branch still points to TG3 cfs_rq because its throttled parent can't be enqueued when the lock is released. tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should. So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch points on another TG cfs_rq, the next TG cfs_rq that will be added, will be linked outside rq->leaf_cfs_rq_list - which is bad. In addition, we can break the ordering of the cfs_rq in rq->leaf_cfs_rq_list but this ordering is used to update and propagate the update from leaf down to root." Instead of trying to work through all these cases and trying to reproduce the very high loads that produced the lockup to begin with, simplify the code temporarily by reverting a9e7f6544b9c - which change was clearly not thought through completely. This (hopefully) gives us a kernel that doesn't lock up so people can continue to enjoy their holidays without worrying about regressions. ;-) [ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ] Analyzed-by: Xie XiuQi Analyzed-by: Vincent Guittot Reported-by: Zhipeng Xie Reported-by: Sargun Dhillon Reported-by: Xie XiuQi Tested-by: Zhipeng Xie Tested-by: Sargun Dhillon Signed-off-by: Linus Torvalds Acked-by: Vincent Guittot Cc: # v4.13+ Cc: Bin Li Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Tejun Heo Cc: Thomas Gleixner Fixes: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com Signed-off-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman --- kernel/sched/fair.c | 43 +++++++++---------------------------------- 1 file changed, 9 insertions(+), 34 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -352,10 +352,9 @@ static inline void list_del_leaf_cfs_rq( } } -/* Iterate thr' all leaf cfs_rq's on a runqueue */ -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ - list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \ - leaf_cfs_rq_list) +/* Iterate through all leaf cfs_rq's on a runqueue: */ +#define for_each_leaf_cfs_rq(rq, cfs_rq) \ + list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) /* Do the two (enqueued) entities belong to the same group ? */ static inline struct cfs_rq * @@ -447,8 +446,8 @@ static inline void list_del_leaf_cfs_rq( { } -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ - for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos) +#define for_each_leaf_cfs_rq(rq, cfs_rq) \ + for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) static inline struct sched_entity *parent_entity(struct sched_entity *se) { @@ -7387,27 +7386,10 @@ static inline bool others_have_blocked(s #ifdef CONFIG_FAIR_GROUP_SCHED -static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) -{ - if (cfs_rq->load.weight) - return false; - - if (cfs_rq->avg.load_sum) - return false; - - if (cfs_rq->avg.util_sum) - return false; - - if (cfs_rq->avg.runnable_load_sum) - return false; - - return true; -} - static void update_blocked_averages(int cpu) { struct rq *rq = cpu_rq(cpu); - struct cfs_rq *cfs_rq, *pos; + struct cfs_rq *cfs_rq; const struct sched_class *curr_class; struct rq_flags rf; bool done = true; @@ -7419,7 +7401,7 @@ static void update_blocked_averages(int * Iterates the task_group tree in a bottom up fashion, see * list_add_leaf_cfs_rq() for details. */ - for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) { + for_each_leaf_cfs_rq(rq, cfs_rq) { struct sched_entity *se; /* throttled entities do not contribute to load */ @@ -7434,13 +7416,6 @@ static void update_blocked_averages(int if (se && !skip_blocked_update(se)) update_load_avg(cfs_rq_of(se), se, 0); - /* - * There can be a lot of idle CPU cgroups. Don't let fully - * decayed cfs_rqs linger on the list. - */ - if (cfs_rq_is_decayed(cfs_rq)) - list_del_leaf_cfs_rq(cfs_rq); - /* Don't need periodic decay once load/util_avg are null */ if (cfs_rq_has_blocked(cfs_rq)) done = false; @@ -10289,10 +10264,10 @@ const struct sched_class fair_sched_clas #ifdef CONFIG_SCHED_DEBUG void print_cfs_stats(struct seq_file *m, int cpu) { - struct cfs_rq *cfs_rq, *pos; + struct cfs_rq *cfs_rq; rcu_read_lock(); - for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos) + for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq) print_cfs_rq(m, cpu, cfs_rq); rcu_read_unlock(); }