Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp4822263ybf; Wed, 4 Mar 2020 11:21:25 -0800 (PST) X-Google-Smtp-Source: ADFU+vuBRwZSV5Dx5Lk5btxtZ8znTwcq9vou3WTzQmdD/L0Mmy73DPpwOOG0o3N+6lrMA4nwbcy7 X-Received: by 2002:a05:6830:1290:: with SMTP id z16mr3741722otp.231.1583349685535; Wed, 04 Mar 2020 11:21:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1583349685; cv=none; d=google.com; s=arc-20160816; b=TtduQZ6RavofA/FKO1N2o5NpNAKJr6/lLfk6L34fkQHgoDXuNq1eRV1aplNJmFbaaT 0hlWFCseeKUxvKYN/Wqq6E+Tn5AH2AMj6ZA1eARuSzhrTxxRwh2WAs+O4/OCvRU/udAa mGgACzo9BF2HtkY0YviUSBKyJXGBxGir+Kn5e+uMb4ueB0ISvh56AJsSy+7B8vbr42Ql UgZAwVXVApVCULZGR1R1pb+MFMvNNNxSDq/ygVufspoREKE909dgxFhsTJ6fRWmILsmR gmEDJ18+iYwiE49r8NlN07eV+xr1X+6cJb8Nnyw7qOGp6K9dM7y0fVll9g3gt18omAgt 5xzA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=SFIsKk4lo34nZQMK10hAtoFzf5hjR5aUgqvwsBb/7D0=; b=QOiRqEa/63lRvhPBL5Ik9TGpzj6DaoLBF3uJJoWMYCxbPJfe0IGbFhJqro/Ni9GQ+Z VwNj2l6mGi0QFBQBnJWG57EkDGyLD/R5+IMkOZxUldXOKdy+Vs6tr2vs7RmJZ4eUv1ta fmTnbunxRVHEXy/YzvyK0FQo0YE7C5mfUafDxK+MvXSqV2LN9Y+VUMuzC3kP9LV3PoY+ TGBAOSBrVEYyg9Jod4yqjk/8+LcMt8Rx0e5Z2HRGQ5PhnrjK1od60f1zKxWIXb4ipYnF eeI3dnGk0dSO2KI0OVyiTmS1I6VVq9ySJtt58NhnqjZ3yxn7ZEu41Dh0WoyWrbH+Ip9I QlxQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c20si1863585oic.99.2020.03.04.11.21.13; Wed, 04 Mar 2020 11:21:25 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387979AbgCDTTl (ORCPT + 99 others); Wed, 4 Mar 2020 14:19:41 -0500 Received: from foss.arm.com ([217.140.110.172]:38686 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726440AbgCDTTl (ORCPT ); Wed, 4 Mar 2020 14:19:41 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 551651FB; Wed, 4 Mar 2020 11:19:40 -0800 (PST) Received: from [192.168.0.7] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B33803F6C4; Wed, 4 Mar 2020 11:19:34 -0800 (PST) Subject: Re: 5.6-rc3: WARNING: CPU: 48 PID: 17435 at kernel/sched/fair.c:380 enqueue_task_fair+0x328/0x440 To: Christian Borntraeger , Vincent Guittot Cc: Ingo Molnar , Peter Zijlstra , "linux-kernel@vger.kernel.org" References: <1a607a98-f12a-77bd-2062-c3e599614331@de.ibm.com> <20200228163545.GA18662@vingu-book> <49a2ebb7-c80b-9e2b-4482-7f9ff938417d@de.ibm.com> <2108173c-beaa-6b84-1bc3-8f575fb95954@de.ibm.com> From: Dietmar Eggemann Message-ID: <7be92e79-731b-220d-b187-d38bde80ad16@arm.com> Date: Wed, 4 Mar 2020 20:19:28 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 MIME-Version: 1.0 In-Reply-To: <2108173c-beaa-6b84-1bc3-8f575fb95954@de.ibm.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Christian, On 04/03/2020 18:42, Christian Borntraeger wrote: > > > On 04.03.20 16:26, Vincent Guittot wrote: >> On Tue, 3 Mar 2020 at 08:55, Vincent Guittot wrote: >>> >>> On Tue, 3 Mar 2020 at 08:37, Christian Borntraeger >>> wrote: >>>> >>>> >>>> >> [...] >>>>>>> --- >>>>>>> kernel/sched/fair.c | 2 +- >>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>>>> index 3c8a379c357e..beb773c23e7d 100644 >>>>>>> --- a/kernel/sched/fair.c >>>>>>> +++ b/kernel/sched/fair.c >>>>>>> @@ -4035,8 +4035,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) >>>>>>> __enqueue_entity(cfs_rq, se); >>>>>>> se->on_rq = 1; >>>>>>> >>>>>>> + list_add_leaf_cfs_rq(cfs_rq); >>>>>>> if (cfs_rq->nr_running == 1) { >>>>>>> - list_add_leaf_cfs_rq(cfs_rq); >>>>>>> check_enqueue_throttle(cfs_rq); >>>>>>> } >>>>>>> } >>>>>> >>>>>> Now running for 3 hours. I have not seen the issue yet. I can tell tomorrow if this fixes >>>>>> the issue. >>>>> >>>>> >>>>> Still running fine. I can tell for sure tomorrow, but I have the impression that this makes the >>>>> WARN_ON go away. >>>> >>>> So I guess this change "fixed" the issue. If you want me to test additional patches, let me know. >>> >>> Thanks for the test. For now, I don't have any other patch to test. I >>> have to look more deeply how the situation happens. >>> I will let you know if I have other patch to test >> >> So I haven't been able to figure out how we reach this situation yet. >> In the meantime I'm going to make a clean patch with the fix above. >> >> Is it ok if I add a reported -by and a tested-by you ? > > Sure- > I just realized that this system has something special. Some month ago I created 2 slices > $ head /etc/systemd/system/*.slice > ==> /etc/systemd/system/machine-production.slice <== > [Unit] > Description=VM production > Before=slices.target > Wants=machine.slice > [Slice] > CPUQuota=2000% > CPUWeight=1000 > > ==> /etc/systemd/system/machine-test.slice <== > [Unit] > Description=VM production > Before=slices.target > Wants=machine.slice > [Slice] > CPUQuota=300% > CPUWeight=100 > > > And the guests are then put into these slices. that also means that this test will never use more than the 2300%. > No matter how much CPUs the system has. If you could run this debug patch on top of your un-patched kernel, it would tell us which task (in the enqueue case) and which taskgroup is causing that. You could then further dump the appropriate taskgroup directory under the cpu cgroup mountpoint (to see e.g. the CFS bandwidth data). I expect more than one hit since assert_list_leaf_cfs_rq() uses SCHED_WARN_ON, hence WARN_ONCE. --8<-- From b709758f476ee4cfc260eceedc45ebcc50d93074 Mon Sep 17 00:00:00 2001 From: Dietmar Eggemann Date: Sat, 29 Feb 2020 11:07:05 +0000 Subject: [PATCH] test: rq->tmp_alone_branch != &rq->leaf_cfs_rq_list Signed-off-by: Dietmar Eggemann --- kernel/sched/fair.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3c8a379c357e..69fc30db7440 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4619,6 +4619,15 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) break; } + if (rq->tmp_alone_branch != &rq->leaf_cfs_rq_list) { + char path[64]; + + sched_trace_cfs_rq_path(cfs_rq, path, 64); + + printk("CPU%d path=%s on_list=%d nr_running=%d\n", + cpu_of(rq), path, cfs_rq->on_list, cfs_rq->nr_running); + } + assert_list_leaf_cfs_rq(rq); if (!se) @@ -5320,6 +5329,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) } } + if (rq->tmp_alone_branch != &rq->leaf_cfs_rq_list) { + char path[64]; + + cfs_rq = cfs_rq_of(&p->se); + + sched_trace_cfs_rq_path(cfs_rq, path, 64); + + printk("CPU%d path=%s on_list=%d nr_running=%d p=[%s %d]\n", + cpu_of(rq), path, cfs_rq->on_list, cfs_rq->nr_running, + p->comm, p->pid); + } + assert_list_leaf_cfs_rq(rq); hrtick_update(rq); -- 2.17.1