Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp20042519rwd; Wed, 28 Jun 2023 19:01:49 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6lN0eC0+IuY7A08Ktc9a8FqPB8/vSDGu+pEGKwj1xcGvco44+RDb7U4dIiGCHgOJe3sS0O X-Received: by 2002:a05:6a20:8411:b0:10e:d134:d686 with SMTP id c17-20020a056a20841100b0010ed134d686mr45382899pzd.6.1688004109475; Wed, 28 Jun 2023 19:01:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688004109; cv=none; d=google.com; s=arc-20160816; b=jiso7nEMSbiDcRBuI/MpSJLHMki4BchFvmJNd/KysVB0xi6NeJvYwLnR7sdPWKgbFt P+uudBhWPLmFuNvS8hf1LP+ZUdcRiZGPPgudTNK49WXwLKlMSU/13aIoopPeFLXWtoWn TWD1LgifqVrXvtg4y4+nTKY9a3uFkD1ICwcK/4ql379tXcyaPxEXjT/mI7eLNN2PJ3oH tIegfQ0EP4DB1vg0P/o8os02Tmk4UNTQG01/muo7sEOUcJawQiHj835PAPVp9bYNdnK3 Q4P1E2pBnDIgYCGarPqTApNUtgQbMb4d1JrG6W2+egfiwdp8vj1zqWDYSWBwmiWiGfak sbpQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject; bh=8Vf4zk/UFQ4lq2bcLuwi0PYjAob6Q9Ee11BOS8ofLdM=; fh=HJd7R9OpHnlYPVNU48dzuQoXTqTIqn1mn9sVT4OJn6s=; b=dRQ1jJsXM/PjraD/E0j/hQGfTBP9nhwUV3XgvVSMBcJw5lPnhKiajhYja9F7fJVlZs O7NQ5rOEQqTHhK4/IUxvI4ao3I3KuWyejV3XM3ft45XQH17igohAYIkk64NZ0TQ6cSaJ y44h4GMiKH0V9bHYfxoNjNfsORmK5QEndha4L88vg99eu+BkVggBnxDqT3WQ+PcbKRp3 tkh+mM1d39rvUIA2ZT6i+gh6qasdtQxUWxQKafYEja/4+D882yG1FBVDfNwGHukvqgU8 kXyYSknsDSynsjS0NsqQH/EJhq2Cg/SQGJcTU7D81UDb3OQYi/YDK8rrSmfdn5xd2QRW qFUg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r25-20020a6560d9000000b0053f0bfcd4fasi10146110pgv.173.2023.06.28.19.01.37; Wed, 28 Jun 2023 19:01:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229583AbjF2BmL (ORCPT + 99 others); Wed, 28 Jun 2023 21:42:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44066 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231839AbjF2BmE (ORCPT ); Wed, 28 Jun 2023 21:42:04 -0400 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DDF0F26B6 for ; Wed, 28 Jun 2023 18:42:01 -0700 (PDT) Received: from dggpemm500002.china.huawei.com (unknown [172.30.72.56]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4Qs1Pb1pk1zTkxm; Thu, 29 Jun 2023 09:41:07 +0800 (CST) Received: from [10.174.179.5] (10.174.179.5) by dggpemm500002.china.huawei.com (7.185.36.229) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.27; Thu, 29 Jun 2023 09:41:59 +0800 Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling To: Thomas Gleixner , Vincent Guittot CC: , Phil Auld , , Linux Kernel Mailing List , Wei Li , "liaoyu (E)" , , Peter Zijlstra , Dietmar Eggemann , Ingo Molnar References: <8e785777-03aa-99e1-d20e-e956f5685be6@huawei.com> <87mt18it1y.ffs@tglx> <68baeac9-9fa7-5594-b5e7-4baf8ac86b77@huawei.com> <875y774wvp.ffs@tglx> <87pm5f2qm2.ffs@tglx> From: Xiongfeng Wang Message-ID: <155adb21-be6e-533c-02f8-600a1e9138f8@huawei.com> Date: Thu, 29 Jun 2023 09:41:59 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 MIME-Version: 1.0 In-Reply-To: <87pm5f2qm2.ffs@tglx> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.179.5] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggpemm500002.china.huawei.com (7.185.36.229) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2023/6/29 6:01, Thomas Gleixner wrote: > On Wed, Jun 28 2023 at 14:35, Vincent Guittot wrote: >> On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner wrote: >>> No, because this is fundamentally wrong. >>> >>> If the CPU is on the way out, then the scheduler hotplug machinery >>> has to handle the period timer so that the problem Xiongfeng analyzed >>> does not happen in the first place. >> >> But the hrtimer was enqueued before it starts to offline the cpu > > It does not really matter when it was enqueued. The important point is > that it was enqueued on that outgoing CPU for whatever reason. > >> Then, hrtimers_dead_cpu should take care of migrating the hrtimer out >> of the outgoing cpu but : >> - it must run on another target cpu to migrate the hrtimer. >> - it runs in the context of the caller which can be throttled. > > Sure. I completely understand the problem. The hrtimer hotplug callback > does not run because the task is stuck and waits for the timer to > expire. Circular dependency. > >>> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS >>> timers, but let me look into whether we can migrate hrtimers early in >>> general. >> >> but for that we must check if the timer is enqueued on the outgoing >> cpu and we then need to choose a target cpu. > > You're right. I somehow assumed that cfs knows where it queued stuff, > but obviously it does not. > > I think we can avoid all that by simply taking that user space task out > of the picture completely, which avoids debating whether there are other > possible weird conditions to consider alltogether. > > Something like the untested below should just work. > > Thanks, > > tglx > --- > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -1490,6 +1490,13 @@ static int cpu_down(unsigned int cpu, en > return err; > } > > +static long __cpu_device_down(void *arg) > +{ > + struct device *dev = arg; > + > + return cpu_down(dev->id, CPUHP_OFFLINE); > +} > + > /** > * cpu_device_down - Bring down a cpu device > * @dev: Pointer to the cpu device to offline > @@ -1502,7 +1509,12 @@ static int cpu_down(unsigned int cpu, en > */ > int cpu_device_down(struct device *dev) > { > - return cpu_down(dev->id, CPUHP_OFFLINE); > + unsigned int cpu = cpumask_any_but(cpu_online_mask, dev->id); > + > + if (cpu >= nr_cpu_ids) > + return -EBUSY; > + > + return work_on_cpu(cpu, __cpu_device_down, dev); > } > > int remove_cpu(unsigned int cpu) > . > Test with the following kernel modification which helps reproduce the issue. The hang task does not happen any more. Thanks a lot. Thanks, Xiongfeng --- a/kernel/fork.c +++ b/kernel/fork.c @@ -110,6 +110,8 @@ #define CREATE_TRACE_POINTS #include +#include + /* * Minimum number of threads to boot the kernel */ @@ -199,6 +201,9 @@ static int free_vm_stack_cache(unsigned int cpu) struct vm_struct **cached_vm_stacks = per_cpu_ptr(cached_stacks, cpu); int i; + mdelay(2000); + cond_resched(); + for (i = 0; i < NR_CACHED_STACKS; i++) { struct vm_struct *vm_stack = cached_vm_stacks[i];