Received: by 2002:a05:6358:16cc:b0:ea:6187:17c9 with SMTP id r12csp10364385rwl; Wed, 11 Jan 2023 19:09:17 -0800 (PST) X-Google-Smtp-Source: AMrXdXv8aSTNy6Y+gZoyK8zPeGNm9pONIbrJe3jtAjWkX/LBvfxgmmwlSla7qbRlO03LEK41GmsD X-Received: by 2002:a17:907:20bb:b0:85c:e3fd:d39 with SMTP id pw27-20020a17090720bb00b0085ce3fd0d39mr5541180ejb.37.1673492956887; Wed, 11 Jan 2023 19:09:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673492956; cv=none; d=google.com; s=arc-20160816; b=URQySAyVoVNzf1+YMu+nPKQAVNgqKvx+VhZ/1Zm9SQbkuen7C4IsV0C40skvv9mieO /gF9p5OFp84VtIATZgc+Z4/1t+JrD44p8TIzsImGisW20gLeyV6ctpDbbEpDCT8U1gER r2Hzwb1Ngz6tptwFcbo6vyXImipmKx4I1LTGgcs8y9h5y89BWlQ8xyhS4WoTkpRmrESy Rjv/1+/2PK5BehbxDd9lBWxo+mox1OCyHieuvBrgiPv6c/Liji1PIqi3rZDh+b7FwcNU SxiyY/tj2hHlZzobJ+zdqPscC0hXn4IywoW5SLwDAX0JBT/eEPSzFQantrLbCUdaSlWN gaIA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:references :cc:to:from:subject:user-agent:mime-version:date:message-id; bh=T8JExTcBL8fLKXATNfmZ8f9yDIU8ht/MinHmishXMhU=; b=mDvG5mMM8oXMc4gzmaxDBDfJB7PFwaunEQONESu4q+MhbOVb46Wr/DWIGlwDDWw0+a 0sD0pmAwjh+TMHGUtC1m0SsxLQ+yOML+nh8O88GZ95unr9G7B2cdR7OdvTGbUZKWy1h/ xNsqfG7c0FFEahr53R8DkMvPZStwDUYSu0oGo3VIr8gz3hiEhs93DCkRdcARj2sqQWDi 9A2NeYJklx1gEgIlxh+gZveV/2jQif+wGRYpXUUr70odhS5um2ZupniE1C3VxIbxQOGJ +Hr5C0NHIN78J6gvsAzzJyQr1Oyc1NzrwxcrSH5lAXKMQGkag7miB9PAaTPnBVRO4pZ8 iKzw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id nc42-20020a1709071c2a00b007c1052c9243si6025922ejc.15.2023.01.11.19.09.04; Wed, 11 Jan 2023 19:09:16 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231189AbjALDB6 (ORCPT + 52 others); Wed, 11 Jan 2023 22:01:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54166 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229547AbjALDB5 (ORCPT ); Wed, 11 Jan 2023 22:01:57 -0500 Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 04C5B4165E for ; Wed, 11 Jan 2023 19:01:50 -0800 (PST) Received: from dggpeml500018.china.huawei.com (unknown [172.30.72.57]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4Nsq6J2LrGz16Mdk; Thu, 12 Jan 2023 11:00:08 +0800 (CST) Received: from [10.67.111.186] (10.67.111.186) by dggpeml500018.china.huawei.com (7.185.36.186) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.34; Thu, 12 Jan 2023 11:01:43 +0800 Message-ID: <624ce3e7-7262-44c0-5f97-3c1e028f2faf@huawei.com> Date: Thu, 12 Jan 2023 11:01:43 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.1.1 Subject: Re: [bug-report] possible s64 overflow in max_vruntime() From: Zhang Qiao To: Peter Zijlstra , Waiman Long CC: Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , lkml References: <73e639d5-702b-0d03-16d9-a965b1963ef6@huawei.com> In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.111.186] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpeml500018.china.huawei.com (7.185.36.186) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 在 2022/12/23 21:57, Zhang Qiao 写道: > > > 在 2022/12/22 20:45, Peter Zijlstra 写道: >> On Wed, Dec 21, 2022 at 11:19:31PM +0800, Zhang Qiao wrote: >>> hi folks, >>> >>> I found problem about s64 overflow in max_vruntime(). >>> >>> I create a task group GROUPA (path: /system.slice/xxx/yyy/CGROUPA) and run a task in this >>> group on each cpu, these tasks is while loop and 100% cpu usage. >>> >>> When unregister net devices, will queue a kwork on system_highpri_wq at flush_all_backlogs() >>> and wake up a high-priority kworker thread on each cpu. However, the kworker thread has been >>> waiting on the queue and has not been scheduled. >>> >>> After parsing the vmcore, the vruntime of the kworker is 0x918fdb05287da7c3 and the >>> cfs_rq->min_vruntime is 0x124b17fd59db8d02. >>> >>> why the difference between the cfs_rq->min_vruntime and kworker's vruntime is so large? >>> 1) the kworker of the system_highpri_wq sleep for long long time(about 300 days). >>> 2) cfs_rq->curr is the ancestor of the GROUPA, cfs->curr->load.weight is 2494, so when >>> the task belonging to the GROUPA run for a long time, its vruntime will increase by 420 >>> times, cfs_rq->min_vruntime will also grow rapidly. >>> 3) when wakeup kworker thread, kworker will be set the maximum value between kworker's >>> vruntime and cfs_rq->min_vruntime. But at max_vruntime(), there will be a s64 overflow issue, >>> as follow: >>> >>> --------- >>> >>> static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime) >>> { >>> /* >>> * vruntime=0x124b17fd59db8d02 >>> * min_vruntime=0x918fdb05287da7c3 >>> * vruntime - min_vruntime = 9276074894177461567 > s64_max, will s64 overflow >>> */ >>> s64 delta = (s64)(vruntime - min_vruntime); >>> if (delta < 0) >>> min_vruntime = vruntime; >>> >>> return min_vruntime; >>> } >>> >>> ---------- >>> >>> max_vruntime() will return the kworker's old vruntime, it is incorrect and the correct result >>> shoud be cfs_rq->minvruntime. This incorrect result is greater than cfs_rq->min_vruntime and >>> will cause kworker thread starved. >>> >>> Does anyone have a good suggestion for slove this problem? or bugfix patch. >> >> I don't understand what you tihnk the problem is. Signed overflow is >> perfectly fine and works as designed here. > > hi, Peter and Waiman, > > This problem occurs in the production environment that deploy some dpdk services. When this probelm > occurs, the system will be unavailable(for example, many commands about network will be stuck),so > i think it's a problem. > > Because most network commands(such as "ip") require rtnl_mutex, but the rtnl_mutex's onwer is waiting for > the the kworker of the system_highpri_wq at flush_all_backlogs(), util this highpri kworker finished > flush the network packets. > > However, this highpri kworker has been sleeping for long, the difference between the kworker's vruntime > and cfs_rq->min_vruntime is so big, when waking up it, it will be set its old vruntime due to s64 overflow > at max_vruntime(). Because the incorrect vruntime, the kworker might not be scheduled. > > Is it necessary to deal with this problem in kernel? > If necessary, for fix this problem, when a tasks is sleeping long enough, we set its vruntime as > cfs_rq->min_vruntime when wakeup it, avoid the s64 overflow issue at max_vruntime, as follow: > hi, Gentle Ping. Please let me know if you have any comments on the issue. Thanks, Zhang Qiao. > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e16e9f0124b0..89df8d7bae66 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4336,10 +4336,14 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se) > #endif > } > > +/* when a task sleep over 200 days, it's vruntime will be set as cfs_rq->min_vruntime. */ > +#define WAKEUP_REINIT_THRESHOLD_NS (200LL * 24 * 3600 * NSEC_PER_SEC) > + > static void > place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) > { > u64 vruntime = cfs_rq->min_vruntime; > + struct rq *rq = rq_of(cfs_rq); > > /* > * The 'current' period is already promised to the current tasks, > @@ -4364,8 +4368,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) > vruntime -= thresh; > } > > - /* ensure we never gain time by being placed backwards. */ > - se->vruntime = max_vruntime(se->vruntime, vruntime); > + if (unlikely(!initial && (s64)(rq_clock_task(rq) - se->exec_start) > WAKEUP_REINIT_THRESHOLD_NS)) > + se->vruntime = vruntime; > + else > + /* ensure we never gain time by being placed backwards. */ > + se->vruntime = max_vruntime(se->vruntime, vruntime); > } > > static void check_enqueue_throttle(struct cfs_rq *cfs_rq); > > > >> >> . >> > > . >