Received: by 2002:a05:6a10:7420:0:0:0:0 with SMTP id hk32csp412993pxb; Thu, 17 Feb 2022 06:51:03 -0800 (PST) X-Google-Smtp-Source: ABdhPJwbnhiJ4B/V8sUJj+NlrdSCTW1cxLHyzrTrXw+BoZG3mR9GBMXJWYmCNj5EqLUHRjZHrUhM X-Received: by 2002:a17:906:71d5:b0:6a7:fd56:e9ad with SMTP id i21-20020a17090671d500b006a7fd56e9admr2504443ejk.178.1645109463557; Thu, 17 Feb 2022 06:51:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1645109463; cv=none; d=google.com; s=arc-20160816; b=OJqGFL/jkbiZbN8PhIO0yOZJu0ljILKnVEupXuqFYkM6hg+2TqhkoQKBVR/MQnaPt4 l9Gr3qp4ZXQNDQLoKs88/zNYAdtVyfsL3VwRqbMzNcZHgt0+kWhoaqBV7GzdtJ+GSbS+ Y29vAnAZgQF5V9GPTDlAHsM6Q4SFCN4yMKQrDJ0MPxiI5x3nlTBU1DnNgOewY6hS0FWQ D0SgBh1LmCglN+tFCDo60lRj2ortcbPl0GNJ6q4TM6+RvEp2XQ+yrJu0nclEgtPsPNCw y0Oj0aVd8WxLLumUFWHEK27JrWnmHOd0cHUi55cTrtFC06YuUdSi1Lb9xKNWdt36hlKL LvwA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=GtehbgE6mEZVIhnmG+OYnlfwvH/wxkCZVm3Yrg3INLY=; b=d0Tx/02q39G3I0kITWYpdmCwZFgbNQLDIAXicuNhOKbFoH2UgYJdeAkeLAPMQVXMg1 jmwZ/rpRHGvNJ19Gs4kVDccXLsKKLr7JNobZm8q1g5q8UxrteZ3eM5OItDYAJwS5LJ2n U9B1UyJUwlOJ2uc0Jr8h64I15pPnGj2YrOR9SS+ggMOYncx8BvJO9+9iIHM1rdTdCo0J 9l/LByKFluDw6WDoT6GmPcGJgolezjjF1rz+JPtTRW2ctaMwN0Af0cO2d4C+AaLBCKZV xIDfu2XowieiagKkiDtEAG9XugTsGc/7ZFJBXRmqSP2ySOWjc8vN6cvmU1FA+er4dmwe yKoA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ss23si1673042ejc.23.2022.02.17.06.50.35; Thu, 17 Feb 2022 06:51:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240275AbiBQMJP (ORCPT + 99 others); Thu, 17 Feb 2022 07:09:15 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:43016 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240288AbiBQMJN (ORCPT ); Thu, 17 Feb 2022 07:09:13 -0500 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ACE8D2ABD06 for ; Thu, 17 Feb 2022 04:08:56 -0800 (PST) Received: from dggpeml500023.china.huawei.com (unknown [172.30.72.54]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4Jztr330t9zbkVd; Thu, 17 Feb 2022 20:07:47 +0800 (CST) Received: from dggpeml500018.china.huawei.com (7.185.36.186) by dggpeml500023.china.huawei.com (7.185.36.114) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Thu, 17 Feb 2022 20:08:54 +0800 Received: from [10.67.111.186] (10.67.111.186) by dggpeml500018.china.huawei.com (7.185.36.186) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Thu, 17 Feb 2022 20:08:54 +0800 Message-ID: Date: Thu, 17 Feb 2022 20:08:54 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.1.1 Subject: Re: [PATCH] sched: Fix yet more sched_fork() races To: Peter Zijlstra CC: Borislav Petkov , Tadeusz Struk , x86-ml , lkml , , , Linus Torvalds References: From: Zhang Qiao In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.111.186] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To dggpeml500018.china.huawei.com (7.185.36.186) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 在 2022/2/17 16:51, Peter Zijlstra 写道: > On Mon, Feb 14, 2022 at 10:16:57AM +0100, Peter Zijlstra wrote: >> Zhang, Tadeusz, TJ, how does this look?Sorry for not noticing the emails. > > *sigh* I was hoping for some Tested-by, since I've no idea how to I'll apply this patch and run the previous test suite. -- Qiao. > operate this cgroup stuff properly. > > Anyway, full patch below. I'll go stick it in sched/urgent. > > --- > Subject: sched: Fix yet more sched_fork() races > From: Peter Zijlstra > Date: Mon, 14 Feb 2022 10:16:57 +0100 > > Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an > invalid sched_task_group") fixed a fork race vs cgroup, it opened up a > race vs syscalls by not placing the task on the runqueue before it > gets exposed through the pidhash. > > Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is > trying to fix a single instance of this, instead fix the whole class > of issues, effectively reverting this commit. > > Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") > Reported-by: Linus Torvalds > Signed-off-by: Peter Zijlstra (Intel) > --- > include/linux/sched/task.h | 4 ++-- > kernel/fork.c | 13 ++++++++++++- > kernel/sched/core.c | 34 +++++++++++++++++++++------------- > 3 files changed, 35 insertions(+), 16 deletions(-) > > --- a/include/linux/sched/task.h > +++ b/include/linux/sched/task.h > @@ -54,8 +54,8 @@ extern asmlinkage void schedule_tail(str > extern void init_idle(struct task_struct *idle, int cpu); > > extern int sched_fork(unsigned long clone_flags, struct task_struct *p); > -extern void sched_post_fork(struct task_struct *p, > - struct kernel_clone_args *kargs); > +extern void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs); > +extern void sched_post_fork(struct task_struct *p); > extern void sched_dead(struct task_struct *p); > > void __noreturn do_task_dead(void); > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -2266,6 +2266,17 @@ static __latent_entropy struct task_stru > goto bad_fork_put_pidfd; > > /* > + * Now that the cgroups are pinned, re-clone the parent cgroup and put > + * the new task on the correct runqueue. All this *before* the task > + * becomes visible. > + * > + * This isn't part of ->can_fork() because while the re-cloning is > + * cgroup specific, it unconditionally needs to place the task on a > + * runqueue. > + */ > + sched_cgroup_fork(p, args); > + > + /* > * From this point on we must avoid any synchronous user-space > * communication until we take the tasklist-lock. In particular, we do > * not want user-space to be able to predict the process start-time by > @@ -2375,7 +2386,7 @@ static __latent_entropy struct task_stru > write_unlock_irq(&tasklist_lock); > > proc_fork_connector(p); > - sched_post_fork(p, args); > + sched_post_fork(p); > cgroup_post_fork(p, args); > perf_event_fork(p); > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1215,9 +1215,8 @@ int tg_nop(struct task_group *tg, void * > } > #endif > > -static void set_load_weight(struct task_struct *p) > +static void set_load_weight(struct task_struct *p, bool update_load) > { > - bool update_load = !(READ_ONCE(p->__state) & TASK_NEW); > int prio = p->static_prio - MAX_RT_PRIO; > struct load_weight *load = &p->se.load; > > @@ -4408,7 +4407,7 @@ int sched_fork(unsigned long clone_flags > p->static_prio = NICE_TO_PRIO(0); > > p->prio = p->normal_prio = p->static_prio; > - set_load_weight(p); > + set_load_weight(p, false); > > /* > * We don't need the reset flag anymore after the fork. It has > @@ -4426,6 +4425,7 @@ int sched_fork(unsigned long clone_flags > > init_entity_runnable_average(&p->se); > > + > #ifdef CONFIG_SCHED_INFO > if (likely(sched_info_on())) > memset(&p->sched_info, 0, sizeof(p->sched_info)); > @@ -4441,18 +4441,23 @@ int sched_fork(unsigned long clone_flags > return 0; > } > > -void sched_post_fork(struct task_struct *p, struct kernel_clone_args *kargs) > +void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs) > { > unsigned long flags; > -#ifdef CONFIG_CGROUP_SCHED > - struct task_group *tg; > -#endif > > + /* > + * Because we're not yet on the pid-hash, p->pi_lock isn't strictly > + * required yet, but lockdep gets upset if rules are violated. > + */ > raw_spin_lock_irqsave(&p->pi_lock, flags); > #ifdef CONFIG_CGROUP_SCHED > - tg = container_of(kargs->cset->subsys[cpu_cgrp_id], > - struct task_group, css); > - p->sched_task_group = autogroup_task_group(p, tg); > + if (1) { > + struct task_group *tg; > + tg = container_of(kargs->cset->subsys[cpu_cgrp_id], > + struct task_group, css); > + tg = autogroup_task_group(p, tg); > + p->sched_task_group = tg; > + } > #endif > rseq_migrate(p); > /* > @@ -4463,7 +4468,10 @@ void sched_post_fork(struct task_struct > if (p->sched_class->task_fork) > p->sched_class->task_fork(p); > raw_spin_unlock_irqrestore(&p->pi_lock, flags); > +} > > +void sched_post_fork(struct task_struct *p) > +{ > uclamp_post_fork(p); > } > > @@ -6923,7 +6931,7 @@ void set_user_nice(struct task_struct *p > put_prev_task(rq, p); > > p->static_prio = NICE_TO_PRIO(nice); > - set_load_weight(p); > + set_load_weight(p, true); > old_prio = p->prio; > p->prio = effective_prio(p); > > @@ -7214,7 +7222,7 @@ static void __setscheduler_params(struct > */ > p->rt_priority = attr->sched_priority; > p->normal_prio = normal_prio(p); > - set_load_weight(p); > + set_load_weight(p, true); > } > > /* > @@ -9447,7 +9455,7 @@ void __init sched_init(void) > #endif > } > > - set_load_weight(&init_task); > + set_load_weight(&init_task, false); > > /* > * The boot idle thread does lazy MMU switching as well: > > . >