Received: by 2002:a05:6a10:6744:0:0:0:0 with SMTP id w4csp587181pxu; Fri, 23 Oct 2020 08:27:26 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyzG26CWcql8Az5CKUXnoBQJcRqhOWozlA4CNAULjY0fGa/kif0Qeifw2mlCW5Xi2NIPUnY X-Received: by 2002:a17:906:31d0:: with SMTP id f16mr2414147ejf.409.1603466845502; Fri, 23 Oct 2020 08:27:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1603466845; cv=none; d=google.com; s=arc-20160816; b=DHy3cyYbegNJn3iOlo0r7f8dCyksyuKdFXy4grv2QlwSvoqf6eSbAw9xz0q5X4E/Xy 7Sc83sxDpCi2R7F2C51aiQZAJMPZKAVgiY+fUywadSb9A0ZOKegwPM0b+i9qt6WGf4dW ww5HW0s2T4p3LN/D5oQmC7NKsQVrtAgo0ADM+/BfA/+j8LN5z9fu3akvem1lSMiEOBDH 6Hid6g/uo3RtsYIP9AFQURvIJ360uG0z0WIs2afNF+NQRlIdhiOWof3Nx9byftf8VtTj /q1PeiP2FOw9KB5/r0xMV/uJd00mroqKIS/zAoqayHfnzV2+Ycw9QUTr3a+JNR93a6kr wP3Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:subject:cc:to:from:date :user-agent:message-id:dkim-signature; bh=Vwv9WCc/z8xd5HpFQhxd9azJ93OJ/d1pot3rAFh5zAI=; b=B2qEGE0F61oKrinGhDVMhrKO5FLz2q0uLGLOOode4yP4Ga9yY9fpY03V9lwlWmhH29 BJFjtaNNdEnXFHR9W9Zp4WZzaHVs4LP0wnqR4/md/VtrLzbOWDNhFU+l0f7dz5ehOQg9 j7KBAjf2Izhf/e7F0FdfD71+a5X+YIloGhkLgO1Erj5NE2cO6/YOA8ngvY2pHDvYlnKs XeP5Wp5dsOvCgPTaaiv5gYqyt0ZAncp77XfTv5jRNGonljxiyVmchzhmHyZpVHOxUgpo VCP6AoN2ZDz6WKTvO+FnG1SWKnXJxd73raYkySJwyo41on5P9HO4/uREuy7Rhmm18uhk cFqg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=TauBtQLj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a6si993755edn.512.2020.10.23.08.27.03; Fri, 23 Oct 2020 08:27:25 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=TauBtQLj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S462252AbgJWKZa (ORCPT + 99 others); Fri, 23 Oct 2020 06:25:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48034 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S462227AbgJWKZZ (ORCPT ); Fri, 23 Oct 2020 06:25:25 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 899C4C0613CE for ; Fri, 23 Oct 2020 03:25:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=Vwv9WCc/z8xd5HpFQhxd9azJ93OJ/d1pot3rAFh5zAI=; b=TauBtQLjq3iWr7rzm6PqqcxVcW uUWjA9JgLNsgbJAAouy2oFLVFcolyD3HvtG4+WOjpfk/6MKdfoDo+Qjk0/tFzvj7rMf9/4htJatqH OsmffHxHeMmhqMBricbFTknpwqOfyXNJKjHBdqPGjrTiQCSYRYt1R6k6oh1LhBOFaTV8y9T2K5qJv VeeR5vblNxUAVJCweiE+zsROZ6e1PV7buIqvW+4xLcjkt2rrqxBWHmIQAQqf6CFqpvg/HkJ8aO7wD 6krBP8bx9XkOUTrz4AHEEso0bOWUnRkDpBRPWVd4AbT8db2oYFJFiBnVI0XTiBUq3lSrI2kfCjRam LOrcmU7Q==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.92.3 #3 (Red Hat Linux)) id 1kVuFo-0002tC-La; Fri, 23 Oct 2020 10:25:02 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id D70503077B1; Fri, 23 Oct 2020 12:24:56 +0200 (CEST) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id 5BEFB29DDA65B; Fri, 23 Oct 2020 12:24:56 +0200 (CEST) Message-ID: <20201023102346.921768277@infradead.org> User-Agent: quilt/0.66 Date: Fri, 23 Oct 2020 12:12:08 +0200 From: Peter Zijlstra To: tglx@linutronix.de, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, bigeasy@linutronix.de, qais.yousef@arm.com, swood@redhat.com, peterz@infradead.org, valentin.schneider@arm.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vincent.donnefort@arm.com, tj@kernel.org, ouwen210@hotmail.com Subject: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr() References: <20201023101158.088940906@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Concurrent migrate_disable() and set_cpus_allowed_ptr() has interesting features. We rely on set_cpus_allowed_ptr() to not return until the task runs inside the provided mask. This expectation is exported to userspace. This means that any set_cpus_allowed_ptr() caller must wait until migrate_enable() allows migrations. At the same time, we don't want migrate_enable() to schedule, due to patterns like: preempt_disable(); migrate_disable(); ... migrate_enable(); preempt_enable(); And: raw_spin_lock(&B); spin_unlock(&A); this means that when migrate_enable() must restore the affinity mask, it cannot wait for completion thereof. Luck will have it that that is exactly the case where there is a pending set_cpus_allowed_ptr(), so let that provide storage for the async stop machine. Much thanks to Valentin who used TLA+ most effective and found lots of 'interesting' cases. Signed-off-by: Peter Zijlstra (Intel) --- include/linux/sched.h | 1 kernel/sched/core.c | 234 +++++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 205 insertions(+), 30 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -713,6 +713,7 @@ struct task_struct { int nr_cpus_allowed; const cpumask_t *cpus_ptr; cpumask_t cpus_mask; + void *migration_pending; #if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT) int migration_disabled; #endif --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1732,15 +1732,26 @@ void migrate_enable(void) { struct task_struct *p = current; - if (--p->migration_disabled) + if (p->migration_disabled > 1) { + p->migration_disabled--; return; + } + /* + * Ensure stop_task runs either before or after this, and that + * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule(). + */ + preempt_disable(); + if (p->cpus_ptr != &p->cpus_mask) + __set_cpus_allowed_ptr(p, &p->cpus_mask, SCA_MIGRATE_ENABLE); + /* + * Mustn't clear migration_disabled() until cpus_ptr points back at the + * regular cpus_mask, otherwise things that race (eg. + * select_fallback_rq) get confused. + */ barrier(); - - if (p->cpus_ptr == &p->cpus_mask) - return; - - __set_cpus_allowed_ptr(p, &p->cpus_mask, SCA_MIGRATE_ENABLE); + p->migration_disabled = 0; + preempt_enable(); } EXPORT_SYMBOL_GPL(migrate_enable); @@ -1805,8 +1816,16 @@ static struct rq *move_queued_task(struc } struct migration_arg { - struct task_struct *task; - int dest_cpu; + struct task_struct *task; + int dest_cpu; + struct set_affinity_pending *pending; +}; + +struct set_affinity_pending { + refcount_t refs; + struct completion done; + struct cpu_stop_work stop_work; + struct migration_arg arg; }; /* @@ -1838,16 +1857,19 @@ static struct rq *__migrate_task(struct */ static int migration_cpu_stop(void *data) { + struct set_affinity_pending *pending; struct migration_arg *arg = data; struct task_struct *p = arg->task; + int dest_cpu = arg->dest_cpu; struct rq *rq = this_rq(); + bool complete = false; struct rq_flags rf; /* * The original target CPU might have gone down and we might * be on another CPU but it doesn't matter. */ - local_irq_disable(); + local_irq_save(rf.flags); /* * We need to explicitly wake pending tasks before running * __migrate_task() such that we will not miss enforcing cpus_ptr @@ -1857,21 +1879,83 @@ static int migration_cpu_stop(void *data raw_spin_lock(&p->pi_lock); rq_lock(rq, &rf); + + pending = p->migration_pending; /* * If task_rq(p) != rq, it cannot be migrated here, because we're * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because * we're holding p->pi_lock. */ if (task_rq(p) == rq) { + if (is_migration_disabled(p)) + goto out; + + if (pending) { + p->migration_pending = NULL; + complete = true; + } + + /* migrate_enable() -- we must not race against SCA */ + if (dest_cpu < 0) { + /* + * When this was migrate_enable() but we no longer + * have a @pending, a concurrent SCA 'fixed' things + * and we should be valid again. Nothing to do. + */ + if (!pending) { + WARN_ON_ONCE(!is_cpu_allowed(p, cpu_of(rq))); + goto out; + } + + dest_cpu = cpumask_any_distribute(&p->cpus_mask); + } + if (task_on_rq_queued(p)) - rq = __migrate_task(rq, &rf, p, arg->dest_cpu); + rq = __migrate_task(rq, &rf, p, dest_cpu); else - p->wake_cpu = arg->dest_cpu; + p->wake_cpu = dest_cpu; + + } else if (dest_cpu < 0) { + /* + * This happens when we get migrated between migrate_enable()'s + * preempt_enable() and scheduling the stopper task. At that + * point we're a regular task again and not current anymore. + * + * A !PREEMPT kernel has a giant hole here, which makes it far + * more likely. + */ + + /* + * When this was migrate_enable() but we no longer have an + * @pending, a concurrent SCA 'fixed' things and we should be + * valid again. Nothing to do. + */ + if (!pending) { + WARN_ON_ONCE(!is_cpu_allowed(p, cpu_of(rq))); + goto out; + } + + /* + * When migrate_enable() hits a rq mis-match we can't reliably + * determine is_migration_disabled() and so have to chase after + * it. + */ + task_rq_unlock(rq, p, &rf); + stop_one_cpu_nowait(task_cpu(p), migration_cpu_stop, + &pending->arg, &pending->stop_work); + return 0; } - rq_unlock(rq, &rf); - raw_spin_unlock(&p->pi_lock); +out: + task_rq_unlock(rq, p, &rf); + + if (complete) + complete_all(&pending->done); + + /* For pending->{arg,stop_work} */ + pending = arg->pending; + if (pending && refcount_dec_and_test(&pending->refs)) + wake_up_var(&pending->refs); - local_irq_enable(); return 0; } @@ -1941,6 +2025,110 @@ void do_set_cpus_allowed(struct task_str } /* + * This function is wildly self concurrent, consider at least 3 times. + */ +static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flags *rf, + int dest_cpu, unsigned int flags) +{ + struct set_affinity_pending my_pending = { }, *pending = NULL; + struct migration_arg arg = { + .task = p, + .dest_cpu = dest_cpu, + }; + bool complete = false; + + /* Can the task run on the task's current CPU? If so, we're done */ + if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) { + pending = p->migration_pending; + if (pending) { + refcount_inc(&pending->refs); + p->migration_pending = NULL; + complete = true; + } + task_rq_unlock(rq, p, rf); + + if (complete) + goto do_complete; + + return 0; + } + + if (!(flags & SCA_MIGRATE_ENABLE)) { + /* serialized by p->pi_lock */ + if (!p->migration_pending) { + refcount_set(&my_pending.refs, 1); + init_completion(&my_pending.done); + p->migration_pending = &my_pending; + } else { + pending = p->migration_pending; + refcount_inc(&pending->refs); + } + } + pending = p->migration_pending; + /* + * - !MIGRATE_ENABLE: + * we'll have installed a pending if there wasn't one already. + * + * - MIGRATE_ENABLE: + * we're here because the current CPU isn't matching anymore, + * the only way that can happen is because of a concurrent + * set_cpus_allowed_ptr() call, which should then still be + * pending completion. + * + * Either way, we really should have a @pending here. + */ + if (WARN_ON_ONCE(!pending)) + return -EINVAL; + + if (flags & SCA_MIGRATE_ENABLE) { + + refcount_inc(&pending->refs); /* pending->{arg,stop_work} */ + task_rq_unlock(rq, p, rf); + + pending->arg = (struct migration_arg) { + .task = p, + .dest_cpu = -1, + .pending = pending, + }; + + stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop, + &pending->arg, &pending->stop_work); + + return 0; + } + + if (task_running(rq, p) || p->state == TASK_WAKING) { + + task_rq_unlock(rq, p, rf); + stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg); + + } else { + + if (!is_migration_disabled(p)) { + if (task_on_rq_queued(p)) + rq = move_queued_task(rq, rf, p, dest_cpu); + + p->migration_pending = NULL; + complete = true; + } + task_rq_unlock(rq, p, rf); + +do_complete: + if (complete) + complete_all(&pending->done); + } + + wait_for_completion(&pending->done); + + if (refcount_dec_and_test(&pending->refs)) + wake_up_var(&pending->refs); + + wait_var_event(&my_pending.refs, !refcount_read(&my_pending.refs)); + + return 0; +} + +/* * Change a given task's CPU affinity. Migrate the thread to a * proper CPU and schedule it away if the CPU it's executing on * is removed from the allowed bitmask. @@ -2009,23 +2197,8 @@ static int __set_cpus_allowed_ptr(struct p->nr_cpus_allowed != 1); } - /* Can the task run on the task's current CPU? If so, we're done */ - if (cpumask_test_cpu(task_cpu(p), new_mask)) - goto out; + return affine_move_task(rq, p, &rf, dest_cpu, flags); - if (task_running(rq, p) || p->state == TASK_WAKING) { - struct migration_arg arg = { p, dest_cpu }; - /* Need help from migration thread: drop lock and wait. */ - task_rq_unlock(rq, p, &rf); - stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg); - return 0; - } else if (task_on_rq_queued(p)) { - /* - * OK, since we're going to drop the lock immediately - * afterwards anyway. - */ - rq = move_queued_task(rq, &rf, p, dest_cpu); - } out: task_rq_unlock(rq, p, &rf); @@ -3205,6 +3378,7 @@ static void __sched_fork(unsigned long c init_numa_balancing(clone_flags, p); #ifdef CONFIG_SMP p->wake_entry.u_flags = CSD_TYPE_TTWU; + p->migration_pending = NULL; #endif }