Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751942AbdFFTMl (ORCPT ); Tue, 6 Jun 2017 15:12:41 -0400 Received: from out03.mta.xmission.com ([166.70.13.233]:57615 "EHLO out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751404AbdFFTMg (ORCPT ); Tue, 6 Jun 2017 15:12:36 -0400 From: "Eric W. Biederman" To: linux-kernel@vger.kernel.org Cc: linux-api@vger.kernel.org, Linus Torvalds , Oleg Nesterov , Ingo Molnar , Thomas Gleixner , Kees Cook , Roland McGrath , Al Viro , David Howells , "Michael Kerrisk (man-pages)" , "Eric W. Biederman" Date: Tue, 6 Jun 2017 14:03:20 -0500 Message-Id: <20170606190338.28347-8-ebiederm@xmission.com> X-Mailer: git-send-email 2.10.1 In-Reply-To: <20170606190338.28347-1-ebiederm@xmission.com> References: <877f0pym71.fsf@xmission.com> <20170606190338.28347-1-ebiederm@xmission.com> X-XM-SPF: eid=1dIJth-0006wd-Bn;;;mid=<20170606190338.28347-8-ebiederm@xmission.com>;;;hst=in02.mta.xmission.com;;;ip=97.121.81.159;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX18knkBaHaeIItNP8D/4751SlSWAAvFYFl0= X-SA-Exim-Connect-IP: 97.121.81.159 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 1.5 XMNoVowels Alpha-numberic number with no vowels * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa01 1397; Body=1 Fuz1=1 Fuz2=1] X-Spam-DCC: XMission; sa01 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: *;linux-kernel@vger.kernel.org X-Spam-Relay-Country: X-Spam-Timing: total 5564 ms - load_scoreonly_sql: 0.06 (0.0%), signal_user_changed: 3.0 (0.1%), b_tie_ro: 2.0 (0.0%), parse: 1.77 (0.0%), extract_message_metadata: 31 (0.6%), get_uri_detail_list: 7 (0.1%), tests_pri_-1000: 11 (0.2%), tests_pri_-950: 2.1 (0.0%), tests_pri_-900: 1.68 (0.0%), tests_pri_-400: 51 (0.9%), check_bayes: 49 (0.9%), b_tokenize: 23 (0.4%), b_tok_get_all: 12 (0.2%), b_comp_prob: 6 (0.1%), b_tok_touch_all: 3.6 (0.1%), b_finish: 0.78 (0.0%), tests_pri_0: 1021 (18.4%), check_dkim_signature: 1.08 (0.0%), check_dkim_adsp: 4.3 (0.1%), tests_pri_500: 4435 (79.7%), poll_dns_idle: 4427 (79.6%), rewrite_mail: 0.00 (0.0%) Subject: [PATCH 08/26] exit: Make the runqueue rcu safe X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6476 Lines: 204 Add a rcu_usage to task_struct and use it to reuse the delayed rcu put logic from release_task in finish_task_switch. This guarantees that there will be an rcu interval before usage drops to zero for any task on the run queue. Making it safe to unconditionally call get_task_struct in a rcu critical section for any task on the run queue. This guarantee in turn allows the fair scheduluer to use ordinary rcu primitives to access tasks on the run queue and makes the magic functions task_rcu_dereference and try_get_task_struct completely unnecessary. Signed-off-by: "Eric W. Biederman" --- include/linux/sched.h | 1 + include/linux/sched/task.h | 4 +-- kernel/exit.c | 83 ++++------------------------------------------ kernel/fork.c | 3 +- kernel/sched/core.c | 2 +- kernel/sched/fair.c | 2 +- 6 files changed, 12 insertions(+), 83 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 2b69fc650201..461ecd20731c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -492,6 +492,7 @@ struct task_struct { volatile long state; void *stack; atomic_t usage; + atomic_t rcu_usage; /* Per task flags (PF_*), defined further below: */ unsigned int flags; unsigned int ptrace; diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index a978d7189cfd..dc4a4f4c566b 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -94,9 +94,7 @@ static inline void put_task_struct(struct task_struct *t) __put_task_struct(t); } -struct task_struct *task_rcu_dereference(struct task_struct **ptask); -struct task_struct *try_get_task_struct(struct task_struct **ptask); - +extern void rcu_put_task_struct(struct task_struct *tsk); #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT extern int arch_task_struct_size __read_mostly; diff --git a/kernel/exit.c b/kernel/exit.c index c3de7ace243c..625e57f1bb5c 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -179,6 +179,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp) put_task_struct(tsk); } +void rcu_put_task_struct(struct task_struct *tsk) +{ + if (atomic_dec_and_test(&tsk->rcu_usage)) + call_rcu(&tsk->rcu, delayed_put_task_struct); +} void release_task(struct task_struct *p) { @@ -218,76 +223,13 @@ void release_task(struct task_struct *p) write_unlock_irq(&tasklist_lock); release_thread(p); - call_rcu(&p->rcu, delayed_put_task_struct); + rcu_put_task_struct(p); p = leader; if (unlikely(zap_leader)) goto repeat; } -/* - * Note that if this function returns a valid task_struct pointer (!NULL) - * task->usage must remain >0 for the duration of the RCU critical section. - */ -struct task_struct *task_rcu_dereference(struct task_struct **ptask) -{ - struct sighand_struct *sighand; - struct task_struct *task; - - /* - * We need to verify that release_task() was not called and thus - * delayed_put_task_struct() can't run and drop the last reference - * before rcu_read_unlock(). We check task->sighand != NULL, - * but we can read the already freed and reused memory. - */ -retry: - task = rcu_dereference(*ptask); - if (!task) - return NULL; - - probe_kernel_address(&task->sighand, sighand); - - /* - * Pairs with atomic_dec_and_test() in put_task_struct(). If this task - * was already freed we can not miss the preceding update of this - * pointer. - */ - smp_rmb(); - if (unlikely(task != READ_ONCE(*ptask))) - goto retry; - - /* - * We've re-checked that "task == *ptask", now we have two different - * cases: - * - * 1. This is actually the same task/task_struct. In this case - * sighand != NULL tells us it is still alive. - * - * 2. This is another task which got the same memory for task_struct. - * We can't know this of course, and we can not trust - * sighand != NULL. - * - * In this case we actually return a random value, but this is - * correct. - * - * If we return NULL - we can pretend that we actually noticed that - * *ptask was updated when the previous task has exited. Or pretend - * that probe_slab_address(&sighand) reads NULL. - * - * If we return the new task (because sighand is not NULL for any - * reason) - this is fine too. This (new) task can't go away before - * another gp pass. - * - * And note: We could even eliminate the false positive if re-read - * task->sighand once again to avoid the falsely NULL. But this case - * is very unlikely so we don't care. - */ - if (!sighand) - return NULL; - - return task; -} - void rcuwait_wake_up(struct rcuwait *w) { struct task_struct *task; @@ -317,19 +259,6 @@ void rcuwait_wake_up(struct rcuwait *w) rcu_read_unlock(); } -struct task_struct *try_get_task_struct(struct task_struct **ptask) -{ - struct task_struct *task; - - rcu_read_lock(); - task = task_rcu_dereference(ptask); - if (task) - get_task_struct(task); - rcu_read_unlock(); - - return task; -} - /* * Determine if a process group is "orphaned", according to the POSIX * definition in 2.2.2.52. Orphaned process groups are not to be affected diff --git a/kernel/fork.c b/kernel/fork.c index aa1076c5e4a9..1fe837e8c38e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -567,7 +567,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) * One for us, one for whoever does the "release_task()" (usually * parent) */ - atomic_set(&tsk->usage, 2); + atomic_set(&tsk->rcu_usage, 2); + atomic_set(&tsk->usage, 1); /* For rcu_usage */ #ifdef CONFIG_BLK_DEV_IO_TRACE tsk->btrace_seq = 0; #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 803c3bc274c4..1fccfd397cab 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2762,7 +2762,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) /* Task is done with its stack. */ put_task_stack(prev); - put_task_struct(prev); + rcu_put_task_struct(prev); } tick_nohz_task_switch(); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d71109321841..5c0a1e1cc0f6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1527,7 +1527,7 @@ static void task_numa_compare(struct task_numa_env *env, int dist = env->dist; rcu_read_lock(); - cur = task_rcu_dereference(&dst_rq->curr); + cur = rcu_dereference(dst_rq->curr); if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur))) cur = NULL; -- 2.10.1