Received: by 2002:a25:683:0:0:0:0:0 with SMTP id 125csp210543ybg; Thu, 11 Jun 2020 22:33:10 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxKwWHZ5HDMeC3jSjIfssYvOjhiKVE2MRgIm1AjET3X8ziGGzyLtOb0dZk+jl2vc4N/lyZ0 X-Received: by 2002:a17:906:d9cd:: with SMTP id qk13mr11444962ejb.268.1591939990057; Thu, 11 Jun 2020 22:33:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1591939990; cv=none; d=google.com; s=arc-20160816; b=0LspkWZQjrIbqoOrYzewG8LzqrKk+9R8ZCLmVJH+Z+745GkCGKVlujAphPB4hZj24u jImA/1e31T3QQpRs6cICtlS5SRoAVrHihHWfRjriV4TiPqPjev3HtR39n3b/57thm3C7 UtrvbTWb6i/PcvSi7tylB2pqE+dJ4jhX8Hj7V/GZrFDwGIEuEzks0mu/R/bPgu9gi5Sj hciaksX5sIH9tIu6IMrDNxhpSBFQRzO/xxDPaITm35eyoL71hdeNtReCKkaE2yxw8L8N NFF7Zf72o8RN3fzPeFK2XS9ed57ge9VZB4IX4MLSlk6OUrezVnpFJubMs29A+aNUTvqm vHPw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=KCr6uVTbZe9Ngxw9xCnn03oEBKTkc8e8YljBlALdDDw=; b=Ovpo2jHFK14uJzrS7MmkpHoAK0tm1dP+kJ7AjaC/n8pVAQoRdoYpzyQkMVlgmDEYHA ZRjYz5/GU6XA2icC7xQ09aQwFDVdt88OGpEUBrElqveCXkKfjtG/UU+0EXRqcPDBMHCf MubhgFWP/Y8aqbbALntG3+1b8ZN2/Gzldr9vEdjr61EkCXQY97SEC+ktu5JIoXBxvRvv AOMDLLf/yZKYf1TYeHFVeDcfVAt+GM29Yh3tlsWGK7WVcnD3CyrSzRTvH/UI4zaRF0bI C+QVa9DArstwSIywb+yAPJLy3YqRS2mmlAouiyhhLVMjGhSGTbk0DmfqmJNm1DY/i0T8 tmuA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=xblTg13J; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id o6si2869827edr.8.2020.06.11.22.32.45; Thu, 11 Jun 2020 22:33:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=xblTg13J; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726488AbgFLFa6 (ORCPT + 99 others); Fri, 12 Jun 2020 01:30:58 -0400 Received: from mail.kernel.org ([198.145.29.99]:41078 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726332AbgFLFa6 (ORCPT ); Fri, 12 Jun 2020 01:30:58 -0400 Received: from mail-wr1-f49.google.com (mail-wr1-f49.google.com [209.85.221.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 4FCD020836 for ; Fri, 12 Jun 2020 05:30:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1591939856; bh=cYq4vqKHsesR3s5ajhm5S9PWZAD1eP2uMwW9ixQuVqI=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=xblTg13JjgfCAJDp0JWxFiUwq2MhD1Edu5hG8hQcU06tqhiqf4EC4vs9sS3CtJKv+ /Ogn6Q88hVnCbJKSUs0PR2pQQUsvJDwNwboL+2MxTW21jQDjl4AiUFG0LgoTt0gn9z 5FQDrzQUBlbCv+aigaSPTqaDzAttk5ct4E7JykZs= Received: by mail-wr1-f49.google.com with SMTP id x14so8405721wrp.2 for ; Thu, 11 Jun 2020 22:30:56 -0700 (PDT) X-Gm-Message-State: AOAM531tXiwZ9NtZh2wycD3mtwRQaZgwpqGsqbGYW8Gzl6sVHCYRe6H3 kCI5iMKTAd4gL5wByG0jLHYaiwOqTXOYCpYvvS1PNA== X-Received: by 2002:adf:ea11:: with SMTP id q17mr12758018wrm.75.1591939854768; Thu, 11 Jun 2020 22:30:54 -0700 (PDT) MIME-Version: 1.0 References: <20200611235305.GA32342@paulmck-ThinkPad-P72> In-Reply-To: <20200611235305.GA32342@paulmck-ThinkPad-P72> From: Andy Lutomirski Date: Thu, 11 Jun 2020 22:30:42 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH RFC] x86/entry: Ask RCU if it needs rcu_irq_{enter,exit}() To: "Paul E. McKenney" Cc: LKML , rcu@vger.kernel.org, Thomas Gleixner , Andrew Lutomirski , X86 ML , Frederic Weisbecker , Steven Rostedt , Joel Fernandes , Mathieu Desnoyers , Will Deacon , Peter Zijlstra Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 11, 2020 at 4:53 PM Paul E. McKenney wrote: > > RCU needs to detect when one if its interrupt handlers interrupted an idle > state, where an idle state is either the idle loop itself or nohz_full > userspace execution. When a CPU has been interrupted from one of these > idle states, RCU can report a quiescent state, helping the current grace > period to come to an end. > > However, there are optimized kernel-entry paths where handlers that > would normally be executed in interrupt context are instead executed > directly from the base process context, but with interrupts disabled. > When such a directly-executed handler is followed by softirq processing > (which re-enables interrupts), it is possible that one of RCU's interrupt > handlers will interrupt this softirq processing. Why is the softirq processing special? ISTM the basic sequence of events is: idle, RCU not watching, IRQs on because we're using a lousy idle mechanism that requires IRQs on. IRQ hits. The ones you're describing are the DEFINE_IDTENTRY_SYSVEC_SIMPLE ones, but I'm not sure why it would matter. IRQ does idtentry_enter_cond_rcu(). idtentry_entry_cond_rcu() sees __rcu_is_watching() return true and does not do rcu_irq_enter(). softirq processing turns on IRQs, a new IRQ hits, and kaboom. But this is a bit of a strange explanation. The SYSVEC_SIMPLE paths don't seem to do irq_exit_rcu(), so there's no softirq processing unless I missed something. But more importantly, what are we doing making it through idtentry_enter_cond_rcu() with rcu not watching at the end? If we hit the idle loop, surely __rcu_is_watching() should have returned false, right? So I added a little warning to detect the case where your patch actually changes something (I don't believe for a second that this warning should be committed -- it's just to get the stack trace): diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index f0b657097a2a..77330b96d8e5 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -592,6 +592,8 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs) return true; } + WARN_ON_ONCE(system_state == SYSTEM_RUNNING && is_idle_task(current)); + /* * If RCU is watching then RCU only wants to check * whether it needs to restart the tick in NOHZ And I get this: [ 1.054927] ------------[ cut here ]------------ [ 1.054928] WARNING: CPU: 1 PID: 0 at arch/x86/entry/common.c:595 idtentry_enter_cond_rcu+0x86/0xa0 [ 1.054929] Modules linked in: [ 1.054930] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.7.0+ #145 [ 1.054931] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 [ 1.054931] RIP: 0010:idtentry_enter_cond_rcu+0x86/0xa0 [ 1.054933] Code: 7c 24 08 e8 3c 3e 00 00 e8 57 a5 6e ff e8 d2 49 00 00 84 c0 74 19 90 31 c0 5b c3 65 48 8b 04 25 c0 7e 01 00 f6 40 24 02 74 99 <0f> 0b 90 eb 94 0f 0b 90 eb e2 0f 0b 90 eb 9d 66 66 2e 0f 1f 84 00 [ 1.054933] RSP: 0000:ffffc9000006bc10 EFLAGS: 00010002 [ 1.054934] RAX: ffff88807d529800 RBX: ffffc9000006bc38 RCX: ffffffff81c01327 [ 1.054935] RDX: 0000000000000000 RSI: ffffffff81c00f49 RDI: ffffc9000006bc38 [ 1.054935] RBP: ffffc9000006bc38 R08: 0000000000000000 R09: 0000000000000000 [ 1.054936] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 1.054936] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1.054937] FS: 0000000000000000(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 [ 1.054937] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.054938] CR2: 00000000ffffffff CR3: 0000000002620001 CR4: 0000000000360ee0 [ 1.054938] Call Trace: [ 1.054939] sysvec_call_function_single+0xa/0xd0 [ 1.054939] asm_sysvec_call_function_single+0x31/0x40 [ 1.054940] RIP: 0010:__do_softirq+0xaa/0x437 [ 1.054941] Code: 24 2c 0a 00 00 00 48 89 44 24 10 48 89 44 24 20 44 89 34 24 65 66 c7 05 22 b3 22 7e 00 00 e8 ad dc 40 ff fb 66 0f 1f 44 00 00 ff ff ff ff 48 c7 c6 00 51 60 82 0f bc 04 24 83 c0 01 49 89 f7 [ 1.054941] RSP: 0000:ffffc9000006bd18 EFLAGS: 00000202 [ 1.054942] RAX: 0000000000001a6e RBX: ffff88807d529800 RCX: 0000000000000000 [ 1.054943] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81e000a3 [ 1.054944] RBP: ffffc9000006bdc8 R08: 0000000000000000 R09: 0000000000000000 [ 1.054944] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 1.054945] R13: 0000000000000000 R14: 0000000000000082 R15: 0000000000000000 [ 1.054945] ? __do_softirq+0xa3/0x437 [ 1.054945] ? __do_softirq+0xaa/0x437 [ 1.054946] ? __do_softirq+0xa3/0x437 [ 1.054946] ? update_ts_time_stats+0x4c/0x70 [ 1.054947] irq_exit_rcu+0xaf/0xc0 [ 1.054947] sysvec_apic_timer_interrupt+0x51/0xd0 [ 1.054948] asm_sysvec_apic_timer_interrupt+0x31/0x40 [ 1.054948] RIP: 0010:native_safe_halt+0xe/0x10 [ 1.054949] Code: 8b 00 a8 08 74 80 eb c2 cc cc cc cc e9 07 00 00 00 0f 00 2d 46 bf 4e 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 36 bf 4e 00 fb f4 cc 41 54 55 53 48 83 ec 10 e8 43 b1 66 ff 65 8b 2d bc 07 4e 7e [ 1.054950] RSP: 0000:ffffc9000006bea0 EFLAGS: 00000206 [ 1.054951] RAX: 0000000000001a69 RBX: 0000000000000001 RCX: 0000000000000000 [ 1.054951] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81b30789 [ 1.054952] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 [ 1.054952] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88807d529800 [ 1.054953] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88807d529800 [ 1.054953] ? default_idle+0x19/0x160 [ 1.054954] ? native_safe_halt+0xe/0x10 [ 1.054954] default_idle+0x1e/0x160 [ 1.054955] do_idle+0x1f5/0x270 [ 1.054955] ? _raw_spin_unlock_irqrestore+0x3a/0x50 [ 1.054956] ? trace_hardirqs_on+0x20/0xf0 [ 1.054956] cpu_startup_entry+0x14/0x20 [ 1.054957] start_secondary+0x147/0x170 [ 1.054957] secondary_startup_64+0xa4/0xb0 This is saying we were idle and __rcu_is_watching() correctly returned false. We got sysvec_apic_timer_interrupt(), which did rcu_irq_enter() and then turned on IRQs for softirq processing. Then we got sysvec_call_function_single() inside that, and sysvec_call_function_single() noticed that RCU was already watching and did *not* call rcu_irq_enter(). And, if sysvec_call_function_single() does rcu_is_cpu_rrupt_from_idle(), it will return true. So the issue is that RCU thinks that, since it > +static __always_inline bool rcu_needs_irq_enter(void) > +{ > + return !IS_ENABLED(CONFIG_TINY_RCU) && > + (context_tracking_enabled_cpu(smp_processor_id()) || is_idle_task(current)); > +} x86 shouldn't need this context tracking check -- we won't even call this if we came from user mode, and we make sure we never run with IRQs on when we're in CONTEXT_USER. I think my preference would be to fix this with something more like tglx's patch but with an explanation: diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index f0b657097a2a..93fd9d6fe033 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -571,7 +571,7 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs) return false; } - if (!__rcu_is_watching()) { + if (!__rcu_is_watching() || is_idle_task(current)) { /* * If RCU is not watching then the same careful * sequence vs. lockdep and tracing is required @@ -579,6 +579,18 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs) * * This only happens for IRQs that hit the idle * loop, i.e. if idle is not using MWAIT. + * + * We also do this in nested IRQs in the idie task + * (e.g. we get an IRQ, we run softirqs with IRQs on after + * processing the IRQ, and we get another IRQ inside the + * softirq.) This is because RCU wants to be able to tell + * whether it's running in an IRQ context directly inside + * idle (and hence interrupted a quiescent state) or + * interrupted kernel code that was coincidentally in the + * idle task (and hence the kernel was not quiescent). RCU + * does this by counting rcu_irq_enter() levels. This won't + * prevent scheduling in an otherwise schedulable context + * because the idle task can never schedule. */ WARN_ON_ONCE(!(regs->flags & X86_EFLAGS_IF)); lockdep_hardirqs_off(CALLER_ADDR0); Alternatively, perhaps rcu_is_cpu_rrupt_from_idle(void) could do something like: if (softirq_count()) return false; if we believe that the only way this could happen was due to a softirq.