2014-11-21 21:26:19

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 0/5] x86: Rework IST interrupts

Following Linus' feedback, this series has grown. When I say "IST
interrupt" below, I'm excluding NMIs. They are their own thing. This
patch does *not* change NMI handling.

This adds new functions ist_enter and ist_exit to handle entry to and
exit from ist handlers. This is, I think, a bugfix for RCU handling,
and I've cc'd some RCU people on that patch. It will also catch bugs
involving scheduling on the IST stack.

This series then changes the entry code to switch to the normal stack if
an IST interrupt came from userspace and wasn't a double-fault. This
simplifies the exit logic.

Building on that, I add new functions ist_begin_non_atomic and
ist_end_non_atomic. They next inside ist_enter and ist_exit, and they
allow scheduling if the right conditions are met and they BUG if not.
They're meant to be used in do_machine_check to handle userspace memory
failures.

NB: Tony has seen odd behavior when stress-testing injected machine
checks with earlier versions of this series applied. I suspect that
it's a bug in something else, possibly his BIOS. Bugs in this series
shouldn't be ruled out, though.

Changes from v3:
- Reworked everything except the asm.

Changes from v2:
- double_fault is no longer affected.
- Other changes, but those are obsolete now.

Changes from RFC/v1: I forgot. Sorry.

Andy Lutomirski (5):
uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
x86, traps: Track entry into and exit from IST context
x86, entry: Switch stacks on a paranoid entry from userspace
x86: Clean up current_stack_pointer
x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic

Documentation/x86/entry_64.txt | 18 ++++--
Documentation/x86/x86_64/kernel-stacks | 8 ++-
arch/x86/include/asm/thread_info.h | 13 +++-
arch/x86/include/asm/traps.h | 6 ++
arch/x86/kernel/cpu/mcheck/mce.c | 5 ++
arch/x86/kernel/cpu/mcheck/p5.c | 6 ++
arch/x86/kernel/cpu/mcheck/winchip.c | 5 ++
arch/x86/kernel/entry_64.S | 86 ++++++++++++++------------
arch/x86/kernel/irq_32.c | 13 +---
arch/x86/kernel/traps.c | 110 +++++++++++++++++++++++++--------
kernel/events/uprobes.c | 1 -
11 files changed, 183 insertions(+), 88 deletions(-)

--
1.9.3


2014-11-21 21:26:24

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 1/5] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME

x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set
but not on non-paranoid returns. I suspect that this is a mistake
and that the code only works because int3 is paranoid.

Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a
workaround for the x86 bug. With that bug fixed, we can remove
_TIF_NOTIFY_RESUME from the uprobes code.

Acked-by: Srikar Dronamraju <[email protected]>
Reported-by: Oleg Nesterov <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/thread_info.h | 2 +-
kernel/events/uprobes.c | 1 -
2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 854053889d4d..547e344a6dc6 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -141,7 +141,7 @@ struct thread_info {
/* Only used for 64 bit */
#define _TIF_DO_NOTIFY_MASK \
(_TIF_SIGPENDING | _TIF_MCE_NOTIFY | _TIF_NOTIFY_RESUME | \
- _TIF_USER_RETURN_NOTIFY)
+ _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE)

/* flags to check in __switch_to() */
#define _TIF_WORK_CTXSW \
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1d0af8a2c646..ed8f2cde34c5 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1640,7 +1640,6 @@ bool uprobe_deny_signal(void)
if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) {
utask->state = UTASK_SSTEP_TRAPPED;
set_tsk_thread_flag(t, TIF_UPROBE);
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
}
}

--
1.9.3

2014-11-21 21:26:26

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

We currently pretend that IST context is like standard exception
context, but this is incorrect. IST entries from userspace are like
standard exceptions except that they use per-cpu stacks, so they are
atomic. IST entries from kernel space are like NMIs from RCU's
perspective -- they are not quiescent states even if they
interrupted the kernel during a quiescent state.

Add and use ist_enter and ist_exit to track IST context. Even
though x86_32 has no IST stacks, we track these interrupts the same
way.

This fixes two issues:

- Scheduling from an IST interrupt handler will now warn. It would
previously appear to work as long as we got lucky and nothing
overwrote the stack frame. (I don't know of any bugs in this
that would trigger the warning, but it's good to be on the safe
side.)

- RCU handling in IST context was dangerous. As far as I know,
only machine checks were likely to trigger this, but it's good to
be on the safe side.

Note that the machine check handlers appears to have been missing
any context tracking at all before this patch.

Cc: "Paul E. McKenney" <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Frédéric Weisbecker <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/traps.h | 4 +++
arch/x86/kernel/cpu/mcheck/mce.c | 5 ++++
arch/x86/kernel/cpu/mcheck/p5.c | 6 +++++
arch/x86/kernel/cpu/mcheck/winchip.c | 5 ++++
arch/x86/kernel/traps.c | 49 ++++++++++++++++++++++++++++++------
5 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index bc8352e7010a..eb16a61bfd06 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -1,6 +1,7 @@
#ifndef _ASM_X86_TRAPS_H
#define _ASM_X86_TRAPS_H

+#include <linux/context_tracking_state.h>
#include <linux/kprobes.h>

#include <asm/debugreg.h>
@@ -109,6 +110,9 @@ asmlinkage void smp_thermal_interrupt(void);
asmlinkage void mce_threshold_interrupt(void);
#endif

+extern enum ctx_state ist_enter(struct pt_regs *regs);
+extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
+
/* Interrupts/Exceptions */
enum {
X86_TRAP_DE = 0, /* 0, Divide-by-zero */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 61a9668cebfd..b72509d77337 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -43,6 +43,7 @@
#include <linux/export.h>

#include <asm/processor.h>
+#include <asm/traps.h>
#include <asm/mce.h>
#include <asm/msr.h>

@@ -1016,6 +1017,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
{
struct mca_config *cfg = &mca_cfg;
struct mce m, *final;
+ enum ctx_state prev_state;
int i;
int worst = 0;
int severity;
@@ -1038,6 +1040,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
char *msg = "Unknown";

+ prev_state = ist_enter(regs);
+
this_cpu_inc(mce_exception_count);

if (!cfg->banks)
@@ -1168,6 +1172,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
out:
sync_core();
+ ist_exit(regs, prev_state);
}
EXPORT_SYMBOL_GPL(do_machine_check);

diff --git a/arch/x86/kernel/cpu/mcheck/p5.c b/arch/x86/kernel/cpu/mcheck/p5.c
index a3042989398c..ec2663a708e4 100644
--- a/arch/x86/kernel/cpu/mcheck/p5.c
+++ b/arch/x86/kernel/cpu/mcheck/p5.c
@@ -8,6 +8,7 @@
#include <linux/smp.h>

#include <asm/processor.h>
+#include <asm/traps.h>
#include <asm/mce.h>
#include <asm/msr.h>

@@ -17,8 +18,11 @@ int mce_p5_enabled __read_mostly;
/* Machine check handler for Pentium class Intel CPUs: */
static void pentium_machine_check(struct pt_regs *regs, long error_code)
{
+ enum ctx_state prev_state;
u32 loaddr, hi, lotype;

+ prev_state = ist_enter(regs);
+
rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);

@@ -33,6 +37,8 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code)
}

add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
+
+ ist_exit(regs, prev_state);
}

/* Set up machine check reporting for processors with Intel style MCE: */
diff --git a/arch/x86/kernel/cpu/mcheck/winchip.c b/arch/x86/kernel/cpu/mcheck/winchip.c
index 7dc5564d0cdf..bd5d46a32210 100644
--- a/arch/x86/kernel/cpu/mcheck/winchip.c
+++ b/arch/x86/kernel/cpu/mcheck/winchip.c
@@ -7,14 +7,19 @@
#include <linux/types.h>

#include <asm/processor.h>
+#include <asm/traps.h>
#include <asm/mce.h>
#include <asm/msr.h>

/* Machine check handler for WinChip C6: */
static void winchip_machine_check(struct pt_regs *regs, long error_code)
{
+ enum ctx_state prev_state = ist_enter(regs);
+
printk(KERN_EMERG "CPU0: Machine Check Exception.\n");
add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
+
+ ist_exit(regs, prev_state);
}

/* Set up machine check reporting on the Winchip C6 series */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 0d0e922fafc1..f5c4b8813774 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -107,6 +107,39 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
preempt_count_dec();
}

+enum ctx_state ist_enter(struct pt_regs *regs)
+{
+ /*
+ * We are atomic because we're on the IST stack (or we're on x86_32,
+ * in which case we still shouldn't schedule.
+ */
+ preempt_count_add(HARDIRQ_OFFSET);
+
+ if (user_mode_vm(regs)) {
+ /* Other than that, we're just an exception. */
+ return exception_enter();
+ } else {
+ /*
+ * We might have interrupted pretty much anything. In
+ * fact, if we're a machine check, we can even interrupt
+ * NMI processing. We don't want in_nmi() to return true,
+ * but we need to notify RCU.
+ */
+ rcu_nmi_enter();
+ return IN_KERNEL; /* the value is irrelevant. */
+ }
+}
+
+void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
+{
+ preempt_count_sub(HARDIRQ_OFFSET);
+
+ if (user_mode_vm(regs))
+ return exception_exit(prev_state);
+ else
+ rcu_nmi_exit();
+}
+
static nokprobe_inline int
do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
struct pt_regs *regs, long error_code)
@@ -244,14 +277,14 @@ dotraplinkage void do_stack_segment(struct pt_regs *regs, long error_code)
{
enum ctx_state prev_state;

- prev_state = exception_enter();
+ prev_state = ist_enter(regs);
if (notify_die(DIE_TRAP, "stack segment", regs, error_code,
X86_TRAP_SS, SIGBUS) != NOTIFY_STOP) {
preempt_conditional_sti(regs);
do_trap(X86_TRAP_SS, SIGBUS, "stack segment", regs, error_code, NULL);
preempt_conditional_cli(regs);
}
- exception_exit(prev_state);
+ ist_exit(regs, prev_state);
}

dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
@@ -259,8 +292,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
static const char str[] = "double fault";
struct task_struct *tsk = current;

- exception_enter();
- /* Return not checked because double check cannot be ignored */
+ ist_enter(regs);
+ /* Return not checked because double fault cannot be ignored */
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);

tsk->thread.error_code = error_code;
@@ -343,7 +376,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
if (poke_int3_handler(regs))
return;

- prev_state = exception_enter();
+ prev_state = ist_enter(regs);
#ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
SIGTRAP) == NOTIFY_STOP)
@@ -369,7 +402,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
preempt_conditional_cli(regs);
debug_stack_usage_dec();
exit:
- exception_exit(prev_state);
+ ist_exit(regs, prev_state);
}
NOKPROBE_SYMBOL(do_int3);

@@ -433,7 +466,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
unsigned long dr6;
int si_code;

- prev_state = exception_enter();
+ prev_state = ist_enter(regs);

get_debugreg(dr6, 6);

@@ -508,7 +541,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
debug_stack_usage_dec();

exit:
- exception_exit(prev_state);
+ ist_exit(regs, prev_state);
}
NOKPROBE_SYMBOL(do_debug);

--
1.9.3

2014-11-21 21:26:33

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 3/5] x86, entry: Switch stacks on a paranoid entry from userspace

This causes all non-NMI, non-double-fault kernel entries from
userspace to run on the normal kernel stack. Double-fault is
exempt to minimize confusion if we double-fault directly from
userspace due to a bad kernel stack.

This is, suprisingly, simpler and shorter than the current code. It
removes the IMO rather frightening paranoid_userspace path, and it
make sync_regs much simpler.

There is no risk of stack overflow due to this change -- the kernel
stack that we switch to is empty.

This will also enable us to create non-atomic sections within
machine checks from userspace, which will simplify memory failure
handling. It will also allow the upcoming fsgsbase code to be
simplified, because it doesn't need to worry about usergs when
scheduling in paranoid_exit, as that code no longer exists.

Cc: Oleg Nesterov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Borislav Petkov <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
Documentation/x86/entry_64.txt | 18 ++++---
Documentation/x86/x86_64/kernel-stacks | 8 ++--
arch/x86/kernel/entry_64.S | 86 ++++++++++++++++++----------------
arch/x86/kernel/traps.c | 23 ++-------
4 files changed, 67 insertions(+), 68 deletions(-)

diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt
index bc7226ef5055..8e53217e6d40 100644
--- a/Documentation/x86/entry_64.txt
+++ b/Documentation/x86/entry_64.txt
@@ -75,9 +75,6 @@ The expensive (paranoid) way is to read back the MSR_GS_BASE value
xorl %ebx,%ebx
1: ret

-and the whole paranoid non-paranoid macro complexity is about whether
-to suffer that RDMSR cost.
-
If we are at an interrupt or user-trap/gate-alike boundary then we can
use the faster check: the stack will be a reliable indicator of
whether SWAPGS was already done: if we see that we are a secondary
@@ -90,6 +87,15 @@ which might have triggered right after a normal entry wrote CS to the
stack but before we executed SWAPGS, then the only safe way to check
for GS is the slower method: the RDMSR.

-So we try only to mark those entry methods 'paranoid' that absolutely
-need the more expensive check for the GS base - and we generate all
-'normal' entry points with the regular (faster) entry macros.
+Therefore, super-atomic entries (except NMI, which is handled separately)
+must use idtentry with paranoid=1 to handle gsbase correctly. This
+triggers three main behavior changes:
+
+ - Interrupt entry will use the slower gsbase check.
+ - Interrupt entry from user mode will switch off the IST stack.
+ - Interrupt exit to kernel mode will not attempt to reschedule.
+
+We try to only use IST entries and the paranoid entry code for vectors
+that absolutely need the more expensive check for the GS base - and we
+generate all 'normal' entry points with the regular (faster) paranoid=0
+variant.
diff --git a/Documentation/x86/x86_64/kernel-stacks b/Documentation/x86/x86_64/kernel-stacks
index a01eec5d1d0b..e3c8a49d1a2f 100644
--- a/Documentation/x86/x86_64/kernel-stacks
+++ b/Documentation/x86/x86_64/kernel-stacks
@@ -40,9 +40,11 @@ An IST is selected by a non-zero value in the IST field of an
interrupt-gate descriptor. When an interrupt occurs and the hardware
loads such a descriptor, the hardware automatically sets the new stack
pointer based on the IST value, then invokes the interrupt handler. If
-software wants to allow nested IST interrupts then the handler must
-adjust the IST values on entry to and exit from the interrupt handler.
-(This is occasionally done, e.g. for debug exceptions.)
+the interrupt came from user mode, then the interrupt handler prologue
+will switch back to the per-thread stack. If software wants to allow
+nested IST interrupts then the handler must adjust the IST values on
+entry to and exit from the interrupt handler. (This is occasionally
+done, e.g. for debug exceptions.)

Events with different IST codes (i.e. with different stacks) can be
nested. For example, a debug interrupt can safely be interrupted by an
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index df088bb03fb3..de24b2eac6b2 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1064,6 +1064,11 @@ ENTRY(\sym)
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15

.if \paranoid
+ .if \paranoid == 1
+ CFI_REMEMBER_STATE
+ testl $3, CS(%rsp) /* If coming from userspace, switch */
+ jnz 1f /* stacks. */
+ .endif
call save_paranoid
.else
call error_entry
@@ -1104,6 +1109,36 @@ ENTRY(\sym)
jmp error_exit /* %ebx: no swapgs flag */
.endif

+ .if \paranoid == 1
+ CFI_RESTORE_STATE
+ /*
+ * Paranoid entry from userspace. Switch stacks and treat it
+ * as a normal entry. This means that paranoid handlers
+ * run in real process context if user_mode(regs).
+ */
+1:
+ call error_entry
+
+ DEFAULT_FRAME 0
+
+ movq %rsp,%rdi /* pt_regs pointer */
+ call sync_regs
+ movq %rax,%rsp /* switch stack */
+
+ movq %rsp,%rdi /* pt_regs pointer */
+
+ .if \has_error_code
+ movq ORIG_RAX(%rsp),%rsi /* get error code */
+ movq $-1,ORIG_RAX(%rsp) /* no syscall to restart */
+ .else
+ xorl %esi,%esi /* no error code */
+ .endif
+
+ call \do_sym
+
+ jmp error_exit /* %ebx: no swapgs flag */
+ .endif
+
CFI_ENDPROC
END(\sym)
.endm
@@ -1124,7 +1159,7 @@ idtentry overflow do_overflow has_error_code=0
idtentry bounds do_bounds has_error_code=0
idtentry invalid_op do_invalid_op has_error_code=0
idtentry device_not_available do_device_not_available has_error_code=0
-idtentry double_fault __do_double_fault has_error_code=1 paranoid=1
+idtentry double_fault __do_double_fault has_error_code=1 paranoid=2
idtentry coprocessor_segment_overrun do_coprocessor_segment_overrun has_error_code=0
idtentry invalid_TSS do_invalid_TSS has_error_code=1
idtentry segment_not_present do_segment_not_present has_error_code=1
@@ -1305,16 +1340,14 @@ idtentry machine_check has_error_code=0 paranoid=1 do_sym=*machine_check_vector(
#endif

/*
- * "Paranoid" exit path from exception stack.
- * Paranoid because this is used by NMIs and cannot take
- * any kernel state for granted.
- * We don't do kernel preemption checks here, because only
- * NMI should be common and it does not enable IRQs and
- * cannot get reschedule ticks.
+ * "Paranoid" exit path from exception stack. This is invoked
+ * only on return from non-NMI IST interrupts that came
+ * from kernel space.
*
- * "trace" is 0 for the NMI handler only, because irq-tracing
- * is fundamentally NMI-unsafe. (we cannot change the soft and
- * hard flags at once, atomically)
+ * We may be returning to very strange contexts (e.g. very early
+ * in syscall entry), so checking for preemption here would
+ * be complicated. Fortunately, we there's no good reason
+ * to try to handle preemption here.
*/

/* ebx: no swapgs flag */
@@ -1324,43 +1357,14 @@ ENTRY(paranoid_exit)
TRACE_IRQS_OFF_DEBUG
testl %ebx,%ebx /* swapgs needed? */
jnz paranoid_restore
- testl $3,CS(%rsp)
- jnz paranoid_userspace
-paranoid_swapgs:
TRACE_IRQS_IRETQ 0
SWAPGS_UNSAFE_STACK
RESTORE_ALL 8
- jmp irq_return
+ INTERRUPT_RETURN
paranoid_restore:
TRACE_IRQS_IRETQ_DEBUG 0
RESTORE_ALL 8
- jmp irq_return
-paranoid_userspace:
- GET_THREAD_INFO(%rcx)
- movl TI_flags(%rcx),%ebx
- andl $_TIF_WORK_MASK,%ebx
- jz paranoid_swapgs
- movq %rsp,%rdi /* &pt_regs */
- call sync_regs
- movq %rax,%rsp /* switch stack for scheduling */
- testl $_TIF_NEED_RESCHED,%ebx
- jnz paranoid_schedule
- movl %ebx,%edx /* arg3: thread flags */
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
- xorl %esi,%esi /* arg2: oldset */
- movq %rsp,%rdi /* arg1: &pt_regs */
- call do_notify_resume
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- jmp paranoid_userspace
-paranoid_schedule:
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_ANY)
- SCHEDULE_USER
- DISABLE_INTERRUPTS(CLBR_ANY)
- TRACE_IRQS_OFF
- jmp paranoid_userspace
+ INTERRUPT_RETURN
CFI_ENDPROC
END(paranoid_exit)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index f5c4b8813774..6a02760df7b4 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -408,27 +408,14 @@ NOKPROBE_SYMBOL(do_int3);

#ifdef CONFIG_X86_64
/*
- * Help handler running on IST stack to switch back to user stack
- * for scheduling or signal handling. The actual stack switch is done in
- * entry.S
+ * Help handler running on IST stack to switch off the IST stack if the
+ * interrupted code was in user mode. The actual stack switch is done in
+ * entry_64.S
*/
asmlinkage __visible struct pt_regs *sync_regs(struct pt_regs *eregs)
{
- struct pt_regs *regs = eregs;
- /* Did already sync */
- if (eregs == (struct pt_regs *)eregs->sp)
- ;
- /* Exception from user space */
- else if (user_mode(eregs))
- regs = task_pt_regs(current);
- /*
- * Exception from kernel and interrupts are enabled. Move to
- * kernel process stack.
- */
- else if (eregs->flags & X86_EFLAGS_IF)
- regs = (struct pt_regs *)(eregs->sp -= sizeof(struct pt_regs));
- if (eregs != regs)
- *regs = *eregs;
+ struct pt_regs *regs = task_pt_regs(current);
+ *regs = *eregs;
return regs;
}
NOKPROBE_SYMBOL(sync_regs);
--
1.9.3

2014-11-21 21:26:35

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 5/5] x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic

In some IST handlers, if the interrupt came from user mode,
we can safely enable preemption. Add helpers to do it safely.

This is intended to be used my the memory failure code in
do_machine_check.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/traps.h | 2 ++
arch/x86/kernel/traps.c | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 40 insertions(+)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index eb16a61bfd06..04ba537fc721 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -112,6 +112,8 @@ asmlinkage void mce_threshold_interrupt(void);

extern enum ctx_state ist_enter(struct pt_regs *regs);
extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
+extern void ist_begin_non_atomic(struct pt_regs *regs);
+extern void ist_end_non_atomic(void);

/* Interrupts/Exceptions */
enum {
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6a02760df7b4..2b5f2e038e3f 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -140,6 +140,44 @@ void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
rcu_nmi_exit();
}

+/**
+ * ist_begin_non_atomic() - begin a non-atomic section in an IST exception
+ * @regs: regs passed to the IST exception handler
+ *
+ * IST exception handlers normally cannot schedule. As a special
+ * exception, if the exception interrupted userspace code (i.e.
+ * user_mode_vm(regs) would return true) and the exception was not
+ * a double fault, it can be safe to schedule. ist_begin_non_atomic()
+ * begins a non-atomic section within an ist_enter()/ist_exit() region.
+ * Callers are responsible for enabling interrupts themselves inside
+ * the non-atomic section, and callers must call is_end_non_atomic()
+ * before ist_exit().
+ */
+void ist_begin_non_atomic(struct pt_regs *regs)
+{
+ BUG_ON(!user_mode_vm(regs));
+
+ /*
+ * Sanity check: we need to be on the normal thread stack. This
+ * will catch asm bugs and any attempt to use ist_preempt_enable
+ * from double_fault.
+ */
+ BUG_ON(((current_stack_pointer() ^ this_cpu_read_stable(kernel_stack))
+ & ~(THREAD_SIZE - 1)) != 0);
+
+ preempt_count_sub(HARDIRQ_OFFSET);
+}
+
+/**
+ * ist_end_non_atomic() - begin a non-atomic section in an IST exception
+ *
+ * Ends a non-atomic section started with ist_begin_non_atomic().
+ */
+void ist_end_non_atomic(void)
+{
+ preempt_count_add(HARDIRQ_OFFSET);
+}
+
static nokprobe_inline int
do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
struct pt_regs *regs, long error_code)
--
1.9.3

2014-11-21 21:27:18

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 4/5] x86: Clean up current_stack_pointer

There's no good reason for it to be a macro, and x86_64 will want to
use it, so it should be in a header.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/thread_info.h | 11 +++++++++++
arch/x86/kernel/irq_32.c | 13 +++----------
2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 547e344a6dc6..8b13b0fbda8e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -170,6 +170,17 @@ static inline struct thread_info *current_thread_info(void)
return ti;
}

+static inline unsigned long current_stack_pointer(void)
+{
+ unsigned long sp;
+#ifdef CONFIG_X86_64
+ asm("mov %%rsp,%0" : "=g" (sp));
+#else
+ asm("mov %%esp,%0" : "=g" (sp));
+#endif
+ return sp;
+}
+
#else /* !__ASSEMBLY__ */

/* how to get the thread information struct from ASM */
diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 63ce838e5a54..28d28f5eb8f4 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -69,16 +69,9 @@ static void call_on_stack(void *func, void *stack)
: "memory", "cc", "edx", "ecx", "eax");
}

-/* how to get the current stack pointer from C */
-#define current_stack_pointer ({ \
- unsigned long sp; \
- asm("mov %%esp,%0" : "=g" (sp)); \
- sp; \
-})
-
static inline void *current_stack(void)
{
- return (void *)(current_stack_pointer & ~(THREAD_SIZE - 1));
+ return (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
}

static inline int
@@ -103,7 +96,7 @@ execute_on_irq_stack(int overflow, struct irq_desc *desc, int irq)

/* Save the next esp at the bottom of the stack */
prev_esp = (u32 *)irqstk;
- *prev_esp = current_stack_pointer;
+ *prev_esp = current_stack_pointer();

if (unlikely(overflow))
call_on_stack(print_stack_overflow, isp);
@@ -156,7 +149,7 @@ void do_softirq_own_stack(void)

/* Push the previous esp onto the stack */
prev_esp = (u32 *)irqstk;
- *prev_esp = current_stack_pointer;
+ *prev_esp = current_stack_pointer();

call_on_stack(__do_softirq, isp);
}
--
1.9.3

2014-11-21 21:33:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <[email protected]> wrote:
> We currently pretend that IST context is like standard exception
> context, but this is incorrect. IST entries from userspace are like
> standard exceptions except that they use per-cpu stacks, so they are
> atomic. IST entries from kernel space are like NMIs from RCU's
> perspective -- they are not quiescent states even if they
> interrupted the kernel during a quiescent state.
>
> Add and use ist_enter and ist_exit to track IST context. Even
> though x86_32 has no IST stacks, we track these interrupts the same
> way.

I should add:

I have no idea why RCU read-side critical sections are safe inside
__do_page_fault today. It's guarded by exception_enter(), but that
doesn't do anything if context tracking is off, and context tracking
is usually off. What am I missing here?

--Andy

>
> This fixes two issues:
>
> - Scheduling from an IST interrupt handler will now warn. It would
> previously appear to work as long as we got lucky and nothing
> overwrote the stack frame. (I don't know of any bugs in this
> that would trigger the warning, but it's good to be on the safe
> side.)
>
> - RCU handling in IST context was dangerous. As far as I know,
> only machine checks were likely to trigger this, but it's good to
> be on the safe side.
>
> Note that the machine check handlers appears to have been missing
> any context tracking at all before this patch.
>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Josh Triplett <[email protected]>
> Cc: Frédéric Weisbecker <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/traps.h | 4 +++
> arch/x86/kernel/cpu/mcheck/mce.c | 5 ++++
> arch/x86/kernel/cpu/mcheck/p5.c | 6 +++++
> arch/x86/kernel/cpu/mcheck/winchip.c | 5 ++++
> arch/x86/kernel/traps.c | 49 ++++++++++++++++++++++++++++++------
> 5 files changed, 61 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index bc8352e7010a..eb16a61bfd06 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -1,6 +1,7 @@
> #ifndef _ASM_X86_TRAPS_H
> #define _ASM_X86_TRAPS_H
>
> +#include <linux/context_tracking_state.h>
> #include <linux/kprobes.h>
>
> #include <asm/debugreg.h>
> @@ -109,6 +110,9 @@ asmlinkage void smp_thermal_interrupt(void);
> asmlinkage void mce_threshold_interrupt(void);
> #endif
>
> +extern enum ctx_state ist_enter(struct pt_regs *regs);
> +extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
> +
> /* Interrupts/Exceptions */
> enum {
> X86_TRAP_DE = 0, /* 0, Divide-by-zero */
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 61a9668cebfd..b72509d77337 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -43,6 +43,7 @@
> #include <linux/export.h>
>
> #include <asm/processor.h>
> +#include <asm/traps.h>
> #include <asm/mce.h>
> #include <asm/msr.h>
>
> @@ -1016,6 +1017,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> {
> struct mca_config *cfg = &mca_cfg;
> struct mce m, *final;
> + enum ctx_state prev_state;
> int i;
> int worst = 0;
> int severity;
> @@ -1038,6 +1040,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
> char *msg = "Unknown";
>
> + prev_state = ist_enter(regs);
> +
> this_cpu_inc(mce_exception_count);
>
> if (!cfg->banks)
> @@ -1168,6 +1172,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> out:
> sync_core();
> + ist_exit(regs, prev_state);
> }
> EXPORT_SYMBOL_GPL(do_machine_check);
>
> diff --git a/arch/x86/kernel/cpu/mcheck/p5.c b/arch/x86/kernel/cpu/mcheck/p5.c
> index a3042989398c..ec2663a708e4 100644
> --- a/arch/x86/kernel/cpu/mcheck/p5.c
> +++ b/arch/x86/kernel/cpu/mcheck/p5.c
> @@ -8,6 +8,7 @@
> #include <linux/smp.h>
>
> #include <asm/processor.h>
> +#include <asm/traps.h>
> #include <asm/mce.h>
> #include <asm/msr.h>
>
> @@ -17,8 +18,11 @@ int mce_p5_enabled __read_mostly;
> /* Machine check handler for Pentium class Intel CPUs: */
> static void pentium_machine_check(struct pt_regs *regs, long error_code)
> {
> + enum ctx_state prev_state;
> u32 loaddr, hi, lotype;
>
> + prev_state = ist_enter(regs);
> +
> rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
> rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);
>
> @@ -33,6 +37,8 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code)
> }
>
> add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> +
> + ist_exit(regs, prev_state);
> }
>
> /* Set up machine check reporting for processors with Intel style MCE: */
> diff --git a/arch/x86/kernel/cpu/mcheck/winchip.c b/arch/x86/kernel/cpu/mcheck/winchip.c
> index 7dc5564d0cdf..bd5d46a32210 100644
> --- a/arch/x86/kernel/cpu/mcheck/winchip.c
> +++ b/arch/x86/kernel/cpu/mcheck/winchip.c
> @@ -7,14 +7,19 @@
> #include <linux/types.h>
>
> #include <asm/processor.h>
> +#include <asm/traps.h>
> #include <asm/mce.h>
> #include <asm/msr.h>
>
> /* Machine check handler for WinChip C6: */
> static void winchip_machine_check(struct pt_regs *regs, long error_code)
> {
> + enum ctx_state prev_state = ist_enter(regs);
> +
> printk(KERN_EMERG "CPU0: Machine Check Exception.\n");
> add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> +
> + ist_exit(regs, prev_state);
> }
>
> /* Set up machine check reporting on the Winchip C6 series */
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 0d0e922fafc1..f5c4b8813774 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -107,6 +107,39 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
> preempt_count_dec();
> }
>
> +enum ctx_state ist_enter(struct pt_regs *regs)
> +{
> + /*
> + * We are atomic because we're on the IST stack (or we're on x86_32,
> + * in which case we still shouldn't schedule.
> + */
> + preempt_count_add(HARDIRQ_OFFSET);
> +
> + if (user_mode_vm(regs)) {
> + /* Other than that, we're just an exception. */
> + return exception_enter();
> + } else {
> + /*
> + * We might have interrupted pretty much anything. In
> + * fact, if we're a machine check, we can even interrupt
> + * NMI processing. We don't want in_nmi() to return true,
> + * but we need to notify RCU.
> + */
> + rcu_nmi_enter();
> + return IN_KERNEL; /* the value is irrelevant. */
> + }
> +}
> +
> +void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
> +{
> + preempt_count_sub(HARDIRQ_OFFSET);
> +
> + if (user_mode_vm(regs))
> + return exception_exit(prev_state);
> + else
> + rcu_nmi_exit();
> +}
> +
> static nokprobe_inline int
> do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
> struct pt_regs *regs, long error_code)
> @@ -244,14 +277,14 @@ dotraplinkage void do_stack_segment(struct pt_regs *regs, long error_code)
> {
> enum ctx_state prev_state;
>
> - prev_state = exception_enter();
> + prev_state = ist_enter(regs);
> if (notify_die(DIE_TRAP, "stack segment", regs, error_code,
> X86_TRAP_SS, SIGBUS) != NOTIFY_STOP) {
> preempt_conditional_sti(regs);
> do_trap(X86_TRAP_SS, SIGBUS, "stack segment", regs, error_code, NULL);
> preempt_conditional_cli(regs);
> }
> - exception_exit(prev_state);
> + ist_exit(regs, prev_state);
> }
>
> dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> @@ -259,8 +292,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> static const char str[] = "double fault";
> struct task_struct *tsk = current;
>
> - exception_enter();
> - /* Return not checked because double check cannot be ignored */
> + ist_enter(regs);
> + /* Return not checked because double fault cannot be ignored */
> notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
>
> tsk->thread.error_code = error_code;
> @@ -343,7 +376,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> if (poke_int3_handler(regs))
> return;
>
> - prev_state = exception_enter();
> + prev_state = ist_enter(regs);
> #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
> if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
> SIGTRAP) == NOTIFY_STOP)
> @@ -369,7 +402,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> preempt_conditional_cli(regs);
> debug_stack_usage_dec();
> exit:
> - exception_exit(prev_state);
> + ist_exit(regs, prev_state);
> }
> NOKPROBE_SYMBOL(do_int3);
>
> @@ -433,7 +466,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
> unsigned long dr6;
> int si_code;
>
> - prev_state = exception_enter();
> + prev_state = ist_enter(regs);
>
> get_debugreg(dr6, 6);
>
> @@ -508,7 +541,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
> debug_stack_usage_dec();
>
> exit:
> - exception_exit(prev_state);
> + ist_exit(regs, prev_state);
> }
> NOKPROBE_SYMBOL(do_debug);
>
> --
> 1.9.3
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-21 22:00:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Fri, Nov 21, 2014 at 01:26:08PM -0800, Andy Lutomirski wrote:
> We currently pretend that IST context is like standard exception
> context, but this is incorrect. IST entries from userspace are like
> standard exceptions except that they use per-cpu stacks, so they are
> atomic. IST entries from kernel space are like NMIs from RCU's
> perspective -- they are not quiescent states even if they
> interrupted the kernel during a quiescent state.
>
> Add and use ist_enter and ist_exit to track IST context. Even
> though x86_32 has no IST stacks, we track these interrupts the same
> way.
>
> This fixes two issues:
>
> - Scheduling from an IST interrupt handler will now warn. It would
> previously appear to work as long as we got lucky and nothing
> overwrote the stack frame. (I don't know of any bugs in this
> that would trigger the warning, but it's good to be on the safe
> side.)
>
> - RCU handling in IST context was dangerous. As far as I know,
> only machine checks were likely to trigger this, but it's good to
> be on the safe side.
>
> Note that the machine check handlers appears to have been missing
> any context tracking at all before this patch.
>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Josh Triplett <[email protected]>
> Cc: Fr?d?ric Weisbecker <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>

Reviewed-by: Paul E. McKenney <[email protected]>

Just out of curiosity... Can an NMI occur in IST context? If it can,
I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
nested calls.

Thanx, Paul

> ---
> arch/x86/include/asm/traps.h | 4 +++
> arch/x86/kernel/cpu/mcheck/mce.c | 5 ++++
> arch/x86/kernel/cpu/mcheck/p5.c | 6 +++++
> arch/x86/kernel/cpu/mcheck/winchip.c | 5 ++++
> arch/x86/kernel/traps.c | 49 ++++++++++++++++++++++++++++++------
> 5 files changed, 61 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index bc8352e7010a..eb16a61bfd06 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -1,6 +1,7 @@
> #ifndef _ASM_X86_TRAPS_H
> #define _ASM_X86_TRAPS_H
>
> +#include <linux/context_tracking_state.h>
> #include <linux/kprobes.h>
>
> #include <asm/debugreg.h>
> @@ -109,6 +110,9 @@ asmlinkage void smp_thermal_interrupt(void);
> asmlinkage void mce_threshold_interrupt(void);
> #endif
>
> +extern enum ctx_state ist_enter(struct pt_regs *regs);
> +extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
> +
> /* Interrupts/Exceptions */
> enum {
> X86_TRAP_DE = 0, /* 0, Divide-by-zero */
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 61a9668cebfd..b72509d77337 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -43,6 +43,7 @@
> #include <linux/export.h>
>
> #include <asm/processor.h>
> +#include <asm/traps.h>
> #include <asm/mce.h>
> #include <asm/msr.h>
>
> @@ -1016,6 +1017,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> {
> struct mca_config *cfg = &mca_cfg;
> struct mce m, *final;
> + enum ctx_state prev_state;
> int i;
> int worst = 0;
> int severity;
> @@ -1038,6 +1040,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
> char *msg = "Unknown";
>
> + prev_state = ist_enter(regs);
> +
> this_cpu_inc(mce_exception_count);
>
> if (!cfg->banks)
> @@ -1168,6 +1172,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> out:
> sync_core();
> + ist_exit(regs, prev_state);
> }
> EXPORT_SYMBOL_GPL(do_machine_check);
>
> diff --git a/arch/x86/kernel/cpu/mcheck/p5.c b/arch/x86/kernel/cpu/mcheck/p5.c
> index a3042989398c..ec2663a708e4 100644
> --- a/arch/x86/kernel/cpu/mcheck/p5.c
> +++ b/arch/x86/kernel/cpu/mcheck/p5.c
> @@ -8,6 +8,7 @@
> #include <linux/smp.h>
>
> #include <asm/processor.h>
> +#include <asm/traps.h>
> #include <asm/mce.h>
> #include <asm/msr.h>
>
> @@ -17,8 +18,11 @@ int mce_p5_enabled __read_mostly;
> /* Machine check handler for Pentium class Intel CPUs: */
> static void pentium_machine_check(struct pt_regs *regs, long error_code)
> {
> + enum ctx_state prev_state;
> u32 loaddr, hi, lotype;
>
> + prev_state = ist_enter(regs);
> +
> rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
> rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);
>
> @@ -33,6 +37,8 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code)
> }
>
> add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> +
> + ist_exit(regs, prev_state);
> }
>
> /* Set up machine check reporting for processors with Intel style MCE: */
> diff --git a/arch/x86/kernel/cpu/mcheck/winchip.c b/arch/x86/kernel/cpu/mcheck/winchip.c
> index 7dc5564d0cdf..bd5d46a32210 100644
> --- a/arch/x86/kernel/cpu/mcheck/winchip.c
> +++ b/arch/x86/kernel/cpu/mcheck/winchip.c
> @@ -7,14 +7,19 @@
> #include <linux/types.h>
>
> #include <asm/processor.h>
> +#include <asm/traps.h>
> #include <asm/mce.h>
> #include <asm/msr.h>
>
> /* Machine check handler for WinChip C6: */
> static void winchip_machine_check(struct pt_regs *regs, long error_code)
> {
> + enum ctx_state prev_state = ist_enter(regs);
> +
> printk(KERN_EMERG "CPU0: Machine Check Exception.\n");
> add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> +
> + ist_exit(regs, prev_state);
> }
>
> /* Set up machine check reporting on the Winchip C6 series */
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 0d0e922fafc1..f5c4b8813774 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -107,6 +107,39 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
> preempt_count_dec();
> }
>
> +enum ctx_state ist_enter(struct pt_regs *regs)
> +{
> + /*
> + * We are atomic because we're on the IST stack (or we're on x86_32,
> + * in which case we still shouldn't schedule.
> + */
> + preempt_count_add(HARDIRQ_OFFSET);
> +
> + if (user_mode_vm(regs)) {
> + /* Other than that, we're just an exception. */
> + return exception_enter();
> + } else {
> + /*
> + * We might have interrupted pretty much anything. In
> + * fact, if we're a machine check, we can even interrupt
> + * NMI processing. We don't want in_nmi() to return true,
> + * but we need to notify RCU.
> + */
> + rcu_nmi_enter();
> + return IN_KERNEL; /* the value is irrelevant. */
> + }
> +}
> +
> +void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
> +{
> + preempt_count_sub(HARDIRQ_OFFSET);
> +
> + if (user_mode_vm(regs))
> + return exception_exit(prev_state);
> + else
> + rcu_nmi_exit();
> +}
> +
> static nokprobe_inline int
> do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
> struct pt_regs *regs, long error_code)
> @@ -244,14 +277,14 @@ dotraplinkage void do_stack_segment(struct pt_regs *regs, long error_code)
> {
> enum ctx_state prev_state;
>
> - prev_state = exception_enter();
> + prev_state = ist_enter(regs);
> if (notify_die(DIE_TRAP, "stack segment", regs, error_code,
> X86_TRAP_SS, SIGBUS) != NOTIFY_STOP) {
> preempt_conditional_sti(regs);
> do_trap(X86_TRAP_SS, SIGBUS, "stack segment", regs, error_code, NULL);
> preempt_conditional_cli(regs);
> }
> - exception_exit(prev_state);
> + ist_exit(regs, prev_state);
> }
>
> dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> @@ -259,8 +292,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> static const char str[] = "double fault";
> struct task_struct *tsk = current;
>
> - exception_enter();
> - /* Return not checked because double check cannot be ignored */
> + ist_enter(regs);
> + /* Return not checked because double fault cannot be ignored */
> notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
>
> tsk->thread.error_code = error_code;
> @@ -343,7 +376,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> if (poke_int3_handler(regs))
> return;
>
> - prev_state = exception_enter();
> + prev_state = ist_enter(regs);
> #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
> if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
> SIGTRAP) == NOTIFY_STOP)
> @@ -369,7 +402,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> preempt_conditional_cli(regs);
> debug_stack_usage_dec();
> exit:
> - exception_exit(prev_state);
> + ist_exit(regs, prev_state);
> }
> NOKPROBE_SYMBOL(do_int3);
>
> @@ -433,7 +466,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
> unsigned long dr6;
> int si_code;
>
> - prev_state = exception_enter();
> + prev_state = ist_enter(regs);
>
> get_debugreg(dr6, 6);
>
> @@ -508,7 +541,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
> debug_stack_usage_dec();
>
> exit:
> - exception_exit(prev_state);
> + ist_exit(regs, prev_state);
> }
> NOKPROBE_SYMBOL(do_debug);
>
> --
> 1.9.3
>

2014-11-21 22:07:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <[email protected]> wrote:
> > We currently pretend that IST context is like standard exception
> > context, but this is incorrect. IST entries from userspace are like
> > standard exceptions except that they use per-cpu stacks, so they are
> > atomic. IST entries from kernel space are like NMIs from RCU's
> > perspective -- they are not quiescent states even if they
> > interrupted the kernel during a quiescent state.
> >
> > Add and use ist_enter and ist_exit to track IST context. Even
> > though x86_32 has no IST stacks, we track these interrupts the same
> > way.
>
> I should add:
>
> I have no idea why RCU read-side critical sections are safe inside
> __do_page_fault today. It's guarded by exception_enter(), but that
> doesn't do anything if context tracking is off, and context tracking
> is usually off. What am I missing here?

Ah! There are three cases:

1. Context tracking is off on a non-idle CPU. In this case, RCU is
still paying attention to CPUs running in both userspace and in
the kernel. So if a page fault happens, RCU will be set up to
notice any RCU read-side critical sections.

2. Context tracking is on on a non-idle CPU. In this case, RCU
might well be ignoring userspace execution: NO_HZ_FULL and
all that. However, as you pointed out, in this case the
context-tracking code lets RCU know that we have entered the
kernel, which means that RCU will again be paying attention to
RCU read-side critical sections.

3. The CPU is idle. In this case, RCU is ignoring the CPU, so
if we take a page fault when context tracking is off, life
will be hard. But the kernel is not supposed to take page
faults in the idle loop, so this is not a problem.

Make sense?

Thanx, Paul

> --Andy
>
> >
> > This fixes two issues:
> >
> > - Scheduling from an IST interrupt handler will now warn. It would
> > previously appear to work as long as we got lucky and nothing
> > overwrote the stack frame. (I don't know of any bugs in this
> > that would trigger the warning, but it's good to be on the safe
> > side.)
> >
> > - RCU handling in IST context was dangerous. As far as I know,
> > only machine checks were likely to trigger this, but it's good to
> > be on the safe side.
> >
> > Note that the machine check handlers appears to have been missing
> > any context tracking at all before this patch.
> >
> > Cc: "Paul E. McKenney" <[email protected]>
> > Cc: Josh Triplett <[email protected]>
> > Cc: Fr?d?ric Weisbecker <[email protected]>
> > Signed-off-by: Andy Lutomirski <[email protected]>
> > ---
> > arch/x86/include/asm/traps.h | 4 +++
> > arch/x86/kernel/cpu/mcheck/mce.c | 5 ++++
> > arch/x86/kernel/cpu/mcheck/p5.c | 6 +++++
> > arch/x86/kernel/cpu/mcheck/winchip.c | 5 ++++
> > arch/x86/kernel/traps.c | 49 ++++++++++++++++++++++++++++++------
> > 5 files changed, 61 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> > index bc8352e7010a..eb16a61bfd06 100644
> > --- a/arch/x86/include/asm/traps.h
> > +++ b/arch/x86/include/asm/traps.h
> > @@ -1,6 +1,7 @@
> > #ifndef _ASM_X86_TRAPS_H
> > #define _ASM_X86_TRAPS_H
> >
> > +#include <linux/context_tracking_state.h>
> > #include <linux/kprobes.h>
> >
> > #include <asm/debugreg.h>
> > @@ -109,6 +110,9 @@ asmlinkage void smp_thermal_interrupt(void);
> > asmlinkage void mce_threshold_interrupt(void);
> > #endif
> >
> > +extern enum ctx_state ist_enter(struct pt_regs *regs);
> > +extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
> > +
> > /* Interrupts/Exceptions */
> > enum {
> > X86_TRAP_DE = 0, /* 0, Divide-by-zero */
> > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> > index 61a9668cebfd..b72509d77337 100644
> > --- a/arch/x86/kernel/cpu/mcheck/mce.c
> > +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> > @@ -43,6 +43,7 @@
> > #include <linux/export.h>
> >
> > #include <asm/processor.h>
> > +#include <asm/traps.h>
> > #include <asm/mce.h>
> > #include <asm/msr.h>
> >
> > @@ -1016,6 +1017,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> > {
> > struct mca_config *cfg = &mca_cfg;
> > struct mce m, *final;
> > + enum ctx_state prev_state;
> > int i;
> > int worst = 0;
> > int severity;
> > @@ -1038,6 +1040,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> > DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
> > char *msg = "Unknown";
> >
> > + prev_state = ist_enter(regs);
> > +
> > this_cpu_inc(mce_exception_count);
> >
> > if (!cfg->banks)
> > @@ -1168,6 +1172,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> > mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> > out:
> > sync_core();
> > + ist_exit(regs, prev_state);
> > }
> > EXPORT_SYMBOL_GPL(do_machine_check);
> >
> > diff --git a/arch/x86/kernel/cpu/mcheck/p5.c b/arch/x86/kernel/cpu/mcheck/p5.c
> > index a3042989398c..ec2663a708e4 100644
> > --- a/arch/x86/kernel/cpu/mcheck/p5.c
> > +++ b/arch/x86/kernel/cpu/mcheck/p5.c
> > @@ -8,6 +8,7 @@
> > #include <linux/smp.h>
> >
> > #include <asm/processor.h>
> > +#include <asm/traps.h>
> > #include <asm/mce.h>
> > #include <asm/msr.h>
> >
> > @@ -17,8 +18,11 @@ int mce_p5_enabled __read_mostly;
> > /* Machine check handler for Pentium class Intel CPUs: */
> > static void pentium_machine_check(struct pt_regs *regs, long error_code)
> > {
> > + enum ctx_state prev_state;
> > u32 loaddr, hi, lotype;
> >
> > + prev_state = ist_enter(regs);
> > +
> > rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
> > rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);
> >
> > @@ -33,6 +37,8 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code)
> > }
> >
> > add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> > +
> > + ist_exit(regs, prev_state);
> > }
> >
> > /* Set up machine check reporting for processors with Intel style MCE: */
> > diff --git a/arch/x86/kernel/cpu/mcheck/winchip.c b/arch/x86/kernel/cpu/mcheck/winchip.c
> > index 7dc5564d0cdf..bd5d46a32210 100644
> > --- a/arch/x86/kernel/cpu/mcheck/winchip.c
> > +++ b/arch/x86/kernel/cpu/mcheck/winchip.c
> > @@ -7,14 +7,19 @@
> > #include <linux/types.h>
> >
> > #include <asm/processor.h>
> > +#include <asm/traps.h>
> > #include <asm/mce.h>
> > #include <asm/msr.h>
> >
> > /* Machine check handler for WinChip C6: */
> > static void winchip_machine_check(struct pt_regs *regs, long error_code)
> > {
> > + enum ctx_state prev_state = ist_enter(regs);
> > +
> > printk(KERN_EMERG "CPU0: Machine Check Exception.\n");
> > add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> > +
> > + ist_exit(regs, prev_state);
> > }
> >
> > /* Set up machine check reporting on the Winchip C6 series */
> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> > index 0d0e922fafc1..f5c4b8813774 100644
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -107,6 +107,39 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
> > preempt_count_dec();
> > }
> >
> > +enum ctx_state ist_enter(struct pt_regs *regs)
> > +{
> > + /*
> > + * We are atomic because we're on the IST stack (or we're on x86_32,
> > + * in which case we still shouldn't schedule.
> > + */
> > + preempt_count_add(HARDIRQ_OFFSET);
> > +
> > + if (user_mode_vm(regs)) {
> > + /* Other than that, we're just an exception. */
> > + return exception_enter();
> > + } else {
> > + /*
> > + * We might have interrupted pretty much anything. In
> > + * fact, if we're a machine check, we can even interrupt
> > + * NMI processing. We don't want in_nmi() to return true,
> > + * but we need to notify RCU.
> > + */
> > + rcu_nmi_enter();
> > + return IN_KERNEL; /* the value is irrelevant. */
> > + }
> > +}
> > +
> > +void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
> > +{
> > + preempt_count_sub(HARDIRQ_OFFSET);
> > +
> > + if (user_mode_vm(regs))
> > + return exception_exit(prev_state);
> > + else
> > + rcu_nmi_exit();
> > +}
> > +
> > static nokprobe_inline int
> > do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
> > struct pt_regs *regs, long error_code)
> > @@ -244,14 +277,14 @@ dotraplinkage void do_stack_segment(struct pt_regs *regs, long error_code)
> > {
> > enum ctx_state prev_state;
> >
> > - prev_state = exception_enter();
> > + prev_state = ist_enter(regs);
> > if (notify_die(DIE_TRAP, "stack segment", regs, error_code,
> > X86_TRAP_SS, SIGBUS) != NOTIFY_STOP) {
> > preempt_conditional_sti(regs);
> > do_trap(X86_TRAP_SS, SIGBUS, "stack segment", regs, error_code, NULL);
> > preempt_conditional_cli(regs);
> > }
> > - exception_exit(prev_state);
> > + ist_exit(regs, prev_state);
> > }
> >
> > dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> > @@ -259,8 +292,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> > static const char str[] = "double fault";
> > struct task_struct *tsk = current;
> >
> > - exception_enter();
> > - /* Return not checked because double check cannot be ignored */
> > + ist_enter(regs);
> > + /* Return not checked because double fault cannot be ignored */
> > notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
> >
> > tsk->thread.error_code = error_code;
> > @@ -343,7 +376,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> > if (poke_int3_handler(regs))
> > return;
> >
> > - prev_state = exception_enter();
> > + prev_state = ist_enter(regs);
> > #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
> > if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
> > SIGTRAP) == NOTIFY_STOP)
> > @@ -369,7 +402,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> > preempt_conditional_cli(regs);
> > debug_stack_usage_dec();
> > exit:
> > - exception_exit(prev_state);
> > + ist_exit(regs, prev_state);
> > }
> > NOKPROBE_SYMBOL(do_int3);
> >
> > @@ -433,7 +466,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
> > unsigned long dr6;
> > int si_code;
> >
> > - prev_state = exception_enter();
> > + prev_state = ist_enter(regs);
> >
> > get_debugreg(dr6, 6);
> >
> > @@ -508,7 +541,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
> > debug_stack_usage_dec();
> >
> > exit:
> > - exception_exit(prev_state);
> > + ist_exit(regs, prev_state);
> > }
> > NOKPROBE_SYMBOL(do_debug);
> >
> > --
> > 1.9.3
> >
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
>

2014-11-21 22:19:42

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
<[email protected]> wrote:
> On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
>> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <[email protected]> wrote:
>> > We currently pretend that IST context is like standard exception
>> > context, but this is incorrect. IST entries from userspace are like
>> > standard exceptions except that they use per-cpu stacks, so they are
>> > atomic. IST entries from kernel space are like NMIs from RCU's
>> > perspective -- they are not quiescent states even if they
>> > interrupted the kernel during a quiescent state.
>> >
>> > Add and use ist_enter and ist_exit to track IST context. Even
>> > though x86_32 has no IST stacks, we track these interrupts the same
>> > way.
>>
>> I should add:
>>
>> I have no idea why RCU read-side critical sections are safe inside
>> __do_page_fault today. It's guarded by exception_enter(), but that
>> doesn't do anything if context tracking is off, and context tracking
>> is usually off. What am I missing here?
>
> Ah! There are three cases:
>
> 1. Context tracking is off on a non-idle CPU. In this case, RCU is
> still paying attention to CPUs running in both userspace and in
> the kernel. So if a page fault happens, RCU will be set up to
> notice any RCU read-side critical sections.
>
> 2. Context tracking is on on a non-idle CPU. In this case, RCU
> might well be ignoring userspace execution: NO_HZ_FULL and
> all that. However, as you pointed out, in this case the
> context-tracking code lets RCU know that we have entered the
> kernel, which means that RCU will again be paying attention to
> RCU read-side critical sections.
>
> 3. The CPU is idle. In this case, RCU is ignoring the CPU, so
> if we take a page fault when context tracking is off, life
> will be hard. But the kernel is not supposed to take page
> faults in the idle loop, so this is not a problem.
>

I guess so, as long as there are really no page faults in the idle loop.

There are, however, machine checks in the idle loop, and maybe kprobes
(haven't checked), so I think this patch might fix real bugs.

> Just out of curiosity... Can an NMI occur in IST context? If it can,
> I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
> nested calls.

Yes, and vice versa. That code looked like it handled nesting
correctly, but I wasn't entirely sure.

Also, just to make sure: are we okay if rcu_nmi_enter() is called
before exception_enter if context tracking is on and we came directly
from userspace?

--Andy

2014-11-21 22:20:26

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Fri, Nov 21, 2014 at 02:07:04PM -0800, Paul E. McKenney wrote:
> On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
> > On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <[email protected]> wrote:
> > > We currently pretend that IST context is like standard exception
> > > context, but this is incorrect. IST entries from userspace are like
> > > standard exceptions except that they use per-cpu stacks, so they are
> > > atomic. IST entries from kernel space are like NMIs from RCU's
> > > perspective -- they are not quiescent states even if they
> > > interrupted the kernel during a quiescent state.
> > >
> > > Add and use ist_enter and ist_exit to track IST context. Even
> > > though x86_32 has no IST stacks, we track these interrupts the same
> > > way.
> >
> > I should add:
> >
> > I have no idea why RCU read-side critical sections are safe inside
> > __do_page_fault today. It's guarded by exception_enter(), but that
> > doesn't do anything if context tracking is off, and context tracking
> > is usually off. What am I missing here?
>
> Ah! There are three cases:
>
> 1. Context tracking is off on a non-idle CPU. In this case, RCU is
> still paying attention to CPUs running in both userspace and in
> the kernel. So if a page fault happens, RCU will be set up to
> notice any RCU read-side critical sections.
>
> 2. Context tracking is on on a non-idle CPU. In this case, RCU
> might well be ignoring userspace execution: NO_HZ_FULL and
> all that. However, as you pointed out, in this case the
> context-tracking code lets RCU know that we have entered the
> kernel, which means that RCU will again be paying attention to
> RCU read-side critical sections.
>
> 3. The CPU is idle. In this case, RCU is ignoring the CPU, so
> if we take a page fault when context tracking is off, life
> will be hard. But the kernel is not supposed to take page
> faults in the idle loop, so this is not a problem.
>
> Make sense?

To zoom out the picture for Andy, context tracking is never used 99%
of all workloads. It's only used for NO_HZ_FULL. RCU needs to tick
to poll on RCU uses. But when we run in userspace, RCU isn't used
and thus doesn't need the tick which we can stop. So context tracking
is there to tell RCU about CPUs crossing user/kernel boundaries.

Also these hooks account the cputime spent in userspace and kernelspace
in the absence of a tick.

2014-11-21 22:55:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <[email protected]> wrote:
> >> > We currently pretend that IST context is like standard exception
> >> > context, but this is incorrect. IST entries from userspace are like
> >> > standard exceptions except that they use per-cpu stacks, so they are
> >> > atomic. IST entries from kernel space are like NMIs from RCU's
> >> > perspective -- they are not quiescent states even if they
> >> > interrupted the kernel during a quiescent state.
> >> >
> >> > Add and use ist_enter and ist_exit to track IST context. Even
> >> > though x86_32 has no IST stacks, we track these interrupts the same
> >> > way.
> >>
> >> I should add:
> >>
> >> I have no idea why RCU read-side critical sections are safe inside
> >> __do_page_fault today. It's guarded by exception_enter(), but that
> >> doesn't do anything if context tracking is off, and context tracking
> >> is usually off. What am I missing here?
> >
> > Ah! There are three cases:
> >
> > 1. Context tracking is off on a non-idle CPU. In this case, RCU is
> > still paying attention to CPUs running in both userspace and in
> > the kernel. So if a page fault happens, RCU will be set up to
> > notice any RCU read-side critical sections.
> >
> > 2. Context tracking is on on a non-idle CPU. In this case, RCU
> > might well be ignoring userspace execution: NO_HZ_FULL and
> > all that. However, as you pointed out, in this case the
> > context-tracking code lets RCU know that we have entered the
> > kernel, which means that RCU will again be paying attention to
> > RCU read-side critical sections.
> >
> > 3. The CPU is idle. In this case, RCU is ignoring the CPU, so
> > if we take a page fault when context tracking is off, life
> > will be hard. But the kernel is not supposed to take page
> > faults in the idle loop, so this is not a problem.
>
> I guess so, as long as there are really no page faults in the idle loop.

As far as I know, there are not. If there are, someone needs to let
me know! ;-)

> There are, however, machine checks in the idle loop, and maybe kprobes
> (haven't checked), so I think this patch might fix real bugs.

If you can get ISTs from the idle loop, then the patch is needed.

> > Just out of curiosity... Can an NMI occur in IST context? If it can,
> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
> > nested calls.
>
> Yes, and vice versa. That code looked like it handled nesting
> correctly, but I wasn't entirely sure.

It currently does not, please see below patch. Are you able to test
nesting? It would be really cool if you could do so -- I have no
way to test this patch.

> Also, just to make sure: are we okay if rcu_nmi_enter() is called
> before exception_enter if context tracking is on and we came directly
> from userspace?

If I understand correctly, this will result in context tracking invoking
rcu_user_enter(), which will result in the rcu_dynticks counter having an
odd value. In that case, rcu_nmi_enter() will notice that RCU is already
paying attention to this CPU via its check of atomic_read(&rdtp->dynticks)
& 0x1), and will thus just return. The matching rcu_nmi_exit() will
notice that the nesting count is zero, and will also just return.

Thus, everything works in that case.

In contrast, if rcu_nmi_enter() was invoked from the idle loop, it
would see that RCU is not paying attention to this CPU and that the
NMI nesting depth (which rcu_nmi_enter() increments) used to be zero.
It would then atomically increment rtdp->dynticks, forcing RCU to start
paying attention to this CPU. The matching rcu_nmi_exit() will see
that the nesting count was non-zero, but became zero when decremented.
This will cause rcu_nmi_exit() to atomically increment rtdp->dynticks,
which will tell RCU to stop paying attention to this CPU.

Thanx, Paul

------------------------------------------------------------------------

rcu: Make rcu_nmi_enter() handle nesting

Andy Lutomirski is introducing ISTs into x86, which from RCU's
viewpoint are NMIs. Because ISTs and NMIs can nest, rcu_nmi_enter()
and rcu_nmi_exit() must now correctly handle nesting. As luck would
have it, rcu_nmi_exit() handles nesting but rcu_nmi_enter() does not.
This patch therefore makes rcu_nmi_enter() handle nesting.

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8749f43f3f05..875421aff6e3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -770,7 +770,8 @@ void rcu_nmi_enter(void)
if (rdtp->dynticks_nmi_nesting == 0 &&
(atomic_read(&rdtp->dynticks) & 0x1))
return;
- rdtp->dynticks_nmi_nesting++;
+ if (rdtp->dynticks_nmi_nesting++ != 0)
+ return; /* Nested NMI/IST/whatever. */
smp_mb__before_atomic(); /* Force delay from prior write. */
atomic_inc(&rdtp->dynticks);
/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */

2014-11-21 23:07:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Fri, Nov 21, 2014 at 2:55 PM, Paul E. McKenney
<[email protected]> wrote:
> On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
>> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
>> <[email protected]> wrote:
>> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
>> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <[email protected]> wrote:
>> >> > We currently pretend that IST context is like standard exception
>> >> > context, but this is incorrect. IST entries from userspace are like
>> >> > standard exceptions except that they use per-cpu stacks, so they are
>> >> > atomic. IST entries from kernel space are like NMIs from RCU's
>> >> > perspective -- they are not quiescent states even if they
>> >> > interrupted the kernel during a quiescent state.
>> >> >
>> >> > Add and use ist_enter and ist_exit to track IST context. Even
>> >> > though x86_32 has no IST stacks, we track these interrupts the same
>> >> > way.
>> >>
>> >> I should add:
>> >>
>> >> I have no idea why RCU read-side critical sections are safe inside
>> >> __do_page_fault today. It's guarded by exception_enter(), but that
>> >> doesn't do anything if context tracking is off, and context tracking
>> >> is usually off. What am I missing here?
>> >
>> > Ah! There are three cases:
>> >
>> > 1. Context tracking is off on a non-idle CPU. In this case, RCU is
>> > still paying attention to CPUs running in both userspace and in
>> > the kernel. So if a page fault happens, RCU will be set up to
>> > notice any RCU read-side critical sections.
>> >
>> > 2. Context tracking is on on a non-idle CPU. In this case, RCU
>> > might well be ignoring userspace execution: NO_HZ_FULL and
>> > all that. However, as you pointed out, in this case the
>> > context-tracking code lets RCU know that we have entered the
>> > kernel, which means that RCU will again be paying attention to
>> > RCU read-side critical sections.
>> >
>> > 3. The CPU is idle. In this case, RCU is ignoring the CPU, so
>> > if we take a page fault when context tracking is off, life
>> > will be hard. But the kernel is not supposed to take page
>> > faults in the idle loop, so this is not a problem.
>>
>> I guess so, as long as there are really no page faults in the idle loop.
>
> As far as I know, there are not. If there are, someone needs to let
> me know! ;-)
>
>> There are, however, machine checks in the idle loop, and maybe kprobes
>> (haven't checked), so I think this patch might fix real bugs.
>
> If you can get ISTs from the idle loop, then the patch is needed.
>
>> > Just out of curiosity... Can an NMI occur in IST context? If it can,
>> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
>> > nested calls.
>>
>> Yes, and vice versa. That code looked like it handled nesting
>> correctly, but I wasn't entirely sure.
>
> It currently does not, please see below patch. Are you able to test
> nesting? It would be really cool if you could do so -- I have no
> way to test this patch.

I can try. It's sort of easy -- I'll put an int3 into do_nmi and add
a fixup to avoid crashing.

What should I look for? Should I try to force full nohz on and assert
something? I don't really know how to make full nohz work.

>
>> Also, just to make sure: are we okay if rcu_nmi_enter() is called
>> before exception_enter if context tracking is on and we came directly
>> from userspace?
>
> If I understand correctly, this will result in context tracking invoking
> rcu_user_enter(), which will result in the rcu_dynticks counter having an
> odd value. In that case, rcu_nmi_enter() will notice that RCU is already
> paying attention to this CPU via its check of atomic_read(&rdtp->dynticks)
> & 0x1), and will thus just return. The matching rcu_nmi_exit() will
> notice that the nesting count is zero, and will also just return.
>
> Thus, everything works in that case.
>
> In contrast, if rcu_nmi_enter() was invoked from the idle loop, it
> would see that RCU is not paying attention to this CPU and that the
> NMI nesting depth (which rcu_nmi_enter() increments) used to be zero.
> It would then atomically increment rtdp->dynticks, forcing RCU to start
> paying attention to this CPU. The matching rcu_nmi_exit() will see
> that the nesting count was non-zero, but became zero when decremented.
> This will cause rcu_nmi_exit() to atomically increment rtdp->dynticks,
> which will tell RCU to stop paying attention to this CPU.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> rcu: Make rcu_nmi_enter() handle nesting
>
> Andy Lutomirski is introducing ISTs into x86, which from RCU's
> viewpoint are NMIs. Because ISTs and NMIs can nest, rcu_nmi_enter()
> and rcu_nmi_exit() must now correctly handle nesting. As luck would
> have it, rcu_nmi_exit() handles nesting but rcu_nmi_enter() does not.
> This patch therefore makes rcu_nmi_enter() handle nesting.

Thanks. Should I add this to v5 of my series?

--Andy

>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8749f43f3f05..875421aff6e3 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -770,7 +770,8 @@ void rcu_nmi_enter(void)
> if (rdtp->dynticks_nmi_nesting == 0 &&
> (atomic_read(&rdtp->dynticks) & 0x1))
> return;
> - rdtp->dynticks_nmi_nesting++;
> + if (rdtp->dynticks_nmi_nesting++ != 0)
> + return; /* Nested NMI/IST/whatever. */
> smp_mb__before_atomic(); /* Force delay from prior write. */
> atomic_inc(&rdtp->dynticks);
> /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-21 23:38:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Fri, Nov 21, 2014 at 03:06:48PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 2:55 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
> >> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
> >> <[email protected]> wrote:
> >> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
> >> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <[email protected]> wrote:
> >> >> > We currently pretend that IST context is like standard exception
> >> >> > context, but this is incorrect. IST entries from userspace are like
> >> >> > standard exceptions except that they use per-cpu stacks, so they are
> >> >> > atomic. IST entries from kernel space are like NMIs from RCU's
> >> >> > perspective -- they are not quiescent states even if they
> >> >> > interrupted the kernel during a quiescent state.
> >> >> >
> >> >> > Add and use ist_enter and ist_exit to track IST context. Even
> >> >> > though x86_32 has no IST stacks, we track these interrupts the same
> >> >> > way.
> >> >>
> >> >> I should add:
> >> >>
> >> >> I have no idea why RCU read-side critical sections are safe inside
> >> >> __do_page_fault today. It's guarded by exception_enter(), but that
> >> >> doesn't do anything if context tracking is off, and context tracking
> >> >> is usually off. What am I missing here?
> >> >
> >> > Ah! There are three cases:
> >> >
> >> > 1. Context tracking is off on a non-idle CPU. In this case, RCU is
> >> > still paying attention to CPUs running in both userspace and in
> >> > the kernel. So if a page fault happens, RCU will be set up to
> >> > notice any RCU read-side critical sections.
> >> >
> >> > 2. Context tracking is on on a non-idle CPU. In this case, RCU
> >> > might well be ignoring userspace execution: NO_HZ_FULL and
> >> > all that. However, as you pointed out, in this case the
> >> > context-tracking code lets RCU know that we have entered the
> >> > kernel, which means that RCU will again be paying attention to
> >> > RCU read-side critical sections.
> >> >
> >> > 3. The CPU is idle. In this case, RCU is ignoring the CPU, so
> >> > if we take a page fault when context tracking is off, life
> >> > will be hard. But the kernel is not supposed to take page
> >> > faults in the idle loop, so this is not a problem.
> >>
> >> I guess so, as long as there are really no page faults in the idle loop.
> >
> > As far as I know, there are not. If there are, someone needs to let
> > me know! ;-)
> >
> >> There are, however, machine checks in the idle loop, and maybe kprobes
> >> (haven't checked), so I think this patch might fix real bugs.
> >
> > If you can get ISTs from the idle loop, then the patch is needed.
> >
> >> > Just out of curiosity... Can an NMI occur in IST context? If it can,
> >> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
> >> > nested calls.
> >>
> >> Yes, and vice versa. That code looked like it handled nesting
> >> correctly, but I wasn't entirely sure.
> >
> > It currently does not, please see below patch. Are you able to test
> > nesting? It would be really cool if you could do so -- I have no
> > way to test this patch.
>
> I can try. It's sort of easy -- I'll put an int3 into do_nmi and add
> a fixup to avoid crashing.
>
> What should I look for? Should I try to force full nohz on and assert
> something? I don't really know how to make full nohz work.

You should look for the WARN_ON_ONCE() calls in rcu_nmi_enter() and
rcu_nmi_exit() to fire.

Thanx, Paul

> >> Also, just to make sure: are we okay if rcu_nmi_enter() is called
> >> before exception_enter if context tracking is on and we came directly
> >> from userspace?
> >
> > If I understand correctly, this will result in context tracking invoking
> > rcu_user_enter(), which will result in the rcu_dynticks counter having an
> > odd value. In that case, rcu_nmi_enter() will notice that RCU is already
> > paying attention to this CPU via its check of atomic_read(&rdtp->dynticks)
> > & 0x1), and will thus just return. The matching rcu_nmi_exit() will
> > notice that the nesting count is zero, and will also just return.
> >
> > Thus, everything works in that case.
> >
> > In contrast, if rcu_nmi_enter() was invoked from the idle loop, it
> > would see that RCU is not paying attention to this CPU and that the
> > NMI nesting depth (which rcu_nmi_enter() increments) used to be zero.
> > It would then atomically increment rtdp->dynticks, forcing RCU to start
> > paying attention to this CPU. The matching rcu_nmi_exit() will see
> > that the nesting count was non-zero, but became zero when decremented.
> > This will cause rcu_nmi_exit() to atomically increment rtdp->dynticks,
> > which will tell RCU to stop paying attention to this CPU.
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > rcu: Make rcu_nmi_enter() handle nesting
> >
> > Andy Lutomirski is introducing ISTs into x86, which from RCU's
> > viewpoint are NMIs. Because ISTs and NMIs can nest, rcu_nmi_enter()
> > and rcu_nmi_exit() must now correctly handle nesting. As luck would
> > have it, rcu_nmi_exit() handles nesting but rcu_nmi_enter() does not.
> > This patch therefore makes rcu_nmi_enter() handle nesting.
>
> Thanks. Should I add this to v5 of my series?
>
> --Andy
>
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 8749f43f3f05..875421aff6e3 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -770,7 +770,8 @@ void rcu_nmi_enter(void)
> > if (rdtp->dynticks_nmi_nesting == 0 &&
> > (atomic_read(&rdtp->dynticks) & 0x1))
> > return;
> > - rdtp->dynticks_nmi_nesting++;
> > + if (rdtp->dynticks_nmi_nesting++ != 0)
> > + return; /* Nested NMI/IST/whatever. */
> > smp_mb__before_atomic(); /* Force delay from prior write. */
> > atomic_inc(&rdtp->dynticks);
> > /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> >
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
>

2014-11-22 02:00:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
<[email protected]> wrote:
> On Fri, Nov 21, 2014 at 03:06:48PM -0800, Andy Lutomirski wrote:
>> On Fri, Nov 21, 2014 at 2:55 PM, Paul E. McKenney
>> <[email protected]> wrote:
>> > On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
>> >> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
>> >> <[email protected]> wrote:
>> >> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
>> >> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <[email protected]> wrote:
>> >> >> > We currently pretend that IST context is like standard exception
>> >> >> > context, but this is incorrect. IST entries from userspace are like
>> >> >> > standard exceptions except that they use per-cpu stacks, so they are
>> >> >> > atomic. IST entries from kernel space are like NMIs from RCU's
>> >> >> > perspective -- they are not quiescent states even if they
>> >> >> > interrupted the kernel during a quiescent state.
>> >> >> >
>> >> >> > Add and use ist_enter and ist_exit to track IST context. Even
>> >> >> > though x86_32 has no IST stacks, we track these interrupts the same
>> >> >> > way.
>> >> >>
>> >> >> I should add:
>> >> >>
>> >> >> I have no idea why RCU read-side critical sections are safe inside
>> >> >> __do_page_fault today. It's guarded by exception_enter(), but that
>> >> >> doesn't do anything if context tracking is off, and context tracking
>> >> >> is usually off. What am I missing here?
>> >> >
>> >> > Ah! There are three cases:
>> >> >
>> >> > 1. Context tracking is off on a non-idle CPU. In this case, RCU is
>> >> > still paying attention to CPUs running in both userspace and in
>> >> > the kernel. So if a page fault happens, RCU will be set up to
>> >> > notice any RCU read-side critical sections.
>> >> >
>> >> > 2. Context tracking is on on a non-idle CPU. In this case, RCU
>> >> > might well be ignoring userspace execution: NO_HZ_FULL and
>> >> > all that. However, as you pointed out, in this case the
>> >> > context-tracking code lets RCU know that we have entered the
>> >> > kernel, which means that RCU will again be paying attention to
>> >> > RCU read-side critical sections.
>> >> >
>> >> > 3. The CPU is idle. In this case, RCU is ignoring the CPU, so
>> >> > if we take a page fault when context tracking is off, life
>> >> > will be hard. But the kernel is not supposed to take page
>> >> > faults in the idle loop, so this is not a problem.
>> >>
>> >> I guess so, as long as there are really no page faults in the idle loop.
>> >
>> > As far as I know, there are not. If there are, someone needs to let
>> > me know! ;-)
>> >
>> >> There are, however, machine checks in the idle loop, and maybe kprobes
>> >> (haven't checked), so I think this patch might fix real bugs.
>> >
>> > If you can get ISTs from the idle loop, then the patch is needed.
>> >
>> >> > Just out of curiosity... Can an NMI occur in IST context? If it can,
>> >> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
>> >> > nested calls.
>> >>
>> >> Yes, and vice versa. That code looked like it handled nesting
>> >> correctly, but I wasn't entirely sure.
>> >
>> > It currently does not, please see below patch. Are you able to test
>> > nesting? It would be really cool if you could do so -- I have no
>> > way to test this patch.
>>
>> I can try. It's sort of easy -- I'll put an int3 into do_nmi and add
>> a fixup to avoid crashing.
>>
>> What should I look for? Should I try to force full nohz on and assert
>> something? I don't really know how to make full nohz work.
>
> You should look for the WARN_ON_ONCE() calls in rcu_nmi_enter() and
> rcu_nmi_exit() to fire.

No warning with or without your patch, maybe because all of those
returns skip the labels.

Also, an NMI can happen *during* rcu_nmi_enter or rcu_nmi_exit. Is
that okay? Should those dynticks_nmi_nesting++ things be local_inc
and local_dec_and_test?

That dynticks_nmi_nesting thing seems scary to me. Shouldn't the code
unconditionally increment dynticks_nmi_nesting in rcu_nmi_enter and
unconditionally decrement it in rcu_nmi_exit?

--Andy

>
> Thanx, Paul
>
>> >> Also, just to make sure: are we okay if rcu_nmi_enter() is called
>> >> before exception_enter if context tracking is on and we came directly
>> >> from userspace?
>> >
>> > If I understand correctly, this will result in context tracking invoking
>> > rcu_user_enter(), which will result in the rcu_dynticks counter having an
>> > odd value. In that case, rcu_nmi_enter() will notice that RCU is already
>> > paying attention to this CPU via its check of atomic_read(&rdtp->dynticks)
>> > & 0x1), and will thus just return. The matching rcu_nmi_exit() will
>> > notice that the nesting count is zero, and will also just return.
>> >
>> > Thus, everything works in that case.
>> >
>> > In contrast, if rcu_nmi_enter() was invoked from the idle loop, it
>> > would see that RCU is not paying attention to this CPU and that the
>> > NMI nesting depth (which rcu_nmi_enter() increments) used to be zero.
>> > It would then atomically increment rtdp->dynticks, forcing RCU to start
>> > paying attention to this CPU. The matching rcu_nmi_exit() will see
>> > that the nesting count was non-zero, but became zero when decremented.
>> > This will cause rcu_nmi_exit() to atomically increment rtdp->dynticks,
>> > which will tell RCU to stop paying attention to this CPU.
>> >
>> > Thanx, Paul
>> >
>> > ------------------------------------------------------------------------
>> >
>> > rcu: Make rcu_nmi_enter() handle nesting
>> >
>> > Andy Lutomirski is introducing ISTs into x86, which from RCU's
>> > viewpoint are NMIs. Because ISTs and NMIs can nest, rcu_nmi_enter()
>> > and rcu_nmi_exit() must now correctly handle nesting. As luck would
>> > have it, rcu_nmi_exit() handles nesting but rcu_nmi_enter() does not.
>> > This patch therefore makes rcu_nmi_enter() handle nesting.
>>
>> Thanks. Should I add this to v5 of my series?
>>
>> --Andy
>>
>> >
>> > Signed-off-by: Paul E. McKenney <[email protected]>
>> >
>> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> > index 8749f43f3f05..875421aff6e3 100644
>> > --- a/kernel/rcu/tree.c
>> > +++ b/kernel/rcu/tree.c
>> > @@ -770,7 +770,8 @@ void rcu_nmi_enter(void)
>> > if (rdtp->dynticks_nmi_nesting == 0 &&
>> > (atomic_read(&rdtp->dynticks) & 0x1))
>> > return;
>> > - rdtp->dynticks_nmi_nesting++;
>> > + if (rdtp->dynticks_nmi_nesting++ != 0)
>> > + return; /* Nested NMI/IST/whatever. */
>> > smp_mb__before_atomic(); /* Force delay from prior write. */
>> > atomic_inc(&rdtp->dynticks);
>> > /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
>> >
>>
>>
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC
>>
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-22 04:21:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Fri, Nov 21, 2014 at 03:06:48PM -0800, Andy Lutomirski wrote:
> >> On Fri, Nov 21, 2014 at 2:55 PM, Paul E. McKenney
> >> <[email protected]> wrote:
> >> > On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
> >> >> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
> >> >> <[email protected]> wrote:
> >> >> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
> >> >> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <[email protected]> wrote:
> >> >> >> > We currently pretend that IST context is like standard exception
> >> >> >> > context, but this is incorrect. IST entries from userspace are like
> >> >> >> > standard exceptions except that they use per-cpu stacks, so they are
> >> >> >> > atomic. IST entries from kernel space are like NMIs from RCU's
> >> >> >> > perspective -- they are not quiescent states even if they
> >> >> >> > interrupted the kernel during a quiescent state.
> >> >> >> >
> >> >> >> > Add and use ist_enter and ist_exit to track IST context. Even
> >> >> >> > though x86_32 has no IST stacks, we track these interrupts the same
> >> >> >> > way.
> >> >> >>
> >> >> >> I should add:
> >> >> >>
> >> >> >> I have no idea why RCU read-side critical sections are safe inside
> >> >> >> __do_page_fault today. It's guarded by exception_enter(), but that
> >> >> >> doesn't do anything if context tracking is off, and context tracking
> >> >> >> is usually off. What am I missing here?
> >> >> >
> >> >> > Ah! There are three cases:
> >> >> >
> >> >> > 1. Context tracking is off on a non-idle CPU. In this case, RCU is
> >> >> > still paying attention to CPUs running in both userspace and in
> >> >> > the kernel. So if a page fault happens, RCU will be set up to
> >> >> > notice any RCU read-side critical sections.
> >> >> >
> >> >> > 2. Context tracking is on on a non-idle CPU. In this case, RCU
> >> >> > might well be ignoring userspace execution: NO_HZ_FULL and
> >> >> > all that. However, as you pointed out, in this case the
> >> >> > context-tracking code lets RCU know that we have entered the
> >> >> > kernel, which means that RCU will again be paying attention to
> >> >> > RCU read-side critical sections.
> >> >> >
> >> >> > 3. The CPU is idle. In this case, RCU is ignoring the CPU, so
> >> >> > if we take a page fault when context tracking is off, life
> >> >> > will be hard. But the kernel is not supposed to take page
> >> >> > faults in the idle loop, so this is not a problem.
> >> >>
> >> >> I guess so, as long as there are really no page faults in the idle loop.
> >> >
> >> > As far as I know, there are not. If there are, someone needs to let
> >> > me know! ;-)
> >> >
> >> >> There are, however, machine checks in the idle loop, and maybe kprobes
> >> >> (haven't checked), so I think this patch might fix real bugs.
> >> >
> >> > If you can get ISTs from the idle loop, then the patch is needed.
> >> >
> >> >> > Just out of curiosity... Can an NMI occur in IST context? If it can,
> >> >> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
> >> >> > nested calls.
> >> >>
> >> >> Yes, and vice versa. That code looked like it handled nesting
> >> >> correctly, but I wasn't entirely sure.
> >> >
> >> > It currently does not, please see below patch. Are you able to test
> >> > nesting? It would be really cool if you could do so -- I have no
> >> > way to test this patch.
> >>
> >> I can try. It's sort of easy -- I'll put an int3 into do_nmi and add
> >> a fixup to avoid crashing.
> >>
> >> What should I look for? Should I try to force full nohz on and assert
> >> something? I don't really know how to make full nohz work.
> >
> > You should look for the WARN_ON_ONCE() calls in rcu_nmi_enter() and
> > rcu_nmi_exit() to fire.
>
> No warning with or without your patch, maybe because all of those
> returns skip the labels.

I will be guardedly optimistic and take this as a good sign. ;-)

> Also, an NMI can happen *during* rcu_nmi_enter or rcu_nmi_exit. Is
> that okay? Should those dynticks_nmi_nesting++ things be local_inc
> and local_dec_and_test?

Yep, it is OK during rcu_nmi_enter() or rcu_nmi_exit(). The nested
NMI will put the dynticks_nmi_nesting counter back where it was, so
no chance of confusion.

> That dynticks_nmi_nesting thing seems scary to me. Shouldn't the code
> unconditionally increment dynticks_nmi_nesting in rcu_nmi_enter and
> unconditionally decrement it in rcu_nmi_exit?

You might be able to get that to work, but the reason it is not done
that way is because we might get an NMI while not in dyntick-idle
state. In that case, it would be very bad to atomically increment
rcu_dynticks, because that would tell RCU to ignore the CPU while it
was in the NMI handler, which is the opposite of what we want.

But what did you have in mind?

Thanx, Paul

> --Andy
>
> >
> > Thanx, Paul
> >
> >> >> Also, just to make sure: are we okay if rcu_nmi_enter() is called
> >> >> before exception_enter if context tracking is on and we came directly
> >> >> from userspace?
> >> >
> >> > If I understand correctly, this will result in context tracking invoking
> >> > rcu_user_enter(), which will result in the rcu_dynticks counter having an
> >> > odd value. In that case, rcu_nmi_enter() will notice that RCU is already
> >> > paying attention to this CPU via its check of atomic_read(&rdtp->dynticks)
> >> > & 0x1), and will thus just return. The matching rcu_nmi_exit() will
> >> > notice that the nesting count is zero, and will also just return.
> >> >
> >> > Thus, everything works in that case.
> >> >
> >> > In contrast, if rcu_nmi_enter() was invoked from the idle loop, it
> >> > would see that RCU is not paying attention to this CPU and that the
> >> > NMI nesting depth (which rcu_nmi_enter() increments) used to be zero.
> >> > It would then atomically increment rtdp->dynticks, forcing RCU to start
> >> > paying attention to this CPU. The matching rcu_nmi_exit() will see
> >> > that the nesting count was non-zero, but became zero when decremented.
> >> > This will cause rcu_nmi_exit() to atomically increment rtdp->dynticks,
> >> > which will tell RCU to stop paying attention to this CPU.
> >> >
> >> > Thanx, Paul
> >> >
> >> > ------------------------------------------------------------------------
> >> >
> >> > rcu: Make rcu_nmi_enter() handle nesting
> >> >
> >> > Andy Lutomirski is introducing ISTs into x86, which from RCU's
> >> > viewpoint are NMIs. Because ISTs and NMIs can nest, rcu_nmi_enter()
> >> > and rcu_nmi_exit() must now correctly handle nesting. As luck would
> >> > have it, rcu_nmi_exit() handles nesting but rcu_nmi_enter() does not.
> >> > This patch therefore makes rcu_nmi_enter() handle nesting.
> >>
> >> Thanks. Should I add this to v5 of my series?
> >>
> >> --Andy
> >>
> >> >
> >> > Signed-off-by: Paul E. McKenney <[email protected]>
> >> >
> >> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >> > index 8749f43f3f05..875421aff6e3 100644
> >> > --- a/kernel/rcu/tree.c
> >> > +++ b/kernel/rcu/tree.c
> >> > @@ -770,7 +770,8 @@ void rcu_nmi_enter(void)
> >> > if (rdtp->dynticks_nmi_nesting == 0 &&
> >> > (atomic_read(&rdtp->dynticks) & 0x1))
> >> > return;
> >> > - rdtp->dynticks_nmi_nesting++;
> >> > + if (rdtp->dynticks_nmi_nesting++ != 0)
> >> > + return; /* Nested NMI/IST/whatever. */
> >> > smp_mb__before_atomic(); /* Force delay from prior write. */
> >> > atomic_inc(&rdtp->dynticks);
> >> > /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> >> >
> >>
> >>
> >>
> >> --
> >> Andy Lutomirski
> >> AMA Capital Management, LLC
> >>
> >
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
>

2014-11-22 05:53:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
<[email protected]> wrote:
> On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
>> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
>> <[email protected]> wrote:
>> > On Fri, Nov 21, 2014 at 03:06:48PM -0800, Andy Lutomirski wrote:
>> >> On Fri, Nov 21, 2014 at 2:55 PM, Paul E. McKenney
>> >> <[email protected]> wrote:
>> >> > On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
>> >> >> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
>> >> >> <[email protected]> wrote:
>> >> >> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
>> >> >> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <[email protected]> wrote:
>> >> >> >> > We currently pretend that IST context is like standard exception
>> >> >> >> > context, but this is incorrect. IST entries from userspace are like
>> >> >> >> > standard exceptions except that they use per-cpu stacks, so they are
>> >> >> >> > atomic. IST entries from kernel space are like NMIs from RCU's
>> >> >> >> > perspective -- they are not quiescent states even if they
>> >> >> >> > interrupted the kernel during a quiescent state.
>> >> >> >> >
>> >> >> >> > Add and use ist_enter and ist_exit to track IST context. Even
>> >> >> >> > though x86_32 has no IST stacks, we track these interrupts the same
>> >> >> >> > way.
>> >> >> >>
>> >> >> >> I should add:
>> >> >> >>
>> >> >> >> I have no idea why RCU read-side critical sections are safe inside
>> >> >> >> __do_page_fault today. It's guarded by exception_enter(), but that
>> >> >> >> doesn't do anything if context tracking is off, and context tracking
>> >> >> >> is usually off. What am I missing here?
>> >> >> >
>> >> >> > Ah! There are three cases:
>> >> >> >
>> >> >> > 1. Context tracking is off on a non-idle CPU. In this case, RCU is
>> >> >> > still paying attention to CPUs running in both userspace and in
>> >> >> > the kernel. So if a page fault happens, RCU will be set up to
>> >> >> > notice any RCU read-side critical sections.
>> >> >> >
>> >> >> > 2. Context tracking is on on a non-idle CPU. In this case, RCU
>> >> >> > might well be ignoring userspace execution: NO_HZ_FULL and
>> >> >> > all that. However, as you pointed out, in this case the
>> >> >> > context-tracking code lets RCU know that we have entered the
>> >> >> > kernel, which means that RCU will again be paying attention to
>> >> >> > RCU read-side critical sections.
>> >> >> >
>> >> >> > 3. The CPU is idle. In this case, RCU is ignoring the CPU, so
>> >> >> > if we take a page fault when context tracking is off, life
>> >> >> > will be hard. But the kernel is not supposed to take page
>> >> >> > faults in the idle loop, so this is not a problem.
>> >> >>
>> >> >> I guess so, as long as there are really no page faults in the idle loop.
>> >> >
>> >> > As far as I know, there are not. If there are, someone needs to let
>> >> > me know! ;-)
>> >> >
>> >> >> There are, however, machine checks in the idle loop, and maybe kprobes
>> >> >> (haven't checked), so I think this patch might fix real bugs.
>> >> >
>> >> > If you can get ISTs from the idle loop, then the patch is needed.
>> >> >
>> >> >> > Just out of curiosity... Can an NMI occur in IST context? If it can,
>> >> >> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
>> >> >> > nested calls.
>> >> >>
>> >> >> Yes, and vice versa. That code looked like it handled nesting
>> >> >> correctly, but I wasn't entirely sure.
>> >> >
>> >> > It currently does not, please see below patch. Are you able to test
>> >> > nesting? It would be really cool if you could do so -- I have no
>> >> > way to test this patch.
>> >>
>> >> I can try. It's sort of easy -- I'll put an int3 into do_nmi and add
>> >> a fixup to avoid crashing.
>> >>
>> >> What should I look for? Should I try to force full nohz on and assert
>> >> something? I don't really know how to make full nohz work.
>> >
>> > You should look for the WARN_ON_ONCE() calls in rcu_nmi_enter() and
>> > rcu_nmi_exit() to fire.
>>
>> No warning with or without your patch, maybe because all of those
>> returns skip the labels.
>
> I will be guardedly optimistic and take this as a good sign. ;-)
>
>> Also, an NMI can happen *during* rcu_nmi_enter or rcu_nmi_exit. Is
>> that okay? Should those dynticks_nmi_nesting++ things be local_inc
>> and local_dec_and_test?
>
> Yep, it is OK during rcu_nmi_enter() or rcu_nmi_exit(). The nested
> NMI will put the dynticks_nmi_nesting counter back where it was, so
> no chance of confusion.
>

That sounds like it's making a scary assumption about the code
generated by the ++ operator.

>> That dynticks_nmi_nesting thing seems scary to me. Shouldn't the code
>> unconditionally increment dynticks_nmi_nesting in rcu_nmi_enter and
>> unconditionally decrement it in rcu_nmi_exit?
>
> You might be able to get that to work, but the reason it is not done
> that way is because we might get an NMI while not in dyntick-idle
> state. In that case, it would be very bad to atomically increment
> rcu_dynticks, because that would tell RCU to ignore the CPU while it
> was in the NMI handler, which is the opposite of what we want.
>
> But what did you have in mind?

If you're willing to have rcu_nmi_enter return a value and pass it
back to rcu_nmi_exit, then it could be like:

int rcu_nmi_enter()
{
if (rdtp->dynticks & 1) {
return 0;
} else {
rdtp->dynticks++;
return 1;
}
}

but wrapped in a cmpxchg loop to make it atomic, and

void rcu_nmi_exit(int state)
{
if (state)
rdtp->dynticks++;
}

Otherwise, I think that there may need to be enough state somewhere so
that the outermost nested rcu_nmi_enter knows whether to increment
dynticks. For example, dynticks_nmi_nesting could store the nesting
count * 2 - (1 if the outermost nested user needs to increment
dynticks). Something like:

void rcu_nmi_enter(void)
{
/* Be very careful -- this function may be called reentrently on the
same CPU. */
atomically: increment dynticks if it's even.

/* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
* the state. */

local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));

WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
}

void rcu_nmi_exit(void)
{
WARN_ON(!(dynticks & 1));
locally atomically: dynticks_nmi_nesting -= 2, unless
dynticks_nmi_nesting == 1, in which case set it to zero

if (dynticks_nmi_nesting was 1)
atomic_inc(&dynticks);
}

The invariant here is that, for a single unnested enter/exit, if
dynticks_nmi_nesting != 0, then dynticks is odd. As a result, an
rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
!= 0 *or* dynticks is odd will have no net effect, so the invariant,
in fact, holds for all invocations, nested or otherwise.

At least one of those conditions is true at all times during the
execution of outermost pair, starting with the first atomic operation
and ending with the final atomic_inc. So they nest properly no matter
what else happens (unless, of course, someone else pokes dynticks in
the middle).

Thoughts?

>
> Thanx, Paul
>
>> --Andy
>>
>> >
>> > Thanx, Paul
>> >
>> >> >> Also, just to make sure: are we okay if rcu_nmi_enter() is called
>> >> >> before exception_enter if context tracking is on and we came directly
>> >> >> from userspace?
>> >> >
>> >> > If I understand correctly, this will result in context tracking invoking
>> >> > rcu_user_enter(), which will result in the rcu_dynticks counter having an
>> >> > odd value. In that case, rcu_nmi_enter() will notice that RCU is already
>> >> > paying attention to this CPU via its check of atomic_read(&rdtp->dynticks)
>> >> > & 0x1), and will thus just return. The matching rcu_nmi_exit() will
>> >> > notice that the nesting count is zero, and will also just return.
>> >> >
>> >> > Thus, everything works in that case.
>> >> >
>> >> > In contrast, if rcu_nmi_enter() was invoked from the idle loop, it
>> >> > would see that RCU is not paying attention to this CPU and that the
>> >> > NMI nesting depth (which rcu_nmi_enter() increments) used to be zero.
>> >> > It would then atomically increment rtdp->dynticks, forcing RCU to start
>> >> > paying attention to this CPU. The matching rcu_nmi_exit() will see
>> >> > that the nesting count was non-zero, but became zero when decremented.
>> >> > This will cause rcu_nmi_exit() to atomically increment rtdp->dynticks,
>> >> > which will tell RCU to stop paying attention to this CPU.
>> >> >
>> >> > Thanx, Paul
>> >> >
>> >> > ------------------------------------------------------------------------
>> >> >
>> >> > rcu: Make rcu_nmi_enter() handle nesting
>> >> >
>> >> > Andy Lutomirski is introducing ISTs into x86, which from RCU's
>> >> > viewpoint are NMIs. Because ISTs and NMIs can nest, rcu_nmi_enter()
>> >> > and rcu_nmi_exit() must now correctly handle nesting. As luck would
>> >> > have it, rcu_nmi_exit() handles nesting but rcu_nmi_enter() does not.
>> >> > This patch therefore makes rcu_nmi_enter() handle nesting.
>> >>
>> >> Thanks. Should I add this to v5 of my series?
>> >>
>> >> --Andy
>> >>
>> >> >
>> >> > Signed-off-by: Paul E. McKenney <[email protected]>
>> >> >
>> >> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> >> > index 8749f43f3f05..875421aff6e3 100644
>> >> > --- a/kernel/rcu/tree.c
>> >> > +++ b/kernel/rcu/tree.c
>> >> > @@ -770,7 +770,8 @@ void rcu_nmi_enter(void)
>> >> > if (rdtp->dynticks_nmi_nesting == 0 &&
>> >> > (atomic_read(&rdtp->dynticks) & 0x1))
>> >> > return;
>> >> > - rdtp->dynticks_nmi_nesting++;
>> >> > + if (rdtp->dynticks_nmi_nesting++ != 0)
>> >> > + return; /* Nested NMI/IST/whatever. */
>> >> > smp_mb__before_atomic(); /* Force delay from prior write. */
>> >> > atomic_inc(&rdtp->dynticks);
>> >> > /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Andy Lutomirski
>> >> AMA Capital Management, LLC
>> >>
>> >
>>
>>
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC
>>
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-22 16:55:09

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME

On Fri, Nov 21, 2014 at 01:26:07PM -0800, Andy Lutomirski wrote:
> x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set
> but not on non-paranoid returns. I suspect that this is a mistake
> and that the code only works because int3 is paranoid.
>
> Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a
> workaround for the x86 bug. With that bug fixed, we can remove
> _TIF_NOTIFY_RESUME from the uprobes code.
>
> Acked-by: Srikar Dronamraju <[email protected]>
> Reported-by: Oleg Nesterov <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>

Acked-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-11-22 17:20:31

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Fri, Nov 21, 2014 at 01:26:08PM -0800, Andy Lutomirski wrote:
> We currently pretend that IST context is like standard exception
> context, but this is incorrect. IST entries from userspace are like
> standard exceptions except that they use per-cpu stacks, so they are
> atomic. IST entries from kernel space are like NMIs from RCU's
> perspective -- they are not quiescent states even if they
> interrupted the kernel during a quiescent state.
>
> Add and use ist_enter and ist_exit to track IST context. Even
> though x86_32 has no IST stacks, we track these interrupts the same
> way.
>
> This fixes two issues:
>
> - Scheduling from an IST interrupt handler will now warn. It would
> previously appear to work as long as we got lucky and nothing
> overwrote the stack frame. (I don't know of any bugs in this
> that would trigger the warning, but it's good to be on the safe
> side.)
>
> - RCU handling in IST context was dangerous. As far as I know,
> only machine checks were likely to trigger this, but it's good to
> be on the safe side.
>
> Note that the machine check handlers appears to have been missing
> any context tracking at all before this patch.
>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Josh Triplett <[email protected]>
> Cc: Frédéric Weisbecker <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/traps.h | 4 +++
> arch/x86/kernel/cpu/mcheck/mce.c | 5 ++++
> arch/x86/kernel/cpu/mcheck/p5.c | 6 +++++
> arch/x86/kernel/cpu/mcheck/winchip.c | 5 ++++
> arch/x86/kernel/traps.c | 49 ++++++++++++++++++++++++++++++------
> 5 files changed, 61 insertions(+), 8 deletions(-)

...

> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 0d0e922fafc1..f5c4b8813774 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -107,6 +107,39 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
> preempt_count_dec();
> }
>
> +enum ctx_state ist_enter(struct pt_regs *regs)
> +{
> + /*
> + * We are atomic because we're on the IST stack (or we're on x86_32,
> + * in which case we still shouldn't schedule.
> + */
> + preempt_count_add(HARDIRQ_OFFSET);
> +
> + if (user_mode_vm(regs)) {
> + /* Other than that, we're just an exception. */
> + return exception_enter();
> + } else {
> + /*
> + * We might have interrupted pretty much anything. In
> + * fact, if we're a machine check, we can even interrupt
> + * NMI processing. We don't want in_nmi() to return true,
> + * but we need to notify RCU.
> + */
> + rcu_nmi_enter();
> + return IN_KERNEL; /* the value is irrelevant. */
> + }

I guess dropping the explicit else-branch could make it look a bit nicer
with the curly braces gone and all...

enum ctx_state ist_enter(struct pt_regs *regs)
{
/*
* We are atomic because we're on the IST stack (or we're on x86_32,
* in which case we still shouldn't schedule.
*/
preempt_count_add(HARDIRQ_OFFSET);

if (user_mode_vm(regs))
/* Other than that, we're just an exception. */
return exception_enter();

/*
* We might have interrupted pretty much anything. In fact, if we're a
* machine check, we can even interrupt NMI processing. We don't want
* in_nmi() to return true, but we need to notify RCU.
*/
rcu_nmi_enter();
return IN_KERNEL; /* the value is irrelevant. */
}

> +}
> +
> +void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
> +{
> + preempt_count_sub(HARDIRQ_OFFSET);
> +
> + if (user_mode_vm(regs))
> + return exception_exit(prev_state);
> + else
> + rcu_nmi_exit();
> +}

Ditto here.

> +
> static nokprobe_inline int
> do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
> struct pt_regs *regs, long error_code)

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-11-23 06:19:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
> >> <[email protected]> wrote:
> >> > On Fri, Nov 21, 2014 at 03:06:48PM -0800, Andy Lutomirski wrote:
> >> >> On Fri, Nov 21, 2014 at 2:55 PM, Paul E. McKenney
> >> >> <[email protected]> wrote:
> >> >> > On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
> >> >> >> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
> >> >> >> <[email protected]> wrote:
> >> >> >> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
> >> >> >> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <[email protected]> wrote:
> >> >> >> >> > We currently pretend that IST context is like standard exception
> >> >> >> >> > context, but this is incorrect. IST entries from userspace are like
> >> >> >> >> > standard exceptions except that they use per-cpu stacks, so they are
> >> >> >> >> > atomic. IST entries from kernel space are like NMIs from RCU's
> >> >> >> >> > perspective -- they are not quiescent states even if they
> >> >> >> >> > interrupted the kernel during a quiescent state.
> >> >> >> >> >
> >> >> >> >> > Add and use ist_enter and ist_exit to track IST context. Even
> >> >> >> >> > though x86_32 has no IST stacks, we track these interrupts the same
> >> >> >> >> > way.
> >> >> >> >>
> >> >> >> >> I should add:
> >> >> >> >>
> >> >> >> >> I have no idea why RCU read-side critical sections are safe inside
> >> >> >> >> __do_page_fault today. It's guarded by exception_enter(), but that
> >> >> >> >> doesn't do anything if context tracking is off, and context tracking
> >> >> >> >> is usually off. What am I missing here?
> >> >> >> >
> >> >> >> > Ah! There are three cases:
> >> >> >> >
> >> >> >> > 1. Context tracking is off on a non-idle CPU. In this case, RCU is
> >> >> >> > still paying attention to CPUs running in both userspace and in
> >> >> >> > the kernel. So if a page fault happens, RCU will be set up to
> >> >> >> > notice any RCU read-side critical sections.
> >> >> >> >
> >> >> >> > 2. Context tracking is on on a non-idle CPU. In this case, RCU
> >> >> >> > might well be ignoring userspace execution: NO_HZ_FULL and
> >> >> >> > all that. However, as you pointed out, in this case the
> >> >> >> > context-tracking code lets RCU know that we have entered the
> >> >> >> > kernel, which means that RCU will again be paying attention to
> >> >> >> > RCU read-side critical sections.
> >> >> >> >
> >> >> >> > 3. The CPU is idle. In this case, RCU is ignoring the CPU, so
> >> >> >> > if we take a page fault when context tracking is off, life
> >> >> >> > will be hard. But the kernel is not supposed to take page
> >> >> >> > faults in the idle loop, so this is not a problem.
> >> >> >>
> >> >> >> I guess so, as long as there are really no page faults in the idle loop.
> >> >> >
> >> >> > As far as I know, there are not. If there are, someone needs to let
> >> >> > me know! ;-)
> >> >> >
> >> >> >> There are, however, machine checks in the idle loop, and maybe kprobes
> >> >> >> (haven't checked), so I think this patch might fix real bugs.
> >> >> >
> >> >> > If you can get ISTs from the idle loop, then the patch is needed.
> >> >> >
> >> >> >> > Just out of curiosity... Can an NMI occur in IST context? If it can,
> >> >> >> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
> >> >> >> > nested calls.
> >> >> >>
> >> >> >> Yes, and vice versa. That code looked like it handled nesting
> >> >> >> correctly, but I wasn't entirely sure.
> >> >> >
> >> >> > It currently does not, please see below patch. Are you able to test
> >> >> > nesting? It would be really cool if you could do so -- I have no
> >> >> > way to test this patch.
> >> >>
> >> >> I can try. It's sort of easy -- I'll put an int3 into do_nmi and add
> >> >> a fixup to avoid crashing.
> >> >>
> >> >> What should I look for? Should I try to force full nohz on and assert
> >> >> something? I don't really know how to make full nohz work.
> >> >
> >> > You should look for the WARN_ON_ONCE() calls in rcu_nmi_enter() and
> >> > rcu_nmi_exit() to fire.
> >>
> >> No warning with or without your patch, maybe because all of those
> >> returns skip the labels.
> >
> > I will be guardedly optimistic and take this as a good sign. ;-)
> >
> >> Also, an NMI can happen *during* rcu_nmi_enter or rcu_nmi_exit. Is
> >> that okay? Should those dynticks_nmi_nesting++ things be local_inc
> >> and local_dec_and_test?
> >
> > Yep, it is OK during rcu_nmi_enter() or rcu_nmi_exit(). The nested
> > NMI will put the dynticks_nmi_nesting counter back where it was, so
> > no chance of confusion.
>
> That sounds like it's making a scary assumption about the code
> generated by the ++ operator.

Yeah, well, local_inc becomes the ++ operator on many systems. ;-)

I could perhaps be convinced to use ACCESS_ONCE(), will give it some
thought.

> >> That dynticks_nmi_nesting thing seems scary to me. Shouldn't the code
> >> unconditionally increment dynticks_nmi_nesting in rcu_nmi_enter and
> >> unconditionally decrement it in rcu_nmi_exit?
> >
> > You might be able to get that to work, but the reason it is not done
> > that way is because we might get an NMI while not in dyntick-idle
> > state. In that case, it would be very bad to atomically increment
> > rcu_dynticks, because that would tell RCU to ignore the CPU while it
> > was in the NMI handler, which is the opposite of what we want.
> >
> > But what did you have in mind?
>
> If you're willing to have rcu_nmi_enter return a value and pass it
> back to rcu_nmi_exit, then it could be like:
>
> int rcu_nmi_enter()
> {
> if (rdtp->dynticks & 1) {
> return 0;
> } else {
> rdtp->dynticks++;
> return 1;
> }
> }
>
> but wrapped in a cmpxchg loop to make it atomic, and
>
> void rcu_nmi_exit(int state)
> {
> if (state)
> rdtp->dynticks++;
> }

Returning state sounds like a bad idea, if we can reasonably avoid it.

And I think I finally see what you are pointing out about my code: If
another NMI comes in between the time I increment ->dynticks_nmi_nesting
and the time I atomically increment ->dynticks, the nested NMI handler
will incorrectly believe that RCU is already paying attention to this CPU.
Which would indeed not be at all good, so good catch!!!

> Otherwise, I think that there may need to be enough state somewhere so
> that the outermost nested rcu_nmi_enter knows whether to increment
> dynticks. For example, dynticks_nmi_nesting could store the nesting
> count * 2 - (1 if the outermost nested user needs to increment
> dynticks). Something like:
>
> void rcu_nmi_enter(void)
> {
> /* Be very careful -- this function may be called reentrently on the
> same CPU. */
> atomically: increment dynticks if it's even.
>
> /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
> * the state. */
>
> local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
>
> WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
> }
>
> void rcu_nmi_exit(void)
> {
> WARN_ON(!(dynticks & 1));
> locally atomically: dynticks_nmi_nesting -= 2, unless
> dynticks_nmi_nesting == 1, in which case set it to zero
>
> if (dynticks_nmi_nesting was 1)
> atomic_inc(&dynticks);
> }
>
> The invariant here is that, for a single unnested enter/exit, if
> dynticks_nmi_nesting != 0, then dynticks is odd. As a result, an
> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
> != 0 *or* dynticks is odd will have no net effect, so the invariant,
> in fact, holds for all invocations, nested or otherwise.
>
> At least one of those conditions is true at all times during the
> execution of outermost pair, starting with the first atomic operation
> and ending with the final atomic_inc. So they nest properly no matter
> what else happens (unless, of course, someone else pokes dynticks in
> the middle).
>
> Thoughts?

Let's see... The evenness of ->dynticks should be preserved by nested NMI
handlers, so the check and increment need not be atomic. We don't have
any way (other than atomic operations) to do local atomic modifications
on all architectures, because we cannot mask NMIs. (Yes, it can work
on x86, but this is common code that needs to work everywhere.) On the
other hand, presumably NMIs are rare, so atomic modification of the NMI
nesting counter should be OK, at least if it proves absolutely necessary.
And I am thinking that a mechanical proof will be needed here. :-/

But first, let me try generating the code and informally evaluating it:

1 struct rcu_dynticks {
2 long long dynticks_nesting;
3 int dynticks_nmi_nesting;
4 atomic_t dynticks;
5 };
6
7 void rcu_nmi_enter(void)
8 {
9 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
10 int incby = 2;
11
12 if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
13 smp_mb__before_atomic();
14 atomic_inc(&rdtp->dynticks);
15 smp_mb__after_atomic();
16 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
17 incby = 1;
18 }
19 rdtp->dynticks_nmi_nesting += incby;
20 barrier();
21 }
22
23 void rcu_nmi_exit(void)
24 {
25 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
26
27 WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
28 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
29 if (rdtp->dynticks_nmi_nesting != 1) {
30 rdtp->dynticks_nmi_nesting -= 2;
31 return;
32 }
33 rdtp->dynticks_nmi_nesting = 0;
34 smp_mb__before_atomic();
35 atomic_inc(&rdtp->dynticks);
36 smp_mb__after_atomic();
37 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
38 }

Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
assumes that we don't need to increment ->dynticks.

Line 12 checks to see if ->dynticks is even. Note that this check is
stable: If there are nested NMIs, they will increment ->dynticks twice
or not at all, and either way preserves the evenness (to be proven, of
course, but that is the plan). If ->dynticks is even, lines 13-15
atomically increment it, line 16 complains if still even, and line 17
says we will increment ->dynticks_nmi_nesting by only 1.

Either way, line 19 increments ->dynticks_nmi_nesting as needed and
line 20 keeps the compiler from getting too cute.

For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
structure. Lines 27 and 28 complain bitterly if invariants are violated.
If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.

Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
atomically increment ->dynticks with full ordering, and line 37
complains bitterly if ->dynticks is not even.

So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
not nested, which is the correct thing for them to think in that case.
They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
to zero (adding and then subtracting 1). If the NMI happens after the
atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
to zero (adding and subtracting two again). If the NMI happens after
the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
time to one (again adding and subtracting two).

In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
one means we need to atomically increment ->dynticks, and other values
mean that we are partially or fully nested. Reasoning proceeds as for
rcu_nmi_enter(), but in the opposite direction.

Whew! That might even work.

But how about taking a different approach. Assuming that there can
never be more than (say) 14 nesting NMI-like things, use the lower
four bits of ->dynticks to represent the NMI nesting and the upper
28 bits as the counter. This of course requires modifying lots of
places in RCU that check the counter, but it is probably time to
abstract the check anyway.

This would allow my earlier attempted logic to work and (maybe) simplify
the reasoning a bit (and yes, the "magic" constants need macros):

void rcu_nmi_enter(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
int nesting = atomic_read(&rdtp->dynticks) & 0xf;
int incby = 0x01;

WARN_ON_ONCE(nexting == 0xf);
if (nesting == 0) {
if (atomic_read(&rdtp->dynticks) & 0x10)
return;
incby = 0x11;
}
smp_mb__before_atomic();
atomic_add(&rdtp->dynticks, incby);
smp_mb__after_atomic();
WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
}

void rcu_nmi_exit(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
int nesting = atomic_read(&rdtp->dynticks) & 0xf;
int incby = 0x0f;

if (nesting == 0)
return;
if (nesting > 1)
incby = -1;
smp_mb__before_atomic();
atomic_add(&rdtp->dynticks, incby);
smp_mb__after_atomic();
WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
}

Over to you! ;-)

Thanx, Paul

2014-11-24 11:39:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] x86: Clean up current_stack_pointer

On Fri, Nov 21, 2014 at 01:26:10PM -0800, Andy Lutomirski wrote:
> There's no good reason for it to be a macro, and x86_64 will want to
> use it, so it should be in a header.
>
> Signed-off-by: Andy Lutomirski <[email protected]>

Acked-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-11-24 15:54:50

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic

On Fri, Nov 21, 2014 at 01:26:11PM -0800, Andy Lutomirski wrote:
> In some IST handlers, if the interrupt came from user mode,
> we can safely enable preemption. Add helpers to do it safely.
>
> This is intended to be used my the memory failure code in
> do_machine_check.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/traps.h | 2 ++
> arch/x86/kernel/traps.c | 38 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 40 insertions(+)
>
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index eb16a61bfd06..04ba537fc721 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -112,6 +112,8 @@ asmlinkage void mce_threshold_interrupt(void);
>
> extern enum ctx_state ist_enter(struct pt_regs *regs);
> extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
> +extern void ist_begin_non_atomic(struct pt_regs *regs);
> +extern void ist_end_non_atomic(void);
>
> /* Interrupts/Exceptions */
> enum {
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 6a02760df7b4..2b5f2e038e3f 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -140,6 +140,44 @@ void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
> rcu_nmi_exit();
> }
>
> +/**
> + * ist_begin_non_atomic() - begin a non-atomic section in an IST exception
> + * @regs: regs passed to the IST exception handler
> + *
> + * IST exception handlers normally cannot schedule. As a special
> + * exception, if the exception interrupted userspace code (i.e.
> + * user_mode_vm(regs) would return true) and the exception was not
> + * a double fault, it can be safe to schedule. ist_begin_non_atomic()
> + * begins a non-atomic section within an ist_enter()/ist_exit() region.
> + * Callers are responsible for enabling interrupts themselves inside
> + * the non-atomic section, and callers must call is_end_non_atomic()
> + * before ist_exit().
> + */

Ok, I guess this is is fine, albeit a bit circumstantial:

I need to do

ist_enter()
ist_begin_non_atomic()

In here, I still need to do

if (user_mode_vm()) {
# do non-atomic stuff
}

AFAICT, right?

ist_end_non_atomic()
ist_exit()

and this whole fun for the context tracking exception_enter/_exit()
calls.

Acked-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-11-24 15:55:18

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] x86, entry: Switch stacks on a paranoid entry from userspace

On Fri, Nov 21, 2014 at 01:26:09PM -0800, Andy Lutomirski wrote:
> This causes all non-NMI, non-double-fault kernel entries from
> userspace to run on the normal kernel stack. Double-fault is
> exempt to minimize confusion if we double-fault directly from
> userspace due to a bad kernel stack.
>
> This is, suprisingly, simpler and shorter than the current code. It
> removes the IMO rather frightening paranoid_userspace path, and it
> make sync_regs much simpler.
>
> There is no risk of stack overflow due to this change -- the kernel
> stack that we switch to is empty.
>
> This will also enable us to create non-atomic sections within
> machine checks from userspace, which will simplify memory failure
> handling. It will also allow the upcoming fsgsbase code to be
> simplified, because it doesn't need to worry about usergs when
> scheduling in paranoid_exit, as that code no longer exists.
>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Tony Luck <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>

Acked-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-11-24 17:58:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME

On Sat, Nov 22, 2014 at 8:55 AM, Borislav Petkov <[email protected]> wrote:
> On Fri, Nov 21, 2014 at 01:26:07PM -0800, Andy Lutomirski wrote:
>> x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set
>> but not on non-paranoid returns. I suspect that this is a mistake
>> and that the code only works because int3 is paranoid.
>>
>> Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a
>> workaround for the x86 bug. With that bug fixed, we can remove
>> _TIF_NOTIFY_RESUME from the uprobes code.
>>
>> Acked-by: Srikar Dronamraju <[email protected]>
>> Reported-by: Oleg Nesterov <[email protected]>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>
> Acked-by: Borislav Petkov <[email protected]>
>

This one is now upstream, so I'm dropping it.

--Andy

> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-24 19:49:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Sat, Nov 22, 2014 at 9:20 AM, Borislav Petkov <[email protected]> wrote:
> On Fri, Nov 21, 2014 at 01:26:08PM -0800, Andy Lutomirski wrote:
>> We currently pretend that IST context is like standard exception
>> context, but this is incorrect. IST entries from userspace are like
>> standard exceptions except that they use per-cpu stacks, so they are
>> atomic. IST entries from kernel space are like NMIs from RCU's
>> perspective -- they are not quiescent states even if they
>> interrupted the kernel during a quiescent state.
>>
>> Add and use ist_enter and ist_exit to track IST context. Even
>> though x86_32 has no IST stacks, we track these interrupts the same
>> way.
>>
>> This fixes two issues:
>>
>> - Scheduling from an IST interrupt handler will now warn. It would
>> previously appear to work as long as we got lucky and nothing
>> overwrote the stack frame. (I don't know of any bugs in this
>> that would trigger the warning, but it's good to be on the safe
>> side.)
>>
>> - RCU handling in IST context was dangerous. As far as I know,
>> only machine checks were likely to trigger this, but it's good to
>> be on the safe side.
>>
>> Note that the machine check handlers appears to have been missing
>> any context tracking at all before this patch.
>>
>> Cc: "Paul E. McKenney" <[email protected]>
>> Cc: Josh Triplett <[email protected]>
>> Cc: Frédéric Weisbecker <[email protected]>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>> arch/x86/include/asm/traps.h | 4 +++
>> arch/x86/kernel/cpu/mcheck/mce.c | 5 ++++
>> arch/x86/kernel/cpu/mcheck/p5.c | 6 +++++
>> arch/x86/kernel/cpu/mcheck/winchip.c | 5 ++++
>> arch/x86/kernel/traps.c | 49 ++++++++++++++++++++++++++++++------
>> 5 files changed, 61 insertions(+), 8 deletions(-)
>
> ...
>
>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> index 0d0e922fafc1..f5c4b8813774 100644
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>> @@ -107,6 +107,39 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
>> preempt_count_dec();
>> }
>>
>> +enum ctx_state ist_enter(struct pt_regs *regs)
>> +{
>> + /*
>> + * We are atomic because we're on the IST stack (or we're on x86_32,
>> + * in which case we still shouldn't schedule.
>> + */
>> + preempt_count_add(HARDIRQ_OFFSET);
>> +
>> + if (user_mode_vm(regs)) {
>> + /* Other than that, we're just an exception. */
>> + return exception_enter();
>> + } else {
>> + /*
>> + * We might have interrupted pretty much anything. In
>> + * fact, if we're a machine check, we can even interrupt
>> + * NMI processing. We don't want in_nmi() to return true,
>> + * but we need to notify RCU.
>> + */
>> + rcu_nmi_enter();
>> + return IN_KERNEL; /* the value is irrelevant. */
>> + }
>
> I guess dropping the explicit else-branch could make it look a bit nicer
> with the curly braces gone and all...
>
> enum ctx_state ist_enter(struct pt_regs *regs)
> {
> /*
> * We are atomic because we're on the IST stack (or we're on x86_32,
> * in which case we still shouldn't schedule.
> */
> preempt_count_add(HARDIRQ_OFFSET);
>
> if (user_mode_vm(regs))
> /* Other than that, we're just an exception. */
> return exception_enter();
>

Two indented lines w/o curly braces makes me think of goto fail; :-/

TBH, when there are clearly two options, I tend to prefer the braces
that make it very obvious what's going on. I had some memorable bugs
several years ago that would have been impossible if I has used braces
more liberally.

--Andy

> /*
> * We might have interrupted pretty much anything. In fact, if we're a
> * machine check, we can even interrupt NMI processing. We don't want
> * in_nmi() to return true, but we need to notify RCU.
> */
> rcu_nmi_enter();
> return IN_KERNEL; /* the value is irrelevant. */
> }
>
>> +}
>> +
>> +void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
>> +{
>> + preempt_count_sub(HARDIRQ_OFFSET);
>> +
>> + if (user_mode_vm(regs))
>> + return exception_exit(prev_state);
>> + else
>> + rcu_nmi_exit();
>> +}
>
> Ditto here.
>
>> +
>> static nokprobe_inline int
>> do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
>> struct pt_regs *regs, long error_code)
>
> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-24 19:53:16

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic

On Mon, Nov 24, 2014 at 7:54 AM, Borislav Petkov <[email protected]> wrote:
> On Fri, Nov 21, 2014 at 01:26:11PM -0800, Andy Lutomirski wrote:
>> In some IST handlers, if the interrupt came from user mode,
>> we can safely enable preemption. Add helpers to do it safely.
>>
>> This is intended to be used my the memory failure code in
>> do_machine_check.
>>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>> arch/x86/include/asm/traps.h | 2 ++
>> arch/x86/kernel/traps.c | 38 ++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 40 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
>> index eb16a61bfd06..04ba537fc721 100644
>> --- a/arch/x86/include/asm/traps.h
>> +++ b/arch/x86/include/asm/traps.h
>> @@ -112,6 +112,8 @@ asmlinkage void mce_threshold_interrupt(void);
>>
>> extern enum ctx_state ist_enter(struct pt_regs *regs);
>> extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
>> +extern void ist_begin_non_atomic(struct pt_regs *regs);
>> +extern void ist_end_non_atomic(void);
>>
>> /* Interrupts/Exceptions */
>> enum {
>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> index 6a02760df7b4..2b5f2e038e3f 100644
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>> @@ -140,6 +140,44 @@ void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
>> rcu_nmi_exit();
>> }
>>
>> +/**
>> + * ist_begin_non_atomic() - begin a non-atomic section in an IST exception
>> + * @regs: regs passed to the IST exception handler
>> + *
>> + * IST exception handlers normally cannot schedule. As a special
>> + * exception, if the exception interrupted userspace code (i.e.
>> + * user_mode_vm(regs) would return true) and the exception was not
>> + * a double fault, it can be safe to schedule. ist_begin_non_atomic()
>> + * begins a non-atomic section within an ist_enter()/ist_exit() region.
>> + * Callers are responsible for enabling interrupts themselves inside
>> + * the non-atomic section, and callers must call is_end_non_atomic()
>> + * before ist_exit().
>> + */
>
> Ok, I guess this is is fine, albeit a bit circumstantial:
>
> I need to do
>
> ist_enter()
> ist_begin_non_atomic()
>
> In here, I still need to do
>
> if (user_mode_vm()) {
> # do non-atomic stuff
> }
>
> AFAICT, right?
>
> ist_end_non_atomic()
> ist_exit()
>
> and this whole fun for the context tracking exception_enter/_exit()
> calls.

Not quite. This fun stuff is to satisfy Linus' request to make the
atomicness of the IST entries unconditional instead of depending on
the weird user_space_vm thing and to make the escape to non-atomicness
be clearly a special case.

So your code above should OOPS instead of working. The code should be:

ist_enter()

if (user_mode_vm()) {
ist_begin_non_atomic();
local_irq_enable();
# do non-atomic stuff
local_irq_disable();
ist_end_non_atomic();
}

ist_exit().

Although that user_mode_vm() check might be subsumed by the mce_severity stuff.

NB: This series is going to look a bit different once I rebase it.
This part is the same, but Linus merged some patches that change some
of the same code over the weekend.

--Andy

>
> Acked-by: Borislav Petkov <[email protected]>
>
> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-24 20:22:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
<[email protected]> wrote:
> On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
>> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
>> <[email protected]> wrote:
>> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
>> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
>> >> <[email protected]> wrote:

> Returning state sounds like a bad idea, if we can reasonably avoid it.

I agree, except that we already do it for exception_enter(), etc. But
yes, changing fewer things is nice.

>
> And I think I finally see what you are pointing out about my code: If
> another NMI comes in between the time I increment ->dynticks_nmi_nesting
> and the time I atomically increment ->dynticks, the nested NMI handler
> will incorrectly believe that RCU is already paying attention to this CPU.
> Which would indeed not be at all good, so good catch!!!
>
>> Otherwise, I think that there may need to be enough state somewhere so
>> that the outermost nested rcu_nmi_enter knows whether to increment
>> dynticks. For example, dynticks_nmi_nesting could store the nesting
>> count * 2 - (1 if the outermost nested user needs to increment
>> dynticks). Something like:
>>
>> void rcu_nmi_enter(void)
>> {
>> /* Be very careful -- this function may be called reentrently on the
>> same CPU. */
>> atomically: increment dynticks if it's even.
>>
>> /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
>> * the state. */
>>
>> local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
>>
>> WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
>> }
>>
>> void rcu_nmi_exit(void)
>> {
>> WARN_ON(!(dynticks & 1));
>> locally atomically: dynticks_nmi_nesting -= 2, unless
>> dynticks_nmi_nesting == 1, in which case set it to zero
>>
>> if (dynticks_nmi_nesting was 1)
>> atomic_inc(&dynticks);
>> }
>>
>> The invariant here is that, for a single unnested enter/exit, if
>> dynticks_nmi_nesting != 0, then dynticks is odd. As a result, an
>> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
>> != 0 *or* dynticks is odd will have no net effect, so the invariant,
>> in fact, holds for all invocations, nested or otherwise.
>>
>> At least one of those conditions is true at all times during the
>> execution of outermost pair, starting with the first atomic operation
>> and ending with the final atomic_inc. So they nest properly no matter
>> what else happens (unless, of course, someone else pokes dynticks in
>> the middle).
>>
>> Thoughts?
>
> Let's see... The evenness of ->dynticks should be preserved by nested NMI
> handlers, so the check and increment need not be atomic. We don't have
> any way (other than atomic operations) to do local atomic modifications
> on all architectures, because we cannot mask NMIs. (Yes, it can work
> on x86, but this is common code that needs to work everywhere.) On the
> other hand, presumably NMIs are rare, so atomic modification of the NMI
> nesting counter should be OK, at least if it proves absolutely necessary.
> And I am thinking that a mechanical proof will be needed here. :-/
>
> But first, let me try generating the code and informally evaluating it:
>
> 1 struct rcu_dynticks {
> 2 long long dynticks_nesting;
> 3 int dynticks_nmi_nesting;
> 4 atomic_t dynticks;
> 5 };
> 6
> 7 void rcu_nmi_enter(void)
> 8 {
> 9 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> 10 int incby = 2;
> 11
> 12 if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> 13 smp_mb__before_atomic();
> 14 atomic_inc(&rdtp->dynticks);
> 15 smp_mb__after_atomic();
> 16 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> 17 incby = 1;

WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?

> 18 }
> 19 rdtp->dynticks_nmi_nesting += incby;

Oh, I see why you don't need local_add -- it's because an nmi in the
middle of this increment won't have any effect on the interrupted
code, so even a software RMW will be okay.

> 20 barrier();
> 21 }
> 22
> 23 void rcu_nmi_exit(void)
> 24 {
> 25 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> 26
> 27 WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
> 28 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> 29 if (rdtp->dynticks_nmi_nesting != 1) {

WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?

> 30 rdtp->dynticks_nmi_nesting -= 2;
> 31 return;
> 32 }
> 33 rdtp->dynticks_nmi_nesting = 0;
> 34 smp_mb__before_atomic();

This implies barrier(), right?

> 35 atomic_inc(&rdtp->dynticks);
> 36 smp_mb__after_atomic();
> 37 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> 38 }
>
> Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
> assumes that we don't need to increment ->dynticks.
>
> Line 12 checks to see if ->dynticks is even. Note that this check is
> stable: If there are nested NMIs, they will increment ->dynticks twice
> or not at all, and either way preserves the evenness (to be proven, of
> course, but that is the plan). If ->dynticks is even, lines 13-15
> atomically increment it, line 16 complains if still even, and line 17
> says we will increment ->dynticks_nmi_nesting by only 1.
>
> Either way, line 19 increments ->dynticks_nmi_nesting as needed and
> line 20 keeps the compiler from getting too cute.
>
> For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
> structure. Lines 27 and 28 complain bitterly if invariants are violated.
> If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
> then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
>
> Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
> atomically increment ->dynticks with full ordering, and line 37
> complains bitterly if ->dynticks is not even.
>
> So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
> nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
> not nested, which is the correct thing for them to think in that case.
> They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
> to zero (adding and then subtracting 1). If the NMI happens after the
> atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
> will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
> to zero (adding and subtracting two again). If the NMI happens after
> the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
> and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
> time to one (again adding and subtracting two).
>
> In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
> one means we need to atomically increment ->dynticks, and other values
> mean that we are partially or fully nested. Reasoning proceeds as for
> rcu_nmi_enter(), but in the opposite direction.
>
> Whew! That might even work.

I think I like this, with the warnings above.

>
> But how about taking a different approach. Assuming that there can
> never be more than (say) 14 nesting NMI-like things, use the lower
> four bits of ->dynticks to represent the NMI nesting and the upper
> 28 bits as the counter. This of course requires modifying lots of
> places in RCU that check the counter, but it is probably time to
> abstract the check anyway.
>
> This would allow my earlier attempted logic to work and (maybe) simplify
> the reasoning a bit (and yes, the "magic" constants need macros):
>
> void rcu_nmi_enter(void)
> {
> struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> int incby = 0x01;
>
> WARN_ON_ONCE(nexting == 0xf);
> if (nesting == 0) {
> if (atomic_read(&rdtp->dynticks) & 0x10)
> return;
> incby = 0x11;
> }
> smp_mb__before_atomic();
> atomic_add(&rdtp->dynticks, incby);
> smp_mb__after_atomic();
> WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> }
>
> void rcu_nmi_exit(void)
> {
> struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> int incby = 0x0f;
>
> if (nesting == 0)
> return;
> if (nesting > 1)
> incby = -1;
> smp_mb__before_atomic();
> atomic_add(&rdtp->dynticks, incby);
> smp_mb__after_atomic();
> WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> }
>
> Over to you! ;-)

This latter one is all you :)

--Andy

>
> Thanx, Paul
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-24 20:54:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Mon, Nov 24, 2014 at 12:22:13PM -0800, Andy Lutomirski wrote:
> On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
> >> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
> >> <[email protected]> wrote:
> >> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
> >> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
> >> >> <[email protected]> wrote:
>
> > Returning state sounds like a bad idea, if we can reasonably avoid it.
>
> I agree, except that we already do it for exception_enter(), etc. But
> yes, changing fewer things is nice.
>
> >
> > And I think I finally see what you are pointing out about my code: If
> > another NMI comes in between the time I increment ->dynticks_nmi_nesting
> > and the time I atomically increment ->dynticks, the nested NMI handler
> > will incorrectly believe that RCU is already paying attention to this CPU.
> > Which would indeed not be at all good, so good catch!!!
> >
> >> Otherwise, I think that there may need to be enough state somewhere so
> >> that the outermost nested rcu_nmi_enter knows whether to increment
> >> dynticks. For example, dynticks_nmi_nesting could store the nesting
> >> count * 2 - (1 if the outermost nested user needs to increment
> >> dynticks). Something like:
> >>
> >> void rcu_nmi_enter(void)
> >> {
> >> /* Be very careful -- this function may be called reentrently on the
> >> same CPU. */
> >> atomically: increment dynticks if it's even.
> >>
> >> /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
> >> * the state. */
> >>
> >> local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
> >>
> >> WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
> >> }
> >>
> >> void rcu_nmi_exit(void)
> >> {
> >> WARN_ON(!(dynticks & 1));
> >> locally atomically: dynticks_nmi_nesting -= 2, unless
> >> dynticks_nmi_nesting == 1, in which case set it to zero
> >>
> >> if (dynticks_nmi_nesting was 1)
> >> atomic_inc(&dynticks);
> >> }
> >>
> >> The invariant here is that, for a single unnested enter/exit, if
> >> dynticks_nmi_nesting != 0, then dynticks is odd. As a result, an
> >> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
> >> != 0 *or* dynticks is odd will have no net effect, so the invariant,
> >> in fact, holds for all invocations, nested or otherwise.
> >>
> >> At least one of those conditions is true at all times during the
> >> execution of outermost pair, starting with the first atomic operation
> >> and ending with the final atomic_inc. So they nest properly no matter
> >> what else happens (unless, of course, someone else pokes dynticks in
> >> the middle).
> >>
> >> Thoughts?
> >
> > Let's see... The evenness of ->dynticks should be preserved by nested NMI
> > handlers, so the check and increment need not be atomic. We don't have
> > any way (other than atomic operations) to do local atomic modifications
> > on all architectures, because we cannot mask NMIs. (Yes, it can work
> > on x86, but this is common code that needs to work everywhere.) On the
> > other hand, presumably NMIs are rare, so atomic modification of the NMI
> > nesting counter should be OK, at least if it proves absolutely necessary.
> > And I am thinking that a mechanical proof will be needed here. :-/
> >
> > But first, let me try generating the code and informally evaluating it:
> >
> > 1 struct rcu_dynticks {
> > 2 long long dynticks_nesting;
> > 3 int dynticks_nmi_nesting;
> > 4 atomic_t dynticks;
> > 5 };
> > 6
> > 7 void rcu_nmi_enter(void)
> > 8 {
> > 9 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > 10 int incby = 2;
> > 11
> > 12 if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> > 13 smp_mb__before_atomic();
> > 14 atomic_inc(&rdtp->dynticks);
> > 15 smp_mb__after_atomic();
> > 16 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > 17 incby = 1;
>
> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?

That would make sense.

> > 18 }
> > 19 rdtp->dynticks_nmi_nesting += incby;
>
> Oh, I see why you don't need local_add -- it's because an nmi in the
> middle of this increment won't have any effect on the interrupted
> code, so even a software RMW will be okay.

Yep! ;-)

> > 20 barrier();
> > 21 }
> > 22
> > 23 void rcu_nmi_exit(void)
> > 24 {
> > 25 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > 26
> > 27 WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
> > 28 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > 29 if (rdtp->dynticks_nmi_nesting != 1) {
>
> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?

This is already implied by the WARN_ON_ONCE() on line 27 and the check
on line 29.

> > 30 rdtp->dynticks_nmi_nesting -= 2;
> > 31 return;
> > 32 }
> > 33 rdtp->dynticks_nmi_nesting = 0;
> > 34 smp_mb__before_atomic();
>
> This implies barrier(), right?

Yep!

> > 35 atomic_inc(&rdtp->dynticks);
> > 36 smp_mb__after_atomic();
> > 37 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> > 38 }
> >
> > Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
> > assumes that we don't need to increment ->dynticks.
> >
> > Line 12 checks to see if ->dynticks is even. Note that this check is
> > stable: If there are nested NMIs, they will increment ->dynticks twice
> > or not at all, and either way preserves the evenness (to be proven, of
> > course, but that is the plan). If ->dynticks is even, lines 13-15
> > atomically increment it, line 16 complains if still even, and line 17
> > says we will increment ->dynticks_nmi_nesting by only 1.
> >
> > Either way, line 19 increments ->dynticks_nmi_nesting as needed and
> > line 20 keeps the compiler from getting too cute.
> >
> > For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
> > structure. Lines 27 and 28 complain bitterly if invariants are violated.
> > If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
> > then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
> >
> > Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
> > atomically increment ->dynticks with full ordering, and line 37
> > complains bitterly if ->dynticks is not even.
> >
> > So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
> > nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
> > not nested, which is the correct thing for them to think in that case.
> > They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
> > to zero (adding and then subtracting 1). If the NMI happens after the
> > atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
> > will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
> > to zero (adding and subtracting two again). If the NMI happens after
> > the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
> > and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
> > time to one (again adding and subtracting two).
> >
> > In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
> > one means we need to atomically increment ->dynticks, and other values
> > mean that we are partially or fully nested. Reasoning proceeds as for
> > rcu_nmi_enter(), but in the opposite direction.
> >
> > Whew! That might even work.
>
> I think I like this, with the warnings above.

OK with dropping the one that I called out as redundant?

> > But how about taking a different approach. Assuming that there can
> > never be more than (say) 14 nesting NMI-like things, use the lower
> > four bits of ->dynticks to represent the NMI nesting and the upper
> > 28 bits as the counter. This of course requires modifying lots of
> > places in RCU that check the counter, but it is probably time to
> > abstract the check anyway.
> >
> > This would allow my earlier attempted logic to work and (maybe) simplify
> > the reasoning a bit (and yes, the "magic" constants need macros):
> >
> > void rcu_nmi_enter(void)
> > {
> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> > int incby = 0x01;
> >
> > WARN_ON_ONCE(nexting == 0xf);
> > if (nesting == 0) {
> > if (atomic_read(&rdtp->dynticks) & 0x10)
> > return;
> > incby = 0x11;
> > }
> > smp_mb__before_atomic();
> > atomic_add(&rdtp->dynticks, incby);
> > smp_mb__after_atomic();
> > WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > }
> >
> > void rcu_nmi_exit(void)
> > {
> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> > int incby = 0x0f;
> >
> > if (nesting == 0)
> > return;
> > if (nesting > 1)
> > incby = -1;
> > smp_mb__before_atomic();
> > atomic_add(&rdtp->dynticks, incby);
> > smp_mb__after_atomic();
> > WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> > }
> >
> > Over to you! ;-)
>
> This latter one is all you :)

Well, let's see how I feel about it after trying a Promela model of
the first code sequence. ;-)

Thanx, Paul

> --Andy
>
> >
> > Thanx, Paul
> >
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
>

2014-11-24 21:03:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Mon, Nov 24, 2014 at 12:54 PM, Paul E. McKenney
<[email protected]> wrote:
> On Mon, Nov 24, 2014 at 12:22:13PM -0800, Andy Lutomirski wrote:
>> On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
>> <[email protected]> wrote:
>> > On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
>> >> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
>> >> <[email protected]> wrote:
>> >> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
>> >> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
>> >> >> <[email protected]> wrote:
>>
>> > Returning state sounds like a bad idea, if we can reasonably avoid it.
>>
>> I agree, except that we already do it for exception_enter(), etc. But
>> yes, changing fewer things is nice.
>>
>> >
>> > And I think I finally see what you are pointing out about my code: If
>> > another NMI comes in between the time I increment ->dynticks_nmi_nesting
>> > and the time I atomically increment ->dynticks, the nested NMI handler
>> > will incorrectly believe that RCU is already paying attention to this CPU.
>> > Which would indeed not be at all good, so good catch!!!
>> >
>> >> Otherwise, I think that there may need to be enough state somewhere so
>> >> that the outermost nested rcu_nmi_enter knows whether to increment
>> >> dynticks. For example, dynticks_nmi_nesting could store the nesting
>> >> count * 2 - (1 if the outermost nested user needs to increment
>> >> dynticks). Something like:
>> >>
>> >> void rcu_nmi_enter(void)
>> >> {
>> >> /* Be very careful -- this function may be called reentrently on the
>> >> same CPU. */
>> >> atomically: increment dynticks if it's even.
>> >>
>> >> /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
>> >> * the state. */
>> >>
>> >> local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
>> >>
>> >> WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
>> >> }
>> >>
>> >> void rcu_nmi_exit(void)
>> >> {
>> >> WARN_ON(!(dynticks & 1));
>> >> locally atomically: dynticks_nmi_nesting -= 2, unless
>> >> dynticks_nmi_nesting == 1, in which case set it to zero
>> >>
>> >> if (dynticks_nmi_nesting was 1)
>> >> atomic_inc(&dynticks);
>> >> }
>> >>
>> >> The invariant here is that, for a single unnested enter/exit, if
>> >> dynticks_nmi_nesting != 0, then dynticks is odd. As a result, an
>> >> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
>> >> != 0 *or* dynticks is odd will have no net effect, so the invariant,
>> >> in fact, holds for all invocations, nested or otherwise.
>> >>
>> >> At least one of those conditions is true at all times during the
>> >> execution of outermost pair, starting with the first atomic operation
>> >> and ending with the final atomic_inc. So they nest properly no matter
>> >> what else happens (unless, of course, someone else pokes dynticks in
>> >> the middle).
>> >>
>> >> Thoughts?
>> >
>> > Let's see... The evenness of ->dynticks should be preserved by nested NMI
>> > handlers, so the check and increment need not be atomic. We don't have
>> > any way (other than atomic operations) to do local atomic modifications
>> > on all architectures, because we cannot mask NMIs. (Yes, it can work
>> > on x86, but this is common code that needs to work everywhere.) On the
>> > other hand, presumably NMIs are rare, so atomic modification of the NMI
>> > nesting counter should be OK, at least if it proves absolutely necessary.
>> > And I am thinking that a mechanical proof will be needed here. :-/
>> >
>> > But first, let me try generating the code and informally evaluating it:
>> >
>> > 1 struct rcu_dynticks {
>> > 2 long long dynticks_nesting;
>> > 3 int dynticks_nmi_nesting;
>> > 4 atomic_t dynticks;
>> > 5 };
>> > 6
>> > 7 void rcu_nmi_enter(void)
>> > 8 {
>> > 9 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > 10 int incby = 2;
>> > 11
>> > 12 if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
>> > 13 smp_mb__before_atomic();
>> > 14 atomic_inc(&rdtp->dynticks);
>> > 15 smp_mb__after_atomic();
>> > 16 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>> > 17 incby = 1;
>>
>> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?
>
> That would make sense.
>
>> > 18 }
>> > 19 rdtp->dynticks_nmi_nesting += incby;
>>
>> Oh, I see why you don't need local_add -- it's because an nmi in the
>> middle of this increment won't have any effect on the interrupted
>> code, so even a software RMW will be okay.
>
> Yep! ;-)
>
>> > 20 barrier();
>> > 21 }
>> > 22
>> > 23 void rcu_nmi_exit(void)
>> > 24 {
>> > 25 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > 26
>> > 27 WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
>> > 28 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>> > 29 if (rdtp->dynticks_nmi_nesting != 1) {
>>
>> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?
>
> This is already implied by the WARN_ON_ONCE() on line 27 and the check
> on line 29.

I was worried about negative numbers. Maybe change line 27 to
WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0), then? (Or is it
unsigned? If so, let's make to signed to catch this type of error.)

>
>> > 30 rdtp->dynticks_nmi_nesting -= 2;
>> > 31 return;
>> > 32 }
>> > 33 rdtp->dynticks_nmi_nesting = 0;
>> > 34 smp_mb__before_atomic();
>>
>> This implies barrier(), right?
>
> Yep!
>
>> > 35 atomic_inc(&rdtp->dynticks);
>> > 36 smp_mb__after_atomic();
>> > 37 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
>> > 38 }
>> >
>> > Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
>> > assumes that we don't need to increment ->dynticks.
>> >
>> > Line 12 checks to see if ->dynticks is even. Note that this check is
>> > stable: If there are nested NMIs, they will increment ->dynticks twice
>> > or not at all, and either way preserves the evenness (to be proven, of
>> > course, but that is the plan). If ->dynticks is even, lines 13-15
>> > atomically increment it, line 16 complains if still even, and line 17
>> > says we will increment ->dynticks_nmi_nesting by only 1.
>> >
>> > Either way, line 19 increments ->dynticks_nmi_nesting as needed and
>> > line 20 keeps the compiler from getting too cute.
>> >
>> > For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
>> > structure. Lines 27 and 28 complain bitterly if invariants are violated.
>> > If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
>> > then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
>> >
>> > Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
>> > atomically increment ->dynticks with full ordering, and line 37
>> > complains bitterly if ->dynticks is not even.
>> >
>> > So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
>> > nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
>> > not nested, which is the correct thing for them to think in that case.
>> > They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
>> > to zero (adding and then subtracting 1). If the NMI happens after the
>> > atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
>> > will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
>> > to zero (adding and subtracting two again). If the NMI happens after
>> > the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
>> > and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
>> > time to one (again adding and subtracting two).
>> >
>> > In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
>> > one means we need to atomically increment ->dynticks, and other values
>> > mean that we are partially or fully nested. Reasoning proceeds as for
>> > rcu_nmi_enter(), but in the opposite direction.
>> >
>> > Whew! That might even work.
>>
>> I think I like this, with the warnings above.
>
> OK with dropping the one that I called out as redundant?

Sure, but see about.

>
>> > But how about taking a different approach. Assuming that there can
>> > never be more than (say) 14 nesting NMI-like things, use the lower
>> > four bits of ->dynticks to represent the NMI nesting and the upper
>> > 28 bits as the counter. This of course requires modifying lots of
>> > places in RCU that check the counter, but it is probably time to
>> > abstract the check anyway.
>> >
>> > This would allow my earlier attempted logic to work and (maybe) simplify
>> > the reasoning a bit (and yes, the "magic" constants need macros):
>> >
>> > void rcu_nmi_enter(void)
>> > {
>> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > int nesting = atomic_read(&rdtp->dynticks) & 0xf;
>> > int incby = 0x01;
>> >
>> > WARN_ON_ONCE(nexting == 0xf);
>> > if (nesting == 0) {
>> > if (atomic_read(&rdtp->dynticks) & 0x10)
>> > return;
>> > incby = 0x11;
>> > }
>> > smp_mb__before_atomic();
>> > atomic_add(&rdtp->dynticks, incby);
>> > smp_mb__after_atomic();
>> > WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>> > }
>> >
>> > void rcu_nmi_exit(void)
>> > {
>> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > int nesting = atomic_read(&rdtp->dynticks) & 0xf;
>> > int incby = 0x0f;
>> >
>> > if (nesting == 0)
>> > return;
>> > if (nesting > 1)
>> > incby = -1;
>> > smp_mb__before_atomic();
>> > atomic_add(&rdtp->dynticks, incby);
>> > smp_mb__after_atomic();
>> > WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
>> > }
>> >
>> > Over to you! ;-)
>>
>> This latter one is all you :)
>
> Well, let's see how I feel about it after trying a Promela model of
> the first code sequence. ;-)

:)

Does Promela understand the differences between this type of
reentrancy and real threading?

--Andy

2014-11-24 21:28:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Mon, Nov 24, 2014 at 12:54:41PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 24, 2014 at 12:22:13PM -0800, Andy Lutomirski wrote:
> > On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
> > <[email protected]> wrote:
> > > On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
> > >> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
> > >> <[email protected]> wrote:
> > >> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
> > >> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
> > >> >> <[email protected]> wrote:
> >
> > > Returning state sounds like a bad idea, if we can reasonably avoid it.
> >
> > I agree, except that we already do it for exception_enter(), etc. But
> > yes, changing fewer things is nice.
> >
> > >
> > > And I think I finally see what you are pointing out about my code: If
> > > another NMI comes in between the time I increment ->dynticks_nmi_nesting
> > > and the time I atomically increment ->dynticks, the nested NMI handler
> > > will incorrectly believe that RCU is already paying attention to this CPU.
> > > Which would indeed not be at all good, so good catch!!!
> > >
> > >> Otherwise, I think that there may need to be enough state somewhere so
> > >> that the outermost nested rcu_nmi_enter knows whether to increment
> > >> dynticks. For example, dynticks_nmi_nesting could store the nesting
> > >> count * 2 - (1 if the outermost nested user needs to increment
> > >> dynticks). Something like:
> > >>
> > >> void rcu_nmi_enter(void)
> > >> {
> > >> /* Be very careful -- this function may be called reentrently on the
> > >> same CPU. */
> > >> atomically: increment dynticks if it's even.
> > >>
> > >> /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
> > >> * the state. */
> > >>
> > >> local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
> > >>
> > >> WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
> > >> }
> > >>
> > >> void rcu_nmi_exit(void)
> > >> {
> > >> WARN_ON(!(dynticks & 1));
> > >> locally atomically: dynticks_nmi_nesting -= 2, unless
> > >> dynticks_nmi_nesting == 1, in which case set it to zero
> > >>
> > >> if (dynticks_nmi_nesting was 1)
> > >> atomic_inc(&dynticks);
> > >> }
> > >>
> > >> The invariant here is that, for a single unnested enter/exit, if
> > >> dynticks_nmi_nesting != 0, then dynticks is odd. As a result, an
> > >> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
> > >> != 0 *or* dynticks is odd will have no net effect, so the invariant,
> > >> in fact, holds for all invocations, nested or otherwise.
> > >>
> > >> At least one of those conditions is true at all times during the
> > >> execution of outermost pair, starting with the first atomic operation
> > >> and ending with the final atomic_inc. So they nest properly no matter
> > >> what else happens (unless, of course, someone else pokes dynticks in
> > >> the middle).
> > >>
> > >> Thoughts?
> > >
> > > Let's see... The evenness of ->dynticks should be preserved by nested NMI
> > > handlers, so the check and increment need not be atomic. We don't have
> > > any way (other than atomic operations) to do local atomic modifications
> > > on all architectures, because we cannot mask NMIs. (Yes, it can work
> > > on x86, but this is common code that needs to work everywhere.) On the
> > > other hand, presumably NMIs are rare, so atomic modification of the NMI
> > > nesting counter should be OK, at least if it proves absolutely necessary.
> > > And I am thinking that a mechanical proof will be needed here. :-/
> > >
> > > But first, let me try generating the code and informally evaluating it:
> > >
> > > 1 struct rcu_dynticks {
> > > 2 long long dynticks_nesting;
> > > 3 int dynticks_nmi_nesting;
> > > 4 atomic_t dynticks;
> > > 5 };
> > > 6
> > > 7 void rcu_nmi_enter(void)
> > > 8 {
> > > 9 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > > 10 int incby = 2;
> > > 11
> > > 12 if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> > > 13 smp_mb__before_atomic();
> > > 14 atomic_inc(&rdtp->dynticks);
> > > 15 smp_mb__after_atomic();
> > > 16 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > > 17 incby = 1;
> >
> > WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?
>
> That would make sense.

I take it back. We get here only if there is no lower-level NMI in effect.
I could do WARN_ON_ONCE(rdtp->dynticks_nmi_nesting > 0), though.

Thanx, Paul

> > > 18 }
> > > 19 rdtp->dynticks_nmi_nesting += incby;
> >
> > Oh, I see why you don't need local_add -- it's because an nmi in the
> > middle of this increment won't have any effect on the interrupted
> > code, so even a software RMW will be okay.
>
> Yep! ;-)
>
> > > 20 barrier();
> > > 21 }
> > > 22
> > > 23 void rcu_nmi_exit(void)
> > > 24 {
> > > 25 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > > 26
> > > 27 WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
> > > 28 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > > 29 if (rdtp->dynticks_nmi_nesting != 1) {
> >
> > WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?
>
> This is already implied by the WARN_ON_ONCE() on line 27 and the check
> on line 29.
>
> > > 30 rdtp->dynticks_nmi_nesting -= 2;
> > > 31 return;
> > > 32 }
> > > 33 rdtp->dynticks_nmi_nesting = 0;
> > > 34 smp_mb__before_atomic();
> >
> > This implies barrier(), right?
>
> Yep!
>
> > > 35 atomic_inc(&rdtp->dynticks);
> > > 36 smp_mb__after_atomic();
> > > 37 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> > > 38 }
> > >
> > > Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
> > > assumes that we don't need to increment ->dynticks.
> > >
> > > Line 12 checks to see if ->dynticks is even. Note that this check is
> > > stable: If there are nested NMIs, they will increment ->dynticks twice
> > > or not at all, and either way preserves the evenness (to be proven, of
> > > course, but that is the plan). If ->dynticks is even, lines 13-15
> > > atomically increment it, line 16 complains if still even, and line 17
> > > says we will increment ->dynticks_nmi_nesting by only 1.
> > >
> > > Either way, line 19 increments ->dynticks_nmi_nesting as needed and
> > > line 20 keeps the compiler from getting too cute.
> > >
> > > For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
> > > structure. Lines 27 and 28 complain bitterly if invariants are violated.
> > > If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
> > > then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
> > >
> > > Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
> > > atomically increment ->dynticks with full ordering, and line 37
> > > complains bitterly if ->dynticks is not even.
> > >
> > > So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
> > > nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
> > > not nested, which is the correct thing for them to think in that case.
> > > They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
> > > to zero (adding and then subtracting 1). If the NMI happens after the
> > > atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
> > > will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
> > > to zero (adding and subtracting two again). If the NMI happens after
> > > the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
> > > and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
> > > time to one (again adding and subtracting two).
> > >
> > > In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
> > > one means we need to atomically increment ->dynticks, and other values
> > > mean that we are partially or fully nested. Reasoning proceeds as for
> > > rcu_nmi_enter(), but in the opposite direction.
> > >
> > > Whew! That might even work.
> >
> > I think I like this, with the warnings above.
>
> OK with dropping the one that I called out as redundant?
>
> > > But how about taking a different approach. Assuming that there can
> > > never be more than (say) 14 nesting NMI-like things, use the lower
> > > four bits of ->dynticks to represent the NMI nesting and the upper
> > > 28 bits as the counter. This of course requires modifying lots of
> > > places in RCU that check the counter, but it is probably time to
> > > abstract the check anyway.
> > >
> > > This would allow my earlier attempted logic to work and (maybe) simplify
> > > the reasoning a bit (and yes, the "magic" constants need macros):
> > >
> > > void rcu_nmi_enter(void)
> > > {
> > > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > > int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> > > int incby = 0x01;
> > >
> > > WARN_ON_ONCE(nexting == 0xf);
> > > if (nesting == 0) {
> > > if (atomic_read(&rdtp->dynticks) & 0x10)
> > > return;
> > > incby = 0x11;
> > > }
> > > smp_mb__before_atomic();
> > > atomic_add(&rdtp->dynticks, incby);
> > > smp_mb__after_atomic();
> > > WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > > }
> > >
> > > void rcu_nmi_exit(void)
> > > {
> > > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > > int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> > > int incby = 0x0f;
> > >
> > > if (nesting == 0)
> > > return;
> > > if (nesting > 1)
> > > incby = -1;
> > > smp_mb__before_atomic();
> > > atomic_add(&rdtp->dynticks, incby);
> > > smp_mb__after_atomic();
> > > WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> > > }
> > >
> > > Over to you! ;-)
> >
> > This latter one is all you :)
>
> Well, let's see how I feel about it after trying a Promela model of
> the first code sequence. ;-)
>
> Thanx, Paul
>
> > --Andy
> >
> > >
> > > Thanx, Paul
> > >
> >
> >
> >
> > --
> > Andy Lutomirski
> > AMA Capital Management, LLC
> >

2014-11-24 21:35:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Mon, Nov 24, 2014 at 01:02:51PM -0800, Andy Lutomirski wrote:
> On Mon, Nov 24, 2014 at 12:54 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Mon, Nov 24, 2014 at 12:22:13PM -0800, Andy Lutomirski wrote:
> >> On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
> >> <[email protected]> wrote:
> >> > On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
> >> >> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
> >> >> <[email protected]> wrote:
> >> >> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
> >> >> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
> >> >> >> <[email protected]> wrote:
> >>
> >> > Returning state sounds like a bad idea, if we can reasonably avoid it.
> >>
> >> I agree, except that we already do it for exception_enter(), etc. But
> >> yes, changing fewer things is nice.
> >>
> >> >
> >> > And I think I finally see what you are pointing out about my code: If
> >> > another NMI comes in between the time I increment ->dynticks_nmi_nesting
> >> > and the time I atomically increment ->dynticks, the nested NMI handler
> >> > will incorrectly believe that RCU is already paying attention to this CPU.
> >> > Which would indeed not be at all good, so good catch!!!
> >> >
> >> >> Otherwise, I think that there may need to be enough state somewhere so
> >> >> that the outermost nested rcu_nmi_enter knows whether to increment
> >> >> dynticks. For example, dynticks_nmi_nesting could store the nesting
> >> >> count * 2 - (1 if the outermost nested user needs to increment
> >> >> dynticks). Something like:
> >> >>
> >> >> void rcu_nmi_enter(void)
> >> >> {
> >> >> /* Be very careful -- this function may be called reentrently on the
> >> >> same CPU. */
> >> >> atomically: increment dynticks if it's even.
> >> >>
> >> >> /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
> >> >> * the state. */
> >> >>
> >> >> local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
> >> >>
> >> >> WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
> >> >> }
> >> >>
> >> >> void rcu_nmi_exit(void)
> >> >> {
> >> >> WARN_ON(!(dynticks & 1));
> >> >> locally atomically: dynticks_nmi_nesting -= 2, unless
> >> >> dynticks_nmi_nesting == 1, in which case set it to zero
> >> >>
> >> >> if (dynticks_nmi_nesting was 1)
> >> >> atomic_inc(&dynticks);
> >> >> }
> >> >>
> >> >> The invariant here is that, for a single unnested enter/exit, if
> >> >> dynticks_nmi_nesting != 0, then dynticks is odd. As a result, an
> >> >> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
> >> >> != 0 *or* dynticks is odd will have no net effect, so the invariant,
> >> >> in fact, holds for all invocations, nested or otherwise.
> >> >>
> >> >> At least one of those conditions is true at all times during the
> >> >> execution of outermost pair, starting with the first atomic operation
> >> >> and ending with the final atomic_inc. So they nest properly no matter
> >> >> what else happens (unless, of course, someone else pokes dynticks in
> >> >> the middle).
> >> >>
> >> >> Thoughts?
> >> >
> >> > Let's see... The evenness of ->dynticks should be preserved by nested NMI
> >> > handlers, so the check and increment need not be atomic. We don't have
> >> > any way (other than atomic operations) to do local atomic modifications
> >> > on all architectures, because we cannot mask NMIs. (Yes, it can work
> >> > on x86, but this is common code that needs to work everywhere.) On the
> >> > other hand, presumably NMIs are rare, so atomic modification of the NMI
> >> > nesting counter should be OK, at least if it proves absolutely necessary.
> >> > And I am thinking that a mechanical proof will be needed here. :-/
> >> >
> >> > But first, let me try generating the code and informally evaluating it:
> >> >
> >> > 1 struct rcu_dynticks {
> >> > 2 long long dynticks_nesting;
> >> > 3 int dynticks_nmi_nesting;
> >> > 4 atomic_t dynticks;
> >> > 5 };
> >> > 6
> >> > 7 void rcu_nmi_enter(void)
> >> > 8 {
> >> > 9 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >> > 10 int incby = 2;
> >> > 11
> >> > 12 if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> >> > 13 smp_mb__before_atomic();
> >> > 14 atomic_inc(&rdtp->dynticks);
> >> > 15 smp_mb__after_atomic();
> >> > 16 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> >> > 17 incby = 1;
> >>
> >> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?
> >
> > That would make sense.
> >
> >> > 18 }
> >> > 19 rdtp->dynticks_nmi_nesting += incby;
> >>
> >> Oh, I see why you don't need local_add -- it's because an nmi in the
> >> middle of this increment won't have any effect on the interrupted
> >> code, so even a software RMW will be okay.
> >
> > Yep! ;-)
> >
> >> > 20 barrier();
> >> > 21 }
> >> > 22
> >> > 23 void rcu_nmi_exit(void)
> >> > 24 {
> >> > 25 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >> > 26
> >> > 27 WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
> >> > 28 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> >> > 29 if (rdtp->dynticks_nmi_nesting != 1) {
> >>
> >> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?
> >
> > This is already implied by the WARN_ON_ONCE() on line 27 and the check
> > on line 29.
>
> I was worried about negative numbers. Maybe change line 27 to
> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0), then? (Or is it
> unsigned? If so, let's make to signed to catch this type of error.)

Good point, they are signed, so your WARN_ON_ONCE() would work.

> >> > 30 rdtp->dynticks_nmi_nesting -= 2;
> >> > 31 return;
> >> > 32 }
> >> > 33 rdtp->dynticks_nmi_nesting = 0;
> >> > 34 smp_mb__before_atomic();
> >>
> >> This implies barrier(), right?
> >
> > Yep!
> >
> >> > 35 atomic_inc(&rdtp->dynticks);
> >> > 36 smp_mb__after_atomic();
> >> > 37 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> >> > 38 }
> >> >
> >> > Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
> >> > assumes that we don't need to increment ->dynticks.
> >> >
> >> > Line 12 checks to see if ->dynticks is even. Note that this check is
> >> > stable: If there are nested NMIs, they will increment ->dynticks twice
> >> > or not at all, and either way preserves the evenness (to be proven, of
> >> > course, but that is the plan). If ->dynticks is even, lines 13-15
> >> > atomically increment it, line 16 complains if still even, and line 17
> >> > says we will increment ->dynticks_nmi_nesting by only 1.
> >> >
> >> > Either way, line 19 increments ->dynticks_nmi_nesting as needed and
> >> > line 20 keeps the compiler from getting too cute.
> >> >
> >> > For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
> >> > structure. Lines 27 and 28 complain bitterly if invariants are violated.
> >> > If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
> >> > then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
> >> >
> >> > Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
> >> > atomically increment ->dynticks with full ordering, and line 37
> >> > complains bitterly if ->dynticks is not even.
> >> >
> >> > So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
> >> > nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
> >> > not nested, which is the correct thing for them to think in that case.
> >> > They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
> >> > to zero (adding and then subtracting 1). If the NMI happens after the
> >> > atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
> >> > will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
> >> > to zero (adding and subtracting two again). If the NMI happens after
> >> > the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
> >> > and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
> >> > time to one (again adding and subtracting two).
> >> >
> >> > In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
> >> > one means we need to atomically increment ->dynticks, and other values
> >> > mean that we are partially or fully nested. Reasoning proceeds as for
> >> > rcu_nmi_enter(), but in the opposite direction.
> >> >
> >> > Whew! That might even work.
> >>
> >> I think I like this, with the warnings above.
> >
> > OK with dropping the one that I called out as redundant?
>
> Sure, but see about.
>
> >
> >> > But how about taking a different approach. Assuming that there can
> >> > never be more than (say) 14 nesting NMI-like things, use the lower
> >> > four bits of ->dynticks to represent the NMI nesting and the upper
> >> > 28 bits as the counter. This of course requires modifying lots of
> >> > places in RCU that check the counter, but it is probably time to
> >> > abstract the check anyway.
> >> >
> >> > This would allow my earlier attempted logic to work and (maybe) simplify
> >> > the reasoning a bit (and yes, the "magic" constants need macros):
> >> >
> >> > void rcu_nmi_enter(void)
> >> > {
> >> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >> > int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> >> > int incby = 0x01;
> >> >
> >> > WARN_ON_ONCE(nexting == 0xf);
> >> > if (nesting == 0) {
> >> > if (atomic_read(&rdtp->dynticks) & 0x10)
> >> > return;
> >> > incby = 0x11;
> >> > }
> >> > smp_mb__before_atomic();
> >> > atomic_add(&rdtp->dynticks, incby);
> >> > smp_mb__after_atomic();
> >> > WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> >> > }
> >> >
> >> > void rcu_nmi_exit(void)
> >> > {
> >> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >> > int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> >> > int incby = 0x0f;
> >> >
> >> > if (nesting == 0)
> >> > return;
> >> > if (nesting > 1)
> >> > incby = -1;
> >> > smp_mb__before_atomic();
> >> > atomic_add(&rdtp->dynticks, incby);
> >> > smp_mb__after_atomic();
> >> > WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> >> > }
> >> >
> >> > Over to you! ;-)
> >>
> >> This latter one is all you :)
> >
> > Well, let's see how I feel about it after trying a Promela model of
> > the first code sequence. ;-)
>
> :)
>
> Does Promela understand the differences between this type of
> reentrancy and real threading?

Not as far as I know. But it can be tricked into making this distinction.
One thread just has the Promela code as is, and the other thread has
the same Promela code entirely contained in an atomic block. This means
that the entire second thread must executed at one point in the first
thread, just like an NMI would.

Thanx, Paul

2014-11-24 22:34:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 24, 2014 at 01:02:51PM -0800, Andy Lutomirski wrote:
> > On Mon, Nov 24, 2014 at 12:54 PM, Paul E. McKenney
> > <[email protected]> wrote:
> > > On Mon, Nov 24, 2014 at 12:22:13PM -0800, Andy Lutomirski wrote:
> > >> On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
> > >> <[email protected]> wrote:
> > >> > On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
> > >> >> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
> > >> >> <[email protected]> wrote:
> > >> >> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
> > >> >> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
> > >> >> >> <[email protected]> wrote:
> > >>
> > >> > Returning state sounds like a bad idea, if we can reasonably avoid it.
> > >>
> > >> I agree, except that we already do it for exception_enter(), etc. But
> > >> yes, changing fewer things is nice.
> > >>
> > >> >
> > >> > And I think I finally see what you are pointing out about my code: If
> > >> > another NMI comes in between the time I increment ->dynticks_nmi_nesting
> > >> > and the time I atomically increment ->dynticks, the nested NMI handler
> > >> > will incorrectly believe that RCU is already paying attention to this CPU.
> > >> > Which would indeed not be at all good, so good catch!!!
> > >> >
> > >> >> Otherwise, I think that there may need to be enough state somewhere so
> > >> >> that the outermost nested rcu_nmi_enter knows whether to increment
> > >> >> dynticks. For example, dynticks_nmi_nesting could store the nesting
> > >> >> count * 2 - (1 if the outermost nested user needs to increment
> > >> >> dynticks). Something like:
> > >> >>
> > >> >> void rcu_nmi_enter(void)
> > >> >> {
> > >> >> /* Be very careful -- this function may be called reentrently on the
> > >> >> same CPU. */
> > >> >> atomically: increment dynticks if it's even.
> > >> >>
> > >> >> /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
> > >> >> * the state. */
> > >> >>
> > >> >> local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
> > >> >>
> > >> >> WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
> > >> >> }
> > >> >>
> > >> >> void rcu_nmi_exit(void)
> > >> >> {
> > >> >> WARN_ON(!(dynticks & 1));
> > >> >> locally atomically: dynticks_nmi_nesting -= 2, unless
> > >> >> dynticks_nmi_nesting == 1, in which case set it to zero
> > >> >>
> > >> >> if (dynticks_nmi_nesting was 1)
> > >> >> atomic_inc(&dynticks);
> > >> >> }
> > >> >>
> > >> >> The invariant here is that, for a single unnested enter/exit, if
> > >> >> dynticks_nmi_nesting != 0, then dynticks is odd. As a result, an
> > >> >> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
> > >> >> != 0 *or* dynticks is odd will have no net effect, so the invariant,
> > >> >> in fact, holds for all invocations, nested or otherwise.
> > >> >>
> > >> >> At least one of those conditions is true at all times during the
> > >> >> execution of outermost pair, starting with the first atomic operation
> > >> >> and ending with the final atomic_inc. So they nest properly no matter
> > >> >> what else happens (unless, of course, someone else pokes dynticks in
> > >> >> the middle).
> > >> >>
> > >> >> Thoughts?
> > >> >
> > >> > Let's see... The evenness of ->dynticks should be preserved by nested NMI
> > >> > handlers, so the check and increment need not be atomic. We don't have
> > >> > any way (other than atomic operations) to do local atomic modifications
> > >> > on all architectures, because we cannot mask NMIs. (Yes, it can work
> > >> > on x86, but this is common code that needs to work everywhere.) On the
> > >> > other hand, presumably NMIs are rare, so atomic modification of the NMI
> > >> > nesting counter should be OK, at least if it proves absolutely necessary.
> > >> > And I am thinking that a mechanical proof will be needed here. :-/
> > >> >
> > >> > But first, let me try generating the code and informally evaluating it:
> > >> >
> > >> > 1 struct rcu_dynticks {
> > >> > 2 long long dynticks_nesting;
> > >> > 3 int dynticks_nmi_nesting;
> > >> > 4 atomic_t dynticks;
> > >> > 5 };
> > >> > 6
> > >> > 7 void rcu_nmi_enter(void)
> > >> > 8 {
> > >> > 9 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > >> > 10 int incby = 2;
> > >> > 11
> > >> > 12 if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> > >> > 13 smp_mb__before_atomic();
> > >> > 14 atomic_inc(&rdtp->dynticks);
> > >> > 15 smp_mb__after_atomic();
> > >> > 16 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > >> > 17 incby = 1;
> > >>
> > >> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?
> > >
> > > That would make sense.
> > >
> > >> > 18 }
> > >> > 19 rdtp->dynticks_nmi_nesting += incby;
> > >>
> > >> Oh, I see why you don't need local_add -- it's because an nmi in the
> > >> middle of this increment won't have any effect on the interrupted
> > >> code, so even a software RMW will be okay.
> > >
> > > Yep! ;-)
> > >
> > >> > 20 barrier();
> > >> > 21 }
> > >> > 22
> > >> > 23 void rcu_nmi_exit(void)
> > >> > 24 {
> > >> > 25 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > >> > 26
> > >> > 27 WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
> > >> > 28 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > >> > 29 if (rdtp->dynticks_nmi_nesting != 1) {
> > >>
> > >> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?
> > >
> > > This is already implied by the WARN_ON_ONCE() on line 27 and the check
> > > on line 29.
> >
> > I was worried about negative numbers. Maybe change line 27 to
> > WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0), then? (Or is it
> > unsigned? If so, let's make to signed to catch this type of error.)
>
> Good point, they are signed, so your WARN_ON_ONCE() would work.
>
> > >> > 30 rdtp->dynticks_nmi_nesting -= 2;
> > >> > 31 return;
> > >> > 32 }
> > >> > 33 rdtp->dynticks_nmi_nesting = 0;
> > >> > 34 smp_mb__before_atomic();
> > >>
> > >> This implies barrier(), right?
> > >
> > > Yep!
> > >
> > >> > 35 atomic_inc(&rdtp->dynticks);
> > >> > 36 smp_mb__after_atomic();
> > >> > 37 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> > >> > 38 }
> > >> >
> > >> > Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
> > >> > assumes that we don't need to increment ->dynticks.
> > >> >
> > >> > Line 12 checks to see if ->dynticks is even. Note that this check is
> > >> > stable: If there are nested NMIs, they will increment ->dynticks twice
> > >> > or not at all, and either way preserves the evenness (to be proven, of
> > >> > course, but that is the plan). If ->dynticks is even, lines 13-15
> > >> > atomically increment it, line 16 complains if still even, and line 17
> > >> > says we will increment ->dynticks_nmi_nesting by only 1.
> > >> >
> > >> > Either way, line 19 increments ->dynticks_nmi_nesting as needed and
> > >> > line 20 keeps the compiler from getting too cute.
> > >> >
> > >> > For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
> > >> > structure. Lines 27 and 28 complain bitterly if invariants are violated.
> > >> > If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
> > >> > then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
> > >> >
> > >> > Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
> > >> > atomically increment ->dynticks with full ordering, and line 37
> > >> > complains bitterly if ->dynticks is not even.
> > >> >
> > >> > So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
> > >> > nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
> > >> > not nested, which is the correct thing for them to think in that case.
> > >> > They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
> > >> > to zero (adding and then subtracting 1). If the NMI happens after the
> > >> > atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
> > >> > will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
> > >> > to zero (adding and subtracting two again). If the NMI happens after
> > >> > the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
> > >> > and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
> > >> > time to one (again adding and subtracting two).
> > >> >
> > >> > In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
> > >> > one means we need to atomically increment ->dynticks, and other values
> > >> > mean that we are partially or fully nested. Reasoning proceeds as for
> > >> > rcu_nmi_enter(), but in the opposite direction.
> > >> >
> > >> > Whew! That might even work.
> > >>
> > >> I think I like this, with the warnings above.
> > >
> > > OK with dropping the one that I called out as redundant?
> >
> > Sure, but see about.
> >
> > >
> > >> > But how about taking a different approach. Assuming that there can
> > >> > never be more than (say) 14 nesting NMI-like things, use the lower
> > >> > four bits of ->dynticks to represent the NMI nesting and the upper
> > >> > 28 bits as the counter. This of course requires modifying lots of
> > >> > places in RCU that check the counter, but it is probably time to
> > >> > abstract the check anyway.
> > >> >
> > >> > This would allow my earlier attempted logic to work and (maybe) simplify
> > >> > the reasoning a bit (and yes, the "magic" constants need macros):
> > >> >
> > >> > void rcu_nmi_enter(void)
> > >> > {
> > >> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > >> > int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> > >> > int incby = 0x01;
> > >> >
> > >> > WARN_ON_ONCE(nexting == 0xf);
> > >> > if (nesting == 0) {
> > >> > if (atomic_read(&rdtp->dynticks) & 0x10)
> > >> > return;
> > >> > incby = 0x11;
> > >> > }
> > >> > smp_mb__before_atomic();
> > >> > atomic_add(&rdtp->dynticks, incby);
> > >> > smp_mb__after_atomic();
> > >> > WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > >> > }
> > >> >
> > >> > void rcu_nmi_exit(void)
> > >> > {
> > >> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > >> > int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> > >> > int incby = 0x0f;
> > >> >
> > >> > if (nesting == 0)
> > >> > return;
> > >> > if (nesting > 1)
> > >> > incby = -1;
> > >> > smp_mb__before_atomic();
> > >> > atomic_add(&rdtp->dynticks, incby);
> > >> > smp_mb__after_atomic();
> > >> > WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> > >> > }
> > >> >
> > >> > Over to you! ;-)
> > >>
> > >> This latter one is all you :)
> > >
> > > Well, let's see how I feel about it after trying a Promela model of
> > > the first code sequence. ;-)
> >
> > :)
> >
> > Does Promela understand the differences between this type of
> > reentrancy and real threading?
>
> Not as far as I know. But it can be tricked into making this distinction.
> One thread just has the Promela code as is, and the other thread has
> the same Promela code entirely contained in an atomic block. This means
> that the entire second thread must executed at one point in the first
> thread, just like an NMI would.

And the following Promela model claims that your approach works.
Should I trust it? ;-)

Thanx, Paul

------------------------------------------------------------------------

/*
* Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
* that allows nesting.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, you can access it online at
* http://www.gnu.org/licenses/gpl-2.0.html.
*
* Copyright IBM Corporation, 2014
*
* Author: Paul E. McKenney <[email protected]>
*/

byte dynticks_nesting;
byte dynticks_nmi_nesting;
byte dynticks;
byte busy;

/*
* Promela verision of rcu_nmi_enter().
*/
inline rcu_nmi_enter()
{
assert(dynticks_nmi_nesting >= 0);
if
:: (dynticks & 1) == 0 ->
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 1);
dynticks_nmi_nesting = dynticks_nmi_nesting + 1;
assert(dynticks_nmi_nesting >= 1);
:: else ->
dynticks_nmi_nesting = dynticks_nmi_nesting + 2;
fi;
}

/*
* Promela verision of rcu_nmi_exit().
*/
inline rcu_nmi_exit()
{
assert(dynticks_nmi_nesting > 0);
assert((dynticks & 1) != 0);
if
:: dynticks_nmi_nesting != 1 ->
dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
:: else ->
dynticks_nmi_nesting = 0;
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 0);
fi;
}

/*
* Base-level NMI runs non-atomically. Crudely emulates process-level
* dynticks-idle entry/exit.
*/
proctype base_NMI()
{
do
:: if
:: 1 -> atomic {
dynticks = dynticks + 1;
}
busy = 0;
:: 1 -> skip;
fi;
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();
if
:: busy -> skip;
:: !busy ->
atomic {
dynticks = dynticks + 1;
}
busy = 1;
fi;
od;
}

/*
* Nested NMI runs atomically to emulate interrupting base_level().
*/
proctype nested_NMI()
{
do
:: atomic {
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();
}
od;
}

init {
dynticks_nesting = 0;
dynticks_nmi_nesting = 0;
dynticks = 0;
busy = 0;
run base_NMI();
run nested_NMI();
}

2014-11-24 22:36:42

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
<[email protected]> wrote:
> On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
>> On Mon, Nov 24, 2014 at 01:02:51PM -0800, Andy Lutomirski wrote:
>> > On Mon, Nov 24, 2014 at 12:54 PM, Paul E. McKenney
>> > <[email protected]> wrote:
>> > > On Mon, Nov 24, 2014 at 12:22:13PM -0800, Andy Lutomirski wrote:
>> > >> On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
>> > >> <[email protected]> wrote:
>> > >> > On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
>> > >> >> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
>> > >> >> <[email protected]> wrote:
>> > >> >> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
>> > >> >> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
>> > >> >> >> <[email protected]> wrote:
>> > >>
>> > >> > Returning state sounds like a bad idea, if we can reasonably avoid it.
>> > >>
>> > >> I agree, except that we already do it for exception_enter(), etc. But
>> > >> yes, changing fewer things is nice.
>> > >>
>> > >> >
>> > >> > And I think I finally see what you are pointing out about my code: If
>> > >> > another NMI comes in between the time I increment ->dynticks_nmi_nesting
>> > >> > and the time I atomically increment ->dynticks, the nested NMI handler
>> > >> > will incorrectly believe that RCU is already paying attention to this CPU.
>> > >> > Which would indeed not be at all good, so good catch!!!
>> > >> >
>> > >> >> Otherwise, I think that there may need to be enough state somewhere so
>> > >> >> that the outermost nested rcu_nmi_enter knows whether to increment
>> > >> >> dynticks. For example, dynticks_nmi_nesting could store the nesting
>> > >> >> count * 2 - (1 if the outermost nested user needs to increment
>> > >> >> dynticks). Something like:
>> > >> >>
>> > >> >> void rcu_nmi_enter(void)
>> > >> >> {
>> > >> >> /* Be very careful -- this function may be called reentrently on the
>> > >> >> same CPU. */
>> > >> >> atomically: increment dynticks if it's even.
>> > >> >>
>> > >> >> /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
>> > >> >> * the state. */
>> > >> >>
>> > >> >> local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
>> > >> >>
>> > >> >> WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
>> > >> >> }
>> > >> >>
>> > >> >> void rcu_nmi_exit(void)
>> > >> >> {
>> > >> >> WARN_ON(!(dynticks & 1));
>> > >> >> locally atomically: dynticks_nmi_nesting -= 2, unless
>> > >> >> dynticks_nmi_nesting == 1, in which case set it to zero
>> > >> >>
>> > >> >> if (dynticks_nmi_nesting was 1)
>> > >> >> atomic_inc(&dynticks);
>> > >> >> }
>> > >> >>
>> > >> >> The invariant here is that, for a single unnested enter/exit, if
>> > >> >> dynticks_nmi_nesting != 0, then dynticks is odd. As a result, an
>> > >> >> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
>> > >> >> != 0 *or* dynticks is odd will have no net effect, so the invariant,
>> > >> >> in fact, holds for all invocations, nested or otherwise.
>> > >> >>
>> > >> >> At least one of those conditions is true at all times during the
>> > >> >> execution of outermost pair, starting with the first atomic operation
>> > >> >> and ending with the final atomic_inc. So they nest properly no matter
>> > >> >> what else happens (unless, of course, someone else pokes dynticks in
>> > >> >> the middle).
>> > >> >>
>> > >> >> Thoughts?
>> > >> >
>> > >> > Let's see... The evenness of ->dynticks should be preserved by nested NMI
>> > >> > handlers, so the check and increment need not be atomic. We don't have
>> > >> > any way (other than atomic operations) to do local atomic modifications
>> > >> > on all architectures, because we cannot mask NMIs. (Yes, it can work
>> > >> > on x86, but this is common code that needs to work everywhere.) On the
>> > >> > other hand, presumably NMIs are rare, so atomic modification of the NMI
>> > >> > nesting counter should be OK, at least if it proves absolutely necessary.
>> > >> > And I am thinking that a mechanical proof will be needed here. :-/
>> > >> >
>> > >> > But first, let me try generating the code and informally evaluating it:
>> > >> >
>> > >> > 1 struct rcu_dynticks {
>> > >> > 2 long long dynticks_nesting;
>> > >> > 3 int dynticks_nmi_nesting;
>> > >> > 4 atomic_t dynticks;
>> > >> > 5 };
>> > >> > 6
>> > >> > 7 void rcu_nmi_enter(void)
>> > >> > 8 {
>> > >> > 9 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > >> > 10 int incby = 2;
>> > >> > 11
>> > >> > 12 if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
>> > >> > 13 smp_mb__before_atomic();
>> > >> > 14 atomic_inc(&rdtp->dynticks);
>> > >> > 15 smp_mb__after_atomic();
>> > >> > 16 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>> > >> > 17 incby = 1;
>> > >>
>> > >> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?
>> > >
>> > > That would make sense.
>> > >
>> > >> > 18 }
>> > >> > 19 rdtp->dynticks_nmi_nesting += incby;
>> > >>
>> > >> Oh, I see why you don't need local_add -- it's because an nmi in the
>> > >> middle of this increment won't have any effect on the interrupted
>> > >> code, so even a software RMW will be okay.
>> > >
>> > > Yep! ;-)
>> > >
>> > >> > 20 barrier();
>> > >> > 21 }
>> > >> > 22
>> > >> > 23 void rcu_nmi_exit(void)
>> > >> > 24 {
>> > >> > 25 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > >> > 26
>> > >> > 27 WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
>> > >> > 28 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>> > >> > 29 if (rdtp->dynticks_nmi_nesting != 1) {
>> > >>
>> > >> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?
>> > >
>> > > This is already implied by the WARN_ON_ONCE() on line 27 and the check
>> > > on line 29.
>> >
>> > I was worried about negative numbers. Maybe change line 27 to
>> > WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0), then? (Or is it
>> > unsigned? If so, let's make to signed to catch this type of error.)
>>
>> Good point, they are signed, so your WARN_ON_ONCE() would work.
>>
>> > >> > 30 rdtp->dynticks_nmi_nesting -= 2;
>> > >> > 31 return;
>> > >> > 32 }
>> > >> > 33 rdtp->dynticks_nmi_nesting = 0;
>> > >> > 34 smp_mb__before_atomic();
>> > >>
>> > >> This implies barrier(), right?
>> > >
>> > > Yep!
>> > >
>> > >> > 35 atomic_inc(&rdtp->dynticks);
>> > >> > 36 smp_mb__after_atomic();
>> > >> > 37 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
>> > >> > 38 }
>> > >> >
>> > >> > Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
>> > >> > assumes that we don't need to increment ->dynticks.
>> > >> >
>> > >> > Line 12 checks to see if ->dynticks is even. Note that this check is
>> > >> > stable: If there are nested NMIs, they will increment ->dynticks twice
>> > >> > or not at all, and either way preserves the evenness (to be proven, of
>> > >> > course, but that is the plan). If ->dynticks is even, lines 13-15
>> > >> > atomically increment it, line 16 complains if still even, and line 17
>> > >> > says we will increment ->dynticks_nmi_nesting by only 1.
>> > >> >
>> > >> > Either way, line 19 increments ->dynticks_nmi_nesting as needed and
>> > >> > line 20 keeps the compiler from getting too cute.
>> > >> >
>> > >> > For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
>> > >> > structure. Lines 27 and 28 complain bitterly if invariants are violated.
>> > >> > If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
>> > >> > then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
>> > >> >
>> > >> > Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
>> > >> > atomically increment ->dynticks with full ordering, and line 37
>> > >> > complains bitterly if ->dynticks is not even.
>> > >> >
>> > >> > So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
>> > >> > nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
>> > >> > not nested, which is the correct thing for them to think in that case.
>> > >> > They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
>> > >> > to zero (adding and then subtracting 1). If the NMI happens after the
>> > >> > atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
>> > >> > will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
>> > >> > to zero (adding and subtracting two again). If the NMI happens after
>> > >> > the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
>> > >> > and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
>> > >> > time to one (again adding and subtracting two).
>> > >> >
>> > >> > In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
>> > >> > one means we need to atomically increment ->dynticks, and other values
>> > >> > mean that we are partially or fully nested. Reasoning proceeds as for
>> > >> > rcu_nmi_enter(), but in the opposite direction.
>> > >> >
>> > >> > Whew! That might even work.
>> > >>
>> > >> I think I like this, with the warnings above.
>> > >
>> > > OK with dropping the one that I called out as redundant?
>> >
>> > Sure, but see about.
>> >
>> > >
>> > >> > But how about taking a different approach. Assuming that there can
>> > >> > never be more than (say) 14 nesting NMI-like things, use the lower
>> > >> > four bits of ->dynticks to represent the NMI nesting and the upper
>> > >> > 28 bits as the counter. This of course requires modifying lots of
>> > >> > places in RCU that check the counter, but it is probably time to
>> > >> > abstract the check anyway.
>> > >> >
>> > >> > This would allow my earlier attempted logic to work and (maybe) simplify
>> > >> > the reasoning a bit (and yes, the "magic" constants need macros):
>> > >> >
>> > >> > void rcu_nmi_enter(void)
>> > >> > {
>> > >> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > >> > int nesting = atomic_read(&rdtp->dynticks) & 0xf;
>> > >> > int incby = 0x01;
>> > >> >
>> > >> > WARN_ON_ONCE(nexting == 0xf);
>> > >> > if (nesting == 0) {
>> > >> > if (atomic_read(&rdtp->dynticks) & 0x10)
>> > >> > return;
>> > >> > incby = 0x11;
>> > >> > }
>> > >> > smp_mb__before_atomic();
>> > >> > atomic_add(&rdtp->dynticks, incby);
>> > >> > smp_mb__after_atomic();
>> > >> > WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>> > >> > }
>> > >> >
>> > >> > void rcu_nmi_exit(void)
>> > >> > {
>> > >> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > >> > int nesting = atomic_read(&rdtp->dynticks) & 0xf;
>> > >> > int incby = 0x0f;
>> > >> >
>> > >> > if (nesting == 0)
>> > >> > return;
>> > >> > if (nesting > 1)
>> > >> > incby = -1;
>> > >> > smp_mb__before_atomic();
>> > >> > atomic_add(&rdtp->dynticks, incby);
>> > >> > smp_mb__after_atomic();
>> > >> > WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
>> > >> > }
>> > >> >
>> > >> > Over to you! ;-)
>> > >>
>> > >> This latter one is all you :)
>> > >
>> > > Well, let's see how I feel about it after trying a Promela model of
>> > > the first code sequence. ;-)
>> >
>> > :)
>> >
>> > Does Promela understand the differences between this type of
>> > reentrancy and real threading?
>>
>> Not as far as I know. But it can be tricked into making this distinction.
>> One thread just has the Promela code as is, and the other thread has
>> the same Promela code entirely contained in an atomic block. This means
>> that the entire second thread must executed at one point in the first
>> thread, just like an NMI would.
>
> And the following Promela model claims that your approach works.
> Should I trust it? ;-)
>

I think so.

Want to write a patch? If so, whose tree should it go in? I can add
it to my IST series, but that seems a bit odd.

--Andy

> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> /*
> * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
> * that allows nesting.
> *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> * (at your option) any later version.
> *
> * This program is distributed in the hope that it will be useful,
> * but WITHOUT ANY WARRANTY; without even the implied warranty of
> * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> * GNU General Public License for more details.
> *
> * You should have received a copy of the GNU General Public License
> * along with this program; if not, you can access it online at
> * http://www.gnu.org/licenses/gpl-2.0.html.
> *
> * Copyright IBM Corporation, 2014
> *
> * Author: Paul E. McKenney <[email protected]>
> */
>
> byte dynticks_nesting;
> byte dynticks_nmi_nesting;
> byte dynticks;
> byte busy;
>
> /*
> * Promela verision of rcu_nmi_enter().
> */
> inline rcu_nmi_enter()
> {
> assert(dynticks_nmi_nesting >= 0);
> if
> :: (dynticks & 1) == 0 ->
> atomic {
> dynticks = dynticks + 1;
> }
> assert((dynticks & 1) == 1);
> dynticks_nmi_nesting = dynticks_nmi_nesting + 1;
> assert(dynticks_nmi_nesting >= 1);
> :: else ->
> dynticks_nmi_nesting = dynticks_nmi_nesting + 2;
> fi;
> }
>
> /*
> * Promela verision of rcu_nmi_exit().
> */
> inline rcu_nmi_exit()
> {
> assert(dynticks_nmi_nesting > 0);
> assert((dynticks & 1) != 0);
> if
> :: dynticks_nmi_nesting != 1 ->
> dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
> :: else ->
> dynticks_nmi_nesting = 0;
> atomic {
> dynticks = dynticks + 1;
> }
> assert((dynticks & 1) == 0);
> fi;
> }
>
> /*
> * Base-level NMI runs non-atomically. Crudely emulates process-level
> * dynticks-idle entry/exit.
> */
> proctype base_NMI()
> {
> do
> :: if
> :: 1 -> atomic {
> dynticks = dynticks + 1;
> }
> busy = 0;
> :: 1 -> skip;
> fi;
> rcu_nmi_enter();
> assert((dynticks & 1) == 1);
> rcu_nmi_exit();
> if
> :: busy -> skip;
> :: !busy ->
> atomic {
> dynticks = dynticks + 1;
> }
> busy = 1;
> fi;
> od;
> }
>
> /*
> * Nested NMI runs atomically to emulate interrupting base_level().
> */
> proctype nested_NMI()
> {
> do
> :: atomic {
> rcu_nmi_enter();
> assert((dynticks & 1) == 1);
> rcu_nmi_exit();
> }
> od;
> }
>
> init {
> dynticks_nesting = 0;
> dynticks_nmi_nesting = 0;
> dynticks = 0;
> busy = 0;
> run base_NMI();
> run nested_NMI();
> }
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-24 22:58:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Mon, Nov 24, 2014 at 02:36:18PM -0800, Andy Lutomirski wrote:
> On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:

[ . . . ]

> > And the following Promela model claims that your approach works.
> > Should I trust it? ;-)
> >
>
> I think so.
>
> Want to write a patch? If so, whose tree should it go in? I can add
> it to my IST series, but that seems a bit odd.

Working on it. ;-)

Thanx, Paul

> --Andy
>
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > /*
> > * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
> > * that allows nesting.
> > *
> > * This program is free software; you can redistribute it and/or modify
> > * it under the terms of the GNU General Public License as published by
> > * the Free Software Foundation; either version 2 of the License, or
> > * (at your option) any later version.
> > *
> > * This program is distributed in the hope that it will be useful,
> > * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > * GNU General Public License for more details.
> > *
> > * You should have received a copy of the GNU General Public License
> > * along with this program; if not, you can access it online at
> > * http://www.gnu.org/licenses/gpl-2.0.html.
> > *
> > * Copyright IBM Corporation, 2014
> > *
> > * Author: Paul E. McKenney <[email protected]>
> > */
> >
> > byte dynticks_nesting;
> > byte dynticks_nmi_nesting;
> > byte dynticks;
> > byte busy;
> >
> > /*
> > * Promela verision of rcu_nmi_enter().
> > */
> > inline rcu_nmi_enter()
> > {
> > assert(dynticks_nmi_nesting >= 0);
> > if
> > :: (dynticks & 1) == 0 ->
> > atomic {
> > dynticks = dynticks + 1;
> > }
> > assert((dynticks & 1) == 1);
> > dynticks_nmi_nesting = dynticks_nmi_nesting + 1;
> > assert(dynticks_nmi_nesting >= 1);
> > :: else ->
> > dynticks_nmi_nesting = dynticks_nmi_nesting + 2;
> > fi;
> > }
> >
> > /*
> > * Promela verision of rcu_nmi_exit().
> > */
> > inline rcu_nmi_exit()
> > {
> > assert(dynticks_nmi_nesting > 0);
> > assert((dynticks & 1) != 0);
> > if
> > :: dynticks_nmi_nesting != 1 ->
> > dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
> > :: else ->
> > dynticks_nmi_nesting = 0;
> > atomic {
> > dynticks = dynticks + 1;
> > }
> > assert((dynticks & 1) == 0);
> > fi;
> > }
> >
> > /*
> > * Base-level NMI runs non-atomically. Crudely emulates process-level
> > * dynticks-idle entry/exit.
> > */
> > proctype base_NMI()
> > {
> > do
> > :: if
> > :: 1 -> atomic {
> > dynticks = dynticks + 1;
> > }
> > busy = 0;
> > :: 1 -> skip;
> > fi;
> > rcu_nmi_enter();
> > assert((dynticks & 1) == 1);
> > rcu_nmi_exit();
> > if
> > :: busy -> skip;
> > :: !busy ->
> > atomic {
> > dynticks = dynticks + 1;
> > }
> > busy = 1;
> > fi;
> > od;
> > }
> >
> > /*
> > * Nested NMI runs atomically to emulate interrupting base_level().
> > */
> > proctype nested_NMI()
> > {
> > do
> > :: atomic {
> > rcu_nmi_enter();
> > assert((dynticks & 1) == 1);
> > rcu_nmi_exit();
> > }
> > od;
> > }
> >
> > init {
> > dynticks_nesting = 0;
> > dynticks_nmi_nesting = 0;
> > dynticks = 0;
> > busy = 0;
> > run base_NMI();
> > run nested_NMI();
> > }
> >
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
>

2014-11-24 23:31:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Mon, Nov 24, 2014 at 02:57:54PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 24, 2014 at 02:36:18PM -0800, Andy Lutomirski wrote:
> > On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
> > <[email protected]> wrote:
> > > On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
>
> [ . . . ]
>
> > > And the following Promela model claims that your approach works.
> > > Should I trust it? ;-)
> > >
> >
> > I think so.
> >
> > Want to write a patch? If so, whose tree should it go in? I can add
> > it to my IST series, but that seems a bit odd.
>
> Working on it. ;-)

And here is a sneak preview of the patch. Thoughts?

Thanx, Paul

------------------------------------------------------------------------

rcu: Make rcu_nmi_enter() handle nesting

Andy Lutomirski is introducing ISTs into x86, which from RCU's
viewpoint are NMIs. Because ISTs and NMIs can nest, rcu_nmi_enter() and
rcu_nmi_exit() must now correctly handle nesting. This commit therefore
introduces nesting, using a clever NMI-coordination algorithm suggested
by Andy. The trick is to atomically increment ->dynticks (if needed)
before manipulating ->dynticks_nmi_nesting on entry (and, accordingly,
after on exit). In addition, ->dynticks_nmi_nesting is incremented by
one if ->dynticks was incremented and by two otherwise. This means that
when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal to one, it knows
that ->dynticks must be atomically incremented.

This NMI-coordination algorithms has been validated by the following
Promela model, for whatever that might be worth:

/*
* Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
* that allows nesting.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, you can access it online at
* http://www.gnu.org/licenses/gpl-2.0.html.
*
* Copyright IBM Corporation, 2014
*
* Author: Paul E. McKenney <[email protected]>
*/

byte dynticks_nesting = 0;
byte dynticks_nmi_nesting = 0;
byte dynticks = 0;

/*
* Promela verision of rcu_nmi_enter().
*/
inline rcu_nmi_enter()
{
byte incby;

incby = 2;
assert(dynticks_nmi_nesting >= 0);
if
:: (dynticks & 1) == 0 ->
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 1);
incby = 1;
:: else ->
skip;
fi;
dynticks_nmi_nesting = dynticks_nmi_nesting + incby;
assert(dynticks_nmi_nesting >= 1);
}

/*
* Promela verision of rcu_nmi_exit().
*/
inline rcu_nmi_exit()
{
assert(dynticks_nmi_nesting > 0);
assert((dynticks & 1) != 0);
if
:: dynticks_nmi_nesting != 1 ->
dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
:: else ->
dynticks_nmi_nesting = 0;
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 0);
fi;
}

/*
* Base-level NMI runs non-atomically. Crudely emulates process-level
* dynticks-idle entry/exit.
*/
proctype base_NMI()
{
byte busy;

busy = 0;
do
:: if
:: 1 -> atomic {
dynticks = dynticks + 1;
}
busy = 0;
:: 1 -> skip;
fi;
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();
if
:: busy -> skip;
:: !busy ->
atomic {
dynticks = dynticks + 1;
}
busy = 1;
fi;
od;
}

/*
* Nested NMI runs atomically to emulate interrupting base_level().
*/
proctype nested_NMI()
{
do
:: atomic {
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();
}
od;
}

init {
run base_NMI();
run nested_NMI();
}

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8749f43f3f05..fc0236992655 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -759,39 +759,71 @@ void rcu_irq_enter(void)
/**
* rcu_nmi_enter - inform RCU of entry to NMI context
*
- * If the CPU was idle with dynamic ticks active, and there is no
- * irq handler running, this updates rdtp->dynticks_nmi to let the
- * RCU grace-period handling know that the CPU is active.
+ * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks and
+ * rdtp->dynticks_nmi_nesting to let the RCU grace-period handling know
+ * that the CPU is active. This implementation permits nested NMIs, as
+ * long as the nesting level does not overflow an int. (You will probably
+ * run out of stack space first.)
*/
void rcu_nmi_enter(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ int incby = 2;

- if (rdtp->dynticks_nmi_nesting == 0 &&
- (atomic_read(&rdtp->dynticks) & 0x1))
- return;
- rdtp->dynticks_nmi_nesting++;
- smp_mb__before_atomic(); /* Force delay from prior write. */
- atomic_inc(&rdtp->dynticks);
- /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
- smp_mb__after_atomic(); /* See above. */
- WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+ /* Complain about underflow. */
+ WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 0);
+
+ /*
+ * If idle from RCU viewpoint, atomically increment ->dynticks
+ * to mark non-idle and increment ->dynticks_nmi_nesting by one.
+ * Otherwise, increment ->dynticks_nmi_nesting by two. This means
+ * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
+ * to be in the outermost NMI handler that interrupted an RCU-idle
+ * period (observation due to Andy Lutomirski).
+ */
+ if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
+ smp_mb__before_atomic(); /* Force delay from prior write. */
+ atomic_inc(&rdtp->dynticks);
+ /* atomic_inc() before later RCU read-side crit sects */
+ smp_mb__after_atomic(); /* See above. */
+ WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+ incby = 1;
+ }
+ rdtp->dynticks_nmi_nesting += incby;
+ barrier();
}

/**
* rcu_nmi_exit - inform RCU of exit from NMI context
*
- * If the CPU was idle with dynamic ticks active, and there is no
- * irq handler running, this updates rdtp->dynticks_nmi to let the
- * RCU grace-period handling know that the CPU is no longer active.
+ * If we are returning from the outermost NMI handler that interrupted an
+ * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
+ * to let the RCU grace-period handling know that the CPU is back to
+ * being RCU-idle.
*/
void rcu_nmi_exit(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);

- if (rdtp->dynticks_nmi_nesting == 0 ||
- --rdtp->dynticks_nmi_nesting != 0)
+ /*
+ * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
+ * (We are exiting an NMI handler, so RCU better be paying attention
+ * to us!)
+ */
+ WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0);
+ WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+
+ /*
+ * If the nesting level is not 1, the CPU wasn't RCU-idle, so
+ * leave it in non-RCU-idle state.
+ */
+ if (rdtp->dynticks_nmi_nesting != 1) {
+ rdtp->dynticks_nmi_nesting -= 2;
return;
+ }
+
+ /* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
+ rdtp->dynticks_nmi_nesting = 0;
/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
smp_mb__before_atomic(); /* See above. */
atomic_inc(&rdtp->dynticks);

2014-11-24 23:36:19

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Mon, Nov 24, 2014 at 3:31 PM, Paul E. McKenney
<[email protected]> wrote:
> On Mon, Nov 24, 2014 at 02:57:54PM -0800, Paul E. McKenney wrote:
>> On Mon, Nov 24, 2014 at 02:36:18PM -0800, Andy Lutomirski wrote:
>> > On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
>> > <[email protected]> wrote:
>> > > On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
>>
>> [ . . . ]
>>
>> > > And the following Promela model claims that your approach works.
>> > > Should I trust it? ;-)
>> > >
>> >
>> > I think so.
>> >
>> > Want to write a patch? If so, whose tree should it go in? I can add
>> > it to my IST series, but that seems a bit odd.
>>
>> Working on it. ;-)
>
> And here is a sneak preview of the patch. Thoughts?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> rcu: Make rcu_nmi_enter() handle nesting
>
> Andy Lutomirski is introducing ISTs into x86, which from RCU's
> viewpoint are NMIs. Because ISTs and NMIs can nest, rcu_nmi_enter() and
> rcu_nmi_exit() must now correctly handle nesting.

You must not be a frequent reader of entry_64.S and the Intel SDM :)
IOW, IST is just a stack switching mechanism, and these interrupts
have been around forever -- they're just buggy right now.

How about:

x86 has multiple types of NMI-like interrupts: real NMIs, machine
checks, and, for some values of NMI-like, debugging and breakpoint
interrupts. These interrupts can nest inside each other. Andy
Lutomirski is adding RCU support to these interrupts, so
rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.

Other than that, I like it.

> This commit therefore
> introduces nesting, using a clever NMI-coordination algorithm suggested
> by Andy. The trick is to atomically increment ->dynticks (if needed)
> before manipulating ->dynticks_nmi_nesting on entry (and, accordingly,
> after on exit). In addition, ->dynticks_nmi_nesting is incremented by
> one if ->dynticks was incremented and by two otherwise. This means that
> when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal to one, it knows
> that ->dynticks must be atomically incremented.
>
> This NMI-coordination algorithms has been validated by the following
> Promela model, for whatever that might be worth:
>
> /*
> * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
> * that allows nesting.
> *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> * (at your option) any later version.
> *
> * This program is distributed in the hope that it will be useful,
> * but WITHOUT ANY WARRANTY; without even the implied warranty of
> * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> * GNU General Public License for more details.
> *
> * You should have received a copy of the GNU General Public License
> * along with this program; if not, you can access it online at
> * http://www.gnu.org/licenses/gpl-2.0.html.
> *
> * Copyright IBM Corporation, 2014
> *
> * Author: Paul E. McKenney <[email protected]>
> */
>
> byte dynticks_nesting = 0;
> byte dynticks_nmi_nesting = 0;
> byte dynticks = 0;
>
> /*
> * Promela verision of rcu_nmi_enter().
> */
> inline rcu_nmi_enter()
> {
> byte incby;
>
> incby = 2;
> assert(dynticks_nmi_nesting >= 0);
> if
> :: (dynticks & 1) == 0 ->
> atomic {
> dynticks = dynticks + 1;
> }
> assert((dynticks & 1) == 1);
> incby = 1;
> :: else ->
> skip;
> fi;
> dynticks_nmi_nesting = dynticks_nmi_nesting + incby;
> assert(dynticks_nmi_nesting >= 1);
> }
>
> /*
> * Promela verision of rcu_nmi_exit().
> */
> inline rcu_nmi_exit()
> {
> assert(dynticks_nmi_nesting > 0);
> assert((dynticks & 1) != 0);
> if
> :: dynticks_nmi_nesting != 1 ->
> dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
> :: else ->
> dynticks_nmi_nesting = 0;
> atomic {
> dynticks = dynticks + 1;
> }
> assert((dynticks & 1) == 0);
> fi;
> }
>
> /*
> * Base-level NMI runs non-atomically. Crudely emulates process-level
> * dynticks-idle entry/exit.
> */
> proctype base_NMI()
> {
> byte busy;
>
> busy = 0;
> do
> :: if
> :: 1 -> atomic {
> dynticks = dynticks + 1;
> }
> busy = 0;
> :: 1 -> skip;
> fi;
> rcu_nmi_enter();
> assert((dynticks & 1) == 1);
> rcu_nmi_exit();
> if
> :: busy -> skip;
> :: !busy ->
> atomic {
> dynticks = dynticks + 1;
> }
> busy = 1;
> fi;
> od;
> }
>
> /*
> * Nested NMI runs atomically to emulate interrupting base_level().
> */
> proctype nested_NMI()
> {
> do
> :: atomic {
> rcu_nmi_enter();
> assert((dynticks & 1) == 1);
> rcu_nmi_exit();
> }
> od;
> }
>
> init {
> run base_NMI();
> run nested_NMI();
> }
>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8749f43f3f05..fc0236992655 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -759,39 +759,71 @@ void rcu_irq_enter(void)
> /**
> * rcu_nmi_enter - inform RCU of entry to NMI context
> *
> - * If the CPU was idle with dynamic ticks active, and there is no
> - * irq handler running, this updates rdtp->dynticks_nmi to let the
> - * RCU grace-period handling know that the CPU is active.
> + * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks and
> + * rdtp->dynticks_nmi_nesting to let the RCU grace-period handling know
> + * that the CPU is active. This implementation permits nested NMIs, as
> + * long as the nesting level does not overflow an int. (You will probably
> + * run out of stack space first.)
> */
> void rcu_nmi_enter(void)
> {
> struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> + int incby = 2;
>
> - if (rdtp->dynticks_nmi_nesting == 0 &&
> - (atomic_read(&rdtp->dynticks) & 0x1))
> - return;
> - rdtp->dynticks_nmi_nesting++;
> - smp_mb__before_atomic(); /* Force delay from prior write. */
> - atomic_inc(&rdtp->dynticks);
> - /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> - smp_mb__after_atomic(); /* See above. */
> - WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> + /* Complain about underflow. */
> + WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 0);
> +
> + /*
> + * If idle from RCU viewpoint, atomically increment ->dynticks
> + * to mark non-idle and increment ->dynticks_nmi_nesting by one.
> + * Otherwise, increment ->dynticks_nmi_nesting by two. This means
> + * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
> + * to be in the outermost NMI handler that interrupted an RCU-idle
> + * period (observation due to Andy Lutomirski).
> + */
> + if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> + smp_mb__before_atomic(); /* Force delay from prior write. */
> + atomic_inc(&rdtp->dynticks);
> + /* atomic_inc() before later RCU read-side crit sects */
> + smp_mb__after_atomic(); /* See above. */
> + WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> + incby = 1;
> + }
> + rdtp->dynticks_nmi_nesting += incby;
> + barrier();
> }
>
> /**
> * rcu_nmi_exit - inform RCU of exit from NMI context
> *
> - * If the CPU was idle with dynamic ticks active, and there is no
> - * irq handler running, this updates rdtp->dynticks_nmi to let the
> - * RCU grace-period handling know that the CPU is no longer active.
> + * If we are returning from the outermost NMI handler that interrupted an
> + * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
> + * to let the RCU grace-period handling know that the CPU is back to
> + * being RCU-idle.
> */
> void rcu_nmi_exit(void)
> {
> struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>
> - if (rdtp->dynticks_nmi_nesting == 0 ||
> - --rdtp->dynticks_nmi_nesting != 0)
> + /*
> + * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
> + * (We are exiting an NMI handler, so RCU better be paying attention
> + * to us!)
> + */
> + WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0);
> + WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> +
> + /*
> + * If the nesting level is not 1, the CPU wasn't RCU-idle, so
> + * leave it in non-RCU-idle state.
> + */
> + if (rdtp->dynticks_nmi_nesting != 1) {
> + rdtp->dynticks_nmi_nesting -= 2;
> return;
> + }
> +
> + /* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
> + rdtp->dynticks_nmi_nesting = 0;
> /* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
> smp_mb__before_atomic(); /* See above. */
> atomic_inc(&rdtp->dynticks);
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-24 23:51:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Mon, Nov 24, 2014 at 03:35:55PM -0800, Andy Lutomirski wrote:
> On Mon, Nov 24, 2014 at 3:31 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Mon, Nov 24, 2014 at 02:57:54PM -0800, Paul E. McKenney wrote:
> >> On Mon, Nov 24, 2014 at 02:36:18PM -0800, Andy Lutomirski wrote:
> >> > On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
> >> > <[email protected]> wrote:
> >> > > On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
> >>
> >> [ . . . ]
> >>
> >> > > And the following Promela model claims that your approach works.
> >> > > Should I trust it? ;-)
> >> > >
> >> >
> >> > I think so.
> >> >
> >> > Want to write a patch? If so, whose tree should it go in? I can add
> >> > it to my IST series, but that seems a bit odd.
> >>
> >> Working on it. ;-)
> >
> > And here is a sneak preview of the patch. Thoughts?
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > rcu: Make rcu_nmi_enter() handle nesting
> >
> > Andy Lutomirski is introducing ISTs into x86, which from RCU's
> > viewpoint are NMIs. Because ISTs and NMIs can nest, rcu_nmi_enter() and
> > rcu_nmi_exit() must now correctly handle nesting.
>
> You must not be a frequent reader of entry_64.S and the Intel SDM :)
> IOW, IST is just a stack switching mechanism, and these interrupts
> have been around forever -- they're just buggy right now.
>
> How about:
>
> x86 has multiple types of NMI-like interrupts: real NMIs, machine
> checks, and, for some values of NMI-like, debugging and breakpoint
> interrupts. These interrupts can nest inside each other. Andy
> Lutomirski is adding RCU support to these interrupts, so
> rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.
>
> Other than that, I like it.

And here is the updated version. Left to my normal workflow, this would
go into 3.20. Please let me know if you need it earlier, but in that
case, I will need you to test it fairly soon. (I don't have any way
to test nested NMI-like things.)

Thanx, Paul

------------------------------------------------------------------------

rcu: Make rcu_nmi_enter() handle nesting

The x86 architecture has multiple types of NMI-like interrupts: real
NMIs, machine checks, and, for some values of NMI-like, debugging
and breakpoint interrupts. These interrupts can nest inside each
other. Andy Lutomirski is adding RCU support to these interrupts,
so rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.

This commit therefore introduces nesting, using a clever NMI-coordination
algorithm suggested by Andy. The trick is to atomically increment
->dynticks (if needed) before manipulating ->dynticks_nmi_nesting on entry
(and, accordingly, after on exit). In addition, ->dynticks_nmi_nesting
is incremented by one if ->dynticks was incremented and by two otherwise.
This means that when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal
to one, it knows that ->dynticks must be atomically incremented.

This NMI-coordination algorithms has been validated by the following
Promela model, for whatever that might be worth:

/*
* Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
* that allows nesting.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, you can access it online at
* http://www.gnu.org/licenses/gpl-2.0.html.
*
* Copyright IBM Corporation, 2014
*
* Author: Paul E. McKenney <[email protected]>
*/

byte dynticks_nesting = 0;
byte dynticks_nmi_nesting = 0;
byte dynticks = 0;

/*
* Promela verision of rcu_nmi_enter().
*/
inline rcu_nmi_enter()
{
byte incby;

incby = 2;
assert(dynticks_nmi_nesting >= 0);
if
:: (dynticks & 1) == 0 ->
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 1);
incby = 1;
:: else ->
skip;
fi;
dynticks_nmi_nesting = dynticks_nmi_nesting + incby;
assert(dynticks_nmi_nesting >= 1);
}

/*
* Promela verision of rcu_nmi_exit().
*/
inline rcu_nmi_exit()
{
assert(dynticks_nmi_nesting > 0);
assert((dynticks & 1) != 0);
if
:: dynticks_nmi_nesting != 1 ->
dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
:: else ->
dynticks_nmi_nesting = 0;
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 0);
fi;
}

/*
* Base-level NMI runs non-atomically. Crudely emulates process-level
* dynticks-idle entry/exit.
*/
proctype base_NMI()
{
byte busy;

busy = 0;
do
:: if
:: 1 -> atomic {
dynticks = dynticks + 1;
}
busy = 0;
:: 1 -> skip;
fi;
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();
if
:: busy -> skip;
:: !busy ->
atomic {
dynticks = dynticks + 1;
}
busy = 1;
fi;
od;
}

/*
* Nested NMI runs atomically to emulate interrupting base_level().
*/
proctype nested_NMI()
{
do
:: atomic {
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();
}
od;
}

init {
run base_NMI();
run nested_NMI();
}

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8749f43f3f05..fc0236992655 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -759,39 +759,71 @@ void rcu_irq_enter(void)
/**
* rcu_nmi_enter - inform RCU of entry to NMI context
*
- * If the CPU was idle with dynamic ticks active, and there is no
- * irq handler running, this updates rdtp->dynticks_nmi to let the
- * RCU grace-period handling know that the CPU is active.
+ * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks and
+ * rdtp->dynticks_nmi_nesting to let the RCU grace-period handling know
+ * that the CPU is active. This implementation permits nested NMIs, as
+ * long as the nesting level does not overflow an int. (You will probably
+ * run out of stack space first.)
*/
void rcu_nmi_enter(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ int incby = 2;

- if (rdtp->dynticks_nmi_nesting == 0 &&
- (atomic_read(&rdtp->dynticks) & 0x1))
- return;
- rdtp->dynticks_nmi_nesting++;
- smp_mb__before_atomic(); /* Force delay from prior write. */
- atomic_inc(&rdtp->dynticks);
- /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
- smp_mb__after_atomic(); /* See above. */
- WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+ /* Complain about underflow. */
+ WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 0);
+
+ /*
+ * If idle from RCU viewpoint, atomically increment ->dynticks
+ * to mark non-idle and increment ->dynticks_nmi_nesting by one.
+ * Otherwise, increment ->dynticks_nmi_nesting by two. This means
+ * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
+ * to be in the outermost NMI handler that interrupted an RCU-idle
+ * period (observation due to Andy Lutomirski).
+ */
+ if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
+ smp_mb__before_atomic(); /* Force delay from prior write. */
+ atomic_inc(&rdtp->dynticks);
+ /* atomic_inc() before later RCU read-side crit sects */
+ smp_mb__after_atomic(); /* See above. */
+ WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+ incby = 1;
+ }
+ rdtp->dynticks_nmi_nesting += incby;
+ barrier();
}

/**
* rcu_nmi_exit - inform RCU of exit from NMI context
*
- * If the CPU was idle with dynamic ticks active, and there is no
- * irq handler running, this updates rdtp->dynticks_nmi to let the
- * RCU grace-period handling know that the CPU is no longer active.
+ * If we are returning from the outermost NMI handler that interrupted an
+ * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
+ * to let the RCU grace-period handling know that the CPU is back to
+ * being RCU-idle.
*/
void rcu_nmi_exit(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);

- if (rdtp->dynticks_nmi_nesting == 0 ||
- --rdtp->dynticks_nmi_nesting != 0)
+ /*
+ * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
+ * (We are exiting an NMI handler, so RCU better be paying attention
+ * to us!)
+ */
+ WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0);
+ WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+
+ /*
+ * If the nesting level is not 1, the CPU wasn't RCU-idle, so
+ * leave it in non-RCU-idle state.
+ */
+ if (rdtp->dynticks_nmi_nesting != 1) {
+ rdtp->dynticks_nmi_nesting -= 2;
return;
+ }
+
+ /* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
+ rdtp->dynticks_nmi_nesting = 0;
/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
smp_mb__before_atomic(); /* See above. */
atomic_inc(&rdtp->dynticks);

2014-11-24 23:52:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Mon, Nov 24, 2014 at 3:50 PM, Paul E. McKenney
<[email protected]> wrote:
> On Mon, Nov 24, 2014 at 03:35:55PM -0800, Andy Lutomirski wrote:
>> On Mon, Nov 24, 2014 at 3:31 PM, Paul E. McKenney
>> <[email protected]> wrote:
>> > On Mon, Nov 24, 2014 at 02:57:54PM -0800, Paul E. McKenney wrote:
>> >> On Mon, Nov 24, 2014 at 02:36:18PM -0800, Andy Lutomirski wrote:
>> >> > On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
>> >> > <[email protected]> wrote:
>> >> > > On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
>> >>
>> >> [ . . . ]
>> >>
>> >> > > And the following Promela model claims that your approach works.
>> >> > > Should I trust it? ;-)
>> >> > >
>> >> >
>> >> > I think so.
>> >> >
>> >> > Want to write a patch? If so, whose tree should it go in? I can add
>> >> > it to my IST series, but that seems a bit odd.
>> >>
>> >> Working on it. ;-)
>> >
>> > And here is a sneak preview of the patch. Thoughts?
>> >
>> > Thanx, Paul
>> >
>> > ------------------------------------------------------------------------
>> >
>> > rcu: Make rcu_nmi_enter() handle nesting
>> >
>> > Andy Lutomirski is introducing ISTs into x86, which from RCU's
>> > viewpoint are NMIs. Because ISTs and NMIs can nest, rcu_nmi_enter() and
>> > rcu_nmi_exit() must now correctly handle nesting.
>>
>> You must not be a frequent reader of entry_64.S and the Intel SDM :)
>> IOW, IST is just a stack switching mechanism, and these interrupts
>> have been around forever -- they're just buggy right now.
>>
>> How about:
>>
>> x86 has multiple types of NMI-like interrupts: real NMIs, machine
>> checks, and, for some values of NMI-like, debugging and breakpoint
>> interrupts. These interrupts can nest inside each other. Andy
>> Lutomirski is adding RCU support to these interrupts, so
>> rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.
>>
>> Other than that, I like it.
>
> And here is the updated version. Left to my normal workflow, this would
> go into 3.20. Please let me know if you need it earlier, but in that
> case, I will need you to test it fairly soon. (I don't have any way
> to test nested NMI-like things.)

Dunno. Tony and Borislav -- when do you want the IST stack switching stuff?

--Andy

>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> rcu: Make rcu_nmi_enter() handle nesting
>
> The x86 architecture has multiple types of NMI-like interrupts: real
> NMIs, machine checks, and, for some values of NMI-like, debugging
> and breakpoint interrupts. These interrupts can nest inside each
> other. Andy Lutomirski is adding RCU support to these interrupts,
> so rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.
>
> This commit therefore introduces nesting, using a clever NMI-coordination
> algorithm suggested by Andy. The trick is to atomically increment
> ->dynticks (if needed) before manipulating ->dynticks_nmi_nesting on entry
> (and, accordingly, after on exit). In addition, ->dynticks_nmi_nesting
> is incremented by one if ->dynticks was incremented and by two otherwise.
> This means that when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal
> to one, it knows that ->dynticks must be atomically incremented.
>
> This NMI-coordination algorithms has been validated by the following
> Promela model, for whatever that might be worth:
>
> /*
> * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
> * that allows nesting.
> *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> * (at your option) any later version.
> *
> * This program is distributed in the hope that it will be useful,
> * but WITHOUT ANY WARRANTY; without even the implied warranty of
> * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> * GNU General Public License for more details.
> *
> * You should have received a copy of the GNU General Public License
> * along with this program; if not, you can access it online at
> * http://www.gnu.org/licenses/gpl-2.0.html.
> *
> * Copyright IBM Corporation, 2014
> *
> * Author: Paul E. McKenney <[email protected]>
> */
>
> byte dynticks_nesting = 0;
> byte dynticks_nmi_nesting = 0;
> byte dynticks = 0;
>
> /*
> * Promela verision of rcu_nmi_enter().
> */
> inline rcu_nmi_enter()
> {
> byte incby;
>
> incby = 2;
> assert(dynticks_nmi_nesting >= 0);
> if
> :: (dynticks & 1) == 0 ->
> atomic {
> dynticks = dynticks + 1;
> }
> assert((dynticks & 1) == 1);
> incby = 1;
> :: else ->
> skip;
> fi;
> dynticks_nmi_nesting = dynticks_nmi_nesting + incby;
> assert(dynticks_nmi_nesting >= 1);
> }
>
> /*
> * Promela verision of rcu_nmi_exit().
> */
> inline rcu_nmi_exit()
> {
> assert(dynticks_nmi_nesting > 0);
> assert((dynticks & 1) != 0);
> if
> :: dynticks_nmi_nesting != 1 ->
> dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
> :: else ->
> dynticks_nmi_nesting = 0;
> atomic {
> dynticks = dynticks + 1;
> }
> assert((dynticks & 1) == 0);
> fi;
> }
>
> /*
> * Base-level NMI runs non-atomically. Crudely emulates process-level
> * dynticks-idle entry/exit.
> */
> proctype base_NMI()
> {
> byte busy;
>
> busy = 0;
> do
> :: if
> :: 1 -> atomic {
> dynticks = dynticks + 1;
> }
> busy = 0;
> :: 1 -> skip;
> fi;
> rcu_nmi_enter();
> assert((dynticks & 1) == 1);
> rcu_nmi_exit();
> if
> :: busy -> skip;
> :: !busy ->
> atomic {
> dynticks = dynticks + 1;
> }
> busy = 1;
> fi;
> od;
> }
>
> /*
> * Nested NMI runs atomically to emulate interrupting base_level().
> */
> proctype nested_NMI()
> {
> do
> :: atomic {
> rcu_nmi_enter();
> assert((dynticks & 1) == 1);
> rcu_nmi_exit();
> }
> od;
> }
>
> init {
> run base_NMI();
> run nested_NMI();
> }
>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8749f43f3f05..fc0236992655 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -759,39 +759,71 @@ void rcu_irq_enter(void)
> /**
> * rcu_nmi_enter - inform RCU of entry to NMI context
> *
> - * If the CPU was idle with dynamic ticks active, and there is no
> - * irq handler running, this updates rdtp->dynticks_nmi to let the
> - * RCU grace-period handling know that the CPU is active.
> + * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks and
> + * rdtp->dynticks_nmi_nesting to let the RCU grace-period handling know
> + * that the CPU is active. This implementation permits nested NMIs, as
> + * long as the nesting level does not overflow an int. (You will probably
> + * run out of stack space first.)
> */
> void rcu_nmi_enter(void)
> {
> struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> + int incby = 2;
>
> - if (rdtp->dynticks_nmi_nesting == 0 &&
> - (atomic_read(&rdtp->dynticks) & 0x1))
> - return;
> - rdtp->dynticks_nmi_nesting++;
> - smp_mb__before_atomic(); /* Force delay from prior write. */
> - atomic_inc(&rdtp->dynticks);
> - /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> - smp_mb__after_atomic(); /* See above. */
> - WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> + /* Complain about underflow. */
> + WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 0);
> +
> + /*
> + * If idle from RCU viewpoint, atomically increment ->dynticks
> + * to mark non-idle and increment ->dynticks_nmi_nesting by one.
> + * Otherwise, increment ->dynticks_nmi_nesting by two. This means
> + * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
> + * to be in the outermost NMI handler that interrupted an RCU-idle
> + * period (observation due to Andy Lutomirski).
> + */
> + if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> + smp_mb__before_atomic(); /* Force delay from prior write. */
> + atomic_inc(&rdtp->dynticks);
> + /* atomic_inc() before later RCU read-side crit sects */
> + smp_mb__after_atomic(); /* See above. */
> + WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> + incby = 1;
> + }
> + rdtp->dynticks_nmi_nesting += incby;
> + barrier();
> }
>
> /**
> * rcu_nmi_exit - inform RCU of exit from NMI context
> *
> - * If the CPU was idle with dynamic ticks active, and there is no
> - * irq handler running, this updates rdtp->dynticks_nmi to let the
> - * RCU grace-period handling know that the CPU is no longer active.
> + * If we are returning from the outermost NMI handler that interrupted an
> + * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
> + * to let the RCU grace-period handling know that the CPU is back to
> + * being RCU-idle.
> */
> void rcu_nmi_exit(void)
> {
> struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>
> - if (rdtp->dynticks_nmi_nesting == 0 ||
> - --rdtp->dynticks_nmi_nesting != 0)
> + /*
> + * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
> + * (We are exiting an NMI handler, so RCU better be paying attention
> + * to us!)
> + */
> + WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0);
> + WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> +
> + /*
> + * If the nesting level is not 1, the CPU wasn't RCU-idle, so
> + * leave it in non-RCU-idle state.
> + */
> + if (rdtp->dynticks_nmi_nesting != 1) {
> + rdtp->dynticks_nmi_nesting -= 2;
> return;
> + }
> +
> + /* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
> + rdtp->dynticks_nmi_nesting = 0;
> /* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
> smp_mb__before_atomic(); /* See above. */
> atomic_inc(&rdtp->dynticks);
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-25 17:13:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Mon, Nov 24, 2014 at 03:50:58PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 24, 2014 at 03:35:55PM -0800, Andy Lutomirski wrote:
> > On Mon, Nov 24, 2014 at 3:31 PM, Paul E. McKenney
> > <[email protected]> wrote:
> > > On Mon, Nov 24, 2014 at 02:57:54PM -0800, Paul E. McKenney wrote:
> > >> On Mon, Nov 24, 2014 at 02:36:18PM -0800, Andy Lutomirski wrote:
> > >> > On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
> > >> > <[email protected]> wrote:
> > >> > > On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
> > >>
> > >> [ . . . ]
> > >>
> > >> > > And the following Promela model claims that your approach works.
> > >> > > Should I trust it? ;-)
> > >> > >
> > >> >
> > >> > I think so.

OK, I added some coverage checks and also injected bugs, all of which
it detected, so I am feeling at least a bit more confident in the model,
the updated version of which is included below, along with the script
that runs it.

Thanx, Paul

------------------------------------------------------------------------
rcu_nmi.spin
------------------------------------------------------------------------

/*
* Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
* that allows nesting.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, you can access it online at
* http://www.gnu.org/licenses/gpl-2.0.html.
*
* Copyright IBM Corporation, 2014
*
* Author: Paul E. McKenney <[email protected]>
*/

byte dynticks_nmi_nesting = 0;
byte dynticks = 0;
#define BUSY_INCBY 2 /* set to 1 to force failure. */

/*
* Promela verision of rcu_nmi_enter().
*/
inline rcu_nmi_enter()
{
byte incby;
byte tmp;

incby = BUSY_INCBY;
assert(dynticks_nmi_nesting >= 0);
if
:: (dynticks & 1) == 0 ->
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 1);
incby = 1;
:: else ->
skip;
fi;
tmp = dynticks_nmi_nesting;
tmp = tmp + incby;
dynticks_nmi_nesting = tmp;
assert(dynticks_nmi_nesting >= 1);
}

/*
* Promela verision of rcu_nmi_exit().
*/
inline rcu_nmi_exit()
{
byte tmp;

assert(dynticks_nmi_nesting > 0);
assert((dynticks & 1) != 0);
if
:: dynticks_nmi_nesting != 1 ->
tmp = dynticks_nmi_nesting;
tmp = tmp - BUSY_INCBY;
dynticks_nmi_nesting = tmp;
:: else ->
dynticks_nmi_nesting = 0;
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 0);
fi;
}

/*
* Base-level NMI runs non-atomically. Crudely emulates process-level
* dynticks-idle entry/exit.
*/
proctype base_NMI()
{
byte busy;

busy = 0;
do
:: /* Emulate base-level dynticks and not. */
if
:: 1 -> atomic {
dynticks = dynticks + 1;
}
busy = 1;
:: 1 -> skip;
fi;

/* Verify that we only sometimes have base-level dynticks. */
if
:: busy == 0 -> skip;
:: busy == 1 -> skip;
fi;

/* Model RCU's NMI entry and exit actions. */
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();

/* Emulated re-entering base-level dynticks and not. */
if
:: !busy -> skip;
:: busy ->
atomic {
dynticks = dynticks + 1;
}
busy = 0;
fi;

/* We had better now be in dyntick-idle mode. */
assert((dynticks & 1) == 0);
od;
}

/*
* Nested NMI runs atomically to emulate interrupting base_level().
*/
proctype nested_NMI()
{
do
:: /*
* Use an atomic section to model a nested NMI. This is
* guaranteed to interleave into base_NMI() between a pair
* of base_NMI() statements, just as a nested NMI would.
*/
atomic {
/* Verify that we only sometimes are in dynticks. */
if
:: (dynticks & 1) == 0 -> skip;
:: (dynticks & 1) == 1 -> skip;
fi;

/* Model RCU's NMI entry and exit actions. */
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();
}
od;
}

init {
run base_NMI();
run nested_NMI();
}

------------------------------------------------------------------------
rcu_nmi.sh
------------------------------------------------------------------------

if ! spin -a rcu_nmi.spin
then
echo Spin errors!!!
exit 1
fi
if ! cc -DSAFETY -o pan pan.c
then
echo Compilation errors!!!
exit 1
fi
./pan -m100000
# For errors: spin -p -l -g -t rcu_nmi.spin < rcu_nmi.spin.trail

2014-11-25 18:58:09

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Mon, Nov 24, 2014 at 03:52:10PM -0800, Andy Lutomirski wrote:
> Dunno. Tony and Borislav -- when do you want the IST stack switching
> stuff?

I'd leave that up to Tony and his testbench. I mean, we can hammer on
it as much as we can and it can pass all testing locally but the real
fun starts once it hits upstream and for that it doesn't matter which
release... IMHO.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-11-25 19:16:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Tue, Nov 25, 2014 at 07:58:03PM +0100, Borislav Petkov wrote:
> On Mon, Nov 24, 2014 at 03:52:10PM -0800, Andy Lutomirski wrote:
> > Dunno. Tony and Borislav -- when do you want the IST stack switching
> > stuff?
>
> I'd leave that up to Tony and his testbench. I mean, we can hammer on
> it as much as we can and it can pass all testing locally but the real
> fun starts once it hits upstream and for that it doesn't matter which
> release... IMHO.

Nothing quite like oddball hardware to make test planning a challenge,
no two ways about it! ;-)

Thanx, Paul

2014-11-27 06:59:45

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context


>
> Signed-off-by: Paul E. McKenney <[email protected]>
>



Reviewed-by: Lai Jiangshan <[email protected]>


> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8749f43f3f05..fc0236992655 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -759,39 +759,71 @@ void rcu_irq_enter(void)
> /**
> * rcu_nmi_enter - inform RCU of entry to NMI context
> *
> - * If the CPU was idle with dynamic ticks active, and there is no
> - * irq handler running, this updates rdtp->dynticks_nmi to let the
> - * RCU grace-period handling know that the CPU is active.
> + * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks and
> + * rdtp->dynticks_nmi_nesting to let the RCU grace-period handling know
> + * that the CPU is active. This implementation permits nested NMIs, as
> + * long as the nesting level does not overflow an int. (You will probably
> + * run out of stack space first.)
> */
> void rcu_nmi_enter(void)
> {
> struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> + int incby = 2;
>
> - if (rdtp->dynticks_nmi_nesting == 0 &&
> - (atomic_read(&rdtp->dynticks) & 0x1))
> - return;
> - rdtp->dynticks_nmi_nesting++;
> - smp_mb__before_atomic(); /* Force delay from prior write. */
> - atomic_inc(&rdtp->dynticks);
> - /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> - smp_mb__after_atomic(); /* See above. */
> - WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> + /* Complain about underflow. */
> + WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 0);
> +
> + /*
> + * If idle from RCU viewpoint, atomically increment ->dynticks
> + * to mark non-idle and increment ->dynticks_nmi_nesting by one.
> + * Otherwise, increment ->dynticks_nmi_nesting by two. This means
> + * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
> + * to be in the outermost NMI handler that interrupted an RCU-idle
> + * period (observation due to Andy Lutomirski).
> + */
> + if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> + smp_mb__before_atomic(); /* Force delay from prior write. */
> + atomic_inc(&rdtp->dynticks);
> + /* atomic_inc() before later RCU read-side crit sects */
> + smp_mb__after_atomic(); /* See above. */
> + WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> + incby = 1;
> + }
> + rdtp->dynticks_nmi_nesting += incby;

I prefer a "else" branch here.

if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
...
WARN_ON_ONCE(rdtp->dynticks_nmi_nesting); /* paired with "rdtp->dynticks_nmi_nesting = 0" */
rdtp->dynticks_nmi_nesting = 1; /* paired with "if (rdtp->dynticks_nmi_nesting != 1) {" */
} else {
rdtp->dynticks_nmi_nesting += 2;
}



> + barrier();
> }
>
> /**
> * rcu_nmi_exit - inform RCU of exit from NMI context
> *
> - * If the CPU was idle with dynamic ticks active, and there is no
> - * irq handler running, this updates rdtp->dynticks_nmi to let the
> - * RCU grace-period handling know that the CPU is no longer active.
> + * If we are returning from the outermost NMI handler that interrupted an
> + * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
> + * to let the RCU grace-period handling know that the CPU is back to
> + * being RCU-idle.
> */
> void rcu_nmi_exit(void)
> {
> struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>
> - if (rdtp->dynticks_nmi_nesting == 0 ||
> - --rdtp->dynticks_nmi_nesting != 0)
> + /*
> + * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
> + * (We are exiting an NMI handler, so RCU better be paying attention
> + * to us!)
> + */
> + WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0);
> + WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> +
> + /*
> + * If the nesting level is not 1, the CPU wasn't RCU-idle, so
> + * leave it in non-RCU-idle state.
> + */
> + if (rdtp->dynticks_nmi_nesting != 1) {
> + rdtp->dynticks_nmi_nesting -= 2;
> return;
> + }
> +
> + /* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
> + rdtp->dynticks_nmi_nesting = 0;
> /* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
> smp_mb__before_atomic(); /* See above. */
> atomic_inc(&rdtp->dynticks);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> .
>

2014-12-01 16:46:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Thu, Nov 27, 2014 at 03:03:22PM +0800, Lai Jiangshan wrote:
>
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
>
> Reviewed-by: Lai Jiangshan <[email protected]>

Thank you, recorded!

> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 8749f43f3f05..fc0236992655 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -759,39 +759,71 @@ void rcu_irq_enter(void)
> > /**
> > * rcu_nmi_enter - inform RCU of entry to NMI context
> > *
> > - * If the CPU was idle with dynamic ticks active, and there is no
> > - * irq handler running, this updates rdtp->dynticks_nmi to let the
> > - * RCU grace-period handling know that the CPU is active.
> > + * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks and
> > + * rdtp->dynticks_nmi_nesting to let the RCU grace-period handling know
> > + * that the CPU is active. This implementation permits nested NMIs, as
> > + * long as the nesting level does not overflow an int. (You will probably
> > + * run out of stack space first.)
> > */
> > void rcu_nmi_enter(void)
> > {
> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > + int incby = 2;
> >
> > - if (rdtp->dynticks_nmi_nesting == 0 &&
> > - (atomic_read(&rdtp->dynticks) & 0x1))
> > - return;
> > - rdtp->dynticks_nmi_nesting++;
> > - smp_mb__before_atomic(); /* Force delay from prior write. */
> > - atomic_inc(&rdtp->dynticks);
> > - /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> > - smp_mb__after_atomic(); /* See above. */
> > - WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > + /* Complain about underflow. */
> > + WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 0);
> > +
> > + /*
> > + * If idle from RCU viewpoint, atomically increment ->dynticks
> > + * to mark non-idle and increment ->dynticks_nmi_nesting by one.
> > + * Otherwise, increment ->dynticks_nmi_nesting by two. This means
> > + * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
> > + * to be in the outermost NMI handler that interrupted an RCU-idle
> > + * period (observation due to Andy Lutomirski).
> > + */
> > + if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> > + smp_mb__before_atomic(); /* Force delay from prior write. */
> > + atomic_inc(&rdtp->dynticks);
> > + /* atomic_inc() before later RCU read-side crit sects */
> > + smp_mb__after_atomic(); /* See above. */
> > + WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > + incby = 1;
> > + }
> > + rdtp->dynticks_nmi_nesting += incby;
>
> I prefer a "else" branch here.
>
> if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> ...
> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting); /* paired with "rdtp->dynticks_nmi_nesting = 0" */
> rdtp->dynticks_nmi_nesting = 1; /* paired with "if (rdtp->dynticks_nmi_nesting != 1) {" */
> } else {
> rdtp->dynticks_nmi_nesting += 2;
> }

I avoided this approach due to the extra line.

Thanx, Paul

> > + barrier();
> > }
> >
> > /**
> > * rcu_nmi_exit - inform RCU of exit from NMI context
> > *
> > - * If the CPU was idle with dynamic ticks active, and there is no
> > - * irq handler running, this updates rdtp->dynticks_nmi to let the
> > - * RCU grace-period handling know that the CPU is no longer active.
> > + * If we are returning from the outermost NMI handler that interrupted an
> > + * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
> > + * to let the RCU grace-period handling know that the CPU is back to
> > + * being RCU-idle.
> > */
> > void rcu_nmi_exit(void)
> > {
> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >
> > - if (rdtp->dynticks_nmi_nesting == 0 ||
> > - --rdtp->dynticks_nmi_nesting != 0)
> > + /*
> > + * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
> > + * (We are exiting an NMI handler, so RCU better be paying attention
> > + * to us!)
> > + */
> > + WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0);
> > + WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > +
> > + /*
> > + * If the nesting level is not 1, the CPU wasn't RCU-idle, so
> > + * leave it in non-RCU-idle state.
> > + */
> > + if (rdtp->dynticks_nmi_nesting != 1) {
> > + rdtp->dynticks_nmi_nesting -= 2;
> > return;
> > + }
> > +
> > + /* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
> > + rdtp->dynticks_nmi_nesting = 0;
> > /* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
> > smp_mb__before_atomic(); /* See above. */
> > atomic_inc(&rdtp->dynticks);
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> > .
> >
>

2014-12-11 00:22:18

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Tue, Nov 25, 2014 at 10:58 AM, Borislav Petkov <[email protected]> wrote:
> On Mon, Nov 24, 2014 at 03:52:10PM -0800, Andy Lutomirski wrote:
>> Dunno. Tony and Borislav -- when do you want the IST stack switching
>> stuff?
>
> I'd leave that up to Tony and his testbench. I mean, we can hammer on
> it as much as we can and it can pass all testing locally but the real
> fun starts once it hits upstream and for that it doesn't matter which
> release... IMHO.

So what was the net result of all the mode/RCU discussions?

Do I need some extra magic incantations in the final version
of do_machine_check() beyond what was in this patch:

https://lkml.kernel.org/r/[email protected]

to make everything happy?

-Tony

2014-12-11 00:25:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context

On Wed, Dec 10, 2014 at 4:22 PM, Tony Luck <[email protected]> wrote:
> On Tue, Nov 25, 2014 at 10:58 AM, Borislav Petkov <[email protected]> wrote:
>> On Mon, Nov 24, 2014 at 03:52:10PM -0800, Andy Lutomirski wrote:
>>> Dunno. Tony and Borislav -- when do you want the IST stack switching
>>> stuff?
>>
>> I'd leave that up to Tony and his testbench. I mean, we can hammer on
>> it as much as we can and it can pass all testing locally but the real
>> fun starts once it hits upstream and for that it doesn't matter which
>> release... IMHO.
>
> So what was the net result of all the mode/RCU discussions?
>
> Do I need some extra magic incantations in the final version
> of do_machine_check() beyond what was in this patch:
>
> https://lkml.kernel.org/r/[email protected]
>
> to make everything happy?
>

I think you need ist_begin_non_atomic() before local_irq_enable() and
ist_end_non_atomic() after local_irq_disable(). Otherwise it should
be good.

--Andy

> -Tony



--
Andy Lutomirski
AMA Capital Management, LLC