2014-12-10 23:34:59

by Luis Chamberlain

[permalink] [raw]
Subject: [PATCH v2 0/2] x86: add xen hypercall preemption

From: "Luis R. Rodriguez" <[email protected]>

This is my second series which addresses hypercall preemption
on Xen. On the first iteration of this series [0] I tried as
much as possible to avoid cond_resched() type of behaviour
but after good feedback I've determined using something like
cond_resched() but on IRQ context is required for preempting
Xen hypercalls. This introduces and uses the new cond_resched_irq().

[0] https://lkml.org/lkml/2014/11/26/630

Luis R. Rodriguez (2):
sched: add cond_resched_irq()
x86/xen: allow privcmd hypercalls to be preempted

arch/x86/kernel/entry_32.S | 21 +++++++++++++++++++++
arch/x86/kernel/entry_64.S | 17 +++++++++++++++++
drivers/xen/Makefile | 2 +-
drivers/xen/preempt.c | 17 +++++++++++++++++
drivers/xen/privcmd.c | 2 ++
include/linux/sched.h | 7 +++++++
include/xen/xen-ops.h | 26 ++++++++++++++++++++++++++
kernel/sched/core.c | 10 ++++++++++
8 files changed, 101 insertions(+), 1 deletion(-)
create mode 100644 drivers/xen/preempt.c

--
2.1.1


2014-12-10 23:35:09

by Luis Chamberlain

[permalink] [raw]
Subject: [PATCH v2 1/2] sched: add cond_resched_irq()

From: "Luis R. Rodriguez" <[email protected]>

Under special circumstances we may want to force
voluntary preemption even for CONFIG_PREEMPT=n
with interrupts disabled. This adds helpers to
let us do that.

Cc: Borislav Petkov <[email protected]>
Cc: David Vrabel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jan Beulich <[email protected]>
Cc: [email protected]
Signed-off-by: Luis R. Rodriguez <[email protected]>
---
include/linux/sched.h | 7 +++++++
kernel/sched/core.c | 10 ++++++++++
2 files changed, 17 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..92da927 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2759,6 +2759,13 @@ static inline int signal_pending_state(long state, struct task_struct *p)
*/
extern int _cond_resched(void);

+/*
+ * Voluntarily preempting the kernel even for CONFIG_PREEMPT=n kernels
+ * on very special circumstances. This is to be used with interrupts
+ * disabled.
+ */
+extern int cond_resched_irq(void);
+
#define cond_resched() ({ \
__might_sleep(__FILE__, __LINE__, 0); \
_cond_resched(); \
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 89e7283..573edb1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4239,6 +4239,16 @@ int __sched _cond_resched(void)
}
EXPORT_SYMBOL(_cond_resched);

+int __sched cond_resched_irq(void)
+{
+ if (should_resched()) {
+ preempt_schedule_irq();
+ return 1;
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(cond_resched_irq);
+
/*
* __cond_resched_lock() - if a reschedule is pending, drop the given lock,
* call schedule, and on return reacquire the lock.
--
2.1.1

2014-12-10 23:35:35

by Luis Chamberlain

[permalink] [raw]
Subject: [PATCH v2 2/2] x86/xen: allow privcmd hypercalls to be preempted

From: "Luis R. Rodriguez" <[email protected]>

Xen has support for splitting heavy work work into a series
of hypercalls, called multicalls, and preempting them through
what Xen calls continuation [0]. Despite this though without
CONFIG_PREEMPT preemption won't happen and while enabling
CONFIG_RT_GROUP_SCHED can at times help its not enough to
make a system usable. Such is the case for example when
creating a > 50 GiB HVM guest, we can get softlockups [1] with:.

kernel: [ 802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351]

The softlock up triggers on the TASK_UNINTERRUPTIBLE hanger check
(default 120 seconds), on the Xen side in this particular case
this happens when the following Xen hypervisor code is used:

xc_domain_set_pod_target() -->
do_memory_op() -->
arch_memory_op() -->
p2m_pod_set_mem_target()
-- long delay (real or emulated) --

This happens on arch_memory_op() on the XENMEM_set_pod_target memory
op even though arch_memory_op() can handle continuation via
hypercall_create_continuation() for example.

Machines over 50 GiB of memory are on high demand and hard to come
by so to help replicate this sort of issue long delays on select
hypercalls have been emulated in order to be able to test this on
smaller machines [2].

On one hand this issue can be considered as expected given that
CONFIG_PREEMPT=n is used however we have forced voluntary preemption
precedent practices in the kernel even for CONFIG_PREEMPT=n through
the usage of cond_resched() sprinkled in many places. To address
this issue with Xen hypercalls though we need to find a way to aid
to the schedular in the middle of hypercalls. We are motivated to
address this issue on CONFIG_PREEMPT=n as otherwise the system becomes
rather unresponsive for long periods of time; in the worst case, at least
only currently by emulating long delays on select io disk bound
hypercalls, this can lead to filesystem corruption if the delay happens
for example on SCHEDOP_remote_shutdown (when we call 'xl <domain> shutdown').

We can address this problem by trying to check if we should schedule
on the xen timer in the middle of a hypercall on the return from the
timer interrupt. We want to be careful to not always force voluntary
preemption though so to do this we only selectively enable preemption
on very specific xen hypercalls.

This enables hypercall preemption by selectively forcing checks for
voluntary preempting only on ioctl initiated private hypercalls
where we know some folks have run into reported issues [1].

[0] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9
[1] https://bugzilla.novell.com/show_bug.cgi?id=861093
[2] http://ftp.suse.com/pub/people/mcgrof/xen/emulate-long-xen-hypercalls.patch

Based on original work by: David Vrabel <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: David Vrabel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jan Beulich <[email protected]>
Cc: [email protected]
Signed-off-by: Luis R. Rodriguez <[email protected]>
---
arch/x86/kernel/entry_32.S | 21 +++++++++++++++++++++
arch/x86/kernel/entry_64.S | 17 +++++++++++++++++
drivers/xen/Makefile | 2 +-
drivers/xen/preempt.c | 17 +++++++++++++++++
drivers/xen/privcmd.c | 2 ++
include/xen/xen-ops.h | 26 ++++++++++++++++++++++++++
6 files changed, 84 insertions(+), 1 deletion(-)
create mode 100644 drivers/xen/preempt.c

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 344b63f..40b5c0c 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -982,7 +982,28 @@ ENTRY(xen_hypervisor_callback)
ENTRY(xen_do_upcall)
1: mov %esp, %eax
call xen_evtchn_do_upcall
+#ifdef CONFIG_PREEMPT
jmp ret_from_intr
+#else
+ GET_THREAD_INFO(%ebp)
+#ifdef CONFIG_VM86
+ movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS
+ movb PT_CS(%esp), %al
+ andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
+#else
+ movl PT_CS(%esp), %eax
+ andl $SEGMENT_RPL_MASK, %eax
+#endif
+ cmpl $USER_RPL, %eax
+ jae resume_userspace # returning to v8086 or userspace
+ DISABLE_INTERRUPTS(CLBR_ANY)
+ cmpb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
+ jz resume_kernel
+ movb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
+ call cond_resched_irq
+ movb $1,PER_CPU_VAR(xen_in_preemptible_hcall)
+ jmp resume_kernel
+#endif /* CONFIG_PREEMPT */
CFI_ENDPROC
ENDPROC(xen_hypervisor_callback)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index c0226ab..0ccdd06 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1170,7 +1170,23 @@ ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs)
popq %rsp
CFI_DEF_CFA_REGISTER rsp
decl PER_CPU_VAR(irq_count)
+#ifdef CONFIG_PREEMPT
jmp error_exit
+#else
+ movl %ebx, %eax
+ RESTORE_REST
+ DISABLE_INTERRUPTS(CLBR_NONE)
+ TRACE_IRQS_OFF
+ GET_THREAD_INFO(%rcx)
+ testl %eax, %eax
+ je error_exit_user
+ cmpb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
+ jz retint_kernel
+ movb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
+ call cond_resched_irq
+ movb $1,PER_CPU_VAR(xen_in_preemptible_hcall)
+ jmp retint_kernel
+#endif /* CONFIG_PREEMPT */
CFI_ENDPROC
END(xen_do_hypervisor_callback)

@@ -1398,6 +1414,7 @@ ENTRY(error_exit)
GET_THREAD_INFO(%rcx)
testl %eax,%eax
jne retint_kernel
+error_exit_user:
LOCKDEP_SYS_EXIT_IRQ
movl TI_flags(%rcx),%edx
movl $_TIF_WORK_MASK,%edi
diff --git a/drivers/xen/Makefile b/drivers/xen/Makefile
index 2140398..2ccd359 100644
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -2,7 +2,7 @@ ifeq ($(filter y, $(CONFIG_ARM) $(CONFIG_ARM64)),)
obj-$(CONFIG_HOTPLUG_CPU) += cpu_hotplug.o
endif
obj-$(CONFIG_X86) += fallback.o
-obj-y += grant-table.o features.o balloon.o manage.o
+obj-y += grant-table.o features.o balloon.o manage.o preempt.o
obj-y += events/
obj-y += xenbus/

diff --git a/drivers/xen/preempt.c b/drivers/xen/preempt.c
new file mode 100644
index 0000000..b5a3e98
--- /dev/null
+++ b/drivers/xen/preempt.c
@@ -0,0 +1,17 @@
+/*
+ * Preemptible hypercalls
+ *
+ * Copyright (C) 2014 Citrix Systems R&D ltd.
+ *
+ * This source code is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ */
+
+#include <xen/xen-ops.h>
+
+#ifndef CONFIG_PREEMPT
+DEFINE_PER_CPU(bool, xen_in_preemptible_hcall);
+EXPORT_SYMBOL_GPL(xen_in_preemptible_hcall);
+#endif
diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 569a13b..59ac71c 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -56,10 +56,12 @@ static long privcmd_ioctl_hypercall(void __user *udata)
if (copy_from_user(&hypercall, udata, sizeof(hypercall)))
return -EFAULT;

+ xen_preemptible_hcall_begin();
ret = privcmd_call(hypercall.op,
hypercall.arg[0], hypercall.arg[1],
hypercall.arg[2], hypercall.arg[3],
hypercall.arg[4]);
+ xen_preemptible_hcall_end();

return ret;
}
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 7491ee5..8333821 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -46,4 +46,30 @@ static inline efi_system_table_t __init *xen_efi_probe(void)
}
#endif

+#ifdef CONFIG_PREEMPT
+
+static inline void xen_preemptible_hcall_begin(void)
+{
+}
+
+static inline void xen_preemptible_hcall_end(void)
+{
+}
+
+#else
+
+DECLARE_PER_CPU(bool, xen_in_preemptible_hcall);
+
+static inline void xen_preemptible_hcall_begin(void)
+{
+ __this_cpu_write(xen_in_preemptible_hcall, true);
+}
+
+static inline void xen_preemptible_hcall_end(void)
+{
+ __this_cpu_write(xen_in_preemptible_hcall, false);
+}
+
+#endif /* CONFIG_PREEMPT */
+
#endif /* INCLUDE_XEN_OPS_H */
--
2.1.1

2014-12-10 23:52:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/xen: allow privcmd hypercalls to be preempted

On Wed, Dec 10, 2014 at 3:34 PM, Luis R. Rodriguez
<[email protected]> wrote:
> From: "Luis R. Rodriguez" <[email protected]>
>
> Xen has support for splitting heavy work work into a series
> of hypercalls, called multicalls, and preempting them through
> what Xen calls continuation [0]. Despite this though without
> CONFIG_PREEMPT preemption won't happen and while enabling
> CONFIG_RT_GROUP_SCHED can at times help its not enough to
> make a system usable. Such is the case for example when
> creating a > 50 GiB HVM guest, we can get softlockups [1] with:.
>
> kernel: [ 802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351]
>
> The softlock up triggers on the TASK_UNINTERRUPTIBLE hanger check
> (default 120 seconds), on the Xen side in this particular case
> this happens when the following Xen hypervisor code is used:
>
> xc_domain_set_pod_target() -->
> do_memory_op() -->
> arch_memory_op() -->
> p2m_pod_set_mem_target()
> -- long delay (real or emulated) --
>
> This happens on arch_memory_op() on the XENMEM_set_pod_target memory
> op even though arch_memory_op() can handle continuation via
> hypercall_create_continuation() for example.
>
> Machines over 50 GiB of memory are on high demand and hard to come
> by so to help replicate this sort of issue long delays on select
> hypercalls have been emulated in order to be able to test this on
> smaller machines [2].
>
> On one hand this issue can be considered as expected given that
> CONFIG_PREEMPT=n is used however we have forced voluntary preemption
> precedent practices in the kernel even for CONFIG_PREEMPT=n through
> the usage of cond_resched() sprinkled in many places. To address
> this issue with Xen hypercalls though we need to find a way to aid
> to the schedular in the middle of hypercalls. We are motivated to
> address this issue on CONFIG_PREEMPT=n as otherwise the system becomes
> rather unresponsive for long periods of time; in the worst case, at least
> only currently by emulating long delays on select io disk bound
> hypercalls, this can lead to filesystem corruption if the delay happens
> for example on SCHEDOP_remote_shutdown (when we call 'xl <domain> shutdown').
>
> We can address this problem by trying to check if we should schedule
> on the xen timer in the middle of a hypercall on the return from the
> timer interrupt. We want to be careful to not always force voluntary
> preemption though so to do this we only selectively enable preemption
> on very specific xen hypercalls.
>
> This enables hypercall preemption by selectively forcing checks for
> voluntary preempting only on ioctl initiated private hypercalls
> where we know some folks have run into reported issues [1].
>
> [0] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9
> [1] https://bugzilla.novell.com/show_bug.cgi?id=861093
> [2] http://ftp.suse.com/pub/people/mcgrof/xen/emulate-long-xen-hypercalls.patch
>
> Based on original work by: David Vrabel <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: David Vrabel <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: [email protected]
> Cc: Andy Lutomirski <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Jan Beulich <[email protected]>
> Cc: [email protected]
> Signed-off-by: Luis R. Rodriguez <[email protected]>
> ---
> arch/x86/kernel/entry_32.S | 21 +++++++++++++++++++++
> arch/x86/kernel/entry_64.S | 17 +++++++++++++++++
> drivers/xen/Makefile | 2 +-
> drivers/xen/preempt.c | 17 +++++++++++++++++
> drivers/xen/privcmd.c | 2 ++
> include/xen/xen-ops.h | 26 ++++++++++++++++++++++++++
> 6 files changed, 84 insertions(+), 1 deletion(-)
> create mode 100644 drivers/xen/preempt.c
>
> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> index 344b63f..40b5c0c 100644
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -982,7 +982,28 @@ ENTRY(xen_hypervisor_callback)
> ENTRY(xen_do_upcall)
> 1: mov %esp, %eax
> call xen_evtchn_do_upcall
> +#ifdef CONFIG_PREEMPT
> jmp ret_from_intr
> +#else
> + GET_THREAD_INFO(%ebp)
> +#ifdef CONFIG_VM86
> + movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS
> + movb PT_CS(%esp), %al
> + andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
> +#else
> + movl PT_CS(%esp), %eax
> + andl $SEGMENT_RPL_MASK, %eax
> +#endif
> + cmpl $USER_RPL, %eax
> + jae resume_userspace # returning to v8086 or userspace
> + DISABLE_INTERRUPTS(CLBR_ANY)
> + cmpb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> + jz resume_kernel
> + movb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> + call cond_resched_irq
> + movb $1,PER_CPU_VAR(xen_in_preemptible_hcall)
> + jmp resume_kernel
> +#endif /* CONFIG_PREEMPT */
> CFI_ENDPROC
> ENDPROC(xen_hypervisor_callback)
>
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index c0226ab..0ccdd06 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1170,7 +1170,23 @@ ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs)
> popq %rsp
> CFI_DEF_CFA_REGISTER rsp
> decl PER_CPU_VAR(irq_count)
> +#ifdef CONFIG_PREEMPT
> jmp error_exit
> +#else
> + movl %ebx, %eax
> + RESTORE_REST
> + DISABLE_INTERRUPTS(CLBR_NONE)
> + TRACE_IRQS_OFF
> + GET_THREAD_INFO(%rcx)
> + testl %eax, %eax
> + je error_exit_user
> + cmpb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> + jz retint_kernel

I think I understand this part.

> + movb $0,PER_CPU_VAR(xen_in_preemptible_hcall)

Why? Is the issue that, if preemptible hypercalls nest, you don't
want to preempt again?

> + call cond_resched_irq

On !CONFIG_PREEMPT, there's no preempt_disable, right? So how do you
guarantee that you don't preempt something you shouldn't? Is the idea
that these events will only fire nested *directly* inside a
preemptible hypercall? Also, should you check that IRQs were on when
the event fired? (Are they on in pt_regs?)

> + movb $1,PER_CPU_VAR(xen_in_preemptible_hcall)
> + jmp retint_kernel
> +#endif /* CONFIG_PREEMPT */
> CFI_ENDPROC

All that being said, this is IMO a bit gross. You've added a bunch of
asm that's kind of like a parallel error_exit, and the error entry and
exit code is hairy enough that this scares me. Can you do this mostly
in C instead? This would look a nicer if it could be:

call xen_evtchn_do_upcall
popq %rsp
CFI_DEF_CFA_REGISTER rsp
decl PER_CPU_VAR(irq_count)
+ call xen_end_upcall
jmp error_exit

Where xen_end_upcall would be witten in C, nokprobes and notrace (if
needed) and would check pt_regs and whatever else and just call
schedule if needed?

--Andy

2014-12-11 00:30:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/xen: allow privcmd hypercalls to be preempted

On 12/10/2014 03:34 PM, Luis R. Rodriguez wrote:
> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> index 344b63f..40b5c0c 100644
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -982,7 +982,28 @@ ENTRY(xen_hypervisor_callback)
> ENTRY(xen_do_upcall)
> 1: mov %esp, %eax
> call xen_evtchn_do_upcall
> +#ifdef CONFIG_PREEMPT
> jmp ret_from_intr
> +#else
> + GET_THREAD_INFO(%ebp)
> +#ifdef CONFIG_VM86
> + movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS
> + movb PT_CS(%esp), %al
> + andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
> +#else
> + movl PT_CS(%esp), %eax
> + andl $SEGMENT_RPL_MASK, %eax
> +#endif
> + cmpl $USER_RPL, %eax
> + jae resume_userspace # returning to v8086 or userspace
> + DISABLE_INTERRUPTS(CLBR_ANY)
> + cmpb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> + jz resume_kernel
> + movb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> + call cond_resched_irq
> + movb $1,PER_CPU_VAR(xen_in_preemptible_hcall)
> + jmp resume_kernel
> +#endif /* CONFIG_PREEMPT */
> CFI_ENDPROC
> ENDPROC(xen_hypervisor_callback)
>
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index c0226ab..0ccdd06 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1170,7 +1170,23 @@ ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs)
> popq %rsp
> CFI_DEF_CFA_REGISTER rsp
> decl PER_CPU_VAR(irq_count)
> +#ifdef CONFIG_PREEMPT
> jmp error_exit
> +#else
> + movl %ebx, %eax
> + RESTORE_REST
> + DISABLE_INTERRUPTS(CLBR_NONE)
> + TRACE_IRQS_OFF
> + GET_THREAD_INFO(%rcx)
> + testl %eax, %eax
> + je error_exit_user
> + cmpb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> + jz retint_kernel
> + movb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> + call cond_resched_irq
> + movb $1,PER_CPU_VAR(xen_in_preemptible_hcall)
> + jmp retint_kernel
> +#endif /* CONFIG_PREEMPT */
> CFI_ENDPROC
> END(xen_do_hypervisor_callback)
>
> @@ -1398,6 +1414,7 @@ ENTRY(error_exit)
> GET_THREAD_INFO(%rcx)
> testl %eax,%eax
> jne retint_kernel
> +error_exit_user:
> LOCKDEP_SYS_EXIT_IRQ
> movl TI_flags(%rcx),%edx
> movl $_TIF_WORK_MASK,%edi

You're adding a bunch of code for the *non*-preemptive case here... why?

-hpa

2014-12-11 00:55:15

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/xen: allow privcmd hypercalls to be preempted

On Wed, Dec 10, 2014 at 03:51:48PM -0800, Andy Lutomirski wrote:
> On Wed, Dec 10, 2014 at 3:34 PM, Luis R. Rodriguez
> <[email protected]> wrote:
> > From: "Luis R. Rodriguez" <[email protected]>
> >
> > Xen has support for splitting heavy work work into a series
> > of hypercalls, called multicalls, and preempting them through
> > what Xen calls continuation [0]. Despite this though without
> > CONFIG_PREEMPT preemption won't happen and while enabling
> > CONFIG_RT_GROUP_SCHED can at times help its not enough to
> > make a system usable. Such is the case for example when
> > creating a > 50 GiB HVM guest, we can get softlockups [1] with:.
> >
> > kernel: [ 802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351]
> >
> > The softlock up triggers on the TASK_UNINTERRUPTIBLE hanger check
> > (default 120 seconds), on the Xen side in this particular case
> > this happens when the following Xen hypervisor code is used:
> >
> > xc_domain_set_pod_target() -->
> > do_memory_op() -->
> > arch_memory_op() -->
> > p2m_pod_set_mem_target()
> > -- long delay (real or emulated) --
> >
> > This happens on arch_memory_op() on the XENMEM_set_pod_target memory
> > op even though arch_memory_op() can handle continuation via
> > hypercall_create_continuation() for example.
> >
> > Machines over 50 GiB of memory are on high demand and hard to come
> > by so to help replicate this sort of issue long delays on select
> > hypercalls have been emulated in order to be able to test this on
> > smaller machines [2].
> >
> > On one hand this issue can be considered as expected given that
> > CONFIG_PREEMPT=n is used however we have forced voluntary preemption
> > precedent practices in the kernel even for CONFIG_PREEMPT=n through
> > the usage of cond_resched() sprinkled in many places. To address
> > this issue with Xen hypercalls though we need to find a way to aid
> > to the schedular in the middle of hypercalls. We are motivated to
> > address this issue on CONFIG_PREEMPT=n as otherwise the system becomes
> > rather unresponsive for long periods of time; in the worst case, at least
> > only currently by emulating long delays on select io disk bound
> > hypercalls, this can lead to filesystem corruption if the delay happens
> > for example on SCHEDOP_remote_shutdown (when we call 'xl <domain> shutdown').
> >
> > We can address this problem by trying to check if we should schedule
> > on the xen timer in the middle of a hypercall on the return from the
> > timer interrupt. We want to be careful to not always force voluntary
> > preemption though so to do this we only selectively enable preemption
> > on very specific xen hypercalls.
> >
> > This enables hypercall preemption by selectively forcing checks for
> > voluntary preempting only on ioctl initiated private hypercalls
> > where we know some folks have run into reported issues [1].
> >
> > [0] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9
> > [1] https://bugzilla.novell.com/show_bug.cgi?id=861093
> > [2] http://ftp.suse.com/pub/people/mcgrof/xen/emulate-long-xen-hypercalls.patch
> >
> > Based on original work by: David Vrabel <[email protected]>
> > Cc: Borislav Petkov <[email protected]>
> > Cc: David Vrabel <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: "H. Peter Anvin" <[email protected]>
> > Cc: [email protected]
> > Cc: Andy Lutomirski <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Masami Hiramatsu <[email protected]>
> > Cc: Jan Beulich <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Luis R. Rodriguez <[email protected]>
> > ---
> > arch/x86/kernel/entry_32.S | 21 +++++++++++++++++++++
> > arch/x86/kernel/entry_64.S | 17 +++++++++++++++++
> > drivers/xen/Makefile | 2 +-
> > drivers/xen/preempt.c | 17 +++++++++++++++++
> > drivers/xen/privcmd.c | 2 ++
> > include/xen/xen-ops.h | 26 ++++++++++++++++++++++++++
> > 6 files changed, 84 insertions(+), 1 deletion(-)
> > create mode 100644 drivers/xen/preempt.c
> >
> > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> > index 344b63f..40b5c0c 100644
> > --- a/arch/x86/kernel/entry_32.S
> > +++ b/arch/x86/kernel/entry_32.S
> > @@ -982,7 +982,28 @@ ENTRY(xen_hypervisor_callback)
> > ENTRY(xen_do_upcall)
> > 1: mov %esp, %eax
> > call xen_evtchn_do_upcall
> > +#ifdef CONFIG_PREEMPT
> > jmp ret_from_intr
> > +#else
> > + GET_THREAD_INFO(%ebp)
> > +#ifdef CONFIG_VM86
> > + movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS
> > + movb PT_CS(%esp), %al
> > + andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
> > +#else
> > + movl PT_CS(%esp), %eax
> > + andl $SEGMENT_RPL_MASK, %eax
> > +#endif
> > + cmpl $USER_RPL, %eax
> > + jae resume_userspace # returning to v8086 or userspace
> > + DISABLE_INTERRUPTS(CLBR_ANY)
> > + cmpb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> > + jz resume_kernel
> > + movb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> > + call cond_resched_irq
> > + movb $1,PER_CPU_VAR(xen_in_preemptible_hcall)
> > + jmp resume_kernel
> > +#endif /* CONFIG_PREEMPT */
> > CFI_ENDPROC
> > ENDPROC(xen_hypervisor_callback)
> >
> > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> > index c0226ab..0ccdd06 100644
> > --- a/arch/x86/kernel/entry_64.S
> > +++ b/arch/x86/kernel/entry_64.S
> > @@ -1170,7 +1170,23 @@ ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs)
> > popq %rsp
> > CFI_DEF_CFA_REGISTER rsp
> > decl PER_CPU_VAR(irq_count)
> > +#ifdef CONFIG_PREEMPT
> > jmp error_exit
> > +#else
> > + movl %ebx, %eax
> > + RESTORE_REST
> > + DISABLE_INTERRUPTS(CLBR_NONE)
> > + TRACE_IRQS_OFF
> > + GET_THREAD_INFO(%rcx)
> > + testl %eax, %eax
> > + je error_exit_user
> > + cmpb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> > + jz retint_kernel
>
> I think I understand this part.
>
> > + movb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
>
> Why? Is the issue that, if preemptible hypercalls nest, you don't
> want to preempt again?


So this callback on the xen timer, without the CPU variable
we would not be able to selectively preempt specific hypercalls
and we'd a no-preempt kernel fully preemptive.

I asked the same question, see:

https://lkml.org/lkml/2014/12/3/756

>
> > + call cond_resched_irq
>
> On !CONFIG_PREEMPT, there's no preempt_disable, right? So how do you
> guarantee that you don't preempt something you shouldn't?

Not sure I follow, but in essence this is no different then the use
of cond_resched() on !CONFIG_PREEMPT, the only thing here is we are
in interrupt context. If this is about abuse of making !CONFIG_PREEMPT
voluntarily preemptive at select points then I had similar concerns
and David pointed out to me the wide use of cond_resched() on the
kernel.

> Is the idea
> that these events will only fire nested *directly* inside a
> preemptible hypercall?

Yeah its the timer interrupt that would trigger the above.

> Also, should you check that IRQs were on when
> the event fired? (Are they on in pt_regs?)

Right before this xen_evtchn_do_upcall() is issued, which
saves pt_regs and then restores them.

> > + movb $1,PER_CPU_VAR(xen_in_preemptible_hcall)
> > + jmp retint_kernel
> > +#endif /* CONFIG_PREEMPT */
> > CFI_ENDPROC
>
> All that being said, this is IMO a bit gross.

That was my first reaction hence my original attempt to try to get away from
this. I've learned David also tried to not go down this route too before,
more on this below.

> You've added a bunch of
> asm that's kind of like a parallel error_exit,

yeah.. I first tried to macro'tize this but it looked hairy, if we
wanted to do that... (although the name probably ain't best)

32-bit:
.macro test_from_kernel kernel_ret:req
GET_THREAD_INFO(%ebp)
#ifdef CONFIG_VM86
movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS
movb PT_CS(%esp), %al
andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
#else
/*
* We can be coming here from child spawned by kernel_thread().
*/
movl PT_CS(%esp), %eax
andl $SEGMENT_RPL_MASK, %eax
#endif
cmpl $USER_RPL, %eax
jb \kernel_ret
.endm

64-bit:
.macro test_from_kernel kernel_ret:req
movl %ebx,%eax
RESTORE_REST
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
GET_THREAD_INFO(%rcx)
testl %eax,%eax
jne \kernel_ret
.endm

> and the error entry and
> exit code is hairy enough that this scares me.

yeah...

> Can you do this mostly in C instead? This would look a nicer if it could be:
>
> call xen_evtchn_do_upcall
> popq %rsp
> CFI_DEF_CFA_REGISTER rsp
> decl PER_CPU_VAR(irq_count)
> + call xen_end_upcall
> jmp error_exit

It seems David tried this originally eons ago:

http://lists.xen.org/archives/html/xen-devel/2014-02/msg01101.html

and the strategy was shifted based on Jan's feedback that we could
not sched as we're on the IRQ stack. David evolved the strategy
to asm and to use preempt_schedule_irq(), this new iteratin just
revisits the same old but tries to generalize scheduling on IRQ
context on very special circumstances.

> Where xen_end_upcall would be witten in C, nokprobes and notrace (if
> needed) and would check pt_regs and whatever else and just call
> schedule if needed?

If there's a way to do it it'd be great. I am not sure if we can though.

Luis

2014-12-11 01:03:48

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/xen: allow privcmd hypercalls to be preempted

On Wed, Dec 10, 2014 at 04:29:06PM -0800, H. Peter Anvin wrote:
> On 12/10/2014 03:34 PM, Luis R. Rodriguez wrote:
> > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> > index 344b63f..40b5c0c 100644
> > --- a/arch/x86/kernel/entry_32.S
> > +++ b/arch/x86/kernel/entry_32.S
> > @@ -982,7 +982,28 @@ ENTRY(xen_hypervisor_callback)
> > ENTRY(xen_do_upcall)
> > 1: mov %esp, %eax
> > call xen_evtchn_do_upcall
> > +#ifdef CONFIG_PREEMPT
> > jmp ret_from_intr
> > +#else
> > + GET_THREAD_INFO(%ebp)
> > +#ifdef CONFIG_VM86
> > + movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS
> > + movb PT_CS(%esp), %al
> > + andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
> > +#else
> > + movl PT_CS(%esp), %eax
> > + andl $SEGMENT_RPL_MASK, %eax
> > +#endif
> > + cmpl $USER_RPL, %eax
> > + jae resume_userspace # returning to v8086 or userspace
> > + DISABLE_INTERRUPTS(CLBR_ANY)
> > + cmpb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> > + jz resume_kernel
> > + movb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> > + call cond_resched_irq
> > + movb $1,PER_CPU_VAR(xen_in_preemptible_hcall)
> > + jmp resume_kernel
> > +#endif /* CONFIG_PREEMPT */
> > CFI_ENDPROC
> > ENDPROC(xen_hypervisor_callback)
> >
> > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> > index c0226ab..0ccdd06 100644
> > --- a/arch/x86/kernel/entry_64.S
> > +++ b/arch/x86/kernel/entry_64.S
> > @@ -1170,7 +1170,23 @@ ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs)
> > popq %rsp
> > CFI_DEF_CFA_REGISTER rsp
> > decl PER_CPU_VAR(irq_count)
> > +#ifdef CONFIG_PREEMPT
> > jmp error_exit
> > +#else
> > + movl %ebx, %eax
> > + RESTORE_REST
> > + DISABLE_INTERRUPTS(CLBR_NONE)
> > + TRACE_IRQS_OFF
> > + GET_THREAD_INFO(%rcx)
> > + testl %eax, %eax
> > + je error_exit_user
> > + cmpb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> > + jz retint_kernel
> > + movb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
> > + call cond_resched_irq
> > + movb $1,PER_CPU_VAR(xen_in_preemptible_hcall)
> > + jmp retint_kernel
> > +#endif /* CONFIG_PREEMPT */
> > CFI_ENDPROC
> > END(xen_do_hypervisor_callback)
> >
> > @@ -1398,6 +1414,7 @@ ENTRY(error_exit)
> > GET_THREAD_INFO(%rcx)
> > testl %eax,%eax
> > jne retint_kernel
> > +error_exit_user:
> > LOCKDEP_SYS_EXIT_IRQ
> > movl TI_flags(%rcx),%edx
> > movl $_TIF_WORK_MASK,%edi
>
> You're adding a bunch of code for the *non*-preemptive case here... why?

This is an issue onloy for for non*-preemptive kernels.

Some of Xen's hypercalls can take a long time and unfortunately for
*non*-preemptive kernels this can be quite a bit of an issue.
We've handled situations like this with cond_resched() before which will
push even *non*-preemptive kernels to behave as voluntarily preemptive,
I was not aware to what extent this was done and precedents set but
its pretety widespread now... this then just addresses once particular
case where this is also an issuefor but now in IRQ context.

I agree its a hack but so are all the other cond_reshed() calls then.
I don't think its a good idea to be spreading use of something like
this everywhere but after careful review and trying toa void this
exact code for a while I have not been able to find any other reasonable
alternative.

Luis

2014-12-11 01:04:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/xen: allow privcmd hypercalls to be preempted

On Wed, Dec 10, 2014 at 4:55 PM, Luis R. Rodriguez <[email protected]> wrote:
> On Wed, Dec 10, 2014 at 03:51:48PM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 10, 2014 at 3:34 PM, Luis R. Rodriguez
>> <[email protected]> wrote:
>> > From: "Luis R. Rodriguez" <[email protected]>
>> >
>> > Xen has support for splitting heavy work work into a series
>> > of hypercalls, called multicalls, and preempting them through
>> > what Xen calls continuation [0]. Despite this though without
>> > CONFIG_PREEMPT preemption won't happen and while enabling
>> > CONFIG_RT_GROUP_SCHED can at times help its not enough to
>> > make a system usable. Such is the case for example when
>> > creating a > 50 GiB HVM guest, we can get softlockups [1] with:.
>> >
>> > kernel: [ 802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351]
>> >
>> > The softlock up triggers on the TASK_UNINTERRUPTIBLE hanger check
>> > (default 120 seconds), on the Xen side in this particular case
>> > this happens when the following Xen hypervisor code is used:
>> >
>> > xc_domain_set_pod_target() -->
>> > do_memory_op() -->
>> > arch_memory_op() -->
>> > p2m_pod_set_mem_target()
>> > -- long delay (real or emulated) --
>> >
>> > This happens on arch_memory_op() on the XENMEM_set_pod_target memory
>> > op even though arch_memory_op() can handle continuation via
>> > hypercall_create_continuation() for example.
>> >
>> > Machines over 50 GiB of memory are on high demand and hard to come
>> > by so to help replicate this sort of issue long delays on select
>> > hypercalls have been emulated in order to be able to test this on
>> > smaller machines [2].
>> >
>> > On one hand this issue can be considered as expected given that
>> > CONFIG_PREEMPT=n is used however we have forced voluntary preemption
>> > precedent practices in the kernel even for CONFIG_PREEMPT=n through
>> > the usage of cond_resched() sprinkled in many places. To address
>> > this issue with Xen hypercalls though we need to find a way to aid
>> > to the schedular in the middle of hypercalls. We are motivated to
>> > address this issue on CONFIG_PREEMPT=n as otherwise the system becomes
>> > rather unresponsive for long periods of time; in the worst case, at least
>> > only currently by emulating long delays on select io disk bound
>> > hypercalls, this can lead to filesystem corruption if the delay happens
>> > for example on SCHEDOP_remote_shutdown (when we call 'xl <domain> shutdown').
>> >
>> > We can address this problem by trying to check if we should schedule
>> > on the xen timer in the middle of a hypercall on the return from the
>> > timer interrupt. We want to be careful to not always force voluntary
>> > preemption though so to do this we only selectively enable preemption
>> > on very specific xen hypercalls.
>> >
>> > This enables hypercall preemption by selectively forcing checks for
>> > voluntary preempting only on ioctl initiated private hypercalls
>> > where we know some folks have run into reported issues [1].
>> >
>> > [0] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=42217cbc5b3e84b8c145d8cfb62dd5de0134b9e8;hp=3a0b9c57d5c9e82c55dd967c84dd06cb43c49ee9
>> > [1] https://bugzilla.novell.com/show_bug.cgi?id=861093
>> > [2] http://ftp.suse.com/pub/people/mcgrof/xen/emulate-long-xen-hypercalls.patch
>> >
>> > Based on original work by: David Vrabel <[email protected]>
>> > Cc: Borislav Petkov <[email protected]>
>> > Cc: David Vrabel <[email protected]>
>> > Cc: Thomas Gleixner <[email protected]>
>> > Cc: Ingo Molnar <[email protected]>
>> > Cc: "H. Peter Anvin" <[email protected]>
>> > Cc: [email protected]
>> > Cc: Andy Lutomirski <[email protected]>
>> > Cc: Steven Rostedt <[email protected]>
>> > Cc: Masami Hiramatsu <[email protected]>
>> > Cc: Jan Beulich <[email protected]>
>> > Cc: [email protected]
>> > Signed-off-by: Luis R. Rodriguez <[email protected]>
>> > ---
>> > arch/x86/kernel/entry_32.S | 21 +++++++++++++++++++++
>> > arch/x86/kernel/entry_64.S | 17 +++++++++++++++++
>> > drivers/xen/Makefile | 2 +-
>> > drivers/xen/preempt.c | 17 +++++++++++++++++
>> > drivers/xen/privcmd.c | 2 ++
>> > include/xen/xen-ops.h | 26 ++++++++++++++++++++++++++
>> > 6 files changed, 84 insertions(+), 1 deletion(-)
>> > create mode 100644 drivers/xen/preempt.c
>> >
>> > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
>> > index 344b63f..40b5c0c 100644
>> > --- a/arch/x86/kernel/entry_32.S
>> > +++ b/arch/x86/kernel/entry_32.S
>> > @@ -982,7 +982,28 @@ ENTRY(xen_hypervisor_callback)
>> > ENTRY(xen_do_upcall)
>> > 1: mov %esp, %eax
>> > call xen_evtchn_do_upcall
>> > +#ifdef CONFIG_PREEMPT
>> > jmp ret_from_intr
>> > +#else
>> > + GET_THREAD_INFO(%ebp)
>> > +#ifdef CONFIG_VM86
>> > + movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS
>> > + movb PT_CS(%esp), %al
>> > + andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
>> > +#else
>> > + movl PT_CS(%esp), %eax
>> > + andl $SEGMENT_RPL_MASK, %eax
>> > +#endif
>> > + cmpl $USER_RPL, %eax
>> > + jae resume_userspace # returning to v8086 or userspace
>> > + DISABLE_INTERRUPTS(CLBR_ANY)
>> > + cmpb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
>> > + jz resume_kernel
>> > + movb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
>> > + call cond_resched_irq
>> > + movb $1,PER_CPU_VAR(xen_in_preemptible_hcall)
>> > + jmp resume_kernel
>> > +#endif /* CONFIG_PREEMPT */
>> > CFI_ENDPROC
>> > ENDPROC(xen_hypervisor_callback)
>> >
>> > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> > index c0226ab..0ccdd06 100644
>> > --- a/arch/x86/kernel/entry_64.S
>> > +++ b/arch/x86/kernel/entry_64.S
>> > @@ -1170,7 +1170,23 @@ ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs)
>> > popq %rsp
>> > CFI_DEF_CFA_REGISTER rsp
>> > decl PER_CPU_VAR(irq_count)
>> > +#ifdef CONFIG_PREEMPT
>> > jmp error_exit
>> > +#else
>> > + movl %ebx, %eax
>> > + RESTORE_REST
>> > + DISABLE_INTERRUPTS(CLBR_NONE)
>> > + TRACE_IRQS_OFF
>> > + GET_THREAD_INFO(%rcx)
>> > + testl %eax, %eax
>> > + je error_exit_user
>> > + cmpb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
>> > + jz retint_kernel
>>
>> I think I understand this part.
>>
>> > + movb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
>>
>> Why? Is the issue that, if preemptible hypercalls nest, you don't
>> want to preempt again?
>
>
> So this callback on the xen timer, without the CPU variable
> we would not be able to selectively preempt specific hypercalls
> and we'd a no-preempt kernel fully preemptive.
>
> I asked the same question, see:
>
> https://lkml.org/lkml/2014/12/3/756
>
>>
>> > + call cond_resched_irq
>>
>> On !CONFIG_PREEMPT, there's no preempt_disable, right? So how do you
>> guarantee that you don't preempt something you shouldn't?
>
> Not sure I follow, but in essence this is no different then the use
> of cond_resched() on !CONFIG_PREEMPT, the only thing here is we are
> in interrupt context. If this is about abuse of making !CONFIG_PREEMPT
> voluntarily preemptive at select points then I had similar concerns
> and David pointed out to me the wide use of cond_resched() on the
> kernel.
>
>> Is the idea
>> that these events will only fire nested *directly* inside a
>> preemptible hypercall?
>
> Yeah its the timer interrupt that would trigger the above.
>
>> Also, should you check that IRQs were on when
>> the event fired? (Are they on in pt_regs?)
>
> Right before this xen_evtchn_do_upcall() is issued, which
> saves pt_regs and then restores them.
>
>> > + movb $1,PER_CPU_VAR(xen_in_preemptible_hcall)
>> > + jmp retint_kernel
>> > +#endif /* CONFIG_PREEMPT */
>> > CFI_ENDPROC
>>
>> All that being said, this is IMO a bit gross.
>
> That was my first reaction hence my original attempt to try to get away from
> this. I've learned David also tried to not go down this route too before,
> more on this below.
>
>> You've added a bunch of
>> asm that's kind of like a parallel error_exit,
>
> yeah.. I first tried to macro'tize this but it looked hairy, if we
> wanted to do that... (although the name probably ain't best)
>
> 32-bit:
> .macro test_from_kernel kernel_ret:req
> GET_THREAD_INFO(%ebp)
> #ifdef CONFIG_VM86
> movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS
> movb PT_CS(%esp), %al
> andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
> #else
> /*
> * We can be coming here from child spawned by kernel_thread().
> */
> movl PT_CS(%esp), %eax
> andl $SEGMENT_RPL_MASK, %eax
> #endif
> cmpl $USER_RPL, %eax
> jb \kernel_ret
> .endm
>
> 64-bit:
> .macro test_from_kernel kernel_ret:req
> movl %ebx,%eax
> RESTORE_REST
> DISABLE_INTERRUPTS(CLBR_NONE)
> TRACE_IRQS_OFF
> GET_THREAD_INFO(%rcx)
> testl %eax,%eax
> jne \kernel_ret
> .endm
>
>> and the error entry and
>> exit code is hairy enough that this scares me.
>
> yeah...
>
>> Can you do this mostly in C instead? This would look a nicer if it could be:
>>
>> call xen_evtchn_do_upcall
>> popq %rsp
>> CFI_DEF_CFA_REGISTER rsp
>> decl PER_CPU_VAR(irq_count)
>> + call xen_end_upcall
>> jmp error_exit
>
> It seems David tried this originally eons ago:
>
> http://lists.xen.org/archives/html/xen-devel/2014-02/msg01101.html
>
> and the strategy was shifted based on Jan's feedback that we could
> not sched as we're on the IRQ stack. David evolved the strategy
> to asm and to use preempt_schedule_irq(), this new iteratin just
> revisits the same old but tries to generalize scheduling on IRQ
> context on very special circumstances.

Indeed. But look more closely at my proposed one-line asm patch:
we're not on the irq stack.

Also, to make this more obviously safe wrt preempting at the wrong
time, would it be possible to check regs->ip and make sure it's
pointing exactly where you expect before scheduling? (You'd need one
more line of asm to get pt_regs in xen_end_upcall.)

--Andy, the stack switching maestro extraordinaire :)

>
>> Where xen_end_upcall would be witten in C, nokprobes and notrace (if
>> needed) and would check pt_regs and whatever else and just call
>> schedule if needed?
>
> If there's a way to do it it'd be great. I am not sure if we can though.
>
> Luis



--
Andy Lutomirski
AMA Capital Management, LLC

2014-12-11 11:09:47

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 2/2] x86/xen: allow privcmd hypercalls to be preempted

On 10/12/14 23:51, Andy Lutomirski wrote:
> On Wed, Dec 10, 2014 at 3:34 PM, Luis R. Rodriguez
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -1170,7 +1170,23 @@ ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs)
>> popq %rsp
>> CFI_DEF_CFA_REGISTER rsp
>> decl PER_CPU_VAR(irq_count)
>> +#ifdef CONFIG_PREEMPT
>> jmp error_exit
>> +#else
>> + movl %ebx, %eax
>> + RESTORE_REST
>> + DISABLE_INTERRUPTS(CLBR_NONE)
>> + TRACE_IRQS_OFF
>> + GET_THREAD_INFO(%rcx)
>> + testl %eax, %eax
>> + je error_exit_user
>> + cmpb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
>> + jz retint_kernel
>
> I think I understand this part.
>
>> + movb $0,PER_CPU_VAR(xen_in_preemptible_hcall)
>
> Why? Is the issue that, if preemptible hypercalls nest, you don't
> want to preempt again?

We need to clear and reset this per-cpu variable around the schedule
point since the current task may be rescheduled on a different CPU, or
we may switch to a task that was previously preempted at this point.

That this prevents nested preemption is fine because we only want
hypercalls issued by the privcmd driver on behalf of userspace to be
preemptible.

>> + call cond_resched_irq
>
> On !CONFIG_PREEMPT, there's no preempt_disable, right? So how do you
> guarantee that you don't preempt something you shouldn't? Is the idea
> that these events will only fire nested *directly* inside a
> preemptible hypercall? Also, should you check that IRQs were on when
> the event fired? (Are they on in pt_regs?)

Testing xen_in_preemptible_hcall is sufficient. We bracket the
hypercalls we want to be preemptible like so:

xen_preemptible_hcall_begin();
ret = privcmd_call(hypercall.op,
hypercall.arg[0], hypercall.arg[1],
hypercall.arg[2], hypercall.arg[3],
hypercall.arg[4]);
xen_preemptible_hcall_end();

begin() and end() are somewhat like a Xen-specific prempt_enable() and
preempt_disable(), overriding the default no-preempt state.

>> + movb $1,PER_CPU_VAR(xen_in_preemptible_hcall)
>> + jmp retint_kernel
>> +#endif /* CONFIG_PREEMPT */
>> CFI_ENDPROC
>
> All that being said, this is IMO a bit gross. You've added a bunch of
> asm that's kind of like a parallel error_exit, and the error entry and
> exit code is hairy enough that this scares me. Can you do this mostly
> in C instead? This would look a nicer if it could be:

I abandoned my initial attempt that looked like this because I thought
it was gross too.

> call xen_evtchn_do_upcall
> popq %rsp
> CFI_DEF_CFA_REGISTER rsp
> decl PER_CPU_VAR(irq_count)
> + call xen_end_upcall
> jmp error_exit
>
> Where xen_end_upcall would be witten in C, nokprobes and notrace (if
> needed) and would check pt_regs and whatever else and just call
> schedule if needed?

Oh that's a good idea, thanks!

David

2014-12-11 13:32:00

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched: add cond_resched_irq()

>>> On 11.12.14 at 00:34, <[email protected]> wrote:
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4239,6 +4239,16 @@ int __sched _cond_resched(void)
> }
> EXPORT_SYMBOL(_cond_resched);
>
> +int __sched cond_resched_irq(void)
> +{
> + if (should_resched()) {
> + preempt_schedule_irq();
> + return 1;
> + }
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(cond_resched_irq);

Do you really want to export to modules a symbol like this?

Jan

2014-12-11 18:48:51

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/xen: allow privcmd hypercalls to be preempted

On 12/10/2014 05:03 PM, Luis R. Rodriguez wrote:
>
> This is an issue onloy for for non*-preemptive kernels.
>
> Some of Xen's hypercalls can take a long time and unfortunately for
> *non*-preemptive kernels this can be quite a bit of an issue.
> We've handled situations like this with cond_resched() before which will
> push even *non*-preemptive kernels to behave as voluntarily preemptive,
> I was not aware to what extent this was done and precedents set but
> its pretety widespread now... this then just addresses once particular
> case where this is also an issuefor but now in IRQ context.
>
> I agree its a hack but so are all the other cond_reshed() calls then.
> I don't think its a good idea to be spreading use of something like
> this everywhere but after careful review and trying toa void this
> exact code for a while I have not been able to find any other reasonable
> alternative.
>

This sounds like a patch that is completely unrelated to the rest of the
patch.

-hpa

2014-12-11 20:39:26

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/xen: allow privcmd hypercalls to be preempted

On Thu, Dec 11, 2014 at 10:47:44AM -0800, H. Peter Anvin wrote:
> On 12/10/2014 05:03 PM, Luis R. Rodriguez wrote:
> >
> > This is an issue onloy for for non*-preemptive kernels.
> >
> > Some of Xen's hypercalls can take a long time and unfortunately for
> > *non*-preemptive kernels this can be quite a bit of an issue.
> > We've handled situations like this with cond_resched() before which will
> > push even *non*-preemptive kernels to behave as voluntarily preemptive,
> > I was not aware to what extent this was done and precedents set but
> > its pretety widespread now... this then just addresses once particular
> > case where this is also an issuefor but now in IRQ context.
> >
> > I agree its a hack but so are all the other cond_reshed() calls then.
> > I don't think its a good idea to be spreading use of something like
> > this everywhere but after careful review and trying toa void this
> > exact code for a while I have not been able to find any other reasonable
> > alternative.
> >
>
> This sounds like a patch that is completely unrelated to the rest of the
> patch.

If you mean architecture and design then yes however this patch tries
to look for a resolution with the existing architecture.

Luis

2014-12-11 21:05:47

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 2/2] x86/xen: allow privcmd hypercalls to be preempted

On Thu, Dec 11, 2014 at 11:09:42AM +0000, David Vrabel wrote:
> On 10/12/14 23:51, Andy Lutomirski wrote:
> > On Wed, Dec 10, 2014 at 3:34 PM, Luis R. Rodriguez
> > All that being said, this is IMO a bit gross. You've added a bunch of
> > asm that's kind of like a parallel error_exit, and the error entry and
> > exit code is hairy enough that this scares me. Can you do this mostly
> > in C instead? This would look a nicer if it could be:
>
> I abandoned my initial attempt that looked like this because I thought
> it was gross too.
>
> > call xen_evtchn_do_upcall
> > popq %rsp
> > CFI_DEF_CFA_REGISTER rsp
> > decl PER_CPU_VAR(irq_count)
> > + call xen_end_upcall
> > jmp error_exit
> >
> > Where xen_end_upcall would be witten in C, nokprobes and notrace (if
> > needed) and would check pt_regs and whatever else and just call
> > schedule if needed?
>
> Oh that's a good idea, thanks!

David, are you going to respin yourself with the goal to get this upstream?
If so I can move on with life on other matters.

Luis

2014-12-11 21:06:51

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched: add cond_resched_irq()

On Thu, Dec 11, 2014 at 06:31:54AM -0700, Jan Beulich wrote:
> >>> On 11.12.14 at 00:34, <[email protected]> wrote:
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4239,6 +4239,16 @@ int __sched _cond_resched(void)
> > }
> > EXPORT_SYMBOL(_cond_resched);
> >
> > +int __sched cond_resched_irq(void)
> > +{
> > + if (should_resched()) {
> > + preempt_schedule_irq();
> > + return 1;
> > + }
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(cond_resched_irq);
>
> Do you really want to export to modules a symbol like this?

You mean lets not, true and good point. Let's see if seems
if we go with Andy's suggestion we may not need this anyway.

Luis

2014-12-18 19:24:18

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 2/2] x86/xen: allow privcmd hypercalls to be preempted

> index 0000000..b5a3e98
> --- /dev/null
> +++ b/drivers/xen/preempt.c
> @@ -0,0 +1,17 @@
> +/*
> + * Preemptible hypercalls
> + *
> + * Copyright (C) 2014 Citrix Systems R&D ltd.
> + *
> + * This source code is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation; either version 2 of the
> + * License, or (at your option) any later version.
> + */
> +
> +#include <xen/xen-ops.h>
> +
> +#ifndef CONFIG_PREEMPT
> +DEFINE_PER_CPU(bool, xen_in_preemptible_hcall);
> +EXPORT_SYMBOL_GPL(xen_in_preemptible_hcall);
> +#endif

Please also add this in the patch:


diff --git a/drivers/xen/preempt.c b/drivers/xen/preempt.c
index b5a3e98..5d773dc 100644
--- a/drivers/xen/preempt.c
+++ b/drivers/xen/preempt.c
@@ -13,5 +13,5 @@

#ifndef CONFIG_PREEMPT
DEFINE_PER_CPU(bool, xen_in_preemptible_hcall);
-EXPORT_SYMBOL_GPL(xen_in_preemptible_hcall);
+EXPORT_PER_CPU_SYMBOL_GPL(xen_in_preemptible_hcall);
#endif