2015-04-01 21:26:40

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set

When I wrote the opportunistic SYSRET code, I missed an important
difference between SYSRET and IRET. Both instructions are capable
of setting EFLAGS.TF, but they behave differently when doing so.
IRET will not issue a #DB trap after execution when it sets TF This
is critical -- otherwise you'd never be able to make forward
progress when returning to userspace. SYSRET, on the other hand,
will trap with #DB immediately after returning to CPL3, and the next
instruction will never execute.

This breaks anything that opportunistically SYSRETs to a user
context with TF set. For example, running this code with TF set and
a SIGTRAP handler loaded never gets past post_nop.

extern unsigned char post_nop[];
asm volatile ("pushfq\n\t"
"popq %%r11\n\t"
"nop\n\t"
"post_nop:"
: : "c" (post_nop) : "r11");

In my defense, I can't find this documented in the AMD or Intel
manual.

Fix it by using IRET to restore TF.

Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible
Signed-off-by: Andy Lutomirski <[email protected]>
---

This affects 4.0-rc as well as -tip. A full test case lives here:

https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/

It's called single_step_syscall_64.

On Intel systems, the 32-bit version of that test fails for unrelated
reasons, but that's not a regression, and fixing it will be much more
intrusive.

Changes from v1:
- Remove mention of testl from changelog.
- Improve comment per Denys' suggestion.

arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 750c6efcb718..537716380959 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -715,7 +715,21 @@ retint_swapgs: /* return to user-space */
cmpq %r11,EFLAGS(%rsp) /* R11 == RFLAGS */
jne opportunistic_sysret_failed

- testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */
+ /*
+ * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
+ * restoring TF results in a trap from userspace immediately after
+ * SYSRET. This would cause an infinite loop whenever #DB happens
+ * with register state that satisfies the opportunistic SYSRET
+ * conditions. For example, single-stepping this user code:
+ *
+ * movq $stuck_here,%rcx
+ * pushfq
+ * popq %r11
+ * stuck_here:
+ *
+ * would never get past stuck_here.
+ */
+ testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11
jnz opportunistic_sysret_failed

/* nothing to check for RSP */
--
2.3.0


2015-04-02 06:23:10

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set

On Wed, Apr 01, 2015 at 02:26:34PM -0700, Andy Lutomirski wrote:
> When I wrote the opportunistic SYSRET code, I missed an important
> difference between SYSRET and IRET. Both instructions are capable
> of setting EFLAGS.TF, but they behave differently when doing so.
> IRET will not issue a #DB trap after execution when it sets TF This
> is critical -- otherwise you'd never be able to make forward
> progress when returning to userspace. SYSRET, on the other hand,
> will trap with #DB immediately after returning to CPL3, and the next
> instruction will never execute.
>
> This breaks anything that opportunistically SYSRETs to a user
> context with TF set. For example, running this code with TF set and
> a SIGTRAP handler loaded never gets past post_nop.
>
> extern unsigned char post_nop[];
> asm volatile ("pushfq\n\t"
> "popq %%r11\n\t"
> "nop\n\t"
> "post_nop:"
> : : "c" (post_nop) : "r11");
>
> In my defense, I can't find this documented in the AMD or Intel
> manual.
>
> Fix it by using IRET to restore TF.
>
> Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
>
> This affects 4.0-rc as well as -tip. A full test case lives here:
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/
>
> It's called single_step_syscall_64.
>
> On Intel systems, the 32-bit version of that test fails for unrelated
> reasons, but that's not a regression, and fixing it will be much more
> intrusive.
>
> Changes from v1:
> - Remove mention of testl from changelog.
> - Improve comment per Denys' suggestion.
>
> arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 750c6efcb718..537716380959 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -715,7 +715,21 @@ retint_swapgs: /* return to user-space */
> cmpq %r11,EFLAGS(%rsp) /* R11 == RFLAGS */
> jne opportunistic_sysret_failed
>
> - testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */
> + /*
> + * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
> + * restoring TF results in a trap from userspace immediately after
> + * SYSRET. This would cause an infinite loop whenever #DB happens
> + * with register state that satisfies the opportunistic SYSRET
> + * conditions. For example, single-stepping this user code:
> + *
> + * movq $stuck_here,%rcx
> + * pushfq
> + * popq %r11
> + * stuck_here:
> + *
> + * would never get past stuck_here.
> + */
> + testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11
> jnz opportunistic_sysret_failed
>
> /* nothing to check for RSP */

Acked-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-04-02 09:07:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set


* Andy Lutomirski <[email protected]> wrote:

> When I wrote the opportunistic SYSRET code, I missed an important
> difference between SYSRET and IRET. Both instructions are capable
> of setting EFLAGS.TF, but they behave differently when doing so.
> IRET will not issue a #DB trap after execution when it sets TF This
> is critical -- otherwise you'd never be able to make forward
> progress when returning to userspace. SYSRET, on the other hand,
> will trap with #DB immediately after returning to CPL3, and the next
> instruction will never execute.
>
> This breaks anything that opportunistically SYSRETs to a user
> context with TF set. For example, running this code with TF set and
> a SIGTRAP handler loaded never gets past post_nop.
>
> extern unsigned char post_nop[];
> asm volatile ("pushfq\n\t"
> "popq %%r11\n\t"
> "nop\n\t"
> "post_nop:"
> : : "c" (post_nop) : "r11");
>
> In my defense, I can't find this documented in the AMD or Intel
> manual.
>
> Fix it by using IRET to restore TF.
>
> Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
>
> This affects 4.0-rc as well as -tip. A full test case lives here:
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/
>
> It's called single_step_syscall_64.
>
> On Intel systems, the 32-bit version of that test fails for unrelated
> reasons, but that's not a regression, and fixing it will be much more
> intrusive.
>
> Changes from v1:
> - Remove mention of testl from changelog.
> - Improve comment per Denys' suggestion.
>
> arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 750c6efcb718..537716380959 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -715,7 +715,21 @@ retint_swapgs: /* return to user-space */
> cmpq %r11,EFLAGS(%rsp) /* R11 == RFLAGS */
> jne opportunistic_sysret_failed
>
> - testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */
> + /*
> + * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
> + * restoring TF results in a trap from userspace immediately after
> + * SYSRET. This would cause an infinite loop whenever #DB happens
> + * with register state that satisfies the opportunistic SYSRET
> + * conditions. For example, single-stepping this user code:
> + *
> + * movq $stuck_here,%rcx
> + * pushfq
> + * popq %r11
> + * stuck_here:
> + *
> + * would never get past stuck_here.
> + */
> + testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11
> jnz opportunistic_sysret_failed

So I merged this as it's an obvious bugfix, but in hindsight I'm
really uneasy about the whole opportunistic SYSRET concept: it appears
that the chance that %rcx matches return-%rip is astronomical - this
is why this bug wasn't noticed live so far.

So should we really be doing this?

It invites fragility and despite being clever, it adds average
overhead to interrupt returns to user-space: the chance of the
optimization hitting and helping us are much lower than the always
paid for cost of doing the tests for it...

So please defend it.

Thanks,

Ingo

2015-04-02 10:07:22

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set

On 04/02/2015 11:07 AM, Ingo Molnar wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> When I wrote the opportunistic SYSRET code, I missed an important
>> difference between SYSRET and IRET. Both instructions are capable
>> of setting EFLAGS.TF, but they behave differently when doing so.
>> IRET will not issue a #DB trap after execution when it sets TF This
>> is critical -- otherwise you'd never be able to make forward
>> progress when returning to userspace. SYSRET, on the other hand,
>> will trap with #DB immediately after returning to CPL3, and the next
>> instruction will never execute.
>>
>> This breaks anything that opportunistically SYSRETs to a user
>> context with TF set. For example, running this code with TF set and
>> a SIGTRAP handler loaded never gets past post_nop.
>>
>> extern unsigned char post_nop[];
>> asm volatile ("pushfq\n\t"
>> "popq %%r11\n\t"
>> "nop\n\t"
>> "post_nop:"
>> : : "c" (post_nop) : "r11");
>>
>> In my defense, I can't find this documented in the AMD or Intel
>> manual.
>>
>> Fix it by using IRET to restore TF.
>>
>> Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>>
>> This affects 4.0-rc as well as -tip. A full test case lives here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/
>>
>> It's called single_step_syscall_64.
>>
>> On Intel systems, the 32-bit version of that test fails for unrelated
>> reasons, but that's not a regression, and fixing it will be much more
>> intrusive.
>>
>> Changes from v1:
>> - Remove mention of testl from changelog.
>> - Improve comment per Denys' suggestion.
>>
>> arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index 750c6efcb718..537716380959 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -715,7 +715,21 @@ retint_swapgs: /* return to user-space */
>> cmpq %r11,EFLAGS(%rsp) /* R11 == RFLAGS */
>> jne opportunistic_sysret_failed
>>
>> - testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */
>> + /*
>> + * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
>> + * restoring TF results in a trap from userspace immediately after
>> + * SYSRET. This would cause an infinite loop whenever #DB happens
>> + * with register state that satisfies the opportunistic SYSRET
>> + * conditions. For example, single-stepping this user code:
>> + *
>> + * movq $stuck_here,%rcx
>> + * pushfq
>> + * popq %r11
>> + * stuck_here:
>> + *
>> + * would never get past stuck_here.
>> + */
>> + testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11
>> jnz opportunistic_sysret_failed
>
> So I merged this as it's an obvious bugfix, but in hindsight I'm
> really uneasy about the whole opportunistic SYSRET concept: it appears
> that the chance that %rcx matches return-%rip is astronomical - this
> is why this bug wasn't noticed live so far.
>
> So should we really be doing this?

Andy does this not for the off-chance that userspace's RCX
is equal to return address and R11 == RFLAGS. The chances of that
are astronomically small.

This code path triggers when ptrace/audit/seccomp is active.
Instead of torturing ourselves trying to not divert into IRET return,
now code is steered that way. But then immediately before
actual IRET, we check again: "do we really need IRET?"
IOW "did ptrace really touch pt_regs->ss? ->flags? ->rip? ->rcx?"
which in vast majority of cases will not be true.

Since paranoiacs of Selinux managed to convince many distros
to run with Selinux enabled, many machines in fact run
with at least some on audit machinery *always active*.

--
vda

2015-04-02 10:37:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set


* Denys Vlasenko <[email protected]> wrote:

> On 04/02/2015 11:07 AM, Ingo Molnar wrote:
> >
> > * Andy Lutomirski <[email protected]> wrote:
> >
> >> When I wrote the opportunistic SYSRET code, I missed an important
> >> difference between SYSRET and IRET. Both instructions are capable
> >> of setting EFLAGS.TF, but they behave differently when doing so.
> >> IRET will not issue a #DB trap after execution when it sets TF This
> >> is critical -- otherwise you'd never be able to make forward
> >> progress when returning to userspace. SYSRET, on the other hand,
> >> will trap with #DB immediately after returning to CPL3, and the next
> >> instruction will never execute.
> >>
> >> This breaks anything that opportunistically SYSRETs to a user
> >> context with TF set. For example, running this code with TF set and
> >> a SIGTRAP handler loaded never gets past post_nop.
> >>
> >> extern unsigned char post_nop[];
> >> asm volatile ("pushfq\n\t"
> >> "popq %%r11\n\t"
> >> "nop\n\t"
> >> "post_nop:"
> >> : : "c" (post_nop) : "r11");
> >>
> >> In my defense, I can't find this documented in the AMD or Intel
> >> manual.
> >>
> >> Fix it by using IRET to restore TF.
> >>
> >> Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible
> >> Signed-off-by: Andy Lutomirski <[email protected]>
> >> ---
> >>
> >> This affects 4.0-rc as well as -tip. A full test case lives here:
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/
> >>
> >> It's called single_step_syscall_64.
> >>
> >> On Intel systems, the 32-bit version of that test fails for unrelated
> >> reasons, but that's not a regression, and fixing it will be much more
> >> intrusive.
> >>
> >> Changes from v1:
> >> - Remove mention of testl from changelog.
> >> - Improve comment per Denys' suggestion.
> >>
> >> arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
> >> 1 file changed, 15 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> >> index 750c6efcb718..537716380959 100644
> >> --- a/arch/x86/kernel/entry_64.S
> >> +++ b/arch/x86/kernel/entry_64.S
> >> @@ -715,7 +715,21 @@ retint_swapgs: /* return to user-space */
> >> cmpq %r11,EFLAGS(%rsp) /* R11 == RFLAGS */
> >> jne opportunistic_sysret_failed
> >>
> >> - testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */
> >> + /*
> >> + * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
> >> + * restoring TF results in a trap from userspace immediately after
> >> + * SYSRET. This would cause an infinite loop whenever #DB happens
> >> + * with register state that satisfies the opportunistic SYSRET
> >> + * conditions. For example, single-stepping this user code:
> >> + *
> >> + * movq $stuck_here,%rcx
> >> + * pushfq
> >> + * popq %r11
> >> + * stuck_here:
> >> + *
> >> + * would never get past stuck_here.
> >> + */
> >> + testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11
> >> jnz opportunistic_sysret_failed
> >
> > So I merged this as it's an obvious bugfix, but in hindsight I'm
> > really uneasy about the whole opportunistic SYSRET concept: it appears
> > that the chance that %rcx matches return-%rip is astronomical - this
> > is why this bug wasn't noticed live so far.
> >
> > So should we really be doing this?
>
> Andy does this not for the off-chance that userspace's RCX is equal
> to return address and R11 == RFLAGS. The chances of that are
> astronomically small.
>
> This code path triggers when ptrace/audit/seccomp is active. Instead
> of torturing ourselves trying to not divert into IRET return, now
> code is steered that way. But then immediately before actual IRET,
> we check again: "do we really need IRET?" IOW "did ptrace really
> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of
> cases will not be true.

I keep forgetting about that, my test systems have the audit muck
turned off ;-)

Fair enough - and it's sensible to share the IRET path between
interrupts and complex-return system calls, even though the check
is unnecessary overhead for the pure interrupt return path...

Thanks,

Ingo

2015-04-02 11:14:29

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set

On Thu, Apr 2, 2015 at 6:37 AM, Ingo Molnar <[email protected]> wrote:
>
> * Denys Vlasenko <[email protected]> wrote:
>
>> On 04/02/2015 11:07 AM, Ingo Molnar wrote:
>> >
>> > * Andy Lutomirski <[email protected]> wrote:
>> >
>> >> When I wrote the opportunistic SYSRET code, I missed an important
>> >> difference between SYSRET and IRET. Both instructions are capable
>> >> of setting EFLAGS.TF, but they behave differently when doing so.
>> >> IRET will not issue a #DB trap after execution when it sets TF This
>> >> is critical -- otherwise you'd never be able to make forward
>> >> progress when returning to userspace. SYSRET, on the other hand,
>> >> will trap with #DB immediately after returning to CPL3, and the next
>> >> instruction will never execute.
>> >>
>> >> This breaks anything that opportunistically SYSRETs to a user
>> >> context with TF set. For example, running this code with TF set and
>> >> a SIGTRAP handler loaded never gets past post_nop.
>> >>
>> >> extern unsigned char post_nop[];
>> >> asm volatile ("pushfq\n\t"
>> >> "popq %%r11\n\t"
>> >> "nop\n\t"
>> >> "post_nop:"
>> >> : : "c" (post_nop) : "r11");
>> >>
>> >> In my defense, I can't find this documented in the AMD or Intel
>> >> manual.
>> >>
>> >> Fix it by using IRET to restore TF.
>> >>
>> >> Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible
>> >> Signed-off-by: Andy Lutomirski <[email protected]>
>> >> ---
>> >>
>> >> This affects 4.0-rc as well as -tip. A full test case lives here:
>> >>
>> >> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/
>> >>
>> >> It's called single_step_syscall_64.
>> >>
>> >> On Intel systems, the 32-bit version of that test fails for unrelated
>> >> reasons, but that's not a regression, and fixing it will be much more
>> >> intrusive.
>> >>
>> >> Changes from v1:
>> >> - Remove mention of testl from changelog.
>> >> - Improve comment per Denys' suggestion.
>> >>
>> >> arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
>> >> 1 file changed, 15 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> >> index 750c6efcb718..537716380959 100644
>> >> --- a/arch/x86/kernel/entry_64.S
>> >> +++ b/arch/x86/kernel/entry_64.S
>> >> @@ -715,7 +715,21 @@ retint_swapgs: /* return to user-space */
>> >> cmpq %r11,EFLAGS(%rsp) /* R11 == RFLAGS */
>> >> jne opportunistic_sysret_failed
>> >>
>> >> - testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */
>> >> + /*
>> >> + * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
>> >> + * restoring TF results in a trap from userspace immediately after
>> >> + * SYSRET. This would cause an infinite loop whenever #DB happens
>> >> + * with register state that satisfies the opportunistic SYSRET
>> >> + * conditions. For example, single-stepping this user code:
>> >> + *
>> >> + * movq $stuck_here,%rcx
>> >> + * pushfq
>> >> + * popq %r11
>> >> + * stuck_here:
>> >> + *
>> >> + * would never get past stuck_here.
>> >> + */
>> >> + testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11
>> >> jnz opportunistic_sysret_failed
>> >
>> > So I merged this as it's an obvious bugfix, but in hindsight I'm
>> > really uneasy about the whole opportunistic SYSRET concept: it appears
>> > that the chance that %rcx matches return-%rip is astronomical - this
>> > is why this bug wasn't noticed live so far.
>> >
>> > So should we really be doing this?
>>
>> Andy does this not for the off-chance that userspace's RCX is equal
>> to return address and R11 == RFLAGS. The chances of that are
>> astronomically small.
>>
>> This code path triggers when ptrace/audit/seccomp is active. Instead
>> of torturing ourselves trying to not divert into IRET return, now
>> code is steered that way. But then immediately before actual IRET,
>> we check again: "do we really need IRET?" IOW "did ptrace really
>> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of
>> cases will not be true.
>
> I keep forgetting about that, my test systems have the audit muck
> turned off ;-)
>
> Fair enough - and it's sensible to share the IRET path between
> interrupts and complex-return system calls, even though the check
> is unnecessary overhead for the pure interrupt return path...


Maybe we could reintroduce TIF_IRET for this purpose instead of
(ab)using TIF_NOTIFY_RESUME. Then we would only do the opportunistic
check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip
it for interrupts.

--
Brian Gerst

2015-04-02 12:24:42

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set

On 04/02/2015 01:14 PM, Brian Gerst wrote:
>>>> So I merged this as it's an obvious bugfix, but in hindsight I'm
>>>> really uneasy about the whole opportunistic SYSRET concept: it appears
>>>> that the chance that %rcx matches return-%rip is astronomical - this
>>>> is why this bug wasn't noticed live so far.
>>>>
>>>> So should we really be doing this?
>>>
>>> Andy does this not for the off-chance that userspace's RCX is equal
>>> to return address and R11 == RFLAGS. The chances of that are
>>> astronomically small.
>>>
>>> This code path triggers when ptrace/audit/seccomp is active. Instead
>>> of torturing ourselves trying to not divert into IRET return, now
>>> code is steered that way. But then immediately before actual IRET,
>>> we check again: "do we really need IRET?" IOW "did ptrace really
>>> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of
>>> cases will not be true.
>>
>> I keep forgetting about that, my test systems have the audit muck
>> turned off ;-)
>>
>> Fair enough - and it's sensible to share the IRET path between
>> interrupts and complex-return system calls, even though the check
>> is unnecessary overhead for the pure interrupt return path...
>
>
> Maybe we could reintroduce TIF_IRET for this purpose instead of
> (ab)using TIF_NOTIFY_RESUME. Then we would only do the opportunistic
> check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip
> it for interrupts.

The very first check in the existing code, pt_regs->cx == pt_regs->ip,
will fail for interrupt returns.

You hardly can save anything by placing a (ti->flags & TIF_TRY_SYSRET)
check in front of it, it's almost as expensive.

2015-04-02 12:32:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set


* Denys Vlasenko <[email protected]> wrote:

> On 04/02/2015 01:14 PM, Brian Gerst wrote:
> >>>> So I merged this as it's an obvious bugfix, but in hindsight I'm
> >>>> really uneasy about the whole opportunistic SYSRET concept: it appears
> >>>> that the chance that %rcx matches return-%rip is astronomical - this
> >>>> is why this bug wasn't noticed live so far.
> >>>>
> >>>> So should we really be doing this?
> >>>
> >>> Andy does this not for the off-chance that userspace's RCX is equal
> >>> to return address and R11 == RFLAGS. The chances of that are
> >>> astronomically small.
> >>>
> >>> This code path triggers when ptrace/audit/seccomp is active. Instead
> >>> of torturing ourselves trying to not divert into IRET return, now
> >>> code is steered that way. But then immediately before actual IRET,
> >>> we check again: "do we really need IRET?" IOW "did ptrace really
> >>> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of
> >>> cases will not be true.
> >>
> >> I keep forgetting about that, my test systems have the audit muck
> >> turned off ;-)
> >>
> >> Fair enough - and it's sensible to share the IRET path between
> >> interrupts and complex-return system calls, even though the check
> >> is unnecessary overhead for the pure interrupt return path...
> >
> >
> > Maybe we could reintroduce TIF_IRET for this purpose instead of
> > (ab)using TIF_NOTIFY_RESUME. Then we would only do the opportunistic
> > check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip
> > it for interrupts.
>
> The very first check in the existing code, pt_regs->cx ==
> pt_regs->ip, will fail for interrupt returns.
>
> You hardly can save anything by placing a (ti->flags &
> TIF_TRY_SYSRET) check in front of it, it's almost as expensive.

Well, what I was thinking of was to have a pure irq (well, async
context) return path, not shared with the weird-syscall-IRET return
path at all ...

It would be open coded, not obfuscated via macros.

That way AFAICS the upsides are:

- it's easier to read (and maintain) what goes on in which case.
'*intr*' labels would truly identify interrupt return related
processing, for a change!

- we can optimize in a more directed fashion - like here

... while the downsides are:

- more code
- a (small) chance of a fix going to one path while not the other.

How much extra code would it be?

Thanks,

Ingo

Subject: [tip:x86/urgent] x86/asm/entry/64: Disable opportunistic SYSRET if regs->flags has TF set

Commit-ID: 7ea24169097d3d3a3eab2dcc5773bc43fd5593e7
Gitweb: http://git.kernel.org/tip/7ea24169097d3d3a3eab2dcc5773bc43fd5593e7
Author: Andy Lutomirski <[email protected]>
AuthorDate: Wed, 1 Apr 2015 14:26:34 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 2 Apr 2015 11:09:54 +0200

x86/asm/entry/64: Disable opportunistic SYSRET if regs->flags has TF set

When I wrote the opportunistic SYSRET code, I missed an important difference
between SYSRET and IRET.

Both instructions are capable of setting EFLAGS.TF, but they behave differently
when doing so:

- IRET will not issue a #DB trap after execution when it sets TF.
This is critical -- otherwise you'd never be able to make forward progress when
returning to userspace.

- SYSRET, on the other hand, will trap with #DB immediately after
returning to CPL3, and the next instruction will never execute.

This breaks anything that opportunistically SYSRETs to a user
context with TF set. For example, running this code with TF set
and a SIGTRAP handler loaded never gets past 'post_nop':

extern unsigned char post_nop[];
asm volatile ("pushfq\n\t"
"popq %%r11\n\t"
"nop\n\t"
"post_nop:"
: : "c" (post_nop) : "r11");

In my defense, I can't find this documented in the AMD or Intel manual.

Fix it by using IRET to restore TF.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 2a23c6b8a9c4 ("x86_64, entry: Use sysret to return to userspace when possible")
Link: http://lkml.kernel.org/r/9472f1ca4c19a38ecda45bba9c91b7168135fcfa.1427923514.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 2babb39..f0095a7 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -799,7 +799,21 @@ retint_swapgs: /* return to user-space */
cmpq %r11,(EFLAGS-ARGOFFSET)(%rsp) /* R11 == RFLAGS */
jne opportunistic_sysret_failed

- testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */
+ /*
+ * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
+ * restoring TF results in a trap from userspace immediately after
+ * SYSRET. This would cause an infinite loop whenever #DB happens
+ * with register state that satisfies the opportunistic SYSRET
+ * conditions. For example, single-stepping this user code:
+ *
+ * movq $stuck_here,%rcx
+ * pushfq
+ * popq %r11
+ * stuck_here:
+ *
+ * would never get past 'stuck_here'.
+ */
+ testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
jnz opportunistic_sysret_failed

/* nothing to check for RSP */

2015-04-02 12:59:47

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set

On 04/02/2015 02:31 PM, Ingo Molnar wrote:
>
> * Denys Vlasenko <[email protected]> wrote:
>
>> On 04/02/2015 01:14 PM, Brian Gerst wrote:
>>>>>> So I merged this as it's an obvious bugfix, but in hindsight I'm
>>>>>> really uneasy about the whole opportunistic SYSRET concept: it appears
>>>>>> that the chance that %rcx matches return-%rip is astronomical - this
>>>>>> is why this bug wasn't noticed live so far.
>>>>>>
>>>>>> So should we really be doing this?
>>>>>
>>>>> Andy does this not for the off-chance that userspace's RCX is equal
>>>>> to return address and R11 == RFLAGS. The chances of that are
>>>>> astronomically small.
>>>>>
>>>>> This code path triggers when ptrace/audit/seccomp is active. Instead
>>>>> of torturing ourselves trying to not divert into IRET return, now
>>>>> code is steered that way. But then immediately before actual IRET,
>>>>> we check again: "do we really need IRET?" IOW "did ptrace really
>>>>> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of
>>>>> cases will not be true.
>>>>
>>>> I keep forgetting about that, my test systems have the audit muck
>>>> turned off ;-)
>>>>
>>>> Fair enough - and it's sensible to share the IRET path between
>>>> interrupts and complex-return system calls, even though the check
>>>> is unnecessary overhead for the pure interrupt return path...
>>>
>>>
>>> Maybe we could reintroduce TIF_IRET for this purpose instead of
>>> (ab)using TIF_NOTIFY_RESUME. Then we would only do the opportunistic
>>> check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip
>>> it for interrupts.
>>
>> The very first check in the existing code, pt_regs->cx ==
>> pt_regs->ip, will fail for interrupt returns.
>>
>> You hardly can save anything by placing a (ti->flags &
>> TIF_TRY_SYSRET) check in front of it, it's almost as expensive.
>
> Well, what I was thinking of was to have a pure irq (well, async
> context) return path, not shared with the weird-syscall-IRET return
> path at all ...
>
> It would be open coded, not obfuscated via macros.
>
> That way AFAICS the upsides are:
>
> - it's easier to read (and maintain) what goes on in which case.
> '*intr*' labels would truly identify interrupt return related
> processing, for a change!

Re labels: I fully agree they need cleanup (mass rename).

Something along the lines of

int_ret_from_sys_call -> return_from_syscall
int_with_check -> sysret_check_workmask_in_edi
int_careful -> sysret_check_NEED_RESCHED
int_very_careful -> sysret_check_SYSCALL_EXIT
int_signal -> sysret_check_DO_NOTIFY_MASK
int_restore_rest -> sysret_next_check

ret_from_intr -> return_from_intr
retint_with_reschedule -> intr_check_WORK_MASK
retint_check -> intr_check_workmask_in_edi
retint_careful -> intr_check_NEED_RESCHED
retint_signal -> intr_check_DO_NOTIFY_MASK

retint_swapgs -> return_from_syscall_or_intr
irq_return_via_sysret -> return_via_sysret

retint_kernel -> intr_check_preempt
restore_args -> restore_c_regs
irq_return -> return_via_iret


and then your proposal can be rephrased as "let's stop
merging sysret and intr code paths at retint_swapgs".

Makes sense. It would entail some code duplication,
but the code will be easier to maintain.

> - we can optimize in a more directed fashion - like here
>
> ... while the downsides are:
>
> - more code
> - a (small) chance of a fix going to one path while not the other.
>
> How much extra code would it be?

A screenful or two.

2015-04-02 14:26:45

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set

On Thu, Apr 2, 2015 at 5:31 AM, Ingo Molnar <[email protected]> wrote:
>
> * Denys Vlasenko <[email protected]> wrote:
>
>> On 04/02/2015 01:14 PM, Brian Gerst wrote:
>> >>>> So I merged this as it's an obvious bugfix, but in hindsight I'm
>> >>>> really uneasy about the whole opportunistic SYSRET concept: it appears
>> >>>> that the chance that %rcx matches return-%rip is astronomical - this
>> >>>> is why this bug wasn't noticed live so far.
>> >>>>
>> >>>> So should we really be doing this?
>> >>>
>> >>> Andy does this not for the off-chance that userspace's RCX is equal
>> >>> to return address and R11 == RFLAGS. The chances of that are
>> >>> astronomically small.
>> >>>
>> >>> This code path triggers when ptrace/audit/seccomp is active. Instead
>> >>> of torturing ourselves trying to not divert into IRET return, now
>> >>> code is steered that way. But then immediately before actual IRET,
>> >>> we check again: "do we really need IRET?" IOW "did ptrace really
>> >>> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of
>> >>> cases will not be true.
>> >>
>> >> I keep forgetting about that, my test systems have the audit muck
>> >> turned off ;-)
>> >>
>> >> Fair enough - and it's sensible to share the IRET path between
>> >> interrupts and complex-return system calls, even though the check
>> >> is unnecessary overhead for the pure interrupt return path...
>> >
>> >
>> > Maybe we could reintroduce TIF_IRET for this purpose instead of
>> > (ab)using TIF_NOTIFY_RESUME. Then we would only do the opportunistic
>> > check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip
>> > it for interrupts.
>>
>> The very first check in the existing code, pt_regs->cx ==
>> pt_regs->ip, will fail for interrupt returns.
>>
>> You hardly can save anything by placing a (ti->flags &
>> TIF_TRY_SYSRET) check in front of it, it's almost as expensive.
>
> Well, what I was thinking of was to have a pure irq (well, async
> context) return path, not shared with the weird-syscall-IRET return
> path at all ...
>
> It would be open coded, not obfuscated via macros.
>
> That way AFAICS the upsides are:
>
> - it's easier to read (and maintain) what goes on in which case.
> '*intr*' labels would truly identify interrupt return related
> processing, for a change!
>
> - we can optimize in a more directed fashion - like here
>
> ... while the downsides are:
>
> - more code
> - a (small) chance of a fix going to one path while not the other.
>
> How much extra code would it be?

Negative if we did it right, perhaps.

I think the best approach is a complete rewrite, not an attempt to
incrementally improve it. The current code is held together by gotos
and bailing wire, and I'm surprised it works at all. Some of it seems
to work by accident AFAICT. For example, the sysret audit "fast path"
that I deleted as part of the opportunistic sysret work was quite
buggy AFAICT, but no one who looks at this code cares about audit, and
it's nearly impossible to even tell what the code is supposed to do.

Linus tried to rewrite some of it last year, but it was incomplete.
Here's my vague inventory of what the exit paths need to do on return
to userspace:

- syscall_trace_leave [1] (syscall only)

- context tracking if TIF_NOHZ (all exits)

- one-shot work. These are things that must be done if the flags are
set, and doing it clears the flags. These flags need to be checked
with IRQs off, and IRQs cannot be re-enabled afterwards. This
includes signal delivery, scheduling, and user return notifiers. (all
exits)

- uprobes. Maybe this counts as one-shot work. (all exits, I assume)

- Check for sysret applicability (only important for syscalls)

- espfix, iret, unless we're using sysret (all exits)

There's also the special case of interrupt returns to kernel mode.
That should schedule in preemptable kernels.

To me, this suggests that we should have a total of four asm exit paths:

- syscall return. IMO this should call a single C function, check
the sysret conditions, and jump to the espfix code.

- paranoid return to kernel. This should just return after a
possible swapgs. (We do this now.)

- interrupt/exception return. IMO this should call a single C
function and jump to the espfix code. [2]

- NMI -- this is its own crazy thing.

All of the check/careful/very_careful crap can just go. Good riddance.


[1] syscall_trace_leave contains context tracking hooks that are
AFAICT completely unnecessary. That's almost okay, though -- other
similarly silly context tracking hooks fix up the mess. This might
explain part of why context tracking is so incredibly slow.

[2] I don't even think we need the retint_careful vs retint_kernel
distinction in asm. If we ditch that, we should be able to replace it
with something like:

call prepare_intr_exception_return
testl $ebx,$ebx
jz 1f
SWAPGS
1f:
INTERRUPT_RETURN

Of course, prepare_intr_exception_return would need to DTRT if we're
returning to kernel mode, but that's easy.

--Andy

2015-04-02 15:49:11

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set

On 04/02/2015 02:59 PM, Denys Vlasenko wrote:
> On 04/02/2015 02:31 PM, Ingo Molnar wrote:
>> - we can optimize in a more directed fashion - like here
>>
>> ... while the downsides are:
>>
>> - more code
>> - a (small) chance of a fix going to one path while not the other.
>>
>> How much extra code would it be?
>
> A screenful or two.

I took a stab at it:

text data bss dec hex filename
12530 0 0 12530 30f2 entry_64.o2
12562 0 0 12562 3112 entry_64.o

The patch does two steps:

(1) copy-pastes "retint_swapgs:" code into syscall handling code,
the copy is under "syscall_return:" label.

(2) remove "opportunistic sysret" code from "retint_swapgs" code block,
since now it won't be reached by syscall return. This in fact removes
most of the code in question.

Lightly run-tested so far.

Ingo, do you want this in a proper patch form?
--
vda


diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 7bc097a..5ea2dd1 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -354,8 +354,8 @@ GLOBAL(int_with_check)
movl TI_flags(%rcx),%edx
andl %edi,%edx
jnz int_careful
- andl $~TS_COMPAT,TI_status(%rcx)
- jmp retint_swapgs
+ andl $~TS_COMPAT,TI_status(%rcx)
+ jmp syscall_return

/* Either reschedule or signal or syscall exit tracking needed. */
/* First do a reschedule test. */
@@ -399,9 +399,86 @@ int_restore_rest:
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp int_with_check
+
+syscall_return:
+ /* The iretq could re-enable interrupts: */
+ DISABLE_INTERRUPTS(CLBR_ANY)
+ TRACE_IRQS_IRETQ
+
+ /*
+ * Try to use SYSRET instead of IRET if we're returning to
+ * a completely clean 64-bit userspace context.
+ */
+ movq RCX(%rsp),%rcx
+ cmpq %rcx,RIP(%rsp) /* RCX == RIP */
+ jne opportunistic_sysret_failed
+
+ /*
+ * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
+ * in kernel space. This essentially lets the user take over
+ * the kernel, since userspace controls RSP. It's not worth
+ * testing for canonicalness exactly -- this check detects any
+ * of the 17 high bits set, which is true for non-canonical
+ * or kernel addresses. (This will pessimize vsyscall=native.
+ * Big deal.)
+ *
+ * If virtual addresses ever become wider, this will need
+ * to be updated to remain correct on both old and new CPUs.
+ */
+ .ifne __VIRTUAL_MASK_SHIFT - 47
+ .error "virtual address width changed -- sysret checks need update"
+ .endif
+ shr $__VIRTUAL_MASK_SHIFT, %rcx
+ jnz opportunistic_sysret_failed
+
+ cmpq $__USER_CS,CS(%rsp) /* CS must match SYSRET */
+ jne opportunistic_sysret_failed
+
+ movq R11(%rsp),%r11
+ cmpq %r11,EFLAGS(%rsp) /* R11 == RFLAGS */
+ jne opportunistic_sysret_failed
+
+ /*
+ * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
+ * restoring TF results in a trap from userspace immediately after
+ * SYSRET. This would cause an infinite loop whenever #DB happens
+ * with register state that satisfies the opportunistic SYSRET
+ * conditions. For example, single-stepping this user code:
+ *
+ * movq $stuck_here,%rcx
+ * pushfq
+ * popq %r11
+ * stuck_here:
+ *
+ * would never get past 'stuck_here'.
+ */
+ testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
+ jnz opportunistic_sysret_failed
+
+ /* nothing to check for RSP */
+
+ cmpq $__USER_DS,SS(%rsp) /* SS must match SYSRET */
+ jne opportunistic_sysret_failed
+
+ /*
+ * We win! This label is here just for ease of understanding
+ * perf profiles. Nothing jumps here.
+ */
+syscall_return_via_sysret:
+ CFI_REMEMBER_STATE
+ /* r11 is already restored (see code above) */
+ RESTORE_C_REGS_EXCEPT_R11
+ movq RSP(%rsp),%rsp
+ USERGS_SYSRET64
+ CFI_RESTORE_STATE
+
+opportunistic_sysret_failed:
+ SWAPGS
+ jmp restore_args
CFI_ENDPROC
END(system_call)

+
.macro FORK_LIKE func
ENTRY(stub_\func)
CFI_STARTPROC
@@ -672,74 +749,6 @@ retint_swapgs: /* return to user-space */
DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_IRETQ

- /*
- * Try to use SYSRET instead of IRET if we're returning to
- * a completely clean 64-bit userspace context.
- */
- movq RCX(%rsp),%rcx
- cmpq %rcx,RIP(%rsp) /* RCX == RIP */
- jne opportunistic_sysret_failed
-
- /*
- * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
- * in kernel space. This essentially lets the user take over
- * the kernel, since userspace controls RSP. It's not worth
- * testing for canonicalness exactly -- this check detects any
- * of the 17 high bits set, which is true for non-canonical
- * or kernel addresses. (This will pessimize vsyscall=native.
- * Big deal.)
- *
- * If virtual addresses ever become wider, this will need
- * to be updated to remain correct on both old and new CPUs.
- */
- .ifne __VIRTUAL_MASK_SHIFT - 47
- .error "virtual address width changed -- sysret checks need update"
- .endif
- shr $__VIRTUAL_MASK_SHIFT, %rcx
- jnz opportunistic_sysret_failed
-
- cmpq $__USER_CS,CS(%rsp) /* CS must match SYSRET */
- jne opportunistic_sysret_failed
-
- movq R11(%rsp),%r11
- cmpq %r11,EFLAGS(%rsp) /* R11 == RFLAGS */
- jne opportunistic_sysret_failed
-
- /*
- * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
- * restoring TF results in a trap from userspace immediately after
- * SYSRET. This would cause an infinite loop whenever #DB happens
- * with register state that satisfies the opportunistic SYSRET
- * conditions. For example, single-stepping this user code:
- *
- * movq $stuck_here,%rcx
- * pushfq
- * popq %r11
- * stuck_here:
- *
- * would never get past 'stuck_here'.
- */
- testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
- jnz opportunistic_sysret_failed
-
- /* nothing to check for RSP */
-
- cmpq $__USER_DS,SS(%rsp) /* SS must match SYSRET */
- jne opportunistic_sysret_failed
-
- /*
- * We win! This label is here just for ease of understanding
- * perf profiles. Nothing jumps here.
- */
-irq_return_via_sysret:
- CFI_REMEMBER_STATE
- /* r11 is already restored (see code above) */
- RESTORE_C_REGS_EXCEPT_R11
- movq RSP(%rsp),%rsp
- USERGS_SYSRET64
- CFI_RESTORE_STATE
-
-opportunistic_sysret_failed:
SWAPGS
jmp restore_args

2015-04-02 16:08:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set


* Denys Vlasenko <[email protected]> wrote:

> On 04/02/2015 02:59 PM, Denys Vlasenko wrote:
> > On 04/02/2015 02:31 PM, Ingo Molnar wrote:
> >> - we can optimize in a more directed fashion - like here
> >>
> >> ... while the downsides are:
> >>
> >> - more code
> >> - a (small) chance of a fix going to one path while not the other.
> >>
> >> How much extra code would it be?
> >
> > A screenful or two.
>
> I took a stab at it:
>
> text data bss dec hex filename
> 12530 0 0 12530 30f2 entry_64.o2
> 12562 0 0 12562 3112 entry_64.o
>
> The patch does two steps:
>
> (1) copy-pastes "retint_swapgs:" code into syscall handling code,
> the copy is under "syscall_return:" label.
>
> (2) remove "opportunistic sysret" code from "retint_swapgs" code block,
> since now it won't be reached by syscall return. This in fact removes
> most of the code in question.
>
> Lightly run-tested so far.
>
> Ingo, do you want this in a proper patch form?

Yeah, that looks good to me (only lightly reviewed).

Thanks,

Ingo