Commit-ID: b3b42ac2cbae1f3cecbb6229964a4d48af31d382
Gitweb: http://git.kernel.org/tip/b3b42ac2cbae1f3cecbb6229964a4d48af31d382
Author: H. Peter Anvin <[email protected]>
AuthorDate: Sun, 16 Mar 2014 15:31:54 -0700
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 11 Apr 2014 10:10:09 -0700
x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
The IRET instruction, when returning to a 16-bit segment, only
restores the bottom 16 bits of the user space stack pointer. We have
a software workaround for that ("espfix") for the 32-bit kernel, but
it relies on a nonzero stack segment base which is not available in
32-bit mode.
Since 16-bit support is somewhat crippled anyway on a 64-bit kernel
(no V86 mode), and most (if not quite all) 64-bit processors support
virtualization for the users who really need it, simply reject
attempts at creating a 16-bit segment when running on top of a 64-bit
kernel.
Cc: Linus Torvalds <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Cc: <[email protected]>
---
arch/x86/kernel/ldt.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index ebc9873..af1d14a 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -229,6 +229,17 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
}
}
+ /*
+ * On x86-64 we do not support 16-bit segments due to
+ * IRET leaking the high bits of the kernel stack address.
+ */
+#ifdef CONFIG_X86_64
+ if (!ldt_info.seg_32bit) {
+ error = -EINVAL;
+ goto out_unlock;
+ }
+#endif
+
fill_ldt(&ldt, &ldt_info);
if (oldmode)
ldt.avl = 0;
On 04/11/2014 10:36 AM, tip-bot for H. Peter Anvin wrote:
> Commit-ID: b3b42ac2cbae1f3cecbb6229964a4d48af31d382
> Gitweb: http://git.kernel.org/tip/b3b42ac2cbae1f3cecbb6229964a4d48af31d382
> Author: H. Peter Anvin <[email protected]>
> AuthorDate: Sun, 16 Mar 2014 15:31:54 -0700
> Committer: H. Peter Anvin <[email protected]>
> CommitDate: Fri, 11 Apr 2014 10:10:09 -0700
>
> x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
>
> The IRET instruction, when returning to a 16-bit segment, only
> restores the bottom 16 bits of the user space stack pointer. We have
> a software workaround for that ("espfix") for the 32-bit kernel, but
> it relies on a nonzero stack segment base which is not available in
> 32-bit mode.
>
> Since 16-bit support is somewhat crippled anyway on a 64-bit kernel
> (no V86 mode), and most (if not quite all) 64-bit processors support
> virtualization for the users who really need it, simply reject
> attempts at creating a 16-bit segment when running on top of a 64-bit
> kernel.
>
> Cc: Linus Torvalds <[email protected]>
> Signed-off-by: H. Peter Anvin <[email protected]>
> Link: http://lkml.kernel.org/n/[email protected]
> Cc: <[email protected]>
If this is what I think it is (hi, Spender), then it is probably only
useful for 3.14.y and not earlier kernels.
--Andy
On 04/11/2014 11:12 AM, Andy Lutomirski wrote:
>
> If this is what I think it is (hi, Spender), then it is probably only
> useful for 3.14.y and not earlier kernels.
>
Not really. The kernel stack address is sensitive regardless of kASLR;
in fact, it is completely orthogonal to kASLR.
-hpa
Is this bug really still present in modern CPUs? This change breaks
running 16-bit apps in Wine. I have a few really old games I like to
play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
On Fri, Apr 11, 2014 at 1:36 PM, tip-bot for H. Peter Anvin
<[email protected]> wrote:
> Commit-ID: b3b42ac2cbae1f3cecbb6229964a4d48af31d382
> Gitweb: http://git.kernel.org/tip/b3b42ac2cbae1f3cecbb6229964a4d48af31d382
> Author: H. Peter Anvin <[email protected]>
> AuthorDate: Sun, 16 Mar 2014 15:31:54 -0700
> Committer: H. Peter Anvin <[email protected]>
> CommitDate: Fri, 11 Apr 2014 10:10:09 -0700
>
> x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
>
> The IRET instruction, when returning to a 16-bit segment, only
> restores the bottom 16 bits of the user space stack pointer. We have
> a software workaround for that ("espfix") for the 32-bit kernel, but
> it relies on a nonzero stack segment base which is not available in
> 32-bit mode.
>
> Since 16-bit support is somewhat crippled anyway on a 64-bit kernel
> (no V86 mode), and most (if not quite all) 64-bit processors support
> virtualization for the users who really need it, simply reject
> attempts at creating a 16-bit segment when running on top of a 64-bit
> kernel.
>
> Cc: Linus Torvalds <[email protected]>
> Signed-off-by: H. Peter Anvin <[email protected]>
> Link: http://lkml.kernel.org/n/[email protected]
> Cc: <[email protected]>
> ---
> arch/x86/kernel/ldt.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index ebc9873..af1d14a 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -229,6 +229,17 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
> }
> }
>
> + /*
> + * On x86-64 we do not support 16-bit segments due to
> + * IRET leaking the high bits of the kernel stack address.
> + */
> +#ifdef CONFIG_X86_64
> + if (!ldt_info.seg_32bit) {
> + error = -EINVAL;
> + goto out_unlock;
> + }
> +#endif
> +
> fill_ldt(&ldt, &ldt_info);
> if (oldmode)
> ldt.avl = 0;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On 04/11/2014 11:27 AM, Brian Gerst wrote:
> Is this bug really still present in modern CPUs? This change breaks
> running 16-bit apps in Wine. I have a few really old games I like to
> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
It is not a bug, per se, but an architectural definition issue, and it
is present in all x86 processors from all vendors.
Yes, it does break running 16-bit apps in Wine, although Wine could be
modified to put 16-bit apps in a container. However, this is at best a
marginal use case.
-hpa
On Fri, Apr 11, 2014 at 2:29 PM, H. Peter Anvin <[email protected]> wrote:
> On 04/11/2014 11:27 AM, Brian Gerst wrote:
>> Is this bug really still present in modern CPUs? This change breaks
>> running 16-bit apps in Wine. I have a few really old games I like to
>> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
>
> It is not a bug, per se, but an architectural definition issue, and it
> is present in all x86 processors from all vendors.
>
> Yes, it does break running 16-bit apps in Wine, although Wine could be
> modified to put 16-bit apps in a container. However, this is at best a
> marginal use case.
Marginal or not, it is still userspace breakage.
On Fri, Apr 11, 2014 at 11:27 AM, Brian Gerst <[email protected]> wrote:
> Is this bug really still present in modern CPUs? This change breaks
> running 16-bit apps in Wine. I have a few really old games I like to
> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
Ok, so you actually do this on x86-64, and it currently works? For
some reason I thought that 16-bit windows apps already didn't work.
Because if we have working users of this, then I don't think we can do
the "we don't support 16-bit segments", or at least we need to make it
runtime configurable.
Linus
On Fri, Apr 11, 2014 at 2:41 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Apr 11, 2014 at 11:27 AM, Brian Gerst <[email protected]> wrote:
>> Is this bug really still present in modern CPUs? This change breaks
>> running 16-bit apps in Wine. I have a few really old games I like to
>> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
>
> Ok, so you actually do this on x86-64, and it currently works? For
> some reason I thought that 16-bit windows apps already didn't work.
>
> Because if we have working users of this, then I don't think we can do
> the "we don't support 16-bit segments", or at least we need to make it
> runtime configurable.
>
> Linus
I haven't tested it recently but I do know it has worked on 64-bit
kernels. There is no reason for it not to, the only thing not
supported in long mode is vm86. 16-bit protected mode is unchanged.
On 04/11/2014 11:41 AM, Linus Torvalds wrote:
>
> Ok, so you actually do this on x86-64, and it currently works? For
> some reason I thought that 16-bit windows apps already didn't work.
>
Some will work, because not all 16-bit software care about the upper
half of ESP getting randomly corrupted.
That is the "functionality bit" of the problem. The other bit, of
course, is that that random corruption is the address of the kernel stack.
> Because if we have working users of this, then I don't think we can do
> the "we don't support 16-bit segments", or at least we need to make it
> runtime configurable.
I'll let you pick what the policy should be here. I personally think
that we have to be able to draw a line somewhere sometimes (Microsoft
themselves haven't supported running 16-bit binaries for several Windows
generations now), but it is your policy, not mine.
-hpa
On Fri, Apr 11, 2014 at 11:45 AM, Brian Gerst <[email protected]> wrote:
>
> I haven't tested it recently but I do know it has worked on 64-bit
> kernels. There is no reason for it not to, the only thing not
> supported in long mode is vm86. 16-bit protected mode is unchanged.
Afaik 64-bit windows doesn't support 16-bit binaries, so I just
assumed Wine wouldn't do it either on x86-64. Not for any real
technical reasons, though.
HOWEVER. I'd like to hear something more definitive than "I haven't
tested recently". The "we don't break user space" is about having
actual real *users*, not about test programs.
Are there people actually using 16-bit old windows programs under
wine? That's what matters.
Linus
On 04/11/2014 11:29 AM, H. Peter Anvin wrote:
> On 04/11/2014 11:27 AM, Brian Gerst wrote:
>> Is this bug really still present in modern CPUs? This change breaks
>> running 16-bit apps in Wine. I have a few really old games I like to
>> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
>
> It is not a bug, per se, but an architectural definition issue, and it
> is present in all x86 processors from all vendors.
>
> Yes, it does break running 16-bit apps in Wine, although Wine could be
> modified to put 16-bit apps in a container. However, this is at best a
> marginal use case.
I wonder if there's an easy-ish good-enough fix:
Allocate some percpu space in the fixmap. (OK, this is ugly, but
kvmclock already does it, so it's possible.) To return to 16-bit
userspace, make sure interrupts are off, copy the whole iret descriptor
to the current cpu's fixmap space, change rsp to point to that space,
and then do the iret.
This won't restore the correct value to the high bits of [er]sp, but it
will at least stop leaking anything interesting to userspace.
--Andy
On 04/11/2014 02:16 PM, Andy Lutomirski wrote:
> On 04/11/2014 11:29 AM, H. Peter Anvin wrote:
>> On 04/11/2014 11:27 AM, Brian Gerst wrote:
>>> Is this bug really still present in modern CPUs? This change breaks
>>> running 16-bit apps in Wine. I have a few really old games I like to
>>> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
>>
>> It is not a bug, per se, but an architectural definition issue, and it
>> is present in all x86 processors from all vendors.
>>
>> Yes, it does break running 16-bit apps in Wine, although Wine could be
>> modified to put 16-bit apps in a container. However, this is at best a
>> marginal use case.
>
> I wonder if there's an easy-ish good-enough fix:
>
> Allocate some percpu space in the fixmap. (OK, this is ugly, but
> kvmclock already does it, so it's possible.) To return to 16-bit
> userspace, make sure interrupts are off, copy the whole iret descriptor
> to the current cpu's fixmap space, change rsp to point to that space,
> and then do the iret.
>
> This won't restore the correct value to the high bits of [er]sp, but it
> will at least stop leaking anything interesting to userspace.
>
This would fix the infoleak, at the cost of allocating a chunk of memory
for each CPU. It doesn't fix the functionality problem.
If we're going to do a workaround I would prefer to do something that
fixes both, but it is highly nontrivial.
This is a writeup I did to a select audience before this was public:
> Hello,
>
> It appears we have an information leak on x86-64 by which at least bits
> [31:16] of the kernel stack address leaks to user space (some silicon
> including the 64-bit Pentium 4 leaks [63:16]). This is due to the the
> behavior of IRET when returning to a 16-bit segment: IRET restores only
> the bottom 16 bits of the stack pointer.
>
> This is known on 32 bits and we, in fact, have a workaround for it
> ("espfix") there. We do not, however, have the equivalent on 64 bits,
> nor does it seem that it is very easy to construct a workaround (see below.)
>
> This is both a functionality problem (16-bit code gets the upper bits of
> %esp corrupted when the kernel is invoked) and an information leak. The
> 32-bit workaround was labeled as a fix for the functionality problem,
> but it of course also addresses the leak.
>
> On 64 bits, the easiest mitigation seems to be to make modify_ldt()
> refuse to install a 16-bit segment when running on a 64-bit kernel.
> 16-bit support is already somewhat crippled on 64 bits since there is no
> V86 support; obviously, for "full service" support we can always set up
> a virtual machine -- most (but sadly, not all) 64-bit parts are also
> virtualization capable.
>
> I would have suggested rejecting modify_ldt() entirely, to reduce attack
> surface, except that some early versions of 32-bit NPTL glibc use
> modify_ldt() to exclusion of all other methods of establishing the
> thread pointer, so in order to stay compatible with those we would need
> to allow 32-bit segments via modify_ldt() still.
>
> However, there is no doubt this will break some legitimate users of
> 16-bit segments, e.g. Wine for 16-bit Windows apps (which don't work on
> 64-bit Windows either, for what it is worth.)
>
> We may very well have other infoleaks that dwarf this, but the kernel
> stack address is a relatively high value item for exploits.
>
> Some workarounds I have considered:
>
> a. Using paging in a similar way to the 32-bit segment base workaround
>
> This one requires a very large swath of virtual user space (depending on
> allocation policy, as much as 4 GiB per CPU.) The "per CPU" requirement
> comes in as locking is not feasible -- as we return to user space there
> is nowhere to release the lock.
>
> b. Return to user space via compatibility mode
>
> As the kernel lives above the 4 GiB virtual mark, a transition through
> compatibility mode is not practical. This would require the kernel to
> reserve virtual address space below the 4 GiB mark, which may interfere
> with the application, especially an application launched as a 64-bit
> application.
>
> c. Trampoline in kernel space
>
> A trampoline in kernel space is not feasible since all ring transition
> instructions capable of returning to 16-bit mode require the use of the
> stack.
>
> d. Trampoline in user space
>
> A return to the vdso with values set up in registers r8-r15 would enable
> a trampoline in user space. Unfortunately there is no way
> to do a far JMP entirely with register state so this would require
> touching user space memory, possibly in an unsafe manner.
>
> The most likely variant is to use the address of the 16-bit user stack
> and simply hope that this is a safe thing to do.
>
> This appears to be the most feasible workaround if a workaround is
> deemed necessary.
>
> e. Transparently run 16-bit code segments inside a lightweight VMM
>
> The complexity of this solution versus the realized value is staggering.
> It also doesn't work on non-virtualization-capable hardware (including
> running on top of a VMM which doesn't support nested virtualization.)
>
> -hpa
On Fri, Apr 11, 2014 at 2:16 PM, Andy Lutomirski <[email protected]> wrote:
>
> I wonder if there's an easy-ish good-enough fix:
Heh. Yes. Check the thread on lkml about three weeks ago under the
subject "x86-64: Information leak: kernel stack address leaks to user
space". It had exactly that as a suggestion.
Anyway, I ended up pulling the current change - let's see if anybody even cares.
And if somebody *does* care, maybe we can just do a trivial sysctl. If
you are running 16-bit apps under wine, the default kernel setup
already stops you: the 'mmap_min_addr' being non-zero means that that
already will not run.
But yeah, I personally don't care about the high bits of rsp one whit,
since that has never worked on x86-64. But the information leak needs
to be plugged, and a percpu stack can fix that.
I'm a bit worried that a percpu stack can cause issues with NMI's,
which already have too much complexity in them, so I don't think it's
*entirely* trivial to do. And the exception that the 'iretq' can take
adds more complexity wrt kernel stack pointer games. Which is why I'm
not at all sure it's worth it.
Linus
On 04/11/2014 02:24 PM, H. Peter Anvin wrote:
> On 04/11/2014 02:16 PM, Andy Lutomirski wrote:
>> I wonder if there's an easy-ish good-enough fix:
>>
>> Allocate some percpu space in the fixmap. (OK, this is ugly, but
>> kvmclock already does it, so it's possible.) To return to 16-bit
>> userspace, make sure interrupts are off, copy the whole iret descriptor
>> to the current cpu's fixmap space, change rsp to point to that space,
>> and then do the iret.
>>
>> This won't restore the correct value to the high bits of [er]sp, but it
>> will at least stop leaking anything interesting to userspace.
>>
>
> This would fix the infoleak, at the cost of allocating a chunk of memory
> for each CPU. It doesn't fix the functionality problem.
>
> If we're going to do a workaround I would prefer to do something that
> fixes both, but it is highly nontrivial.
>
> This is a writeup I did to a select audience before this was public:
>
>> Hello,
>>
>> This is both a functionality problem (16-bit code gets the upper bits of
>> %esp corrupted when the kernel is invoked) and an information leak. The
>> 32-bit workaround was labeled as a fix for the functionality problem,
>> but it of course also addresses the leak.
How big of a functionality problem is it? Apparently it doesn't break
16-bit code on wine.
Since the high bits of esp have been corrupted on x86_64 since the
beginning, there's no regression issue here if an eventual fix writes
less meaningful crap to those bits -- I see no real reason to try to put
the correct values in there.
>> I would have suggested rejecting modify_ldt() entirely, to reduce attack
>> surface, except that some early versions of 32-bit NPTL glibc use
>> modify_ldt() to exclusion of all other methods of establishing the
>> thread pointer, so in order to stay compatible with those we would need
>> to allow 32-bit segments via modify_ldt() still.
I actually use modify_ldt for amusement: it's the only way I know of to
issue real 32-bit syscalls from 64-bit userspace. Yes, this isn't
really a legitimate use case.
>>
>> a. Using paging in a similar way to the 32-bit segment base workaround
>>
>> This one requires a very large swath of virtual user space (depending on
>> allocation policy, as much as 4 GiB per CPU.) The "per CPU" requirement
>> comes in as locking is not feasible -- as we return to user space there
>> is nowhere to release the lock.
Why not just 4k per CPU? Write the pfn to the pte, invlpg, update rsp,
iret. This leaks the CPU number, but that's all.
To me, this sounds like the easiest solution, so long as rsp is known to
be sufficiently far from a page boundary.
These ptes could even be read-only to limit the extra exposure to
known-address attacks.
If you want a fully correct solution, you can use a fancier allocation
policy that can fit quite a few cpus per 4G :)
>>
>> d. Trampoline in user space
>>
>> A return to the vdso with values set up in registers r8-r15 would enable
>> a trampoline in user space. Unfortunately there is no way
>> to do a far JMP entirely with register state so this would require
>> touching user space memory, possibly in an unsafe manner.
>>
>> The most likely variant is to use the address of the 16-bit user stack
>> and simply hope that this is a safe thing to do.
>>
>> This appears to be the most feasible workaround if a workaround is
>> deemed necessary.
Eww.
--Andy
On 04/11/2014 02:53 PM, Andy Lutomirski wrote:
>
> How big of a functionality problem is it? Apparently it doesn't break
> 16-bit code on wine.
>
It breaks *some* 16-bit code. This is actually the reason that 32 bits
has the espfix workaround - it wasn't identified as an infoleak at the time.
> Since the high bits of esp have been corrupted on x86_64 since the
> beginning, there's no regression issue here if an eventual fix writes
> less meaningful crap to those bits -- I see no real reason to try to put
> the correct values in there.
It is a regression vs. the 32-bit kernel, and if we're going to support
16-bit code we should arguably support 16-bit code correctly.
This is actually how I stumbled onto this problem in the first place: it
broke a compiler test suite for gcc -m16 I was working on. The
workaround for *that* was to run in a VM instead.
>>> I would have suggested rejecting modify_ldt() entirely, to reduce attack
>>> surface, except that some early versions of 32-bit NPTL glibc use
>>> modify_ldt() to exclusion of all other methods of establishing the
>>> thread pointer, so in order to stay compatible with those we would need
>>> to allow 32-bit segments via modify_ldt() still.
>
> I actually use modify_ldt for amusement: it's the only way I know of to
> issue real 32-bit syscalls from 64-bit userspace. Yes, this isn't
> really a legitimate use case.
That's actually wrong on no less than two levels:
1. You can issue real 32-bit system calls from 64-bit user space simply
by invoking int $0x80; it works in 64-bit mode as well.
2. Even if you want to be in 32-bit mode you can simply call via
__USER32_CS, you don't need an LDT entry.
> Why not just 4k per CPU? Write the pfn to the pte, invlpg, update rsp,
> iret. This leaks the CPU number, but that's all.
>
> To me, this sounds like the easiest solution, so long as rsp is known to
> be sufficiently far from a page boundary.
>
> These ptes could even be read-only to limit the extra exposure to
> known-address attacks.
>
> If you want a fully correct solution, you can use a fancier allocation
> policy that can fit quite a few cpus per 4G :)
It's damned hard, because you don't have a logical place to
*deallocate*. That's what ends up killing you.
Also, you will need to port over the equivalent to the espfix recovery
code from 32 bits (what happens if IRET takes an exception), so it is
nontrivial.
>>> d. Trampoline in user space
>>>
>>> A return to the vdso with values set up in registers r8-r15 would enable
>>> a trampoline in user space. Unfortunately there is no way
>>> to do a far JMP entirely with register state so this would require
>>> touching user space memory, possibly in an unsafe manner.
>>>
>>> The most likely variant is to use the address of the 16-bit user stack
>>> and simply hope that this is a safe thing to do.
>>>
>>> This appears to be the most feasible workaround if a workaround is
>>> deemed necessary.
>
> Eww.
I don't think any of the options are anything but.
-hpa
On Fri, Apr 11, 2014 at 2:59 PM, H. Peter Anvin <[email protected]> wrote:
> On 04/11/2014 02:53 PM, Andy Lutomirski wrote:
>>
>> How big of a functionality problem is it? Apparently it doesn't break
>> 16-bit code on wine.
>>
>
> It breaks *some* 16-bit code. This is actually the reason that 32 bits
> has the espfix workaround - it wasn't identified as an infoleak at the time.
>
>> Since the high bits of esp have been corrupted on x86_64 since the
>> beginning, there's no regression issue here if an eventual fix writes
>> less meaningful crap to those bits -- I see no real reason to try to put
>> the correct values in there.
>
> It is a regression vs. the 32-bit kernel, and if we're going to support
> 16-bit code we should arguably support 16-bit code correctly.
>
> This is actually how I stumbled onto this problem in the first place: it
> broke a compiler test suite for gcc -m16 I was working on. The
> workaround for *that* was to run in a VM instead.
>
>>>> I would have suggested rejecting modify_ldt() entirely, to reduce attack
>>>> surface, except that some early versions of 32-bit NPTL glibc use
>>>> modify_ldt() to exclusion of all other methods of establishing the
>>>> thread pointer, so in order to stay compatible with those we would need
>>>> to allow 32-bit segments via modify_ldt() still.
>>
>> I actually use modify_ldt for amusement: it's the only way I know of to
>> issue real 32-bit syscalls from 64-bit userspace. Yes, this isn't
>> really a legitimate use case.
>
> That's actually wrong on no less than two levels:
>
> 1. You can issue real 32-bit system calls from 64-bit user space simply
> by invoking int $0x80; it works in 64-bit mode as well.
>
> 2. Even if you want to be in 32-bit mode you can simply call via
> __USER32_CS, you don't need an LDT entry.
I just looked up my hideous code. I was doing this to test the
now-deleted int 0xcc vsyscall stuff. I used modify_ldt because either
I didn't realize that __USER32_CS was usable or I didn't think it was
ABI. Or I was just being silly.
But yes, breaking my hack would not matter. :)
--Andy
On 04/11/2014 03:15 PM, Andy Lutomirski wrote:
>
> I just looked up my hideous code. I was doing this to test the
> now-deleted int 0xcc vsyscall stuff. I used modify_ldt because either
> I didn't realize that __USER32_CS was usable or I didn't think it was
> ABI. Or I was just being silly.
>
> But yes, breaking my hack would not matter. :)
>
Either way, it wouldn't break it.
-hpa
On Fri, Apr 11, 2014 at 2:50 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Apr 11, 2014 at 11:45 AM, Brian Gerst <[email protected]> wrote:
>>
>> I haven't tested it recently but I do know it has worked on 64-bit
>> kernels. There is no reason for it not to, the only thing not
>> supported in long mode is vm86. 16-bit protected mode is unchanged.
>
> Afaik 64-bit windows doesn't support 16-bit binaries, so I just
> assumed Wine wouldn't do it either on x86-64. Not for any real
> technical reasons, though.
>
> HOWEVER. I'd like to hear something more definitive than "I haven't
> tested recently". The "we don't break user space" is about having
> actual real *users*, not about test programs.
>
> Are there people actually using 16-bit old windows programs under
> wine? That's what matters.
>
> Linus
I just verified that the game does still run on a 64-bit kernel
(3.13.8-200.fc20.x86_64). It needed an older version of Wine, but
that's a Wine regression and not kernel related.
So Wine regressed and noone noticed? They doesn't sound like an active user base.
On April 11, 2014 9:44:22 PM PDT, Brian Gerst <[email protected]> wrote:
>On Fri, Apr 11, 2014 at 2:50 PM, Linus Torvalds
><[email protected]> wrote:
>> On Fri, Apr 11, 2014 at 11:45 AM, Brian Gerst <[email protected]>
>wrote:
>>>
>>> I haven't tested it recently but I do know it has worked on 64-bit
>>> kernels. There is no reason for it not to, the only thing not
>>> supported in long mode is vm86. 16-bit protected mode is unchanged.
>>
>> Afaik 64-bit windows doesn't support 16-bit binaries, so I just
>> assumed Wine wouldn't do it either on x86-64. Not for any real
>> technical reasons, though.
>>
>> HOWEVER. I'd like to hear something more definitive than "I haven't
>> tested recently". The "we don't break user space" is about having
>> actual real *users*, not about test programs.
>>
>> Are there people actually using 16-bit old windows programs under
>> wine? That's what matters.
>>
>> Linus
>
>I just verified that the game does still run on a 64-bit kernel
>(3.13.8-200.fc20.x86_64). It needed an older version of Wine, but
>that's a Wine regression and not kernel related.
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
On Sat, Apr 12, 2014 at 10:18:25AM -0700, H. Peter Anvin wrote:
> So Wine regressed and noone noticed? They doesn't sound like an active
> user base.
Btw, wouldn't this obscure use case simply work in a KVM guest with a
kernel <= 3.14?
Because if so, we simply cut it at 3.14, everything newer has the leak
fix and people who still want to play phone games on a x86 machine, can
do so in a guest with an older kernel. Everybody's happy.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
Run a 32-bit VM. The 32-bit kernel does this right.
I suspect it would also work fine in a Qemu user mode guest (is this supported by KVM?), in a ReactOS VM, or some other number of combinations.
The real question is how many real users are actually affected.
On April 12, 2014 12:35:41 PM PDT, Borislav Petkov <[email protected]> wrote:
>On Sat, Apr 12, 2014 at 10:18:25AM -0700, H. Peter Anvin wrote:
>> So Wine regressed and noone noticed? They doesn't sound like an
>active
>> user base.
>
>Btw, wouldn't this obscure use case simply work in a KVM guest with a
>kernel <= 3.14?
>
>Because if so, we simply cut it at 3.14, everything newer has the leak
>fix and people who still want to play phone games on a x86 machine, can
>do so in a guest with an older kernel. Everybody's happy.
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
On Sat, Apr 12, 2014 at 12:44:42PM -0700, H. Peter Anvin wrote:
> Run a 32-bit VM. The 32-bit kernel does this right.
Yes, even better.
> I suspect it would also work fine in a Qemu user mode guest (is
> this supported by KVM?), in a ReactOS VM, or some other number of
> combinations.
Right.
So basically, there a lot of different virt scenarios which can all take
care of those use cases *without* encumbering some insane solutions on
64-bit.
> The real question is how many real users are actually affected.
And if they are, virtualize them, for chrissake. It is time we finally
used virt for maybe one of its major use cases - virtualize old/obscure
hw. It should be pretty reliable by now.
:-P
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
For this particular game, not 16-bit in general. The installer, also
16-bit, runs perfectly. Already filed wine bug 35977.
On Sat, Apr 12, 2014 at 1:18 PM, H. Peter Anvin <[email protected]> wrote:
> So Wine regressed and noone noticed? They doesn't sound like an active user base.
>
> On April 11, 2014 9:44:22 PM PDT, Brian Gerst <[email protected]> wrote:
>>On Fri, Apr 11, 2014 at 2:50 PM, Linus Torvalds
>><[email protected]> wrote:
>>> On Fri, Apr 11, 2014 at 11:45 AM, Brian Gerst <[email protected]>
>>wrote:
>>>>
>>>> I haven't tested it recently but I do know it has worked on 64-bit
>>>> kernels. There is no reason for it not to, the only thing not
>>>> supported in long mode is vm86. 16-bit protected mode is unchanged.
>>>
>>> Afaik 64-bit windows doesn't support 16-bit binaries, so I just
>>> assumed Wine wouldn't do it either on x86-64. Not for any real
>>> technical reasons, though.
>>>
>>> HOWEVER. I'd like to hear something more definitive than "I haven't
>>> tested recently". The "we don't break user space" is about having
>>> actual real *users*, not about test programs.
>>>
>>> Are there people actually using 16-bit old windows programs under
>>> wine? That's what matters.
>>>
>>> Linus
>>
>>I just verified that the game does still run on a 64-bit kernel
>>(3.13.8-200.fc20.x86_64). It needed an older version of Wine, but
>>that's a Wine regression and not kernel related.
>
> --
> Sent from my mobile phone. Please pardon brevity and lack of formatting.
On Sat, Apr 12, 2014 at 4:11 PM, Borislav Petkov <[email protected]> wrote:
> On Sat, Apr 12, 2014 at 12:44:42PM -0700, H. Peter Anvin wrote:
>> Run a 32-bit VM. The 32-bit kernel does this right.
>
> Yes, even better.
>
>> I suspect it would also work fine in a Qemu user mode guest (is
>> this supported by KVM?), in a ReactOS VM, or some other number of
>> combinations.
>
> Right.
>
> So basically, there a lot of different virt scenarios which can all take
> care of those use cases *without* encumbering some insane solutions on
> 64-bit.
>
>> The real question is how many real users are actually affected.
>
> And if they are, virtualize them, for chrissake. It is time we finally
> used virt for maybe one of its major use cases - virtualize old/obscure
> hw. It should be pretty reliable by now.
>
> :-P
>
> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --
My experience with kvm so far is that is slow and clunky. It may be
OK for a server environment, but interactively it's difficult to use.
On Sat, Apr 12, 2014 at 04:34:14PM -0400, Brian Gerst wrote:
> My experience with kvm so far is that is slow and clunky. It may be OK
> for a server environment, but interactively it's difficult to use.
Are you saying, you've run your game in a guest and perf. is sucky?
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Sat, Apr 12, 2014 at 4:59 PM, Borislav Petkov <[email protected]> wrote:
> On Sat, Apr 12, 2014 at 04:34:14PM -0400, Brian Gerst wrote:
>> My experience with kvm so far is that is slow and clunky. It may be OK
>> for a server environment, but interactively it's difficult to use.
>
> Are you saying, you've run your game in a guest and perf. is sucky?
>
Performance is bad in general, running a 32-bit Fedora 20 guest.
On Sat, Apr 12, 2014 at 05:13:40PM -0400, Brian Gerst wrote:
> Performance is bad in general, running a 32-bit Fedora 20 guest.
So this means you haven't tried the game in the guest yet, so that we
can know for sure that a guest doesn't solve your problem or what?
Btw, which game is that and can I get it somewhere to try it here
locally?
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Sat, Apr 12, 2014 at 12:44 PM, H. Peter Anvin <[email protected]> wrote:
> Run a 32-bit VM. The 32-bit kernel does this right.
I really don't think that's the answer.
If people really run these 16-bit programs, we need to allow it.
Clearly it used to work.
Just make the unconditional "don't allow 16-bit segments" be a sysconf entry.
Linus
On 04/12/2014 02:53 PM, Linus Torvalds wrote:
> On Sat, Apr 12, 2014 at 12:44 PM, H. Peter Anvin <[email protected]> wrote:
>> Run a 32-bit VM. The 32-bit kernel does this right.
>
> I really don't think that's the answer.
>
> If people really run these 16-bit programs, we need to allow it.
> Clearly it used to work.
>
> Just make the unconditional "don't allow 16-bit segments" be a sysconf entry.
>
Well, is there more than one user, really... that's my question.
But yes, we can make it configurable, but the default should almost
certainly be off.
-hpa
On 04/12/2014 04:26 PM, Alexander van Heukelum wrote:
>>>
>>> c. Trampoline in kernel space
>>>
>>> A trampoline in kernel space is not feasible since all ring transition
>>> instructions capable of returning to 16-bit mode require the use of the
>>> stack.
>
> "16 bit mode" -> "a mode with 16-bit stack"
Yes... I believe it is the SS.B bit that is relevant, not CS.B (although
I haven't confirmed that experimentally.) Not that that helps one iota,
as far as I can tell.
>>> d. Trampoline in user space
>>>
>>> A return to the vdso with values set up in registers r8-r15 would enable
>>> a trampoline in user space. Unfortunately there is no way
>>> to do a far JMP entirely with register state so this would require
>>> touching user space memory, possibly in an unsafe manner.
>
> d.2. trampoline in user space via long mode
>
> Return from the kernel to a user space trampoline via long mode.
> The kernel changes the stack frame just before executing the iret
> instruction. (the CS and RIP slots are set to run the trampoline code,
> where CS is a long mode segment.) The trampoline code in userspace
> is set up to this single instruction: a far jump to the final CS:EIP
> (compatibility mode).
This still requires user space memory that the kernel can write to.
Long mode is actually exactly identical to what I was suggesting above,
except that I would avoid using self-modifying code in favor of just
parameterization using the high registers.
-hpa
Hi,
> This is a writeup I did to a select audience before this was public:
I'ld like to add an option d.2. for your consideration. Can you think of a
fundamental problem with it?
Greetings,
Alexander
> > Some workarounds I have considered:
> >
> > a. Using paging in a similar way to the 32-bit segment base workaround
> >
> > This one requires a very large swath of virtual user space (depending on
> > allocation policy, as much as 4 GiB per CPU.) The "per CPU" requirement
> > comes in as locking is not feasible -- as we return to user space there
> > is nowhere to release the lock.
> >
> > b. Return to user space via compatibility mode
> >
> > As the kernel lives above the 4 GiB virtual mark, a transition through
> > compatibility mode is not practical. This would require the kernel to
> > reserve virtual address space below the 4 GiB mark, which may interfere
> > with the application, especially an application launched as a 64-bit
> > application.
> >
> > c. Trampoline in kernel space
> >
> > A trampoline in kernel space is not feasible since all ring transition
> > instructions capable of returning to 16-bit mode require the use of the
> > stack.
"16 bit mode" -> "a mode with 16-bit stack"
> > d. Trampoline in user space
> >
> > A return to the vdso with values set up in registers r8-r15 would enable
> > a trampoline in user space. Unfortunately there is no way
> > to do a far JMP entirely with register state so this would require
> > touching user space memory, possibly in an unsafe manner.
d.2. trampoline in user space via long mode
Return from the kernel to a user space trampoline via long mode.
The kernel changes the stack frame just before executing the iret
instruction. (the CS and RIP slots are set to run the trampoline code,
where CS is a long mode segment.) The trampoline code in userspace
is set up to this single instruction: a far jump to the final CS:EIP
(compatibility mode).
Because the IRET is now returning to long mode, all registers are
restored fully. The stack cannot be used at this point, but the far
jump doesn't need stack and it will/should make the stack valid
immediately after execution. The IRET enables interrupts, so the
far jump is in the interrupt shadow: it won't be seen, unless it causes
an exception.
> > The most likely variant is to use the address of the 16-bit user stack
> > and simply hope that this is a safe thing to do.
> >
> > This appears to be the most feasible workaround if a workaround is
> > deemed necessary.
> >
> > e. Transparently run 16-bit code segments inside a lightweight VMM
"16-bit code" -> "code with 16-bit stack"
> > The complexity of this solution versus the realized value is staggering.
> > It also doesn't work on non-virtualization-capable hardware (including
> > running on top of a VMM which doesn't support nested virtualization.)
> >
> > -hpa
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
On Sun, Apr 13, 2014, at 1:31, H. Peter Anvin wrote:
> >>> d. Trampoline in user space
> >>>
> >>> A return to the vdso with values set up in registers r8-r15 would enable
> >>> a trampoline in user space. Unfortunately there is no way
> >>> to do a far JMP entirely with register state so this would require
> >>> touching user space memory, possibly in an unsafe manner.
> >
> > d.2. trampoline in user space via long mode
> >
> > Return from the kernel to a user space trampoline via long mode.
> > The kernel changes the stack frame just before executing the iret
> > instruction. (the CS and RIP slots are set to run the trampoline code,
> > where CS is a long mode segment.) The trampoline code in userspace
> > is set up to this single instruction: a far jump to the final CS:EIP
> > (compatibility mode).
>
> This still requires user space memory that the kernel can write to.
> Long mode is actually exactly identical to what I was suggesting above,
> except that I would avoid using self-modifying code in favor of just
> parameterization using the high registers.
No self modifying code... The far jump must be in the indirect form
anyhow. The CS:EIP must be accessible from user mode, but not
necessarily from compatibility mode. So the trampoline (the jump)
and data (CS:EIP) can live pretty much anywhere in virtual memory.
But indeed, I see what you meant now.
Greetings,
Alexander
>
> -hpa
>
On 04/12/2014 04:49 PM, Alexander van Heukelum wrote:
> On Sun, Apr 13, 2014, at 1:31, H. Peter Anvin wrote:
>>>>> d. Trampoline in user space
>>>>>
>>>>> A return to the vdso with values set up in registers r8-r15 would enable
>>>>> a trampoline in user space. Unfortunately there is no way
>>>>> to do a far JMP entirely with register state so this would require
>>>>> touching user space memory, possibly in an unsafe manner.
>>>
>>> d.2. trampoline in user space via long mode
>>>
>>> Return from the kernel to a user space trampoline via long mode.
>>> The kernel changes the stack frame just before executing the iret
>>> instruction. (the CS and RIP slots are set to run the trampoline code,
>>> where CS is a long mode segment.) The trampoline code in userspace
>>> is set up to this single instruction: a far jump to the final CS:EIP
>>> (compatibility mode).
>>
>> This still requires user space memory that the kernel can write to.
>> Long mode is actually exactly identical to what I was suggesting above,
>> except that I would avoid using self-modifying code in favor of just
>> parameterization using the high registers.
>
> No self modifying code... The far jump must be in the indirect form
> anyhow. The CS:EIP must be accessible from user mode, but not
> necessarily from compatibility mode. So the trampoline (the jump)
> and data (CS:EIP) can live pretty much anywhere in virtual memory.
> But indeed, I see what you meant now.
>
This is, in fact, exactly then what I was suggesting, except that data
is passed directly in memory rather than in a register and letting user
space sort it out (this could be in the vdso, but the vdso may be > 4 GB
so it has to be in 64-bit mode until the last instruction.) The
difference isn't huge; mostly an implementation detail.
A signal arriving while in the user space trampoline could seriously
complicate life.
-hpa
On Sat, Apr 12, 2014 at 5:03 PM, H. Peter Anvin <[email protected]> wrote:
>>
>> No self modifying code... The far jump must be in the indirect form
>> anyhow. The CS:EIP must be accessible from user mode, but not
>> necessarily from compatibility mode. So the trampoline (the jump)
>> and data (CS:EIP) can live pretty much anywhere in virtual memory.
>> But indeed, I see what you meant now.
>>
>
> This is, in fact, exactly then what I was suggesting, except that data
> is passed directly in memory rather than in a register and letting user
> space sort it out (this could be in the vdso, but the vdso may be > 4 GB
> so it has to be in 64-bit mode until the last instruction.) The
> difference isn't huge; mostly an implementation detail.
I'm a bit confused as to exactly what everyone is suggesting. I don't
think there's any instruction that can do a direct far jump to an
address stored in a register.
ISTM it does matter whether SS or CS is the offending selector. If
it's SS, then the possible trampoline sequences are:
MOV SS, ??? / POP SS / LSS
JMP/RET
or
IRET (!)
If it's CS, then we just need a far JMP or a RET or an IRET. The far
JMP is kind of nice since we can at least use RIP-relative addressing
What are the interrupt shadow rules? I thought IRET did not block interrupts.
>
> A signal arriving while in the user space trampoline could seriously
> complicate life.
Agreed.
Note that we're not really guaranteed to have a trampoline at all.
The vdso isn't there in CONFIG_COMPAT_VDSO mode, although the number
of users of this "feature" on OpenSUSE 9 is probably zero.
--Andy
On Sat, Apr 12, 2014 at 6:25 PM, Andy Lutomirski <[email protected]> wrote:
> On Sat, Apr 12, 2014 at 5:03 PM, H. Peter Anvin <[email protected]> wrote:
>> A signal arriving while in the user space trampoline could seriously
>> complicate life.
>
> Agreed.
Maybe I don't agree. Have signals ever worked sensibly when delivered
to a task running on an unexpected stack or code segment?
--Andy
Linus Torvalds <[email protected]> writes:
> On Fri, Apr 11, 2014 at 11:27 AM, Brian Gerst <[email protected]> wrote:
>> Is this bug really still present in modern CPUs? This change breaks
>> running 16-bit apps in Wine. I have a few really old games I like to
>> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
>
> Ok, so you actually do this on x86-64, and it currently works? For
> some reason I thought that 16-bit windows apps already didn't work.
No, it always worked. I spent some time on this early in the x86-64 port
and it flushed out some bugs in the early segment handling. x86-64
is perfectly compatible to this.
Also was always proud that Linux 64 was more compatible old Windows
binaries than Win 64...
-Andi
--
[email protected] -- Speaking for myself only
"H. Peter Anvin" <[email protected]> writes:
>
> But yes, we can make it configurable, but the default should almost
> certainly be off.
Why? Either it works or it doesn't.
If it works it doesn't make any sense to have a sysctl.
-Andi
--
[email protected] -- Speaking for myself only
I would think any sensible application with 16-bit segments would be using sigaltstack. Does anyone know what Wine does?
On April 12, 2014 6:29:11 PM PDT, Andy Lutomirski <[email protected]> wrote:
>On Sat, Apr 12, 2014 at 6:25 PM, Andy Lutomirski <[email protected]>
>wrote:
>> On Sat, Apr 12, 2014 at 5:03 PM, H. Peter Anvin <[email protected]>
>wrote:
>>> A signal arriving while in the user space trampoline could seriously
>>> complicate life.
>>
>> Agreed.
>
>Maybe I don't agree. Have signals ever worked sensibly when delivered
>to a task running on an unexpected stack or code segment?
>
>--Andy
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
It leaks security sensitive information to userspace and corrupts the upper half of ESP because it lacks the equivalent of the espfix workaround.
On April 12, 2014 7:56:48 PM PDT, Andi Kleen <[email protected]> wrote:
>"H. Peter Anvin" <[email protected]> writes:
>>
>> But yes, we can make it configurable, but the default should almost
>> certainly be off.
>
>Why? Either it works or it doesn't.
>
>If it works it doesn't make any sense to have a sysctl.
>
>-Andi
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
On Sat, Apr 12, 2014 at 7:56 PM, Andi Kleen <[email protected]> wrote:
>
> Why? Either it works or it doesn't.
>
> If it works it doesn't make any sense to have a sysctl.
BS.
It "works" exactly like mmap() at NULL "works".
It is a potential security leak, because x86-64 screwed up the
architecture definition in this area. So it should definitely be
disabled by default, exactly like mmap_min_addr is non-zero by
default.
Linus
On 04/11/2014 02:53 PM, Andy Lutomirski wrote:
>
> If you want a fully correct solution, you can use a fancier allocation
> policy that can fit quite a few cpus per 4G :)
>
The more I think about this, I think this might actually be a reasonable
option, *IF* someone is willing to deal with actually implementing it.
The difference versus my "a" alternative is rather than mapping the
existing kernel stack into an alternate part of the address space would
be that we would have a series of ministacks that is only large enough
that we can handle the IRET data *and* big enough to handle any
exceptions that IRET may throw, until we can switch back to the real
kernel stack. Tests would have to be added to the appropriate exception
paths, as early as possible. We would then *copy* the IRET data to the
ministack before returning. Each ministack would be mapped 65536 times.
If we can get away with 64 bytes per CPU, then we can get away with 4
GiB of address space per 1024 CPUs, so if MAX_CPUS is 16384 we would
need 64 GiB of address space... which is not unreasonable on 64 bits.
The total memory consumption would be about 81 bytes per CPU for the
ministacks plus page tables (just over 16K per 1K CPUs.) Again, fairly
reasonable, but a *lot* of complexity.
-hpa
2014-04-12 15:25, H. Peter Anvin:
> On 04/12/2014 02:53 PM, Linus Torvalds wrote:
>> On Sat, Apr 12, 2014 at 12:44 PM, H. Peter Anvin <[email protected]> wrote:
>>> Run a 32-bit VM. The 32-bit kernel does this right.
>>
>> I really don't think that's the answer.
>>
>> If people really run these 16-bit programs, we need to allow it.
>> Clearly it used to work.
>>
>> Just make the unconditional "don't allow 16-bit segments" be a sysconf
>> entry.
>>
>
> Well, is there more than one user, really... that's my question.
>
Just to let you know, I use this too for running some old Win16 apps,
so there's at least two of us... But doesn't this change also break
DPMI apps running in dosemu?
* Borislav Petkov <[email protected]> wrote:
> On Sat, Apr 12, 2014 at 05:13:40PM -0400, Brian Gerst wrote:
> > Performance is bad in general, running a 32-bit Fedora 20 guest.
>
> So this means you haven't tried the game in the guest yet, so that
> we can know for sure that a guest doesn't solve your problem or
> what?
>
> Btw, which game is that and can I get it somewhere to try it here
> locally?
Apparently the game in question is "Exile: Escape from the pit":
http://osdir.com/ml/wine-bugs/2014-04/msg01159.html
Thanks,
Ingo
* H. Peter Anvin <[email protected]> wrote:
> On 04/11/2014 11:41 AM, Linus Torvalds wrote:
> >
> > Ok, so you actually do this on x86-64, and it currently works? For
> > some reason I thought that 16-bit windows apps already didn't work.
> >
>
> Some will work, because not all 16-bit software care about the upper
> half of ESP getting randomly corrupted.
>
> That is the "functionality bit" of the problem. The other bit, of
> course, is that that random corruption is the address of the kernel stack.
>
> > Because if we have working users of this, then I don't think we can do
> > the "we don't support 16-bit segments", or at least we need to make it
> > runtime configurable.
>
> I'll let you pick what the policy should be here. I personally
> think that we have to be able to draw a line somewhere sometimes
> (Microsoft themselves haven't supported running 16-bit binaries for
> several Windows generations now), but it is your policy, not mine.
I think the mmap_min_addr model works pretty well:
- it defaults to secure
- allow a security policy to grant an exception to a known package,
built by the distro
- end user can also grant an exception
This essentially punts any 'makes the system less secure' exceptions
to the distro and the end user.
Thanks,
Ingo
Linus Torvalds <[email protected]> writes:
> On Fri, Apr 11, 2014 at 11:45 AM, Brian Gerst <[email protected]> wrote:
>>
>> I haven't tested it recently but I do know it has worked on 64-bit
>> kernels. There is no reason for it not to, the only thing not
>> supported in long mode is vm86. 16-bit protected mode is unchanged.
>
> Afaik 64-bit windows doesn't support 16-bit binaries, so I just
> assumed Wine wouldn't do it either on x86-64. Not for any real
> technical reasons, though.
>
> HOWEVER. I'd like to hear something more definitive than "I haven't
> tested recently". The "we don't break user space" is about having
> actual real *users*, not about test programs.
>
> Are there people actually using 16-bit old windows programs under
> wine? That's what matters.
Yes, there is still a significant number of users, and we still
regularly get bug reports about specific 16-bit apps. It would be really
nice if we could continue to support them on x86-64, particularly since
Microsoft doesn't ;-)
--
Alexandre Julliard
[email protected]
On Mon, Apr 14, 2014 at 09:21:13AM +0200, Ingo Molnar wrote:
> Apparently the game in question is "Exile: Escape from the pit":
>
> http://osdir.com/ml/wine-bugs/2014-04/msg01159.html
Ah, thanks.
Well, FWIW, you can get the game for free:
http://www.spiderwebsoftware.com/exile/winexile.html
I did run it on an old windoze guest I had lying around. Performance
feels like native considering this game is from the previous century :-)
https://en.wikipedia.org/wiki/Exile_%28video_game_series%29
Wikipedia says you can get it on steam too, which runs linux and stuff.
Thanks.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
* Borislav Petkov <[email protected]> wrote:
> On Mon, Apr 14, 2014 at 09:21:13AM +0200, Ingo Molnar wrote:
> > Apparently the game in question is "Exile: Escape from the pit":
> >
> > http://osdir.com/ml/wine-bugs/2014-04/msg01159.html
>
> Ah, thanks.
>
> Well, FWIW, you can get the game for free:
>
> http://www.spiderwebsoftware.com/exile/winexile.html
>
> I did run it on an old windoze guest I had lying around. Performance
> feels like native considering this game is from the previous century :-)
In fact it's from the previous millenium :-)
Thanks,
Ingo
For both of these, though, it is really kind of broken that it is a global switch, whereas typically only one application on the whole system needs it, so it would be much better to have application-specific controls. How to do that is another matter...
On April 14, 2014 12:27:56 AM PDT, Ingo Molnar <[email protected]> wrote:
>
>* H. Peter Anvin <[email protected]> wrote:
>
>> On 04/11/2014 11:41 AM, Linus Torvalds wrote:
>> >
>> > Ok, so you actually do this on x86-64, and it currently works? For
>> > some reason I thought that 16-bit windows apps already didn't work.
>> >
>>
>> Some will work, because not all 16-bit software care about the upper
>> half of ESP getting randomly corrupted.
>>
>> That is the "functionality bit" of the problem. The other bit, of
>> course, is that that random corruption is the address of the kernel
>stack.
>>
>> > Because if we have working users of this, then I don't think we can
>do
>> > the "we don't support 16-bit segments", or at least we need to make
>it
>> > runtime configurable.
>>
>> I'll let you pick what the policy should be here. I personally
>> think that we have to be able to draw a line somewhere sometimes
>> (Microsoft themselves haven't supported running 16-bit binaries for
>> several Windows generations now), but it is your policy, not mine.
>
>I think the mmap_min_addr model works pretty well:
>
> - it defaults to secure
>
> - allow a security policy to grant an exception to a known package,
> built by the distro
>
> - end user can also grant an exception
>
>This essentially punts any 'makes the system less secure' exceptions
>to the distro and the end user.
>
>Thanks,
>
> Ingo
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
This is a prototype of espfix for the 64-bit kernel. espfix is a
workaround for the architectural definition of IRET, which fails to
restore bits [31:16] of %esp when returning to a 16-bit stack
segment. We have a workaround for the 32-bit kernel, but that
implementation doesn't work for 64 bits.
The 64-bit implementation works like this:
Set up a ministack for each CPU, which is then mapped 65536 times
using the page tables. This implementation uses the second-to-last
PGD slot for this; with a 64-byte espfix stack this is sufficient for
2^18 CPUs (currently we support a max of 2^13 CPUs.)
64 bytes appear to be sufficient, because NMI and #MC cause a task
switch.
THIS IS A PROTOTYPE AND IS NOT COMPLETE. We need to make sure all
code paths that can interrupt userspace execute this code.
Fortunately we never need to use the espfix stack for nested faults,
so one per CPU is guaranteed to be safe.
Furthermore, this code adds unnecessary instructions to the common
path. For example, on exception entry we push %rdi, pop %rdi, and
then save away %rdi. Ideally we should do this in such a way that we
avoid unnecessary swapgs, especially on the IRET path (the exception
path is going to be very rare, and so is less critical.)
Putting this version out there for people to look at/laugh at/play
with.
Signed-off-by: H. Peter Anvin <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Alexander van Heukelum <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Alexandre Julliard <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
arch/x86/include/asm/setup.h | 2 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/entry_64.S | 79 ++++++++++++++++++-
arch/x86/kernel/espfix_64.c | 171 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/head64.c | 1 +
arch/x86/kernel/ldt.c | 11 ---
arch/x86/kernel/smpboot.c | 5 ++
arch/x86/mm/dump_pagetables.c | 2 +
init/main.c | 4 +
9 files changed, 264 insertions(+), 12 deletions(-)
create mode 100644 arch/x86/kernel/espfix_64.c
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 9264f04a4c55..84b882eebdf9 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -57,6 +57,8 @@ extern void x86_ce4100_early_setup(void);
static inline void x86_ce4100_early_setup(void) { }
#endif
+extern void init_espfix_cpu(void);
+
#ifndef _SETUP
/*
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index f4d96000d33a..1cc3789d99d9 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_X86_64) += sys_x86_64.o x8664_ksyms_64.o
obj-y += syscall_$(BITS).o vsyscall_gtod.o
obj-$(CONFIG_X86_64) += vsyscall_64.o
obj-$(CONFIG_X86_64) += vsyscall_emu_64.o
+obj-$(CONFIG_X86_64) += espfix_64.o
obj-$(CONFIG_SYSFS) += ksysfs.o
obj-y += bootflag.o e820.o
obj-y += pci-dma.o quirks.o topology.o kdebugfs.o
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1e96c3628bf2..7cc01770bf21 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -58,6 +58,7 @@
#include <asm/asm.h>
#include <asm/context_tracking.h>
#include <asm/smap.h>
+#include <asm/pgtable_types.h>
#include <linux/err.h>
/* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this. */
@@ -1040,8 +1041,16 @@ restore_args:
RESTORE_ARGS 1,8,1
irq_return:
+ /*
+ * Are we returning to the LDT? Note: in 64-bit mode
+ * SS:RSP on the exception stack is always valid.
+ */
+ testb $4,(SS-RIP)(%rsp)
+ jnz irq_return_ldt
+
+irq_return_iret:
INTERRUPT_RETURN
- _ASM_EXTABLE(irq_return, bad_iret)
+ _ASM_EXTABLE(irq_return_iret, bad_iret)
#ifdef CONFIG_PARAVIRT
ENTRY(native_iret)
@@ -1049,6 +1058,34 @@ ENTRY(native_iret)
_ASM_EXTABLE(native_iret, bad_iret)
#endif
+irq_return_ldt:
+ pushq_cfi %rcx
+ larl (CS-RIP+8)(%rsp), %ecx
+ jnz 1f /* Invalid segment - will #GP at IRET time */
+ testl $0x00200000, %ecx
+ jnz 1f /* Returning to 64-bit mode */
+ larl (SS-RIP+8)(%rsp), %ecx
+ jnz 1f /* Invalid segment - will #SS at IRET time */
+ testl $0x00400000, %ecx
+ jnz 1f /* Not a 16-bit stack segment */
+ pushq_cfi %rsi
+ pushq_cfi %rdi
+ SWAPGS
+ movq PER_CPU_VAR(espfix_stack),%rdi
+ movl (RSP-RIP+3*8)(%rsp),%esi
+ xorw %si,%si
+ orq %rsi,%rdi
+ movq %rsp,%rsi
+ movl $8,%ecx
+ rep;movsq
+ leaq -(8*8)(%rdi),%rsp
+ SWAPGS
+ popq_cfi %rdi
+ popq_cfi %rsi
+1:
+ popq_cfi %rcx
+ jmp irq_return_iret
+
.section .fixup,"ax"
bad_iret:
/*
@@ -1058,6 +1095,7 @@ bad_iret:
* So pretend we completed the iret and took the #GPF in user mode.
*
* We are now running with the kernel GS after exception recovery.
+ * Exception entry will have removed us from the espfix stack.
* But error_entry expects us to have user GS to match the user %cs,
* so swap back.
*/
@@ -1200,6 +1238,17 @@ apicinterrupt IRQ_WORK_VECTOR \
irq_work_interrupt smp_irq_work_interrupt
#endif
+.macro espfix_adjust_stack
+ pushq_cfi %rdi
+ movq %rsp,%rdi
+ sarq $PGDIR_SHIFT,%rdi
+ cmpl $-2,%edi
+ jne 1f
+ call espfix_fix_stack
+1:
+ popq_cfi %rdi /* Fix so we don't need this again */
+.endm
+
/*
* Exception entry points.
*/
@@ -1209,6 +1258,7 @@ ENTRY(\sym)
ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */
+ espfix_adjust_stack
subq $ORIG_RAX-R15, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
call error_entry
@@ -1227,6 +1277,7 @@ ENTRY(\sym)
ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */
+ espfix_adjust_stack
subq $ORIG_RAX-R15, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
call save_paranoid
@@ -1265,6 +1316,7 @@ ENTRY(\sym)
XCPT_FRAME
ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
+ espfix_adjust_stack
subq $ORIG_RAX-R15, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
call error_entry
@@ -1295,6 +1347,7 @@ ENTRY(\sym)
XCPT_FRAME
ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
+ espfix_adjust_stack
subq $ORIG_RAX-R15, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
call save_paranoid
@@ -1323,6 +1376,30 @@ zeroentry coprocessor_error do_coprocessor_error
errorentry alignment_check do_alignment_check
zeroentry simd_coprocessor_error do_simd_coprocessor_error
+ /*
+ * Switch from the espfix stack to the proper stack: tricky stuff.
+ * On the stack right now is 5 words of exception frame,
+ * error code/oldeax, RDI, and the return value, so no additional
+ * stack is available.
+ *
+ * We will always be using the user space GS on entry.
+ */
+ENTRY(espfix_fix_stack)
+ SWAPGS
+ cld
+ movq PER_CPU_VAR(kernel_stack),%rdi
+ subq $8*8,%rdi
+ /* Use the real stack to hold these registers for now */
+ movq %rsi,-8(%rdi)
+ movq %rcx,-16(%rdi)
+ movq %rsp,%rsi
+ movl $8,%ecx
+ rep;movsq
+ leaq -(10*8)(%rdi),%rsp
+ popq %rcx
+ popq %rsi
+ SWAPGS
+ retq
/* Reload gs selector with exception handling */
/* edi: new selector */
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
new file mode 100644
index 000000000000..ff8479628ff2
--- /dev/null
+++ b/arch/x86/kernel/espfix_64.c
@@ -0,0 +1,171 @@
+/* ----------------------------------------------------------------------- *
+ *
+ * Copyright 2014 Intel Corporation; author: H. Peter Anvin
+ *
+ * This file is part of the Linux kernel, and is made available under
+ * the terms of the GNU General Public License version 2 or (at your
+ * option) any later version; incorporated herein by reference.
+ *
+ * ----------------------------------------------------------------------- */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+#include <linux/gfp.h>
+#include <asm/pgtable.h>
+
+#define ESPFIX_STACK_SIZE 64
+#define ESPFIX_BASE_ADDR (-2ULL << PGDIR_SHIFT)
+
+#if CONFIG_NR_CPUS >= (8 << 20)/ESPFIX_STACK_SIZE
+# error "Need more than one PGD for the ESPFIX hack"
+#endif
+
+#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+#define ESPFIX_PGD_FLAGS (__PAGE_KERNEL & ~_PAGE_DIRTY)
+#define ESPFIX_PUD_FLAGS (__PAGE_KERNEL & ~_PAGE_DIRTY)
+#define ESPFIX_PMD_FLAGS (__PAGE_KERNEL & ~_PAGE_DIRTY)
+#define ESPFIX_PTE_FLAGS __PAGE_KERNEL
+
+/* This contains the *bottom* address of the espfix stack */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
+
+/* Initialization mutex - should this be a spinlock? */
+static DEFINE_MUTEX(espfix_init_mutex);
+
+static __page_aligned_bss pud_t espfix_pud_page[PTRS_PER_PUD]
+ __aligned(PAGE_SIZE);
+
+/* This returns the bottom address of the espfix stack for a specific CPU */
+static inline unsigned long espfix_base_addr(int cpu)
+{
+ unsigned long addr = cpu * ESPFIX_STACK_SIZE;
+
+ addr = (addr & 0xffffUL) | ((addr & ~0xffffUL) << 16);
+ addr += ESPFIX_BASE_ADDR;
+ return addr;
+}
+
+#define PTE_STRIDE (65536/PAGE_SIZE)
+#define ESPFIX_PTE_CLONES (PTRS_PER_PTE/PTE_STRIDE)
+#define ESPFIX_PMD_CLONES PTRS_PER_PMD
+#define ESPFIX_PUD_CLONES (65536/(ESPFIX_PTE_CLONES*ESPFIX_PMD_CLONES))
+
+/*
+ * Check to see if the espfix stuff is already installed.
+ * We do this once before grabbing the lock and, if we have to,
+ * once after.
+ */
+static bool espfix_already_there(unsigned long addr)
+{
+ const pgd_t *pgd_p;
+ pgd_t pgd;
+ const pud_t *pud_p;
+ pud_t pud;
+ const pmd_t *pmd_p;
+ pmd_t pmd;
+ const pte_t *pte_p;
+ pte_t pte;
+ int n;
+
+ pgd_p = &init_level4_pgt[pgd_index(addr)];
+ pgd = ACCESS_ONCE(*pgd_p);
+ if (!pgd_present(pgd))
+ return false;
+
+ pud_p = &espfix_pud_page[pud_index(addr)];
+ for (n = 0; n < ESPFIX_PUD_CLONES; n++) {
+ pud = ACCESS_ONCE(pud_p[n]);
+ if (!pud_present(pud))
+ return false;
+ }
+
+ pmd_p = pmd_offset(&pud, addr);
+ for (n = 0; n < ESPFIX_PMD_CLONES; n++) {
+ pmd = ACCESS_ONCE(pmd_p[n]);
+ if (!pmd_present(pmd))
+ return false;
+ }
+
+ pte_p = pte_offset_kernel(&pmd, addr);
+ for (n = 0; n < ESPFIX_PTE_CLONES; n++) {
+ pte = ACCESS_ONCE(pte_p[n*PTE_STRIDE]);
+ if (!pte_present(pte))
+ return false;
+ }
+
+ return true; /* All aliases present and accounted for */
+}
+
+void init_espfix_cpu(void)
+{
+ int cpu = smp_processor_id();
+ unsigned long addr;
+ pgd_t pgd, *pgd_p;
+ pud_t pud, *pud_p;
+ pmd_t pmd, *pmd_p;
+ pte_t pte, *pte_p;
+ int n;
+ void *stack_page;
+
+ cpu = smp_processor_id();
+ BUG_ON(cpu >= (8 << 20)/ESPFIX_STACK_SIZE);
+
+ /* We only have to do this once... */
+ if (likely(this_cpu_read(espfix_stack)))
+ return; /* Already initialized */
+
+ addr = espfix_base_addr(cpu);
+
+ /* Did another CPU already set this up? */
+ if (likely(espfix_already_there(addr)))
+ goto done;
+
+ mutex_lock(&espfix_init_mutex);
+
+ if (unlikely(espfix_already_there(addr)))
+ goto unlock_done;
+
+ pgd_p = &init_level4_pgt[pgd_index(addr)];
+ pgd = *pgd_p;
+ if (!pgd_present(pgd)) {
+ /* This can only happen on the BSP */
+ pgd = __pgd(__pa(espfix_pud_page) |
+ (ESPFIX_PGD_FLAGS & __supported_pte_mask));
+ set_pgd(pgd_p, pgd);
+ }
+
+ pud_p = &espfix_pud_page[pud_index(addr)];
+ pud = *pud_p;
+ if (!pud_present(pud)) {
+ pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+ pud = __pud(__pa(pmd_p) |
+ (ESPFIX_PUD_FLAGS & __supported_pte_mask));
+ for (n = 0; n < ESPFIX_PUD_CLONES; n++)
+ set_pud(&pud_p[n], pud);
+ }
+
+ pmd_p = pmd_offset(&pud, addr);
+ pmd = *pmd_p;
+ if (!pmd_present(pmd)) {
+ pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+ pmd = __pmd(__pa(pte_p) |
+ (ESPFIX_PMD_FLAGS & __supported_pte_mask));
+ for (n = 0; n < ESPFIX_PMD_CLONES; n++)
+ set_pmd(&pmd_p[n], pmd);
+ }
+
+ pte_p = pte_offset_kernel(&pmd, addr);
+ stack_page = (void *)__get_free_page(GFP_KERNEL);
+ pte = __pte(__pa(stack_page) |
+ (ESPFIX_PTE_FLAGS & __supported_pte_mask));
+ for (n = 0; n < ESPFIX_PTE_CLONES; n++)
+ set_pte(&pte_p[n*PTE_STRIDE], pte);
+
+unlock_done:
+ mutex_unlock(&espfix_init_mutex);
+done:
+ this_cpu_write(espfix_stack, addr);
+ printk(KERN_ERR "espfix: Initializing espfix for cpu %d, stack @ %p\n",
+ cpu, (const void *)addr);
+}
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 85126ccbdf6b..dc2d8afcafe9 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -32,6 +32,7 @@
* Manage page tables very early on.
*/
extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pud_t espfix_pud_page[PTRS_PER_PUD];
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
static unsigned int __initdata next_early_pgt = 2;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index af1d14a9ebda..ebc987398923 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -229,17 +229,6 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
}
}
- /*
- * On x86-64 we do not support 16-bit segments due to
- * IRET leaking the high bits of the kernel stack address.
- */
-#ifdef CONFIG_X86_64
- if (!ldt_info.seg_32bit) {
- error = -EINVAL;
- goto out_unlock;
- }
-#endif
-
fill_ldt(&ldt, &ldt_info);
if (oldmode)
ldt.avl = 0;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 34826934d4a7..ff32efb14e33 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -244,6 +244,11 @@ static void notrace start_secondary(void *unused)
check_tsc_sync_target();
/*
+ * Enable the espfix hack for this CPU
+ */
+ init_espfix_cpu();
+
+ /*
* We need to hold vector_lock so there the set of online cpus
* does not change while we are assigning vectors to cpus. Holding
* this lock ensures we don't half assign or remove an irq from a cpu.
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 20621d753d5f..96bf767a05fc 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -327,6 +327,8 @@ void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
int i;
struct pg_state st = {};
+ st.to_dmesg = true;
+
if (pgd) {
start = pgd;
st.to_dmesg = true;
diff --git a/init/main.c b/init/main.c
index 9c7fd4c9249f..6cccf5524b3c 100644
--- a/init/main.c
+++ b/init/main.c
@@ -648,6 +648,10 @@ asmlinkage void __init start_kernel(void)
ftrace_init();
+#ifdef CONFIG_X86_64
+ init_espfix_cpu();
+#endif
+
/* Do the rest non-__init'ed, we're now alive */
rest_init();
}
--
1.9.0
On Mon, Apr 21, 2014 at 3:47 PM, H. Peter Anvin <[email protected]> wrote:
> This is a prototype of espfix for the 64-bit kernel. espfix is a
> workaround for the architectural definition of IRET, which fails to
> restore bits [31:16] of %esp when returning to a 16-bit stack
> segment. We have a workaround for the 32-bit kernel, but that
> implementation doesn't work for 64 bits.
>
> The 64-bit implementation works like this:
>
> Set up a ministack for each CPU, which is then mapped 65536 times
> using the page tables. This implementation uses the second-to-last
> PGD slot for this; with a 64-byte espfix stack this is sufficient for
> 2^18 CPUs (currently we support a max of 2^13 CPUs.)
>
> 64 bytes appear to be sufficient, because NMI and #MC cause a task
> switch.
>
> THIS IS A PROTOTYPE AND IS NOT COMPLETE. We need to make sure all
> code paths that can interrupt userspace execute this code.
> Fortunately we never need to use the espfix stack for nested faults,
> so one per CPU is guaranteed to be safe.
>
> Furthermore, this code adds unnecessary instructions to the common
> path. For example, on exception entry we push %rdi, pop %rdi, and
> then save away %rdi. Ideally we should do this in such a way that we
> avoid unnecessary swapgs, especially on the IRET path (the exception
> path is going to be very rare, and so is less critical.)
>
> Putting this version out there for people to look at/laugh at/play
> with.
Hahaha! :)
Some comments:
Does returning to 64-bit CS with 16-bit SS not need espfix?
Conversely, does 16-bit CS and 32-bit SS need espfix?
> @@ -1058,6 +1095,7 @@ bad_iret:
> * So pretend we completed the iret and took the #GPF in user mode.
> *
> * We are now running with the kernel GS after exception recovery.
> + * Exception entry will have removed us from the espfix stack.
> * But error_entry expects us to have user GS to match the user %cs,
> * so swap back.
> */
What is that referring to?
> + /*
> + * Switch from the espfix stack to the proper stack: tricky stuff.
> + * On the stack right now is 5 words of exception frame,
> + * error code/oldeax, RDI, and the return value, so no additional
> + * stack is available.
> + *
> + * We will always be using the user space GS on entry.
> + */
> +ENTRY(espfix_fix_stack)
> + SWAPGS
> + cld
> + movq PER_CPU_VAR(kernel_stack),%rdi
> + subq $8*8,%rdi
> + /* Use the real stack to hold these registers for now */
> + movq %rsi,-8(%rdi)
> + movq %rcx,-16(%rdi)
> + movq %rsp,%rsi
> + movl $8,%ecx
> + rep;movsq
> + leaq -(10*8)(%rdi),%rsp
> + popq %rcx
> + popq %rsi
> + SWAPGS
> + retq
>
Is it guaranteed that the userspace thread that caused this is dead?
If not, do you need to change RIP so that espfix gets invoked again
when you return from the exception?
> +
> +void init_espfix_cpu(void)
> +{
> + int cpu = smp_processor_id();
> + unsigned long addr;
> + pgd_t pgd, *pgd_p;
> + pud_t pud, *pud_p;
> + pmd_t pmd, *pmd_p;
> + pte_t pte, *pte_p;
> + int n;
> + void *stack_page;
> +
> + cpu = smp_processor_id();
> + BUG_ON(cpu >= (8 << 20)/ESPFIX_STACK_SIZE);
> +
> + /* We only have to do this once... */
> + if (likely(this_cpu_read(espfix_stack)))
> + return; /* Already initialized */
> +
> + addr = espfix_base_addr(cpu);
> +
> + /* Did another CPU already set this up? */
> + if (likely(espfix_already_there(addr)))
> + goto done;
> +
> + mutex_lock(&espfix_init_mutex);
> +
> + if (unlikely(espfix_already_there(addr)))
> + goto unlock_done;
Wouldn't it be simpler to just have a single static bool to indicate
whether espfix is initialized?
Even better: why not separate the percpu init from the pagetable init
and just do the pagetable init once from main or even modify_ldt?
--Andy
On 04/21/2014 04:19 PM, Andrew Lutomirski wrote:
>
> Hahaha! :)
>
> Some comments:
>
> Does returning to 64-bit CS with 16-bit SS not need espfix?
There is no such thing. With a 64-bit CS, the flags on SS are ignored
(although you still have to have a non-null SS... the conditions are a
bit complex.)
> Conversely, does 16-bit CS and 32-bit SS need espfix?
It does not, at least to the best of my knowledge (it is controlled by
the SS size, not the CS size.)
I'm going to double-check the corner cases just out of healthy paranoia,
but I'm 98% sure this is correct (and if not, the 32-bit code needs to
be fixed, too.)
>> @@ -1058,6 +1095,7 @@ bad_iret:
>> * So pretend we completed the iret and took the #GPF in user mode.
>> *
>> * We are now running with the kernel GS after exception recovery.
>> + * Exception entry will have removed us from the espfix stack.
>> * But error_entry expects us to have user GS to match the user %cs,
>> * so swap back.
>> */
>
> What is that referring to?
It means that we have already switched back from the espfix stack to the
real stack.
>> + /*
>> + * Switch from the espfix stack to the proper stack: tricky stuff.
>> + * On the stack right now is 5 words of exception frame,
>> + * error code/oldeax, RDI, and the return value, so no additional
>> + * stack is available.
>> + *
>> + * We will always be using the user space GS on entry.
>> + */
>> +ENTRY(espfix_fix_stack)
>> + SWAPGS
>> + cld
>> + movq PER_CPU_VAR(kernel_stack),%rdi
>> + subq $8*8,%rdi
>> + /* Use the real stack to hold these registers for now */
>> + movq %rsi,-8(%rdi)
>> + movq %rcx,-16(%rdi)
>> + movq %rsp,%rsi
>> + movl $8,%ecx
>> + rep;movsq
>> + leaq -(10*8)(%rdi),%rsp
>> + popq %rcx
>> + popq %rsi
>> + SWAPGS
>> + retq
>>
>
> Is it guaranteed that the userspace thread that caused this is dead?
> If not, do you need to change RIP so that espfix gets invoked again
> when you return from the exception?
It is not guaranteed to be dead at all. Why would you need to change
RIP, though?
>> +
>> +void init_espfix_cpu(void)
>> +{
>> + int cpu = smp_processor_id();
>> + unsigned long addr;
>> + pgd_t pgd, *pgd_p;
>> + pud_t pud, *pud_p;
>> + pmd_t pmd, *pmd_p;
>> + pte_t pte, *pte_p;
>> + int n;
>> + void *stack_page;
>> +
>> + cpu = smp_processor_id();
>> + BUG_ON(cpu >= (8 << 20)/ESPFIX_STACK_SIZE);
>> +
>> + /* We only have to do this once... */
>> + if (likely(this_cpu_read(espfix_stack)))
>> + return; /* Already initialized */
>> +
>> + addr = espfix_base_addr(cpu);
>> +
>> + /* Did another CPU already set this up? */
>> + if (likely(espfix_already_there(addr)))
>> + goto done;
>> +
>> + mutex_lock(&espfix_init_mutex);
>> +
>> + if (unlikely(espfix_already_there(addr)))
>> + goto unlock_done;
>
> Wouldn't it be simpler to just have a single static bool to indicate
> whether espfix is initialized?
No, you would have to allocate memory for every possible CPU, which I
wanted to avoid in case NR_CPUS >> actual CPUs (I don't know if we have
already done that for percpu, but we *should* if we haven't yet.)
> Even better: why not separate the percpu init from the pagetable init
> and just do the pagetable init once from main or even modify_ldt?
It needs to be done once per CPU. I wanted to do it late enough that
the page allocator is fully functional, so we don't have to do the ugly
hacks to call one allocator or another as the percpu initialization code
does (otherwise it would have made a lot of sense to co-locate with percpu.)
-hpa
On Mon, Apr 21, 2014 at 4:29 PM, H. Peter Anvin <[email protected]> wrote:
> On 04/21/2014 04:19 PM, Andrew Lutomirski wrote:
>>
>> Hahaha! :)
>>
>> Some comments:
>>
>> Does returning to 64-bit CS with 16-bit SS not need espfix?
>
> There is no such thing. With a 64-bit CS, the flags on SS are ignored
> (although you still have to have a non-null SS... the conditions are a
> bit complex.)
>
>> Conversely, does 16-bit CS and 32-bit SS need espfix?
>
> It does not, at least to the best of my knowledge (it is controlled by
> the SS size, not the CS size.)
>
> I'm going to double-check the corner cases just out of healthy paranoia,
> but I'm 98% sure this is correct (and if not, the 32-bit code needs to
> be fixed, too.)
>
>>> @@ -1058,6 +1095,7 @@ bad_iret:
>>> * So pretend we completed the iret and took the #GPF in user mode.
>>> *
>>> * We are now running with the kernel GS after exception recovery.
>>> + * Exception entry will have removed us from the espfix stack.
>>> * But error_entry expects us to have user GS to match the user %cs,
>>> * so swap back.
>>> */
>>
>> What is that referring to?
>
> It means that we have already switched back from the espfix stack to the
> real stack.
>
>>> + /*
>>> + * Switch from the espfix stack to the proper stack: tricky stuff.
>>> + * On the stack right now is 5 words of exception frame,
>>> + * error code/oldeax, RDI, and the return value, so no additional
>>> + * stack is available.
>>> + *
>>> + * We will always be using the user space GS on entry.
>>> + */
>>> +ENTRY(espfix_fix_stack)
>>> + SWAPGS
>>> + cld
>>> + movq PER_CPU_VAR(kernel_stack),%rdi
>>> + subq $8*8,%rdi
>>> + /* Use the real stack to hold these registers for now */
>>> + movq %rsi,-8(%rdi)
>>> + movq %rcx,-16(%rdi)
>>> + movq %rsp,%rsi
>>> + movl $8,%ecx
>>> + rep;movsq
>>> + leaq -(10*8)(%rdi),%rsp
>>> + popq %rcx
>>> + popq %rsi
>>> + SWAPGS
>>> + retq
>>>
>>
>> Is it guaranteed that the userspace thread that caused this is dead?
>> If not, do you need to change RIP so that espfix gets invoked again
>> when you return from the exception?
>
> It is not guaranteed to be dead at all. Why would you need to change
> RIP, though?
Oh. You're not changing the RSP that you return to. So this should be okay.
>
>>> +
>>> +void init_espfix_cpu(void)
>>> +{
>>> + int cpu = smp_processor_id();
>>> + unsigned long addr;
>>> + pgd_t pgd, *pgd_p;
>>> + pud_t pud, *pud_p;
>>> + pmd_t pmd, *pmd_p;
>>> + pte_t pte, *pte_p;
>>> + int n;
>>> + void *stack_page;
>>> +
>>> + cpu = smp_processor_id();
>>> + BUG_ON(cpu >= (8 << 20)/ESPFIX_STACK_SIZE);
>>> +
>>> + /* We only have to do this once... */
>>> + if (likely(this_cpu_read(espfix_stack)))
>>> + return; /* Already initialized */
>>> +
>>> + addr = espfix_base_addr(cpu);
>>> +
>>> + /* Did another CPU already set this up? */
>>> + if (likely(espfix_already_there(addr)))
>>> + goto done;
>>> +
>>> + mutex_lock(&espfix_init_mutex);
>>> +
>>> + if (unlikely(espfix_already_there(addr)))
>>> + goto unlock_done;
>>
>> Wouldn't it be simpler to just have a single static bool to indicate
>> whether espfix is initialized?
>
> No, you would have to allocate memory for every possible CPU, which I
> wanted to avoid in case NR_CPUS >> actual CPUs (I don't know if we have
> already done that for percpu, but we *should* if we haven't yet.)
>
>> Even better: why not separate the percpu init from the pagetable init
>> and just do the pagetable init once from main or even modify_ldt?
>
> It needs to be done once per CPU. I wanted to do it late enough that
> the page allocator is fully functional, so we don't have to do the ugly
> hacks to call one allocator or another as the percpu initialization code
> does (otherwise it would have made a lot of sense to co-locate with percpu.)
Hmm. I guess espfix_already_there isn't so bad. Given that, in the
worst case, I think there are 16 pages allocated, it might make sense
to just track which of those 16 pages have been allocated in some
array. That whole array would probably be shorter than the test of
espfix_already_there. Or am I still failing to understand how this
works?
--Andy
Well, if 2^17 CPUs are allocated we might 2K pages allocated. We could easily do a bitmap here, of course. NR_CPUS/64 is a small number, and would reduce the code complexity.
On April 21, 2014 5:37:05 PM PDT, Andrew Lutomirski <[email protected]> wrote:
>On Mon, Apr 21, 2014 at 4:29 PM, H. Peter Anvin <[email protected]> wrote:
>> On 04/21/2014 04:19 PM, Andrew Lutomirski wrote:
>>>
>>> Hahaha! :)
>>>
>>> Some comments:
>>>
>>> Does returning to 64-bit CS with 16-bit SS not need espfix?
>>
>> There is no such thing. With a 64-bit CS, the flags on SS are
>ignored
>> (although you still have to have a non-null SS... the conditions are
>a
>> bit complex.)
>>
>>> Conversely, does 16-bit CS and 32-bit SS need espfix?
>>
>> It does not, at least to the best of my knowledge (it is controlled
>by
>> the SS size, not the CS size.)
>>
>> I'm going to double-check the corner cases just out of healthy
>paranoia,
>> but I'm 98% sure this is correct (and if not, the 32-bit code needs
>to
>> be fixed, too.)
>>
>>>> @@ -1058,6 +1095,7 @@ bad_iret:
>>>> * So pretend we completed the iret and took the #GPF in
>user mode.
>>>> *
>>>> * We are now running with the kernel GS after exception
>recovery.
>>>> + * Exception entry will have removed us from the espfix
>stack.
>>>> * But error_entry expects us to have user GS to match the
>user %cs,
>>>> * so swap back.
>>>> */
>>>
>>> What is that referring to?
>>
>> It means that we have already switched back from the espfix stack to
>the
>> real stack.
>>
>>>> + /*
>>>> + * Switch from the espfix stack to the proper stack: tricky
>stuff.
>>>> + * On the stack right now is 5 words of exception frame,
>>>> + * error code/oldeax, RDI, and the return value, so no
>additional
>>>> + * stack is available.
>>>> + *
>>>> + * We will always be using the user space GS on entry.
>>>> + */
>>>> +ENTRY(espfix_fix_stack)
>>>> + SWAPGS
>>>> + cld
>>>> + movq PER_CPU_VAR(kernel_stack),%rdi
>>>> + subq $8*8,%rdi
>>>> + /* Use the real stack to hold these registers for now */
>>>> + movq %rsi,-8(%rdi)
>>>> + movq %rcx,-16(%rdi)
>>>> + movq %rsp,%rsi
>>>> + movl $8,%ecx
>>>> + rep;movsq
>>>> + leaq -(10*8)(%rdi),%rsp
>>>> + popq %rcx
>>>> + popq %rsi
>>>> + SWAPGS
>>>> + retq
>>>>
>>>
>>> Is it guaranteed that the userspace thread that caused this is dead?
>>> If not, do you need to change RIP so that espfix gets invoked again
>>> when you return from the exception?
>>
>> It is not guaranteed to be dead at all. Why would you need to change
>> RIP, though?
>
>Oh. You're not changing the RSP that you return to. So this should be
>okay.
>
>>
>>>> +
>>>> +void init_espfix_cpu(void)
>>>> +{
>>>> + int cpu = smp_processor_id();
>>>> + unsigned long addr;
>>>> + pgd_t pgd, *pgd_p;
>>>> + pud_t pud, *pud_p;
>>>> + pmd_t pmd, *pmd_p;
>>>> + pte_t pte, *pte_p;
>>>> + int n;
>>>> + void *stack_page;
>>>> +
>>>> + cpu = smp_processor_id();
>>>> + BUG_ON(cpu >= (8 << 20)/ESPFIX_STACK_SIZE);
>>>> +
>>>> + /* We only have to do this once... */
>>>> + if (likely(this_cpu_read(espfix_stack)))
>>>> + return; /* Already initialized */
>>>> +
>>>> + addr = espfix_base_addr(cpu);
>>>> +
>>>> + /* Did another CPU already set this up? */
>>>> + if (likely(espfix_already_there(addr)))
>>>> + goto done;
>>>> +
>>>> + mutex_lock(&espfix_init_mutex);
>>>> +
>>>> + if (unlikely(espfix_already_there(addr)))
>>>> + goto unlock_done;
>>>
>>> Wouldn't it be simpler to just have a single static bool to indicate
>>> whether espfix is initialized?
>>
>> No, you would have to allocate memory for every possible CPU, which I
>> wanted to avoid in case NR_CPUS >> actual CPUs (I don't know if we
>have
>> already done that for percpu, but we *should* if we haven't yet.)
>>
>>> Even better: why not separate the percpu init from the pagetable
>init
>>> and just do the pagetable init once from main or even modify_ldt?
>>
>> It needs to be done once per CPU. I wanted to do it late enough that
>> the page allocator is fully functional, so we don't have to do the
>ugly
>> hacks to call one allocator or another as the percpu initialization
>code
>> does (otherwise it would have made a lot of sense to co-locate with
>percpu.)
>
>Hmm. I guess espfix_already_there isn't so bad. Given that, in the
>worst case, I think there are 16 pages allocated, it might make sense
>to just track which of those 16 pages have been allocated in some
>array. That whole array would probably be shorter than the test of
>espfix_already_there. Or am I still failing to understand how this
>works?
>
>--Andy
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
On Mon, Apr 21, 2014 at 5:53 PM, H. Peter Anvin <[email protected]> wrote:
> Well, if 2^17 CPUs are allocated we might 2K pages allocated. We could easily do a bitmap here, of course. NR_CPUS/64 is a small number, and would reduce the code complexity.
>
Even simpler: just get rid of the check entirely. That is, break out
of the higher level loops once one of them is set (this should be a
big speedup regardless) and don't allocate the page if the first PTE
is already pointing at something.
After all, espfix_already_there is mostly a duplicate of init_espfix_cpu.
--Andy
I wanted to avoid the "another cpu made this allocation, now I have to free" crap, but I also didn't want to grab the lock if there was no work needed.
On April 21, 2014 6:06:19 PM PDT, Andrew Lutomirski <[email protected]> wrote:
>On Mon, Apr 21, 2014 at 5:53 PM, H. Peter Anvin <[email protected]> wrote:
>> Well, if 2^17 CPUs are allocated we might 2K pages allocated. We
>could easily do a bitmap here, of course. NR_CPUS/64 is a small
>number, and would reduce the code complexity.
>>
>
>Even simpler: just get rid of the check entirely. That is, break out
>of the higher level loops once one of them is set (this should be a
>big speedup regardless) and don't allocate the page if the first PTE
>is already pointing at something.
>
>After all, espfix_already_there is mostly a duplicate of
>init_espfix_cpu.
>
>--Andy
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
On Mon, Apr 21, 2014 at 6:14 PM, H. Peter Anvin <[email protected]> wrote:
> I wanted to avoid the "another cpu made this allocation, now I have to free" crap, but I also didn't want to grab the lock if there was no work needed.
I guess you also want to avoid bouncing all these cachelines around on
boot on bit multicore machines.
I'd advocate using the bitmap approach or simplifying the existing
code. For example:
+ for (n = 0; n < ESPFIX_PUD_CLONES; n++) {
+ pud = ACCESS_ONCE(pud_p[n]);
+ if (!pud_present(pud))
+ return false;
+ }
I don't see why that needs to be a loop. How can one clone exist but
not the others?
--Andy
Race condition (although with x86 being globally ordered, it probably can't actually happen.) The bitmask is probably the way to go.
On April 21, 2014 6:28:12 PM PDT, Andrew Lutomirski <[email protected]> wrote:
>On Mon, Apr 21, 2014 at 6:14 PM, H. Peter Anvin <[email protected]> wrote:
>> I wanted to avoid the "another cpu made this allocation, now I have
>to free" crap, but I also didn't want to grab the lock if there was no
>work needed.
>
>I guess you also want to avoid bouncing all these cachelines around on
>boot on bit multicore machines.
>
>I'd advocate using the bitmap approach or simplifying the existing
>code. For example:
>
>+ for (n = 0; n < ESPFIX_PUD_CLONES; n++) {
>+ pud = ACCESS_ONCE(pud_p[n]);
>+ if (!pud_present(pud))
>+ return false;
>+ }
>
>I don't see why that needs to be a loop. How can one clone exist but
>not the others?
>
>--Andy
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin <[email protected]> wrote:
> Race condition (although with x86 being globally ordered, it probably can't actually happen.) The bitmask is probably the way to go.
Does the race matter? In the worst case you take the lock
unnecessarily. But yes, the bitmask is easy.
--Andy
On Mon, Apr 21, 2014 at 06:53:36PM -0700, Andrew Lutomirski wrote:
> On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin <[email protected]> wrote:
> > Race condition (although with x86 being globally ordered, it probably can't actually happen.) The bitmask is probably the way to go.
>
> Does the race matter? In the worst case you take the lock
> unnecessarily. But yes, the bitmask is easy.
I wonder if it would be workable to use a bit in the espfix PGD to
denote that it has been initialized already... I hear, near NX there's
some room :-)
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
Just nitpicks below:
On Mon, Apr 21, 2014 at 03:47:52PM -0700, H. Peter Anvin wrote:
> This is a prototype of espfix for the 64-bit kernel. espfix is a
> workaround for the architectural definition of IRET, which fails to
> restore bits [31:16] of %esp when returning to a 16-bit stack
> segment. We have a workaround for the 32-bit kernel, but that
> implementation doesn't work for 64 bits.
>
> The 64-bit implementation works like this:
>
> Set up a ministack for each CPU, which is then mapped 65536 times
> using the page tables. This implementation uses the second-to-last
> PGD slot for this; with a 64-byte espfix stack this is sufficient for
> 2^18 CPUs (currently we support a max of 2^13 CPUs.)
I wish we'd put this description in the code instead of in a commit
message as those can get lost in git history over time.
> 64 bytes appear to be sufficient, because NMI and #MC cause a task
> switch.
>
> THIS IS A PROTOTYPE AND IS NOT COMPLETE. We need to make sure all
> code paths that can interrupt userspace execute this code.
> Fortunately we never need to use the espfix stack for nested faults,
> so one per CPU is guaranteed to be safe.
>
> Furthermore, this code adds unnecessary instructions to the common
> path. For example, on exception entry we push %rdi, pop %rdi, and
> then save away %rdi. Ideally we should do this in such a way that we
> avoid unnecessary swapgs, especially on the IRET path (the exception
> path is going to be very rare, and so is less critical.)
>
> Putting this version out there for people to look at/laugh at/play
> with.
>
> Signed-off-by: H. Peter Anvin <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Cc: Linus Torvalds <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Alexander van Heukelum <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Arjan van de Ven <[email protected]>
> Cc: Brian Gerst <[email protected]>
> Cc: Alexandre Julliard <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
...
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 1e96c3628bf2..7cc01770bf21 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -58,6 +58,7 @@
> #include <asm/asm.h>
> #include <asm/context_tracking.h>
> #include <asm/smap.h>
> +#include <asm/pgtable_types.h>
> #include <linux/err.h>
>
> /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this. */
> @@ -1040,8 +1041,16 @@ restore_args:
> RESTORE_ARGS 1,8,1
>
> irq_return:
> + /*
> + * Are we returning to the LDT? Note: in 64-bit mode
> + * SS:RSP on the exception stack is always valid.
> + */
> + testb $4,(SS-RIP)(%rsp)
> + jnz irq_return_ldt
> +
> +irq_return_iret:
> INTERRUPT_RETURN
> - _ASM_EXTABLE(irq_return, bad_iret)
> + _ASM_EXTABLE(irq_return_iret, bad_iret)
>
> #ifdef CONFIG_PARAVIRT
> ENTRY(native_iret)
> @@ -1049,6 +1058,34 @@ ENTRY(native_iret)
> _ASM_EXTABLE(native_iret, bad_iret)
> #endif
>
> +irq_return_ldt:
> + pushq_cfi %rcx
> + larl (CS-RIP+8)(%rsp), %ecx
> + jnz 1f /* Invalid segment - will #GP at IRET time */
> + testl $0x00200000, %ecx
> + jnz 1f /* Returning to 64-bit mode */
> + larl (SS-RIP+8)(%rsp), %ecx
> + jnz 1f /* Invalid segment - will #SS at IRET time */
You mean " ... will #GP at IRET time"? But you're right, you're looking
at SS :-)
> + testl $0x00400000, %ecx
> + jnz 1f /* Not a 16-bit stack segment */
> + pushq_cfi %rsi
> + pushq_cfi %rdi
> + SWAPGS
> + movq PER_CPU_VAR(espfix_stack),%rdi
> + movl (RSP-RIP+3*8)(%rsp),%esi
> + xorw %si,%si
> + orq %rsi,%rdi
> + movq %rsp,%rsi
> + movl $8,%ecx
> + rep;movsq
> + leaq -(8*8)(%rdi),%rsp
> + SWAPGS
> + popq_cfi %rdi
> + popq_cfi %rsi
> +1:
> + popq_cfi %rcx
> + jmp irq_return_iret
> +
> .section .fixup,"ax"
> bad_iret:
> /*
...
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index 85126ccbdf6b..dc2d8afcafe9 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -32,6 +32,7 @@
> * Manage page tables very early on.
> */
> extern pgd_t early_level4_pgt[PTRS_PER_PGD];
> +extern pud_t espfix_pud_page[PTRS_PER_PUD];
I guess you don't need the "extern" here.
> extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
> static unsigned int __initdata next_early_pgt = 2;
> pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index af1d14a9ebda..ebc987398923 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -229,17 +229,6 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
> }
> }
>
> - /*
> - * On x86-64 we do not support 16-bit segments due to
> - * IRET leaking the high bits of the kernel stack address.
> - */
> -#ifdef CONFIG_X86_64
> - if (!ldt_info.seg_32bit) {
> - error = -EINVAL;
> - goto out_unlock;
> - }
> -#endif
> -
> fill_ldt(&ldt, &ldt_info);
> if (oldmode)
> ldt.avl = 0;
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 34826934d4a7..ff32efb14e33 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -244,6 +244,11 @@ static void notrace start_secondary(void *unused)
> check_tsc_sync_target();
>
> /*
> + * Enable the espfix hack for this CPU
> + */
> + init_espfix_cpu();
> +
> + /*
> * We need to hold vector_lock so there the set of online cpus
> * does not change while we are assigning vectors to cpus. Holding
> * this lock ensures we don't half assign or remove an irq from a cpu.
> diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
> index 20621d753d5f..96bf767a05fc 100644
> --- a/arch/x86/mm/dump_pagetables.c
> +++ b/arch/x86/mm/dump_pagetables.c
> @@ -327,6 +327,8 @@ void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
> int i;
> struct pg_state st = {};
>
> + st.to_dmesg = true;
Right, remove before applying :)
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Tue, Apr 22, 2014 at 01:23:12PM +0200, Borislav Petkov wrote:
> I wonder if it would be workable to use a bit in the espfix PGD to
> denote that it has been initialized already... I hear, near NX there's
> some room :-)
Ok, I realized this won't work when I hit send... Oh well.
Anyway, another dumb idea: have we considered making this lazy? I.e.,
preallocate pages to fit the stack of NR_CPUS after smp init is done but
not setup the percpu espfix stack. Only do that in espfix_fix_stack the
first time we land there and haven't been setup yet on this cpu.
This should cover the 1% out there who still use 16-bit segments and the
rest simply doesn't use it and get to save themselves the PT-walk in
start_secondary().
Hmmm...
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Tue, Apr 22, 2014 at 7:46 AM, Borislav Petkov <[email protected]> wrote:
> On Tue, Apr 22, 2014 at 01:23:12PM +0200, Borislav Petkov wrote:
>> I wonder if it would be workable to use a bit in the espfix PGD to
>> denote that it has been initialized already... I hear, near NX there's
>> some room :-)
>
> Ok, I realized this won't work when I hit send... Oh well.
>
> Anyway, another dumb idea: have we considered making this lazy? I.e.,
> preallocate pages to fit the stack of NR_CPUS after smp init is done but
> not setup the percpu espfix stack. Only do that in espfix_fix_stack the
> first time we land there and haven't been setup yet on this cpu.
>
> This should cover the 1% out there who still use 16-bit segments and the
> rest simply doesn't use it and get to save themselves the PT-walk in
> start_secondary().
>
> Hmmm...
I'm going to try to do the math to see what's actually going on.
Each 4G slice contains 64kB of ministacks, which corresponds to 1024
ministacks. Virtual addresses are divided up as:
12 bits (0..11): address within page.
9 bits (12..20): identifies the PTE within the level 1 directory
9 bits (21..29): identifies the level 1 directory (pmd?) within the
level 2 directory
9 bits (30..38): identifies the level 2 directory (pud) within the
level 3 directory
Critically, each 1024 CPUs can share the same level 1 directory --
there are just a bunch of copies of the same thing in there.
Similarly, they can share the same level 2 directory, and each slot in
that directory will point to the same level 1 directory.
For the level 3 directory, there is only one globally. It needs 8
entries per 1024 CPUs.
I imagine there's a scalability problem here, too: it's okay if each
of a very large number of CPUs waits while shared structures are
allocated, but owners of big systems won't like it if they all
serialize on the way out.
So maybe it would make sense to refactor this into two separate
functions. First, before we start the first non-boot CPU:
static pte_t *slice_pte_tables[NR_CPUS / 1024];
Allocate and initialize them all;
It might even make sense to do this at build time instead of run time.
I can't imagine that parallelizing this would provide any benefit
unless it were done *very* carefully and there were hundreds of
thousands of CPUs. At worst, we're wasting 4 bytes per CPU not
present.
Then, for the per-CPU part, have one init-once structure (please tell
me the kernel has one of these) per 64 possible CPUs. Each CPU will
make sure that its group of 64 cpus is initialized, using the init
once mechanism, and then it will set its percpu variable accordingly.
There are only 64 CPUs per slice, so mutexes may no be so bad here.
--Andy
Honestly, guys... you're painting the bikeshed at the moment.
Initialization is the easiest bit of all this code. The tricky part is
*the rest of the code*, i.e. the stuff in entry_64.S.
Also, the code is butt-ugly at the moment. Aestetics have not been
dealt with.
-hpa
On Tue, Apr 22, 2014 at 9:10 AM, H. Peter Anvin <[email protected]> wrote:
> Honestly, guys... you're painting the bikeshed at the moment.
>
> Initialization is the easiest bit of all this code. The tricky part is
> *the rest of the code*, i.e. the stuff in entry_64.S.
That's because the initialization code is much simpler, so it's easy
to pick on :) Sorry.
For the espfix_adjust_stack thing, when can it actually need to do
anything? irqs should be off, I think, and MCE, NMI, and debug
exceptions use ist, so that leaves just #SS and #GP, I think. How can
those actually occur? Is there a way to trigger them deliberately
from userspace? Why do you have three espfix_adjust_stack
What happens on the IST entries? If I've read your patch right,
you're still switching back to the normal stack, which looks
questionable.
Also, if you want to same some register abuse on each exception entry,
could you check the saved RIP instead of the current RSP? I.e. use
the test instruction with offset(%rsp)? Maybe there are multiple
possible values, though, and just testing some bits doesn't help.
--Andy
On Tue, Apr 22, 2014 at 9:33 AM, Andrew Lutomirski <[email protected]> wrote:
>
> For the espfix_adjust_stack thing, when can it actually need to do
> anything? irqs should be off, I think, and MCE, NMI, and debug
> exceptions use ist, so that leaves just #SS and #GP, I think. How can
> those actually occur? Is there a way to trigger them deliberately
> from userspace? Why do you have three espfix_adjust_stack
Yes, you can very much trigger GP deliberately.
The way to do it is to just make an invalid segment descriptor on the
iret stack. Or make it a valid 16-bit one, but make it a code segment
for the stack pointer, or read-only, or whatever. All of which is
trivial to do with a sigretun system call. But you can do it other
ways too - enter with a SS that is valid, but do a load_ldt() system
call that makes it invalid, so that by the time you exit it is no
longer valid etc.
There's a reason we mark that "iretq" as taking faults with that
_ASM_EXTABLE(native_iret, bad_iret)
and that "bad_iret" creates a GP fault.
And that's a lot of kernel stack. The whole initial GP fault path,
which goes to the C code that finds the exception table etc. See
do_general_protection_fault() and fixup_exception().
Linus
On Tue, Apr 22, 2014 at 9:43 AM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Apr 22, 2014 at 9:33 AM, Andrew Lutomirski <[email protected]> wrote:
>>
>> For the espfix_adjust_stack thing, when can it actually need to do
>> anything? irqs should be off, I think, and MCE, NMI, and debug
>> exceptions use ist, so that leaves just #SS and #GP, I think. How can
>> those actually occur? Is there a way to trigger them deliberately
>> from userspace? Why do you have three espfix_adjust_stack
>
> Yes, you can very much trigger GP deliberately.
>
> The way to do it is to just make an invalid segment descriptor on the
> iret stack. Or make it a valid 16-bit one, but make it a code segment
> for the stack pointer, or read-only, or whatever. All of which is
> trivial to do with a sigretun system call. But you can do it other
> ways too - enter with a SS that is valid, but do a load_ldt() system
> call that makes it invalid, so that by the time you exit it is no
> longer valid etc.
>
> There's a reason we mark that "iretq" as taking faults with that
>
> _ASM_EXTABLE(native_iret, bad_iret)
>
> and that "bad_iret" creates a GP fault.
>
> And that's a lot of kernel stack. The whole initial GP fault path,
> which goes to the C code that finds the exception table etc. See
> do_general_protection_fault() and fixup_exception().
My point is that it may be safe to remove the special espfix fixup
from #PF, which is probably the most performance-critical piece here,
aside from iret itself.
--Andy
On Tue, Apr 22, 2014 at 10:00 AM, Andrew Lutomirski <[email protected]> wrote:
>
> My point is that it may be safe to remove the special espfix fixup
> from #PF, which is probably the most performance-critical piece here,
> aside from iret itself.
Actually, even that is unsafe.
Why?
The segment table is shared for a process. So you can have one thread
doing a load_ldt() that invalidates a segment, while another thread is
busy taking a page fault. The segment was valid at page fault time and
is saved on the kernel stack, but by the time the page fault returns,
it is no longer valid and the iretq will fault.
Anyway, if done correctly, this whole espfix should be totally free
for normal processes, since it should only trigger if SS is a LDT
entry (bit #2 set in the segment descriptor). So the normal fast-path
should just have a simple test for that.
And if you have a SS that is a descriptor in the LDT, nobody cares
about performance any more.
Linus
On 04/22/2014 10:00 AM, Andrew Lutomirski wrote:
>>
>> Yes, you can very much trigger GP deliberately.
>>
>> The way to do it is to just make an invalid segment descriptor on the
>> iret stack. Or make it a valid 16-bit one, but make it a code segment
>> for the stack pointer, or read-only, or whatever. All of which is
>> trivial to do with a sigretun system call. But you can do it other
>> ways too - enter with a SS that is valid, but do a load_ldt() system
>> call that makes it invalid, so that by the time you exit it is no
>> longer valid etc.
>>
>> There's a reason we mark that "iretq" as taking faults with that
>>
>> _ASM_EXTABLE(native_iret, bad_iret)
>>
>> and that "bad_iret" creates a GP fault.
>>
>> And that's a lot of kernel stack. The whole initial GP fault path,
>> which goes to the C code that finds the exception table etc. See
>> do_general_protection_fault() and fixup_exception().
>
> My point is that it may be safe to remove the special espfix fixup
> from #PF, which is probably the most performance-critical piece here,
> aside from iret itself.
>
It *might* even be plausible to do full manual sanitization, so that the
IRET cannot fault, but I have to admit to that being somewhat daunting,
especially given the thread/process distinction. I wasn't actually sure
about the status of the LDT on the thread vs process scale (the GDT is
per-CPU, but has some entries that are context-switched per *thread*,
but I hadn't looked at the LDT recently.)
As for Andy's questions:
> What happens on the IST entries? If I've read your patch right,
> you're still switching back to the normal stack, which looks
> questionable.
No, in that case %rsp won't point into the espfix region, and the switch
will be bypassed. We will resume back into the espfix region on IRET,
which is actually required e.g. if we take an NMI in the middle of the
espfix setup.
> Also, if you want to same some register abuse on each exception entry,
> could you check the saved RIP instead of the current RSP? I.e. use
> the test instruction with offset(%rsp)? Maybe there are multiple
> possible values, though, and just testing some bits doesn't help.
I don't see how that would work.
-hpa
On Tue, Apr 22, 2014 at 10:04 AM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Apr 22, 2014 at 10:00 AM, Andrew Lutomirski <[email protected]> wrote:
>>
>> My point is that it may be safe to remove the special espfix fixup
>> from #PF, which is probably the most performance-critical piece here,
>> aside from iret itself.
>
> Actually, even that is unsafe.
>
> Why?
>
> The segment table is shared for a process. So you can have one thread
> doing a load_ldt() that invalidates a segment, while another thread is
> busy taking a page fault. The segment was valid at page fault time and
> is saved on the kernel stack, but by the time the page fault returns,
> it is no longer valid and the iretq will fault.
Let me try that again: I think it should be safe to remove the check
for "did we fault from the espfix stack" from the #PF entry. You can
certainly have all kinds of weird things happen on return from #PF,
but the overhead that I'm talking about is a test on exception *entry*
to see whether the fault happened on the espfix stack so that we can
switch back to running on a real stack.
If the espfix code and the iret at the end can't cause #PF, then the
check in #PF entry can be removed, I think.
>
> Anyway, if done correctly, this whole espfix should be totally free
> for normal processes, since it should only trigger if SS is a LDT
> entry (bit #2 set in the segment descriptor). So the normal fast-path
> should just have a simple test for that.
How? Doesn't something still need to check whether SS is funny before
doing iret?
--Andy
On 04/22/2014 10:04 AM, Linus Torvalds wrote:
>
> The segment table is shared for a process. So you can have one thread
> doing a load_ldt() that invalidates a segment, while another thread is
> busy taking a page fault. The segment was valid at page fault time and
> is saved on the kernel stack, but by the time the page fault returns,
> it is no longer valid and the iretq will fault.
>
> Anyway, if done correctly, this whole espfix should be totally free
> for normal processes, since it should only trigger if SS is a LDT
> entry (bit #2 set in the segment descriptor). So the normal fast-path
> should just have a simple test for that.
>
> And if you have a SS that is a descriptor in the LDT, nobody cares
> about performance any more.
>
The fastpath interference is:
1. Testing for an LDT SS selector before IRET. This is actually easier
than on 32 bits, because on 64 bits the SS:RSP on the stack is always valid.
2. Testing for an RSP inside the espfix region on exception entry, so we
can switch back the stack. This has to be done very early in the
exception entry since the espfix stack is so small. If NMI and #MC
didn't use IST it wouldn't work at all (but neither would SYSCALL).
#2 currently saves/restores %rdi, which is also saved further down.
This is obviously wasteful.
-hpa
On 04/22/2014 10:11 AM, Andrew Lutomirski wrote:
>>
>> Anyway, if done correctly, this whole espfix should be totally free
>> for normal processes, since it should only trigger if SS is a LDT
>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>> should just have a simple test for that.
>
> How? Doesn't something still need to check whether SS is funny before
> doing iret?
>
Ideally the tests should be doable such that on a normal machine the
tests can be overlapped with the other things we have to do on that
path. The exit branch will be strongly predicted in the negative
direction, so it shouldn't be a significant problem.
Again, this is not the case in the current prototype.
-hpa
On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski <[email protected]> wrote:
>
>>
>> Anyway, if done correctly, this whole espfix should be totally free
>> for normal processes, since it should only trigger if SS is a LDT
>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>> should just have a simple test for that.
>
> How? Doesn't something still need to check whether SS is funny before
> doing iret?
Just test bit #2. Don't do anything else if it's clear, because you
should be done. You don't need to do anything special if it's clear,
because I don't *think* we have any 16-bit data segments in the GDT on
x86-64.
Linus
On Tue, Apr 22, 2014 at 10:09 AM, H. Peter Anvin <[email protected]> wrote:
>
> As for Andy's questions:
>
>> What happens on the IST entries? If I've read your patch right,
>> you're still switching back to the normal stack, which looks
>> questionable.
>
> No, in that case %rsp won't point into the espfix region, and the switch
> will be bypassed. We will resume back into the espfix region on IRET,
> which is actually required e.g. if we take an NMI in the middle of the
> espfix setup.
Aha. I misread that. Would it be worth adding a comment along the lines of
/* Check whether we are running on the espfix stack. This is
different from checking whether we faulted from the espfix stack,
since an ist exception will have switched us off of the espfix stack.
*/
>
>> Also, if you want to same some register abuse on each exception entry,
>> could you check the saved RIP instead of the current RSP? I.e. use
>> the test instruction with offset(%rsp)? Maybe there are multiple
>> possible values, though, and just testing some bits doesn't help.
>
> I don't see how that would work.
It won't, given the above. I misunderstood what you were checking.
It still seems to me that only #GP needs this special handling. The
IST entries should never run on the espfix stack, and #MC, #DB, #NM,
and #SS (I missed that one earlier) use IST.
Would it ever make sense to make #GP use IST as well? That might
allow espfix_adjust_stack to be removed entirely. I don't know how
much other fiddling would be needed to make that work.
--Andy
On 04/22/2014 10:20 AM, Andrew Lutomirski wrote:
>
> It won't, given the above. I misunderstood what you were checking.
>
> It still seems to me that only #GP needs this special handling. The
> IST entries should never run on the espfix stack, and #MC, #DB, #NM,
> and #SS (I missed that one earlier) use IST.
>
> Would it ever make sense to make #GP use IST as well? That might
> allow espfix_adjust_stack to be removed entirely. I don't know how
> much other fiddling would be needed to make that work.
>
Interesting thought. It might even be able to share an IST entry with #SS.
-hpa
On Tue, Apr 22, 2014 at 10:11:27AM -0700, H. Peter Anvin wrote:
> The fastpath interference is:
>
> 1. Testing for an LDT SS selector before IRET. This is actually easier
> than on 32 bits, because on 64 bits the SS:RSP on the stack is always valid.
>
> 2. Testing for an RSP inside the espfix region on exception entry, so we
> can switch back the stack. This has to be done very early in the
> exception entry since the espfix stack is so small. If NMI and #MC
> didn't use IST it wouldn't work at all (but neither would SYSCALL).
>
> #2 currently saves/restores %rdi, which is also saved further down.
> This is obviously wasteful.
Btw, can we runtime-patch the fastpath interference chunk the moment we
see a 16-bit segment? I.e., connect it to the write_idt() path, i.e. in
the hunk you've removed in there and enable the espfix checks there the
moment we load a 16-bit segment.
I know, I know, this is not so important right now but let me put it out
there just the same.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Tue, Apr 22, 2014 at 10:26 AM, Borislav Petkov <[email protected]> wrote:
> On Tue, Apr 22, 2014 at 10:11:27AM -0700, H. Peter Anvin wrote:
>> The fastpath interference is:
>>
>> 1. Testing for an LDT SS selector before IRET. This is actually easier
>> than on 32 bits, because on 64 bits the SS:RSP on the stack is always valid.
>>
>> 2. Testing for an RSP inside the espfix region on exception entry, so we
>> can switch back the stack. This has to be done very early in the
>> exception entry since the espfix stack is so small. If NMI and #MC
>> didn't use IST it wouldn't work at all (but neither would SYSCALL).
>>
>> #2 currently saves/restores %rdi, which is also saved further down.
>> This is obviously wasteful.
>
> Btw, can we runtime-patch the fastpath interference chunk the moment we
> see a 16-bit segment? I.e., connect it to the write_idt() path, i.e. in
> the hunk you've removed in there and enable the espfix checks there the
> moment we load a 16-bit segment.
>
> I know, I know, this is not so important right now but let me put it out
> there just the same.
Or we could add a TIF_NEEDS_ESPFIX that gets set once you have a
16-bit LDT entry.
But I think it makes sense to nail down everything else first. I
suspect that a single test-and-branch in the iret path will be lost in
the noise from iret itself.
--Andy
On 04/22/2014 10:19 AM, Linus Torvalds wrote:
> On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski <[email protected]> wrote:
>>
>>>
>>> Anyway, if done correctly, this whole espfix should be totally free
>>> for normal processes, since it should only trigger if SS is a LDT
>>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>>> should just have a simple test for that.
>>
>> How? Doesn't something still need to check whether SS is funny before
>> doing iret?
>
> Just test bit #2. Don't do anything else if it's clear, because you
> should be done. You don't need to do anything special if it's clear,
> because I don't *think* we have any 16-bit data segments in the GDT on
> x86-64.
>
And we don't (neither do we on i386, and we depend on that invariance.)
Hence:
irq_return:
+ /*
+ * Are we returning to the LDT? Note: in 64-bit mode
+ * SS:RSP on the exception stack is always valid.
+ */
+ testb $4,(SS-RIP)(%rsp)
+ jnz irq_return_ldt
+
+irq_return_iret:
INTERRUPT_RETURN
- _ASM_EXTABLE(irq_return, bad_iret)
+ _ASM_EXTABLE(irq_return_iret, bad_iret)
That is the whole impact of the IRET path.
If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
make sure there is absolutely no way we could end up nested) then the
rest of the fixup code can go away and we kill the common path
exception-handling overhead; the only new overhead is the IST
indirection for #GP, which isn't a performance-critical fault (good
thing, because untangling #GP faults is a major effort.)
-hpa
On Tue, Apr 22, 2014 at 10:29 AM, H. Peter Anvin <[email protected]> wrote:
> On 04/22/2014 10:19 AM, Linus Torvalds wrote:
>> On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski <[email protected]> wrote:
>>>
>>>>
>>>> Anyway, if done correctly, this whole espfix should be totally free
>>>> for normal processes, since it should only trigger if SS is a LDT
>>>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>>>> should just have a simple test for that.
>>>
>>> How? Doesn't something still need to check whether SS is funny before
>>> doing iret?
>>
>> Just test bit #2. Don't do anything else if it's clear, because you
>> should be done. You don't need to do anything special if it's clear,
>> because I don't *think* we have any 16-bit data segments in the GDT on
>> x86-64.
>>
>
> And we don't (neither do we on i386, and we depend on that invariance.)
>
> Hence:
>
> irq_return:
> + /*
> + * Are we returning to the LDT? Note: in 64-bit mode
> + * SS:RSP on the exception stack is always valid.
> + */
> + testb $4,(SS-RIP)(%rsp)
> + jnz irq_return_ldt
> +
> +irq_return_iret:
> INTERRUPT_RETURN
> - _ASM_EXTABLE(irq_return, bad_iret)
> + _ASM_EXTABLE(irq_return_iret, bad_iret)
>
>
> That is the whole impact of the IRET path.
>
> If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
> make sure there is absolutely no way we could end up nested) then the
> rest of the fixup code can go away and we kill the common path
> exception-handling overhead; the only new overhead is the IST
> indirection for #GP, which isn't a performance-critical fault (good
> thing, because untangling #GP faults is a major effort.)
I'd be a bit nervous about read_msr_safe and friends. Also, what
happens if userspace triggers a #GP and the signal stack setup causes
a page fault?
--Andy
On 04/22/2014 10:46 AM, Andrew Lutomirski wrote:
>>
>> That is the whole impact of the IRET path.
>>
>> If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
>> make sure there is absolutely no way we could end up nested) then the
>> rest of the fixup code can go away and we kill the common path
>> exception-handling overhead; the only new overhead is the IST
>> indirection for #GP, which isn't a performance-critical fault (good
>> thing, because untangling #GP faults is a major effort.)
>
> I'd be a bit nervous about read_msr_safe and friends. Also, what
> happens if userspace triggers a #GP and the signal stack setup causes
> a page fault?
>
Yes, #GPs inside the kernel could be a real problem. MSRs generally
don't trigger #SS. The second scenario shouldn't be a problem, the #PF
will be delivered on the currently active stack.
On the other hand, doing the espfix fixup only for #GP might be an
alternative, as long as we can convince ourselves that it really is the
only fault that could possibly be delivered on the espfix path.
-hpa
On Tue, Apr 22, 2014 at 1:46 PM, Andrew Lutomirski <[email protected]> wrote:
> On Tue, Apr 22, 2014 at 10:29 AM, H. Peter Anvin <[email protected]> wrote:
>> On 04/22/2014 10:19 AM, Linus Torvalds wrote:
>>> On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski <[email protected]> wrote:
>>>>
>>>>>
>>>>> Anyway, if done correctly, this whole espfix should be totally free
>>>>> for normal processes, since it should only trigger if SS is a LDT
>>>>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>>>>> should just have a simple test for that.
>>>>
>>>> How? Doesn't something still need to check whether SS is funny before
>>>> doing iret?
>>>
>>> Just test bit #2. Don't do anything else if it's clear, because you
>>> should be done. You don't need to do anything special if it's clear,
>>> because I don't *think* we have any 16-bit data segments in the GDT on
>>> x86-64.
>>>
>>
>> And we don't (neither do we on i386, and we depend on that invariance.)
>>
>> Hence:
>>
>> irq_return:
>> + /*
>> + * Are we returning to the LDT? Note: in 64-bit mode
>> + * SS:RSP on the exception stack is always valid.
>> + */
>> + testb $4,(SS-RIP)(%rsp)
>> + jnz irq_return_ldt
>> +
>> +irq_return_iret:
>> INTERRUPT_RETURN
>> - _ASM_EXTABLE(irq_return, bad_iret)
>> + _ASM_EXTABLE(irq_return_iret, bad_iret)
>>
>>
>> That is the whole impact of the IRET path.
>>
>> If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
>> make sure there is absolutely no way we could end up nested) then the
>> rest of the fixup code can go away and we kill the common path
>> exception-handling overhead; the only new overhead is the IST
>> indirection for #GP, which isn't a performance-critical fault (good
>> thing, because untangling #GP faults is a major effort.)
>
> I'd be a bit nervous about read_msr_safe and friends. Also, what
> happens if userspace triggers a #GP and the signal stack setup causes
> a page fault?
>
> --Andy
Maybe make the #GP handler check what the previous stack was at the start:
1) If we came from userspace, switch to the top of the process stack.
2) If the previous stack was not the espfix stack, switch back to that stack.
3) Switch to the top of the process stack (espfix case)
This leaves the IST available for any recursive faults.
On 04/22/2014 11:03 AM, Brian Gerst wrote:
>
> Maybe make the #GP handler check what the previous stack was at the start:
> 1) If we came from userspace, switch to the top of the process stack.
> 2) If the previous stack was not the espfix stack, switch back to that stack.
> 3) Switch to the top of the process stack (espfix case)
>
> This leaves the IST available for any recursive faults.
>
Do you actually know what the IST is? If so, you should realize the
above is nonsense.
The *hardware* switches stack on an exception; if the vector is set up
as an IST, then we *always* switch to the IST stack, unconditionally.
If the vector is not, then we switch to the process stack if we came
from userspace.
That is the entry condition that we have to deal with. The fact that
the switch to the IST is unconditional is what makes ISTs hard to deal with.
-hpa
On Tue, Apr 22, 2014 at 2:06 PM, H. Peter Anvin <[email protected]> wrote:
> On 04/22/2014 11:03 AM, Brian Gerst wrote:
>>
>> Maybe make the #GP handler check what the previous stack was at the start:
>> 1) If we came from userspace, switch to the top of the process stack.
>> 2) If the previous stack was not the espfix stack, switch back to that stack.
>> 3) Switch to the top of the process stack (espfix case)
>>
>> This leaves the IST available for any recursive faults.
>>
>
> Do you actually know what the IST is? If so, you should realize the
> above is nonsense.
>
> The *hardware* switches stack on an exception; if the vector is set up
> as an IST, then we *always* switch to the IST stack, unconditionally.
> If the vector is not, then we switch to the process stack if we came
> from userspace.
>
> That is the entry condition that we have to deal with. The fact that
> the switch to the IST is unconditional is what makes ISTs hard to deal with.
Right, that is why you switch away from the IST as soon as possible,
copying the data that is already pushed there to another stack so it
won't be overwritten by a recursive fault.
On 04/22/2014 11:17 AM, Brian Gerst wrote:
>>
>> That is the entry condition that we have to deal with. The fact that
>> the switch to the IST is unconditional is what makes ISTs hard to deal with.
>
> Right, that is why you switch away from the IST as soon as possible,
> copying the data that is already pushed there to another stack so it
> won't be overwritten by a recursive fault.
>
That simply will not work if you can take a #GP due to the "safe" MSR
functions from NMI and #MC context, which would be my main concern.
-hpa
On Tue, Apr 22, 2014 at 10:29:45AM -0700, Andrew Lutomirski wrote:
> Or we could add a TIF_NEEDS_ESPFIX that gets set once you have a
> 16-bit LDT entry.
Or something like that, yep.
> But I think it makes sense to nail down everything else first. I
> suspect that a single test-and-branch in the iret path will be lost in
> the noise from iret itself.
The cumulative effects of such additions here and there are nasty
though. If we can make the general path free relatively painlessly, we
should do it, IMO.
But yeah, later.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
On Tue, Apr 22, 2014 at 2:51 PM, H. Peter Anvin <[email protected]> wrote:
> On 04/22/2014 11:17 AM, Brian Gerst wrote:
>>>
>>> That is the entry condition that we have to deal with. The fact that
>>> the switch to the IST is unconditional is what makes ISTs hard to deal with.
>>
>> Right, that is why you switch away from the IST as soon as possible,
>> copying the data that is already pushed there to another stack so it
>> won't be overwritten by a recursive fault.
>>
>
> That simply will not work if you can take a #GP due to the "safe" MSR
> functions from NMI and #MC context, which would be my main concern.
In that case (#2 above), you would switch to the previous %rsp (in the
NMI/MC stack), copy the exception frame from the IST, and continue
with the #GP handler. That effectively is the same as it is today,
where no stack switch occurs on the #GP fault.
On 04/22/2014 12:55 PM, Brian Gerst wrote:
> On Tue, Apr 22, 2014 at 2:51 PM, H. Peter Anvin <[email protected]> wrote:
>> On 04/22/2014 11:17 AM, Brian Gerst wrote:
>>>>
>>>> That is the entry condition that we have to deal with. The fact that
>>>> the switch to the IST is unconditional is what makes ISTs hard to deal with.
>>>
>>> Right, that is why you switch away from the IST as soon as possible,
>>> copying the data that is already pushed there to another stack so it
>>> won't be overwritten by a recursive fault.
>>>
>>
>> That simply will not work if you can take a #GP due to the "safe" MSR
>> functions from NMI and #MC context, which would be my main concern.
>
> In that case (#2 above), you would switch to the previous %rsp (in the
> NMI/MC stack), copy the exception frame from the IST, and continue
> with the #GP handler. That effectively is the same as it is today,
> where no stack switch occurs on the #GP fault.
>
1. You take #GP. This causes an IST stack switch.
2. You immediately thereafter take an NMI. This switches stacks again.
3. Now you take another #GP. This causes another IST stack, and now you
have clobbered your return information, and cannot resume.
-hpa
On Tue, Apr 22, 2014 at 4:17 PM, H. Peter Anvin <[email protected]> wrote:
> On 04/22/2014 12:55 PM, Brian Gerst wrote:
>> On Tue, Apr 22, 2014 at 2:51 PM, H. Peter Anvin <[email protected]> wrote:
>>> On 04/22/2014 11:17 AM, Brian Gerst wrote:
>>>>>
>>>>> That is the entry condition that we have to deal with. The fact that
>>>>> the switch to the IST is unconditional is what makes ISTs hard to deal with.
>>>>
>>>> Right, that is why you switch away from the IST as soon as possible,
>>>> copying the data that is already pushed there to another stack so it
>>>> won't be overwritten by a recursive fault.
>>>>
>>>
>>> That simply will not work if you can take a #GP due to the "safe" MSR
>>> functions from NMI and #MC context, which would be my main concern.
>>
>> In that case (#2 above), you would switch to the previous %rsp (in the
>> NMI/MC stack), copy the exception frame from the IST, and continue
>> with the #GP handler. That effectively is the same as it is today,
>> where no stack switch occurs on the #GP fault.
>>
>
> 1. You take #GP. This causes an IST stack switch.
> 2. You immediately thereafter take an NMI. This switches stacks again.
> 3. Now you take another #GP. This causes another IST stack, and now you
> have clobbered your return information, and cannot resume.
You are right. That will not work.
> That simply will not work if you can take a #GP due to the "safe" MSR
> functions from NMI and #MC context, which would be my main concern.
At some point the IST entry functions subtracted 1K while the
handler ran to handle simple nesting cases.
Not sure that code is still there.
-Andi
On 04/22/2014 04:39 PM, Andi Kleen wrote:
>> That simply will not work if you can take a #GP due to the "safe" MSR
>> functions from NMI and #MC context, which would be my main concern.
>
> At some point the IST entry functions subtracted 1K while the
> handler ran to handle simple nesting cases.
>
> Not sure that code is still there.
Doesn't help if you take an NMI on the first instruction of the #GP handler.
-hpa
Another spin of the prototype. This one avoids the espfix for anything
but #GP, and avoids save/restore/saving registers... one can wonder,
though, how much that actually matters in practice.
It still does redundant SWAPGS on the slow path. I'm not sure I
personally care enough to optimize that, as it means some fairly
significant restructuring of some of the code paths. Some of that
restructuring might actually be beneficial, but still...
-hpa
On Tue, Apr 22, 2014 at 6:17 PM, H. Peter Anvin <[email protected]> wrote:
> Another spin of the prototype. This one avoids the espfix for anything
> but #GP, and avoids save/restore/saving registers... one can wonder,
> though, how much that actually matters in practice.
>
> It still does redundant SWAPGS on the slow path. I'm not sure I
> personally care enough to optimize that, as it means some fairly
> significant restructuring of some of the code paths. Some of that
> restructuring might actually be beneficial, but still...
>
What's the to_dmesg thing for?
It looks sane, although I haven't checked the detailed register manipulation.
Users of big systems may complain when every single CPU lines up for
that mutex. Maybe no one cares.
--Andy
On 04/22/2014 06:23 PM, Andrew Lutomirski wrote:
>
> What's the to_dmesg thing for?
>
It's for debugging... the espfix page tables generate so many duplicate
entries that trying to output it via a seqfile runs out of memory. I
suspect we need to do something like skip the espfix range or some other
hack.
> It looks sane, although I haven't checked the detailed register manipulation.
>
> Users of big systems may complain when every single CPU lines up for
> that mutex. Maybe no one cares.
Right now the whole smpboot sequence is fully serialized... that needs
to be fixed.
Konrad - I really could use some help figuring out what needs to be done
for this not to break Xen.
-hpa
On 04/22/2014 10:04 AM, Linus Torvalds wrote:
>
> The segment table is shared for a process. So you can have one thread
> doing a load_ldt() that invalidates a segment, while another thread is
> busy taking a page fault. The segment was valid at page fault time and
> is saved on the kernel stack, but by the time the page fault returns,
> it is no longer valid and the iretq will fault.
>
> Anyway, if done correctly, this whole espfix should be totally free
> for normal processes, since it should only trigger if SS is a LDT
> entry (bit #2 set in the segment descriptor). So the normal fast-path
> should just have a simple test for that.
>
> And if you have a SS that is a descriptor in the LDT, nobody cares
> about performance any more.
>
I just realized that with the LDT being a process-level object (unlike
the GDT), we need to remove the filtering on the espfix hack, both for
32-bit and 64-bit kernels. Otherwise there is a race condition between
executing the LAR instruction in the filter and the IRET, which could
allow the leak to become manifest.
The "good" part is that I think the espfix hack is harmless even with a
32/64-bit stack segment, although it has a substantial performance penalty.
Does anyone have any idea if there is a real use case for non-16-bit LDT
segments used as the stack segment? Does Wine use anything like that?
Very old NPTL Linux binaries use LDT segments, but only for data segments.
-hpa
"H. Peter Anvin" <[email protected]> writes:
> Does anyone have any idea if there is a real use case for non-16-bit
> LDT segments used as the stack segment? Does Wine use anything like
> that?
Wine uses them for DPMI support, though that would only get used when
vm86 mode is available.
--
Alexandre Julliard
[email protected]
> Ideally the tests should be doable such that on a normal machine the
> tests can be overlapped with the other things we have to do on that
> path. The exit branch will be strongly predicted in the negative
> direction, so it shouldn't be a significant problem.
>
> Again, this is not the case in the current prototype.
Or you make sure that you switch to those code paths only after software
has executed syscalls that make it possible it will use a 16bit ss.
The other question I have is - is there any reason we can't fix up the
IRET to do a 32bit return into a vsyscall type userspace page which then
does a long jump or retf to the right place ?
Alan
On 04/22/2014 09:42 PM, H. Peter Anvin wrote:
> On 04/22/2014 06:23 PM, Andrew Lutomirski wrote:
>> What's the to_dmesg thing for?
>>
> It's for debugging... the espfix page tables generate so many duplicate
> entries that trying to output it via a seqfile runs out of memory. I
> suspect we need to do something like skip the espfix range or some other
> hack.
>
>> It looks sane, although I haven't checked the detailed register manipulation.
>>
>> Users of big systems may complain when every single CPU lines up for
>> that mutex. Maybe no one cares.
> Right now the whole smpboot sequence is fully serialized... that needs
> to be fixed.
>
> Konrad - I really could use some help figuring out what needs to be done
> for this not to break Xen.
This does break Xen PV:
[ 3.683735] ------------[ cut here ]------------
[ 3.683807] WARNING: CPU: 0 PID: 0 at arch/x86/xen/multicalls.c:129
xen_mc_flush+0x1c8/0x1d0()
[ 3.683903] Modules linked in:
[ 3.684006] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.15.0-rc2 #2
[ 3.684176] 0000000000000009 ffffffff81c01de0 ffffffff816cfb15
0000000000000000
[ 3.684416] ffffffff81c01e18 ffffffff81084abd 0000000000000000
0000000000000001
[ 3.684654] 0000000000000000 ffff88023da0b180 0000000000000010
ffffffff81c01e28
[ 3.684893] Call Trace:
[ 3.684962] [<ffffffff816cfb15>] dump_stack+0x45/0x56
[ 3.685032] [<ffffffff81084abd>] warn_slowpath_common+0x7d/0xa0
[ 3.685102] [<ffffffff81084b9a>] warn_slowpath_null+0x1a/0x20
[ 3.685171] [<ffffffff810050a8>] xen_mc_flush+0x1c8/0x1d0
[ 3.685240] [<ffffffff81008155>] xen_set_pgd+0x1f5/0x220
[ 3.685310] [<ffffffff8101975a>] init_espfix_this_cpu+0x36a/0x380
[ 3.685379] [<ffffffff813cb559>] ? acpi_tb_initialize_facs+0x31/0x33
[ 3.685450] [<ffffffff81d27ec6>] start_kernel+0x37f/0x411
[ 3.685517] [<ffffffff81d27950>] ? repair_env_string+0x5c/0x5c
[ 3.685586] [<ffffffff81d27606>] x86_64_start_reservations+0x2a/0x2c
[ 3.685654] [<ffffffff81d2a6df>] xen_start_kernel+0x594/0x5a0
[ 3.685728] ---[ end trace a2cf2d7b2ecab826 ]---
But then I think we may want to rearrange preempt_enable/disable in
xen_set_pgd().
-boris
On 04/23/2014 02:54 AM, One Thousand Gnomes wrote:
>> Ideally the tests should be doable such that on a normal machine the
>> tests can be overlapped with the other things we have to do on that
>> path. The exit branch will be strongly predicted in the negative
>> direction, so it shouldn't be a significant problem.
>>
>> Again, this is not the case in the current prototype.
>
> Or you make sure that you switch to those code paths only after software
> has executed syscalls that make it possible it will use a 16bit ss.
>
Which, again, would introduce a race, I believe, at least if we have an
LDT at all (and since we only enter these code paths for LDT descriptors
in the first place, it is equivalent to the current code minus the filters.)
> The other question I have is - is there any reason we can't fix up the
> IRET to do a 32bit return into a vsyscall type userspace page which then
> does a long jump or retf to the right place ?
I did a writeup on this a while ago. It does have the problem that you
need additional memory in userspace, which is per-thread and in the
right region of userspace; this pretty much means you have to muck about
with the user space stack when user space is running in weird modes.
This gets complex very quickly and does have some "footprint".
Furthermore, on some CPUs (not including any recent Intel CPUs) there is
still a way to leak bits [63:32]. I believe the in-kernel solution is
actually simpler.
-hpa
On 04/23/2014 07:24 AM, Boris Ostrovsky wrote:
>>
>> Konrad - I really could use some help figuring out what needs to be done
>> for this not to break Xen.
>
> This does break Xen PV:
>
I know it does. This is why I asked for help.
This is fundamentally the problem with PV and *especially* the way Xen
PV was integrated into Linux: it acts as a development brake for native
hardware. Fortunately, Konrad has been quite responsive to that kind of
problems, which hasn't always been true of the Xen community in the past.
-hpa
On Wed, Apr 23, 2014 at 8:53 AM, H. Peter Anvin <[email protected]> wrote:
> On 04/23/2014 02:54 AM, One Thousand Gnomes wrote:
>>> Ideally the tests should be doable such that on a normal machine the
>>> tests can be overlapped with the other things we have to do on that
>>> path. The exit branch will be strongly predicted in the negative
>>> direction, so it shouldn't be a significant problem.
>>>
>>> Again, this is not the case in the current prototype.
>>
>> Or you make sure that you switch to those code paths only after software
>> has executed syscalls that make it possible it will use a 16bit ss.
>>
>
> Which, again, would introduce a race, I believe, at least if we have an
> LDT at all (and since we only enter these code paths for LDT descriptors
> in the first place, it is equivalent to the current code minus the filters.)
The only way I can see to trigger the race is with sigreturn, but it's
still there. Sigh.
Here are two semi-related things:
1. The Intel manual's description of iretq does seems like it forgot
to mention that iret restores the stack pointer in anything except
vm86 mode. Fortunately, the AMD manual seems to thing that, when
returning *from* 64-bit mode, RSP is always restored, which I think is
necessary for this patch to work correctly.
2. I've often pondered changing the way we return *to* CPL 0 to bypass
iret entirely. It could be something like:
SS
RSP
EFLAGS
CS
RIP
push 16($rsp)
popfq [does this need to force rex.w somehow?]
ret $64
This may break backtraces if cfi isn't being used and we get an NMI
just before the popfq. I'm not quite sure how that works.
I haven't benchmarked this at all, but the only slow part should be
the popfq, and I doubt it's anywhere near as slow as iret.
>
>> The other question I have is - is there any reason we can't fix up the
>> IRET to do a 32bit return into a vsyscall type userspace page which then
>> does a long jump or retf to the right place ?
>
> I did a writeup on this a while ago. It does have the problem that you
> need additional memory in userspace, which is per-thread and in the
> right region of userspace; this pretty much means you have to muck about
> with the user space stack when user space is running in weird modes.
> This gets complex very quickly and does have some "footprint".
> Furthermore, on some CPUs (not including any recent Intel CPUs) there is
> still a way to leak bits [63:32]. I believe the in-kernel solution is
> actually simpler.
>
There's also no real guarantee that user code won't unmap the vdso.
--Andy
On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
>
> The only way I can see to trigger the race is with sigreturn, but it's
> still there. Sigh.
>
I don't see why sigreturn needs to be involved... all you need is
modify_ldt() on one CPU while the other is in the middle of an IRET
return. Small window, so hard to hit, but still.
> 2. I've often pondered changing the way we return *to* CPL 0 to bypass
> iret entirely. It could be something like:
>
> SS
> RSP
> EFLAGS
> CS
> RIP
>
> push 16($rsp)
> popfq [does this need to force rex.w somehow?]
> ret $64
When you say return to CPL 0 you mean intra-kernel return? That isn't
really the problem here, though. I think this will also break the
kernel debugger since it will have the wrong behavior for TF and RF.
>>> The other question I have is - is there any reason we can't fix up the
>>> IRET to do a 32bit return into a vsyscall type userspace page which then
>>> does a long jump or retf to the right place ?
>>
>> I did a writeup on this a while ago. It does have the problem that you
>> need additional memory in userspace, which is per-thread and in the
>> right region of userspace; this pretty much means you have to muck about
>> with the user space stack when user space is running in weird modes.
>> This gets complex very quickly and does have some "footprint".
>> Furthermore, on some CPUs (not including any recent Intel CPUs) there is
>> still a way to leak bits [63:32]. I believe the in-kernel solution is
>> actually simpler.
>>
>
> There's also no real guarantee that user code won't unmap the vdso.
There is, but there is also at some point a "don't do that, then" aspect
to it all.
-hpa
On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin <[email protected]> wrote:
> On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
>>
>> The only way I can see to trigger the race is with sigreturn, but it's
>> still there. Sigh.
>>
>
> I don't see why sigreturn needs to be involved... all you need is
> modify_ldt() on one CPU while the other is in the middle of an IRET
> return. Small window, so hard to hit, but still.
>
If you set the flag as soon as anyone calls modify_ldt, before any
descriptor is installed, then I don't think this can happen. But
there's still sigreturn, and I don't think this is worth all the
complexity to save a single branch on #GP.
>> 2. I've often pondered changing the way we return *to* CPL 0 to bypass
>> iret entirely. It could be something like:
>>
>> SS
>> RSP
>> EFLAGS
>> CS
>> RIP
>>
>> push 16($rsp)
>> popfq [does this need to force rex.w somehow?]
>> ret $64
>
> When you say return to CPL 0 you mean intra-kernel return? That isn't
> really the problem here, though. I think this will also break the
> kernel debugger since it will have the wrong behavior for TF and RF.
I do mean intra-kernel. And yes, this has nothing to do with espfix,
but it would make write_msr_safe fail more quickly :)
--Andy
On 04/23/2014 10:25 AM, Andrew Lutomirski wrote:
> On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin <[email protected]> wrote:
>> On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
>>>
>>> The only way I can see to trigger the race is with sigreturn, but it's
>>> still there. Sigh.
>>
>> I don't see why sigreturn needs to be involved... all you need is
>> modify_ldt() on one CPU while the other is in the middle of an IRET
>> return. Small window, so hard to hit, but still.
>
> If you set the flag as soon as anyone calls modify_ldt, before any
> descriptor is installed, then I don't think this can happen. But
> there's still sigreturn, and I don't think this is worth all the
> complexity to save a single branch on #GP.
>
Who cares? Since we only need to enter the fixup path for LDT
selectors, anything that is dependent on having called modify_ldt() is
already redundant.
In some ways that is the saving grace. SS being an LDT selector is
fortunately a rare case.
> I do mean intra-kernel. And yes, this has nothing to do with espfix,
> but it would make write_msr_safe fail more quickly :)
And, pray tell, how important is that?
-hpa
On Wed, Apr 23, 2014 at 10:28 AM, H. Peter Anvin <[email protected]> wrote:
> On 04/23/2014 10:25 AM, Andrew Lutomirski wrote:
>> On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin <[email protected]> wrote:
>>> On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
>>>>
>>>> The only way I can see to trigger the race is with sigreturn, but it's
>>>> still there. Sigh.
>>>
>>> I don't see why sigreturn needs to be involved... all you need is
>>> modify_ldt() on one CPU while the other is in the middle of an IRET
>>> return. Small window, so hard to hit, but still.
>>
>> If you set the flag as soon as anyone calls modify_ldt, before any
>> descriptor is installed, then I don't think this can happen. But
>> there's still sigreturn, and I don't think this is worth all the
>> complexity to save a single branch on #GP.
>>
>
> Who cares? Since we only need to enter the fixup path for LDT
> selectors, anything that is dependent on having called modify_ldt() is
> already redundant.
But you still have to test this, and folding it into the existing
check for thread flags would eliminate that. Still, I think this
would not be worth it, even if it were correct.
>
> In some ways that is the saving grace. SS being an LDT selector is
> fortunately a rare case.
>
>> I do mean intra-kernel. And yes, this has nothing to do with espfix,
>> but it would make write_msr_safe fail more quickly :)
>
> And, pray tell, how important is that?
Not very.
Page faults may be a different story for some workloads, particularly
if they are IO-heavy. Returning to preempted kernel threads may also
matter.
For my particular workload, returns from rescheduling interrupts
delivered to idle cpus probably also matters, but the fact that those
interrupts are happening at all is a bug that tglx is working on.
--Andy
--Andy
On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin <[email protected]> wrote:
> This is a prototype of espfix for the 64-bit kernel. espfix is a
> workaround for the architectural definition of IRET, which fails to
> restore bits [31:16] of %esp when returning to a 16-bit stack
> segment. We have a workaround for the 32-bit kernel, but that
> implementation doesn't work for 64 bits.
Hi,
A comment: The main purpose of espfix is to prevent attackers from
learning sensitive addresses, right? But as far as I can tell, this
mini-stack becomes itself somewhat sensitive:
- The user can put arbitrary data in registers before returning to the
LDT in order to get it saved at a known address accessible from the
kernel. With SMAP and KASLR this might otherwise be difficult.
- If the iret faults, kernel addresses will get stored there (and not
cleared). If a vulnerability could return data from an arbitrary
specified address to the user, this would be harmful.
I guess with the current KASLR implementation you could get the same
effects via brute force anyway, by filling up and browsing memory,
respectively, but ideally there wouldn't be any virtual addresses
guaranteed not to fault.
- If a vulnerability allowed overwriting data at an arbitrary
specified address, the exception frame could get overwritten at
exactly the right moment between the copy and iret (or right after the
iret to mess up fixup_exception)? You probably know better than I
whether or not caches prevent this from actually being possible.
Just raising the issue.
On Wed, Apr 23, 2014 at 9:13 PM, comex <[email protected]> wrote:
> On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin <[email protected]> wrote:
>> This is a prototype of espfix for the 64-bit kernel. espfix is a
>> workaround for the architectural definition of IRET, which fails to
>> restore bits [31:16] of %esp when returning to a 16-bit stack
>> segment. We have a workaround for the 32-bit kernel, but that
>> implementation doesn't work for 64 bits.
>
> Hi,
>
> A comment: The main purpose of espfix is to prevent attackers from
> learning sensitive addresses, right? But as far as I can tell, this
> mini-stack becomes itself somewhat sensitive:
>
> - The user can put arbitrary data in registers before returning to the
> LDT in order to get it saved at a known address accessible from the
> kernel. With SMAP and KASLR this might otherwise be difficult.
For one thing, this only matters on Haswell. Otherwise the user can
put arbitrary data in userspace.
On Haswell, the HPET fixmap is currently a much simpler vector that
can do much the same thing, as long as you're willing to wait for the
HPET counter to contain some particular value. I have patches that
will fix that as a side effect.
Would it pay to randomize the location of the espfix area? Another
somewhat silly idea is to add some random offset to the CPU number mod
NR_CPUS so that at attacker won't know which ministack is which.
> - If the iret faults, kernel addresses will get stored there (and not
> cleared). If a vulnerability could return data from an arbitrary
> specified address to the user, this would be harmful.
Can this be fixed by clearing the ministack in bad_iret? There will
still be a window in which the kernel address is in there, but it'll
be short.
>
> I guess with the current KASLR implementation you could get the same
> effects via brute force anyway, by filling up and browsing memory,
> respectively, but ideally there wouldn't be any virtual addresses
> guaranteed not to fault.
>
> - If a vulnerability allowed overwriting data at an arbitrary
> specified address, the exception frame could get overwritten at
> exactly the right moment between the copy and iret (or right after the
> iret to mess up fixup_exception)? You probably know better than I
> whether or not caches prevent this from actually being possible.
To attack this, you'd change the saved CS value. I don't think caches
would make a difference.
This particular vector hurts: you can safely keep trying until it works.
This just gave me an evil idea: what if we make the whole espfix area
read-only? This has some weird effects. To switch to the espfix
stack, you have to write to an alias. That's a little strange but
harmless and barely complicates the implementation. If the iret
faults, though, I think the result will be a #DF. This may actually
be a good thing: if the #DF handler detects that the cause was a bad
espfix iret, it could just return directly to bad_iret or send the
signal itself the same way that do_stack_segment does. This could
even be written in C :)
Peter, is this idea completely nuts? The only exceptions that can
happen there are NMI, MCE, #DB, #SS, and #GP. The first four use IST,
so they won't double-fault.
--Andy
On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:
>>
>> - The user can put arbitrary data in registers before returning to the
>> LDT in order to get it saved at a known address accessible from the
>> kernel. With SMAP and KASLR this might otherwise be difficult.
>
> For one thing, this only matters on Haswell. Otherwise the user can
> put arbitrary data in userspace.
>
> On Haswell, the HPET fixmap is currently a much simpler vector that
> can do much the same thing, as long as you're willing to wait for the
> HPET counter to contain some particular value. I have patches that
> will fix that as a side effect.
>
> Would it pay to randomize the location of the espfix area? Another
> somewhat silly idea is to add some random offset to the CPU number mod
> NR_CPUS so that at attacker won't know which ministack is which.
Since we store the espfix stack location explicitly, as long as the
scrambling happens in the initialization code that's fine. However, we
don't want to reduce locality lest we massively blow up the memory
requirements.
We could XOR with a random constant with no penalty at all. Only
problem is that this happens early, so the entropy system is not yet
available. Fine if we have RDRAND, but...
>> - If the iret faults, kernel addresses will get stored there (and not
>> cleared). If a vulnerability could return data from an arbitrary
>> specified address to the user, this would be harmful.
>
> Can this be fixed by clearing the ministack in bad_iret? There will
> still be a window in which the kernel address is in there, but it'll
> be short.
We could, if anyone thinks this is actually beneficial.
I'm trying to dig into some of the deeper semantics of IRET to figure
out another issue (a much bigger potential problem), this would affect
that as well. My current belief is that we don't actually have a
problem here.
>> - If a vulnerability allowed overwriting data at an arbitrary
>> specified address, the exception frame could get overwritten at
>> exactly the right moment between the copy and iret (or right after the
>> iret to mess up fixup_exception)? You probably know better than I
>> whether or not caches prevent this from actually being possible.
>
> To attack this, you'd change the saved CS value. I don't think caches
> would make a difference.
>
> This particular vector hurts: you can safely keep trying until it works.
>
> This just gave me an evil idea: what if we make the whole espfix area
> read-only? This has some weird effects. To switch to the espfix
> stack, you have to write to an alias. That's a little strange but
> harmless and barely complicates the implementation. If the iret
> faults, though, I think the result will be a #DF. This may actually
> be a good thing: if the #DF handler detects that the cause was a bad
> espfix iret, it could just return directly to bad_iret or send the
> signal itself the same way that do_stack_segment does. This could
> even be written in C :)
>
> Peter, is this idea completely nuts? The only exceptions that can
> happen there are NMI, MCE, #DB, #SS, and #GP. The first four use IST,
> so they won't double-fault.
It is completely nuts, but sometimes completely nuts is actually useful.
It is more complexity, to be sure, but it doesn't seem completely out
of the realm of reason, and avoids having to unwind the ministack except
in the normally-fatal #DF handler. #DFs are documented as not
recoverable, but we might be able to do something here.
The only real disadvantage I see is the need for more bookkeeping
metadata. Basically the bitmask in espfix_64.c now needs to turn into
an array, plus we need a second percpu variable. Given that if
CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.
-hpa
On Thu, Apr 24, 2014 at 3:24 PM, H. Peter Anvin <[email protected]> wrote:
> On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:
>>>
>>> - The user can put arbitrary data in registers before returning to the
>>> LDT in order to get it saved at a known address accessible from the
>>> kernel. With SMAP and KASLR this might otherwise be difficult.
>>
>> For one thing, this only matters on Haswell. Otherwise the user can
>> put arbitrary data in userspace.
>>
>> On Haswell, the HPET fixmap is currently a much simpler vector that
>> can do much the same thing, as long as you're willing to wait for the
>> HPET counter to contain some particular value. I have patches that
>> will fix that as a side effect.
>>
>> Would it pay to randomize the location of the espfix area? Another
>> somewhat silly idea is to add some random offset to the CPU number mod
>> NR_CPUS so that at attacker won't know which ministack is which.
>
> Since we store the espfix stack location explicitly, as long as the
> scrambling happens in the initialization code that's fine. However, we
> don't want to reduce locality lest we massively blow up the memory
> requirements.
I was imagining just randomizing a couple of high bits so the whole
espfix area moves as a unit.
>
> We could XOR with a random constant with no penalty at all. Only
> problem is that this happens early, so the entropy system is not yet
> available. Fine if we have RDRAND, but...
How many people have SMAP and not RDRAND? I think this is a complete
nonissue for non-SMAP systems.
>> Peter, is this idea completely nuts? The only exceptions that can
>> happen there are NMI, MCE, #DB, #SS, and #GP. The first four use IST,
>> so they won't double-fault.
>
> It is completely nuts, but sometimes completely nuts is actually useful.
> It is more complexity, to be sure, but it doesn't seem completely out
> of the realm of reason, and avoids having to unwind the ministack except
> in the normally-fatal #DF handler. #DFs are documented as not
> recoverable, but we might be able to do something here.
>
> The only real disadvantage I see is the need for more bookkeeping
> metadata. Basically the bitmask in espfix_64.c now needs to turn into
> an array, plus we need a second percpu variable. Given that if
> CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.
Doing something in #DF needs percpu data? What am I missing?
--Andy
On 04/24/2014 03:31 PM, Andrew Lutomirski wrote:
>
> I was imagining just randomizing a couple of high bits so the whole
> espfix area moves as a unit.
>
>> We could XOR with a random constant with no penalty at all. Only
>> problem is that this happens early, so the entropy system is not yet
>> available. Fine if we have RDRAND, but...
>
> How many people have SMAP and not RDRAND? I think this is a complete
> nonissue for non-SMAP systems.
>
Most likely none, unless some "clever" virtualizer turns off RDRAND out
of spite.
>>> Peter, is this idea completely nuts? The only exceptions that can
>>> happen there are NMI, MCE, #DB, #SS, and #GP. The first four use IST,
>>> so they won't double-fault.
>>
>> It is completely nuts, but sometimes completely nuts is actually useful.
>> It is more complexity, to be sure, but it doesn't seem completely out
>> of the realm of reason, and avoids having to unwind the ministack except
>> in the normally-fatal #DF handler. #DFs are documented as not
>> recoverable, but we might be able to do something here.
>>
>> The only real disadvantage I see is the need for more bookkeeping
>> metadata. Basically the bitmask in espfix_64.c now needs to turn into
>> an array, plus we need a second percpu variable. Given that if
>> CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.
>
> Doing something in #DF needs percpu data? What am I missing?
You need the second percpu variable in the espfix setup code so you have
both the write address and the target rsp (read address).
-hpa
On Thu, Apr 24, 2014 at 3:37 PM, H. Peter Anvin <[email protected]> wrote:
> On 04/24/2014 03:31 PM, Andrew Lutomirski wrote:
>>
>> I was imagining just randomizing a couple of high bits so the whole
>> espfix area moves as a unit.
>>
>>> We could XOR with a random constant with no penalty at all. Only
>>> problem is that this happens early, so the entropy system is not yet
>>> available. Fine if we have RDRAND, but...
>>
>> How many people have SMAP and not RDRAND? I think this is a complete
>> nonissue for non-SMAP systems.
>>
>
> Most likely none, unless some "clever" virtualizer turns off RDRAND out
> of spite.
>
>>>> Peter, is this idea completely nuts? The only exceptions that can
>>>> happen there are NMI, MCE, #DB, #SS, and #GP. The first four use IST,
>>>> so they won't double-fault.
>>>
>>> It is completely nuts, but sometimes completely nuts is actually useful.
>>> It is more complexity, to be sure, but it doesn't seem completely out
>>> of the realm of reason, and avoids having to unwind the ministack except
>>> in the normally-fatal #DF handler. #DFs are documented as not
>>> recoverable, but we might be able to do something here.
>>>
>>> The only real disadvantage I see is the need for more bookkeeping
>>> metadata. Basically the bitmask in espfix_64.c now needs to turn into
>>> an array, plus we need a second percpu variable. Given that if
>>> CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.
>>
>> Doing something in #DF needs percpu data? What am I missing?
>
> You need the second percpu variable in the espfix setup code so you have
> both the write address and the target rsp (read address).
>
Duh. :)
--Andy
Hi!
> This is a prototype of espfix for the 64-bit kernel. espfix is a
> workaround for the architectural definition of IRET, which fails to
> restore bits [31:16] of %esp when returning to a 16-bit stack
> segment. We have a workaround for the 32-bit kernel, but that
> implementation doesn't work for 64 bits.
Just to understand the consequences -- we leak 16 bit of kernel data
to the userspace, right? Because it is %esp, we know that we leak
stack address, which is not too sensitive, but will make kernel
address randomization less useful...?
> The 64-bit implementation works like this:
>
> Set up a ministack for each CPU, which is then mapped 65536 times
> using the page tables. This implementation uses the second-to-last
> PGD slot for this; with a 64-byte espfix stack this is sufficient for
> 2^18 CPUs (currently we support a max of 2^13 CPUs.)
16-bit stack segments on 64-bit machine. Who still uses it? Dosemu?
Wine? Would the solution be to disallow that?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Tue, Apr 22, 2014 at 06:17:21PM -0700, H. Peter Anvin wrote:
> Another spin of the prototype. This one avoids the espfix for anything
> but #GP, and avoids save/restore/saving registers... one can wonder,
> though, how much that actually matters in practice.
>
> It still does redundant SWAPGS on the slow path. I'm not sure I
> personally care enough to optimize that, as it means some fairly
> significant restructuring of some of the code paths. Some of that
> restructuring might actually be beneficial, but still...
Sorry about being late to the party.
.. snip..
> diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
> new file mode 100644
> index 000000000000..05567d706f92
> --- /dev/null
> +++ b/arch/x86/kernel/espfix_64.c
> @@ -0,0 +1,136 @@
> +/* ----------------------------------------------------------------------- *
> + *
> + * Copyright 2014 Intel Corporation; author: H. Peter Anvin
> + *
> + * This file is part of the Linux kernel, and is made available under
> + * the terms of the GNU General Public License version 2 or (at your
> + * option) any later version; incorporated herein by reference.
> + *
> + * ----------------------------------------------------------------------- */
> +
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/percpu.h>
> +#include <linux/gfp.h>
> +#include <asm/pgtable.h>
> +
> +#define ESPFIX_STACK_SIZE 64UL
> +#define ESPFIX_STACKS_PER_PAGE (PAGE_SIZE/ESPFIX_STACK_SIZE)
> +
> +#define ESPFIX_MAX_CPUS (ESPFIX_STACKS_PER_PAGE << (PGDIR_SHIFT-PAGE_SHIFT-16))
> +#if CONFIG_NR_CPUS > ESPFIX_MAX_CPUS
> +# error "Need more than one PGD for the ESPFIX hack"
> +#endif
> +
> +#define ESPFIX_BASE_ADDR (-2UL << PGDIR_SHIFT)
> +
> +#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
> +
> +/* This contains the *bottom* address of the espfix stack */
> +DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
> +
> +/* Initialization mutex - should this be a spinlock? */
> +static DEFINE_MUTEX(espfix_init_mutex);
> +
> +/* Page allocation bitmap - each page serves ESPFIX_STACKS_PER_PAGE CPUs */
> +#define ESPFIX_MAX_PAGES DIV_ROUND_UP(CONFIG_NR_CPUS, ESPFIX_STACKS_PER_PAGE)
> +#define ESPFIX_MAP_SIZE DIV_ROUND_UP(ESPFIX_MAX_PAGES, BITS_PER_LONG)
> +static unsigned long espfix_page_alloc_map[ESPFIX_MAP_SIZE];
> +
> +static __page_aligned_bss pud_t espfix_pud_page[PTRS_PER_PUD]
> + __aligned(PAGE_SIZE);
> +
> +/*
> + * This returns the bottom address of the espfix stack for a specific CPU.
> + * The math allows for a non-power-of-two ESPFIX_STACK_SIZE, in which case
> + * we have to account for some amount of padding at the end of each page.
> + */
> +static inline unsigned long espfix_base_addr(unsigned int cpu)
> +{
> + unsigned long page, addr;
> +
> + page = (cpu / ESPFIX_STACKS_PER_PAGE) << PAGE_SHIFT;
> + addr = page + (cpu % ESPFIX_STACKS_PER_PAGE) * ESPFIX_STACK_SIZE;
> + addr = (addr & 0xffffUL) | ((addr & ~0xffffUL) << 16);
> + addr += ESPFIX_BASE_ADDR;
> + return addr;
> +}
> +
> +#define PTE_STRIDE (65536/PAGE_SIZE)
> +#define ESPFIX_PTE_CLONES (PTRS_PER_PTE/PTE_STRIDE)
> +#define ESPFIX_PMD_CLONES PTRS_PER_PMD
> +#define ESPFIX_PUD_CLONES (65536/(ESPFIX_PTE_CLONES*ESPFIX_PMD_CLONES))
> +
> +void init_espfix_this_cpu(void)
> +{
> + unsigned int cpu, page;
> + unsigned long addr;
> + pgd_t pgd, *pgd_p;
> + pud_t pud, *pud_p;
> + pmd_t pmd, *pmd_p;
> + pte_t pte, *pte_p;
> + int n;
> + void *stack_page;
> + pteval_t ptemask;
> +
> + /* We only have to do this once... */
> + if (likely(this_cpu_read(espfix_stack)))
> + return; /* Already initialized */
> +
> + cpu = smp_processor_id();
> + addr = espfix_base_addr(cpu);
> + page = cpu/ESPFIX_STACKS_PER_PAGE;
> +
> + /* Did another CPU already set this up? */
> + if (likely(test_bit(page, espfix_page_alloc_map)))
> + goto done;
> +
> + mutex_lock(&espfix_init_mutex);
> +
> + /* Did we race on the lock? */
> + if (unlikely(test_bit(page, espfix_page_alloc_map)))
> + goto unlock_done;
> +
> + ptemask = __supported_pte_mask;
> +
> + pgd_p = &init_level4_pgt[pgd_index(addr)];
> + pgd = *pgd_p;
> + if (!pgd_present(pgd)) {
> + /* This can only happen on the BSP */
> + pgd = __pgd(__pa_symbol(espfix_pud_page) |
Any particular reason you are using __pgd
> + (_KERNPG_TABLE & ptemask));
> + set_pgd(pgd_p, pgd);
> + }
> +
> + pud_p = &espfix_pud_page[pud_index(addr)];
> + pud = *pud_p;
> + if (!pud_present(pud)) {
> + pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
> + pud = __pud(__pa(pmd_p) | (_KERNPG_TABLE & ptemask));
_pud
> + for (n = 0; n < ESPFIX_PUD_CLONES; n++)
> + set_pud(&pud_p[n], pud);
> + }
> +
> + pmd_p = pmd_offset(&pud, addr);
> + pmd = *pmd_p;
> + if (!pmd_present(pmd)) {
> + pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
> + pmd = __pmd(__pa(pte_p) | (_KERNPG_TABLE & ptemask));
and _pmd?
> + for (n = 0; n < ESPFIX_PMD_CLONES; n++)
> + set_pmd(&pmd_p[n], pmd);
> + }
> +
> + pte_p = pte_offset_kernel(&pmd, addr);
> + stack_page = (void *)__get_free_page(GFP_KERNEL);
> + pte = __pte(__pa(stack_page) | (__PAGE_KERNEL & ptemask));
and __pte instead of the 'pmd', 'pud', 'pmd' and 'pte' macros?
> + for (n = 0; n < ESPFIX_PTE_CLONES; n++)
> + set_pte(&pte_p[n*PTE_STRIDE], pte);
> +
> + /* Job is done for this CPU and any CPU which shares this page */
> + set_bit(page, espfix_page_alloc_map);
> +
> +unlock_done:
> + mutex_unlock(&espfix_init_mutex);
> +done:
> + this_cpu_write(espfix_stack, addr);
> +}
On 04/25/2014 02:02 PM, Konrad Rzeszutek Wilk wrote:
>
> Any particular reason you are using __pgd
>
> _pud
>
> and _pmd?
>
> and __pte instead of the 'pmd', 'pud', 'pmd' and 'pte' macros?
>
Not that I know of other than that the semantics of the various macros
are not described anywhere to the best of my knowledge.
-hpa
On 04/25/2014 05:02 AM, Pavel Machek wrote:
>
> Just to understand the consequences -- we leak 16 bit of kernel data
> to the userspace, right? Because it is %esp, we know that we leak
> stack address, which is not too sensitive, but will make kernel
> address randomization less useful...?
>
It is rather sensitive, in fact.
>> The 64-bit implementation works like this:
>>
>> Set up a ministack for each CPU, which is then mapped 65536 times
>> using the page tables. This implementation uses the second-to-last
>> PGD slot for this; with a 64-byte espfix stack this is sufficient for
>> 2^18 CPUs (currently we support a max of 2^13 CPUs.)
>
> 16-bit stack segments on 64-bit machine. Who still uses it? Dosemu?
> Wine? Would the solution be to disallow that?
Welcome to the show. We do, in fact disallow it now in the 3.15-rc
series. The Wine guys are complaining.
-hpa
On Wed, Apr 23, 2014 at 09:56:00AM -0700, H. Peter Anvin wrote:
> On 04/23/2014 07:24 AM, Boris Ostrovsky wrote:
> >>
> >> Konrad - I really could use some help figuring out what needs to be done
> >> for this not to break Xen.
> >
> > This does break Xen PV:
> >
>
> I know it does. This is why I asked for help.
This week is chaotic for me but I taking a stab at it. Should have
something by the end of the week on top of your patch.
>
> This is fundamentally the problem with PV and *especially* the way Xen
> PV was integrated into Linux: it acts as a development brake for native
> hardware. Fortunately, Konrad has been quite responsive to that kind of
> problems, which hasn't always been true of the Xen community in the past.
Thank you for such kind words!
I hope that in Chicago you will be have a chance to meet other folks that
are involved in Xen and formulate a similar opinion of them.
Cheers!
>
> -hpa
>
On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:
>
> This particular vector hurts: you can safely keep trying until it works.
>
> This just gave me an evil idea: what if we make the whole espfix area
> read-only? This has some weird effects. To switch to the espfix
> stack, you have to write to an alias. That's a little strange but
> harmless and barely complicates the implementation. If the iret
> faults, though, I think the result will be a #DF. This may actually
> be a good thing: if the #DF handler detects that the cause was a bad
> espfix iret, it could just return directly to bad_iret or send the
> signal itself the same way that do_stack_segment does. This could
> even be written in C :)
>
> Peter, is this idea completely nuts? The only exceptions that can
> happen there are NMI, MCE, #DB, #SS, and #GP. The first four use IST,
> so they won't double-fault.
>
So I tried writing this bit up, but it fails in some rather spectacular
ways. Furthermore, I have been unable to debug it under Qemu, because
breakpoints don't work right (common Qemu problem, sadly.)
The kernel code is at:
https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/
There are two tests:
git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
./run16 test/hello.elf
http://www.zytor.com/~hpa/ldttest.c
The former will exercise the irq_return_ldt path, but not the fault
path; the latter will exercise the fault path, but doesn't actually use
a 16-bit segment.
Under the 3.14 stock kernel, the former should die with SIGBUS and the
latter should pass.
-hpa
On 04/28/2014 04:05 PM, H. Peter Anvin wrote:
>
> So I tried writing this bit up, but it fails in some rather spectacular
> ways. Furthermore, I have been unable to debug it under Qemu, because
> breakpoints don't work right (common Qemu problem, sadly.)
>
> The kernel code is at:
>
> https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/
>
> There are two tests:
>
> git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
> ./run16 test/hello.elf
> http://www.zytor.com/~hpa/ldttest.c
>
> The former will exercise the irq_return_ldt path, but not the fault
> path; the latter will exercise the fault path, but doesn't actually use
> a 16-bit segment.
>
> Under the 3.14 stock kernel, the former should die with SIGBUS and the
> latter should pass.
>
Current status of the above code: if I remove the randomization in
espfix_64.c then the first test passes; the second generally crashes the
machine. With the randomization there, both generally crash the machine.
All my testing so far has been under KVM or Qemu, so there is always the
possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
something simpler than that.
-hpa
On Mon, Apr 28, 2014 at 4:08 PM, H. Peter Anvin <[email protected]> wrote:
> On 04/28/2014 04:05 PM, H. Peter Anvin wrote:
>>
>> So I tried writing this bit up, but it fails in some rather spectacular
>> ways. Furthermore, I have been unable to debug it under Qemu, because
>> breakpoints don't work right (common Qemu problem, sadly.)
>>
>> The kernel code is at:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/
>>
>> There are two tests:
>>
>> git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
>> ./run16 test/hello.elf
>> http://www.zytor.com/~hpa/ldttest.c
>>
>> The former will exercise the irq_return_ldt path, but not the fault
>> path; the latter will exercise the fault path, but doesn't actually use
>> a 16-bit segment.
>>
>> Under the 3.14 stock kernel, the former should die with SIGBUS and the
>> latter should pass.
>>
>
> Current status of the above code: if I remove the randomization in
> espfix_64.c then the first test passes; the second generally crashes the
> machine. With the randomization there, both generally crash the machine.
>
> All my testing so far has been under KVM or Qemu, so there is always the
> possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
> something simpler than that.
I'm compiling your branch. In the mean time, two possibly stupid questions:
What's the assembly code in the double-fault entry for?
Have you tried hbreak in qemu? I've had better luck with hbreak than
regular break in the past.
--Andy
On 04/28/2014 05:02 PM, Andrew Lutomirski wrote:
>
> I'm compiling your branch. In the mean time, two possibly stupid questions:
>
> What's the assembly code in the double-fault entry for?
>
It was easier for me to add it there than adding all the glue
(prototypes and so on) to put it into C code... can convert it to C when
it works.
> Have you tried hbreak in qemu? I've had better luck with hbreak than
> regular break in the past.
Yes, no real change.
-hpa
On Mon, Apr 28, 2014 at 5:02 PM, Andrew Lutomirski <[email protected]> wrote:
> On Mon, Apr 28, 2014 at 4:08 PM, H. Peter Anvin <[email protected]> wrote:
>> On 04/28/2014 04:05 PM, H. Peter Anvin wrote:
>>>
>>> So I tried writing this bit up, but it fails in some rather spectacular
>>> ways. Furthermore, I have been unable to debug it under Qemu, because
>>> breakpoints don't work right (common Qemu problem, sadly.)
>>>
>>> The kernel code is at:
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/
>>>
>>> There are two tests:
>>>
>>> git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
>>> ./run16 test/hello.elf
>>> http://www.zytor.com/~hpa/ldttest.c
>>>
>>> The former will exercise the irq_return_ldt path, but not the fault
>>> path; the latter will exercise the fault path, but doesn't actually use
>>> a 16-bit segment.
>>>
>>> Under the 3.14 stock kernel, the former should die with SIGBUS and the
>>> latter should pass.
>>>
>>
>> Current status of the above code: if I remove the randomization in
>> espfix_64.c then the first test passes; the second generally crashes the
>> machine. With the randomization there, both generally crash the machine.
>>
>> All my testing so far has been under KVM or Qemu, so there is always the
>> possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
>> something simpler than that.
>
> I'm compiling your branch. In the mean time, two possibly stupid questions:
>
> What's the assembly code in the double-fault entry for?
>
> Have you tried hbreak in qemu? I've had better luck with hbreak than
> regular break in the past.
>
ldttest segfaults on 3.13 and 3.14 for me. It reboots (triple fault?)
on your branch. It even said this:
qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
found during reset
I have no idea what an uncluncked fd is :)
hello.elf fails to sigbus. weird. gdb says:
1: x/i $pc
=> 0xffffffff8170559c <irq_return_ldt+90>:
jmp 0xffffffff81705537 <irq_return_iret>
(gdb) si
<signal handler called>
1: x/i $pc
=> 0xffffffff81705537 <irq_return_iret>: iretq
(gdb) si
Cannot access memory at address 0xf0000000f
(gdb) info registers
rax 0xffe4000f00001000 -7881234923384832
rbx 0x1000000010 68719476752
rcx 0xffe4f5580000f000 -7611541041909760
rdx 0x805d000 134598656
rsi 0x102170000ffe3 283772784279523
rdi 0xf00000007 64424509447
rbp 0xf0000000f 0xf0000000f
rsp 0xf0000000f 0xf0000000f
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
r13 0x0 0
r14 0x0 0
r15 0x0 0
rip 0x0 0x0 <irq_stack_union>
eflags 0x0 [ ]
cs 0x0 0
ss 0x37f 895
ds 0x0 0
es 0x0 0
fs 0x0 0
---Type <return> to continue, or q <return> to quit---
gs 0x0 0
I got this with 'hbreak irq_return_ldt' using 'target remote :1234'
and virtme-run --console --kimg
~/apps/linux-devel/arch/x86/boot/bzImage --qemu-opts -s
This set of registers looks thoroughly bogus. I don't trust it. I'm
now stuck -- single-stepping stays exactly where it started.
Something is rather screwed up here. Telling gdb to continue causes
gdb to explode and 'Hello, Afterworld!' to be displayed.
I was not able to get a breakpoint on __do_double_fault to hit.
FWIW, I think that gdb is known to have issues debugging a guest that
switches bitness.
--Andy
On 04/28/2014 05:20 PM, Andrew Lutomirski wrote:
>
> ldttest segfaults on 3.13 and 3.14 for me. It reboots (triple fault?)
> on your branch. It even said this:
>
> qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
> found during reset
>
> I have no idea what an uncluncked fd is :)
>
It means 9p wasn't properly shut down.
-hpa
On 04/28/2014 07:38 PM, H. Peter Anvin wrote:
> On 04/28/2014 05:20 PM, Andrew Lutomirski wrote:
>>
>> ldttest segfaults on 3.13 and 3.14 for me. It reboots (triple fault?)
>> on your branch. It even said this:
>>
>> qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
>> found during reset
>>
>> I have no idea what an uncluncked fd is :)
>>
>
> It means 9p wasn't properly shut down.
>
(A "fid" is like the 9p version of a file descriptor. Sort of.)
-hpa
On 04/28/2014 07:38 PM, H. Peter Anvin wrote:
> On 04/28/2014 05:20 PM, Andrew Lutomirski wrote:
>>
>> ldttest segfaults on 3.13 and 3.14 for me. It reboots (triple fault?)
>> on your branch. It even said this:
>>
>> qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
>> found during reset
>>
>> I have no idea what an uncluncked fd is :)
>>
>
> It means 9p wasn't properly shut down.
>
OK, so I found a bug in ldttest.c -- it sets CS to an LDT segment, but
it never sets SS to an LDT segment. This means that it should really
have zero footprint versus the espfix code, and implies that we instead
have another bug involved. Why the espfix code should have any effect
whatsoever is a mystery, however... if it indeed does?
I have uploaded a fixed ldttest.c, but it seems we might be chasing more
than that...
-hpa
On 04/28/2014 08:45 PM, H. Peter Anvin wrote:
>
> OK, so I found a bug in ldttest.c -- it sets CS to an LDT segment, but
> it never sets SS to an LDT segment. This means that it should really
> have zero footprint versus the espfix code, and implies that we instead
> have another bug involved. Why the espfix code should have any effect
> whatsoever is a mystery, however... if it indeed does?
>
> I have uploaded a fixed ldttest.c, but it seems we might be chasing more
> than that...
>
In particular, I was already wondered how we avoid an "upside down
swapgs" with a #GP on IRET. The answer might be that we don't...
-hpa
On 04/28/2014 08:45 PM, H. Peter Anvin wrote:
>
> OK, so I found a bug in ldttest.c -- it sets CS to an LDT segment, but
> it never sets SS to an LDT segment. This means that it should really
> have zero footprint versus the espfix code, and implies that we instead
> have another bug involved. Why the espfix code should have any effect
> whatsoever is a mystery, however... if it indeed does?
>
> I have uploaded a fixed ldttest.c, but it seems we might be chasing more
> than that...
>
With the test fixed, the bug was easy to find: we can't compare against
__KERNEL_DS in the doublefault handler, because both SS and the image on
the stack have the stack segment set to zero (NULL).
With that both ldttest and run16 pass with the doublefault code, even
with randomization turned back on.
I have pushed out the fix.
There are still things that need fixing: we need to go through the
espfix path even when returning from NMI/MC (which fortunately can't
nest with taking an NMI/MC on the espfix path itself, since in that case
we will have been interrupted while running in the kernel with a kernel
stack.)
(Cc: Rostedt because of the NMI issue.)
-hpa
On 04/28/2014 09:36 PM, H. Peter Anvin wrote:
>
> There are still things that need fixing: we need to go through the
> espfix path even when returning from NMI/MC (which fortunately can't
> nest with taking an NMI/MC on the espfix path itself, since in that case
> we will have been interrupted while running in the kernel with a kernel
> stack.)
>
> (Cc: Rostedt because of the NMI issue.)
>
NMI is fine: we go through irq_return except for nested NMI. There are
only three IRETs in the kernel (irq_return, nested_nmi_out, and the
early trap handler) and all of them are good.
I think we just need to clean up the PV aspects of this and then we
should be in good shape. I have done a bunch of cleanups to the
development git tree.
I'm considering making 16-bit segment support a EXPERT config option for
both 32- and 64-bit kernels, as it seems like a bit of a waste for
embedded systems which don't need this kind of backward compatibility.
Maybe that is something that can be left for someone else to implement
if they feel like it.
-hpa