From: Nate Eldredge <[email protected]>
Make math_state_restore() save and restore the interrupt flag, rather
than always disabling interrupts.
If math_state_restore() is called in a task that has not used math, it
needs to allocate some memory (via init_fpu()). Since this can sleep,
it enables interrupts first. Currently, it always disables them
afterwards, regardless of whether or not they were enabled on entry.
(See commit aa283f4927 where this was introduced.) This doesn't make
sense, so instead have it put interrupts back the way they were.
This is the cause of Ubuntu bug #1265841
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1265841): if a
user process dumps core on an ecrypt fs while aesni-intel is loaded,
we get a BUG() in __find_get_block() complaining that it was called
with interrupts disabled; then all further accesses to our ecrypt fs
hang and we have to reboot. The aesni-intel code (encrypting the core
file that we are writing) needs the FPU and quite properly wraps its
code in kernel_fpu_{begin,end}(), the latter of which calls
math_state_restore(). So after kernel_fpu_end(), interrupts may be
disabled, which nobody seems to expect, and they stay that way until
we eventually get to __find_get_block() which barfs. With this patch,
the testcase works fine and no BUG() is triggered.
math_state_restore() may need further review, as it still seems
suspicious that it can unilaterally enable interupts for itself. It's
not clear to me what are the intended semantics of
math_state_restore() and kernel_fpu_{begin,end}() with respect to
interrupts. Nevertheless, this patch should be appropriate for now.
Signed-off-by: Nate Eldredge <[email protected]>
Tested-by: George Spelvin <[email protected]>
Cc: <[email protected]>
Fixes: aa283f4927
---
Applies to linux-3.13. Previous discussion in linux-kernel thread
"math_state_restore and kernel_fpu_end disable interrupts?"
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b857ed8..09df67d 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -628,6 +628,9 @@ void math_state_restore(void)
struct task_struct *tsk = current;
if (!tsk_used_math(tsk)) {
+ unsigned long flags;
+
+ local_save_flags(flags);
local_irq_enable();
/*
* does a slab alloc which can sleep
@@ -639,7 +642,7 @@ void math_state_restore(void)
do_group_exit(SIGKILL);
return;
}
- local_irq_disable();
+ local_irq_restore(flags);
}
__thread_fpu_begin(tsk);
--
Nate Eldredge
[email protected]
On Thu, Jan 30, 2014 at 2:01 PM, Nate Eldredge
<[email protected]> wrote:
>
> If math_state_restore() is called in a task that has not used math, it
> needs to allocate some memory (via init_fpu()). Since this can sleep,
> it enables interrupts first. Currently, it always disables them
> afterwards, regardless of whether or not they were enabled on entry.
> (See commit aa283f4927 where this was introduced.) This doesn't make
> sense, so instead have it put interrupts back the way they were.
Hmm. What doesn't make sense is to have interrupts disabled in the
caller - since clearly math_state_restore() may actually let
interrupts in.
So while I definitely think your patch fixes a bug, I'm wondering if
we shouldn't look deeper at this issue. The comment above says
* Must be called with kernel preemption disabled (eg with local
* local interrupts as in the case of do_device_not_available).
which *kind* of makes sense, but at the same time only reinforces that
whole "if there are reasons why the irq has to be disabled, we can't
just randomly enable it for allocations".
So I really get the feeling that there are more bugs than just the "it
exits with irqs disabled" issue.
> The aesni-intel code (encrypting the core
> file that we are writing) needs the FPU and quite properly wraps its
> code in kernel_fpu_{begin,end}(), the latter of which calls
> math_state_restore().
Ok, and this is interesting too.
kernel_fpu_{begin|end}() shouldn't mess around with tsk_used_math(), I
feel. If the task hasn't used math, no reason to allocate any
save-space, because kernel_fpu_end() will just throw it all away
anyway.
Plus we want to be able to do those things from interrupts, and:
- we can't have blocking allocations
- we should try to not mess with these kinds of process states from
interrupts anyway
So I think that math_state_restore() and the use_eager_fpu() changes
were actually seriously buggy here wrt that whole !tsk_used_math()
case.
I'm adding in some people here, because I think in the end this bug
was introduced by commit 304bceda6a18 ("x86, fpu: use non-lazy fpu
restore for processors supporting xsave") that introduced that
math_state_restore() in kernel_fpu_end(), but we have other commits
(like 5187b28ff08: "x86: Allow FPU to be used at interrupt time even
with eagerfpu") that seem tangential too and might be part of why it
actually *triggers* now.
Comments?
Linus
hi,
On Thu, Jan 30, 2014 at 2:24 PM, Linus Torvalds
<[email protected]> wrote:
> I'm adding in some people here, because I think in the end this bug
> was introduced by commit 304bceda6a18 ("x86, fpu: use non-lazy fpu
> restore for processors supporting xsave") that introduced that
> math_state_restore() in kernel_fpu_end(), but we have other commits
> (like 5187b28ff08: "x86: Allow FPU to be used at interrupt time even
> with eagerfpu") that seem tangential too and might be part of why it
> actually *triggers* now.
>
> Comments?
I haven't been following the recent changes closely, so before I get a
chance to review the current bug and the relevant commits, wanted to
added that:
a. delayed dynamic allocation of FPU state area was not a good idea
(from me). Given most of the future cases will be anyway using eager
FPU (because of processor features like xsaveopt etc, applications
implicitly using FPU because of optimizations in commonly used
libraries etc), we should probably go back to allocation of FPU state
area during thread creation for everyone (including non-eager cases).
Memory savings will be small anyways and the code complexity
introducing subtle bugs like this in not worth it.
b. with the above change, kernel_fpu_begin() will just save any user
live math state and be ready for kernel math operations. And
kernel_fpu_end() will drop the kernel math state and for eager-fpu
case restore the user math state.
We will avoid worrying about any memory allocations in the
math_state_restore() with interrupts disabled etc.
If there are no objections, I will see if I can come up with a quick
patch. or will ask HPA to help fill me in.
thanks,
suresh
On Thu, Jan 30, 2014 at 11:33 PM, Suresh Siddha <[email protected]> wrote:
>
> a. delayed dynamic allocation of FPU state area was not a good idea
> (from me). Given most of the future cases will be anyway using eager
> FPU (because of processor features like xsaveopt etc, applications
> implicitly using FPU because of optimizations in commonly used
> libraries etc), we should probably go back to allocation of FPU state
> area during thread creation for everyone (including non-eager cases).
Yes, I suspect that will help some, and probably fix this particular bug.
That said, regardless of the allocation issue, I do think that it's
stupid for kernel_fpu_{begin,end} to save the math state if
"used_math" was not set. So I do think__kernel_fpu_end() as-s is
buggy and stupid. So I do think we should *either* say
(a) "we don't want to restore at all, because once the kernel starts
using math, it might do so a lot, and saving/restoring is a bad idea":
void __kernel_fpu_end(void)
{
stts();
}
*or*
(b) make the use_eager_fpu() case check tsk_used_math() (in which
case we had better already have an allocation!)
void __kernel_fpu_end(void)
{
if (use_eager_fpu()) {
struct task_struct *me = current;
if (tsk_used_math(me) && likely(!restore_fpu_checking(me)))
return;
}
stts();
}
Quite frankly, I'd almost lean towards (a). Comments? Does anybody
have any loads where the kernel does a lot of fpu stuff (ie network
encryption using the hw engines or something)? I'd really like to hear
if it makes a difference..
Linus
On 02/01/2014 11:27 AM, Linus Torvalds wrote:
>
> (a) "we don't want to restore at all, because once the kernel starts
> using math, it might do so a lot, and saving/restoring is a bad idea":
>
> void __kernel_fpu_end(void)
> {
> stts();
> }
>
> *or*
>
> Quite frankly, I'd almost lean towards (a). Comments? Does anybody
> have any loads where the kernel does a lot of fpu stuff (ie network
> encryption using the hw engines or something)? I'd really like to hear
> if it makes a difference..
>
This will obviously not protect eageronly features (MPX, LWP, ...) so
this means those features are permanently unavailable to the kernel,
even inside kernel_fpu_begin/end. Now, currently I don't think we have
any plans to use those in the kernel (at least not in a way where
kernel_fpu_begin/end makes sense as bracketing), but it is something
worth keeping in mind.
-hpa
On Sat, Feb 1, 2014 at 11:35 AM, H. Peter Anvin <[email protected]> wrote:
>
> This will obviously not protect eageronly features (MPX, LWP, ...) so this
> means those features are permanently unavailable to the kernel, even inside
> kernel_fpu_begin/end. Now, currently I don't think we have any plans to use
> those in the kernel (at least not in a way where kernel_fpu_begin/end makes
> sense as bracketing), but it is something worth keeping in mind.
Hmm. If there are features that (silently) depend on the FPU state
being on, then the plain "stts" doesn't work at all, because we'd be
returning to user space too (not just kernel space) with TS turned
off.
.. which *does* actually bring up something that might work, and might
be a good idea: remove the "restore math state or set TS" from the
normal kernel paths *entirely*, and move it to the "return to user
space" phase.
We could do that with the whole "task_work" thing (or perhaps just
do_notify_resume(), especially after merging the "don't necessarily
return with iret" patch I sent out earlier), with additionally making
sure that scheduling does the right thing wrt a "currently dirty math
state due to kernel use".
The advantage of that would be that we really could do a *lot* of FP
math very cheaply in the kernel, because we'd pay the overhead of
kernel_fpu_begin/end() just once (well, the "end" part would be just
setting the bit that we now have dirty state, the cost would be in the
return-to-user-space-and-restore-fp-state part).
Comments? That would be much more invasive than just changing
__kernel_fpu_end(), but would bring in possibly quite noticeable
advantages under loads that use the FP/vector resources in the kernel.
Linus
Of course, if we are really doing all eager fpu, we could just leave cr0.ts always clear and not touch it at all...
On February 1, 2014 11:46:15 AM PST, Linus Torvalds <[email protected]> wrote:
>On Sat, Feb 1, 2014 at 11:35 AM, H. Peter Anvin <[email protected]> wrote:
>>
>> This will obviously not protect eageronly features (MPX, LWP, ...) so
>this
>> means those features are permanently unavailable to the kernel, even
>inside
>> kernel_fpu_begin/end. Now, currently I don't think we have any plans
>to use
>> those in the kernel (at least not in a way where kernel_fpu_begin/end
>makes
>> sense as bracketing), but it is something worth keeping in mind.
>
>Hmm. If there are features that (silently) depend on the FPU state
>being on, then the plain "stts" doesn't work at all, because we'd be
>returning to user space too (not just kernel space) with TS turned
>off.
>
>.. which *does* actually bring up something that might work, and might
>be a good idea: remove the "restore math state or set TS" from the
>normal kernel paths *entirely*, and move it to the "return to user
>space" phase.
>
>We could do that with the whole "task_work" thing (or perhaps just
>do_notify_resume(), especially after merging the "don't necessarily
>return with iret" patch I sent out earlier), with additionally making
>sure that scheduling does the right thing wrt a "currently dirty math
>state due to kernel use".
>
>The advantage of that would be that we really could do a *lot* of FP
>math very cheaply in the kernel, because we'd pay the overhead of
>kernel_fpu_begin/end() just once (well, the "end" part would be just
>setting the bit that we now have dirty state, the cost would be in the
>return-to-user-space-and-restore-fp-state part).
>
>Comments? That would be much more invasive than just changing
>__kernel_fpu_end(), but would bring in possibly quite noticeable
>advantages under loads that use the FP/vector resources in the kernel.
>
> Linus
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
On Sat, Feb 1, 2014 at 12:00 PM, H. Peter Anvin <[email protected]> wrote:
> Of course, if we are really doing all eager fpu, we could just leave cr0.ts always clear and not touch it at all...
That's what the eager fpu code tries to do now (apart from process
initialization, I think). But the problem is that even disregarding
the process init issue, it saves and restores around every kernel FPU
use. What the extra flag (and task_work/do_resume_suspend) would do is
to just save/restore *once* per kernel entry.
Now, the *common* situation is certainly that the kernel doesn't do
any FPU work at all, but there are clearly exceptions to that. If
certain wireless encryption modes end up actually using hw
acceleration, we might have a *lot* of overhead from the save/restore
code. In fact, I think it triggers even for the idle task, which
doesn't really care.
Linus
On 02/01/2014 11:46 AM, Linus Torvalds wrote:
>
> .. which *does* actually bring up something that might work, and might
> be a good idea: remove the "restore math state or set TS" from the
> normal kernel paths *entirely*, and move it to the "return to user
> space" phase.
>
The task switch code would have to be aware in case of preemption,
though. The nice thing about *that* is that we could probably
completely drop the notion that we can't preempt an FPU-using kernel task.
-hpa
> .. which *does* actually bring up something that might work, and might
> be a good idea: remove the "restore math state or set TS" from the
> normal kernel paths *entirely*, and move it to the "return to user
> space" phase.
This definitely seems much more sensible. For processors without
optimized context save/restore, the decision whether to restore eagerly
is best made once the kernel has decided what it's returning *to*.
>From an interrupt handler, there are now three cases:
1) FPU not in use, available for immediate use.
2) FPU in use by user-space; save to <location> before using.
3) FPU in use by kernel code; may not be used (unless we want
to add nested FPU state save hair).
It would be good to know how often #3 happens on a real workload.
On February 1, 2014 1:17:10 PM PST, George Spelvin <[email protected]> wrote:
>> .. which *does* actually bring up something that might work, and
>might
>> be a good idea: remove the "restore math state or set TS" from the
>> normal kernel paths *entirely*, and move it to the "return to user
>> space" phase.
>
>This definitely seems much more sensible. For processors without
>optimized context save/restore, the decision whether to restore eagerly
>is best made once the kernel has decided what it's returning *to*.
>
>From an interrupt handler, there are now three cases:
>1) FPU not in use, available for immediate use.
>2) FPU in use by user-space; save to <location> before using.
>3) FPU in use by kernel code; may not be used (unless we want
> to add nested FPU state save hair).
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
On 02/01/2014 01:17 PM, George Spelvin wrote:
>> .. which *does* actually bring up something that might work, and might
>> be a good idea: remove the "restore math state or set TS" from the
>> normal kernel paths *entirely*, and move it to the "return to user
>> space" phase.
>
> This definitely seems much more sensible. For processors without
> optimized context save/restore, the decision whether to restore eagerly
> is best made once the kernel has decided what it's returning *to*.
>
> From an interrupt handler, there are now three cases:
> 1) FPU not in use, available for immediate use.
> 2) FPU in use by user-space; save to <location> before using.
> 3) FPU in use by kernel code; may not be used (unless we want
> to add nested FPU state save hair).
>
OK, let's circle back for a bit. We have an active bug, and we clearly
have a lot of restructuring that could/should be done. We need to fix
the bug first; if we're going to a bunch of restructuring then that
ought to be separate. The first bit is how we fix the immediate bug.
-hpa
On Sat, Feb 1, 2014 at 3:40 PM, H. Peter Anvin <[email protected]> wrote:
>
> OK, let's circle back for a bit. We have an active bug, and we clearly
> have a lot of restructuring that could/should be done. We need to fix
> the bug first; if we're going to a bunch of restructuring then that
> ought to be separate. The first bit is how we fix the immediate bug.
So either of my suggested changes to __kernel_fpu_end() _should_ fix
the bug, with the caveat that if we do take the "check tsk-used-math"
version, we had better verify that yes, we allocate the save storage
before we set that flag. One of the reasons I tended to prefer simpler
the "just do stts()" version is that it seemed safer. I am a bit
worried about the whole interaction with synchronous task state and
interrupts.
Suresh's notion of just always allocating the FP state save area at
process creation would fix it too.
So many ways to fix it, so little knowledge about the actual usage
patterns and performance issues..
Linus
On Sat, Feb 1, 2014 at 11:27 AM, Linus Torvalds
<[email protected]> wrote:
> That said, regardless of the allocation issue, I do think that it's
> stupid for kernel_fpu_{begin,end} to save the math state if
> "used_math" was not set. So I do think__kernel_fpu_end() as-s is
> buggy and stupid.
For eager_fpu case, assumption was every task should always have
'used_math' set. But i think there is a race, where we drop the fpu
explicitly by doing drop_fpu() and meanwhile if we get an interrupt
etc that ends up using fpu?
so I will Ack for option "b", as option "a" breaks the features which
don't take into account cr0.TS.
Meanwhile I have the patch removing the delayed dynamic allocation for
non-eager fpu. will post it after some testing.
thanks,
suresh
> OK, let's circle back for a bit. We have an active bug, and we clearly
> have a lot of restructuring that could/should be done. We need to fix
> the bug first; if we're going to a bunch of restructuring then that
> ought to be separate. The first bit is how we fix the immediate bug.
Well, that's what the [PATCH] that started this thread,s for which seems
to greatly reduce the incidence.
The issue being discussed here is the fact that it's far from clear to
people that the result of applying that patch is actually bug-free.
Given how long it's been in shipping kernels (2.6.26 or 3.10, depending
on what you think causes it to trigger) and how few complaints there
have been, I'm happy to take a while to think about it.
On 02/01/2014 05:19 PM, George Spelvin wrote:
>> OK, let's circle back for a bit. We have an active bug, and we clearly
>> have a lot of restructuring that could/should be done. We need to fix
>> the bug first; if we're going to a bunch of restructuring then that
>> ought to be separate. The first bit is how we fix the immediate bug.
>
> Well, that's what the [PATCH] that started this thread,s for which seems
> to greatly reduce the incidence.
>
> The issue being discussed here is the fact that it's far from clear to
> people that the result of applying that patch is actually bug-free.
>
> Given how long it's been in shipping kernels (2.6.26 or 3.10, depending
> on what you think causes it to trigger) and how few complaints there
> have been, I'm happy to take a while to think about it.
>
Yes, I would agree with that. I guess I should have said "how do we fix
the immediate bug(s) properly."
-hpa
On 02/01/2014 05:06 PM, Suresh Siddha wrote:
>
> so I will Ack for option "b", as option "a" breaks the features which
> don't take into account cr0.TS.
>
Even "b" does that, no? "a" should be fine as long as we don't ever use
those features in the kernel, even under kernel_fpu_begin/end().
> Meanwhile I have the patch removing the delayed dynamic allocation for
> non-eager fpu. will post it after some testing.
Thanks.
-hpa
On Sat, Feb 1, 2014 at 5:26 PM, H. Peter Anvin <[email protected]> wrote:
> Even "b" does that, no?
oh right. It needs an else. only for non-eager fpu case we should do stts()
void __kernel_fpu_end(void)
{
if (use_eager_fpu()) {
struct task_struct *me = current;
if (tsk_used_math(me) && likely(!restore_fpu_checking(
me)))
return;
} else
stts();
}
thanks,
suresh
> "a" should be fine as long as we don't ever use
> those features in the kernel, even under kernel_fpu_begin/end().
On Sat, Feb 1, 2014 at 5:35 PM, Suresh Siddha <[email protected]> wrote:
> On Sat, Feb 1, 2014 at 5:26 PM, H. Peter Anvin <[email protected]> wrote:
>> Even "b" does that, no?
>
> oh right. It needs an else. only for non-eager fpu case we should do stts()
It definitely does not want an else, I think.
If tsk_used_math() is false, or if the FPU restore failed, we
*definitely* need that stts(). Otherwise we'd return to user mode with
random contents in the FP state, and let user mode muck around with
it.
No?
Linus
What does the inner if clause do? It looks like it returns either way...
On February 1, 2014 5:35:13 PM PST, Suresh Siddha <[email protected]> wrote:
>On Sat, Feb 1, 2014 at 5:26 PM, H. Peter Anvin <[email protected]> wrote:
>> Even "b" does that, no?
>
>oh right. It needs an else. only for non-eager fpu case we should do
>stts()
>
> void __kernel_fpu_end(void)
> {
> if (use_eager_fpu()) {
> struct task_struct *me = current;
>
> if (tsk_used_math(me) && likely(!restore_fpu_checking(
>me)))
> return;
> } else
> stts();
> }
>
>thanks,
>suresh
>
>> "a" should be fine as long as we don't ever use
>> those features in the kernel, even under kernel_fpu_begin/end().
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
On Sat, Feb 1, 2014 at 5:43 PM, H. Peter Anvin <[email protected]> wrote:
> What does the inner if clause do? It looks like it returns either way...
Suresh broke it with his suggested version.
The inner if-statement is supposed to avoid the stts *if* we had used
math *and* the FPU restore worked.
But with the extra "else" that Suresh added, it now always avoids the
stts for the eager-fpu case, which breaks the whole logic for "hey, if
the process hadn't used math, we don't waste time restoring data that
doesn't exist". And, as you say, making the inner if clause pointless.
Linus
On Sat, Feb 1, 2014 at 5:38 PM, Linus Torvalds
<[email protected]> wrote:
> It definitely does not want an else, I think.
>
> If tsk_used_math() is false, or if the FPU restore failed, we
> *definitely* need that stts(). Otherwise we'd return to user mode with
> random contents in the FP state, and let user mode muck around with
> it.
>
> No?
So if the restore failed, we should do something like drop_init_fpu(),
which will restore init-state to the registers.
for eager-fpu() paths we don't use clts() stts() etc.
thanks,
suresh
On Sat, Feb 1, 2014 at 5:47 PM, Suresh Siddha <[email protected]> wrote:
>
> So if the restore failed, we should do something like drop_init_fpu(),
> which will restore init-state to the registers.
>
> for eager-fpu() paths we don't use clts() stts() etc.
Uhhuh. Ok.
Why do we do that, btw? I think it would make much more sense to just
do what I *thought* we did, and just make it a context-switch-time
optimization ("let's always switch FP state"), not make it a huge
semantic difference.
Linus
On 02/01/2014 05:51 PM, Linus Torvalds wrote:
> On Sat, Feb 1, 2014 at 5:47 PM, Suresh Siddha <[email protected]> wrote:
>>
>> So if the restore failed, we should do something like drop_init_fpu(),
>> which will restore init-state to the registers.
>>
>> for eager-fpu() paths we don't use clts() stts() etc.
>
> Uhhuh. Ok.
>
> Why do we do that, btw? I think it would make much more sense to just
> do what I *thought* we did, and just make it a context-switch-time
> optimization ("let's always switch FP state"), not make it a huge
> semantic difference.
>
Twiddling CR0.TS is pretty slow if we're not taking advantage of it.
-hpa
On Sat, Feb 1, 2014 at 5:51 PM, Linus Torvalds
<[email protected]> wrote:
> On Sat, Feb 1, 2014 at 5:47 PM, Suresh Siddha <[email protected]> wrote:
>>
>> So if the restore failed, we should do something like drop_init_fpu(),
>> which will restore init-state to the registers.
>>
>> for eager-fpu() paths we don't use clts() stts() etc.
>
> Uhhuh. Ok.
>
> Why do we do that, btw? I think it would make much more sense to just
> do what I *thought* we did, and just make it a context-switch-time
> optimization ("let's always switch FP state"), not make it a huge
> semantic difference.
clts/stts is more costly and not all the state under xsave adhers to
cr0.TS/DNA rules.
did I answer your question?
thanks,
suresh
On Sat, Feb 1, 2014 at 5:57 PM, H. Peter Anvin <[email protected]> wrote:
>
> Twiddling CR0.TS is pretty slow if we're not taking advantage of it.
Immaterial.
We *already* avoid twiddling TS if it's not needed.
It is true that we used to twiddle it at every context switch (and
then twiddle it *again* if we decided that we'd want to pre-load the
FPU state anyway, and avoid the extra fault).
But that was fixed, and if we switch from a task that had math state
to another task that has math state, we leave TS alone.
But Suresh apparently hits on the real issue:
> not all the state under xsave adhers to cr0.TS/DNA rules
which if so is sad but yes, makes CR0.TS no longer sufficient.
Linus
Yes, that is exactly the "eageronly" features - currently LWP and MPX.
On February 1, 2014 6:05:05 PM PST, Linus Torvalds <[email protected]> wrote:
>On Sat, Feb 1, 2014 at 5:57 PM, H. Peter Anvin <[email protected]> wrote:
>>
>> Twiddling CR0.TS is pretty slow if we're not taking advantage of it.
>
>Immaterial.
>
>We *already* avoid twiddling TS if it's not needed.
>
>It is true that we used to twiddle it at every context switch (and
>then twiddle it *again* if we decided that we'd want to pre-load the
>FPU state anyway, and avoid the extra fault).
>
>But that was fixed, and if we switch from a task that had math state
>to another task that has math state, we leave TS alone.
>
>But Suresh apparently hits on the real issue:
>
>> not all the state under xsave adhers to cr0.TS/DNA rules
>
>which if so is sad but yes, makes CR0.TS no longer sufficient.
>
> Linus
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
On Sat, 2014-02-01 at 17:06 -0800, Suresh Siddha wrote:
> Meanwhile I have the patch removing the delayed dynamic allocation for
> non-eager fpu. will post it after some testing.
Appended the patch for this. Tested for last 4-5 hours on my laptop.
The real fix for Nate's problem will be coming from Linus, with a
slightly modified option-b that Linus proposed. Linus, please let me
know if you want me to spin it. I can do it sunday night.
thanks,
suresh
---
From: Suresh Siddha <[email protected]>
Subject: x86, fpu: remove the logic of non-eager fpu mem allocation at the first usage
For non-eager fpu mode, thread's fpu state is allocated during the first
fpu usage (in the context of device not available exception). This can be
a blocking call and hence we enable interrupts (which were originally
disabled when the exception happened), allocate memory and disable
interrupts etc. While this saves 512 bytes or so per-thread, there
are some issues in general.
a. Most of the future cases will be anyway using eager
FPU (because of processor features like xsaveopt, LWP, MPX etc) and
they do the allocation at the thread creation itself. Nice to have
one common mechanism as all the state save/restore code is
shared. Avoids the confusion and minimizes the subtle bugs
in the core piece involved with context-switch.
b. If a parent thread uses FPU, during fork() we allocate
the FPU state in the child and copy the state etc. Shortly after this,
during exec() we free it up, so that we can later allocate during
the first usage of FPU. So this free/allocate might be slower
for some workloads.
c. math_state_restore() is called from multiple places
and it is error pone if the caller expects interrupts to be disabled
throughout the execution of math_state_restore(). Can lead to subtle
bugs like Ubuntu bug #1265841.
Memory savings will be small anyways and the code complexity
introducing subtle bugs is not worth it. So remove
the logic of non-eager fpu mem allocation at the first usage.
Signed-off-by: Suresh Siddha <[email protected]>
---
arch/x86/kernel/i387.c | 14 +++++---------
arch/x86/kernel/process.c | 6 ------
arch/x86/kernel/traps.c | 16 ++--------------
arch/x86/kernel/xsave.c | 2 --
4 files changed, 7 insertions(+), 31 deletions(-)
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index e8368c6..4e5f770 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -5,6 +5,7 @@
* General FPU state handling cleanups
* Gareth Hughes <[email protected]>, May 2000
*/
+#include <linux/bootmem.h>
#include <linux/module.h>
#include <linux/regset.h>
#include <linux/sched.h>
@@ -186,6 +187,10 @@ void fpu_init(void)
if (xstate_size == 0)
init_thread_xstate();
+ if (!current->thread.fpu.state)
+ current->thread.fpu.state =
+ alloc_bootmem_align(xstate_size, __alignof__(struct xsave_struct));
+
mxcsr_feature_mask_init();
xsave_init();
eager_fpu_init();
@@ -219,8 +224,6 @@ EXPORT_SYMBOL_GPL(fpu_finit);
*/
int init_fpu(struct task_struct *tsk)
{
- int ret;
-
if (tsk_used_math(tsk)) {
if (cpu_has_fpu && tsk == current)
unlazy_fpu(tsk);
@@ -228,13 +231,6 @@ int init_fpu(struct task_struct *tsk)
return 0;
}
- /*
- * Memory allocation at the first usage of the FPU and other state.
- */
- ret = fpu_alloc(&tsk->thread.fpu);
- if (ret)
- return ret;
-
fpu_finit(&tsk->thread.fpu);
set_stopped_child_used_math(tsk);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 3fb8d95..cd9c190 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -128,12 +128,6 @@ void flush_thread(void)
flush_ptrace_hw_breakpoint(tsk);
memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));
drop_init_fpu(tsk);
- /*
- * Free the FPU state for non xsave platforms. They get reallocated
- * lazily at the first use.
- */
- if (!use_eager_fpu())
- free_thread_xstate(tsk);
}
static void hard_disable_TSC(void)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 57409f6..3265429 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -623,20 +623,8 @@ void math_state_restore(void)
{
struct task_struct *tsk = current;
- if (!tsk_used_math(tsk)) {
- local_irq_enable();
- /*
- * does a slab alloc which can sleep
- */
- if (init_fpu(tsk)) {
- /*
- * ran out of memory!
- */
- do_group_exit(SIGKILL);
- return;
- }
- local_irq_disable();
- }
+ if (!tsk_used_math(tsk))
+ init_fpu(tsk);
__thread_fpu_begin(tsk);
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index a4b451c..46f6266 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -598,8 +598,6 @@ void xsave_init(void)
static inline void __init eager_fpu_init_bp(void)
{
- current->thread.fpu.state =
- alloc_bootmem_align(xstate_size, __alignof__(struct xsave_struct));
if (!init_xstate_buf)
setup_init_fpu_buf();
}
On Sat, 1 Feb 2014, Linus Torvalds wrote:
> We could do that with the whole "task_work" thing (or perhaps just
> do_notify_resume(), especially after merging the "don't necessarily
> return with iret" patch I sent out earlier), with additionally making
> sure that scheduling does the right thing wrt a "currently dirty math
> state due to kernel use".
>
> The advantage of that would be that we really could do a *lot* of FP
> math very cheaply in the kernel, because we'd pay the overhead of
> kernel_fpu_begin/end() just once (well, the "end" part would be just
> setting the bit that we now have dirty state, the cost would be in the
> return-to-user-space-and-restore-fp-state part).
>
> Comments? That would be much more invasive than just changing
> __kernel_fpu_end(), but would bring in possibly quite noticeable
> advantages under loads that use the FP/vector resources in the kernel.
>
This would be very good and it needs to work in interrupt context
(softirq) also, and when we interrupt idle task. It's with networking we
can really hit kernel_fpu_begin()/end() millions of times per second and
there's really only need to do it once per interrupt. This is actually
similar what I was doing (in do_softirq)) when I noticed eagerfpu was
broken and now Nate's bug AFAICS happens there as well.
Pekka
On Sat, Feb 1, 2014 at 11:19 PM, Suresh Siddha <[email protected]> wrote:
>
> The real fix for Nate's problem will be coming from Linus, with a
> slightly modified option-b that Linus proposed. Linus, please let me
> know if you want me to spin it. I can do it sunday night.
Please do it, since clearly I wasn't aware enough about the whole
non-TS-checking FPU state details.
Also, since this issue doesn't seem to be a recent regression, I'm not
going to take this patch directly (even though I'm planning on doing
-rc1 in a few hours), and expect that I'll get it through the normal
channels (presumably together with the __kernel_fpu_end cleanups). Ok
with everybody?
Linus
On Sun, 2014-02-02 at 11:15 -0800, Linus Torvalds wrote:
> On Sat, Feb 1, 2014 at 11:19 PM, Suresh Siddha <[email protected]> wrote:
> >
> > The real fix for Nate's problem will be coming from Linus, with a
> > slightly modified option-b that Linus proposed. Linus, please let me
> > know if you want me to spin it. I can do it sunday night.
>
> Please do it, since clearly I wasn't aware enough about the whole
> non-TS-checking FPU state details.
>
> Also, since this issue doesn't seem to be a recent regression, I'm not
> going to take this patch directly (even though I'm planning on doing
> -rc1 in a few hours), and expect that I'll get it through the normal
> channels (presumably together with the __kernel_fpu_end cleanups). Ok
> with everybody?
Here is the second patch, which should fix the issue reported in this
thread. Maarten, Nate, George, please give this patch a try as is and
see if it helps address the issue you ran into. And please ack/review
with your test results.
Other patch which cleans up the irq_enable/disable logic in
math_state_restore() has been sent yesterday. You can run your
experiments with both these patches if you want. But your issue should
get fixed with just the appended patch here.
Peter, Please push both these patches through normal channels depending
on the results.
thanks,
suresh
---
From: Suresh Siddha <[email protected]>
Subject: x86, fpu: check tsk_used_math() in kernel_fpu_end() for eager fpu
For non-eager fpu mode, thread's fpu state is allocated during the first
fpu usage (in the context of device not available exception). This
(math_state_restore()) can be a blocking call and hence we enable
interrupts (which were originally disabled when the exception happened),
allocate memory and disable interrupts etc.
But the eager-fpu mode, call's the same math_state_restore() from
kernel_fpu_end(). The assumption being that tsk_used_math() is always
set for the eager-fpu mode and thus avoid the code path of enabling
interrupts, allocating fpu state using blocking call and disable
interrupts etc.
But the below issue was noticed by Maarten Baert, Nate Eldredge and
few others:
If a user process dumps core on an ecrypt fs while aesni-intel is loaded,
we get a BUG() in __find_get_block() complaining that it was called with
interrupts disabled; then all further accesses to our ecrypt fs hang
and we have to reboot.
The aesni-intel code (encrypting the core file that we are writing) needs
the FPU and quite properly wraps its code in kernel_fpu_{begin,end}(),
the latter of which calls math_state_restore(). So after kernel_fpu_end(),
interrupts may be disabled, which nobody seems to expect, and they stay
that way until we eventually get to __find_get_block() which barfs.
For eager fpu, most the time, tsk_used_math() is true. At few instances
during thread exit, signal return handling etc, tsk_used_math() might
be false.
In kernel_fpu_end(), for eager-fpu, call math_state_restore()
only if tsk_used_math() is set. Otherwise, don't bother. Kernel code
path which cleared tsk_used_math() knows what needs to be done
with the fpu state.
Reported-by: Maarten Baert <[email protected]>
Reported-by: Nate Eldredge <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
Cc: George Spelvin <[email protected]>
---
arch/x86/kernel/i387.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 4e5f770..670bba1 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -87,10 +87,19 @@ EXPORT_SYMBOL(__kernel_fpu_begin);
void __kernel_fpu_end(void)
{
- if (use_eager_fpu())
- math_state_restore();
- else
+ if (use_eager_fpu()) {
+ /*
+ * For eager fpu, most the time, tsk_used_math() is true.
+ * Restore the user math as we are done with the kernel usage.
+ * At few instances during thread exit, signal handling etc,
+ * tsk_used_math() is false. Those few places will take proper
+ * actions, so we don't need to restore the math here.
+ */
+ if (likely(tsk_used_math(current)))
+ math_state_restore();
+ } else {
stts();
+ }
}
EXPORT_SYMBOL(__kernel_fpu_end);
On Sun, Feb 2, 2014 at 10:56 PM, Suresh Siddha <[email protected]> wrote:
>
> Other patch which cleans up the irq_enable/disable logic in
> math_state_restore() has been sent yesterday. You can run your
> experiments with both these patches if you want. But your issue should
> get fixed with just the appended patch here.
>
> Peter, Please push both these patches through normal channels depending
> on the results.
Thinking about it some more, this patch is *almost* not needed at all.
I'm wondering if you should just change the first patch to just always
initialize the fpu when it is allocated, and at execve() time (ie in
flush_thread()).
If we do that, then this:
+ if (!tsk_used_math(tsk))
+ init_fpu(tsk);
can be dropped entirely from math_state_restore(). And quite frankly,
at that point, I think all the changes to __kernel_fpu_end() can go
away, because at that point math_state_restore() really does the right
thing - all the allocations are gone, and all the async task state
games are gone, only the "restore state" remains.
Hmm? So the only thing needed would be to add that "init_fpu()" to the
initial bootmem allocation path and to change flush_thread() (it
currently does "drop_init_fpu()", let's just make it initialize the
FPU state using fpu_finit()), and then we could remove the whole
"used_math" bit entirely, and just say that the FPU is always
initialized.
What do you guys think?
Linus
On Mon, 2014-02-03 at 10:20 -0800, Linus Torvalds wrote:
> Thinking about it some more, this patch is *almost* not needed at all.
>
> I'm wondering if you should just change the first patch to just always
> initialize the fpu when it is allocated, and at execve() time (ie in
> flush_thread()).
>
We already do this for eager-fpu case, in eager_fpu_init() during boot
and in drop_init_fpu() during flush_thread().
> If we do that, then this:
>
> + if (!tsk_used_math(tsk))
> + init_fpu(tsk);
>
> can be dropped entirely from math_state_restore().
yeah, probably for eager-fpu, but:
> And quite frankly,
> at that point, I think all the changes to __kernel_fpu_end() can go
> away, because at that point math_state_restore() really does the right
> thing - all the allocations are gone, and all the async task state
> games are gone, only the "restore state" remains.
>
> Hmm? So the only thing needed would be to add that "init_fpu()" to the
> initial bootmem allocation path and to change flush_thread() (it
> currently does "drop_init_fpu()", let's just make it initialize the
> FPU state using fpu_finit()), and then we could remove the whole
> "used_math" bit entirely, and just say that the FPU is always
> initialized.
>
> What do you guys think?
No. as I mentioned in the changelog, there is one more path which does
drop_fpu() and we still depend on this used_math bit for eager-fpu.
in signal restore path for 32-bit app, where we copy the sig-context
state from the user stack to the kernel manually (because of legacy
reasons where fsave state is followed by fxsave state etc in the 32-bit
signal handler context and we have to go through convert_to_fxsr() etc).
from __restore_xstate_sig() :
/*
* Drop the current fpu which clears used_math(). This ensures
* that any context-switch during the copy of the new state,
* avoids the intermediate state from getting restored/saved.
* Thus avoiding the new restored state from getting corrupted.
* We will be ready to restore/save the state only after
* set_used_math() is again set.
*/
drop_fpu(tsk);
thanks,
suresh
On Sun, 2 Feb 2014, Suresh Siddha wrote:
> Here is the second patch, which should fix the issue reported in this
> thread. Maarten, Nate, George, please give this patch a try as is and
> see if it helps address the issue you ran into. And please ack/review
> with your test results.
3.13 plus this patch: boots and fixes the testcase I reported (core dump
on ecrypt).
Tested-by: Nate Eldredge <[email protected]>
> diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
> index 4e5f770..670bba1 100644
> --- a/arch/x86/kernel/i387.c
> +++ b/arch/x86/kernel/i387.c
> @@ -87,10 +87,19 @@ EXPORT_SYMBOL(__kernel_fpu_begin);
>
> void __kernel_fpu_end(void)
> {
> - if (use_eager_fpu())
> - math_state_restore();
> - else
> + if (use_eager_fpu()) {
> + /*
> + * For eager fpu, most the time, tsk_used_math() is true.
> + * Restore the user math as we are done with the kernel usage.
> + * At few instances during thread exit, signal handling etc,
> + * tsk_used_math() is false. Those few places will take proper
> + * actions, so we don't need to restore the math here.
> + */
> + if (likely(tsk_used_math(current)))
> + math_state_restore();
> + } else {
> stts();
> + }
> }
> EXPORT_SYMBOL(__kernel_fpu_end);
--
Nate Eldredge
[email protected]
> 3.13 plus this patch: boots and fixes the testcase I reported (core dump
> on ecrypt).
>
> Tested-by: Nate Eldredge <[email protected]>
It's looking good for me so far, too. I'm slower to reproduce, so I'd
like a couple more days before signing off on it, but no complaints yet.
On 03/02/14 07:56, Suresh Siddha wrote:
> Here is the second patch, which should fix the issue reported in this
> thread. Maarten, Nate, George, please give this patch a try as is and
> see if it helps address the issue you ran into.
I can confirm that your patch fixes the bug I reported (core dump in
ecryptfs on ext4). I tested it with Linux 3.12.9 (which is what Arch
Linux uses at the moment).
Maarten Baert
On 03/02/14 07:56, Suresh Siddha wrote:
> Here is the second patch, which should fix the issue reported in this
> thread. Maarten, Nate, George, please give this patch a try as is and
> see if it helps address the issue you ran into.
Maarten reminded me that I should officially say it's worked for me with
no problems.
I wonder if this is also the cause of the "3.14-rc2 XFS backtrace
because irqs_disabled." problem being discussed on linux-kernel &
linux-xfs right now.
Tested-by: George Spelvin <[email protected]>
So, picking up this thread which got dropped on the floor...
On 02/01/2014 11:19 PM, Suresh Siddha wrote:
>
> diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
> index e8368c6..4e5f770 100644
> --- a/arch/x86/kernel/i387.c
> +++ b/arch/x86/kernel/i387.c
> @@ -5,6 +5,7 @@
> * General FPU state handling cleanups
> * Gareth Hughes <[email protected]>, May 2000
> */
> +#include <linux/bootmem.h>
> #include <linux/module.h>
> #include <linux/regset.h>
> #include <linux/sched.h>
> @@ -186,6 +187,10 @@ void fpu_init(void)
> if (xstate_size == 0)
> init_thread_xstate();
>
> + if (!current->thread.fpu.state)
> + current->thread.fpu.state =
> + alloc_bootmem_align(xstate_size, __alignof__(struct xsave_struct));
> +
> mxcsr_feature_mask_init();
> xsave_init();
> eager_fpu_init();
So this bit is giving me a bit of a headache, specifically
alloc_bootmem_align() is an __init function and fpu_init() obviously isn't.
I am doubly confused because init_thread_xstate() only sets the xstate
without any XSAVE features, so the memory allocation we get there will
be insufficient later -- in fact, only a few lines further down the
function, when xsave_init() is called.
I'm wondering if we could put this somewhere inside
xstate_enable_boot_cpu() instead, maybe?
I'm assuming the reason you didn't want to in eager_fpu_init_bp()
anymore is because you want the allocation to happen regardless of if
eagerfpu is enabled, correct?
-hpa