fpu__drop() has an explicit fwait which under some conditions can trigger
a fixable FPU exception while in kernel. Thus, we should attempt to fixup
the exception first, and only call notify_die() if the fixup failed just
like in do_general_protection(). The original call sequence incorrectly
triggers KDB entry on debug kernels under particular FPU-intensive
workloads. This issue had been privately observed, fixed, and tested
on 4.9.98, while this patch brings the fix to the upstream.
Signed-off-by: Siarhei Liakh <[email protected]>
---
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a535dd6..68d77a3 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -835,16 +835,18 @@ static void math_error(struct pt_regs *regs, int error_code, int trapnr)
char *str = (trapnr == X86_TRAP_MF) ? "fpu exception" :
"simd exception";
- if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, SIGFPE) == NOTIFY_STOP)
- return;
cond_local_irq_enable(regs);
if (!user_mode(regs)) {
- if (!fixup_exception(regs, trapnr)) {
- task->thread.error_code = error_code;
- task->thread.trap_nr = trapnr;
+ if (fixup_exception(regs, trapnr))
+ return;
+
+ task->thread.error_code = error_code;
+ task->thread.trap_nr = trapnr;
+
+ if (notify_die(DIE_TRAP, str, regs, error_code,
+ trapnr, SIGFPE) != NOTIFY_STOP)
die(str, regs, error_code);
- }
return;
}
On Thu, Jun 14, 2018 at 10:10 PM Siarhei Liakh
<[email protected]> wrote:
>
> fpu__drop() has an explicit fwait which under some conditions can trigger
> a fixable FPU exception while in kernel. Thus, we should attempt to fixup
> the exception first, and only call notify_die() if the fixup failed just
> like in do_general_protection(). The original call sequence incorrectly
> triggers KDB entry on debug kernels under particular FPU-intensive
> workloads. This issue had been privately observed, fixed, and tested
> on 4.9.98, while this patch brings the fix to the upstream.
Reviewed-by: Andy Lutomirski <[email protected]>
With the caveat that you are perpetuating what is arguably a bug in
some of the other entries: math_error() can now be called with IRQs
off and return with IRQs on. If we actually start asserting good
behavior in the entry code, we'll need to fix this.
--Andy
On Mon, 18 Jun 2018, Andy Lutomirski wrote:
> On Thu, Jun 14, 2018 at 10:10 PM Siarhei Liakh
> <[email protected]> wrote:
> >
> > fpu__drop() has an explicit fwait which under some conditions can trigger
> > a fixable FPU exception while in kernel. Thus, we should attempt to fixup
> > the exception first, and only call notify_die() if the fixup failed just
> > like in do_general_protection(). The original call sequence incorrectly
> > triggers KDB entry on debug kernels under particular FPU-intensive
> > workloads. This issue had been privately observed, fixed, and tested
> > on 4.9.98, while this patch brings the fix to the upstream.
>
> Reviewed-by: Andy Lutomirski <[email protected]>
>
> With the caveat that you are perpetuating what is arguably a bug in
> some of the other entries: math_error() can now be called with IRQs
> off and return with IRQs on. If we actually start asserting good
> behavior in the entry code, we'll need to fix this.
Confused. math_error() is still invoked with interrupts off. What's
different now is that notify_die() is called with interrupts conditionally
enabled while upstream it's always called with interrupts disabled.
Thanks,
tglx
> On Jun 18, 2018, at 11:23 PM, Thomas Gleixner <[email protected]> wrote:
>
>> On Mon, 18 Jun 2018, Andy Lutomirski wrote:
>>
>> On Thu, Jun 14, 2018 at 10:10 PM Siarhei Liakh
>> <[email protected]> wrote:
>>>
>>> fpu__drop() has an explicit fwait which under some conditions can trigger
>>> a fixable FPU exception while in kernel. Thus, we should attempt to fixup
>>> the exception first, and only call notify_die() if the fixup failed just
>>> like in do_general_protection(). The original call sequence incorrectly
>>> triggers KDB entry on debug kernels under particular FPU-intensive
>>> workloads. This issue had been privately observed, fixed, and tested
>>> on 4.9.98, while this patch brings the fix to the upstream.
>>
>> Reviewed-by: Andy Lutomirski <[email protected]>
>>
>> With the caveat that you are perpetuating what is arguably a bug in
>> some of the other entries: math_error() can now be called with IRQs
>> off and return with IRQs on. If we actually start asserting good
>> behavior in the entry code, we'll need to fix this.
>
> Confused. math_error() is still invoked with interrupts off. What's
> different now is that notify_die() is called with interrupts conditionally
> enabled while upstream it's always called with interrupts disabled.
True, but I don’t think that matters. What I’m grumbling about is that we can do cond_local_irq_enable() and then return without local_irq_disable().
Anyway, I think the patch is fine as is. We can unsuck the entry IRQ handling another day.
>
> Thanks,
>
> tglx
>
>
>
On Tue, 19 Jun 2018, Andy Lutomirski wrote:?
> On Jun 19, 2018, at 9:15 AM, Siarhei Liakh <[email protected]> wrote:
>
> > On Mon, 18 Jun 2018, Andy Lutomirski wrote:
> >
> > > > On Thu, Jun 14, 2018 at 10:10 PM Siarhei Liakh
> > > > <[email protected]> wrote:
> > > > >
> > > > > fpu__drop() has an explicit fwait which under some conditions can trigger
> > > > > a fixable FPU exception while in kernel. Thus, we should attempt to fixup
> > > > > the exception first, and only call notify_die() if the fixup failed just
> > > > > like in do_general_protection(). The original call sequence incorrectly
> > > > > triggers KDB entry on debug kernels under particular FPU-intensive
> > > > > workloads. This issue had been privately observed, fixed, and tested
> > > > > on 4.9.98, while this patch brings the fix to the upstream.
> > > >
> > > > Reviewed-by: Andy Lutomirski <[email protected]>
> > > >
> > > > With the caveat that you are perpetuating what is arguably a bug in
> > > > some of the other entries: math_error() can now be called with IRQs
> > > > off and return with IRQs on.? If we actually start asserting good
> > > > behavior in the entry code, we'll need to fix this.
> > >
> > > Confused. math_error() is still invoked with interrupts off. What's
> > > different now is that notify_die() is called with interrupts conditionally
> > > enabled while upstream it's always called with interrupts disabled.
> >
> > I see that notify_die() is being called either way in upstream (ex:
> > do_general_protection() and do_iret_error() vs do_bounds() and etc.).
> > Is there some some sort of general policy/guide documentation available
> > which outlines the expectations of notify_die(), as well as its notifiers?
>
> I doubt it.
>
> The right fix is to delete notify_die(), not to document it. kernel debuggers should
> hook die() directly, and other users (if any) should be moved into the error handlers.
Got it. Unfortunately, this looks like a whole separate code refactoring project
which I cannot undertake at this time. In the mean time, this patch offers a fix for
an immediate issue (KDB tripped when it shouldn't) even if it does nothing to
address the deficiencies in the framework itself.
Thank you.
Commit-ID: 3ae6295ccb7cf6d344908209701badbbbb503e40
Gitweb: https://git.kernel.org/tip/3ae6295ccb7cf6d344908209701badbbbb503e40
Author: Siarhei Liakh <[email protected]>
AuthorDate: Thu, 14 Jun 2018 19:36:07 +0000
Committer: Thomas Gleixner <[email protected]>
CommitDate: Wed, 20 Jun 2018 11:44:56 +0200
x86: Call fixup_exception() before notify_die() in math_error()
fpu__drop() has an explicit fwait which under some conditions can trigger a
fixable FPU exception while in kernel. Thus, we should attempt to fixup the
exception first, and only call notify_die() if the fixup failed just like
in do_general_protection(). The original call sequence incorrectly triggers
KDB entry on debug kernels under particular FPU-intensive workloads.
Andy noted, that this makes the whole conditional irq enable thing even
more inconsistent, but fixing that it outside the scope of this.
Signed-off-by: Siarhei Liakh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "Borislav Petkov" <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/DM5PR11MB201156F1CAB2592B07C79A03B17D0@DM5PR11MB2011.namprd11.prod.outlook.com
---
arch/x86/kernel/traps.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 03f3d7695dac..162a31d80ad5 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -834,16 +834,18 @@ static void math_error(struct pt_regs *regs, int error_code, int trapnr)
char *str = (trapnr == X86_TRAP_MF) ? "fpu exception" :
"simd exception";
- if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, SIGFPE) == NOTIFY_STOP)
- return;
cond_local_irq_enable(regs);
if (!user_mode(regs)) {
- if (!fixup_exception(regs, trapnr)) {
- task->thread.error_code = error_code;
- task->thread.trap_nr = trapnr;
+ if (fixup_exception(regs, trapnr))
+ return;
+
+ task->thread.error_code = error_code;
+ task->thread.trap_nr = trapnr;
+
+ if (notify_die(DIE_TRAP, str, regs, error_code,
+ trapnr, SIGFPE) != NOTIFY_STOP)
die(str, regs, error_code);
- }
return;
}