2021-03-29 16:44:24

by Marco Elver

[permalink] [raw]
Subject: Re: I915 CI-run with kfence enabled, issues found

[+Cc x86 maintainers]

On Mon, Mar 29, 2021 at 11:11AM +0000, Sarvela, Tomi P wrote:
> Hello,
>
> I'm Tomi Sarvela, maintainer and original creator of linux i915-CI:
> https://intel-gfx-ci.01.org/
>
> I got a hint from Martin Peres about kfence functionality in kernel, and it looked
> something we'd like to enable in future CI runs so I made a trial run on DRM-Tip.
> We've had regular KASAN-enabled runs, so the expectation was that there
> wouldn't be too many new problems exposed.
>
> On this run two issues were found, where one is clearly kernel (GUC) issue,
> but another looked a lot like kfence issue on old platforms. Affected
> were IVB, SNB and ILK, with bug signature being:
>
> <3> [31.556004] BUG: using smp_processor_id() in preemptible [00000000] code: ...
> <4> [31.556070] caller is invalidate_user_asid+0x13/0x50
>
> I'm not a kernel developer myself, so I can't make hard assertions
> where the issue originates. In comparison to kernel without kfence,
> it looks like the newly enabled code is the cause because the
> "BUG: KFENCE" signature is missing from the trace
>
> Can someone take a look at the traces and verify if the kfence issue
> exists and is not related to the rest of the kernel?
>
> If there is an issue tracker, I can add this information there.
>
> Example traces:
> https://intel-gfx-ci.01.org/tree/drm-tip/kfence_1/fi-ivb-3770/igt@[email protected]
>
> https://intel-gfx-ci.01.org/tree/drm-tip/kfence_1/fi-snb-2520m/igt@[email protected]
>
> https://intel-gfx-ci.01.org/tree/drm-tip/kfence_1/fi-ilk-650/igt@[email protected]
>
> Kfence-exposed possible GUC issue:
> https://intel-gfx-ci.01.org/tree/drm-tip/kfence_1/fi-kbl-guc/igt@[email protected]
>
> All results can be seen at:
> https://intel-gfx-ci.01.org/tree/drm-tip/kfence_1/index.html
>
> CI_DRM_9910 is recent DRM-Tip commit without -rc5 pulled in yet.
> kfence_1 is same commit with kfence defaults turned on:
[...]

It looks like the code path from flush_tlb_one_kernel() to
invalidate_user_asid()'s this_cpu_ptr() has several feature checks, so
probably some feature difference between systems where it triggers and
it doesn't.

As far as I'm aware, there is no restriction on where
flush_tlb_one_kernel() is called. We could of course guard it but I
think that's wrong.

Other than that, I hope the x86 maintainers know what's going on here.

Just for reference, the stack traces in the above logs start with:

| <3> [31.556004] BUG: using smp_processor_id() in preemptible [00000000] code: dmesg/1075
| <4> [31.556070] caller is invalidate_user_asid+0x13/0x50
| <4> [31.556078] CPU: 6 PID: 1075 Comm: dmesg Not tainted 5.12.0-rc4-gda4a2b1a5479-kfence_1+ #1
| <4> [31.556081] Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, BIOS 8.11 10/24/2012
| <4> [31.556084] Call Trace:
| <4> [31.556088] dump_stack+0x7f/0xad
| <4> [31.556097] check_preemption_disabled+0xc8/0xd0
| <4> [31.556104] invalidate_user_asid+0x13/0x50
| <4> [31.556109] flush_tlb_one_kernel+0x5/0x20
| <4> [31.556113] kfence_protect+0x56/0x80
| ...........

Thanks,
-- Marco


2021-03-29 17:33:50

by Dave Hansen

[permalink] [raw]
Subject: Re: I915 CI-run with kfence enabled, issues found

On 3/29/21 9:40 AM, Marco Elver wrote:
> It looks like the code path from flush_tlb_one_kernel() to
> invalidate_user_asid()'s this_cpu_ptr() has several feature checks, so
> probably some feature difference between systems where it triggers and
> it doesn't.
>
> As far as I'm aware, there is no restriction on where
> flush_tlb_one_kernel() is called. We could of course guard it but I
> think that's wrong.
>
> Other than that, I hope the x86 maintainers know what's going on here.
>
> Just for reference, the stack traces in the above logs start with:
>
> | <3> [31.556004] BUG: using smp_processor_id() in preemptible [00000000] code: dmesg/1075
> | <4> [31.556070] caller is invalidate_user_asid+0x13/0x50
> | <4> [31.556078] CPU: 6 PID: 1075 Comm: dmesg Not tainted 5.12.0-rc4-gda4a2b1a5479-kfence_1+ #1
> | <4> [31.556081] Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, BIOS 8.11 10/24/2012
> | <4> [31.556084] Call Trace:
> | <4> [31.556088] dump_stack+0x7f/0xad
> | <4> [31.556097] check_preemption_disabled+0xc8/0xd0
> | <4> [31.556104] invalidate_user_asid+0x13/0x50
> | <4> [31.556109] flush_tlb_one_kernel+0x5/0x20
> | <4> [31.556113] kfence_protect+0x56/0x80
> | ...........

Our naming here isn't great.

But, the "one" in flush_tlb_one_kernel() really refers to two "ones":
1. Flush one single address
2. Flush that address from one CPU's TLB

The reason preempt needs to be off is that it doesn't make any sense to
flush one TLB entry from a "random" CPU. It only makes sense to flush
it when preempt is disabled and you *know* which CPU's TLB you're flushing.

I think kfence needs to be using flush_tlb_kernel_range(). That does
all the IPI fanciness to flush the TLBs on *ALL* CPUs, not just the
current one.

BTW, the preempt checks in flush_tlb_one_kernel() are dependent on KPTI
being enabled. That's probably why you don't see this everywhere. We
should probably have unconditional preempt checks in there.

2021-03-29 17:48:07

by Marco Elver

[permalink] [raw]
Subject: Re: I915 CI-run with kfence enabled, issues found

On Mon, 29 Mar 2021 at 19:32, Dave Hansen <[email protected]> wrote:
>
> On 3/29/21 9:40 AM, Marco Elver wrote:
> > It looks like the code path from flush_tlb_one_kernel() to
> > invalidate_user_asid()'s this_cpu_ptr() has several feature checks, so
> > probably some feature difference between systems where it triggers and
> > it doesn't.
> >
> > As far as I'm aware, there is no restriction on where
> > flush_tlb_one_kernel() is called. We could of course guard it but I
> > think that's wrong.
> >
> > Other than that, I hope the x86 maintainers know what's going on here.
> >
> > Just for reference, the stack traces in the above logs start with:
> >
> > | <3> [31.556004] BUG: using smp_processor_id() in preemptible [00000000] code: dmesg/1075
> > | <4> [31.556070] caller is invalidate_user_asid+0x13/0x50
> > | <4> [31.556078] CPU: 6 PID: 1075 Comm: dmesg Not tainted 5.12.0-rc4-gda4a2b1a5479-kfence_1+ #1
> > | <4> [31.556081] Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, BIOS 8.11 10/24/2012
> > | <4> [31.556084] Call Trace:
> > | <4> [31.556088] dump_stack+0x7f/0xad
> > | <4> [31.556097] check_preemption_disabled+0xc8/0xd0
> > | <4> [31.556104] invalidate_user_asid+0x13/0x50
> > | <4> [31.556109] flush_tlb_one_kernel+0x5/0x20
> > | <4> [31.556113] kfence_protect+0x56/0x80
> > | ...........
>
> Our naming here isn't great.
>
> But, the "one" in flush_tlb_one_kernel() really refers to two "ones":
> 1. Flush one single address
> 2. Flush that address from one CPU's TLB
>
> The reason preempt needs to be off is that it doesn't make any sense to
> flush one TLB entry from a "random" CPU. It only makes sense to flush
> it when preempt is disabled and you *know* which CPU's TLB you're flushing.

Thanks for the rationale behind needing preempt off.

Though in our case it really is best-effort, as long as we hit the CPU
of the currently running task most of the time.

Doing it to all CPUs is too expensive, and we can tolerate this being
approximate (nothing bad will happen, KFENCE might just miss a bug and
that's ok).

> I think kfence needs to be using flush_tlb_kernel_range(). That does
> all the IPI fanciness to flush the TLBs on *ALL* CPUs, not just the
> current one.

The other problem is that this code can be called from interrupts.
This is already documented in arch/x86/include/asm/kfence.h

> BTW, the preempt checks in flush_tlb_one_kernel() are dependent on KPTI
> being enabled. That's probably why you don't see this everywhere. We
> should probably have unconditional preempt checks in there.

In which case I'll add a preempt_disable/enable() pair to
kfence_protect_page() in arch/x86/include/asm/kfence.h.

Thanks,
-- Marco

2021-03-29 21:07:16

by Dave Hansen

[permalink] [raw]
Subject: Re: I915 CI-run with kfence enabled, issues found

On 3/29/21 10:45 AM, Marco Elver wrote:
> On Mon, 29 Mar 2021 at 19:32, Dave Hansen <[email protected]> wrote:
> Doing it to all CPUs is too expensive, and we can tolerate this being
> approximate (nothing bad will happen, KFENCE might just miss a bug and
> that's ok).
...
>> BTW, the preempt checks in flush_tlb_one_kernel() are dependent on KPTI
>> being enabled. That's probably why you don't see this everywhere. We
>> should probably have unconditional preempt checks in there.
>
> In which case I'll add a preempt_disable/enable() pair to
> kfence_protect_page() in arch/x86/include/asm/kfence.h.

That sounds sane to me. I'd just plead that the special situation (not
needing deterministic TLB flushes) is obvious. We don't want any folks
copying this code.

BTW, I know you want to avoid the cost of IPIs, but have you considered
any other low-cost ways to get quicker TLB flushes? For instance, you
could loop over all CPUs and set cpu_tlbstate.invalidate_other=1. That
would induce a context switch at the next context switch without needing
an IPI.

2021-03-29 21:36:19

by Marco Elver

[permalink] [raw]
Subject: Re: I915 CI-run with kfence enabled, issues found

On Mon, 29 Mar 2021 at 23:03, Dave Hansen <[email protected]> wrote:
> On 3/29/21 10:45 AM, Marco Elver wrote:
> > On Mon, 29 Mar 2021 at 19:32, Dave Hansen <[email protected]> wrote:
> > Doing it to all CPUs is too expensive, and we can tolerate this being
> > approximate (nothing bad will happen, KFENCE might just miss a bug and
> > that's ok).
> ...
> >> BTW, the preempt checks in flush_tlb_one_kernel() are dependent on KPTI
> >> being enabled. That's probably why you don't see this everywhere. We
> >> should probably have unconditional preempt checks in there.
> >
> > In which case I'll add a preempt_disable/enable() pair to
> > kfence_protect_page() in arch/x86/include/asm/kfence.h.
>
> That sounds sane to me. I'd just plead that the special situation (not
> needing deterministic TLB flushes) is obvious. We don't want any folks
> copying this code.
>
> BTW, I know you want to avoid the cost of IPIs, but have you considered
> any other low-cost ways to get quicker TLB flushes? For instance, you
> could loop over all CPUs and set cpu_tlbstate.invalidate_other=1. That
> would induce a context switch at the next context switch without needing
> an IPI.

This is interesting. And it seems like it would work well for our
usecase. Ideally we should only flush entries related to the page we
changed. But it seems invalidate_other would flush the entire TLB.

With PTI, flush_tlb_one_kernel() already does that for the current
CPU, but now we'd flush entire TLBs for all CPUs and even if PTI is
off.

Do you have an intuition for how much this would affect large
multi-socket systems? I currently can't quite say, and would err on
the side of caution.

Thanks,
-- Marco

2021-03-29 21:52:18

by Andy Lutomirski

[permalink] [raw]
Subject: Re: I915 CI-run with kfence enabled, issues found


> On Mar 29, 2021, at 2:34 PM, Marco Elver <[email protected]> wrote:
>
> On Mon, 29 Mar 2021 at 23:03, Dave Hansen <[email protected]> wrote:
>>> On 3/29/21 10:45 AM, Marco Elver wrote:
>>>> On Mon, 29 Mar 2021 at 19:32, Dave Hansen <[email protected]> wrote:
>>> Doing it to all CPUs is too expensive, and we can tolerate this being
>>> approximate (nothing bad will happen, KFENCE might just miss a bug and
>>> that's ok).
>> ...
>>>> BTW, the preempt checks in flush_tlb_one_kernel() are dependent on KPTI
>>>> being enabled. That's probably why you don't see this everywhere. We
>>>> should probably have unconditional preempt checks in there.
>>>
>>> In which case I'll add a preempt_disable/enable() pair to
>>> kfence_protect_page() in arch/x86/include/asm/kfence.h.
>>
>> That sounds sane to me. I'd just plead that the special situation (not
>> needing deterministic TLB flushes) is obvious. We don't want any folks
>> copying this code.
>>
>> BTW, I know you want to avoid the cost of IPIs, but have you considered
>> any other low-cost ways to get quicker TLB flushes? For instance, you
>> could loop over all CPUs and set cpu_tlbstate.invalidate_other=1. That
>> would induce a context switch at the next context switch without needing
>> an IPI.
>
> This is interesting. And it seems like it would work well for our
> usecase. Ideally we should only flush entries related to the page we
> changed. But it seems invalidate_other would flush the entire TLB.
>
> With PTI, flush_tlb_one_kernel() already does that for the current
> CPU, but now we'd flush entire TLBs for all CPUs and even if PTI is
> off.
>
> Do you have an intuition for how much this would affect large
> multi-socket systems? I currently can't quite say, and would err on
> the side of caution.

Flushing the kernel TLB for all addresses
Is rather pricy. ISTR 600 cycles on Skylake, not to mention the cost of losing the TLB. How common is this?

>
> Thanks,
> -- Marco

2021-03-29 21:57:10

by Marco Elver

[permalink] [raw]
Subject: Re: I915 CI-run with kfence enabled, issues found

On Mon, 29 Mar 2021 at 23:47, Andy Lutomirski <[email protected]> wrote:
>
>
> > On Mar 29, 2021, at 2:34 PM, Marco Elver <[email protected]> wrote:
> >
> > On Mon, 29 Mar 2021 at 23:03, Dave Hansen <[email protected]> wrote:
> >>> On 3/29/21 10:45 AM, Marco Elver wrote:
> >>>> On Mon, 29 Mar 2021 at 19:32, Dave Hansen <[email protected]> wrote:
> >>> Doing it to all CPUs is too expensive, and we can tolerate this being
> >>> approximate (nothing bad will happen, KFENCE might just miss a bug and
> >>> that's ok).
> >> ...
> >>>> BTW, the preempt checks in flush_tlb_one_kernel() are dependent on KPTI
> >>>> being enabled. That's probably why you don't see this everywhere. We
> >>>> should probably have unconditional preempt checks in there.
> >>>
> >>> In which case I'll add a preempt_disable/enable() pair to
> >>> kfence_protect_page() in arch/x86/include/asm/kfence.h.
> >>
> >> That sounds sane to me. I'd just plead that the special situation (not
> >> needing deterministic TLB flushes) is obvious. We don't want any folks
> >> copying this code.
> >>
> >> BTW, I know you want to avoid the cost of IPIs, but have you considered
> >> any other low-cost ways to get quicker TLB flushes? For instance, you
> >> could loop over all CPUs and set cpu_tlbstate.invalidate_other=1. That
> >> would induce a context switch at the next context switch without needing
> >> an IPI.
> >
> > This is interesting. And it seems like it would work well for our
> > usecase. Ideally we should only flush entries related to the page we
> > changed. But it seems invalidate_other would flush the entire TLB.
> >
> > With PTI, flush_tlb_one_kernel() already does that for the current
> > CPU, but now we'd flush entire TLBs for all CPUs and even if PTI is
> > off.
> >
> > Do you have an intuition for how much this would affect large
> > multi-socket systems? I currently can't quite say, and would err on
> > the side of caution.
>
> Flushing the kernel TLB for all addresses
> Is rather pricy. ISTR 600 cycles on Skylake, not to mention the cost of losing the TLB. How common is this?

AFAIK, invalidate_other resets the asid, so it's not explicit and
perhaps cheaper?

In any case, if we were to do this, it'd be based on the sample
interval of KFENCE, which can be as low as 1ms. But this is a
production debugging feature, so the target machines are not test
machines. For those production deployments we'd be looking at every
~500ms. But I know of other deployments that use <100ms.

Doesn't sound like much, but as you say, I also worry a bit about
losing the TLB across >100 CPUs even if it's every 500ms.

Thanks,
-- Marco

2021-03-29 23:31:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: I915 CI-run with kfence enabled, issues found


> On Mar 29, 2021, at 2:55 PM, Marco Elver <[email protected]> wrote:
>
> On Mon, 29 Mar 2021 at 23:47, Andy Lutomirski <[email protected]> wrote:
>>
>>
>>>> On Mar 29, 2021, at 2:34 PM, Marco Elver <[email protected]> wrote:
>>>
>>> On Mon, 29 Mar 2021 at 23:03, Dave Hansen <[email protected]> wrote:
>>>>> On 3/29/21 10:45 AM, Marco Elver wrote:
>>>>>> On Mon, 29 Mar 2021 at 19:32, Dave Hansen <[email protected]> wrote:
>>>>> Doing it to all CPUs is too expensive, and we can tolerate this being
>>>>> approximate (nothing bad will happen, KFENCE might just miss a bug and
>>>>> that's ok).
>>>> ...
>>>>>> BTW, the preempt checks in flush_tlb_one_kernel() are dependent on KPTI
>>>>>> being enabled. That's probably why you don't see this everywhere. We
>>>>>> should probably have unconditional preempt checks in there.
>>>>>
>>>>> In which case I'll add a preempt_disable/enable() pair to
>>>>> kfence_protect_page() in arch/x86/include/asm/kfence.h.
>>>>
>>>> That sounds sane to me. I'd just plead that the special situation (not
>>>> needing deterministic TLB flushes) is obvious. We don't want any folks
>>>> copying this code.
>>>>
>>>> BTW, I know you want to avoid the cost of IPIs, but have you considered
>>>> any other low-cost ways to get quicker TLB flushes? For instance, you
>>>> could loop over all CPUs and set cpu_tlbstate.invalidate_other=1. That
>>>> would induce a context switch at the next context switch without needing
>>>> an IPI.
>>>
>>> This is interesting. And it seems like it would work well for our
>>> usecase. Ideally we should only flush entries related to the page we
>>> changed. But it seems invalidate_other would flush the entire TLB.
>>>
>>> With PTI, flush_tlb_one_kernel() already does that for the current
>>> CPU, but now we'd flush entire TLBs for all CPUs and even if PTI is
>>> off.
>>>
>>> Do you have an intuition for how much this would affect large
>>> multi-socket systems? I currently can't quite say, and would err on
>>> the side of caution.
>>
>> Flushing the kernel TLB for all addresses
>> Is rather pricy. ISTR 600 cycles on Skylake, not to mention the cost of losing the TLB. How common is this?
>
> AFAIK, invalidate_other resets the asid, so it's not explicit and
> perhaps cheaper?
>
> In any case, if we were to do this, it'd be based on the sample
> interval of KFENCE, which can be as low as 1ms. But this is a
> production debugging feature, so the target machines are not test
> machines. For those production deployments we'd be looking at every
> ~500ms. But I know of other deployments that use <100ms.
>
> Doesn't sound like much, but as you say, I also worry a bit about
> losing the TLB across >100 CPUs even if it's every 500ms.

On non-PTI, the only way to zap kernel mappings is to do a global flush, either via INVPCID (expensive) or CR4 (extra expensive). In PTI mode, it’s plausible that the implicit flush is good enough, and I’d be happy to review the patch, but it’s a PTI only thing. Much less expensive in PTI mode, too, because it only needs to flush kernel mappings.

If this is best-effort, it might be better to have some work in the exit to usermode path or a thread or similar that periodically does targeting single-page zaps.