cpumask can change underneath us, which is generally safe except when we
call into hv_cpu_number_to_vp_number(): if cpumask ends up empty we pass
num_cpu_possible() into hv_cpu_number_to_vp_number(), causing it to read
garbage. As reported by KASAN:
[ 83.504763] BUG: KASAN: slab-out-of-bounds in hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
[ 83.908636] Read of size 4 at addr ffff888267c01370 by task kworker/u8:2/106
[ 84.196669] CPU: 0 PID: 106 Comm: kworker/u8:2 Tainted: G W 5.4.60 #1
[ 84.196669] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
[ 84.196669] Workqueue: writeback wb_workfn (flush-8:0)
[ 84.196669] Call Trace:
[ 84.196669] dump_stack (lib/dump_stack.c:120)
[ 84.196669] print_address_description.constprop.0 (mm/kasan/report.c:375)
[ 84.196669] __kasan_report.cold (mm/kasan/report.c:507)
[ 84.196669] kasan_report (arch/x86/include/asm/smap.h:71 mm/kasan/common.c:635)
[ 84.196669] hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
[ 84.196669] flush_tlb_mm_range (arch/x86/include/asm/paravirt.h:68 arch/x86/mm/tlb.c:798)
[ 84.196669] ptep_clear_flush (arch/x86/include/asm/tlbflush.h:586 mm/pgtable-generic.c:88)
Fixes: 0e4c88f37693 ("x86/hyper-v: Use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible")
Cc: Vitaly Kuznetsov <[email protected]>
Cc: [email protected]
Signed-off-by: Sasha Levin <[email protected]>
---
arch/x86/hyperv/mmu.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 5208ba49c89a9..b1d6afc5fc4a3 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -109,7 +109,9 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
* must. We will also check all VP numbers when walking the
* supplied CPU set to remain correct in all cases.
*/
- if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
+ int last = cpumask_last(cpus);
+
+ if (last < num_possible_cpus() && hv_cpu_number_to_vp_number(last) >= 64)
goto do_ex_hypercall;
for_each_cpu(cpu, cpus) {
--
2.25.1
Sasha Levin <[email protected]> writes:
> cpumask can change underneath us, which is generally safe except when we
> call into hv_cpu_number_to_vp_number(): if cpumask ends up empty we pass
> num_cpu_possible() into hv_cpu_number_to_vp_number(), causing it to read
> garbage. As reported by KASAN:
>
> [ 83.504763] BUG: KASAN: slab-out-of-bounds in hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
> [ 83.908636] Read of size 4 at addr ffff888267c01370 by task kworker/u8:2/106
> [ 84.196669] CPU: 0 PID: 106 Comm: kworker/u8:2 Tainted: G W 5.4.60 #1
> [ 84.196669] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
> [ 84.196669] Workqueue: writeback wb_workfn (flush-8:0)
> [ 84.196669] Call Trace:
> [ 84.196669] dump_stack (lib/dump_stack.c:120)
> [ 84.196669] print_address_description.constprop.0 (mm/kasan/report.c:375)
> [ 84.196669] __kasan_report.cold (mm/kasan/report.c:507)
> [ 84.196669] kasan_report (arch/x86/include/asm/smap.h:71 mm/kasan/common.c:635)
> [ 84.196669] hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
> [ 84.196669] flush_tlb_mm_range (arch/x86/include/asm/paravirt.h:68 arch/x86/mm/tlb.c:798)
> [ 84.196669] ptep_clear_flush (arch/x86/include/asm/tlbflush.h:586 mm/pgtable-generic.c:88)
>
> Fixes: 0e4c88f37693 ("x86/hyper-v: Use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible")
> Cc: Vitaly Kuznetsov <[email protected]>
> Cc: [email protected]
> Signed-off-by: Sasha Levin <[email protected]>
> ---
> arch/x86/hyperv/mmu.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> index 5208ba49c89a9..b1d6afc5fc4a3 100644
> --- a/arch/x86/hyperv/mmu.c
> +++ b/arch/x86/hyperv/mmu.c
> @@ -109,7 +109,9 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
> * must. We will also check all VP numbers when walking the
> * supplied CPU set to remain correct in all cases.
> */
> - if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
> + int last = cpumask_last(cpus);
> +
> + if (last < num_possible_cpus() && hv_cpu_number_to_vp_number(last) >= 64)
> goto do_ex_hypercall;
In case 'cpus' can end up being empty (I'm genuinely suprised it can)
the check is mandatory indeed. I would, however, just return directly in
this case:
if (last < num_possible_cpus())
return;
if (hv_cpu_number_to_vp_number(last) >= 64)
goto do_ex_hypercall;
as there's nothing to flush, no need to call into
hyperv_flush_tlb_others_ex().
Anyway, the fix seems to be correct, so
Reviewed-by: Vitaly Kuznetsov <[email protected]>
>
> for_each_cpu(cpu, cpus) {
--
Vitaly
On Thu, Oct 01, 2020 at 11:40:04AM +0200, Vitaly Kuznetsov wrote:
> Sasha Levin <[email protected]> writes:
>
> > cpumask can change underneath us, which is generally safe except when we
> > call into hv_cpu_number_to_vp_number(): if cpumask ends up empty we pass
> > num_cpu_possible() into hv_cpu_number_to_vp_number(), causing it to read
> > garbage. As reported by KASAN:
> >
> > [ 83.504763] BUG: KASAN: slab-out-of-bounds in hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
> > [ 83.908636] Read of size 4 at addr ffff888267c01370 by task kworker/u8:2/106
> > [ 84.196669] CPU: 0 PID: 106 Comm: kworker/u8:2 Tainted: G W 5.4.60 #1
> > [ 84.196669] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
> > [ 84.196669] Workqueue: writeback wb_workfn (flush-8:0)
> > [ 84.196669] Call Trace:
> > [ 84.196669] dump_stack (lib/dump_stack.c:120)
> > [ 84.196669] print_address_description.constprop.0 (mm/kasan/report.c:375)
> > [ 84.196669] __kasan_report.cold (mm/kasan/report.c:507)
> > [ 84.196669] kasan_report (arch/x86/include/asm/smap.h:71 mm/kasan/common.c:635)
> > [ 84.196669] hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
> > [ 84.196669] flush_tlb_mm_range (arch/x86/include/asm/paravirt.h:68 arch/x86/mm/tlb.c:798)
> > [ 84.196669] ptep_clear_flush (arch/x86/include/asm/tlbflush.h:586 mm/pgtable-generic.c:88)
> >
> > Fixes: 0e4c88f37693 ("x86/hyper-v: Use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible")
> > Cc: Vitaly Kuznetsov <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Sasha Levin <[email protected]>
> > ---
> > arch/x86/hyperv/mmu.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> > index 5208ba49c89a9..b1d6afc5fc4a3 100644
> > --- a/arch/x86/hyperv/mmu.c
> > +++ b/arch/x86/hyperv/mmu.c
> > @@ -109,7 +109,9 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
> > * must. We will also check all VP numbers when walking the
> > * supplied CPU set to remain correct in all cases.
> > */
> > - if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
> > + int last = cpumask_last(cpus);
> > +
> > + if (last < num_possible_cpus() && hv_cpu_number_to_vp_number(last) >= 64)
> > goto do_ex_hypercall;
>
> In case 'cpus' can end up being empty (I'm genuinely suprised it can)
> the check is mandatory indeed. I would, however, just return directly in
> this case:
>
> if (last < num_possible_cpus())
> return;
I think you want
last >= num_possible_cpus()
here?
A more important question is, if the mask can change willy-nilly, what
is stopping it from changing between these checks? I.e. is there still a
windows that hv_cpu_number_to_vp_number(last) can return garbage?
Wei.
>
> if (hv_cpu_number_to_vp_number(last) >= 64)
> goto do_ex_hypercall;
>
> as there's nothing to flush, no need to call into
> hyperv_flush_tlb_others_ex().
>
> Anyway, the fix seems to be correct, so
>
> Reviewed-by: Vitaly Kuznetsov <[email protected]>
>
> >
> > for_each_cpu(cpu, cpus) {
>
> --
> Vitaly
>
On Thu, Oct 01, 2020 at 11:53:59AM +0000, Wei Liu wrote:
>On Thu, Oct 01, 2020 at 11:40:04AM +0200, Vitaly Kuznetsov wrote:
>> Sasha Levin <[email protected]> writes:
>>
>> > cpumask can change underneath us, which is generally safe except when we
>> > call into hv_cpu_number_to_vp_number(): if cpumask ends up empty we pass
>> > num_cpu_possible() into hv_cpu_number_to_vp_number(), causing it to read
>> > garbage. As reported by KASAN:
>> >
>> > [ 83.504763] BUG: KASAN: slab-out-of-bounds in hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
>> > [ 83.908636] Read of size 4 at addr ffff888267c01370 by task kworker/u8:2/106
>> > [ 84.196669] CPU: 0 PID: 106 Comm: kworker/u8:2 Tainted: G W 5.4.60 #1
>> > [ 84.196669] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
>> > [ 84.196669] Workqueue: writeback wb_workfn (flush-8:0)
>> > [ 84.196669] Call Trace:
>> > [ 84.196669] dump_stack (lib/dump_stack.c:120)
>> > [ 84.196669] print_address_description.constprop.0 (mm/kasan/report.c:375)
>> > [ 84.196669] __kasan_report.cold (mm/kasan/report.c:507)
>> > [ 84.196669] kasan_report (arch/x86/include/asm/smap.h:71 mm/kasan/common.c:635)
>> > [ 84.196669] hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
>> > [ 84.196669] flush_tlb_mm_range (arch/x86/include/asm/paravirt.h:68 arch/x86/mm/tlb.c:798)
>> > [ 84.196669] ptep_clear_flush (arch/x86/include/asm/tlbflush.h:586 mm/pgtable-generic.c:88)
>> >
>> > Fixes: 0e4c88f37693 ("x86/hyper-v: Use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible")
>> > Cc: Vitaly Kuznetsov <[email protected]>
>> > Cc: [email protected]
>> > Signed-off-by: Sasha Levin <[email protected]>
>> > ---
>> > arch/x86/hyperv/mmu.c | 4 +++-
>> > 1 file changed, 3 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
>> > index 5208ba49c89a9..b1d6afc5fc4a3 100644
>> > --- a/arch/x86/hyperv/mmu.c
>> > +++ b/arch/x86/hyperv/mmu.c
>> > @@ -109,7 +109,9 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
>> > * must. We will also check all VP numbers when walking the
>> > * supplied CPU set to remain correct in all cases.
>> > */
>> > - if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
>> > + int last = cpumask_last(cpus);
>> > +
>> > + if (last < num_possible_cpus() && hv_cpu_number_to_vp_number(last) >= 64)
>> > goto do_ex_hypercall;
>>
>> In case 'cpus' can end up being empty (I'm genuinely suprised it can)
I was just as surprised as you and spent the good part of a day
debugging this. However, a:
WARN_ON(cpumask_empty(cpus));
triggers at that line of code even though we check for cpumask_empty()
at the entry of the function.
>> the check is mandatory indeed. I would, however, just return directly in
>> this case:
Makes sense.
>> if (last < num_possible_cpus())
>> return;
>
>I think you want
>
> last >= num_possible_cpus()
>
>here?
>
>A more important question is, if the mask can change willy-nilly, what
>is stopping it from changing between these checks? I.e. is there still a
>windows that hv_cpu_number_to_vp_number(last) can return garbage?
It's not that hv_cpu_number_to_vp_number() returns garbage, the issue is
that we feed it garbage.
hv_cpu_number_to_vp_number() expects that the input would be in the
range of 0 <= X < num_possible_cpus(), and here if 'cpus' was empty we
would pass in X==num_possible_cpus() making it read out of bound.
Maybe it's worthwhile to add a WARN_ON() into
hv_cpu_number_to_vp_number() to assert as well.
--
Thanks,
Sasha
Wei Liu <[email protected]> writes:
> On Thu, Oct 01, 2020 at 11:40:04AM +0200, Vitaly Kuznetsov wrote:
>> Sasha Levin <[email protected]> writes:
>>
>> > cpumask can change underneath us, which is generally safe except when we
>> > call into hv_cpu_number_to_vp_number(): if cpumask ends up empty we pass
>> > num_cpu_possible() into hv_cpu_number_to_vp_number(), causing it to read
>> > garbage. As reported by KASAN:
>> >
>> > [ 83.504763] BUG: KASAN: slab-out-of-bounds in hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
>> > [ 83.908636] Read of size 4 at addr ffff888267c01370 by task kworker/u8:2/106
>> > [ 84.196669] CPU: 0 PID: 106 Comm: kworker/u8:2 Tainted: G W 5.4.60 #1
>> > [ 84.196669] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
>> > [ 84.196669] Workqueue: writeback wb_workfn (flush-8:0)
>> > [ 84.196669] Call Trace:
>> > [ 84.196669] dump_stack (lib/dump_stack.c:120)
>> > [ 84.196669] print_address_description.constprop.0 (mm/kasan/report.c:375)
>> > [ 84.196669] __kasan_report.cold (mm/kasan/report.c:507)
>> > [ 84.196669] kasan_report (arch/x86/include/asm/smap.h:71 mm/kasan/common.c:635)
>> > [ 84.196669] hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
>> > [ 84.196669] flush_tlb_mm_range (arch/x86/include/asm/paravirt.h:68 arch/x86/mm/tlb.c:798)
>> > [ 84.196669] ptep_clear_flush (arch/x86/include/asm/tlbflush.h:586 mm/pgtable-generic.c:88)
>> >
>> > Fixes: 0e4c88f37693 ("x86/hyper-v: Use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible")
>> > Cc: Vitaly Kuznetsov <[email protected]>
>> > Cc: [email protected]
>> > Signed-off-by: Sasha Levin <[email protected]>
>> > ---
>> > arch/x86/hyperv/mmu.c | 4 +++-
>> > 1 file changed, 3 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
>> > index 5208ba49c89a9..b1d6afc5fc4a3 100644
>> > --- a/arch/x86/hyperv/mmu.c
>> > +++ b/arch/x86/hyperv/mmu.c
>> > @@ -109,7 +109,9 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
>> > * must. We will also check all VP numbers when walking the
>> > * supplied CPU set to remain correct in all cases.
>> > */
>> > - if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
>> > + int last = cpumask_last(cpus);
>> > +
>> > + if (last < num_possible_cpus() && hv_cpu_number_to_vp_number(last) >= 64)
>> > goto do_ex_hypercall;
>>
>> In case 'cpus' can end up being empty (I'm genuinely suprised it can)
>> the check is mandatory indeed. I would, however, just return directly in
>> this case:
>>
>> if (last < num_possible_cpus())
>> return;
>
> I think you want
>
> last >= num_possible_cpus()
>
> here?
Of course, thanks!
>
> A more important question is, if the mask can change willy-nilly, what
> is stopping it from changing between these checks? I.e. is there still a
> windows that hv_cpu_number_to_vp_number(last) can return garbage?
>
AFAIU some CPUs can be dropped from the mask (because they switch to a
different mm?) and if we still flush there it's not a problem. The only
real problem I currently see is that we're passing cpumask_last() result
to hv_cpu_number_to_vp_number() and cpumask_last() returns
num_possible_cpus() when the mask is empty but this can't be passed to
hv_cpu_number_to_vp_number().
> Wei.
>
>>
>> if (hv_cpu_number_to_vp_number(last) >= 64)
>> goto do_ex_hypercall;
>>
>> as there's nothing to flush, no need to call into
>> hyperv_flush_tlb_others_ex().
>>
>> Anyway, the fix seems to be correct, so
>>
>> Reviewed-by: Vitaly Kuznetsov <[email protected]>
>>
>> >
>> > for_each_cpu(cpu, cpus) {
>>
>> --
>> Vitaly
>>
>
--
Vitaly
From: Sasha Levin <[email protected]> Sent: Thursday, October 1, 2020 6:04 AM
>
> On Thu, Oct 01, 2020 at 11:53:59AM +0000, Wei Liu wrote:
> >On Thu, Oct 01, 2020 at 11:40:04AM +0200, Vitaly Kuznetsov wrote:
> >> Sasha Levin <[email protected]> writes:
> >>
> >> > cpumask can change underneath us, which is generally safe except when we
> >> > call into hv_cpu_number_to_vp_number(): if cpumask ends up empty we pass
> >> > num_cpu_possible() into hv_cpu_number_to_vp_number(), causing it to read
> >> > garbage. As reported by KASAN:
> >> >
> >> > [ 83.504763] BUG: KASAN: slab-out-of-bounds in hyperv_flush_tlb_others
> (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
> >> > [ 83.908636] Read of size 4 at addr ffff888267c01370 by task kworker/u8:2/106
> >> > [ 84.196669] CPU: 0 PID: 106 Comm: kworker/u8:2 Tainted: G W 5.4.60 #1
> >> > [ 84.196669] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine,
> BIOS 090008 12/07/2018
> >> > [ 84.196669] Workqueue: writeback wb_workfn (flush-8:0)
> >> > [ 84.196669] Call Trace:
> >> > [ 84.196669] dump_stack (lib/dump_stack.c:120)
> >> > [ 84.196669] print_address_description.constprop.0 (mm/kasan/report.c:375)
> >> > [ 84.196669] __kasan_report.cold (mm/kasan/report.c:507)
> >> > [ 84.196669] kasan_report (arch/x86/include/asm/smap.h:71
> mm/kasan/common.c:635)
> >> > [ 84.196669] hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128
> arch/x86/hyperv/mmu.c:112)
> >> > [ 84.196669] flush_tlb_mm_range (arch/x86/include/asm/paravirt.h:68
> arch/x86/mm/tlb.c:798)
> >> > [ 84.196669] ptep_clear_flush (arch/x86/include/asm/tlbflush.h:586 mm/pgtable-
> generic.c:88)
> >> >
> >> > Fixes: 0e4c88f37693 ("x86/hyper-v: Use cheaper
> HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible")
> >> > Cc: Vitaly Kuznetsov <[email protected]>
> >> > Cc: [email protected]
> >> > Signed-off-by: Sasha Levin <[email protected]>
> >> > ---
> >> > arch/x86/hyperv/mmu.c | 4 +++-
> >> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> >> > index 5208ba49c89a9..b1d6afc5fc4a3 100644
> >> > --- a/arch/x86/hyperv/mmu.c
> >> > +++ b/arch/x86/hyperv/mmu.c
> >> > @@ -109,7 +109,9 @@ static void hyperv_flush_tlb_others(const struct cpumask
> *cpus,
> >> > * must. We will also check all VP numbers when walking the
> >> > * supplied CPU set to remain correct in all cases.
> >> > */
> >> > - if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
> >> > + int last = cpumask_last(cpus);
> >> > +
> >> > + if (last < num_possible_cpus() && hv_cpu_number_to_vp_number(last) >=
> 64)
> >> > goto do_ex_hypercall;
> >>
> >> In case 'cpus' can end up being empty (I'm genuinely suprised it can)
>
> I was just as surprised as you and spent the good part of a day
> debugging this. However, a:
>
> WARN_ON(cpumask_empty(cpus));
>
> triggers at that line of code even though we check for cpumask_empty()
> at the entry of the function.
What does the call stack look like when this triggers? I'm curious about
the path where the 'cpus' could be changing while the flush call is in
progress.
I wonder if CPUs could ever be added to the mask? Removing CPUs can
be handled with some care because an unnecessary flush doesn't hurt
anything. But adding CPUs has serious correctness problems.
>
> >> the check is mandatory indeed. I would, however, just return directly in
> >> this case:
>
> Makes sense.
But need to do a local_irq_restore() before returning.
>
> >> if (last < num_possible_cpus())
> >> return;
> >
> >I think you want
> >
> > last >= num_possible_cpus()
> >
> >here?
Yes, but also the && must become ||
> >
> >A more important question is, if the mask can change willy-nilly, what
> >is stopping it from changing between these checks? I.e. is there still a
> >windows that hv_cpu_number_to_vp_number(last) can return garbage?
>
> It's not that hv_cpu_number_to_vp_number() returns garbage, the issue is
> that we feed it garbage.
>
> hv_cpu_number_to_vp_number() expects that the input would be in the
> range of 0 <= X < num_possible_cpus(), and here if 'cpus' was empty we
> would pass in X==num_possible_cpus() making it read out of bound.
>
> Maybe it's worthwhile to add a WARN_ON() into
> hv_cpu_number_to_vp_number() to assert as well.
If the input cpumask can be changing, the other risk is the for_each_cpu()
loop, which also has a call to hv_cpu_number_to_vp_number(). But looking at
the implementation of for_each_cpu(), it will always return an in-bounds value,
so everything should be OK.
>
> --
> Thanks,
> Sasha
On Sat, Oct 03, 2020 at 05:40:15PM +0000, Michael Kelley wrote:
> From: Sasha Levin <[email protected]> Sent: Thursday, October 1, 2020 6:04 AM
> >
> > On Thu, Oct 01, 2020 at 11:53:59AM +0000, Wei Liu wrote:
> > >On Thu, Oct 01, 2020 at 11:40:04AM +0200, Vitaly Kuznetsov wrote:
> > >> Sasha Levin <[email protected]> writes:
> > >>
> > >> > cpumask can change underneath us, which is generally safe except when we
> > >> > call into hv_cpu_number_to_vp_number(): if cpumask ends up empty we pass
> > >> > num_cpu_possible() into hv_cpu_number_to_vp_number(), causing it to read
> > >> > garbage. As reported by KASAN:
> > >> >
> > >> > [ 83.504763] BUG: KASAN: slab-out-of-bounds in hyperv_flush_tlb_others
> > (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
> > >> > [ 83.908636] Read of size 4 at addr ffff888267c01370 by task kworker/u8:2/106
> > >> > [ 84.196669] CPU: 0 PID: 106 Comm: kworker/u8:2 Tainted: G W 5.4.60 #1
> > >> > [ 84.196669] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine,
> > BIOS 090008 12/07/2018
> > >> > [ 84.196669] Workqueue: writeback wb_workfn (flush-8:0)
> > >> > [ 84.196669] Call Trace:
> > >> > [ 84.196669] dump_stack (lib/dump_stack.c:120)
> > >> > [ 84.196669] print_address_description.constprop.0 (mm/kasan/report.c:375)
> > >> > [ 84.196669] __kasan_report.cold (mm/kasan/report.c:507)
> > >> > [ 84.196669] kasan_report (arch/x86/include/asm/smap.h:71
> > mm/kasan/common.c:635)
> > >> > [ 84.196669] hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128
> > arch/x86/hyperv/mmu.c:112)
> > >> > [ 84.196669] flush_tlb_mm_range (arch/x86/include/asm/paravirt.h:68
> > arch/x86/mm/tlb.c:798)
> > >> > [ 84.196669] ptep_clear_flush (arch/x86/include/asm/tlbflush.h:586 mm/pgtable-
> > generic.c:88)
> > >> >
> > >> > Fixes: 0e4c88f37693 ("x86/hyper-v: Use cheaper
> > HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible")
> > >> > Cc: Vitaly Kuznetsov <[email protected]>
> > >> > Cc: [email protected]
> > >> > Signed-off-by: Sasha Levin <[email protected]>
> > >> > ---
> > >> > arch/x86/hyperv/mmu.c | 4 +++-
> > >> > 1 file changed, 3 insertions(+), 1 deletion(-)
> > >> >
> > >> > diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> > >> > index 5208ba49c89a9..b1d6afc5fc4a3 100644
> > >> > --- a/arch/x86/hyperv/mmu.c
> > >> > +++ b/arch/x86/hyperv/mmu.c
> > >> > @@ -109,7 +109,9 @@ static void hyperv_flush_tlb_others(const struct cpumask
> > *cpus,
> > >> > * must. We will also check all VP numbers when walking the
> > >> > * supplied CPU set to remain correct in all cases.
> > >> > */
> > >> > - if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
> > >> > + int last = cpumask_last(cpus);
> > >> > +
> > >> > + if (last < num_possible_cpus() && hv_cpu_number_to_vp_number(last) >=
> > 64)
> > >> > goto do_ex_hypercall;
> > >>
> > >> In case 'cpus' can end up being empty (I'm genuinely suprised it can)
> >
> > I was just as surprised as you and spent the good part of a day
> > debugging this. However, a:
> >
> > WARN_ON(cpumask_empty(cpus));
> >
> > triggers at that line of code even though we check for cpumask_empty()
> > at the entry of the function.
>
> What does the call stack look like when this triggers? I'm curious about
> the path where the 'cpus' could be changing while the flush call is in
> progress.
>
> I wonder if CPUs could ever be added to the mask? Removing CPUs can
> be handled with some care because an unnecessary flush doesn't hurt
> anything. But adding CPUs has serious correctness problems.
>
The cpumask_empty check is done before disabling irq. Is it possible
the mask is modified by an interrupt?
If there is a reliable way to trigger this bug, we may be able to test
the following patch.
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 5208ba49c89a..23fa08d24c1a 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -66,11 +66,13 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
if (!hv_hypercall_pg)
goto do_native;
- if (cpumask_empty(cpus))
- return;
-
local_irq_save(flags);
+ if (cpumask_empty(cpus)) {
+ local_irq_restore(flags);
+ return;
+ }
+
flush_pcpu = (struct hv_tlb_flush **)
this_cpu_ptr(hyperv_pcpu_input_arg);
From: Wei Liu <[email protected]> Sent: Monday, October 5, 2020 7:59 AM
>
> On Sat, Oct 03, 2020 at 05:40:15PM +0000, Michael Kelley wrote:
> > From: Sasha Levin <[email protected]> Sent: Thursday, October 1, 2020 6:04 AM
> > >
> > > On Thu, Oct 01, 2020 at 11:53:59AM +0000, Wei Liu wrote:
> > > >On Thu, Oct 01, 2020 at 11:40:04AM +0200, Vitaly Kuznetsov wrote:
> > > >> Sasha Levin <[email protected]> writes:
> > > >>
> > > >> > cpumask can change underneath us, which is generally safe except when we
> > > >> > call into hv_cpu_number_to_vp_number(): if cpumask ends up empty we pass
> > > >> > num_cpu_possible() into hv_cpu_number_to_vp_number(), causing it to read
> > > >> > garbage. As reported by KASAN:
> > > >> >
> > > >> > [ 83.504763] BUG: KASAN: slab-out-of-bounds in hyperv_flush_tlb_others
> > > (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
> > > >> > [ 83.908636] Read of size 4 at addr ffff888267c01370 by task kworker/u8:2/106
> > > >> > [ 84.196669] CPU: 0 PID: 106 Comm: kworker/u8:2 Tainted: G W 5.4.60 #1
> > > >> > [ 84.196669] Hardware name: Microsoft Corporation Virtual Machine/Virtual
> Machine,
> > > BIOS 090008 12/07/2018
> > > >> > [ 84.196669] Workqueue: writeback wb_workfn (flush-8:0)
> > > >> > [ 84.196669] Call Trace:
> > > >> > [ 84.196669] dump_stack (lib/dump_stack.c:120)
> > > >> > [ 84.196669] print_address_description.constprop.0 (mm/kasan/report.c:375)
> > > >> > [ 84.196669] __kasan_report.cold (mm/kasan/report.c:507)
> > > >> > [ 84.196669] kasan_report (arch/x86/include/asm/smap.h:71
> > > mm/kasan/common.c:635)
> > > >> > [ 84.196669] hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128
> > > arch/x86/hyperv/mmu.c:112)
> > > >> > [ 84.196669] flush_tlb_mm_range (arch/x86/include/asm/paravirt.h:68
> > > arch/x86/mm/tlb.c:798)
> > > >> > [ 84.196669] ptep_clear_flush (arch/x86/include/asm/tlbflush.h:586 mm/pgtable-
> > > generic.c:88)
> > > >> >
> > > >> > Fixes: 0e4c88f37693 ("x86/hyper-v: Use cheaper
> > > HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible")
> > > >> > Cc: Vitaly Kuznetsov <[email protected]>
> > > >> > Cc: [email protected]
> > > >> > Signed-off-by: Sasha Levin <[email protected]>
> > > >> > ---
> > > >> > arch/x86/hyperv/mmu.c | 4 +++-
> > > >> > 1 file changed, 3 insertions(+), 1 deletion(-)
> > > >> >
> > > >> > diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> > > >> > index 5208ba49c89a9..b1d6afc5fc4a3 100644
> > > >> > --- a/arch/x86/hyperv/mmu.c
> > > >> > +++ b/arch/x86/hyperv/mmu.c
> > > >> > @@ -109,7 +109,9 @@ static void hyperv_flush_tlb_others(const struct cpumask
> > > *cpus,
> > > >> > * must. We will also check all VP numbers when walking the
> > > >> > * supplied CPU set to remain correct in all cases.
> > > >> > */
> > > >> > - if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
> > > >> > + int last = cpumask_last(cpus);
> > > >> > +
> > > >> > + if (last < num_possible_cpus() &&
> hv_cpu_number_to_vp_number(last) >=
> > > 64)
> > > >> > goto do_ex_hypercall;
> > > >>
> > > >> In case 'cpus' can end up being empty (I'm genuinely suprised it can)
> > >
> > > I was just as surprised as you and spent the good part of a day
> > > debugging this. However, a:
> > >
> > > WARN_ON(cpumask_empty(cpus));
> > >
> > > triggers at that line of code even though we check for cpumask_empty()
> > > at the entry of the function.
> >
> > What does the call stack look like when this triggers? I'm curious about
> > the path where the 'cpus' could be changing while the flush call is in
> > progress.
> >
> > I wonder if CPUs could ever be added to the mask? Removing CPUs can
> > be handled with some care because an unnecessary flush doesn't hurt
> > anything. But adding CPUs has serious correctness problems.
> >
>
> The cpumask_empty check is done before disabling irq. Is it possible
> the mask is modified by an interrupt?
>
> If there is a reliable way to trigger this bug, we may be able to test
> the following patch.
>
> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> index 5208ba49c89a..23fa08d24c1a 100644
> --- a/arch/x86/hyperv/mmu.c
> +++ b/arch/x86/hyperv/mmu.c
> @@ -66,11 +66,13 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
> if (!hv_hypercall_pg)
> goto do_native;
>
> - if (cpumask_empty(cpus))
> - return;
> -
> local_irq_save(flags);
>
> + if (cpumask_empty(cpus)) {
> + local_irq_restore(flags);
> + return;
> + }
> +
> flush_pcpu = (struct hv_tlb_flush **)
> this_cpu_ptr(hyperv_pcpu_input_arg);
This thread died out 3 months ago without any patches being taken.
I recently hit the problem again at random, though not in a
reproducible way.
I'd like to take Wei Liu's latest proposal to check for an empty
cpumask *after* interrupts are disabled. I think this will almost
certainly solve the problem, and in a cleaner way than Sasha's
proposal. I'd also suggest adding a comment in the code to note
the importance of the ordering.
Michael
On Tue, Jan 05, 2021 at 04:59:10PM +0000, Michael Kelley wrote:
> From: Wei Liu <[email protected]> Sent: Monday, October 5, 2020 7:59 AM
> >
> > On Sat, Oct 03, 2020 at 05:40:15PM +0000, Michael Kelley wrote:
> > > From: Sasha Levin <[email protected]> Sent: Thursday, October 1, 2020 6:04 AM
> > > >
> > > > On Thu, Oct 01, 2020 at 11:53:59AM +0000, Wei Liu wrote:
> > > > >On Thu, Oct 01, 2020 at 11:40:04AM +0200, Vitaly Kuznetsov wrote:
> > > > >> Sasha Levin <[email protected]> writes:
> > > > >>
> > > > >> > cpumask can change underneath us, which is generally safe except when we
> > > > >> > call into hv_cpu_number_to_vp_number(): if cpumask ends up empty we pass
> > > > >> > num_cpu_possible() into hv_cpu_number_to_vp_number(), causing it to read
> > > > >> > garbage. As reported by KASAN:
> > > > >> >
> > > > >> > [ 83.504763] BUG: KASAN: slab-out-of-bounds in hyperv_flush_tlb_others
> > > > (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
> > > > >> > [ 83.908636] Read of size 4 at addr ffff888267c01370 by task kworker/u8:2/106
> > > > >> > [ 84.196669] CPU: 0 PID: 106 Comm: kworker/u8:2 Tainted: G W 5.4.60 #1
> > > > >> > [ 84.196669] Hardware name: Microsoft Corporation Virtual Machine/Virtual
> > Machine,
> > > > BIOS 090008 12/07/2018
> > > > >> > [ 84.196669] Workqueue: writeback wb_workfn (flush-8:0)
> > > > >> > [ 84.196669] Call Trace:
> > > > >> > [ 84.196669] dump_stack (lib/dump_stack.c:120)
> > > > >> > [ 84.196669] print_address_description.constprop.0 (mm/kasan/report.c:375)
> > > > >> > [ 84.196669] __kasan_report.cold (mm/kasan/report.c:507)
> > > > >> > [ 84.196669] kasan_report (arch/x86/include/asm/smap.h:71
> > > > mm/kasan/common.c:635)
> > > > >> > [ 84.196669] hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128
> > > > arch/x86/hyperv/mmu.c:112)
> > > > >> > [ 84.196669] flush_tlb_mm_range (arch/x86/include/asm/paravirt.h:68
> > > > arch/x86/mm/tlb.c:798)
> > > > >> > [ 84.196669] ptep_clear_flush (arch/x86/include/asm/tlbflush.h:586 mm/pgtable-
> > > > generic.c:88)
> > > > >> >
> > > > >> > Fixes: 0e4c88f37693 ("x86/hyper-v: Use cheaper
> > > > HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible")
> > > > >> > Cc: Vitaly Kuznetsov <[email protected]>
> > > > >> > Cc: [email protected]
> > > > >> > Signed-off-by: Sasha Levin <[email protected]>
> > > > >> > ---
> > > > >> > arch/x86/hyperv/mmu.c | 4 +++-
> > > > >> > 1 file changed, 3 insertions(+), 1 deletion(-)
> > > > >> >
> > > > >> > diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> > > > >> > index 5208ba49c89a9..b1d6afc5fc4a3 100644
> > > > >> > --- a/arch/x86/hyperv/mmu.c
> > > > >> > +++ b/arch/x86/hyperv/mmu.c
> > > > >> > @@ -109,7 +109,9 @@ static void hyperv_flush_tlb_others(const struct cpumask
> > > > *cpus,
> > > > >> > * must. We will also check all VP numbers when walking the
> > > > >> > * supplied CPU set to remain correct in all cases.
> > > > >> > */
> > > > >> > - if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
> > > > >> > + int last = cpumask_last(cpus);
> > > > >> > +
> > > > >> > + if (last < num_possible_cpus() &&
> > hv_cpu_number_to_vp_number(last) >=
> > > > 64)
> > > > >> > goto do_ex_hypercall;
> > > > >>
> > > > >> In case 'cpus' can end up being empty (I'm genuinely suprised it can)
> > > >
> > > > I was just as surprised as you and spent the good part of a day
> > > > debugging this. However, a:
> > > >
> > > > WARN_ON(cpumask_empty(cpus));
> > > >
> > > > triggers at that line of code even though we check for cpumask_empty()
> > > > at the entry of the function.
> > >
> > > What does the call stack look like when this triggers? I'm curious about
> > > the path where the 'cpus' could be changing while the flush call is in
> > > progress.
> > >
> > > I wonder if CPUs could ever be added to the mask? Removing CPUs can
> > > be handled with some care because an unnecessary flush doesn't hurt
> > > anything. But adding CPUs has serious correctness problems.
> > >
> >
> > The cpumask_empty check is done before disabling irq. Is it possible
> > the mask is modified by an interrupt?
> >
> > If there is a reliable way to trigger this bug, we may be able to test
> > the following patch.
> >
> > diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> > index 5208ba49c89a..23fa08d24c1a 100644
> > --- a/arch/x86/hyperv/mmu.c
> > +++ b/arch/x86/hyperv/mmu.c
> > @@ -66,11 +66,13 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
> > if (!hv_hypercall_pg)
> > goto do_native;
> >
> > - if (cpumask_empty(cpus))
> > - return;
> > -
> > local_irq_save(flags);
> >
> > + if (cpumask_empty(cpus)) {
> > + local_irq_restore(flags);
> > + return;
> > + }
> > +
> > flush_pcpu = (struct hv_tlb_flush **)
> > this_cpu_ptr(hyperv_pcpu_input_arg);
>
> This thread died out 3 months ago without any patches being taken.
> I recently hit the problem again at random, though not in a
> reproducible way.
>
> I'd like to take Wei Liu's latest proposal to check for an empty
> cpumask *after* interrupts are disabled. I think this will almost
> certainly solve the problem, and in a cleaner way than Sasha's
> proposal. I'd also suggest adding a comment in the code to note
> the importance of the ordering.
>
Sure. Let me prepare a proper patch.
Wei.
On Tue, Jan 05, 2021 at 04:59:10PM +0000, Michael Kelley wrote:
>From: Wei Liu <[email protected]> Sent: Monday, October 5, 2020 7:59 AM
>>
>> On Sat, Oct 03, 2020 at 05:40:15PM +0000, Michael Kelley wrote:
>> > From: Sasha Levin <[email protected]> Sent: Thursday, October 1, 2020 6:04 AM
>> > >
>> > > On Thu, Oct 01, 2020 at 11:53:59AM +0000, Wei Liu wrote:
>> > > >On Thu, Oct 01, 2020 at 11:40:04AM +0200, Vitaly Kuznetsov wrote:
>> > > >> Sasha Levin <[email protected]> writes:
>> > > >>
>> > > >> > cpumask can change underneath us, which is generally safe except when we
>> > > >> > call into hv_cpu_number_to_vp_number(): if cpumask ends up empty we pass
>> > > >> > num_cpu_possible() into hv_cpu_number_to_vp_number(), causing it to read
>> > > >> > garbage. As reported by KASAN:
>> > > >> >
>> > > >> > [ 83.504763] BUG: KASAN: slab-out-of-bounds in hyperv_flush_tlb_others
>> > > (include/asm-generic/mshyperv.h:128 arch/x86/hyperv/mmu.c:112)
>> > > >> > [ 83.908636] Read of size 4 at addr ffff888267c01370 by task kworker/u8:2/106
>> > > >> > [ 84.196669] CPU: 0 PID: 106 Comm: kworker/u8:2 Tainted: G W 5.4.60 #1
>> > > >> > [ 84.196669] Hardware name: Microsoft Corporation Virtual Machine/Virtual
>> Machine,
>> > > BIOS 090008 12/07/2018
>> > > >> > [ 84.196669] Workqueue: writeback wb_workfn (flush-8:0)
>> > > >> > [ 84.196669] Call Trace:
>> > > >> > [ 84.196669] dump_stack (lib/dump_stack.c:120)
>> > > >> > [ 84.196669] print_address_description.constprop.0 (mm/kasan/report.c:375)
>> > > >> > [ 84.196669] __kasan_report.cold (mm/kasan/report.c:507)
>> > > >> > [ 84.196669] kasan_report (arch/x86/include/asm/smap.h:71
>> > > mm/kasan/common.c:635)
>> > > >> > [ 84.196669] hyperv_flush_tlb_others (include/asm-generic/mshyperv.h:128
>> > > arch/x86/hyperv/mmu.c:112)
>> > > >> > [ 84.196669] flush_tlb_mm_range (arch/x86/include/asm/paravirt.h:68
>> > > arch/x86/mm/tlb.c:798)
>> > > >> > [ 84.196669] ptep_clear_flush (arch/x86/include/asm/tlbflush.h:586 mm/pgtable-
>> > > generic.c:88)
>> > > >> >
>> > > >> > Fixes: 0e4c88f37693 ("x86/hyper-v: Use cheaper
>> > > HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible")
>> > > >> > Cc: Vitaly Kuznetsov <[email protected]>
>> > > >> > Cc: [email protected]
>> > > >> > Signed-off-by: Sasha Levin <[email protected]>
>> > > >> > ---
>> > > >> > arch/x86/hyperv/mmu.c | 4 +++-
>> > > >> > 1 file changed, 3 insertions(+), 1 deletion(-)
>> > > >> >
>> > > >> > diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
>> > > >> > index 5208ba49c89a9..b1d6afc5fc4a3 100644
>> > > >> > --- a/arch/x86/hyperv/mmu.c
>> > > >> > +++ b/arch/x86/hyperv/mmu.c
>> > > >> > @@ -109,7 +109,9 @@ static void hyperv_flush_tlb_others(const struct cpumask
>> > > *cpus,
>> > > >> > * must. We will also check all VP numbers when walking the
>> > > >> > * supplied CPU set to remain correct in all cases.
>> > > >> > */
>> > > >> > - if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
>> > > >> > + int last = cpumask_last(cpus);
>> > > >> > +
>> > > >> > + if (last < num_possible_cpus() &&
>> hv_cpu_number_to_vp_number(last) >=
>> > > 64)
>> > > >> > goto do_ex_hypercall;
>> > > >>
>> > > >> In case 'cpus' can end up being empty (I'm genuinely suprised it can)
>> > >
>> > > I was just as surprised as you and spent the good part of a day
>> > > debugging this. However, a:
>> > >
>> > > WARN_ON(cpumask_empty(cpus));
>> > >
>> > > triggers at that line of code even though we check for cpumask_empty()
>> > > at the entry of the function.
>> >
>> > What does the call stack look like when this triggers? I'm curious about
>> > the path where the 'cpus' could be changing while the flush call is in
>> > progress.
>> >
>> > I wonder if CPUs could ever be added to the mask? Removing CPUs can
>> > be handled with some care because an unnecessary flush doesn't hurt
>> > anything. But adding CPUs has serious correctness problems.
>> >
>>
>> The cpumask_empty check is done before disabling irq. Is it possible
>> the mask is modified by an interrupt?
>>
>> If there is a reliable way to trigger this bug, we may be able to test
>> the following patch.
>>
>> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
>> index 5208ba49c89a..23fa08d24c1a 100644
>> --- a/arch/x86/hyperv/mmu.c
>> +++ b/arch/x86/hyperv/mmu.c
>> @@ -66,11 +66,13 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
>> if (!hv_hypercall_pg)
>> goto do_native;
>>
>> - if (cpumask_empty(cpus))
>> - return;
>> -
>> local_irq_save(flags);
>>
>> + if (cpumask_empty(cpus)) {
>> + local_irq_restore(flags);
>> + return;
>> + }
>> +
>> flush_pcpu = (struct hv_tlb_flush **)
>> this_cpu_ptr(hyperv_pcpu_input_arg);
>
>This thread died out 3 months ago without any patches being taken.
>I recently hit the problem again at random, though not in a
>reproducible way.
>
>I'd like to take Wei Liu's latest proposal to check for an empty
>cpumask *after* interrupts are disabled. I think this will almost
>certainly solve the problem, and in a cleaner way than Sasha's
>proposal. I'd also suggest adding a comment in the code to note
>the importance of the ordering.
I found that this syzbot reproducer:
https://syzkaller.appspot.com//bug?id=47befb59c610a69f024db20b927dea80c88fc045
is pretty good at reproducing the issue too:
BUG: KASAN: slab-out-of-bounds in hyperv_flush_tlb_others+0x11ea/0x17c0
Read of size 4 at addr ffff88810005db20 by task 3.c.exe/13007
CPU: 4 PID: 13007 Comm: 3.c.exe Not tainted 5.10.5 #1
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 06/17/2020
Call Trace:
dump_stack+0xa4/0xd9
print_address_description.constprop.0.cold+0xd4/0x509
kasan_report.cold+0x20/0x37
__asan_report_load4_noabort+0x14/0x20
hyperv_flush_tlb_others+0x11ea/0x17c0
flush_tlb_mm_range+0x1fd/0x360
tlb_flush_mmu+0x1b5/0x510
tlb_finish_mmu+0x89/0x360
exit_mmap+0x24f/0x450
mmput+0x121/0x400
do_exit+0x8cf/0x2a70
do_group_exit+0x100/0x300
get_signal+0x3d7/0x1e70
arch_do_signal+0x8c/0x2670
exit_to_user_mode_prepare+0x154/0x1f0
syscall_exit_to_user_mode+0x42/0x280
do_syscall_64+0x45/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x450c2d
Code: Unable to access opcode bytes at RIP 0x450c03.
RSP: 002b:00007f6c81711d68 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 0000000000450c2d
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00000000004e0428
RBP: 00007f6c81711d80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffeeef33d2e
R13: 00007ffeeef33d2f R14: 00007ffeeef33dd0 R15: 00007f6c81711e80
Allocated by task 0:
kasan_save_stack+0x23/0x50
__kasan_kmalloc.constprop.0+0xcf/0xe0
kasan_kmalloc+0x9/0x10
__kmalloc+0x1c8/0x3b0
kmalloc_array+0x12/0x14
hyperv_init+0xd4/0x3a0
apic_intr_mode_init+0xbb/0x1e8
x86_late_time_init+0x96/0xa7
start_kernel+0x317/0x3d3
x86_64_start_reservations+0x24/0x26
x86_64_start_kernel+0x7a/0x7e
secondary_startup_64_no_verify+0xb0/0xbb
The buggy address belongs to the object at ffff88810005db00
which belongs to the cache kmalloc-32 of size 32
The buggy address is located 0 bytes to the right of
32-byte region [ffff88810005db00, ffff88810005db20)
The buggy address belongs to the page:
page:0000000065310ff0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10005d
flags: 0x17ffffc0000200(slab)
raw: 0017ffffc0000200 0000000000000000 0000000100000001 ffff888100043a40
raw: 0000000000000000 0000000000400040 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88810005da00: 00 00 00 00 fc fc fc fc 00 00 00 00 fc fc fc fc
ffff88810005da80: 00 00 00 00 fc fc fc fc 00 00 00 00 fc fc fc fc
>ffff88810005db00: 00 00 00 00 fc fc fc fc 00 00 00 fc fc fc fc fc
^
ffff88810005db80: 00 00 00 fc fc fc fc fc 00 00 00 fc fc fc fc fc
ffff88810005dc00: 00 00 00 fc fc fc fc fc 00 00 00 fc fc fc fc fc
--
Thanks,
Sasha
Hi,
The problem is happening to me very frequently on kernel 4.19.195
ug 4 03:59:01 c-node04 kernel: [36976.388554] BUG: KASAN: slab-out-of-bounds in hyperv_flush_tlb_others+0xec9/0x1640
Aug 4 03:59:01 c-node04 kernel: [36976.388556] Read of size 4 at addr ffff889e5e127440 by task ps/52478
Aug 4 03:59:01 c-node04 kernel: [36976.388556]
Aug 4 03:59:01 c-node04 kernel: [36976.388560] CPU: 4 PID: 52478 Comm: ps Kdump: loaded Tainted: G W OE 4.19.195-KM9 #1
Aug 4 03:59:01 c-node04 kernel: [36976.388562] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
Aug 4 03:59:01 c-node04 kernel: [36976.388562] Call Trace:
Aug 4 03:59:01 c-node04 kernel: [36976.388569] dump_stack+0x11d/0x1a7
Aug 4 03:59:01 c-node04 kernel: [36976.388572] ? dump_stack_print_info.cold.0+0x1b/0x1b
Aug 4 03:59:01 c-node04 kernel: [36976.388576] ? percpu_ref_tryget_live+0x2f0/0x2f0
Aug 4 03:59:01 c-node04 kernel: [36976.388580] ? rb_erase_cached+0xc4c/0x2880
Aug 4 03:59:01 c-node04 kernel: [36976.388584] ? printk+0x9f/0xc5
Aug 4 03:59:01 c-node04 kernel: [36976.388585] ? snapshot_ioctl.cold.1+0x74/0x74
Aug 4 03:59:01 c-node04 kernel: [36976.388590] print_address_description+0x65/0x22e
Aug 4 03:59:01 c-node04 kernel: [36976.388592] kasan_report.cold.6+0x243/0x2ff
Aug 4 03:59:01 c-node04 kernel: [36976.388594] ? hyperv_flush_tlb_others+0xec9/0x1640
Aug 4 03:59:01 c-node04 kernel: [36976.388596] hyperv_flush_tlb_others+0xec9/0x1640
Aug 4 03:59:01 c-node04 kernel: [36976.388601] ? trace_event_raw_event_hyperv_nested_flush_guest_mapping+0x1b0/0x1b0
Aug 4 03:59:01 c-node04 kernel: [36976.388603] ? mem_cgroup_try_charge+0x3cc/0x7d0
Aug 4 03:59:01 c-node04 kernel: [36976.388608] flush_tlb_mm_range+0x25c/0x370
Aug 4 03:59:01 c-node04 kernel: [36976.388611] ? native_flush_tlb_others+0x3b0/0x3b0
Aug 4 03:59:01 c-node04 kernel: [36976.388616] ptep_clear_flush+0x192/0x1d0
Aug 4 03:59:01 c-node04 kernel: [36976.388618] ? pmd_clear_bad+0x70/0x70
Aug 4 03:59:01 c-node04 kernel: [36976.388622] wp_page_copy+0x861/0x1a30
Aug 4 03:59:01 c-node04 kernel: [36976.388624] ? follow_pfn+0x2f0/0x2f0
Aug 4 03:59:01 c-node04 kernel: [36976.388627] ? active_load_balance_cpu_stop+0x10d0/0x10d0
Aug 4 03:59:01 c-node04 kernel: [36976.388632] ? get_page_from_freelist+0x330c/0x4660
Aug 4 03:59:01 c-node04 kernel: [36976.388638] ? activate_page+0x660/0x660
Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? rb_erase+0x2a40/0x2a40
Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? wake_up_page_bit+0x4d0/0x4d0
Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? unwind_next_frame+0x113e/0x1920
Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? __pte_alloc_kernel+0x350/0x350
Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? deref_stack_reg+0x130/0x130
Aug 4 03:59:01 c-node04 kernel: [36976.388639] do_wp_page+0x461/0x1ca0
Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? deref_stack_reg+0x130/0x130
Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? finish_mkwrite_fault+0x710/0x710
Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? unwind_next_frame+0x105d/0x1920
Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? __pte_alloc_kernel+0x350/0x350
Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? __zone_watermark_ok+0x33c/0x640
Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? _raw_spin_lock+0x13/0x30
Pattern not found (press RETURN)
From: David Mozes <[email protected]>
>
> Hi,
> The problem is happening to me very frequently on kernel 4.19.195
>
David -- could you give us a little more context? Were you running earlier
4.19.xxx versions and did not see this problem? There was a timing
problem in hyperv_flush_tlb_others() that was fixed in early January
2021. The fix was backported to the 4.19 longterm tree, and should
be included in 4.19.195. Outside of that, I'm not aware of a problem
in this area.
For completeness, what version of Hyper-V are you using? And how
many vCPUs in your VM?
Michael
>
>
> ug 4 03:59:01 c-node04 kernel: [36976.388554] BUG: KASAN: slab-out-of-bounds in hyperv_flush_tlb_others+0xec9/0x1640
> Aug 4 03:59:01 c-node04 kernel: [36976.388556] Read of size 4 at addr ffff889e5e127440 by task ps/52478
> Aug 4 03:59:01 c-node04 kernel: [36976.388556]
> Aug 4 03:59:01 c-node04 kernel: [36976.388560] CPU: 4 PID: 52478 Comm: ps Kdump: loaded Tainted: G W OE
> 4.19.195-KM9 #1
> Aug 4 03:59:01 c-node04 kernel: [36976.388562] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine,
> BIOS 090008 12/07/2018
> Aug 4 03:59:01 c-node04 kernel: [36976.388562] Call Trace:
> Aug 4 03:59:01 c-node04 kernel: [36976.388569] dump_stack+0x11d/0x1a7
> Aug 4 03:59:01 c-node04 kernel: [36976.388572] ? dump_stack_print_info.cold.0+0x1b/0x1b
> Aug 4 03:59:01 c-node04 kernel: [36976.388576] ? percpu_ref_tryget_live+0x2f0/0x2f0
> Aug 4 03:59:01 c-node04 kernel: [36976.388580] ? rb_erase_cached+0xc4c/0x2880
> Aug 4 03:59:01 c-node04 kernel: [36976.388584] ? printk+0x9f/0xc5
> Aug 4 03:59:01 c-node04 kernel: [36976.388585] ? snapshot_ioctl.cold.1+0x74/0x74
> Aug 4 03:59:01 c-node04 kernel: [36976.388590] print_address_description+0x65/0x22e
> Aug 4 03:59:01 c-node04 kernel: [36976.388592] kasan_report.cold.6+0x243/0x2ff
> Aug 4 03:59:01 c-node04 kernel: [36976.388594] ? hyperv_flush_tlb_others+0xec9/0x1640
> Aug 4 03:59:01 c-node04 kernel: [36976.388596] hyperv_flush_tlb_others+0xec9/0x1640
> Aug 4 03:59:01 c-node04 kernel: [36976.388601] ?
> trace_event_raw_event_hyperv_nested_flush_guest_mapping+0x1b0/0x1b0
> Aug 4 03:59:01 c-node04 kernel: [36976.388603] ? mem_cgroup_try_charge+0x3cc/0x7d0
> Aug 4 03:59:01 c-node04 kernel: [36976.388608] flush_tlb_mm_range+0x25c/0x370
> Aug 4 03:59:01 c-node04 kernel: [36976.388611] ? native_flush_tlb_others+0x3b0/0x3b0
> Aug 4 03:59:01 c-node04 kernel: [36976.388616] ptep_clear_flush+0x192/0x1d0
> Aug 4 03:59:01 c-node04 kernel: [36976.388618] ? pmd_clear_bad+0x70/0x70
> Aug 4 03:59:01 c-node04 kernel: [36976.388622] wp_page_copy+0x861/0x1a30
> Aug 4 03:59:01 c-node04 kernel: [36976.388624] ? follow_pfn+0x2f0/0x2f0
> Aug 4 03:59:01 c-node04 kernel: [36976.388627] ? active_load_balance_cpu_stop+0x10d0/0x10d0
> Aug 4 03:59:01 c-node04 kernel: [36976.388632] ? get_page_from_freelist+0x330c/0x4660
> Aug 4 03:59:01 c-node04 kernel: [36976.388638] ? activate_page+0x660/0x660
> Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? rb_erase+0x2a40/0x2a40
> Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? wake_up_page_bit+0x4d0/0x4d0
> Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? unwind_next_frame+0x113e/0x1920
> Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? __pte_alloc_kernel+0x350/0x350
> Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? deref_stack_reg+0x130/0x130
> Aug 4 03:59:01 c-node04 kernel: [36976.388639] do_wp_page+0x461/0x1ca0
> Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? deref_stack_reg+0x130/0x130
> Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? finish_mkwrite_fault+0x710/0x710
> Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? unwind_next_frame+0x105d/0x1920
> Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? __pte_alloc_kernel+0x350/0x350
> Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? __zone_watermark_ok+0x33c/0x640
> Aug 4 03:59:01 c-node04 kernel: [36976.388639] ? _raw_spin_lock+0x13/0x30
> Pattern not found (press RETURN)
From: David Moses <[email protected]> Sent: Friday, August 6, 2021 2:20 AM
> Hi Michael ,
> We are??running kernel 4.19.195 (The fix?Wei Liu suggested of moving the
> cpumask_empty check after disabling interrupts is included in this version).
> with the default hyper-v version?
> I'm getting the 4 bytes garbage read (trace included) once almost every night
> We running?on Azure vm?Standard? D64s_v4 with 64 cores (Our system include
> three of such Vms) the application?is very high?io traffic involving?iscsi?
> We believe?this issue casus us to stack corruption on the rt scheduler as I forward
> in the previous?mail.
>
> Let us know what is more needed?to clarify?the problem.
> Is it just Hyper-v related?? ?or could?be a general kernel issue.?
>
> Thx David?
>
> even more that that while i add the below patch/fix?
>
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 5b58a6c..165727a 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -298,6 +298,9 @@ static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
?> */
> static inline int hv_cpu_number_to_vp_number(int cpu_number)
> {
> +?????? if (WARN_ON_ONCE(cpu_number < 0 || cpu_number >= num_possible_cpus()))
> +????? ?????????return VP_INVAL;
> +
> ??????? return hv_vp_index[cpu_number];
> }
>
> we have evidence?that we reach this point?
>
> see below:
> Aug? 5 21:03:01 c-node11 kernel: [17147.089261] WARNING: CPU: 15 PID: 8973 at arch/x86/include/asm/mshyperv.h:301 hyperv_flush_tlb_others+0x1f7/0x760
>?Aug? 5 21:03:01 c-node11 kernel: [17147.089275] RIP: 0010:hyperv_flush_tlb_others+0x1f7/0x760
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] Code: ff ff be 40 00 00 00 48 89 df e8 c4 ff 3a 00
> 85 c0 48 89 c2 78 14 48 8b 3d be 52 32 01 f3 48 0f b8 c7 39 c2 0f 82 7e 01 00 00 <0f> 0b ba ff ff ff ff
> 89 d7 48 89 de e8 68 87 7d 00 3b 05 66 54 32
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] RSP: 0018:ffff8c536bcafa38 EFLAGS: 00010046
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] RAX: 0000000000000040 RBX: ffff8c339542ea00 RCX: ffffffffffffffff
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] RDX: 0000000000000040 RSI: ffffffffffffffff RDI: ffffffffffffffff
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] RBP: ffff8c339878b000 R08: ffffffffffffffff R09: ffffe93ecbcaa0e8
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] R10: 00000000020e0000 R11: 0000000000000000 R12: ffff8c536bcafa88
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] R13: ffffe93efe1ef980 R14: ffff8c339542e600 R15: 00007ffcbc390000
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] FS:? 00007fcb8eae37a0(0000) GS:ffff8c339f7c0000(0000) knlGS:0000000000000000
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] CR2: 000000000135d1d8 CR3: 0000004037137005 CR4: 00000000003606e0
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] Call Trace:
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? flush_tlb_mm_range+0xc3/0x120
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? ptep_clear_flush+0x3a/0x40
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? wp_page_copy+0x2e6/0x8f0
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? ? reuse_swap_page+0x13d/0x390
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? do_wp_page+0x99/0x4c0
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? __handle_mm_fault+0xb4e/0x12c0
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? ? memcg_kmem_get_cache+0x76/0x1a0
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? handle_mm_fault+0xd6/0x200
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? __get_user_pages+0x29e/0x780
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? get_user_pages_remote+0x12c/0x1b0
(FYI -- email to the Linux kernel mailing lists should be in plaintext format, and
not use HTML or other formatting.)
This is an excellent experiment. It certainly suggests that the cpumask that is
passed to hyperv_flush_tlb_others() has bits set for CPUs above 64 that don't exist.
If that's the case, it would seem to be a general kernel issue rather than something
specific to Hyper-V.
Since it looks like you can to add debugging code to the kernel, here are a couple
of thoughts:
1) In hyperv_flush_tlb_others() after the call to disable interrupts, check the value
of cpulast(cpus), and if it is greater than num_possible_cpus(), execute a printk()
statement that outputs the entire contents of the cpumask that is passed in. There's
a special printk format string for printing out bitmaps like cpumasks. Let me know
if you would like some help on this code -- I can provide a diff later today. Seeing
what the "bad" cpumask looks like might give some clues as to the problem.
2) As a different experiment, you can disable the Hyper-V specific flush routines
entirely. At the end of the mmu.c source file, have hyperv_setup_mmu_ops()
always return immediately. In this case, the generic Linux kernel flush routines
will be used instead of the Hyper-V ones. The code may be marginally slower,
but it will then be interesting to see if a problem shows up elsewhere.
But based on your experiment, I'm guessing that there's a general kernel issue
rather than something specific to Hyper-V.
Have you run 4.19 kernels previous to 4.19.195 that didn't have this problem? If
you have a kernel version that is good, the ultimate step would be to do
a bisect and find out where the problem was introduced in the 4.19-series. That
could take a while, but it would almost certainly identify the problematic
code change and would be beneficial to the Linux kernel community in
general.
Michael
Yes ,please provide the diff.
Unfortunately we saw the problem on every 4.19.x version we tested, started from 149 we saw the issue as we as in 5.4.80
I believe you are right and it is general kernel issue and not hyper-v specific.
Look that the code I added eliminate the Panic we got but the kernel "doesn't like it"
Any suggestions how we can let the kernel continue working while we do our experiment?
Thx
David
-----Original Message-----
From: Michael Kelley <[email protected]>
Sent: Friday, August 6, 2021 1:43 PM
To: David Moses <[email protected]>
Cc: David Mozes <[email protected]>; [email protected]; [email protected]
Subject: RE: [PATCH] x86/hyper-v: guard against cpu mask changes in hyperv_flush_tlb_others()
From: David Moses <[email protected]> Sent: Friday, August 6, 2021 2:20 AM
> Hi Michael ,
> We are??running kernel 4.19.195 (The fix?Wei Liu suggested of moving the
> cpumask_empty check after disabling interrupts is included in this version).
> with the default hyper-v version?
> I'm getting the 4 bytes garbage read (trace included) once almost every night
> We running?on Azure vm?Standard? D64s_v4 with 64 cores (Our system include
> three of such Vms) the application?is very high?io traffic involving?iscsi?
> We believe?this issue casus us to stack corruption on the rt scheduler as I forward
> in the previous?mail.
>
> Let us know what is more needed?to clarify?the problem.
> Is it just Hyper-v related?? ?or could?be a general kernel issue.?
>
> Thx David?
>
> even more that that while i add the below patch/fix?
>
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 5b58a6c..165727a 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -298,6 +298,9 @@ static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
?> */
> static inline int hv_cpu_number_to_vp_number(int cpu_number)
> {
> +?????? if (WARN_ON_ONCE(cpu_number < 0 || cpu_number >= num_possible_cpus()))
> +????? ?????????return VP_INVAL;
> +
> ??????? return hv_vp_index[cpu_number];
> }
>
> we have evidence?that we reach this point?
>
> see below:
> Aug? 5 21:03:01 c-node11 kernel: [17147.089261] WARNING: CPU: 15 PID: 8973 at arch/x86/include/asm/mshyperv.h:301 hyperv_flush_tlb_others+0x1f7/0x760
>?Aug? 5 21:03:01 c-node11 kernel: [17147.089275] RIP: 0010:hyperv_flush_tlb_others+0x1f7/0x760
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] Code: ff ff be 40 00 00 00 48 89 df e8 c4 ff 3a 00
> 85 c0 48 89 c2 78 14 48 8b 3d be 52 32 01 f3 48 0f b8 c7 39 c2 0f 82 7e 01 00 00 <0f> 0b ba ff ff ff ff
> 89 d7 48 89 de e8 68 87 7d 00 3b 05 66 54 32
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] RSP: 0018:ffff8c536bcafa38 EFLAGS: 00010046
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] RAX: 0000000000000040 RBX: ffff8c339542ea00 RCX: ffffffffffffffff
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] RDX: 0000000000000040 RSI: ffffffffffffffff RDI: ffffffffffffffff
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] RBP: ffff8c339878b000 R08: ffffffffffffffff R09: ffffe93ecbcaa0e8
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] R10: 00000000020e0000 R11: 0000000000000000 R12: ffff8c536bcafa88
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] R13: ffffe93efe1ef980 R14: ffff8c339542e600 R15: 00007ffcbc390000
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] FS:? 00007fcb8eae37a0(0000) GS:ffff8c339f7c0000(0000) knlGS:0000000000000000
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] CR2: 000000000135d1d8 CR3: 0000004037137005 CR4: 00000000003606e0
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275] Call Trace:
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? flush_tlb_mm_range+0xc3/0x120
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? ptep_clear_flush+0x3a/0x40
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? wp_page_copy+0x2e6/0x8f0
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? ? reuse_swap_page+0x13d/0x390
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? do_wp_page+0x99/0x4c0
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? __handle_mm_fault+0xb4e/0x12c0
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? ? memcg_kmem_get_cache+0x76/0x1a0
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? handle_mm_fault+0xd6/0x200
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? __get_user_pages+0x29e/0x780
> Aug? 5 21:03:01 c-node11 kernel: [17147.089275]? get_user_pages_remote+0x12c/0x1b0
(FYI -- email to the Linux kernel mailing lists should be in plaintext format, and
not use HTML or other formatting.)
This is an excellent experiment. It certainly suggests that the cpumask that is
passed to hyperv_flush_tlb_others() has bits set for CPUs above 64 that don't exist.
f that's the case, it would seem to be a general kernel issue rather than something
specific to Hyper-V.
Since it looks like you can to add debugging code to the kernel, here are a couple
of thoughts:
1) In hyperv_flush_tlb_others() after the call to disable interrupts, check the value
of cpulast(cpus), and if it is greater than num_possible_cpus(), execute a printk()
statement that outputs the entire contents of the cpumask that is passed in. There's
a special printk format string for printing out bitmaps like cpumasks. Let me know
if you would like some help on this code -- I can provide a diff later today. Seeing
what the "bad" cpumask looks like might give some clues as to the problem.
2) As a different experiment, you can disable the Hyper-V specific flush routines
entirely. At the end of the mmu.c source file, have hyperv_setup_mmu_ops()
always return immediately. In this case, the generic Linux kernel flush routines
will be used instead of the Hyper-V ones. The code may be marginally slower,
but it will then be interesting to see if a problem shows up elsewhere.
But based on your experiment, I'm guessing that there's a general kernel issue
rather than something specific to Hyper-V.
Have you run 4.19 kernels previous to 4.19.195 that didn't have this problem? If
you have a kernel version that is good, the ultimate step would be to do
a bisect and find out where the problem was introduced in the 4.19-series. That
could take a while, but it would almost certainly identify the problematic
code change and would be beneficial to the Linux kernel community in
general.
Michael
From: ???? ??????? <[email protected]> Sent: Friday, August 6, 2021 11:03 AM
> Attaching the patches Michael asked for debugging
> 1) Print the cpumask when < num_possible_cpus():
> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> index e666f7eaf32d..620f656d6195 100644
> --- a/arch/x86/hyperv/mmu.c
> +++ b/arch/x86/hyperv/mmu.c
> @@ -60,6 +60,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
> ? ? ? ? struct hv_tlb_flush *flush;
> ? ? ? ? u64 status = U64_MAX;
> ? ? ? ? unsigned long flags;
> + ? ? ? unsigned int cpu_last;
>
>? ? ? ? trace_hyperv_mmu_flush_tlb_others(cpus, info);
>
> @@ -68,6 +69,11 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
>
>? ? ? ? local_irq_save(flags);
>
> + ? ? ? cpu_last = cpumask_last(cpus);
> + ? ? ? if (cpu_last > num_possible_cpus()) {
I think this should be ">=" since cpus are numbered starting at zero.
In your VM with 64 CPUs, having CPU #64 in the list would be error.
> + ? ? ? ? ? ? ? pr_emerg("ERROR_HYPERV: cpu_last=%*pbl", cpumask_pr_args(cpus));
> + ? ? ? }
> +
>? ? ? ? /*
>? ? ? ? ?* Only check the mask _after_ interrupt has been disabled to avoid the
>? ? ? ? ?* mask changing under our feet.
>
> 2) disable the Hyper-V specific flush routines:
> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> index e666f7eaf32d..8e77cc84775a 100644
> --- a/arch/x86/hyperv/mmu.c
> +++ b/arch/x86/hyperv/mmu.c
> @@ -235,6 +235,7 @@ static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
>
> void hyperv_setup_mmu_ops(void)
>?{
> + ?return;
>? ? ? ? if (!(ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED))
>? ? ? ? ? ? ? ? return;
Otherwise, this code looks good to me and matches what I had in mind.
Note that the function native_flush_tlb_others() is used when the Hyper-V specific
flush function is disabled per patch #2 above, or when hv_cpu_to_vp_index() returns
VP_INVALID. In a quick glance through the code, it appears that native_flush_tlb_others()
will work even if there's a non-existent CPU in the cpumask that is passed as an argument.
So perhaps an immediate workaround is Patch #2 above.
Perhaps hyperv_flush_tlb_others() should be made equally tolerant of a non-existent
CPU being in the list. But if you are willing, I'm still interested in the results of an
experiment with just Patch #1. I'm curious about what the CPU list looks like when
it has a non-existent CPU. Is it complete garbage, or is there just one non-existent
CPU?
The other curiosity is that I haven't seen this Linux panic reported by other users,
and I think it would have come to our attention if it were happening with any frequency.
You see the problem fairly regularly. So I'm wondering what the difference is.
Michael
Sent from my iPhone
> On Aug 7, 2021, at 12:51 AM, Michael Kelley <[email protected]> wrote:
>
> From: תומר אבוטבול <[email protected]> Sent: Friday, August 6, 2021 11:03 AM
>
>> Attaching the patches Michael asked for debugging
>> 1) Print the cpumask when < num_possible_cpus():
>> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
>> index e666f7eaf32d..620f656d6195 100644
>> --- a/arch/x86/hyperv/mmu.c
>> +++ b/arch/x86/hyperv/mmu.c
>> @@ -60,6 +60,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
>> struct hv_tlb_flush *flush;
>> u64 status = U64_MAX;
>> unsigned long flags;
>> + unsigned int cpu_last;
>>
>> trace_hyperv_mmu_flush_tlb_others(cpus, info);
>>
>> @@ -68,6 +69,11 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
>>
>> local_irq_save(flags);
>>
>> + cpu_last = cpumask_last(cpus);
>> + if (cpu_last > num_possible_cpus()) {
>
> I think this should be ">=" since cpus are numbered starting at zero.
> In your VM with 64 CPUs, having CPU #64 in the list would be error.
>
>> + pr_emerg("ERROR_HYPERV: cpu_last=%*pbl", cpumask_pr_args(cpus));
>> + }
>> +
>> /*
>> * Only check the mask _after_ interrupt has been disabled to avoid the
>> * mask changing under our feet.
>>
>> 2) disable the Hyper-V specific flush routines:
>> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
>> index e666f7eaf32d..8e77cc84775a 100644
>> --- a/arch/x86/hyperv/mmu.c
>> +++ b/arch/x86/hyperv/mmu.c
>> @@ -235,6 +235,7 @@ static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
>>
>> void hyperv_setup_mmu_ops(void)
>> {
>> + return;
>> if (!(ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED))
>> return;
>
> Otherwise, this code looks good to me and matches what I had in mind.
>
> Note that the function native_flush_tlb_others() is used when the Hyper-V specific
> flush function is disabled per patch #2 above, or when hv_cpu_to_vp_index() returns
> VP_INVALID. In a quick glance through the code, it appears that native_flush_tlb_others()
> will work even if there's a non-existent CPU in the cpumask that is passed as an argument.
> So perhaps an immediate workaround is Patch #2 above.
The current code of hv_cpu_to_vp_index (where I generated the warning ) is returning VP_INVALID in this case (see previous mail) and look like it is not completely workaround the issue.
the cpu is hanging even not panic Will continue watching .
>
>
> Perhaps hyperv_flush_tlb_others() should be made equally tolerant of a non-existent
> CPU being in the list. But if you are willing, I'm still interested in the results of an
> experiment with just Patch #1. I'm curious about what the CPU list looks like when
> it has a non-existent CPU. Is it complete garbage, or is there just one non-existent
> CPU?
>
We will do my be not next week since vacation but the week after
> The other curiosity is that I haven't seen this Linux panic reported by other users,
> and I think it would have come to our attention if it were happening with any frequency.
> You see the problem fairly regularly. So I'm wondering what the difference is.
>
> Michael
Hi Michael and all .
I am back from the Holiday and did your saggestiones /requstes
1. While running with patch number-2 (disable the Hyper-V specific flush routines)
As you suspected, we got panic similar to what we got with the Hyper-V specific flash routines.
Below is the trace we got:
[32097.577728] kernel BUG at kernel/sched/rt.c:1004!
[32097.577738] invalid opcode: 0000 [#1] SMP
[32097.578711] CPU: 45 PID: 51244 Comm: STAR4BLKS0_WORK Kdump: loaded Tainted: G OE 4.19.195-KM9 #1
[32097.578711] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
[32097.578711] RIP: 0010:dequeue_top_rt_rq+0x88/0xa0
[32097.578711] Code: 00 48 89 d5 48 0f a3 15 6e 19 82 01 73 d0 48 89 c7 e8 bc b7 fe ff be 02 00 00 00 89 ef 84 c0 74 0b e8 2c 94 04 00 eb b6 0f 0b <0f> 0b e8 b1 93 04 00 eb ab 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00
[32097.578711] RSP: 0018:ffff9442e0de7b48 EFLAGS: 00010046
[32097.578711] RAX: ffff94809f9e1e00 RBX: ffff9448295e4c40 RCX: 00000000ffffffff
[32097.578711] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff94809f9e2040
[32097.578711] RBP: ffff94809f9e1e00 R08: fffffffffff0be25 R09: 00000000000216c0
[32097.578711] R10: 00004bbc85e1eff3 R11: 0000000000000000 R12: 0000000000000000
[32097.578711] R13: ffff9448295e4a20 R14: 0000000000021e00 R15: ffff94809fa21e00
[32097.578711] FS: 00007f7b0cea0700(0000) GS:ffff94809f940000(0000) knlGS:0000000000000000
[32097.578711] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[32097.578711] CR2: ffffffffff600400 CR3: 000000201d5b3002 CR4: 00000000003606e0
[32097.578711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[32097.578711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[32097.578711] Call Trace:
[32097.578711] dequeue_rt_stack+0x3e/0x280
[32097.578711] dequeue_rt_entity+0x1f/0x70
[32097.578711] dequeue_task_rt+0x26/0x70
[32097.578711] push_rt_task+0x1e2/0x220
[32097.578711] push_rt_tasks+0x11/0x20
[32097.578711] __balance_callback+0x3b/0x60
[32097.578711] __schedule+0x6e9/0x830
[32097.578711] schedule+0x28/0x80
[32097.578711] futex_wait_queue_me+0xb9/0x120
[32097.578711] futex_wait+0x139/0x250
[32097.578711] ? try_to_wake_up+0x54/0x460
[32097.578711] ? enqueue_task_rt+0x9f/0xc0
[32097.578711] ? get_futex_key+0x2ee/0x450
[32097.578711] do_futex+0x2eb/0x9f0
[32097.578711] __x64_sys_futex+0x143/0x180
[32097.578711] do_syscall_64+0x59/0x1b0
[32097.578711] ? prepare_exit_to_usermode+0x70/0x90
[32097.578711] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[32097.578711] RIP: 0033:0x7fa2ae151334
[32097.578711] Code: 66 0f 1f 44 00 00 41 52 52 4d 31 d2 ba 02 00 00 00 81 f6 80 00 00 00 64 23 34 25 48 00 00 00 39 d0 75 07 b8 ca 00 00 00 0f 05 <89> d0 87 07 85 c0 75 f1 5a 41 5a c3 83 3d f1 df 20 00 00 74 59 48
[32097.578711] RSP: 002b:00007f7b0ce9f3b0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[32097.578711] RAX: ffffffffffffffda RBX: 00007f7c1da5bc18 RCX: 00007fa2ae151334
[32097.578711] RDX: 0000000000000002 RSI: 0000000000000080 RDI: 00007f7c1da5bc58
[32097.578711] RBP: 00007f7b0ce9f5b0 R08: 00007f7c1da5bc58 R09: 000000000000c82c
[32097.578711] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7b1a149cf0
[32097.578711] R13: 00007f7c1da5bc58 R14: 0000000000000001 R15: 00000000000005a1
2. as you requested and to help to the community we running patch no 1 as well :
And that is what we got:
Aug 17 05:36:22 10.230.247.7 [40544.392690] Hyper-V: ERROR_HYPERV: cpu_last=
It looks like we got an empty cpumask !
Would you please let us know what father info you need and what Is the next step for debugging this interesting issue
Thx
David
-----Original Message-----
From: David Moses <[email protected]>
Sent: Saturday, August 7, 2021 8:00 AM
To: Michael Kelley <[email protected]>
Cc: תומר אבוטבול <[email protected]>; David Mozes <[email protected]>; [email protected]; [email protected]
Subject: Re: [PATCH] x86/hyper-v: guard against cpu mask changes in hyperv_flush_tlb_others()
Sent from my iPhone
> On Aug 7, 2021, at 12:51 AM, Michael Kelley <[email protected]> wrote:
>
> From: תומר אבוטבול <[email protected]> Sent: Friday, August 6, 2021 11:03 AM
>
>> Attaching the patches Michael asked for debugging
>> 1) Print the cpumask when < num_possible_cpus():
>> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
>> index e666f7eaf32d..620f656d6195 100644
>> --- a/arch/x86/hyperv/mmu.c
>> +++ b/arch/x86/hyperv/mmu.c
>> @@ -60,6 +60,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
>> struct hv_tlb_flush *flush;
>> u64 status = U64_MAX;
>> unsigned long flags;
>> + unsigned int cpu_last;
>>
>> trace_hyperv_mmu_flush_tlb_others(cpus, info);
>>
>> @@ -68,6 +69,11 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
>>
>> local_irq_save(flags);
>>
>> + cpu_last = cpumask_last(cpus);
>> + if (cpu_last > num_possible_cpus()) {
>
> I think this should be ">=" since cpus are numbered starting at zero.
> In your VM with 64 CPUs, having CPU #64 in the list would be error.
>
>> + pr_emerg("ERROR_HYPERV: cpu_last=%*pbl", cpumask_pr_args(cpus));
>> + }
>> +
>> /*
>> * Only check the mask _after_ interrupt has been disabled to avoid the
>> * mask changing under our feet.
>>
>> 2) disable the Hyper-V specific flush routines:
>> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
>> index e666f7eaf32d..8e77cc84775a 100644
>> --- a/arch/x86/hyperv/mmu.c
>> +++ b/arch/x86/hyperv/mmu.c
>> @@ -235,6 +235,7 @@ static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
>>
>> void hyperv_setup_mmu_ops(void)
>> {
>> + return;
>> if (!(ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED))
>> return;
>
> Otherwise, this code looks good to me and matches what I had in mind.
>
> Note that the function native_flush_tlb_others() is used when the Hyper-V specific
> flush function is disabled per patch #2 above, or when hv_cpu_to_vp_index() returns
> VP_INVALID. In a quick glance through the code, it appears that native_flush_tlb_others()
> will work even if there's a non-existent CPU in the cpumask that is passed as an argument.
> So perhaps an immediate workaround is Patch #2 above.
The current code of hv_cpu_to_vp_index (where I generated the warning ) is returning VP_INVALID in this case (see previous mail) and look like it is not completely workaround the issue.
the cpu is hanging even not panic Will continue watching .
>
>
> Perhaps hyperv_flush_tlb_others() should be made equally tolerant of a non-existent
> CPU being in the list. But if you are willing, I'm still interested in the results of an
> experiment with just Patch #1. I'm curious about what the CPU list looks like when
> it has a non-existent CPU. Is it complete garbage, or is there just one non-existent
> CPU?
>
We will do my be not next week since vacation but the week after
> The other curiosity is that I haven't seen this Linux panic reported by other users,
> and I think it would have come to our attention if it were happening with any frequency.
> You see the problem fairly regularly. So I'm wondering what the difference is.
>
> Michael
Please use the "reply all" button in your mail client and avoid
top-posting. It is very difficult for me to decipher this thread...
On Tue, Aug 17, 2021 at 09:16:45AM +0000, David Mozes wrote:
> Hi Michael and all .
> I am back from the Holiday and did your saggestiones /requstes
>
> 1. While running with patch number-2 (disable the Hyper-V specific flush routines)
> As you suspected, we got panic similar to what we got with the Hyper-V specific flash routines.
> Below is the trace we got:
>
> [32097.577728] kernel BUG at kernel/sched/rt.c:1004!
> [32097.577738] invalid opcode: 0000 [#1] SMP
> [32097.578711] CPU: 45 PID: 51244 Comm: STAR4BLKS0_WORK Kdump: loaded Tainted: G OE 4.19.195-KM9 #1
It seems that you have out of tree module(s) loaded. Please make sure
they don't do anything unusual.
> [32097.578711] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
> [32097.578711] RIP: 0010:dequeue_top_rt_rq+0x88/0xa0
> [32097.578711] Code: 00 48 89 d5 48 0f a3 15 6e 19 82 01 73 d0 48 89 c7 e8 bc b7 fe ff be 02 00 00 00 89 ef 84 c0 74 0b e8 2c 94 04 00 eb b6 0f 0b <0f> 0b e8 b1 93 04 00 eb ab 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00
> [32097.578711] RSP: 0018:ffff9442e0de7b48 EFLAGS: 00010046
> [32097.578711] RAX: ffff94809f9e1e00 RBX: ffff9448295e4c40 RCX: 00000000ffffffff
> [32097.578711] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff94809f9e2040
> [32097.578711] RBP: ffff94809f9e1e00 R08: fffffffffff0be25 R09: 00000000000216c0
> [32097.578711] R10: 00004bbc85e1eff3 R11: 0000000000000000 R12: 0000000000000000
> [32097.578711] R13: ffff9448295e4a20 R14: 0000000000021e00 R15: ffff94809fa21e00
> [32097.578711] FS: 00007f7b0cea0700(0000) GS:ffff94809f940000(0000) knlGS:0000000000000000
> [32097.578711] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [32097.578711] CR2: ffffffffff600400 CR3: 000000201d5b3002 CR4: 00000000003606e0
> [32097.578711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [32097.578711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [32097.578711] Call Trace:
> [32097.578711] dequeue_rt_stack+0x3e/0x280
> [32097.578711] dequeue_rt_entity+0x1f/0x70
> [32097.578711] dequeue_task_rt+0x26/0x70
> [32097.578711] push_rt_task+0x1e2/0x220
> [32097.578711] push_rt_tasks+0x11/0x20
> [32097.578711] __balance_callback+0x3b/0x60
> [32097.578711] __schedule+0x6e9/0x830
> [32097.578711] schedule+0x28/0x80
It looks like the scheduler is in an irrecoverable state. The stack
trace does not show anything related to TLB flush, so it is unclear to
me this has anything to do with the original report.
Have you tried running the same setup on baremetal?
> [32097.578711] futex_wait_queue_me+0xb9/0x120
> [32097.578711] futex_wait+0x139/0x250
> [32097.578711] ? try_to_wake_up+0x54/0x460
> [32097.578711] ? enqueue_task_rt+0x9f/0xc0
> [32097.578711] ? get_futex_key+0x2ee/0x450
> [32097.578711] do_futex+0x2eb/0x9f0
> [32097.578711] __x64_sys_futex+0x143/0x180
> [32097.578711] do_syscall_64+0x59/0x1b0
> [32097.578711] ? prepare_exit_to_usermode+0x70/0x90
> [32097.578711] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [32097.578711] RIP: 0033:0x7fa2ae151334
> [32097.578711] Code: 66 0f 1f 44 00 00 41 52 52 4d 31 d2 ba 02 00 00 00 81 f6 80 00 00 00 64 23 34 25 48 00 00 00 39 d0 75 07 b8 ca 00 00 00 0f 05 <89> d0 87 07 85 c0 75 f1 5a 41 5a c3 83 3d f1 df 20 00 00 74 59 48
> [32097.578711] RSP: 002b:00007f7b0ce9f3b0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> [32097.578711] RAX: ffffffffffffffda RBX: 00007f7c1da5bc18 RCX: 00007fa2ae151334
> [32097.578711] RDX: 0000000000000002 RSI: 0000000000000080 RDI: 00007f7c1da5bc58
> [32097.578711] RBP: 00007f7b0ce9f5b0 R08: 00007f7c1da5bc58 R09: 000000000000c82c
> [32097.578711] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7b1a149cf0
> [32097.578711] R13: 00007f7c1da5bc58 R14: 0000000000000001 R15: 00000000000005a1
>
>
> 2. as you requested and to help to the community we running patch no 1 as well :
>
> And that is what we got:
>
> Aug 17 05:36:22 10.230.247.7 [40544.392690] Hyper-V: ERROR_HYPERV: cpu_last=
>
> It looks like we got an empty cpumask !
Assuming this is from the patch below, the code already handles empty
cpumask a few lines later.
You should perhaps move your change after that to right before cpus is
actually used.
Wei.
>
> Would you please let us know what father info you need and what Is the next step for debugging this interesting issue
>
> Thx
> David
>
>
>
>
> >> 1) Print the cpumask when < num_possible_cpus():
> >> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> >> index e666f7eaf32d..620f656d6195 100644
> >> --- a/arch/x86/hyperv/mmu.c
> >> +++ b/arch/x86/hyperv/mmu.c
> >> @@ -60,6 +60,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
> >> struct hv_tlb_flush *flush;
> >> u64 status = U64_MAX;
> >> unsigned long flags;
> >> + unsigned int cpu_last;
> >>
> >> trace_hyperv_mmu_flush_tlb_others(cpus, info);
> >>
> >> @@ -68,6 +69,11 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
> >>
> >> local_irq_save(flags);
> >>
> >> + cpu_last = cpumask_last(cpus);
> >> + if (cpu_last > num_possible_cpus()) {
> >
> > I think this should be ">=" since cpus are numbered starting at zero.
> > In your VM with 64 CPUs, having CPU #64 in the list would be error.
> >
> >> + pr_emerg("ERROR_HYPERV: cpu_last=%*pbl", cpumask_pr_args(cpus));
> >> + }
> >> +
> >> /*
> >> * Only check the mask _after_ interrupt has been disabled to avoid the
> >> * mask changing under our feet.
> >>
Hi Wei ,
Per your request I move the print cpumask to other two places after the treatment on the empty mask see below
And I got the folwing:
Aug 19 02:01:51 c-node05 kernel: [25936.562674] Hyper-V: ERROR_HYPERV2: cpu_last=
Aug 19 02:01:51 c-node05 kernel: [25936.562686] WARNING: CPU: 11 PID: 56432 at arch/x86/include/asm/mshyperv.h:301 hyperv_flush_tlb_others+0x23f/0x7b0
So we got empty cpumask on a different place on the code .
Let me know if you need further information from us.
How you sagest to handle this situation?
Thx
David
The new print cpu mask patch
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index a72fa69..31f0683 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -70,9 +70,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
local_irq_save(flags);
cpu_last = cpumask_last(cpus);
- if (cpu_last >= num_possible_cpus()) {
- pr_emerg("ERROR_HYPERV: cpu_last=%*pbl", cpumask_pr_args(cpus));
- }
+
/*
* Only check the mask _after_ interrupt has been disabled to avoid the
@@ -83,6 +81,12 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
return;
}
+
+ if (cpu_last >= num_possible_cpus()) {
+ pr_emerg("ERROR_HYPERV1: cpu_last=%*pbl", cpumask_pr_args(cpus));
+ }
+
+
flush_pcpu = (struct hv_tlb_flush **)
this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -121,6 +125,13 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
* must. We will also check all VP numbers when walking the
* supplied CPU set to remain correct in all cases.
*/
+ cpu_last = cpumask_last(cpus);
+
+ if (cpu_last >= num_possible_cpus()) {
+ pr_emerg("ERROR_HYPERV2: cpu_last=%*pbl", cpumask_pr_args(cpus));
+ }
+
-----Original Message-----
From: Wei Liu <[email protected]>
Sent: Tuesday, August 17, 2021 2:30 PM
To: David Mozes <[email protected]>
Cc: David Moses <[email protected]>; Michael Kelley <[email protected]>; ???? ??????? <[email protected]>; [email protected]; [email protected]; Wei Liu <[email protected]>
Subject: Re: [PATCH] x86/hyper-v: guard against cpu mask changes in hyperv_flush_tlb_others()
Please use the "reply all" button in your mail client and avoid
top-posting. It is very difficult for me to decipher this thread...
On Tue, Aug 17, 2021 at 09:16:45AM +0000, David Mozes wrote:
> Hi Michael and all .
> I am back from the Holiday and did your saggestiones /requstes
>
> 1. While running with patch number-2 (disable the Hyper-V specific flush routines)
> As you suspected, we got panic similar to what we got with the Hyper-V specific flash routines.
> Below is the trace we got:
>
> [32097.577728] kernel BUG at kernel/sched/rt.c:1004!
> [32097.577738] invalid opcode: 0000 [#1] SMP
> [32097.578711] CPU: 45 PID: 51244 Comm: STAR4BLKS0_WORK Kdump: loaded Tainted: G OE 4.19.195-KM9 #1
It seems that you have out of tree module(s) loaded. Please make sure
they don't do anything unusual.
> [32097.578711] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
> [32097.578711] RIP: 0010:dequeue_top_rt_rq+0x88/0xa0
> [32097.578711] Code: 00 48 89 d5 48 0f a3 15 6e 19 82 01 73 d0 48 89 c7 e8 bc b7 fe ff be 02 00 00 00 89 ef 84 c0 74 0b e8 2c 94 04 00 eb b6 0f 0b <0f> 0b e8 b1 93 04 00 eb ab 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00
> [32097.578711] RSP: 0018:ffff9442e0de7b48 EFLAGS: 00010046
> [32097.578711] RAX: ffff94809f9e1e00 RBX: ffff9448295e4c40 RCX: 00000000ffffffff
> [32097.578711] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff94809f9e2040
> [32097.578711] RBP: ffff94809f9e1e00 R08: fffffffffff0be25 R09: 00000000000216c0
> [32097.578711] R10: 00004bbc85e1eff3 R11: 0000000000000000 R12: 0000000000000000
> [32097.578711] R13: ffff9448295e4a20 R14: 0000000000021e00 R15: ffff94809fa21e00
> [32097.578711] FS: 00007f7b0cea0700(0000) GS:ffff94809f940000(0000) knlGS:0000000000000000
> [32097.578711] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [32097.578711] CR2: ffffffffff600400 CR3: 000000201d5b3002 CR4: 00000000003606e0
> [32097.578711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [32097.578711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [32097.578711] Call Trace:
> [32097.578711] dequeue_rt_stack+0x3e/0x280
> [32097.578711] dequeue_rt_entity+0x1f/0x70
> [32097.578711] dequeue_task_rt+0x26/0x70
> [32097.578711] push_rt_task+0x1e2/0x220
> [32097.578711] push_rt_tasks+0x11/0x20
> [32097.578711] __balance_callback+0x3b/0x60
> [32097.578711] __schedule+0x6e9/0x830
> [32097.578711] schedule+0x28/0x80
It looks like the scheduler is in an irrecoverable state. The stack
trace does not show anything related to TLB flush, so it is unclear to
me this has anything to do with the original report.
Have you tried running the same setup on baremetal?
> [32097.578711] futex_wait_queue_me+0xb9/0x120
> [32097.578711] futex_wait+0x139/0x250
> [32097.578711] ? try_to_wake_up+0x54/0x460
> [32097.578711] ? enqueue_task_rt+0x9f/0xc0
> [32097.578711] ? get_futex_key+0x2ee/0x450
> [32097.578711] do_futex+0x2eb/0x9f0
> [32097.578711] __x64_sys_futex+0x143/0x180
> [32097.578711] do_syscall_64+0x59/0x1b0
> [32097.578711] ? prepare_exit_to_usermode+0x70/0x90
> [32097.578711] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [32097.578711] RIP: 0033:0x7fa2ae151334
> [32097.578711] Code: 66 0f 1f 44 00 00 41 52 52 4d 31 d2 ba 02 00 00 00 81 f6 80 00 00 00 64 23 34 25 48 00 00 00 39 d0 75 07 b8 ca 00 00 00 0f 05 <89> d0 87 07 85 c0 75 f1 5a 41 5a c3 83 3d f1 df 20 00 00 74 59 48
> [32097.578711] RSP: 002b:00007f7b0ce9f3b0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> [32097.578711] RAX: ffffffffffffffda RBX: 00007f7c1da5bc18 RCX: 00007fa2ae151334
> [32097.578711] RDX: 0000000000000002 RSI: 0000000000000080 RDI: 00007f7c1da5bc58
> [32097.578711] RBP: 00007f7b0ce9f5b0 R08: 00007f7c1da5bc58 R09: 000000000000c82c
> [32097.578711] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7b1a149cf0
> [32097.578711] R13: 00007f7c1da5bc58 R14: 0000000000000001 R15: 00000000000005a1
>
>
> 2. as you requested and to help to the community we running patch no 1 as well :
>
> And that is what we got:
>
> Aug 17 05:36:22 10.230.247.7 [40544.392690] Hyper-V: ERROR_HYPERV: cpu_last=
>
> It looks like we got an empty cpumask !
Assuming this is from the patch below, the code already handles empty
cpumask a few lines later.
You should perhaps move your change after that to right before cpus is
actually used.
Wei.
>
> Would you please let us know what father info you need and what Is the next step for debugging this interesting issue
>
> Thx
> David
>
>
>
>
> >> 1) Print the cpumask when < num_possible_cpus():
> >> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> >> index e666f7eaf32d..620f656d6195 100644
> >> --- a/arch/x86/hyperv/mmu.c
> >> +++ b/arch/x86/hyperv/mmu.c
> >> @@ -60,6 +60,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
> >> struct hv_tlb_flush *flush;
> >> u64 status = U64_MAX;
> >> unsigned long flags;
> >> + unsigned int cpu_last;
> >>
> >> trace_hyperv_mmu_flush_tlb_others(cpus, info);
> >>
> >> @@ -68,6 +69,11 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
> >>
> >> local_irq_save(flags);
> >>
> >> + cpu_last = cpumask_last(cpus);
> >> + if (cpu_last > num_possible_cpus()) {
> >
> > I think this should be ">=" since cpus are numbered starting at zero.
> > In your VM with 64 CPUs, having CPU #64 in the list would be error.
> >
> >> + pr_emerg("ERROR_HYPERV: cpu_last=%*pbl", cpumask_pr_args(cpus));
> >> + }
> >> +
> >> /*
> >> * Only check the mask _after_ interrupt has been disabled to avoid the
> >> * mask changing under our feet.
> >>
On Thu, Aug 19, 2021 at 07:55:06AM +0000, David Mozes wrote:
> Hi Wei ,
> I move the print cpumask to other two places after the treatment on the empty mask see below
> And I got the folwing:
>
>
> Aug 19 02:01:51 c-node05 kernel: [25936.562674] Hyper-V: ERROR_HYPERV2: cpu_last=
> Aug 19 02:01:51 c-node05 kernel: [25936.562686] WARNING: CPU: 11 PID: 56432 at arch/x86/include/asm/mshyperv.h:301 hyperv_flush_tlb_others+0x23f/0x7b0
>
> So we got empty on different place on the code .
> Let me know if you need further information from us.
> How you sagest to handle this situation?
>
Please find a way to reproduce this issue with upstream kernels.
Thanks,
Wei.
This is not visible since we need a very high load to reproduce.
We have tried a lot but can't achieve the desired load
On our kernel with less load, it is not reproducible as well.
-----Original Message-----
From: Wei Liu <[email protected]>
Sent: Sunday, August 22, 2021 6:25 PM
To: David Mozes <[email protected]>
Cc: David Moses <[email protected]>; Wei Liu <[email protected]>; Michael Kelley <[email protected]>; ???? ??????? <[email protected]>; [email protected]; [email protected]
Subject: Re: [PATCH] x86/hyper-v: guard against cpu mask changes in hyperv_flush_tlb_others()
On Thu, Aug 19, 2021 at 07:55:06AM +0000, David Mozes wrote:
> Hi Wei ,
> I move the print cpumask to other two places after the treatment on the empty mask see below
> And I got the folwing:
>
>
> Aug 19 02:01:51 c-node05 kernel: [25936.562674] Hyper-V: ERROR_HYPERV2: cpu_last=
> Aug 19 02:01:51 c-node05 kernel: [25936.562686] WARNING: CPU: 11 PID: 56432 at arch/x86/include/asm/mshyperv.h:301 hyperv_flush_tlb_others+0x23f/0x7b0
>
> So we got empty on different place on the code .
> Let me know if you need further information from us.
> How you sagest to handle this situation?
>
Please find a way to reproduce this issue with upstream kernels.
Thanks,
Wei.
On Sun, Aug 22, 2021 at 04:25:19PM +0000, David Mozes wrote:
> This is not visible since we need a very high load to reproduce.
> We have tried a lot but can't achieve the desired load
> On our kernel with less load, it is not reproducible as well.
There isn't much upstream can do if there is no way to reproduce the
issue with an upstream kernel.
You can check all the code paths which may modify cpumask and analyze
them. KCSAN may be useful too, but that's only available in 5.8 and
later.
Thanks,
Wei.
>
> -----Original Message-----
> From: Wei Liu <[email protected]>
> Sent: Sunday, August 22, 2021 6:25 PM
> To: David Mozes <[email protected]>
> Cc: David Moses <[email protected]>; Wei Liu <[email protected]>; Michael Kelley <[email protected]>; תומר אבוטבול <[email protected]>; [email protected]; [email protected]
> Subject: Re: [PATCH] x86/hyper-v: guard against cpu mask changes in hyperv_flush_tlb_others()
>
> On Thu, Aug 19, 2021 at 07:55:06AM +0000, David Mozes wrote:
> > Hi Wei ,
> > I move the print cpumask to other two places after the treatment on the empty mask see below
> > And I got the folwing:
> >
> >
> > Aug 19 02:01:51 c-node05 kernel: [25936.562674] Hyper-V: ERROR_HYPERV2: cpu_last=
> > Aug 19 02:01:51 c-node05 kernel: [25936.562686] WARNING: CPU: 11 PID: 56432 at arch/x86/include/asm/mshyperv.h:301 hyperv_flush_tlb_others+0x23f/0x7b0
> >
> > So we got empty on different place on the code .
> > Let me know if you need further information from us.
> > How you sagest to handle this situation?
> >
>
> Please find a way to reproduce this issue with upstream kernels.
>
> Thanks,
> Wei.