Dear all,
resume-from-sleep (mem/S3) on v4.15-rc5-149-g5aa90a845892 triggers the
following bug. If I boot with "pti=off", the kernel does not show this
issue, and neither did kernels before pti was merged:
[ 0.000000] microcode: microcode updated early to revision 0x25, date = 2017-01-27
[ 0.000000] Linux version 4.15.0-rc5+ (brodo@light) (gcc version 7.2.1 20171128 (GCC)) #2 SMP PREEMPT Sat Dec 30 12:03:51 CET 2017
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-testing-git page_poison=on slub_debug=P
...
[ 0.000000] Memory: 7922664K/8291016K available (18460K kernel code, 2408K rwdata, 6548K rodata, 3440K init, 13116K bss, 368352K reserved, 0K cma-reserved)
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[ 0.000000] Kernel/User page tables isolation: enabled
...
[ 38.829235] PM: suspend entry (deep)
...
[ 39.893319] Disabling non-boot CPUs ...
[ 39.911270] smpboot: CPU 1 is now offline
[ 39.928370] smpboot: CPU 2 is now offline
[ 39.947837] smpboot: CPU 3 is now offline
[ 39.951703] ACPI: Low-level resume complete
[ 39.951832] ACPI: EC: EC started
[ 39.951840] PM: Restoring platform NVS memory
[ 39.954648] Enabling non-boot CPUs ...
[ 39.954792] x86: Booting SMP configuration:
[ 39.954800] smpboot: Booting Node 0 Processor 1 APIC 0x2
[ 39.954834] BUG: using smp_processor_id() in preemptible [00000000] code: sh/465
[ 39.954841] caller is native_cpu_up+0x2f0/0xa30
[ 39.954847] CPU: 0 PID: 465 Comm: sh Not tainted 4.15.0-rc5+ #2
[ 39.954851] Hardware name: Dell Inc. XPS 13 9343/0TM99H, BIOS A11 12/08/2016
[ 39.954855] Call Trace:
[ 39.954863] dump_stack+0x67/0x95
[ 39.954871] check_preemption_disabled+0xd8/0xe0
[ 39.954880] native_cpu_up+0x2f0/0xa30
[ 39.954896] bringup_cpu+0x25/0xa0
[ 39.954902] ? cpuhp_kick_ap+0x70/0x70
[ 39.954909] cpuhp_invoke_callback+0xb8/0xc50
[ 39.954930] _cpu_up+0xad/0x170
[ 39.954943] enable_nonboot_cpus+0x9e/0x320
[ 39.954953] suspend_devices_and_enter+0x33c/0xd40
[ 39.954974] pm_suspend+0x6a0/0x9e0
[ 39.954988] state_store+0x7d/0xf0
[ 39.955002] kernfs_fop_write+0x11c/0x1b0
[ 39.955014] __vfs_write+0x39/0x1d0
[ 39.955024] ? rcu_read_lock_sched_held+0x74/0x80
[ 39.955029] ? preempt_count_sub+0x92/0xd0
[ 39.955036] ? __sb_start_write+0x16a/0x1f0
[ 39.955047] vfs_write+0xcc/0x1b0
[ 39.955058] SyS_write+0x55/0xc0
[ 39.955072] entry_SYSCALL_64_fastpath+0x18/0x85
[ 39.955077] RIP: 0033:0x4b9a7e
[ 39.955081] RSP: 002b:00007ffda426c148 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 39.955088] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00000000004b9a7e
[ 39.955092] RDX: 0000000000000004 RSI: 00000000023da740 RDI: 0000000000000001
[ 39.955095] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 39.955099] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 39.955103] R13: 00000000023d7020 R14: 00000000023d7418 R15: 0000000000000000
[ 39.956435] BUG: using smp_processor_id() in preemptible [00000000] code: sh/465
[ 39.956442] caller is native_cpu_up+0x447/0xa30
[ 39.956448] CPU: 0 PID: 465 Comm: sh Not tainted 4.15.0-rc5+ #2
[ 39.956452] Hardware name: Dell Inc. XPS 13 9343/0TM99H, BIOS A11 12/08/2016
[ 39.956455] Call Trace:
[ 39.956465] dump_stack+0x67/0x95
[ 39.956473] check_preemption_disabled+0xd8/0xe0
[ 39.956482] native_cpu_up+0x447/0xa30
[ 39.956498] bringup_cpu+0x25/0xa0
[ 39.956504] ? cpuhp_kick_ap+0x70/0x70
[ 39.956511] cpuhp_invoke_callback+0xb8/0xc50
[ 39.956532] _cpu_up+0xad/0x170
[ 39.956544] enable_nonboot_cpus+0x9e/0x320
[ 39.956555] suspend_devices_and_enter+0x33c/0xd40
[ 39.956575] pm_suspend+0x6a0/0x9e0
[ 39.956589] state_store+0x7d/0xf0
[ 39.956603] kernfs_fop_write+0x11c/0x1b0
[ 39.956615] __vfs_write+0x39/0x1d0
[ 39.956625] ? rcu_read_lock_sched_held+0x74/0x80
[ 39.956630] ? preempt_count_sub+0x92/0xd0
[ 39.956637] ? __sb_start_write+0x16a/0x1f0
[ 39.956649] vfs_write+0xcc/0x1b0
[ 39.956660] SyS_write+0x55/0xc0
[ 39.956673] entry_SYSCALL_64_fastpath+0x18/0x85
[ 39.956678] RIP: 0033:0x4b9a7e
[ 39.956682] RSP: 002b:00007ffda426c148 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 39.956689] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00000000004b9a7e
[ 39.956693] RDX: 0000000000000004 RSI: 00000000023da740 RDI: 0000000000000001
[ 39.956696] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 39.956700] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 39.956704] R13: 00000000023d7020 R14: 00000000023d7418 R15: 0000000000000000
[ 39.958429] cache: parent cpu1 should not be sleeping
[ 39.959881] CPU1 is up
[ 39.960013] smpboot: Booting Node 0 Processor 2 APIC 0x1
[ 39.960023] BUG: using smp_processor_id() in preemptible [00000000] code: sh/465
... and then the same for CPUs 2 and 3, meaning the BUG() is triggered six
times overall, twice for each non-boot CPU.
Thanks,
Dominik
On Sat, 30 Dec 2017, Dominik Brodowski wrote:
> Dear all,
>
> resume-from-sleep (mem/S3) on v4.15-rc5-149-g5aa90a845892 triggers the
> following bug. If I boot with "pti=off", the kernel does not show this
> issue, and neither did kernels before pti was merged:
>
> [ 39.951703] ACPI: Low-level resume complete
> [ 39.951832] ACPI: EC: EC started
> [ 39.951840] PM: Restoring platform NVS memory
> [ 39.954648] Enabling non-boot CPUs ...
> [ 39.954792] x86: Booting SMP configuration:
> [ 39.954800] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [ 39.954834] BUG: using smp_processor_id() in preemptible [00000000] code: sh/465
> [ 39.954841] caller is native_cpu_up+0x2f0/0xa30
I can't reproduce at the moment and I can't find a possible reason for this
by code inspection.
Can you please provide your .config file and perhaps decode the two
offending call sites with
scripts/faddr2line vmlinux native_cpu_up+0x2f0/0xa30 native_cpu_up+0x447/0xa30
Thanks,
tglx
On Sat, Dec 30, 2017 at 04:03:07PM +0100, Thomas Gleixner wrote:
> On Sat, 30 Dec 2017, Dominik Brodowski wrote:
> > resume-from-sleep (mem/S3) on v4.15-rc5-149-g5aa90a845892 triggers the
> > following bug. If I boot with "pti=off", the kernel does not show this
> > issue, and neither did kernels before pti was merged:
> >
> > [ 39.951703] ACPI: Low-level resume complete
> > [ 39.951832] ACPI: EC: EC started
> > [ 39.951840] PM: Restoring platform NVS memory
> > [ 39.954648] Enabling non-boot CPUs ...
> > [ 39.954792] x86: Booting SMP configuration:
> > [ 39.954800] smpboot: Booting Node 0 Processor 1 APIC 0x2
> > [ 39.954834] BUG: using smp_processor_id() in preemptible [00000000] code: sh/465
> > [ 39.954841] caller is native_cpu_up+0x2f0/0xa30
>
> I can't reproduce at the moment and I can't find a possible reason for this
> by code inspection.
Thanks for taking a look at it!
> Can you please provide your .config file
See attached.
> and perhaps decode the two offending call sites with
>
> scripts/faddr2line vmlinux native_cpu_up+0x2f0/0xa30 native_cpu_up+0x447/0xa30
native_cpu_up+0x2f0/0xa30:
invalidate_user_asid at arch/x86/include/asm/tlbflush.h:343
(inlined by) __native_flush_tlb at arch/x86/include/asm/tlbflush.h:351
(inlined by) smpboot_setup_warm_reset_vector at arch/x86/kernel/smpboot.c:129
(inlined by) do_boot_cpu at arch/x86/kernel/smpboot.c:950
(inlined by) native_cpu_up at arch/x86/kernel/smpboot.c:1070
native_cpu_up+0x447/0xa30:
kern_pcid at arch/x86/include/asm/tlbflush.h:105
(inlined by) invalidate_user_asid at arch/x86/include/asm/tlbflush.h:342
(inlined by) __native_flush_tlb at arch/x86/include/asm/tlbflush.h:351
(inlined by) smpboot_restore_warm_reset_vector at arch/x86/kernel/smpboot.c:146
(inlined by) do_boot_cpu at arch/x86/kernel/smpboot.c:1022
(inlined by) native_cpu_up at arch/x86/kernel/smpboot.c:1070
Thanks,
Dominik
On 12/30/2017 07:30 AM, Dominik Brodowski wrote:
>
> native_cpu_up+0x447/0xa30:
> kern_pcid at arch/x86/include/asm/tlbflush.h:105
> (inlined by) invalidate_user_asid at arch/x86/include/asm/tlbflush.h:342
> (inlined by) __native_flush_tlb at arch/x86/include/asm/tlbflush.h:351
> (inlined by) smpboot_restore_warm_reset_vector at arch/x86/kernel/smpboot.c:146
> (inlined by) do_boot_cpu at arch/x86/kernel/smpboot.c:1022
> (inlined by) native_cpu_up at arch/x86/kernel/smpboot.c:1070
This appears to be this path:
> static inline void smpboot_restore_warm_reset_vector(void)
> {
> unsigned long flags;
>
> /*
> * Install writable page 0 entry to set BIOS data area.
> */
> local_flush_tlb();
The PTI code is now tracking when a given ASID needs to get flushed in a
per-cpu variable, and we use smp_processor_id() in local_flush_tlb() to
do that tracking. That's the *proximate* cause of the new warning. I
think it's actually a quite valid warning that's catching something
questionable.
I'm limited here by not knowing how the warm reset vector actually
works, though. I don't know why we TLB flush at all, much less why we
do it after CMOS_WRITE() in the "setup" path but _before_ CMOS_WRITE()
in the "restore" one. Where do we actually "Install writable page 0
entry to set BIOS data area"? Shouldn't we just be flushing _there_?
But, even _doing_ a TLB flush with preempt enabled and interrupts on
seems wrong to me. It just fundamentally doesn't mean anything because
it can theoretically run anywhere and flush *any* TLB. There might be
some other implicit preempt-thwarting going on here, but I can't find it.
The naive fix here is to just preempt_dis/enable() over the area doing
the flush and the writes to the TRAMPOLINE_* area. That'll definitely
shut up the warnings.
On Sat, 30 Dec 2017, Dominik Brodowski wrote:
> On Sat, Dec 30, 2017 at 04:03:07PM +0100, Thomas Gleixner wrote:
> > On Sat, 30 Dec 2017, Dominik Brodowski wrote:
> > > resume-from-sleep (mem/S3) on v4.15-rc5-149-g5aa90a845892 triggers the
> > > following bug. If I boot with "pti=off", the kernel does not show this
> > > issue, and neither did kernels before pti was merged:
> > >
> > > [ 39.951703] ACPI: Low-level resume complete
> > > [ 39.951832] ACPI: EC: EC started
> > > [ 39.951840] PM: Restoring platform NVS memory
> > > [ 39.954648] Enabling non-boot CPUs ...
> > > [ 39.954792] x86: Booting SMP configuration:
> > > [ 39.954800] smpboot: Booting Node 0 Processor 1 APIC 0x2
> > > [ 39.954834] BUG: using smp_processor_id() in preemptible [00000000] code: sh/465
> > > [ 39.954841] caller is native_cpu_up+0x2f0/0xa30
> >
> > I can't reproduce at the moment and I can't find a possible reason for this
> > by code inspection.
>
> Thanks for taking a look at it!
>
> > Can you please provide your .config file
>
> See attached.
>
> > and perhaps decode the two offending call sites with
> >
> > scripts/faddr2line vmlinux native_cpu_up+0x2f0/0xa30 native_cpu_up+0x447/0xa30
>
> native_cpu_up+0x2f0/0xa30:
> invalidate_user_asid at arch/x86/include/asm/tlbflush.h:343
Ah, that makes sense. Missed that in the maze.
What makes less sense is that tlbflush itself. I'm surely missing something
subtle, but from a first look that tlbflush is pointless.
> (inlined by) __native_flush_tlb at arch/x86/include/asm/tlbflush.h:351
> (inlined by) smpboot_setup_warm_reset_vector at arch/x86/kernel/smpboot.c:129
> (inlined by) do_boot_cpu at arch/x86/kernel/smpboot.c:950
> (inlined by) native_cpu_up at arch/x86/kernel/smpboot.c:1070
>
> native_cpu_up+0x447/0xa30:
> kern_pcid at arch/x86/include/asm/tlbflush.h:105
> (inlined by) invalidate_user_asid at arch/x86/include/asm/tlbflush.h:342
> (inlined by) __native_flush_tlb at arch/x86/include/asm/tlbflush.h:351
> (inlined by) smpboot_restore_warm_reset_vector at arch/x86/kernel/smpboot.c:146
This one even more so as the stale comment suggests, that there was some
page table fiddling at some point in the past.
> (inlined by) do_boot_cpu at arch/x86/kernel/smpboot.c:1022
> (inlined by) native_cpu_up at arch/x86/kernel/smpboot.c:1070
Let me think about it and do some archaeological research.
Thanks,
tglx
On Sat, 30 Dec 2017, Dave Hansen wrote:
> On 12/30/2017 07:30 AM, Dominik Brodowski wrote:
> >
> > native_cpu_up+0x447/0xa30:
> > kern_pcid at arch/x86/include/asm/tlbflush.h:105
> > (inlined by) invalidate_user_asid at arch/x86/include/asm/tlbflush.h:342
> > (inlined by) __native_flush_tlb at arch/x86/include/asm/tlbflush.h:351
> > (inlined by) smpboot_restore_warm_reset_vector at arch/x86/kernel/smpboot.c:146
> > (inlined by) do_boot_cpu at arch/x86/kernel/smpboot.c:1022
> > (inlined by) native_cpu_up at arch/x86/kernel/smpboot.c:1070
>
> This appears to be this path:
>
> > static inline void smpboot_restore_warm_reset_vector(void)
> > {
> > unsigned long flags;
> >
> > /*
> > * Install writable page 0 entry to set BIOS data area.
> > */
> > local_flush_tlb();
>
> The PTI code is now tracking when a given ASID needs to get flushed in a
> per-cpu variable, and we use smp_processor_id() in local_flush_tlb() to
> do that tracking. That's the *proximate* cause of the new warning. I
> think it's actually a quite valid warning that's catching something
> questionable.
>
> I'm limited here by not knowing how the warm reset vector actually
> works, though. I don't know why we TLB flush at all, much less why we
> do it after CMOS_WRITE() in the "setup" path but _before_ CMOS_WRITE()
> in the "restore" one. Where do we actually "Install writable page 0
> entry to set BIOS data area"? Shouldn't we just be flushing _there_?
>
> But, even _doing_ a TLB flush with preempt enabled and interrupts on
> seems wrong to me. It just fundamentally doesn't mean anything because
> it can theoretically run anywhere and flush *any* TLB. There might be
> some other implicit preempt-thwarting going on here, but I can't find it.
>
> The naive fix here is to just preempt_dis/enable() over the area doing
> the flush and the writes to the TRAMPOLINE_* area. That'll definitely
> shut up the warnings.
Well, yes, but it makes no sense at all.
On Sat, Dec 30, 2017 at 10:20 AM, Thomas Gleixner <[email protected]> wrote:
> On Sat, 30 Dec 2017, Dominik Brodowski wrote:
>>
>> native_cpu_up+0x2f0/0xa30:
>> invalidate_user_asid at arch/x86/include/asm/tlbflush.h:343
>
> Ah, that makes sense. Missed that in the maze.
>
> What makes less sense is that tlbflush itself. I'm surely missing something
> subtle, but from a first look that tlbflush is pointless.
Hmm. Regardless of whether the TLB flush makes sense in that
smpboot_setup_warm_reset_vector() location (and I agree that it looks
a bit odd), the warning does look pretty relevant.
The __native_flush_tlb() function looks _very_ broken.
It does:
invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));
/*
* If current->mm == NULL then we borrow a mm which may change
* during a task switch and therefore we must not be preempted
* while we write CR3 back:
*/
preempt_disable();
native_write_cr3(__native_read_cr3());
preempt_enable();
but why is that preempt-disabled region only around the cr3 write? The
invalidate_user_asid() logic seems to be very CPU-sensitive too.
And even if there is some reason why invalidate_user_asid() really can
do multiple different percpu accesses and it doesn't matter whether
the thread is bouncing around on different cpu's while it does it,
there doesn't seem any _reason_ not to just extend the preempt-disable
over the whole series.
It really looks strange how it does multiple reads (and then a final
write!) to percpu state, when the cpu can change in between.
So I'd suggest moving the preempt_disable() up to the top of that
function, regardless of whether we could then remove that seemingly
stale TLB flush in that crazy
smpboot_setup/restore_warm_reset_vector() dance...
Andy?
Linus
On 12/30/2017 10:40 AM, Linus Torvalds wrote:
> The __native_flush_tlb() function looks _very_ broken.
...
> So I'd suggest moving the preempt_disable() up to the top of that
> function, regardless of whether we could then remove that seemingly
> stale TLB flush in that crazy
> smpboot_setup/restore_warm_reset_vector() dance...
If someone is calling __native_flush_tlb(), shouldn't they already be in
a state where they can't be preempted? It's fundamentally a one-cpu
thing, both the actual CPU TLB flush _and_ the per-cpu variables.
It seems like we might want to _remove_ the explicit
preempt_dis/enable() from here:
preempt_disable();
native_write_cr3(__native_read_cr3());
preempt_enable();
and add some warnings to ensure it's disabled when we enter
__native_flush_tlb().
On Sat, Dec 30, 2017 at 11:03 AM, Dave Hansen
<[email protected]> wrote:
> On 12/30/2017 10:40 AM, Linus Torvalds wrote:
>> The __native_flush_tlb() function looks _very_ broken.
> ...
>> So I'd suggest moving the preempt_disable() up to the top of that
>> function, regardless of whether we could then remove that seemingly
>> stale TLB flush in that crazy
>> smpboot_setup/restore_warm_reset_vector() dance...
>
> If someone is calling __native_flush_tlb(), shouldn't they already be in
> a state where they can't be preempted? It's fundamentally a one-cpu
> thing, both the actual CPU TLB flush _and_ the per-cpu variables.
Hmm. I think you're right.
> It seems like we might want to _remove_ the explicit
> preempt_dis/enable() from here:
>
> preempt_disable();
> native_write_cr3(__native_read_cr3());
> preempt_enable();
>
> and add some warnings to ensure it's disabled when we enter
> __native_flush_tlb().
Agreed, that would certainly also be consistent.
The current code that disables preemption only selectively seems
insane to me. Either all or nothing, not this crazy half-way thing.
Linus
--Andy
> On Dec 30, 2017, at 11:15 AM, Linus Torvalds <[email protected]> wrote:
>
> On Sat, Dec 30, 2017 at 11:03 AM, Dave Hansen
> <[email protected]> wrote:
>> On 12/30/2017 10:40 AM, Linus Torvalds wrote:
>>> The __native_flush_tlb() function looks _very_ broken.
>> ...
>>> So I'd suggest moving the preempt_disable() up to the top of that
>>> function, regardless of whether we could then remove that seemingly
>>> stale TLB flush in that crazy
>>> smpboot_setup/restore_warm_reset_vector() dance...
>>
>> If someone is calling __native_flush_tlb(), shouldn't they already be in
>> a state where they can't be preempted? It's fundamentally a one-cpu
>> thing, both the actual CPU TLB flush _and_ the per-cpu variables.
>
> Hmm. I think you're right.
>
>> It seems like we might want to _remove_ the explicit
>> preempt_dis/enable() from here:
>>
>> preempt_disable();
>> native_write_cr3(__native_read_cr3());
>> preempt_enable();
>>
>> and add some warnings to ensure it's disabled when we enter
>> __native_flush_tlb().
>
> Agreed, that would certainly also be consistent.
>
> The current code that disables preemption only selectively seems
> insane to me. Either all or nothing, not this crazy half-way thing.
Agreed. The current code is bogus. I'd rather have a warning if preemptible.
I'm reasonably confident that IRQs on but preempt off is okay.
>
> Linus
On Sat, 30 Dec 2017, Linus Torvalds wrote:
> On Sat, Dec 30, 2017 at 10:20 AM, Thomas Gleixner <[email protected]> wrote:
> So I'd suggest moving the preempt_disable() up to the top of that
> function, regardless of whether we could then remove that seemingly
> stale TLB flush in that crazy
> smpboot_setup/restore_warm_reset_vector() dance...
Regardless of the other bogosity, I did some digging and had to go back
into the 2.1 aera to find the reason for this local_flush_tlb(). That old
code manipulated swapper_pg_dir in do_boot_cpu.
+ local_flush_tlb();
+ SMP_PRINTK(("1.\n"));
+ *((volatile unsigned short *) phys_to_virt(0x469)) = ((unsigned long)stack)>>4;
+ SMP_PRINTK(("2.\n"));
+ *((volatile unsigned short *) phys_to_virt(0x467)) = 0;
+ SMP_PRINTK(("3.\n"));
+
+ maincfg=swapper_pg_dir[0];
+ ((unsigned long *)swapper_pg_dir)[0]=0x102007;
.....
+ swapper_pg_dir[0]=maincfg;
+ local_flush_tlb();
That pagetable fiddling dissappeared long ago, but the flush stayed around
forever.
Thanks,
tglx
On Sat, Dec 30, 2017 at 11:41 AM, Thomas Gleixner <[email protected]> wrote:
>
> That pagetable fiddling dissappeared long ago, but the flush stayed around
> forever.
Yes, so it all makes sense, and those two tlb invalidates should just
be removed.
Good to have the reason for the left-over code understood.
Linus
On Sat, 30 Dec 2017, Linus Torvalds wrote:
> On Sat, Dec 30, 2017 at 10:20 AM, Thomas Gleixner <[email protected]> wrote:
> The __native_flush_tlb() function looks _very_ broken.
>
> It does:
>
> invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));
> /*
> * If current->mm == NULL then we borrow a mm which may change
> * during a task switch and therefore we must not be preempted
> * while we write CR3 back:
> */
> preempt_disable();
> native_write_cr3(__native_read_cr3());
> preempt_enable();
>
> but why is that preempt-disabled region only around the cr3 write? The
> invalidate_user_asid() logic seems to be very CPU-sensitive too.
>
> And even if there is some reason why invalidate_user_asid() really can
> do multiple different percpu accesses and it doesn't matter whether
> the thread is bouncing around on different cpu's while it does it,
> there doesn't seem any _reason_ not to just extend the preempt-disable
> over the whole series.
>
> It really looks strange how it does multiple reads (and then a final
> write!) to percpu state, when the cpu can change in between.
>
> So I'd suggest moving the preempt_disable() up to the top of that
> function,
That preempt_disable()/enable() was added with:
commit 5cf0791da5c162ebc14b01eb01631cfa7ed4fa6e
Author: Sebastian Andrzej Siewior <[email protected]>
Date: Fri Aug 5 15:37:39 2016 +0200
x86/mm: Disable preemption during CR3 read+write
So we need to look at that scenario before removing it.
Thanks,
tglx
On Sat, Dec 30, 2017 at 12:28 PM, Thomas Gleixner <[email protected]> wrote:
>
> That preempt_disable()/enable() was added with:
>
> commit 5cf0791da5c162ebc14b01eb01631cfa7ed4fa6e
> Author: Sebastian Andrzej Siewior <[email protected]>
> Date: Fri Aug 5 15:37:39 2016 +0200
>
> x86/mm: Disable preemption during CR3 read+write
>
> So we need to look at that scenario before removing it.
Good point, but I think that was actually a mis-feature of the old
"__flush_tlb_up()" implementation that Andy got rid of in commit
ce4a4e565f52 ("x86/mm: Remove the UP asm/tlbflush.h code, always use
the (formerly) SMP code").
So the code sequence that that commit talks about no longer exists.
Instead, we now have the call to __flush_tlb() inside a
get_cpu/put_cpu, which is preempt-safe even on UP (despite the CPU
number obviously being fixed).
So I think Dave is right: we should just remove the
preempt_disable/endable. But adding a
WARN_ON_ONCE(preemptible());
might still be a good idea.
Linus
On Sat, 30 Dec 2017, Thomas Gleixner wrote:
> On Sat, 30 Dec 2017, Linus Torvalds wrote:
> > On Sat, Dec 30, 2017 at 10:20 AM, Thomas Gleixner <[email protected]> wrote:
> > The __native_flush_tlb() function looks _very_ broken.
> >
> > It does:
> >
> > invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));
> > /*
> > * If current->mm == NULL then we borrow a mm which may change
> > * during a task switch and therefore we must not be preempted
> > * while we write CR3 back:
> > */
> > preempt_disable();
> > native_write_cr3(__native_read_cr3());
> > preempt_enable();
> >
> > but why is that preempt-disabled region only around the cr3 write? The
> > invalidate_user_asid() logic seems to be very CPU-sensitive too.
> >
> > And even if there is some reason why invalidate_user_asid() really can
> > do multiple different percpu accesses and it doesn't matter whether
> > the thread is bouncing around on different cpu's while it does it,
> > there doesn't seem any _reason_ not to just extend the preempt-disable
> > over the whole series.
> >
> > It really looks strange how it does multiple reads (and then a final
> > write!) to percpu state, when the cpu can change in between.
> >
> > So I'd suggest moving the preempt_disable() up to the top of that
> > function,
>
> That preempt_disable()/enable() was added with:
>
> commit 5cf0791da5c162ebc14b01eb01631cfa7ed4fa6e
> Author: Sebastian Andrzej Siewior <[email protected]>
> Date: Fri Aug 5 15:37:39 2016 +0200
>
> x86/mm: Disable preemption during CR3 read+write
>
> So we need to look at that scenario before removing it.
The UP distinction does not exist anymore since:
commit ce4a4e565f5264909a18c733b864c3f74467f69e
Author: Andy Lutomirski <[email protected]>
Date: Sun May 28 10:00:14 2017 -0700
x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code
The proper fix for this would have been anyway to add the
preempt_disable()/enable() pair to
static inline void flush_tlb_mm_range(struct mm_struct *mm,
unsigned long start, unsigned long end, unsigned long vmflag)
{
if (mm == current->active_mm)
__flush_tlb_up();
}
and not into __native_flush_tlb().
The common flush_tlb_mm_range() properly disables preemption, so we are
good to remove that.
Thanks,
tglx
On Sat, Dec 30, 2017 at 12:47 PM, Thomas Gleixner <[email protected]> wrote:
>
> The UP distinction does not exist anymore since:
Heh. Crossed emails. We're very much in agreement.
Linus