LinuxLists.cc - Re: Kernel Benchmarking

2020-09-15 19:41:36

Subject: Re: Kernel Benchmarking

On Tue, Sep 15, 2020 at 12:26 PM Matthieu Baerts
<[email protected]> wrote:
>
> I don't know if this info is useful but I just checked and I can
> reproduce the issue with a single CPU.

Good thinking.

> And the trace is very similar to the previous one:

.. and yes, now there are no messy traces, they all have that
__lock_page_killable() unambiguously in them (and the only '?' entries
are just from stale stuff on the stack which is due to stack frames
that aren't fully initialized and old stack frame data shining
through).

So it does seem like the previous trace uncertainty was likely just a
cross-CPU issue.

Was that an actual UP kernel? It might be good to try that too, just
to see if it could be an SMP race in the page locking code.

After all, one such theoretical race was one reason I started the rewrite.

Linus

2020-09-15 19:58:27

by Matthieu Baerts

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On 15/09/2020 21:32, Linus Torvalds wrote:>
> Was that an actual UP kernel? It might be good to try that too, just
> to see if it could be an SMP race in the page locking code.

I am sorry, I am not sure how to verify this. I guess it was one
processor because I removed "-smp 2" option from qemu. So I guess it
switched to a uniprocessor mode.

Also, when I did the test and to make sure I was using only one CPU, I
also printed the output of /proc/cpuinfo:

+ cat /proc/cpuinfo

processor : 0

vendor_id : AuthenticAMD

cpu family : 23

model : 1

model name : AMD EPYC 7401P 24-Core Processor

stepping : 2

microcode : 0x1000065

cpu MHz : 2000.000
cache size : 512 KB

physical id : 0

siblings : 1

core id : 0

cpu cores : 1

apicid : 0

initial apicid : 0

fpu : yes

fpu_exception : yes

cpuid level : 13

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt
pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni
pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe
popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm
cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ssbd
ibpb vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap
clflushopt sha_ni xsaveopt xsavec xgetbv1 virt_ssbd
arat npt nrip_save arch_capabilities
bugs : fxsave_leak sysret_ss_attrs null_seg spectre_v1
spectre_v2 spec_store_bypass

bogomips : 4000.00
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64

address sizes : 40 bits physical, 48 bits virtual

power management:

Do you want me to try another qemu config?

Sorry, it is getting late for me but I also forgot to mention earlier
that with 1 CPU and your new sysctl set to 1, it didn't reproduce my
issue for 6 executions.

> After all, one such theoretical race was one reason I started the rewrite.

And that's a good thing, thank you! :)

Cheers,
Matt
--
Tessares | Belgium | Hybrid Access Solutions
http://www.tessares.net

2020-09-15 23:37:48

by Linus Torvalds

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Tue, Sep 15, 2020 at 12:56 PM Matthieu Baerts
<[email protected]> wrote:
>
> I am sorry, I am not sure how to verify this. I guess it was one
> processor because I removed "-smp 2" option from qemu. So I guess it
> switched to a uniprocessor mode.

Ok, that all sounds fine. So yes, your problem happens even with just
one CPU, and it's not any subtle SMP race.

Which is all good - apart from the bug existing in the first place, of
course. It just reinforces the "it's probably a latent deadlock"
thing.

Linus

2020-09-16 10:44:18

by Jan Kara

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Tue 15-09-20 16:35:45, Linus Torvalds wrote:
> On Tue, Sep 15, 2020 at 12:56 PM Matthieu Baerts
> <[email protected]> wrote:
> >
> > I am sorry, I am not sure how to verify this. I guess it was one
> > processor because I removed "-smp 2" option from qemu. So I guess it
> > switched to a uniprocessor mode.
>
> Ok, that all sounds fine. So yes, your problem happens even with just
> one CPU, and it's not any subtle SMP race.
>
> Which is all good - apart from the bug existing in the first place, of
> course. It just reinforces the "it's probably a latent deadlock"
> thing.

So from the traces another theory that appeared to me is that it could be a
"missed wakeup" problem. Looking at the code in wait_on_page_bit_common() I
found one suspicious thing (which isn't a great match because the problem
seems to happen on UP as well and I think it's mostly a theoretical issue but
still I'll write it here):

wait_on_page_bit_common() has:

spin_lock_irq(&q->lock);
SetPageWaiters(page);
if (!trylock_page_bit_common(page, bit_nr, wait))
- which expands to:
(
if (wait->flags & WQ_FLAG_EXCLUSIVE) {
if (test_and_set_bit(bit_nr, &page->flags))
return false;
} else if (test_bit(bit_nr, &page->flags))
return false;
)

__add_wait_queue_entry_tail(q, wait);
spin_unlock_irq(&q->lock);

Now the suspicious thing is the ordering here. What prevents the compiler
(or the CPU for that matter) from reordering SetPageWaiters() call behind
the __add_wait_queue_entry_tail() call? I know SetPageWaiters() and
test_and_set_bit() operate on the same long but is it really guaranteed
something doesn't reorder these?

In unlock_page() we have:

if (clear_bit_unlock_is_negative_byte(PG_locked, &page->flags))
wake_up_page_bit(page, PG_locked);

So if the reordering happens, clear_bit_unlock_is_negative_byte() could
return false even though we have a waiter queued.

And this seems to be a thing commit 2a9127fcf22 ("mm: rewrite
wait_on_page_bit_common() logic") introduced because before we had
set_current_state() between SetPageWaiters() and test_bit() which implies a
memory barrier.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-16 18:48:37

by Linus Torvalds

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Wed, Sep 16, 2020 at 3:34 AM Jan Kara <[email protected]> wrote:
>
> wait_on_page_bit_common() has:
>
> spin_lock_irq(&q->lock);
> SetPageWaiters(page);
> if (!trylock_page_bit_common(page, bit_nr, wait))
> - which expands to:
> (
> if (wait->flags & WQ_FLAG_EXCLUSIVE) {
> if (test_and_set_bit(bit_nr, &page->flags))
> return false;
> } else if (test_bit(bit_nr, &page->flags))
> return false;
> )
>
> __add_wait_queue_entry_tail(q, wait);
> spin_unlock_irq(&q->lock);
>
> Now the suspicious thing is the ordering here. What prevents the compiler
> (or the CPU for that matter) from reordering SetPageWaiters() call behind
> the __add_wait_queue_entry_tail() call? I know SetPageWaiters() and
> test_and_set_bit() operate on the same long but is it really guaranteed
> something doesn't reorder these?

I agree that we might want to make this much more specific, but no,
those things can't be re-ordered.

Part of it is very much that memory ordering is only about two
different locations - accessing the *same* location is always ordered,
even on weakly ordered CPU's.

And the compiler can't re-order them either, we very much make
test_and_set_bit() have the proper barriers. We'd be very screwed if a
"set_bit()" could pass a "test_and_set_bit".

That said, the PageWaiters bit is obviously very much by design in the
same word as the bit we're testing and setting - because the whole
point is that we can then at clear time check the PageWaiters bit
atomically with the bit we're clearing.

So this code optimally shouldn't use separate operations for those
bits at all. It would be better to just have one atomic sequence with
a cmpxchg that does both at the same time.

So I agree that sequence isn't wonderful. But no, I don't think this is the bug.

And as you mention, Matthieu sees it on UP, so memory ordering
wouldn't have been an issue anyway (and compiler re-ordering would
cause all kinds of other problems and break our bit operations
entirely).

Plus if it was some subtle bug like that, it wouldn't trigger as
consistently as it apparently does for Matthieu.

Of course, the reason that I don't see how it can trigger at all (I
still like my ABBA deadlock scenario, but I don't see anybody holding
any crossing locks in Matthieu's list of processes) means that I'm
clearly missing something

Linus