LinuxLists.cc - Re: Kernel Benchmarking

2020-09-14 17:48:43

Subject: Re: Kernel Benchmarking

Michael et al,
Ok, I redid my failed "hybrid mode" patch from scratch (original
patch never sent out, I never got it to a working point).

Having learnt from my mistake, this time instead of trying to mix the
old and the new code, instead I just extended the new code, and wrote
a _lot_ of comments about it.

I also made it configurable, using a "page_lock_unfairness" knob,
which this patch defaults to 1000 (which is basically infinite).
That's just a value that says how many times we'll try the old unfair
case, so "1000" means "we'll re-queue up to a thousand times before we
say enough is enough" and zero is the fair mode that shows the
performance problems.

I've only (lightly) tested those two extremes, I think the interesting
range is likely in the 1-5 range.

So you can do

echo 0 > /proc/sys/vm/page_lock_unfairness
.. run test ..

and you should get the same numbers as without this patch (within
noise, of course).

Or do

echo 5 > /proc/sys/vm/page_lock_unfairness
.. run test ..

and get numbers for "we accept some unfairness, but if we have to
requeue more than five times, we force the fair mode".

Again, the default is 1000, which is ludicrously high (it's not a
"this many retries per page" count, it's a "for each waiter" count). I
made it that high just because I have *not* run any numbers for that
interesting range, I just checked the extreme cases, and I wanted to
make sure that Michael sees the old performance (modulo other changes
to 5.9, of course).

Comments? The patch really has a fair amount of comments in it, in
fact the code changes are reasonably small, most of the changes really
are about new and updated comments about what is going on.

I was burnt by making a mess of this the first time, so I proceeded
more thoughtfully this time. Hopefullyt the end result is also better.

(Note that it's a commit and has a SHA1, but it's from my "throw-away
tree for testing", so it doesn't have my sign-off or any real commit
message yet: I'll do that once it gets actual testing and comments).

Linus

Attachments:

0001-Page-lock-unfairness-sysctl.patch (9.79 kB)

2020-09-14 20:27:38

by Matthieu Baerts

[permalink] [raw]

Subject: Re: Kernel Benchmarking

Hello everyone,

On 14/09/2020 19:47, Linus Torvalds wrote:
> Michael et al,
> Ok, I redid my failed "hybrid mode" patch from scratch (original
> patch never sent out, I never got it to a working point).
>
> Having learnt from my mistake, this time instead of trying to mix the
> old and the new code, instead I just extended the new code, and wrote
> a _lot_ of comments about it.
>
> I also made it configurable, using a "page_lock_unfairness" knob,
> which this patch defaults to 1000 (which is basically infinite).
> That's just a value that says how many times we'll try the old unfair
> case, so "1000" means "we'll re-queue up to a thousand times before we
> say enough is enough" and zero is the fair mode that shows the
> performance problems.

Thank you for the new patch and all the work around from everybody!

Sorry to jump in this thread but I wanted to share my issue, also linked
to the same commit:

2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")

I have a simple test environment[1] using Docker and virtme[2] almost
with the default kernel config and validating some tests for the MPTCP
Upstream project[3]. Some of these tests are using a modified version of
packetdrill[4].

Recently, some of these packetdrill tests have been failing after 2
minutes (timeout) instead of being executed in a few seconds (~6
seconds). No packets are even exchanged during these two minutes.

I did a git bisect and it also pointed me to 2a9127fcf229.

I can run the same test 10 times without any issue with the parent
commit (v5.8 tag) but with 2a9127fcf229, I have a timeout most of the time.

Of course, when I try to add some debug info on the userspace or
kernelspace side, I can no longer reproduce the timeout issue. But
without debug, it is easy for me to validate if the issue is there or
not. My issue doesn't seem to be linked to a small file that needs to be
read multiple of times on a FS. Only a few bytes should be transferred
with packetdrill but when there is a timeout, it is even before that
because I don't see any transferred packets in case of issue. I don't
think a lot of IO is used by Packetdrill before transferring a few
packets to a "tun" interface but I didn't analyse further.

With your new patch and the default value, I no longer have the issue.

> I've only (lightly) tested those two extremes, I think the interesting
> range is likely in the 1-5 range.
>
> So you can do
>
> echo 0 > /proc/sys/vm/page_lock_unfairness
> .. run test ..
>
> and you should get the same numbers as without this patch (within
> noise, of course).

On my side, I have the issue with 0. So it seems good because expected!

> Or do
>
> echo 5 > /proc/sys/vm/page_lock_unfairness
> .. run test ..
>
> and get numbers for "we accept some unfairness, but if we have to
> requeue more than five times, we force the fair mode".

Already with 1, it is fine on my side: no more timeout! Same with 5. I
am not checking the performances but only the fact I can run packetdrill
without timeout. With 1 and 5, tests finish in a normal time, that's
really good. I didn't have any timeout in 10 runs, each of them started
from a fresh VM. Patch tested with success!

I would be glad to help by validating new modifications or providing new
info. My setup is also easy to put in place: a Docker image is built
with all required tools to start the same VM just like the one I have.
All scripts are on a public repository[1].

Please tell me if I can help!

Cheers,
Matt

[1]
https://github.com/multipath-tcp/mptcp_net-next/blob/scripts/ci/virtme.sh and
https://github.com/multipath-tcp/mptcp_net-next/blob/scripts/ci/Dockerfile.virtme.sh
[2] https://git.kernel.org/pub/scm/utils/kernel/virtme/virtme.git
[3] https://github.com/multipath-tcp/mptcp_net-next/wiki
[4] https://github.com/multipath-tcp/packetdrill
--
Tessares | Belgium | Hybrid Access Solutions
http://www.tessares.net

2020-09-14 20:55:02

by Linus Torvalds

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Mon, Sep 14, 2020 at 1:21 PM Matthieu Baerts
<[email protected]> wrote:
>
> Recently, some of these packetdrill tests have been failing after 2
> minutes (timeout) instead of being executed in a few seconds (~6
> seconds). No packets are even exchanged during these two minutes.

Hmm.

That sounds like a deadlock to me, and sounds like it's a latent bug
waiting to happen.

One way I can see that happening (with the fair page locking) is to do
something like this

thread A does:
lock_page()
do something

thread B:
lock_page - ends up blocking on the lock

thread A continue:
unlock_page() - for the fair case this now transfers the page lock
to thread B
.. do more work
lock_page() - this now blocks because B already owns the lock

thread B continues:
do something that requires A to have continued, but A is blocked on
B, and we have a classic ABBA deadlock

and the only difference here is that with the unfair locks, thread A
would get the page lock and finish whatever it did, and you'd never
see the deadlock.

And by "never" I mean "very very seldom". That's why it sounds like a
latent bug to me - the fact that it triggers with the fair locks
really makes me suspect that it *could* have triggered with the unfair
locks, it just never really did, because we didn't have that
synchronous lock transfer to the waiter.

One of the problems with the page lock is that it's a very very
special lock, and afaik has never worked with lockdep. So we have
absolutely _zero_ coverage of even the simplest ABBA deadlocks with
the page lock.

> I would be glad to help by validating new modifications or providing new
> info. My setup is also easy to put in place: a Docker image is built
> with all required tools to start the same VM just like the one I have.

I'm not familiar enough with packetdrill or any of that infrastructure
- does it do its own kernel modules etc for the packet latency
testing?

But it sounds like it's 100% repeatable with the fair page lock, which
is actually a good thing. It means that if you do a "sysrq-w" while
it's blocking, you should see exactly what is waiting for what.

(Except since it times out nicely eventually, probably at least part
of the waiting is interruptible, and then you need to do "sysrq-t"
instead and it's going to be _very_ verbose and much harder to
pinpoint things, and you'll probably need to have a very big printk
buffer).

There are obviously other ways to do it too - kgdb or whatever - which
you may or may not be more used to.

But sysrq is very traditional and often particularly easy if it's a
very repeatable "things are hung". Not nearly as good as lockdep, of
course. But if the machine is otherwise working, you can just do

echo 'w' > /proc/sysrq-trigger

in another terminal (and again, maybe you need 't', but then you
really want to do it *without* having a full GUI setup or anythign
like that, to at least make it somewhat less verbose).

Aside: a quick google shows that Nick Piggin did try to extend lockdep
to the page lock many many years ago. I don't think it ever went
anywhere. To quote Avril Lavigne: "It's complicated".

Linus

2020-09-15 00:44:28

by Linus Torvalds

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Mon, Sep 14, 2020 at 1:53 PM Linus Torvalds
<[email protected]> wrote:
>
> One way I can see that happening (with the fair page locking) is to do
> something like this

Note that the "lock_page()" cases in that deadlock example sequence of
mine is not necessarily at all an explicit lock_page(). It is more
likely to be something that indirectly causes it - like a user access
that faults in the page and causes lock_page() as part of that.

Linus

2020-09-15 22:51:47

by Matthieu Baerts

[permalink] [raw]

Subject: Re: Kernel Benchmarking

Hi Linus,

Thank you very much for your reply, with very clear explanations and
instructions!

On 14/09/2020 22:53, Linus Torvalds wrote:
> On Mon, Sep 14, 2020 at 1:21 PM Matthieu Baerts
> <[email protected]> wrote:
>>
>> Recently, some of these packetdrill tests have been failing after 2
>> minutes (timeout) instead of being executed in a few seconds (~6
>> seconds). No packets are even exchanged during these two minutes.
>
> Hmm.
>
> That sounds like a deadlock to me, and sounds like it's a latent bug
> waiting to happen.

Yesterday evening, I wanted to get confirmation about that using
PROVE_LOCKING but just like today, each time I enable this kconfig, I
cannot reproduce the issue.

Anyway I am sure you are right and this bug has been there for sometime
but is too hard to reproduce.
>> I would be glad to help by validating new modifications or providing new
>> info. My setup is also easy to put in place: a Docker image is built
>> with all required tools to start the same VM just like the one I have.
>
> I'm not familiar enough with packetdrill or any of that infrastructure
> - does it do its own kernel modules etc for the packet latency
> testing?

No, Packetdrill doesn't load any kernel module.

Here is a short description of the execution model of Packetdrill from a
paper the authors wrote:

packetdrill parses an entire test script, and then executes each
timestamped line in real time -- at the pace described by the
timestamps -- to replay and verify the scenario.
- For each system call line, packetdrill executes the system call
and verifies that it returns the expected result.
- For each command line, packetdrill executes the shell command.
- For each incoming packet (denoted by a leading < on the line),
packetdrill constructs a packet and injects it into the kernel.
- For each outgoing packet (denoted by a leading > on the line),
packetdrill sniffs the next outgoing packet and verifies that the
packet's timing and contents match the script.

Source: https://research.google/pubs/pub41316/

> But it sounds like it's 100% repeatable with the fair page lock, which
> is actually a good thing. It means that if you do a "sysrq-w" while
> it's blocking, you should see exactly what is waiting for what.
>
> (Except since it times out nicely eventually, probably at least part
> of the waiting is interruptible, and then you need to do "sysrq-t"
> instead and it's going to be _very_ verbose and much harder to
> pinpoint things, and you'll probably need to have a very big printk
> buffer).

Thank you for this idea! I was focused on using lockdep and I forgot
about this simple method. It is not (yet) a reflex for me to use it!

I think I got an interesting trace I took 20 seconds after having
started packetdrill:

------------------- 8< -------------------
[ 25.507563] sysrq: Show Blocked State
[ 25.510695] task:packetdrill state:D stack:13848 pid: 188 ppid:
155 flags:0x00004000
[ 25.517841] Call Trace:
[ 25.520103] __schedule+0x3eb/0x680
[ 25.523197] schedule+0x45/0xb0
[ 25.526013] io_schedule+0xd/0x30
[ 25.528964] __lock_page_killable+0x13e/0x280
[ 25.532794] ? file_fdatawait_range+0x20/0x20
[ 25.536605] filemap_fault+0x6b4/0x970
[ 25.539911] ? filemap_map_pages+0x195/0x330
[ 25.543682] __do_fault+0x32/0x90
[ 25.546620] handle_mm_fault+0x8c1/0xe50
[ 25.550050] ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 25.554637] __get_user_pages+0x25c/0x750
[ 25.558101] populate_vma_page_range+0x57/0x60
[ 25.561968] __mm_populate+0xa9/0x150
[ 25.565125] __x64_sys_mlockall+0x151/0x180
[ 25.568787] do_syscall_64+0x33/0x40
[ 25.571915] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 25.576230] RIP: 0033:0x7f21bee46b3b
[ 25.579357] Code: Bad RIP value.
[ 25.582199] RSP: 002b:00007ffcb5f8ad38 EFLAGS: 00000246 ORIG_RAX:
0000000000000097
[ 25.588588] RAX: ffffffffffffffda RBX: 000055c9762f1450 RCX:
00007f21bee46b3b
[ 25.594627] RDX: 00007ffcb5f8ad28 RSI: 0000000000000002 RDI:
0000000000000003
[ 25.600637] RBP: 00007ffcb5f8ad40 R08: 0000000000000001 R09:
0000000000000000
[ 25.606701] R10: 00007f21beec9ac0 R11: 0000000000000246 R12:
000055c9762b30a0
[ 25.612738] R13: 00007ffcb5f8b180 R14: 0000000000000000 R15:
0000000000000000
[ 25.618762] task:packetdrill state:D stack:13952 pid: 190 ppid:
153 flags:0x00004000
[ 25.625781] Call Trace:
[ 25.627987] __schedule+0x3eb/0x680
[ 25.631046] schedule+0x45/0xb0
[ 25.633796] io_schedule+0xd/0x30
[ 25.636726] ? wake_up_page_bit+0xd1/0x100
[ 25.640271] ? file_fdatawait_range+0x20/0x20
[ 25.644022] ? filemap_fault+0x6b4/0x970
[ 25.647427] ? filemap_map_pages+0x195/0x330
[ 25.651146] ? __do_fault+0x32/0x90
[ 25.654227] ? handle_mm_fault+0x8c1/0xe50
[ 25.657752] ? __get_user_pages+0x25c/0x750
[ 25.661368] ? populate_vma_page_range+0x57/0x60
[ 25.665338] ? __mm_populate+0xa9/0x150
[ 25.668707] ? __x64_sys_mlockall+0x151/0x180
[ 25.672467] ? do_syscall_64+0x33/0x40
[ 25.675751] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 25.680213] task:packetdrill state:D stack:13952 pid: 193 ppid:
160 flags:0x00004000
[ 25.687285] Call Trace:
[ 25.689472] __schedule+0x3eb/0x680
[ 25.692547] schedule+0x45/0xb0
[ 25.695314] io_schedule+0xd/0x30
[ 25.698216] __lock_page_killable+0x13e/0x280
[ 25.702013] ? file_fdatawait_range+0x20/0x20
[ 25.705752] filemap_fault+0x6b4/0x970
[ 25.709010] ? filemap_map_pages+0x195/0x330
[ 25.712691] __do_fault+0x32/0x90
[ 25.715620] handle_mm_fault+0x8c1/0xe50
[ 25.719013] __get_user_pages+0x25c/0x750
[ 25.722485] populate_vma_page_range+0x57/0x60
[ 25.726326] __mm_populate+0xa9/0x150
[ 25.729528] __x64_sys_mlockall+0x151/0x180
[ 25.733138] do_syscall_64+0x33/0x40
[ 25.736263] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 25.740587] RIP: 0033:0x7feb59c16b3b
[ 25.743716] Code: Bad RIP value.
[ 25.746653] RSP: 002b:00007ffd75ef7f38 EFLAGS: 00000246 ORIG_RAX:
0000000000000097
[ 25.753019] RAX: ffffffffffffffda RBX: 0000562a49acc450 RCX:
00007feb59c16b3b
[ 25.759077] RDX: 00007ffd75ef7f28 RSI: 0000000000000002 RDI:
0000000000000003
[ 25.765127] RBP: 00007ffd75ef7f40 R08: 0000000000000001 R09:
0000000000000000
[ 25.771231] R10: 00007feb59c99ac0 R11: 0000000000000246 R12:
0000562a49a8e0a0
[ 25.777442] R13: 00007ffd75ef8380 R14: 0000000000000000 R15:
0000000000000000
[ 25.783496] task:packetdrill state:D stack:13952 pid: 194 ppid:
157 flags:0x00004000
[ 25.790536] Call Trace:
[ 25.792726] __schedule+0x3eb/0x680
[ 25.795777] schedule+0x45/0xb0
[ 25.798582] io_schedule+0xd/0x30
[ 25.801473] __lock_page_killable+0x13e/0x280
[ 25.805246] ? file_fdatawait_range+0x20/0x20
[ 25.809015] filemap_fault+0x6b4/0x970
[ 25.812279] ? filemap_map_pages+0x195/0x330
[ 25.815981] __do_fault+0x32/0x90
[ 25.818909] handle_mm_fault+0x8c1/0xe50
[ 25.822458] __get_user_pages+0x25c/0x750
[ 25.825947] populate_vma_page_range+0x57/0x60
[ 25.829775] __mm_populate+0xa9/0x150
[ 25.832973] __x64_sys_mlockall+0x151/0x180
[ 25.836591] do_syscall_64+0x33/0x40
[ 25.839715] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 25.844089] RIP: 0033:0x7f1bdd340b3b
[ 25.847219] Code: Bad RIP value.
[ 25.850079] RSP: 002b:00007fff992f49e8 EFLAGS: 00000246 ORIG_RAX:
0000000000000097
[ 25.856446] RAX: ffffffffffffffda RBX: 0000557ddd3b8450 RCX:
00007f1bdd340b3b
[ 25.862481] RDX: 00007fff992f49d8 RSI: 0000000000000002 RDI:
0000000000000003
[ 25.868455] RBP: 00007fff992f49f0 R08: 0000000000000001 R09:
0000000000000000
[ 25.874528] R10: 00007f1bdd3c3ac0 R11: 0000000000000246 R12:
0000557ddd37a0a0
[ 25.880541] R13: 00007fff992f4e30 R14: 0000000000000000 R15:
0000000000000000
[ 25.886556] task:packetdrill state:D stack:13952 pid: 200 ppid:
162 flags:0x00004000
[ 25.893568] Call Trace:
[ 25.895776] __schedule+0x3eb/0x680
[ 25.898833] schedule+0x45/0xb0
[ 25.901578] io_schedule+0xd/0x30
[ 25.904495] __lock_page_killable+0x13e/0x280
[ 25.908246] ? file_fdatawait_range+0x20/0x20
[ 25.912012] filemap_fault+0x6b4/0x970
[ 25.915270] ? filemap_map_pages+0x195/0x330
[ 25.918964] __do_fault+0x32/0x90
[ 25.921853] handle_mm_fault+0x8c1/0xe50
[ 25.925245] __get_user_pages+0x25c/0x750
[ 25.928720] populate_vma_page_range+0x57/0x60
[ 25.932543] __mm_populate+0xa9/0x150
[ 25.935727] __x64_sys_mlockall+0x151/0x180
[ 25.939348] do_syscall_64+0x33/0x40
[ 25.942466] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 25.946906] RIP: 0033:0x7fb34860bb3b
[ 25.950026] Code: Bad RIP value.
[ 25.952846] RSP: 002b:00007ffea61b7668 EFLAGS: 00000246 ORIG_RAX:
0000000000000097
[ 25.959289] RAX: ffffffffffffffda RBX: 000055c6f01c2450 RCX:
00007fb34860bb3b
[ 25.965453] RDX: 00007ffea61b7658 RSI: 0000000000000002 RDI:
0000000000000003
[ 25.971504] RBP: 00007ffea61b7670 R08: 0000000000000001 R09:
0000000000000000
[ 25.977505] R10: 00007fb34868eac0 R11: 0000000000000246 R12:
000055c6f01840a0
[ 25.983596] R13: 00007ffea61b7ab0 R14: 0000000000000000 R15:
0000000000000000
[ 25.989598] task:packetdrill state:D stack:13952 pid: 203 ppid:
169 flags:0x00004000
[ 25.996611] Call Trace:
[ 25.998823] __schedule+0x3eb/0x680
[ 26.001863] schedule+0x45/0xb0
[ 26.004645] io_schedule+0xd/0x30
[ 26.007576] ? wake_up_page_bit+0xd1/0x100
[ 26.011133] ? file_fdatawait_range+0x20/0x20
[ 26.014900] ? filemap_fault+0x6b4/0x970
[ 26.018282] ? filemap_map_pages+0x195/0x330
[ 26.021973] ? __do_fault+0x32/0x90
[ 26.025017] ? handle_mm_fault+0x8c1/0xe50
[ 26.028551] ? __get_user_pages+0x25c/0x750
[ 26.032163] ? populate_vma_page_range+0x57/0x60
[ 26.036134] ? __mm_populate+0xa9/0x150
[ 26.039487] ? __x64_sys_mlockall+0x151/0x180
[ 26.043260] ? do_syscall_64+0x33/0x40
[ 26.046528] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 26.050968] task:packetdrill state:D stack:13904 pid: 207 ppid:
173 flags:0x00004000
[ 26.058008] Call Trace:
[ 26.060192] __schedule+0x3eb/0x680
[ 26.063248] schedule+0x45/0xb0
[ 26.066032] io_schedule+0xd/0x30
[ 26.068924] __lock_page_killable+0x13e/0x280
[ 26.072677] ? file_fdatawait_range+0x20/0x20
[ 26.076429] filemap_fault+0x6b4/0x970
[ 26.079704] ? filemap_map_pages+0x195/0x330
[ 26.083424] __do_fault+0x32/0x90
[ 26.086342] handle_mm_fault+0x8c1/0xe50
[ 26.089722] __get_user_pages+0x25c/0x750
[ 26.093209] populate_vma_page_range+0x57/0x60
[ 26.097040] __mm_populate+0xa9/0x150
[ 26.100218] __x64_sys_mlockall+0x151/0x180
[ 26.103837] do_syscall_64+0x33/0x40
[ 26.106948] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 26.111256] RIP: 0033:0x7f90fb829b3b
[ 26.114383] Code: Bad RIP value.
[ 26.117183] RSP: 002b:00007ffc3ae07ea8 EFLAGS: 00000246 ORIG_RAX:
0000000000000097
[ 26.123589] RAX: ffffffffffffffda RBX: 0000560bf837d450 RCX:
00007f90fb829b3b
[ 26.129614] RDX: 00007ffc3ae07e98 RSI: 0000000000000002 RDI:
0000000000000003
[ 26.135641] RBP: 00007ffc3ae07eb0 R08: 0000000000000001 R09:
0000000000000000
[ 26.141660] R10: 00007f90fb8acac0 R11: 0000000000000246 R12:
0000560bf833f0a0
[ 26.147675] R13: 00007ffc3ae082f0 R14: 0000000000000000 R15:
0000000000000000
[ 26.153693] task:packetdrill state:D stack:13952 pid: 210 ppid:
179 flags:0x00004000
[ 26.160728] Call Trace:
[ 26.162923] ? sched_clock_cpu+0x95/0xa0
[ 26.166326] ? ttwu_do_wakeup.isra.0+0x34/0xe0
[ 26.170172] ? __schedule+0x3eb/0x680
[ 26.173349] ? schedule+0x45/0xb0
[ 26.176271] ? io_schedule+0xd/0x30
[ 26.179320] ? __lock_page_killable+0x13e/0x280
[ 26.183216] ? file_fdatawait_range+0x20/0x20
[ 26.187031] ? filemap_fault+0x6b4/0x970
[ 26.190451] ? filemap_map_pages+0x195/0x330
[ 26.194128] ? __do_fault+0x32/0x90
[ 26.197161] ? handle_mm_fault+0x8c1/0xe50
[ 26.200692] ? push_rt_tasks+0xc/0x20
[ 26.203866] ? __get_user_pages+0x25c/0x750
[ 26.207474] ? populate_vma_page_range+0x57/0x60
[ 26.211423] ? __mm_populate+0xa9/0x150
[ 26.214763] ? __x64_sys_mlockall+0x151/0x180
[ 26.218506] ? do_syscall_64+0x33/0x40
[ 26.221757] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 26.226216] task:packetdrill state:D stack:13952 pid: 212 ppid:
185 flags:0x00004000
[ 26.233223] Call Trace:
[ 26.235435] __schedule+0x3eb/0x680
[ 26.238487] schedule+0x45/0xb0
[ 26.241234] io_schedule+0xd/0x30
[ 26.244133] __lock_page_killable+0x13e/0x280
[ 26.247890] ? file_fdatawait_range+0x20/0x20
[ 26.251647] filemap_fault+0x6b4/0x970
[ 26.254906] ? filemap_map_pages+0x195/0x330
[ 26.258590] __do_fault+0x32/0x90
[ 26.261462] handle_mm_fault+0x8c1/0xe50
[ 26.264869] __get_user_pages+0x25c/0x750
[ 26.268327] populate_vma_page_range+0x57/0x60
[ 26.272162] __mm_populate+0xa9/0x150
[ 26.275347] __x64_sys_mlockall+0x151/0x180
[ 26.278970] do_syscall_64+0x33/0x40
[ 26.282082] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 26.286466] RIP: 0033:0x7f00e8863b3b
[ 26.289574] Code: Bad RIP value.
[ 26.292420] RSP: 002b:00007fff5b28f378 EFLAGS: 00000246 ORIG_RAX:
0000000000000097
[ 26.298797] RAX: ffffffffffffffda RBX: 000055ea3bc97450 RCX:
00007f00e8863b3b
[ 26.304787] RDX: 00007fff5b28f368 RSI: 0000000000000002 RDI:
0000000000000003
[ 26.310867] RBP: 00007fff5b28f380 R08: 0000000000000001 R09:
0000000000000000
[ 26.316697] R10: 00007f00e88e6ac0 R11: 0000000000000246 R12:
000055ea3bc590a0
[ 26.322525] R13: 00007fff5b28f7c0 R14: 0000000000000000 R15:
0000000000000000
------------------- 8< -------------------

A version from "decode_stacktrace.sh" is also attached to this email, I
was not sure it would be readable here.
Please tell me if anything else is needed.

One more thing, only when I have the issue, I can also see this kernel
message that seems clearly linked:

[ 7.198259] sched: RT throttling activated

> There are obviously other ways to do it too - kgdb or whatever - which
> you may or may not be more used to.

I never tried to use kgdb with this setup but it is clearly a good
occasion to start! I guess I will be able to easily reproduce the issue
and then generate the crash dump.

> But sysrq is very traditional and often particularly easy if it's a
> very repeatable "things are hung". Not nearly as good as lockdep, of
> course. But if the machine is otherwise working, you can just do
>
> echo 'w' > /proc/sysrq-trigger
>
> in another terminal (and again, maybe you need 't', but then you
> really want to do it *without* having a full GUI setup or anythign
> like that, to at least make it somewhat less verbose).

Please tell me if the trace I shared above is helpful. If not I can
easily share the long output from sysrq-t -- no GUI nor any other
services are running in the background -- and if needed I can prioritise
the generation of a crash dump + analysis.

> Aside: a quick google shows that Nick Piggin did try to extend lockdep
> to the page lock many many years ago. I don't think it ever went
> anywhere. To quote Avril Lavigne: "It's complicated".

:-)

Cheers,
Matt
--
Tessares | Belgium | Hybrid Access Solutions
http://www.tessares.net

Attachments:

sysrq_w_1_analysed.txt (13.20 kB)

2020-09-17 17:53:08

by Linus Torvalds

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Mon, Sep 14, 2020 at 10:47 AM Linus Torvalds
<[email protected]> wrote:
>
> (Note that it's a commit and has a SHA1, but it's from my "throw-away
> tree for testing", so it doesn't have my sign-off or any real commit
> message yet: I'll do that once it gets actual testing and comments).

Just to keep the list and people who were on this thread informed:
Michal ended up doing more benchmarking, and everything seems to line
up and yes, that patch continues to work fine with a 'unfairness'
value of 5.

So I've committed it to my git tree (not pushed out yet, I have other
pull requests etc I'm handling too), and we'll see if anybody can come
up with a better model for how to avoid the page locking being such a
pain. Or if somebody can figure out why fair locking causes problems
for that packetdrill load that Matthieu reported.

It does strike me that if the main source of contention comes from
that "we need to check that the mapping is still valid as we insert
the page into the page tables", then the page lock really isn't the
obvious lock to use.

It would be much more natural to use the mapping->i_mmap_rwsem, I feel.

Willy? Your "just check for uptodate without any lock" patch itself
feels wrong. That's what we do for plain reads, but the difference is
that a read is a one-time event and a race is fine: we get valid data,
it's just that it's only valid *concurrently* with the truncate or
hole-punching event (ie either all zeroes or old data is fine).

The reason faulting a page in is different from a read is that if you
then map in a stale page, it might have had the correct contents at
the time of the fault, but it will not have the correct contents going
forward.

So a page-in requires fundamentally stronger locking than a read()
does, because of how the page-in causes that "future lifetime" of the
page, in ways a read() event does not.

But truncation that does page cache removal already requires that
i_mmap_rwsem, and in fact the VM already very much uses that (ie when
walking the page mapping).

The other alternative might be just the mapping->private_lock. It's
not a reader-writer lock, but if we don't need to sleep (and I don't
think the final "check ->mapping" can sleep anyway since it has to be
done together with the page table lock), a spinlock would be fine.

Linus