2020-07-14 02:55:32

by Francesco Ruggeri

[permalink] [raw]
Subject: soft lockup in fanotify_read

We are getting this soft lockup in fanotify_read.
The reason is that this code does not seem to scale to cases where there
are big bursts of events generated by fanotify_handle_event.
fanotify_read acquires group->notification_lock for each event.
fanotify_handle_event uses the lock to add one event, which also involves
fanotify_merge, which scans the whole list trying to find an event to
merge the new one with.
In our case fanotify_read is invoked with a buffer big enough for 200
events, and what happens is that every time fanotify_read dequeues an
event and releases the lock, fanotify_handle_event adds several more,
scanning a longer and longer list. This causes fanotify_read to wait
longer and longer for the lock, and the soft lockup happens before
fanotify_read can reach 200 events.
Is it intentional for fanotify_read to acquire the lock for each event,
rather than batching together a user buffer worth of events?

Thanks,
Francesco Ruggeri

[5752801.578813] watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fstrace:23105]
[5752801.586804] Modules linked in: ...
[5752801.586871] CPU: 15 PID: 23105 Comm: fstrace Tainted: G W O L 4.19.112-16802951.AroraKernel419.el7.x86_64 #1
[5752801.586872] Hardware name: Supermicro SYS-6029TP-HTR/X11DPT-PS, BIOS 3.1 04/30/2019
[5752801.586879] RIP: 0010:queued_spin_lock_slowpath+0x141/0x17c
[5752801.586881] Code: 5e c1 e9 12 83 e0 03 ff c9 48 c1 e0 04 48 63 c9 48 05 c0 14 02 00 48 03 04 cd 20 e7 e9 81 48 89 10 8b 42 08 85 c0 75 04 f3 90 <eb> f5 48 8b 0a 48 85 c9 74 03 0f 0d 09 8b 07 66 85 c0 74 93 f3 90
[5752801.586882] RSP: 0018:ffffc90027c5fd18 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[5752801.586883] RAX: 0000000000000000 RBX: ffff8883a9078ce8 RCX: 000000000000001e
[5752801.586884] RDX: ffff88df7f6e14c0 RSI: 0000000000400000 RDI: ffff88b06306e30c
[5752801.586885] RBP: ffffc90027c5fd18 R08: 0000000000000001 R09: ffffffff811e8b7e
[5752801.586886] R10: ffffc90027c5fce0 R11: 0000000000000000 R12: ffff88b06306e300
[5752801.586887] R13: 00000000ff935638 R14: ffff88b4af422600 R15: 0000000000000018
[5752801.586888] FS: 0000000000000000(0000) GS:ffff88df7f6c0000(0063) knlGS:0000000009c6a880
[5752801.586889] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[5752801.586890] CR2: 0000000009430914 CR3: 000000035bbef006 CR4: 00000000007606e0
[5752801.586891] DR0: 000000000001864c DR1: 0000000000000000 DR2: 0000000000000000
[5752801.586891] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[5752801.586892] PKRU: 55555554
[5752801.586892] Call Trace:
[5752801.586898] do_raw_spin_lock+0x1a/0x1c
[5752801.586901] _raw_spin_lock+0xe/0x10
[5752801.586905] fanotify_read+0xa3/0x32d
[5752801.586912] ? __wake_up_sync+0x12/0x12
[5752801.586916] copy_oldmem_page+0xa9/0xa9
[5752801.586920] ? fsnotify_perm+0x60/0x6c
[5752801.586921] ? fsnotify_perm+0x60/0x6c
[5752801.586923] ? security_file_permission+0x37/0x3e
[5752801.586926] vfs_read+0xa4/0xdc
[5752801.586928] ksys_read+0x64/0xab
[5752801.586930] __ia32_sys_read+0x18/0x1a
[5752801.586933] do_fast_syscall_32+0xaf/0xf6
[5752801.586935] entry_SYSENTER_compat+0x6b/0x7a
[5752801.586936] RIP: 0023:0xf7f64c29
[5752801.586938] Code: 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 1c 24 c3 8b 34 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
[5752801.586939] RSP: 002b:00000000ff933828 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[5752801.586940] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000ff9348d0
[5752801.586941] RDX: 00000000000012c0 RSI: 0000000009c72da8 RDI: 00000000ff9348d0
[5752801.586941] RBP: 00000000ff935ba8 R08: 0000000000000000 R09: 0000000000000000
[5752801.586942] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[5752801.586942] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000


2020-07-14 13:11:33

by Amir Goldstein

[permalink] [raw]
Subject: Re: soft lockup in fanotify_read

On Tue, Jul 14, 2020 at 5:54 AM Francesco Ruggeri <[email protected]> wrote:
>
> We are getting this soft lockup in fanotify_read.
> The reason is that this code does not seem to scale to cases where there
> are big bursts of events generated by fanotify_handle_event.
> fanotify_read acquires group->notification_lock for each event.
> fanotify_handle_event uses the lock to add one event, which also involves
> fanotify_merge, which scans the whole list trying to find an event to
> merge the new one with.

Yes, that is a terribly inefficient merge algorithm.
If it helps I am carrying a quick brown paper bag fix for this issue in my tree:

@@ -65,6 +74,8 @@ static int fanotify_merge(struct list_head *list,
struct fsnotify_event *event)
{
struct fsnotify_event *test_event;
struct fanotify_event *new;
+ int limit = 128;
+ int i = 0;

pr_debug("%s: list=%p event=%p\n", __func__, list, event);
new = FANOTIFY_E(event);

@@ -78,6 +89,9 @@ static int fanotify_merge(struct list_head *list,
struct fsnotify_event *event)
return 0;

list_for_each_entry_reverse(test_event, list, list) {
+ /* Event merges are expensive so should be limited */
+ if (++i > limit)
+ break;
if (should_merge(test_event, event)) {

It's somewhere down my TODO list to fix this properly with a hash table.

> In our case fanotify_read is invoked with a buffer big enough for 200
> events, and what happens is that every time fanotify_read dequeues an
> event and releases the lock, fanotify_handle_event adds several more,
> scanning a longer and longer list. This causes fanotify_read to wait
> longer and longer for the lock, and the soft lockup happens before
> fanotify_read can reach 200 events.
> Is it intentional for fanotify_read to acquire the lock for each event,
> rather than batching together a user buffer worth of events?

I think it is meant to allow for multiple reader threads to read events
with fairness, but not sure.

Even if it was fine to read a batch of events on every spinlock acquire
making the code in the fanotify_read() loop behave well in case of
an error in an event after reading a bunch of good events looks challenging,
but I didn't try. Anyway, the root cause of the issue seems to be the
inefficient merge and not the spinlock taken per one event read.

Thanks,
Amir.

2020-07-14 13:14:57

by Jan Kara

[permalink] [raw]
Subject: Re: soft lockup in fanotify_read

Hello!

On Mon 13-07-20 19:54:17, Francesco Ruggeri wrote:
> We are getting this soft lockup in fanotify_read.
> The reason is that this code does not seem to scale to cases where there
> are big bursts of events generated by fanotify_handle_event.
> fanotify_read acquires group->notification_lock for each event.
> fanotify_handle_event uses the lock to add one event, which also involves
> fanotify_merge, which scans the whole list trying to find an event to
> merge the new one with.
> In our case fanotify_read is invoked with a buffer big enough for 200
> events, and what happens is that every time fanotify_read dequeues an
> event and releases the lock, fanotify_handle_event adds several more,
> scanning a longer and longer list. This causes fanotify_read to wait
> longer and longer for the lock, and the soft lockup happens before
> fanotify_read can reach 200 events.
> Is it intentional for fanotify_read to acquire the lock for each event,
> rather than batching together a user buffer worth of events?

Thanks for report and the analysis. I agree what you describe is possible.
The locking is actually fine I think but you're correct that the merging
logic isn't ideal and for large amounts of queued events may be too slow.
We were already discussing with Amir how to speed it up but didn't end up
doing anything yet since the issue wasn't really pressing.

WRT fanotify_read() removing events from the list in batches: That's
certainly one possible optimization but (especially with recent extensions
to fanotify interface) it is difficult to tell how many events will
actually fit in the provided buffer so we'd have to provide a way to push
events back to the event queue which may get a bit tricky. And as I wrote
above I think the real problem is actually with fanotify merge logic which
ends up holding the notification_lock for too long...

But we may want to add cond_resched() into the loop in fanotify_read() as
that can currently take time proportinal to user-provided buffer which can
be a lot. That will probably silence the softlockup for you as well
(although it's not really fixing the underlying issue).

We'll have a look what we can do about this :)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-07-14 17:00:03

by Francesco Ruggeri

[permalink] [raw]
Subject: Re: soft lockup in fanotify_read

Thanks for the replies and workaround suggestions!

Francesco