Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
MIME-Version: 1.0
References: <20200714025417.A25EB95C0339@us180.sjc.aristanetworks.com>
In-Reply-To: <20200714025417.A25EB95C0339@us180.sjc.aristanetworks.com>
From:   Amir Goldstein <amir73il@gmail.com>
Date:   Tue, 14 Jul 2020 16:10:33 +0300
Message-ID: <CAOQ4uxjLaGyOUd5GOV8oHwBY=nGGtgk4=5bRxmHTr5VsocrhiA@mail.gmail.com>
Subject: Re: soft lockup in fanotify_read
To:     Francesco Ruggeri <fruggeri@arista.com>
Cc:     linux-kernel <linux-kernel@vger.kernel.org>,
        linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Jan Kara <jack@suse.cz>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Tue, Jul 14, 2020 at 5:54 AM Francesco Ruggeri <fruggeri@arista.com> wrote:
>
> We are getting this soft lockup in fanotify_read.
> The reason is that this code does not seem to scale to cases where there
> are big bursts of events generated by fanotify_handle_event.
> fanotify_read acquires group->notification_lock for each event.
> fanotify_handle_event uses the lock to add one event, which also involves
> fanotify_merge, which scans the whole list trying to find an event to
> merge the new one with.

Yes, that is a terribly inefficient merge algorithm.
If it helps I am carrying a quick brown paper bag fix for this issue in my tree:

@@ -65,6 +74,8 @@ static int fanotify_merge(struct list_head *list,
struct fsnotify_event *event)
 {
        struct fsnotify_event *test_event;
        struct fanotify_event *new;
+       int limit = 128;
+       int i = 0;

        pr_debug("%s: list=%p event=%p\n", __func__, list, event);
        new = FANOTIFY_E(event);

@@ -78,6 +89,9 @@ static int fanotify_merge(struct list_head *list,
struct fsnotify_event *event)
                return 0;

        list_for_each_entry_reverse(test_event, list, list) {
+               /* Event merges are expensive so should be limited */
+               if (++i > limit)
+                       break;
                if (should_merge(test_event, event)) {

It's somewhere down my TODO list to fix this properly with a hash table.

> In our case fanotify_read is invoked with a buffer big enough for 200
> events, and what happens is that every time fanotify_read dequeues an
> event and releases the lock, fanotify_handle_event adds several more,
> scanning a longer and longer list. This causes fanotify_read to wait
> longer and longer for the lock, and the soft lockup happens before
> fanotify_read can reach 200 events.
> Is it intentional for fanotify_read to acquire the lock for each event,
> rather than batching together a user buffer worth of events?

I think it is meant to allow for multiple reader threads to read events
with fairness, but not sure.

Even if it was fine to read a batch of events on every spinlock acquire
making the code in the fanotify_read() loop behave well in case of
an error in an event after reading a bunch of good events looks challenging,
but I didn't try. Anyway, the root cause of the issue seems to be the
inefficient merge and not the spinlock taken per one event read.

Thanks,
Amir.