LinuxLists.cc - Re: Kernel Benchmarking

2020-09-12 07:29:21

Subject: Re: Kernel Benchmarking

On Sat, Sep 12, 2020 at 1:40 AM Michael Larabel
<[email protected]> wrote:
>
> On 9/11/20 5:07 PM, Linus Torvalds wrote:
> > On Fri, Sep 11, 2020 at 9:19 AM Linus Torvalds
> > <[email protected]> wrote:
> >> Ok, it's probably simply that fairness is really bad for performance
> >> here in general, and that special case is just that - a special case,
> >> not the main issue.
> > Ahh. It turns out that I should have looked more at the fault path
> > after all. It was higher up in the profile, but I ignored it because I
> > found that lock-unlock-lock pattern lower down.
> >
> > The main contention point is actually filemap_fault(). Your apache
> > test accesses the 'test.html' file that is mmap'ed into memory, and
> > all the threads hammer on that one single file concurrently and that
> > seems to be the main page lock contention.
> >
> > Which is really sad - the page lock there isn't really all that
> > interesting, and the normal "read()" path doesn't even take it. But
> > faulting the page in does so because the page will have a long-term
> > existence in the page tables, and so there's a worry about racing with
> > truncate.
> >
> > Interesting, but also very annoying.
> >
> > Anyway, I don't have a solution for it, but thought I'd let you know
> > that I'm still looking at this.
> >
> > Linus
>
> I've been running your EXT4 patch on more systems and with some
> additional workloads today. While not the original problem, the patch
> does seem to help a fair amount for the MariaDB database sever. This
> wasn't one of the workloads regressing on 5.9 but at least with the
> systems tried so far the patch does make a meaningful improvement to the
> performance. I haven't run into any apparent issues with that patch so
> continuing to try it out on more systems and other database/server
> workloads.
>

Michael,

Can you please add a reference to the original problem report and
to the offending commit? This conversation appeared on the list without
this information.

Are filesystems other than ext4 also affected by this performance
regression?

Thanks,
Amir.

2020-09-12 15:57:15

by Matthew Wilcox

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Sat, Sep 12, 2020 at 10:28:29AM +0300, Amir Goldstein wrote:
> On Sat, Sep 12, 2020 at 1:40 AM Michael Larabel
> <[email protected]> wrote:
> >
> > On 9/11/20 5:07 PM, Linus Torvalds wrote:
> > > On Fri, Sep 11, 2020 at 9:19 AM Linus Torvalds
> > > <[email protected]> wrote:
> > >> Ok, it's probably simply that fairness is really bad for performance
> > >> here in general, and that special case is just that - a special case,
> > >> not the main issue.
> > > Ahh. It turns out that I should have looked more at the fault path
> > > after all. It was higher up in the profile, but I ignored it because I
> > > found that lock-unlock-lock pattern lower down.
> > >
> > > The main contention point is actually filemap_fault(). Your apache
> > > test accesses the 'test.html' file that is mmap'ed into memory, and
> > > all the threads hammer on that one single file concurrently and that
> > > seems to be the main page lock contention.
> > >
> > > Which is really sad - the page lock there isn't really all that
> > > interesting, and the normal "read()" path doesn't even take it. But
> > > faulting the page in does so because the page will have a long-term
> > > existence in the page tables, and so there's a worry about racing with
> > > truncate.

Here's an idea (sorry, no patch, about to go out for the day)

What if we cleared PageUptodate in the truncate path? And then
filemap_fault() looks a lot more like generic_file_buffered_read() where
we check PageUptodate, and only if it's clear do we take the page lock
and call ->readpage.

We'd need to recheck PageUptodate after installing the PTE and zap
it ourselves, and we wouldn't be able to check page_mapped() in the
truncate path any more, which would make me sad. But there's something
to be said for making faults cheaper.

Even the XFS model where we take the MMAPLOCK_SHARED isn't free -- it's
just write-vs-write on a cacheline instead of a visible contention on
the page lock.

2020-09-12 18:00:45

by Linus Torvalds

[permalink] [raw]

Subject: Re: Kernel Benchmarking

Sorry, I should have put much more background when I started cc'ing
people and lists..

On Sat, Sep 12, 2020 at 12:28 AM Amir Goldstein <[email protected]> wrote:
>
> Can you please add a reference to the original problem report and
> to the offending commit? This conversation appeared on the list without
> this information.
>
> Are filesystems other than ext4 also affected by this performance
> regression?

So let me expand on this, because it actually comes from a really old
problem that has been around for ages, and that I think I finally
understand.

And sadly, the regression comes from the fix to that old problem.

So I think the VM people (but perhaps not necessarily all filesystem
people) have been aware of a long-time problem with certain loads
causing huge latencies, up to and including watchdogs firing because
processes wouldn't make progress for over half a minute (or whatever
the default blocking watchdog timeout is - I would like to say that
it's some odd number like 22 seconds, but maybe that was RCU).

We've known it's related to long queues for the page lock, and about
three years ago now we added a "bookmark" entry to the page wakeup
queues, because those queues got so long that even just traversing the
wakeup queue was a big latency hit. But it's generally been some heavy
private load on a customer machine, and nobody ever really had a good
test-case for it.

We've actually had tons of different page lockers involved. One of the
suspects (and in fact I think it really was one of the causes, just
not the only one) was the NUMA migration, where under certain loads
with lots and lots of threads, the kernel would decide to try to
migrate a hot page, and lots of threads would come in and all
NUMA-fault on it (because it was some core page everything used), and
as part of the fault they would get the page lock to serialize, and
you'd end up with wait queues that were multiple _thousands_ of
entries long.

So the reports of watchdogs firing go back many many years, and over
the years we've had various band-aid fixes - things that really do
help the symptoms a lot, but really seem to be fixes for the symptoms
rather than something fundamental. That "let's break up the wait queue
with a bookmark so that we can at least enable interrupts" is perhaps
the best example of code that just shouldn't exist, but comes about
because there's been incredible contention on the page lock.

See commits 2554db916586 ("sched/wait: Break up long wake list walk")
and 11a19c7b099f ("sched/wait: Introduce wakeup bookmark in
wake_up_page_bit") for that bookmark thing and some of the list
numbers.

There's been a few actual fixes too - I think Hugh Dickins really
ended up fixing at least part of the NUMA balancing case by changing
some of the reference counting. So I don't think it's _all_ been
band-aids, but the page lock has been a thing that has come up
multiple times over the years.

See for example commit 9a1ea439b16b ("mm:
put_and_wait_on_page_locked() while page is migrated") for a patch
that ended up hopefully fixing at least one of the causes of the long
queues during migration. I say "hopefully", because (again) the loads
that cause these things were those "internal customer load" things
that we don't really have a lot of insight into. Hugh has been
involved over the years presumably exactly because google has been one
of those customers, although not the only one by far.

But the point here is that the page lock has been problematic for
years - with those reports of watchdogs (after tens of seconds!)
firing going back long before the fixes above. It's definitely not a
new thing, although I think it has perhaps become more common due to
"bigger machines running more complex loads becoming more common", but
who knows..

Anyway, for various reasons I was looking at this again a couple of
months ago: we had _yet_ another report of softlockups:

https://lore.kernel.org/lkml/[email protected]/

and we had an unrelated thread about a low-level race in page wakeup
(the original report was wrong, but it led to figuring out another
race):

https://lore.kernel.org/lkml/[email protected]/

and there was something else going on too that I can't recall, that
had made me look at the page locking.

And while there, I realized that the simplest explanation for all
those years of softlockups was simply that the page waiting is very
very unfair indeed.

While somebody is patiently waiting for a page, another process can
(and will) come in and get the page lock from under it, and the
original waiter will end up just re-queueing - at the end of the list.
Which explains how you can get those half-minute latencies - not
because any page lock holder really holds the lock for very long
(almost all of them are CPU-bound, not IO bound), but because under
heavy load and contention, you end up with the poor waiters scheduling
away and *new* page lockers end up being treated very preferentially,
with absolutely nothing keeping them from taking the lock while
somebody else is waiting for it.

ANYWAY. That's a long email of background for the commit that I then
put in the tree this merge window:

2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")

which actually makes the page locking truly fair (well, there's a
small window that allows new lockers to come in, but now it's about
CPU instructions rather than "scheduling a sleeping process to run",
so it's essentially gone as a major and recurring unfairness issue).

And that commit actually had a fair number of positive reports: even
before merging it, Hugh had tested this on yet another "company
internal load" (*cough*google*cough*) that used to show watchdog
failures, and it did seem to solve those, and some of the kernel test
robot benchmarks improved by tens of percent, in one case by 160%.

So everything looked fine. It solves a long-running problem, and for
once I think it really _solves_ it by fixing a fundamental problem,
rather than papering over the symptoms.

But there were a couple of nagging concerns. Hackbench showed wildly
fluctuating performance: some tests improved by a lot, others
regressed by a lot. Not a huge deal, I felt: hackbench isn't a great
benchmark, and the performance fluctuations really seemed to be going
both ways and be very dependent on exact test and machine. So it was a
slight concern, but on the whole not really worth worrying about.

But the reason Michal is on the Cc is because the Phoronix benchmark
suite showed a rather marked decrease in the apache test. Unlike
hackbench, I think that's much more of a "real" test, and it seemed to
be a lot more consistent too. So I asked for profiles (and eventually
just recreated the test locally), and I think I understand what's
going on.

It's the fairness.

Fairness is good, but fairness is usually bad for performance even if
it does get rid of the worst-case issues. In this case, it's _really_
bad for performance, because that page lock has always been unfair,
and we have a lot of patterns that have basically come to
(unintentionally) depend on that unfairness.

In particular, the page locking is often used for just verifying
simple things, with the most common example being "lock page, check
that the mapping is still valid, insert page into page tables, unlock
page".

The reason the apache benchmark regresses is that it basically does a
web server test with a single file ("test.html") that gets served by
just mmap'ing it, and sending it out that way. Using lots of threads,
and using lots of different mappings. So they *all* fault on the read
of that page, and they *all* do that "lock page, check that the
mapping is valid, insert page" dance.

That actually worked ok - not great, but ok - when the page lock was
unfair, and anybody runnable would basically just get it. Yes, people
would occasionally get put on the wait-queue, but those waiting
lockers wouldn't really affect the other ones that are going through
that dance since they would just take the lock anyway. VERY unfair,
but hey, very nice for that load.

It works much less well when the page lock is suddenly fair, and if
anybody starts waiting for it, gets the lock handed to it when the
page is unlocked. Now the page is owned by the next waiter in line,
and they're sleeping, and new page lockers don't magically and
unfairly get to just bypass the older waiter.

This is not a new issue. We've had exactly the same thing happen when
we made spinlocks, semaphores, and rwlocks be fair.

And like those other times, we had to make them fair because *not*
making them fair caused those unacceptable outliers under contention,
to the point of starvation and watchdogs firing.

Anyway, I don't have a great solution. I have a few options (roughly
ordered by "simplest to most complex"):

(a) just revert
(b) add some busy-spinning
(c) reader-writer page lock
(d) try to de-emphasize the page lock

but I'd love to hear comments.

Honestly, (a) is trivial to do. We've had the problem for years, the
really *bad* cases are fairly rare, and the workarounds mostly work.
Yeah, you get watchdogs firing, but it's not exactly _common_.

But equally honestly, I hate (a). I feel like this change really fixed
a fundamental issue, and after looking at the apache benchmark, in
many ways it's not a great benchmark. The reason it shows such a
(relatively) huge regression is that it hammers on just a single small
file. So my inclination is to say "we know how to fix the performance
regression, even if we may not be able to do so for 5.9, and this
benchmark behavior is very unlikely to actually hit a real load".

Option (b) is just because right now the page lock is very much a
black-and-white "try to lock once or sleep". Where most lockers (the
initial actual IO to fill the page being the main exception) are
CPU-bound, not IO bound. So spinning is the usual simplistic fix for
locking behavior like that. It doesn't really "fix" anything, but it
helps the bad contended performance case and we wouldn't get the
scheduling and sleeping behavior.

I can imagine coming up with a ten-liner patch to add some spinning
that claws back much of the performance on that benchmark. Maybe.

I don't like option (b) very much, but it might be the band-aid for
5.9 if we feel that the benchmark results _might_ translate to real
loads.

Option (c) is, I feel, the best one. Reader-writer locks aren't
wonderful, but the page lock really tends to have two very distinct
uses: exclusive for the initial IO and for the (very very unlikely)
truncate and hole punching issues, and then the above kind of "lock to
check that it's still valid" use, which is very very common and
happens on every page fault and then some. And it would be very
natural to make the latter be a read-lock (or even just a sequence
counting one with retry rather than a real lock).

Option (d) is "we already have a locking in many filesystems that give
us exclusion between faulting in a page, and the truncate/hole punch,
so we shouldn't use the page lock at all".

I do think that the locking that filesystems do is in many ways
inferior - it's done on a per-inode basis rather than on a per-page
basis. But if the filesystems end up doing that *anyway*, what's the
advantage of the finer granularity one? And *because* the common case
is all about the reading case, the bigger granularity tends to work
very well in practice, and basically never sees contention.

So I think option (c) is potentially technically better because it has
smaller locking granularity, but in practice (d) might be easier and
we already effectively do it for several filesystems.

Also, making the page lock be a rw-lock may be "easy" in theory, but
in practice we have the usual "uhhuh, 'struct page' is very crowded,
and finding even just one more bit in the flags to use as a read bit
is not great, and finding a whole reader _count_ would likely require
us to go to that hashed queue, which we know has horrendous cache
behavior from past experience".

This turned out to be a very long email, and probably most people
didn't get this far. But if you did, comments, opinions, suggestions?

Any other suggestions than those (a)-(d) ones above?

Linus

2020-09-12 20:33:54

by Rogério Brito

[permalink] [raw]

Subject: Re: Kernel Benchmarking

Hi, Linus and other people.

On 12/09/2020 14.59, Linus Torvalds wrote:
> On Sat, Sep 12, 2020 at 12:28 AM Amir Goldstein <[email protected]> wrote:
>>
>> Can you please add a reference to the original problem report and
>> to the offending commit? This conversation appeared on the list without
>> this information.
>>
>> Are filesystems other than ext4 also affected by this performance
>> regression?
>
> So let me expand on this, because it actually comes from a really old
> problem that has been around for ages, and that I think I finally
> understand.
>
> And sadly, the regression comes from the fix to that old problem.
>
> So I think the VM people (but perhaps not necessarily all filesystem
> people) have been aware of a long-time problem with certain loads
> causing huge latencies, up to and including watchdogs firing because
> processes wouldn't make progress for over half a minute (or whatever
> the default blocking watchdog timeout is - I would like to say that
> it's some odd number like 22 seconds, but maybe that was RCU).
>
> We've known it's related to long queues for the page lock, and about
> three years ago now we added a "bookmark" entry to the page wakeup
> queues, because those queues got so long that even just traversing the
> wakeup queue was a big latency hit. But it's generally been some heavy
> private load on a customer machine, and nobody ever really had a good
> test-case for it.
>
> We've actually had tons of different page lockers involved. One of the
> suspects (and in fact I think it really was one of the causes, just
> not the only one) was the NUMA migration, where under certain loads
> with lots and lots of threads, the kernel would decide to try to
> migrate a hot page, and lots of threads would come in and all
> NUMA-fault on it (because it was some core page everything used), and
> as part of the fault they would get the page lock to serialize, and
> you'd end up with wait queues that were multiple _thousands_ of
> entries long.
>
> So the reports of watchdogs firing go back many many years, and over
> the years we've had various band-aid fixes - things that really do
> help the symptoms a lot, but really seem to be fixes for the symptoms
> rather than something fundamental. That "let's break up the wait queue
> with a bookmark so that we can at least enable interrupts" is perhaps
> the best example of code that just shouldn't exist, but comes about
> because there's been incredible contention on the page lock.
>
> See commits 2554db916586 ("sched/wait: Break up long wake list walk")
> and 11a19c7b099f ("sched/wait: Introduce wakeup bookmark in
> wake_up_page_bit") for that bookmark thing and some of the list
> numbers.
>
> There's been a few actual fixes too - I think Hugh Dickins really
> ended up fixing at least part of the NUMA balancing case by changing
> some of the reference counting. So I don't think it's _all_ been
> band-aids, but the page lock has been a thing that has come up
> multiple times over the years.
>
> See for example commit 9a1ea439b16b ("mm:
> put_and_wait_on_page_locked() while page is migrated") for a patch
> that ended up hopefully fixing at least one of the causes of the long
> queues during migration. I say "hopefully", because (again) the loads
> that cause these things were those "internal customer load" things
> that we don't really have a lot of insight into. Hugh has been
> involved over the years presumably exactly because google has been one
> of those customers, although not the only one by far.
>
> But the point here is that the page lock has been problematic for
> years - with those reports of watchdogs (after tens of seconds!)
> firing going back long before the fixes above. It's definitely not a
> new thing, although I think it has perhaps become more common due to
> "bigger machines running more complex loads becoming more common", but
> who knows..

First of all, please excuse my layman questions, but this conversation
picked up my interest.

Now, to the subject: is this that you describe (RCU or VFS), in some
sense, related to, say, copying a "big" file (e.g., a movie) to a "slow"
media (in my case, a USB thumb drive, so that I can watch said movie on
my TV)?

I've seen backtraces mentioning "task xxx hung for yyy seconds" and a
non-reponsive cp process at that... I say RCU or VFS because I see this
with the thumb drives with vfat filesystems (so, it wouldn't be quite
related to ext4, apart from the fact that all my Linux-specific
filesystems are ext4).

The same thing happens with my slow home network when I copy things to
my crappy NFS server (an armel system that has only 128MB of RAM).

In both cases (a local or a remote fs), whenever I try to send SIGSTOP
to the cp process, it stays there without being stopped for many
minutes... I would venture a guess that this is because nothing else is
done unless the outstanding bytes are actually committed to the
filesystem...

In some sense, a very slow system with "moderate load" is akin to a
high-end, loaded server with many threads competing for resources, in my
experience...

OK, so many guesses and conjectures on my side, but the interactivity to
the end-user suffers a lot (have not yet tested any 5.9 kernel).

Thanks,

Rogério.

>
> Anyway, for various reasons I was looking at this again a couple of
> months ago: we had _yet_ another report of softlockups:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> and we had an unrelated thread about a low-level race in page wakeup
> (the original report was wrong, but it led to figuring out another
> race):
>
> https://lore.kernel.org/lkml/[email protected]/
>
> and there was something else going on too that I can't recall, that
> had made me look at the page locking.
>
> And while there, I realized that the simplest explanation for all
> those years of softlockups was simply that the page waiting is very
> very unfair indeed.
>
> While somebody is patiently waiting for a page, another process can
> (and will) come in and get the page lock from under it, and the
> original waiter will end up just re-queueing - at the end of the list.
> Which explains how you can get those half-minute latencies - not
> because any page lock holder really holds the lock for very long
> (almost all of them are CPU-bound, not IO bound), but because under
> heavy load and contention, you end up with the poor waiters scheduling
> away and *new* page lockers end up being treated very preferentially,
> with absolutely nothing keeping them from taking the lock while
> somebody else is waiting for it.
>
> ANYWAY. That's a long email of background for the commit that I then
> put in the tree this merge window:
>
> 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
>
> which actually makes the page locking truly fair (well, there's a
> small window that allows new lockers to come in, but now it's about
> CPU instructions rather than "scheduling a sleeping process to run",
> so it's essentially gone as a major and recurring unfairness issue).
>
> And that commit actually had a fair number of positive reports: even
> before merging it, Hugh had tested this on yet another "company
> internal load" (*cough*google*cough*) that used to show watchdog
> failures, and it did seem to solve those, and some of the kernel test
> robot benchmarks improved by tens of percent, in one case by 160%.
>
> So everything looked fine. It solves a long-running problem, and for
> once I think it really _solves_ it by fixing a fundamental problem,
> rather than papering over the symptoms.
>
> But there were a couple of nagging concerns. Hackbench showed wildly
> fluctuating performance: some tests improved by a lot, others
> regressed by a lot. Not a huge deal, I felt: hackbench isn't a great
> benchmark, and the performance fluctuations really seemed to be going
> both ways and be very dependent on exact test and machine. So it was a
> slight concern, but on the whole not really worth worrying about.
>
> But the reason Michal is on the Cc is because the Phoronix benchmark
> suite showed a rather marked decrease in the apache test. Unlike
> hackbench, I think that's much more of a "real" test, and it seemed to
> be a lot more consistent too. So I asked for profiles (and eventually
> just recreated the test locally), and I think I understand what's
> going on.
>
> It's the fairness.
>
> Fairness is good, but fairness is usually bad for performance even if
> it does get rid of the worst-case issues. In this case, it's _really_
> bad for performance, because that page lock has always been unfair,
> and we have a lot of patterns that have basically come to
> (unintentionally) depend on that unfairness.
>
> In particular, the page locking is often used for just verifying
> simple things, with the most common example being "lock page, check
> that the mapping is still valid, insert page into page tables, unlock
> page".
>
> The reason the apache benchmark regresses is that it basically does a
> web server test with a single file ("test.html") that gets served by
> just mmap'ing it, and sending it out that way. Using lots of threads,
> and using lots of different mappings. So they *all* fault on the read
> of that page, and they *all* do that "lock page, check that the
> mapping is valid, insert page" dance.
>
> That actually worked ok - not great, but ok - when the page lock was
> unfair, and anybody runnable would basically just get it. Yes, people
> would occasionally get put on the wait-queue, but those waiting
> lockers wouldn't really affect the other ones that are going through
> that dance since they would just take the lock anyway. VERY unfair,
> but hey, very nice for that load.
>
> It works much less well when the page lock is suddenly fair, and if
> anybody starts waiting for it, gets the lock handed to it when the
> page is unlocked. Now the page is owned by the next waiter in line,
> and they're sleeping, and new page lockers don't magically and
> unfairly get to just bypass the older waiter.
>
> This is not a new issue. We've had exactly the same thing happen when
> we made spinlocks, semaphores, and rwlocks be fair.
>
> And like those other times, we had to make them fair because *not*
> making them fair caused those unacceptable outliers under contention,
> to the point of starvation and watchdogs firing.
>
> Anyway, I don't have a great solution. I have a few options (roughly
> ordered by "simplest to most complex"):
>
> (a) just revert
> (b) add some busy-spinning
> (c) reader-writer page lock
> (d) try to de-emphasize the page lock
>
> but I'd love to hear comments.
>
> Honestly, (a) is trivial to do. We've had the problem for years, the
> really *bad* cases are fairly rare, and the workarounds mostly work.
> Yeah, you get watchdogs firing, but it's not exactly _common_.
>
> But equally honestly, I hate (a). I feel like this change really fixed
> a fundamental issue, and after looking at the apache benchmark, in
> many ways it's not a great benchmark. The reason it shows such a
> (relatively) huge regression is that it hammers on just a single small
> file. So my inclination is to say "we know how to fix the performance
> regression, even if we may not be able to do so for 5.9, and this
> benchmark behavior is very unlikely to actually hit a real load".
>
> Option (b) is just because right now the page lock is very much a
> black-and-white "try to lock once or sleep". Where most lockers (the
> initial actual IO to fill the page being the main exception) are
> CPU-bound, not IO bound. So spinning is the usual simplistic fix for
> locking behavior like that. It doesn't really "fix" anything, but it
> helps the bad contended performance case and we wouldn't get the
> scheduling and sleeping behavior.
>
> I can imagine coming up with a ten-liner patch to add some spinning
> that claws back much of the performance on that benchmark. Maybe.
>
> I don't like option (b) very much, but it might be the band-aid for
> 5.9 if we feel that the benchmark results _might_ translate to real
> loads.
>
> Option (c) is, I feel, the best one. Reader-writer locks aren't
> wonderful, but the page lock really tends to have two very distinct
> uses: exclusive for the initial IO and for the (very very unlikely)
> truncate and hole punching issues, and then the above kind of "lock to
> check that it's still valid" use, which is very very common and
> happens on every page fault and then some. And it would be very
> natural to make the latter be a read-lock (or even just a sequence
> counting one with retry rather than a real lock).
>
> Option (d) is "we already have a locking in many filesystems that give
> us exclusion between faulting in a page, and the truncate/hole punch,
> so we shouldn't use the page lock at all".
>
> I do think that the locking that filesystems do is in many ways
> inferior - it's done on a per-inode basis rather than on a per-page
> basis. But if the filesystems end up doing that *anyway*, what's the
> advantage of the finer granularity one? And *because* the common case
> is all about the reading case, the bigger granularity tends to work
> very well in practice, and basically never sees contention.
>
> So I think option (c) is potentially technically better because it has
> smaller locking granularity, but in practice (d) might be easier and
> we already effectively do it for several filesystems.
>
> Also, making the page lock be a rw-lock may be "easy" in theory, but
> in practice we have the usual "uhhuh, 'struct page' is very crowded,
> and finding even just one more bit in the flags to use as a read bit
> is not great, and finding a whole reader _count_ would likely require
> us to go to that hashed queue, which we know has horrendous cache
> behavior from past experience".
>
> This turned out to be a very long email, and probably most people
> didn't get this far. But if you did, comments, opinions, suggestions?
>
> Any other suggestions than those (a)-(d) ones above?
>
> Linus
>

2020-09-12 20:58:46

by Josh Triplett

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Sat, Sep 12, 2020 at 10:59:40AM -0700, Linus Torvalds wrote:
> So I think the VM people (but perhaps not necessarily all filesystem
> people) have been aware of a long-time problem with certain loads
> causing huge latencies, up to and including watchdogs firing because
> processes wouldn't make progress for over half a minute (or whatever
> the default blocking watchdog timeout is - I would like to say that
> it's some odd number like 22 seconds, but maybe that was RCU).
>
> We've known it's related to long queues for the page lock, and about
> three years ago now we added a "bookmark" entry to the page wakeup
> queues, because those queues got so long that even just traversing the
> wakeup queue was a big latency hit. But it's generally been some heavy
> private load on a customer machine, and nobody ever really had a good
> test-case for it.

I don't *know* if this is the same bottleneck, but I have an easily
reproducible workload that rather reliably triggers softlockup
watchdogs, massive performance bottlenecks, and processes that hang for
a while without making forward progress, and it seemed worth mentioning
in case it might serve as a reproducer for those private workloads.
(Haven't tested it on a kernel with this fairness fix added; most recent
tests were on 5.7-rc6.)

On a GCP n1-highcpu-96 instance, with nested virtualization enabled,
create a QEMU/KVM VM with the same number of CPUs backed by a disk image
using either NVME or virtio, and in that VM, build a defconfig kernel
with `make -j$(nproc)`. Lots of softlockup warnings, processes that
should be very quick hanging for a long time, and the build on the guest
is up to 5x slower than the host system, with 12-15x the system time.

I've seen similar softlockups with huge VMs running on physical
hardware, not just on cloud systems that allow nested virtualization.
This is *probably* reproducible for anyone who has local hardware with
lots of CPUs, but doing it on GCP should be accessible to anyone.

(I'm not using GCP anymore, and the systems I'm using don't support
nested virtualization, so I don't have this workload readily available
anymore. It was a completely standard Debian image with the cloud kernel
installed, and zero unusual configuration.)

> Fairness is good, but fairness is usually bad for performance even if
> it does get rid of the worst-case issues. In this case, it's _really_
> bad for performance, because that page lock has always been unfair,
> and we have a lot of patterns that have basically come to
> (unintentionally) depend on that unfairness.
>
> In particular, the page locking is often used for just verifying
> simple things, with the most common example being "lock page, check
> that the mapping is still valid, insert page into page tables, unlock
> page".
[...]
> This is not a new issue. We've had exactly the same thing happen when
> we made spinlocks, semaphores, and rwlocks be fair.
>
> And like those other times, we had to make them fair because *not*
> making them fair caused those unacceptable outliers under contention,
> to the point of starvation and watchdogs firing.
>
> Anyway, I don't have a great solution. I have a few options (roughly
> ordered by "simplest to most complex"):
>
> (a) just revert
> (b) add some busy-spinning
> (c) reader-writer page lock
> (d) try to de-emphasize the page lock
>
> but I'd love to hear comments.

[...]

> Honestly, (a) is trivial to do. We've had the problem for years, the
> really *bad* cases are fairly rare, and the workarounds mostly work.
> Yeah, you get watchdogs firing, but it's not exactly _common_.

I feel like every time I run a non-trivial load inside a huge VM, I end
up hitting those watchdogs; they don't *feel* rare.

> Option (c) is, I feel, the best one. Reader-writer locks aren't
> wonderful, but the page lock really tends to have two very distinct
> uses: exclusive for the initial IO and for the (very very unlikely)
> truncate and hole punching issues, and then the above kind of "lock to
> check that it's still valid" use, which is very very common and
> happens on every page fault and then some. And it would be very
> natural to make the latter be a read-lock (or even just a sequence
> counting one with retry rather than a real lock).
>
> Option (d) is "we already have a locking in many filesystems that give
> us exclusion between faulting in a page, and the truncate/hole punch,
> so we shouldn't use the page lock at all".
>
> I do think that the locking that filesystems do is in many ways
> inferior - it's done on a per-inode basis rather than on a per-page
> basis. But if the filesystems end up doing that *anyway*, what's the
> advantage of the finer granularity one? And *because* the common case
> is all about the reading case, the bigger granularity tends to work
> very well in practice, and basically never sees contention.
>
> So I think option (c) is potentially technically better because it has
> smaller locking granularity, but in practice (d) might be easier and
> we already effectively do it for several filesystems.

If filesystems are going to have to have that lock *anyway*, and it
makes the page lock entirely redundant for that use case, then it
doesn't seem like there's any point to making the page lock cheaper if
we can avoid it entirely. On the other hand, that seems like it might
make locking a *lot* more complicated, if the synchronization on a
struct page is "usually the page lock, but if it's a filesystem page,
then a filesystem-specific lock instead".

So, it seems like there'd be two deciding factors between (c) and (d):
- Whether filesystems might ever be able to use the locks in (c) to
reduce or avoid having to do their own locking for this case. (Seems
like there might be a brlock-style approach that could work for
truncate/hole-punch.)
- Whether (d) would make the locking story excessively complicated
compared to (c).

> This turned out to be a very long email, and probably most people
> didn't get this far. But if you did, comments, opinions, suggestions?
>
> Any other suggestions than those (a)-(d) ones above?
>
> Linus

2020-09-12 21:00:23

by James Bottomley

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Sat, 2020-09-12 at 10:59 -0700, Linus Torvalds wrote:
[...]
> Any other suggestions than those (a)-(d) ones above?

What about revert and try to fix the outliers? Say by having a timer
set when a process gets put to sleep waiting on the page lock. If the
time fires it gets woken up and put at the head of the queue. I
suppose it would also be useful to know if this had happened, so if the
timer has to be reset because the process again fails to win and gets
put to sleep it should perhaps be woken after a shorter interval or
perhaps it should spin before sleeping.

I'm not advocating this as the long term solution, but it could be the
stopgap while people work on (c).

James

2020-09-12 21:16:51

by Linus Torvalds

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Sat, Sep 12, 2020 at 1:59 PM James Bottomley
<[email protected]> wrote:
>
> On Sat, 2020-09-12 at 10:59 -0700, Linus Torvalds wrote:
> [...]
> > Any other suggestions than those (a)-(d) ones above?
>
> What about revert and try to fix the outliers? Say by having a timer
> set when a process gets put to sleep waiting on the page lock.

No timer needed, I suspect.

I tried to code something like this up yesterday (hjmm. Thursday?) as
a "hybrid" scheme, where we'd start out with the old behavior and let
people unfairly get the lock while there were waiters, but when a
waiter woke up and noticed that it still couldn't get the lock, _then_
it would stat using the new scheme.

So still be unfair for a bit, but limit the unfairness so that a
waiter won't lose the lock more than once (but obviously while the
waiter initially slept, _many_ other lockers could have come through).

I ended up with a code mess and gave up on it (it seemed to just get
all the complications from the old _and_ the new model), but maybe I
should try again now that I know what went wrong last time. I think I
tried too hard to actually mix the old and the new code.

(If I tried again, I'd not try to mix the new and the old code, I'd
make the new one start out with a non-exclusive wait - which the code
already supports for that whole "wait for PG_writeback to end" as
opposed to "wait to take PG_lock" - and then turn it into an exclusive
wait if it fails.. That might work out better and not mix entirely
different approaches).

Linus

2020-09-12 22:54:00

by Matthew Wilcox

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Sat, Sep 12, 2020 at 10:59:40AM -0700, Linus Torvalds wrote:
> Anyway, I don't have a great solution. I have a few options (roughly
> ordered by "simplest to most complex"):
>
> (a) just revert
> (b) add some busy-spinning
> (c) reader-writer page lock
> (d) try to de-emphasize the page lock
>
> Option (d) is "we already have a locking in many filesystems that give
> us exclusion between faulting in a page, and the truncate/hole punch,
> so we shouldn't use the page lock at all".
>
> I do think that the locking that filesystems do is in many ways
> inferior - it's done on a per-inode basis rather than on a per-page
> basis. But if the filesystems end up doing that *anyway*, what's the
> advantage of the finer granularity one? And *because* the common case
> is all about the reading case, the bigger granularity tends to work
> very well in practice, and basically never sees contention.

I guess this is option (e). Completely untested; not even compiled,
but it might be a design that means filesystems don't need to take
per-inode locks. I probably screwed up the drop-mmap-lock-for-io
parts of filemap_fault. I definitely didn't update DAX for the
new parameter for finish_fault(), and now I think about it, I didn't
update the header file either, so it definitely won't compile.

diff --git a/mm/filemap.c b/mm/filemap.c
index 1aaea26556cc..3909613f1c9c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2602,8 +2602,22 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
}

+ if (fpin)
+ goto out_retry;
+ if (likely(PageUptodate(page)))
+ goto uptodate;
+
if (!lock_page_maybe_drop_mmap(vmf, page, &fpin))
goto out_retry;
+ VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
+
+ /* Did somebody else update it for us? */
+ if (PageUptodate(page)) {
+ unlock_page(page);
+ if (fpin)
+ goto out_retry;
+ goto uptodate;
+ }

/* Did it get truncated? */
if (unlikely(compound_head(page)->mapping != mapping)) {
@@ -2611,14 +2625,6 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
put_page(page);
goto retry_find;
}
- VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
-
- /*
- * We have a locked page in the page cache, now we need to check
- * that it's up-to-date. If not, it is going to be due to an error.
- */
- if (unlikely(!PageUptodate(page)))
- goto page_not_uptodate;

/*
* We've made it this far and we had to drop our mmap_lock, now is the
@@ -2641,10 +2647,6 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
}

- vmf->page = page;
- return ret | VM_FAULT_LOCKED;
-
-page_not_uptodate:
/*
* Umm, take care of errors if the page isn't up-to-date.
* Try to re-read it _once_. We do this synchronously,
@@ -2680,6 +2682,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
if (fpin)
fput(fpin);
return ret | VM_FAULT_RETRY;
+
+uptodate:
+ vmf->page = page;
+ return ret | VM_FAULT_UPTODATE;
}
EXPORT_SYMBOL(filemap_fault);

diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76..48fb04e75a3a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3460,6 +3460,8 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
return VM_FAULT_HWPOISON;
}

+ if (ret & VM_FAULT_UPTODATE)
+ return ret;
if (unlikely(!(ret & VM_FAULT_LOCKED)))
lock_page(vmf->page);
else
@@ -3684,7 +3686,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
*
* Return: %0 on success, %VM_FAULT_ code in case of error.
*/
-vm_fault_t finish_fault(struct vm_fault *vmf)
+vm_fault_t finish_fault(struct vm_fault *vmf, vm_fault_t ret2)
{
struct page *page;
vm_fault_t ret = 0;
@@ -3704,9 +3706,17 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
ret = check_stable_address_space(vmf->vma->vm_mm);
if (!ret)
ret = alloc_set_pte(vmf, page);
+ if (ret2 & VM_FAULT_UPTODATE) {
+ if (!PageUptodate(page)) {
+ /* probably other things to do here */
+ page_remove_rmap(page);
+ pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
+ put_page(page);
+ }
+ }
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
- return ret;
+ return ret | ret2;
}

static unsigned long fault_around_bytes __read_mostly =
@@ -3844,8 +3854,9 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

- ret |= finish_fault(vmf);
- unlock_page(vmf->page);
+ ret = finish_fault(vmf, ret);
+ if (!(ret & VM_FAULT_UPTODATE))
+ unlock_page(vmf->page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
put_page(vmf->page);
return ret;
@@ -3878,8 +3889,9 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
__SetPageUptodate(vmf->cow_page);

- ret |= finish_fault(vmf);
- unlock_page(vmf->page);
+ ret = finish_fault(vmf, ret);
+ if (!(ret & VM_FAULT_UPTODATE))
+ unlock_page(vmf->page);
put_page(vmf->page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
@@ -3912,10 +3924,11 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
}
}

- ret |= finish_fault(vmf);
+ ret = finish_fault(vmf, ret);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
VM_FAULT_RETRY))) {
- unlock_page(vmf->page);
+ if (!(ret & VM_FAULT_UPTODATE))
+ unlock_page(vmf->page);
put_page(vmf->page);
return ret;
}
diff --git a/mm/truncate.c b/mm/truncate.c
index dd9ebc1da356..649381703f31 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -176,6 +176,7 @@ void do_invalidatepage(struct page *page, unsigned int offset,
static void
truncate_cleanup_page(struct address_space *mapping, struct page *page)
{
+ ClearPageUptodate(page);
if (page_mapped(page)) {
pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
unmap_mapping_pages(mapping, page->index, nr, false);
@@ -738,7 +739,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
1, false);
}
}
- BUG_ON(page_mapped(page));
ret2 = do_launder_page(mapping, page);
if (ret2 == 0) {
if (!invalidate_complete_page2(mapping, page))

2020-09-13 00:43:27

by Dave Chinner

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Sat, Sep 12, 2020 at 10:59:40AM -0700, Linus Torvalds wrote:

[...]

> In particular, the page locking is often used for just verifying
> simple things, with the most common example being "lock page, check
> that the mapping is still valid, insert page into page tables, unlock
> page".
>
> The reason the apache benchmark regresses is that it basically does a
> web server test with a single file ("test.html") that gets served by
> just mmap'ing it, and sending it out that way. Using lots of threads,
> and using lots of different mappings. So they *all* fault on the read
> of that page, and they *all* do that "lock page, check that the
> mapping is valid, insert page" dance.

Hmmmm. So this is a typically a truncate race check, but this isn't
sufficient to protect the fault against all page invalidation races
as the page can be re-inserted into the same mapping at a different
page->index now within EOF.

Hence filesystems that support hole punching have to serialise the
->fault path against the page invalidations done in ->fallocate
operations because they otherwise we get data corruption from the
mm/ truncate checks failing to detect invalidated pages within EOF
correctly.

i.e. truncate/hole punch is a multi-object modification operation,
with the typical atomicity boundary of the operation defined by the
inode_lock() and/or the filesystem transaction that makes the
modification. IOWs, page_lock() based truncation/invalidation checks
aren't atomic w.r.t. the other objects being modified in the same
operation. Truncate avoids this by the ordering the file size update
vs the page cache invalidation, but no such ordering protection can
be provided for ->fallocate() operations that directly manipulate
the metadata of user data in the file.

> Anyway, I don't have a great solution. I have a few options (roughly
> ordered by "simplest to most complex"):
>
> (a) just revert
> (b) add some busy-spinning
> (c) reader-writer page lock
> (d) try to de-emphasize the page lock

....

> Option (d) is "we already have a locking in many filesystems that give
> us exclusion between faulting in a page, and the truncate/hole punch,
> so we shouldn't use the page lock at all".
>
> I do think that the locking that filesystems do is in many ways
> inferior - it's done on a per-inode basis rather than on a per-page
> basis. But if the filesystems end up doing that *anyway*, what's the
> advantage of the finer granularity one? And *because* the common case
> is all about the reading case, the bigger granularity tends to work
> very well in practice, and basically never sees contention.

*nod*

Given that:

1) we have been doing (d) for 5 years (see commit 653c60b633a ("xfs:
introduce mmap/truncate lock")),

2) ext4 also has this same functionality,

3) DAX requires the ability for filesystems to exclude page faults

4) it is a widely deployed and tested solution

5) filesystems will still need to be able to exclude page faults
over a file range while they directly manipulate file metadata to
change the user data in the file

> So I think option (c) is potentially technically better because it has
> smaller locking granularity, but in practice (d) might be easier and
> we already effectively do it for several filesystems.

Right. Even if we go for (c), AFAICT we still need (d) because we
still (d) largely because of reason (5) above. There are a whole
class of "page fault vs direct storage manipulation offload"
serialisation issues that filesystems have to consider (especially
if they want to support DAX), so if we can use that same mechanism
to knock a whole bunch of page locking out of the fault paths then
that seems like a win to me....

> Any other suggestions than those (a)-(d) ones above?

Not really - I've been advocating for (d) as the general mechanism
for truncate/holepunch exclusion for quite a few years now because
it largely seems to work with no obvious/apparent issues.

Just as a FWIW: I agree that the per-inode rwsem could be an issue
here, jsut as it is for the IO path. As a side project I'm working
on shared/exclusive range locks for the XFS inode to replace the
rwsems for the XFS_IOLOCK_{SHARED,EXCL} and the
XFS_MMAPLOCK_{SHARED,EXCL}.

That will largely alleviate any problems that "per-inode rwsem"
serialisation migh cause us here - I've got the DIO fastpath down to
2 atomic operations per lock/unlock - it's with 10% of rwsems up to
approx. half a million concurrent DIO read/writes to the same inode.
Concurrent buffered read/write are not far behind direct IO until I
run out of CPU to copy data. None of this requires changes to
anything outside fs/xfs because everythign is already correctly serialised
to "atomic" filesystem operations and range locking preserves the
atomicity of those operations including all the page cache
operations done within them.

Hence I'd much prefer to be moving the page cache in a direction
that results in the page cache not having to care at all about
serialising against racing truncates, hole punches or anythign else
that runs page invalidation. That will make the page cache code
simpler, require less locking, and likely have less invalidation
related issues over time...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-09-13 02:40:38

by Linus Torvalds

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Sat, Sep 12, 2020 at 5:41 PM Dave Chinner <[email protected]> wrote:
>
> Hmmmm. So this is a typically a truncate race check, but this isn't
> sufficient to protect the fault against all page invalidation races
> as the page can be re-inserted into the same mapping at a different
> page->index now within EOF.

Using some "move" ioctl or similar and using a "invalidate page
mapping, then move it to a different point" model?

Yeah. I think that ends up being basically an extended special case of
the truncate thing (for the invalidate), and would require the
filesystem to serialize externally to the page anyway.

Which they would presumably already do with the MMAPLOCK or similar,
so I guess that's not a huge deal.

The real worry with (d) is that we are using the page lock for other
things too, not *just* the truncate check. Things where the inode lock
wouldn't be helping, like locking against throwing pages out of the
page cache entirely, or the hugepage splitting/merging etc. It's not
being truncated, it's just the VM shrinking the cache or modifying
things in other ways.

So I do worry a bit about trying to make things per-inode (or even
some per-range thing with a smarter lock) for those reasons. We use
the page lock not just for synchronizing with filesystem operations,
but for other page state synchronization too.

In many ways I think keeping it as a page-lock, and making the
filesystem operations just act on the range of pages would be safer.

But the page locking code does have some extreme downsides, exactly
because there are so _many_ pages and we end up having to play some
extreme size games due to that (ie the whole external hashing, but
also just not being able to use any debug locks etc, because we just
don't have the resources to do debugging locks at that kind of
granularity).

That's somewhat more longer-term. I'll try to do another version of
the "hybrid fairness" page lock (and/or just try some limited
optimistic spinning) to see if I can at least avoid the nasty
regression. Admittedly it really probably only happens for these kinds
of microbenchmarks that just hammer on one page over and over again,
but it's a big enough regression for a "real enough" load that I
really don't like it.

Linus

2020-09-13 03:19:23

by Matthew Wilcox

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Sun, Sep 13, 2020 at 10:40:57AM +1000, Dave Chinner wrote:
> > The reason the apache benchmark regresses is that it basically does a
> > web server test with a single file ("test.html") that gets served by
> > just mmap'ing it, and sending it out that way. Using lots of threads,
> > and using lots of different mappings. So they *all* fault on the read
> > of that page, and they *all* do that "lock page, check that the
> > mapping is valid, insert page" dance.
>
> Hmmmm. So this is a typically a truncate race check, but this isn't
> sufficient to protect the fault against all page invalidation races
> as the page can be re-inserted into the same mapping at a different
> page->index now within EOF.

No it can't. find_get_page() returns the page with an elevated refcount.
The page can't be reused until we call put_page(). It can be removed
from the page cache, but can't go back to the page allocator until the
refcount hits zero.

> 5) filesystems will still need to be able to exclude page faults
> over a file range while they directly manipulate file metadata to
> change the user data in the file

Yes, but they can do that with a lock inside ->readpage (and, for that
matter in ->readahead()), so there's no need to take a lock for pages
which are stable in cache.

2020-09-13 03:40:53

by Matthew Wilcox

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Sat, Sep 12, 2020 at 07:39:31PM -0700, Linus Torvalds wrote:
> The real worry with (d) is that we are using the page lock for other
> things too, not *just* the truncate check. Things where the inode lock
> wouldn't be helping, like locking against throwing pages out of the
> page cache entirely, or the hugepage splitting/merging etc. It's not
> being truncated, it's just the VM shrinking the cache or modifying
> things in other ways.

Actually, hugepage splitting is done under the protection of page freezing
where we temporarily set the refcount to zero, so pagecache lookups spin
rather than sleep on the lock. Quite nasty, but also quite rare.

> But the page locking code does have some extreme downsides, exactly
> because there are so _many_ pages and we end up having to play some

The good news is that the THP patchset is making good progress. I have
seven consecutive successful three-hour runs of xfstests, so maybe we'll
see fewer pages in the future.

2020-09-13 23:46:10

by Dave Chinner

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Sat, Sep 12, 2020 at 07:39:31PM -0700, Linus Torvalds wrote:
> On Sat, Sep 12, 2020 at 5:41 PM Dave Chinner <[email protected]> wrote:
> >
> > Hmmmm. So this is a typically a truncate race check, but this isn't
> > sufficient to protect the fault against all page invalidation races
> > as the page can be re-inserted into the same mapping at a different
> > page->index now within EOF.
>
> Using some "move" ioctl or similar and using a "invalidate page
> mapping, then move it to a different point" model?

Right, that's the sort of optimisation we could do inside a
FALLOC_FL_{COLLAPSE,INSERT}_RANGE operation if we wanted to preserve
the page cache contents instead of invalidating it.

> Yeah. I think that ends up being basically an extended special case of
> the truncate thing (for the invalidate), and would require the
> filesystem to serialize externally to the page anyway.

*nod*

> Which they would presumably already do with the MMAPLOCK or similar,
> so I guess that's not a huge deal.
>
> The real worry with (d) is that we are using the page lock for other
> things too, not *just* the truncate check. Things where the inode lock
> wouldn't be helping, like locking against throwing pages out of the
> page cache entirely, or the hugepage splitting/merging etc. It's not
> being truncated, it's just the VM shrinking the cache or modifying
> things in other ways.

Yes, that is a problem, and us FS people don't know/see all the
places this can occur. We generally find out about them when one of
our regression stress tests trips over a data corruption. :(

I have my doubts that complex page cache manipulation operations
like ->migrate_page that rely exclusively on page and internal mm
serialisation are really safe against ->fallocate based invalidation
races. I think they probably also need to be wrapped in the
MMAPLOCK, but I don't understand all the locking and constraints
that ->migrate_page has and there's been no evidence yet that it's a
problem so I've kinda left that alone. I suspect that "no evidence"
thing comes from "filesystem people are largely unable to induce
page migrations in regression testing" so it has pretty much zero
test coverage....

Stuff like THP splitting hasn't been an issue for us because the
file-backed page cache does not support THP (yet!). That's
something I'll be looking closely at in Willy's upcoming patchset.

> So I do worry a bit about trying to make things per-inode (or even
> some per-range thing with a smarter lock) for those reasons. We use
> the page lock not just for synchronizing with filesystem operations,
> but for other page state synchronization too.

Right, I'm not suggesting the page lock goes away, just saying that
we actually need two levels of locking for file-backed pages - one
filesystem, one page level - and that carefully selecting where we
"aggregate" the locking for complex multi-object operations might
make the overall locking simpler.

> In many ways I think keeping it as a page-lock, and making the
> filesystem operations just act on the range of pages would be safer.

Possibly, but that "range of pages" lock still doesn't really solve
the filesystem level serialisation problem. We have to prevent page
faults from running over a range even when there aren't pages in the
page cache over that range (i.e. after we invalidate the range).
Hence we cannot rely on anything struct page related - the
serialisation mechanism has to be external to the cached pages
themselves, but it also has to integrate cleanly into the existing
locking and transaction ordering constraints we have.

> But the page locking code does have some extreme downsides, exactly
> because there are so _many_ pages and we end up having to play some
> extreme size games due to that (ie the whole external hashing, but
> also just not being able to use any debug locks etc, because we just
> don't have the resources to do debugging locks at that kind of
> granularity).

*nod*

The other issue here is that serialisation via individual cache
object locking just doesn't scale in any way to the sizes of
operations that fallocate() can run. fallocate() has 64 bit
operands, so a user could ask us to lock down a full 8EB range of
file. Locking that page by page, even using 1GB huge page Xarray
slot entries, is just not practical... :/

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-09-14 03:32:10

by Matthew Wilcox

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Mon, Sep 14, 2020 at 09:45:03AM +1000, Dave Chinner wrote:
> I have my doubts that complex page cache manipulation operations
> like ->migrate_page that rely exclusively on page and internal mm
> serialisation are really safe against ->fallocate based invalidation
> races. I think they probably also need to be wrapped in the
> MMAPLOCK, but I don't understand all the locking and constraints
> that ->migrate_page has and there's been no evidence yet that it's a
> problem so I've kinda left that alone. I suspect that "no evidence"
> thing comes from "filesystem people are largely unable to induce
> page migrations in regression testing" so it has pretty much zero
> test coverage....

Maybe we can get someone who knows the page migration code to give
us a hack to induce pretty much constant migration?

> Stuff like THP splitting hasn't been an issue for us because the
> file-backed page cache does not support THP (yet!). That's
> something I'll be looking closely at in Willy's upcoming patchset.

One of the things I did was fail every tenth I/O to a THP. That causes
us to split the THP when we come to try to make use of it. Far more
effective than using dm-flakey because I know that failing a readahead
I/O should not cause any test to fail, so any newly-failing test is
caused by the THP code.

I've probably spent more time looking at the page splitting and
truncate/hole-punch/invalidate/invalidate2 paths than anything else.
It's definitely an area where more eyes are welcome, and just having
more people understand it would be good. split_huge_page_to_list and
its various helper functions are about 400 lines of code and, IMO,
a little too complex.

> The other issue here is that serialisation via individual cache
> object locking just doesn't scale in any way to the sizes of
> operations that fallocate() can run. fallocate() has 64 bit
> operands, so a user could ask us to lock down a full 8EB range of
> file. Locking that page by page, even using 1GB huge page Xarray
> slot entries, is just not practical... :/

FWIW, there's not currently a "lock down this range" mechanism in
the page cache. If there were, it wouldn't be restricted to 4k/2M/1G
sizes -- with the XArray today, it's fairly straightforward to
lock ranges which are m * 64^n entries in size (for 1 <= m <= 63, n >=0).
In the next year or two, I hope to be able to offer a "lock arbitrary
page range" feature which is as cheap to lock 8EiB as it is 128KiB.

It would still be page-ranges, not byte-ranges, so I don't know how well
that fits your needs. It doesn't solve the DIO vs page cache problems
at all, since we want DIO to ranges which happen to be within the same
pages as each other to not conflict.

2020-09-14 09:34:38

by Jan Kara

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Sat 12-09-20 17:32:41, Rog?rio Brito wrote:
> Now, to the subject: is this that you describe (RCU or VFS), in some sense,
> related to, say, copying a "big" file (e.g., a movie) to a "slow" media (in
> my case, a USB thumb drive, so that I can watch said movie on my TV)?
>
> I've seen backtraces mentioning "task xxx hung for yyy seconds" and a
> non-reponsive cp process at that... I say RCU or VFS because I see this
> with the thumb drives with vfat filesystems (so, it wouldn't be quite
> related to ext4, apart from the fact that all my Linux-specific
> filesystems are ext4).

This is very likely completely different problem. I'd need to see exact
messages and kernel traces but usually errors like these happen when the IO
is very slow and other things (such as grabbing some locks or doing memory
allocation) get blocked waiting for that IO.

In the case Linus speaks about this is really more about CPU bound tasks
that heavily hammer the same cached contents.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-15 09:27:56

by Jan Kara

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On Mon 14-09-20 09:45:03, Dave Chinner wrote:
> I have my doubts that complex page cache manipulation operations
> like ->migrate_page that rely exclusively on page and internal mm
> serialisation are really safe against ->fallocate based invalidation
> races. I think they probably also need to be wrapped in the
> MMAPLOCK, but I don't understand all the locking and constraints
> that ->migrate_page has and there's been no evidence yet that it's a
> problem so I've kinda left that alone. I suspect that "no evidence"
> thing comes from "filesystem people are largely unable to induce
> page migrations in regression testing" so it has pretty much zero
> test coverage....

Last time I've looked, ->migrate_page seemed safe to me. Page migration
happens under page lock so truncate_inode_pages_range() will block until
page migration is done (and this covers currently pretty much anything
fallocate related). And once truncate_inode_pages_range() is done,
there are no pages to migrate :) (plus migration code checks page->mapping
!= NULL after locking the page).

But I agree testing would be nice. When I was chasing a data corruption in
block device page cache caused by page migration, I was using thpscale [1]
or thpfioscale [2] benchmarks from mmtests which create anon hugepage
mapping and bang it from several threads thus making kernel try to compact
pages (and thus migrate other pages that block compaction) really hard. And
with it in parallel I was running the filesystem stress that seemed to
cause issues for the customer... I guess something like fsx & fsstress runs
with this THP stress test in parallel might be decent fstests to have.

Honza

[1] https://github.com/gormanm/mmtests/blob/master/shellpack_src/src/thpscale/thpscale.c
[2] https://github.com/gormanm/mmtests/blob/master/shellpack_src/src/thpfioscale/thpfioscale.c

--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-16 00:00:28

by Chris Mason

[permalink] [raw]

Subject: Re: Kernel Benchmarking

On 13 Sep 2020, at 23:31, Matthew Wilcox wrote:

> On Mon, Sep 14, 2020 at 09:45:03AM +1000, Dave Chinner wrote:
>> I have my doubts that complex page cache manipulation operations
>> like ->migrate_page that rely exclusively on page and internal mm
>> serialisation are really safe against ->fallocate based invalidation
>> races. I think they probably also need to be wrapped in the
>> MMAPLOCK, but I don't understand all the locking and constraints
>> that ->migrate_page has and there's been no evidence yet that it's a
>> problem so I've kinda left that alone. I suspect that "no evidence"
>> thing comes from "filesystem people are largely unable to induce
>> page migrations in regression testing" so it has pretty much zero
>> test coverage....
>
> Maybe we can get someone who knows the page migration code to give
> us a hack to induce pretty much constant migration?

While debugging migrate page problems, I usually run dbench and

while(true) ; do echo 1 > /proc/sys/vm/compact_memory ; done

I’ll do this with a mixture of memory pressure or drop_caches or a
memory hog depending on what I hope to trigger.

Because of hugepage allocations, we tend to bash on migration/compaction
fairly hard in the fleet. We do fallocate in some of these workloads as
well, but I’m sure it doesn’t count as complete coverage for the
races Dave is worried about.

-chris