2014-06-05 11:33:45

by Felipe Contreras

[permalink] [raw]
Subject: Interactivity regression since v3.11 in mm/vmscan.c

Hi,

For a while I've noticed that my machine bogs down in certain
situations, usually while doing heavy I/O operations, it is not just the
I/O operations, but everything, including the graphical interface, even
the mouse pointer.

As far as I can recall this did not happen in the past.

I noticed this specially on certain operations, for example updating a
a game on Steam (to an exteranl USB 3.0 device), or copying TV episodes
to a USB memory stick (probably flash-based).

Today I decided to finally hunt down the problem, so I created a
synthetic test that basically consists on copying a bunch of files from
one drive to another (from an SSD to an external USB 3.0). This is
pretty similar to what I noticed; the graphical interface slows down.

Then I bisected the issue and it turns out that indeed it wasn't
happening in the past, it started happening in v3.11, and it was
triggered by this commit:

e2be15f (mm: vmscan: stall page reclaim and writeback pages based on
dirty/writepage pages encountered)

Then I went back to the latest stable version (v3.14.5), and commented
out the line I think is causing the slow down:

if (nr_unqueued_dirty == nr_taken || nr_immediate)
congestion_wait(BLK_RW_ASYNC, HZ/10);

After that I don't notice the slow down any more.

Anybody has any ideas how to fix the issue properly?

--
Felipe Contreras


2014-06-05 13:37:51

by Michal Hocko

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

On Thu 05-06-14 06:33:40, Felipe Contreras wrote:
> Hi,

Hi,

> For a while I've noticed that my machine bogs down in certain
> situations, usually while doing heavy I/O operations, it is not just the
> I/O operations, but everything, including the graphical interface, even
> the mouse pointer.
>
> As far as I can recall this did not happen in the past.
>
> I noticed this specially on certain operations, for example updating a
> a game on Steam (to an exteranl USB 3.0 device), or copying TV episodes
> to a USB memory stick (probably flash-based).

We had a similar report for opensuse. The common part was that there was
an IO to a slow USB device going on.

> Today I decided to finally hunt down the problem, so I created a
> synthetic test that basically consists on copying a bunch of files from
> one drive to another (from an SSD to an external USB 3.0). This is
> pretty similar to what I noticed; the graphical interface slows down.
>
> Then I bisected the issue and it turns out that indeed it wasn't
> happening in the past, it started happening in v3.11, and it was
> triggered by this commit:
>
> e2be15f (mm: vmscan: stall page reclaim and writeback pages based on
> dirty/writepage pages encountered)
>
> Then I went back to the latest stable version (v3.14.5), and commented
> out the line I think is causing the slow down:
>
> if (nr_unqueued_dirty == nr_taken || nr_immediate)
> congestion_wait(BLK_RW_ASYNC, HZ/10);

Yes, I came to the same check. I didn't have any confirmation yet so
thanks for your confirmation. I've suggested to reduce this
congestion_wait only to kswapd:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32c661d66a45..ef6a1c0e788c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1566,7 +1566,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* implies that pages are cycling through the LRU faster than
* they are written so also forcibly stall.
*/
- if (nr_unqueued_dirty == nr_taken || nr_immediate)
+ if ((nr_unqueued_dirty == nr_taken || nr_immediate) && current_is_kswapd())
congestion_wait(BLK_RW_ASYNC, HZ/10);
}

But I am still not sure whether this is the right way to fix it. Direct
reclaimers can be throttled later on wait_iff_congested. I guess the
original intention was to throttle kswapd to not scan LRU full of dirty
pages like crazy. So I think it makes some sense to reduce the
congestion_wait only to kswapd.


> After that I don't notice the slow down any more.
>
> Anybody has any ideas how to fix the issue properly?
>
> --
> Felipe Contreras
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Michal Hocko
SUSE Labs

2014-06-05 14:00:16

by Felipe Contreras

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

On Thu, Jun 5, 2014 at 8:37 AM, Michal Hocko <[email protected]> wrote:
> On Thu 05-06-14 06:33:40, Felipe Contreras wrote:

>> For a while I've noticed that my machine bogs down in certain
>> situations, usually while doing heavy I/O operations, it is not just the
>> I/O operations, but everything, including the graphical interface, even
>> the mouse pointer.
>>
>> As far as I can recall this did not happen in the past.
>>
>> I noticed this specially on certain operations, for example updating a
>> a game on Steam (to an exteranl USB 3.0 device), or copying TV episodes
>> to a USB memory stick (probably flash-based).
>
> We had a similar report for opensuse. The common part was that there was
> an IO to a slow USB device going on.

Well, it's a USB 3.0 device, I can write at 250 MB/s, so it's not
really that slow.

And in fact, when I read and write to and from the same USB 3.0
device, I don't see the issue.

>> Then I went back to the latest stable version (v3.14.5), and commented
>> out the line I think is causing the slow down:
>>
>> if (nr_unqueued_dirty == nr_taken || nr_immediate)
>> congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> Yes, I came to the same check. I didn't have any confirmation yet so
> thanks for your confirmation. I've suggested to reduce this
> congestion_wait only to kswapd:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 32c661d66a45..ef6a1c0e788c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1566,7 +1566,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> * implies that pages are cycling through the LRU faster than
> * they are written so also forcibly stall.
> */
> - if (nr_unqueued_dirty == nr_taken || nr_immediate)
> + if ((nr_unqueued_dirty == nr_taken || nr_immediate) && current_is_kswapd())
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> }

Unfortunately that doesn't fix the issue for me.

--
Felipe Contreras

2014-06-06 09:16:26

by Michal Hocko

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

On Thu 05-06-14 09:00:10, Felipe Contreras wrote:
> On Thu, Jun 5, 2014 at 8:37 AM, Michal Hocko <[email protected]> wrote:
> > On Thu 05-06-14 06:33:40, Felipe Contreras wrote:
>
> >> For a while I've noticed that my machine bogs down in certain
> >> situations, usually while doing heavy I/O operations, it is not just the
> >> I/O operations, but everything, including the graphical interface, even
> >> the mouse pointer.
> >>
> >> As far as I can recall this did not happen in the past.
> >>
> >> I noticed this specially on certain operations, for example updating a
> >> a game on Steam (to an exteranl USB 3.0 device), or copying TV episodes
> >> to a USB memory stick (probably flash-based).
> >
> > We had a similar report for opensuse. The common part was that there was
> > an IO to a slow USB device going on.
>
> Well, it's a USB 3.0 device, I can write at 250 MB/s, so it's not
> really that slow.
>
> And in fact, when I read and write to and from the same USB 3.0
> device, I don't see the issue.
>
> >> Then I went back to the latest stable version (v3.14.5), and commented
> >> out the line I think is causing the slow down:
> >>
> >> if (nr_unqueued_dirty == nr_taken || nr_immediate)
> >> congestion_wait(BLK_RW_ASYNC, HZ/10);
> >
> > Yes, I came to the same check. I didn't have any confirmation yet so
> > thanks for your confirmation. I've suggested to reduce this
> > congestion_wait only to kswapd:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 32c661d66a45..ef6a1c0e788c 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1566,7 +1566,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> > * implies that pages are cycling through the LRU faster than
> > * they are written so also forcibly stall.
> > */
> > - if (nr_unqueued_dirty == nr_taken || nr_immediate)
> > + if ((nr_unqueued_dirty == nr_taken || nr_immediate) && current_is_kswapd())
> > congestion_wait(BLK_RW_ASYNC, HZ/10);
> > }
>
> Unfortunately that doesn't fix the issue for me.

That is really interesting. So removing the test completely helps but
reducing it to kswapd doesn't. I would expect stalls coming from direct
reclaimers not the kswapd.

Mel has a nice systemtap script (attached) to watch for stalls. Maybe
you can give it a try?

--
Michal Hocko
SUSE Labs


Attachments:
(No filename) (2.37 kB)
watch-dstate-new.pl (10.90 kB)
Download all attachments

2014-06-06 10:24:30

by Felipe Contreras

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

On Fri, Jun 6, 2014 at 4:58 AM, <[email protected]> wrote:

> Alternatively can we try wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10) ?

I see the same problem with that code.

--
Felipe Contreras

2014-06-06 10:33:30

by Felipe Contreras

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

On Fri, Jun 6, 2014 at 4:16 AM, Michal Hocko <[email protected]> wrote:

> Mel has a nice systemtap script (attached) to watch for stalls. Maybe
> you can give it a try?

Is there any special configurations I should enable?

I get this:
semantic error: unresolved arity-1 global array name, missing global
declaration?: identifier 'name' at /tmp/stapd6pu9A:4:2
source: name[t]=execname()
^

Pass 2: analysis failed. [man error::pass2]
Number of similar error messages suppressed: 71.
Rerun with -v to see them.
Unexpected exit of STAP script at
/home/felipec/Downloads/watch-dstate-new.pl line 320.



--
Felipe Contreras

2014-06-06 11:03:09

by Michal Hocko

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

On Fri 06-06-14 05:33:28, Felipe Contreras wrote:
> On Fri, Jun 6, 2014 at 4:16 AM, Michal Hocko <[email protected]> wrote:
>
> > Mel has a nice systemtap script (attached) to watch for stalls. Maybe
> > you can give it a try?
>
> Is there any special configurations I should enable?

You need debuginfo and systemtap AFAIK. I haven't used this script
myself.
--
Michal Hocko
SUSE Labs

2014-06-06 12:27:10

by Felipe Contreras

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

On Thu, Jun 5, 2014 at 8:37 AM, Michal Hocko <[email protected]> wrote:

> We had a similar report for opensuse. The common part was that there was
> an IO to a slow USB device going on.

Actually I've managed to narrow down my synthetic test, and all I need
is to copy a big file, and it even happens reading and writing to the
SSD (although the stall is less severe).

Here's the test:
http://pastie.org/9264124

Just pass a big file as the first argument.

I don't have much memory in this machine, so I guess running out of
memory is the trigger.

--
Felipe Contreras

2014-06-06 12:54:25

by Felipe Contreras

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

On Fri, Jun 6, 2014 at 6:03 AM, Michal Hocko <[email protected]> wrote:
> On Fri 06-06-14 05:33:28, Felipe Contreras wrote:
>> On Fri, Jun 6, 2014 at 4:16 AM, Michal Hocko <[email protected]> wrote:
>>
>> > Mel has a nice systemtap script (attached) to watch for stalls. Maybe
>> > you can give it a try?
>>
>> Is there any special configurations I should enable?
>
> You need debuginfo and systemtap AFAIK. I haven't used this script
> myself.

I have both.

--
Felipe Contreras

2014-06-06 23:11:16

by Felipe Contreras

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

On Fri, Jun 6, 2014 at 5:33 AM, Felipe Contreras
<[email protected]> wrote:
> On Fri, Jun 6, 2014 at 4:16 AM, Michal Hocko <[email protected]> wrote:
>
>> Mel has a nice systemtap script (attached) to watch for stalls. Maybe
>> you can give it a try?
>
> Is there any special configurations I should enable?
>
> I get this:
> semantic error: unresolved arity-1 global array name, missing global
> declaration?: identifier 'name' at /tmp/stapd6pu9A:4:2
> source: name[t]=execname()
> ^
>
> Pass 2: analysis failed. [man error::pass2]
> Number of similar error messages suppressed: 71.
> Rerun with -v to see them.
> Unexpected exit of STAP script at
> /home/felipec/Downloads/watch-dstate-new.pl line 320.

Actually I debugged the problem, and it's that the format of the
script is DOS, not UNIX. After changing the format the script works.

However, it's not returning anything. It's running, but doesn't seem
to find any stalls.

--
Felipe Contreras

2014-06-07 15:23:53

by Felipe Contreras

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

On Sat, Jun 7, 2014 at 7:35 AM, <[email protected]> wrote:

> Would you please try again based only on comment [1](based on v3.15-rc8)?
> thanks
> Hillf
>
> --- a/mm/vmscan.c Sat Jun 7 18:38:08 2014
> +++ b/mm/vmscan.c Sat Jun 7 20:08:36 2014
> @@ -1566,7 +1566,7 @@ shrink_inactive_list(unsigned long nr_to
> * implies that pages are cycling through the LRU faster than
> * they are written so also forcibly stall.
> */
> - if (nr_unqueued_dirty == nr_taken || nr_immediate)
> + if (nr_immediate)
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> }

That actually seems to work correctly on v3.15-rc8.

--
Felipe Contreras

2014-06-07 18:24:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

So we very recently (as in this merge window) merged a change to this
very area, but that change was very specific to one case.

Hillf's patch (below) apparently fixes the problem Felipe sees, and I
have to say, his problem sounds a *lot* like the kind of horrible
performance I've seen with writing to USB devices. I blamed
non-working per-bdi throttling, but this implies it is more generic
than that. The fact that the very same code also made nfsd very
unhappy makes me think that the code is just fundamentally broken.

And quite frankly, the whole logic is a bit questionable. That

"nr_unqueued_dirty == nr_taken"

test is claimed to be "implies that flushers are not keeping up", but
that's not actually true at all. It just means that

(a) all the pages we isolated are dirty
(b) .. and none of them are under writeback

and it's very possible that none of them are under writeback because
nobody has even decided to start writeback on them yet, because nobody
has even walked through the list yet, so they were all still marked as
referenced. I guess you could say that "flushers are not keeping up",
but *we're* one of the flushers, and it's not that we aren't keeping
up, it's that we haven't even scanned things yet.

So what do we do when we haven't scanned the list enough to see any
non-referenced pages? Do we scan it a bit more? No. We decide to
congestion-wait.

That sounds completely and utterly stupid and broken. Does it make any
sense at all? No it doesn't. It just seems to delay starting any
writeback at all.

I suspect the code comes from "let's not spend too much time scanning
the dirty lists when everything is dirty", and is trying to avoid CPU
use. But what it seems to do is actually to avoid even starting
writeback in the first place, and just "congestion-waiting" even when
nothing is being written back (here "nothing" is not absolute - we're
only looking at a part of the dirty pages, obviously, but we're
looking at the *old* dirty pages, so it's a fairly important part of
it).

So I really get the feeling that this code is broken, and that the
patch to remove that "nr_unqueued_dirty == nr_taken" is correct.

In particular, doesn't that congestion wait - which is supposed to
wait for kswapd - end up waiting even when the process in question
*is* kswapd?

So it's not just processes like nfsd that got throttled down (which no
longer happens because of the recent commit 399ba0b95670), it seems
like kswapd itself gets throttled down because of this test.

So at the *very* least I feel like the new current_may_throttle()
needs to say that "kswapd must not be throttled", but I wonder if that
whole thing just needs to go.

And maybe that recent commit 399ba0b95670 is actually broken, and
wanted to fix just this part too. Maybe it *should* wait for the
"nr_immediate" case, which is the one that is currently aimed at
*only* throttling down kswapd itself. Maybe we should remove the
"current_is_kswapd()" test in the nr_immediate code instead, and make
everybody throttle when they hit the actual _real_ congestion case of
the whole zone being under writeback?

Comments? Mel, this code is mostly attributed to you, I'd like to hear
what you think in particular.

Linus

On Sat, Jun 7, 2014 at 5:35 AM, <[email protected]> wrote:
>
> The comments around the congestion_wait,
> [1]
> *
> * Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number
> * of pages under pages flagged for immediate reclaim and stall if any
> * are encountered in the nr_immediate check below.
> */
> if (nr_writeback && nr_writeback == nr_taken)
> zone_set_flag(zone, ZONE_WRITEBACK);
>
>
> [2]
> /*
> * If dirty pages are scanned that are not queued for IO, it
> * implies that flushers are not keeping up. In this case, flag
> * the zone ZONE_TAIL_LRU_DIRTY and kswapd will start writing
> * pages from reclaim context. It will forcibly stall in the
> * next check.
> */
> if (nr_unqueued_dirty == nr_taken)
> zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY);
>
> The "force stall" in [2] conflicts with "start writing pages" in [2], and
> conflicts with "nr_immediate check below" in [1] as well, IIUC.
>
> Would you please try again based only on comment [1](based on v3.15-rc8)?
> thanks
> Hillf
>
> --- a/mm/vmscan.c Sat Jun 7 18:38:08 2014
> +++ b/mm/vmscan.c Sat Jun 7 20:08:36 2014
> @@ -1566,7 +1566,7 @@ shrink_inactive_list(unsigned long nr_to
> * implies that pages are cycling through the LRU faster than
> * they are written so also forcibly stall.
> */
> - if (nr_unqueued_dirty == nr_taken || nr_immediate)
> + if (nr_immediate)
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> }
>
> --

2014-06-08 21:33:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

On Sat, Jun 7, 2014 at 11:24 AM, Linus Torvalds
<[email protected]> wrote:
>
> Comments? Mel, this code is mostly attributed to you, I'd like to hear
> what you think in particular.

In the meantime, I've removed the "nr_unqueued_dirty == nr_taken"
check for congestion_wait(), since I can't see how it can possibly be
sensible, and Felipe confirmed that it fixes his interactivity issue.

Nobody commented on it, but let's see if we get reactions to the
behavior changing..

Linus

2014-06-09 07:54:03

by Michal Hocko

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

On Fri 06-06-14 18:11:14, Felipe Contreras wrote:
> On Fri, Jun 6, 2014 at 5:33 AM, Felipe Contreras
> <[email protected]> wrote:
> > On Fri, Jun 6, 2014 at 4:16 AM, Michal Hocko <[email protected]> wrote:
> >
> >> Mel has a nice systemtap script (attached) to watch for stalls. Maybe
> >> you can give it a try?
> >
> > Is there any special configurations I should enable?
> >
> > I get this:
> > semantic error: unresolved arity-1 global array name, missing global
> > declaration?: identifier 'name' at /tmp/stapd6pu9A:4:2
> > source: name[t]=execname()
> > ^
> >
> > Pass 2: analysis failed. [man error::pass2]
> > Number of similar error messages suppressed: 71.
> > Rerun with -v to see them.
> > Unexpected exit of STAP script at
> > /home/felipec/Downloads/watch-dstate-new.pl line 320.
>
> Actually I debugged the problem, and it's that the format of the
> script is DOS, not UNIX. After changing the format the script works.

Ups, I've downloaded it from our bugzilla so maybe it just did some
tricks with the script.

> However, it's not returning anything. It's running, but doesn't seem
> to find any stalls.

Intereting. It was quite good at pointing at stalls. How are you
measuring those stalls during your testing?
--
Michal Hocko
SUSE Labs

2014-06-09 10:02:43

by Felipe Contreras

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

Michal Hocko wrote:
> On Fri 06-06-14 18:11:14, Felipe Contreras wrote:
> > On Fri, Jun 6, 2014 at 5:33 AM, Felipe Contreras
> > <[email protected]> wrote:
> > > On Fri, Jun 6, 2014 at 4:16 AM, Michal Hocko <[email protected]> wrote:
> > >
> > >> Mel has a nice systemtap script (attached) to watch for stalls. Maybe
> > >> you can give it a try?
> > >
> > > Is there any special configurations I should enable?
> > >
> > > I get this:
> > > semantic error: unresolved arity-1 global array name, missing global
> > > declaration?: identifier 'name' at /tmp/stapd6pu9A:4:2
> > > source: name[t]=execname()
> > > ^
> > >
> > > Pass 2: analysis failed. [man error::pass2]
> > > Number of similar error messages suppressed: 71.
> > > Rerun with -v to see them.
> > > Unexpected exit of STAP script at
> > > /home/felipec/Downloads/watch-dstate-new.pl line 320.
> >
> > Actually I debugged the problem, and it's that the format of the
> > script is DOS, not UNIX. After changing the format the script works.
>
> Ups, I've downloaded it from our bugzilla so maybe it just did some
> tricks with the script.
>
> > However, it's not returning anything. It's running, but doesn't seem
> > to find any stalls.
>
> Intereting. It was quite good at pointing at stalls. How are you
> measuring those stalls during your testing?

I'm not measuring them, I simply grab a GUI window and move it around
while the big file is being copied, when the issue happens the window
stops moving, and the mouse, everything hangs for a time, then it
resumes, then hangs again... the interactivity is bad.

--
Felipe Contreras

2014-06-09 12:58:16

by Mel Gorman

[permalink] [raw]
Subject: Re: Interactivity regression since v3.11 in mm/vmscan.c

On Sat, Jun 07, 2014 at 11:24:56AM -0700, Linus Torvalds wrote:
> So we very recently (as in this merge window) merged a change to this
> very area, but that change was very specific to one case.
>
> Hillf's patch (below) apparently fixes the problem Felipe sees, and I
> have to say, his problem sounds a *lot* like the kind of horrible
> performance I've seen with writing to USB devices. I blamed
> non-working per-bdi throttling, but this implies it is more generic
> than that. The fact that the very same code also made nfsd very
> unhappy makes me think that the code is just fundamentally broken.
>
> And quite frankly, the whole logic is a bit questionable. That
>
> "nr_unqueued_dirty == nr_taken"
>
> test is claimed to be "implies that flushers are not keeping up", but
> that's not actually true at all. It just means that
>
> (a) all the pages we isolated are dirty
> (b) .. and none of them are under writeback
>
> and it's very possible that none of them are under writeback because
> nobody has even decided to start writeback on them yet, because nobody
> has even walked through the list yet, so they were all still marked as
> referenced. I guess you could say that "flushers are not keeping up",
> but *we're* one of the flushers, and it's not that we aren't keeping
> up, it's that we haven't even scanned things yet.
>
> So what do we do when we haven't scanned the list enough to see any
> non-referenced pages? Do we scan it a bit more? No. We decide to
> congestion-wait.
>
> That sounds completely and utterly stupid and broken. Does it make any
> sense at all? No it doesn't. It just seems to delay starting any
> writeback at all.
>

The original intent was moving away from direct reclaimers and kswapd just
blocking on congestion for the sake of it and avoiding excessive swapping
during IO. That was not a smooth road.

> I suspect the code comes from "let's not spend too much time scanning
> the dirty lists when everything is dirty", and is trying to avoid CPU
> use.

Yes. At the time we moved away from calling congestion_wait() for all
sorts of reasons there were a number of bugs with different root causes
but looked like kswapd using 99% of CPU during heavy IO.

> But what it seems to do is actually to avoid even starting
> writeback in the first place, and just "congestion-waiting" even when
> nothing is being written back (here "nothing" is not absolute - we're
> only looking at a part of the dirty pages, obviously, but we're
> looking at the *old* dirty pages, so it's a fairly important part of
> it).
>
> So I really get the feeling that this code is broken, and that the
> patch to remove that "nr_unqueued_dirty == nr_taken" is correct.
>

I cannot think of a reason to disagree with that. It was a mistake because
it also failed to take into account that writeback might not have delayed
because the dirty expire limit had not been reached.

> In particular, doesn't that congestion wait - which is supposed to
> wait for kswapd - end up waiting even when the process in question
> *is* kswapd?
>
> So it's not just processes like nfsd that got throttled down (which no
> longer happens because of the recent commit 399ba0b95670), it seems
> like kswapd itself gets throttled down because of this test.
>
> So at the *very* least I feel like the new current_may_throttle()
> needs to say that "kswapd must not be throttled", but I wonder if that
> whole thing just needs to go.
>
> And maybe that recent commit 399ba0b95670 is actually broken, and
> wanted to fix just this part too. Maybe it *should* wait for the
> "nr_immediate" case, which is the one that is currently aimed at
> *only* throttling down kswapd itself. Maybe we should remove the
> "current_is_kswapd()" test in the nr_immediate code instead, and make
> everybody throttle when they hit the actual _real_ congestion case of
> the whole zone being under writeback?
>
> Comments? Mel, this code is mostly attributed to you, I'd like to hear
> what you think in particular.
>

I've no problem with your patch so lets go with it with the caveat that there
are three bugs to watch out for. The first is excessive CPU usage during
reclaim by direct reclaimers or kswapd which should still be controlled
but worth watching for anyway. The second is excessive writeback from
kswapd context resulting in poor IO efficiency which is harder to measure a
performance impact for but looks like high counts for nr_vmscan_write. The
third is excessive swapping during IO as the file LRUs are being scanned
with mostly dirty pages and the reclaimer swaps anonymous pages instead
which will look like interactivity stalls due to swapping during heavy IO. I
can keep an eye out for all three when my regression tests pick up 3.16-rc1.

Thanks.

--
Mel Gorman
SUSE Labs