2010-07-29 12:23:39

by Fengguang Wu

[permalink] [raw]
Subject: [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads

Andrew,

It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
This simple patchset shows the basic idea. Since it's a big behavior change,
there are inevitably lots of details to sort out. I don't know where it will
go after tests and discussions, so the patches are intentionally kept simple.

sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
[PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
[PATCH 2/5] writeback: stop periodic/background work on seeing sync works
[PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp

let the flusher threads do ASYNC writeback for pageout()
[PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
[PATCH 5/5] vmscan: transfer async file writeback to the flusher

The last two patches are the meats, they depend on the first three patches to
kick the background writeback work, so that the for_reclaim writeback can be
serviced timely.

Comments are welcome!

Thanks,
Fengguang


2010-07-29 16:10:21

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads

On Thu 29-07-10 19:51:42, Wu Fengguang wrote:
> Andrew,
>
> It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> This simple patchset shows the basic idea. Since it's a big behavior change,
> there are inevitably lots of details to sort out. I don't know where it will
> go after tests and discussions, so the patches are intentionally kept simple.
>
> sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
Well, essentially any WB_SYNC_NONE writeback is still livelockable if you
just grow a file constantly. So your changes are a step in the right
direction but won't fix the issue completely. But what we could do to fix
the issue completely would be to just set wbc->nr_to_write to LONG_MAX
before writing inode for sync use my livelock avoidance using page-tagging
for this case (it wouldn't have the possible performance issue because we
are going to write all the inode anyway).
I can write the patch but frankly there are so many patches floating
around that I'm not sure what I should base it on...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-07-29 23:25:20

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads

On Thu, Jul 29, 2010 at 07:51:42PM +0800, Wu Fengguang wrote:
> Andrew,
>
> It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> This simple patchset shows the basic idea. Since it's a big behavior change,
> there are inevitably lots of details to sort out. I don't know where it will
> go after tests and discussions, so the patches are intentionally kept simple.
>
> sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
>
> let the flusher threads do ASYNC writeback for pageout()
> [PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
> [PATCH 5/5] vmscan: transfer async file writeback to the flusher

I really do not like this - all it does is transfer random page writeback
from vmscan to the flusher threads rather than avoiding random page
writeback altogether. Random page writeback is nasty - just say no.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-30 05:34:13

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads

On Fri, Jul 30, 2010 at 12:09:47AM +0800, Jan Kara wrote:
> On Thu 29-07-10 19:51:42, Wu Fengguang wrote:
> > Andrew,
> >
> > It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> > This simple patchset shows the basic idea. Since it's a big behavior change,
> > there are inevitably lots of details to sort out. I don't know where it will
> > go after tests and discussions, so the patches are intentionally kept simple.
> >
> > sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> > [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> > [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> > [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
> Well, essentially any WB_SYNC_NONE writeback is still livelockable if you
> just grow a file constantly. So your changes are a step in the right
> direction but won't fix the issue completely.

Right. We have complementary patches to prevent livelocks both inside
file and among files.

> But what we could do to fix
> the issue completely would be to just set wbc->nr_to_write to LONG_MAX
> before writing inode for sync use my livelock avoidance using page-tagging
> for this case (it wouldn't have the possible performance issue because we
> are going to write all the inode anyway).

Yeah your patches are good to avoid livelocking in one single busy file.
I didn't forgot them :)

> I can write the patch but frankly there are so many patches floating
> around that I'm not sure what I should base it on...

Me confused too. It may take some time to quiet down..

Thanks,
Fengguang

2010-07-30 08:38:08

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads

On Fri, Jul 30, 2010 at 07:23:30AM +0800, Dave Chinner wrote:
> On Thu, Jul 29, 2010 at 07:51:42PM +0800, Wu Fengguang wrote:
> > Andrew,
> >
> > It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> > This simple patchset shows the basic idea. Since it's a big behavior change,
> > there are inevitably lots of details to sort out. I don't know where it will
> > go after tests and discussions, so the patches are intentionally kept simple.
> >
> > sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> > [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> > [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> > [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
> >
> > let the flusher threads do ASYNC writeback for pageout()
> > [PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
> > [PATCH 5/5] vmscan: transfer async file writeback to the flusher
>
> I really do not like this - all it does is transfer random page writeback
> from vmscan to the flusher threads rather than avoiding random page
> writeback altogether. Random page writeback is nasty - just say no.

There are cases we have to do pageout().

- a stressed memcg with lots of dirty pages
- a large NUMA system whose nodes have unbalanced vmscan rate and dirty pages

In the above cases, the whole system may not be that stressed,
except for some local LRU list being busy scanned. If the local
memory stress lead to lots of pageout(), it could bring down the whole
system by congesting the disks with many small seeky IO.

It may be an overkill to push global writeback (ie. it's silly to sync
1GB dirty data because there is a small stressed 100MB LRU list). The
obvious solution is to keep the pageout() calls and make them more IO
wise by doing write-around at the same time. The write-around pages
will likely be in the same stressed LRU list, hence will do good for
page reclaim as well.

Transferring ASYNC work to the flushers helps the kswapd-vs-flusher
priority problem too. Currently the kswapd/direct reclaim either have
to skip dirty pages on congestion, or to risk being blocked in
get_request_wait(), both are not good options. However the use of
bdi_start_inode_writeback() do ask for a good vmscan throttling scheme
to prevent it falsely OOM before the flusher is able to clean the
transfered pages. This would be tricky.

If the system is globally memory stressed and run into pageout(), we
can safely kick the flusher threads for more writeback. There are 3
possible schemes:

- to kick writeback for N pages, eg. the existing wakeup_flusher_threads() calls

- to lower dirty_expire_interval, eg. to enqueue the current inode
(that contains the current dirty page for pageout()) _plus_ all
older inodes for writeback. This can be done when servicing the
for_reclaim writeback work.

- to lower dirty throttle limit (trying to find a criterion...)

Thanks,
Fengguang

2010-07-30 09:22:24

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads

> On Fri, Jul 30, 2010 at 07:23:30AM +0800, Dave Chinner wrote:
> > On Thu, Jul 29, 2010 at 07:51:42PM +0800, Wu Fengguang wrote:
> > > Andrew,
> > >
> > > It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> > > This simple patchset shows the basic idea. Since it's a big behavior change,
> > > there are inevitably lots of details to sort out. I don't know where it will
> > > go after tests and discussions, so the patches are intentionally kept simple.
> > >
> > > sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> > > [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> > > [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> > > [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
> > >
> > > let the flusher threads do ASYNC writeback for pageout()
> > > [PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
> > > [PATCH 5/5] vmscan: transfer async file writeback to the flusher
> >
> > I really do not like this - all it does is transfer random page writeback
> > from vmscan to the flusher threads rather than avoiding random page
> > writeback altogether. Random page writeback is nasty - just say no.
>
> There are cases we have to do pageout().
>
> - a stressed memcg with lots of dirty pages
> - a large NUMA system whose nodes have unbalanced vmscan rate and dirty pages

- 32bit highmem system too

can you please see following commit? this describe current design.




commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
Author: akpm <akpm>
Date: Sun Dec 22 01:07:33 2002 +0000

[PATCH] Give kswapd writeback higher priority than pdflush

The `low latency page reclaim' design works by preventing page
allocators from blocking on request queues (and by preventing them from
blocking against writeback of individual pages, but that is immaterial
here).

This has a problem under some situations. pdflush (or a write(2)
caller) could be saturating the queue with highmem pages. This
prevents anyone from writing back ZONE_NORMAL pages. We end up doing
enormous amounts of scenning.

A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
then kill the mmapping applications. The machine instantly goes from
0% of memory dirty to 95% or more. pdflush kicks in and starts writing
the least-recently-dirtied pages, which are all highmem. The queue is
congested so nobody will write back ZONE_NORMAL pages. kswapd chews
50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
efficiency (pages_reclaimed/pages_scanned) falls to 2%.

So this patch changes the policy for kswapd. kswapd may use all of a
request queue, and is prepared to block on request queues.

What will now happen in the above scenario is:

1: The page alloctor scans some pages, fails to reclaim enough
memory and takes a nap in blk_congetion_wait().

2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
back pages. (These pages will be rotated to the tail of the
inactive list at IO-completion interrupt time).

This writeback will saturate the queue with ZONE_NORMAL pages.
Conveniently, pdflush will avoid the congested queues. So we end up
writing the correct pages.

In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
efficiency rises from 2% to 40% and things are generally a lot happier.


The downside is that kswapd may now do a lot less page reclaim,
increasing page allocation latency, causing more direct reclaim,
increasing lock contention in the VM, etc. But I have not been able to
demonstrate that in testing.


The other problem is that there is only one kswapd, and there are lots
of disks. That is a generic problem - without being able to co-opt
user processes we don't have enough threads to keep lots of disks saturated.

One fix for this would be to add an additional "really congested"
threshold in the request queues, so kswapd can still perform
nonblocking writeout. This gives kswapd priority over pdflush while
allowing kswapd to feed many disk queues. I doubt if this will be
called for.

BKrev: 3e051055aitHp3bZBPSqmq21KGs5aQ


2010-07-30 11:14:28

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads

On Fri, Jul 30, 2010 at 03:58:19PM +0800, Wu Fengguang wrote:
> On Fri, Jul 30, 2010 at 07:23:30AM +0800, Dave Chinner wrote:
> > On Thu, Jul 29, 2010 at 07:51:42PM +0800, Wu Fengguang wrote:
> > > Andrew,
> > >
> > > It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> > > This simple patchset shows the basic idea. Since it's a big behavior change,
> > > there are inevitably lots of details to sort out. I don't know where it will
> > > go after tests and discussions, so the patches are intentionally kept simple.
> > >
> > > sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> > > [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> > > [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> > > [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
> > >
> > > let the flusher threads do ASYNC writeback for pageout()
> > > [PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
> > > [PATCH 5/5] vmscan: transfer async file writeback to the flusher
> >
> > I really do not like this - all it does is transfer random page writeback
> > from vmscan to the flusher threads rather than avoiding random page
> > writeback altogether. Random page writeback is nasty - just say no.
>
> There are cases we have to do pageout().
>
> - a stressed memcg with lots of dirty pages
> - a large NUMA system whose nodes have unbalanced vmscan rate and dirty pages
>
> In the above cases, the whole system may not be that stressed,
> except for some local LRU list being busy scanned. If the local
> memory stress lead to lots of pageout(), it could bring down the whole
> system by congesting the disks with many small seeky IO.
>
> It may be an overkill to push global writeback (ie. it's silly to sync
> 1GB dirty data because there is a small stressed 100MB LRU list).

No it isn't. Dirty pages have to cleaned sometime and it reclaim has
a need to clean pages, we may as well start cleaning them all.
Kicking background writeback is effectively just starting work we
have already delayed into the future a little bit earlier than we
otherwise would have.

Doing this is only going to hurt performance if the same pages are
being frequently dirtied, but the cahnges to flush expired inodes
first in background writeback should avoid the worst of that
behaviour. Further, the more clean pages we have, the faster
susbequent memory reclaims are going to free up pages....


> The
> obvious solution is to keep the pageout() calls and make them more IO
> wise by doing write-around at the same time. The write-around pages
> will likely be in the same stressed LRU list, hence will do good for
> page reclaim as well.

You've kind of already done that by telling it to writeback 1024
pages starting with a specific page. However, the big problem with
this is that it asusme that the inode has contiguous dirty pages in
the cache. That assumption fall down in many cases e.g. when you
are writing lots of small files like kernel trees contain, and so
you still end up with random IO patterns coming out of reclaim.

> Transferring ASYNC work to the flushers helps the
> kswapd-vs-flusher priority problem too. Currently the
> kswapd/direct reclaim either have to skip dirty pages on
> congestion, or to risk being blocked in get_request_wait(), both
> are not good options. However the use of
> bdi_start_inode_writeback() do ask for a good vmscan throttling
> scheme to prevent it falsely OOM before the flusher is able to
> clean the transfered pages. This would be tricky.

I have no problem with that aspect ofthe patch - my issue is that it
does nothing to prevent the problem that causes excessive congestion
in the first place...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-30 12:26:14

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads

> > There are cases we have to do pageout().
> >
> > - a stressed memcg with lots of dirty pages
> > - a large NUMA system whose nodes have unbalanced vmscan rate and dirty pages
>
> - 32bit highmem system too

Ah yes!

> can you please see following commit? this describe current design.

Good staff. Thanks.

Thanks,
Fengguang


>
>
>
> commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
> Author: akpm <akpm>
> Date: Sun Dec 22 01:07:33 2002 +0000
>
> [PATCH] Give kswapd writeback higher priority than pdflush
>
> The `low latency page reclaim' design works by preventing page
> allocators from blocking on request queues (and by preventing them from
> blocking against writeback of individual pages, but that is immaterial
> here).
>
> This has a problem under some situations. pdflush (or a write(2)
> caller) could be saturating the queue with highmem pages. This
> prevents anyone from writing back ZONE_NORMAL pages. We end up doing
> enormous amounts of scenning.
>
> A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
> then kill the mmapping applications. The machine instantly goes from
> 0% of memory dirty to 95% or more. pdflush kicks in and starts writing
> the least-recently-dirtied pages, which are all highmem. The queue is
> congested so nobody will write back ZONE_NORMAL pages. kswapd chews
> 50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
> efficiency (pages_reclaimed/pages_scanned) falls to 2%.
>
> So this patch changes the policy for kswapd. kswapd may use all of a
> request queue, and is prepared to block on request queues.
>
> What will now happen in the above scenario is:
>
> 1: The page alloctor scans some pages, fails to reclaim enough
> memory and takes a nap in blk_congetion_wait().
>
> 2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
> back pages. (These pages will be rotated to the tail of the
> inactive list at IO-completion interrupt time).
>
> This writeback will saturate the queue with ZONE_NORMAL pages.
> Conveniently, pdflush will avoid the congested queues. So we end up
> writing the correct pages.
>
> In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
> efficiency rises from 2% to 40% and things are generally a lot happier.
>
>
> The downside is that kswapd may now do a lot less page reclaim,
> increasing page allocation latency, causing more direct reclaim,
> increasing lock contention in the VM, etc. But I have not been able to
> demonstrate that in testing.
>
>
> The other problem is that there is only one kswapd, and there are lots
> of disks. That is a generic problem - without being able to co-opt
> user processes we don't have enough threads to keep lots of disks saturated.
>
> One fix for this would be to add an additional "really congested"
> threshold in the request queues, so kswapd can still perform
> nonblocking writeout. This gives kswapd priority over pdflush while
> allowing kswapd to feed many disk queues. I doubt if this will be
> called for.
>
> BKrev: 3e051055aitHp3bZBPSqmq21KGs5aQ
>
>

2010-07-30 13:18:34

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads

On Fri, Jul 30, 2010 at 07:12:44PM +0800, Dave Chinner wrote:
> On Fri, Jul 30, 2010 at 03:58:19PM +0800, Wu Fengguang wrote:
> > On Fri, Jul 30, 2010 at 07:23:30AM +0800, Dave Chinner wrote:
> > > On Thu, Jul 29, 2010 at 07:51:42PM +0800, Wu Fengguang wrote:
> > > > Andrew,
> > > >
> > > > It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> > > > This simple patchset shows the basic idea. Since it's a big behavior change,
> > > > there are inevitably lots of details to sort out. I don't know where it will
> > > > go after tests and discussions, so the patches are intentionally kept simple.
> > > >
> > > > sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> > > > [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> > > > [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> > > > [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
> > > >
> > > > let the flusher threads do ASYNC writeback for pageout()
> > > > [PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
> > > > [PATCH 5/5] vmscan: transfer async file writeback to the flusher
> > >
> > > I really do not like this - all it does is transfer random page writeback
> > > from vmscan to the flusher threads rather than avoiding random page
> > > writeback altogether. Random page writeback is nasty - just say no.
> >
> > There are cases we have to do pageout().
> >
> > - a stressed memcg with lots of dirty pages
> > - a large NUMA system whose nodes have unbalanced vmscan rate and dirty pages
> >
> > In the above cases, the whole system may not be that stressed,
> > except for some local LRU list being busy scanned. If the local
> > memory stress lead to lots of pageout(), it could bring down the whole
> > system by congesting the disks with many small seeky IO.
> >
> > It may be an overkill to push global writeback (ie. it's silly to sync
> > 1GB dirty data because there is a small stressed 100MB LRU list).
>
> No it isn't. Dirty pages have to cleaned sometime and it reclaim has
> a need to clean pages, we may as well start cleaning them all.
> Kicking background writeback is effectively just starting work we
> have already delayed into the future a little bit earlier than we
> otherwise would have.
>
> Doing this is only going to hurt performance if the same pages are
> being frequently dirtied, but the cahnges to flush expired inodes
> first in background writeback should avoid the worst of that
> behaviour. Further, the more clean pages we have, the faster
> susbequent memory reclaims are going to free up pages....

You have some points here, the data have to be synced anyway, earlier
or later.

However it still helps to clean the right data first. With
write-around, we may get clean pages in the stressed LRU in 10ms.
Blindly syncing the global inodes...maybe after 10s if unlucky.

So pageout() is still good to have/keep. But sure we need to improve it
(transfer work to the flusher, do write-around, throttle) as well as
reducing it (kick global writeback and knock down global dirty pages).

> > The
> > obvious solution is to keep the pageout() calls and make them more IO
> > wise by doing write-around at the same time. The write-around pages
> > will likely be in the same stressed LRU list, hence will do good for
> > page reclaim as well.
>
> You've kind of already done that by telling it to writeback 1024
> pages starting with a specific page. However, the big problem with
> this is that it asusme that the inode has contiguous dirty pages in

Right. We could use .writeback_index/.nr_to_write instead of
.range_start/.range_end as the writeback parameters. It's a bit racy
to use mapping->writeback_index though.

> the cache. That assumption fall down in many cases e.g. when you
> are writing lots of small files like kernel trees contain, and so
> you still end up with random IO patterns coming out of reclaim.

Small files lead to random IO anyway? You may mean .offset=1 so the
dirty page 0 will be left out. I do have the plan to do write-around
to cover such issue, since it would be very common case. Imagine the
dirty page at offset N lies in Normal zone and N+1 in DMA32 zone.
If DMA32 is scanned slightly before Normal, then we got page N+1
first, while actually we should start with page N.

> > Transferring ASYNC work to the flushers helps the
> > kswapd-vs-flusher priority problem too. Currently the
> > kswapd/direct reclaim either have to skip dirty pages on
> > congestion, or to risk being blocked in get_request_wait(), both
> > are not good options. However the use of
> > bdi_start_inode_writeback() do ask for a good vmscan throttling
> > scheme to prevent it falsely OOM before the flusher is able to
> > clean the transfered pages. This would be tricky.
>
> I have no problem with that aspect ofthe patch - my issue is that it
> does nothing to prevent the problem that causes excessive congestion
> in the first place...

No problem. It's merely the first step, stay tuned :)

Thanks,
Fengguang