2009-09-25 14:10:21

by Chris Mason

[permalink] [raw]
Subject: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first

At unmount time, we do writeback in two stages. First we call
sync_filesystems with wait == 0, and then we call it with wait == 1.

When wait == 1, WB_SYNC_ALL is used. WB_SYNC_ALL will pass wait == 1 to
the filesystem write_inode function if the inode was I_DIRTY_SYNC, and
the filesystem write_inode function is then expected to commit the
running transaction.

The new bdi threads try to keep this two stage writeback, but the
problem is that they do it by calling bdi_writeback_all, which just
kicks a few procs here and there and returns.

The end result is that btrfs is getting stuck in a loop where it commits
the transaction for every dirty inode, and unmount takes forever.

This patch is one possible fix. It just changes bdi_sync_writeback to
always do a WB_SYNC_NONE run synchronously before the WB_SYNC_ALL run.
I'm not sure I've got the bdi calling conventions right, but we need
something along these lines.

We could also make a synchronous form of bdi_writeback_all, but unmount
really isn't the common case so I think this patch is sufficient.

Signed-off-by: Chris Mason <[email protected]>

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8e1e5e1..27f8e0e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -225,7 +225,7 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,
{
struct wb_writeback_args args = {
.sb = sb,
- .sync_mode = WB_SYNC_ALL,
+ .sync_mode = WB_SYNC_NONE,
.nr_pages = LONG_MAX,
.range_cyclic = 0,
};
@@ -236,6 +236,13 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,

bdi_queue_work(bdi, &work);
bdi_wait_on_work_clear(&work);
+
+ args.sync_mode = WB_SYNC_ALL;
+ args.nr_pages = LONG_MAX;
+
+ work.state = WS_USED | WS_ONSTACK;
+ bdi_queue_work(bdi, &work);
+ bdi_wait_on_work_clear(&work);
}

/**


2009-09-27 08:35:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first

On Fri, 25 Sep 2009 10:10:14 -0400 Chris Mason <[email protected]> wrote:

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 8e1e5e1..27f8e0e 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -225,7 +225,7 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,
> {
> struct wb_writeback_args args = {
> .sb = sb,
> - .sync_mode = WB_SYNC_ALL,
> + .sync_mode = WB_SYNC_NONE,
> .nr_pages = LONG_MAX,
> .range_cyclic = 0,
> };
> @@ -236,6 +236,13 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,
>
> bdi_queue_work(bdi, &work);
> bdi_wait_on_work_clear(&work);
> +
> + args.sync_mode = WB_SYNC_ALL;
> + args.nr_pages = LONG_MAX;
> +
> + work.state = WS_USED | WS_ONSTACK;
> + bdi_queue_work(bdi, &work);
> + bdi_wait_on_work_clear(&work);
> }

Those LONG_MAX's are a worry. What prevents a very long
almost-livelock from occurring if userspace is concurrently dirtying
pagecache at a high rate?

2009-09-27 16:44:29

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first

On Sun, Sep 27 2009, Andrew Morton wrote:
> On Fri, 25 Sep 2009 10:10:14 -0400 Chris Mason <[email protected]> wrote:
>
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 8e1e5e1..27f8e0e 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -225,7 +225,7 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,
> > {
> > struct wb_writeback_args args = {
> > .sb = sb,
> > - .sync_mode = WB_SYNC_ALL,
> > + .sync_mode = WB_SYNC_NONE,
> > .nr_pages = LONG_MAX,
> > .range_cyclic = 0,
> > };
> > @@ -236,6 +236,13 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,
> >
> > bdi_queue_work(bdi, &work);
> > bdi_wait_on_work_clear(&work);
> > +
> > + args.sync_mode = WB_SYNC_ALL;
> > + args.nr_pages = LONG_MAX;
> > +
> > + work.state = WS_USED | WS_ONSTACK;
> > + bdi_queue_work(bdi, &work);
> > + bdi_wait_on_work_clear(&work);
> > }
>
> Those LONG_MAX's are a worry. What prevents a very long
> almost-livelock from occurring if userspace is concurrently dirtying
> pagecache at a high rate?

Not sure whether Chris' system is back up again, but I discussed this
with him on irc. Since the WB_SYNC_ALL writeback should be queued behind
the WB_SYNC_NONE that the non-wait sync already issued, not sure why
this patch makes a difference. It's definitely not the right approach.

I'll debug this when I get back.

--
Jens Axboe

2009-09-27 16:52:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first

On Sun, 27 Sep 2009 18:44:32 +0200 Jens Axboe <[email protected]> wrote:

> On Sun, Sep 27 2009, Andrew Morton wrote:
> > On Fri, 25 Sep 2009 10:10:14 -0400 Chris Mason <[email protected]> wrote:
> >
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 8e1e5e1..27f8e0e 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -225,7 +225,7 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,
> > > {
> > > struct wb_writeback_args args = {
> > > .sb = sb,
> > > - .sync_mode = WB_SYNC_ALL,
> > > + .sync_mode = WB_SYNC_NONE,
> > > .nr_pages = LONG_MAX,
> > > .range_cyclic = 0,
> > > };
> > > @@ -236,6 +236,13 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,
> > >
> > > bdi_queue_work(bdi, &work);
> > > bdi_wait_on_work_clear(&work);
> > > +
> > > + args.sync_mode = WB_SYNC_ALL;
> > > + args.nr_pages = LONG_MAX;
> > > +
> > > + work.state = WS_USED | WS_ONSTACK;
> > > + bdi_queue_work(bdi, &work);
> > > + bdi_wait_on_work_clear(&work);
> > > }
> >
> > Those LONG_MAX's are a worry. What prevents a very long
> > almost-livelock from occurring if userspace is concurrently dirtying
> > pagecache at a high rate?
>
> Not sure whether Chris' system is back up again, but I discussed this
> with him on irc. Since the WB_SYNC_ALL writeback should be queued behind
> the WB_SYNC_NONE that the non-wait sync already issued, not sure why
> this patch makes a difference. It's definitely not the right approach.
>

I wasn't referring to this patch actually. The code as it stands in
Linus's tree right now attempts to write back up to 2^63 pages...

2009-09-27 16:55:11

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first

On Sun, Sep 27 2009, Andrew Morton wrote:
> On Sun, 27 Sep 2009 18:44:32 +0200 Jens Axboe <[email protected]> wrote:
>
> > On Sun, Sep 27 2009, Andrew Morton wrote:
> > > On Fri, 25 Sep 2009 10:10:14 -0400 Chris Mason <[email protected]> wrote:
> > >
> > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > index 8e1e5e1..27f8e0e 100644
> > > > --- a/fs/fs-writeback.c
> > > > +++ b/fs/fs-writeback.c
> > > > @@ -225,7 +225,7 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,
> > > > {
> > > > struct wb_writeback_args args = {
> > > > .sb = sb,
> > > > - .sync_mode = WB_SYNC_ALL,
> > > > + .sync_mode = WB_SYNC_NONE,
> > > > .nr_pages = LONG_MAX,
> > > > .range_cyclic = 0,
> > > > };
> > > > @@ -236,6 +236,13 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,
> > > >
> > > > bdi_queue_work(bdi, &work);
> > > > bdi_wait_on_work_clear(&work);
> > > > +
> > > > + args.sync_mode = WB_SYNC_ALL;
> > > > + args.nr_pages = LONG_MAX;
> > > > +
> > > > + work.state = WS_USED | WS_ONSTACK;
> > > > + bdi_queue_work(bdi, &work);
> > > > + bdi_wait_on_work_clear(&work);
> > > > }
> > >
> > > Those LONG_MAX's are a worry. What prevents a very long
> > > almost-livelock from occurring if userspace is concurrently dirtying
> > > pagecache at a high rate?
> >
> > Not sure whether Chris' system is back up again, but I discussed this
> > with him on irc. Since the WB_SYNC_ALL writeback should be queued behind
> > the WB_SYNC_NONE that the non-wait sync already issued, not sure why
> > this patch makes a difference. It's definitely not the right approach.
> >
>
> I wasn't referring to this patch actually. The code as it stands in
> Linus's tree right now attempts to write back up to 2^63 pages...

I agree, it could make the fs sync take a looong time. This is not a new
issue, though.

--
Jens Axboe

2009-09-27 17:10:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first

On Sun, 27 Sep 2009 18:55:14 +0200 Jens Axboe <[email protected]> wrote:

> > I wasn't referring to this patch actually. The code as it stands in
> > Linus's tree right now attempts to write back up to 2^63 pages...
>
> I agree, it could make the fs sync take a looong time. This is not a new
> issue, though.

It _should_ be a new issue. The old code would estimate the number of
dirty pages up-front and would then add a +50% fudge factor, so if we
started the sync with 1GB dirty memory, we write back a max of 1.5GB.

However that might have got broken.

void sync_inodes_sb(struct super_block *sb, int wait)
{
struct writeback_control wbc = {
.sync_mode = wait ? WB_SYNC_ALL : WB_SYNC_NONE,
.range_start = 0,
.range_end = LLONG_MAX,
};

if (!wait) {
unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);

wbc.nr_to_write = nr_dirty + nr_unstable +
(inodes_stat.nr_inodes - inodes_stat.nr_unused);
} else
wbc.nr_to_write = LONG_MAX; /* doesn't actually matter */

sync_sb_inodes(sb, &wbc);
}

a) the +50% isn't there in 2.6.31

b) the wait=true case appears to be vulnerable to livelock in 2.6.31.

whodidthat

38f21977663126fef53f5585e7f1653d8ebe55c4 did that back in January.

2009-09-28 13:33:02

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first

On Sun, Sep 27, 2009 at 01:34:58AM -0700, Andrew Morton wrote:
> On Fri, 25 Sep 2009 10:10:14 -0400 Chris Mason <[email protected]> wrote:
>
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 8e1e5e1..27f8e0e 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -225,7 +225,7 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,
> > {
> > struct wb_writeback_args args = {
> > .sb = sb,
> > - .sync_mode = WB_SYNC_ALL,
> > + .sync_mode = WB_SYNC_NONE,
> > .nr_pages = LONG_MAX,
> > .range_cyclic = 0,
> > };
> > @@ -236,6 +236,13 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,
> >
> > bdi_queue_work(bdi, &work);
> > bdi_wait_on_work_clear(&work);
> > +
> > + args.sync_mode = WB_SYNC_ALL;
> > + args.nr_pages = LONG_MAX;
> > +
> > + work.state = WS_USED | WS_ONSTACK;
> > + bdi_queue_work(bdi, &work);
> > + bdi_wait_on_work_clear(&work);
> > }
>
> Those LONG_MAX's are a worry. What prevents a very long
> almost-livelock from occurring if userspace is concurrently dirtying
> pagecache at a high rate?
>

In this case, we should be called from unmount. But, Jens tells me my
patch isn't quite right because even without my patch, the WB_SYNC_ALL
run is queued onto the tail of the list after the WB_SYNC_NONE run.

So, I need to trace it a little better. My initial theory was that the
nr_dirty number done by the first WB_SYNC_NONE run wasn't big enough.
Once btrfs writepage kicks in, it can make more dirty metadata pages to
close out the delalloc, and if those get written first things could exit
before all the data pages are on disk.

Its a theory anyway, I'll dig in more.

-chris

2009-09-29 16:13:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first

On Sun 27-09-09 10:10:35, Andrew Morton wrote:
> On Sun, 27 Sep 2009 18:55:14 +0200 Jens Axboe <[email protected]> wrote:
>
> > > I wasn't referring to this patch actually. The code as it stands in
> > > Linus's tree right now attempts to write back up to 2^63 pages...
> >
> > I agree, it could make the fs sync take a looong time. This is not a new
> > issue, though.
>
> It _should_ be a new issue. The old code would estimate the number of
> dirty pages up-front and would then add a +50% fudge factor, so if we
> started the sync with 1GB dirty memory, we write back a max of 1.5GB.
>
> However that might have got broken.
>
> void sync_inodes_sb(struct super_block *sb, int wait)
> {
> struct writeback_control wbc = {
> .sync_mode = wait ? WB_SYNC_ALL : WB_SYNC_NONE,
> .range_start = 0,
> .range_end = LLONG_MAX,
> };
>
> if (!wait) {
> unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
> unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
>
> wbc.nr_to_write = nr_dirty + nr_unstable +
> (inodes_stat.nr_inodes - inodes_stat.nr_unused);
> } else
> wbc.nr_to_write = LONG_MAX; /* doesn't actually matter */
>
> sync_sb_inodes(sb, &wbc);
> }
>
> a) the +50% isn't there in 2.6.31
>
> b) the wait=true case appears to be vulnerable to livelock in 2.6.31.
>
> whodidthat
>
> 38f21977663126fef53f5585e7f1653d8ebe55c4 did that back in January.
Yes, I even remember the (long) discussion:
http://lkml.org/lkml/2008/9/22/361
The problem with using .nr_to_write in WB_SYNC_ALL mode is that
write_cache_pages() can return before it actually wrote the pages we
wanted to write (because someone dirtied other pages in the mean time).
So Nick rather removed the .nr_to_write thing to guarantee proper data
integrity semantics. I'm not sure what is the status of the fixes of
livelock he proposed (added him to CC).
Maybe we could at least fix some cases of possible livelock by setting
.range_end to current i_size when calling writepages(). That would solve a
common case when somebody is extending the file. For the case when someone
writes to a huge hole, we would at least provide a finite upper bound,
although practically it might be just the same as without .range_end.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-29 20:47:26

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first

On Fri, Sep 25, 2009 at 10:10:14AM -0400, Chris Mason wrote:
> At unmount time, we do writeback in two stages. First we call
> sync_filesystems with wait == 0, and then we call it with wait == 1.
>
> When wait == 1, WB_SYNC_ALL is used. WB_SYNC_ALL will pass wait == 1 to
> the filesystem write_inode function if the inode was I_DIRTY_SYNC, and
> the filesystem write_inode function is then expected to commit the
> running transaction.
>
> The new bdi threads try to keep this two stage writeback, but the
> problem is that they do it by calling bdi_writeback_all, which just
> kicks a few procs here and there and returns.
>

And I can't reproduce any problems with mainline today. ENOTABUG.

-chris