2011-06-28 23:43:56

by Curt Wohlgemuth

[permalink] [raw]
Subject: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

Contrary to the comment block atop writeback_inodes_sb_nr(),
we *were* calling

wait_for_completion(&done);

which should not be done, as this is not called for data
integrity sync.

Signed-off-by: Curt Wohlgemuth <[email protected]>
---
fs/fs-writeback.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0f015a0..3f711ac 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1186,17 +1186,14 @@ static void wait_sb_inodes(struct super_block *sb)
*/
void writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr)
{
- DECLARE_COMPLETION_ONSTACK(done);
struct wb_writeback_work work = {
.sb = sb,
.sync_mode = WB_SYNC_NONE,
- .done = &done,
.nr_pages = nr,
};

WARN_ON(!rwsem_is_locked(&sb->s_umount));
bdi_queue_work(sb->s_bdi, &work);
- wait_for_completion(&done);
}
EXPORT_SYMBOL(writeback_inodes_sb_nr);

--
1.7.3.1


2011-06-29 00:54:42

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

On Tue, Jun 28, 2011 at 04:43:35PM -0700, Curt Wohlgemuth wrote:
> Contrary to the comment block atop writeback_inodes_sb_nr(),
> we *were* calling
>
> wait_for_completion(&done);
>
> which should not be done, as this is not called for data
> integrity sync.
>
> Signed-off-by: Curt Wohlgemuth <[email protected]>

The comment says it does not wait for IO to be -completed-.

The function as implemented waits for IO to be *submitted*.

This provides the callers with same blocking semantics (i.e. request
queue full) as if the caller submitted the IO themselves. The code
that uses this function rely on this blocking to delay the next set
of operations they do until after IO has been started, so removing
the completion will change their behaviour significantly.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-06-29 01:56:50

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

Hi Dave:

Thanks for the response.

On Tue, Jun 28, 2011 at 5:54 PM, Dave Chinner <[email protected]> wrote:
> On Tue, Jun 28, 2011 at 04:43:35PM -0700, Curt Wohlgemuth wrote:
>> Contrary to the comment block atop writeback_inodes_sb_nr(),
>> we *were* calling
>>
>> ? ? ? ? wait_for_completion(&done);
>>
>> which should not be done, as this is not called for data
>> integrity sync.
>>
>> Signed-off-by: Curt Wohlgemuth <[email protected]>
>
> The comment says it does not wait for IO to be -completed-.
>
> The function as implemented waits for IO to be *submitted*.
>
> This provides the callers with same blocking semantics (i.e. request
> queue full) as if the caller submitted the IO themselves. The code
> that uses this function rely on this blocking to delay the next set
> of operations they do until after IO has been started, so removing
> the completion will change their behaviour significantly.

I don't quite understand this. It's true that all IO done as a result
of calling wb_writeback() on this work item won't finish before the
completion takes place, but sending all those pages in flight *will*
take place. And that's a lot of time. To wait on this before we then
call sync_inodes_sb(), and do it all over again, seems odd at best.

Pre-2.6.35 kernels would start non-integrity sync writeback and
immediately return, which would seem like a reasonable "prefetch-y"
thing to do, considering it's going to be immediately followed by a
data integrity sync writeback operation.

The post 2.6.35 semantics are fine; but then I don't understand why we
do both a __sync_filesystem(0) followed by a __sync_filesystem(1) (in
the case of sync(2)). It doesn't seem to be any safer or more correct
to me; why not just do the data integrity sync writeback and call it a
day?

Thanks,
Curt

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>

2011-06-29 06:42:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

On Wed, Jun 29, 2011 at 10:54:22AM +1000, Dave Chinner wrote:
> The comment says it does not wait for IO to be -completed-.
>
> The function as implemented waits for IO to be *submitted*.
>
> This provides the callers with same blocking semantics (i.e. request
> queue full) as if the caller submitted the IO themselves. The code
> that uses this function rely on this blocking to delay the next set
> of operations they do until after IO has been started, so removing
> the completion will change their behaviour significantly.

The real importance is for locking. If we don't wait for completion
we may still do writebacks in the flushers thread while we're already
unlocked s_umount. That will give us back the nasty writeback vs
umount races that this scheme fixed. The commit logs that added
it should explain it in more detail (at least I usually write detailed
changelogs)

2011-06-29 08:12:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

On Tue, Jun 28, 2011 at 06:56:41PM -0700, Curt Wohlgemuth wrote:
> I don't quite understand this. It's true that all IO done as a result
> of calling wb_writeback() on this work item won't finish before the
> completion takes place, but sending all those pages in flight *will*
> take place. And that's a lot of time. To wait on this before we then
> call sync_inodes_sb(), and do it all over again, seems odd at best.
>
> Pre-2.6.35 kernels would start non-integrity sync writeback and
> immediately return, which would seem like a reasonable "prefetch-y"
> thing to do, considering it's going to be immediately followed by a
> data integrity sync writeback operation.

The only old kernel I have around is 2.6.16, so looking at that one
for old semantics:

- it first does a wakeup_pdflush() which does a truely asynchronous
writeback of everything in the system. This is replaced with a
wakeup_flusher_threads(0) in 3.0, which has pretty similar sematics.
- after that things change a bit as we reordered sync quite a bit too
a) improve data integrity by properly ordering the steps of the sync
and b) shared code between the global sync and per-fs sync as
used by umount, sys_syncfs and a few other callers. But both do
an semi-sync pass and a sync pass on per-sb writeback. For the old
code it's done from the calling threads context, while the next code
moved it to the flushers thread, but still waiting for it with the
completion. No major change in semantics as far as I can see.

The idea of the async pass is to start writeback on as many as possible
pages before actively having to wait on them. I'd agree with your
assessment that the writeback_inodes_sb might not really help all that
much - given that a lot of the waiting might not actually be for I/O
completion but e.g. for the block level throtteling (or maybe cgroups
in your case?).

For sys_sync I'm pretty sure we could simply remove the
writeback_inodes_sb call and get just as good if not better performance,
but we'd still need a solution for the other sync_filesystem callers,
assuming the first write actually benefits them. Maybe you can run
some sys_syncfs microbenchmarks to check it? Note that doing multiple
passes generally doesn't actually help live-lock avoidance either, but
wu has recently done some work in that area.

Another thing we've discussed a while ago is changing sync_inodes_sb
to the writeback or at least the waiting back in the calling threads
context to not block the flushers threads from getting real work done.

2011-06-29 16:57:22

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

On Wed 29-06-11 04:11:55, Christoph Hellwig wrote:
> On Tue, Jun 28, 2011 at 06:56:41PM -0700, Curt Wohlgemuth wrote:
> > I don't quite understand this. It's true that all IO done as a result
> > of calling wb_writeback() on this work item won't finish before the
> > completion takes place, but sending all those pages in flight *will*
> > take place. And that's a lot of time. To wait on this before we then
> > call sync_inodes_sb(), and do it all over again, seems odd at best.
> >
> > Pre-2.6.35 kernels would start non-integrity sync writeback and
> > immediately return, which would seem like a reasonable "prefetch-y"
> > thing to do, considering it's going to be immediately followed by a
> > data integrity sync writeback operation.
>
> The only old kernel I have around is 2.6.16, so looking at that one
> for old semantics:
>
> - it first does a wakeup_pdflush() which does a truely asynchronous
> writeback of everything in the system. This is replaced with a
> wakeup_flusher_threads(0) in 3.0, which has pretty similar sematics.
> - after that things change a bit as we reordered sync quite a bit too
> a) improve data integrity by properly ordering the steps of the sync
> and b) shared code between the global sync and per-fs sync as
> used by umount, sys_syncfs and a few other callers. But both do
> an semi-sync pass and a sync pass on per-sb writeback. For the old
> code it's done from the calling threads context, while the next code
> moved it to the flushers thread, but still waiting for it with the
> completion. No major change in semantics as far as I can see.
>
> The idea of the async pass is to start writeback on as many as possible
> pages before actively having to wait on them. I'd agree with your
> assessment that the writeback_inodes_sb might not really help all that
> much - given that a lot of the waiting might not actually be for I/O
> completion but e.g. for the block level throtteling (or maybe cgroups
> in your case?).
>
> For sys_sync I'm pretty sure we could simply remove the
> writeback_inodes_sb call and get just as good if not better performance,
Actually, it won't with current code. Because WB_SYNC_ALL writeback
currently has the peculiarity that it looks like:
for all inodes {
write all inode data
wait for inode data
}
while to achieve good performance we actually need something like
for all inodes
write all inode data
for all inodes
wait for inode data
It makes a difference in an order of magnitude when there are lots of
smallish files - SLES had a bug like this so I know from user reports ;)

But with current inode list handling it will be hard to find all inodes
that need to be waited for which I guess is the reason why it is done
as it is done.

> but we'd still need a solution for the other sync_filesystem callers,
> assuming the first write actually benefits them. Maybe you can run
> some sys_syncfs microbenchmarks to check it? Note that doing multiple
> passes generally doesn't actually help live-lock avoidance either, but
> wu has recently done some work in that area.
>
> Another thing we've discussed a while ago is changing sync_inodes_sb
> to the writeback or at least the waiting back in the calling threads
> context to not block the flushers threads from getting real work done.
You mean that sync(1) would actually write the data itself? It would
certainly make some things simpler but it has its problems as well - for
example sync racing with flusher thread writing back inodes can create
rather bad IO pattern...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-06-29 17:26:40

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

Hi Christoph:

On Wed, Jun 29, 2011 at 1:11 AM, Christoph Hellwig <[email protected]> wrote:
> On Tue, Jun 28, 2011 at 06:56:41PM -0700, Curt Wohlgemuth wrote:
>> I don't quite understand this. ?It's true that all IO done as a result
>> of calling wb_writeback() on this work item won't finish before the
>> completion takes place, but sending all those pages in flight *will*
>> take place. ?And that's a lot of time. ?To wait on this before we then
>> call sync_inodes_sb(), and do it all over again, seems odd at best.
>>
>> Pre-2.6.35 kernels would start non-integrity sync writeback and
>> immediately return, which would seem like a reasonable "prefetch-y"
>> thing to do, considering it's going to be immediately followed by a
>> data integrity sync writeback operation.
>
> The only old kernel I have around is 2.6.16, so looking at that one
> for old semantics:
>
> ?- it first does a wakeup_pdflush() which does a truely asynchronous
> ? writeback of everything in the system. ?This is replaced with a
> ? wakeup_flusher_threads(0) in 3.0, which has pretty similar sematics.
> ?- after that things change a bit as we reordered sync quite a bit too
> ? a) improve data integrity by properly ordering the steps of the sync
> ? and b) shared code between the global sync and per-fs sync as
> ? used by umount, sys_syncfs and a few other callers. ?But both do
> ? an semi-sync pass and a sync pass on per-sb writeback. ?For the old
> ? code it's done from the calling threads context, while the next code
> ? moved it to the flushers thread, but still waiting for it with the
> ? completion. ?No major change in semantics as far as I can see.

You're right about this. (I'm looking at 2.6.26 as it happens, but
it's got the same behavior.)

The semantics did change between 2.6.34 and 2.6.35 though. When the
work item queue was introduced in 2.6.32, the semantics changed from
what you describe above to what's present through 2.6.34:
writeback_inodes_sb() would enqueue a work item, and return. Your
commit 83ba7b07 ("writeback: simplify the write back thread queue")
added the wait_for_completion() call, putting the semantics back to
where they were pre-2.6.32.

And in fact, this change has only minor effect on writeback, at least
-- since the WB_SYNC_ALL work item won't even get started until the
WB_SYNC_NONE one is finished.

Though one additional change between the old way (pre-2.6.32) and
today: with the old kernel, the pdflush thread would operate
concurrently with the first (and second?) sync path through writeback.
Today of course, they're *all* serialized. So really a call to
sys_sync() will enqueue 3 work items -- one from
wakeup_flusher_threads(), one from writeback_inodes_sb(), and one from
sync_inodes_sb().

> The idea of the async pass is to start writeback on as many as possible
> pages before actively having to wait on them. ?I'd agree with your
> assessment that the writeback_inodes_sb might not really help all that
> much - given that a lot of the waiting might not actually be for I/O
> completion but e.g. for the block level throtteling (or maybe cgroups
> in your case?).

What I'm seeing with an admittedly nasty workload -- two FIO scripts
writing as fast as possible to 1000 files each, with a sync 10 seconds
into the workload -- is that the "async pass" will write out a
(nearly) completely separate set of pages from the "sync pass."
Because the work items are serialized, there are so many new dirty
pages since the first work item starts, and so the sync pass has just
about as many pages to write out as the first. So the "prefetch-y"
async pass is basically wasted time.

>
> For sys_sync I'm pretty sure we could simply remove the
> writeback_inodes_sb call and get just as good if not better performance,
> but we'd still need a solution for the other sync_filesystem callers,
> assuming the first write actually benefits them. ?Maybe you can run
> some sys_syncfs microbenchmarks to check it? ?Note that doing multiple
> passes generally doesn't actually help live-lock avoidance either, but
> wu has recently done some work in that area.

I've been running tests using sys_sync, and seeing sync take far
longer than I'd expect, but haven't tried sys_syncfs yet. I'll add
that to my list.

Still, both do two completely serialized passes: one "async pass"
followed by the "sync pass" -- during which other work items can get
queued, lots more pages get dirtied, etc.

>
> Another thing we've discussed a while ago is changing sync_inodes_sb
> to the writeback or at least the waiting back in the calling threads
> context to not block the flushers threads from getting real work done.

This would make sense to insure that sync finishes reasonably fast,
but as Jan points out, would cause that much more interference with
the write pattern...

Thanks,
Curt

2011-06-29 17:55:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

On Wed, Jun 29, 2011 at 06:57:14PM +0200, Jan Kara wrote:
> > For sys_sync I'm pretty sure we could simply remove the
> > writeback_inodes_sb call and get just as good if not better performance,
> Actually, it won't with current code. Because WB_SYNC_ALL writeback
> currently has the peculiarity that it looks like:
> for all inodes {
> write all inode data
> wait for inode data
> }
> while to achieve good performance we actually need something like
> for all inodes
> write all inode data
> for all inodes
> wait for inode data
> It makes a difference in an order of magnitude when there are lots of
> smallish files - SLES had a bug like this so I know from user reports ;)

I don't think that's true. The WB_SYNC_ALL writeback is done using
sync_inodes_sb, which operates as:

for all dirty inodes in bdi:
if inode belongs to sb
write all inode data

for all inodes in sb:
wait for inode data

we still do that in a big for each sb loop, though.

> You mean that sync(1) would actually write the data itself? It would
> certainly make some things simpler but it has its problems as well - for
> example sync racing with flusher thread writing back inodes can create
> rather bad IO pattern...

Only the second pass. The idea is that we first try to use the flusher
threads for good I/O patterns, but if we can't get that to work only
block the caller and not everyone. But that's just an idea so far,
it would need serious benchmark. And despite what I claimed before
we actually do the wait in the caller context already anyway, which
already gives you the easy part of the above effect.

2011-06-29 18:00:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

On Wed, Jun 29, 2011 at 10:26:33AM -0700, Curt Wohlgemuth wrote:
> The semantics did change between 2.6.34 and 2.6.35 though. When the
> work item queue was introduced in 2.6.32, the semantics changed from
> what you describe above to what's present through 2.6.34:
> writeback_inodes_sb() would enqueue a work item, and return. Your
> commit 83ba7b07 ("writeback: simplify the write back thread queue")
> added the wait_for_completion() call, putting the semantics back to
> where they were pre-2.6.32.

Yes. The kernels inbetween had that nasty writeback vs umount races
that we could trigger quite often.

> Though one additional change between the old way (pre-2.6.32) and
> today: with the old kernel, the pdflush thread would operate
> concurrently with the first (and second?) sync path through writeback.
> Today of course, they're *all* serialized. So really a call to
> sys_sync() will enqueue 3 work items -- one from
> wakeup_flusher_threads(), one from writeback_inodes_sb(), and one from
> sync_inodes_sb().

Yes. And having WB_SYNC_NONE items from both wakeup_flusher_threads vs
writeback_inodes_sb is plain stupid. The initial conversion to the
new sync_filesystem scheme had removed the wakeup_flusher_threads
call, but that caused a huge regression in some benchmark.

As mentioned before Wu was working on some code to introduce tagging
so that the WB_SYNC_ALL call won't start writing pages dirtied after
the sync call, which should help with your issue. Although to
completely solve it we really need to get down to just two passes.

2011-06-29 19:15:24

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

On Wed 29-06-11 13:55:34, Christoph Hellwig wrote:
> On Wed, Jun 29, 2011 at 06:57:14PM +0200, Jan Kara wrote:
> > > For sys_sync I'm pretty sure we could simply remove the
> > > writeback_inodes_sb call and get just as good if not better performance,
> > Actually, it won't with current code. Because WB_SYNC_ALL writeback
> > currently has the peculiarity that it looks like:
> > for all inodes {
> > write all inode data
> > wait for inode data
> > }
> > while to achieve good performance we actually need something like
> > for all inodes
> > write all inode data
> > for all inodes
> > wait for inode data
> > It makes a difference in an order of magnitude when there are lots of
> > smallish files - SLES had a bug like this so I know from user reports ;)
>
> I don't think that's true. The WB_SYNC_ALL writeback is done using
> sync_inodes_sb, which operates as:
>
> for all dirty inodes in bdi:
> if inode belongs to sb
> write all inode data
>
> for all inodes in sb:
> wait for inode data
>
> we still do that in a big for each sb loop, though.
True but writeback_single_inode() has in it:
if (wbc->sync_mode == WB_SYNC_ALL) {
int err = filemap_fdatawait(mapping);
if (ret == 0)
ret = err;
}
So we end up waiting much earlier. Probably we should remove this wait
but that will need some audit I guess.

> > You mean that sync(1) would actually write the data itself? It would
> > certainly make some things simpler but it has its problems as well - for
> > example sync racing with flusher thread writing back inodes can create
> > rather bad IO pattern...
>
> Only the second pass. The idea is that we first try to use the flusher
> threads for good I/O patterns, but if we can't get that to work only
> block the caller and not everyone. But that's just an idea so far,
> it would need serious benchmark. And despite what I claimed before
> we actually do the wait in the caller context already anyway, which
> already gives you the easy part of the above effect.
Modulo the writeback_single_inode() wait. But if that is dealt with I
agree.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-06-29 20:12:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

On Wed, Jun 29, 2011 at 09:15:18PM +0200, Jan Kara wrote:
> True but writeback_single_inode() has in it:
> if (wbc->sync_mode == WB_SYNC_ALL) {
> int err = filemap_fdatawait(mapping);
> if (ret == 0)
> ret = err;
> }
> So we end up waiting much earlier. Probably we should remove this wait
> but that will need some audit I guess.

Uhh, indeed. We'll need this wait for things like sync_inode, though.

2011-06-29 21:30:22

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

On Wed, Jun 29, 2011 at 11:00 AM, Christoph Hellwig <[email protected]> wrote:
> On Wed, Jun 29, 2011 at 10:26:33AM -0700, Curt Wohlgemuth wrote:
>> The semantics did change between 2.6.34 and 2.6.35 though. ?When the
>> work item queue was introduced in 2.6.32, the semantics changed from
>> what you describe above to what's present through 2.6.34:
>> writeback_inodes_sb() would enqueue a work item, and return. ?Your
>> commit 83ba7b07 ("writeback: simplify the write back thread queue")
>> added the wait_for_completion() call, putting the semantics back to
>> where they were pre-2.6.32.
>
> Yes. ?The kernels inbetween had that nasty writeback vs umount races
> that we could trigger quite often.

My apologies for not being aware of these races (I tried to search
mailing lists) -- how did they manifest themselves? I don't see
anything about it in your commit message of 83ba7b07. I do see that
we are not taking s_umount for sys_sync any longer (only in
sync_filesystem(), e.g., for sys_syncfs)...

>> Though one additional change between the old way (pre-2.6.32) and
>> today: ?with the old kernel, the pdflush thread would operate
>> concurrently with the first (and second?) sync path through writeback.
>> ?Today of course, they're *all* serialized. ?So really a call to
>> sys_sync() will enqueue 3 work items -- one from
>> wakeup_flusher_threads(), one from writeback_inodes_sb(), and one from
>> sync_inodes_sb().
>
> Yes. ?And having WB_SYNC_NONE items from both wakeup_flusher_threads vs
> writeback_inodes_sb is plain stupid. ?The initial conversion to the
> new sync_filesystem scheme had removed the wakeup_flusher_threads
> call, but that caused a huge regression in some benchmark.
>
> As mentioned before Wu was working on some code to introduce tagging
> so that the WB_SYNC_ALL call won't start writing pages dirtied after
> the sync call, which should help with your issue. ?Although to
> completely solve it we really need to get down to just two passes.

I can definitely see how tagging with the start of sync would be
helpful; also how going from three to two passes seems like a
no-brainer.

But what's the real point of doing even two passes -- one SYNC_NONE
followed by one SYNC_ALL? Is it try to get through as many inodes as
possible before we potentially lock up by waiting on an inode on an
unavailable device?

Thanks,
Curt

2011-06-30 12:16:07

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

On Wed 29-06-11 16:12:34, Christoph Hellwig wrote:
> On Wed, Jun 29, 2011 at 09:15:18PM +0200, Jan Kara wrote:
> > True but writeback_single_inode() has in it:
> > if (wbc->sync_mode == WB_SYNC_ALL) {
> > int err = filemap_fdatawait(mapping);
> > if (ret == 0)
> > ret = err;
> > }
> > So we end up waiting much earlier. Probably we should remove this wait
> > but that will need some audit I guess.
>
> Uhh, indeed. We'll need this wait for things like sync_inode, though.
Yes. Actually, specifically for filesystems like XFS which update inode
after IO completion we would need more passes to be efficient and correct:
for all inodes
fdatawrite
for all inodes
fdatawait
for all inodes
write_inode
for all inodes
wait for inode IO

But maybe this could be handled by having new WB_SYNC_ mode indicating
that writeback_single_inode() should not bother waiting (thus we'd really
end up waiting in sync_inodes_sb()) and then XFS and other filesystems that
need it would writeout inodes in their sync_fs() implementation (possibly
using a generic helper)?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-06-30 12:38:13

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

On Thu, Jun 30, 2011 at 02:15:58PM +0200, Jan Kara wrote:
> Yes. Actually, specifically for filesystems like XFS which update inode
> after IO completion we would need more passes to be efficient and correct:
> for all inodes
> fdatawrite
> for all inodes
> fdatawait
> for all inodes
> write_inode
> for all inodes
> wait for inode IO
>
> But maybe this could be handled by having new WB_SYNC_ mode indicating
> that writeback_single_inode() should not bother waiting (thus we'd really
> end up waiting in sync_inodes_sb()) and then XFS and other filesystems that
> need it would writeout inodes in their sync_fs() implementation (possibly
> using a generic helper)?

We do very little in write_inode these days. Basically just log the
inode size and timestamp sized into the in-memory log. The actual
writeout of the log happens in ->sync_fs already.