LinuxLists.cc - [PATCH] Optimize wait_sb

2013-06-26 08:45:33

Subject: [PATCH] Optimize wait_sb_inodes()

Hi,

On the following stress, sync(2) became top of cpu load.

fsstress -s 1 -n 50000 -p 3 -d `pwd`

After profile by perf, cpu load was lock contention of inode_sb_list_lock.

sync(2) is data integrity operation, so it has to make sure all dirty
data was written before sync(2) point. The bdi flusher flushes current
dirty data and wait those. But, if there is in-flight I/O before
sync_inodes_sb(), sync_inodes_sb() can be done before prior in-flight I/O.

So, wait_sb_inodes() is walking the all inodes on sb to make sure
in-flight I/O was done too. When it is walking inodes,
inode_sb_list_lock is contending with other operations like
create(2). This is the cause of cpu load.

On another view, wait_sb_inodes() would (arguably) be necessary for
legacy FSes. But, for example, if data=journal on ext*, wait_sb_inodes()
would be more than useless, because ext* can be done it by own
transaction list (and more efficient way).

Likewise, on tux3, the state is same with data=journal.

Also, even if data=ordered, ext* might be able to check in-flight I/O by
ordered data list (with some new additional check, I'm not sure).

So, this patch adds the sb->s_op->wait_inodes() handler to replace
wait_sb_inodes(). With this, FSes can optimize it by using own
internal knowledge.

[Alternative idea to optimize this, push down wait_sb_inodes() into
->sync_fs() handler on all FSes. Or if someone fixes this with another
idea, I'm happy enough.]

The following is profile of result on tux3 (->wait_inodes() handler is
noop function).

Thanks.

- 13.11% fsstress [kernel.kallsyms] [k] sync_inodes_sb
- sync_inodes_sb
- 99.97% sync_inodes_one_sb
iterate_supers
sys_sync
system_call
sync
- 9.39% fsstress [kernel.kallsyms] [k] _raw_spin_lock
- _raw_spin_lock
+ 62.72% sync_inodes_sb
+ 7.46% _atomic_dec_and_lock
+ 6.15% inode_add_lru
+ 4.43% map_region
+ 3.07% __find_get_buffer
+ 2.71% sync_inodes_one_sb
+ 1.77% tux3_set_buffer_dirty_list
+ 1.43% find_inode
+ 1.02% iput
+ 0.69% dput
+ 0.57% __tux3_get_block
+ 0.53% shrink_dcache_parent
+ 0.53% __d_instantiate

Before patch:

real 2m40.994s
user 0m1.344s
sys 0m15.832s

After patch:

real 2m34.748
user 0m1.360s
sys 0m5.356s

Signed-off-by: OGAWA Hirofumi <[email protected]>
---

fs/fs-writeback.c | 5 ++++-
include/linux/fs.h | 1 +
2 files changed, 5 insertions(+), 1 deletion(-)

diff -puN include/linux/fs.h~tux3-fix-wait_sb_inodes include/linux/fs.h
--- tux3fs/include/linux/fs.h~tux3-fix-wait_sb_inodes 2013-06-24 19:03:10.000000000 +0900
+++ tux3fs-hirofumi/include/linux/fs.h 2013-06-24 19:03:10.000000000 +0900
@@ -1595,6 +1595,7 @@ struct super_operations {
int (*write_inode) (struct inode *, struct writeback_control *wbc);
int (*drop_inode) (struct inode *);
void (*evict_inode) (struct inode *);
+ void (*wait_inodes)(struct super_block *);
void (*put_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
int (*freeze_fs) (struct super_block *);
diff -puN fs/fs-writeback.c~tux3-fix-wait_sb_inodes fs/fs-writeback.c
--- tux3fs/fs/fs-writeback.c~tux3-fix-wait_sb_inodes 2013-06-24 19:03:10.000000000 +0900
+++ tux3fs-hirofumi/fs/fs-writeback.c 2013-06-24 19:03:10.000000000 +0900
@@ -1401,7 +1401,10 @@ void sync_inodes_sb(struct super_block *
bdi_queue_work(sb->s_bdi, &work);
wait_for_completion(&done);

- wait_sb_inodes(sb);
+ if (sb->s_op->wait_inodes)
+ sb->s_op->wait_inodes(sb);
+ else
+ wait_sb_inodes(sb);
}
EXPORT_SYMBOL(sync_inodes_sb);

_

--
OGAWA Hirofumi <[email protected]>

2013-06-26 16:02:46

by Jörn Engel

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

On Wed, 26 June 2013 17:45:23 +0900, OGAWA Hirofumi wrote:
>
>
> fs/fs-writeback.c | 5 ++++-
> include/linux/fs.h | 1 +
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff -puN include/linux/fs.h~tux3-fix-wait_sb_inodes include/linux/fs.h
> --- tux3fs/include/linux/fs.h~tux3-fix-wait_sb_inodes 2013-06-24 19:03:10.000000000 +0900
> +++ tux3fs-hirofumi/include/linux/fs.h 2013-06-24 19:03:10.000000000 +0900
> @@ -1595,6 +1595,7 @@ struct super_operations {
> int (*write_inode) (struct inode *, struct writeback_control *wbc);
> int (*drop_inode) (struct inode *);
> void (*evict_inode) (struct inode *);
> + void (*wait_inodes)(struct super_block *);
> void (*put_super) (struct super_block *);
> int (*sync_fs)(struct super_block *sb, int wait);
> int (*freeze_fs) (struct super_block *);
> diff -puN fs/fs-writeback.c~tux3-fix-wait_sb_inodes fs/fs-writeback.c
> --- tux3fs/fs/fs-writeback.c~tux3-fix-wait_sb_inodes 2013-06-24 19:03:10.000000000 +0900
> +++ tux3fs-hirofumi/fs/fs-writeback.c 2013-06-24 19:03:10.000000000 +0900
> @@ -1401,7 +1401,10 @@ void sync_inodes_sb(struct super_block *
> bdi_queue_work(sb->s_bdi, &work);
> wait_for_completion(&done);
>
> - wait_sb_inodes(sb);
> + if (sb->s_op->wait_inodes)
> + sb->s_op->wait_inodes(sb);
> + else
> + wait_sb_inodes(sb);
> }
> EXPORT_SYMBOL(sync_inodes_sb);

Two things. Until there are actual implementations of
s_op->wait_inodes, this is pure obfuscation. You already know this,
of course.

More interestingly, I personally hate methods with a default option if
they are not implemented. Not too bad in this particular case, but
the same pattern has burned me a number of times and wasted weeks of
my life. So I would prefer to unconditionally call
sb->s_op->wait_inodes(sb) and set it to wait_sb_inodes for all
filesystems that don't have a smarter way to do things.

Jörn

--
Linux is more the core point of a concept that surrounds "open source"
which, in turn, is based on a false concept. This concept is that
people actually want to look at source code.
-- Rob Enderle

2013-06-26 23:12:23

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

On Wed, Jun 26, 2013 at 05:45:23PM +0900, OGAWA Hirofumi wrote:
> Hi,
>
> On the following stress, sync(2) became top of cpu load.
>
> fsstress -s 1 -n 50000 -p 3 -d `pwd`
>
> After profile by perf, cpu load was lock contention of inode_sb_list_lock.
>
> sync(2) is data integrity operation, so it has to make sure all dirty
> data was written before sync(2) point. The bdi flusher flushes current
> dirty data and wait those. But, if there is in-flight I/O before
> sync_inodes_sb(), sync_inodes_sb() can be done before prior in-flight I/O.
>
> So, wait_sb_inodes() is walking the all inodes on sb to make sure
> in-flight I/O was done too. When it is walking inodes,
> inode_sb_list_lock is contending with other operations like
> create(2). This is the cause of cpu load.

Sure, but avoiding wait_sb_inodes() doesn't address the fast-path
lock contention inode_sb_list_lock has which is in the create and
evict paths. The problems caused by wait_sb_inodes() are background
noise compared to contention multiple CPUs can put on this lock just
walking large directory structures in parallel. Hence hacking
around wait_sb_inodes() doesn't address the underlying lock
contention problem.

> On another view, wait_sb_inodes() would (arguably) be necessary for
> legacy FSes. But, for example, if data=journal on ext*, wait_sb_inodes()
> would be more than useless, because ext* can be done it by own
> transaction list (and more efficient way).
>
> Likewise, on tux3, the state is same with data=journal.
>
> Also, even if data=ordered, ext* might be able to check in-flight I/O by
> ordered data list (with some new additional check, I'm not sure).

Why would you bother solving this problem differently in every
single filesystem? It's solvable at the VFS by tracking inodes that
are no longer dirty but still under writeback on the BDI. Then
converting wait_sb_inodes() to walk all the dirty and writeback
inodes would be sufficient for data integrity purposes, and it would
be done under the bdi writeback lock, not the inode_sb_list_lock....

Alternatively, splitting up the inode sb list and lock (say via the
per-node list_lru structures in -mm and -next that are being added
for exactly this purpose) would also significantly reduce lock
contention on both the create/evict fast paths and the
wait_sb_inodes() walk that is currently done....

So I think that you should address the problem properly at the VFS
level so everyone benefits, not push interfaces that allow
filesystem specific hacks to work around VFS level deficiencies...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-27 00:01:29

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

J?rn Engel <[email protected]> writes:

> Two things. Until there are actual implementations of
> s_op->wait_inodes, this is pure obfuscation. You already know this,
> of course.

On tux3, implementation of ->wait_inodes() is the following. Because
tux3 guarantees order what wait_sb_inodes() wants to check, like
data=journal.

+static void tux3_wait_inodes(struct super_block *sb)
+{
+ /*
+ * Since tux3 flushes whole delta and guarantee order of
+ * deltas, so tux3 doesn't need to wait inodes.
+ *
+ * Note, when we start to support direct I/O, we might have to
+ * revisit this to check in-progress direct I/O.
+ */
+}

Another (untested) example for ext* would be like the following

static void ext4_wait_inodes(struct super_block *sb)
{
/* ->sync_fs() guarantees to wait all */
if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
return;

/* FIXME: On data=ordered, we might be able to do something. */
wait_sb_inodes(sb);
}

> More interestingly, I personally hate methods with a default option if
> they are not implemented. Not too bad in this particular case, but
> the same pattern has burned me a number of times and wasted weeks of
> my life. So I would prefer to unconditionally call
> sb->s_op->wait_inodes(sb) and set it to wait_sb_inodes for all
> filesystems that don't have a smarter way to do things.

I don't have strong opinion about it though. Because the optimized
version is optional, this way might be safer.

Well, if there is the reason to push down, I will do it.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-06-27 00:14:28

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

Dave Chinner <[email protected]> writes:

>> On another view, wait_sb_inodes() would (arguably) be necessary for
>> legacy FSes. But, for example, if data=journal on ext*, wait_sb_inodes()
>> would be more than useless, because ext* can be done it by own
>> transaction list (and more efficient way).
>>
>> Likewise, on tux3, the state is same with data=journal.
>>
>> Also, even if data=ordered, ext* might be able to check in-flight I/O by
>> ordered data list (with some new additional check, I'm not sure).
>
> Why would you bother solving this problem differently in every
> single filesystem? It's solvable at the VFS by tracking inodes that
> are no longer dirty but still under writeback on the BDI. Then
> converting wait_sb_inodes() to walk all the dirty and writeback
> inodes would be sufficient for data integrity purposes, and it would
> be done under the bdi writeback lock, not the inode_sb_list_lock....
>
> Alternatively, splitting up the inode sb list and lock (say via the
> per-node list_lru structures in -mm and -next that are being added
> for exactly this purpose) would also significantly reduce lock
> contention on both the create/evict fast paths and the
> wait_sb_inodes() walk that is currently done....
>
> So I think that you should address the problem properly at the VFS
> level so everyone benefits, not push interfaces that allow
> filesystem specific hacks to work around VFS level deficiencies...

Optimizing wait_sb_inodes() might help lock contention, but it doesn't
help unnecessary wait/check. Since some FSes know about current
in-flight I/O already in those internal, so I think, those FSes can be
done it here, or are already doing in ->sync_fs().

For example, I guess ext4 implement (untested) would be something like
following. If ->sync_fs() does all, ext4 doesn't need to be bothered by
wait_sb_inodes().

static void ext4_wait_inodes(struct super_block *sb)
{
/* ->sync_fs() guarantees to wait all */
if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
return;

/* FIXME: On data=ordered, we might be able to avoid this too. */
wait_sb_inodes(sb);
}

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-06-27 04:47:17

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

On Thu, Jun 27, 2013 at 09:14:07AM +0900, OGAWA Hirofumi wrote:
> Dave Chinner <[email protected]> writes:
>
> >> On another view, wait_sb_inodes() would (arguably) be necessary for
> >> legacy FSes. But, for example, if data=journal on ext*, wait_sb_inodes()
> >> would be more than useless, because ext* can be done it by own
> >> transaction list (and more efficient way).
> >>
> >> Likewise, on tux3, the state is same with data=journal.
> >>
> >> Also, even if data=ordered, ext* might be able to check in-flight I/O by
> >> ordered data list (with some new additional check, I'm not sure).
> >
> > Why would you bother solving this problem differently in every
> > single filesystem? It's solvable at the VFS by tracking inodes that
> > are no longer dirty but still under writeback on the BDI. Then
> > converting wait_sb_inodes() to walk all the dirty and writeback
> > inodes would be sufficient for data integrity purposes, and it would
> > be done under the bdi writeback lock, not the inode_sb_list_lock....
> >
> > Alternatively, splitting up the inode sb list and lock (say via the
> > per-node list_lru structures in -mm and -next that are being added
> > for exactly this purpose) would also significantly reduce lock
> > contention on both the create/evict fast paths and the
> > wait_sb_inodes() walk that is currently done....
> >
> > So I think that you should address the problem properly at the VFS
> > level so everyone benefits, not push interfaces that allow
> > filesystem specific hacks to work around VFS level deficiencies...
>
> Optimizing wait_sb_inodes() might help lock contention, but it doesn't
> help unnecessary wait/check.

You have your own wait code, that doesn't make what the VFS does
unnecesary. Quite frankly, I don't trust individual filesystems to
get it right - there's a long history of filesystem specific data
sync problems (including in XFS), and the best way to avoid that is
to ensure the VFS gets it right for you.

Indeed, we've gone from having sooper special secret sauce data sync
code in XFS to using the VFS over the past few years, and the result
is that it is now more reliable and faster than when we were trying
to be smart and do it all ourselves. We got to where we are by
fixing the problems in the VFS rather than continuing to try to work
around them.

> Since some FSes know about current
> in-flight I/O already in those internal, so I think, those FSes can be
> done it here, or are already doing in ->sync_fs().

Sure, do your internal checks in ->sync_fs(), but if
wait_sb_inodes() does not have any lock contention and very little
overhead, then why do you need to avoid it? This wait has to be done
somewhere between sync_inodes_sb() dispatching all the IO and
->sync_fs completing, so what's the problem with hving the VFS do
that *for everyone* efficiently?

Fix the root cause of the problem - the sub-optimal VFS code.
Hacking around it specifically for out-of-tree code is not the way
things get done around here...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-27 05:18:23

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

Dave Chinner <[email protected]> writes:

>> Optimizing wait_sb_inodes() might help lock contention, but it doesn't
>> help unnecessary wait/check.
>
> You have your own wait code, that doesn't make what the VFS does
> unnecesary. Quite frankly, I don't trust individual filesystems to
> get it right - there's a long history of filesystem specific data
> sync problems (including in XFS), and the best way to avoid that is
> to ensure the VFS gets it right for you.
>
> Indeed, we've gone from having sooper special secret sauce data sync
> code in XFS to using the VFS over the past few years, and the result
> is that it is now more reliable and faster than when we were trying
> to be smart and do it all ourselves. We got to where we are by
> fixing the problems in the VFS rather than continuing to try to work
> around them.

I guess you are assuming FS which is using data=writeback or such.

>> Since some FSes know about current
>> in-flight I/O already in those internal, so I think, those FSes can be
>> done it here, or are already doing in ->sync_fs().
>
> Sure, do your internal checks in ->sync_fs(), but if
> wait_sb_inodes() does not have any lock contention and very little
> overhead, then why do you need to avoid it? This wait has to be done
> somewhere between sync_inodes_sb() dispatching all the IO and
> ->sync_fs completing, so what's the problem with hving the VFS do
> that *for everyone* efficiently?

Are you saying the vfs should track all in-flight I/O with some sort of
transactional way?

Otherwise, vfs can't know the data is whether after sync point or before
sync point, and have to wait or not. FS is using the behavior like
data=journal has tracking of those already, and can reuse it.

> Fix the root cause of the problem - the sub-optimal VFS code.
> Hacking around it specifically for out-of-tree code is not the way
> things get done around here...

I'm thinking the root cause is vfs can't have knowledge of FS internal,
e.g. FS is handling data transactional way, or not.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-06-27 05:50:57

by Daniel Phillips

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

Hi Dave,

On Wed, Jun 26, 2013 at 9:47 PM, Dave Chinner <[email protected]> wrote:
> You have your own wait code, that doesn't make what the VFS does
> unnecesary. Quite frankly, I don't trust individual filesystems to
> get it right - there's a long history of filesystem specific data
> sync problems (including in XFS), and the best way to avoid that is
> to ensure the VFS gets it right for you.

I agree that some of the methods Tux3 uses to implement data integrity, sync
and friends may be worth lifting up to core, or better, to a library,
but we will
all be better served if such methods are given time to mature first. After all,
that basically describes the entire evolution of the VFS: new concepts start
in a filesystem, prove themselves useful, then may be lifted up to be shared.

It is important to get the order right: prove first, then lift.

Regards,

Daniel

2013-06-27 06:38:26

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

On Thu, Jun 27, 2013 at 02:18:17PM +0900, OGAWA Hirofumi wrote:
> Dave Chinner <[email protected]> writes:
>
> >> Optimizing wait_sb_inodes() might help lock contention, but it doesn't
> >> help unnecessary wait/check.
> >
> > You have your own wait code, that doesn't make what the VFS does
> > unnecesary. Quite frankly, I don't trust individual filesystems to
> > get it right - there's a long history of filesystem specific data
> > sync problems (including in XFS), and the best way to avoid that is
> > to ensure the VFS gets it right for you.
> >
> > Indeed, we've gone from having sooper special secret sauce data sync
> > code in XFS to using the VFS over the past few years, and the result
> > is that it is now more reliable and faster than when we were trying
> > to be smart and do it all ourselves. We got to where we are by
> > fixing the problems in the VFS rather than continuing to try to work
> > around them.
>
> I guess you are assuming FS which is using data=writeback or such.

No, I'm not asumming anything about the filesystem. Indeed, many
filesystems don't work like ext4 data=writeback/ordered, and to
assume that they do at the VFS level is completely wrong.

I'm saying that the VFS - which is tracking all the inodes with
dirty data, and is starting IO on them - needs to also provide
correct and efficient "wait for all inodes under IO" behaviour. it
currently provides "correct", but it's not efficient.

>
> >> Since some FSes know about current
> >> in-flight I/O already in those internal, so I think, those FSes can be
> >> done it here, or are already doing in ->sync_fs().
> >
> > Sure, do your internal checks in ->sync_fs(), but if
> > wait_sb_inodes() does not have any lock contention and very little
> > overhead, then why do you need to avoid it? This wait has to be done
> > somewhere between sync_inodes_sb() dispatching all the IO and
> > ->sync_fs completing, so what's the problem with hving the VFS do
> > that *for everyone* efficiently?
>
> Are you saying the vfs should track all in-flight I/O with some sort of
> transactional way?

The VFs doesn't need to know a thing about filesystem transactions.

> Otherwise, vfs can't know the data is whether after sync point or before
> sync point, and have to wait or not. FS is using the behavior like
> data=journal has tracking of those already, and can reuse it.

The VFS writeback code already differentiates between data written
during a sync operation and that dirtied after a sync operation.
Perhaps you should look at the tagging for WB_SYNC_ALL writeback
that write_cache_pages does....

But, anyway, we don't have to do that on the waiting side of things.
All we need to do is add the inode to a "under IO" list on the bdi
when the mapping is initially tagged with pages under writeback,
and remove it from that list during IO completion when the mapping
is no longer tagged as having pages under writeback.

wait_sb_inodes() just needs to walk that list and wait on each inode
to complete IO. It doesn't require *any awareness* of the underlying
filesystem implementation or how the IO is actually issued - if
there's IO in progress at the time wait_sb_inodes() is called, then
it waits for it.

> > Fix the root cause of the problem - the sub-optimal VFS code.
> > Hacking around it specifically for out-of-tree code is not the way
> > things get done around here...
>
> I'm thinking the root cause is vfs can't have knowledge of FS internal,
> e.g. FS is handling data transactional way, or not.

If the filesystem has transactional data/metadata that the VFS is
not tracking, then that is what the ->sync_fs call is for. i.e. so
the filesystem can then do what ever extra writeback/waiting it
needs to do that the VFS is unaware of.

We already cater for what Tux3 needs in the VFS - all you've done is
found an inefficient algorithm that needs fixing.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-27 07:22:11

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

Dave Chinner <[email protected]> writes:

>> Otherwise, vfs can't know the data is whether after sync point or before
>> sync point, and have to wait or not. FS is using the behavior like
>> data=journal has tracking of those already, and can reuse it.
>
> The VFS writeback code already differentiates between data written
> during a sync operation and that dirtied after a sync operation.
> Perhaps you should look at the tagging for WB_SYNC_ALL writeback
> that write_cache_pages does....
>
> But, anyway, we don't have to do that on the waiting side of things.
> All we need to do is add the inode to a "under IO" list on the bdi
> when the mapping is initially tagged with pages under writeback,
> and remove it from that list during IO completion when the mapping
> is no longer tagged as having pages under writeback.
>
> wait_sb_inodes() just needs to walk that list and wait on each inode
> to complete IO. It doesn't require *any awareness* of the underlying
> filesystem implementation or how the IO is actually issued - if
> there's IO in progress at the time wait_sb_inodes() is called, then
> it waits for it.
>
>> > Fix the root cause of the problem - the sub-optimal VFS code.
>> > Hacking around it specifically for out-of-tree code is not the way
>> > things get done around here...
>>
>> I'm thinking the root cause is vfs can't have knowledge of FS internal,
>> e.g. FS is handling data transactional way, or not.
>
> If the filesystem has transactional data/metadata that the VFS is
> not tracking, then that is what the ->sync_fs call is for. i.e. so
> the filesystem can then do what ever extra writeback/waiting it
> needs to do that the VFS is unaware of.
>
> We already cater for what Tux3 needs in the VFS - all you've done is
> found an inefficient algorithm that needs fixing.

write_cache_pages() is library function to be called from per-FS. So, it
is not under vfs control can be assume already. And it doesn't do right
things via write_cache_pages() for data=journal, because it handles for
each inodes, not at once. So, new dirty data can be inserted while
marking.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-06-27 07:22:26

by Daniel Phillips

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

On Wed, Jun 26, 2013 at 10:18 PM, OGAWA Hirofumi
<[email protected]> wrote:
> ...vfs can't know the data is whether after sync point or before
> sync point, and have to wait or not. FS is using the behavior like
> data=journal has tracking of those already, and can reuse it.

Clearly showing why the current interface is inefficient.

In the old days, core liked to poke its nose into the buffers of
every fs too, and those days are thankfully gone (thanks to
akpm iirc). Maybe the days of core sticking its nose into the
the inode dirty state of every fs are soon to end as well.

Why does core even need to know about the inodes a fs is
using? Maybe for some filesystems it saves implementation
code, but for Tux3 it just wastes CPU and adds extra code.

Regards,

Daniel

2013-06-27 09:40:50

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

On Thu, Jun 27, 2013 at 04:22:04PM +0900, OGAWA Hirofumi wrote:
> Dave Chinner <[email protected]> writes:
>
> >> Otherwise, vfs can't know the data is whether after sync point or before
> >> sync point, and have to wait or not. FS is using the behavior like
> >> data=journal has tracking of those already, and can reuse it.
> >
> > The VFS writeback code already differentiates between data written
> > during a sync operation and that dirtied after a sync operation.
> > Perhaps you should look at the tagging for WB_SYNC_ALL writeback
> > that write_cache_pages does....
> >
> > But, anyway, we don't have to do that on the waiting side of things.
> > All we need to do is add the inode to a "under IO" list on the bdi
> > when the mapping is initially tagged with pages under writeback,
> > and remove it from that list during IO completion when the mapping
> > is no longer tagged as having pages under writeback.
> >
> > wait_sb_inodes() just needs to walk that list and wait on each inode
> > to complete IO. It doesn't require *any awareness* of the underlying
> > filesystem implementation or how the IO is actually issued - if
> > there's IO in progress at the time wait_sb_inodes() is called, then
> > it waits for it.
> >
> >> > Fix the root cause of the problem - the sub-optimal VFS code.
> >> > Hacking around it specifically for out-of-tree code is not the way
> >> > things get done around here...
> >>
> >> I'm thinking the root cause is vfs can't have knowledge of FS internal,
> >> e.g. FS is handling data transactional way, or not.
> >
> > If the filesystem has transactional data/metadata that the VFS is
> > not tracking, then that is what the ->sync_fs call is for. i.e. so
> > the filesystem can then do what ever extra writeback/waiting it
> > needs to do that the VFS is unaware of.
> >
> > We already cater for what Tux3 needs in the VFS - all you've done is
> > found an inefficient algorithm that needs fixing.
>
> write_cache_pages() is library function to be called from per-FS. So, it
> is not under vfs control can be assume already. And it doesn't do right
> things via write_cache_pages() for data=journal, because it handles for
> each inodes, not at once. So, new dirty data can be inserted while
> marking.

Sure it can. But that newly dirtied data has occurred after the data
integrity writeback call was begun, so it's not part of what the
writeback code call needs to write back. We are quite entitled to
ignore it for the purposes of a data integrity sync because it as
dirtied *after* write_cache_pages() was asked to sync the range of
the inode.

IOWs, the VFS draws a line in the sand at a point in time when each
inode is written for a data integrity sync. You have to do that
somewhere, and there's little point in making that a global barrier
when it is not necessary to do so.

tux3 draws a different line in the sand, as does ext3/4
data=journal. In effect, tux3 and ext3/4 data=journal define a
global point in time that everything is "in sync", and that's way
above what is necessary for a sync(2) operation. The VFS already
has equivalent functionality - it's the state we force filesystems
into when they are frozen. i.e. freezing a filesystem forces it down
into a state where it is transactionally consistent on disk w.r.t
both data and metadata. sync(2) does not require these
"transactionally consistent" semantics, so the VFS does not try to
provide them.

Anyway, this is a moot discussion. I've already got prototype code
that fixes the wait_sb_inodes() problem as somebody is having
problems with many concurrent executions of wait_sb_inodes() causing
severe lock contention...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-27 10:06:42

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

Dave Chinner <[email protected]> writes:

>> >> > Fix the root cause of the problem - the sub-optimal VFS code.
>> >> > Hacking around it specifically for out-of-tree code is not the way
>> >> > things get done around here...
>> >>
>> >> I'm thinking the root cause is vfs can't have knowledge of FS internal,
>> >> e.g. FS is handling data transactional way, or not.
>> >
>> > If the filesystem has transactional data/metadata that the VFS is
>> > not tracking, then that is what the ->sync_fs call is for. i.e. so
>> > the filesystem can then do what ever extra writeback/waiting it
>> > needs to do that the VFS is unaware of.
>> >
>> > We already cater for what Tux3 needs in the VFS - all you've done is
>> > found an inefficient algorithm that needs fixing.
>>
>> write_cache_pages() is library function to be called from per-FS. So, it
>> is not under vfs control can be assume already. And it doesn't do right
>> things via write_cache_pages() for data=journal, because it handles for
>> each inodes, not at once. So, new dirty data can be inserted while
>> marking.
>
> Sure it can. But that newly dirtied data has occurred after the data
> integrity writeback call was begun, so it's not part of what the
> writeback code call needs to write back. We are quite entitled to
> ignore it for the purposes of a data integrity sync because it as
> dirtied *after* write_cache_pages() was asked to sync the range of
> the inode.
>
> IOWs, the VFS draws a line in the sand at a point in time when each
> inode is written for a data integrity sync. You have to do that
> somewhere, and there's little point in making that a global barrier
> when it is not necessary to do so.
>
> tux3 draws a different line in the sand, as does ext3/4
> data=journal. In effect, tux3 and ext3/4 data=journal define a
> global point in time that everything is "in sync", and that's way
> above what is necessary for a sync(2) operation. The VFS already
> has equivalent functionality - it's the state we force filesystems
> into when they are frozen. i.e. freezing a filesystem forces it down
> into a state where it is transactionally consistent on disk w.r.t
> both data and metadata. sync(2) does not require these
> "transactionally consistent" semantics, so the VFS does not try to
> provide them.

It is what I'm calling the unnecessary wait.

> Anyway, this is a moot discussion. I've already got prototype code
> that fixes the wait_sb_inodes() problem as somebody is having
> problems with many concurrent executions of wait_sb_inodes() causing
> severe lock contention...

Sorry, but sounds like you are just saying "it doesn't need for me".

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-06-27 18:36:39

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

Let's go up a level. Why are you so focused on optimizing sync(2)?
What's the use case where you anticipate applications using sync(2) or
syncfs(2)? Or is this for the purpose of Tux3's benchmarketing?

In practice, sync(2) as a data integrity mechanism is a bit
heavyweight for most applications. In general the complaint with ext3
or ext4 is that fsync(2) is forcing a full filesystem-wide sync, and
not just a sync specific to the file.

System administrators use it sometimes[1], sometimes out of superstition,
but how many applications actually use it? And why?

[1] In fact, I've heard an argument that sync(2) should be optionally
made privileged and only accessible to root, since there were cases
where system administrators would login to a heavily loaded server as
a non-privileged user, type "sync" out of reflex or habit, and the
resulting huge amounts of I/O would blow the 99.9th percentile latency
guarantees for all of the latency-sensitive services running on that
server.

If the goal is to optimize file system freeze operations, sure. But
that's also not a common operation, so if it burns a little extra CPU
time, it wouldn't cause me to think that it was a high priority issue.
Decreasing the wall clock time for a file system freeze operation
would probably be a far more interesting goal.

Regards,

- Ted

2013-06-27 23:37:48

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

Theodore Ts'o <[email protected]> writes:

> Let's go up a level. Why are you so focused on optimizing sync(2)?
> What's the use case where you anticipate applications using sync(2) or
> syncfs(2)? Or is this for the purpose of Tux3's benchmarketing?
>
> In practice, sync(2) as a data integrity mechanism is a bit
> heavyweight for most applications. In general the complaint with ext3
> or ext4 is that fsync(2) is forcing a full filesystem-wide sync, and
> not just a sync specific to the file.
>
> System administrators use it sometimes[1], sometimes out of superstition,
> but how many applications actually use it? And why?
>
> [1] In fact, I've heard an argument that sync(2) should be optionally
> made privileged and only accessible to root, since there were cases
> where system administrators would login to a heavily loaded server as
> a non-privileged user, type "sync" out of reflex or habit, and the
> resulting huge amounts of I/O would blow the 99.9th percentile latency
> guarantees for all of the latency-sensitive services running on that
> server.
>
> If the goal is to optimize file system freeze operations, sure. But
> that's also not a common operation, so if it burns a little extra CPU
> time, it wouldn't cause me to think that it was a high priority issue.
> Decreasing the wall clock time for a file system freeze operation
> would probably be a far more interesting goal.

I'm not focusing on optimizing sync(2), not like you are suspecting. It
might be an issue of visibility of our work.

Well, anyway, it is simple. This issue was came as the performance
regression when I was experimenting to use kernel bdi flusher from own
flusher. The issue was sync(2) like I said. And this was just I
couldn't solve this issue by tux3 side unlike other optimizations.

Simple patch was enough to optimize that. And there was no reason to
keep to hide the patch, then share this lack of vfs with community.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-06-27 23:45:32

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

On Fri, Jun 28, 2013 at 08:37:40AM +0900, OGAWA Hirofumi wrote:
>
> Well, anyway, it is simple. This issue was came as the performance
> regression when I was experimenting to use kernel bdi flusher from own
> flusher. The issue was sync(2) like I said. And this was just I
> couldn't solve this issue by tux3 side unlike other optimizations.

A performance regression using fsstress? That's not a program
intended to be a useful benchmark for measuring performance.

- Ted

2013-06-28 00:30:26

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

Theodore Ts'o <[email protected]> writes:

> On Fri, Jun 28, 2013 at 08:37:40AM +0900, OGAWA Hirofumi wrote:
>>
>> Well, anyway, it is simple. This issue was came as the performance
>> regression when I was experimenting to use kernel bdi flusher from own
>> flusher. The issue was sync(2) like I said. And this was just I
>> couldn't solve this issue by tux3 side unlike other optimizations.
>
> A performance regression using fsstress? That's not a program
> intended to be a useful benchmark for measuring performance.

Right. fsstress is used as stress tool for me too as part of CI, with
background vmstat 1. Anyway, it is why I noticed this.

I agree it would not be high priority. But I don't think we should stop
to optimize it.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-06-28 03:00:35

by Daniel Phillips

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

Hi Ted,

On Thu, Jun 27, 2013 at 11:36 AM, Theodore Ts'o <[email protected]> wrote:
> If the goal is to optimize file system freeze operations, sure. But
> that's also not a common operation, so if it burns a little extra CPU
> time, it wouldn't cause me to think that it was a high priority issue.
> Decreasing the wall clock time for a file system freeze operation
> would probably be a far more interesting goal.

Personally, any extra CPU burn bothers me. Maybe I am just
obsessive about that. Anyway, extra CPU usually does translate
into wall clock time, so optimizing sync would seem to be worth
the price of admission by that criterion alone. In general it's hard
to see why syncing fast and efficiently could ever be a bad thing.

I agree, not high priority. Not useless either. I see it more as a fit
and finish thing.

Regards,

Daniel

2013-06-28 05:24:24

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

On Fri, Jun 28, 2013 at 09:30:17AM +0900, OGAWA Hirofumi wrote:
> Theodore Ts'o <[email protected]> writes:
>
> > On Fri, Jun 28, 2013 at 08:37:40AM +0900, OGAWA Hirofumi wrote:
> >>
> >> Well, anyway, it is simple. This issue was came as the performance
> >> regression when I was experimenting to use kernel bdi flusher from own
> >> flusher. The issue was sync(2) like I said. And this was just I
> >> couldn't solve this issue by tux3 side unlike other optimizations.
> >
> > A performance regression using fsstress? That's not a program
> > intended to be a useful benchmark for measuring performance.
>
> Right. fsstress is used as stress tool for me too as part of CI, with
> background vmstat 1. Anyway, it is why I noticed this.
>
> I agree it would not be high priority. But I don't think we should stop
> to optimize it.

But you're not proposing any sort of optimisation at all - you're
simply proposing to hack around the problem so you don't have to
care about it. The VFS is a shared resource - it has to work well
for everyone - and that means we need to fix problems and not ignore
them.

As I said, wait_sb_inodes() is fixable. I'm not fixing for tux3,
though - I'm fixing it because it's causing soft lockups on XFS and
ext4 in 3.10-rc6:

https://lkml.org/lkml/2013/6/27/772

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-28 07:39:54

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH] Optimize wait_sb_inodes()

Dave Chinner <[email protected]> writes:

>> > A performance regression using fsstress? That's not a program
>> > intended to be a useful benchmark for measuring performance.
>>
>> Right. fsstress is used as stress tool for me too as part of CI, with
>> background vmstat 1. Anyway, it is why I noticed this.
>>
>> I agree it would not be high priority. But I don't think we should stop
>> to optimize it.
>
> But you're not proposing any sort of optimisation at all - you're
> simply proposing to hack around the problem so you don't have to
> care about it. The VFS is a shared resource - it has to work well
> for everyone - and that means we need to fix problems and not ignore
> them.

Agree, vfs has to work well for everyone. To work well, vfs should not
force unnecessary overhead/wait.

I'm not saying to stop optimizing wait_sb_inodes() itself, I'm saying it
is not enough.

Thanks.
--
OGAWA Hirofumi <[email protected]>