2009-09-24 06:54:23

by Shaohua Li

[permalink] [raw]
Subject: [RFC] page-writeback: move indoes from one superblock together

__mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
several partitions, writeback might keep spindle moving between partitions.
To reduce the move, better write big chunk of one partition and then move to
another. Inodes from one fs usually are in one partion, so idealy move indoes
from one fs together should reduce spindle move. This patch tries to address
this. Before per-bdi writeback is added, the behavior is write indoes
from one fs first and then another, so the patch restores previous behavior.
The loop in the patch is a bit ugly, should we add a dirty list for each
superblock in bdi_writeback?

Test in a two partition disk with attached fio script shows about 3% ~ 6%
improvement.

Signed-off-by: Shaohua Li <[email protected]>

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8e1e5e1..fc87730 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -324,13 +324,29 @@ static void move_expired_inodes(struct list_head *delaying_queue,
struct list_head *dispatch_queue,
unsigned long *older_than_this)
{
+ LIST_HEAD(tmp);
+ struct list_head *pos, *node;
+ struct super_block *sb;
+ struct inode *inode;
+
while (!list_empty(delaying_queue)) {
- struct inode *inode = list_entry(delaying_queue->prev,
- struct inode, i_list);
+ inode = list_entry(delaying_queue->prev, struct inode, i_list);
if (older_than_this &&
inode_dirtied_after(inode, *older_than_this))
break;
- list_move(&inode->i_list, dispatch_queue);
+ list_move(&inode->i_list, &tmp);
+ }
+
+ /* Move indoes from one superblock together */
+ while (!list_empty(&tmp)) {
+ inode = list_entry(tmp.prev, struct inode, i_list);
+ sb = inode->i_sb;
+ list_for_each_prev_safe(pos, node, &tmp) {
+ struct inode *inode = list_entry(pos,
+ struct inode, i_list);
+ if (inode->i_sb == sb)
+ list_move(&inode->i_list, dispatch_queue);
+ }
}
}



Attachments:
newfio (203.00 B)

2009-09-24 07:14:25

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, Sep 24, 2009 at 02:54:20PM +0800, Li, Shaohua wrote:
> __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
> several partitions, writeback might keep spindle moving between partitions.
> To reduce the move, better write big chunk of one partition and then move to
> another. Inodes from one fs usually are in one partion, so idealy move indoes
> from one fs together should reduce spindle move. This patch tries to address
> this. Before per-bdi writeback is added, the behavior is write indoes
> from one fs first and then another, so the patch restores previous behavior.
> The loop in the patch is a bit ugly, should we add a dirty list for each
> superblock in bdi_writeback?
>
> Test in a two partition disk with attached fio script shows about 3% ~ 6%
> improvement.

Reviewed-by: Wu Fengguang <[email protected]>

Good idea! The optimization looks good to me, it addresses one
weakness of per-bdi writeback.

But one problem is, Jan Kara and me are planning to remove b_io and
hence this move_expired_inodes() function. Not sure how to do this
optimization without b_io.

> Signed-off-by: Shaohua Li <[email protected]>
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 8e1e5e1..fc87730 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -324,13 +324,29 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> struct list_head *dispatch_queue,
> unsigned long *older_than_this)
> {
> + LIST_HEAD(tmp);
> + struct list_head *pos, *node;
> + struct super_block *sb;
> + struct inode *inode;
> +
> while (!list_empty(delaying_queue)) {
> - struct inode *inode = list_entry(delaying_queue->prev,
> - struct inode, i_list);
> + inode = list_entry(delaying_queue->prev, struct inode, i_list);
> if (older_than_this &&
> inode_dirtied_after(inode, *older_than_this))
> break;
> - list_move(&inode->i_list, dispatch_queue);
> + list_move(&inode->i_list, &tmp);
> + }
> +
> + /* Move indoes from one superblock together */
> + while (!list_empty(&tmp)) {
> + inode = list_entry(tmp.prev, struct inode, i_list);
> + sb = inode->i_sb;
> + list_for_each_prev_safe(pos, node, &tmp) {

We are in spin lock, so not necessary to use the safe version?

> + struct inode *inode = list_entry(pos,

Could just reuse inode.

Thanks,
Fengguang

> + struct inode, i_list);
> + if (inode->i_sb == sb)
> + list_move(&inode->i_list, dispatch_queue);
> + }
> }
> }
>
>

Content-Description: newfio
> [global]
> runtime=120
> ioscheduler=cfq
> size=2G
> ioengine=sync
> rw=write
> file_service_type=random:256
> overwrite=1
>
> [sdb1]
> directory=/mnt/b1
> nrfiles=10
> numjobs=4
>
> [sdb2]
> directory=/mnt/b2
> nrfiles=10
> numjobs=4

2009-09-24 07:29:28

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, 24 Sep 2009 15:14:15 +0800
Wu Fengguang <[email protected]> wrote:

> On Thu, Sep 24, 2009 at 02:54:20PM +0800, Li, Shaohua wrote:
> > __mark_inode_dirty adds inode to wb dirty list in random order. If
> > a disk has several partitions, writeback might keep spindle moving
> > between partitions. To reduce the move, better write big chunk of
> > one partition and then move to another. Inodes from one fs usually
> > are in one partion, so idealy move indoes from one fs together
> > should reduce spindle move. This patch tries to address this.
> > Before per-bdi writeback is added, the behavior is write indoes
> > from one fs first and then another, so the patch restores previous
> > behavior. The loop in the patch is a bit ugly, should we add a
> > dirty list for each superblock in bdi_writeback?
> >
> > Test in a two partition disk with attached fio script shows about
> > 3% ~ 6% improvement.
>
> Reviewed-by: Wu Fengguang <[email protected]>
>
> Good idea! The optimization looks good to me, it addresses one
> weakness of per-bdi writeback.
>
> But one problem is, Jan Kara and me are planning to remove b_io and
> hence this move_expired_inodes() function. Not sure how to do this
> optimization without b_io.
>
> > Signed-off-by: Shaohua Li <[email protected]>
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 8e1e5e1..fc87730 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -324,13 +324,29 @@ static void move_expired_inodes(struct
> > list_head *delaying_queue, struct list_head *dispatch_queue,
> > unsigned long *older_than_this)
> > {
> > + LIST_HEAD(tmp);
> > + struct list_head *pos, *node;
> > + struct super_block *sb;
> > + struct inode *inode;
> > +
> > while (!list_empty(delaying_queue)) {
> > - struct inode *inode =
> > list_entry(delaying_queue->prev,
> > - struct inode,
> > i_list);
> > + inode = list_entry(delaying_queue->prev, struct
> > inode, i_list); if (older_than_this &&
> > inode_dirtied_after(inode, *older_than_this))
> > break;
> > - list_move(&inode->i_list, dispatch_queue);
> > + list_move(&inode->i_list, &tmp);
> > + }
> > +
> > + /* Move indoes from one superblock together */
> > + while (!list_empty(&tmp)) {
> > + inode = list_entry(tmp.prev, struct inode, i_list);
> > + sb = inode->i_sb;
> > + list_for_each_prev_safe(pos, node, &tmp) {
>
> We are in spin lock, so not necessary to use the safe version?
>

safe is needed for list walks that remove entries from the list
has nothing to do with locking ;-)


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-24 07:36:27

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, Sep 24, 2009 at 03:29:35PM +0800, Arjan van de Ven wrote:
> On Thu, 24 Sep 2009 15:14:15 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > On Thu, Sep 24, 2009 at 02:54:20PM +0800, Li, Shaohua wrote:
> > > __mark_inode_dirty adds inode to wb dirty list in random order. If
> > > a disk has several partitions, writeback might keep spindle moving
> > > between partitions. To reduce the move, better write big chunk of
> > > one partition and then move to another. Inodes from one fs usually
> > > are in one partion, so idealy move indoes from one fs together
> > > should reduce spindle move. This patch tries to address this.
> > > Before per-bdi writeback is added, the behavior is write indoes
> > > from one fs first and then another, so the patch restores previous
> > > behavior. The loop in the patch is a bit ugly, should we add a
> > > dirty list for each superblock in bdi_writeback?
> > >
> > > Test in a two partition disk with attached fio script shows about
> > > 3% ~ 6% improvement.
> >
> > Reviewed-by: Wu Fengguang <[email protected]>
> >
> > Good idea! The optimization looks good to me, it addresses one
> > weakness of per-bdi writeback.
> >
> > But one problem is, Jan Kara and me are planning to remove b_io and
> > hence this move_expired_inodes() function. Not sure how to do this
> > optimization without b_io.
> >
> > > Signed-off-by: Shaohua Li <[email protected]>
> > >
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 8e1e5e1..fc87730 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -324,13 +324,29 @@ static void move_expired_inodes(struct
> > > list_head *delaying_queue, struct list_head *dispatch_queue,
> > > unsigned long *older_than_this)
> > > {
> > > + LIST_HEAD(tmp);
> > > + struct list_head *pos, *node;
> > > + struct super_block *sb;
> > > + struct inode *inode;
> > > +
> > > while (!list_empty(delaying_queue)) {
> > > - struct inode *inode =
> > > list_entry(delaying_queue->prev,
> > > - struct inode,
> > > i_list);
> > > + inode = list_entry(delaying_queue->prev, struct
> > > inode, i_list); if (older_than_this &&
> > > inode_dirtied_after(inode, *older_than_this))
> > > break;
> > > - list_move(&inode->i_list, dispatch_queue);
> > > + list_move(&inode->i_list, &tmp);
> > > + }
> > > +
> > > + /* Move indoes from one superblock together */
> > > + while (!list_empty(&tmp)) {
> > > + inode = list_entry(tmp.prev, struct inode, i_list);
> > > + sb = inode->i_sb;
> > > + list_for_each_prev_safe(pos, node, &tmp) {
> >
> > We are in spin lock, so not necessary to use the safe version?
> >
>
> safe is needed for list walks that remove entries from the list
> has nothing to do with locking ;-)

Ah yes, thanks!

2009-09-24 07:44:34

by Shaohua Li

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, Sep 24, 2009 at 03:14:15PM +0800, Wu, Fengguang wrote:
> On Thu, Sep 24, 2009 at 02:54:20PM +0800, Li, Shaohua wrote:
> > __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
> > several partitions, writeback might keep spindle moving between partitions.
> > To reduce the move, better write big chunk of one partition and then move to
> > another. Inodes from one fs usually are in one partion, so idealy move indoes
> > from one fs together should reduce spindle move. This patch tries to address
> > this. Before per-bdi writeback is added, the behavior is write indoes
> > from one fs first and then another, so the patch restores previous behavior.
> > The loop in the patch is a bit ugly, should we add a dirty list for each
> > superblock in bdi_writeback?
> >
> > Test in a two partition disk with attached fio script shows about 3% ~ 6%
> > improvement.
>
> Reviewed-by: Wu Fengguang <[email protected]>
>
> Good idea! The optimization looks good to me, it addresses one
> weakness of per-bdi writeback.
>
> But one problem is, Jan Kara and me are planning to remove b_io and
> hence this move_expired_inodes() function. Not sure how to do this
> optimization without b_io.
>
> > Signed-off-by: Shaohua Li <[email protected]>
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 8e1e5e1..fc87730 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -324,13 +324,29 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> > struct list_head *dispatch_queue,
> > unsigned long *older_than_this)
> > {
> > + LIST_HEAD(tmp);
> > + struct list_head *pos, *node;
> > + struct super_block *sb;
> > + struct inode *inode;
> > +
> > while (!list_empty(delaying_queue)) {
> > - struct inode *inode = list_entry(delaying_queue->prev,
> > - struct inode, i_list);
> > + inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > if (older_than_this &&
> > inode_dirtied_after(inode, *older_than_this))
> > break;
> > - list_move(&inode->i_list, dispatch_queue);
> > + list_move(&inode->i_list, &tmp);
> > + }
> > +
> > + /* Move indoes from one superblock together */
> > + while (!list_empty(&tmp)) {
> > + inode = list_entry(tmp.prev, struct inode, i_list);
> > + sb = inode->i_sb;
> > + list_for_each_prev_safe(pos, node, &tmp) {
>
> We are in spin lock, so not necessary to use the safe version?
it's to protect list delete.

> > + struct inode *inode = list_entry(pos,
>
> Could just reuse inode.
oops, forgot to remove it when moveing inode to global.


__mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
several partitions, writeback might keep spindle moving between partitions.
To reduce the move, better write big chunk of one partition and then move to
another. Inodes from one fs usually are in one partion, so idealy move indoes
from one fs together should reduce spindle move. This patch tries to address
this. Before per-bdi writeback is added, the behavior is write indoes
from one fs first and then another, so the patch restores previous behavior.
The loop in the patch is a bit ugly, should we add a dirty list for each
superblock in bdi_writeback?

Test in a two partition disk with attached fio script shows about 3% ~ 6%
improvement.

Signed-off-by: Shaohua Li <[email protected]>
Reviewed-by: Wu Fengguang <[email protected]>

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8e1e5e1..303a1c5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -324,13 +324,28 @@ static void move_expired_inodes(struct list_head *delaying_queue,
struct list_head *dispatch_queue,
unsigned long *older_than_this)
{
+ LIST_HEAD(tmp);
+ struct list_head *pos, *node;
+ struct super_block *sb;
+ struct inode *inode;
+
while (!list_empty(delaying_queue)) {
- struct inode *inode = list_entry(delaying_queue->prev,
- struct inode, i_list);
+ inode = list_entry(delaying_queue->prev, struct inode, i_list);
if (older_than_this &&
inode_dirtied_after(inode, *older_than_this))
break;
- list_move(&inode->i_list, dispatch_queue);
+ list_move(&inode->i_list, &tmp);
+ }
+
+ /* Move indoes from one superblock together */
+ while (!list_empty(&tmp)) {
+ inode = list_entry(tmp.prev, struct inode, i_list);
+ sb = inode->i_sb;
+ list_for_each_prev_safe(pos, node, &tmp) {
+ inode = list_entry(pos, struct inode, i_list);
+ if (inode->i_sb == sb)
+ list_move(&inode->i_list, dispatch_queue);
+ }
}
}

2009-09-24 10:02:19

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, Sep 24, 2009 at 02:54:20PM +0800, Li, Shaohua wrote:
> __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
> several partitions, writeback might keep spindle moving between partitions.
> To reduce the move, better write big chunk of one partition and then move to
> another. Inodes from one fs usually are in one partion, so idealy move indoes
> from one fs together should reduce spindle move. This patch tries to address
> this. Before per-bdi writeback is added, the behavior is write indoes
> from one fs first and then another, so the patch restores previous behavior.
> The loop in the patch is a bit ugly, should we add a dirty list for each
> superblock in bdi_writeback?
>
> Test in a two partition disk with attached fio script shows about 3% ~ 6%
> improvement.

A side note: given the noticeable performance gain, I wonder if it
deserves to generalize the idea to do whole disk location ordered
writeback. That should benefit many small file workloads more than
10%. Because this patch only sorted 2 partitions and inodes in 5s
time window, while the below patch will roughly divide the disk into
5 areas and sort inodes in a larger 25s time window.

http://lkml.org/lkml/2007/8/27/45

Judging from this old patch, the complexity cost would be about 250
lines of code (need a rbtree).

Thanks,
Fengguang

> Signed-off-by: Shaohua Li <[email protected]>
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 8e1e5e1..fc87730 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -324,13 +324,29 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> struct list_head *dispatch_queue,
> unsigned long *older_than_this)
> {
> + LIST_HEAD(tmp);
> + struct list_head *pos, *node;
> + struct super_block *sb;
> + struct inode *inode;
> +
> while (!list_empty(delaying_queue)) {
> - struct inode *inode = list_entry(delaying_queue->prev,
> - struct inode, i_list);
> + inode = list_entry(delaying_queue->prev, struct inode, i_list);
> if (older_than_this &&
> inode_dirtied_after(inode, *older_than_this))
> break;
> - list_move(&inode->i_list, dispatch_queue);
> + list_move(&inode->i_list, &tmp);
> + }
> +
> + /* Move indoes from one superblock together */
> + while (!list_empty(&tmp)) {
> + inode = list_entry(tmp.prev, struct inode, i_list);
> + sb = inode->i_sb;
> + list_for_each_prev_safe(pos, node, &tmp) {
> + struct inode *inode = list_entry(pos,
> + struct inode, i_list);
> + if (inode->i_sb == sb)
> + list_move(&inode->i_list, dispatch_queue);
> + }
> }
> }
>
>

Content-Description: newfio
> [global]
> runtime=120
> ioscheduler=cfq
> size=2G
> ioengine=sync
> rw=write
> file_service_type=random:256
> overwrite=1
>
> [sdb1]
> directory=/mnt/b1
> nrfiles=10
> numjobs=4
>
> [sdb2]
> directory=/mnt/b2
> nrfiles=10
> numjobs=4

2009-09-24 12:35:21

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, Sep 24 2009, Wu Fengguang wrote:
> On Thu, Sep 24, 2009 at 02:54:20PM +0800, Li, Shaohua wrote:
> > __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
> > several partitions, writeback might keep spindle moving between partitions.
> > To reduce the move, better write big chunk of one partition and then move to
> > another. Inodes from one fs usually are in one partion, so idealy move indoes
> > from one fs together should reduce spindle move. This patch tries to address
> > this. Before per-bdi writeback is added, the behavior is write indoes
> > from one fs first and then another, so the patch restores previous behavior.
> > The loop in the patch is a bit ugly, should we add a dirty list for each
> > superblock in bdi_writeback?
> >
> > Test in a two partition disk with attached fio script shows about 3% ~ 6%
> > improvement.
>
> A side note: given the noticeable performance gain, I wonder if it
> deserves to generalize the idea to do whole disk location ordered
> writeback. That should benefit many small file workloads more than
> 10%. Because this patch only sorted 2 partitions and inodes in 5s
> time window, while the below patch will roughly divide the disk into
> 5 areas and sort inodes in a larger 25s time window.
>
> http://lkml.org/lkml/2007/8/27/45
>
> Judging from this old patch, the complexity cost would be about 250
> lines of code (need a rbtree).

First of all, nice patch, I'll add it to the current tree. I too was
pondering using an rbtree for sb+dirty_time insertion and extraction.
But for 100 inodes or less, I bet that just doing the re-sort in
writeback time ends up being cheaper on the CPU cycle side.

--
Jens Axboe

2009-09-24 13:17:37

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

> > Could just reuse inode.
> oops, forgot to remove it when moveing inode to global.

This is the one I merged. I added the below as wel, we don't need to
reiterate the list if we only moved inodes from a single super block.

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b27406d..225c731 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -336,17 +336,27 @@ static void move_expired_inodes(struct list_head *delaying_queue,
{
LIST_HEAD(tmp);
struct list_head *pos, *node;
- struct super_block *sb;
+ struct super_block *sb = NULL;
struct inode *inode;
+ int do_sb_sort = 0;

while (!list_empty(delaying_queue)) {
inode = list_entry(delaying_queue->prev, struct inode, i_list);
if (older_than_this &&
inode_dirtied_after(inode, *older_than_this))
break;
+ if (sb && sb != inode->i_sb)
+ do_sb_sort = 1;
+ sb = inode->i_sb;
list_move(&inode->i_list, &tmp);
}

+ /* just one sb in list, splice to dispatch_queue and we're done */
+ if (!do_sb_sort) {
+ list_splice(&tmp, dispatch_queue);
+ return;
+ }
+
/* Move inodes from one superblock together */
while (!list_empty(&tmp)) {
inode = list_entry(tmp.prev, struct inode, i_list);

--
Jens Axboe

2009-09-24 13:23:06

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, Sep 24, 2009 at 08:35:19PM +0800, Jens Axboe wrote:
> On Thu, Sep 24 2009, Wu Fengguang wrote:
> > On Thu, Sep 24, 2009 at 02:54:20PM +0800, Li, Shaohua wrote:
> > > __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
> > > several partitions, writeback might keep spindle moving between partitions.
> > > To reduce the move, better write big chunk of one partition and then move to
> > > another. Inodes from one fs usually are in one partion, so idealy move indoes
> > > from one fs together should reduce spindle move. This patch tries to address
> > > this. Before per-bdi writeback is added, the behavior is write indoes
> > > from one fs first and then another, so the patch restores previous behavior.
> > > The loop in the patch is a bit ugly, should we add a dirty list for each
> > > superblock in bdi_writeback?
> > >
> > > Test in a two partition disk with attached fio script shows about 3% ~ 6%
> > > improvement.
> >
> > A side note: given the noticeable performance gain, I wonder if it
> > deserves to generalize the idea to do whole disk location ordered
> > writeback. That should benefit many small file workloads more than
> > 10%. Because this patch only sorted 2 partitions and inodes in 5s
> > time window, while the below patch will roughly divide the disk into
> > 5 areas and sort inodes in a larger 25s time window.
> >
> > http://lkml.org/lkml/2007/8/27/45
> >
> > Judging from this old patch, the complexity cost would be about 250
> > lines of code (need a rbtree).
>
> First of all, nice patch, I'll add it to the current tree. I too was

You mean Shaohua's patch? It should be a good addition for 2.6.32.

In long term move_expired_inodes() needs some rework. Because it
could be time consuming to move around all the inodes in a large
system, and thus hold inode_lock() for too long time (and this patch
scales up the locked time).

So would need to split the list moves into smaller pieces in future,
or to change data structure.

> pondering using an rbtree for sb+dirty_time insertion and extraction.

FYI Michael Rubin did some work on a rbtree implementation, just
in case you are interested:

http://lkml.org/lkml/2008/1/15/25

> But for 100 inodes or less, I bet that just doing the re-sort in
> writeback time ends up being cheaper on the CPU cycle side.

Yeah.

Thanks,
Fengguang

2009-09-24 13:29:54

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, Sep 24, 2009 at 09:17:39PM +0800, Jens Axboe wrote:
> > > Could just reuse inode.
> > oops, forgot to remove it when moveing inode to global.
>
> This is the one I merged. I added the below as wel, we don't need to
> reiterate the list if we only moved inodes from a single super block.

Looks good to me.

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index b27406d..225c731 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -336,17 +336,27 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> {
> LIST_HEAD(tmp);
> struct list_head *pos, *node;
> - struct super_block *sb;
> + struct super_block *sb = NULL;
> struct inode *inode;
> + int do_sb_sort = 0;
>
> while (!list_empty(delaying_queue)) {
> inode = list_entry(delaying_queue->prev, struct inode, i_list);
> if (older_than_this &&
> inode_dirtied_after(inode, *older_than_this))
> break;
> + if (sb && sb != inode->i_sb)
> + do_sb_sort = 1;
> + sb = inode->i_sb;
> list_move(&inode->i_list, &tmp);
> }
>
> + /* just one sb in list, splice to dispatch_queue and we're done */
> + if (!do_sb_sort) {
> + list_splice(&tmp, dispatch_queue);
> + return;
> + }
> +
> /* Move inodes from one superblock together */
> while (!list_empty(&tmp)) {
> inode = list_entry(tmp.prev, struct inode, i_list);
>
> --
> Jens Axboe

2009-09-24 13:29:48

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, Sep 24 2009, Wu Fengguang wrote:
> On Thu, Sep 24, 2009 at 08:35:19PM +0800, Jens Axboe wrote:
> > On Thu, Sep 24 2009, Wu Fengguang wrote:
> > > On Thu, Sep 24, 2009 at 02:54:20PM +0800, Li, Shaohua wrote:
> > > > __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
> > > > several partitions, writeback might keep spindle moving between partitions.
> > > > To reduce the move, better write big chunk of one partition and then move to
> > > > another. Inodes from one fs usually are in one partion, so idealy move indoes
> > > > from one fs together should reduce spindle move. This patch tries to address
> > > > this. Before per-bdi writeback is added, the behavior is write indoes
> > > > from one fs first and then another, so the patch restores previous behavior.
> > > > The loop in the patch is a bit ugly, should we add a dirty list for each
> > > > superblock in bdi_writeback?
> > > >
> > > > Test in a two partition disk with attached fio script shows about 3% ~ 6%
> > > > improvement.
> > >
> > > A side note: given the noticeable performance gain, I wonder if it
> > > deserves to generalize the idea to do whole disk location ordered
> > > writeback. That should benefit many small file workloads more than
> > > 10%. Because this patch only sorted 2 partitions and inodes in 5s
> > > time window, while the below patch will roughly divide the disk into
> > > 5 areas and sort inodes in a larger 25s time window.
> > >
> > > http://lkml.org/lkml/2007/8/27/45
> > >
> > > Judging from this old patch, the complexity cost would be about 250
> > > lines of code (need a rbtree).
> >
> > First of all, nice patch, I'll add it to the current tree. I too was
>
> You mean Shaohua's patch? It should be a good addition for 2.6.32.

Yes indeed, the parent patch.

> In long term move_expired_inodes() needs some rework. Because it
> could be time consuming to move around all the inodes in a large
> system, and thus hold inode_lock() for too long time (and this patch
> scales up the locked time).

It does. As mentioned in my reply, for 100 inodes or less, it will still
be faster than eg using an rbtree. But the more "reliable" runtime of an
rbtree based solution is appealing, though. It's not hugely critical,
though.

> So would need to split the list moves into smaller pieces in future,
> or to change data structure.

Yes, those are the two options.

> > pondering using an rbtree for sb+dirty_time insertion and extraction.
>
> FYI Michael Rubin did some work on a rbtree implementation, just
> in case you are interested:
>
> http://lkml.org/lkml/2008/1/15/25

Thanks, I'll take a look.

--
Jens Axboe

2009-09-24 13:46:35

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, Sep 24, 2009 at 09:29:50PM +0800, Jens Axboe wrote:
> On Thu, Sep 24 2009, Wu Fengguang wrote:
> > On Thu, Sep 24, 2009 at 08:35:19PM +0800, Jens Axboe wrote:
> > > On Thu, Sep 24 2009, Wu Fengguang wrote:
> > > > On Thu, Sep 24, 2009 at 02:54:20PM +0800, Li, Shaohua wrote:
> > > > > __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
> > > > > several partitions, writeback might keep spindle moving between partitions.
> > > > > To reduce the move, better write big chunk of one partition and then move to
> > > > > another. Inodes from one fs usually are in one partion, so idealy move indoes
> > > > > from one fs together should reduce spindle move. This patch tries to address
> > > > > this. Before per-bdi writeback is added, the behavior is write indoes
> > > > > from one fs first and then another, so the patch restores previous behavior.
> > > > > The loop in the patch is a bit ugly, should we add a dirty list for each
> > > > > superblock in bdi_writeback?
> > > > >
> > > > > Test in a two partition disk with attached fio script shows about 3% ~ 6%
> > > > > improvement.
> > > >
> > > > A side note: given the noticeable performance gain, I wonder if it
> > > > deserves to generalize the idea to do whole disk location ordered
> > > > writeback. That should benefit many small file workloads more than
> > > > 10%. Because this patch only sorted 2 partitions and inodes in 5s
> > > > time window, while the below patch will roughly divide the disk into
> > > > 5 areas and sort inodes in a larger 25s time window.
> > > >
> > > > http://lkml.org/lkml/2007/8/27/45
> > > >
> > > > Judging from this old patch, the complexity cost would be about 250
> > > > lines of code (need a rbtree).
> > >
> > > First of all, nice patch, I'll add it to the current tree. I too was
> >
> > You mean Shaohua's patch? It should be a good addition for 2.6.32.
>
> Yes indeed, the parent patch.
>
> > In long term move_expired_inodes() needs some rework. Because it
> > could be time consuming to move around all the inodes in a large
> > system, and thus hold inode_lock() for too long time (and this patch
> > scales up the locked time).
>
> It does. As mentioned in my reply, for 100 inodes or less, it will still
> be faster than eg using an rbtree. But the more "reliable" runtime of an
> rbtree based solution is appealing, though. It's not hugely critical,
> though.

Agreed. Desktops are not big worries; servers rarely do many
partitions per disk.

> > So would need to split the list moves into smaller pieces in future,
> > or to change data structure.
>
> Yes, those are the two options.
>
> > > pondering using an rbtree for sb+dirty_time insertion and extraction.

Note that dirty_time may not be unique, so need some workaround. And
the resulted rbtree implementation may not be more efficient than
several list traversals even for a very large list (as long as
superblocks numbers are low).

The good side is, once sb+dirty_time rbtree is implemented, it should
be trivial to switch the key to sb+inode_number (also may not be unique),
and to do location ordered writeback ;)

Thanks,
Fengguang

> > FYI Michael Rubin did some work on a rbtree implementation, just
> > in case you are interested:
> >
> > http://lkml.org/lkml/2008/1/15/25
>
> Thanks, I'll take a look.
>
> --
> Jens Axboe

2009-09-24 13:52:06

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, 24 Sep 2009 21:46:25 +0800
Wu Fengguang <[email protected]> wrote:
>
> Note that dirty_time may not be unique, so need some workaround. And
> the resulted rbtree implementation may not be more efficient than
> several list traversals even for a very large list (as long as
> superblocks numbers are low).
>
> The good side is, once sb+dirty_time rbtree is implemented, it should
> be trivial to switch the key to sb+inode_number (also may not be
> unique), and to do location ordered writeback ;)

would you want to sort by dirty time, or by inode number?
(assuming inode number is loosely related to location on disk)


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-24 14:09:29

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, Sep 24, 2009 at 09:52:17PM +0800, Arjan van de Ven wrote:
> On Thu, 24 Sep 2009 21:46:25 +0800
> Wu Fengguang <[email protected]> wrote:
> >
> > Note that dirty_time may not be unique, so need some workaround. And
> > the resulted rbtree implementation may not be more efficient than
> > several list traversals even for a very large list (as long as
> > superblocks numbers are low).
> >
> > The good side is, once sb+dirty_time rbtree is implemented, it should
> > be trivial to switch the key to sb+inode_number (also may not be
> > unique), and to do location ordered writeback ;)
>
> would you want to sort by dirty time, or by inode number?
> (assuming inode number is loosely related to location on disk)

Sort by inode number; dirty time will also be considered when judging
whether the traversed inode is old enough(*) to be eligible for writeback.

(*) this "old enough" criterion has to be much more relaxed, from the
original >30s to >5s. The promise to user would change from

"dirtied inodes will be started writeback _around_ 30s"

to

"dirtied inodes will be started writeback _within_ 30s"



The more detailed algorithm would be:

- put inodes to rbtree with key sb+inode_number
- in each per-5s writeback, traverse a range of 1/5 rbtree
- in each traverse, sync inodes that is dirtied more than 5s ago

So the user visible result would be
- on every 5s, roughly a 1/5 disk area will be visited
- for each dirtied inode, it will be synced after 5-30s


Thanks,
Fengguang

2009-09-25 04:16:33

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, Sep 24, 2009 at 10:09:19PM +0800, Wu Fengguang wrote:
> On Thu, Sep 24, 2009 at 09:52:17PM +0800, Arjan van de Ven wrote:
> > On Thu, 24 Sep 2009 21:46:25 +0800
> > Wu Fengguang <[email protected]> wrote:
> > >
> > > Note that dirty_time may not be unique, so need some workaround. And
> > > the resulted rbtree implementation may not be more efficient than
> > > several list traversals even for a very large list (as long as
> > > superblocks numbers are low).
> > >
> > > The good side is, once sb+dirty_time rbtree is implemented, it should
> > > be trivial to switch the key to sb+inode_number (also may not be
> > > unique), and to do location ordered writeback ;)
> >
> > would you want to sort by dirty time, or by inode number?
> > (assuming inode number is loosely related to location on disk)
>
> Sort by inode number; dirty time will also be considered when judging
> whether the traversed inode is old enough(*) to be eligible for writeback.

Even if the inode number is directly related to location on disk
(like for XFS), there is no guarantee that the data or related
metadata (indirect blocks) writeback location is in any way related
to the inode number. e.g when using the 32 bit allocator on XFS
(default for > 1TB filesystems), there is _zero correlation_ between
the inode number and the data location. Hence writeback by inode
number will not improve writeback patterns at all.

Only the filesystem knows what the best writeback pattern really is;
any change is going to affect filesystems differently.

> The more detailed algorithm would be:
>
> - put inodes to rbtree with key sb+inode_number
> - in each per-5s writeback, traverse a range of 1/5 rbtree
> - in each traverse, sync inodes that is dirtied more than 5s ago
>
> So the user visible result would be
> - on every 5s, roughly a 1/5 disk area will be visited
> - for each dirtied inode, it will be synced after 5-30s

Personally, I'd prefer that writeback calls a vector that says
"writeback inodes older than N" and implement something like the
above as the generic mechanism. That way filesystems can override
the generic algorithm if there is a better way to track and write
back dirty inodes for that filesystem.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-09-25 05:09:27

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Fri, Sep 25, 2009 at 12:16:19PM +0800, Dave Chinner wrote:
> On Thu, Sep 24, 2009 at 10:09:19PM +0800, Wu Fengguang wrote:
> > On Thu, Sep 24, 2009 at 09:52:17PM +0800, Arjan van de Ven wrote:
> > > On Thu, 24 Sep 2009 21:46:25 +0800
> > > Wu Fengguang <[email protected]> wrote:
> > > >
> > > > Note that dirty_time may not be unique, so need some workaround. And
> > > > the resulted rbtree implementation may not be more efficient than
> > > > several list traversals even for a very large list (as long as
> > > > superblocks numbers are low).
> > > >
> > > > The good side is, once sb+dirty_time rbtree is implemented, it should
> > > > be trivial to switch the key to sb+inode_number (also may not be
> > > > unique), and to do location ordered writeback ;)
> > >
> > > would you want to sort by dirty time, or by inode number?
> > > (assuming inode number is loosely related to location on disk)
> >
> > Sort by inode number; dirty time will also be considered when judging
> > whether the traversed inode is old enough(*) to be eligible for writeback.
>
> Even if the inode number is directly related to location on disk
> (like for XFS), there is no guarantee that the data or related
> metadata (indirect blocks) writeback location is in any way related
> to the inode number. e.g when using the 32 bit allocator on XFS
> (default for > 1TB filesystems), there is _zero correlation_ between
> the inode number and the data location. Hence writeback by inode
> number will not improve writeback patterns at all.

The location ordering is mainly an optimization for _small files_.
So no indirect blocks. A good filesystem will put metadata+data as
close as possible for small files. Is that true for XFS?

> Only the filesystem knows what the best writeback pattern really is;
> any change is going to affect filesystems differently.
>
> > The more detailed algorithm would be:
> >
> > - put inodes to rbtree with key sb+inode_number
> > - in each per-5s writeback, traverse a range of 1/5 rbtree
> > - in each traverse, sync inodes that is dirtied more than 5s ago
> >
> > So the user visible result would be
> > - on every 5s, roughly a 1/5 disk area will be visited
> > - for each dirtied inode, it will be synced after 5-30s
>
> Personally, I'd prefer that writeback calls a vector that says
> "writeback inodes older than N" and implement something like the
> above as the generic mechanism. That way filesystems can override
> the generic algorithm if there is a better way to track and write
> back dirty inodes for that filesystem.

We have wbc->older_than_this. Is it good enough for XFS?

Thanks,
Fengguang