LinuxLists.cc - [PATCH v2] writeback: avoid race when update bandwidth

2012-06-12 11:46:20

Subject: [PATCH v2] writeback: avoid race when update bandwidth

From: Wanpeng Li <[email protected]>

"V1 -> V2"
* remove dirty_lock

Since bdi->wb.list_lock is used to protect the b_* lists,
so the flushers who call wb_writeback to writeback pages will
stuck when bandwidth update policy holds this lock. In order
to avoid this race we can introduce a new bandwidth_lock who
is responsible for protecting bandwidth update policy.

Signed-off-by: Wanpeng Li <[email protected]>

---
mm/page-writeback.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index c833bf0..e28d36e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -815,7 +815,6 @@ static void global_update_bandwidth(unsigned long thresh,
unsigned long dirty,
unsigned long now)
{
- static DEFINE_SPINLOCK(dirty_lock);
static unsigned long update_time;

/*
@@ -824,12 +823,10 @@ static void global_update_bandwidth(unsigned long thresh,
if (time_before(now, update_time + BANDWIDTH_INTERVAL))
return;

- spin_lock(&dirty_lock);
if (time_after_eq(now, update_time + BANDWIDTH_INTERVAL)) {
update_dirty_limit(thresh, dirty);
update_time = now;
}
- spin_unlock(&dirty_lock);
}

/*
@@ -1032,12 +1029,14 @@ static void bdi_update_bandwidth(struct backing_dev_info *bdi,
unsigned long bdi_dirty,
unsigned long start_time)
{
+ static DEFINE_SPINLOCK(bandwidth_lock);
+
if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
return;
- spin_lock(&bdi->wb.list_lock);
+ spin_lock(&bandwidth_lock);
__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
bdi_thresh, bdi_dirty, start_time);
- spin_unlock(&bdi->wb.list_lock);
+ spin_unlock(&bandwidth_lock);
}

/*
--
1.7.9.5

2012-06-12 11:52:35

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH v2] writeback: avoid race when update bandwidth

On Tue, Jun 12, 2012 at 07:46:01PM +0800, Wanpeng Li wrote:
> From: Wanpeng Li <[email protected]>
>
> "V1 -> V2"
> * remove dirty_lock
>
> Since bdi->wb.list_lock is used to protect the b_* lists,
> so the flushers who call wb_writeback to writeback pages will
> stuck when bandwidth update policy holds this lock. In order
> to avoid this race we can introduce a new bandwidth_lock who
> is responsible for protecting bandwidth update policy.
>
> Signed-off-by: Wanpeng Li <[email protected]>

Applied with a new title "writeback: use a standalone lock for
updating write bandwidth". "race" is sensitive because it often
refers to some locking error.

Thank you!

Fengguang

2012-06-12 11:58:32

by Wanpeng Li

[permalink] [raw]

Subject: Re: [PATCH v2] writeback: avoid race when update bandwidth

On Tue, Jun 12, 2012 at 07:52:19PM +0800, Fengguang Wu wrote:
>On Tue, Jun 12, 2012 at 07:46:01PM +0800, Wanpeng Li wrote:
>> From: Wanpeng Li <[email protected]>
>>
>> "V1 -> V2"
>> * remove dirty_lock
>>
>> Since bdi->wb.list_lock is used to protect the b_* lists,
>> so the flushers who call wb_writeback to writeback pages will
>> stuck when bandwidth update policy holds this lock. In order
>> to avoid this race we can introduce a new bandwidth_lock who
>> is responsible for protecting bandwidth update policy.
>>
>> Signed-off-by: Wanpeng Li <[email protected]>
>
>Applied with a new title "writeback: use a standalone lock for
>updating write bandwidth". "race" is sensitive because it often
>refers to some locking error.

OK, Thanks a lot.

Regards,
Wanpeng Li

2012-06-13 03:59:26

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v2] writeback: avoid race when update bandwidth

On Tue, Jun 12, 2012 at 07:52:19PM +0800, Fengguang Wu wrote:
> On Tue, Jun 12, 2012 at 07:46:01PM +0800, Wanpeng Li wrote:
> > From: Wanpeng Li <[email protected]>
> >
> > "V1 -> V2"
> > * remove dirty_lock
> >
> > Since bdi->wb.list_lock is used to protect the b_* lists,
> > so the flushers who call wb_writeback to writeback pages will
> > stuck when bandwidth update policy holds this lock. In order
> > to avoid this race we can introduce a new bandwidth_lock who
> > is responsible for protecting bandwidth update policy.
> >
> > Signed-off-by: Wanpeng Li <[email protected]>
>
> Applied with a new title "writeback: use a standalone lock for
> updating write bandwidth". "race" is sensitive because it often
> refers to some locking error.

Fengguang - can we get some evidence that this is a contended lock
before changing the scope of it? All of the previous "breaking up
global locks" have been done based on lock contention data, so
moving back to a global lock for this needs to have the same
analysis provided...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-06-13 12:14:41

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH v2] writeback: avoid race when update bandwidth

On Wed, Jun 13, 2012 at 01:59:20PM +1000, Dave Chinner wrote:
> On Tue, Jun 12, 2012 at 07:52:19PM +0800, Fengguang Wu wrote:
> > On Tue, Jun 12, 2012 at 07:46:01PM +0800, Wanpeng Li wrote:
> > > From: Wanpeng Li <[email protected]>
> > >
> > > "V1 -> V2"
> > > * remove dirty_lock
> > >
> > > Since bdi->wb.list_lock is used to protect the b_* lists,
> > > so the flushers who call wb_writeback to writeback pages will
> > > stuck when bandwidth update policy holds this lock. In order
> > > to avoid this race we can introduce a new bandwidth_lock who
> > > is responsible for protecting bandwidth update policy.
> > >
> > > Signed-off-by: Wanpeng Li <[email protected]>
> >
> > Applied with a new title "writeback: use a standalone lock for
> > updating write bandwidth". "race" is sensitive because it often
> > refers to some locking error.
>
> Fengguang - can we get some evidence that this is a contended lock
> before changing the scope of it? All of the previous "breaking up
> global locks" have been done based on lock contention data, so
> moving back to a global lock for this needs to have the same
> analysis provided...

Good point. Attached is the lockstat for the case "10 disks each runs
100 dd dirtier tasks":

lkp-ne02/JBOD-10HDD-thresh=4G/xfs-100dd-1-3.2.0-rc5

The wb->list_lock contention is much better than I expected, which is
good. What stand out are
waittime-total
- &rq->lock by double_rq_lock() 6738952.13
- clockevents_lock by clockevents_notify() 2155554.37
- mapping->tree_lock by test_clear_page_writeback() 931550.13
- sb_lock by grab_super_passive() 918815.87
- &zone->lru_lock by pagevec_lru_move_fn() 912681.05

- sysfs_mutex by sysfs_permission() 24029975.20 # mutex
- ip->i_lock by xfs_ilock() 18428284.10 # mrlock

Thanks,
Fengguang

2012-06-14 02:06:05

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v2] writeback: avoid race when update bandwidth

On Wed, Jun 13, 2012 at 08:14:34PM +0800, Fengguang Wu wrote:
> On Wed, Jun 13, 2012 at 01:59:20PM +1000, Dave Chinner wrote:
> > On Tue, Jun 12, 2012 at 07:52:19PM +0800, Fengguang Wu wrote:
> > > On Tue, Jun 12, 2012 at 07:46:01PM +0800, Wanpeng Li wrote:
> > > > From: Wanpeng Li <[email protected]>
> > > >
> > > > "V1 -> V2"
> > > > * remove dirty_lock
> > > >
> > > > Since bdi->wb.list_lock is used to protect the b_* lists,
> > > > so the flushers who call wb_writeback to writeback pages will
> > > > stuck when bandwidth update policy holds this lock. In order
> > > > to avoid this race we can introduce a new bandwidth_lock who
> > > > is responsible for protecting bandwidth update policy.
> > > >
> > > > Signed-off-by: Wanpeng Li <[email protected]>
> > >
> > > Applied with a new title "writeback: use a standalone lock for
> > > updating write bandwidth". "race" is sensitive because it often
> > > refers to some locking error.
> >
> > Fengguang - can we get some evidence that this is a contended lock
> > before changing the scope of it? All of the previous "breaking up
> > global locks" have been done based on lock contention data, so
> > moving back to a global lock for this needs to have the same
> > analysis provided...
>
> Good point. Attached is the lockstat for the case "10 disks each runs
> 100 dd dirtier tasks":
>
> lkp-ne02/JBOD-10HDD-thresh=4G/xfs-100dd-1-3.2.0-rc5

(nothing attached)

> The wb->list_lock contention is much better than I expected, which is
> good. What stand out are
> waittime-total
> - &rq->lock by double_rq_lock() 6738952.13
> - clockevents_lock by clockevents_notify() 2155554.37
> - mapping->tree_lock by test_clear_page_writeback() 931550.13
> - sb_lock by grab_super_passive() 918815.87
> - &zone->lru_lock by pagevec_lru_move_fn() 912681.05
>
> - sysfs_mutex by sysfs_permission() 24029975.20 # mutex
> - ip->i_lock by xfs_ilock() 18428284.10 # mrlock

The wait time is not really an indication of contention problems.
Large wait time is usually an indication that the lock is being used
a lot.

What matters is the number of contentions vs the number of
acquisitions, and the number of those contentions that bounced the
lock. If the number of contentions is >= 0.5% of the acquisitions,
then the lock can be considered hot and needing some work. If I look
here:

http://lists.linux.hp.com/~enw/ext4/3.2/3.2-full-lockstats.2/ffsb_fsscale.xfs.large_file_creates_threads=192/profiling/iteration.1/lock_stat

Which is a 192 thread concurrent write on a 48-core machine, the
wb.list_lock shows 5,532 acquistions for the entire test, while the
mapping tree lock took 440 million!. So your test isn't really one
that shows wb.list_lock contention. The 192-thread mailserver
workload from the same machine:

http://lists.linux.hp.com/~enw/ext4/3.2/3.2-full-lockstats.2/ffsb_fsscale.xfs.mail_server_threads=192/profiling/iteration.1/lock_stat

Shows about 7.1m acquisitions of the wb.list_lock, but only 28,000
contentions. So it isn't really contended enough to justify
replacing it with a global lock.

FWIW, the third most contended lock on that workload is the XFS
delayed write queue lock - 25M acquisitions for 600k contentions - a
rate of about 2% which means quite severe contention. That lock no
longer exists in 3.5 - Christoph completely reworked the delayed
write buffer support to remove the global list and lock because it
was showing up in profiles like this...

Indeed, that profile shows that XFS owns 7 of the 10 most contended
locks, and 3 of them have had significant work done to reduce the
contention since 3.2 as a result of recent profile results like this.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-06-14 14:00:19

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH v2] writeback: avoid race when update bandwidth

On Thu, Jun 14, 2012 at 12:05:59PM +1000, Dave Chinner wrote:
> On Wed, Jun 13, 2012 at 08:14:34PM +0800, Fengguang Wu wrote:
> > On Wed, Jun 13, 2012 at 01:59:20PM +1000, Dave Chinner wrote:
> > > On Tue, Jun 12, 2012 at 07:52:19PM +0800, Fengguang Wu wrote:
> > > > On Tue, Jun 12, 2012 at 07:46:01PM +0800, Wanpeng Li wrote:
> > > > > From: Wanpeng Li <[email protected]>
> > > > >
> > > > > "V1 -> V2"
> > > > > * remove dirty_lock
> > > > >
> > > > > Since bdi->wb.list_lock is used to protect the b_* lists,
> > > > > so the flushers who call wb_writeback to writeback pages will
> > > > > stuck when bandwidth update policy holds this lock. In order
> > > > > to avoid this race we can introduce a new bandwidth_lock who
> > > > > is responsible for protecting bandwidth update policy.
> > > > >
> > > > > Signed-off-by: Wanpeng Li <[email protected]>
> > > >
> > > > Applied with a new title "writeback: use a standalone lock for
> > > > updating write bandwidth". "race" is sensitive because it often
> > > > refers to some locking error.
> > >
> > > Fengguang - can we get some evidence that this is a contended lock
> > > before changing the scope of it? All of the previous "breaking up
> > > global locks" have been done based on lock contention data, so
> > > moving back to a global lock for this needs to have the same
> > > analysis provided...
> >
> > Good point. Attached is the lockstat for the case "10 disks each runs
> > 100 dd dirtier tasks":
> >
> > lkp-ne02/JBOD-10HDD-thresh=4G/xfs-100dd-1-3.2.0-rc5
>
> (nothing attached)
>
> > The wb->list_lock contention is much better than I expected, which is
> > good. What stand out are
> > waittime-total
> > - &rq->lock by double_rq_lock() 6738952.13
> > - clockevents_lock by clockevents_notify() 2155554.37
> > - mapping->tree_lock by test_clear_page_writeback() 931550.13
> > - sb_lock by grab_super_passive() 918815.87
> > - &zone->lru_lock by pagevec_lru_move_fn() 912681.05
> >
> > - sysfs_mutex by sysfs_permission() 24029975.20 # mutex
> > - ip->i_lock by xfs_ilock() 18428284.10 # mrlock
>
> The wait time is not really an indication of contention problems.
> Large wait time is usually an indication that the lock is being used
> a lot.

Right.

> What matters is the number of contentions vs the number of
> acquisitions, and the number of those contentions that bounced the
> lock. If the number of contentions is >= 0.5% of the acquisitions,
> then the lock can be considered hot and needing some work. If I look
> here:

I wonder if anyone has a simple script for sorting lock_stat output
based on that (and perhaps other selectable) criterion? It should be
possible to write on myself, but still.. ;-)

Default lock_stat output is sorted by absolute number of contentions.

> http://lists.linux.hp.com/~enw/ext4/3.2/3.2-full-lockstats.2/ffsb_fsscale.xfs.large_file_creates_threads=192/profiling/iteration.1/lock_stat
>
> Which is a 192 thread concurrent write on a 48-core machine, the
> wb.list_lock shows 5,532 acquistions for the entire test, while the
> mapping tree lock took 440 million!. So your test isn't really one
> that shows wb.list_lock contention. The 192-thread mailserver
> workload from the same machine:
>
> http://lists.linux.hp.com/~enw/ext4/3.2/3.2-full-lockstats.2/ffsb_fsscale.xfs.mail_server_threads=192/profiling/iteration.1/lock_stat
>
> Shows about 7.1m acquisitions of the wb.list_lock, but only 28,000
> contentions. So it isn't really contended enough to justify
> replacing it with a global lock.

Right.

> FWIW, the third most contended lock on that workload is the XFS
> delayed write queue lock - 25M acquisitions for 600k contentions - a
> rate of about 2% which means quite severe contention. That lock no
> longer exists in 3.5 - Christoph completely reworked the delayed
> write buffer support to remove the global list and lock because it
> was showing up in profiles like this...
>
> Indeed, that profile shows that XFS owns 7 of the 10 most contended
> locks, and 3 of them have had significant work done to reduce the
> contention since 3.2 as a result of recent profile results like this.

Nice work!

Thanks,
Fengguang

2012-06-15 00:06:40

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v2] writeback: avoid race when update bandwidth

On Thu, Jun 14, 2012 at 10:00:06PM +0800, Fengguang Wu wrote:
> On Thu, Jun 14, 2012 at 12:05:59PM +1000, Dave Chinner wrote:
> > On Wed, Jun 13, 2012 at 08:14:34PM +0800, Fengguang Wu wrote:
> I wonder if anyone has a simple script for sorting lock_stat output
> based on that (and perhaps other selectable) criterion? It should be
> possible to write on myself, but still.. ;-)
>
> Default lock_stat output is sorted by absolute number of contentions.

No that I know of. The default is pretty sane, because a highly
contended lock that is causing performance problems will always show
up near the top. If it's not in the top 10, then it's usually not
worth worrying about until you've dealt with the those above it....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-06-15 00:30:06

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH v2] writeback: avoid race when update bandwidth

On Fri, Jun 15, 2012 at 10:06:33AM +1000, Dave Chinner wrote:
> On Thu, Jun 14, 2012 at 10:00:06PM +0800, Fengguang Wu wrote:
> > On Thu, Jun 14, 2012 at 12:05:59PM +1000, Dave Chinner wrote:
> > > On Wed, Jun 13, 2012 at 08:14:34PM +0800, Fengguang Wu wrote:
> > I wonder if anyone has a simple script for sorting lock_stat output
> > based on that (and perhaps other selectable) criterion? It should be
> > possible to write on myself, but still.. ;-)
> >
> > Default lock_stat output is sorted by absolute number of contentions.
>
> No that I know of. The default is pretty sane, because a highly
> contended lock that is causing performance problems will always show
> up near the top. If it's not in the top 10, then it's usually not
> worth worrying about until you've dealt with the those above it....

Okay, thanks for the tip!

Thanks,
Fengguang