2007-09-28 06:50:34

by Andrew Morton

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Thu, 27 Sep 2007 23:32:36 -0700 "Chakri n" <[email protected]> wrote:

> Hi,
>
> In my testing, a unresponsive file system can hang all I/O in the system.
> This is not seen in 2.4.
>
> I started 20 threads doing I/O on a NFS share. They are just doing 4K
> writes in a loop.
>
> Now I stop NFS server hosting the NFS share and start a
> "dd" process to write a file on local EXT3 file system.
>
> # dd if=/dev/zero of=/tmp/x count=1000
>
> This process never progresses.

yup.

> There is plenty of HIGH MEMORY available in the system, but this
> process never progresses.
>
> ...
>
> The problem seems to be in balance_dirty_pages, which calculates
> dirty_thresh based on only ZONE_NORMAL. The same scenario works fine
> in 2.4. The dd processes finishes in no time.
> NFS file systems can go offline, due to multiple reasons, a failed
> switch, filer etc, but that should not effect other file systems in
> the machine.
> Can this behavior be fenced?, can the buffer cache be tuned so that
> other processes do not see the effect?

It's unrelated to the actual value of dirty_thresh: if the machine fills up
with dirty (or unstable) NFS pages then eventually new writers will block
until that condition clears.

2.4 doesn't have this problem at low levels of dirty data because 2.4
VFS/MM doesn't account for NFS pages at all.

I'm not sure what we can do about this from a design perspective, really.
We have data floating about in memory which we're not allowed to discard
and if we allow it to increase without bound it will eventually either
wedge userspace _anyway_ or it will take the machine down, resulting in
data loss.

What it would be nice to do would be to write that data to local disk if
poss, then reclaim it. Perhaps David Howells' fscache code can do that (or
could be tweaked to do so).

If you really want to fill all memory with pages whic are dirty against a
dead NFS server then you can manually increase
/proc/sys/vm/dirty_background_ratio and dirty_ratio - that should give you
the 2.4 behaviour.


<thinks>

Actually we perhaps could address this at the VFS level in another way.
Processes which are writing to the dead NFS server will eventually block in
balance_dirty_pages() once they've exceeded the memory limits and will
remain blocked until the server wakes up - that's the behaviour we want.

What we _don't_ want to happen is for other processes which are writing to
other, non-dead devices to get collaterally blocked. We have patches which
might fix that queued for 2.6.24. Peter?


2007-09-28 13:28:54

by Jonathan Corbet

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

Andrew wrote:
> It's unrelated to the actual value of dirty_thresh: if the machine fills up
> with dirty (or unstable) NFS pages then eventually new writers will block
> until that condition clears.
>
> 2.4 doesn't have this problem at low levels of dirty data because 2.4
> VFS/MM doesn't account for NFS pages at all.

Is it really NFS-related? I was trying to back up my 2.6.23-rc8 system
to an external USB drive the other day when something flaked and the
drive fell off the bus. That, too, was sufficient to wedge the entire
system, even though the only thing which needed the dead drive was one
rsync process. It's kind of a bummer to have to hit the reset button
after the failure of (what should be) a non-critical piece of hardware.

Not that I have a fix to propose...:)

jon

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-28 13:35:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Fri, 2007-09-28 at 07:28 -0600, Jonathan Corbet wrote:
> Andrew wrote:
> > It's unrelated to the actual value of dirty_thresh: if the machine fills up
> > with dirty (or unstable) NFS pages then eventually new writers will block
> > until that condition clears.
> >
> > 2.4 doesn't have this problem at low levels of dirty data because 2.4
> > VFS/MM doesn't account for NFS pages at all.
>
> Is it really NFS-related? I was trying to back up my 2.6.23-rc8 system
> to an external USB drive the other day when something flaked and the
> drive fell off the bus. That, too, was sufficient to wedge the entire
> system, even though the only thing which needed the dead drive was one
> rsync process. It's kind of a bummer to have to hit the reset button
> after the failure of (what should be) a non-critical piece of hardware.
>
> Not that I have a fix to propose...:)

the per bdi work in -mm should make the system not drop dead.

Still, would a remove,re-insert of the usb media end up with the same
bdi? That is, would it recognise as the same and resume the transfer.

Anyway, it would be grand (and dangerous) if we could provide for a
button that would just kill off all outstanding pages against a dead
device.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-09-28 16:45:46

by Alan Stern

[permalink] [raw]
Subject: Re: [linux-pm] Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Fri, 28 Sep 2007, Peter Zijlstra wrote:

> On Fri, 2007-09-28 at 07:28 -0600, Jonathan Corbet wrote:

> > Is it really NFS-related? I was trying to back up my 2.6.23-rc8 system
> > to an external USB drive the other day when something flaked and the
> > drive fell off the bus. That, too, was sufficient to wedge the entire
> > system, even though the only thing which needed the dead drive was one
> > rsync process. It's kind of a bummer to have to hit the reset button
> > after the failure of (what should be) a non-critical piece of hardware.
> >
> > Not that I have a fix to propose...:)
>
> the per bdi work in -mm should make the system not drop dead.
>
> Still, would a remove,re-insert of the usb media end up with the same
> bdi? That is, would it recognise as the same and resume the transfer.

Removal and replacement of the media might work. I have never tried
it.

But Jon described removal of the device, not the media. Replacing the
device definitely will not work.

Alan Stern


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-28 17:00:53

by Trond Myklebust

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:

> Actually we perhaps could address this at the VFS level in another way.
> Processes which are writing to the dead NFS server will eventually block in
> balance_dirty_pages() once they've exceeded the memory limits and will
> remain blocked until the server wakes up - that's the behaviour we want.
>
> What we _don't_ want to happen is for other processes which are writing to
> other, non-dead devices to get collaterally blocked. We have patches which
> might fix that queued for 2.6.24. Peter?

Do these patches also cause the memory reclaimers to steer clear of
devices that are congested (and stop waiting on a congested device if
they see that it remains congested for a long period of time)? Most of
the collateral blocking I see tends to happen in memory allocation...

Cheers
Trond

2007-09-28 18:04:45

by Andrew Morton

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Fri, 28 Sep 2007 07:28:52 -0600 [email protected] (Jonathan Corbet) wrote:

> Andrew wrote:
> > It's unrelated to the actual value of dirty_thresh: if the machine fills up
> > with dirty (or unstable) NFS pages then eventually new writers will block
> > until that condition clears.
> >
> > 2.4 doesn't have this problem at low levels of dirty data because 2.4
> > VFS/MM doesn't account for NFS pages at all.
>
> Is it really NFS-related? I was trying to back up my 2.6.23-rc8 system
> to an external USB drive the other day when something flaked and the
> drive fell off the bus. That, too, was sufficient to wedge the entire
> system, even though the only thing which needed the dead drive was one
> rsync process. It's kind of a bummer to have to hit the reset button
> after the failure of (what should be) a non-critical piece of hardware.
>
> Not that I have a fix to propose...:)
>

That's a USB bug, surely. What should happen is that the kernel attempts
writeback, gets an IO error and then your data gets lost.

2007-09-28 18:49:59

by Andrew Morton

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Fri, 28 Sep 2007 13:00:53 -0400 Trond Myklebust <[email protected]> wrote:

> On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:
>
> > Actually we perhaps could address this at the VFS level in another way.
> > Processes which are writing to the dead NFS server will eventually block in
> > balance_dirty_pages() once they've exceeded the memory limits and will
> > remain blocked until the server wakes up - that's the behaviour we want.
> >
> > What we _don't_ want to happen is for other processes which are writing to
> > other, non-dead devices to get collaterally blocked. We have patches which
> > might fix that queued for 2.6.24. Peter?
>
> Do these patches also cause the memory reclaimers to steer clear of
> devices that are congested (and stop waiting on a congested device if
> they see that it remains congested for a long period of time)? Most of
> the collateral blocking I see tends to happen in memory allocation...
>

No, they don't attempt to do that, but I suspect they put in place
infrastructure which could be used to improve direct-reclaimer latency. In
the throttle_vm_writeout() path, at least.

Do you know where the stalls are occurring? throttle_vm_writeout(), or via
direct calls to congestion_wait() from page_alloc.c and vmscan.c? (running
sysrq-w five or ten times will probably be enough to determine this)


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-28 18:48:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)


On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote:

> Do you know where the stalls are occurring? throttle_vm_writeout(), or via
> direct calls to congestion_wait() from page_alloc.c and vmscan.c? (running
> sysrq-w five or ten times will probably be enough to determine this)

would it make sense to instrument congestion_wait() callsites with
vmstats?

2007-09-28 19:16:11

by Trond Myklebust

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote:
> On Fri, 28 Sep 2007 13:00:53 -0400 Trond Myklebust <[email protected]> wrote:
> > Do these patches also cause the memory reclaimers to steer clear of
> > devices that are congested (and stop waiting on a congested device if
> > they see that it remains congested for a long period of time)? Most of
> > the collateral blocking I see tends to happen in memory allocation...
> >
>
> No, they don't attempt to do that, but I suspect they put in place
> infrastructure which could be used to improve direct-reclaimer latency. In
> the throttle_vm_writeout() path, at least.
>
> Do you know where the stalls are occurring? throttle_vm_writeout(), or via
> direct calls to congestion_wait() from page_alloc.c and vmscan.c? (running
> sysrq-w five or ten times will probably be enough to determine this)

Looking back, they were getting caught up in
balance_dirty_pages_ratelimited() and friends. See the attached
example...

Cheers
Trond


Attachments:
(No filename) (9.15 kB)
Attached message - [NFS] NFS on loopback locks up entire system(2.6.23-rc6)?

2007-09-28 19:17:55

by Andrew Morton

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Fri, 28 Sep 2007 20:48:59 +0200 Peter Zijlstra <[email protected]> wrote:

>
> On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote:
>
> > Do you know where the stalls are occurring? throttle_vm_writeout(), or via
> > direct calls to congestion_wait() from page_alloc.c and vmscan.c? (running
> > sysrq-w five or ten times will probably be enough to determine this)
>
> would it make sense to instrument congestion_wait() callsites with
> vmstats?

Better than nothing, but it isn't a great fit: we'd need one vmstat counter
per congestion_wait() callsite, and it's all rather specific to the
kernel-of-the-day.

taskstats delay accounting isn't useful either - it will aggregate all the
schedule() callsites.

profile=sleep is just about ideal for this, isn't it? I suspect that most
people don't know it's there, or forgot about it.

It could be that profile=sleep just tells us "you're spending a lot of time
in io_schedule()" or congestion_wait(), so perhaps we need to teach it to
go for walk up the stack somehow.

But lockdep knows how to do that already so perhaps we (ie: you ;)) can
bolt sleep instrumentation onto lockdep as we (ie you ;)) did with the
lockstat stuff?

(Searches for the lockstat documentation)

Did we forget to do that?

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-28 19:27:48

by Andrew Morton

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[email protected]> wrote:

> On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote:
> > On Fri, 28 Sep 2007 13:00:53 -0400 Trond Myklebust <[email protected]> wrote:
> > > Do these patches also cause the memory reclaimers to steer clear of
> > > devices that are congested (and stop waiting on a congested device if
> > > they see that it remains congested for a long period of time)? Most of
> > > the collateral blocking I see tends to happen in memory allocation...
> > >
> >
> > No, they don't attempt to do that, but I suspect they put in place
> > infrastructure which could be used to improve direct-reclaimer latency. In
> > the throttle_vm_writeout() path, at least.
> >
> > Do you know where the stalls are occurring? throttle_vm_writeout(), or via
> > direct calls to congestion_wait() from page_alloc.c and vmscan.c? (running
> > sysrq-w five or ten times will probably be enough to determine this)
>
> Looking back, they were getting caught up in
> balance_dirty_pages_ratelimited() and friends. See the attached
> example...

that one is nfs-on-loopback, which is a special case, isn't it?

NFS on loopback used to hang, but then we fixed it. It looks like we
broke it again sometime in the intervening four years or so.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-28 19:52:44

by Trond Myklebust

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
> On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[email protected]> wrote:
> > Looking back, they were getting caught up in
> > balance_dirty_pages_ratelimited() and friends. See the attached
> > example...
>
> that one is nfs-on-loopback, which is a special case, isn't it?

I'm not sure that the hang that is illustrated here is so special. It is
an example of a bog-standard ext3 write, that ends up calling the NFS
client, which is hanging. The fact that it happens to be hanging on the
nfsd process is more or less irrelevant here: the same thing could
happen to any other process in the case where we have an NFS server that
is down.

> NFS on loopback used to hang, but then we fixed it. It looks like we
> broke it again sometime in the intervening four years or so.

It has been quirky all through the 2.6.x series because of this issue.

Cheers
Trond


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-28 20:11:53

by Andrew Morton

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Fri, 28 Sep 2007 15:52:28 -0400
Trond Myklebust <[email protected]> wrote:

> On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
> > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[email protected]> wrote:
> > > Looking back, they were getting caught up in
> > > balance_dirty_pages_ratelimited() and friends. See the attached
> > > example...
> >
> > that one is nfs-on-loopback, which is a special case, isn't it?
>
> I'm not sure that the hang that is illustrated here is so special. It is
> an example of a bog-standard ext3 write, that ends up calling the NFS
> client, which is hanging. The fact that it happens to be hanging on the
> nfsd process is more or less irrelevant here: the same thing could
> happen to any other process in the case where we have an NFS server that
> is down.

hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?

We should be able to fix that by marking the backing device as
write-congested. That'll have small race windows, but it should be a 99.9%
fix?


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-28 20:24:34

by Daniel Phillips

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Friday 28 September 2007 12:52, Trond Myklebust wrote:
> I'm not sure that the hang that is illustrated here is so special. It
> is an example of a bog-standard ext3 write, that ends up calling the
> NFS client, which is hanging. The fact that it happens to be hanging
> on the nfsd process is more or less irrelevant here: the same thing
> could happen to any other process in the case where we have an NFS
> server that is down.

Hi Trond,

Could you clarify what you meant by "calling the NFS client"? I don't
see any direct call in the posted backtrace.

Regards,

Daniel

2007-09-28 20:32:37

by Trond Myklebust

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote:
> On Fri, 28 Sep 2007 15:52:28 -0400
> Trond Myklebust <[email protected]> wrote:
>
> > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
> > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[email protected]> wrote:
> > > > Looking back, they were getting caught up in
> > > > balance_dirty_pages_ratelimited() and friends. See the attached
> > > > example...
> > >
> > > that one is nfs-on-loopback, which is a special case, isn't it?
> >
> > I'm not sure that the hang that is illustrated here is so special. It is
> > an example of a bog-standard ext3 write, that ends up calling the NFS
> > client, which is hanging. The fact that it happens to be hanging on the
> > nfsd process is more or less irrelevant here: the same thing could
> > happen to any other process in the case where we have an NFS server that
> > is down.
>
> hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?
>
> We should be able to fix that by marking the backing device as
> write-congested. That'll have small race windows, but it should be a 99.9%
> fix?

No. The problem would rather appear to be that we're doing
per-backing_dev writeback (if I read sync_sb_inodes() correctly), but
we're measuring variables which are global to the VM. The backing device
that we are selecting may not be writing out any dirty pages, in which
case, we're just spinning in balance_dirty_pages_ratelimited().

Should we therefore perhaps be looking at adding per-backing_dev stats
too?

Trond


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-28 20:43:26

by Andrew Morton

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Fri, 28 Sep 2007 16:32:18 -0400
Trond Myklebust <[email protected]> wrote:

> On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote:
> > On Fri, 28 Sep 2007 15:52:28 -0400
> > Trond Myklebust <[email protected]> wrote:
> >
> > > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
> > > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[email protected]> wrote:
> > > > > Looking back, they were getting caught up in
> > > > > balance_dirty_pages_ratelimited() and friends. See the attached
> > > > > example...
> > > >
> > > > that one is nfs-on-loopback, which is a special case, isn't it?
> > >
> > > I'm not sure that the hang that is illustrated here is so special. It is
> > > an example of a bog-standard ext3 write, that ends up calling the NFS
> > > client, which is hanging. The fact that it happens to be hanging on the
> > > nfsd process is more or less irrelevant here: the same thing could
> > > happen to any other process in the case where we have an NFS server that
> > > is down.
> >
> > hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?
> >
> > We should be able to fix that by marking the backing device as
> > write-congested. That'll have small race windows, but it should be a 99.9%
> > fix?
>
> No. The problem would rather appear to be that we're doing
> per-backing_dev writeback (if I read sync_sb_inodes() correctly), but
> we're measuring variables which are global to the VM. The backing device
> that we are selecting may not be writing out any dirty pages, in which
> case, we're just spinning in balance_dirty_pages_ratelimited().

OK, so it's unrelated to page reclaim.

> Should we therefore perhaps be looking at adding per-backing_dev stats
> too?

That's what mm-per-device-dirty-threshold.patch and friends are doing.
Whether it works adequately is not really known at this time.
Unfortunately kernel developers don't test -mm much.

2007-09-28 21:36:40

by Chakri n

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

Here is a the snapshot of vmstats when the problem happened. I believe
this could help a little.

crash> kmem -V
NR_FREE_PAGES: 680853
NR_INACTIVE: 95380
NR_ACTIVE: 26891
NR_ANON_PAGES: 2507
NR_FILE_MAPPED: 1832
NR_FILE_PAGES: 119779
NR_FILE_DIRTY: 0
NR_WRITEBACK: 18272
NR_SLAB_RECLAIMABLE: 1305
NR_SLAB_UNRECLAIMABLE: 2085
NR_PAGETABLE: 123
NR_UNSTABLE_NFS: 0
NR_BOUNCE: 0
NR_VMSCAN_WRITE: 0

In my testing, I always saw the processes are waiting in
balance_dirty_pages_ratelimited(), never in throttle_vm_writeout()
path.

But this could be because I have about 4Gig of memory in the system
and plenty of mem is still available around.

I will rerun the test limiting memory to 1024MB and lets see if it
takes in any different path.

Thanks
--Chakri


On 9/28/07, Andrew Morton <[email protected]> wrote:
> On Fri, 28 Sep 2007 16:32:18 -0400
> Trond Myklebust <[email protected]> wrote:
>
> > On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote:
> > > On Fri, 28 Sep 2007 15:52:28 -0400
> > > Trond Myklebust <[email protected]> wrote:
> > >
> > > > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
> > > > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[email protected]> wrote:
> > > > > > Looking back, they were getting caught up in
> > > > > > balance_dirty_pages_ratelimited() and friends. See the attached
> > > > > > example...
> > > > >
> > > > > that one is nfs-on-loopback, which is a special case, isn't it?
> > > >
> > > > I'm not sure that the hang that is illustrated here is so special. It is
> > > > an example of a bog-standard ext3 write, that ends up calling the NFS
> > > > client, which is hanging. The fact that it happens to be hanging on the
> > > > nfsd process is more or less irrelevant here: the same thing could
> > > > happen to any other process in the case where we have an NFS server that
> > > > is down.
> > >
> > > hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?
> > >
> > > We should be able to fix that by marking the backing device as
> > > write-congested. That'll have small race windows, but it should be a 99.9%
> > > fix?
> >
> > No. The problem would rather appear to be that we're doing
> > per-backing_dev writeback (if I read sync_sb_inodes() correctly), but
> > we're measuring variables which are global to the VM. The backing device
> > that we are selecting may not be writing out any dirty pages, in which
> > case, we're just spinning in balance_dirty_pages_ratelimited().
>
> OK, so it's unrelated to page reclaim.
>
> > Should we therefore perhaps be looking at adding per-backing_dev stats
> > too?
>
> That's what mm-per-device-dirty-threshold.patch and friends are doing.
> Whether it works adequately is not really known at this time.
> Unfortunately kernel developers don't test -mm much.
>

2007-09-28 23:33:17

by Chakri n

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

No change in behavior even in case of low memory systems. I confirmed
it running on 1Gig machine.

Thanks
--Chakri

On 9/28/07, Chakri n <[email protected]> wrote:
> Here is a the snapshot of vmstats when the problem happened. I believe
> this could help a little.
>
> crash> kmem -V
> NR_FREE_PAGES: 680853
> NR_INACTIVE: 95380
> NR_ACTIVE: 26891
> NR_ANON_PAGES: 2507
> NR_FILE_MAPPED: 1832
> NR_FILE_PAGES: 119779
> NR_FILE_DIRTY: 0
> NR_WRITEBACK: 18272
> NR_SLAB_RECLAIMABLE: 1305
> NR_SLAB_UNRECLAIMABLE: 2085
> NR_PAGETABLE: 123
> NR_UNSTABLE_NFS: 0
> NR_BOUNCE: 0
> NR_VMSCAN_WRITE: 0
>
> In my testing, I always saw the processes are waiting in
> balance_dirty_pages_ratelimited(), never in throttle_vm_writeout()
> path.
>
> But this could be because I have about 4Gig of memory in the system
> and plenty of mem is still available around.
>
> I will rerun the test limiting memory to 1024MB and lets see if it
> takes in any different path.
>
> Thanks
> --Chakri
>
>
> On 9/28/07, Andrew Morton <[email protected]> wrote:
> > On Fri, 28 Sep 2007 16:32:18 -0400
> > Trond Myklebust <[email protected]> wrote:
> >
> > > On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote:
> > > > On Fri, 28 Sep 2007 15:52:28 -0400
> > > > Trond Myklebust <[email protected]> wrote:
> > > >
> > > > > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
> > > > > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[email protected]> wrote:
> > > > > > > Looking back, they were getting caught up in
> > > > > > > balance_dirty_pages_ratelimited() and friends. See the attached
> > > > > > > example...
> > > > > >
> > > > > > that one is nfs-on-loopback, which is a special case, isn't it?
> > > > >
> > > > > I'm not sure that the hang that is illustrated here is so special. It is
> > > > > an example of a bog-standard ext3 write, that ends up calling the NFS
> > > > > client, which is hanging. The fact that it happens to be hanging on the
> > > > > nfsd process is more or less irrelevant here: the same thing could
> > > > > happen to any other process in the case where we have an NFS server that
> > > > > is down.
> > > >
> > > > hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?
> > > >
> > > > We should be able to fix that by marking the backing device as
> > > > write-congested. That'll have small race windows, but it should be a 99.9%
> > > > fix?
> > >
> > > No. The problem would rather appear to be that we're doing
> > > per-backing_dev writeback (if I read sync_sb_inodes() correctly), but
> > > we're measuring variables which are global to the VM. The backing device
> > > that we are selecting may not be writing out any dirty pages, in which
> > > case, we're just spinning in balance_dirty_pages_ratelimited().
> >
> > OK, so it's unrelated to page reclaim.
> >
> > > Should we therefore perhaps be looking at adding per-backing_dev stats
> > > too?
> >
> > That's what mm-per-device-dirty-threshold.patch and friends are doing.
> > Whether it works adequately is not really known at this time.
> > Unfortunately kernel developers don't test -mm much.
> >
>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-29 00:46:59

by Daniel Phillips

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Thursday 27 September 2007 23:50, Andrew Morton wrote:
> Actually we perhaps could address this at the VFS level in another
> way. Processes which are writing to the dead NFS server will
> eventually block in balance_dirty_pages() once they've exceeded the
> memory limits and will remain blocked until the server wakes up -
> that's the behaviour we want.

It is not necessary to restrict total dirty pages at all. Instead it is
necessary to restrict total writeout in flight. This is evident from
the fact that making progress is the one and only reason our kernel
exists, and writeout is how we make progress clearing memory. In other
words, if we guarantee the progress of writeout, we will live happily
ever after and not have to sell the farm.

The current situation has an eerily similar feeling to the VM
instability in early 2.4, which was never solved until we convinced
ourselves that the only way to deal with Moore's law as applied to
number of memory pages was to implement positive control of swapout in
the form of reverse mapping[1]. This time round, we need to add
positive control of writeout in the form of rate limiting.

I _think_ Peter is with me on this, and not only that, but between the
too of us we already have patches for most of the subsystems that need
it, and we have both been busy testing (different subsets of) these
patches to destruction for the better part of a year.

Anyway, to fix the immediate bug before the one true dirty_limit removal
patch lands (promise) I think you are on the right track by noticing
that balance_dirty_pages has to become aware of how congested the
involved block device is, since blocking a writeout process on an
underused block device is clearly a bad idea. Note how much this idea
looks like rate limiting.

[1] We lost the scent for a number of reasons, not least because the
experimental implementation of reverse mapping at the time was buggy
for reasons entirely unrelated to the reverse mapping itself.

Regards,

Daniel

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-29 01:27:59

by Daniel Phillips

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Friday 28 September 2007 06:35, Peter Zijlstra wrote:
> ,,,it would be grand (and dangerous) if we could provide for a
> button that would just kill off all outstanding pages against a dead
> device.

Substitute "resources" for "pages" and you begin to get an idea of how
tricky that actually is. That said, this is exactly what we have done
with ddsnap, for the simple reason that our users, now emboldened by
being able to stop or terminate the user space part, felt justified in
expecting that the system continue as if nothing had happened, and
furthermore, be able to restart ddsnap without a hiccup. (Otherwise
known as a sysop's diety-given right to kill.)

So this is what we do in the specific case of ddsnap:

* When we detect some nasty state change such as our userspace
control daemon disappearing on us, we go poking around and
explicitly release every semaphore that the device driver could
possibly wait on forever (interestingly they are all in our own
driver except for BKL, which is just an artifact of device mapper
not having gone over to unlock_ioctl for no good reason that I
know of).

* Then at the points were the driver falls through some lock thus
released, we check our "ready" flag, and if it indicates "busted",
proceed with wherever cleanup is needed at that point.

Does not sound like an approach one would expect to work reliably, does
it? But there just may be some general principle to be ferretted out
here. (Anyone who has ideas on how bits of this procedure could be
abstracted, please do not hesitate to step boldly forth into the
limelight.)

Incidentally, only a small subset of locks needed special handling as
above. Most can be shown to have no way to block forever, short of an
outright bug.

I shudder to think how much work it would be to bring every driver in
the kernel up to such a standard, particularly if user space components
are involved, as with USB. On the other hand, every driver fixed is
one less driver that sucks. The next one to emerge from the pipeline
will most likely be NBD, which we have been working on in fits and
starts for a while. Look for it to morph into "ddbd", with cross-node
distributed data awareness, in addition to perforning its current job
without deadlocking.

Regards,

Daniel

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-29 01:51:46

by Daniel Phillips

[permalink] [raw]
Subject: KDB?

On Friday 28 September 2007 12:16, Trond Myklebust wrote:
> crash> bt 3188

crash> ps|grep ps

Hey, that looks just like kdb! But I heard that kgdb is better in every
way than kdb.

Innocently yours,

Daniel

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-28 06:59:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:

> What we _don't_ want to happen is for other processes which are writing to
> other, non-dead devices to get collaterally blocked. We have patches which
> might fix that queued for 2.6.24. Peter?

Nasty problem, don't do that :-)

But yeah, with per BDI dirty limits we get stuck at whatever ratio that
NFS server/mount (?) has - which could be 100%. Other processes will
then work almost synchronously against their BDIs but it should work.

[ They will lower the NFS-BDI's ratio, but some fancy clipping code will
limit the other BDIs their dirty limit to not exceed the total limit.
And with all these NFS pages stuck, that will still be nothing. ]



Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-09-28 08:27:17

by Chakri n

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

Thanks.

The BDI dirty limits sounds like a good idea.

Is there already a patch for this, which I could try?

I believe it works like this,

Each BDI, will have a limit. If the dirty_thresh exceeds the limit,
all the I/O on the block device will be synchronous.

so, if I have sda & a NFS mount, the dirty limit can be different for
each of them.

I can set dirty limit for
- sda to be 90% and
- NFS mount to be 50%.

So, if the dirty limit is greater than 50%, NFS does synchronously,
but sda can work asynchronously, till dirty limit reaches 90%.

Thanks
--Chakri

On 9/27/07, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:
>
> > What we _don't_ want to happen is for other processes which are writing to
> > other, non-dead devices to get collaterally blocked. We have patches which
> > might fix that queued for 2.6.24. Peter?
>
> Nasty problem, don't do that :-)
>
> But yeah, with per BDI dirty limits we get stuck at whatever ratio that
> NFS server/mount (?) has - which could be 100%. Other processes will
> then work almost synchronously against their BDIs but it should work.
>
> [ They will lower the NFS-BDI's ratio, but some fancy clipping code will
> limit the other BDIs their dirty limit to not exceed the total limit.
> And with all these NFS pages stuck, that will still be nothing. ]
>
>
>
>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-28 08:40:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

[ please don't top-post! ]

On Fri, 2007-09-28 at 01:27 -0700, Chakri n wrote:

> On 9/27/07, Peter Zijlstra <[email protected]> wrote:
> > On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:
> >
> > > What we _don't_ want to happen is for other processes which are writing to
> > > other, non-dead devices to get collaterally blocked. We have patches which
> > > might fix that queued for 2.6.24. Peter?
> >
> > Nasty problem, don't do that :-)
> >
> > But yeah, with per BDI dirty limits we get stuck at whatever ratio that
> > NFS server/mount (?) has - which could be 100%. Other processes will
> > then work almost synchronously against their BDIs but it should work.
> >
> > [ They will lower the NFS-BDI's ratio, but some fancy clipping code will
> > limit the other BDIs their dirty limit to not exceed the total limit.
> > And with all these NFS pages stuck, that will still be nothing. ]
> >
> Thanks.
>
> The BDI dirty limits sounds like a good idea.
>
> Is there already a patch for this, which I could try?

v2.6.23-rc8-mm2

> I believe it works like this,
>
> Each BDI, will have a limit. If the dirty_thresh exceeds the limit,
> all the I/O on the block device will be synchronous.
>
> so, if I have sda & a NFS mount, the dirty limit can be different for
> each of them.
>
> I can set dirty limit for
> - sda to be 90% and
> - NFS mount to be 50%.
>
> So, if the dirty limit is greater than 50%, NFS does synchronously,
> but sda can work asynchronously, till dirty limit reaches 90%.

Not quite, the system determines the limit itself in an adaptive
fashion.

bdi_limit = total_limit * p_bdi

Where p is a faction [0,1], and is determined by the relative writeout
speed of the current BDI vs all other BDIs.

So if you were to have 3 BDIs (sda, sdb and 1 nfs mount), and sda is
idle, and the nfs mount gets twice as much traffic as sdb, the ratios
will look like:

p_sda: 0
p_sdb: 1/3
p_nfs: 2/3

Once the traffic exceeds the write speed of the device we build up a
backlog and stuff gets throttled, so these proportions converge to the
relative write speed of the BDIs when saturated with data.

So what can happen in your case is that the NFS mount is the only one
with traffic is will get a fraction of 1. If it then disconnects like in
your case, it will still have all of the dirty limit pinned for NFS.

However other devices will at that moment try to maintain a limit of 0,
which ends up being similar to a sync mount.

So they'll not get stuck, but they will be slow.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-09-28 09:01:23

by Chakri n

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

Thanks for explaining the adaptive logic.

> However other devices will at that moment try to maintain a limit of 0,
> which ends up being similar to a sync mount.
>
> So they'll not get stuck, but they will be slow.
>
>

Sync should be ok, when the situation is bad like this and some one
hijacked all the buffers.

But, I see my simple dd to write 10blocks on local disk never
completes even after 10 minutes.

[root@h46 ~]# dd if=/dev/zero of=/tmp/x count=10

I think the process is completely stuck and is not progressing at all.

Is something going wrong in the calculations where it does not fall
back to sync mode.

Thanks
--Chakri

On 9/28/07, Peter Zijlstra <[email protected]> wrote:
> [ please don't top-post! ]
>
> On Fri, 2007-09-28 at 01:27 -0700, Chakri n wrote:
>
> > On 9/27/07, Peter Zijlstra <[email protected]> wrote:
> > > On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:
> > >
> > > > What we _don't_ want to happen is for other processes which are writing to
> > > > other, non-dead devices to get collaterally blocked. We have patches which
> > > > might fix that queued for 2.6.24. Peter?
> > >
> > > Nasty problem, don't do that :-)
> > >
> > > But yeah, with per BDI dirty limits we get stuck at whatever ratio that
> > > NFS server/mount (?) has - which could be 100%. Other processes will
> > > then work almost synchronously against their BDIs but it should work.
> > >
> > > [ They will lower the NFS-BDI's ratio, but some fancy clipping code will
> > > limit the other BDIs their dirty limit to not exceed the total limit.
> > > And with all these NFS pages stuck, that will still be nothing. ]
> > >
> > Thanks.
> >
> > The BDI dirty limits sounds like a good idea.
> >
> > Is there already a patch for this, which I could try?
>
> v2.6.23-rc8-mm2
>
> > I believe it works like this,
> >
> > Each BDI, will have a limit. If the dirty_thresh exceeds the limit,
> > all the I/O on the block device will be synchronous.
> >
> > so, if I have sda & a NFS mount, the dirty limit can be different for
> > each of them.
> >
> > I can set dirty limit for
> > - sda to be 90% and
> > - NFS mount to be 50%.
> >
> > So, if the dirty limit is greater than 50%, NFS does synchronously,
> > but sda can work asynchronously, till dirty limit reaches 90%.
>
> Not quite, the system determines the limit itself in an adaptive
> fashion.
>
> bdi_limit = total_limit * p_bdi
>
> Where p is a faction [0,1], and is determined by the relative writeout
> speed of the current BDI vs all other BDIs.
>
> So if you were to have 3 BDIs (sda, sdb and 1 nfs mount), and sda is
> idle, and the nfs mount gets twice as much traffic as sdb, the ratios
> will look like:
>
> p_sda: 0
> p_sdb: 1/3
> p_nfs: 2/3
>
> Once the traffic exceeds the write speed of the device we build up a
> backlog and stuff gets throttled, so these proportions converge to the
> relative write speed of the BDIs when saturated with data.
>
> So what can happen in your case is that the NFS mount is the only one
> with traffic is will get a fraction of 1. If it then disconnects like in
> your case, it will still have all of the dirty limit pinned for NFS.
>
> However other devices will at that moment try to maintain a limit of 0,
> which ends up being similar to a sync mount.
>
> So they'll not get stuck, but they will be slow.
>
>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-28 09:12:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

On Fri, 2007-09-28 at 02:01 -0700, Chakri n wrote:
> Thanks for explaining the adaptive logic.
>
> > However other devices will at that moment try to maintain a limit of 0,
> > which ends up being similar to a sync mount.
> >
> > So they'll not get stuck, but they will be slow.
> >
> >
>
> Sync should be ok, when the situation is bad like this and some one
> hijacked all the buffers.
>
> But, I see my simple dd to write 10blocks on local disk never
> completes even after 10 minutes.
>
> [root@h46 ~]# dd if=/dev/zero of=/tmp/x count=10
>
> I think the process is completely stuck and is not progressing at all.
>
> Is something going wrong in the calculations where it does not fall
> back to sync mode.

What kernel is that?


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-09-28 09:20:24

by Chakri n

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

It's 2.6.23-rc6.

Thanks
--Chakri

On 9/28/07, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2007-09-28 at 02:01 -0700, Chakri n wrote:
> > Thanks for explaining the adaptive logic.
> >
> > > However other devices will at that moment try to maintain a limit of 0,
> > > which ends up being similar to a sync mount.
> > >
> > > So they'll not get stuck, but they will be slow.
> > >
> > >
> >
> > Sync should be ok, when the situation is bad like this and some one
> > hijacked all the buffers.
> >
> > But, I see my simple dd to write 10blocks on local disk never
> > completes even after 10 minutes.
> >
> > [root@h46 ~]# dd if=/dev/zero of=/tmp/x count=10
> >
> > I think the process is completely stuck and is not progressing at all.
> >
> > Is something going wrong in the calculations where it does not fall
> > back to sync mode.
>
> What kernel is that?
>
>
>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-09-28 09:23:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

[ and one copy for the list too ]

On Fri, 2007-09-28 at 02:20 -0700, Chakri n wrote:
> It's 2.6.23-rc6.

Could you try .23-rc8-mm2. It includes the per bdi stuff.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part
(No filename) (228.00 B)
(No filename) (140.00 B)
Download all attachments

2007-09-28 10:36:49

by Chakri n

[permalink] [raw]
Subject: Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

It's works on .23-rc8-mm2 with out any problems.

"dd" process does not hang any more.

Thanks for all the help.

Cheers
--Chakri


On 9/28/07, Peter Zijlstra <[email protected]> wrote:
> [ and one copy for the list too ]
>
> On Fri, 2007-09-28 at 02:20 -0700, Chakri n wrote:
> > It's 2.6.23-rc6.
>
> Could you try .23-rc8-mm2. It includes the per bdi stuff.
>
>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs