LinuxLists.cc - Counter-kludge for 2.5.x hanging when writing to block device

2003-06-03 08:35:02

Subject: Counter-kludge for 2.5.x hanging when writing to block device

For at least the past few months, the Linux 2.5 kernels have
hung when I try to write a large amount of data to a block device.
I most commonly notice this when trying to clear a disk with a command
like "dd if=/dev/zero of=/dev/discs/disc1/disc". Sometimes doing
an mkfs on a big file system is enough to cause the hang.
I wrote a little program to repeatedly write a 4kB block of zeroes
to the kernel so I could track how far it got before hanging, and it
would write 210-215MB of zeroes to the disk on a computer that had
512MB of RAM before hanging. When these hangs occur, other processes
continue to run fine, and I can do syncs, which return, but the
hung process never resumes. In the past, I've verified with a
printk that it is looping in balance_dirty_pages, repeatedly
calling blk_congestion_wait, and never leaving the loop.

Here is a counter-kludge that seems to stop the problem.
This is certainly not the "right" fix. It just illustrates a way
to stop the problem.

By the way, I say "counter-kludge", because I get the impression
that blk_congestion_wait is itself a kludge, since it calls
blk_run_queues and waits a fixed amount of time, 100ms in this case,
potentially a big waste of time, rather than awaiting some more
accurate criterion.

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Miplitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

--- linux-2.5.70-bk7/mm/page-writeback.c 2003-06-02 14:02:39.000000000 -0700
+++ linux/mm/page-writeback.c 2003-06-02 13:59:31.000000000 -0700
@@ -177,7 +177,12 @@
if (pages_written >= write_chunk)
break; /* We've done our duty */
}
+#if 0 /* AJR */
blk_congestion_wait(WRITE, HZ/10);
+#else
+ blk_run_queues();
+ break;
+#endif
}

if (nr_reclaimable + ps.nr_writeback <= dirty_thresh)

2003-06-03 08:56:54

by Jens Axboe

[permalink] [raw]

Subject: Re: Counter-kludge for 2.5.x hanging when writing to block device

On Tue, Jun 03 2003, Adam J. Richter wrote:
> For at least the past few months, the Linux 2.5 kernels have
> hung when I try to write a large amount of data to a block device.
> I most commonly notice this when trying to clear a disk with a command
> like "dd if=/dev/zero of=/dev/discs/disc1/disc". Sometimes doing
> an mkfs on a big file system is enough to cause the hang.
> I wrote a little program to repeatedly write a 4kB block of zeroes
> to the kernel so I could track how far it got before hanging, and it
> would write 210-215MB of zeroes to the disk on a computer that had
> 512MB of RAM before hanging. When these hangs occur, other processes
> continue to run fine, and I can do syncs, which return, but the
> hung process never resumes. In the past, I've verified with a
> printk that it is looping in balance_dirty_pages, repeatedly
> calling blk_congestion_wait, and never leaving the loop.
>
> Here is a counter-kludge that seems to stop the problem.
> This is certainly not the "right" fix. It just illustrates a way
> to stop the problem.
>
> By the way, I say "counter-kludge", because I get the impression
> that blk_congestion_wait is itself a kludge, since it calls
> blk_run_queues and waits a fixed amount of time, 100ms in this case,
> potentially a big waste of time, rather than awaiting some more
> accurate criterion.

Does something like this work? Andrew, what's the point of doing the
wait if the queue isn't congested?! I haven't even checked if this gets
the job done, I think it would be cleaner to pass in the backing dev
info to blk_congestion_wait so we can make the decision in there.

===== mm/page-writeback.c 1.66 vs edited =====
--- 1.66/mm/page-writeback.c Sun Jun 1 23:12:47 2003
+++ edited/mm/page-writeback.c Tue Jun 3 11:09:13 2003
@@ -152,6 +152,7 @@
.sync_mode = WB_SYNC_NONE,
.older_than_this = NULL,
.nr_to_write = write_chunk,
+ .encountered_congestion = 0,
};

get_dirty_limits(&ps, &background_thresh, &dirty_thresh);
@@ -178,7 +179,8 @@
if (pages_written >= write_chunk)
break; /* We've done our duty */
}
- blk_congestion_wait(WRITE, HZ/10);
+ if (wbc.encountered_congestion)
+ blk_congestion_wait(WRITE, HZ/10);
}

if (nr_reclaimable + ps.nr_writeback <= dirty_thresh)

--
Jens Axboe

2003-06-03 09:49:21

by Jens Axboe

[permalink] [raw]

Subject: Re: Counter-kludge for 2.5.x hanging when writing to block device

On Tue, Jun 03 2003, Andrew Morton wrote:
> > Does something like this work? Andrew, what's the point of doing the
> > wait if the queue isn't congested?!
>
> We need to wait until the amount of dirty memory in the machine is below
> the designated limits. This is unrelated to queue congestion. The way the
> logic is now we can have 256 megs worth of requests queues on a 32M machine
> and everything throttles and clamps as intended.
>
>
> There are several things wrong with blk_congestion_wait(), including:
>
> a) it should be called throttle_on_io()

Well...

> b) it should check that there are still requests in flight after parking
> itself on the waitqueue rather than relying on the timeout.

This is important, would be much nicer to pass in the backing dev. This
is a big problem, imho. It's broken right now.

> As for Adam's hang: dunno. I and many others have run mkfs and dd an
> unbelievable number of times. He needs to debug it more.

Agree

--
Jens Axboe

2003-06-03 09:46:43

by Andrew Morton

[permalink] [raw]

Subject: Re: Counter-kludge for 2.5.x hanging when writing to block device

Jens Axboe <[email protected]> wrote:
>
> On Tue, Jun 03 2003, Adam J. Richter wrote:
> > For at least the past few months, the Linux 2.5 kernels have
> > hung when I try to write a large amount of data to a block device.

Well ytf is this the first time I've heard about it?

> > I most commonly notice this when trying to clear a disk with a command
> > like "dd if=/dev/zero of=/dev/discs/disc1/disc". Sometimes doing
> > an mkfs on a big file system is enough to cause the hang.
> > I wrote a little program to repeatedly write a 4kB block of zeroes
> > to the kernel so I could track how far it got before hanging, and it
> > would write 210-215MB of zeroes to the disk on a computer that had
> > 512MB of RAM before hanging. When these hangs occur, other processes
> > continue to run fine, and I can do syncs, which return, but the
> > hung process never resumes. In the past, I've verified with a
> > printk that it is looping in balance_dirty_pages, repeatedly
> > calling blk_congestion_wait, and never leaving the loop.
> >

Please debug it further. Something may have gone wrong with the arithmetic
in balance_dirty_pages().

> > Here is a counter-kludge that seems to stop the problem.
> > This is certainly not the "right" fix. It just illustrates a way
> > to stop the problem.
> >
> > By the way, I say "counter-kludge", because I get the impression
> > that blk_congestion_wait is itself a kludge, since it calls
> > blk_run_queues and waits a fixed amount of time, 100ms in this case,
> > potentially a big waste of time, rather than awaiting some more
> > accurate criterion.

The sleep in blk_congestion_wait() terminates when a request is returned to
the queue. The timeout is only really there for non-request-based backing
devices.

> Does something like this work? Andrew, what's the point of doing the
> wait if the queue isn't congested?!

We need to wait until the amount of dirty memory in the machine is below
the designated limits. This is unrelated to queue congestion. The way the
logic is now we can have 256 megs worth of requests queues on a 32M machine
and everything throttles and clamps as intended.

There are several things wrong with blk_congestion_wait(), including:

a) it should be called throttle_on_io()

b) it should check that there are still requests in flight after parking
itself on the waitqueue rather than relying on the timeout.

c) for memory reclaim we should terminate the sleep on a certain number
of pages coming unreclaimable, not on write requests being returned or
timeout.

d) network filesystems should be delivering wakeups to throttled
processes rather than relying on the timeout.

But none of these have proven sufficiently problematic to justify futzing
with it. I expect d) will eventually need to be implemented.

As for Adam's hang: dunno. I and many others have run mkfs and dd an
unbelievable number of times. He needs to debug it more.

2003-06-03 10:06:52

by Andrew Morton

[permalink] [raw]

Subject: Re: Counter-kludge for 2.5.x hanging when writing to block device

Jens Axboe <[email protected]> wrote:
>
> > b) it should check that there are still requests in flight after parking
> > itself on the waitqueue rather than relying on the timeout.
>
> This is important, would be much nicer to pass in the backing dev. This
> is a big problem, imho. It's broken right now.

The throttling is not really a per-device concept. It is a "global"
concept.

If a process has written to a really slow device and has encountered
throttling due to exceeded dirty memory limits, we _do_ want to wake that
process up (to reevaluate the system state) if a bunch of writes terminate
against a fast device.

There is a fixed amount of system memory which the administrator has
dedicated to buffering of dirty-and-writeback data and I believe that not
discriminating between different bandwidth devices will give the overall
lowest latency. This may be wrong, and maybe we do want to throttle tasks
which write to slow devices more heavily.

Or place the device's nominal bandwidth in the backing_dev_info, account
for dirty memory on a per-queue basis and limit the permissible amount of
dirty memory against slower devices. That's probably not too hard to do
but I'm not sure that the combination of slow and fast devices both under
heavy writeout at the same time is common enough to justify it.

2003-06-03 10:08:59

by Michael Frank

[permalink] [raw]

Subject: Re: Counter-kludge for 2.5.x hanging when writing to block device

On Tuesday 03 June 2003 18:00, Andrew Morton wrote:
> Jens Axboe <[email protected]> wrote:
> > On Tue, Jun 03 2003, Adam J. Richter wrote:
> > > For at least the past few months, the Linux 2.5 kernels have
> > > hung when I try to write a large amount of data to a block device.
>
> Well ytf is this the first time I've heard about it?
>

Lots of people are using 2.5 in many configurations. This kind of
would have shown long ago.

Suspect driver/hardware specific issue.

More info on hardware is needed.

Regards
Michael

2003-06-03 14:29:20

by Jens Axboe

[permalink] [raw]

Subject: Re: Counter-kludge for 2.5.x hanging when writing to block device

On Tue, Jun 03 2003, Andrew Morton wrote:
> Jens Axboe <[email protected]> wrote:
> >
> > > b) it should check that there are still requests in flight after parking
> > > itself on the waitqueue rather than relying on the timeout.
> >
> > This is important, would be much nicer to pass in the backing dev. This
> > is a big problem, imho. It's broken right now.
>
> The throttling is not really a per-device concept. It is a "global"
> concept.
>
> If a process has written to a really slow device and has encountered
> throttling due to exceeded dirty memory limits, we _do_ want to wake that
> process up (to reevaluate the system state) if a bunch of writes terminate
> against a fast device.
>
> There is a fixed amount of system memory which the administrator has
> dedicated to buffering of dirty-and-writeback data and I believe that not
> discriminating between different bandwidth devices will give the overall
> lowest latency. This may be wrong, and maybe we do want to throttle tasks
> which write to slow devices more heavily.
>
> Or place the device's nominal bandwidth in the backing_dev_info, account
> for dirty memory on a per-queue basis and limit the permissible amount of
> dirty memory against slower devices. That's probably not too hard to do
> but I'm not sure that the combination of slow and fast devices both under
> heavy writeout at the same time is common enough to justify it.

Per process slow vs fast device is probably not common enough to justify
any changes, as long as we deal correctly with fast vs slow globally.

But your mail explains it nicely, thanks.

--
Jens Axboe