LinuxLists.cc - blk_congestion

[permalink] [raw]

Subject: Re: blk_congestion_wait racy?

Martin Schwidefsky <[email protected]> wrote:
>
> Hi,
> we have a stupid little program that linearly allocates and touches
> memory. We use this to see how fast s390 can swap. If this is combined
> with the fastest block device we have (xpram) we see a very strange
> effect:
>
> 2.6.4-rc2 with 1 cpu
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600 of 600
> Done.
>
> real 0m2.516s
> user 0m0.150s
> sys 0m0.570s
> #
>
> 2.6.4-rc2 with 2 cpus
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600 of 600
> Done.
>
> real 0m56.086s
> user 0m0.110s
> sys 0m0.630s
> #

Interesting.

> I have the suspicion that the call to blk_congestion_wait in
> try_to_free_pages is part of the problem. It initiates a wait for
> a queue to exit congestion but this could already have happened
> on another cpu before blk_congestion_wait has setup the wait
> queue. In this case the process sleeps for 0.1 seconds.

The comment may be a bit stale. The idea is that the VM needs to take a
nap while the disk system retires some writes. So we go to sleep until a
write request gets put back. We do this regardless of the queue's
congestion state - the queue could have thousands of request slots and may
never even become congested.

> the swap test setup this happens all the time. If I "fix"
> blk_congestion_wait not to wait:
>
> diff -urN linux-2.6/drivers/block/ll_rw_blk.c linux-2.6-fix/drivers/block/ll_rw_blk.c
> --- linux-2.6/drivers/block/ll_rw_blk.c Fri Mar 5 14:50:28 2004
> +++ linux-2.6-fix/drivers/block/ll_rw_blk.c Fri Mar 5 14:51:05 2004
> @@ -1892,7 +1892,9 @@
>
> blk_run_queues();
> prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> +#if 0
> io_schedule_timeout(timeout);
> +#endif
> finish_wait(wqh, &wait);
> }

Gad, that'll make the VM scan its guts out.

> then the system reacts normal again:
>
> 2.6.4-rc2 + "fix" with 1 cpu
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600 of 600
> Done.
>
> real 0m2.523s
> user 0m0.200s
> sys 0m0.880s
> #
>
> 2.6.4-rc2 + "fix" with 2 cpu
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600 of 600
> Done.
>
> real 0m2.029s
> user 0m0.250s
> sys 0m1.560s
> #

system time was doubled though.

> Since it isn't a solution to remove the call to io_schedule_timeout
> I tried to understand what the event is, that blk_congestion_wait
> is waiting for. The comment says it waits for a queue to exit congestion.

It's just waiting for a write request to complete. It's a pretty crude way
of throttling page reclaim to the I/O system's speed.

> That is starting from prepare_to_wait it waits for a call to
> clear_queue_congested. In my test scenario NO queue is congested on
> enter to blk_congestion_wait. I'd like to see a proper wait_event
> there but it is non-trivial to define the event to wait for.
> Any useful hints ?

Nope, something is obviously broken. I'll take a look.

Perhaps with two CPUs you are able to get kswapd and mempig running page
reclaim at the same time, which causes seekier swap I/O patterns than with
one CPU, where we only run one app or the other at any time.

Serialising balance_pgdat() and try_to_free_pages() with a global semaphore
would be a way of testing that theory.

2004-03-08 13:38:41

[permalink] [raw]

Subject: Re: blk_congestion_wait racy?

> Gad, that'll make the VM scan its guts out.
Yes, I expected something like this.

> > 2.6.4-rc2 + "fix" with 1 cpu
> > sys 0m0.880s
> >
> > 2.6.4-rc2 + "fix" with 2 cpu
> > sys 0m1.560s
>
> system time was doubled though.
That would be the additional cost for not waiting.

> Nope, something is obviously broken. I'll take a look.
That would be very much appreciated.

> Perhaps with two CPUs you are able to get kswapd and mempig running page
> reclaim at the same time, which causes seekier swap I/O patterns than with
> one CPU, where we only run one app or the other at any time.
>
> Serialising balance_pgdat() and try_to_free_pages() with a global semaphore
> would be a way of testing that theory.

Just tried the following patch:

Index: mm/vmscan.c
===================================================================
RCS file: /home/cvs/linux-2.5/mm/vmscan.c,v
retrieving revision 1.45
diff -u -r1.45 vmscan.c
--- mm/vmscan.c 18 Feb 2004 17:45:28 -0000 1.45
+++ mm/vmscan.c 8 Mar 2004 13:30:56 -0000
@@ -848,6 +848,7 @@
* excessive rotation of the inactive list, which is _supposed_ to be an LRU,
* yes?
*/
+static DECLARE_MUTEX(reclaim_sem);
int try_to_free_pages(struct zone **zones,
unsigned int gfp_mask, unsigned int order)
{
@@ -858,6 +859,8 @@
struct reclaim_state *reclaim_state = current->reclaim_state;
int i;

+ down(&reclaim_sem);
+
inc_page_state(allocstall);

for (i = 0; zones[i] != 0; i++)
@@ -884,7 +887,10 @@
wakeup_bdflush(total_scanned);

/* Take a nap, wait for some writeback to complete */
+ up(&reclaim_sem);
blk_congestion_wait(WRITE, HZ/10);
+ down(&reclaim_sem);
+
if (zones[0] - zones[0]->zone_pgdat->node_zones < ZONE_HIGHMEM) {
shrink_slab(total_scanned, gfp_mask);
if (reclaim_state) {
@@ -898,6 +904,9 @@
out:
for (i = 0; zones[i] != 0; i++)
zones[i]->prev_priority = zones[i]->temp_priority;
+
+ up(&reclaim_sem);
+
return ret;
}

@@ -926,6 +935,8 @@
int i;
struct reclaim_state *reclaim_state = current->reclaim_state;

+ down(&reclaim_sem);
+
inc_page_state(pageoutrun);

for (i = 0; i < pgdat->nr_zones; i++) {
@@ -974,8 +985,11 @@
}
if (all_zones_ok)
break;
- if (to_free > 0)
+ if (to_free > 0) {
+ up(&reclaim_sem);
blk_congestion_wait(WRITE, HZ/10);
+ down(&reclaim_sem);
+ }
}

for (i = 0; i < pgdat->nr_zones; i++) {
@@ -983,6 +997,9 @@

zone->prev_priority = zone->temp_priority;
}
+
+ up(&reclaim_sem);
+
return nr_pages - to_free;
}

It didn't help. Still needs almost a minute.

blue skies,
Martin

2004-03-08 23:50:35

[permalink] [raw]

Subject: Re: blk_congestion_wait racy?

Martin Schwidefsky wrote:

>
>
>
>>Gad, that'll make the VM scan its guts out.
>>
>Yes, I expected something like this.
>
>
>>>2.6.4-rc2 + "fix" with 1 cpu
>>>sys 0m0.880s
>>>
>>>2.6.4-rc2 + "fix" with 2 cpu
>>>sys 0m1.560s
>>>
>>system time was doubled though.
>>
>That would be the additional cost for not waiting.
>
>

I'd say its more like cacheline contention or something: reclaim
won't simply be spinning with nothing to do because you're dirtying
plenty of memory. And if any queues were full it will mostly just be
blocking in the block layer.

>>Nope, something is obviously broken. I'll take a look.
>>
>That would be very much appreciated.
>

I'm looking at 2.6.1 source, so apologies if I'm wrong, but
drivers/block/ll_rw_blk.c:
freed_request does not need the memory barrier because the queue is
protected by the per queue spinlock. And I think clear_queue_congested
should have a memory barrier right before if (waitqueue_active(wqh)).

Another problem is that if there are no requests anywhere in the system,
sleepers in blk_congestion_wait will not get kicked. blk_congestion_wait
could probably have blk_run_queues moved after prepare_to_wait, which
might help.

Just some ideas.

2004-03-09 17:55:46

[permalink] [raw]

Subject: Re: blk_congestion_wait racy?

Hi Nick,

> Another problem is that if there are no requests anywhere in the system,
> sleepers in blk_congestion_wait will not get kicked. blk_congestion_wait
> could probably have blk_run_queues moved after prepare_to_wait, which
> might help.
I tried putting blk_run_queues after prepare_to_wait, it worked but it
didn't help. The test still needs close to a minute.

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]

2004-03-10 05:27:43

[permalink] [raw]

Subject: Re: blk_congestion_wait racy?

Martin Schwidefsky wrote:

>
>
>
>Hi Nick,
>
>
>>Another problem is that if there are no requests anywhere in the system,
>>sleepers in blk_congestion_wait will not get kicked. blk_congestion_wait
>>could probably have blk_run_queues moved after prepare_to_wait, which
>>might help.
>>
>I tried putting blk_run_queues after prepare_to_wait, it worked but it
>didn't help. The test still needs close to a minute.
>
>

OK. This was *with* the memory barrier changes too, was it? Not that
they should make that much difference. The test is still racy, but
the window just gets smaller.

But I'm guessing that you have no requests in flight by the time
blk_congestion_wait gets called, so nothing ever gets kicked.

I prefer something more like this model: if 'current' submits a request
to a congested queue then it gets put on the congestion waitqueue.
You can then run blk_congestion_wait afterwards and it won't block if
the queue you've written to has come out of congestion at any time.

This also means that you can (should, in fact) stop uncongested queues
from waking up the waiters every time they complete a request. Hmm, I
like it.

2004-03-10 05:35:30

[permalink] [raw]

Subject: Re: blk_congestion_wait racy?

Nick Piggin <[email protected]> wrote:
>
> But I'm guessing that you have no requests in flight by the time
> blk_congestion_wait gets called, so nothing ever gets kicked.

That's why blk_congestion_wait() in -mm propagates the schedule_timeout()
return value. You can do:

if (blk_congestion_wait(...))
printk("ouch\n");

If your kernel says ouch much, we have a problem.

2004-03-10 06:41:25

[permalink] [raw]

Subject: Re: blk_congestion_wait racy?

Andrew Morton wrote:

>Nick Piggin <[email protected]> wrote:
>
>>But I'm guessing that you have no requests in flight by the time
>> blk_congestion_wait gets called, so nothing ever gets kicked.
>>
>
>That's why blk_congestion_wait() in -mm propagates the schedule_timeout()
>return value. You can do:
>
> if (blk_congestion_wait(...))
> printk("ouch\n");
>
>If your kernel says ouch much, we have a problem.
>
>

Martin, have you tried adding this printk?

Andrew, could you take the following patch (even though it didn't fix
the problem).

I think the smp_mb isn't needed because the rl waitqueue stuff is
serialised by the queue spinlocks.

The addition of the smp_mb and the other change is to try to close the
window for races a bit. Obviously they can still happen, it's a racy
interface and it doesn't matter much.

Attachments:

blk-congestion-races.patch (1.31 kB)

2004-03-11 18:25:51

[permalink] [raw]

Subject: Re: blk_congestion_wait racy?

> Martin, have you tried adding this printk?

Sorry for the delay. I had to get 2.6.4-mm1 working before doing the
"ouch" test. The new pte_to_pgprot/pgoff_prot_to_pte stuff wasn't easy.
I tested 2.6.4-mm1 with the blk_run_queues move and the ouch printk.
The first interesting observation is that 2.6.4-mm1 behaves MUCH better
then 2.6.4:

2.6.4-mm1 with 1 cpu
# time ./mempig 600
Count (1Meg blocks) = 600
600 of 600
Done.

real 0m2.587s
user 0m0.100s
sys 0m0.730s
#

2.6.4-mm1 with 2 cpus
# time ./mempig 600
Count (1Meg blocks) = 600
600 of 600
Done.

real 0m10.313s
user 0m0.160s
sys 0m0.780s
#

2.6.4 takes > 1min for the test with 2 cpus.

The second observation is that I get only a few "ouch" messages. They
all come from the blk_congestion_wait in try_to_free_pages, as expected.
What I did not expect is that I only got 9 "ouches" for the run with
2 cpus.

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]

2004-03-11 18:55:38

[permalink] [raw]

Subject: Re: blk_congestion_wait racy?

Martin Schwidefsky <[email protected]> wrote:
>
> > Martin, have you tried adding this printk?
>
> Sorry for the delay. I had to get 2.6.4-mm1 working before doing the
> "ouch" test. The new pte_to_pgprot/pgoff_prot_to_pte stuff wasn't easy.

Yes, sorry, all the world's an x86 :( Could you please send me whatever
diffs were needed to get it all going?

There are porting instructions in
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.4/2.6.4-mm1/broken-out/remap-file-pages-prot-2.6.4-rc1-mm1-A1.patch
but maybe it's a bit late for that.

> I tested 2.6.4-mm1 with the blk_run_queues move and the ouch printk.
> The first interesting observation is that 2.6.4-mm1 behaves MUCH better
> then 2.6.4:
>
> 2.6.4-mm1 with 1 cpu
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600 of 600
> Done.
>
> real 0m2.587s
> user 0m0.100s
> sys 0m0.730s
> #

I thought you were running a 256MB machine? Two seconds for 400 megs of
swapout? What's up?

> 2.6.4-mm1 with 2 cpus
> # time ./mempig 600
> Count (1Meg blocks) = 600
> 600 of 600
> Done.
>
> real 0m10.313s
> user 0m0.160s
> sys 0m0.780s
> #
>
> 2.6.4 takes > 1min for the test with 2 cpus.
>
> The second observation is that I get only a few "ouch" messages. They
> all come from the blk_congestion_wait in try_to_free_pages, as expected.
> What I did not expect is that I only got 9 "ouches" for the run with
> 2 cpus.

An ouch-per-second sounds reasonable. It could simply be that the CPUs
were off running other tasks - those timeout are less than scheduling
quanta.

The 4x performance difference remains not understood.

2004-03-11 19:04:54

[permalink] [raw]

Subject: Re: blk_congestion_wait racy?

> Yes, sorry, all the world's an x86 :( Could you please send me whatever
> diffs were needed to get it all going?

I am just preparing that mail :-)

> I thought you were running a 256MB machine? Two seconds for 400 megs of
> swapout? What's up?

Roughly 400 MB of swapout. And two seconds isn't that bad ;-)

> An ouch-per-second sounds reasonable. It could simply be that the CPUs
> were off running other tasks - those timeout are less than scheduling
> quanta.

I don't understand why an ouch-per-second is reasonable. The mempig is
the only process that runs on the machine and the blk_congestion_wait
uses HZ/10 as timeout value. I'd expect about 100 ouches for the 10
seconds the test runs.

The 4x performance difference remains not understood.

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]

2004-03-11 23:23:55

[permalink] [raw]

Subject: Re: blk_congestion_wait racy?

Martin Schwidefsky <[email protected]> wrote:
>
> > An ouch-per-second sounds reasonable. It could simply be that the CPUs
> > were off running other tasks - those timeout are less than scheduling
> > quanta.
>
> I don't understand why an ouch-per-second is reasonable. The mempig is
> the only process that runs on the machine and the blk_congestion_wait
> uses HZ/10 as timeout value. I'd expect about 100 ouches for the 10
> seconds the test runs.

blk_congestion_wait() is supposed to be terminated by someone releasing a
disk write request. If no write requests are freed in 100 milliseconds
then either Something Is Up or that process simply was not scheduled for
some time after the wakeup was delivered.

2004-03-12 02:36:32