2004-09-08 10:04:15

by Ingo Molnar

[permalink] [raw]
Subject: [patch] max-sectors-2.6.9-rc1-bk14-A0


this is a re-send of the max-sectors patch against 2.6.9-rc1-bk14.

the attached patch introduces two new /sys/block values:

/sys/block/*/queue/max_hw_sectors_kb
/sys/block/*/queue/max_sectors_kb

max_hw_sectors_kb is the maximum that the driver can handle and is
readonly. max_sectors_kb is the current max_sectors value and can be
tuned by root. PAGE_SIZE granularity is enforced.

It's all locking-safe and all affected layered drivers have been updated
as well. The patch has been in testing for a couple of weeks already as
part of the voluntary-preempt patches and it works just fine - people
use it to reduce IDE IRQ handling latencies. Please apply.

Signed-off-by: Ingo Molnar <[email protected]>

Ingo


Attachments:
(No filename) (715.00 B)
max-sectors-2.6.9-rc1-bk14-A0 (6.48 kB)
Download all attachments

2004-09-08 10:11:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] max-sectors-2.6.9-rc1-bk14-A0

Ingo Molnar <[email protected]> wrote:
>
> the attached patch introduces two new /sys/block values:
>
> /sys/block/*/queue/max_hw_sectors_kb
> /sys/block/*/queue/max_sectors_kb
>
> max_hw_sectors_kb is the maximum that the driver can handle and is
> readonly. max_sectors_kb is the current max_sectors value and can be
> tuned by root. PAGE_SIZE granularity is enforced.
>
> It's all locking-safe and all affected layered drivers have been updated
> as well. The patch has been in testing for a couple of weeks already as
> part of the voluntary-preempt patches and it works just fine - people
> use it to reduce IDE IRQ handling latencies.

Could you remind us what the cause of the latency is, and its duration?

(Am vaguely surprised that it's an issue at, what, 32 pages? Is something
sucky happening?)

2004-09-08 10:18:36

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch] max-sectors-2.6.9-rc1-bk14-A0

On Wed, Sep 08 2004, Ingo Molnar wrote:
>
> this is a re-send of the max-sectors patch against 2.6.9-rc1-bk14.
>
> the attached patch introduces two new /sys/block values:
>
> /sys/block/*/queue/max_hw_sectors_kb
> /sys/block/*/queue/max_sectors_kb
>
> max_hw_sectors_kb is the maximum that the driver can handle and is
> readonly. max_sectors_kb is the current max_sectors value and can be
> tuned by root. PAGE_SIZE granularity is enforced.
>
> It's all locking-safe and all affected layered drivers have been updated
> as well. The patch has been in testing for a couple of weeks already as
> part of the voluntary-preempt patches and it works just fine - people
> use it to reduce IDE IRQ handling latencies. Please apply.

Wasn't the move of the ide_lock grabbing enough to solve this problem by
itself?


--
Jens Axboe

2004-09-08 10:48:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] max-sectors-2.6.9-rc1-bk14-A0


* Andrew Morton <[email protected]> wrote:

> > the attached patch introduces two new /sys/block values:
> >
> > /sys/block/*/queue/max_hw_sectors_kb
> > /sys/block/*/queue/max_sectors_kb
> >
> > max_hw_sectors_kb is the maximum that the driver can handle and is
> > readonly. max_sectors_kb is the current max_sectors value and can be
> > tuned by root. PAGE_SIZE granularity is enforced.
> >
> > It's all locking-safe and all affected layered drivers have been updated
> > as well. The patch has been in testing for a couple of weeks already as
> > part of the voluntary-preempt patches and it works just fine - people
> > use it to reduce IDE IRQ handling latencies.
>
> Could you remind us what the cause of the latency is, and its
> duration?
>
> (Am vaguely surprised that it's an issue at, what, 32 pages? Is
> something sucky happening?)

yes, we are touching and completing 32 (or 64?) completely cache-cold
structures: the page and the bio which are on two separate cachelines a
pop. We also call into the mempool code for every bio completed. With
the default max_sectors people reported hardirq latencies up to 1msec or
more. You can see a trace of a 600+usec latency at:

http://krustophenia.net/testresults.php?dataset=2.6.8-rc4-bk3-O7#/var/www/2.6.8-rc4-bk3-O7/ide_irq_latency_trace.txt

here it's ~8 usecs per page completion - with 64 pages this completion
activity alone is 512 usecs. So people want to have a way to tune down
the maximum overhead in hardirq handlers. Users of the VP patches have
reported good results (== no significant performance impact) with
max_sectors at 32KB (8 pages) or even 16KB (4 pages).

Ingo

2004-09-08 10:52:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] max-sectors-2.6.9-rc1-bk14-A0


* Jens Axboe <[email protected]> wrote:

> Wasn't the move of the ide_lock grabbing enough to solve this problem
> by itself?

yes and no. It does solve it for the specific case of the
voluntary-preemption patches: there hardirqs can run in separate kernel
threads which are preemptable (no HARDIRQ_OFFSET). In stock Linux
hardirqs are not preemptable so the earlier dropping of ide_lock doesnt
solve the latency.

so in the upstream kernel the only solution is to reduce the size of IO.
(I'll push the hardirq patches later on too but their acceptance should
not hinder people in achieving good latencies.) It can be useful for
other reasons too to reduce IO, so why not? The patch certainly causes
no overhead anywhere in the block layer and people are happy with it.

Ingo

2004-09-08 11:07:01

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch] max-sectors-2.6.9-rc1-bk14-A0

On Wed, Sep 08 2004, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > Wasn't the move of the ide_lock grabbing enough to solve this problem
> > by itself?
>
> yes and no. It does solve it for the specific case of the
> voluntary-preemption patches: there hardirqs can run in separate kernel
> threads which are preemptable (no HARDIRQ_OFFSET). In stock Linux
> hardirqs are not preemptable so the earlier dropping of ide_lock doesnt
> solve the latency.
>
> so in the upstream kernel the only solution is to reduce the size of IO.
> (I'll push the hardirq patches later on too but their acceptance should
> not hinder people in achieving good latencies.) It can be useful for
> other reasons too to reduce IO, so why not? The patch certainly causes
> no overhead anywhere in the block layer and people are happy with it.

I'm not particularly against it, I was just curious. The splitting of
max_sectors into a max_hw_sectors is something we need to do anyways, so
I'm quite fine with the patch. You can add my signed-off-by too.

--
Jens Axboe

2004-09-08 11:45:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] max-sectors-2.6.9-rc1-bk14-A0

Ingo Molnar <[email protected]> wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > > the attached patch introduces two new /sys/block values:
> > >
> > > /sys/block/*/queue/max_hw_sectors_kb
> > > /sys/block/*/queue/max_sectors_kb
> > >
> > > max_hw_sectors_kb is the maximum that the driver can handle and is
> > > readonly. max_sectors_kb is the current max_sectors value and can be
> > > tuned by root. PAGE_SIZE granularity is enforced.
> > >
> > > It's all locking-safe and all affected layered drivers have been updated
> > > as well. The patch has been in testing for a couple of weeks already as
> > > part of the voluntary-preempt patches and it works just fine - people
> > > use it to reduce IDE IRQ handling latencies.
> >
> > Could you remind us what the cause of the latency is, and its
> > duration?
> >
> > (Am vaguely surprised that it's an issue at, what, 32 pages? Is
> > something sucky happening?)
>
> yes, we are touching and completing 32 (or 64?) completely cache-cold
> structures: the page and the bio which are on two separate cachelines a
> pop. We also call into the mempool code for every bio completed. With
> the default max_sectors people reported hardirq latencies up to 1msec or
> more. You can see a trace of a 600+usec latency at:
>
> http://krustophenia.net/testresults.php?dataset=2.6.8-rc4-bk3-O7#/var/www/2.6.8-rc4-bk3-O7/ide_irq_latency_trace.txt
>
> here it's ~8 usecs per page completion - with 64 pages this completion
> activity alone is 512 usecs. So people want to have a way to tune down
> the maximum overhead in hardirq handlers. Users of the VP patches have
> reported good results (== no significant performance impact) with
> max_sectors at 32KB (8 pages) or even 16KB (4 pages).

Still sounds a bit odd. How many cachelines can that CPU fetch in 8 usecs?
Several tens at least?

2004-09-08 12:39:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] max-sectors-2.6.9-rc1-bk14-A0


* Andrew Morton <[email protected]> wrote:

> Still sounds a bit odd. How many cachelines can that CPU fetch in 8
> usecs? Several tens at least?

the CPU in question is a 600 MHz C3, so it should be dozens. Considering
a conservative 200nsec cacheline-fetch latency and 8 nsecs per byte
bursted - so for a 32-byte cacheline it could take 264 nsecs. So with
... ~8 cachelines touched that could only explain 2-3 usec of overhead.
The bio itself is not layed out optimally: the bio and the vector are on
two different cachelines plus we have the buffer_head too (in the ext3
case) - all on different cachelines.

but the latency does happen and it happens even with tracing turned
completely off.

The main overhead is the completion path for a single page, which goes
like:

__end_that_request_first()
bio_endio()
end_bio_bh_io_sync()
journal_end_buffer_io_sync()
unlock_buffer()
wake_up_buffer()
bio_put()
bio_destructor()
mempool_free()
mempool_free_slab()
kmem_cache_free()
mempool_free()
mempool_free_slab()
kmem_cache_free()

this is quite fat just from an instruction count POV - 14 functions with
at least 20 instructions in each function, amounting to ~300
instructions per iteration - that alone is quite an icache footprint
assumption.

Plus we could be trashing the cache due to touching at least 3 new
cachelines per iteration - which is 192 new (dirty) cachelines for the
full completion or ~6K of new L1 cache contents. With 128 byte
cachelines it's much worse: at least 24K worth of new cache contents.
I'd suggest to at least attempt to merge bio and bio->bi_io_vec into a
single cacheline, for the simpler cases.

another detail is the SLAB's FIFO logic memmove-ing the full array:

0.184ms (+0.000ms): kmem_cache_free (mempool_free)
0.185ms (+0.000ms): cache_flusharray (kmem_cache_free)
0.185ms (+0.000ms): free_block (cache_flusharray)
0.200ms (+0.014ms): memmove (cache_flusharray)
0.200ms (+0.000ms): memcpy (memmove)

that's 14 usecs a pop and quite likely a fair amount of new dirty cache
contents.

The building of the sg-list of the next DMA request was responsible for
some of the latency as well:

0.571ms (+0.000ms): ide_build_dmatable (ide_start_dma)
0.571ms (+0.000ms): ide_build_sglist (ide_build_dmatable)
0.572ms (+0.000ms): blk_rq_map_sg (ide_build_sglist)
0.593ms (+0.021ms): do_IRQ (common_interrupt)
0.594ms (+0.000ms): mask_and_ack_8259A (do_IRQ)

this completion codeath isnt something people really profiled/measured
previously, because it's in an irqs-off hardirq path that triggers
relatively rarely. But for scheduling latencies it can be quite high.

Ingo