blk_recalc_rq_segments assumes that any segments can be merged in the
case of BIOVEC_VIRT_MERGEABLE && !BIOVEC_VIRT_OVERSIZE. However, an
IOMMU can't merge segments if the total length of the segments is
larger than max_segment_size (the LLD restriction).
Due to this bug, a LLD may get the larger number of segments than
nr_hw_segments because the block layer puts more segments in a request
than it should do.
This bug could happen on alpha, parisc, and sparc, which use VMERGE.
Like blk_hw_contig_segment() does, this patch uses hw_seg_size for
simplicity, which is a bit larger than an exact value (we don't need
BIOVEC_VIRT_START_SIZE here).
Signed-off-by: FUJITA Tomonori <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: David Miller <[email protected]>
---
block/blk-merge.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5efc9e7..39a22f8 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -83,7 +83,8 @@ void blk_recalc_rq_segments(struct request *rq)
continue;
}
new_segment:
- if (BIOVEC_VIRT_MERGEABLE(bvprv, bv) &&
+ if (hw_seg_size + bv->bv_len <= q->max_segment_size &&
+ BIOVEC_VIRT_MERGEABLE(bvprv, bv) &&
!BIOVEC_VIRT_OVERSIZE(hw_seg_size + bv->bv_len))
hw_seg_size += bv->bv_len;
else {
--
1.5.5.GIT
On Tue, 15 Jul 2008, FUJITA Tomonori wrote:
> blk_recalc_rq_segments assumes that any segments can be merged in the
> case of BIOVEC_VIRT_MERGEABLE && !BIOVEC_VIRT_OVERSIZE. However, an
> IOMMU can't merge segments if the total length of the segments is
> larger than max_segment_size (the LLD restriction).
>
> Due to this bug, a LLD may get the larger number of segments than
> nr_hw_segments because the block layer puts more segments in a request
> than it should do.
>
> This bug could happen on alpha, parisc, and sparc, which use VMERGE.
Parisc doesn't use virtual merge accounting (there is variable for it but
it's always 0). On sparc64 it is broken anyway with or without your patch.
And alpha alone doesn't justify substantial code bloat in generic block
layer. So I propose this patch to drop it at all.
A further patch could be made to drop bi_hw_segments, nr_hw_segments and
similar --- they have no use for anything except alpha and because there
are few alpha computers and alpha is discontinued platform, the logic for
attempting to predict number of hardware segments can't be well tested and
maintained.
> Like blk_hw_contig_segment() does, this patch uses hw_seg_size for
> simplicity, which is a bit larger than an exact value (we don't need
> BIOVEC_VIRT_START_SIZE here).
>
> Signed-off-by: FUJITA Tomonori <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Mikulas Patocka <[email protected]>
> Cc: David Miller <[email protected]>
> ---
> block/blk-merge.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 5efc9e7..39a22f8 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -83,7 +83,8 @@ void blk_recalc_rq_segments(struct request *rq)
> continue;
> }
> new_segment:
> - if (BIOVEC_VIRT_MERGEABLE(bvprv, bv) &&
> + if (hw_seg_size + bv->bv_len <= q->max_segment_size &&
> + BIOVEC_VIRT_MERGEABLE(bvprv, bv) &&
> !BIOVEC_VIRT_OVERSIZE(hw_seg_size + bv->bv_len))
> hw_seg_size += bv->bv_len;
> else {
>
Drop virtual merge accounting in bio layer (not the virtual merging
itself). It is used only on Sparc64 (where it is broken and needs to be
disabled) and on Alpha. This logic is very hard to maintain (the generic
code in block/blk-merge.c tries to attempt to predict how
architecture-specific IOMMU will merge the segments) and Alpha
architecture alone doesn't justify the maintenance and code-bloat cost.
Signed-off-by: Mikulas Patocka <[email protected]>
---
arch/parisc/kernel/setup.c | 5 ---
arch/x86/kernel/pci-dma.c | 6 ---
block/blk-merge.c | 72 +++------------------------------------------
fs/bio.c | 6 +--
include/asm-alpha/io.h | 3 -
include/asm-ia64/io.h | 26 +---------------
include/asm-parisc/io.h | 6 ---
include/asm-powerpc/io.h | 7 ----
include/asm-sparc64/io.h | 1
include/asm-x86/io_64.h | 3 -
include/linux/bio.h | 15 ---------
11 files changed, 10 insertions(+), 140 deletions(-)
Index: linux-2.6.26-fast/arch/parisc/kernel/setup.c
===================================================================
--- linux-2.6.26-fast.orig/arch/parisc/kernel/setup.c 2008-07-15 14:19:01.000000000 +0200
+++ linux-2.6.26-fast/arch/parisc/kernel/setup.c 2008-07-15 14:19:18.000000000 +0200
@@ -57,11 +57,6 @@ int parisc_bus_is_phys __read_mostly = 1
EXPORT_SYMBOL(parisc_bus_is_phys);
#endif
-/* This sets the vmerge boundary and size, it's here because it has to
- * be available on all platforms (zero means no-virtual merging) */
-unsigned long parisc_vmerge_boundary = 0;
-unsigned long parisc_vmerge_max_size = 0;
-
void __init setup_cmdline(char **cmdline_p)
{
extern unsigned int boot_args[];
Index: linux-2.6.26-fast/arch/x86/kernel/pci-dma.c
===================================================================
--- linux-2.6.26-fast.orig/arch/x86/kernel/pci-dma.c 2008-07-15 14:20:15.000000000 +0200
+++ linux-2.6.26-fast/arch/x86/kernel/pci-dma.c 2008-07-15 14:20:37.000000000 +0200
@@ -30,11 +30,6 @@ int no_iommu __read_mostly;
/* Set this to 1 if there is a HW IOMMU in the system */
int iommu_detected __read_mostly = 0;
-/* This tells the BIO block layer to assume merging. Default to off
- because we cannot guarantee merging later. */
-int iommu_bio_merge __read_mostly = 0;
-EXPORT_SYMBOL(iommu_bio_merge);
-
dma_addr_t bad_dma_address __read_mostly = 0;
EXPORT_SYMBOL(bad_dma_address);
@@ -151,7 +146,6 @@ static __init int iommu_setup(char *p)
}
if (!strncmp(p, "biomerge", 8)) {
- iommu_bio_merge = 4096;
iommu_merge = 1;
force_iommu = 1;
}
Index: linux-2.6.26-fast/block/blk-merge.c
===================================================================
--- linux-2.6.26-fast.orig/block/blk-merge.c 2008-07-15 14:22:25.000000000 +0200
+++ linux-2.6.26-fast/block/blk-merge.c 2008-07-15 14:36:06.000000000 +0200
@@ -66,7 +66,7 @@ void blk_recalc_rq_segments(struct reque
*/
high = page_to_pfn(bv->bv_page) > q->bounce_pfn;
if (high || highprv)
- goto new_hw_segment;
+ goto new_segment;
if (cluster) {
if (seg_size + bv->bv_len > q->max_segment_size)
goto new_segment;
@@ -74,8 +74,6 @@ void blk_recalc_rq_segments(struct reque
goto new_segment;
if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bv))
goto new_segment;
- if (BIOVEC_VIRT_OVERSIZE(hw_seg_size + bv->bv_len))
- goto new_hw_segment;
seg_size += bv->bv_len;
hw_seg_size += bv->bv_len;
@@ -83,17 +81,11 @@ void blk_recalc_rq_segments(struct reque
continue;
}
new_segment:
- if (BIOVEC_VIRT_MERGEABLE(bvprv, bv) &&
- !BIOVEC_VIRT_OVERSIZE(hw_seg_size + bv->bv_len))
- hw_seg_size += bv->bv_len;
- else {
-new_hw_segment:
- if (nr_hw_segs == 1 &&
- hw_seg_size > rq->bio->bi_hw_front_size)
- rq->bio->bi_hw_front_size = hw_seg_size;
- hw_seg_size = BIOVEC_VIRT_START_SIZE(bv) + bv->bv_len;
- nr_hw_segs++;
- }
+ if (nr_hw_segs == 1 &&
+ hw_seg_size > rq->bio->bi_hw_front_size)
+ rq->bio->bi_hw_front_size = hw_seg_size;
+ hw_seg_size = bv->bv_len;
+ nr_hw_segs++;
nr_phys_segs++;
bvprv = bv;
@@ -146,22 +138,6 @@ static int blk_phys_contig_segment(struc
return 0;
}
-static int blk_hw_contig_segment(struct request_queue *q, struct bio *bio,
- struct bio *nxt)
-{
- if (!bio_flagged(bio, BIO_SEG_VALID))
- blk_recount_segments(q, bio);
- if (!bio_flagged(nxt, BIO_SEG_VALID))
- blk_recount_segments(q, nxt);
- if (!BIOVEC_VIRT_MERGEABLE(__BVEC_END(bio), __BVEC_START(nxt)) ||
- BIOVEC_VIRT_OVERSIZE(bio->bi_hw_back_size + nxt->bi_hw_front_size))
- return 0;
- if (bio->bi_hw_back_size + nxt->bi_hw_front_size > q->max_segment_size)
- return 0;
-
- return 1;
-}
-
/*
* map a request to scatterlist, return number of sg entries setup. Caller
* must make sure sg can hold rq->nr_phys_segments entries
@@ -317,18 +293,6 @@ int ll_back_merge_fn(struct request_queu
if (!bio_flagged(bio, BIO_SEG_VALID))
blk_recount_segments(q, bio);
len = req->biotail->bi_hw_back_size + bio->bi_hw_front_size;
- if (BIOVEC_VIRT_MERGEABLE(__BVEC_END(req->biotail), __BVEC_START(bio))
- && !BIOVEC_VIRT_OVERSIZE(len)) {
- int mergeable = ll_new_mergeable(q, req, bio);
-
- if (mergeable) {
- if (req->nr_hw_segments == 1)
- req->bio->bi_hw_front_size = len;
- if (bio->bi_hw_segments == 1)
- bio->bi_hw_back_size = len;
- }
- return mergeable;
- }
return ll_new_hw_segment(q, req, bio);
}
@@ -356,18 +320,6 @@ int ll_front_merge_fn(struct request_que
blk_recount_segments(q, bio);
if (!bio_flagged(req->bio, BIO_SEG_VALID))
blk_recount_segments(q, req->bio);
- if (BIOVEC_VIRT_MERGEABLE(__BVEC_END(bio), __BVEC_START(req->bio)) &&
- !BIOVEC_VIRT_OVERSIZE(len)) {
- int mergeable = ll_new_mergeable(q, req, bio);
-
- if (mergeable) {
- if (bio->bi_hw_segments == 1)
- bio->bi_hw_front_size = len;
- if (req->nr_hw_segments == 1)
- req->biotail->bi_hw_back_size = len;
- }
- return mergeable;
- }
return ll_new_hw_segment(q, req, bio);
}
@@ -399,18 +351,6 @@ static int ll_merge_requests_fn(struct r
return 0;
total_hw_segments = req->nr_hw_segments + next->nr_hw_segments;
- if (blk_hw_contig_segment(q, req->biotail, next->bio)) {
- int len = req->biotail->bi_hw_back_size +
- next->bio->bi_hw_front_size;
- /*
- * propagate the combined length to the end of the requests
- */
- if (req->nr_hw_segments == 1)
- req->bio->bi_hw_front_size = len;
- if (next->nr_hw_segments == 1)
- next->biotail->bi_hw_back_size = len;
- total_hw_segments--;
- }
if (total_hw_segments > q->max_hw_segments)
return 0;
Index: linux-2.6.26-fast/include/asm-alpha/io.h
===================================================================
--- linux-2.6.26-fast.orig/include/asm-alpha/io.h 2008-07-15 14:14:39.000000000 +0200
+++ linux-2.6.26-fast/include/asm-alpha/io.h 2008-07-15 14:16:15.000000000 +0200
@@ -96,9 +96,6 @@ static inline dma_addr_t __deprecated is
return page_to_phys(page);
}
-/* This depends on working iommu. */
-#define BIO_VMERGE_BOUNDARY (alpha_mv.mv_pci_tbi ? PAGE_SIZE : 0)
-
/* Maximum PIO space address supported? */
#define IO_SPACE_LIMIT 0xffff
Index: linux-2.6.26-fast/include/asm-ia64/io.h
===================================================================
--- linux-2.6.26-fast.orig/include/asm-ia64/io.h 2008-07-15 14:14:45.000000000 +0200
+++ linux-2.6.26-fast/include/asm-ia64/io.h 2008-07-15 14:18:24.000000000 +0200
@@ -430,30 +430,8 @@ extern void memcpy_fromio(void *dst, con
extern void memcpy_toio(volatile void __iomem *dst, const void *src, long n);
extern void memset_io(volatile void __iomem *s, int c, long n);
-# endif /* __KERNEL__ */
-
-/*
- * Enabling BIO_VMERGE_BOUNDARY forces us to turn off I/O MMU bypassing. It is said that
- * BIO-level virtual merging can give up to 4% performance boost (not verified for ia64).
- * On the other hand, we know that I/O MMU bypassing gives ~8% performance improvement on
- * SPECweb-like workloads on zx1-based machines. Thus, for now we favor I/O MMU bypassing
- * over BIO-level virtual merging.
- */
extern unsigned long ia64_max_iommu_merge_mask;
-#if 1
-#define BIO_VMERGE_BOUNDARY 0
-#else
-/*
- * It makes no sense at all to have this BIO_VMERGE_BOUNDARY macro here. Should be
- * replaced by dma_merge_mask() or something of that sort. Note: the only way
- * BIO_VMERGE_BOUNDARY is used is to mask off bits. Effectively, our definition gets
- * expanded into:
- *
- * addr & ((ia64_max_iommu_merge_mask + 1) - 1) == (addr & ia64_max_iommu_vmerge_mask)
- *
- * which is precisely what we want.
- */
-#define BIO_VMERGE_BOUNDARY (ia64_max_iommu_merge_mask + 1)
-#endif
+
+# endif /* __KERNEL__ */
#endif /* _ASM_IA64_IO_H */
Index: linux-2.6.26-fast/include/asm-parisc/io.h
===================================================================
--- linux-2.6.26-fast.orig/include/asm-parisc/io.h 2008-07-15 14:14:52.000000000 +0200
+++ linux-2.6.26-fast/include/asm-parisc/io.h 2008-07-15 14:18:58.000000000 +0200
@@ -4,12 +4,6 @@
#include <linux/types.h>
#include <asm/pgtable.h>
-extern unsigned long parisc_vmerge_boundary;
-extern unsigned long parisc_vmerge_max_size;
-
-#define BIO_VMERGE_BOUNDARY parisc_vmerge_boundary
-#define BIO_VMERGE_MAX_SIZE parisc_vmerge_max_size
-
#define virt_to_phys(a) ((unsigned long)__pa(a))
#define phys_to_virt(a) __va(a)
#define virt_to_bus virt_to_phys
Index: linux-2.6.26-fast/include/asm-powerpc/io.h
===================================================================
--- linux-2.6.26-fast.orig/include/asm-powerpc/io.h 2008-07-15 14:14:56.000000000 +0200
+++ linux-2.6.26-fast/include/asm-powerpc/io.h 2008-07-15 14:19:37.000000000 +0200
@@ -683,13 +683,6 @@ static inline void * phys_to_virt(unsign
*/
#define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
-/* We do NOT want virtual merging, it would put too much pressure on
- * our iommu allocator. Instead, we want drivers to be smart enough
- * to coalesce sglists that happen to have been mapped in a contiguous
- * way by the iommu
- */
-#define BIO_VMERGE_BOUNDARY 0
-
/*
* 32 bits still uses virt_to_bus() for it's implementation of DMA
* mappings se we have to keep it defined here. We also have some old
Index: linux-2.6.26-fast/include/asm-sparc64/io.h
===================================================================
--- linux-2.6.26-fast.orig/include/asm-sparc64/io.h 2008-07-15 14:15:04.000000000 +0200
+++ linux-2.6.26-fast/include/asm-sparc64/io.h 2008-07-15 14:19:52.000000000 +0200
@@ -16,7 +16,6 @@
/* BIO layer definitions. */
extern unsigned long kern_base, kern_size;
#define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
-#define BIO_VMERGE_BOUNDARY 8192
static inline u8 _inb(unsigned long addr)
{
Index: linux-2.6.26-fast/include/asm-x86/io_64.h
===================================================================
--- linux-2.6.26-fast.orig/include/asm-x86/io_64.h 2008-07-15 14:15:08.000000000 +0200
+++ linux-2.6.26-fast/include/asm-x86/io_64.h 2008-07-15 14:20:11.000000000 +0200
@@ -304,9 +304,6 @@ void memset_io(volatile void __iomem *a,
#define flush_write_buffers()
-extern int iommu_bio_merge;
-#define BIO_VMERGE_BOUNDARY iommu_bio_merge
-
/*
* Convert a virtual cached pointer to an uncached pointer
*/
Index: linux-2.6.26-fast/include/linux/bio.h
===================================================================
--- linux-2.6.26-fast.orig/include/linux/bio.h 2008-07-15 14:15:13.000000000 +0200
+++ linux-2.6.26-fast/include/linux/bio.h 2008-07-15 14:44:58.000000000 +0200
@@ -26,21 +26,8 @@
#ifdef CONFIG_BLOCK
-/* Platforms may set this to teach the BIO layer about IOMMU hardware. */
#include <asm/io.h>
-#if defined(BIO_VMERGE_MAX_SIZE) && defined(BIO_VMERGE_BOUNDARY)
-#define BIOVEC_VIRT_START_SIZE(x) (bvec_to_phys(x) & (BIO_VMERGE_BOUNDARY - 1))
-#define BIOVEC_VIRT_OVERSIZE(x) ((x) > BIO_VMERGE_MAX_SIZE)
-#else
-#define BIOVEC_VIRT_START_SIZE(x) 0
-#define BIOVEC_VIRT_OVERSIZE(x) 0
-#endif
-
-#ifndef BIO_VMERGE_BOUNDARY
-#define BIO_VMERGE_BOUNDARY 0
-#endif
-
#define BIO_DEBUG
#ifdef BIO_DEBUG
@@ -235,8 +222,6 @@ static inline void *bio_data(struct bio
((bvec_to_phys((vec1)) + (vec1)->bv_len) == bvec_to_phys((vec2)))
#endif
-#define BIOVEC_VIRT_MERGEABLE(vec1, vec2) \
- ((((bvec_to_phys((vec1)) + (vec1)->bv_len) | bvec_to_phys((vec2))) & (BIO_VMERGE_BOUNDARY - 1)) == 0)
#define __BIO_SEG_BOUNDARY(addr1, addr2, mask) \
(((addr1) | (mask)) == (((addr2) - 1) | (mask)))
#define BIOVEC_SEG_BOUNDARY(q, b1, b2) \
Index: linux-2.6.26-fast/fs/bio.c
===================================================================
--- linux-2.6.26-fast.orig/fs/bio.c 2008-07-15 14:43:03.000000000 +0200
+++ linux-2.6.26-fast/fs/bio.c 2008-07-15 14:43:58.000000000 +0200
@@ -352,8 +352,7 @@ static int __bio_add_page(struct request
*/
while (bio->bi_phys_segments >= q->max_phys_segments
- || bio->bi_hw_segments >= q->max_hw_segments
- || BIOVEC_VIRT_OVERSIZE(bio->bi_size)) {
+ || bio->bi_hw_segments >= q->max_hw_segments) {
if (retried_segments)
return 0;
@@ -390,8 +389,7 @@ static int __bio_add_page(struct request
}
/* If we may be able to merge these biovecs, force a recount */
- if (bio->bi_vcnt && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec) ||
- BIOVEC_VIRT_MERGEABLE(bvec-1, bvec)))
+ if (bio->bi_vcnt && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec)))
bio->bi_flags &= ~(1 << BIO_SEG_VALID);
bio->bi_vcnt++;
On Tue, 15 Jul 2008 09:37:05 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:
> On Tue, 15 Jul 2008, FUJITA Tomonori wrote:
>
> > blk_recalc_rq_segments assumes that any segments can be merged in the
> > case of BIOVEC_VIRT_MERGEABLE && !BIOVEC_VIRT_OVERSIZE. However, an
> > IOMMU can't merge segments if the total length of the segments is
> > larger than max_segment_size (the LLD restriction).
> >
> > Due to this bug, a LLD may get the larger number of segments than
> > nr_hw_segments because the block layer puts more segments in a request
> > than it should do.
> >
> > This bug could happen on alpha, parisc, and sparc, which use VMERGE.
>
> Parisc doesn't use virtual merge accounting (there is variable for it but
> it's always 0).
Hmm, really? Looks like PARISC IOMMUs (ccio-dma.c and sba_iomm.c) set
parisc_vmerge_boundary (CC'ed PARISC mailing list).
> On sparc64 it is broken anyway with or without your patch.
Yeah, we need to modify SPARC64 IOMMU code (I'm not sure that it's
worth). Right now, the best fix is setting BIO_VMERGE_BOUNDARY to 0.
> And alpha alone doesn't justify substantial code bloat in generic block
> layer. So I propose this patch to drop it at all.
Jens, what do you think about removing VMERGE code?
>>> This bug could happen on alpha, parisc, and sparc, which use VMERGE.
>>
>> Parisc doesn't use virtual merge accounting (there is variable for it but
>> it's always 0).
>
> Hmm, really? Looks like PARISC IOMMUs (ccio-dma.c and sba_iomm.c) set
> parisc_vmerge_boundary (CC'ed PARISC mailing list).
That's right, I looked only at arch and include.
>> On sparc64 it is broken anyway with or without your patch.
>
> Yeah, we need to modify SPARC64 IOMMU code (I'm not sure that it's
> worth). Right now, the best fix is setting BIO_VMERGE_BOUNDARY to 0.
Even if we fix it now, the question is: how long it will stay fixed? Until
someone makes another change to struct device that restricts boundaries on
some wacky hardware.
Mikulas
>> And alpha alone doesn't justify substantial code bloat in generic block
>> layer. So I propose this patch to drop it at all.
>
> Jens, what do you think about removing VMERGE code?
>
On Tue, 2008-07-15 at 23:20 +0900, FUJITA Tomonori wrote:
> On Tue, 15 Jul 2008 09:37:05 -0400 (EDT)
> Mikulas Patocka <[email protected]> wrote:
>
> > On Tue, 15 Jul 2008, FUJITA Tomonori wrote:
> >
> > > blk_recalc_rq_segments assumes that any segments can be merged in the
> > > case of BIOVEC_VIRT_MERGEABLE && !BIOVEC_VIRT_OVERSIZE. However, an
> > > IOMMU can't merge segments if the total length of the segments is
> > > larger than max_segment_size (the LLD restriction).
> > >
> > > Due to this bug, a LLD may get the larger number of segments than
> > > nr_hw_segments because the block layer puts more segments in a request
> > > than it should do.
> > >
> > > This bug could happen on alpha, parisc, and sparc, which use VMERGE.
> >
> > Parisc doesn't use virtual merge accounting (there is variable for it but
> > it's always 0).
>
> Hmm, really? Looks like PARISC IOMMUs (ccio-dma.c and sba_iomm.c) set
> parisc_vmerge_boundary (CC'ed PARISC mailing list).
That's correct. The size and boundary depend on the type of IOMMU (ccio
or sba) so the vmerge boundary parameters are set up in the iommu driver
code.
> > On sparc64 it is broken anyway with or without your patch.
>
> Yeah, we need to modify SPARC64 IOMMU code (I'm not sure that it's
> worth). Right now, the best fix is setting BIO_VMERGE_BOUNDARY to 0.
>
>
> > And alpha alone doesn't justify substantial code bloat in generic block
> > layer. So I propose this patch to drop it at all.
>
> Jens, what do you think about removing VMERGE code?
Actually, it's code I did.
There are plusses and minusses to all of this. The original vmerge code
was done for sparc ... mainly because the benefits of virtual merging
can offset the cost of having to use the iommu. However, most
architectures didn't use it. When I fixed it up to work for parisc (and
introduced the parameters) we were trying to demonstrate that using it
was feasible.
The idea behind vmerging is that assembling and programming sg lists is
expensive, so you want to do it once. Either in the iommu or in the
driver sg list, but not in both. There is evidence that it saves around
7% or so on drivers. However, for architectures that can do it, better
savings are made simply by lifting the iommu out of the I/O path (so
called bypass mode).
I suspect with IOMMUs coming back (and being unable to be bypassed) with
virtualisation, virtual merging might once more become a significant
value.
James
>>> On sparc64 it is broken anyway with or without your patch.
>>
>> Yeah, we need to modify SPARC64 IOMMU code (I'm not sure that it's
>> worth). Right now, the best fix is setting BIO_VMERGE_BOUNDARY to 0.
>>
>>
>>> And alpha alone doesn't justify substantial code bloat in generic block
>>> layer. So I propose this patch to drop it at all.
>>
>> Jens, what do you think about removing VMERGE code?
>
> Actually, it's code I did.
>
> There are plusses and minusses to all of this. The original vmerge code
> was done for sparc ... mainly because the benefits of virtual merging
> can offset the cost of having to use the iommu. However, most
> architectures didn't use it. When I fixed it up to work for parisc (and
> introduced the parameters) we were trying to demonstrate that using it
> was feasible.
>
> The idea behind vmerging is that assembling and programming sg lists is
> expensive, so you want to do it once. Either in the iommu or in the
> driver sg list, but not in both. There is evidence that it saves around
> 7% or so on drivers. However, for architectures that can do it, better
> savings are made simply by lifting the iommu out of the I/O path (so
> called bypass mode).
The problem is with vmerge accounting in block layer (that is what I'm
proposing to remove), not with vmerge itself.
Vmerge accounting has advantages only if you have device with small amount
of sg slots --- it allows the block layer to create request that has
higher number of segments then the device.
If you have device with for example 1024 slots, the virtual merge
accounting has no effect, because the any request will fit into that size.
Even without virtual merge accounting, the virtual merging will happen, so
there will be no performance penalty for the controller --- the controller
will be programmed with exactly the same number of segments as if virtual
merge accounting was present. (there could be even slight positive
performance effect if you remove accounting, because you burn less CPU
cycles per request)
If you have device will small number of sg slots (16 or so), vmerge
accounting can improve performance by creating requests with more than 16
segments --- the question is: is there any such device? And is the device
performance-sensitive? (i.e. isn't it such an old hardware where no one
cares about performance anyway?)
> I suspect with IOMMUs coming back (and being unable to be bypassed) with
> virtualisation, virtual merging might once more become a significant
> value.
I suppose that no one would manufacture new SCSI card with 16 or 32 sg
slots these days, so the accounting of hardware segments has no effect on
modern hardware.
Mikulas
> James
>
>
On Tue, 15 Jul 2008 10:37:58 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:
> >>> This bug could happen on alpha, parisc, and sparc, which use VMERGE.
> >>
> >> Parisc doesn't use virtual merge accounting (there is variable for it but
> >> it's always 0).
> >
> > Hmm, really? Looks like PARISC IOMMUs (ccio-dma.c and sba_iomm.c) set
> > parisc_vmerge_boundary (CC'ed PARISC mailing list).
>
> That's right, I looked only at arch and include.
>
> >> On sparc64 it is broken anyway with or without your patch.
> >
> > Yeah, we need to modify SPARC64 IOMMU code (I'm not sure that it's
> > worth). Right now, the best fix is setting BIO_VMERGE_BOUNDARY to 0.
>
> Even if we fix it now, the question is: how long it will stay fixed? Until
> someone makes another change to struct device that restricts boundaries on
> some wacky hardware.
I'm not sure how the boundary restriction of a device can break
the VMERGE accounting.
On Tue, 2008-07-15 at 11:24 -0400, Mikulas Patocka wrote:
> >>> On sparc64 it is broken anyway with or without your patch.
> >>
> >> Yeah, we need to modify SPARC64 IOMMU code (I'm not sure that it's
> >> worth). Right now, the best fix is setting BIO_VMERGE_BOUNDARY to 0.
> >>
> >>
> >>> And alpha alone doesn't justify substantial code bloat in generic block
> >>> layer. So I propose this patch to drop it at all.
> >>
> >> Jens, what do you think about removing VMERGE code?
> >
> > Actually, it's code I did.
> >
> > There are plusses and minusses to all of this. The original vmerge code
> > was done for sparc ... mainly because the benefits of virtual merging
> > can offset the cost of having to use the iommu. However, most
> > architectures didn't use it. When I fixed it up to work for parisc (and
> > introduced the parameters) we were trying to demonstrate that using it
> > was feasible.
> >
> > The idea behind vmerging is that assembling and programming sg lists is
> > expensive, so you want to do it once. Either in the iommu or in the
> > driver sg list, but not in both. There is evidence that it saves around
> > 7% or so on drivers. However, for architectures that can do it, better
> > savings are made simply by lifting the iommu out of the I/O path (so
> > called bypass mode).
>
> The problem is with vmerge accounting in block layer (that is what I'm
> proposing to remove), not with vmerge itself.
I don't think that's true ... otherwise parisc would be falling over
left right and centre.
> Vmerge accounting has advantages only if you have device with small amount
> of sg slots --- it allows the block layer to create request that has
> higher number of segments then the device.
This isn't really true either. A lot of devices with a high sg slot
count are still less efficient than an iommu for programming.
Even if they're not, on parisc we have to program the iommu, we can't
bypass, so it still makes sense to only have one large sg list (in the
iommu) and one small one (in the device). Having two large ones reduces
our I/O throughput because of the extra overhead.
> If you have device with for example 1024 slots, the virtual merge
> accounting has no effect, because the any request will fit into that size.
It's not about fitting a request, it's about efficient processing.
> Even without virtual merge accounting, the virtual merging will happen, so
> there will be no performance penalty for the controller --- the controller
> will be programmed with exactly the same number of segments as if virtual
> merge accounting was present. (there could be even slight positive
> performance effect if you remove accounting, because you burn less CPU
> cycles per request)
Yes there is. Both the iommu and the device have to traverse large SG
lists. This is where the inefficiency lies. On PA, we use exactly the
same number of iotlb slots whether virtual merging is in effect or not,
but the device has an internal loop to go over the list. It's that loop
that virtual merging reduces.
Since the virtual merge computation is in line when the request is built
(by design) it doesn't really detract from the throughput and the cost
is pretty small.
> If you have device will small number of sg slots (16 or so), vmerge
> accounting can improve performance by creating requests with more than 16
> segments --- the question is: is there any such device? And is the device
> performance-sensitive? (i.e. isn't it such an old hardware where no one
> cares about performance anyway?)
>
> > I suspect with IOMMUs coming back (and being unable to be bypassed) with
> > virtualisation, virtual merging might once more become a significant
> > value.
>
> I suppose that no one would manufacture new SCSI card with 16 or 32 sg
> slots these days, so the accounting of hardware segments has no effect on
> modern hardware.
It's not about accounting, it's about performance. There's a cost in
every device to traversing large count sg lists. If you have to bear it
in the iommu (which is usually more efficient because the iotlb tends to
follow mmtlb optimisations) you can reduce the cost by eliminating it
from the device.
James
>> Even if we fix it now, the question is: how long it will stay fixed? Until
>> someone makes another change to struct device that restricts boundaries on
>> some wacky hardware.
>
> I'm not sure how the boundary restriction of a device can break
> the VMERGE accounting.
Because block layer code doesn't know anything about the device, pci
access restrictions and so on.
Someone already broken DaveM's Sparc64 merging by adding boundaries (it
was broken even before, but these boundary checks made it worse).
Mikulas
You are mixing two ideas here:
(1) virtual merging --- IOMMU maps discontinuous segments into continuous
area that it presents to the device.
(2) virtual merge accounting --- block layer tries to guess how many
segments will be created by (1) and merges small requests into big ones.
The resulting requests are as big that they can't be processed by the
device if (1) weren't in effect.
>> The problem is with vmerge accounting in block layer (that is what I'm
>> proposing to remove), not with vmerge itself.
>
> I don't think that's true ... otherwise parisc would be falling over
> left right and centre.
>
>> Vmerge accounting has advantages only if you have device with small amount
>> of sg slots --- it allows the block layer to create request that has
>> higher number of segments then the device.
>
> This isn't really true either. A lot of devices with a high sg slot
> count are still less efficient than an iommu for programming.
--- for these devices virtual merging (1) improves performance, but
virtual merge accounting (2) doesn't.
> Even if they're not, on parisc we have to program the iommu, we can't
> bypass, so it still makes sense to only have one large sg list (in the
> iommu) and one small one (in the device). Having two large ones reduces
> our I/O throughput because of the extra overhead.
>
>> If you have device with for example 1024 slots, the virtual merge
>> accounting has no effect, because the any request will fit into that size.
>
> It's not about fitting a request, it's about efficient processing.
Virtual merge accounting (2) is about fitting a request. It is block layer
technique.
>> Even without virtual merge accounting, the virtual merging will happen, so
>> there will be no performance penalty for the controller --- the controller
>> will be programmed with exactly the same number of segments as if virtual
>> merge accounting was present. (there could be even slight positive
>> performance effect if you remove accounting, because you burn less CPU
>> cycles per request)
>
> Yes there is. Both the iommu and the device have to traverse large SG
> lists. This is where the inefficiency lies. On PA, we use exactly the
> same number of iotlb slots whether virtual merging is in effect or not,
> but the device has an internal loop to go over the list. It's that loop
> that virtual merging reduces.
>
> Since the virtual merge computation is in line when the request is built
> (by design) it doesn't really detract from the throughput and the cost
> is pretty small.
The purpose of (1) virtual merging is to save device's sg slots. The
purpose of (2) virtual merge accounting is to allow block layer to build
larger requests. If you remove virtual merge accounting, it will cause no
increase in number of sg slots used.
>>> I suspect with IOMMUs coming back (and being unable to be bypassed) with
>>> virtualisation, virtual merging might once more become a significant
>>> value.
>>
>> I suppose that no one would manufacture new SCSI card with 16 or 32 sg
>> slots these days, so the accounting of hardware segments has no effect on
>> modern hardware.
>
> It's not about accounting, it's about performance. There's a cost in
> every device to traversing large count sg lists. If you have to bear it
> in the iommu (which is usually more efficient because the iotlb tends to
> follow mmtlb optimisations) you can reduce the cost by eliminating it
> from the device.
That's why I'm proposing to remove virtual merge accounting (2), but leave
virtual merging (1) itself. The accounting doesn't reduce number of sg
slots.
Mikulas
> James
On Tue, 2008-07-15 at 11:58 -0400, Mikulas Patocka wrote:
> You are mixing two ideas here:
>
> (1) virtual merging --- IOMMU maps discontinuous segments into continuous
> area that it presents to the device.
>
> (2) virtual merge accounting --- block layer tries to guess how many
> segments will be created by (1) and merges small requests into big ones.
> The resulting requests are as big that they can't be processed by the
> device if (1) weren't in effect.
No ... I'm not ... the virtual merge implementation requires the block
layer to get this accounting right, otherwise the iommu code can end up
doing the wrong thing.
You're proposing to eliminate the difference between max_phys_segments
and max_hw_segments without actually removing them.
> >> The problem is with vmerge accounting in block layer (that is what I'm
> >> proposing to remove), not with vmerge itself.
> >
> > I don't think that's true ... otherwise parisc would be falling over
> > left right and centre.
> >
> >> Vmerge accounting has advantages only if you have device with small amount
> >> of sg slots --- it allows the block layer to create request that has
> >> higher number of segments then the device.
> >
> > This isn't really true either. A lot of devices with a high sg slot
> > count are still less efficient than an iommu for programming.
>
> --- for these devices virtual merging (1) improves performance, but
> virtual merge accounting (2) doesn't.
>
> > Even if they're not, on parisc we have to program the iommu, we can't
> > bypass, so it still makes sense to only have one large sg list (in the
> > iommu) and one small one (in the device). Having two large ones reduces
> > our I/O throughput because of the extra overhead.
> >
> >> If you have device with for example 1024 slots, the virtual merge
> >> accounting has no effect, because the any request will fit into that size.
> >
> > It's not about fitting a request, it's about efficient processing.
>
> Virtual merge accounting (2) is about fitting a request. It is block layer
> technique.
>
> >> Even without virtual merge accounting, the virtual merging will happen, so
> >> there will be no performance penalty for the controller --- the controller
> >> will be programmed with exactly the same number of segments as if virtual
> >> merge accounting was present. (there could be even slight positive
> >> performance effect if you remove accounting, because you burn less CPU
> >> cycles per request)
> >
> > Yes there is. Both the iommu and the device have to traverse large SG
> > lists. This is where the inefficiency lies. On PA, we use exactly the
> > same number of iotlb slots whether virtual merging is in effect or not,
> > but the device has an internal loop to go over the list. It's that loop
> > that virtual merging reduces.
> >
> > Since the virtual merge computation is in line when the request is built
> > (by design) it doesn't really detract from the throughput and the cost
> > is pretty small.
>
> The purpose of (1) virtual merging is to save device's sg slots. The
> purpose of (2) virtual merge accounting is to allow block layer to build
> larger requests. If you remove virtual merge accounting, it will cause no
> increase in number of sg slots used.
>
> >>> I suspect with IOMMUs coming back (and being unable to be bypassed) with
> >>> virtualisation, virtual merging might once more become a significant
> >>> value.
> >>
> >> I suppose that no one would manufacture new SCSI card with 16 or 32 sg
> >> slots these days, so the accounting of hardware segments has no effect on
> >> modern hardware.
> >
> > It's not about accounting, it's about performance. There's a cost in
> > every device to traversing large count sg lists. If you have to bear it
> > in the iommu (which is usually more efficient because the iotlb tends to
> > follow mmtlb optimisations) you can reduce the cost by eliminating it
> > from the device.
>
> That's why I'm proposing to remove virtual merge accounting (2), but leave
> virtual merging (1) itself. The accounting doesn't reduce number of sg
> slots.
Yes, but it's gains very little ... architectures that don't want it can
already turn it off, and it's useful for those, like parisc, who do.
James
On Tue, 15 Jul 2008, James Bottomley wrote:
> On Tue, 2008-07-15 at 11:58 -0400, Mikulas Patocka wrote:
>> You are mixing two ideas here:
>>
>> (1) virtual merging --- IOMMU maps discontinuous segments into continuous
>> area that it presents to the device.
>>
>> (2) virtual merge accounting --- block layer tries to guess how many
>> segments will be created by (1) and merges small requests into big ones.
>> The resulting requests are as big that they can't be processed by the
>> device if (1) weren't in effect.
>
> No ... I'm not ... the virtual merge implementation requires the block
> layer to get this accounting right, otherwise the iommu code can end up
> doing the wrong thing.
The virtual merge (1) can work even without accounting (2). IOMMU can
always create less sg entries then the block layer expects.
> You're proposing to eliminate the difference between max_phys_segments
> and max_hw_segments without actually removing them.
Yes. Only for alpha and pa-risc, there is difference between these values.
And both of these architectures are being discontinued.
>> That's why I'm proposing to remove virtual merge accounting (2), but leave
>> virtual merging (1) itself. The accounting doesn't reduce number of sg
>> slots.
>
> Yes, but it's gains very little ... architectures that don't want it can
> already turn it off, and it's useful for those, like parisc, who do.
>
> James
It increases maintainability of the code, reduces bloat and bugs.
Mikulas
On Tue, 2008-07-15 at 12:20 -0400, Mikulas Patocka wrote:
> On Tue, 15 Jul 2008, James Bottomley wrote:
>
> > On Tue, 2008-07-15 at 11:58 -0400, Mikulas Patocka wrote:
> >> You are mixing two ideas here:
> >>
> >> (1) virtual merging --- IOMMU maps discontinuous segments into continuous
> >> area that it presents to the device.
> >>
> >> (2) virtual merge accounting --- block layer tries to guess how many
> >> segments will be created by (1) and merges small requests into big ones.
> >> The resulting requests are as big that they can't be processed by the
> >> device if (1) weren't in effect.
> >
> > No ... I'm not ... the virtual merge implementation requires the block
> > layer to get this accounting right, otherwise the iommu code can end up
> > doing the wrong thing.
>
> The virtual merge (1) can work even without accounting (2). IOMMU can
> always create less sg entries then the block layer expects.
It can, but it's not optimal ... and depends on max_phys_segments ==
max_hw_segments.
> > You're proposing to eliminate the difference between max_phys_segments
> > and max_hw_segments without actually removing them.
>
> Yes. Only for alpha and pa-risc, there is difference between these values.
> And both of these architectures are being discontinued.
>
> >> That's why I'm proposing to remove virtual merge accounting (2), but leave
> >> virtual merging (1) itself. The accounting doesn't reduce number of sg
> >> slots.
> >
> > Yes, but it's gains very little ... architectures that don't want it can
> > already turn it off, and it's useful for those, like parisc, who do.
>
> It increases maintainability of the code, reduces bloat and bugs.
That's not really a good reason. You can eliminate code because it's
unused and unikely to be used or you redo it to better or fix it to be
less buggy. You don't simply eliminate useful functionality that
currently has in-tree users, however marginal you might opine those
users to be.
James
> > > Yes, but it's gains very little ... architectures that don't want it can
> > > already turn it off, and it's useful for those, like parisc, who do.
> >
> > It increases maintainability of the code, reduces bloat and bugs.
>
> That's not really a good reason. You can eliminate code because it's
> unused and unikely to be used or you redo it to better or fix it to be
> less buggy. You don't simply eliminate useful functionality that
> currently has in-tree users, however marginal you might opine those
> users to be.
>
> James
So show a specific device where the virtual merge accounting is useful.
(1) The device that is often used in alpha or pa-risc environments ---
because the accounting is not used on other archs.
(2) The device that is performance-sensitive --- not something outdated or
unusual.
(3) And the device that has limited sg-list size, so that generic I/O
requests made by the kernel hit this limit. (if the sg-list is so big that
nr_phys_segments of most requests fits into it, you don't need to count
nr_hw_segments --- because nr_hw_segments < nr_phys_segments and
nr_phys_segments already fits).
[ the device that traverses its sg-list slowly doesn't fall into category
(3), beacuse virtual merging would happen with or without nr_hw_segments
accounting ]
Mikulas
On Tue, 15 Jul 2008 11:46:46 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:
> >> Even if we fix it now, the question is: how long it will stay fixed? Until
> >> someone makes another change to struct device that restricts boundaries on
> >> some wacky hardware.
> >
> > I'm not sure how the boundary restriction of a device can break
> > the VMERGE accounting.
>
> Because block layer code doesn't know anything about the device, pci
> access restrictions and so on.
Not true, the block layer knows about the device restrictions like DMA
boundary.
But it's not the point here because the boundary restriction doesn't
matter for the VMERGE accounting. An IOMMU just returns an error if it
can't allocate an I/O space fit for the device restrictions.
Please give me an example how the boundary restriction of a device can
break the VMERGE accounting and an IOMMU if you aren't still sure.
On Wed, 16 Jul 2008, FUJITA Tomonori wrote:
> On Tue, 15 Jul 2008 11:46:46 -0400 (EDT)
> Mikulas Patocka <[email protected]> wrote:
>
> > >> Even if we fix it now, the question is: how long it will stay fixed? Until
> > >> someone makes another change to struct device that restricts boundaries on
> > >> some wacky hardware.
> > >
> > > I'm not sure how the boundary restriction of a device can break
> > > the VMERGE accounting.
> >
> > Because block layer code doesn't know anything about the device, pci
> > access restrictions and so on.
>
> Not true, the block layer knows about the device restrictions like DMA
> boundary.
>
> But it's not the point here because the boundary restriction doesn't
> matter for the VMERGE accounting. An IOMMU just returns an error if it
> can't allocate an I/O space fit for the device restrictions.
>
>
> Please give me an example how the boundary restriction of a device can
> break the VMERGE accounting and an IOMMU if you aren't still sure.
You have dma_get_seg_boundary and dma_get_max_seg_size. On sparc64, adding
one of these broken VMERGE accounting (the VMERGE didn't happen past 64-kb
boundary and bio layer thought that VMERGE would be possible).
And if you fix this case, someone will break it again, sooner or later, by
adding new restriction.
Mikulas
On Wed, 16 Jul 2008 14:02:27 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:
> On Wed, 16 Jul 2008, FUJITA Tomonori wrote:
>
> > On Tue, 15 Jul 2008 11:46:46 -0400 (EDT)
> > Mikulas Patocka <[email protected]> wrote:
> >
> > > >> Even if we fix it now, the question is: how long it will stay fixed? Until
> > > >> someone makes another change to struct device that restricts boundaries on
> > > >> some wacky hardware.
> > > >
> > > > I'm not sure how the boundary restriction of a device can break
> > > > the VMERGE accounting.
> > >
> > > Because block layer code doesn't know anything about the device, pci
> > > access restrictions and so on.
> >
> > Not true, the block layer knows about the device restrictions like DMA
> > boundary.
> >
> > But it's not the point here because the boundary restriction doesn't
> > matter for the VMERGE accounting. An IOMMU just returns an error if it
> > can't allocate an I/O space fit for the device restrictions.
> >
> >
> > Please give me an example how the boundary restriction of a device can
> > break the VMERGE accounting and an IOMMU if you aren't still sure.
>
> You have dma_get_seg_boundary and dma_get_max_seg_size. On sparc64, adding
> one of these broken VMERGE accounting (the VMERGE didn't happen past 64-kb
> boundary and bio layer thought that VMERGE would be possible).
If the device has 64KB boundary restriction, the device also has
max_seg_size restriction of 64KB or under. So the vmerge acounting
works (though we need to fix it to handle max_seg_size, as discussed).
> And if you fix this case, someone will break it again, sooner or later, by
> adding new restriction.
What is your new restriction?
All restrictions that IOMMUs need to know are dma_get_seg_boundary and
dma_get_max_seg_size.
> > > Please give me an example how the boundary restriction of a device can
> > > break the VMERGE accounting and an IOMMU if you aren't still sure.
> >
> > You have dma_get_seg_boundary and dma_get_max_seg_size. On sparc64, adding
> > one of these broken VMERGE accounting (the VMERGE didn't happen past 64-kb
> > boundary and bio layer thought that VMERGE would be possible).
>
> If the device has 64KB boundary restriction, the device also has
> max_seg_size restriction of 64KB or under. So the vmerge acounting
> works (though we need to fix it to handle max_seg_size, as discussed).
>
> > And if you fix this case, someone will break it again, sooner or later, by
> > adding new restriction.
>
> All restrictions that IOMMUs need to know are dma_get_seg_boundary and
> dma_get_max_seg_size.
>
> What is your new restriction?
We don't know what happens in the future. And that is the problem that we
don't know --- but we have two pieces of code (blk-merge and iommu) that
try to calculate the same number (number of hw segments) and if they get
different result, it will crash. If the calculations were done at one
place, there would be no problem with that.
Mikulas
On Thu, 17 Jul 2008 07:50:24 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:
> > > > Please give me an example how the boundary restriction of a device can
> > > > break the VMERGE accounting and an IOMMU if you aren't still sure.
> > >
> > > You have dma_get_seg_boundary and dma_get_max_seg_size. On sparc64, adding
> > > one of these broken VMERGE accounting (the VMERGE didn't happen past 64-kb
> > > boundary and bio layer thought that VMERGE would be possible).
> >
> > If the device has 64KB boundary restriction, the device also has
> > max_seg_size restriction of 64KB or under. So the vmerge acounting
> > works (though we need to fix it to handle max_seg_size, as discussed).
> >
> > > And if you fix this case, someone will break it again, sooner or later, by
> > > adding new restriction.
> >
> > All restrictions that IOMMUs need to know are dma_get_seg_boundary and
> > dma_get_max_seg_size.
> >
> > What is your new restriction?
>
> We don't know what happens in the future.
It's very unlikely to add new restrictions.
> And that is the problem that we
> don't know --- but we have two pieces of code (blk-merge and iommu) that
> try to calculate the same number (number of hw segments) and if they get
> different result, it will crash. If the calculations were done at one
> place, there would be no problem with that.
I don't think that your argument, 'the problem that we don't know', is
true.
With the vmerge accounting, we calculate at two places. So if we add
a new restriction, we need to handle it at two places. It's a logical
result.
Of course, it's easier to calculate at one place rather than two
places. But 'we don't know what restriction we will need' isn't a
problem.
BTW, as I've already said, I'm not against removing the vmerge
accounting from the block layer.
FUJITA Tomonori wrote:
> On Thu, 17 Jul 2008 07:50:24 -0400 (EDT)
> Mikulas Patocka <[email protected]> wrote:
>
>>>>> Please give me an example how the boundary restriction of a device can
>>>>> break the VMERGE accounting and an IOMMU if you aren't still sure.
>>>> You have dma_get_seg_boundary and dma_get_max_seg_size. On sparc64, adding
>>>> one of these broken VMERGE accounting (the VMERGE didn't happen past 64-kb
>>>> boundary and bio layer thought that VMERGE would be possible).
>>> If the device has 64KB boundary restriction, the device also has
>>> max_seg_size restriction of 64KB or under. So the vmerge acounting
>>> works (though we need to fix it to handle max_seg_size, as discussed).
>>>
>>>> And if you fix this case, someone will break it again, sooner or later, by
>>>> adding new restriction.
>>> All restrictions that IOMMUs need to know are dma_get_seg_boundary and
>>> dma_get_max_seg_size.
>>>
>>> What is your new restriction?
>> We don't know what happens in the future.
>
> It's very unlikely to add new restrictions.
>
>
>> And that is the problem that we
>> don't know --- but we have two pieces of code (blk-merge and iommu) that
>> try to calculate the same number (number of hw segments) and if they get
>> different result, it will crash. If the calculations were done at one
>> place, there would be no problem with that.
>
> I don't think that your argument, 'the problem that we don't know', is
> true.
>
> With the vmerge accounting, we calculate at two places. So if we add
> a new restriction, we need to handle it at two places. It's a logical
> result.
>
> Of course, it's easier to calculate at one place rather than two
> places. But 'we don't know what restriction we will need' isn't a
> problem.
>
>
> BTW, as I've already said, I'm not against removing the vmerge
> accounting from the block layer.
I have a question. Does the block layer know of the IOMMU in use
for the device? can it call into the IOMMU to calculate the
restriction?
Thanks Boaz
On Thu, 2008-07-17 at 16:27 +0300, Boaz Harrosh wrote:
> FUJITA Tomonori wrote:
> > On Thu, 17 Jul 2008 07:50:24 -0400 (EDT)
> > Mikulas Patocka <[email protected]> wrote:
> >
> >>>>> Please give me an example how the boundary restriction of a device can
> >>>>> break the VMERGE accounting and an IOMMU if you aren't still sure.
> >>>> You have dma_get_seg_boundary and dma_get_max_seg_size. On sparc64, adding
> >>>> one of these broken VMERGE accounting (the VMERGE didn't happen past 64-kb
> >>>> boundary and bio layer thought that VMERGE would be possible).
> >>> If the device has 64KB boundary restriction, the device also has
> >>> max_seg_size restriction of 64KB or under. So the vmerge acounting
> >>> works (though we need to fix it to handle max_seg_size, as discussed).
> >>>
> >>>> And if you fix this case, someone will break it again, sooner or later, by
> >>>> adding new restriction.
> >>> All restrictions that IOMMUs need to know are dma_get_seg_boundary and
> >>> dma_get_max_seg_size.
> >>>
> >>> What is your new restriction?
> >> We don't know what happens in the future.
> >
> > It's very unlikely to add new restrictions.
> >
> >
> >> And that is the problem that we
> >> don't know --- but we have two pieces of code (blk-merge and iommu) that
> >> try to calculate the same number (number of hw segments) and if they get
> >> different result, it will crash. If the calculations were done at one
> >> place, there would be no problem with that.
> >
> > I don't think that your argument, 'the problem that we don't know', is
> > true.
> >
> > With the vmerge accounting, we calculate at two places. So if we add
> > a new restriction, we need to handle it at two places. It's a logical
> > result.
> >
> > Of course, it's easier to calculate at one place rather than two
> > places. But 'we don't know what restriction we will need' isn't a
> > problem.
> >
> >
> > BTW, as I've already said, I'm not against removing the vmerge
> > accounting from the block layer.
>
> I have a question. Does the block layer know of the IOMMU in use
> for the device? can it call into the IOMMU to calculate the
> restriction?
Yes and no. The parameter PCI_DMA_BUS_IS_PHYS is set if the platform
doesn't have one. Nowadays, that's not enough; with VT and bypass what
the system really needs to know is if the device will be using the
iommu.
The idea of calling into the platform iommu code was considered when all
this was done, but it was rejected. Function pointer calls are
incredibly expensive on most platforms that at that time had iommus.
The best way was to construct a theoretical parametrisation of an iommu
and get the block layer to follow that model.
James
From: FUJITA Tomonori <[email protected]>
Date: Thu, 17 Jul 2008 22:18:44 +0900
> BTW, as I've already said, I'm not against removing the vmerge
> accounting from the block layer.
I am also, as stated, not against this.
Fujita-san, please proposage a patch so that we can put this
issue behind us :-)
On Sat, 19 Jul 2008, David Miller wrote:
> From: FUJITA Tomonori <[email protected]>
> Date: Thu, 17 Jul 2008 22:18:44 +0900
>
> > BTW, as I've already said, I'm not against removing the vmerge
> > accounting from the block layer.
>
> I am also, as stated, not against this.
>
> Fujita-san, please proposage a patch so that we can put this
> issue behind us :-)
Few days ago I created this.
Another task would be to remove nr_hw_segments from request, bio and queue
parameters (if this patch is accepted).
Mikulas
---
arch/parisc/kernel/setup.c | 5 ---
arch/x86/kernel/pci-dma.c | 6 ---
block/blk-merge.c | 72 +++------------------------------------------
drivers/parisc/ccio-dma.c | 2 -
drivers/parisc/sba_iommu.c | 2 -
fs/bio.c | 6 +--
include/asm-alpha/io.h | 3 -
include/asm-ia64/io.h | 26 +---------------
include/asm-parisc/io.h | 6 ---
include/asm-powerpc/io.h | 7 ----
include/asm-sparc64/io.h | 1
include/asm-x86/io_64.h | 3 -
include/linux/bio.h | 15 ---------
13 files changed, 10 insertions(+), 144 deletions(-)
Index: linux-2.6.26-fast/arch/parisc/kernel/setup.c
===================================================================
--- linux-2.6.26-fast.orig/arch/parisc/kernel/setup.c 2008-07-15 16:19:53.000000000 +0200
+++ linux-2.6.26-fast/arch/parisc/kernel/setup.c 2008-07-15 16:20:15.000000000 +0200
@@ -57,11 +57,6 @@ int parisc_bus_is_phys __read_mostly = 1
EXPORT_SYMBOL(parisc_bus_is_phys);
#endif
-/* This sets the vmerge boundary and size, it's here because it has to
- * be available on all platforms (zero means no-virtual merging) */
-unsigned long parisc_vmerge_boundary = 0;
-unsigned long parisc_vmerge_max_size = 0;
-
void __init setup_cmdline(char **cmdline_p)
{
extern unsigned int boot_args[];
Index: linux-2.6.26-fast/arch/x86/kernel/pci-dma.c
===================================================================
--- linux-2.6.26-fast.orig/arch/x86/kernel/pci-dma.c 2008-07-15 16:19:53.000000000 +0200
+++ linux-2.6.26-fast/arch/x86/kernel/pci-dma.c 2008-07-15 16:20:15.000000000 +0200
@@ -30,11 +30,6 @@ int no_iommu __read_mostly;
/* Set this to 1 if there is a HW IOMMU in the system */
int iommu_detected __read_mostly = 0;
-/* This tells the BIO block layer to assume merging. Default to off
- because we cannot guarantee merging later. */
-int iommu_bio_merge __read_mostly = 0;
-EXPORT_SYMBOL(iommu_bio_merge);
-
dma_addr_t bad_dma_address __read_mostly = 0;
EXPORT_SYMBOL(bad_dma_address);
@@ -151,7 +146,6 @@ static __init int iommu_setup(char *p)
}
if (!strncmp(p, "biomerge", 8)) {
- iommu_bio_merge = 4096;
iommu_merge = 1;
force_iommu = 1;
}
Index: linux-2.6.26-fast/block/blk-merge.c
===================================================================
--- linux-2.6.26-fast.orig/block/blk-merge.c 2008-07-15 16:19:53.000000000 +0200
+++ linux-2.6.26-fast/block/blk-merge.c 2008-07-15 16:20:15.000000000 +0200
@@ -66,7 +66,7 @@ void blk_recalc_rq_segments(struct reque
*/
high = page_to_pfn(bv->bv_page) > q->bounce_pfn;
if (high || highprv)
- goto new_hw_segment;
+ goto new_segment;
if (cluster) {
if (seg_size + bv->bv_len > q->max_segment_size)
goto new_segment;
@@ -74,8 +74,6 @@ void blk_recalc_rq_segments(struct reque
goto new_segment;
if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bv))
goto new_segment;
- if (BIOVEC_VIRT_OVERSIZE(hw_seg_size + bv->bv_len))
- goto new_hw_segment;
seg_size += bv->bv_len;
hw_seg_size += bv->bv_len;
@@ -83,17 +81,11 @@ void blk_recalc_rq_segments(struct reque
continue;
}
new_segment:
- if (BIOVEC_VIRT_MERGEABLE(bvprv, bv) &&
- !BIOVEC_VIRT_OVERSIZE(hw_seg_size + bv->bv_len))
- hw_seg_size += bv->bv_len;
- else {
-new_hw_segment:
- if (nr_hw_segs == 1 &&
- hw_seg_size > rq->bio->bi_hw_front_size)
- rq->bio->bi_hw_front_size = hw_seg_size;
- hw_seg_size = BIOVEC_VIRT_START_SIZE(bv) + bv->bv_len;
- nr_hw_segs++;
- }
+ if (nr_hw_segs == 1 &&
+ hw_seg_size > rq->bio->bi_hw_front_size)
+ rq->bio->bi_hw_front_size = hw_seg_size;
+ hw_seg_size = bv->bv_len;
+ nr_hw_segs++;
nr_phys_segs++;
bvprv = bv;
@@ -146,22 +138,6 @@ static int blk_phys_contig_segment(struc
return 0;
}
-static int blk_hw_contig_segment(struct request_queue *q, struct bio *bio,
- struct bio *nxt)
-{
- if (!bio_flagged(bio, BIO_SEG_VALID))
- blk_recount_segments(q, bio);
- if (!bio_flagged(nxt, BIO_SEG_VALID))
- blk_recount_segments(q, nxt);
- if (!BIOVEC_VIRT_MERGEABLE(__BVEC_END(bio), __BVEC_START(nxt)) ||
- BIOVEC_VIRT_OVERSIZE(bio->bi_hw_back_size + nxt->bi_hw_front_size))
- return 0;
- if (bio->bi_hw_back_size + nxt->bi_hw_front_size > q->max_segment_size)
- return 0;
-
- return 1;
-}
-
/*
* map a request to scatterlist, return number of sg entries setup. Caller
* must make sure sg can hold rq->nr_phys_segments entries
@@ -317,18 +293,6 @@ int ll_back_merge_fn(struct request_queu
if (!bio_flagged(bio, BIO_SEG_VALID))
blk_recount_segments(q, bio);
len = req->biotail->bi_hw_back_size + bio->bi_hw_front_size;
- if (BIOVEC_VIRT_MERGEABLE(__BVEC_END(req->biotail), __BVEC_START(bio))
- && !BIOVEC_VIRT_OVERSIZE(len)) {
- int mergeable = ll_new_mergeable(q, req, bio);
-
- if (mergeable) {
- if (req->nr_hw_segments == 1)
- req->bio->bi_hw_front_size = len;
- if (bio->bi_hw_segments == 1)
- bio->bi_hw_back_size = len;
- }
- return mergeable;
- }
return ll_new_hw_segment(q, req, bio);
}
@@ -356,18 +320,6 @@ int ll_front_merge_fn(struct request_que
blk_recount_segments(q, bio);
if (!bio_flagged(req->bio, BIO_SEG_VALID))
blk_recount_segments(q, req->bio);
- if (BIOVEC_VIRT_MERGEABLE(__BVEC_END(bio), __BVEC_START(req->bio)) &&
- !BIOVEC_VIRT_OVERSIZE(len)) {
- int mergeable = ll_new_mergeable(q, req, bio);
-
- if (mergeable) {
- if (bio->bi_hw_segments == 1)
- bio->bi_hw_front_size = len;
- if (req->nr_hw_segments == 1)
- req->biotail->bi_hw_back_size = len;
- }
- return mergeable;
- }
return ll_new_hw_segment(q, req, bio);
}
@@ -399,18 +351,6 @@ static int ll_merge_requests_fn(struct r
return 0;
total_hw_segments = req->nr_hw_segments + next->nr_hw_segments;
- if (blk_hw_contig_segment(q, req->biotail, next->bio)) {
- int len = req->biotail->bi_hw_back_size +
- next->bio->bi_hw_front_size;
- /*
- * propagate the combined length to the end of the requests
- */
- if (req->nr_hw_segments == 1)
- req->bio->bi_hw_front_size = len;
- if (next->nr_hw_segments == 1)
- next->biotail->bi_hw_back_size = len;
- total_hw_segments--;
- }
if (total_hw_segments > q->max_hw_segments)
return 0;
Index: linux-2.6.26-fast/include/asm-alpha/io.h
===================================================================
--- linux-2.6.26-fast.orig/include/asm-alpha/io.h 2008-07-15 16:19:52.000000000 +0200
+++ linux-2.6.26-fast/include/asm-alpha/io.h 2008-07-15 16:20:15.000000000 +0200
@@ -96,9 +96,6 @@ static inline dma_addr_t __deprecated is
return page_to_phys(page);
}
-/* This depends on working iommu. */
-#define BIO_VMERGE_BOUNDARY (alpha_mv.mv_pci_tbi ? PAGE_SIZE : 0)
-
/* Maximum PIO space address supported? */
#define IO_SPACE_LIMIT 0xffff
Index: linux-2.6.26-fast/include/asm-ia64/io.h
===================================================================
--- linux-2.6.26-fast.orig/include/asm-ia64/io.h 2008-07-15 16:19:52.000000000 +0200
+++ linux-2.6.26-fast/include/asm-ia64/io.h 2008-07-15 16:20:15.000000000 +0200
@@ -430,30 +430,8 @@ extern void memcpy_fromio(void *dst, con
extern void memcpy_toio(volatile void __iomem *dst, const void *src, long n);
extern void memset_io(volatile void __iomem *s, int c, long n);
-# endif /* __KERNEL__ */
-
-/*
- * Enabling BIO_VMERGE_BOUNDARY forces us to turn off I/O MMU bypassing. It is said that
- * BIO-level virtual merging can give up to 4% performance boost (not verified for ia64).
- * On the other hand, we know that I/O MMU bypassing gives ~8% performance improvement on
- * SPECweb-like workloads on zx1-based machines. Thus, for now we favor I/O MMU bypassing
- * over BIO-level virtual merging.
- */
extern unsigned long ia64_max_iommu_merge_mask;
-#if 1
-#define BIO_VMERGE_BOUNDARY 0
-#else
-/*
- * It makes no sense at all to have this BIO_VMERGE_BOUNDARY macro here. Should be
- * replaced by dma_merge_mask() or something of that sort. Note: the only way
- * BIO_VMERGE_BOUNDARY is used is to mask off bits. Effectively, our definition gets
- * expanded into:
- *
- * addr & ((ia64_max_iommu_merge_mask + 1) - 1) == (addr & ia64_max_iommu_vmerge_mask)
- *
- * which is precisely what we want.
- */
-#define BIO_VMERGE_BOUNDARY (ia64_max_iommu_merge_mask + 1)
-#endif
+
+# endif /* __KERNEL__ */
#endif /* _ASM_IA64_IO_H */
Index: linux-2.6.26-fast/include/asm-parisc/io.h
===================================================================
--- linux-2.6.26-fast.orig/include/asm-parisc/io.h 2008-07-15 16:19:53.000000000 +0200
+++ linux-2.6.26-fast/include/asm-parisc/io.h 2008-07-15 16:20:15.000000000 +0200
@@ -4,12 +4,6 @@
#include <linux/types.h>
#include <asm/pgtable.h>
-extern unsigned long parisc_vmerge_boundary;
-extern unsigned long parisc_vmerge_max_size;
-
-#define BIO_VMERGE_BOUNDARY parisc_vmerge_boundary
-#define BIO_VMERGE_MAX_SIZE parisc_vmerge_max_size
-
#define virt_to_phys(a) ((unsigned long)__pa(a))
#define phys_to_virt(a) __va(a)
#define virt_to_bus virt_to_phys
Index: linux-2.6.26-fast/include/asm-powerpc/io.h
===================================================================
--- linux-2.6.26-fast.orig/include/asm-powerpc/io.h 2008-07-15 16:19:53.000000000 +0200
+++ linux-2.6.26-fast/include/asm-powerpc/io.h 2008-07-15 16:20:15.000000000 +0200
@@ -683,13 +683,6 @@ static inline void * phys_to_virt(unsign
*/
#define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
-/* We do NOT want virtual merging, it would put too much pressure on
- * our iommu allocator. Instead, we want drivers to be smart enough
- * to coalesce sglists that happen to have been mapped in a contiguous
- * way by the iommu
- */
-#define BIO_VMERGE_BOUNDARY 0
-
/*
* 32 bits still uses virt_to_bus() for it's implementation of DMA
* mappings se we have to keep it defined here. We also have some old
Index: linux-2.6.26-fast/include/asm-sparc64/io.h
===================================================================
--- linux-2.6.26-fast.orig/include/asm-sparc64/io.h 2008-07-15 16:19:53.000000000 +0200
+++ linux-2.6.26-fast/include/asm-sparc64/io.h 2008-07-15 16:20:15.000000000 +0200
@@ -16,7 +16,6 @@
/* BIO layer definitions. */
extern unsigned long kern_base, kern_size;
#define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
-#define BIO_VMERGE_BOUNDARY 8192
static inline u8 _inb(unsigned long addr)
{
Index: linux-2.6.26-fast/include/asm-x86/io_64.h
===================================================================
--- linux-2.6.26-fast.orig/include/asm-x86/io_64.h 2008-07-15 16:19:53.000000000 +0200
+++ linux-2.6.26-fast/include/asm-x86/io_64.h 2008-07-15 16:20:15.000000000 +0200
@@ -304,9 +304,6 @@ void memset_io(volatile void __iomem *a,
#define flush_write_buffers()
-extern int iommu_bio_merge;
-#define BIO_VMERGE_BOUNDARY iommu_bio_merge
-
/*
* Convert a virtual cached pointer to an uncached pointer
*/
Index: linux-2.6.26-fast/include/linux/bio.h
===================================================================
--- linux-2.6.26-fast.orig/include/linux/bio.h 2008-07-15 16:19:53.000000000 +0200
+++ linux-2.6.26-fast/include/linux/bio.h 2008-07-15 16:20:15.000000000 +0200
@@ -26,21 +26,8 @@
#ifdef CONFIG_BLOCK
-/* Platforms may set this to teach the BIO layer about IOMMU hardware. */
#include <asm/io.h>
-#if defined(BIO_VMERGE_MAX_SIZE) && defined(BIO_VMERGE_BOUNDARY)
-#define BIOVEC_VIRT_START_SIZE(x) (bvec_to_phys(x) & (BIO_VMERGE_BOUNDARY - 1))
-#define BIOVEC_VIRT_OVERSIZE(x) ((x) > BIO_VMERGE_MAX_SIZE)
-#else
-#define BIOVEC_VIRT_START_SIZE(x) 0
-#define BIOVEC_VIRT_OVERSIZE(x) 0
-#endif
-
-#ifndef BIO_VMERGE_BOUNDARY
-#define BIO_VMERGE_BOUNDARY 0
-#endif
-
#define BIO_DEBUG
#ifdef BIO_DEBUG
@@ -235,8 +222,6 @@ static inline void *bio_data(struct bio
((bvec_to_phys((vec1)) + (vec1)->bv_len) == bvec_to_phys((vec2)))
#endif
-#define BIOVEC_VIRT_MERGEABLE(vec1, vec2) \
- ((((bvec_to_phys((vec1)) + (vec1)->bv_len) | bvec_to_phys((vec2))) & (BIO_VMERGE_BOUNDARY - 1)) == 0)
#define __BIO_SEG_BOUNDARY(addr1, addr2, mask) \
(((addr1) | (mask)) == (((addr2) - 1) | (mask)))
#define BIOVEC_SEG_BOUNDARY(q, b1, b2) \
Index: linux-2.6.26-fast/fs/bio.c
===================================================================
--- linux-2.6.26-fast.orig/fs/bio.c 2008-07-15 16:19:53.000000000 +0200
+++ linux-2.6.26-fast/fs/bio.c 2008-07-15 16:20:15.000000000 +0200
@@ -352,8 +352,7 @@ static int __bio_add_page(struct request
*/
while (bio->bi_phys_segments >= q->max_phys_segments
- || bio->bi_hw_segments >= q->max_hw_segments
- || BIOVEC_VIRT_OVERSIZE(bio->bi_size)) {
+ || bio->bi_hw_segments >= q->max_hw_segments) {
if (retried_segments)
return 0;
@@ -390,8 +389,7 @@ static int __bio_add_page(struct request
}
/* If we may be able to merge these biovecs, force a recount */
- if (bio->bi_vcnt && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec) ||
- BIOVEC_VIRT_MERGEABLE(bvec-1, bvec)))
+ if (bio->bi_vcnt && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec)))
bio->bi_flags &= ~(1 << BIO_SEG_VALID);
bio->bi_vcnt++;
Index: linux-2.6.26-fast/drivers/parisc/ccio-dma.c
===================================================================
--- linux-2.6.26-fast.orig/drivers/parisc/ccio-dma.c 2008-07-15 16:29:31.000000000 +0200
+++ linux-2.6.26-fast/drivers/parisc/ccio-dma.c 2008-07-15 16:34:23.000000000 +0200
@@ -1587,8 +1587,6 @@ static int __init ccio_probe(struct pari
ioc_count++;
- parisc_vmerge_boundary = IOVP_SIZE;
- parisc_vmerge_max_size = BITS_PER_LONG * IOVP_SIZE;
parisc_has_iommu();
return 0;
}
Index: linux-2.6.26-fast/drivers/parisc/sba_iommu.c
===================================================================
--- linux-2.6.26-fast.orig/drivers/parisc/sba_iommu.c 2008-07-15 16:29:31.000000000 +0200
+++ linux-2.6.26-fast/drivers/parisc/sba_iommu.c 2008-07-15 16:34:32.000000000 +0200
@@ -1979,8 +1979,6 @@ sba_driver_callback(struct parisc_device
proc_create("sba_iommu-bitmap", 0, root, &sba_proc_bitmap_fops);
#endif
- parisc_vmerge_boundary = IOVP_SIZE;
- parisc_vmerge_max_size = IOVP_SIZE * BITS_PER_LONG;
parisc_has_iommu();
return 0;
}
On Sat, 2008-07-19 at 21:45 -0400, Mikulas Patocka wrote:
> On Sat, 19 Jul 2008, David Miller wrote:
>
> > From: FUJITA Tomonori <[email protected]>
> > Date: Thu, 17 Jul 2008 22:18:44 +0900
> >
> > > BTW, as I've already said, I'm not against removing the vmerge
> > > accounting from the block layer.
> >
> > I am also, as stated, not against this.
> >
> > Fujita-san, please proposage a patch so that we can put this
> > issue behind us :-)
>
> Few days ago I created this.
>
> Another task would be to remove nr_hw_segments from request, bio and queue
> parameters (if this patch is accepted).
I think we've already established that the code in question is correct
and functional,
As far as I can tell, virtual merging has always been broken on ppc, so
it shouldn't enable it. It looks like at some point in history sparc
went from a working vmerge to a non working one (by copying the broken
ppc code), so the correct fix for both of these arches is simply to turn
off virtual merging.
ppc claims to turn off virtual merging, but in fact the define is
broken. Sparc should now follow ppc.
Try this patch
James
---
diff --git a/include/asm-powerpc/io.h b/include/asm-powerpc/io.h
index 8b62782..0f3212c 100644
--- a/include/asm-powerpc/io.h
+++ b/include/asm-powerpc/io.h
@@ -715,7 +715,7 @@ static inline void * phys_to_virt(unsigned long address)
* to coalesce sglists that happen to have been mapped in a contiguous
* way by the iommu
*/
-#define BIO_VMERGE_BOUNDARY 0
+#undef BIO_VMERGE_BOUNDARY
/*
* 32 bits still uses virt_to_bus() for it's implementation of DMA
diff --git a/include/asm-sparc64/io.h b/include/asm-sparc64/io.h
index 3158960..1da5642 100644
--- a/include/asm-sparc64/io.h
+++ b/include/asm-sparc64/io.h
@@ -16,7 +16,9 @@
/* BIO layer definitions. */
extern unsigned long kern_base, kern_size;
#define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
-#define BIO_VMERGE_BOUNDARY 8192
+
+/* virtual merging doesn't work on sparc now */
+#undef BIO_VMERGE_BOUNDARY
static inline u8 _inb(unsigned long addr)
{
From: James Bottomley <[email protected]>
Date: Sat, 19 Jul 2008 21:17:08 -0500
> Try this patch
I'd rather remove the vmerge code, it doesn't buy us
anything, and for something so complex and so hard to
keep working correctly it's existence is far from
justified these days.
From: Mikulas Patocka <[email protected]>
Date: Sat, 19 Jul 2008 21:45:11 -0400 (EDT)
> Few days ago I created this.
>
> Another task would be to remove nr_hw_segments from request, bio and queue
> parameters (if this patch is accepted).
I'm fine with this:
Acked-by: David S. Miller <[email protected]>
On Sat, 2008-07-19 at 21:07 -0700, David Miller wrote:
> From: James Bottomley <[email protected]>
> Date: Sat, 19 Jul 2008 21:17:08 -0500
>
> > Try this patch
>
> I'd rather remove the vmerge code, it doesn't buy us
> anything, and for something so complex and so hard to
> keep working correctly it's existence is far from
> justified these days.
You can ... as soon as BIO_VMERGE_BOUNDARY is undefined or set to zero,
it gets compiled out of the block code.
Since we're using it successfully in parisc, I don't want the block code
removed, but I don't see a reason to force other architectures to use
it.
However, it has two use cases. One is the legacy one of making rather
dumb I/O cards perform better (which is the primary on on parisc), but
there is a current one making huge transfers go through SCSI using using
the sg_table code. That latter is pretty vital to me since I have to
keep the code working, but I don't really have any SCSI cards that can
take advantage of it without virtual merging. As a slight irony, IBM is
trying to persuade me that a ppc would be better than a parisc for big
endian I/O testing ... so I might just be seeing if I can make virtual
merging work on power too.
James
From: James Bottomley <[email protected]>
Date: Sun, 20 Jul 2008 09:52:25 -0500
> Since we're using it successfully in parisc, I don't want the block code
> removed, but I don't see a reason to force other architectures to use
> it.
>
> However, it has two use cases. One is the legacy one of making rather
> dumb I/O cards perform better (which is the primary on on parisc), but
> there is a current one making huge transfers go through SCSI using using
> the sg_table code. That latter is pretty vital to me since I have to
> keep the code working, but I don't really have any SCSI cards that can
> take advantage of it without virtual merging. As a slight irony, IBM is
> trying to persuade me that a ppc would be better than a parisc for big
> endian I/O testing ... so I might just be seeing if I can make virtual
> merging work on power too.
All of this is gibberish, we've been over this a few times already
in this thread.
For a dumb I/O card, you advertise SG_ALL capabilities, the IOMMU
is going to merge things as it would have anyways, and you have
code in the driver to advance SG entries after each "dumb I/O".
There is zero value to the vmerge code, the real gains are being
realized already.
On Sun, 2008-07-20 at 10:23 -0700, David Miller wrote:
> From: James Bottomley <[email protected]>
> Date: Sun, 20 Jul 2008 09:52:25 -0500
>
> > Since we're using it successfully in parisc, I don't want the block code
> > removed, but I don't see a reason to force other architectures to use
> > it.
> >
> > However, it has two use cases. One is the legacy one of making rather
> > dumb I/O cards perform better (which is the primary on on parisc), but
> > there is a current one making huge transfers go through SCSI using using
> > the sg_table code. That latter is pretty vital to me since I have to
> > keep the code working, but I don't really have any SCSI cards that can
> > take advantage of it without virtual merging. As a slight irony, IBM is
> > trying to persuade me that a ppc would be better than a parisc for big
> > endian I/O testing ... so I might just be seeing if I can make virtual
> > merging work on power too.
>
> All of this is gibberish, we've been over this a few times already
> in this thread.
Really? I must have missed the proposed replacement for the
functionality that we're using then.
> For a dumb I/O card, you advertise SG_ALL capabilities, the IOMMU
> is going to merge things as it would have anyways, and you have
> code in the driver to advance SG entries after each "dumb I/O".
Not that dumb ... they just have a limited number of SG slots. We
wouldn't want to run them as spoon fed PIO because that really would
kill performance.
> There is zero value to the vmerge code, the real gains are being
> realized already.
There is value to me in my testbed, which I can't achieve any other way
(except by buying different SCSI cards).
As I said, you can compile it out on sparc just fine. I wish to keep it
running for parisc, so I'll maintain it. If it ever bit rots out of
parisc like it has done for the other architectures, then feel free to
remove it.
James
> > For a dumb I/O card, you advertise SG_ALL capabilities, the IOMMU
> > is going to merge things as it would have anyways, and you have
> > code in the driver to advance SG entries after each "dumb I/O".
>
> Not that dumb ... they just have a limited number of SG slots. We
> wouldn't want to run them as spoon fed PIO because that really would
> kill performance.
>
> > There is zero value to the vmerge code, the real gains are being
> > realized already.
>
> There is value to me in my testbed, which I can't achieve any other way
> (except by buying different SCSI cards).
>
> As I said, you can compile it out on sparc just fine. I wish to keep it
> running for parisc, so I'll maintain it. If it ever bit rots out of
> parisc like it has done for the other architectures, then feel free to
> remove it.
>
> James
So try to #define BIO_VMERGE_BOUNDARY 0 for Pa-Risc and tell us what
performance degradation do you see (and what driver do you use and what is
the I/O pattern).
If you show something specific, we can consider that --- but you haven't
yet told us anything, except generic talk.
Mikulas
On Thu, 2008-07-24 at 11:07 -0400, Mikulas Patocka wrote:
> So try to #define BIO_VMERGE_BOUNDARY 0 for Pa-Risc and tell us what
> performance degradation do you see (and what driver do you use and what is
> the I/O pattern).
>
> If you show something specific, we can consider that --- but you haven't
> yet told us anything, except generic talk.
You keep ignoring inconvenient facts. For about the third time:
I run a test bed for sg_tables (large chaining of requests). This runs
on parisc using virtual merging (has to because the final physical table
size can't go over the sg list of the SCSI card). If I turn off virtual
merging I can no longer test sg_tables in vanilla kernels.
James
On Thu, 24 Jul 2008, James Bottomley wrote:
> On Thu, 2008-07-24 at 11:07 -0400, Mikulas Patocka wrote:
> > So try to #define BIO_VMERGE_BOUNDARY 0 for Pa-Risc and tell us what
> > performance degradation do you see (and what driver do you use and what is
> > the I/O pattern).
> >
> > If you show something specific, we can consider that --- but you haven't
> > yet told us anything, except generic talk.
>
> You keep ignoring inconvenient facts. For about the third time:
>
> I run a test bed for sg_tables (large chaining of requests). This runs
> on parisc using virtual merging (has to because the final physical table
> size can't go over the sg list of the SCSI card). If I turn off virtual
> merging I can no longer test sg_tables in vanilla kernels.
>
> James
What sg_tables test do you mean? What does the test do? Why couldn't you
run the test if BIO_VMERGE_BOUNDARY is 0? Normal I/O obviously can work
with BIO_VMERGE_BOUNDARY 0, the kernel will just send more smaller
requests.
Mikulas
On Thu, 2008-07-24 at 12:34 -0400, Mikulas Patocka wrote:
> On Thu, 24 Jul 2008, James Bottomley wrote:
>
> > On Thu, 2008-07-24 at 11:07 -0400, Mikulas Patocka wrote:
> > > So try to #define BIO_VMERGE_BOUNDARY 0 for Pa-Risc and tell us what
> > > performance degradation do you see (and what driver do you use and what is
> > > the I/O pattern).
> > >
> > > If you show something specific, we can consider that --- but you haven't
> > > yet told us anything, except generic talk.
> >
> > You keep ignoring inconvenient facts. For about the third time:
> >
> > I run a test bed for sg_tables (large chaining of requests). This runs
> > on parisc using virtual merging (has to because the final physical table
> > size can't go over the sg list of the SCSI card). If I turn off virtual
> > merging I can no longer test sg_tables in vanilla kernels.
> >
> > James
>
> What sg_tables test do you mean? What does the test do? Why couldn't you
> run the test if BIO_VMERGE_BOUNDARY is 0? Normal I/O obviously can work
> with BIO_VMERGE_BOUNDARY 0, the kernel will just send more smaller
Look, if you don't really understand what I'm doing, it's not really my
job to educate you. The sg_table discussions are on marc.info, mainly
on the SCSI lists; just look for 'sg chaining' in the header (need to
use google site ... marc's search is bad).
You can complain if the code is impacting you ... but I believe I've
optimised it so it isn't. Your basic problem amounts to you not liking
me doing something that has no impact on you ... I'm afraid that's what
freedom leads to (shocking, I know).
James
On Thu, 24 Jul 2008, James Bottomley wrote:
> On Thu, 2008-07-24 at 12:34 -0400, Mikulas Patocka wrote:
> > On Thu, 24 Jul 2008, James Bottomley wrote:
> >
> > > On Thu, 2008-07-24 at 11:07 -0400, Mikulas Patocka wrote:
> > > > So try to #define BIO_VMERGE_BOUNDARY 0 for Pa-Risc and tell us what
> > > > performance degradation do you see (and what driver do you use and what is
> > > > the I/O pattern).
> > > >
> > > > If you show something specific, we can consider that --- but you haven't
> > > > yet told us anything, except generic talk.
> > >
> > > You keep ignoring inconvenient facts. For about the third time:
> > >
> > > I run a test bed for sg_tables (large chaining of requests). This runs
> > > on parisc using virtual merging (has to because the final physical table
> > > size can't go over the sg list of the SCSI card). If I turn off virtual
> > > merging I can no longer test sg_tables in vanilla kernels.
> > >
> > > James
> >
> > What sg_tables test do you mean? What does the test do? Why couldn't you
> > run the test if BIO_VMERGE_BOUNDARY is 0? Normal I/O obviously can work
> > with BIO_VMERGE_BOUNDARY 0, the kernel will just send more smaller
>
> Look, if you don't really understand what I'm doing, it's not really my
> job to educate you. The sg_table discussions are on marc.info, mainly
> on the SCSI lists; just look for 'sg chaining' in the header (need to
> use google site ... marc's search is bad).
>
> You can complain if the code is impacting you ... but I believe I've
> optimised it so it isn't. Your basic problem amounts to you not liking
> me doing something that has no impact on you ... I'm afraid that's what
> freedom leads to (shocking, I know).
>
> James
Chaining of sg_tables is used for drivers with big sg tables --- and
vmerge counting is used for drivers with small sg tables. So what do they
have in common?
Summary, what I mean:
* in blk-merge.c, you have 85 lines, that is 16% of the size of the file,
devoted to counting of hw_segments
* it is only used on two architectures, one already outdated (alpha), the
other being discontinued (pa-risc). On all the other architectures,
hw_segments == phys_segments
* it is prone to bugs and hard to maintain, because the same value must be
calculated in blk-merge.c and in architectural iommu functions --- if the
value differs, you create too long request, corrupt kernel memory and
crash (happened on sparc64). Anyone changing blk-merge in the future will
risk breaking something on the architectures that use BIO_VMERGE_BOUNDARY
--- and because these architectures are so rare, the bug will go unnoticed
for long time --- like in the case of sparc64.
* you are just talking how this code is important for performance without
showing any single proof that it really is (temporarily disable
hw_segments accounting by defining BIO_VMERGE_BOUNDARY 0 and get the
numbers).
Mikulas
From: Mikulas Patocka <[email protected]>
Date: Thu, 24 Jul 2008 17:49:14 -0400 (EDT)
> * it is prone to bugs and hard to maintain, because the same value must be
> calculated in blk-merge.c and in architectural iommu functions --- if the
> value differs, you create too long request, corrupt kernel memory and
> crash (happened on sparc64). Anyone changing blk-merge in the future will
> risk breaking something on the architectures that use BIO_VMERGE_BOUNDARY
> --- and because these architectures are so rare, the bug will go unnoticed
> for long time --- like in the case of sparc64.
I completely agree with this point.
This VMERGE stuff is now a non-trivial maintainence burdon because
anyone who wants to hack on the block layer has to be mindful of
VMERGE but is very unlikely to have access to a system that it can
even be tested on.
And the answer isn't "James Bottomly will test your changes for you",
because that simply doesn't scale.
I still say we should definitely remove the VMERGE code. It's not
worth the maintainence hassle just for some SG chaining test rig
on some obscure platform.
I really only hear one person who really wants this code around any
more. Is that the Linux way? :-) Can't he patch it into his tree when
he needs it or write an alternative way to stress the SG chaining
code? He has the source, right? :-)))
On Thu, 24 Jul 2008 17:49:14 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:
> On Thu, 24 Jul 2008, James Bottomley wrote:
>
> > On Thu, 2008-07-24 at 12:34 -0400, Mikulas Patocka wrote:
> > > On Thu, 24 Jul 2008, James Bottomley wrote:
> > >
> > > > On Thu, 2008-07-24 at 11:07 -0400, Mikulas Patocka wrote:
> > > > > So try to #define BIO_VMERGE_BOUNDARY 0 for Pa-Risc and tell us what
> > > > > performance degradation do you see (and what driver do you use and what is
> > > > > the I/O pattern).
> > > > >
> > > > > If you show something specific, we can consider that --- but you haven't
> > > > > yet told us anything, except generic talk.
> > > >
> > > > You keep ignoring inconvenient facts. For about the third time:
> > > >
> > > > I run a test bed for sg_tables (large chaining of requests). This runs
> > > > on parisc using virtual merging (has to because the final physical table
> > > > size can't go over the sg list of the SCSI card). If I turn off virtual
> > > > merging I can no longer test sg_tables in vanilla kernels.
> > > >
> > > > James
> > >
> > > What sg_tables test do you mean? What does the test do? Why couldn't you
> > > run the test if BIO_VMERGE_BOUNDARY is 0? Normal I/O obviously can work
> > > with BIO_VMERGE_BOUNDARY 0, the kernel will just send more smaller
> >
> > Look, if you don't really understand what I'm doing, it's not really my
> > job to educate you. The sg_table discussions are on marc.info, mainly
> > on the SCSI lists; just look for 'sg chaining' in the header (need to
> > use google site ... marc's search is bad).
> >
> > You can complain if the code is impacting you ... but I believe I've
> > optimised it so it isn't. Your basic problem amounts to you not liking
> > me doing something that has no impact on you ... I'm afraid that's what
> > freedom leads to (shocking, I know).
> >
> > James
>
> Chaining of sg_tables is used for drivers with big sg tables --- and
> vmerge counting is used for drivers with small sg tables. So what do they
> have in common?
VMERGE enables you to handle a large request even with drivers with
small sg tables.
> Summary, what I mean:
>
> * in blk-merge.c, you have 85 lines, that is 16% of the size of the file,
> devoted to counting of hw_segments
>
> * it is only used on two architectures, one already outdated (alpha), the
> other being discontinued (pa-risc). On all the other architectures,
> hw_segments == phys_segments
BTW, alpha IOMMU can't handle VMERGE. But the IOMMU has the code to
handle VMERGE so one-line patch can fix the IOMMU.
As I said before, can we leave this to Jens, keeping or removing
VMERGE? Seems that I see the same arguments again and again.
> * it is only used on two architectures, one already outdated (alpha), the
> other being discontinued (pa-risc). On all the other architectures,
> hw_segments == phys_segments
This "business based" argument should be ignored. Followed to the limit,
there would only be one architecture left...
Dave
--
J. David Anglin [email protected]
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)
On Thu, 2008-07-24 at 14:53 -0700, David Miller wrote:
> From: Mikulas Patocka <[email protected]>
> Date: Thu, 24 Jul 2008 17:49:14 -0400 (EDT)
>
> > * it is prone to bugs and hard to maintain, because the same value must be
> > calculated in blk-merge.c and in architectural iommu functions --- if the
> > value differs, you create too long request, corrupt kernel memory and
> > crash (happened on sparc64). Anyone changing blk-merge in the future will
> > risk breaking something on the architectures that use BIO_VMERGE_BOUNDARY
> > --- and because these architectures are so rare, the bug will go unnoticed
> > for long time --- like in the case of sparc64.
>
> I completely agree with this point.
So you think the parametrisation in the block layer is the wrong way to
approach the problem? On this argument your next patch should be
removing physical merging as well because it also relies on a
parametrisation model of how the device builds the sg list.
> This VMERGE stuff is now a non-trivial maintainence burdon because
> anyone who wants to hack on the block layer has to be mindful of
> VMERGE but is very unlikely to have access to a system that it can
> even be tested on.
I'm sorry sparc broke, really I am ... but you changed your iommu code
from one with a working vmerge to one without and failed to turn off
vmerging. Partly it wasn't noticed because at 2*PAGE_SIZE you have a
strange vmerge boundary, so it's harder to notice. However, I can't
extrapolate that just because this happened on sparc it will inevitably
happen on all other architectures.
> And the answer isn't "James Bottomly will test your changes for you",
> because that simply doesn't scale.
Actually, parisc will test your code we have a PAGE_SIZE vmerge
boundary, so the effect is much more noticeable.
> I still say we should definitely remove the VMERGE code. It's not
> worth the maintainence hassle just for some SG chaining test rig
> on some obscure platform.
OK ... well you've had your say and so have I. The code in question is
the responsibility of a particular maintainer ... we'll let him decide.
> I really only hear one person who really wants this code around any
> more. Is that the Linux way? :-) Can't he patch it into his tree when
> he needs it or write an alternative way to stress the SG chaining
> code? He has the source, right? :-)))
You're advocating an out of tree patch as a solution? I didn't know
you'd been appointed RHEL maintainer ;-)
James
From: James Bottomley <[email protected]>
Date: Thu, 24 Jul 2008 23:47:50 -0400
> I'm sorry sparc broke, really I am ... but you changed your iommu code
> from one with a working vmerge to one without and failed to turn off
> vmerging. Partly it wasn't noticed because at 2*PAGE_SIZE you have a
> strange vmerge boundary, so it's harder to notice. However, I can't
> extrapolate that just because this happened on sparc it will inevitably
> happen on all other architectures.
The vmerge boundary on sparc64 was 8K which is equal to sparc64's base
PAGE_SIZE. I don't know where you get that 2*PAGE_SIZE from.