2009-04-01 11:05:39

by Tejun Heo

[permalink] [raw]
Subject: [GIT PATCHSET block#for-linus] block: blk-map related fixes

Hi, Jens.

Upon ack, please pull from the following git tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blk-map-fixes
http://git.kernel.org/?p=linux/kernel/git/tj/misc.git;a=shortlog;h=blk-map-fixes

This patchset contains the following eight patches to fix subtle bugs
in scatterlist and blk-map and are candidates for 2.6.30 inclusion.

0001-scatterlist-make-sure-sg_miter_next-doesn-t-retur.patch
0002-block-fix-SG_IO-vector-request-data-length-handling.patch
0003-block-fix-queue-bounce-limit-setting.patch
0004-bio-actually-inline-inline-bvecs-into-bio.patch
0005-bio-fix-bio_kmalloc.patch
0006-bio-remove-size-segments-limit-on-bio_-copy-map-_-u.patch
0007-blk-map-let-blk_rq_map_user_iov-support-null-mapp.patch
0008-blk-map-reimplement-blk_rq_map_user-using-blk_rq_.patch

0001 fixes a subtle bug in sg mapping iterator where zero-size maps
were returned. The bug isn't critical but can sometimes lead to
unexpected results (when testing for full consumption for example).

0002 is visible to userland. It fixes block iovec SG_IOs such that
when sghdr->dxfer_len and the data length in the iovec disagrees, the
shorter one wins instead of failing as described in the SG_IO howto.

0003 is pretty critical. The bug means that GFP_DMA is set in
q->bounce_gfp for many ATA controllers on x86_64 which leads to OOM if
large SG_IO is issued.

0004-0008 fixes bio_kmalloc(), removes bio-imposed segment and length
limit when using bio_kmalloc() and makes blk_rq_map_{user|kern}*()
functions always use single bio instead of chain of bios. It's not
only simpler but also makes sure the resulting request falls inside
queue limits as a whole. Previously, only each bio was checked for
limits, so the whole request could go over the limits.

This patchset is on top of the current linux-2.6-block#for-linus[1]
and contains the following changes.

block/blk-map.c | 119 +++------------------------------------------
block/blk-settings.c | 20 ++++---
block/scsi_ioctl.c | 13 ++++
drivers/md/dm-io.c | 4 -
drivers/md/dm.c | 4 -
drivers/scsi/scsi_lib.c | 3 -
fs/bio.c | 97 +++++++++++++++++++-----------------
fs/btrfs/extent_io.c | 2
fs/ext4/extents.c | 4 -
fs/xfs/linux-2.6/xfs_buf.c | 2
include/linux/bio.h | 16 +++---
lib/scatterlist.c | 9 ++-
12 files changed, 109 insertions(+), 184 deletions(-)

Thanks.

--
tejun

[1] 714ed0cf62319b14dc327273a7339a9a199fe046


2009-04-01 11:05:13

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/8] scatterlist: make sure sg_miter_next() doesn't return 0 sized mappings

Impact: fix not-so-critical but annoying bug

sg_miter_next() returns 0 sized mapping if there is an zero sized sg
entry in the list or at the end of each iteration. As the users
always check the ->length field, this bug shouldn't be critical other
than causing unnecessary iteration.

Fix it.

Signed-off-by: Tejun Heo <[email protected]>
---
lib/scatterlist.c | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index b7b449d..a295e40 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -347,9 +347,12 @@ bool sg_miter_next(struct sg_mapping_iter *miter)
sg_miter_stop(miter);

/* get to the next sg if necessary. __offset is adjusted by stop */
- if (miter->__offset == miter->__sg->length && --miter->__nents) {
- miter->__sg = sg_next(miter->__sg);
- miter->__offset = 0;
+ while (miter->__offset == miter->__sg->length) {
+ if (--miter->__nents) {
+ miter->__sg = sg_next(miter->__sg);
+ miter->__offset = 0;
+ } else
+ return false;
}

/* map the next page */
--
1.6.0.2

2009-04-01 11:06:22

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/8] block: fix queue bounce limit setting

Impact: don't set GFP_DMA in q->bounce_gfp unnecessarily

All DMA address limits are expressed in terms of the last addressable
unit (byte or page) instead of one plus that. However, when
determining bounce_gfp for 64bit machines in blk_queue_bounce_limit(),
it compares the specified limit against 0x100000000UL to determine
whether it's below 4G ending up falsely setting GFP_DMA in
q->bounce_gfp.

As DMA zone is very small on x86_64, this makes larger SG_IO transfers
very eager to trigger OOM killer. Fix it. While at it, rename the
parameter to @dma_mask for clarity and convert comment to proper
winged style.

Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-settings.c | 20 +++++++++++---------
1 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 59fd05d..aa4364c 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -156,26 +156,28 @@ EXPORT_SYMBOL(blk_queue_make_request);

/**
* blk_queue_bounce_limit - set bounce buffer limit for queue
- * @q: the request queue for the device
- * @dma_addr: bus address limit
+ * @q: the request queue for the device
+ * @dma_mask: the maximum address the device can handle
*
* Description:
* Different hardware can have different requirements as to what pages
* it can do I/O directly to. A low level driver can call
* blk_queue_bounce_limit to have lower memory pages allocated as bounce
- * buffers for doing I/O to pages residing above @dma_addr.
+ * buffers for doing I/O to pages residing above @dma_mask.
**/
-void blk_queue_bounce_limit(struct request_queue *q, u64 dma_addr)
+void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
{
- unsigned long b_pfn = dma_addr >> PAGE_SHIFT;
+ unsigned long b_pfn = dma_mask >> PAGE_SHIFT;
int dma = 0;

q->bounce_gfp = GFP_NOIO;
#if BITS_PER_LONG == 64
- /* Assume anything <= 4GB can be handled by IOMMU.
- Actually some IOMMUs can handle everything, but I don't
- know of a way to test this here. */
- if (b_pfn < (min_t(u64, 0x100000000UL, BLK_BOUNCE_HIGH) >> PAGE_SHIFT))
+ /*
+ * Assume anything <= 4GB can be handled by IOMMU. Actually
+ * some IOMMUs can handle everything, but I don't know of a
+ * way to test this here.
+ */
+ if (b_pfn < (min_t(u64, 0xffffffffUL, BLK_BOUNCE_HIGH) >> PAGE_SHIFT))
dma = 1;
q->bounce_pfn = max_low_pfn;
#else
--
1.6.0.2

2009-04-01 11:05:55

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/8] block: fix SG_IO vector request data length handling

Impact: fix SG_IO behavior such that it matches the documentation

SG_IO howto says that if ->dxfer_len and sum of iovec disagress, the
shorter one wins. However, the current implementation returns -EINVAL
for such cases. Trim iovc if it's longer than ->dxfer_len.

This patch uses iov_*() helpers which take struct iovec * by casting
struct sg_iovec * to it. sg_iovec is always identical to iovec and
this will be further cleaned up with later patches.

Signed-off-by: Tejun Heo <[email protected]>
---
block/scsi_ioctl.c | 13 ++++++++++++-
1 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 626ee27..c8e8868 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -288,6 +288,7 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk,

if (hdr->iovec_count) {
const int size = sizeof(struct sg_iovec) * hdr->iovec_count;
+ size_t iov_data_len;
struct sg_iovec *iov;

iov = kmalloc(size, GFP_KERNEL);
@@ -302,8 +303,18 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk,
goto out;
}

+ /* SG_IO howto says that the shorter of the two wins */
+ iov_data_len = iov_length((struct iovec *)iov,
+ hdr->iovec_count);
+ if (hdr->dxfer_len < iov_data_len) {
+ hdr->iovec_count = iov_shorten((struct iovec *)iov,
+ hdr->iovec_count,
+ hdr->dxfer_len);
+ iov_data_len = hdr->dxfer_len;
+ }
+
ret = blk_rq_map_user_iov(q, rq, NULL, iov, hdr->iovec_count,
- hdr->dxfer_len, GFP_KERNEL);
+ iov_data_len, GFP_KERNEL);
kfree(iov);
} else if (hdr->dxfer_len)
ret = blk_rq_map_user(q, rq, NULL, hdr->dxferp, hdr->dxfer_len,
--
1.6.0.2

2009-04-01 11:06:40

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/8] bio: actually inline inline bvecs into bio

Impact: cleanup

BIO_INLINE_VECS bvecs are inlined into bio to avoid bvec allocation
for small transfers. This was achieved by declaring zero sized bvec
array at the end of bio and allocating bio with extra bytes at the
end. As BIO_INLINE_VECS is constant, there is no reason to do this
allocation trick. This patch simply defines BIO_INLINE_VECS sized
bvec array at the end. This will help fixing bio_kmalloc().

Signed-off-by: Tejun Heo <[email protected]>
---
fs/bio.c | 9 +--------
include/linux/bio.h | 9 +++++----
2 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index a040cde..7574839 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -31,12 +31,6 @@

DEFINE_TRACE(block_split);

-/*
- * Test patch to inline a certain number of bi_io_vec's inside the bio
- * itself, to shrink a bio data allocation from two mempool calls to one
- */
-#define BIO_INLINE_VECS 4
-
static mempool_t *bio_split_pool __read_mostly;

/*
@@ -1550,7 +1544,6 @@ void bioset_free(struct bio_set *bs)
*/
struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
{
- unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec);
struct bio_set *bs;

bs = kzalloc(sizeof(*bs), GFP_KERNEL);
@@ -1559,7 +1552,7 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)

bs->front_pad = front_pad;

- bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
+ bs->bio_slab = bio_find_or_create_slab(front_pad);
if (!bs->bio_slab) {
kfree(bs);
return NULL;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b05b1d4..14e5f42 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -40,6 +40,8 @@
#define BIO_MAX_SIZE (BIO_MAX_PAGES << PAGE_CACHE_SHIFT)
#define BIO_MAX_SECTORS (BIO_MAX_SIZE >> 9)

+#define BIO_INLINE_VECS 4
+
/*
* was unsigned short, but we might as well be ready for > 64kB I/O pages
*/
@@ -104,11 +106,10 @@ struct bio {
bio_destructor_t *bi_destructor; /* destructor */

/*
- * We can inline a number of vecs at the end of the bio, to avoid
- * double allocations for a small number of bio_vecs. This member
- * MUST obviously be kept at the very end of the bio.
+ * Inline small number of vecs in the bio to avoid double
+ * allocations for a small number of bio_vecs.
*/
- struct bio_vec bi_inline_vecs[0];
+ struct bio_vec bi_inline_vecs[BIO_INLINE_VECS];
};

/*
--
1.6.0.2

2009-04-01 11:07:53

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 7/8] blk-map: let blk_rq_map_user_iov() support null mapping

Impact: API expansion

Till now, only blk_rq_map() supported null mapping. Add null mapping
support to blk_rq_map_user_iov() by moving BIO_NULL_MAPPED setting to
bio_copy_user_iov().

Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-map.c | 3 ---
fs/bio.c | 4 +++-
2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 6718021..ac1961d 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -63,9 +63,6 @@ static int __blk_rq_map_user(struct request_queue *q, struct request *rq,
if (IS_ERR(bio))
return PTR_ERR(bio);

- if (map_data && map_data->null_mapped)
- bio->bi_flags |= (1 << BIO_NULL_MAPPED);
-
orig_bio = bio;
blk_queue_bounce(q, &bio);

diff --git a/fs/bio.c b/fs/bio.c
index 8ad9784..728bef9 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -869,7 +869,9 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
/*
* success
*/
- if (!write_to_vm && (!map_data || !map_data->null_mapped)) {
+ if (unlikely(map_data && map_data->null_mapped))
+ bio->bi_flags |= (1 << BIO_NULL_MAPPED);
+ else if (!write_to_vm) {
ret = __bio_copy_iov(bio, bio->bi_io_vec, iov, iov_count, 0, 0);
if (ret)
goto cleanup;
--
1.6.0.2

2009-04-01 11:07:33

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 6/8] bio: remove size/segments limit on bio_{copy|map}_{user|kern}*()

Impact: API improvement, constant renames

bio_{copy|map}_{user|kern}*() don't guarantee successful allocation
and thus don't have any reason to use bio_alloc(). Switch to
bio_kmalloc(). This removes size/segments limit from these APIs,
which will be used to fix multiple bio mapping bug in
blk_rq_map_user().

As this means that bio size is not capped, rename BIO_MAX_* to
BIO_GUARANTEED_*.

Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-map.c | 12 +++++++-----
drivers/md/dm-io.c | 4 ++--
drivers/md/dm.c | 4 ++--
drivers/scsi/scsi_lib.c | 3 ++-
fs/bio.c | 10 +++++-----
fs/btrfs/extent_io.c | 2 +-
fs/ext4/extents.c | 4 ++--
fs/xfs/linux-2.6/xfs_buf.c | 2 +-
include/linux/bio.h | 6 +++---
9 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index f103729..6718021 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -127,17 +127,19 @@ int blk_rq_map_user(struct request_queue *q, struct request *rq,
while (bytes_read != len) {
unsigned long map_len, end, start;

- map_len = min_t(unsigned long, len - bytes_read, BIO_MAX_SIZE);
+ map_len = min_t(unsigned long, len - bytes_read,
+ BIO_GUARANTEED_SIZE);
end = ((unsigned long)ubuf + map_len + PAGE_SIZE - 1)
>> PAGE_SHIFT;
start = (unsigned long)ubuf >> PAGE_SHIFT;

/*
- * A bad offset could cause us to require BIO_MAX_PAGES + 1
- * pages. If this happens we just lower the requested
- * mapping len by a page so that we can fit
+ * A bad offset could cause us to require
+ * BIO_GUARANTEED_PAGES + 1 pages. If this happens we
+ * just lower the requested mapping len by a page so
+ * that we can fit
*/
- if (end - start > BIO_MAX_PAGES)
+ if (end - start > BIO_GUARANTEED_PAGES)
map_len -= PAGE_SIZE;

ret = __blk_rq_map_user(q, rq, map_data, ubuf, map_len,
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 36e2b5e..38a175a 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -292,8 +292,8 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
(PAGE_SIZE >> SECTOR_SHIFT));
num_bvecs = 1 + min_t(int, bio_get_nr_vecs(where->bdev),
num_bvecs);
- if (unlikely(num_bvecs > BIO_MAX_PAGES))
- num_bvecs = BIO_MAX_PAGES;
+ if (unlikely(num_bvecs > BIO_GUARANTEED_PAGES))
+ num_bvecs = BIO_GUARANTEED_PAGES;
bio = bio_alloc_bioset(GFP_NOIO, num_bvecs, io->client->bios);
bio->bi_sector = where->sector + (where->count - remaining);
bio->bi_bdev = where->bdev;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8d40f27..4949112 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -894,8 +894,8 @@ static int dm_merge_bvec(struct request_queue *q,
/*
* Find maximum amount of I/O that won't need splitting
*/
- max_sectors = min(max_io_len(md, bvm->bi_sector, ti),
- (sector_t) BIO_MAX_SECTORS);
+ max_sectors = min_t(sector_t, max_io_len(md, bvm->bi_sector, ti),
+ BIO_GUARANTEED_SECTORS);
max_size = (max_sectors << SECTOR_SHIFT) - bvm->bi_size;
if (max_size < 0)
max_size = 0;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index b82ffd9..3196c83 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -351,7 +351,8 @@ static int scsi_req_map_sg(struct request *rq, struct scatterlist *sgl,
bytes = min(bytes, data_len);

if (!bio) {
- nr_vecs = min_t(int, BIO_MAX_PAGES, nr_pages);
+ nr_vecs = min_t(int, BIO_GUARANTEED_PAGES,
+ nr_pages);
nr_pages -= nr_vecs;

bio = bio_alloc(gfp, nr_vecs);
diff --git a/fs/bio.c b/fs/bio.c
index e15e457..8ad9784 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -40,7 +40,7 @@ static mempool_t *bio_split_pool __read_mostly;
*/
#define BV(x) { .nr_vecs = x, .name = "biovec-"__stringify(x) }
struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = {
- BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES),
+ BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_GUARANTEED_PAGES),
};
#undef BV

@@ -187,7 +187,7 @@ struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx,
case 65 ... 128:
*idx = 4;
break;
- case 129 ... BIO_MAX_PAGES:
+ case 129 ... BIO_GUARANTEED_PAGES:
*idx = 5;
break;
default:
@@ -818,7 +818,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
return ERR_PTR(-ENOMEM);

ret = -ENOMEM;
- bio = bio_alloc(gfp_mask, nr_pages);
+ bio = bio_kmalloc(gfp_mask, nr_pages);
if (!bio)
goto out_bmd;

@@ -942,7 +942,7 @@ static struct bio *__bio_map_user_iov(struct request_queue *q,
if (!nr_pages)
return ERR_PTR(-EINVAL);

- bio = bio_alloc(gfp_mask, nr_pages);
+ bio = bio_kmalloc(gfp_mask, nr_pages);
if (!bio)
return ERR_PTR(-ENOMEM);

@@ -1126,7 +1126,7 @@ static struct bio *__bio_map_kern(struct request_queue *q, void *data,
int offset, i;
struct bio *bio;

- bio = bio_alloc(gfp_mask, nr_pages);
+ bio = bio_kmalloc(gfp_mask, nr_pages);
if (!bio)
return ERR_PTR(-ENOMEM);

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ebe6b29..8560995 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1899,7 +1899,7 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
}
}
if (this_compressed)
- nr = BIO_MAX_PAGES;
+ nr = BIO_GUARANTEED_PAGES;
else
nr = bio_get_nr_vecs(bdev);

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e0aa4fe..6aad7b2 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2315,8 +2315,8 @@ static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent *ex)

while (ee_len > 0) {

- if (ee_len > BIO_MAX_PAGES)
- len = BIO_MAX_PAGES;
+ if (ee_len > BIO_GUARANTEED_PAGES)
+ len = BIO_GUARANTEED_PAGES;
else
len = ee_len;

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index aa1016b..5d7c571 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -1211,7 +1211,7 @@ _xfs_buf_ioapply(

next_chunk:
atomic_inc(&bp->b_io_remaining);
- nr_pages = BIO_MAX_SECTORS >> (PAGE_SHIFT - BBSHIFT);
+ nr_pages = BIO_GUARANTEED_SECTORS >> (PAGE_SHIFT - BBSHIFT);
if (nr_pages > total_nr_pages)
nr_pages = total_nr_pages;

diff --git a/include/linux/bio.h b/include/linux/bio.h
index e01e41d..8647dd9 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -36,9 +36,9 @@
#define BIO_BUG_ON
#endif

-#define BIO_MAX_PAGES 256
-#define BIO_MAX_SIZE (BIO_MAX_PAGES << PAGE_CACHE_SHIFT)
-#define BIO_MAX_SECTORS (BIO_MAX_SIZE >> 9)
+#define BIO_GUARANTEED_PAGES 256
+#define BIO_GUARANTEED_SIZE (BIO_GUARANTEED_PAGES << PAGE_CACHE_SHIFT)
+#define BIO_GUARANTEED_SECTORS (BIO_GUARANTEED_SIZE >> 9)

#define BIO_INLINE_VECS 4

--
1.6.0.2

2009-04-01 11:07:00

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 5/8] bio: fix bio_kmalloc()

Impact: fix bio_kmalloc() and its destruction path

bio_kmalloc() was broken in two ways.

* bvec_alloc_bs() first allocates bvec using kmalloc() and then
ignores it and allocates again like non-kmalloc bvecs.

* bio_kmalloc_destructor() didn't check for and free bio integrity
data.

This patch fixes the above problems. kmalloc bvec allocation is moved
to bio_alloc_bioset(). While at it, also make the following changes.

* define and use BIO_POOL_NONE so that pool index check in
bvec_free_bs() triggers if inline or kmalloc allocated bvec gets
there.

* relocate destructors on top of each allocation function so that how
they're used is more clear.

Signed-off-by: Tejun Heo <[email protected]>
---
fs/bio.c | 74 +++++++++++++++++++++++++++++----------------------
include/linux/bio.h | 1 +
2 files changed, 43 insertions(+), 32 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index 7574839..e15e457 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -169,14 +169,6 @@ struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx,
struct bio_vec *bvl;

/*
- * If 'bs' is given, lookup the pool and do the mempool alloc.
- * If not, this is a bio_kmalloc() allocation and just do a
- * kzalloc() for the exact number of vecs right away.
- */
- if (!bs)
- bvl = kmalloc(nr * sizeof(struct bio_vec), gfp_mask);
-
- /*
* see comment near bvec_array define!
*/
switch (nr) {
@@ -254,21 +246,6 @@ void bio_free(struct bio *bio, struct bio_set *bs)
mempool_free(p, bs->bio_pool);
}

-/*
- * default destructor for a bio allocated with bio_alloc_bioset()
- */
-static void bio_fs_destructor(struct bio *bio)
-{
- bio_free(bio, fs_bio_set);
-}
-
-static void bio_kmalloc_destructor(struct bio *bio)
-{
- if (bio_has_allocated_vec(bio))
- kfree(bio->bi_io_vec);
- kfree(bio);
-}
-
void bio_init(struct bio *bio)
{
memset(bio, 0, sizeof(*bio));
@@ -297,7 +274,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
{
struct bio_vec *bvl = NULL;
struct bio *bio = NULL;
- unsigned long idx = 0;
+ unsigned long idx = BIO_POOL_NONE;
void *p = NULL;

if (bs) {
@@ -319,12 +296,15 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
if (nr_iovecs <= BIO_INLINE_VECS) {
bvl = bio->bi_inline_vecs;
nr_iovecs = BIO_INLINE_VECS;
- } else {
+ } else if (bs) {
bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
if (unlikely(!bvl))
goto err_free;
-
nr_iovecs = bvec_nr_vecs(idx);
+ } else {
+ bvl = kmalloc(nr_iovecs * sizeof(struct bio_vec), gfp_mask);
+ if (unlikely(!bvl))
+ goto err_free;
}
bio->bi_flags |= idx << BIO_POOL_OFFSET;
bio->bi_max_vecs = nr_iovecs;
@@ -342,6 +322,22 @@ err:
return NULL;
}

+static void bio_fs_destructor(struct bio *bio)
+{
+ bio_free(bio, fs_bio_set);
+}
+
+/**
+ * bio_alloc - allocate a new bio, memory pool backed
+ * @gfp_mask: allocation mask to use
+ * @nr_iovecs: number of iovecs
+ *
+ * Allocate a new bio with @nr_iovecs bvecs. If @gfp_mask
+ * contains __GFP_WAIT, the allocation is guaranteed to succeed.
+ *
+ * RETURNS:
+ * Pointer to new bio on success, NULL on failure.
+ */
struct bio *bio_alloc(gfp_t gfp_mask, int nr_iovecs)
{
struct bio *bio = bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
@@ -352,12 +348,26 @@ struct bio *bio_alloc(gfp_t gfp_mask, int nr_iovecs)
return bio;
}

-/*
- * Like bio_alloc(), but doesn't use a mempool backing. This means that
- * it CAN fail, but while bio_alloc() can only be used for allocations
- * that have a short (finite) life span, bio_kmalloc() should be used
- * for more permanent bio allocations (like allocating some bio's for
- * initalization or setup purposes).
+static void bio_kmalloc_destructor(struct bio *bio)
+{
+ if (bio_has_allocated_vec(bio))
+ kfree(bio->bi_io_vec);
+ if (bio_integrity(bio))
+ bio_integrity_free(bio);
+ kfree(bio);
+}
+
+/**
+ * bio_kmalloc - allocate a new bio
+ * @gfp_mask: allocation mask to use
+ * @nr_iovecs: number of iovecs
+ *
+ * Similar to bio_alloc() but uses regular kmalloc for allocation
+ * and can fail unless __GFP_NOFAIL is set in @gfp_mask. This is
+ * useful for more permanant or over-sized bio allocations.
+ *
+ * RETURNS:
+ * Poitner to new bio on success, NULL on failure.
*/
struct bio *bio_kmalloc(gfp_t gfp_mask, int nr_iovecs)
{
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 14e5f42..e01e41d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -133,6 +133,7 @@ struct bio {
* top 4 bits of bio flags indicate the pool this bio came from
*/
#define BIO_POOL_BITS (4)
+#define BIO_POOL_NONE ((1 << BIO_POOL_BITS) - 1)
#define BIO_POOL_OFFSET (BITS_PER_LONG - BIO_POOL_BITS)
#define BIO_POOL_MASK (1UL << BIO_POOL_OFFSET)
#define BIO_POOL_IDX(bio) ((bio)->bi_flags >> BIO_POOL_OFFSET)
--
1.6.0.2

2009-04-01 11:08:21

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 8/8] blk-map: reimplement blk_rq_map_user() using blk_rq_map_user_iov()

Impact: subtle bugs fixed, cleanup

blk_rq_map_user() supported multi-bio mapping by calling
bio_map/copy_user() multiple times and linking the resulting bios into
rq; however, this had subtle bugs.

* Because each call to bio_map/copy_user() is independent, segment
limit check was done only per each bio, so it was possible to create
requests which are larger than the driver and hardware limits, which
could lead to disastrous outcome.

* Layers under FS may call blk_rq_map*() functions during request
processing. Under severe memory pressure and with enough bad luck,
this can lead to deadlock. As fs bvec pool is quite small, the
possibility isn't completely theoretical.

This patch reimplement blk_rq_map_user() in terms of
blk_rq_map_user_iov() which doesn't support multi-bio mappping and
drop multi bio handling from blk_rq_unmap_user(). Note that with the
previous patch to remove bio max size limit and to add null mapping
support to blk_rq_map_user_iov(), this change doesn't remove any
functionality.

Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-map.c | 118 +++++--------------------------------------------------
1 files changed, 10 insertions(+), 108 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index ac1961d..a43c93c 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -40,49 +40,6 @@ static int __blk_rq_unmap_user(struct bio *bio)
return ret;
}

-static int __blk_rq_map_user(struct request_queue *q, struct request *rq,
- struct rq_map_data *map_data, void __user *ubuf,
- unsigned int len, gfp_t gfp_mask)
-{
- unsigned long uaddr;
- struct bio *bio, *orig_bio;
- int reading, ret;
-
- reading = rq_data_dir(rq) == READ;
-
- /*
- * if alignment requirement is satisfied, map in user pages for
- * direct dma. else, set up kernel bounce buffers
- */
- uaddr = (unsigned long) ubuf;
- if (blk_rq_aligned(q, ubuf, len) && !map_data)
- bio = bio_map_user(q, NULL, uaddr, len, reading, gfp_mask);
- else
- bio = bio_copy_user(q, map_data, uaddr, len, reading, gfp_mask);
-
- if (IS_ERR(bio))
- return PTR_ERR(bio);
-
- orig_bio = bio;
- blk_queue_bounce(q, &bio);
-
- /*
- * We link the bounce buffer in and could have to traverse it
- * later so we have to get a ref to prevent it from being freed
- */
- bio_get(bio);
-
- ret = blk_rq_append_bio(q, rq, bio);
- if (!ret)
- return bio->bi_size;
-
- /* if it was boucned we must call the end io function */
- bio_endio(bio, 0);
- __blk_rq_unmap_user(orig_bio);
- bio_put(bio);
- return ret;
-}
-
/**
* blk_rq_map_user - map user data to a request, for REQ_TYPE_BLOCK_PC usage
* @q: request queue where request should be inserted
@@ -109,58 +66,12 @@ int blk_rq_map_user(struct request_queue *q, struct request *rq,
struct rq_map_data *map_data, void __user *ubuf,
unsigned long len, gfp_t gfp_mask)
{
- unsigned long bytes_read = 0;
- struct bio *bio = NULL;
- int ret;
+ struct sg_iovec iov;

- if (len > (q->max_hw_sectors << 9))
- return -EINVAL;
- if (!len)
- return -EINVAL;
+ iov.iov_base = ubuf;
+ iov.iov_len = len;

- if (!ubuf && (!map_data || !map_data->null_mapped))
- return -EINVAL;
-
- while (bytes_read != len) {
- unsigned long map_len, end, start;
-
- map_len = min_t(unsigned long, len - bytes_read,
- BIO_GUARANTEED_SIZE);
- end = ((unsigned long)ubuf + map_len + PAGE_SIZE - 1)
- >> PAGE_SHIFT;
- start = (unsigned long)ubuf >> PAGE_SHIFT;
-
- /*
- * A bad offset could cause us to require
- * BIO_GUARANTEED_PAGES + 1 pages. If this happens we
- * just lower the requested mapping len by a page so
- * that we can fit
- */
- if (end - start > BIO_GUARANTEED_PAGES)
- map_len -= PAGE_SIZE;
-
- ret = __blk_rq_map_user(q, rq, map_data, ubuf, map_len,
- gfp_mask);
- if (ret < 0)
- goto unmap_rq;
- if (!bio)
- bio = rq->bio;
- bytes_read += ret;
- ubuf += ret;
-
- if (map_data)
- map_data->offset += ret;
- }
-
- if (!bio_flagged(bio, BIO_USER_MAPPED))
- rq->cmd_flags |= REQ_COPY_USER;
-
- rq->buffer = rq->data = NULL;
- return 0;
-unmap_rq:
- blk_rq_unmap_user(bio);
- rq->bio = NULL;
- return ret;
+ return blk_rq_map_user_iov(q, rq, map_data, &iov, 1, len, gfp_mask);
}
EXPORT_SYMBOL(blk_rq_map_user);

@@ -250,23 +161,14 @@ EXPORT_SYMBOL(blk_rq_map_user_iov);
*/
int blk_rq_unmap_user(struct bio *bio)
{
- struct bio *mapped_bio;
- int ret = 0, ret2;
-
- while (bio) {
- mapped_bio = bio;
- if (unlikely(bio_flagged(bio, BIO_BOUNCED)))
- mapped_bio = bio->bi_private;
+ struct bio *mapped_bio = bio;
+ int ret;

- ret2 = __blk_rq_unmap_user(mapped_bio);
- if (ret2 && !ret)
- ret = ret2;
-
- mapped_bio = bio;
- bio = bio->bi_next;
- bio_put(mapped_bio);
- }
+ if (unlikely(bio_flagged(bio, BIO_BOUNCED)))
+ mapped_bio = bio->bi_private;

+ ret = __blk_rq_unmap_user(mapped_bio);
+ bio_put(bio);
return ret;
}
EXPORT_SYMBOL(blk_rq_unmap_user);
--
1.6.0.2

2009-04-01 11:48:59

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [PATCH 2/8] block: fix SG_IO vector request data length handling

On Wed, 1 Apr 2009 20:04:38 +0900
Tejun Heo <[email protected]> wrote:

> Impact: fix SG_IO behavior such that it matches the documentation
>
> SG_IO howto says that if ->dxfer_len and sum of iovec disagress, the
> shorter one wins. However, the current implementation returns -EINVAL
> for such cases. Trim iovc if it's longer than ->dxfer_len.

Is that description about sg's SG_IO?


> This patch uses iov_*() helpers which take struct iovec * by casting
> struct sg_iovec * to it. sg_iovec is always identical to iovec and
> this will be further cleaned up with later patches.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> block/scsi_ioctl.c | 13 ++++++++++++-
> 1 files changed, 12 insertions(+), 1 deletions(-)
>
> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
> index 626ee27..c8e8868 100644
> --- a/block/scsi_ioctl.c
> +++ b/block/scsi_ioctl.c
> @@ -288,6 +288,7 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk,
>
> if (hdr->iovec_count) {
> const int size = sizeof(struct sg_iovec) * hdr->iovec_count;
> + size_t iov_data_len;
> struct sg_iovec *iov;
>
> iov = kmalloc(size, GFP_KERNEL);
> @@ -302,8 +303,18 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk,
> goto out;
> }
>
> + /* SG_IO howto says that the shorter of the two wins */
> + iov_data_len = iov_length((struct iovec *)iov,
> + hdr->iovec_count);
> + if (hdr->dxfer_len < iov_data_len) {
> + hdr->iovec_count = iov_shorten((struct iovec *)iov,
> + hdr->iovec_count,
> + hdr->dxfer_len);
> + iov_data_len = hdr->dxfer_len;
> + }
> +
> ret = blk_rq_map_user_iov(q, rq, NULL, iov, hdr->iovec_count,
> - hdr->dxfer_len, GFP_KERNEL);
> + iov_data_len, GFP_KERNEL);
> kfree(iov);
> } else if (hdr->dxfer_len)
> ret = blk_rq_map_user(q, rq, NULL, hdr->dxferp, hdr->dxfer_len,
> --
> 1.6.0.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-04-01 11:51:18

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 2/8] block: fix SG_IO vector request data length handling

On Wed, Apr 01 2009, FUJITA Tomonori wrote:
> On Wed, 1 Apr 2009 20:04:38 +0900
> Tejun Heo <[email protected]> wrote:
>
> > Impact: fix SG_IO behavior such that it matches the documentation
> >
> > SG_IO howto says that if ->dxfer_len and sum of iovec disagress, the
> > shorter one wins. However, the current implementation returns -EINVAL
> > for such cases. Trim iovc if it's longer than ->dxfer_len.
>
> Is that description about sg's SG_IO?

The more important question is what sg.c actually does, that's more
important than the documentation.

--
Jens Axboe

2009-04-01 11:55:27

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [PATCH 7/8] blk-map: let blk_rq_map_user_iov() support null mapping

On Wed, 1 Apr 2009 20:04:43 +0900
Tejun Heo <[email protected]> wrote:

> Impact: API expansion
>
> Till now, only blk_rq_map() supported null mapping. Add null mapping
> support to blk_rq_map_user_iov() by moving BIO_NULL_MAPPED setting to
> bio_copy_user_iov().

Why does blk_rq_map_user_iov() need to support the null mapping?


> Signed-off-by: Tejun Heo <[email protected]>
> ---
> block/blk-map.c | 3 ---
> fs/bio.c | 4 +++-
> 2 files changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-map.c b/block/blk-map.c
> index 6718021..ac1961d 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -63,9 +63,6 @@ static int __blk_rq_map_user(struct request_queue *q, struct request *rq,
> if (IS_ERR(bio))
> return PTR_ERR(bio);
>
> - if (map_data && map_data->null_mapped)
> - bio->bi_flags |= (1 << BIO_NULL_MAPPED);
> -
> orig_bio = bio;
> blk_queue_bounce(q, &bio);
>
> diff --git a/fs/bio.c b/fs/bio.c
> index 8ad9784..728bef9 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -869,7 +869,9 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
> /*
> * success
> */
> - if (!write_to_vm && (!map_data || !map_data->null_mapped)) {
> + if (unlikely(map_data && map_data->null_mapped))
> + bio->bi_flags |= (1 << BIO_NULL_MAPPED);
> + else if (!write_to_vm) {
> ret = __bio_copy_iov(bio, bio->bi_io_vec, iov, iov_count, 0, 0);
> if (ret)
> goto cleanup;
> --
> 1.6.0.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-04-01 12:03:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 7/8] blk-map: let blk_rq_map_user_iov() support null mapping

FUJITA Tomonori wrote:
> On Wed, 1 Apr 2009 20:04:43 +0900
> Tejun Heo <[email protected]> wrote:
>
>> Impact: API expansion
>>
>> Till now, only blk_rq_map() supported null mapping. Add null mapping
>> support to blk_rq_map_user_iov() by moving BIO_NULL_MAPPED setting to
>> bio_copy_user_iov().
>
> Why does blk_rq_map_user_iov() need to support the null mapping?

Because the next patch makes blk_rq_map_user() use it.

--
tejun

2009-04-01 12:20:15

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [PATCH 2/8] block: fix SG_IO vector request data length handling

On Wed, 1 Apr 2009 13:50:58 +0200
Jens Axboe <[email protected]> wrote:

> On Wed, Apr 01 2009, FUJITA Tomonori wrote:
> > On Wed, 1 Apr 2009 20:04:38 +0900
> > Tejun Heo <[email protected]> wrote:
> >
> > > Impact: fix SG_IO behavior such that it matches the documentation
> > >
> > > SG_IO howto says that if ->dxfer_len and sum of iovec disagress, the
> > > shorter one wins. However, the current implementation returns -EINVAL
> > > for such cases. Trim iovc if it's longer than ->dxfer_len.
> >
> > Is that description about sg's SG_IO?
>
> The more important question is what sg.c actually does, that's more
> important than the documentation.

Do you think that Doug is a person who makes such mistake? ;)

Seems that sg worked as the howto says. But I think that I broke it
when I converted sg to use the block layer. I'll fix it soon.

About this patch, as we know, there are lots of subtle differences
between sg's SG_IO and the block's. I'm not sure that it's a good idea
to change the behavior of the block's SG_IO.

2009-04-01 12:25:21

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 2/8] block: fix SG_IO vector request data length handling

On Wed, Apr 01 2009, FUJITA Tomonori wrote:
> On Wed, 1 Apr 2009 13:50:58 +0200
> Jens Axboe <[email protected]> wrote:
>
> > On Wed, Apr 01 2009, FUJITA Tomonori wrote:
> > > On Wed, 1 Apr 2009 20:04:38 +0900
> > > Tejun Heo <[email protected]> wrote:
> > >
> > > > Impact: fix SG_IO behavior such that it matches the documentation
> > > >
> > > > SG_IO howto says that if ->dxfer_len and sum of iovec disagress, the
> > > > shorter one wins. However, the current implementation returns -EINVAL
> > > > for such cases. Trim iovc if it's longer than ->dxfer_len.
> > >
> > > Is that description about sg's SG_IO?
> >
> > The more important question is what sg.c actually does, that's more
> > important than the documentation.
>
> Do you think that Doug is a person who makes such mistake? ;)
>
> Seems that sg worked as the howto says. But I think that I broke it
> when I converted sg to use the block layer. I'll fix it soon.

OK

> About this patch, as we know, there are lots of subtle differences
> between sg's SG_IO and the block's. I'm not sure that it's a good idea
> to change the behavior of the block's SG_IO.

Depends on what it is. For something like this where the alternative is
returning -EINVAL and the patch makes it work, we definitely should just
do it.

--
Jens Axboe

2009-04-01 12:27:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/8] block: fix SG_IO vector request data length handling

FUJITA Tomonori wrote:
> On Wed, 1 Apr 2009 13:50:58 +0200
> Jens Axboe <[email protected]> wrote:
>
>> On Wed, Apr 01 2009, FUJITA Tomonori wrote:
>>> On Wed, 1 Apr 2009 20:04:38 +0900
>>> Tejun Heo <[email protected]> wrote:
>>>
>>>> Impact: fix SG_IO behavior such that it matches the documentation
>>>>
>>>> SG_IO howto says that if ->dxfer_len and sum of iovec disagress, the
>>>> shorter one wins. However, the current implementation returns -EINVAL
>>>> for such cases. Trim iovc if it's longer than ->dxfer_len.
>>> Is that description about sg's SG_IO?

It looks like it's the closest thing.

>> The more important question is what sg.c actually does, that's more
>> important than the documentation.

The current code would fail it with -EINVAL but after brief look into
2.6.12-rc2, it seems like it would use the shorter one. On direct
mapping path, it builds considering both lengths and on indirect path
it doesn't seem to look at the iov supplied till the transfer is
actually complete using the dxfer_len and then copy out whatever can
be copied out.

> Do you think that Doug is a person who makes such mistake? ;)
>
> Seems that sg worked as the howto says. But I think that I broke it
> when I converted sg to use the block layer. I'll fix it soon.
>
> About this patch, as we know, there are lots of subtle differences
> between sg's SG_IO and the block's. I'm not sure that it's a good idea
> to change the behavior of the block's SG_IO.

I think it's better to make the behavior more consistent. Using
shorter dxfer_len can be considered a feature too, so...

Thanks.

--
tejun

2009-04-01 12:29:56

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 2/8] block: fix SG_IO vector request data length handling

On Wed, Apr 01 2009, Tejun Heo wrote:
> FUJITA Tomonori wrote:
> > On Wed, 1 Apr 2009 13:50:58 +0200
> > Jens Axboe <[email protected]> wrote:
> >
> >> On Wed, Apr 01 2009, FUJITA Tomonori wrote:
> >>> On Wed, 1 Apr 2009 20:04:38 +0900
> >>> Tejun Heo <[email protected]> wrote:
> >>>
> >>>> Impact: fix SG_IO behavior such that it matches the documentation
> >>>>
> >>>> SG_IO howto says that if ->dxfer_len and sum of iovec disagress, the
> >>>> shorter one wins. However, the current implementation returns -EINVAL
> >>>> for such cases. Trim iovc if it's longer than ->dxfer_len.
> >>> Is that description about sg's SG_IO?
>
> It looks like it's the closest thing.
>
> >> The more important question is what sg.c actually does, that's more
> >> important than the documentation.
>
> The current code would fail it with -EINVAL but after brief look into
> 2.6.12-rc2, it seems like it would use the shorter one. On direct
> mapping path, it builds considering both lengths and on indirect path
> it doesn't seem to look at the iov supplied till the transfer is
> actually complete using the dxfer_len and then copy out whatever can
> be copied out.
>
> > Do you think that Doug is a person who makes such mistake? ;)
> >
> > Seems that sg worked as the howto says. But I think that I broke it
> > when I converted sg to use the block layer. I'll fix it soon.
> >
> > About this patch, as we know, there are lots of subtle differences
> > between sg's SG_IO and the block's. I'm not sure that it's a good idea
> > to change the behavior of the block's SG_IO.
>
> I think it's better to make the behavior more consistent. Using
> shorter dxfer_len can be considered a feature too, so...

It's definitely a good feature, it's how you'd probe lots of scsi
command returns. Setting up the iovec once and passing first a smaller
than a larger dxfer_len would be a perfectly reasonable way to do that.

--
Jens Axboe

2009-04-01 12:45:31

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [PATCH 8/8] blk-map: reimplement blk_rq_map_user() using blk_rq_map_user_iov()

On Wed, 1 Apr 2009 20:04:44 +0900
Tejun Heo <[email protected]> wrote:

> Impact: subtle bugs fixed, cleanup
>
> blk_rq_map_user() supported multi-bio mapping by calling
> bio_map/copy_user() multiple times and linking the resulting bios into
> rq; however, this had subtle bugs.
>
> * Because each call to bio_map/copy_user() is independent, segment
> limit check was done only per each bio, so it was possible to create
> requests which are larger than the driver and hardware limits, which
> could lead to disastrous outcome.

What do you mean? blk_rq_append_bio properly checks the segment and
limit, I think.


> * Layers under FS may call blk_rq_map*() functions during request
> processing. Under severe memory pressure and with enough bad luck,
> this can lead to deadlock. As fs bvec pool is quite small, the
> possibility isn't completely theoretical.
>
> This patch reimplement blk_rq_map_user() in terms of
> blk_rq_map_user_iov() which doesn't support multi-bio mappping and
> drop multi bio handling from blk_rq_unmap_user(). Note that with the
> previous patch to remove bio max size limit and to add null mapping
> support to blk_rq_map_user_iov(), this change doesn't remove any
> functionality.

I don't think that we can drop multi bio handling from
blk_rq_unmap_user(). It may make some users angry. Mike Christie added
it because it was necessary.

2009-04-01 12:47:35

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 4/8] bio: actually inline inline bvecs into bio

On Wed, Apr 01 2009, Tejun Heo wrote:
> Impact: cleanup
>
> BIO_INLINE_VECS bvecs are inlined into bio to avoid bvec allocation
> for small transfers. This was achieved by declaring zero sized bvec
> array at the end of bio and allocating bio with extra bytes at the
> end. As BIO_INLINE_VECS is constant, there is no reason to do this
> allocation trick. This patch simply defines BIO_INLINE_VECS sized
> bvec array at the end. This will help fixing bio_kmalloc().

I don't like this, it's much nicer to do it with a zero sized array. If
you don't need a bio_vec, then you don't consume the extra space. I
guess for that to really work, we'd need one more slab and mempool
though. At least for non-stack bio's. I also fear that direct uses of
->bi_inline_vecs[] will then crop up, making it harder to go back.

--
Jens Axboe

2009-04-01 12:51:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 8/8] blk-map: reimplement blk_rq_map_user() using blk_rq_map_user_iov()

FUJITA Tomonori wrote:
>> * Because each call to bio_map/copy_user() is independent, segment
>> limit check was done only per each bio, so it was possible to create
>> requests which are larger than the driver and hardware limits, which
>> could lead to disastrous outcome.
>
> What do you mean? blk_rq_append_bio properly checks the segment and
> limit, I think.

Right, ll_back_merge_fn() does that. Sorry about that.

>> * Layers under FS may call blk_rq_map*() functions during request
>> processing. Under severe memory pressure and with enough bad luck,
>> this can lead to deadlock. As fs bvec pool is quite small, the
>> possibility isn't completely theoretical.
>>
>> This patch reimplement blk_rq_map_user() in terms of
>> blk_rq_map_user_iov() which doesn't support multi-bio mappping and
>> drop multi bio handling from blk_rq_unmap_user(). Note that with the
>> previous patch to remove bio max size limit and to add null mapping
>> support to blk_rq_map_user_iov(), this change doesn't remove any
>> functionality.
>
> I don't think that we can drop multi bio handling from
> blk_rq_unmap_user(). It may make some users angry. Mike Christie added
> it because it was necessary.

The only user of blk_rq_append_bio() is scsi_lib.c. Is Mike
Christie's code chaining bio's directly into rq?

Thanks.

--
tejun

2009-04-01 12:56:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/8] bio: actually inline inline bvecs into bio

Jens Axboe wrote:
> On Wed, Apr 01 2009, Tejun Heo wrote:
>> Impact: cleanup
>>
>> BIO_INLINE_VECS bvecs are inlined into bio to avoid bvec allocation
>> for small transfers. This was achieved by declaring zero sized bvec
>> array at the end of bio and allocating bio with extra bytes at the
>> end. As BIO_INLINE_VECS is constant, there is no reason to do this
>> allocation trick. This patch simply defines BIO_INLINE_VECS sized
>> bvec array at the end. This will help fixing bio_kmalloc().
>
> I don't like this, it's much nicer to do it with a zero sized array. If
> you don't need a bio_vec, then you don't consume the extra space. I
> guess for that to really work, we'd need one more slab and mempool
> though. At least for non-stack bio's. I also fear that direct uses of
> ->bi_inline_vecs[] will then crop up, making it harder to go back.

I don't know. None of the current users cares about it and unless
there's plan to make the number of inlined vecs variable, doing it
separately doesn't buy much and I think we'll probably pay more by
having extra slab and mempool.

Dropping this one isn't difficult tho. If your objection is strong,
I'll drop it and update the next one accordingly. Actually, what we
can do for bio_kmalloc() is to embed whole bvec as we know the size
anyway.

Thanks.

--
tejun

2009-04-01 12:59:30

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 4/8] bio: actually inline inline bvecs into bio

On Wed, Apr 01 2009, Tejun Heo wrote:
> Jens Axboe wrote:
> > On Wed, Apr 01 2009, Tejun Heo wrote:
> >> Impact: cleanup
> >>
> >> BIO_INLINE_VECS bvecs are inlined into bio to avoid bvec allocation
> >> for small transfers. This was achieved by declaring zero sized bvec
> >> array at the end of bio and allocating bio with extra bytes at the
> >> end. As BIO_INLINE_VECS is constant, there is no reason to do this
> >> allocation trick. This patch simply defines BIO_INLINE_VECS sized
> >> bvec array at the end. This will help fixing bio_kmalloc().
> >
> > I don't like this, it's much nicer to do it with a zero sized array. If
> > you don't need a bio_vec, then you don't consume the extra space. I
> > guess for that to really work, we'd need one more slab and mempool
> > though. At least for non-stack bio's. I also fear that direct uses of
> > ->bi_inline_vecs[] will then crop up, making it harder to go back.
>
> I don't know. None of the current users cares about it and unless
> there's plan to make the number of inlined vecs variable, doing it
> separately doesn't buy much and I think we'll probably pay more by
> having extra slab and mempool.
>
> Dropping this one isn't difficult tho. If your objection is strong,
> I'll drop it and update the next one accordingly. Actually, what we
> can do for bio_kmalloc() is to embed whole bvec as we know the size
> anyway.

Should not be hard, it's just a matter of what size to pass. Thanks!

--
Jens Axboe

2009-04-01 13:00:56

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [PATCH 8/8] blk-map: reimplement blk_rq_map_user() using blk_rq_map_user_iov()

On Wed, 01 Apr 2009 21:50:49 +0900
Tejun Heo <[email protected]> wrote:

> FUJITA Tomonori wrote:
> >> * Because each call to bio_map/copy_user() is independent, segment
> >> limit check was done only per each bio, so it was possible to create
> >> requests which are larger than the driver and hardware limits, which
> >> could lead to disastrous outcome.
> >
> > What do you mean? blk_rq_append_bio properly checks the segment and
> > limit, I think.
>
> Right, ll_back_merge_fn() does that. Sorry about that.
>
> >> * Layers under FS may call blk_rq_map*() functions during request
> >> processing. Under severe memory pressure and with enough bad luck,
> >> this can lead to deadlock. As fs bvec pool is quite small, the
> >> possibility isn't completely theoretical.
> >>
> >> This patch reimplement blk_rq_map_user() in terms of
> >> blk_rq_map_user_iov() which doesn't support multi-bio mappping and
> >> drop multi bio handling from blk_rq_unmap_user(). Note that with the
> >> previous patch to remove bio max size limit and to add null mapping
> >> support to blk_rq_map_user_iov(), this change doesn't remove any
> >> functionality.
> >
> > I don't think that we can drop multi bio handling from
> > blk_rq_unmap_user(). It may make some users angry. Mike Christie added
> > it because it was necessary.
>
> The only user of blk_rq_append_bio() is scsi_lib.c. Is Mike
> Christie's code chaining bio's directly into rq?

No, we are not talking about blk_rq_append_bio().

We are talking about the multiple bio handling in blk_rq_map_user,
which is the feature that Mike added long time ago. The feature is
surely necessary for some users. So you can't remote it.

2009-04-01 13:04:13

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 8/8] blk-map: reimplement blk_rq_map_user() using blk_rq_map_user_iov()

FUJITA Tomonori wrote:
> No, we are not talking about blk_rq_append_bio().
>
> We are talking about the multiple bio handling in blk_rq_map_user,
> which is the feature that Mike added long time ago. The feature is
> surely necessary for some users. So you can't remote it.

How would someone use that without blk_rq_append_bio()? The only
reason blk_rq_map_user() had multiple bio chaining was to work around
BIO_MAX_SIZE. blk_rq_map_user_iov() doesn't support multiple bio
chaining, so sans blk_rq_append_bio() or playing with rq/bio internals
directly, there's no way to use or even know about multiple bios.

Thanks.

--
tejun

2009-04-01 13:12:20

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [PATCH 8/8] blk-map: reimplement blk_rq_map_user() using blk_rq_map_user_iov()

On Wed, 01 Apr 2009 22:03:39 +0900
Tejun Heo <[email protected]> wrote:

> FUJITA Tomonori wrote:
> > No, we are not talking about blk_rq_append_bio().
> >
> > We are talking about the multiple bio handling in blk_rq_map_user,
> > which is the feature that Mike added long time ago. The feature is
> > surely necessary for some users. So you can't remote it.
>
> How would someone use that without blk_rq_append_bio()? The only

Hmm, I'm not sure what you are talking about.

Why do we need to live without blk_rq_append_bio()?

You want to remove blk_rq_append_bio()? Please make your goal clear.


> reason blk_rq_map_user() had multiple bio chaining was to work around
> BIO_MAX_SIZE. blk_rq_map_user_iov() doesn't support multiple bio
> chaining, so sans blk_rq_append_bio() or playing with rq/bio internals
> directly, there's no way to use or even know about multiple bios.

Yes, only non iovec interface of SG_IO supports large data
transfer. Users have been lived with that.

2009-04-01 13:17:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 8/8] blk-map: reimplement blk_rq_map_user() using blk_rq_map_user_iov()

Hello,

FUJITA Tomonori wrote:
> On Wed, 01 Apr 2009 22:03:39 +0900
> Tejun Heo <[email protected]> wrote:
>
>> FUJITA Tomonori wrote:
>>> No, we are not talking about blk_rq_append_bio().
>>>
>>> We are talking about the multiple bio handling in blk_rq_map_user,
>>> which is the feature that Mike added long time ago. The feature is
>>> surely necessary for some users. So you can't remote it.
>> How would someone use that without blk_rq_append_bio()? The only
>
> Hmm, I'm not sure what you are talking about.
>
> Why do we need to live without blk_rq_append_bio()?
>
> You want to remove blk_rq_append_bio()? Please make your goal clear.

Yeah, I'm writing header message for the next patchset. It will go
out in a few minutes. With the bogus fix part removed, this patch
(and related earlier ones) should have been part of the next set.
And, yes, the goal is removing blk_rq_append_bio() and any and all
request/bio internal meddling with further patchsets.

>> reason blk_rq_map_user() had multiple bio chaining was to work around
>> BIO_MAX_SIZE. blk_rq_map_user_iov() doesn't support multiple bio
>> chaining, so sans blk_rq_append_bio() or playing with rq/bio internals
>> directly, there's no way to use or even know about multiple bios.
>
> Yes, only non iovec interface of SG_IO supports large data
> transfer. Users have been lived with that.

This patch doesn't remove any feature. You don't lose anything. What
used to be done with multiple bios is now done with single bio. The
implementation is simpler and shorter. Using or not using multiple
bios doesn't (and shouldn't) make any difference to blk_map_*() users.

Thanks.

--
tejun

2009-04-01 13:18:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 8/8] blk-map: reimplement blk_rq_map_user() using blk_rq_map_user_iov()

Tejun Heo wrote:
>>> reason blk_rq_map_user() had multiple bio chaining was to work around
>>> BIO_MAX_SIZE. blk_rq_map_user_iov() doesn't support multiple bio
>>> chaining, so sans blk_rq_append_bio() or playing with rq/bio internals
>>> directly, there's no way to use or even know about multiple bios.
>> Yes, only non iovec interface of SG_IO supports large data
>> transfer. Users have been lived with that.
>
> This patch doesn't remove any feature. You don't lose anything. What
> used to be done with multiple bios is now done with single bio. The
> implementation is simpler and shorter. Using or not using multiple
> bios doesn't (and shouldn't) make any difference to blk_map_*() users.

Oh.. and blk_rq_map_user_iov() now gets to play with large requests
too. :-)

--
tejun

2009-04-01 13:23:44

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [PATCH 8/8] blk-map: reimplement blk_rq_map_user() using blk_rq_map_user_iov()

On Wed, 01 Apr 2009 22:17:05 +0900
Tejun Heo <[email protected]> wrote:

> Hello,
>
> FUJITA Tomonori wrote:
> > On Wed, 01 Apr 2009 22:03:39 +0900
> > Tejun Heo <[email protected]> wrote:
> >
> >> FUJITA Tomonori wrote:
> >>> No, we are not talking about blk_rq_append_bio().
> >>>
> >>> We are talking about the multiple bio handling in blk_rq_map_user,
> >>> which is the feature that Mike added long time ago. The feature is
> >>> surely necessary for some users. So you can't remote it.
> >> How would someone use that without blk_rq_append_bio()? The only
> >
> > Hmm, I'm not sure what you are talking about.
> >
> > Why do we need to live without blk_rq_append_bio()?
> >
> > You want to remove blk_rq_append_bio()? Please make your goal clear.
>
> Yeah, I'm writing header message for the next patchset. It will go
> out in a few minutes. With the bogus fix part removed, this patch
> (and related earlier ones) should have been part of the next set.
> And, yes, the goal is removing blk_rq_append_bio() and any and all
> request/bio internal meddling with further patchsets.

Sounds a good idea. But I need to review that.

But 7/8 and 8/8 patches are not bug fixes at all (as I wrote, your
descriptions about checking is untrue). It can't be for 2.6.30. So put
them to the next patchset.


> >> reason blk_rq_map_user() had multiple bio chaining was to work around
> >> BIO_MAX_SIZE. blk_rq_map_user_iov() doesn't support multiple bio
> >> chaining, so sans blk_rq_append_bio() or playing with rq/bio internals
> >> directly, there's no way to use or even know about multiple bios.
> >
> > Yes, only non iovec interface of SG_IO supports large data
> > transfer. Users have been lived with that.
>
> This patch doesn't remove any feature. You don't lose anything. What
> used to be done with multiple bios is now done with single bio. The
> implementation is simpler and shorter. Using or not using multiple
> bios doesn't (and shouldn't) make any difference to blk_map_*() users.

Hmm, with your change, blk_rq_map_user can't handle larger than
BIO_MAX_SIZE, right?

2009-04-01 13:28:57

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 8/8] blk-map: reimplement blk_rq_map_user() using blk_rq_map_user_iov()

FUJITA Tomonori wrote:
> Sounds a good idea. But I need to review that.
>
> But 7/8 and 8/8 patches are not bug fixes at all (as I wrote, your
> descriptions about checking is untrue). It can't be for 2.6.30. So put
> them to the next patchset.

Yeah, it should.

>>>> reason blk_rq_map_user() had multiple bio chaining was to work around
>>>> BIO_MAX_SIZE. blk_rq_map_user_iov() doesn't support multiple bio
>>>> chaining, so sans blk_rq_append_bio() or playing with rq/bio internals
>>>> directly, there's no way to use or even know about multiple bios.
>>> Yes, only non iovec interface of SG_IO supports large data
>>> transfer. Users have been lived with that.
>> This patch doesn't remove any feature. You don't lose anything. What
>> used to be done with multiple bios is now done with single bio. The
>> implementation is simpler and shorter. Using or not using multiple
>> bios doesn't (and shouldn't) make any difference to blk_map_*() users.
>
> Hmm, with your change, blk_rq_map_user can't handle larger than
> BIO_MAX_SIZE, right?

Yes, it can. The previous bio_kmalloc() thing was for this.

Thanks.

--
tejun