Hello, Bartlomiej, Jens.
This patchset contains mostly IDE cleanups around rq->buffer, data and
special usage. Over time, the meanings of those three fields became
blurry. Be default, they mean
->buffer : kernel address of the current bio segment, NULL for !bio
requests and could be NULL if the address is high
->data : direct data pointer specified by LLD for !fs requests
->special : LLD-specific pointer for special requests, opaque to block
layer, also used as general opaque pointer
Currently, IDE is the biggest offender to the above conventions and
uses all three fields as opaque pointers depending on situation. This
complicates both IDE and block layer. Also, with SCSI completely
switched over to bio for everything, there only very few places which
still use direct pointer for !fs requests. This patchset updates IDE
such that it only uses special for opaque pointer and always uses bio
for data handling.
This patchset contains the following patches.
0001-block-clear-req-errors-on-bio-completion-only-for.patch
0002-block-reorganize-__-bio_map_kern.patch
0003-block-implement-blk_rq_map_kern_prealloc.patch
0004-ide-use-blk_run_queue-instead-of-blk_start_queuei.patch
0005-ide-don-t-set-REQ_SOFTBARRIER.patch
0006-ide-kill-unused-ide_cmd-special.patch
0007-ide-cd-clear-sense-buffer-before-issuing-request-se.patch
0008-ide-floppy-block-pc-always-uses-bio.patch
0009-ide-taskfile-don-t-abuse-rq-buffer.patch
0010-ide-atapi-don-t-abuse-rq-buffer.patch
0011-ide-cd-don-t-abuse-rq-buffer.patch
0012-ide-pm-don-t-abuse-rq-data.patch
0013-ide-atapi-use-bio-for-request-sense.patch
0014-ide-cd-use-bio-for-request-sense.patch
0001-0003 are block layer ones. As IDE uses data requests which
aren't either FS or PC, 0001 update rq completion such that rq->errors
is not cleared accidentally for those requests. 0002-0003 implements
blk_rq_map_kern_prealloc() which will be used for bio-mapping sense
buffer. This is necessary because IDE issues request sense from IRQ
context. I think it would be best if this first three patches can go
through pata-2.6 as impact on other blk users is minimal (0001 has
possbility of affecting other users but all the current few users look
okay). Jens, what do you think?
0004-0007 are misc cleanup patches.
0008 drops unnecessary rq->data handling path from
idefloppy_blockpc_cmd() as block PCs always use bio for data transfer.
0009-0014 clean up rq->buffer and data usages. If it's used for
opaque pointer, rq->special is used instead. If for data, the data is
mapped using either blk_rq_map_kern() or blk_rq_map_kern_prealloc().
ide-cd and floppy are tested. I don't have access to ide tape device,
so it hasn't been tested but given that atapi changes are shared
between floppy and tape, I don't think too much could be broken for
tape. I'm trying to get a IDE tape drive but it's quite difficult to
come by over here. :-(
This patchset will allow further IDE and block layer cleanup and ease
future improvements to block layer.
This patchset is on top of linux-next pata-2.6 tree as of 2009-03-23.
Git tree is available at the following vector but please DO NOT pull
from the following git tree as pata-2.6 tree is quilt-based and the
base tree I used isn't the one which is gonna go upstream. The
following tree is for review only.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git ide-phase1
Bartlomiej, if you're okay with the changes and Jens acks the patches
and pushing the first three patches via pata-2.6, please feel free to
include these patches into your tree.
This patchset contains the following changes.
block/blk-core.c | 10 +++--
block/blk-map.c | 50 ++++++++++++++++++++++++++++
drivers/ide/ide-atapi.c | 42 +++++++++++++----------
drivers/ide/ide-cd.c | 51 ++++++++++++++---------------
drivers/ide/ide-disk.c | 1
drivers/ide/ide-floppy.c | 16 +++------
drivers/ide/ide-io.c | 7 +--
drivers/ide/ide-ioctls.c | 1
drivers/ide/ide-park.c | 7 +--
drivers/ide/ide-pm.c | 36 +++++++-------------
drivers/ide/ide-tape.c | 10 +++--
drivers/ide/ide-taskfile.c | 18 ++++++----
fs/bio.c | 79 ++++++++++++++++++++++++++-------------------
include/linux/bio.h | 3 +
include/linux/blkdev.h | 4 ++
include/linux/ide.h | 4 +-
16 files changed, 205 insertions(+), 134 deletions(-)
Thanks.
--
tejun
Impact: code reorganization, cleanup
Reorganize [__]bio_map_kern() such that __bio_map_kern() only maps the
pages while bio_map_kern() handles allocation. Also, if not all pages
can be mapped, __bio_map_kern() fails directly instead of doing
partial mapping and letting bio_map_kern() detect it and fail.
While at it, convert to PFN_*() macros for page number calculations.
The reorganized __bio_map_kern() will be used in the next patch for
bio_map_kern_prealloc().
Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
---
fs/bio.c | 56 +++++++++++++++++++++++---------------------------------
1 files changed, 23 insertions(+), 33 deletions(-)
diff --git a/fs/bio.c b/fs/bio.c
index d4f0632..746b566 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -26,6 +26,7 @@
#include <linux/mempool.h>
#include <linux/workqueue.h>
#include <linux/blktrace_api.h>
+#include <linux/pfn.h>
#include <trace/block.h>
#include <scsi/sg.h> /* for struct sg_iovec */
@@ -1109,41 +1110,27 @@ static void bio_map_kern_endio(struct bio *bio, int err)
}
-static struct bio *__bio_map_kern(struct request_queue *q, void *data,
- unsigned int len, gfp_t gfp_mask)
+static int __bio_map_kern(struct request_queue *q, struct bio *bio,
+ void *data, unsigned int len)
{
- unsigned long kaddr = (unsigned long)data;
- unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
- unsigned long start = kaddr >> PAGE_SHIFT;
- const int nr_pages = end - start;
int offset, i;
- struct bio *bio;
-
- bio = bio_alloc(gfp_mask, nr_pages);
- if (!bio)
- return ERR_PTR(-ENOMEM);
-
- offset = offset_in_page(kaddr);
- for (i = 0; i < nr_pages; i++) {
- unsigned int bytes = PAGE_SIZE - offset;
- if (len <= 0)
- break;
-
- if (bytes > len)
- bytes = len;
+ offset = offset_in_page(data);
+ for (i = 0; len; i++) {
+ unsigned int bytes = min_t(unsigned int,
+ len, PAGE_SIZE - offset);
+ /* don't support partial mapping */
if (bio_add_pc_page(q, bio, virt_to_page(data), bytes,
offset) < bytes)
- break;
+ return -EINVAL;
data += bytes;
len -= bytes;
offset = 0;
}
- bio->bi_end_io = bio_map_kern_endio;
- return bio;
+ return 0;
}
/**
@@ -1159,20 +1146,23 @@ static struct bio *__bio_map_kern(struct request_queue *q, void *data,
struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
gfp_t gfp_mask)
{
+ const unsigned long kaddr = (unsigned long)data;
+ const int nr_pages = PFN_UP(kaddr + len) - PFN_DOWN(kaddr);
struct bio *bio;
+ int error;
- bio = __bio_map_kern(q, data, len, gfp_mask);
- if (IS_ERR(bio))
- return bio;
+ bio = bio_alloc(gfp_mask, nr_pages);
+ if (unlikely(!bio))
+ return ERR_PTR(-ENOMEM);
+ bio->bi_end_io = bio_map_kern_endio;
- if (bio->bi_size == len)
- return bio;
+ error = __bio_map_kern(q, bio, data, len);
+ if (unlikely(error)) {
+ bio_put(bio);
+ return ERR_PTR(error);
+ }
- /*
- * Don't support partial mappings.
- */
- bio_put(bio);
- return ERR_PTR(-EINVAL);
+ return bio;
}
static void bio_copy_kern_endio(struct bio *bio, int err)
--
1.6.0.2
Impact: subtle behavior change
For fs requests, rq is only carrier of bios and rq error status as a
whole doesn't mean much. This is the reason why rq->errors is being
cleared on each partial completion of a request as on each partial
completion the error status is transferred to the respective bios.
For pc requests, rq->errors is used to carry error status to the
issuer and thus __end_that_request_first() doesn't clear it on such
cases.
The condition was fine till now as only fs and pc requests have used
bio and thus the bio completion path. However, future changes will
unify data accesses to bio and all non fs users care about rq error
status. Clear rq->errors on bio completion only for fs requests.
In general, the implicit clearing is a bit too subtle especially as
the meaning of rq->errors is completely dependent on low level
drivers. Unifying / cleaning up rq->errors usage and letting llds
manage it would be better. TODO comment added.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/blk-core.c | 10 +++++++---
1 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 29bcfac..beb597d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1735,10 +1735,14 @@ static int __end_that_request_first(struct request *req, int error,
trace_block_rq_complete(req->q, req);
/*
- * for a REQ_TYPE_BLOCK_PC request, we want to carry any eventual
- * sense key with us all the way through
+ * For fs requests, rq is just carrier of independent bio's
+ * and each partial completion should be handled separately.
+ * Reset per-request error on each partial completion.
+ *
+ * TODO: tj: This is too subtle. It would be better to let
+ * low level drivers do what they see fit.
*/
- if (!blk_pc_request(req))
+ if (blk_fs_request(req))
req->errors = 0;
if (error && (blk_fs_request(req) && !(req->cmd_flags & REQ_QUIET))) {
--
1.6.0.2
Impact: implement new API
There still are a few places where request is allocated statically and
manually initialized. Till now, those sites used their own custom
mechanisms to pass data buffer through request queue. This patch
implements blk_rq_map_kern_prealloc() which initializes rq with
pre-allocated bio and bvec so that those code paths can be converted
to bio.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/blk-map.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/bio.c | 23 ++++++++++++++++++++++
include/linux/bio.h | 3 ++
include/linux/blkdev.h | 4 +++
4 files changed, 80 insertions(+), 0 deletions(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index f103729..c50aac2 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -317,3 +317,53 @@ int blk_rq_map_kern(struct request_queue *q, struct request *rq, void *kbuf,
return 0;
}
EXPORT_SYMBOL(blk_rq_map_kern);
+
+/**
+ * blk_rq_map_kern_prealloc - map kernel data using preallocated structures
+ * @q: request queue where request should be inserted
+ * @rq: request to fill
+ * @bio: bio to use
+ * @bvec: bio_vec array to use
+ * @bvec_len: the number of available bvecs in @bvec
+ * @kbuf: the kernel buffer to map
+ * @len: length of @kbuf
+ * @force: bypass @kbuf alignment and on stack check
+ *
+ * Description:
+ * @len bytes at @kbuf will be mapped into @rq using @bio and
+ * @bvec. The buffer must be suitable for direct mapping and
+ * should not require more than @bvec_len elements to map. This
+ * function doesn't do any allocation and can be called from atomic
+ * context. 0 is returned on success, -errno on failure. Note
+ * that this function does not handle bouncing.
+ *
+ * WARNING: Using this function is strongly discouraged. Use
+ * blk_rq_map_kern() if at all possible.
+ */
+int blk_rq_map_kern_prealloc(struct request_queue *q,
+ struct request *rq, struct bio *bio,
+ struct bio_vec *bvec, unsigned short bvec_len,
+ void *kbuf, unsigned int len, bool force)
+{
+ int error;
+
+ if (len > (q->max_hw_sectors << 9))
+ return -EINVAL;
+ if (!len || !kbuf)
+ return -EINVAL;
+ if (!force &&
+ (!blk_rq_aligned(q, kbuf, len) || object_is_on_stack(kbuf)))
+ return -EINVAL;
+
+ error = bio_map_kern_prealloc(q, bio, bvec, bvec_len, kbuf, len);
+ if (error)
+ return error;
+
+ if (rq_data_dir(rq) == WRITE)
+ bio->bi_rw |= (1 << BIO_RW);
+
+ blk_rq_bio_prep(q, rq, bio);
+ rq->buffer = rq->data = NULL;
+ return 0;
+}
+EXPORT_SYMBOL(blk_rq_map_kern_prealloc);
diff --git a/fs/bio.c b/fs/bio.c
index 746b566..ab3c9c9 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1165,6 +1165,29 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
return bio;
}
+/**
+ * bio_map_kern_prealloc - map kernel address into preallocated bio/bvec
+ * @q: the struct request_queue for the bio
+ * @bio: bio to initialize
+ * @bvec: bio_vec array to initialize
+ * @bvec_len: how many elements are available in @bvec
+ * @data: pointer to buffer to map
+ * @len: length in bytes
+ *
+ * Map @data into @bvec and initialize @bio with it. If there
+ * aren't enough @bvec entries, -EINVAL is returned.
+ */
+int bio_map_kern_prealloc(struct request_queue *q, struct bio *bio,
+ struct bio_vec *bvec, unsigned short bvec_len,
+ void *data, unsigned int len)
+{
+ bio_init(bio);
+ bio->bi_io_vec = bvec;
+ bio->bi_max_vecs = bvec_len;
+
+ return __bio_map_kern(q, bio, data, len);
+}
+
static void bio_copy_kern_endio(struct bio *bio, int err)
{
struct bio_vec *bvec;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d8bd43b..39619eb 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -388,6 +388,9 @@ extern struct bio *bio_map_user_iov(struct request_queue *,
extern void bio_unmap_user(struct bio *);
extern struct bio *bio_map_kern(struct request_queue *, void *, unsigned int,
gfp_t);
+extern int bio_map_kern_prealloc(struct request_queue *q, struct bio *bio,
+ struct bio_vec *bvec, unsigned short bvec_len,
+ void *data, unsigned int len);
extern struct bio *bio_copy_kern(struct request_queue *, void *, unsigned int,
gfp_t, int);
extern void bio_set_pages_dirty(struct bio *bio);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 465d6ba..e322e94 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -781,6 +781,10 @@ extern int blk_rq_map_user(struct request_queue *, struct request *,
gfp_t);
extern int blk_rq_unmap_user(struct bio *);
extern int blk_rq_map_kern(struct request_queue *, struct request *, void *, unsigned int, gfp_t);
+extern int blk_rq_map_kern_prealloc(struct request_queue *q,
+ struct request *rq, struct bio *bio,
+ struct bio_vec *bvec, unsigned short bvec_len,
+ void *kbuf, unsigned int len, bool force);
extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
struct rq_map_data *, struct sg_iovec *, int,
unsigned int, gfp_t);
--
1.6.0.2
blk_start_queueing() is being phased out in favor of
[__]blk_run_queue(). Switch.
Signed-off-by: Tejun Heo <[email protected]>
---
drivers/ide/ide-park.c | 7 ++-----
1 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/drivers/ide/ide-park.c b/drivers/ide/ide-park.c
index 9490b44..31d3d32 100644
--- a/drivers/ide/ide-park.c
+++ b/drivers/ide/ide-park.c
@@ -24,11 +24,8 @@ static void issue_park_cmd(ide_drive_t *drive, unsigned long timeout)
start_queue = 1;
spin_unlock_irq(&hwif->lock);
- if (start_queue) {
- spin_lock_irq(q->queue_lock);
- blk_start_queueing(q);
- spin_unlock_irq(q->queue_lock);
- }
+ if (start_queue)
+ blk_run_queue(q);
return;
}
spin_unlock_irq(&hwif->lock);
--
1.6.0.2
Impact: removal of unused field
No one uses ide_cmd->special anymore. Kill it.
Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/ide.h | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/include/linux/ide.h b/include/linux/ide.h
index ccaf372..662ea26 100644
--- a/include/linux/ide.h
+++ b/include/linux/ide.h
@@ -362,7 +362,6 @@ struct ide_cmd {
unsigned int cursg_ofs;
struct request *rq; /* copy of request */
- void *special; /* valid_t generally */
};
/* ATAPI packet command flags */
--
1.6.0.2
Impact: rq->buffer usage cleanup
ide_raw_taskfile() directly uses rq->buffer to carry pointer to the
data buffer. This complicates both block interface and ide backend
request handling. Use blk_rq_map_kern() instead and drop special
handling for REQ_TYPE_ATA_TASKFILE from ide_map_sg().
Note that REQ_RW setting is moved upwards as blk_rq_map_kern() uses it
to initialize bio rw flag.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
---
drivers/ide/ide-io.c | 5 +----
drivers/ide/ide-taskfile.c | 18 +++++++++++-------
2 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index 0321278..c2e622f 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -244,10 +244,7 @@ void ide_map_sg(ide_drive_t *drive, struct ide_cmd *cmd)
struct scatterlist *sg = hwif->sg_table;
struct request *rq = cmd->rq;
- if (rq->cmd_type == REQ_TYPE_ATA_TASKFILE) {
- sg_init_one(sg, rq->buffer, rq->nr_sectors * SECTOR_SIZE);
- cmd->sg_nents = 1;
- } else if (!rq->bio) {
+ if (!rq->bio) {
sg_init_one(sg, rq->data, rq->data_len);
cmd->sg_nents = 1;
} else
diff --git a/drivers/ide/ide-taskfile.c b/drivers/ide/ide-taskfile.c
index 243421c..4e0d66f 100644
--- a/drivers/ide/ide-taskfile.c
+++ b/drivers/ide/ide-taskfile.c
@@ -396,7 +396,9 @@ int ide_raw_taskfile(ide_drive_t *drive, struct ide_cmd *cmd, u8 *buf,
rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
- rq->buffer = buf;
+
+ if (cmd->tf_flags & IDE_TFLAG_WRITE)
+ rq->cmd_flags |= REQ_RW;
/*
* (ks) We transfer currently only whole sectors.
@@ -404,18 +406,20 @@ int ide_raw_taskfile(ide_drive_t *drive, struct ide_cmd *cmd, u8 *buf,
* if we would find a solution to transfer any size.
* To support special commands like READ LONG.
*/
- rq->hard_nr_sectors = rq->nr_sectors = nsect;
- rq->hard_cur_sectors = rq->current_nr_sectors = nsect;
-
- if (cmd->tf_flags & IDE_TFLAG_WRITE)
- rq->cmd_flags |= REQ_RW;
+ if (nsect) {
+ error = blk_rq_map_kern(drive->queue, rq, buf,
+ nsect * SECTOR_SIZE, __GFP_WAIT);
+ if (error)
+ goto put_req;
+ }
rq->special = cmd;
cmd->rq = rq;
error = blk_execute_rq(drive->queue, NULL, rq, 0);
- blk_put_request(rq);
+put_req:
+ blk_put_request(rq);
return error;
}
--
1.6.0.2
ide doesn't have to worry about REQ_SOFTBARRIER. Don't set it.
Signed-off-by: Tejun Heo <[email protected]>
---
drivers/ide/ide-disk.c | 1 -
drivers/ide/ide-ioctls.c | 1 -
2 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
index c998cf8..d5f7d79 100644
--- a/drivers/ide/ide-disk.c
+++ b/drivers/ide/ide-disk.c
@@ -400,7 +400,6 @@ static void idedisk_prepare_flush(struct request_queue *q, struct request *rq)
cmd->protocol = ATA_PROT_NODATA;
rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
- rq->cmd_flags |= REQ_SOFTBARRIER;
rq->special = cmd;
}
diff --git a/drivers/ide/ide-ioctls.c b/drivers/ide/ide-ioctls.c
index 7701427..c22b4c7 100644
--- a/drivers/ide/ide-ioctls.c
+++ b/drivers/ide/ide-ioctls.c
@@ -229,7 +229,6 @@ static int generic_drive_reset(ide_drive_t *drive)
rq->cmd_type = REQ_TYPE_SPECIAL;
rq->cmd_len = 1;
rq->cmd[0] = REQ_DRIVE_RESET;
- rq->cmd_flags |= REQ_SOFTBARRIER;
if (blk_execute_rq(drive->queue, NULL, rq, 1))
ret = rq->errors;
blk_put_request(rq);
--
1.6.0.2
Impact: remove unnecessary code path
Block pc requests always use bio and rq->data is always NULL. No need
to worry about !rq->bio cases in idefloppy_block_pc_cmd(). Note that
ide-atapi uses ide_pio_bytes() for bio PIO transfer which handle sg
fine.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
---
drivers/ide/ide-floppy.c | 14 ++++++--------
1 files changed, 6 insertions(+), 8 deletions(-)
diff --git a/drivers/ide/ide-floppy.c b/drivers/ide/ide-floppy.c
index 2b4868d..3b22e06 100644
--- a/drivers/ide/ide-floppy.c
+++ b/drivers/ide/ide-floppy.c
@@ -216,15 +216,13 @@ static void idefloppy_blockpc_cmd(struct ide_disk_obj *floppy,
ide_init_pc(pc);
memcpy(pc->c, rq->cmd, sizeof(pc->c));
pc->rq = rq;
- if (rq->data_len && rq_data_dir(rq) == WRITE)
- pc->flags |= PC_FLAG_WRITING;
- pc->buf = rq->data;
- if (rq->bio)
+ if (rq->data_len) {
pc->flags |= PC_FLAG_DMA_OK;
- /*
- * possibly problematic, doesn't look like ide-floppy correctly
- * handled scattered requests if dma fails...
- */
+ if (rq_data_dir(rq) == WRITE)
+ pc->flags |= PC_FLAG_WRITING;
+ }
+ /* pio will be performed by ide_pio_bytes() which handles sg fine */
+ pc->buf = NULL;
pc->req_xfer = pc->buf_size = rq->data_len;
}
--
1.6.0.2
Impact: rq->buffer usage cleanup
ide-atapi uses rq->buffer as private opaque value for internal special
requests. rq->special isn't used for these cases (the only case where
rq->special is used is for ide-tape rw requests). Use rq->special
instead.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
---
drivers/ide/ide-atapi.c | 4 ++--
drivers/ide/ide-floppy.c | 2 +-
drivers/ide/ide-tape.c | 2 +-
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
index 11a680c..5ab3014 100644
--- a/drivers/ide/ide-atapi.c
+++ b/drivers/ide/ide-atapi.c
@@ -88,7 +88,7 @@ static void ide_queue_pc_head(ide_drive_t *drive, struct gendisk *disk,
blk_rq_init(NULL, rq);
rq->cmd_type = REQ_TYPE_SPECIAL;
rq->cmd_flags |= REQ_PREEMPT;
- rq->buffer = (char *)pc;
+ rq->special = (char *)pc;
rq->rq_disk = disk;
if (pc->req_xfer) {
@@ -117,7 +117,7 @@ int ide_queue_pc_tail(ide_drive_t *drive, struct gendisk *disk,
rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
rq->cmd_type = REQ_TYPE_SPECIAL;
- rq->buffer = (char *)pc;
+ rq->special = (char *)pc;
if (pc->req_xfer) {
rq->data = pc->buf;
diff --git a/drivers/ide/ide-floppy.c b/drivers/ide/ide-floppy.c
index 3b22e06..9460033 100644
--- a/drivers/ide/ide-floppy.c
+++ b/drivers/ide/ide-floppy.c
@@ -264,7 +264,7 @@ static ide_startstop_t ide_floppy_do_request(ide_drive_t *drive,
pc = &floppy->queued_pc;
idefloppy_create_rw_cmd(drive, pc, rq, (unsigned long)block);
} else if (blk_special_request(rq)) {
- pc = (struct ide_atapi_pc *) rq->buffer;
+ pc = (struct ide_atapi_pc *)rq->special;
} else if (blk_pc_request(rq)) {
pc = &floppy->queued_pc;
idefloppy_blockpc_cmd(floppy, pc, rq);
diff --git a/drivers/ide/ide-tape.c b/drivers/ide/ide-tape.c
index cb942a9..3d16c77 100644
--- a/drivers/ide/ide-tape.c
+++ b/drivers/ide/ide-tape.c
@@ -834,7 +834,7 @@ static ide_startstop_t idetape_do_request(ide_drive_t *drive,
goto out;
}
if (rq->cmd[13] & REQ_IDETAPE_PC1) {
- pc = (struct ide_atapi_pc *) rq->buffer;
+ pc = (struct ide_atapi_pc *)rq->special;
rq->cmd[13] &= ~(REQ_IDETAPE_PC1);
rq->cmd[13] |= REQ_IDETAPE_PC2;
goto out;
--
1.6.0.2
Impact: unify request data buffer handling
rq->data is used mostly to pass kernel buffer through request queue
without using bio. There are only a couple of places which still do
this in kernel and converting to bio isn't difficult.
This patch converts ide-cd to use bio instead of rq->data for request
sense and internal pc commands. As request sense is issued from
atomic context, drive->request_sense_bio and ->request_sense_bvec[2]
are used via bio_map_kern_prealloc(). Internal pc commands use
blk_rq_map_kern().
cdrom_do_block_pc() now doesn't have to deal with REQ_TYPE_ATA_PC
special case. Simplified.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
---
drivers/ide/ide-cd.c | 31 ++++++++++++++++++-------------
1 files changed, 18 insertions(+), 13 deletions(-)
diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index de6ce8d..22c95b9 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -209,8 +209,12 @@ static void cdrom_analyze_sense_data(ide_drive_t *drive,
static void cdrom_queue_request_sense(ide_drive_t *drive, void *sense,
struct request *failed_command)
{
- struct cdrom_info *info = drive->driver_data;
- struct request *rq = &drive->request_sense_rq;
+ struct cdrom_info *info = drive->driver_data;
+ struct request *rq = &drive->request_sense_rq;
+ struct bio *bio = &drive->request_sense_bio;
+ struct bio_vec *bvec = drive->request_sense_bvec;
+ unsigned int bvec_len = ARRAY_SIZE(drive->request_sense_bvec);
+ int error;
ide_debug_log(IDE_DBG_SENSE, "enter");
@@ -224,10 +228,12 @@ static void cdrom_queue_request_sense(ide_drive_t *drive, void *sense,
rq->cmd_type = REQ_TYPE_ATA_PC;
rq->rq_disk = info->disk;
- rq->data = sense;
+ error = blk_rq_map_kern_prealloc(drive->queue, rq, bio, bvec, bvec_len,
+ sense, 18, true);
+ BUG_ON(error);
+
rq->cmd[0] = GPCMD_REQUEST_SENSE;
rq->cmd[4] = 18;
- rq->data_len = 18;
rq->cmd_type = REQ_TYPE_SENSE;
rq->cmd_flags |= REQ_PREEMPT;
@@ -556,8 +562,12 @@ int ide_cd_queue_pc(ide_drive_t *drive, const unsigned char *cmd,
rq->cmd_flags |= cmd_flags;
rq->timeout = timeout;
if (buffer) {
- rq->data = buffer;
- rq->data_len = *bufflen;
+ error = blk_rq_map_kern(drive->queue, rq, buffer,
+ *bufflen, __GFP_WAIT);
+ if (error) {
+ blk_put_request(rq);
+ return error;
+ }
}
error = blk_execute_rq(drive->queue, info->disk, rq, 0);
@@ -847,15 +857,10 @@ static void cdrom_do_block_pc(ide_drive_t *drive, struct request *rq)
drive->dma = 0;
/* sg request */
- if (rq->bio || ((rq->cmd_type == REQ_TYPE_ATA_PC) && rq->data_len)) {
+ if (rq->bio) {
struct request_queue *q = drive->queue;
+ char *buf = bio_data(rq->bio);
unsigned int alignment;
- char *buf;
-
- if (rq->bio)
- buf = bio_data(rq->bio);
- else
- buf = rq->data;
drive->dma = !!(drive->dev_flags & IDE_DFLAG_USING_DMA);
--
1.6.0.2
Impact: rq->buffer usage cleanup
ide-cd uses rq->buffer to carry pointer to the original request when
issuing REQUEST_SENSE. Use rq->special instead.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
---
drivers/ide/ide-cd.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index b3736b4..de6ce8d 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -232,8 +232,8 @@ static void cdrom_queue_request_sense(ide_drive_t *drive, void *sense,
rq->cmd_type = REQ_TYPE_SENSE;
rq->cmd_flags |= REQ_PREEMPT;
- /* NOTE! Save the failed command in "rq->buffer" */
- rq->buffer = (void *) failed_command;
+ /* NOTE! Save the failed command in "rq->special" */
+ rq->special = (void *)failed_command;
if (failed_command)
ide_debug_log(IDE_DBG_SENSE, "failed_cmd: 0x%x",
@@ -247,10 +247,10 @@ static void cdrom_queue_request_sense(ide_drive_t *drive, void *sense,
static void ide_cd_complete_failed_rq(ide_drive_t *drive, struct request *rq)
{
/*
- * For REQ_TYPE_SENSE, "rq->buffer" points to the original
+ * For REQ_TYPE_SENSE, "rq->special" points to the original
* failed request
*/
- struct request *failed = (struct request *)rq->buffer;
+ struct request *failed = (struct request *)rq->special;
struct cdrom_info *info = drive->driver_data;
void *sense = &info->sense_data;
--
1.6.0.2
Impact: cleanup rq->data usage
ide-pm uses rq->data to carry pointer to struct request_pm_state
through request queue and rq->special is used to carray pointer to
local struct ide_cmd, which isn't necessary. Use rq->special for
request_pm_state instead and use local ide_cmd in
ide_start_power_step().
Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
---
drivers/ide/ide-io.c | 2 +-
drivers/ide/ide-pm.c | 36 ++++++++++++++----------------------
2 files changed, 15 insertions(+), 23 deletions(-)
diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index c2e622f..9ad9bc1 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -364,7 +364,7 @@ static ide_startstop_t start_request (ide_drive_t *drive, struct request *rq)
if (rq->cmd_type == REQ_TYPE_ATA_TASKFILE)
return execute_drive_cmd(drive, rq);
else if (blk_pm_request(rq)) {
- struct request_pm_state *pm = rq->data;
+ struct request_pm_state *pm = rq->special;
#ifdef DEBUG_PM
printk("%s: start_power_step(step: %d)\n",
drive->name, pm->pm_step);
diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
index bb7858e..84b8df4 100644
--- a/drivers/ide/ide-pm.c
+++ b/drivers/ide/ide-pm.c
@@ -7,7 +7,6 @@ int generic_ide_suspend(struct device *dev, pm_message_t mesg)
ide_hwif_t *hwif = drive->hwif;
struct request *rq;
struct request_pm_state rqpm;
- struct ide_cmd cmd;
int ret;
/* call ACPI _GTM only once */
@@ -15,11 +14,9 @@ int generic_ide_suspend(struct device *dev, pm_message_t mesg)
ide_acpi_get_timing(hwif);
memset(&rqpm, 0, sizeof(rqpm));
- memset(&cmd, 0, sizeof(cmd));
rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
rq->cmd_type = REQ_TYPE_PM_SUSPEND;
- rq->special = &cmd;
- rq->data = &rqpm;
+ rq->special = &rqpm;
rqpm.pm_step = IDE_PM_START_SUSPEND;
if (mesg.event == PM_EVENT_PRETHAW)
mesg.event = PM_EVENT_FREEZE;
@@ -41,7 +38,6 @@ int generic_ide_resume(struct device *dev)
ide_hwif_t *hwif = drive->hwif;
struct request *rq;
struct request_pm_state rqpm;
- struct ide_cmd cmd;
int err;
/* call ACPI _PS0 / _STM only once */
@@ -53,12 +49,10 @@ int generic_ide_resume(struct device *dev)
ide_acpi_exec_tfs(drive);
memset(&rqpm, 0, sizeof(rqpm));
- memset(&cmd, 0, sizeof(cmd));
rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
rq->cmd_type = REQ_TYPE_PM_RESUME;
rq->cmd_flags |= REQ_PREEMPT;
- rq->special = &cmd;
- rq->data = &rqpm;
+ rq->special = &rqpm;
rqpm.pm_step = IDE_PM_START_RESUME;
rqpm.pm_state = PM_EVENT_ON;
@@ -77,7 +71,7 @@ int generic_ide_resume(struct device *dev)
void ide_complete_power_step(ide_drive_t *drive, struct request *rq)
{
- struct request_pm_state *pm = rq->data;
+ struct request_pm_state *pm = rq->special;
#ifdef DEBUG_PM
printk(KERN_INFO "%s: complete_power_step(step: %d)\n",
@@ -107,10 +101,8 @@ void ide_complete_power_step(ide_drive_t *drive, struct request *rq)
ide_startstop_t ide_start_power_step(ide_drive_t *drive, struct request *rq)
{
- struct request_pm_state *pm = rq->data;
- struct ide_cmd *cmd = rq->special;
-
- memset(cmd, 0, sizeof(*cmd));
+ struct request_pm_state *pm = rq->special;
+ struct ide_cmd cmd = { };
switch (pm->pm_step) {
case IDE_PM_FLUSH_CACHE: /* Suspend step 1 (flush cache) */
@@ -123,12 +115,12 @@ ide_startstop_t ide_start_power_step(ide_drive_t *drive, struct request *rq)
return ide_stopped;
}
if (ata_id_flush_ext_enabled(drive->id))
- cmd->tf.command = ATA_CMD_FLUSH_EXT;
+ cmd.tf.command = ATA_CMD_FLUSH_EXT;
else
- cmd->tf.command = ATA_CMD_FLUSH;
+ cmd.tf.command = ATA_CMD_FLUSH;
goto out_do_tf;
case IDE_PM_STANDBY: /* Suspend step 2 (standby) */
- cmd->tf.command = ATA_CMD_STANDBYNOW1;
+ cmd.tf.command = ATA_CMD_STANDBYNOW1;
goto out_do_tf;
case IDE_PM_RESTORE_PIO: /* Resume step 1 (restore PIO) */
ide_set_max_pio(drive);
@@ -141,7 +133,7 @@ ide_startstop_t ide_start_power_step(ide_drive_t *drive, struct request *rq)
ide_complete_power_step(drive, rq);
return ide_stopped;
case IDE_PM_IDLE: /* Resume step 2 (idle) */
- cmd->tf.command = ATA_CMD_IDLEIMMEDIATE;
+ cmd.tf.command = ATA_CMD_IDLEIMMEDIATE;
goto out_do_tf;
case IDE_PM_RESTORE_DMA: /* Resume step 3 (restore DMA) */
/*
@@ -163,10 +155,10 @@ ide_startstop_t ide_start_power_step(ide_drive_t *drive, struct request *rq)
return ide_stopped;
out_do_tf:
- cmd->tf_flags = IDE_TFLAG_TF | IDE_TFLAG_DEVICE;
- cmd->protocol = ATA_PROT_NODATA;
+ cmd.tf_flags = IDE_TFLAG_TF | IDE_TFLAG_DEVICE;
+ cmd.protocol = ATA_PROT_NODATA;
- return do_rw_taskfile(drive, cmd);
+ return do_rw_taskfile(drive, &cmd);
}
/**
@@ -180,7 +172,7 @@ out_do_tf:
void ide_complete_pm_rq(ide_drive_t *drive, struct request *rq)
{
struct request_queue *q = drive->queue;
- struct request_pm_state *pm = rq->data;
+ struct request_pm_state *pm = rq->special;
unsigned long flags;
ide_complete_power_step(drive, rq);
@@ -206,7 +198,7 @@ void ide_complete_pm_rq(ide_drive_t *drive, struct request *rq)
void ide_check_pm_state(ide_drive_t *drive, struct request *rq)
{
- struct request_pm_state *pm = rq->data;
+ struct request_pm_state *pm = rq->special;
if (blk_pm_suspend_request(rq) &&
pm->pm_step == IDE_PM_START_SUSPEND)
--
1.6.0.2
Impact: unify request data buffer handling
rq->data is used mostly to pass kernel buffer through request queue
without using bio. There are only a couple of places which still do
this in kernel and converting to bio isn't difficult.
This patch converts ide-atapi to use bio instead of rq->data. For
ide_queue_pc_tail() which can block, blk_rq_map_kern() is used. For
ide_queue_pc_head() which is used for request sense and called from
atomic context, drive->request_sense_bio and ->request_sense_bvec[2]
are added and used via bio_map_kern_prealloc().
ide-tape is updated to map sg for special requests.
This change makes all ide-atapi commands except for tape bh ones use
bio. Drop tp_ops->output/input_data path from ide_pc_intr().
Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
---
drivers/ide/ide-atapi.c | 38 ++++++++++++++++++++++----------------
drivers/ide/ide-tape.c | 8 +++++---
include/linux/ide.h | 3 +++
3 files changed, 30 insertions(+), 19 deletions(-)
diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
index 5ab3014..bc7dfcf 100644
--- a/drivers/ide/ide-atapi.c
+++ b/drivers/ide/ide-atapi.c
@@ -83,7 +83,9 @@ EXPORT_SYMBOL_GPL(ide_init_pc);
* pass through the driver.
*/
static void ide_queue_pc_head(ide_drive_t *drive, struct gendisk *disk,
- struct ide_atapi_pc *pc, struct request *rq)
+ struct ide_atapi_pc *pc, struct request *rq,
+ struct bio *bio, struct bio_vec *bvec,
+ unsigned short bvec_len)
{
blk_rq_init(NULL, rq);
rq->cmd_type = REQ_TYPE_SPECIAL;
@@ -92,8 +94,12 @@ static void ide_queue_pc_head(ide_drive_t *drive, struct gendisk *disk,
rq->rq_disk = disk;
if (pc->req_xfer) {
- rq->data = pc->buf;
- rq->data_len = pc->req_xfer;
+ int error;
+
+ error = blk_rq_map_kern_prealloc(drive->queue, rq, bio,
+ bvec, bvec_len,
+ pc->buf, pc->req_xfer, true);
+ BUG_ON(error);
}
memcpy(rq->cmd, pc->c, 12);
@@ -120,16 +126,19 @@ int ide_queue_pc_tail(ide_drive_t *drive, struct gendisk *disk,
rq->special = (char *)pc;
if (pc->req_xfer) {
- rq->data = pc->buf;
- rq->data_len = pc->req_xfer;
+ error = blk_rq_map_kern(drive->queue, rq, pc->buf, pc->req_xfer,
+ __GFP_WAIT);
+ if (error)
+ goto put_req;
}
memcpy(rq->cmd, pc->c, 12);
if (drive->media == ide_tape)
rq->cmd[13] = REQ_IDETAPE_PC1;
error = blk_execute_rq(drive->queue, disk, rq, 0);
- blk_put_request(rq);
+put_req:
+ blk_put_request(rq);
return error;
}
EXPORT_SYMBOL_GPL(ide_queue_pc_tail);
@@ -196,13 +205,16 @@ EXPORT_SYMBOL_GPL(ide_create_request_sense_cmd);
void ide_retry_pc(ide_drive_t *drive, struct gendisk *disk)
{
struct request *rq = &drive->request_sense_rq;
+ struct bio *bio = &drive->request_sense_bio;
+ struct bio_vec *bvec = drive->request_sense_bvec;
struct ide_atapi_pc *pc = &drive->request_sense_pc;
+ unsigned short bvec_len = ARRAY_SIZE(drive->request_sense_bvec);
(void)ide_read_error(drive);
ide_create_request_sense_cmd(drive, pc);
if (drive->media == ide_tape)
set_bit(IDE_AFLAG_IGNORE_DSC, &drive->atapi_flags);
- ide_queue_pc_head(drive, disk, pc, rq);
+ ide_queue_pc_head(drive, disk, pc, rq, bio, bvec, bvec_len);
}
EXPORT_SYMBOL_GPL(ide_retry_pc);
@@ -277,7 +289,6 @@ static ide_startstop_t ide_pc_intr(ide_drive_t *drive)
struct ide_cmd *cmd = &hwif->cmd;
struct request *rq = hwif->rq;
const struct ide_tp_ops *tp_ops = hwif->tp_ops;
- xfer_func_t *xferfunc;
unsigned int timeout, done;
u16 bcount;
u8 stat, ireason, dsc = 0;
@@ -409,16 +420,11 @@ static ide_startstop_t ide_pc_intr(ide_drive_t *drive)
return ide_do_reset(drive);
}
- xferfunc = write ? tp_ops->output_data : tp_ops->input_data;
-
- if (drive->media == ide_floppy && pc->buf == NULL) {
+ if (drive->media == ide_tape && pc->bh)
+ done = drive->pc_io_buffers(drive, pc, bcount, write);
+ else {
done = min_t(unsigned int, bcount, cmd->nleft);
ide_pio_bytes(drive, cmd, write, done);
- } else if (drive->media == ide_tape && pc->bh) {
- done = drive->pc_io_buffers(drive, pc, bcount, write);
- } else {
- done = min_t(unsigned int, bcount, pc->req_xfer - pc->xferred);
- xferfunc(drive, NULL, pc->cur_pos, done);
}
/* Update the current position */
diff --git a/drivers/ide/ide-tape.c b/drivers/ide/ide-tape.c
index 3d16c77..d52f8b3 100644
--- a/drivers/ide/ide-tape.c
+++ b/drivers/ide/ide-tape.c
@@ -750,7 +750,7 @@ static ide_startstop_t idetape_do_request(ide_drive_t *drive,
idetape_tape_t *tape = drive->driver_data;
struct ide_atapi_pc *pc = NULL;
struct request *postponed_rq = tape->postponed_rq;
- struct ide_cmd cmd;
+ struct ide_cmd cmd = { };
u8 stat;
debug_log(DBG_SENSE, "sector: %llu, nr_sectors: %lu,"
@@ -837,6 +837,10 @@ static ide_startstop_t idetape_do_request(ide_drive_t *drive,
pc = (struct ide_atapi_pc *)rq->special;
rq->cmd[13] &= ~(REQ_IDETAPE_PC1);
rq->cmd[13] |= REQ_IDETAPE_PC2;
+ if (pc->req_xfer) {
+ ide_init_sg_cmd(&cmd, pc->req_xfer);
+ ide_map_sg(drive, &cmd);
+ }
goto out;
}
if (rq->cmd[13] & REQ_IDETAPE_PC2) {
@@ -846,8 +850,6 @@ static ide_startstop_t idetape_do_request(ide_drive_t *drive,
BUG();
out:
- memset(&cmd, 0, sizeof(cmd));
-
if (rq_data_dir(rq))
cmd.tf_flags |= IDE_TFLAG_WRITE;
diff --git a/include/linux/ide.h b/include/linux/ide.h
index 662ea26..8d3bb27 100644
--- a/include/linux/ide.h
+++ b/include/linux/ide.h
@@ -638,8 +638,11 @@ struct ide_drive_s {
unsigned long atapi_flags;
+ /* for request sense */
struct ide_atapi_pc request_sense_pc;
struct request request_sense_rq;
+ struct bio request_sense_bio;
+ struct bio_vec request_sense_bvec[2];
};
typedef struct ide_drive_s ide_drive_t;
--
1.6.0.2
Impact: code simplification
ide_cd_request_sense_fixup() clears the tail of the sense buffer if
the device didn't completely fill it. This patch makes
cdrom_queue_request_sense() clear the sense buffer before issuing the
command instead of clearing it afterwards. This simplifies code and
eases future changes.
Signed-off-by: Tejun Heo <[email protected]>
---
drivers/ide/ide-cd.c | 12 ++++--------
1 files changed, 4 insertions(+), 8 deletions(-)
diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 35729a4..b3736b4 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -217,6 +217,8 @@ static void cdrom_queue_request_sense(ide_drive_t *drive, void *sense,
if (sense == NULL)
sense = &info->sense_data;
+ memset(sense, 0, 18);
+
/* stuff the sense request in front of our current request */
blk_rq_init(NULL, rq);
rq->cmd_type = REQ_TYPE_ATA_PC;
@@ -520,14 +522,8 @@ static void ide_cd_request_sense_fixup(ide_drive_t *drive, struct ide_cmd *cmd)
* and some drives don't send them. Sigh.
*/
if (rq->cmd[0] == GPCMD_REQUEST_SENSE &&
- cmd->nleft > 0 && cmd->nleft <= 5) {
- unsigned int ofs = cmd->nbytes - cmd->nleft;
-
- while (cmd->nleft > 0) {
- *((u8 *)rq->data + ofs++) = 0;
- cmd->nleft--;
- }
- }
+ cmd->nleft > 0 && cmd->nleft <= 5)
+ cmd->nleft = 0;
}
int ide_cd_queue_pc(ide_drive_t *drive, const unsigned char *cmd,
--
1.6.0.2
Tejun Heo wrote:
> Impact: implement new API
>
> There still are a few places where request is allocated statically and
> manually initialized. Till now, those sites used their own custom
> mechanisms to pass data buffer through request queue. This patch
> implements blk_rq_map_kern_prealloc() which initializes rq with
> pre-allocated bio and bvec so that those code paths can be converted
> to bio.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
Hi dear Tejun Heo
I have a similar, totally unrelated patch queued, perhaps we can unify
the efforts, to satisfy both our needs in one stone.
I've sent this patch:
http://www.spinics.net/lists/linux-scsi/msg34082.html
In an effort to un-export blk_rq_append_bio().
Perhaps you could reorder the code below a bit?
My proposal is:
* blk_rq_map_kern_prealloc => is simplified to be
int blk_rq_map_bio(struct request_queue *q, struct request *rq,
struct bio *bio);
* The extra checks currently inside blk_rq_map_kern_prealloc are moved
to bio_map_kern_prealloc()
* Users call bio_map_kern_prealloc() directly and then use blk_rq_map_bio()
in a two stage process.
So blk_rq_map_bio becomes a BLOCK_PC command's way of a pre-allocated bio
the way FS_PC commands use generic_make_request.
Thanks for doing all this. BTW' how close are we to remove req->data and req->buffer
Have a good day
Boaz
> ---
> block/blk-map.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++
> fs/bio.c | 23 ++++++++++++++++++++++
> include/linux/bio.h | 3 ++
> include/linux/blkdev.h | 4 +++
> 4 files changed, 80 insertions(+), 0 deletions(-)
>
> diff --git a/block/blk-map.c b/block/blk-map.c
> index f103729..c50aac2 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -317,3 +317,53 @@ int blk_rq_map_kern(struct request_queue *q, struct request *rq, void *kbuf,
> return 0;
> }
> EXPORT_SYMBOL(blk_rq_map_kern);
> +
> +/**
> + * blk_rq_map_kern_prealloc - map kernel data using preallocated structures
> + * @q: request queue where request should be inserted
> + * @rq: request to fill
> + * @bio: bio to use
> + * @bvec: bio_vec array to use
> + * @bvec_len: the number of available bvecs in @bvec
> + * @kbuf: the kernel buffer to map
> + * @len: length of @kbuf
> + * @force: bypass @kbuf alignment and on stack check
> + *
> + * Description:
> + * @len bytes at @kbuf will be mapped into @rq using @bio and
> + * @bvec. The buffer must be suitable for direct mapping and
> + * should not require more than @bvec_len elements to map. This
> + * function doesn't do any allocation and can be called from atomic
> + * context. 0 is returned on success, -errno on failure. Note
> + * that this function does not handle bouncing.
> + *
> + * WARNING: Using this function is strongly discouraged. Use
> + * blk_rq_map_kern() if at all possible.
> + */
> +int blk_rq_map_kern_prealloc(struct request_queue *q,
> + struct request *rq, struct bio *bio,
> + struct bio_vec *bvec, unsigned short bvec_len,
> + void *kbuf, unsigned int len, bool force)
> +{
> + int error;
> +
> + if (len > (q->max_hw_sectors << 9))
> + return -EINVAL;
> + if (!len || !kbuf)
> + return -EINVAL;
> + if (!force &&
> + (!blk_rq_aligned(q, kbuf, len) || object_is_on_stack(kbuf)))
> + return -EINVAL;
> +
> + error = bio_map_kern_prealloc(q, bio, bvec, bvec_len, kbuf, len);
> + if (error)
> + return error;
> +
> + if (rq_data_dir(rq) == WRITE)
> + bio->bi_rw |= (1 << BIO_RW);
> +
> + blk_rq_bio_prep(q, rq, bio);
> + rq->buffer = rq->data = NULL;
> + return 0;
> +}
> +EXPORT_SYMBOL(blk_rq_map_kern_prealloc);
> diff --git a/fs/bio.c b/fs/bio.c
> index 746b566..ab3c9c9 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -1165,6 +1165,29 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
> return bio;
> }
>
> +/**
> + * bio_map_kern_prealloc - map kernel address into preallocated bio/bvec
> + * @q: the struct request_queue for the bio
> + * @bio: bio to initialize
> + * @bvec: bio_vec array to initialize
> + * @bvec_len: how many elements are available in @bvec
> + * @data: pointer to buffer to map
> + * @len: length in bytes
> + *
> + * Map @data into @bvec and initialize @bio with it. If there
> + * aren't enough @bvec entries, -EINVAL is returned.
> + */
> +int bio_map_kern_prealloc(struct request_queue *q, struct bio *bio,
> + struct bio_vec *bvec, unsigned short bvec_len,
> + void *data, unsigned int len)
> +{
> + bio_init(bio);
> + bio->bi_io_vec = bvec;
> + bio->bi_max_vecs = bvec_len;
> +
> + return __bio_map_kern(q, bio, data, len);
> +}
> +
> static void bio_copy_kern_endio(struct bio *bio, int err)
> {
> struct bio_vec *bvec;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index d8bd43b..39619eb 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -388,6 +388,9 @@ extern struct bio *bio_map_user_iov(struct request_queue *,
> extern void bio_unmap_user(struct bio *);
> extern struct bio *bio_map_kern(struct request_queue *, void *, unsigned int,
> gfp_t);
> +extern int bio_map_kern_prealloc(struct request_queue *q, struct bio *bio,
> + struct bio_vec *bvec, unsigned short bvec_len,
> + void *data, unsigned int len);
> extern struct bio *bio_copy_kern(struct request_queue *, void *, unsigned int,
> gfp_t, int);
> extern void bio_set_pages_dirty(struct bio *bio);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 465d6ba..e322e94 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -781,6 +781,10 @@ extern int blk_rq_map_user(struct request_queue *, struct request *,
> gfp_t);
> extern int blk_rq_unmap_user(struct bio *);
> extern int blk_rq_map_kern(struct request_queue *, struct request *, void *, unsigned int, gfp_t);
> +extern int blk_rq_map_kern_prealloc(struct request_queue *q,
> + struct request *rq, struct bio *bio,
> + struct bio_vec *bvec, unsigned short bvec_len,
> + void *kbuf, unsigned int len, bool force);
> extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
> struct rq_map_data *, struct sg_iovec *, int,
> unsigned int, gfp_t);
Hello,
Boaz Harrosh wrote:
> I have a similar, totally unrelated patch queued, perhaps we can unify
> the efforts, to satisfy both our needs in one stone.
>
> I've sent this patch:
> http://www.spinics.net/lists/linux-scsi/msg34082.html
>
> In an effort to un-export blk_rq_append_bio().
>
> Perhaps you could reorder the code below a bit?
>
> My proposal is:
> * blk_rq_map_kern_prealloc => is simplified to be
> int blk_rq_map_bio(struct request_queue *q, struct request *rq,
> struct bio *bio);
>
> * The extra checks currently inside blk_rq_map_kern_prealloc are moved
> to bio_map_kern_prealloc()
>
> * Users call bio_map_kern_prealloc() directly and then use blk_rq_map_bio()
> in a two stage process.
>
> So blk_rq_map_bio becomes a BLOCK_PC command's way of a
> pre-allocated bio the way FS_PC commands use generic_make_request.
Sounds good. I'll make the change.
> Thanks for doing all this. BTW' how close are we to remove req->data
> and req->buffer
Those patches will be going out later today. It will also kill
rq->hard_* and rq->*nr_sectors. Single byte-granual data length for
all along with single way to carray data via bio! :-)
Thanks.
--
tejun
Hi Tejun,
On Wed, Mar 25, 2009 at 01:06:09AM +0900, Tejun Heo wrote:
> Impact: code simplification
>
> ide_cd_request_sense_fixup() clears the tail of the sense buffer if
> the device didn't completely fill it. This patch makes
> cdrom_queue_request_sense() clear the sense buffer before issuing the
> command instead of clearing it afterwards. This simplifies code and
> eases future changes.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> drivers/ide/ide-cd.c | 12 ++++--------
> 1 files changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
> index 35729a4..b3736b4 100644
> --- a/drivers/ide/ide-cd.c
> +++ b/drivers/ide/ide-cd.c
> @@ -217,6 +217,8 @@ static void cdrom_queue_request_sense(ide_drive_t *drive, void *sense,
> if (sense == NULL)
> sense = &info->sense_data;
>
> + memset(sense, 0, 18);
> +
Hmm, not quite. If its info->sense_data, then 18 is not correct. Rather,
sizeof(struct request_sense) would be more likely the proper length. So,
actually, it would be more like:
unsigned sense_len = 18;
if (sense == NULL) {
sense = &info->sense_data;
sense_len = sizeof(struct request_sense);
}
memset(sense, 0, sense_len);
> /* stuff the sense request in front of our current request */
> blk_rq_init(NULL, rq);
> rq->cmd_type = REQ_TYPE_ATA_PC;
> @@ -520,14 +522,8 @@ static void ide_cd_request_sense_fixup(ide_drive_t *drive, struct ide_cmd *cmd)
> * and some drives don't send them. Sigh.
> */
> if (rq->cmd[0] == GPCMD_REQUEST_SENSE &&
> - cmd->nleft > 0 && cmd->nleft <= 5) {
> - unsigned int ofs = cmd->nbytes - cmd->nleft;
> -
> - while (cmd->nleft > 0) {
> - *((u8 *)rq->data + ofs++) = 0;
> - cmd->nleft--;
> - }
> - }
> + cmd->nleft > 0 && cmd->nleft <= 5)
> + cmd->nleft = 0;
> }
>
Otherwise, I like it.
--
Regards/Gruss,
Boris.
Hello,
Borislav Petkov wrote:
> Hmm, not quite. If its info->sense_data, then 18 is not correct. Rather,
> sizeof(struct request_sense) would be more likely the proper length. So,
> actually, it would be more like:
>
> unsigned sense_len = 18;
>
> if (sense == NULL) {
> sense = &info->sense_data;
> sense_len = sizeof(struct request_sense);
> }
>
> memset(sense, 0, sense_len);
Okay, will update as such. Thanks.
--
tejun
Hello,
A few issues.
Tejun Heo wrote:
>> Perhaps you could reorder the code below a bit?
>>
>> My proposal is:
>> * blk_rq_map_kern_prealloc => is simplified to be
>> int blk_rq_map_bio(struct request_queue *q, struct request *rq,
>> struct bio *bio);
The thing is that the prealloc variant should be allowed to be called
from IRQ context and blk_queue_bounce() shouldn't be called.
Hmmm... well, the caller is supposed to know what it's doing and maybe
we can just add a comment that it shouldn't be called with buffers
which might get bounced from IRQ context.
>> * The extra checks currently inside blk_rq_map_kern_prealloc are moved
>> to bio_map_kern_prealloc()
>>
>> * Users call bio_map_kern_prealloc() directly and then use blk_rq_map_bio()
>> in a two stage process.
This breaks consistency with blk_rq_map_*() family of functions. Do
you have a plan to make them all consistent? I think we really need
to maintain API consistency.
Thanks.
--
tejun
On 03/26/2009 09:42 AM, Tejun Heo wrote:
> Hello,
>
> A few issues.
>
> Tejun Heo wrote:
>>> Perhaps you could reorder the code below a bit?
>>>
>>> My proposal is:
>>> * blk_rq_map_kern_prealloc => is simplified to be
>>> int blk_rq_map_bio(struct request_queue *q, struct request *rq,
>>> struct bio *bio);
>
> The thing is that the prealloc variant should be allowed to be called
> from IRQ context and blk_queue_bounce() shouldn't be called.
> Hmmm... well, the caller is supposed to know what it's doing and maybe
> we can just add a comment that it shouldn't be called with buffers
> which might get bounced from IRQ context.
>
Hmm that is a problem. I would suggest a flag or a check. My bios come
from VFS they need bouncing.
Can you think of a solution?
We could just call blk_queue_bounce(). IRQ callers need to make sure their
buffers don't need bouncing anyway, so there is no such bug right? If a programmer
gets it wrong he will get a BUG check that tells him that.
>>> * The extra checks currently inside blk_rq_map_kern_prealloc are moved
>>> to bio_map_kern_prealloc()
>>>
>>> * Users call bio_map_kern_prealloc() directly and then use blk_rq_map_bio()
>>> in a two stage process.
>
> This breaks consistency with blk_rq_map_*() family of functions. Do
> you have a plan to make them all consistent? I think we really need
> to maintain API consistency.
>
I agree, that is why I called it blk_make_request originally. But this is not good
for you since your request is pre-allocated as well as the bio.
It needs a different name, blk_rq_set_bio(), I don't know do you have any ideas?
> Thanks.
>
Thanks
Boaz
On 03/26/2009 10:05 AM, Boaz Harrosh wrote:
> On 03/26/2009 09:42 AM, Tejun Heo wrote:
>> The thing is that the prealloc variant should be allowed to be called
>> from IRQ context and blk_queue_bounce() shouldn't be called.
>> Hmmm... well, the caller is supposed to know what it's doing and maybe
>> we can just add a comment that it shouldn't be called with buffers
>> which might get bounced from IRQ context.
>>
>
> Hmm that is a problem. I would suggest a flag or a check. My bios come
> from VFS they need bouncing.
>
> Can you think of a solution?
>
> We could just call blk_queue_bounce(). IRQ callers need to make sure their
> buffers don't need bouncing anyway, so there is no such bug right? If a programmer
> gets it wrong he will get a BUG check that tells him that.
>
I just realized that in your original call you had the "force" flag, we can keep
that flag for the blk_rq_set_bio(), or what ever we should name it.
Thanks
Boaz
On Wed, Mar 25, 2009 at 01:06:13AM +0900, Tejun Heo wrote:
> Impact: rq->buffer usage cleanup
>
> ide-cd uses rq->buffer to carry pointer to the original request when
> issuing REQUEST_SENSE. Use rq->special instead.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
Acked-by: Borislav Petkov <[email protected]>
--
Regards/Gruss,
Boris.
Boaz Harrosh wrote:
> On 03/26/2009 10:05 AM, Boaz Harrosh wrote:
>> On 03/26/2009 09:42 AM, Tejun Heo wrote:
>>> The thing is that the prealloc variant should be allowed to be called
>>> from IRQ context and blk_queue_bounce() shouldn't be called.
>>> Hmmm... well, the caller is supposed to know what it's doing and maybe
>>> we can just add a comment that it shouldn't be called with buffers
>>> which might get bounced from IRQ context.
>>>
>> Hmm that is a problem. I would suggest a flag or a check. My bios come
>> from VFS they need bouncing.
>>
>> Can you think of a solution?
>>
>> We could just call blk_queue_bounce(). IRQ callers need to make sure their
>> buffers don't need bouncing anyway, so there is no such bug right? If a programmer
>> gets it wrong he will get a BUG check that tells him that.
>>
>
> I just realized that in your original call you had the "force" flag,
> we can keep that flag for the blk_rq_set_bio(), or what ever we
> should name it.
Eh... Currently, fs requests are built from bio whereas PC/kernel
requests are usually mapped into the requset the driver already
allocated, so there are two different directions of initiaializing a
request.
For the latter, the APIs are blk_rq_map_{user[_iov]|kern}(). To add
to the fun, we have blk_rq_map_sg() which does a completely different
thing of mapping an rq's bios to sg and is usually called from low
level do_request() when starting to process a request.
You're suggesting to add blk_rq_map_bio() which doesn't really map
anything but rather allocates and initializes an rq from bio and
basically boils down to blk_rq_bio_prep() and blk_queue_bounce().
I think it's about time to cool it a bit and try to see what's going
on. We don't really want blk_rq_map_*() to do three completely
different things, right? Usage of blk_rq_map_{user[_iov]|kern}()
isn't too wide spread, so cleaning up the API shouldn't be difficult.
So, let's think about the whole API.
There is fundamental difference between fs and pc requests. For fs
requests, rq (struct request) doesn't really matter. bio is the final
goal of fs requests and rq is just the carrier of bios into the block
layer and drivers. For pc requests, this isn't true. Here, bios are
used as data carrier and the issuer cares much more about the rq
itself which will be configured to carry the command the issuer wants
to execute. This is also true for completion too. For fs requests,
rq error status doesn't matter. For pc requests, the other way
around. The difference explains why we have two different
initialization directions, so to me, the current API seems logical
although we really should be using different names.
Can you please explain what your need is? Why do you want
blk_make_request()?
Thanks.
--
tejun
On 03/26/2009 12:19 PM, Tejun Heo wrote:
> Boaz Harrosh wrote:
>> On 03/26/2009 10:05 AM, Boaz Harrosh wrote:
>>> On 03/26/2009 09:42 AM, Tejun Heo wrote:
>>>> The thing is that the prealloc variant should be allowed to be called
>>>> from IRQ context and blk_queue_bounce() shouldn't be called.
>>>> Hmmm... well, the caller is supposed to know what it's doing and maybe
>>>> we can just add a comment that it shouldn't be called with buffers
>>>> which might get bounced from IRQ context.
>>>>
>>> Hmm that is a problem. I would suggest a flag or a check. My bios come
>>> from VFS they need bouncing.
>>>
>>> Can you think of a solution?
>>>
>>> We could just call blk_queue_bounce(). IRQ callers need to make sure their
>>> buffers don't need bouncing anyway, so there is no such bug right? If a programmer
>>> gets it wrong he will get a BUG check that tells him that.
>>>
>> I just realized that in your original call you had the "force" flag,
>> we can keep that flag for the blk_rq_set_bio(), or what ever we
>> should name it.
>
> Eh... Currently, fs requests are built from bio whereas PC/kernel
> requests are usually mapped into the requset the driver already
> allocated, so there are two different directions of initiaializing a
> request.
>
> For the latter, the APIs are blk_rq_map_{user[_iov]|kern}(). To add
> to the fun, we have blk_rq_map_sg() which does a completely different
> thing of mapping an rq's bios to sg and is usually called from low
> level do_request() when starting to process a request.
>
Lets just remove for a moment blk_rq_map_sg() from the discussion.
That one is totally bogus name that gives me the crips every time
I read its name. If using the same terminology then at minimum it should have
been blk_sg_map_request() (Which I agree is even more cryptic but at least
it gets the source-destination right).
> You're suggesting to add blk_rq_map_bio() which doesn't really map
> anything but rather allocates and initializes an rq from bio and
> basically boils down to blk_rq_bio_prep() and blk_queue_bounce().
>
Yes, as I said, I completely agree. It does not map anything and
should not be named blk_rq_map_. It should have a different better
name. Hell it exists today, as blk_rq_append_bio(), but that one
suggests usage in ways some people don't like, and they want to
remove it. My opinion is that: Both me and you could just use the
good old blk_rq_append_bio() and be happy.
So basically in practice it just comes down to renaming
blk_rq_append_bio() to something. We both just don't know how to call
it?
> I think it's about time to cool it a bit and try to see what's going
> on. We don't really want blk_rq_map_*() to do three completely
> different things, right? Usage of blk_rq_map_{user[_iov]|kern}()
> isn't too wide spread, so cleaning up the API shouldn't be difficult.
>
Currently blk_rq_map_sg() is the only odd ball, renaming that to something
sane will fix the current situation.
What additionally we both want is just plain old blk_rq_append_bio and
maybe with a better name?
> So, let's think about the whole API.
>
> There is fundamental difference between fs and pc requests. For fs
> requests, rq (struct request) doesn't really matter. bio is the final
> goal of fs requests and rq is just the carrier of bios into the block
> layer and drivers.
This is because the one that cares about the request, comes later down
the stack in the form of a block-driver. So an FS_PC command is just a
flag that says: Careful, only half baked request here, someone needs to
fill in the extra information.
At the LLD level there is no difference between an READ_10 issued by
user-mode through sg.c pushing BLOCK_PC, or FS_PC commands from ext3
converted to READ_10 by sd.c on the way down.
I'm saying all that to better explain my answer below.
For pc requests, this isn't true. Here, bios are
> used as data carrier and the issuer cares much more about the rq
> itself which will be configured to carry the command the issuer wants
> to execute.
BLOCK_PC commands are "protocol specific" and the issuer has all
the needed information up-front. So he needs the mapping, as well
as all the usual prep_fn thing done on the request in one go.
> This is also true for completion too. For fs requests,
> rq error status doesn't matter. For pc requests, the other way
> around.
Again a stack thing, The mid-level block-driver will carry status
on req->errors up to when it completes the request where it should
do a translation of it's specific error to a generic VFS error
-EIO, -ENOSPC, -ENOMEM etc. So they are both used only at different
stages of the request life time. Since BLOCK_PC are protocol specific
and they kind of by-pass the block-driver they need the protocol specific
information, just as above when they spoke the language in issuing READ_10
> The difference explains why we have two different
> initialization directions, so to me, the current API seems logical
> although we really should be using different names.
>
> Can you please explain what your need is? Why do you want
> blk_make_request()?
>
I don't need blk_make_request per-se, I just need a way to
allocate a request and then associate a pre built bio with it,
that was prepared by an upper filesystem.
(Which talks OSD, which does not fit the block device model at all,
but this is too much information for you)
Farther more, My life is more complicated then that because, I need
a bidi-request which is second request hanging on the first one. And
each request can map up to 6 segments of source information that
are finally laid linear on the wire. So my bio can not be built
with a single blk_rq_map_xxx call.
To make the story short blk_rq_append_bio() can be used by your
code and mine, perhaps with a better name.
> Thanks.
>
I feel I took to much of your time with my petty problems, Just
that I felt that if adding an associate-this-request-with-this-bio
API to blkdev.h could be done in a way that makes both of us happy.
Thanks alot
Boaz
Hello, Boaz.
Boaz Harrosh wrote:
>> For the latter, the APIs are blk_rq_map_{user[_iov]|kern}(). To add
>> to the fun, we have blk_rq_map_sg() which does a completely different
>> thing of mapping an rq's bios to sg and is usually called from low
>> level do_request() when starting to process a request.
>
> Lets just remove for a moment blk_rq_map_sg() from the discussion.
> That one is totally bogus name that gives me the crips every time
> I read its name. If using the same terminology then at minimum it should have
> been blk_sg_map_request() (Which I agree is even more cryptic but at least
> it gets the source-destination right).
Ah... yeah, one of the two definitely needs to get a different name.
>> You're suggesting to add blk_rq_map_bio() which doesn't really map
>> anything but rather allocates and initializes an rq from bio and
>> basically boils down to blk_rq_bio_prep() and blk_queue_bounce().
>
> Yes, as I said, I completely agree. It does not map anything and
> should not be named blk_rq_map_. It should have a different better
> name. Hell it exists today, as blk_rq_append_bio(), but that one
> suggests usage in ways some people don't like, and they want to
> remove it. My opinion is that: Both me and you could just use the
> good old blk_rq_append_bio() and be happy.
>
> So basically in practice it just comes down to renaming
> blk_rq_append_bio() to something. We both just don't know how to call
> it?
Subtly different. blk_rq_append_bio() doesn't bounce. It seems we
really need to figure out who needs what and how to provide the
interface in uniform way.
>> I think it's about time to cool it a bit and try to see what's going
>> on. We don't really want blk_rq_map_*() to do three completely
>> different things, right? Usage of blk_rq_map_{user[_iov]|kern}()
>> isn't too wide spread, so cleaning up the API shouldn't be difficult.
>
> Currently blk_rq_map_sg() is the only odd ball, renaming that to something
> sane will fix the current situation.
>
> What additionally we both want is just plain old blk_rq_append_bio and
> maybe with a better name?
Yeah, basically, something similar to that but without the subtle
traps. :-)
>> The difference explains why we have two different
>> initialization directions, so to me, the current API seems logical
>> although we really should be using different names.
>>
>> Can you please explain what your need is? Why do you want
>> blk_make_request()?
>>
>
> I don't need blk_make_request per-se, I just need a way to
> allocate a request and then associate a pre built bio with it,
> that was prepared by an upper filesystem.
>
> (Which talks OSD, which does not fit the block device model at all,
> but this is too much information for you)
>
> Farther more, My life is more complicated then that because, I need
> a bidi-request which is second request hanging on the first one. And
> each request can map up to 6 segments of source information that
> are finally laid linear on the wire. So my bio can not be built
> with a single blk_rq_map_xxx call.
>
> To make the story short blk_rq_append_bio() can be used by your
> code and mine, perhaps with a better name.
>
> I feel I took to much of your time with my petty problems, Just
> that I felt that if adding an associate-this-request-with-this-bio
> API to blkdev.h could be done in a way that makes both of us happy.
Thanks for the explanation.
I think the problem is unclear distinction between bio initialization
and rq initialization. Ideally, blk_rq_map*() functions should
initialize rq, call appropriate bio map function, link the result into
rq and be done with it, but currently blk_rq_map*() functions do more
than that and that makes cutting out the bio initialization apart
difficult, so in case where bios need to be initialized separately
from rq, the interface gets weird. What level of bio initialization
should the init-rq-from-bio function should expect? Is it already
bounced? Does it have aligned buffer and etc?
If we can define cleaner bio mapping interface and make rq mapping
functions thinner wrapper around those, it will be much better, but
that doesn't seem to be a trivial thing. Is it possible for you to
stay with blk_rq_append_bio() for the time being? I'll revisit the
mapping API once the current pile of patches I'm sitting on settle
down a bit.
Thanks.
--
tejun
On 03/26/2009 02:07 PM, Tejun Heo wrote:
> Hello, Boaz.
>
> Boaz Harrosh wrote:
>>> For the latter, the APIs are blk_rq_map_{user[_iov]|kern}(). To add
>>> to the fun, we have blk_rq_map_sg() which does a completely different
>>> thing of mapping an rq's bios to sg and is usually called from low
>>> level do_request() when starting to process a request.
>> Lets just remove for a moment blk_rq_map_sg() from the discussion.
>> That one is totally bogus name that gives me the crips every time
>> I read its name. If using the same terminology then at minimum it should have
>> been blk_sg_map_request() (Which I agree is even more cryptic but at least
>> it gets the source-destination right).
>
> Ah... yeah, one of the two definitely needs to get a different name.
>
>>> You're suggesting to add blk_rq_map_bio() which doesn't really map
>>> anything but rather allocates and initializes an rq from bio and
>>> basically boils down to blk_rq_bio_prep() and blk_queue_bounce().
>> Yes, as I said, I completely agree. It does not map anything and
>> should not be named blk_rq_map_. It should have a different better
>> name. Hell it exists today, as blk_rq_append_bio(), but that one
>> suggests usage in ways some people don't like, and they want to
>> remove it. My opinion is that: Both me and you could just use the
>> good old blk_rq_append_bio() and be happy.
>>
>> So basically in practice it just comes down to renaming
>> blk_rq_append_bio() to something. We both just don't know how to call
>> it?
>
> Subtly different. blk_rq_append_bio() doesn't bounce. It seems we
> really need to figure out who needs what and how to provide the
> interface in uniform way.
>
>>> I think it's about time to cool it a bit and try to see what's going
>>> on. We don't really want blk_rq_map_*() to do three completely
>>> different things, right? Usage of blk_rq_map_{user[_iov]|kern}()
>>> isn't too wide spread, so cleaning up the API shouldn't be difficult.
>> Currently blk_rq_map_sg() is the only odd ball, renaming that to something
>> sane will fix the current situation.
>>
>> What additionally we both want is just plain old blk_rq_append_bio and
>> maybe with a better name?
>
> Yeah, basically, something similar to that but without the subtle
> traps. :-)
>
>>> The difference explains why we have two different
>>> initialization directions, so to me, the current API seems logical
>>> although we really should be using different names.
>>>
>>> Can you please explain what your need is? Why do you want
>>> blk_make_request()?
>>>
>> I don't need blk_make_request per-se, I just need a way to
>> allocate a request and then associate a pre built bio with it,
>> that was prepared by an upper filesystem.
>>
>> (Which talks OSD, which does not fit the block device model at all,
>> but this is too much information for you)
>>
>> Farther more, My life is more complicated then that because, I need
>> a bidi-request which is second request hanging on the first one. And
>> each request can map up to 6 segments of source information that
>> are finally laid linear on the wire. So my bio can not be built
>> with a single blk_rq_map_xxx call.
>>
>> To make the story short blk_rq_append_bio() can be used by your
>> code and mine, perhaps with a better name.
>>
>> I feel I took to much of your time with my petty problems, Just
>> that I felt that if adding an associate-this-request-with-this-bio
>> API to blkdev.h could be done in a way that makes both of us happy.
>
> Thanks for the explanation.
>
> I think the problem is unclear distinction between bio initialization
> and rq initialization. Ideally, blk_rq_map*() functions should
> initialize rq, call appropriate bio map function, link the result into
> rq and be done with it, but currently blk_rq_map*() functions do more
> than that and that makes cutting out the bio initialization apart
> difficult, so in case where bios need to be initialized separately
> from rq, the interface gets weird. What level of bio initialization
> should the init-rq-from-bio function should expect? Is it already
> bounced? Does it have aligned buffer and etc?
>
> If we can define cleaner bio mapping interface and make rq mapping
> functions thinner wrapper around those, it will be much better, but
> that doesn't seem to be a trivial thing. Is it possible for you to
> stay with blk_rq_append_bio() for the time being? I'll revisit the
> mapping API once the current pile of patches I'm sitting on settle
> down a bit.
>
Yes sure, I'll be thinking about this some more too. Once your patch
pile settles we can sort out an additional cleanup that unifies some
of today's exported functions, just like you did with the blk_end_request.
> Thanks.
>
Do you have a public tree I can inspect the all lot in one central
place? That should be the IDE tree you published few days ago
right? I'll look for it.
Thanks
Boaz
Hello,
Boaz Harrosh wrote:
> Yes sure, I'll be thinking about this some more too. Once your patch
> pile settles we can sort out an additional cleanup that unifies some
> of today's exported functions, just like you did with the blk_end_request.
I'm currently looking at the mapping API. It's actually the final
obstacle. It seems scsi_req_map_sg() should really be in blk-map and
doing that would solve most issues we've been talking about, right?
Anyways, I'll take deeper look and see whether I can clean this up.
> Do you have a public tree I can inspect the all lot in one central
> place? That should be the IDE tree you published few days ago
> right? I'll look for it.
I have ~40 more patches on top of those two. They're mostly ready.
I'll post them soonish (I want to say later today but I said later
today yesterday too :-)
Thanks.
--
tejun
On Wed, Mar 25, 2009 at 01:06:15AM +0900, Tejun Heo wrote:
> Impact: unify request data buffer handling
>
> rq->data is used mostly to pass kernel buffer through request queue
> without using bio. There are only a couple of places which still do
> this in kernel and converting to bio isn't difficult.
>
> This patch converts ide-atapi to use bio instead of rq->data. For
> ide_queue_pc_tail() which can block, blk_rq_map_kern() is used. For
> ide_queue_pc_head() which is used for request sense and called from
> atomic context, drive->request_sense_bio and ->request_sense_bvec[2]
> are added and used via bio_map_kern_prealloc().
>
> ide-tape is updated to map sg for special requests.
>
> This change makes all ide-atapi commands except for tape bh ones use
> bio. Drop tp_ops->output/input_data path from ide_pc_intr().
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
Acked-by: Borislav Petkov <[email protected]>
--
Regards/Gruss,
Boris.
Hi,
On Tuesday 24 March 2009, Tejun Heo wrote:
> Hello, Bartlomiej, Jens.
>
> This patchset contains mostly IDE cleanups around rq->buffer, data and
> special usage. Over time, the meanings of those three fields became
> blurry. Be default, they mean
[...]
> Bartlomiej, if you're okay with the changes and Jens acks the patches
> and pushing the first three patches via pata-2.6, please feel free to
> include these patches into your tree.
Everything looks fine to me. Once I get Jens' ack (block bits)
and Borislav's one (ATAPI bits) I'll happily merge this patchset.
Thanks,
Bart
On Sat, Mar 28, 2009 at 02:51:01PM +0100, Bartlomiej Zolnierkiewicz wrote:
[..]
> > Bartlomiej, if you're okay with the changes and Jens acks the patches
> > and pushing the first three patches via pata-2.6, please feel free to
> > include these patches into your tree.
>
> Everything looks fine to me. Once I get Jens' ack (block bits)
> and Borislav's one (ATAPI bits) I'll happily merge this patchset.
Yep, I just finished testing the whole series here so you can go ahead
and apply them pending a recast of patch 07/14 you'll hopefully be
receiving soon from Tejun.
--
Regards/Gruss,
Boris.
Borislav Petkov wrote:
> On Sat, Mar 28, 2009 at 02:51:01PM +0100, Bartlomiej Zolnierkiewicz wrote:
>
> [..]
>
>>> Bartlomiej, if you're okay with the changes and Jens acks the patches
>>> and pushing the first three patches via pata-2.6, please feel free to
>>> include these patches into your tree.
>> Everything looks fine to me. Once I get Jens' ack (block bits)
>> and Borislav's one (ATAPI bits) I'll happily merge this patchset.
>
> Yep, I just finished testing the whole series here so you can go ahead
> and apply them pending a recast of patch 07/14 you'll hopefully be
> receiving soon from Tejun.
After the earlier discussion with Boaz Harrosh, I ended up redoing the
blk_rq_map_{kern|user}_*() stuff and am now in the process of testing
it. The patchset is rather large (20+ patches). As the prealloc
patch would fit on top of that series, I think it would be better to
sequence things via git merge. Is there a git tree I can use for
pata-2.6 changes? Or is the current linus master okay?
Thanks.
P.S.: Boaz, the above patchset implements blk_rq_map_kern_sgl() and
kills the SCSI layer hack. That was the issue you were trying to
solve, right?
--
tejun
On 03/30/2009 12:12 PM, Tejun Heo wrote:
> Borislav Petkov wrote:
>> On Sat, Mar 28, 2009 at 02:51:01PM +0100, Bartlomiej Zolnierkiewicz wrote:
>>
>> [..]
>>
>>>> Bartlomiej, if you're okay with the changes and Jens acks the patches
>>>> and pushing the first three patches via pata-2.6, please feel free to
>>>> include these patches into your tree.
>>> Everything looks fine to me. Once I get Jens' ack (block bits)
>>> and Borislav's one (ATAPI bits) I'll happily merge this patchset.
>> Yep, I just finished testing the whole series here so you can go ahead
>> and apply them pending a recast of patch 07/14 you'll hopefully be
>> receiving soon from Tejun.
>
> After the earlier discussion with Boaz Harrosh, I ended up redoing the
> blk_rq_map_{kern|user}_*() stuff and am now in the process of testing
> it. The patchset is rather large (20+ patches). As the prealloc
> patch would fit on top of that series, I think it would be better to
> sequence things via git merge. Is there a git tree I can use for
> pata-2.6 changes? Or is the current linus master okay?
>
> Thanks.
>
> P.S.: Boaz, the above patchset implements blk_rq_map_kern_sgl()
What is blk_rq_map_kern_sgl() ? I'm not sure I remember ?
What is it's API? and what does it do?
> and
> kills the SCSI layer hack. That was the issue you were trying to
> solve, right?
>
If you post URL of git tree I can have a look and start an early review
if you want
Thanks for doing this
Boaz
On Monday 30 March 2009, Tejun Heo wrote:
> Borislav Petkov wrote:
> > On Sat, Mar 28, 2009 at 02:51:01PM +0100, Bartlomiej Zolnierkiewicz wrote:
> >
> > [..]
> >
> >>> Bartlomiej, if you're okay with the changes and Jens acks the patches
> >>> and pushing the first three patches via pata-2.6, please feel free to
> >>> include these patches into your tree.
> >> Everything looks fine to me. Once I get Jens' ack (block bits)
> >> and Borislav's one (ATAPI bits) I'll happily merge this patchset.
> >
> > Yep, I just finished testing the whole series here so you can go ahead
> > and apply them pending a recast of patch 07/14 you'll hopefully be
> > receiving soon from Tejun.
>
> After the earlier discussion with Boaz Harrosh, I ended up redoing the
> blk_rq_map_{kern|user}_*() stuff and am now in the process of testing
> it. The patchset is rather large (20+ patches). As the prealloc
> patch would fit on top of that series, I think it would be better to
> sequence things via git merge. Is there a git tree I can use for
> pata-2.6 changes? Or is the current linus master okay?
Not yet, we discussed this two weeks ago and the only thing that changed
in the meantime is that we are in the merge window now and I'm ultra busy
instead of being just really busy. Do not worry though -- your last two
patchsets are a very good motivator for quilt->git conversion (much better
than i.e. some unproductive preaching).
Thanks,
Bart
Boaz Harrosh wrote:
>> P.S.: Boaz, the above patchset implements blk_rq_map_kern_sgl()
>
> What is blk_rq_map_kern_sgl() ? I'm not sure I remember ?
> What is it's API? and what does it do?
>
>> and kills the SCSI layer hack. That was the issue you were trying
>> to solve, right?
>
> If you post URL of git tree I can have a look and start an early review
> if you want
Here's the patchset. It builds but I haven't even booted it yet, so
expect some breakages.
http://git.kernel.org/?p=linux/kernel/git/tj/misc.git;a=shortlog;h=map-untested
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git map-untested
Thanks.
--
tejun
Hello, Bartlomiej.
Bartlomiej Zolnierkiewicz wrote:
> Not yet, we discussed this two weeks ago and the only thing that changed
> in the meantime is that we are in the merge window now and I'm ultra busy
> instead of being just really busy. Do not worry though -- your last two
> patchsets are a very good motivator for quilt->git conversion (much better
> than i.e. some unproductive preaching).
Sounds greate. Just lemme know when your tree is ready. I'll prep
cleanly merged git tree you and Jens can pull from.
Thanks.
--
tejun
On 03/30/2009 08:20 PM, Tejun Heo wrote:
> Boaz Harrosh wrote:
>>> P.S.: Boaz, the above patchset implements blk_rq_map_kern_sgl()
>> What is blk_rq_map_kern_sgl() ? I'm not sure I remember ?
>> What is it's API? and what does it do?
>>
>>> and kills the SCSI layer hack. That was the issue you were trying
>>> to solve, right?
>> If you post URL of git tree I can have a look and start an early review
>> if you want
>
> Here's the patchset. It builds but I haven't even booted it yet, so
> expect some breakages.
>
> http://git.kernel.org/?p=linux/kernel/git/tj/misc.git;a=shortlog;h=map-untested
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git map-untested
>
> Thanks.
>
Hi Tejun.
I hope you have not been working on stabilizing this code yet, I've just seen
it this mornning (Israel time). It has obvious bugs, in the use of scatterlists.
But before you go running to fix them. Don't. I don't like the scatter-list API
at all. We where working hard to remove them from scsi and I don't like them
for block layer either.
If it is for my personal needs then all I need is a simple:
"I have this bio please attach it to a request". The reason I have a bio
is because it comes from a filesystem and because I will be needing to use
the raid engines that are bio based.
I must have confused you with my blk_rq_map_pages() patches, these where made just
as an example of the ugliness and fragility of not using bios. It was suppose to be
a bad example, not to be accepted. (Hence the absence of my Singed-off-by:)
Scatterlists are bad because they are two times too fat for what they do here, they
need to be allocated, chained and slabbed, and finally they need to be converted to
a bio. Working directly with a bio is the shortest route.
Looking at the patches it seems you are changing them as I write (git-web is freezing)
So I sending this mail now. I was in the process of looking at them from the beginning.
So far it looks like a very deep change, I have some comments, but it will need a proper
review.
I do not understand what was your intention for the blk_rq_map_kern_sgl() how's
the user for it?
BTW: now you absolutely must do something with the name of blk_rq_map_sg() which does not
do any mapping and does not change request at all.
Boaz
Hello, Boaz.
Boaz Harrosh wrote:
> I hope you have not been working on stabilizing this code yet, I've
> just seen it this mornning (Israel time). It has obvious bugs, in
> the use of scatterlists.
Actually, I'm now and it's mostly working now.
> But before you go running to fix them. Don't. I don't like the
> scatter-list API at all. We where working hard to remove them from
> scsi and I don't like them for block layer either.
>
> If it is for my personal needs then all I need is a simple: "I have
> this bio please attach it to a request". The reason I have a bio is
> because it comes from a filesystem and because I will be needing to
> use the raid engines that are bio based.
>
> I must have confused you with my blk_rq_map_pages() patches, these
> where made just as an example of the ugliness and fragility of not
> using bios. It was suppose to be a bad example, not to be
> accepted. (Hence the absence of my Singed-off-by:) Scatterlists are
> bad because they are two times too fat for what they do here, they
> need to be allocated, chained and slabbed, and finally they need to
> be converted to a bio. Working directly with a bio is the shortest
> route.
sgl or bvec doesn't make much of difference. The only difference
between sgl and bvec is the extra entries for DMA mapping. The
problem is that there needs to be common way to describe the pages.
The reason why the blk/bio mapping looks that complex is because all
four cases - user map/copy and kern map/copy - are similar but subtly
different. Without a common medium to carry the list of pages, each
path ended up mixing what's common with what's different and thus the
not quite manageable code.
The reason why I want with sgl is because it's the sg mapping
iterator. Other than that, using bvec or sgl wouldn't make any
difference. All that's necessary is common carrier of page/offset/len
list.
But given the use of sgl there, I don't think the performance
arguments holds much ground. How high performance or scalability is
required in PC bio map paths? The added cost is at the most one more
sgl allocation and use of more sparse data structure which BTW
wouldn't be that large anyway. Both aren't worth worrying about
considering how relative coldness of the path and other overhead of
doing IO.
If you think sgl is too heavy for things like this, the right way to
go is to introduce some generic mechanism to carry sg chunks and use
it in bio, but I don't think the overhead of the current sgl justifies
that. If you have qualms about the sgl API, please clean it up. I
just don't think bvec should be used outside of block/fs interface.
As I wrote before, non-FS users have no reason to worry about bio.
All it should think about is the requst it needs to issue and the
memory area it's gonna use as in/out buffers. bio just doesn't have a
place there.
> I do not understand what was your intention for the
> blk_rq_map_kern_sgl() how's the user for it?
Well, the only in-kern user which was tinkering with rq/bio internals
was scsi_req_map_sg(). Do you have another user?
> BTW: now you absolutely must do something with the name of
> blk_rq_map_sg() which does not do any mapping and does not change
> request at all.
Heh... yeap, definitely. Can you think of a good one?
Thanks.
--
tejun
I really should proof read before hitting the send button. My head is
somehow pretty good at messing up sentences.
Tejun Heo wrote:
> The reason why I want with sgl is because it's the sg mapping
went it has
...
> wouldn't be that large anyway. Both aren't worth worrying about
> considering how relative coldness of the path and other overhead of
Eh.. that was a nice mix of "how the path is relatively cold" and
"relative coldness of the path". :-)
Thanks.
--
tejun
On 03/31/2009 12:05 PM, Tejun Heo wrote:
> Hello, Boaz.
>
> Boaz Harrosh wrote:
>> I hope you have not been working on stabilizing this code yet, I've
>> just seen it this mornning (Israel time). It has obvious bugs, in
>> the use of scatterlists.
>
> Actually, I'm now and it's mostly working now.
>
hmm, I've seen for example use of:
sg_set_page(&sg,,) on a stack scatterlist entry, and call to a function
that does for_each_sg(), but the for_each_sg() looks for a terminating bit.
You must do sg_init_one() or sg_init_table() first.
Are these code paths exercised?
>> But before you go running to fix them. Don't. I don't like the
>> scatter-list API at all. We where working hard to remove them from
>> scsi and I don't like them for block layer either.
>>
>> If it is for my personal needs then all I need is a simple: "I have
>> this bio please attach it to a request". The reason I have a bio is
>> because it comes from a filesystem and because I will be needing to
>> use the raid engines that are bio based.
>>
>> I must have confused you with my blk_rq_map_pages() patches, these
>> where made just as an example of the ugliness and fragility of not
>> using bios. It was suppose to be a bad example, not to be
>> accepted. (Hence the absence of my Singed-off-by:) Scatterlists are
>> bad because they are two times too fat for what they do here, they
>> need to be allocated, chained and slabbed, and finally they need to
>> be converted to a bio. Working directly with a bio is the shortest
>> route.
>
> sgl or bvec doesn't make much of difference. The only difference
> between sgl and bvec is the extra entries for DMA mapping.
Exactly, and in 32bit largemem that part is 200% of the first
part, so bvec can be 33% of sgl at times, and 50% usually.
> The
> problem is that there needs to be common way to describe the pages.
> The reason why the blk/bio mapping looks that complex is because all
> four cases - user map/copy and kern map/copy - are similar but subtly
> different. Without a common medium to carry the list of pages, each
> path ended up mixing what's common with what's different and thus the
> not quite manageable code.
Yes, the mess is more historical, and patch by patch wise, then actual
technical difficulty. And it is more accommodating to what user code has
at hand then what makes block-layer clean.
>
> The reason why I want with sgl is because it's the sg mapping
> iterator.
Actually no! "sg mapping iterator" success. it has that chaining
ugliness and bit markers. And all that missing information from it.
It was born out of necessity, and was a nightmare to stabilize around
the 2.6.21st kernel if I remember correctly.
At the time there was that long argument of it needing to be an sg_table
and chained at the table level. Because look how much information is missing
form a scatterlist. Like the length, the allocated size associated data, ...
scatterlist is both too large and too little to be a nice carrier structure
for pages.
Other than that, using bvec or sgl wouldn't make any
> difference. All that's necessary is common carrier of page/offset/len
> list.
>
bvec has the same shortages as a scatterlist, a good carrier is a bio
> But given the use of sgl there, I don't think the performance
> arguments holds much ground. How high performance or scalability is
> required in PC bio map paths? The added cost is at the most one more
> sgl allocation and use of more sparse data structure which BTW
> wouldn't be that large anyway.
One pageful of scatterlist can be as small as 128 nents. This is a stupid
512K. This is way too small for any real IO. Start chaining these and you get
a nightmare. The governing structure should be something much longer, and smarter.
(Don't even think of allocating something linearly bigger then one page
it does not work).
> Both aren't worth worrying about
> considering how relative coldness of the path and other overhead of
> doing IO.
>
Not true, with ram-disk performance benchmarks we could see 3-6% difference
with an extra sccaterlist allocation. SSDs and RDMA devices come close to
ram disk performance these days.
> If you think sgl is too heavy for things like this, the right way to
> go is to introduce some generic mechanism to carry sg chunks and use
> it in bio, but I don't think the overhead of the current sgl justifies
> that. If you have qualms about the sgl API, please clean it up.
sgl is a compromise with historical code at the DMA and LLD level.
It should not have any new users above the block layer.
> I
> just don't think bvec should be used outside of block/fs interface.
> As I wrote before, non-FS users have no reason to worry about bio.
> All it should think about is the requst it needs to issue and the
> memory area it's gonna use as in/out buffers. bio just doesn't have a
> place there.
>
I don't understand what happens to all the people that start to work on the block
layer, they start getting defensive about bio being private to the request
level. But the Genie is out of the bag already (I know cats and bottles).
bio is already heavily used above the block layer from directly inside filesystems
to all kind of raid engines, DM MD managers, multi paths, integrity information ...
Because bio is exactly that Ideal page carrier you are talking about.
The usage pattern needed by everybody everywhere is:
Allocate an abstract container starting empty. add some pages to it,
grow / chain endlessly as needed, (or as constrained from other factors).
Submit it down the stack. Next stage can re-use it, split it, merge it, mux
it demux it, bounce it, hang event notifiers, the works.
The only such structure today is a bio. If an API should be cleaned up it
is the bio API. Add the nice iterator to it if you like sg_next() a lot.
Make it so clean that all users of block-request take a bio, easily add
pages / memory segments to it, and then submit it into a request with a single
API, blk_req_add_bio(). or better blk_make_request(bio).
Inventing half baked intermediate linear structures will not hold.
>> I do not understand what was your intention for the
>> blk_rq_map_kern_sgl() how's the user for it?
>
> Well, the only in-kern user which was tinkering with rq/bio internals
> was scsi_req_map_sg(). Do you have another user?
>
scsi_req_map_sg() is gone. All its users now go directly to blk_rq_map_*
and submit requests directly. So there are no external users of sgl.
(Should be submitted soon)
>> BTW: now you absolutely must do something with the name of
>> blk_rq_map_sg() which does not do any mapping and does not change
>> request at all.
>
> Heh... yeap, definitely. Can you think of a good one?
>
I would say blk_rq_fill_sgl, maybe
> Thanks.
>
Other then the sgl usage I like your clean ups a lot, I do however have
some comments. Should I hold these until you post them or start shooting
away?
Thanks
Boaz
Hello, Boaz.
Boaz Harrosh wrote:
> On 03/31/2009 12:05 PM, Tejun Heo wrote:
>> Hello, Boaz.
>>
>> Boaz Harrosh wrote:
>>> I hope you have not been working on stabilizing this code yet, I've
>>> just seen it this mornning (Israel time). It has obvious bugs, in
>>> the use of scatterlists.
>> Actually, I'm now and it's mostly working now.
>>
>
> hmm, I've seen for example use of:
> sg_set_page(&sg,,) on a stack scatterlist entry, and call to a function
> that does for_each_sg(), but the for_each_sg() looks for a terminating bit.
> You must do sg_init_one() or sg_init_table() first.
>
> Are these code paths exercised?
As I wrote, it was a preliminary tree which didn't even go through
single round of testing. Missing sg_init_table() blew up on the first
test run and got fixed.
>> sgl or bvec doesn't make much of difference. The only difference
>> between sgl and bvec is the extra entries for DMA mapping.
>
> Exactly, and in 32bit largemem that part is 200% of the first
> part, so bvec can be 33% of sgl at times, and 50% usually.
And why would all that matter for temporary short-lived sgls on a
relatively cold path?
>> The reason why I want with sgl is because it's the sg mapping
>> iterator.
>
> Actually no! "sg mapping iterator" success. it has that chaining
> ugliness and bit markers. And all that missing information from it.
> It was born out of necessity, and was a nightmare to stabilize around
> the 2.6.21st kernel if I remember correctly.
What? If you're unhappy with sgl interface, improve it. Don't go
around it. Yeah, it took some work to work the sgl changes out but
it's stable now, so why not use it?
> At the time there was that long argument of it needing to be an
> sg_table and chained at the table level. Because look how much
> information is missing form a scatterlist. Like the length, the
> allocated size associated data, ... scatterlist is both too large
> and too little to be a nice carrier structure for pages.
Again, then, please go ahead and improve it. Whether you like it or
not, it's the de-facto standard segment carrying structure which is
widely used. You can't go around that.
> Other than that, using bvec or sgl wouldn't make any
>> difference. All that's necessary is common carrier of page/offset/len
>> list.
>>
>
> bvec has the same shortages as a scatterlist, a good carrier is a bio
bio is _NOT_ designed to be a generic page carrier. Please do not
abuse the interface just because the code is there. If you think the
currnent sg code is lacking, improve it and then apply it to bio if
possible, don't twist and abuse bio.
>> But given the use of sgl there, I don't think the performance
>> arguments holds much ground. How high performance or scalability is
>> required in PC bio map paths? The added cost is at the most one more
>> sgl allocation and use of more sparse data structure which BTW
>> wouldn't be that large anyway.
>
> One pageful of scatterlist can be as small as 128 nents. This is a
> stupid 512K. This is way too small for any real IO. Start chaining
> these and you get a nightmare. The governing structure should be
> something much longer, and smarter. (Don't even think of allocating
> something linearly bigger then one page it does not work).
I'm not trying to use sgl inside block layer. I'm using it as page
carrier for kernel API (what else would you use?) and page carrier for
internal implementation because sgl has some useful helpers to deal
with them.
>> Both aren't worth worrying about considering how relative coldness
>> of the path and other overhead of doing IO.
>
> Not true, with ram-disk performance benchmarks we could see 3-6% difference
> with an extra sccaterlist allocation. SSDs and RDMA devices come close to
> ram disk performance these days.
And you're doing that via PC requests using blk_rq_map_*() interface?
Wouldn't those IOs go through regular bio channel?
>> If you think sgl is too heavy for things like this, the right way to
>> go is to introduce some generic mechanism to carry sg chunks and use
>> it in bio, but I don't think the overhead of the current sgl justifies
>> that. If you have qualms about the sgl API, please clean it up.
>
> sgl is a compromise with historical code at the DMA and LLD level.
> It should not have any new users above the block layer.
bio is something which should be exported outside of block layer?
>> I just don't think bvec should be used outside of block/fs
>> interface. As I wrote before, non-FS users have no reason to worry
>> about bio. All it should think about is the requst it needs to
>> issue and the memory area it's gonna use as in/out buffers. bio
>> just doesn't have a place there.
>
> I don't understand what happens to all the people that start to work
> on the block layer, they start getting defensive about bio being
> private to the request level. But the Genie is out of the bag
> already (I know cats and bottles). bio is already heavily used
> above the block layer from directly inside filesystems to all kind
> of raid engines, DM MD managers, multi paths, integrity information
> ...
>
> Because bio is exactly that Ideal page carrier you are talking about.
> The usage pattern needed by everybody everywhere is:
> Allocate an abstract container starting empty. add some pages to it,
> grow / chain endlessly as needed, (or as constrained from other factors).
> Submit it down the stack. Next stage can re-use it, split it, merge it, mux
> it demux it, bounce it, hang event notifiers, the works.
>
> The only such structure today is a bio. If an API should be cleaned up it
> is the bio API. Add the nice iterator to it if you like sg_next() a lot.
>
> Make it so clean that all users of block-request take a bio, easily add
> pages / memory segments to it, and then submit it into a request with a single
> API, blk_req_add_bio(). or better blk_make_request(bio).
>
> Inventing half baked intermediate linear structures will not hold.
Thanks for the credit but I didn't invent sgl.
bio is unit of block IO viewed from filesystems. Let's leave it
there. If you're unhappy with sg_table or sgl, what needs to be done
is extending sg_table such that it does all the great things you
described above and then integrating it into bio. If the extra fields
for DMA mapping matters, create a new structure which will be used for
that purpose and build API around it and then replace private bvec
handling inside bio and apply the new API gradually replacing sgl
usages where it makes sense.
In all seriousness, using bio as generic page list carrier is a
horrible horrible idea. That's a path which will lead to strange
situations. bio isa an interface tightly tied to block and fs
interface. No one else has any business playing with it. md/dm got
caught up inbetween and I know that people haven't been happy with the
situation and there haven been attempts to move over to request based
implementation although I don't know what happened thereafter, so
please stop pushing bio as generic data carrier because it is not and
should not ever be.
For now, the only thing that matter is whether to use sgl in
blk_rq_map_kern_sgl() or not. I suppose you're pushing for using bio
instead, right?
>>> BTW: now you absolutely must do something with the name of
>>> blk_rq_map_sg() which does not do any mapping and does not change
>>> request at all.
>> Heh... yeap, definitely. Can you think of a good one?
>
> I would say blk_rq_fill_sgl, maybe
Sounds good to me.
> Other then the sgl usage I like your clean ups a lot, I do however
> have some comments. Should I hold these until you post them or start
> shooting away?
Please wait a bit.
Thanks.
--
tejun
On 03/31/2009 04:43 PM, Tejun Heo wrote:
> Hello, Boaz.
>
Hi Tejun
> Boaz Harrosh wrote:
>> On 03/31/2009 12:05 PM, Tejun Heo wrote:
>>> Hello, Boaz.
>>>
>
>>> sgl or bvec doesn't make much of difference. The only difference
>>> between sgl and bvec is the extra entries for DMA mapping.
>> Exactly, and in 32bit largemem that part is 200% of the first
>> part, so bvec can be 33% of sgl at times, and 50% usually.
>
> And why would all that matter for temporary short-lived sgls on a
> relatively cold path?
>
It's use is un-common yes, but it is not cold. It is very hot for it's
users. Any SG_IO users from sg, bsg and other go threw here sometimes
very hotly.
My benchmarks where done with the sg3_dd utility, going threw sg.c
for direct IO submission. (sg3 utils package)
I will use some such path for the exofs filesystem which is very
hot for me
>
> What? If you're unhappy with sgl interface, improve it. Don't go
> around it. Yeah, it took some work to work the sgl changes out but
> it's stable now, so why not use it?
>
I hate it, but pragmatically I will be too lazy to do anything
about it.
Again it is heavily used low down in drivers land, and DMA times
which is what it is. It is not used in mm or VFS or user-mode-to-kernel.
I thought we where talking about the interface above block layer.
scatter lists have no place there.
>> At the time there was that long argument of it needing to be an
>> sg_table and chained at the table level. Because look how much
>> information is missing form a scatterlist. Like the length, the
>> allocated size associated data, ... scatterlist is both too large
>> and too little to be a nice carrier structure for pages.
>
> Again, then, please go ahead and improve it. Whether you like it or
> not, it's the de-facto standard segment carrying structure which is
> widely used. You can't go around that.
>
>> Other than that, using bvec or sgl wouldn't make any
>>> difference. All that's necessary is common carrier of page/offset/len
>>> list.
>>>
>> bvec has the same shortages as a scatterlist, a good carrier is a bio
>
> bio is _NOT_ designed to be a generic page carrier. Please do not
> abuse the interface just because the code is there. If you think the
> currnent sg code is lacking, improve it and then apply it to bio if
> possible, don't twist and abuse bio.
>
I'm only using bio the exact same way as other filesystems. The only
difference is that I need it with a block_pc command.
>>> But given the use of sgl there, I don't think the performance
>>> arguments holds much ground. How high performance or scalability is
>>> required in PC bio map paths? The added cost is at the most one more
>>> sgl allocation and use of more sparse data structure which BTW
>>> wouldn't be that large anyway.
>> One pageful of scatterlist can be as small as 128 nents. This is a
>> stupid 512K. This is way too small for any real IO. Start chaining
>> these and you get a nightmare. The governing structure should be
>> something much longer, and smarter. (Don't even think of allocating
>> something linearly bigger then one page it does not work).
>
> I'm not trying to use sgl inside block layer. I'm using it as page
> carrier for kernel API (what else would you use?) and page carrier for
> internal implementation because sgl has some useful helpers to deal
> with them.
>
>>> Both aren't worth worrying about considering how relative coldness
>>> of the path and other overhead of doing IO.
>> Not true, with ram-disk performance benchmarks we could see 3-6% difference
>> with an extra sccaterlist allocation. SSDs and RDMA devices come close to
>> ram disk performance these days.
>
> And you're doing that via PC requests using blk_rq_map_*() interface?
> Wouldn't those IOs go through regular bio channel?
>
Sure, that's the point. I need a block_pc command interface that is just as
hot as current "regular bio channel".
And so do CD-burners, tapes, you name it. Any none FS heavy usage of
scsi devices, who's usage does not fit the narrow READ/WRITE flat array
of sectors. They can have very heavy I/O, osd for instance.
>>> If you think sgl is too heavy for things like this, the right way to
>>> go is to introduce some generic mechanism to carry sg chunks and use
>>> it in bio, but I don't think the overhead of the current sgl justifies
>>> that. If you have qualms about the sgl API, please clean it up.
>> sgl is a compromise with historical code at the DMA and LLD level.
>> It should not have any new users above the block layer.
>
> bio is something which should be exported outside of block layer?
>
It is today, it's used in the interface between filesystems and block-layer
and in-between stacked block devices.
>>> I just don't think bvec should be used outside of block/fs
>>> interface. As I wrote before, non-FS users have no reason to worry
>>> about bio. All it should think about is the requst it needs to
>>> issue and the memory area it's gonna use as in/out buffers. bio
>>> just doesn't have a place there.
>> I don't understand what happens to all the people that start to work
>> on the block layer, they start getting defensive about bio being
>> private to the request level. But the Genie is out of the bag
>> already (I know cats and bottles). bio is already heavily used
>> above the block layer from directly inside filesystems to all kind
>> of raid engines, DM MD managers, multi paths, integrity information
>> ...
>>
>> Because bio is exactly that Ideal page carrier you are talking about.
>> The usage pattern needed by everybody everywhere is:
>> Allocate an abstract container starting empty. add some pages to it,
>> grow / chain endlessly as needed, (or as constrained from other factors).
>> Submit it down the stack. Next stage can re-use it, split it, merge it, mux
>> it demux it, bounce it, hang event notifiers, the works.
>>
>> The only such structure today is a bio. If an API should be cleaned up it
>> is the bio API. Add the nice iterator to it if you like sg_next() a lot.
>>
>> Make it so clean that all users of block-request take a bio, easily add
>> pages / memory segments to it, and then submit it into a request with a single
>> API, blk_req_add_bio(). or better blk_make_request(bio).
>>
>> Inventing half baked intermediate linear structures will not hold.
>
> Thanks for the credit but I didn't invent sgl.
>
> bio is unit of block IO viewed from filesystems. Let's leave it
> there. If you're unhappy with sg_table or sgl, what needs to be done
> is extending sg_table such that it does all the great things you
> described above and then integrating it into bio. If the extra fields
> for DMA mapping matters, create a new structure which will be used for
> that purpose and build API around it and then replace private bvec
> handling inside bio and apply the new API gradually replacing sgl
> usages where it makes sense.
>
> In all seriousness, using bio as generic page list carrier is a
> horrible horrible idea. That's a path which will lead to strange
> situations. bio isa an interface tightly tied to block and fs
> interface. No one else has any business playing with it. md/dm got
> caught up inbetween and I know that people haven't been happy with the
> situation and there haven been attempts to move over to request based
> implementation although I don't know what happened thereafter, so
> please stop pushing bio as generic data carrier because it is not and
> should not ever be.
>
> For now, the only thing that matter is whether to use sgl in
> blk_rq_map_kern_sgl() or not. I suppose you're pushing for using bio
> instead, right?
>
Currently I do not see any candidate users for blk_rq_map_kern_sgl().
sg.c was that last user and TOMO has converted that to blk_rq_map_user/_iov
with map_data at places. It should all be going into 2.6.30.
Me I will not use sgl. My main motivation is exofs, an osd based
filesystem, and pNFS-objects-layout driver. Both of which leave in
"fs" space. The only solution, which I currently have implemented,
is using bios. sgl is not an option.
Thanks
Boaz
On Tue, 31 Mar 2009 16:04:39 +0300
Boaz Harrosh <[email protected]> wrote:
> > I
> > just don't think bvec should be used outside of block/fs interface.
> > As I wrote before, non-FS users have no reason to worry about bio.
> > All it should think about is the requst it needs to issue and the
> > memory area it's gonna use as in/out buffers. bio just doesn't have a
> > place there.
> >
>
> I don't understand what happens to all the people that start to work on the block
> layer, they start getting defensive about bio being private to the request
> level. But the Genie is out of the bag already (I know cats and bottles).
> bio is already heavily used above the block layer from directly inside filesystems
> to all kind of raid engines, DM MD managers, multi paths, integrity information ...
>
> Because bio is exactly that Ideal page carrier you are talking about.
Wrong. multi path doesn't use bio. md accesses to the bio internal
(it's not nice) and has the own way to carry pages. dm has the own
mechanism on the top of bio. And bio doesn't work nicely for file
systems such as btrfs, which handle multiple devices.
Please stop your wrong argument 'bio is the ideal page carrier'.
> The usage pattern needed by everybody everywhere is:
> Allocate an abstract container starting empty. add some pages to it,
> grow / chain endlessly as needed, (or as constrained from other factors).
> Submit it down the stack. Next stage can re-use it, split it, merge it, mux
> it demux it, bounce it, hang event notifiers, the works.
FUJITA Tomonori wrote:
> On Tue, 31 Mar 2009 16:04:39 +0300
> Boaz Harrosh <[email protected]> wrote:
>
>>> I
>>> just don't think bvec should be used outside of block/fs interface.
>>> As I wrote before, non-FS users have no reason to worry about bio.
>>> All it should think about is the requst it needs to issue and the
>>> memory area it's gonna use as in/out buffers. bio just doesn't have a
>>> place there.
>>>
>> I don't understand what happens to all the people that start to work on the block
>> layer, they start getting defensive about bio being private to the request
>> level. But the Genie is out of the bag already (I know cats and bottles).
>> bio is already heavily used above the block layer from directly inside filesystems
>> to all kind of raid engines, DM MD managers, multi paths, integrity information ...
>>
>> Because bio is exactly that Ideal page carrier you are talking about.
>
> Wrong. multi path doesn't use bio. md accesses to the bio internal
> (it's not nice) and has the own way to carry pages. dm has the own
> mechanism on the top of bio. And bio doesn't work nicely for file
> systems such as btrfs, which handle multiple devices.
>
> Please stop your wrong argument 'bio is the ideal page carrier'.
What is the multi-device problem with bio?
The idea behind bio is to be the ideal page carrier :) It sounds like
bio needs revision, not abandonment.
Jeff
On Mon, 13 Apr 2009 03:54:38 -0400
Jeff Garzik <[email protected]> wrote:
> FUJITA Tomonori wrote:
> > On Tue, 31 Mar 2009 16:04:39 +0300
> > Boaz Harrosh <[email protected]> wrote:
> >
> >>> I
> >>> just don't think bvec should be used outside of block/fs interface.
> >>> As I wrote before, non-FS users have no reason to worry about bio.
> >>> All it should think about is the requst it needs to issue and the
> >>> memory area it's gonna use as in/out buffers. bio just doesn't have a
> >>> place there.
> >>>
> >> I don't understand what happens to all the people that start to work on the block
> >> layer, they start getting defensive about bio being private to the request
> >> level. But the Genie is out of the bag already (I know cats and bottles).
> >> bio is already heavily used above the block layer from directly inside filesystems
> >> to all kind of raid engines, DM MD managers, multi paths, integrity information ...
> >>
> >> Because bio is exactly that Ideal page carrier you are talking about.
> >
> > Wrong. multi path doesn't use bio. md accesses to the bio internal
> > (it's not nice) and has the own way to carry pages. dm has the own
> > mechanism on the top of bio. And bio doesn't work nicely for file
> > systems such as btrfs, which handle multiple devices.
> >
> > Please stop your wrong argument 'bio is the ideal page carrier'.
>
> What is the multi-device problem with bio?
Well, if it works nicely, I guess that we don't have something like
drivers/dm/{dm-bio-record.h, dm-bio-list.h}, btrfs_multi_bio struct,
or md's own page carrier?
> The idea behind bio is to be the ideal page carrier :) It sounds like
> bio needs revision, not abandonment.
I think that bio was designed to be an ideal page carrier for file
systems. But it's ideal not for everyone. I don't think that Boaz's
argument to use bio everywhere such as under drivers/scsi is a good
idea.
Yeah, a structure for page carrier needs revision anyway. At least, we
need to unify the way (or code) to handle multiple devices.
FUJITA Tomonori wrote:
> On Mon, 13 Apr 2009 03:54:38 -0400
> Jeff Garzik <[email protected]> wrote:
>
>> FUJITA Tomonori wrote:
>>> On Tue, 31 Mar 2009 16:04:39 +0300
>>> Boaz Harrosh <[email protected]> wrote:
>>>
>>>>> I
>>>>> just don't think bvec should be used outside of block/fs interface.
>>>>> As I wrote before, non-FS users have no reason to worry about bio.
>>>>> All it should think about is the requst it needs to issue and the
>>>>> memory area it's gonna use as in/out buffers. bio just doesn't have a
>>>>> place there.
>>>>>
>>>> I don't understand what happens to all the people that start to work on the block
>>>> layer, they start getting defensive about bio being private to the request
>>>> level. But the Genie is out of the bag already (I know cats and bottles).
>>>> bio is already heavily used above the block layer from directly inside filesystems
>>>> to all kind of raid engines, DM MD managers, multi paths, integrity information ...
>>>>
>>>> Because bio is exactly that Ideal page carrier you are talking about.
>>> Wrong. multi path doesn't use bio. md accesses to the bio internal
>>> (it's not nice) and has the own way to carry pages. dm has the own
>>> mechanism on the top of bio. And bio doesn't work nicely for file
>>> systems such as btrfs, which handle multiple devices.
>>>
>>> Please stop your wrong argument 'bio is the ideal page carrier'.
>> What is the multi-device problem with bio?
>
> Well, if it works nicely, I guess that we don't have something like
> drivers/dm/{dm-bio-record.h, dm-bio-list.h}, btrfs_multi_bio struct,
> or md's own page carrier?
It was an honest question. I am seeking information, not denying your
argument.
What, specifically, is this multi-device problem with bio, please?
Jeff
On Wed, Mar 25, 2009 at 01:06:16AM +0900, Tejun Heo wrote:
> Impact: unify request data buffer handling
>
> rq->data is used mostly to pass kernel buffer through request queue
> without using bio. There are only a couple of places which still do
> this in kernel and converting to bio isn't difficult.
>
> This patch converts ide-cd to use bio instead of rq->data for request
> sense and internal pc commands. As request sense is issued from
> atomic context, drive->request_sense_bio and ->request_sense_bvec[2]
> are used via bio_map_kern_prealloc(). Internal pc commands use
> blk_rq_map_kern().
>
> cdrom_do_block_pc() now doesn't have to deal with REQ_TYPE_ATA_PC
> special case. Simplified.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> ---
> drivers/ide/ide-cd.c | 31 ++++++++++++++++++-------------
> 1 files changed, 18 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
> index de6ce8d..22c95b9 100644
> --- a/drivers/ide/ide-cd.c
> +++ b/drivers/ide/ide-cd.c
> @@ -209,8 +209,12 @@ static void cdrom_analyze_sense_data(ide_drive_t *drive,
> static void cdrom_queue_request_sense(ide_drive_t *drive, void *sense,
> struct request *failed_command)
> {
> - struct cdrom_info *info = drive->driver_data;
> - struct request *rq = &drive->request_sense_rq;
> + struct cdrom_info *info = drive->driver_data;
> + struct request *rq = &drive->request_sense_rq;
> + struct bio *bio = &drive->request_sense_bio;
> + struct bio_vec *bvec = drive->request_sense_bvec;
> + unsigned int bvec_len = ARRAY_SIZE(drive->request_sense_bvec);
> + int error;
>
> ide_debug_log(IDE_DBG_SENSE, "enter");
>
> @@ -224,10 +228,12 @@ static void cdrom_queue_request_sense(ide_drive_t *drive, void *sense,
> rq->cmd_type = REQ_TYPE_ATA_PC;
by the way, this line can go too, since the correct type is REQ_TYPE_SENSE now.
> rq->rq_disk = info->disk;
>
> - rq->data = sense;
> + error = blk_rq_map_kern_prealloc(drive->queue, rq, bio, bvec, bvec_len,
> + sense, 18, true);
> + BUG_ON(error);
> +
> rq->cmd[0] = GPCMD_REQUEST_SENSE;
> rq->cmd[4] = 18;
> - rq->data_len = 18;
>
> rq->cmd_type = REQ_TYPE_SENSE;
> rq->cmd_flags |= REQ_PREEMPT;
> @@ -556,8 +562,12 @@ int ide_cd_queue_pc(ide_drive_t *drive, const unsigned char *cmd,
> rq->cmd_flags |= cmd_flags;
> rq->timeout = timeout;
> if (buffer) {
> - rq->data = buffer;
> - rq->data_len = *bufflen;
> + error = blk_rq_map_kern(drive->queue, rq, buffer,
> + *bufflen, __GFP_WAIT);
> + if (error) {
> + blk_put_request(rq);
> + return error;
> + }
> }
>
> error = blk_execute_rq(drive->queue, info->disk, rq, 0);
> @@ -847,15 +857,10 @@ static void cdrom_do_block_pc(ide_drive_t *drive, struct request *rq)
> drive->dma = 0;
>
> /* sg request */
> - if (rq->bio || ((rq->cmd_type == REQ_TYPE_ATA_PC) && rq->data_len)) {
> + if (rq->bio) {
> struct request_queue *q = drive->queue;
> + char *buf = bio_data(rq->bio);
> unsigned int alignment;
> - char *buf;
> -
> - if (rq->bio)
> - buf = bio_data(rq->bio);
> - else
> - buf = rq->data;
>
> drive->dma = !!(drive->dev_flags & IDE_DFLAG_USING_DMA);
>
> --
> 1.6.0.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Regards/Gruss,
Boris.
On Thu, 26 Mar 2009 10:05:57 +0200
Boaz Harrosh <[email protected]> wrote:
> On 03/26/2009 09:42 AM, Tejun Heo wrote:
> > Hello,
> >
> > A few issues.
> >
> > Tejun Heo wrote:
> >>> Perhaps you could reorder the code below a bit?
> >>>
> >>> My proposal is:
> >>> * blk_rq_map_kern_prealloc => is simplified to be
> >>> int blk_rq_map_bio(struct request_queue *q, struct request *rq,
> >>> struct bio *bio);
> >
> > The thing is that the prealloc variant should be allowed to be called
> > from IRQ context and blk_queue_bounce() shouldn't be called.
> > Hmmm... well, the caller is supposed to know what it's doing and maybe
> > we can just add a comment that it shouldn't be called with buffers
> > which might get bounced from IRQ context.
> >
>
> Hmm that is a problem. I would suggest a flag or a check. My bios come
> from VFS they need bouncing.
>
> Can you think of a solution?
>
> We could just call blk_queue_bounce(). IRQ callers need to make sure their
> buffers don't need bouncing anyway, so there is no such bug right? If a programmer
> gets it wrong he will get a BUG check that tells him that.
If I understand the old IDE code, it's broken about the kernel buffer
bouncing.
Fixing this in the old ide needs a big surgery, I think. When I
converted the old ide code to use the block layer API to allocate a
request, I gave up the requests on the stack (used in IRQ).
On Mon, 13 Apr 2009 04:31:08 -0400
Jeff Garzik <[email protected]> wrote:
> FUJITA Tomonori wrote:
> > On Mon, 13 Apr 2009 03:54:38 -0400
> > Jeff Garzik <[email protected]> wrote:
> >
> >> FUJITA Tomonori wrote:
> >>> On Tue, 31 Mar 2009 16:04:39 +0300
> >>> Boaz Harrosh <[email protected]> wrote:
> >>>
> >>>>> I
> >>>>> just don't think bvec should be used outside of block/fs interface.
> >>>>> As I wrote before, non-FS users have no reason to worry about bio.
> >>>>> All it should think about is the requst it needs to issue and the
> >>>>> memory area it's gonna use as in/out buffers. bio just doesn't have a
> >>>>> place there.
> >>>>>
> >>>> I don't understand what happens to all the people that start to work on the block
> >>>> layer, they start getting defensive about bio being private to the request
> >>>> level. But the Genie is out of the bag already (I know cats and bottles).
> >>>> bio is already heavily used above the block layer from directly inside filesystems
> >>>> to all kind of raid engines, DM MD managers, multi paths, integrity information ...
> >>>>
> >>>> Because bio is exactly that Ideal page carrier you are talking about.
> >>> Wrong. multi path doesn't use bio. md accesses to the bio internal
> >>> (it's not nice) and has the own way to carry pages. dm has the own
> >>> mechanism on the top of bio. And bio doesn't work nicely for file
> >>> systems such as btrfs, which handle multiple devices.
> >>>
> >>> Please stop your wrong argument 'bio is the ideal page carrier'.
> >> What is the multi-device problem with bio?
> >
> > Well, if it works nicely, I guess that we don't have something like
> > drivers/dm/{dm-bio-record.h, dm-bio-list.h}, btrfs_multi_bio struct,
> > or md's own page carrier?
>
> It was an honest question. I am seeking information, not denying your
> argument.
>
> What, specifically, is this multi-device problem with bio, please?
dm-bio-record.h says:
/*
* There are lots of mutable fields in the bio struct that get
* changed by the lower levels of the block layer. Some targets,
* such as multipath, may wish to resubmit a bio on error. The
* functions in this file help the target record and restore the
* original bio state.
*/
I guess that dm-bio-list.h is a way to link multiple bios for multiple
devices to a single structure but others use bio->bi_next for a
different purpose (for example, building a large request in the block
mapping API) so it's confusing.
On Mon, 2009-04-13 at 04:31 -0400, Jeff Garzik wrote:
> >> What is the multi-device problem with bio?
> >
> > Well, if it works nicely, I guess that we don't have something like
> > drivers/dm/{dm-bio-record.h, dm-bio-list.h}, btrfs_multi_bio struct,
> > or md's own page carrier?
>
> It was an honest question. I am seeking information, not denying your
> argument.
I think it wasn't quite designed like that. The original designed
carrier of pages was the bio_vec array. The bio was designed to carry
the bio_vec into the top end of block. bio_vec is incredibly
lightweight (it's more like a scatterlist), but bio is a lot heavier and
now has a lot of quirky initialisers (like call backs and merge counts)
that need to be set correctly, making it a more difficult think to use
generically. bio is really designed to be the merge unit for requests
not a generic carrier of pages.
The final irony is that most of the page list mappings we do want to go
straight to requests, making the bio complexity even more superfluous.
> What, specifically, is this multi-device problem with bio, please?
The multi device problem is that it's very hard to split a bio up again
once we've assembled it, so if you get a single bio whose I/O crosses
multiple devices in md or dm, there's a lot of work that has to go on
behind the scenes to get the split to happen because we need one bio per
underlying device that we send I/O to.
James