2008-01-09 08:52:51

by Jens Axboe

[permalink] [raw]
Subject: [PATCH][RFC] fast file mapping for loop

Hi,

loop.c currently uses the page cache interface to do IO to file backed
devices. This works reasonably well for simple things, like mapping an
iso9660 file for direct mount and other read-only workloads. Writing is
somewhat problematic, as anyone who has really used this feature can
attest to - it tends to confuse the vm (hello kswapd) since it break
dirty accounting and behaves very erratically on writeout. Did I mention
that it's pretty slow as well, for both reads and writes?

It also behaves differently than a real drive. For writes, completions
are done once they hit page cache. Since loop queues bio's async and
hands them off to a thread, you can have a huge backlog of stuff to do.
It's hard to attempt to guarentee data safety for file systems on top of
loop without making it even slower than it currently is.

Back when loop was only used for iso9660 mounting and other simple
things, this didn't matter. Now it's often used in xen (and others)
setups where we do care about performance AND writing. So the below is a
attempt at speeding up loop and making it behave like a real device.
It's a somewhat quick hack and is still missing one piece to be
complete, but I'll throw it out there for people to play with and
comment on.

So how does it work? Instead of punting IO to a thread and passing it
through the page cache, we instead attempt to send the IO directly to the
filesystem block that it maps to. loop maintains a prio tree of known
extents in the file (populated lazily on demand, as needed). Advantages
of this approach:

- It's fast, loop will basically work at device speed.
- It's fast, loop it doesn't put a huge amount of system load on the
system when busy. When I did comparison tests on my notebook with an
external drive, running a simple tiobench on the current in-kernel
loop with a sparse file backing rendered the notebook basically
unusable while the test was ongoing. The remapper version had no more
impact than it did when used directly on the external drive.
- It behaves like a real block device.
- It's easy to support IO barriers, which is needed to ensure safety
especially in virtualized setups.

Disadvantages:

- The file block mappings must not change while loop is using the file.
This means that we have to ensure exclusive access to the file and
this is the bit that is currently missing in the implementation. It
would be nice if we could just do this via open(), ideas welcome...
- It'll tie down a bit of memory for the prio tree. This is GREATLY
offset by the reduced page cache foot print though.
- It cannot be used with the loop encryption stuff. dm-crypt should be
used instead, on top of loop (which, I think, is even the recommended
way to do this today, so not a big deal).

This patch will automatically enable the new operation mode (called
fastfs). I added an ioctl (LOOP_SET_FASTFS) that should be implemented
in losetup, then we can remove this hunk in the code:

+ /*
+ * This needs to be done after setup with another ioctl,
+ * not automatically like this.
+ */
+ loop_init_fastfs(lo);
+

from loop_set_fd().

Patch is against 2.6.23-rc7 ('ish, as of this morning), will probably
apply easily to 2.6.22 as well.


diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 56e2304..e49bfa8 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -481,16 +481,51 @@ static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)
return ret;
}

+#define __lo_throttle(wq, lock, condition) \
+do { \
+ DEFINE_WAIT(__wait); \
+ for (;;) { \
+ prepare_to_wait((wq), &__wait, TASK_UNINTERRUPTIBLE); \
+ if (condition) \
+ break; \
+ spin_unlock_irq((lock)); \
+ io_schedule(); \
+ spin_lock_irq((lock)); \
+ } \
+ finish_wait((wq), &__wait); \
+} while (0) \
+
+#define lo_act_bio(bio) ((bio)->bi_bdev)
+#define LO_BIO_THROTTLE 128
+
/*
- * Add bio to back of pending list
+ * A normal block device will throttle on request allocation. Do the same
+ * for loop to prevent millions of bio's queued internally.
+ */
+static void loop_bio_throttle(struct loop_device *lo, struct bio *bio)
+{
+ if (lo_act_bio(bio))
+ __lo_throttle(&lo->lo_bio_wait, &lo->lo_lock,
+ lo->lo_bio_cnt < LO_BIO_THROTTLE);
+}
+
+/*
+ * Add bio to back of pending list and wakeup thread
*/
static void loop_add_bio(struct loop_device *lo, struct bio *bio)
{
+ loop_bio_throttle(lo, bio);
+
if (lo->lo_biotail) {
lo->lo_biotail->bi_next = bio;
lo->lo_biotail = bio;
} else
lo->lo_bio = lo->lo_biotail = bio;
+
+ if (lo_act_bio(bio))
+ lo->lo_bio_cnt++;
+
+ wake_up(&lo->lo_event);
}

/*
@@ -505,11 +540,230 @@ static struct bio *loop_get_bio(struct loop_device *lo)
lo->lo_biotail = NULL;
lo->lo_bio = bio->bi_next;
bio->bi_next = NULL;
+
+ if (lo_act_bio(bio))
+ lo->lo_bio_cnt--;
+ if (lo->lo_bio_cnt < LO_BIO_THROTTLE || !lo->lo_bio)
+ wake_up(&lo->lo_bio_wait);
}

return bio;
}

+struct loop_file_extent {
+ struct prio_tree_node prio_node;
+ sector_t disk_block;
+ sector_t file_block;
+ unsigned int nr_blocks;
+};
+
+#define node_to_lfe(n) prio_tree_entry((n), struct loop_file_extent, prio_node)
+
+static void loop_prio_remove_node(struct loop_device *lo,
+ struct loop_file_extent *lfe)
+{
+ if (lo->last_lookup == &lfe->prio_node)
+ lo->last_lookup = NULL;
+ if (lo->last_insert == &lfe->prio_node)
+ lo->last_insert = NULL;
+
+ prio_tree_remove(&lo->prio_root, &lfe->prio_node);
+}
+
+/*
+ * Drop and free all stored extents
+ */
+static void loop_drop_extents(struct loop_device *lo)
+{
+ struct prio_tree_node *node;
+ struct prio_tree_iter iter;
+ unsigned int nr_extents = 0;
+
+ /*
+ * We may need a few iterations to prune all entries, since we are
+ * removing nodes while iterating the tree.
+ */
+ do {
+ struct loop_file_extent *lfe;
+
+ prio_tree_iter_init(&iter, &lo->prio_root, 0, (pgoff_t) -1);
+
+ while ((node = prio_tree_next(&iter)) != NULL) {
+ lfe = node_to_lfe(node);
+ loop_prio_remove_node(lo, lfe);
+ kfree(lfe);
+ nr_extents++;
+ }
+ } while (!prio_tree_empty(&lo->prio_root));
+
+ printk(KERN_INFO "loop%d: dropped %u extents\n", lo->lo_number, nr_extents);
+}
+
+/*
+ * Most lookups are for the same extent a number of times, before switching
+ * to a new one. This happens for bio page adds, for instance. So cache
+ * the last lookup to prevent doing a full prio_tree lookup if we are within
+ * the range of the current 'last_lookup'
+ */
+static inline int loop_check_last_node(struct loop_device *lo, sector_t block)
+{
+ struct prio_tree_node *n = lo->last_lookup;
+
+ if (n)
+ return (block >= n->start) && (block <= n->last);
+
+ return 0;
+}
+
+/*
+ * Find extent mapping this lo device block to the file block on the real
+ * device
+ */
+static struct loop_file_extent *loop_lookup_extent(struct loop_device *lo,
+ sector_t block)
+{
+ if (!loop_check_last_node(lo, block)) {
+ struct prio_tree_iter iter;
+
+ prio_tree_iter_init(&iter, &lo->prio_root, block, block);
+ lo->last_lookup = prio_tree_next(&iter);
+ }
+
+ if (lo->last_lookup)
+ return node_to_lfe(lo->last_lookup);
+
+ return NULL;
+}
+
+static void loop_handle_extent_hole(struct loop_device *lo, struct bio *bio)
+{
+ /*
+ * for a read, just zero the data and end the io
+ */
+ if (bio_data_dir(bio) == READ) {
+ struct bio_vec *bvec;
+ unsigned long flags;
+ int i;
+
+ bio_for_each_segment(bvec, bio, i) {
+ char *dst = bvec_kmap_irq(bvec, &flags);
+
+ memset(dst, 0, bvec->bv_len);
+ bvec_kunmap_irq(dst, &flags);
+ }
+ bio_endio(bio, 0);
+ } else {
+ /*
+ * let the page cache handling path do this bio, and then
+ * lookup the mapped blocks after the io has been issued to
+ * instantiate extents.
+ */
+ loop_add_bio(lo, bio);
+ }
+}
+
+/*
+ * Alloc a hint bio to tell the loop thread to read file blocks for a given
+ * range
+ */
+#define LOOP_EXTENT_RW_MAGIC 0x19283746
+static void loop_schedule_extent_mapping(struct loop_device *lo,
+ sector_t disk_block,
+ unsigned int nr_blocks, int wait)
+{
+ struct bio *bio, stackbio;
+
+ might_sleep_if(wait);
+
+ /*
+ * it's ok if we occasionally fail. if called with blocking set,
+ * then use an on-stack bio since that must not fail.
+ */
+ if (wait) {
+ bio = &stackbio;
+ bio_init(bio);
+ } else
+ bio = bio_alloc(GFP_ATOMIC, 0);
+
+ if (bio) {
+ DECLARE_COMPLETION_ONSTACK(comp);
+
+ bio->bi_rw = LOOP_EXTENT_RW_MAGIC;
+ bio->bi_sector = disk_block;
+ bio->bi_size = nr_blocks;
+
+ loop_add_bio(lo, bio);
+
+ if (wait) {
+ /*
+ * ok to set here, loop_add_bio() doesn't drop lock
+ * for this bio (!lo_act_bio(bio))
+ */
+ bio->bi_private = &comp;
+
+ /*
+ * never called with wait != 0 where it's not
+ * allowed to use spin_unlock_irq() which
+ * unconditionally enables interrupts.
+ */
+ spin_unlock_irq(&lo->lo_lock);
+ wait_for_completion(&comp);
+ spin_lock_irq(&lo->lo_lock);
+ }
+ }
+}
+
+/*
+ * Change mapping of the bio, so that it points to the real bdev and offset
+ */
+static int loop_redirect_bio(struct loop_device *lo, struct bio *bio)
+{
+ struct loop_file_extent *lfe;
+ sector_t disk_block;
+ sector_t extent_off;
+
+ disk_block = bio->bi_sector >> (lo->blkbits - 9);
+
+ /*
+ * if no extent exists yet, map this extent first and populate a few
+ * ahead
+ */
+ lfe = loop_lookup_extent(lo, disk_block);
+ if (!lfe) {
+ unsigned int nr_blocks = bio->bi_size >> (lo->blkbits - 9);
+
+ loop_schedule_extent_mapping(lo, disk_block, nr_blocks, 1);
+
+ lfe = loop_lookup_extent(lo, disk_block);
+ BUG_ON(!lfe);
+ }
+
+ /*
+ * handle sparse io
+ */
+ if (!lfe->file_block) {
+ loop_handle_extent_hole(lo, bio);
+ return 0;
+ }
+
+ /*
+ * not a hole, redirect
+ */
+ extent_off = bio->bi_sector - (lfe->disk_block << (lo->blkbits - 9));
+ bio->bi_bdev = lo->fs_bdev;
+ bio->bi_sector = lfe->file_block + extent_off;
+ return 1;
+}
+
+/*
+ * Wait on bio's on our list to complete before sending a barrier bio
+ * to the below device. Called with lo_lock held.
+ */
+static void loop_wait_on_bios(struct loop_device *lo)
+{
+ __lo_throttle(&lo->lo_bio_wait, &lo->lo_lock, !lo->lo_bio);
+}
+
static int loop_make_request(struct request_queue *q, struct bio *old_bio)
{
struct loop_device *lo = q->queuedata;
@@ -525,15 +779,33 @@ static int loop_make_request(struct request_queue *q, struct bio *old_bio)
goto out;
if (unlikely(rw == WRITE && (lo->lo_flags & LO_FLAGS_READ_ONLY)))
goto out;
+ if (lo->lo_flags & LO_FLAGS_FASTFS) {
+ /*
+ * If we get a barrier bio, then we just need to wait for
+ * existing bio's to be complete. This can only happen
+ * on the 'new' extent mapped loop, since that is the only
+ * one that supports barriers.
+ */
+ if (bio_barrier(old_bio))
+ loop_wait_on_bios(lo);
+
+ if (loop_redirect_bio(lo, old_bio))
+ goto out_redir;
+
+ goto out_end;
+ }
loop_add_bio(lo, old_bio);
- wake_up(&lo->lo_event);
spin_unlock_irq(&lo->lo_lock);
return 0;

out:
- spin_unlock_irq(&lo->lo_lock);
bio_io_error(old_bio);
+out_end:
+ spin_unlock_irq(&lo->lo_lock);
return 0;
+out_redir:
+ spin_unlock_irq(&lo->lo_lock);
+ return 1;
}

/*
@@ -547,20 +819,62 @@ static void loop_unplug(struct request_queue *q)
blk_run_address_space(lo->lo_backing_file->f_mapping);
}

+static void loop_unplug_fastfs(struct request_queue *q)
+{
+ struct loop_device *lo = q->queuedata;
+ struct request_queue *rq = bdev_get_queue(lo->fs_bdev);
+
+ clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags);
+
+ if (rq->unplug_fn)
+ rq->unplug_fn(rq);
+}
+
struct switch_request {
struct file *file;
struct completion wait;
};

static void do_loop_switch(struct loop_device *, struct switch_request *);
+static void loop_hole_filled(struct loop_device *, struct bio *);
+static int loop_init_fastfs(struct loop_device *);
+static int loop_read_bmap(struct loop_device *, struct inode *, sector_t,
+ sector_t, unsigned int);

static inline void loop_handle_bio(struct loop_device *lo, struct bio *bio)
{
- if (unlikely(!bio->bi_bdev)) {
+ if (!bio->bi_bdev && bio->bi_rw == LOOP_EXTENT_RW_MAGIC) {
+ struct inode *inode = lo->lo_backing_file->f_mapping->host;
+
+ loop_read_bmap(lo, inode, bio->bi_sector, (sector_t) -1, 8);
+
+ /*
+ * completion bio is stack allocated (since it's sync)
+ */
+ if (bio->bi_private)
+ complete(bio->bi_private);
+ else
+ bio_put(bio);
+ } else if (!bio->bi_bdev) {
do_loop_switch(lo, bio->bi_private);
bio_put(bio);
} else {
- int ret = do_bio_filebacked(lo, bio);
+ int ret;
+
+ ret = do_bio_filebacked(lo, bio);
+
+ /*
+ * must be filling a hole. kick off writeback of the pages
+ * so that we know they have a disk mapping. lookup the new
+ * disk blocks and update our prio tree, splitting the extent
+ * covering the old hole and adding new extents for the new
+ * blocks.
+ */
+ if (lo->lo_flags & LO_FLAGS_FASTFS) {
+ filemap_fdatawrite(lo->lo_backing_file->f_mapping);
+ loop_hole_filled(lo, bio);
+ }
+
bio_endio(bio, ret);
}
}
@@ -610,7 +924,7 @@ static int loop_thread(void *data)
static int loop_switch(struct loop_device *lo, struct file *file)
{
struct switch_request w;
- struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+ struct bio *bio = bio_alloc(GFP_KERNEL, 0);
if (!bio)
return -ENOMEM;
init_completion(&w.wait);
@@ -637,6 +951,21 @@ static void do_loop_switch(struct loop_device *lo, struct switch_request *p)
mapping->host->i_bdev->bd_block_size : PAGE_SIZE;
lo->old_gfp_mask = mapping_gfp_mask(mapping);
mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
+
+ if (lo->lo_flags & LO_FLAGS_FASTFS) {
+ unsigned long flags;
+
+ /*
+ * teardown existing extent mappings and init for a new file
+ */
+ lo->lo_flags &= ~LO_FLAGS_FASTFS;
+
+ spin_lock_irqsave(&lo->lo_lock, flags);
+ loop_drop_extents(lo);
+ loop_init_fastfs(lo);
+ spin_unlock_irqrestore(&lo->lo_lock, flags);
+ }
+
complete(&p->wait);
}

@@ -700,6 +1029,321 @@ static int loop_change_fd(struct loop_device *lo, struct file *lo_file,
return error;
}

+static struct loop_file_extent *
+loop_merge_last_node(struct loop_device *lo, sector_t disk_block,
+ sector_t file_block, unsigned int nr_blocks)
+{
+ unsigned int file_blocks, lfe_file_blocks;
+ struct loop_file_extent *lfe;
+ struct prio_tree_node *node;
+
+ node = lo->last_insert;
+ if (unlikely(!node))
+ return NULL;
+
+ lfe = node_to_lfe(node);
+ file_blocks = nr_blocks << (lo->blkbits - 9);
+ lfe_file_blocks = lfe->nr_blocks << (lo->blkbits - 9);
+
+ /*
+ * See if we can add to front or back of this node
+ */
+ if ((lfe->disk_block + lfe->nr_blocks == disk_block) &&
+ (lfe->file_block + lfe_file_blocks == file_block)) {
+ /*
+ * back merge
+ */
+ return lfe;
+ } else if ((disk_block + nr_blocks == lfe->disk_block) &&
+ (file_block + file_blocks == lfe->file_block)) {
+ /*
+ * front merge
+ */
+ lfe->disk_block = disk_block;
+ return lfe;
+ }
+
+ return NULL;
+}
+
+static int __loop_add_extent(struct loop_device *lo,
+ struct loop_file_extent *lfe)
+{
+ struct prio_tree_node *ret;
+
+ INIT_PRIO_TREE_NODE(&lfe->prio_node);
+ lfe->prio_node.start = lfe->disk_block;
+ lfe->prio_node.last = lfe->disk_block + lfe->nr_blocks - 1;
+
+ ret = prio_tree_insert(&lo->prio_root, &lfe->prio_node);
+ if (ret == &lfe->prio_node) {
+ if (lfe->file_block)
+ lo->last_insert = &lfe->prio_node;
+ return 0;
+ }
+
+ return -EEXIST;
+}
+
+/*
+ * Add an extent starting at 'disk_block' loop block and 'file_block'
+ * fs block, spanning 'nr_blocks'. file_block may be 0, in which case
+ * this extent describes a hole in the file.
+ */
+static int loop_add_extent(struct loop_device *lo, sector_t disk_block,
+ sector_t file_block, unsigned int nr_blocks)
+{
+ struct loop_file_extent *lfe;
+ int ret;
+
+ file_block = file_block << (lo->blkbits - 9);
+ spin_lock_irq(&lo->lo_lock);
+
+ /*
+ * See if we can merge with the last inserted node
+ */
+ lfe = loop_merge_last_node(lo, disk_block, file_block, nr_blocks);
+ if (lfe) {
+ lfe->nr_blocks += nr_blocks;
+ loop_prio_remove_node(lo, lfe);
+ } else {
+ /*
+ * no merge, allocate and add a new node
+ */
+ lfe = kzalloc(sizeof(*lfe), GFP_ATOMIC);
+ if (!lfe) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ lfe->disk_block = disk_block;
+ lfe->file_block = file_block;
+ lfe->nr_blocks = nr_blocks;
+ }
+
+ ret = __loop_add_extent(lo, lfe);
+ if (ret) {
+ /*
+ * Most likely we raced with someone else in setting up this
+ * extent, so just drop this one
+ */
+ kfree(lfe);
+ }
+
+out:
+ spin_unlock_irq(&lo->lo_lock);
+ return ret;
+}
+
+/*
+ * See if adding this bvec would cause us to spill into a new extent. If so,
+ * disallow the add to start a new bio. This ensures that the bio we receive
+ * in loop_make_request() never spans two extents or more.
+ */
+static int loop_merge_bvec(struct request_queue *q, struct bio *bio,
+ struct bio_vec *bvec)
+{
+ struct loop_device *lo = q->queuedata;
+ struct loop_file_extent *lfe;
+ unsigned int nr_blocks, ret;
+ unsigned long flags;
+ sector_t disk_block;
+
+ disk_block = bio->bi_sector >> (lo->blkbits - 9);
+ nr_blocks = (bio->bi_size + bvec->bv_len) >> lo->blkbits;
+ ret = bvec->bv_len;
+
+ spin_lock_irqsave(&lo->lo_lock, flags);
+ lfe = loop_lookup_extent(lo, disk_block);
+ if (lfe) {
+ /*
+ * have extent, disallow if outside that extent
+ */
+ if (disk_block + nr_blocks > lfe->disk_block + lfe->nr_blocks)
+ ret = 0;
+ } else {
+ /*
+ * no extent yet, disallow if we already added to the bio and
+ * schedule read of blocks for this extent
+ */
+ loop_schedule_extent_mapping(lo, disk_block, nr_blocks, 0);
+ if (bio->bi_size)
+ ret = 0;
+ }
+
+ spin_unlock_irqrestore(&lo->lo_lock, flags);
+ return ret;
+}
+
+/*
+ * Read and populate the prio tree starting from 'block' disk block
+ * and to 'end', unless we hit 'max_ext' first.
+ */
+static int loop_read_bmap(struct loop_device *lo, struct inode *inode,
+ sector_t block, sector_t end, unsigned int max_ext)
+{
+ sector_t expected_block, disk_start, file_start;
+ unsigned int extent_blocks, nr_extents;
+
+ if (end == (sector_t) -1) {
+ struct inode *inode = lo->lo_backing_file->f_mapping->host;
+
+ end = i_size_read(inode) >> inode->i_blkbits;
+ }
+
+ nr_extents = 0;
+ expected_block = block + 1;
+ file_start = disk_start = -1;
+ extent_blocks = 0;
+
+ while (block <= end && nr_extents < max_ext) {
+ sector_t file_block = bmap(inode, block);
+
+ if (disk_start == -1) {
+start_extent:
+ disk_start = block;
+ file_start = file_block;
+ extent_blocks = 1;
+ } else if (expected_block == file_block)
+ extent_blocks++;
+ else {
+ if (loop_add_extent(lo, disk_start, file_start, extent_blocks)) {
+ disk_start = -1;
+ break;
+ }
+ nr_extents++;
+ goto start_extent;
+ }
+
+ if (file_block)
+ expected_block = file_block + 1;
+ else
+ expected_block = 0;
+
+ block++;
+ }
+
+ if (disk_start != -1) {
+ if (!loop_add_extent(lo, disk_start, file_start, extent_blocks))
+ nr_extents++;
+ }
+
+ return nr_extents;
+}
+
+/*
+ * Initialize the members pertaining to extent mapping. We will populate
+ * the tree lazily on demand, as a full scan of a big file can take some
+ * time.
+ */
+static int loop_init_fastfs(struct loop_device *lo)
+{
+ struct file *file = lo->lo_backing_file;
+ struct inode *inode = file->f_mapping->host;
+ struct request_queue *fs_q;
+
+ if (!S_ISREG(inode->i_mode))
+ return -EINVAL;
+
+ /*
+ * Need a working bmap. TODO: use the same optimization that
+ * direct-io.c does for get_block() mapping more than one block
+ * at the time.
+ */
+ if (inode->i_mapping->a_ops->bmap == NULL)
+ return -EINVAL;
+
+ INIT_PRIO_TREE_ROOT(&lo->prio_root);
+ lo->blkbits = inode->i_blkbits;
+ lo->fs_bdev = file->f_mapping->host->i_sb->s_bdev;
+ lo->lo_flags |= LO_FLAGS_FASTFS;
+ lo->lo_queue->unplug_fn = loop_unplug_fastfs;
+
+ blk_queue_merge_bvec(lo->lo_queue, loop_merge_bvec);
+ blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN, NULL);
+
+ fs_q = bdev_get_queue(lo->fs_bdev);
+ blk_queue_stack_limits(lo->lo_queue, fs_q);
+
+ printk(KERN_INFO "loop%d: fast redirect\n", lo->lo_number);
+ return 0;
+}
+
+/*
+ * We filled the hole at the location specified by bio. Lookup the extent
+ * covering this hole, and either split it into two or shorten it depending
+ * on what part the bio covers. After that we need to lookup the new blocks
+ * and add extents for those.
+ */
+static void loop_hole_filled(struct loop_device *lo, struct bio *bio)
+{
+ struct inode *inode = lo->lo_backing_file->f_mapping->host;
+ struct loop_file_extent *lfe, *lfe_e;
+ unsigned int nr_blocks;
+ sector_t disk_block;
+
+ disk_block = bio->bi_sector >> (lo->blkbits - 9);
+ lfe_e = NULL;
+
+ spin_lock_irq(&lo->lo_lock);
+ lfe = loop_lookup_extent(lo, disk_block);
+ BUG_ON(!lfe);
+
+ /*
+ * Remove extent from tree and trim or split it
+ */
+ loop_prio_remove_node(lo, lfe);
+ spin_unlock_irq(&lo->lo_lock);
+
+ nr_blocks = bio->bi_size >> lo->blkbits;
+
+ /*
+ * Either we need to trim at the front, at the end, or split
+ * it into two if the write is in the middle
+ */
+ if (disk_block == lfe->disk_block) {
+ /*
+ * Trim front
+ */
+ lfe->disk_block += nr_blocks;
+ lfe->nr_blocks -= nr_blocks;
+ } else if (disk_block + nr_blocks == lfe->disk_block + lfe->nr_blocks) {
+ /*
+ * Trim end
+ */
+ lfe->nr_blocks -= nr_blocks;
+ } else {
+ unsigned int total_blocks = lfe->nr_blocks;
+
+ /*
+ * Split extent in two
+ */
+ lfe->nr_blocks = disk_block - lfe->disk_block;
+
+ lfe_e = kzalloc(sizeof(*lfe_e), GFP_NOIO | __GFP_NOFAIL);
+ lfe_e->disk_block = disk_block + nr_blocks;
+ lfe_e->nr_blocks = total_blocks - nr_blocks - lfe->nr_blocks;
+ }
+
+ if (!lfe->nr_blocks) {
+ kfree(lfe);
+ lfe = NULL;
+ }
+
+ spin_lock_irq(&lo->lo_lock);
+
+ /*
+ * Re-add hole extent(s)
+ */
+ if (lfe)
+ __loop_add_extent(lo, lfe);
+ if (lfe_e)
+ __loop_add_extent(lo, lfe_e);
+
+ spin_unlock_irq(&lo->lo_lock);
+
+ loop_read_bmap(lo, inode, disk_block, disk_block + nr_blocks - 1, -1);
+}
+
static inline int is_loop_device(struct file *file)
{
struct inode *i = file->f_mapping->host;
@@ -748,6 +1392,7 @@ static int loop_set_fd(struct loop_device *lo, struct file *lo_file,

mapping = file->f_mapping;
inode = mapping->host;
+ lo->lo_flags = 0;

if (!(file->f_mode & FMODE_WRITE))
lo_flags |= LO_FLAGS_READ_ONLY;
@@ -811,6 +1456,12 @@ static int loop_set_fd(struct loop_device *lo, struct file *lo_file,

set_blocksize(bdev, lo_blocksize);

+ /*
+ * This needs to be done after setup with another ioctl,
+ * not automatically like this.
+ */
+ loop_init_fastfs(lo);
+
lo->lo_thread = kthread_create(loop_thread, lo, "loop%d",
lo->lo_number);
if (IS_ERR(lo->lo_thread)) {
@@ -896,6 +1547,9 @@ static int loop_clr_fd(struct loop_device *lo, struct block_device *bdev)

kthread_stop(lo->lo_thread);

+ if (lo->lo_flags & LO_FLAGS_FASTFS)
+ loop_drop_extents(lo);
+
lo->lo_backing_file = NULL;

loop_release_xfer(lo);
@@ -943,6 +1597,9 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
if (info->lo_encrypt_type) {
unsigned int type = info->lo_encrypt_type;

+ if (lo->lo_flags & LO_FLAGS_FASTFS)
+ return -EINVAL;
+
if (type >= MAX_LO_CRYPT)
return -EINVAL;
xfer = xfer_funcs[type];
@@ -1153,6 +1810,9 @@ static int lo_ioctl(struct inode * inode, struct file * file,
case LOOP_GET_STATUS64:
err = loop_get_status64(lo, (struct loop_info64 __user *) arg);
break;
+ case LOOP_SET_FASTFS:
+ err = loop_init_fastfs(lo);
+ break;
default:
err = lo->ioctl ? lo->ioctl(lo, cmd, arg) : -EINVAL;
}
@@ -1412,6 +2072,7 @@ static struct loop_device *loop_alloc(int i)
lo->lo_number = i;
lo->lo_thread = NULL;
init_waitqueue_head(&lo->lo_event);
+ init_waitqueue_head(&lo->lo_bio_wait);
spin_lock_init(&lo->lo_lock);
disk->major = LOOP_MAJOR;
disk->first_minor = i;
diff --git a/include/linux/loop.h b/include/linux/loop.h
index 26a0a10..e3c6ce0 100644
--- a/include/linux/loop.h
+++ b/include/linux/loop.h
@@ -18,6 +18,7 @@
#include <linux/blkdev.h>
#include <linux/spinlock.h>
#include <linux/mutex.h>
+#include <linux/prio_tree.h>

/* Possible states of device */
enum {
@@ -50,6 +51,7 @@ struct loop_device {

struct file * lo_backing_file;
struct block_device *lo_device;
+ struct block_device *fs_bdev;
unsigned lo_blocksize;
void *key_data;

@@ -58,14 +60,21 @@ struct loop_device {
spinlock_t lo_lock;
struct bio *lo_bio;
struct bio *lo_biotail;
+ unsigned int lo_bio_cnt;
int lo_state;
struct mutex lo_ctl_mutex;
struct task_struct *lo_thread;
wait_queue_head_t lo_event;
+ wait_queue_head_t lo_bio_wait;

struct request_queue *lo_queue;
struct gendisk *lo_disk;
struct list_head lo_list;
+
+ struct prio_tree_root prio_root;
+ struct prio_tree_node *last_insert;
+ struct prio_tree_node *last_lookup;
+ unsigned int blkbits;
};

#endif /* __KERNEL__ */
@@ -76,6 +85,7 @@ struct loop_device {
enum {
LO_FLAGS_READ_ONLY = 1,
LO_FLAGS_USE_AOPS = 2,
+ LO_FLAGS_FASTFS = 4,
};

#include <asm/posix_types.h> /* for __kernel_old_dev_t */
@@ -159,5 +169,6 @@ int loop_unregister_transfer(int number);
#define LOOP_SET_STATUS64 0x4C04
#define LOOP_GET_STATUS64 0x4C05
#define LOOP_CHANGE_FD 0x4C06
+#define LOOP_SET_FASTFS 0x4C07

#endif
diff --git a/lib/prio_tree.c b/lib/prio_tree.c
index ccfd850..0f550f7 100644
--- a/lib/prio_tree.c
+++ b/lib/prio_tree.c
@@ -14,6 +14,7 @@
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/prio_tree.h>
+#include <linux/module.h>

/*
* A clever mix of heap and radix trees forms a radix priority search tree (PST)
@@ -255,6 +256,7 @@ struct prio_tree_node *prio_tree_insert(struct prio_tree_root *root,
BUG();
return NULL;
}
+EXPORT_SYMBOL(prio_tree_insert);

/*
* Remove a prio_tree_node @node from a radix priority search tree @root. The
@@ -304,6 +306,7 @@ void prio_tree_remove(struct prio_tree_root *root, struct prio_tree_node *node)
while (cur != node)
cur = prio_tree_replace(root, cur->parent, cur);
}
+EXPORT_SYMBOL(prio_tree_remove);

/*
* Following functions help to enumerate all prio_tree_nodes in the tree that
@@ -482,3 +485,4 @@ repeat:

goto repeat;
}
+EXPORT_SYMBOL(prio_tree_next);

--
Jens Axboe


2008-01-09 09:31:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote:
> - The file block mappings must not change while loop is using the file.
> This means that we have to ensure exclusive access to the file and
> this is the bit that is currently missing in the implementation. It
> would be nice if we could just do this via open(), ideas welcome...

And the way this is done is simply broken. It means you have to get
rid of things like delayed or unwritten hands beforehand, it'll be
a complete pain for COW or non-block backed filesystems.

The right way to do this is to allow direct I/O from kernel sources
where the filesystem is in-charge of submitting the actual I/O after
the pages are handed to it. I think Peter Zijlstra has been looking
into something like that for swap over nfs.

2008-01-09 09:43:36

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Wed, Jan 09 2008, Christoph Hellwig wrote:
> On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote:
> > - The file block mappings must not change while loop is using the file.
> > This means that we have to ensure exclusive access to the file and
> > this is the bit that is currently missing in the implementation. It
> > would be nice if we could just do this via open(), ideas welcome...
>
> And the way this is done is simply broken. It means you have to get
> rid of things like delayed or unwritten hands beforehand, it'll be
> a complete pain for COW or non-block backed filesystems.

COW is not that hard to handle, you just need to be notified of moving
blocks. If you view the patch as just a tighter integration between loop
and fs, I don't think it's necessarily that broken.

I did consider these cases, and it can be done with the existing
approach.

> The right way to do this is to allow direct I/O from kernel sources
> where the filesystem is in-charge of submitting the actual I/O after
> the pages are handed to it. I think Peter Zijlstra has been looking
> into something like that for swap over nfs.

That does sound like a nice approach, but a lot more work. It'll behave
differently too, the advantage of what I proposed is that it behaves
like a real device.

I'm not asking you to love it (in fact I knew some people would complain
about this approach and I understand why), just tossing it out there to
get things rolling. If we end up doing it differently I don't really
care, I'm not married to any solution but merely wish to solve a
problem. If that ends up being solved differently, the outcome is the
same to me.

--
Jens Axboe

2008-01-09 11:01:20

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Wed, 9 Jan 2008 10:43:21 +0100
Jens Axboe <[email protected]> wrote:

> On Wed, Jan 09 2008, Christoph Hellwig wrote:
> > On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote:
> > > - The file block mappings must not change while loop is using the
> > > file. This means that we have to ensure exclusive access to the
> > > file and this is the bit that is currently missing in the
> > > implementation. It would be nice if we could just do this via
> > > open(), ideas welcome...
> >
> > And the way this is done is simply broken. It means you have to get
> > rid of things like delayed or unwritten hands beforehand, it'll be
> > a complete pain for COW or non-block backed filesystems.
>
> COW is not that hard to handle, you just need to be notified of moving
> blocks. If you view the patch as just a tighter integration between
> loop and fs, I don't think it's necessarily that broken.
>
Filling holes (delayed allocation) and COW are definitely a problem.
But at least for the loop use case, most non-cow filesystems will want
to preallocate the space for loop file and be done with it. Sparse
loop definitely has uses, but generally those users are willing to pay
a little performance.

Jens' patch falls back to buffered writes for the hole case and
pretends cow doesn't exist. It's a good starting point that I hope to
extend with something like the extent_map apis.

> I did consider these cases, and it can be done with the existing
> approach.
>
> > The right way to do this is to allow direct I/O from kernel sources
> > where the filesystem is in-charge of submitting the actual I/O after
> > the pages are handed to it. I think Peter Zijlstra has been looking
> > into something like that for swap over nfs.
>
> That does sound like a nice approach, but a lot more work. It'll
> behave differently too, the advantage of what I proposed is that it
> behaves like a real device.

The problem with O_DIRECT (or even O_SYNC) loop is that every write
into loop becomes synchronous, and it really changes the performance of
things like filemap_fdatawrite.

If we just hand ownership of the file over to loop entirely and prevent
other openers (perhaps even forcing backups through the loop device),
we get fewer corner cases and much better performance.

-chris

2008-01-09 15:34:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

Jens Axboe <[email protected]> writes:
>
> So how does it work? Instead of punting IO to a thread and passing it
> through the page cache, we instead attempt to send the IO directly to the

Great -- something like this was needed for a long time.

> - The file block mappings must not change while loop is using the file.
> This means that we have to ensure exclusive access to the file and
> this is the bit that is currently missing in the implementation. It
> would be nice if we could just do this via open(), ideas welcome...

get_write_access()/put_write_access() will block other writers.

But as pointed out by others that is not enough for this.

I suppose you could use a white list like a special flag for file systems
(like ext2/ext3) that do not reallocate blocks.

-Andi

2008-01-09 23:17:31

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

Here's the latest version of dm-loop, for comparison.

To try it out,
ln -s dmsetup dmlosetup
and supply similar basic parameters to losetup.
(using dmsetup version 1.02.11 or higher)

Alasdair

From: Bryn Reeves <[email protected]>

This implements a loopback target for device mapper allowing a regular
file to be treated as a block device.

Signed-off-by: Bryn Reeves <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>

---
drivers/md/Kconfig | 9
drivers/md/Makefile | 1
drivers/md/dm-loop.c | 1036 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 1046 insertions(+)

Index: linux-2.6.24-rc6/drivers/md/Kconfig
===================================================================
--- linux-2.6.24-rc6.orig/drivers/md/Kconfig 2008-01-07 18:32:05.000000000 +0000
+++ linux-2.6.24-rc6/drivers/md/Kconfig 2008-01-07 18:33:12.000000000 +0000
@@ -229,6 +229,15 @@ config DM_CRYPT

If unsure, say N.

+config DM_LOOP
+ tristate "Loop target (EXPERIMENTAL)"
+ depends on BLK_DEV_DM && EXPERIMENTAL
+ ---help---
+ This device-mapper target allows you to treat a regular file as
+ a block device.
+
+ If unsure, say N.
+
config DM_SNAPSHOT
tristate "Snapshot target (EXPERIMENTAL)"
depends on BLK_DEV_DM && EXPERIMENTAL
Index: linux-2.6.24-rc6/drivers/md/Makefile
===================================================================
--- linux-2.6.24-rc6.orig/drivers/md/Makefile 2008-01-07 18:32:05.000000000 +0000
+++ linux-2.6.24-rc6/drivers/md/Makefile 2008-01-07 18:33:12.000000000 +0000
@@ -34,6 +34,7 @@ obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
obj-$(CONFIG_DM_DELAY) += dm-delay.o
+obj-$(CONFIG_DM_LOOP) += dm-loop.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_EMC) += dm-emc.o
obj-$(CONFIG_DM_MULTIPATH_HP) += dm-hp-sw.o
Index: linux-2.6.24-rc6/drivers/md/dm-loop.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.24-rc6/drivers/md/dm-loop.c 2008-01-07 18:33:12.000000000 +0000
@@ -0,0 +1,1036 @@
+/*
+ * Copyright (C) 2006-2008 Red Hat, Inc. All rights reserved.
+ *
+ * This file is part of device-mapper.
+ *
+ * Extent mapping implementation heavily influenced by mm/swapfile.c
+ * Bryn Reeves <[email protected]>
+ *
+ * File mapping and block lookup algorithms support by
+ * Heinz Mauelshagen <[email protected]>.
+ *
+ * This file is released under the GPL.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/vmalloc.h>
+#include <linux/syscalls.h>
+#include <linux/workqueue.h>
+#include <linux/file.h>
+#include <linux/bio.h>
+
+#include "dm.h"
+#include "dm-bio-list.h"
+
+#define DM_LOOP_DAEMON "kloopd"
+#define DM_MSG_PREFIX "loop"
+
+enum flags { DM_LOOP_BMAP, DM_LOOP_FSIO };
+
+/*--------------------------------------------------------------------
+ * Loop context
+ *--------------------------------------------------------------------*/
+
+struct loop_c {
+ unsigned long flags;
+
+ /* Backing store */
+
+ struct file *filp;
+ char *path;
+ loff_t offset;
+ struct block_device *bdev;
+ unsigned blkbits; /* file system block size shift bits */
+
+ loff_t size; /* size of entire file in bytes */
+ loff_t blocks; /* blocks allocated to loop file */
+ sector_t mapped_sectors; /* size of mapped area in sectors */
+
+ int (*map_fn)(struct dm_target *, struct bio *);
+ void *map_data;
+};
+
+/*
+ * Block map extent
+ */
+struct dm_loop_extent {
+ sector_t start; /* start sector in mapped device */
+ sector_t to; /* start sector on target device */
+ sector_t len; /* length in sectors */
+};
+
+/*
+ * Temporary extent list
+ */
+struct extent_list {
+ struct dm_loop_extent *extent;
+ struct list_head list;
+};
+
+static struct kmem_cache *dm_loop_extent_cache;
+
+/*
+ * Block map private context
+ */
+struct block_map_c {
+ int nr_extents; /* number of extents in map */
+ struct dm_loop_extent **map; /* linear map of extent pointers */
+ struct dm_loop_extent **mru; /* pointer to mru entry */
+ spinlock_t mru_lock; /* protects mru */
+};
+
+/*
+ * File map private context
+ */
+struct file_map_c {
+ spinlock_t lock; /* protects in */
+ struct bio_list in; /* new bios for processing */
+ struct bio_list work; /* bios queued for processing */
+ struct workqueue_struct *wq; /* workqueue */
+ struct work_struct ws; /* loop work */
+ struct loop_c *loop; /* for filp & offset */
+};
+
+/*--------------------------------------------------------------------
+ * Generic helpers
+ *--------------------------------------------------------------------*/
+
+static sector_t blk2sect(struct loop_c *lc, blkcnt_t block)
+{
+ return block << (lc->blkbits - SECTOR_SHIFT);
+}
+
+static blkcnt_t sec2blk(struct loop_c *lc, sector_t sector)
+{
+ return sector >> (lc->blkbits - SECTOR_SHIFT);
+}
+
+/*--------------------------------------------------------------------
+ * File I/O helpers
+ *--------------------------------------------------------------------*/
+
+/*
+ * transfer data to/from file using the read/write file_operations.
+ */
+static int fs_io(int rw, struct file *filp, loff_t *pos, struct bio_vec *bv)
+{
+ ssize_t r;
+ void __user *ptr = (void __user __force *) kmap(bv->bv_page) + bv->bv_offset;
+ mm_segment_t old_fs = get_fs();
+
+ set_fs(get_ds());
+ r = (rw == READ) ? filp->f_op->read(filp, ptr, bv->bv_len, pos) :
+ filp->f_op->write(filp, ptr, bv->bv_len, pos);
+ set_fs(old_fs);
+ kunmap(bv->bv_page);
+
+ return (r == bv->bv_len) ? 0 : -EIO;
+}
+
+/*
+ * Handle I/O for one bio
+ */
+static void do_one_bio(struct file_map_c *fc, struct bio *bio)
+{
+ int r = 0, rw = bio_data_dir(bio);
+ loff_t start = (bio->bi_sector << 9) + fc->loop->offset, pos = start;
+ struct bio_vec *bv, *bv_end = bio->bi_io_vec + bio->bi_vcnt;
+
+ for (bv = bio->bi_io_vec; bv < bv_end; bv++) {
+ r = fs_io(rw, fc->loop->filp, &pos, bv);
+ if (r) {
+ DMERR("%s error %d", rw ? "write" : "read", r);
+ break;
+ }
+ }
+
+ bio_endio(bio, r);
+}
+
+/*
+ * Worker thread for a 'file' type loop device
+ */
+static void do_loop_work(struct work_struct *ws)
+{
+ struct file_map_c *fc = container_of(ws, struct file_map_c, ws);
+ struct bio *bio;
+
+ /* quickly grab all new bios queued and add them to the work list */
+ spin_lock_irq(&fc->lock);
+ bio_list_merge(&fc->work, &fc->in);
+ bio_list_init(&fc->in);
+ spin_unlock_irq(&fc->lock);
+
+ /* work the list and do file I/O on all bios */
+ while ((bio = bio_list_pop(&fc->work)))
+ do_one_bio(fc, bio);
+}
+
+/*
+ * Create work queue and initialize work
+ */
+static int loop_work_init(struct loop_c *lc)
+{
+ struct file_map_c *fc = lc->map_data;
+
+ fc->wq = create_singlethread_workqueue(DM_LOOP_DAEMON);
+ if (!fc->wq)
+ return -ENOMEM;
+
+ return 0;
+}
+
+/*
+ * Destroy work queue
+ */
+static void loop_work_exit(struct loop_c *lc)
+{
+ struct file_map_c *fc = lc->map_data;
+
+ if (fc->wq)
+ destroy_workqueue(fc->wq);
+}
+
+/*
+ * DM_LOOP_FSIO map_fn. Mapping just queues bios to the file map
+ * context and lets the daemon deal with them.
+ */
+static int loop_file_map(struct dm_target *ti, struct bio *bio)
+{
+ int wake;
+ struct loop_c *lc = ti->private;
+ struct file_map_c *fc = lc->map_data;
+
+ spin_lock_irq(&fc->lock);
+ wake = bio_list_empty(&fc->in);
+ bio_list_add(&fc->in, bio);
+ spin_unlock_irq(&fc->lock);
+
+ /*
+ * Only call queue_work() if necessary to avoid
+ * superfluous preempt_{disable/enable}() overhead.
+ */
+ if (wake)
+ queue_work(fc->wq, &fc->ws);
+
+ /* Handling bio - will submit later. */
+ return 0;
+}
+
+/*
+ * Shutdown the workqueue and free a file mapping
+ */
+static void destroy_file_map(struct loop_c *lc)
+{
+ loop_work_exit(lc);
+ kfree(lc->map_data);
+}
+
+/*
+ * Set up a file map context and workqueue
+ */
+static int setup_file_map(struct loop_c *lc)
+{
+ struct file_map_c *fc = kzalloc(sizeof(*fc), GFP_KERNEL);
+
+ if (!fc)
+ return -ENOMEM;
+
+ spin_lock_init(&fc->lock);
+ bio_list_init(&fc->in);
+ bio_list_init(&fc->work);
+ INIT_WORK(&fc->ws, do_loop_work);
+ fc->loop = lc;
+
+ lc->map_data = fc;
+ lc->map_fn = loop_file_map;
+
+ return loop_work_init(lc);
+}
+
+/*--------------------------------------------------------------------
+ * Block I/O helpers
+ *--------------------------------------------------------------------*/
+
+static int contains_sector(struct dm_loop_extent *e, sector_t s)
+{
+ if (likely(e))
+ return s < (e->start + (e->len)) && s >= e->start;
+
+ return 0;
+}
+
+/*
+ * Walk over a linked list of extent_list structures, freeing them as
+ * we go. Does not free el->extent.
+ */
+static void destroy_extent_list(struct list_head *head)
+{
+ struct list_head *curr, *n;
+ struct extent_list *el;
+
+ if (list_empty(head))
+ return;
+
+ list_for_each_safe(curr, n, head) {
+ el = list_entry(curr, struct extent_list, list);
+ list_del(curr);
+ kfree(el);
+ }
+}
+
+/*
+ * Add a new extent to the tail of the list at *head with
+ * start/to/len parameters. Allocates from the extent cache.
+ */
+static int list_add_extent(struct list_head *head, sector_t start,
+ sector_t to, sector_t len)
+{
+ struct dm_loop_extent *extent;
+ struct extent_list *list;
+
+ extent = kmem_cache_alloc(dm_loop_extent_cache, GFP_KERNEL);
+ if (!extent)
+ goto out;
+
+ list = kmalloc(sizeof(*list), GFP_KERNEL);
+ if (!list)
+ goto out;
+
+ extent->start = start;
+ extent->to = to;
+ extent->len = len;
+
+ list->extent = extent;
+ list_add_tail(&list->list, head);
+
+ return 0;
+out:
+ if (extent)
+ kmem_cache_free(dm_loop_extent_cache, extent);
+ return -ENOMEM;
+}
+
+/*
+ * Return an extent range (i.e. beginning and ending physical block numbers).
+ */
+static int extent_range(struct inode *inode,
+ blkcnt_t logical_blk, blkcnt_t last_blk,
+ blkcnt_t *begin_blk, blkcnt_t *end_blk)
+{
+ sector_t dist = 0, phys_blk, probe_blk = logical_blk;
+
+ /* Find beginning physical block of extent starting at logical_blk. */
+ *begin_blk = phys_blk = bmap(inode, probe_blk);
+ if (!phys_blk)
+ return -ENXIO;
+
+ for (; phys_blk == *begin_blk + dist; dist++) {
+ *end_blk = phys_blk;
+ if (++probe_blk > last_blk)
+ break;
+
+ phys_blk = bmap(inode, probe_blk);
+ if (unlikely(!phys_blk))
+ return -ENXIO;
+ }
+
+ return 0;
+}
+
+/*
+ * Create a sequential list of extents from an inode and return
+ * it in *head. On success return the number of extents found or
+ * -ERRNO on failure.
+ */
+static int loop_extents(struct loop_c *lc, struct inode *inode,
+ struct list_head *head)
+{
+ sector_t start = 0;
+ int r, nr_extents = 0;
+ blkcnt_t nr_blks = 0, begin_blk = 0, end_blk = 0;
+ blkcnt_t after_last_blk = sec2blk(lc,
+ (lc->mapped_sectors + (lc->offset >> 9)));
+ blkcnt_t logical_blk = sec2blk(lc, (lc->offset >> 9));
+
+ /* for each block in the mapped region */
+ while (logical_blk < after_last_blk) {
+ r = extent_range(inode, logical_blk, after_last_blk - 1,
+ &begin_blk, &end_blk);
+
+ /* sparse file fallback */
+ if (unlikely(r)) {
+ DMWARN("%s has a hole; sparse file detected - "
+ "switching to filesystem I/O", lc->path);
+ clear_bit(DM_LOOP_BMAP, &lc->flags);
+ set_bit(DM_LOOP_FSIO, &lc->flags);
+ return r;
+ }
+
+ nr_blks = 1 + end_blk - begin_blk;
+
+ if (unlikely(!nr_blks))
+ continue;
+
+ r = list_add_extent(head, start, blk2sect(lc, begin_blk),
+ blk2sect(lc, nr_blks));
+ if (unlikely(r))
+ return r;
+
+ /* advance to next extent */
+ nr_extents++;
+ start += blk2sect(lc, nr_blks);
+ logical_blk += nr_blks;
+ }
+
+ return nr_extents;
+}
+
+/*
+ * Walk over the extents in a block_map_c, returning them to the cache and
+ * freeing bc->map and bc.
+ */
+static void destroy_block_map(struct block_map_c *bc)
+{
+ unsigned i;
+
+ if (!bc)
+ return;
+
+ for (i = 0; i < bc->nr_extents; i++)
+ kmem_cache_free(dm_loop_extent_cache, bc->map[i]);
+
+ DMDEBUG("destroying block map of %d entries", i);
+
+ vfree(bc->map);
+ kfree(bc);
+}
+
+/*
+ * Find an extent in *bc using binary search. Returns a pointer into the
+ * extent map. Calculate index as (extent - bc->map).
+ */
+static struct dm_loop_extent **extent_binary_lookup(struct block_map_c *bc,
+ struct dm_loop_extent **extent_mru, sector_t sector)
+{
+ unsigned nr_extents = bc->nr_extents;
+ unsigned delta, dist, prev_dist = 0;
+ struct dm_loop_extent **eptr;
+
+ /* Optimize lookup range based on MRU extent. */
+ dist = extent_mru - bc->map;
+ if ((*extent_mru)->start >= sector)
+ delta = dist = dist / 2;
+ else {
+ delta = (nr_extents - dist) / 2;
+ dist += delta;
+ }
+
+ eptr = bc->map + dist;
+ while (*eptr && !contains_sector(*eptr, sector)) {
+ if (sector >= (*eptr)->start + (*eptr)->len) {
+ prev_dist = dist;
+ if (delta > 1)
+ delta /= 2;
+ dist += delta;
+ } else {
+ delta = (dist - prev_dist) / 2;
+ if (!delta)
+ delta = 1;
+ dist -= delta;
+ }
+ eptr = bc->map + dist;
+ }
+
+ return eptr;
+}
+
+/*
+ * Lookup an extent for a sector using the mru cache and binary search.
+ */
+static struct dm_loop_extent *extent_lookup(struct block_map_c *bc, sector_t sector)
+{
+ struct dm_loop_extent **eptr;
+
+ spin_lock_irq(&bc->mru_lock);
+ eptr = bc->mru;
+ spin_unlock_irq(&bc->mru_lock);
+
+ if (contains_sector(*eptr, sector))
+ return *eptr;
+
+ eptr = extent_binary_lookup(bc, eptr, sector);
+ if (!eptr)
+ return NULL;
+
+ spin_lock_irq(&bc->mru_lock);
+ bc->mru = eptr;
+ spin_unlock_irq(&bc->mru_lock);
+
+ return *eptr;
+}
+
+/*
+ * DM_LOOP_BMAP map_fn. Looks up the sector in the extent map and
+ * rewrites the bio device and bi_sector fields.
+ */
+static int loop_block_map(struct dm_target *ti, struct bio *bio)
+{
+ struct loop_c *lc = ti->private;
+ struct dm_loop_extent *extent = extent_lookup(lc->map_data, bio->bi_sector);
+
+ if (likely(extent)) {
+ bio->bi_bdev = lc->bdev;
+ bio->bi_sector = extent->to + (bio->bi_sector - extent->start);
+ return 1; /* Done with bio -> submit */
+ }
+
+ DMERR("no matching extent in map for sector %llu",
+ (unsigned long long) bio->bi_sector + ti->begin);
+ BUG();
+
+ return -EIO;
+}
+
+/*
+ * Turn an extent_list into a linear pointer map of nr_extents + 1 entries
+ * and set the final entry to NULL.
+ */
+static struct dm_loop_extent **build_extent_map(struct list_head *head, int nr_extents,
+ unsigned long *flags)
+{
+ unsigned map_size, cache_size;
+ struct dm_loop_extent **map, **curr;
+ struct list_head *pos;
+ struct extent_list *el;
+
+ map_size = 1 + (sizeof(*map) * nr_extents);
+ cache_size = kmem_cache_size(dm_loop_extent_cache) * nr_extents;
+
+ map = vmalloc(map_size);
+ curr = map;
+
+ DMDEBUG("allocated extent map of %u %s for %d extents (%u %s)",
+ (map_size < 8192) ? map_size : map_size >> 10,
+ (map_size < 8192) ? "bytes" : "kilobytes", nr_extents,
+ (cache_size < 8192) ? cache_size : cache_size >> 10,
+ (cache_size < 8192) ? "bytes" : "kilobytes");
+
+ list_for_each(pos, head) {
+ el = list_entry(pos, struct extent_list, list);
+ *(curr++) = el->extent;
+ }
+ *curr = NULL;
+
+ return map;
+}
+
+/*
+ * Set up a block map context and extent map
+ */
+static int setup_block_map(struct loop_c *lc, struct inode *inode)
+{
+ int r, nr_extents;
+ struct block_map_c *bc;
+ LIST_HEAD(head);
+
+ if (!inode || !inode->i_sb || !inode->i_sb->s_bdev)
+ return -ENXIO;
+
+ /* build a linked list of extents in linear order */
+ r = nr_extents = loop_extents(lc, inode, &head);
+ if (nr_extents < 1)
+ goto out;
+
+ r = -ENOMEM;
+ bc = kzalloc(sizeof(*bc), GFP_KERNEL);
+ if (!bc)
+ goto out;
+
+ /* create a linear map of pointers into the extent cache */
+ bc->map = build_extent_map(&head, nr_extents, &lc->flags);
+ destroy_extent_list(&head);
+
+ if (IS_ERR(bc->map)) {
+ r = PTR_ERR(bc->map);
+ goto out;
+ }
+
+ spin_lock_init(&bc->mru_lock);
+ bc->mru = bc->map;
+ bc->nr_extents = nr_extents;
+ lc->bdev = inode->i_sb->s_bdev;
+ lc->map_data = bc;
+ lc->map_fn = loop_block_map;
+
+ return 0;
+
+out:
+ return r;
+}
+
+/*--------------------------------------------------------------------
+ * Generic helpers
+ *--------------------------------------------------------------------*/
+
+/*
+ * Invalidate all unlocked loop file pages
+ */
+static int loop_invalidate_file(struct file *filp)
+{
+ int r;
+
+ /* Same as generic_file_direct_IO() */
+ unmap_mapping_range(filp->f_mapping, 0, ~0UL, 0);
+
+ r = filemap_write_and_wait(filp->f_mapping);
+ if (r)
+ return r;
+
+ /*
+ * This will remove all pages except dirty ones.
+ * If there are dirty pages at this point, it means that the user
+ * is writing to the file and the coherency is lost anyway.
+ * If the user was writing to the file simultaneously, this
+ * returns non-zero, but we ignore that.
+ */
+ invalidate_inode_pages2_range(filp->f_mapping, 0, ~0UL);
+
+ return 0;
+}
+
+/*
+ * Acquire or release a "no-truncate" lock on *filp.
+ * We overload the S_SWAPFILE flag for loop targets because
+ * it provides the same no-truncate semantics we require, and
+ * holding onto i_sem is no longer an option.
+ */
+static void file_truncate_lock(struct file *filp)
+{
+ struct inode *inode = filp->f_mapping->host;
+
+ mutex_lock(&inode->i_mutex);
+ inode->i_flags |= S_SWAPFILE;
+ mutex_unlock(&inode->i_mutex);
+}
+
+static void file_truncate_unlock(struct file *filp)
+{
+ struct inode *inode = filp->f_mapping->host;
+
+ mutex_lock(&inode->i_mutex);
+ inode->i_flags &= ~S_SWAPFILE;
+ mutex_unlock(&inode->i_mutex);
+}
+
+/*
+ * Fill out split_io for taget backing store
+ */
+static void set_split_io(struct dm_target *ti)
+{
+ struct loop_c *lc = ti->private;
+
+ if (test_bit(DM_LOOP_BMAP, &lc->flags))
+ /* Split I/O at block boundaries */
+ ti->split_io = 1 << (lc->blkbits - SECTOR_SHIFT);
+ else
+ ti->split_io = 64;
+
+ DMDEBUG("splitting io at %llu sector boundaries",
+ (unsigned long long) ti->split_io);
+}
+
+/*
+ * Check that the loop file is regular and available.
+ */
+static int loop_check_file(struct dm_target *ti)
+{
+ struct loop_c *lc = ti->private;
+ struct file *filp = lc->filp;
+ struct inode *inode = filp->f_mapping->host;
+
+ if (!inode)
+ return -ENXIO;
+
+ ti->error = "backing file must be a regular file";
+ if (!S_ISREG(inode->i_mode))
+ return -EINVAL;
+
+ ti->error = "backing file is mapped into userspace for writing";
+ if (mapping_writably_mapped(filp->f_mapping))
+ return -EBUSY;
+
+ if (mapping_mapped(filp->f_mapping))
+ DMWARN("%s is mapped into userspace", lc->path);
+
+ if (!inode->i_sb || !inode->i_sb->s_bdev) {
+ DMWARN("%s has no blockdevice - switching to filesystem I/O",
+ lc->path);
+ clear_bit(DM_LOOP_BMAP, &lc->flags);
+ set_bit(DM_LOOP_FSIO, &lc->flags);
+ }
+
+ ti->error = "backing file already in use";
+ if (IS_SWAPFILE(inode))
+ return -EBUSY;
+
+ return 0;
+}
+
+/*
+ * Check loop file size and store it in the loop context
+ */
+static int loop_setup_size(struct dm_target *ti)
+{
+ struct loop_c *lc = ti->private;
+ struct inode *inode = lc->filp->f_mapping->host;
+ int r = -EINVAL;
+
+ lc->size = i_size_read(inode);
+ lc->blkbits = inode->i_blkbits;
+
+ ti->error = "backing file is empty";
+ if (!lc->size)
+ goto out;
+
+ DMDEBUG("set backing file size to %llu", (unsigned long long) lc->size);
+
+ ti->error = "backing file cannot be less than one block in size";
+ if (lc->size < (blk2sect(lc, 1) << 9))
+ goto out;
+
+ ti->error = "loop file offset must be a multiple of fs blocksize";
+ if (lc->offset & ((1 << lc->blkbits) - 1))
+ goto out;
+
+ ti->error = "loop file offset too large";
+ if (lc->offset > (lc->size - (1 << 9)))
+ goto out;
+
+ lc->mapped_sectors = (lc->size - lc->offset) >> 9;
+ DMDEBUG("set mapped sectors to %llu (%llu bytes)",
+ (unsigned long long) lc->mapped_sectors,
+ (lc->size - lc->offset));
+
+ if ((lc->offset + (lc->mapped_sectors << 9)) < lc->size)
+ DMWARN("not using %llu bytes in incomplete block at EOF",
+ lc->size - (lc->offset + (lc->mapped_sectors << 9)));
+
+ ti->error = "mapped region cannot be smaller than target size";
+ if (lc->size - lc->offset < (ti->len << 9))
+ goto out;
+
+ r = 0;
+
+out:
+ return r;
+}
+
+/*
+ * release a loop file
+ */
+static void loop_put_file(struct file *filp)
+{
+ if (!filp)
+ return;
+
+ file_truncate_unlock(filp);
+ filp_close(filp, NULL);
+}
+
+/*
+ * Open loop file and perform type, availability and size checks.
+ */
+static int loop_get_file(struct dm_target *ti)
+{
+ int flags = ((dm_table_get_mode(ti->table) & FMODE_WRITE) ?
+ O_RDWR : O_RDONLY) | O_LARGEFILE;
+ struct loop_c *lc = ti->private;
+ struct file *filp;
+ int r = 0;
+
+ ti->error = "could not open backing file";
+ filp = filp_open(lc->path, flags, 0);
+ if (IS_ERR(filp))
+ return PTR_ERR(filp);
+ lc->filp = filp;
+ r = loop_check_file(ti);
+ if (r)
+ goto err;
+
+ r = loop_setup_size(ti);
+ if (r)
+ goto err;
+
+ file_truncate_lock(filp);
+ return 0;
+
+err:
+ fput(filp);
+ return r;
+}
+
+/*
+ * invalidate mapped pages belonging to the loop file
+ */
+static void loop_flush(struct dm_target *ti)
+{
+ struct loop_c *lc = ti->private;
+
+ loop_invalidate_file(lc->filp);
+}
+
+/*--------------------------------------------------------------------
+ * Device-mapper target methods
+ *--------------------------------------------------------------------*/
+
+/*
+ * Generic loop map function. Re-base I/O to target begin and submit
+ */
+static int loop_map(struct dm_target *ti, struct bio *bio,
+ union map_info *context)
+{
+ struct loop_c *lc = ti->private;
+
+ if (unlikely(bio_barrier(bio)))
+ return -EOPNOTSUPP;
+
+ bio->bi_sector -= ti->begin;
+
+ if (lc->map_fn)
+ return lc->map_fn(ti, bio);
+
+ return -EIO;
+}
+
+/*
+ * Block status helper
+ */
+static ssize_t loop_file_status(struct loop_c *lc, char *result,
+ unsigned maxlen)
+{
+ ssize_t sz = 0;
+ struct file_map_c *fc = lc->map_data;
+ int qlen;
+
+ spin_lock_irq(&fc->lock);
+ qlen = bio_list_size(&fc->work);
+ qlen += bio_list_size(&fc->in);
+ spin_unlock_irq(&fc->lock);
+
+ DMEMIT("file %d", qlen);
+
+ return sz;
+}
+
+/*
+ * File status helper
+ */
+static ssize_t loop_block_status(struct loop_c *lc, char *result,
+ unsigned maxlen)
+{
+ ssize_t sz = 0;
+ struct block_map_c *bc = lc->map_data;
+ int mru;
+
+ spin_lock_irq(&bc->mru_lock);
+ mru = bc->mru - bc->map;
+ spin_unlock_irq(&bc->mru_lock);
+
+ DMEMIT("block %d %d", bc->nr_extents, mru);
+
+ return sz;
+}
+
+/*
+ * This needs some thought on handling unlinked backing files. some parts of
+ * the kernel return a cached name (now invalid), while others return a dcache
+ * "/path/to/foo (deleted)" name (never was/is valid). Which is "better" is
+ * debatable.
+ *
+ * On the one hand, using a cached name gives table output which is directly
+ * usable assuming the user re-creates the unlinked image file, on the other
+ * it is more consistent with e.g. swap to use the dcache name.
+ *
+*/
+static int loop_status(struct dm_target *ti, status_type_t type, char *result,
+ unsigned maxlen)
+{
+ struct loop_c *lc = ti->private;
+ ssize_t sz = 0;
+
+ switch (type) {
+ case STATUSTYPE_INFO:
+ if (test_bit(DM_LOOP_BMAP, &lc->flags))
+ sz += loop_block_status(lc, result, maxlen - sz);
+ else if (test_bit(DM_LOOP_FSIO, &lc->flags))
+ sz += loop_file_status(lc, result, maxlen - sz);
+ break;
+
+ case STATUSTYPE_TABLE:
+ DMEMIT("%s %llu", lc->path, lc->offset);
+ break;
+ }
+ return 0;
+}
+
+/*
+ * Destroy a loopback mapping
+ */
+static void loop_dtr(struct dm_target *ti)
+{
+ struct loop_c *lc = ti->private;
+
+ if ((dm_table_get_mode(ti->table) & FMODE_WRITE))
+ loop_invalidate_file(lc->filp);
+
+ if (test_bit(DM_LOOP_BMAP, &lc->flags) && lc->map_data)
+ destroy_block_map((struct block_map_c *)lc->map_data);
+ if (test_bit(DM_LOOP_FSIO, &lc->flags) && lc->map_data)
+ destroy_file_map(lc);
+
+ loop_put_file(lc->filp);
+ DMINFO("released file %s", lc->path);
+
+ kfree(lc);
+}
+
+/*
+ * Construct a loopback mapping: <path> <offset>
+ */
+static int loop_ctr(struct dm_target *ti, unsigned argc, char **argv)
+{
+ struct loop_c *lc = NULL;
+ int r = -EINVAL;
+
+ ti->error = "invalid argument count";
+ if (argc != 2)
+ goto err;
+
+ r = -ENOMEM;
+ ti->error = "cannot allocate loop context";
+ lc = kzalloc(sizeof(*lc), GFP_KERNEL);
+ if (!lc)
+ goto err;
+
+ /* default */
+ set_bit(DM_LOOP_BMAP, &lc->flags);
+ ti->error = "cannot allocate loop path";
+ lc->path = kstrdup(argv[0], GFP_KERNEL);
+ if (!lc->path)
+ goto err;
+
+ ti->private = lc;
+
+ r = -EINVAL;
+ ti->error = "invalid file offset";
+ if (sscanf(argv[1], "%lld", &lc->offset) != 1)
+ goto err;
+
+ if (lc->offset)
+ DMDEBUG("setting file offset to %lld", lc->offset);
+
+ /* open & check file and set size parameters */
+ r = loop_get_file(ti);
+
+ /* ti->error has been set by loop_get_file */
+ if (r)
+ goto err;
+
+ ti->error = "could not create loop mapping";
+ if (test_bit(DM_LOOP_BMAP, &lc->flags))
+ r = setup_block_map(lc, lc->filp->f_mapping->host);
+ if (test_bit(DM_LOOP_FSIO, &lc->flags))
+ r = setup_file_map(lc);
+
+ if (r)
+ goto err_putf;
+
+ loop_invalidate_file(lc->filp);
+
+ set_split_io(ti);
+ if (lc->bdev)
+ dm_set_device_limits(ti, lc->bdev);
+
+ DMDEBUG("constructed loop target on %s "
+ "(%lldk, %llu sectors)", lc->path,
+ (lc->size >> 10), (unsigned long long)lc->mapped_sectors);
+ ti->error = NULL;
+
+ return 0;
+
+err_putf:
+ loop_put_file(lc->filp);
+err:
+ if (lc)
+ kfree(lc);
+ return r;
+}
+
+static struct target_type loop_target = {
+ .name = "loop",
+ .version = {0, 0, 2},
+ .module = THIS_MODULE,
+ .ctr = loop_ctr,
+ .dtr = loop_dtr,
+ .map = loop_map,
+ .presuspend = loop_flush,
+ .flush = loop_flush,
+ .status = loop_status,
+};
+
+/*--------------------------------------------------------------------
+ * Module bits
+ *--------------------------------------------------------------------*/
+static int __init dm_loop_init(void)
+{
+ int r;
+
+ r = dm_register_target(&loop_target);
+ if (r < 0) {
+ DMERR("register failed %d", r);
+ goto err;
+ }
+
+ r = -ENOMEM;
+ dm_loop_extent_cache = KMEM_CACHE(dm_loop_extent, SLAB_HWCACHE_ALIGN);
+ if (!dm_loop_extent_cache)
+ goto err;
+
+ DMINFO("version %u.%u.%u loaded",
+ loop_target.version[0], loop_target.version[1],
+ loop_target.version[2]);
+
+ return 0;
+
+err:
+ if (dm_loop_extent_cache)
+ kmem_cache_destroy(dm_loop_extent_cache);
+
+ return r;
+}
+
+static void __exit dm_loop_exit(void)
+{
+ int r;
+
+ r = dm_unregister_target(&loop_target);
+ kmem_cache_destroy(dm_loop_extent_cache);
+
+ if (r < 0)
+ DMERR("target unregister failed %d", r);
+ else
+ DMINFO("version %u.%u.%u unloaded",
+ loop_target.version[0], loop_target.version[1],
+ loop_target.version[2]);
+}
+
+module_init(dm_loop_init);
+module_exit(dm_loop_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Bryn Reeves <[email protected]>");
+MODULE_DESCRIPTION("device-mapper loop target");

2008-01-09 23:43:32

by Roland

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

oh, nice to see that this is still alive.

i tried this around half a year ago because i needed more than 256 loop devices and iirc, this was working quite fine.
at least i got no crashes and was able to mount and acess more than 300 iso-images with that.

shortly after, loop device was extended to handle a larger number of loop-devices and i went that way because dm-loop was not in mainline.

i have taken a look at the wiki at http://sources.redhat.com/lvm2/wiki/DMLoop from time to time, but there didn`t seem to happen much.

were there fixes/chances since then?

do you think it`s ready for mainline?



>Here's the latest version of dm-loop, for comparison.
>
>To try it out,
> ln -s dmsetup dmlosetup
>and supply similar basic parameters to losetup.
>(using dmsetup version 1.02.11 or higher)
>
>Alasdair
>
>From: Bryn Reeves <[email protected]>
>
>This implements a loopback target for device mapper allowing a regular
>file to be treated as a block device.
>
>Signed-off-by: Bryn Reeves <[email protected]>
>Signed-off-by: Alasdair G Kergon <[email protected]>

__________________________________________________________________________
Erweitern Sie FreeMail zu einem noch leistungsst?rkeren E-Mail-Postfach!
Mehr Infos unter http://produkte.web.de/club/?mc=021131

2008-01-09 23:54:22

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, Jan 10, 2008 at 12:43:19AM +0100, [email protected] wrote:
> oh, nice to see that this is still alive.

> at least i got no crashes and was able to mount and acess more than 300 iso-images with that.

> were there fixes/chances since then?

Little has changed for some time - mostly code cleanups and the occasional bug fix.

It's time to give it wider exposure, I think, and we'll find out how well it holds up.

Alasdair
--
[email protected]

2008-01-10 01:42:59

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Wednesday 09 January 2008 19:52, Jens Axboe wrote:

> So how does it work? Instead of punting IO to a thread and passing it
> through the page cache, we instead attempt to send the IO directly to the
> filesystem block that it maps to.

You told Christoph that just using direct-IO from kernel still doesn't
give you the required behaviour... What about queueing the IO directly
*and* using direct-IO? I guess it still has to go through the underlying
filesystem, but that's probably a good thing.

> loop maintains a prio tree of known
> extents in the file (populated lazily on demand, as needed).

Just a quick question (I haven't looked closely at the code): how come
you are using a prio tree for extents? I don't think they could be
overlapping?

2008-01-10 08:31:43

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Wed, Jan 09 2008, Alasdair G Kergon wrote:
> Here's the latest version of dm-loop, for comparison.
>
> To try it out,
> ln -s dmsetup dmlosetup
> and supply similar basic parameters to losetup.
> (using dmsetup version 1.02.11 or higher)

Why oh why does dm always insist to reinvent everything? That's bad
enough in itself, but on top of that most of the extra stuff ends up
being essentially unmaintained.

If we instead improve loop, everyone wins.

Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a bit
outside your own sandbox.

--
Jens Axboe

2008-01-10 08:35:04

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, Jan 10 2008, Nick Piggin wrote:
> On Wednesday 09 January 2008 19:52, Jens Axboe wrote:
>
> > So how does it work? Instead of punting IO to a thread and passing it
> > through the page cache, we instead attempt to send the IO directly to the
> > filesystem block that it maps to.
>
> You told Christoph that just using direct-IO from kernel still doesn't
> give you the required behaviour... What about queueing the IO directly
> *and* using direct-IO? I guess it still has to go through the underlying
> filesystem, but that's probably a good thing.

If it was O_DIRECT and aio, then we would be close.

> > loop maintains a prio tree of known
> > extents in the file (populated lazily on demand, as needed).
>
> Just a quick question (I haven't looked closely at the code): how come
> you are using a prio tree for extents? I don't think they could be
> overlapping?

Because I'm really lazy - the core of this was basically first written
as a quick hack and then I always go shopping for reusable data
structures. prio trees fit the bill nicely, they described extents and
allowed loopup with a key anywhere in that extent. You are right in that
I don't need the overlap handling at all, and Chris already tried to
talk me into reusing his btrfs extent code :-)

So I may just do the latter, turning it into a lib/extent-map.c in the
longer run. My first priority was just having something that worked so I
could test it. At the end of the day, not a single soul would ever
notice if the prio tree ended up being slightly slower than a custom
solution.

--
Jens Axboe

2008-01-10 08:38:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, Jan 10, 2008 at 12:42:25PM +1100, Nick Piggin wrote:
> > So how does it work? Instead of punting IO to a thread and passing it
> > through the page cache, we instead attempt to send the IO directly to the
> > filesystem block that it maps to.
>
> You told Christoph that just using direct-IO from kernel still doesn't
> give you the required behaviour... What about queueing the IO directly
> *and* using direct-IO? I guess it still has to go through the underlying
> filesystem, but that's probably a good thing.

We defintively need to go through the filesystem for I/O submission,
and also for I/O completion. Thinking of the async submission might be
what Peter actually implemented for his network swapping patches as
you really wouldn't want to write it out synchronously.


Peter, any chance you could chime in here?
>
> > loop maintains a prio tree of known
> > extents in the file (populated lazily on demand, as needed).
>
> Just a quick question (I haven't looked closely at the code): how come
> you are using a prio tree for extents? I don't think they could be
> overlapping?

IMHO this shouldn't be done in the loop driver anyway. Filesystems have
their own effricient extent lookup trees (well, at least xfs and btrfs
do), and we should leverage that instead of reinventing it.

2008-01-10 08:42:19

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, Jan 10 2008, Jens Axboe wrote:
> On Wed, Jan 09 2008, Alasdair G Kergon wrote:
> > Here's the latest version of dm-loop, for comparison.
> >
> > To try it out,
> > ln -s dmsetup dmlosetup
> > and supply similar basic parameters to losetup.
> > (using dmsetup version 1.02.11 or higher)
>
> Why oh why does dm always insist to reinvent everything? That's bad
> enough in itself, but on top of that most of the extra stuff ends up
> being essentially unmaintained.
>
> If we instead improve loop, everyone wins.
>
> Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a bit
> outside your own sandbox.

So I looked at the code - it seems you build a full extent of the blocks
in the file, filling holes as you go along. I initally did that as well,
but that is to slow to be usable in real life.

You also don't support sparse files, falling back to normal fs
read/write paths. Supporting sparse files properly is a must, people
generally don't want to prealloc a huge disk backing.

--
Jens Axboe

2008-01-10 08:43:49

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Wed, Jan 09 2008, Andi Kleen wrote:
> Jens Axboe <[email protected]> writes:
> >
> > So how does it work? Instead of punting IO to a thread and passing it
> > through the page cache, we instead attempt to send the IO directly to the
>
> Great -- something like this was needed for a long time.
>
> > - The file block mappings must not change while loop is using the file.
> > This means that we have to ensure exclusive access to the file and
> > this is the bit that is currently missing in the implementation. It
> > would be nice if we could just do this via open(), ideas welcome...
>
> get_write_access()/put_write_access() will block other writers.
>
> But as pointed out by others that is not enough for this.

Yeah, basically allowing O_RDONLY | O_DIRECT opens should be ok, but we
can't allow writes and we can't allow page cache to exist for this file
outside of loop.

> I suppose you could use a white list like a special flag for file systems
> (like ext2/ext3) that do not reallocate blocks.

Irk, but yeah we probably need something like that for now until Chris
proposes his API addition.

--
Jens Axboe

2008-01-10 08:45:19

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, Jan 10 2008, Christoph Hellwig wrote:
> > > loop maintains a prio tree of known
> > > extents in the file (populated lazily on demand, as needed).
> >
> > Just a quick question (I haven't looked closely at the code): how come
> > you are using a prio tree for extents? I don't think they could be
> > overlapping?
>
> IMHO this shouldn't be done in the loop driver anyway. Filesystems have
> their own effricient extent lookup trees (well, at least xfs and btrfs
> do), and we should leverage that instead of reinventing it.

Completely agree, it's just needed right now for this solution since all
we have is a crappy bmap() interface to get at those mappings.

--
Jens Axboe

2008-01-10 08:55:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote:
> > IMHO this shouldn't be done in the loop driver anyway. Filesystems have
> > their own effricient extent lookup trees (well, at least xfs and btrfs
> > do), and we should leverage that instead of reinventing it.
>
> Completely agree, it's just needed right now for this solution since all
> we have is a crappy bmap() interface to get at those mappings.

So let's fix the interface instead of piling crap ontop of it. As I
said I think Peter has something to start with so let's beat on it
until we have something suitable. If we aren't done by end of Feb
I'm happy to host a hackfest to get it sorted around the fs/storage
summit..

2008-01-10 09:01:39

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, Jan 10 2008, Christoph Hellwig wrote:
> On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote:
> > > IMHO this shouldn't be done in the loop driver anyway. Filesystems have
> > > their own effricient extent lookup trees (well, at least xfs and btrfs
> > > do), and we should leverage that instead of reinventing it.
> >
> > Completely agree, it's just needed right now for this solution since all
> > we have is a crappy bmap() interface to get at those mappings.
>
> So let's fix the interface instead of piling crap ontop of it. As I
> said I think Peter has something to start with so let's beat on it
> until we have something suitable.

Sure, I'm all for doing it the Right Way. I wasn't aware of anything
Peter was doing in this area, so lets please see it.

It's not like opportunities to improve this haven't been around. My plan
was/is to convert to using the get_block() tricks of O_DIRECT, one could
easily argue that the work should have been done then. And perhaps
direct-io.c wouldn't be such a steaming pile of crap if it had been
done, and loop would already be fine since we could have tapped into
that.

> If we aren't done by end of Feb I'm happy to host a hackfest to get it
> sorted around the fs/storage summit..

Count me in :)

--
Jens Axboe

2008-01-10 09:37:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop


On Thu, 2008-01-10 at 08:37 +0000, Christoph Hellwig wrote:

> Peter, any chance you could chime in here?

I have this patch to add swap_out/_in methods. I expect we can loosen
the requirement for swapcache pages and change the name a little.

previously posted here:
http://lkml.org/lkml/2007/5/4/143

---
Subject: mm: add support for non block device backed swap files

New addres_space_operations methods are added:
int swapfile(struct address_space *, int);
int swap_out(struct file *, struct page *, struct writeback_control *);
int swap_in(struct file *, struct page *);

When during sys_swapon() the swapfile() method is found and returns no error
the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
make use of swap_{out,in}() to write/read swapcache pages.

The swapfile method will be used to communicate to the address_space that the
VM relies on it, and the address_space should take adequate measures (like
reserving memory for mempools or the like).

This new interface can be used to obviate the need for ->bmap in the swapfile
code. A filesystem would need to load (and maybe even allocate) the full block
map for a file into memory and pin it there on ->swapfile(,1) so that
->swap_{out,in}() have instant access to it. It can be released on
->swapfile(,0).

The reason to provide ->swap_{out,in}() over using {write,read}page() is to
1) make a distinction between swapcache and pagecache pages, and
2) to provide a struct file * for credential context (normally not needed
in the context of writepage, as the page content is normally dirtied
using either of the following interfaces:
write_{begin,end}()
{prepare,commit}_write()
page_mkwrite()
which do have the file context.

Signed-off-by: Peter Zijlstra <[email protected]>
---
Documentation/filesystems/Locking | 19 ++++++++++++
Documentation/filesystems/vfs.txt | 17 +++++++++++
include/linux/buffer_head.h | 2 -
include/linux/fs.h | 8 +++++
include/linux/swap.h | 3 +
mm/Kconfig | 3 +
mm/page_io.c | 58 ++++++++++++++++++++++++++++++++++++++
mm/swap_state.c | 5 +++
mm/swapfile.c | 22 +++++++++++++-
9 files changed, 135 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -164,6 +164,7 @@ enum {
SWP_USED = (1 << 0), /* is slot in swap_info[] used? */
SWP_WRITEOK = (1 << 1), /* ok to write to this swap? */
SWP_ACTIVE = (SWP_USED | SWP_WRITEOK),
+ SWP_FILE = (1 << 2), /* file swap area */
/* add others here before... */
SWP_SCANNING = (1 << 8), /* refcount in scan_swap_map */
};
@@ -261,6 +262,8 @@ extern void swap_unplug_io_fn(struct bac
/* linux/mm/page_io.c */
extern int swap_readpage(struct file *, struct page *);
extern int swap_writepage(struct page *page, struct writeback_control *wbc);
+extern void swap_sync_page(struct page *page);
+extern int swap_set_page_dirty(struct page *page);
extern void end_swap_bio_read(struct bio *bio, int err);

/* linux/mm/swap_state.c */
Index: linux-2.6/mm/page_io.c
===================================================================
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -17,6 +17,7 @@
#include <linux/bio.h>
#include <linux/swapops.h>
#include <linux/writeback.h>
+#include <linux/buffer_head.h>
#include <asm/pgtable.h>

static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -102,6 +103,18 @@ int swap_writepage(struct page *page, st
unlock_page(page);
goto out;
}
+#ifdef CONFIG_SWAP_FILE
+ {
+ struct swap_info_struct *sis = page_swap_info(page);
+ if (sis->flags & SWP_FILE) {
+ ret = sis->swap_file->f_mapping->
+ a_ops->swap_out(sis->swap_file, page, wbc);
+ if (!ret)
+ count_vm_event(PSWPOUT);
+ return ret;
+ }
+ }
+#endif
bio = get_swap_bio(GFP_NOIO, page_private(page), page,
end_swap_bio_write);
if (bio == NULL) {
@@ -120,6 +133,39 @@ out:
return ret;
}

+#ifdef CONFIG_SWAP_FILE
+void swap_sync_page(struct page *page)
+{
+ struct swap_info_struct *sis = page_swap_info(page);
+
+ if (sis->flags & SWP_FILE) {
+ const struct address_space_operations * a_ops =
+ sis->swap_file->f_mapping->a_ops;
+ if (a_ops->sync_page)
+ a_ops->sync_page(page);
+ } else
+ block_sync_page(page);
+}
+
+int swap_set_page_dirty(struct page *page)
+{
+ struct swap_info_struct *sis = page_swap_info(page);
+
+ if (sis->flags & SWP_FILE) {
+ const struct address_space_operations * a_ops =
+ sis->swap_file->f_mapping->a_ops;
+ int (*spd)(struct page *) = a_ops->set_page_dirty;
+#ifdef CONFIG_BLOCK
+ if (!spd)
+ spd = __set_page_dirty_buffers;
+#endif
+ return (*spd)(page);
+ }
+
+ return __set_page_dirty_nobuffers(page);
+}
+#endif
+
int swap_readpage(struct file *file, struct page *page)
{
struct bio *bio;
@@ -127,6 +173,18 @@ int swap_readpage(struct file *file, str

BUG_ON(!PageLocked(page));
ClearPageUptodate(page);
+#ifdef CONFIG_SWAP_FILE
+ {
+ struct swap_info_struct *sis = page_swap_info(page);
+ if (sis->flags & SWP_FILE) {
+ ret = sis->swap_file->f_mapping->
+ a_ops->swap_in(sis->swap_file, page);
+ if (!ret)
+ count_vm_event(PSWPIN);
+ return ret;
+ }
+ }
+#endif
bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
end_swap_bio_read);
if (bio == NULL) {
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -27,8 +27,13 @@
*/
static const struct address_space_operations swap_aops = {
.writepage = swap_writepage,
+#ifdef CONFIG_SWAP_FILE
+ .sync_page = swap_sync_page,
+ .set_page_dirty = swap_set_page_dirty,
+#else
.sync_page = block_sync_page,
.set_page_dirty = __set_page_dirty_nobuffers,
+#endif
.migratepage = migrate_page,
};

Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -1015,6 +1015,13 @@ static void destroy_swap_extents(struct
list_del(&se->list);
kfree(se);
}
+#ifdef CONFIG_SWAP_FILE
+ if (sis->flags & SWP_FILE) {
+ sis->flags &= ~SWP_FILE;
+ sis->swap_file->f_mapping->a_ops->
+ swapfile(sis->swap_file->f_mapping, 0);
+ }
+#endif
}

/*
@@ -1107,6 +1114,19 @@ static int setup_swap_extents(struct swa
goto done;
}

+#ifdef CONFIG_SWAP_FILE
+ if (sis->swap_file->f_mapping->a_ops->swapfile) {
+ ret = sis->swap_file->f_mapping->a_ops->
+ swapfile(sis->swap_file->f_mapping, 1);
+ if (!ret) {
+ sis->flags |= SWP_FILE;
+ ret = add_swap_extent(sis, 0, sis->max, 0);
+ *span = sis->pages;
+ }
+ goto done;
+ }
+#endif
+
blkbits = inode->i_blkbits;
blocks_per_page = PAGE_SIZE >> blkbits;

@@ -1671,7 +1691,7 @@ asmlinkage long sys_swapon(const char __

mutex_lock(&swapon_mutex);
spin_lock(&swap_lock);
- p->flags = SWP_ACTIVE;
+ p->flags |= SWP_WRITEOK;
nr_swap_pages += nr_good_pages;
total_swap_pages += nr_good_pages;

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -479,6 +479,14 @@ struct address_space_operations {
int (*migratepage) (struct address_space *,
struct page *, struct page *);
int (*launder_page) (struct page *);
+
+ /*
+ * swapfile support
+ */
+ int (*swapfile)(struct address_space *, int);
+ int (*swap_out)(struct file *file, struct page *page,
+ struct writeback_control *wbc);
+ int (*swap_in)(struct file *file, struct page *page);
};

/*
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -171,6 +171,9 @@ prototypes:
int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
int (*launder_page) (struct page *);
+ int (*swapfile) (struct address_space *, int);
+ int (*swap_out) (struct file *, struct page *, struct writeback_control *);
+ int (*swap_in) (struct file *, struct page *);

locking rules:
All except set_page_dirty may block
@@ -192,6 +195,9 @@ invalidatepage: no yes
releasepage: no yes
direct_IO: no
launder_page: no yes
+swapfile no
+swap_out no yes, unlocks
+swap_in no yes, unlocks

->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
may be called from the request handler (/dev/loop).
@@ -291,6 +297,19 @@ cleaned, or an error value if not. Note
getting mapped back in and redirtied, it needs to be kept locked
across the entire operation.

+ ->swapfile() will be called with a non zero argument on address spaces
+backing non block device backed swapfiles. A return value of zero indicates
+success. In which case this address space can be used for backing swapspace.
+The swapspace operations will be proxied to the address space operations.
+Swapoff will call this method with a zero argument to release the address
+space.
+
+ ->swap_out() when swapfile() returned success, this method is used to
+write the swap page.
+
+ ->swap_in() when swapfile() returned success, this method is used to
+read the swap page.
+
Note: currently almost all instances of address_space methods are
using BKL for internal serialization and that's one of the worst sources
of contention. Normally they are calling library functions (in fs/buffer.c)
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig
+++ linux-2.6/mm/Kconfig
@@ -185,6 +185,9 @@ config BOUNCE
def_bool y
depends on BLOCK && MMU && (ZONE_DMA || HIGHMEM)

+config SWAP_FILE
+ def_bool n
+
config NR_QUICK
int
depends on QUICKLIST
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -335,7 +335,7 @@ static inline void invalidate_inode_buff
static inline int remove_inode_buffers(struct inode *inode) { return 1; }
static inline int sync_mapping_buffers(struct address_space *mapping) { return 0; }
static inline void invalidate_bdev(struct block_device *bdev) {}
-
+static inline void block_sync_page(struct page *) { }

#endif /* CONFIG_BLOCK */
#endif /* _LINUX_BUFFER_HEAD_H */
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6/Documentation/filesystems/vfs.txt
@@ -542,6 +542,10 @@ struct address_space_operations {
/* migrate the contents of a page to the specified target */
int (*migratepage) (struct page *, struct page *);
int (*launder_page) (struct page *);
+ int (*swapfile)(struct address_space *, int);
+ int (*swap_out)(struct file *file, struct page *page,
+ struct writeback_control *wbc);
+ int (*swap_in)(struct file *file, struct page *page);
};

writepage: called by the VM to write a dirty page to backing store.
@@ -727,6 +731,19 @@ struct address_space_operations {
prevent redirtying the page, it is kept locked during the whole
operation.

+ swapfile: Called with a non-zero argument when swapon is used on a file. A
+ return value of zero indicates success. In which case this
+ address_space can be used to back swapspace. The swapspace operations
+ will be proxied to this address space's ->swap_{out,in} methods.
+ Swapoff will call this method with a zero argument to release the
+ address space.
+
+ swap_out: Called to write a swapcache page to a backing store, similar to
+ writepage.
+
+ swap_in: Called to read a swapcache page from a backing store, similar to
+ readpage.
+
The File Object
===============


2008-01-10 09:49:59

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, Jan 10 2008, Peter Zijlstra wrote:
>
> On Thu, 2008-01-10 at 08:37 +0000, Christoph Hellwig wrote:
>
> > Peter, any chance you could chime in here?
>
> I have this patch to add swap_out/_in methods. I expect we can loosen
> the requirement for swapcache pages and change the name a little.
>
> previously posted here:
> http://lkml.org/lkml/2007/5/4/143
>
> ---
> Subject: mm: add support for non block device backed swap files
>
> New addres_space_operations methods are added:
> int swapfile(struct address_space *, int);
> int swap_out(struct file *, struct page *, struct writeback_control *);
> int swap_in(struct file *, struct page *);
>
> When during sys_swapon() the swapfile() method is found and returns no error
> the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
> make use of swap_{out,in}() to write/read swapcache pages.
>
> The swapfile method will be used to communicate to the address_space that the
> VM relies on it, and the address_space should take adequate measures (like
> reserving memory for mempools or the like).
>
> This new interface can be used to obviate the need for ->bmap in the swapfile
> code. A filesystem would need to load (and maybe even allocate) the full block
> map for a file into memory and pin it there on ->swapfile(,1) so that
> ->swap_{out,in}() have instant access to it. It can be released on
> ->swapfile(,0).

So this is where I don't think that's good enough, you cannot require a
full block/extent mapping of a file on setup. It can take quite some
time, a little testing I did here easily took 5 seconds for only a
couple of gigabytes. And that wasn't even worst case for that size. It
also wastes memory by populating extents that we may never read or
write.

If you look at the loop addition I did, it populates lazily as needed
with some very simple logic to populate-ahead. In practice that performs
as well as a pre-populated map, the first IO to a given range will just
be a little slower since we have to bmap() it.

Do you have plans to improve this area?

--
Jens Axboe

2008-01-10 09:53:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop


On Thu, 2008-01-10 at 10:49 +0100, Jens Axboe wrote:
> On Thu, Jan 10 2008, Peter Zijlstra wrote:
> >
> > On Thu, 2008-01-10 at 08:37 +0000, Christoph Hellwig wrote:
> >
> > > Peter, any chance you could chime in here?
> >
> > I have this patch to add swap_out/_in methods. I expect we can loosen
> > the requirement for swapcache pages and change the name a little.
> >
> > previously posted here:
> > http://lkml.org/lkml/2007/5/4/143
> >
> > ---
> > Subject: mm: add support for non block device backed swap files
> >
> > New addres_space_operations methods are added:
> > int swapfile(struct address_space *, int);
> > int swap_out(struct file *, struct page *, struct writeback_control *);
> > int swap_in(struct file *, struct page *);
> >
> > When during sys_swapon() the swapfile() method is found and returns no error
> > the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
> > make use of swap_{out,in}() to write/read swapcache pages.
> >
> > The swapfile method will be used to communicate to the address_space that the
> > VM relies on it, and the address_space should take adequate measures (like
> > reserving memory for mempools or the like).
> >
> > This new interface can be used to obviate the need for ->bmap in the swapfile
> > code. A filesystem would need to load (and maybe even allocate) the full block
> > map for a file into memory and pin it there on ->swapfile(,1) so that
> > ->swap_{out,in}() have instant access to it. It can be released on
> > ->swapfile(,0).
>
> So this is where I don't think that's good enough, you cannot require a
> full block/extent mapping of a file on setup. It can take quite some
> time, a little testing I did here easily took 5 seconds for only a
> couple of gigabytes. And that wasn't even worst case for that size. It
> also wastes memory by populating extents that we may never read or
> write.
>
> If you look at the loop addition I did, it populates lazily as needed
> with some very simple logic to populate-ahead. In practice that performs
> as well as a pre-populated map, the first IO to a given range will just
> be a little slower since we have to bmap() it.
>
> Do you have plans to improve this area?

Nope, for swap it _must_ be there, there is just no way we can do block
allocation on swapout.

That said, the swapfile() interface can be used to pre-populate the
extend/block mapping, and when using swap_(in/out) without it, it can be
done lazily.

2008-01-10 10:02:45

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, Jan 10 2008, Peter Zijlstra wrote:
>
> On Thu, 2008-01-10 at 10:49 +0100, Jens Axboe wrote:
> > On Thu, Jan 10 2008, Peter Zijlstra wrote:
> > >
> > > On Thu, 2008-01-10 at 08:37 +0000, Christoph Hellwig wrote:
> > >
> > > > Peter, any chance you could chime in here?
> > >
> > > I have this patch to add swap_out/_in methods. I expect we can loosen
> > > the requirement for swapcache pages and change the name a little.
> > >
> > > previously posted here:
> > > http://lkml.org/lkml/2007/5/4/143
> > >
> > > ---
> > > Subject: mm: add support for non block device backed swap files
> > >
> > > New addres_space_operations methods are added:
> > > int swapfile(struct address_space *, int);
> > > int swap_out(struct file *, struct page *, struct writeback_control *);
> > > int swap_in(struct file *, struct page *);
> > >
> > > When during sys_swapon() the swapfile() method is found and returns no error
> > > the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
> > > make use of swap_{out,in}() to write/read swapcache pages.
> > >
> > > The swapfile method will be used to communicate to the address_space that the
> > > VM relies on it, and the address_space should take adequate measures (like
> > > reserving memory for mempools or the like).
> > >
> > > This new interface can be used to obviate the need for ->bmap in the swapfile
> > > code. A filesystem would need to load (and maybe even allocate) the full block
> > > map for a file into memory and pin it there on ->swapfile(,1) so that
> > > ->swap_{out,in}() have instant access to it. It can be released on
> > > ->swapfile(,0).
> >
> > So this is where I don't think that's good enough, you cannot require a
> > full block/extent mapping of a file on setup. It can take quite some
> > time, a little testing I did here easily took 5 seconds for only a
> > couple of gigabytes. And that wasn't even worst case for that size. It
> > also wastes memory by populating extents that we may never read or
> > write.
> >
> > If you look at the loop addition I did, it populates lazily as needed
> > with some very simple logic to populate-ahead. In practice that performs
> > as well as a pre-populated map, the first IO to a given range will just
> > be a little slower since we have to bmap() it.
> >
> > Do you have plans to improve this area?
>
> Nope, for swap it _must_ be there, there is just no way we can do block
> allocation on swapout.

I appreciate that fact, just saying that we could have more flexibility
for other uses.

> That said, the swapfile() interface can be used to pre-populate the
> extend/block mapping, and when using swap_(in/out) without it, it can be
> done lazily.

I think the interface is too simple, I would much rather have a way to
dig into the file mappings and be allowed to request population and so
on. Without that, it can't be used for eg loop.

--
Jens Axboe

2008-01-10 10:21:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop


On Thu, 2008-01-10 at 11:02 +0100, Jens Axboe wrote:
> On Thu, Jan 10 2008, Peter Zijlstra wrote:
> >
> > On Thu, 2008-01-10 at 10:49 +0100, Jens Axboe wrote:
> > > On Thu, Jan 10 2008, Peter Zijlstra wrote:
> > > >
> > > > On Thu, 2008-01-10 at 08:37 +0000, Christoph Hellwig wrote:
> > > >
> > > > > Peter, any chance you could chime in here?
> > > >
> > > > I have this patch to add swap_out/_in methods. I expect we can loosen
> > > > the requirement for swapcache pages and change the name a little.
> > > >
> > > > previously posted here:
> > > > http://lkml.org/lkml/2007/5/4/143
> > > >
> > > > ---
> > > > Subject: mm: add support for non block device backed swap files
> > > >
> > > > New addres_space_operations methods are added:
> > > > int swapfile(struct address_space *, int);
> > > > int swap_out(struct file *, struct page *, struct writeback_control *);
> > > > int swap_in(struct file *, struct page *);
> > > >
> > > > When during sys_swapon() the swapfile() method is found and returns no error
> > > > the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
> > > > make use of swap_{out,in}() to write/read swapcache pages.
> > > >
> > > > The swapfile method will be used to communicate to the address_space that the
> > > > VM relies on it, and the address_space should take adequate measures (like
> > > > reserving memory for mempools or the like).
> > > >
> > > > This new interface can be used to obviate the need for ->bmap in the swapfile
> > > > code. A filesystem would need to load (and maybe even allocate) the full block
> > > > map for a file into memory and pin it there on ->swapfile(,1) so that
> > > > ->swap_{out,in}() have instant access to it. It can be released on
> > > > ->swapfile(,0).
> > >
> > > So this is where I don't think that's good enough, you cannot require a
> > > full block/extent mapping of a file on setup. It can take quite some
> > > time, a little testing I did here easily took 5 seconds for only a
> > > couple of gigabytes. And that wasn't even worst case for that size. It
> > > also wastes memory by populating extents that we may never read or
> > > write.
> > >
> > > If you look at the loop addition I did, it populates lazily as needed
> > > with some very simple logic to populate-ahead. In practice that performs
> > > as well as a pre-populated map, the first IO to a given range will just
> > > be a little slower since we have to bmap() it.
> > >
> > > Do you have plans to improve this area?
> >
> > Nope, for swap it _must_ be there, there is just no way we can do block
> > allocation on swapout.
>
> I appreciate that fact, just saying that we could have more flexibility
> for other uses.
>
> > That said, the swapfile() interface can be used to pre-populate the
> > extend/block mapping, and when using swap_(in/out) without it, it can be
> > done lazily.
>
> I think the interface is too simple, I would much rather have a way to
> dig into the file mappings and be allowed to request population and so
> on. Without that, it can't be used for eg loop.

I'm open to suggestions :-), if you can propose an interface we both can
use I'm all ears.

Although poking at the extends sounds like bmap and we were just trying
to get rid of that.

Perhaps something like badvise(), where you can suggest loading/dropping
extend information on a voluntary basis.

2008-01-10 12:49:21

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, 10 Jan 2008 09:31:31 +0100
Jens Axboe <[email protected]> wrote:

> On Wed, Jan 09 2008, Alasdair G Kergon wrote:
> > Here's the latest version of dm-loop, for comparison.
> >
> > To try it out,
> > ln -s dmsetup dmlosetup
> > and supply similar basic parameters to losetup.
> > (using dmsetup version 1.02.11 or higher)
>
> Why oh why does dm always insist to reinvent everything? That's bad
> enough in itself, but on top of that most of the extra stuff ends up
> being essentially unmaintained.

I don't quite get how the dm version is reinventing things. They use
the dmsetup command that they use for everything else and provide a
small and fairly clean module for bio specific loop instead of piling
it onto loop.c....

Their code doesn't have the fancy hole handling that yours does, but
neither did yours 4 days ago ;)

>
> If we instead improve loop, everyone wins.
>
> Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a
> bit outside your own sandbox.
>

It is a natural fit in either place, as both loop and dm have a good
infrastructure for it. I'm not picky about where it ends up, but dm
wouldn't be a bad place.

-chris

2008-01-10 12:55:11

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, 10 Jan 2008 08:54:59 +0000
Christoph Hellwig <[email protected]> wrote:

> On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote:
> > > IMHO this shouldn't be done in the loop driver anyway.
> > > Filesystems have their own effricient extent lookup trees (well,
> > > at least xfs and btrfs do), and we should leverage that instead
> > > of reinventing it.
> >
> > Completely agree, it's just needed right now for this solution
> > since all we have is a crappy bmap() interface to get at those
> > mappings.
>
> So let's fix the interface instead of piling crap ontop of it. As I
> said I think Peter has something to start with so let's beat on it
> until we have something suitable. If we aren't done by end of Feb
> I'm happy to host a hackfest to get it sorted around the fs/storage
> summit..
>

Ok, I've been meaning to break my extent_map code up, and this is a
very good reason. I'll work up a sample today based on Jens' code.

The basic goals:

* Loop (swap) calls into the FS for each mapping. Any caching happens
on the FS side.
* The FS returns an extent, filling any holes

Swap would need to use an extra call early on for preallocation.

Step two is having a call back into the FS allow the FS to delay the
bios until commit completion so that COW and delalloc blocks can be
fully on disk when the bios are reported as done. Jens, can you add
some way to queue the bio completions up?

-chris

2008-01-10 12:57:53

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, Jan 10 2008, Chris Mason wrote:
> On Thu, 10 Jan 2008 09:31:31 +0100
> Jens Axboe <[email protected]> wrote:
>
> > On Wed, Jan 09 2008, Alasdair G Kergon wrote:
> > > Here's the latest version of dm-loop, for comparison.
> > >
> > > To try it out,
> > > ln -s dmsetup dmlosetup
> > > and supply similar basic parameters to losetup.
> > > (using dmsetup version 1.02.11 or higher)
> >
> > Why oh why does dm always insist to reinvent everything? That's bad
> > enough in itself, but on top of that most of the extra stuff ends up
> > being essentially unmaintained.
>
> I don't quite get how the dm version is reinventing things. They use

Things like raid, now file mapping functionality. I'm sure there are
more examples, it's how dm was always developed probably originating
back to when they developed mostly out of tree. And I think it's a bad
idea, we don't want duplicate functionality. If something is wrong with
loop, fix it, don't write dm-loop.

> the dmsetup command that they use for everything else and provide a
> small and fairly clean module for bio specific loop instead of piling
> it onto loop.c....

If loop.c is a problem, I'd rather see a newloop.c (with a better name,
of course) that we can transition to.

> Their code doesn't have the fancy hole handling that yours does, but
> neither did yours 4 days ago ;)

Well mine didn't exist 4 days ago, I was just listing missing
functionality.

>
> >
> > If we instead improve loop, everyone wins.
> >
> > Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a
> > bit outside your own sandbox.
> >
>
> It is a natural fit in either place, as both loop and dm have a good
> infrastructure for it. I'm not picky about where it ends up, but dm
> wouldn't be a bad place.

I know that's your opinion, I reserve the right to have my own on where
this functionality belongs :)

--
Jens Axboe

2008-01-10 13:03:39

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, Jan 10 2008, Chris Mason wrote:
> On Thu, 10 Jan 2008 08:54:59 +0000
> Christoph Hellwig <[email protected]> wrote:
>
> > On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote:
> > > > IMHO this shouldn't be done in the loop driver anyway.
> > > > Filesystems have their own effricient extent lookup trees (well,
> > > > at least xfs and btrfs do), and we should leverage that instead
> > > > of reinventing it.
> > >
> > > Completely agree, it's just needed right now for this solution
> > > since all we have is a crappy bmap() interface to get at those
> > > mappings.
> >
> > So let's fix the interface instead of piling crap ontop of it. As I
> > said I think Peter has something to start with so let's beat on it
> > until we have something suitable. If we aren't done by end of Feb
> > I'm happy to host a hackfest to get it sorted around the fs/storage
> > summit..
> >
>
> Ok, I've been meaning to break my extent_map code up, and this is a
> very good reason. I'll work up a sample today based on Jens' code.

Great!

> The basic goals:
>
> * Loop (swap) calls into the FS for each mapping. Any caching happens
> on the FS side.
> * The FS returns an extent, filling any holes

We don't want to fill holes for a read, but I guess that's a given?

> Swap would need to use an extra call early on for preallocation.
>
> Step two is having a call back into the FS allow the FS to delay the
> bios until commit completion so that COW and delalloc blocks can be
> fully on disk when the bios are reported as done. Jens, can you add
> some way to queue the bio completions up?

Sure, a function to save a completed bio and a function to execute
completions on those already stored?

--
Jens Axboe

2008-01-10 13:47:03

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thu, 10 Jan 2008 14:03:24 +0100
Jens Axboe <[email protected]> wrote:

> On Thu, Jan 10 2008, Chris Mason wrote:
> > On Thu, 10 Jan 2008 08:54:59 +0000
> > Christoph Hellwig <[email protected]> wrote:
> >
> > > On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote:
> > > > > IMHO this shouldn't be done in the loop driver anyway.
> > > > > Filesystems have their own effricient extent lookup trees
> > > > > (well, at least xfs and btrfs do), and we should leverage
> > > > > that instead of reinventing it.
> > > >
> > > > Completely agree, it's just needed right now for this solution
> > > > since all we have is a crappy bmap() interface to get at those
> > > > mappings.
> > >
> > > So let's fix the interface instead of piling crap ontop of it.
> > > As I said I think Peter has something to start with so let's beat
> > > on it until we have something suitable. If we aren't done by
> > > end of Feb I'm happy to host a hackfest to get it sorted around
> > > the fs/storage summit..
> > >
> >
> > Ok, I've been meaning to break my extent_map code up, and this is a
> > very good reason. I'll work up a sample today based on Jens' code.
>
> Great!

Grin, we'll see how the sample looks.

>
> > The basic goals:
> >
> > * Loop (swap) calls into the FS for each mapping. Any caching
> > happens on the FS side.
> > * The FS returns an extent, filling any holes
>
> We don't want to fill holes for a read, but I guess that's a given?

Right.

>
> > Swap would need to use an extra call early on for preallocation.
> >
> > Step two is having a call back into the FS allow the FS to delay the
> > bios until commit completion so that COW and delalloc blocks can be
> > fully on disk when the bios are reported as done. Jens, can you add
> > some way to queue the bio completions up?
>
> Sure, a function to save a completed bio and a function to execute
> completions on those already stored?
>

Sounds right, I'm mostly looking for a way to aggregate a few writes to
make the commits a little larger.

-chris

2008-01-10 23:01:37

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Thursday January 10, [email protected] wrote:
> On Thu, Jan 10 2008, Chris Mason wrote:
> > On Thu, 10 Jan 2008 09:31:31 +0100
> > Jens Axboe <[email protected]> wrote:
> >
> > > On Wed, Jan 09 2008, Alasdair G Kergon wrote:
> > > > Here's the latest version of dm-loop, for comparison.
> > > >
> > > > To try it out,
> > > > ln -s dmsetup dmlosetup
> > > > and supply similar basic parameters to losetup.
> > > > (using dmsetup version 1.02.11 or higher)
> > >
> > > Why oh why does dm always insist to reinvent everything? That's bad
> > > enough in itself, but on top of that most of the extra stuff ends up
> > > being essentially unmaintained.
> >
> > I don't quite get how the dm version is reinventing things. They use
>
> Things like raid, now file mapping functionality. I'm sure there are
> more examples, it's how dm was always developed probably originating
> back to when they developed mostly out of tree. And I think it's a bad
> idea, we don't want duplicate functionality. If something is wrong with
> loop, fix it, don't write dm-loop.

I'm with Jens here.

We currently have two interfaces that interesting block devices can be
written for: 'dm' and 'block'.
We really should aim to have just one. I would call it 'block' and
move anything really useful from dm into block.

As far as I can tell, the important things that 'dm' has that 'block'
doesn't have are:

- a standard ioctl interface for assembling and creating interesting
devices.
For 'block', everybody just rolls there own. e.g. md, loop, and
nbd all use totally different approaches for setup and tear down
etc.

- suspend/reconfigure/resume.
This is something that I would really like to see in 'block'. If
I had a filesystem mounted on /dev/sda1 and I wanted to make it a
raid1, it would be cool if I could
suspend /dev/sda1
build a raid1 from sda1 and something else
plug tha raid1 in as 'sda1'.
resume sda1

- Integrated 'linear' mapping.
This is the bit of 'dm' that I think of as yucky. If I read the
code correctly, every dm device is a linear array of a bunch of
targets. Each target can be a stripe-set(raid0) or a multipath or
a raid1 or a plain block device or whatever.
Having 'linear' at a different level to everything else seems a
bit ugly, but it isn't really a big deal.

I would really like to see every 'dm' target being just a regular
'block' device. Then a 'linear' block device could be used to
assemble dm targets into a dm device. Or the targets could be used
directly if the 'linear' function wasn't needed.

Each target/device could respond to both dm ioctls and 'adhoc'
ioctls. That is a bit ugly, but backwards compatibility always is,
but it isn't a big cost.

I think the way forward here is to put the important
suspend/reconfig/resume functionality into the block layer, then
work on making code work with multiple ioctl interfaces.

I *don't* think the way forward is to duplicate current block devices
as dm targets. This is duplication of effort (which I admit isn't
always a bad thing) and a maintenance headache (which is).

"Help, I'm having a problem with the loop driver"
"Which one, dm or regular, because if it is the 'dm' one you have to
ask over there ..."

It has already happened occasionally with raid1 and multipath - both
of which have pointless multiple implementations.


>
> > the dmsetup command that they use for everything else and provide a
> > small and fairly clean module for bio specific loop instead of piling
> > it onto loop.c....

I'm missing something here... "fairly clean module for bio specific
loop". Isn't that what 'loop.c' is meant to be?

NeilBrown

2008-01-11 01:01:44

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

Jens Axboe wrote:
> Hi,
>
> loop.c currently uses the page cache interface to do IO to file backed
> devices. This works reasonably well for simple things, like mapping an
> iso9660 file for direct mount and other read-only workloads. Writing is
> somewhat problematic, as anyone who has really used this feature can
> attest to - it tends to confuse the vm (hello kswapd) since it break
> dirty accounting and behaves very erratically on writeout. Did I mention
> that it's pretty slow as well, for both reads and writes?
>
Since you are looking for comments, I'll mention a loop-related behavior
I've been seeing and see if it gets comments or is useful, since it can
be used to tickle bad behavior on demand.

I have an 6GB sparse file, which I mount with cryptoloop and populate as
an ext3 filesystem (more later on why). I then copy ~5.8GB of data to
the filesystem, which is unmounted to be burnt to a DVD. Before it's
burned the "dvdisaster" application is used to add some ECC information
to the end, and make an image which fits on a DVD-DL. Media will be
burned and distributed to multiple locations.

The problem:

When copying with rsync, the copy runs at ~25MB/s for a while, then
falls into a pattern of bursts of 25MB/s followed by 10-15 sec of iowait
with no disk activity. So I tried doing the copy by cpio
find . -depth | cpio -pdm /mnt/loop
which shows exactly the same behavior. Then, for no good reason I tried
find . -depth | cpio -pBdm /mnt/loop
and the copy ran at 25MB/s for the whole data set.

I was able to see similar results with a pure loop mount, I only mention
the crypto for accuracy. Because many of these have been shipped over
the last two years and new loop code would only be useful in this case
if it were compatible so old data sets could be read.

> It also behaves differently than a real drive. For writes, completions
> are done once they hit page cache. Since loop queues bio's async and
> hands them off to a thread, you can have a huge backlog of stuff to do.
> It's hard to attempt to guarentee data safety for file systems on top of
> loop without making it even slower than it currently is.
>
> Back when loop was only used for iso9660 mounting and other simple
> things, this didn't matter. Now it's often used in xen (and others)
> setups where we do care about performance AND writing. So the below is a
> attempt at speeding up loop and making it behave like a real device.
> It's a somewhat quick hack and is still missing one piece to be
> complete, but I'll throw it out there for people to play with and
> comment on.
>
> So how does it work? Instead of punting IO to a thread and passing it
> through the page cache, we instead attempt to send the IO directly to the
> filesystem block that it maps to. loop maintains a prio tree of known
> extents in the file (populated lazily on demand, as needed). Advantages
> of this approach:
>
> - It's fast, loop will basically work at device speed.
> - It's fast, loop it doesn't put a huge amount of system load on the
> system when busy. When I did comparison tests on my notebook with an
> external drive, running a simple tiobench on the current in-kernel
> loop with a sparse file backing rendered the notebook basically
> unusable while the test was ongoing. The remapper version had no more
> impact than it did when used directly on the external drive.
> - It behaves like a real block device.
> - It's easy to support IO barriers, which is needed to ensure safety
> especially in virtualized setups.
>
> Disadvantages:
>
> - The file block mappings must not change while loop is using the file.
> This means that we have to ensure exclusive access to the file and
> this is the bit that is currently missing in the implementation. It
> would be nice if we could just do this via open(), ideas welcome...
> - It'll tie down a bit of memory for the prio tree. This is GREATLY
> offset by the reduced page cache foot print though.
> - It cannot be used with the loop encryption stuff. dm-crypt should be
> used instead, on top of loop (which, I think, is even the recommended
> way to do this today, so not a big deal).
>

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2008-01-11 07:58:28

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Fri, Jan 11 2008, Mikulas Patocka wrote:
> > So I looked at the code - it seems you build a full extent of the blocks
> > in the file, filling holes as you go along. I initally did that as well,
> > but that is to slow to be usable in real life.
> >
> > You also don't support sparse files, falling back to normal fs
> > read/write paths. Supporting sparse files properly is a must, people
> > generally don't want to prealloc a huge disk backing.
>
> How would you do sparse file support with passthrough loopback that
> doesn't use pagecache?
>
> Holes are allocated at get_block function provided by each filesystem and
> the function gets a buffer that is supposed to be in the pagecache. Now if
> you want to allocate holes without pagecache, there's a problem --- new
> interface to all filesystems is needed.
>
> It could be possible to use pagecache interface for filling holes and
> passthrough interface for other requests --- but get_block is allowed to
> move other blocks on the filesystem (and on UFS it really does), so
> calling get_block to fill a hole could move other unrelated blocks which
> would result in desychronized block map and corruption of both
> filesystems.

Please read the posted patch and the posts from Chris as well, it
basically covers everything in your email.

The patch, as posted, doesn't work if the fs moves blocks around.

--
Jens Axboe

2008-01-11 08:04:32

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

> So I looked at the code - it seems you build a full extent of the blocks
> in the file, filling holes as you go along. I initally did that as well,
> but that is to slow to be usable in real life.
>
> You also don't support sparse files, falling back to normal fs
> read/write paths. Supporting sparse files properly is a must, people
> generally don't want to prealloc a huge disk backing.

How would you do sparse file support with passthrough loopback that
doesn't use pagecache?

Holes are allocated at get_block function provided by each filesystem and
the function gets a buffer that is supposed to be in the pagecache. Now if
you want to allocate holes without pagecache, there's a problem --- new
interface to all filesystems is needed.

It could be possible to use pagecache interface for filling holes and
passthrough interface for other requests --- but get_block is allowed to
move other blocks on the filesystem (and on UFS it really does), so
calling get_block to fill a hole could move other unrelated blocks which
would result in desychronized block map and corruption of both
filesystems.

Mikulas

2008-01-11 14:22:26

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Fri, 11 Jan 2008 10:01:18 +1100
Neil Brown <[email protected]> wrote:

> On Thursday January 10, [email protected] wrote:
> > On Thu, Jan 10 2008, Chris Mason wrote:
> > > On Thu, 10 Jan 2008 09:31:31 +0100
> > > Jens Axboe <[email protected]> wrote:
> > >
> > > > On Wed, Jan 09 2008, Alasdair G Kergon wrote:
> > > > > Here's the latest version of dm-loop, for comparison.
> > > > >
> > > > > To try it out,
> > > > > ln -s dmsetup dmlosetup
> > > > > and supply similar basic parameters to losetup.
> > > > > (using dmsetup version 1.02.11 or higher)
> > > >
> > > > Why oh why does dm always insist to reinvent everything? That's
> > > > bad enough in itself, but on top of that most of the extra
> > > > stuff ends up being essentially unmaintained.
> > >
> > > I don't quite get how the dm version is reinventing things. They
> > > use
> >
> > Things like raid, now file mapping functionality. I'm sure there are
> > more examples, it's how dm was always developed probably originating
> > back to when they developed mostly out of tree. And I think it's a
> > bad idea, we don't want duplicate functionality. If something is
> > wrong with loop, fix it, don't write dm-loop.
>
> I'm with Jens here.
>
> We currently have two interfaces that interesting block devices can be
> written for: 'dm' and 'block'.
> We really should aim to have just one. I would call it 'block' and
> move anything really useful from dm into block.
>
> As far as I can tell, the important things that 'dm' has that 'block'
> doesn't have are:
>
> - a standard ioctl interface for assembling and creating interesting
> devices.
> For 'block', everybody just rolls there own. e.g. md, loop, and
> nbd all use totally different approaches for setup and tear down
> etc.
>
> - suspend/reconfigure/resume.
> This is something that I would really like to see in 'block'. If
> I had a filesystem mounted on /dev/sda1 and I wanted to make it a
> raid1, it would be cool if I could
> suspend /dev/sda1
> build a raid1 from sda1 and something else
> plug tha raid1 in as 'sda1'.
> resume sda1
>
> - Integrated 'linear' mapping.
> This is the bit of 'dm' that I think of as yucky. If I read the
> code correctly, every dm device is a linear array of a bunch of
> targets. Each target can be a stripe-set(raid0) or a multipath or
> a raid1 or a plain block device or whatever.
> Having 'linear' at a different level to everything else seems a
> bit ugly, but it isn't really a big deal.
>

DM is also a framework where you can introduce completely new types of
block devices without having to go through the associated pain of
finding major numbers. In terms of developing new things with greater
flexibility, I think it is easier.

> I would really like to see every 'dm' target being just a regular
> 'block' device. Then a 'linear' block device could be used to
> assemble dm targets into a dm device. Or the targets could be used
> directly if the 'linear' function wasn't needed.
>
> Each target/device could respond to both dm ioctls and 'adhoc'
> ioctls. That is a bit ugly, but backwards compatibility always is,
> but it isn't a big cost.
>
> I think the way forward here is to put the important
> suspend/reconfig/resume functionality into the block layer, then
> work on making code work with multiple ioctl interfaces.
>
> I *don't* think the way forward is to duplicate current block devices
> as dm targets. This is duplication of effort (which I admit isn't
> always a bad thing) and a maintenance headache (which is).
>

raid in dm aside (that's an entirely different debate ;), loop is a
pile of things which dm can nicely layer out into pieces (dm-crypt vs
loopback crypt). Also, dm doesn't have to jump through hoops to get a
variable number of minors.

Yes, the loop side was recently improved for # of minors, and it does
have enough in there for userland to do variable number of minors, but
this is one specific case where dm is just easier.

At any rate, I'm all for ideas that make dm less of the evil stepchild
of the block layer ;) I'm not saying everything should be dm, but I
did want to point out that dm-loop isn't entirely silly.

I have a version of Jens' patch in testing here that makes a new API
with the FS for mapping extents and hope to post it later today.

-chris

2008-01-11 18:17:48

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

Hi Jens,

This looks really useful.

On Wednesday 09 January 2008 00:52, Jens Axboe wrote:
> Disadvantages:
>
> - The file block mappings must not change while loop is using the
> file. This means that we have to ensure exclusive access to the file
> and this is the bit that is currently missing in the implementation.
> It would be nice if we could just do this via open(), ideas
> welcome...

Get_block methods are pretty fast and you have caching in the level
above you, so you might be able to get away with no cache of physical
addresses at all, in which case you just need i_mutex and i_alloc_sem
at get_block time. This would save a pile of code and still have the
main benefit of avoiding double caching.

If you use ->get_block instead of bmap, it will fill in file holes for
you, but of course get_block is not exposed, and Al is likely to bark
at anyone who exposes it.

Instead of exposing get_block you could expose an aops method
like ->bio_transfer that would hide the use of *_get_block in a library
routine, just as __blockdev_direct_IO does. Chances are, there are
other users besides loop that would be interested in a generic way of
performing bio transfers to files.

I presume you would fall back to the existing approach for any
filesystem without get_block. You could handle this transparently with
a default library method that does read/write.

Regards,

Daniel

2008-01-11 18:23:33

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Fri, Jan 11 2008, Daniel Phillips wrote:
> Hi Jens,
>
> This looks really useful.
>
> On Wednesday 09 January 2008 00:52, Jens Axboe wrote:
> > Disadvantages:
> >
> > - The file block mappings must not change while loop is using the
> > file. This means that we have to ensure exclusive access to the file
> > and this is the bit that is currently missing in the implementation.
> > It would be nice if we could just do this via open(), ideas
> > welcome...
>
> Get_block methods are pretty fast and you have caching in the level
> above you, so you might be able to get away with no cache of physical
> addresses at all, in which case you just need i_mutex and i_alloc_sem
> at get_block time. This would save a pile of code and still have the
> main benefit of avoiding double caching.

I'm not too fond of the tree either, but it serves an important purpose
as well - we need to be careful in calling bmap() as not to deadlock the
fs under vm pressure. So the current code punts to a thread for bmap()
on extents we don't already have stored in loop. And that slows things
down of course, we would have to still punt every IO to loopd instead of
just doing a quick remap. But...

> If you use ->get_block instead of bmap, it will fill in file holes for
> you, but of course get_block is not exposed, and Al is likely to bark
> at anyone who exposes it.
>
> Instead of exposing get_block you could expose an aops method
> like ->bio_transfer that would hide the use of *_get_block in a library
> routine, just as __blockdev_direct_IO does. Chances are, there are
> other users besides loop that would be interested in a generic way of
> performing bio transfers to files.
>
> I presume you would fall back to the existing approach for any
> filesystem without get_block. You could handle this transparently with
> a default library method that does read/write.

... things are already moving forward, Chris has a new interface for
this and tied it in with the loop fastfs mode. I think he'll post it
later today.

--
Jens Axboe

2008-01-14 17:11:41

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

Hello everyone,

Here is a modified version of Jens' patch. The basic idea is to push
the mapping maintenance out of loop and down into the filesystem (ext2
in this case).

Two new address_space operations are added, one to map
extents and the other to provide call backs into the FS as io is
completed.

Still TODO for this patch:

* Add exclusion between filling holes and readers. This is partly
implemented, when a hole is filled by the FS, the extent is flagged as
having a hole. The idea is to check this flag before trying to read
the blocks and just send back all zeros.

The flag would be cleared when the blocks filling the hole have been
written.

* Exclude page cache readers and writers

* Add a way for the FS to request a commit before telling the higher
layers the IO is complete. This way we can make sure metadata related
to holes is on disk before claiming the IO is really done. COW based
filesystems will also needed it.

* Change loop to use fast mapping only when the new address_space op is
provided (whoops, forgot about this until just now)

* A few other bits for COW, not really relevant until there
is...something COW using it.

-chris

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -76,6 +76,7 @@
#include <linux/gfp.h>
#include <linux/kthread.h>
#include <linux/splice.h>
+#include <linux/extent_map.h>

#include <asm/uaccess.h>

@@ -481,16 +482,51 @@ static int do_bio_filebacked(struct loop
return ret;
}

+#define __lo_throttle(wq, lock, condition) \
+do { \
+ DEFINE_WAIT(__wait); \
+ for (;;) { \
+ prepare_to_wait((wq), &__wait, TASK_UNINTERRUPTIBLE); \
+ if (condition) \
+ break; \
+ spin_unlock_irq((lock)); \
+ io_schedule(); \
+ spin_lock_irq((lock)); \
+ } \
+ finish_wait((wq), &__wait); \
+} while (0) \
+
+#define lo_act_bio(bio) ((bio)->bi_bdev)
+#define LO_BIO_THROTTLE 128
+
/*
- * Add bio to back of pending list
+ * A normal block device will throttle on request allocation. Do the same
+ * for loop to prevent millions of bio's queued internally.
+ */
+static void loop_bio_throttle(struct loop_device *lo, struct bio *bio)
+{
+ if (lo_act_bio(bio))
+ __lo_throttle(&lo->lo_bio_wait, &lo->lo_lock,
+ lo->lo_bio_cnt < LO_BIO_THROTTLE);
+}
+
+/*
+ * Add bio to back of pending list and wakeup thread
*/
static void loop_add_bio(struct loop_device *lo, struct bio *bio)
{
+ loop_bio_throttle(lo, bio);
+
if (lo->lo_biotail) {
lo->lo_biotail->bi_next = bio;
lo->lo_biotail = bio;
} else
lo->lo_bio = lo->lo_biotail = bio;
+
+ if (lo_act_bio(bio))
+ lo->lo_bio_cnt++;
+
+ wake_up(&lo->lo_event);
}

/*
@@ -510,6 +546,178 @@ static struct bio *loop_get_bio(struct l
return bio;
}

+static void loop_exit_fastfs(struct loop_device *lo)
+{
+ /*
+ * drop what page cache we instantiated filling holes
+ */
+ invalidate_inode_pages2(lo->lo_backing_file->f_mapping);
+
+ blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_NONE, NULL);
+}
+
+static inline u64 lo_bio_offset(struct loop_device *lo, struct bio *bio)
+{
+ return (u64)lo->lo_offset + ((u64)bio->bi_sector << 9);
+}
+
+/*
+ * Find extent mapping this lo device block to the file block on the real
+ * device
+ */
+static struct extent_map *loop_lookup_extent(struct loop_device *lo,
+ u64 offset, gfp_t gfp_mask)
+{
+ struct address_space *mapping;
+ struct extent_map *em;
+ u64 len = 1 << lo->blkbits;
+
+ mapping = lo->lo_backing_file->f_mapping;
+ em = mapping->a_ops->map_extent(mapping, NULL, 0,
+ offset, len, 0, gfp_mask);
+ return em;
+}
+
+/*
+ * Alloc a hint bio to tell the loop thread to read file blocks for a given
+ * range
+ */
+static void loop_schedule_extent_mapping(struct loop_device *lo,
+ sector_t sector,
+ unsigned long len, int wait)
+{
+ struct bio *bio, stackbio;
+
+ /*
+ * it's ok if we occasionally fail. if called with blocking set,
+ * then use an on-stack bio since that must not fail.
+ */
+ if (wait) {
+ bio = &stackbio;
+ bio_init(bio);
+ } else
+ bio = bio_alloc(GFP_ATOMIC, 0);
+
+ if (bio) {
+ DECLARE_COMPLETION_ONSTACK(comp);
+
+ bio->bi_rw = LOOP_EXTENT_RW_MAGIC;
+ bio->bi_sector = sector;
+ bio->bi_size = len;
+
+ loop_add_bio(lo, bio);
+
+ if (wait) {
+ /*
+ * ok to set here, loop_add_bio() doesn't drop lock
+ * for this bio (!lo_act_bio(bio))
+ */
+ bio->bi_private = &comp;
+
+ /*
+ * never called with wait != 0 where it's not
+ * allowed to use spin_unlock_irq() which
+ * unconditionally enables interrupts.
+ */
+ spin_unlock_irq(&lo->lo_lock);
+ wait_for_completion(&comp);
+ spin_lock_irq(&lo->lo_lock);
+ }
+ }
+}
+
+static void loop_handle_extent_hole(struct loop_device *lo, struct bio *bio)
+{
+ /*
+ * for a read, just zero the data and end the io
+ */
+ if (bio_data_dir(bio) == READ) {
+ struct bio_vec *bvec;
+ unsigned long flags;
+ int i;
+
+ bio_for_each_segment(bvec, bio, i) {
+ char *dst = bvec_kmap_irq(bvec, &flags);
+
+ memset(dst, 0, bvec->bv_len);
+ bvec_kunmap_irq(dst, &flags);
+ }
+ bio_endio(bio, 0);
+ } else {
+ /*
+ * let the page cache handling path do this bio, and then
+ * lookup the mapped blocks after the io has been issued to
+ * instantiate extents.
+ */
+ loop_add_bio(lo, bio);
+ }
+}
+
+static inline int lo_is_switch_bio(struct bio *bio)
+{
+ return !bio->bi_bdev && bio->bi_rw == LOOP_SWITCH_RW_MAGIC;
+}
+
+static inline int lo_is_map_bio(struct bio *bio)
+{
+ return !bio->bi_bdev && bio->bi_rw == LOOP_EXTENT_RW_MAGIC;
+}
+
+/*
+ * Change mapping of the bio, so that it points to the real bdev and offset
+ */
+static int loop_redirect_bio(struct loop_device *lo, struct bio *bio)
+{
+ struct extent_map *lfe;
+ u64 extent_off;
+ u64 disk_block;
+ u64 start = lo_bio_offset(lo, bio);
+
+ lfe = loop_lookup_extent(lo, start, GFP_ATOMIC);
+ if (IS_ERR(lfe))
+ return -EIO;
+
+ while(!lfe) {
+ loop_schedule_extent_mapping(lo, bio->bi_sector,
+ bio->bi_size, 1);
+ lfe = loop_lookup_extent(lo, start, GFP_ATOMIC);
+ if (IS_ERR(lfe))
+ return -EIO;
+ }
+ /*
+ * handle sparse io
+ */
+ if (lfe->block_start == EXTENT_MAP_HOLE) {
+ loop_handle_extent_hole(lo, bio);
+ free_extent_map(lfe);
+ return 0;
+ }
+
+ /*
+ * not a hole, redirect
+ */
+ disk_block = lfe->block_start;
+ extent_off = start - lfe->start;
+ bio->bi_bdev = lfe->bdev;
+ bio->bi_sector = (disk_block + extent_off) >> 9;
+ free_extent_map(lfe);
+ return 1;
+}
+
+/*
+ * Wait on bio's on our list to complete before sending a barrier bio
+ * to the below device. Called with lo_lock held.
+ */
+static void loop_wait_on_bios(struct loop_device *lo)
+{
+ __lo_throttle(&lo->lo_bio_wait, &lo->lo_lock, !lo->lo_bio);
+}
+
+static void loop_wait_on_switch(struct loop_device *lo)
+{
+ __lo_throttle(&lo->lo_bio_wait, &lo->lo_lock, !lo->lo_switch);
+}
+
static int loop_make_request(struct request_queue *q, struct bio *old_bio)
{
struct loop_device *lo = q->queuedata;
@@ -525,15 +733,39 @@ static int loop_make_request(struct requ
goto out;
if (unlikely(rw == WRITE && (lo->lo_flags & LO_FLAGS_READ_ONLY)))
goto out;
+ if (lo->lo_flags & LO_FLAGS_FASTFS) {
+ /*
+ * If we get a barrier bio, then we just need to wait for
+ * existing bio's to be complete. This can only happen
+ * on the 'new' extent mapped loop, since that is the only
+ * one that supports barriers.
+ */
+ if (bio_barrier(old_bio))
+ loop_wait_on_bios(lo);
+
+ /*
+ * if file switch is in progress, wait for it to complete
+ */
+ if (!lo_is_switch_bio(old_bio) && lo->lo_switch)
+ loop_wait_on_switch(lo);
+
+ if (loop_redirect_bio(lo, old_bio))
+ goto out_redir;
+ goto out_end;
+ }
loop_add_bio(lo, old_bio);
- wake_up(&lo->lo_event);
spin_unlock_irq(&lo->lo_lock);
return 0;

out:
+ bio_io_error(old_bio);
+out_end:
spin_unlock_irq(&lo->lo_lock);
- bio_io_error(old_bio);
return 0;
+
+out_redir:
+ spin_unlock_irq(&lo->lo_lock);
+ return 1;
}

/*
@@ -547,21 +779,117 @@ static void loop_unplug(struct request_q
blk_run_address_space(lo->lo_backing_file->f_mapping);
}

+static void loop_unplug_fastfs(struct request_queue *q)
+{
+ struct loop_device *lo = q->queuedata;
+ struct request_queue *rq = bdev_get_queue(lo->fs_bdev);
+
+ clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags);
+
+ if (rq->unplug_fn)
+ rq->unplug_fn(rq);
+}
+
struct switch_request {
struct file *file;
struct completion wait;
};

static void do_loop_switch(struct loop_device *, struct switch_request *);
+static int loop_init_fastfs(struct loop_device *);
+
+static void end_bio_hole_filling(struct bio *bio, int err)
+{
+ struct bio *orig_bio = bio->bi_private;
+ struct inode *inode = bio->bi_bdev->bd_inode;
+ u64 start = orig_bio->bi_sector << 9;
+ u64 len = bio->bi_size;
+
+ if (inode->i_mapping->a_ops->extent_io_complete) {
+ inode->i_mapping->a_ops->extent_io_complete(inode->i_mapping,
+ start, len);
+ }
+ bio_put(bio);
+ bio_endio(orig_bio, err);
+}
+
+static int fill_extent_hole(struct loop_device *lo, struct bio *bio)
+{
+ struct address_space *mapping = lo->lo_backing_file->f_mapping;
+ struct bio *new_bio;
+ struct extent_map *em;
+ u64 len = bio->bi_size;
+ u64 start = lo_bio_offset(lo, bio);
+ u64 disk_block;
+ u64 extent_off;
+ int err;
+
+ new_bio = bio_clone(bio, GFP_NOIO);
+ if (!new_bio) {
+ err = -ENOMEM;
+ goto out;
+ }
+ /*
+ * change the sector so we can find the correct file offset in our
+ * endio
+ */
+ bio->bi_sector = lo_bio_offset(lo, bio) >> 9;
+
+ mutex_lock(&mapping->host->i_mutex);
+
+ em = mapping->a_ops->map_extent(mapping, NULL, 0,
+ start, len, 1, GFP_KERNEL);
+ mark_inode_dirty(mapping->host);
+ mutex_unlock(&mapping->host->i_mutex);
+
+ if (!em || IS_ERR(em)) {
+ err = -EIO;
+ goto out;;
+ }
+
+
+ disk_block = em->block_start;
+ extent_off = start - em->start;
+ new_bio->bi_sector = (disk_block + extent_off) >> 9;
+ new_bio->bi_bdev = em->bdev;
+ new_bio->bi_private = bio;
+ new_bio->bi_size = bio->bi_size;
+ new_bio->bi_end_io = end_bio_hole_filling;
+ free_extent_map(em);
+
+ generic_make_request(new_bio);
+ return 0;
+
+out:
+ bio_endio(bio, err);
+ return 0;
+}

static inline void loop_handle_bio(struct loop_device *lo, struct bio *bio)
{
- if (unlikely(!bio->bi_bdev)) {
+ struct extent_map *lfe;
+
+ if (lo_is_map_bio(bio)) {
+ lfe = loop_lookup_extent(lo, lo_bio_offset(lo, bio),
+ GFP_KERNEL);
+ free_extent_map(lfe);
+ if (bio->bi_private)
+ complete(bio->bi_private);
+ else
+ bio_put(bio);
+ } else if (lo_is_switch_bio(bio)) {
do_loop_switch(lo, bio->bi_private);
bio_put(bio);
} else {
- int ret = do_bio_filebacked(lo, bio);
- bio_endio(bio, ret);
+ int ret;
+
+ if (lo->lo_flags & LO_FLAGS_FASTFS) {
+ /* we only get here when filling holes */
+ ret = fill_extent_hole(lo, bio);
+ } else {
+ ret = do_bio_filebacked(lo, bio);
+ bio_endio(bio, ret);
+ }
}
}

@@ -581,6 +909,7 @@ static int loop_thread(void *data)
{
struct loop_device *lo = data;
struct bio *bio;
+ int bio_act;

set_user_nice(current, -20);

@@ -588,7 +917,6 @@ static int loop_thread(void *data)

wait_event_interruptible(lo->lo_event,
lo->lo_bio || kthread_should_stop());
-
if (!lo->lo_bio)
continue;
spin_lock_irq(&lo->lo_lock);
@@ -596,7 +924,19 @@ static int loop_thread(void *data)
spin_unlock_irq(&lo->lo_lock);

BUG_ON(!bio);
+ if (lo_act_bio(bio))
+ bio_act = 1;
+ else
+ bio_act = 0;
+
loop_handle_bio(lo, bio);
+
+ spin_lock_irq(&lo->lo_lock);
+ if (bio_act)
+ lo->lo_bio_cnt--;
+ if (lo->lo_bio_cnt < LO_BIO_THROTTLE || !lo->lo_bio)
+ wake_up(&lo->lo_bio_wait);
+ spin_unlock_irq(&lo->lo_lock);
}

return 0;
@@ -610,13 +950,15 @@ static int loop_switch(struct loop_devic
static int loop_switch(struct loop_device *lo, struct file *file)
{
struct switch_request w;
- struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+ struct bio *bio = bio_alloc(GFP_KERNEL, 0);
if (!bio)
return -ENOMEM;
init_completion(&w.wait);
w.file = file;
bio->bi_private = &w;
bio->bi_bdev = NULL;
+ bio->bi_rw = LOOP_SWITCH_RW_MAGIC;
+ lo->lo_switch = 1;
loop_make_request(lo->lo_queue, bio);
wait_for_completion(&w.wait);
return 0;
@@ -630,6 +972,10 @@ static void do_loop_switch(struct loop_d
struct file *file = p->file;
struct file *old_file = lo->lo_backing_file;
struct address_space *mapping = file->f_mapping;
+ const int fastfs = lo->lo_flags & LO_FLAGS_FASTFS;
+
+ if (fastfs)
+ loop_exit_fastfs(lo);

mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask);
lo->lo_backing_file = file;
@@ -637,6 +983,13 @@ static void do_loop_switch(struct loop_d
mapping->host->i_bdev->bd_block_size : PAGE_SIZE;
lo->old_gfp_mask = mapping_gfp_mask(mapping);
mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
+
+ if (fastfs)
+ loop_init_fastfs(lo);
+
+ lo->lo_switch = 0;
+ wake_up(&lo->lo_bio_wait);
+
complete(&p->wait);
}

@@ -700,6 +1053,85 @@ static int loop_change_fd(struct loop_de
return error;
}

+/*
+ * See if adding this bvec would cause us to spill into a new extent. If so,
+ * disallow the add to start a new bio. This ensures that the bio we receive
+ * in loop_make_request() never spans two extents or more.
+ */
+static int loop_merge_bvec(struct request_queue *q, struct bio *bio,
+ struct bio_vec *bvec)
+{
+ struct loop_device *lo = q->queuedata;
+ struct extent_map *lfe;
+ unsigned int ret;
+ u64 start;
+ u64 len;
+
+ start = lo_bio_offset(lo, bio);
+ len = bio->bi_size + bvec->bv_len;
+ ret = bvec->bv_len;
+
+ lfe = loop_lookup_extent(lo, start, GFP_ATOMIC);
+ if (lfe && !IS_ERR(lfe)) {
+ /*
+ * have extent, disallow if outside that extent
+ */
+ if (start + len > lfe->start + lfe->len)
+ ret = 0;
+
+ free_extent_map(lfe);
+ } else {
+ if (bio->bi_size)
+ ret = 0;
+ }
+ return ret;
+}
+
+/*
+ * Initialize the members pertaining to extent mapping. We will populate
+ * the tree lazily on demand, as a full scan of a big file can take some
+ * time.
+ */
+static int loop_init_fastfs(struct loop_device *lo)
+{
+ struct file *file = lo->lo_backing_file;
+ struct inode *inode = file->f_mapping->host;
+ struct request_queue *fs_q;
+ int ret;
+
+ if (!S_ISREG(inode->i_mode))
+ return -EINVAL;
+
+ /*
+ * Need a working bmap. TODO: use the same optimization that
+ * direct-io.c does for get_block() mapping more than one block
+ * at the time.
+ */
+ if (inode->i_mapping->a_ops->bmap == NULL)
+ return -EINVAL;
+ /*
+ * invalidate all page cache belonging to this file, it could become
+ * stale when we directly overwrite blocks.
+ */
+ ret = invalidate_inode_pages2(file->f_mapping);
+ if (unlikely(ret))
+ return ret;
+
+ lo->blkbits = inode->i_blkbits;
+ lo->fs_bdev = file->f_mapping->host->i_sb->s_bdev;
+ lo->lo_flags |= LO_FLAGS_FASTFS;
+ lo->lo_queue->unplug_fn = loop_unplug_fastfs;
+
+ blk_queue_merge_bvec(lo->lo_queue, loop_merge_bvec);
+ blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN, NULL);
+
+ fs_q = bdev_get_queue(lo->fs_bdev);
+ blk_queue_stack_limits(lo->lo_queue, fs_q);
+
+ printk(KERN_INFO "loop%d: fast redirect\n", lo->lo_number);
+ return 0;
+}
+
static inline int is_loop_device(struct file *file)
{
struct inode *i = file->f_mapping->host;
@@ -748,6 +1180,7 @@ static int loop_set_fd(struct loop_devic

mapping = file->f_mapping;
inode = mapping->host;
+ lo->lo_flags = 0;

if (!(file->f_mode & FMODE_WRITE))
lo_flags |= LO_FLAGS_READ_ONLY;
@@ -810,6 +1243,12 @@ static int loop_set_fd(struct loop_devic
bd_set_size(bdev, size << 9);

set_blocksize(bdev, lo_blocksize);
+
+ /*
+ * This needs to be done after setup with another ioctl,
+ * not automatically like this.
+ */
+ loop_init_fastfs(lo);

lo->lo_thread = kthread_create(loop_thread, lo, "loop%d",
lo->lo_number);
@@ -896,6 +1335,9 @@ static int loop_clr_fd(struct loop_devic

kthread_stop(lo->lo_thread);

+ if (lo->lo_flags & LO_FLAGS_FASTFS)
+ loop_exit_fastfs(lo);
+
lo->lo_backing_file = NULL;

loop_release_xfer(lo);
@@ -943,6 +1385,9 @@ loop_set_status(struct loop_device *lo,
if (info->lo_encrypt_type) {
unsigned int type = info->lo_encrypt_type;

+ if (lo->lo_flags & LO_FLAGS_FASTFS)
+ return -EINVAL;
+
if (type >= MAX_LO_CRYPT)
return -EINVAL;
xfer = xfer_funcs[type];
@@ -950,6 +1395,13 @@ loop_set_status(struct loop_device *lo,
return -EINVAL;
} else
xfer = NULL;
+
+ /*
+ * for remaps, offset must be a multiple of full blocks
+ */
+ if ((lo->lo_flags & LO_FLAGS_FASTFS) &&
+ (((1 << lo->blkbits) - 1) & info->lo_offset))
+ return -EINVAL;

err = loop_init_xfer(lo, xfer, info);
if (err)
@@ -1152,6 +1604,9 @@ static int lo_ioctl(struct inode * inode
break;
case LOOP_GET_STATUS64:
err = loop_get_status64(lo, (struct loop_info64 __user *) arg);
+ break;
+ case LOOP_SET_FASTFS:
+ err = loop_init_fastfs(lo);
break;
default:
err = lo->ioctl ? lo->ioctl(lo, cmd, arg) : -EINVAL;
@@ -1412,6 +1867,7 @@ static struct loop_device *loop_alloc(in
lo->lo_number = i;
lo->lo_thread = NULL;
init_waitqueue_head(&lo->lo_event);
+ init_waitqueue_head(&lo->lo_bio_wait);
spin_lock_init(&lo->lo_lock);
disk->major = LOOP_MAJOR;
disk->first_minor = i;
diff --git a/fs/Makefile b/fs/Makefile
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
- stack.o
+ stack.o extent_map.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/dcache.c b/fs/dcache.c
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -31,6 +31,7 @@
#include <linux/seqlock.h>
#include <linux/swap.h>
#include <linux/bootmem.h>
+#include <linux/extent_map.h>
#include "internal.h"


@@ -2170,6 +2171,7 @@ void __init vfs_caches_init(unsigned lon

dcache_init();
inode_init();
+ extent_map_init();
files_init(mempages);
mnt_init();
bdev_cache_init();
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -1,5 +1,6 @@
#include <linux/fs.h>
#include <linux/ext2_fs.h>
+#include <linux/extent_map.h>

/*
* ext2 mount options
@@ -62,6 +63,8 @@ struct ext2_inode_info {
struct mutex truncate_mutex;
struct inode vfs_inode;
struct list_head i_orphan; /* unlinked but open inodes */
+
+ struct extent_map_tree extent_tree;
};

/*
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -584,6 +584,7 @@ got:
ei->i_block_alloc_info = NULL;
ei->i_block_group = group;
ei->i_dir_start_lookup = 0;
+ extent_map_tree_init(&ei->extent_tree);
ei->i_state = EXT2_STATE_NEW;
ext2_set_inode_flags(inode);
spin_lock(&sbi->s_next_gen_lock);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -31,6 +31,7 @@
#include <linux/writeback.h>
#include <linux/buffer_head.h>
#include <linux/mpage.h>
+#include <linux/extent_map.h>
#include "ext2.h"
#include "acl.h"
#include "xip.h"
@@ -70,9 +71,11 @@ void ext2_delete_inode (struct inode * i
if (inode->i_blocks)
ext2_truncate (inode);
ext2_free_inode (inode);
+ remove_extent_mappings(&EXT2_I(inode)->extent_tree, 0, (u64)-1);

return;
no_delete:
+ remove_extent_mappings(&EXT2_I(inode)->extent_tree, 0, (u64)-1);
clear_inode(inode); /* We must guarantee clearing of inode... */
}

@@ -709,6 +712,16 @@ int ext2_get_block(struct inode *inode,

}

+static struct extent_map *ext2_map_extent(struct address_space *mapping,
+ struct page *page,
+ size_t page_offset, u64 start,
+ u64 len, int create, gfp_t gfp_mask)
+{
+ return map_extent_get_block(&EXT2_I(mapping->host)->extent_tree,
+ mapping, start, len, create, gfp_mask,
+ ext2_get_block);
+}
+
static int ext2_writepage(struct page *page, struct writeback_control *wbc)
{
return block_write_full_page(page, ext2_get_block, wbc);
@@ -796,6 +809,7 @@ const struct address_space_operations ex
.direct_IO = ext2_direct_IO,
.writepages = ext2_writepages,
.migratepage = buffer_migrate_page,
+ .map_extent = ext2_map_extent,
};

const struct address_space_operations ext2_aops_xip = {
@@ -1242,6 +1256,7 @@ void ext2_read_inode (struct inode * ino
ei->i_state = 0;
ei->i_block_group = (ino - 1) / EXT2_INODES_PER_GROUP(inode->i_sb);
ei->i_dir_start_lookup = 0;
+ extent_map_tree_init(&ei->extent_tree);

/*
* NOTE! The in-memory inode i_data array is in little-endian order
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -156,6 +156,7 @@ static struct inode *ext2_alloc_inode(st

static void ext2_destroy_inode(struct inode *inode)
{
+ remove_extent_mappings(&EXT2_I(inode)->extent_tree, 0, (u64)-1);
kmem_cache_free(ext2_inode_cachep, EXT2_I(inode));
}

@@ -168,6 +169,7 @@ static void init_once(struct kmem_cache
init_rwsem(&ei->xattr_sem);
#endif
mutex_init(&ei->truncate_mutex);
+ extent_map_tree_init(&ei->extent_tree);
inode_init_once(&ei->vfs_inode);
}

diff --git a/fs/extent_map.c b/fs/extent_map.c
new file mode 100644
--- /dev/null
+++ b/fs/extent_map.c
@@ -0,0 +1,402 @@
+#include <linux/err.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/version.h>
+#include <linux/fs.h>
+#include <linux/buffer_head.h>
+#include <linux/extent_map.h>
+
+static struct kmem_cache *extent_map_cache;
+
+int __init extent_map_init(void)
+{
+ extent_map_cache = kmem_cache_create("extent_map",
+ sizeof(struct extent_map), 0,
+ SLAB_MEM_SPREAD, NULL);
+ if (!extent_map_cache)
+ return -ENOMEM;
+ return 0;
+}
+
+void __exit extent_map_exit(void)
+{
+ if (extent_map_cache)
+ kmem_cache_destroy(extent_map_cache);
+}
+
+void extent_map_tree_init(struct extent_map_tree *tree)
+{
+ tree->map.rb_node = NULL;
+ tree->last = NULL;
+ rwlock_init(&tree->lock);
+}
+EXPORT_SYMBOL(extent_map_tree_init);
+
+struct extent_map *alloc_extent_map(gfp_t mask)
+{
+ struct extent_map *em;
+ em = kmem_cache_alloc(extent_map_cache, mask);
+ if (!em || IS_ERR(em))
+ return em;
+ atomic_set(&em->refs, 1);
+ em->flags = 0;
+ return em;
+}
+EXPORT_SYMBOL(alloc_extent_map);
+
+void free_extent_map(struct extent_map *em)
+{
+ if (!em)
+ return;
+ if (atomic_dec_and_test(&em->refs))
+ kmem_cache_free(extent_map_cache, em);
+}
+EXPORT_SYMBOL(free_extent_map);
+
+static struct rb_node *tree_insert(struct rb_root *root, u64 offset,
+ struct rb_node *node)
+{
+ struct rb_node ** p = &root->rb_node;
+ struct rb_node * parent = NULL;
+ struct extent_map *entry;
+
+ while(*p) {
+ parent = *p;
+ entry = rb_entry(parent, struct extent_map, rb_node);
+
+ if (offset < entry->start)
+ p = &(*p)->rb_left;
+ else if (offset >= entry->start + entry->len)
+ p = &(*p)->rb_right;
+ else
+ return parent;
+ }
+
+ entry = rb_entry(node, struct extent_map, rb_node);
+ rb_link_node(node, parent, p);
+ rb_insert_color(node, root);
+ return NULL;
+}
+
+static struct rb_node *__tree_search(struct rb_root *root, u64 offset,
+ struct rb_node **prev_ret)
+{
+ struct rb_node * n = root->rb_node;
+ struct rb_node *prev = NULL;
+ struct extent_map *entry;
+ struct extent_map *prev_entry = NULL;
+
+ while(n) {
+ entry = rb_entry(n, struct extent_map, rb_node);
+ prev = n;
+ prev_entry = entry;
+
+ if (offset < entry->start)
+ n = n->rb_left;
+ else if (offset >= entry->start + entry->len)
+ n = n->rb_right;
+ else
+ return n;
+ }
+ if (!prev_ret)
+ return NULL;
+ while(prev && (offset >= prev_entry->start + prev_entry->len)) {
+ prev = rb_next(prev);
+ prev_entry = rb_entry(prev, struct extent_map, rb_node);
+ }
+ *prev_ret = prev;
+ return NULL;
+}
+
+static inline struct rb_node *tree_search(struct rb_root *root, u64 offset)
+{
+ struct rb_node *prev;
+ struct rb_node *ret;
+ ret = __tree_search(root, offset, &prev);
+ if (!ret)
+ return prev;
+ return ret;
+}
+
+static int tree_delete(struct rb_root *root, u64 offset)
+{
+ struct rb_node *node;
+ struct extent_map *entry;
+
+ node = __tree_search(root, offset, NULL);
+ if (!node)
+ return -ENOENT;
+ entry = rb_entry(node, struct extent_map, rb_node);
+ rb_erase(node, root);
+ return 0;
+}
+
+static int mergable_maps(struct extent_map *prev, struct extent_map *next)
+{
+ if (extent_map_end(prev) == next->start &&
+ prev->flags == next->flags &&
+ ((next->block_start == EXTENT_MAP_HOLE &&
+ prev->block_start == EXTENT_MAP_HOLE) ||
+ (next->block_start == EXTENT_MAP_INLINE &&
+ prev->block_start == EXTENT_MAP_INLINE) ||
+ (next->block_start == EXTENT_MAP_DELALLOC &&
+ prev->block_start == EXTENT_MAP_DELALLOC) ||
+ (next->block_start < EXTENT_MAP_LAST_BYTE - 1 &&
+ next->block_start == extent_map_block_end(prev)))) {
+ return 1;
+ }
+ return 0;
+}
+
+/*
+ * add_extent_mapping tries a simple forward/backward merge with existing
+ * mappings. The extent_map struct passed in will be inserted into
+ * the tree directly (no copies made, just a reference taken).
+ */
+int add_extent_mapping(struct extent_map_tree *tree,
+ struct extent_map *em)
+{
+ int ret = 0;
+ struct extent_map *merge = NULL;
+ struct rb_node *rb;
+ unsigned long flags;
+
+ write_lock_irqsave(&tree->lock, flags);
+ rb = tree_insert(&tree->map, em->start, &em->rb_node);
+ if (rb) {
+ ret = -EEXIST;
+ goto out;
+ }
+ atomic_inc(&em->refs);
+ if (em->start != 0) {
+ rb = rb_prev(&em->rb_node);
+ if (rb)
+ merge = rb_entry(rb, struct extent_map, rb_node);
+ if (rb && mergable_maps(merge, em)) {
+ em->start = merge->start;
+ em->len += merge->len;
+ em->block_start = merge->block_start;
+ rb_erase(&merge->rb_node, &tree->map);
+ free_extent_map(merge);
+ }
+ }
+ rb = rb_next(&em->rb_node);
+ if (rb)
+ merge = rb_entry(rb, struct extent_map, rb_node);
+ if (rb && mergable_maps(em, merge)) {
+ em->len += merge->len;
+ rb_erase(&merge->rb_node, &tree->map);
+ free_extent_map(merge);
+ }
+ tree->last = em;
+out:
+ write_unlock_irqrestore(&tree->lock, flags);
+ return ret;
+}
+EXPORT_SYMBOL(add_extent_mapping);
+
+/*
+ * lookup_extent_mapping returns the first extent_map struct in the
+ * tree that intersects the [start, len] range. There may
+ * be additional objects in the tree that intersect, so check the object
+ * returned carefully to make sure you don't need additional lookups.
+ */
+struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree,
+ u64 start, u64 len)
+{
+ struct extent_map *em;
+ struct extent_map *last;
+ struct rb_node *rb_node;
+ unsigned long flags;
+
+ read_lock_irqsave(&tree->lock, flags);
+ last = tree->last;
+ if (last && start >= last->start &&
+ start + len <= extent_map_end(last)) {
+ em = last;
+ atomic_inc(&em->refs);
+ goto out;
+ }
+ rb_node = tree_search(&tree->map, start);
+ if (!rb_node) {
+ em = NULL;
+ goto out;
+ }
+ if (IS_ERR(rb_node)) {
+ em = ERR_PTR(PTR_ERR(rb_node));
+ goto out;
+ }
+ em = rb_entry(rb_node, struct extent_map, rb_node);
+ if (extent_map_end(em) <= start || em->start >= start + len) {
+ em = NULL;
+ goto out;
+ }
+ atomic_inc(&em->refs);
+ tree->last = em;
+out:
+ read_unlock_irqrestore(&tree->lock, flags);
+ return em;
+}
+EXPORT_SYMBOL(lookup_extent_mapping);
+
+/*
+ * removes an extent_map struct from the tree. No reference counts are
+ * dropped, and no checks are done to see if the range is in use
+ */
+int remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em)
+{
+ int ret;
+ unsigned long flags;
+
+ write_lock_irqsave(&tree->lock, flags);
+ ret = tree_delete(&tree->map, em->start);
+ tree->last = NULL;
+ write_unlock_irqrestore(&tree->lock, flags);
+ return ret;
+}
+EXPORT_SYMBOL(remove_extent_mapping);
+
+static struct extent_map *__map_extent(struct extent_map_tree *tree,
+ struct address_space *mapping,
+ u64 start, u64 len, int create,
+ gfp_t gfp_mask, get_block_t get_block)
+{
+ struct inode *inode = mapping->host;
+ struct extent_map *em;
+ struct buffer_head result;
+ sector_t start_block;
+ u64 cur_len;
+ int ret;
+
+again:
+ em = lookup_extent_mapping(tree, start, len);
+ if (em) {
+ /*
+ * we may have found an extent that starts after the
+ * requested range. Double check and alter the length
+ * appropriately
+ */
+ if (em->start > start) {
+ len = em->start - start;
+ } else if (!create || em->block_start != EXTENT_MAP_HOLE) {
+ return em;
+ }
+ free_extent_map(em);
+
+ }
+ if (gfp_mask & GFP_ATOMIC)
+ return NULL;
+
+ em = alloc_extent_map(GFP_NOFS);
+ if (!em)
+ return ERR_PTR(-ENOMEM);
+
+ len = min_t(u64, len, (size_t)-1);
+ result.b_state = 0;
+ result.b_size = len;
+ start_block = start >> inode->i_blkbits;
+
+ if (len < inode->i_sb->s_blocksize) {
+ printk("warning2: mapping length %Lu\n", len);
+ }
+
+ /*
+ * FIXME if there are errors later on, we end up exposing stale
+ * data on disk while filling holes.
+ */
+ ret = get_block(inode, start_block,
+ &result, create);
+ if (ret < 0) {
+ free_extent_map(em);
+ return ERR_PTR(ret);
+ }
+
+ cur_len = result.b_size;
+ em->start = start;
+ em->len = cur_len;
+ em->bdev = result.b_bdev;
+
+ if (create && buffer_new(&result)) {
+ remove_extent_mappings(tree, em->start, em->len);
+ em->flags = (1 << EXTENT_MAP_HOLE_FILLED);
+ }
+
+ if (buffer_mapped(&result))
+ em->block_start = (u64)result.b_blocknr << inode->i_blkbits;
+ else {
+ em->block_start = EXTENT_MAP_HOLE;
+ if (create) {
+ free_extent_map(em);
+ return ERR_PTR(-EIO);
+ }
+ }
+ ret = add_extent_mapping(tree, em);
+ if (ret == -EEXIST) {
+ free_extent_map(em);
+ goto again;
+ }
+ return em;
+}
+
+struct extent_map *map_extent_get_block(struct extent_map_tree *tree,
+ struct address_space *mapping,
+ u64 start, u64 len, int create,
+ gfp_t gfp_mask, get_block_t get_block)
+{
+ struct extent_map *em;
+ u64 last;
+ u64 map_ahead_len = 0;
+
+ em = __map_extent(tree, mapping, start, len, create,
+ gfp_mask, get_block);
+
+ /*
+ * if we're doing a write or we found a large extent, return it
+ */
+ if (IS_ERR(em) || !em || create || start + len < extent_map_end(em)) {
+ return em;
+ }
+
+ /*
+ * otherwise, try to walk forward a bit and see if we can build
+ * something bigger.
+ */
+ do {
+ last = extent_map_end(em);
+ free_extent_map(em);
+ em = __map_extent(tree, mapping, last, len, create,
+ gfp_mask, get_block);
+ if (IS_ERR(em) || !em)
+ break;
+ map_ahead_len += extent_map_end(em) - last;
+ } while(em->start <= start && start + len <= extent_map_end(em) &&
+ em->block_start < EXTENT_MAP_LAST_BYTE &&
+ map_ahead_len < (512 * 1024));
+
+ /* make sure we return the extent for this range */
+ if (!em || IS_ERR(em) || em->start > start ||
+ start + len > extent_map_end(em)) {
+ free_extent_map(em);
+ em = __map_extent(tree, mapping, start, len, create,
+ gfp_mask, get_block);
+ }
+ return em;
+}
+EXPORT_SYMBOL(map_extent_get_block);
+
+int remove_extent_mappings(struct extent_map_tree *tree,
+ u64 start, u64 len)
+{
+ struct extent_map *em;
+
+ while((em = lookup_extent_mapping(tree, start, len))) {
+ remove_extent_mapping(tree, em);
+ /* once for us */
+ free_extent_map(em);
+ /* once for the tree */
+ free_extent_map(em);
+ }
+ return 0;
+}
+EXPORT_SYMBOL(remove_extent_mappings);
diff --git a/include/linux/extent_map.h b/include/linux/extent_map.h
new file mode 100644
--- /dev/null
+++ b/include/linux/extent_map.h
@@ -0,0 +1,58 @@
+#ifndef __EXTENTMAP__
+#define __EXTENTMAP__
+
+#include <linux/rbtree.h>
+
+/* special values for struct extent_map->block_start */
+#define EXTENT_MAP_LAST_BYTE (u64)-4
+#define EXTENT_MAP_HOLE (u64)-3
+#define EXTENT_MAP_INLINE (u64)-2
+#define EXTENT_MAP_DELALLOC (u64)-1
+
+/* bit flags for struct extent_map->flags */
+#define EXTENT_MAP_COMMIT_REQUIRED 1
+#define EXTENT_MAP_HOLE_FILLED 2
+
+struct extent_map_tree {
+ struct rb_root map;
+ rwlock_t lock;
+ struct extent_map *last;
+};
+
+struct extent_map {
+ struct rb_node rb_node;
+ u64 start;
+ u64 len;
+ u64 block_start;
+ struct block_device *bdev;
+ atomic_t refs;
+ unsigned long flags;
+};
+
+static inline u64 extent_map_end(struct extent_map *em)
+{
+ return em->start + em->len;
+}
+
+static inline u64 extent_map_block_end(struct extent_map *em)
+{
+ return em->block_start + em->len;
+}
+
+void extent_map_tree_init(struct extent_map_tree *tree);
+struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree,
+ u64 start, u64 end);
+struct extent_map *map_extent_get_block(struct extent_map_tree *tree,
+ struct address_space *mapping,
+ u64 start, u64 len, int create,
+ gfp_t gfp_mask, get_block_t get_block);
+int add_extent_mapping(struct extent_map_tree *tree,
+ struct extent_map *em);
+int remove_extent_mappings(struct extent_map_tree *tree,
+ u64 start, u64 len);
+int remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em);
+struct extent_map *alloc_extent_map(gfp_t mask);
+void free_extent_map(struct extent_map *em);
+int __init extent_map_init(void);
+void __exit extent_map_exit(void);
+#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -439,6 +439,7 @@ static inline size_t iov_iter_count(stru
}


+struct extent_map;
struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
@@ -479,6 +480,15 @@ struct address_space_operations {
int (*migratepage) (struct address_space *,
struct page *, struct page *);
int (*launder_page) (struct page *);
+
+ /* raw extent mapping to the disk */
+ struct extent_map *(*map_extent)(struct address_space *mapping,
+ struct page *page,
+ size_t page_offset,
+ u64 start, u64 len, int create,
+ gfp_t gfp_mask);
+ int (*extent_io_complete)(struct address_space *mapping,
+ u64 start, u64 len);
};

/*
diff --git a/include/linux/loop.h b/include/linux/loop.h
--- a/include/linux/loop.h
+++ b/include/linux/loop.h
@@ -18,6 +18,7 @@
#include <linux/blkdev.h>
#include <linux/spinlock.h>
#include <linux/mutex.h>
+#include <linux/rbtree.h>

/* Possible states of device */
enum {
@@ -50,22 +51,31 @@ struct loop_device {

struct file * lo_backing_file;
struct block_device *lo_device;
+ struct block_device *fs_bdev;
unsigned lo_blocksize;
void *key_data;
+ unsigned int lo_switch;

gfp_t old_gfp_mask;

spinlock_t lo_lock;
struct bio *lo_bio;
struct bio *lo_biotail;
+ unsigned int lo_bio_cnt;
int lo_state;
struct mutex lo_ctl_mutex;
struct task_struct *lo_thread;
wait_queue_head_t lo_event;
+ wait_queue_head_t lo_bio_wait;

struct request_queue *lo_queue;
struct gendisk *lo_disk;
struct list_head lo_list;
+
+ struct prio_tree_root prio_root;
+ struct prio_tree_node *last_insert;
+ struct prio_tree_node *last_lookup;
+ unsigned int blkbits;
};

#endif /* __KERNEL__ */
@@ -76,6 +86,7 @@ enum {
enum {
LO_FLAGS_READ_ONLY = 1,
LO_FLAGS_USE_AOPS = 2,
+ LO_FLAGS_FASTFS = 4,
};

#include <asm/posix_types.h> /* for __kernel_old_dev_t */
@@ -159,5 +170,11 @@ int loop_unregister_transfer(int number)
#define LOOP_SET_STATUS64 0x4C04
#define LOOP_GET_STATUS64 0x4C05
#define LOOP_CHANGE_FD 0x4C06
+#define LOOP_SET_FASTFS 0x4C07
+
+enum {
+ LOOP_EXTENT_RW_MAGIC = 0x19283746,
+ LOOP_SWITCH_RW_MAGIC = 0xfeedbeef,
+};

#endif

2008-01-14 17:54:18

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Mon, Jan 14 2008, Chris Mason wrote:
> Hello everyone,
>
> Here is a modified version of Jens' patch. The basic idea is to push
> the mapping maintenance out of loop and down into the filesystem (ext2
> in this case).
>
> Two new address_space operations are added, one to map
> extents and the other to provide call backs into the FS as io is
> completed.
>
> Still TODO for this patch:
>
> * Add exclusion between filling holes and readers. This is partly
> implemented, when a hole is filled by the FS, the extent is flagged as
> having a hole. The idea is to check this flag before trying to read
> the blocks and just send back all zeros.
>
> The flag would be cleared when the blocks filling the hole have been
> written.
>
> * Exclude page cache readers and writers
>
> * Add a way for the FS to request a commit before telling the higher
> layers the IO is complete. This way we can make sure metadata related
> to holes is on disk before claiming the IO is really done. COW based
> filesystems will also needed it.
>
> * Change loop to use fast mapping only when the new address_space op is
> provided (whoops, forgot about this until just now)
>
> * A few other bits for COW, not really relevant until there
> is...something COW using it.

Looks pretty good. Essentially the loop side is 100% the same, it just
offloads the extent ownership to the fs (where it belongs). So I like
it. Attaching a small cleanup/fixup patch for loop, don't think it needs
further explanations.

One suggestion - free_extent_map(), I would call that put_extent_map()
instead.

diff -u b/drivers/block/loop.c b/drivers/block/loop.c
--- b/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -677,13 +677,14 @@
if (IS_ERR(lfe))
return -EIO;

- while(!lfe) {
+ while (!lfe) {
loop_schedule_extent_mapping(lo, bio->bi_sector,
bio->bi_size, 1);
lfe = loop_lookup_extent(lo, start, GFP_ATOMIC);
if (IS_ERR(lfe))
return -EIO;
}
+
/*
* handle sparse io
*/
@@ -802,13 +803,13 @@
{
struct bio *orig_bio = bio->bi_private;
struct inode *inode = bio->bi_bdev->bd_inode;
+ struct address_space *mapping = inode->i_mapping;
u64 start = orig_bio->bi_sector << 9;
u64 len = bio->bi_size;

- if (inode->i_mapping->a_ops->extent_io_complete) {
- inode->i_mapping->a_ops->extent_io_complete(inode->i_mapping,
- start, len);
- }
+ if (mapping->a_ops->extent_io_complete)
+ mapping->a_ops->extent_io_complete(mapping, start, len);
+
bio_put(bio);
bio_endio(orig_bio, err);
}
@@ -829,6 +830,7 @@
err = -ENOMEM;
goto out;
}
+
/*
* change the sector so we can find the correct file offset in our
* endio
@@ -847,7 +849,6 @@
goto out;;
}

-
disk_block = em->block_start;
extent_off = start - em->start;
new_bio->bi_sector = (disk_block + extent_off) >> 9;
@@ -924,11 +925,8 @@
spin_unlock_irq(&lo->lo_lock);

BUG_ON(!bio);
- if (lo_act_bio(bio))
- bio_act = 1;
- else
- bio_act = 0;

+ bio_act = lo_act_bio(bio);
loop_handle_bio(lo, bio);

spin_lock_irq(&lo->lo_lock);
@@ -1103,11 +1101,9 @@
return -EINVAL;

/*
- * Need a working bmap. TODO: use the same optimization that
- * direct-io.c does for get_block() mapping more than one block
- * at the time.
+ * Need a working extent_map
*/
- if (inode->i_mapping->a_ops->bmap == NULL)
+ if (inode->i_mapping->a_ops->map_extent == NULL)
return -EINVAL;
/*
* invalidate all page cache belonging to this file, it could become
diff -u b/include/linux/loop.h b/include/linux/loop.h
--- b/include/linux/loop.h
+++ b/include/linux/loop.h
@@ -18,7 +18,6 @@
#include <linux/blkdev.h>
#include <linux/spinlock.h>
#include <linux/mutex.h>
-#include <linux/rbtree.h>

/* Possible states of device */
enum {
@@ -72,9 +71,6 @@
struct gendisk *lo_disk;
struct list_head lo_list;

- struct prio_tree_root prio_root;
- struct prio_tree_node *last_insert;
- struct prio_tree_node *last_lookup;
unsigned int blkbits;
};


--
Jens Axboe

2008-01-15 09:25:29

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Mon, Jan 14 2008, Jens Axboe wrote:
> On Mon, Jan 14 2008, Chris Mason wrote:
> > Hello everyone,
> >
> > Here is a modified version of Jens' patch. The basic idea is to push
> > the mapping maintenance out of loop and down into the filesystem (ext2
> > in this case).
> >
> > Two new address_space operations are added, one to map
> > extents and the other to provide call backs into the FS as io is
> > completed.
> >
> > Still TODO for this patch:
> >
> > * Add exclusion between filling holes and readers. This is partly
> > implemented, when a hole is filled by the FS, the extent is flagged as
> > having a hole. The idea is to check this flag before trying to read
> > the blocks and just send back all zeros.
> >
> > The flag would be cleared when the blocks filling the hole have been
> > written.
> >
> > * Exclude page cache readers and writers
> >
> > * Add a way for the FS to request a commit before telling the higher
> > layers the IO is complete. This way we can make sure metadata related
> > to holes is on disk before claiming the IO is really done. COW based
> > filesystems will also needed it.
> >
> > * Change loop to use fast mapping only when the new address_space op is
> > provided (whoops, forgot about this until just now)
> >
> > * A few other bits for COW, not really relevant until there
> > is...something COW using it.
>
> Looks pretty good. Essentially the loop side is 100% the same, it just
> offloads the extent ownership to the fs (where it belongs). So I like
> it. Attaching a small cleanup/fixup patch for loop, don't think it needs
> further explanations.
>
> One suggestion - free_extent_map(), I would call that put_extent_map()
> instead.

I split and merged the patch into five bits (added ext3 support), so
perhaps that would be easier for people to read/review. Attached and
also exist in the loop-extent_map branch here:

http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=loop-extent_map

--
Jens Axboe


Attachments:
(No filename) (1.94 kB)
0001-fs-add-extent_map-library.patch (13.35 kB)
0002-fs-add-extent_map-hooks-to-the-address-space-operat.patch (1.25 kB)
0003-ext2-add-support-for-extent_map-API.patch (3.68 kB)
0004-ext3-add-support-for-extent_map-API.patch (4.35 kB)
0005-loop-fastfs-support.patch (17.59 kB)
Download all attachments

2008-01-15 09:37:08

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Tue, Jan 15 2008, Jens Axboe wrote:
> On Mon, Jan 14 2008, Jens Axboe wrote:
> > On Mon, Jan 14 2008, Chris Mason wrote:
> > > Hello everyone,
> > >
> > > Here is a modified version of Jens' patch. The basic idea is to push
> > > the mapping maintenance out of loop and down into the filesystem (ext2
> > > in this case).
> > >
> > > Two new address_space operations are added, one to map
> > > extents and the other to provide call backs into the FS as io is
> > > completed.
> > >
> > > Still TODO for this patch:
> > >
> > > * Add exclusion between filling holes and readers. This is partly
> > > implemented, when a hole is filled by the FS, the extent is flagged as
> > > having a hole. The idea is to check this flag before trying to read
> > > the blocks and just send back all zeros.
> > >
> > > The flag would be cleared when the blocks filling the hole have been
> > > written.
> > >
> > > * Exclude page cache readers and writers
> > >
> > > * Add a way for the FS to request a commit before telling the higher
> > > layers the IO is complete. This way we can make sure metadata related
> > > to holes is on disk before claiming the IO is really done. COW based
> > > filesystems will also needed it.
> > >
> > > * Change loop to use fast mapping only when the new address_space op is
> > > provided (whoops, forgot about this until just now)
> > >
> > > * A few other bits for COW, not really relevant until there
> > > is...something COW using it.
> >
> > Looks pretty good. Essentially the loop side is 100% the same, it just
> > offloads the extent ownership to the fs (where it belongs). So I like
> > it. Attaching a small cleanup/fixup patch for loop, don't think it needs
> > further explanations.
> >
> > One suggestion - free_extent_map(), I would call that put_extent_map()
> > instead.
>
> I split and merged the patch into five bits (added ext3 support), so
> perhaps that would be easier for people to read/review. Attached and
> also exist in the loop-extent_map branch here:
>
> http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=loop-extent_map

Seems my ext3 version doesn't work, it craps out in
ext3_get_blocks_handle() triggering this bug:

J_ASSERT(handle != NULL || create == 0);

I'll see if I can fix that, being fairly fs ignorant...

--
Jens Axboe

2008-01-15 10:08:02

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Tue, Jan 15 2008, Jens Axboe wrote:
> On Tue, Jan 15 2008, Jens Axboe wrote:
> > On Mon, Jan 14 2008, Jens Axboe wrote:
> > > On Mon, Jan 14 2008, Chris Mason wrote:
> > > > Hello everyone,
> > > >
> > > > Here is a modified version of Jens' patch. The basic idea is to push
> > > > the mapping maintenance out of loop and down into the filesystem (ext2
> > > > in this case).
> > > >
> > > > Two new address_space operations are added, one to map
> > > > extents and the other to provide call backs into the FS as io is
> > > > completed.
> > > >
> > > > Still TODO for this patch:
> > > >
> > > > * Add exclusion between filling holes and readers. This is partly
> > > > implemented, when a hole is filled by the FS, the extent is flagged as
> > > > having a hole. The idea is to check this flag before trying to read
> > > > the blocks and just send back all zeros.
> > > >
> > > > The flag would be cleared when the blocks filling the hole have been
> > > > written.
> > > >
> > > > * Exclude page cache readers and writers
> > > >
> > > > * Add a way for the FS to request a commit before telling the higher
> > > > layers the IO is complete. This way we can make sure metadata related
> > > > to holes is on disk before claiming the IO is really done. COW based
> > > > filesystems will also needed it.
> > > >
> > > > * Change loop to use fast mapping only when the new address_space op is
> > > > provided (whoops, forgot about this until just now)
> > > >
> > > > * A few other bits for COW, not really relevant until there
> > > > is...something COW using it.
> > >
> > > Looks pretty good. Essentially the loop side is 100% the same, it just
> > > offloads the extent ownership to the fs (where it belongs). So I like
> > > it. Attaching a small cleanup/fixup patch for loop, don't think it needs
> > > further explanations.
> > >
> > > One suggestion - free_extent_map(), I would call that put_extent_map()
> > > instead.
> >
> > I split and merged the patch into five bits (added ext3 support), so
> > perhaps that would be easier for people to read/review. Attached and
> > also exist in the loop-extent_map branch here:
> >
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=loop-extent_map
>
> Seems my ext3 version doesn't work, it craps out in
> ext3_get_blocks_handle() triggering this bug:
>
> J_ASSERT(handle != NULL || create == 0);
>
> I'll see if I can fix that, being fairly fs ignorant...

This works, but probably pretty suboptimal (should end the new journal
in map_io_complete()?). And yes I know the >> 9 isn't correct, since the
fs block size is larger. Just making sure that we always have enough
blocks.

Punting to Chris!

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 55e677d..e97181a 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1002,11 +1002,25 @@ static struct extent_map *ext3_map_extent(struct address_space *mapping,
gfp_t gfp_mask)
{
struct extent_map_tree *tree = &EXT3_I(mapping->host)->extent_tree;
+ handle_t *handle = NULL;
+ struct extent_map *ret;

- return map_extent_get_block(tree, mapping, start, len, create, gfp_mask,
+ if (create) {
+ handle = ext3_journal_start(mapping->host, len >> 9);
+ if (IS_ERR(handle))
+ return (struct extent_map *) handle;
+ }
+
+ ret = map_extent_get_block(tree, mapping, start, len, create, gfp_mask,
ext3_get_block);
+ if (handle)
+ ext3_journal_stop(handle);
+
+ return ret;
}

+
+
/*
* `handle' can be NULL if create is zero
*/

--
Jens Axboe

2008-01-15 14:05:44

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH][RFC] fast file mapping for loop

On Tue, 15 Jan 2008 11:07:40 +0100
Jens Axboe <[email protected]> wrote:

> > > I split and merged the patch into five bits (added ext3 support),
> > > so perhaps that would be easier for people to read/review.
> > > Attached and also exist in the loop-extent_map branch here:

Thanks!

> > >
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=loop-extent_map
> >
> > Seems my ext3 version doesn't work, it craps out in
> > ext3_get_blocks_handle() triggering this bug:
> >
> > J_ASSERT(handle != NULL || create == 0);
> >
> > I'll see if I can fix that, being fairly fs ignorant...
>
> This works, but probably pretty suboptimal (should end the new journal
> in map_io_complete()?). And yes I know the >> 9 isn't correct, since
> the fs block size is larger. Just making sure that we always have
> enough blocks.

You can use DIO_CREDITS instead of len >> 9, just like the ext3
O_DIRECT code does. Your current patch is fine, except it breaks
data=ordered rules. My plan to work within data=ordered:

1) Inside ext3_map_extent (while the transaction was running), increment
a counter in the ext3 journal for number of pending IOs. Then end the
transaction handle.

2) Drop this counter inside the IO completion call

3) Change the ext3 commit code to wait for the IO count to be zero.

I'll give it a shot later this week, until then your current patch is
just data=writeback, which is good enough for testing.

-chris