2007-06-24 01:45:48

by Nick Piggin

[permalink] [raw]
Subject: [RFC] fsblock


I'm announcing "fsblock" now because it is quite intrusive and so I'd
like to get some thoughts about significantly changing this core part
of the kernel.

fsblock is a rewrite of the "buffer layer" (ding dong the witch is
dead), which I have been working on, on and off and is now at the stage
where some of the basics are working-ish. This email is going to be
long...

Firstly, what is the buffer layer? The buffer layer isn't really a
buffer layer as in the buffer cache of unix: the block device cache
is unified with the pagecache (in terms of the pagecache, a blkdev
file is just like any other, but with a 1:1 mapping between offset
and block).

There are filesystem APIs to access the block device, but these go
through the block device pagecache as well. These don't exactly
define the buffer layer either.

The buffer layer is a layer between the pagecache and the block
device for block based filesystems. It keeps a translation between
logical offset and physical block number, as well as meta
information such as locks, dirtyness, and IO status of each block.
This information is tracked via the buffer_head structure.

Why rewrite the buffer layer? Lots of people have had a desire to
completely rip out the buffer layer, but we can't do that[*] because
it does actually serve a useful purpose. Why the bad rap? Because
the code is old and crufty, and buffer_head is an awful name. It must
be among the oldest code in the core fs/vm, and the main reason is
because of the inertia of so many and such complex filesystems.

[*] About the furthest we could go is use the struct page for the
information otherwise stored in the buffer_head, but this would be
tricky and suboptimal for filesystems with non page sized blocks and
would probably bloat the struct page as well.

So why rewrite rather than incremental improvements? Incremental
improvements are logically the correct way to do this, and we probably
could go from buffer.c to fsblock.c in steps. But I didn't do this
because: a) the blinding pace at which things move in this area would
make me an old man before it would be complete; b) I didn't actually
know exactly what it was going to look like before starting on it; c)
I wanted stable root filesystems and such when testing it; and d) I
found it reasonably easy to have both layers coexist (it uses an extra
page flag, but even that wouldn't be needed if the old buffer layer
was better decoupled from the page cache).

I started this as an exercise to see how the buffer layer could be
improved, and I think it is working out OK so far. The name is fsblock
because it basically ties the fs layer to the block layer. I think
Andrew has wanted to rename buffer_head to block before, but block is
too clashy, and it isn't a great deal more descriptive than buffer_head.
I believe fsblock is.

I'll go through a list of things where I have hopefully improved on the
buffer layer, off the top of my head. The big caveat here is that minix
is the only real filesystem I have converted so far, and complex
journalled filesystems might pose some problems that water down its
goodness (I don't know).

- data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on
64-bit (could easily be 32 if we can have int bitops). Compare this
to around 50 and 100ish for struct buffer_head. With a 4K page and 1K
blocks, IO requires 10% RAM overhead in buffer heads alone. With
fsblocks you're down to around 3%.

- Structure packing. A page gets a number of buffer heads that are
allocated in a linked list. fsblocks are allocated contiguously, so
cacheline footprint is smaller in the above situation.

- Data / metadata separation. I have a struct fsblock and a struct
fsblock_meta, so we could put more stuff into the usually less used
fsblock_meta without bloating it up too much. After a few tricks, these
are no longer any different in my code, and dirty up the typing quite
a lot (and I'm aware it still has some warnings, thanks). So if not
useful this could be taken out.

- Locking. fsblocks completely use the pagecache for locking and lookups.
The page lock is used, but there is no extra per-inode lock that buffer
has. Would go very nicely with lockless pagecache. RCU is used for one
non-blocking fsblock lookup (find_get_block), but I'd really rather hope
filesystems can tolerate that blocking, and get rid of RCU completely.
(actually this is not quite true because mapping->private_lock is still
used for mark_buffer_dirty_inode equivalent, but that's a relatively
rare operation).

- Coupling with pagecache metadata. Pagecache pages contain some metadata
that is logically redundant because it is tracked in buffers as well
(eg. a page is dirty if one or more buffers are dirty, or uptodate if
all buffers are uptodate). This is great because means we can avoid that
layer in some situations, but they can get out of sync. eg. if a
filesystem writes a buffer out by hand, its pagecache page will stay
dirty, and the next "writeout" will notice it has no dirty buffers and
call it clean. fsblock-based writeout or readin will update page
metadata too, which is cleaner. It also uses page locking for IO ops
instead of an extra layer of locking which seems nice.

- No deadlocks (hopefully). The buffer layer is technically deadlocky by
design, because it can require memory allocations at page writeout-time.
It also has one path that cannot tolerate memory allocation failures.
No such problems for fsblock, which keeps fsblock metadata around for as
long as a page is dirty (this still has problems vs get_user_pages, but
that's going to require an audit of all get_user_pages sites. Phew).

- In line with the above item, filesystem block allocation is performed
before a page is dirtied. In the buffer layer, mmap writes can dirty a
page with no backing blocks which is a problem if the filesystem is
ENOSPC (patches exist for buffer.c for this).

- Block memory accessors for filesystems. If the buffer layer was to ever
be replaced completely, this means block device pagecache would not be
restricted to lowmem. It also doesn't have theoretical CPU cache
aliasing problems that buffer heads do.

- A real "nobh" mode. nobh was created I think mainly to avoid problems
with buffer_head memory consumption, especially on lowmem machines. It
is basically a hack (sorry), which requires special code in filesystems,
and duplication of quite a bit of tricky buffer layer code (and bugs).
It also doesn't work so well for buffers with non-trivial private data
(like most journalling ones). fsblock implements this with basically a
few lines of code, and it shold work in situations like ext3.

- Similarly, it gets around the circular reference problem where a buffer
holds a ref on a page and a page holds a ref on a buffer, but the page
has been removed from pagecache. These occur with some journalled fses
like ext3 ordered, and eventually fill up memory and have to be
reclaimed via the LRU (which is often not a problem, but I have seen
real workloads where the reclaim causes throughput to drop quite a lot).

- An inode's metadata must be tracked per-inode in order for fsync to
work correctly. buffer contains helpers to do this for basic
filesystems, but any block can be only the metadata for a single inode.
This is not really correct for things like inode descriptor blocks.
fsblock can track multiple inodes per block. (This is non trivial,
and it may be overkill so it could be reverted to a simpler scheme
like buffer).

- Large block support. I can mount and run an 8K block size minix3 fs on
my 4K page system and it didn't require anything special in the fs. We
can go up to about 32MB blocks now, and gigabyte+ blocks would only
require one more bit in the fsblock flags. fsblock_superpage blocks
are > PAGE_CACHE_SIZE, midpage ==, and subpage <.

Core pagecache code is pretty creaky with respect to this. I think it is
mostly race free, but it requires stupid unlocking and relocking hacks
because the vm usually passes single locked pages to the fs layers, and we
need to lock all pages of a block in offset ascending order. This could be
avoided by doing locking on only the first page of a block for locking in
the fsblock layer, but that's a bit scary too. Probably better would be to
move towards offset,length rather than page based fs APIs where everything
can be batched up nicely and this sort of non-trivial locking can be more
optimal.

Large blocks also have a performance black spot where an 8K sized and
aligned write(2) would require an RMW in the filesystem. Again because of
the page based nature of the fs API, and this too would be fixed if
the APIs were better.

Large block memory access via filesystem uses vmap, but it will go back
to kmap if the access doesn't cross a page. Filesystems really should do
this because vmap is slow as anything. I've implemented a vmap cache
which basically wouldn't work on 32-bit systems (because of limited vmap
space) for performance testing (and yes it sometimes tries to unmap in
interrupt context, I know, I'm using loop). We could possibly do a self
limiting cache, but I'd rather build some helpers to hide the raw multi
page access for things like bitmap scanning and bit setting etc. and
avoid too much vmaps.

- Code size. I'm sure I'm still missing some things, but at the moment we
can do this in about the same amount of icache as buffer.c. If we turn
off large block support, I think it is around 2/3 the size.

That's basically it for now. I have a few more ideas for cool things, but
there are only so many hours in a day. Comments are non-existant so far,
and there is lots of debugging stuff and some things are a little dirty,
but it should be slightly familiar if you understand buffer.c. I'm not so
interested in hearing about trivial nitpicking at this point because things
are far from final or proposed for upstream. There is still a race or two,
but I think they can all be solved.

So. Comments? Is this something we want? If yes, then how would we
transition from buffer.c to fsblock.c?


2007-06-24 01:46:37

by Nick Piggin

[permalink] [raw]
Subject: [patch 1/3] add the fsblock layer


Rewrite the buffer layer.

---
fs/Makefile | 2
fs/buffer.c | 31
fs/fs-writeback.c | 13
fs/fsblock.c | 2511 ++++++++++++++++++++++++++++++++++++++++++
fs/inode.c | 37
fs/splice.c | 3
include/linux/buffer_head.h | 1
include/linux/fsblock.h | 347 +++++
include/linux/fsblock_types.h | 70 +
include/linux/page-flags.h | 15
init/main.c | 2
mm/filemap.c | 7
mm/page_alloc.c | 3
mm/swap.c | 7
mm/truncate.c | 93 -
mm/vmscan.c | 6
16 files changed, 3077 insertions(+), 71 deletions(-)

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -90,6 +90,8 @@
#define PG_reclaim 17 /* To be reclaimed asap */
#define PG_buddy 19 /* Page is free, on buddy lists */

+#define PG_blocks 20 /* Page has block mappings */
+
/* PG_owner_priv_1 users should have descriptive aliases */
#define PG_checked PG_owner_priv_1 /* Used by some filesystems */

@@ -134,8 +136,17 @@ static inline void SetPageUptodate(struc
if (!test_and_set_bit(PG_uptodate, &page->flags))
page_clear_dirty(page);
}
+static inline void TestSetPageUptodate(struct page *page)
+{
+ if (!test_and_set_bit(PG_uptodate, &page->flags)) {
+ page_clear_dirty(page);
+ return 0;
+ }
+ return 1;
+}
#else
#define SetPageUptodate(page) set_bit(PG_uptodate, &(page)->flags)
+#define TestSetPageUptodate(page) test_and_set_bit(PG_uptodate, &(page)->flags)
#endif
#define ClearPageUptodate(page) clear_bit(PG_uptodate, &(page)->flags)

@@ -217,6 +228,10 @@ static inline void SetPageUptodate(struc
#define __SetPageBuddy(page) __set_bit(PG_buddy, &(page)->flags)
#define __ClearPageBuddy(page) __clear_bit(PG_buddy, &(page)->flags)

+#define PageBlocks(page) test_bit(PG_blocks, &(page)->flags)
+#define SetPageBlocks(page) set_bit(PG_blocks, &(page)->flags)
+#define ClearPageBlocks(page) clear_bit(PG_blocks, &(page)->flags)
+
#define PageMappedToDisk(page) test_bit(PG_mappedtodisk, &(page)->flags)
#define SetPageMappedToDisk(page) set_bit(PG_mappedtodisk, &(page)->flags)
#define ClearPageMappedToDisk(page) clear_bit(PG_mappedtodisk, &(page)->flags)
Index: linux-2.6/fs/Makefile
===================================================================
--- linux-2.6.orig/fs/Makefile
+++ linux-2.6/fs/Makefile
@@ -14,7 +14,7 @@ obj-y := open.o read_write.o file_table.
stack.o

ifeq ($(CONFIG_BLOCK),y)
-obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
+obj-y += fsblock.o buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
else
obj-y += no-block.o
endif
Index: linux-2.6/fs/fsblock.c
===================================================================
--- /dev/null
+++ linux-2.6/fs/fsblock.c
@@ -0,0 +1,2511 @@
+/*
+ * fs/fsblock.c
+ *
+ * Copyright (C) 2007 Nick Piggin, SuSE Labs, Novell Inc.
+ */
+
+#include <linux/fsblock.h>
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/bio.h>
+#include <linux/mm.h>
+#include <linux/gfp.h>
+#include <linux/bitops.h>
+#include <linux/pagevec.h>
+#include <linux/pagemap.h>
+#include <linux/page-flags.h>
+#include <linux/rcupdate.h> /* XXX: get rid of RCU */
+#include <linux/module.h>
+#include <linux/bit_spinlock.h> /* bit_spin_lock for subpage blocks */
+#include <linux/vmalloc.h> /* vmap for superpage blocks */
+#include <linux/gfp.h>
+//#include <linux/buffer_head.h> /* too much crap in me */
+extern int try_to_free_buffers(struct page *);
+
+/* XXX: add a page / block invariant checker function? */
+
+#include <asm/atomic.h>
+
+#define SECTOR_SHIFT MIN_SECTOR_SHIFT
+#define NR_SUB_SIZES (1 << (PAGE_CACHE_SHIFT - MIN_SECTOR_SHIFT))
+
+static struct kmem_cache *block_cache __read_mostly;
+static struct kmem_cache *mblock_cache __read_mostly;
+
+static void block_ctor(void *data, struct kmem_cache *cachep,
+ unsigned long flags)
+{
+ struct fsblock *block = data;
+ atomic_set(&block->count, 0);
+}
+
+void __init fsblock_init(void)
+{
+ block_cache = kmem_cache_create("fsblock-data",
+ sizeof(struct fsblock), 0,
+ SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_DESTROY_BY_RCU,
+ block_ctor, NULL);
+
+ mblock_cache = kmem_cache_create("fsblock-metadata",
+ sizeof(struct fsblock_meta), 0,
+ SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_DESTROY_BY_RCU,
+ block_ctor, NULL);
+}
+
+static void init_block(struct page *page, struct fsblock *block, unsigned int bits)
+{
+ block->flags = 0;
+ block->block_nr = -1;
+ block->page = page;
+ block->private = NULL;
+ FSB_BUG_ON(atomic_read(&block->count));
+ atomic_inc(&block->count);
+ __set_bit(BL_locked, &block->flags);
+ fsblock_set_bits(block, bits);
+#ifdef FSB_DEBUG
+ atomic_set(&block->vmap_count, 0);
+#endif
+}
+
+static void init_mblock(struct page *page, struct fsblock_meta *mblock, unsigned int bits)
+{
+ init_block(page, &mblock->block, bits);
+ __set_bit(BL_metadata, &mblock->block.flags);
+ INIT_LIST_HEAD(&mblock->assoc_list);
+ mblock->assoc_mapping = NULL;
+}
+
+static struct fsblock *alloc_blocks(struct page *page, unsigned int bits, gfp_t gfp_flags)
+{
+ struct fsblock *block;
+ int nid = page_to_nid(page);
+
+ if (bits >= PAGE_CACHE_SHIFT) { /* !subpage */
+ block = kmem_cache_alloc_node(block_cache, gfp_flags, nid);
+ if (likely(block))
+ init_block(page, block, bits);
+ } else {
+ int nr = PAGE_CACHE_SIZE >> bits;
+ /* XXX: could have a range of cache sizes */
+ block = kmalloc_node(sizeof(struct fsblock)*nr, gfp_flags, nid);
+ if (likely(block)) {
+ int i;
+ for (i = 0; i < nr; i++) {
+ struct fsblock *b = block + i;
+ atomic_set(&b->count, 0);
+ init_block(page, b, bits);
+ }
+ }
+ }
+ return block;
+}
+
+static struct fsblock_meta *alloc_mblocks(struct page *page, unsigned int bits, gfp_t gfp_flags)
+{
+ struct fsblock_meta *mblock;
+ int nid = page_to_nid(page);
+
+ if (bits >= PAGE_CACHE_SHIFT) { /* !subpage */
+ mblock = kmem_cache_alloc_node(mblock_cache, gfp_flags, nid);
+ if (likely(mblock))
+ init_mblock(page, mblock, bits);
+ } else {
+ int nr = PAGE_CACHE_SIZE >> bits;
+ mblock = kmalloc_node(sizeof(struct fsblock_meta)*nr, gfp_flags, nid);
+ if (likely(mblock)) {
+ int i;
+ for (i = 0; i < nr; i++) {
+ struct fsblock_meta *mb = mblock + i;
+ atomic_set(&mb->block.count, 0);
+ init_mblock(page, mb, bits);
+ }
+ }
+ }
+ return mblock;
+}
+
+#ifdef FSB_DEBUG
+static void assert_block(struct fsblock *block)
+{
+ struct page *page = block->page;
+
+ FSB_BUG_ON(!PageLocked(page));
+ FSB_BUG_ON(!PageBlocks(page));
+
+ if (fsblock_superpage(block)) {
+ struct page *p;
+
+ FSB_BUG_ON(page->index != first_page_idx(page->index,
+ fsblock_size(block)));
+
+ for_each_page(page, fsblock_size(block), p) {
+ FSB_BUG_ON(!PageBlocks(p));
+ FSB_BUG_ON(page_blocks(p) != block);
+ } end_for_each_page;
+ } else if (fsblock_subpage(block)) {
+ struct fsblock *b;
+ block = page_blocks(block->page);
+
+ for_each_block(block, b)
+ FSB_BUG_ON(b->page != page);
+ }
+}
+
+static void free_block_check(struct fsblock *block)
+{
+ unsigned long flags = block->flags;
+ unsigned long badflags =
+ (1 << BL_locked |
+ 1 << BL_dirty |
+ /* 1 << BL_error | */
+ 1 << BL_new |
+ 1 << BL_writeback |
+ 1 << BL_readin |
+ 1 << BL_sync_io);
+ unsigned long goodflags = 0;
+ unsigned int size = fsblock_size(block);
+ unsigned int count = atomic_read(&block->count);
+ unsigned int vmap_count = atomic_read(&block->vmap_count);
+ void *private = block->private;
+
+ if ((flags & badflags) || ((flags & goodflags) != goodflags) || count || private || vmap_count) {
+ printk("block flags = %lx\n", flags);
+ printk("block size = %u\n", size);
+ printk("block count = %u\n", count);
+ printk("block private = %p\n", private);
+ printk("vmap count = %u\n", vmap_count);
+ BUG();
+ }
+}
+#else
+static inline void assert_block(struct fsblock *block) {}
+#endif
+
+static void rcu_free_block(struct rcu_head *head)
+{
+ struct fsblock *block = container_of(head, struct fsblock, rcu_head);
+ kfree(block);
+}
+
+static void free_block(struct fsblock *block)
+{
+ if (fsblock_subpage(block)) {
+#ifdef FSB_DEBUG
+ unsigned int bits = fsblock_bits(block);
+ int i, nr = PAGE_CACHE_SIZE >> bits;
+
+ for (i = 0; i < nr; i++) {
+ struct fsblock *b;
+ if (test_bit(BL_metadata, &block->flags))
+ b = &(block_mblock(block) + i)->block;
+ else
+ b = block + i;
+ free_block_check(b);
+ }
+#endif
+
+ INIT_RCU_HEAD(&block->rcu_head);
+ call_rcu(&block->rcu_head, rcu_free_block);
+ } else {
+#ifdef VMAP_CACHE
+ if (test_bit(BL_vmapped, &block->flags)) {
+ vunmap(block->vaddr);
+ block->vaddr = NULL;
+ clear_bit(BL_vmapped, &block->flags);
+ }
+#endif
+#ifdef FSB_DEBUG
+ free_block_check(block);
+#endif
+ if (test_bit(BL_metadata, &block->flags))
+ kmem_cache_free(mblock_cache, block);
+ else
+ kmem_cache_free(block_cache, block);
+ }
+}
+
+int block_get_unless_zero(struct fsblock *block)
+{
+ return atomic_inc_not_zero(&block->count);
+}
+
+void block_get(struct fsblock *block)
+{
+ FSB_BUG_ON(atomic_read(&block->count) == 0);
+ atomic_inc(&block->count);
+}
+EXPORT_SYMBOL(block_get);
+
+static int fsblock_noblock = 1 __read_mostly; /* Like nobh mode */
+
+void block_put(struct fsblock *block)
+{
+ int free_it;
+ struct page *page;
+
+ page = block->page;
+ free_it = 0;
+ if (!page->mapping || fsblock_noblock) {
+ free_it = 1;
+ page_cache_get(page);
+ }
+
+#ifdef FSB_DEBUG
+ FSB_BUG_ON(atomic_read(&block->count) == 2 &&
+ atomic_read(&block->vmap_count));
+#endif
+ FSB_BUG_ON(atomic_read(&block->count) <= 1);
+
+ /* dec_return required for the release memory barrier */
+ if (atomic_dec_return(&block->count) == 1) {
+ if (free_it && !test_bit(BL_dirty, &block->flags)) {
+ /*
+ * At this point we'd like to try stripping the block
+ * if it is only existing in a self-referential
+ * relationship with the pagecache (ie. the pagecache
+ * is truncated as well).
+ */
+ if (!TestSetPageLocked(page)) {
+ try_to_free_blocks(page);
+ unlock_page(page);
+ }
+ }
+ }
+ if (free_it)
+ page_cache_release(page);
+}
+EXPORT_SYMBOL(block_put);
+
+static int sleep_on_block(void *unused)
+{
+ io_schedule();
+ return 0;
+}
+
+void lock_block(struct fsblock *block)
+{
+ might_sleep();
+
+ if (!trylock_block(block))
+ wait_on_bit_lock(&block->flags, BL_locked, sleep_on_block,
+ TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(lock_block);
+
+void unlock_block(struct fsblock *block)
+{
+ FSB_BUG_ON(!test_bit(BL_locked, &block->flags));
+ smp_mb__before_clear_bit();
+ clear_bit(BL_locked, &block->flags);
+ smp_mb__after_clear_bit();
+ wake_up_bit(&block->flags, BL_locked);
+}
+EXPORT_SYMBOL(unlock_block);
+
+void wait_on_block_locked(struct fsblock *block)
+{
+ might_sleep();
+
+ if (test_bit(BL_locked, &block->flags))
+ wait_on_bit(&block->flags, BL_locked, sleep_on_block,
+ TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(wait_on_block_locked);
+
+static void set_block_sync_io(struct fsblock *block)
+{
+ FSB_BUG_ON(!PageLocked(block->page));
+ FSB_BUG_ON(test_bit(BL_sync_io, &block->flags));
+#ifdef FSB_DEBUG
+ if (fsblock_superpage(block)) {
+ struct page *page = block->page, *p;
+ for_each_page(page, fsblock_size(block), p) {
+ FSB_BUG_ON(!PageLocked(p));
+ FSB_BUG_ON(PageWriteback(p));
+ } end_for_each_page;
+ } else {
+ FSB_BUG_ON(!PageLocked(block->page));
+ FSB_BUG_ON(PageWriteback(block->page));
+ }
+#endif
+ set_bit(BL_sync_io, &block->flags);
+}
+
+static void end_block_sync_io(struct fsblock *block)
+{
+ FSB_BUG_ON(!PageLocked(block->page));
+ FSB_BUG_ON(!test_bit(BL_sync_io, &block->flags));
+ clear_bit(BL_sync_io, &block->flags);
+ smp_mb__after_clear_bit();
+ wake_up_bit(&block->flags, BL_sync_io);
+}
+
+static void wait_on_block_sync_io(struct fsblock *block)
+{
+ might_sleep();
+
+ FSB_BUG_ON(!PageLocked(block->page));
+ if (test_bit(BL_sync_io, &block->flags))
+ wait_on_bit(&block->flags, BL_sync_io, sleep_on_block,
+ TASK_UNINTERRUPTIBLE);
+}
+
+static void iolock_block(struct fsblock *block)
+{
+ struct page *page, *p;
+ might_sleep();
+
+ page = block->page;
+ if (!fsblock_superpage(block))
+ lock_page(page);
+ else {
+ for_each_page(page, fsblock_size(block), p) {
+ lock_page(p);
+ } end_for_each_page;
+ }
+}
+
+static void iounlock_block(struct fsblock *block)
+{
+ struct page *page, *p;
+
+ page = block->page;
+ if (!fsblock_superpage(block))
+ unlock_page(page);
+ else {
+ for_each_page(page, fsblock_size(block), p) {
+ unlock_page(p);
+ } end_for_each_page;
+ }
+}
+
+static void wait_on_block_iolock(struct fsblock *block)
+{
+ struct page *page, *p;
+ might_sleep();
+
+ page = block->page;
+ if (!fsblock_superpage(block))
+ wait_on_page_locked(page);
+ else {
+ for_each_page(page, fsblock_size(block), p) {
+ wait_on_page_locked(p);
+ } end_for_each_page;
+ }
+}
+
+static void set_block_writeback(struct fsblock *block)
+{
+ struct page *page, *p;
+ might_sleep();
+
+ page = block->page;
+ if (!fsblock_superpage(block)) {
+ set_page_writeback(page);
+ unlock_page(page);
+ } else {
+ for_each_page(page, fsblock_size(block), p) {
+ set_page_writeback(p);
+ unlock_page(p);
+ } end_for_each_page;
+ }
+}
+
+static void end_block_writeback(struct fsblock *block)
+{
+ struct page *page, *p;
+
+ page = block->page;
+ if (!fsblock_superpage(block))
+ end_page_writeback(page);
+ else {
+ for_each_page(page, fsblock_size(block), p) {
+ end_page_writeback(p);
+ } end_for_each_page;
+ }
+}
+
+static void wait_on_block_writeback(struct fsblock *block)
+{
+ struct page *page, *p;
+ might_sleep();
+
+ page = block->page;
+ if (!fsblock_superpage(block))
+ wait_on_page_writeback(page);
+ else {
+ for_each_page(page, fsblock_size(block), p) {
+ wait_on_page_writeback(p);
+ } end_for_each_page;
+ }
+}
+
+static struct block_device *mapping_data_bdev(struct address_space *mapping)
+{
+ struct inode *inode = mapping->host;
+ if (unlikely(S_ISBLK(inode->i_mode)))
+ return inode->i_bdev;
+ else
+ return inode->i_sb->s_bdev;
+}
+
+static struct fsblock *find_get_page_block(struct page *page)
+{
+ struct fsblock *block;
+
+ rcu_read_lock();
+again:
+ block = page_blocks_rcu(page);
+ if (block) {
+ /*
+ * Might be better off implementing this as a bit spinlock
+ * rather than count (which requires tricks with ordering
+ * eg. release vs set page dirty).
+ */
+ if (block_get_unless_zero(block)) {
+ if ((page_blocks_rcu(page) != block)) {
+ block_put(block);
+ block = NULL;
+ }
+ } else {
+ cpu_relax();
+ goto again;
+ }
+ }
+ rcu_read_unlock();
+
+ return block;
+}
+
+static int __set_page_dirty_noblocks(struct page *page)
+{
+ FSB_BUG_ON(!PageBlocks(page));
+ FSB_BUG_ON(!fsblock_subpage(page_blocks(page)) && !PageUptodate(page));
+
+ return __set_page_dirty_nobuffers(page);
+}
+
+int fsblock_set_page_dirty(struct page *page)
+{
+ struct fsblock *block;
+ int ret = 0;
+
+ FSB_BUG_ON(!PageUptodate(page));
+ FSB_BUG_ON(!PageBlocks(page));
+// FSB_BUG_ON(!PageLocked(page)); /* XXX: this can go away when we pin a page's metadata */
+
+ block = page_blocks(page);
+ if (fsblock_subpage(block)) {
+ struct fsblock *b;
+
+ for_each_block(block, b) {
+ FSB_BUG_ON(!test_bit(BL_uptodate, &b->flags));
+ if (!test_bit(BL_dirty, &b->flags)) {
+ set_bit(BL_dirty, &b->flags);
+ ret = 1;
+ }
+ }
+ } else {
+ FSB_BUG_ON(!test_bit(BL_uptodate, &block->flags));
+ if (!test_bit(BL_dirty, &block->flags)) {
+ set_bit(BL_dirty, &block->flags);
+ ret = 1;
+ }
+ }
+ /*
+ * XXX: this is slightly racy because the above blocks could be
+ * cleaned in a writeback that's underway, while the page will
+ * still get marked dirty below. This technically breaks some
+ * invariants that we check for (that a dirty page must have at
+ * least 1 dirty buffer). Eventually we could just relax those
+ * invariants, but keep them in for now to catch bugs.
+ */
+ return __set_page_dirty_noblocks(page);
+}
+EXPORT_SYMBOL(fsblock_set_page_dirty);
+
+/*
+ * Do we need a fast atomic version for just page sized / aligned maps?
+ */
+void *vmap_block(struct fsblock *block, off_t off, size_t len)
+{
+ struct address_space *mapping = block->page->mapping;
+ unsigned int size = fsblock_size(block);
+
+ FSB_BUG_ON(off < 0);
+ FSB_BUG_ON(off + len > size);
+
+ if (!fsblock_superpage(block)) {
+ unsigned int page_offset = 0;
+ if (fsblock_subpage(block))
+ page_offset = block_page_offset(block, size);
+#ifdef FSB_DEBUG
+ atomic_inc(&block->vmap_count);
+#endif
+ return kmap(block->page) + page_offset + off;
+ } else {
+ pgoff_t pgoff, start, end;
+ unsigned long pos;
+
+#ifdef VMAP_CACHE
+ if (test_bit(BL_vmapped, &block->flags)) {
+ while (test_bit(BL_vmap_lock, &block->flags))
+ cpu_relax();
+ smp_rmb();
+#ifdef FSB_DEBUG
+ atomic_inc(&block->vmap_count);
+#endif
+ return block->vaddr + off;
+ }
+#endif
+
+ pgoff = block->page->index;
+ FSB_BUG_ON(test_bit(BL_metadata, &block->flags) &&
+ pgoff != block->block_nr * (size >> PAGE_CACHE_SHIFT));
+ start = pgoff + (off >> PAGE_CACHE_SHIFT);
+ end = pgoff + ((off + len - 1) >> PAGE_CACHE_SHIFT);
+ pos = off & ~PAGE_CACHE_MASK;
+
+#ifndef VMAP_CACHE
+ if (start == end) {
+ struct page *page;
+
+ page = find_page(mapping, start);
+ FSB_BUG_ON(!page);
+
+#ifdef FSB_DEBUG
+ atomic_inc(&block->vmap_count);
+#endif
+ return kmap(page) + pos;
+ } else
+#endif
+ {
+ int nr;
+ struct page **pages;
+ void *addr;
+#ifndef VMAP_CACHE
+ nr = end - start + 1;
+#else
+ nr = size >> PAGE_CACHE_SHIFT;
+#endif
+ pages = kmalloc(nr * sizeof(struct page *), GFP_NOFS);
+ if (!pages)
+ return ERR_PTR(-ENOMEM);
+#ifndef VMAP_CACHE
+ find_pages(mapping, start, nr, pages);
+#else
+ find_pages(mapping, pgoff, nr, pages);
+#endif
+
+ addr = vmap(pages, nr, VM_MAP, PAGE_KERNEL);
+ kfree(pages);
+ if (!addr)
+ return ERR_PTR(-ENOMEM);
+
+#ifdef FSB_DEBUG
+ atomic_inc(&block->vmap_count);
+#endif
+#ifndef VMAP_CACHE
+ return addr + pos;
+#else
+ bit_spin_lock(BL_vmap_lock, &block->flags);
+ if (!test_bit(BL_vmapped, &block->flags)) {
+ block->vaddr = addr;
+ set_bit(BL_vmapped, &block->flags);
+ }
+ bit_spin_unlock(BL_vmap_lock, &block->flags);
+ if (block->vaddr != addr)
+ vunmap(addr);
+ return block->vaddr + off;
+#endif
+ }
+ }
+}
+EXPORT_SYMBOL(vmap_block);
+
+void vunmap_block(struct fsblock *block, off_t off, size_t len, void *vaddr)
+{
+#ifdef FSB_DEBUG
+ FSB_BUG_ON(atomic_read(&block->vmap_count) <= 0);
+ atomic_dec(&block->vmap_count);
+#endif
+ if (!fsblock_superpage(block))
+ kunmap(block->page);
+#ifndef VMAP_CACHE
+ else {
+ unsigned int size = fsblock_size(block);
+ pgoff_t pgoff, start, end;
+
+ pgoff = block->block_nr * (size >> PAGE_CACHE_SHIFT);
+ FSB_BUG_ON(pgoff != block->page->index);
+ start = pgoff + (off >> PAGE_CACHE_SHIFT);
+ end = pgoff + ((off + len - 1) >> PAGE_CACHE_SHIFT);
+
+ if (start == end) {
+ struct address_space *mapping = block->page->mapping;
+ struct page *page;
+
+ page = find_page(mapping, start);
+ FSB_BUG_ON(!page);
+
+ kunmap(page);
+ } else {
+ unsigned long pos;
+
+ pos = off & ~PAGE_CACHE_MASK;
+ vunmap(vaddr - pos);
+ }
+ }
+#endif
+}
+EXPORT_SYMBOL(vunmap_block);
+
+static struct fsblock *__find_get_block(struct address_space *mapping, sector_t blocknr)
+{
+ struct inode *inode = mapping->host;
+ struct page *page;
+ pgoff_t pgoff;
+
+ pgoff = sector_pgoff(blocknr, inode->i_blkbits);
+
+ page = find_get_page(mapping, pgoff);
+ if (page) {
+ struct fsblock *block;
+ block = find_get_page_block(page);
+ if (block) {
+ if (fsblock_subpage(block)) {
+ struct fsblock *b;
+ for_each_block(block, b) {
+ if (b->block_nr == blocknr) {
+ block_get(b);
+ block_put(block);
+ block = b;
+ goto found;
+ }
+ }
+ FSB_BUG();
+ } else
+ FSB_BUG_ON(block->block_nr != blocknr);
+found:
+ FSB_BUG_ON(!test_bit(BL_mapped, &block->flags));
+ }
+
+ page_cache_release(page);
+ return block;
+ }
+ return NULL;
+}
+
+struct fsblock_meta *find_get_mblock(struct block_device *bdev, sector_t blocknr, unsigned int size)
+{
+ struct fsblock *block;
+
+ block = __find_get_block(bdev->bd_inode->i_mapping, blocknr);
+ if (block) {
+ if (test_bit(BL_metadata, &block->flags)) {
+ /*
+ * XXX: need a better way than 'size' to tag and
+ * identify metadata fsblocks?
+ */
+ if (fsblock_size(block) == size)
+ return block_mblock(block);
+ }
+
+ block_put(block);
+ }
+ return NULL;
+}
+EXPORT_SYMBOL(find_get_mblock);
+
+static void attach_block_page(struct page *page, struct fsblock *block)
+{
+ if (PageUptodate(page))
+ set_bit(BL_uptodate, &block->flags);
+ unlock_block(block); /* XXX: need this? */
+}
+
+/* This goes away when we get rid of buffer.c */
+static int invalidate_aliasing_buffers(struct page *page, unsigned int size)
+{
+ if (!size_is_superpage(size)) {
+ if (PagePrivate(page))
+ return try_to_free_buffers(page);
+ } else {
+ struct page *p;
+
+ for_each_page(page, size, p) {
+ FSB_BUG_ON(!PageLocked(p));
+ FSB_BUG_ON(PageBlocks(p));
+
+ if (PagePrivate(p)) {
+ if (!try_to_free_buffers(p))
+ return 0;
+ }
+ } end_for_each_page;
+ }
+ return 1;
+}
+
+static int __try_to_free_blocks(struct page *page, int all_locked);
+static int invalidate_aliasing_blocks(struct page *page, unsigned int size)
+{
+ if (!size_is_superpage(size)) {
+ if (PageBlocks(page)) {
+ /* could check for compatible blocks here, but meh */
+ return __try_to_free_blocks(page, 1);
+ }
+ } else {
+ struct page *p;
+
+ for_each_page(page, size, p) {
+ FSB_BUG_ON(!PageLocked(p));
+ FSB_BUG_ON(PageBlocks(p));
+
+ if (PageBlocks(p)) {
+ if (!__try_to_free_blocks(p, 1))
+ return 0;
+ }
+ } end_for_each_page;
+ }
+ return 1;
+}
+
+#define CREATE_METADATA 0x01
+#define CREATE_DIRTY 0x02
+static int create_unmapped_blocks(struct page *page, gfp_t gfp_flags, unsigned int size, unsigned int flags)
+{
+ unsigned int bits = ffs(size) - 1;
+ struct fsblock *block;
+
+ FSB_BUG_ON(!PageLocked(page));
+ FSB_BUG_ON(PageDirty(page));
+ FSB_BUG_ON(PageWriteback(page));
+ FSB_BUG_ON(PageBlocks(page));
+ FSB_BUG_ON(flags & CREATE_DIRTY);
+
+ if (!invalidate_aliasing_buffers(page, size))
+ return -EBUSY;
+
+ /*
+ * XXX: maybe use private alloc funcions so fses can embed block into
+ * their fs-private block rather than using ->private? Maybe ->private
+ * is easier though...
+ */
+ if (!(flags & CREATE_METADATA)) {
+ block = alloc_blocks(page, bits, gfp_flags);
+ if (!block)
+ return -ENOMEM;
+ } else {
+ struct fsblock_meta *mblock;
+ mblock = alloc_mblocks(page, bits, gfp_flags);
+ if (!mblock)
+ return -ENOMEM;
+ block = mblock_block(mblock);
+ }
+
+ if (!fsblock_superpage(block)) {
+ attach_page_blocks(page, block);
+ /*
+ * Ensure ordering between setting page->block ptr and reading
+ * PageDirty, thus giving synchronisation between this and
+ * fsblock_set_page_dirty()
+ */
+ smp_mb();
+ if (fsblock_subpage(block)) {
+ struct fsblock *b;
+ for_each_block(block, b)
+ attach_block_page(page, b);
+ } else
+ attach_block_page(page, block);
+ } else {
+ struct page *p;
+ int uptodate = 1;
+ FSB_BUG_ON(page->index != first_page_idx(page->index, size));
+
+ for_each_page(page, size, p) {
+ FSB_BUG_ON(!PageLocked(p));
+ FSB_BUG_ON(PageDirty(p));
+ FSB_BUG_ON(PageWriteback(p));
+ FSB_BUG_ON(PageBlocks(p));
+ attach_page_blocks(p, block);
+ } end_for_each_page;
+ smp_mb();
+ for_each_page(page, size, p) {
+ if (!PageUptodate(p))
+ uptodate = 0;
+ } end_for_each_page;
+ if (uptodate)
+ set_bit(BL_uptodate, &block->flags);
+ unlock_block(block);
+ }
+
+ assert_block(block);
+
+ return 0;
+}
+
+static struct page *create_lock_page_range(struct address_space *mapping,
+ pgoff_t pgoff, unsigned int size)
+{
+ struct page *page;
+ gfp_t gfp;
+
+ gfp = mapping_gfp_mask(mapping) & ~__GFP_FS;
+ page = find_or_create_page(mapping, pgoff, gfp);
+ if (!page)
+ return NULL;
+
+ FSB_BUG_ON(!page->mapping);
+ page_cache_release(page);
+
+ if (size_is_superpage(size)) {
+ int i, nr = size >> PAGE_CACHE_SHIFT;
+
+ FSB_BUG_ON(pgoff != first_page_idx(pgoff, size));
+
+ for (i = 1; i < nr; i++) {
+ struct page *p;
+
+ p = find_or_create_page(mapping, pgoff + i, gfp);
+ if (!p) {
+ nr = i;
+ for (i = 0; i < nr; i++) {
+ p = find_page(mapping, pgoff + i);
+ FSB_BUG_ON(!p);
+ unlock_page(p);
+ }
+ return NULL;
+ }
+ FSB_BUG_ON(!p->mapping);
+ page_cache_release(p);
+ /*
+ * don't want a ref hanging around (see end io handlers
+ * for pagecache). Page lock pins the pcache ref.
+ * XXX: this is a little unclean.
+ */
+ }
+ }
+ FSB_BUG_ON(page->index != pgoff);
+ return page;
+}
+
+static void unlock_page_range(struct page *page, unsigned int size)
+{
+ if (!size_is_superpage(size))
+ unlock_page(page);
+ else {
+ struct page *p;
+
+ FSB_BUG_ON(page->index != first_page_idx(page->index, size));
+ for_each_page(page, size, p) {
+ FSB_BUG_ON(!p);
+ unlock_page(p);
+ } end_for_each_page;
+ }
+}
+
+struct fsblock_meta *find_or_create_mblock(struct block_device *bdev, sector_t blocknr, unsigned int size)
+{
+ struct inode *bd_inode = bdev->bd_inode;
+ struct address_space *bd_mapping = bd_inode->i_mapping;
+ struct page *page;
+ struct fsblock_meta *mblock;
+ pgoff_t pgoff;
+ int ret;
+
+ pgoff = sector_pgoff(blocknr, bd_inode->i_blkbits);
+
+ mblock = find_get_mblock(bdev, blocknr, size);
+ if (mblock)
+ return mblock;
+
+ page = create_lock_page_range(bd_mapping, pgoff, size);
+ if (!page)
+ return ERR_PTR(-ENOMEM);
+
+ if (!invalidate_aliasing_blocks(page, size)) {
+ mblock = ERR_PTR(-EBUSY);
+ goto failed;
+ }
+ ret = create_unmapped_blocks(page, GFP_NOFS, size, CREATE_METADATA);
+ if (ret) {
+ mblock = ERR_PTR(ret);
+ goto failed;
+ }
+
+ mblock = page_mblocks(page);
+ /*
+ * XXX: technically this is just the block_dev.c direct
+ * mapping. So maybe logically in that file? (OTOH it *is*
+ * "metadata")
+ */
+ if (fsblock_subpage(&mblock->block)) {
+ struct fsblock_meta *ret = NULL, *mb;
+ sector_t base_block;
+ base_block = pgoff << (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
+ __for_each_mblock(mblock, size, mb) {
+ mb->block.block_nr = base_block;
+ set_bit(BL_mapped, &mb->block.flags);
+ if (mb->block.block_nr == blocknr) {
+ FSB_BUG_ON(ret);
+ ret = mb;
+ }
+ base_block++;
+ }
+ FSB_BUG_ON(!ret);
+ mblock = ret;
+ } else {
+ mblock->block.block_nr = blocknr;
+ set_bit(BL_mapped, &mblock->block.flags);
+ }
+ mblock_get(mblock);
+failed:
+ unlock_page_range(page, size);
+ return mblock;
+}
+EXPORT_SYMBOL(find_or_create_mblock);
+
+static void block_end_read(struct fsblock *block, int uptodate)
+{
+ int sync_io;
+ int finished_readin = 1;
+ struct page *page = block->page;
+
+ FSB_BUG_ON(test_bit(BL_uptodate, &block->flags));
+ FSB_BUG_ON(test_bit(BL_error, &block->flags));
+
+ sync_io = test_bit(BL_sync_io, &block->flags);
+
+ if (unlikely(!uptodate)) {
+ set_bit(BL_error, &block->flags);
+ if (!fsblock_superpage(block))
+ SetPageError(page);
+ else {
+ struct page *p;
+ for_each_page(page, fsblock_size(block), p) {
+ SetPageError(p);
+ } end_for_each_page;
+ }
+ } else
+ set_bit(BL_uptodate, &block->flags);
+
+ if (fsblock_subpage(block)) {
+ unsigned long flags;
+ struct fsblock *b, *first = page_blocks(block->page);
+
+ local_irq_save(flags);
+ bit_spin_lock(BL_rd_lock, &first->flags);
+ clear_bit(BL_readin, &block->flags);
+ for_each_block(page_blocks(page), b) {
+ if (test_bit(BL_readin, &b->flags)) {
+ finished_readin = 0;
+ break;
+ }
+ if (!test_bit(BL_uptodate, &b->flags))
+ uptodate = 0;
+ }
+ bit_spin_unlock(BL_rd_lock, &first->flags);
+ local_irq_restore(flags);
+ } else
+ clear_bit(BL_readin, &block->flags);
+
+ if (sync_io)
+ finished_readin = 0; /* don't unlock */
+ if (!fsblock_superpage(block)) {
+ FSB_BUG_ON(PageWriteback(page));
+ if (uptodate)
+ SetPageUptodate(page);
+ if (finished_readin)
+ unlock_page(page);
+ /*
+ * XXX: don't know whether or not to keep the page
+ * refcount elevated or simply rely on the page lock...
+ */
+ } else {
+ struct page *p;
+
+ for_each_page(page, fsblock_size(block), p) {
+ FSB_BUG_ON(PageDirty(p));
+ FSB_BUG_ON(PageWriteback(p));
+ if (uptodate)
+ SetPageUptodate(p);
+ if (finished_readin)
+ unlock_page(p);
+ } end_for_each_page;
+ }
+
+ if (sync_io)
+ end_block_sync_io(block);
+
+ block_put(block);
+}
+
+static void block_end_write(struct fsblock *block, int uptodate)
+{
+ int sync_io;
+ int finished_writeback = 1;
+ struct page *page = block->page;
+
+ FSB_BUG_ON(!test_bit(BL_uptodate, &block->flags));
+ FSB_BUG_ON(test_bit(BL_error, &block->flags));
+
+ sync_io = test_bit(BL_sync_io, &block->flags);
+
+ if (unlikely(!uptodate)) {
+ set_bit(BL_error, &block->flags);
+ if (!fsblock_superpage(block))
+ SetPageError(page);
+ else {
+ struct page *p;
+ for_each_page(page, fsblock_size(block), p) {
+ SetPageError(p);
+ } end_for_each_page;
+ }
+ set_bit(AS_EIO, &page->mapping->flags);
+ }
+
+ if (fsblock_subpage(block)) {
+ unsigned long flags;
+ struct fsblock *b, *first = page_blocks(block->page);
+
+ local_irq_save(flags);
+ bit_spin_lock(BL_wb_lock, &first->flags);
+ clear_bit(BL_writeback, &block->flags);
+ for_each_block(first, b) {
+ if (test_bit(BL_writeback, &b->flags)) {
+ finished_writeback = 0;
+ break;
+ }
+ }
+ bit_spin_unlock(BL_wb_lock, &first->flags);
+ local_irq_restore(flags);
+ } else
+ clear_bit(BL_writeback, &block->flags);
+
+ if (!sync_io) {
+ if (finished_writeback) {
+ if (!fsblock_superpage(block)) {
+ end_page_writeback(page);
+ } else {
+ struct page *p;
+ for_each_page(page, fsblock_size(block), p) {
+ FSB_BUG_ON(!p->mapping);
+ end_page_writeback(p);
+ } end_for_each_page;
+ }
+ }
+ } else
+ end_block_sync_io(block);
+
+ block_put(block);
+}
+
+int fsblock_strip = 1;
+
+static int block_end_bio_io(struct bio *bio, unsigned int bytes_done, int err)
+{
+ struct fsblock *block = bio->bi_private;
+ int uptodate;
+
+ if (bio->bi_size)
+ return 1;
+
+ uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+
+ if (err == -EOPNOTSUPP) {
+ printk(KERN_WARNING "block_end_bio_io: op not supported!\n");
+ WARN_ON(uptodate);
+ }
+
+ FSB_BUG_ON(!(test_bit(BL_readin, &block->flags) ^
+ test_bit(BL_writeback, &block->flags)));
+
+ if (test_bit(BL_readin, &block->flags))
+ block_end_read(block, uptodate);
+ else
+ block_end_write(block, uptodate);
+
+ bio_put(bio);
+
+ return 0;
+}
+
+static int submit_block(struct fsblock *block, int rw)
+{
+ struct page *page = block->page;
+ struct address_space *mapping = page->mapping;
+ struct bio *bio;
+ int ret = 0;
+ unsigned int bits = fsblock_bits(block);
+ unsigned int size = 1 << bits;
+ int nr = (size + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
+
+#if 0
+ printk("submit_block for %s [blocknr=%lu, sector=%lu, size=%u]\n",
+ (test_bit(BL_readin, &block->flags) ? "read" : "write"),
+ (unsigned long)block->block_nr,
+ (unsigned long)block->block_nr * (size >> SECTOR_SHIFT), size);
+#endif
+
+ FSB_BUG_ON(!PageLocked(page) && !PageWriteback(page));
+ FSB_BUG_ON(!mapping);
+ FSB_BUG_ON(!test_bit(BL_mapped, &block->flags));
+
+ clear_bit(BL_error, &block->flags);
+
+ bio = bio_alloc(GFP_NOIO, nr);
+ bio->bi_sector = block->block_nr << (bits - SECTOR_SHIFT);
+ bio->bi_bdev = mapping_data_bdev(mapping);
+ bio->bi_end_io = block_end_bio_io;
+ bio->bi_private = block;
+
+ if (!fsblock_superpage(block)) {
+ unsigned int offset = 0;
+
+ if (fsblock_subpage(block))
+ offset = block_page_offset(block, size);
+ if (bio_add_page(bio, page, size, offset) != size)
+ FSB_BUG();
+ } else {
+ struct page *p;
+ int i;
+
+ i = 0;
+ for_each_page(page, size, p) {
+ if (bio_add_page(bio, p, PAGE_CACHE_SIZE, 0) != PAGE_CACHE_SIZE)
+ FSB_BUG();
+ i++;
+ } end_for_each_page;
+ FSB_BUG_ON(i != nr);
+ }
+
+ block_get(block);
+ bio_get(bio);
+ submit_bio(rw, bio);
+
+ if (bio_flagged(bio, BIO_EOPNOTSUPP))
+ ret = -EOPNOTSUPP;
+
+ bio_put(bio);
+ return ret;
+}
+
+static int read_block(struct fsblock *block)
+{
+ FSB_BUG_ON(PageWriteback(block->page));
+ FSB_BUG_ON(test_bit(BL_readin, &block->flags));
+ FSB_BUG_ON(test_bit(BL_writeback, &block->flags));
+ FSB_BUG_ON(test_bit(BL_dirty, &block->flags));
+ set_bit(BL_readin, &block->flags);
+ return submit_block(block, READ);
+}
+
+static int write_block(struct fsblock *block)
+{
+ FSB_BUG_ON(!PageWriteback(block->page));
+ FSB_BUG_ON(test_bit(BL_readin, &block->flags));
+ FSB_BUG_ON(test_bit(BL_writeback, &block->flags));
+ FSB_BUG_ON(!test_bit(BL_uptodate, &block->flags));
+ set_bit(BL_writeback, &block->flags);
+ return submit_block(block, WRITE);
+}
+
+int sync_block(struct fsblock *block)
+{
+ int ret = 0;
+
+ if (test_bit(BL_dirty, &block->flags)) {
+ struct page *page = block->page;
+
+ iolock_block(block);
+ wait_on_block_writeback(block);
+ FSB_BUG_ON(PageWriteback(page)); /* because block is locked */
+ if (test_bit(BL_dirty, &block->flags)) {
+ FSB_BUG_ON(!test_bit(BL_uptodate, &block->flags));
+ clear_bit(BL_dirty, &block->flags);
+
+ if (fsblock_subpage(block)) {
+ struct fsblock *b;
+ for_each_block(page_blocks(page), b) {
+ if (test_bit(BL_dirty, &b->flags))
+ goto page_dirty;
+ }
+ }
+ if (!fsblock_superpage(block)) {
+ ret = clear_page_dirty_for_io(page);
+ FSB_BUG_ON(!ret);
+ } else {
+ struct page *p;
+ for_each_page(page, fsblock_size(block), p) {
+ clear_page_dirty_for_io(p);
+ } end_for_each_page;
+ }
+page_dirty:
+ set_block_writeback(block);
+
+ ret = write_block(block);
+ if (!ret) {
+ wait_on_block_writeback(block);
+ if (test_bit(BL_error, &block->flags))
+ ret = -EIO;
+ }
+ } else
+ iounlock_block(block);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(sync_block);
+
+void mark_mblock_uptodate(struct fsblock_meta *mblock)
+{
+ struct fsblock *block = mblock_block(mblock);
+ struct page *page = block->page;
+
+ if (fsblock_superpage(block)) {
+ struct page *p;
+ for_each_page(page, fsblock_size(block), p) {
+ SetPageUptodate(p);
+ } end_for_each_page;
+ } else if (fsblock_midpage(block)) {
+ SetPageUptodate(page);
+ } /* XXX: could check for all subblocks uptodate */
+ set_bit(BL_uptodate, &block->flags);
+}
+
+int mark_mblock_dirty(struct fsblock_meta *mblock)
+{
+ struct page *page;
+ FSB_BUG_ON(!fsblock_superpage(&mblock->block) &&
+ !test_bit(BL_uptodate, &mblock->block.flags));
+
+ if (test_and_set_bit(BL_dirty, &mblock->block.flags))
+ return 0;
+
+ page = mblock_block(mblock)->page;
+ if (!fsblock_superpage(mblock_block(mblock))) {
+ __set_page_dirty_noblocks(page);
+ } else {
+ struct page *p;
+ for_each_page(page, fsblock_size(mblock_block(mblock)), p) {
+ __set_page_dirty_noblocks(p);
+ } end_for_each_page;
+ }
+ return 1;
+}
+EXPORT_SYMBOL(mark_mblock_dirty);
+
+/*
+ * XXX: this is good, but is complex and inhibits block reclaim for now.
+ * Reworking so that it gets removed if the block is cleaned might be a
+ * good option? (would require a block flag)
+ */
+struct mb_assoc {
+ struct list_head mlist;
+ struct address_space *mapping;
+
+ struct list_head blist;
+ struct fsblock_meta *mblock;
+};
+
+int mark_mblock_dirty_inode(struct fsblock_meta *mblock, struct inode *inode)
+{
+ struct address_space *mapping = inode->i_mapping;
+ struct fsblock *block = mblock_block(mblock);
+ struct mb_assoc *mba;
+ int ret;
+
+ ret = mark_mblock_dirty(mblock);
+
+ bit_spin_lock(BL_assoc_lock, &block->flags);
+ if (block->private) {
+ mba = (struct mb_assoc *)block->private;
+ do {
+ FSB_BUG_ON(mba->mblock != mblock);
+ if (mba->mapping == inode->i_mapping)
+ goto out;
+ mba = list_entry(mba->blist.next,struct mb_assoc,blist);
+ } while (mba != block->private);
+ }
+ mba = kmalloc(sizeof(struct mb_assoc), GFP_ATOMIC);
+ if (unlikely(!mba)) {
+ bit_spin_unlock(BL_assoc_lock, &block->flags);
+ sync_block(block);
+ return ret;
+ }
+ INIT_LIST_HEAD(&mba->mlist);
+ mba->mapping = mapping;
+ INIT_LIST_HEAD(&mba->blist);
+ mba->mblock = mblock;
+ if (block->private)
+ list_add(&mba->blist, ((struct mb_assoc *)block->private)->blist.prev);
+ block->private = mba;
+ spin_lock(&mapping->private_lock);
+ list_add_tail(&mba->mlist, &mapping->private_list);
+ spin_unlock(&mapping->private_lock);
+
+out:
+ bit_spin_unlock(BL_assoc_lock, &block->flags);
+ return ret;
+}
+EXPORT_SYMBOL(mark_mblock_dirty_inode);
+
+int fsblock_sync(struct address_space *mapping)
+{
+ int err, ret;
+ LIST_HEAD(list);
+ struct mb_assoc *mba, *tmp;
+
+ spin_lock(&mapping->private_lock);
+ list_splice_init(&mapping->private_list, &list);
+ spin_unlock(&mapping->private_lock);
+
+ err = 0;
+ list_for_each_entry_safe(mba, tmp, &list, mlist) {
+ struct fsblock *block = mblock_block(mba->mblock);
+
+ FSB_BUG_ON(mba->mapping != mapping);
+
+ bit_spin_lock(BL_assoc_lock, &block->flags);
+ if (list_empty(&mba->blist))
+ block->private = NULL;
+ else {
+ if (block->private == mba)
+ block->private = list_entry(mba->blist.next,struct mb_assoc,blist);
+ list_del(&mba->blist);
+ }
+ bit_spin_unlock(BL_assoc_lock, &block->flags);
+
+ iolock_block(block);
+ wait_on_block_writeback(block);
+ if (test_bit(BL_dirty, &block->flags)) {
+ FSB_BUG_ON(!test_bit(BL_uptodate, &block->flags));
+ clear_bit(BL_dirty, &block->flags);
+ ret = write_block(block);
+ if (ret && !err)
+ err = ret;
+ } else
+ iounlock_block(block);
+ }
+
+ while (!list_empty(&list)) {
+ struct fsblock *block;
+
+ /* Go in reverse order to reduce context switching */
+ mba = list_entry(list.prev, struct mb_assoc, mlist);
+ list_del(&mba->mlist);
+
+ block = mblock_block(mba->mblock);
+ wait_on_block_writeback(block);
+ if (test_bit(BL_error, &block->flags)) {
+ if (!err)
+ err = -EIO;
+ set_bit(AS_EIO, &mba->mapping->flags);
+ }
+ kfree(mba);
+ }
+ return err;
+}
+EXPORT_SYMBOL(fsblock_sync);
+
+int fsblock_release(struct address_space *mapping, int force)
+{
+ struct mb_assoc *mba;
+ LIST_HEAD(list);
+
+ if (!mapping_has_private(mapping))
+ return 1;
+
+ spin_lock(&mapping->private_lock);
+ if (!force) {
+ list_for_each_entry(mba, &mapping->private_list, mlist) {
+ struct fsblock *block = mblock_block(mba->mblock);
+ if (test_bit(BL_dirty, &block->flags)) {
+ spin_unlock(&mapping->private_lock);
+ return 0;
+ }
+ }
+ }
+ list_splice_init(&mapping->private_list, &list);
+ spin_unlock(&mapping->private_lock);
+
+ while (!list_empty(&list)) {
+ struct fsblock *block;
+
+ mba = list_entry(list.prev, struct mb_assoc, mlist);
+ list_del(&mba->mlist);
+
+ block = mblock_block(mba->mblock);
+ bit_spin_lock(BL_assoc_lock, &block->flags);
+ if (list_empty(&mba->blist))
+ block->private = NULL;
+ else {
+ if (block->private == mba)
+ block->private = list_entry(mba->blist.next,struct mb_assoc,blist);
+ list_del(&mba->blist);
+ }
+ bit_spin_unlock(BL_assoc_lock, &block->flags);
+
+ if (test_bit(BL_error, &block->flags))
+ set_bit(AS_EIO, &mba->mapping->flags);
+ kfree(mba);
+ }
+ return 1;
+}
+EXPORT_SYMBOL(fsblock_release);
+
+static void sync_underlying_metadata(struct fsblock *block)
+{
+ struct address_space *mapping = block->page->mapping;
+ struct block_device *bdev = mapping_data_bdev(mapping);
+ struct fsblock *meta_block;
+ sector_t blocknr = block->block_nr;
+
+ /* XXX: should this just invalidate rather than write back? */
+
+ FSB_BUG_ON(test_bit(BL_metadata, &block->flags));
+
+ meta_block = __find_get_block(bdev->bd_inode->i_mapping, blocknr);
+ if (meta_block) {
+ int err;
+
+ FSB_BUG_ON(!test_bit(BL_metadata, &meta_block->flags));
+ /*
+ * Could actually do a memory copy here to bring
+ * the block uptodate. Probably not worthwhile.
+ */
+ FSB_BUG_ON(block == meta_block);
+ err = sync_block(meta_block);
+ if (!err)
+ FSB_BUG_ON(test_bit(BL_dirty, &meta_block->flags));
+ else {
+ clear_bit(BL_dirty, &meta_block->flags);
+ wait_on_block_iolock(meta_block);
+ }
+ }
+}
+
+struct fsblock_meta *mbread(struct block_device *bdev, sector_t blocknr, unsigned int size)
+{
+ struct fsblock_meta *mblock;
+
+ mblock = find_or_create_mblock(bdev, blocknr, size);
+ if (!IS_ERR(mblock)) {
+ struct fsblock *block = &mblock->block;
+
+ if (!test_bit(BL_uptodate, &block->flags)) {
+ iolock_block(block);
+ if (!test_bit(BL_uptodate, &block->flags)) {
+ int ret;
+ FSB_BUG_ON(PageWriteback(block->page));
+ FSB_BUG_ON(test_bit(BL_dirty, &block->flags));
+ set_block_sync_io(block);
+ ret = read_block(block);
+ if (ret) {
+ /* XXX: handle errors properly */
+ block_put(block);
+ mblock = ERR_PTR(ret);
+ } else {
+ wait_on_block_sync_io(block);
+ if (!test_bit(BL_uptodate, &block->flags))
+ mblock = ERR_PTR(-EIO);
+ FSB_BUG_ON(size >= PAGE_CACHE_SIZE && !PageUptodate(block->page));
+ }
+ }
+ iounlock_block(block);
+ }
+ }
+
+ return mblock;
+}
+EXPORT_SYMBOL(mbread);
+
+/*
+ * XXX: maybe either don't have a generic version, or change the
+ * insert_mapping scheme so that it fills fsblocks rather than inserts them
+ * live into pages?
+ */
+sector_t fsblock_bmap(struct address_space *mapping, sector_t blocknr, insert_mapping_fn *insert_mapping)
+{
+ struct fsblock *block;
+ struct inode *inode = mapping->host;
+ sector_t ret;
+
+ block = __find_get_block(mapping, blocknr);
+ if (!block) {
+ pgoff_t pgoff = sector_pgoff(blocknr, inode->i_blkbits);
+ unsigned int size = 1 << inode->i_blkbits;
+ struct page *page;
+
+ page = create_lock_page_range(mapping, pgoff, size);
+ if (!page)
+ return 0;
+
+ if (create_unmapped_blocks(page, GFP_NOFS, size, CREATE_METADATA))
+ return 0;
+
+ ret = insert_mapping(mapping, pgoff, PAGE_CACHE_SIZE, 0);
+
+ block = __find_get_block(mapping, blocknr);
+ FSB_BUG_ON(!block);
+
+ unlock_page_range(page, size);
+ }
+
+ FSB_BUG_ON(test_bit(BL_new, &block->flags));
+ ret = 0;
+ if (test_bit(BL_mapped, &block->flags))
+ ret = block->block_nr;
+
+ return ret;
+}
+EXPORT_SYMBOL(fsblock_bmap);
+
+static int relock_superpage_block(struct page **pagep, unsigned int size)
+{
+ struct page *page = *pagep;
+ pgoff_t index = page->index;
+ pgoff_t first = first_page_idx(page->index, size);
+ struct address_space *mapping = page->mapping;
+
+ /*
+ * XXX: this is a bit of a hack because the ->readpage and other
+ * aops APIs are not so nice. Should convert over to a ->read_range
+ * API that does the offset, length thing and allows caller locking?
+ * (also getting rid of ->readpages).
+ */
+ unlock_page(page);
+ *pagep = create_lock_page_range(mapping, first, size);
+ if (!*pagep) {
+ lock_page(page);
+ return -ENOMEM;
+ }
+ if (page != find_page(mapping, index)) {
+ unlock_page_range(*pagep, size);
+ return AOP_TRUNCATED_PAGE;
+ }
+ return 0;
+}
+
+static int block_read_helper(struct page *page, struct fsblock *block)
+{
+ FSB_BUG_ON(test_bit(BL_new, &block->flags));
+
+ if (test_bit(BL_uptodate, &block->flags))
+ return 0;
+
+ FSB_BUG_ON(PageUptodate(page));
+
+ if (!test_bit(BL_mapped, &block->flags)) {
+ unsigned int size = fsblock_size(block);
+ unsigned int offset = block_page_offset(block, size);
+ zero_user_page(page, offset, size, KM_USER0);
+ set_bit(BL_uptodate, &block->flags);
+ return 0;
+ }
+
+ if (!test_bit(BL_uptodate, &block->flags)) {
+ FSB_BUG_ON(test_bit(BL_readin, &block->flags));
+ FSB_BUG_ON(test_bit(BL_writeback, &block->flags));
+ set_bit(BL_readin, &block->flags);
+ return 1;
+ }
+ return 0;
+}
+
+int fsblock_read_page(struct page *page, insert_mapping_fn *insert_mapping)
+{
+ struct address_space *mapping = page->mapping;
+ struct inode *inode = mapping->host;
+ unsigned int size = 1 << inode->i_blkbits;
+ struct fsblock *block;
+ int ret;
+
+ FSB_BUG_ON(!PageLocked(page));
+ FSB_BUG_ON(PageUptodate(page));
+ FSB_BUG_ON(PageWriteback(page));
+
+ if (size_is_superpage(size)) {
+ struct page *orig_page = page;
+
+ ret = relock_superpage_block(&page, size);
+ if (ret)
+ return ret;
+ if (PageUptodate(orig_page))
+ goto out_unlock;
+ }
+
+ if (!PageBlocks(page)) {
+ ret = create_unmapped_blocks(page, GFP_NOFS, size, 0);
+ if (ret)
+ goto out_unlock;
+ }
+
+ /* XXX: optimise away if page is mapped to disk */
+ ret = insert_mapping(mapping, page->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+ /* XXX: SetPageError on failure? */
+ if (ret)
+ goto out_unlock;
+
+ block = page_blocks(page);
+
+ if (!fsblock_superpage(block)) {
+
+ if (fsblock_subpage(block)) {
+ int nr = 0;
+ struct fsblock *b;
+ for_each_block(block, b)
+ nr += block_read_helper(page, b);
+ if (nr == 0) {
+ /* Hole? */
+ SetPageUptodate(page);
+ goto out_unlock;
+ }
+ for_each_block(block, b) {
+ if (!test_bit(BL_readin, &b->flags))
+ continue;
+
+ ret = submit_block(b, READ);
+ if (ret)
+ goto out_unlock;
+ /*
+ * XXX: must handle errors properly (eg. wait
+ * for outstanding reads before unlocking the
+ * page?
+ */
+ }
+ } else {
+ if (block_read_helper(page, block)) {
+ ret = submit_block(block, READ);
+ if (ret)
+ goto out_unlock;
+ } else {
+ SetPageUptodate(page);
+ goto out_unlock;
+ }
+ }
+ } else {
+ struct page *p;
+
+ ret = 0;
+
+ FSB_BUG_ON(test_bit(BL_new, &block->flags));
+ FSB_BUG_ON(test_bit(BL_uptodate, &block->flags));
+ FSB_BUG_ON(test_bit(BL_dirty, &block->flags));
+
+ if (!test_bit(BL_mapped, &block->flags)) {
+ for_each_page(page, size, p) {
+ FSB_BUG_ON(PageUptodate(p));
+ zero_user_page(p, 0, PAGE_CACHE_SIZE, KM_USER0);
+ SetPageUptodate(p);
+ unlock_page(p);
+ } end_for_each_page;
+ set_bit(BL_uptodate, &block->flags);
+ } else {
+ ret = read_block(block);
+ if (ret)
+ goto out_unlock;
+ }
+ }
+ FSB_BUG_ON(ret);
+ return 0;
+
+out_unlock:
+ unlock_page_range(page, size);
+ return ret;
+}
+EXPORT_SYMBOL(fsblock_read_page);
+
+static int block_write_helper(struct page *page, struct fsblock *block)
+{
+ FSB_BUG_ON(!test_bit(BL_mapped, &block->flags));
+
+ if (test_bit(BL_new, &block->flags)) {
+ sync_underlying_metadata(block);
+ clear_bit(BL_new, &block->flags);
+ set_bit(BL_dirty, &block->flags);
+ }
+
+ if (test_bit(BL_dirty, &block->flags)) {
+ FSB_BUG_ON(!test_bit(BL_uptodate, &block->flags));
+ clear_bit(BL_dirty, &block->flags);
+ FSB_BUG_ON(test_bit(BL_readin, &block->flags));
+ FSB_BUG_ON(test_bit(BL_writeback, &block->flags));
+ set_bit(BL_writeback, &block->flags);
+ return 1;
+ /*
+ * XXX: Careful of ordering between clear buffer / page dirty
+ * and set buffer / page dirty
+ */
+ }
+ return 0;
+}
+
+/* XXX: must obey non-blocking writeout! */
+int fsblock_write_page(struct page *page, insert_mapping_fn *insert_mapping,
+ struct writeback_control *wbc)
+{
+ struct address_space *mapping = page->mapping;
+ unsigned int size = 1 << mapping->host->i_blkbits;
+ struct fsblock *block;
+ int ret;
+
+ FSB_BUG_ON(!PageLocked(page));
+ FSB_BUG_ON(PageWriteback(page));
+
+ if (size_is_superpage(size)) {
+ struct page *orig_page = page;
+
+ redirty_page_for_writepage(wbc, orig_page);
+ ret = relock_superpage_block(&page, size);
+ if (ret)
+ return ret;
+ if (!clear_page_dirty_for_io(orig_page))
+ goto out_unlock;
+ }
+
+ if (!PageBlocks(page)) {
+ FSB_BUG(); /* XXX: should always have blocks here */
+ FSB_BUG_ON(!PageUptodate(page));
+ /* XXX: should rework (eg use page_mkwrite) so as to always
+ * have blocks by this stage!!! */
+ ret = create_unmapped_blocks(page, GFP_NOFS, size, CREATE_DIRTY);
+ if (ret)
+ goto out_unlock;
+ }
+
+ ret = insert_mapping(mapping, page->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 1);
+ if (ret)
+ goto out_unlock;
+
+ /*
+ * XXX: todo - i_size handling ... should it be here?!?
+ * No - I would prefer partial page zeroing to go in filemap_nopage
+ * and tolerate writing of crap past EOF in filesystems -- no
+ * other sane way to do it other than invalidating a partial page
+ * before zeroing before writing it out in order that we can
+ * guarantee it isn't touched after zeroing.
+ */
+
+ block = page_blocks(page);
+
+ if (!fsblock_superpage(block)) {
+
+ if (fsblock_subpage(block)) {
+ int nr = 0;
+ struct fsblock *b;
+ for_each_block(block, b)
+ nr += block_write_helper(page, b);
+ /* XXX: technically could happen (see set_page_dirty_blocks) */
+ FSB_BUG_ON(nr == 0);
+ if (nr == 0)
+ goto out_unlock;
+
+ FSB_BUG_ON(PageWriteback(page));
+ set_page_writeback(page);
+ unlock_page(page);
+ for_each_block(block, b) {
+ if (!test_bit(BL_writeback, &b->flags))
+ continue;
+ ret = submit_block(b, WRITE);
+ if (ret)
+ goto out_unlock;
+ /* XXX: error handling */
+ }
+ } else {
+ if (block_write_helper(page, block)) {
+ FSB_BUG_ON(PageWriteback(page));
+ set_page_writeback(page);
+ unlock_page(page);
+ ret = submit_block(block, WRITE);
+ if (ret)
+ goto out_unlock;
+ } else {
+ FSB_BUG(); /* XXX: see above */
+ goto out_unlock;
+ }
+ }
+ } else {
+ struct page *p;
+
+ FSB_BUG_ON(!test_bit(BL_mapped, &block->flags));
+ FSB_BUG_ON(!test_bit(BL_uptodate, &block->flags));
+ FSB_BUG_ON(!test_bit(BL_dirty, &block->flags));
+ FSB_BUG_ON(test_bit(BL_new, &block->flags));
+
+ for_each_page(page, size, p) {
+ FSB_BUG_ON(page_blocks(p) != block);
+ FSB_BUG_ON(!PageUptodate(p));
+ } end_for_each_page;
+
+ for_each_page(page, size, p) {
+ clear_page_dirty_for_io(p);
+ FSB_BUG_ON(PageWriteback(p));
+ FSB_BUG_ON(!PageUptodate(p));
+ set_page_writeback(p);
+ unlock_page(p);
+ } end_for_each_page;
+
+ /* XXX: recheck ordering here! don't want to lose dirty bits */
+
+ clear_bit(BL_dirty, &block->flags);
+ ret = write_block(block);
+ if (ret)
+ goto out_unlock;
+ }
+ FSB_BUG_ON(ret);
+ return 0;
+
+out_unlock:
+ unlock_page_range(page, size);
+ return ret;
+}
+EXPORT_SYMBOL(fsblock_write_page);
+
+static int block_dirty_helper(struct page *page, struct fsblock *block)
+{
+ FSB_BUG_ON(!test_bit(BL_mapped, &block->flags));
+
+ if (test_bit(BL_uptodate, &block->flags))
+ return 0;
+
+ FSB_BUG_ON(PageUptodate(page));
+
+ if (test_bit(BL_new, &block->flags)) {
+ unsigned int size = fsblock_size(block);
+ unsigned int offset = block_page_offset(block, size);
+ zero_user_page(page, offset, size, KM_USER0);
+ set_bit(BL_uptodate, &block->flags);
+ sync_underlying_metadata(block);
+ clear_bit(BL_new, &block->flags);
+ set_bit(BL_dirty, &block->flags);
+ __set_page_dirty_noblocks(page);
+ return 0;
+ }
+ return 1;
+}
+
+static int fsblock_prepare_write_super(struct page *orig_page,
+ unsigned int size, unsigned from, unsigned to,
+ insert_mapping_fn *insert_mapping)
+{
+ struct address_space *mapping = orig_page->mapping;
+ struct fsblock *block;
+ struct page *page = orig_page, *p;
+ int ret;
+
+ ret = relock_superpage_block(&page, size);
+ if (ret)
+ return ret;
+
+ FSB_BUG_ON(PageBlocks(page) != PageBlocks(orig_page));
+ if (!PageBlocks(page)) {
+ FSB_BUG_ON(PageDirty(orig_page));
+ FSB_BUG_ON(PageDirty(page));
+ ret = create_unmapped_blocks(page, GFP_NOFS, size, 0);
+ if (ret)
+ goto out_unlock;
+ }
+ FSB_BUG_ON(!PageBlocks(page));
+
+ ret = insert_mapping(mapping, page->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 1);
+ if (ret)
+ goto out_unlock;
+
+ block = page_blocks(page);
+
+ if (test_bit(BL_new, &block->flags)) {
+ for_each_page(page, size, p) {
+ if (!PageUptodate(p)) {
+ FSB_BUG_ON(PageDirty(p));
+ zero_user_page(p, 0, PAGE_CACHE_SIZE, KM_USER0);
+ SetPageUptodate(p);
+ }
+ __set_page_dirty_noblocks(p);
+ } end_for_each_page;
+
+ set_bit(BL_uptodate, &block->flags);
+ sync_underlying_metadata(block);
+ clear_bit(BL_new, &block->flags);
+ set_bit(BL_dirty, &block->flags);
+ } else if (!test_bit(BL_uptodate, &block->flags)) {
+ FSB_BUG_ON(test_bit(BL_dirty, &block->flags));
+
+ set_block_sync_io(block);
+ ret = read_block(block);
+ if (ret)
+ goto out_unlock;
+ wait_on_block_sync_io(block);
+ if (!test_bit(BL_uptodate, &block->flags)) {
+ ret = -EIO;
+ goto out_unlock;
+ }
+ }
+
+ return 0;
+
+out_unlock:
+ unlock_page_range(page, size);
+ lock_page(orig_page);
+ FSB_BUG_ON(!ret);
+ return ret;
+}
+
+int fsblock_prepare_write(struct page *page, unsigned from, unsigned to,
+ insert_mapping_fn *insert_mapping)
+{
+ struct address_space *mapping = page->mapping;
+ unsigned int size = 1 << mapping->host->i_blkbits;
+ struct fsblock *block;
+ int ret, nr;
+
+ FSB_BUG_ON(!PageLocked(page));
+ FSB_BUG_ON(from > PAGE_CACHE_SIZE);
+ FSB_BUG_ON(to > PAGE_CACHE_SIZE);
+ FSB_BUG_ON(from > to);
+
+ if (size_is_superpage(size))
+ return fsblock_prepare_write_super(page, size, from, to, insert_mapping);
+
+ if (!PageBlocks(page)) {
+ ret = create_unmapped_blocks(page, GFP_NOFS, size, 0);
+ if (ret)
+ return ret;
+ }
+
+ ret = insert_mapping(mapping, page->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 1);
+ if (ret)
+ return ret;
+
+ block = page_blocks(page);
+
+ nr = 0;
+ if (fsblock_subpage(block)) {
+ struct fsblock *b;
+ for_each_block(block, b)
+ nr += block_dirty_helper(page, b);
+ } else
+ nr += block_dirty_helper(page, block);
+ if (nr == 0)
+ SetPageUptodate(page);
+
+ if (PageUptodate(page))
+ return 0;
+
+ if (to - from == PAGE_CACHE_SIZE)
+ return 0;
+
+ /*
+ * XXX: this is stupid, could do better with write_begin aops, or
+ * just zero out unwritten partial blocks.
+ */
+ if (fsblock_subpage(block)) {
+ struct fsblock *b;
+ for_each_block(block, b) {
+ if (test_bit(BL_uptodate, &b->flags))
+ continue;
+ set_block_sync_io(b);
+ ret = read_block(b);
+ if (ret)
+ break;
+ }
+
+ for_each_block(block, b) {
+ wait_on_block_sync_io(b);
+ if (!ret && !test_bit(BL_uptodate, &b->flags))
+ ret = -EIO;
+ }
+ if (ret)
+ return ret;
+ } else {
+
+ FSB_BUG_ON(test_bit(BL_uptodate, &block->flags));
+ set_block_sync_io(block);
+ ret = read_block(block);
+ if (ret)
+ return ret;
+ wait_on_block_sync_io(block);
+ if (test_bit(BL_error, &block->flags))
+ SetPageError(page);
+ if (!test_bit(BL_uptodate, &block->flags))
+ return -EIO;
+ }
+ SetPageUptodate(page);
+
+ return 0;
+}
+EXPORT_SYMBOL(fsblock_prepare_write);
+
+static int __fsblock_commit_write_super(struct page *orig_page,
+ struct fsblock *block, unsigned from, unsigned to)
+{
+ unsigned int size = fsblock_size(block);
+ struct page *page, *p;
+
+ FSB_BUG_ON(!test_bit(BL_uptodate, &block->flags));
+ set_bit(BL_dirty, &block->flags);
+ page = block->page;
+ for_each_page(page, size, p) {
+ FSB_BUG_ON(!PageUptodate(p));
+ __set_page_dirty_noblocks(p);
+ } end_for_each_page;
+ unlock_page_range(page, size);
+ lock_page(orig_page);
+
+ return 0;
+}
+
+static int __fsblock_commit_write_sub(struct page *page,
+ struct fsblock *block, unsigned from, unsigned to)
+{
+ struct fsblock *b;
+
+ for_each_block(block, b) {
+ if (to - from < PAGE_CACHE_SIZE)
+ FSB_BUG_ON(!test_bit(BL_uptodate, &b->flags));
+ else
+ set_bit(BL_uptodate, &block->flags);
+ if (!test_bit(BL_dirty, &b->flags))
+ set_bit(BL_dirty, &b->flags);
+ }
+ if (to - from < PAGE_CACHE_SIZE)
+ FSB_BUG_ON(!PageUptodate(page));
+ else
+ SetPageUptodate(page);
+ __set_page_dirty_noblocks(page);
+
+ return 0;
+}
+
+int __fsblock_commit_write(struct page *page, unsigned from, unsigned to)
+{
+ struct fsblock *block;
+
+ FSB_BUG_ON(!PageLocked(page));
+ FSB_BUG_ON(from > PAGE_CACHE_SIZE);
+ FSB_BUG_ON(to > PAGE_CACHE_SIZE);
+ FSB_BUG_ON(from > to);
+ FSB_BUG_ON(!PageBlocks(page));
+
+ block = page_blocks(page);
+ FSB_BUG_ON(!test_bit(BL_mapped, &block->flags));
+
+ if (fsblock_superpage(block))
+ return __fsblock_commit_write_super(page, block, from, to);
+ if (fsblock_subpage(block))
+ return __fsblock_commit_write_sub(page, block, from, to);
+
+ if (to - from < PAGE_CACHE_SIZE) {
+ FSB_BUG_ON(!PageUptodate(page));
+ FSB_BUG_ON(!test_bit(BL_uptodate, &block->flags));
+ } else {
+ set_bit(BL_uptodate, &block->flags);
+ SetPageUptodate(page);
+ }
+
+ if (!test_bit(BL_dirty, &block->flags))
+ set_bit(BL_dirty, &block->flags);
+ __set_page_dirty_noblocks(page);
+
+ return 0;
+}
+EXPORT_SYMBOL(__fsblock_commit_write);
+
+int fsblock_commit_write(struct file *file, struct page *page, unsigned from, unsigned to)
+{
+ struct inode *inode;
+ loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+ int ret;
+
+ inode = page->mapping->host;
+ ret = __fsblock_commit_write(page, from, to);
+
+ /*
+ * No need to use i_size_read() here, the i_size
+ * cannot change under us because we hold i_mutex.
+ */
+ if (!ret && pos > inode->i_size) {
+ i_size_write(inode, pos);
+ mark_inode_dirty(inode);
+ }
+ return ret;
+
+}
+EXPORT_SYMBOL(fsblock_commit_write);
+
+/* XXX: this is racy I think (must verify versus page_mkclean). Must have
+ * some operation to pin a page's metadata while dirtying it. (this will
+ * fix get_user_pages for dirty as well once callers are converted).
+ */
+int fsblock_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+ struct address_space *mapping;
+ const struct address_space_operations *a_ops;
+ int ret = 0;
+
+ lock_page(page);
+ mapping = page->mapping;
+ if (!mapping) {
+ /* Caller will take care of it */
+ goto out;
+ }
+ a_ops = mapping->a_ops;
+
+ /* XXX: don't instantiate blocks past isize! (same for truncate?) */
+ ret = a_ops->prepare_write(NULL, page, 0, PAGE_CACHE_SIZE);
+ if (ret == 0)
+ ret = __fsblock_commit_write(page, 0, PAGE_CACHE_SIZE);
+out:
+ unlock_page(page);
+
+ return ret;
+}
+EXPORT_SYMBOL(fsblock_page_mkwrite);
+
+static int fsblock_truncate_page_super(struct address_space *mapping, loff_t from)
+{
+ pgoff_t index;
+ unsigned offset;
+ const struct address_space_operations *a_ops = mapping->a_ops;
+ unsigned int size = 1 << mapping->host->i_blkbits;
+ unsigned int nr_pages;
+ unsigned int length;
+ int i, err;
+
+ length = from & (size - 1);
+ if (length == 0)
+ return 0;
+
+ nr_pages = ((size - length + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT);
+ index = from >> PAGE_CACHE_SHIFT;
+ offset = from & (PAGE_CACHE_SIZE-1);
+
+ err = 0;
+ for (i = 0; i < nr_pages; i++) {
+ struct page *page;
+
+ page = grab_cache_page(mapping, index + i);
+ if (!page) {
+ err = -ENOMEM;
+ break;
+ }
+
+ err = a_ops->prepare_write(NULL, page, offset, PAGE_CACHE_SIZE);
+ if (!err) {
+ FSB_BUG_ON(!PageBlocks(page));
+ zero_user_page(page, offset, PAGE_CACHE_SIZE-offset, KM_USER0);
+ err = __fsblock_commit_write(page, offset, PAGE_CACHE_SIZE);
+ }
+
+ unlock_page(page);
+ page_cache_release(page);
+ if (err)
+ break;
+ offset = 0;
+ }
+ return err;
+}
+
+int fsblock_truncate_page(struct address_space *mapping, loff_t from)
+{
+ pgoff_t index;
+ unsigned offset;
+ struct page *page;
+ const struct address_space_operations *a_ops = mapping->a_ops;
+ unsigned int size = 1 << mapping->host->i_blkbits;
+ unsigned int length;
+ int ret;
+
+ if (size_is_superpage(size))
+ return fsblock_truncate_page_super(mapping, from);
+
+ length = from & (size - 1);
+ if (length == 0)
+ return 0;
+
+ index = from >> PAGE_CACHE_SHIFT;
+ offset = from & (PAGE_CACHE_SIZE-1);
+
+ page = grab_cache_page(mapping, index);
+ if (!page)
+ return -ENOMEM;
+
+ ret = a_ops->prepare_write(NULL, page, offset, PAGE_CACHE_SIZE);
+ if (ret == 0) {
+ zero_user_page(page, offset, PAGE_CACHE_SIZE-offset, KM_USER0);
+ /*
+ * a_ops->commit_write would extend i_size :( Have to assume
+ * caller uses fsblock_prepare_write.
+ */
+ ret = __fsblock_commit_write(page, offset, PAGE_CACHE_SIZE);
+ }
+ unlock_page(page);
+ page_cache_release(page);
+ return ret;
+}
+EXPORT_SYMBOL(fsblock_truncate_page);
+
+static int can_free_block(struct fsblock *block)
+{
+ return atomic_read(&block->count) == 1 &&
+ !test_bit(BL_dirty, &block->flags) &&
+ !block->private;
+}
+
+static int __try_to_free_block(struct fsblock *block)
+{
+ int ret = 0;
+ if (can_free_block(block)) {
+ if (atomic_dec_and_test(&block->count)) {
+ if (!test_bit(BL_dirty, &block->flags)) {
+ ret = 1;
+ goto out;
+ }
+ }
+ atomic_inc(&block->count);
+ }
+out:
+ unlock_block(block);
+
+ return ret;
+}
+
+static int try_to_free_block(struct fsblock *block)
+{
+ /*
+ * XXX: get rid of block locking from here and invalidate_block --
+ * use page lock instead?
+ */
+ if (trylock_block(block))
+ return __try_to_free_block(block);
+ return 0;
+}
+
+static int try_to_free_blocks_super(struct page *orig_page, int all_locked)
+{
+ unsigned int size;
+ struct fsblock *block;
+ struct page *page, *p;
+ int i;
+ int ret = 0;
+
+ FSB_BUG_ON(!PageLocked(orig_page));
+ FSB_BUG_ON(!PageBlocks(orig_page));
+
+ if (PageDirty(orig_page) || PageWriteback(orig_page))
+ return ret;
+
+ block = page_blocks(orig_page);
+ page = block->page;
+ size = fsblock_size(block);
+
+ i = 0;
+ if (!all_locked) {
+ for_each_page(page, size, p) {
+ if (p != orig_page) {
+ if (TestSetPageLocked(p))
+ goto out;
+ i++;
+ if (PageWriteback(p))
+ goto out;
+ }
+ } end_for_each_page;
+ }
+
+ assert_block(block);
+
+ if (!can_free_block(block))
+ goto out;
+ if (!try_to_free_block(block))
+ goto out;
+
+ for_each_page(page, size, p) {
+ FSB_BUG_ON(!PageLocked(p));
+ FSB_BUG_ON(!PageBlocks(p));
+ FSB_BUG_ON(PageWriteback(p));
+ clear_page_blocks(p);
+ } end_for_each_page;
+
+ free_block(block);
+
+ ret = 1;
+
+out:
+ if (i > 0) {
+ for_each_page(page, size, p) {
+ FSB_BUG_ON(PageDirty(p)); /* XXX: racy? */
+ if (p != orig_page) {
+ unlock_page(p);
+ i--;
+ if (i == 0)
+ break;
+ }
+ } end_for_each_page;
+ }
+ return ret;
+}
+
+static int __try_to_free_blocks(struct page *page, int all_locked)
+{
+ unsigned int size;
+ struct fsblock *block;
+
+ FSB_BUG_ON(!PageLocked(page));
+ FSB_BUG_ON(!PageBlocks(page));
+
+ if (PageDirty(page) || PageWriteback(page))
+ return 0;
+
+ block = page_blocks(page);
+ if (fsblock_superpage(block))
+ return try_to_free_blocks_super(page, all_locked);
+
+ assert_block(block);
+ if (fsblock_subpage(block)) {
+ struct fsblock *b;
+
+ for_each_block(block, b) {
+ if (!can_free_block(b))
+ return 0;
+ }
+
+ for_each_block(block, b) {
+ /*
+ * must decrement head block last, so that if the
+ * find_get_page_block fails, then the blocks will
+ * really be freed.
+ */
+ if (b == block)
+ continue;
+ if (!try_to_free_block(b))
+ goto error;
+ }
+ if (!try_to_free_block(block))
+ goto error;
+
+ size = fsblock_size(block);
+ FSB_BUG_ON(block != page_blocks(page));
+ goto success;
+error:
+ for_each_block(block, b) {
+ if (atomic_read(&b->count) == 0)
+ atomic_inc(&b->count);
+ }
+ return 0;
+ } else {
+ if (!can_free_block(block))
+ return 0;
+ if (!try_to_free_block(block))
+ return 0;
+ size = PAGE_CACHE_SIZE;
+ }
+
+success:
+ clear_page_blocks(page);
+ free_block(block);
+ return 1;
+}
+
+int try_to_free_blocks(struct page *page)
+{
+ return __try_to_free_blocks(page, 0);
+}
+
+static void invalidate_block(struct fsblock *block)
+{
+ lock_block(block);
+ /*
+ * XXX
+ * FSB_BUG_ON(test_bit(BL_new, &block->flags));
+ * -- except vmtruncate of new pages can come here
+ * via prepare_write failure
+ */
+ clear_bit(BL_new, &block->flags);
+ clear_bit(BL_dirty, &block->flags);
+ clear_bit(BL_uptodate, &block->flags);
+ clear_bit(BL_mapped, &block->flags);
+ block->block_nr = -1;
+ unlock_block(block);
+ /* XXX: if metadata, then have an fs-private release? */
+}
+
+void fsblock_invalidate_page(struct page *page, unsigned long offset)
+{
+ struct fsblock *block;
+
+ FSB_BUG_ON(!PageLocked(page));
+ FSB_BUG_ON(PageWriteback(page));
+ FSB_BUG_ON(!PageBlocks(page));
+
+ block = page_blocks(page);
+ if (fsblock_superpage(block)) {
+ struct page *orig_page = page;
+ struct page *p;
+ unsigned int size = fsblock_size(block);
+ /* XXX: the below may not work for hole punching? */
+ if (page->index & ((size >> PAGE_CACHE_SHIFT) - 1))
+ return;
+ if (offset != 0)
+ return;
+ page = block->page;
+ /* XXX: could lock these pages? */
+ for_each_page(page, size, p) {
+ FSB_BUG_ON(PageWriteback(p));
+#if 0
+ XXX: generic code should not do it for us
+ if (p->index == orig_page->index)
+ continue;
+#endif
+ cancel_dirty_page(p, PAGE_CACHE_SIZE);
+ ClearPageUptodate(p);
+ ClearPageMappedToDisk(p);
+ } end_for_each_page;
+ invalidate_block(block);
+ try_to_free_blocks(orig_page);
+ return;
+ }
+
+ if (fsblock_subpage(block)) {
+ unsigned int size = fsblock_size(block);
+ unsigned int curr;
+ struct fsblock *b;
+
+ curr = 0;
+ for_each_block(block, b) {
+ if (offset > curr)
+ continue;
+ invalidate_block(b);
+ curr += size;
+ }
+ } else {
+ if (offset == 0)
+ invalidate_block(block);
+ }
+ if (offset == 0)
+ try_to_free_blocks(page);
+}
+EXPORT_SYMBOL(fsblock_invalidate_page);
+
+static struct vm_operations_struct fsblock_file_vm_ops = {
+ .nopage = filemap_nopage,
+ .populate = filemap_populate,
+ .page_mkwrite = fsblock_page_mkwrite,
+};
+
+/* This is used for a general mmap of a disk file */
+
+int fsblock_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct address_space *mapping = file->f_mapping;
+
+ if (!mapping->a_ops->readpage)
+ return -ENOEXEC;
+ file_accessed(file);
+ vma->vm_ops = &fsblock_file_vm_ops;
+ return 0;
+}
+EXPORT_SYMBOL(fsblock_file_mmap);
Index: linux-2.6/include/linux/fsblock.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/fsblock.h
@@ -0,0 +1,347 @@
+#ifndef __FSBLOCK_H__
+#define __FSBLOCK_H__
+
+#include <linux/fsblock_types.h>
+#include <linux/types.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/bitops.h>
+#include <linux/page-flags.h>
+#include <linux/mm_types.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/rcupdate.h>
+#include <linux/gfp.h>
+#include <asm/atomic.h>
+
+#define MIN_SECTOR_SHIFT 9 /* 512 bytes */
+
+#define BL_bits_mask 0x000f
+
+#define BL_locked 4
+#define BL_dirty 5
+#define BL_error 6
+#define BL_uptodate 7
+
+#define BL_mapped 8
+#define BL_new 9
+#define BL_writeback 10
+#define BL_readin 11
+
+#define BL_sync_io 12 /* IO completion doesn't unlock/unwriteback */
+#define BL_metadata 13 /* Metadata. If set, page->mapping is the
+ * blkdev inode. */
+#define BL_wb_lock 14 /* writeback lock (on first subblock in page) */
+#define BL_rd_lock 14 /* readin lock (never active with wb_lock) */
+#define BL_assoc_lock 14
+#ifdef VMAP_CACHE
+#define BL_vmap_lock 14
+#define BL_vmapped 15
+#endif
+
+/*
+ * XXX: eventually want to replace BL_pagecache_io with synchronised block
+ * and page flags manipulations (set_page_dirty of get_user_pages could be
+ * a problem? could have some extra call to pin buffers, though?).
+ */
+
+/*
+ * XXX: should distinguish data buffer and metadata buffer. data buffer
+ * attachment (or dirtyment?) could cause the page to *also* be added to
+ * the blkdev page_tree (with the host inode still at page->mapping). This
+ * could allow coherent blkdev/pagecache and also sweet block device based
+ * page writeout.
+ */
+
+static inline struct fsblock_meta *block_mblock(struct fsblock *block)
+{
+ FSB_BUG_ON(!test_bit(BL_metadata, &block->flags));
+ return (struct fsblock_meta *)block;
+}
+
+static inline struct fsblock *mblock_block(struct fsblock_meta *mblock)
+{
+ return &mblock->block;
+}
+
+static inline unsigned int fsblock_bits(struct fsblock *block)
+{
+ unsigned int bits = (block->flags & BL_bits_mask) + MIN_SECTOR_SHIFT;
+#ifdef FSB_DEBUG
+ if (!test_bit(BL_metadata, &block->flags))
+ FSB_BUG_ON(block->page->mapping->host->i_blkbits != bits);
+#endif
+ return bits;
+}
+
+static inline void fsblock_set_bits(struct fsblock *block, unsigned int bits)
+{
+ FSB_BUG_ON(block->flags & BL_bits_mask);
+ FSB_BUG_ON(bits < MIN_SECTOR_SHIFT);
+ FSB_BUG_ON(bits > BL_bits_mask + MIN_SECTOR_SHIFT);
+ block->flags |= bits - MIN_SECTOR_SHIFT;
+}
+
+static inline int size_is_superpage(unsigned int size)
+{
+#ifdef BLOCK_SUPERPAGE_SUPPORT
+ return size > PAGE_CACHE_SIZE;
+#else
+ return 0;
+#endif
+}
+
+static inline int fsblock_subpage(struct fsblock *block)
+{
+ return fsblock_bits(block) < PAGE_CACHE_SHIFT;
+}
+
+static inline int fsblock_midpage(struct fsblock *block)
+{
+ return fsblock_bits(block) == PAGE_CACHE_SHIFT;
+}
+
+static inline int fsblock_superpage(struct fsblock *block)
+{
+#ifdef BLOCK_SUPERPAGE_SUPPORT
+ return fsblock_bits(block) > PAGE_CACHE_SHIFT;
+#else
+ return 0;
+#endif
+}
+
+static inline unsigned int fsblock_size(struct fsblock *block)
+{
+ return 1 << fsblock_bits(block);
+}
+
+static inline int sizeof_block(struct fsblock *block)
+{
+ if (test_bit(BL_metadata, &block->flags))
+ return sizeof(struct fsblock_meta);
+ else
+ return sizeof(struct fsblock);
+
+}
+
+static inline struct fsblock *page_blocks_rcu(struct page *page)
+{
+ return rcu_dereference((struct fsblock *)page->private);
+}
+
+static inline struct fsblock *page_blocks(struct page *page)
+{
+ struct fsblock *block;
+ FSB_BUG_ON(!PageBlocks(page));
+ block = (struct fsblock *)page->private;
+ FSB_BUG_ON(!fsblock_superpage(block) && block->page != page);
+ /* XXX these go bang if put here
+ FSB_BUG_ON(PageUptodate(page) && !test_bit(BL_uptodate, &block->flags));
+ FSB_BUG_ON(test_bit(BL_dirty, &block->flags) && !PageDirty(page));
+ */
+ return block;
+}
+
+static inline struct fsblock_meta *page_mblocks(struct page *page)
+{
+ return block_mblock(page_blocks(page));
+}
+
+static inline void attach_page_blocks(struct page *page, struct fsblock *block)
+{
+ FSB_BUG_ON(!PageLocked(page));
+ FSB_BUG_ON(PageBlocks(page));
+ FSB_BUG_ON(PagePrivate(page));
+ SetPageBlocks(page);
+ smp_wmb(); /* Rather than rcu_assign_pointer */
+ page->private = (unsigned long)block;
+ page_cache_get(page);
+}
+
+static inline void clear_page_blocks(struct page *page)
+{
+ FSB_BUG_ON(!PageLocked(page));
+ FSB_BUG_ON(!PageBlocks(page));
+ FSB_BUG_ON(PageDirty(page));
+ ClearPageBlocks(page);
+ page->private = (unsigned long)NULL;
+ page_cache_release(page);
+}
+
+
+static inline void map_fsblock(struct fsblock *block, sector_t blocknr)
+{
+ FSB_BUG_ON(test_bit(BL_mapped, &block->flags));
+ block->block_nr = blocknr;
+ set_bit(BL_mapped, &block->flags);
+#ifdef FSB_DEBUG
+ /* XXX: test for inside bdev? */
+ if (test_bit(BL_metadata, &block->flags)) {
+ FSB_BUG_ON(block->block_nr << fsblock_bits(block) >> PAGE_CACHE_SHIFT
+ != block->page->index);
+ }
+#endif
+}
+
+#define assert_first_block(first) \
+({ \
+ FSB_BUG_ON((struct fsblock *)first != page_blocks(first->page));\
+ first; \
+})
+
+#define block_inbounds(first, b, bsize, size_of) \
+({ \
+ int ret; \
+ FSB_BUG_ON(!fsblock_subpage(first)); \
+ FSB_BUG_ON(sizeof_block(first) != size_of); \
+ ret = ((unsigned long)b - (unsigned long)first) * bsize < \
+ PAGE_CACHE_SIZE * size_of; \
+ if (ret) { \
+ FSB_BUG_ON(!fsblock_subpage(b)); \
+ FSB_BUG_ON(test_bit(BL_metadata, &first->flags) != \
+ test_bit(BL_metadata, &b->flags)); \
+ FSB_BUG_ON(sizeof_block(b) != size_of); \
+ } \
+ ret; \
+})
+
+#define for_each_block(first, b) \
+ for (b = assert_first_block(first); block_inbounds(first, b, fsblock_size(first), sizeof_block(first)); b = (void *)((unsigned long)b + sizeof_block(first)))
+
+#define __for_each_block(first, size, b) \
+ for (b = assert_first_block(first); block_inbounds(first, b, size, sizeof(struct fsblock)); b++)
+
+#define __for_each_mblock(first, size, mb) \
+ for (mb = block_mblock(assert_first_block(mblock_block(first))); block_inbounds(mblock_block(first), mblock_block(mb), size, sizeof(struct fsblock_meta)); mb++)
+
+
+#define first_page_idx(idx, bsize) ((idx) & ~(((bsize) >> PAGE_CACHE_SHIFT)-1))
+
+static inline struct page *find_page(struct address_space *mapping, pgoff_t index)
+{
+ struct page *page;
+
+ read_lock_irq(&mapping->tree_lock);
+ page = radix_tree_lookup(&mapping->page_tree, index);
+ read_unlock_irq(&mapping->tree_lock);
+
+ return page;
+}
+
+static inline void find_pages(struct address_space *mapping, pgoff_t start, int nr_pages, struct page **pages)
+{
+ int ret;
+
+ read_lock_irq(&mapping->tree_lock);
+ ret = radix_tree_gang_lookup(&mapping->page_tree,
+ (void **)pages, start, nr_pages);
+ read_unlock_irq(&mapping->tree_lock);
+ FSB_BUG_ON(ret != nr_pages);
+}
+
+#define for_each_page(page, size, p) \
+do { \
+ pgoff_t ___idx = (page)->index; \
+ int ___i, ___nr = (size) >> PAGE_CACHE_SHIFT; \
+ (p) = (page); \
+ FSB_BUG_ON(___idx != first_page_idx(___idx, size)); \
+ for (___i = 0; ___i < ___nr; ___i++) { \
+ (p) = find_page(page->mapping, ___idx + ___i); \
+ FSB_BUG_ON(!(p)); \
+ { struct { int i; } page; (void)page.i; \
+
+#define end_for_each_page } } } while (0)
+
+static inline pgoff_t sector_pgoff(sector_t blocknr, unsigned int blkbits)
+{
+ if (blkbits <= PAGE_CACHE_SHIFT)
+ return blocknr >> (PAGE_CACHE_SHIFT - blkbits);
+ else
+ return blocknr << (blkbits - PAGE_CACHE_SHIFT);
+}
+
+static inline sector_t pgoff_sector(pgoff_t pgoff, unsigned int blkbits)
+{
+ if (blkbits <= PAGE_CACHE_SHIFT)
+ return (sector_t)pgoff << (PAGE_CACHE_SHIFT - blkbits);
+ else
+ return (sector_t)pgoff >> (blkbits - PAGE_CACHE_SHIFT);
+}
+
+static inline unsigned int block_page_offset(struct fsblock *block, unsigned int size)
+{
+ unsigned int idx;
+ unsigned int size_of = sizeof_block(block);
+ idx = (unsigned long)block - (unsigned long)page_blocks(block->page);
+ return size * (idx / size_of);
+}
+
+int fsblock_set_page_dirty(struct page *page);
+
+struct fsblock_meta *find_get_mblock(struct block_device *bdev, sector_t blocknr, unsigned int size);
+
+struct fsblock_meta *find_or_create_mblock(struct block_device *bdev, sector_t blocknr, unsigned int size);
+
+struct fsblock_meta *mbread(struct block_device *bdev, sector_t blocknr, unsigned int size);
+
+
+static inline struct fsblock_meta *sb_find_get_mblock(struct super_block *sb, sector_t blocknr)
+{
+ return find_get_mblock(sb->s_bdev, blocknr, sb->s_blocksize);
+}
+
+static inline struct fsblock_meta *sb_find_or_create_mblock(struct super_block *sb, sector_t blocknr)
+{
+ return find_or_create_mblock(sb->s_bdev, blocknr, sb->s_blocksize);
+}
+
+static inline struct fsblock_meta *sb_mbread(struct super_block *sb, sector_t blocknr)
+{
+ return mbread(sb->s_bdev, blocknr, sb->s_blocksize);
+}
+
+void mark_mblock_uptodate(struct fsblock_meta *mblock);
+int mark_mblock_dirty(struct fsblock_meta *mblock);
+int mark_mblock_dirty_inode(struct fsblock_meta *mblock, struct inode *inode);
+
+int sync_block(struct fsblock *block);
+
+/* XXX: are these always for metablocks? (no, directory in pagecache?) */
+void *vmap_block(struct fsblock *block, off_t off, size_t len);
+void vunmap_block(struct fsblock *block, off_t off, size_t len, void *vaddr);
+
+void block_get(struct fsblock *block);
+#define mblock_get(b) block_get(mblock_block(b))
+void block_put(struct fsblock *block);
+#define mblock_put(b) block_put(mblock_block(b))
+
+static inline int trylock_block(struct fsblock *block)
+{
+ return likely(!test_and_set_bit(BL_locked, &block->flags));
+}
+void lock_block(struct fsblock *block);
+void unlock_block(struct fsblock *block);
+
+sector_t fsblock_bmap(struct address_space *mapping, sector_t block, insert_mapping_fn *insert_mapping);
+
+int fsblock_read_page(struct page *page, insert_mapping_fn *insert_mapping);
+int fsblock_write_page(struct page *page, insert_mapping_fn *insert_mapping,
+ struct writeback_control *wbc);
+
+int fsblock_prepare_write(struct page *page, unsigned from, unsigned to, insert_mapping_fn insert_mapping);
+int __fsblock_commit_write(struct page *page, unsigned from, unsigned to);
+int fsblock_commit_write(struct file *file, struct page *page, unsigned from, unsigned to);
+int fsblock_page_mkwrite(struct vm_area_struct *vma, struct page *page);
+int fsblock_truncate_page(struct address_space *mapping, loff_t from);
+void fsblock_invalidate_page(struct page *page, unsigned long offset);
+int fsblock_release(struct address_space *mapping, int force);
+int fsblock_sync(struct address_space *mapping);
+
+//int alloc_mapping_blocks(struct address_space *mapping, pgoff_t pgoff, gfp_t gfp_flags);
+int try_to_free_blocks(struct page *page);
+
+int fsblock_file_mmap(struct file *file, struct vm_area_struct *vma);
+
+void fsblock_init(void);
+
+#endif
Index: linux-2.6/include/linux/fsblock_types.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/fsblock_types.h
@@ -0,0 +1,70 @@
+#ifndef __FSBLOCK_TYPES_H__
+#define __FSBLOCK_TYPES_H__
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mm_types.h>
+#include <linux/rcupdate.h>
+#include <asm/atomic.h>
+
+#define FSB_DEBUG 1
+
+#ifdef FSB_DEBUG
+# define FSB_BUG() BUG()
+# define FSB_BUG_ON(x) BUG_ON(x)
+#else
+# define FSB_BUG()
+# define FSB_BUG_ON(x)
+#endif
+
+#define BLOCK_SUPERPAGE_SUPPORT 1
+
+/*
+ * XXX: this is a hack for filesystems that vmap the entire block regularly,
+ * and won't even work for systems with limited vmalloc space.
+ * Should make fs'es vmap in page sized chunks instead (providing some
+ * helpers too). Currently racy when vunmapping at end_io interrupt.
+ */
+#define VMAP_CACHE 1
+
+struct address_space;
+
+/*
+ * inode == page->mapping->host
+ * bsize == inode->i_blkbits
+ * bdev == inode->i_bdev
+ */
+struct fsblock {
+ atomic_t count;
+ union {
+ struct {
+ unsigned long flags; /* XXX: flags could be int for better packing */
+
+ sector_t block_nr;
+ void *private;
+ struct page *page; /* Superpage block pages found via ->mapping */
+ };
+ struct rcu_head rcu_head; /* XXX: can go away if we have kmem caches for fsblocks */
+ };
+
+#ifdef VMAP_CACHE
+ void *vaddr;
+#endif
+
+#ifdef FSB_DEBUG
+ atomic_t vmap_count;
+#endif
+};
+
+struct fsblock_meta {
+ struct fsblock block;
+
+ /* Nothing else, at the moment */
+ /* XXX: could just get rid of fsblock_meta? */
+};
+
+typedef int (insert_mapping_fn)(struct address_space *mapping,
+ loff_t off, size_t len, int create);
+
+#endif
Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c
+++ linux-2.6/init/main.c
@@ -50,6 +50,7 @@
#include <linux/key.h>
#include <linux/unwind.h>
#include <linux/buffer_head.h>
+#include <linux/fsblock.h>
#include <linux/debug_locks.h>
#include <linux/lockdep.h>
#include <linux/pid_namespace.h>
@@ -613,6 +614,7 @@ asmlinkage void __init start_kernel(void
fork_init(num_physpages);
proc_caches_init();
buffer_init();
+ fsblock_init();
unnamed_dev_init();
key_init();
security_init();
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -199,6 +199,7 @@ static void bad_page(struct page *page)
dump_stack();
page->flags &= ~(1 << PG_lru |
1 << PG_private |
+ 1 << PG_blocks |
1 << PG_locked |
1 << PG_active |
1 << PG_dirty |
@@ -436,6 +437,7 @@ static inline int free_pages_check(struc
(page->flags & (
1 << PG_lru |
1 << PG_private |
+ 1 << PG_blocks |
1 << PG_locked |
1 << PG_active |
1 << PG_slab |
@@ -590,6 +592,7 @@ static int prep_new_page(struct page *pa
(page->flags & (
1 << PG_lru |
1 << PG_private |
+ 1 << PG_blocks |
1 << PG_locked |
1 << PG_active |
1 << PG_dirty |
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -34,6 +34,7 @@
#include <linux/hash.h>
#include <linux/suspend.h>
#include <linux/buffer_head.h>
+#include <linux/fsblock.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/bio.h>
#include <linux/notifier.h>
@@ -988,6 +989,11 @@ grow_dev_page(struct block_device *bdev,

BUG_ON(!PageLocked(page));

+ if (PageBlocks(page)) {
+ if (try_to_free_blocks(page))
+ return NULL;
+ }
+
if (page_has_buffers(page)) {
bh = page_buffers(page);
if (bh->b_size == size) {
@@ -1596,6 +1602,10 @@ static int __block_write_full_page(struc
last_block = (i_size_read(inode) - 1) >> inode->i_blkbits;

if (!page_has_buffers(page)) {
+ if (PageBlocks(page)) {
+ if (try_to_free_blocks(page))
+ return -EBUSY;
+ }
create_empty_buffers(page, blocksize,
(1 << BH_Dirty)|(1 << BH_Uptodate));
}
@@ -1757,8 +1767,13 @@ static int __block_prepare_write(struct
BUG_ON(from > to);

blocksize = 1 << inode->i_blkbits;
- if (!page_has_buffers(page))
+ if (!page_has_buffers(page)) {
+ if (PageBlocks(page)) {
+ if (try_to_free_blocks(page))
+ return -EBUSY;
+ }
create_empty_buffers(page, blocksize, 0);
+ }
head = page_buffers(page);

bbits = inode->i_blkbits;
@@ -1911,8 +1926,13 @@ int block_read_full_page(struct page *pa

BUG_ON(!PageLocked(page));
blocksize = 1 << inode->i_blkbits;
- if (!page_has_buffers(page))
+ if (!page_has_buffers(page)) {
+ if (PageBlocks(page)) {
+ if (try_to_free_blocks(page))
+ return -EBUSY;
+ }
create_empty_buffers(page, blocksize, 0);
+ }
head = page_buffers(page);

iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
@@ -2475,8 +2495,13 @@ int block_truncate_page(struct address_s
if (!page)
goto out;

- if (!page_has_buffers(page))
+ if (!page_has_buffers(page)) {
+ if (PageBlocks(page)) {
+ if (try_to_free_blocks(page))
+ return -EBUSY;
+ }
create_empty_buffers(page, blocksize, 0);
+ }

/* Find the buffer that contains "offset" */
bh = page_buffers(page);
Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -25,6 +25,7 @@
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/buffer_head.h>
+#include <linux/fsblock.h>
#include <linux/module.h>
#include <linux/syscalls.h>
#include <linux/uio.h>
@@ -73,7 +74,7 @@ static int page_cache_pipe_buf_steal(str
*/
wait_on_page_writeback(page);

- if (PagePrivate(page))
+ if (PagePrivate(page) || PageBlocks(page))
try_to_release_page(page, GFP_KERNEL);

/*
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -230,6 +230,7 @@ static inline void attach_page_buffers(s
struct buffer_head *head)
{
page_cache_get(page);
+ BUG_ON(PageBlocks(page));
SetPagePrivate(page);
set_page_private(page, (unsigned long)head);
}
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -37,6 +37,7 @@
* FIXME: remove all knowledge of the buffer layer from the core VM
*/
#include <linux/buffer_head.h> /* for generic_osync_inode */
+#include <linux/fsblock.h>

#include <asm/mman.h>

@@ -2473,9 +2474,13 @@ int try_to_release_page(struct page *pag
if (PageWriteback(page))
return 0;

+ BUG_ON(!(PagePrivate(page) ^ PageBlocks(page)));
if (mapping && mapping->a_ops->releasepage)
return mapping->a_ops->releasepage(page, gfp_mask);
- return try_to_free_buffers(page);
+ if (PagePrivate(page))
+ return try_to_free_buffers(page);
+ else
+ return try_to_free_blocks(page);
}

EXPORT_SYMBOL(try_to_release_page);
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -24,6 +24,7 @@
#include <linux/module.h>
#include <linux/mm_inline.h>
#include <linux/buffer_head.h> /* for try_to_release_page() */
+#include <linux/fsblock.h>
#include <linux/module.h>
#include <linux/percpu_counter.h>
#include <linux/percpu.h>
@@ -412,8 +413,10 @@ void pagevec_strip(struct pagevec *pvec)
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];

- if (PagePrivate(page) && !TestSetPageLocked(page)) {
- if (PagePrivate(page))
+ if ((PagePrivate(page) || PageBlocks(page)) &&
+ !TestSetPageLocked(page)) {
+ BUG_ON(PagePrivate(page) && PageBlocks(page));
+ if (PagePrivate(page) || PageBlocks(page))
try_to_release_page(page, 0);
unlock_page(page);
}
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -15,8 +15,8 @@
#include <linux/highmem.h>
#include <linux/pagevec.h>
#include <linux/task_io_accounting_ops.h>
-#include <linux/buffer_head.h> /* grr. try_to_release_page,
- do_invalidatepage */
+#include <linux/buffer_head.h> /* try_to_release_page, do_invalidatepage */
+#include <linux/fsblock.h>


/**
@@ -36,20 +36,28 @@
void do_invalidatepage(struct page *page, unsigned long offset)
{
void (*invalidatepage)(struct page *, unsigned long);
+
+ if (!PagePrivate(page) && !PageBlocks(page))
+ return;
+
invalidatepage = page->mapping->a_ops->invalidatepage;
-#ifdef CONFIG_BLOCK
- if (!invalidatepage)
- invalidatepage = block_invalidatepage;
-#endif
if (invalidatepage)
(*invalidatepage)(page, offset);
+#ifdef CONFIG_BLOCK
+ else if (PagePrivate(page))
+ block_invalidatepage(page, offset);
+#endif
}

static inline void truncate_partial_page(struct page *page, unsigned partial)
{
+ /*
+ * XXX: this is only to get the already-invalidated tail and thus
+ * it doesn't actually "dirty" the page. This probably should be
+ * solved in the fs truncate_page operation.
+ */
zero_user_page(page, partial, PAGE_CACHE_SIZE - partial, KM_USER0);
- if (PagePrivate(page))
- do_invalidatepage(page, partial);
+ do_invalidatepage(page, partial);
}

/*
@@ -97,13 +105,14 @@ truncate_complete_page(struct address_sp

cancel_dirty_page(page, PAGE_CACHE_SIZE);

- if (PagePrivate(page))
- do_invalidatepage(page, 0);
+ do_invalidatepage(page, 0);

- ClearPageUptodate(page);
- ClearPageMappedToDisk(page);
- remove_from_page_cache(page);
- page_cache_release(page); /* pagecache ref */
+ if (!PageBlocks(page)) {
+ ClearPageUptodate(page);
+ ClearPageMappedToDisk(page);
+ remove_from_page_cache(page);
+ page_cache_release(page); /* pagecache ref */
+ }
}

/*
@@ -122,8 +131,9 @@ invalidate_complete_page(struct address_
if (page->mapping != mapping)
return 0;

- if (PagePrivate(page) && !try_to_release_page(page, 0))
- return 0;
+ if (PagePrivate(page) || PageBlocks(page))
+ if (!try_to_release_page(page, 0))
+ return 0;

ret = remove_mapping(mapping, page);

@@ -176,24 +186,18 @@ void truncate_inode_pages_range(struct a
pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
- pgoff_t page_index = page->index;

- if (page_index > end) {
- next = page_index;
+ next = page->index+1;
+ if (next-1 > end)
break;
- }

- if (page_index > next)
- next = page_index;
- next++;
- if (TestSetPageLocked(page))
+ if (PageWriteback(page))
continue;
- if (PageWriteback(page)) {
+ if (!TestSetPageLocked(page)) {
+ if (!PageWriteback(page))
+ truncate_complete_page(mapping, page);
unlock_page(page);
- continue;
}
- truncate_complete_page(mapping, page);
- unlock_page(page);
}
pagevec_release(&pvec);
cond_resched();
@@ -210,28 +214,18 @@ void truncate_inode_pages_range(struct a
}

next = start;
- for ( ; ; ) {
- cond_resched();
- if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
- if (next == start)
- break;
- next = start;
- continue;
- }
- if (pvec.pages[0]->index > end) {
- pagevec_release(&pvec);
- break;
- }
+ while (next <= end &&
+ pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];

- if (page->index > end)
- break;
lock_page(page);
+ next = page->index + 1;
+ if (next-1 > end) {
+ unlock_page(page);
+ break;
+ }
wait_on_page_writeback(page);
- if (page->index > next)
- next = page->index;
- next++;
truncate_complete_page(mapping, page);
unlock_page(page);
}
@@ -326,17 +320,18 @@ invalidate_complete_page2(struct address
if (page->mapping != mapping)
return 0;

- if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
- return 0;
+ if (PagePrivate(page) || PageBlocks(page))
+ if (!try_to_release_page(page, GFP_KERNEL))
+ return 0;

write_lock_irq(&mapping->tree_lock);
if (PageDirty(page))
goto failed;

- BUG_ON(PagePrivate(page));
+ BUG_ON(PagePrivate(page) || PageBlocks(page));
+ ClearPageUptodate(page);
__remove_from_page_cache(page);
write_unlock_irq(&mapping->tree_lock);
- ClearPageUptodate(page);
page_cache_release(page); /* pagecache ref */
return 1;
failed:
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -23,8 +23,8 @@
#include <linux/file.h>
#include <linux/writeback.h>
#include <linux/blkdev.h>
-#include <linux/buffer_head.h> /* for try_to_release_page(),
- buffer_heads_over_limit */
+#include <linux/buffer_head.h> /* try_to_release_page, buffer_heads_over_limit*/
+#include <linux/fsblock.h>
#include <linux/mm_inline.h>
#include <linux/pagevec.h>
#include <linux/backing-dev.h>
@@ -565,7 +565,7 @@ static unsigned long shrink_page_list(st
* process address space (page_count == 1) it can be freed.
* Otherwise, leave the page on the LRU so it is swappable.
*/
- if (PagePrivate(page)) {
+ if (PagePrivate(page) || PageBlocks(page)) {
if (!try_to_release_page(page, sc->gfp_mask))
goto activate_locked;
if (!mapping && page_count(page) == 1)
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -22,6 +22,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/buffer_head.h>
+#include <linux/fsblock.h>
#include "internal.h"

/**
@@ -636,9 +637,15 @@ int generic_osync_inode(struct inode *in
if (what & OSYNC_DATA)
err = filemap_fdatawrite(mapping);
if (what & (OSYNC_METADATA|OSYNC_DATA)) {
- err2 = sync_mapping_buffers(mapping);
- if (!err)
- err = err2;
+ if (!mapping->a_ops->sync) {
+ err2 = sync_mapping_buffers(mapping);
+ if (!err)
+ err = err2;
+ } else {
+ err2 = mapping->a_ops->sync(mapping);
+ if (!err)
+ err = err2;
+ }
}
if (what & OSYNC_DATA) {
err2 = filemap_fdatawait(mapping);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -32,6 +32,7 @@
* FIXME: remove all knowledge of the buffer layer from this file
*/
#include <linux/buffer_head.h>
+#include <linux/fsblock.h>

/*
* New inode.c implementation.
@@ -170,7 +171,7 @@ static struct inode *alloc_inode(struct

void destroy_inode(struct inode *inode)
{
- BUG_ON(inode_has_buffers(inode));
+ BUG_ON(mapping_has_private(&inode->i_data));
security_inode_free(inode);
if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
@@ -241,10 +242,14 @@ void __iget(struct inode * inode)
*/
void clear_inode(struct inode *inode)
{
+ struct address_space *mapping = &inode->i_data;
+
might_sleep();
- invalidate_inode_buffers(inode);
+ if (!mapping->a_ops->release)
+ invalidate_inode_buffers(inode);

- BUG_ON(inode->i_data.nrpages);
+ BUG_ON(mapping_has_private(mapping));
+ BUG_ON(mapping->nrpages);
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(inode->i_state & I_CLEAR);
wait_on_inode(inode);
@@ -307,6 +312,7 @@ static int invalidate_list(struct list_h
for (;;) {
struct list_head * tmp = next;
struct inode * inode;
+ struct address_space * mapping;

/*
* We can reschedule here without worrying about the list's
@@ -320,7 +326,12 @@ static int invalidate_list(struct list_h
if (tmp == head)
break;
inode = list_entry(tmp, struct inode, i_sb_list);
- invalidate_inode_buffers(inode);
+ mapping = &inode->i_data;
+ if (!mapping->a_ops->release)
+ invalidate_inode_buffers(inode);
+ else
+ mapping->a_ops->release(mapping, 1); /* XXX: should be done in fs? */
+ BUG_ON(mapping_has_private(mapping));
if (!atomic_read(&inode->i_count)) {
list_move(&inode->i_list, dispose);
inode->i_state |= I_FREEING;
@@ -363,13 +374,15 @@ EXPORT_SYMBOL(invalidate_inodes);

static int can_unuse(struct inode *inode)
{
+ struct address_space *mapping = &inode->i_data;
+
if (inode->i_state)
return 0;
- if (inode_has_buffers(inode))
+ if (mapping_has_private(mapping))
return 0;
if (atomic_read(&inode->i_count))
return 0;
- if (inode->i_data.nrpages)
+ if (mapping->nrpages)
return 0;
return 1;
}
@@ -398,6 +411,7 @@ static void prune_icache(int nr_to_scan)
spin_lock(&inode_lock);
for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
struct inode *inode;
+ struct address_space *mapping;

if (list_empty(&inode_unused))
break;
@@ -408,10 +422,17 @@ static void prune_icache(int nr_to_scan)
list_move(&inode->i_list, &inode_unused);
continue;
}
- if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+ mapping = &inode->i_data;
+ if (mapping_has_private(mapping) || mapping->nrpages) {
+ int ret;
+
__iget(inode);
spin_unlock(&inode_lock);
- if (remove_inode_buffers(inode))
+ if (mapping->a_ops->release)
+ ret = mapping->a_ops->release(mapping, 0);
+ else
+ ret = remove_inode_buffers(inode);
+ if (ret)
reap += invalidate_mapping_pages(&inode->i_data,
0, -1);
iput(inode);

2007-06-24 01:47:16

by Nick Piggin

[permalink] [raw]
Subject: [patch 2/3] block_dev: convert to fsblock


Convert block_dev mostly to fsblocks.

---
fs/block_dev.c | 204 +++++++++++++++++++++++++++++++++++++++-----
fs/buffer.c | 113 ++----------------------
fs/super.c | 2
include/linux/buffer_head.h | 9 -
include/linux/fs.h | 29 ++++++
5 files changed, 225 insertions(+), 132 deletions(-)

Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c
+++ linux-2.6/fs/block_dev.c
@@ -16,7 +16,9 @@
#include <linux/blkdev.h>
#include <linux/module.h>
#include <linux/blkpg.h>
+#include <linux/fsblock.h>
#include <linux/buffer_head.h>
+#include <linux/pagevec.h>
#include <linux/writeback.h>
#include <linux/mpage.h>
#include <linux/mount.h>
@@ -61,14 +63,14 @@ static void kill_bdev(struct block_devic
{
if (bdev->bd_inode->i_mapping->nrpages == 0)
return;
- invalidate_bh_lrus();
+ invalidate_bh_lrus(); /* XXX: this can go when buffers goes */
truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
}

int set_blocksize(struct block_device *bdev, int size)
{
/* Size must be a power of two, and between 512 and PAGE_SIZE */
- if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
+ if (size < 512 || !is_power_of_2(size))
return -EINVAL;

/* Size cannot be smaller than the size supported by the device */
@@ -92,7 +94,7 @@ int sb_set_blocksize(struct super_block
if (set_blocksize(sb->s_bdev, size))
return 0;
/* If we get here, we know size is power of two
- * and it's value is between 512 and PAGE_SIZE */
+ * and it's value is >= 512 */
sb->s_blocksize = size;
sb->s_blocksize_bits = blksize_bits(size);
return sb->s_blocksize;
@@ -112,19 +114,12 @@ EXPORT_SYMBOL(sb_min_blocksize);

static int
blkdev_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh, int create)
+ struct buffer_head *bh, int create)
{
if (iblock >= max_block(I_BDEV(inode))) {
if (create)
return -EIO;
-
- /*
- * for reads, we're just trying to fill a partial page.
- * return a hole, they will have to call get_block again
- * before they can fill it, and they will get -EIO at that
- * time
- */
- return 0;
+ return 0;
}
bh->b_bdev = I_BDEV(inode);
bh->b_blocknr = iblock;
@@ -132,6 +127,66 @@ blkdev_get_block(struct inode *inode, se
return 0;
}

+static int blkdev_insert_mapping(struct address_space *mapping, loff_t off,
+ size_t len, int create)
+{
+ sector_t blocknr;
+ struct inode *inode = mapping->host;
+ pgoff_t next, end;
+ struct pagevec pvec;
+ int ret = 0;
+
+ pagevec_init(&pvec, 0);
+ next = off >> PAGE_CACHE_SHIFT;
+ end = (off + len) >> PAGE_CACHE_SHIFT;
+ blocknr = off >> inode->i_blkbits;
+ while (next <= end && pagevec_lookup(&pvec, mapping, next,
+ min(end - next, (pgoff_t)PAGEVEC_SIZE))) {
+ unsigned int i;
+
+ for (i = 0; i < pagevec_count(&pvec); i++) {
+ struct fsblock *block;
+ struct page *page = pvec.pages[i];
+
+ BUG_ON(page->index != next);
+ BUG_ON(blocknr != pgoff_sector(next, inode->i_blkbits));
+ BUG_ON(!PageLocked(page));
+
+ if (blocknr >= max_block(I_BDEV(inode))) {
+ if (create)
+ ret = -ENOMEM;
+
+ /*
+ * for reads, we're just trying to fill a
+ * partial page. return a hole, they will
+ * have to call in again before they can fill
+ * it, and they will get -EIO at that time
+ */
+ continue; /* xxx: could be smarter, stop now */
+ }
+
+ block = page_blocks(page);
+ if (fsblock_subpage(block)) {
+ struct fsblock *b;
+ for_each_block(block, b) {
+ if (!test_bit(BL_mapped, &b->flags))
+ map_fsblock(b, blocknr);
+ blocknr++;
+ }
+ } else {
+ if (!test_bit(BL_mapped, &block->flags))
+ map_fsblock(block, blocknr);
+ blocknr++;
+ }
+ next++;
+ }
+ pagevec_release(&pvec);
+ }
+
+ return ret;
+}
+
+#if 0
static int
blkdev_get_blocks(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create)
@@ -170,6 +225,7 @@ blkdev_direct_IO(int rw, struct kiocb *i
return blockdev_direct_IO_no_locking(rw, iocb, inode, I_BDEV(inode),
iov, offset, nr_segs, blkdev_get_blocks, NULL);
}
+#endif

#if 0
static int blk_end_aio(struct bio *bio, unsigned int bytes_done, int error)
@@ -368,24 +424,127 @@ backout:
}
#endif

+/*
+ * Write out and wait upon all the dirty data associated with a block
+ * device via its mapping. Does not take the superblock lock.
+ */
+int sync_blockdev(struct block_device *bdev)
+{
+ int ret = 0;
+
+ if (bdev)
+ ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
+ return ret;
+}
+EXPORT_SYMBOL(sync_blockdev);
+
+/*
+ * Write out and wait upon all dirty data associated with this
+ * device. Filesystem data as well as the underlying block
+ * device. Takes the superblock lock.
+ */
+int fsync_bdev(struct block_device *bdev)
+{
+ struct super_block *sb = get_super(bdev);
+ if (sb) {
+ int res = fsync_super(sb);
+ drop_super(sb);
+ return res;
+ }
+ return sync_blockdev(bdev);
+}
+
+/**
+ * freeze_bdev -- lock a filesystem and force it into a consistent state
+ * @bdev: blockdevice to lock
+ *
+ * This takes the block device bd_mount_mutex to make sure no new mounts
+ * happen on bdev until thaw_bdev() is called.
+ * If a superblock is found on this device, we take the s_umount semaphore
+ * on it to make sure nobody unmounts until the snapshot creation is done.
+ */
+struct super_block *freeze_bdev(struct block_device *bdev)
+{
+ struct super_block *sb;
+
+ down(&bdev->bd_mount_sem);
+ sb = get_super(bdev);
+ if (sb && !(sb->s_flags & MS_RDONLY)) {
+ sb->s_frozen = SB_FREEZE_WRITE;
+ smp_wmb();
+
+ __fsync_super(sb);
+
+ sb->s_frozen = SB_FREEZE_TRANS;
+ smp_wmb();
+
+ sync_blockdev(sb->s_bdev);
+
+ if (sb->s_op->write_super_lockfs)
+ sb->s_op->write_super_lockfs(sb);
+ }
+
+ sync_blockdev(bdev);
+ return sb; /* thaw_bdev releases s->s_umount and bd_mount_sem */
+}
+EXPORT_SYMBOL(freeze_bdev);
+
+/**
+ * thaw_bdev -- unlock filesystem
+ * @bdev: blockdevice to unlock
+ * @sb: associated superblock
+ *
+ * Unlocks the filesystem and marks it writeable again after freeze_bdev().
+ */
+void thaw_bdev(struct block_device *bdev, struct super_block *sb)
+{
+ if (sb) {
+ BUG_ON(sb->s_bdev != bdev);
+
+ if (sb->s_op->unlockfs)
+ sb->s_op->unlockfs(sb);
+ sb->s_frozen = SB_UNFROZEN;
+ smp_wmb();
+ wake_up(&sb->s_wait_unfrozen);
+ drop_super(sb);
+ }
+
+ up(&bdev->bd_mount_sem);
+}
+EXPORT_SYMBOL(thaw_bdev);
+
static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
{
- return block_write_full_page(page, blkdev_get_block, wbc);
+ if (PagePrivate(page))
+ return block_write_full_page(page, blkdev_get_block, wbc);
+ return fsblock_write_page(page, blkdev_insert_mapping, wbc);
}

static int blkdev_readpage(struct file * file, struct page * page)
{
- return block_read_full_page(page, blkdev_get_block);
+ return fsblock_read_page(page, blkdev_insert_mapping);
}

static int blkdev_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
{
- return block_prepare_write(page, from, to, blkdev_get_block);
+ if (PagePrivate(page))
+ return block_prepare_write(page, from, to, blkdev_get_block);
+ return fsblock_prepare_write(page, from, to, blkdev_insert_mapping);
}

static int blkdev_commit_write(struct file *file, struct page *page, unsigned from, unsigned to)
{
- return block_commit_write(page, from, to);
+ if (PagePrivate(page))
+ return generic_commit_write(file, page, from, to);
+ return fsblock_commit_write(file, page, from, to);
+}
+
+static void blkdev_invalidate_page(struct page *page, unsigned long offset)
+{
+ if (PagePrivate(page))
+ block_invalidatepage(page, offset);
+ else
+ fsblock_invalidate_page(page, offset);
}

/*
@@ -840,7 +999,7 @@ static void free_bd_holder(struct bd_hol
/**
* find_bd_holder - find matching struct bd_holder from the block device
*
- * @bdev: struct block device to be searched
+ * @bdev: struct fsblock device to be searched
* @bo: target struct bd_holder
*
* Returns matching entry with @bo in @bdev->bd_holder_list.
@@ -1272,6 +1431,10 @@ static int __blkdev_put(struct block_dev
bdev->bd_part_count--;

if (!--bdev->bd_openers) {
+ /*
+ * XXX: This could go away when block dev and inode
+ * mappings are in sync?
+ */
sync_blockdev(bdev);
kill_bdev(bdev);
}
@@ -1325,11 +1488,14 @@ static long block_ioctl(struct file *fil
const struct address_space_operations def_blk_aops = {
.readpage = blkdev_readpage,
.writepage = blkdev_writepage,
- .sync_page = block_sync_page,
+// .sync_page = block_sync_page, /* xxx: gone w/ explicit plugging */
.prepare_write = blkdev_prepare_write,
.commit_write = blkdev_commit_write,
.writepages = generic_writepages,
- .direct_IO = blkdev_direct_IO,
+// .direct_IO = blkdev_direct_IO,
+ .set_page_dirty = fsblock_set_page_dirty,
+ .invalidatepage = blkdev_invalidate_page,
+ /* XXX: .sync */
};

const struct file_operations def_blk_fops = {
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -147,95 +147,6 @@ void end_buffer_write_sync(struct buffer
}

/*
- * Write out and wait upon all the dirty data associated with a block
- * device via its mapping. Does not take the superblock lock.
- */
-int sync_blockdev(struct block_device *bdev)
-{
- int ret = 0;
-
- if (bdev)
- ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
- return ret;
-}
-EXPORT_SYMBOL(sync_blockdev);
-
-/*
- * Write out and wait upon all dirty data associated with this
- * device. Filesystem data as well as the underlying block
- * device. Takes the superblock lock.
- */
-int fsync_bdev(struct block_device *bdev)
-{
- struct super_block *sb = get_super(bdev);
- if (sb) {
- int res = fsync_super(sb);
- drop_super(sb);
- return res;
- }
- return sync_blockdev(bdev);
-}
-
-/**
- * freeze_bdev -- lock a filesystem and force it into a consistent state
- * @bdev: blockdevice to lock
- *
- * This takes the block device bd_mount_sem to make sure no new mounts
- * happen on bdev until thaw_bdev() is called.
- * If a superblock is found on this device, we take the s_umount semaphore
- * on it to make sure nobody unmounts until the snapshot creation is done.
- */
-struct super_block *freeze_bdev(struct block_device *bdev)
-{
- struct super_block *sb;
-
- down(&bdev->bd_mount_sem);
- sb = get_super(bdev);
- if (sb && !(sb->s_flags & MS_RDONLY)) {
- sb->s_frozen = SB_FREEZE_WRITE;
- smp_wmb();
-
- __fsync_super(sb);
-
- sb->s_frozen = SB_FREEZE_TRANS;
- smp_wmb();
-
- sync_blockdev(sb->s_bdev);
-
- if (sb->s_op->write_super_lockfs)
- sb->s_op->write_super_lockfs(sb);
- }
-
- sync_blockdev(bdev);
- return sb; /* thaw_bdev releases s->s_umount and bd_mount_sem */
-}
-EXPORT_SYMBOL(freeze_bdev);
-
-/**
- * thaw_bdev -- unlock filesystem
- * @bdev: blockdevice to unlock
- * @sb: associated superblock
- *
- * Unlocks the filesystem and marks it writeable again after freeze_bdev().
- */
-void thaw_bdev(struct block_device *bdev, struct super_block *sb)
-{
- if (sb) {
- BUG_ON(sb->s_bdev != bdev);
-
- if (sb->s_op->unlockfs)
- sb->s_op->unlockfs(sb);
- sb->s_frozen = SB_UNFROZEN;
- smp_wmb();
- wake_up(&sb->s_wait_unfrozen);
- drop_super(sb);
- }
-
- up(&bdev->bd_mount_sem);
-}
-EXPORT_SYMBOL(thaw_bdev);
-
-/*
* Various filesystems appear to want __find_get_block to be non-blocking.
* But it's the page lock which protects the buffers. To get around this,
* we get exclusion from try_to_free_buffers with the blockdev mapping's
@@ -574,11 +485,6 @@ static inline void __remove_assoc_queue(
bh->b_assoc_map = NULL;
}

-int inode_has_buffers(struct inode *inode)
-{
- return !list_empty(&inode->i_data.private_list);
-}
-
/*
* osync is designed to support O_SYNC io. It waits synchronously for
* all already-submitted IO to complete, but does not queue any new
@@ -818,8 +724,9 @@ static int fsync_buffers_list(spinlock_t
*/
void invalidate_inode_buffers(struct inode *inode)
{
- if (inode_has_buffers(inode)) {
- struct address_space *mapping = &inode->i_data;
+ struct address_space *mapping = &inode->i_data;
+
+ if (mapping_has_private(mapping)) {
struct list_head *list = &mapping->private_list;
struct address_space *buffer_mapping = mapping->assoc_mapping;

@@ -838,10 +745,10 @@ void invalidate_inode_buffers(struct ino
*/
int remove_inode_buffers(struct inode *inode)
{
+ struct address_space *mapping = &inode->i_data;
int ret = 1;

- if (inode_has_buffers(inode)) {
- struct address_space *mapping = &inode->i_data;
+ if (mapping_has_private(mapping)) {
struct list_head *list = &mapping->private_list;
struct address_space *buffer_mapping = mapping->assoc_mapping;

@@ -990,7 +897,7 @@ grow_dev_page(struct block_device *bdev,
BUG_ON(!PageLocked(page));

if (PageBlocks(page)) {
- if (try_to_free_blocks(page))
+ if (!try_to_free_blocks(page))
return NULL;
}

@@ -1603,7 +1510,7 @@ static int __block_write_full_page(struc

if (!page_has_buffers(page)) {
if (PageBlocks(page)) {
- if (try_to_free_blocks(page))
+ if (!try_to_free_blocks(page))
return -EBUSY;
}
create_empty_buffers(page, blocksize,
@@ -1769,7 +1676,7 @@ static int __block_prepare_write(struct
blocksize = 1 << inode->i_blkbits;
if (!page_has_buffers(page)) {
if (PageBlocks(page)) {
- if (try_to_free_blocks(page))
+ if (!try_to_free_blocks(page))
return -EBUSY;
}
create_empty_buffers(page, blocksize, 0);
@@ -1928,7 +1835,7 @@ int block_read_full_page(struct page *pa
blocksize = 1 << inode->i_blkbits;
if (!page_has_buffers(page)) {
if (PageBlocks(page)) {
- if (try_to_free_blocks(page))
+ if (!try_to_free_blocks(page))
return -EBUSY;
}
create_empty_buffers(page, blocksize, 0);
@@ -2497,7 +2404,7 @@ int block_truncate_page(struct address_s

if (!page_has_buffers(page)) {
if (PageBlocks(page)) {
- if (try_to_free_blocks(page))
+ if (!try_to_free_blocks(page))
return -EBUSY;
}
create_empty_buffers(page, blocksize, 0);
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -28,7 +28,7 @@
#include <linux/blkdev.h>
#include <linux/quotaops.h>
#include <linux/namei.h>
-#include <linux/buffer_head.h> /* for fsync_super() */
+#include <linux/fs.h> /* for fsync_super() */
#include <linux/mount.h>
#include <linux/security.h>
#include <linux/syscalls.h>
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -158,22 +158,14 @@ void end_buffer_write_sync(struct buffer

/* Things to do with buffers at mapping->private_list */
void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode);
-int inode_has_buffers(struct inode *);
void invalidate_inode_buffers(struct inode *);
int remove_inode_buffers(struct inode *inode);
int sync_mapping_buffers(struct address_space *mapping);
void unmap_underlying_metadata(struct block_device *bdev, sector_t block);

void mark_buffer_async_write(struct buffer_head *bh);
-void invalidate_bdev(struct block_device *);
-int sync_blockdev(struct block_device *bdev);
void __wait_on_buffer(struct buffer_head *);
wait_queue_head_t *bh_waitq_head(struct buffer_head *bh);
-int fsync_bdev(struct block_device *);
-struct super_block *freeze_bdev(struct block_device *);
-void thaw_bdev(struct block_device *, struct super_block *);
-int fsync_super(struct super_block *);
-int fsync_no_super(struct block_device *);
struct buffer_head *__find_get_block(struct block_device *bdev, sector_t block,
unsigned size);
struct buffer_head *__getblk(struct block_device *bdev, sector_t block,
@@ -317,7 +309,6 @@ extern int __set_page_dirty_buffers(stru
static inline void buffer_init(void) {}
static inline int try_to_free_buffers(struct page *page) { return 1; }
static inline int sync_blockdev(struct block_device *bdev) { return 0; }
-static inline int inode_has_buffers(struct inode *inode) { return 0; }
static inline void invalidate_inode_buffers(struct inode *inode) {}
static inline int remove_inode_buffers(struct inode *inode) { return 1; }
static inline int sync_mapping_buffers(struct address_space *mapping) { return 0; }
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -430,6 +430,20 @@ struct address_space_operations {
int (*migratepage) (struct address_space *,
struct page *, struct page *);
int (*launder_page) (struct page *);
+
+ /*
+ * release_mapping releases any private data on the mapping so that
+ * it may be reclaimed. Returns 1 on success or 0 on failure. Second
+ * parameter 'force' causes dirty data to be invalidated. (XXX: could
+ * have other flags like sync/async, etc).
+ */
+ int (*release)(struct address_space *, int);
+
+ /*
+ * sync writes back and waits for any private data on the mapping,
+ * as a data consistency operation.
+ */
+ int (*sync)(struct address_space *);
};

struct backing_dev_info;
@@ -497,6 +511,14 @@ struct block_device {
int mapping_tagged(struct address_space *mapping, int tag);

/*
+ * Does this mapping have anything on its private list?
+ */
+static inline int mapping_has_private(struct address_space *mapping)
+{
+ return !list_empty(&mapping->private_list);
+}
+
+/*
* Might pages of this file be mapped into userspace?
*/
static inline int mapping_mapped(struct address_space *mapping)
@@ -1503,6 +1525,13 @@ extern void bd_forget(struct inode *inod
extern void bdput(struct block_device *);
extern struct block_device *open_by_devnum(dev_t, unsigned);
extern const struct address_space_operations def_blk_aops;
+void invalidate_bdev(struct block_device *);
+int sync_blockdev(struct block_device *bdev);
+struct super_block *freeze_bdev(struct block_device *);
+void thaw_bdev(struct block_device *, struct super_block *);
+int fsync_bdev(struct block_device *);
+int fsync_super(struct super_block *);
+int fsync_no_super(struct block_device *);
#else
static inline void bd_forget(struct inode *inode) {}
#endif

2007-06-24 01:47:45

by Nick Piggin

[permalink] [raw]
Subject: [patch 3/3] minix: convert to fsblock


Convert minix from buffer head to fsblock.

---
fs/minix/bitmap.c | 148 +++++++++++++++++++++----------
fs/minix/file.c | 6 -
fs/minix/inode.c | 172 ++++++++++++++++++++++--------------
fs/minix/itree_common.c | 227 ++++++++++++++++++++++++++++++++----------------
fs/minix/itree_v1.c | 7 -
fs/minix/itree_v2.c | 7 -
fs/minix/minix.h | 17 ++-
7 files changed, 382 insertions(+), 202 deletions(-)

Index: linux-2.6/fs/minix/minix.h
===================================================================
--- linux-2.6.orig/fs/minix/minix.h
+++ linux-2.6/fs/minix/minix.h
@@ -1,4 +1,5 @@
#include <linux/fs.h>
+#include <linux/fsblock.h>
#include <linux/pagemap.h>
#include <linux/minix_fs.h>

@@ -37,16 +38,18 @@ struct minix_sb_info {
int s_dirsize;
int s_namelen;
int s_link_max;
- struct buffer_head ** s_imap;
- struct buffer_head ** s_zmap;
- struct buffer_head * s_sbh;
+ struct fsblock_meta ** s_imap;
+ struct fsblock_meta ** s_zmap;
+ struct fsblock_meta * s_smblock;
struct minix_super_block * s_ms;
unsigned short s_mount_state;
unsigned short s_version;
};

-extern struct minix_inode * minix_V1_raw_inode(struct super_block *, ino_t, struct buffer_head **);
-extern struct minix2_inode * minix_V2_raw_inode(struct super_block *, ino_t, struct buffer_head **);
+extern struct minix_inode * minix_V1_raw_inode(struct super_block *, ino_t, struct fsblock_meta **);
+extern void minix_put_raw_inode(struct super_block *sb, ino_t ino, struct fsblock_meta *mblock, struct minix_inode *p);
+extern struct minix2_inode * minix_V2_raw_inode(struct super_block *, ino_t, struct fsblock_meta **);
+extern void minix2_put_raw_inode(struct super_block *sb, ino_t ino, struct fsblock_meta *mblock, struct minix2_inode *p);
extern struct inode * minix_new_inode(const struct inode * dir, int * error);
extern void minix_free_inode(struct inode * inode);
extern unsigned long minix_count_free_inodes(struct minix_sb_info *sbi);
@@ -60,8 +63,8 @@ extern void V2_minix_truncate(struct ino
extern void minix_truncate(struct inode *);
extern int minix_sync_inode(struct inode *);
extern void minix_set_inode(struct inode *, dev_t);
-extern int V1_minix_get_block(struct inode *, long, struct buffer_head *, int);
-extern int V2_minix_get_block(struct inode *, long, struct buffer_head *, int);
+extern int V1_minix_insert_mapping(struct address_space *, loff_t, size_t, int);
+extern int V2_minix_insert_mapping(struct address_space *, loff_t, size_t, int);
extern unsigned V1_minix_blocks(loff_t, struct super_block *);
extern unsigned V2_minix_blocks(loff_t, struct super_block *);

Index: linux-2.6/fs/minix/itree_common.c
===================================================================
--- linux-2.6.orig/fs/minix/itree_common.c
+++ linux-2.6/fs/minix/itree_common.c
@@ -1,31 +1,29 @@
/* Generic part */

typedef struct {
- block_t *p;
+ block_t *mem;
+ int offset;
block_t key;
- struct buffer_head *bh;
+ struct fsblock_meta *mblock;
} Indirect;

static DEFINE_RWLOCK(pointers_lock);

-static inline void add_chain(Indirect *p, struct buffer_head *bh, block_t *v)
+static inline void add_chain(Indirect *p, struct fsblock_meta *mblock, block_t *mem, int offset)
{
- p->key = *(p->p = v);
- p->bh = bh;
+ p->mem = mem;
+ p->offset = offset;
+ p->key = mem[offset];
+ p->mblock = mblock;
}

static inline int verify_chain(Indirect *from, Indirect *to)
{
- while (from <= to && from->key == *from->p)
+ while (from <= to && from->key == from->mem[from->offset])
from++;
return (from > to);
}

-static inline block_t *block_end(struct buffer_head *bh)
-{
- return (block_t *)((char*)bh->b_data + bh->b_size);
-}
-
static inline Indirect *get_branch(struct inode *inode,
int depth,
int *offsets,
@@ -34,35 +32,43 @@ static inline Indirect *get_branch(struc
{
struct super_block *sb = inode->i_sb;
Indirect *p = chain;
- struct buffer_head *bh;
+ struct fsblock_meta *mblock;

*err = 0;
/* i_data is not going away, no lock needed */
- add_chain (chain, NULL, i_data(inode) + *offsets);
+ add_chain (chain, NULL, i_data(inode), *offsets);
if (!p->key)
- goto no_block;
+ goto out;
while (--depth) {
- bh = sb_bread(sb, block_to_cpu(p->key));
- if (!bh)
- goto failure;
+ void *data;
+
+ mblock = sb_mbread(sb, block_to_cpu(p->key));
+ if (!mblock) {
+ *err = -EIO;
+ goto out;
+ }
read_lock(&pointers_lock);
- if (!verify_chain(chain, p))
- goto changed;
- add_chain(++p, bh, (block_t *)bh->b_data + *++offsets);
+ if (!verify_chain(chain, p)) {
+ /* changed */
+ *err = -EAGAIN;
+ goto out_unlock;
+ }
+ data = vmap_block(mblock_block(mblock), 0, sb->s_blocksize);
+ if (!data) {
+ *err = -ENOMEM;
+ goto out_unlock;
+ }
+ add_chain(++p, mblock, (block_t *)data, *++offsets);
read_unlock(&pointers_lock);
if (!p->key)
- goto no_block;
+ goto out;
}
return NULL;

-changed:
+out_unlock:
read_unlock(&pointers_lock);
- brelse(bh);
- *err = -EAGAIN;
- goto no_block;
-failure:
- *err = -EIO;
-no_block:
+ mblock_put(mblock);
+out:
return p;
}

@@ -71,35 +77,54 @@ static int alloc_branch(struct inode *in
int *offsets,
Indirect *branch)
{
+ struct super_block *sb = inode->i_sb;
int n = 0;
int i;
int parent = minix_new_block(inode);
+ int ret = -ENOSPC;

branch[0].key = cpu_to_block(parent);
if (parent) for (n = 1; n < num; n++) {
- struct buffer_head *bh;
+ struct fsblock_meta *mblock;
+ void *data;
+
/* Allocate the next block */
int nr = minix_new_block(inode);
if (!nr)
break;
branch[n].key = cpu_to_block(nr);
- bh = sb_getblk(inode->i_sb, parent);
- lock_buffer(bh);
- memset(bh->b_data, 0, bh->b_size);
- branch[n].bh = bh;
- branch[n].p = (block_t*) bh->b_data + offsets[n];
- *branch[n].p = branch[n].key;
- set_buffer_uptodate(bh);
- unlock_buffer(bh);
- mark_buffer_dirty_inode(bh, inode);
+ mblock = sb_find_or_create_mblock(sb, parent);
+ if (IS_ERR(mblock)) {
+ ret = PTR_ERR(mblock);
+ break;
+ }
+
+ data = vmap_block(mblock_block(mblock), 0, sb->s_blocksize);
+ if (!data) {
+ ret = -ENOMEM;
+ break;
+ }
+
+ lock_block(mblock);
+ memset(data, 0, sb->s_blocksize); /* XXX: or mblock->size */
+
+ branch[n].mblock = mblock;
+ branch[n].mem = data;
+ branch[n].offset = offsets[n];
+ branch[n].mem[branch[n].offset] = branch[n].key;
+ mark_mblock_uptodate(mblock);
+ unlock_block(mblock);
+ mark_mblock_dirty_inode(mblock, inode);
parent = nr;
}
if (n == num)
return 0;

/* Allocation failed, free what we already allocated */
- for (i = 1; i < n; i++)
- bforget(branch[i].bh);
+ for (i = 1; i < n; i++) {
+ vunmap_block(branch[i].mblock, 0, sb->s_blocksize, branch[i].mem);
+ mblock_put(branch[i].mblock);
+ }
for (i = 0; i < n; i++)
minix_free_block(inode, block_to_cpu(branch[i].key));
return -ENOSPC;
@@ -110,15 +135,16 @@ static inline int splice_branch(struct i
Indirect *where,
int num)
{
+ struct super_block *sb = inode->i_sb;
int i;

write_lock(&pointers_lock);

/* Verify that place we are splicing to is still there and vacant */
- if (!verify_chain(chain, where-1) || *where->p)
+ if (!verify_chain(chain, where-1) || where->mem[where->offset])
goto changed;

- *where->p = where->key;
+ where->mem[where->offset] = where->key;

write_unlock(&pointers_lock);

@@ -127,31 +153,37 @@ static inline int splice_branch(struct i
inode->i_ctime = CURRENT_TIME_SEC;

/* had we spliced it onto indirect block? */
- if (where->bh)
- mark_buffer_dirty_inode(where->bh, inode);
+ if (where->mblock)
+ mark_mblock_dirty_inode(where->mblock, inode);

mark_inode_dirty(inode);
return 0;

changed:
write_unlock(&pointers_lock);
- for (i = 1; i < num; i++)
- bforget(where[i].bh);
+ for (i = 1; i < num; i++) {
+ vunmap_block(where[i].mblock, 0, sb->s_blocksize, where[i].mem);
+ mblock_put(where[i].mblock);
+ }
for (i = 0; i < num; i++)
minix_free_block(inode, block_to_cpu(where[i].key));
return -EAGAIN;
}

-static inline int get_block(struct inode * inode, sector_t block,
- struct buffer_head *bh, int create)
+static inline int insert_block(struct inode *inode, struct fsblock *block, sector_t blocknr, int create)
{
+ struct super_block *sb = inode->i_sb;
int err = -EIO;
int offsets[DEPTH];
Indirect chain[DEPTH];
Indirect *partial;
int left;
- int depth = block_to_path(inode, block, offsets);
+ int depth;

+ if (test_bit(BL_mapped, &block->flags))
+ return 0;
+
+ depth = block_to_path(inode, blocknr, offsets);
if (depth == 0)
goto out;

@@ -161,7 +193,7 @@ reread:
/* Simplest case - block found, no allocation needed */
if (!partial) {
got_it:
- map_bh(bh, inode->i_sb, block_to_cpu(chain[depth-1].key));
+ map_fsblock(block, block_to_cpu(chain[depth-1].key));
/* Clean up and exit */
partial = chain+depth-1; /* the whole chain */
goto cleanup;
@@ -171,7 +203,9 @@ got_it:
if (!create || err == -EIO) {
cleanup:
while (partial > chain) {
- brelse(partial->bh);
+ vunmap_block(partial->mblock, 0, sb->s_blocksize, partial->mem);
+ mblock_put(partial->mblock);
+ /* XXX: balance puts and unmaps etc etc */
partial--;
}
out:
@@ -194,17 +228,56 @@ out:
if (splice_branch(inode, chain, partial, left) < 0)
goto changed;

- set_buffer_new(bh);
+ set_bit(BL_new, &block->flags);
goto got_it;

changed:
while (partial > chain) {
- brelse(partial->bh);
+ vunmap_block(partial->mblock, 0, sb->s_blocksize, partial->mem);
+ mblock_put(partial->mblock);
partial--;
}
goto reread;
}

+static inline int insert_mapping(struct address_space *mapping, loff_t pos,
+ size_t len, int create)
+{
+ struct inode *inode = mapping->host;
+ struct page *page;
+ sector_t blocknr;
+ pgoff_t pgoff, end;
+ struct fsblock *block;
+ int ret;
+
+ BUG_ON(len != PAGE_CACHE_SIZE); /* XXX can't do this yet... */
+
+ pgoff = pos >> PAGE_CACHE_SHIFT;
+ end = (pos + len) >> PAGE_CACHE_SHIFT;
+ blocknr = pos >> inode->i_blkbits;
+
+ page = find_page(mapping, pgoff);
+ BUG_ON(!PageLocked(page));
+
+ /* XXX: sort out brelse & bforget vs block_put */
+
+ block = page_blocks(page);
+ if (fsblock_subpage(block)) {
+ struct fsblock *b;
+ ret = 0;
+ for_each_block(block, b) {
+ ret = insert_block(inode, b, blocknr, create);
+ if (ret)
+ break;
+ blocknr++;
+ }
+ } else {
+ ret = insert_block(inode, block, blocknr, create);
+ }
+
+ return ret;
+}
+
static inline int all_zeroes(block_t *p, block_t *q)
{
while (p < q)
@@ -219,6 +292,7 @@ static Indirect *find_shared(struct inod
Indirect chain[DEPTH],
block_t *top)
{
+ struct super_block *sb = inode->i_sb;
Indirect *partial, *p;
int k, err;

@@ -230,23 +304,25 @@ static Indirect *find_shared(struct inod
write_lock(&pointers_lock);
if (!partial)
partial = chain + k-1;
- if (!partial->key && *partial->p) {
+ if (!partial->key && partial->mem[partial->offset]) {
write_unlock(&pointers_lock);
goto no_top;
}
- for (p=partial;p>chain && all_zeroes((block_t*)p->bh->b_data,p->p);p--)
- ;
+ p = partial;
+ while (p > chain && all_zeroes(p->mem, &p->mem[p->offset]))
+ p--;
if (p == chain + k - 1 && p > chain) {
- p->p--;
+ p->offset--;
} else {
- *top = *p->p;
- *p->p = 0;
+ *top = p->mem[p->offset];
+ p->mem[p->offset] = 0;
}
write_unlock(&pointers_lock);

while(partial > p)
{
- brelse(partial->bh);
+ vunmap_block(partial->mblock, 0, sb->s_blocksize, partial->mem);
+ mblock_put(partial->mblock);
partial--;
}
no_top:
@@ -268,21 +344,25 @@ static inline void free_data(struct inod

static void free_branches(struct inode *inode, block_t *p, block_t *q, int depth)
{
- struct buffer_head * bh;
+ struct super_block *sb = inode->i_sb;
+ struct fsblock_meta *mblock;
unsigned long nr;

if (depth--) {
for ( ; p < q ; p++) {
+ block_t *start, *end;
nr = block_to_cpu(*p);
if (!nr)
continue;
*p = 0;
- bh = sb_bread(inode->i_sb, nr);
- if (!bh)
+ mblock = sb_mbread(sb, nr);
+ if (!mblock)
continue;
- free_branches(inode, (block_t*)bh->b_data,
- block_end(bh), depth);
- bforget(bh);
+ start = vmap_block(mblock, 0, sb->s_blocksize);
+ end = (block_t *)((unsigned long)start + sb->s_blocksize);
+ free_branches(inode, start, end, depth);
+ vunmap_block(mblock, 0, sb->s_blocksize, start);
+ mblock_put(mblock);
minix_free_block(inode, nr);
mark_inode_dirty(inode);
}
@@ -303,7 +383,7 @@ static inline void truncate (struct inod
long iblock;

iblock = (inode->i_size + sb->s_blocksize -1) >> sb->s_blocksize_bits;
- block_truncate_page(inode->i_mapping, inode->i_size, get_block);
+ fsblock_truncate_page(inode->i_mapping, inode->i_size);

n = block_to_path(inode, iblock, offsets);
if (!n)
@@ -321,15 +401,18 @@ static inline void truncate (struct inod
if (partial == chain)
mark_inode_dirty(inode);
else
- mark_buffer_dirty_inode(partial->bh, inode);
+ mark_mblock_dirty_inode(partial->mblock, inode);
free_branches(inode, &nr, &nr+1, (chain+n-1) - partial);
}
/* Clear the ends of indirect blocks on the shared branch */
while (partial > chain) {
- free_branches(inode, partial->p + 1, block_end(partial->bh),
- (chain+n-1) - partial);
- mark_buffer_dirty_inode(partial->bh, inode);
- brelse (partial->bh);
+ block_t *start, *end;
+ start = &partial->mem[partial->offset + 1];
+ end = (block_t *)((unsigned long)partial->mem + sb->s_blocksize);
+ free_branches(inode, start, end, (chain+n-1) - partial);
+ mark_mblock_dirty_inode(partial->mblock, inode);
+ vunmap_block(partial->mblock, 0, sb->s_blocksize, partial->mem);
+ mblock_put(partial->mblock);
partial--;
}
do_indirects:
Index: linux-2.6/fs/minix/itree_v1.c
===================================================================
--- linux-2.6.orig/fs/minix/itree_v1.c
+++ linux-2.6/fs/minix/itree_v1.c
@@ -1,4 +1,4 @@
-#include <linux/buffer_head.h>
+#include <linux/fsblock.h>
#include "minix.h"

enum {DEPTH = 3, DIRECT = 7}; /* Only double indirect */
@@ -44,10 +44,9 @@ static int block_to_path(struct inode *

#include "itree_common.c"

-int V1_minix_get_block(struct inode * inode, long block,
- struct buffer_head *bh_result, int create)
+int V1_minix_insert_mapping(struct address_space *mapping, loff_t off, size_t len, int create)
{
- return get_block(inode, block, bh_result, create);
+ return insert_mapping(mapping, off, len, create);
}

void V1_minix_truncate(struct inode * inode)
Index: linux-2.6/fs/minix/itree_v2.c
===================================================================
--- linux-2.6.orig/fs/minix/itree_v2.c
+++ linux-2.6/fs/minix/itree_v2.c
@@ -1,4 +1,4 @@
-#include <linux/buffer_head.h>
+#include <linux/fsblock.h>
#include "minix.h"

enum {DIRECT = 7, DEPTH = 4}; /* Have triple indirect */
@@ -50,10 +50,9 @@ static int block_to_path(struct inode *

#include "itree_common.c"

-int V2_minix_get_block(struct inode * inode, long block,
- struct buffer_head *bh_result, int create)
+int V2_minix_insert_mapping(struct address_space *mapping, loff_t off, size_t len, int create)
{
- return get_block(inode, block, bh_result, create);
+ return insert_mapping(mapping, off, len, create);
}

void V2_minix_truncate(struct inode * inode)
Index: linux-2.6/fs/minix/bitmap.c
===================================================================
--- linux-2.6.orig/fs/minix/bitmap.c
+++ linux-2.6/fs/minix/bitmap.c
@@ -13,39 +13,48 @@

#include "minix.h"
#include <linux/smp_lock.h>
-#include <linux/buffer_head.h>
+#include <linux/fsblock.h>
#include <linux/bitops.h>
#include <linux/sched.h>

static int nibblemap[] = { 4,3,3,2,3,2,2,1,3,2,2,1,2,1,1,0 };

-static unsigned long count_free(struct buffer_head *map[], unsigned numblocks, __u32 numbits)
+static unsigned long count_free(struct fsblock_meta *map[], unsigned numblocks, __u32 numbits)
{
unsigned i, j, sum = 0;
- struct buffer_head *bh;
+ struct fsblock_meta *mblock;
+ unsigned int size;
+ char *data;

- for (i=0; i<numblocks-1; i++) {
- if (!(bh=map[i]))
+ for (i = 0; i < numblocks - 1; i++) {
+ if (!(mblock = map[i]))
return(0);
- for (j=0; j<bh->b_size; j++)
- sum += nibblemap[bh->b_data[j] & 0xf]
- + nibblemap[(bh->b_data[j]>>4) & 0xf];
+ size = fsblock_size(mblock_block(mblock));
+ data = vmap_block(mblock_block(mblock), 0, size);
+ for (j = 0; j < size; j++)
+ sum += nibblemap[data[j] & 0xf]
+ + nibblemap[(data[j]>>4) & 0xf];
+ vunmap_block(mblock_block(mblock), 0, size, data);
}

- if (numblocks==0 || !(bh=map[numblocks-1]))
+ if (numblocks == 0 || !(mblock = map[numblocks-1]))
return(0);
- i = ((numbits - (numblocks-1) * bh->b_size * 8) / 16) * 2;
+ size = fsblock_size(mblock_block(mblock));
+ i = ((numbits - (numblocks-1) * size * 8) / 16) * 2;
+ data = vmap_block(mblock, 0, size);
for (j=0; j<i; j++) {
- sum += nibblemap[bh->b_data[j] & 0xf]
- + nibblemap[(bh->b_data[j]>>4) & 0xf];
+ sum += nibblemap[data[j] & 0xf]
+ + nibblemap[(data[j]>>4) & 0xf];
}

i = numbits%16;
if (i!=0) {
- i = *(__u16 *)(&bh->b_data[j]) | ~((1<<i) - 1);
+ i = *(__u16 *)(&data[j]) | ~((1<<i) - 1);
sum += nibblemap[i & 0xf] + nibblemap[(i>>4) & 0xf];
sum += nibblemap[(i>>8) & 0xf] + nibblemap[(i>>12) & 0xf];
}
+ vunmap_block(mblock, 0, size, data);
+
return(sum);
}

@@ -53,7 +62,9 @@ void minix_free_block(struct inode *inod
{
struct super_block *sb = inode->i_sb;
struct minix_sb_info *sbi = minix_sb(sb);
- struct buffer_head *bh;
+ struct fsblock_meta *mblock;
+ char *data;
+ unsigned int size;
int k = sb->s_blocksize_bits + 3;
unsigned long bit, zone;

@@ -68,13 +79,16 @@ void minix_free_block(struct inode *inod
printk("minix_free_block: nonexistent bitmap buffer\n");
return;
}
- bh = sbi->s_zmap[zone];
+ mblock = sbi->s_zmap[zone];
+ size = fsblock_size(mblock_block(mblock));
+ data = vmap_block(mblock, 0, size);
lock_kernel();
- if (!minix_test_and_clear_bit(bit, bh->b_data))
+ if (!minix_test_and_clear_bit(bit, data))
printk("minix_free_block (%s:%lu): bit already cleared\n",
sb->s_id, block);
unlock_kernel();
- mark_buffer_dirty(bh);
+ vunmap_block(mblock, 0, size, data);
+ mark_mblock_dirty_inode(mblock, inode);
return;
}

@@ -85,21 +99,26 @@ int minix_new_block(struct inode * inode
int i;

for (i = 0; i < sbi->s_zmap_blocks; i++) {
- struct buffer_head *bh = sbi->s_zmap[i];
+ struct fsblock_meta *mblock = sbi->s_zmap[i];
+ unsigned int size = fsblock_size(mblock_block(mblock));
+ char *data;
int j;

+ data = vmap_block(mblock, 0, size);
lock_kernel();
- j = minix_find_first_zero_bit(bh->b_data, bits_per_zone);
+ j = minix_find_first_zero_bit(data, bits_per_zone);
if (j < bits_per_zone) {
- minix_set_bit(j, bh->b_data);
+ minix_set_bit(j, data);
unlock_kernel();
- mark_buffer_dirty(bh);
+ vunmap_block(mblock, 0, size, data);
+ mark_mblock_dirty_inode(mblock, inode);
j += i * bits_per_zone + sbi->s_firstdatazone-1;
if (j < sbi->s_firstdatazone || j >= sbi->s_nzones)
break;
return j;
}
unlock_kernel();
+ vunmap_block(mblock, 0, size, data);
}
return 0;
}
@@ -112,11 +131,12 @@ unsigned long minix_count_free_blocks(st
}

struct minix_inode *
-minix_V1_raw_inode(struct super_block *sb, ino_t ino, struct buffer_head **bh)
+minix_V1_raw_inode(struct super_block *sb, ino_t ino, struct fsblock_meta **mblock)
{
int block;
struct minix_sb_info *sbi = minix_sb(sb);
struct minix_inode *p;
+ unsigned int size;

if (!ino || ino > sbi->s_ninodes) {
printk("Bad inode number on dev %s: %ld is out of range\n",
@@ -126,24 +146,32 @@ minix_V1_raw_inode(struct super_block *s
ino--;
block = 2 + sbi->s_imap_blocks + sbi->s_zmap_blocks +
ino / MINIX_INODES_PER_BLOCK;
- *bh = sb_bread(sb, block);
- if (!*bh) {
+ *mblock = sb_mbread(sb, block);
+ if (!*mblock) {
printk("Unable to read inode block\n");
return NULL;
}
- p = (void *)(*bh)->b_data;
+ size = fsblock_size(mblock_block(*mblock));
+ p = vmap_block(*mblock, 0, size);
return p + ino % MINIX_INODES_PER_BLOCK;
}

+void minix_put_raw_inode(struct super_block *sb, ino_t ino, struct fsblock_meta *mblock, struct minix_inode *p)
+{
+ unsigned int size = fsblock_size(mblock_block(mblock));
+ vunmap_block(mblock, 0, size, p - ino%MINIX_INODES_PER_BLOCK);
+ mblock_put(mblock);
+}
+
struct minix2_inode *
-minix_V2_raw_inode(struct super_block *sb, ino_t ino, struct buffer_head **bh)
+minix_V2_raw_inode(struct super_block *sb, ino_t ino, struct fsblock_meta **mblock)
{
int block;
struct minix_sb_info *sbi = minix_sb(sb);
struct minix2_inode *p;
int minix2_inodes_per_block = sb->s_blocksize / sizeof(struct minix2_inode);
+ unsigned int size;

- *bh = NULL;
if (!ino || ino > sbi->s_ninodes) {
printk("Bad inode number on dev %s: %ld is out of range\n",
sb->s_id, (long)ino);
@@ -152,49 +180,64 @@ minix_V2_raw_inode(struct super_block *s
ino--;
block = 2 + sbi->s_imap_blocks + sbi->s_zmap_blocks +
ino / minix2_inodes_per_block;
- *bh = sb_bread(sb, block);
- if (!*bh) {
+ *mblock = sb_mbread(sb, block);
+ if (!*mblock) {
printk("Unable to read inode block\n");
return NULL;
}
- p = (void *)(*bh)->b_data;
+ size = fsblock_size(mblock_block(*mblock));
+ p = vmap_block(*mblock, 0, size);
return p + ino % minix2_inodes_per_block;
}

+void minix2_put_raw_inode(struct super_block *sb, ino_t ino, struct fsblock_meta *mblock, struct minix2_inode *p)
+{
+ int minix2_inodes_per_block = sb->s_blocksize / sizeof(struct minix2_inode);
+ unsigned int size = fsblock_size(mblock_block(mblock));
+
+ ino--;
+ vunmap_block(mblock, 0, size, p - ino%minix2_inodes_per_block);
+ mblock_put(mblock);
+}
+
/* Clear the link count and mode of a deleted inode on disk. */

static void minix_clear_inode(struct inode *inode)
{
- struct buffer_head *bh = NULL;
+ struct super_block *sb = inode->i_sb;
+ ino_t ino = inode->i_ino;
+ struct fsblock_meta *mblock;

if (INODE_VERSION(inode) == MINIX_V1) {
struct minix_inode *raw_inode;
- raw_inode = minix_V1_raw_inode(inode->i_sb, inode->i_ino, &bh);
+ raw_inode = minix_V1_raw_inode(sb, ino, &mblock);
if (raw_inode) {
raw_inode->i_nlinks = 0;
raw_inode->i_mode = 0;
+ mark_mblock_dirty(mblock);
+ minix_put_raw_inode(sb, ino, mblock, raw_inode);
}
} else {
struct minix2_inode *raw_inode;
- raw_inode = minix_V2_raw_inode(inode->i_sb, inode->i_ino, &bh);
+ raw_inode = minix_V2_raw_inode(sb, ino, &mblock);
if (raw_inode) {
raw_inode->i_nlinks = 0;
raw_inode->i_mode = 0;
+ mark_mblock_dirty(mblock);
+ minix2_put_raw_inode(sb, ino, mblock, raw_inode);
}
}
- if (bh) {
- mark_buffer_dirty(bh);
- brelse (bh);
- }
}

void minix_free_inode(struct inode * inode)
{
struct super_block *sb = inode->i_sb;
struct minix_sb_info *sbi = minix_sb(inode->i_sb);
- struct buffer_head *bh;
+ struct fsblock_meta *mblock;
int k = sb->s_blocksize_bits + 3;
unsigned long ino, bit;
+ unsigned int size;
+ char *data;

ino = inode->i_ino;
if (ino < 1 || ino > sbi->s_ninodes) {
@@ -210,12 +253,15 @@ void minix_free_inode(struct inode * ino

minix_clear_inode(inode); /* clear on-disk copy */

- bh = sbi->s_imap[ino];
+ mblock = sbi->s_imap[ino];
+ size = fsblock_size(mblock_block(mblock));
+ data = vmap_block(mblock, 0, size);
lock_kernel();
- if (!minix_test_and_clear_bit(bit, bh->b_data))
+ if (!minix_test_and_clear_bit(bit, data))
printk("minix_free_inode: bit %lu already cleared\n", bit);
unlock_kernel();
- mark_buffer_dirty(bh);
+ vunmap_block(mblock, 0, size, data);
+ mark_mblock_dirty(mblock);
out:
clear_inode(inode); /* clear in-memory copy */
}
@@ -225,7 +271,9 @@ struct inode * minix_new_inode(const str
struct super_block *sb = dir->i_sb;
struct minix_sb_info *sbi = minix_sb(sb);
struct inode *inode = new_inode(sb);
- struct buffer_head * bh;
+ struct fsblock_meta * mblock;
+ unsigned int size;
+ char * data;
int bits_per_zone = 8 * sb->s_blocksize;
unsigned long j;
int i;
@@ -235,28 +283,32 @@ struct inode * minix_new_inode(const str
return NULL;
}
j = bits_per_zone;
- bh = NULL;
+ mblock = NULL;
*error = -ENOSPC;
lock_kernel();
for (i = 0; i < sbi->s_imap_blocks; i++) {
- bh = sbi->s_imap[i];
- j = minix_find_first_zero_bit(bh->b_data, bits_per_zone);
+ mblock = sbi->s_imap[i];
+ size = fsblock_size(mblock_block(mblock));
+ data = vmap_block(mblock, 0, size);
+ j = minix_find_first_zero_bit(data, bits_per_zone);
if (j < bits_per_zone)
break;
+ vunmap_block(mblock, 0, size, data);
}
- if (!bh || j >= bits_per_zone) {
+ if (!mblock || j >= bits_per_zone) {
unlock_kernel();
iput(inode);
return NULL;
}
- if (minix_test_and_set_bit(j, bh->b_data)) { /* shouldn't happen */
+ if (minix_test_and_set_bit(j, data)) { /* shouldn't happen */
unlock_kernel();
printk("minix_new_inode: bit already set\n");
iput(inode);
return NULL;
}
unlock_kernel();
- mark_buffer_dirty(bh);
+ vunmap_block(mblock, 0, size, data);
+ mark_mblock_dirty(mblock);
j += i * bits_per_zone;
if (!j || j > sbi->s_ninodes) {
iput(inode);
Index: linux-2.6/fs/minix/inode.c
===================================================================
--- linux-2.6.orig/fs/minix/inode.c
+++ linux-2.6/fs/minix/inode.c
@@ -12,7 +12,7 @@

#include <linux/module.h>
#include "minix.h"
-#include <linux/buffer_head.h>
+#include <linux/fsblock.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/highuid.h>
@@ -25,27 +25,34 @@ static int minix_remount (struct super_b

static void minix_delete_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ struct address_space *mapping = &inode->i_data;
+
+ truncate_inode_pages(mapping, 0);
inode->i_size = 0;
minix_truncate(inode);
+ fsblock_release(mapping, 1);
minix_free_inode(inode);
}

static void minix_put_super(struct super_block *sb)
{
int i;
+ unsigned int offset;
struct minix_sb_info *sbi = minix_sb(sb);

if (!(sb->s_flags & MS_RDONLY)) {
if (sbi->s_version != MINIX_V3) /* s_state is now out from V3 sb */
sbi->s_ms->s_state = sbi->s_mount_state;
- mark_buffer_dirty(sbi->s_sbh);
+ mark_mblock_dirty(sbi->s_smblock);
}
for (i = 0; i < sbi->s_imap_blocks; i++)
- brelse(sbi->s_imap[i]);
+ mblock_put(sbi->s_imap[i]);
for (i = 0; i < sbi->s_zmap_blocks; i++)
- brelse(sbi->s_zmap[i]);
- brelse (sbi->s_sbh);
+ mblock_put(sbi->s_zmap[i]);
+
+ offset = BLOCK_SIZE - mblock_block(sbi->s_smblock)->block_nr * sb->s_blocksize;
+ vunmap_block(sbi->s_smblock, offset, BLOCK_SIZE, sbi->s_ms);
+ mblock_put(sbi->s_smblock);
kfree(sbi->s_imap);
sb->s_fs_info = NULL;
kfree(sbi);
@@ -119,7 +126,7 @@ static int minix_remount (struct super_b
/* Mounting a rw partition read-only. */
if (sbi->s_version != MINIX_V3)
ms->s_state = sbi->s_mount_state;
- mark_buffer_dirty(sbi->s_sbh);
+ mark_mblock_dirty(sbi->s_smblock);
} else {
/* Mount a partition which is read-only, read-write. */
if (sbi->s_version != MINIX_V3) {
@@ -128,7 +135,7 @@ static int minix_remount (struct super_b
} else {
sbi->s_mount_state = MINIX_VALID_FS;
}
- mark_buffer_dirty(sbi->s_sbh);
+ mark_mblock_dirty(sbi->s_smblock);

if (!(sbi->s_mount_state & MINIX_VALID_FS))
printk("MINIX-fs warning: remounting unchecked fs, "
@@ -142,13 +149,17 @@ static int minix_remount (struct super_b

static int minix_fill_super(struct super_block *s, void *data, int silent)
{
- struct buffer_head *bh;
- struct buffer_head **map;
+ struct fsblock_meta *mblock;
+ struct fsblock_meta **map;
struct minix_super_block *ms;
struct minix3_super_block *m3s = NULL;
unsigned long i, block;
struct inode *root_inode;
struct minix_sb_info *sbi;
+ char *d;
+ unsigned int size = BLOCK_SIZE;
+ sector_t blocknr = BLOCK_SIZE / size;
+ unsigned int offset = BLOCK_SIZE - blocknr * size;

sbi = kzalloc(sizeof(struct minix_sb_info), GFP_KERNEL);
if (!sbi)
@@ -158,15 +169,15 @@ static int minix_fill_super(struct super
BUILD_BUG_ON(32 != sizeof (struct minix_inode));
BUILD_BUG_ON(64 != sizeof(struct minix2_inode));

- if (!sb_set_blocksize(s, BLOCK_SIZE))
+ if (!sb_set_blocksize(s, size))
goto out_bad_hblock;

- if (!(bh = sb_bread(s, 1)))
+ if (!(mblock = sb_mbread(s, blocknr)))
goto out_bad_sb;

- ms = (struct minix_super_block *) bh->b_data;
+ ms = vmap_block(mblock, offset, BLOCK_SIZE); /* XXX: unmap where? */
sbi->s_ms = ms;
- sbi->s_sbh = bh;
+ sbi->s_smblock = mblock;
sbi->s_mount_state = ms->s_state;
sbi->s_ninodes = ms->s_ninodes;
sbi->s_nzones = ms->s_nzones;
@@ -198,8 +209,8 @@ static int minix_fill_super(struct super
sbi->s_dirsize = 32;
sbi->s_namelen = 30;
sbi->s_link_max = MINIX2_LINK_MAX;
- } else if ( *(__u16 *)(bh->b_data + 24) == MINIX3_SUPER_MAGIC) {
- m3s = (struct minix3_super_block *) bh->b_data;
+ } else if ( *((__u16 *)ms + 12) == MINIX3_SUPER_MAGIC) {
+ m3s = (struct minix3_super_block *)ms;
s->s_magic = m3s->s_magic;
sbi->s_imap_blocks = m3s->s_imap_blocks;
sbi->s_zmap_blocks = m3s->s_zmap_blocks;
@@ -213,7 +224,22 @@ static int minix_fill_super(struct super
sbi->s_version = MINIX_V3;
sbi->s_link_max = MINIX2_LINK_MAX;
sbi->s_mount_state = MINIX_VALID_FS;
- sb_set_blocksize(s, m3s->s_blocksize);
+ size = m3s->s_blocksize;
+ if (size != BLOCK_SIZE) {
+ blocknr = BLOCK_SIZE / size;
+ offset = BLOCK_SIZE - blocknr * size;
+
+ vunmap_block(mblock, offset, BLOCK_SIZE, ms);
+ mblock_put(mblock);
+ if (!sb_set_blocksize(s, size))
+ goto out_bad_hblock;
+ if (!(mblock = sb_mbread(s, blocknr)))
+ goto out_bad_sb;
+ ms = vmap_block(mblock, offset, BLOCK_SIZE);
+ m3s = (struct minix3_super_block *)ms;
+ sbi->s_ms = ms;
+ sbi->s_smblock = mblock;
+ }
} else
goto out_no_fs;

@@ -222,7 +248,7 @@ static int minix_fill_super(struct super
*/
if (sbi->s_imap_blocks == 0 || sbi->s_zmap_blocks == 0)
goto out_illegal_sb;
- i = (sbi->s_imap_blocks + sbi->s_zmap_blocks) * sizeof(bh);
+ i = (sbi->s_imap_blocks + sbi->s_zmap_blocks) * sizeof(mblock);
map = kzalloc(i, GFP_KERNEL);
if (!map)
goto out_no_map;
@@ -231,18 +257,23 @@ static int minix_fill_super(struct super

block=2;
for (i=0 ; i < sbi->s_imap_blocks ; i++) {
- if (!(sbi->s_imap[i]=sb_bread(s, block)))
+ if (!(sbi->s_imap[i] = sb_mbread(s, block)))
goto out_no_bitmap;
block++;
}
for (i=0 ; i < sbi->s_zmap_blocks ; i++) {
- if (!(sbi->s_zmap[i]=sb_bread(s, block)))
+ if (!(sbi->s_zmap[i] = sb_mbread(s, block)))
goto out_no_bitmap;
block++;
}

- minix_set_bit(0,sbi->s_imap[0]->b_data);
- minix_set_bit(0,sbi->s_zmap[0]->b_data);
+ d = vmap_block(sbi->s_imap[0], 0, size);
+ minix_set_bit(0, d);
+ vunmap_block(sbi->s_imap[0], 0, size, d);
+
+ d = vmap_block(sbi->s_zmap[0], 0, size);
+ minix_set_bit(0, d);
+ vunmap_block(sbi->s_zmap[0], 0, size, d);

/* set up enough so that it can read an inode */
s->s_op = &minix_sops;
@@ -260,8 +291,9 @@ static int minix_fill_super(struct super
if (!(s->s_flags & MS_RDONLY)) {
if (sbi->s_version != MINIX_V3) /* s_state is now out from V3 sb */
ms->s_state &= ~MINIX_VALID_FS;
- mark_buffer_dirty(bh);
+ mark_mblock_dirty(mblock);
}
+
if (!(sbi->s_mount_state & MINIX_VALID_FS))
printk("MINIX-fs: mounting unchecked file system, "
"running fsck is recommended\n");
@@ -283,9 +315,9 @@ out_no_bitmap:
printk("MINIX-fs: bad superblock or unable to read bitmaps\n");
out_freemap:
for (i = 0; i < sbi->s_imap_blocks; i++)
- brelse(sbi->s_imap[i]);
+ mblock_put(sbi->s_imap[i]);
for (i = 0; i < sbi->s_zmap_blocks; i++)
- brelse(sbi->s_zmap[i]);
+ mblock_put(sbi->s_zmap[i]);
kfree(sbi->s_imap);
goto out_release;

@@ -304,7 +336,8 @@ out_no_fs:
printk("VFS: Can't find a Minix filesystem V1 | V2 | V3 "
"on device %s.\n", s->s_id);
out_release:
- brelse(bh);
+ vunmap_block(mblock, offset, BLOCK_SIZE, ms);
+ mblock_put(mblock);
goto out;

out_bad_hblock:
@@ -333,38 +366,45 @@ static int minix_statfs(struct dentry *d
return 0;
}

-static int minix_get_block(struct inode *inode, sector_t block,
- struct buffer_head *bh_result, int create)
+static int minix_insert_mapping(struct address_space *mapping, loff_t off, size_t len, int create)
{
- if (INODE_VERSION(inode) == MINIX_V1)
- return V1_minix_get_block(inode, block, bh_result, create);
+ if (INODE_VERSION(mapping->host) == MINIX_V1)
+ return V1_minix_insert_mapping(mapping, off, len, create);
else
- return V2_minix_get_block(inode, block, bh_result, create);
+ return V2_minix_insert_mapping(mapping, off, len, create);
}

static int minix_writepage(struct page *page, struct writeback_control *wbc)
{
- return block_write_full_page(page, minix_get_block, wbc);
+ return fsblock_write_page(page, minix_insert_mapping, wbc);
}
+
static int minix_readpage(struct file *file, struct page *page)
{
- return block_read_full_page(page,minix_get_block);
+ return fsblock_read_page(page, minix_insert_mapping);
}
+
static int minix_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
{
- return block_prepare_write(page,from,to,minix_get_block);
+ return fsblock_prepare_write(page, from, to, minix_insert_mapping);
}
+
static sector_t minix_bmap(struct address_space *mapping, sector_t block)
{
- return generic_block_bmap(mapping,block,minix_get_block);
+ return fsblock_bmap(mapping, block, minix_insert_mapping);
}
+
static const struct address_space_operations minix_aops = {
.readpage = minix_readpage,
.writepage = minix_writepage,
- .sync_page = block_sync_page,
+// .sync_page = block_sync_page,
.prepare_write = minix_prepare_write,
- .commit_write = generic_commit_write,
- .bmap = minix_bmap
+ .commit_write = fsblock_commit_write,
+ .bmap = minix_bmap,
+ .set_page_dirty = fsblock_set_page_dirty,
+ .invalidatepage = fsblock_invalidate_page,
+ .release = fsblock_release,
+ .sync = fsblock_sync,
};

static const struct inode_operations minix_symlink_inode_operations = {
@@ -396,12 +436,12 @@ void minix_set_inode(struct inode *inode
*/
static void V1_minix_read_inode(struct inode * inode)
{
- struct buffer_head * bh;
+ struct fsblock_meta *mblock;
struct minix_inode * raw_inode;
struct minix_inode_info *minix_inode = minix_i(inode);
int i;

- raw_inode = minix_V1_raw_inode(inode->i_sb, inode->i_ino, &bh);
+ raw_inode = minix_V1_raw_inode(inode->i_sb, inode->i_ino, &mblock);
if (!raw_inode) {
make_bad_inode(inode);
return;
@@ -419,7 +459,7 @@ static void V1_minix_read_inode(struct i
for (i = 0; i < 9; i++)
minix_inode->u.i1_data[i] = raw_inode->i_zone[i];
minix_set_inode(inode, old_decode_dev(raw_inode->i_zone[0]));
- brelse(bh);
+ minix_put_raw_inode(inode->i_sb, inode->i_ino, mblock, raw_inode);
}

/*
@@ -427,12 +467,13 @@ static void V1_minix_read_inode(struct i
*/
static void V2_minix_read_inode(struct inode * inode)
{
- struct buffer_head * bh;
+ struct fsblock_meta *mblock;
struct minix2_inode * raw_inode;
struct minix_inode_info *minix_inode = minix_i(inode);
int i;
+ ino_t ino = inode->i_ino;

- raw_inode = minix_V2_raw_inode(inode->i_sb, inode->i_ino, &bh);
+ raw_inode = minix_V2_raw_inode(inode->i_sb, ino, &mblock);
if (!raw_inode) {
make_bad_inode(inode);
return;
@@ -452,7 +493,7 @@ static void V2_minix_read_inode(struct i
for (i = 0; i < 10; i++)
minix_inode->u.i2_data[i] = raw_inode->i_zone[i];
minix_set_inode(inode, old_decode_dev(raw_inode->i_zone[0]));
- brelse(bh);
+ minix2_put_raw_inode(inode->i_sb, ino, mblock, raw_inode);
}

/*
@@ -469,14 +510,14 @@ static void minix_read_inode(struct inod
/*
* The minix V1 function to synchronize an inode.
*/
-static struct buffer_head * V1_minix_update_inode(struct inode * inode)
+static struct fsblock_meta * V1_minix_update_inode(struct inode * inode)
{
- struct buffer_head * bh;
+ struct fsblock_meta * mblock;
struct minix_inode * raw_inode;
struct minix_inode_info *minix_inode = minix_i(inode);
int i;

- raw_inode = minix_V1_raw_inode(inode->i_sb, inode->i_ino, &bh);
+ raw_inode = minix_V1_raw_inode(inode->i_sb, inode->i_ino, &mblock);
if (!raw_inode)
return NULL;
raw_inode->i_mode = inode->i_mode;
@@ -489,21 +530,23 @@ static struct buffer_head * V1_minix_upd
raw_inode->i_zone[0] = old_encode_dev(inode->i_rdev);
else for (i = 0; i < 9; i++)
raw_inode->i_zone[i] = minix_inode->u.i1_data[i];
- mark_buffer_dirty(bh);
- return bh;
+ mblock_get(mblock);
+ mark_mblock_dirty_inode(mblock, inode);
+ minix_put_raw_inode(inode->i_sb, inode->i_ino, mblock, raw_inode);
+ return mblock;
}

/*
* The minix V2 function to synchronize an inode.
*/
-static struct buffer_head * V2_minix_update_inode(struct inode * inode)
+static struct fsblock_meta * V2_minix_update_inode(struct inode * inode)
{
- struct buffer_head * bh;
+ struct fsblock_meta * mblock;
struct minix2_inode * raw_inode;
struct minix_inode_info *minix_inode = minix_i(inode);
int i;

- raw_inode = minix_V2_raw_inode(inode->i_sb, inode->i_ino, &bh);
+ raw_inode = minix_V2_raw_inode(inode->i_sb, inode->i_ino, &mblock);
if (!raw_inode)
return NULL;
raw_inode->i_mode = inode->i_mode;
@@ -518,11 +561,13 @@ static struct buffer_head * V2_minix_upd
raw_inode->i_zone[0] = old_encode_dev(inode->i_rdev);
else for (i = 0; i < 10; i++)
raw_inode->i_zone[i] = minix_inode->u.i2_data[i];
- mark_buffer_dirty(bh);
- return bh;
+ mblock_get(mblock);
+ mark_mblock_dirty_inode(mblock, inode);
+ minix2_put_raw_inode(inode->i_sb, inode->i_ino, mblock, raw_inode);
+ return mblock;
}

-static struct buffer_head *minix_update_inode(struct inode *inode)
+static struct fsblock_meta *minix_update_inode(struct inode *inode)
{
if (INODE_VERSION(inode) == MINIX_V1)
return V1_minix_update_inode(inode);
@@ -532,29 +577,28 @@ static struct buffer_head *minix_update_

static int minix_write_inode(struct inode * inode, int wait)
{
- brelse(minix_update_inode(inode));
+ mblock_put(minix_update_inode(inode));
return 0;
}

int minix_sync_inode(struct inode * inode)
{
int err = 0;
- struct buffer_head *bh;
+ struct fsblock_meta *mblock;

- bh = minix_update_inode(inode);
- if (bh && buffer_dirty(bh))
- {
- sync_dirty_buffer(bh);
- if (buffer_req(bh) && !buffer_uptodate(bh))
+ mblock = minix_update_inode(inode);
+ if (mblock && test_bit(BL_dirty, &mblock_block(mblock)->flags)) {
+ sync_block(mblock_block(mblock));
+ if (test_bit(BL_error, &mblock_block(mblock)->flags))
{
printk("IO error syncing minix inode [%s:%08lx]\n",
inode->i_sb->s_id, inode->i_ino);
err = -1;
}
}
- else if (!bh)
+ else if (!mblock)
err = -1;
- brelse (bh);
+ mblock_put(mblock);
return err;
}

Index: linux-2.6/fs/minix/file.c
===================================================================
--- linux-2.6.orig/fs/minix/file.c
+++ linux-2.6/fs/minix/file.c
@@ -6,7 +6,7 @@
* minix regular file handling primitives
*/

-#include <linux/buffer_head.h> /* for fsync_inode_buffers() */
+#include <linux/fsblock.h>
#include "minix.h"

/*
@@ -21,7 +21,7 @@ const struct file_operations minix_file_
.aio_read = generic_file_aio_read,
.write = do_sync_write,
.aio_write = generic_file_aio_write,
- .mmap = generic_file_mmap,
+ .mmap = fsblock_file_mmap,
.fsync = minix_sync_file,
.sendfile = generic_file_sendfile,
};
@@ -36,7 +36,7 @@ int minix_sync_file(struct file * file,
struct inode *inode = dentry->d_inode;
int err;

- err = sync_mapping_buffers(inode->i_mapping);
+ err = fsblock_sync(inode->i_mapping);
if (!(inode->i_state & I_DIRTY))
return err;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))

2007-06-24 01:54:07

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] fsblock

Just clarify a few things. Don't you hate rereading a long work you
wrote? (oh, you're supposed to do that *before* you press send?).

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
>
> I'm announcing "fsblock" now because it is quite intrusive and so I'd
> like to get some thoughts about significantly changing this core part
> of the kernel.
>
> fsblock is a rewrite of the "buffer layer" (ding dong the witch is
> dead), which I have been working on, on and off and is now at the stage
> where some of the basics are working-ish. This email is going to be
> long...
>
> Firstly, what is the buffer layer? The buffer layer isn't really a
> buffer layer as in the buffer cache of unix: the block device cache
> is unified with the pagecache (in terms of the pagecache, a blkdev
> file is just like any other, but with a 1:1 mapping between offset
> and block).

I mean, in Linux, the block device cache is unified. UNIX I believe
did all its caching in a buffer cache, below the filesystem.


> - Large block support. I can mount and run an 8K block size minix3 fs on
> my 4K page system and it didn't require anything special in the fs. We

Oh, and I don't have a Linux mkfs that makes minixv3 filesystems.
I had an image kindly made for me because I don't use minix. If
you want to test large block support, I won't email it to you though:
you can just convert ext2 or ext3 to fsblock ;)

2007-06-24 03:08:10

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] fsblock

Nick Piggin wrote:
> - No deadlocks (hopefully). The buffer layer is technically deadlocky by
> design, because it can require memory allocations at page writeout-time.
> It also has one path that cannot tolerate memory allocation failures.
> No such problems for fsblock, which keeps fsblock metadata around for as
> long as a page is dirty (this still has problems vs get_user_pages, but
> that's going to require an audit of all get_user_pages sites. Phew).
>
> - In line with the above item, filesystem block allocation is performed
> before a page is dirtied. In the buffer layer, mmap writes can dirty a
> page with no backing blocks which is a problem if the filesystem is
> ENOSPC (patches exist for buffer.c for this).

This raises an eyebrow... The handling of ENOSPC prior to mmap write is
more an ABI behavior, so I don't see how this can be fixed with internal
changes, yet without changing behavior currently exported to userland
(and thus affecting code based on such assumptions).


> - An inode's metadata must be tracked per-inode in order for fsync to
> work correctly. buffer contains helpers to do this for basic
> filesystems, but any block can be only the metadata for a single inode.
> This is not really correct for things like inode descriptor blocks.
> fsblock can track multiple inodes per block. (This is non trivial,
> and it may be overkill so it could be reverted to a simpler scheme
> like buffer).

hrm; no specific comment but this seems like an idea/area that needs to
be fleshed out more, by converting some of the more advanced filesystems.


> - Large block support. I can mount and run an 8K block size minix3 fs on
> my 4K page system and it didn't require anything special in the fs. We
> can go up to about 32MB blocks now, and gigabyte+ blocks would only
> require one more bit in the fsblock flags. fsblock_superpage blocks
> are > PAGE_CACHE_SIZE, midpage ==, and subpage <.

definitely useful, especially if I rewrite my ibu filesystem for 2.6.x,
like I've been planning.


> So. Comments? Is this something we want? If yes, then how would we
> transition from buffer.c to fsblock.c?

Your work is definitely interesting, but I think it will be even more
interesting once ext2 (w/ dir in pagecache) and ext3 (journalling) are
converted.

My gut feeling is that there are several problem areas you haven't hit
yet, with the new code.

Also, once things are converted, the question of transitioning from
buffer.c will undoubtedly answer itself. That's the way several of us
handle transitions: finish all the work, then look with fresh eyes and
conceive a path from the current code to your enhanced code.

Jeff


2007-06-24 03:48:14

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote:
> Nick Piggin wrote:
> >- No deadlocks (hopefully). The buffer layer is technically deadlocky by
> > design, because it can require memory allocations at page writeout-time.
> > It also has one path that cannot tolerate memory allocation failures.
> > No such problems for fsblock, which keeps fsblock metadata around for as
> > long as a page is dirty (this still has problems vs get_user_pages, but
> > that's going to require an audit of all get_user_pages sites. Phew).
> >
> >- In line with the above item, filesystem block allocation is performed
> > before a page is dirtied. In the buffer layer, mmap writes can dirty a
> > page with no backing blocks which is a problem if the filesystem is
> > ENOSPC (patches exist for buffer.c for this).
>
> This raises an eyebrow... The handling of ENOSPC prior to mmap write is
> more an ABI behavior, so I don't see how this can be fixed with internal
> changes, yet without changing behavior currently exported to userland
> (and thus affecting code based on such assumptions).

I believe people are happy to have it SIGBUS (which is how the VM
is already set up with page_mkwrite, and what fsblock does).


> >- An inode's metadata must be tracked per-inode in order for fsync to
> > work correctly. buffer contains helpers to do this for basic
> > filesystems, but any block can be only the metadata for a single inode.
> > This is not really correct for things like inode descriptor blocks.
> > fsblock can track multiple inodes per block. (This is non trivial,
> > and it may be overkill so it could be reverted to a simpler scheme
> > like buffer).
>
> hrm; no specific comment but this seems like an idea/area that needs to
> be fleshed out more, by converting some of the more advanced filesystems.

Yep. It's conceptually fairly simple though, and it might be easier
than having filesystems implement their own complex syncing that finds
and syncs everything themselves.


> >- Large block support. I can mount and run an 8K block size minix3 fs on
> > my 4K page system and it didn't require anything special in the fs. We
> > can go up to about 32MB blocks now, and gigabyte+ blocks would only
> > require one more bit in the fsblock flags. fsblock_superpage blocks
> > are > PAGE_CACHE_SIZE, midpage ==, and subpage <.
>
> definitely useful, especially if I rewrite my ibu filesystem for 2.6.x,
> like I've been planning.

Yeah, it wasn't the primary motivation for the rewrite, but it would
be negligent to not even consider large blocks in such a rewrite, I
think.


> >So. Comments? Is this something we want? If yes, then how would we
> >transition from buffer.c to fsblock.c?
>
> Your work is definitely interesting, but I think it will be even more
> interesting once ext2 (w/ dir in pagecache) and ext3 (journalling) are
> converted.

Well minix has dir in pagecache ;) But you're completely right: ext2
will be the next step and then ext3 and things like XFS and NTFS
will be the real test. I think I could eventually get ext2 done (one
of the biggest headaches is simply just converting ->b_data accesses),
however unlikely a journalling one.


> My gut feeling is that there are several problem areas you haven't hit
> yet, with the new code.

I would agree with your gut :)


> Also, once things are converted, the question of transitioning from
> buffer.c will undoubtedly answer itself. That's the way several of us
> handle transitions: finish all the work, then look with fresh eyes and
> conceive a path from the current code to your enhanced code.

Yeah that would be nice. It's very difficult because of so much
filesystem code. I'd say it would be feasible to step buffer.c into
fsblock.c, however if we were to track all (or even the common)
filesystems along with that it would introduce a huge number of
kind-of-redundant changes that I don't think all fs maintainers would
have time to write (and as I said, I can't do it myself). Anyway,
let's cross that bridge if and when we come to it.

For now, the big thing that needs to be done is convert a "big" fs
and see if the results tell us that it's workable.

Thanks for the comments Jeff.


2007-06-24 04:18:35

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
> fsblock is a rewrite of the "buffer layer" (ding dong the witch is
> dead), which I have been working on, on and off and is now at the stage
> where some of the basics are working-ish. This email is going to be
> long...

Long overdue. Thank you.


-- wli

2007-06-24 13:20:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] fsblock

Nick Piggin <[email protected]> writes:
>
> - Structure packing. A page gets a number of buffer heads that are
> allocated in a linked list. fsblocks are allocated contiguously, so
> cacheline footprint is smaller in the above situation.

It would be interesting to test if that makes a difference for
database benchmarks running over file systems. Databases
eat a lot of cache so in theory any cache improvements
in the kernel which often runs cache cold then should be beneficial.

But I guess it would need at least ext2 to test; Minix is probably not
good enough.

In general have you benchmarked the CPU overhead of old vs new code?
e.g. when we went to BIO scalability went up, but CPU costs
of a single request also went up. It would be nice to not continue
or better reverse that trend.

> - Large block support. I can mount and run an 8K block size minix3 fs on
> my 4K page system and it didn't require anything special in the fs. We
> can go up to about 32MB blocks now, and gigabyte+ blocks would only
> require one more bit in the fsblock flags. fsblock_superpage blocks
> are > PAGE_CACHE_SIZE, midpage ==, and subpage <.

Can it be cleanly ifdefed or optimized away? Unless the fragmentation
problem is not solved it would seem rather pointless to me. Also I personally
still think the right way to approach this is larger softpage size.

-Andi

2007-06-24 13:54:32

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Sun, Jun 24, 2007 at 05:47:55AM +0200, Nick Piggin wrote:
> On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote:
>
> > >- Large block support. I can mount and run an 8K block size minix3 fs on
> > > my 4K page system and it didn't require anything special in the fs. We
> > > can go up to about 32MB blocks now, and gigabyte+ blocks would only
> > > require one more bit in the fsblock flags. fsblock_superpage blocks
> > > are > PAGE_CACHE_SIZE, midpage ==, and subpage <.
> >
> > definitely useful, especially if I rewrite my ibu filesystem for 2.6.x,
> > like I've been planning.
>
> Yeah, it wasn't the primary motivation for the rewrite, but it would
> be negligent to not even consider large blocks in such a rewrite, I
> think.

I'll join the cheering here, thanks for starting on this.

>
> > My gut feeling is that there are several problem areas you haven't hit
> > yet, with the new code.
>
> I would agree with your gut :)
>

Without having read the code yet (light reading for monday morning ;),
ext3 and reiserfs use buffers heads for data=ordered to help them do
deadlock free writeback. Basically they need to be able to write out
the pending data=ordered pages, potentially with the transaction lock
held (or if not held, while blocking new transactions from starting).

But, writepage, prepare_write and commit_write all need to start a
transaction with the page lock already held. So, if the page lock were
used for data=ordered writeback, there would be a lock inversion between
the transaction lock and the page lock.

Using buffer heads instead allows the FS to send file data down inside
the transaction code, without taking the page lock. So, locking wrt
data=ordered is definitely going to be tricky.

The best long term option may be making the locking order
transaction -> page lock, and change writepage to punt to some other
queue when it needs to start a transaction.

-chris

2007-06-24 14:33:13

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

Nick Piggin <[email protected]> writes:


[haven't read everything, just commenting on something that caught my eye]

> +struct fsblock {
> + atomic_t count;
> + union {
> + struct {
> + unsigned long flags; /* XXX: flags could be int for better packing */

int is not supported by many architectures, but works on x86 at least.

Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the smallest
possible type for each architecture. And a couple of ugly casts for set_bit et.al.
but those could be also hidden in macros. Should be relatively easy to do.

-Andi

2007-06-24 20:19:20

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer


> Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the smallest
> possible type for each architecture. And a couple of ugly casts for set_bit et.al.
> but those could be also hidden in macros. Should be relatively easy to do.

or make a "smallbit" type that is small/supported, so 64 bit if 32 bit
isn't supported, otherwise 32


2007-06-24 23:01:52

by NeilBrown

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

On Sunday June 24, [email protected] wrote:
>
> +#define PG_blocks 20 /* Page has block mappings */
> +

I've only had a very quick look, but this line looks *very* wrong.
You should be using PG_private.

There should never be any confusion about whether ->private has
buffers or blocks attached as the only routines that ever look in
->private are address_space operations (or should be. I think 'NULL'
is sometimes special cased, as in try_to_release_page. It would be
good to do some preliminary work and tidy all that up).

Why do you think you need PG_blocks?

NeilBrown

2007-06-25 06:59:05

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] fsblock

Chris Mason wrote:
> On Sun, Jun 24, 2007 at 05:47:55AM +0200, Nick Piggin wrote:

>>>My gut feeling is that there are several problem areas you haven't hit
>>>yet, with the new code.
>>
>>I would agree with your gut :)
>>
>
>
> Without having read the code yet (light reading for monday morning ;),
> ext3 and reiserfs use buffers heads for data=ordered to help them do
> deadlock free writeback. Basically they need to be able to write out
> the pending data=ordered pages, potentially with the transaction lock
> held (or if not held, while blocking new transactions from starting).
>
> But, writepage, prepare_write and commit_write all need to start a
> transaction with the page lock already held. So, if the page lock were
> used for data=ordered writeback, there would be a lock inversion between
> the transaction lock and the page lock.

Ah, thanks for that information.


> Using buffer heads instead allows the FS to send file data down inside
> the transaction code, without taking the page lock. So, locking wrt
> data=ordered is definitely going to be tricky.
>
> The best long term option may be making the locking order
> transaction -> page lock, and change writepage to punt to some other
> queue when it needs to start a transaction.

Yeah, that's what I would like, and I think it would come naturally
if we move away from these "pass down a single, locked page APIs"
in the VM, and let the filesystem do the locking and potentially
batching of larger ranges.

write_begin/write_end is a step in that direction (and it helps
OCFS and GFS quite a bit). I think there is also not much reason
for writepage sites to require the page to lock the page and clear
the dirty bit themselves (which has seems ugly to me).

So yes, I definitely want to move the aops API along with fsblock.

That I have tried to keep it within the existing API for the moment
is just because that makes things a bit easier...

--
SUSE Labs, Novell Inc.

2007-06-25 07:16:40

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] fsblock

Andi Kleen wrote:
> Nick Piggin <[email protected]> writes:
>
>>- Structure packing. A page gets a number of buffer heads that are
>> allocated in a linked list. fsblocks are allocated contiguously, so
>> cacheline footprint is smaller in the above situation.
>
>
> It would be interesting to test if that makes a difference for
> database benchmarks running over file systems. Databases
> eat a lot of cache so in theory any cache improvements
> in the kernel which often runs cache cold then should be beneficial.
>
> But I guess it would need at least ext2 to test; Minix is probably not
> good enough.

Yeah, you are right. ext2 would be cool to port as it would be
a reasonable platform for basic performance testing and comparisons.


> In general have you benchmarked the CPU overhead of old vs new code?
> e.g. when we went to BIO scalability went up, but CPU costs
> of a single request also went up. It would be nice to not continue
> or better reverse that trend.

At the moment there are still a few silly things in the code, such
as always calling the insert_mapping indirect function (which is
the get_block equivalent). And it does a bit more RMWing than it
should still.

Also, it always goes to the pagecache radix-tree to find fsblocks,
wheras the buffer layer has a per-CPU cache front-end... so in
that regard, fsblock is really designed with lockless pagecache
in mind, where find_get_page is much faster even in the serial case
(though fsblock shouldn't exactly be slow with the current pagecache).

However, I don't think there are any fundamental performance
problems with fsblock. It even uses one less layer of locking to
do regular IO compared with buffer.c, so in theory it might even
have some advantage.

Single threaded performance of request submission is something I
will definitely try to keep optimal.


>>- Large block support. I can mount and run an 8K block size minix3 fs on
>> my 4K page system and it didn't require anything special in the fs. We
>> can go up to about 32MB blocks now, and gigabyte+ blocks would only
>> require one more bit in the fsblock flags. fsblock_superpage blocks
>> are > PAGE_CACHE_SIZE, midpage ==, and subpage <.
>
>
> Can it be cleanly ifdefed or optimized away?

Yeah, it pretty well stays out of the way when using <= PAGE_CACHE_SIZE
size blocks, generally just a single test and branch of an already-used
cacheline. It can be optimised away completely by commenting out
#define BLOCK_SUPERPAGE_SUPPORT from fsblock.h.


> Unless the fragmentation
> problem is not solved it would seem rather pointless to me. Also I personally
> still think the right way to approach this is larger softpage size.

It does not suffer from a fragmentation problem. It will do scatter
gather IO if the pagecache of that block is not contiguous. My naming
may be a little confusing: fsblock_superpage (which is a function that
returns true if the given fsblock is larger than PAGE_CACHE_SIZE) is
just named as to whether the fsblock is larger than a page, rather than
having a connection to VM superpages.

Don't get me wrong, I think soft page size is a good idea for other
reasons as well (less page metadata and page operations), and that
8 or 16K would probably be a good sweet spot for today's x86 systems.

--
SUSE Labs, Novell Inc.

2007-06-25 07:19:59

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

Andi Kleen wrote:
> Nick Piggin <[email protected]> writes:
>
>
> [haven't read everything, just commenting on something that caught my eye]
>
>
>>+struct fsblock {
>>+ atomic_t count;
>>+ union {
>>+ struct {
>>+ unsigned long flags; /* XXX: flags could be int for better packing */
>
>
> int is not supported by many architectures, but works on x86 at least.

Yeah, that would be nice. We could actually use this for buffer_head as well,
but saving 4% there isn't so important as saving 20% for fsblock :)


> Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the smallest
> possible type for each architecture. And a couple of ugly casts for set_bit et.al.
> but those could be also hidden in macros. Should be relatively easy to do.

Cool. It would probably be useful for other things as well.

--
SUSE Labs, Novell Inc.

2007-06-25 07:42:30

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

Neil Brown wrote:
> On Sunday June 24, [email protected] wrote:
>
>>
>>+#define PG_blocks 20 /* Page has block mappings */
>>+
>
>
> I've only had a very quick look, but this line looks *very* wrong.
> You should be using PG_private.
>
> There should never be any confusion about whether ->private has
> buffers or blocks attached as the only routines that ever look in
> ->private are address_space operations (or should be. I think 'NULL'
> is sometimes special cased, as in try_to_release_page. It would be
> good to do some preliminary work and tidy all that up).

There is a lot of confusion, actually :)
But as you see in the patch, I added a couple more aops APIs, and
am working toward decoupling it as much as possible. It's pretty
close after the fsblock patch... however:


> Why do you think you need PG_blocks?

Block device pagecache (buffer cache) has to be able to accept
attachment of either buffers or blocks for filesystem metadata,
and call into either buffer.c or fsblock.c based on that.

If the page flag is really important, we can do some awful hack
like assuming the first long of the private data is flags, and
those flags will tell us whether the structure is a buffer_head
or fsblock ;) But for now it is just easier to use a page flag.

--
SUSE Labs, Novell Inc.

2007-06-25 08:58:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

On Sun, Jun 24, 2007 at 01:18:42PM -0700, Arjan van de Ven wrote:
>
> > Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the smallest
> > possible type for each architecture. And a couple of ugly casts for set_bit et.al.
> > but those could be also hidden in macros. Should be relatively easy to do.
>
> or make a "smallbit" type that is small/supported, so 64 bit if 32 bit
> isn't supported, otherwise 32

That wouldn't handle the case where you only need e.g. 8 bits
That's fine for x86 too. It only hates atomic accesses crossing cache line
boundaries (but handles them too, just slow)

-Andi

2007-06-25 12:28:30

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Mon, Jun 25, 2007 at 04:58:48PM +1000, Nick Piggin wrote:
>
> >Using buffer heads instead allows the FS to send file data down inside
> >the transaction code, without taking the page lock. So, locking wrt
> >data=ordered is definitely going to be tricky.
> >
> >The best long term option may be making the locking order
> >transaction -> page lock, and change writepage to punt to some other
> >queue when it needs to start a transaction.
>
> Yeah, that's what I would like, and I think it would come naturally
> if we move away from these "pass down a single, locked page APIs"
> in the VM, and let the filesystem do the locking and potentially
> batching of larger ranges.

Definitely.

>
> write_begin/write_end is a step in that direction (and it helps
> OCFS and GFS quite a bit). I think there is also not much reason
> for writepage sites to require the page to lock the page and clear
> the dirty bit themselves (which has seems ugly to me).

If we keep the page mapping information with the page all the time (ie
writepage doesn't have to call get_block ever), it may be possible to
avoid sending down a locked page. But, I don't know the delayed
allocation internals well enough to say for sure if that is true.

Either way, writepage is the easiest of the bunch because it can be
deferred.

-chris

2007-06-25 12:32:33

by Chris Mason

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

On Mon, Jun 25, 2007 at 05:41:58PM +1000, Nick Piggin wrote:
> Neil Brown wrote:
> >On Sunday June 24, [email protected] wrote:
> >
> >>
> >>+#define PG_blocks 20 /* Page has block mappings */
> >>+
> >
> >
> >I've only had a very quick look, but this line looks *very* wrong.
> >You should be using PG_private.
> >
> >There should never be any confusion about whether ->private has
> >buffers or blocks attached as the only routines that ever look in
> >->private are address_space operations (or should be. I think 'NULL'
> >is sometimes special cased, as in try_to_release_page. It would be
> >good to do some preliminary work and tidy all that up).
>
> There is a lot of confusion, actually :)
> But as you see in the patch, I added a couple more aops APIs, and
> am working toward decoupling it as much as possible. It's pretty
> close after the fsblock patch... however:
>
>
> >Why do you think you need PG_blocks?
>
> Block device pagecache (buffer cache) has to be able to accept
> attachment of either buffers or blocks for filesystem metadata,
> and call into either buffer.c or fsblock.c based on that.
>
> If the page flag is really important, we can do some awful hack
> like assuming the first long of the private data is flags, and
> those flags will tell us whether the structure is a buffer_head
> or fsblock ;) But for now it is just easier to use a page flag.

The block device pagecache isn't special, and certainly isn't that much
code. I would suggest keeping it buffer head specific and making a
second variant that does only fsblocks. This is mostly to keep the
semantics of PagePrivate sane, lets not fuzz the line.

-chris

2007-06-25 13:22:23

by Chris Mason

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

On Sun, Jun 24, 2007 at 03:46:13AM +0200, Nick Piggin wrote:
> Rewrite the buffer layer.

Overall, I like the basic concepts, but it is hard to track the locking
rules. Could you please write them up?

I like the way you split out the assoc_buffers from the main fsblock
code, but the list setup is still something of a wart. It also provides
poor ordering of blocks for writeback.

I think it makes sense to replace the assoc_buffers list head with a
radix tree sorted by block number. mark_buffer_dirty_inode would up the
reference count and put it into the radix, the various flushing routines
would walk the radix etc.

If you wanted to be able to drop the reference count once the block was
written you could have a back pointer to the appropriate inode.

-chris

2007-06-26 02:34:41

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

Chris Mason wrote:
> On Mon, Jun 25, 2007 at 05:41:58PM +1000, Nick Piggin wrote:
>
>>Neil Brown wrote:

>>>Why do you think you need PG_blocks?
>>
>>Block device pagecache (buffer cache) has to be able to accept
>>attachment of either buffers or blocks for filesystem metadata,
>>and call into either buffer.c or fsblock.c based on that.
>>
>>If the page flag is really important, we can do some awful hack
>>like assuming the first long of the private data is flags, and
>>those flags will tell us whether the structure is a buffer_head
>>or fsblock ;) But for now it is just easier to use a page flag.
>
>
> The block device pagecache isn't special, and certainly isn't that much
> code. I would suggest keeping it buffer head specific and making a
> second variant that does only fsblocks. This is mostly to keep the
> semantics of PagePrivate sane, lets not fuzz the line.

That would require a new inode and address_space for the fsblock
type blockdev pagecache, wouldn't it? I just can't think of a
better non-intrusive way of allowing a buffer_head filesystem and
an fsblock filesystem to live on the same blkdev together.

--
SUSE Labs, Novell Inc.

2007-06-26 02:42:42

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

Chris Mason wrote:
> On Sun, Jun 24, 2007 at 03:46:13AM +0200, Nick Piggin wrote:
>
>>Rewrite the buffer layer.
>
>
> Overall, I like the basic concepts, but it is hard to track the locking
> rules. Could you please write them up?

Yeah I will do that.

Thanks for taking a look. One thing I am thinking about is to get
rid of the unmap_underlying_metadata calls from the generic code.
I found they were required for minix to prevent corruption, however
I don't know exactly what metadata is interfering here (maybe it
is indirect blocks or something?). Anyway, I think I will make it
a requirement that the filesystem has to already handle this before
returning a newly allocated block -- they can probably do it more
efficiently and we avoid the extra work on every block allocation.


> I like the way you split out the assoc_buffers from the main fsblock
> code, but the list setup is still something of a wart. It also provides
> poor ordering of blocks for writeback.

Yeah, I didn't know how much effort to put in here because I don't
know whether modern filesystems are going to need to implement their
own management of this stuff or not.

I haven't actually instrumented this in something like ext2 to see
how much IO comes from the assoc buffers...


> I think it makes sense to replace the assoc_buffers list head with a
> radix tree sorted by block number. mark_buffer_dirty_inode would up the
> reference count and put it into the radix, the various flushing routines
> would walk the radix etc.
>
> If you wanted to be able to drop the reference count once the block was
> written you could have a back pointer to the appropriate inode.

I was actually thinking about a radix-tree :) One annoyance is that
unsigned long != sector_t :P rbtree would probably be OK.

--
SUSE Labs, Novell Inc.

2007-06-26 02:48:47

by NeilBrown

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

On Tuesday June 26, [email protected] wrote:
> Chris Mason wrote:
> >
> > The block device pagecache isn't special, and certainly isn't that much
> > code. I would suggest keeping it buffer head specific and making a
> > second variant that does only fsblocks. This is mostly to keep the
> > semantics of PagePrivate sane, lets not fuzz the line.
>
> That would require a new inode and address_space for the fsblock
> type blockdev pagecache, wouldn't it? I just can't think of a
> better non-intrusive way of allowing a buffer_head filesystem and
> an fsblock filesystem to live on the same blkdev together.

I don't think they would ever try to. Both filesystems would bd_claim
the blkdev, and only one would win.

The issue is more of a filesystem sharing a blockdev with the
block-special device (i.e. open("/dev/sda1"), read) isn't it?

If a filesystem wants to attach information to the blockdev pagecache
that is different to what blockdev want to attach, then I think "Yes"
- a new inode and address space is what it needs to create.

Then you get into consistency issues between the metadata and direct
blockdevice access. Do we care about those?


NeilBrown

2007-06-26 03:07:05

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
>
> I'm announcing "fsblock" now because it is quite intrusive and so I'd
> like to get some thoughts about significantly changing this core part
> of the kernel.

Can you rename it to something other than shorthand for
"filesystem block"? e.g. When you say:

> - In line with the above item, filesystem block allocation is performed

What are we actually talking aout here? filesystem block allocation
is something a filesystem does to allocate blocks on disk, not
allocate a mapping structure in memory.

Realistically, this is not about "filesystem blocks", this is
about file offset to disk blocks. i.e. it's a mapping.

> Probably better would be to
> move towards offset,length rather than page based fs APIs where everything
> can be batched up nicely and this sort of non-trivial locking can be more
> optimal.

If we are going to turn over the API completely like this, can
we seriously look at moving to this sort of interface at the same
time?

With a offset/len interface, we can start to track contiguous
ranges of blocks rather than persisting with a structure per
filesystem block. If you want to save memory, thet's where
we need to go.

XFS uses "iomaps" for this purpose - it's basically:

- start offset into file
- start block on disk
- length of mapping
- state

With special "disk blocks" for indicating delayed allocation
blocks (-1) and unwritten extents (-2). Worst case we end up
with is an iomap per filesystem block.

If we allow iomaps to be split and combined along with range
locking, we can parallelise read and write access to each
file on an iomap basis, etc. There's plenty of goodness that
comes from indexing by range....

FWIW, I really see little point in making all the filesystems
work with fsblocks if the plan is to change the API again in
a major way a year down the track. Let's get all the changes
we think are necessary in one basket first, and then work out
a coherent plan to implement them ;)

> - Large block support. I can mount and run an 8K block size minix3 fs on
> my 4K page system and it didn't require anything special in the fs. We
> can go up to about 32MB blocks now, and gigabyte+ blocks would only
> require one more bit in the fsblock flags. fsblock_superpage blocks
> are > PAGE_CACHE_SIZE, midpage ==, and subpage <.

My 2c worth - this is a damn complex way of introducing large block
size support. It has all the problems I pointed out that it would
have (locking issues, vmap overhead, every filesystem needs needs
major changes and it's not very efficient) and it's going to take
quite some time to stabilise.

If this is the only real feature that fsblocks are going to give us,
then I think this is a waste of time. If we are going to replace
buffer heads, lets do it with something that is completely
independent of filesystem block size and not introduce something
that is just a bufferhead on steroids.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-06-26 03:07:57

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

Neil Brown wrote:
> On Tuesday June 26, [email protected] wrote:
>
>>Chris Mason wrote:
>>
>>>The block device pagecache isn't special, and certainly isn't that much
>>>code. I would suggest keeping it buffer head specific and making a
>>>second variant that does only fsblocks. This is mostly to keep the
>>>semantics of PagePrivate sane, lets not fuzz the line.
>>
>>That would require a new inode and address_space for the fsblock
>>type blockdev pagecache, wouldn't it? I just can't think of a
>>better non-intrusive way of allowing a buffer_head filesystem and
>>an fsblock filesystem to live on the same blkdev together.
>
>
> I don't think they would ever try to. Both filesystems would bd_claim
> the blkdev, and only one would win.

Hmm OK, I might have confused myself thinking about partitions...

> The issue is more of a filesystem sharing a blockdev with the
> block-special device (i.e. open("/dev/sda1"), read) isn't it?
>
> If a filesystem wants to attach information to the blockdev pagecache
> that is different to what blockdev want to attach, then I think "Yes"
> - a new inode and address space is what it needs to create.
>
> Then you get into consistency issues between the metadata and direct
> blockdevice access. Do we care about those?

Yeah that issue is definitely a real one. The problem is not just
consistency, but "how do the block device aops even know that the
PG_private page they have has buffer heads or fsblocks", so it is
an oopsable condition rather than just a plain consistency issue
(consistency is already not guaranteed).

--
SUSE Labs, Novell Inc.

2007-06-26 03:55:29

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] fsblock

David Chinner wrote:
> On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
>
>>I'm announcing "fsblock" now because it is quite intrusive and so I'd
>>like to get some thoughts about significantly changing this core part
>>of the kernel.
>
>
> Can you rename it to something other than shorthand for
> "filesystem block"? e.g. When you say:
>
>
>>- In line with the above item, filesystem block allocation is performed
>
>
> What are we actually talking aout here? filesystem block allocation
> is something a filesystem does to allocate blocks on disk, not
> allocate a mapping structure in memory.
>
> Realistically, this is not about "filesystem blocks", this is
> about file offset to disk blocks. i.e. it's a mapping.

Yeah, fsblock ~= the layer between the fs and the block layers.
But don't take the name too literally, like a struct page isn't
actually a page of memory ;)


>> Probably better would be to
>> move towards offset,length rather than page based fs APIs where everything
>> can be batched up nicely and this sort of non-trivial locking can be more
>> optimal.
>
>
> If we are going to turn over the API completely like this, can
> we seriously look at moving to this sort of interface at the same
> time?

Yeah we can move to anything. But note that fsblock is perfectly
happy with <= PAGE_CACHE_SIZE blocks today, and isn't _terrible_
at >.


> With a offset/len interface, we can start to track contiguous
> ranges of blocks rather than persisting with a structure per
> filesystem block. If you want to save memory, thet's where
> we need to go.
>
> XFS uses "iomaps" for this purpose - it's basically:
>
> - start offset into file
> - start block on disk
> - length of mapping
> - state
>
> With special "disk blocks" for indicating delayed allocation
> blocks (-1) and unwritten extents (-2). Worst case we end up
> with is an iomap per filesystem block.

I was thinking about doing an extent based scheme, but it has
some issues as well. Block based is light weight and simple, it
aligns nicely with the pagecache structures.


> If we allow iomaps to be split and combined along with range
> locking, we can parallelise read and write access to each
> file on an iomap basis, etc. There's plenty of goodness that
> comes from indexing by range....

Some operations AFAIKS will always need to be per-page (eg. in
the core VM it wants to lock a single page to fault it in, or
wait for a single page to writeout etc). So I didn't see a huge
gain in a one-lock-per-extent type arrangement.

If you're worried about parallelisability, then I don't see what
iomaps give you that buffer heads or fsblocks do not? In fact
they would be worse because there are fewer of them? :)

But remember that once the filesystems have accessor APIs and
can handle multiple pages per fsblock, that would already be
most of the work done for the fs and the mm to go to an extent
based representation.


> FWIW, I really see little point in making all the filesystems
> work with fsblocks if the plan is to change the API again in
> a major way a year down the track. Let's get all the changes
> we think are necessary in one basket first, and then work out
> a coherent plan to implement them ;)

The aops API changes and the fsblock layer are kind of two
seperate things. I'm slowly implementing things as I go (eg.
see perform_write aop, which is exactly the offset,length
based API that I'm talking about).

fsblocks can be implemented on the old or the new APIs. New
APIs won't invalidate work to convert a filesystem to fsblocks.


>>- Large block support. I can mount and run an 8K block size minix3 fs on
>> my 4K page system and it didn't require anything special in the fs. We
>> can go up to about 32MB blocks now, and gigabyte+ blocks would only
>> require one more bit in the fsblock flags. fsblock_superpage blocks
>> are > PAGE_CACHE_SIZE, midpage ==, and subpage <.
>
>
> My 2c worth - this is a damn complex way of introducing large block
> size support. It has all the problems I pointed out that it would
> have (locking issues, vmap overhead, every filesystem needs needs
> major changes and it's not very efficient) and it's going to take
> quite some time to stabilise.

What locking issues? It locks pages in pagecache offset ascending
order, which already has precedent and is really the only sane way
to do it so it's not like it precludes other possible sane lock
orderings.

vmap overhead is an issue, however I did it mainly for easy of
conversion. I guess things like superblocks and such would make
use of it happily. Most other things should be able to be
implemented with page based helpers (just a couple of bitops
helpers would pretty much cover minix). If it is still a problem,
then I can implement a proper vmap cache.

But the major changes in the filesystem are not for vmaps, but for
page accessors. As I said, this allows blkdev to move out of
lowmem and also closes CPU cache coherency problems. (as well as
not having to carry around a vmem pointer of course).


> If this is the only real feature that fsblocks are going to give us,
> then I think this is a waste of time. If we are going to replace
> buffer heads, lets do it with something that is completely
> independent of filesystem block size and not introduce something
> that is just a bufferhead on steroids.

Well if you ignore all my other points, then yes it is the only thing
that fsblocks gives us :) But it would be very easy to overengineer
this. I don't really see a good case for extents here because we have
to manage these discrete pages anyway. The large block support in
fsblock is probably 500 lines when you take out the debugging stuff.

And I don't see how you think an extent representation would solve
the page locking, complexity, intrusiveness, or vmap problems at all?
Solve them for one and it should be a good solution for the other,
right?

Anyway, let's suppose that we move to a virtually mapped kernel
with defragmentation support and did higher order pagecache for
large block support and decided the remaining advantages of fsblock
were not worth keeping around support for that. Well that could be
taken out and fsblock still wouldn't be a bad thing to have IMO.
fsblock is actually supposed to be a simplified and slimmed down
buffer_head, rather than a steorid filled one.

--
SUSE Labs, Novell Inc.

2007-06-26 09:23:38

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
> David Chinner wrote:
> >On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
> >>I'm announcing "fsblock" now because it is quite intrusive and so I'd
> >>like to get some thoughts about significantly changing this core part
> >>of the kernel.
> >
> >Can you rename it to something other than shorthand for
> >"filesystem block"? e.g. When you say:
> >
> >>- In line with the above item, filesystem block allocation is performed
> >
> >What are we actually talking aout here? filesystem block allocation
> >is something a filesystem does to allocate blocks on disk, not
> >allocate a mapping structure in memory.
> >
> >Realistically, this is not about "filesystem blocks", this is
> >about file offset to disk blocks. i.e. it's a mapping.
>
> Yeah, fsblock ~= the layer between the fs and the block layers.

Sure, but it's not a "filesystem block" which is what you are
calling it. IMO, it's overloading a well known term with something
different, and that's just confusing.

Can we call it a block mapping layer or something like that?
e.g. struct blkmap?

> >> Probably better would be to
> >> move towards offset,length rather than page based fs APIs where
> >> everything
> >> can be batched up nicely and this sort of non-trivial locking can be more
> >> optimal.
> >
> >If we are going to turn over the API completely like this, can
> >we seriously look at moving to this sort of interface at the same
> >time?
>
> Yeah we can move to anything. But note that fsblock is perfectly
> happy with <= PAGE_CACHE_SIZE blocks today, and isn't _terrible_
> at >.

Extent based block mapping is entirely independent of block size.
Please don't confuse the two....

> >With a offset/len interface, we can start to track contiguous
> >ranges of blocks rather than persisting with a structure per
> >filesystem block. If you want to save memory, thet's where
> >we need to go.
> >
> >XFS uses "iomaps" for this purpose - it's basically:
> >
> > - start offset into file
> > - start block on disk
> > - length of mapping
> > - state
> >
> >With special "disk blocks" for indicating delayed allocation
> >blocks (-1) and unwritten extents (-2). Worst case we end up
> >with is an iomap per filesystem block.
>
> I was thinking about doing an extent based scheme, but it has
> some issues as well. Block based is light weight and simple, it
> aligns nicely with the pagecache structures.

Yes. Block based is simple, but has flexibility and scalability
problems. e.g the number of fsblocks that are required to map large
files. It's not uncommon for use to have millions of bufferheads
lying around after writing a single large file that only has a
handful of extents. That's 5-6 orders of magnitude difference there
in memory usage and as memory and disk sizes get larger, this will
become more of a problem....

> >If we allow iomaps to be split and combined along with range
> >locking, we can parallelise read and write access to each
> >file on an iomap basis, etc. There's plenty of goodness that
> >comes from indexing by range....
>
> Some operations AFAIKS will always need to be per-page (eg. in
> the core VM it wants to lock a single page to fault it in, or
> wait for a single page to writeout etc). So I didn't see a huge
> gain in a one-lock-per-extent type arrangement.

For VM operations, no, but they would continue to be locked on a
per-page basis. However, we can do filesystem block operations
without needing to hold page locks. e.g. space reservation and
allocation......

> If you're worried about parallelisability, then I don't see what
> iomaps give you that buffer heads or fsblocks do not? In fact
> they would be worse because there are fewer of them? :)

No, that's wrong. I'm not talking about VM parallelisation,
I want to be able to support multiple writers to a single file.
i.e. removing the i_mutex restriction on writes. To do that
you've got to have a range locking scheme integrated into
the block map for the file so that concurrent lookups and
allocations don't trip over each other.

iomaps can double as range locks simply because iomaps are
expressions of ranges within the file. Seeing as you can only
access a given range exclusively to modify it, inserting an empty
mapping into the tree as a range lock gives an effective method of
allowing safe parallel reads, writes and allocation into the file.

The fsblocks and the vm page cache interface cannot be used to
facilitate this because a radix tree is the wrong type of tree to
store this information in. A sparse, range based tree (e.g. btree)
is the right way to do this and it matches very well with
a range based API.

None of what I'm talking about requires any changes to the existing
page cache or VM address space. I'm proposing that we should be
treat the block mapping as an address space in it's own right. i.e.
perhaps the struct page should not have block mapping objects
attached to it at all.

By separating out the block mapping from the page cache, we make the
page cache completely independent of filesystem block size, and it
can just operate wholly on pages. We can implement a generic extent
mapping tree instead of every filesystem having to (re)implement
their own. And if the filesystem does it's job of preventing
fragmentation, the amount of memory consumed by the tree will
be orders of magnitude lower than any fsblock based indexing.

I also like what this implies for keeping track of sub-block dirty
ranges. i.e. no need for RMW cycles for if we are doing sector sized
and aligned I/O - we can keep track of sub-block dirty state in the
block mapping tree easily *and* we know exactly what sector on disk
it maps to. That means we don't care about filesystem block size
as it no longer has any influence on RMW boundaries.

None of this is possible with fsblocks, so I really think that
fsblocks are not the step forward we need. They are just bufferheads
under another name and hence have all the same restrictions that
bufferheads imply. We should be looking to eliminate bufferheads
entirely rather than perpetuating them as fsblocks.....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-06-26 11:14:29

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
> On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
> > >
> > >Realistically, this is not about "filesystem blocks", this is
> > >about file offset to disk blocks. i.e. it's a mapping.
> >
> > Yeah, fsblock ~= the layer between the fs and the block layers.
>
> Sure, but it's not a "filesystem block" which is what you are
> calling it. IMO, it's overloading a well known term with something
> different, and that's just confusing.

Well it is the metadata used to manage the filesystem block for the
given bit of pagecache (even if the block is not actually allocated
or even a hole, it is deemed to be so by the filesystem).

> Can we call it a block mapping layer or something like that?
> e.g. struct blkmap?

I'm not fixed on fsblock, but blkmap doesn't grab me either. It
is a map from the pagecache to the block layer, but blkmap sounds
like it is a map from the block to somewhere.

fsblkmap ;)


> > >> Probably better would be to
> > >> move towards offset,length rather than page based fs APIs where
> > >> everything
> > >> can be batched up nicely and this sort of non-trivial locking can be more
> > >> optimal.
> > >
> > >If we are going to turn over the API completely like this, can
> > >we seriously look at moving to this sort of interface at the same
> > >time?
> >
> > Yeah we can move to anything. But note that fsblock is perfectly
> > happy with <= PAGE_CACHE_SIZE blocks today, and isn't _terrible_
> > at >.
>
> Extent based block mapping is entirely independent of block size.
> Please don't confuse the two....

I'm not, but it seemed like you were confused that fsblock is tied
to changing the aops APIs. It is not, but they can be changed to
give improvements in a good number of areas (*including* better
large block support).


> > >With special "disk blocks" for indicating delayed allocation
> > >blocks (-1) and unwritten extents (-2). Worst case we end up
> > >with is an iomap per filesystem block.
> >
> > I was thinking about doing an extent based scheme, but it has
> > some issues as well. Block based is light weight and simple, it
> > aligns nicely with the pagecache structures.
>
> Yes. Block based is simple, but has flexibility and scalability
> problems. e.g the number of fsblocks that are required to map large
> files. It's not uncommon for use to have millions of bufferheads
> lying around after writing a single large file that only has a
> handful of extents. That's 5-6 orders of magnitude difference there
> in memory usage and as memory and disk sizes get larger, this will
> become more of a problem....

I guess fsblock is 3 times smaller and you would probably have 16
times fewer of them for such a filesystem (given a 4K page size)
still leaves a few orders of magnitude ;)

However, fsblock has this nice feature where it can drop the blocks
when the last reference goes away, so you really only have fsblocks
around for dirty or currently-being-read blocks...

But you give me a good idea: I'll gear the filesystem-side APIs to
be more extent based as well (eg. fsblock's get_block equivalent).
That way it should be much easier to change over to such extents in
future or even have an extent based representation sitting in front
of the fsblock one and acting as a high density cache in your above
situation.


> > >If we allow iomaps to be split and combined along with range
> > >locking, we can parallelise read and write access to each
> > >file on an iomap basis, etc. There's plenty of goodness that
> > >comes from indexing by range....
> >
> > Some operations AFAIKS will always need to be per-page (eg. in
> > the core VM it wants to lock a single page to fault it in, or
> > wait for a single page to writeout etc). So I didn't see a huge
> > gain in a one-lock-per-extent type arrangement.
>
> For VM operations, no, but they would continue to be locked on a
> per-page basis. However, we can do filesystem block operations
> without needing to hold page locks. e.g. space reservation and
> allocation......

You could do that without holding the page locks as well AFAIKS.
Actually again it might be a bit troublesome with the current
aops APIs, but I don't think fsblock stands in your way there
either.

> > If you're worried about parallelisability, then I don't see what
> > iomaps give you that buffer heads or fsblocks do not? In fact
> > they would be worse because there are fewer of them? :)
>
> No, that's wrong. I'm not talking about VM parallelisation,
> I want to be able to support multiple writers to a single file.
> i.e. removing the i_mutex restriction on writes. To do that
> you've got to have a range locking scheme integrated into
> the block map for the file so that concurrent lookups and
> allocations don't trip over each other.

> iomaps can double as range locks simply because iomaps are
> expressions of ranges within the file. Seeing as you can only
> access a given range exclusively to modify it, inserting an empty
> mapping into the tree as a range lock gives an effective method of
> allowing safe parallel reads, writes and allocation into the file.
>
> The fsblocks and the vm page cache interface cannot be used to
> facilitate this because a radix tree is the wrong type of tree to
> store this information in. A sparse, range based tree (e.g. btree)
> is the right way to do this and it matches very well with
> a range based API.
>
> None of what I'm talking about requires any changes to the existing
> page cache or VM address space. I'm proposing that we should be
> treat the block mapping as an address space in it's own right. i.e.
> perhaps the struct page should not have block mapping objects
> attached to it at all.
>
> By separating out the block mapping from the page cache, we make the
> page cache completely independent of filesystem block size, and it
> can just operate wholly on pages. We can implement a generic extent
> mapping tree instead of every filesystem having to (re)implement
> their own. And if the filesystem does it's job of preventing
> fragmentation, the amount of memory consumed by the tree will
> be orders of magnitude lower than any fsblock based indexing.

The independent mapping tree is something I have been thinking
about, but you still need to tie the page to the block at some
point and you need to track IO details and such.

The problem with implementing it in generic code is that it
will add another layer of locking and data structure that may
be better done in the filesystem. (because you _do_ already
need to do all the per-page stuff as well). This was my thing
about overengineering: fsblock is supposed to be just a very
light layer.


> I also like what this implies for keeping track of sub-block dirty
> ranges. i.e. no need for RMW cycles for if we are doing sector sized
> and aligned I/O - we can keep track of sub-block dirty state in the
> block mapping tree easily *and* we know exactly what sector on disk
> it maps to. That means we don't care about filesystem block size
> as it no longer has any influence on RMW boundaries.
>
> None of this is possible with fsblocks, so I really think that
> fsblocks are not the step forward we need. They are just bufferheads
> under another name and hence have all the same restrictions that
> bufferheads imply. We should be looking to eliminate bufferheads
> entirely rather than perpetuating them as fsblocks.....

I don't know why you think none of that is possible with fsblocks.
You could easily keep an in-memory btree or similar as the
authoritative block management structure and feed the fsblock
layer from that.

There is nothing about fsblock that is tied to i_mutex, and all
it's locking basically comes for free on top of the page based
locking that's already required in the VM.

2007-06-26 12:30:00

by Chris Mason

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

On Tue, Jun 26, 2007 at 01:07:43PM +1000, Nick Piggin wrote:
> Neil Brown wrote:
> >On Tuesday June 26, [email protected] wrote:
> >
> >>Chris Mason wrote:
> >>
> >>>The block device pagecache isn't special, and certainly isn't that much
> >>>code. I would suggest keeping it buffer head specific and making a
> >>>second variant that does only fsblocks. This is mostly to keep the
> >>>semantics of PagePrivate sane, lets not fuzz the line.
> >>
> >>That would require a new inode and address_space for the fsblock
> >>type blockdev pagecache, wouldn't it? I just can't think of a
> >>better non-intrusive way of allowing a buffer_head filesystem and
> >>an fsblock filesystem to live on the same blkdev together.
> >
> >
> >I don't think they would ever try to. Both filesystems would bd_claim
> >the blkdev, and only one would win.
>
> Hmm OK, I might have confused myself thinking about partitions...
>
> >The issue is more of a filesystem sharing a blockdev with the
> >block-special device (i.e. open("/dev/sda1"), read) isn't it?
> >
> >If a filesystem wants to attach information to the blockdev pagecache
> >that is different to what blockdev want to attach, then I think "Yes"
> >- a new inode and address space is what it needs to create.
> >
> >Then you get into consistency issues between the metadata and direct
> >blockdevice access. Do we care about those?
>
> Yeah that issue is definitely a real one. The problem is not just
> consistency, but "how do the block device aops even know that the
> PG_private page they have has buffer heads or fsblocks", so it is
> an oopsable condition rather than just a plain consistency issue
> (consistency is already not guaranteed).

Since we're testing new code, I would just leave the blkdev address
space alone. If a filesystem wants to use fsblocks, they allocate a new
inode during mount, stuff it into their private super block (or in the
generic super), and use that for everything. Basically ignoring the
block device address space completely.

It means there will be some inconsistency between what you get when
reading the block device file and the filesystem metadata, but we've got
that already (ext2 dir in page cache).

-chris

2007-06-26 12:37:58

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
> On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:

[ ... fsblocks vs extent range mapping ]

> iomaps can double as range locks simply because iomaps are
> expressions of ranges within the file. Seeing as you can only
> access a given range exclusively to modify it, inserting an empty
> mapping into the tree as a range lock gives an effective method of
> allowing safe parallel reads, writes and allocation into the file.
>
> The fsblocks and the vm page cache interface cannot be used to
> facilitate this because a radix tree is the wrong type of tree to
> store this information in. A sparse, range based tree (e.g. btree)
> is the right way to do this and it matches very well with
> a range based API.

I'm really not against the extent based page cache idea, but I kind of
assumed it would be too big a change for this kind of generic setup. At
any rate, if we'd like to do it, it may be best to ditch the idea of
"attach mapping information to a page", and switch to "lookup mapping
information and range locking for a page".

A btree could be used to hold the range mapping and locking, but it
could just as easily be a radix tree where you do a gang lookup for the
end of the range (the same way my placeholder patch did). It'll still
find intersecting range locks but is much faster for random
insertion/deletion than the btrees.

-chris

2007-06-27 05:32:59

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:
> On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
> > On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
>
> [ ... fsblocks vs extent range mapping ]
>
> > iomaps can double as range locks simply because iomaps are
> > expressions of ranges within the file. Seeing as you can only
> > access a given range exclusively to modify it, inserting an empty
> > mapping into the tree as a range lock gives an effective method of
> > allowing safe parallel reads, writes and allocation into the file.
> >
> > The fsblocks and the vm page cache interface cannot be used to
> > facilitate this because a radix tree is the wrong type of tree to
> > store this information in. A sparse, range based tree (e.g. btree)
> > is the right way to do this and it matches very well with
> > a range based API.
>
> I'm really not against the extent based page cache idea, but I kind of
> assumed it would be too big a change for this kind of generic setup. At
> any rate, if we'd like to do it, it may be best to ditch the idea of
> "attach mapping information to a page", and switch to "lookup mapping
> information and range locking for a page".

Well the get_block equivalent API is extent based one now, and I'll
look at what is required in making map_fsblock a more generic call
that could be used for an extent-based scheme.

An extent based thing IMO really isn't appropriate as the main generic
layer here though. If it is really useful and popular, then it could
be turned into generic code and sit along side fsblock or underneath
fsblock...

It definitely isn't trivial to drive the IO directly from something
like that which doesn't correspond to filesystem block size. Splitting
parts of your extent tree when things go dirty or uptodate or partially
under IO, etc.. joining things back up again when they are mergable.
Not that it would be impossible, but it would be a lot more heavyweight
than fsblock.

I think using fsblock to drive the IO and keep the pagecache flags
uptodate and using a btree in the filesystem to manage extents of block
allocations wouldn't be a bad idea though. Do any filesystems actually
do this?

2007-06-27 06:05:57

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:
> I think using fsblock to drive the IO and keep the pagecache flags
> uptodate and using a btree in the filesystem to manage extents of block
> allocations wouldn't be a bad idea though. Do any filesystems actually
> do this?

Yes. XFS. But we still need to hold state in buffer heads (BH_delay,
BH_unwritten) that is needed to determine what type of
allocation/extent conversion is necessary during writeback. i.e.
what we originally mapped the page as during the ->prepare_write
call.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-06-27 11:54:17

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:
> On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:
> > On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
> > > On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
> >
> > [ ... fsblocks vs extent range mapping ]
> >
> > > iomaps can double as range locks simply because iomaps are
> > > expressions of ranges within the file. Seeing as you can only
> > > access a given range exclusively to modify it, inserting an empty
> > > mapping into the tree as a range lock gives an effective method of
> > > allowing safe parallel reads, writes and allocation into the file.
> > >
> > > The fsblocks and the vm page cache interface cannot be used to
> > > facilitate this because a radix tree is the wrong type of tree to
> > > store this information in. A sparse, range based tree (e.g. btree)
> > > is the right way to do this and it matches very well with
> > > a range based API.
> >
> > I'm really not against the extent based page cache idea, but I kind of
> > assumed it would be too big a change for this kind of generic setup. At
> > any rate, if we'd like to do it, it may be best to ditch the idea of
> > "attach mapping information to a page", and switch to "lookup mapping
> > information and range locking for a page".
>
> Well the get_block equivalent API is extent based one now, and I'll
> look at what is required in making map_fsblock a more generic call
> that could be used for an extent-based scheme.
>
> An extent based thing IMO really isn't appropriate as the main generic
> layer here though. If it is really useful and popular, then it could
> be turned into generic code and sit along side fsblock or underneath
> fsblock...

Lets look at a typical example of how IO actually gets done today,
starting with sys_write():

sys_write(file, buffer, 1MB)
for each page:
prepare_write()
allocate contiguous chunks of disk
attach buffers
copy_from_user()
commit_write()
dirty buffers

pdflush:
writepages()
find pages with contiguous chunks of disk
build and submit large bios

So, we replace prepare_write and commit_write with an extent based api,
but we keep the dirty each buffer part. writepages has to turn that
back into extents (bio sized), and the result is completely full of dark
dark corner cases.

I do think fsblocks is a nice cleanup on its own, but Dave has a good
point that it makes sense to look for ways generalize things even more.

-chris

2007-06-27 12:39:28

by Kyle Moffett

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Jun 26, 2007, at 07:14:14, Nick Piggin wrote:
> On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
>> Can we call it a block mapping layer or something like that? e.g.
>> struct blkmap?
>
> I'm not fixed on fsblock, but blkmap doesn't grab me either. It is
> a map from the pagecache to the block layer, but blkmap sounds like
> it is a map from the block to somewhere.
>
> fsblkmap ;)

vmblock? pgblock?

Cheers,
Kyle Moffett

2007-06-27 15:18:59

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] fsblock

On 27 Jun 2007, at 12:50, Chris Mason wrote:
> On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:
>> On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:
>>> On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
>>>> On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
>>>
>>> [ ... fsblocks vs extent range mapping ]
>>>
>>>> iomaps can double as range locks simply because iomaps are
>>>> expressions of ranges within the file. Seeing as you can only
>>>> access a given range exclusively to modify it, inserting an empty
>>>> mapping into the tree as a range lock gives an effective method of
>>>> allowing safe parallel reads, writes and allocation into the file.
>>>>
>>>> The fsblocks and the vm page cache interface cannot be used to
>>>> facilitate this because a radix tree is the wrong type of tree to
>>>> store this information in. A sparse, range based tree (e.g. btree)
>>>> is the right way to do this and it matches very well with
>>>> a range based API.
>>>
>>> I'm really not against the extent based page cache idea, but I
>>> kind of
>>> assumed it would be too big a change for this kind of generic
>>> setup. At
>>> any rate, if we'd like to do it, it may be best to ditch the idea of
>>> "attach mapping information to a page", and switch to "lookup
>>> mapping
>>> information and range locking for a page".
>>
>> Well the get_block equivalent API is extent based one now, and I'll
>> look at what is required in making map_fsblock a more generic call
>> that could be used for an extent-based scheme.
>>
>> An extent based thing IMO really isn't appropriate as the main
>> generic
>> layer here though. If it is really useful and popular, then it could
>> be turned into generic code and sit along side fsblock or underneath
>> fsblock...
>
> Lets look at a typical example of how IO actually gets done today,
> starting with sys_write():

Yes, this is very inefficient which is one of the reasons I don't use
the generic file write helpers in NTFS. The other reasons are that
supporting larger logical block sizes than PAGE_CACHE_SIZE becomes a
pain if it is not done this way when the write targets a hole as that
requires all pages in the hole to be locked simultaneously which
would mean dropping the page lock to acquire the others that are of
lower page index and to then re-take the page lock which is horrible
- much better to lock all at once from the outset and the other
reason is that in NTFS there is such a thing as the initialized size
of an attribute which basically states "anything past this byte
offset must be returned as 0 on read, i.e. it does not have to be
read from disk at all, and on write beyond the initialized_size you
have to zero on disk everything between the old initialized size and
the start of the write before you begin writing and certainly before
you update the initalized_size otherwise a concurrent read would see
random old data from the disk.

For NTFS this effectively becomes:

> sys_write(file, buffer, 1MB)

allocate space for the entire 1MB write

if write offset past the initialized_size zero out on disk starting
at initialized_size up to the start offset for the write and update
the initialized size to be equal to the start offset of the write

do {
if (current position is in a hole and the NTFS logical block size is
> PAGE_CACHE_SIZE) {
work on (NTFS logical block size / PAGE_CACHE_SIZE) pages in one go;
do_pages = vol->cluster_size / PAGE_CACHE_SIZE;
} else {
work on only one page;
do_pages = 1;
}
fault in for read (do_pages*PAGE_CACHE_SIZE) bytes worth of source
pages
grab do_pages worth of pages
prepare_write - attach buffers to grabbed pages
copy data from source to grabbed&prepared pages
commit_write the copied pages by dirtying their buffers
} while (data left to write);

The allocation in advance is a huge win both in terms of avoiding
fragmentation (NTFS still uses a very simple/stupid allocator so you
get a lot of fragmentation if two processes write to different files
simultaneously and do so in small chunks) and in terms of performance.

I have wondered whether I should perhaps turn on the "multi page"
stuff on for all writes rather than just for ones that go into a hole
and the logical size is greater than the PAGE_CACHE_SIZE as that
might improve performance even further but I haven't had the time/
inclination to experiment...

And I have also wondered whether to go direct to bio/wholes pages at
once instead of bothering with dirtying each buffer but the buffers
(which are always 512 bytes on NTFS) allow me to easily support
dirtying smaller parts of the page which is desired at least on
volumes with a logical block size < PAGE_CACHE_SIZE as different bits
of the page could then reside on completely different locations on
disk so writing out unneeded bits of the page could result in a lot
of wasted disk head seek times.

Best regards,

Anton

> for each page:
> prepare_write()
> allocate contiguous chunks of disk
> attach buffers
> copy_from_user()
> commit_write()
> dirty buffers
>
> pdflush:
> writepages()
> find pages with contiguous chunks of disk
> build and submit large bios
>
> So, we replace prepare_write and commit_write with an extent based
> api,
> but we keep the dirty each buffer part. writepages has to turn that
> back into extents (bio sized), and the result is completely full of
> dark
> dark corner cases.
>
> I do think fsblocks is a nice cleanup on its own, but Dave has a good
> point that it makes sense to look for ways generalize things even
> more.
>
> -chris

--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


2007-06-27 22:36:20

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote:
> On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:
> > On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:
> > > On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
> > > > On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
> > >
> > > [ ... fsblocks vs extent range mapping ]
> > >
> > > > iomaps can double as range locks simply because iomaps are
> > > > expressions of ranges within the file. Seeing as you can only
> > > > access a given range exclusively to modify it, inserting an empty
> > > > mapping into the tree as a range lock gives an effective method of
> > > > allowing safe parallel reads, writes and allocation into the file.
> > > >
> > > > The fsblocks and the vm page cache interface cannot be used to
> > > > facilitate this because a radix tree is the wrong type of tree to
> > > > store this information in. A sparse, range based tree (e.g. btree)
> > > > is the right way to do this and it matches very well with
> > > > a range based API.
> > >
> > > I'm really not against the extent based page cache idea, but I kind of
> > > assumed it would be too big a change for this kind of generic setup. At
> > > any rate, if we'd like to do it, it may be best to ditch the idea of
> > > "attach mapping information to a page", and switch to "lookup mapping
> > > information and range locking for a page".
> >
> > Well the get_block equivalent API is extent based one now, and I'll
> > look at what is required in making map_fsblock a more generic call
> > that could be used for an extent-based scheme.
> >
> > An extent based thing IMO really isn't appropriate as the main generic
> > layer here though. If it is really useful and popular, then it could
> > be turned into generic code and sit along side fsblock or underneath
> > fsblock...
>
> Lets look at a typical example of how IO actually gets done today,
> starting with sys_write():
>
> sys_write(file, buffer, 1MB)
> for each page:
> prepare_write()
> allocate contiguous chunks of disk
> attach buffers
> copy_from_user()
> commit_write()
> dirty buffers
>
> pdflush:
> writepages()
> find pages with contiguous chunks of disk
> build and submit large bios
>
> So, we replace prepare_write and commit_write with an extent based api,
> but we keep the dirty each buffer part. writepages has to turn that
> back into extents (bio sized), and the result is completely full of dark
> dark corner cases.

Yup - I've been on the painful end of those dark corner cases several
times in the last few months.

It's also worth pointing out that mpage_readpages() already works on
an extent basis - it overloads bufferheads to provide a "map_bh" that
can point to a range of blocks in the same state. The code then iterates
the map_bh range a page at a time building bios (i.e. not even using
buffer heads) from that map......

> I do think fsblocks is a nice cleanup on its own, but Dave has a good
> point that it makes sense to look for ways generalize things even more.

*nod*

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-06-28 02:45:06

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Thu, Jun 28, 2007 at 08:35:48AM +1000, David Chinner wrote:
> On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote:
> > Lets look at a typical example of how IO actually gets done today,
> > starting with sys_write():
> >
> > sys_write(file, buffer, 1MB)
> > for each page:
> > prepare_write()
> > allocate contiguous chunks of disk
> > attach buffers
> > copy_from_user()
> > commit_write()
> > dirty buffers
> >
> > pdflush:
> > writepages()
> > find pages with contiguous chunks of disk
> > build and submit large bios
> >
> > So, we replace prepare_write and commit_write with an extent based api,
> > but we keep the dirty each buffer part. writepages has to turn that
> > back into extents (bio sized), and the result is completely full of dark
> > dark corner cases.

That's true but I don't think an extent data structure means we can
become too far divorced from the pagecache or the native block size
-- what will end up happening is that often we'll need "stuff" to map
between all those as well, even if it is only at IO-time.

But the point is taken, and I do believe that at least for APIs, extent
based seems like the best way to go. And that should allow fsblock to
be replaced or augmented in future without _too_ much pain.


> Yup - I've been on the painful end of those dark corner cases several
> times in the last few months.
>
> It's also worth pointing out that mpage_readpages() already works on
> an extent basis - it overloads bufferheads to provide a "map_bh" that
> can point to a range of blocks in the same state. The code then iterates
> the map_bh range a page at a time building bios (i.e. not even using
> buffer heads) from that map......

One issue I have with the current nobh and mpage stuff is that it
requires multiple calls into get_block (first to prepare write, then
to writepage), it doesn't allow filesystems to attach resources
required for writeout at prepare_write time, and it doesn't play nicely
with buffers in general. (not to mention that nobh error handling is
buggy).

I haven't done any mpage-like code for fsblocks yet, but I think they
wouldn't be too much trouble, and wouldn't have any of the above
problems...

2007-06-28 12:23:41

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote:
> On Thu, Jun 28, 2007 at 08:35:48AM +1000, David Chinner wrote:
> > On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote:
> > > Lets look at a typical example of how IO actually gets done today,
> > > starting with sys_write():
> > >
> > > sys_write(file, buffer, 1MB)
> > > for each page:
> > > prepare_write()
> > > allocate contiguous chunks of disk
> > > attach buffers
> > > copy_from_user()
> > > commit_write()
> > > dirty buffers
> > >
> > > pdflush:
> > > writepages()
> > > find pages with contiguous chunks of disk
> > > build and submit large bios
> > >
> > > So, we replace prepare_write and commit_write with an extent based api,
> > > but we keep the dirty each buffer part. writepages has to turn that
> > > back into extents (bio sized), and the result is completely full of dark
> > > dark corner cases.
>
> That's true but I don't think an extent data structure means we can
> become too far divorced from the pagecache or the native block size
> -- what will end up happening is that often we'll need "stuff" to map
> between all those as well, even if it is only at IO-time.

I think the fundamental difference is that fsblock still does:
mapping_info = page->something, where something is attached on a per
page basis. What we really want is mapping_info = lookup_mapping(page),
where that function goes and finds something stored on a per extent
basis, with extra bits for tracking dirty and locked state.

Ideally, in at least some of the cases the dirty and locked state could
be at an extent granularity (streaming IO) instead of the block
granularity (random IO).

In my little brain, even block based filesystems should be able to take
advantage of this...but such things are always easier to believe in
before the coding starts.

>
> But the point is taken, and I do believe that at least for APIs, extent
> based seems like the best way to go. And that should allow fsblock to
> be replaced or augmented in future without _too_ much pain.
>
>
> > Yup - I've been on the painful end of those dark corner cases several
> > times in the last few months.
> >
> > It's also worth pointing out that mpage_readpages() already works on
> > an extent basis - it overloads bufferheads to provide a "map_bh" that
> > can point to a range of blocks in the same state. The code then iterates
> > the map_bh range a page at a time building bios (i.e. not even using
> > buffer heads) from that map......
>
> One issue I have with the current nobh and mpage stuff is that it
> requires multiple calls into get_block (first to prepare write, then
> to writepage), it doesn't allow filesystems to attach resources
> required for writeout at prepare_write time, and it doesn't play nicely
> with buffers in general. (not to mention that nobh error handling is
> buggy).
>
> I haven't done any mpage-like code for fsblocks yet, but I think they
> wouldn't be too much trouble, and wouldn't have any of the above
> problems...

Could be, but the fundamental issue of sometimes pages have mappings
attached and sometimes they don't is still there. The window is
smaller, but non-zero.

-chris

2007-06-29 02:09:16

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Thu, Jun 28, 2007 at 08:20:31AM -0400, Chris Mason wrote:
> On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote:
> > That's true but I don't think an extent data structure means we can
> > become too far divorced from the pagecache or the native block size
> > -- what will end up happening is that often we'll need "stuff" to map
> > between all those as well, even if it is only at IO-time.
>
> I think the fundamental difference is that fsblock still does:
> mapping_info = page->something, where something is attached on a per
> page basis. What we really want is mapping_info = lookup_mapping(page),

lookup_block_mapping(page).... ;)

But yes, that is the essence of what I was saying. Thanks for
describing it so concisely, Chris.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-06-29 02:33:30

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Thu, Jun 28, 2007 at 08:20:31AM -0400, Chris Mason wrote:
> On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote:
> >
> > That's true but I don't think an extent data structure means we can
> > become too far divorced from the pagecache or the native block size
> > -- what will end up happening is that often we'll need "stuff" to map
> > between all those as well, even if it is only at IO-time.
>
> I think the fundamental difference is that fsblock still does:
> mapping_info = page->something, where something is attached on a per
> page basis. What we really want is mapping_info = lookup_mapping(page),
> where that function goes and finds something stored on a per extent
> basis, with extra bits for tracking dirty and locked state.
>
> Ideally, in at least some of the cases the dirty and locked state could
> be at an extent granularity (streaming IO) instead of the block
> granularity (random IO).
>
> In my little brain, even block based filesystems should be able to take
> advantage of this...but such things are always easier to believe in
> before the coding starts.

Now I wouldn't for a minute deny that at least some of the block
information would be well to store in extent/tree format (if XFS
does it it must be good!).

And yes, I'm sure filesystems with even basic block based allocation
could get a reasonable ratio of blocks to extents.

However I think it is fundamentally another layer or at least
more complexity... fsblocks uses the existing pagecache mapping as
(much of) the data structure and uses the existing pagecache locking
for the locking. And it fundamentally just provides a block access
and IO layer into the pagecache for the filesystem, which I think will
often be needed anyway.

But that said, I would like to see a generic extent mapping layer
sitting between fsblock and the filesystem (I might even have a crack
at it myself)... and I could be proven completely wrong and it may be
that fsblock isn't required at all after such a layer goes in. So I
will try to keep all the APIs extent based.

The first thing I actually looked at for "get_blocks" was for the
filesystem to build up a tree of mappings itself, completely unconnected
from the pagecache. It just ended up being a little more work and
locking but the idea isn't insane :)


> > One issue I have with the current nobh and mpage stuff is that it
> > requires multiple calls into get_block (first to prepare write, then
> > to writepage), it doesn't allow filesystems to attach resources
> > required for writeout at prepare_write time, and it doesn't play nicely
> > with buffers in general. (not to mention that nobh error handling is
> > buggy).
> >
> > I haven't done any mpage-like code for fsblocks yet, but I think they
> > wouldn't be too much trouble, and wouldn't have any of the above
> > problems...
>
> Could be, but the fundamental issue of sometimes pages have mappings
> attached and sometimes they don't is still there. The window is
> smaller, but non-zero.

The aim for fsblocks is that any page under IO will always have fsblocks,
which I hope is going to make this easy. In the fsblocks patch I sent out
there is a window (with mmapped pages), however that's a bug wich can be
fixed rather than a fundamental problem. So writepages will be less problem.

Readpages may indeed be more efficient and block mapping with extents than
individual fsblocks (or it could be, if it were an extent based API itself).

Well I don't know. Extents are always going to have benefits, but I don't
know if it means the fsblock part could go away completely. I'll keep
it in mind though.

2007-06-30 10:40:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

On Tue, Jun 26, 2007 at 12:34:26PM +1000, Nick Piggin wrote:
> That would require a new inode and address_space for the fsblock
> type blockdev pagecache, wouldn't it?

Yes. That's easily possible, XFS already does it for it's own
buffer cache.

2007-06-30 10:40:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 1/3] add the fsblock layer

On Tue, Jun 26, 2007 at 08:26:50AM -0400, Chris Mason wrote:
> Since we're testing new code, I would just leave the blkdev address
> space alone. If a filesystem wants to use fsblocks, they allocate a new
> inode during mount, stuff it into their private super block (or in the
> generic super), and use that for everything. Basically ignoring the
> block device address space completely.

Exactly, same thing XFS does.

2007-06-30 10:43:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote:
> >- In line with the above item, filesystem block allocation is performed
> > before a page is dirtied. In the buffer layer, mmap writes can dirty a
> > page with no backing blocks which is a problem if the filesystem is
> > ENOSPC (patches exist for buffer.c for this).
>
> This raises an eyebrow... The handling of ENOSPC prior to mmap write is
> more an ABI behavior, so I don't see how this can be fixed with internal
> changes, yet without changing behavior currently exported to userland
> (and thus affecting code based on such assumptions).

Not really, the current behaviour is a bug. And it's not actually buffer
layer specific - XFS now has a fix for that bug and it's generic enough
that everyone could use it.

2007-06-30 10:44:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Mon, Jun 25, 2007 at 08:25:21AM -0400, Chris Mason wrote:
> > write_begin/write_end is a step in that direction (and it helps
> > OCFS and GFS quite a bit). I think there is also not much reason
> > for writepage sites to require the page to lock the page and clear
> > the dirty bit themselves (which has seems ugly to me).
>
> If we keep the page mapping information with the page all the time (ie
> writepage doesn't have to call get_block ever), it may be possible to
> avoid sending down a locked page. But, I don't know the delayed
> allocation internals well enough to say for sure if that is true.

The point of delayed allocations is that the mapping information doesn't
even exist until writepage for new allocations :)

2007-06-30 11:05:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] fsblock

Warning ahead: I've only briefly skipped over the pages so the comments
in the mail are very highlevel.

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
> fsblock is a rewrite of the "buffer layer" (ding dong the witch is
> dead), which I have been working on, on and off and is now at the stage
> where some of the basics are working-ish. This email is going to be
> long...
>
> Firstly, what is the buffer layer? The buffer layer isn't really a
> buffer layer as in the buffer cache of unix: the block device cache
> is unified with the pagecache (in terms of the pagecache, a blkdev
> file is just like any other, but with a 1:1 mapping between offset
> and block).
>
> There are filesystem APIs to access the block device, but these go
> through the block device pagecache as well. These don't exactly
> define the buffer layer either.
>
> The buffer layer is a layer between the pagecache and the block
> device for block based filesystems. It keeps a translation between
> logical offset and physical block number, as well as meta
> information such as locks, dirtyness, and IO status of each block.
> This information is tracked via the buffer_head structure.
>

Traditional unix buffer cache is always physical block indexed and
used for all data/metadata/blockdevice node access. There's been
a lot of variants of schemes where data or some data is in a separate
inode,logial block indexed scheme. Most modern OSes including Linux
now always do the inode,logial block index with some noop substitute
for the metadata and block device node variants of operation.

Now what you replace is a really crappy hybrid of a traditional
unix buffercache implemented ontop of the pagecache for the block
device node (for metadata) and a lot of abuse of the same data
structure as used in the buffercache for keeping metainformation
about the actual data mapping.

> Why rewrite the buffer layer? Lots of people have had a desire to
> completely rip out the buffer layer, but we can't do that[*] because
> it does actually serve a useful purpose. Why the bad rap? Because
> the code is old and crufty, and buffer_head is an awful name. It must
> be among the oldest code in the core fs/vm, and the main reason is
> because of the inertia of so many and such complex filesystems.

Actually most of the code is no older than 10 years. Just compare
fs/buffer.c in 2.2 and 2.6. buffer_head is a perfectly fine name
for one of it's uses in the traditional buffercache.

I also thing there is little to no reason to get rid of that use:
This buffercache is what most linux block-based filesystems (except
xfs and jfs most notably) are written to, and it fits them very nicely.

What I'd really like to see is to get rid of the abuse of struct buffer_head
in the data path, and the sometimes to intimate coupling of the buffer cache
with page cache internals.

> - Data / metadata separation. I have a struct fsblock and a struct
> fsblock_meta, so we could put more stuff into the usually less used
> fsblock_meta without bloating it up too much. After a few tricks, these
> are no longer any different in my code, and dirty up the typing quite
> a lot (and I'm aware it still has some warnings, thanks). So if not
> useful this could be taken out.


That's what I mean. And from a quick glimpse at your code they're still
far too deeply coupled in fsblock. Really, we don't really want to share
anything between the buffer cache and data mapping operations - they are
so deeply different that this sharing is what creates the enormous complexity
we have to deal with.

> - No deadlocks (hopefully). The buffer layer is technically deadlocky by
> design, because it can require memory allocations at page writeout-time.
> It also has one path that cannot tolerate memory allocation failures.
> No such problems for fsblock, which keeps fsblock metadata around for as
> long as a page is dirty (this still has problems vs get_user_pages, but
> that's going to require an audit of all get_user_pages sites. Phew).

The whole concept of delayed allocation requires page allocations at
writeout time, as do various network protocols or even storage drivers.

> - In line with the above item, filesystem block allocation is performed
> before a page is dirtied. In the buffer layer, mmap writes can dirty a
> page with no backing blocks which is a problem if the filesystem is
> ENOSPC (patches exist for buffer.c for this).

Not really something that is the block layers fault but rather the lazyness
of the filesystem maintainers.

> - Large block support. I can mount and run an 8K block size minix3 fs on
> my 4K page system and it didn't require anything special in the fs. We
> can go up to about 32MB blocks now, and gigabyte+ blocks would only
> require one more bit in the fsblock flags. fsblock_superpage blocks
> are > PAGE_CACHE_SIZE, midpage ==, and subpage <.
>
> Core pagecache code is pretty creaky with respect to this. I think it is
> mostly race free, but it requires stupid unlocking and relocking hacks
> because the vm usually passes single locked pages to the fs layers, and we
> need to lock all pages of a block in offset ascending order. This could be
> avoided by doing locking on only the first page of a block for locking in
> the fsblock layer, but that's a bit scary too. Probably better would be to
> move towards offset,length rather than page based fs APIs where everything
> can be batched up nicely and this sort of non-trivial locking can be more
> optimal.

See now why people like large order page cache so much :)

> Large block memory access via filesystem uses vmap, but it will go back
> to kmap if the access doesn't cross a page. Filesystems really should do
> this because vmap is slow as anything. I've implemented a vmap cache
> which basically wouldn't work on 32-bit systems (because of limited vmap
> space) for performance testing (and yes it sometimes tries to unmap in
> interrupt context, I know, I'm using loop). We could possibly do a self
> limiting cache, but I'd rather build some helpers to hide the raw multi
> page access for things like bitmap scanning and bit setting etc. and
> avoid too much vmaps.

And this is a complete pain in the ass. XFS uses vmap in it's metadata buffer
cache due to requirements carrier over from IRIX (in fact that's why I implemented
vmap in it's current form). This works okay most of them time, but there are
a lot of scenarios where you run out of vmalloc space as you mention. What's
also nasy is that you can't call vunmap from irq context, and vunmap beeing
rather bad for system peformance due to the tlb flushing overhead.


So as the closing comment I'd say I'd rather keep buffer_heads for metadata
for now and try to decouple the data path from it. Your fsblock patches
are a very nice start for this, but I'd rather skip the intermediate step
towards the extent based API Dave has been outlining. Having deal with the
I/O path of a high performance filesystem for a while per-page or sub-page
structures are a real pain to deal with and I'd really prefer to have data
structures for as much as possible blocks with the same state.

2007-06-30 11:10:46

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] fsblock

Christoph Hellwig wrote:
> On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote:
>>> - In line with the above item, filesystem block allocation is performed
>>> before a page is dirtied. In the buffer layer, mmap writes can dirty a
>>> page with no backing blocks which is a problem if the filesystem is
>>> ENOSPC (patches exist for buffer.c for this).
>> This raises an eyebrow... The handling of ENOSPC prior to mmap write is
>> more an ABI behavior, so I don't see how this can be fixed with internal
>> changes, yet without changing behavior currently exported to userland
>> (and thus affecting code based on such assumptions).

> Not really, the current behaviour is a bug. And it's not actually buffer
> layer specific - XFS now has a fix for that bug and it's generic enough
> that everyone could use it.

I'm not sure I follow. If you require block allocation at mmap(2) time,
rather than when a page is actually dirtied, you are denying userspace
the ability to do sparse files with mmap.

A quick Google readily turns up people who have built upon the
mmap-sparse-file assumption, and I don't think we want to break those
assumptions as a "bug fix."

Where is the bug?

Jeff


2007-06-30 11:13:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] fsblock

On Sat, Jun 30, 2007 at 07:10:27AM -0400, Jeff Garzik wrote:
> >Not really, the current behaviour is a bug. And it's not actually buffer
> >layer specific - XFS now has a fix for that bug and it's generic enough
> >that everyone could use it.
>
> I'm not sure I follow. If you require block allocation at mmap(2) time,
> rather than when a page is actually dirtied, you are denying userspace
> the ability to do sparse files with mmap.
>
> A quick Google readily turns up people who have built upon the
> mmap-sparse-file assumption, and I don't think we want to break those
> assumptions as a "bug fix."
>
> Where is the bug?

It's not mmap time but page dirtying time. Currently the default behaviour
is not to allocate at page dirtying time but rather at writeout time in
some scenarious.

(and s/allocation/reservation/ applies for delalloc of course)