LinuxLists.cc - [RFC] basic delayed allocation in VFS

2007-07-26 10:03:09

Subject: [RFC] basic delayed allocation in VFS

Good day,

please review ...

thanks, Alex

basic delayed allocation in VFS:

* block_prepare_write() can be passed special ->get_block() which
doesn't allocate blocks, but reserve them and mark bh delayed
* a filesystem can use mpage_da_writepages() with other ->get_block()
which doesn't defer allocation. mpage_da_writepages() finds all
non-allocated blocks and try to allocate them with minimal calls
to ->get_block(), then submit IO using __mpage_writepage()

Signed-off-by: Alex Tomas <[email protected]>

Index: linux-2.6.22-rc4/fs/buffer.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/buffer.c 2007-06-05 04:57:25.000000000 +0400
+++ linux-2.6.22-rc4/fs/buffer.c 2007-07-17 22:40:49.000000000 +0400
@@ -1630,7 +1630,8 @@ static int __block_write_full_page(struc
*/
clear_buffer_dirty(bh);
set_buffer_uptodate(bh);
- } else if (!buffer_mapped(bh) && buffer_dirty(bh)) {
+ } else if ((!buffer_mapped(bh) || buffer_delay(bh))
+ && buffer_dirty(bh)) {
WARN_ON(bh->b_size != blocksize);
err = get_block(inode, block, bh, 1);
if (err)
Index: linux-2.6.22-rc4/include/linux/mpage.h
===================================================================
--- linux-2.6.22-rc4.orig/include/linux/mpage.h 2007-06-05 04:57:25.000000000 +0400
+++ linux-2.6.22-rc4/include/linux/mpage.h 2007-07-25 09:48:44.000000000 +0400
@@ -20,5 +20,7 @@ int mpage_writepages(struct address_spac
struct writeback_control *wbc, get_block_t get_block);
int mpage_writepage(struct page *page, get_block_t *get_block,
struct writeback_control *wbc);
+int mpage_da_writepages(struct address_space *mapping,
+ struct writeback_control *wbc, get_block_t get_block);

#endif
Index: linux-2.6.22-rc4/fs/mpage.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/mpage.c 2007-06-05 04:57:25.000000000 +0400
+++ linux-2.6.22-rc4/fs/mpage.c 2007-07-26 11:59:39.000000000 +0400
@@ -10,6 +10,8 @@
* Initial version
* 27Jun2002 [email protected]
* use bio_add_page() to build bio's just the right size
+ * 26Jul2007 [email protected] AKA bzzz
+ * basic delayed allocation support
*/

#include <linux/kernel.h>
@@ -732,3 +734,404 @@ int mpage_writepage(struct page *page, g
return ret;
}
EXPORT_SYMBOL(mpage_writepage);
+
+/*
+ * Delayed allocation stuff
+ */
+
+struct mpage_da_data {
+ struct inode *inode;
+ struct buffer_head lbh; /* extent of blocks */
+ unsigned long first_page, next_page; /* extent of pages */
+ get_block_t *get_block;
+ struct writeback_control *wbc;
+};
+
+
+/*
+ * mpage_da_submit_io - walks through extent of pages and try to write
+ * them with __mpage_writepage()
+ *
+ * @mpd->inode: inode
+ * @mpd->first_page: first page of the extent
+ * @mpd->next_page: page after the last page of the extent
+ * @mpd->get_block: the filesystem's block mapper function
+ *
+ * By the time mpage_da_submit_io() is called we expect all blocks
+ * to be allocated. this may be wrong if allocation failed.
+ *
+ * As pages are already locked by write_cache_pages(), we can't use it
+ */
+static int mpage_da_submit_io(struct mpage_da_data *mpd)
+{
+ struct address_space *mapping = mpd->inode->i_mapping;
+ struct mpage_data mpd_pp = {
+ .bio = NULL,
+ .last_block_in_bio = 0,
+ .get_block = mpd->get_block,
+ .use_writepage = 1,
+ };
+ int ret = 0, err, nr_pages, i;
+ unsigned long index, end;
+ struct pagevec pvec;
+
+ BUG_ON(mpd->next_page <= mpd->first_page);
+
+ pagevec_init(&pvec, 0);
+ index = mpd->first_page;
+ end = mpd->next_page - 1;
+
+ while (index <= end) {
+ /* XXX: optimize tail */
+ nr_pages = pagevec_lookup(&pvec, mapping, index, PAGEVEC_SIZE);
+ if (nr_pages == 0)
+ break;for (i = 0; i < nr_pages; i++) {
+ struct page *page = pvec.pages[i];
+
+ index = page->index;
+ if (index > end)
+ break;
+ index++;
+
+ err = __mpage_writepage(page, mpd->wbc, &mpd_pp);
+
+ /*
+ * In error case, we have to continue because
+ * remaining pages are still locked
+ * XXX: unlock and re-dirty them?
+ */
+ if (ret == 0)
+ ret = err;
+ }
+ pagevec_release(&pvec);
+ }
+ if (mpd_pp.bio)
+ mpage_bio_submit(WRITE, mpd_pp.bio);
+
+ return ret;
+}
+
+/*
+ * mpage_put_bnr_to_bhs - walk blocks and assign them actual numbers
+ *
+ * @mpd->inode - inode to walk through
+ * @exbh->b_blocknr - first block on a disk
+ * @exbh->b_size - amount of space in bytes
+ * @logical - first logical block to start assignment with
+ *
+ * the function goes through all passed space and put actual disk
+ * block numbers into buffer heads, dropping BH_Delay
+ */
+static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd, sector_t logical,
+ struct buffer_head *exbh)
+{
+ struct inode *inode = mpd->inode;
+ struct address_space *mapping = inode->i_mapping;
+ int blocks = exbh->b_size >> inode->i_blkbits;
+ sector_t pblock = exbh->b_blocknr, cur_logical;
+ struct buffer_head *head, *bh;
+ unsigned long index, end;
+ struct pagevec pvec;
+ int nr_pages, i;
+
+ index = logical >> (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ end = (logical + blocks - 1) >> (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ cur_logical = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+
+ pagevec_init(&pvec, 0);
+
+ while (index <= end) {
+ /* XXX: optimize tail */
+ nr_pages = pagevec_lookup(&pvec, mapping, index, PAGEVEC_SIZE);
+ if (nr_pages == 0)
+ break;
+ for (i = 0; i < nr_pages; i++) {
+ struct page *page = pvec.pages[i];
+
+ index = page->index;
+ if (index > end)
+ break;
+ index++;
+
+ BUG_ON(!PageLocked(page));
+ BUG_ON(PageWriteback(page));
+ BUG_ON(!page_has_buffers(page));
+
+ head = bh = page_buffers(page);
+
+ /* skip blocks out of the range */
+ do {
+ if (cur_logical >= logical)
+ break;
+ cur_logical++;
+ pblock++;
+ } while ((bh = bh->b_this_page) != head);
+
+ do {
+ if (cur_logical >= logical + blocks)
+ break;
+
+ if (buffer_delay(bh)) {
+ bh->b_blocknr = pblock;
+ clear_buffer_delay(bh);
+ } else if (buffer_mapped(bh)) {
+ BUG_ON(bh->b_blocknr != pblock);
+ }
+
+ cur_logical++;
+ pblock++;
+ } while ((bh = bh->b_this_page) != head);
+ }
+ pagevec_release(&pvec);
+ }
+}
+
+
+/*
+ * __unmap_underlying_blocks - just a helper function to unmap
+ * set of blocks described by @bh
+ */
+static inline void __unmap_underlying_blocks(struct inode *inode,
+ struct buffer_head *bh)
+{
+ struct block_device *bdev = inode->i_sb->s_bdev;
+ int blocks, i;
+
+ blocks = bh->b_size >> inode->i_blkbits;
+ for (i = 0; i < blocks; i++)
+ unmap_underlying_metadata(bdev, bh->b_blocknr + i);
+}
+
+/*
+ * mpage_da_map_blocks - go through given space
+ *
+ * @mpd->lbh - bh describing space
+ * @mpd->get_block - the filesystem's block mapper function
+ *
+ * The function skips space we know is already mapped to disk blocks.
+ *
+ * The function ignores errors ->get_block() returns, thus real
+ * error handling is postponed to __mpage_writepage()
+ */
+static void mpage_da_map_blocks(struct mpage_da_data *mpd)
+{
+ struct buffer_head *lbh = &mpd->lbh;
+ int err = 0, remain = lbh->b_size;
+ sector_t next = lbh->b_blocknr;
+ struct buffer_head new;
+
+ /*
+ * We consider only non-mapped and non-allocated blocks
+ */
+ if (buffer_mapped(lbh) && !buffer_delay(lbh))
+ return;
+
+ while (remain) {
+ new.b_state = lbh->b_state;
+ new.b_blocknr = 0;
+ new.b_size = remain;
+ err = mpd->get_block(mpd->inode, next, &new, 1);
+ if (err) {
+ /*
+ * Rather than implement own error handling
+ * here, we just leave remaining blocks
+ * unallocated and try again with ->writepage()
+ */
+ break;
+ }
+ BUG_ON(new.b_size == 0);
+
+ if (buffer_new(&new))
+ __unmap_underlying_blocks(mpd->inode, &new);
+
+ /*
+ * If blocks are delayed marked, we need to
+ * put actual blocknr and drop delayed bit
+ */
+ if (buffer_delay(lbh))
+ mpage_put_bnr_to_bhs(mpd, next, &new);
+
+ /* go for the remaining blocks */
+ next += new.b_size >> mpd->inode->i_blkbits;
+ remain -= new.b_size;
+ }
+}
+
+#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | (1 << BH_Delay))
+
+/*
+ * mpage_add_bh_to_extent - try to add one more block to extent of blocks
+ *
+ * @mpd->lbh - extent of blocks
+ * @logical - logical number of the block in the file
+ * @bh - bh of the block (used to access block's state)
+ *
+ * the function is used to collect contig. blocks in same state
+ */
+static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
+ sector_t logical, struct buffer_head *bh)
+{
+ struct buffer_head *lbh = &mpd->lbh;
+ sector_t next;
+
+ next = lbh->b_blocknr + (lbh->b_size >> mpd->inode->i_blkbits);
+
+ /*
+ * First block in the extent
+ */
+ if (lbh->b_size == 0) {
+ lbh->b_blocknr = logical;
+ lbh->b_size = bh->b_size;
+ lbh->b_state = bh->b_state & BH_FLAGS;
+ return;
+ }
+
+ /*
+ * Can we merge the block to our big extent?
+ */
+ if (logical == next && (bh->b_state & BH_FLAGS) == lbh->b_state) {
+ lbh->b_size += bh->b_size;
+ return;
+ }
+
+ /*
+ * We couldn't merge the block to our extent, so we
+ * need to flush current extent and start new one
+ */
+ mpage_da_map_blocks(mpd);
+
+ /*
+ * Now start a new extent
+ */
+ lbh->b_size = bh->b_size;
+ lbh->b_state = bh->b_state & BH_FLAGS;
+ lbh->b_blocknr = logical;
+}
+
+/*
+ * __mpage_da_writepage - finds extent of pages and blocks
+ *
+ * @page: page to consider
+ * @wbc: not used, we just follow rules
+ * @data: context
+ *
+ * The function finds extents of pages and scan them for all blocks.
+ */
+static int __mpage_da_writepage(struct page *page,
+ struct writeback_control *wbc, void *data)
+{
+ struct mpage_da_data *mpd = data;
+ struct inode *inode = mpd->inode;
+ struct buffer_head *bh, *head, fake;
+ sector_t logical;
+
+ /*
+ * Can we merge this page to current extent?
+ */
+ if (mpd->next_page != page->index) {
+ /*
+ * Nope, we can't. So, we map non-allocated blocks
+ * and start IO on them using __mpage_writepage()
+ */
+ if (mpd->next_page != mpd->first_page) {
+ mpage_da_map_blocks(mpd);
+ mpage_da_submit_io(mpd);
+ }
+
+ /*
+ * Start next extent of pages ...
+ */
+ mpd->first_page = page->index;
+
+ /*
+ * ... and blocks
+ */
+ mpd->lbh.b_size = 0;
+ mpd->lbh.b_state = 0;
+ mpd->lbh.b_blocknr = 0;
+ }
+
+ mpd->next_page = page->index + 1;
+ logical = (sector_t) page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+
+ if (!page_has_buffers(page)) {
+ /*
+ * There is no attached buffer heads yet (mmap?)
+ * we treat the page asfull of dirty blocks
+ */
+ bh = &fake;
+ bh->b_size = PAGE_CACHE_SIZE;
+ bh->b_state = 0;
+ set_buffer_dirty(bh);
+ set_buffer_uptodate(bh);
+ mpage_add_bh_to_extent(mpd, logical, bh);
+ } else {
+ /*
+ * Page with regular buffer heads, just add all dirty ones
+ */
+ bh = head = page_buffers(page);
+ do {
+ BUG_ON(buffer_locked(bh));
+ if (buffer_dirty(bh))
+ mpage_add_bh_to_extent(mpd, logical, bh);
+ logical++;
+ } while ((bh = bh->b_this_page) != head);
+ }
+
+ return 0;
+}
+
+/*
+ * mpage_da_writepages - walk the list of dirty pages of the given
+ * address space, allocates non-allocated blocks, maps newly-allocated
+ * blocks to existing bhs and issue IO them
+ *
+ * @mapping: address space structure to write
+ * @wbc: subtract the number of written pages from *@wbc->nr_to_write
+ * @get_block: the filesystem's block mapper function.
+ *
+ * This is a library function, which implements the writepages()
+ * address_space_operation.
+ *
+ * In order to avoid duplication of logic that deals with partial pages,
+ * multiple bio per page, etc, we find non-allocated blocks, allocate
+ * them with minimal calls to ->get_block() and re-use __mpage_writepage()
+ *
+ * It's important that we call __mpage_writepage() only once for each
+ * involved page, otherwise we'd have to implement more complicated logic
+ * to deal with pages w/o PG_lock or w/ PG_writeback and so on.
+ *
+ * See comments to mpage_writepages()
+ */
+int mpage_da_writepages(struct address_space *mapping,
+ struct writeback_control *wbc, get_block_t get_block)
+{
+ struct mpage_da_data mpd;
+ int ret;
+
+ if (!get_block)
+ return generic_writepages(mapping, wbc);
+
+ mpd.wbc = wbc;
+ mpd.inode = mapping->host;
+ mpd.lbh.b_size = 0;
+ mpd.lbh.b_state = 0;
+ mpd.lbh.b_blocknr = 0;
+ mpd.first_page = 0;
+ mpd.next_page = 0;
+ mpd.get_block = get_block;
+
+ ret = write_cache_pages(mapping, wbc, __mpage_da_writepage, &mpd);
+
+ /*
+ * Handle last extent of pages
+ */
+ if (mpd.next_page != mpd.first_page) {
+ mpage_da_map_blocks(&mpd);
+ mpage_da_submit_io(&mpd);
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(mpage_da_writepages);
+

2007-07-26 10:32:56

by Jeff Garzik

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

Alex Tomas wrote:
> Good day,
>
> please review ...
>
> thanks, Alex
>
>
> basic delayed allocation in VFS:
>
> * block_prepare_write() can be passed special ->get_block() which
> doesn't allocate blocks, but reserve them and mark bh delayed
> * a filesystem can use mpage_da_writepages() with other ->get_block()
> which doesn't defer allocation. mpage_da_writepages() finds all
> non-allocated blocks and try to allocate them with minimal calls
> to ->get_block(), then submit IO using __mpage_writepage()
>
>
> Signed-off-by: Alex Tomas <[email protected]>

Is this based on Christoph's work?

Christoph, or some other XFS hacker, already did generic delalloc,
modeled on the XFS delalloc code.

Jeff

2007-07-26 11:26:27

by Alex Tomas

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

Jeff Garzik wrote:
> Is this based on Christoph's work?
>
> Christoph, or some other XFS hacker, already did generic delalloc,
> modeled on the XFS delalloc code.

nope, this one is simple (something I'd prefer for ext4).

thanks, Alex

2007-07-26 11:47:56

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

Alex Tomas wrote:
> Good day,
>
> please review ...
>
> thanks, Alex
>
>
> basic delayed allocation in VFS:
>
> * block_prepare_write() can be passed special ->get_block() which
> doesn't allocate blocks, but reserve them and mark bh delayed
> * a filesystem can use mpage_da_writepages() with other ->get_block()
> which doesn't defer allocation. mpage_da_writepages() finds all
> non-allocated blocks and try to allocate them with minimal calls
> to ->get_block(), then submit IO using __mpage_writepage()
>

I missed this patch when looking at the ext4 patches. Can we mark related
patch as [ PATCH 1/2 ] so that we know that another patch is going to follow.

-aneesh

2007-07-26 12:05:17

by Jeff Garzik

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

Alex Tomas wrote:
> Jeff Garzik wrote:
>> Is this based on Christoph's work?
>>
>> Christoph, or some other XFS hacker, already did generic delalloc,
>> modeled on the XFS delalloc code.
>
> nope, this one is simple (something I'd prefer for ext4).

The XFS one is proven and the work was already completed.

What were the specific technical issues that made it unsuitable for ext4?

I would rather not reinvent the wheel, particularly if the reinvention
is less capable than the existing work.

Jeff

2007-07-26 13:33:33

by Alex Tomas

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

It duplicates fs/mpage.c in bio building and introduces new generic API
(iomap, map_blocks_t, etc). In contrast, my trivial implementation re-use
existing code in fs/mpage.c, doesn't introduce new API and I tend to think
provides quite the same functionality. I can be wrong, of course ...

thanks, Alex

Jeff Garzik wrote:
> The XFS one is proven and the work was already completed.
>
> What were the specific technical issues that made it unsuitable for ext4?
>
> I would rather not reinvent the wheel, particularly if the reinvention
> is less capable than the existing work.
>
> Jeff
>
>

2007-07-27 05:07:14

by David Chinner

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

[please don't top post!]

On Thu, Jul 26, 2007 at 05:33:08PM +0400, Alex Tomas wrote:
> Jeff Garzik wrote:
> >The XFS one is proven and the work was already completed.
> >
> >What were the specific technical issues that made it unsuitable for ext4?
> >
> >I would rather not reinvent the wheel, particularly if the reinvention
> >is less capable than the existing work.
>
> It duplicates fs/mpage.c in bio building and introduces new generic API
> (iomap, map_blocks_t, etc).

Using a new API for new functionality is a bad thing?

> In contrast, my trivial implementation re-use
> existing code in fs/mpage.c, doesn't introduce new API and I tend to think
> provides quite the same functionality. I can be wrong, of course ...

No, it doesn't provide the same functionality.

Firstly, XFS attaches a different I/O completion to delalloc writes
to allow us to update the file size when the write is beyond the
current on disk EOF. This code cannot do that as all it does is
allocation and present "normal looking" buffers to the generic code
path.

Secondly, apart from delalloc, XFS cannot use the generic code paths
for writeback because unwritten extent conversion also requires
custom I/O completion handlers. Given that __mpage_writepage() only
calls ->writepage when it is confused, XFS simply cannot use this
API.

Also, looking at the way mpage_da_map_blocks() is done - if we have
an 128MB delalloc extent - ext4 will allocate that will allocate it
in one go, right? What happens if we then crash after only writing a
few megabytes of that extent? stale data exposure? XFS can allocate
multiple gigabytes in a single get_blocks call so even if ext4 can't
do this, it's a problem for XFS.....

So without the ability to attach specific I/O completions to bios
or support for unwritten extents directly in __mpage_writepage,
there is no way XFS can use this "generic" delayed allocation code.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-07-27 07:52:22

by Alex Tomas

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

David Chinner wrote:
> Using a new API for new functionality is a bad thing?

if existing API can be used ...

> No, it doesn't provide the same functionality.
>
> Firstly, XFS attaches a different I/O completion to delalloc writes
> to allow us to update the file size when the write is beyond the
> current on disk EOF. This code cannot do that as all it does is
> allocation and present "normal looking" buffers to the generic code
> path.

good point, I was going to take care of it in a separate patch
to support data=ordered.

> Secondly, apart from delalloc, XFS cannot use the generic code paths
> for writeback because unwritten extent conversion also requires
> custom I/O completion handlers. Given that __mpage_writepage() only
> calls ->writepage when it is confused, XFS simply cannot use this
> API.

this doesn't mean fs/mpage.c should go, right?

> Also, looking at the way mpage_da_map_blocks() is done - if we have
> an 128MB delalloc extent - ext4 will allocate that will allocate it
> in one go, right? What happens if we then crash after only writing a
> few megabytes of that extent? stale data exposure? XFS can allocate
> multiple gigabytes in a single get_blocks call so even if ext4 can't
> do this, it's a problem for XFS.....

what happens if IO to 2nd MB is completed, while IO to 1st MB is not
(probably sitting in queue) ? do you update on-disk size in this case?
how do you track this?

> So without the ability to attach specific I/O completions to bios
> or support for unwritten extents directly in __mpage_writepage,
> there is no way XFS can use this "generic" delayed allocation code.

I didn't say "generic", see Subject: :)

thanks, Alex

2007-07-27 12:34:15

by Jeff Garzik

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

Alex Tomas wrote:
>> So without the ability to attach specific I/O completions to bios
>> or support for unwritten extents directly in __mpage_writepage,
>> there is no way XFS can use this "generic" delayed allocation code.
>
> I didn't say "generic", see Subject: :)

Well, it shouldn't even be in the VFS layer if it's only usable by one
filesystem.

Jeff

2007-07-27 12:38:44

by Alex Tomas

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

David Chinner wrote:
> Firstly, XFS attaches a different I/O completion to delalloc writes
> to allow us to update the file size when the write is beyond the
> current on disk EOF. This code cannot do that as all it does is
> allocation and present "normal looking" buffers to the generic code
> path.

how do you implement fsync(2) ? you'd have to wait such IO to complete,
then update the inode and write it through the log?

> Also, looking at the way mpage_da_map_blocks() is done - if we have
> an 128MB delalloc extent - ext4 will allocate that will allocate it
> in one go, right? What happens if we then crash after only writing a
> few megabytes of that extent? stale data exposure? XFS can allocate
> multiple gigabytes in a single get_blocks call so even if ext4 can't
> do this, it's a problem for XFS.....

I just realized that you're talking about data=ordered mode in ext4,
where care is taken to prevent on-disk references to no-yet-written
blocks. The solution is to wait such IO to complete before metadata
commit. And the key thing here is to allocate and attach to inode
blocks we're writing immediately. IOW, there is no unwritten blocks
attached to inode (except fallocate(2) case), but there may be blocks
preallocated for this inode in-core. same gigabytes, but different
way ;)

I have no single objection to custom IO completion callback per
mpage_writepages().

thanks, Alex

2007-07-27 12:43:20

by Alex Tomas

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

Jeff Garzik wrote:
> Alex Tomas wrote:
>>> So without the ability to attach specific I/O completions to bios
>>> or support for unwritten extents directly in __mpage_writepage,
>>> there is no way XFS can use this "generic" delayed allocation code.
>>
>> I didn't say "generic", see Subject: :)
>
> Well, it shouldn't even be in the VFS layer if it's only usable by one
> filesystem.

sorry, but it seems I can say the same about iomap/ioend. I think
mpage_da_writepages() is simple enough to be adopted by other
filesystem, ext2 for example.

thanks, Alex

2007-07-28 19:51:24

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

On Thu, Jul 26, 2007 at 06:32:56AM -0400, Jeff Garzik wrote:
> Is this based on Christoph's work?
>
> Christoph, or some other XFS hacker, already did generic delalloc,
> modeled on the XFS delalloc code.

This is not based on my attempt to make the xfs writeout path generic.
Alex's variant is a lot simpler and thus missed various bits required
for high sustained writeout performance or xfs functionality.

That doesn't mean I want to arge against Alex's code although I'd of
course be more happy if we could actually shared code between multiple
filesystems.

Of ourse the code in it's current form should not go into mpage.c but
rather into ext4 so that it doesn't bloat the kernel for everyone.

2007-07-28 19:53:30

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

On Fri, Jul 27, 2007 at 03:07:14PM +1000, David Chinner wrote:
> > It duplicates fs/mpage.c in bio building and introduces new generic API
> > (iomap, map_blocks_t, etc).
>
> Using a new API for new functionality is a bad thing?

Depends on wht you do. This patch is just a quickhack to shoe-horn
delalloc support into ext4. Introducing a new abstraction is overkill.
If we really want an overhaul of the writeback path that's extent-aware,
and efficient for delalloc and unwritten extents introducing a proper
iomap-like data structure would make sense. That beeing said I personally
hate the ubffer_head abuse for bmap data that we have in various places
as it's utterly confusing and wasting stack space, but that's a different
discussion.

2007-07-28 19:56:05

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

On Fri, Jul 27, 2007 at 11:51:56AM +0400, Alex Tomas wrote:
> >Secondly, apart from delalloc, XFS cannot use the generic code paths
> >for writeback because unwritten extent conversion also requires
> >custom I/O completion handlers. Given that __mpage_writepage() only
> >calls ->writepage when it is confused, XFS simply cannot use this
> >API.
>
> this doesn't mean fs/mpage.c should go, right?

mpage.c read side is fine for every block based filesystem I know.
mpage.c write side is fine for every simple (non-delalloc, non-unwritten
extent, etc) filesystem. So it surely shouldn't go.

> I didn't say "generic", see Subject: :)

then it shouldn't be in generic code.

2007-07-28 19:56:40

by Alex Tomas

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

Christoph Hellwig wrote:
> This is not based on my attempt to make the xfs writeout path generic.
> Alex's variant is a lot simpler and thus missed various bits required
> for high sustained writeout performance or xfs functionality.

I'd very appreciate any details about high writeout performance.

> That doesn't mean I want to arge against Alex's code although I'd of
> course be more happy if we could actually shared code between multiple
> filesystems.

I'm not against at all, of course. but xfs writeout code looks .. hmm ..
very xfs :)

thanks, Alex

2007-07-28 19:57:18

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

On Fri, Jul 27, 2007 at 04:38:44PM +0400, Alex Tomas wrote:
> I just realized that you're talking about data=ordered mode in ext4,
> where care is taken to prevent on-disk references to no-yet-written
> blocks.

Any reference to non-written blocks is a bug.

2007-07-29 09:18:28

by David Chinner

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

On Fri, Jul 27, 2007 at 11:51:56AM +0400, Alex Tomas wrote:
> David Chinner wrote:
> >Using a new API for new functionality is a bad thing?
>
> if existing API can be used ...

Sure, but using the existing APIs is no good if the only filesystem
in the kernel that supports delalloc cannot use the new code....

> >Also, looking at the way mpage_da_map_blocks() is done - if we have
> >an 128MB delalloc extent - ext4 will allocate that will allocate it
> >in one go, right? What happens if we then crash after only writing a
> >few megabytes of that extent? stale data exposure? XFS can allocate
> >multiple gigabytes in a single get_blocks call so even if ext4 can't
> >do this, it's a problem for XFS.....
>
> what happens if IO to 2nd MB is completed, while IO to 1st MB is not
> (probably sitting in queue) ? do you update on-disk size in this case?
> how do you track this?

We're updating the in-memory on-disk inode here, not the actual
inode on disk. That means that if we crashed right here, the file
size on disk would not be changed at all and the filesystem would
behave as if both writes did not ever occur and we simply end up
with empty "preallocated" blocks beyond EOF....

But this is really irrelevant - the issue at hand is what we want
for VFS level delalloc support. IMO, that mechanism needs to support
both XFS and ext4, and I'd prefer if it doesn't perpetuate the
bufferhead abuses of the past (i.e. define an iomap structure
instead of overloading bufferheads yet again).

> >So without the ability to attach specific I/O completions to bios
> >or support for unwritten extents directly in __mpage_writepage,
> >there is no way XFS can use this "generic" delayed allocation code.
>
> I didn't say "generic", see Subject: :)

No, you didn't, but VFS level functionality implies that
functionality is both generic and able to be used by all
filesystems.....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-07-29 12:09:44

by Alex Tomas

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

David Chinner wrote:
> On Fri, Jul 27, 2007 at 11:51:56AM +0400, Alex Tomas wrote:
> But this is really irrelevant - the issue at hand is what we want
> for VFS level delalloc support. IMO, that mechanism needs to support
> both XFS and ext4, and I'd prefer if it doesn't perpetuate the
> bufferhead abuses of the past (i.e. define an iomap structure
> instead of overloading bufferheads yet again).

I'm not sure I understand very well. where would you track uptodate,
dirty and other states then? do you propose to separate block states
from block mapping?

thanks, Alex

2007-07-29 17:30:38

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

On Jul 28, 2007 20:51 +0100, Christoph Hellwig wrote:
> That doesn't mean I want to arge against Alex's code although I'd of
> course be more happy if we could actually shared code between multiple
> filesystems.
>
> Of ourse the code in it's current form should not go into mpage.c but
> rather into ext4 so that it doesn't bloat the kernel for everyone.

Sigh, we HAVE a patch that was only adding delalloc to ext4, but it
was rejected because "that functionality should go into the VFS".
Since the performance improvement of delalloc is quite large, we'd
like to get this into the kernel one way or another. Can we make a
decision if the ext4-specific delalloc is acceptable?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-07-29 17:48:10

by Alex Tomas

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

Andreas Dilger wrote:
> Sigh, we HAVE a patch that was only adding delalloc to ext4, but it
> was rejected because "that functionality should go into the VFS".
> Since the performance improvement of delalloc is quite large, we'd
> like to get this into the kernel one way or another. Can we make a
> decision if the ext4-specific delalloc is acceptable?

I think the latter one is better because it supports bs < pagesize
(though I'm not sure about data=ordered yet). I'm not against putting
most of the patch into fs/ext4/, but at least few bits to be changed
in fs/ - exports in fs/mpage.c and one "if" in __block_write_full_page().

thanks, Alex

2007-07-29 19:22:50

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

On Sun, Jul 29, 2007 at 09:48:10PM +0400, Alex Tomas wrote:
> I think the latter one is better because it supports bs < pagesize
> (though I'm not sure about data=ordered yet). I'm not against putting
> most of the patch into fs/ext4/, but at least few bits to be changed
> in fs/ - exports in fs/mpage.c and one "if" in __block_write_full_page().

The changes to __block_write_full_page is obviously fine, and exporting
mpage.c bits sounds fine to me aswell, although I'd like to take a look
at the final patch.

2007-07-29 19:24:40

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

On Sun, Jul 29, 2007 at 11:30:36AM -0600, Andreas Dilger wrote:
> Sigh, we HAVE a patch that was only adding delalloc to ext4, but it
> was rejected because "that functionality should go into the VFS".
> Since the performance improvement of delalloc is quite large, we'd
> like to get this into the kernel one way or another. Can we make a
> decision if the ext4-specific delalloc is acceptable?

I'm a big proponent of having proper common delalloc code, but the
one proposed here is not generic for the existing filesystem using
delalloc. It's still on my todo list to revamp the xfs code to get
rid of some of the existing mess and make it useable genericly. If
the ext4 users are fine with the end result we could move to generic
code.

Note that moving to VFS is bullshit either way, writeback code is
nowhere near the VFS nor should it.

2007-07-29 19:51:07

by Alex Tomas

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

I'm a bit worried about one thing ... it looks like XFS and ext4
use different techniques to order data and metadata referencing
them. now I'm not that optimistic that we can separate ordering
from delalloc itself clean and reasonable way. In general, I'd
prefer common code in fs/ (mm/?) of course, for number of reasons.

thanks, Alex

Christoph Hellwig wrote:
> I'm a big proponent of having proper common delalloc code, but the
> one proposed here is not generic for the existing filesystem using
> delalloc. It's still on my todo list to revamp the xfs code to get
> rid of some of the existing mess and make it useable genericly. If
> the ext4 users are fine with the end result we could move to generic
> code.
>
> Note that moving to VFS is bullshit either way, writeback code is
> nowhere near the VFS nor should it.

2007-07-30 00:28:52

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

On Sun, Jul 29, 2007 at 08:24:37PM +0100, Christoph Hellwig wrote:
> I'm a big proponent of having proper common delalloc code, but the
> one proposed here is not generic for the existing filesystem using
> delalloc. It's still on my todo list to revamp the xfs code to get
> rid of some of the existing mess and make it useable genericly. If
> the ext4 users are fine with the end result we could move to generic
> code.

Do you think it would be faster for you to revamp the code or to give
instructions about how you'd like to clean up the code and what has to
be preserved in order to keep XFS happy, so someone else could give it
a try? Or do you think the code is to grotty and/or tricky for
someone else to attempt this?

> Note that moving to VFS is bullshit either way, writeback code is
> nowhere near the VFS nor should it.

Agreed. I would think the something like mm/delayed_alloc.c would be
preferable. Ideally it would be like the filemap.c code, where it
would be relatively easy for most standard filesystems to hook into it
and get the advantages of delayed allocation. (Although granted it
will probably require more effort on the part of a filesystem author
than filemap!)

- Ted

2007-07-30 00:29:26

by David Chinner

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

On Sun, Jul 29, 2007 at 04:09:20PM +0400, Alex Tomas wrote:
> David Chinner wrote:
> >On Fri, Jul 27, 2007 at 11:51:56AM +0400, Alex Tomas wrote:
> >But this is really irrelevant - the issue at hand is what we want
> >for VFS level delalloc support. IMO, that mechanism needs to support
> >both XFS and ext4, and I'd prefer if it doesn't perpetuate the
> >bufferhead abuses of the past (i.e. define an iomap structure
> >instead of overloading bufferheads yet again).
>
> I'm not sure I understand very well.

->get_blocks abuses bufferheads to provide an offset/length/state
mapping. That's all it needs. That what the iomap structure is used
for. It's smaller than a bufferhead, it's descriptive of it's use
and you don't get it confused with the other 10 ways bufferheads
are used and abused.

> where would you track uptodate, dirty and other states then?
> do you propose to separate block states from block mapping?

No. They still get tracked in the bufferheads attached to the page.
That's what bufferheads were originally intended for(*).

Cheers,

Dave.

(*) I recently proposed a separate block map tree for this rather
than using buffer heads for this because of the memory footprint of
N bufferheads per page on contiguous mappings. That's future work,
not something we really need to consider here. Chris Mason's extent
map tree patches are a start on this concept.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-07-30 17:49:17

by Mingming Cao

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

On Sun, 2007-07-29 at 20:24 +0100, Christoph Hellwig wrote:
> On Sun, Jul 29, 2007 at 11:30:36AM -0600, Andreas Dilger wrote:
> > Sigh, we HAVE a patch that was only adding delalloc to ext4, but it
> > was rejected because "that functionality should go into the VFS".
> > Since the performance improvement of delalloc is quite large, we'd
> > like to get this into the kernel one way or another. Can we make a
> > decision if the ext4-specific delalloc is acceptable?
>
> I'm a big proponent of having proper common delalloc code, but the
> one proposed here is not generic for the existing filesystem using
> delalloc.

To be fair, what Alex have so far is probably good enough for ext2/3
delayed allocation.

> It's still on my todo list to revamp the xfs code to get
> rid of some of the existing mess and make it useable genericly. If
> the ext4 users are fine with the end result we could move to generic
> code.
>

Are you okay with having a ext4 delayed allocation implementation (i.e.
moving the code proposed in this thread to fs/ext4) first? Then later
when you come up with a generic delayed allocation for both ext4 and xfs
we could make use of that generic implementation. Is that a acceptable
approach?

Andrew, what do you think?

Regards,
Mingming

2007-07-30 19:43:49

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] basic delayed allocation in VFS

On Mon, 30 Jul 2007 10:49:14 -0700
Mingming Cao <[email protected]> wrote:

> On Sun, 2007-07-29 at 20:24 +0100, Christoph Hellwig wrote:
> > On Sun, Jul 29, 2007 at 11:30:36AM -0600, Andreas Dilger wrote:
> > > Sigh, we HAVE a patch that was only adding delalloc to ext4, but it
> > > was rejected because "that functionality should go into the VFS".
> > > Since the performance improvement of delalloc is quite large, we'd
> > > like to get this into the kernel one way or another. Can we make a
> > > decision if the ext4-specific delalloc is acceptable?
> >
> > I'm a big proponent of having proper common delalloc code, but the
> > one proposed here is not generic for the existing filesystem using
> > delalloc.
>
> To be fair, what Alex have so far is probably good enough for ext2/3
> delayed allocation.
>
> > It's still on my todo list to revamp the xfs code to get
> > rid of some of the existing mess and make it useable genericly. If
> > the ext4 users are fine with the end result we could move to generic
> > code.
> >
>
> Are you okay with having a ext4 delayed allocation implementation (i.e.
> moving the code proposed in this thread to fs/ext4) first? Then later
> when you come up with a generic delayed allocation for both ext4 and xfs
> we could make use of that generic implementation. Is that a acceptable
> approach?
>
> Andrew, what do you think?
>

There's a decent risk that the generic implementation would never happen.

I'd have thought that it'd be pretty tricky to make anything which is in
XFS suitable for general use, because after years of tuning and tweaking
it'll be full of xfs-specific things, but I haven't looked.

And a similar thing will happen if an ext4-specific version is merged.

The sad fact is that if we have a generic version, it turns out being a
least-common-denominator thing which never fully meets the requirements of
any of its users. We end up filling the generic code up with
caller-selectable optional functionality for each filesystem. (See
fs/direct-io.c).

The whole approach of making the pagecache/data handling be part of the VFS
hasn't been a great success, IMO. It was fine for ext2 and similar (jfs,
minix, etc). But for filesytems which do fancier things with data it
hasn't worked out well. otoh, moving it all into the fs would have been a
bad decision too, so we just muddle through, making compromises.

So, umm, yes, on balance I do agree that we should explore doing some of
this in the VFS, and I believe that we should do it on the initial merge
rather than promising to ourselves that we'll fix it up later. This will
devolve into the ext4 and xfs people working out which bits can and should
be moved into the VFS, and working out what they should look like.