2007-05-04 13:35:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

Theodore Tso <[email protected]> writes:

> On Fri, Apr 27, 2007 at 01:48:49AM -0700, Andrew Morton wrote:
>> And other filesystems (ie: ext4) _might_ use it. But ext4 is extent-based,
>> so perhaps it's not work churning the on-disk format to get a bit of a
>> boost in the block allocator.
>
> Well, ext3 could definitely use it; there are people using 8k and 16k
> blocksizes on ia64 systems today. Those filesystems can't be mounted
> on x86 or x86_64 systems because our pagesize is 4k, though.
>
> And I imagine that ext4 might want to use a large blocksize too ---
> after all, XFS is extent based as well, and not _all_ of the
> advantages of using a larger blocksize are related to brain-damaged
> storage subsystems with short SG list support. Whether the advantages
> offset the internal fragmentation overhead or the complexity of adding
> fragments support is a different question, of course.
>
> So while the jury is out about how many other filesystems might use
> it, I suspect it's more than you might think. At the very least,
> there may be some IA64 users who might be trying to transition their
> way to x86_64, and have existing filesystems using a 8k or 16k
> block filesystems. :-)

How much of a problem would it be if those blocks were not necessarily
contiguous in RAM, but placed in normal 4K pages in the page cache?

I expect meta data operations would have to be modified but that otherwise
you would not care.

Eric


2007-05-07 04:30:06

by David Chinner

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

On Fri, May 04, 2007 at 07:33:54AM -0600, Eric W. Biederman wrote:
> >
> > So while the jury is out about how many other filesystems might use
> > it, I suspect it's more than you might think. At the very least,
> > there may be some IA64 users who might be trying to transition their
> > way to x86_64, and have existing filesystems using a 8k or 16k
> > block filesystems. :-)
>
> How much of a problem would it be if those blocks were not necessarily
> contiguous in RAM, but placed in normal 4K pages in the page cache?

If you need to treat the block in a contiguous range, then you need to
vmap() the discontiguous pages. That has substantial overhead if you
have to do it regularly.

We do this in xfs_buf.c for > page size blocks - the overhead that
caused when operating on inode clusters resulted in us doing some
pointer fiddling and directly addresing the contents of each page
to avoid the vmap overhead. See xfs_buf_offset() and friends....

> I expect meta data operations would have to be modified but that otherwise
> you would not care.

I think you might need to modify the copy-in and copy-out operations
substantially (e.g. prepare_/commit_write()) as they assume a buffer doesn't
span multple pages.....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-05-07 04:49:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

David Chinner <[email protected]> writes:

> On Fri, May 04, 2007 at 07:33:54AM -0600, Eric W. Biederman wrote:
>> >
>> > So while the jury is out about how many other filesystems might use
>> > it, I suspect it's more than you might think. At the very least,
>> > there may be some IA64 users who might be trying to transition their
>> > way to x86_64, and have existing filesystems using a 8k or 16k
>> > block filesystems. :-)
>>
>> How much of a problem would it be if those blocks were not necessarily
>> contiguous in RAM, but placed in normal 4K pages in the page cache?
>
> If you need to treat the block in a contiguous range, then you need to
> vmap() the discontiguous pages. That has substantial overhead if you
> have to do it regularly.

Which is why I would prefer not to do it. I think vmap is not really
compatible with the design of the linux page cache.

Although we can't even count on the pages being mapped into low
memory right now and have to call kmap if we want to access them
so things might not be that bad. Even if it was a multipage kmap
type operation.

> We do this in xfs_buf.c for > page size blocks - the overhead that
> caused when operating on inode clusters resulted in us doing some
> pointer fiddling and directly addresing the contents of each page
> to avoid the vmap overhead. See xfs_buf_offset() and friends....
>
>> I expect meta data operations would have to be modified but that otherwise
>> you would not care.
>
> I think you might need to modify the copy-in and copy-out operations
> substantially (e.g. prepare_/commit_write()) as they assume a buffer doesn't
> span multple pages.....

But in a filesystem like ext2 except for a zeroing some unused hunks
of the page all that really happens is you setup for DMA straight out
of the page cache. So this is primarily an issue for meta-data.

Eric

2007-05-07 05:27:55

by David Chinner

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

On Sun, May 06, 2007 at 10:48:23PM -0600, Eric W. Biederman wrote:
> David Chinner <[email protected]> writes:
>
> > On Fri, May 04, 2007 at 07:33:54AM -0600, Eric W. Biederman wrote:
> >> >
> >> > So while the jury is out about how many other filesystems might use
> >> > it, I suspect it's more than you might think. At the very least,
> >> > there may be some IA64 users who might be trying to transition their
> >> > way to x86_64, and have existing filesystems using a 8k or 16k
> >> > block filesystems. :-)
> >>
> >> How much of a problem would it be if those blocks were not necessarily
> >> contiguous in RAM, but placed in normal 4K pages in the page cache?
> >
> > If you need to treat the block in a contiguous range, then you need to
> > vmap() the discontiguous pages. That has substantial overhead if you
> > have to do it regularly.
>
> Which is why I would prefer not to do it. I think vmap is not really
> compatible with the design of the linux page cache.

Right - so how do we efficiently manipulate data inside a large
block that spans multiple discontigous pages if we don't vmap
it?

> Although we can't even count on the pages being mapped into low
> memory right now and have to call kmap if we want to access them
> so things might not be that bad. Even if it was a multipage kmap
> type operation.

Except when you structures span page boundaries. Then you can't directly
reference the structure - it needs to be copied out elsewhere, modified
and copied back. That's messy and will require significant modification
to any filesystem that wants large block sizes....

> > We do this in xfs_buf.c for > page size blocks - the overhead that
> > caused when operating on inode clusters resulted in us doing some
> > pointer fiddling and directly addresing the contents of each page
> > to avoid the vmap overhead. See xfs_buf_offset() and friends....
> >
> >> I expect meta data operations would have to be modified but that otherwise
> >> you would not care.
> >
> > I think you might need to modify the copy-in and copy-out operations
> > substantially (e.g. prepare_/commit_write()) as they assume a buffer doesn't
> > span multple pages.....
>
> But in a filesystem like ext2 except for a zeroing some unused hunks
> of the page all that really happens is you setup for DMA straight out
> of the page cache. So this is primarily an issue for meta-data.

I'm not sure I follow you here - copyin/copyout is to userspace and
has to handle things like RMW cycles to a filesystem block. e.g. if
we get a partial block over-write, we need to read in all the bits
around it and that will span multiple discontiguous pages. Currently
these function only handle RMW operations on something up to a
single page in size - to handle a RMW cycle on a block larger than a
page they are going to need substantial modification or entirely
new interfaces.

The high order page cache avoids the need to redesign interfaces
because it doesn't change the interfaces between the filesystem
and the page cache - everything still effectively operates
on single pages and the filesystem block size never exceeds the
size of a single page.....

Cheers,

Dave.

--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-05-07 06:43:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

David Chinner <[email protected]> writes:

> On Sun, May 06, 2007 at 10:48:23PM -0600, Eric W. Biederman wrote:
>> David Chinner <[email protected]> writes:
>>
>> > On Fri, May 04, 2007 at 07:33:54AM -0600, Eric W. Biederman wrote:
>> >> >
>> >> > So while the jury is out about how many other filesystems might use
>> >> > it, I suspect it's more than you might think. At the very least,
>> >> > there may be some IA64 users who might be trying to transition their
>> >> > way to x86_64, and have existing filesystems using a 8k or 16k
>> >> > block filesystems. :-)
>> >>
>> >> How much of a problem would it be if those blocks were not necessarily
>> >> contiguous in RAM, but placed in normal 4K pages in the page cache?
>> >
>> > If you need to treat the block in a contiguous range, then you need to
>> > vmap() the discontiguous pages. That has substantial overhead if you
>> > have to do it regularly.
>>
>> Which is why I would prefer not to do it. I think vmap is not really
>> compatible with the design of the linux page cache.
>
> Right - so how do we efficiently manipulate data inside a large
> block that spans multiple discontigous pages if we don't vmap
> it?

You don't manipulate data except for copy_from_user, copy_to_user.
That is easy comparatively to deal with, and certainly doesn't
need vmap.

Meta-data may be trickier, but a lot of that depends on your
individual filesystem and how it organizes it's meta-data.

>> Although we can't even count on the pages being mapped into low
>> memory right now and have to call kmap if we want to access them
>> so things might not be that bad. Even if it was a multipage kmap
>> type operation.
>
> Except when you structures span page boundaries. Then you can't directly
> reference the structure - it needs to be copied out elsewhere, modified
> and copied back. That's messy and will require significant modification
> to any filesystem that wants large block sizes....

Potentially. This is a just a meta data problem, and possibly we
solve it with something like vmap. Possibly the filesystem won't
cross those kinds of boundaries and we simply never care.

The fact that it is a meta-data problem suggests it isn't the fast
path and we can incur a little more cost. Especially if we filesytems
with large block sizes are rare.

> I'm not sure I follow you here - copyin/copyout is to userspace and
> has to handle things like RMW cycles to a filesystem block. e.g. if
> we get a partial block over-write, we need to read in all the bits
> around it and that will span multiple discontiguous pages. Currently
> these function only handle RMW operations on something up to a
> single page in size - to handle a RMW cycle on a block larger than a
> page they are going to need substantial modification or entirely
> new interfaces.

Bleh. It has been to many days since I have hacked that code I forgot
which piece that was. Yes. prepare_to_write is called before
we write to the page cache from the filesystem.

We already handle multiple page writes fairly well in that context.
prepare_write, commit_write may need a page cache but it may not.
All that really needs to happen is that all of the pages that
are part of the block get marked dirty in the page cache so one
won't get written without the others.

> The high order page cache avoids the need to redesign interfaces
> because it doesn't change the interfaces between the filesystem
> and the page cache - everything still effectively operates
> on single pages and the filesystem block size never exceeds the
> size of a single page.....

Yes, instead of having to redesign the interface between the
fs and the page cache for those filesystems that handle large
blocks we instead need to redesign significant parts of the VM interface.
Shift the redesign work to another group of people and call it a trivial.

That is hardly a gain when it looks like you can have the same effect
with some moderately simple changes to mm/filemap.c and the existing
interfaces.

Eric

2007-05-07 06:49:00

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

David Chinner <[email protected]> writes:
>> Right - so how do we efficiently manipulate data inside a large
>> block that spans multiple discontigous pages if we don't vmap
>> it?

On Mon, May 07, 2007 at 12:43:19AM -0600, Eric W. Biederman wrote:
> You don't manipulate data except for copy_from_user, copy_to_user.
> That is easy comparatively to deal with, and certainly doesn't
> need vmap.
> Meta-data may be trickier, but a lot of that depends on your
> individual filesystem and how it organizes it's meta-data.

I wonder what happened to my pagearray patches.


-- wli

2007-05-07 07:06:36

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

David Chinner <[email protected]> writes:
>>> Right - so how do we efficiently manipulate data inside a large
>>> block that spans multiple discontigous pages if we don't vmap
>>> it?

On Mon, May 07, 2007 at 12:43:19AM -0600, Eric W. Biederman wrote:
>> You don't manipulate data except for copy_from_user, copy_to_user.
>> That is easy comparatively to deal with, and certainly doesn't
>> need vmap.
>> Meta-data may be trickier, but a lot of that depends on your
>> individual filesystem and how it organizes it's meta-data.

On Sun, May 06, 2007 at 11:49:25PM -0700, William Lee Irwin III wrote:
> I wonder what happened to my pagearray patches.

I never really got the thing working, but I had an idea for a sort of
library to do this. This is/was probably against something like 2.6.5
but I honestly have no idea. Maybe this makes it something of an API
proposal.


-- wli


Index: linux-2.6/include/linux/pagearray.h
===================================================================
--- linux-2.6.orig/include/linux/pagearray.h 2004-04-06 10:56:48.000000000 -0700
+++ linux-2.6/include/linux/pagearray.h 2005-04-22 06:06:02.677494584 -0700
@@ -0,0 +1,24 @@
+#ifndef _LINUX_PAGEARRAY_H
+#define _LINUX_PAGEARRAY_H
+
+struct scatterlist;
+struct vm_area_struct;
+struct page;
+
+struct pagearray {
+ struct page **pages;
+ int nr_pages;
+ size_t length;
+};
+
+int alloc_page_array(struct pagearray *, const int, const size_t);
+void free_page_array(struct pagearray *);
+void zero_page_array(struct pagearray *);
+struct page *nopage_page_array(const struct vm_area_struct *, unsigned long, unsigned long, int *, struct pagearray *);
+int mmap_page_array(const struct vm_area_struct *, struct pagearray *, const size_t, const size_t);
+int copy_page_array_to_user(struct pagearray *, void __user *, const size_t, const size_t);
+int copy_page_array_from_user(struct pagearray *, void __user *, const size_t, const size_t);
+struct scatterlist *pagearray_to_scatterlist(struct pagearray *, size_t, size_t, int *);
+void *vmap_pagearray(struct pagearray *);
+
+#endif /* _LINUX_PAGEARRAY_H */
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile 2005-04-22 06:01:29.786980248 -0700
+++ linux-2.6/mm/Makefile 2005-04-22 06:06:02.677494584 -0700
@@ -10,7 +10,7 @@
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
page_alloc.o page-writeback.o pdflush.o \
readahead.o slab.o swap.o truncate.o vmscan.o \
- prio_tree.o $(mmu-y)
+ prio_tree.o pagearray.o $(mmu-y)

obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
Index: linux-2.6/mm/pagearray.c
===================================================================
--- linux-2.6.orig/mm/pagearray.c 2004-04-06 10:56:48.000000000 -0700
+++ linux-2.6/mm/pagearray.c 2005-04-22 06:20:26.154226168 -0700
@@ -0,0 +1,293 @@
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/module.h>
+#include <linux/highmem.h>
+#include <linux/pagearray.h>
+#include <asm/uaccess.h>
+#include <asm/scatterlist.h>
+
+/**
+ * alloc_page_array - allocate an array of pages
+ * @pages: the array of pages to be allocated
+ * @gfp_mask: the GFP flags to be passed to the allocator
+ * @length: the amount of data the array needs to hold
+ *
+ * Allocate an array of page pointers long enough so that when full of
+ * pages, the amount of data in length may be stored, then allocate the
+ * pages for each position in the array.
+ */
+int alloc_page_array(struct pagearray *pages, const int gfp_mask, const size_t length)
+{
+ int k;
+ pages->length = PAGE_ALIGN(length);
+ pages->nr_pages = PAGE_ALIGN(length) >> PAGE_SHIFT;
+ pages->pages = kmalloc(pages->nr_pages*sizeof(struct page *), gfp_mask);
+ if (!pages->pages)
+ return -ENOMEM;
+ memset(pages->pages, 0, pages->nr_pages*sizeof(struct page *));
+ for (k = 0; k < pages->nr_pages; ++k) {
+ pages->pages[k] = alloc_page(gfp_mask);
+ if (!pages->pages[k])
+ goto enomem;
+ }
+ return 0;
+enomem:
+ for (--k; k >= 0; --k)
+ __free_page(pages->pages[k]);
+ kfree(pages->pages);
+ memset(pages, 0, sizeof(struct pagearray));
+ return -ENOMEM;
+}
+EXPORT_SYMBOL(alloc_page_array);
+
+/**
+ * free_page_array - free an array of pages
+ * @pages: the array of pages to be freed
+ *
+ * Free an array of pages, including the pages pointed to by the array.
+ */
+void free_page_array(struct pagearray *pages)
+{
+ int k;
+ for (k = 0; k < pages->nr_pages; ++k)
+ __free_page(pages->pages[k]);
+ kfree(pages->pages);
+ memset(pages, 0, sizeof(struct pagearray));
+}
+EXPORT_SYMBOL(free_page_array);
+
+/**
+ * zero_page_array - zero an array of pages
+ * @pages: the array of pages
+ *
+ * Zero out a set of pages pointed to by an array of page pointers.
+ */
+void zero_page_array(struct pagearray *pages)
+{
+ int k;
+ for (k = 0; k < pages->nr_pages; ++k)
+ clear_highpage(pages->pages[k]);
+}
+EXPORT_SYMBOL(zero_page_array);
+
+/**
+ * nopage_page_array - retrieve the page to satisfy a fault with
+ * @vma: the user virtual memory area the fault occurred on
+ * @pgoff: an offset into the underlying array to add to ->vm_pgoff
+ * @vaddr: the user virtual address the fault occurred on
+ * @type: the type of fault that occurred, to be returned
+ * @pages: the array of page pointers
+ *
+ * This is a trivial helper for ->nopage() methods. Simply return the
+ * result of this function after retrieving the page array and its
+ * descriptive parameters from vma->vm_private_data, for instance:
+ * return nopage_page_array(vma, pgoff, vaddr, type, pages);
+ * as the last thing in the ->nopage() method after fetching the
+ * parameters from vma->vm_private_data.
+ */
+struct page *nopage_page_array(const struct vm_area_struct *vma, unsigned long pgoff, unsigned long vaddr, int *type, struct pagearray *pages)
+{
+ if (vaddr >= vma->vm_end)
+ goto sigbus;
+ pgoff += vma->vm_pgoff + ((vaddr - vma->vm_start) >> PAGE_SHIFT);
+ if (pgoff > PAGE_ALIGN(pages->length)/PAGE_SIZE)
+ goto sigbus;
+ if (pgoff > pages->nr_pages)
+ goto sigbus;
+ get_page(pages->pages[pgoff]);
+ if (type)
+ *type = VM_FAULT_MINOR;
+ return pages->pages[pgoff];
+sigbus:
+ if (type)
+ *type = VM_FAULT_SIGBUS;
+ return NOPAGE_SIGBUS;
+}
+EXPORT_SYMBOL(nopage_page_array);
+
+/**
+ * mmap_page_array - mmap an array of pages
+ * @vma: the vma where the mmapping is done
+ * @pages: the array of page pointers
+ * @offset: the offset into the vma in bytes where mmapping should be done
+ * @length: the amount of data that should be mmap'd, in bytes
+ *
+ * vma->vm_pgoff specifies how far out into the page array mmapping
+ * should be done. The page array is treated as a list of the pieces
+ * of an object and vma->vm_pgoff the offset into that object.
+ * vma->vm_page_prot in turn specifies the protections to map with.
+ * offset says where in userspace relative to vma->vm_start to put
+ * the mappings of the pieces of the page array. length specifies how
+ * much data should be mapped into userspace.
+ */
+#ifdef CONFIG_MMU
+int mmap_page_array(const struct vm_area_struct *vma, struct pagearray *pages, const size_t offset, const size_t length)
+{
+ int k, ret = 0;
+ unsigned long end, off, vaddr = vma->vm_start + offset;
+ off = (vma->vm_pgoff << PAGE_SHIFT) + offset;
+ end = vaddr + length;
+ if (vaddr >= end)
+ return -EINVAL;
+ else if (offset != PAGE_ALIGN(offset))
+ return -EINVAL;
+ else if (offset + length > pages->length)
+ return -EINVAL;
+ k = off >> PAGE_SHIFT;
+ while (vaddr < end && !ret) {
+ pgd_t *pgd;
+ pud_t *pud;
+
+ spin_lock(&vma->vm_mm->page_table_lock);
+ pgd = pgd_offset(vma->vm_mm, vaddr);
+ pud = pud_alloc(vma->vm_mm, pgd, vaddr);
+ if (!pud) {
+ ret = -ENOMEM;
+ break;
+ } else {
+ pmd_t *pmd = pmd_alloc(vma->vm_mm, pud, vaddr);
+ if (!pmd) {
+ ret = -ENOMEM;
+ break;
+ } else {
+ pte_t val, *pte;
+
+ pte = pte_alloc_map(vma->vm_mm, pmd, vaddr);
+ if (!pte) {
+ ret = -ENOMEM;
+ break;
+ } else {
+ val = mk_pte(pages->pages[k], vma->vm_page_prot);
+ set_pte(pte, val);
+ pte_unmap(pte);
+ update_mmu_cache(vma, vaddr, val);
+ }
+ }
+ }
+ spin_unlock(&vma->vm_mm->page_table_lock);
+ vaddr += PAGE_SIZE;
+ off += PAGE_SIZE;
+ ++k;
+ }
+ return ret;
+}
+#else
+int mmap_page_array(const struct vm_area_struct *vma, struct pagearray *pages, const size_t offset, const size_t length)
+{
+ return -ENOSYS;
+}
+#endif
+EXPORT_SYMBOL(mmap_page_array);
+
+static int copy_page_array(struct pagearray *pages, char __user *buf, const size_t offset, const size_t length, const int rw)
+{
+ size_t pos = 0, off = offset, remaining = length;
+ int k;
+
+ if (length > pages->length)
+ return -EFAULT;
+ else if (length > MM_VM_SIZE(current->mm))
+ return -EFAULT;
+ else if ((unsigned long)buf > MM_VM_SIZE(current->mm) - length)
+ return -EFAULT;
+
+ for (k = off >> PAGE_SHIFT; k < pages->nr_pages && remaining > 0; ++k) {
+ unsigned long left, tail, suboff = off & PAGE_MASK;
+ char *kbuf = kmap_atomic(pages->pages[k], KM_USER0);
+ tail = min(PAGE_SIZE - suboff, (unsigned long)remaining);
+ if (rw)
+ left = __copy_to_user(&buf[pos], &kbuf[suboff], tail);
+ else
+ left = __copy_from_user(&kbuf[suboff], &buf[pos], tail);
+ kunmap_atomic(kbuf, KM_USER0);
+ if (left) {
+ kbuf = kmap(pages->pages[k]);
+ if (rw)
+ left = __copy_to_user(&buf[pos], &kbuf[suboff], tail);
+ else
+ left = __copy_from_user(&kbuf[suboff], &buf[pos], tail);
+ kunmap(pages->pages[k]);
+ }
+ BUG_ON(tail - left > remaining);
+ remaining -= tail - left;
+ pos += tail - left;
+ off = (off + PAGE_SIZE) & PAGE_MASK;
+ if (left)
+ break;
+ }
+ return remaining;
+}
+
+/**
+ * copy_page_array_to_user - copy data from a page array to userspace
+ * @pages: the array of page pointers holding the data
+ * @buf: the user virtual address to start depositing the data at
+ * @offset: the offset into the page array to start copying data from
+ * @length: how much data to copy
+ *
+ * Copy data from a page array, starting offset bytes into the array
+ * when it's treated as a list of the pieces of an object in order,
+ * to userspace.
+ */
+int copy_page_array_to_user(struct pagearray *pages, void __user *buf, const size_t offset, const size_t length)
+{
+ return copy_page_array(pages, buf, offset, length, 1);
+}
+EXPORT_SYMBOL(copy_page_array_to_user);
+
+/**
+ * copy_page_array_from_user - copy data from userspace to a page array
+ * @pages: the array of page pointers holding the data
+ * @buf: the user virtual address to start reading the data from
+ * @offset: the offset into the page array to start copying data to
+ * @length: how much data to copy
+ *
+ * Copy data to a page array, starting offset bytes into the array
+ * when it's treated as a list of the pieces of an object in order,
+ * from userspace.
+ */
+int copy_page_array_from_user(struct pagearray *pages, void __user *buf, const size_t offset, const size_t length)
+{
+ return copy_page_array(pages, buf, offset, length, 0);
+}
+EXPORT_SYMBOL(copy_page_array_from_user);
+
+/**
+ * pagearray_to_scatterlist - generate a scatterlist for a slice of a pagearray
+ * @pages: the pagearray to make a scatterlist for
+ * @offset: the offset into the pagearray of the start of the slice
+ * @length: the length of the slice of the pagearray
+ * @sglist_len: the size of the generated scatterlist
+ *
+ * Set up a scatterlist covering a slice of a pagearray, starting at offset
+ * bytes into the pagearray, with length length.
+ */
+struct scatterlist *pagearray_to_scatterlist(struct pagearray *pages, size_t offset, size_t length, int *sglist_len)
+{
+ struct scatterlist *sg;
+ int i, nr_pages =
+ (PAGE_ALIGN(offset + length) - (offset & PAGE_MASK))/PAGE_SIZE;
+ sg = kmalloc(nr_pages * sizeof(struct scatterlist), GFP_KERNEL);
+ if (!sg)
+ return NULL;
+ memset(sg, 0, nr_pages * sizeof(struct scatterlist));
+ sg[0].page = pages->pages[offset >> PAGE_SHIFT];
+ sg[0].offset = offset & ~PAGE_MASK;
+ sg[0].length = PAGE_SIZE - sg[0].offset;
+ offset = (offset + PAGE_SIZE) & PAGE_MASK;
+ for (i = 1; i < nr_pages - 1; ++i) {
+ sg[i].page = pages->pages[i];
+ sg[i].length = PAGE_SIZE;
+ }
+ sg[i].page = pages->pages[i];
+ sg[i].length = (offset + length) & ~PAGE_MASK;
+ *sglist_len = nr_pages;
+ return sg;
+}
+EXPORT_SYMBOL(pagearray_to_scatterlist);
+
+void *vmap_pagearray(struct pagearray *pages)
+{
+ return vmap(pages->pages, pages->nr_pages, VM_MAP, PAGE_KERNEL);
+}
+EXPORT_SYMBOL(vmap_pagearray);

2007-05-07 16:06:15

by Christoph Lameter

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

On Mon, 7 May 2007, Eric W. Biederman wrote:

> Yes, instead of having to redesign the interface between the
> fs and the page cache for those filesystems that handle large
> blocks we instead need to redesign significant parts of the VM interface.
> Shift the redesign work to another group of people and call it a trivial.

To some extend that is true. But then there will then also be additional
gain: We can likely get the VM to handle larger pages too which may get
rid of hugetlb fs etc. The work is pretty straightforward: No locking
changes f.e. So hardly a redesign. I think the crucial point is the
antifrag/defrag issue if we want to generalize it.

I have an updated patch here that relies on page reservations. Adds
something called page pools. On bootup you need to specify how many pages
of each size you want. The page cache will then use those pages for
filesystems that need larger blocksize.

The interesting thing about that one is that it actually enables support
foir multiple blocksizes with a single larger pagesize. If f.e. we setup a
pool of 64k pages then the block layer can segment that into 16k pieces.
So one can actually use 16k 32k and 64k block size with a single larger
page size.

2007-05-07 17:29:18

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

On Mon, 7 May 2007, Eric W. Biederman wrote:
>> Yes, instead of having to redesign the interface between the
>> fs and the page cache for those filesystems that handle large
>> blocks we instead need to redesign significant parts of the VM interface.
>> Shift the redesign work to another group of people and call it a trivial.

On Mon, May 07, 2007 at 09:06:05AM -0700, Christoph Lameter wrote:
> To some extend that is true. But then there will then also be additional
> gain: We can likely get the VM to handle larger pages too which may get
> rid of hugetlb fs etc. The work is pretty straightforward: No locking
> changes f.e. So hardly a redesign. I think the crucial point is the
> antifrag/defrag issue if we want to generalize it.

Sadly, a backward compatibility stub must be retained in perpetuity.
It should be able to be reduced to the point it doesn't need its own
dedicated source files or config options, but it'll need something to
deal with the arch code.


-- wli

2007-05-08 08:49:14

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

On Mon, May 07, 2007 at 12:06:38AM -0700, William Lee Irwin III wrote:
> +int alloc_page_array(struct pagearray *, const int, const size_t);
> +void free_page_array(struct pagearray *);
> +void zero_page_array(struct pagearray *);
> +struct page *nopage_page_array(const struct vm_area_struct *, unsigned long, unsigned long, int *, struct pagearray *);
> +int mmap_page_array(const struct vm_area_struct *, struct pagearray *, const size_t, const size_t);
> +int copy_page_array_to_user(struct pagearray *, void __user *, const size_t, const size_t);
> +int copy_page_array_from_user(struct pagearray *, void __user *, const size_t, const size_t);
> +struct scatterlist *pagearray_to_scatterlist(struct pagearray *, size_t, size_t, int *);
> +void *vmap_pagearray(struct pagearray *);

This should probably have memcpy to/from pagearrays. Whole-hog read
and write f_op implementations would be good, too, since ISTR some
drivers basically do little besides that on their internal buffers.

vmap_pagearray() should take flags, esp. VM_IOREMAP but perhaps also
protections besides PAGE_KERNEL in case uncachedness is desirable. I'm
not entirely sure what it'd be used for if discontiguity is so heavily
supported. My wild guess is drivers that do things that are just too
weird to support with the discontig API, since that's how I used it.
It should support vmap()'ing interior sub-ranges, too.

The pagearray mmap() support is schizophrenic as to whether it prefills
or faults and not all that complete as far as manipulating the mmap()
goes. Shooting down ptes, flipping pages, or whatever drivers actually
do with the things should have helpers arranged. Coherent sets of
helpers for faulting vs. mmap()'ing idioms would be good.

pagearray_to_scatterlist() should probably take the scatterist as an
argument instead of allocating the scatterlist itself.

Something to construct bio's from pagearrays might help.

s/page_array/pagearray/g should probably be done. Prefixing with
pagearray_ instead of randomly positioning it within the name would
be good, too.

Some working API conversions on drivers sound like a good idea. I had
some large number of API conversions about, now lost, but they'd be
bitrotted anyway.

struct pagearray is better off as an opaque type so large pagearray
handling can be added in later via radix trees or some such, likewise
for expansion and contraction. Keeping drivers' hands off the internals
is just a good idea in general.

I'm somewhat less clear on what filesystems need to do here, or if it
would be useful for them to efficiently manipulate data inside a
large block that spans multiple discontiguous pages. I expect some
changes are needed at the very least to fill a pagearray with whatever
predetermined pages are needed. Filesystems probably need other changes
to handle sparse pagearrays and refilling pages within them via IO.


-- wli