2003-05-13 22:15:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH] Interface to invalidate regions of mmaps

"Paul E. McKenney" <[email protected]> wrote:
>
> This patch adds an API to allow networked and distributed filesystems
> to invalidate portions of (or all of) a file. This is needed to
> provide POSIX or near-POSIX semantics in such filesystems, as
> discussed on LKML late last year:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103609089604576&w=2
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103167761917669&w=2
>
> Thoughts?

What filesystems would be needing this, and when could we see live code
which actually uses it?

> +/*
> + * Helper function for invalidate_mmap_range().
> + * Both hba and hlen are page numbers in PAGE_SIZE units.
> + */
> +static void
> +invalidate_mmap_range_list(struct list_head *head,
> + unsigned long const hba,
> + unsigned long const hlen)

Be nice to consolidate this with vmtruncate_list, so that it gets
exercised.


2003-05-13 22:58:49

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC][PATCH] Interface to invalidate regions of mmaps

Andrew Morton wrote:

> What filesystems would be needing this, and when could we see live code
> which actually uses it?

on the one hand, lustre would very much like something like this. our
posix IO guarantees are centered around a DLM that knows about file
extents and the presence of pages in the page cache is tied to holding
these locks. its very common for us to get a lock cancelation which
invalidates a region of a file that falls in the middle of what is cached.

worse still, our (possibly gi-normous) files are backed by striping the
file across multiple storage targets and the locks live on these
targets. if you imagine a file that is built by alternating 64k-wide
stripes across 4 targets, we can get a lock cancelation that invalidates
pages at offset 0->15, 64->79,128->143, and so on.

so what we'd like most is the ability to invalidate a region of the file
in an efficient go.

void truncate_inode_pages(struct address_space * mapping, loff_t lstart,
loff_t end)

that sort of thing. this might not suck so bad if the page cache was an
rbtree :) in any case, what we've been doing so far is tracking dirty
page offsets in our own rbtree thing in lustre and calling
truncate_complete_page for these offsets as locks are canceled. (our
locks are page-aligned, so we don't worry so much about partial page
pain in these particular paths).

but on the other hand, this doesn't solve another problem we have with
opportunistic lock extents and sparse page cache populations. Ideally
we'd like a FS specific pointer in struct page so we can associate pages
in the cache with a lock, but I can't imagine suggesting such a thing
within earshot of wli. so we'd still have to track the dirty offsets to
avoid having to pass through offsets 0 ... i_size only to find that one
page in the 8T file that was cached.

https://lxr.lustre.org/source/llite/file.c?v=b_devel#602

is the most relevant part of the story.

- z


2003-05-13 23:11:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH] Interface to invalidate regions of mmaps

Zach Brown <[email protected]> wrote:
>
> so what we'd like most is the ability to invalidate a region of the file
> in an efficient go.
>
> void truncate_inode_pages(struct address_space * mapping, loff_t lstart,
> loff_t end)
>
> that sort of thing.

That's trivial in 2.5.

> this might not suck so bad if the page cache was an
> rbtree :)

Or a radix tree.

> but on the other hand, this doesn't solve another problem we have with
> opportunistic lock extents and sparse page cache populations. Ideally
> we'd like a FS specific pointer in struct page so we can associate pages
> in the cache with a lock,

In 2.5, page->buffers was abstracted out to page->private, and is available
to filesystems for functions such as this.


> but I can't imagine suggesting such a thing
> within earshot of wli.

wli doesn't have to run your kernel. If you want to add a pointer to the
pageframe, go add it. But I'd suggest that you do it with a view to
migrating it to page->private.

When you finally decide to do your development in a development kernel ;)


2003-05-13 23:14:34

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC][PATCH] Interface to invalidate regions of mmaps

On Tue, May 13, 2003 at 04:11:31PM -0700, Zach Brown wrote:
> but on the other hand, this doesn't solve another problem we have with
> opportunistic lock extents and sparse page cache populations. Ideally
> we'd like a FS specific pointer in struct page so we can associate pages
> in the cache with a lock, but I can't imagine suggesting such a thing
> within earshot of wli. so we'd still have to track the dirty offsets to
> avoid having to pass through offsets 0 ... i_size only to find that one
> page in the 8T file that was cached.

Nah, don't worry about sizeof(struct page) anymore; I'll just jack up
PAGE_SIZE to compensate.


-- wli

2003-05-13 23:31:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC][PATCH] Interface to invalidate regions of mmaps

On Tue, May 13, 2003 at 03:21:41PM -0700, Andrew Morton wrote:
> "Paul E. McKenney" <[email protected]> wrote:
> >
> > This patch adds an API to allow networked and distributed filesystems
> > to invalidate portions of (or all of) a file. This is needed to
> > provide POSIX or near-POSIX semantics in such filesystems, as
> > discussed on LKML late last year:
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103609089604576&w=2
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103167761917669&w=2
> >
> > Thoughts?
>
> What filesystems would be needing this, and when could we see live code
> which actually uses it?

Working on getting it out... But I suspect that others need
this functionality as well, given the threads noted above.

> > +/*
> > + * Helper function for invalidate_mmap_range().
> > + * Both hba and hlen are page numbers in PAGE_SIZE units.
> > + */
> > +static void
> > +invalidate_mmap_range_list(struct list_head *head,
> > + unsigned long const hba,
> > + unsigned long const hlen)
>
> Be nice to consolidate this with vmtruncate_list, so that it gets
> exercised.

Good point from both you and wli -- here is the updated vmtruncate
patch (now depends on the invalidate_mmap_range patch).

Thanx, Paul

diff -urN -X dontdiff linux-2.5.69.invalidate_mmap_range/mm/memory.c linux-2.5.69.vmtruncate/mm/memory.c
--- linux-2.5.69.invalidate_mmap_range/mm/memory.c Tue May 13 14:56:41 2003
+++ linux-2.5.69.vmtruncate/mm/memory.c Tue May 13 15:19:23 2003
@@ -1063,6 +1063,7 @@
/*
* Helper function for invalidate_mmap_range().
* Both hba and hlen are page numbers in PAGE_SIZE units.
+ * An hlen of zero blows away the entire portion file after hba.
*/
static void
invalidate_mmap_range_list(struct list_head *head,
@@ -1078,6 +1079,8 @@
unsigned long zea;

hea = hba + hlen - 1; /* avoid overflow. */
+ if (hea < hba)
+ hea = ULONG_MAX;
list_for_each(curr, head) {
vp = list_entry(curr, struct vm_area_struct, shared);
vba = vp->vm_pgoff;
@@ -1128,37 +1131,6 @@
up(&mapping->i_shared_sem);
}

-static void vmtruncate_list(struct list_head *head, unsigned long pgoff)
-{
- unsigned long start, end, len, diff;
- struct vm_area_struct *vma;
- struct list_head *curr;
-
- list_for_each(curr, head) {
- vma = list_entry(curr, struct vm_area_struct, shared);
- start = vma->vm_start;
- end = vma->vm_end;
- len = end - start;
-
- /* mapping wholly truncated? */
- if (vma->vm_pgoff >= pgoff) {
- zap_page_range(vma, start, len);
- continue;
- }
-
- /* mapping wholly unaffected? */
- len = len >> PAGE_SHIFT;
- diff = pgoff - vma->vm_pgoff;
- if (diff >= len)
- continue;
-
- /* Ok, partially affected.. */
- start += diff << PAGE_SHIFT;
- len = (len - diff) << PAGE_SHIFT;
- zap_page_range(vma, start, len);
- }
-}
-
/*
* Handle all mappings that got truncated by a "truncate()"
* system call.
@@ -1176,17 +1148,12 @@
if (inode->i_size < offset)
goto do_expand;
inode->i_size = offset;
+ pgoff = (offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
down(&mapping->i_shared_sem);
- if (list_empty(&mapping->i_mmap) && list_empty(&mapping->i_mmap_shared))
- goto out_unlock;
-
- pgoff = (offset + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (!list_empty(&mapping->i_mmap))
- vmtruncate_list(&mapping->i_mmap, pgoff);
- if (!list_empty(&mapping->i_mmap_shared))
- vmtruncate_list(&mapping->i_mmap_shared, pgoff);
-
-out_unlock:
+ if (unlikely(!list_empty(&mapping->i_mmap)))
+ invalidate_mmap_range_list(&mapping->i_mmap, pgoff, 0);
+ if (unlikely(!list_empty(&mapping->i_mmap_shared)))
+ invalidate_mmap_range_list(&mapping->i_mmap_shared, pgoff, 0);
up(&mapping->i_shared_sem);
truncate_inode_pages(mapping, offset);
goto out_truncate;

2003-05-13 23:44:53

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC][PATCH] Interface to invalidate regions of mmaps


> In 2.5, page->buffers was abstracted out to page->private, and is available
> to filesystems for functions such as this.

that's great news!

> When you finally decide to do your development in a development kernel ;)

customers seem to have the strangest aversion to development kernels :)

but, yeah, I should be doing 2.5 work soon and will holler if
simplifications make themselves apparent.

- z