LinuxLists.cc - [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

2007-12-20 13:06:03

Subject: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

Hi,

Lennart asked for madvise(WILLNEED) to work on anonymous pages, he plans
to use this to pre-fault pages. He currently uses: mlock/munlock for
this purpose.

[ compile tested only ]

Signed-off-by: Peter Zijlstra <[email protected]>
---
diff --git a/mm/madvise.c b/mm/madvise.c
index 93ee375..eff60ce 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -100,6 +100,24 @@ out:
return error;
}

+static long madvice_willneed_anon(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+ int ret, len;
+
+ *prev = vma;
+ if (end > vma->vm_end)
+ end = vma->vm_end;
+
+ len = end - start;
+ ret = get_user_pages(current, current->mm, start, len,
+ 0, 0, NULL, NULL);
+ if (ret < 0)
+ return ret;
+ return ret == len ? 0 : -1;
+}
+
/*
* Schedule all required I/O operations. Do not wait for completion.
*/
@@ -110,7 +128,7 @@ static long madvise_willneed(struct vm_area_struct * vma,
struct file *file = vma->vm_file;

if (!file)
- return -EBADF;
+ return madvice_willneed_anon(vma, prev, start, end);

if (file->f_mapping->a_ops->get_xip_page) {
/* no bad return value, but ignore advice */

2007-12-20 14:17:31

by Hugh Dickins

[permalink] [raw]

Subject: Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

On Thu, 20 Dec 2007, Peter Zijlstra wrote:
>
> Lennart asked for madvise(WILLNEED) to work on anonymous pages, he plans
> to use this to pre-fault pages. He currently uses: mlock/munlock for
> this purpose.

I certainly agree with this in principle: it just seems an unnecessary
and surprising restriction to refuse on anonymous vmas; I guess the only
reason for not adding this was not having anyone asking for it until now.
Though, does Lennart realize he could use MAP_POPULATE in the mmap?

>
> [ compile tested only ]

I haven't tried it either, but generally it looks plausible.

>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 93ee375..eff60ce 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -100,6 +100,24 @@ out:
> return error;
> }
>
> +static long madvice_willneed_anon(struct vm_area_struct *vma,
> + struct vm_area_struct **prev,
> + unsigned long start, unsigned long end)

mavise.c uses "madvise_" rather than " madvice_" throughout,
so please go with the flow.

> +{
> + int ret, len;
> +
> + *prev = vma;
> + if (end > vma->vm_end)
> + end = vma->vm_end;

Please check, but I think the upper level ensures end is within range.

> +
> + len = end - start;
> + ret = get_user_pages(current, current->mm, start, len,
> + 0, 0, NULL, NULL);
> + if (ret < 0)
> + return ret;
> + return ret == len ? 0 : -1;

It's not good to return -1 as an alternative to a real errno:
it'll look like -EPERM. If you copied that from somewhere, better
send a patch to fix the somewhere! Ah, yes, make_pages_present: it
happens that nobody is interested in its return value, so we could
make it a void; but that'd just be a cleanup. What to do here if
non-negative ret less than len? Oh, just return 0, that's good
enough in this case (the file case always returns 0).

Hmm, might it be better to use make_pages_present itself,
fixing its retval, rather than using get_user_pages directly?
(I'd hope the caching makes its repeat of find_vma not an overhead.)

Interesting divergence: make_pages_present faults in writable pages
in a writable vma, whereas the file case's force_page_cache_readahead
doesn't even insert the pages into the mm.

> +}
> +
> /*
> * Schedule all required I/O operations. Do not wait for completion.
> */
> @@ -110,7 +128,7 @@ static long madvise_willneed(struct vm_area_struct * vma,
> struct file *file = vma->vm_file;
>
> if (!file)
> - return -EBADF;
> + return madvice_willneed_anon(vma, prev, start, end);
>
> if (file->f_mapping->a_ops->get_xip_page) {
> /* no bad return value, but ignore advice */

And there's a correctly invisible hunk to the patch too: this
extension of MADV_WILLNEED also does not require down_write of
mmap_sem, so madvise_need_mmap_write can remain unchanged.

Hugh

2007-12-20 14:48:19

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

On Thu, 2007-12-20 at 14:09 +0000, Hugh Dickins wrote:
> On Thu, 20 Dec 2007, Peter Zijlstra wrote:
> >
> > Lennart asked for madvise(WILLNEED) to work on anonymous pages, he plans
> > to use this to pre-fault pages. He currently uses: mlock/munlock for
> > this purpose.
>
> I certainly agree with this in principle: it just seems an unnecessary
> and surprising restriction to refuse on anonymous vmas; I guess the only
> reason for not adding this was not having anyone asking for it until now.
> Though, does Lennart realize he could use MAP_POPULATE in the mmap?

I think he's trying to get his data swapped-in.

> >
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > ---
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 93ee375..eff60ce 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -100,6 +100,24 @@ out:
> > return error;
> > }
> >
> > +static long madvice_willneed_anon(struct vm_area_struct *vma,
> > + struct vm_area_struct **prev,
> > + unsigned long start, unsigned long end)
>
> mavise.c uses "madvise_" rather than " madvice_" throughout,
> so please go with the flow.

Ah, quite. I hadn't noticed this, will fix.

> > +{
> > + int ret, len;
> > +
> > + *prev = vma;
> > + if (end > vma->vm_end)
> > + end = vma->vm_end;
>
> Please check, but I think the upper level ensures end is within range.

It certainly looks like it, but I since the file case did this check I
thought it prudent to also do it. I guess I might as well remove both.

> > +
> > + len = end - start;
> > + ret = get_user_pages(current, current->mm, start, len,
> > + 0, 0, NULL, NULL);
> > + if (ret < 0)
> > + return ret;
> > + return ret == len ? 0 : -1;
>
> It's not good to return -1 as an alternative to a real errno:
> it'll look like -EPERM. If you copied that from somewhere, better
> send a patch to fix the somewhere! Ah, yes, make_pages_present: it
> happens that nobody is interested in its return value, so we could
> make it a void; but that'd just be a cleanup. What to do here if
> non-negative ret less than len? Oh, just return 0, that's good
> enough in this case (the file case always returns 0).

ok, return 0; it is.

> Hmm, might it be better to use make_pages_present itself,
> fixing its retval, rather than using get_user_pages directly?
> (I'd hope the caching makes its repeat of find_vma not an overhead.)
>
> Interesting divergence: make_pages_present faults in writable pages
> in a writable vma, whereas the file case's force_page_cache_readahead
> doesn't even insert the pages into the mm.

Yeah, the find_vma and write fault thing are the reason I didn't use
make_pages_present.

I had noticed the difference in pte population between
force_page_cache_readahead and make_pages_present, but it seemed to me
that writing a function to walk the page tables and populate the
swapcache but not populate the ptes wasn't worth the effort.

> > +}
> > +
> > /*
> > * Schedule all required I/O operations. Do not wait for completion.
> > */
> > @@ -110,7 +128,7 @@ static long madvise_willneed(struct vm_area_struct * vma,
> > struct file *file = vma->vm_file;
> >
> > if (!file)
> > - return -EBADF;
> > + return madvice_willneed_anon(vma, prev, start, end);
> >
> > if (file->f_mapping->a_ops->get_xip_page) {
> > /* no bad return value, but ignore advice */
>
> And there's a correctly invisible hunk to the patch too: this
> extension of MADV_WILLNEED also does not require down_write of
> mmap_sem, so madvise_need_mmap_write can remain unchanged.

Indeed, I did check that :-)

Signed-off-by: Peter Zijlstra <[email protected]>
---
diff --git a/mm/madvise.c b/mm/madvise.c
index 93ee375..563bf00 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -100,6 +100,21 @@ out:
return error;
}

+static long madvise_willneed_anon(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+ int ret;
+
+ *prev = vma;
+ ret = get_user_pages(current, current->mm, start, end - start,
+ 0, 0, NULL, NULL);
+ if (ret < 0)
+ return ret;
+
+ return 0;
+}
+
/*
* Schedule all required I/O operations. Do not wait for completion.
*/
@@ -110,7 +125,7 @@ static long madvise_willneed(struct vm_area_struct * vma,
struct file *file = vma->vm_file;

if (!file)
- return -EBADF;
+ return madvise_willneed_anon(vma, prev, start, end);

if (file->f_mapping->a_ops->get_xip_page) {
/* no bad return value, but ignore advice */
@@ -119,8 +134,6 @@ static long madvise_willneed(struct vm_area_struct * vma,

*prev = vma;
start = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
- if (end > vma->vm_end)
- end = vma->vm_end;
end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;

force_page_cache_readahead(file->f_mapping,

2007-12-20 14:56:20

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

On Thu, 2007-12-20 at 15:47 +0100, Peter Zijlstra wrote:
> On Thu, 2007-12-20 at 14:09 +0000, Hugh Dickins wrote:

> > Interesting divergence: make_pages_present faults in writable pages
> > in a writable vma, whereas the file case's force_page_cache_readahead
> > doesn't even insert the pages into the mm.
>
> Yeah, the find_vma and write fault thing are the reason I didn't use
> make_pages_present.
>
> I had noticed the difference in pte population between
> force_page_cache_readahead and make_pages_present, but it seemed to me
> that writing a function to walk the page tables and populate the
> swapcache but not populate the ptes wasn't worth the effort.

Ah, another, more important difference:

force_page_cache_readahead will not wait for the read to complete,
whereas get_user_pages() will be fully synchronous.

I think I'd better come up with something else then,..

2007-12-20 15:18:41

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

On Thu, 2007-12-20 at 15:56 +0100, Peter Zijlstra wrote:
> On Thu, 2007-12-20 at 15:47 +0100, Peter Zijlstra wrote:
> > On Thu, 2007-12-20 at 14:09 +0000, Hugh Dickins wrote:
>
> > > Interesting divergence: make_pages_present faults in writable pages
> > > in a writable vma, whereas the file case's force_page_cache_readahead
> > > doesn't even insert the pages into the mm.
> >
> > Yeah, the find_vma and write fault thing are the reason I didn't use
> > make_pages_present.
> >
> > I had noticed the difference in pte population between
> > force_page_cache_readahead and make_pages_present, but it seemed to me
> > that writing a function to walk the page tables and populate the
> > swapcache but not populate the ptes wasn't worth the effort.
>
> Ah, another, more important difference:
>
> force_page_cache_readahead will not wait for the read to complete,
> whereas get_user_pages() will be fully synchronous.
>
> I think I'd better come up with something else then,..

Depending on the page table walk from -mm

---
A best effort implementation of madvise(WILLNEED) for anonymous pages.

Signed-off-by: Peter Zijlstra <[email protected]>
---
diff --git a/mm/madvise.c b/mm/madvise.c
index 93ee375..e6f772a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -11,6 +11,8 @@
#include <linux/mempolicy.h>
#include <linux/hugetlb.h>
#include <linux/sched.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>

/*
* Any behaviour which results in changes to the vma->vm_flags needs to
@@ -100,6 +102,34 @@ out:
return error;
}

+static int madvise_willneed_anon_pte(pte_t *ptep,
+ unsigned long start, unsigned long end, void *arg)
+{
+ struct vm_area_struct *vma = arg;
+ struct page *page;
+
+ page = read_swap_cache_async(pte_to_swp_entry(*ptep), GFP_KERNEL,
+ vma, start);
+ if (page)
+ page_cache_release(page);
+
+ return 0;
+}
+
+static long madvise_willneed_anon(struct vm_area_struct * vma,
+ struct vm_area_struct ** prev,
+ unsigned long start, unsigned long end)
+{
+ struct mm_walk walk = {
+ .pte_entry = madvise_willneed_anon_pte,
+ };
+
+ *prev = vma;
+ walk_page_range(vma->vm_mm, start, end, &walk, vma);
+
+ return 0;
+}
+
/*
* Schedule all required I/O operations. Do not wait for completion.
*/
@@ -110,7 +140,7 @@ static long madvise_willneed(struct vm_area_struct * vma,
struct file *file = vma->vm_file;

if (!file)
- return -EBADF;
+ return madvise_willneed_anon(vma, prev, start, end);

if (file->f_mapping->a_ops->get_xip_page) {
/* no bad return value, but ignore advice */

2007-12-20 15:45:21

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

On Thu, 2007-12-20 at 16:18 +0100, Peter Zijlstra wrote:

> +static int madvise_willneed_anon_pte(pte_t *ptep,
> + unsigned long start, unsigned long end, void *arg)
> +{
> + struct vm_area_struct *vma = arg;
> + struct page *page;
> +
> + page = read_swap_cache_async(pte_to_swp_entry(*ptep), GFP_KERNEL,

Argh, with HIGHPTE this is done inside a kmap_atomic.

/me goes complicate the code with page pre-allocation..

2007-12-20 15:47:40

by Hugh Dickins

[permalink] [raw]

Subject: Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

On Thu, 20 Dec 2007, Peter Zijlstra wrote:
> On Thu, 2007-12-20 at 14:09 +0000, Hugh Dickins wrote:
> > On Thu, 20 Dec 2007, Peter Zijlstra wrote:
> >
> > I certainly agree with this in principle: it just seems an unnecessary
> > and surprising restriction to refuse on anonymous vmas; I guess the only
> > reason for not adding this was not having anyone asking for it until now.
> > Though, does Lennart realize he could use MAP_POPULATE in the mmap?
>
> I think he's trying to get his data swapped-in.

That's perfectly reasonable, fair enough.

> > > +{
> > > + int ret, len;
> > > +
> > > + *prev = vma;
> > > + if (end > vma->vm_end)
> > > + end = vma->vm_end;
> >
> > Please check, but I think the upper level ensures end is within range.
>
> It certainly looks like it, but I since the file case did this check I
> thought it prudent to also do it. I guess I might as well remove both.

Ah, so it does. Yes, please do remove both.

> > Hmm, might it be better to use make_pages_present itself,
> > fixing its retval, rather than using get_user_pages directly?
> > (I'd hope the caching makes its repeat of find_vma not an overhead.)
> >
> > Interesting divergence: make_pages_present faults in writable pages
> > in a writable vma, whereas the file case's force_page_cache_readahead
> > doesn't even insert the pages into the mm.
>
> Yeah, the find_vma and write fault thing are the reason I didn't use
> make_pages_present.

The write fault thing is irrelevant now, actually: now do_anonymous_page
doesn't use ZERO_PAGE, it puts in a writable page if the vma flags permit,
even when it's just a read fault (and its write_access arg is redundant).

>
> I had noticed the difference in pte population between
> force_page_cache_readahead and make_pages_present, but it seemed to me
> that writing a function to walk the page tables and populate the
> swapcache but not populate the ptes wasn't worth the effort.

I was about to agree with you, when you made the observation:

> Ah, another, more important difference:
>
> force_page_cache_readahead will not wait for the read to complete,
> whereas get_user_pages() will be fully synchronous.
>
> I think I'd better come up with something else then,..

Yes, that's an interesting point. Maybe first put in what you have,
to stop it from saying -EBADF on anon; then make it asynch later.

The asynch code: perhaps not worth doing for MADV_WILLNEED alone,
but might prove useful for more general use when swapping in.
Not really the same as Con's swap prefetch, but worth looking
at that for reference. But I guess this becomes a much bigger
issue than you were intending to get into here.

Hugh

2007-12-20 16:36:32

by Lennart Poettering

[permalink] [raw]

Subject: Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

On Thu, 20.12.07 14:09, Hugh Dickins ([email protected]) wrote:

> > Lennart asked for madvise(WILLNEED) to work on anonymous pages, he plans
> > to use this to pre-fault pages. He currently uses: mlock/munlock for
> > this purpose.
>
> I certainly agree with this in principle: it just seems an unnecessary
> and surprising restriction to refuse on anonymous vmas; I guess the only
> reason for not adding this was not having anyone asking for it until now.
> Though, does Lennart realize he could use MAP_POPULATE in the mmap?

Not really. First, if the mmap() is hidden somewhere in glibc (i.e. as
part of malloc() or whatever) it's not really possible to do
MAP_POPULATE. Also, I need this for some memory that is allocated
during the whole runtime but only seldomly used. Thus I am happy if it
is swapped out, but everytime I want to use it I want to make sure it
is paged in before I pass it on to the RT thread. So, there's a
mmap() during startup only, and then, during the whole runtime of my
program I want to page in the memory again and again, with long
intervals in between, but with no call to mmap()/munmap().

Lennart

--
Lennart Poettering Red Hat, Inc.
lennart [at] poettering [dot] net ICQ# 11060553
http://0pointer.net/lennart/ GnuPG 0x1A015CC4

2007-12-20 16:53:55

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

On Thu, 2007-12-20 at 15:26 +0000, Hugh Dickins wrote:

> The asynch code: perhaps not worth doing for MADV_WILLNEED alone,
> but might prove useful for more general use when swapping in.
> Not really the same as Con's swap prefetch, but worth looking
> at that for reference. But I guess this becomes a much bigger
> issue than you were intending to get into here.

heh, yeah, got somewhat more complex that I'd hoped for.

last patch for today (not even compile tested), will do a proper patch
and test it tomorrow.

---
A best effort MADV_WILLNEED implementation for anonymous memory.

It adds a batch method to the page table walk routines so we can
copy a few ptes while holding the kmap, which makes it possible to
allocate the backing pages using GFP_KERNEL.

Signed-off-by: Peter Zijlstra <[email protected]>
---
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5c3655f..391a453 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -726,6 +726,7 @@ unsigned long unmap_vmas(struct mmu_gather **tlb,
* @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry
* @pte_entry: if set, called for each non-empty PTE (4th-level) entry
* @pte_hole: if set, called for each hole at all levels
+ * @pte_batch: if set, called for each %WALK_BATCH_SIZE PTE entries.
*
* (see walk_page_range for more details)
*/
@@ -735,8 +736,16 @@ struct mm_walk {
int (*pmd_entry)(pmd_t *, unsigned long, unsigned long, void *);
int (*pte_entry)(pte_t *, unsigned long, unsigned long, void *);
int (*pte_hole)(unsigned long, unsigned long, void *);
+ int (*pte_batch)(unsigned long, unsigned long, void *);
};

+#define WALK_BATCH_SIZE 32
+
+static inline walk_addr_index(unsigned long addr)
+{
+ return (addr >> PAGE_SHIFT) % WALK_BATCH_SIZE;
+}
+
int walk_page_range(const struct mm_struct *, unsigned long addr,
unsigned long end, const struct mm_walk *walk,
void *private);
diff --git a/mm/madvise.c b/mm/madvise.c
index 93ee375..86610a0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -11,6 +11,8 @@
#include <linux/mempolicy.h>
#include <linux/hugetlb.h>
#include <linux/sched.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>

/*
* Any behaviour which results in changes to the vma->vm_flags needs to
@@ -100,17 +102,71 @@ out:
return error;
}

+struct madvise_willneed_anon_data {
+ pte_t entries[WALK_BATCH_SIZE];
+ struct vm_area_struct *vma;
+}
+
+static int madvise_willneed_anon_pte(pte_t *ptep,
+ unsigned long addr, unsigned long end, void *arg)
+{
+ struct madvise_willneed_anon_data *data = arg;
+
+ data->entries[walk_addr_index(addr)] = *ptep;
+
+ return 0;
+}
+
+static int madvise_willneed_anon_batch(unsigned long addr,
+ unsigned long end, void *arg)
+{
+ struct madvise_willneed_anon_data *data = arg;
+ unsigned int i;
+
+ for (; addr != end; addr += PAGE_SIZE) {
+ pte_t pte = data->entries[walk_addr_index(addr)];
+
+ if (is_swap_pte(pte)) {
+ struct page *page =
+ read_swap_cache_async(pte_to_swp_entry(pte),
+ GFP_KERNEL, data->vma, addr);
+ if (page)
+ page_cache_release(page);
+ }
+ }
+
+ return 0;
+}
+
+static long madvise_willneed_anon(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+ struct madvise_willneed_anon_data data = {
+ .vma = vma;
+ };
+ struct mm_walk walk = {
+ .pte_entry = madvise_willneed_anon_pte,
+ .pte_batch = madvise_willneed_anon_batch,
+ };
+
+ *prev = vma;
+ walk_page_range(vma->vm_mm, start, end, &walk, vma);
+
+ return 0;
+}
+
/*
* Schedule all required I/O operations. Do not wait for completion.
*/
-static long madvise_willneed(struct vm_area_struct * vma,
- struct vm_area_struct ** prev,
+static long madvise_willneed(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
struct file *file = vma->vm_file;

if (!file)
- return -EBADF;
+ return madvise_willneed_anon(vma, prev, start, end);

if (file->f_mapping->a_ops->get_xip_page) {
/* no bad return value, but ignore advice */
@@ -119,8 +175,6 @@ static long madvise_willneed(struct vm_area_struct * vma,

*prev = vma;
start = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
- if (end > vma->vm_end)
- end = vma->vm_end;
end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;

force_page_cache_readahead(file->f_mapping,
@@ -147,8 +201,8 @@ static long madvise_willneed(struct vm_area_struct * vma,
* An interface that causes the system to free clean pages and flush
* dirty pages is already available as msync(MS_INVALIDATE).
*/
-static long madvise_dontneed(struct vm_area_struct * vma,
- struct vm_area_struct ** prev,
+static long madvise_dontneed(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
*prev = vma;
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index b4f27d2..25fc656 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -2,12 +2,45 @@
#include <linux/highmem.h>
#include <linux/sched.h>

+static int walk_pte_range_batch(pmd_t *pmd, unsigned long addr, unsigned long end,
+ const struct mm_walk *walk, void *private)
+{
+ int err = 0;
+
+ do {
+ unsigned int i;
+ pte_t *pte;
+ unsigned long start = addr;
+ int err2;
+
+ pte = pte_offset_map(pmd, addr);
+ for (i = 0; i < WALK_BATCH_SIZE && addr != end;
+ i++, pte++, addr += PAGE_SIZE) {
+ err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, private);
+ if (err)
+ break;
+ }
+ pte_unmap(pte);
+
+ err2 = walk->pte_batch(start, end, private);
+ if (!err)
+ err = err2;
+ if (err)
+ break;
+ } while (addr != end);
+
+ return err;
+}
+
static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
const struct mm_walk *walk, void *private)
{
pte_t *pte;
int err = 0;

+ if (walk->pte_batch)
+ return walk_pte_range_batch(pmd, addr, end, walk, private);
+
pte = pte_offset_map(pmd, addr);
do {
err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, private);

2007-12-20 17:13:39

by Matt Mackall

[permalink] [raw]

Subject: Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

On Thu, Dec 20, 2007 at 05:53:41PM +0100, Peter Zijlstra wrote:
>
> On Thu, 2007-12-20 at 15:26 +0000, Hugh Dickins wrote:
>
> > The asynch code: perhaps not worth doing for MADV_WILLNEED alone,
> > but might prove useful for more general use when swapping in.
> > Not really the same as Con's swap prefetch, but worth looking
> > at that for reference. But I guess this becomes a much bigger
> > issue than you were intending to get into here.
>
> heh, yeah, got somewhat more complex that I'd hoped for.
>
> last patch for today (not even compile tested), will do a proper patch
> and test it tomorrow.
>
> ---
> A best effort MADV_WILLNEED implementation for anonymous memory.
>
> It adds a batch method to the page table walk routines so we can
> copy a few ptes while holding the kmap, which makes it possible to
> allocate the backing pages using GFP_KERNEL.

Yuck. We actually need to just fix the atomic kmap issue in the
existing pagemap code rather than add a new method, I think.

If performance of map/unmap is too slow at a granularity of 1, we can
add some internal batching in the CONFIG_HIGHPTE case.

--
Mathematics is the supreme nostalgia of our time.

2007-12-20 17:17:11

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

On Thu, 2007-12-20 at 11:11 -0600, Matt Mackall wrote:
> On Thu, Dec 20, 2007 at 05:53:41PM +0100, Peter Zijlstra wrote:
> >
> > On Thu, 2007-12-20 at 15:26 +0000, Hugh Dickins wrote:
> >
> > > The asynch code: perhaps not worth doing for MADV_WILLNEED alone,
> > > but might prove useful for more general use when swapping in.
> > > Not really the same as Con's swap prefetch, but worth looking
> > > at that for reference. But I guess this becomes a much bigger
> > > issue than you were intending to get into here.
> >
> > heh, yeah, got somewhat more complex that I'd hoped for.
> >
> > last patch for today (not even compile tested), will do a proper patch
> > and test it tomorrow.
> >
> > ---
> > A best effort MADV_WILLNEED implementation for anonymous memory.
> >
> > It adds a batch method to the page table walk routines so we can
> > copy a few ptes while holding the kmap, which makes it possible to
> > allocate the backing pages using GFP_KERNEL.
>
> Yuck. We actually need to just fix the atomic kmap issue in the
> existing pagemap code rather than add a new method, I think.
>
> If performance of map/unmap is too slow at a granularity of 1, we can
> add some internal batching in the CONFIG_HIGHPTE case.

OK, sounds like a much better idea indeed. Will implement that.