2005-11-02 01:15:32

by Badari Pulavarty

[permalink] [raw]
Subject: [PATCH] 2.6.14 patch for supporting madvise(MADV_FREE)

Hi All,

Here is the patch to support madvise(MADV_FREE) - which frees
up the given range of pages and truncates the underlying backing
store. This basically provides "punch hole into file" functionality.
Currently it supports ONLY shmfs/tmpfs - where we have short term
need. Other filesystems return -ENOSYS.

Yes. This is a *crazy* interface to do it. But this is what
we exactly need for now. Here is the discussion on linux-mm
(for all the fun discussion and the naming):

http://marc.theaimsgroup.com/?l=linux-mm&m=113078625426989&w=2

Andrew, could you include this in your next -mm release ?
BTW, for completeness - this patch includes reiser4-truncate-inode-
pages-range patch from your -mm series. If you want me to re-work
my patch without that, please let me know.

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.14-
rc5/2.6.14-rc5-mm1/broken-out/reiser4-truncate_inode_pages_range.patch

I tested with my test cases and Jeff Dike was kind enough to provide
a test case with UML - which found more bugs. I thank Andrea & Hugh
for helping me out heavily :)

Comments ?

Thanks,
Badari



Attachments:
madvise-free.patch (22.06 kB)

2005-11-02 01:43:35

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_FREE)

On Tue, Nov 01, 2005 at 05:15:01PM -0800, Badari Pulavarty wrote:
> Here is the patch to support madvise(MADV_FREE) - which frees
> up the given range of pages and truncates the underlying backing
> store. This basically provides "punch hole into file" functionality.
> Currently it supports ONLY shmfs/tmpfs - where we have short term
> need. Other filesystems return -ENOSYS.

MADV_FREE as a name isn't right if we return -ENOSYS for anonymoys
memory.

MADV_FREE in other OS works _only_ on anonymous memory and returns
-EINVAL if used on filebacked vmas. Infact we probably should rename our
MADV_DONTNEED to MADV_FREE.

http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrde?a=view

"This value cannot be used on mappings that have underlying file objects."

Our MADV_DONTNEED exactly matches the MADV_FREE semantics, and it seems
the MADV_DONTNEED of other OS isn't destructive like ours. Except our
MADV_DONTNEED also works on filebacked mappings but it's destructive
only on anonymous memory.


I thought Andrew suggested MADV_REMOVE for the new feature.

This feature didn't exist in other OS yet AFIK, so a new MADV_name for
it makes sense. I'm not completely against extending MADV_FREE but then we
shouldn't return -ENOSYS on anonymous memory and we should do the same
thing MADV_DONTNEED does on anonymous memory. Probably a new name is
safer to avoid confusion (think an application running MADV_FREE and
expecting -EINVAL when used on filebacked mappings).

Thanks!

2005-11-02 15:49:45

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_FREE)

On Wed, 2005-11-02 at 02:43 +0100, Andrea Arcangeli wrote:
> On Tue, Nov 01, 2005 at 05:15:01PM -0800, Badari Pulavarty wrote:
> > Here is the patch to support madvise(MADV_FREE) - which frees
> > up the given range of pages and truncates the underlying backing
> > store. This basically provides "punch hole into file" functionality.
> > Currently it supports ONLY shmfs/tmpfs - where we have short term
> > need. Other filesystems return -ENOSYS.
>
> MADV_FREE as a name isn't right if we return -ENOSYS for anonymoys
> memory.
>
> MADV_FREE in other OS works _only_ on anonymous memory and returns
> -EINVAL if used on filebacked vmas. Infact we probably should rename our
> MADV_DONTNEED to MADV_FREE.
>
> http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrde?a=view
>
> "This value cannot be used on mappings that have underlying file objects."
>
> Our MADV_DONTNEED exactly matches the MADV_FREE semantics, and it seems
> the MADV_DONTNEED of other OS isn't destructive like ours. Except our
> MADV_DONTNEED also works on filebacked mappings but it's destructive
> only on anonymous memory.
>
>
> I thought Andrew suggested MADV_REMOVE for the new feature.

Yep. My bad. Andrew did suggest MADV_REMOVE. Let me rename and
generate patch once again !! Thanks for pointing out.

>
> This feature didn't exist in other OS yet AFIK, so a new MADV_name for
> it makes sense. I'm not completely against extending MADV_FREE but then we
> shouldn't return -ENOSYS on anonymous memory and we should do the same
> thing MADV_DONTNEED does on anonymous memory. Probably a new name is
> safer to avoid confusion (think an application running MADV_FREE and
> expecting -EINVAL when used on filebacked mappings).
>
> Thanks!
>

Thanks,
Badari

2005-11-02 16:13:06

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)

Hi Andrew & Andrea,

Here is the updated patch with name change again :(
Hopefully this would be final. (MADV_REMOVE).

BTW, I am not sure if we need to hold i_sem and i_allocsem
all the way ? I wanted to be safe - but this may be overkill ?


+ /* XXX - Do we need both i_sem and i_allocsem all the way ? */
+ down(&inode->i_sem);
+ down_write(&inode->i_alloc_sem);
+ unmap_mapping_range(mapping, offset, (end - offset), 1);
+ truncate_inode_pages_range(mapping, offset, end);
+ inode->i_op->truncate_range(inode, offset, end);
+ up_write(&inode->i_alloc_sem);
+ up(&inode->i_sem);


Thanks,
Badari



Attachments:
madvise-remove.patch (22.15 kB)

2005-11-02 19:49:15

by Blaisorblade

[permalink] [raw]
Subject: New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE))

On Wednesday 02 November 2005 17:12, Badari Pulavarty wrote:
> Hi Andrew & Andrea,
>
> Here is the updated patch with name change again :(
> Hopefully this would be final. (MADV_REMOVE).
>
> BTW, I am not sure if we need to hold i_sem and i_allocsem
> all the way ? I wanted to be safe - but this may be overkill ?
While looking into this, I probably found another problem, a race with
install_page(), which doesn't use the seqlock-style check we use for
everything else (aka do_no_page) but simply assumes a page is valid if its
index is below the current file size.

This is clearly "truncate" specific, and is already racy. Suppose I truncate a
file and reduce its size, and then re-extend it, the page which I previously
fetched from the cache is invalid. The current install_page code generates
corruption.

In fact the page is fetched from the caller of install_page and passed to it.

This affects anybody using MAP_POPULATE or using remap_file_pages.

> + /* XXX - Do we need both i_sem and i_allocsem all the way ? */
> + down(&inode->i_sem);
> + down_write(&inode->i_alloc_sem);
> + unmap_mapping_range(mapping, offset, (end - offset), 1);
In my opinion, as already said, unmap_mapping_range can be called without
these two locks, as it operates only on mappings for the file.

However currently it's called with these locks held in vmtruncate, but I think
the locks are held in that case only because we need to truncate the file,
and are hold in excess also across this call.

Instead, we need to protect against concurrent faults on the mapping (not
against concurrent mmaps)...and that is done through (struct address_space*)
mapping->truncate_count.

=====
Finally, there is MAP_POPULATE and other pre-faulting, i.e. install_page (no,
this is not peculiar to VM_NONLINEAR, even if some code is shared), but
install_page checks explicitly for truncation; the problem is that the check
is rather bogus, compared to the rest of checks:

/*
* This page may have been truncated. Tell the
* caller about it.
*/
err = -EINVAL;
inode = vma->vm_file->f_mapping->host;
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (!page->mapping || page->index >= size)
goto err_unlock;

I remember there being a BUG_ON and Linus fixing it up.

It should be converted to the normal checks used for the rest
(->truncate_count based - see do_no_page()).

To do so, the caller (*_populate) needs to call again *_getpage, if
install_page detects a race, but it must also save and pass the
truncate_count.

So, we probably want need to cleanup and join {filemap,shmem}_populate
together, because the only real difference between them is the function
called to lookup the page from disk ({shmem,filemap}_getpage).

So, we should replace struct vm_operations_struct "populate" method with a
"getpage" method, by using the shmem_getpage prototype, which is better
engineered, see my comment:

page = filemap_getpage(file, pgoff, nonblock);

/* XXX: This is wrong, a filesystem I/O error may have happened. Fix
that as
* done in shmem_populate calling shmem_getpage */
if (!page && !nonblock)
return -ENOMEM;

> + truncate_inode_pages_range(mapping, offset, end);
> + inode->i_op->truncate_range(inode, offset, end);
> + up_write(&inode->i_alloc_sem);
> + up(&inode->i_sem);

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade





___________________________________
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB
http://mail.yahoo.it

2005-11-02 20:13:50

by Hugh Dickins

[permalink] [raw]
Subject: Re: New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE))

On Wed, 2 Nov 2005, Blaisorblade wrote:
> While looking into this, I probably found another problem, a race with
> install_page(), which doesn't use the seqlock-style check we use for
> everything else (aka do_no_page) but simply assumes a page is valid if its
> index is below the current file size.
>
> This is clearly "truncate" specific, and is already racy. Suppose I truncate a
> file and reduce its size, and then re-extend it, the page which I previously
> fetched from the cache is invalid. The current install_page code generates
> corruption.

No, it should be fine as is (unless perhaps some barrier is needed).

The check
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (!page->mapping || page->index >= size)
goto err_unlock;
handles the case that worries you: page->mapping will be NULL.

do_no_page has to do the more complicated truncate_count business because
it deals with all kinds of ->nopage, some of which leave page->mapping NULL:
so it's unable to distinguish one where the driver left it NULL from one
where truncation has suddenly made it NULL.

Hugh

2005-11-02 20:46:10

by Hugh Dickins

[permalink] [raw]
Subject: Re: New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE))

On Wed, 2 Nov 2005, Hugh Dickins wrote:
> On Wed, 2 Nov 2005, Blaisorblade wrote:
>
> No, it should be fine as is (unless perhaps some barrier is needed).

We already have the barrier needed: we're holding page_table_lock (pte lock).

Hugh

2005-11-02 21:36:52

by Badari Pulavarty

[permalink] [raw]
Subject: Re: New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE))

On Wed, 2005-11-02 at 20:54 +0100, Blaisorblade wrote:
> On Wednesday 02 November 2005 17:12, Badari Pulavarty wrote:
> > Hi Andrew & Andrea,
> >
> > Here is the updated patch with name change again :(
> > Hopefully this would be final. (MADV_REMOVE).
> >
> > BTW, I am not sure if we need to hold i_sem and i_allocsem
> > all the way ? I wanted to be safe - but this may be overkill ?
> While looking into this, I probably found another problem, a race with
> install_page(), which doesn't use the seqlock-style check we use for
> everything else (aka do_no_page) but simply assumes a page is valid if its
> index is below the current file size.
>
> This is clearly "truncate" specific, and is already racy. Suppose I truncate a
> file and reduce its size, and then re-extend it, the page which I previously
> fetched from the cache is invalid. The current install_page code generates
> corruption.
>
> In fact the page is fetched from the caller of install_page and passed to it.
>
> This affects anybody using MAP_POPULATE or using remap_file_pages.
>
> > + /* XXX - Do we need both i_sem and i_allocsem all the way ? */
> > + down(&inode->i_sem);
> > + down_write(&inode->i_alloc_sem);
> > + unmap_mapping_range(mapping, offset, (end - offset), 1);
> In my opinion, as already said, unmap_mapping_range can be called without
> these two locks, as it operates only on mappings for the file.
>
> However currently it's called with these locks held in vmtruncate, but I think
> the locks are held in that case only because we need to truncate the file,
> and are hold in excess also across this call.

I agree, I can push down the locking only for ->truncate_range - if
no one has objections. (But again, it so special case - no one really
cares about the performance of this interface ?).

Thanks,
Badari

2005-11-02 21:57:03

by Hugh Dickins

[permalink] [raw]
Subject: Re: New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE))

On Wed, 2 Nov 2005, Badari Pulavarty wrote:
> On Wed, 2005-11-02 at 20:54 +0100, Blaisorblade wrote:
> > > + /* XXX - Do we need both i_sem and i_allocsem all the way ? */
> > > + down(&inode->i_sem);
> > > + down_write(&inode->i_alloc_sem);
> > > + unmap_mapping_range(mapping, offset, (end - offset), 1);
> > In my opinion, as already said, unmap_mapping_range can be called without
> > these two locks, as it operates only on mappings for the file.
> >
> > However currently it's called with these locks held in vmtruncate, but I think
> > the locks are held in that case only because we need to truncate the file,
> > and are hold in excess also across this call.
>
> I agree, I can push down the locking only for ->truncate_range - if
> no one has objections. (But again, it so special case - no one really
> cares about the performance of this interface ?).

I can't remember why i_alloc_sem got introduced, and don't have time to
work it out: something to do with direct I/O races, perhaps? Someone
else must advise, perhaps you will be able to drop that one.

But I think you'd be very unwise to drop i_sem too. i_mmap_lock gets
dropped whenever preemption demands here, i_sem is what's preventing
someone else coming along and doing a concurrent truncate or remove.
You don't want that.

Sorry, I've not yet had time to study your patch: I do intend to,
but cannot promise when. I fear it won't be as easy as making
these occasional responses.

Hugh

2005-11-02 22:03:08

by Badari Pulavarty

[permalink] [raw]
Subject: Re: New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE))

On Wed, 2005-11-02 at 21:55 +0000, Hugh Dickins wrote:
> On Wed, 2 Nov 2005, Badari Pulavarty wrote:
> > On Wed, 2005-11-02 at 20:54 +0100, Blaisorblade wrote:
> > > > + /* XXX - Do we need both i_sem and i_allocsem all the way ? */
> > > > + down(&inode->i_sem);
> > > > + down_write(&inode->i_alloc_sem);
> > > > + unmap_mapping_range(mapping, offset, (end - offset), 1);
> > > In my opinion, as already said, unmap_mapping_range can be called without
> > > these two locks, as it operates only on mappings for the file.
> > >
> > > However currently it's called with these locks held in vmtruncate, but I think
> > > the locks are held in that case only because we need to truncate the file,
> > > and are hold in excess also across this call.
> >
> > I agree, I can push down the locking only for ->truncate_range - if
> > no one has objections. (But again, it so special case - no one really
> > cares about the performance of this interface ?).
>
> I can't remember why i_alloc_sem got introduced, and don't have time to
> work it out: something to do with direct I/O races, perhaps? Someone
> else must advise, perhaps you will be able to drop that one.

Yep. i_alloc_sem is supposed to protect DIO races with truncate.

> But I think you'd be very unwise to drop i_sem too. i_mmap_lock gets
> dropped whenever preemption demands here, i_sem is what's preventing
> someone else coming along and doing a concurrent truncate or remove.
> You don't want that.
>
> Sorry, I've not yet had time to study your patch: I do intend to,
> but cannot promise when. I fear it won't be as easy as making
> these occasional responses.

Thanks Hugh. For now, I will leave those locks alone. We can re-visit
later, if we really care about the performance of this interface.
Better be safe than sorry.

Thanks,
Badari

2005-11-12 00:26:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)

Badari Pulavarty <[email protected]> wrote:
>
> +/*
> + * Application wants to free up the pages and associated backing store.
> + * This is effectively punching a hole into the middle of a file.
> + *
> + * NOTE: Currently, only shmfs/tmpfs is supported for this operation.
> + * Other filesystems return -ENOSYS.
> + */
> +static long madvise_remove(struct vm_area_struct * vma,
> + unsigned long start, unsigned long end)
> +{
> + struct address_space *mapping;
> + loff_t offset, endoff;
> +
> + if (vma->vm_flags & (VM_LOCKED|VM_NONLINEAR|VM_HUGETLB))
> + return -EINVAL;
> +
> + if (!vma->vm_file || !vma->vm_file->f_mapping
> + || !vma->vm_file->f_mapping->host) {
> + return -EINVAL;
> + }
> +
> + mapping = vma->vm_file->f_mapping;
> + if (mapping == &swapper_space) {
> + return -EINVAL;
> + }
> +
> + offset = (loff_t)(start - vma->vm_start)
> + + (vma->vm_pgoff << PAGE_SHIFT);
> + endoff = (loff_t)(end - vma->vm_start - 1)
> + + (vma->vm_pgoff << PAGE_SHIFT);
> + return vmtruncate_range(mapping->host, offset, endoff);
> +}
> +

I'm suspecting you tested this on a 64-bit machine, yes? On 32-bit that
vm_pgoff shift is going to overflow.

Fixes-thus-far below. Please rerun all tests on x86?

Why does madvise_remove() have an explicit check for swapper_space?

In your testing, how are you determining that the code is successfully
removing the correct number of pages, from the correct file offset?


diff -puN mm/madvise.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy mm/madvise.c
--- devel/mm/madvise.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy 2005-11-11 16:12:43.000000000 -0800
+++ devel-akpm/mm/madvise.c 2005-11-11 16:16:50.000000000 -0800
@@ -147,8 +147,8 @@ static long madvise_dontneed(struct vm_a
* NOTE: Currently, only shmfs/tmpfs is supported for this operation.
* Other filesystems return -ENOSYS.
*/
-static long madvise_remove(struct vm_area_struct * vma,
- unsigned long start, unsigned long end)
+static long madvise_remove(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
{
struct address_space *mapping;
loff_t offset, endoff;
@@ -162,14 +162,13 @@ static long madvise_remove(struct vm_are
}

mapping = vma->vm_file->f_mapping;
- if (mapping == &swapper_space) {
+ if (mapping == &swapper_space)
return -EINVAL;
- }

offset = (loff_t)(start - vma->vm_start)
- + (vma->vm_pgoff << PAGE_SHIFT);
+ + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
endoff = (loff_t)(end - vma->vm_start - 1)
- + (vma->vm_pgoff << PAGE_SHIFT);
+ + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
return vmtruncate_range(mapping->host, offset, endoff);
}

diff -puN mm/memory.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy mm/memory.c
--- devel/mm/memory.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy 2005-11-11 16:16:54.000000000 -0800
+++ devel-akpm/mm/memory.c 2005-11-11 16:17:59.000000000 -0800
@@ -1608,10 +1608,9 @@ out_big:
out_busy:
return -ETXTBSY;
}
-
EXPORT_SYMBOL(vmtruncate);

-int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end)
+int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
{
struct address_space *mapping = inode->i_mapping;

@@ -1634,7 +1633,6 @@ int vmtruncate_range(struct inode * inod

return 0;
}
-
EXPORT_SYMBOL(vmtruncate_range);

/*
_

2005-11-12 00:34:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)

Badari Pulavarty <[email protected]> wrote:
>
> +int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end)
> +{
> + struct address_space *mapping = inode->i_mapping;
> +
> + /*
> + * If the underlying filesystem is not going to provide
> + * a way to truncate a range of blocks (punch a hole) -
> + * we should return failure right now.
> + */
> + if (!inode->i_op || !inode->i_op->truncate_range)
> + return -ENOSYS;
> +
> + /* XXX - Do we need both i_sem and i_allocsem all the way ? */
> + down(&inode->i_sem);
> + down_write(&inode->i_alloc_sem);
> + unmap_mapping_range(mapping, offset, (end - offset), 1);
> + truncate_inode_pages_range(mapping, offset, end);
> + inode->i_op->truncate_range(inode, offset, end);
> + up_write(&inode->i_alloc_sem);
> + up(&inode->i_sem);
> +
> + return 0;
> +}

Yes, we need to take i_alloc_sem for writing. To prevent concurrent
direct-io reads from coming in and instantiated by unwritten blocks.

tmpfs doesn't implements direct-io though.

2005-11-12 00:34:55

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)

On Fri, 2005-11-11 at 16:25 -0800, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> >
> > +/*
> > + * Application wants to free up the pages and associated backing store.
> > + * This is effectively punching a hole into the middle of a file.
> > + *
> > + * NOTE: Currently, only shmfs/tmpfs is supported for this operation.
> > + * Other filesystems return -ENOSYS.
> > + */
> > +static long madvise_remove(struct vm_area_struct * vma,
> > + unsigned long start, unsigned long end)
> > +{
> > + struct address_space *mapping;
> > + loff_t offset, endoff;
> > +
> > + if (vma->vm_flags & (VM_LOCKED|VM_NONLINEAR|VM_HUGETLB))
> > + return -EINVAL;
> > +
> > + if (!vma->vm_file || !vma->vm_file->f_mapping
> > + || !vma->vm_file->f_mapping->host) {
> > + return -EINVAL;
> > + }
> > +
> > + mapping = vma->vm_file->f_mapping;
> > + if (mapping == &swapper_space) {
> > + return -EINVAL;
> > + }
> > +
> > + offset = (loff_t)(start - vma->vm_start)
> > + + (vma->vm_pgoff << PAGE_SHIFT);
> > + endoff = (loff_t)(end - vma->vm_start - 1)
> > + + (vma->vm_pgoff << PAGE_SHIFT);
> > + return vmtruncate_range(mapping->host, offset, endoff);
> > +}
> > +
>
> I'm suspecting you tested this on a 64-bit machine, yes? On 32-bit that
> vm_pgoff shift is going to overflow.

Yes. I have moved to all 64-bit (amd64, em64t, ppc64) machines. My bad.

>
> Fixes-thus-far below. Please rerun all tests on x86?
>

I will verify. Thanks.

> Why does madvise_remove() have an explicit check for swapper_space?

I really don't remember (I yanked code from some other kernel routine
vmtruncate()). If you think its unnecessary, I can take it out.

> In your testing, how are you determining that the code is successfully
> removing the correct number of pages, from the correct file offset?

I verified with test programs, added debug printk + looked through live
"crash" session + verified with UML testcases.

>
> diff -puN mm/madvise.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy mm/madvise.c
> --- devel/mm/madvise.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy 2005-11-11 16:12:43.000000000 -0800
> +++ devel-akpm/mm/madvise.c 2005-11-11 16:16:50.000000000 -0800
> @@ -147,8 +147,8 @@ static long madvise_dontneed(struct vm_a
> * NOTE: Currently, only shmfs/tmpfs is supported for this operation.
> * Other filesystems return -ENOSYS.
> */
> -static long madvise_remove(struct vm_area_struct * vma,
> - unsigned long start, unsigned long end)
> +static long madvise_remove(struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
> {
> struct address_space *mapping;
> loff_t offset, endoff;
> @@ -162,14 +162,13 @@ static long madvise_remove(struct vm_are
> }
>
> mapping = vma->vm_file->f_mapping;
> - if (mapping == &swapper_space) {
> + if (mapping == &swapper_space)
> return -EINVAL;
> - }
>
> offset = (loff_t)(start - vma->vm_start)
> - + (vma->vm_pgoff << PAGE_SHIFT);
> + + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> endoff = (loff_t)(end - vma->vm_start - 1)
> - + (vma->vm_pgoff << PAGE_SHIFT);
> + + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> return vmtruncate_range(mapping->host, offset, endoff);
> }
>
> diff -puN mm/memory.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy mm/memory.c
> --- devel/mm/memory.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy 2005-11-11 16:16:54.000000000 -0800
> +++ devel-akpm/mm/memory.c 2005-11-11 16:17:59.000000000 -0800
> @@ -1608,10 +1608,9 @@ out_big:
> out_busy:
> return -ETXTBSY;
> }
> -
> EXPORT_SYMBOL(vmtruncate);
>
> -int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end)
> +int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
> {
> struct address_space *mapping = inode->i_mapping;
>
> @@ -1634,7 +1633,6 @@ int vmtruncate_range(struct inode * inod
>
> return 0;
> }
> -
> EXPORT_SYMBOL(vmtruncate_range);
>
> /*
> _
>
>

2005-11-12 01:43:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)

Badari Pulavarty <[email protected]> wrote:
>
> > Why does madvise_remove() have an explicit check for swapper_space?
>
> I really don't remember (I yanked code from some other kernel routine
> vmtruncate()).

I don't see such a thing anywhere. vmtruncate() has the IS_SWAPFILE()
test, which I guess vmtruncate_range() ought to have too, for
future-safety.

Logically, vmtruncate() should just be a special case of vmtruncate_range().
But it's not - ugly, but hard to do anything about (need to implement
->truncate_range in all filesystems, but "know" which ones only support
->truncate_range() at eof).

>
> > In your testing, how are you determining that the code is successfully
> > removing the correct number of pages, from the correct file offset?
>
> I verified with test programs, added debug printk + looked through live
> "crash" session + verified with UML testcases.

OK, well please be sure to test it on 32-bit and 64-bit, operating in three
ranges of the file: <2G, 2G-4G amd >4G.

2005-11-12 04:41:14

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)

Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
>
>>>Why does madvise_remove() have an explicit check for swapper_space?
>>
>>I really don't remember (I yanked code from some other kernel routine
>>vmtruncate()).
>
>
> I don't see such a thing anywhere. vmtruncate() has the IS_SWAPFILE()
> test, which I guess vmtruncate_range() ought to have too, for
> future-safety.

Yep. That was the check. Since I don't have inode and have mapping
handy anyway, check was made using that. I could change it, if you wish.

>
> Logically, vmtruncate() should just be a special case of vmtruncate_range().
> But it's not - ugly, but hard to do anything about (need to implement
> ->truncate_range in all filesystems, but "know" which ones only support
> ->truncate_range() at eof).
>
>
>>>In your testing, how are you determining that the code is successfully
>>>removing the correct number of pages, from the correct file offset?
>>
>>I verified with test programs, added debug printk + looked through live
>>"crash" session + verified with UML testcases.
>
>
> OK, well please be sure to test it on 32-bit and 64-bit, operating in three
> ranges of the file: <2G, 2G-4G amd >4G.
>
Will do.

Thanks,
Badari

2006-01-16 13:06:58

by Andrea Arcangeli

[permalink] [raw]
Subject: differences between MADV_FREE and MADV_DONTNEED

Now that MADV_REMOVE is in, should we discuss MADV_FREE?

MADV_FREE in Solaris is destructive and only works on anonymous memory,
while MADV_DONTNEED seems to never be destructive (which I assume it
means it's a noop on anonymous memory).

Our MADV_DONTNEED is destructive on anonymous memory, while it's
non-destructive on file mappings.

Perhaps we could move the destructive anonymous part of MADV_DONTNEED to
MADV_FREE?

Or we could as well go relaxed and define MADV_FREE and MADV_DONTNEED
the same way (that still leaves the question if we risk to break apps
ported from solaris where MADV_DONTNEED is apparently always not
destructive).

I only read the docs, I don't know in practice what MADV_DONTNEED does
on solaris (does it return -EINVAL if run on anonymous memory or not?).

http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrgk?a=view

BTW, I don't know how other specifications define MADV_FREE, but besides
MADV_REMOVE I've also got the request to provide MADV_FREE in linux,
this is why I'm asking. (right now I'm telling them to use #ifdef
__linux__ #define MADV_FREE MADV_DONTNEED but that's quite an hack since
it could break if we make MADV_DONTNEED non-destructive in the future)

Thanks.

2006-01-16 16:02:21

by Suleiman Souhlal

[permalink] [raw]
Subject: Re: differences between MADV_FREE and MADV_DONTNEED

Andrea Arcangeli wrote:
> Now that MADV_REMOVE is in, should we discuss MADV_FREE?
>
> MADV_FREE in Solaris is destructive and only works on anonymous memory,
> while MADV_DONTNEED seems to never be destructive (which I assume it
> means it's a noop on anonymous memory).

FWIW, in FreeBSD, MADV_DONTNEED is not destructive, and just makes pages
(including anonymous ones) more likely to get swapped out.

> Our MADV_DONTNEED is destructive on anonymous memory, while it's
> non-destructive on file mappings.
>
> Perhaps we could move the destructive anonymous part of MADV_DONTNEED to
> MADV_FREE?

This would seem like the best way to go, since it would bring Linux's
behavior more in line with what other systems do.

> Or we could as well go relaxed and define MADV_FREE and MADV_DONTNEED
> the same way (that still leaves the question if we risk to break apps
> ported from solaris where MADV_DONTNEED is apparently always not
> destructive).
>
> I only read the docs, I don't know in practice what MADV_DONTNEED does
> on solaris (does it return -EINVAL if run on anonymous memory or not?).
>
> http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrgk?a=view
>
> BTW, I don't know how other specifications define MADV_FREE, but besides
> MADV_REMOVE I've also got the request to provide MADV_FREE in linux,
> this is why I'm asking. (right now I'm telling them to use #ifdef
> __linux__ #define MADV_FREE MADV_DONTNEED but that's quite an hack since
> it could break if we make MADV_DONTNEED non-destructive in the future)

FreeBSD's MADV_FREE only works on anonymous memory (it's a noop for
vnode-backed memory), and marks the pages clean before moving them to
the inactive queue, so that they can be freed or reused quickly, without
causing a pagefault.

-- Suleiman

2006-01-16 16:28:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: differences between MADV_FREE and MADV_DONTNEED

On Mon, Jan 16, 2006 at 08:02:07AM -0800, Suleiman Souhlal wrote:
> FWIW, in FreeBSD, MADV_DONTNEED is not destructive, and just makes pages
> (including anonymous ones) more likely to get swapped out.

We can also use it for the same purpose, we could add the pages to
swapcache mark them dirty and zap the ptes _after_ that.

> This would seem like the best way to go, since it would bring Linux's
> behavior more in line with what other systems do.

Agreed.

> FreeBSD's MADV_FREE only works on anonymous memory (it's a noop for
> vnode-backed memory), and marks the pages clean before moving them to
> the inactive queue, so that they can be freed or reused quickly, without
> causing a pagefault.

Well, perhaps solaris is also a noop and not necessairly a -EINVAL, all
I know from the docs is "This value cannot be used on mappings that have
underlying file objects.", so I expected -EINVAL but it may be a noop.

2006-01-16 17:03:14

by Suleiman Souhlal

[permalink] [raw]
Subject: Re: differences between MADV_FREE and MADV_DONTNEED

Andrea Arcangeli wrote:

> We can also use it for the same purpose, we could add the pages to
> swapcache mark them dirty and zap the ptes _after_ that.

Wouldn't that cause the pages to get swapped out immediately?
If so, I don't think this would be the best approach: It would be better
to just move the pages to the inactive list, if they aren't there
already, so that they get swapped out only when they really need to be.

-- Suleiman

2006-01-16 17:24:54

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: differences between MADV_FREE and MADV_DONTNEED

On Mon, Jan 16, 2006 at 09:03:00AM -0800, Suleiman Souhlal wrote:
> Andrea Arcangeli wrote:
>
> >We can also use it for the same purpose, we could add the pages to
> >swapcache mark them dirty and zap the ptes _after_ that.
>
> Wouldn't that cause the pages to get swapped out immediately?

Not really, it would be a non blocking operation. But they could be
swapped out shortly later (that's the whole point of DONTNEED, right?),
once there is more memory pressure. Otherwise if they're used again, a
minor fault will happen and it will find the swapcache uptodate in ram.

2006-01-16 21:45:26

by Eric W. Biederman

[permalink] [raw]
Subject: Re: differences between MADV_FREE and MADV_DONTNEED

Andrea Arcangeli <[email protected]> writes:

> On Mon, Jan 16, 2006 at 09:03:00AM -0800, Suleiman Souhlal wrote:
>> Andrea Arcangeli wrote:
>>
>> >We can also use it for the same purpose, we could add the pages to
>> >swapcache mark them dirty and zap the ptes _after_ that.
>>
>> Wouldn't that cause the pages to get swapped out immediately?
>
> Not really, it would be a non blocking operation. But they could be
> swapped out shortly later (that's the whole point of DONTNEED, right?),
> once there is more memory pressure. Otherwise if they're used again, a
> minor fault will happen and it will find the swapcache uptodate in ram.

As I recall the logic with DONTNEED was to mark the mapping of
the page clean so the page didn't need to be swapped out, it could
just be dropped.

That is why they anonymous and the file backed cases differ.

Part of the point is to avoid the case of swapping the pages out if
the application doesn't care what is on them anymore.

Eric

2006-01-17 00:24:16

by Suleiman Souhlal

[permalink] [raw]
Subject: Re: differences between MADV_FREE and MADV_DONTNEED

Eric W. Biederman wrote:
> As I recall the logic with DONTNEED was to mark the mapping of
> the page clean so the page didn't need to be swapped out, it could
> just be dropped.
>
> That is why they anonymous and the file backed cases differ.
>
> Part of the point is to avoid the case of swapping the pages out if
> the application doesn't care what is on them anymore.

Well, imho, MADV_DONTNEED should mean "I won't need this anytime soon",
and MADV_FREE "I will never need this again".

-- Suleiman

2006-01-17 01:04:13

by Nicholas Miell

[permalink] [raw]
Subject: Re: differences between MADV_FREE and MADV_DONTNEED

On Mon, 2006-01-16 at 16:24 -0800, Suleiman Souhlal wrote:
> Eric W. Biederman wrote:
> > As I recall the logic with DONTNEED was to mark the mapping of
> > the page clean so the page didn't need to be swapped out, it could
> > just be dropped.
> >
> > That is why they anonymous and the file backed cases differ.
> >
> > Part of the point is to avoid the case of swapping the pages out if
> > the application doesn't care what is on them anymore.
>
> Well, imho, MADV_DONTNEED should mean "I won't need this anytime soon",
> and MADV_FREE "I will never need this again".
>

POSIX doesn't have a madvise(), but it does have a posix_madvise(), with
flags defined as follows:

POSIX_MADV_NORMAL
Specifies that the application has no advice to give on its behavior
with respect to the specified range. It is the default characteristic if
no advice is given for a range of memory.
POSIX_MADV_SEQUENTIAL
Specifies that the application expects to access the specified range
sequentially from lower addresses to higher addresses.
POSIX_MADV_RANDOM
Specifies that the application expects to access the specified range
in a random order.
POSIX_MADV_WILLNEED
Specifies that the application expects to access the specified range
in the near future.
POSIX_MADV_DONTNEED
Specifies that the application expects that it will not access the
specified range in the near future.

Note that glibc forwards posix_madvise() directly to madvise(2), which
means that right now, POSIX conformant apps which use
posix_madvise(addr, len, POSIX_MADV_DONTNEED) are silently corrupting
data on Linux systems.

--
Nicholas Miell <[email protected]>

2006-01-17 01:06:16

by Blaisorblade

[permalink] [raw]
Subject: Re: differences between MADV_FREE and MADV_DONTNEED

On Monday 16 January 2006 14:06, Andrea Arcangeli wrote:
> Now that MADV_REMOVE is in, should we discuss MADV_FREE?

> MADV_FREE in Solaris is destructive and only works on anonymous memory,

I.e. it's a restriction of MADV_REMOVE. Is there anything conceivable relying
on errors or no behaviour on file-backed memory? If relying on errors we
could need an API, but if relying only on the NO-OP thing the correctness
semantics are already implemented. I.e. data are retained on both Solaris
MADV_FREE and Linux MADV_REMOVE for file-backed case, they get a different
semantics for caching.

> while MADV_DONTNEED seems to never be destructive (which I assume it
> means it's a noop on anonymous memory).

It could be a "swap it out", as mentioned in Linux comments on our madvise
semantics about "other Unices".

> Our MADV_DONTNEED is destructive on anonymous memory, while it's
> non-destructive on file mappings.

Indeed, not even that. See our madvise_dontneed() comment - dirty data are
discarded in both cases, and the comment suggests msync(MS_INVALIDATE). It
also speaks of "other implementation", which could also refer to Solaris.

> Perhaps we could move the destructive anonymous part of MADV_DONTNEED to
> MADV_FREE?

Why changing existing apps behaviour? That's nonsense, unless you have a
standard. Indeed, however, posix_madvise exists, and it's DONTNEED semantics
are the Solaris ones. Don't know past behaviour about "breaking existing to
comply to standards" (new syscall slot?).

> Or we could as well go relaxed and define MADV_FREE and MADV_DONTNEED
> the same way (that still leaves the question if we risk to break apps
> ported from solaris where MADV_DONTNEED is apparently always not
> destructive).

Provide our fine-grained semantics with new, not misunderstandable identifiers
(MADV_FREE_DISCARD, MADV_FREE_CACHE, for instance).

For current names, libc could provide a "let user choose the meaning of
things", like it does for signals with _BSD_SOURCE, _POSIX_SOURCE and so on.

> I only read the docs, I don't know in practice what MADV_DONTNEED does
> on solaris (does it return -EINVAL if run on anonymous memory or not?).

> http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrgk?a=view

> BTW, I don't know how other specifications define MADV_FREE, but besides
> MADV_REMOVE I've also got the request to provide MADV_FREE in linux,
> this is why I'm asking. (right now I'm telling them to use #ifdef
> __linux__ #define MADV_FREE MADV_DONTNEED but that's quite an hack since
> it could break if we make MADV_DONTNEED non-destructive in the future)

Making their apps work by causing the same breakage to Linux apps is a better
idea?

> Thanks.

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade





___________________________________
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB
http://mail.yahoo.it

2006-01-17 01:33:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: differences between MADV_FREE and MADV_DONTNEED

On Tue, Jan 17, 2006 at 02:06:09AM +0100, Blaisorblade wrote:
> I.e. it's a restriction of MADV_REMOVE. Is there anything conceivable
> relying on errors or no behaviour on file-backed memory? If relying on
> errors we could need an API, but if relying only on the NO-OP thing the
> correctness semantics are already implemented. I.e. data are retained on both
> Solaris MADV_FREE and Linux MADV_REMOVE for file-backed case, they get a
> different semantics for caching.

Not sure to understand but merging MADV_REMOVE into MADV_FREE apparently
would break freebsd apps that might expect a noop instead. And it could
break Solaris apps if they execpt a -EINVAL (though the latter is more
dubious, but I doubt making differences is worth it and if freebsd makes
it a noop I'd stick with the noop and leave MADV_REMOVE alone).

> are the Solaris ones. Don't know past behaviour about "breaking existing to
> comply to standards" (new syscall slot?).

The change I suggested would be backwards compatible because it can only
affect performance.

The only thing that can break right now, is running a non-linux (and
apparently posix too) app on a linux system that will corrupt memory
with potential data loss.

> Provide our fine-grained semantics with new, not misunderstandable identifiers
> (MADV_FREE_DISCARD, MADV_FREE_CACHE, for instance).

Why should we deviate for the sake of porting pain, when we can comply
at no tangible risk for us?

> Making their apps work by causing the same breakage to Linux apps is a better
> idea?

Again: if an app breaks it means it's working by pure luck because it's
depending on fragile timings in the first place.

Call it a potential lower performance or less efficient memory
utilization, a breakage not.

If we were to make MADV_DONTNEED more aggressive, then we'd be risking a
breakage, but we're going to relax it instead.

2006-01-17 12:44:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: differences between MADV_FREE and MADV_DONTNEED

On Mon, Jan 16, 2006 at 05:04:07PM -0800, Nicholas Miell wrote:
> On Mon, 2006-01-16 at 16:24 -0800, Suleiman Souhlal wrote:
> > Eric W. Biederman wrote:
> > > As I recall the logic with DONTNEED was to mark the mapping of
> > > the page clean so the page didn't need to be swapped out, it could
> > > just be dropped.
> > >
> > > That is why they anonymous and the file backed cases differ.
> > >
> > > Part of the point is to avoid the case of swapping the pages out if
> > > the application doesn't care what is on them anymore.
> >
> > Well, imho, MADV_DONTNEED should mean "I won't need this anytime soon",
> > and MADV_FREE "I will never need this again".
> >
>
> POSIX doesn't have a madvise(), but it does have a posix_madvise(), with
> flags defined as follows:
>
> POSIX_MADV_NORMAL
> Specifies that the application has no advice to give on its behavior
> with respect to the specified range. It is the default characteristic if
> no advice is given for a range of memory.
> POSIX_MADV_SEQUENTIAL
> Specifies that the application expects to access the specified range
> sequentially from lower addresses to higher addresses.
> POSIX_MADV_RANDOM
> Specifies that the application expects to access the specified range
> in a random order.
> POSIX_MADV_WILLNEED
> Specifies that the application expects to access the specified range
> in the near future.
> POSIX_MADV_DONTNEED
> Specifies that the application expects that it will not access the
> specified range in the near future.
>
> Note that glibc forwards posix_madvise() directly to madvise(2), which
> means that right now, POSIX conformant apps which use
> posix_madvise(addr, len, POSIX_MADV_DONTNEED) are silently corrupting
> data on Linux systems.

Does our MAD_DONTNEED numerical value match glibc's POSIX_MADV_DONTNEED?

In either case I'd say we should backout this patch for now. We should
implement a real MADV_DONTNEED and rename the current one to MADV_FREE,
but that's 2.6.17 material.

2006-01-17 18:25:41

by Eric W. Biederman

[permalink] [raw]
Subject: Re: differences between MADV_FREE and MADV_DONTNEED

Christoph Hellwig <[email protected]> writes:

> On Mon, Jan 16, 2006 at 05:04:07PM -0800, Nicholas Miell wrote:
>> On Mon, 2006-01-16 at 16:24 -0800, Suleiman Souhlal wrote:
>> > Eric W. Biederman wrote:
>> > > As I recall the logic with DONTNEED was to mark the mapping of
>> > > the page clean so the page didn't need to be swapped out, it could
>> > > just be dropped.
>> > >
>> > > That is why they anonymous and the file backed cases differ.
>> > >
>> > > Part of the point is to avoid the case of swapping the pages out if
>> > > the application doesn't care what is on them anymore.
>> >
>> > Well, imho, MADV_DONTNEED should mean "I won't need this anytime soon",
>> > and MADV_FREE "I will never need this again".
>> >
>>
>> POSIX doesn't have a madvise(), but it does have a posix_madvise(), with
>> flags defined as follows:
>>
>> POSIX_MADV_NORMAL
>> Specifies that the application has no advice to give on its behavior
>> with respect to the specified range. It is the default characteristic if
>> no advice is given for a range of memory.
>> POSIX_MADV_SEQUENTIAL
>> Specifies that the application expects to access the specified range
>> sequentially from lower addresses to higher addresses.
>> POSIX_MADV_RANDOM
>> Specifies that the application expects to access the specified range
>> in a random order.
>> POSIX_MADV_WILLNEED
>> Specifies that the application expects to access the specified range
>> in the near future.
>> POSIX_MADV_DONTNEED
>> Specifies that the application expects that it will not access the
>> specified range in the near future.
>>
>> Note that glibc forwards posix_madvise() directly to madvise(2), which
>> means that right now, POSIX conformant apps which use
>> posix_madvise(addr, len, POSIX_MADV_DONTNEED) are silently corrupting
>> data on Linux systems.
>
> Does our MAD_DONTNEED numerical value match glibc's POSIX_MADV_DONTNEED?
>
> In either case I'd say we should backout this patch for now. We should
> implement a real MADV_DONTNEED and rename the current one to MADV_FREE,
> but that's 2.6.17 material.

We definitely need to check this. I am fairly certain I have seen this conversation
before.

Eric


2006-01-17 19:06:12

by Badari Pulavarty

[permalink] [raw]
Subject: Re: differences between MADV_FREE and MADV_DONTNEED

On Tue, 2006-01-17 at 12:43 +0000, Christoph Hellwig wrote:
> On Mon, Jan 16, 2006 at 05:04:07PM -0800, Nicholas Miell wrote:
> > On Mon, 2006-01-16 at 16:24 -0800, Suleiman Souhlal wrote:
> > > Eric W. Biederman wrote:
> > > > As I recall the logic with DONTNEED was to mark the mapping of
> > > > the page clean so the page didn't need to be swapped out, it could
> > > > just be dropped.
> > > >
> > > > That is why they anonymous and the file backed cases differ.
> > > >
> > > > Part of the point is to avoid the case of swapping the pages out if
> > > > the application doesn't care what is on them anymore.
> > >
> > > Well, imho, MADV_DONTNEED should mean "I won't need this anytime soon",
> > > and MADV_FREE "I will never need this again".
> > >
> >
> > POSIX doesn't have a madvise(), but it does have a posix_madvise(), with
> > flags defined as follows:
> >
> > POSIX_MADV_NORMAL
> > Specifies that the application has no advice to give on its behavior
> > with respect to the specified range. It is the default characteristic if
> > no advice is given for a range of memory.
> > POSIX_MADV_SEQUENTIAL
> > Specifies that the application expects to access the specified range
> > sequentially from lower addresses to higher addresses.
> > POSIX_MADV_RANDOM
> > Specifies that the application expects to access the specified range
> > in a random order.
> > POSIX_MADV_WILLNEED
> > Specifies that the application expects to access the specified range
> > in the near future.
> > POSIX_MADV_DONTNEED
> > Specifies that the application expects that it will not access the
> > specified range in the near future.
> >
> > Note that glibc forwards posix_madvise() directly to madvise(2), which
> > means that right now, POSIX conformant apps which use
> > posix_madvise(addr, len, POSIX_MADV_DONTNEED) are silently corrupting
> > data on Linux systems.
>
> Does our MAD_DONTNEED numerical value match glibc's POSIX_MADV_DONTNEED?
>
> In either case I'd say we should backout this patch for now. We should
> implement a real MADV_DONTNEED and rename the current one to MADV_FREE,
> but that's 2.6.17 material.

Christoph,

What patch are you recommending backing out ?

Thanks,
Badari

2006-01-17 22:55:24

by Nicholas Miell

[permalink] [raw]
Subject: Re: differences between MADV_FREE and MADV_DONTNEED

On Tue, 2006-01-17 at 11:23 -0700, Eric W. Biederman wrote:
> Christoph Hellwig <[email protected]> writes:
> > On Mon, Jan 16, 2006 at 05:04:07PM -0800, Nicholas Miell wrote:
> >> On Mon, 2006-01-16 at 16:24 -0800, Suleiman Souhlal wrote:
> >> > Well, imho, MADV_DONTNEED should mean "I won't need this anytime soon",
> >> > and MADV_FREE "I will never need this again".
> >> >
> >>
> >> POSIX doesn't have a madvise(), but it does have a posix_madvise(), with
> >> flags defined as follows:
> >>
> >> POSIX_MADV_NORMAL
> >> Specifies that the application has no advice to give on its behavior
> >> with respect to the specified range. It is the default characteristic if
> >> no advice is given for a range of memory.
> >> POSIX_MADV_SEQUENTIAL
> >> Specifies that the application expects to access the specified range
> >> sequentially from lower addresses to higher addresses.
> >> POSIX_MADV_RANDOM
> >> Specifies that the application expects to access the specified range
> >> in a random order.
> >> POSIX_MADV_WILLNEED
> >> Specifies that the application expects to access the specified range
> >> in the near future.
> >> POSIX_MADV_DONTNEED
> >> Specifies that the application expects that it will not access the
> >> specified range in the near future.
> >>
> >> Note that glibc forwards posix_madvise() directly to madvise(2), which
> >> means that right now, POSIX conformant apps which use
> >> posix_madvise(addr, len, POSIX_MADV_DONTNEED) are silently corrupting
> >> data on Linux systems.
> >
> > Does our MAD_DONTNEED numerical value match glibc's POSIX_MADV_DONTNEED?
> >
> > In either case I'd say we should backout this patch for now. We should
> > implement a real MADV_DONTNEED and rename the current one to MADV_FREE,
> > but that's 2.6.17 material.
>
> We definitely need to check this. I am fairly certain I have seen this conversation
> before.

Yes, POSIX_MADV_* have the same values as MADV_*. And if you're trying
to find the actual implementation of posix_madvise() to verify its
behavior, it is generated by script from a line in
libc/sysdeps/unix/sysv/linux/syscalls.list.

--
Nicholas Miell <[email protected]>