2002-12-09 03:56:53

by Kingsley Cheung

[permalink] [raw]
Subject: [TRIVIAL PATCH 2.4.20] madvise_willneed makes bad limit comparison

Hi,

'madvise_willneed' makes an incorrect rss limit comparison. It
directly compares rlim[RLIMIT_RSS].rlim_cur to rss. The former is in
bytes, whereas the latter is in pages. The fix for this is trivial.

[As an aside, one question is whether this limit check is needed at
all. Most rss limit enforcement implementations that I've seen are
'soft', whereas this would give the limit 'hard' semantics. Do we
really want 'hard' limit semantics?]


diff -urN linux-2.4.20/mm/filemap.c linux-2.4.20patched/mm/filemap.c
--- linux-2.4.20/mm/filemap.c Mon Dec 9 14:19:13 2002
+++ linux-2.4.20patched/mm/filemap.c Mon Dec 9 14:36:08 2002
@@ -2471,10 +2471,12 @@

/* Make sure this doesn't exceed the process's max rss. */
error = -EIO;
- rlim_rss = current->rlim ? current->rlim[RLIMIT_RSS].rlim_cur :
- LONG_MAX; /* default: see resource.h */
- if ((vma->vm_mm->rss + (end - start)) > rlim_rss)
- return error;
+ rlim_rss = current->rlim[RLIMIT_RSS].rlim_cur;
+ if (rlim_rss != RLIM_INFINITY) {
+ rlim_rss >>= PAGE_SHIFT;
+ if ((vma->vm_mm->rss + (end - start)) > rlim_rss)
+ return error;
+ }

/* round to cluster boundaries if this isn't a "random" area. */
if (!VM_RandomReadHint(vma)) {


--
Kingsley


2002-12-09 04:45:42

by Kingsley Cheung

[permalink] [raw]
Subject: [TRIVIAL PATCH 2.5.50] madvise_willneed makes bad limit comparison

Hi,

Trivial patch against 2.5.50, similar to the patch I sent earlier for
2.4.20, to fix the bad rss limit comparision in 'madvise_willneed'.


diff -urN linux-2.5.50/mm/madvise.c linux-2.5.50patched/mm/madvise.c
--- linux-2.5.50/mm/madvise.c Thu Nov 28 09:35:53 2002
+++ linux-2.5.50patched/mm/madvise.c Mon Dec 9 15:20:48 2002
@@ -75,10 +75,12 @@

/* Make sure this doesn't exceed the process's max rss. */
error = -EIO;
- rlim_rss = current->rlim ? current->rlim[RLIMIT_RSS].rlim_cur :
- LONG_MAX; /* default: see resource.h */
- if ((vma->vm_mm->rss + (end - start)) > rlim_rss)
- return error;
+ rlim_rss = current->rlim[RLIMIT_RSS].rlim_cur;
+ if (rlim_rss != RLIM_INFINITY) {
+ rlim_rss >>= PAGE_SHIFT;
+ if ((vma->vm_mm->rss + (end - start)) > rlim_rss)
+ return error;
+ }

do_page_cache_readahead(file->f_dentry->d_inode->i_mapping, file, start, end - start);
return 0;


--
Kingsley

2002-12-09 22:20:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [TRIVIAL PATCH 2.4.20] madvise_willneed makes bad limit comparison

Kingsley Cheung wrote:
>
> Hi,
>
> 'madvise_willneed' makes an incorrect rss limit comparison. It
> directly compares rlim[RLIMIT_RSS].rlim_cur to rss. The former is in
> bytes, whereas the latter is in pages. The fix for this is trivial.
>

It's surely a bug, but looking at the code, one does ask "what
on earth is it trying to do"?

1) -EIO is not a recognised (or appropriate) return value.

2) If the MADV_WILLNEED call fails, all the user needs to do is to
use a smaller chunk, and walk across the file using that chunk
size! The only system-protecting limit here is the request queue
size.

3) We don't know that the application will try to map all that readahead
at the same time anyway. And if it does, the rlimits will catch it.

Linus used "half the size of the inactive list" in sys_readahead. That's
probably as good as anything else. I'd suggest that we just share
that bit of code in madvise.

hmm. Also the new readahead code will allocate all that memory up-front
before putting it under I/O. I'll fix up do_page_cache_readahead()
for that.


> [As an aside, one question is whether this limit check is needed at
> all. Most rss limit enforcement implementations that I've seen are
> 'soft', whereas this would give the limit 'hard' semantics. Do we
> really want 'hard' limit semantics?]
>

I agree that failing with an error is inappropriate.

We should limit the readahead according to machine size, disk bandwidth,
free memory availability, shoe size, etc. And once that's done then
it _has_ to return success. Otherwise the application would see
different results depending on system size and activity.

It is just "advice".

2002-12-09 22:25:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [TRIVIAL PATCH 2.4.20] madvise_willneed makes bad limit comparison

Kingsley Cheung wrote:
>
> Hi,
>
> 'madvise_willneed' makes an incorrect rss limit comparison. It
> directly compares rlim[RLIMIT_RSS].rlim_cur to rss. The former is in
> bytes, whereas the latter is in pages. The fix for this is trivial.
>

It's surely a bug, but looking at the code, one does ask "what
on earth is it trying to do"?

1) -EIO is not a recognised (or appropriate) return value.

2) If the MADV_WILLNEED call fails, all the user needs to do is to
use a smaller chunk, and walk across the file using that chunk
size! The only system-protecting limit here is the request queue
size.

3) We don't know that the application will try to map all that readahead
at the same time anyway. And if it does, the rlimits will catch it.

Linus used "half the size of the inactive list" in sys_readahead. That's
probably as good as anything else. I'd suggest that we just share
that bit of code in madvise.

hmm. Also the new readahead code will allocate all that memory up-front
before putting it under I/O. I'll fix up do_page_cache_readahead()
for that.


> [As an aside, one question is whether this limit check is needed at
> all. Most rss limit enforcement implementations that I've seen are
> 'soft', whereas this would give the limit 'hard' semantics. Do we
> really want 'hard' limit semantics?]
>

I agree that failing with an error is inappropriate.

We should limit the readahead according to machine size, disk bandwidth,
free memory availability, shoe size, etc. And once that's done then
it _has_ to return success. Otherwise the application would see
different results depending on system size and activity.

It is just "advice".

2002-12-10 06:01:18

by Kingsley Cheung

[permalink] [raw]
Subject: Re: [TRIVIAL PATCH 2.4.20] madvise_willneed makes bad limit comparison

On Sun, Dec 08, 2002 at 10:29:41PM -0800, Andrew Morton wrote:
>
> It's surely a bug, but looking at the code, one does ask "what
> on earth is it trying to do"?
>

Thanks for that Andrew. Yisshhhh. I merely took the bug at face
value. This is not as trivial as I first thought.

> 1) -EIO is not a recognised (or appropriate) return value.
>

Aye, I overlooked that.

> 2) If the MADV_WILLNEED call fails, all the user needs to do is to
> use a smaller chunk, and walk across the file using that chunk
> size! The only system-protecting limit here is the request queue
> size.
>
> 3) We don't know that the application will try to map all that readahead
> at the same time anyway. And if it does, the rlimits will catch it.
>

Yes. Though currently there is no enforcement for RLIMIT_RSS
implemented. I guess when its there it will catch it when the process
starts faulting on those pages.

> Linus used "half the size of the inactive list" in sys_readahead. That's
> probably as good as anything else. I'd suggest that we just share
> that bit of code in madvise.
>

<snip>

> I agree that failing with an error is inappropriate.
>
> We should limit the readahead according to machine size, disk bandwidth,
> free memory availability, shoe size, etc. And once that's done then
> it _has_ to return success. Otherwise the application would see
> different results depending on system size and activity.
>
> It is just "advice".

So then something of the following without the check is more
appropriate or a starting point then?


diff -urN linux-2.4.20/mm/filemap.c linux-2.4.20patched/mm/filemap.c
--- linux-2.4.20/mm/filemap.c Mon Dec 9 14:19:13 2002
+++ linux-2.4.20patched/mm/filemap.c Tue Dec 10 15:30:05 2002
@@ -2455,7 +2455,7 @@
{
long error = -EBADF;
struct file * file;
- unsigned long size, rlim_rss;
+ unsigned long size, max;

/* Doesn't work if there's no mapped file. */
if (!vma->vm_file)
@@ -2469,12 +2469,10 @@
end = vma->vm_end;
end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;

- /* Make sure this doesn't exceed the process's max rss. */
- error = -EIO;
- rlim_rss = current->rlim ? current->rlim[RLIMIT_RSS].rlim_cur :
- LONG_MAX; /* default: see resource.h */
- if ((vma->vm_mm->rss + (end - start)) > rlim_rss)
- return error;
+ /* Like sys_readahead, limit to a sane percentage of inactive list.. */
+ max = nr_inactive_pages / 2;
+ if ((end - start) > max)
+ end = start + max;

/* round to cluster boundaries if this isn't a "random" area. */
if (!VM_RandomReadHint(vma)) {


--
Kingsley

2002-12-10 07:48:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [TRIVIAL PATCH 2.4.20] madvise_willneed makes bad limit comparison

Kingsley Cheung wrote:
>
>...
> So then something of the following without the check is more
> appropriate or a starting point then?
>

Looks good to me. Here's a 2.5 version...

include/linux/mm.h | 1 +
mm/filemap.c | 12 +-----------
mm/madvise.c | 23 +++++------------------
mm/readahead.c | 13 +++++++++++++
4 files changed, 20 insertions(+), 29 deletions(-)

--- 25/mm/filemap.c~max_sane_readahead Mon Dec 9 23:44:47 2002
+++ 25-akpm/mm/filemap.c Mon Dec 9 23:47:14 2002
@@ -898,20 +898,10 @@ static ssize_t
do_readahead(struct address_space *mapping, struct file *filp,
unsigned long index, unsigned long nr)
{
- unsigned long max;
- unsigned long active;
- unsigned long inactive;
-
if (!mapping || !mapping->a_ops || !mapping->a_ops->readpage)
return -EINVAL;

- /* Limit it to a sane percentage of the inactive list.. */
- get_zone_counts(&active, &inactive);
- max = inactive / 2;
- if (nr > max)
- nr = max;
-
- do_page_cache_readahead(mapping, filp, index, nr);
+ do_page_cache_readahead(mapping, filp, index, max_sane_readahead(nr));
return 0;
}

--- 25/mm/readahead.c~max_sane_readahead Mon Dec 9 23:44:55 2002
+++ 25-akpm/mm/readahead.c Mon Dec 9 23:48:39 2002
@@ -465,3 +465,16 @@ void handle_ra_miss(struct address_space
ra->next_size = min;
}
}
+
+/*
+ * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
+ * sensible upper limit.
+ */
+unsigned long max_sane_readahead(unsigned long nr)
+{
+ unsigned long active;
+ unsigned long inactive;
+
+ get_zone_counts(&active, &inactive);
+ return min(nr, inactive / 2);
+}
--- 25/include/linux/mm.h~max_sane_readahead Mon Dec 9 23:47:45 2002
+++ 25-akpm/include/linux/mm.h Mon Dec 9 23:48:06 2002
@@ -524,6 +524,7 @@ void page_cache_readaround(struct addres
unsigned long offset);
void handle_ra_miss(struct address_space *mapping,
struct file_ra_state *ra);
+unsigned long max_sane_readahead(unsigned long nr);

/* Do stack extension */
extern int expand_stack(struct vm_area_struct * vma, unsigned long address);
--- 25/mm/madvise.c~max_sane_readahead Mon Dec 9 23:52:37 2002
+++ 25-akpm/mm/madvise.c Mon Dec 9 23:52:41 2002
@@ -51,36 +51,23 @@ static long madvise_behavior(struct vm_a
}

/*
- * Schedule all required I/O operations, then run the disk queue
- * to make sure they are started. Do not wait for completion.
+ * Schedule all required I/O operations. Do not wait for completion.
*/
static long madvise_willneed(struct vm_area_struct * vma,
unsigned long start, unsigned long end)
{
- long error = -EBADF;
- struct file * file;
- unsigned long size, rlim_rss;
+ struct file *file = vma->vm_file;

- /* Doesn't work if there's no mapped file. */
if (!vma->vm_file)
- return error;
- file = vma->vm_file;
- size = (file->f_dentry->d_inode->i_size + PAGE_CACHE_SIZE - 1) >>
- PAGE_CACHE_SHIFT;
+ return -EBADF;

start = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
if (end > vma->vm_end)
end = vma->vm_end;
end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;

- /* Make sure this doesn't exceed the process's max rss. */
- error = -EIO;
- rlim_rss = current->rlim ? current->rlim[RLIMIT_RSS].rlim_cur :
- LONG_MAX; /* default: see resource.h */
- if ((vma->vm_mm->rss + (end - start)) > rlim_rss)
- return error;
-
- do_page_cache_readahead(file->f_dentry->d_inode->i_mapping, file, start, end - start);
+ do_page_cache_readahead(file->f_dentry->d_inode->i_mapping,
+ file, start, max_sane_readahead(end - start));
return 0;
}


_