2015-02-24 12:58:22

by Rafael Aquini

[permalink] [raw]
Subject: [PATCH] mm: readahead: get back a sensible upper limit

commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for memoryless NUMA
nodes and limit readahead pages")[1] imposed 2 mB hard limits to readahead by
changing max_sane_readahead() to sort out a corner case where a thread runs on
amemoryless NUMA node and it would have its readahead capability disabled.

The aforementioned change, despite fixing that corner case, is detrimental to
other ordinary workloads that memory map big files and rely on readahead() or
posix_fadvise(WILLNEED) syscalls to get most of the file populating system's cache.

Laurence Oberman reports, via https://bugzilla.redhat.com/show_bug.cgi?id=1187940,
slowdowns up to 3-4 times when changes for mentioned commit [1] got introduced in
RHEL kenrel. We also have an upstream bugzilla opened for similar complaint:
https://bugzilla.kernel.org/show_bug.cgi?id=79111

This patch brings back the old behavior of max_sane_readahead() where we used to
consider NR_INACTIVE_FILE and NR_FREE_PAGES pages to derive a sensible / adujstable
readahead upper limit. This patch also keeps the 2 mB ceiling scheme introduced by
commit [1] to avoid regressions on CONFIG_HAVE_MEMORYLESS_NODES systems,
where numa_mem_id(), by any buggy reason, might end up not returning
the 'local memory' for a memoryless node CPU.

Reported-by: Laurence Oberman <[email protected]>
Tested-by: Laurence Oberman <[email protected]>
Signed-off-by: Rafael Aquini <[email protected]>
---
mm/readahead.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 9356758..73f934d 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -203,6 +203,7 @@ out:
return ret;
}

+#define MAX_READAHEAD ((512 * 4096) / PAGE_CACHE_SIZE)
/*
* Chunk the readahead into 2 megabyte units, so that we don't pin too much
* memory at once.
@@ -217,7 +218,7 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
while (nr_to_read) {
int err;

- unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
+ unsigned long this_chunk = MAX_READAHEAD;

if (this_chunk > nr_to_read)
this_chunk = nr_to_read;
@@ -232,14 +233,15 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
return 0;
}

-#define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE)
/*
* Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
* sensible upper limit.
*/
unsigned long max_sane_readahead(unsigned long nr)
{
- return min(nr, MAX_READAHEAD);
+ return min(nr, max(MAX_READAHEAD,
+ (node_page_state(numa_mem_id(), NR_INACTIVE_FILE) +
+ node_page_state(numa_mem_id(), NR_FREE_PAGES)) / 2));
}

/*
--
1.9.3


2015-02-24 20:50:27

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: get back a sensible upper limit

On Tue, 24 Feb 2015, Rafael Aquini wrote:

> commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for memoryless NUMA
> nodes and limit readahead pages")[1] imposed 2 mB hard limits to readahead by
> changing max_sane_readahead() to sort out a corner case where a thread runs on
> amemoryless NUMA node and it would have its readahead capability disabled.
>
> The aforementioned change, despite fixing that corner case, is detrimental to
> other ordinary workloads that memory map big files and rely on readahead() or
> posix_fadvise(WILLNEED) syscalls to get most of the file populating system's cache.
>
> Laurence Oberman reports, via https://bugzilla.redhat.com/show_bug.cgi?id=1187940,
> slowdowns up to 3-4 times when changes for mentioned commit [1] got introduced in
> RHEL kenrel. We also have an upstream bugzilla opened for similar complaint:
> https://bugzilla.kernel.org/show_bug.cgi?id=79111
>
> This patch brings back the old behavior of max_sane_readahead() where we used to
> consider NR_INACTIVE_FILE and NR_FREE_PAGES pages to derive a sensible / adujstable
> readahead upper limit. This patch also keeps the 2 mB ceiling scheme introduced by
> commit [1] to avoid regressions on CONFIG_HAVE_MEMORYLESS_NODES systems,
> where numa_mem_id(), by any buggy reason, might end up not returning
> the 'local memory' for a memoryless node CPU.
>
> Reported-by: Laurence Oberman <[email protected]>
> Tested-by: Laurence Oberman <[email protected]>
> Signed-off-by: Rafael Aquini <[email protected]>
> ---
> mm/readahead.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9356758..73f934d 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -203,6 +203,7 @@ out:
> return ret;
> }
>
> +#define MAX_READAHEAD ((512 * 4096) / PAGE_CACHE_SIZE)
> /*
> * Chunk the readahead into 2 megabyte units, so that we don't pin too much
> * memory at once.
> @@ -217,7 +218,7 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
> while (nr_to_read) {
> int err;
>
> - unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
> + unsigned long this_chunk = MAX_READAHEAD;
>
> if (this_chunk > nr_to_read)
> this_chunk = nr_to_read;
> @@ -232,14 +233,15 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
> return 0;
> }
>
> -#define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE)
> /*
> * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> * sensible upper limit.
> */
> unsigned long max_sane_readahead(unsigned long nr)
> {
> - return min(nr, MAX_READAHEAD);
> + return min(nr, max(MAX_READAHEAD,
> + (node_page_state(numa_mem_id(), NR_INACTIVE_FILE) +
> + node_page_state(numa_mem_id(), NR_FREE_PAGES)) / 2));
> }
>
> /*

I think Linus suggested avoiding the complexity here regarding any
heuristics involving the per-node memory state, specifically in
http://www.kernelhub.org/?msg=413344&p=2, and suggested the MAX_READAHEAD
size.

If we are to go forward with this revert, then I believe the change to
numa_mem_id() will fix the memoryless node issue as pointed out in that
thread.

2015-02-24 21:13:54

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: get back a sensible upper limit

On Tue, Feb 24, 2015 at 12:50:20PM -0800, David Rientjes wrote:
> On Tue, 24 Feb 2015, Rafael Aquini wrote:
>
> > commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for memoryless NUMA
> > nodes and limit readahead pages")[1] imposed 2 mB hard limits to readahead by
> > changing max_sane_readahead() to sort out a corner case where a thread runs on
> > amemoryless NUMA node and it would have its readahead capability disabled.
> >
> > The aforementioned change, despite fixing that corner case, is detrimental to
> > other ordinary workloads that memory map big files and rely on readahead() or
> > posix_fadvise(WILLNEED) syscalls to get most of the file populating system's cache.
> >
> > Laurence Oberman reports, via https://bugzilla.redhat.com/show_bug.cgi?id=1187940,
> > slowdowns up to 3-4 times when changes for mentioned commit [1] got introduced in
> > RHEL kenrel. We also have an upstream bugzilla opened for similar complaint:
> > https://bugzilla.kernel.org/show_bug.cgi?id=79111
> >
> > This patch brings back the old behavior of max_sane_readahead() where we used to
> > consider NR_INACTIVE_FILE and NR_FREE_PAGES pages to derive a sensible / adujstable
> > readahead upper limit. This patch also keeps the 2 mB ceiling scheme introduced by
> > commit [1] to avoid regressions on CONFIG_HAVE_MEMORYLESS_NODES systems,
> > where numa_mem_id(), by any buggy reason, might end up not returning
> > the 'local memory' for a memoryless node CPU.
> >
> > Reported-by: Laurence Oberman <[email protected]>
> > Tested-by: Laurence Oberman <[email protected]>
> > Signed-off-by: Rafael Aquini <[email protected]>
> > ---
> > mm/readahead.c | 8 +++++---
> > 1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 9356758..73f934d 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -203,6 +203,7 @@ out:
> > return ret;
> > }
> >
> > +#define MAX_READAHEAD ((512 * 4096) / PAGE_CACHE_SIZE)
> > /*
> > * Chunk the readahead into 2 megabyte units, so that we don't pin too much
> > * memory at once.
> > @@ -217,7 +218,7 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
> > while (nr_to_read) {
> > int err;
> >
> > - unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
> > + unsigned long this_chunk = MAX_READAHEAD;
> >
> > if (this_chunk > nr_to_read)
> > this_chunk = nr_to_read;
> > @@ -232,14 +233,15 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
> > return 0;
> > }
> >
> > -#define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE)
> > /*
> > * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> > * sensible upper limit.
> > */
> > unsigned long max_sane_readahead(unsigned long nr)
> > {
> > - return min(nr, MAX_READAHEAD);
> > + return min(nr, max(MAX_READAHEAD,
> > + (node_page_state(numa_mem_id(), NR_INACTIVE_FILE) +
> > + node_page_state(numa_mem_id(), NR_FREE_PAGES)) / 2));
> > }
> >
> > /*
>
> I think Linus suggested avoiding the complexity here regarding any
> heuristics involving the per-node memory state, specifically in
> http://www.kernelhub.org/?msg=413344&p=2, and suggested the MAX_READAHEAD
> size.
>

The problem I think that thread skipped is that we were already shipping
readaheads on chunks of 2mB to the I/O system -- take a look on
force_page_cache_readahead(), and the 2 mB hard ceiling Raghavendra
patch introduced ended up capping all readahead activity to 2mB regardless the
system memory state.

By doing so it ended up affecting all users of force_page_cache_readahead(),
like readahead(2) and posix_fadvise(2). At that time, Raghavendra was
able to report tangible gains for readahead on NUMA layouts where some
nodes are memoryless and no one else reported measurable losses due to
the change.

Unfortunately, we now have people complaining about losses, for ordinary
workloads, that are as tangible as those gains reported for the original
corner case. That's why I think the right thing we should do here is to
partially revert that change, to get the old behaviour back without
loosing the hard ceiling scheme it introduced for corner cases.

Regards,

-- Rafael

> If we are to go forward with this revert, then I believe the change to
> numa_mem_id() will fix the memoryless node issue as pointed out in that
> thread.

2015-02-24 21:56:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: get back a sensible upper limit

On Tue, Feb 24, 2015 at 4:58 AM, Rafael Aquini <[email protected]> wrote:
>
> This patch brings back the old behavior of max_sane_readahead()

Yeah no.

There was a reason that code was killed. No way in hell are we
bringing back the insanities with node memory etc.

Also, we have never actually heard of anything sane that actualyl
depended on this. Last time this came up it was a made-up benchmark,
not an actual real load that cared.

Who can possibly care about this in real life?

Linus

2015-02-24 22:09:00

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: get back a sensible upper limit

On Tue, Feb 24, 2015 at 01:56:25PM -0800, Linus Torvalds wrote:
> On Tue, Feb 24, 2015 at 4:58 AM, Rafael Aquini <[email protected]> wrote:
> >
> > This patch brings back the old behavior of max_sane_readahead()
>
> Yeah no.
>
> There was a reason that code was killed. No way in hell are we
> bringing back the insanities with node memory etc.
>

Would you consider bringing it back, but instead of node memory state,
utilizing global memory state instead?

> Also, we have never actually heard of anything sane that actualyl
> depended on this. Last time this came up it was a made-up benchmark,
> not an actual real load that cared.
>
> Who can possibly care about this in real life?
>
People filing bugs complaining their applications that memory map files
are getting hurt by it.

-- Rafael

2015-02-24 22:12:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: get back a sensible upper limit

On Tue, Feb 24, 2015 at 2:08 PM, Rafael Aquini <[email protected]> wrote:
>
> Would you consider bringing it back, but instead of node memory state,
> utilizing global memory state instead?

Maybe. At least it would be saner than picking random values that make
absolutely no sense.

> People filing bugs complaining their applications that memory map files
> are getting hurt by it.

Show them. And as mentioned, last time this came up (and it has come
up before), it wasn't actually a real load, but some benchmark that
just did the prefetch, and then people were upset because their
benchmark numbers changed.

Which quite frankly doesn't make me care. The benchmark could equally
well just be changed to do prefetching in saner chunks instead.

So I really want to see real numbers from real loads, not some
nebulous "people noticed and complain" that doesn't even specify what
they did.

Linus

2015-02-24 22:54:12

by Laurence Oberman

[permalink] [raw]
Subject: Re: [PATCH] mm: readahead: get back a sensible upper limit

----- Original Message -----
From: "Linus Torvalds" <[email protected]>
To: "Rafael Aquini" <[email protected]>
Cc: "linux-mm" <[email protected]>, "Andrew Morton" <[email protected]>, "Johannes Weiner" <[email protected]>, "Rik van Riel" <[email protected]>, "David Rientjes" <[email protected]>, "Linux Kernel Mailing List" <[email protected]>, [email protected], "Larry Woodman" <[email protected]>, "Raghavendra K T" <[email protected]>
Sent: Tuesday, February 24, 2015 5:12:28 PM
Subject: Re: [PATCH] mm: readahead: get back a sensible upper limit

On Tue, Feb 24, 2015 at 2:08 PM, Rafael Aquini <[email protected]> wrote:
>
> Would you consider bringing it back, but instead of node memory state,
> utilizing global memory state instead?

Maybe. At least it would be saner than picking random values that make
absolutely no sense.

> People filing bugs complaining their applications that memory map files
> are getting hurt by it.

Show them. And as mentioned, last time this came up (and it has come
up before), it wasn't actually a real load, but some benchmark that
just did the prefetch, and then people were upset because their
benchmark numbers changed.

Which quite frankly doesn't make me care. The benchmark could equally
well just be changed to do prefetching in saner chunks instead.

So I really want to see real numbers from real loads, not some
nebulous "people noticed and complain" that doesn't even specify what
they did.

Linus
Hello

Any way to get a change in even if its global would help this customer and others.
They noticed this change (they call it performance regression) when they went to the newer kernel.
I understand that we wont be able to revert.

Patch applied to our kernel

diff -Nurp linux-2.6.32-504.3.3.el6.orig/mm/readahead.c linux-2.6.32-504.3.3.el6/mm/readahead.c
--- linux-2.6.32-504.3.3.el6.orig/mm/readahead.c 2014-12-12 15:29:35.000000000 -0500
+++ linux-2.6.32-504.3.3.el6/mm/readahead.c 2015-02-03 11:05:15.103030796 -0500
@@ -228,14 +228,14 @@ int force_page_cache_readahead(struct ad
return ret;
}

-#define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE)
/*
* Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
* sensible upper limit.
*/
unsigned long max_sane_readahead(unsigned long nr)
{
- return min(nr, MAX_READAHEAD);
+ return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
+ + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
}

/*

This is the customers requirement:

Note:

The customer wants the default read_ahead_kb to be small 32, and control the size when necessary via posix_fadvise().

"
To reiterate, what we want is readahead to be small (we use 32KB in production), but leverage posix_fadvise if we need to load a big file.
In the past that worked since I/O size was only limited by max_sectors_kb.
In the newer kernel it does not work anymore, since readahead also limits I/O sizes
"

Test program used to reproduce.

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h> /* mmap() is defined in this header */
#include <fcntl.h>

int err_quit(char *msg)
{
printf(msg);
return 0;
}

int main (int argc, char *argv[])
{
int fdin;
char *src;
char dst[262144];
struct stat statbuf;
int mode = 0x0777;
unsigned long i;
long ret;

if (argc != 2)
err_quit ("usage: a.out <fromfile> <tofile>\n");

/* open the input file */
if ((fdin = open (argv[1], O_RDONLY)) < 0)
{printf("can't open %s for reading\n", argv[1]);
return 0;
}

ret=posix_fadvise(fdin,0,0,POSIX_FADV_WILLNEED);
printf("ret = %ld\n",ret);

/* find size of input file */
if (fstat (fdin,&statbuf) < 0)
{printf ("fstat error\n");
return 0;
}
printf("Size of input file = %ld\n",statbuf.st_size);

/* mmap the input file */
if ((src = mmap (0, statbuf.st_size, PROT_READ, MAP_SHARED, fdin, 0))
== (caddr_t) -1)
{printf ("mmap error for input");
return 0;
}

/* this copies the input file to the data buffer using 256k blocks*/

for (i=0;i< statbuf.st_size-262144; i+=262144) {
memcpy (dst, src+i, 262144);
}
return 0;

} /* main */


Without the patch
-----------------
Here we land up being constrained to 32, see reads (Size) below.

read_ahead set to 32.
queue]# cat read_ahead_kb
32

echo 3 > /proc/sys/vm/drop_caches

time ./mmapexe ./xaa
Size of input file = 1048576000

Run time is > 3 times the time seen when the patch is applied.

real 0m21.101s
user 0m0.295s
sys 0m0.744s

I/O size below is constrained to 32, see reads (Size)

# DISK STATISTICS (/sec)
# <---------reads---------><---------writes---------><--------averages--------> Pct
#Time Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util
09:47:43 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0
09:47:44 sdg 3660 0 59 62 0 0 0 0 62 2 3 1 7
09:47:45 sdg 59040 0 1845 32 0 0 0 0 32 1 1 0 100
09:47:46 sdg 58624 0 1832 32 0 0 0 0 32 1 1 0 99
09:47:47 sdg 63072 0 1971 32 0 0 0 0 32 1 1 0 99
09:47:48 sdg 61856 0 1933 32 0 0 0 0 32 1 1 0 99
09:47:49 sdg 59488 0 1859 32 0 0 0 0 32 1 1 0 99
09:47:50 sdg 62240 0 1945 32 0 0 0 0 32 1 1 0 99
09:47:51 sdg 42080 0 1315 32 0 0 0 0 32 1 1 0 99
09:47:52 sdg 22112 0 691 32 0 0 0 0 32 1 3 1 99
09:47:53 sdg 42240 0 1320 32 0 0 0 0 32 1 1 0 99
09:47:54 sdg 41472 0 1296 32 0 0 0 0 32 1 1 0 99
09:47:55 sdg 42080 0 1315 32 0 0 0 0 32 1 1 0 99
09:47:56 sdg 42112 0 1316 32 0 0 0 0 32 1 1 0 99
09:47:57 sdg 42144 0 1317 32 0 0 0 0 32 1 1 0 99
09:47:58 sdg 42176 0 1318 32 0 0 0 0 32 1 1 0 99
09:47:59 sdg 42048 0 1314 32 0 0 0 0 32 1 1 0 99
09:48:00 sdg 40384 0 1262 32 0 0 0 0 32 1 1 0 99
09:48:01 sdg 29792 0 931 32 0 0 0 0 32 1 1 1 99
09:48:02 sdg 49984 0 1562 32 0 0 0 0 32 1 1 0 99
09:48:03 sdg 59488 0 1859 32 0 0 0 0 32 1 1 0 99
09:48:04 sdg 59520 0 1860 32 0 0 0 0 32 1 1 0 99
09:48:05 sdg 57664 0 1802 32 0 0 0 0 32 1 1 0 100
..

With the revert patch.
----------------------

read_ahead set to 32.
queue]# cat read_ahead_kb
32

[queue]# echo 3 > /proc/sys/vm/drop_caches

I/O sizes are able to use the larger size (512) using to the posix_fadvise()

[queue]# collectl -sD -oT -i1 | grep -e "#" -e sdg
# DISK STATISTICS (/sec)
# <---------reads---------><---------writes---------><--------averages--------> Pct
#Time Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util
21:49:11 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0
21:49:12 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0
21:49:13 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0
21:49:14 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0
21:49:15 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0
21:49:16 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0
21:49:17 sdg 23568 183 53 445 0 0 0 0 444 130 100 3 20
21:49:18 sdg 162304 320 317 512 0 0 0 0 512 143 418 3 99
21:49:19 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99
21:49:20 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99
21:49:21 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99
21:49:22 sdg 163840 320 320 512 0 0 0 0 512 143 447 3 99
21:49:23 sdg 163840 213 320 512 0 0 0 0 512 125 447 3 99
21:49:24 sdg 18944 0 37 512 0 0 0 0 512 19 442 2 9
21:49:25 sdg 0 0 0 0 0 0 0 0 0 0 0 0 0

time ./mmapexe ./xaa

Size of input file = 1048576000

real 0m6.329s
user 0m0.243s
sys 0m0.260s


Laurence Oberman
Red Hat Global Support Service
SEG Team