Currently, max_sane_readahead returns zero on the cpu with empty numa node,
fix this by checking for potential empty numa node case during calculation.
We also limit the number of readahead pages to 4k.
Signed-off-by: Raghavendra K T <[email protected]>
---
The current patch limits the readahead into 4k pages (16MB was suggested by Linus).
and also handles the case of memoryless cpu issuing readahead failures.
We still do not consider [fm]advise() specific calculations here.
I have dropped the iterating over numa node to calculate free page idea.
I do not have much idea whether there is any impact on big streaming apps..
Comments/suggestions ?
mm/readahead.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 7cdbb44..be4d205 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -237,14 +237,25 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
return ret;
}
+#define MAX_REMOTE_READAHEAD 4096UL
/*
* Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
* sensible upper limit.
*/
unsigned long max_sane_readahead(unsigned long nr)
{
- return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
- + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+ unsigned long local_free_page;
+ unsigned long sane_nr = min(nr, MAX_REMOTE_READAHEAD);
+
+ local_free_page = node_page_state(numa_node_id(), NR_INACTIVE_FILE)
+ + node_page_state(numa_node_id(), NR_FREE_PAGES);
+
+ /*
+ * Readahead onto remote memory is better than no readahead when local
+ * numa node does not have memory. We sanitize readahead size depending
+ * on free memory in the local node but limiting to 4k pages.
+ */
+ return local_free_page ? min(sane_nr, local_free_page / 2) : sane_nr;
}
/*
--
1.7.11.7
On Mon 06-01-14 15:51:55, Raghavendra K T wrote:
> Currently, max_sane_readahead returns zero on the cpu with empty numa node,
> fix this by checking for potential empty numa node case during calculation.
> We also limit the number of readahead pages to 4k.
>
> Signed-off-by: Raghavendra K T <[email protected]>
> ---
> The current patch limits the readahead into 4k pages (16MB was suggested
> by Linus). and also handles the case of memoryless cpu issuing readahead
> failures. We still do not consider [fm]advise() specific calculations
> here. I have dropped the iterating over numa node to calculate free page
> idea. I do not have much idea whether there is any impact on big
> streaming apps.. Comments/suggestions ?
As you say I would be also interested what impact this has on a streaming
application. It should be rather easy to check - create 1 GB file, drop
caches. Then measure how long does it take to open the file, call fadvise
FADV_WILLNEED, read the whole file (for a kernel with and without your
patch). Do several measurements so that we get some meaningful statistics.
Resulting numbers can then be part of the changelog. Thanks!
Honza
> mm/readahead.c | 15 +++++++++++++--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 7cdbb44..be4d205 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -237,14 +237,25 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
> return ret;
> }
>
> +#define MAX_REMOTE_READAHEAD 4096UL
> /*
> * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> * sensible upper limit.
> */
> unsigned long max_sane_readahead(unsigned long nr)
> {
> - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> + unsigned long local_free_page;
> + unsigned long sane_nr = min(nr, MAX_REMOTE_READAHEAD);
> +
> + local_free_page = node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> + + node_page_state(numa_node_id(), NR_FREE_PAGES);
> +
> + /*
> + * Readahead onto remote memory is better than no readahead when local
> + * numa node does not have memory. We sanitize readahead size depending
> + * on free memory in the local node but limiting to 4k pages.
> + */
> + return local_free_page ? min(sane_nr, local_free_page / 2) : sane_nr;
> }
>
> /*
> --
> 1.7.11.7
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, 6 Jan 2014 15:51:55 +0530 Raghavendra K T <[email protected]> wrote:
> Currently, max_sane_readahead returns zero on the cpu with empty numa node,
> fix this by checking for potential empty numa node case during calculation.
> We also limit the number of readahead pages to 4k.
>
> ...
>
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -237,14 +237,25 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
> return ret;
> }
>
> +#define MAX_REMOTE_READAHEAD 4096UL
> /*
> * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> * sensible upper limit.
> */
> unsigned long max_sane_readahead(unsigned long nr)
> {
> - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> + unsigned long local_free_page;
> + unsigned long sane_nr = min(nr, MAX_REMOTE_READAHEAD);
> +
> + local_free_page = node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> + + node_page_state(numa_node_id(), NR_FREE_PAGES);
> +
> + /*
> + * Readahead onto remote memory is better than no readahead when local
> + * numa node does not have memory. We sanitize readahead size depending
> + * on free memory in the local node but limiting to 4k pages.
> + */
> + return local_free_page ? min(sane_nr, local_free_page / 2) : sane_nr;
> }
So if the local node has two free pages, we do just one page of
readahead.
Then the local node has one free page and we do zero pages readahead.
Assuming that bug(!) is fixed, the local node now has zero free pages
and we suddenly resume doing large readahead.
This transition from large readahead to very small readahead then back
to large readahead is illogical, surely?
On 01/06/2014 04:26 PM, Jan Kara wrote:
> On Mon 06-01-14 15:51:55, Raghavendra K T wrote:
>> Currently, max_sane_readahead returns zero on the cpu with empty numa node,
>> fix this by checking for potential empty numa node case during calculation.
>> We also limit the number of readahead pages to 4k.
>>
>> Signed-off-by: Raghavendra K T <[email protected]>
>> ---
>> The current patch limits the readahead into 4k pages (16MB was suggested
>> by Linus). and also handles the case of memoryless cpu issuing readahead
>> failures. We still do not consider [fm]advise() specific calculations
>> here. I have dropped the iterating over numa node to calculate free page
>> idea. I do not have much idea whether there is any impact on big
>> streaming apps.. Comments/suggestions ?
> As you say I would be also interested what impact this has on a streaming
> application. It should be rather easy to check - create 1 GB file, drop
> caches. Then measure how long does it take to open the file, call fadvise
> FADV_WILLNEED, read the whole file (for a kernel with and without your
> patch). Do several measurements so that we get some meaningful statistics.
> Resulting numbers can then be part of the changelog. Thanks!
>
Hi Honza,
Thanks for the idea. (sorry for the delay, spent my own time to do some
fadvise and other benchmarking). Here is the result on my x240 machine
with 32 cpu (w/ HT) 128GB ram.
Below test was for 1gb test file as per suggestion.
x base_result
+ patched_result
N Min Max Median Avg Stddev
x 12 7.217 7.444 7.2345 7.2603333 0.06442802
+ 12 7.24 7.431 7.243 7.2684167 0.059649672
From the result we could see that there is not much impact with the
patch.
I shall include the result in changelog when I resend/next version
depending on the others' comment.
---
test file looked something like this:
char buf[4096];
int main()
{
int fd = open("testfile", O_RDONLY);
unsigned long read_bytes = 0;
int sz;
posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
do {
sz = read(fd, buf, 4096);
read_bytes += sz;
} while (sz > 0);
close(fd);
printf (" Total bytes read = %lu \n", read_bytes);
return 0;
}
On 01/07/2014 03:43 AM, Andrew Morton wrote:
> On Mon, 6 Jan 2014 15:51:55 +0530 Raghavendra K T <[email protected]> wrote:
>
>> + /*
>> + * Readahead onto remote memory is better than no readahead when local
>> + * numa node does not have memory. We sanitize readahead size depending
>> + * on free memory in the local node but limiting to 4k pages.
>> + */
>> + return local_free_page ? min(sane_nr, local_free_page / 2) : sane_nr;
>> }
>
> So if the local node has two free pages, we do just one page of
> readahead.
>
> Then the local node has one free page and we do zero pages readahead.
>
> Assuming that bug(!) is fixed, the local node now has zero free pages
> and we suddenly resume doing large readahead.
>
> This transition from large readahead to very small readahead then back
> to large readahead is illogical, surely?
>
>
Hi Andrew, Thanks for having a look at this.
You are correct that there is a transition from small readahead to
large once we have zero free pages.
I am not sure I can defend well, but 'll give a try :).
Hoping that we have evenly distributed cpu/memory load, if we have very
less free+inactive memory may be we are in really bad shape already.
But in the case where we have a situation like below [1] (cpu does not
have any local memory node populated) I had mentioned
earlier where we will have to depend on remote node always,
is it not that sanitized readahead onto remote memory seems better?
But having said that I am not able to get an idea of sane implementation
to solve this readahead failure bug overcoming the anomaly you pointed
:(. hints/ideas.. ?? please let me know.
[1]: IBM P730
----------------------------------
# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus:
node 1 size: 12288 MB
node 1 free: 10440 MB
node distances:
node 0 1
0: 10 40
1: 40 10
On Wed 08-01-14 14:07:03, Raghavendra K T wrote:
> On 01/06/2014 04:26 PM, Jan Kara wrote:
> >On Mon 06-01-14 15:51:55, Raghavendra K T wrote:
> >>Currently, max_sane_readahead returns zero on the cpu with empty numa node,
> >>fix this by checking for potential empty numa node case during calculation.
> >>We also limit the number of readahead pages to 4k.
> >>
> >>Signed-off-by: Raghavendra K T <[email protected]>
> >>---
> >>The current patch limits the readahead into 4k pages (16MB was suggested
> >>by Linus). and also handles the case of memoryless cpu issuing readahead
> >>failures. We still do not consider [fm]advise() specific calculations
> >>here. I have dropped the iterating over numa node to calculate free page
> >>idea. I do not have much idea whether there is any impact on big
> >>streaming apps.. Comments/suggestions ?
> > As you say I would be also interested what impact this has on a streaming
> >application. It should be rather easy to check - create 1 GB file, drop
> >caches. Then measure how long does it take to open the file, call fadvise
> >FADV_WILLNEED, read the whole file (for a kernel with and without your
> >patch). Do several measurements so that we get some meaningful statistics.
> >Resulting numbers can then be part of the changelog. Thanks!
> >
>
> Hi Honza,
>
> Thanks for the idea. (sorry for the delay, spent my own time to do some
> fadvise and other benchmarking). Here is the result on my x240 machine
> with 32 cpu (w/ HT) 128GB ram.
>
> Below test was for 1gb test file as per suggestion.
>
> x base_result
> + patched_result
> N Min Max Median Avg Stddev
> x 12 7.217 7.444 7.2345 7.2603333 0.06442802
> + 12 7.24 7.431 7.243 7.2684167 0.059649672
>
> From the result we could see that there is not much impact with the
> patch.
> I shall include the result in changelog when I resend/next version
> depending on the others' comment.
>
> ---
> test file looked something like this:
>
> char buf[4096];
>
> int main()
> {
> int fd = open("testfile", O_RDONLY);
> unsigned long read_bytes = 0;
> int sz;
> posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
Hum, but this call should have rather been:
struct stat st;
fstat(fd, &st);
posix_fadvise(fd, 0, st.st_size, POSIX_FADV_WILLNEED);
The posix_fadvise() call you had doesn't do anything...
Honza
> do {
> sz = read(fd, buf, 4096);
> read_bytes += sz;
> } while (sz > 0);
>
> close(fd);
> printf (" Total bytes read = %lu \n", read_bytes);
> return 0;
> }
>
>
>
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 08-01-14 14:19:23, Raghavendra K T wrote:
> On 01/07/2014 03:43 AM, Andrew Morton wrote:
> >On Mon, 6 Jan 2014 15:51:55 +0530 Raghavendra K T <[email protected]> wrote:
> >
> >>+ /*
> >>+ * Readahead onto remote memory is better than no readahead when local
> >>+ * numa node does not have memory. We sanitize readahead size depending
> >>+ * on free memory in the local node but limiting to 4k pages.
> >>+ */
> >>+ return local_free_page ? min(sane_nr, local_free_page / 2) : sane_nr;
> >> }
> >
> >So if the local node has two free pages, we do just one page of
> >readahead.
> >
> >Then the local node has one free page and we do zero pages readahead.
> >
> >Assuming that bug(!) is fixed, the local node now has zero free pages
> >and we suddenly resume doing large readahead.
> >
> >This transition from large readahead to very small readahead then back
> >to large readahead is illogical, surely?
> >
> >
>
> You are correct that there is a transition from small readahead to
> large once we have zero free pages.
> I am not sure I can defend well, but 'll give a try :).
>
> Hoping that we have evenly distributed cpu/memory load, if we have very
> less free+inactive memory may be we are in really bad shape already.
>
> But in the case where we have a situation like below [1] (cpu does
> not have any local memory node populated) I had mentioned
> earlier where we will have to depend on remote node always,
> is it not that sanitized readahead onto remote memory seems better?
>
> But having said that I am not able to get an idea of sane implementation
> to solve this readahead failure bug overcoming the anomaly you pointed
> :(. hints/ideas.. ?? please let me know.
So if we would be happy with just fixing corner cases like this, we might
use total node memory size to detect them, can't we? If total node memory
size is 0, we can use 16 MB (or global number of free pages / 2 if we would
be uneasy with fixed 16 MB limit) as an upperbound...
Honza
>
>
> [1]: IBM P730
> ----------------------------------
> # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
> 22 23 24 25 26 27 28 29 30 31
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus:
> node 1 size: 12288 MB
> node 1 free: 10440 MB
> node distances:
> node 0 1
> 0: 10 40
> 1: 40 10
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 01/08/2014 04:17 PM, Jan Kara wrote:
> On Wed 08-01-14 14:19:23, Raghavendra K T wrote:
>> On 01/07/2014 03:43 AM, Andrew Morton wrote:
>>> On Mon, 6 Jan 2014 15:51:55 +0530 Raghavendra K T <[email protected]> wrote:
[...]
>> But having said that I am not able to get an idea of sane implementation
>> to solve this readahead failure bug overcoming the anomaly you pointed
>> :(. hints/ideas.. ?? please let me know.
> So if we would be happy with just fixing corner cases like this, we might
> use total node memory size to detect them, can't we? If total node memory
> size is 0, we can use 16 MB (or global number of free pages / 2 if we would
> be uneasy with fixed 16 MB limit) as an upperbound...
>
Thanks Honza.
This seems to be more sensible option, I 'll send the patch with that
change (including 16MB limit if nobody disagrees).
On 01/08/2014 04:08 PM, Jan Kara wrote:
> On Wed 08-01-14 14:07:03, Raghavendra K T wrote:
>> On 01/06/2014 04:26 PM, Jan Kara wrote:
>>> On Mon 06-01-14 15:51:55, Raghavendra K T wrote:
>> ---
>> test file looked something like this:
>>
>> char buf[4096];
>>
>> int main()
>> {
>> int fd = open("testfile", O_RDONLY);
>> unsigned long read_bytes = 0;
>> int sz;
>> posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
> Hum, but this call should have rather been:
> struct stat st;
>
> fstat(fd, &st);
> posix_fadvise(fd, 0, st.st_size, POSIX_FADV_WILLNEED);
>
> The posix_fadvise() call you had doesn't do anything...
>
> Honza
I reran the test with that change, no change the outcome though.
(I had earlier tested with hardcoded size etc.. but fstat was the
correct thing to do.. thanks). will include the result in V4
>
>> do {
>> sz = read(fd, buf, 4096);
>> read_bytes += sz;
>> } while (sz > 0);
>>
>> close(fd);
>> printf (" Total bytes read = %lu \n", read_bytes);
>> return 0;
>> }