2014-01-22 11:03:43

by Raghavendra K T

[permalink] [raw]
Subject: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

max_sane_readahead returns zero on the cpu having no local memory
node. Fix that by returning a sanitized number of pages viz.,
minimum of (requested pages, 4k)

Result:
fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB testfile
32GB* 4G RAM numa machine ( 12 iterations) yielded

Kernel Avg Stddev
base 7.2963 1.10 %
patched 7.2972 1.18 %

Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
---
Changes in V5:
- Drop the 4k limit for normal readahead. (Jan Kara)

Changes in V4:
- Check for total node memory to decide whether we don't
have local memory (jan Kara)
- Add 4k page limit on readahead for normal and remote readahead (Linus)
(Linus suggestion was 16MB limit).

Changes in V3:
- Drop iterating over numa nodes that calculates total free pages (Linus)

Agree that we do not have control on allocation for readahead on a
particular numa node and hence for remote readahead we can not further
sanitize based on potential free pages of that node. and also we do
not want to itererate through all nodes to find total free pages.

Suggestions and comments welcome

mm/readahead.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 7cdbb44..9d2afd0 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -237,14 +237,32 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
return ret;
}

+#define MAX_REMOTE_READAHEAD 4096UL
/*
* Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
* sensible upper limit.
*/
unsigned long max_sane_readahead(unsigned long nr)
{
- return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
- + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+ unsigned long local_free_page;
+ int nid;
+
+ nid = numa_node_id();
+ if (node_present_pages(nid)) {
+ /*
+ * We sanitize readahead size depending on free memory in
+ * the local node.
+ */
+ local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
+ + node_page_state(nid, NR_FREE_PAGES);
+ return min(nr, local_free_page / 2);
+ }
+ /*
+ * Readahead onto remote memory is better than no readahead when local
+ * numa node does not have memory. We limit the readahead to 4k
+ * pages though to avoid trashing page cache.
+ */
+ return min(nr, MAX_REMOTE_READAHEAD);
}

/*
--
1.7.11.7


2014-02-03 08:24:30

by Raghavendra K T

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On 01/22/2014 04:23 PM, Raghavendra K T wrote:
> max_sane_readahead returns zero on the cpu having no local memory
> node. Fix that by returning a sanitized number of pages viz.,
> minimum of (requested pages, 4k)
>
> Result:
> fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB testfile
> 32GB* 4G RAM numa machine ( 12 iterations) yielded
>
> Kernel Avg Stddev
> base 7.2963 1.10 %
> patched 7.2972 1.18 %
>
> Reviewed-by: Jan Kara <[email protected]>
> Signed-off-by: Raghavendra K T <[email protected]>
> ---

Could you please let me know what do you feel about the patch ?

2014-02-06 22:51:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Wed, 22 Jan 2014 16:23:45 +0530 Raghavendra K T <[email protected]> wrote:

> max_sane_readahead returns zero on the cpu having no local memory
> node. Fix that by returning a sanitized number of pages viz.,
> minimum of (requested pages, 4k)

um, fix what? The changelog should describe the user-visible impact of
the current implementation. There are a whole bunch of reasons for
this, but I tire of typing them in day after day after day.

> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -237,14 +237,32 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
> return ret;
> }
>
> +#define MAX_REMOTE_READAHEAD 4096UL
> /*
> * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> * sensible upper limit.
> */
> unsigned long max_sane_readahead(unsigned long nr)
> {
> - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> + unsigned long local_free_page;
> + int nid;
> +
> + nid = numa_node_id();
> + if (node_present_pages(nid)) {
> + /*
> + * We sanitize readahead size depending on free memory in
> + * the local node.
> + */
> + local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
> + + node_page_state(nid, NR_FREE_PAGES);
> + return min(nr, local_free_page / 2);
> + }
> + /*
> + * Readahead onto remote memory is better than no readahead when local
> + * numa node does not have memory. We limit the readahead to 4k
> + * pages though to avoid trashing page cache.
> + */
> + return min(nr, MAX_REMOTE_READAHEAD);
> }

Looks reasonable to me. Please send along a fixed up changelog.

2014-02-06 22:58:24

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Thu, 6 Feb 2014, Andrew Morton wrote:

> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -237,14 +237,32 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
> > return ret;
> > }
> >
> > +#define MAX_REMOTE_READAHEAD 4096UL
> > /*
> > * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> > * sensible upper limit.
> > */
> > unsigned long max_sane_readahead(unsigned long nr)
> > {
> > - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> > - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> > + unsigned long local_free_page;
> > + int nid;
> > +
> > + nid = numa_node_id();

If you're intending this to be cached for your calls into
node_page_state() you need nid = ACCESS_ONCE(numa_node_id()).

What's the downside of just using numa_mem_id() here instead which is
usually "local memory to this memoryless node cpu" and forget about
testing node_present_pages(nid)?

> > + if (node_present_pages(nid)) {
> > + /*
> > + * We sanitize readahead size depending on free memory in
> > + * the local node.
> > + */
> > + local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
> > + + node_page_state(nid, NR_FREE_PAGES);
> > + return min(nr, local_free_page / 2);
> > + }
> > + /*
> > + * Readahead onto remote memory is better than no readahead when local
> > + * numa node does not have memory. We limit the readahead to 4k
> > + * pages though to avoid trashing page cache.
> > + */
> > + return min(nr, MAX_REMOTE_READAHEAD);
> > }

2014-02-06 23:22:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Thu, 6 Feb 2014 14:58:21 -0800 (PST) David Rientjes <[email protected]> wrote:

> > > +#define MAX_REMOTE_READAHEAD 4096UL
> > > /*
> > > * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> > > * sensible upper limit.
> > > */
> > > unsigned long max_sane_readahead(unsigned long nr)
> > > {
> > > - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> > > - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> > > + unsigned long local_free_page;
> > > + int nid;
> > > +
> > > + nid = numa_node_id();
>
> If you're intending this to be cached for your calls into
> node_page_state() you need nid = ACCESS_ONCE(numa_node_id()).

ugh. That's too subtle and we didn't even document it.

We could put the ACCESS_ONCE inside numa_node_id() I assume but we
still have the same problem as smp_processor_id(): the numa_node_id()
return value is wrong as soon as you obtain it if running preemptibly.

We could plaster Big Fat Warnings all over the place or we could treat
numa_node_id() and derivatives in the same way as smp_processor_id()
(which is a huge pain). Or something else, but we've left a big hand
grenade here and Raghavendra won't be the last one to pull the pin?

2014-02-06 23:48:26

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Thu, 6 Feb 2014, Andrew Morton wrote:

> On Thu, 6 Feb 2014 14:58:21 -0800 (PST) David Rientjes <[email protected]> wrote:
>
> > > > +#define MAX_REMOTE_READAHEAD 4096UL
> > > > /*
> > > > * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> > > > * sensible upper limit.
> > > > */
> > > > unsigned long max_sane_readahead(unsigned long nr)
> > > > {
> > > > - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> > > > - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> > > > + unsigned long local_free_page;
> > > > + int nid;
> > > > +
> > > > + nid = numa_node_id();
> >
> > If you're intending this to be cached for your calls into
> > node_page_state() you need nid = ACCESS_ONCE(numa_node_id()).
>
> ugh. That's too subtle and we didn't even document it.
>
> We could put the ACCESS_ONCE inside numa_node_id() I assume but we
> still have the same problem as smp_processor_id(): the numa_node_id()
> return value is wrong as soon as you obtain it if running preemptibly.
>
> We could plaster Big Fat Warnings all over the place or we could treat
> numa_node_id() and derivatives in the same way as smp_processor_id()
> (which is a huge pain). Or something else, but we've left a big hand
> grenade here and Raghavendra won't be the last one to pull the pin?
>

Normally it wouldn't matter because there's no significant downside to it
racing, things like mempolicies which use numa_node_id() extensively would
result in, oops, a page allocation on the wrong node.

This stands out to me, though, because you're expecting the calculation to
be correct for a specific node.

The patch is still wrong, though, it should just do

int node = ACCESS_ONCE(numa_mem_id());
return min(nr, (node_page_state(node, NR_INACTIVE_FILE) +
node_page_state(node, NR_FREE_PAGES)) / 2);

since we want to readahead based on the cpu's local node, the comment
saying we're reading ahead onto "remote memory" is wrong since a
memoryless node has local affinity to numa_mem_id().

2014-02-06 23:58:22

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Thu, 6 Feb 2014, David Rientjes wrote:

> > > > > +#define MAX_REMOTE_READAHEAD 4096UL

> Normally it wouldn't matter because there's no significant downside to it
> racing, things like mempolicies which use numa_node_id() extensively would
> result in, oops, a page allocation on the wrong node.
>
> This stands out to me, though, because you're expecting the calculation to
> be correct for a specific node.
>
> The patch is still wrong, though, it should just do
>
> int node = ACCESS_ONCE(numa_mem_id());
> return min(nr, (node_page_state(node, NR_INACTIVE_FILE) +
> node_page_state(node, NR_FREE_PAGES)) / 2);
>
> since we want to readahead based on the cpu's local node, the comment
> saying we're reading ahead onto "remote memory" is wrong since a
> memoryless node has local affinity to numa_mem_id().
>

Oops, forgot about the MAX_REMOTE_READAHEAD which needs to be factored in
as well, but this handles the bound on local node's statistics.

2014-02-07 10:36:35

by Raghavendra K T

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On 02/07/2014 05:28 AM, David Rientjes wrote:
> On Thu, 6 Feb 2014, David Rientjes wrote:
>
>>>>>> +#define MAX_REMOTE_READAHEAD 4096UL
>
>> Normally it wouldn't matter because there's no significant downside to it
>> racing, things like mempolicies which use numa_node_id() extensively would
>> result in, oops, a page allocation on the wrong node.
>>
>> This stands out to me, though, because you're expecting the calculation to
>> be correct for a specific node.
>>
>> The patch is still wrong, though, it should just do
>>
>> int node = ACCESS_ONCE(numa_mem_id());
>> return min(nr, (node_page_state(node, NR_INACTIVE_FILE) +
>> node_page_state(node, NR_FREE_PAGES)) / 2);
>>
>> since we want to readahead based on the cpu's local node, the comment
>> saying we're reading ahead onto "remote memory" is wrong since a
>> memoryless node has local affinity to numa_mem_id().
>>
>
> Oops, forgot about the MAX_REMOTE_READAHEAD which needs to be factored in
> as well, but this handles the bound on local node's statistics.
>

So following discussion TODO for my patch is:

1) Update the changelog with user visible impact of the patch.
(Andrew's suggestion)
2) Add ACCESS_ONCE to numa_node_id().
3) Change the "readahead into remote memory" part of the documentation
which is misleading.

( I feel no need to add numa_mem_id() since we would specifically limit
the readahead with MAX_REMOTE_READAHEAD in memoryless cpu cases).

2014-02-07 20:41:42

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Fri, 7 Feb 2014, Raghavendra K T wrote:

> So following discussion TODO for my patch is:
>
> 1) Update the changelog with user visible impact of the patch.
> (Andrew's suggestion)
> 2) Add ACCESS_ONCE to numa_node_id().
> 3) Change the "readahead into remote memory" part of the documentation
> which is misleading.
>
> ( I feel no need to add numa_mem_id() since we would specifically limit
> the readahead with MAX_REMOTE_READAHEAD in memoryless cpu cases).
>

I don't understand what you're saying, numa_mem_id() is local memory to
current's cpu. When a node consists only of cpus and not memory it is not
true that all memory is then considered remote, you won't find that in any
specification that defines memory affinity including the ACPI spec. I can
trivially define all cpus on my system to be on memoryless nodes and
having that affect readahead behavior when, in fact, there is affinity
would be ridiculous.

2014-02-10 08:15:34

by Raghavendra K T

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On 02/08/2014 02:11 AM, David Rientjes wrote:
> On Fri, 7 Feb 2014, Raghavendra K T wrote:
>> 3) Change the "readahead into remote memory" part of the documentation
>> which is misleading.
>>
>> ( I feel no need to add numa_mem_id() since we would specifically limit
>> the readahead with MAX_REMOTE_READAHEAD in memoryless cpu cases).
>>
>
> I don't understand what you're saying, numa_mem_id() is local memory to
> current's cpu. When a node consists only of cpus and not memory it is not
> true that all memory is then considered remote, you won't find that in any
> specification that defines memory affinity including the ACPI spec. I can
> trivially define all cpus on my system to be on memoryless nodes and
> having that affect readahead behavior when, in fact, there is affinity
> would be ridiculous.
>
As you rightly pointed , I 'll drop remote memory term and use
something like :

"* Ensure readahead success on a memoryless node cpu. But we limit
* the readahead to 4k pages to avoid trashing page cache." ..

Regarding ACCESS_ONCE, since we will have to add
inside the function and still there is nothing that could prevent us
getting run on different cpu with a different node (as Andrew ponted), I
have not included in current patch that I am posting.
Moreover this case is hopefully not fatal since it is just a hint for
readahead we can do.

So there are many possible implementation:
(1) use numa_mem_id(), apply freepage limit and use 4k page limit for
all case
(Jan had reservation about this case)

(2)for normal case: use free memory calculation and do not apply 4k
limit (no change).
for memoryless cpu case: use numa_mem_id for more accurate
calculation of limit and also apply 4k limit.

(3) for normal case: use free memory calculation and do not apply 4k
limit (no change).
for memoryless case: apply 4k page limit

(4) use numa_mem_id() and apply only free page limit..

So, I ll be resending the patch with changelog and comment changes
based on your and Andrew's feedback (type (3) implementation).



2014-02-10 08:23:29

by Raghavendra K T

[permalink] [raw]
Subject: Re: [RFC PATCH V5 RESEND] mm readahead: Fix readahead fail for no local memory and limit readahead pages

* Andrew Morton <[email protected]> [2014-02-06 14:51:05]:

> On Wed, 22 Jan 2014 16:23:45 +0530 Raghavendra K T <[email protected]> wrote:
>
>
> Looks reasonable to me. Please send along a fixed up changelog.
>

Hi Andrew,
Sorry took some time to get and measure benefit on the memoryless system.
Resending patch with changelog and comment changes based on your and
David's suggestion.

----8<---
>From fc8186b5c33a34810a34f5aadd50082463117636 Mon Sep 17 00:00:00 2001
From: Raghavendra K T <[email protected]>
Date: Mon, 25 Nov 2013 14:29:03 +0530
Subject: [RFC PATCH V5 RESEND] mm readahead: Fix readahead fail for no local
memory and limit readahead pages

Currently max_sane_readahead() returns zero on the cpu having no local memory node
which leads to readahead failure. Fix the readahead failure by returning
minimum of (requested pages, 4k). Users running application a on memory less cpu
which needs readahead such as streaming application see considerable boost in the
performance.

Result:
fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless CPU
with 1GB testfile ( 12 iterations) yielded 46.66% improvement

kernel Avg Stddev
base_ppc 11.946833 1.34%
patched_ppc 6.3720833 1.80%

Below result proves that there is no impact on the normal NUMA cases w/ patch.

fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB testfile
32GB* 4G RAM numa machine ( 12 iterations) yielded

Kernel Avg Stddev
base 7.2963 1.10 %
patched 7.2972 1.18 %

Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
---

Changes in V5:
- Updated the changelog with benefit seen (Andrew)
- Discard remote memroy term in comment since memoryless CPU
will have affinity to numa_mem_id() (David)
- Drop the 4k limit for normal readahead. (Jan Kara)

Changes in V4:
- Check for total node memory to decide whether we don't
have local memory (jan Kara)
- Add 4k page limit on readahead for normal and remote readahead (Linus)
(Linus suggestion was 16MB limit).

Changes in V3:
- Drop iterating over numa nodes that calculates total free pages (Linus)

Agree that we do not have control on allocation for readahead on a
particular numa node and hence for remote readahead we can not further
sanitize based on potential free pages of that node. and also we do
not want to itererate through all nodes to find total free pages.

Suggestions and comments welcome
mm/readahead.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 0de2360..4c7343b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -233,14 +233,31 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
return 0;
}

+#define MAX_REMOTE_READAHEAD 4096UL
/*
* Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
* sensible upper limit.
*/
unsigned long max_sane_readahead(unsigned long nr)
{
- return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
- + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+ unsigned long local_free_page;
+ int nid;
+
+ nid = numa_node_id();
+ if (node_present_pages(nid)) {
+ /*
+ * We sanitize readahead size depending on free memory in
+ * the local node.
+ */
+ local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
+ + node_page_state(nid, NR_FREE_PAGES);
+ return min(nr, local_free_page / 2);
+ }
+ /*
+ * Ensure readahead success on a memoryless node cpu. But we limit
+ * the readahead to 4k pages to avoid trashing page cache.
+ */
+ return min(nr, MAX_REMOTE_READAHEAD);
}

/*
--
1.7.11.7

2014-02-10 10:05:55

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Mon, 10 Feb 2014, Raghavendra K T wrote:

> As you rightly pointed , I 'll drop remote memory term and use
> something like :
>
> "* Ensure readahead success on a memoryless node cpu. But we limit
> * the readahead to 4k pages to avoid trashing page cache." ..
>

I don't know how to proceed here after pointing it out twice, I'm afraid.

numa_mem_id() is local memory for a memoryless node. node_present_pages()
has no place in your patch.

> Regarding ACCESS_ONCE, since we will have to add
> inside the function and still there is nothing that could prevent us
> getting run on different cpu with a different node (as Andrew ponted), I have
> not included in current patch that I am posting.
> Moreover this case is hopefully not fatal since it is just a hint for
> readahead we can do.
>

I have no idea why you think the ACCESS_ONCE() is a problem. It's relying
on gcc's implementation to ensure that the equation is done only for one
node. It has absolutely nothing to do with the fact that the process may
be moved to another cpu upon returning or even immediately after the
calculation is done. Is it possible that node0 has 80% of memory free and
node1 has 80% of memory inactive? Well, then your equation doesn't work
quite so well if the process moves.

There is no downside whatsoever to using it, I have no idea why you think
it's better without it.

> So there are many possible implementation:
> (1) use numa_mem_id(), apply freepage limit and use 4k page limit for all
> case
> (Jan had reservation about this case)
>
> (2)for normal case: use free memory calculation and do not apply 4k
> limit (no change).
> for memoryless cpu case: use numa_mem_id for more accurate
> calculation of limit and also apply 4k limit.
>
> (3) for normal case: use free memory calculation and do not apply 4k
> limit (no change).
> for memoryless case: apply 4k page limit
>
> (4) use numa_mem_id() and apply only free page limit..
>
> So, I ll be resending the patch with changelog and comment changes
> based on your and Andrew's feedback (type (3) implementation).
>

It's frustrating to have to say something three times. Ask yourself what
happens if ALL NODES WITH CPUS DO NOT HAVE MEMORY?

2014-02-10 12:19:52

by Raghavendra K T

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On 02/10/2014 03:35 PM, David Rientjes wrote:
> On Mon, 10 Feb 2014, Raghavendra K T wrote:
>
>> As you rightly pointed , I 'll drop remote memory term and use
>> something like :
>>
>> "* Ensure readahead success on a memoryless node cpu. But we limit
>> * the readahead to 4k pages to avoid trashing page cache." ..
>>
>
> I don't know how to proceed here after pointing it out twice, I'm afraid.
>
> numa_mem_id() is local memory for a memoryless node. node_present_pages()
> has no place in your patch.

Hi David, I am happy to see your pointer reg. numa_mem_id(). I did not
meant to be ignoring/offensive .. sorry if conversation thought to be so.

So I understood that you are suggesting implementations like below

1) I do not have problem with the below approach, I could post this in
next version.
( But this did not include 4k limit Linus mentioned to apply)

unsigned long max_sane_readahead(unsigned long nr)
{
unsigned long local_free_page;
int nid;

nid = numa_mem_id();

/*
* We sanitize readahead size depending on free memory in
* the local node.
*/
local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
+ node_page_state(nid, NR_FREE_PAGES);
return min(nr, local_free_page / 2);
}

2) I did not go for below because Honza (Jan Kara) had some
concerns for 4k limit for normal case, and since I am not
the expert, I was waiting for opinions.

unsigned long max_sane_readahead(unsigned long nr)
{
unsigned long local_free_page, sane_nr;
int nid;

nid = numa_mem_id();
/* limit the max readahead to 4k pages */
sane_nr = min(nr, MAX_REMOTE_READAHEAD);

/*
* We sanitize readahead size depending on free memory in
* the local node.
*/
local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
+ node_page_state(nid, NR_FREE_PAGES);
return min(sane_nr, local_free_page / 2);
}

>
>> Regarding ACCESS_ONCE, since we will have to add
>> inside the function and still there is nothing that could prevent us
>> getting run on different cpu with a different node (as Andrew ponted), I have
>> not included in current patch that I am posting.
>> Moreover this case is hopefully not fatal since it is just a hint for
>> readahead we can do.
>>
>
> I have no idea why you think the ACCESS_ONCE() is a problem. It's relying
> on gcc's implementation to ensure that the equation is done only for one
> node. It has absolutely nothing to do with the fact that the process may
> be moved to another cpu upon returning or even immediately after the
> calculation is done. Is it possible that node0 has 80% of memory free and
> node1 has 80% of memory inactive? Well, then your equation doesn't work
> quite so well if the process moves.
>
> There is no downside whatsoever to using it, I have no idea why you think
> it's better without it.

I have no problem introducing ACESSS_ONCE too. But I skipped only
after I got the below error.

mm/readahead.c: In function ?max_sane_readahead?:
mm/readahead.c:246: error: lvalue required as unary ?&? operand

>
>> So there are many possible implementation:
>> (1) use numa_mem_id(), apply freepage limit and use 4k page limit for all
>> case
>> (Jan had reservation about this case)
>>
>> (2)for normal case: use free memory calculation and do not apply 4k
>> limit (no change).
>> for memoryless cpu case: use numa_mem_id for more accurate
>> calculation of limit and also apply 4k limit.
>>
>> (3) for normal case: use free memory calculation and do not apply 4k
>> limit (no change).
>> for memoryless case: apply 4k page limit
>>
>> (4) use numa_mem_id() and apply only free page limit..
>>
>> So, I ll be resending the patch with changelog and comment changes
>> based on your and Andrew's feedback (type (3) implementation).
>>
>
> It's frustrating to have to say something three times. Ask yourself what
> happens if ALL NODES WITH CPUS DO NOT HAVE MEMORY?
>

True, this is the reason why we could go for implementation (1) I posted
above. It was just that I did not want to float a new version without
knowing whether Andrew was expecting new patch or change log updates.

2014-02-10 21:35:42

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Mon, 10 Feb 2014, Raghavendra K T wrote:

> So I understood that you are suggesting implementations like below
>
> 1) I do not have problem with the below approach, I could post this in
> next version.
> ( But this did not include 4k limit Linus mentioned to apply)
>
> unsigned long max_sane_readahead(unsigned long nr)
> {
> unsigned long local_free_page;
> int nid;
>
> nid = numa_mem_id();
>
> /*
> * We sanitize readahead size depending on free memory in
> * the local node.
> */
> local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
> + node_page_state(nid, NR_FREE_PAGES);
> return min(nr, local_free_page / 2);
> }
>
> 2) I did not go for below because Honza (Jan Kara) had some
> concerns for 4k limit for normal case, and since I am not
> the expert, I was waiting for opinions.
>
> unsigned long max_sane_readahead(unsigned long nr)
> {
> unsigned long local_free_page, sane_nr;
> int nid;
>
> nid = numa_mem_id();
> /* limit the max readahead to 4k pages */
> sane_nr = min(nr, MAX_REMOTE_READAHEAD);
>
> /*
> * We sanitize readahead size depending on free memory in
> * the local node.
> */
> local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
> + node_page_state(nid, NR_FREE_PAGES);
> return min(sane_nr, local_free_page / 2);
> }
>

I have no opinion on the 4KB pages, either of the above is just fine.

2014-02-13 07:01:29

by Raghavendra K T

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On 02/11/2014 03:05 AM, David Rientjes wrote:
> On Mon, 10 Feb 2014, Raghavendra K T wrote:
>
>> So I understood that you are suggesting implementations like below
>>
>> 1) I do not have problem with the below approach, I could post this in
>> next version.
>> ( But this did not include 4k limit Linus mentioned to apply)
>>
>> unsigned long max_sane_readahead(unsigned long nr)
>> {
>> unsigned long local_free_page;
>> int nid;
>>
>> nid = numa_mem_id();
>>
>> /*
>> * We sanitize readahead size depending on free memory in
>> * the local node.
>> */
>> local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
>> + node_page_state(nid, NR_FREE_PAGES);
>> return min(nr, local_free_page / 2);
>> }
>>
>> 2) I did not go for below because Honza (Jan Kara) had some
>> concerns for 4k limit for normal case, and since I am not
>> the expert, I was waiting for opinions.
>>
>> unsigned long max_sane_readahead(unsigned long nr)
>> {
>> unsigned long local_free_page, sane_nr;
>> int nid;
>>
>> nid = numa_mem_id();
>> /* limit the max readahead to 4k pages */
>> sane_nr = min(nr, MAX_REMOTE_READAHEAD);
>>
>> /*
>> * We sanitize readahead size depending on free memory in
>> * the local node.
>> */
>> local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
>> + node_page_state(nid, NR_FREE_PAGES);
>> return min(sane_nr, local_free_page / 2);
>> }
>>
>
> I have no opinion on the 4KB pages, either of the above is just fine.
>

I was able to test (1) implementation on the system where readahead
problem occurred. Unfortunately it did not help.

Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
numa_mem_id(). The PPC machine I am facing problem has topology like
this:

numactl -H
---------
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25
...
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 8 9 10 11 32 33 34 35 ...
node 1 size: 8071 MB
node 1 free: 2479 MB
node distances:
node 0 1
0: 10 20
1: 20 10

So it seems numa_mem_id() does not help for all the configs..
Am I missing something ?

2014-02-13 08:05:35

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Thu, 13 Feb 2014, Raghavendra K T wrote:

> I was able to test (1) implementation on the system where readahead problem
> occurred. Unfortunately it did not help.
>
> Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
> numa_mem_id(). The PPC machine I am facing problem has topology like
> this:
>
> numactl -H
> ---------
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> ...
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 8 9 10 11 32 33 34 35 ...
> node 1 size: 8071 MB
> node 1 free: 2479 MB
> node distances:
> node 0 1
> 0: 10 20
> 1: 20 10
>
> So it seems numa_mem_id() does not help for all the configs..
> Am I missing something ?
>

You need the patch from http://marc.info/?l=linux-mm&m=139093411119013
first.

2014-02-13 09:59:41

by Raghavendra K T

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On 02/13/2014 01:35 PM, David Rientjes wrote:
> On Thu, 13 Feb 2014, Raghavendra K T wrote:
>
>> I was able to test (1) implementation on the system where readahead problem
>> occurred. Unfortunately it did not help.
>>
>> Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
>> numa_mem_id(). The PPC machine I am facing problem has topology like
>> this:
[...]
>>
>> So it seems numa_mem_id() does not help for all the configs..
>> Am I missing something ?
>>
>
> You need the patch from http://marc.info/?l=linux-mm&m=139093411119013
> first.

Thanks David, unfortunately even after applying that patch, I do not see
the improvement.

Interestingly numa_mem_id() seem to still return the value of a
memoryless node.
May be per cpu _numa_mem_ values are not set properly. Need to dig out ....

2014-02-13 21:06:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Thu, 13 Feb 2014 00:05:31 -0800 (PST) David Rientjes <[email protected]> wrote:

> On Thu, 13 Feb 2014, Raghavendra K T wrote:
>
> > I was able to test (1) implementation on the system where readahead problem
> > occurred. Unfortunately it did not help.
> >
> > Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
> > numa_mem_id(). The PPC machine I am facing problem has topology like
> > this:
> >
> > numactl -H
> > ---------
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3 4 5 6 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> > ...
> > node 0 size: 0 MB
> > node 0 free: 0 MB
> > node 1 cpus: 8 9 10 11 32 33 34 35 ...
> > node 1 size: 8071 MB
> > node 1 free: 2479 MB
> > node distances:
> > node 0 1
> > 0: 10 20
> > 1: 20 10
> >
> > So it seems numa_mem_id() does not help for all the configs..
> > Am I missing something ?
> >
>
> You need the patch from http://marc.info/?l=linux-mm&m=139093411119013
> first.

That (un-signed-off) powerpc patch appears to be moribund. What's up?

2014-02-13 21:42:35

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On 13.02.2014 [13:06:43 -0800], Andrew Morton wrote:
> On Thu, 13 Feb 2014 00:05:31 -0800 (PST) David Rientjes <[email protected]> wrote:
>
> > On Thu, 13 Feb 2014, Raghavendra K T wrote:
> >
> > > I was able to test (1) implementation on the system where readahead problem
> > > occurred. Unfortunately it did not help.
> > >
> > > Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
> > > numa_mem_id(). The PPC machine I am facing problem has topology like
> > > this:
> > >
> > > numactl -H
> > > ---------
> > > available: 2 nodes (0-1)
> > > node 0 cpus: 0 1 2 3 4 5 6 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> > > ...
> > > node 0 size: 0 MB
> > > node 0 free: 0 MB
> > > node 1 cpus: 8 9 10 11 32 33 34 35 ...
> > > node 1 size: 8071 MB
> > > node 1 free: 2479 MB
> > > node distances:
> > > node 0 1
> > > 0: 10 20
> > > 1: 20 10
> > >
> > > So it seems numa_mem_id() does not help for all the configs..
> > > Am I missing something ?
> > >
> >
> > You need the patch from http://marc.info/?l=linux-mm&m=139093411119013
> > first.
>
> That (un-signed-off) powerpc patch appears to be moribund. What's up?

Gah, thanks for catching that Andrew, not sure what went wrong. I've
appended my S-o-b. I've asked Ben to take a look, but I think he's still
catching up on his queue after travelling.

-Nish

2014-02-13 22:41:10

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Thu, 13 Feb 2014, Raghavendra K T wrote:

> Thanks David, unfortunately even after applying that patch, I do not see
> the improvement.
>
> Interestingly numa_mem_id() seem to still return the value of a
> memoryless node.
> May be per cpu _numa_mem_ values are not set properly. Need to dig out ....
>

I believe ppc will be relying on __build_all_zonelists() to set
numa_mem_id() to be the proper node, and that relies on the ordering of
the zonelist built for the memoryless node. It would be very strange if
local_memory_node() is returning a memoryless node because it is the first
zone for node_zonelist(GFP_KERNEL) (why would a memoryless node be on the
zonelist at all?).

I think the real problem is that build_all_zonelists() is only called at
init when the boot cpu is online so it's only setting numa_mem_id()
properly for the boot cpu. Does it return a node with memory if you
toggle /proc/sys/vm/numa_zonelist_order? Do

echo node > /proc/sys/vm/numa_zonelist_order
echo zone > /proc/sys/vm/numa_zonelist_order
echo default > /proc/sys/vm/numa_zonelist_order

and check if it returns the proper value at either point. This will force
build_all_zonelists() and numa_mem_id() to point to the proper node since
all cpus are now online.

So the prerequisite for CONFIG_HAVE_MEMORYLESS_NODES is that there is an
arch-specific set_numa_mem() that makes this mapping correct like ia64
does. If that's the case, then it's (1) completely undocumented and (2)
Nishanth's patch is incomplete because anything that adds
CONFIG_HAVE_MEMORYLESS_NODES needs to do the proper set_numa_mem() for it
to be any different than numa_node_id().

2014-02-14 00:14:51

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On 13.02.2014 [14:41:04 -0800], David Rientjes wrote:
> On Thu, 13 Feb 2014, Raghavendra K T wrote:
>
> > Thanks David, unfortunately even after applying that patch, I do not see
> > the improvement.
> >
> > Interestingly numa_mem_id() seem to still return the value of a
> > memoryless node.
> > May be per cpu _numa_mem_ values are not set properly. Need to dig out ....
> >
>
> I believe ppc will be relying on __build_all_zonelists() to set
> numa_mem_id() to be the proper node, and that relies on the ordering of
> the zonelist built for the memoryless node. It would be very strange if
> local_memory_node() is returning a memoryless node because it is the first
> zone for node_zonelist(GFP_KERNEL) (why would a memoryless node be on the
> zonelist at all?).
>
> I think the real problem is that build_all_zonelists() is only called at
> init when the boot cpu is online so it's only setting numa_mem_id()
> properly for the boot cpu. Does it return a node with memory if you
> toggle /proc/sys/vm/numa_zonelist_order? Do
>
> echo node > /proc/sys/vm/numa_zonelist_order
> echo zone > /proc/sys/vm/numa_zonelist_order
> echo default > /proc/sys/vm/numa_zonelist_order
>
> and check if it returns the proper value at either point. This will force
> build_all_zonelists() and numa_mem_id() to point to the proper node since
> all cpus are now online.
>
> So the prerequisite for CONFIG_HAVE_MEMORYLESS_NODES is that there is an
> arch-specific set_numa_mem() that makes this mapping correct like ia64
> does. If that's the case, then it's (1) completely undocumented and (2)
> Nishanth's patch is incomplete because anything that adds
> CONFIG_HAVE_MEMORYLESS_NODES needs to do the proper set_numa_mem() for it
> to be any different than numa_node_id().

I'm working on this latter bit now. I tried to mirror ia64, but it looks
like they have CONFIG_USER_PERCPU_NUMA_NODE_ID, which powerpc doesn't.
It seems like CONFIG_USER_PERCPU_NUMA_NODE_ID and
CONFIG_HAVE_MEMORYLESS_NODES should be tied together in Kconfig?

I'll keep working, but would appreciate any further insight.

-Nish

2014-02-14 00:37:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

Is this whole thread still just for the crazy and pointless
"max_sane_readahead()"?

Or is there some *real* reason we should care?

Because if it really is just for max_sane_readahead(), then for the
love of God, let us just do this

unsigned long max_sane_readahead(unsigned long nr)
{
return min(nr, 128);
}

and bury this whole idiotic thread.

Seriously, if your IO subsystem needs more than 512kB of read-ahead to
get full performance, your IO subsystem is just bad, and has latencies
that are long enough that you should just replace it. There's no real
reason to bend over backwards for that, and the whole complexity and
fragility of the insane "let's try to figure out how much memory this
node has" is just not worth it. The read-ahead should be small enough
that we should never need to care, and large enough that you get
reasonable IO throughput. The above does that.

Goddammit, there's a reason the whole "Gordian knot" parable is
famous. We're making this all too damn complicated FOR NO GOOD REASON.

Just cut the rope, people. Our aim is not to generate some kind of job
security by making things as complicated as possible.

Linus

On Thu, Feb 13, 2014 at 4:14 PM, Nishanth Aravamudan
<[email protected]> wrote:
>
> I'm working on this latter bit now. I tried to mirror ia64, but it looks
> like they have CONFIG_USER_PERCPU_NUMA_NODE_ID, which powerpc doesn't.
> It seems like CONFIG_USER_PERCPU_NUMA_NODE_ID and
> CONFIG_HAVE_MEMORYLESS_NODES should be tied together in Kconfig?
>
> I'll keep working, but would appreciate any further insight.
>
> -Nish
>

2014-02-14 00:45:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Thu, 13 Feb 2014 16:37:53 -0800 Linus Torvalds <[email protected]> wrote:

> unsigned long max_sane_readahead(unsigned long nr)
> {
> return min(nr, 128);
> }

I bet nobody will notice.

It should be 128*4096/PAGE_CACHE_SIZE so that variations in PAGE_SIZE
don't affect readahead behaviour.

2014-02-14 04:32:49

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

Hi Linus,

On 13.02.2014 [16:37:53 -0800], Linus Torvalds wrote:
> Is this whole thread still just for the crazy and pointless
> "max_sane_readahead()"?
>
> Or is there some *real* reason we should care?

There is an open issue on powerpc with memoryless nodes (inasmuch as we
can have them, but the kernel doesn't support it properly). There is a
separate discussion going on on linuxppc-dev about what is necessary for
CONFIG_HAVE_MEMORYLESS_NODES to be supported.

> Because if it really is just for max_sane_readahead(), then for the
> love of God, let us just do this
>
> unsigned long max_sane_readahead(unsigned long nr)
> {
> return min(nr, 128);
> }
>
> and bury this whole idiotic thread.

Agreed that for the readahead case the above is probably more than
sufficient.

Apologies for hijacking the thread, my comments below were purely about
the memoryless node support, not about readahead specifically.

Thanks,
Nish

> Seriously, if your IO subsystem needs more than 512kB of read-ahead to
> get full performance, your IO subsystem is just bad, and has latencies
> that are long enough that you should just replace it. There's no real
> reason to bend over backwards for that, and the whole complexity and
> fragility of the insane "let's try to figure out how much memory this
> node has" is just not worth it. The read-ahead should be small enough
> that we should never need to care, and large enough that you get
> reasonable IO throughput. The above does that.
>
> Goddammit, there's a reason the whole "Gordian knot" parable is
> famous. We're making this all too damn complicated FOR NO GOOD REASON.
>
> Just cut the rope, people. Our aim is not to generate some kind of job
> security by making things as complicated as possible.
>
> Linus
>
> On Thu, Feb 13, 2014 at 4:14 PM, Nishanth Aravamudan
> <[email protected]> wrote:
> >
> > I'm working on this latter bit now. I tried to mirror ia64, but it looks
> > like they have CONFIG_USER_PERCPU_NUMA_NODE_ID, which powerpc doesn't.
> > It seems like CONFIG_USER_PERCPU_NUMA_NODE_ID and
> > CONFIG_HAVE_MEMORYLESS_NODES should be tied together in Kconfig?
> >
> > I'll keep working, but would appreciate any further insight.
> >
> > -Nish
> >
>

2014-02-14 05:47:37

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On 13.02.2014 [14:41:04 -0800], David Rientjes wrote:
> On Thu, 13 Feb 2014, Raghavendra K T wrote:
>
> > Thanks David, unfortunately even after applying that patch, I do not see
> > the improvement.
> >
> > Interestingly numa_mem_id() seem to still return the value of a
> > memoryless node.
> > May be per cpu _numa_mem_ values are not set properly. Need to dig out ....
> >
>
> I believe ppc will be relying on __build_all_zonelists() to set
> numa_mem_id() to be the proper node, and that relies on the ordering of
> the zonelist built for the memoryless node. It would be very strange if
> local_memory_node() is returning a memoryless node because it is the first
> zone for node_zonelist(GFP_KERNEL) (why would a memoryless node be on the
> zonelist at all?).
>
> I think the real problem is that build_all_zonelists() is only called at
> init when the boot cpu is online so it's only setting numa_mem_id()
> properly for the boot cpu. Does it return a node with memory if you
> toggle /proc/sys/vm/numa_zonelist_order? Do
>
> echo node > /proc/sys/vm/numa_zonelist_order
> echo zone > /proc/sys/vm/numa_zonelist_order
> echo default > /proc/sys/vm/numa_zonelist_order
>
> and check if it returns the proper value at either point. This will force
> build_all_zonelists() and numa_mem_id() to point to the proper node since
> all cpus are now online.

Yep, after massaging the code to allow CONFIG_USE_PERCPU_NUMA_NODE_ID,
you're right that the memory node is wrong. The cpu node is right (they
are all on node 0), but that could be lucky. The memory node is right
for the boot cpu. I did notice that some CPUs now think the cpu node is
1, which is wrong.

> So the prerequisite for CONFIG_HAVE_MEMORYLESS_NODES is that there is an
> arch-specific set_numa_mem() that makes this mapping correct like ia64
> does. If that's the case, then it's (1) completely undocumented and (2)
> Nishanth's patch is incomplete because anything that adds
> CONFIG_HAVE_MEMORYLESS_NODES needs to do the proper set_numa_mem() for it
> to be any different than numa_node_id().

I'll work on getting the set_numa_mem() and set_numa_node() correct for
powerpc.

Thanks,
Nish

2014-02-14 07:43:14

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Thu 13-02-14 16:37:53, Linus Torvalds wrote:
> Is this whole thread still just for the crazy and pointless
> "max_sane_readahead()"?
>
> Or is there some *real* reason we should care?
>
> Because if it really is just for max_sane_readahead(), then for the
> love of God, let us just do this
>
> unsigned long max_sane_readahead(unsigned long nr)
> {
> return min(nr, 128);
> }
>
> and bury this whole idiotic thread.
max_sane_readahead() is also used for limiting amount of readahead for
[fm]advice(2) WILLNEED and that is used e.g. by a dynamic linker to preload
shared libraries into memory. So I'm convinced this usecase *will* notice
the change - effectively we limit preloading of shared libraries to the
first 512KB in the file but libraries get accessed in a rather random manner.

Maybe limits for WILLNEED and for standard readahead should be different.
It makes sence to me and people seem to keep forgetting that
max_sane_readahead() limits also WILLNEED preloading.

Honza

> On Thu, Feb 13, 2014 at 4:14 PM, Nishanth Aravamudan
> <[email protected]> wrote:
> >
> > I'm working on this latter bit now. I tried to mirror ia64, but it looks
> > like they have CONFIG_USER_PERCPU_NUMA_NODE_ID, which powerpc doesn't.
> > It seems like CONFIG_USER_PERCPU_NUMA_NODE_ID and
> > CONFIG_HAVE_MEMORYLESS_NODES should be tied together in Kconfig?
> >
> > I'll keep working, but would appreciate any further insight.
> >
> > -Nish
> >
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-02-14 10:54:11

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Thu, 13 Feb 2014, Nishanth Aravamudan wrote:

> There is an open issue on powerpc with memoryless nodes (inasmuch as we
> can have them, but the kernel doesn't support it properly). There is a
> separate discussion going on on linuxppc-dev about what is necessary for
> CONFIG_HAVE_MEMORYLESS_NODES to be supported.
>

Yeah, and this is causing problems with the slub allocator as well.

> Apologies for hijacking the thread, my comments below were purely about
> the memoryless node support, not about readahead specifically.
>

Neither you nor Raghavendra have any reason to apologize to anybody.
Memoryless node support on powerpc isn't working very well right now and
you're trying to fix it, that fix is needed both in this thread and in
your fixes for slub. It's great to see both of you working hard on your
platform to make it work the best.

I think what you'll need to do in addition to your
CONFIG_HAVE_MEMORYLESS_NODE fix, which is obviously needed, is to enable
CONFIG_USE_PERCPU_NUMA_NODE_ID for the same NUMA configurations and then
use set_numa_node() or set_cpu_numa_node() to properly store the mapping
between cpu and node rather than numa_cpu_lookup_table. Then you should
be able to do away with your own implementation of cpu_to_node().

After that, I think it should be as simple as doing

set_numa_node(cpu_to_node(cpu));
set_numa_mem(local_memory_node(cpu_to_node(cpu)));

probably before taking vector_lock in smp_callin(). The cpu-to-node
mapping should be done much earlier in boot while the nodes are being
initialized, I don't think there should be any problem there.

While you're at it, I think you'll also want to add a comment that
setting up the cpu sibling mask must be done before the smp_wmb() before
notify_cpu_starting(cpu), it's crucial to have before the cpu is brought
online and why we need the store memory barrier.

But, again, please don't apologize for developing your architecture and
attacking bugs as they arise, it's very admirable and I'm happy to help in
any way that I can.

2014-02-17 19:28:14

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On 14.02.2014 [02:54:06 -0800], David Rientjes wrote:
> On Thu, 13 Feb 2014, Nishanth Aravamudan wrote:
>
> > There is an open issue on powerpc with memoryless nodes (inasmuch as we
> > can have them, but the kernel doesn't support it properly). There is a
> > separate discussion going on on linuxppc-dev about what is necessary for
> > CONFIG_HAVE_MEMORYLESS_NODES to be supported.
> >
>
> Yeah, and this is causing problems with the slub allocator as well.
>
> > Apologies for hijacking the thread, my comments below were purely about
> > the memoryless node support, not about readahead specifically.
> >
>
> Neither you nor Raghavendra have any reason to apologize to anybody.
> Memoryless node support on powerpc isn't working very well right now and
> you're trying to fix it, that fix is needed both in this thread and in
> your fixes for slub. It's great to see both of you working hard on your
> platform to make it work the best.
>
> I think what you'll need to do in addition to your
> CONFIG_HAVE_MEMORYLESS_NODE fix, which is obviously needed, is to enable
> CONFIG_USE_PERCPU_NUMA_NODE_ID for the same NUMA configurations and then
> use set_numa_node() or set_cpu_numa_node() to properly store the mapping
> between cpu and node rather than numa_cpu_lookup_table. Then you should
> be able to do away with your own implementation of cpu_to_node().
>
> After that, I think it should be as simple as doing
>
> set_numa_node(cpu_to_node(cpu));
> set_numa_mem(local_memory_node(cpu_to_node(cpu)));
>
> probably before taking vector_lock in smp_callin(). The cpu-to-node
> mapping should be done much earlier in boot while the nodes are being
> initialized, I don't think there should be any problem there.

vector_lock/smp_callin are ia64 specific things, I believe? I think the
equivalent is just in start_secondary() for powerpc? (which in fact is
what calls smp_callin on powerpc).

Here is what I'm running into now:

setup_arch ->
do_init_bootmem ->
cpu_numa_callback ->
numa_setup_cpu ->
map_cpu_to_node ->
update_numa_cpu_lookup_table

Which current updates the powerpc specific numa_cpu_lookup_table. I
would like to update that function to use set_cpu_numa_node() and
set_cpu_numa_mem(), but local_memory_node() is not yet functional
because build_all_zonelists is called later in start_kernel. Would it
make sense for first_zones_zonelist() to return NUMA_NO_NODE if we
don't have a zone?

Thanks,
Nish

2014-02-17 22:57:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Thu, Feb 13, 2014 at 11:43 PM, Jan Kara <[email protected]> wrote:
>
> max_sane_readahead() is also used for limiting amount of readahead for
> [fm]advice(2) WILLNEED and that is used e.g. by a dynamic linker to preload
> shared libraries into memory. So I'm convinced this usecase *will* notice
> the change - effectively we limit preloading of shared libraries to the
> first 512KB in the file but libraries get accessed in a rather random manner.
>
> Maybe limits for WILLNEED and for standard readahead should be different.
> It makes sence to me and people seem to keep forgetting that
> max_sane_readahead() limits also WILLNEED preloading.

Good point. But it's probably overly complex to have two different
limits. The "512kB" thing was entirely random - the only real issue is
that it should be small enough that it won't be a problem on any
reasonable current machines, and big enough to get perfectly fine IO
patterns unless your IO subsystem sucks so bad that it's not even
worth worrying about.

If we just add a third requirement that it be "big enough that
reasonable uses of [fm]advice() will work well enough", then your
shared library example might well be grounds for saying "let's just do
2MB instead". That's still small enough that it won't really hurt any
modern machines.

And if it means that WILLNEED won't necessarily always read the whole
file for big files - well, we never guaranteed that to begin with.

Linus

2014-02-17 22:59:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Thu, Feb 13, 2014 at 8:32 PM, Nishanth Aravamudan
<[email protected]> wrote:
>
> Agreed that for the readahead case the above is probably more than
> sufficient.
>
> Apologies for hijacking the thread, my comments below were purely about
> the memoryless node support, not about readahead specifically.

Ok, no problem. I just wanted to make sure that we're not going down
some fragile rats nest just for something silly that wasn't worth it.

Linus

2014-02-17 23:14:13

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On Mon, 17 Feb 2014, Nishanth Aravamudan wrote:

> Here is what I'm running into now:
>
> setup_arch ->
> do_init_bootmem ->
> cpu_numa_callback ->
> numa_setup_cpu ->
> map_cpu_to_node ->
> update_numa_cpu_lookup_table
>
> Which current updates the powerpc specific numa_cpu_lookup_table. I
> would like to update that function to use set_cpu_numa_node() and
> set_cpu_numa_mem(), but local_memory_node() is not yet functional
> because build_all_zonelists is called later in start_kernel. Would it
> make sense for first_zones_zonelist() to return NUMA_NO_NODE if we
> don't have a zone?
>

Hmm, I don't think we'll want to modify the generic first_zones_zonelist()
for a special case that is only true during boot. Instead, would it make
sense to modify numa_setup_cpu() to use the generic cpu_to_node() instead
of using a powerpc mapping and then do the set_cpu_numa_mem() after
paging_init() when the zonelists will have been built and zones without
present pages are properly excluded?

2014-02-18 01:31:13

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages

On 17.02.2014 [15:14:06 -0800], David Rientjes wrote:
> On Mon, 17 Feb 2014, Nishanth Aravamudan wrote:
>
> > Here is what I'm running into now:
> >
> > setup_arch ->
> > do_init_bootmem ->
> > cpu_numa_callback ->
> > numa_setup_cpu ->
> > map_cpu_to_node ->
> > update_numa_cpu_lookup_table
> >
> > Which current updates the powerpc specific numa_cpu_lookup_table. I
> > would like to update that function to use set_cpu_numa_node() and
> > set_cpu_numa_mem(), but local_memory_node() is not yet functional
> > because build_all_zonelists is called later in start_kernel. Would it
> > make sense for first_zones_zonelist() to return NUMA_NO_NODE if we
> > don't have a zone?
> >
>
> Hmm, I don't think we'll want to modify the generic first_zones_zonelist()
> for a special case that is only true during boot. Instead, would it make
> sense to modify numa_setup_cpu() to use the generic cpu_to_node() instead
> of using a powerpc mapping and then do the set_cpu_numa_mem() after
> paging_init() when the zonelists will have been built and zones without
> present pages are properly excluded?

Sorry, I was unclear in my e-mail. I meant to modify
local_memory_node(), not first_zones_zonelist(). Well, it only needs the
following, I think?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3758a0..5de4337 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3650,6 +3650,8 @@ int local_memory_node(int node)
gfp_zone(GFP_KERNEL),
NULL,
&zone);
+ if (!zone)
+ return NUMA_NO_NODE;
return zone->node;
}
#endif

I think that condition should only happen during boot -- maybe even
deserving of an unlikely, but I don't think the above is considered a
hot-path. If the above isn't palatable, I can look into your suggestion
instead.

Thanks,
Nish