Hi,
I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
enable zone reclaim). The commit message suggests that the zone reclaim
is desirable for all NUMA configurations.
History has shown that the zone reclaim is more often harmful than
helpful and leads to performance problems. The default RECLAIM_DISTANCE
for generic case has been increased from 20 to 30 around 3.0
(32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).
I strongly suspect that the patch is incorrect and it should be
reverted. Before I will send a revert I would like to understand what
led to the patch in the first place. I do not see why would PPC use only
LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
seen use different values.
Anton, could you comment please?
--
Michal Hocko
SUSE Labs
On Tue, 18 Feb 2014, Michal Hocko wrote:
> Hi,
> I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
> 56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
> enable zone reclaim). The commit message suggests that the zone reclaim
> is desirable for all NUMA configurations.
>
> History has shown that the zone reclaim is more often harmful than
> helpful and leads to performance problems. The default RECLAIM_DISTANCE
> for generic case has been increased from 20 to 30 around 3.0
> (32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).
>
> I strongly suspect that the patch is incorrect and it should be
> reverted. Before I will send a revert I would like to understand what
> led to the patch in the first place. I do not see why would PPC use only
> LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
> seen use different values.
>
I strongly suspect that the patch is correct since powerpc node distances
are different than the architectures you're talking about and get doubled
for every NUMA domain that the hardware supports.
Hi Michal,
On 18.02.2014 [10:06:58 +0100], Michal Hocko wrote:
> Hi,
> I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
> 56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
> enable zone reclaim). The commit message suggests that the zone reclaim
> is desirable for all NUMA configurations.
>
> History has shown that the zone reclaim is more often harmful than
> helpful and leads to performance problems. The default RECLAIM_DISTANCE
> for generic case has been increased from 20 to 30 around 3.0
> (32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).
Interesting.
> I strongly suspect that the patch is incorrect and it should be
> reverted. Before I will send a revert I would like to understand what
> led to the patch in the first place. I do not see why would PPC use only
> LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
> seen use different values.
>
> Anton, could you comment please?
I'll let Anton comment here, but in looking into this issue in working
on CONFIG_HAVE_MEMORYLESS_NODE support, I realized that any LPAR with
memoryless nodes will set zone_reclaim_mode to 1. I think we want to
ignore memoryless nodes when we set up the reclaim mode like the
following? I'll send it as a proper patch if you agree?
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5de4337..4f6ff6f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1853,8 +1853,9 @@ static void __paginginit init_zone_allows_reclaim(int nid)
{
int i;
- for_each_online_node(i)
- if (node_distance(nid, i) <= RECLAIM_DISTANCE)
+ for_each_online_node(i) {
+ if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
+ local_memory_node(nid) != nid)
node_set(i, NODE_DATA(nid)->reclaim_nodes);
else
zone_reclaim_mode = 1;
Note, this won't actually do anything if CONFIG_HAVE_MEMORYLESS_NODES is
not set, but if it is, I think semantically it will indicate that
memoryless nodes *have* to reclaim remotely.
And actually the above won't work, because the callpath is
start_kernel -> setup_arch -> paging_init [-> free_area_init_nodes ->
free_area_init_node -> init_zone_allows_reclaim] which is called before
build_all_zonelists. This is a similar ordering problem as I'm having
with the MEMORYLESS_NODE support, will work on it.
Thanks,
Nish
On 18.02.2014 [15:34:05 -0800], Nishanth Aravamudan wrote:
> Hi Michal,
>
> On 18.02.2014 [10:06:58 +0100], Michal Hocko wrote:
> > Hi,
> > I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
> > 56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
> > enable zone reclaim). The commit message suggests that the zone reclaim
> > is desirable for all NUMA configurations.
> >
> > History has shown that the zone reclaim is more often harmful than
> > helpful and leads to performance problems. The default RECLAIM_DISTANCE
> > for generic case has been increased from 20 to 30 around 3.0
> > (32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).
>
> Interesting.
>
> > I strongly suspect that the patch is incorrect and it should be
> > reverted. Before I will send a revert I would like to understand what
> > led to the patch in the first place. I do not see why would PPC use only
> > LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
> > seen use different values.
> >
> > Anton, could you comment please?
>
> I'll let Anton comment here, but in looking into this issue in working
> on CONFIG_HAVE_MEMORYLESS_NODE support, I realized that any LPAR with
> memoryless nodes will set zone_reclaim_mode to 1. I think we want to
> ignore memoryless nodes when we set up the reclaim mode like the
> following? I'll send it as a proper patch if you agree?
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5de4337..4f6ff6f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1853,8 +1853,9 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> {
> int i;
>
> - for_each_online_node(i)
> - if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> + for_each_online_node(i) {
> + if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> + local_memory_node(nid) != nid)
> node_set(i, NODE_DATA(nid)->reclaim_nodes);
> else
> zone_reclaim_mode = 1;
>
> Note, this won't actually do anything if CONFIG_HAVE_MEMORYLESS_NODES is
> not set, but if it is, I think semantically it will indicate that
> memoryless nodes *have* to reclaim remotely.
>
> And actually the above won't work, because the callpath is
>
> start_kernel -> setup_arch -> paging_init [-> free_area_init_nodes ->
> free_area_init_node -> init_zone_allows_reclaim] which is called before
> build_all_zonelists. This is a similar ordering problem as I'm having
> with the MEMORYLESS_NODE support, will work on it.
How about the following?
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5de4337..1a0eced 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1854,7 +1854,8 @@ static void __paginginit init_zone_allows_reclaim(int nid)
int i;
for_each_online_node(i)
- if (node_distance(nid, i) <= RECLAIM_DISTANCE)
+ if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
+ !NODE_DATA(nid)->node_present_pages)
node_set(i, NODE_DATA(nid)->reclaim_nodes);
else
zone_reclaim_mode = 1;
@@ -4901,13 +4902,13 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
- init_zone_allows_reclaim(nid);
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
#endif
calculate_node_totalpages(pgdat, start_pfn, end_pfn,
zones_size, zholes_size);
+ init_zone_allows_reclaim(nid);
alloc_node_mem_map(pgdat);
#ifdef CONFIG_FLAT_NODE_MEM_MAP
printk(KERN_DEBUG "free_area_init_node: node %d, pgdat %08lx, node_mem_map %08lx\n",
I think it's safe to move init_zone_allows_reclaim, because I don't
think any allocates are occurring here that could cause us to reclaim
anyways, right? Moving it allows us to safely reference
node_present_pages.
Thanks,
Nish
On 18.02.2014 [15:58:00 -0800], Nishanth Aravamudan wrote:
> On 18.02.2014 [15:34:05 -0800], Nishanth Aravamudan wrote:
> > Hi Michal,
> >
> > On 18.02.2014 [10:06:58 +0100], Michal Hocko wrote:
> > > Hi,
> > > I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
> > > 56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
> > > enable zone reclaim). The commit message suggests that the zone reclaim
> > > is desirable for all NUMA configurations.
> > >
> > > History has shown that the zone reclaim is more often harmful than
> > > helpful and leads to performance problems. The default RECLAIM_DISTANCE
> > > for generic case has been increased from 20 to 30 around 3.0
> > > (32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).
> >
> > Interesting.
> >
> > > I strongly suspect that the patch is incorrect and it should be
> > > reverted. Before I will send a revert I would like to understand what
> > > led to the patch in the first place. I do not see why would PPC use only
> > > LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
> > > seen use different values.
> > >
> > > Anton, could you comment please?
> >
> > I'll let Anton comment here, but in looking into this issue in working
> > on CONFIG_HAVE_MEMORYLESS_NODE support, I realized that any LPAR with
> > memoryless nodes will set zone_reclaim_mode to 1. I think we want to
> > ignore memoryless nodes when we set up the reclaim mode like the
> > following? I'll send it as a proper patch if you agree?
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 5de4337..4f6ff6f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1853,8 +1853,9 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> > {
> > int i;
> >
> > - for_each_online_node(i)
> > - if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> > + for_each_online_node(i) {
> > + if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> > + local_memory_node(nid) != nid)
> > node_set(i, NODE_DATA(nid)->reclaim_nodes);
> > else
> > zone_reclaim_mode = 1;
> >
> > Note, this won't actually do anything if CONFIG_HAVE_MEMORYLESS_NODES is
> > not set, but if it is, I think semantically it will indicate that
> > memoryless nodes *have* to reclaim remotely.
> >
> > And actually the above won't work, because the callpath is
> >
> > start_kernel -> setup_arch -> paging_init [-> free_area_init_nodes ->
> > free_area_init_node -> init_zone_allows_reclaim] which is called before
> > build_all_zonelists. This is a similar ordering problem as I'm having
> > with the MEMORYLESS_NODE support, will work on it.
>
> How about the following?
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5de4337..1a0eced 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1854,7 +1854,8 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> int i;
>
> for_each_online_node(i)
> - if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> + if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> + !NODE_DATA(nid)->node_present_pages)
err s/nid/i/ above.
-Nish
On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> How about the following?
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5de4337..1a0eced 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1854,7 +1854,8 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> int i;
>
> for_each_online_node(i)
> - if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> + if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> + !NODE_DATA(i)->node_present_pages)
> node_set(i, NODE_DATA(nid)->reclaim_nodes);
> else
> zone_reclaim_mode = 1;
[ I changed the above from NODE_DATA(nid) -> NODE_DATA(i) as you caught
so we're looking at the right code. ]
That can't be right, it would allow reclaiming from a memoryless node. I
think what you want is
for_each_online_node(i) {
if (!node_present_pages(i))
continue;
if (node_distance(nid, i) <= RECLAIM_DISTANCE) {
node_set(i, NODE_DATA(nid)->reclaim_nodes);
continue;
}
/* Always try to reclaim locally */
zone_reclaim_mode = 1;
}
but we really should be able to do for_each_node_state(i, N_MEMORY) here
and memoryless nodes should already be excluded from that mask.
> @@ -4901,13 +4902,13 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>
> pgdat->node_id = nid;
> pgdat->node_start_pfn = node_start_pfn;
> - init_zone_allows_reclaim(nid);
> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> #endif
> calculate_node_totalpages(pgdat, start_pfn, end_pfn,
> zones_size, zholes_size);
>
> + init_zone_allows_reclaim(nid);
> alloc_node_mem_map(pgdat);
> #ifdef CONFIG_FLAT_NODE_MEM_MAP
> printk(KERN_DEBUG "free_area_init_node: node %d, pgdat %08lx, node_mem_map %08lx\n",
>
> I think it's safe to move init_zone_allows_reclaim, because I don't
> think any allocates are occurring here that could cause us to reclaim
> anyways, right? Moving it allows us to safely reference
> node_present_pages.
>
Yeah, this is fine.
On Tue 18-02-14 14:27:11, David Rientjes wrote:
> On Tue, 18 Feb 2014, Michal Hocko wrote:
>
> > Hi,
> > I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
> > 56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
> > enable zone reclaim). The commit message suggests that the zone reclaim
> > is desirable for all NUMA configurations.
> >
> > History has shown that the zone reclaim is more often harmful than
> > helpful and leads to performance problems. The default RECLAIM_DISTANCE
> > for generic case has been increased from 20 to 30 around 3.0
> > (32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).
> >
> > I strongly suspect that the patch is incorrect and it should be
> > reverted. Before I will send a revert I would like to understand what
> > led to the patch in the first place. I do not see why would PPC use only
> > LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
> > seen use different values.
> >
>
> I strongly suspect that the patch is correct since powerpc node distances
> are different than the architectures you're talking about and get doubled
> for every NUMA domain that the hardware supports.
Even if the units of the distance is different on PPC should every NUMA
machine have zone_reclaim enabled? That doesn't right to me.
--
Michal Hocko
SUSE Labs
On Wed, 19 Feb 2014, Michal Hocko wrote:
> > I strongly suspect that the patch is correct since powerpc node distances
> > are different than the architectures you're talking about and get doubled
> > for every NUMA domain that the hardware supports.
>
> Even if the units of the distance is different on PPC should every NUMA
> machine have zone_reclaim enabled? That doesn't right to me.
>
In my experience on powerpc it's very correct, there's typically a
significant latency in remote access and we don't have the benefit of a
SLIT that actually defines the locality between proximity domains like we
do on other architectures. We have had significant issues with thp, for
example, being allocated remotely instead of pages locally, much more
drastic than on our x86 machines, particularly AMD machines.
On Tue 18-02-14 15:34:05, Nishanth Aravamudan wrote:
> Hi Michal,
>
> On 18.02.2014 [10:06:58 +0100], Michal Hocko wrote:
> > Hi,
> > I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
> > 56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
> > enable zone reclaim). The commit message suggests that the zone reclaim
> > is desirable for all NUMA configurations.
> >
> > History has shown that the zone reclaim is more often harmful than
> > helpful and leads to performance problems. The default RECLAIM_DISTANCE
> > for generic case has been increased from 20 to 30 around 3.0
> > (32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).
>
> Interesting.
>
> > I strongly suspect that the patch is incorrect and it should be
> > reverted. Before I will send a revert I would like to understand what
> > led to the patch in the first place. I do not see why would PPC use only
> > LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
> > seen use different values.
> >
> > Anton, could you comment please?
>
> I'll let Anton comment here, but in looking into this issue in working
> on CONFIG_HAVE_MEMORYLESS_NODE support, I realized that any LPAR with
> memoryless nodes will set zone_reclaim_mode to 1. I think we want to
> ignore memoryless nodes when we set up the reclaim mode like the
> following? I'll send it as a proper patch if you agree?
Funny enough, ppc memoryless node setup is what led me to this code.
We had a setup like this:
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus:
node 2 size: 7168 MB
node 2 free: 6019 MB
node distances:
node 0 2
0: 10 40
2: 40 10
Which ends up enabling zone_reclaim although there is only a single node
with memory. Not that RECLAIM_DISTANCE would make any difference here as
the distance is even above default RECLAIM_DISTANCE.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5de4337..4f6ff6f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1853,8 +1853,9 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> {
> int i;
>
> - for_each_online_node(i)
> - if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> + for_each_online_node(i) {
> + if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> + local_memory_node(nid) != nid)
> node_set(i, NODE_DATA(nid)->reclaim_nodes);
> else
> zone_reclaim_mode = 1;
>
> Note, this won't actually do anything if CONFIG_HAVE_MEMORYLESS_NODES is
> not set, but if it is, I think semantically it will indicate that
> memoryless nodes *have* to reclaim remotely.
>
> And actually the above won't work, because the callpath is
>
> start_kernel -> setup_arch -> paging_init [-> free_area_init_nodes ->
> free_area_init_node -> init_zone_allows_reclaim] which is called before
> build_all_zonelists. This is a similar ordering problem as I'm having
> with the MEMORYLESS_NODE support, will work on it.
I think you just want for_each_node_state(nid, N_MEMORY) and skip all
the memory less nodes, no?
--
Michal Hocko
SUSE Labs
On Tue 18-02-14 17:43:38, David Rientjes wrote:
> On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
>
> > How about the following?
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 5de4337..1a0eced 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1854,7 +1854,8 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> > int i;
> >
> > for_each_online_node(i)
> > - if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> > + if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> > + !NODE_DATA(i)->node_present_pages)
> > node_set(i, NODE_DATA(nid)->reclaim_nodes);
> > else
> > zone_reclaim_mode = 1;
>
> [ I changed the above from NODE_DATA(nid) -> NODE_DATA(i) as you caught
> so we're looking at the right code. ]
>
> That can't be right, it would allow reclaiming from a memoryless node. I
> think what you want is
>
> for_each_online_node(i) {
> if (!node_present_pages(i))
> continue;
> if (node_distance(nid, i) <= RECLAIM_DISTANCE) {
> node_set(i, NODE_DATA(nid)->reclaim_nodes);
> continue;
> }
> /* Always try to reclaim locally */
> zone_reclaim_mode = 1;
> }
>
> but we really should be able to do for_each_node_state(i, N_MEMORY) here
> and memoryless nodes should already be excluded from that mask.
Agreed! Actually the code I am currently interested in is based on 3.0
kernel where zone_reclaim_mode is set in build_zonelists which relies on
find_next_best_node which iterates only N_HIGH_MEMORY nodes which should
have non 0 present pages.
[...]
--
Michal Hocko
SUSE Labs
On Wed 19-02-14 00:20:21, David Rientjes wrote:
> On Wed, 19 Feb 2014, Michal Hocko wrote:
>
> > > I strongly suspect that the patch is correct since powerpc node distances
> > > are different than the architectures you're talking about and get doubled
> > > for every NUMA domain that the hardware supports.
> >
> > Even if the units of the distance is different on PPC should every NUMA
> > machine have zone_reclaim enabled? That doesn't right to me.
> >
>
> In my experience on powerpc it's very correct, there's typically a
> significant latency in remote access and we don't have the benefit of a
> SLIT that actually defines the locality between proximity domains like we
> do on other architectures.
Interesting. So is the PPC NUMA basically about local vs. very distant?
Should REMOTE_DISTANCE reflect that as well? Or can we have
distance < REMOTE_DISTANCE and it would still make sense to have
zone_reclaim enabled?
[...]
Thanks!
--
Michal Hocko
SUSE Labs
On 19.02.2014 [08:24:38 -0800], Nishanth Aravamudan wrote:
> On 18.02.2014 [17:43:38 -0800], David Rientjes wrote:
> > On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> >
> > > How about the following?
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 5de4337..1a0eced 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1854,7 +1854,8 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> > > int i;
> > >
> > > for_each_online_node(i)
> > > - if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> > > + if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> > > + !NODE_DATA(i)->node_present_pages)
> > > node_set(i, NODE_DATA(nid)->reclaim_nodes);
> > > else
> > > zone_reclaim_mode = 1;
> >
> > [ I changed the above from NODE_DATA(nid) -> NODE_DATA(i) as you caught
> > so we're looking at the right code. ]
> >
> > That can't be right, it would allow reclaiming from a memoryless node. I
> > think what you want is
>
> Gah, you're right.
>
> > for_each_online_node(i) {
> > if (!node_present_pages(i))
> > continue;
> > if (node_distance(nid, i) <= RECLAIM_DISTANCE) {
> > node_set(i, NODE_DATA(nid)->reclaim_nodes);
> > continue;
> > }
> > /* Always try to reclaim locally */
> > zone_reclaim_mode = 1;
> > }
> >
> > but we really should be able to do for_each_node_state(i, N_MEMORY) here
> > and memoryless nodes should already be excluded from that mask.
>
> Yep, I found that afterwards, which simplifies the logic. I'll add this
> to my series :)
In looking at the code, I am wondering about the following:
init_zone_allows_reclaim() is called for each nid from
free_area_init_node(). Which means that calculate_node_totalpages for
other "later" nids and check_for_memory() [which sets up the N_MEMORY
nodemask] hasn't been called yet.
So, would it make sense to pull up the
/* Any memory on that node */
if (pgdat->node_present_pages)
node_set_state(nid, N_MEMORY);
check_for_memory(pgdat, nid);
into free_area_init_node()?
Thanks,
Nish
We had a report about strange OOM killer strikes on a PPC machine
although there was a lot of swap free and a tons of anonymous memory
which could be swapped out. In the end it turned out that the OOM was
a side effect of zone reclaim which wasn't doesn't unmap and swapp out
and so the system was pushed to the OOM. Although this sounds like a bug
somewhere in the kswapd vs. zone reclaim vs. direct reclaim interaction
numactl on the said hardware suggests that the zone reclaim should
have been set in the first place:
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus:
node 2 size: 7168 MB
node 2 free: 6019 MB
node distances:
node 0 2
0: 10 40
2: 40 10
So all the CPUs are associated with Node0 which doesn't have any memory
while Node2 contains all the available memory. Node distances cause an
automatic zone_reclaim_mode enabling.
Zone reclaim is intended to keep the allocations local but this doesn't
make any sense on the memory less nodes. So let's exlcude such nodes
for init_zone_allows_reclaim which evaluates zone reclaim behavior and
suitable reclaim_nodes.
Signed-off-by: Michal Hocko <[email protected]>
---
I haven't got to testing this so I am sending this as an RFC for now.
But does this look reasonable?
mm/page_alloc.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e953f07edb0..4a44bdc7a8cf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1855,7 +1855,7 @@ static void __paginginit init_zone_allows_reclaim(int nid)
{
int i;
- for_each_online_node(i)
+ for_each_node_state(i, N_HIGH_MEMORY)
if (node_distance(nid, i) <= RECLAIM_DISTANCE)
node_set(i, NODE_DATA(nid)->reclaim_nodes);
else
@@ -4901,7 +4901,8 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
- init_zone_allows_reclaim(nid);
+ if (node_state(nid, N_HIGH_MEMORY))
+ init_zone_allows_reclaim(nid);
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
#endif
--
1.9.0.rc3
On 19.02.2014 [18:03:03 +0100], Michal Hocko wrote:
> We had a report about strange OOM killer strikes on a PPC machine
> although there was a lot of swap free and a tons of anonymous memory
> which could be swapped out. In the end it turned out that the OOM was
> a side effect of zone reclaim which wasn't doesn't unmap and swapp out
> and so the system was pushed to the OOM. Although this sounds like a bug
> somewhere in the kswapd vs. zone reclaim vs. direct reclaim interaction
> numactl on the said hardware suggests that the zone reclaim should
> have been set in the first place:
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 2 cpus:
> node 2 size: 7168 MB
> node 2 free: 6019 MB
> node distances:
> node 0 2
> 0: 10 40
> 2: 40 10
>
> So all the CPUs are associated with Node0 which doesn't have any memory
> while Node2 contains all the available memory. Node distances cause an
> automatic zone_reclaim_mode enabling.
>
> Zone reclaim is intended to keep the allocations local but this doesn't
> make any sense on the memory less nodes. So let's exlcude such nodes
> for init_zone_allows_reclaim which evaluates zone reclaim behavior and
> suitable reclaim_nodes.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> I haven't got to testing this so I am sending this as an RFC for now.
> But does this look reasonable?
>
> mm/page_alloc.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e953f07edb0..4a44bdc7a8cf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1855,7 +1855,7 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> {
> int i;
>
> - for_each_online_node(i)
> + for_each_node_state(i, N_HIGH_MEMORY)
> if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> node_set(i, NODE_DATA(nid)->reclaim_nodes);
> else
> @@ -4901,7 +4901,8 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>
> pgdat->node_id = nid;
> pgdat->node_start_pfn = node_start_pfn;
> - init_zone_allows_reclaim(nid);
> + if (node_state(nid, N_HIGH_MEMORY))
> + init_zone_allows_reclaim(nid);
I don't think this will work, because what sets N_HIGH_MEMORY (and
shouldn't it be N_MEMORY?) is check_for_memory() (free_area_init_nodes()
for N_MEMORY), which is run *after* init_zone_allows_reclaim(). Further,
the for_each_node_state() loop doesn't make sense at this point, becuase
we are actually setting up the nids as we go. So node 0, will only see
node 0 in the N_HIGH_MEMORY mask (if any). Node 1, will only see nodes 0
and 1, etc.
I'm working on testing a patch that reorders some of this in hopefully a
safe way.
Thanks,
Nish
> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> #endif
> --
> 1.9.0.rc3
>
On Wed 19-02-14 09:16:28, Nishanth Aravamudan wrote:
> On 19.02.2014 [18:03:03 +0100], Michal Hocko wrote:
> > We had a report about strange OOM killer strikes on a PPC machine
> > although there was a lot of swap free and a tons of anonymous memory
> > which could be swapped out. In the end it turned out that the OOM was
> > a side effect of zone reclaim which wasn't doesn't unmap and swapp out
> > and so the system was pushed to the OOM. Although this sounds like a bug
> > somewhere in the kswapd vs. zone reclaim vs. direct reclaim interaction
> > numactl on the said hardware suggests that the zone reclaim should
> > have been set in the first place:
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > node 0 size: 0 MB
> > node 0 free: 0 MB
> > node 2 cpus:
> > node 2 size: 7168 MB
> > node 2 free: 6019 MB
> > node distances:
> > node 0 2
> > 0: 10 40
> > 2: 40 10
> >
> > So all the CPUs are associated with Node0 which doesn't have any memory
> > while Node2 contains all the available memory. Node distances cause an
> > automatic zone_reclaim_mode enabling.
> >
> > Zone reclaim is intended to keep the allocations local but this doesn't
> > make any sense on the memory less nodes. So let's exlcude such nodes
> > for init_zone_allows_reclaim which evaluates zone reclaim behavior and
> > suitable reclaim_nodes.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > I haven't got to testing this so I am sending this as an RFC for now.
> > But does this look reasonable?
> >
> > mm/page_alloc.c | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3e953f07edb0..4a44bdc7a8cf 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1855,7 +1855,7 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> > {
> > int i;
> >
> > - for_each_online_node(i)
> > + for_each_node_state(i, N_HIGH_MEMORY)
> > if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> > node_set(i, NODE_DATA(nid)->reclaim_nodes);
> > else
> > @@ -4901,7 +4901,8 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
> >
> > pgdat->node_id = nid;
> > pgdat->node_start_pfn = node_start_pfn;
> > - init_zone_allows_reclaim(nid);
> > + if (node_state(nid, N_HIGH_MEMORY))
> > + init_zone_allows_reclaim(nid);
>
> I don't think this will work, because what sets N_HIGH_MEMORY (and
> shouldn't it be N_MEMORY?)
This should be the same thing AFAIU.
> is check_for_memory() (free_area_init_nodes() for N_MEMORY), which is
> run *after* init_zone_allows_reclaim().
early_calculate_totalpages sets the memory state before we get here.
> Further, the for_each_node_state() loop doesn't make sense at this
> point, becuase we are actually setting up the nids as we go. So node
> 0, will only see node 0 in the N_HIGH_MEMORY mask (if any). Node 1,
> will only see nodes 0 and 1, etc.
I am not sure I understand what you are saying here. Why would se
consider distance to a memoryless node in the first place.
for_each_node_state just makes sure we are comparing only to a node
which has some memory.
> I'm working on testing a patch that reorders some of this in hopefully a
> safe way.
Although it might make some sense to reorganize the code (it's a mess if
you ask me), but I do not think it is necessary.
--
Michal Hocko
SUSE Labs
On 19.02.2014 [18:32:59 +0100], Michal Hocko wrote:
> On Wed 19-02-14 09:16:28, Nishanth Aravamudan wrote:
> > On 19.02.2014 [18:03:03 +0100], Michal Hocko wrote:
> > > We had a report about strange OOM killer strikes on a PPC machine
> > > although there was a lot of swap free and a tons of anonymous memory
> > > which could be swapped out. In the end it turned out that the OOM was
> > > a side effect of zone reclaim which wasn't doesn't unmap and swapp out
> > > and so the system was pushed to the OOM. Although this sounds like a bug
> > > somewhere in the kswapd vs. zone reclaim vs. direct reclaim interaction
> > > numactl on the said hardware suggests that the zone reclaim should
> > > have been set in the first place:
> > > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > node 0 size: 0 MB
> > > node 0 free: 0 MB
> > > node 2 cpus:
> > > node 2 size: 7168 MB
> > > node 2 free: 6019 MB
> > > node distances:
> > > node 0 2
> > > 0: 10 40
> > > 2: 40 10
> > >
> > > So all the CPUs are associated with Node0 which doesn't have any memory
> > > while Node2 contains all the available memory. Node distances cause an
> > > automatic zone_reclaim_mode enabling.
> > >
> > > Zone reclaim is intended to keep the allocations local but this doesn't
> > > make any sense on the memory less nodes. So let's exlcude such nodes
> > > for init_zone_allows_reclaim which evaluates zone reclaim behavior and
> > > suitable reclaim_nodes.
> > >
> > > Signed-off-by: Michal Hocko <[email protected]>
> > > ---
> > > I haven't got to testing this so I am sending this as an RFC for now.
> > > But does this look reasonable?
> > >
> > > mm/page_alloc.c | 5 +++--
> > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 3e953f07edb0..4a44bdc7a8cf 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1855,7 +1855,7 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> > > {
> > > int i;
> > >
> > > - for_each_online_node(i)
> > > + for_each_node_state(i, N_HIGH_MEMORY)
> > > if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> > > node_set(i, NODE_DATA(nid)->reclaim_nodes);
> > > else
> > > @@ -4901,7 +4901,8 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
> > >
> > > pgdat->node_id = nid;
> > > pgdat->node_start_pfn = node_start_pfn;
> > > - init_zone_allows_reclaim(nid);
> > > + if (node_state(nid, N_HIGH_MEMORY))
> > > + init_zone_allows_reclaim(nid);
> >
> > I don't think this will work, because what sets N_HIGH_MEMORY (and
> > shouldn't it be N_MEMORY?)
>
> This should be the same thing AFAIU.
I don't think they are guaranteed to be? And, in any case, semantically,
we care if a node has MEMORY, not if it has HIGH_MEMORY?
> > is check_for_memory() (free_area_init_nodes() for N_MEMORY), which is
> > run *after* init_zone_allows_reclaim().
>
> early_calculate_totalpages sets the memory state before we get here.
Ah, I did not see this, thanks! But that does only
node_set_state(nid, N_MEMORY);
which to me means we should only rely on that being set in the
afore-mentioned loop.
> > Further, the for_each_node_state() loop doesn't make sense at this
> > point, becuase we are actually setting up the nids as we go. So node
> > 0, will only see node 0 in the N_HIGH_MEMORY mask (if any). Node 1,
> > will only see nodes 0 and 1, etc.
>
> I am not sure I understand what you are saying here. Why would se
> consider distance to a memoryless node in the first place.
> for_each_node_state just makes sure we are comparing only to a node
> which has some memory.
I apologize, I missed the call to early_calculate_totalpages(), so I was
simply saying that looping over the N_MEMORY/N_HIGH_MEMORY mask wouldn't
necessarily be right if we are setting up that mask on a node-by-node
basis. But early_calculate_totalpages ensures the N_MEMORY iteration
will be fine.
>
> > I'm working on testing a patch that reorders some of this in hopefully a
> > safe way.
>
> Although it might make some sense to reorganize the code (it's a mess if
> you ask me), but I do not think it is necessary.
Agreed.
Thanks,
Nish
On 19.02.2014 [18:03:03 +0100], Michal Hocko wrote:
> We had a report about strange OOM killer strikes on a PPC machine
> although there was a lot of swap free and a tons of anonymous memory
> which could be swapped out. In the end it turned out that the OOM was
> a side effect of zone reclaim which wasn't doesn't unmap and swapp out
> and so the system was pushed to the OOM. Although this sounds like a bug
> somewhere in the kswapd vs. zone reclaim vs. direct reclaim interaction
> numactl on the said hardware suggests that the zone reclaim should
> have been set in the first place:
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 2 cpus:
> node 2 size: 7168 MB
> node 2 free: 6019 MB
> node distances:
> node 0 2
> 0: 10 40
> 2: 40 10
>
> So all the CPUs are associated with Node0 which doesn't have any memory
> while Node2 contains all the available memory. Node distances cause an
> automatic zone_reclaim_mode enabling.
>
> Zone reclaim is intended to keep the allocations local but this doesn't
> make any sense on the memory less nodes. So let's exlcude such nodes
> for init_zone_allows_reclaim which evaluates zone reclaim behavior and
> suitable reclaim_nodes.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> I haven't got to testing this so I am sending this as an RFC for now.
> But does this look reasonable?
>
> mm/page_alloc.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e953f07edb0..4a44bdc7a8cf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1855,7 +1855,7 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> {
> int i;
>
> - for_each_online_node(i)
> + for_each_node_state(i, N_HIGH_MEMORY)
> if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> node_set(i, NODE_DATA(nid)->reclaim_nodes);
> else
> @@ -4901,7 +4901,8 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>
> pgdat->node_id = nid;
> pgdat->node_start_pfn = node_start_pfn;
> - init_zone_allows_reclaim(nid);
> + if (node_state(nid, N_HIGH_MEMORY))
> + init_zone_allows_reclaim(nid);
I'm still new to this code, but isn't this saying that if a node has no
memory, then it shouldn't reclaim from any node? But, for a memoryless
node to ensure progress later if reclaim is necessary, it *must* reclaim
from other nodes? So wouldn't we want to set reclaim_nodes() in that
case to node_states[N_MEMORY]?
Thanks,
Nish
On 19.02.2014 [09:23:13 +0100], Michal Hocko wrote:
> On Tue 18-02-14 15:34:05, Nishanth Aravamudan wrote:
> > Hi Michal,
> >
> > On 18.02.2014 [10:06:58 +0100], Michal Hocko wrote:
> > > Hi,
> > > I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
> > > 56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
> > > enable zone reclaim). The commit message suggests that the zone reclaim
> > > is desirable for all NUMA configurations.
> > >
> > > History has shown that the zone reclaim is more often harmful than
> > > helpful and leads to performance problems. The default RECLAIM_DISTANCE
> > > for generic case has been increased from 20 to 30 around 3.0
> > > (32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).
> >
> > Interesting.
> >
> > > I strongly suspect that the patch is incorrect and it should be
> > > reverted. Before I will send a revert I would like to understand what
> > > led to the patch in the first place. I do not see why would PPC use only
> > > LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
> > > seen use different values.
> > >
> > > Anton, could you comment please?
> >
> > I'll let Anton comment here, but in looking into this issue in working
> > on CONFIG_HAVE_MEMORYLESS_NODE support, I realized that any LPAR with
> > memoryless nodes will set zone_reclaim_mode to 1. I think we want to
> > ignore memoryless nodes when we set up the reclaim mode like the
> > following? I'll send it as a proper patch if you agree?
>
> Funny enough, ppc memoryless node setup is what led me to this code.
> We had a setup like this:
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 2 cpus:
> node 2 size: 7168 MB
> node 2 free: 6019 MB
> node distances:
> node 0 2
> 0: 10 40
> 2: 40 10
Yeah, I think this happens fairly often ... and we didn't properly
support it (particularly with SLUB) on powerpc. I'll cc you on my
patchset.
> Which ends up enabling zone_reclaim although there is only a single node
> with memory. Not that RECLAIM_DISTANCE would make any difference here as
> the distance is even above default RECLAIM_DISTANCE.
>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 5de4337..4f6ff6f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1853,8 +1853,9 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> > {
> > int i;
> >
> > - for_each_online_node(i)
> > - if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> > + for_each_online_node(i) {
> > + if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> > + local_memory_node(nid) != nid)
> > node_set(i, NODE_DATA(nid)->reclaim_nodes);
> > else
> > zone_reclaim_mode = 1;
> >
> > Note, this won't actually do anything if CONFIG_HAVE_MEMORYLESS_NODES is
> > not set, but if it is, I think semantically it will indicate that
> > memoryless nodes *have* to reclaim remotely.
> >
> > And actually the above won't work, because the callpath is
> >
> > start_kernel -> setup_arch -> paging_init [-> free_area_init_nodes ->
> > free_area_init_node -> init_zone_allows_reclaim] which is called before
> > build_all_zonelists. This is a similar ordering problem as I'm having
> > with the MEMORYLESS_NODE support, will work on it.
>
> I think you just want for_each_node_state(nid, N_MEMORY) and skip all
> the memory less nodes, no?
Yep, thanks!
-Nish
On 18.02.2014 [17:43:38 -0800], David Rientjes wrote:
> On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
>
> > How about the following?
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 5de4337..1a0eced 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1854,7 +1854,8 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> > int i;
> >
> > for_each_online_node(i)
> > - if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> > + if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> > + !NODE_DATA(i)->node_present_pages)
> > node_set(i, NODE_DATA(nid)->reclaim_nodes);
> > else
> > zone_reclaim_mode = 1;
>
> [ I changed the above from NODE_DATA(nid) -> NODE_DATA(i) as you caught
> so we're looking at the right code. ]
>
> That can't be right, it would allow reclaiming from a memoryless node. I
> think what you want is
Gah, you're right.
> for_each_online_node(i) {
> if (!node_present_pages(i))
> continue;
> if (node_distance(nid, i) <= RECLAIM_DISTANCE) {
> node_set(i, NODE_DATA(nid)->reclaim_nodes);
> continue;
> }
> /* Always try to reclaim locally */
> zone_reclaim_mode = 1;
> }
>
> but we really should be able to do for_each_node_state(i, N_MEMORY) here
> and memoryless nodes should already be excluded from that mask.
Yep, I found that afterwards, which simplifies the logic. I'll add this
to my series :)
<snip>
> > I think it's safe to move init_zone_allows_reclaim, because I don't
> > think any allocates are occurring here that could cause us to reclaim
> > anyways, right? Moving it allows us to safely reference
> > node_present_pages.
> >
>
> Yeah, this is fine.
Thanks,
Nish
On Wed 19-02-14 09:49:41, Nishanth Aravamudan wrote:
> On 19.02.2014 [18:32:59 +0100], Michal Hocko wrote:
> > On Wed 19-02-14 09:16:28, Nishanth Aravamudan wrote:
[...]
> > > I don't think this will work, because what sets N_HIGH_MEMORY (and
> > > shouldn't it be N_MEMORY?)
> >
> > This should be the same thing AFAIU.
>
> I don't think they are guaranteed to be? And, in any case, semantically,
> we care if a node has MEMORY, not if it has HIGH_MEMORY?
I don't know. The whole MEMORY vs HIGH_MEMORY thing is really
confusing. But my understanding was that HIGH_MEMORY is superset of the
other one. But now that I look at the code again it seems that N_MEMORY
is the right thing to use here. I will repost the patch tomorrow if
other parts are good.
[...]
--
Michal Hocko
SUSE Labs
On Wed, 19 Feb 2014, Michal Hocko wrote:
> Interesting. So is the PPC NUMA basically about local vs. very distant?
The point is that it's impossible to tell how distant they are from one
NUMA domain to the next NUMA domain.
> Should REMOTE_DISTANCE reflect that as well? Or can we have
> distance < REMOTE_DISTANCE and it would still make sense to have
> zone_reclaim enabled?
>
Ppc doesn't want to allocate in a different NUMA domain unless required,
the latency of a remote access is too high. Everything that isn't in the
same domain has a distance >10 and is setup as 2^(domain hops) * 10. We
don't have the ability like with a SLIT to define remote nodes to have
local latency vs remote or costly latency so the safe setting is a
RECLAIM_DISTANCE of 10.
On Wed, 19 Feb 2014, Nishanth Aravamudan wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3e953f07edb0..4a44bdc7a8cf 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1855,7 +1855,7 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> > {
> > int i;
> >
> > - for_each_online_node(i)
> > + for_each_node_state(i, N_HIGH_MEMORY)
> > if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> > node_set(i, NODE_DATA(nid)->reclaim_nodes);
> > else
> > @@ -4901,7 +4901,8 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
> >
> > pgdat->node_id = nid;
> > pgdat->node_start_pfn = node_start_pfn;
> > - init_zone_allows_reclaim(nid);
> > + if (node_state(nid, N_HIGH_MEMORY))
> > + init_zone_allows_reclaim(nid);
>
> I'm still new to this code, but isn't this saying that if a node has no
> memory, then it shouldn't reclaim from any node? But, for a memoryless
> node to ensure progress later if reclaim is necessary, it *must* reclaim
> from other nodes? So wouldn't we want to set reclaim_nodes() in that
> case to node_states[N_MEMORY]?
>
The only time when pgdat->reclaim_nodes or zone_reclaim_mode matters is
when iterating through a zonelist for page allocation and a memoryless
node should never appear in a zonelist for page allocation, so this is
just preventing setting zone_reclaim_mode unnecessarily because the only
nodes with > RECLAIM_DISTANCE to another node are memoryless. So this
patch is fine as long as it gets s/N_HIGH_MEMORY/N_MEMORY/.
Acked-by: David Rientjes <[email protected]>
On 19.02.2014 [13:56:00 -0800], David Rientjes wrote:
> On Wed, 19 Feb 2014, Nishanth Aravamudan wrote:
>
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 3e953f07edb0..4a44bdc7a8cf 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1855,7 +1855,7 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> > > {
> > > int i;
> > >
> > > - for_each_online_node(i)
> > > + for_each_node_state(i, N_HIGH_MEMORY)
> > > if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> > > node_set(i, NODE_DATA(nid)->reclaim_nodes);
> > > else
> > > @@ -4901,7 +4901,8 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
> > >
> > > pgdat->node_id = nid;
> > > pgdat->node_start_pfn = node_start_pfn;
> > > - init_zone_allows_reclaim(nid);
> > > + if (node_state(nid, N_HIGH_MEMORY))
> > > + init_zone_allows_reclaim(nid);
> >
> > I'm still new to this code, but isn't this saying that if a node has no
> > memory, then it shouldn't reclaim from any node? But, for a memoryless
> > node to ensure progress later if reclaim is necessary, it *must* reclaim
> > from other nodes? So wouldn't we want to set reclaim_nodes() in that
> > case to node_states[N_MEMORY]?
> >
>
> The only time when pgdat->reclaim_nodes or zone_reclaim_mode matters is
> when iterating through a zonelist for page allocation and a memoryless
> node should never appear in a zonelist for page allocation, so this is
> just preventing setting zone_reclaim_mode unnecessarily because the only
> nodes with > RECLAIM_DISTANCE to another node are memoryless. So this
> patch is fine as long as it gets s/N_HIGH_MEMORY/N_MEMORY/.
Ah yes, sorry, I've been looking at this code perhaps too much and going
a bit cross-eyed!
I wonder if we should also put some comments in? But
Acked-by: Nishanth Aravamudan <[email protected]>
Tested-by: Nishanth Aravamudan <[email protected]>
Thanks,
Nish
On Wed 19-02-14 15:05:58, Nishanth Aravamudan wrote:
> On 19.02.2014 [13:56:00 -0800], David Rientjes wrote:
> > On Wed, 19 Feb 2014, Nishanth Aravamudan wrote:
> >
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 3e953f07edb0..4a44bdc7a8cf 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1855,7 +1855,7 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> > > > {
> > > > int i;
> > > >
> > > > - for_each_online_node(i)
> > > > + for_each_node_state(i, N_HIGH_MEMORY)
> > > > if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> > > > node_set(i, NODE_DATA(nid)->reclaim_nodes);
> > > > else
> > > > @@ -4901,7 +4901,8 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
> > > >
> > > > pgdat->node_id = nid;
> > > > pgdat->node_start_pfn = node_start_pfn;
> > > > - init_zone_allows_reclaim(nid);
> > > > + if (node_state(nid, N_HIGH_MEMORY))
> > > > + init_zone_allows_reclaim(nid);
> > >
> > > I'm still new to this code, but isn't this saying that if a node has no
> > > memory, then it shouldn't reclaim from any node? But, for a memoryless
> > > node to ensure progress later if reclaim is necessary, it *must* reclaim
> > > from other nodes? So wouldn't we want to set reclaim_nodes() in that
> > > case to node_states[N_MEMORY]?
> > >
> >
> > The only time when pgdat->reclaim_nodes or zone_reclaim_mode matters is
> > when iterating through a zonelist for page allocation and a memoryless
> > node should never appear in a zonelist for page allocation, so this is
> > just preventing setting zone_reclaim_mode unnecessarily because the only
> > nodes with > RECLAIM_DISTANCE to another node are memoryless. So this
> > patch is fine as long as it gets s/N_HIGH_MEMORY/N_MEMORY/.
>
> Ah yes, sorry, I've been looking at this code perhaps too much and going
> a bit cross-eyed!
>
> I wonder if we should also put some comments in? But
>
> Acked-by: Nishanth Aravamudan <[email protected]>
> Tested-by: Nishanth Aravamudan <[email protected]>
Thanks both of you for acks and testing. I will submit the updated patch
and include Andrew to pick it up.
--
Michal Hocko
SUSE Labs
On Wed 19-02-14 08:33:45, Nishanth Aravamudan wrote:
> On 19.02.2014 [08:24:38 -0800], Nishanth Aravamudan wrote:
> > On 18.02.2014 [17:43:38 -0800], David Rientjes wrote:
> > > On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> > >
> > > > How about the following?
> > > >
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 5de4337..1a0eced 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1854,7 +1854,8 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> > > > int i;
> > > >
> > > > for_each_online_node(i)
> > > > - if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> > > > + if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> > > > + !NODE_DATA(i)->node_present_pages)
> > > > node_set(i, NODE_DATA(nid)->reclaim_nodes);
> > > > else
> > > > zone_reclaim_mode = 1;
> > >
> > > [ I changed the above from NODE_DATA(nid) -> NODE_DATA(i) as you caught
> > > so we're looking at the right code. ]
> > >
> > > That can't be right, it would allow reclaiming from a memoryless node. I
> > > think what you want is
> >
> > Gah, you're right.
> >
> > > for_each_online_node(i) {
> > > if (!node_present_pages(i))
> > > continue;
> > > if (node_distance(nid, i) <= RECLAIM_DISTANCE) {
> > > node_set(i, NODE_DATA(nid)->reclaim_nodes);
> > > continue;
> > > }
> > > /* Always try to reclaim locally */
> > > zone_reclaim_mode = 1;
> > > }
> > >
> > > but we really should be able to do for_each_node_state(i, N_MEMORY) here
> > > and memoryless nodes should already be excluded from that mask.
> >
> > Yep, I found that afterwards, which simplifies the logic. I'll add this
> > to my series :)
>
> In looking at the code, I am wondering about the following:
>
> init_zone_allows_reclaim() is called for each nid from
> free_area_init_node(). Which means that calculate_node_totalpages for
> other "later" nids and check_for_memory() [which sets up the N_MEMORY
> nodemask] hasn't been called yet.
>
> So, would it make sense to pull up the
> /* Any memory on that node */
> if (pgdat->node_present_pages)
> node_set_state(nid, N_MEMORY);
> check_for_memory(pgdat, nid);
> into free_area_init_node()?
Dunno, but it shouldn't be needed because nodes are set N_MEMORY earlier
in early_calculate_totalpages as mentioned in other email.
--
Michal Hocko
SUSE Labs