2022-03-29 14:09:03

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH] mm: migrate: set demotion targets differently

Hi Jagdish,

On 3/29/2022 7:52 PM, Jagdish Gediya wrote:
> The current implementation to identify the demotion
> targets limits some of the opportunities to share
> the demotion targets between multiple source nodes.
>
> Implement a logic to identify the loop in the demotion
> targets such that all the possibilities of demotion can
> be utilized. Don't share the used targets between all
> the nodes, instead create the used targets from scratch
> for each individual node based on for what all node this
> node is a demotion target. This helps to share the demotion
> targets without missing any possible way of demotion.
>
> e.g. with below NUMA topology, where node 0 & 1 are
> cpu + dram nodes, node 2 & 3 are equally slower memory
> only nodes, and node 4 is slowest memory only node,
>
> available: 5 nodes (0-4)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus: 2 3
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus:
> node 2 size: n MB
> node 2 free: n MB
> node 3 cpus:
> node 3 size: n MB
> node 3 free: n MB
> node 4 cpus:
> node 4 size: n MB
> node 4 free: n MB
> node distances:
> node 0 1 2 3 4
> 0: 10 20 40 40 80
> 1: 20 10 40 40 80
> 2: 40 40 10 40 80
> 3: 40 40 40 10 80
> 4: 80 80 80 80 10
>
> The existing implementation gives below demotion targets,
>
> node demotion_target
> 0 3, 2
> 1 4
> 2 X
> 3 X
> 4 X
>
> With this patch applied, below are the demotion targets,
>
> node demotion_target
> 0 3, 2
> 1 3, 2
> 2 3
> 3 4
> 4 X

Node 2 and node 3 both are slow memory and have same distance, why node
2 should demote cold memory to node 3? They should have the same target
demotion node 4, which is the slowest memory node, right?

>
> e.g. with below NUMA topology, where node 0, 1 & 2 are
> cpu + dram nodes and node 3 is slow memory node,
>
> available: 4 nodes (0-3)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus: 2 3
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus: 4 5
> node 2 size: n MB
> node 2 free: n MB
> node 3 cpus:
> node 3 size: n MB
> node 3 free: n MB
> node distances:
> node 0 1 2 3
> 0: 10 20 20 40
> 1: 20 10 20 40
> 2: 20 20 10 40
> 3: 40 40 40 10
>
> The existing implementation gives below demotion targets,
>
> node demotion_target
> 0 3
> 1 X
> 2 X
> 3 X
>
> With this patch applied, below are the demotion targets,
>
> node demotion_target
> 0 3
> 1 3
> 2 3
> 3 X

Sounds reasonable.

>
> with below NUMA topology, where node 0 & 2 are cpu + dram
> nodes and node 1 & 3 are slow memory nodes,
>
> available: 4 nodes (0-3)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus:
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus: 2 3
> node 2 size: n MB
> node 2 free: n MB
> node 3 cpus:
> node 3 size: n MB
> node 3 free: n MB
> node distances:
> node 0 1 2 3
> 0: 10 40 20 80
> 1: 40 10 80 80
> 2: 20 80 10 40
> 3: 80 80 40 10
>
> The existing implementation gives below demotion targets,
>
> node demotion_target
> 0 3
> 1 X
> 2 3
> 3 X

If I understand correctly, this is not true. The demotion route should
be as below with existing implementation:
node 0 ---> node 1
node 1 ---> X
node 2 ---> node 3
node 3 ---> X

>
> With this patch applied, below are the demotion targets,
>
> node demotion_target
> 0 1
> 1 3
> 2 3
> 3 X
>
> As it can be seen above, node 3 can be demotion target for node
> 1 but existing implementation doesn't configure it that way. It
> is better to move pages from node 1 to node 3 instead of moving
> it from node 1 to swap.

Which means node 3 is the slowest memory node?

>
> Signed-off-by: Jagdish Gediya <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> mm/migrate.c | 75 ++++++++++++++++++++++++++++------------------------
> 1 file changed, 41 insertions(+), 34 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 3d60823afd2d..7ec8d934e706 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2381,10 +2381,13 @@ static int establish_migrate_target(int node, nodemask_t *used,
> */
> static void __set_migration_target_nodes(void)
> {
> - nodemask_t next_pass = NODE_MASK_NONE;
> - nodemask_t this_pass = NODE_MASK_NONE;
> nodemask_t used_targets = NODE_MASK_NONE;
> int node, best_distance;
> + nodemask_t *src_nodes;
> +
> + src_nodes = kcalloc(nr_node_ids, sizeof(nodemask_t), GFP_KERNEL);
> + if (!src_nodes)
> + return;
>
> /*
> * Avoid any oddities like cycles that could occur
> @@ -2393,29 +2396,39 @@ static void __set_migration_target_nodes(void)
> */
> disable_all_migrate_targets();
>
> - /*
> - * Allocations go close to CPUs, first. Assume that
> - * the migration path starts at the nodes with CPUs.
> - */
> - next_pass = node_states[N_CPU];
> -again:
> - this_pass = next_pass;
> - next_pass = NODE_MASK_NONE;
> - /*
> - * To avoid cycles in the migration "graph", ensure
> - * that migration sources are not future targets by
> - * setting them in 'used_targets'. Do this only
> - * once per pass so that multiple source nodes can
> - * share a target node.
> - *
> - * 'used_targets' will become unavailable in future
> - * passes. This limits some opportunities for
> - * multiple source nodes to share a destination.
> - */
> - nodes_or(used_targets, used_targets, this_pass);
> + for_each_online_node(node) {
> + int tmp_node;
>
> - for_each_node_mask(node, this_pass) {
> best_distance = -1;
> + used_targets = NODE_MASK_NONE;
> +
> + /*
> + * Avoid adding same node as the demotion target.
> + */
> + node_set(node, used_targets);
> +
> + /*
> + * Add CPU NUMA nodes to the used target list so that it
> + * won't be considered a demotion target.
> + */
> + nodes_or(used_targets, used_targets, node_states[N_CPU]);
> +
> + /*
> + * Add all nodes that has appeared as source node of demotion
> + * for this target node.
> + *
> + * To avoid cycles in the migration "graph", ensure
> + * that migration sources are not future targets by
> + * setting them in 'used_targets'.
> + */
> + for_each_node_mask(tmp_node, src_nodes[node])
> + nodes_or(used_targets, used_targets, src_nodes[tmp_node]);
> +
> + /*
> + * Now update the demotion src nodes with other nodes in graph
> + * which got computed above.
> + */
> + nodes_or(src_nodes[node], src_nodes[node], used_targets);
>
> /*
> * Try to set up the migration path for the node, and the target
> @@ -2434,20 +2447,14 @@ static void __set_migration_target_nodes(void)
> best_distance = node_distance(node, target_node);
>
> /*
> - * Visit targets from this pass in the next pass.
> - * Eventually, every node will have been part of
> - * a pass, and will become set in 'used_targets'.
> + * Add this node in the src_nodes list so that we can
> + * detect the looping.
> */
> - node_set(target_node, next_pass);
> + node_set(node, src_nodes[target_node]);
> } while (1);
> }
> - /*
> - * 'next_pass' contains nodes which became migration
> - * targets in this pass. Make additional passes until
> - * no more migrations targets are available.
> - */
> - if (!nodes_empty(next_pass))
> - goto again;
> +
> + kfree(src_nodes);
> }
>
> /*


2022-03-30 05:23:11

by Jagdish Gediya

[permalink] [raw]
Subject: Re: [PATCH] mm: migrate: set demotion targets differently

On Tue, Mar 29, 2022 at 08:26:05PM +0800, Baolin Wang wrote:
Hi Baolin,
> Hi Jagdish,
>
> On 3/29/2022 7:52 PM, Jagdish Gediya wrote:
> > The current implementation to identify the demotion
> > targets limits some of the opportunities to share
> > the demotion targets between multiple source nodes.
> >
> > Implement a logic to identify the loop in the demotion
> > targets such that all the possibilities of demotion can
> > be utilized. Don't share the used targets between all
> > the nodes, instead create the used targets from scratch
> > for each individual node based on for what all node this
> > node is a demotion target. This helps to share the demotion
> > targets without missing any possible way of demotion.
> >
> > e.g. with below NUMA topology, where node 0 & 1 are
> > cpu + dram nodes, node 2 & 3 are equally slower memory
> > only nodes, and node 4 is slowest memory only node,
> >
> > available: 5 nodes (0-4)
> > node 0 cpus: 0 1
> > node 0 size: n MB
> > node 0 free: n MB
> > node 1 cpus: 2 3
> > node 1 size: n MB
> > node 1 free: n MB
> > node 2 cpus:
> > node 2 size: n MB
> > node 2 free: n MB
> > node 3 cpus:
> > node 3 size: n MB
> > node 3 free: n MB
> > node 4 cpus:
> > node 4 size: n MB
> > node 4 free: n MB
> > node distances:
> > node 0 1 2 3 4
> > 0: 10 20 40 40 80
> > 1: 20 10 40 40 80
> > 2: 40 40 10 40 80
> > 3: 40 40 40 10 80
> > 4: 80 80 80 80 10
> >
> > The existing implementation gives below demotion targets,
> >
> > node demotion_target
> > 0 3, 2
> > 1 4
> > 2 X
> > 3 X
> > 4 X
> >
> > With this patch applied, below are the demotion targets,
> >
> > node demotion_target
> > 0 3, 2
> > 1 3, 2
> > 2 3
> > 3 4
> > 4 X
>
> Node 2 and node 3 both are slow memory and have same distance, why node 2
> should demote cold memory to node 3? They should have the same target
> demotion node 4, which is the slowest memory node, right?
>
Current demotion target finding algorithm works based on best distance, as distance between node 2 & 3 is 40 and distance between node 2 & 4 is 80, node 2 demotes to node 3.
> >
> > e.g. with below NUMA topology, where node 0, 1 & 2 are
> > cpu + dram nodes and node 3 is slow memory node,
> >
> > available: 4 nodes (0-3)
> > node 0 cpus: 0 1
> > node 0 size: n MB
> > node 0 free: n MB
> > node 1 cpus: 2 3
> > node 1 size: n MB
> > node 1 free: n MB
> > node 2 cpus: 4 5
> > node 2 size: n MB
> > node 2 free: n MB
> > node 3 cpus:
> > node 3 size: n MB
> > node 3 free: n MB
> > node distances:
> > node 0 1 2 3
> > 0: 10 20 20 40
> > 1: 20 10 20 40
> > 2: 20 20 10 40
> > 3: 40 40 40 10
> >
> > The existing implementation gives below demotion targets,
> >
> > node demotion_target
> > 0 3
> > 1 X
> > 2 X
> > 3 X
> >
> > With this patch applied, below are the demotion targets,
> >
> > node demotion_target
> > 0 3
> > 1 3
> > 2 3
> > 3 X
>
> Sounds reasonable.
>
> >
> > with below NUMA topology, where node 0 & 2 are cpu + dram
> > nodes and node 1 & 3 are slow memory nodes,
> >
> > available: 4 nodes (0-3)
> > node 0 cpus: 0 1
> > node 0 size: n MB
> > node 0 free: n MB
> > node 1 cpus:
> > node 1 size: n MB
> > node 1 free: n MB
> > node 2 cpus: 2 3
> > node 2 size: n MB
> > node 2 free: n MB
> > node 3 cpus:
> > node 3 size: n MB
> > node 3 free: n MB
> > node distances:
> > node 0 1 2 3
> > 0: 10 40 20 80
> > 1: 40 10 80 80
> > 2: 20 80 10 40
> > 3: 80 80 40 10
> >
> > The existing implementation gives below demotion targets,
> >
> > node demotion_target
> > 0 3
> > 1 X
> > 2 3
> > 3 X
>
> If I understand correctly, this is not true. The demotion route should be as
> below with existing implementation:
> node 0 ---> node 1
> node 1 ---> X
> node 2 ---> node 3
> node 3 ---> X
>
Its typo, It should be 0 -> 1, Will correct it in v2.
> >
> > With this patch applied, below are the demotion targets,
> >
> > node demotion_target
> > 0 1
> > 1 3
> > 2 3
> > 3 X
> >
> > As it can be seen above, node 3 can be demotion target for node
> > 1 but existing implementation doesn't configure it that way. It
> > is better to move pages from node 1 to node 3 instead of moving
> > it from node 1 to swap.
>
> Which means node 3 is the slowest memory node?
>
Node 1 and 3 are equally slower but 1 is near to 0 and 3 is near to 2. Basically you can think of it like node 1 is slow memory logical node near to node 0 and node 3 is slow memory logical node near to node 2.
> >
> > Signed-off-by: Jagdish Gediya <[email protected]>
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > ---
> > mm/migrate.c | 75 ++++++++++++++++++++++++++++------------------------
> > 1 file changed, 41 insertions(+), 34 deletions(-)
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 3d60823afd2d..7ec8d934e706 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2381,10 +2381,13 @@ static int establish_migrate_target(int node, nodemask_t *used,
> > */
> > static void __set_migration_target_nodes(void)
> > {
> > - nodemask_t next_pass = NODE_MASK_NONE;
> > - nodemask_t this_pass = NODE_MASK_NONE;
> > nodemask_t used_targets = NODE_MASK_NONE;
> > int node, best_distance;
> > + nodemask_t *src_nodes;
> > +
> > + src_nodes = kcalloc(nr_node_ids, sizeof(nodemask_t), GFP_KERNEL);
> > + if (!src_nodes)
> > + return;
> > /*
> > * Avoid any oddities like cycles that could occur
> > @@ -2393,29 +2396,39 @@ static void __set_migration_target_nodes(void)
> > */
> > disable_all_migrate_targets();
> > - /*
> > - * Allocations go close to CPUs, first. Assume that
> > - * the migration path starts at the nodes with CPUs.
> > - */
> > - next_pass = node_states[N_CPU];
> > -again:
> > - this_pass = next_pass;
> > - next_pass = NODE_MASK_NONE;
> > - /*
> > - * To avoid cycles in the migration "graph", ensure
> > - * that migration sources are not future targets by
> > - * setting them in 'used_targets'. Do this only
> > - * once per pass so that multiple source nodes can
> > - * share a target node.
> > - *
> > - * 'used_targets' will become unavailable in future
> > - * passes. This limits some opportunities for
> > - * multiple source nodes to share a destination.
> > - */
> > - nodes_or(used_targets, used_targets, this_pass);
> > + for_each_online_node(node) {
> > + int tmp_node;
> > - for_each_node_mask(node, this_pass) {
> > best_distance = -1;
> > + used_targets = NODE_MASK_NONE;
> > +
> > + /*
> > + * Avoid adding same node as the demotion target.
> > + */
> > + node_set(node, used_targets);
> > +
> > + /*
> > + * Add CPU NUMA nodes to the used target list so that it
> > + * won't be considered a demotion target.
> > + */
> > + nodes_or(used_targets, used_targets, node_states[N_CPU]);
> > +
> > + /*
> > + * Add all nodes that has appeared as source node of demotion
> > + * for this target node.
> > + *
> > + * To avoid cycles in the migration "graph", ensure
> > + * that migration sources are not future targets by
> > + * setting them in 'used_targets'.
> > + */
> > + for_each_node_mask(tmp_node, src_nodes[node])
> > + nodes_or(used_targets, used_targets, src_nodes[tmp_node]);
> > +
> > + /*
> > + * Now update the demotion src nodes with other nodes in graph
> > + * which got computed above.
> > + */
> > + nodes_or(src_nodes[node], src_nodes[node], used_targets);
> > /*
> > * Try to set up the migration path for the node, and the target
> > @@ -2434,20 +2447,14 @@ static void __set_migration_target_nodes(void)
> > best_distance = node_distance(node, target_node);
> > /*
> > - * Visit targets from this pass in the next pass.
> > - * Eventually, every node will have been part of
> > - * a pass, and will become set in 'used_targets'.
> > + * Add this node in the src_nodes list so that we can
> > + * detect the looping.
> > */
> > - node_set(target_node, next_pass);
> > + node_set(node, src_nodes[target_node]);
> > } while (1);
> > }
> > - /*
> > - * 'next_pass' contains nodes which became migration
> > - * targets in this pass. Make additional passes until
> > - * no more migrations targets are available.
> > - */
> > - if (!nodes_empty(next_pass))
> > - goto again;
> > +
> > + kfree(src_nodes);
> > }
> > /*

2022-03-30 17:33:47

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH] mm: migrate: set demotion targets differently



On 3/29/2022 10:04 PM, Jagdish Gediya wrote:
> On Tue, Mar 29, 2022 at 08:26:05PM +0800, Baolin Wang wrote:
> Hi Baolin,
>> Hi Jagdish,
>>
>> On 3/29/2022 7:52 PM, Jagdish Gediya wrote:
>>> The current implementation to identify the demotion
>>> targets limits some of the opportunities to share
>>> the demotion targets between multiple source nodes.
>>>
>>> Implement a logic to identify the loop in the demotion
>>> targets such that all the possibilities of demotion can
>>> be utilized. Don't share the used targets between all
>>> the nodes, instead create the used targets from scratch
>>> for each individual node based on for what all node this
>>> node is a demotion target. This helps to share the demotion
>>> targets without missing any possible way of demotion.
>>>
>>> e.g. with below NUMA topology, where node 0 & 1 are
>>> cpu + dram nodes, node 2 & 3 are equally slower memory
>>> only nodes, and node 4 is slowest memory only node,
>>>
>>> available: 5 nodes (0-4)
>>> node 0 cpus: 0 1
>>> node 0 size: n MB
>>> node 0 free: n MB
>>> node 1 cpus: 2 3
>>> node 1 size: n MB
>>> node 1 free: n MB
>>> node 2 cpus:
>>> node 2 size: n MB
>>> node 2 free: n MB
>>> node 3 cpus:
>>> node 3 size: n MB
>>> node 3 free: n MB
>>> node 4 cpus:
>>> node 4 size: n MB
>>> node 4 free: n MB
>>> node distances:
>>> node 0 1 2 3 4
>>> 0: 10 20 40 40 80
>>> 1: 20 10 40 40 80
>>> 2: 40 40 10 40 80
>>> 3: 40 40 40 10 80
>>> 4: 80 80 80 80 10
>>>
>>> The existing implementation gives below demotion targets,
>>>
>>> node demotion_target
>>> 0 3, 2
>>> 1 4
>>> 2 X
>>> 3 X
>>> 4 X
>>>
>>> With this patch applied, below are the demotion targets,
>>>
>>> node demotion_target
>>> 0 3, 2
>>> 1 3, 2
>>> 2 3
>>> 3 4
>>> 4 X
>>
>> Node 2 and node 3 both are slow memory and have same distance, why node 2
>> should demote cold memory to node 3? They should have the same target
>> demotion node 4, which is the slowest memory node, right?
>>
> Current demotion target finding algorithm works based on best distance, as distance between node 2 & 3 is 40 and distance between node 2 & 4 is 80, node 2 demotes to node 3.

If node 2 can demote to node 3, which means node 3's memory is colder
than node 2, right? The accessing time of node 3 should be larger than
node 2, then we can demote colder memory to node 3 from node 2.

But node 2 and node 3 are same memory type and have same distance, the
accessing time of node 2 and node 3 should be same too, so why add so
many page migration between node 2 and node 3? I'm still not sure the
benefits.

Huang Ying and Dave, how do you think about this demotion targets?

>>>
>>> e.g. with below NUMA topology, where node 0, 1 & 2 are
>>> cpu + dram nodes and node 3 is slow memory node,
>>>
>>> available: 4 nodes (0-3)
>>> node 0 cpus: 0 1
>>> node 0 size: n MB
>>> node 0 free: n MB
>>> node 1 cpus: 2 3
>>> node 1 size: n MB
>>> node 1 free: n MB
>>> node 2 cpus: 4 5
>>> node 2 size: n MB
>>> node 2 free: n MB
>>> node 3 cpus:
>>> node 3 size: n MB
>>> node 3 free: n MB
>>> node distances:
>>> node 0 1 2 3
>>> 0: 10 20 20 40
>>> 1: 20 10 20 40
>>> 2: 20 20 10 40
>>> 3: 40 40 40 10
>>>
>>> The existing implementation gives below demotion targets,
>>>
>>> node demotion_target
>>> 0 3
>>> 1 X
>>> 2 X
>>> 3 X
>>>
>>> With this patch applied, below are the demotion targets,
>>>
>>> node demotion_target
>>> 0 3
>>> 1 3
>>> 2 3
>>> 3 X
>>
>> Sounds reasonable.
>>
>>>
>>> with below NUMA topology, where node 0 & 2 are cpu + dram
>>> nodes and node 1 & 3 are slow memory nodes,
>>>
>>> available: 4 nodes (0-3)
>>> node 0 cpus: 0 1
>>> node 0 size: n MB
>>> node 0 free: n MB
>>> node 1 cpus:
>>> node 1 size: n MB
>>> node 1 free: n MB
>>> node 2 cpus: 2 3
>>> node 2 size: n MB
>>> node 2 free: n MB
>>> node 3 cpus:
>>> node 3 size: n MB
>>> node 3 free: n MB
>>> node distances:
>>> node 0 1 2 3
>>> 0: 10 40 20 80
>>> 1: 40 10 80 80
>>> 2: 20 80 10 40
>>> 3: 80 80 40 10
>>>
>>> The existing implementation gives below demotion targets,
>>>
>>> node demotion_target
>>> 0 3
>>> 1 X
>>> 2 3
>>> 3 X
>>
>> If I understand correctly, this is not true. The demotion route should be as
>> below with existing implementation:
>> node 0 ---> node 1
>> node 1 ---> X
>> node 2 ---> node 3
>> node 3 ---> X
>>
> Its typo, It should be 0 -> 1, Will correct it in v2.
>>>
>>> With this patch applied, below are the demotion targets,
>>>
>>> node demotion_target
>>> 0 1
>>> 1 3
>>> 2 3
>>> 3 X
>>>
>>> As it can be seen above, node 3 can be demotion target for node
>>> 1 but existing implementation doesn't configure it that way. It
>>> is better to move pages from node 1 to node 3 instead of moving
>>> it from node 1 to swap.
>>
>> Which means node 3 is the slowest memory node?
>>
> Node 1 and 3 are equally slower but 1 is near to 0 and 3 is near to 2. Basically you can think of it like node 1 is slow memory logical node near to node 0 and node 3 is slow memory logical node near to node 2.

OK.