Michael Bringmann <[email protected]> writes:
> On 05/29/2017 12:32 AM, Michael Ellerman wrote:
>> Reza Arbab <[email protected]> writes:
>>
>>> On Fri, May 26, 2017 at 01:46:58PM +1000, Michael Ellerman wrote:
>>>> Reza Arbab <[email protected]> writes:
>>>>
>>>>> On Thu, May 25, 2017 at 04:19:53PM +1000, Michael Ellerman wrote:
>>>>>> The commit message for 3af229f2071f says:
>>>>>>
>>>>>> In practice, we never see a system with 256 NUMA nodes, and in fact, we
>>>>>> do not support node hotplug on power in the first place, so the nodes
>>>>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>> that are online when we come up are the nodes that will be present for
>>>>>> the lifetime of this kernel.
>>>>>>
>>>>>> Is that no longer true?
>>>>>
>>>>> I don't know what the reasoning behind that statement was at the time,
>>>>> but as far as I can tell, the only thing missing for node hotplug now is
>>>>> Balbir's patchset [1]. He fixes the resource issue which motivated
>>>>> 3af229f2071f and reverts it.
>>>>>
>>>>> With that set, I can instantiate a new numa node just by doing
>>>>> add_memory(nid, ...) where nid doesn't currently exist.
>>>>
>>>> But does that actually happen on any real system?
>>>
>>> I don't know if anything currently tries to do this. My interest in
>>> having this working is so that in the future, our coherent gpu memory
>>> could be added as a distinct node by the device driver.
>>
>> Sure. If/when that happens, we would hopefully still have some way to
>> limit the size of the possible map.
>>
>> That would ideally be a firmware property that tells us the maximum
>> number of GPUs that might be hot-added, or we punt and cap it at some
>> "sane" maximum number.
>>
>> But until that happens it's silly to say we can have up to 256 nodes
>> when in practice most of our systems have 8 or less.
>>
>> So I'm still waiting for an explanation from Michael B on how he's
>> seeing this bug in practice.
>
> I already answered this in an earlier message.
Which one? I must have missed it.
> I will give an example.
>
> * Let there be a configuration with nodes (0, 4-5, 8) that boots with 1 VP
> and 10G of memory in a shared processor configuration.
> * At boot time, 4 nodes are put into the possible map by the PowerPC boot
> code.
I'm pretty sure we never add nodes to the possible map, it starts out
with MAX_NUMNODES possible and that's it.
Do you actually see mention of nodes 0 and 8 in the dmesg?
What does it say?
> * Subsequently, the NUMA code executes and puts the 10G memory into nodes
> 4 & 5. No memory goes into Node 0. So we now have 2 nodes in the
> node_online_map.
> * The VP and its threads get assigned to Node 4.
> * Then when 'initmem_init()' in 'powerpc/numa.c' executes the instruction,
> node_and(node_possible_map, node_possible_map, node_online_map);
> the content of the node_possible_map is reduced to nodes 4-5.
> * Later on we hot-add 90G of memory to the system. It tries to put the
> memory into nodes 0, 4-5, 8 based on the memory association map. We
> should see memory put into all 4 nodes. However, since we have reduced
> the 'node_possible_map' to only nodes 4 & 5, we can now only put memory
> into 2 of the configured nodes.
Right. So it's not that you're hot adding memory into a previously
unseen node as you implied in earlier mails.
> # We want to be able to put memory into all 4 nodes via hot-add operations,
> not only the nodes that 'survive' boot time initialization. We could
> make a number of changes to ensure that all of the nodes in the initial
> configuration provided by the pHyp can be used, but this one appears to
> be the simplest, only using resources requested by the pHyp at boot --
> even if those resource are not used immediately.
I don't think that's what the patch does. It just marks 32 (!?) nodes as
online. Or if you're talking about reverting 3af229f2071f that leaves
you with 256 possible nodes. Both of which are wasteful.
The right fix is to make sure any nodes which are present at boot remain
in the possible map, even if they don't have memory/CPUs assigned at
boot.
What does your device tree look like? Can you send us the output of:
$ lsprop /proc/device-tree
cheers
On Thu, Jun 01, 2017 at 07:36:31PM +1000, Michael Ellerman wrote:
>I don't think that's what the patch does. It just marks 32 (!?) nodes
>as online. Or if you're talking about reverting 3af229f2071f that
>leaves you with 256 possible nodes. Both of which are wasteful.
To be clear, with Balbir's set the latter is no longer wasteful.
>The right fix is to make sure any nodes which are present at boot
>remain in the possible map, even if they don't have memory/CPUs
>assigned at boot.
I'm still hoping 3af229f2071f could indeed be reverted some day, but
until then the following would follow your suggestion for our GPU nodes.
What do you think?
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -895,6 +895,7 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
void __init initmem_init(void)
{
int nid, cpu;
+ struct device_node *dn;
max_low_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT;
max_pfn = max_low_pfn;
@@ -911,6 +912,18 @@ void __init initmem_init(void)
*/
nodes_and(node_possible_map, node_possible_map, node_online_map);
+ /*
+ * Consider an ibm,coherent-device-memory node possible. Even though
+ * it is not online at boot, it may be hotplugged later.
+ */
+ for_each_compatible_node(dn, NULL, "ibm,coherent-device-memory") {
+ nid = of_node_to_nid_single(dn);
+ if (nid < 0)
+ continue;
+
+ node_set(nid, node_possible_map);
+ }
+
for_each_online_node(nid) {
unsigned long start_pfn, end_pfn;
--
Reza Arbab
On 06/01/2017 04:36 AM, Michael Ellerman wrote:
> Michael Bringmann <[email protected]> writes:
>
>> On 05/29/2017 12:32 AM, Michael Ellerman wrote:
>>> Reza Arbab <[email protected]> writes:
>>>
>>>> On Fri, May 26, 2017 at 01:46:58PM +1000, Michael Ellerman wrote:
>>>>> Reza Arbab <[email protected]> writes:
>>>>>
>>>>>> On Thu, May 25, 2017 at 04:19:53PM +1000, Michael Ellerman wrote:
>>>>>>> The commit message for 3af229f2071f says:
>>>>>>>
>>>>>>> In practice, we never see a system with 256 NUMA nodes, and in fact, we
>>>>>>> do not support node hotplug on power in the first place, so the nodes
>>>>>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>>> that are online when we come up are the nodes that will be present for
>>>>>>> the lifetime of this kernel.
>>>>>>>
>>>>>>> Is that no longer true?
>>>>>>
>>>>>> I don't know what the reasoning behind that statement was at the time,
>>>>>> but as far as I can tell, the only thing missing for node hotplug now is
>>>>>> Balbir's patchset [1]. He fixes the resource issue which motivated
>>>>>> 3af229f2071f and reverts it.
>>>>>>
>>>>>> With that set, I can instantiate a new numa node just by doing
>>>>>> add_memory(nid, ...) where nid doesn't currently exist.
>>>>>
>>>>> But does that actually happen on any real system?
>>>>
>>>> I don't know if anything currently tries to do this. My interest in
>>>> having this working is so that in the future, our coherent gpu memory
>>>> could be added as a distinct node by the device driver.
>>>
>>> Sure. If/when that happens, we would hopefully still have some way to
>>> limit the size of the possible map.
>>>
>>> That would ideally be a firmware property that tells us the maximum
>>> number of GPUs that might be hot-added, or we punt and cap it at some
>>> "sane" maximum number.
>>>
>>> But until that happens it's silly to say we can have up to 256 nodes
>>> when in practice most of our systems have 8 or less.
>>>
>>> So I'm still waiting for an explanation from Michael B on how he's
>>> seeing this bug in practice.
>>
>> I already answered this in an earlier message.
>
> Which one? I must have missed it.
>
>> I will give an example.
>>
>> * Let there be a configuration with nodes (0, 4-5, 8) that boots with 1 VP
>> and 10G of memory in a shared processor configuration.
>> * At boot time, 4 nodes are put into the possible map by the PowerPC boot
>> code.
>
> I'm pretty sure we never add nodes to the possible map, it starts out
> with MAX_NUMNODES possible and that's it.
Let me reword that. It enables the nodes in the possible map.
>
> Do you actually see mention of nodes 0 and 8 in the dmesg?
When the 'numa.c' code is built with debug messages, and the system was
given that configuration by pHyp, yes, I did.
>
> What does it say?
The debug message for each core thread would be something like,
removing cpu 64 from node 0
adding cpu 64 to node 8
repeated for all 8 threads of the CPU, and usually with the messages
for all of the CPUs coming out intermixed on the console/dmesg log.
>
>> * Subsequently, the NUMA code executes and puts the 10G memory into nodes
>> 4 & 5. No memory goes into Node 0. So we now have 2 nodes in the
>> node_online_map.
>> * The VP and its threads get assigned to Node 4.
>> * Then when 'initmem_init()' in 'powerpc/numa.c' executes the instruction,
>> node_and(node_possible_map, node_possible_map, node_online_map);
>> the content of the node_possible_map is reduced to nodes 4-5.
>> * Later on we hot-add 90G of memory to the system. It tries to put the
>> memory into nodes 0, 4-5, 8 based on the memory association map. We
>> should see memory put into all 4 nodes. However, since we have reduced
>> the 'node_possible_map' to only nodes 4 & 5, we can now only put memory
>> into 2 of the configured nodes.
>
> Right. So it's not that you're hot adding memory into a previously
> unseen node as you implied in earlier mails.
In the sense that the nodes were defined in the device tree, that is correct.
In the sense that those nodes are currently deleted from node_possible_map in
'numa.c' by the instruction 'node_and(node_possible_map,node_possible_map,
node_online_map);', the nodes are no longer available to place memory or CPU.
>> # We want to be able to put memory into all 4 nodes via hot-add operations,
>> not only the nodes that 'survive' boot time initialization. We could
>> make a number of changes to ensure that all of the nodes in the initial
>> configuration provided by the pHyp can be used, but this one appears to
>> be the simplest, only using resources requested by the pHyp at boot --
>> even if those resource are not used immediately.
>
> I don't think that's what the patch does. It just marks 32 (!?) nodes as
> online. Or if you're talking about reverting 3af229f2071f that leaves
> you with 256 possible nodes. Both of which are wasteful> > The right fix is to make sure any nodes which are present at boot remain
> in the possible map, even if they don't have memory/CPUs assigned at
> boot.
Okay, I can try to insert code that extracts all of the nodes from the
ibm,associativity-lookup-arrays property and merge them with the nodes
put into the online map from the CPUs that were found previously during
boot of the powerpc code.
> What does your device tree look like? Can you send us the output of:
>
> $ lsprop /proc/device-tree
See attachment 'device-tree.log'. Note though that this boot of my test
system only has 2 nodes, 0 and 2.
>
> cheers
>
--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]
Michael Bringmann <[email protected]> writes:
> On 06/01/2017 04:36 AM, Michael Ellerman wrote:
>> Do you actually see mention of nodes 0 and 8 in the dmesg?
>
> When the 'numa.c' code is built with debug messages, and the system was
> given that configuration by pHyp, yes, I did.
>
>> What does it say?
>
> The debug message for each core thread would be something like,
>
> removing cpu 64 from node 0
> adding cpu 64 to node 8
>
> repeated for all 8 threads of the CPU, and usually with the messages
> for all of the CPUs coming out intermixed on the console/dmesg log.
OK. I meant what do you see at boot.
I'm curious how we're discovering node 0 and 8 at all if neither has any
memory or CPUs assigned at boot.
>> Right. So it's not that you're hot adding memory into a previously
>> unseen node as you implied in earlier mails.
>
> In the sense that the nodes were defined in the device tree, that is correct.
Where are they defined in the device tree? That's what I'm trying to understand.
> In the sense that those nodes are currently deleted from node_possible_map in
> 'numa.c' by the instruction 'node_and(node_possible_map,node_possible_map,
> node_online_map);', the nodes are no longer available to place memory or CPU.
Yeah I understand that part.
> Okay, I can try to insert code that extracts all of the nodes from the
> ibm,associativity-lookup-arrays property and merge them with the nodes
> put into the online map from the CPUs that were found previously during
> boot of the powerpc code.
Hmm, will that work?
Looking at PAPR it's not clear to me that it will work for nodes that
have no memory assigned at boot.
This property is used to duplicate the function of the
“ibm,associativity” property in a /memory node. Each “assigned” LMB
represented has an index valued between 0 and M-1 which is used as in
index into this table to select which associativity list to use for
the LMB. “unassigned” LMBs are place holders for potential DLPAR
additions, for which the associativity list index is meaningless and
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
is given the reserved value of -1. This static property, need only
contain values relevant for the LMBs presented in the
“ibm,dynamicreconfiguration-memory” node; for a dynamic LPAR addition
of a new LMB, the device tree fragment reported by the
ibm,configure-connector RTAS function is a /memory node, with the
inclusion of the “ibm,associativity” device tree property defined in
Section C.6.2.2‚ “Properties of the Children of Root‚” on page 1059.
>> What does your device tree look like? Can you send us the output of:
>>
>> $ lsprop /proc/device-tree
Thanks. I forgot that lsprop will truncate long properties, I actually
wanted to see all of the ibm,dynamic-memory property.
But looking at the code I see the only place we set a nid online is if
there is a CPU assigned to it:
static int __init parse_numa_properties(void)
{
...
for_each_present_cpu(i) {
...
cpu = of_get_cpu_node(i, NULL);
nid = of_node_to_nid_single(cpu);
...
node_set_online(nid);
}
Or for memory nodes (same function):
for_each_node_by_type(memory, "memory") {
...
nid = of_node_to_nid_single(memory);
...
node_set_online(nid);
...
}
Or for entries in ibm,dynamic-memory that are assigned:
static void __init parse_drconf_memory(struct device_node *memory)
{
...
for (; n != 0; --n) {
...
/* skip this block if the reserved bit is set in flags (0x80)
or if the block is not assigned to this partition (0x8) */
if ((drmem.flags & DRCONF_MEM_RESERVED)
|| !(drmem.flags & DRCONF_MEM_ASSIGNED))
continue;
...
do {
...
nid = of_drconf_to_nid_single(&drmem, &aa);
node_set_online(nid);
...
} while (--ranges);
}
}
So I don't see from that how we can even be aware that node 0 and 8
exist at boot based on that. Maybe there's another path I'm missing
though.
cheers
On 06/06/2017 04:48 AM, Michael Ellerman wrote:
> Michael Bringmann <[email protected]> writes:
>> On 06/01/2017 04:36 AM, Michael Ellerman wrote:
>>> Do you actually see mention of nodes 0 and 8 in the dmesg?
>>
>> When the 'numa.c' code is built with debug messages, and the system was
>> given that configuration by pHyp, yes, I did.
>>
>>> What does it say?
>>
>> The debug message for each core thread would be something like,
>>
>> removing cpu 64 from node 0
>> adding cpu 64 to node 8
>>
>> repeated for all 8 threads of the CPU, and usually with the messages
>> for all of the CPUs coming out intermixed on the console/dmesg log.
>
> OK. I meant what do you see at boot.
Here is an example with nodes 0,2,6,7, node 0 starts out empty:
[ 0.000000] Initmem setup node 0
[ 0.000000] NODE_DATA [mem 0x3bff7d6300-0x3bff7dffff]
[ 0.000000] NODE_DATA(0) on node 7
[ 0.000000] Initmem setup node 2 [mem 0x00000000-0x13ffffffff]
[ 0.000000] NODE_DATA [mem 0x13ffff6300-0x13ffffffff]
[ 0.000000] Initmem setup node 6 [mem 0x1400000000-0x34afffffff]
[ 0.000000] NODE_DATA [mem 0x34afff6300-0x34afffffff]
[ 0.000000] Initmem setup node 7 [mem 0x34b0000000-0x3bffffffff]
[ 0.000000] NODE_DATA [mem 0x3bff7cc600-0x3bff7d62ff]
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000000000000-0x0000003bffffffff]
[ 0.000000] DMA32 empty
[ 0.000000] Normal empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 2: [mem 0x0000000000000000-0x00000013ffffffff]
[ 0.000000] node 6: [mem 0x0000001400000000-0x00000034afffffff]
[ 0.000000] node 7: [mem 0x00000034b0000000-0x0000003bffffffff]
[ 0.000000] Could not find start_pfn for node 0
[ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
[ 0.000000] Initmem setup node 2 [mem 0x0000000000000000-0x00000013ffffffff]
[ 0.000000] Initmem setup node 6 [mem 0x0000001400000000-0x00000034afffffff]
[ 0.000000] Initmem setup node 7 [mem 0x00000034b0000000-0x0000003bffffffff]
[ 0.000000] percpu: Embedded 3 pages/cpu @c000003bf8000000 s155672 r0 d40936 u262144
[ 0.000000] Built 4 zonelists in Node order, mobility grouping on. Total pages: 3928320
and,
[root@ltcalpine2-lp20 ~]# numactl --hardware
available: 4 nodes (0,2,6-7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 16 17 18 19 20 21 22 23 32 33 34 35 36 37 38 39 56 57 58 59 60 61 62 63
node 2 size: 81792 MB
node 2 free: 81033 MB
node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 40 41 42 43 44 45 46 47
node 6 size: 133743 MB
node 6 free: 133097 MB
node 7 cpus: 48 49 50 51 52 53 54 55
node 7 size: 29877 MB
node 7 free: 29599 MB
node distances:
node 0 2 6 7
0: 10 40 40 40
2: 40 10 40 40
6: 40 40 10 20
7: 40 40 20 10
[root@ltcalpine2-lp20 ~]#
>
> I'm curious how we're discovering node 0 and 8 at all if neither has any
> memory or CPUs assigned at boot.
Both are in the memory associativity lookup arrays. And we are circling
back to the
>
>>> Right. So it's not that you're hot adding memory into a previously
>>> unseen node as you implied in earlier mails.
>>
>> In the sense that the nodes were defined in the device tree, that is correct.
>
> Where are they defined in the device tree? That's what I'm trying to understand.
The nodes for memory are defined one time in "ibm,associativity-lookup-arrays".
I wish that there was an official set of node properties in the device-tree,
instead of having them be the values of other properties.
>
>> In the sense that those nodes are currently deleted from node_possible_map in
>> 'numa.c' by the instruction 'node_and(node_possible_map,node_possible_map,
>> node_online_map);', the nodes are no longer available to place memory or CPU.
>
> Yeah I understand that part.
>
>> Okay, I can try to insert code that extracts all of the nodes from the
>> ibm,associativity-lookup-arrays property and merge them with the nodes
>> put into the online map from the CPUs that were found previously during
>> boot of the powerpc code.
>
> Hmm, will that work?
The nodes are defined in the associativity lookup array, so they have at least
been reserved for us by the pHyp. On the other hand, if we are only to use
nodes that have resources at boot, why are there extra node values specified?
What I am not 100% clear on -- and why I preferred letting all possible nodes
originally defined, still be possible for subsequent hot-add operations -- is
whether the nodes to be used for hot-added CPUs would always be a subset of
the nodes used for hot-added memory.
* The hot-added CPUs in Shared CPU configurations may be mapped to nodes by
the value returned to the kernel by the VPHN hcall.
* So far in my tests, this has not been a problem, but I could not be positive
from the PAPR.
>
> Looking at PAPR it's not clear to me that it will work for nodes that
> have no memory assigned at boot.
>
> This property is used to duplicate the function of the
> “ibm,associativity” property in a /memory node. Each “assigned” LMB
> represented has an index valued between 0 and M-1 which is used as in
> index into this table to select which associativity list to use for
> the LMB. “unassigned” LMBs are place holders for potential DLPAR
> additions, for which the associativity list index is meaningless and
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> is given the reserved value of -1. This static property, need only
> contain values relevant for the LMBs presented in the
> “ibm,dynamicreconfiguration-memory” node; for a dynamic LPAR addition
> of a new LMB, the device tree fragment reported by the
> ibm,configure-connector RTAS function is a /memory node, with the
> inclusion of the “ibm,associativity” device tree property defined in
> Section C.6.2.2‚ “Properties of the Children of Root‚” on page 1059.
I don't see any place that builds new /memory nodes in conjunction with
hot-added memory. The code for powerpc treats the definitions provided
by 'ibm,dynamic-reconfiguration-memory' as the primary reference wherever
hot-added memory comes into play. It looks like the '/memory' properties
are backups or used for the 'pseries' according to one comment.
>
>>> What does your device tree look like? Can you send us the output of:
>>>
>>> $ lsprop /proc/device-tree
>
> Thanks. I forgot that lsprop will truncate long properties, I actually
> wanted to see all of the ibm,dynamic-memory property.
>
> But looking at the code I see the only place we set a nid online is if
> there is a CPU assigned to it:
>
> static int __init parse_numa_properties(void)
> {
> ...
> for_each_present_cpu(i) {
> ...
> cpu = of_get_cpu_node(i, NULL);
> nid = of_node_to_nid_single(cpu);
> ...
> node_set_online(nid);
> }
>
> Or for memory nodes (same function):
>
> for_each_node_by_type(memory, "memory") {
> ...
> nid = of_node_to_nid_single(memory);
> ...
> node_set_online(nid);
> ...
> }
>
> Or for entries in ibm,dynamic-memory that are assigned:
>
> static void __init parse_drconf_memory(struct device_node *memory)
> {
> ...
> for (; n != 0; --n) {
> ...
> /* skip this block if the reserved bit is set in flags (0x80)
> or if the block is not assigned to this partition (0x8) */
> if ((drmem.flags & DRCONF_MEM_RESERVED)
> || !(drmem.flags & DRCONF_MEM_ASSIGNED))
> continue;
>
> ...
> do {
> ...
> nid = of_drconf_to_nid_single(&drmem, &aa);
> node_set_online(nid);
> ...
> } while (--ranges);
> }
> }
>
>
> So I don't see from that how we can even be aware that node 0 and 8
> exist at boot based on that. Maybe there's another path I'm missing
> though.
We don't 'fill in' the nodes, but we are aware that they exist per the 'ibm,associativity-lookup-arrays'
or the responsed provided by the pHyp to the VPHN hcall. We don't associate either of these resources
to them, but does that mean that the nodes do not exist?
The code currently says that only nodes booted with resources "exist" i.e. it can't hot-add new nodes,
but is that a just a problem of the kernel implementation? I think so.
However, this is the problem for users running systems that hot-add a lot of resources are concerned.
They see the associativity arrays (and 'hypinfo' table internal to the pHyp), and they ask why the
kernel only records new resources into the boot-time nodes, while pHyp appears to distribute across
all of the memory nodes specified to the LPAR of the kernel at boot.
I think that all of those nodes specified by the pHyp should exist to the kernel, and that we are
trying to find the best way to make them visible here.
>
> cheers
>
>
--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]
On Thu, 2017-06-01 at 16:33 -0500, Reza Arbab wrote:
> On Thu, Jun 01, 2017 at 07:36:31PM +1000, Michael Ellerman wrote:
> > I don't think that's what the patch does. It just marks 32 (!?) nodes
> > as online. Or if you're talking about reverting 3af229f2071f that
> > leaves you with 256 possible nodes. Both of which are wasteful.
>
> To be clear, with Balbir's set the latter is no longer wasteful.
>
> > The right fix is to make sure any nodes which are present at boot
> > remain in the possible map, even if they don't have memory/CPUs
> > assigned at boot.
>
> I'm still hoping 3af229f2071f could indeed be reverted some day, but
> until then the following would follow your suggestion for our GPU nodes.
> What do you think?
>
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -895,6 +895,7 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
> void __init initmem_init(void)
> {
> int nid, cpu;
> + struct device_node *dn;
>
> max_low_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT;
> max_pfn = max_low_pfn;
> @@ -911,6 +912,18 @@ void __init initmem_init(void)
> */
> nodes_and(node_possible_map, node_possible_map, node_online_map);
>
> + /*
> + * Consider an ibm,coherent-device-memory node possible. Even though
> + * it is not online at boot, it may be hotplugged later.
> + */
> + for_each_compatible_node(dn, NULL, "ibm,coherent-device-memory") {
> + nid = of_node_to_nid_single(dn);
> + if (nid < 0)
> + continue;
> +
> + node_set(nid, node_possible_map);
> + }
> +
I think it looks reasonable, although I'd really like to set a limit
in firmware on the number of nodes and fix memcg hotplug correctly in
the medium term.
Balbir Singh.
> for_each_online_node(nid) {
> unsigned long start_pfn, end_pfn;
>
>
Reza Arbab <[email protected]> writes:
> On Thu, Jun 01, 2017 at 07:36:31PM +1000, Michael Ellerman wrote:
>>I don't think that's what the patch does. It just marks 32 (!?) nodes
>>as online. Or if you're talking about reverting 3af229f2071f that
>>leaves you with 256 possible nodes. Both of which are wasteful.
>
> To be clear, with Balbir's set the latter is no longer wasteful.
AFAIK that series has been rejected.
And even so, it only fixes one known case where having a very large
possible map is wasteful, there could be others we haven't found, or
there could be more introduced in future.
So we will always prefer to keep the possible map as small as it can be.
>>The right fix is to make sure any nodes which are present at boot
>>remain in the possible map, even if they don't have memory/CPUs
>>assigned at boot.
>
> I'm still hoping 3af229f2071f could indeed be reverted some day, but
> until then the following would follow your suggestion for our GPU nodes.
> What do you think?
Yeah we can do something like that.
We might want to come up with a generic property, rather than looking
specifically for ibm,coherent-device-memory.
And we might also need a way for skiboot to tell us "in addition to
what is in the device tree at boot, up to N more nodes could be hot
plugged".
cheers
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -895,6 +895,7 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
> void __init initmem_init(void)
> {
> int nid, cpu;
> + struct device_node *dn;
>
> max_low_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT;
> max_pfn = max_low_pfn;
> @@ -911,6 +912,18 @@ void __init initmem_init(void)
> */
> nodes_and(node_possible_map, node_possible_map, node_online_map);
>
> + /*
> + * Consider an ibm,coherent-device-memory node possible. Even though
> + * it is not online at boot, it may be hotplugged later.
> + */
> + for_each_compatible_node(dn, NULL, "ibm,coherent-device-memory") {
> + nid = of_node_to_nid_single(dn);
> + if (nid < 0)
> + continue;
> +
> + node_set(nid, node_possible_map);
> + }
> +
> for_each_online_node(nid) {
> unsigned long start_pfn, end_pfn;
>
>
> --
> Reza Arbab
Michael Bringmann <[email protected]> writes:
> On 06/06/2017 04:48 AM, Michael Ellerman wrote:
>> Michael Bringmann <[email protected]> writes:
>>> On 06/01/2017 04:36 AM, Michael Ellerman wrote:
>>>> Do you actually see mention of nodes 0 and 8 in the dmesg?
>>>
>>> When the 'numa.c' code is built with debug messages, and the system was
>>> given that configuration by pHyp, yes, I did.
>>>
>>>> What does it say?
>>>
>>> The debug message for each core thread would be something like,
>>>
>>> removing cpu 64 from node 0
>>> adding cpu 64 to node 8
>>>
>>> repeated for all 8 threads of the CPU, and usually with the messages
>>> for all of the CPUs coming out intermixed on the console/dmesg log.
>>
>> OK. I meant what do you see at boot.
>
> Here is an example with nodes 0,2,6,7, node 0 starts out empty:
>
> [ 0.000000] Initmem setup node 0
> [ 0.000000] NODE_DATA [mem 0x3bff7d6300-0x3bff7dffff]
> [ 0.000000] NODE_DATA(0) on node 7
> [ 0.000000] Initmem setup node 2 [mem 0x00000000-0x13ffffffff]
> [ 0.000000] NODE_DATA [mem 0x13ffff6300-0x13ffffffff]
> [ 0.000000] Initmem setup node 6 [mem 0x1400000000-0x34afffffff]
> [ 0.000000] NODE_DATA [mem 0x34afff6300-0x34afffffff]
> [ 0.000000] Initmem setup node 7 [mem 0x34b0000000-0x3bffffffff]
> [ 0.000000] NODE_DATA [mem 0x3bff7cc600-0x3bff7d62ff]
>
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x0000000000000000-0x0000003bffffffff]
> [ 0.000000] DMA32 empty
> [ 0.000000] Normal empty
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 2: [mem 0x0000000000000000-0x00000013ffffffff]
> [ 0.000000] node 6: [mem 0x0000001400000000-0x00000034afffffff]
> [ 0.000000] node 7: [mem 0x00000034b0000000-0x0000003bffffffff]
> [ 0.000000] Could not find start_pfn for node 0
> [ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
> [ 0.000000] Initmem setup node 2 [mem 0x0000000000000000-0x00000013ffffffff]
> [ 0.000000] Initmem setup node 6 [mem 0x0000001400000000-0x00000034afffffff]
> [ 0.000000] Initmem setup node 7 [mem 0x00000034b0000000-0x0000003bffffffff]
> [ 0.000000] percpu: Embedded 3 pages/cpu @c000003bf8000000 s155672 r0 d40936 u262144
> [ 0.000000] Built 4 zonelists in Node order, mobility grouping on. Total pages: 3928320
>
> and,
>
> [root@ltcalpine2-lp20 ~]# numactl --hardware
> available: 4 nodes (0,2,6-7)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 2 cpus: 16 17 18 19 20 21 22 23 32 33 34 35 36 37 38 39 56 57 58 59 60 61 62 63
> node 2 size: 81792 MB
> node 2 free: 81033 MB
> node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 40 41 42 43 44 45 46 47
> node 6 size: 133743 MB
> node 6 free: 133097 MB
> node 7 cpus: 48 49 50 51 52 53 54 55
> node 7 size: 29877 MB
> node 7 free: 29599 MB
> node distances:
> node 0 2 6 7
> 0: 10 40 40 40
> 2: 40 10 40 40
> 6: 40 40 10 20
> 7: 40 40 20 10
> [root@ltcalpine2-lp20 ~]#
What kernel is that running?
And can you show me the full ibm,dynamic-memory and lookup-arrays
properties for that system?
cheers
Here is the information from 2 different kernels. I have not been able to retrieve
the information matching yesterday's attachments, yet, as those dumps were
acquired in April.
Attached please find 2 dumps of similar material from kernels running with my
current patches (Linux 4.4, Linux 4.12).
On 06/07/2017 07:08 AM, Michael Ellerman wrote:
> Michael Bringmann <[email protected]> writes:
>
>> On 06/06/2017 04:48 AM, Michael Ellerman wrote:
>>> Michael Bringmann <[email protected]> writes:
>>>> On 06/01/2017 04:36 AM, Michael Ellerman wrote:
>>>>> Do you actually see mention of nodes 0 and 8 in the dmesg?
>>>>
>>>> When the 'numa.c' code is built with debug messages, and the system was
>>>> given that configuration by pHyp, yes, I did.
>>>>
>>>>> What does it say?
>>>>
>>>> The debug message for each core thread would be something like,
>>>>
>>>> removing cpu 64 from node 0
>>>> adding cpu 64 to node 8
>>>>
>>>> repeated for all 8 threads of the CPU, and usually with the messages
>>>> for all of the CPUs coming out intermixed on the console/dmesg log.
>>>
>>> OK. I meant what do you see at boot.
>>
>> Here is an example with nodes 0,2,6,7, node 0 starts out empty:
>>
>> [ 0.000000] Initmem setup node 0
>> [ 0.000000] NODE_DATA [mem 0x3bff7d6300-0x3bff7dffff]
>> [ 0.000000] NODE_DATA(0) on node 7
>> [ 0.000000] Initmem setup node 2 [mem 0x00000000-0x13ffffffff]
>> [ 0.000000] NODE_DATA [mem 0x13ffff6300-0x13ffffffff]
>> [ 0.000000] Initmem setup node 6 [mem 0x1400000000-0x34afffffff]
>> [ 0.000000] NODE_DATA [mem 0x34afff6300-0x34afffffff]
>> [ 0.000000] Initmem setup node 7 [mem 0x34b0000000-0x3bffffffff]
>> [ 0.000000] NODE_DATA [mem 0x3bff7cc600-0x3bff7d62ff]
>>
>> [ 0.000000] Zone ranges:
>> [ 0.000000] DMA [mem 0x0000000000000000-0x0000003bffffffff]
>> [ 0.000000] DMA32 empty
>> [ 0.000000] Normal empty
>> [ 0.000000] Movable zone start for each node
>> [ 0.000000] Early memory node ranges
>> [ 0.000000] node 2: [mem 0x0000000000000000-0x00000013ffffffff]
>> [ 0.000000] node 6: [mem 0x0000001400000000-0x00000034afffffff]
>> [ 0.000000] node 7: [mem 0x00000034b0000000-0x0000003bffffffff]
>> [ 0.000000] Could not find start_pfn for node 0
>> [ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
>> [ 0.000000] Initmem setup node 2 [mem 0x0000000000000000-0x00000013ffffffff]
>> [ 0.000000] Initmem setup node 6 [mem 0x0000001400000000-0x00000034afffffff]
>> [ 0.000000] Initmem setup node 7 [mem 0x00000034b0000000-0x0000003bffffffff]
>> [ 0.000000] percpu: Embedded 3 pages/cpu @c000003bf8000000 s155672 r0 d40936 u262144
>> [ 0.000000] Built 4 zonelists in Node order, mobility grouping on. Total pages: 3928320
>>
>> and,
>>
>> [root@ltcalpine2-lp20 ~]# numactl --hardware
>> available: 4 nodes (0,2,6-7)
>> node 0 cpus:
>> node 0 size: 0 MB
>> node 0 free: 0 MB
>> node 2 cpus: 16 17 18 19 20 21 22 23 32 33 34 35 36 37 38 39 56 57 58 59 60 61 62 63
>> node 2 size: 81792 MB
>> node 2 free: 81033 MB
>> node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 40 41 42 43 44 45 46 47
>> node 6 size: 133743 MB
>> node 6 free: 133097 MB
>> node 7 cpus: 48 49 50 51 52 53 54 55
>> node 7 size: 29877 MB
>> node 7 free: 29599 MB
>> node distances:
>> node 0 2 6 7
>> 0: 10 40 40 40
>> 2: 40 10 40 40
>> 6: 40 40 10 20
>> 7: 40 40 20 10
>> [root@ltcalpine2-lp20 ~]#
>
> What kernel is that running?
>
> And can you show me the full ibm,dynamic-memory and lookup-arrays
> properties for that system?
>
> cheers
>
>
--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]
Michael Bringmann <[email protected]> writes:
> Here is the information from 2 different kernels. I have not been able to retrieve
> the information matching yesterday's attachments, yet, as those dumps were
> acquired in April.
>
> Attached please find 2 dumps of similar material from kernels running with my
> current patches (Linux 4.4, Linux 4.12).
OK thanks.
I'd actually like to see the dmesg output from a kernel *without* your
patches.
Looking at the device tree properties:
ltcalpine2-lp9:/proc/device-tree/ibm,dynamic-reconfiguration-memory # lsprop ibm,associativity-lookup-arrays
ibm,associativity-lookup-arrays
00000004 = 4 arrays
00000004 = of 4 entries each
00000000 00000000 00000000 00000000
00000000 00000000 00000001 00000001
00000000 00000003 00000006 00000006
00000000 00000003 00000007 00000007
Which does tell us that nodes 0, 1, 6 and 7 exist.
So your idea of looking at that and setting any node found in there
online should work.
My only worry is that behaviour appears to be completely undocumented in
PAPR, ie. PAPR explicitly says that property only needs to contain
values for LMBs present at boot.
But possibly we can talk to the PowerVM/PAPR guys and have that changed
so that it becomes something we can rely on.
cheers
On a related note, we are discussing the addition of 2 new device-tree properties
with Pete Heyrman and his fellows that should simplify the determination of the
set of required nodes.
* One property would provide the total/max number of nodes needed by the kernel
on the current hardware.
* A second property would provide the total/max number of nodes that the kernel
could use on any system to which it could be migrated.
These properties aren't available, yet, and it takes time to define new properties
in the PAPR and have them implemented in pHyp and the kernel. As an intermediary
step, the systems which are doing a lot of dynamic hot-add/hot-remove configuration
could provide equivalent information to the PowerPC kernel with a command line
parameter. The 'numa.c' code would then read this value and fill in the necessary
entries in the 'node_possible_map'.
Would you foresee any problems with using such a feature?
Thanks.
On 06/13/2017 05:45 AM, Michael Ellerman wrote:
> Michael Bringmann <[email protected]> writes:
>
>> Here is the information from 2 different kernels. I have not been able to retrieve
>> the information matching yesterday's attachments, yet, as those dumps were
>> acquired in April.
>>
>> Attached please find 2 dumps of similar material from kernels running with my
>> current patches (Linux 4.4, Linux 4.12).
>
> OK thanks.
>
> I'd actually like to see the dmesg output from a kernel *without* your
> patches.
>
> Looking at the device tree properties:
>
> ltcalpine2-lp9:/proc/device-tree/ibm,dynamic-reconfiguration-memory # lsprop ibm,associativity-lookup-arrays
> ibm,associativity-lookup-arrays
> 00000004 = 4 arrays
> 00000004 = of 4 entries each
> 00000000 00000000 00000000 00000000
> 00000000 00000000 00000001 00000001
> 00000000 00000003 00000006 00000006
> 00000000 00000003 00000007 00000007
>
>
> Which does tell us that nodes 0, 1, 6 and 7 exist.
>
> So your idea of looking at that and setting any node found in there
> online should work.
>
> My only worry is that behaviour appears to be completely undocumented in
> PAPR, ie. PAPR explicitly says that property only needs to contain
> values for LMBs present at boot.
>
> But possibly we can talk to the PowerVM/PAPR guys and have that changed
> so that it becomes something we can rely on.
>
> cheers
>
>
--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]
On Wed, Jun 14, 2017 at 3:25 PM, Balbir Singh <[email protected]> wrote:
>
>
> On Wed, Jun 14, 2017 at 8:21 AM, Michael Bringmann <[email protected]>
> wrote:
>>
>> On a related note, we are discussing the addition of 2 new device-tree
>> properties
>> with Pete Heyrman and his fellows that should simplify the determination
>> of the
>> set of required nodes.
>>
>> * One property would provide the total/max number of nodes needed by the
>> kernel
>> on the current hardware.
>
>
Yes, that would be nice to have
>
>>
>> * A second property would provide the total/max number of nodes that the
>> kernel
>> could use on any system to which it could be migrated.
>>
>
Not sure about this one, are you suggesting more memory can be added
depending on the migration target?
>
>
>>
>> These properties aren't available, yet, and it takes time to define new
>> properties
>> in the PAPR and have them implemented in pHyp and the kernel. As an
>> intermediary
>> step, the systems which are doing a lot of dynamic hot-add/hot-remove
>> configuration
>> could provide equivalent information to the PowerPC kernel with a command
>> line
>> parameter. The 'numa.c' code would then read this value and fill in the
>> necessary
>> entries in the 'node_possible_map'.
>>
>> Would you foresee any problems with using such a feature?
>
>
Sorry my mailer goofed up, resending
Balbir Singh
Hello:
On 06/14/2017 12:27 AM, Balbir Singh wrote:
> On Wed, Jun 14, 2017 at 3:25 PM, Balbir Singh <[email protected]> wrote:
>>
>>
>> On Wed, Jun 14, 2017 at 8:21 AM, Michael Bringmann <[email protected]>
>> wrote:
>>>
>>> On a related note, we are discussing the addition of 2 new device-tree
>>> properties
>>> with Pete Heyrman and his fellows that should simplify the determination
>>> of the
>>> set of required nodes.
>>>
>>> * One property would provide the total/max number of nodes needed by the
>>> kernel
>>> on the current hardware.
>>
>>
>
> Yes, that would be nice to have
>
>>
>>>
>>> * A second property would provide the total/max number of nodes that the
>>> kernel
>>> could use on any system to which it could be migrated.
>>>
>>
>
> Not sure about this one, are you suggesting more memory can be added
> depending on the migration target?
We would use only one of these numbers to allocate nodes. I have only been
on the periphery of the discussions, so I can not communicate the full
reasoning as to why both measures would be needed. We would like to have
the first number for node allocation/initialization, but if only the second
value were provided, we would likely need to use it.
>>
>>
>>>
>>> These properties aren't available, yet, and it takes time to define new
>>> properties
>>> in the PAPR and have them implemented in pHyp and the kernel. As an
>>> intermediary
>>> step, the systems which are doing a lot of dynamic hot-add/hot-remove
>>> configuration
>>> could provide equivalent information to the PowerPC kernel with a command
>>> line
>>> parameter. The 'numa.c' code would then read this value and fill in the
>>> necessary
>>> entries in the 'node_possible_map'.
>>>
>>> Would you foresee any problems with using such a feature?
>>
>>
>
> Sorry my mailer goofed up, resending
>
> Balbir Singh
>
Thanks.
--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]