LinuxLists.cc - [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

2022-01-06 23:21:07

Subject: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

From: Long Li <[email protected]>

When the kernel boots with parameters restricting the number of cpus or NUMA
nodes, e.g. maxcpus=X or numa=off, the vPCI driver should only set to the NUMA
node to a value that is valid in the current running kernel.

Signed-off-by: Long Li <[email protected]>
---
drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index fc1a29acadbb..8686343eff4c 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1835,8 +1835,21 @@ static void hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
if (!hv_dev)
continue;

- if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
- set_dev_node(&dev->dev, hv_dev->desc.virtual_numa_node);
+ if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
+ int cpu;
+ bool found_node = false;
+
+ for_each_possible_cpu(cpu)
+ if (cpu_to_node(cpu) ==
+ hv_dev->desc.virtual_numa_node) {
+ found_node = true;
+ break;
+ }
+
+ if (found_node)
+ set_dev_node(&dev->dev,
+ hv_dev->desc.virtual_numa_node);
+ }

put_pcichild(hv_dev);
}
--
2.25.1

2022-01-07 15:34:21

by Michael Kelley (LINUX)

[permalink] [raw]

Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

From: [email protected] <[email protected]> Sent: Thursday, January 6, 2022 3:20 PM
>
> When the kernel boots with parameters restricting the number of cpus or NUMA
> nodes, e.g. maxcpus=X or numa=off, the vPCI driver should only set to the NUMA
> node to a value that is valid in the current running kernel.
>
> Signed-off-by: Long Li <[email protected]>
> ---
> drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++--
> 1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-
> hyperv.c
> index fc1a29acadbb..8686343eff4c 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1835,8 +1835,21 @@ static void hv_pci_assign_numa_node(struct
> hv_pcibus_device *hbus)
> if (!hv_dev)
> continue;
>
> - if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> - set_dev_node(&dev->dev, hv_dev->desc.virtual_numa_node);
> + if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
> + int cpu;
> + bool found_node = false;
> +
> + for_each_possible_cpu(cpu)
> + if (cpu_to_node(cpu) ==
> + hv_dev->desc.virtual_numa_node) {
> + found_node = true;
> + break;
> + }
> +
> + if (found_node)
> + set_dev_node(&dev->dev,
> + hv_dev->desc.virtual_numa_node);
> + }

I'm wondering about this approach vs. just comparing against nr_node_ids.
Comparing against nr_node_ids would handle the case of numa=off on the
kernel boot line, or a kernel built with CONFIG_NUMA=n, or the use of
numa=fake. Your approach is also affected by which CPUs are online,
since cpu_to_node() references percpu data. It would seem to produce
more variable results since CPUs can go online and offline while the VM
is running. If a network VF device was removed and re-added, the results
of your algorithm could be different for the re-add, depending on which
CPUs were online at the time.

My impression (which may be incorrect) is that the device numa_node
is primarily to allow the driver to allocate memory from the closest
NUMA node, and such memory allocations don't really need to be
affected by which CPUs are online.

Thoughts?

>
> put_pcichild(hv_dev);
> }
> --
> 2.25.1

2022-01-07 20:31:47

by Long Li

[permalink] [raw]

Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

> Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots
> with parameters affecting NUMA topology
>
> From: [email protected] <[email protected]> Sent:
> Thursday, January 6, 2022 3:20 PM
> >
> > When the kernel boots with parameters restricting the number of cpus
> > or NUMA nodes, e.g. maxcpus=X or numa=off, the vPCI driver should only
> > set to the NUMA node to a value that is valid in the current running kernel.
> >
> > Signed-off-by: Long Li <[email protected]>
> > ---
> > drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++--
> > 1 file changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/pci/controller/pci-hyperv.c
> > b/drivers/pci/controller/pci- hyperv.c index
> > fc1a29acadbb..8686343eff4c 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -1835,8 +1835,21 @@ static void hv_pci_assign_numa_node(struct
> > hv_pcibus_device *hbus)
> > if (!hv_dev)
> > continue;
> >
> > - if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > - set_dev_node(&dev->dev, hv_dev-
> >desc.virtual_numa_node);
> > + if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
> > + int cpu;
> > + bool found_node = false;
> > +
> > + for_each_possible_cpu(cpu)
> > + if (cpu_to_node(cpu) ==
> > + hv_dev->desc.virtual_numa_node) {
> > + found_node = true;
> > + break;
> > + }
> > +
> > + if (found_node)
> > + set_dev_node(&dev->dev,
> > + hv_dev-
> >desc.virtual_numa_node);
> > + }
>
> I'm wondering about this approach vs. just comparing against nr_node_ids.

I was trying to fix this by comparing with nr_node_ids. This worked for numa=off, but it didn't work with maxcpus=X.

maxcpus=X is commonly used in kdump kernels. In this config, the memory system is initialized in a way that only the NUMA nodes within maxcpus are setup and can be used by the drivers.

> Comparing against nr_node_ids would handle the case of numa=off on the
> kernel boot line, or a kernel built with CONFIG_NUMA=n, or the use of
> numa=fake. Your approach is also affected by which CPUs are online, since
> cpu_to_node() references percpu data. It would seem to produce more
> variable results since CPUs can go online and offline while the VM is running.
> If a network VF device was removed and re-added, the results of your
> algorithm could be different for the re-add, depending on which CPUs were
> online at the time.
>
> My impression (which may be incorrect) is that the device numa_node is
> primarily to allow the driver to allocate memory from the closest NUMA node,
> and such memory allocations don't really need to be affected by which CPUs
> are online.

Yes, this is the reason I'm using for_each_possible_cpu(). Even if some CPUs are not online, the memory system is setup in a way that allow driver to allocate memory on that NUMA node. The algorithm guarantees the value of NUMA node is valid when calling set_dev_node().

>
> Thoughts?
>
> >
> > put_pcichild(hv_dev);
> > }
> > --
> > 2.25.1

2022-01-10 16:12:41

by Michael Kelley (LINUX)

[permalink] [raw]

Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

From: Long Li <[email protected]> Sent: Friday, January 7, 2022 12:32 PM
> >
> > From: [email protected] <[email protected]> Sent:
> > Thursday, January 6, 2022 3:20 PM
> > >
> > > When the kernel boots with parameters restricting the number of cpus
> > > or NUMA nodes, e.g. maxcpus=X or numa=off, the vPCI driver should only
> > > set to the NUMA node to a value that is valid in the current running kernel.
> > >
> > > Signed-off-by: Long Li <[email protected]>
> > > ---
> > > drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++--
> > > 1 file changed, 15 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > b/drivers/pci/controller/pci- hyperv.c index
> > > fc1a29acadbb..8686343eff4c 100644
> > > --- a/drivers/pci/controller/pci-hyperv.c
> > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > @@ -1835,8 +1835,21 @@ static void hv_pci_assign_numa_node(struct
> > > hv_pcibus_device *hbus)
> > > if (!hv_dev)
> > > continue;
> > >
> > > - if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > > - set_dev_node(&dev->dev, hv_dev->desc.virtual_numa_node);
> > > + if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
> > > + int cpu;
> > > + bool found_node = false;
> > > +
> > > + for_each_possible_cpu(cpu)
> > > + if (cpu_to_node(cpu) ==
> > > + hv_dev->desc.virtual_numa_node) {
> > > + found_node = true;
> > > + break;
> > > + }
> > > +
> > > + if (found_node)
> > > + set_dev_node(&dev->dev,
> > > + hv_dev->desc.virtual_numa_node);
> > > + }
> >
> > I'm wondering about this approach vs. just comparing against nr_node_ids.
>
> I was trying to fix this by comparing with nr_node_ids. This worked for
> numa=off, but it didn't work with maxcpus=X.
>
> maxcpus=X is commonly used in kdump kernels. In this config, the memory
> system is initialized in a way that only the NUMA nodes within maxcpus are
> setup and can be used by the drivers.

In looking at a 5.16 kernel running in a Hyper-V VM on two NUMA
nodes, the number of NUMA nodes configured in the kernel is not
affected by maxcpus= on the kernel boot line. This VM has 48 vCPUs
and 2 NUMA nodes, and is Generation 2. Even with maxcpus=4 or
maxcpus=1, these lines are output during boot:

[ 0.238953] NODE_DATA(0) allocated [mem 0x7edffd5000-0x7edfffffff]
[ 0.241397] NODE_DATA(1) allocated [mem 0xfcdffd4000-0xfcdfffefff]

and

[ 0.280039] Initmem setup node 0 [mem 0x0000000000001000-0x0000007edfffffff]
[ 0.282869] Initmem setup node 1 [mem 0x0000007ee0000000-0x000000fcdfffffff]

It's perfectly legit to have a NUMA node with memory but no CPUs. The
memory assigned to the NUMA node is determined by the ACPI SRAT. So
I'm wondering what is causing the kdump issue you see. Or maybe the
behavior of older kernels is different.

>
> > Comparing against nr_node_ids would handle the case of numa=off on the
> > kernel boot line, or a kernel built with CONFIG_NUMA=n, or the use of
> > numa=fake. Your approach is also affected by which CPUs are online, since
> > cpu_to_node() references percpu data. It would seem to produce more
> > variable results since CPUs can go online and offline while the VM is running.
> > If a network VF device was removed and re-added, the results of your
> > algorithm could be different for the re-add, depending on which CPUs were
> > online at the time.
> >
> > My impression (which may be incorrect) is that the device numa_node is
> > primarily to allow the driver to allocate memory from the closest NUMA node,
> > and such memory allocations don't really need to be affected by which CPUs
> > are online.
>
> Yes, this is the reason I'm using for_each_possible_cpu(). Even if some CPUs
> are not online, the memory system is setup in a way that allow driver to
> allocate memory on that NUMA node. The algorithm guarantees the value of
> NUMA node is valid when calling set_dev_node().
>

I'm thinking the code here should check against nr_node_ids, to catch the
numa=off or CONFIG_NUMA=n cases. Then could use either node_online()
or numa_map_to_online_node(), but I'm still curious as to how we would
get an offline NUMA node given how Hyper-V normally sets up a VM.

NUMA nodes only transition from online to offline if there are no CPUs
or memory assigned. That can happen if the CPUs are taken offline (or
never came online) and if the memory is hot-removed. We don't currently
support hot-remove memory in Hyper-V VMs, though there has been
some discussion about adding it. I'm not sure how that case is supposed
to be handled if the NUMA node is stashed in some device and get used
during dma_alloc_coherent(), for example. That seems to be a general
Linux problem unless there's a mechanism for handling it that I haven't
noticed.

Michael

2022-01-13 00:59:21

by Long Li

[permalink] [raw]

Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

> Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots
> with parameters affecting NUMA topology
>
> From: Long Li <[email protected]> Sent: Friday, January 7, 2022 12:32 PM
> > >
> > > From: [email protected] <[email protected]> Sent:
> > > Thursday, January 6, 2022 3:20 PM
> > > >
> > > > When the kernel boots with parameters restricting the number of
> > > > cpus or NUMA nodes, e.g. maxcpus=X or numa=off, the vPCI driver
> > > > should only set to the NUMA node to a value that is valid in the current
> running kernel.
> > > >
> > > > Signed-off-by: Long Li <[email protected]>
> > > > ---
> > > > drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++--
> > > > 1 file changed, 15 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > > b/drivers/pci/controller/pci- hyperv.c index
> > > > fc1a29acadbb..8686343eff4c 100644
> > > > --- a/drivers/pci/controller/pci-hyperv.c
> > > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > > @@ -1835,8 +1835,21 @@ static void hv_pci_assign_numa_node(struct
> > > > hv_pcibus_device *hbus)
> > > > if (!hv_dev)
> > > > continue;
> > > >
> > > > - if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > > > - set_dev_node(&dev->dev, hv_dev-
> >desc.virtual_numa_node);
> > > > + if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
> > > > + int cpu;
> > > > + bool found_node = false;
> > > > +
> > > > + for_each_possible_cpu(cpu)
> > > > + if (cpu_to_node(cpu) ==
> > > > + hv_dev->desc.virtual_numa_node) {
> > > > + found_node = true;
> > > > + break;
> > > > + }
> > > > +
> > > > + if (found_node)
> > > > + set_dev_node(&dev->dev,
> > > > + hv_dev->desc.virtual_numa_node);
> > > > + }
> > >
> > > I'm wondering about this approach vs. just comparing against nr_node_ids.
> >
> > I was trying to fix this by comparing with nr_node_ids. This worked
> > for numa=off, but it didn't work with maxcpus=X.
> >
> > maxcpus=X is commonly used in kdump kernels. In this config, the
> > memory system is initialized in a way that only the NUMA nodes within
> > maxcpus are setup and can be used by the drivers.
>
> In looking at a 5.16 kernel running in a Hyper-V VM on two NUMA nodes, the
> number of NUMA nodes configured in the kernel is not affected by maxcpus= on
> the kernel boot line. This VM has 48 vCPUs and 2 NUMA nodes, and is
> Generation 2. Even with maxcpus=4 or maxcpus=1, these lines are output during
> boot:
>
> [ 0.238953] NODE_DATA(0) allocated [mem 0x7edffd5000-0x7edfffffff]
> [ 0.241397] NODE_DATA(1) allocated [mem 0xfcdffd4000-0xfcdfffefff]
>
> and
>
> [ 0.280039] Initmem setup node 0 [mem 0x0000000000001000-
> 0x0000007edfffffff]
> [ 0.282869] Initmem setup node 1 [mem 0x0000007ee0000000-
> 0x000000fcdfffffff]
>
> It's perfectly legit to have a NUMA node with memory but no CPUs. The
> memory assigned to the NUMA node is determined by the ACPI SRAT. So I'm
> wondering what is causing the kdump issue you see. Or maybe the behavior of
> older kernels is different.

Sorry, it turns out I had a typo. It's nr_cpus=1 (not maxcpus). But I'm not sure if that matters as the descriptions on these two in the kernel doc are the same.

On my system (4 NUMA nodes) with kdump boot line: (maybe if you try a VM with 4 NUMA nodes, you can see the problem)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-1025-azure root=PARTUUID=7145c36d-e182-43b6-a37e-0b6d18fef8fe ro console=tty1 console=ttyS0 earlyprintk=ttyS0 reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0 elfcorehdr=4038049140K

I see the following:
[ 0.408246] NODE_DATA(0) allocated [mem 0x2cfd6000-0x2cffffff]
[ 0.410454] NODE_DATA(3) allocated [mem 0x3c2bef32000-0x3c2bef5bfff]
[ 0.413031] Zone ranges:
[ 0.414117] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.416522] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
[ 0.418932] Normal [mem 0x0000000100000000-0x000003c2bef5cfff]
[ 0.421357] Device empty
[ 0.422454] Movable zone start for each node
[ 0.424109] Early memory node ranges
[ 0.425541] node 0: [mem 0x0000000000001000-0x000000000009ffff]
[ 0.428050] node 0: [mem 0x000000001d000000-0x000000002cffffff]
[ 0.430547] node 3: [mem 0x000003c27f000000-0x000003c2bef5cfff]
[ 0.432963] Initmem setup node 0 [mem 0x0000000000001000-0x000000002cffffff]
[ 0.435695] Initmem setup node 3 [mem 0x000003c27f000000-0x000003c2bef5cfff]
[ 0.438446] On node 0, zone DMA: 1 pages in unavailable ranges
[ 0.439377] On node 0, zone DMA32: 53088 pages in unavailable ranges
[ 0.452784] On node 3, zone Normal: 40960 pages in unavailable ranges
[ 0.455221] On node 3, zone Normal: 4259 pages in unavailable ranges

It's unclear to me why node 1 and 2 are missing. But I don't think it's a Hyper-V problem since it's only affected by setting nr_cpus over kernel boot line. Later, a device driver (mlx5 in this example) tries to allocate memory on node 1 and fails:

[ 137.348836] BUG: unable to handle page fault for address: 0000000000001cc8
[ 137.352224] #PF: supervisor read access in kernel mode
[ 137.354884] #PF: error_code(0x0000) - not-present page
[ 137.357440] PGD 0 P4D 0
[ 137.358976] Oops: 0000 [#1] SMP NOPTI
[ 137.361072] CPU: 0 PID: 445 Comm: systemd-udevd Not tainted 5.11.0-1025-azure #27~20.04.1-Ubuntu
[ 137.365160] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 10/27/2020
[ 137.369915] RIP: 0010:__alloc_pages_nodemask+0x10e/0x300
[ 137.372590] Code: ff ff 84 c0 0f 85 21 01 00 00 44 89 e0 48 8b 55 b0 8b 75 c4 c1 e8 0c 48 8b 7d a8 83 e0 01 88 45 c8 48 85 d2 0f 85 44 01 00 00 <3b> 77 08 0f 82 3b 01 00 00 48 89 7d b8 48 8b 07 44 89 e2 81 e2 00
[ 137.381311] RSP: 0018:ffffabcd4137b858 EFLAGS: 00010246
[ 137.383941] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 137.387397] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000001cc0
[ 137.390724] RBP: ffffabcd4137b8b0 R08: ffffffffffffffff R09: ffffffffffffffff
[ 137.394347] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000cc0
[ 137.397698] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc0
[ 137.401055] FS: 00007fefb6620880(0000) GS:ffff90103ec00000(0000) knlGS:0000000000000000
[ 137.404970] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 137.407888] CR2: 0000000000001cc8 CR3: 000003c280d78000 CR4: 0000000000350ef0
[ 137.411294] Call Trace:
[ 137.412882] __dma_direct_alloc_pages+0x8e/0x120
[ 137.415399] dma_direct_alloc+0x66/0x2a0
[ 137.417517] dma_alloc_attrs+0x3e/0x50
[ 137.419590] mlx5_cmd_init+0xe5/0x570 [mlx5_core]

>
> >
> > > Comparing against nr_node_ids would handle the case of numa=off on
> > > the kernel boot line, or a kernel built with CONFIG_NUMA=n, or the
> > > use of numa=fake. Your approach is also affected by which CPUs are
> > > online, since
> > > cpu_to_node() references percpu data. It would seem to produce more
> > > variable results since CPUs can go online and offline while the VM is running.
> > > If a network VF device was removed and re-added, the results of your
> > > algorithm could be different for the re-add, depending on which CPUs
> > > were online at the time.
> > >
> > > My impression (which may be incorrect) is that the device numa_node
> > > is primarily to allow the driver to allocate memory from the closest
> > > NUMA node, and such memory allocations don't really need to be
> > > affected by which CPUs are online.
> >
> > Yes, this is the reason I'm using for_each_possible_cpu(). Even if
> > some CPUs are not online, the memory system is setup in a way that
> > allow driver to allocate memory on that NUMA node. The algorithm
> > guarantees the value of NUMA node is valid when calling set_dev_node().
> >
>
> I'm thinking the code here should check against nr_node_ids, to catch the
> numa=off or CONFIG_NUMA=n cases. Then could use either node_online() or
> numa_map_to_online_node(), but I'm still curious as to how we would get an
> offline NUMA node given how Hyper-V normally sets up a VM.
>
> NUMA nodes only transition from online to offline if there are no CPUs or
> memory assigned. That can happen if the CPUs are taken offline (or never came
> online) and if the memory is hot-removed. We don't currently support hot-
> remove memory in Hyper-V VMs, though there has been some discussion about
> adding it. I'm not sure how that case is supposed to be handled if the NUMA
> node is stashed in some device and get used during dma_alloc_coherent(), for
> example. That seems to be a general Linux problem unless there's a mechanism
> for handling it that I haven't noticed.
>
> Michael

2022-01-17 10:41:21

by Michael Kelley (LINUX)

[permalink] [raw]

Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

From: Long Li <[email protected]> Sent: Wednesday, January 12, 2022 4:59 PM
>
> > Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots
> > with parameters affecting NUMA topology
> >
> > From: Long Li <[email protected]> Sent: Friday, January 7, 2022 12:32 PM
> > > >
> > > > From: [email protected] <[email protected]> Sent:
> > > > Thursday, January 6, 2022 3:20 PM
> > > > >
> > > > > When the kernel boots with parameters restricting the number of
> > > > > cpus or NUMA nodes, e.g. maxcpus=X or numa=off, the vPCI driver
> > > > > should only set to the NUMA node to a value that is valid in the current running kernel.
> > > > >
> > > > > Signed-off-by: Long Li <[email protected]>
> > > > > ---
> > > > > drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++--
> > > > > 1 file changed, 15 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > > > b/drivers/pci/controller/pci- hyperv.c index
> > > > > fc1a29acadbb..8686343eff4c 100644
> > > > > --- a/drivers/pci/controller/pci-hyperv.c
> > > > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > > > @@ -1835,8 +1835,21 @@ static void hv_pci_assign_numa_node(struct
> > > > > hv_pcibus_device *hbus)
> > > > > if (!hv_dev)
> > > > > continue;
> > > > >
> > > > > - if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > > > > - set_dev_node(&dev->dev, hv_dev->desc.virtual_numa_node);
> > > > > + if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
> > > > > + int cpu;
> > > > > + bool found_node = false;
> > > > > +
> > > > > + for_each_possible_cpu(cpu)
> > > > > + if (cpu_to_node(cpu) ==
> > > > > + hv_dev->desc.virtual_numa_node) {
> > > > > + found_node = true;
> > > > > + break;
> > > > > + }
> > > > > +
> > > > > + if (found_node)
> > > > > + set_dev_node(&dev->dev,
> > > > > + hv_dev->desc.virtual_numa_node);
> > > > > + }
> > > >
> > > > I'm wondering about this approach vs. just comparing against nr_node_ids.
> > >
> > > I was trying to fix this by comparing with nr_node_ids. This worked
> > > for numa=off, but it didn't work with maxcpus=X.
> > >
> > > maxcpus=X is commonly used in kdump kernels. In this config, the
> > > memory system is initialized in a way that only the NUMA nodes within
> > > maxcpus are setup and can be used by the drivers.
> >
> > In looking at a 5.16 kernel running in a Hyper-V VM on two NUMA nodes, the
> > number of NUMA nodes configured in the kernel is not affected by maxcpus= on
> > the kernel boot line. This VM has 48 vCPUs and 2 NUMA nodes, and is
> > Generation 2. Even with maxcpus=4 or maxcpus=1, these lines are output during
> > boot:
> >
> > [ 0.238953] NODE_DATA(0) allocated [mem 0x7edffd5000-0x7edfffffff]
> > [ 0.241397] NODE_DATA(1) allocated [mem 0xfcdffd4000-0xfcdfffefff]
> >
> > and
> >
> > [ 0.280039] Initmem setup node 0 [mem 0x0000000000001000-0x0000007edfffffff]
> > [ 0.282869] Initmem setup node 1 [mem 0x0000007ee0000000-0x000000fcdfffffff]
> >
> > It's perfectly legit to have a NUMA node with memory but no CPUs. The
> > memory assigned to the NUMA node is determined by the ACPI SRAT. So I'm
> > wondering what is causing the kdump issue you see. Or maybe the behavior of
> > older kernels is different.
>
> Sorry, it turns out I had a typo. It's nr_cpus=1 (not maxcpus). But I'm not sure if that
> matters as the descriptions on these two in the kernel doc are the same.
>
> On my system (4 NUMA nodes) with kdump boot line: (maybe if you try a VM with 4
> NUMA nodes, you can see the problem)
> [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-1025-azure
> root=PARTUUID=7145c36d-e182-43b6-a37e-0b6d18fef8fe ro console=tty1 console=ttyS0
> earlyprintk=ttyS0 reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1
> irqpoll nousb ata_piix.prefer_ms_hyperv=0 elfcorehdr=4038049140K
>
> I see the following:
> [ 0.408246] NODE_DATA(0) allocated [mem 0x2cfd6000-0x2cffffff]
> [ 0.410454] NODE_DATA(3) allocated [mem 0x3c2bef32000-0x3c2bef5bfff]
> [ 0.413031] Zone ranges:
> [ 0.414117] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> [ 0.416522] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
> [ 0.418932] Normal [mem 0x0000000100000000-0x000003c2bef5cfff]
> [ 0.421357] Device empty
> [ 0.422454] Movable zone start for each node
> [ 0.424109] Early memory node ranges
> [ 0.425541] node 0: [mem 0x0000000000001000-0x000000000009ffff]
> [ 0.428050] node 0: [mem 0x000000001d000000-0x000000002cffffff]
> [ 0.430547] node 3: [mem 0x000003c27f000000-0x000003c2bef5cfff]
> [ 0.432963] Initmem setup node 0 [mem 0x0000000000001000-0x000000002cffffff]
> [ 0.435695] Initmem setup node 3 [mem 0x000003c27f000000-0x000003c2bef5cfff]
> [ 0.438446] On node 0, zone DMA: 1 pages in unavailable ranges
> [ 0.439377] On node 0, zone DMA32: 53088 pages in unavailable ranges
> [ 0.452784] On node 3, zone Normal: 40960 pages in unavailable ranges
> [ 0.455221] On node 3, zone Normal: 4259 pages in unavailable ranges
>
> It's unclear to me why node 1 and 2 are missing. But I don't think it's a Hyper-V problem
> since it's only affected by setting nr_cpus over kernel boot line. Later, a device driver
> (mlx5 in this example) tries to allocate memory on node 1 and fails:
>

To summarize some offline conversation, we've figured out that
the "missing" NUMA nodes are not due to setting maxcpus=1 or
nr_cpus=1. Setting the cpu count doesn't affect any of this.

Instead, Linux is modifying the memory map prior to starting the
kdump kernel so that most of the memory is not touched and is
preserved to be dumped, which is the whole point of kdump. This
modified memory map has no memory in NUMA nodes 1 and 2, so
it is correct to just see nodes 0 and 3 as online.

I think code fix here is pretty simple:

int node;

node = hv_dev->desc.virtual_numa_node;
if ((hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
&& (node < nr_node_ids))
set_dev_node(&dev->dev, numa_map_to_online_node(node));

Michael

2022-01-21 10:59:09

by Long Li

[permalink] [raw]

Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

> Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots
> with parameters affecting NUMA topology
>
> From: Long Li <[email protected]> Sent: Wednesday, January 12, 2022 4:59
> PM
> >
> > > Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel
> > > boots with parameters affecting NUMA topology
> > >
> > > From: Long Li <[email protected]> Sent: Friday, January 7, 2022
> > > 12:32 PM
> > > > >
> > > > > From: [email protected] <[email protected]> Sent:
> > > > > Thursday, January 6, 2022 3:20 PM
> > > > > >
> > > > > > When the kernel boots with parameters restricting the number
> > > > > > of cpus or NUMA nodes, e.g. maxcpus=X or numa=off, the vPCI
> > > > > > driver should only set to the NUMA node to a value that is valid in the
> current running kernel.
> > > > > >
> > > > > > Signed-off-by: Long Li <[email protected]>
> > > > > > ---
> > > > > > drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++--
> > > > > > 1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > > > > b/drivers/pci/controller/pci- hyperv.c index
> > > > > > fc1a29acadbb..8686343eff4c 100644
> > > > > > --- a/drivers/pci/controller/pci-hyperv.c
> > > > > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > > > > @@ -1835,8 +1835,21 @@ static void
> > > > > > hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
> > > > > > if (!hv_dev)
> > > > > > continue;
> > > > > >
> > > > > > - if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > > > > > - set_dev_node(&dev->dev, hv_dev-
> >desc.virtual_numa_node);
> > > > > > + if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
> > > > > > + int cpu;
> > > > > > + bool found_node = false;
> > > > > > +
> > > > > > + for_each_possible_cpu(cpu)
> > > > > > + if (cpu_to_node(cpu) ==
> > > > > > + hv_dev->desc.virtual_numa_node) {
> > > > > > + found_node = true;
> > > > > > + break;
> > > > > > + }
> > > > > > +
> > > > > > + if (found_node)
> > > > > > + set_dev_node(&dev->dev,
> > > > > > + hv_dev-
> >desc.virtual_numa_node);
> > > > > > + }
> > > > >
> > > > > I'm wondering about this approach vs. just comparing against
> nr_node_ids.
> > > >
> > > > I was trying to fix this by comparing with nr_node_ids. This
> > > > worked for numa=off, but it didn't work with maxcpus=X.
> > > >
> > > > maxcpus=X is commonly used in kdump kernels. In this config, the
> > > > memory system is initialized in a way that only the NUMA nodes
> > > > within maxcpus are setup and can be used by the drivers.
> > >
> > > In looking at a 5.16 kernel running in a Hyper-V VM on two NUMA
> > > nodes, the number of NUMA nodes configured in the kernel is not
> > > affected by maxcpus= on the kernel boot line. This VM has 48 vCPUs
> > > and 2 NUMA nodes, and is Generation 2. Even with maxcpus=4 or
> > > maxcpus=1, these lines are output during
> > > boot:
> > >
> > > [ 0.238953] NODE_DATA(0) allocated [mem 0x7edffd5000-0x7edfffffff]
> > > [ 0.241397] NODE_DATA(1) allocated [mem 0xfcdffd4000-0xfcdfffefff]
> > >
> > > and
> > >
> > > [ 0.280039] Initmem setup node 0 [mem 0x0000000000001000-
> 0x0000007edfffffff]
> > > [ 0.282869] Initmem setup node 1 [mem 0x0000007ee0000000-
> 0x000000fcdfffffff]
> > >
> > > It's perfectly legit to have a NUMA node with memory but no CPUs.
> > > The memory assigned to the NUMA node is determined by the ACPI SRAT.
> > > So I'm wondering what is causing the kdump issue you see. Or maybe
> > > the behavior of older kernels is different.
> >
> > Sorry, it turns out I had a typo. It's nr_cpus=1 (not maxcpus). But
> > I'm not sure if that matters as the descriptions on these two in the kernel doc
> are the same.
> >
> > On my system (4 NUMA nodes) with kdump boot line: (maybe if you try a
> > VM with 4 NUMA nodes, you can see the problem)
> > [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-1025-azure
> > root=PARTUUID=7145c36d-e182-43b6-a37e-0b6d18fef8fe ro console=tty1
> > console=ttyS0
> > earlyprintk=ttyS0 reset_devices systemd.unit=kdump-tools-dump.service
> > nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0
> > elfcorehdr=4038049140K
> >
> > I see the following:
> > [ 0.408246] NODE_DATA(0) allocated [mem 0x2cfd6000-0x2cffffff]
> > [ 0.410454] NODE_DATA(3) allocated [mem 0x3c2bef32000-0x3c2bef5bfff]
> > [ 0.413031] Zone ranges:
> > [ 0.414117] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> > [ 0.416522] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
> > [ 0.418932] Normal [mem 0x0000000100000000-0x000003c2bef5cfff]
> > [ 0.421357] Device empty
> > [ 0.422454] Movable zone start for each node
> > [ 0.424109] Early memory node ranges
> > [ 0.425541] node 0: [mem 0x0000000000001000-0x000000000009ffff]
> > [ 0.428050] node 0: [mem 0x000000001d000000-0x000000002cffffff]
> > [ 0.430547] node 3: [mem 0x000003c27f000000-0x000003c2bef5cfff]
> > [ 0.432963] Initmem setup node 0 [mem 0x0000000000001000-
> 0x000000002cffffff]
> > [ 0.435695] Initmem setup node 3 [mem 0x000003c27f000000-
> 0x000003c2bef5cfff]
> > [ 0.438446] On node 0, zone DMA: 1 pages in unavailable ranges
> > [ 0.439377] On node 0, zone DMA32: 53088 pages in unavailable ranges
> > [ 0.452784] On node 3, zone Normal: 40960 pages in unavailable ranges
> > [ 0.455221] On node 3, zone Normal: 4259 pages in unavailable ranges
> >
> > It's unclear to me why node 1 and 2 are missing. But I don't think
> > it's a Hyper-V problem since it's only affected by setting nr_cpus
> > over kernel boot line. Later, a device driver
> > (mlx5 in this example) tries to allocate memory on node 1 and fails:
> >
>
> To summarize some offline conversation, we've figured out that the "missing"
> NUMA nodes are not due to setting maxcpus=1 or nr_cpus=1. Setting the cpu
> count doesn't affect any of this.
>
> Instead, Linux is modifying the memory map prior to starting the kdump kernel
> so that most of the memory is not touched and is
> preserved to be dumped, which is the whole point of kdump. This
> modified memory map has no memory in NUMA nodes 1 and 2, so it is correct
> to just see nodes 0 and 3 as online.
>
> I think code fix here is pretty simple:
>
> int node;
>
> node = hv_dev->desc.virtual_numa_node;
> if ((hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> && (node < nr_node_ids))
> set_dev_node(&dev->dev, numa_map_to_online_node(node));
>
> Michael

Okay, this looks good.

I'm sending a V2 (with a minor change) after testing is done.

Long

2022-01-21 11:11:34

by Michael Kelley (LINUX)

[permalink] [raw]

Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

From: Long Li <[email protected]> Sent: Tuesday, January 18, 2022 2:44 PM
> >
> > From: Long Li <[email protected]> Sent: Wednesday, January 12, 2022 4:59
> > PM
> > >
> > > > Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel
> > > > boots with parameters affecting NUMA topology
> > > >
> > > > From: Long Li <[email protected]> Sent: Friday, January 7, 2022
> > > > 12:32 PM
> > > > > >
> > > > > > From: [email protected] <[email protected]> Sent:
> > > > > > Thursday, January 6, 2022 3:20 PM
> > > > > > >
> > > > > > > When the kernel boots with parameters restricting the number
> > > > > > > of cpus or NUMA nodes, e.g. maxcpus=X or numa=off, the vPCI
> > > > > > > driver should only set to the NUMA node to a value that is valid in the
> > current running kernel.
> > > > > > >
> > > > > > > Signed-off-by: Long Li <[email protected]>
> > > > > > > ---
> > > > > > > drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++--
> > > > > > > 1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > > > > > b/drivers/pci/controller/pci- hyperv.c index
> > > > > > > fc1a29acadbb..8686343eff4c 100644
> > > > > > > --- a/drivers/pci/controller/pci-hyperv.c
> > > > > > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > > > > > @@ -1835,8 +1835,21 @@ static void
> > > > > > > hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
> > > > > > > if (!hv_dev)
> > > > > > > continue;
> > > > > > >
> > > > > > > - if (hv_dev->desc.flags &
> > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > > > > > > - set_dev_node(&dev->dev, hv_dev-
> > >desc.virtual_numa_node);
> > > > > > > + if (hv_dev->desc.flags &
> > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
> > > > > > > + int cpu;
> > > > > > > + bool found_node = false;
> > > > > > > +
> > > > > > > + for_each_possible_cpu(cpu)
> > > > > > > + if (cpu_to_node(cpu) ==
> > > > > > > + hv_dev->desc.virtual_numa_node) {
> > > > > > > + found_node = true;
> > > > > > > + break;
> > > > > > > + }
> > > > > > > +
> > > > > > > + if (found_node)
> > > > > > > + set_dev_node(&dev->dev,
> > > > > > > + hv_dev-
> > >desc.virtual_numa_node);
> > > > > > > + }
> > > > > >
> > > > > > I'm wondering about this approach vs. just comparing against
> > nr_node_ids.
> > > > >
> > > > > I was trying to fix this by comparing with nr_node_ids. This
> > > > > worked for numa=off, but it didn't work with maxcpus=X.
> > > > >
> > > > > maxcpus=X is commonly used in kdump kernels. In this config, the
> > > > > memory system is initialized in a way that only the NUMA nodes
> > > > > within maxcpus are setup and can be used by the drivers.
> > > >
> > > > In looking at a 5.16 kernel running in a Hyper-V VM on two NUMA
> > > > nodes, the number of NUMA nodes configured in the kernel is not
> > > > affected by maxcpus= on the kernel boot line. This VM has 48 vCPUs
> > > > and 2 NUMA nodes, and is Generation 2. Even with maxcpus=4 or
> > > > maxcpus=1, these lines are output during
> > > > boot:
> > > >
> > > > [ 0.238953] NODE_DATA(0) allocated [mem 0x7edffd5000-0x7edfffffff]
> > > > [ 0.241397] NODE_DATA(1) allocated [mem 0xfcdffd4000-0xfcdfffefff]
> > > >
> > > > and
> > > >
> > > > [ 0.280039] Initmem setup node 0 [mem 0x0000000000001000-
> > 0x0000007edfffffff]
> > > > [ 0.282869] Initmem setup node 1 [mem 0x0000007ee0000000-
> > 0x000000fcdfffffff]
> > > >
> > > > It's perfectly legit to have a NUMA node with memory but no CPUs.
> > > > The memory assigned to the NUMA node is determined by the ACPI SRAT.
> > > > So I'm wondering what is causing the kdump issue you see. Or maybe
> > > > the behavior of older kernels is different.
> > >
> > > Sorry, it turns out I had a typo. It's nr_cpus=1 (not maxcpus). But
> > > I'm not sure if that matters as the descriptions on these two in the kernel doc
> > are the same.
> > >
> > > On my system (4 NUMA nodes) with kdump boot line: (maybe if you try a
> > > VM with 4 NUMA nodes, you can see the problem)
> > > [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-1025-azure
> > > root=PARTUUID=7145c36d-e182-43b6-a37e-0b6d18fef8fe ro console=tty1
> > > console=ttyS0
> > > earlyprintk=ttyS0 reset_devices systemd.unit=kdump-tools-dump.service
> > > nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0
> > > elfcorehdr=4038049140K
> > >
> > > I see the following:
> > > [ 0.408246] NODE_DATA(0) allocated [mem 0x2cfd6000-0x2cffffff]
> > > [ 0.410454] NODE_DATA(3) allocated [mem 0x3c2bef32000-0x3c2bef5bfff]
> > > [ 0.413031] Zone ranges:
> > > [ 0.414117] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> > > [ 0.416522] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
> > > [ 0.418932] Normal [mem 0x0000000100000000-0x000003c2bef5cfff]
> > > [ 0.421357] Device empty
> > > [ 0.422454] Movable zone start for each node
> > > [ 0.424109] Early memory node ranges
> > > [ 0.425541] node 0: [mem 0x0000000000001000-0x000000000009ffff]
> > > [ 0.428050] node 0: [mem 0x000000001d000000-0x000000002cffffff]
> > > [ 0.430547] node 3: [mem 0x000003c27f000000-0x000003c2bef5cfff]
> > > [ 0.432963] Initmem setup node 0 [mem 0x0000000000001000-
> > 0x000000002cffffff]
> > > [ 0.435695] Initmem setup node 3 [mem 0x000003c27f000000-
> > 0x000003c2bef5cfff]
> > > [ 0.438446] On node 0, zone DMA: 1 pages in unavailable ranges
> > > [ 0.439377] On node 0, zone DMA32: 53088 pages in unavailable ranges
> > > [ 0.452784] On node 3, zone Normal: 40960 pages in unavailable ranges
> > > [ 0.455221] On node 3, zone Normal: 4259 pages in unavailable ranges
> > >
> > > It's unclear to me why node 1 and 2 are missing. But I don't think
> > > it's a Hyper-V problem since it's only affected by setting nr_cpus
> > > over kernel boot line. Later, a device driver
> > > (mlx5 in this example) tries to allocate memory on node 1 and fails:
> > >
> >
> > To summarize some offline conversation, we've figured out that the "missing"
> > NUMA nodes are not due to setting maxcpus=1 or nr_cpus=1. Setting the cpu
> > count doesn't affect any of this.
> >
> > Instead, Linux is modifying the memory map prior to starting the kdump kernel
> > so that most of the memory is not touched and is
> > preserved to be dumped, which is the whole point of kdump. This
> > modified memory map has no memory in NUMA nodes 1 and 2, so it is correct
> > to just see nodes 0 and 3 as online.
> >
> > I think code fix here is pretty simple:
> >
> > int node;
> >
> > node = hv_dev->desc.virtual_numa_node;
> > if ((hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > && (node < nr_node_ids))
> > set_dev_node(&dev->dev, numa_map_to_online_node(node));
> >
> > Michael
>
> Okay, this looks good.
>
> I'm sending a V2 (with a minor change) after testing is done.
>
> Long

Please leave a comment in the code as to why a NUMA node might be
offline. In the future, somebody new might not know what can happen.
I certainly didn't. :-(

Michael

2022-01-21 11:52:29

by Long Li

[permalink] [raw]

Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

> Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots
> with parameters affecting NUMA topology
>
> From: Long Li <[email protected]> Sent: Tuesday, January 18, 2022 2:44 PM
> > >
> > > From: Long Li <[email protected]> Sent: Wednesday, January 12,
> > > 2022 4:59 PM
> > > >
> > > > > Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when
> > > > > kernel boots with parameters affecting NUMA topology
> > > > >
> > > > > From: Long Li <[email protected]> Sent: Friday, January 7,
> > > > > 2022
> > > > > 12:32 PM
> > > > > > >
> > > > > > > From: [email protected] <[email protected]> Sent:
> > > > > > > Thursday, January 6, 2022 3:20 PM
> > > > > > > >
> > > > > > > > When the kernel boots with parameters restricting the
> > > > > > > > number of cpus or NUMA nodes, e.g. maxcpus=X or numa=off,
> > > > > > > > the vPCI driver should only set to the NUMA node to a
> > > > > > > > value that is valid in the
> > > current running kernel.
> > > > > > > >
> > > > > > > > Signed-off-by: Long Li <[email protected]>
> > > > > > > > ---
> > > > > > > > drivers/pci/controller/pci-hyperv.c | 17
> > > > > > > > +++++++++++++++--
> > > > > > > > 1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > > > > > > b/drivers/pci/controller/pci- hyperv.c index
> > > > > > > > fc1a29acadbb..8686343eff4c 100644
> > > > > > > > --- a/drivers/pci/controller/pci-hyperv.c
> > > > > > > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > > > > > > @@ -1835,8 +1835,21 @@ static void
> > > > > > > > hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
> > > > > > > > if (!hv_dev)
> > > > > > > > continue;
> > > > > > > >
> > > > > > > > - if (hv_dev->desc.flags &
> > > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > > > > > > > - set_dev_node(&dev->dev, hv_dev-
> > > >desc.virtual_numa_node);
> > > > > > > > + if (hv_dev->desc.flags &
> > > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
> > > > > > > > + int cpu;
> > > > > > > > + bool found_node = false;
> > > > > > > > +
> > > > > > > > + for_each_possible_cpu(cpu)
> > > > > > > > + if (cpu_to_node(cpu) ==
> > > > > > > > + hv_dev->desc.virtual_numa_node) {
> > > > > > > > + found_node = true;
> > > > > > > > + break;
> > > > > > > > + }
> > > > > > > > +
> > > > > > > > + if (found_node)
> > > > > > > > + set_dev_node(&dev->dev,
> > > > > > > > + hv_dev-
> > > >desc.virtual_numa_node);
> > > > > > > > + }
> > > > > > >
> > > > > > > I'm wondering about this approach vs. just comparing against
> > > nr_node_ids.
> > > > > >
> > > > > > I was trying to fix this by comparing with nr_node_ids. This
> > > > > > worked for numa=off, but it didn't work with maxcpus=X.
> > > > > >
> > > > > > maxcpus=X is commonly used in kdump kernels. In this config,
> > > > > > the memory system is initialized in a way that only the NUMA
> > > > > > nodes within maxcpus are setup and can be used by the drivers.
> > > > >
> > > > > In looking at a 5.16 kernel running in a Hyper-V VM on two NUMA
> > > > > nodes, the number of NUMA nodes configured in the kernel is not
> > > > > affected by maxcpus= on the kernel boot line. This VM has 48
> > > > > vCPUs and 2 NUMA nodes, and is Generation 2. Even with
> > > > > maxcpus=4 or maxcpus=1, these lines are output during
> > > > > boot:
> > > > >
> > > > > [ 0.238953] NODE_DATA(0) allocated [mem 0x7edffd5000-0x7edfffffff]
> > > > > [ 0.241397] NODE_DATA(1) allocated [mem 0xfcdffd4000-0xfcdfffefff]
> > > > >
> > > > > and
> > > > >
> > > > > [ 0.280039] Initmem setup node 0 [mem 0x0000000000001000-
> > > 0x0000007edfffffff]
> > > > > [ 0.282869] Initmem setup node 1 [mem 0x0000007ee0000000-
> > > 0x000000fcdfffffff]
> > > > >
> > > > > It's perfectly legit to have a NUMA node with memory but no CPUs.
> > > > > The memory assigned to the NUMA node is determined by the ACPI SRAT.
> > > > > So I'm wondering what is causing the kdump issue you see. Or
> > > > > maybe the behavior of older kernels is different.
> > > >
> > > > Sorry, it turns out I had a typo. It's nr_cpus=1 (not maxcpus).
> > > > But I'm not sure if that matters as the descriptions on these two
> > > > in the kernel doc
> > > are the same.
> > > >
> > > > On my system (4 NUMA nodes) with kdump boot line: (maybe if you
> > > > try a VM with 4 NUMA nodes, you can see the problem)
> > > > [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-1025-
> azure
> > > > root=PARTUUID=7145c36d-e182-43b6-a37e-0b6d18fef8fe ro console=tty1
> > > > console=ttyS0
> > > > earlyprintk=ttyS0 reset_devices
> > > > systemd.unit=kdump-tools-dump.service
> > > > nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0
> > > > elfcorehdr=4038049140K
> > > >
> > > > I see the following:
> > > > [ 0.408246] NODE_DATA(0) allocated [mem 0x2cfd6000-0x2cffffff]
> > > > [ 0.410454] NODE_DATA(3) allocated [mem 0x3c2bef32000-
> 0x3c2bef5bfff]
> > > > [ 0.413031] Zone ranges:
> > > > [ 0.414117] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> > > > [ 0.416522] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
> > > > [ 0.418932] Normal [mem 0x0000000100000000-0x000003c2bef5cfff]
> > > > [ 0.421357] Device empty
> > > > [ 0.422454] Movable zone start for each node
> > > > [ 0.424109] Early memory node ranges
> > > > [ 0.425541] node 0: [mem 0x0000000000001000-0x000000000009ffff]
> > > > [ 0.428050] node 0: [mem 0x000000001d000000-0x000000002cffffff]
> > > > [ 0.430547] node 3: [mem 0x000003c27f000000-0x000003c2bef5cfff]
> > > > [ 0.432963] Initmem setup node 0 [mem 0x0000000000001000-
> > > 0x000000002cffffff]
> > > > [ 0.435695] Initmem setup node 3 [mem 0x000003c27f000000-
> > > 0x000003c2bef5cfff]
> > > > [ 0.438446] On node 0, zone DMA: 1 pages in unavailable ranges
> > > > [ 0.439377] On node 0, zone DMA32: 53088 pages in unavailable ranges
> > > > [ 0.452784] On node 3, zone Normal: 40960 pages in unavailable ranges
> > > > [ 0.455221] On node 3, zone Normal: 4259 pages in unavailable ranges
> > > >
> > > > It's unclear to me why node 1 and 2 are missing. But I don't think
> > > > it's a Hyper-V problem since it's only affected by setting nr_cpus
> > > > over kernel boot line. Later, a device driver
> > > > (mlx5 in this example) tries to allocate memory on node 1 and fails:
> > > >
> > >
> > > To summarize some offline conversation, we've figured out that the
> "missing"
> > > NUMA nodes are not due to setting maxcpus=1 or nr_cpus=1. Setting
> > > the cpu count doesn't affect any of this.
> > >
> > > Instead, Linux is modifying the memory map prior to starting the
> > > kdump kernel so that most of the memory is not touched and is
> > > preserved to be dumped, which is the whole point of kdump. This
> > > modified memory map has no memory in NUMA nodes 1 and 2, so it is
> > > correct to just see nodes 0 and 3 as online.
> > >
> > > I think code fix here is pretty simple:
> > >
> > > int node;
> > >
> > > node = hv_dev->desc.virtual_numa_node;
> > > if ((hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > > && (node < nr_node_ids))
> > > set_dev_node(&dev->dev, numa_map_to_online_node(node));
> > >
> > > Michael
> >
> > Okay, this looks good.
> >
> > I'm sending a V2 (with a minor change) after testing is done.
> >
> > Long
>
> Please leave a comment in the code as to why a NUMA node might be
> offline. In the future, somebody new might not know what can happen.
> I certainly didn't. :-(

Will do that.

Long