4 commits are involved here:
A (2016): commit 0de8ce3ee8e3 ("PCI: hv: Allocate physically contiguous hypercall params buffer")
B (2017): commit be66b6736591 ("PCI: hv: Use page allocation for hbus structure")
C (2019): commit 877b911a5ba0 ("PCI: hv: Avoid a kmemleak false positive caused by the hbus buffer")
D (2018): commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
Patch D introduced the per-CPU hypercall input page "hyperv_pcpu_input_arg"
in 2018. With patch D, we no longer need the per-Hyper-V-PCI-bus hypercall
input page "hbus->retarget_msi_interrupt_params" that was added in patch A,
and the issue addressed by patch B is no longer an issue, and we can also
get rid of patch C.
The change here is required for PCI device assignment to work for
Confidential VMs (CVMs), because otherwise we would have to call
set_memory_decrypted() for "hbus->retarget_msi_interrupt_params" before
calling the hypercall HVCALL_RETARGET_INTERRUPT.
Signed-off-by: Dexuan Cui <[email protected]>
---
drivers/pci/controller/pci-hyperv.c | 48 +++++------------------------
1 file changed, 7 insertions(+), 41 deletions(-)
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 337f3b4a04fc0..bc32662c6bb7f 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -508,20 +508,11 @@ struct hv_pcibus_device {
struct msi_domain_info msi_info;
struct irq_domain *irq_domain;
- spinlock_t retarget_msi_interrupt_lock;
-
struct workqueue_struct *wq;
/* Highest slot of child device with resources allocated */
int wslot_res_allocated;
bool use_calls; /* Use hypercalls to access mmio cfg space */
-
- /* hypercall arg, must not cross page boundary */
- struct hv_retarget_device_interrupt retarget_msi_interrupt_params;
-
- /*
- * Don't put anything here: retarget_msi_interrupt_params must be last
- */
};
/*
@@ -645,9 +636,9 @@ static void hv_arch_irq_unmask(struct irq_data *data)
hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
int_desc = data->chip_data;
- spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
+ local_irq_save(flags);
- params = &hbus->retarget_msi_interrupt_params;
+ params = *this_cpu_ptr(hyperv_pcpu_input_arg);
memset(params, 0, sizeof(*params));
params->partition_id = HV_PARTITION_ID_SELF;
params->int_entry.source = HV_INTERRUPT_SOURCE_MSI;
@@ -680,7 +671,7 @@ static void hv_arch_irq_unmask(struct irq_data *data)
if (!alloc_cpumask_var(&tmp, GFP_ATOMIC)) {
res = 1;
- goto exit_unlock;
+ goto out;
}
cpumask_and(tmp, dest, cpu_online_mask);
@@ -689,7 +680,7 @@ static void hv_arch_irq_unmask(struct irq_data *data)
if (nr_bank <= 0) {
res = 1;
- goto exit_unlock;
+ goto out;
}
/*
@@ -708,8 +699,8 @@ static void hv_arch_irq_unmask(struct irq_data *data)
res = hv_do_hypercall(HVCALL_RETARGET_INTERRUPT | (var_size << 17),
params, NULL);
-exit_unlock:
- spin_unlock_irqrestore(&hbus->retarget_msi_interrupt_lock, flags);
+out:
+ local_irq_restore(flags);
/*
* During hibernation, when a CPU is offlined, the kernel tries
@@ -3598,35 +3589,11 @@ static int hv_pci_probe(struct hv_device *hdev,
bool enter_d0_retry = true;
int ret;
- /*
- * hv_pcibus_device contains the hypercall arguments for retargeting in
- * hv_irq_unmask(). Those must not cross a page boundary.
- */
- BUILD_BUG_ON(sizeof(*hbus) > HV_HYP_PAGE_SIZE);
-
bridge = devm_pci_alloc_host_bridge(&hdev->device, 0);
if (!bridge)
return -ENOMEM;
- /*
- * With the recent 59bb47985c1d ("mm, sl[aou]b: guarantee natural
- * alignment for kmalloc(power-of-two)"), kzalloc() is able to allocate
- * a 4KB buffer that is guaranteed to be 4KB-aligned. Here the size and
- * alignment of hbus is important because hbus's field
- * retarget_msi_interrupt_params must not cross a 4KB page boundary.
- *
- * Here we prefer kzalloc to get_zeroed_page(), because a buffer
- * allocated by the latter is not tracked and scanned by kmemleak, and
- * hence kmemleak reports the pointer contained in the hbus buffer
- * (i.e. the hpdev struct, which is created in new_pcichild_device() and
- * is tracked by hbus->children) as memory leak (false positive).
- *
- * If the kernel doesn't have 59bb47985c1d, get_zeroed_page() *must* be
- * used to allocate the hbus buffer and we can avoid the kmemleak false
- * positive by using kmemleak_alloc() and kmemleak_free() to ask
- * kmemleak to track and scan the hbus buffer.
- */
- hbus = kzalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
+ hbus = kzalloc(sizeof(*hbus), GFP_KERNEL);
if (!hbus)
return -ENOMEM;
@@ -3683,7 +3650,6 @@ static int hv_pci_probe(struct hv_device *hdev,
INIT_LIST_HEAD(&hbus->dr_list);
spin_lock_init(&hbus->config_lock);
spin_lock_init(&hbus->device_list_lock);
- spin_lock_init(&hbus->retarget_msi_interrupt_lock);
hbus->wq = alloc_ordered_workqueue("hv_pci_%x", 0,
hbus->bridge->domain_nr);
if (!hbus->wq) {
--
2.25.1
From: Dexuan Cui <[email protected]> Sent: Saturday, April 8, 2023 2:11 PM
>
> 4 commits are involved here:
> A (2016): commit 0de8ce3ee8e3 ("PCI: hv: Allocate physically contiguous hypercall
> params buffer")
> B (2017): commit be66b6736591 ("PCI: hv: Use page allocation for hbus structure")
> C (2019): commit 877b911a5ba0 ("PCI: hv: Avoid a kmemleak false positive caused by
> the hbus buffer")
> D (2018): commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
>
> Patch D introduced the per-CPU hypercall input page "hyperv_pcpu_input_arg"
> in 2018. With patch D, we no longer need the per-Hyper-V-PCI-bus hypercall
> input page "hbus->retarget_msi_interrupt_params" that was added in patch A,
> and the issue addressed by patch B is no longer an issue, and we can also
> get rid of patch C.
>
> The change here is required for PCI device assignment to work for
> Confidential VMs (CVMs), because otherwise we would have to call
> set_memory_decrypted() for "hbus->retarget_msi_interrupt_params" before
> calling the hypercall HVCALL_RETARGET_INTERRUPT.
Well, not all CVMs. It's not required for SEV-SNP vTOM VMs on Hyper-V because
of the paravisor. Is it more accurate to say "for Confidential VMs (CVMs) running
without a paravisor"?
Otherwise,
Reviewed-by: Michael Kelley <[email protected]>
>
> Signed-off-by: Dexuan Cui <[email protected]>
> ---
> drivers/pci/controller/pci-hyperv.c | 48 +++++------------------------
> 1 file changed, 7 insertions(+), 41 deletions(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 337f3b4a04fc0..bc32662c6bb7f 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -508,20 +508,11 @@ struct hv_pcibus_device {
> struct msi_domain_info msi_info;
> struct irq_domain *irq_domain;
>
> - spinlock_t retarget_msi_interrupt_lock;
> -
> struct workqueue_struct *wq;
>
> /* Highest slot of child device with resources allocated */
> int wslot_res_allocated;
> bool use_calls; /* Use hypercalls to access mmio cfg space */
> -
> - /* hypercall arg, must not cross page boundary */
> - struct hv_retarget_device_interrupt retarget_msi_interrupt_params;
> -
> - /*
> - * Don't put anything here: retarget_msi_interrupt_params must be last
> - */
> };
>
> /*
> @@ -645,9 +636,9 @@ static void hv_arch_irq_unmask(struct irq_data *data)
> hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> int_desc = data->chip_data;
>
> - spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
> + local_irq_save(flags);
>
> - params = &hbus->retarget_msi_interrupt_params;
> + params = *this_cpu_ptr(hyperv_pcpu_input_arg);
> memset(params, 0, sizeof(*params));
> params->partition_id = HV_PARTITION_ID_SELF;
> params->int_entry.source = HV_INTERRUPT_SOURCE_MSI;
> @@ -680,7 +671,7 @@ static void hv_arch_irq_unmask(struct irq_data *data)
>
> if (!alloc_cpumask_var(&tmp, GFP_ATOMIC)) {
> res = 1;
> - goto exit_unlock;
> + goto out;
> }
>
> cpumask_and(tmp, dest, cpu_online_mask);
> @@ -689,7 +680,7 @@ static void hv_arch_irq_unmask(struct irq_data *data)
>
> if (nr_bank <= 0) {
> res = 1;
> - goto exit_unlock;
> + goto out;
> }
>
> /*
> @@ -708,8 +699,8 @@ static void hv_arch_irq_unmask(struct irq_data *data)
> res = hv_do_hypercall(HVCALL_RETARGET_INTERRUPT | (var_size << 17),
> params, NULL);
>
> -exit_unlock:
> - spin_unlock_irqrestore(&hbus->retarget_msi_interrupt_lock, flags);
> +out:
> + local_irq_restore(flags);
>
> /*
> * During hibernation, when a CPU is offlined, the kernel tries
> @@ -3598,35 +3589,11 @@ static int hv_pci_probe(struct hv_device *hdev,
> bool enter_d0_retry = true;
> int ret;
>
> - /*
> - * hv_pcibus_device contains the hypercall arguments for retargeting in
> - * hv_irq_unmask(). Those must not cross a page boundary.
> - */
> - BUILD_BUG_ON(sizeof(*hbus) > HV_HYP_PAGE_SIZE);
> -
> bridge = devm_pci_alloc_host_bridge(&hdev->device, 0);
> if (!bridge)
> return -ENOMEM;
>
> - /*
> - * With the recent 59bb47985c1d ("mm, sl[aou]b: guarantee natural
> - * alignment for kmalloc(power-of-two)"), kzalloc() is able to allocate
> - * a 4KB buffer that is guaranteed to be 4KB-aligned. Here the size and
> - * alignment of hbus is important because hbus's field
> - * retarget_msi_interrupt_params must not cross a 4KB page boundary.
> - *
> - * Here we prefer kzalloc to get_zeroed_page(), because a buffer
> - * allocated by the latter is not tracked and scanned by kmemleak, and
> - * hence kmemleak reports the pointer contained in the hbus buffer
> - * (i.e. the hpdev struct, which is created in new_pcichild_device() and
> - * is tracked by hbus->children) as memory leak (false positive).
> - *
> - * If the kernel doesn't have 59bb47985c1d, get_zeroed_page() *must* be
> - * used to allocate the hbus buffer and we can avoid the kmemleak false
> - * positive by using kmemleak_alloc() and kmemleak_free() to ask
> - * kmemleak to track and scan the hbus buffer.
> - */
> - hbus = kzalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
> + hbus = kzalloc(sizeof(*hbus), GFP_KERNEL);
> if (!hbus)
> return -ENOMEM;
>
> @@ -3683,7 +3650,6 @@ static int hv_pci_probe(struct hv_device *hdev,
> INIT_LIST_HEAD(&hbus->dr_list);
> spin_lock_init(&hbus->config_lock);
> spin_lock_init(&hbus->device_list_lock);
> - spin_lock_init(&hbus->retarget_msi_interrupt_lock);
> hbus->wq = alloc_ordered_workqueue("hv_pci_%x", 0,
> hbus->bridge->domain_nr);
> if (!hbus->wq) {
> --
> 2.25.1
On Tue, Apr 11, 2023 at 05:27:41PM +0000, Michael Kelley (LINUX) wrote:
> From: Dexuan Cui <[email protected]> Sent: Saturday, April 8, 2023 2:11 PM
> >
> > 4 commits are involved here:
> > A (2016): commit 0de8ce3ee8e3 ("PCI: hv: Allocate physically contiguous hypercall
> > params buffer")
> > B (2017): commit be66b6736591 ("PCI: hv: Use page allocation for hbus structure")
> > C (2019): commit 877b911a5ba0 ("PCI: hv: Avoid a kmemleak false positive caused by
> > the hbus buffer")
> > D (2018): commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
> >
> > Patch D introduced the per-CPU hypercall input page "hyperv_pcpu_input_arg"
> > in 2018. With patch D, we no longer need the per-Hyper-V-PCI-bus hypercall
> > input page "hbus->retarget_msi_interrupt_params" that was added in patch A,
> > and the issue addressed by patch B is no longer an issue, and we can also
> > get rid of patch C.
> >
> > The change here is required for PCI device assignment to work for
> > Confidential VMs (CVMs), because otherwise we would have to call
> > set_memory_decrypted() for "hbus->retarget_msi_interrupt_params" before
> > calling the hypercall HVCALL_RETARGET_INTERRUPT.
>
> Well, not all CVMs. It's not required for SEV-SNP vTOM VMs on Hyper-V because
> of the paravisor. Is it more accurate to say "for Confidential VMs (CVMs) running
> without a paravisor"?
>
Let me know how this text should be reworded.
> Otherwise,
>
> Reviewed-by: Michael Kelley <[email protected]>
> From: Wei Liu <[email protected]>
> Sent: Wednesday, April 12, 2023 6:42 PM
> > > ...
> > > The change here is required for PCI device assignment to work for
> > > Confidential VMs (CVMs), because otherwise we would have to call
> > > set_memory_decrypted() for "hbus->retarget_msi_interrupt_params"
> > > before calling the hypercall HVCALL_RETARGET_INTERRUPT.
> >
> > Well, not all CVMs. It's not required for SEV-SNP vTOM VMs on
> > Hyper-V because of the paravisor. Is it more accurate to say
> > "for Confidential VMs (CVMs) running without a paravisor"?
>
> Let me know how this text should be reworded.
Thanks, I just posted v2:
https://lwn.net/ml/linux-kernel/20230421013025.17152-1-decui%40microsoft.com/