2024-06-04 05:11:09

by Michael Kelley

[permalink] [raw]
Subject: [RFC 00/12] Hyper-V guests use Linux IRQs for channel interrupts

From: Michael Kelley <[email protected]>

This patch set makes significant changes in how Linux guests running on Hyper-V
handle interrupts from VMBus devices. It's "RFC" because of the level of change,
and because I'm interested in macro-level feedback on the whole idea. The usual
detailed-level code review feedback is also welcome.

Background
----------
Linux guests running on Hyper-V are presented with a range of synthetic
devices, including storage, networking, mouse, keyboard, etc. Guests are also
presented with management-related pseudo-devices, including time
synchronization, heartbeat, and host-initiated shutdown. These devices use the
VMBus connection between the host and the guest, and each device has one or
more VMBus channels for passing messages between the host and guest.
Separately, a control path exists for creating and tearing down VMBus channels
when needed. Hyper-V VMBus and its synthetic devices play a role similar to
virtio and virtio devices.

The host interrupts the guest when it sends a new message to the guest in an
otherwise empty channel, and when it sends a new control message to the guest.
All interrupts are multiplexed onto a single per-CPU architectural interrupt as
defined by the processor architecture. On arm64, this architectural interrupt
is a PPI, and is represented in Linux as a Linux per-CPU IRQ. The x86/x64
architecture does not provide per-CPU interrupts, so Linux reserves a fixed x86
interrupt vector (HYPERVISOR_CALLBACK_VECTOR) on all CPUs, with a hard-coded
interrupt handler. No Linux IRQ is assigned on x86/x64.

The interrupt handler for the architectural interrupt is the VMBus interrupt
handler, and it runs whenever the host generates an interrupt for a channel or
a control message. The VMBus interrupt handler consults data structures
provided by the per-CPU Hyper-V synthetic interrupt controller (synic) to
determine which channel or channels have a pending interrupt, or if there is a
pending control message. When doing this demultiplexing, the VMBus interrupt
handler directly invokes the per-device handlers, and processes any control
messages.

In this scheme, the individual VMBus devices and their channels are not
modelled as Linux IRQs. /proc/interrupts shows counts for the top-level
architectural interrupt, but not for the individual VMBus devices. The VMBus
driver has implemented separate reporting of per-device data under /sys.
Similarly, /proc/irq/<nn> does not show affinity information for individual
VMBus devices, and the affinity cannot be changed via /proc/irq. Again, a
separate interface under /sys is provided for changing the vCPU that a VMBus
channel should interrupt.

What's New
----------
Linux kernel community members have commented that it would be better for the
Hyper-V synic and the handling of VMBus channel interrupts to be modelled as
Linux IRQs. Then they would participate in existing IRQ handling mechanisms,
and not need something separate under /sys. This patch set provides such an
implementation.

With this patch set, the output of /proc/interrupts looks like this in a Linux
guest on a Hyper-V x86/x64 Generation 2 VM with 8 vCPUs (the columns for
CPUs 2 thru 5 are omitted due to text width considerations):

CPU0 CPU1 CPU6 CPU7
8: 0 0 0 0 IO-APIC 8-edge rtc0
9: 2 0 0 0 IO-APIC 9-fasteoi acpi
24: 16841 0 0 0 VMBus 7 heartbeat
25: 1 0 0 0 VMBus 9 shutdown
26: 6752 0 0 0 VMBus 10 timesync
27: 1 0 0 0 VMBus 11 vss
28: 22095 0 0 0 VMBus 14 pri@storvsc1
29: 0 85 0 0 VMBus 15 pri@storvsc0
30: 3 0 0 0 VMBus 1 balloon
31: 106 0 0 0 VMBus 3 mouse
32: 73 0 0 0 VMBus 4 keyboard
33: 6 0 0 0 VMBus 5 framebuffer
34: 0 0 0 0 VMBus 13 netvsc
35: 1 0 0 0 VMBus 8 kvp
36: 0 0 0 0 VMBus 16 netvsc
37: 0 0 0 0 VMBus 17 netvsc
38: 0 0 0 0 VMBus 18 netvsc
39: 0 0 2375 0 VMBus 19 netvsc
40: 0 0 0 1178 VMBus 20 netvsc
41: 1003 0 0 0 VMBus 21 netvsc
42: 0 4337 0 0 VMBus 22 netvsc
43: 0 0 0 0 VMBus 23 sub@storvsc1
44: 0 0 0 0 VMBus 24 sub@storvsc0
NMI: 0 0 0 0 Non-maskable interrupts
LOC: 0 0 0 0 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 0 0 0 0 Performance monitoring interrupts
IWI: 1 0 0 2 IRQ work interrupts
RTR: 0 0 0 0 APIC ICR read retries
RES: 411 233 349 318 Rescheduling interrupts
CAL: 135236 29211 99526 36424 Function call interrupts
TLB: 0 0 0 0 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 109 109 109 109 Machine check polls
HYP: 71 0 0 0 Hypervisor callback interrupts
HRE: 0 0 0 0 Hyper-V reenlightenment interrupts
HVS: 183391 109695 181539 339239 Hyper-V stimer0 interrupts
ERR: 0
MIS: 0
PIN: 0 0 0 0 Posted-interrupt notification event
NPI: 0 0 0 0 Nested posted-interrupt event
PIW: 0 0 0 0 Posted-interrupt wakeup event

IRQs 24 thru 44 are the VMBus channel IRQs that didn't previously exist. Some
devices, such as storvsc and netvsc, have multiple channels, each of which has
a separate IRQ. The HYP line now shows counts only for control messages instead
of for all VMBus interrupts. The HVS line continues to show Hyper-V synthetic
timer (stimer0) interrupts. (In older versions of Hyper-V where stimer0
interrupts are delivered as control messages, those interrupts are accounted
for on the HYP line. Since this is legacy behavior, it seemed unnecessary to
create a separate Linux IRQ to model these messages.)

The basic approach is to update the VMBus channel create/open/close/delete
infrastructure, and the VMBus interrupt handler, to create and process the
additional IRQs. The goal is that individual VMBus drivers require no code
changes to work with the new IRQs. However, a driver may optionally annotate
the IRQ using knowledge that only the driver has. For example, in the above
/proc/interrupts output, the storvsc driver for synthetic disks has been
updated to distinguish the primary channel from the subchannels, and to
distinguish controller 0 from controller 1. Since storvsc models a SCSI
controller, the numeric designations are set to match the "host0" and "host1"
SCSI controllers that could be seen with "lsscsi -H -v". Similarly annotating
the "netvsc" entries is a "to do" item.

In furtherance of the "no changes to existing VMBus drivers" goal, all VMBus
IRQs use the same interrupt handler. Each channel has a callback function
associated with it, and the interrupt handler invokes this callback in the same
way as before VMBus IRQs were introduced.

VMBus code creates a standalone linear IRQ domain for the VMBus IRQs. The hwirq
#'s are the VMBus channel relid's, which are integers in the range 1 to 2047.
Each relid uniquely identifies a VMBus channel, and a channel interrupt from
the host provides the relid. A mapping from IRQ to hwirq (relid) is created
when a new channel is created, and the mapping is deleted when the channel is
deleted, so adding and removing VMBus devices in a running VM is fully
functional.

Getting the IRQ counting correct requires a new IRQ flow handler in
/kernel/irq/chip.c. See Patch 6. The existing flow handlers don't handle this
case, and creating a custom flow handler outside of chip.c isn't feasible
because the needed components aren't available to code outside of /kernel/irq.

When a guest CPU is taken offline, Linux automatically changes the affinity of
IRQs assigned to that CPU so that the IRQ is handled on another CPU that is
still online. Unfortunately, VMBus channel IRQs are not able to take advantage
of that mechanism. At the point the mechanism runs, it isn't possible to know
when Hyper-V started interrupting the new CPU; hence pending interrupts may be
stranded on the old offline CPU. Existing VMBus code in Linux prevents a CPU
from going offline when a VMBus channel interrupt is affined to it, and that
code must be retained. Of course, VMBus channel IRQs can manually be affined to
a different CPU, which could then allow a CPU to be taken offline. I have an
idea for a somewhat convoluted way to allow the automatic Linux mechanism to
work, but that will be another patch set.

Having VMBus channel interrupts appear in /proc/irq means that user-space
irqbalance can change the affinity of the interrupts, where previously it could
not. For example, this gives irqbalance the ability to change (and perhaps
degrade) the default affinity configuration that was designed to maximize
network throughput. This is yet another reason to disable irqbalance in VM
images.

The custom VMBus entries under /sys are retained for backwards compatibility,
but they are integrated. For example, changing the affinity via /proc/irq is
reflected in /sys/bus/vmbus/devices/<guid>/channels/<nn>/cpu, and vice versa.

Patches
-------
The patches have been tested on x86/x64 Generation 1 and Generation 2 VMs, and
on arm64 VMs.

* Patches 1 and 2 fixes some error handling that is needed by subsequent
patches. They could be applied independent of this patch set.

* Patches 3,4, and 5 add support for IRQ names, and add annotations in the
storvsc and the Hyper-V vPCI driver so that descriptions in /proc/interrupts
convey useful information.

* Patch 6 adds a new IRQ flow handler needed to properly account for IRQs as
VMBus IRQs or as the top-level architectural interrupt.

* Patch 7 creates the new IRQ domain for VMBus channel IRQs.

* Patch 8 allocates the VMBus channel IRQs, and creates the mapping between
VMBus relid (used as the hwirq) and the Linux IRQ number. That mapping is then
used for lookups.

* Patch 9 does request_irq() for the new VMBus channel IRQs, and converts the
top-level VMBus architectural interrupt handler to use the new VMBus channel
IRQs for handling channel interrupts. With this patch, /proc/interrupts shows
counts for the VMBus channel IRQs.

* Patch 10 implements the irq_set_affinity() function for VMBus channel IRQs,
and integrates it with the existing Hyper-V-specific sysfs mechanism for
changing the CPU that a VMBus channel will interrupt.

* Patch 11 updates hv_synic_cleanup() during CPU offlining to handle the lack
of waiting for the MODIFYCHANNEL message in vmbus_irq_set_affinity() in Patch
10.

* Patch 12 fixes a race in VMBus channel interrupt assignments that existed
prior to this patch set when taking a CPU offline.

Open Topics
-----------
* The performance impact of the additional IRQ processing in the interrupt path
appears to be minimal, based on looking at the code. Measurements are still to
be taken to confirm this.

* Does the netvsc driver have sufficient information to annotate what appears
in /proc/interrupts, particularly when multiple synthetic NICs are present? At
first glance, it appears that the identifying info isn't generated until after
vmbus_open() is called, and by then it's too late to do the IRQ annotation.
Need some input from a netvsc expert.

* The VMBus irq domain and irq chip data structures are placed in the
vmbus_connection structure. Starting with VMBus protocol version 5.0, the VMBus
host side supports multiple connections from a single guest instance (see
commit ae20b254306a6). While there's no upstream kernel code that creates
multiple VMBus connections, it's unclear whether the IRQ domain should be
global across all VMBus connection, or per connection. Are relid's global (to
the VM) or per connection?

* I have not tried or tested any hv_sock channels. Is there a handy test
program that would be convenient for opening multiple hv_sock connections to
the guest, have them persist long enough to see what /proc/interrupts shows,
and try changing the IRQ affinity?

* I have not tried or tested Linux guest hibernation paths.

The patches build and run against 6.10-rc2 and next-20240603.

Michael Kelley (12):
Drivers: hv: vmbus: Drop unsupported VMBus devices earlier
Drivers: hv: vmbus: Fix error path that deletes non-existent sysfs
group
Drivers: hv: vmbus: Add an IRQ name to VMBus channels
PCI: hv: Annotate the VMBus channel IRQ name
scsi: storvsc: Annotate the VMBus channel IRQ name
genirq: Add per-cpu flow handler with conditional IRQ stats
Drivers: hv: vmbus: Set up irqdomain and irqchip for the VMBus
connection
Drivers: hv: vmbus: Allocate an IRQ per channel and use for relid
mapping
Drivers: hv: vmbus: Use Linux IRQs to handle VMBus channel interrupts
Drivers: hv: vmbus: Implement vmbus_irq_set_affinity
Drivers: hv: vmbus: Wait for MODIFYCHANNEL to finish when offlining
CPUs
Drivers: hv: vmbus: Ensure IRQ affinity isn't set to a CPU going
offline

arch/x86/kernel/cpu/mshyperv.c | 9 +-
drivers/hv/channel.c | 125 ++++------
drivers/hv/channel_mgmt.c | 139 ++++++++----
drivers/hv/connection.c | 48 ++--
drivers/hv/hv.c | 90 +++++++-
drivers/hv/hv_common.c | 2 +-
drivers/hv/hyperv_vmbus.h | 22 +-
drivers/hv/vmbus_drv.c | 338 +++++++++++++++++++---------
drivers/pci/controller/pci-hyperv.c | 5 +
drivers/scsi/storvsc_drv.c | 9 +
include/asm-generic/mshyperv.h | 9 +-
include/linux/hyperv.h | 9 +
include/linux/irq.h | 1 +
kernel/irq/chip.c | 29 +++
14 files changed, 567 insertions(+), 268 deletions(-)

--
2.25.1



2024-06-04 05:11:17

by Michael Kelley

[permalink] [raw]
Subject: [RFC 02/12] Drivers: hv: vmbus: Fix error path that deletes non-existent sysfs group

From: Michael Kelley <[email protected]>

If vmbus_device_create() returns an error to vmbus_add_channel_work(),
the cleanup path calls free_channel(), which in turn calls
vmbus_remove_channel_attr_group(). But the channel attr group hasn't
been created yet, causing sysfs_remove_group() to generate multiple
WARNs about non-existent entries.

Fix the WARNs by adding a flag to struct vmbus_channel to indicate
whether the sysfs group for the channel has been created. Use the
flag to determine if the sysfs group should be removed.

Signed-off-by: Michael Kelley <[email protected]>
---
drivers/hv/vmbus_drv.c | 5 ++++-
include/linux/hyperv.h | 1 +
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 12a707ab73f8..291a8358370b 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1842,6 +1842,7 @@ int vmbus_add_channel_kobj(struct hv_device *dev, struct vmbus_channel *channel)
dev_err(device, "Unable to set up channel sysfs files\n");
return ret;
}
+ channel->channel_attr_set = true;

kobject_uevent(kobj, KOBJ_ADD);

@@ -1853,7 +1854,9 @@ int vmbus_add_channel_kobj(struct hv_device *dev, struct vmbus_channel *channel)
*/
void vmbus_remove_channel_attr_group(struct vmbus_channel *channel)
{
- sysfs_remove_group(&channel->kobj, &vmbus_chan_group);
+ if (channel->channel_attr_set)
+ sysfs_remove_group(&channel->kobj, &vmbus_chan_group);
+ channel->channel_attr_set = false;
}

/*
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 5e39baa7f6cb..d52c916cc492 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -980,6 +980,7 @@ struct vmbus_channel {
* For sysfs per-channel properties.
*/
struct kobject kobj;
+ bool channel_attr_set;

/*
* For performance critical channels (storage, networking
--
2.25.1


2024-06-04 05:11:55

by Michael Kelley

[permalink] [raw]
Subject: [RFC 03/12] Drivers: hv: vmbus: Add an IRQ name to VMBus channels

From: Michael Kelley <[email protected]>

In preparation for assigning Linux IRQs to VMBus channels, assign a
string name to each channel. The name is used when the IRQ is
requested, and it will appear in the output of /proc/interrupts to
make it easier to identify the device the IRQ is associated with.

Many VMBus devices are single-instance, with a single channel. For
such devices, a default string name can be determined based on the
GUID identifying the device. So add default names to the vmbus_devs
table that lists recognized devices. When a channel is created, set
the channel name to that default so a reasonable name is displayed
without requiring VMBus driver modifications.

However, individual VMBus device drivers may be optionally modified
to override the default name to provide additional information. In
cases where a driver supports multiple instances of a device, or a
device has multiple channels, the additional information may be
helpful to display in /proc/interrupts.

Signed-off-by: Michael Kelley <[email protected]>
---
drivers/hv/channel_mgmt.c | 45 +++++++++++++++++++++++++++++++--------
include/linux/hyperv.h | 5 +++++
2 files changed, 41 insertions(+), 9 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index a216a0aede73..adbe184e5197 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -20,6 +20,7 @@
#include <linux/delay.h>
#include <linux/cpu.h>
#include <linux/hyperv.h>
+#include <linux/string.h>
#include <asm/mshyperv.h>
#include <linux/sched/isolation.h>

@@ -33,6 +34,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_IDE_GUID,
.perf_device = true,
.allowed_in_isolated = false,
+ .irq_name = "storvsc",
},

/* SCSI */
@@ -40,6 +42,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_SCSI_GUID,
.perf_device = true,
.allowed_in_isolated = true,
+ .irq_name = "storvsc",
},

/* Fibre Channel */
@@ -47,6 +50,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_SYNTHFC_GUID,
.perf_device = true,
.allowed_in_isolated = false,
+ .irq_name = "storvsc",
},

/* Synthetic NIC */
@@ -54,6 +58,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_NIC_GUID,
.perf_device = true,
.allowed_in_isolated = true,
+ .irq_name = "netvsc",
},

/* Network Direct */
@@ -61,6 +66,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_ND_GUID,
.perf_device = true,
.allowed_in_isolated = false,
+ .irq_name = "netdirect",
},

/* PCIE */
@@ -68,6 +74,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_PCIE_GUID,
.perf_device = false,
.allowed_in_isolated = true,
+ .irq_name = "vpci",
},

/* Synthetic Frame Buffer */
@@ -75,6 +82,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_SYNTHVID_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "framebuffer",
},

/* Synthetic Keyboard */
@@ -82,6 +90,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_KBD_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "keyboard",
},

/* Synthetic MOUSE */
@@ -89,6 +98,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_MOUSE_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "mouse",
},

/* KVP */
@@ -96,6 +106,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_KVP_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "kvp",
},

/* Time Synch */
@@ -103,6 +114,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_TS_GUID,
.perf_device = false,
.allowed_in_isolated = true,
+ .irq_name = "timesync",
},

/* Heartbeat */
@@ -110,6 +122,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_HEART_BEAT_GUID,
.perf_device = false,
.allowed_in_isolated = true,
+ .irq_name = "heartbeat",
},

/* Shutdown */
@@ -117,6 +130,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_SHUTDOWN_GUID,
.perf_device = false,
.allowed_in_isolated = true,
+ .irq_name = "shutdown",
},

/* File copy */
@@ -126,6 +140,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_FCOPY_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "fcopy",
},

/* Backup */
@@ -133,6 +148,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_VSS_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "vss",
},

/* Dynamic Memory */
@@ -140,6 +156,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_DM_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "balloon",
},

/*
@@ -198,20 +215,30 @@ static bool is_unsupported_vmbus_devs(const guid_t *guid)
return false;
}

-static u16 hv_get_dev_type(const struct vmbus_channel *channel)
+static void hv_set_dev_type_and_irq_name(struct vmbus_channel *channel)
{
const guid_t *guid = &channel->offermsg.offer.if_type;
+ char *name = NULL;
u16 i;

- if (is_hvsock_channel(channel))
- return HV_UNKNOWN;
-
- for (i = HV_IDE; i < HV_UNKNOWN; i++) {
- if (guid_equal(guid, &vmbus_devs[i].guid))
- return i;
+ if (is_hvsock_channel(channel)) {
+ i = HV_UNKNOWN;
+ name = "hv_sock";
+ goto found;
}
+
+ for (i = HV_IDE; i < HV_UNKNOWN; i++)
+ if (guid_equal(guid, &vmbus_devs[i].guid)) {
+ name = vmbus_devs[i].irq_name;
+ goto found;
+ }
+
pr_info("Unknown GUID: %pUl\n", guid);
- return i;
+
+found:
+ channel->device_id = i;
+ if (name)
+ strscpy(channel->irq_name, name, VMBUS_CHAN_IRQ_NAME_MAX);
}

/**
@@ -970,7 +997,7 @@ static void vmbus_setup_channel_state(struct vmbus_channel *channel,
sizeof(struct vmbus_channel_offer_channel));
channel->monitor_grp = (u8)offer->monitorid / 32;
channel->monitor_bit = (u8)offer->monitorid % 32;
- channel->device_id = hv_get_dev_type(channel);
+ hv_set_dev_type_and_irq_name(channel);
}

/*
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index d52c916cc492..897d96fa19d4 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -826,6 +826,7 @@ struct vmbus_device {
guid_t guid;
bool perf_device;
bool allowed_in_isolated;
+ char *irq_name;
};

#define VMBUS_DEFAULT_MAX_PKT_SIZE 4096
@@ -837,6 +838,8 @@ struct vmbus_gpadl {
bool decrypted;
};

+#define VMBUS_CHAN_IRQ_NAME_MAX 32
+
struct vmbus_channel {
struct list_head listentry;

@@ -1068,6 +1071,8 @@ struct vmbus_channel {

/* The max size of a packet on this channel */
u32 max_pkt_size;
+
+ char irq_name[VMBUS_CHAN_IRQ_NAME_MAX];
};

#define lock_requestor(channel, flags) \
--
2.25.1


2024-06-04 05:11:55

by Michael Kelley

[permalink] [raw]
Subject: [RFC 04/12] PCI: hv: Annotate the VMBus channel IRQ name

From: Michael Kelley <[email protected]>

In preparation for assigning Linux IRQs to VMBus channels, annotate
the IRQ name in the single VMBus channel used for setup and teardown
of a virtual PCI device in a Hyper-V guest. The annotation adds the
16-bit PCI domain ID that the Hyper-V vPCI driver assigns to the
virtual PCI bus for the device.

Signed-off-by: Michael Kelley <[email protected]>
---
drivers/pci/controller/pci-hyperv.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 5992280e8110..4f70cddb61dc 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3705,6 +3705,9 @@ static int hv_pci_probe(struct hv_device *hdev,
hdev->channel->request_addr_callback = vmbus_request_addr;
hdev->channel->rqstor_size = HV_PCI_RQSTOR_SIZE;

+ snprintf(hdev->channel->irq_name, VMBUS_CHAN_IRQ_NAME_MAX,
+ "vpci:%04x", dom);
+
ret = vmbus_open(hdev->channel, pci_ring_size, pci_ring_size, NULL, 0,
hv_pci_onchannelcallback, hbus);
if (ret)
@@ -4018,6 +4021,8 @@ static int hv_pci_resume(struct hv_device *hdev)
hdev->channel->next_request_id_callback = vmbus_next_request_id;
hdev->channel->request_addr_callback = vmbus_request_addr;
hdev->channel->rqstor_size = HV_PCI_RQSTOR_SIZE;
+ snprintf(hdev->channel->irq_name, VMBUS_CHAN_IRQ_NAME_MAX,
+ "vpci:%04x", hbus->bridge->domain_nr);

ret = vmbus_open(hdev->channel, pci_ring_size, pci_ring_size, NULL, 0,
hv_pci_onchannelcallback, hbus);
--
2.25.1


2024-06-04 05:12:42

by Michael Kelley

[permalink] [raw]
Subject: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats

From: Michael Kelley <[email protected]>

Hyper-V VMBus devices generate interrupts that are multiplexed
onto a single per-CPU architectural interrupt. The top-level VMBus
driver ISR demultiplexes these interrupts and invokes per-device
handlers. Currently, these per-device handlers are not modeled as
Linux IRQs, so /proc/interrupts shows all VMBus interrupts as accounted
to the top level architectural interrupt. Visibility into per-device
interrupt stats requires accessing VMBus-specific entries in sysfs.
The top-level VMBus driver ISR also handles management-related
interrupts that are not attributable to a particular VMBus device.

As part of changing VMBus to model VMBus per-device handlers as
normal Linux IRQs, the top-level VMBus driver needs to conditionally
account for interrupts. If it passes the interrupt off to a
device-specific IRQ, the interrupt stats are done by that IRQ
handler, and accounting for the interrupt at the top level
is duplicative. But if it handles a management-related interrupt
itself, then it should account for the interrupt itself.

Introduce a new flow handler that provides this functionality.
The new handler parallels handle_percpu_irq(), but does stats
only if the ISR returns other than IRQ_NONE. The existing
handle_untracked_irq() can't be used because it doesn't work for
per-cpu IRQs, and it doesn't provide conditional stats.

Signed-off-by: Michael Kelley <[email protected]>
---
include/linux/irq.h | 1 +
kernel/irq/chip.c | 29 +++++++++++++++++++++++++++++
2 files changed, 30 insertions(+)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index a217e1029c1d..8825b95cefe5 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -656,6 +656,7 @@ extern void handle_edge_eoi_irq(struct irq_desc *desc);
extern void handle_simple_irq(struct irq_desc *desc);
extern void handle_untracked_irq(struct irq_desc *desc);
extern void handle_percpu_irq(struct irq_desc *desc);
+extern void handle_percpu_demux_irq(struct irq_desc *desc);
extern void handle_percpu_devid_irq(struct irq_desc *desc);
extern void handle_bad_irq(struct irq_desc *desc);
extern void handle_nested_irq(unsigned int irq);
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index dc94e0bf2c94..1f37a9db4a4d 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -910,6 +910,35 @@ void handle_percpu_irq(struct irq_desc *desc)
chip->irq_eoi(&desc->irq_data);
}

+/**
+ * handle_percpu_demux_irq - Per CPU local irq handler for IRQs
+ * that may demultiplex to other IRQs
+ * @desc: the interrupt description structure for this irq
+ *
+ * For per CPU interrupts on SMP machines without locking requirements.
+ * Used for IRQs that may demultiplex to other IRQs that will do the
+ * stats tracking and randomness. If the handler result indicates no
+ * demultiplexing is done, account for the interrupt on this IRQ.
+ */
+void handle_percpu_demux_irq(struct irq_desc *desc)
+{
+ struct irq_chip *chip = irq_desc_get_chip(desc);
+
+ if (chip->irq_ack)
+ chip->irq_ack(&desc->irq_data);
+
+ if (__handle_irq_event_percpu(desc))
+ /*
+ * PER CPU interrupts are not serialized. Do not touch
+ * desc->tot_count.
+ */
+ __kstat_incr_irqs_this_cpu(desc);
+
+ if (chip->irq_eoi)
+ chip->irq_eoi(&desc->irq_data);
+}
+EXPORT_SYMBOL_GPL(handle_percpu_demux_irq);
+
/**
* handle_percpu_devid_irq - Per CPU local irq handler with per cpu dev ids
* @desc: the interrupt description structure for this irq
--
2.25.1


2024-06-04 05:12:59

by Michael Kelley

[permalink] [raw]
Subject: [RFC 07/12] Drivers: hv: vmbus: Set up irqdomain and irqchip for the VMBus connection

From: Michael Kelley <[email protected]>

In preparation for assigning Linux IRQs to VMBus channels, set up an
irqdomain and irqchip for the VMBus connection. The irqdomain is linear,
with the VMBus relid used as the "hwirq" value. A relid is a unique
index assigned by Hyper-V to each VMBus channel, with values ranging
from 1 to 2047. Because these hwirqs don't map to anything in the
architectural hardware, the domain is not part of the domain hierarchy.

VMBus channel interrupts provide minimal management functionality, so
provide only a minimal set of irqchip functions. The set_affinity function
is a place-holder that is populated in a subsequent patch.

Signed-off-by: Michael Kelley <[email protected]>
---
drivers/hv/connection.c | 24 +++++++++-----
drivers/hv/hyperv_vmbus.h | 9 +++++
drivers/hv/vmbus_drv.c | 60 +++++++++++++++++++++++++++++++++-
include/asm-generic/mshyperv.h | 6 ++++
4 files changed, 90 insertions(+), 9 deletions(-)

diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index f001ae880e1d..cb01784e5c3b 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -21,21 +21,29 @@
#include <linux/export.h>
#include <linux/io.h>
#include <linux/set_memory.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/irqdomain.h>
#include <asm/mshyperv.h>

#include "hyperv_vmbus.h"


struct vmbus_connection vmbus_connection = {
- .conn_state = DISCONNECTED,
- .unload_event = COMPLETION_INITIALIZER(
- vmbus_connection.unload_event),
- .next_gpadl_handle = ATOMIC_INIT(0xE1E10),
-
- .ready_for_suspend_event = COMPLETION_INITIALIZER(
- vmbus_connection.ready_for_suspend_event),
+ .conn_state = DISCONNECTED,
+ .unload_event = COMPLETION_INITIALIZER(
+ vmbus_connection.unload_event),
+ .next_gpadl_handle = ATOMIC_INIT(0xE1E10),
+
+ .vmbus_irq_chip.name = "VMBus",
+ .vmbus_irq_chip.irq_set_affinity = vmbus_irq_set_affinity,
+ .vmbus_irq_chip.irq_mask = vmbus_irq_mask,
+ .vmbus_irq_chip.irq_unmask = vmbus_irq_unmask,
+
+ .ready_for_suspend_event = COMPLETION_INITIALIZER(
+ vmbus_connection.ready_for_suspend_event),
.ready_for_resume_event = COMPLETION_INITIALIZER(
- vmbus_connection.ready_for_resume_event),
+ vmbus_connection.ready_for_resume_event),
};
EXPORT_SYMBOL_GPL(vmbus_connection);

diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 76ac5185a01a..95d4d47d34f7 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -18,7 +18,11 @@
#include <asm/hyperv-tlfs.h>
#include <linux/atomic.h>
#include <linux/hyperv.h>
+#include <linux/fwnode.h>
#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/irqdomain.h>

#include "hv_trace.h"

@@ -258,6 +262,11 @@ struct vmbus_connection {
/* Array of channels */
struct vmbus_channel **channels;

+ /* IRQ domain data */
+ struct fwnode_handle *vmbus_fwnode;
+ struct irq_domain *vmbus_irq_domain;
+ struct irq_chip vmbus_irq_chip;
+
/*
* An offer message is handled first on the work_queue, and then
* is further handled on handle_primary_chan_wq or
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 291a8358370b..cbccdfad49a2 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -36,6 +36,9 @@
#include <linux/syscore_ops.h>
#include <linux/dma-map-ops.h>
#include <linux/pci.h>
+#include <linux/irq.h>
+#include <linux/irqdomain.h>
+#include <linux/hardirq.h>
#include <clocksource/hyperv_timer.h>
#include <asm/mshyperv.h>
#include "hyperv_vmbus.h"
@@ -1306,6 +1309,40 @@ static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
return IRQ_HANDLED;
}

+int vmbus_irq_set_affinity(struct irq_data *data,
+ const struct cpumask *dest, bool force)
+{
+ return 0;
+}
+
+/*
+ * VMBus channel interrupts do not need to be masked or unmasked, and the
+ * Hyper-V synic doesn't provide any masking functionality anyway. But in the
+ * absence of these irqchip functions, the IRQ subsystem keeps the IRQ marked
+ * as "masked". To prevent any problems associated with staying the "masked"
+ * state, and so that IRQ status shown in debugfs doesn't indicate "masked",
+ * provide null implementations.
+ */
+void vmbus_irq_unmask(struct irq_data *data)
+{
+}
+
+void vmbus_irq_mask(struct irq_data *data)
+{
+}
+
+static int vmbus_irq_map(struct irq_domain *d, unsigned int irq,
+ irq_hw_number_t hw)
+{
+ irq_set_chip_and_handler(irq,
+ &vmbus_connection.vmbus_irq_chip, handle_simple_irq);
+ return 0;
+}
+
+static const struct irq_domain_ops vmbus_domain_ops = {
+ .map = vmbus_irq_map,
+};
+
/*
* vmbus_bus_init -Main vmbus driver initialization routine.
*
@@ -1340,6 +1377,7 @@ static int vmbus_bus_init(void)
if (vmbus_irq == -1) {
hv_setup_vmbus_handler(vmbus_isr);
} else {
+ irq_set_handler(vmbus_irq, handle_percpu_demux_irq);
vmbus_evt = alloc_percpu(long);
ret = request_percpu_irq(vmbus_irq, vmbus_percpu_isr,
"Hyper-V VMbus", vmbus_evt);
@@ -1355,6 +1393,20 @@ static int vmbus_bus_init(void)
if (ret)
goto err_alloc;

+ /* Create IRQ domain for VMBus devices */
+ vmbus_connection.vmbus_fwnode = irq_domain_alloc_named_fwnode("hv-vmbus");
+ if (!vmbus_connection.vmbus_fwnode) {
+ ret = -ENOMEM;
+ goto err_alloc;
+ }
+ vmbus_connection.vmbus_irq_domain = irq_domain_create_linear(
+ vmbus_connection.vmbus_fwnode, MAX_CHANNEL_RELIDS,
+ &vmbus_domain_ops, NULL);
+ if (!vmbus_connection.vmbus_irq_domain) {
+ ret = -ENOMEM;
+ goto err_fwnode;
+ }
+
/*
* Initialize the per-cpu interrupt state and stimer state.
* Then connect to the host.
@@ -1362,7 +1414,7 @@ static int vmbus_bus_init(void)
ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "hyperv/vmbus:online",
hv_synic_init, hv_synic_cleanup);
if (ret < 0)
- goto err_alloc;
+ goto err_domain;
hyperv_cpuhp_online = ret;

ret = vmbus_connect();
@@ -1382,6 +1434,10 @@ static int vmbus_bus_init(void)

err_connect:
cpuhp_remove_state(hyperv_cpuhp_online);
+err_domain:
+ irq_domain_remove(vmbus_connection.vmbus_irq_domain);
+err_fwnode:
+ irq_domain_free_fwnode(vmbus_connection.vmbus_fwnode);
err_alloc:
hv_synic_free();
if (vmbus_irq == -1) {
@@ -2690,6 +2746,8 @@ static void __exit vmbus_exit(void)
hv_debug_rm_all_dir();

vmbus_free_channels();
+ irq_domain_remove(vmbus_connection.vmbus_irq_domain);
+ irq_domain_free_fwnode(vmbus_connection.vmbus_fwnode);
kfree(vmbus_connection.channels);

/*
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 8fe7aaab2599..0488ff8b511f 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -24,6 +24,7 @@
#include <acpi/acpi_numa.h>
#include <linux/cpumask.h>
#include <linux/nmi.h>
+#include <linux/irq.h>
#include <asm/ptrace.h>
#include <asm/hyperv-tlfs.h>

@@ -187,6 +188,11 @@ void hv_remove_kexec_handler(void);
void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
void hv_remove_crash_handler(void);

+extern void vmbus_irq_mask(struct irq_data *data);
+extern void vmbus_irq_unmask(struct irq_data *data);
+extern int vmbus_irq_set_affinity(struct irq_data *data,
+ const struct cpumask *dest, bool force);
+
extern int vmbus_interrupt;
extern int vmbus_irq;

--
2.25.1


2024-06-04 05:13:23

by Michael Kelley

[permalink] [raw]
Subject: [RFC 08/12] Drivers: hv: vmbus: Allocate an IRQ per channel and use for relid mapping

From: Michael Kelley <[email protected]>

Current code uses the "channels" array in the VMBus connection to map from
relid to VMBus channel. Functions exist to establish and remove mappings,
and to map from a relid to its channel.

To implement assigning Linux IRQs to VMBus channels, repurpose the existing
relid functions. Change the "map" function to create an IRQ mapping in the
VMBus IRQ domain, and set the handler data for the IRQ to be the VMBus
channel. Change the "unmap" function to remove the IRQ mapping after
freeing the IRQ if it has been requested. Change the relid2channel()
function to look up the mapping and return both the channel and the
corresponding irqdesc.

While Linux IRQs are allocated for each channel, they are used only for
relid mapping. A subsequent patch changes the interrupt handling path
to use the Linux IRQ.

With these changes, the "channels" array in the VMBus connection is no
longer used. Remove it. Also fixup comments that refer to the channels
array and make them refer to the IRQ mapping instead.

Signed-off-by: Michael Kelley <[email protected]>
---
drivers/hv/channel_mgmt.c | 52 ++++++++++++++++++++++++++++-----------
drivers/hv/connection.c | 23 ++++++-----------
drivers/hv/hyperv_vmbus.h | 5 +---
drivers/hv/vmbus_drv.c | 8 +++---
include/linux/hyperv.h | 3 +++
5 files changed, 54 insertions(+), 37 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index adbe184e5197..da42aaae994e 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -10,6 +10,9 @@

#include <linux/kernel.h>
#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/irqdomain.h>
#include <linux/sched.h>
#include <linux/wait.h>
#include <linux/mm.h>
@@ -410,7 +413,10 @@ static void free_channel(struct vmbus_channel *channel)

void vmbus_channel_map_relid(struct vmbus_channel *channel)
{
- if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
+ int res;
+ u32 relid = channel->offermsg.child_relid;
+
+ if (WARN_ON(relid >= MAX_CHANNEL_RELIDS))
return;
/*
* The mapping of the channel's relid is visible from the CPUs that
@@ -437,18 +443,33 @@ void vmbus_channel_map_relid(struct vmbus_channel *channel)
* of the VMBus driver and vmbus_chan_sched() can not run before
* vmbus_bus_resume() has completed execution (cf. resume_noirq).
*/
- virt_store_mb(
- vmbus_connection.channels[channel->offermsg.child_relid],
- channel);
+
+ channel->irq = irq_create_mapping(vmbus_connection.vmbus_irq_domain, relid);
+ if (!channel->irq) {
+ pr_err("irq_create_mapping failed for relid %d\n", relid);
+ return;
+ }
+
+ res = irq_set_handler_data(channel->irq, channel);
+ if (res) {
+ irq_dispose_mapping(channel->irq);
+ channel->irq = 0;
+ pr_err("irq_set_handler_data failed with %d for relid %d\n",
+ res, relid);
+ return;
+ }
+
+ irq_set_status_flags(channel->irq, IRQ_MOVE_PCNTXT);
}

void vmbus_channel_unmap_relid(struct vmbus_channel *channel)
{
- if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
- return;
- WRITE_ONCE(
- vmbus_connection.channels[channel->offermsg.child_relid],
- NULL);
+ if (channel->irq_requested) {
+ irq_update_affinity_hint(channel->irq, NULL);
+ free_irq(channel->irq, channel);
+ }
+ channel->irq_requested = false;
+ irq_dispose_mapping(channel->irq);
}

static void vmbus_release_relid(u32 relid)
@@ -478,10 +499,10 @@ void hv_process_channel_removal(struct vmbus_channel *channel)
!is_hvsock_channel(channel));

/*
- * Upon suspend, an in-use hv_sock channel is removed from the array of
- * channels and the relid is invalidated. After hibernation, when the
+ * Upon suspend, an in-use hv_sock channel is removed from the IRQ
+ * map and the relid is invalidated. After hibernation, when the
* user-space application destroys the channel, it's unnecessary and
- * unsafe to remove the channel from the array of channels. See also
+ * unsafe to remove the channel from the IRQ map. See also
* the inline comments before the call of vmbus_release_relid() below.
*/
if (channel->offermsg.child_relid != INVALID_RELID)
@@ -533,6 +554,9 @@ static void vmbus_add_channel_work(struct work_struct *work)
struct vmbus_channel *primary_channel = newchannel->primary_channel;
int ret;

+ if (!newchannel->irq)
+ goto err_deq_chan;
+
/*
* This state is used to indicate a successful open
* so that when we do close the channel normally, we
@@ -1144,7 +1168,7 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr)
vmbus_setup_channel_state(oldchannel, offer);
}

- /* Add the channel back to the array of channels. */
+ /* Re-establish the channel's IRQ mapping using the new relid */
vmbus_channel_map_relid(oldchannel);
check_ready_for_resume_event();

@@ -1225,7 +1249,7 @@ static void vmbus_onoffer_rescind(struct vmbus_channel_message_header *hdr)
}

mutex_lock(&vmbus_connection.channel_mutex);
- channel = relid2channel(rescind->child_relid);
+ channel = relid2channel(rescind->child_relid, NULL);
if (channel != NULL) {
/*
* Guarantee that no other instance of vmbus_onoffer_rescind()
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index cb01784e5c3b..a105eecdeec2 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -308,14 +308,6 @@ int vmbus_connect(void)
pr_info("Vmbus version:%d.%d\n",
version >> 16, version & 0xFFFF);

- vmbus_connection.channels = kcalloc(MAX_CHANNEL_RELIDS,
- sizeof(struct vmbus_channel *),
- GFP_KERNEL);
- if (vmbus_connection.channels == NULL) {
- ret = -ENOMEM;
- goto cleanup;
- }
-
kfree(msginfo);
return 0;

@@ -373,15 +365,16 @@ void vmbus_disconnect(void)
* relid2channel - Get the channel object given its
* child relative id (ie channel id)
*/
-struct vmbus_channel *relid2channel(u32 relid)
+struct vmbus_channel *relid2channel(u32 relid, struct irq_desc **desc_ptr)
{
- if (vmbus_connection.channels == NULL) {
- pr_warn_once("relid2channel: relid=%d: No channels mapped!\n", relid);
- return NULL;
- }
- if (WARN_ON(relid >= MAX_CHANNEL_RELIDS))
+ struct irq_desc *desc;
+
+ desc = irq_resolve_mapping(vmbus_connection.vmbus_irq_domain, relid);
+ if (!desc)
return NULL;
- return READ_ONCE(vmbus_connection.channels[relid]);
+ if (desc_ptr)
+ *desc_ptr = desc;
+ return irq_desc_get_handler_data(desc);
}

/*
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 95d4d47d34f7..bf35bb40c55e 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -259,9 +259,6 @@ struct vmbus_connection {
struct list_head chn_list;
struct mutex channel_mutex;

- /* Array of channels */
- struct vmbus_channel **channels;
-
/* IRQ domain data */
struct fwnode_handle *vmbus_fwnode;
struct irq_domain *vmbus_irq_domain;
@@ -364,7 +361,7 @@ void vmbus_remove_channel_attr_group(struct vmbus_channel *channel);
void vmbus_channel_map_relid(struct vmbus_channel *channel);
void vmbus_channel_unmap_relid(struct vmbus_channel *channel);

-struct vmbus_channel *relid2channel(u32 relid);
+struct vmbus_channel *relid2channel(u32 relid, struct irq_desc **desc);

void vmbus_free_channels(void);

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index cbccdfad49a2..8fd03d41e71a 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1219,6 +1219,7 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
for_each_set_bit(relid, recv_int_page, maxbits) {
void (*callback_fn)(void *context);
struct vmbus_channel *channel;
+ struct irq_desc *desc;

if (!sync_test_and_clear_bit(relid, recv_int_page))
continue;
@@ -1236,7 +1237,7 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
rcu_read_lock();

/* Find channel based on relid */
- channel = relid2channel(relid);
+ channel = relid2channel(relid, &desc);
if (channel == NULL)
goto sched_unlock_rcu;

@@ -2466,10 +2467,10 @@ static int vmbus_bus_suspend(struct device *dev)

list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
/*
- * Remove the channel from the array of channels and invalidate
+ * Remove the channel from the IRQ map and invalidate
* the channel's relid. Upon resume, vmbus_onoffer() will fix
* up the relid (and other fields, if necessary) and add the
- * channel back to the array.
+ * channel back to the IRQ map.
*/
vmbus_channel_unmap_relid(channel);
channel->offermsg.child_relid = INVALID_RELID;
@@ -2748,7 +2749,6 @@ static void __exit vmbus_exit(void)
vmbus_free_channels();
irq_domain_remove(vmbus_connection.vmbus_irq_domain);
irq_domain_free_fwnode(vmbus_connection.vmbus_fwnode);
- kfree(vmbus_connection.channels);

/*
* The vmbus panic notifier is always registered, hence we should
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 897d96fa19d4..d545b1b635e5 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1072,6 +1072,9 @@ struct vmbus_channel {
/* The max size of a packet on this channel */
u32 max_pkt_size;

+ /* The Linux IRQ for the channel in the "hv-vmbus" IRQ domain */
+ u32 irq;
+ bool irq_requested;
char irq_name[VMBUS_CHAN_IRQ_NAME_MAX];
};

--
2.25.1


2024-06-04 05:13:31

by Michael Kelley

[permalink] [raw]
Subject: [RFC 09/12] Drivers: hv: vmbus: Use Linux IRQs to handle VMBus channel interrupts

From: Michael Kelley <[email protected]>

Do the following:

1) Create an interrupt handler for VMBus channel interrupts by pulling
out portions of vmbus_chan_sched() into vmbus_chan_handler(). The
outer part of vmbus_chan_sched() that loops through the synic event
page bitmap remains unchanged. But when a pending VMBus channel
interrupt is found, call generic_handle_irq_desc() to invoke
handle_simple_irq() and then vmbus_chan_handler() for the channel's
IRQ. handle_simple_irq() does the IRQ stats for that channel's IRQ,
so that per-channel interrupt counts appear in /proc/interrupts. The
overall processing of VMBus channel interrupts is unchanged except
for the intervening handle_simple_irq() that does the stats. No acks
or EOIs are required for VMBus channel IRQs.

2) Update __vmbus_open() to call request_irq(), specifying the previously
setup channel IRQ name and vmbus_chan_handler() as the interrupt
handler. Set the IRQ affinity to the target_cpu assigned when the
channel was created.

3) Update vmbus_isr() to return "false" if it only handles VMBus
interrupts, which were passed to the channel IRQ handler. If
vmbus_isr() handles one or more control message interrupts, then
return "true". Update the related definitions to specify a boolean
return value.

4) The callers of vmbus_isr() increment IRQ stats for the top-level
IRQ only if "true" is returned. On x86, the caller is
sysvec_hyperv_callback(), which manages the stats directly. On
arm64, the caller is vmbus_percpu_isr(), which maps the boolean
return value to IRQ_NONE ("false") or IRQ_HANDLED ("true").
Then handle_percpu_demux_irq() conditionally updates the
stats based on the return value from vmbus_percpu_isr().

With these changes, interrupts from VMBus channels are now
processed as Linux IRQs that are demultiplexed from the main
VMBus interrupt.

Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/kernel/cpu/mshyperv.c | 9 ++--
drivers/hv/channel.c | 25 +++++++++-
drivers/hv/hv_common.c | 2 +-
drivers/hv/vmbus_drv.c | 84 +++++++++++++++++++---------------
include/asm-generic/mshyperv.h | 3 +-
5 files changed, 79 insertions(+), 44 deletions(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index e0fd57a8ba84..18bc282a99db 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -110,7 +110,7 @@ void hv_set_msr(unsigned int reg, u64 value)
}
EXPORT_SYMBOL_GPL(hv_set_msr);

-static void (*vmbus_handler)(void);
+static bool (*vmbus_handler)(void);
static void (*hv_stimer0_handler)(void);
static void (*hv_kexec_handler)(void);
static void (*hv_crash_handler)(struct pt_regs *regs);
@@ -119,9 +119,8 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
{
struct pt_regs *old_regs = set_irq_regs(regs);

- inc_irq_stat(irq_hv_callback_count);
- if (vmbus_handler)
- vmbus_handler();
+ if (vmbus_handler && vmbus_handler())
+ inc_irq_stat(irq_hv_callback_count);

if (ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED)
apic_eoi();
@@ -129,7 +128,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
set_irq_regs(old_regs);
}

-void hv_setup_vmbus_handler(void (*handler)(void))
+void hv_setup_vmbus_handler(bool (*handler)(void))
{
vmbus_handler = handler;
}
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index fb8cd8469328..1aa020b538f1 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -638,6 +638,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
struct vmbus_channel_open_channel *open_msg;
struct vmbus_channel_msginfo *open_info = NULL;
struct page *page = newchannel->ringbuffer_page;
+ u32 relid = newchannel->offermsg.child_relid;
u32 send_pages, recv_pages;
unsigned long flags;
int err;
@@ -685,13 +686,31 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
if (err)
goto error_free_gpadl;

+ /* Request the IRQ and assign to target_cpu */
+ err = request_irq(newchannel->irq, vmbus_chan_handler, 0,
+ newchannel->irq_name, newchannel);
+ if (err) {
+ pr_err("request_irq failed with %d for relid %d irq %d\n",
+ err, relid, newchannel->irq);
+ goto error_free_gpadl;
+ }
+ err = irq_set_affinity_and_hint(newchannel->irq,
+ cpumask_of(newchannel->target_cpu));
+ if (err) {
+ pr_err("irq_set_affinity_and_hint failed with %d for relid %d irq %d\n",
+ err, relid, newchannel->irq);
+ free_irq(newchannel->irq, newchannel);
+ goto error_free_gpadl;
+ }
+ newchannel->irq_requested = true;
+
/* Create and init the channel open message */
open_info = kzalloc(sizeof(*open_info) +
sizeof(struct vmbus_channel_open_channel),
GFP_KERNEL);
if (!open_info) {
err = -ENOMEM;
- goto error_free_gpadl;
+ goto error_free_irq;
}

init_completion(&open_info->waitevent);
@@ -759,6 +778,10 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
error_free_info:
kfree(open_info);
+error_free_irq:
+ irq_update_affinity_hint(newchannel->irq, NULL);
+ free_irq(newchannel->irq, newchannel);
+ newchannel->irq_requested = false;
error_free_gpadl:
vmbus_teardown_gpadl(newchannel, &newchannel->ringbuffer_gpadlhandle);
error_clean_ring:
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 9c452bfbd571..38a23add721c 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -610,7 +610,7 @@ bool __weak hv_isolation_type_tdx(void)
}
EXPORT_SYMBOL_GPL(hv_isolation_type_tdx);

-void __weak hv_setup_vmbus_handler(void (*handler)(void))
+void __weak hv_setup_vmbus_handler(bool (*handler)(void))
{
}
EXPORT_SYMBOL_GPL(hv_setup_vmbus_handler);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 8fd03d41e71a..b73be7c02d37 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1193,6 +1193,45 @@ static void vmbus_force_channel_rescinded(struct vmbus_channel *channel)
}
#endif /* CONFIG_PM_SLEEP */

+irqreturn_t vmbus_chan_handler(int irq, void *dev_id)
+{
+ void (*callback_fn)(void *context);
+ struct vmbus_channel *channel = dev_id;
+
+ /*
+ * Make sure that the ring buffer data structure doesn't get
+ * freed while we dereference the ring buffer pointer. Test
+ * for the channel's onchannel_callback being NULL within a
+ * sched_lock critical section. See also the inline comments
+ * in vmbus_reset_channel_cb().
+ */
+ spin_lock(&channel->sched_lock);
+
+ callback_fn = channel->onchannel_callback;
+ if (unlikely(callback_fn == NULL))
+ goto spin_unlock;
+
+ trace_vmbus_chan_sched(channel);
+
+ ++channel->interrupts;
+
+ switch (channel->callback_mode) {
+ case HV_CALL_ISR:
+ (*callback_fn)(channel->channel_callback_context);
+ break;
+
+ case HV_CALL_BATCHED:
+ hv_begin_read(&channel->inbound);
+ fallthrough;
+ case HV_CALL_DIRECT:
+ tasklet_schedule(&channel->callback_event);
+ }
+
+spin_unlock:
+ spin_unlock(&channel->sched_lock);
+ return IRQ_HANDLED;
+}
+
/*
* Schedule all channels with events pending
*/
@@ -1217,7 +1256,6 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
return;

for_each_set_bit(relid, recv_int_page, maxbits) {
- void (*callback_fn)(void *context);
struct vmbus_channel *channel;
struct irq_desc *desc;

@@ -1244,43 +1282,14 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
if (channel->rescind)
goto sched_unlock_rcu;

- /*
- * Make sure that the ring buffer data structure doesn't get
- * freed while we dereference the ring buffer pointer. Test
- * for the channel's onchannel_callback being NULL within a
- * sched_lock critical section. See also the inline comments
- * in vmbus_reset_channel_cb().
- */
- spin_lock(&channel->sched_lock);
-
- callback_fn = channel->onchannel_callback;
- if (unlikely(callback_fn == NULL))
- goto sched_unlock;
-
- trace_vmbus_chan_sched(channel);
-
- ++channel->interrupts;
-
- switch (channel->callback_mode) {
- case HV_CALL_ISR:
- (*callback_fn)(channel->channel_callback_context);
- break;
-
- case HV_CALL_BATCHED:
- hv_begin_read(&channel->inbound);
- fallthrough;
- case HV_CALL_DIRECT:
- tasklet_schedule(&channel->callback_event);
- }
+ generic_handle_irq_desc(desc);

-sched_unlock:
- spin_unlock(&channel->sched_lock);
sched_unlock_rcu:
rcu_read_unlock();
}
}

-static void vmbus_isr(void)
+static bool vmbus_isr(void)
{
struct hv_per_cpu_context *hv_cpu
= this_cpu_ptr(hv_context.cpu_context);
@@ -1299,15 +1308,18 @@ static void vmbus_isr(void)
vmbus_signal_eom(msg, HVMSG_TIMER_EXPIRED);
} else
tasklet_schedule(&hv_cpu->msg_dpc);
- }

- add_interrupt_randomness(vmbus_interrupt);
+ add_interrupt_randomness(vmbus_interrupt);
+ return true;
+ }
+ return false;
}

static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
{
- vmbus_isr();
- return IRQ_HANDLED;
+ if (vmbus_isr())
+ return IRQ_HANDLED;
+ return IRQ_NONE;
}

int vmbus_irq_set_affinity(struct irq_data *data,
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 0488ff8b511f..0a5559b9d5f7 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -178,7 +178,7 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)

int hv_get_hypervisor_version(union hv_hypervisor_version_info *info);

-void hv_setup_vmbus_handler(void (*handler)(void));
+void hv_setup_vmbus_handler(bool (*handler)(void));
void hv_remove_vmbus_handler(void);
void hv_setup_stimer0_handler(void (*handler)(void));
void hv_remove_stimer0_handler(void);
@@ -188,6 +188,7 @@ void hv_remove_kexec_handler(void);
void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
void hv_remove_crash_handler(void);

+extern irqreturn_t vmbus_chan_handler(int irq, void *dev_id);
extern void vmbus_irq_mask(struct irq_data *data);
extern void vmbus_irq_unmask(struct irq_data *data);
extern int vmbus_irq_set_affinity(struct irq_data *data,
--
2.25.1


2024-06-04 05:13:55

by Michael Kelley

[permalink] [raw]
Subject: [RFC 10/12] Drivers: hv: vmbus: Implement vmbus_irq_set_affinity

From: Michael Kelley <[email protected]>

Pull out core code from target_cpu_store() to implement
vmbus_irq_set_affinity() so the affinity of VMBus channel interrupts
can be updated from user space via /proc/irq.

Since vmbus_irq_set_affinity() runs with interrupts disabled,
vmbus_send_modifychannel() can't wait for an ACK from Hyper-V. As
such, remove the "wait for ack" version of vmbus_send_modifychannel().
Not waiting isn't a problem unless the old CPU is quickly taken offline
before Hyper-V makes the change, which is dealt with in a subsequent
patch.

Also change target_cpu_store() to call irq_set_affinity() so that
changes made via /sys/bus/vmbus/devices/<guid>/channels/<nn>/cpu
are in sync with the /proc/irq interface. The cpus_read_lock() is
no longer needed in target_cpu_store() because irq_set_affinity()
ensures that the interrupt affinity is not set to an offline
CPU.

Signed-off-by: Michael Kelley <[email protected]>
---
drivers/hv/channel.c | 97 ++++++-------------------
drivers/hv/vmbus_drv.c | 161 +++++++++++++++++++++++++----------------
2 files changed, 121 insertions(+), 137 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 1aa020b538f1..b7920072e243 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -212,79 +212,6 @@ int vmbus_send_tl_connect_request(const guid_t *shv_guest_servie_id,
}
EXPORT_SYMBOL_GPL(vmbus_send_tl_connect_request);

-static int send_modifychannel_without_ack(struct vmbus_channel *channel, u32 target_vp)
-{
- struct vmbus_channel_modifychannel msg;
- int ret;
-
- memset(&msg, 0, sizeof(msg));
- msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL;
- msg.child_relid = channel->offermsg.child_relid;
- msg.target_vp = target_vp;
-
- ret = vmbus_post_msg(&msg, sizeof(msg), true);
- trace_vmbus_send_modifychannel(&msg, ret);
-
- return ret;
-}
-
-static int send_modifychannel_with_ack(struct vmbus_channel *channel, u32 target_vp)
-{
- struct vmbus_channel_modifychannel *msg;
- struct vmbus_channel_msginfo *info;
- unsigned long flags;
- int ret;
-
- info = kzalloc(sizeof(struct vmbus_channel_msginfo) +
- sizeof(struct vmbus_channel_modifychannel),
- GFP_KERNEL);
- if (!info)
- return -ENOMEM;
-
- init_completion(&info->waitevent);
- info->waiting_channel = channel;
-
- msg = (struct vmbus_channel_modifychannel *)info->msg;
- msg->header.msgtype = CHANNELMSG_MODIFYCHANNEL;
- msg->child_relid = channel->offermsg.child_relid;
- msg->target_vp = target_vp;
-
- spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
- list_add_tail(&info->msglistentry, &vmbus_connection.chn_msg_list);
- spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
-
- ret = vmbus_post_msg(msg, sizeof(*msg), true);
- trace_vmbus_send_modifychannel(msg, ret);
- if (ret != 0) {
- spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
- list_del(&info->msglistentry);
- spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
- goto free_info;
- }
-
- /*
- * Release channel_mutex; otherwise, vmbus_onoffer_rescind() could block on
- * the mutex and be unable to signal the completion.
- *
- * See the caller target_cpu_store() for information about the usage of the
- * mutex.
- */
- mutex_unlock(&vmbus_connection.channel_mutex);
- wait_for_completion(&info->waitevent);
- mutex_lock(&vmbus_connection.channel_mutex);
-
- spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
- list_del(&info->msglistentry);
- spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
-
- if (info->response.modify_response.status)
- ret = -EAGAIN;
-
-free_info:
- kfree(info);
- return ret;
-}
-
/*
* Set/change the vCPU (@target_vp) the channel (@child_relid) will interrupt.
*
@@ -294,14 +221,32 @@ static int send_modifychannel_with_ack(struct vmbus_channel *channel, u32 target
* out an ACK, we can not know when the host will stop interrupting the "old"
* vCPU and start interrupting the "new" vCPU for the given channel.
*
+ * But even if Hyper-V provides the ACK, we don't wait for it because the
+ * caller, vmbus_irq_set_affinity(), is running with a spin lock held. The
+ * unknown delay in when the host will start interrupting the new vCPU is not
+ * a problem unless the old vCPU is taken offline, and that situation is dealt
+ * with separately in the CPU offlining path.
+ *
* The CHANNELMSG_MODIFYCHANNEL message type is supported since VMBus version
* VERSION_WIN10_V4_1.
*/
int vmbus_send_modifychannel(struct vmbus_channel *channel, u32 target_vp)
{
- if (vmbus_proto_version >= VERSION_WIN10_V5_3)
- return send_modifychannel_with_ack(channel, target_vp);
- return send_modifychannel_without_ack(channel, target_vp);
+ struct vmbus_channel_modifychannel msg;
+ int ret;
+
+ if (vmbus_proto_version < VERSION_WIN10_V4_1)
+ return -EINVAL;
+
+ memset(&msg, 0, sizeof(msg));
+ msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL;
+ msg.child_relid = channel->offermsg.child_relid;
+ msg.target_vp = target_vp;
+
+ ret = vmbus_post_msg(&msg, sizeof(msg), false);
+ trace_vmbus_send_modifychannel(&msg, ret);
+
+ return ret;
}
EXPORT_SYMBOL_GPL(vmbus_send_modifychannel);

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index b73be7c02d37..87f2f3436136 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -22,7 +22,6 @@
#include <linux/kernel_stat.h>
#include <linux/of_address.h>
#include <linux/clockchips.h>
-#include <linux/cpu.h>
#include <linux/sched/isolation.h>
#include <linux/sched/task_stack.h>

@@ -1322,10 +1321,107 @@ static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
return IRQ_NONE;
}

+/*
+ * This function is invoked by user space affinity changes initiated
+ * from /proc/irq/<nn> or from the legacy VMBus-specific interface at
+ * /sys/bus/vmbus/devices/<guid>/channels/<nn>/cpu.
+ *
+ * In the former case, the /proc implementation ensures that unmapping
+ * (i.e., deleting) the IRQ will pend while this function is in progress.
+ * Since deleting the channel unmaps the IRQ first, the channel can't go
+ * away either.
+ *
+ * In the latter case, the VMBus connection channel_mutex is held, which
+ * prevents channel deltion, and therefore IRQ unampping as well.
+ *
+ * So in both cases, accessing the channel and IRQ data structures is safe.
+ */
int vmbus_irq_set_affinity(struct irq_data *data,
const struct cpumask *dest, bool force)
{
- return 0;
+ static int next_cpu;
+ static cpumask_t tempmask;
+ int origin_cpu, target_cpu;
+ struct vmbus_channel *channel = irq_data_get_irq_handler_data(data);
+ int ret;
+
+ if (!channel) {
+ pr_err("Bad channel in vmbus_irq_set_affinity for relid %ld\n",
+ data->hwirq);
+ return -EINVAL;
+ }
+
+ /* Don't consider CPUs that are isolated */
+ if (housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
+ cpumask_and(&tempmask, dest,
+ housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+ else
+ cpumask_copy(&tempmask, dest);
+
+ /*
+ * If Hyper-V is already targeting a CPU in the new affinity mask,
+ * keep that targeting and Hyper-V doesn't need to be updated. But
+ * still set effective affinity as it may be unset when the IRQ is
+ * first created.
+ */
+ origin_cpu = channel->target_cpu;
+ if (cpumask_test_cpu(origin_cpu, &tempmask)) {
+ target_cpu = origin_cpu;
+ goto update_effective;
+ }
+
+ /*
+ * Pick a CPU from the new affinity mask. As a simple heuristic to
+ * spread out the selection when the mask contains multiple CPUs,
+ * start with whatever CPU was last selected.
+ */
+ target_cpu = cpumask_next_wrap(next_cpu, &tempmask, nr_cpu_ids, false);
+ if (target_cpu >= nr_cpu_ids)
+ return -EINVAL;
+ next_cpu = target_cpu;
+
+ /*
+ * Hyper-V will ignore MODIFYCHANNEL messages for "non-open" channels;
+ * avoid sending the message and fail here for such channels.
+ */
+ if (channel->state != CHANNEL_OPENED_STATE)
+ return -EIO;
+
+ ret = vmbus_send_modifychannel(channel,
+ hv_cpu_number_to_vp_number(target_cpu));
+ if (ret)
+ return ret;
+
+ /*
+ * Warning. At this point, there is *no* guarantee that the host will
+ * have successfully processed the vmbus_send_modifychannel() request.
+ * See the header comment of vmbus_send_modifychannel() for more info.
+ *
+ * Lags in the processing of the above vmbus_send_modifychannel() can
+ * result in missed interrupts if the "old" target CPU is taken offline
+ * before Hyper-V starts sending interrupts to the "new" target CPU.
+ * hv_synic_cleanup() will ensure no interrupts are missed.
+ *
+ * But apart from this offlining scenario, the code tolerates such
+ * lags. It will function correctly even if a channel interrupt comes
+ * in on a CPU that is different from the channel target_cpu value.
+ */
+
+ channel->target_cpu = target_cpu;
+
+ /* See init_vp_index(). */
+ if (hv_is_perf_channel(channel))
+ hv_update_allocated_cpus(origin_cpu, target_cpu);
+
+ /* Currently set only for storvsc channels. */
+ if (channel->change_target_cpu_callback) {
+ (*channel->change_target_cpu_callback)(channel,
+ origin_cpu, target_cpu);
+ }
+
+update_effective:
+ irq_data_update_effective_affinity(data, cpumask_of(target_cpu));
+ return IRQ_SET_MASK_OK;
}

/*
@@ -1655,7 +1751,7 @@ static ssize_t target_cpu_show(struct vmbus_channel *channel, char *buf)
static ssize_t target_cpu_store(struct vmbus_channel *channel,
const char *buf, size_t count)
{
- u32 target_cpu, origin_cpu;
+ u32 target_cpu;
ssize_t ret = count;

if (vmbus_proto_version < VERSION_WIN10_V4_1)
@@ -1668,17 +1764,6 @@ static ssize_t target_cpu_store(struct vmbus_channel *channel,
if (target_cpu >= nr_cpumask_bits)
return -EINVAL;

- if (!cpumask_test_cpu(target_cpu, housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)))
- return -EINVAL;
-
- /* No CPUs should come up or down during this. */
- cpus_read_lock();
-
- if (!cpu_online(target_cpu)) {
- cpus_read_unlock();
- return -EINVAL;
- }
-
/*
* Synchronizes target_cpu_store() and channel closure:
*
@@ -1703,55 +1788,9 @@ static ssize_t target_cpu_store(struct vmbus_channel *channel,
*/
mutex_lock(&vmbus_connection.channel_mutex);

- /*
- * Hyper-V will ignore MODIFYCHANNEL messages for "non-open" channels;
- * avoid sending the message and fail here for such channels.
- */
- if (channel->state != CHANNEL_OPENED_STATE) {
- ret = -EIO;
- goto cpu_store_unlock;
- }
-
- origin_cpu = channel->target_cpu;
- if (target_cpu == origin_cpu)
- goto cpu_store_unlock;
-
- if (vmbus_send_modifychannel(channel,
- hv_cpu_number_to_vp_number(target_cpu))) {
- ret = -EIO;
- goto cpu_store_unlock;
- }
-
- /*
- * For version before VERSION_WIN10_V5_3, the following warning holds:
- *
- * Warning. At this point, there is *no* guarantee that the host will
- * have successfully processed the vmbus_send_modifychannel() request.
- * See the header comment of vmbus_send_modifychannel() for more info.
- *
- * Lags in the processing of the above vmbus_send_modifychannel() can
- * result in missed interrupts if the "old" target CPU is taken offline
- * before Hyper-V starts sending interrupts to the "new" target CPU.
- * But apart from this offlining scenario, the code tolerates such
- * lags. It will function correctly even if a channel interrupt comes
- * in on a CPU that is different from the channel target_cpu value.
- */
-
- channel->target_cpu = target_cpu;
-
- /* See init_vp_index(). */
- if (hv_is_perf_channel(channel))
- hv_update_allocated_cpus(origin_cpu, target_cpu);
-
- /* Currently set only for storvsc channels. */
- if (channel->change_target_cpu_callback) {
- (*channel->change_target_cpu_callback)(channel,
- origin_cpu, target_cpu);
- }
+ ret = irq_set_affinity(channel->irq, cpumask_of(target_cpu));

-cpu_store_unlock:
mutex_unlock(&vmbus_connection.channel_mutex);
- cpus_read_unlock();
return ret;
}
static VMBUS_CHAN_ATTR(cpu, 0644, target_cpu_show, target_cpu_store);
--
2.25.1


2024-06-04 05:14:13

by Michael Kelley

[permalink] [raw]
Subject: [RFC 11/12] Drivers: hv: vmbus: Wait for MODIFYCHANNEL to finish when offlining CPUs

From: Michael Kelley <[email protected]>

vmbus_irq_set_affinity() may issue a MODIFYCHANNEL request to Hyper-V to
target a VMBus channel interrupt to a different CPU. While newer versions
of Hyper-V send a response to the guest when the change is complete,
vmbus_irq_set_affinity() does not wait for the response because it is
running with interrupts disabled. So Hyper-V may continue to direct
interrupts to the old CPU for a short window after vmbus_irq_set_affinity()
completes. This lag is not a problem during normal operation. But if
the old CPU is taken offline during that window, Hyper-V may drop
the interrupt because the synic in the target CPU is disabled. Dropping
the interrupt may cause the VMBus channel to hang.

To prevent this, wait for in-process MODIFYCHANNEL requests when taking
a CPU offline. On newer versions of Hyper-V, completion can be confirmed
by waiting for the response sent by Hyper-V. But on older versions of
Hyper-V that don't send a response, wait a fixed interval of time that
empirically should be "long enough", as that's the best that can be done.

Signed-off-by: Michael Kelley <[email protected]>
---
drivers/hv/channel.c | 3 ++
drivers/hv/channel_mgmt.c | 32 ++++--------------
drivers/hv/hv.c | 70 +++++++++++++++++++++++++++++++++++----
drivers/hv/hyperv_vmbus.h | 8 +++++
4 files changed, 81 insertions(+), 32 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index b7920072e243..b7ee95373049 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -246,6 +246,9 @@ int vmbus_send_modifychannel(struct vmbus_channel *channel, u32 target_vp)
ret = vmbus_post_msg(&msg, sizeof(msg), false);
trace_vmbus_send_modifychannel(&msg, ret);

+ if (!ret)
+ vmbus_connection.modchan_sent++;
+
return ret;
}
EXPORT_SYMBOL_GPL(vmbus_send_modifychannel);
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index da42aaae994e..960a2f0367d8 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -1465,40 +1465,20 @@ static void vmbus_ongpadl_created(struct vmbus_channel_message_header *hdr)
* vmbus_onmodifychannel_response - Modify Channel response handler.
*
* This is invoked when we received a response to our channel modify request.
- * Find the matching request, copy the response and signal the requesting thread.
+ * Increment the count of responses received. No locking is needed because
+ * responses are always received on the VMBUS_CONNECT_CPU.
*/
static void vmbus_onmodifychannel_response(struct vmbus_channel_message_header *hdr)
{
struct vmbus_channel_modifychannel_response *response;
- struct vmbus_channel_msginfo *msginfo;
- unsigned long flags;

response = (struct vmbus_channel_modifychannel_response *)hdr;
+ if (response->status)
+ pr_err("Error status %x in MODIFYCHANNEL response for relid %d\n",
+ response->status, response->child_relid);
+ vmbus_connection.modchan_completed++;

trace_vmbus_onmodifychannel_response(response);
-
- /*
- * Find the modify msg, copy the response and signal/unblock the wait event.
- */
- spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
-
- list_for_each_entry(msginfo, &vmbus_connection.chn_msg_list, msglistentry) {
- struct vmbus_channel_message_header *responseheader =
- (struct vmbus_channel_message_header *)msginfo->msg;
-
- if (responseheader->msgtype == CHANNELMSG_MODIFYCHANNEL) {
- struct vmbus_channel_modifychannel *modifymsg;
-
- modifymsg = (struct vmbus_channel_modifychannel *)msginfo->msg;
- if (modifymsg->child_relid == response->child_relid) {
- memcpy(&msginfo->response.modify_response, response,
- sizeof(*response));
- complete(&msginfo->waitevent);
- break;
- }
- }
- }
- spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
}

/*
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index a8ad728354cb..76658dfc5008 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -401,6 +401,56 @@ void hv_synic_disable_regs(unsigned int cpu)
disable_percpu_irq(vmbus_irq);
}

+static void hv_synic_wait_for_modifychannel(int cpu)
+{
+ int i = 5;
+ u64 base;
+
+ /*
+ * If we're on a VMBus version where MODIFYCHANNEL doesn't send acks,
+ * just sleep for 20 milliseconds and hope that gives Hyper-V enough
+ * time to process them. Empirical data on recent server-class CPUs
+ * (both x86 and arm64) shows that the Hyper-V response is typically
+ * received and processed in the guest within a few hundred
+ * microseconds. The 20 millisecond wait is somewhat arbitrary and
+ * intended to give plenty to time in case there are multiple
+ * MODIFYCHANNEL requests in progress and the host is busy. It's
+ * the best we can do.
+ */
+ if (vmbus_proto_version < VERSION_WIN10_V5_3) {
+ usleep_range(20000, 25000);
+ return;
+ }
+
+ /*
+ * Otherwise compare the current value of modchan_completed against
+ * modchan_sent. If some MODIFYCHANNEL requests have been sent that
+ * haven't completed, sleep 5 milliseconds and check again. If the
+ * requests still haven't completed after 5 attempts, output a
+ * message and proceed anyway.
+ *
+ * Hyper-V guarantees to process MODIFYCHANNEL requests in the order
+ * they are received from the guest, so simply comparing the counts
+ * is sufficient.
+ *
+ * Note that this check may encompass MODIFYCHANNEL requests that are
+ * unrelated to the CPU that is going offline. But the only effect is
+ * to potentially wait a little bit longer than necessary. CPUs going
+ * offline and affinity changes that result in MODIFYCHANNEL are
+ * relatively rare and it's not worth the complexity to track them more
+ * precisely.
+ */
+ base = READ_ONCE(vmbus_connection.modchan_sent);
+ while (READ_ONCE(vmbus_connection.modchan_completed) < base && i) {
+ usleep_range(5000, 10000);
+ i--;
+ }
+
+ if (i == 0)
+ pr_err("Timed out waiting for MODIFYCHANNEL. CPU %d sent %lld completed %lld\n",
+ cpu, base, vmbus_connection.modchan_completed);
+}
+
#define HV_MAX_TRIES 3
/*
* Scan the event flags page of 'this' CPU looking for any bit that is set. If we find one
@@ -485,13 +535,21 @@ int hv_synic_cleanup(unsigned int cpu)
/*
* channel_found == false means that any channels that were previously
* assigned to the CPU have been reassigned elsewhere with a call of
- * vmbus_send_modifychannel(). Scan the event flags page looking for
- * bits that are set and waiting with a timeout for vmbus_chan_sched()
- * to process such bits. If bits are still set after this operation
- * and VMBus is connected, fail the CPU offlining operation.
+ * vmbus_send_modifychannel(). First wait until any MODIFYCHANNEL
+ * requests have been completed by Hyper-V, after which we know that
+ * no new bits in the event flags will be set. Then scan the event flags
+ * page looking for bits that are set and waiting with a timeout for
+ * vmbus_chan_sched() to process such bits. If bits are still set
+ * after this operation, fail the CPU offlining operation.
*/
- if (vmbus_proto_version >= VERSION_WIN10_V4_1 && hv_synic_event_pending())
- return -EBUSY;
+ if (vmbus_proto_version >= VERSION_WIN10_V4_1) {
+ hv_synic_wait_for_modifychannel(cpu);
+ if (hv_synic_event_pending()) {
+ pr_err("Events pending when trying to offline CPU %d\n",
+ cpu);
+ return -EBUSY;
+ }
+ }

always_cleanup:
hv_stimer_legacy_cleanup(cpu);
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index bf35bb40c55e..571b2955b38e 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -264,6 +264,14 @@ struct vmbus_connection {
struct irq_domain *vmbus_irq_domain;
struct irq_chip vmbus_irq_chip;

+ /*
+ * VM-wide counts of MODIFYCHANNEL messages sent and completed.
+ * Used when taking a CPU offline to make sure the relevant
+ * MODIFYCHANNEL messages have been completed.
+ */
+ u64 modchan_sent;
+ u64 modchan_completed;
+
/*
* An offer message is handled first on the work_queue, and then
* is further handled on handle_primary_chan_wq or
--
2.25.1


2024-06-04 05:14:19

by Michael Kelley

[permalink] [raw]
Subject: [RFC 12/12] Drivers: hv: vmbus: Ensure IRQ affinity isn't set to a CPU going offline

From: Michael Kelley <[email protected]>

hv_synic_cleanup() currently prevents a CPU from going offline if any
VMBus channel IRQs are targeted at that CPU. However, current code has a
race in that an IRQ could be affinitized to the CPU after the check in
hv_synic_cleanup() and before the CPU is removed from cpu_online_mask.
Any channel interrupts could be lost and the channel would hang.

Fix this by adding a flag for each CPU indicating if the synic is online.
Filter the new affinity with these flags so that vmbus_irq_set_affinity()
doesn't pick a CPU where the synic is already offline.

Also add a spin lock so that vmbus_irq_set_affinity() changing the
channel target_cpu and sending the MODIFYCHANNEL message to Hyper-V
are atomic with respect to the checks made in hv_synic_cleanup().
hv_synic_cleanup() needs these operations to be atomic so that it
can correctly count the MODIFYCHANNEL messages that need to be
ack'ed by Hyper-V.

Signed-off-by: Michael Kelley <[email protected]>
---
drivers/hv/connection.c | 1 +
drivers/hv/hv.c | 22 ++++++++++++++++++++--
drivers/hv/hyperv_vmbus.h | 2 ++
drivers/hv/vmbus_drv.c | 34 ++++++++++++++++++++++++++++------
4 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index a105eecdeec2..b44ce3d39135 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -213,6 +213,7 @@ int vmbus_connect(void)

INIT_LIST_HEAD(&vmbus_connection.chn_list);
mutex_init(&vmbus_connection.channel_mutex);
+ spin_lock_init(&vmbus_connection.set_affinity_lock);

/*
* Setup the vmbus event connection for channel interrupt
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 76658dfc5008..89e491219129 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -338,6 +338,8 @@ int hv_synic_init(unsigned int cpu)
{
hv_synic_enable_regs(cpu);

+ cpumask_set_cpu(cpu, &vmbus_connection.synic_online);
+
hv_stimer_legacy_init(cpu, VMBUS_MESSAGE_SINT);

return 0;
@@ -513,6 +515,17 @@ int hv_synic_cleanup(unsigned int cpu)
* TODO: Re-bind the channels to different CPUs.
*/
mutex_lock(&vmbus_connection.channel_mutex);
+ spin_lock(&vmbus_connection.set_affinity_lock);
+
+ /*
+ * Once the check for channels assigned to this CPU is complete, we
+ * must not allow a channel to be assigned to this CPU. So mark
+ * the synic as no longer online. This cpumask is checked in
+ * vmbus_irq_set_affinity() to prevent setting the affinity of
+ * an IRQ to such a CPU.
+ */
+ cpumask_clear_cpu(cpu, &vmbus_connection.synic_online);
+
list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
if (channel->target_cpu == cpu) {
channel_found = true;
@@ -527,10 +540,11 @@ int hv_synic_cleanup(unsigned int cpu)
if (channel_found)
break;
}
+ spin_unlock(&vmbus_connection.set_affinity_lock);
mutex_unlock(&vmbus_connection.channel_mutex);

if (channel_found)
- return -EBUSY;
+ goto set_online;

/*
* channel_found == false means that any channels that were previously
@@ -547,7 +561,7 @@ int hv_synic_cleanup(unsigned int cpu)
if (hv_synic_event_pending()) {
pr_err("Events pending when trying to offline CPU %d\n",
cpu);
- return -EBUSY;
+ goto set_online;
}
}

@@ -557,4 +571,8 @@ int hv_synic_cleanup(unsigned int cpu)
hv_synic_disable_regs(cpu);

return 0;
+
+set_online:
+ cpumask_set_cpu(cpu, &vmbus_connection.synic_online);
+ return -EBUSY;
}
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 571b2955b38e..92ae5af10778 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -263,6 +263,8 @@ struct vmbus_connection {
struct fwnode_handle *vmbus_fwnode;
struct irq_domain *vmbus_irq_domain;
struct irq_chip vmbus_irq_chip;
+ cpumask_t synic_online;
+ spinlock_t set_affinity_lock;

/*
* VM-wide counts of MODIFYCHANNEL messages sent and completed.
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 87f2f3436136..3430ad42d7ba 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1351,6 +1351,14 @@ int vmbus_irq_set_affinity(struct irq_data *data,
return -EINVAL;
}

+ /*
+ * The spin lock must be held so that checking synic_online, sending
+ * the MODIFYCHANNEL message, and setting channel->target_cpu are
+ * atomic with respect to hv_synic_cleanup() clearing the CPU in
+ * synic_online and doing the search.
+ */
+ spin_lock(&vmbus_connection.set_affinity_lock);
+
/* Don't consider CPUs that are isolated */
if (housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
cpumask_and(&tempmask, dest,
@@ -1367,30 +1375,39 @@ int vmbus_irq_set_affinity(struct irq_data *data,
origin_cpu = channel->target_cpu;
if (cpumask_test_cpu(origin_cpu, &tempmask)) {
target_cpu = origin_cpu;
+ spin_unlock(&vmbus_connection.set_affinity_lock);
goto update_effective;
}

/*
* Pick a CPU from the new affinity mask. As a simple heuristic to
* spread out the selection when the mask contains multiple CPUs,
- * start with whatever CPU was last selected.
+ * start with whatever CPU was last selected. Also filter out any
+ * CPUs where synic_online isn't set -- these CPUs are in the process
+ * of going offline and must not have channel interrupts assigned
+ * to them.
*/
+ cpumask_and(&tempmask, &tempmask, &vmbus_connection.synic_online);
target_cpu = cpumask_next_wrap(next_cpu, &tempmask, nr_cpu_ids, false);
- if (target_cpu >= nr_cpu_ids)
- return -EINVAL;
+ if (target_cpu >= nr_cpu_ids) {
+ ret = -EINVAL;
+ goto unlock;
+ }
next_cpu = target_cpu;

/*
* Hyper-V will ignore MODIFYCHANNEL messages for "non-open" channels;
* avoid sending the message and fail here for such channels.
*/
- if (channel->state != CHANNEL_OPENED_STATE)
- return -EIO;
+ if (channel->state != CHANNEL_OPENED_STATE) {
+ ret = -EIO;
+ goto unlock;
+ }

ret = vmbus_send_modifychannel(channel,
hv_cpu_number_to_vp_number(target_cpu));
if (ret)
- return ret;
+ goto unlock;

/*
* Warning. At this point, there is *no* guarantee that the host will
@@ -1408,6 +1425,7 @@ int vmbus_irq_set_affinity(struct irq_data *data,
*/

channel->target_cpu = target_cpu;
+ spin_unlock(&vmbus_connection.set_affinity_lock);

/* See init_vp_index(). */
if (hv_is_perf_channel(channel))
@@ -1422,6 +1440,10 @@ int vmbus_irq_set_affinity(struct irq_data *data,
update_effective:
irq_data_update_effective_affinity(data, cpumask_of(target_cpu));
return IRQ_SET_MASK_OK;
+
+unlock:
+ spin_unlock(&vmbus_connection.set_affinity_lock);
+ return ret;
}

/*
--
2.25.1


2024-06-04 05:15:54

by Michael Kelley

[permalink] [raw]
Subject: [RFC 05/12] scsi: storvsc: Annotate the VMBus channel IRQ name

From: Michael Kelley <[email protected]>

In preparation for assigning Linux IRQs to VMBus channels, annotate the
IRQ name in the VMBus channels used by the virtual IDE, SCSI, and Fibre
Channel controllers in a Hyper-V guest. Distinguish the primary channel
and sub-channels with "pri" and "sub" prefixes. Identify controllers
with a numeric suffix that matches the numeric part of the "host<n>"
name displayed by "lsscsi" and SCSI-related entries in sysfs.

Signed-off-by: Michael Kelley <[email protected]>
---
drivers/scsi/storvsc_drv.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 7ceb982040a5..11b3fc3b24c9 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -686,6 +686,8 @@ static void handle_sc_creation(struct vmbus_channel *new_sc)
new_sc->max_pkt_size = STORVSC_MAX_PKT_SIZE;

new_sc->next_request_id_callback = storvsc_next_request_id;
+ snprintf(new_sc->irq_name, VMBUS_CHAN_IRQ_NAME_MAX, "sub@storvsc%d",
+ stor_device->host->host_no);

ret = vmbus_open(new_sc,
aligned_ringbuffer_size,
@@ -1322,6 +1324,7 @@ static int storvsc_connect_to_vsp(struct hv_device *device, u32 ring_size,
bool is_fc)
{
struct vmstorage_channel_properties props;
+ struct storvsc_device *stor_device;
int ret;

memset(&props, 0, sizeof(struct vmstorage_channel_properties));
@@ -1329,6 +1332,12 @@ static int storvsc_connect_to_vsp(struct hv_device *device, u32 ring_size,
device->channel->max_pkt_size = STORVSC_MAX_PKT_SIZE;
device->channel->next_request_id_callback = storvsc_next_request_id;

+ stor_device = get_out_stor_device(device);
+ if (!stor_device)
+ return -EINVAL;
+ snprintf(device->channel->irq_name, VMBUS_CHAN_IRQ_NAME_MAX, "pri@storvsc%d",
+ stor_device->host->host_no);
+
ret = vmbus_open(device->channel,
ring_size,
ring_size,
--
2.25.1


2024-06-04 18:14:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats

Michael!

On Mon, Jun 03 2024 at 22:09, [email protected] wrote:
> Hyper-V VMBus devices generate interrupts that are multiplexed
> onto a single per-CPU architectural interrupt. The top-level VMBus
> driver ISR demultiplexes these interrupts and invokes per-device
> handlers. Currently, these per-device handlers are not modeled as
> Linux IRQs, so /proc/interrupts shows all VMBus interrupts as accounted
> to the top level architectural interrupt. Visibility into per-device
> interrupt stats requires accessing VMBus-specific entries in sysfs.
> The top-level VMBus driver ISR also handles management-related
> interrupts that are not attributable to a particular VMBus device.
>
> As part of changing VMBus to model VMBus per-device handlers as
> normal Linux IRQs, the top-level VMBus driver needs to conditionally
> account for interrupts. If it passes the interrupt off to a
> device-specific IRQ, the interrupt stats are done by that IRQ
> handler, and accounting for the interrupt at the top level
> is duplicative. But if it handles a management-related interrupt
> itself, then it should account for the interrupt itself.
>
> Introduce a new flow handler that provides this functionality.
> The new handler parallels handle_percpu_irq(), but does stats
> only if the ISR returns other than IRQ_NONE. The existing
> handle_untracked_irq() can't be used because it doesn't work for
> per-cpu IRQs, and it doesn't provide conditional stats.

There is a two other options to solve this:

1) Move the inner workings of handle_percpu_irq() out into
a static function which returns the 'handled' value and
share it between the two handler functions.

2) Allocate a proper interrupt for the management mode and invoke it
via generic_handle_irq() just as any other demultiplex interrupt.
That spares all the special casing in the core code and just
works.

Thanks,

tglx

2024-06-04 23:03:45

by Michael Kelley

[permalink] [raw]
Subject: RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats

From: Thomas Gleixner <[email protected]> Sent: Tuesday, June 4, 2024 11:14 AM
>
> Michael!
>
> On Mon, Jun 03 2024 at 22:09, [email protected] wrote:
> > Hyper-V VMBus devices generate interrupts that are multiplexed
> > onto a single per-CPU architectural interrupt. The top-level VMBus
> > driver ISR demultiplexes these interrupts and invokes per-device
> > handlers. Currently, these per-device handlers are not modeled as
> > Linux IRQs, so /proc/interrupts shows all VMBus interrupts as accounted
> > to the top level architectural interrupt. Visibility into per-device
> > interrupt stats requires accessing VMBus-specific entries in sysfs.
> > The top-level VMBus driver ISR also handles management-related
> > interrupts that are not attributable to a particular VMBus device.
> >
> > As part of changing VMBus to model VMBus per-device handlers as
> > normal Linux IRQs, the top-level VMBus driver needs to conditionally
> > account for interrupts. If it passes the interrupt off to a
> > device-specific IRQ, the interrupt stats are done by that IRQ
> > handler, and accounting for the interrupt at the top level
> > is duplicative. But if it handles a management-related interrupt
> > itself, then it should account for the interrupt itself.
> >
> > Introduce a new flow handler that provides this functionality.
> > The new handler parallels handle_percpu_irq(), but does stats
> > only if the ISR returns other than IRQ_NONE. The existing
> > handle_untracked_irq() can't be used because it doesn't work for
> > per-cpu IRQs, and it doesn't provide conditional stats.
>
> There is a two other options to solve this:
>

Thanks for taking a look. Unfortunately, unless I'm missing something,
both options you suggest have downsides.

> 1) Move the inner workings of handle_percpu_irq() out into
> a static function which returns the 'handled' value and
> share it between the two handler functions.

The "inner workings" aren't quite the same in the two cases.
handle_percpu_irq() uses handle_irq_event_percpu() while
handle_percpu_demux_irq() uses __handle_irq_event_percpu().
The latter doesn't do add_interrupt_randomness() because the
demultiplexed IRQ handler will do it. Doing add_interrupt_randomness()
twice doesn't break anything, but it's more overhead in the hard irq
path, which I'm trying to avoid. The extra functionality in the
non-double-underscore version could be hoisted up to
handle_percpu_irq(), but that offsets gains from sharing the
inner workings.

>
> 2) Allocate a proper interrupt for the management mode and invoke it
> via generic_handle_irq() just as any other demultiplex interrupt.
> That spares all the special casing in the core code and just
> works.

Yes, this would work on x86, as the top-level interrupt isn't a Linux IRQ,
and the interrupt counting is done in Hyper-V specific code that could be
removed. The demux'ed interrupt does the counting.

But on arm64 the top-level interrupt *is* a Linux IRQ, so each
interrupt will get double-counted, which is a problem. Having to add
handle_percpu_demux_irq() to handle arm64 correctly isn't as clean
as I wish it could be. But I couldn't find a better approach.

Michael

2024-06-05 13:42:02

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats

On Tue, Jun 04 2024 at 23:03, Michael Kelley wrote:
> From: Thomas Gleixner <[email protected]> Sent: Tuesday, June 4, 2024 11:14 AM
>> 1) Move the inner workings of handle_percpu_irq() out into
>> a static function which returns the 'handled' value and
>> share it between the two handler functions.
>
> The "inner workings" aren't quite the same in the two cases.
> handle_percpu_irq() uses handle_irq_event_percpu() while
> handle_percpu_demux_irq() uses __handle_irq_event_percpu().
> The latter doesn't do add_interrupt_randomness() because the
> demultiplexed IRQ handler will do it. Doing add_interrupt_randomness()
> twice doesn't break anything, but it's more overhead in the hard irq
> path, which I'm trying to avoid. The extra functionality in the
> non-double-underscore version could be hoisted up to
> handle_percpu_irq(), but that offsets gains from sharing the
> inner workings.

That's not rocket science to solve:

static irqreturn_t helper(desc, func)
{
boiler_plate..
ret = func(desc)
boiler_plate..
return ret;
}

No?

TBH, I still hate that conditional accounting :)

>> 2) Allocate a proper interrupt for the management mode and invoke it
>> via generic_handle_irq() just as any other demultiplex interrupt.
>> That spares all the special casing in the core code and just
>> works.
>
> Yes, this would work on x86, as the top-level interrupt isn't a Linux IRQ,
> and the interrupt counting is done in Hyper-V specific code that could be
> removed. The demux'ed interrupt does the counting.
>
> But on arm64 the top-level interrupt *is* a Linux IRQ, so each
> interrupt will get double-counted, which is a problem.

What is the problem?

You have: toplevel, mgmt, device[], right?

They are all accounted for seperately and each toplevel interrupt might
result in demultiplexing one or more interrupts (mgmt, device[]), no?

IMO accounting the toplevel interrupt seperately is informative because
it allows you to figure out whether demultiplexing is clustered or not,
but I lost that argument long ago. That's why most demultiplex muck
installs a chained handler, which is a design fail on it's own.

Thanks,

tglx


2024-06-05 14:31:52

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats

On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote:
> From: Thomas Gleixner <[email protected]> Sent: Wednesday, June 5, 2024 6:20 AM
>
> In /proc/interrupts, the double-counting isn't a problem, and is
> potentially helpful as you say. But /proc/stat, for example, shows a total
> interrupt count, which will be roughly double what it was before. That
> /proc/stat value then shows up in user space in vmstat, for example.
> That's what I was concerned about, though it's not a huge problem in
> the grand scheme of things.

That's trivial to solve. We can mark interrupts to be excluded from
/proc/stat accounting.

Thanks,

tglx

2024-06-05 14:36:37

by Michael Kelley

[permalink] [raw]
Subject: RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats

From: Thomas Gleixner <[email protected]> Sent: Wednesday, June 5, 2024 6:20 AM
>
> On Tue, Jun 04 2024 at 23:03, Michael Kelley wrote:
> > From: Thomas Gleixner <[email protected]> Sent: Tuesday, June 4, 2024 11:14 AM
> >> 1) Move the inner workings of handle_percpu_irq() out into
> >> a static function which returns the 'handled' value and
> >> share it between the two handler functions.
> >
> > The "inner workings" aren't quite the same in the two cases.
> > handle_percpu_irq() uses handle_irq_event_percpu() while
> > handle_percpu_demux_irq() uses __handle_irq_event_percpu().
> > The latter doesn't do add_interrupt_randomness() because the
> > demultiplexed IRQ handler will do it. Doing add_interrupt_randomness()
> > twice doesn't break anything, but it's more overhead in the hard irq
> > path, which I'm trying to avoid. The extra functionality in the
> > non-double-underscore version could be hoisted up to
> > handle_percpu_irq(), but that offsets gains from sharing the
> > inner workings.
>
> That's not rocket science to solve:
>
> static irqreturn_t helper(desc, func)
> {
> boiler_plate..
> ret = func(desc)
> boiler_plate..
> return ret;
> }
>
> No?
>
> TBH, I still hate that conditional accounting :)
>
> >> 2) Allocate a proper interrupt for the management mode and invoke it
> >> via generic_handle_irq() just as any other demultiplex interrupt.
> >> That spares all the special casing in the core code and just
> >> works.
> >
> > Yes, this would work on x86, as the top-level interrupt isn't a Linux IRQ,
> > and the interrupt counting is done in Hyper-V specific code that could be
> > removed. The demux'ed interrupt does the counting.
> >
> > But on arm64 the top-level interrupt *is* a Linux IRQ, so each
> > interrupt will get double-counted, which is a problem.
>
> What is the problem?
>
> You have: toplevel, mgmt, device[], right?
>
> They are all accounted for seperately and each toplevel interrupt might
> result in demultiplexing one or more interrupts (mgmt, device[]), no?
>
> IMO accounting the toplevel interrupt seperately is informative because
> it allows you to figure out whether demultiplexing is clustered or not,
> but I lost that argument long ago. That's why most demultiplex muck
> installs a chained handler, which is a design fail on it's own.

In /proc/interrupts, the double-counting isn't a problem, and is
potentially helpful as you say. But /proc/stat, for example, shows a total
interrupt count, which will be roughly double what it was before. That
/proc/stat value then shows up in user space in vmstat, for example.
That's what I was concerned about, though it's not a huge problem in
the grand scheme of things.

Michael

2024-06-06 03:15:19

by Michael Kelley

[permalink] [raw]
Subject: RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats

From: Thomas Gleixner <[email protected]> Sent: Wednesday, June 5, 2024 7:20 AM
>
> On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote:
> > From: Thomas Gleixner <[email protected]> Sent: Wednesday, June 5, 2024 6:20 AM
> >
> > In /proc/interrupts, the double-counting isn't a problem, and is
> > potentially helpful as you say. But /proc/stat, for example, shows a total
> > interrupt count, which will be roughly double what it was before. That
> > /proc/stat value then shows up in user space in vmstat, for example.
> > That's what I was concerned about, though it's not a huge problem in
> > the grand scheme of things.
>
> That's trivial to solve. We can mark interrupts to be excluded from
> /proc/stat accounting.
>

OK. On x86, some simple #ifdef'ery in arch_irq_stat_cpu() can filter
out the HYP interrupts. But what do you envision on arm64, where
there is no arch_irq_stat_cpu()? On arm64, the top-level interrupt is a
normal Linux IRQ, and its count is included in the "kstat.irqs_sum" field
with no breakout by IRQ. Identifying the right IRQ and subtracting it
out later looks a lot uglier than the conditional stats accounting.

Or if there's some other approach I'm missing, please enlighten me!

Michael

2024-06-06 09:34:50

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats

On Thu, Jun 06 2024 at 03:14, Michael Kelley wrote:
> From: Thomas Gleixner <[email protected]> Sent: Wednesday, June 5, 2024 7:20 AM
>>
>> On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote:
>> > From: Thomas Gleixner <[email protected]> Sent: Wednesday, June 5, 2024 6:20 AM
>> >
>> > In /proc/interrupts, the double-counting isn't a problem, and is
>> > potentially helpful as you say. But /proc/stat, for example, shows a total
>> > interrupt count, which will be roughly double what it was before. That
>> > /proc/stat value then shows up in user space in vmstat, for example.
>> > That's what I was concerned about, though it's not a huge problem in
>> > the grand scheme of things.
>>
>> That's trivial to solve. We can mark interrupts to be excluded from
>> /proc/stat accounting.
>>
>
> OK. On x86, some simple #ifdef'ery in arch_irq_stat_cpu() can filter
> out the HYP interrupts. But what do you envision on arm64, where
> there is no arch_irq_stat_cpu()? On arm64, the top-level interrupt is a
> normal Linux IRQ, and its count is included in the "kstat.irqs_sum" field
> with no breakout by IRQ. Identifying the right IRQ and subtracting it
> out later looks a lot uglier than the conditional stats accounting.

Sure. There are two ways to solve that:

1) Introduce a IRQ_NO_PER_CPU_STATS flag, mark the interrupt
accordingly and make the stats increment conditional on it.
The downside is that the conditional affects every interrupt.

2) Do something like this:

static inline
void __handle_percpu_irq(struct irq_desc *desc, irqreturn_t (*handle)(struct irq_desc *))
{
struct irq_chip *chip = irq_desc_get_chip(desc);

if (chip->irq_ack)
chip->irq_ack(&desc->irq_data);

handle(desc);

if (chip->irq_eoi)
chip->irq_eoi(&desc->irq_data);
}

void handle_percpu_irq(struct irq_desc *desc)
{
/*
* PER CPU interrupts are not serialized. Do not touch
* desc->tot_count.
*/
__kstat_incr_irqs_this_cpu(desc);
__handle_percpu_irq(desc, handle_irq_event_percpu);
}

void handle_percpu_irq_nostat(struct irq_desc *desc)
{
__this_cpu_inc(desc->kstat_irqs->cnt);
__handle_percpu_irq(desc, __handle_irq_event_percpu);
}

So that keeps the interrupt accounted for in /proc/interrupts. If you
don't want that remove the __this_cpu_inc() and mark the interrupt with
irq_set_status_flags(irq, IRQ_HIDDEN). That will exclude it from
/proc/interrupts too.

Thanks,

tglx

2024-06-06 15:37:05

by Michael Kelley

[permalink] [raw]
Subject: RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats

From: Thomas Gleixner <[email protected]> Sent: Thursday, June 6, 2024 2:34 AM
>
> On Thu, Jun 06 2024 at 03:14, Michael Kelley wrote:
> > From: Thomas Gleixner <[email protected]> Sent: Wednesday, June 5, 2024 7:20 AM
> >>
> >> On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote:
> >> > From: Thomas Gleixner <[email protected]> Sent: Wednesday, June 5, 2024 6:20 AM
> >> >
> >> > In /proc/interrupts, the double-counting isn't a problem, and is
> >> > potentially helpful as you say. But /proc/stat, for example, shows a total
> >> > interrupt count, which will be roughly double what it was before. That
> >> > /proc/stat value then shows up in user space in vmstat, for example.
> >> > That's what I was concerned about, though it's not a huge problem in
> >> > the grand scheme of things.
> >>
> >> That's trivial to solve. We can mark interrupts to be excluded from
> >> /proc/stat accounting.
> >>
> >
> > OK. On x86, some simple #ifdef'ery in arch_irq_stat_cpu() can filter
> > out the HYP interrupts. But what do you envision on arm64, where
> > there is no arch_irq_stat_cpu()? On arm64, the top-level interrupt is a
> > normal Linux IRQ, and its count is included in the "kstat.irqs_sum" field
> > with no breakout by IRQ. Identifying the right IRQ and subtracting it
> > out later looks a lot uglier than the conditional stats accounting.
>
> Sure. There are two ways to solve that:
>
> 1) Introduce a IRQ_NO_PER_CPU_STATS flag, mark the interrupt
> accordingly and make the stats increment conditional on it.
> The downside is that the conditional affects every interrupt.
>
> 2) Do something like this:
>
> static inline
> void __handle_percpu_irq(struct irq_desc *desc, irqreturn_t (*handle)(struct irq_desc
> *))
> {
> struct irq_chip *chip = irq_desc_get_chip(desc);
>
> if (chip->irq_ack)
> chip->irq_ack(&desc->irq_data);
>
> handle(desc);
>
> if (chip->irq_eoi)
> chip->irq_eoi(&desc->irq_data);
> }
>
> void handle_percpu_irq(struct irq_desc *desc)
> {
> /*
> * PER CPU interrupts are not serialized. Do not touch
> * desc->tot_count.
> */
> __kstat_incr_irqs_this_cpu(desc);
> __handle_percpu_irq(desc, handle_irq_event_percpu);
> }
>
> void handle_percpu_irq_nostat(struct irq_desc *desc)
> {
> __this_cpu_inc(desc->kstat_irqs->cnt);
> __handle_percpu_irq(desc, __handle_irq_event_percpu);
> }
>
> So that keeps the interrupt accounted for in /proc/interrupts. If you
> don't want that remove the __this_cpu_inc() and mark the interrupt with
> irq_set_status_flags(irq, IRQ_HIDDEN). That will exclude it from
> /proc/interrupts too.
>

Yes, this works for not double-counting in the first place. Account for the
control message interrupts in their own Linux IRQ. Then for the top-level
interrupt, instead of adding a new handler with conditional accounting,
add a new per-CPU handler that does no accounting. I had not noticed
the IRQ_HIDDEN flag, and that solves my concern about having an
entry in /proc/interrupts that always shows zero interrupts. And with
no double-counting, the interrupt counts in /proc/stat won't be bloated.

On x86, I'll have to separately make the "HYP" line in /proc/interrupts
go away, but that's easy.

Thanks,

Michael