Hi all,
This series introduces changes in the VMBus drivers to reassign (at
runtime) the CPU that a VMbus channel will interrupt. This feature
can be used for load balancing or other purposes, e.g. to offline a
given vCPU (vCPUs with no interrupts assigned can be taken offline).
Thanks,
Andrea
Cc: "David S. Miller" <[email protected]>
Cc: Lorenzo Pieralisi <[email protected]>
Cc: Andrew Murray <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Andrea Parri (Microsoft) (11):
Drivers: hv: vmbus: Always handle the VMBus messages on CPU0
Drivers: hv: vmbus: Don't bind the offer&rescind works to a specific
CPU
Drivers: hv: vmbus: Replace the per-CPU channel lists with a global
array of channels
hv_netvsc: Disable NAPI before closing the VMBus channel
hv_utils: Always execute the fcopy and vss callbacks in a tasklet
Drivers: hv: vmbus: Use a spin lock for synchronizing channel
scheduling vs. channel removal
PCI: hv: Prepare hv_compose_msi_msg() for the
VMBus-channel-interrupt-to-vCPU reassignment functionality
Drivers: hv: vmbus: Remove the unused HV_LOCALIZED channel affinity
logic
Drivers: hv: vmbus: Synchronize init_vp_index() vs. CPU hotplug
Drivers: hv: vmbus: Introduce the CHANNELMSG_MODIFYCHANNEL message
type
scsi: storvsc: Re-init stor_chns when a channel interrupt is
re-assigned
drivers/hv/channel.c | 58 +++--
drivers/hv/channel_mgmt.c | 326 ++++++++++++++--------------
drivers/hv/connection.c | 58 +----
drivers/hv/hv.c | 16 +-
drivers/hv/hv_fcopy.c | 2 +-
drivers/hv/hv_snapshot.c | 2 +-
drivers/hv/hv_trace.h | 19 ++
drivers/hv/hyperv_vmbus.h | 32 ++-
drivers/hv/vmbus_drv.c | 241 ++++++++++++++++----
drivers/net/hyperv/netvsc.c | 7 +-
drivers/pci/controller/pci-hyperv.c | 44 ++--
drivers/scsi/storvsc_drv.c | 95 +++++++-
include/linux/hyperv.h | 51 ++---
13 files changed, 599 insertions(+), 352 deletions(-)
--
2.24.0
A Linux guest have to pick a "connect CPU" to communicate with the
Hyper-V host. This CPU can not be taken offline because Hyper-V does
not provide a way to change that CPU assignment.
Current code sets the connect CPU to whatever CPU ends up running the
function vmbus_negotiate_version(), and this will generate problems if
that CPU is taken offine.
Establish CPU0 as the connect CPU, and add logics to prevents the
connect CPU from being taken offline. We could pick some other CPU,
and we could pick that "other CPU" dynamically if there was a reason to
do so at some point in the future. But for now, #defining the connect
CPU to 0 is the most straightforward and least complex solution.
While on this, add inline comments explaining "why" offer and rescind
messages should not be handled by a same serialized work queue.
Suggested-by: Dexuan Cui <[email protected]>
Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
---
drivers/hv/connection.c | 20 +-------------------
drivers/hv/hv.c | 7 +++++++
drivers/hv/hyperv_vmbus.h | 11 ++++++-----
drivers/hv/vmbus_drv.c | 20 +++++++++++++++++---
4 files changed, 31 insertions(+), 27 deletions(-)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 74e77de89b4f3..f4bd306d2cef9 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -69,7 +69,6 @@ MODULE_PARM_DESC(max_version,
int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version)
{
int ret = 0;
- unsigned int cur_cpu;
struct vmbus_channel_initiate_contact *msg;
unsigned long flags;
@@ -102,24 +101,7 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version)
msg->monitor_page1 = virt_to_phys(vmbus_connection.monitor_pages[0]);
msg->monitor_page2 = virt_to_phys(vmbus_connection.monitor_pages[1]);
- /*
- * We want all channel messages to be delivered on CPU 0.
- * This has been the behavior pre-win8. This is not
- * perf issue and having all channel messages delivered on CPU 0
- * would be ok.
- * For post win8 hosts, we support receiving channel messagges on
- * all the CPUs. This is needed for kexec to work correctly where
- * the CPU attempting to connect may not be CPU 0.
- */
- if (version >= VERSION_WIN8_1) {
- cur_cpu = get_cpu();
- msg->target_vcpu = hv_cpu_number_to_vp_number(cur_cpu);
- vmbus_connection.connect_cpu = cur_cpu;
- put_cpu();
- } else {
- msg->target_vcpu = 0;
- vmbus_connection.connect_cpu = 0;
- }
+ msg->target_vcpu = hv_cpu_number_to_vp_number(VMBUS_CONNECT_CPU);
/*
* Add to list before we send the request since we may
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 6098e0cbdb4b0..e2b3310454640 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -249,6 +249,13 @@ int hv_synic_cleanup(unsigned int cpu)
bool channel_found = false;
unsigned long flags;
+ /*
+ * Hyper-V does not provide a way to change the connect CPU once
+ * it is set; we must prevent the connect CPU from going offline.
+ */
+ if (cpu == VMBUS_CONNECT_CPU)
+ return -EBUSY;
+
/*
* Search for channels which are bound to the CPU we're about to
* cleanup. In case we find one and vmbus is still connected we need to
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 70b30e223a578..67fb1edcbf527 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -212,12 +212,13 @@ enum vmbus_connect_state {
#define MAX_SIZE_CHANNEL_MESSAGE HV_MESSAGE_PAYLOAD_BYTE_COUNT
-struct vmbus_connection {
- /*
- * CPU on which the initial host contact was made.
- */
- int connect_cpu;
+/*
+ * The CPU that Hyper-V will interrupt for VMBUS messages, such as
+ * CHANNELMSG_OFFERCHANNEL and CHANNELMSG_RESCIND_CHANNELOFFER.
+ */
+#define VMBUS_CONNECT_CPU 0
+struct vmbus_connection {
u32 msg_conn_id;
atomic_t offer_in_progress;
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 029378c27421d..7600615e13754 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1056,14 +1056,28 @@ void vmbus_on_msg_dpc(unsigned long data)
/*
* If we are handling the rescind message;
* schedule the work on the global work queue.
+ *
+ * The OFFER message and the RESCIND message should
+ * not be handled by the same serialized work queue,
+ * because the OFFER handler may call vmbus_open(),
+ * which tries to open the channel by sending an
+ * OPEN_CHANNEL message to the host and waits for
+ * the host's response; however, if the host has
+ * rescinded the channel before it receives the
+ * OPEN_CHANNEL message, the host just silently
+ * ignores the OPEN_CHANNEL message; as a result,
+ * the guest's OFFER handler hangs for ever, if we
+ * handle the RESCIND message in the same serialized
+ * work queue: the RESCIND handler can not start to
+ * run before the OFFER handler finishes.
*/
- schedule_work_on(vmbus_connection.connect_cpu,
+ schedule_work_on(VMBUS_CONNECT_CPU,
&ctx->work);
break;
case CHANNELMSG_OFFERCHANNEL:
atomic_inc(&vmbus_connection.offer_in_progress);
- queue_work_on(vmbus_connection.connect_cpu,
+ queue_work_on(VMBUS_CONNECT_CPU,
vmbus_connection.work_queue,
&ctx->work);
break;
@@ -1110,7 +1124,7 @@ static void vmbus_force_channel_rescinded(struct vmbus_channel *channel)
INIT_WORK(&ctx->work, vmbus_onmessage_work);
- queue_work_on(vmbus_connection.connect_cpu,
+ queue_work_on(VMBUS_CONNECT_CPU,
vmbus_connection.work_queue,
&ctx->work);
}
--
2.24.0
The offer and rescind works are currently scheduled on the so called
"connect CPU". However, this is not really needed: we can synchronize
the works by relying on the usage of the offer_in_progress counter and
of the channel_mutex mutex. This synchronization is already in place.
So, remove this unnecessary "bind to the connect CPU" constraint and
update the inline comments accordingly.
Suggested-by: Dexuan Cui <[email protected]>
Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
---
drivers/hv/channel_mgmt.c | 21 ++++++++++++++++-----
drivers/hv/vmbus_drv.c | 39 ++++++++++++++++++++++++++++-----------
2 files changed, 44 insertions(+), 16 deletions(-)
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 0370364169c4e..1191f3d76d111 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -1025,11 +1025,22 @@ static void vmbus_onoffer_rescind(struct vmbus_channel_message_header *hdr)
* offer comes in first and then the rescind.
* Since we process these events in work elements,
* and with preemption, we may end up processing
- * the events out of order. Given that we handle these
- * work elements on the same CPU, this is possible only
- * in the case of preemption. In any case wait here
- * until the offer processing has moved beyond the
- * point where the channel is discoverable.
+ * the events out of order. We rely on the synchronization
+ * provided by offer_in_progress and by channel_mutex for
+ * ordering these events:
+ *
+ * { Initially: offer_in_progress = 1 }
+ *
+ * CPU1 CPU2
+ *
+ * [vmbus_process_offer()] [vmbus_onoffer_rescind()]
+ *
+ * LOCK channel_mutex WAIT_ON offer_in_progress == 0
+ * DECREMENT offer_in_progress LOCK channel_mutex
+ * INSERT chn_list SEARCH chn_list
+ * UNLOCK channel_mutex UNLOCK channel_mutex
+ *
+ * Forbids: CPU2's SEARCH from *not* seeing CPU1's INSERT
*/
while (atomic_read(&vmbus_connection.offer_in_progress) != 0) {
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 7600615e13754..903b1ec6a259e 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1048,8 +1048,9 @@ void vmbus_on_msg_dpc(unsigned long data)
/*
* The host can generate a rescind message while we
* may still be handling the original offer. We deal with
- * this condition by ensuring the processing is done on the
- * same CPU.
+ * this condition by relying on the synchronization provided
+ * by offer_in_progress and by channel_mutex. See also the
+ * inline comments in vmbus_onoffer_rescind().
*/
switch (hdr->msgtype) {
case CHANNELMSG_RESCIND_CHANNELOFFER:
@@ -1071,16 +1072,34 @@ void vmbus_on_msg_dpc(unsigned long data)
* work queue: the RESCIND handler can not start to
* run before the OFFER handler finishes.
*/
- schedule_work_on(VMBUS_CONNECT_CPU,
- &ctx->work);
+ schedule_work(&ctx->work);
break;
case CHANNELMSG_OFFERCHANNEL:
+ /*
+ * The host sends the offer message of a given channel
+ * before sending the rescind message of the same
+ * channel. These messages are sent to the guest's
+ * connect CPU; the guest then starts processing them
+ * in the tasklet handler on this CPU:
+ *
+ * VMBUS_CONNECT_CPU
+ *
+ * [vmbus_on_msg_dpc()]
+ * atomic_inc() // CHANNELMSG_OFFERCHANNEL
+ * queue_work()
+ * ...
+ * [vmbus_on_msg_dpc()]
+ * schedule_work() // CHANNELMSG_RESCIND_CHANNELOFFER
+ *
+ * We rely on the memory-ordering properties of the
+ * queue_work() and schedule_work() primitives, which
+ * guarantee that the atomic increment will be visible
+ * to the CPUs which will execute the offer & rescind
+ * works by the time these works will start execution.
+ */
atomic_inc(&vmbus_connection.offer_in_progress);
- queue_work_on(VMBUS_CONNECT_CPU,
- vmbus_connection.work_queue,
- &ctx->work);
- break;
+ fallthrough;
default:
queue_work(vmbus_connection.work_queue, &ctx->work);
@@ -1124,9 +1143,7 @@ static void vmbus_force_channel_rescinded(struct vmbus_channel *channel)
INIT_WORK(&ctx->work, vmbus_onmessage_work);
- queue_work_on(VMBUS_CONNECT_CPU,
- vmbus_connection.work_queue,
- &ctx->work);
+ queue_work(vmbus_connection.work_queue, &ctx->work);
}
#endif /* CONFIG_PM_SLEEP */
--
2.24.0
When Hyper-V sends an interrupt to the guest, the guest has to figure
out which channel the interrupt is associated with. Hyper-V sets a bit
in a memory page that is shared with the guest, indicating a particular
"relid" that the interrupt is associated with. The current Linux code
then uses a set of per-CPU linked lists to map a given "relid" to a
pointer to a channel structure.
This design introduces a synchronization problem if the CPU that Hyper-V
will interrupt for a certain channel is changed. If the interrupt comes
on the "old CPU" and the channel was already moved to the per-CPU list
of the "new CPU", then the relid -> channel mapping will fail and the
interrupt is dropped. Similarly, if the interrupt comes on the new CPU
but the channel was not moved to the per-CPU list of the new CPU, then
the mapping will fail and the interrupt is dropped.
Relids are integers ranging from 0 to 2047. The mapping from relids to
channel structures can be done by setting up an array with 2048 entries,
each entry being a pointer to a channel structure (hence total size ~16K
bytes, which is not a problem). The array is global, so there are no
per-CPU linked lists to update. The array can be searched and updated
by simply loading and storing the array at the specified index. With no
per-CPU data structures, the above mentioned synchronization problem is
avoided and the relid2channel() function gets simpler.
Suggested-by: Michael Kelley <[email protected]>
Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
---
drivers/hv/channel_mgmt.c | 158 ++++++++++++++++++++++----------------
drivers/hv/connection.c | 38 +++------
drivers/hv/hv.c | 2 -
drivers/hv/hyperv_vmbus.h | 14 ++--
drivers/hv/vmbus_drv.c | 48 +++++++-----
include/linux/hyperv.h | 5 --
6 files changed, 139 insertions(+), 126 deletions(-)
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 1191f3d76d111..9b1449c839575 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -319,7 +319,6 @@ static struct vmbus_channel *alloc_channel(void)
init_completion(&channel->rescind_event);
INIT_LIST_HEAD(&channel->sc_list);
- INIT_LIST_HEAD(&channel->percpu_list);
tasklet_init(&channel->callback_event,
vmbus_on_event, (unsigned long)channel);
@@ -340,23 +339,28 @@ static void free_channel(struct vmbus_channel *channel)
kobject_put(&channel->kobj);
}
-static void percpu_channel_enq(void *arg)
+void vmbus_channel_map_relid(struct vmbus_channel *channel)
{
- struct vmbus_channel *channel = arg;
- struct hv_per_cpu_context *hv_cpu
- = this_cpu_ptr(hv_context.cpu_context);
-
- list_add_tail_rcu(&channel->percpu_list, &hv_cpu->chan_list);
+ if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
+ return;
+ /*
+ * Pairs with the READ_ONCE() in vmbus_chan_sched(). Guarantees
+ * that vmbus_chan_sched() will find up-to-date data.
+ */
+ smp_store_release(
+ &vmbus_connection.channels[channel->offermsg.child_relid],
+ channel);
}
-static void percpu_channel_deq(void *arg)
+void vmbus_channel_unmap_relid(struct vmbus_channel *channel)
{
- struct vmbus_channel *channel = arg;
-
- list_del_rcu(&channel->percpu_list);
+ if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
+ return;
+ WRITE_ONCE(
+ vmbus_connection.channels[channel->offermsg.child_relid],
+ NULL);
}
-
static void vmbus_release_relid(u32 relid)
{
struct vmbus_channel_relid_released msg;
@@ -376,17 +380,25 @@ void hv_process_channel_removal(struct vmbus_channel *channel)
struct vmbus_channel *primary_channel;
unsigned long flags;
- BUG_ON(!mutex_is_locked(&vmbus_connection.channel_mutex));
+ lockdep_assert_held(&vmbus_connection.channel_mutex);
BUG_ON(!channel->rescind);
- if (channel->target_cpu != get_cpu()) {
- put_cpu();
- smp_call_function_single(channel->target_cpu,
- percpu_channel_deq, channel, true);
- } else {
- percpu_channel_deq(channel);
- put_cpu();
- }
+ /*
+ * hv_process_channel_removal() could find INVALID_RELID only for
+ * hv_sock channels. See the inline comments in vmbus_onoffer().
+ */
+ WARN_ON(channel->offermsg.child_relid == INVALID_RELID &&
+ !is_hvsock_channel(channel));
+
+ /*
+ * Upon suspend, an in-use hv_sock channel is removed from the array of
+ * channels and the relid is invalidated. After hibernation, when the
+ * user-space appplication destroys the channel, it's unnecessary and
+ * unsafe to remove the channel from the array of channels. See also
+ * the inline comments before the call of vmbus_release_relid() below.
+ */
+ if (channel->offermsg.child_relid != INVALID_RELID)
+ vmbus_channel_unmap_relid(channel);
if (channel->primary_channel == NULL) {
list_del(&channel->listentry);
@@ -447,16 +459,6 @@ static void vmbus_add_channel_work(struct work_struct *work)
init_vp_index(newchannel, dev_type);
- if (newchannel->target_cpu != get_cpu()) {
- put_cpu();
- smp_call_function_single(newchannel->target_cpu,
- percpu_channel_enq,
- newchannel, true);
- } else {
- percpu_channel_enq(newchannel);
- put_cpu();
- }
-
/*
* This state is used to indicate a successful open
* so that when we do close the channel normally, we
@@ -523,17 +525,10 @@ static void vmbus_add_channel_work(struct work_struct *work)
spin_unlock_irqrestore(&primary_channel->lock, flags);
}
- mutex_unlock(&vmbus_connection.channel_mutex);
+ /* vmbus_process_offer() has mapped the channel. */
+ vmbus_channel_unmap_relid(newchannel);
- if (newchannel->target_cpu != get_cpu()) {
- put_cpu();
- smp_call_function_single(newchannel->target_cpu,
- percpu_channel_deq,
- newchannel, true);
- } else {
- percpu_channel_deq(newchannel);
- put_cpu();
- }
+ mutex_unlock(&vmbus_connection.channel_mutex);
vmbus_release_relid(newchannel->offermsg.child_relid);
@@ -599,6 +594,8 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel)
spin_unlock_irqrestore(&channel->lock, flags);
}
+ vmbus_channel_map_relid(newchannel);
+
mutex_unlock(&vmbus_connection.channel_mutex);
/*
@@ -937,8 +934,6 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr)
oldchannel = find_primary_channel_by_offer(offer);
if (oldchannel != NULL) {
- atomic_dec(&vmbus_connection.offer_in_progress);
-
/*
* We're resuming from hibernation: all the sub-channel and
* hv_sock channels we had before the hibernation should have
@@ -946,36 +941,65 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr)
* primary channel that we had before the hibernation.
*/
+ /*
+ * { Initially: channel relid = INVALID_RELID,
+ * channels[valid_relid] = NULL }
+ *
+ * CPU1 CPU2
+ *
+ * [vmbus_onoffer()] [vmbus_device_release()]
+ *
+ * LOCK channel_mutex LOCK channel_mutex
+ * STORE channel relid = valid_relid LOAD r1 = channel relid
+ * MAP_RELID channel if (r1 != INVALID_RELID)
+ * UNLOCK channel_mutex UNMAP_RELID channel
+ * UNLOCK channel_mutex
+ *
+ * Forbids: r1 == valid_relid &&
+ * channels[valid_relid] == channel
+ *
+ * Note. r1 can be INVALID_RELID only for an hv_sock channel.
+ * None of the hv_sock channels which were present before the
+ * suspend are re-offered upon the resume. See the WARN_ON()
+ * in hv_process_channel_removal().
+ */
+ mutex_lock(&vmbus_connection.channel_mutex);
+
+ atomic_dec(&vmbus_connection.offer_in_progress);
+
WARN_ON(oldchannel->offermsg.child_relid != INVALID_RELID);
/* Fix up the relid. */
oldchannel->offermsg.child_relid = offer->child_relid;
offer_sz = sizeof(*offer);
- if (memcmp(offer, &oldchannel->offermsg, offer_sz) == 0) {
- check_ready_for_resume_event();
- return;
+ if (memcmp(offer, &oldchannel->offermsg, offer_sz) != 0) {
+ /*
+ * This is not an error, since the host can also change
+ * the other field(s) of the offer, e.g. on WS RS5
+ * (Build 17763), the offer->connection_id of the
+ * Mellanox VF vmbus device can change when the host
+ * reoffers the device upon resume.
+ */
+ pr_debug("vmbus offer changed: relid=%d\n",
+ offer->child_relid);
+
+ print_hex_dump_debug("Old vmbus offer: ",
+ DUMP_PREFIX_OFFSET, 16, 4,
+ &oldchannel->offermsg, offer_sz,
+ false);
+ print_hex_dump_debug("New vmbus offer: ",
+ DUMP_PREFIX_OFFSET, 16, 4,
+ offer, offer_sz, false);
+
+ /* Fix up the old channel. */
+ vmbus_setup_channel_state(oldchannel, offer);
}
- /*
- * This is not an error, since the host can also change the
- * other field(s) of the offer, e.g. on WS RS5 (Build 17763),
- * the offer->connection_id of the Mellanox VF vmbus device
- * can change when the host reoffers the device upon resume.
- */
- pr_debug("vmbus offer changed: relid=%d\n",
- offer->child_relid);
-
- print_hex_dump_debug("Old vmbus offer: ", DUMP_PREFIX_OFFSET,
- 16, 4, &oldchannel->offermsg, offer_sz,
- false);
- print_hex_dump_debug("New vmbus offer: ", DUMP_PREFIX_OFFSET,
- 16, 4, offer, offer_sz, false);
-
- /* Fix up the old channel. */
- vmbus_setup_channel_state(oldchannel, offer);
-
+ /* Add the channel back to the array of channels. */
+ vmbus_channel_map_relid(oldchannel);
check_ready_for_resume_event();
+ mutex_unlock(&vmbus_connection.channel_mutex);
return;
}
@@ -1033,14 +1057,14 @@ static void vmbus_onoffer_rescind(struct vmbus_channel_message_header *hdr)
*
* CPU1 CPU2
*
- * [vmbus_process_offer()] [vmbus_onoffer_rescind()]
+ * [vmbus_onoffer()] [vmbus_onoffer_rescind()]
*
* LOCK channel_mutex WAIT_ON offer_in_progress == 0
* DECREMENT offer_in_progress LOCK channel_mutex
- * INSERT chn_list SEARCH chn_list
+ * STORE channels[] LOAD channels[]
* UNLOCK channel_mutex UNLOCK channel_mutex
*
- * Forbids: CPU2's SEARCH from *not* seeing CPU1's INSERT
+ * Forbids: CPU2's LOAD from *not* seeing CPU1's STORE
*/
while (atomic_read(&vmbus_connection.offer_in_progress) != 0) {
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index f4bd306d2cef9..11170d9a2e1a5 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -248,6 +248,14 @@ int vmbus_connect(void)
pr_info("Vmbus version:%d.%d\n",
version >> 16, version & 0xFFFF);
+ vmbus_connection.channels = kcalloc(MAX_CHANNEL_RELIDS,
+ sizeof(struct vmbus_channel *),
+ GFP_KERNEL);
+ if (vmbus_connection.channels == NULL) {
+ ret = -ENOMEM;
+ goto cleanup;
+ }
+
kfree(msginfo);
return 0;
@@ -295,33 +303,9 @@ void vmbus_disconnect(void)
*/
struct vmbus_channel *relid2channel(u32 relid)
{
- struct vmbus_channel *channel;
- struct vmbus_channel *found_channel = NULL;
- struct list_head *cur, *tmp;
- struct vmbus_channel *cur_sc;
-
- BUG_ON(!mutex_is_locked(&vmbus_connection.channel_mutex));
-
- list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
- if (channel->offermsg.child_relid == relid) {
- found_channel = channel;
- break;
- } else if (!list_empty(&channel->sc_list)) {
- /*
- * Deal with sub-channels.
- */
- list_for_each_safe(cur, tmp, &channel->sc_list) {
- cur_sc = list_entry(cur, struct vmbus_channel,
- sc_list);
- if (cur_sc->offermsg.child_relid == relid) {
- found_channel = cur_sc;
- break;
- }
- }
- }
- }
-
- return found_channel;
+ if (WARN_ON(relid >= MAX_CHANNEL_RELIDS))
+ return NULL;
+ return READ_ONCE(vmbus_connection.channels[relid]);
}
/*
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index e2b3310454640..17bf1f229152b 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -117,8 +117,6 @@ int hv_synic_alloc(void)
pr_err("Unable to allocate post msg page\n");
goto err;
}
-
- INIT_LIST_HEAD(&hv_cpu->chan_list);
}
return 0;
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 67fb1edcbf527..2216cd5e8e8bf 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -132,12 +132,6 @@ struct hv_per_cpu_context {
* basis.
*/
struct tasklet_struct msg_dpc;
-
- /*
- * To optimize the mapping of relid to channel, maintain
- * per-cpu list of the channels based on their CPU affinity.
- */
- struct list_head chan_list;
};
struct hv_context {
@@ -202,6 +196,8 @@ int hv_ringbuffer_read(struct vmbus_channel *channel,
/* TODO: Need to make this configurable */
#define MAX_NUM_CHANNELS_SUPPORTED 256
+#define MAX_CHANNEL_RELIDS \
+ max(MAX_NUM_CHANNELS_SUPPORTED, HV_EVENT_FLAGS_COUNT)
enum vmbus_connect_state {
DISCONNECTED,
@@ -251,6 +247,9 @@ struct vmbus_connection {
struct list_head chn_list;
struct mutex channel_mutex;
+ /* Array of channels */
+ struct vmbus_channel **channels;
+
/*
* An offer message is handled first on the work_queue, and then
* is further handled on handle_primary_chan_wq or
@@ -337,6 +336,9 @@ int vmbus_add_channel_kobj(struct hv_device *device_obj,
void vmbus_remove_channel_attr_group(struct vmbus_channel *channel);
+void vmbus_channel_map_relid(struct vmbus_channel *channel);
+void vmbus_channel_unmap_relid(struct vmbus_channel *channel);
+
struct vmbus_channel *relid2channel(u32 relid);
void vmbus_free_channels(void);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 903b1ec6a259e..301e3f484bb1a 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1198,33 +1198,39 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
if (relid == 0)
continue;
+ /*
+ * Pairs with the kfree_rcu() in vmbus_chan_release().
+ * Guarantees that the channel data structure doesn't
+ * get freed while the channel pointer below is being
+ * dereferenced.
+ */
rcu_read_lock();
/* Find channel based on relid */
- list_for_each_entry_rcu(channel, &hv_cpu->chan_list, percpu_list) {
- if (channel->offermsg.child_relid != relid)
- continue;
+ channel = relid2channel(relid);
+ if (channel == NULL)
+ goto sched_unlock_rcu;
- if (channel->rescind)
- continue;
+ if (channel->rescind)
+ goto sched_unlock_rcu;
- trace_vmbus_chan_sched(channel);
+ trace_vmbus_chan_sched(channel);
- ++channel->interrupts;
+ ++channel->interrupts;
- switch (channel->callback_mode) {
- case HV_CALL_ISR:
- vmbus_channel_isr(channel);
- break;
+ switch (channel->callback_mode) {
+ case HV_CALL_ISR:
+ vmbus_channel_isr(channel);
+ break;
- case HV_CALL_BATCHED:
- hv_begin_read(&channel->inbound);
- /* fallthrough */
- case HV_CALL_DIRECT:
- tasklet_schedule(&channel->callback_event);
- }
+ case HV_CALL_BATCHED:
+ hv_begin_read(&channel->inbound);
+ fallthrough;
+ case HV_CALL_DIRECT:
+ tasklet_schedule(&channel->callback_event);
}
+sched_unlock_rcu:
rcu_read_unlock();
}
}
@@ -2208,9 +2214,12 @@ static int vmbus_bus_suspend(struct device *dev)
list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
/*
- * Invalidate the field. Upon resume, vmbus_onoffer() will fix
- * up the field, and the other fields (if necessary).
+ * Remove the channel from the array of channels and invalidate
+ * the channel's relid. Upon resume, vmbus_onoffer() will fix
+ * up the relid (and other fields, if necessary) and add the
+ * channel back to the array.
*/
+ vmbus_channel_unmap_relid(channel);
channel->offermsg.child_relid = INVALID_RELID;
if (is_hvsock_channel(channel)) {
@@ -2450,6 +2459,7 @@ static void __exit vmbus_exit(void)
hv_debug_rm_all_dir();
vmbus_free_channels();
+ kfree(vmbus_connection.channels);
if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
kmsg_dump_unregister(&hv_kmsg_dumper);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 692c89ccf5dfd..6c794fd5c903e 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -854,11 +854,6 @@ struct vmbus_channel {
* Support per-channel state for use by vmbus drivers.
*/
void *per_channel_state;
- /*
- * To support per-cpu lookup mapping of relid to channel,
- * link up channels based on their CPU affinity.
- */
- struct list_head percpu_list;
/*
* Defer freeing channel until after all cpu's have
--
2.24.0
vmbus_chan_sched() might call the netvsc driver callback function that
ends up scheduling NAPI work. This "work" can access the channel ring
buffer, so we must ensure that any such work is completed and that the
ring buffer is no longer being accessed before freeing the ring buffer
data structure in the channel closure path. To this end, disable NAPI
before calling vmbus_close() in netvsc_device_remove().
Suggested-by: Michael Kelley <[email protected]>
Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: <[email protected]>
---
drivers/hv/channel.c | 6 ++++++
drivers/net/hyperv/netvsc.c | 7 +++++--
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 23f358cb7f494..256ee90c74460 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -609,6 +609,12 @@ void vmbus_reset_channel_cb(struct vmbus_channel *channel)
* the former is accessing channel->inbound.ring_buffer, the latter
* could be freeing the ring_buffer pages, so here we must stop it
* first.
+ *
+ * vmbus_chan_sched() might call the netvsc driver callback function
+ * that ends up scheduling NAPI work that accesses the ring buffer.
+ * At this point, we have to ensure that any such work is completed
+ * and that the channel ring buffer is no longer being accessed, cf.
+ * the calls to napi_disable() in netvsc_device_remove().
*/
tasklet_disable(&channel->callback_event);
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 1b320bcf150a4..806cc85d10033 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -635,9 +635,12 @@ void netvsc_device_remove(struct hv_device *device)
RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
- /* And disassociate NAPI context from device */
- for (i = 0; i < net_device->num_chn; i++)
+ /* Disable NAPI and disassociate its context from the device. */
+ for (i = 0; i < net_device->num_chn; i++) {
+ /* See also vmbus_reset_channel_cb(). */
+ napi_disable(&net_device->chan_table[i].napi);
netif_napi_del(&net_device->chan_table[i].napi);
+ }
/*
* At this point, no one should be accessing net_device
--
2.24.0
The fcopy and vss callback functions could be running in a tasklet
at the same time they are called in hv_poll_channel(). Current code
serializes the invocations of these functions, and their accesses to
the channel ring buffer, by sending an IPI to the CPU that is allowed
to access the ring buffer, cf. hv_poll_channel(). This IPI mechanism
becomes infeasible if we allow changing the CPU that a channel will
interrupt. Instead modify the callback wrappers to always execute
the fcopy and vss callbacks in a tasklet, thus mirroring the solution
for the kvp callback functions adopted since commit a3ade8cc474d8
("HV: properly delay KVP packets when negotiation is in progress").
This will ensure that the callback function can't run on two CPUs at
the same time.
Suggested-by: Michael Kelley <[email protected]>
Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
---
drivers/hv/hv_fcopy.c | 2 +-
drivers/hv/hv_snapshot.c | 2 +-
drivers/hv/hyperv_vmbus.h | 7 +------
3 files changed, 3 insertions(+), 8 deletions(-)
diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
index bb9ba3f7c7949..5040d7e0cd9e9 100644
--- a/drivers/hv/hv_fcopy.c
+++ b/drivers/hv/hv_fcopy.c
@@ -71,7 +71,7 @@ static void fcopy_poll_wrapper(void *channel)
{
/* Transaction is finished, reset the state here to avoid races. */
fcopy_transaction.state = HVUTIL_READY;
- hv_fcopy_onchannelcallback(channel);
+ tasklet_schedule(&((struct vmbus_channel *)channel)->callback_event);
}
static void fcopy_timeout_func(struct work_struct *dummy)
diff --git a/drivers/hv/hv_snapshot.c b/drivers/hv/hv_snapshot.c
index 1c75b38f0d6da..783779e4cc1a5 100644
--- a/drivers/hv/hv_snapshot.c
+++ b/drivers/hv/hv_snapshot.c
@@ -80,7 +80,7 @@ static void vss_poll_wrapper(void *channel)
{
/* Transaction is finished, reset the state here to avoid races. */
vss_transaction.state = HVUTIL_READY;
- hv_vss_onchannelcallback(channel);
+ tasklet_schedule(&((struct vmbus_channel *)channel)->callback_event);
}
/*
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 2216cd5e8e8bf..ff187cf5896b9 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -377,12 +377,7 @@ static inline void hv_poll_channel(struct vmbus_channel *channel,
{
if (!channel)
return;
-
- if (in_interrupt() && (channel->target_cpu == smp_processor_id())) {
- cb(channel);
- return;
- }
- smp_call_function_single(channel->target_cpu, cb, channel, true);
+ cb(channel);
}
enum hvutil_device_state {
--
2.24.0
Since vmbus_chan_sched() dereferences the ring buffer pointer, we have
to make sure that the ring buffer data structures don't get freed while
such dereferencing is happening. Current code does this by sending an
IPI to the CPU that is allowed to access that ring buffer from interrupt
level, cf., vmbus_reset_channel_cb(). But with the new functionality
to allow changing the CPU that a channel will interrupt, we can't be
sure what CPU will be running the vmbus_chan_sched() function for a
particular channel, so the current IPI mechanism is infeasible.
Instead synchronize vmbus_chan_sched() and vmbus_reset_channel_cb() by
using the (newly introduced) per-channel spin lock "sched_lock". Move
the test for onchannel_callback being NULL before the "switch" control
statement in vmbus_chan_sched(), in order to not access the ring buffer
if the vmbus_reset_channel_cb() has been completed on the channel.
Suggested-by: Michael Kelley <[email protected]>
Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
---
drivers/hv/channel.c | 24 +++++++-----------------
drivers/hv/channel_mgmt.c | 1 +
drivers/hv/vmbus_drv.c | 30 +++++++++++++++++-------------
include/linux/hyperv.h | 6 ++++++
4 files changed, 31 insertions(+), 30 deletions(-)
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 256ee90c74460..132e476f87b2e 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -594,15 +594,10 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
}
EXPORT_SYMBOL_GPL(vmbus_teardown_gpadl);
-static void reset_channel_cb(void *arg)
-{
- struct vmbus_channel *channel = arg;
-
- channel->onchannel_callback = NULL;
-}
-
void vmbus_reset_channel_cb(struct vmbus_channel *channel)
{
+ unsigned long flags;
+
/*
* vmbus_on_event(), running in the per-channel tasklet, can race
* with vmbus_close_internal() in the case of SMP guest, e.g., when
@@ -618,17 +613,12 @@ void vmbus_reset_channel_cb(struct vmbus_channel *channel)
*/
tasklet_disable(&channel->callback_event);
- channel->sc_creation_callback = NULL;
+ /* See the inline comments in vmbus_chan_sched(). */
+ spin_lock_irqsave(&channel->sched_lock, flags);
+ channel->onchannel_callback = NULL;
+ spin_unlock_irqrestore(&channel->sched_lock, flags);
- /* Stop the callback asap */
- if (channel->target_cpu != get_cpu()) {
- put_cpu();
- smp_call_function_single(channel->target_cpu, reset_channel_cb,
- channel, true);
- } else {
- reset_channel_cb(channel);
- put_cpu();
- }
+ channel->sc_creation_callback = NULL;
/* Re-enable tasklet for use on re-open */
tasklet_enable(&channel->callback_event);
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 9b1449c839575..c53f58ba06dcf 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -315,6 +315,7 @@ static struct vmbus_channel *alloc_channel(void)
if (!channel)
return NULL;
+ spin_lock_init(&channel->sched_lock);
spin_lock_init(&channel->lock);
init_completion(&channel->rescind_event);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 301e3f484bb1a..ebe2716f583d2 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1147,18 +1147,6 @@ static void vmbus_force_channel_rescinded(struct vmbus_channel *channel)
}
#endif /* CONFIG_PM_SLEEP */
-/*
- * Direct callback for channels using other deferred processing
- */
-static void vmbus_channel_isr(struct vmbus_channel *channel)
-{
- void (*callback_fn)(void *);
-
- callback_fn = READ_ONCE(channel->onchannel_callback);
- if (likely(callback_fn != NULL))
- (*callback_fn)(channel->channel_callback_context);
-}
-
/*
* Schedule all channels with events pending
*/
@@ -1189,6 +1177,7 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
return;
for_each_set_bit(relid, recv_int_page, maxbits) {
+ void (*callback_fn)(void *context);
struct vmbus_channel *channel;
if (!sync_test_and_clear_bit(relid, recv_int_page))
@@ -1214,13 +1203,26 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
if (channel->rescind)
goto sched_unlock_rcu;
+ /*
+ * Make sure that the ring buffer data structure doesn't get
+ * freed while we dereference the ring buffer pointer. Test
+ * for the channel's onchannel_callback being NULL within a
+ * sched_lock critical section. See also the inline comments
+ * in vmbus_reset_channel_cb().
+ */
+ spin_lock(&channel->sched_lock);
+
+ callback_fn = channel->onchannel_callback;
+ if (unlikely(callback_fn == NULL))
+ goto sched_unlock;
+
trace_vmbus_chan_sched(channel);
++channel->interrupts;
switch (channel->callback_mode) {
case HV_CALL_ISR:
- vmbus_channel_isr(channel);
+ (*callback_fn)(channel->channel_callback_context);
break;
case HV_CALL_BATCHED:
@@ -1230,6 +1232,8 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
tasklet_schedule(&channel->callback_event);
}
+sched_unlock:
+ spin_unlock(&channel->sched_lock);
sched_unlock_rcu:
rcu_read_unlock();
}
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 6c794fd5c903e..ce32ab186192f 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -771,6 +771,12 @@ struct vmbus_channel {
void (*onchannel_callback)(void *context);
void *channel_callback_context;
+ /*
+ * Synchronize channel scheduling and channel removal; see the inline
+ * comments in vmbus_chan_sched() and vmbus_reset_channel_cb().
+ */
+ spinlock_t sched_lock;
+
/*
* A channel can be marked for one of three modes of reading:
* BATCHED - callback called from taslket and should read
--
2.24.0
The current implementation of hv_compose_msi_msg() is incompatible with
the new functionality that allows changing the vCPU a VMBus channel will
interrupt: if this function always calls hv_pci_onchannelcallback() in
the polling loop, the interrupt going to a different CPU could cause
hv_pci_onchannelcallback() to be running simultaneously in a tasklet,
which will break. The current code also has a problem in that it is not
synchronized with vmbus_reset_channel_cb(): hv_compose_msi_msg() could
be accessing the ring buffer via the call of hv_pci_onchannelcallback()
well after the time that vmbus_reset_channel_cb() has finished.
Fix these issues as follows. Disable the channel tasklet before
entering the polling loop in hv_compose_msi_msg() and re-enable it when
done. This will prevent hv_pci_onchannelcallback() from running in a
tasklet on a different CPU. Moreover, poll by always calling
hv_pci_onchannelcallback(), but check the channel callback function for
NULL and invoke the callback within a sched_lock critical section. This
will prevent hv_compose_msi_msg() from accessing the ring buffer after
vmbus_reset_channel_cb() has acquired the sched_lock spinlock.
Suggested-by: Michael Kelley <[email protected]>
Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
Cc: Lorenzo Pieralisi <[email protected]>
Cc: Andrew Murray <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: <[email protected]>
---
drivers/pci/controller/pci-hyperv.c | 44 ++++++++++++++++++-----------
1 file changed, 28 insertions(+), 16 deletions(-)
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 9977abff92fc5..e6020480a28b1 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1350,11 +1350,11 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
{
struct irq_cfg *cfg = irqd_cfg(data);
struct hv_pcibus_device *hbus;
+ struct vmbus_channel *channel;
struct hv_pci_dev *hpdev;
struct pci_bus *pbus;
struct pci_dev *pdev;
struct cpumask *dest;
- unsigned long flags;
struct compose_comp_ctxt comp;
struct tran_int_desc *int_desc;
struct {
@@ -1372,6 +1372,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
dest = irq_data_get_effective_affinity_mask(data);
pbus = pdev->bus;
hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
+ channel = hbus->hdev->channel;
hpdev = get_pcichild_wslot(hbus, devfn_to_wslot(pdev->devfn));
if (!hpdev)
goto return_null_message;
@@ -1428,43 +1429,52 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
goto free_int_desc;
}
+ /*
+ * Prevents hv_pci_onchannelcallback() from running concurrently
+ * in the tasklet.
+ */
+ tasklet_disable(&channel->callback_event);
+
/*
* Since this function is called with IRQ locks held, can't
* do normal wait for completion; instead poll.
*/
while (!try_wait_for_completion(&comp.comp_pkt.host_event)) {
+ unsigned long flags;
+
/* 0xFFFF means an invalid PCI VENDOR ID. */
if (hv_pcifront_get_vendor_id(hpdev) == 0xFFFF) {
dev_err_once(&hbus->hdev->device,
"the device has gone\n");
- goto free_int_desc;
+ goto enable_tasklet;
}
/*
- * When the higher level interrupt code calls us with
- * interrupt disabled, we must poll the channel by calling
- * the channel callback directly when channel->target_cpu is
- * the current CPU. When the higher level interrupt code
- * calls us with interrupt enabled, let's add the
- * local_irq_save()/restore() to avoid race:
- * hv_pci_onchannelcallback() can also run in tasklet.
+ * Make sure that the ring buffer data structure doesn't get
+ * freed while we dereference the ring buffer pointer. Test
+ * for the channel's onchannel_callback being NULL within a
+ * sched_lock critical section. See also the inline comments
+ * in vmbus_reset_channel_cb().
*/
- local_irq_save(flags);
-
- if (hbus->hdev->channel->target_cpu == smp_processor_id())
- hv_pci_onchannelcallback(hbus);
-
- local_irq_restore(flags);
+ spin_lock_irqsave(&channel->sched_lock, flags);
+ if (unlikely(channel->onchannel_callback == NULL)) {
+ spin_unlock_irqrestore(&channel->sched_lock, flags);
+ goto enable_tasklet;
+ }
+ hv_pci_onchannelcallback(hbus);
+ spin_unlock_irqrestore(&channel->sched_lock, flags);
if (hpdev->state == hv_pcichild_ejecting) {
dev_err_once(&hbus->hdev->device,
"the device is being ejected\n");
- goto free_int_desc;
+ goto enable_tasklet;
}
udelay(100);
}
+ tasklet_enable(&channel->callback_event);
+
if (comp.comp_pkt.completion_status < 0) {
dev_err(&hbus->hdev->device,
"Request for interrupt failed: 0x%x",
@@ -1488,6 +1498,8 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
put_pcichild(hpdev);
return;
+enable_tasklet:
+ tasklet_enable(&channel->callback_event);
free_int_desc:
kfree(int_desc);
drop_reference:
--
2.24.0
init_vp_index() may access the cpu_online_mask mask via its calls of
cpumask_of_node(). Make sure to protect these accesses with a
cpus_read_lock() critical section.
Also, remove some (hardcoded) instances of CPU(0) from init_vp_index()
and replace them with VMBUS_CONNECT_CPU. The connect CPU can not go
offline, since Hyper-V does not provide a way to change it.
Finally, order the accesses of target_cpu from init_vp_index() and
hv_synic_cleanup() by relying on the channel_mutex; this is achieved
by moving the call of init_vp_index() into vmbus_process_offer().
Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
---
drivers/hv/channel_mgmt.c | 47 ++++++++++++++++++++++++++++-----------
drivers/hv/hv.c | 7 +++---
2 files changed, 38 insertions(+), 16 deletions(-)
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 8f34494bb53fb..34672dc2fc935 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -18,6 +18,7 @@
#include <linux/module.h>
#include <linux/completion.h>
#include <linux/delay.h>
+#include <linux/cpu.h>
#include <linux/hyperv.h>
#include <asm/mshyperv.h>
@@ -445,13 +446,8 @@ static void vmbus_add_channel_work(struct work_struct *work)
container_of(work, struct vmbus_channel, add_channel_work);
struct vmbus_channel *primary_channel = newchannel->primary_channel;
unsigned long flags;
- u16 dev_type;
int ret;
- dev_type = hv_get_dev_type(newchannel);
-
- init_vp_index(newchannel, dev_type);
-
/*
* This state is used to indicate a successful open
* so that when we do close the channel normally, we
@@ -483,7 +479,7 @@ static void vmbus_add_channel_work(struct work_struct *work)
if (!newchannel->device_obj)
goto err_deq_chan;
- newchannel->device_obj->device_id = dev_type;
+ newchannel->device_obj->device_id = hv_get_dev_type(newchannel);
/*
* Add the new device to the bus. This will kick off device-driver
* binding which eventually invokes the device driver's AddDevice()
@@ -539,6 +535,25 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel)
unsigned long flags;
bool fnew = true;
+ /*
+ * Initialize the target_CPU before inserting the channel in
+ * the chn_list and sc_list lists, within the channel_mutex
+ * critical section:
+ *
+ * CPU1 CPU2
+ *
+ * [vmbus_process_offer()] [hv_syninc_cleanup()]
+ *
+ * STORE target_cpu LOCK channel_mutex
+ * LOCK channel_mutex SEARCH chn_list
+ * INSERT chn_list LOAD target_cpu
+ * UNLOCK channel_mutex UNLOCK channel_mutex
+ *
+ * Forbids: CPU2's SEARCH from seeing CPU1's INSERT &&
+ * CPU2's LOAD from *not* seing CPU1's STORE
+ */
+ init_vp_index(newchannel, hv_get_dev_type(newchannel));
+
mutex_lock(&vmbus_connection.channel_mutex);
/* Remember the channels that should be cleaned up upon suspend. */
@@ -635,7 +650,7 @@ static DEFINE_SPINLOCK(bind_channel_to_cpu_lock);
* channel interrupt load by binding a channel to VCPU.
*
* For pre-win8 hosts or non-performance critical channels we assign the
- * first CPU in the first NUMA node.
+ * VMBUS_CONNECT_CPU.
*
* Starting with win8, performance critical channels will be distributed
* evenly among all the available NUMA nodes. Once the node is assigned,
@@ -654,17 +669,22 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type)
!alloc_cpumask_var(&available_mask, GFP_KERNEL)) {
/*
* Prior to win8, all channel interrupts are
- * delivered on cpu 0.
+ * delivered on VMBUS_CONNECT_CPU.
* Also if the channel is not a performance critical
- * channel, bind it to cpu 0.
- * In case alloc_cpumask_var() fails, bind it to cpu 0.
+ * channel, bind it to VMBUS_CONNECT_CPU.
+ * In case alloc_cpumask_var() fails, bind it to
+ * VMBUS_CONNECT_CPU.
*/
- channel->numa_node = 0;
- channel->target_cpu = 0;
- channel->target_vp = hv_cpu_number_to_vp_number(0);
+ channel->numa_node = cpu_to_node(VMBUS_CONNECT_CPU);
+ channel->target_cpu = VMBUS_CONNECT_CPU;
+ channel->target_vp =
+ hv_cpu_number_to_vp_number(VMBUS_CONNECT_CPU);
return;
}
+ /* No CPUs can come up or down during this. */
+ cpus_read_lock();
+
/*
* Serializes the accesses to the global variable next_numa_node_id.
* See also the header comment of the spin lock declaration.
@@ -702,6 +722,7 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type)
channel->target_vp = hv_cpu_number_to_vp_number(target_cpu);
spin_unlock(&bind_channel_to_cpu_lock);
+ cpus_read_unlock();
free_cpumask_var(available_mask);
}
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 17bf1f229152b..188b42b07f07b 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -256,9 +256,10 @@ int hv_synic_cleanup(unsigned int cpu)
/*
* Search for channels which are bound to the CPU we're about to
- * cleanup. In case we find one and vmbus is still connected we need to
- * fail, this will effectively prevent CPU offlining. There is no way
- * we can re-bind channels to different CPUs for now.
+ * cleanup. In case we find one and vmbus is still connected, we
+ * fail; this will effectively prevent CPU offlining.
+ *
+ * TODO: Re-bind the channels to different CPUs.
*/
mutex_lock(&vmbus_connection.channel_mutex);
list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
--
2.24.0
VMBus version 4.1 and later support the CHANNELMSG_MODIFYCHANNEL(22)
message type which can be used to request Hyper-V to change the vCPU
that a channel will interrupt.
Introduce the CHANNELMSG_MODIFYCHANNEL message type, and define the
vmbus_send_modifychannel() function to send CHANNELMSG_MODIFYCHANNEL
requests to the host via a hypercall. The function is then used to
define a sysfs "store" operation, which allows to change the (v)CPU
the channel will interrupt by using the sysfs interface. The feature
can be used for load balancing or other purposes.
One interesting catch here is that Hyper-V can *not* currently ACK
CHANNELMSG_MODIFYCHANNEL messages with the promise that (after the ACK
is sent) the channel won't send any more interrupts to the "old" CPU.
The peculiarity of the CHANNELMSG_MODIFYCHANNEL messages is problematic
if the user want to take a CPU offline, since we don't want to take a
CPU offline (and, potentially, "lose" channel interrupts on such CPU)
if the host is still processing a CHANNELMSG_MODIFYCHANNEL message
associated to that CPU.
It is worth mentioning, however, that we have been unable to observe
the above mentioned "race": in all our tests, CHANNELMSG_MODIFYCHANNEL
requests appeared *as if* they were processed synchronously by the host.
Suggested-by: Michael Kelley <[email protected]>
Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
---
drivers/hv/channel.c | 28 ++++++++++
drivers/hv/channel_mgmt.c | 2 +-
drivers/hv/hv_trace.h | 19 +++++++
drivers/hv/vmbus_drv.c | 108 +++++++++++++++++++++++++++++++++++++-
include/linux/hyperv.h | 10 +++-
5 files changed, 163 insertions(+), 4 deletions(-)
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 132e476f87b2e..90070b337c10d 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -289,6 +289,34 @@ int vmbus_send_tl_connect_request(const guid_t *shv_guest_servie_id,
}
EXPORT_SYMBOL_GPL(vmbus_send_tl_connect_request);
+/*
+ * Set/change the vCPU (@target_vp) the channel (@child_relid) will interrupt.
+ *
+ * CHANNELMSG_MODIFYCHANNEL messages are aynchronous. Also, Hyper-V does not
+ * ACK such messages. IOW we can't know when the host will stop interrupting
+ * the "old" vCPU and start interrupting the "new" vCPU for the given channel.
+ *
+ * The CHANNELMSG_MODIFYCHANNEL message type is supported since VMBus version
+ * VERSION_WIN10_V4_1.
+ */
+int vmbus_send_modifychannel(u32 child_relid, u32 target_vp)
+{
+ struct vmbus_channel_modifychannel conn_msg;
+ int ret;
+
+ memset(&conn_msg, 0, sizeof(conn_msg));
+ conn_msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL;
+ conn_msg.child_relid = child_relid;
+ conn_msg.target_vp = target_vp;
+
+ ret = vmbus_post_msg(&conn_msg, sizeof(conn_msg), true);
+
+ trace_vmbus_send_modifychannel(&conn_msg, ret);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vmbus_send_modifychannel);
+
/*
* create_gpadl_header - Creates a gpadl for the specified buffer
*/
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 34672dc2fc935..818a9d8bf649e 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -1353,7 +1353,7 @@ channel_message_table[CHANNELMSG_COUNT] = {
{ CHANNELMSG_19, 0, NULL },
{ CHANNELMSG_20, 0, NULL },
{ CHANNELMSG_TL_CONNECT_REQUEST, 0, NULL },
- { CHANNELMSG_22, 0, NULL },
+ { CHANNELMSG_MODIFYCHANNEL, 0, NULL },
{ CHANNELMSG_TL_CONNECT_RESULT, 0, NULL },
};
diff --git a/drivers/hv/hv_trace.h b/drivers/hv/hv_trace.h
index e70783e33680f..a43bc76c2d5d0 100644
--- a/drivers/hv/hv_trace.h
+++ b/drivers/hv/hv_trace.h
@@ -296,6 +296,25 @@ TRACE_EVENT(vmbus_send_tl_connect_request,
)
);
+TRACE_EVENT(vmbus_send_modifychannel,
+ TP_PROTO(const struct vmbus_channel_modifychannel *msg,
+ int ret),
+ TP_ARGS(msg, ret),
+ TP_STRUCT__entry(
+ __field(u32, child_relid)
+ __field(u32, target_vp)
+ __field(int, ret)
+ ),
+ TP_fast_assign(
+ __entry->child_relid = msg->child_relid;
+ __entry->target_vp = msg->target_vp;
+ __entry->ret = ret;
+ ),
+ TP_printk("binding child_relid 0x%x to target_vp 0x%x, ret %d",
+ __entry->child_relid, __entry->target_vp, __entry->ret
+ )
+ );
+
DECLARE_EVENT_CLASS(vmbus_channel,
TP_PROTO(const struct vmbus_channel *channel),
TP_ARGS(channel),
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index ebe2716f583d2..84d2f22c569aa 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1550,8 +1550,24 @@ static ssize_t vmbus_chan_attr_show(struct kobject *kobj,
return attribute->show(chan, buf);
}
+static ssize_t vmbus_chan_attr_store(struct kobject *kobj,
+ struct attribute *attr, const char *buf,
+ size_t count)
+{
+ const struct vmbus_chan_attribute *attribute
+ = container_of(attr, struct vmbus_chan_attribute, attr);
+ struct vmbus_channel *chan
+ = container_of(kobj, struct vmbus_channel, kobj);
+
+ if (!attribute->store)
+ return -EIO;
+
+ return attribute->store(chan, buf, count);
+}
+
static const struct sysfs_ops vmbus_chan_sysfs_ops = {
.show = vmbus_chan_attr_show,
+ .store = vmbus_chan_attr_store,
};
static ssize_t out_mask_show(struct vmbus_channel *channel, char *buf)
@@ -1622,11 +1638,99 @@ static ssize_t write_avail_show(struct vmbus_channel *channel, char *buf)
}
static VMBUS_CHAN_ATTR_RO(write_avail);
-static ssize_t show_target_cpu(struct vmbus_channel *channel, char *buf)
+static ssize_t target_cpu_show(struct vmbus_channel *channel, char *buf)
{
return sprintf(buf, "%u\n", channel->target_cpu);
}
-static VMBUS_CHAN_ATTR(cpu, S_IRUGO, show_target_cpu, NULL);
+static ssize_t target_cpu_store(struct vmbus_channel *channel,
+ const char *buf, size_t count)
+{
+ ssize_t ret = count;
+ u32 target_cpu;
+
+ if (vmbus_proto_version < VERSION_WIN10_V4_1)
+ return -EIO;
+
+ if (sscanf(buf, "%uu", &target_cpu) != 1)
+ return -EIO;
+
+ /* Validate target_cpu for the cpumask_test_cpu() operation below. */
+ if (target_cpu >= nr_cpumask_bits)
+ return -EINVAL;
+
+ /* No CPUs should come up or down during this. */
+ cpus_read_lock();
+
+ if (!cpumask_test_cpu(target_cpu, cpu_online_mask)) {
+ cpus_read_unlock();
+ return -EINVAL;
+ }
+
+ /*
+ * Synchronizes target_cpu_store() and channel closure:
+ *
+ * { Initially: state = CHANNEL_OPENED }
+ *
+ * CPU1 CPU2
+ *
+ * [target_cpu_store()] [vmbus_disconnect_ring()]
+ *
+ * LOCK channel_mutex LOCK channel_mutex
+ * LOAD r1 = state LOAD r2 = state
+ * IF (r1 == CHANNEL_OPENED) IF (r2 == CHANNEL_OPENED)
+ * SEND MODIFYCHANNEL STORE state = CHANNEL_OPEN
+ * [...] SEND CLOSECHANNEL
+ * UNLOCK channel_mutex UNLOCK channel_mutex
+ *
+ * Forbids: r1 == r2 == CHANNEL_OPENED (i.e., CPU1's LOCK precedes
+ * CPU2's LOCK) && CPU2's SEND precedes CPU1's SEND
+ *
+ * Note. The host processes the channel messages "sequentially", in
+ * the order in which they are received on a per-partition basis.
+ */
+ mutex_lock(&vmbus_connection.channel_mutex);
+
+ /*
+ * Hyper-V will ignore MODIFYCHANNEL messages for "non-open" channels;
+ * avoid sending the message and fail here for such channels.
+ */
+ if (channel->state != CHANNEL_OPENED_STATE) {
+ ret = -EIO;
+ goto cpu_store_unlock;
+ }
+
+ if (channel->target_cpu == target_cpu)
+ goto cpu_store_unlock;
+
+ if (vmbus_send_modifychannel(channel->offermsg.child_relid,
+ hv_cpu_number_to_vp_number(target_cpu))) {
+ ret = -EIO;
+ goto cpu_store_unlock;
+ }
+
+ /*
+ * Warning. At this point, there is *no* guarantee that the host will
+ * have successfully processed the vmbus_send_modifychannel() request.
+ * See the header comment of vmbus_send_modifychannel() for more info.
+ *
+ * Lags in the processing of the above vmbus_send_modifychannel() can
+ * result in missed interrupts if the "old" target CPU is taken offline
+ * before Hyper-V starts sending interrupts to the "new" target CPU.
+ * But apart from this offlining scenario, the code tolerates such
+ * lags. It will function correctly even if a channel interrupt comes
+ * in on a CPU that is different from the channel target_cpu value.
+ */
+
+ channel->target_cpu = target_cpu;
+ channel->target_vp = hv_cpu_number_to_vp_number(target_cpu);
+ channel->numa_node = cpu_to_node(target_cpu);
+
+cpu_store_unlock:
+ mutex_unlock(&vmbus_connection.channel_mutex);
+ cpus_read_unlock();
+ return ret;
+}
+static VMBUS_CHAN_ATTR(cpu, 0644, target_cpu_show, target_cpu_store);
static ssize_t channel_pending_show(struct vmbus_channel *channel,
char *buf)
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index f8e7c22d41a1a..edfcd42319ef3 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -425,7 +425,7 @@ enum vmbus_channel_message_type {
CHANNELMSG_19 = 19,
CHANNELMSG_20 = 20,
CHANNELMSG_TL_CONNECT_REQUEST = 21,
- CHANNELMSG_22 = 22,
+ CHANNELMSG_MODIFYCHANNEL = 22,
CHANNELMSG_TL_CONNECT_RESULT = 23,
CHANNELMSG_COUNT
};
@@ -620,6 +620,13 @@ struct vmbus_channel_tl_connect_request {
guid_t host_service_id;
} __packed;
+/* Modify Channel parameters, cf. vmbus_send_modifychannel() */
+struct vmbus_channel_modifychannel {
+ struct vmbus_channel_message_header header;
+ u32 child_relid;
+ u32 target_vp;
+} __packed;
+
struct vmbus_channel_version_response {
struct vmbus_channel_message_header header;
u8 version_supported;
@@ -1505,6 +1512,7 @@ extern __u32 vmbus_proto_version;
int vmbus_send_tl_connect_request(const guid_t *shv_guest_servie_id,
const guid_t *shv_host_servie_id);
+int vmbus_send_modifychannel(u32 child_relid, u32 target_vp);
void vmbus_set_event(struct vmbus_channel *channel);
/* Get the start of the ring buffer. */
--
2.24.0
The logic is unused since commit 509879bdb30b8 ("Drivers: hv: Introduce
a policy for controlling channel affinity").
This logic assumes that a channel target_cpu doesn't change during the
lifetime of a channel, but this assumption is incompatible with the new
functionality that allows changing the vCPU a channel will interrupt.
Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
---
drivers/hv/channel_mgmt.c | 105 +++++++++-----------------------------
include/linux/hyperv.h | 27 ----------
2 files changed, 25 insertions(+), 107 deletions(-)
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index c53f58ba06dcf..8f34494bb53fb 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -412,14 +412,6 @@ void hv_process_channel_removal(struct vmbus_channel *channel)
spin_unlock_irqrestore(&primary_channel->lock, flags);
}
- /*
- * We need to free the bit for init_vp_index() to work in the case
- * of sub-channel, when we reload drivers like hv_netvsc.
- */
- if (channel->affinity_policy == HV_LOCALIZED)
- cpumask_clear_cpu(channel->target_cpu,
- &primary_channel->alloced_cpus_in_node);
-
/*
* Upon suspend, an in-use hv_sock channel is marked as "rescinded" and
* the relid is invalidated; after hibernation, when the user-space app
@@ -641,20 +633,21 @@ static DEFINE_SPINLOCK(bind_channel_to_cpu_lock);
/*
* Starting with Win8, we can statically distribute the incoming
* channel interrupt load by binding a channel to VCPU.
- * We distribute the interrupt loads to one or more NUMA nodes based on
- * the channel's affinity_policy.
*
* For pre-win8 hosts or non-performance critical channels we assign the
* first CPU in the first NUMA node.
+ *
+ * Starting with win8, performance critical channels will be distributed
+ * evenly among all the available NUMA nodes. Once the node is assigned,
+ * we will assign the CPU based on a simple round robin scheme.
*/
static void init_vp_index(struct vmbus_channel *channel, u16 dev_type)
{
- u32 cur_cpu;
bool perf_chn = vmbus_devs[dev_type].perf_device;
- struct vmbus_channel *primary = channel->primary_channel;
- int next_node;
cpumask_var_t available_mask;
struct cpumask *alloced_mask;
+ u32 target_cpu;
+ int numa_node;
if ((vmbus_proto_version == VERSION_WS2008) ||
(vmbus_proto_version == VERSION_WIN7) || (!perf_chn) ||
@@ -672,31 +665,27 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type)
return;
}
- spin_lock(&bind_channel_to_cpu_lock);
-
/*
- * Based on the channel affinity policy, we will assign the NUMA
- * nodes.
+ * Serializes the accesses to the global variable next_numa_node_id.
+ * See also the header comment of the spin lock declaration.
*/
+ spin_lock(&bind_channel_to_cpu_lock);
- if ((channel->affinity_policy == HV_BALANCED) || (!primary)) {
- while (true) {
- next_node = next_numa_node_id++;
- if (next_node == nr_node_ids) {
- next_node = next_numa_node_id = 0;
- continue;
- }
- if (cpumask_empty(cpumask_of_node(next_node)))
- continue;
- break;
+ while (true) {
+ numa_node = next_numa_node_id++;
+ if (numa_node == nr_node_ids) {
+ next_numa_node_id = 0;
+ continue;
}
- channel->numa_node = next_node;
- primary = channel;
+ if (cpumask_empty(cpumask_of_node(numa_node)))
+ continue;
+ break;
}
- alloced_mask = &hv_context.hv_numa_map[primary->numa_node];
+ channel->numa_node = numa_node;
+ alloced_mask = &hv_context.hv_numa_map[numa_node];
if (cpumask_weight(alloced_mask) ==
- cpumask_weight(cpumask_of_node(primary->numa_node))) {
+ cpumask_weight(cpumask_of_node(numa_node))) {
/*
* We have cycled through all the CPUs in the node;
* reset the alloced map.
@@ -704,57 +693,13 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type)
cpumask_clear(alloced_mask);
}
- cpumask_xor(available_mask, alloced_mask,
- cpumask_of_node(primary->numa_node));
+ cpumask_xor(available_mask, alloced_mask, cpumask_of_node(numa_node));
- cur_cpu = -1;
-
- if (primary->affinity_policy == HV_LOCALIZED) {
- /*
- * Normally Hyper-V host doesn't create more subchannels
- * than there are VCPUs on the node but it is possible when not
- * all present VCPUs on the node are initialized by guest.
- * Clear the alloced_cpus_in_node to start over.
- */
- if (cpumask_equal(&primary->alloced_cpus_in_node,
- cpumask_of_node(primary->numa_node)))
- cpumask_clear(&primary->alloced_cpus_in_node);
- }
-
- while (true) {
- cur_cpu = cpumask_next(cur_cpu, available_mask);
- if (cur_cpu >= nr_cpu_ids) {
- cur_cpu = -1;
- cpumask_copy(available_mask,
- cpumask_of_node(primary->numa_node));
- continue;
- }
-
- if (primary->affinity_policy == HV_LOCALIZED) {
- /*
- * NOTE: in the case of sub-channel, we clear the
- * sub-channel related bit(s) in
- * primary->alloced_cpus_in_node in
- * hv_process_channel_removal(), so when we
- * reload drivers like hv_netvsc in SMP guest, here
- * we're able to re-allocate
- * bit from primary->alloced_cpus_in_node.
- */
- if (!cpumask_test_cpu(cur_cpu,
- &primary->alloced_cpus_in_node)) {
- cpumask_set_cpu(cur_cpu,
- &primary->alloced_cpus_in_node);
- cpumask_set_cpu(cur_cpu, alloced_mask);
- break;
- }
- } else {
- cpumask_set_cpu(cur_cpu, alloced_mask);
- break;
- }
- }
+ target_cpu = cpumask_first(available_mask);
+ cpumask_set_cpu(target_cpu, alloced_mask);
- channel->target_cpu = cur_cpu;
- channel->target_vp = hv_cpu_number_to_vp_number(cur_cpu);
+ channel->target_cpu = target_cpu;
+ channel->target_vp = hv_cpu_number_to_vp_number(target_cpu);
spin_unlock(&bind_channel_to_cpu_lock);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index ce32ab186192f..f8e7c22d41a1a 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -689,11 +689,6 @@ union hv_connection_id {
} u;
};
-enum hv_numa_policy {
- HV_BALANCED = 0,
- HV_LOCALIZED,
-};
-
enum vmbus_device_type {
HV_IDE = 0,
HV_SCSI,
@@ -808,10 +803,6 @@ struct vmbus_channel {
u32 target_vp;
/* The corresponding CPUID in the guest */
u32 target_cpu;
- /*
- * State to manage the CPU affiliation of channels.
- */
- struct cpumask alloced_cpus_in_node;
int numa_node;
/*
* Support for sub-channels. For high performance devices,
@@ -898,18 +889,6 @@ struct vmbus_channel {
*/
bool low_latency;
- /*
- * NUMA distribution policy:
- * We support two policies:
- * 1) Balanced: Here all performance critical channels are
- * distributed evenly amongst all the NUMA nodes.
- * This policy will be the default policy.
- * 2) Localized: All channels of a given instance of a
- * performance critical service will be assigned CPUs
- * within a selected NUMA node.
- */
- enum hv_numa_policy affinity_policy;
-
bool probe_done;
/*
@@ -965,12 +944,6 @@ static inline bool is_sub_channel(const struct vmbus_channel *c)
return c->offermsg.offer.sub_channel_index != 0;
}
-static inline void set_channel_affinity_state(struct vmbus_channel *c,
- enum hv_numa_policy policy)
-{
- c->affinity_policy = policy;
-}
-
static inline void set_channel_read_mode(struct vmbus_channel *c,
enum hv_callback_mode mode)
{
--
2.24.0
For each storvsc_device, storvsc keeps track of the channel target CPUs
associated to the device (alloced_cpus) and it uses this information to
fill a "cache" (stor_chns) mapping CPU->channel according to a certain
heuristic. Update the alloced_cpus mask and the stor_chns array when a
channel of the storvsc device is re-assigned to a different CPU.
Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: <[email protected]>
---
drivers/hv/vmbus_drv.c | 4 ++
drivers/scsi/storvsc_drv.c | 95 ++++++++++++++++++++++++++++++++++----
include/linux/hyperv.h | 3 ++
3 files changed, 94 insertions(+), 8 deletions(-)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 84d2f22c569aa..7199fee2b5869 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1721,6 +1721,10 @@ static ssize_t target_cpu_store(struct vmbus_channel *channel,
* in on a CPU that is different from the channel target_cpu value.
*/
+ if (channel->change_target_cpu_callback)
+ (*channel->change_target_cpu_callback)(channel,
+ channel->target_cpu, target_cpu);
+
channel->target_cpu = target_cpu;
channel->target_vp = hv_cpu_number_to_vp_number(target_cpu);
channel->numa_node = cpu_to_node(target_cpu);
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index fb41636519ee8..a680592b9d32a 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -621,6 +621,63 @@ static inline struct storvsc_device *get_in_stor_device(
}
+void storvsc_change_target_cpu(struct vmbus_channel *channel, u32 old, u32 new)
+{
+ struct storvsc_device *stor_device;
+ struct vmbus_channel *cur_chn;
+ bool old_is_alloced = false;
+ struct hv_device *device;
+ unsigned long flags;
+ int cpu;
+
+ device = channel->primary_channel ?
+ channel->primary_channel->device_obj
+ : channel->device_obj;
+ stor_device = get_out_stor_device(device);
+ if (!stor_device)
+ return;
+
+ /* See storvsc_do_io() -> get_og_chn(). */
+ spin_lock_irqsave(&device->channel->lock, flags);
+
+ /*
+ * Determines if the storvsc device has other channels assigned to
+ * the "old" CPU to update the alloced_cpus mask and the stor_chns
+ * array.
+ */
+ if (device->channel != channel && device->channel->target_cpu == old) {
+ cur_chn = device->channel;
+ old_is_alloced = true;
+ goto old_is_alloced;
+ }
+ list_for_each_entry(cur_chn, &device->channel->sc_list, sc_list) {
+ if (cur_chn == channel)
+ continue;
+ if (cur_chn->target_cpu == old) {
+ old_is_alloced = true;
+ goto old_is_alloced;
+ }
+ }
+
+old_is_alloced:
+ if (old_is_alloced)
+ WRITE_ONCE(stor_device->stor_chns[old], cur_chn);
+ else
+ cpumask_clear_cpu(old, &stor_device->alloced_cpus);
+
+ /* "Flush" the stor_chns array. */
+ for_each_possible_cpu(cpu) {
+ if (stor_device->stor_chns[cpu] && !cpumask_test_cpu(
+ cpu, &stor_device->alloced_cpus))
+ WRITE_ONCE(stor_device->stor_chns[cpu], NULL);
+ }
+
+ WRITE_ONCE(stor_device->stor_chns[new], channel);
+ cpumask_set_cpu(new, &stor_device->alloced_cpus);
+
+ spin_unlock_irqrestore(&device->channel->lock, flags);
+}
+
static void handle_sc_creation(struct vmbus_channel *new_sc)
{
struct hv_device *device = new_sc->primary_channel->device_obj;
@@ -648,6 +705,8 @@ static void handle_sc_creation(struct vmbus_channel *new_sc)
return;
}
+ new_sc->change_target_cpu_callback = storvsc_change_target_cpu;
+
/* Add the sub-channel to the array of available channels. */
stor_device->stor_chns[new_sc->target_cpu] = new_sc;
cpumask_set_cpu(new_sc->target_cpu, &stor_device->alloced_cpus);
@@ -876,6 +935,8 @@ static int storvsc_channel_init(struct hv_device *device, bool is_fc)
if (stor_device->stor_chns == NULL)
return -ENOMEM;
+ device->channel->change_target_cpu_callback = storvsc_change_target_cpu;
+
stor_device->stor_chns[device->channel->target_cpu] = device->channel;
cpumask_set_cpu(device->channel->target_cpu,
&stor_device->alloced_cpus);
@@ -1248,8 +1309,10 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device,
const struct cpumask *node_mask;
int num_channels, tgt_cpu;
- if (stor_device->num_sc == 0)
+ if (stor_device->num_sc == 0) {
+ stor_device->stor_chns[q_num] = stor_device->device->channel;
return stor_device->device->channel;
+ }
/*
* Our channel array is sparsley populated and we
@@ -1258,7 +1321,6 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device,
* The strategy is simple:
* I. Ensure NUMA locality
* II. Distribute evenly (best effort)
- * III. Mapping is persistent.
*/
node_mask = cpumask_of_node(cpu_to_node(q_num));
@@ -1268,8 +1330,10 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device,
if (cpumask_test_cpu(tgt_cpu, node_mask))
num_channels++;
}
- if (num_channels == 0)
+ if (num_channels == 0) {
+ stor_device->stor_chns[q_num] = stor_device->device->channel;
return stor_device->device->channel;
+ }
hash_qnum = q_num;
while (hash_qnum >= num_channels)
@@ -1295,6 +1359,7 @@ static int storvsc_do_io(struct hv_device *device,
struct storvsc_device *stor_device;
struct vstor_packet *vstor_packet;
struct vmbus_channel *outgoing_channel, *channel;
+ unsigned long flags;
int ret = 0;
const struct cpumask *node_mask;
int tgt_cpu;
@@ -1308,10 +1373,11 @@ static int storvsc_do_io(struct hv_device *device,
request->device = device;
/*
- * Select an an appropriate channel to send the request out.
+ * Select an appropriate channel to send the request out.
*/
- if (stor_device->stor_chns[q_num] != NULL) {
- outgoing_channel = stor_device->stor_chns[q_num];
+ /* See storvsc_change_target_cpu(). */
+ outgoing_channel = READ_ONCE(stor_device->stor_chns[q_num]);
+ if (outgoing_channel != NULL) {
if (outgoing_channel->target_cpu == q_num) {
/*
* Ideally, we want to pick a different channel if
@@ -1324,7 +1390,10 @@ static int storvsc_do_io(struct hv_device *device,
continue;
if (tgt_cpu == q_num)
continue;
- channel = stor_device->stor_chns[tgt_cpu];
+ channel = READ_ONCE(
+ stor_device->stor_chns[tgt_cpu]);
+ if (channel == NULL)
+ continue;
if (hv_get_avail_to_write_percent(
&channel->outbound)
> ring_avail_percent_lowater) {
@@ -1350,7 +1419,10 @@ static int storvsc_do_io(struct hv_device *device,
for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) {
if (cpumask_test_cpu(tgt_cpu, node_mask))
continue;
- channel = stor_device->stor_chns[tgt_cpu];
+ channel = READ_ONCE(
+ stor_device->stor_chns[tgt_cpu]);
+ if (channel == NULL)
+ continue;
if (hv_get_avail_to_write_percent(
&channel->outbound)
> ring_avail_percent_lowater) {
@@ -1360,7 +1432,14 @@ static int storvsc_do_io(struct hv_device *device,
}
}
} else {
+ spin_lock_irqsave(&device->channel->lock, flags);
+ outgoing_channel = stor_device->stor_chns[q_num];
+ if (outgoing_channel != NULL) {
+ spin_unlock_irqrestore(&device->channel->lock, flags);
+ goto found_channel;
+ }
outgoing_channel = get_og_chn(stor_device, q_num);
+ spin_unlock_irqrestore(&device->channel->lock, flags);
}
found_channel:
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index edfcd42319ef3..9018b89614b78 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -773,6 +773,9 @@ struct vmbus_channel {
void (*onchannel_callback)(void *context);
void *channel_callback_context;
+ void (*change_target_cpu_callback)(struct vmbus_channel *channel,
+ u32 old, u32 new);
+
/*
* Synchronize channel scheduling and channel removal; see the inline
* comments in vmbus_chan_sched() and vmbus_reset_channel_cb().
--
2.24.0
"Andrea Parri (Microsoft)" <[email protected]> writes:
> A Linux guest have to pick a "connect CPU" to communicate with the
> Hyper-V host. This CPU can not be taken offline because Hyper-V does
> not provide a way to change that CPU assignment.
>
> Current code sets the connect CPU to whatever CPU ends up running the
> function vmbus_negotiate_version(), and this will generate problems if
> that CPU is taken offine.
>
> Establish CPU0 as the connect CPU, and add logics to prevents the
> connect CPU from being taken offline. We could pick some other CPU,
> and we could pick that "other CPU" dynamically if there was a reason to
> do so at some point in the future. But for now, #defining the connect
> CPU to 0 is the most straightforward and least complex solution.
>
> While on this, add inline comments explaining "why" offer and rescind
> messages should not be handled by a same serialized work queue.
>
> Suggested-by: Dexuan Cui <[email protected]>
> Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
> ---
> drivers/hv/connection.c | 20 +-------------------
> drivers/hv/hv.c | 7 +++++++
> drivers/hv/hyperv_vmbus.h | 11 ++++++-----
> drivers/hv/vmbus_drv.c | 20 +++++++++++++++++---
> 4 files changed, 31 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index 74e77de89b4f3..f4bd306d2cef9 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -69,7 +69,6 @@ MODULE_PARM_DESC(max_version,
> int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version)
> {
> int ret = 0;
> - unsigned int cur_cpu;
> struct vmbus_channel_initiate_contact *msg;
> unsigned long flags;
>
> @@ -102,24 +101,7 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version)
>
> msg->monitor_page1 = virt_to_phys(vmbus_connection.monitor_pages[0]);
> msg->monitor_page2 = virt_to_phys(vmbus_connection.monitor_pages[1]);
> - /*
> - * We want all channel messages to be delivered on CPU 0.
> - * This has been the behavior pre-win8. This is not
> - * perf issue and having all channel messages delivered on CPU 0
> - * would be ok.
> - * For post win8 hosts, we support receiving channel messagges on
> - * all the CPUs. This is needed for kexec to work correctly where
> - * the CPU attempting to connect may not be CPU 0.
> - */
> - if (version >= VERSION_WIN8_1) {
> - cur_cpu = get_cpu();
> - msg->target_vcpu = hv_cpu_number_to_vp_number(cur_cpu);
> - vmbus_connection.connect_cpu = cur_cpu;
> - put_cpu();
> - } else {
> - msg->target_vcpu = 0;
> - vmbus_connection.connect_cpu = 0;
> - }
> + msg->target_vcpu = hv_cpu_number_to_vp_number(VMBUS_CONNECT_CPU);
>
> /*
> * Add to list before we send the request since we may
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index 6098e0cbdb4b0..e2b3310454640 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -249,6 +249,13 @@ int hv_synic_cleanup(unsigned int cpu)
> bool channel_found = false;
> unsigned long flags;
>
> + /*
> + * Hyper-V does not provide a way to change the connect CPU once
> + * it is set; we must prevent the connect CPU from going offline.
> + */
> + if (cpu == VMBUS_CONNECT_CPU)
> + return -EBUSY;
> +
> /*
> * Search for channels which are bound to the CPU we're about to
> * cleanup. In case we find one and vmbus is still connected we need to
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index 70b30e223a578..67fb1edcbf527 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -212,12 +212,13 @@ enum vmbus_connect_state {
>
> #define MAX_SIZE_CHANNEL_MESSAGE HV_MESSAGE_PAYLOAD_BYTE_COUNT
>
> -struct vmbus_connection {
> - /*
> - * CPU on which the initial host contact was made.
> - */
> - int connect_cpu;
> +/*
> + * The CPU that Hyper-V will interrupt for VMBUS messages, such as
> + * CHANNELMSG_OFFERCHANNEL and CHANNELMSG_RESCIND_CHANNELOFFER.
> + */
> +#define VMBUS_CONNECT_CPU 0
>
> +struct vmbus_connection {
> u32 msg_conn_id;
>
> atomic_t offer_in_progress;
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 029378c27421d..7600615e13754 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1056,14 +1056,28 @@ void vmbus_on_msg_dpc(unsigned long data)
> /*
> * If we are handling the rescind message;
> * schedule the work on the global work queue.
> + *
> + * The OFFER message and the RESCIND message should
> + * not be handled by the same serialized work queue,
> + * because the OFFER handler may call vmbus_open(),
> + * which tries to open the channel by sending an
> + * OPEN_CHANNEL message to the host and waits for
> + * the host's response; however, if the host has
> + * rescinded the channel before it receives the
> + * OPEN_CHANNEL message, the host just silently
> + * ignores the OPEN_CHANNEL message; as a result,
> + * the guest's OFFER handler hangs for ever, if we
> + * handle the RESCIND message in the same serialized
> + * work queue: the RESCIND handler can not start to
> + * run before the OFFER handler finishes.
> */
> - schedule_work_on(vmbus_connection.connect_cpu,
> + schedule_work_on(VMBUS_CONNECT_CPU,
> &ctx->work);
> break;
>
> case CHANNELMSG_OFFERCHANNEL:
> atomic_inc(&vmbus_connection.offer_in_progress);
> - queue_work_on(vmbus_connection.connect_cpu,
> + queue_work_on(VMBUS_CONNECT_CPU,
> vmbus_connection.work_queue,
> &ctx->work);
> break;
> @@ -1110,7 +1124,7 @@ static void vmbus_force_channel_rescinded(struct vmbus_channel *channel)
>
> INIT_WORK(&ctx->work, vmbus_onmessage_work);
>
> - queue_work_on(vmbus_connection.connect_cpu,
> + queue_work_on(VMBUS_CONNECT_CPU,
> vmbus_connection.work_queue,
> &ctx->work);
> }
I tried to refresh my memory on why 'connect_cpu' was introduced and it
all comes down to the following commit:
commit 7268644734f6a300353a4c4ff8bf3e013ba80f89
Author: Alex Ng <[email protected]>
Date: Fri Feb 26 15:13:22 2016 -0800
Drivers: hv: vmbus: Support kexec on ws2012 r2 and above
which for some unknown reason kept hardcoding '0' for pre-win2012-r2 (
hv_context.vp_index[smp_processor_id()] in all cases would do exactly
the same I guess ). Later, 'connect_cpu' appeared just to remember our
choice, I can't see why we didn't go with CPU0 for simplicity.
Reviewed-by: Vitaly Kuznetsov <[email protected]>
--
Vitaly
"Andrea Parri (Microsoft)" <[email protected]> writes:
> The offer and rescind works are currently scheduled on the so called
> "connect CPU". However, this is not really needed: we can synchronize
> the works by relying on the usage of the offer_in_progress counter and
> of the channel_mutex mutex. This synchronization is already in place.
> So, remove this unnecessary "bind to the connect CPU" constraint and
> update the inline comments accordingly.
>
> Suggested-by: Dexuan Cui <[email protected]>
> Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
> ---
> drivers/hv/channel_mgmt.c | 21 ++++++++++++++++-----
> drivers/hv/vmbus_drv.c | 39 ++++++++++++++++++++++++++++-----------
> 2 files changed, 44 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 0370364169c4e..1191f3d76d111 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -1025,11 +1025,22 @@ static void vmbus_onoffer_rescind(struct vmbus_channel_message_header *hdr)
> * offer comes in first and then the rescind.
> * Since we process these events in work elements,
> * and with preemption, we may end up processing
> - * the events out of order. Given that we handle these
> - * work elements on the same CPU, this is possible only
> - * in the case of preemption. In any case wait here
> - * until the offer processing has moved beyond the
> - * point where the channel is discoverable.
> + * the events out of order. We rely on the synchronization
> + * provided by offer_in_progress and by channel_mutex for
> + * ordering these events:
> + *
> + * { Initially: offer_in_progress = 1 }
> + *
> + * CPU1 CPU2
> + *
> + * [vmbus_process_offer()] [vmbus_onoffer_rescind()]
> + *
> + * LOCK channel_mutex WAIT_ON offer_in_progress == 0
> + * DECREMENT offer_in_progress LOCK channel_mutex
> + * INSERT chn_list SEARCH chn_list
> + * UNLOCK channel_mutex UNLOCK channel_mutex
> + *
> + * Forbids: CPU2's SEARCH from *not* seeing CPU1's INSERT
WAIT_ON offer_in_progress == 0
LOCK channel_mutex
seems to be racy: what happens if offer_in_progress increments after we
read it but before we managed to aquire channel_mutex?
I think this shold be changed to
LOCK channel_mutex
CHECK offer_in_progress == 0
EQUAL? GOTO proceed with rescind handling
NOT EQUAL?
WHILE offer_in_progress) != 0 {
UNLOCK channel_mutex
MSLEEP(1)
LOCK channel_mutex
}
proceed with rescind handling:
...
UNLOCK channel_mutex
> */
>
> while (atomic_read(&vmbus_connection.offer_in_progress) != 0) {
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 7600615e13754..903b1ec6a259e 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1048,8 +1048,9 @@ void vmbus_on_msg_dpc(unsigned long data)
> /*
> * The host can generate a rescind message while we
> * may still be handling the original offer. We deal with
> - * this condition by ensuring the processing is done on the
> - * same CPU.
> + * this condition by relying on the synchronization provided
> + * by offer_in_progress and by channel_mutex. See also the
> + * inline comments in vmbus_onoffer_rescind().
> */
> switch (hdr->msgtype) {
> case CHANNELMSG_RESCIND_CHANNELOFFER:
> @@ -1071,16 +1072,34 @@ void vmbus_on_msg_dpc(unsigned long data)
> * work queue: the RESCIND handler can not start to
> * run before the OFFER handler finishes.
> */
> - schedule_work_on(VMBUS_CONNECT_CPU,
> - &ctx->work);
> + schedule_work(&ctx->work);
> break;
>
> case CHANNELMSG_OFFERCHANNEL:
> + /*
> + * The host sends the offer message of a given channel
> + * before sending the rescind message of the same
> + * channel. These messages are sent to the guest's
> + * connect CPU; the guest then starts processing them
> + * in the tasklet handler on this CPU:
> + *
> + * VMBUS_CONNECT_CPU
> + *
> + * [vmbus_on_msg_dpc()]
> + * atomic_inc() // CHANNELMSG_OFFERCHANNEL
> + * queue_work()
> + * ...
> + * [vmbus_on_msg_dpc()]
> + * schedule_work() // CHANNELMSG_RESCIND_CHANNELOFFER
> + *
> + * We rely on the memory-ordering properties of the
> + * queue_work() and schedule_work() primitives, which
> + * guarantee that the atomic increment will be visible
> + * to the CPUs which will execute the offer & rescind
> + * works by the time these works will start execution.
> + */
> atomic_inc(&vmbus_connection.offer_in_progress);
> - queue_work_on(VMBUS_CONNECT_CPU,
> - vmbus_connection.work_queue,
> - &ctx->work);
> - break;
> + fallthrough;
>
> default:
> queue_work(vmbus_connection.work_queue, &ctx->work);
> @@ -1124,9 +1143,7 @@ static void vmbus_force_channel_rescinded(struct vmbus_channel *channel)
>
> INIT_WORK(&ctx->work, vmbus_onmessage_work);
>
> - queue_work_on(VMBUS_CONNECT_CPU,
> - vmbus_connection.work_queue,
> - &ctx->work);
> + queue_work(vmbus_connection.work_queue, &ctx->work);
> }
> #endif /* CONFIG_PM_SLEEP */
--
Vitaly
"Andrea Parri (Microsoft)" <[email protected]> writes:
> When Hyper-V sends an interrupt to the guest, the guest has to figure
> out which channel the interrupt is associated with. Hyper-V sets a bit
> in a memory page that is shared with the guest, indicating a particular
> "relid" that the interrupt is associated with. The current Linux code
> then uses a set of per-CPU linked lists to map a given "relid" to a
> pointer to a channel structure.
>
> This design introduces a synchronization problem if the CPU that Hyper-V
> will interrupt for a certain channel is changed. If the interrupt comes
> on the "old CPU" and the channel was already moved to the per-CPU list
> of the "new CPU", then the relid -> channel mapping will fail and the
> interrupt is dropped. Similarly, if the interrupt comes on the new CPU
> but the channel was not moved to the per-CPU list of the new CPU, then
> the mapping will fail and the interrupt is dropped.
>
> Relids are integers ranging from 0 to 2047. The mapping from relids to
> channel structures can be done by setting up an array with 2048 entries,
> each entry being a pointer to a channel structure (hence total size ~16K
> bytes, which is not a problem). The array is global, so there are no
> per-CPU linked lists to update. The array can be searched and updated
> by simply loading and storing the array at the specified index. With no
> per-CPU data structures, the above mentioned synchronization problem is
> avoided and the relid2channel() function gets simpler.
>
> Suggested-by: Michael Kelley <[email protected]>
> Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
> ---
> drivers/hv/channel_mgmt.c | 158 ++++++++++++++++++++++----------------
> drivers/hv/connection.c | 38 +++------
> drivers/hv/hv.c | 2 -
> drivers/hv/hyperv_vmbus.h | 14 ++--
> drivers/hv/vmbus_drv.c | 48 +++++++-----
> include/linux/hyperv.h | 5 --
> 6 files changed, 139 insertions(+), 126 deletions(-)
>
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 1191f3d76d111..9b1449c839575 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -319,7 +319,6 @@ static struct vmbus_channel *alloc_channel(void)
> init_completion(&channel->rescind_event);
>
> INIT_LIST_HEAD(&channel->sc_list);
> - INIT_LIST_HEAD(&channel->percpu_list);
>
> tasklet_init(&channel->callback_event,
> vmbus_on_event, (unsigned long)channel);
> @@ -340,23 +339,28 @@ static void free_channel(struct vmbus_channel *channel)
> kobject_put(&channel->kobj);
> }
>
> -static void percpu_channel_enq(void *arg)
> +void vmbus_channel_map_relid(struct vmbus_channel *channel)
> {
> - struct vmbus_channel *channel = arg;
> - struct hv_per_cpu_context *hv_cpu
> - = this_cpu_ptr(hv_context.cpu_context);
> -
> - list_add_tail_rcu(&channel->percpu_list, &hv_cpu->chan_list);
> + if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
> + return;
> + /*
> + * Pairs with the READ_ONCE() in vmbus_chan_sched(). Guarantees
> + * that vmbus_chan_sched() will find up-to-date data.
> + */
> + smp_store_release(
> + &vmbus_connection.channels[channel->offermsg.child_relid],
> + channel);
> }
>
> -static void percpu_channel_deq(void *arg)
> +void vmbus_channel_unmap_relid(struct vmbus_channel *channel)
> {
> - struct vmbus_channel *channel = arg;
> -
> - list_del_rcu(&channel->percpu_list);
> + if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
> + return;
> + WRITE_ONCE(
> + vmbus_connection.channels[channel->offermsg.child_relid],
> + NULL);
I don't think this smp_store_release()/WRITE_ONCE() fanciness gives you
anything. Basically, without proper synchronization with a lock there is
no such constructions which will give you any additional guarantee on
top of just doing X=1. E.g. smp_store_release() is just
barrier();
*p = v;
if I'm not mistaken. Nobody tells you when *some other CPU* will see the
update - 'eventually' is your best guess. Here, you're only setting one
pointer.
Percpu structures have an advantage: we (almost) never access them from
different CPUs so just doing updates atomically (and writing 64bit
pointer on x86_64 is atomic) is OK.
I haven't looked at all possible scenarios but I'd suggest protecting
this array with a spinlock (in case we can have simultaneous accesses
from different CPUs and care about the result, of course).
> }
>
> -
> static void vmbus_release_relid(u32 relid)
> {
> struct vmbus_channel_relid_released msg;
> @@ -376,17 +380,25 @@ void hv_process_channel_removal(struct vmbus_channel *channel)
> struct vmbus_channel *primary_channel;
> unsigned long flags;
>
> - BUG_ON(!mutex_is_locked(&vmbus_connection.channel_mutex));
> + lockdep_assert_held(&vmbus_connection.channel_mutex);
> BUG_ON(!channel->rescind);
>
> - if (channel->target_cpu != get_cpu()) {
> - put_cpu();
> - smp_call_function_single(channel->target_cpu,
> - percpu_channel_deq, channel, true);
> - } else {
> - percpu_channel_deq(channel);
> - put_cpu();
> - }
> + /*
> + * hv_process_channel_removal() could find INVALID_RELID only for
> + * hv_sock channels. See the inline comments in vmbus_onoffer().
> + */
> + WARN_ON(channel->offermsg.child_relid == INVALID_RELID &&
> + !is_hvsock_channel(channel));
> +
> + /*
> + * Upon suspend, an in-use hv_sock channel is removed from the array of
> + * channels and the relid is invalidated. After hibernation, when the
> + * user-space appplication destroys the channel, it's unnecessary and
> + * unsafe to remove the channel from the array of channels. See also
> + * the inline comments before the call of vmbus_release_relid() below.
> + */
> + if (channel->offermsg.child_relid != INVALID_RELID)
> + vmbus_channel_unmap_relid(channel);
>
> if (channel->primary_channel == NULL) {
> list_del(&channel->listentry);
> @@ -447,16 +459,6 @@ static void vmbus_add_channel_work(struct work_struct *work)
>
> init_vp_index(newchannel, dev_type);
>
> - if (newchannel->target_cpu != get_cpu()) {
> - put_cpu();
> - smp_call_function_single(newchannel->target_cpu,
> - percpu_channel_enq,
> - newchannel, true);
> - } else {
> - percpu_channel_enq(newchannel);
> - put_cpu();
> - }
> -
> /*
> * This state is used to indicate a successful open
> * so that when we do close the channel normally, we
> @@ -523,17 +525,10 @@ static void vmbus_add_channel_work(struct work_struct *work)
> spin_unlock_irqrestore(&primary_channel->lock, flags);
> }
>
> - mutex_unlock(&vmbus_connection.channel_mutex);
> + /* vmbus_process_offer() has mapped the channel. */
> + vmbus_channel_unmap_relid(newchannel);
>
> - if (newchannel->target_cpu != get_cpu()) {
> - put_cpu();
> - smp_call_function_single(newchannel->target_cpu,
> - percpu_channel_deq,
> - newchannel, true);
> - } else {
> - percpu_channel_deq(newchannel);
> - put_cpu();
> - }
> + mutex_unlock(&vmbus_connection.channel_mutex);
>
> vmbus_release_relid(newchannel->offermsg.child_relid);
>
> @@ -599,6 +594,8 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel)
> spin_unlock_irqrestore(&channel->lock, flags);
> }
>
> + vmbus_channel_map_relid(newchannel);
> +
> mutex_unlock(&vmbus_connection.channel_mutex);
>
> /*
> @@ -937,8 +934,6 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr)
> oldchannel = find_primary_channel_by_offer(offer);
>
> if (oldchannel != NULL) {
> - atomic_dec(&vmbus_connection.offer_in_progress);
> -
> /*
> * We're resuming from hibernation: all the sub-channel and
> * hv_sock channels we had before the hibernation should have
> @@ -946,36 +941,65 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr)
> * primary channel that we had before the hibernation.
> */
>
> + /*
> + * { Initially: channel relid = INVALID_RELID,
> + * channels[valid_relid] = NULL }
> + *
> + * CPU1 CPU2
> + *
> + * [vmbus_onoffer()] [vmbus_device_release()]
> + *
> + * LOCK channel_mutex LOCK channel_mutex
> + * STORE channel relid = valid_relid LOAD r1 = channel relid
> + * MAP_RELID channel if (r1 != INVALID_RELID)
> + * UNLOCK channel_mutex UNMAP_RELID channel
> + * UNLOCK channel_mutex
> + *
> + * Forbids: r1 == valid_relid &&
> + * channels[valid_relid] == channel
> + *
> + * Note. r1 can be INVALID_RELID only for an hv_sock channel.
> + * None of the hv_sock channels which were present before the
> + * suspend are re-offered upon the resume. See the WARN_ON()
> + * in hv_process_channel_removal().
> + */
> + mutex_lock(&vmbus_connection.channel_mutex);
> +
> + atomic_dec(&vmbus_connection.offer_in_progress);
> +
> WARN_ON(oldchannel->offermsg.child_relid != INVALID_RELID);
> /* Fix up the relid. */
> oldchannel->offermsg.child_relid = offer->child_relid;
>
> offer_sz = sizeof(*offer);
> - if (memcmp(offer, &oldchannel->offermsg, offer_sz) == 0) {
> - check_ready_for_resume_event();
> - return;
> + if (memcmp(offer, &oldchannel->offermsg, offer_sz) != 0) {
> + /*
> + * This is not an error, since the host can also change
> + * the other field(s) of the offer, e.g. on WS RS5
> + * (Build 17763), the offer->connection_id of the
> + * Mellanox VF vmbus device can change when the host
> + * reoffers the device upon resume.
> + */
> + pr_debug("vmbus offer changed: relid=%d\n",
> + offer->child_relid);
> +
> + print_hex_dump_debug("Old vmbus offer: ",
> + DUMP_PREFIX_OFFSET, 16, 4,
> + &oldchannel->offermsg, offer_sz,
> + false);
> + print_hex_dump_debug("New vmbus offer: ",
> + DUMP_PREFIX_OFFSET, 16, 4,
> + offer, offer_sz, false);
> +
> + /* Fix up the old channel. */
> + vmbus_setup_channel_state(oldchannel, offer);
> }
>
> - /*
> - * This is not an error, since the host can also change the
> - * other field(s) of the offer, e.g. on WS RS5 (Build 17763),
> - * the offer->connection_id of the Mellanox VF vmbus device
> - * can change when the host reoffers the device upon resume.
> - */
> - pr_debug("vmbus offer changed: relid=%d\n",
> - offer->child_relid);
> -
> - print_hex_dump_debug("Old vmbus offer: ", DUMP_PREFIX_OFFSET,
> - 16, 4, &oldchannel->offermsg, offer_sz,
> - false);
> - print_hex_dump_debug("New vmbus offer: ", DUMP_PREFIX_OFFSET,
> - 16, 4, offer, offer_sz, false);
> -
> - /* Fix up the old channel. */
> - vmbus_setup_channel_state(oldchannel, offer);
> -
> + /* Add the channel back to the array of channels. */
> + vmbus_channel_map_relid(oldchannel);
> check_ready_for_resume_event();
>
> + mutex_unlock(&vmbus_connection.channel_mutex);
> return;
> }
>
> @@ -1033,14 +1057,14 @@ static void vmbus_onoffer_rescind(struct vmbus_channel_message_header *hdr)
> *
> * CPU1 CPU2
> *
> - * [vmbus_process_offer()] [vmbus_onoffer_rescind()]
> + * [vmbus_onoffer()] [vmbus_onoffer_rescind()]
> *
> * LOCK channel_mutex WAIT_ON offer_in_progress == 0
> * DECREMENT offer_in_progress LOCK channel_mutex
> - * INSERT chn_list SEARCH chn_list
> + * STORE channels[] LOAD channels[]
> * UNLOCK channel_mutex UNLOCK channel_mutex
> *
> - * Forbids: CPU2's SEARCH from *not* seeing CPU1's INSERT
> + * Forbids: CPU2's LOAD from *not* seeing CPU1's STORE
> */
>
> while (atomic_read(&vmbus_connection.offer_in_progress) != 0) {
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index f4bd306d2cef9..11170d9a2e1a5 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -248,6 +248,14 @@ int vmbus_connect(void)
> pr_info("Vmbus version:%d.%d\n",
> version >> 16, version & 0xFFFF);
>
> + vmbus_connection.channels = kcalloc(MAX_CHANNEL_RELIDS,
> + sizeof(struct vmbus_channel *),
> + GFP_KERNEL);
> + if (vmbus_connection.channels == NULL) {
> + ret = -ENOMEM;
> + goto cleanup;
> + }
> +
> kfree(msginfo);
> return 0;
>
> @@ -295,33 +303,9 @@ void vmbus_disconnect(void)
> */
> struct vmbus_channel *relid2channel(u32 relid)
> {
> - struct vmbus_channel *channel;
> - struct vmbus_channel *found_channel = NULL;
> - struct list_head *cur, *tmp;
> - struct vmbus_channel *cur_sc;
> -
> - BUG_ON(!mutex_is_locked(&vmbus_connection.channel_mutex));
> -
> - list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
> - if (channel->offermsg.child_relid == relid) {
> - found_channel = channel;
> - break;
> - } else if (!list_empty(&channel->sc_list)) {
> - /*
> - * Deal with sub-channels.
> - */
> - list_for_each_safe(cur, tmp, &channel->sc_list) {
> - cur_sc = list_entry(cur, struct vmbus_channel,
> - sc_list);
> - if (cur_sc->offermsg.child_relid == relid) {
> - found_channel = cur_sc;
> - break;
> - }
> - }
> - }
> - }
> -
> - return found_channel;
> + if (WARN_ON(relid >= MAX_CHANNEL_RELIDS))
> + return NULL;
> + return READ_ONCE(vmbus_connection.channels[relid]);
> }
>
> /*
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index e2b3310454640..17bf1f229152b 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -117,8 +117,6 @@ int hv_synic_alloc(void)
> pr_err("Unable to allocate post msg page\n");
> goto err;
> }
> -
> - INIT_LIST_HEAD(&hv_cpu->chan_list);
> }
>
> return 0;
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index 67fb1edcbf527..2216cd5e8e8bf 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -132,12 +132,6 @@ struct hv_per_cpu_context {
> * basis.
> */
> struct tasklet_struct msg_dpc;
> -
> - /*
> - * To optimize the mapping of relid to channel, maintain
> - * per-cpu list of the channels based on their CPU affinity.
> - */
> - struct list_head chan_list;
> };
>
> struct hv_context {
> @@ -202,6 +196,8 @@ int hv_ringbuffer_read(struct vmbus_channel *channel,
> /* TODO: Need to make this configurable */
> #define MAX_NUM_CHANNELS_SUPPORTED 256
>
> +#define MAX_CHANNEL_RELIDS \
> + max(MAX_NUM_CHANNELS_SUPPORTED, HV_EVENT_FLAGS_COUNT)
>
> enum vmbus_connect_state {
> DISCONNECTED,
> @@ -251,6 +247,9 @@ struct vmbus_connection {
> struct list_head chn_list;
> struct mutex channel_mutex;
>
> + /* Array of channels */
> + struct vmbus_channel **channels;
> +
> /*
> * An offer message is handled first on the work_queue, and then
> * is further handled on handle_primary_chan_wq or
> @@ -337,6 +336,9 @@ int vmbus_add_channel_kobj(struct hv_device *device_obj,
>
> void vmbus_remove_channel_attr_group(struct vmbus_channel *channel);
>
> +void vmbus_channel_map_relid(struct vmbus_channel *channel);
> +void vmbus_channel_unmap_relid(struct vmbus_channel *channel);
> +
> struct vmbus_channel *relid2channel(u32 relid);
>
> void vmbus_free_channels(void);
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 903b1ec6a259e..301e3f484bb1a 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1198,33 +1198,39 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
> if (relid == 0)
> continue;
>
> + /*
> + * Pairs with the kfree_rcu() in vmbus_chan_release().
> + * Guarantees that the channel data structure doesn't
> + * get freed while the channel pointer below is being
> + * dereferenced.
> + */
> rcu_read_lock();
>
> /* Find channel based on relid */
> - list_for_each_entry_rcu(channel, &hv_cpu->chan_list, percpu_list) {
> - if (channel->offermsg.child_relid != relid)
> - continue;
> + channel = relid2channel(relid);
> + if (channel == NULL)
> + goto sched_unlock_rcu;
>
> - if (channel->rescind)
> - continue;
> + if (channel->rescind)
> + goto sched_unlock_rcu;
>
> - trace_vmbus_chan_sched(channel);
> + trace_vmbus_chan_sched(channel);
>
> - ++channel->interrupts;
> + ++channel->interrupts;
>
> - switch (channel->callback_mode) {
> - case HV_CALL_ISR:
> - vmbus_channel_isr(channel);
> - break;
> + switch (channel->callback_mode) {
> + case HV_CALL_ISR:
> + vmbus_channel_isr(channel);
> + break;
>
> - case HV_CALL_BATCHED:
> - hv_begin_read(&channel->inbound);
> - /* fallthrough */
> - case HV_CALL_DIRECT:
> - tasklet_schedule(&channel->callback_event);
> - }
> + case HV_CALL_BATCHED:
> + hv_begin_read(&channel->inbound);
> + fallthrough;
> + case HV_CALL_DIRECT:
> + tasklet_schedule(&channel->callback_event);
> }
>
> +sched_unlock_rcu:
> rcu_read_unlock();
> }
> }
> @@ -2208,9 +2214,12 @@ static int vmbus_bus_suspend(struct device *dev)
>
> list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
> /*
> - * Invalidate the field. Upon resume, vmbus_onoffer() will fix
> - * up the field, and the other fields (if necessary).
> + * Remove the channel from the array of channels and invalidate
> + * the channel's relid. Upon resume, vmbus_onoffer() will fix
> + * up the relid (and other fields, if necessary) and add the
> + * channel back to the array.
> */
> + vmbus_channel_unmap_relid(channel);
> channel->offermsg.child_relid = INVALID_RELID;
>
> if (is_hvsock_channel(channel)) {
> @@ -2450,6 +2459,7 @@ static void __exit vmbus_exit(void)
> hv_debug_rm_all_dir();
>
> vmbus_free_channels();
> + kfree(vmbus_connection.channels);
>
> if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
> kmsg_dump_unregister(&hv_kmsg_dumper);
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 692c89ccf5dfd..6c794fd5c903e 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -854,11 +854,6 @@ struct vmbus_channel {
> * Support per-channel state for use by vmbus drivers.
> */
> void *per_channel_state;
> - /*
> - * To support per-cpu lookup mapping of relid to channel,
> - * link up channels based on their CPU affinity.
> - */
> - struct list_head percpu_list;
>
> /*
> * Defer freeing channel until after all cpu's have
--
Vitaly
"Andrea Parri (Microsoft)" <[email protected]> writes:
> VMBus version 4.1 and later support the CHANNELMSG_MODIFYCHANNEL(22)
> message type which can be used to request Hyper-V to change the vCPU
> that a channel will interrupt.
>
> Introduce the CHANNELMSG_MODIFYCHANNEL message type, and define the
> vmbus_send_modifychannel() function to send CHANNELMSG_MODIFYCHANNEL
> requests to the host via a hypercall. The function is then used to
> define a sysfs "store" operation, which allows to change the (v)CPU
> the channel will interrupt by using the sysfs interface. The feature
> can be used for load balancing or other purposes.
>
> One interesting catch here is that Hyper-V can *not* currently ACK
> CHANNELMSG_MODIFYCHANNEL messages with the promise that (after the ACK
> is sent) the channel won't send any more interrupts to the "old" CPU.
>
> The peculiarity of the CHANNELMSG_MODIFYCHANNEL messages is problematic
> if the user want to take a CPU offline, since we don't want to take a
> CPU offline (and, potentially, "lose" channel interrupts on such CPU)
> if the host is still processing a CHANNELMSG_MODIFYCHANNEL message
> associated to that CPU.
>
> It is worth mentioning, however, that we have been unable to observe
> the above mentioned "race": in all our tests, CHANNELMSG_MODIFYCHANNEL
> requests appeared *as if* they were processed synchronously by the
> host.
Hyper-V engineers never want to make our lifes easier :-)
I can only think of a 'lazy' approach to setting channel CPU affinity:
we actually re-assign it to the target CPU when we receive first
interrupt there - but it's not very different from re-assigning it there
but still handling interrupts on the old CPU like you do.
One more thing: it was already discussed several times but we never get
to it. I think this question was even raised on Michael's latest
'Hyper-V on ARM' submission. What about implmenting a Hyper-V specific
IRQ chip which would now support setting CPU affinity? The greatest
benefit is that we don't need new tools to do e.g. load balancing,
irqbalance will just work.
>
> Suggested-by: Michael Kelley <[email protected]>
> Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
> ---
> drivers/hv/channel.c | 28 ++++++++++
> drivers/hv/channel_mgmt.c | 2 +-
> drivers/hv/hv_trace.h | 19 +++++++
> drivers/hv/vmbus_drv.c | 108 +++++++++++++++++++++++++++++++++++++-
> include/linux/hyperv.h | 10 +++-
> 5 files changed, 163 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
> index 132e476f87b2e..90070b337c10d 100644
> --- a/drivers/hv/channel.c
> +++ b/drivers/hv/channel.c
> @@ -289,6 +289,34 @@ int vmbus_send_tl_connect_request(const guid_t *shv_guest_servie_id,
> }
> EXPORT_SYMBOL_GPL(vmbus_send_tl_connect_request);
>
> +/*
> + * Set/change the vCPU (@target_vp) the channel (@child_relid) will interrupt.
> + *
> + * CHANNELMSG_MODIFYCHANNEL messages are aynchronous. Also, Hyper-V does not
> + * ACK such messages. IOW we can't know when the host will stop interrupting
> + * the "old" vCPU and start interrupting the "new" vCPU for the given channel.
> + *
> + * The CHANNELMSG_MODIFYCHANNEL message type is supported since VMBus version
> + * VERSION_WIN10_V4_1.
> + */
> +int vmbus_send_modifychannel(u32 child_relid, u32 target_vp)
> +{
> + struct vmbus_channel_modifychannel conn_msg;
> + int ret;
> +
> + memset(&conn_msg, 0, sizeof(conn_msg));
> + conn_msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL;
> + conn_msg.child_relid = child_relid;
> + conn_msg.target_vp = target_vp;
> +
> + ret = vmbus_post_msg(&conn_msg, sizeof(conn_msg), true);
> +
> + trace_vmbus_send_modifychannel(&conn_msg, ret);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vmbus_send_modifychannel);
> +
> /*
> * create_gpadl_header - Creates a gpadl for the specified buffer
> */
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 34672dc2fc935..818a9d8bf649e 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -1353,7 +1353,7 @@ channel_message_table[CHANNELMSG_COUNT] = {
> { CHANNELMSG_19, 0, NULL },
> { CHANNELMSG_20, 0, NULL },
> { CHANNELMSG_TL_CONNECT_REQUEST, 0, NULL },
> - { CHANNELMSG_22, 0, NULL },
> + { CHANNELMSG_MODIFYCHANNEL, 0, NULL },
> { CHANNELMSG_TL_CONNECT_RESULT, 0, NULL },
> };
>
> diff --git a/drivers/hv/hv_trace.h b/drivers/hv/hv_trace.h
> index e70783e33680f..a43bc76c2d5d0 100644
> --- a/drivers/hv/hv_trace.h
> +++ b/drivers/hv/hv_trace.h
> @@ -296,6 +296,25 @@ TRACE_EVENT(vmbus_send_tl_connect_request,
> )
> );
>
> +TRACE_EVENT(vmbus_send_modifychannel,
> + TP_PROTO(const struct vmbus_channel_modifychannel *msg,
> + int ret),
> + TP_ARGS(msg, ret),
> + TP_STRUCT__entry(
> + __field(u32, child_relid)
> + __field(u32, target_vp)
> + __field(int, ret)
> + ),
> + TP_fast_assign(
> + __entry->child_relid = msg->child_relid;
> + __entry->target_vp = msg->target_vp;
> + __entry->ret = ret;
> + ),
> + TP_printk("binding child_relid 0x%x to target_vp 0x%x, ret %d",
> + __entry->child_relid, __entry->target_vp, __entry->ret
> + )
> + );
> +
> DECLARE_EVENT_CLASS(vmbus_channel,
> TP_PROTO(const struct vmbus_channel *channel),
> TP_ARGS(channel),
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index ebe2716f583d2..84d2f22c569aa 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1550,8 +1550,24 @@ static ssize_t vmbus_chan_attr_show(struct kobject *kobj,
> return attribute->show(chan, buf);
> }
>
> +static ssize_t vmbus_chan_attr_store(struct kobject *kobj,
> + struct attribute *attr, const char *buf,
> + size_t count)
> +{
> + const struct vmbus_chan_attribute *attribute
> + = container_of(attr, struct vmbus_chan_attribute, attr);
> + struct vmbus_channel *chan
> + = container_of(kobj, struct vmbus_channel, kobj);
> +
> + if (!attribute->store)
> + return -EIO;
> +
> + return attribute->store(chan, buf, count);
> +}
> +
> static const struct sysfs_ops vmbus_chan_sysfs_ops = {
> .show = vmbus_chan_attr_show,
> + .store = vmbus_chan_attr_store,
> };
>
> static ssize_t out_mask_show(struct vmbus_channel *channel, char *buf)
> @@ -1622,11 +1638,99 @@ static ssize_t write_avail_show(struct vmbus_channel *channel, char *buf)
> }
> static VMBUS_CHAN_ATTR_RO(write_avail);
>
> -static ssize_t show_target_cpu(struct vmbus_channel *channel, char *buf)
> +static ssize_t target_cpu_show(struct vmbus_channel *channel, char *buf)
> {
> return sprintf(buf, "%u\n", channel->target_cpu);
> }
> -static VMBUS_CHAN_ATTR(cpu, S_IRUGO, show_target_cpu, NULL);
> +static ssize_t target_cpu_store(struct vmbus_channel *channel,
> + const char *buf, size_t count)
> +{
> + ssize_t ret = count;
> + u32 target_cpu;
> +
> + if (vmbus_proto_version < VERSION_WIN10_V4_1)
> + return -EIO;
> +
> + if (sscanf(buf, "%uu", &target_cpu) != 1)
> + return -EIO;
> +
> + /* Validate target_cpu for the cpumask_test_cpu() operation below. */
> + if (target_cpu >= nr_cpumask_bits)
> + return -EINVAL;
> +
> + /* No CPUs should come up or down during this. */
> + cpus_read_lock();
> +
> + if (!cpumask_test_cpu(target_cpu, cpu_online_mask)) {
> + cpus_read_unlock();
> + return -EINVAL;
> + }
> +
> + /*
> + * Synchronizes target_cpu_store() and channel closure:
> + *
> + * { Initially: state = CHANNEL_OPENED }
> + *
> + * CPU1 CPU2
> + *
> + * [target_cpu_store()] [vmbus_disconnect_ring()]
> + *
> + * LOCK channel_mutex LOCK channel_mutex
> + * LOAD r1 = state LOAD r2 = state
> + * IF (r1 == CHANNEL_OPENED) IF (r2 == CHANNEL_OPENED)
> + * SEND MODIFYCHANNEL STORE state = CHANNEL_OPEN
> + * [...] SEND CLOSECHANNEL
> + * UNLOCK channel_mutex UNLOCK channel_mutex
> + *
> + * Forbids: r1 == r2 == CHANNEL_OPENED (i.e., CPU1's LOCK precedes
> + * CPU2's LOCK) && CPU2's SEND precedes CPU1's SEND
> + *
> + * Note. The host processes the channel messages "sequentially", in
> + * the order in which they are received on a per-partition basis.
> + */
> + mutex_lock(&vmbus_connection.channel_mutex);
> +
> + /*
> + * Hyper-V will ignore MODIFYCHANNEL messages for "non-open" channels;
> + * avoid sending the message and fail here for such channels.
> + */
> + if (channel->state != CHANNEL_OPENED_STATE) {
> + ret = -EIO;
> + goto cpu_store_unlock;
> + }
> +
> + if (channel->target_cpu == target_cpu)
> + goto cpu_store_unlock;
> +
> + if (vmbus_send_modifychannel(channel->offermsg.child_relid,
> + hv_cpu_number_to_vp_number(target_cpu))) {
> + ret = -EIO;
> + goto cpu_store_unlock;
> + }
> +
> + /*
> + * Warning. At this point, there is *no* guarantee that the host will
> + * have successfully processed the vmbus_send_modifychannel() request.
> + * See the header comment of vmbus_send_modifychannel() for more info.
> + *
> + * Lags in the processing of the above vmbus_send_modifychannel() can
> + * result in missed interrupts if the "old" target CPU is taken offline
> + * before Hyper-V starts sending interrupts to the "new" target CPU.
> + * But apart from this offlining scenario, the code tolerates such
> + * lags. It will function correctly even if a channel interrupt comes
> + * in on a CPU that is different from the channel target_cpu value.
> + */
> +
> + channel->target_cpu = target_cpu;
> + channel->target_vp = hv_cpu_number_to_vp_number(target_cpu);
> + channel->numa_node = cpu_to_node(target_cpu);
> +
> +cpu_store_unlock:
> + mutex_unlock(&vmbus_connection.channel_mutex);
> + cpus_read_unlock();
> + return ret;
> +}
> +static VMBUS_CHAN_ATTR(cpu, 0644, target_cpu_show, target_cpu_store);
>
> static ssize_t channel_pending_show(struct vmbus_channel *channel,
> char *buf)
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index f8e7c22d41a1a..edfcd42319ef3 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -425,7 +425,7 @@ enum vmbus_channel_message_type {
> CHANNELMSG_19 = 19,
> CHANNELMSG_20 = 20,
> CHANNELMSG_TL_CONNECT_REQUEST = 21,
> - CHANNELMSG_22 = 22,
> + CHANNELMSG_MODIFYCHANNEL = 22,
> CHANNELMSG_TL_CONNECT_RESULT = 23,
> CHANNELMSG_COUNT
> };
> @@ -620,6 +620,13 @@ struct vmbus_channel_tl_connect_request {
> guid_t host_service_id;
> } __packed;
>
> +/* Modify Channel parameters, cf. vmbus_send_modifychannel() */
> +struct vmbus_channel_modifychannel {
> + struct vmbus_channel_message_header header;
> + u32 child_relid;
> + u32 target_vp;
> +} __packed;
> +
> struct vmbus_channel_version_response {
> struct vmbus_channel_message_header header;
> u8 version_supported;
> @@ -1505,6 +1512,7 @@ extern __u32 vmbus_proto_version;
>
> int vmbus_send_tl_connect_request(const guid_t *shv_guest_servie_id,
> const guid_t *shv_host_servie_id);
> +int vmbus_send_modifychannel(u32 child_relid, u32 target_vp);
> void vmbus_set_event(struct vmbus_channel *channel);
>
> /* Get the start of the ring buffer. */
--
Vitaly
On Wed, 25 Mar 2020 23:54:58 +0100
"Andrea Parri (Microsoft)" <[email protected]> wrote:
> vmbus_chan_sched() might call the netvsc driver callback function that
> ends up scheduling NAPI work. This "work" can access the channel ring
> buffer, so we must ensure that any such work is completed and that the
> ring buffer is no longer being accessed before freeing the ring buffer
> data structure in the channel closure path. To this end, disable NAPI
> before calling vmbus_close() in netvsc_device_remove().
>
> Suggested-by: Michael Kelley <[email protected]>
> Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
> Cc: "David S. Miller" <[email protected]>
> Cc: <[email protected]>
Do you have a test that reproduces this issue?
The netvsc device is somewhat unique in that it needs NAPI
enabled on the primary channel to interact with the host.
Therefore it can't call napi_disable in the normal dev->stop() place.
Acked-by: Stephen Hemminger <[email protected]>
On Thu, Mar 26, 2020 at 03:16:21PM +0100, Vitaly Kuznetsov wrote:
> "Andrea Parri (Microsoft)" <[email protected]> writes:
>
> > The offer and rescind works are currently scheduled on the so called
> > "connect CPU". However, this is not really needed: we can synchronize
> > the works by relying on the usage of the offer_in_progress counter and
> > of the channel_mutex mutex. This synchronization is already in place.
> > So, remove this unnecessary "bind to the connect CPU" constraint and
> > update the inline comments accordingly.
> >
> > Suggested-by: Dexuan Cui <[email protected]>
> > Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
> > ---
> > drivers/hv/channel_mgmt.c | 21 ++++++++++++++++-----
> > drivers/hv/vmbus_drv.c | 39 ++++++++++++++++++++++++++++-----------
> > 2 files changed, 44 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> > index 0370364169c4e..1191f3d76d111 100644
> > --- a/drivers/hv/channel_mgmt.c
> > +++ b/drivers/hv/channel_mgmt.c
> > @@ -1025,11 +1025,22 @@ static void vmbus_onoffer_rescind(struct vmbus_channel_message_header *hdr)
> > * offer comes in first and then the rescind.
> > * Since we process these events in work elements,
> > * and with preemption, we may end up processing
> > - * the events out of order. Given that we handle these
> > - * work elements on the same CPU, this is possible only
> > - * in the case of preemption. In any case wait here
> > - * until the offer processing has moved beyond the
> > - * point where the channel is discoverable.
> > + * the events out of order. We rely on the synchronization
> > + * provided by offer_in_progress and by channel_mutex for
> > + * ordering these events:
> > + *
> > + * { Initially: offer_in_progress = 1 }
> > + *
> > + * CPU1 CPU2
> > + *
> > + * [vmbus_process_offer()] [vmbus_onoffer_rescind()]
> > + *
> > + * LOCK channel_mutex WAIT_ON offer_in_progress == 0
> > + * DECREMENT offer_in_progress LOCK channel_mutex
> > + * INSERT chn_list SEARCH chn_list
> > + * UNLOCK channel_mutex UNLOCK channel_mutex
> > + *
> > + * Forbids: CPU2's SEARCH from *not* seeing CPU1's INSERT
>
> WAIT_ON offer_in_progress == 0
> LOCK channel_mutex
>
> seems to be racy: what happens if offer_in_progress increments after we
> read it but before we managed to aquire channel_mutex?
Remark that the RESCIND work must see the increment which is performed
"before" queueing the work in question (and the associated OFFER work),
cf. the comment in vmbus_on_msg_dpc() below and
dbb92f88648d6 ("workqueue: Document (some) memory-ordering properties of {queue,schedule}_work()")
AFAICT, this suffices to meet the intended behavior as sketched above.
I might be missing something of course, can you elaborate on the issue
here?
Thanks,
Andrea
>
> I think this shold be changed to
>
> LOCK channel_mutex
> CHECK offer_in_progress == 0
> EQUAL? GOTO proceed with rescind handling
> NOT EQUAL?
> WHILE offer_in_progress) != 0 {
> UNLOCK channel_mutex
> MSLEEP(1)
> LOCK channel_mutex
> }
> proceed with rescind handling:
> ...
> UNLOCK channel_mutex
>
> > */
> >
> > while (atomic_read(&vmbus_connection.offer_in_progress) != 0) {
> > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> > index 7600615e13754..903b1ec6a259e 100644
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -1048,8 +1048,9 @@ void vmbus_on_msg_dpc(unsigned long data)
> > /*
> > * The host can generate a rescind message while we
> > * may still be handling the original offer. We deal with
> > - * this condition by ensuring the processing is done on the
> > - * same CPU.
> > + * this condition by relying on the synchronization provided
> > + * by offer_in_progress and by channel_mutex. See also the
> > + * inline comments in vmbus_onoffer_rescind().
> > */
> > switch (hdr->msgtype) {
> > case CHANNELMSG_RESCIND_CHANNELOFFER:
> > @@ -1071,16 +1072,34 @@ void vmbus_on_msg_dpc(unsigned long data)
> > * work queue: the RESCIND handler can not start to
> > * run before the OFFER handler finishes.
> > */
> > - schedule_work_on(VMBUS_CONNECT_CPU,
> > - &ctx->work);
> > + schedule_work(&ctx->work);
> > break;
> >
> > case CHANNELMSG_OFFERCHANNEL:
> > + /*
> > + * The host sends the offer message of a given channel
> > + * before sending the rescind message of the same
> > + * channel. These messages are sent to the guest's
> > + * connect CPU; the guest then starts processing them
> > + * in the tasklet handler on this CPU:
> > + *
> > + * VMBUS_CONNECT_CPU
> > + *
> > + * [vmbus_on_msg_dpc()]
> > + * atomic_inc() // CHANNELMSG_OFFERCHANNEL
> > + * queue_work()
> > + * ...
> > + * [vmbus_on_msg_dpc()]
> > + * schedule_work() // CHANNELMSG_RESCIND_CHANNELOFFER
> > + *
> > + * We rely on the memory-ordering properties of the
> > + * queue_work() and schedule_work() primitives, which
> > + * guarantee that the atomic increment will be visible
> > + * to the CPUs which will execute the offer & rescind
> > + * works by the time these works will start execution.
> > + */
> > atomic_inc(&vmbus_connection.offer_in_progress);
> > - queue_work_on(VMBUS_CONNECT_CPU,
> > - vmbus_connection.work_queue,
> > - &ctx->work);
> > - break;
> > + fallthrough;
> >
> > default:
> > queue_work(vmbus_connection.work_queue, &ctx->work);
> > @@ -1124,9 +1143,7 @@ static void vmbus_force_channel_rescinded(struct vmbus_channel *channel)
> >
> > INIT_WORK(&ctx->work, vmbus_onmessage_work);
> >
> > - queue_work_on(VMBUS_CONNECT_CPU,
> > - vmbus_connection.work_queue,
> > - &ctx->work);
> > + queue_work(vmbus_connection.work_queue, &ctx->work);
> > }
> > #endif /* CONFIG_PM_SLEEP */
>
> --
> Vitaly
>
On Thu, Mar 26, 2020 at 03:31:20PM +0100, Vitaly Kuznetsov wrote:
> "Andrea Parri (Microsoft)" <[email protected]> writes:
>
> > When Hyper-V sends an interrupt to the guest, the guest has to figure
> > out which channel the interrupt is associated with. Hyper-V sets a bit
> > in a memory page that is shared with the guest, indicating a particular
> > "relid" that the interrupt is associated with. The current Linux code
> > then uses a set of per-CPU linked lists to map a given "relid" to a
> > pointer to a channel structure.
> >
> > This design introduces a synchronization problem if the CPU that Hyper-V
> > will interrupt for a certain channel is changed. If the interrupt comes
> > on the "old CPU" and the channel was already moved to the per-CPU list
> > of the "new CPU", then the relid -> channel mapping will fail and the
> > interrupt is dropped. Similarly, if the interrupt comes on the new CPU
> > but the channel was not moved to the per-CPU list of the new CPU, then
> > the mapping will fail and the interrupt is dropped.
> >
> > Relids are integers ranging from 0 to 2047. The mapping from relids to
> > channel structures can be done by setting up an array with 2048 entries,
> > each entry being a pointer to a channel structure (hence total size ~16K
> > bytes, which is not a problem). The array is global, so there are no
> > per-CPU linked lists to update. The array can be searched and updated
> > by simply loading and storing the array at the specified index. With no
> > per-CPU data structures, the above mentioned synchronization problem is
> > avoided and the relid2channel() function gets simpler.
> >
> > Suggested-by: Michael Kelley <[email protected]>
> > Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
> > ---
> > drivers/hv/channel_mgmt.c | 158 ++++++++++++++++++++++----------------
> > drivers/hv/connection.c | 38 +++------
> > drivers/hv/hv.c | 2 -
> > drivers/hv/hyperv_vmbus.h | 14 ++--
> > drivers/hv/vmbus_drv.c | 48 +++++++-----
> > include/linux/hyperv.h | 5 --
> > 6 files changed, 139 insertions(+), 126 deletions(-)
> >
> > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> > index 1191f3d76d111..9b1449c839575 100644
> > --- a/drivers/hv/channel_mgmt.c
> > +++ b/drivers/hv/channel_mgmt.c
> > @@ -319,7 +319,6 @@ static struct vmbus_channel *alloc_channel(void)
> > init_completion(&channel->rescind_event);
> >
> > INIT_LIST_HEAD(&channel->sc_list);
> > - INIT_LIST_HEAD(&channel->percpu_list);
> >
> > tasklet_init(&channel->callback_event,
> > vmbus_on_event, (unsigned long)channel);
> > @@ -340,23 +339,28 @@ static void free_channel(struct vmbus_channel *channel)
> > kobject_put(&channel->kobj);
> > }
> >
> > -static void percpu_channel_enq(void *arg)
> > +void vmbus_channel_map_relid(struct vmbus_channel *channel)
> > {
> > - struct vmbus_channel *channel = arg;
> > - struct hv_per_cpu_context *hv_cpu
> > - = this_cpu_ptr(hv_context.cpu_context);
> > -
> > - list_add_tail_rcu(&channel->percpu_list, &hv_cpu->chan_list);
> > + if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
> > + return;
> > + /*
> > + * Pairs with the READ_ONCE() in vmbus_chan_sched(). Guarantees
> > + * that vmbus_chan_sched() will find up-to-date data.
> > + */
> > + smp_store_release(
> > + &vmbus_connection.channels[channel->offermsg.child_relid],
> > + channel);
> > }
> >
> > -static void percpu_channel_deq(void *arg)
> > +void vmbus_channel_unmap_relid(struct vmbus_channel *channel)
> > {
> > - struct vmbus_channel *channel = arg;
> > -
> > - list_del_rcu(&channel->percpu_list);
> > + if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
> > + return;
> > + WRITE_ONCE(
> > + vmbus_connection.channels[channel->offermsg.child_relid],
> > + NULL);
>
> I don't think this smp_store_release()/WRITE_ONCE() fanciness gives you
> anything. Basically, without proper synchronization with a lock there is
> no such constructions which will give you any additional guarantee on
> top of just doing X=1. E.g. smp_store_release() is just
> barrier();
> *p = v;
> if I'm not mistaken. Nobody tells you when *some other CPU* will see the
> update - 'eventually' is your best guess. Here, you're only setting one
> pointer.
>
> Percpu structures have an advantage: we (almost) never access them from
> different CPUs so just doing updates atomically (and writing 64bit
> pointer on x86_64 is atomic) is OK.
>
> I haven't looked at all possible scenarios but I'd suggest protecting
> this array with a spinlock (in case we can have simultaneous accesses
> from different CPUs and care about the result, of course).
The smp_store_release()+READ_ONCE() pair should guarantee that any store
to the channel fields performed before (in program order) the "mapping"
of the channel are visible to the CPU which observes that mapping; this
guarantee is expected to hold for all architectures.
Notice that this apporach follows the current/upstream code, cf. the
rcu_assign_pointer() in list_add_tail_rcu() and notice that (both before
and after this series) vmbus_chan_sched() accesses the channel array
without any mutex/lock held.
I'd be inclined to stick to the current code (unless more turns out to
be required). Thoughts?
Thanks,
Andrea
Andrea Parri <[email protected]> writes:
> On Thu, Mar 26, 2020 at 03:16:21PM +0100, Vitaly Kuznetsov wrote:
>> "Andrea Parri (Microsoft)" <[email protected]> writes:
>>
>> > The offer and rescind works are currently scheduled on the so called
>> > "connect CPU". However, this is not really needed: we can synchronize
>> > the works by relying on the usage of the offer_in_progress counter and
>> > of the channel_mutex mutex. This synchronization is already in place.
>> > So, remove this unnecessary "bind to the connect CPU" constraint and
>> > update the inline comments accordingly.
>> >
>> > Suggested-by: Dexuan Cui <[email protected]>
>> > Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
>> > ---
>> > drivers/hv/channel_mgmt.c | 21 ++++++++++++++++-----
>> > drivers/hv/vmbus_drv.c | 39 ++++++++++++++++++++++++++++-----------
>> > 2 files changed, 44 insertions(+), 16 deletions(-)
>> >
>> > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
>> > index 0370364169c4e..1191f3d76d111 100644
>> > --- a/drivers/hv/channel_mgmt.c
>> > +++ b/drivers/hv/channel_mgmt.c
>> > @@ -1025,11 +1025,22 @@ static void vmbus_onoffer_rescind(struct vmbus_channel_message_header *hdr)
>> > * offer comes in first and then the rescind.
>> > * Since we process these events in work elements,
>> > * and with preemption, we may end up processing
>> > - * the events out of order. Given that we handle these
>> > - * work elements on the same CPU, this is possible only
>> > - * in the case of preemption. In any case wait here
>> > - * until the offer processing has moved beyond the
>> > - * point where the channel is discoverable.
>> > + * the events out of order. We rely on the synchronization
>> > + * provided by offer_in_progress and by channel_mutex for
>> > + * ordering these events:
>> > + *
>> > + * { Initially: offer_in_progress = 1 }
>> > + *
>> > + * CPU1 CPU2
>> > + *
>> > + * [vmbus_process_offer()] [vmbus_onoffer_rescind()]
>> > + *
>> > + * LOCK channel_mutex WAIT_ON offer_in_progress == 0
>> > + * DECREMENT offer_in_progress LOCK channel_mutex
>> > + * INSERT chn_list SEARCH chn_list
>> > + * UNLOCK channel_mutex UNLOCK channel_mutex
>> > + *
>> > + * Forbids: CPU2's SEARCH from *not* seeing CPU1's INSERT
>>
>> WAIT_ON offer_in_progress == 0
>> LOCK channel_mutex
>>
>> seems to be racy: what happens if offer_in_progress increments after we
>> read it but before we managed to aquire channel_mutex?
>
> Remark that the RESCIND work must see the increment which is performed
> "before" queueing the work in question (and the associated OFFER work),
> cf. the comment in vmbus_on_msg_dpc() below and
>
> dbb92f88648d6 ("workqueue: Document (some) memory-ordering properties of {queue,schedule}_work()")
>
> AFAICT, this suffices to meet the intended behavior as sketched above.
> I might be missing something of course, can you elaborate on the issue
> here?
>
In case we believe that OFFER -> RESCINF sequence is always ordered
by the host AND we don't care about other offers in the queue the
suggested locking is OK: we're guaranteed to process RESCIND after we
finished processing OFFER for the same channel. However, waiting for
'offer_in_progress == 0' looks fishy so I'd suggest we at least add a
comment explaining that the wait is only needed to serialize us with
possible OFFER for the same channel - and nothing else. I'd personally
still slightly prefer the algorythm I suggested as it guarantees we take
channel_mutex with offer_in_progress == 0 -- even if there are no issues
we can think of today (not strongly though).
--
Vitaly
Andrea Parri <[email protected]> writes:
> On Thu, Mar 26, 2020 at 03:31:20PM +0100, Vitaly Kuznetsov wrote:
>> "Andrea Parri (Microsoft)" <[email protected]> writes:
>>
>> > When Hyper-V sends an interrupt to the guest, the guest has to figure
>> > out which channel the interrupt is associated with. Hyper-V sets a bit
>> > in a memory page that is shared with the guest, indicating a particular
>> > "relid" that the interrupt is associated with. The current Linux code
>> > then uses a set of per-CPU linked lists to map a given "relid" to a
>> > pointer to a channel structure.
>> >
>> > This design introduces a synchronization problem if the CPU that Hyper-V
>> > will interrupt for a certain channel is changed. If the interrupt comes
>> > on the "old CPU" and the channel was already moved to the per-CPU list
>> > of the "new CPU", then the relid -> channel mapping will fail and the
>> > interrupt is dropped. Similarly, if the interrupt comes on the new CPU
>> > but the channel was not moved to the per-CPU list of the new CPU, then
>> > the mapping will fail and the interrupt is dropped.
>> >
>> > Relids are integers ranging from 0 to 2047. The mapping from relids to
>> > channel structures can be done by setting up an array with 2048 entries,
>> > each entry being a pointer to a channel structure (hence total size ~16K
>> > bytes, which is not a problem). The array is global, so there are no
>> > per-CPU linked lists to update. The array can be searched and updated
>> > by simply loading and storing the array at the specified index. With no
>> > per-CPU data structures, the above mentioned synchronization problem is
>> > avoided and the relid2channel() function gets simpler.
>> >
>> > Suggested-by: Michael Kelley <[email protected]>
>> > Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
>> > ---
>> > drivers/hv/channel_mgmt.c | 158 ++++++++++++++++++++++----------------
>> > drivers/hv/connection.c | 38 +++------
>> > drivers/hv/hv.c | 2 -
>> > drivers/hv/hyperv_vmbus.h | 14 ++--
>> > drivers/hv/vmbus_drv.c | 48 +++++++-----
>> > include/linux/hyperv.h | 5 --
>> > 6 files changed, 139 insertions(+), 126 deletions(-)
>> >
>> > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
>> > index 1191f3d76d111..9b1449c839575 100644
>> > --- a/drivers/hv/channel_mgmt.c
>> > +++ b/drivers/hv/channel_mgmt.c
>> > @@ -319,7 +319,6 @@ static struct vmbus_channel *alloc_channel(void)
>> > init_completion(&channel->rescind_event);
>> >
>> > INIT_LIST_HEAD(&channel->sc_list);
>> > - INIT_LIST_HEAD(&channel->percpu_list);
>> >
>> > tasklet_init(&channel->callback_event,
>> > vmbus_on_event, (unsigned long)channel);
>> > @@ -340,23 +339,28 @@ static void free_channel(struct vmbus_channel *channel)
>> > kobject_put(&channel->kobj);
>> > }
>> >
>> > -static void percpu_channel_enq(void *arg)
>> > +void vmbus_channel_map_relid(struct vmbus_channel *channel)
>> > {
>> > - struct vmbus_channel *channel = arg;
>> > - struct hv_per_cpu_context *hv_cpu
>> > - = this_cpu_ptr(hv_context.cpu_context);
>> > -
>> > - list_add_tail_rcu(&channel->percpu_list, &hv_cpu->chan_list);
>> > + if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
>> > + return;
>> > + /*
>> > + * Pairs with the READ_ONCE() in vmbus_chan_sched(). Guarantees
>> > + * that vmbus_chan_sched() will find up-to-date data.
>> > + */
>> > + smp_store_release(
>> > + &vmbus_connection.channels[channel->offermsg.child_relid],
>> > + channel);
>> > }
>> >
>> > -static void percpu_channel_deq(void *arg)
>> > +void vmbus_channel_unmap_relid(struct vmbus_channel *channel)
>> > {
>> > - struct vmbus_channel *channel = arg;
>> > -
>> > - list_del_rcu(&channel->percpu_list);
>> > + if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
>> > + return;
>> > + WRITE_ONCE(
>> > + vmbus_connection.channels[channel->offermsg.child_relid],
>> > + NULL);
>>
>> I don't think this smp_store_release()/WRITE_ONCE() fanciness gives you
>> anything. Basically, without proper synchronization with a lock there is
>> no such constructions which will give you any additional guarantee on
>> top of just doing X=1. E.g. smp_store_release() is just
>> barrier();
>> *p = v;
>> if I'm not mistaken. Nobody tells you when *some other CPU* will see the
>> update - 'eventually' is your best guess. Here, you're only setting one
>> pointer.
>>
>> Percpu structures have an advantage: we (almost) never access them from
>> different CPUs so just doing updates atomically (and writing 64bit
>> pointer on x86_64 is atomic) is OK.
>>
>> I haven't looked at all possible scenarios but I'd suggest protecting
>> this array with a spinlock (in case we can have simultaneous accesses
>> from different CPUs and care about the result, of course).
>
> The smp_store_release()+READ_ONCE() pair should guarantee that any store
> to the channel fields performed before (in program order) the "mapping"
> of the channel are visible to the CPU which observes that mapping; this
> guarantee is expected to hold for all architectures.
Yes, basically the order is preserved (but no guarantees WHEN another
CPU will see it).
>
> Notice that this apporach follows the current/upstream code, cf. the
> rcu_assign_pointer() in list_add_tail_rcu() and notice that (both before
> and after this series) vmbus_chan_sched() accesses the channel array
> without any mutex/lock held.
>
> I'd be inclined to stick to the current code (unless more turns out to
> be required). Thoughts?
Correct me if I'm wrong, but currently vmbus_chan_sched() accesses
per-cpu list of channels on the same CPU so we don't need a spinlock to
guarantee that during an interrupt we'll be able to see the update if it
happened before the interrupt (in chronological order). With a global
list of relids, who guarantees that an interrupt handler on another CPU
will actually see the modified list?
--
Vitaly
On Thu, Mar 26, 2020 at 08:26:36AM -0700, Stephen Hemminger wrote:
> On Wed, 25 Mar 2020 23:54:58 +0100
> "Andrea Parri (Microsoft)" <[email protected]> wrote:
>
> > vmbus_chan_sched() might call the netvsc driver callback function that
> > ends up scheduling NAPI work. This "work" can access the channel ring
> > buffer, so we must ensure that any such work is completed and that the
> > ring buffer is no longer being accessed before freeing the ring buffer
> > data structure in the channel closure path. To this end, disable NAPI
> > before calling vmbus_close() in netvsc_device_remove().
> >
> > Suggested-by: Michael Kelley <[email protected]>
> > Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
> > Cc: "David S. Miller" <[email protected]>
> > Cc: <[email protected]>
>
> Do you have a test that reproduces this issue?
I don't (or I'm not aware of such a test).
>
> The netvsc device is somewhat unique in that it needs NAPI
> enabled on the primary channel to interact with the host.
> Therefore it can't call napi_disable in the normal dev->stop() place.
>
> Acked-by: Stephen Hemminger <[email protected]>
Thanks!
Andrea
> In case we believe that OFFER -> RESCINF sequence is always ordered
> by the host AND we don't care about other offers in the queue the
> suggested locking is OK: we're guaranteed to process RESCIND after we
> finished processing OFFER for the same channel. However, waiting for
> 'offer_in_progress == 0' looks fishy so I'd suggest we at least add a
> comment explaining that the wait is only needed to serialize us with
> possible OFFER for the same channel - and nothing else. I'd personally
> still slightly prefer the algorythm I suggested as it guarantees we take
> channel_mutex with offer_in_progress == 0 -- even if there are no issues
> we can think of today (not strongly though).
Does it? offer_in_progress is incremented without channel_mutex...
IAC, I have no objections to apply the changes you suggested. To avoid
misunderstandings: vmbus_bus_suspend() presents a similar usage... Are
you suggesting that I apply similar changes there?
Alternatively: FWIW, the comment in vmbus_onoffer_rescind() does refer
to "The offer msg and the corresponding rescind msg...". I am all ears
if you have any concrete suggestions to improve these comments.
Thanks,
Andrea
> Correct me if I'm wrong, but currently vmbus_chan_sched() accesses
> per-cpu list of channels on the same CPU so we don't need a spinlock to
> guarantee that during an interrupt we'll be able to see the update if it
> happened before the interrupt (in chronological order). With a global
> list of relids, who guarantees that an interrupt handler on another CPU
> will actually see the modified list?
Thanks for pointing this out!
The offer/resume path presents implicit full memory barriers, program
-order after the array store which should guarantee the visibility of
the store to *all* CPUs before the offer/resume can complete (c.f.,
tools/memory-model/Documentation/explanation.txt, Sect. #13
and assuming that the offer/resume for a channel must complete before
the corresponding handler, which seems to be the case considered that
some essential channel fields are initialized only later...)
IIUC, the spin lock approach you suggested will work and be "simpler";
an obvious side effect would be, well, a global synchronization point
in vmbus_chan_sched()...
Thoughts?
Thanks,
Andrea
On Thu, Mar 26, 2020 at 03:46:23PM +0100, Vitaly Kuznetsov wrote:
> "Andrea Parri (Microsoft)" <[email protected]> writes:
>
> > VMBus version 4.1 and later support the CHANNELMSG_MODIFYCHANNEL(22)
> > message type which can be used to request Hyper-V to change the vCPU
> > that a channel will interrupt.
> >
> > Introduce the CHANNELMSG_MODIFYCHANNEL message type, and define the
> > vmbus_send_modifychannel() function to send CHANNELMSG_MODIFYCHANNEL
> > requests to the host via a hypercall. The function is then used to
> > define a sysfs "store" operation, which allows to change the (v)CPU
> > the channel will interrupt by using the sysfs interface. The feature
> > can be used for load balancing or other purposes.
> >
> > One interesting catch here is that Hyper-V can *not* currently ACK
> > CHANNELMSG_MODIFYCHANNEL messages with the promise that (after the ACK
> > is sent) the channel won't send any more interrupts to the "old" CPU.
> >
> > The peculiarity of the CHANNELMSG_MODIFYCHANNEL messages is problematic
> > if the user want to take a CPU offline, since we don't want to take a
> > CPU offline (and, potentially, "lose" channel interrupts on such CPU)
> > if the host is still processing a CHANNELMSG_MODIFYCHANNEL message
> > associated to that CPU.
> >
> > It is worth mentioning, however, that we have been unable to observe
> > the above mentioned "race": in all our tests, CHANNELMSG_MODIFYCHANNEL
> > requests appeared *as if* they were processed synchronously by the
> > host.
>
> Hyper-V engineers never want to make our lifes easier :-)
Haha. I'd say more exciting! ;-) ;-)
> I can only think of a 'lazy' approach to setting channel CPU affinity:
> we actually re-assign it to the target CPU when we receive first
> interrupt there - but it's not very different from re-assigning it there
> but still handling interrupts on the old CPU like you do.
Interesting! I'm wondering whether it'd make sense to use a similar
approach to (lazily) "unblock" the "old" CPUs; let me think more...
> One more thing: it was already discussed several times but we never get
> to it. I think this question was even raised on Michael's latest
> 'Hyper-V on ARM' submission. What about implmenting a Hyper-V specific
> IRQ chip which would now support setting CPU affinity? The greatest
> benefit is that we don't need new tools to do e.g. load balancing,
> irqbalance will just work.
Thank you for the suggestions; TBH, I haven't considered such approach
so far (and I'd need more time to come up with an informed comment...)
OTOH, I had some initial investigations about the current (in-kernel)
balancing scheme/init_vp_index() and possible improvements/extensions
there... Hopefully another, follow-up series to come soon!
Thanks,
Andrea
On Thu, Mar 26, 2020 at 03:05:17PM +0100, Vitaly Kuznetsov wrote:
> "Andrea Parri (Microsoft)" <[email protected]> writes:
>
> > A Linux guest have to pick a "connect CPU" to communicate with the
> > Hyper-V host. This CPU can not be taken offline because Hyper-V does
> > not provide a way to change that CPU assignment.
> >
> > Current code sets the connect CPU to whatever CPU ends up running the
> > function vmbus_negotiate_version(), and this will generate problems if
> > that CPU is taken offine.
> >
> > Establish CPU0 as the connect CPU, and add logics to prevents the
> > connect CPU from being taken offline. We could pick some other CPU,
> > and we could pick that "other CPU" dynamically if there was a reason to
> > do so at some point in the future. But for now, #defining the connect
> > CPU to 0 is the most straightforward and least complex solution.
> >
> > While on this, add inline comments explaining "why" offer and rescind
> > messages should not be handled by a same serialized work queue.
> >
> > Suggested-by: Dexuan Cui <[email protected]>
> > Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
> > ---
> > drivers/hv/connection.c | 20 +-------------------
> > drivers/hv/hv.c | 7 +++++++
> > drivers/hv/hyperv_vmbus.h | 11 ++++++-----
> > drivers/hv/vmbus_drv.c | 20 +++++++++++++++++---
> > 4 files changed, 31 insertions(+), 27 deletions(-)
> >
> > diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> > index 74e77de89b4f3..f4bd306d2cef9 100644
> > --- a/drivers/hv/connection.c
> > +++ b/drivers/hv/connection.c
> > @@ -69,7 +69,6 @@ MODULE_PARM_DESC(max_version,
> > int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version)
> > {
> > int ret = 0;
> > - unsigned int cur_cpu;
> > struct vmbus_channel_initiate_contact *msg;
> > unsigned long flags;
> >
> > @@ -102,24 +101,7 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version)
> >
> > msg->monitor_page1 = virt_to_phys(vmbus_connection.monitor_pages[0]);
> > msg->monitor_page2 = virt_to_phys(vmbus_connection.monitor_pages[1]);
> > - /*
> > - * We want all channel messages to be delivered on CPU 0.
> > - * This has been the behavior pre-win8. This is not
> > - * perf issue and having all channel messages delivered on CPU 0
> > - * would be ok.
> > - * For post win8 hosts, we support receiving channel messagges on
> > - * all the CPUs. This is needed for kexec to work correctly where
> > - * the CPU attempting to connect may not be CPU 0.
> > - */
> > - if (version >= VERSION_WIN8_1) {
> > - cur_cpu = get_cpu();
> > - msg->target_vcpu = hv_cpu_number_to_vp_number(cur_cpu);
> > - vmbus_connection.connect_cpu = cur_cpu;
> > - put_cpu();
> > - } else {
> > - msg->target_vcpu = 0;
> > - vmbus_connection.connect_cpu = 0;
> > - }
> > + msg->target_vcpu = hv_cpu_number_to_vp_number(VMBUS_CONNECT_CPU);
> >
> > /*
> > * Add to list before we send the request since we may
> > diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> > index 6098e0cbdb4b0..e2b3310454640 100644
> > --- a/drivers/hv/hv.c
> > +++ b/drivers/hv/hv.c
> > @@ -249,6 +249,13 @@ int hv_synic_cleanup(unsigned int cpu)
> > bool channel_found = false;
> > unsigned long flags;
> >
> > + /*
> > + * Hyper-V does not provide a way to change the connect CPU once
> > + * it is set; we must prevent the connect CPU from going offline.
> > + */
> > + if (cpu == VMBUS_CONNECT_CPU)
> > + return -EBUSY;
> > +
> > /*
> > * Search for channels which are bound to the CPU we're about to
> > * cleanup. In case we find one and vmbus is still connected we need to
> > diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> > index 70b30e223a578..67fb1edcbf527 100644
> > --- a/drivers/hv/hyperv_vmbus.h
> > +++ b/drivers/hv/hyperv_vmbus.h
> > @@ -212,12 +212,13 @@ enum vmbus_connect_state {
> >
> > #define MAX_SIZE_CHANNEL_MESSAGE HV_MESSAGE_PAYLOAD_BYTE_COUNT
> >
> > -struct vmbus_connection {
> > - /*
> > - * CPU on which the initial host contact was made.
> > - */
> > - int connect_cpu;
> > +/*
> > + * The CPU that Hyper-V will interrupt for VMBUS messages, such as
> > + * CHANNELMSG_OFFERCHANNEL and CHANNELMSG_RESCIND_CHANNELOFFER.
> > + */
> > +#define VMBUS_CONNECT_CPU 0
> >
> > +struct vmbus_connection {
> > u32 msg_conn_id;
> >
> > atomic_t offer_in_progress;
> > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> > index 029378c27421d..7600615e13754 100644
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -1056,14 +1056,28 @@ void vmbus_on_msg_dpc(unsigned long data)
> > /*
> > * If we are handling the rescind message;
> > * schedule the work on the global work queue.
> > + *
> > + * The OFFER message and the RESCIND message should
> > + * not be handled by the same serialized work queue,
> > + * because the OFFER handler may call vmbus_open(),
> > + * which tries to open the channel by sending an
> > + * OPEN_CHANNEL message to the host and waits for
> > + * the host's response; however, if the host has
> > + * rescinded the channel before it receives the
> > + * OPEN_CHANNEL message, the host just silently
> > + * ignores the OPEN_CHANNEL message; as a result,
> > + * the guest's OFFER handler hangs for ever, if we
> > + * handle the RESCIND message in the same serialized
> > + * work queue: the RESCIND handler can not start to
> > + * run before the OFFER handler finishes.
> > */
> > - schedule_work_on(vmbus_connection.connect_cpu,
> > + schedule_work_on(VMBUS_CONNECT_CPU,
> > &ctx->work);
> > break;
> >
> > case CHANNELMSG_OFFERCHANNEL:
> > atomic_inc(&vmbus_connection.offer_in_progress);
> > - queue_work_on(vmbus_connection.connect_cpu,
> > + queue_work_on(VMBUS_CONNECT_CPU,
> > vmbus_connection.work_queue,
> > &ctx->work);
> > break;
> > @@ -1110,7 +1124,7 @@ static void vmbus_force_channel_rescinded(struct vmbus_channel *channel)
> >
> > INIT_WORK(&ctx->work, vmbus_onmessage_work);
> >
> > - queue_work_on(vmbus_connection.connect_cpu,
> > + queue_work_on(VMBUS_CONNECT_CPU,
> > vmbus_connection.work_queue,
> > &ctx->work);
> > }
>
> I tried to refresh my memory on why 'connect_cpu' was introduced and it
> all comes down to the following commit:
>
> commit 7268644734f6a300353a4c4ff8bf3e013ba80f89
> Author: Alex Ng <[email protected]>
> Date: Fri Feb 26 15:13:22 2016 -0800
>
> Drivers: hv: vmbus: Support kexec on ws2012 r2 and above
>
> which for some unknown reason kept hardcoding '0' for pre-win2012-r2 (
> hv_context.vp_index[smp_processor_id()] in all cases would do exactly
> the same I guess ). Later, 'connect_cpu' appeared just to remember our
> choice, I can't see why we didn't go with CPU0 for simplicity.
>
> Reviewed-by: Vitaly Kuznetsov <[email protected]>
Applied! Thanks again for the review and all the suggestions,
Andrea
From: Andrea Parri <[email protected]> Sent: Saturday, March 28, 2020 10:09 AM
>
> > In case we believe that OFFER -> RESCINF sequence is always ordered
> > by the host AND we don't care about other offers in the queue the
> > suggested locking is OK: we're guaranteed to process RESCIND after we
> > finished processing OFFER for the same channel. However, waiting for
> > 'offer_in_progress == 0' looks fishy so I'd suggest we at least add a
> > comment explaining that the wait is only needed to serialize us with
> > possible OFFER for the same channel - and nothing else. I'd personally
> > still slightly prefer the algorythm I suggested as it guarantees we take
> > channel_mutex with offer_in_progress == 0 -- even if there are no issues
> > we can think of today (not strongly though).
>
> Does it? offer_in_progress is incremented without channel_mutex...
>
> IAC, I have no objections to apply the changes you suggested. To avoid
> misunderstandings: vmbus_bus_suspend() presents a similar usage... Are
> you suggesting that I apply similar changes there?
>
> Alternatively: FWIW, the comment in vmbus_onoffer_rescind() does refer
> to "The offer msg and the corresponding rescind msg...". I am all ears
> if you have any concrete suggestions to improve these comments.
>
Given that waiting for 'offer_in_progress == 0' is the current code, I think
there's an argument to made for not changing it if the change isn't strictly
necessary. This patch set introduces enough change that *is* necessary. :-)
Michael
From: Andrea Parri <[email protected]> Sent: Saturday, March 28, 2020 11:22 AM
>
> > Correct me if I'm wrong, but currently vmbus_chan_sched() accesses
> > per-cpu list of channels on the same CPU so we don't need a spinlock to
> > guarantee that during an interrupt we'll be able to see the update if it
> > happened before the interrupt (in chronological order). With a global
> > list of relids, who guarantees that an interrupt handler on another CPU
> > will actually see the modified list?
>
> Thanks for pointing this out!
>
> The offer/resume path presents implicit full memory barriers, program
> -order after the array store which should guarantee the visibility of
> the store to *all* CPUs before the offer/resume can complete (c.f.,
>
> tools/memory-model/Documentation/explanation.txt, Sect. #13
>
> and assuming that the offer/resume for a channel must complete before
> the corresponding handler, which seems to be the case considered that
> some essential channel fields are initialized only later...)
>
> IIUC, the spin lock approach you suggested will work and be "simpler";
> an obvious side effect would be, well, a global synchronization point
> in vmbus_chan_sched()...
>
> Thoughts?
>
Note that this global array is accessed overwhelmingly with reads. Once
The system is initialized, channels only rarely come-or-go, so writes will
be rare. So the array can be cached in all CPUs, and we need to avoid
any global synchronization points. Leveraging the full semantics of the
memory model (across all architectures) seems like the right approach
to preserve a high level of concurrency.
Michael
Michael Kelley <[email protected]> writes:
> From: Andrea Parri <[email protected]> Sent: Saturday, March 28, 2020 10:09 AM
>>
>> > In case we believe that OFFER -> RESCINF sequence is always ordered
>> > by the host AND we don't care about other offers in the queue the
>> > suggested locking is OK: we're guaranteed to process RESCIND after we
>> > finished processing OFFER for the same channel. However, waiting for
>> > 'offer_in_progress == 0' looks fishy so I'd suggest we at least add a
>> > comment explaining that the wait is only needed to serialize us with
>> > possible OFFER for the same channel - and nothing else. I'd personally
>> > still slightly prefer the algorythm I suggested as it guarantees we take
>> > channel_mutex with offer_in_progress == 0 -- even if there are no issues
>> > we can think of today (not strongly though).
>>
>> Does it? offer_in_progress is incremented without channel_mutex...
>>
No, it does not, you're right, by itself the change is insufficient.
>> IAC, I have no objections to apply the changes you suggested. To avoid
>> misunderstandings: vmbus_bus_suspend() presents a similar usage... Are
>> you suggesting that I apply similar changes there?
>>
>> Alternatively: FWIW, the comment in vmbus_onoffer_rescind() does refer
>> to "The offer msg and the corresponding rescind msg...". I am all ears
>> if you have any concrete suggestions to improve these comments.
>>
>
> Given that waiting for 'offer_in_progress == 0' is the current code, I think
> there's an argument to made for not changing it if the change isn't strictly
> necessary. This patch set introduces enough change that *is* necessary. :-)
>
Sure. I was thinking a bit more about this and it seems that over years
we've made the synchronization of channels code too complex (every time
for a good reason but still). Now (before this series) we have at least:
vmbus_connection.channel_mutex
vmbus_connection.offer_in_progress
channel.probe_done
channel.rescind
Workqueues (vmbus_connection.work_queue,
queue_work_on(vmbus_connection.connect_cpu),...)
channel.lock spinlock (the least of the problems)
Maybe there's room for improvement? Out of top of my head I'd suggest a
state machine for each channel (e.g something like
OFFERED->OPENING->OPEN->RESCIND_REQ->RESCINDED->CLOSED) + refcounting
(subchannels, open/rescind/... requests in progress, ...) + non-blocking
request handling like "Can we handle this rescind offer now? No,
refcount is too big. OK, rescheduling the work". Maybe not the best
design ever and I'd gladly support any other which improves the
readability of the code and makes all state changes and synchronization
between them more obvious.
Note, VMBus channel handling driven my messages (unlike events for ring
buffer) is not performance critical, we just need to ensure completeness
(all requests are handled correctly) with forward progress guarantees
(no deadlocks).
I understand the absence of 'hot' issues in the current code is what can
make the virtue of redesign questionable and sorry for hijacking the
series which doesn't seem to make things worse :-)
--
Vitaly
Andrea Parri <[email protected]> writes:
>> Correct me if I'm wrong, but currently vmbus_chan_sched() accesses
>> per-cpu list of channels on the same CPU so we don't need a spinlock to
>> guarantee that during an interrupt we'll be able to see the update if it
>> happened before the interrupt (in chronological order). With a global
>> list of relids, who guarantees that an interrupt handler on another CPU
>> will actually see the modified list?
>
> Thanks for pointing this out!
>
> The offer/resume path presents implicit full memory barriers, program
> -order after the array store which should guarantee the visibility of
> the store to *all* CPUs before the offer/resume can complete (c.f.,
>
> tools/memory-model/Documentation/explanation.txt, Sect. #13
>
> and assuming that the offer/resume for a channel must complete before
> the corresponding handler, which seems to be the case considered that
> some essential channel fields are initialized only later...)
>
> IIUC, the spin lock approach you suggested will work and be "simpler";
> an obvious side effect would be, well, a global synchronization point
> in vmbus_chan_sched()...
>
> Thoughts?
This is, of course, very theoretical as if we're seeing an interrupt for
a channel at the same time we're writing its relid we're already in
trouble. I can, however, try to suggest one tiny improvement:
vmbus_chan_sched() now clean the bit in the event page and then searches
for a channel with this relid; in case we allow the search to
(temporary) fail we can reverse the logic: search for the channel and
clean the bit only if we succeed. In case we fail, next time (next IRQ)
we'll try again and likely succeed. The only purpose is to make sure no
interrupts are ever lost. This may be an overkill, we may want to try
to count how many times (if ever) this happens.
Just a thought though.
--
Vitaly
From: Andrea Parri (Microsoft) <[email protected]> Sent: Wednesday, March 25, 2020 3:55 PM
>
> For each storvsc_device, storvsc keeps track of the channel target CPUs
> associated to the device (alloced_cpus) and it uses this information to
> fill a "cache" (stor_chns) mapping CPU->channel according to a certain
> heuristic. Update the alloced_cpus mask and the stor_chns array when a
> channel of the storvsc device is re-assigned to a different CPU.
>
> Signed-off-by: Andrea Parri (Microsoft) <[email protected]>
> Cc: "James E.J. Bottomley" <[email protected]>
> Cc: "Martin K. Petersen" <[email protected]>
> Cc: <[email protected]>
> ---
> drivers/hv/vmbus_drv.c | 4 ++
> drivers/scsi/storvsc_drv.c | 95 ++++++++++++++++++++++++++++++++++----
> include/linux/hyperv.h | 3 ++
> 3 files changed, 94 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 84d2f22c569aa..7199fee2b5869 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1721,6 +1721,10 @@ static ssize_t target_cpu_store(struct vmbus_channel *channel,
> * in on a CPU that is different from the channel target_cpu value.
> */
>
> + if (channel->change_target_cpu_callback)
> + (*channel->change_target_cpu_callback)(channel,
> + channel->target_cpu, target_cpu);
> +
> channel->target_cpu = target_cpu;
> channel->target_vp = hv_cpu_number_to_vp_number(target_cpu);
> channel->numa_node = cpu_to_node(target_cpu);
I think there's an ordering problem here. The change_target_cpu_callback
will allow storvsc to flush the cache that it is keeping, but there's a window
after the storvsc callback releases the spin lock and before this function
changes channel->target_cpu to the new value. In that window, the cache
could get refilled based on the old value of channel->target_cpu, which is
exactly what we don't want. Generally with caches, you have to set the new
value first, then flush the cache, and I think that works in this case. The
callback function doesn't depend on the value of channel->target_cpu,
and any cache filling that might happen after channel->target_cpu is set
to the new value but before the callback function runs is OK. But please
double-check my thinking. :-)
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index fb41636519ee8..a680592b9d32a 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -621,6 +621,63 @@ static inline struct storvsc_device *get_in_stor_device(
>
> }
>
> +void storvsc_change_target_cpu(struct vmbus_channel *channel, u32 old, u32 new)
> +{
> + struct storvsc_device *stor_device;
> + struct vmbus_channel *cur_chn;
> + bool old_is_alloced = false;
> + struct hv_device *device;
> + unsigned long flags;
> + int cpu;
> +
> + device = channel->primary_channel ?
> + channel->primary_channel->device_obj
> + : channel->device_obj;
> + stor_device = get_out_stor_device(device);
> + if (!stor_device)
> + return;
> +
> + /* See storvsc_do_io() -> get_og_chn(). */
> + spin_lock_irqsave(&device->channel->lock, flags);
> +
> + /*
> + * Determines if the storvsc device has other channels assigned to
> + * the "old" CPU to update the alloced_cpus mask and the stor_chns
> + * array.
> + */
> + if (device->channel != channel && device->channel->target_cpu == old) {
> + cur_chn = device->channel;
> + old_is_alloced = true;
> + goto old_is_alloced;
> + }
> + list_for_each_entry(cur_chn, &device->channel->sc_list, sc_list) {
> + if (cur_chn == channel)
> + continue;
> + if (cur_chn->target_cpu == old) {
> + old_is_alloced = true;
> + goto old_is_alloced;
> + }
> + }
> +
> +old_is_alloced:
> + if (old_is_alloced)
> + WRITE_ONCE(stor_device->stor_chns[old], cur_chn);
> + else
> + cpumask_clear_cpu(old, &stor_device->alloced_cpus);
I think target_cpu_store() can get called in parallel on multiple CPUs for different
channels on the same storvsc device, but multiple changes to a single channel are
serialized by higher levels of sysfs. So this function could run after multiple
channels have been changed, in which case there's not just a single "old" value,
and the above algorithm might not work, especially if channel->target_cpu is
updated before calling this function per my earlier comment. I can see a
couple of possible ways to deal with this. One is to put the update of
channel->target_cpu in this function, within the spin lock boundaries so
that the cache flush and target_cpu update are atomic. Another idea is to
process multiple changes in this function, by building a temp copy of
alloced_cpus by walking the channel list, use XOR to create a cpumask
with changes, and then process all the changes in a loop instead of
just handling a single change as with the current code at the old_is_alloced
label. But I haven't completely thought through this idea.
> +
> + /* "Flush" the stor_chns array. */
> + for_each_possible_cpu(cpu) {
> + if (stor_device->stor_chns[cpu] && !cpumask_test_cpu(
> + cpu, &stor_device->alloced_cpus))
> + WRITE_ONCE(stor_device->stor_chns[cpu], NULL);
> + }
> +
> + WRITE_ONCE(stor_device->stor_chns[new], channel);
> + cpumask_set_cpu(new, &stor_device->alloced_cpus);
> +
> + spin_unlock_irqrestore(&device->channel->lock, flags);
> +}
> +
> static void handle_sc_creation(struct vmbus_channel *new_sc)
> {
> struct hv_device *device = new_sc->primary_channel->device_obj;
> @@ -648,6 +705,8 @@ static void handle_sc_creation(struct vmbus_channel *new_sc)
> return;
> }
>
> + new_sc->change_target_cpu_callback = storvsc_change_target_cpu;
> +
> /* Add the sub-channel to the array of available channels. */
> stor_device->stor_chns[new_sc->target_cpu] = new_sc;
> cpumask_set_cpu(new_sc->target_cpu, &stor_device->alloced_cpus);
> @@ -876,6 +935,8 @@ static int storvsc_channel_init(struct hv_device *device, bool is_fc)
> if (stor_device->stor_chns == NULL)
> return -ENOMEM;
>
> + device->channel->change_target_cpu_callback = storvsc_change_target_cpu;
> +
> stor_device->stor_chns[device->channel->target_cpu] = device->channel;
> cpumask_set_cpu(device->channel->target_cpu,
> &stor_device->alloced_cpus);
> @@ -1248,8 +1309,10 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device
> *stor_device,
> const struct cpumask *node_mask;
> int num_channels, tgt_cpu;
>
> - if (stor_device->num_sc == 0)
> + if (stor_device->num_sc == 0) {
> + stor_device->stor_chns[q_num] = stor_device->device->channel;
> return stor_device->device->channel;
> + }
>
> /*
> * Our channel array is sparsley populated and we
> @@ -1258,7 +1321,6 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device
> *stor_device,
> * The strategy is simple:
> * I. Ensure NUMA locality
> * II. Distribute evenly (best effort)
> - * III. Mapping is persistent.
> */
>
> node_mask = cpumask_of_node(cpu_to_node(q_num));
> @@ -1268,8 +1330,10 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device
> *stor_device,
> if (cpumask_test_cpu(tgt_cpu, node_mask))
> num_channels++;
> }
> - if (num_channels == 0)
> + if (num_channels == 0) {
> + stor_device->stor_chns[q_num] = stor_device->device->channel;
Is the above added line just fixing a bug in the existing code? I'm not seeing how
it would derive from the other changes in this patch.
> return stor_device->device->channel;
> + }
>
> hash_qnum = q_num;
> while (hash_qnum >= num_channels)
> @@ -1295,6 +1359,7 @@ static int storvsc_do_io(struct hv_device *device,
> struct storvsc_device *stor_device;
> struct vstor_packet *vstor_packet;
> struct vmbus_channel *outgoing_channel, *channel;
> + unsigned long flags;
> int ret = 0;
> const struct cpumask *node_mask;
> int tgt_cpu;
> @@ -1308,10 +1373,11 @@ static int storvsc_do_io(struct hv_device *device,
>
> request->device = device;
> /*
> - * Select an an appropriate channel to send the request out.
> + * Select an appropriate channel to send the request out.
> */
> - if (stor_device->stor_chns[q_num] != NULL) {
> - outgoing_channel = stor_device->stor_chns[q_num];
> + /* See storvsc_change_target_cpu(). */
> + outgoing_channel = READ_ONCE(stor_device->stor_chns[q_num]);
> + if (outgoing_channel != NULL) {
> if (outgoing_channel->target_cpu == q_num) {
> /*
> * Ideally, we want to pick a different channel if
> @@ -1324,7 +1390,10 @@ static int storvsc_do_io(struct hv_device *device,
> continue;
> if (tgt_cpu == q_num)
> continue;
> - channel = stor_device->stor_chns[tgt_cpu];
> + channel = READ_ONCE(
> + stor_device->stor_chns[tgt_cpu]);
> + if (channel == NULL)
> + continue;
The channel == NULL case is new because a cache flush could be happening
in parallel on another CPU. I'm wondering about the tradeoffs of
continuing in the loop (as you have coded in this patch) vs. a "goto" back to
the top level "if" statement. With the "continue" you might finish the
loop without finding any matches, and fall through to the next approach.
But it's only a single I/O operation, and if it comes up with a less than
optimal channel choice, it's no big deal. So I guess it's really a wash.
> if (hv_get_avail_to_write_percent(
> &channel->outbound)
> > ring_avail_percent_lowater) {
> @@ -1350,7 +1419,10 @@ static int storvsc_do_io(struct hv_device *device,
> for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) {
> if (cpumask_test_cpu(tgt_cpu, node_mask))
> continue;
> - channel = stor_device->stor_chns[tgt_cpu];
> + channel = READ_ONCE(
> + stor_device->stor_chns[tgt_cpu]);
> + if (channel == NULL)
> + continue;
Same comment here.
> if (hv_get_avail_to_write_percent(
> &channel->outbound)
> > ring_avail_percent_lowater) {
> @@ -1360,7 +1432,14 @@ static int storvsc_do_io(struct hv_device *device,
> }
> }
> } else {
> + spin_lock_irqsave(&device->channel->lock, flags);
> + outgoing_channel = stor_device->stor_chns[q_num];
> + if (outgoing_channel != NULL) {
> + spin_unlock_irqrestore(&device->channel->lock, flags);
> + goto found_channel;
> + }
> outgoing_channel = get_og_chn(stor_device, q_num);
> + spin_unlock_irqrestore(&device->channel->lock, flags);
> }
>
> found_channel:
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index edfcd42319ef3..9018b89614b78 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -773,6 +773,9 @@ struct vmbus_channel {
> void (*onchannel_callback)(void *context);
> void *channel_callback_context;
>
> + void (*change_target_cpu_callback)(struct vmbus_channel *channel,
> + u32 old, u32 new);
> +
> /*
> * Synchronize channel scheduling and channel removal; see the inline
> * comments in vmbus_chan_sched() and vmbus_reset_channel_cb().
> --
> 2.24.0
> > @@ -1721,6 +1721,10 @@ static ssize_t target_cpu_store(struct vmbus_channel *channel,
> > * in on a CPU that is different from the channel target_cpu value.
> > */
> >
> > + if (channel->change_target_cpu_callback)
> > + (*channel->change_target_cpu_callback)(channel,
> > + channel->target_cpu, target_cpu);
> > +
> > channel->target_cpu = target_cpu;
> > channel->target_vp = hv_cpu_number_to_vp_number(target_cpu);
> > channel->numa_node = cpu_to_node(target_cpu);
>
> I think there's an ordering problem here. The change_target_cpu_callback
> will allow storvsc to flush the cache that it is keeping, but there's a window
> after the storvsc callback releases the spin lock and before this function
> changes channel->target_cpu to the new value. In that window, the cache
> could get refilled based on the old value of channel->target_cpu, which is
> exactly what we don't want. Generally with caches, you have to set the new
> value first, then flush the cache, and I think that works in this case. The
> callback function doesn't depend on the value of channel->target_cpu,
> and any cache filling that might happen after channel->target_cpu is set
> to the new value but before the callback function runs is OK. But please
> double-check my thinking. :-)
Sorry, I don't see the problem. AFAICT, the "cache" gets refilled based
on the values of alloced_cpus and on the current state of the cache but
not based on the value of channel->target_cpu. The callback invocation
uses the value of the "old" target_cpu; I think I ended up placing the
callback call where it is for not having to introduce a local variable
"old_cpu". ;-)
> > @@ -621,6 +621,63 @@ static inline struct storvsc_device *get_in_stor_device(
> >
> > }
> >
> > +void storvsc_change_target_cpu(struct vmbus_channel *channel, u32 old, u32 new)
> > +{
> > + struct storvsc_device *stor_device;
> > + struct vmbus_channel *cur_chn;
> > + bool old_is_alloced = false;
> > + struct hv_device *device;
> > + unsigned long flags;
> > + int cpu;
> > +
> > + device = channel->primary_channel ?
> > + channel->primary_channel->device_obj
> > + : channel->device_obj;
> > + stor_device = get_out_stor_device(device);
> > + if (!stor_device)
> > + return;
> > +
> > + /* See storvsc_do_io() -> get_og_chn(). */
> > + spin_lock_irqsave(&device->channel->lock, flags);
> > +
> > + /*
> > + * Determines if the storvsc device has other channels assigned to
> > + * the "old" CPU to update the alloced_cpus mask and the stor_chns
> > + * array.
> > + */
> > + if (device->channel != channel && device->channel->target_cpu == old) {
> > + cur_chn = device->channel;
> > + old_is_alloced = true;
> > + goto old_is_alloced;
> > + }
> > + list_for_each_entry(cur_chn, &device->channel->sc_list, sc_list) {
> > + if (cur_chn == channel)
> > + continue;
> > + if (cur_chn->target_cpu == old) {
> > + old_is_alloced = true;
> > + goto old_is_alloced;
> > + }
> > + }
> > +
> > +old_is_alloced:
> > + if (old_is_alloced)
> > + WRITE_ONCE(stor_device->stor_chns[old], cur_chn);
> > + else
> > + cpumask_clear_cpu(old, &stor_device->alloced_cpus);
>
> I think target_cpu_store() can get called in parallel on multiple CPUs for different
> channels on the same storvsc device, but multiple changes to a single channel are
> serialized by higher levels of sysfs. So this function could run after multiple
> channels have been changed, in which case there's not just a single "old" value,
> and the above algorithm might not work, especially if channel->target_cpu is
> updated before calling this function per my earlier comment. I can see a
> couple of possible ways to deal with this. One is to put the update of
> channel->target_cpu in this function, within the spin lock boundaries so
> that the cache flush and target_cpu update are atomic. Another idea is to
> process multiple changes in this function, by building a temp copy of
> alloced_cpus by walking the channel list, use XOR to create a cpumask
> with changes, and then process all the changes in a loop instead of
> just handling a single change as with the current code at the old_is_alloced
> label. But I haven't completely thought through this idea.
Same here: the invocations of target_cpu_store() are serialized on the
per-connection channel_mutex...
> > @@ -1268,8 +1330,10 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device
> > *stor_device,
> > if (cpumask_test_cpu(tgt_cpu, node_mask))
> > num_channels++;
> > }
> > - if (num_channels == 0)
> > + if (num_channels == 0) {
> > + stor_device->stor_chns[q_num] = stor_device->device->channel;
>
> Is the above added line just fixing a bug in the existing code? I'm not seeing how
> it would derive from the other changes in this patch.
It was rather intended as an optimization: Each time I/O for a device
is initiated on a CPU that have "num_channels == 0" channel, the current
code ends up calling get_og_chn() (in the attempt to fill the cache) and
returns the device's primary channel. In the current code, the cost of
this operations is basically the cost of parsing alloced_cpus, but with
the changes introduced here this also involves acquiring (and releasing)
the primary channel's lock. I should probably put my hands forward and
say that I haven't observed any measurable effects due this addition in
my experiments; OTOH, caching the returned/"found" value made sense...
> > @@ -1324,7 +1390,10 @@ static int storvsc_do_io(struct hv_device *device,
> > continue;
> > if (tgt_cpu == q_num)
> > continue;
> > - channel = stor_device->stor_chns[tgt_cpu];
> > + channel = READ_ONCE(
> > + stor_device->stor_chns[tgt_cpu]);
> > + if (channel == NULL)
> > + continue;
>
> The channel == NULL case is new because a cache flush could be happening
> in parallel on another CPU. I'm wondering about the tradeoffs of
> continuing in the loop (as you have coded in this patch) vs. a "goto" back to
> the top level "if" statement. With the "continue" you might finish the
> loop without finding any matches, and fall through to the next approach.
> But it's only a single I/O operation, and if it comes up with a less than
> optimal channel choice, it's no big deal. So I guess it's really a wash.
Yes, I considered both approaches; they both "worked" here. I was a
bit concerned about the number of "possible" gotos (again, mainly a
theoretical issue, since I can imagine that the cash flushes will be
relatively "rare" events in most cases and, in any case, they happen
to be serialized); the "continue" looked like a suitable and simpler
approach/compromise, at least for the time being.
>
> > if (hv_get_avail_to_write_percent(
> > &channel->outbound)
> > > ring_avail_percent_lowater) {
> > @@ -1350,7 +1419,10 @@ static int storvsc_do_io(struct hv_device *device,
> > for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) {
> > if (cpumask_test_cpu(tgt_cpu, node_mask))
> > continue;
> > - channel = stor_device->stor_chns[tgt_cpu];
> > + channel = READ_ONCE(
> > + stor_device->stor_chns[tgt_cpu]);
> > + if (channel == NULL)
> > + continue;
>
> Same comment here.
Similarly here.
Thoughts?
Thanks,
Andrea
From: Andrea Parri <[email protected]> Sent: Monday, March 30, 2020 11:55 AM
>
> > > @@ -1721,6 +1721,10 @@ static ssize_t target_cpu_store(struct vmbus_channel
> *channel,
> > > * in on a CPU that is different from the channel target_cpu value.
> > > */
> > >
> > > + if (channel->change_target_cpu_callback)
> > > + (*channel->change_target_cpu_callback)(channel,
> > > + channel->target_cpu, target_cpu);
> > > +
> > > channel->target_cpu = target_cpu;
> > > channel->target_vp = hv_cpu_number_to_vp_number(target_cpu);
> > > channel->numa_node = cpu_to_node(target_cpu);
> >
> > I think there's an ordering problem here. The change_target_cpu_callback
> > will allow storvsc to flush the cache that it is keeping, but there's a window
> > after the storvsc callback releases the spin lock and before this function
> > changes channel->target_cpu to the new value. In that window, the cache
> > could get refilled based on the old value of channel->target_cpu, which is
> > exactly what we don't want. Generally with caches, you have to set the new
> > value first, then flush the cache, and I think that works in this case. The
> > callback function doesn't depend on the value of channel->target_cpu,
> > and any cache filling that might happen after channel->target_cpu is set
> > to the new value but before the callback function runs is OK. But please
> > double-check my thinking. :-)
>
> Sorry, I don't see the problem. AFAICT, the "cache" gets refilled based
> on the values of alloced_cpus and on the current state of the cache but
> not based on the value of channel->target_cpu. The callback invocation
> uses the value of the "old" target_cpu; I think I ended up placing the
> callback call where it is for not having to introduce a local variable
> "old_cpu". ;-)
>
You are right. My comment is bogus.
>
> > > @@ -621,6 +621,63 @@ static inline struct storvsc_device *get_in_stor_device(
> > >
> > > }
> > >
> > > +void storvsc_change_target_cpu(struct vmbus_channel *channel, u32 old, u32 new)
> > > +{
> > > + struct storvsc_device *stor_device;
> > > + struct vmbus_channel *cur_chn;
> > > + bool old_is_alloced = false;
> > > + struct hv_device *device;
> > > + unsigned long flags;
> > > + int cpu;
> > > +
> > > + device = channel->primary_channel ?
> > > + channel->primary_channel->device_obj
> > > + : channel->device_obj;
> > > + stor_device = get_out_stor_device(device);
> > > + if (!stor_device)
> > > + return;
> > > +
> > > + /* See storvsc_do_io() -> get_og_chn(). */
> > > + spin_lock_irqsave(&device->channel->lock, flags);
> > > +
> > > + /*
> > > + * Determines if the storvsc device has other channels assigned to
> > > + * the "old" CPU to update the alloced_cpus mask and the stor_chns
> > > + * array.
> > > + */
> > > + if (device->channel != channel && device->channel->target_cpu == old) {
> > > + cur_chn = device->channel;
> > > + old_is_alloced = true;
> > > + goto old_is_alloced;
> > > + }
> > > + list_for_each_entry(cur_chn, &device->channel->sc_list, sc_list) {
> > > + if (cur_chn == channel)
> > > + continue;
> > > + if (cur_chn->target_cpu == old) {
> > > + old_is_alloced = true;
> > > + goto old_is_alloced;
> > > + }
> > > + }
> > > +
> > > +old_is_alloced:
> > > + if (old_is_alloced)
> > > + WRITE_ONCE(stor_device->stor_chns[old], cur_chn);
> > > + else
> > > + cpumask_clear_cpu(old, &stor_device->alloced_cpus);
> >
> > I think target_cpu_store() can get called in parallel on multiple CPUs for different
> > channels on the same storvsc device, but multiple changes to a single channel are
> > serialized by higher levels of sysfs. So this function could run after multiple
> > channels have been changed, in which case there's not just a single "old" value,
> > and the above algorithm might not work, especially if channel->target_cpu is
> > updated before calling this function per my earlier comment. I can see a
> > couple of possible ways to deal with this. One is to put the update of
> > channel->target_cpu in this function, within the spin lock boundaries so
> > that the cache flush and target_cpu update are atomic. Another idea is to
> > process multiple changes in this function, by building a temp copy of
> > alloced_cpus by walking the channel list, use XOR to create a cpumask
> > with changes, and then process all the changes in a loop instead of
> > just handling a single change as with the current code at the old_is_alloced
> > label. But I haven't completely thought through this idea.
>
> Same here: the invocations of target_cpu_store() are serialized on the
> per-connection channel_mutex...
Agreed. My comment is not valid.
>
>
> > > @@ -1268,8 +1330,10 @@ static struct vmbus_channel *get_og_chn(struct
> storvsc_device
> > > *stor_device,
> > > if (cpumask_test_cpu(tgt_cpu, node_mask))
> > > num_channels++;
> > > }
> > > - if (num_channels == 0)
> > > + if (num_channels == 0) {
> > > + stor_device->stor_chns[q_num] = stor_device->device->channel;
> >
> > Is the above added line just fixing a bug in the existing code? I'm not seeing how
> > it would derive from the other changes in this patch.
>
> It was rather intended as an optimization: Each time I/O for a device
> is initiated on a CPU that have "num_channels == 0" channel, the current
> code ends up calling get_og_chn() (in the attempt to fill the cache) and
> returns the device's primary channel. In the current code, the cost of
> this operations is basically the cost of parsing alloced_cpus, but with
> the changes introduced here this also involves acquiring (and releasing)
> the primary channel's lock. I should probably put my hands forward and
> say that I haven't observed any measurable effects due this addition in
> my experiments; OTOH, caching the returned/"found" value made sense...
OK. That's what I thought. The existing code does not produce an incorrect
result, but the cache isn't working as intended. This fixes it.
>
>
> > > @@ -1324,7 +1390,10 @@ static int storvsc_do_io(struct hv_device *device,
> > > continue;
> > > if (tgt_cpu == q_num)
> > > continue;
> > > - channel = stor_device->stor_chns[tgt_cpu];
> > > + channel = READ_ONCE(
> > > + stor_device->stor_chns[tgt_cpu]);
> > > + if (channel == NULL)
> > > + continue;
> >
> > The channel == NULL case is new because a cache flush could be happening
> > in parallel on another CPU. I'm wondering about the tradeoffs of
> > continuing in the loop (as you have coded in this patch) vs. a "goto" back to
> > the top level "if" statement. With the "continue" you might finish the
> > loop without finding any matches, and fall through to the next approach.
> > But it's only a single I/O operation, and if it comes up with a less than
> > optimal channel choice, it's no big deal. So I guess it's really a wash.
>
> Yes, I considered both approaches; they both "worked" here. I was a
> bit concerned about the number of "possible" gotos (again, mainly a
> theoretical issue, since I can imagine that the cash flushes will be
> relatively "rare" events in most cases and, in any case, they happen
> to be serialized); the "continue" looked like a suitable and simpler
> approach/compromise, at least for the time being.
Yes, I'm OK with your patch "as is". I was just thinking about the
alternative, and evidently you did too.
>
>
> >
> > > if (hv_get_avail_to_write_percent(
> > > &channel->outbound)
> > > > ring_avail_percent_lowater) {
> > > @@ -1350,7 +1419,10 @@ static int storvsc_do_io(struct hv_device *device,
> > > for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) {
> > > if (cpumask_test_cpu(tgt_cpu, node_mask))
> > > continue;
> > > - channel = stor_device->stor_chns[tgt_cpu];
> > > + channel = READ_ONCE(
> > > + stor_device->stor_chns[tgt_cpu]);
> > > + if (channel == NULL)
> > > + continue;
> >
> > Same comment here.
>
> Similarly here.
Agreed.
>
> Thoughts?
>
> Thanks,
> Andrea
On Mon, Mar 30, 2020 at 02:24:16PM +0200, Vitaly Kuznetsov wrote:
> Michael Kelley <[email protected]> writes:
>
> > From: Andrea Parri <[email protected]> Sent: Saturday, March 28, 2020 10:09 AM
> >>
> >> > In case we believe that OFFER -> RESCINF sequence is always ordered
> >> > by the host AND we don't care about other offers in the queue the
> >> > suggested locking is OK: we're guaranteed to process RESCIND after we
> >> > finished processing OFFER for the same channel. However, waiting for
> >> > 'offer_in_progress == 0' looks fishy so I'd suggest we at least add a
> >> > comment explaining that the wait is only needed to serialize us with
> >> > possible OFFER for the same channel - and nothing else. I'd personally
> >> > still slightly prefer the algorythm I suggested as it guarantees we take
> >> > channel_mutex with offer_in_progress == 0 -- even if there are no issues
> >> > we can think of today (not strongly though).
> >>
> >> Does it? offer_in_progress is incremented without channel_mutex...
> >>
>
> No, it does not, you're right, by itself the change is insufficient.
>
> >> IAC, I have no objections to apply the changes you suggested. To avoid
> >> misunderstandings: vmbus_bus_suspend() presents a similar usage... Are
> >> you suggesting that I apply similar changes there?
> >>
> >> Alternatively: FWIW, the comment in vmbus_onoffer_rescind() does refer
> >> to "The offer msg and the corresponding rescind msg...". I am all ears
> >> if you have any concrete suggestions to improve these comments.
> >>
> >
> > Given that waiting for 'offer_in_progress == 0' is the current code, I think
> > there's an argument to made for not changing it if the change isn't strictly
> > necessary. This patch set introduces enough change that *is* necessary. :-)
> >
>
> Sure. I was thinking a bit more about this and it seems that over years
> we've made the synchronization of channels code too complex (every time
> for a good reason but still). Now (before this series) we have at least:
>
> vmbus_connection.channel_mutex
> vmbus_connection.offer_in_progress
> channel.probe_done
> channel.rescind
> Workqueues (vmbus_connection.work_queue,
> queue_work_on(vmbus_connection.connect_cpu),...)
> channel.lock spinlock (the least of the problems)
>
> Maybe there's room for improvement? Out of top of my head I'd suggest a
> state machine for each channel (e.g something like
> OFFERED->OPENING->OPEN->RESCIND_REQ->RESCINDED->CLOSED) + refcounting
> (subchannels, open/rescind/... requests in progress, ...) + non-blocking
> request handling like "Can we handle this rescind offer now? No,
> refcount is too big. OK, rescheduling the work". Maybe not the best
> design ever and I'd gladly support any other which improves the
> readability of the code and makes all state changes and synchronization
> between them more obvious.
>
> Note, VMBus channel handling driven my messages (unlike events for ring
> buffer) is not performance critical, we just need to ensure completeness
> (all requests are handled correctly) with forward progress guarantees
> (no deadlocks).
>
> I understand the absence of 'hot' issues in the current code is what can
> make the virtue of redesign questionable and sorry for hijacking the
> series which doesn't seem to make things worse :-)
(Back here... Sorry for the delay.)
FWIW, what you wrote above makes sense to me; and *yes*, the series in
question was not intended as "let us undertake such a redesign" (quite
the opposite in fact...)
With regard to this specific patch, it seems to me that we've reached
a certain consensus or, at least, I don't see complaints ;-). Please
let me know if I misunderstood.
Thanks,
Andrea
On Mon, Mar 30, 2020 at 02:45:54PM +0200, Vitaly Kuznetsov wrote:
> Andrea Parri <[email protected]> writes:
>
> >> Correct me if I'm wrong, but currently vmbus_chan_sched() accesses
> >> per-cpu list of channels on the same CPU so we don't need a spinlock to
> >> guarantee that during an interrupt we'll be able to see the update if it
> >> happened before the interrupt (in chronological order). With a global
> >> list of relids, who guarantees that an interrupt handler on another CPU
> >> will actually see the modified list?
> >
> > Thanks for pointing this out!
> >
> > The offer/resume path presents implicit full memory barriers, program
> > -order after the array store which should guarantee the visibility of
> > the store to *all* CPUs before the offer/resume can complete (c.f.,
> >
> > tools/memory-model/Documentation/explanation.txt, Sect. #13
> >
> > and assuming that the offer/resume for a channel must complete before
> > the corresponding handler, which seems to be the case considered that
> > some essential channel fields are initialized only later...)
> >
> > IIUC, the spin lock approach you suggested will work and be "simpler";
> > an obvious side effect would be, well, a global synchronization point
> > in vmbus_chan_sched()...
> >
> > Thoughts?
>
> This is, of course, very theoretical as if we're seeing an interrupt for
> a channel at the same time we're writing its relid we're already in
> trouble. I can, however, try to suggest one tiny improvement:
Indeed. I think the idea (still quite informal) is that:
1) the mapping of the channel relid is propagated to (visible from)
all CPUs before add_channel_work is queued (full barrier in
queue_work()),
2) add_channel_work is queued before the channel is opened (aka,
before the channel ring buffer is allocate/initalized and the
OPENCHANNEL msg is sent and acked from Hyper-V, cf. OPEN_STATE),
3) the channel is opened before Hyper-V can start sending interrupts
for the channel, and hence before vmbus_chan_sched() can find the
channel relid in recv_int_page set,
4) vmbus_chan_sched() finds the channel's relid in recv_int_page
set before it search/load from the channel array (full barrier
in sync_test_and_clear_bit()).
This is for the "normal"/not resuming from hibernation case; for the
latter, notice that:
a) vmbus_isr() (and vmbus_chan_sched()) can not run until when
vmbus_bus_resume() has finished (@resume_noirq callback),
b) vmbus_bus_resume() can not complete before nr_chan_fixup_on_resume
equals 0 in check_ready_for_resume_event().
(and check_ready_for_resume_event() does also provides a full barrier).
If makes sense to you, I'll try to add some of the above in comments.
Thanks,
Andrea
>
> vmbus_chan_sched() now clean the bit in the event page and then searches
> for a channel with this relid; in case we allow the search to
> (temporary) fail we can reverse the logic: search for the channel and
> clean the bit only if we succeed. In case we fail, next time (next IRQ)
> we'll try again and likely succeed. The only purpose is to make sure no
> interrupts are ever lost. This may be an overkill, we may want to try
> to count how many times (if ever) this happens.
>
> Just a thought though.
>
> --
> Vitaly
>
On Mon, Mar 30, 2020 at 07:49:00PM +0000, Michael Kelley wrote:
> From: Andrea Parri <[email protected]> Sent: Monday, March 30, 2020 11:55 AM
> >
> > > > @@ -1721,6 +1721,10 @@ static ssize_t target_cpu_store(struct vmbus_channel
> > *channel,
> > > > * in on a CPU that is different from the channel target_cpu value.
> > > > */
> > > >
> > > > + if (channel->change_target_cpu_callback)
> > > > + (*channel->change_target_cpu_callback)(channel,
> > > > + channel->target_cpu, target_cpu);
> > > > +
> > > > channel->target_cpu = target_cpu;
> > > > channel->target_vp = hv_cpu_number_to_vp_number(target_cpu);
> > > > channel->numa_node = cpu_to_node(target_cpu);
> > >
> > > I think there's an ordering problem here. The change_target_cpu_callback
> > > will allow storvsc to flush the cache that it is keeping, but there's a window
> > > after the storvsc callback releases the spin lock and before this function
> > > changes channel->target_cpu to the new value. In that window, the cache
> > > could get refilled based on the old value of channel->target_cpu, which is
> > > exactly what we don't want. Generally with caches, you have to set the new
> > > value first, then flush the cache, and I think that works in this case. The
> > > callback function doesn't depend on the value of channel->target_cpu,
> > > and any cache filling that might happen after channel->target_cpu is set
> > > to the new value but before the callback function runs is OK. But please
> > > double-check my thinking. :-)
> >
> > Sorry, I don't see the problem. AFAICT, the "cache" gets refilled based
> > on the values of alloced_cpus and on the current state of the cache but
> > not based on the value of channel->target_cpu. The callback invocation
> > uses the value of the "old" target_cpu; I think I ended up placing the
> > callback call where it is for not having to introduce a local variable
> > "old_cpu". ;-)
> >
>
> You are right. My comment is bogus.
>
> >
> > > > @@ -621,6 +621,63 @@ static inline struct storvsc_device *get_in_stor_device(
> > > >
> > > > }
> > > >
> > > > +void storvsc_change_target_cpu(struct vmbus_channel *channel, u32 old, u32 new)
> > > > +{
> > > > + struct storvsc_device *stor_device;
> > > > + struct vmbus_channel *cur_chn;
> > > > + bool old_is_alloced = false;
> > > > + struct hv_device *device;
> > > > + unsigned long flags;
> > > > + int cpu;
> > > > +
> > > > + device = channel->primary_channel ?
> > > > + channel->primary_channel->device_obj
> > > > + : channel->device_obj;
> > > > + stor_device = get_out_stor_device(device);
> > > > + if (!stor_device)
> > > > + return;
> > > > +
> > > > + /* See storvsc_do_io() -> get_og_chn(). */
> > > > + spin_lock_irqsave(&device->channel->lock, flags);
> > > > +
> > > > + /*
> > > > + * Determines if the storvsc device has other channels assigned to
> > > > + * the "old" CPU to update the alloced_cpus mask and the stor_chns
> > > > + * array.
> > > > + */
> > > > + if (device->channel != channel && device->channel->target_cpu == old) {
> > > > + cur_chn = device->channel;
> > > > + old_is_alloced = true;
> > > > + goto old_is_alloced;
> > > > + }
> > > > + list_for_each_entry(cur_chn, &device->channel->sc_list, sc_list) {
> > > > + if (cur_chn == channel)
> > > > + continue;
> > > > + if (cur_chn->target_cpu == old) {
> > > > + old_is_alloced = true;
> > > > + goto old_is_alloced;
> > > > + }
> > > > + }
> > > > +
> > > > +old_is_alloced:
> > > > + if (old_is_alloced)
> > > > + WRITE_ONCE(stor_device->stor_chns[old], cur_chn);
> > > > + else
> > > > + cpumask_clear_cpu(old, &stor_device->alloced_cpus);
> > >
> > > I think target_cpu_store() can get called in parallel on multiple CPUs for different
> > > channels on the same storvsc device, but multiple changes to a single channel are
> > > serialized by higher levels of sysfs. So this function could run after multiple
> > > channels have been changed, in which case there's not just a single "old" value,
> > > and the above algorithm might not work, especially if channel->target_cpu is
> > > updated before calling this function per my earlier comment. I can see a
> > > couple of possible ways to deal with this. One is to put the update of
> > > channel->target_cpu in this function, within the spin lock boundaries so
> > > that the cache flush and target_cpu update are atomic. Another idea is to
> > > process multiple changes in this function, by building a temp copy of
> > > alloced_cpus by walking the channel list, use XOR to create a cpumask
> > > with changes, and then process all the changes in a loop instead of
> > > just handling a single change as with the current code at the old_is_alloced
> > > label. But I haven't completely thought through this idea.
> >
> > Same here: the invocations of target_cpu_store() are serialized on the
> > per-connection channel_mutex...
>
> Agreed. My comment is not valid.
>
> >
> >
> > > > @@ -1268,8 +1330,10 @@ static struct vmbus_channel *get_og_chn(struct
> > storvsc_device
> > > > *stor_device,
> > > > if (cpumask_test_cpu(tgt_cpu, node_mask))
> > > > num_channels++;
> > > > }
> > > > - if (num_channels == 0)
> > > > + if (num_channels == 0) {
> > > > + stor_device->stor_chns[q_num] = stor_device->device->channel;
> > >
> > > Is the above added line just fixing a bug in the existing code? I'm not seeing how
> > > it would derive from the other changes in this patch.
> >
> > It was rather intended as an optimization: Each time I/O for a device
> > is initiated on a CPU that have "num_channels == 0" channel, the current
> > code ends up calling get_og_chn() (in the attempt to fill the cache) and
> > returns the device's primary channel. In the current code, the cost of
> > this operations is basically the cost of parsing alloced_cpus, but with
> > the changes introduced here this also involves acquiring (and releasing)
> > the primary channel's lock. I should probably put my hands forward and
> > say that I haven't observed any measurable effects due this addition in
> > my experiments; OTOH, caching the returned/"found" value made sense...
>
> OK. That's what I thought. The existing code does not produce an incorrect
> result, but the cache isn't working as intended. This fixes it.
>
> >
> >
> > > > @@ -1324,7 +1390,10 @@ static int storvsc_do_io(struct hv_device *device,
> > > > continue;
> > > > if (tgt_cpu == q_num)
> > > > continue;
> > > > - channel = stor_device->stor_chns[tgt_cpu];
> > > > + channel = READ_ONCE(
> > > > + stor_device->stor_chns[tgt_cpu]);
> > > > + if (channel == NULL)
> > > > + continue;
> > >
> > > The channel == NULL case is new because a cache flush could be happening
> > > in parallel on another CPU. I'm wondering about the tradeoffs of
> > > continuing in the loop (as you have coded in this patch) vs. a "goto" back to
> > > the top level "if" statement. With the "continue" you might finish the
> > > loop without finding any matches, and fall through to the next approach.
> > > But it's only a single I/O operation, and if it comes up with a less than
> > > optimal channel choice, it's no big deal. So I guess it's really a wash.
> >
> > Yes, I considered both approaches; they both "worked" here. I was a
> > bit concerned about the number of "possible" gotos (again, mainly a
> > theoretical issue, since I can imagine that the cash flushes will be
> > relatively "rare" events in most cases and, in any case, they happen
> > to be serialized); the "continue" looked like a suitable and simpler
> > approach/compromise, at least for the time being.
>
> Yes, I'm OK with your patch "as is". I was just thinking about the
> alternative, and evidently you did too.
Thank you for the confirmation. I'm wondering: could take this as a
Reviewed-by: for this patch? ;-)
Thanks,
Andrea
>
> >
> >
> > >
> > > > if (hv_get_avail_to_write_percent(
> > > > &channel->outbound)
> > > > > ring_avail_percent_lowater) {
> > > > @@ -1350,7 +1419,10 @@ static int storvsc_do_io(struct hv_device *device,
> > > > for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) {
> > > > if (cpumask_test_cpu(tgt_cpu, node_mask))
> > > > continue;
> > > > - channel = stor_device->stor_chns[tgt_cpu];
> > > > + channel = READ_ONCE(
> > > > + stor_device->stor_chns[tgt_cpu]);
> > > > + if (channel == NULL)
> > > > + continue;
> > >
> > > Same comment here.
> >
> > Similarly here.
>
> Agreed.
>
> >
> > Thoughts?
> >
> > Thanks,
> > Andrea
> > I can only think of a 'lazy' approach to setting channel CPU affinity:
> > we actually re-assign it to the target CPU when we receive first
> > interrupt there - but it's not very different from re-assigning it there
> > but still handling interrupts on the old CPU like you do.
>
> Interesting! I'm wondering whether it'd make sense to use a similar
> approach to (lazily) "unblock" the "old" CPUs; let me think more...
So, I spent some more time thinking about this... AFAICT, one main
issue with this approach (besides the fact that we'd need to "save"
the "old" CPUs...) is that we might be blocking that CPU for "too
long" (depending on the channel/device usage); also, this approach
seemed to make the handling of target_cpu a bit more involved, and
this is a concern considered that (as mentioned before) we'd like to
keep the target_cpu of *all* channels in the system balanced.
I'm sticking to the approach implemented here for now: I'm planning
to send a new version of the series with minor changes soon.
Thanks,
Andrea
Andrea Parri <[email protected]> writes:
> On Mon, Mar 30, 2020 at 02:45:54PM +0200, Vitaly Kuznetsov wrote:
>> Andrea Parri <[email protected]> writes:
>>
>> >> Correct me if I'm wrong, but currently vmbus_chan_sched() accesses
>> >> per-cpu list of channels on the same CPU so we don't need a spinlock to
>> >> guarantee that during an interrupt we'll be able to see the update if it
>> >> happened before the interrupt (in chronological order). With a global
>> >> list of relids, who guarantees that an interrupt handler on another CPU
>> >> will actually see the modified list?
>> >
>> > Thanks for pointing this out!
>> >
>> > The offer/resume path presents implicit full memory barriers, program
>> > -order after the array store which should guarantee the visibility of
>> > the store to *all* CPUs before the offer/resume can complete (c.f.,
>> >
>> > tools/memory-model/Documentation/explanation.txt, Sect. #13
>> >
>> > and assuming that the offer/resume for a channel must complete before
>> > the corresponding handler, which seems to be the case considered that
>> > some essential channel fields are initialized only later...)
>> >
>> > IIUC, the spin lock approach you suggested will work and be "simpler";
>> > an obvious side effect would be, well, a global synchronization point
>> > in vmbus_chan_sched()...
>> >
>> > Thoughts?
>>
>> This is, of course, very theoretical as if we're seeing an interrupt for
>> a channel at the same time we're writing its relid we're already in
>> trouble. I can, however, try to suggest one tiny improvement:
>
> Indeed. I think the idea (still quite informal) is that:
>
> 1) the mapping of the channel relid is propagated to (visible from)
> all CPUs before add_channel_work is queued (full barrier in
> queue_work()),
>
> 2) add_channel_work is queued before the channel is opened (aka,
> before the channel ring buffer is allocate/initalized and the
> OPENCHANNEL msg is sent and acked from Hyper-V, cf. OPEN_STATE),
>
> 3) the channel is opened before Hyper-V can start sending interrupts
> for the channel, and hence before vmbus_chan_sched() can find the
> channel relid in recv_int_page set,
>
> 4) vmbus_chan_sched() finds the channel's relid in recv_int_page
> set before it search/load from the channel array (full barrier
> in sync_test_and_clear_bit()).
>
> This is for the "normal"/not resuming from hibernation case; for the
> latter, notice that:
>
> a) vmbus_isr() (and vmbus_chan_sched()) can not run until when
> vmbus_bus_resume() has finished (@resume_noirq callback),
>
> b) vmbus_bus_resume() can not complete before nr_chan_fixup_on_resume
> equals 0 in check_ready_for_resume_event().
>
> (and check_ready_for_resume_event() does also provides a full barrier).
>
> If makes sense to you, I'll try to add some of the above in comments.
>
It does, thank you!
--
Vitaly