This is to introduce "xenwatch multithreading" (or "multithreaded xenwatch",
abbreviated as 'mtwatch'). The implementation of xen mtwatch involves below
components:
* dom0 linux kernel
* xen toolstack
Here are what the RFC is going to discuss:
- what is the problem
- what is the objective
- what is the solution
- where is the challenge
- patch set
what is the problem
===================
xenwatch_thread is a single kernel thread processing the callback function for
subscribed xenwatch events successively. The xenwatch is stalled in 'D' state
if any of callback function is stalled and uninterruptible.
The domU create/destroy is failed if xenwatch is stalled in 'D' state as the
paravirtual driver init/uninit cannot complete. Usually, the only option is to
reboot dom0 server unless there is solution/workaround to move forward and
complete the stalled xenwatch event callback function. Below is the output of
'xl create' when xenwatch is stalled (the issue is reproduced on purpose by
hooking netif_receive_skb() to intercept an sk_buff sent out from vifX.Y on
dom0 with patch at
https://github.com/finallyjustice/patchset/blob/master/xenwatch-stall-by-vif.patch):
# xl create pv.cfg
Parsing config from pv.cfg
libxl: error: libxl_device.c:1080:device_backend_callback: Domain 2:unable to add device with path /local/domain/0/backend/vbd/2/51712
libxl: error: libxl_create.c:1278:domcreate_launch_dm: Domain 2:unable to add disk devices
libxl: error: libxl_device.c:1080:device_backend_callback: Domain 2:unable to remove device with path /local/domain/0/backend/vbd/2/51712
libxl: error: libxl_domain.c:1073:devices_destroy_cb: Domain 2:libxl__devices_destroy failed
libxl: error: libxl_domain.c:1000:libxl__destroy_domid: Domain 2:Non-existant domain
libxl: error: libxl_domain.c:959:domain_destroy_callback: Domain 2:Unable to destroy guest
libxl: error: libxl_domain.c:886:domain_destroy_cb: Domain 2:Destruction of domain failed
Three scenarios are discussed below to demonstrate the limitation of
single-threaded xenwatch:
scenario 1
----------
In this scenario, xenwatch is stalled at kthread_stop() as it is waiting for
xenvif_dealloc_kthread() to exit. However, unless all inflight packets (sent
out from xen-netback with SKBTX_DEV_ZEROCOPY set) are released successfully and
correctly (e.g., by bond/vlan/bridge/tap/NIC), xenvif_dealloc_kthread() would
never stop and exit. Below is the call stack of xenwatch thread:
---------------------------------------------
xenwatch call stack:
[<0>] kthread_stop
[<0>] xenvif_disconnect_data
[<0>] set_backend_state
[<0>] frontend_changed
[<0>] xenwatch_thread
[<0>] kthread
[<0>] ret_from_fork
---------------------------------------------
Similar issue has been reported and discussed in xen-devel in the past as shown
below.
https://lists.xenproject.org/archives/html/xen-devel/2016-06/msg00195.html
scenario 2
----------
In this scenario, xenwatch is stalled at kthread_stop() waiting for
xen_blkif_schedule() to complete all pending I/O requests. When there is issue
with loop device used by xen-blkback, xen_blkif_schedule() cannot complete and
exit. xenwatch is stalled unless xen_blkif_schedule() is able to finish all
pending I/O requests. Below is the call stack of xenwatch when it is stalled:
---------------------------------------------
xenwatch call stack:
[<0>] kthread_stop
[<0>] xen_blkif_disconnect
[<0>] xen_blkbk_remove
[<0>] xenbus_dev_remove
[<0>] __device_release_driver
[<0>] device_release_driver
[<0>] bus_remove_device
[<0>] device_del
[<0>] device_unregister
[<0>] frontend_changed
[<0>] xenbus_otherend_changed
[<0>] frontend_changed
[<0>] xenwatch_thread
[<0>] kthread
[<0>] ret_from_fork
---------------------------------------------
scenario 3
----------
In this scenario, xenwatch is stalled at gnttab_unmap_refs_sync() when some
persistent pages (of xen-blkback) are still mapped and used by dom0 filesystem
or block layer. When there is issue with filesystem or block layer, the
persistent pages assigned to the submitted bio is not released successfully or
correctly so that xenwatch is stalled forever. Below is the call stack of
stalled xenwatch:
---------------------------------------------
xenwatch call stack:
[<0>] gnttab_unmap_refs_sync
[<0>] free_persistent_gnts
[<0>] xen_blkbk_free_caches
[<0>] xen_blkif_disconnect
[<0>] xen_blkbk_remove
[<0>] xenbus_dev_remove
[<0>] __device_release_driver
[<0>] device_release_driver
[<0>] bus_remove_device
[<0>] device_del
[<0>] device_unregister
[<0>] frontend_changed
[<0>] xenbus_otherend_changed
[<0>] frontend_changed
[<0>] xenwatch_thread
[<0>] kthread
[<0>] ret_from_fork
---------------------------------------------
From above scenarios, we may conclude that the stability of xenwatch heavily
relies on xenwatch callback function, that is, the stability of other dom0
kernel components such as networking, NIC, filesystem or block. When xenwatch
is stalled, people would always blame xen although the root cause is on xen
side.
what is the objective
=====================
The objective of this RFC is to guarantee xenwatch is always able to respond to
coming xenwatch event, even any of callback function is already stalled, to
avoid the immediate dom0 reboot. We should guarantee that only the per-domU
xenwatch thread is stalled when the event callback function hangs.
The xenwatch stall issue on domU is not as significant as on dom0, which is the
privileged management domain responsible for domain create/destroy, as reboot
domU is not a severe issue. However, It is always necessary for the
administrator to schedule a downtime to reboot dom0. Therefore, we only cover
the dom0 xenwatch stall issue in this RFC.
what is the solution
====================
The general idea of the solution is to create a kernel thread for every domU in
addition to the default xenwatch thread.
For each coming xenwatch event, xenwatch_thread() first calculates the domid
that this event belong to. The event is forwarded to per-domU thread according
to the result of domid calculation. The xenwatch event is processed by per-domU
watch thread if the domid is not 0. Otherwise, it is processed by default
xenwatch_thread().
As this issue is only significant to dom0, the solution would only cover dom0.
The domU (including driver domain) is not considered in this RFC.
A kernel parameter 'xen_mtwatch' is introduced to control whether the featue is
enabled or not. The feature is disabled by default if 'xen_mtwatch' is not set
in grub.
where is the challenge
======================
There are two challenges during the design of xen mtwatch:
1. The calculation of domid given the xenwatch event path.
2. When to create/destroy per-domU kthread.
About domid calculation, instead of having a single intelligent function to
calculate the domid for all event path, a new callback function .get_domid() is
introduced as a member of 'struct xenbus_watch' as shown below:
/* Register callback to watch this node. */
struct xenbus_watch
{
struct list_head list;
/* Path being watched. */
const char *node;
/* Callback (executed in a process context with no locks held). */
void (*callback)(struct xenbus_watch *,
const char *path, const char *token);
+
+ /* Callback to help calculate the domid the path belongs to */
+ domid_t (*get_domid)(struct xenbus_watch *watch,
+ const char *path, const char *token);
+
+ /* Get the owner's domid if the watch is for a specific domain */
+ domid_t (*get_owner)(struct xenbus_watch *watch);
};
Below is a sample implementation of .get_domid() method for xenwatch at
xenstore entry 'state'
+static domid_t otherend_get_domid(struct xenbus_watch *watch,
+ const char *path,
+ const char *token)
+{
+ struct xenbus_device *xendev =
+ container_of(watch, struct xenbus_device, otherend_watch);
+
+ return xendev->otherend_id;
+}
static int watch_otherend(struct xenbus_device *dev)
{
struct xen_bus_type *bus =
container_of(dev->dev.bus, struct xen_bus_type, bus);
+ dev->otherend_watch.get_domid = otherend_get_domid;
Therefore, the xenwatch subscriber is expected to implement the callback
function to calculate the domid. The xenwatch event is processed by default
xenwatch thread if the .get_domid() is not implemented.
About per-domU xenwatch thread create/destroy, a new type of xenstore node is
introduced: '/local/domain/0/mtwatch/<domid>'.
Suppose the new domid id 7. During the domU (domid=7) creation, the xen
toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the insertion
of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.
The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'. Kernel
thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is inserted,
while this kernel thread is destroyed when the corresponding xenstore node is
removed.
patch set
=========
There is one linux patch and one xen patch following this RFC to help
understand the idea:
[RFC PATCH linux 1/2] xenbus: introduce xenwatch multithreading to dom0 linux kernel
[RFC PATCH xen 2/2] libxl: introduce xenwatch multithreading to xen toolstack
Below patch can help reproduce the issue on purpose:
https://github.com/finallyjustice/patchset/blob/master/xenwatch-stall-by-vif.patch
Please let me know your input on this RFC. Thank you very much!
Dongli Zhang
This patch is based on xen-4.10.0.
This patch introduces xenwatch multithreading (or multithreaded xenwatch,
abbreviated as 'mtwatch') on xen toolstack side.
In addition to the existing single xenwatch thread, each domU has its own
kernel thread ([xen-mtwatch-<domid>]) to process its xenwatch event.
The create/destroy of each per-domU mtwatch thread is controlled by
xenstore. The dom0 kernel watches at node '/local/domain/0/mtwatch', under
which each domU has an entry. The node '/local/domain/0/mtwatch/<domid>' is
written to xenstore first during domU creation, while the same node is
removed from xenstore as the last xenstore operation during the domU
destroy. The corresponding watch callback would create or destroy the
per-domU mtwatch thread.
Signed-off-by: Dongli Zhang <[email protected]>
---
tools/libxl/libxl_create.c | 12 ++++++++++++
tools/libxl/libxl_domain.c | 26 ++++++++++++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index f15fb21..579216a 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -527,6 +527,7 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_config *d_config,
const char *dom_type;
char *uuid_string;
char *dom_path, *vm_path, *libxl_path;
+ char *mtwatch_path;
struct xs_permissions roperm[2];
struct xs_permissions rwperm[1];
struct xs_permissions noperm[1];
@@ -601,6 +602,14 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_config *d_config,
goto out;
}
+ mtwatch_path = GCSPRINTF("%s/mtwatch/%d",
+ libxl__xs_get_dompath(gc, 0), *domid);
+ if (!mtwatch_path) {
+ LOGD(ERROR, *domid, "cannot allocate mtwatch path");
+ rc = ERROR_FAIL;
+ goto out;
+ }
+
noperm[0].id = 0;
noperm[0].perms = XS_PERM_NONE;
@@ -615,6 +624,9 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_config *d_config,
retry_transaction:
t = xs_transaction_start(ctx->xsh);
+ xs_rm(ctx->xsh, t, mtwatch_path);
+ libxl__xs_mknod(gc, t, mtwatch_path, noperm, ARRAY_SIZE(noperm));
+
xs_rm(ctx->xsh, t, dom_path);
libxl__xs_mknod(gc, t, dom_path, roperm, ARRAY_SIZE(roperm));
diff --git a/tools/libxl/libxl_domain.c b/tools/libxl/libxl_domain.c
index 814f812..6638f66 100644
--- a/tools/libxl/libxl_domain.c
+++ b/tools/libxl/libxl_domain.c
@@ -1051,6 +1051,30 @@ out:
return;
}
+static void libxl_rm_mtwatch_node(libxl__gc *gc, uint32_t domid)
+{
+ char *mtwatch_path;
+ xs_transaction_t t = 0;
+ int rc;
+
+ mtwatch_path = GCSPRINTF("%s/mtwatch/%d",
+ libxl__xs_get_dompath(gc, 0), domid);
+ if (!mtwatch_path) {
+ LOGD(ERROR, domid, "cannot allocate mtwatch path");
+ return;
+ }
+
+ rc = libxl__xs_transaction_start(gc, &t);
+ if (rc) goto out;
+
+ libxl__xs_path_cleanup(gc, t, mtwatch_path);
+
+ libxl__xs_transaction_commit(gc, &t);
+
+out:
+ libxl__xs_transaction_abort(gc, &t);
+}
+
static void devices_destroy_cb(libxl__egc *egc,
libxl__devices_remove_state *drs,
int rc)
@@ -1083,6 +1107,8 @@ static void devices_destroy_cb(libxl__egc *egc,
xs_rm(ctx->xsh, XBT_NULL, libxl__xs_libxl_path(gc, domid));
xs_rm(ctx->xsh, XBT_NULL, GCSPRINTF( "/local/domain/%d/hvmloader", domid));
+ libxl_rm_mtwatch_node(gc, domid);
+
/* This is async operation, we already hold CTX lock */
lock = libxl__lock_domain_userdata(gc, domid);
if (!lock) {
--
2.7.4
This patch is based on v4.16-rc7.
This patch introduces xenwatch multithreading (or multithreaded xenwatch,
abbreviated as 'mtwatch') to dom0 kernel. In addition to the existing
single xenwatch thread, each domU has its own kernel thread
([xen-mtwatch-<domid>]) to process its xenwatch event.
The create/destroy of each per-domU mtwatch thread is controlled by
xenstore. The dom0 kernel watches at node '/local/domain/0/mtwatch', under
which each domU has an entry. The node '/local/domain/0/mtwatch/<domid>' is
written to xenstore first during domU creation, while the same node is
removed from xenstore as the last xenstore operation during the domU
destroy. The corresponding watch callback would create or destroy the
per-domU mtwatch thread.
A kernel parameter 'xen_mtwatch' is introduced to control whether the
featue is enabled or not. The feature is disabled by default if
'xen_mtwatch' is not set in grub.
Signed-off-by: Dongli Zhang <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 3 +
drivers/xen/xenbus/xenbus_probe.c | 39 ++-
drivers/xen/xenbus/xenbus_probe_backend.c | 55 ++++
drivers/xen/xenbus/xenbus_xs.c | 333 +++++++++++++++++++++++-
include/xen/xenbus.h | 55 ++++
5 files changed, 481 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 1d1d53f..46a484ba 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4672,6 +4672,9 @@
the unplug protocol
never -- do not unplug even if version check succeeds
+ xen_mtwatch [KNL,XEN]
+ Enables the multithreaded xenwatch (mtwatch).
+
xen_nopvspin [X86,XEN]
Disables the ticketlock slowpath using Xen PV
optimizations.
diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c
index ec9eb4f..5f88db1 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -129,12 +129,34 @@ static int talk_to_otherend(struct xenbus_device *dev)
}
+static domid_t otherend_get_domid(struct xenbus_watch *watch,
+ const char *path,
+ const char *token)
+{
+ struct xenbus_device *xendev =
+ container_of(watch, struct xenbus_device, otherend_watch);
+
+ return xendev->otherend_id;
+}
+
+
+static domid_t otherend_get_owner(struct xenbus_watch *watch)
+{
+ struct xenbus_device *xendev =
+ container_of(watch, struct xenbus_device, otherend_watch);
+
+ return xendev->otherend_id;
+}
+
static int watch_otherend(struct xenbus_device *dev)
{
struct xen_bus_type *bus =
container_of(dev->dev.bus, struct xen_bus_type, bus);
+ dev->otherend_watch.get_domid = otherend_get_domid;
+ dev->otherend_watch.get_owner = otherend_get_owner;
+
return xenbus_watch_pathfmt(dev, &dev->otherend_watch,
bus->otherend_changed,
"%s/%s", dev->otherend, "state");
@@ -521,7 +543,22 @@ int xenbus_probe_devices(struct xen_bus_type *bus)
}
EXPORT_SYMBOL_GPL(xenbus_probe_devices);
-static unsigned int char_count(const char *str, char c)
+domid_t path_to_domid(const char *path)
+{
+ const char *p = path;
+ domid_t domid = 0;
+
+ while (*p) {
+ if (*p < '0' || *p > '9')
+ break;
+ domid = (domid << 3) + (domid << 1) + (*p - '0');
+ p++;
+ }
+
+ return domid;
+}
+
+unsigned int char_count(const char *str, char c)
{
unsigned int i, ret = 0;
diff --git a/drivers/xen/xenbus/xenbus_probe_backend.c b/drivers/xen/xenbus/xenbus_probe_backend.c
index b0bed4f..80be062 100644
--- a/drivers/xen/xenbus/xenbus_probe_backend.c
+++ b/drivers/xen/xenbus/xenbus_probe_backend.c
@@ -211,9 +211,61 @@ static void backend_changed(struct xenbus_watch *watch,
xenbus_dev_changed(path, &xenbus_backend);
}
+/* path: backend/<pvdev>/<domid>/... */
+static domid_t be_get_domid(struct xenbus_watch *watch,
+ const char *path,
+ const char *token)
+{
+ const char *p = path;
+
+ if (char_count(path, '/') < 2)
+ return 0;
+
+ p = strchr(p, '/') + 1;
+ p = strchr(p, '/') + 1;
+
+ return path_to_domid(p);
+}
+
static struct xenbus_watch be_watch = {
.node = "backend",
.callback = backend_changed,
+ .get_domid = be_get_domid,
+};
+
+static void mtwatch_changed(struct xenbus_watch *watch,
+ const char *path, const char *token)
+{
+ domid_t domid;
+ const char *p = path;
+
+ if (char_count(path, '/') != 1)
+ return;
+
+ p = strchr(p, '/') + 1;
+ domid = path_to_domid(p);
+
+ xen_mtwatch_callback(domid, path);
+}
+
+/* path: mtwatch/<domid> */
+static domid_t mtwatch_get_domid(struct xenbus_watch *watch,
+ const char *path, const char *token)
+{
+ const char *p = path;
+
+ if (char_count(path, '/') != 1)
+ return 0;
+
+ p = strchr(p, '/') + 1;
+
+ return path_to_domid(p);
+}
+
+static struct xenbus_watch mtwatch_watch = {
+ .node = "mtwatch",
+ .callback = mtwatch_changed,
+ .get_domid = mtwatch_get_domid,
};
static int read_frontend_details(struct xenbus_device *xendev)
@@ -245,6 +297,9 @@ static int backend_probe_and_watch(struct notifier_block *notifier,
xenbus_probe_devices(&xenbus_backend);
register_xenbus_watch(&be_watch);
+ if (xen_mtwatch)
+ register_xenbus_watch(&mtwatch_watch);
+
return NOTIFY_DONE;
}
diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index 3f3b293..7f30b7a 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -95,6 +95,242 @@ static pid_t xenwatch_pid;
static DEFINE_MUTEX(xenwatch_mutex);
static DECLARE_WAIT_QUEUE_HEAD(watch_events_waitq);
+bool xen_mtwatch;
+EXPORT_SYMBOL_GPL(xen_mtwatch);
+
+struct mtwatch_info *mtwatch_info;
+
+static bool param_xen_mtwatch;
+static __init int xen_parse_mtwatch(char *arg)
+{
+ param_xen_mtwatch = true;
+ return 0;
+}
+early_param("xen_mtwatch", xen_parse_mtwatch);
+
+struct mtwatch_domain *mtwatch_find_domain(domid_t domid)
+{
+ struct mtwatch_domain *domain;
+ int hash = MTWATCH_HASH(domid);
+ struct hlist_head *hash_head = &mtwatch_info->domain_hash[hash];
+
+ hlist_for_each_entry_rcu(domain, hash_head, hash_node) {
+ if (domain->domid == domid)
+ return domain;
+ }
+
+ return NULL;
+}
+
+static int mtwatch_thread(void *arg)
+{
+ struct mtwatch_domain *domain = (struct mtwatch_domain *) arg;
+ struct list_head *ent;
+ struct xs_watch_event *event;
+
+ domain->pid = current->pid;
+
+ for (;;) {
+ wait_event_interruptible(domain->events_wq,
+ !list_empty(&domain->events) ||
+ kthread_should_stop());
+
+ /*
+ * domain->state is already set to MTWATCH_DOMAIN_DOWN (to
+ * avoid new event to domain->events) when
+ * kthread_should_stop()==true, so that it is not required
+ * to cleanup domain->events once the loop breaks.
+ */
+ if (kthread_should_stop() && list_empty(&domain->events))
+ break;
+
+ mutex_lock(&domain->domain_mutex);
+
+ spin_lock(&domain->events_lock);
+ ent = domain->events.next;
+ if (ent != &domain->events)
+ list_del(ent);
+ spin_unlock(&domain->events_lock);
+
+ if (ent != &domain->events) {
+ event = list_entry(ent, struct xs_watch_event, list);
+ event->handle->callback(event->handle, event->path,
+ event->token);
+ kfree(event);
+ }
+
+ mutex_unlock(&domain->domain_mutex);
+ }
+
+ return 0;
+}
+
+/* protected by xenwatch_mutex so xenbus_watch is always valid */
+static int mtwatch_process(struct xs_watch_event *event)
+{
+ domid_t domid;
+ struct xenbus_watch *watch = event->handle;
+ struct mtwatch_domain *domain;
+ int rc = 1;
+
+ if (!watch->get_domid)
+ return rc;
+
+ domid = watch->get_domid(event->handle, event->path, event->token);
+ if (!domid)
+ return rc;
+
+ rcu_read_lock();
+
+ domain = mtwatch_find_domain(domid);
+ if (!domain) {
+ rcu_read_unlock();
+ return rc;
+ }
+
+ spin_lock(&domain->events_lock);
+ if (domain->state != MTWATCH_DOMAIN_DOWN) {
+ list_add(&event->list, &domain->events);
+ wake_up(&domain->events_wq);
+ rc = 0;
+ }
+ spin_unlock(&domain->events_lock);
+
+ rcu_read_unlock();
+
+ return rc;
+}
+
+
+static void delayed_destroy_domain(struct rcu_head *head)
+{
+ struct mtwatch_domain *domain;
+
+ domain = container_of(head, struct mtwatch_domain, rcu);
+ kfree(domain);
+}
+
+static void xen_mtwatch_purge_domain(struct work_struct *work)
+{
+ struct mtwatch_domain *domain;
+ struct list_head *node;
+
+ while (!list_empty(&mtwatch_info->purge_list)) {
+
+ spin_lock(&mtwatch_info->purge_lock);
+ node = mtwatch_info->purge_list.next;
+ if (node != &mtwatch_info->purge_list)
+ list_del(node);
+ spin_unlock(&mtwatch_info->purge_lock);
+
+ if (node != &mtwatch_info->purge_list) {
+ domain = list_entry(node, struct mtwatch_domain,
+ purge_node);
+ kthread_stop(domain->task);
+
+ call_rcu(&domain->rcu, delayed_destroy_domain);
+ }
+ }
+}
+
+static void mtwatch_create_domain(domid_t domid)
+{
+ struct mtwatch_domain *domain;
+
+ if (!domid) {
+ pr_warn("default xenwatch thread is for dom0\n");
+ return;
+ }
+
+ spin_lock(&mtwatch_info->domain_lock);
+
+ if (mtwatch_find_domain(domid)) {
+ spin_unlock(&mtwatch_info->domain_lock);
+ return;
+ }
+
+ domain = kzalloc(sizeof(*domain), GFP_ATOMIC);
+ if (!domain) {
+ spin_unlock(&mtwatch_info->domain_lock);
+ return;
+ }
+
+ domain->domid = domid;
+ mutex_init(&domain->domain_mutex);
+ INIT_LIST_HEAD(&domain->purge_node);
+
+ init_waitqueue_head(&domain->events_wq);
+ spin_lock_init(&domain->events_lock);
+ INIT_LIST_HEAD(&domain->events);
+
+ hlist_add_head_rcu(&domain->hash_node,
+ &mtwatch_info->domain_hash[MTWATCH_HASH(domid)]);
+
+ list_add_tail_rcu(&domain->list_node, &mtwatch_info->domain_list);
+
+ spin_unlock(&mtwatch_info->domain_lock);
+
+ domain->task = kthread_run(mtwatch_thread, domain,
+ "xen-mtwatch-%d", domid);
+
+ if (!domain->task) {
+ pr_err("mtwatch kthread creation is failed\n");
+
+ spin_lock(&mtwatch_info->domain_lock);
+ hlist_del_rcu(&domain->hash_node);
+ list_del_rcu(&domain->list_node);
+ spin_unlock(&mtwatch_info->domain_lock);
+
+ call_rcu(&domain->rcu, delayed_destroy_domain);
+
+ return;
+ }
+
+ domain->state = MTWATCH_DOMAIN_UP;
+}
+
+static void mtwatch_destroy_domain(domid_t domid)
+{
+ struct mtwatch_domain *domain;
+
+ spin_lock(&mtwatch_info->domain_lock);
+
+ domain = mtwatch_find_domain(domid);
+ if (!domain) {
+ spin_unlock(&mtwatch_info->domain_lock);
+ return;
+ }
+
+ spin_lock(&domain->events_lock);
+ domain->state = MTWATCH_DOMAIN_DOWN;
+ spin_unlock(&domain->events_lock);
+
+ hlist_del_rcu(&domain->hash_node);
+ list_del_rcu(&domain->list_node);
+
+ spin_unlock(&mtwatch_info->domain_lock);
+
+ spin_lock(&mtwatch_info->purge_lock);
+ list_add(&domain->purge_node, &mtwatch_info->purge_list);
+ spin_unlock(&mtwatch_info->purge_lock);
+
+ schedule_work(&mtwatch_info->purge_work);
+}
+
+void xen_mtwatch_callback(domid_t domid, const char *path)
+{
+ int exists;
+
+ exists = xenbus_exists(XBT_NIL, path, "");
+
+ if (!exists) {
+ mtwatch_destroy_domain(domid);
+ return;
+ }
+
+ mtwatch_create_domain(domid);
+}
+
static void xs_suspend_enter(void)
{
spin_lock(&xs_state_lock);
@@ -778,6 +1014,71 @@ int register_xenbus_watch(struct xenbus_watch *watch)
}
EXPORT_SYMBOL_GPL(register_xenbus_watch);
+static void __unregister_single_mtwatch(struct xenbus_watch *watch,
+ struct mtwatch_domain *domain)
+{
+ struct xs_watch_event *event, *tmp;
+
+ if (current->pid != domain->pid)
+ mutex_lock(&domain->domain_mutex);
+
+ spin_lock(&domain->events_lock);
+ list_for_each_entry_safe(event, tmp,
+ &domain->events, list) {
+ if (event->handle != watch)
+ continue;
+ list_del(&event->list);
+ kfree(event);
+ }
+ spin_unlock(&domain->events_lock);
+
+ if (current->pid != domain->pid)
+ mutex_unlock(&domain->domain_mutex);
+}
+
+static void unregister_single_mtwatch(struct xenbus_watch *watch,
+ domid_t domid)
+{
+ struct mtwatch_domain *domain;
+
+ rcu_read_lock();
+
+ domain = mtwatch_find_domain(domid);
+
+ if (WARN_ON_ONCE(unlikely(!domain))) {
+ rcu_read_unlock();
+ return;
+ }
+
+ __unregister_single_mtwatch(watch, domain);
+
+ rcu_read_unlock();
+}
+
+static void unregister_all_mtwatch(struct xenbus_watch *watch)
+{
+ struct mtwatch_domain *domain;
+
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(domain, &mtwatch_info->domain_list,
+ list_node) {
+ __unregister_single_mtwatch(watch, domain);
+ }
+
+ rcu_read_unlock();
+}
+
+static void unregister_mtwatch(struct xenbus_watch *watch)
+{
+ domid_t domid = watch->get_owner ? watch->get_owner(watch) : 0;
+
+ if (domid)
+ unregister_single_mtwatch(watch, domid);
+ else
+ unregister_all_mtwatch(watch);
+}
+
void unregister_xenbus_watch(struct xenbus_watch *watch)
{
struct xs_watch_event *event, *tmp;
@@ -816,6 +1117,9 @@ void unregister_xenbus_watch(struct xenbus_watch *watch)
if (current->pid != xenwatch_pid)
mutex_unlock(&xenwatch_mutex);
+
+ if (xen_mtwatch && watch->get_domid)
+ unregister_mtwatch(watch);
}
EXPORT_SYMBOL_GPL(unregister_xenbus_watch);
@@ -879,9 +1183,13 @@ static int xenwatch_thread(void *unused)
if (ent != &watch_events) {
event = list_entry(ent, struct xs_watch_event, list);
- event->handle->callback(event->handle, event->path,
- event->token);
- kfree(event);
+
+ if (!xen_mtwatch || mtwatch_process(event)) {
+ event->handle->callback(event->handle,
+ event->path,
+ event->token);
+ kfree(event);
+ }
}
mutex_unlock(&xenwatch_mutex);
@@ -927,6 +1235,25 @@ int xs_init(void)
if (err)
return err;
+ if (xen_initial_domain() && param_xen_mtwatch) {
+ int i;
+
+ mtwatch_info = kmalloc(sizeof(*mtwatch_info), GFP_KERNEL);
+
+ for (i = 0; i < MTWATCH_HASH_SIZE; i++)
+ INIT_HLIST_HEAD(&mtwatch_info->domain_hash[i]);
+ spin_lock_init(&mtwatch_info->domain_lock);
+ INIT_LIST_HEAD(&mtwatch_info->domain_list);
+
+ spin_lock_init(&mtwatch_info->purge_lock);
+ INIT_LIST_HEAD(&mtwatch_info->purge_list);
+ INIT_WORK(&mtwatch_info->purge_work, xen_mtwatch_purge_domain);
+
+ xen_mtwatch = true;
+
+ pr_info("xenwatch multithreading is enabled\n");
+ }
+
task = kthread_run(xenwatch_thread, NULL, "xenwatch");
if (IS_ERR(task))
return PTR_ERR(task);
diff --git a/include/xen/xenbus.h b/include/xen/xenbus.h
index 869c816..917d3ca 100644
--- a/include/xen/xenbus.h
+++ b/include/xen/xenbus.h
@@ -62,6 +62,13 @@ struct xenbus_watch
/* Callback (executed in a process context with no locks held). */
void (*callback)(struct xenbus_watch *,
const char *path, const char *token);
+
+ /* Callback to help calculate the domid the path belongs to */
+ domid_t (*get_domid)(struct xenbus_watch *watch,
+ const char *path, const char *token);
+
+ /* Get the owner's domid if the watch is for a specific domain */
+ domid_t (*get_owner)(struct xenbus_watch *watch);
};
@@ -229,8 +236,56 @@ const char *xenbus_strstate(enum xenbus_state state);
int xenbus_dev_is_online(struct xenbus_device *dev);
int xenbus_frontend_closed(struct xenbus_device *dev);
+domid_t path_to_domid(const char *path);
+unsigned int char_count(const char *str, char c);
+void xen_mtwatch_callback(domid_t domid, const char *path);
+
extern const struct file_operations xen_xenbus_fops;
extern struct xenstore_domain_interface *xen_store_interface;
extern int xen_store_evtchn;
+extern bool xen_mtwatch;
+
+#define MTWATCH_HASH_SIZE 256
+#define MTWATCH_HASH(_id) ((int)(_id)&(MTWATCH_HASH_SIZE-1))
+
+struct mtwatch_info {
+ /* the mtwatch_domain is put on both a hash table and a list */
+ spinlock_t domain_lock;
+ struct hlist_head domain_hash[MTWATCH_HASH_SIZE];
+ struct list_head domain_list;
+
+ /*
+ * when a per-domU kthread is going to be destroyed, it would put
+ * a request on list purge_list, which will be flushed by
+ * purge_work later.
+ */
+ struct work_struct purge_work;
+ spinlock_t purge_lock;
+ struct list_head purge_list;
+};
+
+enum mtwatch_domain_state {
+ MTWATCH_DOMAIN_UP = 1,
+ MTWATCH_DOMAIN_DOWN = 2,
+};
+
+struct mtwatch_domain {
+ domid_t domid;
+ struct task_struct *task;
+
+ pid_t pid;
+ struct mutex domain_mutex;
+ struct rcu_head rcu;
+
+ struct hlist_node hash_node;
+ struct list_head list_node;
+ struct list_head purge_node;
+
+ wait_queue_head_t events_wq;
+ spinlock_t events_lock;
+ struct list_head events;
+ enum mtwatch_domain_state state;
+};
+
#endif /* _XEN_XENBUS_H */
--
2.7.4
On 07/04/18 13:25, Dongli Zhang wrote:
> This is to introduce "xenwatch multithreading" (or "multithreaded xenwatch",
> abbreviated as 'mtwatch'). The implementation of xen mtwatch involves below
> components:
>
> * dom0 linux kernel
> * xen toolstack
>
> Here are what the RFC is going to discuss:
>
> - what is the problem
> - what is the objective
> - what is the solution
Instead of creating one thread per domU, wouldn't it make much more
sense to use another mechanism, e.g. a workqueue, to deliver the
watch events? This would be one central place, wouldn't need any
changes in Xen tools or complex mechanisms to select the correct
thread, save resources, and domUs would benefit, too.
Juergen
Hi Juergen,
On 04/07/2018 07:59 PM, Juergen Gross wrote:
> On 07/04/18 13:25, Dongli Zhang wrote:
>> This is to introduce "xenwatch multithreading" (or "multithreaded xenwatch",
>> abbreviated as 'mtwatch'). The implementation of xen mtwatch involves below
>> components:
>>
>> * dom0 linux kernel
>> * xen toolstack
>>
>> Here are what the RFC is going to discuss:
>>
>> - what is the problem
>> - what is the objective
>> - what is the solution
>
> Instead of creating one thread per domU, wouldn't it make much more
> sense to use another mechanism, e.g. a workqueue, to deliver the
> watch events? This would be one central place, wouldn't need any
> changes in Xen tools or complex mechanisms to select the correct
> thread, save resources, and domUs would benefit, too.
I think the xenwatch events of the same domU are expected to be processed in
order. I do not think we are able to guarantee the events of same domU are
processed in order with a central workqueue (unless it is an ordered workqueue).
Suppose an event callback function is stalled, we expected all following events
belong to the same domU are delayed until there is workaround/solution to
unblock the stalled function.
For instance, as mentioned in below mailing list archive, once the NIC is reset,
all in-flight packets are flushed and the stalled callback function moves
forward and complete. As all events (especially the events belong to the same
domU) are still queued, the pv device backend can disconnect correctly and
successfully. With a central workqueue, the following events belong to the same
domU are already processed so that the pv device backend may not disconnect
successfully.
https://lists.xenproject.org/archives/html/xen-devel/2016-06/msg00195.html
Dongli Zhang
On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
> About per-domU xenwatch thread create/destroy, a new type of xenstore node is
> introduced: '/local/domain/0/mtwatch/<domid>'.
>
> Suppose the new domid id 7. During the domU (domid=7) creation, the xen
> toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the insertion
> of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
> operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.
>
> The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'. Kernel
> thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is inserted,
> while this kernel thread is destroyed when the corresponding xenstore node is
> removed.
Instead of inventing yet another node, can you not watch /local/domain
directly?
Wei.
Hi Wei,
On 04/23/2018 10:09 PM, Wei Liu wrote:
> On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
>> About per-domU xenwatch thread create/destroy, a new type of xenstore node is
>> introduced: '/local/domain/0/mtwatch/<domid>'.
>>
>> Suppose the new domid id 7. During the domU (domid=7) creation, the xen
>> toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the insertion
>> of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
>> operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.
>>
>> The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'. Kernel
>> thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is inserted,
>> while this kernel thread is destroyed when the corresponding xenstore node is
>> removed.
>
> Instead of inventing yet another node, can you not watch /local/domain
> directly?
Would you like to watch at /local/domain directly? Or is your question "is there
any other way to not watch at /local/domain, while no extra xenstore node will
be introduced"?
Actually, the first prototype of this idea was to watch at /local/domain
directly to get aware of the domU create/destroy, so that xen toolstack will not
get involved. Joao Martins (CCed) had a concern on the performance as watching
at /local/domain would lead to large amount of xenwatch events.
Dongli Zhang
Hi Juergen,
On 04/24/2018 01:22 PM, Juergen Gross wrote:
> On 24/04/18 01:55, Dongli Zhang wrote:
>> Hi Wei,
>>
>> On 04/23/2018 10:09 PM, Wei Liu wrote:
>>> On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
>>>> About per-domU xenwatch thread create/destroy, a new type of xenstore node is
>>>> introduced: '/local/domain/0/mtwatch/<domid>'.
>>>>
>>>> Suppose the new domid id 7. During the domU (domid=7) creation, the xen
>>>> toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the insertion
>>>> of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
>>>> operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.
>>>>
>>>> The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'. Kernel
>>>> thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is inserted,
>>>> while this kernel thread is destroyed when the corresponding xenstore node is
>>>> removed.
>>>
>>> Instead of inventing yet another node, can you not watch /local/domain
>>> directly?
>>
>> Would you like to watch at /local/domain directly? Or is your question "is there
>> any other way to not watch at /local/domain, while no extra xenstore node will
>> be introduced"?
>>
>> Actually, the first prototype of this idea was to watch at /local/domain
>> directly to get aware of the domU create/destroy, so that xen toolstack will not
>> get involved. Joao Martins (CCed) had a concern on the performance as watching
>> at /local/domain would lead to large amount of xenwatch events.
>
> That's what the special watches "@introduceDomain" and "@releaseDomain"
> are meant for.
I used to consider to watch at "@introduceDomain". However, there is no domain
information appended with "@introduceDomain" and it is still required for dom0
kernel to proactively confirm who is created.
Dongli Zhang
On 24/04/18 01:55, Dongli Zhang wrote:
> Hi Wei,
>
> On 04/23/2018 10:09 PM, Wei Liu wrote:
>> On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
>>> About per-domU xenwatch thread create/destroy, a new type of xenstore node is
>>> introduced: '/local/domain/0/mtwatch/<domid>'.
>>>
>>> Suppose the new domid id 7. During the domU (domid=7) creation, the xen
>>> toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the insertion
>>> of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
>>> operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.
>>>
>>> The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'. Kernel
>>> thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is inserted,
>>> while this kernel thread is destroyed when the corresponding xenstore node is
>>> removed.
>>
>> Instead of inventing yet another node, can you not watch /local/domain
>> directly?
>
> Would you like to watch at /local/domain directly? Or is your question "is there
> any other way to not watch at /local/domain, while no extra xenstore node will
> be introduced"?
>
> Actually, the first prototype of this idea was to watch at /local/domain
> directly to get aware of the domU create/destroy, so that xen toolstack will not
> get involved. Joao Martins (CCed) had a concern on the performance as watching
> at /local/domain would lead to large amount of xenwatch events.
That's what the special watches "@introduceDomain" and "@releaseDomain"
are meant for.
Juergen
On 24/04/18 07:52, Dongli Zhang wrote:
> Hi Juergen,
>
> On 04/24/2018 01:22 PM, Juergen Gross wrote:
>> On 24/04/18 01:55, Dongli Zhang wrote:
>>> Hi Wei,
>>>
>>> On 04/23/2018 10:09 PM, Wei Liu wrote:
>>>> On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
>>>>> About per-domU xenwatch thread create/destroy, a new type of xenstore node is
>>>>> introduced: '/local/domain/0/mtwatch/<domid>'.
>>>>>
>>>>> Suppose the new domid id 7. During the domU (domid=7) creation, the xen
>>>>> toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the insertion
>>>>> of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
>>>>> operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.
>>>>>
>>>>> The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'. Kernel
>>>>> thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is inserted,
>>>>> while this kernel thread is destroyed when the corresponding xenstore node is
>>>>> removed.
>>>>
>>>> Instead of inventing yet another node, can you not watch /local/domain
>>>> directly?
>>>
>>> Would you like to watch at /local/domain directly? Or is your question "is there
>>> any other way to not watch at /local/domain, while no extra xenstore node will
>>> be introduced"?
>>>
>>> Actually, the first prototype of this idea was to watch at /local/domain
>>> directly to get aware of the domU create/destroy, so that xen toolstack will not
>>> get involved. Joao Martins (CCed) had a concern on the performance as watching
>>> at /local/domain would lead to large amount of xenwatch events.
>>
>> That's what the special watches "@introduceDomain" and "@releaseDomain"
>> are meant for.
>
> I used to consider to watch at "@introduceDomain". However, there is no domain
> information appended with "@introduceDomain" and it is still required for dom0
> kernel to proactively confirm who is created.
That isn't too hard, right? You just need to read /local/domain to get
the list of its children and look for new domains there.
Juergen
On 24/04/18 08:10, Dongli Zhang wrote:
>
>
> On 04/24/2018 02:03 PM, Juergen Gross wrote:
>> On 24/04/18 07:52, Dongli Zhang wrote:
>>> Hi Juergen,
>>>
>>> On 04/24/2018 01:22 PM, Juergen Gross wrote:
>>>> On 24/04/18 01:55, Dongli Zhang wrote:
>>>>> Hi Wei,
>>>>>
>>>>> On 04/23/2018 10:09 PM, Wei Liu wrote:
>>>>>> On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
>>>>>>> About per-domU xenwatch thread create/destroy, a new type of xenstore node is
>>>>>>> introduced: '/local/domain/0/mtwatch/<domid>'.
>>>>>>>
>>>>>>> Suppose the new domid id 7. During the domU (domid=7) creation, the xen
>>>>>>> toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the insertion
>>>>>>> of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
>>>>>>> operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.
>>>>>>>
>>>>>>> The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'. Kernel
>>>>>>> thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is inserted,
>>>>>>> while this kernel thread is destroyed when the corresponding xenstore node is
>>>>>>> removed.
>>>>>>
>>>>>> Instead of inventing yet another node, can you not watch /local/domain
>>>>>> directly?
>>>>>
>>>>> Would you like to watch at /local/domain directly? Or is your question "is there
>>>>> any other way to not watch at /local/domain, while no extra xenstore node will
>>>>> be introduced"?
>>>>>
>>>>> Actually, the first prototype of this idea was to watch at /local/domain
>>>>> directly to get aware of the domU create/destroy, so that xen toolstack will not
>>>>> get involved. Joao Martins (CCed) had a concern on the performance as watching
>>>>> at /local/domain would lead to large amount of xenwatch events.
>>>>
>>>> That's what the special watches "@introduceDomain" and "@releaseDomain"
>>>> are meant for.
>>>
>>> I used to consider to watch at "@introduceDomain". However, there is no domain
>>> information appended with "@introduceDomain" and it is still required for dom0
>>> kernel to proactively confirm who is created.
>>
>> That isn't too hard, right? You just need to read /local/domain to get
>> the list of its children and look for new domains there.
>
> You are right. I will try to limit the modification within linux kernel, and try
> to not dirty xen toolstack.
>
> Thank you very much for the suggestion.
When going that route you should extend xenbus_directory() in the kernel
to use XS_DIRECTORY_PART in case XS_DIRECTORY returns E2BIG (see
xs_directory() in Xen's tools/xenstore/xs.c ). This will enable the
kernel to support more than about 1000 domains.
Juergen
On 04/24/2018 02:03 PM, Juergen Gross wrote:
> On 24/04/18 07:52, Dongli Zhang wrote:
>> Hi Juergen,
>>
>> On 04/24/2018 01:22 PM, Juergen Gross wrote:
>>> On 24/04/18 01:55, Dongli Zhang wrote:
>>>> Hi Wei,
>>>>
>>>> On 04/23/2018 10:09 PM, Wei Liu wrote:
>>>>> On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
>>>>>> About per-domU xenwatch thread create/destroy, a new type of xenstore node is
>>>>>> introduced: '/local/domain/0/mtwatch/<domid>'.
>>>>>>
>>>>>> Suppose the new domid id 7. During the domU (domid=7) creation, the xen
>>>>>> toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the insertion
>>>>>> of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
>>>>>> operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.
>>>>>>
>>>>>> The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'. Kernel
>>>>>> thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is inserted,
>>>>>> while this kernel thread is destroyed when the corresponding xenstore node is
>>>>>> removed.
>>>>>
>>>>> Instead of inventing yet another node, can you not watch /local/domain
>>>>> directly?
>>>>
>>>> Would you like to watch at /local/domain directly? Or is your question "is there
>>>> any other way to not watch at /local/domain, while no extra xenstore node will
>>>> be introduced"?
>>>>
>>>> Actually, the first prototype of this idea was to watch at /local/domain
>>>> directly to get aware of the domU create/destroy, so that xen toolstack will not
>>>> get involved. Joao Martins (CCed) had a concern on the performance as watching
>>>> at /local/domain would lead to large amount of xenwatch events.
>>>
>>> That's what the special watches "@introduceDomain" and "@releaseDomain"
>>> are meant for.
>>
>> I used to consider to watch at "@introduceDomain". However, there is no domain
>> information appended with "@introduceDomain" and it is still required for dom0
>> kernel to proactively confirm who is created.
>
> That isn't too hard, right? You just need to read /local/domain to get
> the list of its children and look for new domains there.
You are right. I will try to limit the modification within linux kernel, and try
to not dirty xen toolstack.
Thank you very much for the suggestion.
Dongli Zhang