2015-07-13 04:08:07

by Bandan Das

[permalink] [raw]
Subject: [RFC PATCH 0/4] Shared vhost design

Hello,

There have been discussions on improving the current vhost design. The first
attempt, to my knowledge was Shirley Ma's patch to create a dedicated vhost
worker per cgroup.

http://comments.gmane.org/gmane.linux.network/224730

Later, I posted a cmwq based approach for performance comparisions
http://comments.gmane.org/gmane.linux.network/286858

More recently was the Elvis work that was presented in KVM Forum 2013
http://www.linux-kvm.org/images/a/a3/Kvm-forum-2013-elvis.pdf

The Elvis patches rely on common vhost thread design for scalability
along with polling for performance. Since there are two major changes
being proposed, we decided to split up the work. The first (this RFC),
proposing a re-design of the vhost threading model and the second part
(not posted yet) to focus more on improving performance.

I am posting this with the hope that we can have a meaningful discussion
on the proposed new architecture. We have run some tests to show that the new
design is scalable and in terms of performance, is comparable to the current
stable design.

Test Setup:
The testing is based on the setup described in the Elvis proposal.
The initial tests are just an aggregate of Netperf STREAM and MAERTS but
as we progress, I am happy to run more tests. The hosts are two identical
16 core Haswell systems with point to point network links. For the first 10 runs,
with n=1 upto n=10 guests running in parallel, I booted the target system with nr_cpus=8
and mem=12G. The purpose was to do a comparision of resource utilization
and how it affects performance. Finally, with the number of guests set at 14,
I didn't limit the number of CPUs booted on the host or limit memory seen by
the kernel but boot the kernel with isolcpus=14,15 that will be used to run
the vhost threads. The guests are pinned to cpus 0-13 and based on which
cpu the guest is running on, the corresponding I/O thread is either pinned
to cpu 14 or 15.

Results
# X axis is number of guests
# Y axis is netperf number
# nr_cpus=8 and mem=12G
#Number of Guests #Baseline #ELVIS
1 1119.3 1111.0
2 1135.6 1130.2
3 1135.5 1131.6
4 1136.0 1127.1
5 1118.6 1129.3
6 1123.4 1129.8
7 1128.7 1135.4
8 1129.9 1137.5
9 1130.6 1135.1
10 1129.3 1138.9
14* 1173.8 1216.9

#* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit imposed.
# I/O thread runs on CPU 14 or 15 depending on which guest it's serving

There's a simple graph at
http://people.redhat.com/~bdas/elvis/data/results.png
that shows how task affinity results in a jump and even without it,
as the number of guests increase, the shared vhost design performs
slightly better.

Observations:
1. In terms of "stock" performance, the results are comparable.
2. However, with a tuned setup, even without polling, we see an improvement
with the new design.
3. Making the new design simulate old behavior would be a matter of setting
the number of guests per vhost threads to 1.
4. Maybe, setting a per guest limit on the work being done by a specific vhost
thread is needed for it to be fair.
5. cgroup associations needs to be figured out. I just slightly hacked the
current cgroup association mechanism to work with the new model. Ccing cgroups
for input/comments.

Many thanks to Razya Ladelsky and Eyal Moscovici, IBM for the initial
patches, the helpful testing suggestions and discussions.

Bandan Das (4):
vhost: Introduce a universal thread to serve all users
vhost: Limit the number of devices served by a single worker thread
cgroup: Introduce a function to compare cgroups
vhost: Add cgroup-aware creation of worker threads

drivers/vhost/net.c | 6 +-
drivers/vhost/scsi.c | 18 ++--
drivers/vhost/vhost.c | 272 +++++++++++++++++++++++++++++++++++--------------
drivers/vhost/vhost.h | 32 +++++-
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 40 ++++++++
6 files changed, 275 insertions(+), 94 deletions(-)

--
2.4.3


2015-07-13 04:08:17

by Bandan Das

[permalink] [raw]
Subject: [RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users

vhost threads are per-device, but in most cases a single thread
is enough. This change creates a single thread that is used to
serve all guests.

However, this complicates cgroups associations. The current policy
is to attach the per-device thread to all cgroups of the parent process
that the device is associated it. This is no longer possible if we
have a single thread. So, we end up moving the thread around to
cgroups of whichever device that needs servicing. This is a very
inefficient protocol but seems to be the only way to integrate
cgroups support.

Signed-off-by: Razya Ladelsky <[email protected]>
Signed-off-by: Bandan Das <[email protected]>
---
drivers/vhost/scsi.c | 15 +++--
drivers/vhost/vhost.c | 150 ++++++++++++++++++++++++--------------------------
drivers/vhost/vhost.h | 19 +++++--
3 files changed, 97 insertions(+), 87 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index ea32b38..6c42936 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct vhost_scsi_cmd *cmd)

llist_add(&cmd->tvc_completion_list, &vs->vs_completion_list);

- vhost_work_queue(&vs->dev, &vs->vs_completion_work);
+ vhost_work_queue(vs->dev.worker, &vs->vs_completion_work);
}

static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd)
@@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs,
}

llist_add(&evt->list, &vs->vs_event_list);
- vhost_work_queue(&vs->dev, &vs->vs_event_work);
+ vhost_work_queue(vs->dev.worker, &vs->vs_event_work);
}

static void vhost_scsi_evt_handle_kick(struct vhost_work *work)
@@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
/* Flush both the vhost poll and vhost work */
for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
vhost_scsi_flush_vq(vs, i);
- vhost_work_flush(&vs->dev, &vs->vs_completion_work);
- vhost_work_flush(&vs->dev, &vs->vs_event_work);
+ vhost_work_flush(vs->dev.worker, &vs->vs_completion_work);
+ vhost_work_flush(vs->dev.worker, &vs->vs_event_work);

/* Wait for all reqs issued before the flush to be finished */
for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
@@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
if (!vqs)
goto err_vqs;

- vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
- vhost_work_init(&vs->vs_event_work, vhost_scsi_evt_work);
+ vhost_work_init(&vs->dev, &vs->vs_completion_work,
+ vhost_scsi_complete_cmd_work);
+
+ vhost_work_init(&vs->dev, &vs->vs_event_work,
+ vhost_scsi_evt_work);

vs->vs_events_nr = 0;
vs->vs_events_missed = false;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ee2826..951c96b 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -11,6 +11,8 @@
* Generic code for virtio server in host kernel.
*/

+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
#include <linux/eventfd.h>
#include <linux/vhost.h>
#include <linux/uio.h>
@@ -28,6 +30,9 @@

#include "vhost.h"

+/* Just one worker thread to service all devices */
+static struct vhost_worker *worker;
+
enum {
VHOST_MEMORY_MAX_NREGIONS = 64,
VHOST_MEMORY_F_LOG = 0x1,
@@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
return 0;
}

-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
+void vhost_work_init(struct vhost_dev *dev,
+ struct vhost_work *work, vhost_work_fn_t fn)
{
INIT_LIST_HEAD(&work->node);
work->fn = fn;
init_waitqueue_head(&work->done);
work->flushing = 0;
work->queue_seq = work->done_seq = 0;
+ work->dev = dev;
}
EXPORT_SYMBOL_GPL(vhost_work_init);

@@ -78,7 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
poll->dev = dev;
poll->wqh = NULL;

- vhost_work_init(&poll->work, fn);
+ vhost_work_init(dev, &poll->work, fn);
}
EXPORT_SYMBOL_GPL(vhost_poll_init);

@@ -116,30 +123,30 @@ void vhost_poll_stop(struct vhost_poll *poll)
}
EXPORT_SYMBOL_GPL(vhost_poll_stop);

-static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work,
- unsigned seq)
+static bool vhost_work_seq_done(struct vhost_worker *worker,
+ struct vhost_work *work, unsigned seq)
{
int left;

- spin_lock_irq(&dev->work_lock);
+ spin_lock_irq(&worker->work_lock);
left = seq - work->done_seq;
- spin_unlock_irq(&dev->work_lock);
+ spin_unlock_irq(&worker->work_lock);
return left <= 0;
}

-void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
+void vhost_work_flush(struct vhost_worker *worker, struct vhost_work *work)
{
unsigned seq;
int flushing;

- spin_lock_irq(&dev->work_lock);
+ spin_lock_irq(&worker->work_lock);
seq = work->queue_seq;
work->flushing++;
- spin_unlock_irq(&dev->work_lock);
- wait_event(work->done, vhost_work_seq_done(dev, work, seq));
- spin_lock_irq(&dev->work_lock);
+ spin_unlock_irq(&worker->work_lock);
+ wait_event(work->done, vhost_work_seq_done(worker, work, seq));
+ spin_lock_irq(&worker->work_lock);
flushing = --work->flushing;
- spin_unlock_irq(&dev->work_lock);
+ spin_unlock_irq(&worker->work_lock);
BUG_ON(flushing < 0);
}
EXPORT_SYMBOL_GPL(vhost_work_flush);
@@ -148,29 +155,30 @@ EXPORT_SYMBOL_GPL(vhost_work_flush);
* locks that are also used by the callback. */
void vhost_poll_flush(struct vhost_poll *poll)
{
- vhost_work_flush(poll->dev, &poll->work);
+ vhost_work_flush(poll->dev->worker, &poll->work);
}
EXPORT_SYMBOL_GPL(vhost_poll_flush);

-void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
+void vhost_work_queue(struct vhost_worker *worker,
+ struct vhost_work *work)
{
unsigned long flags;

- spin_lock_irqsave(&dev->work_lock, flags);
+ spin_lock_irqsave(&worker->work_lock, flags);
if (list_empty(&work->node)) {
- list_add_tail(&work->node, &dev->work_list);
+ list_add_tail(&work->node, &worker->work_list);
work->queue_seq++;
- spin_unlock_irqrestore(&dev->work_lock, flags);
- wake_up_process(dev->worker);
+ spin_unlock_irqrestore(&worker->work_lock, flags);
+ wake_up_process(worker->thread);
} else {
- spin_unlock_irqrestore(&dev->work_lock, flags);
+ spin_unlock_irqrestore(&worker->work_lock, flags);
}
}
EXPORT_SYMBOL_GPL(vhost_work_queue);

void vhost_poll_queue(struct vhost_poll *poll)
{
- vhost_work_queue(poll->dev, &poll->work);
+ vhost_work_queue(poll->dev->worker, &poll->work);
}
EXPORT_SYMBOL_GPL(vhost_poll_queue);

@@ -203,19 +211,18 @@ static void vhost_vq_reset(struct vhost_dev *dev,

static int vhost_worker(void *data)
{
- struct vhost_dev *dev = data;
+ struct vhost_worker *worker = data;
struct vhost_work *work = NULL;
unsigned uninitialized_var(seq);
mm_segment_t oldfs = get_fs();

set_fs(USER_DS);
- use_mm(dev->mm);

for (;;) {
/* mb paired w/ kthread_stop */
set_current_state(TASK_INTERRUPTIBLE);

- spin_lock_irq(&dev->work_lock);
+ spin_lock_irq(&worker->work_lock);
if (work) {
work->done_seq = seq;
if (work->flushing)
@@ -223,21 +230,35 @@ static int vhost_worker(void *data)
}

if (kthread_should_stop()) {
- spin_unlock_irq(&dev->work_lock);
+ spin_unlock_irq(&worker->work_lock);
__set_current_state(TASK_RUNNING);
break;
}
- if (!list_empty(&dev->work_list)) {
- work = list_first_entry(&dev->work_list,
+ if (!list_empty(&worker->work_list)) {
+ work = list_first_entry(&worker->work_list,
struct vhost_work, node);
list_del_init(&work->node);
seq = work->queue_seq;
} else
work = NULL;
- spin_unlock_irq(&dev->work_lock);
+ spin_unlock_irq(&worker->work_lock);

if (work) {
+ struct vhost_dev *dev = work->dev;
+
__set_current_state(TASK_RUNNING);
+
+ if (current->mm != dev->mm) {
+ unuse_mm(current->mm);
+ use_mm(dev->mm);
+ }
+
+ /* TODO: Consider a more elegant solution */
+ if (worker->owner != dev->owner) {
+ /* Should check for return value */
+ cgroup_attach_task_all(dev->owner, current);
+ worker->owner = dev->owner;
+ }
work->fn(work);
if (need_resched())
schedule();
@@ -245,7 +266,6 @@ static int vhost_worker(void *data)
schedule();

}
- unuse_mm(dev->mm);
set_fs(oldfs);
return 0;
}
@@ -304,9 +324,8 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->log_file = NULL;
dev->memory = NULL;
dev->mm = NULL;
- spin_lock_init(&dev->work_lock);
- INIT_LIST_HEAD(&dev->work_list);
- dev->worker = NULL;
+ dev->worker = worker;
+ dev->owner = current;

for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
@@ -331,31 +350,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
}
EXPORT_SYMBOL_GPL(vhost_dev_check_owner);

-struct vhost_attach_cgroups_struct {
- struct vhost_work work;
- struct task_struct *owner;
- int ret;
-};
-
-static void vhost_attach_cgroups_work(struct vhost_work *work)
-{
- struct vhost_attach_cgroups_struct *s;
-
- s = container_of(work, struct vhost_attach_cgroups_struct, work);
- s->ret = cgroup_attach_task_all(s->owner, current);
-}
-
-static int vhost_attach_cgroups(struct vhost_dev *dev)
-{
- struct vhost_attach_cgroups_struct attach;
-
- attach.owner = current;
- vhost_work_init(&attach.work, vhost_attach_cgroups_work);
- vhost_work_queue(dev, &attach.work);
- vhost_work_flush(dev, &attach.work);
- return attach.ret;
-}
-
/* Caller should have device mutex */
bool vhost_dev_has_owner(struct vhost_dev *dev)
{
@@ -366,7 +360,6 @@ EXPORT_SYMBOL_GPL(vhost_dev_has_owner);
/* Caller should have device mutex */
long vhost_dev_set_owner(struct vhost_dev *dev)
{
- struct task_struct *worker;
int err;

/* Is there an owner already? */
@@ -377,28 +370,15 @@ long vhost_dev_set_owner(struct vhost_dev *dev)

/* No owner, become one */
dev->mm = get_task_mm(current);
- worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
- if (IS_ERR(worker)) {
- err = PTR_ERR(worker);
- goto err_worker;
- }
-
dev->worker = worker;
- wake_up_process(worker); /* avoid contributing to loadavg */
-
- err = vhost_attach_cgroups(dev);
- if (err)
- goto err_cgroup;

err = vhost_dev_alloc_iovecs(dev);
if (err)
- goto err_cgroup;
+ goto err_alloc;

return 0;
-err_cgroup:
- kthread_stop(worker);
+err_alloc:
dev->worker = NULL;
-err_worker:
if (dev->mm)
mmput(dev->mm);
dev->mm = NULL;
@@ -472,11 +452,6 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
/* No one will access memory at this point */
kfree(dev->memory);
dev->memory = NULL;
- WARN_ON(!list_empty(&dev->work_list));
- if (dev->worker) {
- kthread_stop(dev->worker);
- dev->worker = NULL;
- }
if (dev->mm)
mmput(dev->mm);
dev->mm = NULL;
@@ -1567,11 +1542,32 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);

static int __init vhost_init(void)
{
+ struct vhost_worker *w =
+ kzalloc(sizeof(*w), GFP_KERNEL);
+ if (!w)
+ return -ENOMEM;
+
+ w->thread = kthread_create(vhost_worker,
+ w, "vhost-worker");
+ if (IS_ERR(w->thread))
+ return PTR_ERR(w->thread);
+
+ worker = w;
+ spin_lock_init(&worker->work_lock);
+ INIT_LIST_HEAD(&worker->work_list);
+ wake_up_process(worker->thread);
+ pr_info("Created universal thread to service requests\n");
+
return 0;
}

static void __exit vhost_exit(void)
{
+ if (worker) {
+ kthread_stop(worker->thread);
+ WARN_ON(!list_empty(&worker->work_list));
+ kfree(worker);
+ }
}

module_init(vhost_init);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8c1c792..2f204ce 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -22,6 +22,7 @@ struct vhost_work {
int flushing;
unsigned queue_seq;
unsigned done_seq;
+ struct vhost_dev *dev;
};

/* Poll a file (eventfd or socket) */
@@ -35,8 +36,8 @@ struct vhost_poll {
struct vhost_dev *dev;
};

-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
-void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
+void vhost_work_init(struct vhost_dev *dev,
+ struct vhost_work *work, vhost_work_fn_t fn);

void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
unsigned long mask, struct vhost_dev *dev);
@@ -44,7 +45,6 @@ int vhost_poll_start(struct vhost_poll *poll, struct file *file);
void vhost_poll_stop(struct vhost_poll *poll);
void vhost_poll_flush(struct vhost_poll *poll);
void vhost_poll_queue(struct vhost_poll *poll);
-void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work);
long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp);

struct vhost_log {
@@ -116,11 +116,22 @@ struct vhost_dev {
int nvqs;
struct file *log_file;
struct eventfd_ctx *log_ctx;
+ /* vhost shared worker */
+ struct vhost_worker *worker;
+ /* for cgroup support */
+ struct task_struct *owner;
+};
+
+struct vhost_worker {
spinlock_t work_lock;
struct list_head work_list;
- struct task_struct *worker;
+ struct task_struct *thread;
+ struct task_struct *owner;
};

+void vhost_work_queue(struct vhost_worker *worker,
+ struct vhost_work *work);
+void vhost_work_flush(struct vhost_worker *worker, struct vhost_work *work);
void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
long vhost_dev_set_owner(struct vhost_dev *dev);
bool vhost_dev_has_owner(struct vhost_dev *dev);
--
2.4.3

2015-07-13 04:08:46

by Bandan Das

[permalink] [raw]
Subject: [RFC PATCH 2/4] vhost: Limit the number of devices served by a single worker thread

When the number of devices increase, the universal thread model
(introduced in the preceding patch) may end up being the bottleneck.
Moreover, a single worker thread also forces us to change cgroups
based on the device we are serving.

We introduce a worker pool struct that starts with one worker
thread and we keep adding more threads when the numbers of devs
reaches a certain threshold. The default value is set at 7 but is
not based on any empirical data. The value can also be changed by
the user with the devs_per_worker module parameter.

Note that this patch doesn't change how cgroups work. We still
keep moving around the worker thread to the cgroups of the
device we are serving at the moment.

Signed-off-by: Razya Ladelsky <[email protected]>
Signed-off-by: Bandan Das <[email protected]>
---
drivers/vhost/net.c | 6 +--
drivers/vhost/scsi.c | 3 +-
drivers/vhost/vhost.c | 135 +++++++++++++++++++++++++++++++++++++++++---------
drivers/vhost/vhost.h | 13 ++++-
4 files changed, 128 insertions(+), 29 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 7d137a4..7bfa019 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -705,7 +705,8 @@ static int vhost_net_open(struct inode *inode, struct file *f)
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
}
- vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
+ if (vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX))
+ return dev->err;

vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
@@ -801,9 +802,6 @@ static int vhost_net_release(struct inode *inode, struct file *f)
sockfd_put(rx_sock);
/* Make sure no callbacks are outstanding */
synchronize_rcu_bh();
- /* We do an extra flush before freeing memory,
- * since jobs can re-queue themselves. */
- vhost_net_flush(n);
kfree(n->dev.vqs);
kvfree(n);
return 0;
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 6c42936..97de2db 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1601,7 +1601,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
vqs[i] = &vs->vqs[i].vq;
vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
}
- vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ);
+ if (vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ))
+ return vs->dev.err;

vhost_scsi_init_inflight(vs, NULL);

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 951c96b..6a5d4c0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -27,11 +27,19 @@
#include <linux/kthread.h>
#include <linux/cgroup.h>
#include <linux/module.h>
+#include <linux/moduleparam.h>

#include "vhost.h"

-/* Just one worker thread to service all devices */
-static struct vhost_worker *worker;
+static int __read_mostly devs_per_worker = 7;
+module_param(devs_per_worker, int, S_IRUGO);
+MODULE_PARM_DESC(devs_per_worker, "Setup the number of devices being served by a worker thread");
+
+/* Only used to give a unique id to a vhost thread at the moment */
+static unsigned int total_vhost_workers;
+
+/* Pool of vhost threads */
+static struct vhost_pool *vhost_pool;

enum {
VHOST_MEMORY_MAX_NREGIONS = 64,
@@ -270,6 +278,63 @@ static int vhost_worker(void *data)
return 0;
}

+static void vhost_create_worker(struct vhost_dev *dev)
+{
+ struct vhost_worker *worker;
+ struct vhost_pool *pool = vhost_pool;
+
+ worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+ if (!worker) {
+ dev->err = -ENOMEM;
+ return;
+ }
+
+ worker->thread = kthread_create(vhost_worker,
+ worker,
+ "vhost-%d",
+ total_vhost_workers);
+ if (IS_ERR(worker->thread)) {
+ dev->err = PTR_ERR(worker->thread);
+ goto therror;
+ }
+
+ spin_lock_init(&worker->work_lock);
+ INIT_LIST_HEAD(&worker->work_list);
+ list_add(&worker->node, &pool->workers);
+ worker->owner = NULL;
+ worker->num_devices++;
+ total_vhost_workers++;
+ dev->worker = worker;
+ dev->worker_assigned = true;
+ return;
+
+therror:
+ if (worker->thread)
+ kthread_stop(worker->thread);
+ kfree(worker);
+}
+
+static int vhost_dev_assign_worker(struct vhost_dev *dev)
+{
+ struct vhost_worker *worker;
+
+ mutex_lock(&vhost_pool->pool_lock);
+ list_for_each_entry(worker, &vhost_pool->workers, node) {
+ if (worker->num_devices < devs_per_worker) {
+ dev->worker = worker;
+ dev->worker_assigned = true;
+ worker->num_devices++;
+ break;
+ }
+ }
+ if (!dev->worker_assigned)
+ /* create a new worker */
+ vhost_create_worker(dev);
+ mutex_unlock(&vhost_pool->pool_lock);
+
+ return dev->err;
+}
+
static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
{
kfree(vq->indirect);
@@ -311,7 +376,7 @@ static void vhost_dev_free_iovecs(struct vhost_dev *dev)
vhost_vq_free_iovecs(dev->vqs[i]);
}

-void vhost_dev_init(struct vhost_dev *dev,
+int vhost_dev_init(struct vhost_dev *dev,
struct vhost_virtqueue **vqs, int nvqs)
{
struct vhost_virtqueue *vq;
@@ -324,9 +389,14 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->log_file = NULL;
dev->memory = NULL;
dev->mm = NULL;
- dev->worker = worker;
+ dev->worker = NULL;
+ dev->err = 0;
+ dev->worker_assigned = false;
dev->owner = current;

+ if (vhost_dev_assign_worker(dev))
+ goto done;
+
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
vq->log = NULL;
@@ -339,6 +409,9 @@ void vhost_dev_init(struct vhost_dev *dev,
vhost_poll_init(&vq->poll, vq->handle_kick,
POLLIN, dev);
}
+
+done:
+ return dev->err;
}
EXPORT_SYMBOL_GPL(vhost_dev_init);

@@ -370,7 +443,6 @@ long vhost_dev_set_owner(struct vhost_dev *dev)

/* No owner, become one */
dev->mm = get_task_mm(current);
- dev->worker = worker;

err = vhost_dev_alloc_iovecs(dev);
if (err)
@@ -424,6 +496,24 @@ void vhost_dev_stop(struct vhost_dev *dev)
}
EXPORT_SYMBOL_GPL(vhost_dev_stop);

+static void vhost_deassign_worker(struct vhost_dev *dev)
+{
+ if (dev->worker) {
+ mutex_lock(&vhost_pool->pool_lock);
+ WARN_ON(dev->worker->num_devices <= 0);
+ if (!--dev->worker->num_devices) {
+ WARN_ON(!list_empty(&dev->worker->work_list));
+ list_del(&dev->worker->node);
+ kthread_stop(dev->worker->thread);
+ dev->worker->thread = NULL;
+ kfree(dev->worker);
+ }
+ mutex_unlock(&vhost_pool->pool_lock);
+ }
+
+ dev->worker = NULL;
+}
+
/* Caller should have device mutex if and only if locked is set */
void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
{
@@ -452,6 +542,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
/* No one will access memory at this point */
kfree(dev->memory);
dev->memory = NULL;
+ vhost_deassign_worker(dev);
if (dev->mm)
mmput(dev->mm);
dev->mm = NULL;
@@ -1542,31 +1633,29 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);

static int __init vhost_init(void)
{
- struct vhost_worker *w =
- kzalloc(sizeof(*w), GFP_KERNEL);
- if (!w)
+ struct vhost_pool *pool =
+ kzalloc(sizeof(*pool), GFP_KERNEL);
+ if (!pool)
return -ENOMEM;
-
- w->thread = kthread_create(vhost_worker,
- w, "vhost-worker");
- if (IS_ERR(w->thread))
- return PTR_ERR(w->thread);
-
- worker = w;
- spin_lock_init(&worker->work_lock);
- INIT_LIST_HEAD(&worker->work_list);
- wake_up_process(worker->thread);
- pr_info("Created universal thread to service requests\n");
+ mutex_init(&pool->pool_lock);
+ INIT_LIST_HEAD(&pool->workers);
+ vhost_pool = pool;

return 0;
}

static void __exit vhost_exit(void)
{
- if (worker) {
- kthread_stop(worker->thread);
- WARN_ON(!list_empty(&worker->work_list));
- kfree(worker);
+ struct vhost_worker *worker;
+
+ if (vhost_pool) {
+ list_for_each_entry(worker, &vhost_pool->workers, node) {
+ kthread_stop(worker->thread);
+ WARN_ON(!list_empty(&worker->work_list));
+ list_del(&worker->node);
+ kfree(worker);
+ }
+ kfree(vhost_pool);
}
}

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 2f204ce..a45193b 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -120,19 +120,30 @@ struct vhost_dev {
struct vhost_worker *worker;
/* for cgroup support */
struct task_struct *owner;
+ bool worker_assigned;
+ int err;
};

struct vhost_worker {
spinlock_t work_lock;
+ unsigned id;
struct list_head work_list;
struct task_struct *thread;
struct task_struct *owner;
+ int num_devices;
+ struct list_head node;
+};
+
+struct vhost_pool {
+ struct work_struct work;
+ struct mutex pool_lock;
+ struct list_head workers;
};

void vhost_work_queue(struct vhost_worker *worker,
struct vhost_work *work);
void vhost_work_flush(struct vhost_worker *worker, struct vhost_work *work);
-void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
+int vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
long vhost_dev_set_owner(struct vhost_dev *dev);
bool vhost_dev_has_owner(struct vhost_dev *dev);
long vhost_dev_check_owner(struct vhost_dev *);
--
2.4.3

2015-07-13 04:09:01

by Bandan Das

[permalink] [raw]
Subject: [RFC PATCH 3/4] cgroup: Introduce a function to compare cgroups

This function takes two tasks and iterates through all
hierarchies to check if they belong to the same cgroups.
It ignores the check for default hierarchies or for
hierarchies with no subsystems attached. This function
will be used by the next patch to add rudimentary cgroup support
with vhost workers.

Signed-off-by: Bandan Das <[email protected]>
---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 40 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b9cb94c..606fb5b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -933,6 +933,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
struct task_struct *css_task_iter_next(struct css_task_iter *it);
void css_task_iter_end(struct css_task_iter *it);

+int cgroup_match_groups(struct task_struct *tsk1, struct task_struct *tsk2);
int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 469dd54..ba4121e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2465,6 +2465,46 @@ out_unlock_cgroup:
}

/**
+ * cgroup_match_groups - check if tsk1 and tsk2 belong to
+ * same cgroups in all hierarchies
+ * Returns 0 on success
+ */
+int cgroup_match_groups(struct task_struct *tsk1, struct task_struct *tsk2)
+{
+ struct cgroup_root *root;
+ int retval = 0;
+
+ WARN_ON(!tsk1 || !tsk2);
+
+ mutex_lock(&cgroup_mutex);
+ for_each_root(root) {
+ struct cgroup *cg_tsk1;
+ struct cgroup *cg_tsk2;
+
+ /* Default hierarchy */
+ if (root == &cgrp_dfl_root)
+ continue;
+ /* No subsystems attached */
+ if (!root->subsys_mask)
+ continue;
+
+ down_read(&css_set_rwsem);
+ cg_tsk1 = task_cgroup_from_root(tsk1, root);
+ cg_tsk2 = task_cgroup_from_root(tsk2, root);
+ up_read(&css_set_rwsem);
+
+ if (cg_tsk1 != cg_tsk2) {
+ retval = 1;
+ break;
+ }
+ }
+ mutex_unlock(&cgroup_mutex);
+
+ return retval;
+}
+EXPORT_SYMBOL_GPL(cgroup_match_groups);
+
+/**
* cgroup_attach_task_all - attach task 'tsk' to all cgroups of task 'from'
* @from: attach to all cgroups of a given task
* @tsk: the task to be attached
--
2.4.3

2015-07-13 04:09:24

by Bandan Das

[permalink] [raw]
Subject: [RFC PATCH 4/4] vhost: Add cgroup-aware creation of worker threads

With the help of the cgroup function to compare groups introduced
in the previous patch, this changes worker creation policy.
If the new device belongs to different cgroups than any of the
devices we are currently serving, we end up creating a new worker
thread even if we haven't reached the devs_per_worker threshold

Signed-off-by: Bandan Das <[email protected]>
---
drivers/vhost/vhost.c | 47 +++++++++++++++++++++++++++++++++++++++--------
1 file changed, 39 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 6a5d4c0..dc0fa37 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -261,12 +261,6 @@ static int vhost_worker(void *data)
use_mm(dev->mm);
}

- /* TODO: Consider a more elegant solution */
- if (worker->owner != dev->owner) {
- /* Should check for return value */
- cgroup_attach_task_all(dev->owner, current);
- worker->owner = dev->owner;
- }
work->fn(work);
if (need_resched())
schedule();
@@ -278,6 +272,36 @@ static int vhost_worker(void *data)
return 0;
}

+struct vhost_attach_cgroups_struct {
+ struct vhost_work work;
+ struct task_struct *owner;
+ int ret;
+};
+
+static void vhost_attach_cgroups_work(struct vhost_work *work)
+{
+ struct vhost_attach_cgroups_struct *s;
+
+ s = container_of(work, struct vhost_attach_cgroups_struct, work);
+ s->ret = cgroup_attach_task_all(s->owner, current);
+}
+
+static void vhost_attach_cgroups(struct vhost_dev *dev,
+ struct vhost_worker *worker)
+{
+ struct vhost_attach_cgroups_struct attach;
+
+ attach.owner = dev->owner;
+ vhost_work_init(dev, &attach.work, vhost_attach_cgroups_work);
+ vhost_work_queue(worker, &attach.work);
+ vhost_work_flush(worker, &attach.work);
+
+ if (!attach.ret)
+ worker->owner = dev->owner;
+
+ dev->err = attach.ret;
+}
+
static void vhost_create_worker(struct vhost_dev *dev)
{
struct vhost_worker *worker;
@@ -300,8 +324,14 @@ static void vhost_create_worker(struct vhost_dev *dev)

spin_lock_init(&worker->work_lock);
INIT_LIST_HEAD(&worker->work_list);
+
+ /* attach to the cgroups of the process that created us */
+ vhost_attach_cgroups(dev, worker);
+ if (dev->err)
+ goto therror;
+ worker->owner = dev->owner;
+
list_add(&worker->node, &pool->workers);
- worker->owner = NULL;
worker->num_devices++;
total_vhost_workers++;
dev->worker = worker;
@@ -320,7 +350,8 @@ static int vhost_dev_assign_worker(struct vhost_dev *dev)

mutex_lock(&vhost_pool->pool_lock);
list_for_each_entry(worker, &vhost_pool->workers, node) {
- if (worker->num_devices < devs_per_worker) {
+ if (worker->num_devices < devs_per_worker &&
+ (!cgroup_match_groups(dev->owner, worker->owner))) {
dev->worker = worker;
dev->worker_assigned = true;
worker->num_devices++;
--
2.4.3

2015-07-27 19:48:24

by Bandan Das

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Shared vhost design

Eyal Moscovici <[email protected]> writes:

> Hi,
>
> The test showed the same relative numbers as we got in our internal
> testing. I was wondering about the configuration in regards to NUMA. From
Thanks for confirming.

> our testing we saw that if the VMs are spread across 2 NUMA nodes then
> having a shared vhost thread per node performs better then having the two
> threads in the same core.

IIUC, this is similar to my test setup and observations i.e
> 14* 1173.8 1216.9

In this case, there's a shared vhost thread on CPU 14 for numa node 0
and another on CPU 15 for numa node 1. Guests running on CPUs 0,2,4,6,8,10,12
are serviced by vhost-0 that runs on CPU 14 and guests running on CPUs 1,3,5,7,9,11,13
get serviced by vhost-1 (Numa node 1). I tried some other configurations but
this configuration gave me the best results.


Eyal, I think it makes sense to add polling on top of these patches and
get numbers for them too. Thoughts ?

Bandan

> Eyal Moscovici
> HL-Cloud Infrastructure Solutions
> IBM Haifa Research Lab
>
>
>
> From: Bandan Das <[email protected]>
> To: [email protected]
> Cc: [email protected], [email protected],
> [email protected], Eyal Moscovici/Haifa/IBM@IBMIL, Razya
> Ladelsky/Haifa/IBM@IBMIL, [email protected], [email protected]
> Date: 07/13/2015 07:08 AM
> Subject: [RFC PATCH 0/4] Shared vhost design
>
>
>
> Hello,
>
> There have been discussions on improving the current vhost design. The
> first
> attempt, to my knowledge was Shirley Ma's patch to create a dedicated
> vhost
> worker per cgroup.
>
> http://comments.gmane.org/gmane.linux.network/224730
>
> Later, I posted a cmwq based approach for performance comparisions
> http://comments.gmane.org/gmane.linux.network/286858
>
> More recently was the Elvis work that was presented in KVM Forum 2013
> http://www.linux-kvm.org/images/a/a3/Kvm-forum-2013-elvis.pdf
>
> The Elvis patches rely on common vhost thread design for scalability
> along with polling for performance. Since there are two major changes
> being proposed, we decided to split up the work. The first (this RFC),
> proposing a re-design of the vhost threading model and the second part
> (not posted yet) to focus more on improving performance.
>
> I am posting this with the hope that we can have a meaningful discussion
> on the proposed new architecture. We have run some tests to show that the
> new
> design is scalable and in terms of performance, is comparable to the
> current
> stable design.
>
> Test Setup:
> The testing is based on the setup described in the Elvis proposal.
> The initial tests are just an aggregate of Netperf STREAM and MAERTS but
> as we progress, I am happy to run more tests. The hosts are two identical
> 16 core Haswell systems with point to point network links. For the first
> 10 runs,
> with n=1 upto n=10 guests running in parallel, I booted the target system
> with nr_cpus=8
> and mem=12G. The purpose was to do a comparision of resource utilization
> and how it affects performance. Finally, with the number of guests set at
> 14,
> I didn't limit the number of CPUs booted on the host or limit memory seen
> by
> the kernel but boot the kernel with isolcpus=14,15 that will be used to
> run
> the vhost threads. The guests are pinned to cpus 0-13 and based on which
> cpu the guest is running on, the corresponding I/O thread is either pinned
> to cpu 14 or 15.
>
> Results
> # X axis is number of guests
> # Y axis is netperf number
> # nr_cpus=8 and mem=12G
> #Number of Guests #Baseline #ELVIS
> 1 1119.3 1111.0
> 2 1135.6 1130.2
> 3 1135.5 1131.6
> 4 1136.0 1127.1
> 5 1118.6 1129.3
> 6 1123.4 1129.8
> 7 1128.7 1135.4
> 8 1129.9 1137.5
> 9 1130.6 1135.1
> 10 1129.3 1138.9
> 14* 1173.8 1216.9
>
> #* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit
> imposed.
> # I/O thread runs on CPU 14 or 15 depending on which guest it's serving
>
> There's a simple graph at
> http://people.redhat.com/~bdas/elvis/data/results.png
> that shows how task affinity results in a jump and even without it,
> as the number of guests increase, the shared vhost design performs
> slightly better.
>
> Observations:
> 1. In terms of "stock" performance, the results are comparable.
> 2. However, with a tuned setup, even without polling, we see an
> improvement
> with the new design.
> 3. Making the new design simulate old behavior would be a matter of
> setting
> the number of guests per vhost threads to 1.
> 4. Maybe, setting a per guest limit on the work being done by a specific
> vhost
> thread is needed for it to be fair.
> 5. cgroup associations needs to be figured out. I just slightly hacked the
> current cgroup association mechanism to work with the new model. Ccing
> cgroups
> for input/comments.
>
> Many thanks to Razya Ladelsky and Eyal Moscovici, IBM for the initial
> patches, the helpful testing suggestions and discussions.
>
> Bandan Das (4):
> vhost: Introduce a universal thread to serve all users
> vhost: Limit the number of devices served by a single worker thread
> cgroup: Introduce a function to compare cgroups
> vhost: Add cgroup-aware creation of worker threads
>
> drivers/vhost/net.c | 6 +-
> drivers/vhost/scsi.c | 18 ++--
> drivers/vhost/vhost.c | 272
> +++++++++++++++++++++++++++++++++++--------------
> drivers/vhost/vhost.h | 32 +++++-
> include/linux/cgroup.h | 1 +
> kernel/cgroup.c | 40 ++++++++
> 6 files changed, 275 insertions(+), 94 deletions(-)

2015-07-27 21:02:20

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Shared vhost design

On Mon, Jul 13, 2015 at 12:07:31AM -0400, Bandan Das wrote:
> Hello,
>
> There have been discussions on improving the current vhost design. The first
> attempt, to my knowledge was Shirley Ma's patch to create a dedicated vhost
> worker per cgroup.
>
> http://comments.gmane.org/gmane.linux.network/224730
>
> Later, I posted a cmwq based approach for performance comparisions
> http://comments.gmane.org/gmane.linux.network/286858
>
> More recently was the Elvis work that was presented in KVM Forum 2013
> http://www.linux-kvm.org/images/a/a3/Kvm-forum-2013-elvis.pdf
>
> The Elvis patches rely on common vhost thread design for scalability
> along with polling for performance. Since there are two major changes
> being proposed, we decided to split up the work. The first (this RFC),
> proposing a re-design of the vhost threading model and the second part
> (not posted yet) to focus more on improving performance.
>
> I am posting this with the hope that we can have a meaningful discussion
> on the proposed new architecture. We have run some tests to show that the new
> design is scalable and in terms of performance, is comparable to the current
> stable design.
>
> Test Setup:
> The testing is based on the setup described in the Elvis proposal.
> The initial tests are just an aggregate of Netperf STREAM and MAERTS but
> as we progress, I am happy to run more tests. The hosts are two identical
> 16 core Haswell systems with point to point network links. For the first 10 runs,
> with n=1 upto n=10 guests running in parallel, I booted the target system with nr_cpus=8
> and mem=12G. The purpose was to do a comparision of resource utilization
> and how it affects performance. Finally, with the number of guests set at 14,
> I didn't limit the number of CPUs booted on the host or limit memory seen by
> the kernel but boot the kernel with isolcpus=14,15 that will be used to run
> the vhost threads. The guests are pinned to cpus 0-13 and based on which
> cpu the guest is running on, the corresponding I/O thread is either pinned
> to cpu 14 or 15.
> Results
> # X axis is number of guests
> # Y axis is netperf number
> # nr_cpus=8 and mem=12G
> #Number of Guests #Baseline #ELVIS
> 1 1119.3 1111.0
> 2 1135.6 1130.2
> 3 1135.5 1131.6
> 4 1136.0 1127.1
> 5 1118.6 1129.3
> 6 1123.4 1129.8
> 7 1128.7 1135.4
> 8 1129.9 1137.5
> 9 1130.6 1135.1
> 10 1129.3 1138.9
> 14* 1173.8 1216.9

I'm a bit too busy now, with 2.4 and related stuff, will review once we
finish 2.4. But I'd like to ask two things:
- did you actually test a config where cgroups were used?
- does the design address the issue of VM 1 being blocked
(e.g. because it hits swap) and blocking VM 2?

>
> #* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit imposed.
> # I/O thread runs on CPU 14 or 15 depending on which guest it's serving
>
> There's a simple graph at
> http://people.redhat.com/~bdas/elvis/data/results.png
> that shows how task affinity results in a jump and even without it,
> as the number of guests increase, the shared vhost design performs
> slightly better.
>
> Observations:
> 1. In terms of "stock" performance, the results are comparable.
> 2. However, with a tuned setup, even without polling, we see an improvement
> with the new design.
> 3. Making the new design simulate old behavior would be a matter of setting
> the number of guests per vhost threads to 1.
> 4. Maybe, setting a per guest limit on the work being done by a specific vhost
> thread is needed for it to be fair.
> 5. cgroup associations needs to be figured out. I just slightly hacked the
> current cgroup association mechanism to work with the new model. Ccing cgroups
> for input/comments.
>
> Many thanks to Razya Ladelsky and Eyal Moscovici, IBM for the initial
> patches, the helpful testing suggestions and discussions.
>
> Bandan Das (4):
> vhost: Introduce a universal thread to serve all users
> vhost: Limit the number of devices served by a single worker thread
> cgroup: Introduce a function to compare cgroups
> vhost: Add cgroup-aware creation of worker threads
>
> drivers/vhost/net.c | 6 +-
> drivers/vhost/scsi.c | 18 ++--
> drivers/vhost/vhost.c | 272 +++++++++++++++++++++++++++++++++++--------------
> drivers/vhost/vhost.h | 32 +++++-
> include/linux/cgroup.h | 1 +
> kernel/cgroup.c | 40 ++++++++
> 6 files changed, 275 insertions(+), 94 deletions(-)
>
> --
> 2.4.3

2015-07-27 21:07:29

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Shared vhost design

On Mon, Jul 27, 2015 at 03:48:19PM -0400, Bandan Das wrote:
> Eyal Moscovici <[email protected]> writes:
>
> > Hi,
> >
> > The test showed the same relative numbers as we got in our internal
> > testing. I was wondering about the configuration in regards to NUMA. From
> Thanks for confirming.
>
> > our testing we saw that if the VMs are spread across 2 NUMA nodes then
> > having a shared vhost thread per node performs better then having the two
> > threads in the same core.
>
> IIUC, this is similar to my test setup and observations i.e
> > 14* 1173.8 1216.9
>
> In this case, there's a shared vhost thread on CPU 14 for numa node 0
> and another on CPU 15 for numa node 1. Guests running on CPUs 0,2,4,6,8,10,12
> are serviced by vhost-0 that runs on CPU 14 and guests running on CPUs 1,3,5,7,9,11,13
> get serviced by vhost-1 (Numa node 1). I tried some other configurations but
> this configuration gave me the best results.
>
>
> Eyal, I think it makes sense to add polling on top of these patches and
> get numbers for them too. Thoughts ?
>
> Bandan

So simple polling by vhost is kind of ok for some guests, but I think to
really make it work for a reasonably wide selection of guests/workloads
you need to combine it with 1. polling the NIC - it makes no sense to me
to only poll one side of the equation; and probably 2. - polling in
guest.

2015-07-27 21:12:22

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] vhost: Add cgroup-aware creation of worker threads

On Mon, Jul 13, 2015 at 12:07:35AM -0400, Bandan Das wrote:
> With the help of the cgroup function to compare groups introduced
> in the previous patch, this changes worker creation policy.
> If the new device belongs to different cgroups than any of the
> devices we are currently serving, we end up creating a new worker
> thread even if we haven't reached the devs_per_worker threshold
>
> Signed-off-by: Bandan Das <[email protected]>

Would it make sense to integrate this in the work-queue mechanism somehow?
Just a thought - correctly accounting kernel's work
on behalf of specific userspace groups might have value generally.
Or is the usecase too special?
Cc Tejun for comments.

> ---
> drivers/vhost/vhost.c | 47 +++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 39 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 6a5d4c0..dc0fa37 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -261,12 +261,6 @@ static int vhost_worker(void *data)
> use_mm(dev->mm);
> }
>
> - /* TODO: Consider a more elegant solution */
> - if (worker->owner != dev->owner) {
> - /* Should check for return value */
> - cgroup_attach_task_all(dev->owner, current);
> - worker->owner = dev->owner;
> - }
> work->fn(work);
> if (need_resched())
> schedule();
> @@ -278,6 +272,36 @@ static int vhost_worker(void *data)
> return 0;
> }
>
> +struct vhost_attach_cgroups_struct {
> + struct vhost_work work;
> + struct task_struct *owner;
> + int ret;
> +};
> +
> +static void vhost_attach_cgroups_work(struct vhost_work *work)
> +{
> + struct vhost_attach_cgroups_struct *s;
> +
> + s = container_of(work, struct vhost_attach_cgroups_struct, work);
> + s->ret = cgroup_attach_task_all(s->owner, current);
> +}
> +
> +static void vhost_attach_cgroups(struct vhost_dev *dev,
> + struct vhost_worker *worker)
> +{
> + struct vhost_attach_cgroups_struct attach;
> +
> + attach.owner = dev->owner;
> + vhost_work_init(dev, &attach.work, vhost_attach_cgroups_work);
> + vhost_work_queue(worker, &attach.work);
> + vhost_work_flush(worker, &attach.work);
> +
> + if (!attach.ret)
> + worker->owner = dev->owner;
> +
> + dev->err = attach.ret;
> +}
> +
> static void vhost_create_worker(struct vhost_dev *dev)
> {
> struct vhost_worker *worker;
> @@ -300,8 +324,14 @@ static void vhost_create_worker(struct vhost_dev *dev)
>
> spin_lock_init(&worker->work_lock);
> INIT_LIST_HEAD(&worker->work_list);
> +
> + /* attach to the cgroups of the process that created us */
> + vhost_attach_cgroups(dev, worker);
> + if (dev->err)
> + goto therror;
> + worker->owner = dev->owner;
> +
> list_add(&worker->node, &pool->workers);
> - worker->owner = NULL;
> worker->num_devices++;
> total_vhost_workers++;
> dev->worker = worker;
> @@ -320,7 +350,8 @@ static int vhost_dev_assign_worker(struct vhost_dev *dev)
>
> mutex_lock(&vhost_pool->pool_lock);
> list_for_each_entry(worker, &vhost_pool->workers, node) {
> - if (worker->num_devices < devs_per_worker) {
> + if (worker->num_devices < devs_per_worker &&
> + (!cgroup_match_groups(dev->owner, worker->owner))) {
> dev->worker = worker;
> dev->worker_assigned = true;
> worker->num_devices++;
> --
> 2.4.3

2015-08-01 18:48:38

by Bandan Das

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Shared vhost design

Eyal Moscovici <[email protected]> writes:
...
>
> We can start to implement polling, but I am unsure if the cgroups
> integration
> will be sufficient. The polling vhost thread should be scheduled all
> the time and just moving it from one cgroup to the other wont be
> sufficient.
> I think it needs a deeper integration to the point where either we have a
> vhost thread for each cgroup or the vhost thread enforces the cgroup
> policies over its polled VM guests.

Agreed, what we have with cgroups is not sufficient. I am waiting
for Tejun et al to comment on our approach :) Michael mentioned whether
it's possible to integrate cgroups into workgroups which I think is a more
generic and the preferred solution. I just don't know yet how easy/difficult
it is to implement this with the new cgroups unified hierarchy.


BTW, I am working on the numbers you had asked for. Honestly, I think the
cost of cgroups could be similar to running a vhost thread/guest since
that is how cgroups integration currently works. But it's good to have the
numbers before us.

>>
>> So simple polling by vhost is kind of ok for some guests, but I think to
>> really make it work for a reasonably wide selection of guests/workloads
>> you need to combine it with 1. polling the NIC - it makes no sense to me
>> to only poll one side of the equation; and probably 2. - polling in
>> guest.
>>
>
> I agree that we need polling on the NIC which could probably be achieved
> by using
> the polling interface introduced in kernel 3.11:
> http://lwn.net/Articles/551284/
> although I never tried using it myself.
> About your point about polling inside the guest, I think it is orthogonal
> to polling
> in the host.
>
>
> Eyal Moscovici
> HL-Cloud Infrastructure Solutions
> IBM Haifa Research Lab

2015-08-08 22:40:49

by Bandan Das

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users

Eyal Moscovici <[email protected]> writes:

> Hi,
>
> Do you know what is the overhead of switching the vhost thread from one
> cgroup to another?

I misinterpreted this question earlier. I think what you are asking here is
that when the vm process is moved from one cgroup to another, what is the
overhead of moving the vhost thread to the new cgroup.

This design does not provide any hooks for the vhost thread to move to
a new cgroup. Rather, I think a better approach is to create a new vhost
thread and bind the process to it if the process is migrated to a new
cgroup. This is much less complicated, and there's a good chance that
it's impossible to migrate the vhost thread since it's serving other guests.
I will address this in v2.

> Eyal Moscovici
> HL-Cloud Infrastructure Solutions
> IBM Haifa Research Lab
>
>
>
> From: Bandan Das <[email protected]>
> To: [email protected]
> Cc: [email protected], [email protected],
> [email protected], Eyal Moscovici/Haifa/IBM@IBMIL, Razya
> Ladelsky/Haifa/IBM@IBMIL, [email protected], [email protected]
> Date: 07/13/2015 07:08 AM
> Subject: [RFC PATCH 1/4] vhost: Introduce a universal thread to
> serve all users
>
>
>
> vhost threads are per-device, but in most cases a single thread
> is enough. This change creates a single thread that is used to
> serve all guests.
>
> However, this complicates cgroups associations. The current policy
> is to attach the per-device thread to all cgroups of the parent process
> that the device is associated it. This is no longer possible if we
> have a single thread. So, we end up moving the thread around to
> cgroups of whichever device that needs servicing. This is a very
> inefficient protocol but seems to be the only way to integrate
> cgroups support.
>
> Signed-off-by: Razya Ladelsky <[email protected]>
> Signed-off-by: Bandan Das <[email protected]>
> ---
> drivers/vhost/scsi.c | 15 +++--
> drivers/vhost/vhost.c | 150
> ++++++++++++++++++++++++--------------------------
> drivers/vhost/vhost.h | 19 +++++--
> 3 files changed, 97 insertions(+), 87 deletions(-)
>
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index ea32b38..6c42936 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct
> vhost_scsi_cmd *cmd)
>
> llist_add(&cmd->tvc_completion_list,
> &vs->vs_completion_list);
>
> - vhost_work_queue(&vs->dev, &vs->vs_completion_work);
> + vhost_work_queue(vs->dev.worker,
> &vs->vs_completion_work);
> }
>
> static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd)
> @@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs,
> }
>
> llist_add(&evt->list, &vs->vs_event_list);
> - vhost_work_queue(&vs->dev, &vs->vs_event_work);
> + vhost_work_queue(vs->dev.worker, &vs->vs_event_work);
> }
>
> static void vhost_scsi_evt_handle_kick(struct vhost_work *work)
> @@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
> /* Flush both the vhost poll and vhost work */
> for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
> vhost_scsi_flush_vq(vs, i);
> - vhost_work_flush(&vs->dev, &vs->vs_completion_work);
> - vhost_work_flush(&vs->dev, &vs->vs_event_work);
> + vhost_work_flush(vs->dev.worker,
> &vs->vs_completion_work);
> + vhost_work_flush(vs->dev.worker, &vs->vs_event_work);
>
> /* Wait for all reqs issued before the flush to be
> finished */
> for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
> @@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode,
> struct file *f)
> if (!vqs)
> goto err_vqs;
>
> - vhost_work_init(&vs->vs_completion_work,
> vhost_scsi_complete_cmd_work);
> - vhost_work_init(&vs->vs_event_work, vhost_scsi_evt_work);
> + vhost_work_init(&vs->dev, &vs->vs_completion_work,
> + vhost_scsi_complete_cmd_work);
> +
> + vhost_work_init(&vs->dev, &vs->vs_event_work,
> + vhost_scsi_evt_work);
>
> vs->vs_events_nr = 0;
> vs->vs_events_missed = false;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2ee2826..951c96b 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -11,6 +11,8 @@
> * Generic code for virtio server in host kernel.
> */
>
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> #include <linux/eventfd.h>
> #include <linux/vhost.h>
> #include <linux/uio.h>
> @@ -28,6 +30,9 @@
>
> #include "vhost.h"
>
> +/* Just one worker thread to service all devices */
> +static struct vhost_worker *worker;
> +
> enum {
> VHOST_MEMORY_MAX_NREGIONS = 64,
> VHOST_MEMORY_F_LOG = 0x1,
> @@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait,
> unsigned mode, int sync,
> return 0;
> }
>
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
> +void vhost_work_init(struct vhost_dev *dev,
> + struct vhost_work *work,
> vhost_work_fn_t fn)
> {
> INIT_LIST_HEAD(&work->node);
> work->fn = fn;
> init_waitqueue_head(&work->done);
> work->flushing = 0;
> work->queue_seq = work->done_seq = 0;
> + work->dev = dev;
> }
> EXPORT_SYMBOL_GPL(vhost_work_init);
>
> @@ -78,7 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll,
> vhost_work_fn_t fn,
> poll->dev = dev;
> poll->wqh = NULL;
>
> - vhost_work_init(&poll->work, fn);
> + vhost_work_init(dev, &poll->work, fn);
> }
> EXPORT_SYMBOL_GPL(vhost_poll_init);
>
> @@ -116,30 +123,30 @@ void vhost_poll_stop(struct vhost_poll *poll)
> }
> EXPORT_SYMBOL_GPL(vhost_poll_stop);
>
> -static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work
> *work,
> - unsigned
> seq)
> +static bool vhost_work_seq_done(struct vhost_worker *worker,
> + struct
> vhost_work *work, unsigned seq)
> {
> int left;
>
> - spin_lock_irq(&dev->work_lock);
> + spin_lock_irq(&worker->work_lock);
> left = seq - work->done_seq;
> - spin_unlock_irq(&dev->work_lock);
> + spin_unlock_irq(&worker->work_lock);
> return left <= 0;
> }
>
> -void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
> +void vhost_work_flush(struct vhost_worker *worker, struct vhost_work
> *work)
> {
> unsigned seq;
> int flushing;
>
> - spin_lock_irq(&dev->work_lock);
> + spin_lock_irq(&worker->work_lock);
> seq = work->queue_seq;
> work->flushing++;
> - spin_unlock_irq(&dev->work_lock);
> - wait_event(work->done, vhost_work_seq_done(dev, work,
> seq));
> - spin_lock_irq(&dev->work_lock);
> + spin_unlock_irq(&worker->work_lock);
> + wait_event(work->done, vhost_work_seq_done(worker, work,
> seq));
> + spin_lock_irq(&worker->work_lock);
> flushing = --work->flushing;
> - spin_unlock_irq(&dev->work_lock);
> + spin_unlock_irq(&worker->work_lock);
> BUG_ON(flushing < 0);
> }
> EXPORT_SYMBOL_GPL(vhost_work_flush);
> @@ -148,29 +155,30 @@ EXPORT_SYMBOL_GPL(vhost_work_flush);
> * locks that are also used by the callback. */
> void vhost_poll_flush(struct vhost_poll *poll)
> {
> - vhost_work_flush(poll->dev, &poll->work);
> + vhost_work_flush(poll->dev->worker, &poll->work);
> }
> EXPORT_SYMBOL_GPL(vhost_poll_flush);
>
> -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
> +void vhost_work_queue(struct vhost_worker *worker,
> + struct vhost_work *work)
> {
> unsigned long flags;
>
> - spin_lock_irqsave(&dev->work_lock, flags);
> + spin_lock_irqsave(&worker->work_lock, flags);
> if (list_empty(&work->node)) {
> - list_add_tail(&work->node,
> &dev->work_list);
> + list_add_tail(&work->node,
> &worker->work_list);
> work->queue_seq++;
> - spin_unlock_irqrestore(&dev->work_lock,
> flags);
> - wake_up_process(dev->worker);
> + spin_unlock_irqrestore(&worker->work_lock, flags);
> + wake_up_process(worker->thread);
> } else {
> - spin_unlock_irqrestore(&dev->work_lock,
> flags);
> + spin_unlock_irqrestore(&worker->work_lock, flags);
> }
> }
> EXPORT_SYMBOL_GPL(vhost_work_queue);
>
> void vhost_poll_queue(struct vhost_poll *poll)
> {
> - vhost_work_queue(poll->dev, &poll->work);
> + vhost_work_queue(poll->dev->worker, &poll->work);
> }
> EXPORT_SYMBOL_GPL(vhost_poll_queue);
>
> @@ -203,19 +211,18 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>
> static int vhost_worker(void *data)
> {
> - struct vhost_dev *dev = data;
> + struct vhost_worker *worker = data;
> struct vhost_work *work = NULL;
> unsigned uninitialized_var(seq);
> mm_segment_t oldfs = get_fs();
>
> set_fs(USER_DS);
> - use_mm(dev->mm);
>
> for (;;) {
> /* mb paired w/ kthread_stop */
> set_current_state(TASK_INTERRUPTIBLE);
>
> - spin_lock_irq(&dev->work_lock);
> + spin_lock_irq(&worker->work_lock);
> if (work) {
> work->done_seq = seq;
> if (work->flushing)
> @@ -223,21 +230,35 @@ static int vhost_worker(void *data)
> }
>
> if (kthread_should_stop()) {
> - spin_unlock_irq(&dev->work_lock);
> + spin_unlock_irq(&worker->work_lock);
> __set_current_state(TASK_RUNNING);
> break;
> }
> - if (!list_empty(&dev->work_list)) {
> - work =
> list_first_entry(&dev->work_list,
> + if (!list_empty(&worker->work_list)) {
> + work =
> list_first_entry(&worker->work_list,
> struct vhost_work, node);
> list_del_init(&work->node);
> seq = work->queue_seq;
> } else
> work = NULL;
> - spin_unlock_irq(&dev->work_lock);
> + spin_unlock_irq(&worker->work_lock);
>
> if (work) {
> + struct vhost_dev *dev =
> work->dev;
> +
> __set_current_state(TASK_RUNNING);
> +
> + if (current->mm !=
> dev->mm) {
> + unuse_mm(current->mm);
> + use_mm(dev->mm);
> + }
> +
> + /* TODO: Consider a more
> elegant solution */
> + if (worker->owner !=
> dev->owner) {
> + /* Should
> check for return value */
> + cgroup_attach_task_all(dev->owner, current);
> + worker->owner = dev->owner;
> + }
> work->fn(work);
> if (need_resched())
> schedule();
> @@ -245,7 +266,6 @@ static int vhost_worker(void *data)
> schedule();
>
> }
> - unuse_mm(dev->mm);
> set_fs(oldfs);
> return 0;
> }
> @@ -304,9 +324,8 @@ void vhost_dev_init(struct vhost_dev *dev,
> dev->log_file = NULL;
> dev->memory = NULL;
> dev->mm = NULL;
> - spin_lock_init(&dev->work_lock);
> - INIT_LIST_HEAD(&dev->work_list);
> - dev->worker = NULL;
> + dev->worker = worker;
> + dev->owner = current;
>
> for (i = 0; i < dev->nvqs; ++i) {
> vq = dev->vqs[i];
> @@ -331,31 +350,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
> }
> EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
>
> -struct vhost_attach_cgroups_struct {
> - struct vhost_work work;
> - struct task_struct *owner;
> - int ret;
> -};
> -
> -static void vhost_attach_cgroups_work(struct vhost_work *work)
> -{
> - struct vhost_attach_cgroups_struct *s;
> -
> - s = container_of(work, struct
> vhost_attach_cgroups_struct, work);
> - s->ret = cgroup_attach_task_all(s->owner, current);
> -}
> -
> -static int vhost_attach_cgroups(struct vhost_dev *dev)
> -{
> - struct vhost_attach_cgroups_struct attach;
> -
> - attach.owner = current;
> - vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> - vhost_work_queue(dev, &attach.work);
> - vhost_work_flush(dev, &attach.work);
> - return attach.ret;
> -}
> -
> /* Caller should have device mutex */
> bool vhost_dev_has_owner(struct vhost_dev *dev)
> {
> @@ -366,7 +360,6 @@ EXPORT_SYMBOL_GPL(vhost_dev_has_owner);
> /* Caller should have device mutex */
> long vhost_dev_set_owner(struct vhost_dev *dev)
> {
> - struct task_struct *worker;
> int err;
>
> /* Is there an owner already? */
> @@ -377,28 +370,15 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>
> /* No owner, become one */
> dev->mm = get_task_mm(current);
> - worker = kthread_create(vhost_worker, dev, "vhost-%d",
> current->pid);
> - if (IS_ERR(worker)) {
> - err = PTR_ERR(worker);
> - goto err_worker;
> - }
> -
> dev->worker = worker;
> - wake_up_process(worker); /* avoid
> contributing to loadavg */
> -
> - err = vhost_attach_cgroups(dev);
> - if (err)
> - goto err_cgroup;
>
> err = vhost_dev_alloc_iovecs(dev);
> if (err)
> - goto err_cgroup;
> + goto err_alloc;
>
> return 0;
> -err_cgroup:
> - kthread_stop(worker);
> +err_alloc:
> dev->worker = NULL;
> -err_worker:
> if (dev->mm)
> mmput(dev->mm);
> dev->mm = NULL;
> @@ -472,11 +452,6 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool
> locked)
> /* No one will access memory at this point */
> kfree(dev->memory);
> dev->memory = NULL;
> - WARN_ON(!list_empty(&dev->work_list));
> - if (dev->worker) {
> - kthread_stop(dev->worker);
> - dev->worker = NULL;
> - }
> if (dev->mm)
> mmput(dev->mm);
> dev->mm = NULL;
> @@ -1567,11 +1542,32 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
>
> static int __init vhost_init(void)
> {
> + struct vhost_worker *w =
> + kzalloc(sizeof(*w), GFP_KERNEL);
> + if (!w)
> + return -ENOMEM;
> +
> + w->thread = kthread_create(vhost_worker,
> + w,
> "vhost-worker");
> + if (IS_ERR(w->thread))
> + return PTR_ERR(w->thread);
> +
> + worker = w;
> + spin_lock_init(&worker->work_lock);
> + INIT_LIST_HEAD(&worker->work_list);
> + wake_up_process(worker->thread);
> + pr_info("Created universal thread to service
> requests\n");
> +
> return 0;
> }
>
> static void __exit vhost_exit(void)
> {
> + if (worker) {
> + kthread_stop(worker->thread);
> + WARN_ON(!list_empty(&worker->work_list));
> + kfree(worker);
> + }
> }
>
> module_init(vhost_init);
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 8c1c792..2f204ce 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -22,6 +22,7 @@ struct vhost_work {
> int flushing;
> unsigned queue_seq;
> unsigned done_seq;
> + struct vhost_dev *dev;
> };
>
> /* Poll a file (eventfd or socket) */
> @@ -35,8 +36,8 @@ struct vhost_poll {
> struct vhost_dev *dev;
> };
>
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
> -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
> +void vhost_work_init(struct vhost_dev *dev,
> + struct vhost_work *work,
> vhost_work_fn_t fn);
>
> void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> unsigned long mask, struct vhost_dev
> *dev);
> @@ -44,7 +45,6 @@ int vhost_poll_start(struct vhost_poll *poll, struct
> file *file);
> void vhost_poll_stop(struct vhost_poll *poll);
> void vhost_poll_flush(struct vhost_poll *poll);
> void vhost_poll_queue(struct vhost_poll *poll);
> -void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work);
> long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user
> *argp);
>
> struct vhost_log {
> @@ -116,11 +116,22 @@ struct vhost_dev {
> int nvqs;
> struct file *log_file;
> struct eventfd_ctx *log_ctx;
> + /* vhost shared worker */
> + struct vhost_worker *worker;
> + /* for cgroup support */
> + struct task_struct *owner;
> +};
> +
> +struct vhost_worker {
> spinlock_t work_lock;
> struct list_head work_list;
> - struct task_struct *worker;
> + struct task_struct *thread;
> + struct task_struct *owner;
> };
>
> +void vhost_work_queue(struct vhost_worker *worker,
> + struct vhost_work *work);
> +void vhost_work_flush(struct vhost_worker *worker, struct vhost_work
> *work);
> void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int
> nvqs);
> long vhost_dev_set_owner(struct vhost_dev *dev);
> bool vhost_dev_has_owner(struct vhost_dev *dev);

2015-08-08 23:06:43

by Bandan Das

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Shared vhost design

Hi Michael,

"Michael S. Tsirkin" <[email protected]> writes:

> On Mon, Jul 13, 2015 at 12:07:31AM -0400, Bandan Das wrote:
>> Hello,
>>
>> There have been discussions on improving the current vhost design. The first
>> attempt, to my knowledge was Shirley Ma's patch to create a dedicated vhost
>> worker per cgroup.
>>
>> http://comments.gmane.org/gmane.linux.network/224730
>>
>> Later, I posted a cmwq based approach for performance comparisions
>> http://comments.gmane.org/gmane.linux.network/286858
>>
>> More recently was the Elvis work that was presented in KVM Forum 2013
>> http://www.linux-kvm.org/images/a/a3/Kvm-forum-2013-elvis.pdf
>>
>> The Elvis patches rely on common vhost thread design for scalability
>> along with polling for performance. Since there are two major changes
>> being proposed, we decided to split up the work. The first (this RFC),
>> proposing a re-design of the vhost threading model and the second part
>> (not posted yet) to focus more on improving performance.
>>
>> I am posting this with the hope that we can have a meaningful discussion
>> on the proposed new architecture. We have run some tests to show that the new
>> design is scalable and in terms of performance, is comparable to the current
>> stable design.
>>
>> Test Setup:
>> The testing is based on the setup described in the Elvis proposal.
>> The initial tests are just an aggregate of Netperf STREAM and MAERTS but
>> as we progress, I am happy to run more tests. The hosts are two identical
>> 16 core Haswell systems with point to point network links. For the first 10 runs,
>> with n=1 upto n=10 guests running in parallel, I booted the target system with nr_cpus=8
>> and mem=12G. The purpose was to do a comparision of resource utilization
>> and how it affects performance. Finally, with the number of guests set at 14,
>> I didn't limit the number of CPUs booted on the host or limit memory seen by
>> the kernel but boot the kernel with isolcpus=14,15 that will be used to run
>> the vhost threads. The guests are pinned to cpus 0-13 and based on which
>> cpu the guest is running on, the corresponding I/O thread is either pinned
>> to cpu 14 or 15.
>> Results
>> # X axis is number of guests
>> # Y axis is netperf number
>> # nr_cpus=8 and mem=12G
>> #Number of Guests #Baseline #ELVIS
>> 1 1119.3 1111.0
>> 2 1135.6 1130.2
>> 3 1135.5 1131.6
>> 4 1136.0 1127.1
>> 5 1118.6 1129.3
>> 6 1123.4 1129.8
>> 7 1128.7 1135.4
>> 8 1129.9 1137.5
>> 9 1130.6 1135.1
>> 10 1129.3 1138.9
>> 14* 1173.8 1216.9
>
> I'm a bit too busy now, with 2.4 and related stuff, will review once we
> finish 2.4. But I'd like to ask two things:
> - did you actually test a config where cgroups were used?

Here are some numbers with a simple cgroup setup.

Three cgroups with cpusets cpu=0,2,4 for cgroup1, cpu=1,3,5 for cgroup2 and cpu=6,7
for cgroup3 (even though 6,7 have different numa nodes)

I run netperf for 1 to 9 guests starting with assigning the first guest
to cgroup1, second to cgroup2, third to cgroup3 and repeat this sequence
upto 9 guests.

The numbers - (TCP_STREAM + TCP_MAERTS)/2

#Number of Guests #ELVIS (Mbps)
1 1056.9
2 1122.5
3 1122.8
4 1123.2
5 1122.6
6 1110.3
7 1116.3
8 1121.8
9 1118.5

Maybe, my cgroup setup was too simple but these numbers are comparable
to the no cgroups results above. I wrote some tracing code to trace
cgroup_match_groups() and find cgroup search overhead but it seemed
unnecessary for this particular test.


> - does the design address the issue of VM 1 being blocked
> (e.g. because it hits swap) and blocking VM 2?
Good question. I haven't thought of this yet. But IIUC,
the worker thread will complete VM1's job and then move on to
executing VM2's scheduled work. It doesn't matter if VM1 is
blocked currently. I think it would be a problem though if/when
polling is introduced.

>>
>> #* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit imposed.
>> # I/O thread runs on CPU 14 or 15 depending on which guest it's serving
>>
>> There's a simple graph at
>> http://people.redhat.com/~bdas/elvis/data/results.png
>> that shows how task affinity results in a jump and even without it,
>> as the number of guests increase, the shared vhost design performs
>> slightly better.
>>
>> Observations:
>> 1. In terms of "stock" performance, the results are comparable.
>> 2. However, with a tuned setup, even without polling, we see an improvement
>> with the new design.
>> 3. Making the new design simulate old behavior would be a matter of setting
>> the number of guests per vhost threads to 1.
>> 4. Maybe, setting a per guest limit on the work being done by a specific vhost
>> thread is needed for it to be fair.
>> 5. cgroup associations needs to be figured out. I just slightly hacked the
>> current cgroup association mechanism to work with the new model. Ccing cgroups
>> for input/comments.
>>
>> Many thanks to Razya Ladelsky and Eyal Moscovici, IBM for the initial
>> patches, the helpful testing suggestions and discussions.
>>
>> Bandan Das (4):
>> vhost: Introduce a universal thread to serve all users
>> vhost: Limit the number of devices served by a single worker thread
>> cgroup: Introduce a function to compare cgroups
>> vhost: Add cgroup-aware creation of worker threads
>>
>> drivers/vhost/net.c | 6 +-
>> drivers/vhost/scsi.c | 18 ++--
>> drivers/vhost/vhost.c | 272 +++++++++++++++++++++++++++++++++++--------------
>> drivers/vhost/vhost.h | 32 +++++-
>> include/linux/cgroup.h | 1 +
>> kernel/cgroup.c | 40 ++++++++
>> 6 files changed, 275 insertions(+), 94 deletions(-)
>>
>> --
>> 2.4.3

2015-08-09 12:45:54

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Shared vhost design

On Sat, Aug 08, 2015 at 07:06:38PM -0400, Bandan Das wrote:
> Hi Michael,
>
> "Michael S. Tsirkin" <[email protected]> writes:
>
> > On Mon, Jul 13, 2015 at 12:07:31AM -0400, Bandan Das wrote:
> >> Hello,
> >>
> >> There have been discussions on improving the current vhost design. The first
> >> attempt, to my knowledge was Shirley Ma's patch to create a dedicated vhost
> >> worker per cgroup.
> >>
> >> http://comments.gmane.org/gmane.linux.network/224730
> >>
> >> Later, I posted a cmwq based approach for performance comparisions
> >> http://comments.gmane.org/gmane.linux.network/286858
> >>
> >> More recently was the Elvis work that was presented in KVM Forum 2013
> >> http://www.linux-kvm.org/images/a/a3/Kvm-forum-2013-elvis.pdf
> >>
> >> The Elvis patches rely on common vhost thread design for scalability
> >> along with polling for performance. Since there are two major changes
> >> being proposed, we decided to split up the work. The first (this RFC),
> >> proposing a re-design of the vhost threading model and the second part
> >> (not posted yet) to focus more on improving performance.
> >>
> >> I am posting this with the hope that we can have a meaningful discussion
> >> on the proposed new architecture. We have run some tests to show that the new
> >> design is scalable and in terms of performance, is comparable to the current
> >> stable design.
> >>
> >> Test Setup:
> >> The testing is based on the setup described in the Elvis proposal.
> >> The initial tests are just an aggregate of Netperf STREAM and MAERTS but
> >> as we progress, I am happy to run more tests. The hosts are two identical
> >> 16 core Haswell systems with point to point network links. For the first 10 runs,
> >> with n=1 upto n=10 guests running in parallel, I booted the target system with nr_cpus=8
> >> and mem=12G. The purpose was to do a comparision of resource utilization
> >> and how it affects performance. Finally, with the number of guests set at 14,
> >> I didn't limit the number of CPUs booted on the host or limit memory seen by
> >> the kernel but boot the kernel with isolcpus=14,15 that will be used to run
> >> the vhost threads. The guests are pinned to cpus 0-13 and based on which
> >> cpu the guest is running on, the corresponding I/O thread is either pinned
> >> to cpu 14 or 15.
> >> Results
> >> # X axis is number of guests
> >> # Y axis is netperf number
> >> # nr_cpus=8 and mem=12G
> >> #Number of Guests #Baseline #ELVIS
> >> 1 1119.3 1111.0
> >> 2 1135.6 1130.2
> >> 3 1135.5 1131.6
> >> 4 1136.0 1127.1
> >> 5 1118.6 1129.3
> >> 6 1123.4 1129.8
> >> 7 1128.7 1135.4
> >> 8 1129.9 1137.5
> >> 9 1130.6 1135.1
> >> 10 1129.3 1138.9
> >> 14* 1173.8 1216.9
> >
> > I'm a bit too busy now, with 2.4 and related stuff, will review once we
> > finish 2.4. But I'd like to ask two things:
> > - did you actually test a config where cgroups were used?
>
> Here are some numbers with a simple cgroup setup.
>
> Three cgroups with cpusets cpu=0,2,4 for cgroup1, cpu=1,3,5 for cgroup2 and cpu=6,7
> for cgroup3 (even though 6,7 have different numa nodes)
>
> I run netperf for 1 to 9 guests starting with assigning the first guest
> to cgroup1, second to cgroup2, third to cgroup3 and repeat this sequence
> upto 9 guests.
>
> The numbers - (TCP_STREAM + TCP_MAERTS)/2
>
> #Number of Guests #ELVIS (Mbps)
> 1 1056.9
> 2 1122.5
> 3 1122.8
> 4 1123.2
> 5 1122.6
> 6 1110.3
> 7 1116.3
> 8 1121.8
> 9 1118.5
>
> Maybe, my cgroup setup was too simple but these numbers are comparable
> to the no cgroups results above. I wrote some tracing code to trace
> cgroup_match_groups() and find cgroup search overhead but it seemed
> unnecessary for this particular test.
>
>
> > - does the design address the issue of VM 1 being blocked
> > (e.g. because it hits swap) and blocking VM 2?
> Good question. I haven't thought of this yet. But IIUC,
> the worker thread will complete VM1's job and then move on to
> executing VM2's scheduled work.
> It doesn't matter if VM1 is
> blocked currently. I think it would be a problem though if/when
> polling is introduced.

Sorry, I wasn't clear. If VM1's memory is in swap, attempts to
access it might block the service thread, so it won't
complete VM2's job.



>
> >>
> >> #* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit imposed.
> >> # I/O thread runs on CPU 14 or 15 depending on which guest it's serving
> >>
> >> There's a simple graph at
> >> http://people.redhat.com/~bdas/elvis/data/results.png
> >> that shows how task affinity results in a jump and even without it,
> >> as the number of guests increase, the shared vhost design performs
> >> slightly better.
> >>
> >> Observations:
> >> 1. In terms of "stock" performance, the results are comparable.
> >> 2. However, with a tuned setup, even without polling, we see an improvement
> >> with the new design.
> >> 3. Making the new design simulate old behavior would be a matter of setting
> >> the number of guests per vhost threads to 1.
> >> 4. Maybe, setting a per guest limit on the work being done by a specific vhost
> >> thread is needed for it to be fair.
> >> 5. cgroup associations needs to be figured out. I just slightly hacked the
> >> current cgroup association mechanism to work with the new model. Ccing cgroups
> >> for input/comments.
> >>
> >> Many thanks to Razya Ladelsky and Eyal Moscovici, IBM for the initial
> >> patches, the helpful testing suggestions and discussions.
> >>
> >> Bandan Das (4):
> >> vhost: Introduce a universal thread to serve all users
> >> vhost: Limit the number of devices served by a single worker thread
> >> cgroup: Introduce a function to compare cgroups
> >> vhost: Add cgroup-aware creation of worker threads
> >>
> >> drivers/vhost/net.c | 6 +-
> >> drivers/vhost/scsi.c | 18 ++--
> >> drivers/vhost/vhost.c | 272 +++++++++++++++++++++++++++++++++++--------------
> >> drivers/vhost/vhost.h | 32 +++++-
> >> include/linux/cgroup.h | 1 +
> >> kernel/cgroup.c | 40 ++++++++
> >> 6 files changed, 275 insertions(+), 94 deletions(-)
> >>
> >> --
> >> 2.4.3

2015-08-09 15:41:05

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Shared vhost design

On Sun, Aug 09, 2015 at 05:57:53PM +0300, Eyal Moscovici wrote:
> Eyal Moscovici
> HL-Cloud Infrastructure Solutions
> IBM Haifa Research Lab
>
> "Michael S. Tsirkin" <[email protected]> wrote on 08/09/2015 03:45:47 PM:
>
> > From: "Michael S. Tsirkin" <[email protected]>
> > To: Bandan Das <[email protected]>
> > Cc: [email protected], [email protected], linux-
> > [email protected], Eyal Moscovici/Haifa/IBM@IBMIL, Razya
> > Ladelsky/Haifa/IBM@IBMIL, [email protected], [email protected]
> > Date: 08/09/2015 03:46 PM
> > Subject: Re: [RFC PATCH 0/4] Shared vhost design
> >
> > On Sat, Aug 08, 2015 at 07:06:38PM -0400, Bandan Das wrote:
> > > Hi Michael,
> > >
> > > "Michael S. Tsirkin" <[email protected]> writes:
> > >
> > > > On Mon, Jul 13, 2015 at 12:07:31AM -0400, Bandan Das wrote:
> > > >> Hello,
> > > >>
> > > >> There have been discussions on improving the current vhost
> > design. The first
> > > >> attempt, to my knowledge was Shirley Ma's patch to create a
> > dedicated vhost
> > > >> worker per cgroup.
> > > >>
> > > >> http://comments.gmane.org/gmane.linux.network/224730
> > > >>
> > > >> Later, I posted a cmwq based approach for performance comparisions
> > > >> http://comments.gmane.org/gmane.linux.network/286858
> > > >>
> > > >> More recently was the Elvis work that was presented in KVM Forum 2013
> > > >> http://www.linux-kvm.org/images/a/a3/Kvm-forum-2013-elvis.pdf
> > > >>
> > > >> The Elvis patches rely on common vhost thread design for scalability
> > > >> along with polling for performance. Since there are two major changes
> > > >> being proposed, we decided to split up the work. The first (this RFC),
> > > >> proposing a re-design of the vhost threading model and the second part
> > > >> (not posted yet) to focus more on improving performance.
> > > >>
> > > >> I am posting this with the hope that we can have a meaningful discussion
> > > >> on the proposed new architecture. We have run some tests to
> > show that the new
> > > >> design is scalable and in terms of performance, is comparable
> > to the current
> > > >> stable design.
> > > >>
> > > >> Test Setup:
> > > >> The testing is based on the setup described in the Elvis proposal.
> > > >> The initial tests are just an aggregate of Netperf STREAM and MAERTS but
> > > >> as we progress, I am happy to run more tests. The hosts are twoidentical
> > > >> 16 core Haswell systems with point to point network links. For
> > the first 10 runs,
> > > >> with n=1 upto n=10 guests running in parallel, I booted the
> > target system with nr_cpus=8
> > > >> and mem=12G. The purpose was to do a comparision of resource utilization
> > > >> and how it affects performance. Finally, with the number of
> > guests set at 14,
> > > >> I didn't limit the number of CPUs booted on the host or limit
> > memory seen by
> > > >> the kernel but boot the kernel with isolcpus=14,15 that will be
> > used to run
> > > >> the vhost threads. The guests are pinned to cpus 0-13 and based on which
> > > >> cpu the guest is running on, the corresponding I/O thread is
> > either pinned
> > > >> to cpu 14 or 15.
> > > >> Results
> > > >> # X axis is number of guests
> > > >> # Y axis is netperf number
> > > >> # nr_cpus=8 and mem=12G
> > > >> #Number of Guests #Baseline #ELVIS
> > > >> 1 1119.3 1111.0
> > > >> 2 1135.6 1130.2
> > > >> 3 1135.5 1131.6
> > > >> 4 1136.0 1127.1
> > > >> 5 1118.6 1129.3
> > > >> 6 1123.4 1129.8
> > > >> 7 1128.7 1135.4
> > > >> 8 1129.9 1137.5
> > > >> 9 1130.6 1135.1
> > > >> 10 1129.3 1138.9
> > > >> 14* 1173.8 1216.9
> > > >
> > > > I'm a bit too busy now, with 2.4 and related stuff, will review once we
> > > > finish 2.4. But I'd like to ask two things:
> > > > - did you actually test a config where cgroups were used?
> > >
> > > Here are some numbers with a simple cgroup setup.
> > >
> > > Three cgroups with cpusets cpu=0,2,4 for cgroup1, cpu=1,3,5 for
> > cgroup2 and cpu=6,7
> > > for cgroup3 (even though 6,7 have different numa nodes)
> > >
> > > I run netperf for 1 to 9 guests starting with assigning the first guest
> > > to cgroup1, second to cgroup2, third to cgroup3 and repeat this sequence
> > > upto 9 guests.
> > >
> > > The numbers - (TCP_STREAM + TCP_MAERTS)/2
> > >
> > > #Number of Guests #ELVIS (Mbps)
> > > 1 1056.9
> > > 2 1122.5
> > > 3 1122.8
> > > 4 1123.2
> > > 5 1122.6
> > > 6 1110.3
> > > 7 1116.3
> > > 8 1121.8
> > > 9 1118.5
> > >
> > > Maybe, my cgroup setup was too simple but these numbers are comparable
> > > to the no cgroups results above. I wrote some tracing code to trace
> > > cgroup_match_groups() and find cgroup search overhead but it seemed
> > > unnecessary for this particular test.
> > >
> > >
> > > > - does the design address the issue of VM 1 being blocked
> > > > (e.g. because it hits swap) and blocking VM 2?
> > > Good question. I haven't thought of this yet. But IIUC,
> > > the worker thread will complete VM1's job and then move on to
> > > executing VM2's scheduled work.
> > > It doesn't matter if VM1 is
> > > blocked currently. I think it would be a problem though if/when
> > > polling is introduced.
> >
> > Sorry, I wasn't clear. If VM1's memory is in swap, attempts to
> > access it might block the service thread, so it won't
> > complete VM2's job.
> >
>
> We are not talking about correctness only about performance issues. In this
> case, if
> the VM is swapped out you are most likely in a state of memory pressure.
> Aren't the effects on performance of swapping in only the specific pages of the
> vrings is negligible as compared to the performance effects in a state of
> memory pressure?

VM1 is under pressure, but VM2 might not be.

> >
> >
> > >
> > > >>
> > > >> #* Last run with the vCPU and I/O thread(s) pinned, no CPU/
> > memory limit imposed.
> > > >> # I/O thread runs on CPU 14 or 15 depending on which guest it's serving
> > > >>
> > > >> There's a simple graph at
> > > >> http://people.redhat.com/~bdas/elvis/data/results.png
> > > >> that shows how task affinity results in a jump and even without it,
> > > >> as the number of guests increase, the shared vhost design performs
> > > >> slightly better.
> > > >>
> > > >> Observations:
> > > >> 1. In terms of "stock" performance, the results are comparable.
> > > >> 2. However, with a tuned setup, even without polling, we see an
> > improvement
> > > >> with the new design.
> > > >> 3. Making the new design simulate old behavior would be a
> > matter of setting
> > > >> the number of guests per vhost threads to 1.
> > > >> 4. Maybe, setting a per guest limit on the work being done by a
> > specific vhost
> > > >> thread is needed for it to be fair.
> > > >> 5. cgroup associations needs to be figured out. I just slightlyhacked
> the
> > > >> current cgroup association mechanism to work with the new
> > model. Ccing cgroups
> > > >> for input/comments.
> > > >>
> > > >> Many thanks to Razya Ladelsky and Eyal Moscovici, IBM for the initial
> > > >> patches, the helpful testing suggestions and discussions.
> > > >>
> > > >> Bandan Das (4):
> > > >> vhost: Introduce a universal thread to serve all users
> > > >> vhost: Limit the number of devices served by a single worker thread
> > > >> cgroup: Introduce a function to compare cgroups
> > > >> vhost: Add cgroup-aware creation of worker threads
> > > >>
> > > >> drivers/vhost/net.c | 6 +-
> > > >> drivers/vhost/scsi.c | 18 ++--
> > > >> drivers/vhost/vhost.c | 272 +++++++++++++++++++++++++++++++++
> > ++--------------
> > > >> drivers/vhost/vhost.h | 32 +++++-
> > > >> include/linux/cgroup.h | 1 +
> > > >> kernel/cgroup.c | 40 ++++++++
> > > >> 6 files changed, 275 insertions(+), 94 deletions(-)
> > > >>
> > > >> --
> > > >> 2.4.3
> >

2015-08-10 09:28:03

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users

On Mon, Jul 13, 2015 at 12:07:32AM -0400, Bandan Das wrote:
> vhost threads are per-device, but in most cases a single thread
> is enough. This change creates a single thread that is used to
> serve all guests.
>
> However, this complicates cgroups associations. The current policy
> is to attach the per-device thread to all cgroups of the parent process
> that the device is associated it. This is no longer possible if we
> have a single thread. So, we end up moving the thread around to
> cgroups of whichever device that needs servicing. This is a very
> inefficient protocol but seems to be the only way to integrate
> cgroups support.
>
> Signed-off-by: Razya Ladelsky <[email protected]>
> Signed-off-by: Bandan Das <[email protected]>

BTW, how does this interact with virtio net MQ?
It would seem that MQ gains from more parallelism and
CPU locality.

> ---
> drivers/vhost/scsi.c | 15 +++--
> drivers/vhost/vhost.c | 150 ++++++++++++++++++++++++--------------------------
> drivers/vhost/vhost.h | 19 +++++--
> 3 files changed, 97 insertions(+), 87 deletions(-)
>
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index ea32b38..6c42936 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct vhost_scsi_cmd *cmd)
>
> llist_add(&cmd->tvc_completion_list, &vs->vs_completion_list);
>
> - vhost_work_queue(&vs->dev, &vs->vs_completion_work);
> + vhost_work_queue(vs->dev.worker, &vs->vs_completion_work);
> }
>
> static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd)
> @@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs,
> }
>
> llist_add(&evt->list, &vs->vs_event_list);
> - vhost_work_queue(&vs->dev, &vs->vs_event_work);
> + vhost_work_queue(vs->dev.worker, &vs->vs_event_work);
> }
>
> static void vhost_scsi_evt_handle_kick(struct vhost_work *work)
> @@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
> /* Flush both the vhost poll and vhost work */
> for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
> vhost_scsi_flush_vq(vs, i);
> - vhost_work_flush(&vs->dev, &vs->vs_completion_work);
> - vhost_work_flush(&vs->dev, &vs->vs_event_work);
> + vhost_work_flush(vs->dev.worker, &vs->vs_completion_work);
> + vhost_work_flush(vs->dev.worker, &vs->vs_event_work);
>
> /* Wait for all reqs issued before the flush to be finished */
> for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
> @@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
> if (!vqs)
> goto err_vqs;
>
> - vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
> - vhost_work_init(&vs->vs_event_work, vhost_scsi_evt_work);
> + vhost_work_init(&vs->dev, &vs->vs_completion_work,
> + vhost_scsi_complete_cmd_work);
> +
> + vhost_work_init(&vs->dev, &vs->vs_event_work,
> + vhost_scsi_evt_work);
>
> vs->vs_events_nr = 0;
> vs->vs_events_missed = false;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2ee2826..951c96b 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -11,6 +11,8 @@
> * Generic code for virtio server in host kernel.
> */
>
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> #include <linux/eventfd.h>
> #include <linux/vhost.h>
> #include <linux/uio.h>
> @@ -28,6 +30,9 @@
>
> #include "vhost.h"
>
> +/* Just one worker thread to service all devices */
> +static struct vhost_worker *worker;
> +
> enum {
> VHOST_MEMORY_MAX_NREGIONS = 64,
> VHOST_MEMORY_F_LOG = 0x1,
> @@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
> return 0;
> }
>
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
> +void vhost_work_init(struct vhost_dev *dev,
> + struct vhost_work *work, vhost_work_fn_t fn)
> {
> INIT_LIST_HEAD(&work->node);
> work->fn = fn;
> init_waitqueue_head(&work->done);
> work->flushing = 0;
> work->queue_seq = work->done_seq = 0;
> + work->dev = dev;
> }
> EXPORT_SYMBOL_GPL(vhost_work_init);
>
> @@ -78,7 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> poll->dev = dev;
> poll->wqh = NULL;
>
> - vhost_work_init(&poll->work, fn);
> + vhost_work_init(dev, &poll->work, fn);
> }
> EXPORT_SYMBOL_GPL(vhost_poll_init);
>
> @@ -116,30 +123,30 @@ void vhost_poll_stop(struct vhost_poll *poll)
> }
> EXPORT_SYMBOL_GPL(vhost_poll_stop);
>
> -static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work,
> - unsigned seq)
> +static bool vhost_work_seq_done(struct vhost_worker *worker,
> + struct vhost_work *work, unsigned seq)
> {
> int left;
>
> - spin_lock_irq(&dev->work_lock);
> + spin_lock_irq(&worker->work_lock);
> left = seq - work->done_seq;
> - spin_unlock_irq(&dev->work_lock);
> + spin_unlock_irq(&worker->work_lock);
> return left <= 0;
> }
>
> -void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
> +void vhost_work_flush(struct vhost_worker *worker, struct vhost_work *work)
> {
> unsigned seq;
> int flushing;
>
> - spin_lock_irq(&dev->work_lock);
> + spin_lock_irq(&worker->work_lock);
> seq = work->queue_seq;
> work->flushing++;
> - spin_unlock_irq(&dev->work_lock);
> - wait_event(work->done, vhost_work_seq_done(dev, work, seq));
> - spin_lock_irq(&dev->work_lock);
> + spin_unlock_irq(&worker->work_lock);
> + wait_event(work->done, vhost_work_seq_done(worker, work, seq));
> + spin_lock_irq(&worker->work_lock);
> flushing = --work->flushing;
> - spin_unlock_irq(&dev->work_lock);
> + spin_unlock_irq(&worker->work_lock);
> BUG_ON(flushing < 0);
> }
> EXPORT_SYMBOL_GPL(vhost_work_flush);
> @@ -148,29 +155,30 @@ EXPORT_SYMBOL_GPL(vhost_work_flush);
> * locks that are also used by the callback. */
> void vhost_poll_flush(struct vhost_poll *poll)
> {
> - vhost_work_flush(poll->dev, &poll->work);
> + vhost_work_flush(poll->dev->worker, &poll->work);
> }
> EXPORT_SYMBOL_GPL(vhost_poll_flush);
>
> -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
> +void vhost_work_queue(struct vhost_worker *worker,
> + struct vhost_work *work)
> {
> unsigned long flags;
>
> - spin_lock_irqsave(&dev->work_lock, flags);
> + spin_lock_irqsave(&worker->work_lock, flags);
> if (list_empty(&work->node)) {
> - list_add_tail(&work->node, &dev->work_list);
> + list_add_tail(&work->node, &worker->work_list);
> work->queue_seq++;
> - spin_unlock_irqrestore(&dev->work_lock, flags);
> - wake_up_process(dev->worker);
> + spin_unlock_irqrestore(&worker->work_lock, flags);
> + wake_up_process(worker->thread);
> } else {
> - spin_unlock_irqrestore(&dev->work_lock, flags);
> + spin_unlock_irqrestore(&worker->work_lock, flags);
> }
> }
> EXPORT_SYMBOL_GPL(vhost_work_queue);
>
> void vhost_poll_queue(struct vhost_poll *poll)
> {
> - vhost_work_queue(poll->dev, &poll->work);
> + vhost_work_queue(poll->dev->worker, &poll->work);
> }
> EXPORT_SYMBOL_GPL(vhost_poll_queue);
>
> @@ -203,19 +211,18 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>
> static int vhost_worker(void *data)
> {
> - struct vhost_dev *dev = data;
> + struct vhost_worker *worker = data;
> struct vhost_work *work = NULL;
> unsigned uninitialized_var(seq);
> mm_segment_t oldfs = get_fs();
>
> set_fs(USER_DS);
> - use_mm(dev->mm);
>
> for (;;) {
> /* mb paired w/ kthread_stop */
> set_current_state(TASK_INTERRUPTIBLE);
>
> - spin_lock_irq(&dev->work_lock);
> + spin_lock_irq(&worker->work_lock);
> if (work) {
> work->done_seq = seq;
> if (work->flushing)
> @@ -223,21 +230,35 @@ static int vhost_worker(void *data)
> }
>
> if (kthread_should_stop()) {
> - spin_unlock_irq(&dev->work_lock);
> + spin_unlock_irq(&worker->work_lock);
> __set_current_state(TASK_RUNNING);
> break;
> }
> - if (!list_empty(&dev->work_list)) {
> - work = list_first_entry(&dev->work_list,
> + if (!list_empty(&worker->work_list)) {
> + work = list_first_entry(&worker->work_list,
> struct vhost_work, node);
> list_del_init(&work->node);
> seq = work->queue_seq;
> } else
> work = NULL;
> - spin_unlock_irq(&dev->work_lock);
> + spin_unlock_irq(&worker->work_lock);
>
> if (work) {
> + struct vhost_dev *dev = work->dev;
> +
> __set_current_state(TASK_RUNNING);
> +
> + if (current->mm != dev->mm) {
> + unuse_mm(current->mm);
> + use_mm(dev->mm);
> + }
> +
> + /* TODO: Consider a more elegant solution */
> + if (worker->owner != dev->owner) {
> + /* Should check for return value */
> + cgroup_attach_task_all(dev->owner, current);
> + worker->owner = dev->owner;
> + }
> work->fn(work);
> if (need_resched())
> schedule();
> @@ -245,7 +266,6 @@ static int vhost_worker(void *data)
> schedule();
>
> }
> - unuse_mm(dev->mm);
> set_fs(oldfs);
> return 0;
> }
> @@ -304,9 +324,8 @@ void vhost_dev_init(struct vhost_dev *dev,
> dev->log_file = NULL;
> dev->memory = NULL;
> dev->mm = NULL;
> - spin_lock_init(&dev->work_lock);
> - INIT_LIST_HEAD(&dev->work_list);
> - dev->worker = NULL;
> + dev->worker = worker;
> + dev->owner = current;
>
> for (i = 0; i < dev->nvqs; ++i) {
> vq = dev->vqs[i];
> @@ -331,31 +350,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
> }
> EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
>
> -struct vhost_attach_cgroups_struct {
> - struct vhost_work work;
> - struct task_struct *owner;
> - int ret;
> -};
> -
> -static void vhost_attach_cgroups_work(struct vhost_work *work)
> -{
> - struct vhost_attach_cgroups_struct *s;
> -
> - s = container_of(work, struct vhost_attach_cgroups_struct, work);
> - s->ret = cgroup_attach_task_all(s->owner, current);
> -}
> -
> -static int vhost_attach_cgroups(struct vhost_dev *dev)
> -{
> - struct vhost_attach_cgroups_struct attach;
> -
> - attach.owner = current;
> - vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> - vhost_work_queue(dev, &attach.work);
> - vhost_work_flush(dev, &attach.work);
> - return attach.ret;
> -}
> -
> /* Caller should have device mutex */
> bool vhost_dev_has_owner(struct vhost_dev *dev)
> {
> @@ -366,7 +360,6 @@ EXPORT_SYMBOL_GPL(vhost_dev_has_owner);
> /* Caller should have device mutex */
> long vhost_dev_set_owner(struct vhost_dev *dev)
> {
> - struct task_struct *worker;
> int err;
>
> /* Is there an owner already? */
> @@ -377,28 +370,15 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>
> /* No owner, become one */
> dev->mm = get_task_mm(current);
> - worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
> - if (IS_ERR(worker)) {
> - err = PTR_ERR(worker);
> - goto err_worker;
> - }
> -
> dev->worker = worker;
> - wake_up_process(worker); /* avoid contributing to loadavg */
> -
> - err = vhost_attach_cgroups(dev);
> - if (err)
> - goto err_cgroup;
>
> err = vhost_dev_alloc_iovecs(dev);
> if (err)
> - goto err_cgroup;
> + goto err_alloc;
>
> return 0;
> -err_cgroup:
> - kthread_stop(worker);
> +err_alloc:
> dev->worker = NULL;
> -err_worker:
> if (dev->mm)
> mmput(dev->mm);
> dev->mm = NULL;
> @@ -472,11 +452,6 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
> /* No one will access memory at this point */
> kfree(dev->memory);
> dev->memory = NULL;
> - WARN_ON(!list_empty(&dev->work_list));
> - if (dev->worker) {
> - kthread_stop(dev->worker);
> - dev->worker = NULL;
> - }
> if (dev->mm)
> mmput(dev->mm);
> dev->mm = NULL;
> @@ -1567,11 +1542,32 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
>
> static int __init vhost_init(void)
> {
> + struct vhost_worker *w =
> + kzalloc(sizeof(*w), GFP_KERNEL);
> + if (!w)
> + return -ENOMEM;
> +
> + w->thread = kthread_create(vhost_worker,
> + w, "vhost-worker");
> + if (IS_ERR(w->thread))
> + return PTR_ERR(w->thread);
> +
> + worker = w;
> + spin_lock_init(&worker->work_lock);
> + INIT_LIST_HEAD(&worker->work_list);
> + wake_up_process(worker->thread);
> + pr_info("Created universal thread to service requests\n");
> +
> return 0;
> }
>
> static void __exit vhost_exit(void)
> {
> + if (worker) {
> + kthread_stop(worker->thread);
> + WARN_ON(!list_empty(&worker->work_list));
> + kfree(worker);
> + }
> }
>
> module_init(vhost_init);
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 8c1c792..2f204ce 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -22,6 +22,7 @@ struct vhost_work {
> int flushing;
> unsigned queue_seq;
> unsigned done_seq;
> + struct vhost_dev *dev;
> };
>
> /* Poll a file (eventfd or socket) */
> @@ -35,8 +36,8 @@ struct vhost_poll {
> struct vhost_dev *dev;
> };
>
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
> -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
> +void vhost_work_init(struct vhost_dev *dev,
> + struct vhost_work *work, vhost_work_fn_t fn);
>
> void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> unsigned long mask, struct vhost_dev *dev);
> @@ -44,7 +45,6 @@ int vhost_poll_start(struct vhost_poll *poll, struct file *file);
> void vhost_poll_stop(struct vhost_poll *poll);
> void vhost_poll_flush(struct vhost_poll *poll);
> void vhost_poll_queue(struct vhost_poll *poll);
> -void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work);
> long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp);
>
> struct vhost_log {
> @@ -116,11 +116,22 @@ struct vhost_dev {
> int nvqs;
> struct file *log_file;
> struct eventfd_ctx *log_ctx;
> + /* vhost shared worker */
> + struct vhost_worker *worker;
> + /* for cgroup support */
> + struct task_struct *owner;
> +};
> +
> +struct vhost_worker {
> spinlock_t work_lock;
> struct list_head work_list;
> - struct task_struct *worker;
> + struct task_struct *thread;
> + struct task_struct *owner;
> };
>
> +void vhost_work_queue(struct vhost_worker *worker,
> + struct vhost_work *work);
> +void vhost_work_flush(struct vhost_worker *worker, struct vhost_work *work);
> void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
> long vhost_dev_set_owner(struct vhost_dev *dev);
> bool vhost_dev_has_owner(struct vhost_dev *dev);
> --
> 2.4.3

2015-08-10 20:00:28

by Bandan Das

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Shared vhost design

"Michael S. Tsirkin" <[email protected]> writes:

> On Sat, Aug 08, 2015 at 07:06:38PM -0400, Bandan Das wrote:
>> Hi Michael,
...
>>
>> > - does the design address the issue of VM 1 being blocked
>> > (e.g. because it hits swap) and blocking VM 2?
>> Good question. I haven't thought of this yet. But IIUC,
>> the worker thread will complete VM1's job and then move on to
>> executing VM2's scheduled work.
>> It doesn't matter if VM1 is
>> blocked currently. I think it would be a problem though if/when
>> polling is introduced.
>
> Sorry, I wasn't clear. If VM1's memory is in swap, attempts to
> access it might block the service thread, so it won't
> complete VM2's job.

Ah ok, I understand now. I am pretty sure the current RFC doesn't
take care of this :) I will add this to my todo list for v2.

Bandan

>
>
>>
>> >>
>> >> #* Last run with the vCPU and I/O thread(s) pinned, no CPU/memory limit imposed.
>> >> # I/O thread runs on CPU 14 or 15 depending on which guest it's serving
>> >>
>> >> There's a simple graph at
>> >> http://people.redhat.com/~bdas/elvis/data/results.png
>> >> that shows how task affinity results in a jump and even without it,
>> >> as the number of guests increase, the shared vhost design performs
>> >> slightly better.
>> >>
>> >> Observations:
>> >> 1. In terms of "stock" performance, the results are comparable.
>> >> 2. However, with a tuned setup, even without polling, we see an improvement
>> >> with the new design.
>> >> 3. Making the new design simulate old behavior would be a matter of setting
>> >> the number of guests per vhost threads to 1.
>> >> 4. Maybe, setting a per guest limit on the work being done by a specific vhost
>> >> thread is needed for it to be fair.
>> >> 5. cgroup associations needs to be figured out. I just slightly hacked the
>> >> current cgroup association mechanism to work with the new model. Ccing cgroups
>> >> for input/comments.
>> >>
>> >> Many thanks to Razya Ladelsky and Eyal Moscovici, IBM for the initial
>> >> patches, the helpful testing suggestions and discussions.
>> >>
>> >> Bandan Das (4):
>> >> vhost: Introduce a universal thread to serve all users
>> >> vhost: Limit the number of devices served by a single worker thread
>> >> cgroup: Introduce a function to compare cgroups
>> >> vhost: Add cgroup-aware creation of worker threads
>> >>
>> >> drivers/vhost/net.c | 6 +-
>> >> drivers/vhost/scsi.c | 18 ++--
>> >> drivers/vhost/vhost.c | 272 +++++++++++++++++++++++++++++++++++--------------
>> >> drivers/vhost/vhost.h | 32 +++++-
>> >> include/linux/cgroup.h | 1 +
>> >> kernel/cgroup.c | 40 ++++++++
>> >> 6 files changed, 275 insertions(+), 94 deletions(-)
>> >>
>> >> --
>> >> 2.4.3

2015-08-10 20:09:34

by Bandan Das

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users

"Michael S. Tsirkin" <[email protected]> writes:

> On Mon, Jul 13, 2015 at 12:07:32AM -0400, Bandan Das wrote:
>> vhost threads are per-device, but in most cases a single thread
>> is enough. This change creates a single thread that is used to
>> serve all guests.
>>
>> However, this complicates cgroups associations. The current policy
>> is to attach the per-device thread to all cgroups of the parent process
>> that the device is associated it. This is no longer possible if we
>> have a single thread. So, we end up moving the thread around to
>> cgroups of whichever device that needs servicing. This is a very
>> inefficient protocol but seems to be the only way to integrate
>> cgroups support.
>>
>> Signed-off-by: Razya Ladelsky <[email protected]>
>> Signed-off-by: Bandan Das <[email protected]>
>
> BTW, how does this interact with virtio net MQ?
> It would seem that MQ gains from more parallelism and
> CPU locality.

Hm.. Good point. As of this version, this design will always have
one worker thread servicing a guest. Now suppose we have 10 virtio
queues for a guest, surely, we could benefit from spawning off another
worker just like we are doing in case of a new guest/device with
the devs_per_worker parameter.

>> ---
>> drivers/vhost/scsi.c | 15 +++--
>> drivers/vhost/vhost.c | 150 ++++++++++++++++++++++++--------------------------
>> drivers/vhost/vhost.h | 19 +++++--
>> 3 files changed, 97 insertions(+), 87 deletions(-)
>>
>> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
>> index ea32b38..6c42936 100644
>> --- a/drivers/vhost/scsi.c
>> +++ b/drivers/vhost/scsi.c
>> @@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct vhost_scsi_cmd *cmd)
>>
>> llist_add(&cmd->tvc_completion_list, &vs->vs_completion_list);
>>
>> - vhost_work_queue(&vs->dev, &vs->vs_completion_work);
>> + vhost_work_queue(vs->dev.worker, &vs->vs_completion_work);
>> }
>>
>> static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd)
>> @@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs,
>> }
>>
>> llist_add(&evt->list, &vs->vs_event_list);
>> - vhost_work_queue(&vs->dev, &vs->vs_event_work);
>> + vhost_work_queue(vs->dev.worker, &vs->vs_event_work);
>> }
>>
>> static void vhost_scsi_evt_handle_kick(struct vhost_work *work)
>> @@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
>> /* Flush both the vhost poll and vhost work */
>> for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
>> vhost_scsi_flush_vq(vs, i);
>> - vhost_work_flush(&vs->dev, &vs->vs_completion_work);
>> - vhost_work_flush(&vs->dev, &vs->vs_event_work);
>> + vhost_work_flush(vs->dev.worker, &vs->vs_completion_work);
>> + vhost_work_flush(vs->dev.worker, &vs->vs_event_work);
>>
>> /* Wait for all reqs issued before the flush to be finished */
>> for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
>> @@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
>> if (!vqs)
>> goto err_vqs;
>>
>> - vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
>> - vhost_work_init(&vs->vs_event_work, vhost_scsi_evt_work);
>> + vhost_work_init(&vs->dev, &vs->vs_completion_work,
>> + vhost_scsi_complete_cmd_work);
>> +
>> + vhost_work_init(&vs->dev, &vs->vs_event_work,
>> + vhost_scsi_evt_work);
>>
>> vs->vs_events_nr = 0;
>> vs->vs_events_missed = false;
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 2ee2826..951c96b 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -11,6 +11,8 @@
>> * Generic code for virtio server in host kernel.
>> */
>>
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> #include <linux/eventfd.h>
>> #include <linux/vhost.h>
>> #include <linux/uio.h>
>> @@ -28,6 +30,9 @@
>>
>> #include "vhost.h"
>>
>> +/* Just one worker thread to service all devices */
>> +static struct vhost_worker *worker;
>> +
>> enum {
>> VHOST_MEMORY_MAX_NREGIONS = 64,
>> VHOST_MEMORY_F_LOG = 0x1,
>> @@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
>> return 0;
>> }
>>
>> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
>> +void vhost_work_init(struct vhost_dev *dev,
>> + struct vhost_work *work, vhost_work_fn_t fn)
>> {
>> INIT_LIST_HEAD(&work->node);
>> work->fn = fn;
>> init_waitqueue_head(&work->done);
>> work->flushing = 0;
>> work->queue_seq = work->done_seq = 0;
>> + work->dev = dev;
>> }
>> EXPORT_SYMBOL_GPL(vhost_work_init);
>>
>> @@ -78,7 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
>> poll->dev = dev;
>> poll->wqh = NULL;
>>
>> - vhost_work_init(&poll->work, fn);
>> + vhost_work_init(dev, &poll->work, fn);
>> }
>> EXPORT_SYMBOL_GPL(vhost_poll_init);
>>
>> @@ -116,30 +123,30 @@ void vhost_poll_stop(struct vhost_poll *poll)
>> }
>> EXPORT_SYMBOL_GPL(vhost_poll_stop);
>>
>> -static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work,
>> - unsigned seq)
>> +static bool vhost_work_seq_done(struct vhost_worker *worker,
>> + struct vhost_work *work, unsigned seq)
>> {
>> int left;
>>
>> - spin_lock_irq(&dev->work_lock);
>> + spin_lock_irq(&worker->work_lock);
>> left = seq - work->done_seq;
>> - spin_unlock_irq(&dev->work_lock);
>> + spin_unlock_irq(&worker->work_lock);
>> return left <= 0;
>> }
>>
>> -void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
>> +void vhost_work_flush(struct vhost_worker *worker, struct vhost_work *work)
>> {
>> unsigned seq;
>> int flushing;
>>
>> - spin_lock_irq(&dev->work_lock);
>> + spin_lock_irq(&worker->work_lock);
>> seq = work->queue_seq;
>> work->flushing++;
>> - spin_unlock_irq(&dev->work_lock);
>> - wait_event(work->done, vhost_work_seq_done(dev, work, seq));
>> - spin_lock_irq(&dev->work_lock);
>> + spin_unlock_irq(&worker->work_lock);
>> + wait_event(work->done, vhost_work_seq_done(worker, work, seq));
>> + spin_lock_irq(&worker->work_lock);
>> flushing = --work->flushing;
>> - spin_unlock_irq(&dev->work_lock);
>> + spin_unlock_irq(&worker->work_lock);
>> BUG_ON(flushing < 0);
>> }
>> EXPORT_SYMBOL_GPL(vhost_work_flush);
>> @@ -148,29 +155,30 @@ EXPORT_SYMBOL_GPL(vhost_work_flush);
>> * locks that are also used by the callback. */
>> void vhost_poll_flush(struct vhost_poll *poll)
>> {
>> - vhost_work_flush(poll->dev, &poll->work);
>> + vhost_work_flush(poll->dev->worker, &poll->work);
>> }
>> EXPORT_SYMBOL_GPL(vhost_poll_flush);
>>
>> -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
>> +void vhost_work_queue(struct vhost_worker *worker,
>> + struct vhost_work *work)
>> {
>> unsigned long flags;
>>
>> - spin_lock_irqsave(&dev->work_lock, flags);
>> + spin_lock_irqsave(&worker->work_lock, flags);
>> if (list_empty(&work->node)) {
>> - list_add_tail(&work->node, &dev->work_list);
>> + list_add_tail(&work->node, &worker->work_list);
>> work->queue_seq++;
>> - spin_unlock_irqrestore(&dev->work_lock, flags);
>> - wake_up_process(dev->worker);
>> + spin_unlock_irqrestore(&worker->work_lock, flags);
>> + wake_up_process(worker->thread);
>> } else {
>> - spin_unlock_irqrestore(&dev->work_lock, flags);
>> + spin_unlock_irqrestore(&worker->work_lock, flags);
>> }
>> }
>> EXPORT_SYMBOL_GPL(vhost_work_queue);
>>
>> void vhost_poll_queue(struct vhost_poll *poll)
>> {
>> - vhost_work_queue(poll->dev, &poll->work);
>> + vhost_work_queue(poll->dev->worker, &poll->work);
>> }
>> EXPORT_SYMBOL_GPL(vhost_poll_queue);
>>
>> @@ -203,19 +211,18 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>>
>> static int vhost_worker(void *data)
>> {
>> - struct vhost_dev *dev = data;
>> + struct vhost_worker *worker = data;
>> struct vhost_work *work = NULL;
>> unsigned uninitialized_var(seq);
>> mm_segment_t oldfs = get_fs();
>>
>> set_fs(USER_DS);
>> - use_mm(dev->mm);
>>
>> for (;;) {
>> /* mb paired w/ kthread_stop */
>> set_current_state(TASK_INTERRUPTIBLE);
>>
>> - spin_lock_irq(&dev->work_lock);
>> + spin_lock_irq(&worker->work_lock);
>> if (work) {
>> work->done_seq = seq;
>> if (work->flushing)
>> @@ -223,21 +230,35 @@ static int vhost_worker(void *data)
>> }
>>
>> if (kthread_should_stop()) {
>> - spin_unlock_irq(&dev->work_lock);
>> + spin_unlock_irq(&worker->work_lock);
>> __set_current_state(TASK_RUNNING);
>> break;
>> }
>> - if (!list_empty(&dev->work_list)) {
>> - work = list_first_entry(&dev->work_list,
>> + if (!list_empty(&worker->work_list)) {
>> + work = list_first_entry(&worker->work_list,
>> struct vhost_work, node);
>> list_del_init(&work->node);
>> seq = work->queue_seq;
>> } else
>> work = NULL;
>> - spin_unlock_irq(&dev->work_lock);
>> + spin_unlock_irq(&worker->work_lock);
>>
>> if (work) {
>> + struct vhost_dev *dev = work->dev;
>> +
>> __set_current_state(TASK_RUNNING);
>> +
>> + if (current->mm != dev->mm) {
>> + unuse_mm(current->mm);
>> + use_mm(dev->mm);
>> + }
>> +
>> + /* TODO: Consider a more elegant solution */
>> + if (worker->owner != dev->owner) {
>> + /* Should check for return value */
>> + cgroup_attach_task_all(dev->owner, current);
>> + worker->owner = dev->owner;
>> + }
>> work->fn(work);
>> if (need_resched())
>> schedule();
>> @@ -245,7 +266,6 @@ static int vhost_worker(void *data)
>> schedule();
>>
>> }
>> - unuse_mm(dev->mm);
>> set_fs(oldfs);
>> return 0;
>> }
>> @@ -304,9 +324,8 @@ void vhost_dev_init(struct vhost_dev *dev,
>> dev->log_file = NULL;
>> dev->memory = NULL;
>> dev->mm = NULL;
>> - spin_lock_init(&dev->work_lock);
>> - INIT_LIST_HEAD(&dev->work_list);
>> - dev->worker = NULL;
>> + dev->worker = worker;
>> + dev->owner = current;
>>
>> for (i = 0; i < dev->nvqs; ++i) {
>> vq = dev->vqs[i];
>> @@ -331,31 +350,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
>> }
>> EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
>>
>> -struct vhost_attach_cgroups_struct {
>> - struct vhost_work work;
>> - struct task_struct *owner;
>> - int ret;
>> -};
>> -
>> -static void vhost_attach_cgroups_work(struct vhost_work *work)
>> -{
>> - struct vhost_attach_cgroups_struct *s;
>> -
>> - s = container_of(work, struct vhost_attach_cgroups_struct, work);
>> - s->ret = cgroup_attach_task_all(s->owner, current);
>> -}
>> -
>> -static int vhost_attach_cgroups(struct vhost_dev *dev)
>> -{
>> - struct vhost_attach_cgroups_struct attach;
>> -
>> - attach.owner = current;
>> - vhost_work_init(&attach.work, vhost_attach_cgroups_work);
>> - vhost_work_queue(dev, &attach.work);
>> - vhost_work_flush(dev, &attach.work);
>> - return attach.ret;
>> -}
>> -
>> /* Caller should have device mutex */
>> bool vhost_dev_has_owner(struct vhost_dev *dev)
>> {
>> @@ -366,7 +360,6 @@ EXPORT_SYMBOL_GPL(vhost_dev_has_owner);
>> /* Caller should have device mutex */
>> long vhost_dev_set_owner(struct vhost_dev *dev)
>> {
>> - struct task_struct *worker;
>> int err;
>>
>> /* Is there an owner already? */
>> @@ -377,28 +370,15 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>>
>> /* No owner, become one */
>> dev->mm = get_task_mm(current);
>> - worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
>> - if (IS_ERR(worker)) {
>> - err = PTR_ERR(worker);
>> - goto err_worker;
>> - }
>> -
>> dev->worker = worker;
>> - wake_up_process(worker); /* avoid contributing to loadavg */
>> -
>> - err = vhost_attach_cgroups(dev);
>> - if (err)
>> - goto err_cgroup;
>>
>> err = vhost_dev_alloc_iovecs(dev);
>> if (err)
>> - goto err_cgroup;
>> + goto err_alloc;
>>
>> return 0;
>> -err_cgroup:
>> - kthread_stop(worker);
>> +err_alloc:
>> dev->worker = NULL;
>> -err_worker:
>> if (dev->mm)
>> mmput(dev->mm);
>> dev->mm = NULL;
>> @@ -472,11 +452,6 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>> /* No one will access memory at this point */
>> kfree(dev->memory);
>> dev->memory = NULL;
>> - WARN_ON(!list_empty(&dev->work_list));
>> - if (dev->worker) {
>> - kthread_stop(dev->worker);
>> - dev->worker = NULL;
>> - }
>> if (dev->mm)
>> mmput(dev->mm);
>> dev->mm = NULL;
>> @@ -1567,11 +1542,32 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
>>
>> static int __init vhost_init(void)
>> {
>> + struct vhost_worker *w =
>> + kzalloc(sizeof(*w), GFP_KERNEL);
>> + if (!w)
>> + return -ENOMEM;
>> +
>> + w->thread = kthread_create(vhost_worker,
>> + w, "vhost-worker");
>> + if (IS_ERR(w->thread))
>> + return PTR_ERR(w->thread);
>> +
>> + worker = w;
>> + spin_lock_init(&worker->work_lock);
>> + INIT_LIST_HEAD(&worker->work_list);
>> + wake_up_process(worker->thread);
>> + pr_info("Created universal thread to service requests\n");
>> +
>> return 0;
>> }
>>
>> static void __exit vhost_exit(void)
>> {
>> + if (worker) {
>> + kthread_stop(worker->thread);
>> + WARN_ON(!list_empty(&worker->work_list));
>> + kfree(worker);
>> + }
>> }
>>
>> module_init(vhost_init);
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index 8c1c792..2f204ce 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -22,6 +22,7 @@ struct vhost_work {
>> int flushing;
>> unsigned queue_seq;
>> unsigned done_seq;
>> + struct vhost_dev *dev;
>> };
>>
>> /* Poll a file (eventfd or socket) */
>> @@ -35,8 +36,8 @@ struct vhost_poll {
>> struct vhost_dev *dev;
>> };
>>
>> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
>> -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
>> +void vhost_work_init(struct vhost_dev *dev,
>> + struct vhost_work *work, vhost_work_fn_t fn);
>>
>> void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
>> unsigned long mask, struct vhost_dev *dev);
>> @@ -44,7 +45,6 @@ int vhost_poll_start(struct vhost_poll *poll, struct file *file);
>> void vhost_poll_stop(struct vhost_poll *poll);
>> void vhost_poll_flush(struct vhost_poll *poll);
>> void vhost_poll_queue(struct vhost_poll *poll);
>> -void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work);
>> long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp);
>>
>> struct vhost_log {
>> @@ -116,11 +116,22 @@ struct vhost_dev {
>> int nvqs;
>> struct file *log_file;
>> struct eventfd_ctx *log_ctx;
>> + /* vhost shared worker */
>> + struct vhost_worker *worker;
>> + /* for cgroup support */
>> + struct task_struct *owner;
>> +};
>> +
>> +struct vhost_worker {
>> spinlock_t work_lock;
>> struct list_head work_list;
>> - struct task_struct *worker;
>> + struct task_struct *thread;
>> + struct task_struct *owner;
>> };
>>
>> +void vhost_work_queue(struct vhost_worker *worker,
>> + struct vhost_work *work);
>> +void vhost_work_flush(struct vhost_worker *worker, struct vhost_work *work);
>> void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
>> long vhost_dev_set_owner(struct vhost_dev *dev);
>> bool vhost_dev_has_owner(struct vhost_dev *dev);
>> --
>> 2.4.3

2015-08-10 21:05:19

by Bandan Das

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users

Bandan Das <[email protected]> writes:

> "Michael S. Tsirkin" <[email protected]> writes:
>
>> On Mon, Jul 13, 2015 at 12:07:32AM -0400, Bandan Das wrote:
>>> vhost threads are per-device, but in most cases a single thread
>>> is enough. This change creates a single thread that is used to
>>> serve all guests.
>>>
>>> However, this complicates cgroups associations. The current policy
>>> is to attach the per-device thread to all cgroups of the parent process
>>> that the device is associated it. This is no longer possible if we
>>> have a single thread. So, we end up moving the thread around to
>>> cgroups of whichever device that needs servicing. This is a very
>>> inefficient protocol but seems to be the only way to integrate
>>> cgroups support.
>>>
>>> Signed-off-by: Razya Ladelsky <[email protected]>
>>> Signed-off-by: Bandan Das <[email protected]>
>>
>> BTW, how does this interact with virtio net MQ?
>> It would seem that MQ gains from more parallelism and
>> CPU locality.
>
> Hm.. Good point. As of this version, this design will always have
> one worker thread servicing a guest. Now suppose we have 10 virtio
> queues for a guest, surely, we could benefit from spawning off another
> worker just like we are doing in case of a new guest/device with
> the devs_per_worker parameter.

So, I did a quick smoke test with virtio-net and the Elvis patches.
virtio net MQ already spawns a new worker thread for every queue,
it seems ? So, the above setup already works! :) I will run some tests and
post back the results.

>>> ---
>>> drivers/vhost/scsi.c | 15 +++--
>>> drivers/vhost/vhost.c | 150 ++++++++++++++++++++++++--------------------------
>>> drivers/vhost/vhost.h | 19 +++++--
>>> 3 files changed, 97 insertions(+), 87 deletions(-)
>>>
>>> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
>>> index ea32b38..6c42936 100644
>>> --- a/drivers/vhost/scsi.c
>>> +++ b/drivers/vhost/scsi.c
>>> @@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct vhost_scsi_cmd *cmd)
>>>
>>> llist_add(&cmd->tvc_completion_list, &vs->vs_completion_list);
>>>
>>> - vhost_work_queue(&vs->dev, &vs->vs_completion_work);
>>> + vhost_work_queue(vs->dev.worker, &vs->vs_completion_work);
>>> }
>>>
>>> static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd)
>>> @@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs,
>>> }
>>>
>>> llist_add(&evt->list, &vs->vs_event_list);
>>> - vhost_work_queue(&vs->dev, &vs->vs_event_work);
>>> + vhost_work_queue(vs->dev.worker, &vs->vs_event_work);
>>> }
>>>
>>> static void vhost_scsi_evt_handle_kick(struct vhost_work *work)
>>> @@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
>>> /* Flush both the vhost poll and vhost work */
>>> for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
>>> vhost_scsi_flush_vq(vs, i);
>>> - vhost_work_flush(&vs->dev, &vs->vs_completion_work);
>>> - vhost_work_flush(&vs->dev, &vs->vs_event_work);
>>> + vhost_work_flush(vs->dev.worker, &vs->vs_completion_work);
>>> + vhost_work_flush(vs->dev.worker, &vs->vs_event_work);
>>>
>>> /* Wait for all reqs issued before the flush to be finished */
>>> for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
>>> @@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
>>> if (!vqs)
>>> goto err_vqs;
>>>
>>> - vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
>>> - vhost_work_init(&vs->vs_event_work, vhost_scsi_evt_work);
>>> + vhost_work_init(&vs->dev, &vs->vs_completion_work,
>>> + vhost_scsi_complete_cmd_work);
>>> +
>>> + vhost_work_init(&vs->dev, &vs->vs_event_work,
>>> + vhost_scsi_evt_work);
>>>
>>> vs->vs_events_nr = 0;
>>> vs->vs_events_missed = false;
>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>> index 2ee2826..951c96b 100644
>>> --- a/drivers/vhost/vhost.c
>>> +++ b/drivers/vhost/vhost.c
>>> @@ -11,6 +11,8 @@
>>> * Generic code for virtio server in host kernel.
>>> */
>>>
>>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>>> +
>>> #include <linux/eventfd.h>
>>> #include <linux/vhost.h>
>>> #include <linux/uio.h>
>>> @@ -28,6 +30,9 @@
>>>
>>> #include "vhost.h"
>>>
>>> +/* Just one worker thread to service all devices */
>>> +static struct vhost_worker *worker;
>>> +
>>> enum {
>>> VHOST_MEMORY_MAX_NREGIONS = 64,
>>> VHOST_MEMORY_F_LOG = 0x1,
>>> @@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
>>> return 0;
>>> }
>>>
>>> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
>>> +void vhost_work_init(struct vhost_dev *dev,
>>> + struct vhost_work *work, vhost_work_fn_t fn)
>>> {
>>> INIT_LIST_HEAD(&work->node);
>>> work->fn = fn;
>>> init_waitqueue_head(&work->done);
>>> work->flushing = 0;
>>> work->queue_seq = work->done_seq = 0;
>>> + work->dev = dev;
>>> }
>>> EXPORT_SYMBOL_GPL(vhost_work_init);
>>>
>>> @@ -78,7 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
>>> poll->dev = dev;
>>> poll->wqh = NULL;
>>>
>>> - vhost_work_init(&poll->work, fn);
>>> + vhost_work_init(dev, &poll->work, fn);
>>> }
>>> EXPORT_SYMBOL_GPL(vhost_poll_init);
>>>
>>> @@ -116,30 +123,30 @@ void vhost_poll_stop(struct vhost_poll *poll)
>>> }
>>> EXPORT_SYMBOL_GPL(vhost_poll_stop);
>>>
>>> -static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work,
>>> - unsigned seq)
>>> +static bool vhost_work_seq_done(struct vhost_worker *worker,
>>> + struct vhost_work *work, unsigned seq)
>>> {
>>> int left;
>>>
>>> - spin_lock_irq(&dev->work_lock);
>>> + spin_lock_irq(&worker->work_lock);
>>> left = seq - work->done_seq;
>>> - spin_unlock_irq(&dev->work_lock);
>>> + spin_unlock_irq(&worker->work_lock);
>>> return left <= 0;
>>> }
>>>
>>> -void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
>>> +void vhost_work_flush(struct vhost_worker *worker, struct vhost_work *work)
>>> {
>>> unsigned seq;
>>> int flushing;
>>>
>>> - spin_lock_irq(&dev->work_lock);
>>> + spin_lock_irq(&worker->work_lock);
>>> seq = work->queue_seq;
>>> work->flushing++;
>>> - spin_unlock_irq(&dev->work_lock);
>>> - wait_event(work->done, vhost_work_seq_done(dev, work, seq));
>>> - spin_lock_irq(&dev->work_lock);
>>> + spin_unlock_irq(&worker->work_lock);
>>> + wait_event(work->done, vhost_work_seq_done(worker, work, seq));
>>> + spin_lock_irq(&worker->work_lock);
>>> flushing = --work->flushing;
>>> - spin_unlock_irq(&dev->work_lock);
>>> + spin_unlock_irq(&worker->work_lock);
>>> BUG_ON(flushing < 0);
>>> }
>>> EXPORT_SYMBOL_GPL(vhost_work_flush);
>>> @@ -148,29 +155,30 @@ EXPORT_SYMBOL_GPL(vhost_work_flush);
>>> * locks that are also used by the callback. */
>>> void vhost_poll_flush(struct vhost_poll *poll)
>>> {
>>> - vhost_work_flush(poll->dev, &poll->work);
>>> + vhost_work_flush(poll->dev->worker, &poll->work);
>>> }
>>> EXPORT_SYMBOL_GPL(vhost_poll_flush);
>>>
>>> -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
>>> +void vhost_work_queue(struct vhost_worker *worker,
>>> + struct vhost_work *work)
>>> {
>>> unsigned long flags;
>>>
>>> - spin_lock_irqsave(&dev->work_lock, flags);
>>> + spin_lock_irqsave(&worker->work_lock, flags);
>>> if (list_empty(&work->node)) {
>>> - list_add_tail(&work->node, &dev->work_list);
>>> + list_add_tail(&work->node, &worker->work_list);
>>> work->queue_seq++;
>>> - spin_unlock_irqrestore(&dev->work_lock, flags);
>>> - wake_up_process(dev->worker);
>>> + spin_unlock_irqrestore(&worker->work_lock, flags);
>>> + wake_up_process(worker->thread);
>>> } else {
>>> - spin_unlock_irqrestore(&dev->work_lock, flags);
>>> + spin_unlock_irqrestore(&worker->work_lock, flags);
>>> }
>>> }
>>> EXPORT_SYMBOL_GPL(vhost_work_queue);
>>>
>>> void vhost_poll_queue(struct vhost_poll *poll)
>>> {
>>> - vhost_work_queue(poll->dev, &poll->work);
>>> + vhost_work_queue(poll->dev->worker, &poll->work);
>>> }
>>> EXPORT_SYMBOL_GPL(vhost_poll_queue);
>>>
>>> @@ -203,19 +211,18 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>>>
>>> static int vhost_worker(void *data)
>>> {
>>> - struct vhost_dev *dev = data;
>>> + struct vhost_worker *worker = data;
>>> struct vhost_work *work = NULL;
>>> unsigned uninitialized_var(seq);
>>> mm_segment_t oldfs = get_fs();
>>>
>>> set_fs(USER_DS);
>>> - use_mm(dev->mm);
>>>
>>> for (;;) {
>>> /* mb paired w/ kthread_stop */
>>> set_current_state(TASK_INTERRUPTIBLE);
>>>
>>> - spin_lock_irq(&dev->work_lock);
>>> + spin_lock_irq(&worker->work_lock);
>>> if (work) {
>>> work->done_seq = seq;
>>> if (work->flushing)
>>> @@ -223,21 +230,35 @@ static int vhost_worker(void *data)
>>> }
>>>
>>> if (kthread_should_stop()) {
>>> - spin_unlock_irq(&dev->work_lock);
>>> + spin_unlock_irq(&worker->work_lock);
>>> __set_current_state(TASK_RUNNING);
>>> break;
>>> }
>>> - if (!list_empty(&dev->work_list)) {
>>> - work = list_first_entry(&dev->work_list,
>>> + if (!list_empty(&worker->work_list)) {
>>> + work = list_first_entry(&worker->work_list,
>>> struct vhost_work, node);
>>> list_del_init(&work->node);
>>> seq = work->queue_seq;
>>> } else
>>> work = NULL;
>>> - spin_unlock_irq(&dev->work_lock);
>>> + spin_unlock_irq(&worker->work_lock);
>>>
>>> if (work) {
>>> + struct vhost_dev *dev = work->dev;
>>> +
>>> __set_current_state(TASK_RUNNING);
>>> +
>>> + if (current->mm != dev->mm) {
>>> + unuse_mm(current->mm);
>>> + use_mm(dev->mm);
>>> + }
>>> +
>>> + /* TODO: Consider a more elegant solution */
>>> + if (worker->owner != dev->owner) {
>>> + /* Should check for return value */
>>> + cgroup_attach_task_all(dev->owner, current);
>>> + worker->owner = dev->owner;
>>> + }
>>> work->fn(work);
>>> if (need_resched())
>>> schedule();
>>> @@ -245,7 +266,6 @@ static int vhost_worker(void *data)
>>> schedule();
>>>
>>> }
>>> - unuse_mm(dev->mm);
>>> set_fs(oldfs);
>>> return 0;
>>> }
>>> @@ -304,9 +324,8 @@ void vhost_dev_init(struct vhost_dev *dev,
>>> dev->log_file = NULL;
>>> dev->memory = NULL;
>>> dev->mm = NULL;
>>> - spin_lock_init(&dev->work_lock);
>>> - INIT_LIST_HEAD(&dev->work_list);
>>> - dev->worker = NULL;
>>> + dev->worker = worker;
>>> + dev->owner = current;
>>>
>>> for (i = 0; i < dev->nvqs; ++i) {
>>> vq = dev->vqs[i];
>>> @@ -331,31 +350,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
>>> }
>>> EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
>>>
>>> -struct vhost_attach_cgroups_struct {
>>> - struct vhost_work work;
>>> - struct task_struct *owner;
>>> - int ret;
>>> -};
>>> -
>>> -static void vhost_attach_cgroups_work(struct vhost_work *work)
>>> -{
>>> - struct vhost_attach_cgroups_struct *s;
>>> -
>>> - s = container_of(work, struct vhost_attach_cgroups_struct, work);
>>> - s->ret = cgroup_attach_task_all(s->owner, current);
>>> -}
>>> -
>>> -static int vhost_attach_cgroups(struct vhost_dev *dev)
>>> -{
>>> - struct vhost_attach_cgroups_struct attach;
>>> -
>>> - attach.owner = current;
>>> - vhost_work_init(&attach.work, vhost_attach_cgroups_work);
>>> - vhost_work_queue(dev, &attach.work);
>>> - vhost_work_flush(dev, &attach.work);
>>> - return attach.ret;
>>> -}
>>> -
>>> /* Caller should have device mutex */
>>> bool vhost_dev_has_owner(struct vhost_dev *dev)
>>> {
>>> @@ -366,7 +360,6 @@ EXPORT_SYMBOL_GPL(vhost_dev_has_owner);
>>> /* Caller should have device mutex */
>>> long vhost_dev_set_owner(struct vhost_dev *dev)
>>> {
>>> - struct task_struct *worker;
>>> int err;
>>>
>>> /* Is there an owner already? */
>>> @@ -377,28 +370,15 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>>>
>>> /* No owner, become one */
>>> dev->mm = get_task_mm(current);
>>> - worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
>>> - if (IS_ERR(worker)) {
>>> - err = PTR_ERR(worker);
>>> - goto err_worker;
>>> - }
>>> -
>>> dev->worker = worker;
>>> - wake_up_process(worker); /* avoid contributing to loadavg */
>>> -
>>> - err = vhost_attach_cgroups(dev);
>>> - if (err)
>>> - goto err_cgroup;
>>>
>>> err = vhost_dev_alloc_iovecs(dev);
>>> if (err)
>>> - goto err_cgroup;
>>> + goto err_alloc;
>>>
>>> return 0;
>>> -err_cgroup:
>>> - kthread_stop(worker);
>>> +err_alloc:
>>> dev->worker = NULL;
>>> -err_worker:
>>> if (dev->mm)
>>> mmput(dev->mm);
>>> dev->mm = NULL;
>>> @@ -472,11 +452,6 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>>> /* No one will access memory at this point */
>>> kfree(dev->memory);
>>> dev->memory = NULL;
>>> - WARN_ON(!list_empty(&dev->work_list));
>>> - if (dev->worker) {
>>> - kthread_stop(dev->worker);
>>> - dev->worker = NULL;
>>> - }
>>> if (dev->mm)
>>> mmput(dev->mm);
>>> dev->mm = NULL;
>>> @@ -1567,11 +1542,32 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
>>>
>>> static int __init vhost_init(void)
>>> {
>>> + struct vhost_worker *w =
>>> + kzalloc(sizeof(*w), GFP_KERNEL);
>>> + if (!w)
>>> + return -ENOMEM;
>>> +
>>> + w->thread = kthread_create(vhost_worker,
>>> + w, "vhost-worker");
>>> + if (IS_ERR(w->thread))
>>> + return PTR_ERR(w->thread);
>>> +
>>> + worker = w;
>>> + spin_lock_init(&worker->work_lock);
>>> + INIT_LIST_HEAD(&worker->work_list);
>>> + wake_up_process(worker->thread);
>>> + pr_info("Created universal thread to service requests\n");
>>> +
>>> return 0;
>>> }
>>>
>>> static void __exit vhost_exit(void)
>>> {
>>> + if (worker) {
>>> + kthread_stop(worker->thread);
>>> + WARN_ON(!list_empty(&worker->work_list));
>>> + kfree(worker);
>>> + }
>>> }
>>>
>>> module_init(vhost_init);
>>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>>> index 8c1c792..2f204ce 100644
>>> --- a/drivers/vhost/vhost.h
>>> +++ b/drivers/vhost/vhost.h
>>> @@ -22,6 +22,7 @@ struct vhost_work {
>>> int flushing;
>>> unsigned queue_seq;
>>> unsigned done_seq;
>>> + struct vhost_dev *dev;
>>> };
>>>
>>> /* Poll a file (eventfd or socket) */
>>> @@ -35,8 +36,8 @@ struct vhost_poll {
>>> struct vhost_dev *dev;
>>> };
>>>
>>> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
>>> -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
>>> +void vhost_work_init(struct vhost_dev *dev,
>>> + struct vhost_work *work, vhost_work_fn_t fn);
>>>
>>> void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
>>> unsigned long mask, struct vhost_dev *dev);
>>> @@ -44,7 +45,6 @@ int vhost_poll_start(struct vhost_poll *poll, struct file *file);
>>> void vhost_poll_stop(struct vhost_poll *poll);
>>> void vhost_poll_flush(struct vhost_poll *poll);
>>> void vhost_poll_queue(struct vhost_poll *poll);
>>> -void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work);
>>> long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp);
>>>
>>> struct vhost_log {
>>> @@ -116,11 +116,22 @@ struct vhost_dev {
>>> int nvqs;
>>> struct file *log_file;
>>> struct eventfd_ctx *log_ctx;
>>> + /* vhost shared worker */
>>> + struct vhost_worker *worker;
>>> + /* for cgroup support */
>>> + struct task_struct *owner;
>>> +};
>>> +
>>> +struct vhost_worker {
>>> spinlock_t work_lock;
>>> struct list_head work_list;
>>> - struct task_struct *worker;
>>> + struct task_struct *thread;
>>> + struct task_struct *owner;
>>> };
>>>
>>> +void vhost_work_queue(struct vhost_worker *worker,
>>> + struct vhost_work *work);
>>> +void vhost_work_flush(struct vhost_worker *worker, struct vhost_work *work);
>>> void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
>>> long vhost_dev_set_owner(struct vhost_dev *dev);
>>> bool vhost_dev_has_owner(struct vhost_dev *dev);
>>> --
>>> 2.4.3
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html