Hi all,
this series introduces the frontend for the newly introduced PV Calls
procotol.
PV Calls is a paravirtualized protocol that allows the implementation of
a set of POSIX functions in a different domain. The PV Calls frontend
sends POSIX function calls to the backend, which implements them and
returns a value to the frontend and acts on the function call.
For more information about PV Calls, please read:
https://xenbits.xen.org/docs/unstable/misc/pvcalls.html
This patch series only implements the frontend driver. It doesn't
attempt to redirect POSIX calls to it. The functions exported in
pvcalls-front.h are meant to be used for that. A separate patch series
will be sent to use them and hook them into the system.
Changes in v3:
In addition to addressing all comments, in this version of the series I
also made one other significant change: I implemented non-blocking
accept, which I didn't realize was missing in the last version of the
series (non-blocking accept is usually called only after a successful
poll, when there are already outstanding connections to accept). To do
that, I also changed the id for sockets sent to the backend from the
value of the struct socket pointer to the value of the struct
sock_mapping pointer. It's all documented in the patch descriptions.
This is the full list of changes:
- use struct sock_mapping* instead of struct socket* as socket id to
share with the backend
- allocate struct sock_mapping for a socket in pvcalls_front_socket
- in pvcalls_front_accept check if the request is non-blocking and
return -EAGAIN in that case
- store req_id and struct sock_mapping * of an inflight accept request,
so that we can resume when it's ready, in case the accept is
non-blocking
- remove unnecessary parenthesis
- rename RING_ORDER to PVCALLS_RING_ORDER
- make pvcalls_front_dev static
- remove ref local variable from pvcalls_front_probe
- move patch #12 before patch #2
- move dev_set_drvdata before any got error in pvcalls_front_probe
- combine src and dst calculation lines in pvcalls_front_event_handler
- remove unnecessary != 0 in the wait_event and wait_event_interruptible
tests
- add an in-code comment about sk_send_head
- In v2 I applied some changes to the wrong patch. Move changes to the
right patches.
- refactor the code to get a req_id into a function
- use wait_event (instead of wait_event_interruptible) to wait for a
response from the backend because we cannot cope with missing
responses
- remove the usage of PVCALLS_FRONT_MAX_SPIN from recvmsg
- remove unnecessary !bedata check in pvcalls_front_release
Changes in v2:
- use xenbus_read_unsigned when possible
- call dev_set_drvdata earlier in pvcalls_front_probe not to dereference
a NULL pointer in the error path
- set ret appropriately in pvcalls_front_probe
- include pvcalls-front.h in pvcalls-front.c
- call wake_up only once after the consuming loop in pvcalls_front_event_handler
- don't leak "bytes" in case of errors in create_active
- call spin_unlock appropriately in case of errors in create_active
- remove all BUG_ONs
- don't leak newsock->sk in pvcalls_front_accept in case of errors
- rename PVCALLS_FRON_MAX_SPIN to PVCALLS_FRONT_MAX_SPIN
- return bool from pvcalls_front_read_todo
- add a barrier after setting PVCALLS_FLAG_POLL_RET in
pvcalls_front_event_handler
- remove outdated comment in pvcalls_front_free_map
- clear sock->sk->sk_send_head later in pvcalls_front_release
- make XEN_PVCALLS_FRONTEND tristate
- don't add an empty resume function
Stefano Stabellini (13):
xen/pvcalls: introduce the pvcalls xenbus frontend
xen/pvcalls: implement frontend disconnect
xen/pvcalls: connect to the backend
xen/pvcalls: implement socket command and handle events
xen/pvcalls: implement connect command
xen/pvcalls: implement bind command
xen/pvcalls: implement listen command
xen/pvcalls: implement accept command
xen/pvcalls: implement sendmsg
xen/pvcalls: implement recvmsg
xen/pvcalls: implement poll command
xen/pvcalls: implement release command
xen: introduce a Kconfig option to enable the pvcalls frontend
drivers/xen/Kconfig | 9 +
drivers/xen/Makefile | 1 +
drivers/xen/pvcalls-front.c | 1140 +++++++++++++++++++++++++++++++++++++++++++
drivers/xen/pvcalls-front.h | 28 ++
4 files changed, 1178 insertions(+)
create mode 100644 drivers/xen/pvcalls-front.c
create mode 100644 drivers/xen/pvcalls-front.h
Send PVCALLS_LISTEN to the backend.
Signed-off-by: Stefano Stabellini <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/xen/pvcalls-front.c | 47 +++++++++++++++++++++++++++++++++++++++++++++
drivers/xen/pvcalls-front.h | 1 +
2 files changed, 48 insertions(+)
diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index 5ccef34..b2757f5 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -367,6 +367,53 @@ int pvcalls_front_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
return 0;
}
+int pvcalls_front_listen(struct socket *sock, int backlog)
+{
+ struct pvcalls_bedata *bedata;
+ struct sock_mapping *map;
+ struct xen_pvcalls_request *req;
+ int notify, req_id, ret;
+
+ if (!pvcalls_front_dev)
+ return -ENOTCONN;
+ bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
+
+ map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
+ if (!map)
+ return -ENOTSOCK;
+
+ if (map->passive.status != PVCALLS_STATUS_BIND)
+ return -EOPNOTSUPP;
+
+ spin_lock(&bedata->pvcallss_lock);
+ ret = get_request(bedata, &req_id);
+ if (ret < 0) {
+ spin_unlock(&bedata->pvcallss_lock);
+ return ret;
+ }
+ req = RING_GET_REQUEST(&bedata->ring, req_id);
+ req->req_id = req_id;
+ req->cmd = PVCALLS_LISTEN;
+ req->u.listen.id = (uint64_t) map;
+ req->u.listen.backlog = backlog;
+
+ bedata->ring.req_prod_pvt++;
+ RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
+ spin_unlock(&bedata->pvcallss_lock);
+ if (notify)
+ notify_remote_via_irq(bedata->irq);
+
+ wait_event(bedata->inflight_req,
+ READ_ONCE(bedata->rsp[req_id].req_id) == req_id);
+
+ map->passive.status = PVCALLS_STATUS_LISTEN;
+ ret = bedata->rsp[req_id].ret;
+ /* read ret, then set this rsp slot to be reused */
+ smp_mb();
+ WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
+ return ret;
+}
+
static const struct xenbus_device_id pvcalls_front_ids[] = {
{ "pvcalls" },
{ "" }
diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
index 8b0a274..aa8fe10 100644
--- a/drivers/xen/pvcalls-front.h
+++ b/drivers/xen/pvcalls-front.h
@@ -9,5 +9,6 @@ int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
int pvcalls_front_bind(struct socket *sock,
struct sockaddr *addr,
int addr_len);
+int pvcalls_front_listen(struct socket *sock, int backlog);
#endif
--
1.9.1
Send PVCALLS_BIND to the backend. Introduce a new structure, part of
struct sock_mapping, to store information specific to passive sockets.
Introduce a status field to keep track of the status of the passive
socket.
Signed-off-by: Stefano Stabellini <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/xen/pvcalls-front.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
drivers/xen/pvcalls-front.h | 3 +++
2 files changed, 62 insertions(+)
diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index 379b8fb..5ccef34 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -59,6 +59,13 @@ struct sock_mapping {
wait_queue_head_t inflight_conn_req;
} active;
+ struct {
+ /* Socket status */
+#define PVCALLS_STATUS_UNINITALIZED 0
+#define PVCALLS_STATUS_BIND 1
+#define PVCALLS_STATUS_LISTEN 2
+ uint8_t status;
+ } passive;
};
};
@@ -308,6 +315,58 @@ int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
return ret;
}
+int pvcalls_front_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
+{
+ struct pvcalls_bedata *bedata;
+ struct sock_mapping *map = NULL;
+ struct xen_pvcalls_request *req;
+ int notify, req_id, ret;
+
+ if (!pvcalls_front_dev)
+ return -ENOTCONN;
+ if (addr->sa_family != AF_INET || sock->type != SOCK_STREAM)
+ return -ENOTSUPP;
+ bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
+
+ map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
+ if (map == NULL)
+ return -EINVAL;
+
+ spin_lock(&bedata->pvcallss_lock);
+ ret = get_request(bedata, &req_id);
+ if (ret < 0) {
+ spin_unlock(&bedata->pvcallss_lock);
+ return ret;
+ }
+ req = RING_GET_REQUEST(&bedata->ring, req_id);
+ req->req_id = req_id;
+ map->sock = sock;
+ req->cmd = PVCALLS_BIND;
+ req->u.bind.id = (uint64_t) map;
+ memcpy(req->u.bind.addr, addr, sizeof(*addr));
+ req->u.bind.len = addr_len;
+
+ init_waitqueue_head(&map->passive.inflight_accept_req);
+
+ map->active_socket = false;
+
+ bedata->ring.req_prod_pvt++;
+ RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
+ spin_unlock(&bedata->pvcallss_lock);
+ if (notify)
+ notify_remote_via_irq(bedata->irq);
+
+ wait_event(bedata->inflight_req,
+ READ_ONCE(bedata->rsp[req_id].req_id) == req_id);
+
+ map->passive.status = PVCALLS_STATUS_BIND;
+ ret = bedata->rsp[req_id].ret;
+ /* read ret, then set this rsp slot to be reused */
+ smp_mb();
+ WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
+ return 0;
+}
+
static const struct xenbus_device_id pvcalls_front_ids[] = {
{ "pvcalls" },
{ "" }
diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
index 63b0417..8b0a274 100644
--- a/drivers/xen/pvcalls-front.h
+++ b/drivers/xen/pvcalls-front.h
@@ -6,5 +6,8 @@
int pvcalls_front_socket(struct socket *sock);
int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
int addr_len, int flags);
+int pvcalls_front_bind(struct socket *sock,
+ struct sockaddr *addr,
+ int addr_len);
#endif
--
1.9.1
For active sockets, check the indexes and use the inflight_conn_req
waitqueue to wait.
For passive sockets if an accept is outstanding
(PVCALLS_FLAG_ACCEPT_INFLIGHT), check if it has been answered by looking
at bedata->rsp[req_id]. If so, return POLLIN. Otherwise use the
inflight_accept_req waitqueue.
If no accepts are inflight, send PVCALLS_POLL to the backend. If we have
outstanding POLL requests awaiting for a response use the inflight_req
waitqueue: inflight_req is awaken when a new response is received; on
wakeup we check whether the POLL response is arrived by looking at the
PVCALLS_FLAG_POLL_RET flag. We set the flag from
pvcalls_front_event_handler, if the response was for a POLL command.
In pvcalls_front_event_handler, get the struct sock_mapping from the
poll id (we previously converted struct sock_mapping* to uint64_t and
used it as id).
Signed-off-by: Stefano Stabellini <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/xen/pvcalls-front.c | 135 +++++++++++++++++++++++++++++++++++++++++---
drivers/xen/pvcalls-front.h | 3 +
2 files changed, 129 insertions(+), 9 deletions(-)
diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index 635a83a..1c975d6 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -72,6 +72,8 @@ struct sock_mapping {
* Only one poll operation can be inflight for a given socket.
*/
#define PVCALLS_FLAG_ACCEPT_INFLIGHT 0
+#define PVCALLS_FLAG_POLL_INFLIGHT 1
+#define PVCALLS_FLAG_POLL_RET 2
uint8_t flags;
uint32_t inflight_req_id;
struct sock_mapping *accept_map;
@@ -139,15 +141,32 @@ static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
rsp = RING_GET_RESPONSE(&bedata->ring, bedata->ring.rsp_cons);
req_id = rsp->req_id;
- dst = (uint8_t *)&bedata->rsp[req_id] + sizeof(rsp->req_id);
- src = (uint8_t *)rsp + sizeof(rsp->req_id);
- memcpy(dst, src, sizeof(*rsp) - sizeof(rsp->req_id));
- /*
- * First copy the rest of the data, then req_id. It is
- * paired with the barrier when accessing bedata->rsp.
- */
- smp_wmb();
- WRITE_ONCE(bedata->rsp[req_id].req_id, rsp->req_id);
+ if (rsp->cmd == PVCALLS_POLL) {
+ struct sock_mapping *map = (struct sock_mapping *)
+ rsp->u.poll.id;
+
+ set_bit(PVCALLS_FLAG_POLL_RET,
+ (void *)&map->passive.flags);
+ /*
+ * Set RET, then clear INFLIGHT. It pairs with
+ * the checks at the beginning of
+ * pvcalls_front_poll_passive.
+ */
+ smp_wmb();
+ clear_bit(PVCALLS_FLAG_POLL_INFLIGHT,
+ (void *)&map->passive.flags);
+ } else {
+ dst = (uint8_t *)&bedata->rsp[req_id] +
+ sizeof(rsp->req_id);
+ src = (uint8_t *)rsp + sizeof(rsp->req_id);
+ memcpy(dst, src, sizeof(*rsp) - sizeof(rsp->req_id));
+ /*
+ * First copy the rest of the data, then req_id. It is
+ * paired with the barrier when accessing bedata->rsp.
+ */
+ smp_wmb();
+ WRITE_ONCE(bedata->rsp[req_id].req_id, rsp->req_id);
+ }
done = 1;
bedata->ring.rsp_cons++;
@@ -736,6 +755,104 @@ int pvcalls_front_accept(struct socket *sock, struct socket *newsock, int flags)
return ret;
}
+static unsigned int pvcalls_front_poll_passive(struct file *file,
+ struct pvcalls_bedata *bedata,
+ struct sock_mapping *map,
+ poll_table *wait)
+{
+ int notify, req_id, ret;
+ struct xen_pvcalls_request *req;
+
+ if (test_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT,
+ (void *)&map->passive.flags)) {
+ uint32_t req_id = READ_ONCE(map->passive.inflight_req_id);
+ if (req_id != PVCALLS_INVALID_ID &&
+ READ_ONCE(bedata->rsp[req_id].req_id) == req_id)
+ return POLLIN;
+
+ poll_wait(file, &map->passive.inflight_accept_req, wait);
+ return 0;
+ }
+
+ if (test_and_clear_bit(PVCALLS_FLAG_POLL_RET,
+ (void *)&map->passive.flags))
+ return POLLIN;
+
+ /*
+ * First check RET, then INFLIGHT. No barriers necessary to
+ * ensure execution ordering because of the conditional
+ * instructions creating control dependencies.
+ */
+
+ if (test_and_set_bit(PVCALLS_FLAG_POLL_INFLIGHT,
+ (void *)&map->passive.flags)) {
+ poll_wait(file, &bedata->inflight_req, wait);
+ return 0;
+ }
+
+ spin_lock(&bedata->pvcallss_lock);
+ ret = get_request(bedata, &req_id);
+ if (ret < 0) {
+ spin_unlock(&bedata->pvcallss_lock);
+ return ret;
+ }
+ req = RING_GET_REQUEST(&bedata->ring, req_id);
+ req->req_id = req_id;
+ req->cmd = PVCALLS_POLL;
+ req->u.poll.id = (uint64_t) map;
+
+ bedata->ring.req_prod_pvt++;
+ RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
+ spin_unlock(&bedata->pvcallss_lock);
+ if (notify)
+ notify_remote_via_irq(bedata->irq);
+
+ poll_wait(file, &bedata->inflight_req, wait);
+ return 0;
+}
+
+static unsigned int pvcalls_front_poll_active(struct file *file,
+ struct pvcalls_bedata *bedata,
+ struct sock_mapping *map,
+ poll_table *wait)
+{
+ unsigned int mask = 0;
+ int32_t in_error, out_error;
+ struct pvcalls_data_intf *intf = map->active.ring;
+
+ out_error = intf->out_error;
+ in_error = intf->in_error;
+
+ poll_wait(file, &map->active.inflight_conn_req, wait);
+ if (pvcalls_front_write_todo(map))
+ mask |= POLLOUT | POLLWRNORM;
+ if (pvcalls_front_read_todo(map))
+ mask |= POLLIN | POLLRDNORM;
+ if (in_error != 0 || out_error != 0)
+ mask |= POLLERR;
+
+ return mask;
+}
+
+unsigned int pvcalls_front_poll(struct file *file, struct socket *sock,
+ poll_table *wait)
+{
+ struct pvcalls_bedata *bedata;
+ struct sock_mapping *map;
+
+ if (!pvcalls_front_dev)
+ return POLLNVAL;
+ bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
+
+ map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
+ if (!map)
+ return POLLNVAL;
+ if (map->active_socket)
+ return pvcalls_front_poll_active(file, bedata, map, wait);
+ else
+ return pvcalls_front_poll_passive(file, bedata, map, wait);
+}
+
static const struct xenbus_device_id pvcalls_front_ids[] = {
{ "pvcalls" },
{ "" }
diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
index de24041..25e05b8 100644
--- a/drivers/xen/pvcalls-front.h
+++ b/drivers/xen/pvcalls-front.h
@@ -20,5 +20,8 @@ int pvcalls_front_recvmsg(struct socket *sock,
struct msghdr *msg,
size_t len,
int flags);
+unsigned int pvcalls_front_poll(struct file *file,
+ struct socket *sock,
+ poll_table *wait);
#endif
--
1.9.1
Send PVCALLS_RELEASE to the backend and wait for a reply. Take both
in_mutex and out_mutex to avoid concurrent accesses. Then, free the
socket.
For passive sockets, check whether we have already pre-allocated an
active socket for the purpose of being accepted. If so, free that as
well.
Signed-off-by: Stefano Stabellini <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/xen/pvcalls-front.c | 88 +++++++++++++++++++++++++++++++++++++++++++++
drivers/xen/pvcalls-front.h | 1 +
2 files changed, 89 insertions(+)
diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index 1c975d6..775a6d2 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -192,6 +192,23 @@ static irqreturn_t pvcalls_front_conn_handler(int irq, void *sock_map)
return IRQ_HANDLED;
}
+static void pvcalls_front_free_map(struct pvcalls_bedata *bedata,
+ struct sock_mapping *map)
+{
+ int i;
+
+ spin_lock(&bedata->pvcallss_lock);
+ if (!list_empty(&map->list))
+ list_del_init(&map->list);
+ spin_unlock(&bedata->pvcallss_lock);
+
+ for (i = 0; i < (1 << map->active.ring->ring_order); i++)
+ gnttab_end_foreign_access(map->active.ring->ref[i], 0, 0);
+ gnttab_end_foreign_access(map->active.ref, 0, 0);
+ free_page((unsigned long)map->active.ring);
+ unbind_from_irqhandler(map->active.irq, map);
+}
+
int pvcalls_front_socket(struct socket *sock)
{
struct pvcalls_bedata *bedata;
@@ -853,6 +870,77 @@ unsigned int pvcalls_front_poll(struct file *file, struct socket *sock,
return pvcalls_front_poll_passive(file, bedata, map, wait);
}
+int pvcalls_front_release(struct socket *sock)
+{
+ struct pvcalls_bedata *bedata;
+ struct sock_mapping *map;
+ int req_id, notify, ret;
+ struct xen_pvcalls_request *req;
+
+ if (!pvcalls_front_dev)
+ return -EIO;
+ bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
+
+ if (sock->sk == NULL)
+ return 0;
+
+ map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
+ if (map == NULL)
+ return 0;
+
+ spin_lock(&bedata->pvcallss_lock);
+ ret = get_request(bedata, &req_id);
+ if (ret < 0) {
+ spin_unlock(&bedata->pvcallss_lock);
+ return ret;
+ }
+ WRITE_ONCE(sock->sk->sk_send_head, NULL);
+
+ req = RING_GET_REQUEST(&bedata->ring, req_id);
+ req->req_id = req_id;
+ req->cmd = PVCALLS_RELEASE;
+ req->u.release.id = (uint64_t)map;
+
+ bedata->ring.req_prod_pvt++;
+ RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
+ spin_unlock(&bedata->pvcallss_lock);
+ if (notify)
+ notify_remote_via_irq(bedata->irq);
+
+ wait_event(bedata->inflight_req,
+ READ_ONCE(bedata->rsp[req_id].req_id) == req_id);
+
+ if (map->active_socket) {
+ /*
+ * Set in_error and wake up inflight_conn_req to force
+ * recvmsg waiters to exit.
+ */
+ map->active.ring->in_error = -EBADF;
+ wake_up_interruptible(&map->active.inflight_conn_req);
+
+ mutex_lock(&map->active.in_mutex);
+ mutex_lock(&map->active.out_mutex);
+ pvcalls_front_free_map(bedata, map);
+ mutex_unlock(&map->active.out_mutex);
+ mutex_unlock(&map->active.in_mutex);
+ kfree(map);
+ } else {
+ spin_lock(&bedata->pvcallss_lock);
+ if (READ_ONCE(map->passive.inflight_req_id) !=
+ PVCALLS_INVALID_ID) {
+ pvcalls_front_free_map(bedata,
+ map->passive.accept_map);
+ kfree(map->passive.accept_map);
+ }
+ list_del_init(&map->list);
+ kfree(map);
+ spin_unlock(&bedata->pvcallss_lock);
+ }
+ WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
+
+ return 0;
+}
+
static const struct xenbus_device_id pvcalls_front_ids[] = {
{ "pvcalls" },
{ "" }
diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
index 25e05b8..3332978 100644
--- a/drivers/xen/pvcalls-front.h
+++ b/drivers/xen/pvcalls-front.h
@@ -23,5 +23,6 @@ int pvcalls_front_recvmsg(struct socket *sock,
unsigned int pvcalls_front_poll(struct file *file,
struct socket *sock,
poll_table *wait);
+int pvcalls_front_release(struct socket *sock);
#endif
--
1.9.1
Also add pvcalls-front to the Makefile.
Signed-off-by: Stefano Stabellini <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/xen/Kconfig | 9 +++++++++
drivers/xen/Makefile | 1 +
2 files changed, 10 insertions(+)
diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index 4545561..0b2c828 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -196,6 +196,15 @@ config XEN_PCIDEV_BACKEND
If in doubt, say m.
+config XEN_PVCALLS_FRONTEND
+ tristate "XEN PV Calls frontend driver"
+ depends on INET && XEN
+ help
+ Experimental frontend for the Xen PV Calls protocol
+ (https://xenbits.xen.org/docs/unstable/misc/pvcalls.html). It
+ sends a small set of POSIX calls to the backend, which
+ implements them.
+
config XEN_PVCALLS_BACKEND
bool "XEN PV Calls backend driver"
depends on INET && XEN && XEN_BACKEND
diff --git a/drivers/xen/Makefile b/drivers/xen/Makefile
index 480b928..afb9e03 100644
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -39,6 +39,7 @@ obj-$(CONFIG_XEN_EFI) += efi.o
obj-$(CONFIG_XEN_SCSI_BACKEND) += xen-scsiback.o
obj-$(CONFIG_XEN_AUTO_XLATE) += xlate_mmu.o
obj-$(CONFIG_XEN_PVCALLS_BACKEND) += pvcalls-back.o
+obj-$(CONFIG_XEN_PVCALLS_FRONTEND) += pvcalls-front.o
xen-evtchn-y := evtchn.o
xen-gntdev-y := gntdev.o
xen-gntalloc-y := gntalloc.o
--
1.9.1
Implement recvmsg by copying data from the "in" ring. If not enough data
is available and the recvmsg call is blocking, then wait on the
inflight_conn_req waitqueue. Take the active socket in_mutex so that
only one function can access the ring at any given time.
If no data is available on the ring, rather than returning immediately
or sleep-waiting, spin for up to 5000 cycles. This small optimization
turns out to improve performance and latency significantly.
Signed-off-by: Stefano Stabellini <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/xen/pvcalls-front.c | 102 ++++++++++++++++++++++++++++++++++++++++++++
drivers/xen/pvcalls-front.h | 4 ++
2 files changed, 106 insertions(+)
diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index 369acde..635a83a 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -105,6 +105,20 @@ static int pvcalls_front_write_todo(struct sock_mapping *map)
return size - pvcalls_queued(prod, cons, size);
}
+static bool pvcalls_front_read_todo(struct sock_mapping *map)
+{
+ struct pvcalls_data_intf *intf = map->active.ring;
+ RING_IDX cons, prod;
+ int32_t error;
+
+ cons = intf->in_cons;
+ prod = intf->in_prod;
+ error = intf->in_error;
+ return (error != 0 ||
+ pvcalls_queued(prod, cons,
+ XEN_FLEX_RING_SIZE(intf->ring_order)) != 0);
+}
+
static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
{
struct xenbus_device *dev = dev_id;
@@ -434,6 +448,94 @@ int pvcalls_front_sendmsg(struct socket *sock, struct msghdr *msg,
return tot_sent;
}
+static int __read_ring(struct pvcalls_data_intf *intf,
+ struct pvcalls_data *data,
+ struct iov_iter *msg_iter,
+ size_t len, int flags)
+{
+ RING_IDX cons, prod, size, masked_prod, masked_cons;
+ RING_IDX array_size = XEN_FLEX_RING_SIZE(intf->ring_order);
+ int32_t error;
+
+ cons = intf->in_cons;
+ prod = intf->in_prod;
+ error = intf->in_error;
+ /* get pointers before reading from the ring */
+ virt_rmb();
+ if (error < 0)
+ return error;
+
+ size = pvcalls_queued(prod, cons, array_size);
+ masked_prod = pvcalls_mask(prod, array_size);
+ masked_cons = pvcalls_mask(cons, array_size);
+
+ if (size == 0)
+ return 0;
+
+ if (len > size)
+ len = size;
+
+ if (masked_prod > masked_cons) {
+ copy_to_iter(data->in + masked_cons, len, msg_iter);
+ } else {
+ if (len > (array_size - masked_cons)) {
+ copy_to_iter(data->in + masked_cons,
+ array_size - masked_cons, msg_iter);
+ copy_to_iter(data->in,
+ len - (array_size - masked_cons),
+ msg_iter);
+ } else {
+ copy_to_iter(data->in + masked_cons, len, msg_iter);
+ }
+ }
+ /* read data from the ring before increasing the index */
+ virt_mb();
+ if (!(flags & MSG_PEEK))
+ intf->in_cons += len;
+
+ return len;
+}
+
+int pvcalls_front_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
+ int flags)
+{
+ struct pvcalls_bedata *bedata;
+ int ret = -EAGAIN;
+ struct sock_mapping *map;
+
+ if (!pvcalls_front_dev)
+ return -ENOTCONN;
+ bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
+
+ map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
+ if (!map)
+ return -ENOTSOCK;
+
+ if (flags & (MSG_CMSG_CLOEXEC|MSG_ERRQUEUE|MSG_OOB|MSG_TRUNC))
+ return -EOPNOTSUPP;
+
+ mutex_lock(&map->active.in_mutex);
+ if (len > XEN_FLEX_RING_SIZE(map->active.ring->ring_order))
+ len = XEN_FLEX_RING_SIZE(map->active.ring->ring_order);
+
+ while (!(flags & MSG_DONTWAIT) && !pvcalls_front_read_todo(map)) {
+ wait_event_interruptible(map->active.inflight_conn_req,
+ pvcalls_front_read_todo(map));
+ }
+ ret = __read_ring(map->active.ring, &map->active.data,
+ &msg->msg_iter, len, flags);
+
+ if (ret > 0)
+ notify_remote_via_irq(map->active.irq);
+ if (ret == 0)
+ ret = -EAGAIN;
+ if (ret == -ENOTCONN)
+ ret = 0;
+
+ mutex_unlock(&map->active.in_mutex);
+ return ret;
+}
+
int pvcalls_front_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
{
struct pvcalls_bedata *bedata;
diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
index d937c24..de24041 100644
--- a/drivers/xen/pvcalls-front.h
+++ b/drivers/xen/pvcalls-front.h
@@ -16,5 +16,9 @@ int pvcalls_front_accept(struct socket *sock,
int pvcalls_front_sendmsg(struct socket *sock,
struct msghdr *msg,
size_t len);
+int pvcalls_front_recvmsg(struct socket *sock,
+ struct msghdr *msg,
+ size_t len,
+ int flags);
#endif
--
1.9.1
Send data to an active socket by copying data to the "out" ring. Take
the active socket out_mutex so that only one function can access the
ring at any given time.
If not enough room is available on the ring, rather than returning
immediately or sleep-waiting, spin for up to 5000 cycles. This small
optimization turns out to improve performance significantly.
Signed-off-by: Stefano Stabellini <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/xen/pvcalls-front.c | 109 ++++++++++++++++++++++++++++++++++++++++++++
drivers/xen/pvcalls-front.h | 3 ++
2 files changed, 112 insertions(+)
diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index f83b910..369acde 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -29,6 +29,7 @@
#define PVCALLS_INVALID_ID UINT_MAX
#define PVCALLS_RING_ORDER XENBUS_MAX_RING_GRANT_ORDER
#define PVCALLS_NR_REQ_PER_RING __CONST_RING_SIZE(xen_pvcalls, XEN_PAGE_SIZE)
+#define PVCALLS_FRONT_MAX_SPIN 5000
struct pvcalls_bedata {
struct xen_pvcalls_front_ring ring;
@@ -88,6 +89,22 @@ static inline int get_request(struct pvcalls_bedata *bedata, int *req_id)
return 0;
}
+static int pvcalls_front_write_todo(struct sock_mapping *map)
+{
+ struct pvcalls_data_intf *intf = map->active.ring;
+ RING_IDX cons, prod, size = XEN_FLEX_RING_SIZE(intf->ring_order);
+ int32_t error;
+
+ cons = intf->out_cons;
+ prod = intf->out_prod;
+ error = intf->out_error;
+ if (error == -ENOTCONN)
+ return 0;
+ if (error != 0)
+ return error;
+ return size - pvcalls_queued(prod, cons, size);
+}
+
static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
{
struct xenbus_device *dev = dev_id;
@@ -325,6 +342,98 @@ int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
return ret;
}
+static int __write_ring(struct pvcalls_data_intf *intf,
+ struct pvcalls_data *data,
+ struct iov_iter *msg_iter,
+ size_t len)
+{
+ RING_IDX cons, prod, size, masked_prod, masked_cons;
+ RING_IDX array_size = XEN_FLEX_RING_SIZE(intf->ring_order);
+ int32_t error;
+
+ cons = intf->out_cons;
+ prod = intf->out_prod;
+ error = intf->out_error;
+ /* read indexes before continuing */
+ virt_mb();
+
+ if (error < 0)
+ return error;
+
+ size = pvcalls_queued(prod, cons, array_size);
+ if (size >= array_size)
+ return 0;
+ if (len > array_size - size)
+ len = array_size - size;
+
+ masked_prod = pvcalls_mask(prod, array_size);
+ masked_cons = pvcalls_mask(cons, array_size);
+
+ if (masked_prod < masked_cons) {
+ copy_from_iter(data->out + masked_prod, len, msg_iter);
+ } else {
+ if (len > array_size - masked_prod) {
+ copy_from_iter(data->out + masked_prod,
+ array_size - masked_prod, msg_iter);
+ copy_from_iter(data->out,
+ len - (array_size - masked_prod),
+ msg_iter);
+ } else {
+ copy_from_iter(data->out + masked_prod, len, msg_iter);
+ }
+ }
+ /* write to ring before updating pointer */
+ virt_wmb();
+ intf->out_prod += len;
+
+ return len;
+}
+
+int pvcalls_front_sendmsg(struct socket *sock, struct msghdr *msg,
+ size_t len)
+{
+ struct pvcalls_bedata *bedata;
+ struct sock_mapping *map;
+ int sent = 0, tot_sent = 0;
+ int count = 0, flags;
+
+ if (!pvcalls_front_dev)
+ return -ENOTCONN;
+ bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
+
+ map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
+ if (!map)
+ return -ENOTSOCK;
+
+ flags = msg->msg_flags;
+ if (flags & (MSG_CONFIRM|MSG_DONTROUTE|MSG_EOR|MSG_OOB))
+ return -EOPNOTSUPP;
+
+ mutex_lock(&map->active.out_mutex);
+ if ((flags & MSG_DONTWAIT) && !pvcalls_front_write_todo(map)) {
+ mutex_unlock(&map->active.out_mutex);
+ return -EAGAIN;
+ }
+
+again:
+ count++;
+ sent = __write_ring(map->active.ring,
+ &map->active.data, &msg->msg_iter,
+ len);
+ if (sent > 0) {
+ len -= sent;
+ tot_sent += sent;
+ notify_remote_via_irq(map->active.irq);
+ }
+ if (sent >= 0 && len > 0 && count < PVCALLS_FRONT_MAX_SPIN)
+ goto again;
+ if (sent < 0)
+ tot_sent = sent;
+
+ mutex_unlock(&map->active.out_mutex);
+ return tot_sent;
+}
+
int pvcalls_front_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
{
struct pvcalls_bedata *bedata;
diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
index ab4f1da..d937c24 100644
--- a/drivers/xen/pvcalls-front.h
+++ b/drivers/xen/pvcalls-front.h
@@ -13,5 +13,8 @@ int pvcalls_front_bind(struct socket *sock,
int pvcalls_front_accept(struct socket *sock,
struct socket *newsock,
int flags);
+int pvcalls_front_sendmsg(struct socket *sock,
+ struct msghdr *msg,
+ size_t len);
#endif
--
1.9.1
Introduce a waitqueue to allow only one outstanding accept command at
any given time and to implement polling on the passive socket. Introduce
a flags field to keep track of in-flight accept and poll commands.
Send PVCALLS_ACCEPT to the backend. Allocate a new active socket. Make
sure that only one accept command is executed at any given time by
setting PVCALLS_FLAG_ACCEPT_INFLIGHT and waiting on the
inflight_accept_req waitqueue.
Convert the new struct sock_mapping pointer into an uint64_t and use it
as id for the new socket to pass to the backend.
Check if the accept call is non-blocking: in that case after sending the
ACCEPT command to the backend store the sock_mapping pointer of the new
struct and the inflight req_id then return -EAGAIN (which will respond
only when there is something to accept). Next time accept is called,
we'll check if the ACCEPT command has been answered, if so we'll pick up
where we left off, otherwise we return -EAGAIN again.
Note that, differently from the other commands, we can use
wait_event_interruptible (instead of wait_event) in the case of accept
as we are able to track the req_id of the ACCEPT response that we are
waiting.
Signed-off-by: Stefano Stabellini <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/xen/pvcalls-front.c | 111 ++++++++++++++++++++++++++++++++++++++++++++
drivers/xen/pvcalls-front.h | 3 ++
2 files changed, 114 insertions(+)
diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index b2757f5..f83b910 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -65,6 +65,16 @@ struct sock_mapping {
#define PVCALLS_STATUS_BIND 1
#define PVCALLS_STATUS_LISTEN 2
uint8_t status;
+ /*
+ * Internal state-machine flags.
+ * Only one accept operation can be inflight for a socket.
+ * Only one poll operation can be inflight for a given socket.
+ */
+#define PVCALLS_FLAG_ACCEPT_INFLIGHT 0
+ uint8_t flags;
+ uint32_t inflight_req_id;
+ struct sock_mapping *accept_map;
+ wait_queue_head_t inflight_accept_req;
} passive;
};
};
@@ -414,6 +424,107 @@ int pvcalls_front_listen(struct socket *sock, int backlog)
return ret;
}
+int pvcalls_front_accept(struct socket *sock, struct socket *newsock, int flags)
+{
+ struct pvcalls_bedata *bedata;
+ struct sock_mapping *map;
+ struct sock_mapping *map2 = NULL;
+ struct xen_pvcalls_request *req;
+ int notify, req_id, ret, evtchn, nonblock;
+
+ if (!pvcalls_front_dev)
+ return -ENOTCONN;
+ bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
+
+ map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
+ if (!map)
+ return -ENOTSOCK;
+
+ if (map->passive.status != PVCALLS_STATUS_LISTEN)
+ return -EINVAL;
+
+ nonblock = flags & SOCK_NONBLOCK;
+ /*
+ * Backend only supports 1 inflight accept request, will return
+ * errors for the others
+ */
+ if (test_and_set_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT,
+ (void *)&map->passive.flags)) {
+ req_id = READ_ONCE(map->passive.inflight_req_id);
+ if (req_id != PVCALLS_INVALID_ID &&
+ READ_ONCE(bedata->rsp[req_id].req_id) == req_id)
+ goto received;
+ if (nonblock)
+ return -EAGAIN;
+ if (wait_event_interruptible(map->passive.inflight_accept_req,
+ !test_and_set_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT,
+ (void *)&map->passive.flags)))
+ return -EINTR;
+ }
+
+ spin_lock(&bedata->pvcallss_lock);
+ ret = get_request(bedata, &req_id);
+ if (ret < 0) {
+ spin_unlock(&bedata->pvcallss_lock);
+ return ret;
+ }
+ map2 = kzalloc(sizeof(*map2), GFP_KERNEL);
+ if (map2 == NULL)
+ return -ENOMEM;
+ ret = create_active(map2, &evtchn);
+ if (ret < 0) {
+ kfree(map2);
+ return -ENOMEM;
+ }
+ list_add_tail(&map2->list, &bedata->socket_mappings);
+
+ req = RING_GET_REQUEST(&bedata->ring, req_id);
+ req->req_id = req_id;
+ req->cmd = PVCALLS_ACCEPT;
+ req->u.accept.id = (uint64_t) map;
+ req->u.accept.ref = map2->active.ref;
+ req->u.accept.id_new = (uint64_t) map2;
+ req->u.accept.evtchn = evtchn;
+ map->passive.accept_map = map2;
+
+ bedata->ring.req_prod_pvt++;
+ RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
+ spin_unlock(&bedata->pvcallss_lock);
+ if (notify)
+ notify_remote_via_irq(bedata->irq);
+ if (nonblock) {
+ WRITE_ONCE(map->passive.inflight_req_id, req_id);
+ return -EAGAIN;
+ }
+
+ if (wait_event_interruptible(bedata->inflight_req,
+ READ_ONCE(bedata->rsp[req_id].req_id) == req_id))
+ return -EINTR;
+
+received:
+ map2 = map->passive.accept_map;
+ map2->sock = newsock;
+ newsock->sk = kzalloc(sizeof(*newsock->sk), GFP_KERNEL);
+ if (!newsock->sk) {
+ WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
+ WRITE_ONCE(map->passive.inflight_req_id, PVCALLS_INVALID_ID);
+ pvcalls_front_free_map(bedata, map2);
+ kfree(map2);
+ return -ENOMEM;
+ }
+ WRITE_ONCE(newsock->sk->sk_send_head, (void *)map2);
+
+ clear_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT, (void *)&map->passive.flags);
+ wake_up(&map->passive.inflight_accept_req);
+
+ ret = bedata->rsp[req_id].ret;
+ /* read ret, then set this rsp slot to be reused */
+ smp_mb();
+ WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
+ WRITE_ONCE(map->passive.inflight_req_id, PVCALLS_INVALID_ID);
+ return ret;
+}
+
static const struct xenbus_device_id pvcalls_front_ids[] = {
{ "pvcalls" },
{ "" }
diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
index aa8fe10..ab4f1da 100644
--- a/drivers/xen/pvcalls-front.h
+++ b/drivers/xen/pvcalls-front.h
@@ -10,5 +10,8 @@ int pvcalls_front_bind(struct socket *sock,
struct sockaddr *addr,
int addr_len);
int pvcalls_front_listen(struct socket *sock, int backlog);
+int pvcalls_front_accept(struct socket *sock,
+ struct socket *newsock,
+ int flags);
#endif
--
1.9.1
Introduce a data structure named pvcalls_bedata. It contains pointers to
the command ring, the event channel, a list of active sockets and a list
of passive sockets. Lists accesses are protected by a spin_lock.
Introduce a waitqueue to allow waiting for a response on commands sent
to the backend.
Introduce an array of struct xen_pvcalls_response to store commands
responses.
Implement pvcalls frontend removal function. Go through the list of
active and passive sockets and free them all, one at a time.
Signed-off-by: Stefano Stabellini <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/xen/pvcalls-front.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index a8d38c2..a126195 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -20,6 +20,29 @@
#include <xen/xenbus.h>
#include <xen/interface/io/pvcalls.h>
+#define PVCALLS_INVALID_ID UINT_MAX
+#define PVCALLS_RING_ORDER XENBUS_MAX_RING_GRANT_ORDER
+#define PVCALLS_NR_REQ_PER_RING __CONST_RING_SIZE(xen_pvcalls, XEN_PAGE_SIZE)
+
+struct pvcalls_bedata {
+ struct xen_pvcalls_front_ring ring;
+ grant_ref_t ref;
+ int irq;
+
+ struct list_head socket_mappings;
+ struct list_head socketpass_mappings;
+ spinlock_t pvcallss_lock;
+
+ wait_queue_head_t inflight_req;
+ struct xen_pvcalls_response rsp[PVCALLS_NR_REQ_PER_RING];
+};
+static struct xenbus_device *pvcalls_front_dev;
+
+static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
+{
+ return IRQ_HANDLED;
+}
+
static const struct xenbus_device_id pvcalls_front_ids[] = {
{ "pvcalls" },
{ "" }
@@ -27,6 +50,34 @@
static int pvcalls_front_remove(struct xenbus_device *dev)
{
+ struct pvcalls_bedata *bedata;
+ struct sock_mapping *map = NULL, *n;
+
+ bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
+
+ list_for_each_entry_safe(map, n, &bedata->socket_mappings, list) {
+ mutex_lock(&map->active.in_mutex);
+ mutex_lock(&map->active.out_mutex);
+ pvcalls_front_free_map(bedata, map);
+ mutex_unlock(&map->active.out_mutex);
+ mutex_unlock(&map->active.in_mutex);
+ kfree(map);
+ }
+ list_for_each_entry_safe(map, n, &bedata->socketpass_mappings, list) {
+ spin_lock(&bedata->pvcallss_lock);
+ list_del_init(&map->list);
+ spin_unlock(&bedata->pvcallss_lock);
+ kfree(map);
+ }
+ if (bedata->irq > 0)
+ unbind_from_irqhandler(bedata->irq, dev);
+ if (bedata->ref >= 0)
+ gnttab_end_foreign_access(bedata->ref, 0, 0);
+ kfree(bedata->ring.sring);
+ kfree(bedata);
+ dev_set_drvdata(&dev->dev, NULL);
+ xenbus_switch_state(dev, XenbusStateClosed);
+ pvcalls_front_dev = NULL;
return 0;
}
--
1.9.1
Send PVCALLS_CONNECT to the backend. Allocate a new ring and evtchn for
the active socket.
Introduce fields in struct sock_mapping to keep track of active sockets.
Introduce a waitqueue to allow the frontend to wait on data coming from
the backend on the active socket (recvmsg command).
Two mutexes (one of reads and one for writes) will be used to protect
the active socket in and out rings from concurrent accesses.
Signed-off-by: Stefano Stabellini <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/xen/pvcalls-front.c | 146 ++++++++++++++++++++++++++++++++++++++++++++
drivers/xen/pvcalls-front.h | 2 +
2 files changed, 148 insertions(+)
diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index 7c4a7cb..379b8fb 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -13,6 +13,10 @@
*/
#include <linux/module.h>
+#include <linux/net.h>
+#include <linux/socket.h>
+
+#include <net/sock.h>
#include <xen/events.h>
#include <xen/grant_table.h>
@@ -44,6 +48,18 @@ struct sock_mapping {
bool active_socket;
struct list_head list;
struct socket *sock;
+ union {
+ struct {
+ int irq;
+ grant_ref_t ref;
+ struct pvcalls_data_intf *ring;
+ struct pvcalls_data data;
+ struct mutex in_mutex;
+ struct mutex out_mutex;
+
+ wait_queue_head_t inflight_conn_req;
+ } active;
+ };
};
static inline int get_request(struct pvcalls_bedata *bedata, int *req_id)
@@ -97,6 +113,18 @@ static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
return IRQ_HANDLED;
}
+static irqreturn_t pvcalls_front_conn_handler(int irq, void *sock_map)
+{
+ struct sock_mapping *map = sock_map;
+
+ if (map == NULL)
+ return IRQ_HANDLED;
+
+ wake_up_interruptible(&map->active.inflight_conn_req);
+
+ return IRQ_HANDLED;
+}
+
int pvcalls_front_socket(struct socket *sock)
{
struct pvcalls_bedata *bedata;
@@ -162,6 +190,124 @@ int pvcalls_front_socket(struct socket *sock)
return ret;
}
+static int create_active(struct sock_mapping *map, int *evtchn)
+{
+ void *bytes;
+ int ret = -ENOMEM, irq = -1, i;
+
+ init_waitqueue_head(&map->active.inflight_conn_req);
+
+ map->active.ring = (struct pvcalls_data_intf *)
+ __get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (map->active.ring == NULL)
+ goto out_error;
+ memset(map->active.ring, 0, XEN_PAGE_SIZE);
+ map->active.ring->ring_order = PVCALLS_RING_ORDER;
+ bytes = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
+ map->active.ring->ring_order);
+ if (bytes == NULL)
+ goto out_error;
+ for (i = 0; i < (1 << map->active.ring->ring_order); i++)
+ map->active.ring->ref[i] = gnttab_grant_foreign_access(
+ pvcalls_front_dev->otherend_id,
+ pfn_to_gfn(virt_to_pfn(bytes) + i), 0);
+
+ map->active.ref = gnttab_grant_foreign_access(
+ pvcalls_front_dev->otherend_id,
+ pfn_to_gfn(virt_to_pfn((void *)map->active.ring)), 0);
+
+ map->active.data.in = bytes;
+ map->active.data.out = bytes +
+ XEN_FLEX_RING_SIZE(map->active.ring->ring_order);
+
+ ret = xenbus_alloc_evtchn(pvcalls_front_dev, evtchn);
+ if (ret)
+ goto out_error;
+ irq = bind_evtchn_to_irqhandler(*evtchn, pvcalls_front_conn_handler,
+ 0, "pvcalls-frontend", map);
+ if (irq < 0) {
+ ret = irq;
+ goto out_error;
+ }
+
+ map->active.irq = irq;
+ map->active_socket = true;
+ mutex_init(&map->active.in_mutex);
+ mutex_init(&map->active.out_mutex);
+
+ return 0;
+
+out_error:
+ if (irq >= 0)
+ unbind_from_irqhandler(irq, map);
+ else if (*evtchn >= 0)
+ xenbus_free_evtchn(pvcalls_front_dev, *evtchn);
+ kfree(map->active.data.in);
+ kfree(map->active.ring);
+ kfree(map);
+ return ret;
+}
+
+int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
+ int addr_len, int flags)
+{
+ struct pvcalls_bedata *bedata;
+ struct sock_mapping *map = NULL;
+ struct xen_pvcalls_request *req;
+ int notify, req_id, ret, evtchn;
+
+ if (!pvcalls_front_dev)
+ return -ENETUNREACH;
+ if (addr->sa_family != AF_INET || sock->type != SOCK_STREAM)
+ return -ENOTSUPP;
+
+ bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
+
+ map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
+ if (!map)
+ return -EINVAL;
+
+ spin_lock(&bedata->pvcallss_lock);
+ ret = get_request(bedata, &req_id);
+ if (ret < 0) {
+ spin_unlock(&bedata->pvcallss_lock);
+ return ret;
+ }
+ ret = create_active(map, &evtchn);
+ if (ret < 0) {
+ spin_unlock(&bedata->pvcallss_lock);
+ return ret;
+ }
+
+ req = RING_GET_REQUEST(&bedata->ring, req_id);
+ req->req_id = req_id;
+ req->cmd = PVCALLS_CONNECT;
+ req->u.connect.id = (uint64_t)map;
+ memcpy(req->u.connect.addr, addr, sizeof(*addr));
+ req->u.connect.len = addr_len;
+ req->u.connect.flags = flags;
+ req->u.connect.ref = map->active.ref;
+ req->u.connect.evtchn = evtchn;
+
+ map->sock = sock;
+
+ bedata->ring.req_prod_pvt++;
+ RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
+ spin_unlock(&bedata->pvcallss_lock);
+
+ if (notify)
+ notify_remote_via_irq(bedata->irq);
+
+ wait_event(bedata->inflight_req,
+ READ_ONCE(bedata->rsp[req_id].req_id) == req_id);
+
+ ret = bedata->rsp[req_id].ret;
+ /* read ret, then set this rsp slot to be reused */
+ smp_mb();
+ WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
+ return ret;
+}
+
static const struct xenbus_device_id pvcalls_front_ids[] = {
{ "pvcalls" },
{ "" }
diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
index b7dabed..63b0417 100644
--- a/drivers/xen/pvcalls-front.h
+++ b/drivers/xen/pvcalls-front.h
@@ -4,5 +4,7 @@
#include <linux/net.h>
int pvcalls_front_socket(struct socket *sock);
+int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
+ int addr_len, int flags);
#endif
--
1.9.1
Send a PVCALLS_SOCKET command to the backend, use the masked
req_prod_pvt as req_id. This way, req_id is guaranteed to be between 0
and PVCALLS_NR_REQ_PER_RING. We already have a slot in the rsp array
ready for the response, and there cannot be two outstanding responses
with the same req_id.
Wait for the response by waiting on the inflight_req waitqueue and
check for the req_id field in rsp[req_id]. Use atomic accesses and
barriers to read the field. Note that the barriers are simple smp
barriers (as opposed to virt barriers) because they are for internal
frontend synchronization, not frontend<->backend communication.
Once a response is received, clear the corresponding rsp slot by setting
req_id to PVCALLS_INVALID_ID. Note that PVCALLS_INVALID_ID is invalid
only from the frontend point of view. It is not part of the PVCalls
protocol.
pvcalls_front_event_handler is in charge of copying responses from the
ring to the appropriate rsp slot. It is done by copying the body of the
response first, then by copying req_id atomically. After the copies,
wake up anybody waiting on waitqueue.
pvcallss_lock protects accesses to the ring.
Create a new struct sock_mapping and convert the pointer into an
uint64_t and use it as id for the new socket to pass to the backend. The
struct will be fully initialized later on connect or bind. In this patch
the struct sock_mapping is empty, the fields will be added by the next
patch.
sock->sk->sk_send_head is not used for ip sockets: reuse the field to
store a pointer to the struct sock_mapping corresponding to the socket.
This way, we can easily get the struct sock_mapping from the struct
socket.
Signed-off-by: Stefano Stabellini <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/xen/pvcalls-front.c | 119 ++++++++++++++++++++++++++++++++++++++++++++
drivers/xen/pvcalls-front.h | 8 +++
2 files changed, 127 insertions(+)
create mode 100644 drivers/xen/pvcalls-front.h
diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index 2afe36d..7c4a7cb 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -20,6 +20,8 @@
#include <xen/xenbus.h>
#include <xen/interface/io/pvcalls.h>
+#include "pvcalls-front.h"
+
#define PVCALLS_INVALID_ID UINT_MAX
#define PVCALLS_RING_ORDER XENBUS_MAX_RING_GRANT_ORDER
#define PVCALLS_NR_REQ_PER_RING __CONST_RING_SIZE(xen_pvcalls, XEN_PAGE_SIZE)
@@ -38,11 +40,128 @@ struct pvcalls_bedata {
};
static struct xenbus_device *pvcalls_front_dev;
+struct sock_mapping {
+ bool active_socket;
+ struct list_head list;
+ struct socket *sock;
+};
+
+static inline int get_request(struct pvcalls_bedata *bedata, int *req_id)
+{
+ *req_id = bedata->ring.req_prod_pvt & (RING_SIZE(&bedata->ring) - 1);
+ if (RING_FULL(&bedata->ring) ||
+ READ_ONCE(bedata->rsp[*req_id].req_id) != PVCALLS_INVALID_ID)
+ return -EAGAIN;
+ return 0;
+}
+
static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
{
+ struct xenbus_device *dev = dev_id;
+ struct pvcalls_bedata *bedata;
+ struct xen_pvcalls_response *rsp;
+ uint8_t *src, *dst;
+ int req_id = 0, more = 0, done = 0;
+
+ if (dev == NULL)
+ return IRQ_HANDLED;
+
+ bedata = dev_get_drvdata(&dev->dev);
+ if (bedata == NULL)
+ return IRQ_HANDLED;
+
+again:
+ while (RING_HAS_UNCONSUMED_RESPONSES(&bedata->ring)) {
+ rsp = RING_GET_RESPONSE(&bedata->ring, bedata->ring.rsp_cons);
+
+ req_id = rsp->req_id;
+ dst = (uint8_t *)&bedata->rsp[req_id] + sizeof(rsp->req_id);
+ src = (uint8_t *)rsp + sizeof(rsp->req_id);
+ memcpy(dst, src, sizeof(*rsp) - sizeof(rsp->req_id));
+ /*
+ * First copy the rest of the data, then req_id. It is
+ * paired with the barrier when accessing bedata->rsp.
+ */
+ smp_wmb();
+ WRITE_ONCE(bedata->rsp[req_id].req_id, rsp->req_id);
+
+ done = 1;
+ bedata->ring.rsp_cons++;
+ }
+
+ RING_FINAL_CHECK_FOR_RESPONSES(&bedata->ring, more);
+ if (more)
+ goto again;
+ if (done)
+ wake_up(&bedata->inflight_req);
return IRQ_HANDLED;
}
+int pvcalls_front_socket(struct socket *sock)
+{
+ struct pvcalls_bedata *bedata;
+ struct sock_mapping *map = NULL;
+ struct xen_pvcalls_request *req;
+ int notify, req_id, ret;
+
+ if (!pvcalls_front_dev)
+ return -EACCES;
+ /*
+ * PVCalls only supports domain AF_INET,
+ * type SOCK_STREAM and protocol 0 sockets for now.
+ *
+ * Check socket type here, AF_INET and protocol checks are done
+ * by the caller.
+ */
+ if (sock->type != SOCK_STREAM)
+ return -ENOTSUPP;
+
+ bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
+
+ map = kzalloc(sizeof(*map), GFP_KERNEL);
+ if (map == NULL)
+ return -ENOMEM;
+ /*
+ * sock->sk->sk_send_head is not used for ip sockets: reuse the
+ * field to store a pointer to the struct sock_mapping
+ * corresponding to the socket. This way, we can easily get the
+ * struct sock_mapping from the struct socket.
+ */
+ WRITE_ONCE(sock->sk->sk_send_head, (void *)map);
+
+ spin_lock(&bedata->pvcallss_lock);
+ list_add_tail(&map->list, &bedata->socket_mappings);
+
+ ret = get_request(bedata, &req_id);
+ if (ret < 0) {
+ spin_unlock(&bedata->pvcallss_lock);
+ return ret;
+ }
+ req = RING_GET_REQUEST(&bedata->ring, req_id);
+ req->req_id = req_id;
+ req->cmd = PVCALLS_SOCKET;
+ req->u.socket.id = (uint64_t) map;
+ req->u.socket.domain = AF_INET;
+ req->u.socket.type = SOCK_STREAM;
+ req->u.socket.protocol = 0;
+
+ bedata->ring.req_prod_pvt++;
+ RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
+ spin_unlock(&bedata->pvcallss_lock);
+ if (notify)
+ notify_remote_via_irq(bedata->irq);
+
+ wait_event(bedata->inflight_req,
+ READ_ONCE(bedata->rsp[req_id].req_id) == req_id);
+
+ ret = bedata->rsp[req_id].ret;
+ /* read ret, then set this rsp slot to be reused */
+ smp_mb();
+ WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
+
+ return ret;
+}
+
static const struct xenbus_device_id pvcalls_front_ids[] = {
{ "pvcalls" },
{ "" }
diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
new file mode 100644
index 0000000..b7dabed
--- /dev/null
+++ b/drivers/xen/pvcalls-front.h
@@ -0,0 +1,8 @@
+#ifndef __PVCALLS_FRONT_H__
+#define __PVCALLS_FRONT_H__
+
+#include <linux/net.h>
+
+int pvcalls_front_socket(struct socket *sock);
+
+#endif
--
1.9.1
Introduce a xenbus frontend for the pvcalls protocol, as defined by
https://xenbits.xen.org/docs/unstable/misc/pvcalls.html.
This patch only adds the stubs, the code will be added by the following
patches.
Signed-off-by: Stefano Stabellini <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/xen/pvcalls-front.c | 61 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)
create mode 100644 drivers/xen/pvcalls-front.c
diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
new file mode 100644
index 0000000..a8d38c2
--- /dev/null
+++ b/drivers/xen/pvcalls-front.c
@@ -0,0 +1,61 @@
+/*
+ * (c) 2017 Stefano Stabellini <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+
+#include <xen/events.h>
+#include <xen/grant_table.h>
+#include <xen/xen.h>
+#include <xen/xenbus.h>
+#include <xen/interface/io/pvcalls.h>
+
+static const struct xenbus_device_id pvcalls_front_ids[] = {
+ { "pvcalls" },
+ { "" }
+};
+
+static int pvcalls_front_remove(struct xenbus_device *dev)
+{
+ return 0;
+}
+
+static int pvcalls_front_probe(struct xenbus_device *dev,
+ const struct xenbus_device_id *id)
+{
+ return 0;
+}
+
+static void pvcalls_front_changed(struct xenbus_device *dev,
+ enum xenbus_state backend_state)
+{
+}
+
+static struct xenbus_driver pvcalls_front_driver = {
+ .ids = pvcalls_front_ids,
+ .probe = pvcalls_front_probe,
+ .remove = pvcalls_front_remove,
+ .otherend_changed = pvcalls_front_changed,
+};
+
+static int __init pvcalls_frontend_init(void)
+{
+ if (!xen_domain())
+ return -ENODEV;
+
+ pr_info("Initialising Xen pvcalls frontend driver\n");
+
+ return xenbus_register_frontend(&pvcalls_front_driver);
+}
+
+module_init(pvcalls_frontend_init);
--
1.9.1
Implement the probe function for the pvcalls frontend. Read the
supported versions, max-page-order and function-calls nodes from
xenstore.
Only one frontend<->backend connection is supported at any given time
for a guest. Store the active frontend device to a static pointer.
Introduce a stub functions for the event handler.
Signed-off-by: Stefano Stabellini <[email protected]>
CC: [email protected]
CC: [email protected]
---
drivers/xen/pvcalls-front.c | 130 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 130 insertions(+)
diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index a126195..2afe36d 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -84,12 +84,142 @@ static int pvcalls_front_remove(struct xenbus_device *dev)
static int pvcalls_front_probe(struct xenbus_device *dev,
const struct xenbus_device_id *id)
{
+ int ret = -ENOMEM, evtchn, i;
+ unsigned int max_page_order, function_calls, len;
+ char *versions;
+ grant_ref_t gref_head = 0;
+ struct xenbus_transaction xbt;
+ struct pvcalls_bedata *bedata = NULL;
+ struct xen_pvcalls_sring *sring;
+
+ if (pvcalls_front_dev != NULL) {
+ dev_err(&dev->dev, "only one PV Calls connection supported\n");
+ return -EINVAL;
+ }
+
+ versions = xenbus_read(XBT_NIL, dev->otherend, "versions", &len);
+ if (!len)
+ return -EINVAL;
+ if (strcmp(versions, "1")) {
+ kfree(versions);
+ return -EINVAL;
+ }
+ kfree(versions);
+ max_page_order = xenbus_read_unsigned(dev->otherend,
+ "max-page-order", 0);
+ if (max_page_order < PVCALLS_RING_ORDER)
+ return -ENODEV;
+ function_calls = xenbus_read_unsigned(dev->otherend,
+ "function-calls", 0);
+ if (function_calls != 1)
+ return -ENODEV;
+ pr_info("%s max-page-order is %u\n", __func__, max_page_order);
+
+ bedata = kzalloc(sizeof(struct pvcalls_bedata), GFP_KERNEL);
+ if (!bedata)
+ return -ENOMEM;
+
+ dev_set_drvdata(&dev->dev, bedata);
+ pvcalls_front_dev = dev;
+ init_waitqueue_head(&bedata->inflight_req);
+ for (i = 0; i < PVCALLS_NR_REQ_PER_RING; i++)
+ bedata->rsp[i].req_id = PVCALLS_INVALID_ID;
+
+ sring = (struct xen_pvcalls_sring *) __get_free_page(GFP_KERNEL |
+ __GFP_ZERO);
+ if (!sring)
+ goto error;
+ SHARED_RING_INIT(sring);
+ FRONT_RING_INIT(&bedata->ring, sring, XEN_PAGE_SIZE);
+
+ ret = xenbus_alloc_evtchn(dev, &evtchn);
+ if (ret)
+ goto error;
+
+ bedata->irq = bind_evtchn_to_irqhandler(evtchn,
+ pvcalls_front_event_handler,
+ 0, "pvcalls-frontend", dev);
+ if (bedata->irq < 0) {
+ ret = bedata->irq;
+ goto error;
+ }
+
+ ret = gnttab_alloc_grant_references(1, &gref_head);
+ if (ret < 0)
+ goto error;
+ bedata->ref = gnttab_claim_grant_reference(&gref_head);
+ if (bedata->ref < 0) {
+ ret = bedata->ref;
+ goto error;
+ }
+ gnttab_grant_foreign_access_ref(bedata->ref, dev->otherend_id,
+ virt_to_gfn((void *)sring), 0);
+
+ again:
+ ret = xenbus_transaction_start(&xbt);
+ if (ret) {
+ xenbus_dev_fatal(dev, ret, "starting transaction");
+ goto error;
+ }
+ ret = xenbus_printf(xbt, dev->nodename, "version", "%u", 1);
+ if (ret)
+ goto error_xenbus;
+ ret = xenbus_printf(xbt, dev->nodename, "ring-ref", "%d", bedata->ref);
+ if (ret)
+ goto error_xenbus;
+ ret = xenbus_printf(xbt, dev->nodename, "port", "%u",
+ evtchn);
+ if (ret)
+ goto error_xenbus;
+ ret = xenbus_transaction_end(xbt, 0);
+ if (ret) {
+ if (ret == -EAGAIN)
+ goto again;
+ xenbus_dev_fatal(dev, ret, "completing transaction");
+ goto error;
+ }
+
+ INIT_LIST_HEAD(&bedata->socket_mappings);
+ INIT_LIST_HEAD(&bedata->socketpass_mappings);
+ spin_lock_init(&bedata->pvcallss_lock);
+ xenbus_switch_state(dev, XenbusStateInitialised);
+
return 0;
+
+ error_xenbus:
+ xenbus_transaction_end(xbt, 1);
+ xenbus_dev_fatal(dev, ret, "writing xenstore");
+ error:
+ pvcalls_front_remove(dev);
+ return ret;
}
static void pvcalls_front_changed(struct xenbus_device *dev,
enum xenbus_state backend_state)
{
+ switch (backend_state) {
+ case XenbusStateReconfiguring:
+ case XenbusStateReconfigured:
+ case XenbusStateInitialising:
+ case XenbusStateInitialised:
+ case XenbusStateUnknown:
+ break;
+
+ case XenbusStateInitWait:
+ break;
+
+ case XenbusStateConnected:
+ xenbus_switch_state(dev, XenbusStateConnected);
+ break;
+
+ case XenbusStateClosed:
+ if (dev->state == XenbusStateClosed)
+ break;
+ /* Missed the backend's CLOSING state -- fallthrough */
+ case XenbusStateClosing:
+ xenbus_frontend_closed(dev);
+ break;
+ }
}
static struct xenbus_driver pvcalls_front_driver = {
--
1.9.1
On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> Hi all,
>
> this series introduces the frontend for the newly introduced PV Calls
> procotol.
>
> PV Calls is a paravirtualized protocol that allows the implementation of
> a set of POSIX functions in a different domain. The PV Calls frontend
> sends POSIX function calls to the backend, which implements them and
> returns a value to the frontend and acts on the function call.
>
> For more information about PV Calls, please read:
>
> https://xenbits.xen.org/docs/unstable/misc/pvcalls.html
>
> This patch series only implements the frontend driver. It doesn't
> attempt to redirect POSIX calls to it. The functions exported in
> pvcalls-front.h are meant to be used for that. A separate patch series
> will be sent to use them and hook them into the system.
Stefano,
Should this be reviewed or are you going to send another version?
-boris
On Thu, 10 Aug 2017, Boris Ostrovsky wrote:
> On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> > Hi all,
> >
> > this series introduces the frontend for the newly introduced PV Calls
> > procotol.
> >
> > PV Calls is a paravirtualized protocol that allows the implementation of
> > a set of POSIX functions in a different domain. The PV Calls frontend
> > sends POSIX function calls to the backend, which implements them and
> > returns a value to the frontend and acts on the function call.
> >
> > For more information about PV Calls, please read:
> >
> > https://xenbits.xen.org/docs/unstable/misc/pvcalls.html
> >
> > This patch series only implements the frontend driver. It doesn't
> > attempt to redirect POSIX calls to it. The functions exported in
> > pvcalls-front.h are meant to be used for that. A separate patch series
> > will be sent to use them and hook them into the system.
>
> Stefano,
>
>
> Should this be reviewed or are you going to send another version?
Sorry for the delay, I am dealing with a personal issue at the moment
and I didn't manage to address your valid comments on the release
function implementation.
However in this version I did address all other comments, so it would be
good if you could give a look at it, bearing in mind that the problem on
release is still unaddressed here.
Thanks,
Stefano
On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> Introduce a xenbus frontend for the pvcalls protocol, as defined by
> https://xenbits.xen.org/docs/unstable/misc/pvcalls.html.
>
> This patch only adds the stubs, the code will be added by the following
> patches.
>
> Signed-off-by: Stefano Stabellini <[email protected]>
> CC: [email protected]
> CC: [email protected]
Reviewed-by: Boris Ostrovsky <[email protected]>
On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> Introduce a data structure named pvcalls_bedata. It contains pointers to
> the command ring, the event channel, a list of active sockets and a list
> of passive sockets. Lists accesses are protected by a spin_lock.
>
> Introduce a waitqueue to allow waiting for a response on commands sent
> to the backend.
>
> Introduce an array of struct xen_pvcalls_response to store commands
> responses.
>
> Implement pvcalls frontend removal function. Go through the list of
> active and passive sockets and free them all, one at a time.
>
> Signed-off-by: Stefano Stabellini <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> drivers/xen/pvcalls-front.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 51 insertions(+)
>
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index a8d38c2..a126195 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -20,6 +20,29 @@
> #include <xen/xenbus.h>
> #include <xen/interface/io/pvcalls.h>
>
> +#define PVCALLS_INVALID_ID UINT_MAX
> +#define PVCALLS_RING_ORDER XENBUS_MAX_RING_GRANT_ORDER
> +#define PVCALLS_NR_REQ_PER_RING __CONST_RING_SIZE(xen_pvcalls, XEN_PAGE_SIZE)
> +
> +struct pvcalls_bedata {
> + struct xen_pvcalls_front_ring ring;
> + grant_ref_t ref;
> + int irq;
> +
> + struct list_head socket_mappings;
> + struct list_head socketpass_mappings;
> + spinlock_t pvcallss_lock;
In the backend this is called socket_lock and (subjectively) it would
sound as a better name here too.
> +
> + wait_queue_head_t inflight_req;
> + struct xen_pvcalls_response rsp[PVCALLS_NR_REQ_PER_RING];
> +};
> +static struct xenbus_device *pvcalls_front_dev;
> +
> +static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
> +{
> + return IRQ_HANDLED;
> +}
> +
> static const struct xenbus_device_id pvcalls_front_ids[] = {
> { "pvcalls" },
> { "" }
> @@ -27,6 +50,34 @@
>
> static int pvcalls_front_remove(struct xenbus_device *dev)
> {
> + struct pvcalls_bedata *bedata;
> + struct sock_mapping *map = NULL, *n;
> +
> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> +
> + list_for_each_entry_safe(map, n, &bedata->socket_mappings, list) {
> + mutex_lock(&map->active.in_mutex);
> + mutex_lock(&map->active.out_mutex);
> + pvcalls_front_free_map(bedata, map);
> + mutex_unlock(&map->active.out_mutex);
> + mutex_unlock(&map->active.in_mutex);
> + kfree(map);
I think this is the same issue as the one discussed for some other patch
--- unlocking and then immediately freeing a lock.
> + }
> + list_for_each_entry_safe(map, n, &bedata->socketpass_mappings, list) {
> + spin_lock(&bedata->pvcallss_lock);
> + list_del_init(&map->list);
> + spin_unlock(&bedata->pvcallss_lock);
> + kfree(map);
> + }
> + if (bedata->irq > 0)
> + unbind_from_irqhandler(bedata->irq, dev);
> + if (bedata->ref >= 0)
> + gnttab_end_foreign_access(bedata->ref, 0, 0);
> + kfree(bedata->ring.sring);
> + kfree(bedata);
> + dev_set_drvdata(&dev->dev, NULL);
> + xenbus_switch_state(dev, XenbusStateClosed);
Should we first move the state to Closed and then free things up? Or it
doesn't matter?
-boris
> + pvcalls_front_dev = NULL;
> return 0;
> }
>
On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> Implement the probe function for the pvcalls frontend. Read the
> supported versions, max-page-order and function-calls nodes from
> xenstore.
>
> Only one frontend<->backend connection is supported at any given time
> for a guest. Store the active frontend device to a static pointer.
>
> Introduce a stub functions for the event handler.
>
> Signed-off-by: Stefano Stabellini <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> drivers/xen/pvcalls-front.c | 130 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 130 insertions(+)
>
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index a126195..2afe36d 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -84,12 +84,142 @@ static int pvcalls_front_remove(struct xenbus_device *dev)
> static int pvcalls_front_probe(struct xenbus_device *dev,
> const struct xenbus_device_id *id)
> {
> + int ret = -ENOMEM, evtchn, i;
> + unsigned int max_page_order, function_calls, len;
> + char *versions;
> + grant_ref_t gref_head = 0;
> + struct xenbus_transaction xbt;
> + struct pvcalls_bedata *bedata = NULL;
> + struct xen_pvcalls_sring *sring;
> +
> + if (pvcalls_front_dev != NULL) {
> + dev_err(&dev->dev, "only one PV Calls connection supported\n");
> + return -EINVAL;
> + }
> +
> + versions = xenbus_read(XBT_NIL, dev->otherend, "versions", &len);
> + if (!len)
> + return -EINVAL;
> + if (strcmp(versions, "1")) {
> + kfree(versions);
> + return -EINVAL;
> + }
> + kfree(versions);
> + max_page_order = xenbus_read_unsigned(dev->otherend,
> + "max-page-order", 0);
> + if (max_page_order < PVCALLS_RING_ORDER)
> + return -ENODEV;
> + function_calls = xenbus_read_unsigned(dev->otherend,
> + "function-calls", 0);
> + if (function_calls != 1)
XENBUS_FUNCTIONS_CALLS
> + return -ENODEV;
> + pr_info("%s max-page-order is %u\n", __func__, max_page_order);
> +
> + bedata = kzalloc(sizeof(struct pvcalls_bedata), GFP_KERNEL);
> + if (!bedata)
> + return -ENOMEM;
> +
> + dev_set_drvdata(&dev->dev, bedata);
> + pvcalls_front_dev = dev;
> + init_waitqueue_head(&bedata->inflight_req);
> + for (i = 0; i < PVCALLS_NR_REQ_PER_RING; i++)
> + bedata->rsp[i].req_id = PVCALLS_INVALID_ID;
> +
> + sring = (struct xen_pvcalls_sring *) __get_free_page(GFP_KERNEL |
> + __GFP_ZERO);
> + if (!sring)
> + goto error;
> + SHARED_RING_INIT(sring);
> + FRONT_RING_INIT(&bedata->ring, sring, XEN_PAGE_SIZE);
> +
> + ret = xenbus_alloc_evtchn(dev, &evtchn);
> + if (ret)
> + goto error;
> +
> + bedata->irq = bind_evtchn_to_irqhandler(evtchn,
> + pvcalls_front_event_handler,
> + 0, "pvcalls-frontend", dev);
> + if (bedata->irq < 0) {
In the previous patch you free up irq if it is strictly >0. Should have
been >=0.
> + ret = bedata->irq;
> + goto error;
> + }
> +
> + ret = gnttab_alloc_grant_references(1, &gref_head);
> + if (ret < 0)
> + goto error;
> + bedata->ref = gnttab_claim_grant_reference(&gref_head);
> + if (bedata->ref < 0) {
> + ret = bedata->ref;
> + goto error;
> + }
> + gnttab_grant_foreign_access_ref(bedata->ref, dev->otherend_id,
> + virt_to_gfn((void *)sring), 0);
> +
> + again:
> + ret = xenbus_transaction_start(&xbt);
> + if (ret) {
> + xenbus_dev_fatal(dev, ret, "starting transaction");
> + goto error;
> + }
> + ret = xenbus_printf(xbt, dev->nodename, "version", "%u", 1);
> + if (ret)
> + goto error_xenbus;
> + ret = xenbus_printf(xbt, dev->nodename, "ring-ref", "%d", bedata->ref);
> + if (ret)
> + goto error_xenbus;
> + ret = xenbus_printf(xbt, dev->nodename, "port", "%u",
> + evtchn);
> + if (ret)
> + goto error_xenbus;
> + ret = xenbus_transaction_end(xbt, 0);
> + if (ret) {
> + if (ret == -EAGAIN)
> + goto again;
> + xenbus_dev_fatal(dev, ret, "completing transaction");
> + goto error;
> + }
> +
> + INIT_LIST_HEAD(&bedata->socket_mappings);
> + INIT_LIST_HEAD(&bedata->socketpass_mappings);
> + spin_lock_init(&bedata->pvcallss_lock);
> + xenbus_switch_state(dev, XenbusStateInitialised);
> +
> return 0;
> +
> + error_xenbus:
> + xenbus_transaction_end(xbt, 1);
> + xenbus_dev_fatal(dev, ret, "writing xenstore");
> + error:
> + pvcalls_front_remove(dev);
Going back to the previous patch --- it seems to me that on some error
paths certain structures may not be usable in pvcalls_front_remove().
For example, socket_mappings list is definitely not initialized.
-boris
> + return ret;
> }
>
> static void pvcalls_front_changed(struct xenbus_device *dev,
> enum xenbus_state backend_state)
> {
> + switch (backend_state) {
> + case XenbusStateReconfiguring:
> + case XenbusStateReconfigured:
> + case XenbusStateInitialising:
> + case XenbusStateInitialised:
> + case XenbusStateUnknown:
> + break;
> +
> + case XenbusStateInitWait:
> + break;
> +
> + case XenbusStateConnected:
> + xenbus_switch_state(dev, XenbusStateConnected);
> + break;
> +
> + case XenbusStateClosed:
> + if (dev->state == XenbusStateClosed)
> + break;
> + /* Missed the backend's CLOSING state -- fallthrough */
> + case XenbusStateClosing:
> + xenbus_frontend_closed(dev);
> + break;
> + }
> }
>
> static struct xenbus_driver pvcalls_front_driver = {
On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> Send a PVCALLS_SOCKET command to the backend, use the masked
> req_prod_pvt as req_id. This way, req_id is guaranteed to be between 0
> and PVCALLS_NR_REQ_PER_RING. We already have a slot in the rsp array
> ready for the response, and there cannot be two outstanding responses
> with the same req_id.
>
> Wait for the response by waiting on the inflight_req waitqueue and
> check for the req_id field in rsp[req_id]. Use atomic accesses and
> barriers to read the field. Note that the barriers are simple smp
> barriers (as opposed to virt barriers) because they are for internal
> frontend synchronization, not frontend<->backend communication.
>
> Once a response is received, clear the corresponding rsp slot by setting
> req_id to PVCALLS_INVALID_ID. Note that PVCALLS_INVALID_ID is invalid
> only from the frontend point of view. It is not part of the PVCalls
> protocol.
>
> pvcalls_front_event_handler is in charge of copying responses from the
> ring to the appropriate rsp slot. It is done by copying the body of the
> response first, then by copying req_id atomically. After the copies,
> wake up anybody waiting on waitqueue.
>
> pvcallss_lock protects accesses to the ring.
>
> Create a new struct sock_mapping and convert the pointer into an
> uint64_t and use it as id for the new socket to pass to the backend. The
> struct will be fully initialized later on connect or bind. In this patch
> the struct sock_mapping is empty, the fields will be added by the next
> patch.
>
> sock->sk->sk_send_head is not used for ip sockets: reuse the field to
> store a pointer to the struct sock_mapping corresponding to the socket.
> This way, we can easily get the struct sock_mapping from the struct
> socket.
>
> Signed-off-by: Stefano Stabellini <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> drivers/xen/pvcalls-front.c | 119 ++++++++++++++++++++++++++++++++++++++++++++
> drivers/xen/pvcalls-front.h | 8 +++
> 2 files changed, 127 insertions(+)
> create mode 100644 drivers/xen/pvcalls-front.h
>
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index 2afe36d..7c4a7cb 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -20,6 +20,8 @@
> #include <xen/xenbus.h>
> #include <xen/interface/io/pvcalls.h>
>
> +#include "pvcalls-front.h"
> +
> #define PVCALLS_INVALID_ID UINT_MAX
> #define PVCALLS_RING_ORDER XENBUS_MAX_RING_GRANT_ORDER
> #define PVCALLS_NR_REQ_PER_RING __CONST_RING_SIZE(xen_pvcalls, XEN_PAGE_SIZE)
> @@ -38,11 +40,128 @@ struct pvcalls_bedata {
> };
> static struct xenbus_device *pvcalls_front_dev;
>
> +struct sock_mapping {
> + bool active_socket;
> + struct list_head list;
> + struct socket *sock;
> +};
> +
> +static inline int get_request(struct pvcalls_bedata *bedata, int *req_id)
> +{
> + *req_id = bedata->ring.req_prod_pvt & (RING_SIZE(&bedata->ring) - 1);
> + if (RING_FULL(&bedata->ring) ||
> + READ_ONCE(bedata->rsp[*req_id].req_id) != PVCALLS_INVALID_ID)
> + return -EAGAIN;
> + return 0;
> +}
> +
> static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
> {
> + struct xenbus_device *dev = dev_id;
> + struct pvcalls_bedata *bedata;
> + struct xen_pvcalls_response *rsp;
> + uint8_t *src, *dst;
> + int req_id = 0, more = 0, done = 0;
> +
> + if (dev == NULL)
> + return IRQ_HANDLED;
> +
> + bedata = dev_get_drvdata(&dev->dev);
> + if (bedata == NULL)
> + return IRQ_HANDLED;
> +
> +again:
> + while (RING_HAS_UNCONSUMED_RESPONSES(&bedata->ring)) {
> + rsp = RING_GET_RESPONSE(&bedata->ring, bedata->ring.rsp_cons);
> +
> + req_id = rsp->req_id;
> + dst = (uint8_t *)&bedata->rsp[req_id] + sizeof(rsp->req_id);
> + src = (uint8_t *)rsp + sizeof(rsp->req_id);
> + memcpy(dst, src, sizeof(*rsp) - sizeof(rsp->req_id));
> + /*
> + * First copy the rest of the data, then req_id. It is
> + * paired with the barrier when accessing bedata->rsp.
> + */
> + smp_wmb();
> + WRITE_ONCE(bedata->rsp[req_id].req_id, rsp->req_id);
> +
> + done = 1;
> + bedata->ring.rsp_cons++;
> + }
> +
> + RING_FINAL_CHECK_FOR_RESPONSES(&bedata->ring, more);
> + if (more)
> + goto again;
> + if (done)
> + wake_up(&bedata->inflight_req);
> return IRQ_HANDLED;
> }
>
> +int pvcalls_front_socket(struct socket *sock)
> +{
> + struct pvcalls_bedata *bedata;
> + struct sock_mapping *map = NULL;
> + struct xen_pvcalls_request *req;
> + int notify, req_id, ret;
> +
> + if (!pvcalls_front_dev)
> + return -EACCES;
> + /*
> + * PVCalls only supports domain AF_INET,
> + * type SOCK_STREAM and protocol 0 sockets for now.
> + *
> + * Check socket type here, AF_INET and protocol checks are done
> + * by the caller.
> + */
> + if (sock->type != SOCK_STREAM)
> + return -ENOTSUPP;
> +
> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> +
> + map = kzalloc(sizeof(*map), GFP_KERNEL);
> + if (map == NULL)
> + return -ENOMEM;
> + /*
> + * sock->sk->sk_send_head is not used for ip sockets: reuse the
> + * field to store a pointer to the struct sock_mapping
> + * corresponding to the socket. This way, we can easily get the
> + * struct sock_mapping from the struct socket.
Is this a safe assumption? Is it conceivable that at some point in the
future sk_send_head might become used?
> + */
> + WRITE_ONCE(sock->sk->sk_send_head, (void *)map);
> +
> + spin_lock(&bedata->pvcallss_lock);
> + list_add_tail(&map->list, &bedata->socket_mappings);
> +
> + ret = get_request(bedata, &req_id);
> + if (ret < 0) {
> + spin_unlock(&bedata->pvcallss_lock);
> + return ret;
What is happening to the mapping you've linked above? Should it be
unlinked and freed? (or defer its creation until we've got a request slot?)
> + }
> + req = RING_GET_REQUEST(&bedata->ring, req_id);
> + req->req_id = req_id;
> + req->cmd = PVCALLS_SOCKET;
> + req->u.socket.id = (uint64_t) map;
> + req->u.socket.domain = AF_INET;
> + req->u.socket.type = SOCK_STREAM;
> + req->u.socket.protocol = 0;
IPPROTO_IP?
-boris
> +
> + bedata->ring.req_prod_pvt++;
> + RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
> + spin_unlock(&bedata->pvcallss_lock);
> + if (notify)
> + notify_remote_via_irq(bedata->irq);
> +
> + wait_event(bedata->inflight_req,
> + READ_ONCE(bedata->rsp[req_id].req_id) == req_id);
> +
> + ret = bedata->rsp[req_id].ret;
> + /* read ret, then set this rsp slot to be reused */
> + smp_mb();
> + WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
> +
> + return ret;
> +}
> +
> static const struct xenbus_device_id pvcalls_front_ids[] = {
> { "pvcalls" },
> { "" }
> diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
> new file mode 100644
> index 0000000..b7dabed
> --- /dev/null
> +++ b/drivers/xen/pvcalls-front.h
> @@ -0,0 +1,8 @@
> +#ifndef __PVCALLS_FRONT_H__
> +#define __PVCALLS_FRONT_H__
> +
> +#include <linux/net.h>
> +
> +int pvcalls_front_socket(struct socket *sock);
> +
> +#endif
>
On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> Send PVCALLS_CONNECT to the backend. Allocate a new ring and evtchn for
> the active socket.
>
> Introduce fields in struct sock_mapping to keep track of active sockets.
> Introduce a waitqueue to allow the frontend to wait on data coming from
> the backend on the active socket (recvmsg command).
>
> Two mutexes (one of reads and one for writes) will be used to protect
> the active socket in and out rings from concurrent accesses.
>
> Signed-off-by: Stefano Stabellini <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> drivers/xen/pvcalls-front.c | 146 ++++++++++++++++++++++++++++++++++++++++++++
> drivers/xen/pvcalls-front.h | 2 +
> 2 files changed, 148 insertions(+)
>
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index 7c4a7cb..379b8fb 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -13,6 +13,10 @@
> */
>
> #include <linux/module.h>
> +#include <linux/net.h>
> +#include <linux/socket.h>
> +
> +#include <net/sock.h>
>
> #include <xen/events.h>
> #include <xen/grant_table.h>
> @@ -44,6 +48,18 @@ struct sock_mapping {
> bool active_socket;
> struct list_head list;
> struct socket *sock;
> + union {
> + struct {
> + int irq;
> + grant_ref_t ref;
> + struct pvcalls_data_intf *ring;
> + struct pvcalls_data data;
> + struct mutex in_mutex;
> + struct mutex out_mutex;
> +
> + wait_queue_head_t inflight_conn_req;
> + } active;
> + };
> };
>
> static inline int get_request(struct pvcalls_bedata *bedata, int *req_id)
> @@ -97,6 +113,18 @@ static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
> return IRQ_HANDLED;
> }
>
> +static irqreturn_t pvcalls_front_conn_handler(int irq, void *sock_map)
> +{
> + struct sock_mapping *map = sock_map;
> +
> + if (map == NULL)
> + return IRQ_HANDLED;
> +
> + wake_up_interruptible(&map->active.inflight_conn_req);
> +
> + return IRQ_HANDLED;
> +}
> +
> int pvcalls_front_socket(struct socket *sock)
> {
> struct pvcalls_bedata *bedata;
> @@ -162,6 +190,124 @@ int pvcalls_front_socket(struct socket *sock)
> return ret;
> }
>
> +static int create_active(struct sock_mapping *map, int *evtchn)
> +{
> + void *bytes;
> + int ret = -ENOMEM, irq = -1, i;
> +
> + init_waitqueue_head(&map->active.inflight_conn_req);
> +
> + map->active.ring = (struct pvcalls_data_intf *)
> + __get_free_page(GFP_KERNEL | __GFP_ZERO);
> + if (map->active.ring == NULL)
> + goto out_error;
> + memset(map->active.ring, 0, XEN_PAGE_SIZE);
Not needed (page allocated with __GFP_ZERO)
BTW, are you operating on XEN_PAGE_SIZE of PAGE_SIZE? Because if it's
the former than __get_free_page() may be way more than what you need.
(below too)
> + map->active.ring->ring_order = PVCALLS_RING_ORDER;
> + bytes = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> + map->active.ring->ring_order);
> + if (bytes == NULL)
> + goto out_error;
> + for (i = 0; i < (1 << map->active.ring->ring_order); i++)
PVCALLS_RING_ORDER
> + map->active.ring->ref[i] = gnttab_grant_foreign_access(
> + pvcalls_front_dev->otherend_id,
> + pfn_to_gfn(virt_to_pfn(bytes) + i), 0);
> +
> + map->active.ref = gnttab_grant_foreign_access(
> + pvcalls_front_dev->otherend_id,
> + pfn_to_gfn(virt_to_pfn((void *)map->active.ring)), 0);
> +
> + map->active.data.in = bytes;
> + map->active.data.out = bytes +
> + XEN_FLEX_RING_SIZE(map->active.ring->ring_order);
> +
> + ret = xenbus_alloc_evtchn(pvcalls_front_dev, evtchn);
> + if (ret)
> + goto out_error;
> + irq = bind_evtchn_to_irqhandler(*evtchn, pvcalls_front_conn_handler,
> + 0, "pvcalls-frontend", map);
> + if (irq < 0) {
> + ret = irq;
> + goto out_error;
> + }
> +
> + map->active.irq = irq;
> + map->active_socket = true;
> + mutex_init(&map->active.in_mutex);
> + mutex_init(&map->active.out_mutex);
> +
> + return 0;
> +
> +out_error:
> + if (irq >= 0)
> + unbind_from_irqhandler(irq, map);
> + else if (*evtchn >= 0)
*evtchn may have been passed in as >=0. You probably want to set to -1
or something in the beginning.
> + xenbus_free_evtchn(pvcalls_front_dev, *evtchn);
> + kfree(map->active.data.in);
> + kfree(map->active.ring);
> + kfree(map);
Probably a matter of personal style but I think it's better to free map
in the caller.
Actually, should it be freed? The caller gets an error if get_request()
or create_active() fail. In the first case map is not freed. Is the
caller going to distinguish EAGAIN from any other error and know that
map does not need to be allocated?
(Also I think you need to unlink the map from socket_mapping list before
freeing it).
-boris
> + return ret;
> +}
> +
> +int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
> + int addr_len, int flags)
> +{
> + struct pvcalls_bedata *bedata;
> + struct sock_mapping *map = NULL;
> + struct xen_pvcalls_request *req;
> + int notify, req_id, ret, evtchn;
> +
> + if (!pvcalls_front_dev)
> + return -ENETUNREACH;
> + if (addr->sa_family != AF_INET || sock->type != SOCK_STREAM)
> + return -ENOTSUPP;
> +
> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> +
> + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> + if (!map)
> + return -EINVAL;
> +
> + spin_lock(&bedata->pvcallss_lock);
> + ret = get_request(bedata, &req_id);
> + if (ret < 0) {
> + spin_unlock(&bedata->pvcallss_lock);
> + return ret;
> + }
> + ret = create_active(map, &evtchn);
> + if (ret < 0) {
> + spin_unlock(&bedata->pvcallss_lock);
> + return ret;
> + }
> +
> + req = RING_GET_REQUEST(&bedata->ring, req_id);
> + req->req_id = req_id;
> + req->cmd = PVCALLS_CONNECT;
> + req->u.connect.id = (uint64_t)map;
> + memcpy(req->u.connect.addr, addr, sizeof(*addr));
> + req->u.connect.len = addr_len;
> + req->u.connect.flags = flags;
> + req->u.connect.ref = map->active.ref;
> + req->u.connect.evtchn = evtchn;
> +
> + map->sock = sock;
> +
> + bedata->ring.req_prod_pvt++;
> + RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
> + spin_unlock(&bedata->pvcallss_lock);
> +
> + if (notify)
> + notify_remote_via_irq(bedata->irq);
> +
> + wait_event(bedata->inflight_req,
> + READ_ONCE(bedata->rsp[req_id].req_id) == req_id);
> +
> + ret = bedata->rsp[req_id].ret;
> + /* read ret, then set this rsp slot to be reused */
> + smp_mb();
> + WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
> + return ret;
> +}
> +
> static const struct xenbus_device_id pvcalls_front_ids[] = {
> { "pvcalls" },
> { "" }
> diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
> index b7dabed..63b0417 100644
> --- a/drivers/xen/pvcalls-front.h
> +++ b/drivers/xen/pvcalls-front.h
> @@ -4,5 +4,7 @@
> #include <linux/net.h>
>
> int pvcalls_front_socket(struct socket *sock);
> +int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
> + int addr_len, int flags);
>
> #endif
>
On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> Send PVCALLS_BIND to the backend. Introduce a new structure, part of
> struct sock_mapping, to store information specific to passive sockets.
>
> Introduce a status field to keep track of the status of the passive
> socket.
>
> Signed-off-by: Stefano Stabellini <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> drivers/xen/pvcalls-front.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
> drivers/xen/pvcalls-front.h | 3 +++
> 2 files changed, 62 insertions(+)
>
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index 379b8fb..5ccef34 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -59,6 +59,13 @@ struct sock_mapping {
>
> wait_queue_head_t inflight_conn_req;
> } active;
> + struct {
> + /* Socket status */
> +#define PVCALLS_STATUS_UNINITALIZED 0
> +#define PVCALLS_STATUS_BIND 1
> +#define PVCALLS_STATUS_LISTEN 2
> + uint8_t status;
> + } passive;
> };
> };
>
> @@ -308,6 +315,58 @@ int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
> return ret;
> }
>
> +int pvcalls_front_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
> +{
> + struct pvcalls_bedata *bedata;
> + struct sock_mapping *map = NULL;
> + struct xen_pvcalls_request *req;
> + int notify, req_id, ret;
> +
> + if (!pvcalls_front_dev)
> + return -ENOTCONN;
> + if (addr->sa_family != AF_INET || sock->type != SOCK_STREAM)
No protocol check? (in the previous patch too)
-boris
> + return -ENOTSUPP;
> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> +
> + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> + if (map == NULL)
> + return -EINVAL;
> +
> + spin_lock(&bedata->pvcallss_lock);
> + ret = get_request(bedata, &req_id);
> + if (ret < 0) {
> + spin_unlock(&bedata->pvcallss_lock);
> + return ret;
> + }
> + req = RING_GET_REQUEST(&bedata->ring, req_id);
> + req->req_id = req_id;
> + map->sock = sock;
> + req->cmd = PVCALLS_BIND;
> + req->u.bind.id = (uint64_t) map;
> + memcpy(req->u.bind.addr, addr, sizeof(*addr));
> + req->u.bind.len = addr_len;
> +
> + init_waitqueue_head(&map->passive.inflight_accept_req);
> +
> + map->active_socket = false;
> +
> + bedata->ring.req_prod_pvt++;
> + RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
> + spin_unlock(&bedata->pvcallss_lock);
> + if (notify)
> + notify_remote_via_irq(bedata->irq);
> +
> + wait_event(bedata->inflight_req,
> + READ_ONCE(bedata->rsp[req_id].req_id) == req_id);
> +
> + map->passive.status = PVCALLS_STATUS_BIND;
> + ret = bedata->rsp[req_id].ret;
> + /* read ret, then set this rsp slot to be reused */
> + smp_mb();
> + WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
> + return 0;
> +}
> +
> static const struct xenbus_device_id pvcalls_front_ids[] = {
> { "pvcalls" },
> { "" }
> diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
> index 63b0417..8b0a274 100644
> --- a/drivers/xen/pvcalls-front.h
> +++ b/drivers/xen/pvcalls-front.h
> @@ -6,5 +6,8 @@
> int pvcalls_front_socket(struct socket *sock);
> int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
> int addr_len, int flags);
> +int pvcalls_front_bind(struct socket *sock,
> + struct sockaddr *addr,
> + int addr_len);
>
> #endif
>
On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> Send PVCALLS_LISTEN to the backend.
>
> Signed-off-by: Stefano Stabellini <[email protected]>
> CC: [email protected]
> CC: [email protected]
Reviewed-by: Boris Ostrovsky <[email protected]>
On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> Introduce a waitqueue to allow only one outstanding accept command at
> any given time and to implement polling on the passive socket. Introduce
> a flags field to keep track of in-flight accept and poll commands.
>
> Send PVCALLS_ACCEPT to the backend. Allocate a new active socket. Make
> sure that only one accept command is executed at any given time by
> setting PVCALLS_FLAG_ACCEPT_INFLIGHT and waiting on the
> inflight_accept_req waitqueue.
>
> Convert the new struct sock_mapping pointer into an uint64_t and use it
> as id for the new socket to pass to the backend.
>
> Check if the accept call is non-blocking: in that case after sending the
> ACCEPT command to the backend store the sock_mapping pointer of the new
> struct and the inflight req_id then return -EAGAIN (which will respond
> only when there is something to accept). Next time accept is called,
> we'll check if the ACCEPT command has been answered, if so we'll pick up
> where we left off, otherwise we return -EAGAIN again.
>
> Note that, differently from the other commands, we can use
> wait_event_interruptible (instead of wait_event) in the case of accept
> as we are able to track the req_id of the ACCEPT response that we are
> waiting.
>
> Signed-off-by: Stefano Stabellini <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> drivers/xen/pvcalls-front.c | 111 ++++++++++++++++++++++++++++++++++++++++++++
> drivers/xen/pvcalls-front.h | 3 ++
> 2 files changed, 114 insertions(+)
>
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index b2757f5..f83b910 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -65,6 +65,16 @@ struct sock_mapping {
> #define PVCALLS_STATUS_BIND 1
> #define PVCALLS_STATUS_LISTEN 2
> uint8_t status;
> + /*
> + * Internal state-machine flags.
> + * Only one accept operation can be inflight for a socket.
> + * Only one poll operation can be inflight for a given socket.
> + */
> +#define PVCALLS_FLAG_ACCEPT_INFLIGHT 0
> + uint8_t flags;
> + uint32_t inflight_req_id;
> + struct sock_mapping *accept_map;
> + wait_queue_head_t inflight_accept_req;
> } passive;
> };
> };
> @@ -414,6 +424,107 @@ int pvcalls_front_listen(struct socket *sock, int backlog)
> return ret;
> }
>
> +int pvcalls_front_accept(struct socket *sock, struct socket *newsock, int flags)
> +{
> + struct pvcalls_bedata *bedata;
> + struct sock_mapping *map;
> + struct sock_mapping *map2 = NULL;
> + struct xen_pvcalls_request *req;
> + int notify, req_id, ret, evtchn, nonblock;
> +
> + if (!pvcalls_front_dev)
> + return -ENOTCONN;
> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> +
> + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> + if (!map)
> + return -ENOTSOCK;
> +
> + if (map->passive.status != PVCALLS_STATUS_LISTEN)
> + return -EINVAL;
> +
> + nonblock = flags & SOCK_NONBLOCK;
> + /*
> + * Backend only supports 1 inflight accept request, will return
> + * errors for the others
> + */
> + if (test_and_set_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT,
> + (void *)&map->passive.flags)) {
> + req_id = READ_ONCE(map->passive.inflight_req_id);
> + if (req_id != PVCALLS_INVALID_ID &&
> + READ_ONCE(bedata->rsp[req_id].req_id) == req_id)
> + goto received;
> + if (nonblock)
> + return -EAGAIN;
> + if (wait_event_interruptible(map->passive.inflight_accept_req,
> + !test_and_set_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT,
> + (void *)&map->passive.flags)))
> + return -EINTR;
> + }
> +
> + spin_lock(&bedata->pvcallss_lock);
> + ret = get_request(bedata, &req_id);
> + if (ret < 0) {
> + spin_unlock(&bedata->pvcallss_lock);
> + return ret;
> + }
> + map2 = kzalloc(sizeof(*map2), GFP_KERNEL);
> + if (map2 == NULL)
> + return -ENOMEM;
> + ret = create_active(map2, &evtchn);
> + if (ret < 0) {
> + kfree(map2);
In the connect patch create_active() frees maps2 (and I had some
comments about it)
> + return -ENOMEM;
Both error paths need an unlock.
> + }
> + list_add_tail(&map2->list, &bedata->socket_mappings);
> +
> + req = RING_GET_REQUEST(&bedata->ring, req_id);
> + req->req_id = req_id;
> + req->cmd = PVCALLS_ACCEPT;
> + req->u.accept.id = (uint64_t) map;
> + req->u.accept.ref = map2->active.ref;
> + req->u.accept.id_new = (uint64_t) map2;
> + req->u.accept.evtchn = evtchn;
> + map->passive.accept_map = map2;
> +
> + bedata->ring.req_prod_pvt++;
> + RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
> + spin_unlock(&bedata->pvcallss_lock);
> + if (notify)
> + notify_remote_via_irq(bedata->irq);
> + if (nonblock) {
> + WRITE_ONCE(map->passive.inflight_req_id, req_id);
> + return -EAGAIN;
Would it be worth checking (maybe a few times) whether the response has
come back?
> + }
> +
> + if (wait_event_interruptible(bedata->inflight_req,
> + READ_ONCE(bedata->rsp[req_id].req_id) == req_id))
> + return -EINTR;
> +
> +received:
> + map2 = map->passive.accept_map;
I think this could go to the inflight check in the beginning of this
routine (before the 'goto'). Otherwise map2 has already been assigned above.
> + map2->sock = newsock;
> + newsock->sk = kzalloc(sizeof(*newsock->sk), GFP_KERNEL);
> + if (!newsock->sk) {
> + WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
> + WRITE_ONCE(map->passive.inflight_req_id, PVCALLS_INVALID_ID);
> + pvcalls_front_free_map(bedata, map2);
> + kfree(map2);
> + return -ENOMEM;
> + }
> + WRITE_ONCE(newsock->sk->sk_send_head, (void *)map2);
> +
> + clear_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT, (void *)&map->passive.flags);
> + wake_up(&map->passive.inflight_accept_req);
> +
> + ret = bedata->rsp[req_id].ret;
You can just return bedata->rsp[req_id].ret;
-boris
> + /* read ret, then set this rsp slot to be reused */
> + smp_mb();
> + WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
> + WRITE_ONCE(map->passive.inflight_req_id, PVCALLS_INVALID_ID);
> + return ret;
> +}
> +
> static const struct xenbus_device_id pvcalls_front_ids[] = {
> { "pvcalls" },
> { "" }
> diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
> index aa8fe10..ab4f1da 100644
> --- a/drivers/xen/pvcalls-front.h
> +++ b/drivers/xen/pvcalls-front.h
> @@ -10,5 +10,8 @@ int pvcalls_front_bind(struct socket *sock,
> struct sockaddr *addr,
> int addr_len);
> int pvcalls_front_listen(struct socket *sock, int backlog);
> +int pvcalls_front_accept(struct socket *sock,
> + struct socket *newsock,
> + int flags);
>
> #endif
>
>> +
>> + ret = bedata->rsp[req_id].ret;
>
> You can just return bedata->rsp[req_id].ret;
Or maybe not. The slot may get reused by the time you get to the end.
-boris
>
> -boris
>
>> + /* read ret, then set this rsp slot to be reused */
>> + smp_mb();
>> + WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
>> + WRITE_ONCE(map->passive.inflight_req_id, PVCALLS_INVALID_ID);
>> + return ret;
>> +}
On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> Send data to an active socket by copying data to the "out" ring. Take
> the active socket out_mutex so that only one function can access the
> ring at any given time.
>
> If not enough room is available on the ring, rather than returning
> immediately or sleep-waiting, spin for up to 5000 cycles. This small
> optimization turns out to improve performance significantly.
>
> Signed-off-by: Stefano Stabellini <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> drivers/xen/pvcalls-front.c | 109 ++++++++++++++++++++++++++++++++++++++++++++
> drivers/xen/pvcalls-front.h | 3 ++
> 2 files changed, 112 insertions(+)
>
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index f83b910..369acde 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -29,6 +29,7 @@
> #define PVCALLS_INVALID_ID UINT_MAX
> #define PVCALLS_RING_ORDER XENBUS_MAX_RING_GRANT_ORDER
> #define PVCALLS_NR_REQ_PER_RING __CONST_RING_SIZE(xen_pvcalls, XEN_PAGE_SIZE)
> +#define PVCALLS_FRONT_MAX_SPIN 5000
>
> struct pvcalls_bedata {
> struct xen_pvcalls_front_ring ring;
> @@ -88,6 +89,22 @@ static inline int get_request(struct pvcalls_bedata *bedata, int *req_id)
> return 0;
> }
>
> +static int pvcalls_front_write_todo(struct sock_mapping *map)
> +{
> + struct pvcalls_data_intf *intf = map->active.ring;
> + RING_IDX cons, prod, size = XEN_FLEX_RING_SIZE(intf->ring_order);
> + int32_t error;
> +
> + cons = intf->out_cons;
> + prod = intf->out_prod;
> + error = intf->out_error;
> + if (error == -ENOTCONN)
> + return 0;
> + if (error != 0)
> + return error;
> + return size - pvcalls_queued(prod, cons, size);
Do you ever look at actual return value except whether it's zero or not?
Both here and in the poll patch you look for !=0 and never check for an
error.
> +}
> +
> static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
> {
> struct xenbus_device *dev = dev_id;
> @@ -325,6 +342,98 @@ int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
> return ret;
> }
>
> +static int __write_ring(struct pvcalls_data_intf *intf,
> + struct pvcalls_data *data,
> + struct iov_iter *msg_iter,
> + size_t len)
> +{
> + RING_IDX cons, prod, size, masked_prod, masked_cons;
> + RING_IDX array_size = XEN_FLEX_RING_SIZE(intf->ring_order);
> + int32_t error;
> +
> + cons = intf->out_cons;
> + prod = intf->out_prod;
> + error = intf->out_error;
> + /* read indexes before continuing */
> + virt_mb();
> +
> + if (error < 0)
> + return error;
This can be done before the barrier. (In fact, this can be done first
thing, before cons and prod are read).
> +
> + size = pvcalls_queued(prod, cons, array_size);
> + if (size >= array_size)
> + return 0;
> + if (len > array_size - size)
> + len = array_size - size;
> +
> + masked_prod = pvcalls_mask(prod, array_size);
> + masked_cons = pvcalls_mask(cons, array_size);
> +
> + if (masked_prod < masked_cons) {
> + copy_from_iter(data->out + masked_prod, len, msg_iter);
> + } else {
> + if (len > array_size - masked_prod) {
> + copy_from_iter(data->out + masked_prod,
> + array_size - masked_prod, msg_iter);
> + copy_from_iter(data->out,
> + len - (array_size - masked_prod),
> + msg_iter);
> + } else {
> + copy_from_iter(data->out + masked_prod, len, msg_iter);
> + }
> + }
> + /* write to ring before updating pointer */
> + virt_wmb();
> + intf->out_prod += len;
> +
> + return len;
You are returning size_t. I actually was going to ask in one of the
previous patches whether using int for sizes was correct. Unfortunately
I can't remember which struct/function I was looking at ;-(
Of course, you are also possibly returning a (negative) error here.
> +}
> +
> +int pvcalls_front_sendmsg(struct socket *sock, struct msghdr *msg,
> + size_t len)
> +{
> + struct pvcalls_bedata *bedata;
> + struct sock_mapping *map;
> + int sent = 0, tot_sent = 0;
'sent' doesn't need to be initialized.
> + int count = 0, flags;
> +
> + if (!pvcalls_front_dev)
> + return -ENOTCONN;
> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> +
> + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> + if (!map)
> + return -ENOTSOCK;
IIRC the error value for sk_send_head being zero is inconsistent across
patches.
> +
> + flags = msg->msg_flags;
> + if (flags & (MSG_CONFIRM|MSG_DONTROUTE|MSG_EOR|MSG_OOB))
> + return -EOPNOTSUPP;
> +
> + mutex_lock(&map->active.out_mutex);
> + if ((flags & MSG_DONTWAIT) && !pvcalls_front_write_todo(map)) {
> + mutex_unlock(&map->active.out_mutex);
> + return -EAGAIN;
> + }
> +
> +again:
> + count++;
> + sent = __write_ring(map->active.ring,
> + &map->active.data, &msg->msg_iter,
> + len);
> + if (sent > 0) {
> + len -= sent;
> + tot_sent += sent;
> + notify_remote_via_irq(map->active.irq);
> + }
> + if (sent >= 0 && len > 0 && count < PVCALLS_FRONT_MAX_SPIN)
> + goto again;
> + if (sent < 0)
> + tot_sent = sent;
What does it mean when an error is detected on the interface? Does it
need to be somehow propagated to the caller?
-boris
> +
> + mutex_unlock(&map->active.out_mutex);
> + return tot_sent;
> +}
> +
> int pvcalls_front_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
> {
> struct pvcalls_bedata *bedata;
> diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
> index ab4f1da..d937c24 100644
> --- a/drivers/xen/pvcalls-front.h
> +++ b/drivers/xen/pvcalls-front.h
> @@ -13,5 +13,8 @@ int pvcalls_front_bind(struct socket *sock,
> int pvcalls_front_accept(struct socket *sock,
> struct socket *newsock,
> int flags);
> +int pvcalls_front_sendmsg(struct socket *sock,
> + struct msghdr *msg,
> + size_t len);
>
> #endif
>
On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> Implement recvmsg by copying data from the "in" ring. If not enough data
> is available and the recvmsg call is blocking, then wait on the
> inflight_conn_req waitqueue. Take the active socket in_mutex so that
> only one function can access the ring at any given time.
>
> If no data is available on the ring, rather than returning immediately
> or sleep-waiting, spin for up to 5000 cycles. This small optimization
> turns out to improve performance and latency significantly.
>
> Signed-off-by: Stefano Stabellini <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> drivers/xen/pvcalls-front.c | 102 ++++++++++++++++++++++++++++++++++++++++++++
> drivers/xen/pvcalls-front.h | 4 ++
> 2 files changed, 106 insertions(+)
>
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index 369acde..635a83a 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -105,6 +105,20 @@ static int pvcalls_front_write_todo(struct sock_mapping *map)
> return size - pvcalls_queued(prod, cons, size);
> }
>
> +static bool pvcalls_front_read_todo(struct sock_mapping *map)
> +{
> + struct pvcalls_data_intf *intf = map->active.ring;
> + RING_IDX cons, prod;
> + int32_t error;
> +
> + cons = intf->in_cons;
> + prod = intf->in_prod;
> + error = intf->in_error;
> + return (error != 0 ||
> + pvcalls_queued(prod, cons,
> + XEN_FLEX_RING_SIZE(intf->ring_order)) != 0);
> +}
> +
> static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
> {
> struct xenbus_device *dev = dev_id;
> @@ -434,6 +448,94 @@ int pvcalls_front_sendmsg(struct socket *sock, struct msghdr *msg,
> return tot_sent;
> }
>
> +static int __read_ring(struct pvcalls_data_intf *intf,
> + struct pvcalls_data *data,
> + struct iov_iter *msg_iter,
> + size_t len, int flags)
> +{
> + RING_IDX cons, prod, size, masked_prod, masked_cons;
> + RING_IDX array_size = XEN_FLEX_RING_SIZE(intf->ring_order);
> + int32_t error;
> +
> + cons = intf->in_cons;
> + prod = intf->in_prod;
> + error = intf->in_error;
> + /* get pointers before reading from the ring */
> + virt_rmb();
> + if (error < 0)
> + return error;
> +
> + size = pvcalls_queued(prod, cons, array_size);
> + masked_prod = pvcalls_mask(prod, array_size);
> + masked_cons = pvcalls_mask(cons, array_size);
> +
> + if (size == 0)
> + return 0;
> +
> + if (len > size)
> + len = size;
> +
> + if (masked_prod > masked_cons) {
> + copy_to_iter(data->in + masked_cons, len, msg_iter);
> + } else {
> + if (len > (array_size - masked_cons)) {
> + copy_to_iter(data->in + masked_cons,
> + array_size - masked_cons, msg_iter);
> + copy_to_iter(data->in,
> + len - (array_size - masked_cons),
> + msg_iter);
> + } else {
> + copy_to_iter(data->in + masked_cons, len, msg_iter);
> + }
> + }
> + /* read data from the ring before increasing the index */
> + virt_mb();
> + if (!(flags & MSG_PEEK))
> + intf->in_cons += len;
> +
> + return len;
> +}
> +
> +int pvcalls_front_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
> + int flags)
> +{
> + struct pvcalls_bedata *bedata;
> + int ret = -EAGAIN;
Not necessary to initialize.
> + struct sock_mapping *map;
> +
> + if (!pvcalls_front_dev)
> + return -ENOTCONN;
> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> +
> + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> + if (!map)
> + return -ENOTSOCK;
> +
> + if (flags & (MSG_CMSG_CLOEXEC|MSG_ERRQUEUE|MSG_OOB|MSG_TRUNC))
> + return -EOPNOTSUPP;
> +
> + mutex_lock(&map->active.in_mutex);
> + if (len > XEN_FLEX_RING_SIZE(map->active.ring->ring_order))
> + len = XEN_FLEX_RING_SIZE(map->active.ring->ring_order);
> +
> + while (!(flags & MSG_DONTWAIT) && !pvcalls_front_read_todo(map)) {
> + wait_event_interruptible(map->active.inflight_conn_req,
> + pvcalls_front_read_todo(map));
> + }
> + ret = __read_ring(map->active.ring, &map->active.data,
> + &msg->msg_iter, len, flags);
> +
> + if (ret > 0)
> + notify_remote_via_irq(map->active.irq);
> + if (ret == 0)
> + ret = -EAGAIN;
> + if (ret == -ENOTCONN)
> + ret = 0;
> +
> + mutex_unlock(&map->active.in_mutex);
> + return ret;
Are errors converted by the caller? (I am asking because recvmsg can
only return -1 for errors)
-boris
> +}
> +
> int pvcalls_front_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
> {
> struct pvcalls_bedata *bedata;
> diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
> index d937c24..de24041 100644
> --- a/drivers/xen/pvcalls-front.h
> +++ b/drivers/xen/pvcalls-front.h
> @@ -16,5 +16,9 @@ int pvcalls_front_accept(struct socket *sock,
> int pvcalls_front_sendmsg(struct socket *sock,
> struct msghdr *msg,
> size_t len);
> +int pvcalls_front_recvmsg(struct socket *sock,
> + struct msghdr *msg,
> + size_t len,
> + int flags);
>
> #endif
>
On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> For active sockets, check the indexes and use the inflight_conn_req
> waitqueue to wait.
>
> For passive sockets if an accept is outstanding
> (PVCALLS_FLAG_ACCEPT_INFLIGHT), check if it has been answered by looking
> at bedata->rsp[req_id]. If so, return POLLIN. Otherwise use the
> inflight_accept_req waitqueue.
>
> If no accepts are inflight, send PVCALLS_POLL to the backend. If we have
> outstanding POLL requests awaiting for a response use the inflight_req
> waitqueue: inflight_req is awaken when a new response is received; on
> wakeup we check whether the POLL response is arrived by looking at the
> PVCALLS_FLAG_POLL_RET flag. We set the flag from
> pvcalls_front_event_handler, if the response was for a POLL command.
>
> In pvcalls_front_event_handler, get the struct sock_mapping from the
> poll id (we previously converted struct sock_mapping* to uint64_t and
> used it as id).
>
> Signed-off-by: Stefano Stabellini <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> drivers/xen/pvcalls-front.c | 135 +++++++++++++++++++++++++++++++++++++++++---
> drivers/xen/pvcalls-front.h | 3 +
> 2 files changed, 129 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index 635a83a..1c975d6 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -72,6 +72,8 @@ struct sock_mapping {
> * Only one poll operation can be inflight for a given socket.
> */
> #define PVCALLS_FLAG_ACCEPT_INFLIGHT 0
> +#define PVCALLS_FLAG_POLL_INFLIGHT 1
> +#define PVCALLS_FLAG_POLL_RET 2
> uint8_t flags;
> uint32_t inflight_req_id;
> struct sock_mapping *accept_map;
> @@ -139,15 +141,32 @@ static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
> rsp = RING_GET_RESPONSE(&bedata->ring, bedata->ring.rsp_cons);
>
> req_id = rsp->req_id;
> - dst = (uint8_t *)&bedata->rsp[req_id] + sizeof(rsp->req_id);
> - src = (uint8_t *)rsp + sizeof(rsp->req_id);
> - memcpy(dst, src, sizeof(*rsp) - sizeof(rsp->req_id));
> - /*
> - * First copy the rest of the data, then req_id. It is
> - * paired with the barrier when accessing bedata->rsp.
> - */
> - smp_wmb();
> - WRITE_ONCE(bedata->rsp[req_id].req_id, rsp->req_id);
> + if (rsp->cmd == PVCALLS_POLL) {
> + struct sock_mapping *map = (struct sock_mapping *)
> + rsp->u.poll.id;
> +
> + set_bit(PVCALLS_FLAG_POLL_RET,
> + (void *)&map->passive.flags);
> + /*
> + * Set RET, then clear INFLIGHT. It pairs with
> + * the checks at the beginning of
> + * pvcalls_front_poll_passive.
> + */
> + smp_wmb();
> + clear_bit(PVCALLS_FLAG_POLL_INFLIGHT,
> + (void *)&map->passive.flags);
> + } else {
> + dst = (uint8_t *)&bedata->rsp[req_id] +
> + sizeof(rsp->req_id);
> + src = (uint8_t *)rsp + sizeof(rsp->req_id);
> + memcpy(dst, src, sizeof(*rsp) - sizeof(rsp->req_id));
> + /*
> + * First copy the rest of the data, then req_id. It is
> + * paired with the barrier when accessing bedata->rsp.
> + */
> + smp_wmb();
> + WRITE_ONCE(bedata->rsp[req_id].req_id, rsp->req_id);
> + }
>
> done = 1;
> bedata->ring.rsp_cons++;
> @@ -736,6 +755,104 @@ int pvcalls_front_accept(struct socket *sock, struct socket *newsock, int flags)
> return ret;
> }
>
> +static unsigned int pvcalls_front_poll_passive(struct file *file,
> + struct pvcalls_bedata *bedata,
> + struct sock_mapping *map,
> + poll_table *wait)
> +{
> + int notify, req_id, ret;
> + struct xen_pvcalls_request *req;
> +
I am a bit confused by the logic here.
> + if (test_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT,
> + (void *)&map->passive.flags)) {
> + uint32_t req_id = READ_ONCE(map->passive.inflight_req_id);
> + if (req_id != PVCALLS_INVALID_ID &&
> + READ_ONCE(bedata->rsp[req_id].req_id) == req_id)
> + return POLLIN;
This is successful accept? Why is it returning POLLIN only and not
POLLIN | POLLRDNORM (or POLLOUT, for that matter)?
> +
> + poll_wait(file, &map->passive.inflight_accept_req, wait);
> + return 0;
> + }
> +
> + if (test_and_clear_bit(PVCALLS_FLAG_POLL_RET,
> + (void *)&map->passive.flags))
> + return POLLIN;
This is previous poll request completing?
> +
> + /*
> + * First check RET, then INFLIGHT. No barriers necessary to
> + * ensure execution ordering because of the conditional
> + * instructions creating control dependencies.
> + */
> +
> + if (test_and_set_bit(PVCALLS_FLAG_POLL_INFLIGHT,
> + (void *)&map->passive.flags)) {
> + poll_wait(file, &bedata->inflight_req, wait);
> + return 0;
> + }
This I don't understand, could you explain?
> +
> + spin_lock(&bedata->pvcallss_lock);
> + ret = get_request(bedata, &req_id);
> + if (ret < 0) {
> + spin_unlock(&bedata->pvcallss_lock);
> + return ret;
> + }
> + req = RING_GET_REQUEST(&bedata->ring, req_id);
> + req->req_id = req_id;
> + req->cmd = PVCALLS_POLL;
> + req->u.poll.id = (uint64_t) map;
> +
> + bedata->ring.req_prod_pvt++;
> + RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
> + spin_unlock(&bedata->pvcallss_lock);
> + if (notify)
> + notify_remote_via_irq(bedata->irq);
> +
> + poll_wait(file, &bedata->inflight_req, wait);
> + return 0;
> +}
> +
> +static unsigned int pvcalls_front_poll_active(struct file *file,
> + struct pvcalls_bedata *bedata,
> + struct sock_mapping *map,
> + poll_table *wait)
> +{
> + unsigned int mask = 0;
> + int32_t in_error, out_error;
> + struct pvcalls_data_intf *intf = map->active.ring;
> +
> + out_error = intf->out_error;
> + in_error = intf->in_error;
> +
> + poll_wait(file, &map->active.inflight_conn_req, wait);
> + if (pvcalls_front_write_todo(map))
> + mask |= POLLOUT | POLLWRNORM;
> + if (pvcalls_front_read_todo(map))
> + mask |= POLLIN | POLLRDNORM;
> + if (in_error != 0 || out_error != 0)
> + mask |= POLLERR;
> +
> + return mask;
> +}
> +
> +unsigned int pvcalls_front_poll(struct file *file, struct socket *sock,
> + poll_table *wait)
> +{
> + struct pvcalls_bedata *bedata;
> + struct sock_mapping *map;
> +
> + if (!pvcalls_front_dev)
> + return POLLNVAL;
> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> +
> + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
I just noticed this --- why is it READ_ONCE? Are you concerned that
sk_send_head may change?
-boris
> + if (!map)
> + return POLLNVAL;
> + if (map->active_socket)
> + return pvcalls_front_poll_active(file, bedata, map, wait);
> + else
> + return pvcalls_front_poll_passive(file, bedata, map, wait);
> +}
> +
> static const struct xenbus_device_id pvcalls_front_ids[] = {
> { "pvcalls" },
> { "" }
> diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
> index de24041..25e05b8 100644
> --- a/drivers/xen/pvcalls-front.h
> +++ b/drivers/xen/pvcalls-front.h
> @@ -20,5 +20,8 @@ int pvcalls_front_recvmsg(struct socket *sock,
> struct msghdr *msg,
> size_t len,
> int flags);
> +unsigned int pvcalls_front_poll(struct file *file,
> + struct socket *sock,
> + poll_table *wait);
>
> #endif
On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> Send PVCALLS_RELEASE to the backend and wait for a reply. Take both
> in_mutex and out_mutex to avoid concurrent accesses. Then, free the
> socket.
>
> For passive sockets, check whether we have already pre-allocated an
> active socket for the purpose of being accepted. If so, free that as
> well.
>
> Signed-off-by: Stefano Stabellini <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> drivers/xen/pvcalls-front.c | 88 +++++++++++++++++++++++++++++++++++++++++++++
> drivers/xen/pvcalls-front.h | 1 +
> 2 files changed, 89 insertions(+)
>
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index 1c975d6..775a6d2 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -192,6 +192,23 @@ static irqreturn_t pvcalls_front_conn_handler(int irq, void *sock_map)
> return IRQ_HANDLED;
> }
>
> +static void pvcalls_front_free_map(struct pvcalls_bedata *bedata,
> + struct sock_mapping *map)
> +{
> + int i;
> +
> + spin_lock(&bedata->pvcallss_lock);
> + if (!list_empty(&map->list))
> + list_del_init(&map->list);
> + spin_unlock(&bedata->pvcallss_lock);
> +
> + for (i = 0; i < (1 << map->active.ring->ring_order); i++)
> + gnttab_end_foreign_access(map->active.ring->ref[i], 0, 0);
> + gnttab_end_foreign_access(map->active.ref, 0, 0);
> + free_page((unsigned long)map->active.ring);
> + unbind_from_irqhandler(map->active.irq, map);
Would it better to first unbind the handler? Any chance an interrupt
might come in?
> +}
> +
> int pvcalls_front_socket(struct socket *sock)
> {
> struct pvcalls_bedata *bedata;
> @@ -853,6 +870,77 @@ unsigned int pvcalls_front_poll(struct file *file, struct socket *sock,
> return pvcalls_front_poll_passive(file, bedata, map, wait);
> }
>
> +int pvcalls_front_release(struct socket *sock)
> +{
> + struct pvcalls_bedata *bedata;
> + struct sock_mapping *map;
> + int req_id, notify, ret;
> + struct xen_pvcalls_request *req;
> +
> + if (!pvcalls_front_dev)
> + return -EIO;
> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> +
> + if (sock->sk == NULL)
> + return 0;
This can go above bedata access.
(You are going to address locking here so I won't review the rest)
-boris
> +
> + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> + if (map == NULL)
> + return 0;
> +
> + spin_lock(&bedata->pvcallss_lock);
> + ret = get_request(bedata, &req_id);
> + if (ret < 0) {
> + spin_unlock(&bedata->pvcallss_lock);
> + return ret;
> + }
> + WRITE_ONCE(sock->sk->sk_send_head, NULL);
> +
> + req = RING_GET_REQUEST(&bedata->ring, req_id);
> + req->req_id = req_id;
> + req->cmd = PVCALLS_RELEASE;
> + req->u.release.id = (uint64_t)map;
> +
> + bedata->ring.req_prod_pvt++;
> + RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
> + spin_unlock(&bedata->pvcallss_lock);
> + if (notify)
> + notify_remote_via_irq(bedata->irq);
> +
> + wait_event(bedata->inflight_req,
> + READ_ONCE(bedata->rsp[req_id].req_id) == req_id);
> +
> + if (map->active_socket) {
> + /*
> + * Set in_error and wake up inflight_conn_req to force
> + * recvmsg waiters to exit.
> + */
> + map->active.ring->in_error = -EBADF;
> + wake_up_interruptible(&map->active.inflight_conn_req);
> +
> + mutex_lock(&map->active.in_mutex);
> + mutex_lock(&map->active.out_mutex);
> + pvcalls_front_free_map(bedata, map);
> + mutex_unlock(&map->active.out_mutex);
> + mutex_unlock(&map->active.in_mutex);
> + kfree(map);
> + } else {
> + spin_lock(&bedata->pvcallss_lock);
> + if (READ_ONCE(map->passive.inflight_req_id) !=
> + PVCALLS_INVALID_ID) {
> + pvcalls_front_free_map(bedata,
> + map->passive.accept_map);
> + kfree(map->passive.accept_map);
> + }
> + list_del_init(&map->list);
> + kfree(map);
> + spin_unlock(&bedata->pvcallss_lock);
> + }
> + WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
> +
> + return 0;
> +}
> +
> static const struct xenbus_device_id pvcalls_front_ids[] = {
> { "pvcalls" },
> { "" }
> diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
> index 25e05b8..3332978 100644
> --- a/drivers/xen/pvcalls-front.h
> +++ b/drivers/xen/pvcalls-front.h
> @@ -23,5 +23,6 @@ int pvcalls_front_recvmsg(struct socket *sock,
> unsigned int pvcalls_front_poll(struct file *file,
> struct socket *sock,
> poll_table *wait);
> +int pvcalls_front_release(struct socket *sock);
>
> #endif
On Sat, 12 Aug 2017, Boris Ostrovsky wrote:
> On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> > Send PVCALLS_CONNECT to the backend. Allocate a new ring and evtchn for
> > the active socket.
> >
> > Introduce fields in struct sock_mapping to keep track of active sockets.
> > Introduce a waitqueue to allow the frontend to wait on data coming from
> > the backend on the active socket (recvmsg command).
> >
> > Two mutexes (one of reads and one for writes) will be used to protect
> > the active socket in and out rings from concurrent accesses.
> >
> > Signed-off-by: Stefano Stabellini <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > ---
> > drivers/xen/pvcalls-front.c | 146
> > ++++++++++++++++++++++++++++++++++++++++++++
> > drivers/xen/pvcalls-front.h | 2 +
> > 2 files changed, 148 insertions(+)
> >
> > diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> > index 7c4a7cb..379b8fb 100644
> > --- a/drivers/xen/pvcalls-front.c
> > +++ b/drivers/xen/pvcalls-front.c
> > @@ -13,6 +13,10 @@
> > */
> > #include <linux/module.h>
> > +#include <linux/net.h>
> > +#include <linux/socket.h>
> > +
> > +#include <net/sock.h>
> > #include <xen/events.h>
> > #include <xen/grant_table.h>
> > @@ -44,6 +48,18 @@ struct sock_mapping {
> > bool active_socket;
> > struct list_head list;
> > struct socket *sock;
> > + union {
> > + struct {
> > + int irq;
> > + grant_ref_t ref;
> > + struct pvcalls_data_intf *ring;
> > + struct pvcalls_data data;
> > + struct mutex in_mutex;
> > + struct mutex out_mutex;
> > +
> > + wait_queue_head_t inflight_conn_req;
> > + } active;
> > + };
> > };
> > static inline int get_request(struct pvcalls_bedata *bedata, int
> > *req_id)
> > @@ -97,6 +113,18 @@ static irqreturn_t pvcalls_front_event_handler(int irq,
> > void *dev_id)
> > return IRQ_HANDLED;
> > }
> > +static irqreturn_t pvcalls_front_conn_handler(int irq, void *sock_map)
> > +{
> > + struct sock_mapping *map = sock_map;
> > +
> > + if (map == NULL)
> > + return IRQ_HANDLED;
> > +
> > + wake_up_interruptible(&map->active.inflight_conn_req);
> > +
> > + return IRQ_HANDLED;
> > +}
> > +
> > int pvcalls_front_socket(struct socket *sock)
> > {
> > struct pvcalls_bedata *bedata;
> > @@ -162,6 +190,124 @@ int pvcalls_front_socket(struct socket *sock)
> > return ret;
> > }
> > +static int create_active(struct sock_mapping *map, int *evtchn)
> > +{
> > + void *bytes;
> > + int ret = -ENOMEM, irq = -1, i;
> > +
> > + init_waitqueue_head(&map->active.inflight_conn_req);
> > +
> > + map->active.ring = (struct pvcalls_data_intf *)
> > + __get_free_page(GFP_KERNEL | __GFP_ZERO);
> > + if (map->active.ring == NULL)
> > + goto out_error;
> > + memset(map->active.ring, 0, XEN_PAGE_SIZE);
>
>
> Not needed (page allocated with __GFP_ZERO)
OK
> BTW, are you operating on XEN_PAGE_SIZE of PAGE_SIZE? Because if it's the
> former than __get_free_page() may be way more than what you need. (below too)
It is the former, but a 64K kernel is not able to allocate a 4K page, so
we'll have to accept the memory waste.
> > + map->active.ring->ring_order = PVCALLS_RING_ORDER;
> > + bytes = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> > + map->active.ring->ring_order);
> > + if (bytes == NULL)
> > + goto out_error;
> > + for (i = 0; i < (1 << map->active.ring->ring_order); i++)
>
> PVCALLS_RING_ORDER
OK
> > + map->active.ring->ref[i] = gnttab_grant_foreign_access(
> > + pvcalls_front_dev->otherend_id,
> > + pfn_to_gfn(virt_to_pfn(bytes) + i), 0);
> > +
> > + map->active.ref = gnttab_grant_foreign_access(
> > + pvcalls_front_dev->otherend_id,
> > + pfn_to_gfn(virt_to_pfn((void *)map->active.ring)), 0);
> > +
> > + map->active.data.in = bytes;
> > + map->active.data.out = bytes +
> > + XEN_FLEX_RING_SIZE(map->active.ring->ring_order);
> > +
> > + ret = xenbus_alloc_evtchn(pvcalls_front_dev, evtchn);
> > + if (ret)
> > + goto out_error;
> > + irq = bind_evtchn_to_irqhandler(*evtchn, pvcalls_front_conn_handler,
> > + 0, "pvcalls-frontend", map);
> > + if (irq < 0) {
> > + ret = irq;
> > + goto out_error;
> > + }
> > +
> > + map->active.irq = irq;
> > + map->active_socket = true;
> > + mutex_init(&map->active.in_mutex);
> > + mutex_init(&map->active.out_mutex);
> > +
> > + return 0;
> > +
> > +out_error:
> > + if (irq >= 0)
> > + unbind_from_irqhandler(irq, map);
> > + else if (*evtchn >= 0)
>
> *evtchn may have been passed in as >=0. You probably want to set to -1 or
> something in the beginning.
OK
> > + xenbus_free_evtchn(pvcalls_front_dev, *evtchn);
> > + kfree(map->active.data.in);
> > + kfree(map->active.ring);
> > + kfree(map);
>
> Probably a matter of personal style but I think it's better to free map in the
> caller.
>
> Actually, should it be freed? The caller gets an error if get_request() or
> create_active() fail. In the first case map is not freed. Is the caller going
> to distinguish EAGAIN from any other error and know that map does not need to
> be allocated?
You are right, map shouldn't be freed, as "close" is still supposed to
be called afterwards, even if "connect" failed.
> (Also I think you need to unlink the map from socket_mapping list before
> freeing it).
As map shouldn't be freed, it should also not be removed by
socket_mapping.
> > + return ret;
> > +}
> > +
> > +int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
> > + int addr_len, int flags)
> > +{
> > + struct pvcalls_bedata *bedata;
> > + struct sock_mapping *map = NULL;
> > + struct xen_pvcalls_request *req;
> > + int notify, req_id, ret, evtchn;
> > +
> > + if (!pvcalls_front_dev)
> > + return -ENETUNREACH;
> > + if (addr->sa_family != AF_INET || sock->type != SOCK_STREAM)
> > + return -ENOTSUPP;
> > +
> > + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> > +
> > + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> > + if (!map)
> > + return -EINVAL;
> > +
> > + spin_lock(&bedata->pvcallss_lock);
> > + ret = get_request(bedata, &req_id);
> > + if (ret < 0) {
> > + spin_unlock(&bedata->pvcallss_lock);
> > + return ret;
> > + }
> > + ret = create_active(map, &evtchn);
> > + if (ret < 0) {
> > + spin_unlock(&bedata->pvcallss_lock);
> > + return ret;
> > + }
> > +
> > + req = RING_GET_REQUEST(&bedata->ring, req_id);
> > + req->req_id = req_id;
> > + req->cmd = PVCALLS_CONNECT;
> > + req->u.connect.id = (uint64_t)map;
> > + memcpy(req->u.connect.addr, addr, sizeof(*addr));
> > + req->u.connect.len = addr_len;
> > + req->u.connect.flags = flags;
> > + req->u.connect.ref = map->active.ref;
> > + req->u.connect.evtchn = evtchn;
> > +
> > + map->sock = sock;
> > +
> > + bedata->ring.req_prod_pvt++;
> > + RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
> > + spin_unlock(&bedata->pvcallss_lock);
> > +
> > + if (notify)
> > + notify_remote_via_irq(bedata->irq);
> > +
> > + wait_event(bedata->inflight_req,
> > + READ_ONCE(bedata->rsp[req_id].req_id) == req_id);
> > +
> > + ret = bedata->rsp[req_id].ret;
> > + /* read ret, then set this rsp slot to be reused */
> > + smp_mb();
> > + WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
> > + return ret;
> > +}
> > +
> > static const struct xenbus_device_id pvcalls_front_ids[] = {
> > { "pvcalls" },
> > { "" }
> > diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
> > index b7dabed..63b0417 100644
> > --- a/drivers/xen/pvcalls-front.h
> > +++ b/drivers/xen/pvcalls-front.h
> > @@ -4,5 +4,7 @@
> > #include <linux/net.h>
> > int pvcalls_front_socket(struct socket *sock);
> > +int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
> > + int addr_len, int flags);
> > #endif
> >
>
On Sat, 12 Aug 2017, Boris Ostrovsky wrote:
> On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> > Send PVCALLS_BIND to the backend. Introduce a new structure, part of
> > struct sock_mapping, to store information specific to passive sockets.
> >
> > Introduce a status field to keep track of the status of the passive
> > socket.
> >
> > Signed-off-by: Stefano Stabellini <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > ---
> > drivers/xen/pvcalls-front.c | 59
> > +++++++++++++++++++++++++++++++++++++++++++++
> > drivers/xen/pvcalls-front.h | 3 +++
> > 2 files changed, 62 insertions(+)
> >
> > diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> > index 379b8fb..5ccef34 100644
> > --- a/drivers/xen/pvcalls-front.c
> > +++ b/drivers/xen/pvcalls-front.c
> > @@ -59,6 +59,13 @@ struct sock_mapping {
> > wait_queue_head_t inflight_conn_req;
> > } active;
> > + struct {
> > + /* Socket status */
> > +#define PVCALLS_STATUS_UNINITALIZED 0
> > +#define PVCALLS_STATUS_BIND 1
> > +#define PVCALLS_STATUS_LISTEN 2
> > + uint8_t status;
> > + } passive;
> > };
> > };
> > @@ -308,6 +315,58 @@ int pvcalls_front_connect(struct socket *sock, struct
> > sockaddr *addr,
> > return ret;
> > }
> > +int pvcalls_front_bind(struct socket *sock, struct sockaddr *addr, int
> > addr_len)
> > +{
> > + struct pvcalls_bedata *bedata;
> > + struct sock_mapping *map = NULL;
> > + struct xen_pvcalls_request *req;
> > + int notify, req_id, ret;
> > +
> > + if (!pvcalls_front_dev)
> > + return -ENOTCONN;
> > + if (addr->sa_family != AF_INET || sock->type != SOCK_STREAM)
>
>
> No protocol check? (in the previous patch too)
As there is only one protocol for AF_INET SOCK_STREAM sockets, it is not
possible to get a "wrong" protocol here. In fact the field is not even
saved in struct socket or struct sockaddr.
The check should be done at the time the socket is created
(corresponding to the "socket" syscall), by the caller of
pvcalls_front_socket. I made sure to have a comment in
pvcalls_front_socket to clarify it:
* Check socket type here, AF_INET and protocol checks are done
* by the caller.
> > + return -ENOTSUPP;
> > + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> > +
> > + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> > + if (map == NULL)
> > + return -EINVAL;
> > +
> > + spin_lock(&bedata->pvcallss_lock);
> > + ret = get_request(bedata, &req_id);
> > + if (ret < 0) {
> > + spin_unlock(&bedata->pvcallss_lock);
> > + return ret;
> > + }
> > + req = RING_GET_REQUEST(&bedata->ring, req_id);
> > + req->req_id = req_id;
> > + map->sock = sock;
> > + req->cmd = PVCALLS_BIND;
> > + req->u.bind.id = (uint64_t) map;
> > + memcpy(req->u.bind.addr, addr, sizeof(*addr));
> > + req->u.bind.len = addr_len;
> > +
> > + init_waitqueue_head(&map->passive.inflight_accept_req);
> > +
> > + map->active_socket = false;
> > +
> > + bedata->ring.req_prod_pvt++;
> > + RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
> > + spin_unlock(&bedata->pvcallss_lock);
> > + if (notify)
> > + notify_remote_via_irq(bedata->irq);
> > +
> > + wait_event(bedata->inflight_req,
> > + READ_ONCE(bedata->rsp[req_id].req_id) == req_id);
> > +
> > + map->passive.status = PVCALLS_STATUS_BIND;
> > + ret = bedata->rsp[req_id].ret;
> > + /* read ret, then set this rsp slot to be reused */
> > + smp_mb();
> > + WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
> > + return 0;
> > +}
> > +
> > static const struct xenbus_device_id pvcalls_front_ids[] = {
> > { "pvcalls" },
> > { "" }
> > diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
> > index 63b0417..8b0a274 100644
> > --- a/drivers/xen/pvcalls-front.h
> > +++ b/drivers/xen/pvcalls-front.h
> > @@ -6,5 +6,8 @@
> > int pvcalls_front_socket(struct socket *sock);
> > int pvcalls_front_connect(struct socket *sock, struct sockaddr *addr,
> > int addr_len, int flags);
> > +int pvcalls_front_bind(struct socket *sock,
> > + struct sockaddr *addr,
> > + int addr_len);
> > #endif
> >
>
On Mon, 14 Aug 2017, Boris Ostrovsky wrote:
> > > +
> > > + ret = bedata->rsp[req_id].ret;
> >
> > You can just return bedata->rsp[req_id].ret;
>
> Or maybe not. The slot may get reused by the time you get to the end.
Right!
> >
> > -boris
> >
> > > + /* read ret, then set this rsp slot to be reused */
> > > + smp_mb();
> > > + WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
> > > + WRITE_ONCE(map->passive.inflight_req_id, PVCALLS_INVALID_ID);
> > > + return ret;
> > > +}
>
On Mon, 14 Aug 2017, Boris Ostrovsky wrote:
> On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> > Introduce a waitqueue to allow only one outstanding accept command at
> > any given time and to implement polling on the passive socket. Introduce
> > a flags field to keep track of in-flight accept and poll commands.
> >
> > Send PVCALLS_ACCEPT to the backend. Allocate a new active socket. Make
> > sure that only one accept command is executed at any given time by
> > setting PVCALLS_FLAG_ACCEPT_INFLIGHT and waiting on the
> > inflight_accept_req waitqueue.
> >
> > Convert the new struct sock_mapping pointer into an uint64_t and use it
> > as id for the new socket to pass to the backend.
> >
> > Check if the accept call is non-blocking: in that case after sending the
> > ACCEPT command to the backend store the sock_mapping pointer of the new
> > struct and the inflight req_id then return -EAGAIN (which will respond
> > only when there is something to accept). Next time accept is called,
> > we'll check if the ACCEPT command has been answered, if so we'll pick up
> > where we left off, otherwise we return -EAGAIN again.
> >
> > Note that, differently from the other commands, we can use
> > wait_event_interruptible (instead of wait_event) in the case of accept
> > as we are able to track the req_id of the ACCEPT response that we are
> > waiting.
> >
> > Signed-off-by: Stefano Stabellini <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > ---
> > drivers/xen/pvcalls-front.c | 111
> > ++++++++++++++++++++++++++++++++++++++++++++
> > drivers/xen/pvcalls-front.h | 3 ++
> > 2 files changed, 114 insertions(+)
> >
> > diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> > index b2757f5..f83b910 100644
> > --- a/drivers/xen/pvcalls-front.c
> > +++ b/drivers/xen/pvcalls-front.c
> > @@ -65,6 +65,16 @@ struct sock_mapping {
> > #define PVCALLS_STATUS_BIND 1
> > #define PVCALLS_STATUS_LISTEN 2
> > uint8_t status;
> > + /*
> > + * Internal state-machine flags.
> > + * Only one accept operation can be inflight for a socket.
> > + * Only one poll operation can be inflight for a given socket.
> > + */
> > +#define PVCALLS_FLAG_ACCEPT_INFLIGHT 0
> > + uint8_t flags;
> > + uint32_t inflight_req_id;
> > + struct sock_mapping *accept_map;
> > + wait_queue_head_t inflight_accept_req;
> > } passive;
> > };
> > };
> > @@ -414,6 +424,107 @@ int pvcalls_front_listen(struct socket *sock, int
> > backlog)
> > return ret;
> > }
> > +int pvcalls_front_accept(struct socket *sock, struct socket *newsock, int
> > flags)
> > +{
> > + struct pvcalls_bedata *bedata;
> > + struct sock_mapping *map;
> > + struct sock_mapping *map2 = NULL;
> > + struct xen_pvcalls_request *req;
> > + int notify, req_id, ret, evtchn, nonblock;
> > +
> > + if (!pvcalls_front_dev)
> > + return -ENOTCONN;
> > + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> > +
> > + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> > + if (!map)
> > + return -ENOTSOCK;
> > +
> > + if (map->passive.status != PVCALLS_STATUS_LISTEN)
> > + return -EINVAL;
> > +
> > + nonblock = flags & SOCK_NONBLOCK;
> > + /*
> > + * Backend only supports 1 inflight accept request, will return
> > + * errors for the others
> > + */
> > + if (test_and_set_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT,
> > + (void *)&map->passive.flags)) {
> > + req_id = READ_ONCE(map->passive.inflight_req_id);
> > + if (req_id != PVCALLS_INVALID_ID &&
> > + READ_ONCE(bedata->rsp[req_id].req_id) == req_id)
> > + goto received;
> > + if (nonblock)
> > + return -EAGAIN;
> > + if (wait_event_interruptible(map->passive.inflight_accept_req,
> > + !test_and_set_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT,
> > + (void *)&map->passive.flags)))
> > + return -EINTR;
> > + }
> > +
> > + spin_lock(&bedata->pvcallss_lock);
> > + ret = get_request(bedata, &req_id);
> > + if (ret < 0) {
> > + spin_unlock(&bedata->pvcallss_lock);
> > + return ret;
> > + }
> > + map2 = kzalloc(sizeof(*map2), GFP_KERNEL);
> > + if (map2 == NULL)
> > + return -ENOMEM;
> > + ret = create_active(map2, &evtchn);
> > + if (ret < 0) {
> > + kfree(map2);
>
>
> In the connect patch create_active() frees maps2 (and I had some comments
> about it)
Yes, you were right, there is no need to free map in create_active. map
will be freed by the next "close" call on the passive socket. However,
we do need to free map2 because that has been created in this call few
lines above.
> > + return -ENOMEM;
>
> Both error paths need an unlock.
You are right!
> > + }
> > + list_add_tail(&map2->list, &bedata->socket_mappings);
> > +
> > + req = RING_GET_REQUEST(&bedata->ring, req_id);
> > + req->req_id = req_id;
> > + req->cmd = PVCALLS_ACCEPT;
> > + req->u.accept.id = (uint64_t) map;
> > + req->u.accept.ref = map2->active.ref;
> > + req->u.accept.id_new = (uint64_t) map2;
> > + req->u.accept.evtchn = evtchn;
> > + map->passive.accept_map = map2;
> > +
> > + bedata->ring.req_prod_pvt++;
> > + RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
> > + spin_unlock(&bedata->pvcallss_lock);
> > + if (notify)
> > + notify_remote_via_irq(bedata->irq);
> > + if (nonblock) {
> > + WRITE_ONCE(map->passive.inflight_req_id, req_id);
> > + return -EAGAIN;
>
> Would it be worth checking (maybe a few times) whether the response has come
> back?
Yes, it is something to keep in mind. I'll add an in-code comment to
remember. I don't think this is a path worth optimizing now because it
is not very common to set the passive socket as non-blocking. Many
implementations leave the passive socket as blocking and use "poll" to
check if there is something to do before calling "accept".
> > + }
> > +
> > + if (wait_event_interruptible(bedata->inflight_req,
> > + READ_ONCE(bedata->rsp[req_id].req_id) == req_id))
> > + return -EINTR;
> > +
> > +received:
> > + map2 = map->passive.accept_map;
>
> I think this could go to the inflight check in the beginning of this routine
> (before the 'goto'). Otherwise map2 has already been assigned above.
Yes, I'll do that. The code worked anyway because of the
map->passive.accept_map = map2;
assignment in other path.
> > + map2->sock = newsock;
> > + newsock->sk = kzalloc(sizeof(*newsock->sk), GFP_KERNEL);
> > + if (!newsock->sk) {
> > + WRITE_ONCE(bedata->rsp[req_id].req_id, PVCALLS_INVALID_ID);
> > + WRITE_ONCE(map->passive.inflight_req_id, PVCALLS_INVALID_ID);
> > + pvcalls_front_free_map(bedata, map2);
> > + kfree(map2);
> > + return -ENOMEM;
> > + }
> > + WRITE_ONCE(newsock->sk->sk_send_head, (void *)map2);
> > +
> > + clear_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT, (void *)&map->passive.flags);
> > + wake_up(&map->passive.inflight_accept_req);
> > +
> > + ret = bedata->rsp[req_id].ret;
On Tue, 15 Aug 2017, Boris Ostrovsky wrote:
> > @@ -736,6 +755,104 @@ int pvcalls_front_accept(struct socket *sock, struct socket *newsock, int flags)
> > return ret;
> > }
> >
> > +static unsigned int pvcalls_front_poll_passive(struct file *file,
> > + struct pvcalls_bedata *bedata,
> > + struct sock_mapping *map,
> > + poll_table *wait)
> > +{
> > + int notify, req_id, ret;
> > + struct xen_pvcalls_request *req;
> > +
>
> I am a bit confused by the logic here.
>
> > + if (test_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT,
> > + (void *)&map->passive.flags)) {
> > + uint32_t req_id = READ_ONCE(map->passive.inflight_req_id);
> > + if (req_id != PVCALLS_INVALID_ID &&
> > + READ_ONCE(bedata->rsp[req_id].req_id) == req_id)
> > + return POLLIN;
>
> This is successful accept?
Yes: if an accept reply is pending, there must be something to read on
the passive socket.
> Why is it returning POLLIN only and not
> POLLIN | POLLRDNORM (or POLLOUT, for that matter)?
Keep in mind that this is a passive socket. There is no writing to
them. But yes, POLLIN and POLLRDNORM are equivalent, so it is best to
set both. I'll do that.
> > +
> > + poll_wait(file, &map->passive.inflight_accept_req, wait);
> > + return 0;
> > + }
> > +
> > + if (test_and_clear_bit(PVCALLS_FLAG_POLL_RET,
> > + (void *)&map->passive.flags))
> > + return POLLIN;
>
> This is previous poll request completing?
Yes, a previous POLL request completing.
> > +
> > + /*
> > + * First check RET, then INFLIGHT. No barriers necessary to
> > + * ensure execution ordering because of the conditional
> > + * instructions creating control dependencies.
> > + */
> > +
> > + if (test_and_set_bit(PVCALLS_FLAG_POLL_INFLIGHT,
> > + (void *)&map->passive.flags)) {
> > + poll_wait(file, &bedata->inflight_req, wait);
> > + return 0;
> > + }
>
> This I don't understand, could you explain?
If none of the conditions above apply (there isn't an
accept reply ready, and there isn't a poll reply ready) then it means
that we actually need to tell the caller to wait. This is done by
calling poll_wait to add a wait_queue to the poll table.
Then we need to tell the backend that we are polling, that is done by
the code below that sends a PVCALLS_POLL message to the backend.
The check on PVCALLS_FLAG_POLL_INFLIGHT here is to avoid sending
multiple PVCALLS_POLL requests to the backend. Thus, if there is one
already in-flight, we can just call poll_wait and return immediately
without sending a second request.
> > +
> > + spin_lock(&bedata->pvcallss_lock);
> > + ret = get_request(bedata, &req_id);
> > + if (ret < 0) {
> > + spin_unlock(&bedata->pvcallss_lock);
> > + return ret;
> > + }
> > + req = RING_GET_REQUEST(&bedata->ring, req_id);
> > + req->req_id = req_id;
> > + req->cmd = PVCALLS_POLL;
> > + req->u.poll.id = (uint64_t) map;
> > +
> > + bedata->ring.req_prod_pvt++;
> > + RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&bedata->ring, notify);
> > + spin_unlock(&bedata->pvcallss_lock);
> > + if (notify)
> > + notify_remote_via_irq(bedata->irq);
> > +
> > + poll_wait(file, &bedata->inflight_req, wait);
> > + return 0;
> > +}
> > +
> > +static unsigned int pvcalls_front_poll_active(struct file *file,
> > + struct pvcalls_bedata *bedata,
> > + struct sock_mapping *map,
> > + poll_table *wait)
> > +{
> > + unsigned int mask = 0;
> > + int32_t in_error, out_error;
> > + struct pvcalls_data_intf *intf = map->active.ring;
> > +
> > + out_error = intf->out_error;
> > + in_error = intf->in_error;
> > +
> > + poll_wait(file, &map->active.inflight_conn_req, wait);
> > + if (pvcalls_front_write_todo(map))
> > + mask |= POLLOUT | POLLWRNORM;
> > + if (pvcalls_front_read_todo(map))
> > + mask |= POLLIN | POLLRDNORM;
> > + if (in_error != 0 || out_error != 0)
> > + mask |= POLLERR;
> > +
> > + return mask;
> > +}
> > +
> > +unsigned int pvcalls_front_poll(struct file *file, struct socket *sock,
> > + poll_table *wait)
> > +{
> > + struct pvcalls_bedata *bedata;
> > + struct sock_mapping *map;
> > +
> > + if (!pvcalls_front_dev)
> > + return POLLNVAL;
> > + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> > +
> > + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
>
> I just noticed this --- why is it READ_ONCE? Are you concerned that
> sk_send_head may change?
No, but I wanted to avoid partial reads. A caller could call
pvcalls_front_accept and pvcalls_front_poll on newsock almost at the
same time (it is probably not the correct way to use the API), I wanted
to make sure that "map" is either read correctly, or not read at all.
> > + if (!map)
> > + return POLLNVAL;
> > + if (map->active_socket)
> > + return pvcalls_front_poll_active(file, bedata, map, wait);
> > + else
> > + return pvcalls_front_poll_passive(file, bedata, map, wait);
> > +}
> > +
> > static const struct xenbus_device_id pvcalls_front_ids[] = {
> > { "pvcalls" },
> > { "" }
> > diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
> > index de24041..25e05b8 100644
> > --- a/drivers/xen/pvcalls-front.h
> > +++ b/drivers/xen/pvcalls-front.h
> > @@ -20,5 +20,8 @@ int pvcalls_front_recvmsg(struct socket *sock,
> > struct msghdr *msg,
> > size_t len,
> > int flags);
> > +unsigned int pvcalls_front_poll(struct file *file,
> > + struct socket *sock,
> > + poll_table *wait);
> >
> > #endif
>
On Tue, 15 Aug 2017, Boris Ostrovsky wrote:
> On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> > Send PVCALLS_RELEASE to the backend and wait for a reply. Take both
> > in_mutex and out_mutex to avoid concurrent accesses. Then, free the
> > socket.
> >
> > For passive sockets, check whether we have already pre-allocated an
> > active socket for the purpose of being accepted. If so, free that as
> > well.
> >
> > Signed-off-by: Stefano Stabellini <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > ---
> > drivers/xen/pvcalls-front.c | 88 +++++++++++++++++++++++++++++++++++++++++++++
> > drivers/xen/pvcalls-front.h | 1 +
> > 2 files changed, 89 insertions(+)
> >
> > diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> > index 1c975d6..775a6d2 100644
> > --- a/drivers/xen/pvcalls-front.c
> > +++ b/drivers/xen/pvcalls-front.c
> > @@ -192,6 +192,23 @@ static irqreturn_t pvcalls_front_conn_handler(int irq, void *sock_map)
> > return IRQ_HANDLED;
> > }
> >
> > +static void pvcalls_front_free_map(struct pvcalls_bedata *bedata,
> > + struct sock_mapping *map)
> > +{
> > + int i;
> > +
> > + spin_lock(&bedata->pvcallss_lock);
> > + if (!list_empty(&map->list))
> > + list_del_init(&map->list);
> > + spin_unlock(&bedata->pvcallss_lock);
> > +
> > + for (i = 0; i < (1 << map->active.ring->ring_order); i++)
> > + gnttab_end_foreign_access(map->active.ring->ref[i], 0, 0);
> > + gnttab_end_foreign_access(map->active.ref, 0, 0);
> > + free_page((unsigned long)map->active.ring);
> > + unbind_from_irqhandler(map->active.irq, map);
>
> Would it better to first unbind the handler? Any chance an interrupt
> might come in?
Fair enough, I'll do that.
> > +}
> > +
> > int pvcalls_front_socket(struct socket *sock)
> > {
> > struct pvcalls_bedata *bedata;
> > @@ -853,6 +870,77 @@ unsigned int pvcalls_front_poll(struct file *file, struct socket *sock,
> > return pvcalls_front_poll_passive(file, bedata, map, wait);
> > }
> >
> > +int pvcalls_front_release(struct socket *sock)
> > +{
> > + struct pvcalls_bedata *bedata;
> > + struct sock_mapping *map;
> > + int req_id, notify, ret;
> > + struct xen_pvcalls_request *req;
> > +
> > + if (!pvcalls_front_dev)
> > + return -EIO;
> > + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> > +
> > + if (sock->sk == NULL)
> > + return 0;
>
> This can go above bedata access.
Yes, good idea.
> (You are going to address locking here so I won't review the rest)
Yes, I will. Thanks for the review! And sorry for taking so long to
come back to you.
On Mon, 14 Aug 2017, Boris Ostrovsky wrote:
> On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> > Implement recvmsg by copying data from the "in" ring. If not enough data
> > is available and the recvmsg call is blocking, then wait on the
> > inflight_conn_req waitqueue. Take the active socket in_mutex so that
> > only one function can access the ring at any given time.
> >
> > If no data is available on the ring, rather than returning immediately
> > or sleep-waiting, spin for up to 5000 cycles. This small optimization
> > turns out to improve performance and latency significantly.
> >
> > Signed-off-by: Stefano Stabellini <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > ---
> > drivers/xen/pvcalls-front.c | 102
> > ++++++++++++++++++++++++++++++++++++++++++++
> > drivers/xen/pvcalls-front.h | 4 ++
> > 2 files changed, 106 insertions(+)
> >
> > diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> > index 369acde..635a83a 100644
> > --- a/drivers/xen/pvcalls-front.c
> > +++ b/drivers/xen/pvcalls-front.c
> > @@ -105,6 +105,20 @@ static int pvcalls_front_write_todo(struct sock_mapping
> > *map)
> > return size - pvcalls_queued(prod, cons, size);
> > }
> > +static bool pvcalls_front_read_todo(struct sock_mapping *map)
> > +{
> > + struct pvcalls_data_intf *intf = map->active.ring;
> > + RING_IDX cons, prod;
> > + int32_t error;
> > +
> > + cons = intf->in_cons;
> > + prod = intf->in_prod;
> > + error = intf->in_error;
> > + return (error != 0 ||
> > + pvcalls_queued(prod, cons,
> > + XEN_FLEX_RING_SIZE(intf->ring_order)) != 0);
> > +}
> > +
> > static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
> > {
> > struct xenbus_device *dev = dev_id;
> > @@ -434,6 +448,94 @@ int pvcalls_front_sendmsg(struct socket *sock, struct
> > msghdr *msg,
> > return tot_sent;
> > }
> > +static int __read_ring(struct pvcalls_data_intf *intf,
> > + struct pvcalls_data *data,
> > + struct iov_iter *msg_iter,
> > + size_t len, int flags)
> > +{
> > + RING_IDX cons, prod, size, masked_prod, masked_cons;
> > + RING_IDX array_size = XEN_FLEX_RING_SIZE(intf->ring_order);
> > + int32_t error;
> > +
> > + cons = intf->in_cons;
> > + prod = intf->in_prod;
> > + error = intf->in_error;
> > + /* get pointers before reading from the ring */
> > + virt_rmb();
> > + if (error < 0)
> > + return error;
> > +
> > + size = pvcalls_queued(prod, cons, array_size);
> > + masked_prod = pvcalls_mask(prod, array_size);
> > + masked_cons = pvcalls_mask(cons, array_size);
> > +
> > + if (size == 0)
> > + return 0;
> > +
> > + if (len > size)
> > + len = size;
> > +
> > + if (masked_prod > masked_cons) {
> > + copy_to_iter(data->in + masked_cons, len, msg_iter);
> > + } else {
> > + if (len > (array_size - masked_cons)) {
> > + copy_to_iter(data->in + masked_cons,
> > + array_size - masked_cons, msg_iter);
> > + copy_to_iter(data->in,
> > + len - (array_size - masked_cons),
> > + msg_iter);
> > + } else {
> > + copy_to_iter(data->in + masked_cons, len, msg_iter);
> > + }
> > + }
> > + /* read data from the ring before increasing the index */
> > + virt_mb();
> > + if (!(flags & MSG_PEEK))
> > + intf->in_cons += len;
> > +
> > + return len;
> > +}
> > +
> > +int pvcalls_front_recvmsg(struct socket *sock, struct msghdr *msg, size_t
> > len,
> > + int flags)
> > +{
> > + struct pvcalls_bedata *bedata;
> > + int ret = -EAGAIN;
>
> Not necessary to initialize.
Yes, you are right.
> > + struct sock_mapping *map;
> > +
> > + if (!pvcalls_front_dev)
> > + return -ENOTCONN;
> > + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> > +
> > + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> > + if (!map)
> > + return -ENOTSOCK;
> > +
> > + if (flags & (MSG_CMSG_CLOEXEC|MSG_ERRQUEUE|MSG_OOB|MSG_TRUNC))
> > + return -EOPNOTSUPP;
> > +
> > + mutex_lock(&map->active.in_mutex);
> > + if (len > XEN_FLEX_RING_SIZE(map->active.ring->ring_order))
> > + len = XEN_FLEX_RING_SIZE(map->active.ring->ring_order);
> > +
> > + while (!(flags & MSG_DONTWAIT) && !pvcalls_front_read_todo(map)) {
> > + wait_event_interruptible(map->active.inflight_conn_req,
> > + pvcalls_front_read_todo(map));
> > + }
> > + ret = __read_ring(map->active.ring, &map->active.data,
> > + &msg->msg_iter, len, flags);
> > +
> > + if (ret > 0)
> > + notify_remote_via_irq(map->active.irq);
> > + if (ret == 0)
> > + ret = -EAGAIN;
> > + if (ret == -ENOTCONN)
> > + ret = 0;
> > +
> > + mutex_unlock(&map->active.in_mutex);
> > + return ret;
>
> Are errors converted by the caller? (I am asking because recvmsg can only
> return -1 for errors)
Yes, the caller will set errno appropriately.
> > +}
> > +
> > int pvcalls_front_bind(struct socket *sock, struct sockaddr *addr, int
> > addr_len)
> > {
> > struct pvcalls_bedata *bedata;
> > diff --git a/drivers/xen/pvcalls-front.h b/drivers/xen/pvcalls-front.h
> > index d937c24..de24041 100644
> > --- a/drivers/xen/pvcalls-front.h
> > +++ b/drivers/xen/pvcalls-front.h
> > @@ -16,5 +16,9 @@ int pvcalls_front_accept(struct socket *sock,
> > int pvcalls_front_sendmsg(struct socket *sock,
> > struct msghdr *msg,
> > size_t len);
> > +int pvcalls_front_recvmsg(struct socket *sock,
> > + struct msghdr *msg,
> > + size_t len,
> > + int flags);
> > #endif
> >
>
On Sat, 12 Aug 2017, Boris Ostrovsky wrote:
> On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> > Send a PVCALLS_SOCKET command to the backend, use the masked
> > req_prod_pvt as req_id. This way, req_id is guaranteed to be between 0
> > and PVCALLS_NR_REQ_PER_RING. We already have a slot in the rsp array
> > ready for the response, and there cannot be two outstanding responses
> > with the same req_id.
> >
> > Wait for the response by waiting on the inflight_req waitqueue and
> > check for the req_id field in rsp[req_id]. Use atomic accesses and
> > barriers to read the field. Note that the barriers are simple smp
> > barriers (as opposed to virt barriers) because they are for internal
> > frontend synchronization, not frontend<->backend communication.
> >
> > Once a response is received, clear the corresponding rsp slot by setting
> > req_id to PVCALLS_INVALID_ID. Note that PVCALLS_INVALID_ID is invalid
> > only from the frontend point of view. It is not part of the PVCalls
> > protocol.
> >
> > pvcalls_front_event_handler is in charge of copying responses from the
> > ring to the appropriate rsp slot. It is done by copying the body of the
> > response first, then by copying req_id atomically. After the copies,
> > wake up anybody waiting on waitqueue.
> >
> > pvcallss_lock protects accesses to the ring.
> >
> > Create a new struct sock_mapping and convert the pointer into an
> > uint64_t and use it as id for the new socket to pass to the backend. The
> > struct will be fully initialized later on connect or bind. In this patch
> > the struct sock_mapping is empty, the fields will be added by the next
> > patch.
> >
> > sock->sk->sk_send_head is not used for ip sockets: reuse the field to
> > store a pointer to the struct sock_mapping corresponding to the socket.
> > This way, we can easily get the struct sock_mapping from the struct
> > socket.
> >
> > Signed-off-by: Stefano Stabellini <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > ---
> > drivers/xen/pvcalls-front.c | 119
> > ++++++++++++++++++++++++++++++++++++++++++++
> > drivers/xen/pvcalls-front.h | 8 +++
> > 2 files changed, 127 insertions(+)
> > create mode 100644 drivers/xen/pvcalls-front.h
> >
> > diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> > index 2afe36d..7c4a7cb 100644
> > --- a/drivers/xen/pvcalls-front.c
> > +++ b/drivers/xen/pvcalls-front.c
> > @@ -20,6 +20,8 @@
> > #include <xen/xenbus.h>
> > #include <xen/interface/io/pvcalls.h>
> > +#include "pvcalls-front.h"
> > +
> > #define PVCALLS_INVALID_ID UINT_MAX
> > #define PVCALLS_RING_ORDER XENBUS_MAX_RING_GRANT_ORDER
> > #define PVCALLS_NR_REQ_PER_RING __CONST_RING_SIZE(xen_pvcalls,
> > XEN_PAGE_SIZE)
> > @@ -38,11 +40,128 @@ struct pvcalls_bedata {
> > };
> > static struct xenbus_device *pvcalls_front_dev;
> > +struct sock_mapping {
> > + bool active_socket;
> > + struct list_head list;
> > + struct socket *sock;
> > +};
> > +
> > +static inline int get_request(struct pvcalls_bedata *bedata, int *req_id)
> > +{
> > + *req_id = bedata->ring.req_prod_pvt & (RING_SIZE(&bedata->ring) - 1);
> > + if (RING_FULL(&bedata->ring) ||
> > + READ_ONCE(bedata->rsp[*req_id].req_id) != PVCALLS_INVALID_ID)
> > + return -EAGAIN;
> > + return 0;
> > +}
> > +
> > static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
> > {
> > + struct xenbus_device *dev = dev_id;
> > + struct pvcalls_bedata *bedata;
> > + struct xen_pvcalls_response *rsp;
> > + uint8_t *src, *dst;
> > + int req_id = 0, more = 0, done = 0;
> > +
> > + if (dev == NULL)
> > + return IRQ_HANDLED;
> > +
> > + bedata = dev_get_drvdata(&dev->dev);
> > + if (bedata == NULL)
> > + return IRQ_HANDLED;
> > +
> > +again:
> > + while (RING_HAS_UNCONSUMED_RESPONSES(&bedata->ring)) {
> > + rsp = RING_GET_RESPONSE(&bedata->ring, bedata->ring.rsp_cons);
> > +
> > + req_id = rsp->req_id;
> > + dst = (uint8_t *)&bedata->rsp[req_id] + sizeof(rsp->req_id);
> > + src = (uint8_t *)rsp + sizeof(rsp->req_id);
> > + memcpy(dst, src, sizeof(*rsp) - sizeof(rsp->req_id));
> > + /*
> > + * First copy the rest of the data, then req_id. It is
> > + * paired with the barrier when accessing bedata->rsp.
> > + */
> > + smp_wmb();
> > + WRITE_ONCE(bedata->rsp[req_id].req_id, rsp->req_id);
> > +
> > + done = 1;
> > + bedata->ring.rsp_cons++;
> > + }
> > +
> > + RING_FINAL_CHECK_FOR_RESPONSES(&bedata->ring, more);
> > + if (more)
> > + goto again;
> > + if (done)
> > + wake_up(&bedata->inflight_req);
> > return IRQ_HANDLED;
> > }
> > +int pvcalls_front_socket(struct socket *sock)
> > +{
> > + struct pvcalls_bedata *bedata;
> > + struct sock_mapping *map = NULL;
> > + struct xen_pvcalls_request *req;
> > + int notify, req_id, ret;
> > +
> > + if (!pvcalls_front_dev)
> > + return -EACCES;
> > + /*
> > + * PVCalls only supports domain AF_INET,
> > + * type SOCK_STREAM and protocol 0 sockets for now.
> > + *
> > + * Check socket type here, AF_INET and protocol checks are done
> > + * by the caller.
> > + */
> > + if (sock->type != SOCK_STREAM)
> > + return -ENOTSUPP;
> > +
> > + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> > +
> > + map = kzalloc(sizeof(*map), GFP_KERNEL);
> > + if (map == NULL)
> > + return -ENOMEM;
> > + /*
> > + * sock->sk->sk_send_head is not used for ip sockets: reuse the
> > + * field to store a pointer to the struct sock_mapping
> > + * corresponding to the socket. This way, we can easily get the
> > + * struct sock_mapping from the struct socket.
>
>
> Is this a safe assumption? Is it conceivable that at some point in the future
> sk_send_head might become used?
I don't think there is any risk of sk_send_head being used anytime soon:
among all the protocols supported under net/, sk_send_head is only used
by dccp.
> > + */
> > + WRITE_ONCE(sock->sk->sk_send_head, (void *)map);
> > +
> > + spin_lock(&bedata->pvcallss_lock);
> > + list_add_tail(&map->list, &bedata->socket_mappings);
> > +
> > + ret = get_request(bedata, &req_id);
> > + if (ret < 0) {
> > + spin_unlock(&bedata->pvcallss_lock);
> > + return ret;
> What is happening to the mapping you've linked above? Should it be unlinked
> and freed? (or defer its creation until we've got a request slot?)
Yes, you are right.
> > + }
> > + req = RING_GET_REQUEST(&bedata->ring, req_id);
> > + req->req_id = req_id;
> > + req->cmd = PVCALLS_SOCKET;
> > + req->u.socket.id = (uint64_t) map;
> > + req->u.socket.domain = AF_INET;
> > + req->u.socket.type = SOCK_STREAM;
> > + req->u.socket.protocol = 0;
>
> IPPROTO_IP?
Yes, thanks
On Mon, 14 Aug 2017, Boris Ostrovsky wrote:
> On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> > Send data to an active socket by copying data to the "out" ring. Take
> > the active socket out_mutex so that only one function can access the
> > ring at any given time.
> >
> > If not enough room is available on the ring, rather than returning
> > immediately or sleep-waiting, spin for up to 5000 cycles. This small
> > optimization turns out to improve performance significantly.
> >
> > Signed-off-by: Stefano Stabellini <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > ---
> > drivers/xen/pvcalls-front.c | 109
> > ++++++++++++++++++++++++++++++++++++++++++++
> > drivers/xen/pvcalls-front.h | 3 ++
> > 2 files changed, 112 insertions(+)
> >
> > diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> > index f83b910..369acde 100644
> > --- a/drivers/xen/pvcalls-front.c
> > +++ b/drivers/xen/pvcalls-front.c
> > @@ -29,6 +29,7 @@
> > #define PVCALLS_INVALID_ID UINT_MAX
> > #define PVCALLS_RING_ORDER XENBUS_MAX_RING_GRANT_ORDER
> > #define PVCALLS_NR_REQ_PER_RING __CONST_RING_SIZE(xen_pvcalls,
> > XEN_PAGE_SIZE)
> > +#define PVCALLS_FRONT_MAX_SPIN 5000
> > struct pvcalls_bedata {
> > struct xen_pvcalls_front_ring ring;
> > @@ -88,6 +89,22 @@ static inline int get_request(struct pvcalls_bedata
> > *bedata, int *req_id)
> > return 0;
> > }
> > +static int pvcalls_front_write_todo(struct sock_mapping *map)
> > +{
> > + struct pvcalls_data_intf *intf = map->active.ring;
> > + RING_IDX cons, prod, size = XEN_FLEX_RING_SIZE(intf->ring_order);
> > + int32_t error;
> > +
> > + cons = intf->out_cons;
> > + prod = intf->out_prod;
> > + error = intf->out_error;
> > + if (error == -ENOTCONN)
> > + return 0;
> > + if (error != 0)
> > + return error;
> > + return size - pvcalls_queued(prod, cons, size);
>
> Do you ever look at actual return value except whether it's zero or not? Both
> here and in the poll patch you look for !=0 and never check for an error.
No, I can turn the return value into bool.
> > +}
> > +
> > static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
> > {
> > struct xenbus_device *dev = dev_id;
> > @@ -325,6 +342,98 @@ int pvcalls_front_connect(struct socket *sock, struct
> > sockaddr *addr,
> > return ret;
> > }
> > +static int __write_ring(struct pvcalls_data_intf *intf,
> > + struct pvcalls_data *data,
> > + struct iov_iter *msg_iter,
> > + size_t len)
> > +{
> > + RING_IDX cons, prod, size, masked_prod, masked_cons;
> > + RING_IDX array_size = XEN_FLEX_RING_SIZE(intf->ring_order);
> > + int32_t error;
> > +
> > + cons = intf->out_cons;
> > + prod = intf->out_prod;
> > + error = intf->out_error;
> > + /* read indexes before continuing */
> > + virt_mb();
> > +
> > + if (error < 0)
> > + return error;
>
> This can be done before the barrier. (In fact, this can be done first thing,
> before cons and prod are read).
Good point.
> > +
> > + size = pvcalls_queued(prod, cons, array_size);
> > + if (size >= array_size)
> > + return 0;
> > + if (len > array_size - size)
> > + len = array_size - size;
> > +
> > + masked_prod = pvcalls_mask(prod, array_size);
> > + masked_cons = pvcalls_mask(cons, array_size);
> > +
> > + if (masked_prod < masked_cons) {
> > + copy_from_iter(data->out + masked_prod, len, msg_iter);
> > + } else {
> > + if (len > array_size - masked_prod) {
> > + copy_from_iter(data->out + masked_prod,
> > + array_size - masked_prod, msg_iter);
> > + copy_from_iter(data->out,
> > + len - (array_size - masked_prod),
> > + msg_iter);
> > + } else {
> > + copy_from_iter(data->out + masked_prod, len,
> > msg_iter);
> > + }
> > + }
> > + /* write to ring before updating pointer */
> > + virt_wmb();
> > + intf->out_prod += len;
> > +
> > + return len;
>
> You are returning size_t. I actually was going to ask in one of the previous
> patches whether using int for sizes was correct. Unfortunately I can't
> remember which struct/function I was looking at ;-(
>
> Of course, you are also possibly returning a (negative) error here.
You are right, len should not be size_t but ssize_t or int. Especially
given that we cannot write more than array_size bytes to the ring.
For simplicity, I'll change the type of len to int in __write_ring and
add a check on len >= INT_MAX in the caller.
> > +}
> > +
> > +int pvcalls_front_sendmsg(struct socket *sock, struct msghdr *msg,
> > + size_t len)
> > +{
> > + struct pvcalls_bedata *bedata;
> > + struct sock_mapping *map;
> > + int sent = 0, tot_sent = 0;
>
> 'sent' doesn't need to be initialized.
OK
> > + int count = 0, flags;
> > +
> > + if (!pvcalls_front_dev)
> > + return -ENOTCONN;
> > + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> > +
> > + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> > + if (!map)
> > + return -ENOTSOCK;
>
> IIRC the error value for sk_send_head being zero is inconsistent across
> patches.
I'll fix it.
> > +
> > + flags = msg->msg_flags;
> > + if (flags & (MSG_CONFIRM|MSG_DONTROUTE|MSG_EOR|MSG_OOB))
> > + return -EOPNOTSUPP;
> > +
> > + mutex_lock(&map->active.out_mutex);
> > + if ((flags & MSG_DONTWAIT) && !pvcalls_front_write_todo(map)) {
> > + mutex_unlock(&map->active.out_mutex);
> > + return -EAGAIN;
> > + }
> > +
> > +again:
> > + count++;
> > + sent = __write_ring(map->active.ring,
> > + &map->active.data, &msg->msg_iter,
> > + len);
> > + if (sent > 0) {
> > + len -= sent;
> > + tot_sent += sent;
> > + notify_remote_via_irq(map->active.irq);
> > + }
> > + if (sent >= 0 && len > 0 && count < PVCALLS_FRONT_MAX_SPIN)
> > + goto again;
> > + if (sent < 0)
> > + tot_sent = sent;
>
> What does it mean when an error is detected on the interface? Does it need to
> be somehow propagated to the caller?
The error could be a genuine socket error set by the backend, for
example ECONNRESET (Connection reset by peer). The error will be
returned by __write_ring. By setting tot_sent to sent, we return it to
the caller of pvcalls_front_sendmsg. Eventually the userspace
application will see an errno = ECONNRESET.
On Fri, 11 Aug 2017, Boris Ostrovsky wrote:
> On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> > Introduce a data structure named pvcalls_bedata. It contains pointers to
> > the command ring, the event channel, a list of active sockets and a list
> > of passive sockets. Lists accesses are protected by a spin_lock.
> >
> > Introduce a waitqueue to allow waiting for a response on commands sent
> > to the backend.
> >
> > Introduce an array of struct xen_pvcalls_response to store commands
> > responses.
> >
> > Implement pvcalls frontend removal function. Go through the list of
> > active and passive sockets and free them all, one at a time.
> >
> > Signed-off-by: Stefano Stabellini <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > ---
> > drivers/xen/pvcalls-front.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 51 insertions(+)
> >
> > diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> > index a8d38c2..a126195 100644
> > --- a/drivers/xen/pvcalls-front.c
> > +++ b/drivers/xen/pvcalls-front.c
> > @@ -20,6 +20,29 @@
> > #include <xen/xenbus.h>
> > #include <xen/interface/io/pvcalls.h>
> >
> > +#define PVCALLS_INVALID_ID UINT_MAX
> > +#define PVCALLS_RING_ORDER XENBUS_MAX_RING_GRANT_ORDER
> > +#define PVCALLS_NR_REQ_PER_RING __CONST_RING_SIZE(xen_pvcalls, XEN_PAGE_SIZE)
> > +
> > +struct pvcalls_bedata {
> > + struct xen_pvcalls_front_ring ring;
> > + grant_ref_t ref;
> > + int irq;
> > +
> > + struct list_head socket_mappings;
> > + struct list_head socketpass_mappings;
> > + spinlock_t pvcallss_lock;
>
> In the backend this is called socket_lock and (subjectively) it would
> sound as a better name here too.
I'll rename
> > +
> > + wait_queue_head_t inflight_req;
> > + struct xen_pvcalls_response rsp[PVCALLS_NR_REQ_PER_RING];
> > +};
> > +static struct xenbus_device *pvcalls_front_dev;
> > +
> > +static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
> > +{
> > + return IRQ_HANDLED;
> > +}
> > +
> > static const struct xenbus_device_id pvcalls_front_ids[] = {
> > { "pvcalls" },
> > { "" }
> > @@ -27,6 +50,34 @@
> >
> > static int pvcalls_front_remove(struct xenbus_device *dev)
> > {
> > + struct pvcalls_bedata *bedata;
> > + struct sock_mapping *map = NULL, *n;
> > +
> > + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> > +
> > + list_for_each_entry_safe(map, n, &bedata->socket_mappings, list) {
> > + mutex_lock(&map->active.in_mutex);
> > + mutex_lock(&map->active.out_mutex);
> > + pvcalls_front_free_map(bedata, map);
> > + mutex_unlock(&map->active.out_mutex);
> > + mutex_unlock(&map->active.in_mutex);
> > + kfree(map);
>
> I think this is the same issue as the one discussed for some other patch
> --- unlocking and then immediately freeing a lock.
Yes, I'll fix this too.
> > + }
> > + list_for_each_entry_safe(map, n, &bedata->socketpass_mappings, list) {
> > + spin_lock(&bedata->pvcallss_lock);
> > + list_del_init(&map->list);
> > + spin_unlock(&bedata->pvcallss_lock);
> > + kfree(map);
> > + }
> > + if (bedata->irq > 0)
> > + unbind_from_irqhandler(bedata->irq, dev);
> > + if (bedata->ref >= 0)
> > + gnttab_end_foreign_access(bedata->ref, 0, 0);
> > + kfree(bedata->ring.sring);
> > + kfree(bedata);
> > + dev_set_drvdata(&dev->dev, NULL);
> > + xenbus_switch_state(dev, XenbusStateClosed);
>
> Should we first move the state to Closed and then free things up? Or it
> doesn't matter?
I believe that is already done by the xenbus driver: this function is
supposed to be called after the frontend state is set to Closing.
> > + pvcalls_front_dev = NULL;
> > return 0;
> > }
> >
>
On Fri, 11 Aug 2017, Boris Ostrovsky wrote:
> On 07/31/2017 06:57 PM, Stefano Stabellini wrote:
> > Implement the probe function for the pvcalls frontend. Read the
> > supported versions, max-page-order and function-calls nodes from
> > xenstore.
> >
> > Only one frontend<->backend connection is supported at any given time
> > for a guest. Store the active frontend device to a static pointer.
> >
> > Introduce a stub functions for the event handler.
> >
> > Signed-off-by: Stefano Stabellini <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > ---
> > drivers/xen/pvcalls-front.c | 130 ++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 130 insertions(+)
> >
> > diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> > index a126195..2afe36d 100644
> > --- a/drivers/xen/pvcalls-front.c
> > +++ b/drivers/xen/pvcalls-front.c
> > @@ -84,12 +84,142 @@ static int pvcalls_front_remove(struct xenbus_device *dev)
> > static int pvcalls_front_probe(struct xenbus_device *dev,
> > const struct xenbus_device_id *id)
> > {
> > + int ret = -ENOMEM, evtchn, i;
> > + unsigned int max_page_order, function_calls, len;
> > + char *versions;
> > + grant_ref_t gref_head = 0;
> > + struct xenbus_transaction xbt;
> > + struct pvcalls_bedata *bedata = NULL;
> > + struct xen_pvcalls_sring *sring;
> > +
> > + if (pvcalls_front_dev != NULL) {
> > + dev_err(&dev->dev, "only one PV Calls connection supported\n");
> > + return -EINVAL;
> > + }
> > +
> > + versions = xenbus_read(XBT_NIL, dev->otherend, "versions", &len);
> > + if (!len)
> > + return -EINVAL;
> > + if (strcmp(versions, "1")) {
> > + kfree(versions);
> > + return -EINVAL;
> > + }
> > + kfree(versions);
> > + max_page_order = xenbus_read_unsigned(dev->otherend,
> > + "max-page-order", 0);
> > + if (max_page_order < PVCALLS_RING_ORDER)
> > + return -ENODEV;
> > + function_calls = xenbus_read_unsigned(dev->otherend,
> > + "function-calls", 0);
> > + if (function_calls != 1)
>
> XENBUS_FUNCTIONS_CALLS
XENBUS_FUNCTIONS_CALLS is actually defined as "1". I'll add a comment
instead.
> > + return -ENODEV;
> > + pr_info("%s max-page-order is %u\n", __func__, max_page_order);
> > +
> > + bedata = kzalloc(sizeof(struct pvcalls_bedata), GFP_KERNEL);
> > + if (!bedata)
> > + return -ENOMEM;
> > +
> > + dev_set_drvdata(&dev->dev, bedata);
> > + pvcalls_front_dev = dev;
> > + init_waitqueue_head(&bedata->inflight_req);
> > + for (i = 0; i < PVCALLS_NR_REQ_PER_RING; i++)
> > + bedata->rsp[i].req_id = PVCALLS_INVALID_ID;
> > +
> > + sring = (struct xen_pvcalls_sring *) __get_free_page(GFP_KERNEL |
> > + __GFP_ZERO);
> > + if (!sring)
> > + goto error;
> > + SHARED_RING_INIT(sring);
> > + FRONT_RING_INIT(&bedata->ring, sring, XEN_PAGE_SIZE);
> > +
> > + ret = xenbus_alloc_evtchn(dev, &evtchn);
> > + if (ret)
> > + goto error;
> > +
> > + bedata->irq = bind_evtchn_to_irqhandler(evtchn,
> > + pvcalls_front_event_handler,
> > + 0, "pvcalls-frontend", dev);
> > + if (bedata->irq < 0) {
>
> In the previous patch you free up irq if it is strictly >0. Should have
> been >=0.
OK
> > + ret = bedata->irq;
> > + goto error;
> > + }
> > +
> > + ret = gnttab_alloc_grant_references(1, &gref_head);
> > + if (ret < 0)
> > + goto error;
> > + bedata->ref = gnttab_claim_grant_reference(&gref_head);
> > + if (bedata->ref < 0) {
> > + ret = bedata->ref;
> > + goto error;
> > + }
> > + gnttab_grant_foreign_access_ref(bedata->ref, dev->otherend_id,
> > + virt_to_gfn((void *)sring), 0);
> > +
> > + again:
> > + ret = xenbus_transaction_start(&xbt);
> > + if (ret) {
> > + xenbus_dev_fatal(dev, ret, "starting transaction");
> > + goto error;
> > + }
> > + ret = xenbus_printf(xbt, dev->nodename, "version", "%u", 1);
> > + if (ret)
> > + goto error_xenbus;
> > + ret = xenbus_printf(xbt, dev->nodename, "ring-ref", "%d", bedata->ref);
> > + if (ret)
> > + goto error_xenbus;
> > + ret = xenbus_printf(xbt, dev->nodename, "port", "%u",
> > + evtchn);
> > + if (ret)
> > + goto error_xenbus;
> > + ret = xenbus_transaction_end(xbt, 0);
> > + if (ret) {
> > + if (ret == -EAGAIN)
> > + goto again;
> > + xenbus_dev_fatal(dev, ret, "completing transaction");
> > + goto error;
> > + }
> > +
> > + INIT_LIST_HEAD(&bedata->socket_mappings);
> > + INIT_LIST_HEAD(&bedata->socketpass_mappings);
> > + spin_lock_init(&bedata->pvcallss_lock);
> > + xenbus_switch_state(dev, XenbusStateInitialised);
> > +
> > return 0;
> > +
> > + error_xenbus:
> > + xenbus_transaction_end(xbt, 1);
> > + xenbus_dev_fatal(dev, ret, "writing xenstore");
> > + error:
> > + pvcalls_front_remove(dev);
>
> Going back to the previous patch --- it seems to me that on some error
> paths certain structures may not be usable in pvcalls_front_remove().
> For example, socket_mappings list is definitely not initialized.
You are right, I'll fix that.
> > + return ret;
> > }
> >
> > static void pvcalls_front_changed(struct xenbus_device *dev,
> > enum xenbus_state backend_state)
> > {
> > + switch (backend_state) {
> > + case XenbusStateReconfiguring:
> > + case XenbusStateReconfigured:
> > + case XenbusStateInitialising:
> > + case XenbusStateInitialised:
> > + case XenbusStateUnknown:
> > + break;
> > +
> > + case XenbusStateInitWait:
> > + break;
> > +
> > + case XenbusStateConnected:
> > + xenbus_switch_state(dev, XenbusStateConnected);
> > + break;
> > +
> > + case XenbusStateClosed:
> > + if (dev->state == XenbusStateClosed)
> > + break;
> > + /* Missed the backend's CLOSING state -- fallthrough */
> > + case XenbusStateClosing:
> > + xenbus_frontend_closed(dev);
> > + break;
> > + }
> > }
> >
> > static struct xenbus_driver pvcalls_front_driver = {
>
>>> +
>>> +unsigned int pvcalls_front_poll(struct file *file, struct socket *sock,
>>> + poll_table *wait)
>>> +{
>>> + struct pvcalls_bedata *bedata;
>>> + struct sock_mapping *map;
>>> +
>>> + if (!pvcalls_front_dev)
>>> + return POLLNVAL;
>>> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
>>> +
>>> + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
>> I just noticed this --- why is it READ_ONCE? Are you concerned that
>> sk_send_head may change?
> No, but I wanted to avoid partial reads. A caller could call
> pvcalls_front_accept and pvcalls_front_poll on newsock almost at the
> same time (it is probably not the correct way to use the API), I wanted
> to make sure that "map" is either read correctly, or not read at all.
How can you have a partial read on a pointer?
-boris
On Tue, 12 Sep 2017, Boris Ostrovsky wrote:
> >>> +
> >>> +unsigned int pvcalls_front_poll(struct file *file, struct socket *sock,
> >>> + poll_table *wait)
> >>> +{
> >>> + struct pvcalls_bedata *bedata;
> >>> + struct sock_mapping *map;
> >>> +
> >>> + if (!pvcalls_front_dev)
> >>> + return POLLNVAL;
> >>> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> >>> +
> >>> + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> >> I just noticed this --- why is it READ_ONCE? Are you concerned that
> >> sk_send_head may change?
> > No, but I wanted to avoid partial reads. A caller could call
> > pvcalls_front_accept and pvcalls_front_poll on newsock almost at the
> > same time (it is probably not the correct way to use the API), I wanted
> > to make sure that "map" is either read correctly, or not read at all.
>
> How can you have a partial read on a pointer?
I don't think that the compiler makes any promises on translating a
pointer read into a single read instruction. Of couse, I expect gcc to
actually do it without any need for READ/WRITE_ONCE.
On 09/12/2017 06:17 PM, Stefano Stabellini wrote:
> On Tue, 12 Sep 2017, Boris Ostrovsky wrote:
>>>>> +
>>>>> +unsigned int pvcalls_front_poll(struct file *file, struct socket *sock,
>>>>> + poll_table *wait)
>>>>> +{
>>>>> + struct pvcalls_bedata *bedata;
>>>>> + struct sock_mapping *map;
>>>>> +
>>>>> + if (!pvcalls_front_dev)
>>>>> + return POLLNVAL;
>>>>> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
>>>>> +
>>>>> + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
>>>> I just noticed this --- why is it READ_ONCE? Are you concerned that
>>>> sk_send_head may change?
>>> No, but I wanted to avoid partial reads. A caller could call
>>> pvcalls_front_accept and pvcalls_front_poll on newsock almost at the
>>> same time (it is probably not the correct way to use the API), I wanted
>>> to make sure that "map" is either read correctly, or not read at all.
>> How can you have a partial read on a pointer?
> I don't think that the compiler makes any promises on translating a
> pointer read into a single read instruction. Of couse, I expect gcc to
> actually do it without any need for READ/WRITE_ONCE.
READ_ONCE() only guarantees ordering but not atomicity. It resolves (for
64-bit pointers) to
case 8: *(__u64 *)res = *(volatile __u64 *)p; break;
so if compiler was breaking accesses into two then nothing would have
prevented it from breaking them here (I don't think volatile declaration
would affect this). Moreover, for sizes >8 bytes READ_ONCE() is
__builtin_memcpy() which is definitely not atomic.
So you can't rely on READ_ONCE being atomic from that perspective.
OTOH, I am pretty sure pointer accesses are guaranteed to be atomic. For
example, atomic64_read() is READ_ONCE(u64), which (per above) is
dereferencing of a 64-bit pointer in C.
-boris
On Tue, 12 Sep 2017, Boris Ostrovsky wrote:
> On 09/12/2017 06:17 PM, Stefano Stabellini wrote:
> > On Tue, 12 Sep 2017, Boris Ostrovsky wrote:
> >>>>> +
> >>>>> +unsigned int pvcalls_front_poll(struct file *file, struct socket *sock,
> >>>>> + poll_table *wait)
> >>>>> +{
> >>>>> + struct pvcalls_bedata *bedata;
> >>>>> + struct sock_mapping *map;
> >>>>> +
> >>>>> + if (!pvcalls_front_dev)
> >>>>> + return POLLNVAL;
> >>>>> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> >>>>> +
> >>>>> + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> >>>> I just noticed this --- why is it READ_ONCE? Are you concerned that
> >>>> sk_send_head may change?
> >>> No, but I wanted to avoid partial reads. A caller could call
> >>> pvcalls_front_accept and pvcalls_front_poll on newsock almost at the
> >>> same time (it is probably not the correct way to use the API), I wanted
> >>> to make sure that "map" is either read correctly, or not read at all.
> >> How can you have a partial read on a pointer?
> > I don't think that the compiler makes any promises on translating a
> > pointer read into a single read instruction. Of couse, I expect gcc to
> > actually do it without any need for READ/WRITE_ONCE.
>
> READ_ONCE() only guarantees ordering but not atomicity. It resolves (for
> 64-bit pointers) to
>
> case 8: *(__u64 *)res = *(volatile __u64 *)p; break;
>
> so if compiler was breaking accesses into two then nothing would have
> prevented it from breaking them here (I don't think volatile declaration
> would affect this). Moreover, for sizes >8 bytes READ_ONCE() is
> __builtin_memcpy() which is definitely not atomic.
>
> So you can't rely on READ_ONCE being atomic from that perspective.
I thought that READ_ONCE guaranteed atomicity for sizes less or equal to
the machine word size. It doesn't make any atomicity guarantees for
sizes >8 bytes.
> OTOH, I am pretty sure pointer accesses are guaranteed to be atomic. For
> example, atomic64_read() is READ_ONCE(u64), which (per above) is
> dereferencing of a 64-bit pointer in C.
I am happy to remove the READ_ONCE and WRITE_ONCE, if we all think it is
safe.
On Tue, 12 Sep 2017, Stefano Stabellini wrote:
> On Tue, 12 Sep 2017, Boris Ostrovsky wrote:
> > On 09/12/2017 06:17 PM, Stefano Stabellini wrote:
> > > On Tue, 12 Sep 2017, Boris Ostrovsky wrote:
> > >>>>> +
> > >>>>> +unsigned int pvcalls_front_poll(struct file *file, struct socket *sock,
> > >>>>> + poll_table *wait)
> > >>>>> +{
> > >>>>> + struct pvcalls_bedata *bedata;
> > >>>>> + struct sock_mapping *map;
> > >>>>> +
> > >>>>> + if (!pvcalls_front_dev)
> > >>>>> + return POLLNVAL;
> > >>>>> + bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
> > >>>>> +
> > >>>>> + map = (struct sock_mapping *) READ_ONCE(sock->sk->sk_send_head);
> > >>>> I just noticed this --- why is it READ_ONCE? Are you concerned that
> > >>>> sk_send_head may change?
> > >>> No, but I wanted to avoid partial reads. A caller could call
> > >>> pvcalls_front_accept and pvcalls_front_poll on newsock almost at the
> > >>> same time (it is probably not the correct way to use the API), I wanted
> > >>> to make sure that "map" is either read correctly, or not read at all.
> > >> How can you have a partial read on a pointer?
> > > I don't think that the compiler makes any promises on translating a
> > > pointer read into a single read instruction. Of couse, I expect gcc to
> > > actually do it without any need for READ/WRITE_ONCE.
> >
> > READ_ONCE() only guarantees ordering but not atomicity. It resolves (for
> > 64-bit pointers) to
> >
> > case 8: *(__u64 *)res = *(volatile __u64 *)p; break;
> >
> > so if compiler was breaking accesses into two then nothing would have
> > prevented it from breaking them here (I don't think volatile declaration
> > would affect this). Moreover, for sizes >8 bytes READ_ONCE() is
> > __builtin_memcpy() which is definitely not atomic.
> >
> > So you can't rely on READ_ONCE being atomic from that perspective.
>
> I thought that READ_ONCE guaranteed atomicity for sizes less or equal to
> the machine word size. It doesn't make any atomicity guarantees for
> sizes >8 bytes.
>
>
> > OTOH, I am pretty sure pointer accesses are guaranteed to be atomic. For
> > example, atomic64_read() is READ_ONCE(u64), which (per above) is
> > dereferencing of a 64-bit pointer in C.
>
> I am happy to remove the READ_ONCE and WRITE_ONCE, if we all think it is
> safe.
Looking at other code in Linux, it seems that they are making this
assumption in many places. I'll remove READ/WRITE_ONCE.