Hi,
This patch addresses the race I'm describing below.
It currently mostly affects trans_rdma, but the fix has to go in the
common part of the code.
virtio and tcp really aren't consciously protected though, so I believe
this can't hurt. They *might* be safe at the moment because other
changes to req are made within an unrelated spin lock and the
unlocking does a write barrier... But I wouldn't bet much on it.
Short version:
9P/RDMA stops having odd crashes about null dereferences in
p9_parse_header once every few hours of heavy use.
Long version:
Null deref in here:
...
[exception RIP: p9_parse_header+37]
RIP: ffffffffa07477d7 RSP: ffff88105b8bf8c8 RFLAGS: 00010292
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88105b8bf94b RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88105b8bf918 R8: 0000000000000000 R9: 000000000000000c
R10: ffff88082fc00020 R11: 0000000000000000 R12: 0000000000000000
R13: ffff88105b8bf94b R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff88105b8bf920] p9_client_rpc at ffffffffa074841b [9pnet]
#10 [ffff88105b8bfa00] p9_client_getattr_dotl at ffffffffa0749278
#[9pnet]
#11 [ffff88105b8bfad0] v9fs_inode_from_fid_dotl at ffffffffa0763562 [9p]
#12 [ffff88105b8bfb10] v9fs_get_new_inode_from_fid at ffffffffa0762630
#[9p]
...
Looking at the disassembly/full bt, p9_parse_header's first argument
(req->rc) is indeed NULL, but looking for req in the stack, it's 'rc'
member isn't.
Illustrating for the rdma code, I'm looking at two threads:
First, in net/9p/client.c, we have:
------------------------------------------------------------------------
p9_client_rpc(...)
{
...
err = c->trans_mod->request(c, req);
...
err = wait_event_interruptible(*req->wq,
req->status >= REQ_STATUS_RCVD);
...
/* stuff using req->rc: p9_check_errors() is inlined in and
calls: p9_parse_header(req->rc, NULL, &type, NULL, 0); */
...
}
------------------------------------------------------------------------
Second, in the rdma side of the code:
------------------------------------------------------------------------
handle_recv(...)
{
...
req = p9_tag_lookup(client, tag);
...
req->rc = c->rc;
req->status = REQ_STATUS_RCVD;
p9_client_cb(client, req);
...
}
where p9_client_cb() is back in client.c:
void p9_client_cb(struct p9_client *c, struct p9_req_t *req)
{
p9_debug(P9_DEBUG_MUX, " tag %d\n", req->tc->tag);
wake_up(req->wq);
p9_debug(P9_DEBUG_MUX, "wakeup: %d\n", req->tc->tag);
}
------------------------------------------------------------------------
So, basically, one thread sends a message and waits till a reply is
received, another thread receives messages, fills appropriate structures
and wakes up the first thread.
Now let's look at this (albeit unlikely) possible timing:
Thread 1 | Thread 2
|
send request |
/* scheduled off somewhere */ |
| receives reply
| set req->rc
| set req->status
| wake_up()
| /* nothing to wake up, no write
| barrier */
| cpu cache flushes req->status
| before req->rc
|
wait_event_interruptible |
-> nothing to wait for, |
since status is good: |
no read barrier |
|
potential incorrect use of req->rc |
Now I read in Documentation/memory-barrier.txt that
wait_event_interruptible implies a general memory barrier, but looking
at the code if the condition is already fulfilled then it does not; so
we need an explicit read barrier after the wait.
(a scheduling should also do a barrier, so arguably if the condition is
filled directly then there should be that one, but it doesn't feel safe
timing-wise)
I also read that wake_up does a write barrier if and only if it wakes
another thread up.. But I'm not sure about that either, and it actually
wouldn't be strong enough for us.
Thanks to Peter Zijlstra for helping me sort this out :)
What we need is:
[w] req->rc, 1 [r] req->status, 1
wmb rmb
[w] req->status, 1 [r] req->rc
Where the wmb ensures that rc gets written before status,
and the rmb ensures that if you observe status == 1, rc is the new value.
Anyway, this patch does this in the tidiest way I could come up with.
I'm not sure it would be a good idea not to do the barriers for
TCP/virtio, but if one of you *want* to and have a clean way to, feel
free to suggest a patch that adds barrier to RDMA only.
(FWIW, I didn't try virtio, but for tcp I actually noticed better
performances for some reason. I'm not sure of why, but I'd need to run
more tests...)
I'm also interested in any comment to make the commit message better; I
tried to sum this whole thing up but I'm not sure I made it clear enough
(the whole reason I need a 0/1 patch message!)
Thank you for your time if you have read this far! :)
Dominique Martinet (1):
9P: Add memory barriers to protect request fields over cb/rpc threads
handoff
include/net/9p/client.h | 2 +-
net/9p/client.c | 16 +++++++++++++++-
net/9p/trans_fd.c | 15 ++++++---------
net/9p/trans_rdma.c | 3 +--
net/9p/trans_virtio.c | 3 +--
5 files changed, 24 insertions(+), 15 deletions(-)
--
1.8.1.4
We need barriers to guarantee this pattern works as intended:
[w] req->rc, 1 [r] req->status, 1
wmb rmb
[w] req->status, 1 [r] req->rc
Where the wmb ensures that rc gets written before status,
and the rmb ensures that if you observe status == 1, rc is the new value.
Signed-off-by: Dominique Martinet <[email protected]>
---
include/net/9p/client.h | 2 +-
net/9p/client.c | 16 +++++++++++++++-
net/9p/trans_fd.c | 15 ++++++---------
net/9p/trans_rdma.c | 3 +--
net/9p/trans_virtio.c | 3 +--
5 files changed, 24 insertions(+), 15 deletions(-)
diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index c38a005..115aeac 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -261,7 +261,7 @@ int p9_client_mkdir_dotl(struct p9_fid *fid, char *name, int mode,
int p9_client_lock_dotl(struct p9_fid *fid, struct p9_flock *flock, u8 *status);
int p9_client_getlock_dotl(struct p9_fid *fid, struct p9_getlock *fl);
struct p9_req_t *p9_tag_lookup(struct p9_client *, u16);
-void p9_client_cb(struct p9_client *c, struct p9_req_t *req);
+void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status);
int p9_parse_header(struct p9_fcall *, int32_t *, int8_t *, int16_t *, int);
int p9stat_read(struct p9_client *, char *, int, struct p9_wstat *);
diff --git a/net/9p/client.c b/net/9p/client.c
index ee8fd6b..03431f0 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -415,9 +415,17 @@ static void p9_free_req(struct p9_client *c, struct p9_req_t *r)
* req: request received
*
*/
-void p9_client_cb(struct p9_client *c, struct p9_req_t *req)
+void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status)
{
p9_debug(P9_DEBUG_MUX, " tag %d\n", req->tc->tag);
+
+ /*
+ * This barrier is needed to make sure any change made to req before
+ * the other thread wakes up will indeed be seen by the waiting side.
+ */
+ smp_wmb();
+ req->status = status;
+
wake_up(req->wq);
p9_debug(P9_DEBUG_MUX, "wakeup: %d\n", req->tc->tag);
}
@@ -751,6 +759,12 @@ again:
err = wait_event_interruptible(*req->wq,
req->status >= REQ_STATUS_RCVD);
+ /*
+ * Make sure our req is coherent with regard to updates in other
+ * threads - echoes to wmb() in the callback
+ */
+ smp_rmb();
+
if ((err == -ERESTARTSYS) && (c->status == Connected)
&& (type == P9_TFLUSH)) {
sigpending = 1;
diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 9321a77..74e6bbc 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -212,15 +212,9 @@ static void p9_conn_cancel(struct p9_conn *m, int err)
m->err = err;
list_for_each_entry_safe(req, rtmp, &m->req_list, req_list) {
- req->status = REQ_STATUS_ERROR;
- if (!req->t_err)
- req->t_err = err;
list_move(&req->req_list, &cancel_list);
}
list_for_each_entry_safe(req, rtmp, &m->unsent_req_list, req_list) {
- req->status = REQ_STATUS_ERROR;
- if (!req->t_err)
- req->t_err = err;
list_move(&req->req_list, &cancel_list);
}
spin_unlock_irqrestore(&m->client->lock, flags);
@@ -228,7 +222,9 @@ static void p9_conn_cancel(struct p9_conn *m, int err)
list_for_each_entry_safe(req, rtmp, &cancel_list, req_list) {
p9_debug(P9_DEBUG_ERROR, "call back req %p\n", req);
list_del(&req->req_list);
- p9_client_cb(m->client, req);
+ if (!req->t_err)
+ req->t_err = err;
+ p9_client_cb(m->client, req, REQ_STATUS_ERROR);
}
}
@@ -302,6 +298,7 @@ static void p9_read_work(struct work_struct *work)
{
int n, err;
struct p9_conn *m;
+ int status = REQ_STATUS_ERROR;
m = container_of(work, struct p9_conn, rq);
@@ -375,10 +372,10 @@ static void p9_read_work(struct work_struct *work)
p9_debug(P9_DEBUG_TRANS, "got new packet\n");
spin_lock(&m->client->lock);
if (m->req->status != REQ_STATUS_ERROR)
- m->req->status = REQ_STATUS_RCVD;
+ status = REQ_STATUS_RCVD;
list_del(&m->req->req_list);
spin_unlock(&m->client->lock);
- p9_client_cb(m->client, m->req);
+ p9_client_cb(m->client, m->req, status);
m->rbuf = NULL;
m->rpos = 0;
m->rsize = 0;
diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c
index 8f68df5..f127ae5 100644
--- a/net/9p/trans_rdma.c
+++ b/net/9p/trans_rdma.c
@@ -305,8 +305,7 @@ handle_recv(struct p9_client *client, struct p9_trans_rdma *rdma,
}
req->rc = c->rc;
- req->status = REQ_STATUS_RCVD;
- p9_client_cb(client, req);
+ p9_client_cb(client, req, REQ_STATUS_RCVD);
return;
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 9c5a1aa..81785b7b 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -164,8 +164,7 @@ static void req_done(struct virtqueue *vq)
p9_debug(P9_DEBUG_TRANS, ": rc %p\n", rc);
p9_debug(P9_DEBUG_TRANS, ": lookup tag %d\n", rc->tag);
req = p9_tag_lookup(chan->client, rc->tag);
- req->status = REQ_STATUS_RCVD;
- p9_client_cb(chan->client, req);
+ p9_client_cb(chan->client, req, REQ_STATUS_RCVD);
}
}
--
1.8.1.4
Hi,
Dominique Martinet wrote on Fri, Jan 17, 2014 :
> We need barriers to guarantee this pattern works as intended:
> [w] req->rc, 1 [r] req->status, 1
> wmb rmb
> [w] req->status, 1 [r] req->rc
>
> Where the wmb ensures that rc gets written before status,
> and the rmb ensures that if you observe status == 1, rc is the new value.
>
> Signed-off-by: Dominique Martinet <[email protected]>
> ---
> include/net/9p/client.h | 2 +-
> net/9p/client.c | 16 +++++++++++++++-
> net/9p/trans_fd.c | 15 ++++++---------
> net/9p/trans_rdma.c | 3 +--
> net/9p/trans_virtio.c | 3 +--
> 5 files changed, 24 insertions(+), 15 deletions(-)
Reminder to get some attention now that the pull request for merge
window has passed :)
Please ask if this needs a resend, but should still apply as is.
Would be particularily nice to get confirmation that this doesn't slow
everything down for tcp/virtio folk or raises another kind of blocker
(although I have nothing better short of another transport-specific
function call after the p9_client_rpc wait is over, which is worse than
this to me, but suggestions are always welcome!)
Cheers,
--
Dominique Martinet