Eric Van Hensbergen wrote on Thu, Jul 25, 2013 :
> So, the cancel function should be used to flush any pending requests that
> haven't actually been sent yet. Looking at the 9p RDMA code, it looks like
> the thought was that this wasn't going to be possible. Regardless of
> removing unsent requests, the flush will still be sent and if the server
> processes it before the original request and sends a flush response back
> then we need to clear the posted buffer. This is what rdma_cancelled is
> supposed to be doing. So, the fix is to hook it into the structure -- but
> looking at the code it seems like we probably need to do something more to
> reclaim the buffer rather than just incrementing a counter.
>
> To be clear this has less to do with recovery and more to do with the
> proper implementation of 9p flush semantics. By and large, those semantics
> won't impact static file system users -- but if anyone is using the
> transport to access synthetic filesystems or files then they'll definitely
> want to have a properly implemented flush setup. The way to test this is
> to get a blocking read on a remote named pipe or fifo and then ^C it.
Ok, I knew about the concept of flush but didn't think a ^C would cause
a -ESYSRESTART, so didn't think of that.
That said, reading from, say, a fifo is an entierly local operation: the
client does a walk, getattr, doesn't do anything 9p-wise, and clunks
when it's done with it.
As for the function needing a bit more work, there's a race, but on
"normal" requests I think it is about right - the answer lays in a
comment in rdma_request:
/* When an error occurs between posting the recv and the send,
* there will be a receive context posted without a pending request.
* Since there is no way to "un-post" it, we remember it and skip
* post_recv() for the next request.
* So here,
* see if we are this `next request' and need to absorb an excess rc.
* If yes, then drop and free our own, and do not recv_post().
**/
Basically, receive buffers are sent in a queue, and we can't "retrieve"
it back, so we just don't sent next one.
There is one problem though - if the server handles the original request
before getting the flush, the receive buffer will be consumed and we
won't send a new one, so we'll starve the reception queue.
I'm afraid I don't have any bright idea there...
While we are on reception buffer issues, there is another problem with
the queue of receive buffers, even without flush, in the following
scenario:
- post a buffer for tag 0, on a hanging request
- post a buffer for tag 1
- reply for tag 1 will come on buffer from tag 0
- post another request with tag 1.. the buffer already is in the queue,
and we don't know we can post the buffer associated with tag 0 back.
I haven't found how to reproduce this perfectly yet, but a dd with
blocksize 1MB and one with blocksize 10B in parallel brought the
mountpoint down (and the whole server was completely unavailable for the
duration of the dd - TCP sessions timed out, I even got IO errors on the
local disk :D)
Regards,
--
Dominique Martinet
I think I need to stop sending mails before triple-checking things!
So sorry for the multiple mails again.
Dominique Martinet wrote on Thu, Jul 25, 2013 :
> [rdma_cancelled]
> There is one problem though - if the server handles the original request
> before getting the flush, the receive buffer will be consumed and we
> won't send a new one, so we'll starve the reception queue.
> I'm afraid I don't have any bright idea there...
This still looks correct to me.
> While we are on reception buffer issues, there is another problem with
> the queue of receive buffers, even without flush, in the following
> scenario:
> - post a buffer for tag 0, on a hanging request
> - post a buffer for tag 1
> - reply for tag 1 will come on buffer from tag 0
> - post another request with tag 1.. the buffer already is in the queue,
> and we don't know we can post the buffer associated with tag 0 back.
It actually looks like the reply buffers are swapped properly - taken
out of the req struct into the context on send, then given back to the
appropriate req on reception, so on normal operation there's no problem
with what I described - sorry for crying wolf.
> I haven't found how to reproduce this perfectly yet, but a dd with
> blocksize 1MB and one with blocksize 10B in parallel brought the
> mountpoint down (and the whole server was completely unavailable for the
> duration of the dd - TCP sessions timed out, I even got IO errors on the
> local disk :D)
I need to run more tests to explain what happens with the two dds, but
it's easily reproductible with debugs on, I guess that helps with a race
somewhere.
Regards,
--
Dominique Martinet