Hi,
We've found a softlockup on the nfsv3 client (udp), while testing an iptables
rule, which simply drops traffic destined for the nfs server. The softlockup
occurs because xs_sendpages() returns a positive result (resulting in an
-EAGAIN), masking out the -EPERM that the lower level ip code returns.
The first patch restructures xs_sendpages() such that it can return a 'sent'
value in addition to an error value. It no longer over-loads the return value
as both a 'sent' and 'error' value. In this way the 'higher' level code can
make more informed decisions. This patch is intended to be a no-op.
The second patch makes use of the new return value to propagate the -EPERM
up instead of the -EAGAIN.
Thanks,
-Jason
Jason Baron (2):
rpc: return sent and err from xs_sendpages()
rpc: Add -EPERM processing for xs_udp_send_request()
net/sunrpc/clnt.c | 2 ++
net/sunrpc/xprtsock.c | 86 ++++++++++++++++++++++++++++-----------------------
2 files changed, 49 insertions(+), 39 deletions(-)
--
1.8.2.rc2
If an iptables drop rule is added for an nfs server, the client can end up in
a softlockup. Because of the way that xs_sendpages() is structured, the -EPERM
is ignored since the prior bits of the packet may have been successfully queued
and thus xs_sendpages() returns a non-zero value. Then, xs_udp_send_request()
thinks that because some bits were queued it should return -EAGAIN. We then try
the request and again and a softlockup occurs. The test sequence is simply:
1) open a file on the nfs server '/nfs/foo' (mounted using udp)
2) iptables -A OUTPUT -d <nfs server ip> -j DROP
3) write to /nfs/foo
4) close /nfs/foo
5) iptables -D OUTPUT -d <nfs server ip> -j DROP
The softlockup occurs in step 4 above.
The previous patch, allows xs_sendpages() to return both a sent count and
any error values that may have occurred. Thus, if we get an -EPERM, return
that to the higher level code.
With this patch in place we can successfully abort the above sequence and
avoid the softlockup.
I also tried the above test case on an nfs mount on tcp and although the system
does not softlockup, I still ended up with the 'hung_task' firing after 120
seconds, due to the i/o being stuck. The tcp case appears a bit harder to fix,
since -EPERM appears to get ignored much lower down in the stack and does not
propogate up to xs_sendpages(). The tcp case is not quite as insidious as the
softlockup (since its not consuming cpu) and it is not addressed here.
Reported-by: Yigong Lou <[email protected]>
Signed-off-by: Jason Baron <[email protected]>
---
net/sunrpc/clnt.c | 2 ++
net/sunrpc/xprtsock.c | 6 ++++++
2 files changed, 8 insertions(+)
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 488ddee..a316012 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1913,6 +1913,7 @@ call_transmit_status(struct rpc_task *task)
case -EHOSTDOWN:
case -EHOSTUNREACH:
case -ENETUNREACH:
+ case -EPERM:
if (RPC_IS_SOFTCONN(task)) {
xprt_end_transmit(task);
rpc_exit(task, task->tk_status);
@@ -2047,6 +2048,7 @@ call_status(struct rpc_task *task)
case -EAGAIN:
task->tk_action = call_transmit;
break;
+ case -EPERM:
case -EIO:
/* shutdown or soft timeout */
rpc_exit(task, status);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 38eb17d..c41c5b2 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -643,6 +643,10 @@ static int xs_udp_send_request(struct rpc_task *task)
dprintk("RPC: xs_udp_send_request(%u) = %d\n",
xdr->len - req->rq_bytes_sent, status);
+ /* firewall is blocking us, no sense in retrying */
+ if (status == -EPERM)
+ goto process_status;
+
if (sent > 0 || status == 0) {
req->rq_xmit_bytes_sent += sent;
if (sent >= req->rq_slen)
@@ -651,6 +655,7 @@ static int xs_udp_send_request(struct rpc_task *task)
status = -EAGAIN;
}
+process_status:
switch (status) {
case -ENOTSOCK:
status = -ENOTCONN;
@@ -666,6 +671,7 @@ static int xs_udp_send_request(struct rpc_task *task)
case -ENOBUFS:
case -EPIPE:
case -ECONNREFUSED:
+ case -EPERM:
/* When the server has died, an ICMP port unreachable message
* prompts ECONNREFUSED. */
clear_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags);
--
1.8.2.rc2
On 09/18/2014 03:51 PM, Jason Baron wrote:
> If an error is returned after the first bits of a packet have already been
> successfully queued, xs_sendpages() will return a positive 'int' value
> indicating success. Callers seem to treat this as -EAGAIN.
>
> However, there are cases where its not a question of waiting for the write
> queue to drain. For example, when there is an iptables rule dropping packets
> to the destination, the lower level code can return -EPERM only after parts
> of the packet have been successfully queued. In this case, we can end up
> continuously retrying.
>
> This patch is meant to effectively be a no-op in preparation for subsequent
> patches that can make decisions based on both on the number of bytes sent
> by xs_sendpages() and any errors that may have be returned.
Nit: I don't like calling this patch a "no-op", since it does change the code. Saying that it's preparing for the next patch with no change in behavior is probably good enough! :)
>
> Signed-off-by: Jason Baron <[email protected]>
> ---
> net/sunrpc/xprtsock.c | 80 ++++++++++++++++++++++++++-------------------------
> 1 file changed, 41 insertions(+), 39 deletions(-)
>
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 43cd89e..38eb17d 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -399,13 +399,13 @@ static int xs_send_kvec(struct socket *sock, struct sockaddr *addr, int addrlen,
> return kernel_sendmsg(sock, &msg, NULL, 0, 0);
> }
>
> -static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned int base, int more, bool zerocopy)
> +static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned int base, int more, bool zerocopy, int *sent_p)
> {
> ssize_t (*do_sendpage)(struct socket *sock, struct page *page,
> int offset, size_t size, int flags);
> struct page **ppage;
> unsigned int remainder;
> - int err, sent = 0;
> + int err;
>
> remainder = xdr->page_len - base;
> base += xdr->page_base;
> @@ -424,15 +424,15 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
> err = do_sendpage(sock, *ppage, base, len, flags);
> if (remainder == 0 || err != len)
> break;
> - sent += err;
> + *sent_p += err;
> ppage++;
> base = 0;
> }
> - if (sent == 0)
> - return err;
> - if (err > 0)
> - sent += err;
> - return sent;
> + if (err > 0) {
> + *sent_p += err;
> + err = 0;
> + }
> + return err;
> }
>
> /**
> @@ -445,10 +445,11 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
> * @zerocopy: true if it is safe to use sendpage()
Please update the documentation here to include the new parameter.
Anna
> *
> */
> -static int xs_sendpages(struct socket *sock, struct sockaddr *addr, int addrlen, struct xdr_buf *xdr, unsigned int base, bool zerocopy)
> +static int xs_sendpages(struct socket *sock, struct sockaddr *addr, int addrlen, struct xdr_buf *xdr, unsigned int base, bool zerocopy, int *sent_p)
> {
> unsigned int remainder = xdr->len - base;
> - int err, sent = 0;
> + int err = 0;
> + int sent = 0;
>
> if (unlikely(!sock))
> return -ENOTSOCK;
> @@ -465,7 +466,7 @@ static int xs_sendpages(struct socket *sock, struct sockaddr *addr, int addrlen,
> err = xs_send_kvec(sock, addr, addrlen, &xdr->head[0], base, remainder != 0);
> if (remainder == 0 || err != len)
> goto out;
> - sent += err;
> + *sent_p += err;
> base = 0;
> } else
> base -= xdr->head[0].iov_len;
> @@ -473,23 +474,23 @@ static int xs_sendpages(struct socket *sock, struct sockaddr *addr, int addrlen,
> if (base < xdr->page_len) {
> unsigned int len = xdr->page_len - base;
> remainder -= len;
> - err = xs_send_pagedata(sock, xdr, base, remainder != 0, zerocopy);
> - if (remainder == 0 || err != len)
> + err = xs_send_pagedata(sock, xdr, base, remainder != 0, zerocopy, &sent);
> + *sent_p += sent;
> + if (remainder == 0 || sent != len)
> goto out;
> - sent += err;
> base = 0;
> } else
> base -= xdr->page_len;
>
> if (base >= xdr->tail[0].iov_len)
> - return sent;
> + return 0;
> err = xs_send_kvec(sock, NULL, 0, &xdr->tail[0], base, 0);
> out:
> - if (sent == 0)
> - return err;
> - if (err > 0)
> - sent += err;
> - return sent;
> + if (err > 0) {
> + *sent_p += err;
> + err = 0;
> + }
> + return err;
> }
>
> static void xs_nospace_callback(struct rpc_task *task)
> @@ -573,19 +574,20 @@ static int xs_local_send_request(struct rpc_task *task)
> container_of(xprt, struct sock_xprt, xprt);
> struct xdr_buf *xdr = &req->rq_snd_buf;
> int status;
> + int sent = 0;
>
> xs_encode_stream_record_marker(&req->rq_snd_buf);
>
> xs_pktdump("packet data:",
> req->rq_svec->iov_base, req->rq_svec->iov_len);
>
> - status = xs_sendpages(transport->sock, NULL, 0,
> - xdr, req->rq_bytes_sent, true);
> + status = xs_sendpages(transport->sock, NULL, 0, xdr, req->rq_bytes_sent,
> + true, &sent);
> dprintk("RPC: %s(%u) = %d\n",
> __func__, xdr->len - req->rq_bytes_sent, status);
> - if (likely(status >= 0)) {
> - req->rq_bytes_sent += status;
> - req->rq_xmit_bytes_sent += status;
> + if (likely(sent > 0) || status == 0) {
> + req->rq_bytes_sent += sent;
> + req->rq_xmit_bytes_sent += sent;
> if (likely(req->rq_bytes_sent >= req->rq_slen)) {
> req->rq_bytes_sent = 0;
> return 0;
> @@ -626,6 +628,7 @@ static int xs_udp_send_request(struct rpc_task *task)
> struct rpc_xprt *xprt = req->rq_xprt;
> struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
> struct xdr_buf *xdr = &req->rq_snd_buf;
> + int sent = 0;
> int status;
>
> xs_pktdump("packet data:",
> @@ -634,17 +637,15 @@ static int xs_udp_send_request(struct rpc_task *task)
>
> if (!xprt_bound(xprt))
> return -ENOTCONN;
> - status = xs_sendpages(transport->sock,
> - xs_addr(xprt),
> - xprt->addrlen, xdr,
> - req->rq_bytes_sent, true);
> + status = xs_sendpages(transport->sock, xs_addr(xprt), xprt->addrlen,
> + xdr, req->rq_bytes_sent, true, &sent);
>
> dprintk("RPC: xs_udp_send_request(%u) = %d\n",
> xdr->len - req->rq_bytes_sent, status);
>
> - if (status >= 0) {
> - req->rq_xmit_bytes_sent += status;
> - if (status >= req->rq_slen)
> + if (sent > 0 || status == 0) {
> + req->rq_xmit_bytes_sent += sent;
> + if (sent >= req->rq_slen)
> return 0;
> /* Still some bytes left; set up for a retry later. */
> status = -EAGAIN;
> @@ -713,6 +714,7 @@ static int xs_tcp_send_request(struct rpc_task *task)
> struct xdr_buf *xdr = &req->rq_snd_buf;
> bool zerocopy = true;
> int status;
> + int sent = 0;
>
> xs_encode_stream_record_marker(&req->rq_snd_buf);
>
> @@ -730,26 +732,26 @@ static int xs_tcp_send_request(struct rpc_task *task)
> * to cope with writespace callbacks arriving _after_ we have
> * called sendmsg(). */
> while (1) {
> - status = xs_sendpages(transport->sock,
> - NULL, 0, xdr, req->rq_bytes_sent,
> - zerocopy);
> + sent = 0;
> + status = xs_sendpages(transport->sock, NULL, 0, xdr,
> + req->rq_bytes_sent, zerocopy, &sent);
>
> dprintk("RPC: xs_tcp_send_request(%u) = %d\n",
> xdr->len - req->rq_bytes_sent, status);
>
> - if (unlikely(status < 0))
> + if (unlikely(sent == 0 && status < 0))
> break;
>
> /* If we've sent the entire packet, immediately
> * reset the count of bytes sent. */
> - req->rq_bytes_sent += status;
> - req->rq_xmit_bytes_sent += status;
> + req->rq_bytes_sent += sent;
> + req->rq_xmit_bytes_sent += sent;
> if (likely(req->rq_bytes_sent >= req->rq_slen)) {
> req->rq_bytes_sent = 0;
> return 0;
> }
>
> - if (status != 0)
> + if (sent != 0)
> continue;
> status = -EAGAIN;
> break;
On Thu, Sep 18, 2014 at 9:54 PM, Jason Baron <[email protected]> wrote:
> On 09/18/2014 05:20 PM, Trond Myklebust wrote:
>> On Thu, Sep 18, 2014 at 5:02 PM, Jason Baron <[email protected]> wrote:
>>> On 09/18/2014 04:51 PM, Trond Myklebust wrote:
>>>> On Thu, Sep 18, 2014 at 3:51 PM, Jason Baron <[email protected]> wrote:
>>>>> If an iptables drop rule is added for an nfs server, the client can end up in
>>>>> a softlockup. Because of the way that xs_sendpages() is structured, the -EPERM
>>>>> is ignored since the prior bits of the packet may have been successfully queued
>>>>> and thus xs_sendpages() returns a non-zero value. Then, xs_udp_send_request()
>>>>> thinks that because some bits were queued it should return -EAGAIN. We then try
>>>>> the request and again and a softlockup occurs. The test sequence is simply:
>>>>>
>>>>> 1) open a file on the nfs server '/nfs/foo' (mounted using udp)
>>>>> 2) iptables -A OUTPUT -d <nfs server ip> -j DROP
>>>>> 3) write to /nfs/foo
>>>>> 4) close /nfs/foo
>>>>> 5) iptables -D OUTPUT -d <nfs server ip> -j DROP
>>>>>
>>>>> The softlockup occurs in step 4 above.
>>>> For UDP, the expected and documented behaviour in the case above is as follows:
>>>> - if the mount is soft, then return EIO on the first major timeout.
>>> yeah - so this case is a softlockup in my testing :(
>>>
>>>> - if the mount is hard, then retry indefinitely on timeout.
>>>>
>>>> Won't these 2 patches end up propagating an EPERM to the application?
>>>> That would be a definite violation of both hard and soft semantics.
>>> ok, yeah it does propogate the -EPERM up - I wasn't aware of the correct
>>> semantics - thanks.
>>>
>>> I can rework the patches such that they return -EIO instead for a soft mount,
>>> and verify that we keep retrying for a hard one.
>>>
>> Doesn't the soft timeout currently trigger after the major timeout? If
>> not, do we understand why it isn't doing so?
>
> No, the soft timeout does not currently trigger after the major timeout. Instead,
> the kernel spins indefinitely, and triggers a softlockup.
>
> The reason is that xs_sendpages() returns a positive value in this case
> and xs_udp_send_request() turns it in an -EAGAIN for the write operation.
> Subsequently, we call call_transmit_status() and then call_status() which
> sees the EAGAIN, which just starts all over again with a 'call_transmit()'.
> So we are stuck spinning indefinitely in kernel space.
>
> Simply moving the -EPERM up in this patch, results in the behavior you
> described above - EIO after a major timeout on a soft mount, and indefinte
> retries on a hard mount - but without the cpu consumption. IE applying
> this on top of this patch:
>
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -2019,6 +2019,7 @@ call_status(struct rpc_task *task)
> case -EHOSTDOWN:
> case -EHOSTUNREACH:
> case -ENETUNREACH:
> + case -EPERM:
> if (RPC_IS_SOFTCONN(task)) {
> rpc_exit(task, status);
> break;
> @@ -2048,7 +2049,6 @@ call_status(struct rpc_task *task)
> case -EAGAIN:
> task->tk_action = call_transmit;
> break;
> - case -EPERM:
> case -EIO:
> /* shutdown or soft timeout */
> rpc_exit(task, status);
>
> We could also 'translate' the -EPERM into an '-ENETUNREACH' or such,
> in the return from xs_udp_send_request(), if you think that would make
> more sense?
>
> Hopefully, I've explained things better.
>
>
Yep. Can you please resend the patch with the above fix? I think we
can live with the EPERM in the RPC_IS_SOFTCONN case, given that it is
in practice only ever passed back to the 'mount' syscall.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On 09/19/2014 05:16 PM, Jason Baron wrote:
> On 09/19/2014 03:41 PM, Trond Myklebust wrote:
>> On Thu, Sep 18, 2014 at 9:54 PM, Jason Baron <[email protected]> wrote:
>>> On 09/18/2014 05:20 PM, Trond Myklebust wrote:
>>>> On Thu, Sep 18, 2014 at 5:02 PM, Jason Baron <[email protected]> wrote:
>>>>> On 09/18/2014 04:51 PM, Trond Myklebust wrote:
>>>>>> On Thu, Sep 18, 2014 at 3:51 PM, Jason Baron <[email protected]> wrote:
>>>>>>> If an iptables drop rule is added for an nfs server, the client can end up in
>>>>>>> a softlockup. Because of the way that xs_sendpages() is structured, the -EPERM
>>>>>>> is ignored since the prior bits of the packet may have been successfully queued
>>>>>>> and thus xs_sendpages() returns a non-zero value. Then, xs_udp_send_request()
>>>>>>> thinks that because some bits were queued it should return -EAGAIN. We then try
>>>>>>> the request and again and a softlockup occurs. The test sequence is simply:
>>>>>>>
>>>>>>> 1) open a file on the nfs server '/nfs/foo' (mounted using udp)
>>>>>>> 2) iptables -A OUTPUT -d <nfs server ip> -j DROP
>>>>>>> 3) write to /nfs/foo
>>>>>>> 4) close /nfs/foo
>>>>>>> 5) iptables -D OUTPUT -d <nfs server ip> -j DROP
>>>>>>>
>>>>>>> The softlockup occurs in step 4 above.
>>>>>> For UDP, the expected and documented behaviour in the case above is as follows:
>>>>>> - if the mount is soft, then return EIO on the first major timeout.
>>>>> yeah - so this case is a softlockup in my testing :(
>>>>>
>>>>>> - if the mount is hard, then retry indefinitely on timeout.
>>>>>>
>>>>>> Won't these 2 patches end up propagating an EPERM to the application?
>>>>>> That would be a definite violation of both hard and soft semantics.
>>>>> ok, yeah it does propogate the -EPERM up - I wasn't aware of the correct
>>>>> semantics - thanks.
>>>>>
>>>>> I can rework the patches such that they return -EIO instead for a soft mount,
>>>>> and verify that we keep retrying for a hard one.
>>>>>
>>>> Doesn't the soft timeout currently trigger after the major timeout? If
>>>> not, do we understand why it isn't doing so?
>>>
>>> No, the soft timeout does not currently trigger after the major timeout. Instead,
>>> the kernel spins indefinitely, and triggers a softlockup.
>>>
>>> The reason is that xs_sendpages() returns a positive value in this case
>>> and xs_udp_send_request() turns it in an -EAGAIN for the write operation.
>>> Subsequently, we call call_transmit_status() and then call_status() which
>>> sees the EAGAIN, which just starts all over again with a 'call_transmit()'.
>>> So we are stuck spinning indefinitely in kernel space.
>>>
>>> Simply moving the -EPERM up in this patch, results in the behavior you
>>> described above - EIO after a major timeout on a soft mount, and indefinte
>>> retries on a hard mount - but without the cpu consumption. IE applying
>>> this on top of this patch:
>>>
>>> --- a/net/sunrpc/clnt.c
>>> +++ b/net/sunrpc/clnt.c
>>> @@ -2019,6 +2019,7 @@ call_status(struct rpc_task *task)
>>> case -EHOSTDOWN:
>>> case -EHOSTUNREACH:
>>> case -ENETUNREACH:
>>> + case -EPERM:
>>> if (RPC_IS_SOFTCONN(task)) {
>>> rpc_exit(task, status);
>>> break;
>>> @@ -2048,7 +2049,6 @@ call_status(struct rpc_task *task)
>>> case -EAGAIN:
>>> task->tk_action = call_transmit;
>>> break;
>>> - case -EPERM:
>>> case -EIO:
>>> /* shutdown or soft timeout */
>>> rpc_exit(task, status);
>>>
>>> We could also 'translate' the -EPERM into an '-ENETUNREACH' or such,
>>> in the return from xs_udp_send_request(), if you think that would make
>>> more sense?
>>>
>>> Hopefully, I've explained things better.
>>>
>>>
>>
>> Yep. Can you please resend the patch with the above fix? I think we
>> can live with the EPERM in the RPC_IS_SOFTCONN case, given that it is
>> in practice only ever passed back to the 'mount' syscall.
>>
>
> Hi,
>
> So after some more testing on this new patch, the test sequence I described
> works fine - but if I set the firewall rule first and then do an open, it
> appears that the open() wouldn't time out even on a soft mount (whereas
> with the previous patch it incorrectly returned -EPERM almost immediately).
> It appears that the rpc request is getting queued up onto one of the wait
> queues (xprt_backlog or xprt_sending) in that case, but I'm not sure why.
> I'll have to look more into it next week.
>
> Thanks,
>
> -Jason
>
>
Hi Trond,
Ok, so they do timeout now with this patch (for a soft mount) - I simply wasn't
waiting long enough (took around 30 minutes in some cases). So I think this
patch is ok. If it makes sense I can clean it up based on the comments, and
re-submit?
Thanks,
-Jason
On Mon, Sep 22, 2014 at 1:55 PM, Jason Baron <[email protected]> wrote:
> On 09/19/2014 05:16 PM, Jason Baron wrote:
>> On 09/19/2014 03:41 PM, Trond Myklebust wrote:
>>> On Thu, Sep 18, 2014 at 9:54 PM, Jason Baron <[email protected]> wrote:
>>>> On 09/18/2014 05:20 PM, Trond Myklebust wrote:
>>>>> On Thu, Sep 18, 2014 at 5:02 PM, Jason Baron <[email protected]> wrote:
>>>>>> On 09/18/2014 04:51 PM, Trond Myklebust wrote:
>>>>>>> On Thu, Sep 18, 2014 at 3:51 PM, Jason Baron <[email protected]> wrote:
>>>>>>>> If an iptables drop rule is added for an nfs server, the client can end up in
>>>>>>>> a softlockup. Because of the way that xs_sendpages() is structured, the -EPERM
>>>>>>>> is ignored since the prior bits of the packet may have been successfully queued
>>>>>>>> and thus xs_sendpages() returns a non-zero value. Then, xs_udp_send_request()
>>>>>>>> thinks that because some bits were queued it should return -EAGAIN. We then try
>>>>>>>> the request and again and a softlockup occurs. The test sequence is simply:
>>>>>>>>
>>>>>>>> 1) open a file on the nfs server '/nfs/foo' (mounted using udp)
>>>>>>>> 2) iptables -A OUTPUT -d <nfs server ip> -j DROP
>>>>>>>> 3) write to /nfs/foo
>>>>>>>> 4) close /nfs/foo
>>>>>>>> 5) iptables -D OUTPUT -d <nfs server ip> -j DROP
>>>>>>>>
>>>>>>>> The softlockup occurs in step 4 above.
>>>>>>> For UDP, the expected and documented behaviour in the case above is as follows:
>>>>>>> - if the mount is soft, then return EIO on the first major timeout.
>>>>>> yeah - so this case is a softlockup in my testing :(
>>>>>>
>>>>>>> - if the mount is hard, then retry indefinitely on timeout.
>>>>>>>
>>>>>>> Won't these 2 patches end up propagating an EPERM to the application?
>>>>>>> That would be a definite violation of both hard and soft semantics.
>>>>>> ok, yeah it does propogate the -EPERM up - I wasn't aware of the correct
>>>>>> semantics - thanks.
>>>>>>
>>>>>> I can rework the patches such that they return -EIO instead for a soft mount,
>>>>>> and verify that we keep retrying for a hard one.
>>>>>>
>>>>> Doesn't the soft timeout currently trigger after the major timeout? If
>>>>> not, do we understand why it isn't doing so?
>>>>
>>>> No, the soft timeout does not currently trigger after the major timeout. Instead,
>>>> the kernel spins indefinitely, and triggers a softlockup.
>>>>
>>>> The reason is that xs_sendpages() returns a positive value in this case
>>>> and xs_udp_send_request() turns it in an -EAGAIN for the write operation.
>>>> Subsequently, we call call_transmit_status() and then call_status() which
>>>> sees the EAGAIN, which just starts all over again with a 'call_transmit()'.
>>>> So we are stuck spinning indefinitely in kernel space.
>>>>
>>>> Simply moving the -EPERM up in this patch, results in the behavior you
>>>> described above - EIO after a major timeout on a soft mount, and indefinte
>>>> retries on a hard mount - but without the cpu consumption. IE applying
>>>> this on top of this patch:
>>>>
>>>> --- a/net/sunrpc/clnt.c
>>>> +++ b/net/sunrpc/clnt.c
>>>> @@ -2019,6 +2019,7 @@ call_status(struct rpc_task *task)
>>>> case -EHOSTDOWN:
>>>> case -EHOSTUNREACH:
>>>> case -ENETUNREACH:
>>>> + case -EPERM:
>>>> if (RPC_IS_SOFTCONN(task)) {
>>>> rpc_exit(task, status);
>>>> break;
>>>> @@ -2048,7 +2049,6 @@ call_status(struct rpc_task *task)
>>>> case -EAGAIN:
>>>> task->tk_action = call_transmit;
>>>> break;
>>>> - case -EPERM:
>>>> case -EIO:
>>>> /* shutdown or soft timeout */
>>>> rpc_exit(task, status);
>>>>
>>>> We could also 'translate' the -EPERM into an '-ENETUNREACH' or such,
>>>> in the return from xs_udp_send_request(), if you think that would make
>>>> more sense?
>>>>
>>>> Hopefully, I've explained things better.
>>>>
>>>>
>>>
>>> Yep. Can you please resend the patch with the above fix? I think we
>>> can live with the EPERM in the RPC_IS_SOFTCONN case, given that it is
>>> in practice only ever passed back to the 'mount' syscall.
>>>
>>
>> Hi,
>>
>> So after some more testing on this new patch, the test sequence I described
>> works fine - but if I set the firewall rule first and then do an open, it
>> appears that the open() wouldn't time out even on a soft mount (whereas
>> with the previous patch it incorrectly returned -EPERM almost immediately).
>> It appears that the rpc request is getting queued up onto one of the wait
>> queues (xprt_backlog or xprt_sending) in that case, but I'm not sure why.
>> I'll have to look more into it next week.
>>
>> Thanks,
>>
>> -Jason
>>
>>
>
> Hi Trond,
>
> Ok, so they do timeout now with this patch (for a soft mount) - I simply wasn't
> waiting long enough (took around 30 minutes in some cases). So I think this
> patch is ok. If it makes sense I can clean it up based on the comments, and
> re-submit?
>
Please do. Thanks!
Trond
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On 09/18/2014 05:20 PM, Trond Myklebust wrote:
> On Thu, Sep 18, 2014 at 5:02 PM, Jason Baron <[email protected]> wrote:
>> On 09/18/2014 04:51 PM, Trond Myklebust wrote:
>>> On Thu, Sep 18, 2014 at 3:51 PM, Jason Baron <[email protected]> wrote:
>>>> If an iptables drop rule is added for an nfs server, the client can end up in
>>>> a softlockup. Because of the way that xs_sendpages() is structured, the -EPERM
>>>> is ignored since the prior bits of the packet may have been successfully queued
>>>> and thus xs_sendpages() returns a non-zero value. Then, xs_udp_send_request()
>>>> thinks that because some bits were queued it should return -EAGAIN. We then try
>>>> the request and again and a softlockup occurs. The test sequence is simply:
>>>>
>>>> 1) open a file on the nfs server '/nfs/foo' (mounted using udp)
>>>> 2) iptables -A OUTPUT -d <nfs server ip> -j DROP
>>>> 3) write to /nfs/foo
>>>> 4) close /nfs/foo
>>>> 5) iptables -D OUTPUT -d <nfs server ip> -j DROP
>>>>
>>>> The softlockup occurs in step 4 above.
>>> For UDP, the expected and documented behaviour in the case above is as follows:
>>> - if the mount is soft, then return EIO on the first major timeout.
>> yeah - so this case is a softlockup in my testing :(
>>
>>> - if the mount is hard, then retry indefinitely on timeout.
>>>
>>> Won't these 2 patches end up propagating an EPERM to the application?
>>> That would be a definite violation of both hard and soft semantics.
>> ok, yeah it does propogate the -EPERM up - I wasn't aware of the correct
>> semantics - thanks.
>>
>> I can rework the patches such that they return -EIO instead for a soft mount,
>> and verify that we keep retrying for a hard one.
>>
> Doesn't the soft timeout currently trigger after the major timeout? If
> not, do we understand why it isn't doing so?
No, the soft timeout does not currently trigger after the major timeout. Instead,
the kernel spins indefinitely, and triggers a softlockup.
The reason is that xs_sendpages() returns a positive value in this case
and xs_udp_send_request() turns it in an -EAGAIN for the write operation.
Subsequently, we call call_transmit_status() and then call_status() which
sees the EAGAIN, which just starts all over again with a 'call_transmit()'.
So we are stuck spinning indefinitely in kernel space.
Simply moving the -EPERM up in this patch, results in the behavior you
described above - EIO after a major timeout on a soft mount, and indefinte
retries on a hard mount - but without the cpu consumption. IE applying
this on top of this patch:
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -2019,6 +2019,7 @@ call_status(struct rpc_task *task)
case -EHOSTDOWN:
case -EHOSTUNREACH:
case -ENETUNREACH:
+ case -EPERM:
if (RPC_IS_SOFTCONN(task)) {
rpc_exit(task, status);
break;
@@ -2048,7 +2049,6 @@ call_status(struct rpc_task *task)
case -EAGAIN:
task->tk_action = call_transmit;
break;
- case -EPERM:
case -EIO:
/* shutdown or soft timeout */
rpc_exit(task, status);
We could also 'translate' the -EPERM into an '-ENETUNREACH' or such,
in the return from xs_udp_send_request(), if you think that would make
more sense?
Hopefully, I've explained things better.
Thanks,
-Jason
On 09/18/2014 04:51 PM, Trond Myklebust wrote:
> On Thu, Sep 18, 2014 at 3:51 PM, Jason Baron <[email protected]> wrote:
>> If an iptables drop rule is added for an nfs server, the client can end up in
>> a softlockup. Because of the way that xs_sendpages() is structured, the -EPERM
>> is ignored since the prior bits of the packet may have been successfully queued
>> and thus xs_sendpages() returns a non-zero value. Then, xs_udp_send_request()
>> thinks that because some bits were queued it should return -EAGAIN. We then try
>> the request and again and a softlockup occurs. The test sequence is simply:
>>
>> 1) open a file on the nfs server '/nfs/foo' (mounted using udp)
>> 2) iptables -A OUTPUT -d <nfs server ip> -j DROP
>> 3) write to /nfs/foo
>> 4) close /nfs/foo
>> 5) iptables -D OUTPUT -d <nfs server ip> -j DROP
>>
>> The softlockup occurs in step 4 above.
> For UDP, the expected and documented behaviour in the case above is as follows:
> - if the mount is soft, then return EIO on the first major timeout.
yeah - so this case is a softlockup in my testing :(
> - if the mount is hard, then retry indefinitely on timeout.
>
> Won't these 2 patches end up propagating an EPERM to the application?
> That would be a definite violation of both hard and soft semantics.
ok, yeah it does propogate the -EPERM up - I wasn't aware of the correct
semantics - thanks.
I can rework the patches such that they return -EIO instead for a soft mount,
and verify that we keep retrying for a hard one.
Thanks,
-Jason
On 09/19/2014 03:41 PM, Trond Myklebust wrote:
> On Thu, Sep 18, 2014 at 9:54 PM, Jason Baron <[email protected]> wrote:
>> On 09/18/2014 05:20 PM, Trond Myklebust wrote:
>>> On Thu, Sep 18, 2014 at 5:02 PM, Jason Baron <[email protected]> wrote:
>>>> On 09/18/2014 04:51 PM, Trond Myklebust wrote:
>>>>> On Thu, Sep 18, 2014 at 3:51 PM, Jason Baron <[email protected]> wrote:
>>>>>> If an iptables drop rule is added for an nfs server, the client can end up in
>>>>>> a softlockup. Because of the way that xs_sendpages() is structured, the -EPERM
>>>>>> is ignored since the prior bits of the packet may have been successfully queued
>>>>>> and thus xs_sendpages() returns a non-zero value. Then, xs_udp_send_request()
>>>>>> thinks that because some bits were queued it should return -EAGAIN. We then try
>>>>>> the request and again and a softlockup occurs. The test sequence is simply:
>>>>>>
>>>>>> 1) open a file on the nfs server '/nfs/foo' (mounted using udp)
>>>>>> 2) iptables -A OUTPUT -d <nfs server ip> -j DROP
>>>>>> 3) write to /nfs/foo
>>>>>> 4) close /nfs/foo
>>>>>> 5) iptables -D OUTPUT -d <nfs server ip> -j DROP
>>>>>>
>>>>>> The softlockup occurs in step 4 above.
>>>>> For UDP, the expected and documented behaviour in the case above is as follows:
>>>>> - if the mount is soft, then return EIO on the first major timeout.
>>>> yeah - so this case is a softlockup in my testing :(
>>>>
>>>>> - if the mount is hard, then retry indefinitely on timeout.
>>>>>
>>>>> Won't these 2 patches end up propagating an EPERM to the application?
>>>>> That would be a definite violation of both hard and soft semantics.
>>>> ok, yeah it does propogate the -EPERM up - I wasn't aware of the correct
>>>> semantics - thanks.
>>>>
>>>> I can rework the patches such that they return -EIO instead for a soft mount,
>>>> and verify that we keep retrying for a hard one.
>>>>
>>> Doesn't the soft timeout currently trigger after the major timeout? If
>>> not, do we understand why it isn't doing so?
>>
>> No, the soft timeout does not currently trigger after the major timeout. Instead,
>> the kernel spins indefinitely, and triggers a softlockup.
>>
>> The reason is that xs_sendpages() returns a positive value in this case
>> and xs_udp_send_request() turns it in an -EAGAIN for the write operation.
>> Subsequently, we call call_transmit_status() and then call_status() which
>> sees the EAGAIN, which just starts all over again with a 'call_transmit()'.
>> So we are stuck spinning indefinitely in kernel space.
>>
>> Simply moving the -EPERM up in this patch, results in the behavior you
>> described above - EIO after a major timeout on a soft mount, and indefinte
>> retries on a hard mount - but without the cpu consumption. IE applying
>> this on top of this patch:
>>
>> --- a/net/sunrpc/clnt.c
>> +++ b/net/sunrpc/clnt.c
>> @@ -2019,6 +2019,7 @@ call_status(struct rpc_task *task)
>> case -EHOSTDOWN:
>> case -EHOSTUNREACH:
>> case -ENETUNREACH:
>> + case -EPERM:
>> if (RPC_IS_SOFTCONN(task)) {
>> rpc_exit(task, status);
>> break;
>> @@ -2048,7 +2049,6 @@ call_status(struct rpc_task *task)
>> case -EAGAIN:
>> task->tk_action = call_transmit;
>> break;
>> - case -EPERM:
>> case -EIO:
>> /* shutdown or soft timeout */
>> rpc_exit(task, status);
>>
>> We could also 'translate' the -EPERM into an '-ENETUNREACH' or such,
>> in the return from xs_udp_send_request(), if you think that would make
>> more sense?
>>
>> Hopefully, I've explained things better.
>>
>>
>
> Yep. Can you please resend the patch with the above fix? I think we
> can live with the EPERM in the RPC_IS_SOFTCONN case, given that it is
> in practice only ever passed back to the 'mount' syscall.
>
Hi,
So after some more testing on this new patch, the test sequence I described
works fine - but if I set the firewall rule first and then do an open, it
appears that the open() wouldn't time out even on a soft mount (whereas
with the previous patch it incorrectly returned -EPERM almost immediately).
It appears that the rpc request is getting queued up onto one of the wait
queues (xprt_backlog or xprt_sending) in that case, but I'm not sure why.
I'll have to look more into it next week.
Thanks,
-Jason
On Thu, Sep 18, 2014 at 5:02 PM, Jason Baron <[email protected]> wrote:
> On 09/18/2014 04:51 PM, Trond Myklebust wrote:
>> On Thu, Sep 18, 2014 at 3:51 PM, Jason Baron <[email protected]> wrote:
>>> If an iptables drop rule is added for an nfs server, the client can end up in
>>> a softlockup. Because of the way that xs_sendpages() is structured, the -EPERM
>>> is ignored since the prior bits of the packet may have been successfully queued
>>> and thus xs_sendpages() returns a non-zero value. Then, xs_udp_send_request()
>>> thinks that because some bits were queued it should return -EAGAIN. We then try
>>> the request and again and a softlockup occurs. The test sequence is simply:
>>>
>>> 1) open a file on the nfs server '/nfs/foo' (mounted using udp)
>>> 2) iptables -A OUTPUT -d <nfs server ip> -j DROP
>>> 3) write to /nfs/foo
>>> 4) close /nfs/foo
>>> 5) iptables -D OUTPUT -d <nfs server ip> -j DROP
>>>
>>> The softlockup occurs in step 4 above.
>> For UDP, the expected and documented behaviour in the case above is as follows:
>> - if the mount is soft, then return EIO on the first major timeout.
>
> yeah - so this case is a softlockup in my testing :(
>
>> - if the mount is hard, then retry indefinitely on timeout.
>>
>> Won't these 2 patches end up propagating an EPERM to the application?
>> That would be a definite violation of both hard and soft semantics.
>
> ok, yeah it does propogate the -EPERM up - I wasn't aware of the correct
> semantics - thanks.
>
> I can rework the patches such that they return -EIO instead for a soft mount,
> and verify that we keep retrying for a hard one.
>
Doesn't the soft timeout currently trigger after the major timeout? If
not, do we understand why it isn't doing so?
I'm just trying to figure out what the motivation is for this patch.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
If an error is returned after the first bits of a packet have already been
successfully queued, xs_sendpages() will return a positive 'int' value
indicating success. Callers seem to treat this as -EAGAIN.
However, there are cases where its not a question of waiting for the write
queue to drain. For example, when there is an iptables rule dropping packets
to the destination, the lower level code can return -EPERM only after parts
of the packet have been successfully queued. In this case, we can end up
continuously retrying.
This patch is meant to effectively be a no-op in preparation for subsequent
patches that can make decisions based on both on the number of bytes sent
by xs_sendpages() and any errors that may have be returned.
Signed-off-by: Jason Baron <[email protected]>
---
net/sunrpc/xprtsock.c | 80 ++++++++++++++++++++++++++-------------------------
1 file changed, 41 insertions(+), 39 deletions(-)
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89e..38eb17d 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -399,13 +399,13 @@ static int xs_send_kvec(struct socket *sock, struct sockaddr *addr, int addrlen,
return kernel_sendmsg(sock, &msg, NULL, 0, 0);
}
-static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned int base, int more, bool zerocopy)
+static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned int base, int more, bool zerocopy, int *sent_p)
{
ssize_t (*do_sendpage)(struct socket *sock, struct page *page,
int offset, size_t size, int flags);
struct page **ppage;
unsigned int remainder;
- int err, sent = 0;
+ int err;
remainder = xdr->page_len - base;
base += xdr->page_base;
@@ -424,15 +424,15 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
err = do_sendpage(sock, *ppage, base, len, flags);
if (remainder == 0 || err != len)
break;
- sent += err;
+ *sent_p += err;
ppage++;
base = 0;
}
- if (sent == 0)
- return err;
- if (err > 0)
- sent += err;
- return sent;
+ if (err > 0) {
+ *sent_p += err;
+ err = 0;
+ }
+ return err;
}
/**
@@ -445,10 +445,11 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
* @zerocopy: true if it is safe to use sendpage()
*
*/
-static int xs_sendpages(struct socket *sock, struct sockaddr *addr, int addrlen, struct xdr_buf *xdr, unsigned int base, bool zerocopy)
+static int xs_sendpages(struct socket *sock, struct sockaddr *addr, int addrlen, struct xdr_buf *xdr, unsigned int base, bool zerocopy, int *sent_p)
{
unsigned int remainder = xdr->len - base;
- int err, sent = 0;
+ int err = 0;
+ int sent = 0;
if (unlikely(!sock))
return -ENOTSOCK;
@@ -465,7 +466,7 @@ static int xs_sendpages(struct socket *sock, struct sockaddr *addr, int addrlen,
err = xs_send_kvec(sock, addr, addrlen, &xdr->head[0], base, remainder != 0);
if (remainder == 0 || err != len)
goto out;
- sent += err;
+ *sent_p += err;
base = 0;
} else
base -= xdr->head[0].iov_len;
@@ -473,23 +474,23 @@ static int xs_sendpages(struct socket *sock, struct sockaddr *addr, int addrlen,
if (base < xdr->page_len) {
unsigned int len = xdr->page_len - base;
remainder -= len;
- err = xs_send_pagedata(sock, xdr, base, remainder != 0, zerocopy);
- if (remainder == 0 || err != len)
+ err = xs_send_pagedata(sock, xdr, base, remainder != 0, zerocopy, &sent);
+ *sent_p += sent;
+ if (remainder == 0 || sent != len)
goto out;
- sent += err;
base = 0;
} else
base -= xdr->page_len;
if (base >= xdr->tail[0].iov_len)
- return sent;
+ return 0;
err = xs_send_kvec(sock, NULL, 0, &xdr->tail[0], base, 0);
out:
- if (sent == 0)
- return err;
- if (err > 0)
- sent += err;
- return sent;
+ if (err > 0) {
+ *sent_p += err;
+ err = 0;
+ }
+ return err;
}
static void xs_nospace_callback(struct rpc_task *task)
@@ -573,19 +574,20 @@ static int xs_local_send_request(struct rpc_task *task)
container_of(xprt, struct sock_xprt, xprt);
struct xdr_buf *xdr = &req->rq_snd_buf;
int status;
+ int sent = 0;
xs_encode_stream_record_marker(&req->rq_snd_buf);
xs_pktdump("packet data:",
req->rq_svec->iov_base, req->rq_svec->iov_len);
- status = xs_sendpages(transport->sock, NULL, 0,
- xdr, req->rq_bytes_sent, true);
+ status = xs_sendpages(transport->sock, NULL, 0, xdr, req->rq_bytes_sent,
+ true, &sent);
dprintk("RPC: %s(%u) = %d\n",
__func__, xdr->len - req->rq_bytes_sent, status);
- if (likely(status >= 0)) {
- req->rq_bytes_sent += status;
- req->rq_xmit_bytes_sent += status;
+ if (likely(sent > 0) || status == 0) {
+ req->rq_bytes_sent += sent;
+ req->rq_xmit_bytes_sent += sent;
if (likely(req->rq_bytes_sent >= req->rq_slen)) {
req->rq_bytes_sent = 0;
return 0;
@@ -626,6 +628,7 @@ static int xs_udp_send_request(struct rpc_task *task)
struct rpc_xprt *xprt = req->rq_xprt;
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
struct xdr_buf *xdr = &req->rq_snd_buf;
+ int sent = 0;
int status;
xs_pktdump("packet data:",
@@ -634,17 +637,15 @@ static int xs_udp_send_request(struct rpc_task *task)
if (!xprt_bound(xprt))
return -ENOTCONN;
- status = xs_sendpages(transport->sock,
- xs_addr(xprt),
- xprt->addrlen, xdr,
- req->rq_bytes_sent, true);
+ status = xs_sendpages(transport->sock, xs_addr(xprt), xprt->addrlen,
+ xdr, req->rq_bytes_sent, true, &sent);
dprintk("RPC: xs_udp_send_request(%u) = %d\n",
xdr->len - req->rq_bytes_sent, status);
- if (status >= 0) {
- req->rq_xmit_bytes_sent += status;
- if (status >= req->rq_slen)
+ if (sent > 0 || status == 0) {
+ req->rq_xmit_bytes_sent += sent;
+ if (sent >= req->rq_slen)
return 0;
/* Still some bytes left; set up for a retry later. */
status = -EAGAIN;
@@ -713,6 +714,7 @@ static int xs_tcp_send_request(struct rpc_task *task)
struct xdr_buf *xdr = &req->rq_snd_buf;
bool zerocopy = true;
int status;
+ int sent = 0;
xs_encode_stream_record_marker(&req->rq_snd_buf);
@@ -730,26 +732,26 @@ static int xs_tcp_send_request(struct rpc_task *task)
* to cope with writespace callbacks arriving _after_ we have
* called sendmsg(). */
while (1) {
- status = xs_sendpages(transport->sock,
- NULL, 0, xdr, req->rq_bytes_sent,
- zerocopy);
+ sent = 0;
+ status = xs_sendpages(transport->sock, NULL, 0, xdr,
+ req->rq_bytes_sent, zerocopy, &sent);
dprintk("RPC: xs_tcp_send_request(%u) = %d\n",
xdr->len - req->rq_bytes_sent, status);
- if (unlikely(status < 0))
+ if (unlikely(sent == 0 && status < 0))
break;
/* If we've sent the entire packet, immediately
* reset the count of bytes sent. */
- req->rq_bytes_sent += status;
- req->rq_xmit_bytes_sent += status;
+ req->rq_bytes_sent += sent;
+ req->rq_xmit_bytes_sent += sent;
if (likely(req->rq_bytes_sent >= req->rq_slen)) {
req->rq_bytes_sent = 0;
return 0;
}
- if (status != 0)
+ if (sent != 0)
continue;
status = -EAGAIN;
break;
--
1.8.2.rc2
On Thu, Sep 18, 2014 at 3:51 PM, Jason Baron <[email protected]> wrote:
> If an iptables drop rule is added for an nfs server, the client can end up in
> a softlockup. Because of the way that xs_sendpages() is structured, the -EPERM
> is ignored since the prior bits of the packet may have been successfully queued
> and thus xs_sendpages() returns a non-zero value. Then, xs_udp_send_request()
> thinks that because some bits were queued it should return -EAGAIN. We then try
> the request and again and a softlockup occurs. The test sequence is simply:
>
> 1) open a file on the nfs server '/nfs/foo' (mounted using udp)
> 2) iptables -A OUTPUT -d <nfs server ip> -j DROP
> 3) write to /nfs/foo
> 4) close /nfs/foo
> 5) iptables -D OUTPUT -d <nfs server ip> -j DROP
>
> The softlockup occurs in step 4 above.
For UDP, the expected and documented behaviour in the case above is as follows:
- if the mount is soft, then return EIO on the first major timeout.
- if the mount is hard, then retry indefinitely on timeout.
Won't these 2 patches end up propagating an EPERM to the application?
That would be a definite violation of both hard and soft semantics.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]