2010-09-01 12:24:39

by Greg

[permalink] [raw]
Subject: Relocate NFS root FS for maintenance

Hi,

I don't know if it is the good place to ask for such a problem, if not
please apologize.

I have a pseudo active/active NFSv4 configuration: 2 NFSv4 servers
(1:1.1.2-6lenny1), exporting 3 filesystems. Each filesystem are
connected to a SAN volume, via iscsi and multipath.

filer-01 :
/var/local
10.0.0.0/16(rw,fsid=0,crossmnt,insecure,async,no_subtree_check)
/var/local/large
10.0.0.0/16(rw,insecure,async,no_subtree_check,no_root_squash,fsid=3)

filer-02 :
/var/local
10.0.0.0/16(rw,fsid=0,crossmnt,insecure,async,no_subtree_check)
/var/local/small
10.0.0.0/16(rw,insecure,async,no_subtree_check,no_root_squash,fsid=1)
/var/local/medium
10.0.0.0/16(rw,insecure,async,no_subtree_check,no_root_squash,fsid=2)

filer-01 share a volume on the first SAN, filer-02 share 2 volumes on
the second SAN.

I have to do a firmware upgrade on the SANs, so I have to umount
/var/local/* on the nfs servers. On way could be to relocate NFS root FS
from /var/local to /var/maintenance with empty directories large, medium
and small
But how to do that ? I try to change /etc/exports then exportfs -rvf,
but got "Stale NFS file handle" on clients...

If not, is there a better way to do that, without to shutdown NFS servers ?

--
Greg



2010-09-20 12:49:28

by Menyhart Zoltan

[permalink] [raw]
Subject: Locking question around "...PagePrivate()"

Hi,

nfs_inode_add_request() runs under the protection of the Page Lock
while setting up SetPagePrivate(), set_page_private().

The consumers of PagePrivate() run also under the protection of the Page Lock.

On the other hand, nfs_inode_remove_request() invoked by
nfs_writeback_release_full() and nfs_commit_release() apparently does
not take the Page Lock.

nfs_inode_remove_request() includes the sequence of

set_page_private(req->wb_page, 0);
ClearPagePrivate(req->wb_page);

When the base kernel calls e.g.

mapping->a_ops->releasepage(page, gfp_mask)

then nfs_release_page() may see an incoherent "private" state.

Should not all the "private" operations be protected by the Page Lock?

Thanks,

Zoltan Menyhart

2010-09-08 13:33:44

by Menyhart Zoltan

[permalink] [raw]
Subject: Re :statfs() gives ESTALE error

Hi,

An NFS client executes a statfs("file", &buff) call.
"file" exists / existed, the client has read / written it,
but it has already closed it.

user_path(pathname, &path) looks up "file" successfully in the
directory-cache and restarts the aging timer of the directory-entry.
Even if "file" has already been removed from the server, because the
lookupcache=positive option I use, keeps the entries valid for a while.

nfs_statfs() returns ESTALE if "file" has already been removed from the
server.

If the user application repeats the statfs("file", &buff) call, we
are stuck: "file" remains young forever in the directory-cache.

Signed-off-by: Zoltan Menyhart <[email protected]>


Attachments:
nfs_statfs.patch (595.00 B)

2010-09-02 16:06:59

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Relocate NFS root FS for maintenance

On Thu, Sep 02, 2010 at 09:32:18AM +0200, Greg wrote:
> Tom,
>
> >
> >What you are missing here is that the client uses two things to
> >access content on
> >the servers - path and file handles. When you do the reshare, you
> >would be pointing any
> >new requests to the empty directories. But, any application which
> >already had a
> >file handle would have a reference to the old mount (via the fsid
> >part of the file handle).
>
> OK thanks for the precisions. It's why I'm getting those errors :
> Sep 1 15:12:53 varan-14 kernel: [3424547.256518] NFS: server
> filer-large-vip.local error: fileid changed
> Sep 1 15:12:53 varan-14 kernel: [3424547.256518] fsid 0:13:
> expected fileid 0x2, got 0x41
>
> >The options I see are to:
> >
> >1) Shutdown NFS/remove write access to the export/etc - this is
> >along the lines of what
> >you have done. And the result is that the server will inform the
> >client of an error.
> >
> >2) Disconnect the servers from the network. (Or partition the
> >network). In this scenario,
> >the client will be getting timeouts and will probably use a retry schema.
> >
> >3) Shutdown the NFS clients - harsh, but they will not be
> >accessing the servers and you
> >can easily do the upgrades.
> >
> >These all result in downtime for both your servers and your clients.
>
> So in short there is no way to do a maintenance on attached storage,
> could it be hard drives, RAID, iSCSI or anything.
> Is such a "feature" in the roadmap ?

I think I just don't understand your setup, because I'm confused about
what you're asking for. (If the thing you're upgrading is literally the
only path between the nfs servers and their storage, then what can we
do?)

Would you mind explaining again from the start exactly what you're setup
is and what you're hoping to accomplish by moving these mountpoints?
Apologies for being dense!

--b.

2010-09-02 07:32:21

by Greg

[permalink] [raw]
Subject: Re: Relocate NFS root FS for maintenance

Tom,

>
> What you are missing here is that the client uses two things to access
> content on
> the servers - path and file handles. When you do the reshare, you
> would be pointing any
> new requests to the empty directories. But, any application which
> already had a
> file handle would have a reference to the old mount (via the fsid part
> of the file handle).

OK thanks for the precisions. It's why I'm getting those errors :
Sep 1 15:12:53 varan-14 kernel: [3424547.256518] NFS: server
filer-large-vip.local error: fileid changed
Sep 1 15:12:53 varan-14 kernel: [3424547.256518] fsid 0:13: expected
fileid 0x2, got 0x41

> The options I see are to:
>
> 1) Shutdown NFS/remove write access to the export/etc - this is along
> the lines of what
> you have done. And the result is that the server will inform the
> client of an error.
>
> 2) Disconnect the servers from the network. (Or partition the
> network). In this scenario,
> the client will be getting timeouts and will probably use a retry schema.
>
> 3) Shutdown the NFS clients - harsh, but they will not be accessing
> the servers and you
> can easily do the upgrades.
>
> These all result in downtime for both your servers and your clients.

So in short there is no way to do a maintenance on attached storage,
could it be hard drives, RAID, iSCSI or anything.
Is such a "feature" in the roadmap ?

--
Greg


2010-09-07 06:59:29

by Greg

[permalink] [raw]
Subject: Re: Relocate NFS root FS for maintenance

Le 02/09/2010 18:06, J. Bruce Fields a ?crit :
> I think I just don't understand your setup, because I'm confused about
> what you're asking for. (If the thing you're upgrading is literally the
> only path between the nfs servers and their storage, then what can we
> do?)

The setup:
clients (apache) ---> server1 nfs4 --> SAN storage via iscsi

I want to do a firmware upgrade on the SAN without stopping the NFS
server, cause Apache will wait for IO and reach the limits in 2
seconds... then never die, I have to reboot all clients servers.

A trick could be to relocate the export to an empty real partition on
the NFS server, so clients could access the share but not the datas.
Apache will not complain, just answer with 404 Not Found errors :)

--
Greg


2010-09-02 06:52:27

by Menyhart Zoltan

[permalink] [raw]
Subject: statfs() gives ESTALE error

Hi,

An NFS client executes a statfs("file", &buff) call.
"file" exists / existed, the client has read / written it,
but it has already closed it.

user_path(pathname, &path) looks up "file" successfully in the
directory-cache and restarts the aging timer of the directory-entry.
Even if "file" has already been removed from the server, because the
lookupcache=positive opcion I use, keeps the entries valid for a while.

nfs_statfs() returns ESTALE if "file" has already been removed from the
server.

If the user application repeats the statfs("file", &buff) call, we
are stuck: "file" remains young forever in the directory-cache.

Should not this directory-entry be cleaned up?

--- linux-2.6.32.x86_64-orig/fs/nfs/super.c 2010-08-02 14:34:57.000000000 +0200
+++ linux-2.6.32.x86_64/fs/nfs/super.c 2010-08-31 16:57:30.000000000 +0200
@@ -429,6 +429,22 @@
int error;

error = server->nfs_client->rpc_ops->statfs(server, fh, &res);
+ if (unlikely(error == -ESTALE)){
+
+ struct inode *pd_inode;
+
+ BUG_ON(dentry->d_parent == NULL);
+ pd_inode = dentry->d_parent->d_inode;
+ BUG_ON(pd_inode == NULL);
+ /*
+ * This forces the revalidation code in nfs_lookup_revalidate() to do a
+ * full lookup on all child dentries of 'dir' whenever a change occurs
+ * on the server that might have invalidated our dcache.
+ */
+ spin_lock(&pd_inode->i_lock);
+ nfs_force_lookup_revalidate(pd_inode);
+ spin_unlock(&pd_inode->i_lock);
+ }
if (error < 0)
goto out_err;
buf->f_type = NFS_SUPER_MAGIC;

It's for the 2.6.32, yet the 2.6.36 dos not change from this aspect.

Thanks,

Zoltan Menyhart

2010-09-08 20:25:30

by Trond Myklebust

[permalink] [raw]
Subject: Re: Re :statfs() gives ESTALE error

On Wed, 2010-09-08 at 15:33 +0200, Menyhart Zoltan wrote:
> Hi,
>
> An NFS client executes a statfs("file", &buff) call.
> "file" exists / existed, the client has read / written it,
> but it has already closed it.
>
> user_path(pathname, &path) looks up "file" successfully in the
> directory-cache and restarts the aging timer of the directory-entry.
> Even if "file" has already been removed from the server, because the
> lookupcache=positive option I use, keeps the entries valid for a while.
>
> nfs_statfs() returns ESTALE if "file" has already been removed from the
> server.
>
> If the user application repeats the statfs("file", &buff) call, we
> are stuck: "file" remains young forever in the directory-cache.
>
> Signed-off-by: Zoltan Menyhart <[email protected]>
---
Please include the above as part of the patch itself if you send it as
an attachment. Otherwise, I need to manually sew the two back together
before I can apply them through git.

> diff -Nru linux-2.6.36-rc3-orig/fs/nfs/super.c
> linux-2.6.36-rc3-new/fs/nfs/super.c
> --- linux-2.6.36-rc3-orig/fs/nfs/super.c 2010-08-29
> 17:36:04.000000000 +0200
> +++ linux-2.6.36-rc3-new/fs/nfs/super.c 2010-09-08 11:33:03.078684569
> +0200
> @@ -431,7 +431,15 @@
> goto out_err;
>
> error = server->nfs_client->rpc_ops->statfs(server, fh, &res);
> + if (unlikely(error == -ESTALE)) {
>
> + struct dentry *pd_dentry;
> +
> + if ((pd_dentry = dget_parent(dentry)) != NULL) {

Assignments inside an 'if ()' are frowned upon by the Linux style gods.
I'll fix this up.

> + nfs_zap_caches(pd_dentry->d_inode);
> + dput(pd_dentry);
> + }
> + }
> nfs_free_fattr(res.fattr);
> if (error < 0)
> goto out_err;


2010-09-01 21:52:38

by Tom Haynes

[permalink] [raw]
Subject: Re: Relocate NFS root FS for maintenance

On 9/1/2010 7:17 AM, Greg wrote:
> Hi,
>
> I don't know if it is the good place to ask for such a problem, if not
> please apologize.
>
> I have a pseudo active/active NFSv4 configuration: 2 NFSv4 servers
> (1:1.1.2-6lenny1), exporting 3 filesystems. Each filesystem are
> connected to a SAN volume, via iscsi and multipath.
>
> filer-01 :
> /var/local
> 10.0.0.0/16(rw,fsid=0,crossmnt,insecure,async,no_subtree_check)
> /var/local/large
> 10.0.0.0/16(rw,insecure,async,no_subtree_check,no_root_squash,fsid=3)
>
> filer-02 :
> /var/local
> 10.0.0.0/16(rw,fsid=0,crossmnt,insecure,async,no_subtree_check)
> /var/local/small
> 10.0.0.0/16(rw,insecure,async,no_subtree_check,no_root_squash,fsid=1)
> /var/local/medium
> 10.0.0.0/16(rw,insecure,async,no_subtree_check,no_root_squash,fsid=2)
>
> filer-01 share a volume on the first SAN, filer-02 share 2 volumes on
> the second SAN.
>
> I have to do a firmware upgrade on the SANs, so I have to umount
> /var/local/* on the nfs servers. On way could be to relocate NFS root
> FS from /var/local to /var/maintenance with empty directories large,
> medium and small
> But how to do that ? I try to change /etc/exports then exportfs -rvf,
> but got "Stale NFS file handle" on clients...
>
> If not, is there a better way to do that, without to shutdown NFS
> servers ?
>

Greg,

What you are missing here is that the client uses two things to access
content on
the servers - path and file handles. When you do the reshare, you would
be pointing any
new requests to the empty directories. But, any application which
already had a
file handle would have a reference to the old mount (via the fsid part
of the file handle).

The client is detecting that the old mount is no longer being exported
by the server, so
it replies with ESTALE.

The options I see are to:

1) Shutdown NFS/remove write access to the export/etc - this is along
the lines of what
you have done. And the result is that the server will inform the client
of an error.

2) Disconnect the servers from the network. (Or partition the network).
In this scenario,
the client will be getting timeouts and will probably use a retry schema.

3) Shutdown the NFS clients - harsh, but they will not be accessing the
servers and you
can easily do the upgrades.

These all result in downtime for both your servers and your clients.

A long term solution will be possible when replication is realized
within the server and
clients. It is a protocol feature which I do not think is implemented yet.

Tom

2010-09-01 17:35:10

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Relocate NFS root FS for maintenance

On Wed, Sep 01, 2010 at 02:17:45PM +0200, Greg wrote:
> Hi,
>
> I don't know if it is the good place to ask for such a problem, if
> not please apologize.
>
> I have a pseudo active/active NFSv4 configuration: 2 NFSv4 servers
> (1:1.1.2-6lenny1), exporting 3 filesystems. Each filesystem are
> connected to a SAN volume, via iscsi and multipath.
>
> filer-01 :
> /var/local
> 10.0.0.0/16(rw,fsid=0,crossmnt,insecure,async,no_subtree_check)
> /var/local/large
> 10.0.0.0/16(rw,insecure,async,no_subtree_check,no_root_squash,fsid=3)
>
> filer-02 :
> /var/local
> 10.0.0.0/16(rw,fsid=0,crossmnt,insecure,async,no_subtree_check)
> /var/local/small
> 10.0.0.0/16(rw,insecure,async,no_subtree_check,no_root_squash,fsid=1)
> /var/local/medium
> 10.0.0.0/16(rw,insecure,async,no_subtree_check,no_root_squash,fsid=2)
>
> filer-01 share a volume on the first SAN, filer-02 share 2 volumes
> on the second SAN.
>
> I have to do a firmware upgrade on the SANs, so I have to umount
> /var/local/* on the nfs servers. On way could be to relocate NFS
> root FS from /var/local to /var/maintenance with empty directories
> large, medium and small
> But how to do that ? I try to change /etc/exports then exportfs
> -rvf, but got "Stale NFS file handle" on clients...
>
> If not, is there a better way to do that, without to shutdown NFS servers ?

I'm a little confused as to what you actually want to happen. Clearly
clients can't continue as usual, because you're taking all the storage
away. So presumably you expect the clients to just hang till the
upgrade is done? If so, why not just shut down the servers?

--b.

2010-09-07 18:32:10

by Trond Myklebust

[permalink] [raw]
Subject: Re: statfs() gives ESTALE error

On Thu, 2010-09-02 at 08:56 +0200, Menyhart Zoltan wrote:
> Hi,
>
> An NFS client executes a statfs("file", &buff) call.
> "file" exists / existed, the client has read / written it,
> but it has already closed it.
>
> user_path(pathname, &path) looks up "file" successfully in the
> directory-cache and restarts the aging timer of the directory-entry.
> Even if "file" has already been removed from the server, because the
> lookupcache=positive opcion I use, keeps the entries valid for a while.
>
> nfs_statfs() returns ESTALE if "file" has already been removed from the
> server.
>
> If the user application repeats the statfs("file", &buff) call, we
> are stuck: "file" remains young forever in the directory-cache.
>
> Should not this directory-entry be cleaned up?
>
> --- linux-2.6.32.x86_64-orig/fs/nfs/super.c 2010-08-02 14:34:57.000000000 +0200
> +++ linux-2.6.32.x86_64/fs/nfs/super.c 2010-08-31 16:57:30.000000000 +0200
> @@ -429,6 +429,22 @@
> int error;
>
> error = server->nfs_client->rpc_ops->statfs(server, fh, &res);
> + if (unlikely(error == -ESTALE)){
> +
> + struct inode *pd_inode;
> +
> + BUG_ON(dentry->d_parent == NULL);
> + pd_inode = dentry->d_parent->d_inode;

Should be using dget_parent() to ensure that the parent directory
doesn't disappear beneath us while we're working on it (with a matching
d_put()).

> + BUG_ON(pd_inode == NULL);

Why the 2 BUG_ON()? If the conditions are true, then the kernel will
Oops anyway.

> + /*
> + * This forces the revalidation code in nfs_lookup_revalidate() to do a
> + * full lookup on all child dentries of 'dir' whenever a change occurs
> + * on the server that might have invalidated our dcache.
> + */
> + spin_lock(&pd_inode->i_lock);
> + nfs_force_lookup_revalidate(pd_inode);
> + spin_unlock(&pd_inode->i_lock);

The above should probably be a nfs_zap_caches(). We need to not only
revalidate the lookup cache, but also to clear the readdir cache.

> + }
> if (error < 0)
> goto out_err;
> buf->f_type = NFS_SUPER_MAGIC;
>
> It's for the 2.6.32, yet the 2.6.36 dos not change from this aspect.

Can you fix the above, then resend (for 2.6.36) with a Signed-off-by:
line?

Cheers
Trond


2010-09-20 13:55:36

by Trond Myklebust

[permalink] [raw]
Subject: Re: Locking question around "...PagePrivate()"

On Mon, 2010-09-20 at 14:49 +0200, Menyhart Zoltan wrote:
> Hi,
>
> nfs_inode_add_request() runs under the protection of the Page Lock
> while setting up SetPagePrivate(), set_page_private().
>
> The consumers of PagePrivate() run also under the protection of the Page Lock.
>
> On the other hand, nfs_inode_remove_request() invoked by
> nfs_writeback_release_full() and nfs_commit_release() apparently does
> not take the Page Lock.
>
> nfs_inode_remove_request() includes the sequence of
>
> set_page_private(req->wb_page, 0);
> ClearPagePrivate(req->wb_page);
>
> When the base kernel calls e.g.
>
> mapping->a_ops->releasepage(page, gfp_mask)
>
> then nfs_release_page() may see an incoherent "private" state.
>
> Should not all the "private" operations be protected by the Page Lock?

No.

Cheers
Trond


2010-10-21 20:38:02

by J. Bruce Fields

[permalink] [raw]
Subject: Re: "xprt" reference count drops to 0

On Tue, Oct 05, 2010 at 10:22:30AM +0200, Menyhart Zoltan wrote:
> Due to some race conditions, the reference count can become 0
> while "xprt" is still on a "pool":

Apologies, your email got buried in my inbox....

>
> WARNING: at lib/kref.c:43 kref_get+0x23/0x2d()
> [] kref_get+0x23/0x2d
> [] svc_xprt_get+0x12/0x14 [sunrpc]
> [] svc_recv+0x2db/0x78a [sunrpc]

Which kernel exactly did you see this on? Is it reproduceable?

> I think we should increase the reference counter before adding "xprt"
> onto any list.

I don't see the xprt added to any list after the svc_xprt_get() you've
added below.

Help, I'm confused!

--b.

>
> Obviously, we keep increasing the reference count before passing "xprt"
> to another thread via "rqstp->rq_xprt".
>
> Thanks,
>
> Zoltan Menyhart

> "svc_xprt_get()" should be invoked before adding "xprt" onto "pool" by
> "svc_xprt_enqueue()".
> Otherwise "xprt->xpt_ref" can drop to 0 due to race conditions.
>
> Signed-off-by: Zoltan Menyhart <[email protected]>
>
> --- old/net/sunrpc/svc_xprt.c 2010-10-05 09:47:51.000000000 +0200
> +++ new/net/sunrpc/svc_xprt.c 2010-10-05 09:47:20.000000000 +0200
> @@ -369,6 +369,7 @@
> }
>
> process:
> + svc_xprt_get(xprt);
> if (!list_empty(&pool->sp_threads)) {
> rqstp = list_entry(pool->sp_threads.next,
> struct svc_rqst,
> @@ -381,7 +382,6 @@
> "svc_xprt_enqueue: server %p, rq_xprt=%p!\n",
> rqstp, rqstp->rq_xprt);
> rqstp->rq_xprt = xprt;
> - svc_xprt_get(xprt);
> rqstp->rq_reserved = serv->sv_max_mesg;
> atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> pool->sp_stats.threads_woken++;
> @@ -655,7 +655,6 @@
> xprt = svc_xprt_dequeue(pool);
> if (xprt) {
> rqstp->rq_xprt = xprt;
> - svc_xprt_get(xprt);
> rqstp->rq_reserved = serv->sv_max_mesg;
> atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> } else {


2010-10-25 11:57:24

by Menyhart Zoltan

[permalink] [raw]
Subject: Re: "xprt" reference count drops to 0

> Maybe your fix is right, but I'm not sure: It looks to me like if
> svc_xprt_enqueue() gets to "process:" in a situation where the caller
> holds the only reference, then that's already a bug. Do you know who
> the caller of svc_xprt_enqueue() was when this happened?

Unfortunately, I do not.
We saw lots of warnings like this (before my patch):

WARNING: at lib/kref.c:43 kref_get+0x23/0x2d()
Hardware name: Stoutland Platform
Modules linked in: rdma_ucm ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ib_uverbs ib_umad mlx4_ib mlx4_core ib_mthca ib_mad ib_core ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 scsi_dh_emc dm_round_robin dm_multipath iTCO_wdt i2c_i801 i2c_core igb ioatdma iTCO_vendor_support dca raid0 lpfc usb_storage scsi_transport_fc scsi_tgt [last unloaded: mlx4_core]
Pid: 3571, comm: nfsd Tainted: G D W 2.6.32.9-21.fc12.Bull.6.x86_64 #1
Call Trace:
[<ffffffff81050af0>] warn_slowpath_common+0x7c/0x94
[<ffffffff81050b1c>] warn_slowpath_null+0x14/0x16
[<ffffffff81222cca>] kref_get+0x23/0x2d
[<ffffffffa01d0a90>] svc_xprt_get+0x12/0x14 [sunrpc]
[<ffffffffa01d1903>] svc_recv+0x2db/0x78a [sunrpc]
[<ffffffff8104b2ef>] ? default_wake_function+0x0/0x14
[<ffffffffa02a1898>] nfsd+0xac/0x13f [nfsd]
[<ffffffffa02a17ec>] ? nfsd+0x0/0x13f [nfsd]
[<ffffffff8106ee0e>] kthread+0x7f/0x87
[<ffffffff8100cd6a>] child_rip+0xa/0x20
[<ffffffff8106ed8f>] ? kthread+0x0/0x87
[<ffffffff8100cd60>] ? child_rip+0x0/0x20

When you can see the messages like this, the guilty task is already over...

> Doh. Wait, when you say "has not been corrected on 2.6.36-rc3", do you
> mean you've actually *seen* the problem occur on 2.6.36-rc3?

We do not really use more advanced kernel than the 2.6.32 in a great number.
Just some test configurations up to the 2.6.36-rc6...
I have not seen this problem on recent kernels.

I'm not sure that I can really follow your explication.

I have got two reasons why I think my patch is good.

1. There are several code sequences where just after calling "svc_xprt_enqueue()",
we drop "kref", i.e. we do not delegate the access right. Therefore "kref"
should be increased by "svc_xprt_enqueue()". See:

svc_revisit()
{
...
svc_xprt_enqueue(xprt);
svc_xprt_put(xprt);
}

svc_xprt_release()
{
...
svc_reserve(rqstp, 0):
...
svc_xprt_enqueue(xprt);
svc_xprt_put(xprt);
}

svc_check_conn_limits()
{
...
svc_xprt_enqueue(xprt);
svc_xprt_put(xprt);
}

svc_age_temp_xprts()
{
...
svc_xprt_enqueue(xprt);
svc_xprt_put(xprt);
}

2. Increasing "kref" by "svc_recv()" is too late: by the time you can increase
"kref", the structure may have already been destroyed.

As "svc_xprt_enqueue()" has not been modified since, I deduce that my patch is
correct for the 2.6.36-rc kernels, too.

Thanks.

Zoltan Menyhart

2010-10-26 00:29:03

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/4] svcrpc: never clear XPT_BUSY on dead xprt

On Mon, Oct 25, 2010 at 08:11:45PM -0400, J. Bruce Fields wrote:
> On Tue, Oct 26, 2010 at 10:54:47AM +1100, Neil Brown wrote:
> > Note that once we always set XPT_BUSY and it stays set. So we call
> > svc_delete_xprt instread of svc_close_xprt.
> >
> > Maybe we don't actually need to list_del_init - both the pool and the xprt
> > will soon be freed and if there is linkage between them, who cares??
> > In that case we wouldn't need to for_each_pool after all ???

So, untested:

commit 5e0bffc6376d8d3316771e12af698631a4ca2634
Author: J. Bruce Fields <[email protected]>
Date: Mon Oct 25 18:11:21 2010 -0400

svcrpc: simplify svc_close_all

There's no need to be fooling with XPT_BUSY now that all the threads
are gone.

The list_del_init() here could execute at the same time as the
svc_xprt_enqueue()'s list_add_tail(), with undefined results. We don't
really care at this point, but it might result in a spurious
list-corruption warning or something.

And svc_close() isn't adding any value; just call svc_delete_xprt()
directly.

Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index dd61cd0..8c018df 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -941,16 +941,16 @@ void svc_close_all(struct list_head *xprt_list)
struct svc_xprt *xprt;
struct svc_xprt *tmp;

+ /*
+ * The server is shutting down, and no more threads are running.
+ * svc_xprt_enqueue() might still be running, but at worst it
+ * will re-add the xprt to sp_sockets, which will soon get
+ * freed. So we don't bother with any more locking, and don't
+ * leave the close to the (nonexistent) server threads:
+ */
list_for_each_entry_safe(xprt, tmp, xprt_list, xpt_list) {
set_bit(XPT_CLOSE, &xprt->xpt_flags);
- if (test_bit(XPT_BUSY, &xprt->xpt_flags)) {
- /* Waiting to be processed, but no threads left,
- * So just remove it from the waiting list
- */
- list_del_init(&xprt->xpt_ready);
- clear_bit(XPT_BUSY, &xprt->xpt_flags);
- }
- svc_close_xprt(xprt);
+ svc_delete_xprt(xprt);
}
}


2010-10-26 00:11:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/4] svcrpc: never clear XPT_BUSY on dead xprt

On Tue, Oct 26, 2010 at 10:54:47AM +1100, Neil Brown wrote:
> On Mon, 25 Oct 2010 19:03:35 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Tue, Oct 26, 2010 at 09:58:36AM +1100, Neil Brown wrote:
> > > On Mon, 25 Oct 2010 16:21:56 -0400
> > > "J. Bruce Fields" <[email protected]> wrote:
> > >
> > > > On Mon, Oct 25, 2010 at 12:43:57PM +1100, Neil Brown wrote:
> > > > > On Sun, 24 Oct 2010 21:21:30 -0400
> > > > > "J. Bruce Fields" <[email protected]> wrote:
> > > > >
> > > > > > Once an xprt has been deleted, there's no reason to allow it to be
> > > > > > enqueued--at worst, that might cause the xprt to be re-added to some
> > > > > > global list, resulting in later corruption.
> > > > > >
> > > > > > Signed-off-by: J. Bruce Fields <[email protected]>
> > > > >
> > > > > Yep, this makes svc_close_xprt() behave the same way as svc_recv() which
> > > > > calls svc_delete_xprt but does not clear XPT_BUSY. The other branches in
> > > > > svc_recv call svc_xprt_received, but the XPT_CLOSE branch doesn't
> > > > >
> > > > > Reviewed-by: NeilBrown <[email protected]>
> > > >
> > > > Also, of course:
> > > >
> > > > > > svc_xprt_get(xprt);
> > > > > > svc_delete_xprt(xprt);
> > > > > > - clear_bit(XPT_BUSY, &xprt->xpt_flags);
> > > > > > svc_xprt_put(xprt);
> > > >
> > > > The get/put is pointless: the only reason I can see for doing that of
> > > > course was to be able to safely clear the bit afterwards.
> > > >
> > >
> > > Agreed.
> > >
> > > I like patches that get rid of code!!
> >
> > Unfortunately, I'm stuck on just one more point: is svc_close_all()
> > really safe? It assumes it doesn't need any locking to speak of any
> > more because the server threads are gone--but the xprt's themselves
> > could still be producing events, right? (So data could be arriving that
> > results in calls to svc_xprt_enqueue, for example?)
> >
> > If that's right, I'm not sure what to do there....
> >
> > --b.
>
> Yes, svc_close_all is racy w.r.t. svc_xprt_enqueue.
> I guess we've never lost that race?
>
> The race happens if the test_and_set(XPT_BUSY) in svc_xprt_enqueue happens
> before the test_bit(XPT_BUSY) in svc_close_all, but the list_add_tail at the
> end of svc_xprt_enqueue happens before (or during!) the list_del_init in
> svc_close_all.
>
> We cannot really lock against this race as svc_xprt_enqueue holds the pool
> lock, and svc_close_all doesn't know which pool to lock (as xprt->pool isn't
> set until after XPT_BUSY is set).
>
> Maybe we just need to lock all pools in that case??
>
> So svc_close_all becomes something like:
>
>
> void svc_close_all(struct list_head *xprt_list)
> {
> struct svc_xprt *xprt;
> struct svc_xprt *tmp;
> struct svc_pool *pool;
>
> list_for_each_entry_safe(xprt, tmp, xprt_list, xpt_list) {
> set_bit(XPT_CLOSE, &xprt->xpt_flags);
> if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags)) {
> /* Waiting to be processed, but no threads left,
> * So just remove it from the waiting list. First
> * we need to ensure svc_xprt_enqueue isn't still
> * queuing the xprt to some pool.
> */
> for_each_pool(pool, xprt->xpt_server) {
> spin_lock(&pool->sp_lock);
> spin_unlock(&pool->sp_lock);
> }
> list_del_init(&xprt->xpt_ready);
> }
> svc_delete_xprt(xprt);
> }
> }
>
>
> Note that once we always set XPT_BUSY and it stays set. So we call
> svc_delete_xprt instread of svc_close_xprt.
>
> Maybe we don't actually need to list_del_init - both the pool and the xprt
> will soon be freed and if there is linkage between them, who cares??
> In that case we wouldn't need to for_each_pool after all ???

Yeah, that's what I'm thinking now: if svc_xprt_enqueue is the only
thing that could still be running, then at worst it adds the xprt back
to sp_sockets or something, which we'll delete soon.

And yes I think we can just remove svc_close_all()'s attempts to handle
the situation--if its list_del_init(&xprt->xpt_ready) executes
simultaneously with svc_xprt_enqueue()'s list_add_tail()--well, maybe we
don't care what the results are, but if nothing else we could end up
with a spurious lists corruption warning.

So, delete more code, I think....

--b.

2010-10-25 23:08:30

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 4/4] svcrpc: svc_tcp_sendto XTP_DEAD check is redundant

On Mon, 25 Oct 2010 13:46:32 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Oct 25, 2010 at 11:03:08AM -0400, J. Bruce Fields wrote:
> > On Mon, Oct 25, 2010 at 01:10:24PM +1100, Neil Brown wrote:
> > > On Sun, 24 Oct 2010 21:21:33 -0400
> > > "J. Bruce Fields" <[email protected]> wrote:
> > >
> > > > The only caller (svc_send) has already checked XPT_DEAD.
> > > >
> > > > Signed-off-by: J. Bruce Fields <[email protected]>
> > > > ---
> > > > net/sunrpc/svcsock.c | 3 ---
> > > > 1 files changed, 0 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > > > index 1454739..07919e1 100644
> > > > --- a/net/sunrpc/svcsock.c
> > > > +++ b/net/sunrpc/svcsock.c
> > > > @@ -1135,9 +1135,6 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
> > > > reclen = htonl(0x80000000|((xbufp->len ) - 4));
> > > > memcpy(xbufp->head[0].iov_base, &reclen, 4);
> > > >
> > > > - if (test_bit(XPT_DEAD, &rqstp->rq_xprt->xpt_flags))
> > > > - return -ENOTCONN;
> > > > -
> > > > sent = svc_sendto(rqstp, &rqstp->rq_res);
> > > > if (sent != xbufp->len) {
> > > > printk(KERN_NOTICE
> > >
> > >
> > > So after removing all these references to XPT_DEAD, do we need XPT_DEAD at
> > > all???
> > >
> > > I think it is only used in two other places.
> > >
> > > 1/ In svc_revisit we don't queue the deferred request to an XPT_DEAD
> > > transport.
> > > We could avoid that but changing the 'owner' of a deferred request from the
> > > service to the xprt, and call cache_clean_deferred in svc_delete_xprt
> >
> > That use does seem a bit of a hack to me, so I'd be happy to get rid of
> > it.
>
> Eh, but then don't we end up doing the same check when deferring, to
> prevent deferring a request on a dead xprt?

Good point.
However....

If we change svc_defer to set handle.owner to rqstp->rq_xprt rather than
rqstp->rq_server (As already suggested), then we don't need the svc_xprt_get.
i.e. the deferred request doesn't need to hold a reference to the xprt,
because when the xprt is finally removed it is easy (cache_clean_deferred) to
remove all those deferred requests.
So we put the call to cache_clean_deferred in svc_xprt_free. By this stage
we are guaranteed not to get more deferrals as the xprt is dead.

>
> Maybe we should leave well enough alone here.

That is certainly an option, especially if it turns out that we cannot remove
XPT_DEAD completely.

Thanks,
NeilBrown

>
> --b.


2010-10-25 20:22:00

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/4] svcrpc: never clear XPT_BUSY on dead xprt

On Mon, Oct 25, 2010 at 12:43:57PM +1100, Neil Brown wrote:
> On Sun, 24 Oct 2010 21:21:30 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > Once an xprt has been deleted, there's no reason to allow it to be
> > enqueued--at worst, that might cause the xprt to be re-added to some
> > global list, resulting in later corruption.
> >
> > Signed-off-by: J. Bruce Fields <[email protected]>
>
> Yep, this makes svc_close_xprt() behave the same way as svc_recv() which
> calls svc_delete_xprt but does not clear XPT_BUSY. The other branches in
> svc_recv call svc_xprt_received, but the XPT_CLOSE branch doesn't
>
> Reviewed-by: NeilBrown <[email protected]>

Also, of course:

> > svc_xprt_get(xprt);
> > svc_delete_xprt(xprt);
> > - clear_bit(XPT_BUSY, &xprt->xpt_flags);
> > svc_xprt_put(xprt);

The get/put is pointless: the only reason I can see for doing that of
course was to be able to safely clear the bit afterwards.

--b.

2010-10-05 08:22:53

by Menyhart Zoltan

[permalink] [raw]
Subject: "xprt" reference count drops to 0

Hi,

Due to some race conditions, the reference count can become 0
while "xprt" is still on a "pool":

WARNING: at lib/kref.c:43 kref_get+0x23/0x2d()
[] kref_get+0x23/0x2d
[] svc_xprt_get+0x12/0x14 [sunrpc]
[] svc_recv+0x2db/0x78a [sunrpc]

I think we should increase the reference counter before adding "xprt"
onto any list.

Obviously, we keep increasing the reference count before passing "xprt"
to another thread via "rqstp->rq_xprt".

Thanks,

Zoltan Menyhart


Attachments:
diff (1.00 kB)

2010-10-25 23:54:54

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/4] svcrpc: never clear XPT_BUSY on dead xprt

On Mon, 25 Oct 2010 19:03:35 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Tue, Oct 26, 2010 at 09:58:36AM +1100, Neil Brown wrote:
> > On Mon, 25 Oct 2010 16:21:56 -0400
> > "J. Bruce Fields" <[email protected]> wrote:
> >
> > > On Mon, Oct 25, 2010 at 12:43:57PM +1100, Neil Brown wrote:
> > > > On Sun, 24 Oct 2010 21:21:30 -0400
> > > > "J. Bruce Fields" <[email protected]> wrote:
> > > >
> > > > > Once an xprt has been deleted, there's no reason to allow it to be
> > > > > enqueued--at worst, that might cause the xprt to be re-added to some
> > > > > global list, resulting in later corruption.
> > > > >
> > > > > Signed-off-by: J. Bruce Fields <[email protected]>
> > > >
> > > > Yep, this makes svc_close_xprt() behave the same way as svc_recv() which
> > > > calls svc_delete_xprt but does not clear XPT_BUSY. The other branches in
> > > > svc_recv call svc_xprt_received, but the XPT_CLOSE branch doesn't
> > > >
> > > > Reviewed-by: NeilBrown <[email protected]>
> > >
> > > Also, of course:
> > >
> > > > > svc_xprt_get(xprt);
> > > > > svc_delete_xprt(xprt);
> > > > > - clear_bit(XPT_BUSY, &xprt->xpt_flags);
> > > > > svc_xprt_put(xprt);
> > >
> > > The get/put is pointless: the only reason I can see for doing that of
> > > course was to be able to safely clear the bit afterwards.
> > >
> >
> > Agreed.
> >
> > I like patches that get rid of code!!
>
> Unfortunately, I'm stuck on just one more point: is svc_close_all()
> really safe? It assumes it doesn't need any locking to speak of any
> more because the server threads are gone--but the xprt's themselves
> could still be producing events, right? (So data could be arriving that
> results in calls to svc_xprt_enqueue, for example?)
>
> If that's right, I'm not sure what to do there....
>
> --b.

Yes, svc_close_all is racy w.r.t. svc_xprt_enqueue.
I guess we've never lost that race?

The race happens if the test_and_set(XPT_BUSY) in svc_xprt_enqueue happens
before the test_bit(XPT_BUSY) in svc_close_all, but the list_add_tail at the
end of svc_xprt_enqueue happens before (or during!) the list_del_init in
svc_close_all.

We cannot really lock against this race as svc_xprt_enqueue holds the pool
lock, and svc_close_all doesn't know which pool to lock (as xprt->pool isn't
set until after XPT_BUSY is set).

Maybe we just need to lock all pools in that case??

So svc_close_all becomes something like:


void svc_close_all(struct list_head *xprt_list)
{
struct svc_xprt *xprt;
struct svc_xprt *tmp;
struct svc_pool *pool;

list_for_each_entry_safe(xprt, tmp, xprt_list, xpt_list) {
set_bit(XPT_CLOSE, &xprt->xpt_flags);
if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags)) {
/* Waiting to be processed, but no threads left,
* So just remove it from the waiting list. First
* we need to ensure svc_xprt_enqueue isn't still
* queuing the xprt to some pool.
*/
for_each_pool(pool, xprt->xpt_server) {
spin_lock(&pool->sp_lock);
spin_unlock(&pool->sp_lock);
}
list_del_init(&xprt->xpt_ready);
}
svc_delete_xprt(xprt);
}
}


Note that once we always set XPT_BUSY and it stays set. So we call
svc_delete_xprt instread of svc_close_xprt.

Maybe we don't actually need to list_del_init - both the pool and the xprt
will soon be freed and if there is linkage between them, who cares??
In that case we wouldn't need to for_each_pool after all ???


NeilBrown

2010-10-26 01:28:29

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/4] svcrpc: never clear XPT_BUSY on dead xprt

On Mon, 25 Oct 2010 20:30:08 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Oct 25, 2010 at 08:28:58PM -0400, J. Bruce Fields wrote:
> > On Mon, Oct 25, 2010 at 08:11:45PM -0400, J. Bruce Fields wrote:
> > > On Tue, Oct 26, 2010 at 10:54:47AM +1100, Neil Brown wrote:
> > > > Note that once we always set XPT_BUSY and it stays set. So we call
> > > > svc_delete_xprt instread of svc_close_xprt.
> > > >
> > > > Maybe we don't actually need to list_del_init - both the pool and the xprt
> > > > will soon be freed and if there is linkage between them, who cares??
> > > > In that case we wouldn't need to for_each_pool after all ???
> >
> > So, untested:
>
> Ditto. But, hey, it deletes more code, it must be good.
>
> --b.
>
> commit 99401f6f6c3d0782ef79bb37749ace2accd4c79a
> Author: J. Bruce Fields <[email protected]>
> Date: Mon Oct 25 20:24:48 2010 -0400
>
> svcrpc: simplify svc_close_xprt, fix minor race
>
> We're assuming that in the XPT_BUSY case, somebody else will take care
> of the close. But that's not necessarily the case (e.g. if
> svc_xprt_enqueue() exits due to a lack of write space). So the close
> could in theory be delayed indefinitely. We should be calling
> svc_xprt_enqueue() after every set of XPT_CLOSE.
>
> Also with svc_close_all() no longer relying on svc_close, it no longer
> needs to be able to immediately do the svc_delete_xprt() itself.
>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 8c018df..dfee123 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -928,11 +928,7 @@ void svc_delete_xprt(struct svc_xprt *xprt)
> void svc_close_xprt(struct svc_xprt *xprt)
> {
> set_bit(XPT_CLOSE, &xprt->xpt_flags);
> - if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags))
> - /* someone else will have to effect the close */
> - return;
> -
> - svc_delete_xprt(xprt);
> + svc_xprt_enqueue(xprt);
> }
> EXPORT_SYMBOL_GPL(svc_close_xprt);
>

I'm not quite so sure about this one.

svc_close_xprt gets called when user-space writes some magic string to some
magic file. This could be done when there are no active nfsd threads.
It is a bit far fetched, but you could tell nfsd to open a socket, then tell
it to close it, before telling it to start any threads.

In that case the socket would stay open until the first thread was started.

So ... not quite convinced (much as the code reduction is nice!).

NeilBrown

2010-10-25 17:46:44

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 4/4] svcrpc: svc_tcp_sendto XTP_DEAD check is redundant

On Mon, Oct 25, 2010 at 11:03:08AM -0400, J. Bruce Fields wrote:
> On Mon, Oct 25, 2010 at 01:10:24PM +1100, Neil Brown wrote:
> > On Sun, 24 Oct 2010 21:21:33 -0400
> > "J. Bruce Fields" <[email protected]> wrote:
> >
> > > The only caller (svc_send) has already checked XPT_DEAD.
> > >
> > > Signed-off-by: J. Bruce Fields <[email protected]>
> > > ---
> > > net/sunrpc/svcsock.c | 3 ---
> > > 1 files changed, 0 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > > index 1454739..07919e1 100644
> > > --- a/net/sunrpc/svcsock.c
> > > +++ b/net/sunrpc/svcsock.c
> > > @@ -1135,9 +1135,6 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
> > > reclen = htonl(0x80000000|((xbufp->len ) - 4));
> > > memcpy(xbufp->head[0].iov_base, &reclen, 4);
> > >
> > > - if (test_bit(XPT_DEAD, &rqstp->rq_xprt->xpt_flags))
> > > - return -ENOTCONN;
> > > -
> > > sent = svc_sendto(rqstp, &rqstp->rq_res);
> > > if (sent != xbufp->len) {
> > > printk(KERN_NOTICE
> >
> >
> > So after removing all these references to XPT_DEAD, do we need XPT_DEAD at
> > all???
> >
> > I think it is only used in two other places.
> >
> > 1/ In svc_revisit we don't queue the deferred request to an XPT_DEAD
> > transport.
> > We could avoid that but changing the 'owner' of a deferred request from the
> > service to the xprt, and call cache_clean_deferred in svc_delete_xprt
>
> That use does seem a bit of a hack to me, so I'd be happy to get rid of
> it.

Eh, but then don't we end up doing the same check when deferring, to
prevent deferring a request on a dead xprt?

Maybe we should leave well enough alone here.

--b.

2010-10-25 01:09:29

by J. Bruce Fields

[permalink] [raw]
Subject: Re: "xprt" reference count drops to 0

On Fri, Oct 22, 2010 at 11:32:50PM -0400, J. Bruce Fields wrote:
> On Fri, Oct 22, 2010 at 07:21:33PM -0400, J. Bruce Fields wrote:
> > On Fri, Oct 22, 2010 at 07:01:12PM -0400, J. Bruce Fields wrote:
> > > On Fri, Oct 22, 2010 at 05:20:07PM -0400, J. Bruce Fields wrote:
> > > > On Fri, Oct 22, 2010 at 05:00:50PM +0200, Menyhart Zoltan wrote:
> > > > > J. Bruce Fields wrote:
> > > > > >On Tue, Oct 05, 2010 at 10:22:30AM +0200, Menyhart Zoltan wrote:
> > > > > >>Due to some race conditions, the reference count can become 0
> > > > > >>while "xprt" is still on a "pool":
> > > > > >
> > > > > >Apologies, your email got buried in my inbox....
> > > > > >
> > > > > >>
> > > > > >>WARNING: at lib/kref.c:43 kref_get+0x23/0x2d()
> > > > > >> [] kref_get+0x23/0x2d
> > > > > >> [] svc_xprt_get+0x12/0x14 [sunrpc]
> > > > > >> [] svc_recv+0x2db/0x78a [sunrpc]
> > > > > >
> > > > > >Which kernel exactly did you see this on? Is it reproduceable?
> > > > >
> > > > > I saw it on a 2.6.32.
> > > > > It has not been corrected for the 2.6.36-rc3 yet.
>
> Doh. Wait, when you say "has not been corrected on 2.6.36-rc3", do you
> mean you've actually *seen* the problem occur on 2.6.36-rc3?
>
> If not, it's more likely you're seeing the problem that was fixed by
> 1b644b6e6f6160ae35ce4b52c2ca89ed3e356e18, in 2.6.34-rc1.

But, as long as we're here, I think there's a minor fix and some cleanup
we could do; patches follow.--b.

2010-10-25 01:21:42

by J. Bruce Fields

[permalink] [raw]
Subject: [PATCH 2/4] svcrpc: assume svc_delete_xprt() called only once

As long as DEAD exports are left BUSY, and svc_delete_xprt is called
only with BUSY held, then svc_delete_xprt() will never be called on an
xprt that is already DEAD.

Signed-off-by: J. Bruce Fields <[email protected]>
---
net/sunrpc/svc_xprt.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index db90e4d..409881a 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -903,7 +903,7 @@ void svc_delete_xprt(struct svc_xprt *xprt)

/* Only do this once */
if (test_and_set_bit(XPT_DEAD, &xprt->xpt_flags))
- return;
+ BUG();

dprintk("svc: svc_delete_xprt(%p)\n", xprt);
xprt->xpt_ops->xpo_detach(xprt);
--
1.7.1


2010-10-26 01:33:57

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 4/4] svcrpc: svc_tcp_sendto XTP_DEAD check is redundant

On Tue, Oct 26, 2010 at 10:08:21AM +1100, Neil Brown wrote:
> On Mon, 25 Oct 2010 13:46:32 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Mon, Oct 25, 2010 at 11:03:08AM -0400, J. Bruce Fields wrote:
> > > On Mon, Oct 25, 2010 at 01:10:24PM +1100, Neil Brown wrote:
> > > > On Sun, 24 Oct 2010 21:21:33 -0400
> > > > "J. Bruce Fields" <[email protected]> wrote:
> > > >
> > > > > The only caller (svc_send) has already checked XPT_DEAD.
> > > > >
> > > > > Signed-off-by: J. Bruce Fields <[email protected]>
> > > > > ---
> > > > > net/sunrpc/svcsock.c | 3 ---
> > > > > 1 files changed, 0 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > > > > index 1454739..07919e1 100644
> > > > > --- a/net/sunrpc/svcsock.c
> > > > > +++ b/net/sunrpc/svcsock.c
> > > > > @@ -1135,9 +1135,6 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
> > > > > reclen = htonl(0x80000000|((xbufp->len ) - 4));
> > > > > memcpy(xbufp->head[0].iov_base, &reclen, 4);
> > > > >
> > > > > - if (test_bit(XPT_DEAD, &rqstp->rq_xprt->xpt_flags))
> > > > > - return -ENOTCONN;
> > > > > -
> > > > > sent = svc_sendto(rqstp, &rqstp->rq_res);
> > > > > if (sent != xbufp->len) {
> > > > > printk(KERN_NOTICE
> > > >
> > > >
> > > > So after removing all these references to XPT_DEAD, do we need XPT_DEAD at
> > > > all???
> > > >
> > > > I think it is only used in two other places.
> > > >
> > > > 1/ In svc_revisit we don't queue the deferred request to an XPT_DEAD
> > > > transport.
> > > > We could avoid that but changing the 'owner' of a deferred request from the
> > > > service to the xprt, and call cache_clean_deferred in svc_delete_xprt
> > >
> > > That use does seem a bit of a hack to me, so I'd be happy to get rid of
> > > it.
> >
> > Eh, but then don't we end up doing the same check when deferring, to
> > prevent deferring a request on a dead xprt?
>
> Good point.
> However....
>
> If we change svc_defer to set handle.owner to rqstp->rq_xprt rather than
> rqstp->rq_server (As already suggested), then we don't need the svc_xprt_get.
> i.e. the deferred request doesn't need to hold a reference to the xprt,
> because when the xprt is finally removed it is easy (cache_clean_deferred) to
> remove all those deferred requests.
> So we put the call to cache_clean_deferred in svc_xprt_free. By this stage
> we are guaranteed not to get more deferrals as the xprt is dead.

Oh, OK, got it. I'll keep that in mind....

And I've also got some more cleanup, possibly bordering on barn-painting
(though I like the second one).

--b.

commit a08af80085c3ee539ca25729d7c7e9af6d060d71
Author: J. Bruce Fields <[email protected]>
Date: Mon Oct 25 12:50:15 2010 -0400

svcrpc: don't set then immediately clear XPT_DEFERRED

There's no harm to doing this, since the only caller will immediately
call svc_enqueue() afterwards, ensuring we don't miss the remaining
deferred requests just because XPT_DEFERRED was briefly cleared.

But why not just do this the simple way?

Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index c82fe73..a74cb67 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -1059,14 +1059,13 @@ static struct svc_deferred_req *svc_deferred_dequeue(struct svc_xprt *xprt)
if (!test_bit(XPT_DEFERRED, &xprt->xpt_flags))
return NULL;
spin_lock(&xprt->xpt_lock);
- clear_bit(XPT_DEFERRED, &xprt->xpt_flags);
if (!list_empty(&xprt->xpt_deferred)) {
dr = list_entry(xprt->xpt_deferred.next,
struct svc_deferred_req,
handle.recent);
list_del_init(&dr->handle.recent);
- set_bit(XPT_DEFERRED, &xprt->xpt_flags);
- }
+ } else
+ clear_bit(XPT_DEFERRED, &xprt->xpt_flags);
spin_unlock(&xprt->xpt_lock);
return dr;
}
commit 79a54abe64963acc82a2bf2a61b4efff5e782805
Author: J. Bruce Fields <[email protected]>
Date: Mon Oct 25 14:12:40 2010 -0400

nfsd4: centralize more calls to svc_xprt_received

Follow up on b48fa6b99100dc7772af3cd276035fcec9719ceb by moving all the
svc_xprt_received() calls for the main xprt to one place. The clearing
of XPT_BUSY here is critical to the correctness of the server, so I'd
prefer it to be obvious where we do it.

The only substantive result is moving svc_xprt_received() after
svc_receive_deferred(). Other than a (likely insignificant) delay
waking up the next thread, that should be harmless.

Also reshuffle the exit code a little to skip a few other steps that we
don't care about the in the svc_delete_xprt() case.

Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index a74cb67..dd61cd0 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -716,7 +716,10 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
if (test_bit(XPT_CLOSE, &xprt->xpt_flags)) {
dprintk("svc_recv: found XPT_CLOSE\n");
svc_delete_xprt(xprt);
- } else if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
+ /* Leave XPT_BUSY set on the dead xprt: */
+ goto out;
+ }
+ if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
struct svc_xprt *newxpt;
newxpt = xprt->xpt_ops->xpo_accept(xprt);
if (newxpt) {
@@ -741,28 +744,23 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
spin_unlock_bh(&serv->sv_lock);
svc_xprt_received(newxpt);
}
- svc_xprt_received(xprt);
} else {
dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
rqstp, pool->sp_id, xprt,
atomic_read(&xprt->xpt_ref.refcount));
rqstp->rq_deferred = svc_deferred_dequeue(xprt);
- if (rqstp->rq_deferred) {
- svc_xprt_received(xprt);
+ if (rqstp->rq_deferred)
len = svc_deferred_recv(rqstp);
- } else {
+ else
len = xprt->xpt_ops->xpo_recvfrom(rqstp);
- svc_xprt_received(xprt);
- }
dprintk("svc: got len=%d\n", len);
}
+ svc_xprt_received(xprt);

/* No data, incomplete (TCP) read, or accept() */
- if (len == 0 || len == -EAGAIN) {
- rqstp->rq_res.len = 0;
- svc_xprt_release(rqstp);
- return -EAGAIN;
- }
+ if (len == 0 || len == -EAGAIN)
+ goto out;
+
clear_bit(XPT_OLD, &xprt->xpt_flags);

rqstp->rq_secure = svc_port_is_privileged(svc_addr(rqstp));
@@ -771,6 +769,10 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
if (serv->sv_stats)
serv->sv_stats->netcnt++;
return len;
+out:
+ rqstp->rq_res.len = 0;
+ svc_xprt_release(rqstp);
+ return -EAGAIN;
}
EXPORT_SYMBOL_GPL(svc_recv);


2010-10-25 23:23:48

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 4/4] svcrpc: svc_tcp_sendto XTP_DEAD check is redundant

On Mon, 25 Oct 2010 11:03:08 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Oct 25, 2010 at 01:10:24PM +1100, Neil Brown wrote:
> > On Sun, 24 Oct 2010 21:21:33 -0400
> > "J. Bruce Fields" <[email protected]> wrote:
> >
> > > The only caller (svc_send) has already checked XPT_DEAD.
> > >
> > > Signed-off-by: J. Bruce Fields <[email protected]>
> > > ---
> > > net/sunrpc/svcsock.c | 3 ---
> > > 1 files changed, 0 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > > index 1454739..07919e1 100644
> > > --- a/net/sunrpc/svcsock.c
> > > +++ b/net/sunrpc/svcsock.c
> > > @@ -1135,9 +1135,6 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
> > > reclen = htonl(0x80000000|((xbufp->len ) - 4));
> > > memcpy(xbufp->head[0].iov_base, &reclen, 4);
> > >
> > > - if (test_bit(XPT_DEAD, &rqstp->rq_xprt->xpt_flags))
> > > - return -ENOTCONN;
> > > -
> > > sent = svc_sendto(rqstp, &rqstp->rq_res);
> > > if (sent != xbufp->len) {
> > > printk(KERN_NOTICE
> >
> >
> > So after removing all these references to XPT_DEAD, do we need XPT_DEAD at
> > all???
> >
> > I think it is only used in two other places.
> >
> > 1/ In svc_revisit we don't queue the deferred request to an XPT_DEAD
> > transport.
> > We could avoid that but changing the 'owner' of a deferred request from the
> > service to the xprt, and call cache_clean_deferred in svc_delete_xprt
>
> That use does seem a bit of a hack to me, so I'd be happy to get rid of
> it.
>
> > 2/ in svc_send(). I wonder if we need this at all. There doesn't seem to be
> > any locking to ensure that XPT_DEAD doesn't get set immediately after the
> > test, and the underlying sendto (whether tcp or udp or whatever) should fail
> > if the socket is closed, and if it doesn't it shouldn't really matter??
>
> Does it make a difference in the case of a half-close? If the client
> follows a request immediately by a FIN, and if that results in our
> setting DEAD (I think it does, assuming svc_tcp_state_change() is called
> in that case), then the current code may have the effect of preventing
> us from sending the reply.
>
> I don't know if that's good or bad.
>
> > So can we get rid of XPT_DEAL altogether?
>
> OK, I also had another use in mind: for the purposes of 4.1 (which needs
> to know when a connection goes down, e.g. to know that it's no longer
> available for callbacks), I added a list of callbacks to the xprt,
> called on svc_delete_xprt().
>
> I just noticed that I think my current code allows the 4.1 code to
> register an xprt after svc_delete_xprt() is called. I could fix that
> race by checking for DEAD after trying to register.
>
> (That callback code already seems messier than it should be, so maybe
> someone else could suggest a better scheme. I'm stuck.
>
> In any case, it wouldn't be so bad if that were the one remaining use of
> DEAD.)

Could you use XPT_CLOSE for that??

NeilBrown


>
> --b.


2010-10-25 01:21:42

by J. Bruce Fields

[permalink] [raw]
Subject: [PATCH 1/4] svcrpc: never clear XPT_BUSY on dead xprt

Once an xprt has been deleted, there's no reason to allow it to be
enqueued--at worst, that might cause the xprt to be re-added to some
global list, resulting in later corruption.

Signed-off-by: J. Bruce Fields <[email protected]>
---
net/sunrpc/svc_xprt.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index bef1e88..db90e4d 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -938,7 +938,6 @@ void svc_close_xprt(struct svc_xprt *xprt)

svc_xprt_get(xprt);
svc_delete_xprt(xprt);
- clear_bit(XPT_BUSY, &xprt->xpt_flags);
svc_xprt_put(xprt);
}
EXPORT_SYMBOL_GPL(svc_close_xprt);
--
1.7.1


2010-10-26 00:30:11

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/4] svcrpc: never clear XPT_BUSY on dead xprt

On Mon, Oct 25, 2010 at 08:28:58PM -0400, J. Bruce Fields wrote:
> On Mon, Oct 25, 2010 at 08:11:45PM -0400, J. Bruce Fields wrote:
> > On Tue, Oct 26, 2010 at 10:54:47AM +1100, Neil Brown wrote:
> > > Note that once we always set XPT_BUSY and it stays set. So we call
> > > svc_delete_xprt instread of svc_close_xprt.
> > >
> > > Maybe we don't actually need to list_del_init - both the pool and the xprt
> > > will soon be freed and if there is linkage between them, who cares??
> > > In that case we wouldn't need to for_each_pool after all ???
>
> So, untested:

Ditto. But, hey, it deletes more code, it must be good.

--b.

commit 99401f6f6c3d0782ef79bb37749ace2accd4c79a
Author: J. Bruce Fields <[email protected]>
Date: Mon Oct 25 20:24:48 2010 -0400

svcrpc: simplify svc_close_xprt, fix minor race

We're assuming that in the XPT_BUSY case, somebody else will take care
of the close. But that's not necessarily the case (e.g. if
svc_xprt_enqueue() exits due to a lack of write space). So the close
could in theory be delayed indefinitely. We should be calling
svc_xprt_enqueue() after every set of XPT_CLOSE.

Also with svc_close_all() no longer relying on svc_close, it no longer
needs to be able to immediately do the svc_delete_xprt() itself.

Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 8c018df..dfee123 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -928,11 +928,7 @@ void svc_delete_xprt(struct svc_xprt *xprt)
void svc_close_xprt(struct svc_xprt *xprt)
{
set_bit(XPT_CLOSE, &xprt->xpt_flags);
- if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags))
- /* someone else will have to effect the close */
- return;
-
- svc_delete_xprt(xprt);
+ svc_xprt_enqueue(xprt);
}
EXPORT_SYMBOL_GPL(svc_close_xprt);


2010-10-26 12:59:56

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/4] svcrpc: never clear XPT_BUSY on dead xprt

On Tue, Oct 26, 2010 at 12:28:20PM +1100, Neil Brown wrote:
> On Mon, 25 Oct 2010 20:30:08 -0400
> "J. Bruce Fields" <[email protected]> wrote:
> >
> > svcrpc: simplify svc_close_xprt, fix minor race
> >
> > We're assuming that in the XPT_BUSY case, somebody else will take care
> > of the close. But that's not necessarily the case (e.g. if
> > svc_xprt_enqueue() exits due to a lack of write space). So the close
> > could in theory be delayed indefinitely. We should be calling
> > svc_xprt_enqueue() after every set of XPT_CLOSE.
> >
> > Also with svc_close_all() no longer relying on svc_close, it no longer
> > needs to be able to immediately do the svc_delete_xprt() itself.
> >
> > Signed-off-by: J. Bruce Fields <[email protected]>
> >
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index 8c018df..dfee123 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -928,11 +928,7 @@ void svc_delete_xprt(struct svc_xprt *xprt)
> > void svc_close_xprt(struct svc_xprt *xprt)
> > {
> > set_bit(XPT_CLOSE, &xprt->xpt_flags);
> > - if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags))
> > - /* someone else will have to effect the close */
> > - return;
> > -
> > - svc_delete_xprt(xprt);
> > + svc_xprt_enqueue(xprt);
> > }
> > EXPORT_SYMBOL_GPL(svc_close_xprt);
> >
>
> I'm not quite so sure about this one.
>
> svc_close_xprt gets called when user-space writes some magic string to some
> magic file. This could be done when there are no active nfsd threads.
> It is a bit far fetched, but you could tell nfsd to open a socket, then tell
> it to close it, before telling it to start any threads.
>
> In that case the socket would stay open until the first thread was started.

Thanks, that's an odd case, but I think you're right. OK:

--b.

commit 6a04c0fa098b60e93d4077c12f7fa7a2f189df95
Author: J. Bruce Fields <[email protected]>
Date: Mon Oct 25 20:24:48 2010 -0400

svcrpc: fix minor svc_close_xprt race

We're assuming that in the XPT_BUSY case, somebody else will take care
of the close. But that's not necessarily the case. So the close could
in theory be delayed indefinitely.

We should instead be calling svc_xprt_enqueue() after every set of
XPT_CLOSE.

Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 8c018df..99996ad 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -928,10 +928,17 @@ void svc_delete_xprt(struct svc_xprt *xprt)
void svc_close_xprt(struct svc_xprt *xprt)
{
set_bit(XPT_CLOSE, &xprt->xpt_flags);
- if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags))
+ if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags)) {
/* someone else will have to effect the close */
+ svc_xprt_enqueue(xprt);
return;
-
+ }
+ /*
+ * We expect svc_close_xprt() to work even when no threads are
+ * running (e.g., while configuring the server before starting
+ * any threads), so if the transport isn't busy, we delete
+ * it ourself:
+ */
svc_delete_xprt(xprt);
}
EXPORT_SYMBOL_GPL(svc_close_xprt);

2010-10-25 02:10:32

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 4/4] svcrpc: svc_tcp_sendto XTP_DEAD check is redundant

On Sun, 24 Oct 2010 21:21:33 -0400
"J. Bruce Fields" <[email protected]> wrote:

> The only caller (svc_send) has already checked XPT_DEAD.
>
> Signed-off-by: J. Bruce Fields <[email protected]>
> ---
> net/sunrpc/svcsock.c | 3 ---
> 1 files changed, 0 insertions(+), 3 deletions(-)
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 1454739..07919e1 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1135,9 +1135,6 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
> reclen = htonl(0x80000000|((xbufp->len ) - 4));
> memcpy(xbufp->head[0].iov_base, &reclen, 4);
>
> - if (test_bit(XPT_DEAD, &rqstp->rq_xprt->xpt_flags))
> - return -ENOTCONN;
> -
> sent = svc_sendto(rqstp, &rqstp->rq_res);
> if (sent != xbufp->len) {
> printk(KERN_NOTICE


So after removing all these references to XPT_DEAD, do we need XPT_DEAD at
all???

I think it is only used in two other places.

1/ In svc_revisit we don't queue the deferred request to an XPT_DEAD
transport.
We could avoid that but changing the 'owner' of a deferred request from the
service to the xprt, and call cache_clean_deferred in svc_delete_xprt

2/ in svc_send(). I wonder if we need this at all. There doesn't seem to be
any locking to ensure that XPT_DEAD doesn't get set immediately after the
test, and the underlying sendto (whether tcp or udp or whatever) should fail
if the socket is closed, and if it doesn't it shouldn't really matter??

So can we get rid of XPT_DEAL altogether?

NeilBrown


2010-10-25 14:36:58

by J. Bruce Fields

[permalink] [raw]
Subject: Re: "xprt" reference count drops to 0

On Mon, Oct 25, 2010 at 01:56:31PM +0200, Menyhart Zoltan wrote:
> We do not really use more advanced kernel than the 2.6.32 in a great number.
> Just some test configurations up to the 2.6.36-rc6...
> I have not seen this problem on recent kernels.
>
> I'm not sure that I can really follow your explication.

Look at the callers of svc_xprt_put(). You'll see that all of them have
an obvious paired svc_xprt_get():

- Any svc_xprt_put() that clears a rqstp->rq_xprt is paired with
the svc_xprt_get() that originally set rqstp->rq_xprt
- The svc_xprt_put() in svc_check_conn_limits() is paired with
the immediately preceding svc_xprt_get() that took the xprt
off the sv_tempsocks list.
- Etc.

*Except* for the svc_xprt_put() at the end of svc_delete_xprt(). That
one you can think of as paired with the initial reference created by the
kref_init() in svc_xprt_init().

So, each xprt has one special reference which keeps the xprt from being
removed before svc_delete_xprt() is called.

There are only two callers of svc_delete_xprt(): svc_close_xprt(), and
svc_recv().

When we exit svc_xprt_enqueue(), we exit with XPT_BUSY set on the xprt
in question.

That flag will be cleared only by the svc_recv() call that picks up this
xprt (or by svc_close_all() if nfsd shuts down before then).

That prevents svc_close_xprt() from deleting the xprt, since it returns
without doing anything if XPT_BUSY is already set.

So the only code that could normally delete this xprt is the svc_recv()
that processes the xprt. It takes the xprt off the pool->sp_sockets
list before it does that.

So, returning from svc_xprt_enqueue() with an xprt on ->sp_sockets that
has XPT_BUSY set is safe.

Think of XPT_BUSY as a kind of lock, ownership of which can pass from
the svc_xprt_enqueue() that queues up an xprt to the svc_recv() that
processes it, and note that an xprt can only be deleted under this
XPT_BUSY lock.

--b.

2010-10-26 16:05:22

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/4] svcrpc: never clear XPT_BUSY on dead xprt

On Tue, Oct 26, 2010 at 08:59:47AM -0400, J. Bruce Fields wrote:
> On Tue, Oct 26, 2010 at 12:28:20PM +1100, Neil Brown wrote:
> > On Mon, 25 Oct 2010 20:30:08 -0400
> > "J. Bruce Fields" <[email protected]> wrote:
> > >
> > > svcrpc: simplify svc_close_xprt, fix minor race
> > >
> > > We're assuming that in the XPT_BUSY case, somebody else will take care
> > > of the close. But that's not necessarily the case (e.g. if
> > > svc_xprt_enqueue() exits due to a lack of write space). So the close
> > > could in theory be delayed indefinitely. We should be calling
> > > svc_xprt_enqueue() after every set of XPT_CLOSE.
> > >
> > > Also with svc_close_all() no longer relying on svc_close, it no longer
> > > needs to be able to immediately do the svc_delete_xprt() itself.
> > >
> > > Signed-off-by: J. Bruce Fields <[email protected]>
> > >
> > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > index 8c018df..dfee123 100644
> > > --- a/net/sunrpc/svc_xprt.c
> > > +++ b/net/sunrpc/svc_xprt.c
> > > @@ -928,11 +928,7 @@ void svc_delete_xprt(struct svc_xprt *xprt)
> > > void svc_close_xprt(struct svc_xprt *xprt)
> > > {
> > > set_bit(XPT_CLOSE, &xprt->xpt_flags);
> > > - if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags))
> > > - /* someone else will have to effect the close */
> > > - return;
> > > -
> > > - svc_delete_xprt(xprt);
> > > + svc_xprt_enqueue(xprt);
> > > }
> > > EXPORT_SYMBOL_GPL(svc_close_xprt);
> > >
> >
> > I'm not quite so sure about this one.
> >
> > svc_close_xprt gets called when user-space writes some magic string to some
> > magic file. This could be done when there are no active nfsd threads.
> > It is a bit far fetched, but you could tell nfsd to open a socket, then tell
> > it to close it, before telling it to start any threads.
> >
> > In that case the socket would stay open until the first thread was started.
>
> Thanks, that's an odd case, but I think you're right. OK:

Actually.... I think the real race here is the fault of the write-space
check: once we've taken XPT_BUSY, we really do need to guarantee that
we'll arrange to have svc_xprt_enqueue() called again after clearing
XPT_BUSY, to avoid missing any events that arrived while we were
processing ours. So putting the wspace check after taking XPT_BUSY in
svc_xprt_enqueue looks to me like a bug that could cause processing to
stall unnecessarily in some unlikely cases.

--b.

commit cf501cb479873d6e77abba131ea40c11aa924d72
Author: J. Bruce Fields <[email protected]>
Date: Tue Oct 26 11:32:03 2010 -0400

svcrpc: fix wspace-checking race

We call svc_xprt_enqueue() after something happens which we think may
require handling from a server thread. To avoid such events being lost,
svc_xprt_enqueue() must guarantee that there will be a svc_serv() call
from a server thread following any such event. It does that by either
waking up a server thread itself, or checking that XPT_BUSY is set (in
which case somebody else is doing it).

But the check of XPT_BUSY could occur just as someone finishes
processing some other event, and just before they clear XPT_BUSY.

Therefore it's important not to clear XPT_BUSY without subsequently
doing another svc_export_enqueue() to check whether the xprt should be
requeued.

The xpo_wspace() check in svc_xprt_enqueue() breaks this rule, allowing
an event to be missed in situations like:

data arrives
call svc_tcp_data_ready():
call svc_xprt_enqueue():
set BUSY
find no write space
svc_reserve():
free up write space
call svc_enqueue():
test BUSY
clear BUSY

So, instead, check wspace in the same places that the state flags are
checked: before taking BUSY, and in svc_receive().

Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index d2261fe..9c690fa 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -92,12 +92,17 @@ static inline void unregister_xpt_user(struct svc_xprt *xpt, struct svc_xpt_user
static inline int register_xpt_user(struct svc_xprt *xpt, struct svc_xpt_user *u)
{
spin_lock(&xpt->xpt_lock);
- list_add(&u->list, &xpt->xpt_users);
- spin_unlock(&xpt->xpt_lock);
if (test_bit(XPT_CLOSE, &xpt->xpt_flags)) {
- unregister_xpt_user(xpt, u);
+ /*
+ * The connection is about to be deleted soon (or,
+ * worse, may already be deleted--in which case we've
+ * already notified the xpt_users).
+ */
+ spin_unlock(&xpt->xpt_lock);
return -ENOTCONN;
}
+ list_add(&u->list, &xpt->xpt_users);
+ spin_unlock(&xpt->xpt_lock);
return 0;
}

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 99996ad..04238a9 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -302,6 +302,15 @@ static void svc_thread_dequeue(struct svc_pool *pool, struct svc_rqst *rqstp)
list_del(&rqstp->rq_list);
}

+static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
+{
+ if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
+ return true;
+ if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
+ return xprt->xpt_ops->xpo_has_wspace(xprt);
+ return false;
+}
+
/*
* Queue up a transport with data pending. If there are idle nfsd
* processes, wake 'em up.
@@ -314,8 +323,7 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
struct svc_rqst *rqstp;
int cpu;

- if (!(xprt->xpt_flags &
- ((1<<XPT_CONN)|(1<<XPT_DATA)|(1<<XPT_CLOSE)|(1<<XPT_DEFERRED))))
+ if (!svc_xprt_has_something_to_do(xprt))
return;

cpu = get_cpu();
@@ -345,25 +353,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
BUG_ON(xprt->xpt_pool != NULL);
xprt->xpt_pool = pool;

- /* Handle pending connection */
- if (test_bit(XPT_CONN, &xprt->xpt_flags))
- goto process;
-
- /* Handle close in-progress */
- if (test_bit(XPT_CLOSE, &xprt->xpt_flags))
- goto process;
-
- /* Check if we have space to reply to a request */
- if (!xprt->xpt_ops->xpo_has_wspace(xprt)) {
- /* Don't enqueue while not enough space for reply */
- dprintk("svc: no write space, transport %p not enqueued\n",
- xprt);
- xprt->xpt_pool = NULL;
- clear_bit(XPT_BUSY, &xprt->xpt_flags);
- goto out_unlock;
- }
-
- process:
if (!list_empty(&pool->sp_threads)) {
rqstp = list_entry(pool->sp_threads.next,
struct svc_rqst,
@@ -744,7 +733,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
spin_unlock_bh(&serv->sv_lock);
svc_xprt_received(newxpt);
}
- } else {
+ } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
rqstp, pool->sp_id, xprt,
atomic_read(&xprt->xpt_ref.refcount));

2010-10-22 23:21:34

by J. Bruce Fields

[permalink] [raw]
Subject: Re: "xprt" reference count drops to 0

On Fri, Oct 22, 2010 at 07:01:12PM -0400, J. Bruce Fields wrote:
> On Fri, Oct 22, 2010 at 05:20:07PM -0400, J. Bruce Fields wrote:
> > On Fri, Oct 22, 2010 at 05:00:50PM +0200, Menyhart Zoltan wrote:
> > > J. Bruce Fields wrote:
> > > >On Tue, Oct 05, 2010 at 10:22:30AM +0200, Menyhart Zoltan wrote:
> > > >>Due to some race conditions, the reference count can become 0
> > > >>while "xprt" is still on a "pool":
> > > >
> > > >Apologies, your email got buried in my inbox....
> > > >
> > > >>
> > > >>WARNING: at lib/kref.c:43 kref_get+0x23/0x2d()
> > > >> [] kref_get+0x23/0x2d
> > > >> [] svc_xprt_get+0x12/0x14 [sunrpc]
> > > >> [] svc_recv+0x2db/0x78a [sunrpc]
> > > >
> > > >Which kernel exactly did you see this on? Is it reproduceable?
> > >
> > > I saw it on a 2.6.32.
> > > It has not been corrected for the 2.6.36-rc3 yet.
> > > The patch is for the 2.6.36-rc3.
> > >
> > > It is a narrow window, you need a high work load and a bit of luck to
> > > delay the current CPU just after"svc_xprt_enqueue()" returns.
> > >
> > > >>I think we should increase the reference counter before adding "xprt"
> > > >>onto any list.
> > > >
> > > >I don't see the xprt added to any list after the svc_xprt_get() you've
> > > >added below.
> > >
> > > "svc_xprt_enqueue()" has got two ways to pass an "xprt":
> > > - via "rqstp->rq_xprt" if a worker is available,
> > > - on the "pool->sp_sockets" list otherwise
> > >
> > > if (!list_empty(&pool->sp_threads)) {
> > > rqstp = list_entry(pool->sp_threads.next, struct svc_rqst, rq_list);
> > > svc_thread_dequeue(pool, rqstp);
> > > rqstp->rq_xprt = xprt;
> > > svc_xprt_get(xprt);
> > > rqstp->rq_reserved = serv->sv_max_mesg;
> > > atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> > > pool->sp_stats.threads_woken++;
> > > wake_up(&rqstp->rq_wait);
> > > } else {
> > > list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
> > > pool->sp_stats.sockets_queued++;
> > > }
> > >
> > > In the 1st case, there is a "svc_xprt_get(xprt)", in the 2nd one, there is not.
> > > Once "svc_xprt_enqueue()" returns, at some places, "svc_xprt_put(xprt)" is
> > > invoked. If we has passed the "else" branch, the "kref" can drop down to 0.
> >
> > Maybe your fix is right, but I'm not sure: It looks to me like if
> > svc_xprt_enqueue() gets to "process:" in a situation where the caller
> > holds the only reference, then that's already a bug. Do you know who
> > the caller of svc_xprt_enqueue() was when this happened?
>
> Hm. Maybe something like this could happen: two threads call
> svc_check_conn_limits at about the same time, and both pick the same
> victim xprt.
>
>
> thread 1 thread 2
> ^^^^^^^^ ^^^^^^^^
> set CLOSE set CLOSE
> call svc_xprt_enqueue
> set BUSY
>
> thread 3
> ^^^^^^^^ call svc_xprt_enqueue
> call svc_recv
> dequeue our xprt
> check DEAD, see it unset
> call svc_delete_xprt
> remove xprt from any
> global lists
> put xprt
> clear BUSY

I'm wrong here; svc_recv leaves the xprt BUSY in the case where it calls
svc_delete_xprt().

svc_close_xprt() does clear BUSY after calling svc_delete_xprt(), for
some bizarre reason, but that's probably harmless, except maybe at
server shutdown time. I assume you weren't shutting down the server
when you saw this.

Ugh. cc'ing Neil, maybe he has some idea.

--b.

> test_and_set_bit BUSY
> test CLOSE, go to process:
> make xprt globablly visible
> again
> ARGH!
>
>
> The put in svc_delete_xprt() is meant to happen only when the xprt is
> taken off any rqstp's or global lists. We shouldn't be able to requeue
> the xprt after that's done.
>
> So, both the svc_check_conn_limits return, the reference count's
> probably gone to zero at that point, and the xprt's freed while there
> are still references to it somewhere.
>
> It seems wrong to be clearing BUSY after deleting an xprt; what good
> could come of letting someone try to process an xprt that's already
> DEAD?
>
> But I need to go back over that. Maybe I've missed something.
>
> --b.

2010-10-25 01:44:07

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/4] svcrpc: never clear XPT_BUSY on dead xprt

On Sun, 24 Oct 2010 21:21:30 -0400
"J. Bruce Fields" <[email protected]> wrote:

> Once an xprt has been deleted, there's no reason to allow it to be
> enqueued--at worst, that might cause the xprt to be re-added to some
> global list, resulting in later corruption.
>
> Signed-off-by: J. Bruce Fields <[email protected]>

Yep, this makes svc_close_xprt() behave the same way as svc_recv() which
calls svc_delete_xprt but does not clear XPT_BUSY. The other branches in
svc_recv call svc_xprt_received, but the XPT_CLOSE branch doesn't

Reviewed-by: NeilBrown <[email protected]>

NeilBrown

> ---
> net/sunrpc/svc_xprt.c | 1 -
> 1 files changed, 0 insertions(+), 1 deletions(-)
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index bef1e88..db90e4d 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -938,7 +938,6 @@ void svc_close_xprt(struct svc_xprt *xprt)
>
> svc_xprt_get(xprt);
> svc_delete_xprt(xprt);
> - clear_bit(XPT_BUSY, &xprt->xpt_flags);
> svc_xprt_put(xprt);
> }
> EXPORT_SYMBOL_GPL(svc_close_xprt);


2010-10-25 15:03:13

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 4/4] svcrpc: svc_tcp_sendto XTP_DEAD check is redundant

On Mon, Oct 25, 2010 at 01:10:24PM +1100, Neil Brown wrote:
> On Sun, 24 Oct 2010 21:21:33 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > The only caller (svc_send) has already checked XPT_DEAD.
> >
> > Signed-off-by: J. Bruce Fields <[email protected]>
> > ---
> > net/sunrpc/svcsock.c | 3 ---
> > 1 files changed, 0 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > index 1454739..07919e1 100644
> > --- a/net/sunrpc/svcsock.c
> > +++ b/net/sunrpc/svcsock.c
> > @@ -1135,9 +1135,6 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
> > reclen = htonl(0x80000000|((xbufp->len ) - 4));
> > memcpy(xbufp->head[0].iov_base, &reclen, 4);
> >
> > - if (test_bit(XPT_DEAD, &rqstp->rq_xprt->xpt_flags))
> > - return -ENOTCONN;
> > -
> > sent = svc_sendto(rqstp, &rqstp->rq_res);
> > if (sent != xbufp->len) {
> > printk(KERN_NOTICE
>
>
> So after removing all these references to XPT_DEAD, do we need XPT_DEAD at
> all???
>
> I think it is only used in two other places.
>
> 1/ In svc_revisit we don't queue the deferred request to an XPT_DEAD
> transport.
> We could avoid that but changing the 'owner' of a deferred request from the
> service to the xprt, and call cache_clean_deferred in svc_delete_xprt

That use does seem a bit of a hack to me, so I'd be happy to get rid of
it.

> 2/ in svc_send(). I wonder if we need this at all. There doesn't seem to be
> any locking to ensure that XPT_DEAD doesn't get set immediately after the
> test, and the underlying sendto (whether tcp or udp or whatever) should fail
> if the socket is closed, and if it doesn't it shouldn't really matter??

Does it make a difference in the case of a half-close? If the client
follows a request immediately by a FIN, and if that results in our
setting DEAD (I think it does, assuming svc_tcp_state_change() is called
in that case), then the current code may have the effect of preventing
us from sending the reply.

I don't know if that's good or bad.

> So can we get rid of XPT_DEAL altogether?

OK, I also had another use in mind: for the purposes of 4.1 (which needs
to know when a connection goes down, e.g. to know that it's no longer
available for callbacks), I added a list of callbacks to the xprt,
called on svc_delete_xprt().

I just noticed that I think my current code allows the 4.1 code to
register an xprt after svc_delete_xprt() is called. I could fix that
race by checking for DEAD after trying to register.

(That callback code already seems messier than it should be, so maybe
someone else could suggest a better scheme. I'm stuck.

In any case, it wouldn't be so bad if that were the one remaining use of
DEAD.)

--b.

2010-10-22 15:01:40

by Menyhart Zoltan

[permalink] [raw]
Subject: Re: "xprt" reference count drops to 0

J. Bruce Fields wrote:
> On Tue, Oct 05, 2010 at 10:22:30AM +0200, Menyhart Zoltan wrote:
>> Due to some race conditions, the reference count can become 0
>> while "xprt" is still on a "pool":
>
> Apologies, your email got buried in my inbox....
>
>>
>> WARNING: at lib/kref.c:43 kref_get+0x23/0x2d()
>> [] kref_get+0x23/0x2d
>> [] svc_xprt_get+0x12/0x14 [sunrpc]
>> [] svc_recv+0x2db/0x78a [sunrpc]
>
> Which kernel exactly did you see this on? Is it reproduceable?

I saw it on a 2.6.32.
It has not been corrected for the 2.6.36-rc3 yet.
The patch is for the 2.6.36-rc3.

It is a narrow window, you need a high work load and a bit of luck to
delay the current CPU just after"svc_xprt_enqueue()" returns.

>> I think we should increase the reference counter before adding "xprt"
>> onto any list.
>
> I don't see the xprt added to any list after the svc_xprt_get() you've
> added below.

"svc_xprt_enqueue()" has got two ways to pass an "xprt":
- via "rqstp->rq_xprt" if a worker is available,
- on the "pool->sp_sockets" list otherwise

if (!list_empty(&pool->sp_threads)) {
rqstp = list_entry(pool->sp_threads.next, struct svc_rqst, rq_list);
svc_thread_dequeue(pool, rqstp);
rqstp->rq_xprt = xprt;
svc_xprt_get(xprt);
rqstp->rq_reserved = serv->sv_max_mesg;
atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
pool->sp_stats.threads_woken++;
wake_up(&rqstp->rq_wait);
} else {
list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
pool->sp_stats.sockets_queued++;
}

In the 1st case, there is a "svc_xprt_get(xprt)", in the 2nd one, there is not.
Once "svc_xprt_enqueue()" returns, at some places, "svc_xprt_put(xprt)" is
invoked. If we has passed the "else" branch, the "kref" can drop down to 0.

"svc_recv()" includes:

xprt = svc_xprt_dequeue(pool);
if (xprt) {
rqstp->rq_xprt = xprt;
svc_xprt_get(xprt);
rqstp->rq_reserved = serv->sv_max_mesg;
atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
} else {

i.e. it calls "svc_xprt_get(xprt)" if it has picked up "xprt" from the "pool" list.
Yet it is too late, if the caller of "svc_xprt_enqueue()" has already had the
time to call "svc_xprt_put(xprt)".

I think "svc_xprt_get(xprt)" should be called in "svc_xprt_enqueue()" before
"list_add_tail(&xprt->xpt_ready, &pool->sp_sockets)" to keep the reference
counter valid.
It is much simpler to move the call "svc_xprt_get(xprt)" just before

if (!list_empty(&pool->sp_threads))

Thanks,

Zoltan




2010-10-25 23:04:00

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/4] svcrpc: never clear XPT_BUSY on dead xprt

On Tue, Oct 26, 2010 at 09:58:36AM +1100, Neil Brown wrote:
> On Mon, 25 Oct 2010 16:21:56 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Mon, Oct 25, 2010 at 12:43:57PM +1100, Neil Brown wrote:
> > > On Sun, 24 Oct 2010 21:21:30 -0400
> > > "J. Bruce Fields" <[email protected]> wrote:
> > >
> > > > Once an xprt has been deleted, there's no reason to allow it to be
> > > > enqueued--at worst, that might cause the xprt to be re-added to some
> > > > global list, resulting in later corruption.
> > > >
> > > > Signed-off-by: J. Bruce Fields <[email protected]>
> > >
> > > Yep, this makes svc_close_xprt() behave the same way as svc_recv() which
> > > calls svc_delete_xprt but does not clear XPT_BUSY. The other branches in
> > > svc_recv call svc_xprt_received, but the XPT_CLOSE branch doesn't
> > >
> > > Reviewed-by: NeilBrown <[email protected]>
> >
> > Also, of course:
> >
> > > > svc_xprt_get(xprt);
> > > > svc_delete_xprt(xprt);
> > > > - clear_bit(XPT_BUSY, &xprt->xpt_flags);
> > > > svc_xprt_put(xprt);
> >
> > The get/put is pointless: the only reason I can see for doing that of
> > course was to be able to safely clear the bit afterwards.
> >
>
> Agreed.
>
> I like patches that get rid of code!!

Unfortunately, I'm stuck on just one more point: is svc_close_all()
really safe? It assumes it doesn't need any locking to speak of any
more because the server threads are gone--but the xprt's themselves
could still be producing events, right? (So data could be arriving that
results in calls to svc_xprt_enqueue, for example?)

If that's right, I'm not sure what to do there....

--b.

2010-10-22 23:01:13

by J. Bruce Fields

[permalink] [raw]
Subject: Re: "xprt" reference count drops to 0

On Fri, Oct 22, 2010 at 05:20:07PM -0400, J. Bruce Fields wrote:
> On Fri, Oct 22, 2010 at 05:00:50PM +0200, Menyhart Zoltan wrote:
> > J. Bruce Fields wrote:
> > >On Tue, Oct 05, 2010 at 10:22:30AM +0200, Menyhart Zoltan wrote:
> > >>Due to some race conditions, the reference count can become 0
> > >>while "xprt" is still on a "pool":
> > >
> > >Apologies, your email got buried in my inbox....
> > >
> > >>
> > >>WARNING: at lib/kref.c:43 kref_get+0x23/0x2d()
> > >> [] kref_get+0x23/0x2d
> > >> [] svc_xprt_get+0x12/0x14 [sunrpc]
> > >> [] svc_recv+0x2db/0x78a [sunrpc]
> > >
> > >Which kernel exactly did you see this on? Is it reproduceable?
> >
> > I saw it on a 2.6.32.
> > It has not been corrected for the 2.6.36-rc3 yet.
> > The patch is for the 2.6.36-rc3.
> >
> > It is a narrow window, you need a high work load and a bit of luck to
> > delay the current CPU just after"svc_xprt_enqueue()" returns.
> >
> > >>I think we should increase the reference counter before adding "xprt"
> > >>onto any list.
> > >
> > >I don't see the xprt added to any list after the svc_xprt_get() you've
> > >added below.
> >
> > "svc_xprt_enqueue()" has got two ways to pass an "xprt":
> > - via "rqstp->rq_xprt" if a worker is available,
> > - on the "pool->sp_sockets" list otherwise
> >
> > if (!list_empty(&pool->sp_threads)) {
> > rqstp = list_entry(pool->sp_threads.next, struct svc_rqst, rq_list);
> > svc_thread_dequeue(pool, rqstp);
> > rqstp->rq_xprt = xprt;
> > svc_xprt_get(xprt);
> > rqstp->rq_reserved = serv->sv_max_mesg;
> > atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> > pool->sp_stats.threads_woken++;
> > wake_up(&rqstp->rq_wait);
> > } else {
> > list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
> > pool->sp_stats.sockets_queued++;
> > }
> >
> > In the 1st case, there is a "svc_xprt_get(xprt)", in the 2nd one, there is not.
> > Once "svc_xprt_enqueue()" returns, at some places, "svc_xprt_put(xprt)" is
> > invoked. If we has passed the "else" branch, the "kref" can drop down to 0.
>
> Maybe your fix is right, but I'm not sure: It looks to me like if
> svc_xprt_enqueue() gets to "process:" in a situation where the caller
> holds the only reference, then that's already a bug. Do you know who
> the caller of svc_xprt_enqueue() was when this happened?

Hm. Maybe something like this could happen: two threads call
svc_check_conn_limits at about the same time, and both pick the same
victim xprt.


thread 1 thread 2
^^^^^^^^ ^^^^^^^^
set CLOSE set CLOSE
call svc_xprt_enqueue
set BUSY

thread 3
^^^^^^^^ call svc_xprt_enqueue
call svc_recv
dequeue our xprt
check DEAD, see it unset
call svc_delete_xprt
remove xprt from any
global lists
put xprt
clear BUSY
test_and_set_bit BUSY
test CLOSE, go to process:
make xprt globablly visible
again
ARGH!


The put in svc_delete_xprt() is meant to happen only when the xprt is
taken off any rqstp's or global lists. We shouldn't be able to requeue
the xprt after that's done.

So, both the svc_check_conn_limits return, the reference count's
probably gone to zero at that point, and the xprt's freed while there
are still references to it somewhere.

It seems wrong to be clearing BUSY after deleting an xprt; what good
could come of letting someone try to process an xprt that's already
DEAD?

But I need to go back over that. Maybe I've missed something.

--b.

2010-10-25 01:21:43

by J. Bruce Fields

[permalink] [raw]
Subject: [PATCH 3/4] svcrpc: no need for XPT_DEAD check in svc_xprt_enqueue

If any xprt marked DEAD is also left BUSY for the rest of its life, then
the XPT_DEAD check here is superfluous--we'll get the same result from
the XPT_BUSY check just after.

Signed-off-by: J. Bruce Fields <[email protected]>
---
net/sunrpc/svc_xprt.c | 6 ------
1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 409881a..3a771f0 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -330,12 +330,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
"svc_xprt_enqueue: "
"threads and transports both waiting??\n");

- if (test_bit(XPT_DEAD, &xprt->xpt_flags)) {
- /* Don't enqueue dead transports */
- dprintk("svc: transport %p is dead, not enqueued\n", xprt);
- goto out_unlock;
- }
-
pool->sp_stats.packets++;

/* Mark transport as busy. It will remain in this state until
--
1.7.1


2010-10-22 21:20:09

by J. Bruce Fields

[permalink] [raw]
Subject: Re: "xprt" reference count drops to 0

On Fri, Oct 22, 2010 at 05:00:50PM +0200, Menyhart Zoltan wrote:
> J. Bruce Fields wrote:
> >On Tue, Oct 05, 2010 at 10:22:30AM +0200, Menyhart Zoltan wrote:
> >>Due to some race conditions, the reference count can become 0
> >>while "xprt" is still on a "pool":
> >
> >Apologies, your email got buried in my inbox....
> >
> >>
> >>WARNING: at lib/kref.c:43 kref_get+0x23/0x2d()
> >> [] kref_get+0x23/0x2d
> >> [] svc_xprt_get+0x12/0x14 [sunrpc]
> >> [] svc_recv+0x2db/0x78a [sunrpc]
> >
> >Which kernel exactly did you see this on? Is it reproduceable?
>
> I saw it on a 2.6.32.
> It has not been corrected for the 2.6.36-rc3 yet.
> The patch is for the 2.6.36-rc3.
>
> It is a narrow window, you need a high work load and a bit of luck to
> delay the current CPU just after"svc_xprt_enqueue()" returns.
>
> >>I think we should increase the reference counter before adding "xprt"
> >>onto any list.
> >
> >I don't see the xprt added to any list after the svc_xprt_get() you've
> >added below.
>
> "svc_xprt_enqueue()" has got two ways to pass an "xprt":
> - via "rqstp->rq_xprt" if a worker is available,
> - on the "pool->sp_sockets" list otherwise
>
> if (!list_empty(&pool->sp_threads)) {
> rqstp = list_entry(pool->sp_threads.next, struct svc_rqst, rq_list);
> svc_thread_dequeue(pool, rqstp);
> rqstp->rq_xprt = xprt;
> svc_xprt_get(xprt);
> rqstp->rq_reserved = serv->sv_max_mesg;
> atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> pool->sp_stats.threads_woken++;
> wake_up(&rqstp->rq_wait);
> } else {
> list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
> pool->sp_stats.sockets_queued++;
> }
>
> In the 1st case, there is a "svc_xprt_get(xprt)", in the 2nd one, there is not.
> Once "svc_xprt_enqueue()" returns, at some places, "svc_xprt_put(xprt)" is
> invoked. If we has passed the "else" branch, the "kref" can drop down to 0.

Maybe your fix is right, but I'm not sure: It looks to me like if
svc_xprt_enqueue() gets to "process:" in a situation where the caller
holds the only reference, then that's already a bug. Do you know who
the caller of svc_xprt_enqueue() was when this happened?

--b.

>
> "svc_recv()" includes:
>
> xprt = svc_xprt_dequeue(pool);
> if (xprt) {
> rqstp->rq_xprt = xprt;
> svc_xprt_get(xprt);
> rqstp->rq_reserved = serv->sv_max_mesg;
> atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> } else {
>
> i.e. it calls "svc_xprt_get(xprt)" if it has picked up "xprt" from the "pool" list.
> Yet it is too late, if the caller of "svc_xprt_enqueue()" has already had the
> time to call "svc_xprt_put(xprt)".
>
> I think "svc_xprt_get(xprt)" should be called in "svc_xprt_enqueue()" before
> "list_add_tail(&xprt->xpt_ready, &pool->sp_sockets)" to keep the reference
> counter valid.
> It is much simpler to move the call "svc_xprt_get(xprt)" just before
>
> if (!list_empty(&pool->sp_threads))
>
> Thanks,
>
> Zoltan
>
>
>

2010-10-25 01:21:44

by J. Bruce Fields

[permalink] [raw]
Subject: [PATCH 4/4] svcrpc: svc_tcp_sendto XTP_DEAD check is redundant

The only caller (svc_send) has already checked XPT_DEAD.

Signed-off-by: J. Bruce Fields <[email protected]>
---
net/sunrpc/svcsock.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 1454739..07919e1 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1135,9 +1135,6 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
reclen = htonl(0x80000000|((xbufp->len ) - 4));
memcpy(xbufp->head[0].iov_base, &reclen, 4);

- if (test_bit(XPT_DEAD, &rqstp->rq_xprt->xpt_flags))
- return -ENOTCONN;
-
sent = svc_sendto(rqstp, &rqstp->rq_res);
if (sent != xbufp->len) {
printk(KERN_NOTICE
--
1.7.1


2010-10-23 03:32:52

by J. Bruce Fields

[permalink] [raw]
Subject: Re: "xprt" reference count drops to 0

On Fri, Oct 22, 2010 at 07:21:33PM -0400, J. Bruce Fields wrote:
> On Fri, Oct 22, 2010 at 07:01:12PM -0400, J. Bruce Fields wrote:
> > On Fri, Oct 22, 2010 at 05:20:07PM -0400, J. Bruce Fields wrote:
> > > On Fri, Oct 22, 2010 at 05:00:50PM +0200, Menyhart Zoltan wrote:
> > > > J. Bruce Fields wrote:
> > > > >On Tue, Oct 05, 2010 at 10:22:30AM +0200, Menyhart Zoltan wrote:
> > > > >>Due to some race conditions, the reference count can become 0
> > > > >>while "xprt" is still on a "pool":
> > > > >
> > > > >Apologies, your email got buried in my inbox....
> > > > >
> > > > >>
> > > > >>WARNING: at lib/kref.c:43 kref_get+0x23/0x2d()
> > > > >> [] kref_get+0x23/0x2d
> > > > >> [] svc_xprt_get+0x12/0x14 [sunrpc]
> > > > >> [] svc_recv+0x2db/0x78a [sunrpc]
> > > > >
> > > > >Which kernel exactly did you see this on? Is it reproduceable?
> > > >
> > > > I saw it on a 2.6.32.
> > > > It has not been corrected for the 2.6.36-rc3 yet.

Doh. Wait, when you say "has not been corrected on 2.6.36-rc3", do you
mean you've actually *seen* the problem occur on 2.6.36-rc3?

If not, it's more likely you're seeing the problem that was fixed by
1b644b6e6f6160ae35ce4b52c2ca89ed3e356e18, in 2.6.34-rc1.

--b.

2010-10-25 22:58:45

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/4] svcrpc: never clear XPT_BUSY on dead xprt

On Mon, 25 Oct 2010 16:21:56 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Oct 25, 2010 at 12:43:57PM +1100, Neil Brown wrote:
> > On Sun, 24 Oct 2010 21:21:30 -0400
> > "J. Bruce Fields" <[email protected]> wrote:
> >
> > > Once an xprt has been deleted, there's no reason to allow it to be
> > > enqueued--at worst, that might cause the xprt to be re-added to some
> > > global list, resulting in later corruption.
> > >
> > > Signed-off-by: J. Bruce Fields <[email protected]>
> >
> > Yep, this makes svc_close_xprt() behave the same way as svc_recv() which
> > calls svc_delete_xprt but does not clear XPT_BUSY. The other branches in
> > svc_recv call svc_xprt_received, but the XPT_CLOSE branch doesn't
> >
> > Reviewed-by: NeilBrown <[email protected]>
>
> Also, of course:
>
> > > svc_xprt_get(xprt);
> > > svc_delete_xprt(xprt);
> > > - clear_bit(XPT_BUSY, &xprt->xpt_flags);
> > > svc_xprt_put(xprt);
>
> The get/put is pointless: the only reason I can see for doing that of
> course was to be able to safely clear the bit afterwards.
>

Agreed.

I like patches that get rid of code!!

NeilBrown

2010-10-26 01:25:23

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 4/4] svcrpc: svc_tcp_sendto XTP_DEAD check is redundant

On Tue, Oct 26, 2010 at 10:23:38AM +1100, Neil Brown wrote:
> On Mon, 25 Oct 2010 11:03:08 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Mon, Oct 25, 2010 at 01:10:24PM +1100, Neil Brown wrote:
> > > On Sun, 24 Oct 2010 21:21:33 -0400
> > > "J. Bruce Fields" <[email protected]> wrote:
> > >
> > > > The only caller (svc_send) has already checked XPT_DEAD.
> > > >
> > > > Signed-off-by: J. Bruce Fields <[email protected]>
> > > > ---
> > > > net/sunrpc/svcsock.c | 3 ---
> > > > 1 files changed, 0 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > > > index 1454739..07919e1 100644
> > > > --- a/net/sunrpc/svcsock.c
> > > > +++ b/net/sunrpc/svcsock.c
> > > > @@ -1135,9 +1135,6 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
> > > > reclen = htonl(0x80000000|((xbufp->len ) - 4));
> > > > memcpy(xbufp->head[0].iov_base, &reclen, 4);
> > > >
> > > > - if (test_bit(XPT_DEAD, &rqstp->rq_xprt->xpt_flags))
> > > > - return -ENOTCONN;
> > > > -
> > > > sent = svc_sendto(rqstp, &rqstp->rq_res);
> > > > if (sent != xbufp->len) {
> > > > printk(KERN_NOTICE
> > >
> > >
> > > So after removing all these references to XPT_DEAD, do we need XPT_DEAD at
> > > all???
> > >
> > > I think it is only used in two other places.
> > >
> > > 1/ In svc_revisit we don't queue the deferred request to an XPT_DEAD
> > > transport.
> > > We could avoid that but changing the 'owner' of a deferred request from the
> > > service to the xprt, and call cache_clean_deferred in svc_delete_xprt
> >
> > That use does seem a bit of a hack to me, so I'd be happy to get rid of
> > it.
> >
> > > 2/ in svc_send(). I wonder if we need this at all. There doesn't seem to be
> > > any locking to ensure that XPT_DEAD doesn't get set immediately after the
> > > test, and the underlying sendto (whether tcp or udp or whatever) should fail
> > > if the socket is closed, and if it doesn't it shouldn't really matter??
> >
> > Does it make a difference in the case of a half-close? If the client
> > follows a request immediately by a FIN, and if that results in our
> > setting DEAD (I think it does, assuming svc_tcp_state_change() is called
> > in that case), then the current code may have the effect of preventing
> > us from sending the reply.
> >
> > I don't know if that's good or bad.
> >
> > > So can we get rid of XPT_DEAL altogether?
> >
> > OK, I also had another use in mind: for the purposes of 4.1 (which needs
> > to know when a connection goes down, e.g. to know that it's no longer
> > available for callbacks), I added a list of callbacks to the xprt,
> > called on svc_delete_xprt().
> >
> > I just noticed that I think my current code allows the 4.1 code to
> > register an xprt after svc_delete_xprt() is called. I could fix that
> > race by checking for DEAD after trying to register.
> >
> > (That callback code already seems messier than it should be, so maybe
> > someone else could suggest a better scheme. I'm stuck.
> >
> > In any case, it wouldn't be so bad if that were the one remaining use of
> > DEAD.)
>
> Could you use XPT_CLOSE for that??

Good idea, yes, I think so.

--b.

2010-11-12 19:00:10

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/4] svcrpc: never clear XPT_BUSY on dead xprt

On Tue, Oct 26, 2010 at 12:05:17PM -0400, J. Bruce Fields wrote:
> On Tue, Oct 26, 2010 at 08:59:47AM -0400, J. Bruce Fields wrote:
> > On Tue, Oct 26, 2010 at 12:28:20PM +1100, Neil Brown wrote:
> > > On Mon, 25 Oct 2010 20:30:08 -0400
> > > "J. Bruce Fields" <[email protected]> wrote:
> > > >
> > > > svcrpc: simplify svc_close_xprt, fix minor race
> > > >
> > > > We're assuming that in the XPT_BUSY case, somebody else will take care
> > > > of the close. But that's not necessarily the case (e.g. if
> > > > svc_xprt_enqueue() exits due to a lack of write space). So the close
> > > > could in theory be delayed indefinitely. We should be calling
> > > > svc_xprt_enqueue() after every set of XPT_CLOSE.
> > > >
> > > > Also with svc_close_all() no longer relying on svc_close, it no longer
> > > > needs to be able to immediately do the svc_delete_xprt() itself.
> > > >
> > > > Signed-off-by: J. Bruce Fields <[email protected]>
> > > >
> > > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > > index 8c018df..dfee123 100644
> > > > --- a/net/sunrpc/svc_xprt.c
> > > > +++ b/net/sunrpc/svc_xprt.c
> > > > @@ -928,11 +928,7 @@ void svc_delete_xprt(struct svc_xprt *xprt)
> > > > void svc_close_xprt(struct svc_xprt *xprt)
> > > > {
> > > > set_bit(XPT_CLOSE, &xprt->xpt_flags);
> > > > - if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags))
> > > > - /* someone else will have to effect the close */
> > > > - return;
> > > > -
> > > > - svc_delete_xprt(xprt);
> > > > + svc_xprt_enqueue(xprt);
> > > > }
> > > > EXPORT_SYMBOL_GPL(svc_close_xprt);
> > > >
> > >
> > > I'm not quite so sure about this one.
> > >
> > > svc_close_xprt gets called when user-space writes some magic string to some
> > > magic file. This could be done when there are no active nfsd threads.
> > > It is a bit far fetched, but you could tell nfsd to open a socket, then tell
> > > it to close it, before telling it to start any threads.
> > >
> > > In that case the socket would stay open until the first thread was started.
> >
> > Thanks, that's an odd case, but I think you're right. OK:
>
> Actually.... I think the real race here is the fault of the write-space
> check: once we've taken XPT_BUSY, we really do need to guarantee that
> we'll arrange to have svc_xprt_enqueue() called again after clearing
> XPT_BUSY, to avoid missing any events that arrived while we were
> processing ours. So putting the wspace check after taking XPT_BUSY in
> svc_xprt_enqueue looks to me like a bug that could cause processing to
> stall unnecessarily in some unlikely cases.

I'm going to apply this for 2.6.38 if you can't think of a reason not
to.

(My one last-minute worry was whether there was some reason xpo_wspace()
needed to be called with XPT_BUSY held; but I can't see any.)

--b.

>
> --b.
>
> commit cf501cb479873d6e77abba131ea40c11aa924d72
> Author: J. Bruce Fields <[email protected]>
> Date: Tue Oct 26 11:32:03 2010 -0400
>
> svcrpc: fix wspace-checking race
>
> We call svc_xprt_enqueue() after something happens which we think may
> require handling from a server thread. To avoid such events being lost,
> svc_xprt_enqueue() must guarantee that there will be a svc_serv() call
> from a server thread following any such event. It does that by either
> waking up a server thread itself, or checking that XPT_BUSY is set (in
> which case somebody else is doing it).
>
> But the check of XPT_BUSY could occur just as someone finishes
> processing some other event, and just before they clear XPT_BUSY.
>
> Therefore it's important not to clear XPT_BUSY without subsequently
> doing another svc_export_enqueue() to check whether the xprt should be
> requeued.
>
> The xpo_wspace() check in svc_xprt_enqueue() breaks this rule, allowing
> an event to be missed in situations like:
>
> data arrives
> call svc_tcp_data_ready():
> call svc_xprt_enqueue():
> set BUSY
> find no write space
> svc_reserve():
> free up write space
> call svc_enqueue():
> test BUSY
> clear BUSY
>
> So, instead, check wspace in the same places that the state flags are
> checked: before taking BUSY, and in svc_receive().
>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
> index d2261fe..9c690fa 100644
> --- a/include/linux/sunrpc/svc_xprt.h
> +++ b/include/linux/sunrpc/svc_xprt.h
> @@ -92,12 +92,17 @@ static inline void unregister_xpt_user(struct svc_xprt *xpt, struct svc_xpt_user
> static inline int register_xpt_user(struct svc_xprt *xpt, struct svc_xpt_user *u)
> {
> spin_lock(&xpt->xpt_lock);
> - list_add(&u->list, &xpt->xpt_users);
> - spin_unlock(&xpt->xpt_lock);
> if (test_bit(XPT_CLOSE, &xpt->xpt_flags)) {
> - unregister_xpt_user(xpt, u);
> + /*
> + * The connection is about to be deleted soon (or,
> + * worse, may already be deleted--in which case we've
> + * already notified the xpt_users).
> + */
> + spin_unlock(&xpt->xpt_lock);
> return -ENOTCONN;
> }
> + list_add(&u->list, &xpt->xpt_users);
> + spin_unlock(&xpt->xpt_lock);
> return 0;
> }
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 99996ad..04238a9 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -302,6 +302,15 @@ static void svc_thread_dequeue(struct svc_pool *pool, struct svc_rqst *rqstp)
> list_del(&rqstp->rq_list);
> }
>
> +static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> +{
> + if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> + return true;
> + if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
> + return xprt->xpt_ops->xpo_has_wspace(xprt);
> + return false;
> +}
> +
> /*
> * Queue up a transport with data pending. If there are idle nfsd
> * processes, wake 'em up.
> @@ -314,8 +323,7 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
> struct svc_rqst *rqstp;
> int cpu;
>
> - if (!(xprt->xpt_flags &
> - ((1<<XPT_CONN)|(1<<XPT_DATA)|(1<<XPT_CLOSE)|(1<<XPT_DEFERRED))))
> + if (!svc_xprt_has_something_to_do(xprt))
> return;
>
> cpu = get_cpu();
> @@ -345,25 +353,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
> BUG_ON(xprt->xpt_pool != NULL);
> xprt->xpt_pool = pool;
>
> - /* Handle pending connection */
> - if (test_bit(XPT_CONN, &xprt->xpt_flags))
> - goto process;
> -
> - /* Handle close in-progress */
> - if (test_bit(XPT_CLOSE, &xprt->xpt_flags))
> - goto process;
> -
> - /* Check if we have space to reply to a request */
> - if (!xprt->xpt_ops->xpo_has_wspace(xprt)) {
> - /* Don't enqueue while not enough space for reply */
> - dprintk("svc: no write space, transport %p not enqueued\n",
> - xprt);
> - xprt->xpt_pool = NULL;
> - clear_bit(XPT_BUSY, &xprt->xpt_flags);
> - goto out_unlock;
> - }
> -
> - process:
> if (!list_empty(&pool->sp_threads)) {
> rqstp = list_entry(pool->sp_threads.next,
> struct svc_rqst,
> @@ -744,7 +733,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
> spin_unlock_bh(&serv->sv_lock);
> svc_xprt_received(newxpt);
> }
> - } else {
> + } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
> dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
> rqstp, pool->sp_id, xprt,
> atomic_read(&xprt->xpt_ref.refcount));