The rpc_pipefs pattern used for the GETDEVICEINFO implementation in the
blocklayout driver is fundamentally not thread safe. Workloads like
dbench manage to trigger concurrent GETDEVICEINFO calls though, so we
need to add a critical section around it.
I also wonder if we should avoid even sending multiple GETDEVICEINFO
a a higher level in the NFS client, as Ń•ending multiple request just
to cancel out their action before adding them to the cache doesn't
seem very helpful.
The rpc_pipefs code isn't thread safe, leading to occasional use after
frees when running xfstests generic/241 (dbench).
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/nfs/blocklayout/rpc_pipefs.c | 14 +++++++++-----
fs/nfs/netns.h | 1 +
2 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/fs/nfs/blocklayout/rpc_pipefs.c b/fs/nfs/blocklayout/rpc_pipefs.c
index 8d04bda..da58ff7 100644
--- a/fs/nfs/blocklayout/rpc_pipefs.c
+++ b/fs/nfs/blocklayout/rpc_pipefs.c
@@ -65,17 +65,18 @@ bl_resolve_deviceid(struct nfs_server *server, struct pnfs_block_volume *b,
dprintk("%s CREATING PIPEFS MESSAGE\n", __func__);
+ mutex_lock(&nn->bl_mutex);
bl_pipe_msg.bl_wq = &nn->bl_wq;
b->simple.len += 4; /* single volume */
if (b->simple.len > PAGE_SIZE)
- return -EIO;
+ goto out_unlock;
memset(msg, 0, sizeof(*msg));
msg->len = sizeof(*bl_msg) + b->simple.len;
msg->data = kzalloc(msg->len, gfp_mask);
if (!msg->data)
- goto out;
+ goto out_free_data;
bl_msg = msg->data;
bl_msg->type = BL_DEVICE_MOUNT,
@@ -87,7 +88,7 @@ bl_resolve_deviceid(struct nfs_server *server, struct pnfs_block_volume *b,
rc = rpc_queue_upcall(nn->bl_device_pipe, msg);
if (rc < 0) {
remove_wait_queue(&nn->bl_wq, &wq);
- goto out;
+ goto out_free_data;
}
set_current_state(TASK_UNINTERRUPTIBLE);
@@ -98,12 +99,14 @@ bl_resolve_deviceid(struct nfs_server *server, struct pnfs_block_volume *b,
if (reply->status != BL_DEVICE_REQUEST_PROC) {
printk(KERN_WARNING "%s failed to decode device: %d\n",
__func__, reply->status);
- goto out;
+ goto out_free_data;
}
dev = MKDEV(reply->major, reply->minor);
-out:
+out_free_data:
kfree(msg->data);
+out_unlock:
+ mutex_unlock(&nn->bl_mutex);
return dev;
}
@@ -233,6 +236,7 @@ static int nfs4blocklayout_net_init(struct net *net)
struct nfs_net *nn = net_generic(net, nfs_net_id);
struct dentry *dentry;
+ mutex_init(&nn->bl_mutex);
init_waitqueue_head(&nn->bl_wq);
nn->bl_device_pipe = rpc_mkpipe_data(&bl_upcall_ops, 0);
if (IS_ERR(nn->bl_device_pipe))
diff --git a/fs/nfs/netns.h b/fs/nfs/netns.h
index ef221fb..f0e06e4 100644
--- a/fs/nfs/netns.h
+++ b/fs/nfs/netns.h
@@ -19,6 +19,7 @@ struct nfs_net {
struct rpc_pipe *bl_device_pipe;
struct bl_dev_msg bl_mount_reply;
wait_queue_head_t bl_wq;
+ struct mutex bl_mutex;
struct list_head nfs_client_list;
struct list_head nfs_volume_list;
#if IS_ENABLED(CONFIG_NFS_V4)
--
1.9.1
On Fri, Sep 26, 2014 at 12:21:06PM -0400, Trond Myklebust wrote:
> > Not without getting rid of the rpc_pipefs interface. That is on my
> > very long term TODO list, but it will require new userspace support.
>
> Why is that? rpc_pipefs was designed to be message based, so it should
> work quite well in a multi-threaded environment. We certainly don't
> use mutexes around the gssd up/downcall, and the only reason for the
> mutex in idmapd is to deal with the keyring upcall.
Ok, the GSS code dynamically allocates a new message for each upcall
and thus doesn't have the issue. I'll take a look if I can do this
in the blocklayout driver as well.
On Fri, Sep 26, 2014 at 11:48 AM, Christoph Hellwig <[email protected]> wrote:
> On Fri, Sep 26, 2014 at 10:29:34AM -0400, Trond Myklebust wrote:
>> It worries me that we're putting a mutex directly in the writeback
>> path. For small arrays, it might be acceptable, but what if you have a
>> block device with 1000s of disks on the back end?
>>
>> Is there no better way to fix this issue?
>
> Not without getting rid of the rpc_pipefs interface. That is on my
> very long term TODO list, but it will require new userspace support.
Why is that? rpc_pipefs was designed to be message based, so it should
work quite well in a multi-threaded environment. We certainly don't
use mutexes around the gssd up/downcall, and the only reason for the
mutex in idmapd is to deal with the keyring upcall.
> Note that I'm actually worried about GETDEVICEINFO from the writeback
> path in general. There is a lot that happens when we don't have
> a device in cache, including the need to open a block device for
> the block layout driver, which is a complex operation full of
> GFP_KERNEL allocation, or even a more complex scsi device scan
> for the object layout. It's been on my more near term todo list
> to look into reproducers for deadlocks in this area which seem
> very possible, and then look into a fix for it; I can't really
> think of anything less drastic than refusing block or object layout
> I/O from memory reclaim if we don't have the device cached yet.
> The situation for file layouts seems less severe, so I'll need
> help from people more familar with to think about the situation there.
Agreed,
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On Fri, Sep 26, 2014 at 10:02 AM, Christoph Hellwig <[email protected]> wrote:
> The rpc_pipefs code isn't thread safe, leading to occasional use after
> frees when running xfstests generic/241 (dbench).
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> fs/nfs/blocklayout/rpc_pipefs.c | 14 +++++++++-----
> fs/nfs/netns.h | 1 +
> 2 files changed, 10 insertions(+), 5 deletions(-)
>
>
It worries me that we're putting a mutex directly in the writeback
path. For small arrays, it might be acceptable, but what if you have a
block device with 1000s of disks on the back end?
Is there no better way to fix this issue?
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On Fri, Sep 26, 2014 at 10:29:34AM -0400, Trond Myklebust wrote:
> It worries me that we're putting a mutex directly in the writeback
> path. For small arrays, it might be acceptable, but what if you have a
> block device with 1000s of disks on the back end?
>
> Is there no better way to fix this issue?
Not without getting rid of the rpc_pipefs interface. That is on my
very long term TODO list, but it will require new userspace support.
Note that I'm actually worried about GETDEVICEINFO from the writeback
path in general. There is a lot that happens when we don't have
a device in cache, including the need to open a block device for
the block layout driver, which is a complex operation full of
GFP_KERNEL allocation, or even a more complex scsi device scan
for the object layout. It's been on my more near term todo list
to look into reproducers for deadlocks in this area which seem
very possible, and then look into a fix for it; I can't really
think of anything less drastic than refusing block or object layout
I/O from memory reclaim if we don't have the device cached yet.
The situation for file layouts seems less severe, so I'll need
help from people more familar with to think about the situation there.
On Fri, Sep 26, 2014 at 12:21:06PM -0400, Trond Myklebust wrote:
> > On Fri, Sep 26, 2014 at 10:29:34AM -0400, Trond Myklebust wrote:
> >> It worries me that we're putting a mutex directly in the writeback
> >> path. For small arrays, it might be acceptable, but what if you have a
> >> block device with 1000s of disks on the back end?
> >>
> >> Is there no better way to fix this issue?
> >
> > Not without getting rid of the rpc_pipefs interface. That is on my
> > very long term TODO list, but it will require new userspace support.
>
> Why is that? rpc_pipefs was designed to be message based, so it should
> work quite well in a multi-threaded environment. We certainly don't
> use mutexes around the gssd up/downcall, and the only reason for the
> mutex in idmapd is to deal with the keyring upcall.
Turns out we can't do anything like that with the existing blkmapd
ABI. The other callers that support concurrency have some sort
of uniqueue identifier embedded in their messages both on the upcall
and the downcall, so that the kernel can find the right structure
on the downcall side.
Because of that I'd like to request inclusion of this patch with a Cc
to stable for 3.17.
On Fri, Oct 24, 2014 at 04:29:39PM +0200, Christoph Hellwig wrote:
> Turns out we can't do anything like that with the existing blkmapd
> ABI. The other callers that support concurrency have some sort
> of uniqueue identifier embedded in their messages both on the upcall
> and the downcall, so that the kernel can find the right structure
> on the downcall side.
>
> Because of that I'd like to request inclusion of this patch with a Cc
> to stable for 3.17.
ping? This is an easily expoitable oops, so I'd really like to see the/a
fix in 3.18-rc and 3.17-stable.
Thanks,
Christoph