Currently when falling back to doing I/O through the MDS (via
pnfs_{read|write}_through_mds), the client frees the nfs_pgio_header
without releasing the reference taken on the dreq
via pnfs_generic_pg_{read|write}pages -> nfs_pgheader_init ->
nfs_direct_pgio_init. It then takes another reference on the dreq via
nfs_generic_pg_pgios -> nfs_pgheader_init -> nfs_direct_pgio_init and
as a result the requester will become stuck in inode_dio_wait. Once
that happens, other processes accessing the inode will become stuck as
well.
Moving the init_hdr call down to nfs_initiate_pgio ensures we take the
reference on the dreq only once.
This can be reproduced (sometimes) by performing "storage failover
takeover" commands on NetApp filer while doing direct I/O from a client.
This can also be reproduced using SystemTap to simulate a failure while
doing direct I/O from a client (from Dave Wysochanski
<[email protected]>):
stap -v -g -e 'probe module("nfs_layout_nfsv41_files").function("nfs4_fl_prepare_ds").return { $return=NULL; exit(); }'
Suggested-by: Benjamin Coddington <[email protected]>
Signed-off-by: Scott Mayhew <[email protected]>
---
fs/nfs/pagelist.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index d0543e1..d478c19 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -54,9 +54,6 @@ void nfs_pgheader_init(struct nfs_pageio_descriptor *desc,
hdr->dreq = desc->pg_dreq;
hdr->release = release;
hdr->completion_ops = desc->pg_completion_ops;
- if (hdr->completion_ops->init_hdr)
- hdr->completion_ops->init_hdr(hdr);
-
hdr->pgio_mirror_idx = desc->pg_mirror_idx;
}
EXPORT_SYMBOL_GPL(nfs_pgheader_init);
@@ -607,6 +604,9 @@ int nfs_initiate_pgio(struct rpc_clnt *clnt, struct nfs_pgio_header *hdr,
};
int ret = 0;
+ if (hdr->completion_ops->init_hdr)
+ hdr->completion_ops->init_hdr(hdr);
+
hdr->rw_ops->rw_initiate(hdr, &msg, rpc_ops, &task_setup_data, how);
dprintk("NFS: initiated pgio call "
--
2.9.5
T24gV2VkLCAyMDE3LTEyLTEzIGF0IDExOjE1IC0wNTAwLCBTY290dCBNYXloZXcgd3JvdGU6DQo+
IEN1cnJlbnRseSB3aGVuIGZhbGxpbmcgYmFjayB0byBkb2luZyBJL08gdGhyb3VnaCB0aGUgTURT
ICh2aWENCj4gcG5mc197cmVhZHx3cml0ZX1fdGhyb3VnaF9tZHMpLCB0aGUgY2xpZW50IGZyZWVz
IHRoZSBuZnNfcGdpb19oZWFkZXINCj4gd2l0aG91dCByZWxlYXNpbmcgdGhlIHJlZmVyZW5jZSB0
YWtlbiBvbiB0aGUgZHJlcQ0KPiB2aWEgcG5mc19nZW5lcmljX3BnX3tyZWFkfHdyaXRlfXBhZ2Vz
IC0+IG5mc19wZ2hlYWRlcl9pbml0IC0+DQo+IG5mc19kaXJlY3RfcGdpb19pbml0LiAgSXQgdGhl
biB0YWtlcyBhbm90aGVyIHJlZmVyZW5jZSBvbiB0aGUgZHJlcQ0KPiB2aWENCj4gbmZzX2dlbmVy
aWNfcGdfcGdpb3MgLT4gbmZzX3BnaGVhZGVyX2luaXQgLT4gbmZzX2RpcmVjdF9wZ2lvX2luaXQg
YW5kDQo+IGFzIGEgcmVzdWx0IHRoZSByZXF1ZXN0ZXIgd2lsbCBiZWNvbWUgc3R1Y2sgaW4gaW5v
ZGVfZGlvX3dhaXQuICBPbmNlDQo+IHRoYXQgaGFwcGVucywgb3RoZXIgcHJvY2Vzc2VzIGFjY2Vz
c2luZyB0aGUgaW5vZGUgd2lsbCBiZWNvbWUgc3R1Y2sNCj4gYXMNCj4gd2VsbC4NCj4gDQo+IE1v
dmluZyB0aGUgaW5pdF9oZHIgY2FsbCBkb3duIHRvIG5mc19pbml0aWF0ZV9wZ2lvIGVuc3VyZXMg
d2UgdGFrZQ0KPiB0aGUNCj4gcmVmZXJlbmNlIG9uIHRoZSBkcmVxIG9ubHkgb25jZS4NCj4gDQo+
IFRoaXMgY2FuIGJlIHJlcHJvZHVjZWQgKHNvbWV0aW1lcykgYnkgcGVyZm9ybWluZyAic3RvcmFn
ZSBmYWlsb3Zlcg0KPiB0YWtlb3ZlciIgY29tbWFuZHMgb24gTmV0QXBwIGZpbGVyIHdoaWxlIGRv
aW5nIGRpcmVjdCBJL08gZnJvbSBhDQo+IGNsaWVudC4NCj4gDQo+IFRoaXMgY2FuIGFsc28gYmUg
cmVwcm9kdWNlZCB1c2luZyBTeXN0ZW1UYXAgdG8gc2ltdWxhdGUgYSBmYWlsdXJlDQo+IHdoaWxl
DQo+IGRvaW5nIGRpcmVjdCBJL08gZnJvbSBhIGNsaWVudCAoZnJvbSBEYXZlIFd5c29jaGFuc2tp
DQo+IDxkd3lzb2NoYUByZWRoYXQuY29tPik6DQo+IA0KPiBzdGFwIC12IC1nIC1lICdwcm9iZQ0K
PiBtb2R1bGUoIm5mc19sYXlvdXRfbmZzdjQxX2ZpbGVzIikuZnVuY3Rpb24oIm5mczRfZmxfcHJl
cGFyZV9kcyIpLnJldHUNCj4gcm4geyAkcmV0dXJuPU5VTEw7IGV4aXQoKTsgfScNCj4gDQo+IFN1
Z2dlc3RlZC1ieTogQmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRpbmdAcmVkaGF0LmNvbT4NCj4g
U2lnbmVkLW9mZi1ieTogU2NvdHQgTWF5aGV3IDxzbWF5aGV3QHJlZGhhdC5jb20+DQo+IC0tLQ0K
PiAgZnMvbmZzL3BhZ2VsaXN0LmMgfCA2ICsrKy0tLQ0KPiAgMSBmaWxlIGNoYW5nZWQsIDMgaW5z
ZXJ0aW9ucygrKSwgMyBkZWxldGlvbnMoLSkNCj4gDQo+IGRpZmYgLS1naXQgYS9mcy9uZnMvcGFn
ZWxpc3QuYyBiL2ZzL25mcy9wYWdlbGlzdC5jDQo+IGluZGV4IGQwNTQzZTEuLmQ0NzhjMTkgMTAw
NjQ0DQo+IC0tLSBhL2ZzL25mcy9wYWdlbGlzdC5jDQo+ICsrKyBiL2ZzL25mcy9wYWdlbGlzdC5j
DQo+IEBAIC01NCw5ICs1NCw2IEBAIHZvaWQgbmZzX3BnaGVhZGVyX2luaXQoc3RydWN0IG5mc19w
YWdlaW9fZGVzY3JpcHRvcg0KPiAqZGVzYywNCj4gIAloZHItPmRyZXEgPSBkZXNjLT5wZ19kcmVx
Ow0KPiAgCWhkci0+cmVsZWFzZSA9IHJlbGVhc2U7DQo+ICAJaGRyLT5jb21wbGV0aW9uX29wcyA9
IGRlc2MtPnBnX2NvbXBsZXRpb25fb3BzOw0KPiAtCWlmIChoZHItPmNvbXBsZXRpb25fb3BzLT5p
bml0X2hkcikNCj4gLQkJaGRyLT5jb21wbGV0aW9uX29wcy0+aW5pdF9oZHIoaGRyKTsNCj4gLQ0K
PiAgCWhkci0+cGdpb19taXJyb3JfaWR4ID0gZGVzYy0+cGdfbWlycm9yX2lkeDsNCj4gIH0NCj4g
IEVYUE9SVF9TWU1CT0xfR1BMKG5mc19wZ2hlYWRlcl9pbml0KTsNCj4gQEAgLTYwNyw2ICs2MDQs
OSBAQCBpbnQgbmZzX2luaXRpYXRlX3BnaW8oc3RydWN0IHJwY19jbG50ICpjbG50LA0KPiBzdHJ1
Y3QgbmZzX3BnaW9faGVhZGVyICpoZHIsDQo+ICAJfTsNCj4gIAlpbnQgcmV0ID0gMDsNCj4gIA0K
PiArCWlmIChoZHItPmNvbXBsZXRpb25fb3BzLT5pbml0X2hkcikNCj4gKwkJaGRyLT5jb21wbGV0
aW9uX29wcy0+aW5pdF9oZHIoaGRyKTsNCj4gKw0KPiAgCWhkci0+cndfb3BzLT5yd19pbml0aWF0
ZShoZHIsICZtc2csIHJwY19vcHMsDQo+ICZ0YXNrX3NldHVwX2RhdGEsIGhvdyk7DQo+ICANCj4g
IAlkcHJpbnRrKCJORlM6IGluaXRpYXRlZCBwZ2lvIGNhbGwgIg0KDQpXb24ndCB0aGlzIGJyZWFr
IGFzIHNvb24gYXMgd2UgaGl0IGEgY2FsbCB0byBuZnNfcGdpb19lcnJvcigpPyBBbHNvLA0Kd2hh
dCBhYm91dCBibG9jayBsYXlvdXQsIHdoaWNoIG5ldmVyIGNhbGxzIG5mc19pbml0aWF0ZV9wZ2lv
KCk/DQoNCkknZCBwcmVmZXIgdGhhdCB3ZSBmaXggdGhpcyBpbiBwbmZzX3JlYWRfdGhyb3VnaF9t
ZHMoKSBhbmQNCnBuZnNfd3JpdGVfdGhyb3VnaF9tZHMoKSBieSBlbnN1cmluZyB0aGF0IHRob3Nl
IGZ1bmN0aW9ucyBjbGVhbiB1cA0KY29ycmVjdGx5LiBTaG91bGQgdGhleSBiZSBjYWxsaW5nIGhk
ci0+Y29tcGxldGlvbl9vcHMtPmNvbXBsZXRpb24oKQ0KaW5zdGVhZCBvZiBjYWxsaW5nIGhkci0+
cmVsZWFzZSgpIGRpcmVjdGx5Pw0KDQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNs
aWVudCBtYWludGFpbmVyLCBQcmltYXJ5RGF0YQ0KdHJvbmQubXlrbGVidXN0QHByaW1hcnlkYXRh
LmNvbQ0K
On Wed, 13 Dec 2017, Trond Myklebust wrote:
> On Wed, 2017-12-13 at 11:15 -0500, Scott Mayhew wrote:
> > Currently when falling back to doing I/O through the MDS (via
> > pnfs_{read|write}_through_mds), the client frees the nfs_pgio_header
> > without releasing the reference taken on the dreq
> > via pnfs_generic_pg_{read|write}pages -> nfs_pgheader_init ->
> > nfs_direct_pgio_init. It then takes another reference on the dreq
> > via
> > nfs_generic_pg_pgios -> nfs_pgheader_init -> nfs_direct_pgio_init and
> > as a result the requester will become stuck in inode_dio_wait. Once
> > that happens, other processes accessing the inode will become stuck
> > as
> > well.
> >
> > Moving the init_hdr call down to nfs_initiate_pgio ensures we take
> > the
> > reference on the dreq only once.
> >
> > This can be reproduced (sometimes) by performing "storage failover
> > takeover" commands on NetApp filer while doing direct I/O from a
> > client.
> >
> > This can also be reproduced using SystemTap to simulate a failure
> > while
> > doing direct I/O from a client (from Dave Wysochanski
> > <[email protected]>):
> >
> > stap -v -g -e 'probe
> > module("nfs_layout_nfsv41_files").function("nfs4_fl_prepare_ds").retu
> > rn { $return=NULL; exit(); }'
> >
> > Suggested-by: Benjamin Coddington <[email protected]>
> > Signed-off-by: Scott Mayhew <[email protected]>
> > ---
> > fs/nfs/pagelist.c | 6 +++---
> > 1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
> > index d0543e1..d478c19 100644
> > --- a/fs/nfs/pagelist.c
> > +++ b/fs/nfs/pagelist.c
> > @@ -54,9 +54,6 @@ void nfs_pgheader_init(struct nfs_pageio_descriptor
> > *desc,
> > hdr->dreq = desc->pg_dreq;
> > hdr->release = release;
> > hdr->completion_ops = desc->pg_completion_ops;
> > - if (hdr->completion_ops->init_hdr)
> > - hdr->completion_ops->init_hdr(hdr);
> > -
> > hdr->pgio_mirror_idx = desc->pg_mirror_idx;
> > }
> > EXPORT_SYMBOL_GPL(nfs_pgheader_init);
> > @@ -607,6 +604,9 @@ int nfs_initiate_pgio(struct rpc_clnt *clnt,
> > struct nfs_pgio_header *hdr,
> > };
> > int ret = 0;
> >
> > + if (hdr->completion_ops->init_hdr)
> > + hdr->completion_ops->init_hdr(hdr);
> > +
> > hdr->rw_ops->rw_initiate(hdr, &msg, rpc_ops,
> > &task_setup_data, how);
> >
> > dprintk("NFS: initiated pgio call "
>
> Won't this break as soon as we hit a call to nfs_pgio_error()? Also,
> what about block layout, which never calls nfs_initiate_pgio()?
Yes, it appears it would break both of those. Sorry I missed that.
>
> I'd prefer that we fix this in pnfs_read_through_mds() and
> pnfs_write_through_mds() by ensuring that those functions clean up
> correctly. Should they be calling hdr->completion_ops->completion()
> instead of calling hdr->release() directly?
We were wondering if that might be the case too. I'll get to testing on
that right away.
Thanks,
Scott
>
> --
> Trond Myklebust
> Linux NFS client maintainer, PrimaryData
> [email protected]
Currently when falling back to doing I/O through the MDS (via
pnfs_{read|write}_through_mds), the client frees the nfs_pgio_header
without releasing the reference taken on the dreq
via pnfs_generic_pg_{read|write}pages -> nfs_pgheader_init ->
nfs_direct_pgio_init. It then takes another reference on the dreq via
nfs_generic_pg_pgios -> nfs_pgheader_init -> nfs_direct_pgio_init and
as a result the requester will become stuck in inode_dio_wait. Once
that happens, other processes accessing the inode will become stuck as
well.
Ensure that pnfs_read_through_mds() and pnfs_write_through_mds() clean
up correctly by calling hdr->completion_ops->completion() instead of
calling hdr->release() directly.
This can be reproduced (sometimes) by performing "storage failover
takeover" commands on NetApp filer while doing direct I/O from a client.
This can also be reproduced using SystemTap to simulate a failure while
doing direct I/O from a client (from Dave Wysochanski
<[email protected]>):
stap -v -g -e 'probe module("nfs_layout_nfsv41_files").function("nfs4_fl_prepare_ds").return { $return=NULL; exit(); }'
Suggested-by: Trond Myklebust <[email protected]>
Signed-off-by: Scott Mayhew <[email protected]>
---
fs/nfs/pnfs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index d602fe9..eb098cc 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -2255,7 +2255,7 @@ pnfs_write_through_mds(struct nfs_pageio_descriptor *desc,
nfs_pageio_reset_write_mds(desc);
mirror->pg_recoalesce = 1;
}
- hdr->release(hdr);
+ hdr->completion_ops->completion(hdr);
}
static enum pnfs_try_status
@@ -2378,7 +2378,7 @@ pnfs_read_through_mds(struct nfs_pageio_descriptor *desc,
nfs_pageio_reset_read_mds(desc);
mirror->pg_recoalesce = 1;
}
- hdr->release(hdr);
+ hdr->completion_ops->completion(hdr);
}
/*
--
2.9.5
Trond, Anna,
Any concerns with this patch?
We left this run on multiple machines through the holidays... all doing
direct I/O to a Netapp filer while doing storage failovers on an hourly
basis on the filer. No more issues with processes getting stuck in
inode_dio_wait.
-Scott
On Fri, 15 Dec 2017, Scott Mayhew wrote:
> Currently when falling back to doing I/O through the MDS (via
> pnfs_{read|write}_through_mds), the client frees the nfs_pgio_header
> without releasing the reference taken on the dreq
> via pnfs_generic_pg_{read|write}pages -> nfs_pgheader_init ->
> nfs_direct_pgio_init. It then takes another reference on the dreq via
> nfs_generic_pg_pgios -> nfs_pgheader_init -> nfs_direct_pgio_init and
> as a result the requester will become stuck in inode_dio_wait. Once
> that happens, other processes accessing the inode will become stuck as
> well.
>
> Ensure that pnfs_read_through_mds() and pnfs_write_through_mds() clean
> up correctly by calling hdr->completion_ops->completion() instead of
> calling hdr->release() directly.
>
> This can be reproduced (sometimes) by performing "storage failover
> takeover" commands on NetApp filer while doing direct I/O from a client.
>
> This can also be reproduced using SystemTap to simulate a failure while
> doing direct I/O from a client (from Dave Wysochanski
> <[email protected]>):
>
> stap -v -g -e 'probe module("nfs_layout_nfsv41_files").function("nfs4_fl_prepare_ds").return { $return=NULL; exit(); }'
>
> Suggested-by: Trond Myklebust <[email protected]>
> Signed-off-by: Scott Mayhew <[email protected]>
> ---
> fs/nfs/pnfs.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index d602fe9..eb098cc 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -2255,7 +2255,7 @@ pnfs_write_through_mds(struct nfs_pageio_descriptor *desc,
> nfs_pageio_reset_write_mds(desc);
> mirror->pg_recoalesce = 1;
> }
> - hdr->release(hdr);
> + hdr->completion_ops->completion(hdr);
> }
>
> static enum pnfs_try_status
> @@ -2378,7 +2378,7 @@ pnfs_read_through_mds(struct nfs_pageio_descriptor *desc,
> nfs_pageio_reset_read_mds(desc);
> mirror->pg_recoalesce = 1;
> }
> - hdr->release(hdr);
> + hdr->completion_ops->completion(hdr);
> }
>
> /*
> --
> 2.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
T24gVGh1LCAyMDE4LTAxLTExIGF0IDA4OjEwIC0wNTAwLCBTY290dCBNYXloZXcgd3JvdGU6DQo+
IFRyb25kLCBBbm5hLA0KPiANCj4gQW55IGNvbmNlcm5zIHdpdGggdGhpcyBwYXRjaD8NCj4gDQo+
IFdlIGxlZnQgdGhpcyBydW4gb24gbXVsdGlwbGUgbWFjaGluZXMgdGhyb3VnaCB0aGUgaG9saWRh
eXMuLi4gYWxsDQo+IGRvaW5nDQo+IGRpcmVjdCBJL08gdG8gYSBOZXRhcHAgZmlsZXIgd2hpbGUg
ZG9pbmcgc3RvcmFnZSBmYWlsb3ZlcnMgb24gYW4NCj4gaG91cmx5DQo+IGJhc2lzIG9uIHRoZSBm
aWxlci4gIE5vIG1vcmUgaXNzdWVzIHdpdGggcHJvY2Vzc2VzIGdldHRpbmcgc3R1Y2sgaW4NCj4g
aW5vZGVfZGlvX3dhaXQuDQoNCkknbSBmaW5lIHdpdGggdGhpcyB2ZXJzaW9uLiBTaG91bGQgaXQg
YmUgYSBzdGFibGUgZml4PyBJZiBzbywgZG8gd2UNCmhhdmUgYSAiRml4ZXM6IiBjb21taXQgZm9y
IGl0Pw0KDQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVy
LCBQcmltYXJ5RGF0YQ0KdHJvbmQubXlrbGVidXN0QHByaW1hcnlkYXRhLmNvbQ0K
On Thu, 11 Jan 2018, Trond Myklebust wrote:
> On Thu, 2018-01-11 at 08:10 -0500, Scott Mayhew wrote:
> > Trond, Anna,
> >
> > Any concerns with this patch?
> >
> > We left this run on multiple machines through the holidays... all
> > doing
> > direct I/O to a Netapp filer while doing storage failovers on an
> > hourly
> > basis on the filer. No more issues with processes getting stuck in
> > inode_dio_wait.
>
> I'm fine with this version. Should it be a stable fix? If so, do we
> have a "Fixes:" commit for it?
Yeah it should probably be a stable fix. The hdr->release() call was
added in commit 1ca018d28d ("pNFS: Fix a memory leak when attempted pnfs
fails")... but I can reproduce this on earlier kernels than that (I
tested 4.0) so it looks like this goes way back...
-Scott
>
> --
> Trond Myklebust
> Linux NFS client maintainer, PrimaryData
> [email protected]