LinuxLists.cc - [PATCHSET 00/10] pnfsd + pnfsd-exofs: Fixes and enhancements to layouts / recalls

2012-09-13 23:24:57

Subject: [PATCHSET 00/10] pnfsd + pnfsd-exofs: Fixes and enhancements to layouts / recalls

Hi Benny, Linux pNFS Server hackers

I'm submitting a set of changes to the pNFSD-Server tree, which fix and
enhance the server, to make it usable by exofs-raid1/5, as well as
panfs. (Panfs is Panasas's out-of-tree cluster FS client, used in the
implementation of a pNFS MDS)

I have been sitting on these patches for months now, because there is
one patch in the center of the set with a SPLITME: title because it
is a combination of few changes intertwined, which could be staged
better. But because of lack of time I've never done that.
I'm submitting it as is. In the long run it does not matter because
they are all SQUASHMEs anyway, but a better staging would make it easier
for review. Sorry about that.

But I would like these patches In, for BAT. So we can all work on the
same code base. (Without, pnfsd is not usable for me)

These are the patches:

[PATCH 01/10] Revert "pnfsd-exofs: Two clients must not write to the
[PATCH 02/10] Revert "pnfsd-exofs: Add autologin support to exofs"

These two revert on unfinished / buggy code that is already
In Benny's tree. They will be introduced again at end of
list.

[PATCH 03/10] SQUASHME: pnfsd: Pass less arguments to init_layout()
[PATCH 04/10] SQUASHME: Remove unused lr_flags & co
[PATCH 05/10] {SPLITME} SQUASHME: pnfsd: Revamp the all
[PATCH 06/10] SQUASHME: pnfsd-exofs: layout_return API changes
[PATCH 07/10] SQUASHME: pnfsd: Something very wrong with layout_recall of RETURN_FILE

These 5 are a deep cleanup, fixes, and API enhancements, to the core
Server. [PATCH 05/10] could have been split farther, into 3 patches.
Read the individual commit messages for more explanations.
(I hope I explained well)

[PATCH 08/10] SQUASHME: pnfsd-exofs: Autologin XDR also encode URI
[PATCH 09/10] SQUASHME: pnfsd-exofs: Autologin support to
[PATCH 10/10] pnfsd-exofs: Two clients must not write to the same

These three are the two reverted below. Better divided for later
SQUASHME

These patches can be fetched from:
git://git.open-osd.org/linux-open-osd.git pnfsd-exofs-devel

Based on pnfs-all-latest (v3.5):
[081ddba3] pnfs: mimic vanilla nfs4 stateid allocation in pNFS
And up to:
[9613a8aa] pnfsd-exofs: Two clients must not write to the same RAID stripe

However these patches by themselves are not enough. There are important fixes
to exofs that went into 3.6-rc2 which must be applied for a working system.
I have all these on a different branch, based on v3.5
(They also include the usual UML fixes that I must apply for UML to work)

These patches can be fetched from:
git://git.open-osd.org/linux-open-osd.git debugable_linux-next

Based on:
[28a33cbc] Linux 3.5
And up to:
[e7936c3b] RFC: do_xor_speed Broken on UML do to jiffies

This *debugable_linux-next* branch can just be merged with any
pnfs (v3.5) branch. For convenience I have such a merged branch at:
git://git.open-osd.org/linux-open-osd.git merge_and_compile

It is all on the gitweb at:
http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/merge_and_compile

Thanks for your review
Boaz

2012-09-13 23:38:18

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 08/10] SQUASHME: pnfsd-exofs: Autologin XDR also encode URI in device_info

From: Sachin Bhamare <[email protected]>

Add the missing bits to encode the autologin info strings

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exportfs/pnfs_osd_xdr_srv.c | 45 +++++++++++++++++++++++++++++++++++++++---
include/linux/pnfs_osd_xdr.h | 5 +++++
2 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/fs/exportfs/pnfs_osd_xdr_srv.c b/fs/exportfs/pnfs_osd_xdr_srv.c
index 35b3d32..04a3681 100644
--- a/fs/exportfs/pnfs_osd_xdr_srv.c
+++ b/fs/exportfs/pnfs_osd_xdr_srv.c
@@ -178,6 +178,42 @@ static enum nfsstat4 _encode_string(struct exp_xdr_stream *xdr,
return 0;
}

+/* struct pnfs_osd_targetaddr {
+ * u32 ota_available;
+ * struct pnfs_osd_net_addr ota_netaddr;
+ * };
+ */
+static inline enum nfsstat4 pnfs_osd_xdr_encode_targetaddr(
+ struct exp_xdr_stream *xdr,
+ struct pnfs_osd_targetaddr *taddr)
+{
+ __be32 *p;
+
+ /* ota_available */
+ p = exp_xdr_reserve_space(xdr, 4);
+ if (!p)
+ return NFS4ERR_TOOSMALL;
+ p = exp_xdr_encode_u32(p, taddr->ota_available);
+
+ /* encode r_netid */
+ p = exp_xdr_reserve_space(xdr, 4 + taddr->ota_netaddr.r_netid.len);
+ if (!p)
+ return NFS4ERR_TOOSMALL;
+
+ p = exp_xdr_encode_opaque(p,
+ taddr->ota_netaddr.r_netid.data,
+ taddr->ota_netaddr.r_netid.len);
+
+ /* encode r_addr */
+ p = exp_xdr_reserve_space(xdr, 4 + taddr->ota_netaddr.r_addr.len);
+ if (!p)
+ return NFS4ERR_TOOSMALL;
+ p = exp_xdr_encode_opaque(p,
+ taddr->ota_netaddr.r_addr.data,
+ taddr->ota_netaddr.r_addr.len);
+ return 0;
+}
+
/* struct pnfs_osd_deviceaddr {
* struct pnfs_osd_targetid oda_targetid;
* struct pnfs_osd_targetaddr oda_targetaddr;
@@ -193,17 +229,20 @@ enum nfsstat4 pnfs_osd_xdr_encode_deviceaddr(
__be32 *p;
enum nfsstat4 err;

- p = exp_xdr_reserve_space(xdr, 4 + 4 + sizeof(devaddr->oda_lun));
+ p = exp_xdr_reserve_space(xdr, sizeof(u32));
if (!p)
return NFS4ERR_TOOSMALL;

/* Empty oda_targetid */
p = exp_xdr_encode_u32(p, OBJ_TARGET_ANON);

- /* Empty oda_targetaddr for now */
- p = exp_xdr_encode_u32(p, 0);
+ /* oda_targetaddr */
+ err = pnfs_osd_xdr_encode_targetaddr(xdr, &devaddr->oda_targetaddr);
+ if (err)
+ return err;

/* oda_lun */
+ p = exp_xdr_reserve_space(xdr, sizeof(devaddr->oda_lun));
exp_xdr_encode_bytes(p, devaddr->oda_lun, sizeof(devaddr->oda_lun));

err = _encode_string(xdr, &devaddr->oda_systemid);
diff --git a/include/linux/pnfs_osd_xdr.h b/include/linux/pnfs_osd_xdr.h
index 435dd5f..3aab6e2 100644
--- a/include/linux/pnfs_osd_xdr.h
+++ b/include/linux/pnfs_osd_xdr.h
@@ -148,6 +148,11 @@ enum pnfs_osd_targetid_type {
OBJ_TARGET_SCSI_DEVICE_ID = 3,
};

+enum pnfs_osd_target_ota {
+ OBJ_OTA_UNAVAILABLE = 0,
+ OBJ_OTA_AVAILABLE = 1,
+};
+
/* union pnfs_osd_targetid4 switch (pnfs_osd_targetid_type4 oti_type) {
* case OBJ_TARGET_SCSI_NAME:
* string oti_scsi_name<>;
--
1.7.10.2.677.gb6bc67f

2012-09-13 23:35:50

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 04/10] SQUASHME: Remove unused lr_flags & co

the member nfsd4_pnfs_layoutreturn::lr_flags was
only set and never used any where.

Perhaps the intention was to put it inside
nfsd4_pnfs_layoutreturn_arg, to be passed to the
s_pnfs_op->layout_return() operation, but it is
not so.

The following patches will change layoutreturn API
which will make these flags unnecessary, so just drop
them.

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/nfsd/nfs4pnfsd.c | 16 ++++++----------
fs/nfsd/xdr4.h | 6 ------
2 files changed, 6 insertions(+), 16 deletions(-)

diff --git a/fs/nfsd/nfs4pnfsd.c b/fs/nfsd/nfs4pnfsd.c
index f0e193a..5228e3b 100644
--- a/fs/nfsd/nfs4pnfsd.c
+++ b/fs/nfsd/nfs4pnfsd.c
@@ -350,15 +350,13 @@ destroy_layout(struct nfs4_layout *lp)
}

void fs_layout_return(struct super_block *sb, struct inode *ino,
- struct nfsd4_pnfs_layoutreturn *lrp, int flags,
- void *recall_cookie)
+ struct nfsd4_pnfs_layoutreturn *lrp, void *recall_cookie)
{
int ret;

if (unlikely(!sb->s_pnfs_op->layout_return))
return;

- lrp->lr_flags = flags;
lrp->args.lr_cookie = recall_cookie;

if (!ino) /* FSID or ALL */
@@ -366,10 +364,10 @@ void fs_layout_return(struct super_block *sb, struct inode *ino,

ret = sb->s_pnfs_op->layout_return(ino, &lrp->args);
dprintk("%s: inode %lu iomode=%d offset=0x%llx length=0x%llx "
- "cookie = %p flags 0x%x status=%d\n",
+ "cookie=%p status=%d\n",
__func__, ino->i_ino, lrp->args.lr_seg.iomode,
lrp->args.lr_seg.offset, lrp->args.lr_seg.length,
- recall_cookie, flags, ret);
+ recall_cookie, ret);
}

static u64
@@ -1077,7 +1075,7 @@ out:
nfs4_unlock_state();

/* call exported filesystem layout_return (ignore return-code) */
- fs_layout_return(sb, ino, lrp, 0, recall_cookie);
+ fs_layout_return(sb, ino, lrp, recall_cookie);

out_no_fs_call:
dprintk("pNFS %s: exit status %d\n", __func__, status);
@@ -1214,8 +1212,7 @@ nomatching_layout(struct nfs4_layoutrecall *clr)
recall_cookie = layoutrecall_done(clr);
spin_unlock(&layout_lock);

- fs_layout_return(clr->clr_sb, inode, &lr, LR_FLAG_INTERN,
- recall_cookie);
+ fs_layout_return(clr->clr_sb, inode, &lr, recall_cookie);
iput(inode);
}

@@ -1250,7 +1247,6 @@ void pnfsd_roc(struct nfs4_client *clp, struct nfs4_file *fp)
found = true;
dprintk("%s: fp=%p clp=%p: return on close", __func__, fp, clp);
fs_layout_return(fp->fi_inode->i_sb, fp->fi_inode, &lr,
- LR_FLAG_INTERN,
empty ? PNFS_LAST_LAYOUT_NO_RECALLS : NULL);
}
spin_unlock(&layout_lock);
@@ -1308,7 +1304,7 @@ void pnfs_expire_client(struct nfs4_client *clp)
dprintk("%s: inode %lu lp %p clp %p\n", __func__, inode->i_ino,
lp, clp);

- fs_layout_return(inode->i_sb, inode, &lr, LR_FLAG_EXPIRE,
+ fs_layout_return(inode->i_sb, inode, &lr,
empty ? PNFS_LAST_LAYOUT_NO_RECALLS : NULL);
iput(inode);
}
diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
index 9db2c0b..6350337 100644
--- a/fs/nfsd/xdr4.h
+++ b/fs/nfsd/xdr4.h
@@ -460,14 +460,8 @@ struct nfsd4_pnfs_layoutcommit {
struct nfsd4_pnfs_layoutcommit_res res;
};

-enum layoutreturn_flags {
- LR_FLAG_INTERN = 1 << 0, /* internal return */
- LR_FLAG_EXPIRE = 1 << 1, /* return on client expiration */
-};
-
struct nfsd4_pnfs_layoutreturn {
struct nfsd4_pnfs_layoutreturn_arg args;
- u32 lr_flags;
stateid_t lr_sid; /* request/resopnse */
u32 lrs_present; /* response */
};
--
1.7.10.2.677.gb6bc67f

2012-09-13 23:33:44

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 01/10] Revert "pnfsd-exofs: Two clients must not write to the same RAID stripe"

This reverts commit c5c391c6f12e09a65e37ebe3e8c437d075d0befd.
---
fs/exofs/export.c | 48 ++++++------------------------------------------
1 file changed, 6 insertions(+), 42 deletions(-)

diff --git a/fs/exofs/export.c b/fs/exofs/export.c
index bc69073..a53f575 100644
--- a/fs/exofs/export.c
+++ b/fs/exofs/export.c
@@ -29,9 +29,6 @@

#include "linux/nfsd/pnfs_osd_xdr_srv.h"

-/* TODO: put in sysfs per sb */
-const static unsigned sb_shared_num_stripes = 8;
-
static int exofs_layout_type(struct super_block *sb)
{
return LAYOUT_OSD2_OBJECTS;
@@ -97,27 +94,14 @@ void ore_layout_2_pnfs_layout(struct pnfs_osd_layout *pl,
}
}

-static bool _align_io(struct ore_layout *layout, struct nfsd4_layout_seg *lseg,
- bool shared)
+static void _align_io(struct ore_layout *layout, u64 *offset, u64 *length)
{
u64 stripe_size = (layout->group_width - layout->parity) *
layout->stripe_unit;
u64 group_size = stripe_size * layout->group_depth;

- /* TODO: Don't ignore shared flag. Single writer can get a full group */
- if (lseg->iomode != IOMODE_READ &&
- (layout->parity || (layout->mirrors_p1 > 1))) {
- /* RAID writes */
- lseg->offset = div64_u64(lseg->offset, stripe_size) *
- stripe_size;
- lseg->length = stripe_size * sb_shared_num_stripes;
- return true;
- } else {
- /* reads or no data redundancy */
- lseg->offset = div64_u64(lseg->offset, group_size) * group_size;
- lseg->length = group_size;
- return false;
- }
+ *offset = div64_u64(*offset, group_size) * group_size;
+ *length = group_size;
}

static enum nfsstat4 exofs_layout_get(
@@ -132,41 +116,21 @@ static enum nfsstat4 exofs_layout_get(
struct pnfs_osd_layout layout;
__be32 *start;
unsigned i;
- bool in_recall, need_recall;
+ bool in_recall;
enum nfsstat4 nfserr;

EXOFS_DBGMSG("(0x%lx) REQUESTED offset=0x%llx len=0x%llx iomod=0x%x\n",
inode->i_ino, res->lg_seg.offset,
res->lg_seg.length, res->lg_seg.iomode);

- need_recall = _align_io(&sbi->layout, &res->lg_seg,
- test_bit(OBJ_LAYOUT_IS_GIVEN, &oi->i_flags));
+ _align_io(&sbi->layout, &res->lg_seg.offset, &res->lg_seg.length);
+ res->lg_seg.iomode = IOMODE_RW;
res->lg_return_on_close = true;

EXOFS_DBGMSG("(0x%lx) RETURNED offset=0x%llx len=0x%llx iomod=0x%x\n",
inode->i_ino, res->lg_seg.offset,
res->lg_seg.length, res->lg_seg.iomode);

- if (need_recall) {
- int rc = cb_layout_recall(inode, IOMODE_RW, res->lg_seg.offset,
- res->lg_seg.length, (void *)0x17);
- switch (rc) {
- case 0:
- case -EAGAIN:
- EXOFS_DBGMSG("(0x%lx) @@@ Sharing of RAID5/1 stripe\n",
- inode->i_ino);
- return NFS4ERR_RECALLCONFLICT;
- default:
- /* This is fine for now */
- /* TODO: Fence object off */
- EXOFS_DBGMSG("(0x%lx) !!!cb_layout_recall => %d\n",
- inode->i_ino, rc);
- /*fallthrough*/
- case -ENOENT:
- break;
- }
- }
-
/* skip opaque size, will be filled-in later */
start = exp_xdr_reserve_qwords(xdr, 1);
if (!start) {
--
1.7.10.2.677.gb6bc67f

2012-09-13 23:39:27

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 10/10] pnfsd-exofs: Two clients must not write to the same RAID stripe

If we have file redundancy RAID1/4/5/6 then two clients cannot
write to the same stripe/region.

We take care of this by giving out smaller regions of the file.
Before any layout_get we make sure to recall the same exact
region from any client. If a recall was issued we return
NFS4ERR_RECALLCONFLICT. The client will come again later for
it's layout.

Meanwhile the fist client can flush data and release the
layout. The next time the segment might be free and the
lo_get succeed.

It is very possible that multiple writers will fight and
some clients will starve forever. But the smaller the
region, and if the clients randomize a wait, it should
statistically be OK.
(We could manage a fairness queue. What about a lo_availble
notification)

On the other hand a very small segment will hurt performance.
Default size is 8 stripes.
TODO: Let segment size be set in sysfs.

TODO:
For debugging we always give out small segments. But we should
only start giving out small segments on a shared file. The
first/single writer should get a large seg as before.

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/export.c | 48 ++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 42 insertions(+), 6 deletions(-)

diff --git a/fs/exofs/export.c b/fs/exofs/export.c
index c5712f3..a1f112f 100644
--- a/fs/exofs/export.c
+++ b/fs/exofs/export.c
@@ -29,6 +29,9 @@

#include "linux/nfsd/pnfs_osd_xdr_srv.h"

+/* TODO: put in sysfs per sb */
+const static unsigned sb_shared_num_stripes = 8;
+
static int exofs_layout_type(struct super_block *sb)
{
return LAYOUT_OSD2_OBJECTS;
@@ -94,14 +97,27 @@ void ore_layout_2_pnfs_layout(struct pnfs_osd_layout *pl,
}
}

-static void _align_io(struct ore_layout *layout, u64 *offset, u64 *length)
+static bool _align_io(struct ore_layout *layout, struct nfsd4_layout_seg *lseg,
+ bool shared)
{
u64 stripe_size = (layout->group_width - layout->parity) *
layout->stripe_unit;
u64 group_size = stripe_size * layout->group_depth;

- *offset = div64_u64(*offset, group_size) * group_size;
- *length = group_size;
+ /* TODO: Don't ignore shared flag. Single writer can get a full group */
+ if (lseg->iomode != IOMODE_READ &&
+ (layout->parity || (layout->mirrors_p1 > 1))) {
+ /* RAID writes */
+ lseg->offset = div64_u64(lseg->offset, stripe_size) *
+ stripe_size;
+ lseg->length = stripe_size * sb_shared_num_stripes;
+ return true;
+ } else {
+ /* reads or no data redundancy */
+ lseg->offset = div64_u64(lseg->offset, group_size) * group_size;
+ lseg->length = group_size;
+ return false;
+ }
}

static enum nfsstat4 exofs_layout_get(
@@ -116,15 +132,15 @@ static enum nfsstat4 exofs_layout_get(
struct pnfs_osd_layout layout;
__be32 *start;
unsigned i;
- bool in_recall;
+ bool in_recall, need_recall;
enum nfsstat4 nfserr;

EXOFS_DBGMSG("(0x%lx) REQUESTED offset=0x%llx len=0x%llx iomod=0x%x\n",
inode->i_ino, res->lg_seg.offset,
res->lg_seg.length, res->lg_seg.iomode);

- _align_io(&sbi->layout, &res->lg_seg.offset, &res->lg_seg.length);
- res->lg_seg.iomode = IOMODE_RW;
+ need_recall = _align_io(&sbi->layout, &res->lg_seg,
+ test_bit(OBJ_LAYOUT_IS_GIVEN, &oi->i_flags));
res->lg_return_on_close = true;
res->lg_lo_cookie = inode; /* Just for debug prints */

@@ -132,6 +148,26 @@ static enum nfsstat4 exofs_layout_get(
inode->i_ino, res->lg_seg.offset,
res->lg_seg.length, res->lg_seg.iomode);

+ if (need_recall) {
+ int rc = cb_layout_recall(inode, IOMODE_RW, res->lg_seg.offset,
+ res->lg_seg.length, (void *)0x17);
+ switch (rc) {
+ case 0:
+ case -EAGAIN:
+ EXOFS_DBGMSG("(0x%lx) @@@ Sharing of RAID5/1 stripe\n",
+ inode->i_ino);
+ return NFS4ERR_RECALLCONFLICT;
+ default:
+ /* This is fine for now */
+ /* TODO: Fence object off */
+ EXOFS_DBGMSG("(0x%lx) !!!cb_layout_recall => %d\n",
+ inode->i_ino, rc);
+ /*fallthrough*/
+ case -ENOENT:
+ break;
+ }
+ }
+
/* skip opaque size, will be filled-in later */
start = exp_xdr_reserve_qwords(xdr, 1);
if (!start) {
--
1.7.10.2.677.gb6bc67f

2012-09-13 23:38:48

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 09/10] SQUASHME: pnfsd-exofs: Autologin support to get_device_info

From: Sachin Bhamare <[email protected]>

In exofs_get_device_info also send the URI string set by the
user-mode mounter associated with each device, to enable
autologin in the client.

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/export.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/exofs/export.c b/fs/exofs/export.c
index 809fa19..c5712f3 100644
--- a/fs/exofs/export.c
+++ b/fs/exofs/export.c
@@ -324,6 +324,7 @@ int exofs_get_device_info(struct super_block *sb, struct exp_xdr_stream *xdr,
{
struct exofs_sb_info *sbi = sb->s_fs_info;
struct pnfs_osd_deviceaddr devaddr;
+ struct exofs_dev *edev;
const struct osd_dev_info *odi;
u64 devno = devid->devid;
__be32 *start;
@@ -337,7 +338,8 @@ int exofs_get_device_info(struct super_block *sb, struct exp_xdr_stream *xdr,
return -ENODEV;
}

- odi = osduld_device_info(sbi->oc.ods[devno]->od);
+ edev = container_of(sbi->oc.ods[devno], typeof(*edev), ored);
+ odi = osduld_device_info(edev->ored.od);

devaddr.oda_systemid.len = odi->systemid_len;
devaddr.oda_systemid.data = (void *)odi->systemid; /* !const cast */
@@ -345,6 +347,10 @@ int exofs_get_device_info(struct super_block *sb, struct exp_xdr_stream *xdr,
devaddr.oda_osdname.len = odi->osdname_len ;
devaddr.oda_osdname.data = (void *)odi->osdname;/* !const cast */

+ devaddr.oda_targetaddr.ota_available = OBJ_OTA_AVAILABLE;
+ devaddr.oda_targetaddr.ota_netaddr.r_addr.data = (void *)edev->uri;
+ devaddr.oda_targetaddr.ota_netaddr.r_addr.len = edev->urilen;
+
/* skip opaque size, will be filled-in later */
start = exp_xdr_reserve_qwords(xdr, 1);
if (!start) {
--
1.7.10.2.677.gb6bc67f

2012-09-13 23:36:31

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 05/10] {SPLITME} SQUASHME: pnfsd: Revamp the all layout_return operations

cleanups:
- In nfs4_pnfs_return_layout, move local variables into the
code sections that use them, so to better understand their
scope. (And untangle the error handling)
- Lots of places had simulated layout_returns based on
lo_segments. They are all in a single place now.

Fixes:
Every code path that eventually called fs_layout_return
had different locking. Some held both the layout_spinlock
has well as nfs-lock. Some one or the other, some none.
Fix all sites to never hold any locks, before calling
The FS.

enhancements:
We change the code so there is now a one-to-one
relationship between a layout_gotten and a layout_returned.
Note that this was the case in ROC or expire_client. We
now do the same for layouts explicitly returned in
nfs4_pnfs_return_layout and no_matching_layout.

Now For each lo_segment received from FS and added to
the layoutDB we send a corresponding layout_return, when
the lo_segment is removed from the DB (and before
it is destroyed releasing the inode ref).

An FS can now attach an *lo_cookie* to each lo_segment
and before this lo_segment is forever released this
lo_cookie is layout_returned to the FS. An FS can have
resources associated with each lo_segment, and when
the client is completely done with that lo_segment
it can now release those resources. (Eliminating the
need for the FS to keep it's own lo_lists)

If the client sent a partial lo_return we do report
such a partial lo_return to FS, and as before, we adjust
our lo_DB to only hold the reminder of the lo_segment.
The final lo_return that has removed the lo_segment from
DB will pass the lo_cookie to the FS denoting closure.

If, for example, in the case of nfs4_pnfs_return_layout
or no_matching_layout one lo_return spans multiple
lo_segments. The FS is called multiple times, for each
lo_segment released, together with its lo_cookie.

And finally, like today, if a return satisfies a recall,
before that recall is released the recall_cookie is also
passed back to the FS, attached the the last segment
matching the recall. (If because of races the recall was
actually empty, a special lo_return is sent with just the
recall_cookie)

We also have a new flag in layout_return that tells the
FS that the lo_list is *empty* attached to the very last
lo returned.

This new system makes the code surprisingly smaller and
cleaner, because all code sites look the same and use
common code. (And it can be done even better, if some of
the lo_return and lo_get API gets united a bit)

And probably some more stuff ...

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/nfsd/nfs4pnfsd.c | 200 +++++++++++++++++++++++-----------------
fs/nfsd/pnfsd.h | 1 +
include/linux/nfsd/nfsd4_pnfs.h | 3 +
3 files changed, 117 insertions(+), 87 deletions(-)

diff --git a/fs/nfsd/nfs4pnfsd.c b/fs/nfsd/nfs4pnfsd.c
index 5228e3b..e0ad1d7 100644
--- a/fs/nfsd/nfs4pnfsd.c
+++ b/fs/nfsd/nfs4pnfsd.c
@@ -304,13 +304,14 @@ init_layout(struct nfs4_layout *lp,
struct nfsd4_pnfs_layoutget *lgp,
struct nfsd4_pnfs_layoutget_res *res)
{
- dprintk("pNFS %s: lp %p ls %p clp %p fp %p ino %p\n", __func__,
- lp, ls, clp, fp, fp->fi_inode);
+ dprintk("pNFS %s: lp %p ls %p clp %p fp %p ino %p lo_cookie %p\n",
+ __func__, lp, ls, clp, fp, fp->fi_inode, res->lg_lo_cookie);

get_nfs4_file(fp);
lp->lo_client = clp;
lp->lo_file = fp;
- memcpy(&lp->lo_seg, &res->lg_seg, sizeof(lp->lo_seg));
+ lp->lo_seg = res->lg_seg;
+ lp->lo_cookie = res->lg_lo_cookie;
get_layout_state(ls); /* put on destroy_layout */
lp->lo_state = ls;
update_layout_stateid(ls, &lgp->lg_sid);
@@ -349,25 +350,25 @@ destroy_layout(struct nfs4_layout *lp)
put_nfs4_file(fp);
}

-void fs_layout_return(struct super_block *sb, struct inode *ino,
- struct nfsd4_pnfs_layoutreturn *lrp, void *recall_cookie)
+void fs_layout_return(struct inode *ino, struct nfsd4_pnfs_layoutreturn *lrp,
+ bool empty, void *recall_cookie, void *lo_cookie)
{
+ struct super_block *sb = ino->i_sb;
int ret;

if (unlikely(!sb->s_pnfs_op->layout_return))
return;

lrp->args.lr_cookie = recall_cookie;
-
- if (!ino) /* FSID or ALL */
- ino = sb->s_root->d_inode;
+ lrp->args.lr_lo_cookie = lo_cookie;
+ lrp->args.lr_empty = empty;

ret = sb->s_pnfs_op->layout_return(ino, &lrp->args);
dprintk("%s: inode %lu iomode=%d offset=0x%llx length=0x%llx "
- "cookie=%p status=%d\n",
+ "cookie=%p empty=%x status=%d\n",
__func__, ino->i_ino, lrp->args.lr_seg.iomode,
lrp->args.lr_seg.offset, lrp->args.lr_seg.length,
- recall_cookie, ret);
+ recall_cookie, empty, ret);
}

static u64
@@ -880,10 +881,65 @@ out:
dprintk("%s:End lo %llu:%lld\n", __func__, lo->offset, lo->length);
}

+static void
+pnfsd_return_lo_list(struct list_head *lo_destroy_list, struct inode *ino_orig,
+ struct nfsd4_pnfs_layoutreturn *lr_orig, void *cb_cookie)
+{
+ struct nfs4_layout *lo, *nextlp;
+
+ if (list_empty(lo_destroy_list) && cb_cookie) {
+ /* This is a rare race case where at the time of recall there
+ * were some layouts, which got freed before the recall-done.
+ * and the recall was left without any actual layouts to free.
+ * If the FS gave us a cb_cookie it is waiting for it so we
+ * report back about it here.
+ */
+
+ /* Any caller with a cb_cookie must pass a none null
+ * ino_orig and an lr_orig.
+ */
+ struct inode *inode = igrab(ino_orig);
+
+ /* Probably a BUG_ON but just in case. Caller of cb_recall must
+ * take care of this. Please report to ml */
+ if (WARN_ON(!inode))
+ return;
+
+ fs_layout_return(inode, lr_orig, true, cb_cookie, NULL);
+ iput(inode);
+ return;
+ }
+
+ list_for_each_entry_safe(lo, nextlp, lo_destroy_list, lo_perfile) {
+ struct inode *inode = lo->lo_file->fi_inode;
+ struct nfsd4_pnfs_layoutreturn lr;
+ bool empty;
+
+ memset(&lr, 0, sizeof(lr));
+ lr.args.lr_return_type = RETURN_FILE;
+ lr.args.lr_seg = lo->lo_seg;
+
+ list_del(&lo->lo_perfile);
+ empty = list_empty(lo_destroy_list);
+
+ fs_layout_return(inode, &lr, empty, empty ? cb_cookie : NULL,
+ lo->lo_cookie);
+
+ /* FIXME: A comment at destroy_layout says we need layout_lock
+ * But is that true? dequeue_layout was done under lock
+ * Must we lock for destroy_layout?
+ */
+ spin_lock(&layout_lock);
+ destroy_layout(lo); /* this will put the lo_file */
+ spin_unlock(&layout_lock);
+ }
+}
+
static int
pnfs_return_file_layouts(struct nfs4_client *clp, struct nfs4_file *fp,
struct nfsd4_pnfs_layoutreturn *lrp,
- struct nfs4_layout_state *ls)
+ struct nfs4_layout_state *ls,
+ struct list_head *lo_destroy_list)
{
int layouts_found = 0;
struct nfs4_layout *lp, *nextlp;
@@ -908,7 +964,7 @@ pnfs_return_file_layouts(struct nfs4_client *clp, struct nfs4_file *fp,
if (!lp->lo_seg.length) {
lrp->lrs_present = 0;
dequeue_layout(lp);
- destroy_layout(lp);
+ list_add_tail(&lp->lo_perfile, lo_destroy_list);
}
}
if (ls && layouts_found && lrp->lrs_present)
@@ -920,7 +976,8 @@ pnfs_return_file_layouts(struct nfs4_client *clp, struct nfs4_file *fp,

static int
pnfs_return_client_layouts(struct nfs4_client *clp,
- struct nfsd4_pnfs_layoutreturn *lrp, u64 ex_fsid)
+ struct nfsd4_pnfs_layoutreturn *lrp, u64 ex_fsid,
+ struct list_head *lo_destroy_list)
{
int layouts_found = 0;
struct nfs4_layout *lp, *nextlp;
@@ -938,7 +995,7 @@ pnfs_return_client_layouts(struct nfs4_client *clp,

layouts_found++;
dequeue_layout(lp);
- destroy_layout(lp);
+ list_add_tail(&lp->lo_perfile, lo_destroy_list);
}
spin_unlock(&layout_lock);

@@ -1000,8 +1057,8 @@ int nfs4_pnfs_return_layout(struct super_block *sb, struct svc_fh *current_fh,
struct inode *ino = current_fh->fh_dentry->d_inode;
struct nfs4_file *fp = NULL;
struct nfs4_client *clp;
- struct nfs4_layout_state *ls = NULL;
struct nfs4_layoutrecall *clr, *nextclr;
+ LIST_HEAD(lo_destroy_list);
u64 ex_fsid = current_fh->fh_export->ex_fsid;
void *recall_cookie = NULL;

@@ -1013,6 +1070,8 @@ int nfs4_pnfs_return_layout(struct super_block *sb, struct svc_fh *current_fh,
goto out;

if (lrp->args.lr_return_type == RETURN_FILE) {
+ struct nfs4_layout_state *ls = NULL;
+
fp = find_file(ino);
if (!fp) {
nfs4_unlock_state();
@@ -1023,7 +1082,7 @@ int nfs4_pnfs_return_layout(struct super_block *sb, struct svc_fh *current_fh,
* don't then it means all layouts were ROC and at this
* point we returned all of them on file close.
*/
- goto out_no_fs_call;
+ goto out_see_about_recalls;
}

/* Check the stateid */
@@ -1033,12 +1092,12 @@ int nfs4_pnfs_return_layout(struct super_block *sb, struct svc_fh *current_fh,
goto out_put_file;

/* update layouts */
- layouts_found = pnfs_return_file_layouts(clp, fp, lrp, ls);
- /* optimize for the all-empty case */
- if (list_empty(&fp->fi_layouts))
- recall_cookie = PNFS_LAST_LAYOUT_NO_RECALLS;
+ layouts_found = pnfs_return_file_layouts(clp, fp, lrp, ls,
+ &lo_destroy_list);
+ put_layout_state(ls);
} else {
- layouts_found = pnfs_return_client_layouts(clp, lrp, ex_fsid);
+ layouts_found = pnfs_return_client_layouts(clp, lrp, ex_fsid,
+ &lo_destroy_list);
}

dprintk("pNFS %s: clp %p fp %p layout_type 0x%x iomode %d "
@@ -1049,6 +1108,7 @@ int nfs4_pnfs_return_layout(struct super_block *sb, struct svc_fh *current_fh,
ex_fsid,
lrp->args.lr_seg.offset, lrp->args.lr_seg.length, layouts_found);

+out_see_about_recalls:
/* update layoutrecalls
* note: for RETURN_{FSID,ALL}, fp may be NULL
*/
@@ -1066,18 +1126,15 @@ int nfs4_pnfs_return_layout(struct super_block *sb, struct svc_fh *current_fh,
}
spin_unlock(&layout_lock);

+ pnfsd_return_lo_list(&lo_destroy_list, ino ? ino : sb->s_root->d_inode,
+ lrp, recall_cookie);
+
out_put_file:
if (fp)
put_nfs4_file(fp);
- if (ls)
- put_layout_state(ls);
out:
nfs4_unlock_state();

- /* call exported filesystem layout_return (ignore return-code) */
- fs_layout_return(sb, ino, lrp, recall_cookie);
-
-out_no_fs_call:
dprintk("pNFS %s: exit status %d\n", __func__, status);
return status;
}
@@ -1188,32 +1245,26 @@ nomatching_layout(struct nfs4_layoutrecall *clr)
.args.lr_seg = clr->cb.cbl_seg,
};
struct inode *inode;
+ LIST_HEAD(lo_destroy_list);
void *recall_cookie;

- if (clr->clr_file) {
- inode = igrab(clr->clr_file->fi_inode);
- if (WARN_ON(!inode))
- return;
- } else {
- inode = NULL;
- }
-
dprintk("%s: clp %p fp %p: simulating layout_return\n", __func__,
clr->clr_client, clr->clr_file);

if (clr->cb.cbl_recall_type == RETURN_FILE)
pnfs_return_file_layouts(clr->clr_client, clr->clr_file, &lr,
- NULL);
+ NULL, &lo_destroy_list);
else
pnfs_return_client_layouts(clr->clr_client, &lr,
- clr->cb.cbl_fsid.major);
+ clr->cb.cbl_fsid.major,
+ &lo_destroy_list);

spin_lock(&layout_lock);
recall_cookie = layoutrecall_done(clr);
spin_unlock(&layout_lock);

- fs_layout_return(clr->clr_sb, inode, &lr, recall_cookie);
- iput(inode);
+ inode = clr->clr_file->fi_inode ?: clr->clr_sb->s_root->d_inode;
+ pnfsd_return_lo_list(&lo_destroy_list, inode, &lr, recall_cookie);
}

/* Return On Close:
@@ -1224,38 +1275,35 @@ nomatching_layout(struct nfs4_layoutrecall *clr)
void pnfsd_roc(struct nfs4_client *clp, struct nfs4_file *fp)
{
struct nfs4_layout *lo, *nextlp;
- bool found = false;
+ LIST_HEAD(lo_destroy_list);
+
+
+ /* TODO: We need to also free layout recalls like pnfs_expire_client */
+ dprintk("%s: clp %p fp %p: simulating layout_return\n", __func__,
+ clp, fp);

dprintk("%s: fp=%p clp=%p", __func__, fp, clp);
spin_lock(&layout_lock);
list_for_each_entry_safe (lo, nextlp, &fp->fi_layouts, lo_perfile) {
- struct nfsd4_pnfs_layoutreturn lr;
- bool empty;

/* Check for a match */
if (!lo->lo_state->ls_roc || lo->lo_client != clp)
continue;

- /* Return the layout */
- memset(&lr, 0, sizeof(lr));
- lr.args.lr_return_type = RETURN_FILE;
- lr.args.lr_seg = lo->lo_seg;
+ /* Mark layout for return */
dequeue_layout(lo);
- destroy_layout(lo); /* do not access lp after this */
-
- empty = list_empty(&fp->fi_layouts);
- found = true;
- dprintk("%s: fp=%p clp=%p: return on close", __func__, fp, clp);
- fs_layout_return(fp->fi_inode->i_sb, fp->fi_inode, &lr,
- empty ? PNFS_LAST_LAYOUT_NO_RECALLS : NULL);
+ list_add_tail(&lo->lo_perfile, &lo_destroy_list);
}
spin_unlock(&layout_lock);
- if (!found)
- dprintk("%s: no layout found", __func__);
+
+ pnfsd_return_lo_list(&lo_destroy_list, NULL, NULL, NULL);
}

void pnfs_expire_client(struct nfs4_client *clp)
{
+ struct nfs4_layout *lo, *nextlo;
+ LIST_HEAD(lo_destroy_list);
+
for (;;) {
struct nfs4_layoutrecall *lrp = NULL;

@@ -1275,39 +1323,17 @@ void pnfs_expire_client(struct nfs4_client *clp)
put_layoutrecall(lrp);
}

- for (;;) {
- struct nfs4_layout *lp = NULL;
- struct inode *inode = NULL;
- struct nfsd4_pnfs_layoutreturn lr;
- bool empty = false;
-
- spin_lock(&layout_lock);
- if (!list_empty(&clp->cl_layouts)) {
- lp = list_entry(clp->cl_layouts.next,
- struct nfs4_layout, lo_perclnt);
- inode = igrab(lp->lo_file->fi_inode);
- memset(&lr, 0, sizeof(lr));
- lr.args.lr_return_type = RETURN_FILE;
- lr.args.lr_seg = lp->lo_seg;
- empty = list_empty(&lp->lo_file->fi_layouts);
- BUG_ON(lp->lo_client != clp);
- dequeue_layout(lp);
- destroy_layout(lp); /* do not access lp after this */
- }
- spin_unlock(&layout_lock);
- if (!lp)
- break;
-
- if (WARN_ON(!inode))
- break;
-
- dprintk("%s: inode %lu lp %p clp %p\n", __func__, inode->i_ino,
- lp, clp);
-
- fs_layout_return(inode->i_sb, inode, &lr,
- empty ? PNFS_LAST_LAYOUT_NO_RECALLS : NULL);
- iput(inode);
+ spin_lock(&layout_lock);
+ list_for_each_entry_safe(lo, nextlo, &clp->cl_layouts, lo_perclnt) {
+ BUG_ON(lo->lo_client != clp);
+ dequeue_layout(lo);
+ list_add_tail(&lo->lo_perfile, &lo_destroy_list);
+ dprintk("%s: inode %lu lp %p clp %p\n", __func__,
+ lo->lo_file->fi_inode->i_ino, lo, clp);
}
+ spin_unlock(&layout_lock);
+
+ pnfsd_return_lo_list(&lo_destroy_list, NULL, NULL, NULL);
}

struct create_recall_list_arg {
diff --git a/fs/nfsd/pnfsd.h b/fs/nfsd/pnfsd.h
index 35859ff..53ed6f1 100644
--- a/fs/nfsd/pnfsd.h
+++ b/fs/nfsd/pnfsd.h
@@ -56,6 +56,7 @@ struct nfs4_layout {
struct nfs4_client *lo_client;
struct nfs4_layout_state *lo_state;
struct nfsd4_layout_seg lo_seg;
+ void *lo_cookie;
};

struct pnfs_inval_state {
diff --git a/include/linux/nfsd/nfsd4_pnfs.h b/include/linux/nfsd/nfsd4_pnfs.h
index 8d3d384..a35b93e 100644
--- a/include/linux/nfsd/nfsd4_pnfs.h
+++ b/include/linux/nfsd/nfsd4_pnfs.h
@@ -93,6 +93,7 @@ struct nfsd4_pnfs_layoutget_arg {
struct nfsd4_pnfs_layoutget_res {
struct nfsd4_layout_seg lg_seg; /* request/resopnse */
u32 lg_return_on_close;
+ void *lg_lo_cookie; /* fs private */
};

struct nfsd4_pnfs_layoutcommit_arg {
@@ -119,6 +120,8 @@ struct nfsd4_pnfs_layoutreturn_arg {
u32 lrf_body_len; /* request */
void *lrf_body; /* request */
void *lr_cookie; /* fs private */
+ void *lr_lo_cookie; /* fs private */
+ bool lr_empty; /* request */
};

/* pNFS Metadata to Data server state communication */
--
1.7.10.2.677.gb6bc67f

2012-09-13 23:34:23

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 02/10] Revert "pnfsd-exofs: Add autologin support to exofs"

This reverts commit 0157f33be71f4607021c595743e5454031319111.
In Benny's tree needs better versions

Boaz
---
fs/exofs/export.c | 8 +-------
fs/exportfs/pnfs_osd_xdr_srv.c | 45 +++---------------------------------------
include/linux/pnfs_osd_xdr.h | 5 -----
3 files changed, 4 insertions(+), 54 deletions(-)

diff --git a/fs/exofs/export.c b/fs/exofs/export.c
index a53f575..621bd11 100644
--- a/fs/exofs/export.c
+++ b/fs/exofs/export.c
@@ -321,7 +321,6 @@ int exofs_get_device_info(struct super_block *sb, struct exp_xdr_stream *xdr,
{
struct exofs_sb_info *sbi = sb->s_fs_info;
struct pnfs_osd_deviceaddr devaddr;
- struct exofs_dev *edev;
const struct osd_dev_info *odi;
u64 devno = devid->devid;
__be32 *start;
@@ -335,8 +334,7 @@ int exofs_get_device_info(struct super_block *sb, struct exp_xdr_stream *xdr,
return -ENODEV;
}

- edev = container_of(sbi->oc.ods[devno], typeof(*edev), ored);
- odi = osduld_device_info(edev->ored.od);
+ odi = osduld_device_info(sbi->oc.ods[devno]->od);

devaddr.oda_systemid.len = odi->systemid_len;
devaddr.oda_systemid.data = (void *)odi->systemid; /* !const cast */
@@ -344,10 +342,6 @@ int exofs_get_device_info(struct super_block *sb, struct exp_xdr_stream *xdr,
devaddr.oda_osdname.len = odi->osdname_len ;
devaddr.oda_osdname.data = (void *)odi->osdname;/* !const cast */

- devaddr.oda_targetaddr.ota_available = OBJ_OTA_AVAILABLE;
- devaddr.oda_targetaddr.ota_netaddr.r_addr.data = (void *)edev->uri;
- devaddr.oda_targetaddr.ota_netaddr.r_addr.len = edev->urilen;
-
/* skip opaque size, will be filled-in later */
start = exp_xdr_reserve_qwords(xdr, 1);
if (!start) {
diff --git a/fs/exportfs/pnfs_osd_xdr_srv.c b/fs/exportfs/pnfs_osd_xdr_srv.c
index 04a3681..35b3d32 100644
--- a/fs/exportfs/pnfs_osd_xdr_srv.c
+++ b/fs/exportfs/pnfs_osd_xdr_srv.c
@@ -178,42 +178,6 @@ static enum nfsstat4 _encode_string(struct exp_xdr_stream *xdr,
return 0;
}

-/* struct pnfs_osd_targetaddr {
- * u32 ota_available;
- * struct pnfs_osd_net_addr ota_netaddr;
- * };
- */
-static inline enum nfsstat4 pnfs_osd_xdr_encode_targetaddr(
- struct exp_xdr_stream *xdr,
- struct pnfs_osd_targetaddr *taddr)
-{
- __be32 *p;
-
- /* ota_available */
- p = exp_xdr_reserve_space(xdr, 4);
- if (!p)
- return NFS4ERR_TOOSMALL;
- p = exp_xdr_encode_u32(p, taddr->ota_available);
-
- /* encode r_netid */
- p = exp_xdr_reserve_space(xdr, 4 + taddr->ota_netaddr.r_netid.len);
- if (!p)
- return NFS4ERR_TOOSMALL;
-
- p = exp_xdr_encode_opaque(p,
- taddr->ota_netaddr.r_netid.data,
- taddr->ota_netaddr.r_netid.len);
-
- /* encode r_addr */
- p = exp_xdr_reserve_space(xdr, 4 + taddr->ota_netaddr.r_addr.len);
- if (!p)
- return NFS4ERR_TOOSMALL;
- p = exp_xdr_encode_opaque(p,
- taddr->ota_netaddr.r_addr.data,
- taddr->ota_netaddr.r_addr.len);
- return 0;
-}
-
/* struct pnfs_osd_deviceaddr {
* struct pnfs_osd_targetid oda_targetid;
* struct pnfs_osd_targetaddr oda_targetaddr;
@@ -229,20 +193,17 @@ enum nfsstat4 pnfs_osd_xdr_encode_deviceaddr(
__be32 *p;
enum nfsstat4 err;

- p = exp_xdr_reserve_space(xdr, sizeof(u32));
+ p = exp_xdr_reserve_space(xdr, 4 + 4 + sizeof(devaddr->oda_lun));
if (!p)
return NFS4ERR_TOOSMALL;

/* Empty oda_targetid */
p = exp_xdr_encode_u32(p, OBJ_TARGET_ANON);

- /* oda_targetaddr */
- err = pnfs_osd_xdr_encode_targetaddr(xdr, &devaddr->oda_targetaddr);
- if (err)
- return err;
+ /* Empty oda_targetaddr for now */
+ p = exp_xdr_encode_u32(p, 0);

/* oda_lun */
- p = exp_xdr_reserve_space(xdr, sizeof(devaddr->oda_lun));
exp_xdr_encode_bytes(p, devaddr->oda_lun, sizeof(devaddr->oda_lun));

err = _encode_string(xdr, &devaddr->oda_systemid);
diff --git a/include/linux/pnfs_osd_xdr.h b/include/linux/pnfs_osd_xdr.h
index 3aab6e2..435dd5f 100644
--- a/include/linux/pnfs_osd_xdr.h
+++ b/include/linux/pnfs_osd_xdr.h
@@ -148,11 +148,6 @@ enum pnfs_osd_targetid_type {
OBJ_TARGET_SCSI_DEVICE_ID = 3,
};

-enum pnfs_osd_target_ota {
- OBJ_OTA_UNAVAILABLE = 0,
- OBJ_OTA_AVAILABLE = 1,
-};
-
/* union pnfs_osd_targetid4 switch (pnfs_osd_targetid_type4 oti_type) {
* case OBJ_TARGET_SCSI_NAME:
* string oti_scsi_name<>;
--
1.7.10.2.677.gb6bc67f

2012-09-13 23:37:12

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 06/10] SQUASHME: pnfsd: layout_return API changes

the layout_return API is changes, to fix member
naming, and fix all users.

Exofs is not yet using the new lo_cookie only
the fact that it is no empty.

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/export.c | 11 +++++++----
fs/nfsd/nfs4pnfsd.c | 2 +-
include/linux/nfsd/nfsd4_pnfs.h | 2 +-
3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/exofs/export.c b/fs/exofs/export.c
index 621bd11..809fa19 100644
--- a/fs/exofs/export.c
+++ b/fs/exofs/export.c
@@ -126,6 +126,7 @@ static enum nfsstat4 exofs_layout_get(
_align_io(&sbi->layout, &res->lg_seg.offset, &res->lg_seg.length);
res->lg_seg.iomode = IOMODE_RW;
res->lg_return_on_close = true;
+ res->lg_lo_cookie = inode; /* Just for debug prints */

EXOFS_DBGMSG("(0x%lx) RETURNED offset=0x%llx len=0x%llx iomod=0x%x\n",
inode->i_ino, res->lg_seg.offset,
@@ -292,19 +293,21 @@ static int exofs_layout_return(
};
struct pnfs_osd_ioerr ioerr;

- EXOFS_DBGMSG("(0x%lx) cookie %p body_len %d\n",
- inode->i_ino, args->lr_cookie, args->lrf_body_len);
+ EXOFS_DBGMSG("(0x%lx) lo_cookie=%p cb_cookie=%p empty=%d body_len %d\n",
+ inode->i_ino, args->lr_lo_cookie, args->lr_cb_cookie,
+ args->lr_empty, args->lrf_body_len);

while (pnfs_osd_xdr_decode_ioerr(&ioerr, &xdr))
exofs_handle_error(&ioerr);

- if (args->lr_cookie) {
+ if (args->lr_cb_cookie || args->lr_empty) {
struct exofs_i_info *oi = exofs_i(inode);
bool in_recall;

spin_lock(&oi->i_layout_lock);
in_recall = test_bit(OBJ_IN_LAYOUT_RECALL, &oi->i_flags);
- __clear_bit(OBJ_LAYOUT_IS_GIVEN, &oi->i_flags);
+ if (args->lr_empty)
+ __clear_bit(OBJ_LAYOUT_IS_GIVEN, &oi->i_flags);
spin_unlock(&oi->i_layout_lock);

/* TODO: how to communicate cookie with the waiter */
diff --git a/fs/nfsd/nfs4pnfsd.c b/fs/nfsd/nfs4pnfsd.c
index e0ad1d7..e8e7709 100644
--- a/fs/nfsd/nfs4pnfsd.c
+++ b/fs/nfsd/nfs4pnfsd.c
@@ -359,7 +359,7 @@ void fs_layout_return(struct inode *ino, struct nfsd4_pnfs_layoutreturn *lrp,
if (unlikely(!sb->s_pnfs_op->layout_return))
return;

- lrp->args.lr_cookie = recall_cookie;
+ lrp->args.lr_cb_cookie = recall_cookie;
lrp->args.lr_lo_cookie = lo_cookie;
lrp->args.lr_empty = empty;

diff --git a/include/linux/nfsd/nfsd4_pnfs.h b/include/linux/nfsd/nfsd4_pnfs.h
index a35b93e..6bd03f9 100644
--- a/include/linux/nfsd/nfsd4_pnfs.h
+++ b/include/linux/nfsd/nfsd4_pnfs.h
@@ -119,7 +119,7 @@ struct nfsd4_pnfs_layoutreturn_arg {
u32 lr_reclaim; /* request */
u32 lrf_body_len; /* request */
void *lrf_body; /* request */
- void *lr_cookie; /* fs private */
+ void *lr_cb_cookie; /* fs private */
void *lr_lo_cookie; /* fs private */
bool lr_empty; /* request */
};
--
1.7.10.2.677.gb6bc67f

2012-09-13 23:35:09

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 03/10] SQUASHME: pnfsd: Pass less arguments to init_layout()

Instead of passing all parameters individually, of which
one was unused. Pass the structures these originate from.

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/nfsd/nfs4pnfsd.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/nfsd/nfs4pnfsd.c b/fs/nfsd/nfs4pnfsd.c
index 509b260..f0e193a 100644
--- a/fs/nfsd/nfs4pnfsd.c
+++ b/fs/nfsd/nfs4pnfsd.c
@@ -301,9 +301,8 @@ init_layout(struct nfs4_layout *lp,
struct nfs4_layout_state *ls,
struct nfs4_file *fp,
struct nfs4_client *clp,
- struct svc_fh *current_fh,
- struct nfsd4_layout_seg *seg,
- stateid_t *stateid)
+ struct nfsd4_pnfs_layoutget *lgp,
+ struct nfsd4_pnfs_layoutget_res *res)
{
dprintk("pNFS %s: lp %p ls %p clp %p fp %p ino %p\n", __func__,
lp, ls, clp, fp, fp->fi_inode);
@@ -311,10 +310,10 @@ init_layout(struct nfs4_layout *lp,
get_nfs4_file(fp);
lp->lo_client = clp;
lp->lo_file = fp;
- memcpy(&lp->lo_seg, seg, sizeof(lp->lo_seg));
+ memcpy(&lp->lo_seg, &res->lg_seg, sizeof(lp->lo_seg));
get_layout_state(ls); /* put on destroy_layout */
lp->lo_state = ls;
- update_layout_stateid(ls, stateid);
+ update_layout_stateid(ls, &lgp->lg_sid);
list_add_tail(&lp->lo_perclnt, &clp->cl_layouts);
list_add_tail(&lp->lo_perfile, &fp->fi_layouts);
dprintk("pNFS %s end\n", __func__);
@@ -829,7 +828,7 @@ nfs4_pnfs_get_layout(struct nfsd4_pnfs_layoutget *lgp,
goto out_freelayout;

/* Can't merge, so let's initialize this new layout */
- init_layout(lp, ls, fp, clp, lgp->lg_fhp, &res.lg_seg, &lgp->lg_sid);
+ init_layout(lp, ls, fp, clp, lgp, &res);
out_unlock:
if (ls)
put_layout_state(ls);
--
1.7.10.2.677.gb6bc67f

2012-09-13 23:37:47

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 07/10] SQUASHME: pnfsd: Something very wrong with layout_recall(RETURN_FILE)

In patch:
pnfsd: layout recall layout state

the cl_has_file_layout() is no longer inspecting the layout structures added per file
but is inspecting if file has layout_state.

So it is counting layout_states and not layouts

This is bad because the addition of the layout_states on the file is done before the
call to the filesystem so if the FS does a recall, the nfsd is confused thinking
it already has a layout and issues a recall. Instead of returning -ENOENT, ie list
is empty. The client then truly returns nomaching_layout and when the lo_return(s) are
emulated the system gets stuck is some reference miss-match. (UML so no crash trace)

Now lets say that the state should be set before the call to the FS. Then I don't
see where the state is removed in the case of an ERROR return from FS->layout_get.
Meaning cl_has_file_layout() will always think it has some count.

Also When a layout is returned it is the layout list that is inspected and freed,
so how is the cl_has_file_layout() emptied ?

In any way. I do not agree that it is the state that is needed to be searched
in cl_has_file_layout() but it is layouts that are needed, otherwise the all
layout <---> recall very delicate dance is totally broken.

What was the meaning of the Poet?

I reverted the cl_has_file_layout() to historical processing.

Also cl_has_file_layout() returns true for any layout on a file, but we must
inspect IO_MODE and LSEG for a partial-match, as well.

The below works for me. State also looks good. I can now safely call
cb_recall, from within a layout_get operation.

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/nfsd/nfs4pnfsd.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/nfsd/nfs4pnfsd.c b/fs/nfsd/nfs4pnfsd.c
index e8e7709..523d3d0 100644
--- a/fs/nfsd/nfs4pnfsd.c
+++ b/fs/nfsd/nfs4pnfsd.c
@@ -1177,24 +1177,27 @@ out:
}

static bool
-cl_has_file_layout(struct nfs4_client *clp, struct nfs4_file *fp, stateid_t *lsid)
+cl_has_file_layout(struct nfs4_client *clp, struct nfs4_file *fp,
+ stateid_t *lsid, struct nfsd4_pnfs_cb_layout *cbl)
{
- struct nfs4_layout_state *ls;
+ struct nfs4_layout *lo;
+ bool ret = false;

spin_lock(&layout_lock);
- list_for_each_entry (ls, &fp->fi_layout_states, ls_perfile)
- if (same_clid(&ls->ls_stid.sc_stateid.si_opaque.so_clid,
- &clp->cl_clientid)) {
+ list_for_each_entry(lo, &fp->fi_layouts, lo_perfile) {
+ if (same_clid(&lo->lo_client->cl_clientid, &clp->cl_clientid) &&
+ lo_seg_overlapping(&cbl->cbl_seg, &lo->lo_seg) &&
+ (cbl->cbl_seg.iomode & lo->lo_seg.iomode))
goto found;
- }
- spin_unlock(&layout_lock);
- return false;
-
+ }
+ goto unlock;
found:
- update_layout_stateid_locked(ls, lsid);
+ /* Im going to send a recall on this latout update state */
+ update_layout_stateid_locked(lo->lo_state, lsid);
+ ret = true;
+unlock:
spin_unlock(&layout_lock);
-
- return true;
+ return ret;
}

static int
@@ -1226,7 +1229,7 @@ cl_has_layout(struct nfs4_client *clp, struct nfsd4_pnfs_cb_layout *cbl,
{
switch (cbl->cbl_recall_type) {
case RETURN_FILE:
- return cl_has_file_layout(clp, lrfile, lsid);
+ return cl_has_file_layout(clp, lrfile, lsid, cbl);
case RETURN_FSID:
return cl_has_fsid_layout(clp, &cbl->cbl_fsid);
default:
--
1.7.10.2.677.gb6bc67f