2011-06-07 17:24:19

by Jim Rees

[permalink] [raw]
Subject: [PATCH 00/88] pnfs block layout driver

This patch set adds a block layout driver to the pnfs client.

Benny Halevy (25):
pnfs: add set-clear layoutdriver interface
pnfs: xdr support for three word attribute bitmap
pnfsblock: select BLK_DEV_DM when PNFS_BLOCK is configured
SQUASHME: pnfs-block: convert APIs pnfs-post-submit
SQUASHME: pnfsblock: get rid of threshold policy ops
SQUASHME: pnfs-block: nfs4_blk_add_block_disk ret must be signed
SQUASHME: pnfs-block: use new alloc/free_layout API
SQUASHME: pnfs-block: use new commit api
SQUASHME: pnfs-block: use new read_pagelist api
SQUASHME: pnfs-block: use new write_pagelist api
SQUASHME: pnfs-block: apply types rename
SQUASHME: pnfs-block: Revert "pnfsblock: expose block_class
interface"
SQUASHME: pnfsblock: remove obsolete include file from blocklayout.h
SQUASHME: pnfsblock: use nfs4_deviceid
SQUASHME: pnfsblock: no callback ops
SQAUSHME: pnfsblock: no PNFS_NFS_SERVER
SQUASHME: pnfsblock: no dev_notify_types
SQUASHME: pnfsblock: use new struct pnfs_layout_hdr
SQUASHME: pnfs-block: deprecate get_stripesize
SQUASHME: pnfs-block: use {set,clear}_layoutdriver
SQUASHME: pnfs-block: fixup setup_layoutcommit arguments
SQUASHME: pnfs-block: fixup cleanup_layoutcommit arguments
SQUASHME: pnfs-block: fixup encode_layoutcommit arguments
SQUASHME: pnfs-block: fixup layoutcommit methods args
SQUASHME: pnfs-block: use pnfs_layout_hdr field prefix

Boaz Harrosh (1):
SQUASHME: pnfs-block: remove of CONFIG_PNFS fallout

Fred (1):
pnfsblock: find_get_extent

Fred Isaman (39):
pnfs_post_submit: Restore "pnfs: pnfs_do_flush" part 1
pnfs_post_submit: Restore the pnfs_write_end part of "pnfs: commit
and pnfs_write_end"
pnfs: HACK: ask for layout_blksize on mount
pnfs: HACK: modify write_end_cleanup
HACK: propagate fsdata into nfs_writepage_setup
pnfs: HACK: adjust eof handling
pnfsblock: define PNFS_BLOCK Kconfig option
pnfsblock: blocklayout stub
pnfsblock: expose scsi interface
pnfsblock: scan scsi devices
pnfsblock: call and parse getdevicelist
pnfsblock: dm kernel interface
pnfsblock: create and destroy dm metadevice
pnfsblock: construct and load md table
pnfsblock: layout alloc and free
pnfsblock: basic extent code
pnfsblock: lseg alloc and free
pnfsblock: xdr decode pnfs_block_layout4
pnfsblock: merge extents
pnfsblock: bl_read_pagelist
pnfsblock: allow use of PG_owner_priv_1 flag
pnfsblock: read path error handling
pnfsblock: SPLITME: add extent manipulation functions
pnfsblock: write_begin
pnfsblock: write_end
pnfsblock: write_end_cleanup
pnfsblock: bl_write_pagelist support functions
pnfsblock: bl_write_pagelist
pnfsblock: note written INVAL areas for layoutcommit
pnfsblock: bl_setup_layoutcommit
pnfsblock: encode_layoutcommit
pnfsblock: cleanup_layoutcommit
pnfsblock: merge rw extents
pnfsblock: debugging dprintks for clist info
SQUASHME: pnfsblock: write_begin adjust for removed fields
SQUASHME: pnfsblock: write_end adjust for removed ok_to_use_pnfs
SQUASHME: pnfsblock: write_end_cleanup adjust for removed
ok_to_use_pnfs
SQUASHME: pnfsblock: bl_write_pagelist support functions adjust for
missing PG_USE_PNFS
SQUASHME: pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS

J. Bruce Fields (1):
SQUASHME: pnfs-block: fix compile breakage

Jim Rees (5):
pnfs-block: Add support for simple rpc pipefs
pnfs-block: Remove device creation from kernel
move include lines out of include file
SQUASHME: pnfs-block: Return failure from bl_initialize_mountpoint
pnfs-block: fix blocklayoutdev.c for new blkdev_get_by_dev()

Mike Sager (1):
pnfsblock: use the session max response size for getdeviceinfo's
maxcount

Peng Tao (4):
pnfs: let layoutcommit code handle multiple segments
SQUASHME: pnfs: blocklayout: port block layout code
Add configurable prefetch size for layoutget
NFS41: do not update isize if inode needs layoutcommit

Steve Dickson (1):
SQUASHME: pnfsblock: compile error in blocklayout code

Tao Guo (3):
SQUASHME: pnfsblock: fix bug when decoding block device info.
pnfsblock: expose block_class interface
pnfsblock: iterating all local block disks instead of only scsi disks
when initializing mount point.

Zhang Jingwang (7):
SQAUSHME: blocklayoutdriver: NULL pointer reference when committing
too many extents
SQUASHME: pnfsblock: Fix a memory leak
SQUASHME: pnfsblock: Wrong extent refcount in block extents list
SQUASHME: pnfsblock: Implement release_inval_marks
SQUASHME: pnfsblock: Fix missing extent in commit list
pnfsblock: Lookup list entry of layouts and tags in reverse order
SQUASHME: pnfsblock: set pnfs_blksize before calling
set_pnfs_layoutdriver

fs/nfs/Kconfig | 10 +
fs/nfs/Makefile | 1 +
fs/nfs/blocklayout/Makefile | 6 +
fs/nfs/blocklayout/block-device-discovery-pipe.c | 66 ++
fs/nfs/blocklayout/blocklayout.c | 1103 ++++++++++++++++++++++
fs/nfs/blocklayout/blocklayout.h | 297 ++++++
fs/nfs/blocklayout/blocklayoutdev.c | 346 +++++++
fs/nfs/blocklayout/blocklayoutdm.c | 120 +++
fs/nfs/blocklayout/extents.c | 940 ++++++++++++++++++
fs/nfs/client.c | 8 +-
fs/nfs/file.c | 26 +-
fs/nfs/inode.c | 3 +-
fs/nfs/nfs4_fs.h | 2 +-
fs/nfs/nfs4proc.c | 6 +-
fs/nfs/nfs4xdr.c | 104 ++-
fs/nfs/pnfs.c | 96 ++-
fs/nfs/pnfs.h | 126 +++-
fs/nfs/sysctl.c | 10 +
fs/nfs/write.c | 12 +-
include/linux/nfs_fs.h | 3 +-
include/linux/nfs_fs_sb.h | 4 +-
include/linux/nfs_xdr.h | 3 +-
include/linux/sunrpc/rpc_pipe_fs.h | 4 +
include/linux/sunrpc/simple_rpc_pipefs.h | 105 ++
net/sunrpc/Makefile | 2 +-
net/sunrpc/simple_rpc_pipefs.c | 423 +++++++++
26 files changed, 3778 insertions(+), 48 deletions(-)
create mode 100644 fs/nfs/blocklayout/Makefile
create mode 100644 fs/nfs/blocklayout/block-device-discovery-pipe.c
create mode 100644 fs/nfs/blocklayout/blocklayout.c
create mode 100644 fs/nfs/blocklayout/blocklayout.h
create mode 100644 fs/nfs/blocklayout/blocklayoutdev.c
create mode 100644 fs/nfs/blocklayout/blocklayoutdm.c
create mode 100644 fs/nfs/blocklayout/extents.c
create mode 100644 include/linux/sunrpc/simple_rpc_pipefs.h
create mode 100644 net/sunrpc/simple_rpc_pipefs.c

--
1.7.4.1



2011-06-07 17:29:25

by Jim Rees

[permalink] [raw]
Subject: [PATCH 30/88] pnfsblock: write_end

From: Fred Isaman <[email protected]>

Implements bl_write_end, which basically just calls SetPageUptodate.

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index b3ad99d..f4851c1 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -818,6 +818,23 @@ bl_write_begin(struct pnfs_layout_segment *lseg, struct page *page, loff_t pos,
return ret;
}

+/* CAREFUL - what happens if copied < count??? */
+static int
+bl_write_end(struct inode *inode, struct page *page, loff_t pos,
+ unsigned count, unsigned copied, struct pnfs_fsdata *fsdata)
+{
+ dprintk("%s enter, %u@%lld, %i\n", __func__, count, pos,
+ fsdata ? fsdata->ok_to_use_pnfs : -1);
+ print_page(page);
+ if (fsdata) {
+ if (fsdata->ok_to_use_pnfs) {
+ dprintk("%s using pnfs\n", __func__);
+ SetPageUptodate(page);
+ }
+ }
+ return 0;
+}
+
static ssize_t
bl_get_stripesize(struct pnfs_layout_type *lo)
{
@@ -862,6 +879,7 @@ static struct layoutdriver_io_operations blocklayout_io_operations = {
.read_pagelist = bl_read_pagelist,
.write_pagelist = bl_write_pagelist,
.write_begin = bl_write_begin,
+ .write_end = bl_write_end,
.alloc_layout = bl_alloc_layout,
.free_layout = bl_free_layout,
.alloc_lseg = bl_alloc_lseg,
--
1.7.4.1


2011-06-10 23:20:19

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 06/10/2011 12:23 PM, Benny Halevy wrote:
> On 2011-06-10 10:09, [email protected] wrote:
>
> A simple algorithm I can suggest is:
> - on initialization, calculate and save, per layout driver
> - maximum layout size
> - take into account csr_fore_chan_attrs.ca_maxresponsesize and possible other parameters
> - keep a working copy of the maximum value and the calculated copy.
> - alignment value.
> - on miss, see if there's an adjacent layout segment in cache
> - if found, ask for twice the found segment size, up to the maximum value,
> aligned on the alignment value.
> - if the server returns less the layoutget range, keep note of the returned length
> (but not adjust maximum yet, as the server may return a short segment for various
> reasons)
> - if the server is consistent about returning less than was asked, adjust the
> - working copy of the maximum length
> - if the maximum was adjusted try bumping it up after X (TBD) layoutgets or T seconds
> to see if that was just due to high load or conflicts on the server
> - on any error returned for LAYOUTGET reset the algorithm parameters
> - on session reestablishment recalculate maximums.
>
> Benny
>

I completely disagree with all this. NACK!

The only proper thing a client can do is ask for what it needs, and only the application
can do that, because at the VFS level it is only second guessing, and is completely
pointless.

The only one that can know about structure, alignments, optimal IO sizes and layouts
is the server. The server even have more information to second guess the application
from the file size information and it's share and lock disposition. Please see my
simple Server side algorithm.

Because you must understand one most important thing. Any smart decision a client can
make will be after it received the layout (stripe_unit, number-of-devices etc..) But
at that time it is too late it already sent the layout_get. Only the server knows
before hand what is the most optimal size. The client should just be a transparent
pipe from application to the server. It should never ever set policy. Only a Server
can/should do that.

Lets put the efforts and algorithms where they belong, please?

Boaz

2011-06-07 17:30:47

by Jim Rees

[permalink] [raw]
Subject: [PATCH 40/88] SQAUSHME: blocklayoutdriver: NULL pointer reference when committing too many extents

From: Zhang Jingwang <[email protected]>

If there are too many extents to commit, xdr buffer will be used up.

So we check the return value and encode as many extents as possible to xdr buffer, leaving the rest in the bl_commit list.

Signed-off-by: Zhang Jingwang <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/extents.c | 29 ++++++++++++++++-------------
1 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 98452ca..09a7c5c 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -745,7 +745,7 @@ encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
{
sector_t start, end;
struct pnfs_block_short_extent *lce, *save;
- unsigned int count;
+ unsigned int count = 0;
struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
struct list_head *ranges = &bld->ranges;
__be32 *p, *xdr_start;
@@ -760,29 +760,32 @@ encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
* entire block to be marked WRITTEN before it can be added.
*/
spin_lock(&bl->bl_ext_lock);
- list_splice_init(&bl->bl_commit, ranges);
- count = bl->bl_count;
- bl->bl_count = 0;
/* Want to adjust for possible truncate */
/* We now want to adjust argument range */
- spin_unlock(&bl->bl_ext_lock);

- dprintk("%s found %i ranges\n", __func__, count);
/* XDR encode the ranges found */
- xdr_start = p = xdr_reserve_space(xdr, 8);
- p++;
- WRITE32(count);
- list_for_each_entry_safe(lce, save, ranges, bse_node) {
+ xdr_start = xdr_reserve_space(xdr, 8);
+ if (!xdr_start)
+ goto out;
+ list_for_each_entry_safe(lce, save, &bl->bl_commit, bse_node) {
p = xdr_reserve_space(xdr, 7 * 4 + sizeof(lce->bse_devid.data));
-
+ if (!p)
+ break;
WRITE_DEVID(&lce->bse_devid);
WRITE64(lce->bse_f_offset << 9);
WRITE64(lce->bse_length << 9);
WRITE64(0LL);
WRITE32(PNFS_BLOCK_READWRITE_DATA);
+ list_del(&lce->bse_node);
+ list_add_tail(&lce->bse_node, ranges);
+ bl->bl_count--;
+ count++;
}
-
- *xdr_start = cpu_to_be32((xdr->p - xdr_start - 1) * 4);
+ xdr_start[0] = cpu_to_be32((xdr->p - xdr_start - 1) * 4);
+ xdr_start[1] = cpu_to_be32(count);
+out:
+ spin_unlock(&bl->bl_ext_lock);
+ dprintk("%s found %i ranges\n", __func__, count);
return 0;
}

--
1.7.4.1


2011-06-07 17:35:22

by Jim Rees

[permalink] [raw]
Subject: [PATCH 80/88] SQUASHME: pnfs-block: fixup setup_layoutcommit arguments

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 5f11fb8..ff94ee2 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -617,7 +617,7 @@ bl_alloc_lseg(struct pnfs_layout_hdr *lo,

static int
bl_setup_layoutcommit(struct pnfs_layout_hdr *lo,
- struct nfs4_layoutcommit_args *arg)
+ struct nfs4_layoutcommit_op_args *arg)
{
struct nfs_server *nfss = NFS_SERVER(lo->inode);
struct bl_layoutupdate_data *layoutupdate_data;
--
1.7.4.1


2011-06-07 17:34:37

by Jim Rees

[permalink] [raw]
Subject: [PATCH 73/88] SQUASHME: pnfsblock: no dev_notify_types

From: Benny Halevy <[email protected]>

not supported yet

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 5e50c93..ac53a3f 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -725,7 +725,6 @@ nfs4_blk_get_deviceinfo(struct nfs_server *server, const struct nfs_fh *fh,

memcpy(&dev->dev_id, d_id, sizeof(*d_id));
dev->layout_type = LAYOUT_BLOCK_VOLUME;
- dev->dev_notify_types = 0;
dev->pages = pages;
dev->pgbase = 0;
dev->pglen = PAGE_SIZE * max_pages;
--
1.7.4.1


2011-06-07 17:28:49

by Jim Rees

[permalink] [raw]
Subject: [PATCH 24/88] pnfsblock: find_get_extent

From: Fred <[email protected]>

Implement find_get_extent(), one of the core extent manipulation
routines.

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>

pnfsblock: fix print format warnings for sector_t and size_t

gcc spews warnings about these on x86_64, e.g.:
fs/nfs/blocklayout/blocklayout.c:74: warning: format ‘%Lu’ expects type ‘long long unsigned int’, but argument 2 has type ‘sector_t’
fs/nfs/blocklayout/blocklayout.c:388: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 3 ++
fs/nfs/blocklayout/extents.c | 47 ++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 13fc0e2..e992b94 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -203,6 +203,9 @@ struct pnfs_block_dev *nfs4_blk_init_metadev(struct super_block *sb,
int nfs4_blk_flatten(struct pnfs_blk_volume *, int, struct pnfs_block_dev *);
void free_block_dev(struct pnfs_block_dev *bdev);
/* extents.c */
+struct pnfs_block_extent *
+find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
+ struct pnfs_block_extent **cow_read);
void put_extent(struct pnfs_block_extent *be);
struct pnfs_block_extent *alloc_extent(void);
int add_and_merge_extent(struct pnfs_block_layout *bl,
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index ce7b6f7..944f824 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -193,3 +193,50 @@ add_and_merge_extent(struct pnfs_block_layout *bl,
put_extent(new);
return -EIO;
}
+
+/* Returns extent, or NULL. If a second READ extent exists, it is returned
+ * in cow_read, if given.
+ *
+ * The extents are kept in two seperate ordered lists, one for READ and NONE,
+ * one for READWRITE and INVALID. Within each list, we assume:
+ * 1. Extents are ordered by file offset.
+ * 2. For any given isect, there is at most one extents that matches.
+ */
+struct pnfs_block_extent *
+find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
+ struct pnfs_block_extent **cow_read)
+{
+ struct pnfs_block_extent *be, *cow, *ret;
+ int i;
+
+ dprintk("%s enter with isect %llu\n", __func__, (u64)isect);
+ cow = ret = NULL;
+ spin_lock(&bl->bl_ext_lock);
+ for (i = 0; i < EXTENT_LISTS; i++) {
+ if (ret &&
+ (!cow_read || ret->be_state != PNFS_BLOCK_INVALID_DATA))
+ break;
+ list_for_each_entry(be, &bl->bl_extents[i], be_node) {
+ if (isect < be->be_f_offset)
+ break;
+ if (isect < be->be_f_offset + be->be_length) {
+ /* We have found an extent */
+ dprintk("%s Get %p (%i)\n", __func__, be,
+ atomic_read(&be->be_refcnt.refcount));
+ kref_get(&be->be_refcnt);
+ if (!ret)
+ ret = be;
+ else if (be->be_state != PNFS_BLOCK_READ_DATA)
+ put_extent(be);
+ else
+ cow = be;
+ break;
+ }
+ }
+ }
+ spin_unlock(&bl->bl_ext_lock);
+ if (cow_read)
+ *cow_read = cow;
+ print_bl_extent(ret);
+ return ret;
+}
--
1.7.4.1


2011-06-10 04:04:22

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 00/88] pnfs block layout driver

On 2011-06-09 18:15, Jim Rees wrote:
> Boaz Harrosh wrote:
>
> Who is going to SQUASH all the SQUASHMEs and re think the all patch
> separation again. To something that makes a more logical progression
> and easier on the review. The way it is now I'm not able to review,
> sorry, I got lost trying to understand which is which.
>
> I'm open to suggestions and happy to do the work. I agree that 88 patches

Thanks!

> is nearly indigestable. However I note that Benny seems to have pulled in
> the entire set so I'm not sure how to proceed at this point. Also this code patch

For 2.6.39 it is what it is but for 3.[01] we should clean up the patchset and I'll
rebase it into the tree again. When it's ready for final review and submission we'll
have a for-3.1 branch based off of Trond's respective branch with all the queued
patches.

> was in Benny's 2.6.38 and only got dropped when the 3.0 merge came along, so
> most of it's already been under review for a year or more.

Ehhhhh, we need to re-review it for submission taking into account
the major changes that went into 2.6.39 and 3.0...

Benny

> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-06-07 17:33:20

by Jim Rees

[permalink] [raw]
Subject: [PATCH 61/88] SQUASHME: pnfs-block: use new alloc/free_layout API

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 9 ++++-----
fs/nfs/blocklayout/blocklayout.h | 15 +++++++++++++--
fs/nfs/blocklayout/blocklayoutdev.c | 2 +-
3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 92f0b4b..63d3b5a 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -552,18 +552,17 @@ release_inval_marks(struct pnfs_inval_markings *marks)

/* Note we are relying on caller locking to prevent nasty races. */
static void
-bl_free_layout(void *p)
+bl_free_layout(struct pnfs_layout_type *lo)
{
- struct pnfs_block_layout *bl = p;
+ struct pnfs_block_layout *bl = BLK_LO2EXT(lo);

dprintk("%s enter\n", __func__);
release_extents(bl, NULL);
release_inval_marks(&bl->bl_inval);
kfree(bl);
- return;
}

-static void *
+static struct pnfs_layout_type *
bl_alloc_layout(struct inode *inode)
{
struct pnfs_block_layout *bl;
@@ -579,7 +578,7 @@ bl_alloc_layout(struct inode *inode)
bl->bl_count = 0;
bl->bl_blocksize = NFS_SERVER(inode)->pnfs_blksize >> 9;
INIT_INVAL_MARKS(&bl->bl_inval, bl->bl_blocksize);
- return bl;
+ return &bl->bl_layout;
}

static void
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 0efed8d..d316b7f 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -177,6 +177,7 @@ static inline int choose_list(enum exstate4 state)
}

struct pnfs_block_layout {
+ struct pnfs_layout_type bl_layout;
struct pnfs_inval_markings bl_inval; /* tracks INVAL->RW transition */
spinlock_t bl_ext_lock; /* Protects list manipulation */
struct list_head bl_extents[EXTENT_LISTS]; /* R and RW extents */
@@ -193,8 +194,18 @@ struct bl_layoutupdate_data {
};

#define BLK_ID(lo) ((struct block_mount_id *)(PNFS_NFS_SERVER(lo)->pnfs_ld_data))
-#define BLK_LSEG2EXT(lseg) ((struct pnfs_block_layout *)lseg->layout->ld_data)
-#define BLK_LO2EXT(lo) ((struct pnfs_block_layout *)lo->ld_data)
+
+static inline struct pnfs_block_layout *
+BLK_LO2EXT(struct pnfs_layout_type *lo)
+{
+ return container_of(lo, struct pnfs_block_layout, bl_layout);
+}
+
+static inline struct pnfs_block_layout *
+BLK_LSEG2EXT(struct pnfs_layout_segment *lseg)
+{
+ return BLK_LO2EXT(lseg->layout);
+}

uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes);

diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index a866f5c..7285d5e 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -620,7 +620,7 @@ int
nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
struct nfs4_pnfs_layoutget_res *lgr)
{
- struct pnfs_block_layout *bl = PNFS_LD_DATA(lo);
+ struct pnfs_block_layout *bl = BLK_LO2EXT(lo);
uint32_t *p = (uint32_t *)lgr->layout.buf;
uint32_t *end = (uint32_t *)((char *)lgr->layout.buf + lgr->layout.len);
int i, status = -EIO;
--
1.7.4.1


2011-06-08 07:06:44

by Peng Tao

[permalink] [raw]
Subject: Re: [PATCH 88/88] NFS41: do not update isize if inode needs layoutcommit

On 6/8/11, Benny Halevy <[email protected]> wrote:
> Better send generic patches separately.
> This patch needs to go upstream and to stable 2.6.39
It has been sent separately before. please see
http://www.spinics.net/list/linux-nfs/msg21586.html

>
> Benny
>
> On 2011-06-07 13:36, Jim Rees wrote:
>> From: Peng Tao <[email protected]>
>>
>> Layout commit is supposed to set server file size similiar to nfs pages.
>> We should not update client file size for the same reason.
>> Otherwise we will lose what we have at hand.
>>
>> Signed-off-by: Peng Tao <[email protected]>
>> Signed-off-by: Jim Rees <[email protected]>
>> ---
>> fs/nfs/inode.c | 3 ++-
>> 1 files changed, 2 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> index 144f2a3..3f1eb81 100644
>> --- a/fs/nfs/inode.c
>> +++ b/fs/nfs/inode.c
>> @@ -1294,7 +1294,8 @@ static int nfs_update_inode(struct inode *inode,
>> struct nfs_fattr *fattr)
>> if (new_isize != cur_isize) {
>> /* Do we perhaps have any outstanding writes, or has
>> * the file grown beyond our last write? */
>> - if (nfsi->npages == 0 || new_isize > cur_isize) {
>> + if ((nfsi->npages == 0 && !test_bit(NFS_INO_LAYOUTCOMMIT,
>> &nfsi->flags)) ||
>> + new_isize > cur_isize) {
>> i_size_write(inode, new_isize);
>> invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA;
>> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


--
Thanks,
-Bergwolf

2011-06-07 17:28:44

by Jim Rees

[permalink] [raw]
Subject: [PATCH 23/88] pnfsblock: merge extents

From: Fred Isaman <[email protected]>

Replace a stub, so that extents underlying the layouts are properly
added, merged, or ignored as necessary.

Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: delete the new node before put it]
Signed-off-by: Mingyang Guo <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 10 +++
fs/nfs/blocklayout/blocklayoutdev.c | 19 +++++-
fs/nfs/blocklayout/extents.c | 128 +++++++++++++++++++++++++++++++++++
3 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index f91939d..13fc0e2 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -135,6 +135,14 @@ enum extentclass4 {
EXTENT_LISTS = 2,
};

+static inline int choose_list(enum exstate4 state)
+{
+ if (state == PNFS_BLOCK_READ_DATA || state == PNFS_BLOCK_NONE_DATA)
+ return RO_EXTENT;
+ else
+ return RW_EXTENT;
+}
+
struct pnfs_block_layout {
struct pnfs_inval_markings bl_inval; /* tracks INVAL->RW transition */
spinlock_t bl_ext_lock; /* Protects list manipulation */
@@ -197,4 +205,6 @@ void free_block_dev(struct pnfs_block_dev *bdev);
/* extents.c */
void put_extent(struct pnfs_block_extent *be);
struct pnfs_block_extent *alloc_extent(void);
+int add_and_merge_extent(struct pnfs_block_layout *bl,
+ struct pnfs_block_extent *new);
#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 77190fd..ac5c117 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -642,7 +642,7 @@ nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
uint32_t *end = (uint32_t *)((char *)lgr->layout.buf + lgr->layout.len);
int i, status = -EIO;
uint32_t count;
- struct pnfs_block_extent *be = NULL;
+ struct pnfs_block_extent *be = NULL, *save;
uint64_t tmp; /* Used by READSECTOR */
struct layout_verification lv = {
.mode = lgr->lseg.iomode,
@@ -706,9 +706,22 @@ nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
/* Extents decoded properly, now try to merge them in to
* existing layout extents.
*/
- /* STUB - instead we just throw them away */
+ spin_lock(&bl->bl_ext_lock);
+ list_for_each_entry_safe(be, save, &extents, be_node) {
+ list_del(&be->be_node);
+ status = add_and_merge_extent(bl, be);
+ if (status) {
+ spin_unlock(&bl->bl_ext_lock);
+ /* This is a fairly catastrophic error, as the
+ * entire layout extent lists are now corrupted.
+ * We should have some way to distinguish this.
+ */
+ be = NULL;
+ goto out_err;
+ }
+ }
+ spin_unlock(&bl->bl_ext_lock);
status = 0;
- goto out_err;
out:
dprintk("%s returns %i\n", __func__, status);
return status;
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index a952d39..ce7b6f7 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -33,6 +33,17 @@
#include "blocklayout.h"
#define NFSDBG_FACILITY NFSDBG_PNFS_LD

+static void print_bl_extent(struct pnfs_block_extent *be)
+{
+ dprintk("PRINT EXTENT extent %p\n", be);
+ if (be) {
+ dprintk(" be_f_offset %llu\n", (u64)be->be_f_offset);
+ dprintk(" be_length %llu\n", (u64)be->be_length);
+ dprintk(" be_v_offset %llu\n", (u64)be->be_v_offset);
+ dprintk(" be_state %d\n", be->be_state);
+ }
+}
+
static void
destroy_extent(struct kref *kref)
{
@@ -65,3 +76,120 @@ struct pnfs_block_extent *alloc_extent(void)
be->be_inval = NULL;
return be;
}
+
+void print_elist(struct list_head *list)
+{
+ struct pnfs_block_extent *be;
+ dprintk("****************\n");
+ dprintk("Extent list looks like:\n");
+ list_for_each_entry(be, list, be_node) {
+ print_bl_extent(be);
+ }
+ dprintk("****************\n");
+}
+
+static inline int
+extents_consistent(struct pnfs_block_extent *old, struct pnfs_block_extent *new)
+{
+ /* Note this assumes new->be_f_offset >= old->be_f_offset */
+ return (new->be_state == old->be_state) &&
+ ((new->be_state == PNFS_BLOCK_NONE_DATA) ||
+ ((new->be_v_offset - old->be_v_offset ==
+ new->be_f_offset - old->be_f_offset) &&
+ new->be_mdev == old->be_mdev));
+}
+
+/* Adds new to appropriate list in bl, modifying new and removing existing
+ * extents as appropriate to deal with overlaps.
+ *
+ * See find_get_extent for list constraints.
+ *
+ * Refcount on new is already set. If end up not using it, or error out,
+ * need to put the reference.
+ *
+ * Lock is held by caller.
+ */
+int
+add_and_merge_extent(struct pnfs_block_layout *bl,
+ struct pnfs_block_extent *new)
+{
+ struct pnfs_block_extent *be, *tmp;
+ sector_t end = new->be_f_offset + new->be_length;
+ struct list_head *list;
+
+ dprintk("%s enter with be=%p\n", __func__, new);
+ print_bl_extent(new);
+ list = &bl->bl_extents[choose_list(new->be_state)];
+ print_elist(list);
+
+ /* Scan for proper place to insert, extending new to the left
+ * as much as possible.
+ */
+ list_for_each_entry_safe(be, tmp, list, be_node) {
+ if (new->be_f_offset < be->be_f_offset)
+ break;
+ if (end <= be->be_f_offset + be->be_length) {
+ /* new is a subset of existing be*/
+ if (extents_consistent(be, new)) {
+ dprintk("%s: new is subset, ignoring\n",
+ __func__);
+ put_extent(new);
+ return 0;
+ } else
+ goto out_err;
+ } else if (new->be_f_offset <=
+ be->be_f_offset + be->be_length) {
+ /* new overlaps or abuts existing be */
+ if (extents_consistent(be, new)) {
+ /* extend new to fully replace be */
+ new->be_length += new->be_f_offset -
+ be->be_f_offset;
+ new->be_f_offset = be->be_f_offset;
+ new->be_v_offset = be->be_v_offset;
+ dprintk("%s: removing %p\n", __func__, be);
+ list_del(&be->be_node);
+ put_extent(be);
+ } else if (new->be_f_offset !=
+ be->be_f_offset + be->be_length)
+ goto out_err;
+ }
+ }
+ /* Note that if we never hit the above break, be will not point to a
+ * valid extent. However, in that case &be->be_node==list.
+ */
+ list_add_tail(&new->be_node, &be->be_node);
+ dprintk("%s: inserting new\n", __func__);
+ print_elist(list);
+ /* Scan forward for overlaps. If we find any, extend new and
+ * remove the overlapped extent.
+ */
+ be = list_prepare_entry(new, list, be_node);
+ list_for_each_entry_safe_continue(be, tmp, list, be_node) {
+ if (end < be->be_f_offset)
+ break;
+ /* new overlaps or abuts existing be */
+ if (extents_consistent(be, new)) {
+ if (end < be->be_f_offset + be->be_length) {
+ /* extend new to fully cover be */
+ end = be->be_f_offset + be->be_length;
+ new->be_length = end - new->be_f_offset;
+ }
+ dprintk("%s: removing %p\n", __func__, be);
+ list_del(&be->be_node);
+ put_extent(be);
+ } else if (end != be->be_f_offset) {
+ list_del(&new->be_node);
+ goto out_err;
+ }
+ }
+ dprintk("%s: after merging\n", __func__);
+ print_elist(list);
+ /* STUB - The per-list consistency checks have all been done,
+ * should now check cross-list consistency.
+ */
+ return 0;
+
+ out_err:
+ put_extent(new);
+ return -EIO;
+}
--
1.7.4.1


2011-06-07 17:31:33

by Jim Rees

[permalink] [raw]
Subject: [PATCH 46/88] SQUASHME: pnfsblock: Fix missing extent in commit list

From: Zhang Jingwang <[email protected]>

When offset is in the middle of a extent, we shouldn't step forward to
offset + extent->be_length, otherwise we may miss some extents in
commit list.

Signed-off-by: Zhang Jingwang <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 11 ++++-------
1 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index b0ad836..cf306e9 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -348,16 +348,13 @@ static void mark_extents_written(struct pnfs_block_layout *bl,
end = (offset + count + PAGE_CACHE_SIZE - 1) & (long)(PAGE_CACHE_MASK);
end >>= 9;
while (isect < end) {
+ sector_t len;
be = find_get_extent(bl, isect, NULL);
BUG_ON(!be); /* FIXME */
- if (be->be_state != PNFS_BLOCK_INVALID_DATA)
- isect += be->be_length;
- else {
- sector_t len;
- len = min(end, be->be_f_offset + be->be_length) - isect;
+ len = min(end, be->be_f_offset + be->be_length) - isect;
+ if (be->be_state == PNFS_BLOCK_INVALID_DATA)
mark_for_commit(be, isect, len); /* What if fails? */
- isect += len;
- }
+ isect += len;
put_extent(be);
}
}
--
1.7.4.1


2011-06-07 17:33:25

by Jim Rees

[permalink] [raw]
Subject: [PATCH 62/88] SQUASHME: pnfs-block: use new commit api

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 63d3b5a..e2ee90a 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -99,9 +99,8 @@ dont_like_caller(struct nfs_page *req)
}

static enum pnfs_try_status
-bl_commit(struct pnfs_layout_type *lo,
- int sync,
- struct nfs_write_data *nfs_data)
+bl_commit(struct nfs_write_data *nfs_data,
+ int sync)
{
dprintk("%s enter\n", __func__);
return PNFS_NOT_ATTEMPTED;
--
1.7.4.1


2011-06-07 17:28:55

by Jim Rees

[permalink] [raw]
Subject: [PATCH 25/88] pnfsblock: bl_read_pagelist

From: Fred Isaman <[email protected]>

Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.

When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.

Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 252 +++++++++++++++++++++++++++++++++++++-
fs/nfs/blocklayout/blocklayout.h | 1 +
fs/nfs/blocklayout/extents.c | 6 +
3 files changed, 256 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index f54e9a9..22ea965 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -32,6 +32,7 @@
#include <linux/module.h>
#include <linux/init.h>

+#include <linux/bio.h> /* struct bio */
#include <linux/vmalloc.h>
#include "blocklayout.h"

@@ -44,6 +45,45 @@ MODULE_DESCRIPTION("The NFSv4.1 pNFS Block layout driver");
/* Callback operations to the pNFS client */
struct pnfs_client_operations *pnfs_callback_ops;

+static void print_page(struct page *page)
+{
+ dprintk("PRINTPAGE page %p\n", page);
+ dprintk(" PagePrivate %d\n", PagePrivate(page));
+ dprintk(" PageUptodate %d\n", PageUptodate(page));
+ dprintk(" PageError %d\n", PageError(page));
+ dprintk(" PageDirty %d\n", PageDirty(page));
+ dprintk(" PageReferenced %d\n", PageReferenced(page));
+ dprintk(" PageLocked %d\n", PageLocked(page));
+ dprintk(" PageWriteback %d\n", PageWriteback(page));
+ dprintk(" PageMappedToDisk %d\n", PageMappedToDisk(page));
+ dprintk("\n");
+}
+
+/* Given the be associated with isect, determine if page data needs to be
+ * initialized.
+ */
+static int is_hole(struct pnfs_block_extent *be, sector_t isect)
+{
+ if (be->be_state == PNFS_BLOCK_NONE_DATA)
+ return 1;
+ else if (be->be_state != PNFS_BLOCK_INVALID_DATA)
+ return 0;
+ else
+ return !is_sector_initialized(be->be_inval, isect);
+}
+
+static int
+dont_like_caller(struct nfs_page *req)
+{
+ if (atomic_read(&req->wb_complete)) {
+ /* Called by _multi */
+ return 1;
+ } else {
+ /* Called by _one */
+ return 0;
+ }
+}
+
static enum pnfs_try_status
bl_commit(struct pnfs_layout_type *lo,
int sync,
@@ -53,16 +93,222 @@ bl_commit(struct pnfs_layout_type *lo,
return PNFS_NOT_ATTEMPTED;
}

+/* The data we are handed might be spread across several bios. We need
+ * to track when the last one is finished.
+ */
+struct parallel_io {
+ struct kref refcnt;
+ struct rpc_call_ops call_ops;
+ void (*pnfs_callback) (void *data);
+ void *data;
+};
+
+static inline struct parallel_io *alloc_parallel(void *data)
+{
+ struct parallel_io *rv;
+
+ rv = kmalloc(sizeof(*rv), GFP_KERNEL);
+ if (rv) {
+ rv->data = data;
+ kref_init(&rv->refcnt);
+ }
+ return rv;
+}
+
+static inline void get_parallel(struct parallel_io *p)
+{
+ kref_get(&p->refcnt);
+}
+
+static void destroy_parallel(struct kref *kref)
+{
+ struct parallel_io *p = container_of(kref, struct parallel_io, refcnt);
+
+ dprintk("%s enter\n", __func__);
+ p->pnfs_callback(p->data);
+ kfree(p);
+}
+
+static inline void put_parallel(struct parallel_io *p)
+{
+ kref_put(&p->refcnt, destroy_parallel);
+}
+
+static struct bio *
+bl_submit_bio(int rw, struct bio *bio)
+{
+ if (bio) {
+ get_parallel(bio->bi_private);
+ dprintk("%s submitting %s bio %u@%llu\n", __func__,
+ rw == READ ? "read" : "write",
+ bio->bi_size, (u64)bio->bi_sector);
+ submit_bio(rw, bio);
+ }
+ return NULL;
+}
+
+static inline void
+bl_done_with_rpage(struct page *page, const int ok)
+{
+ if (ok) {
+ SetPageUptodate(page);
+ } else {
+ ClearPageUptodate(page);
+ SetPageError(page);
+ }
+ /* Page is unlocked via rpc_release. Should really be done here. */
+}
+
+/* This is basically copied from mpage_end_io_read */
+static void bl_end_io_read(struct bio *bio, int err)
+{
+ void *data = bio->bi_private;
+ const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+ struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+
+ do {
+ struct page *page = bvec->bv_page;
+
+ if (--bvec >= bio->bi_io_vec)
+ prefetchw(&bvec->bv_page->flags);
+ bl_done_with_rpage(page, uptodate);
+ } while (bvec >= bio->bi_io_vec);
+ bio_put(bio);
+ put_parallel(data);
+}
+
+static void bl_read_cleanup(struct work_struct *work)
+{
+ struct rpc_task *task;
+ struct nfs_read_data *rdata;
+ dprintk("%s enter\n", __func__);
+ task = container_of(work, struct rpc_task, u.tk_work);
+ rdata = container_of(task, struct nfs_read_data, task);
+ pnfs_callback_ops->nfs_readlist_complete(rdata);
+}
+
+static void
+bl_end_par_io_read(void *data)
+{
+ struct nfs_read_data *rdata = data;
+
+ INIT_WORK(&rdata->task.u.tk_work, bl_read_cleanup);
+ schedule_work(&rdata->task.u.tk_work);
+}
+
+/* We don't want normal .rpc_call_done callback used, so we replace it
+ * with this stub.
+ */
+static void bl_rpc_do_nothing(struct rpc_task *task, void *calldata)
+{
+ return;
+}
+
static enum pnfs_try_status
bl_read_pagelist(struct pnfs_layout_type *lo,
struct page **pages,
unsigned int pgbase,
unsigned nr_pages,
- loff_t offset,
+ loff_t f_offset,
size_t count,
- struct nfs_read_data *nfs_data)
+ struct nfs_read_data *rdata)
{
- dprintk("%s enter\n", __func__);
+ int i, hole;
+ struct bio *bio = NULL;
+ struct pnfs_block_extent *be = NULL, *cow_read = NULL;
+ sector_t isect, extent_length = 0;
+ struct parallel_io *par;
+ int pg_index = pgbase >> PAGE_CACHE_SHIFT;
+
+ dprintk("%s enter nr_pages %u offset %lld count %Zd\n", __func__,
+ nr_pages, f_offset, count);
+
+ if (dont_like_caller(rdata->req)) {
+ dprintk("%s dont_like_caller failed\n", __func__);
+ goto use_mds;
+ }
+ par = alloc_parallel(rdata);
+ if (!par)
+ goto use_mds;
+ par->call_ops = *rdata->pdata.call_ops;
+ par->call_ops.rpc_call_done = bl_rpc_do_nothing;
+ par->pnfs_callback = bl_end_par_io_read;
+ /* At this point, we can no longer jump to use_mds */
+
+ isect = (sector_t) (f_offset >> 9);
+ /* Code assumes extents are page-aligned */
+ for (i = pg_index; i < nr_pages; i++) {
+ if (!extent_length) {
+ /* We've used up the previous extent */
+ put_extent(be);
+ put_extent(cow_read);
+ bio = bl_submit_bio(READ, bio);
+ /* Get the next one */
+ be = find_get_extent(BLK_LSEG2EXT(rdata->pdata.lseg),
+ isect, &cow_read);
+ if (!be) {
+ /* Error out this page */
+ bl_done_with_rpage(pages[i], 0);
+ break;
+ }
+ extent_length = be->be_length -
+ (isect - be->be_f_offset);
+ if (cow_read) {
+ sector_t cow_length = cow_read->be_length -
+ (isect - cow_read->be_f_offset);
+ extent_length = min(extent_length, cow_length);
+ }
+ }
+ hole = is_hole(be, isect);
+ if (hole && !cow_read) {
+ bio = bl_submit_bio(READ, bio);
+ /* Fill hole w/ zeroes w/o accessing device */
+ dprintk("%s Zeroing page for hole\n", __func__);
+ zero_user(pages[i], 0,
+ min_t(int, PAGE_CACHE_SIZE, count));
+ print_page(pages[i]);
+ bl_done_with_rpage(pages[i], 1);
+ } else {
+ struct pnfs_block_extent *be_read;
+
+ be_read = (hole && cow_read) ? cow_read : be;
+ for (;;) {
+ if (!bio) {
+ bio = bio_alloc(GFP_NOIO, nr_pages - i);
+ if (!bio) {
+ /* Error out this page */
+ bl_done_with_rpage(pages[i], 0);
+ break;
+ }
+ bio->bi_sector = isect -
+ be_read->be_f_offset +
+ be_read->be_v_offset;
+ bio->bi_bdev = be_read->be_mdev;
+ bio->bi_end_io = bl_end_io_read;
+ bio->bi_private = par;
+ }
+ if (bio_add_page(bio, pages[i], PAGE_SIZE, 0))
+ break;
+ bio = bl_submit_bio(READ, bio);
+ }
+ }
+ isect += PAGE_CACHE_SIZE >> 9;
+ extent_length -= PAGE_CACHE_SIZE >> 9;
+ }
+ if ((isect << 9) >= rdata->inode->i_size) {
+ rdata->res.eof = 1;
+ rdata->res.count = rdata->inode->i_size - f_offset;
+ } else {
+ rdata->res.count = (isect << 9) - f_offset;
+ }
+ put_extent(be);
+ put_extent(cow_read);
+ bl_submit_bio(READ, bio);
+ put_parallel(par);
+ return PNFS_ATTEMPTED;
+
+ use_mds:
+ dprintk("Giving up and using normal NFS\n");
return PNFS_NOT_ATTEMPTED;
}

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index e992b94..8b06c93 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -208,6 +208,7 @@ find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
struct pnfs_block_extent **cow_read);
void put_extent(struct pnfs_block_extent *be);
struct pnfs_block_extent *alloc_extent(void);
+int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
int add_and_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);
#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 944f824..31fe359 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -33,6 +33,12 @@
#include "blocklayout.h"
#define NFSDBG_FACILITY NFSDBG_PNFS_LD

+int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect)
+{
+ /* STUB */
+ return 0;
+}
+
static void print_bl_extent(struct pnfs_block_extent *be)
{
dprintk("PRINT EXTENT extent %p\n", be);
--
1.7.4.1


2011-06-07 17:34:49

by Jim Rees

[permalink] [raw]
Subject: [PATCH 75/88] SQUASHME: pnfsblock: compile error in blocklayout code

From: Steve Dickson <[email protected]>

I needed to make the following change to get the
block layout code to compile in a Fedora build
environment.

Signed-off-by: Steve Dickson <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 7db7768..9e7bd62 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -34,7 +34,7 @@

#include <linux/nfs_fs.h>
#include <linux/dm-ioctl.h> /* Needed for struct dm_ioctl*/
-#include <../pnfs.h>
+#include "../pnfs.h"

#define PAGE_CACHE_SECTORS (PAGE_CACHE_SIZE >> 9)

--
1.7.4.1


2011-06-07 17:34:19

by Jim Rees

[permalink] [raw]
Subject: [PATCH 70/88] SQUASHME: pnfsblock: use nfs4_deviceid

From: Benny Halevy <[email protected]>

---
fs/nfs/blocklayout/blocklayout.c | 2 +-
fs/nfs/blocklayout/blocklayout.h | 10 +++++-----
fs/nfs/blocklayout/blocklayoutdev.c | 8 ++++----
fs/nfs/blocklayout/extents.c | 2 +-
4 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index f49c68c..071e7ef 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -683,7 +683,7 @@ static void free_blk_mountid(struct block_mount_id *mid)
*/
static struct pnfs_block_dev *
nfs4_blk_get_deviceinfo(struct nfs_server *server, const struct nfs_fh *fh,
- struct pnfs_deviceid *d_id,
+ struct nfs4_deviceid *d_id,
struct list_head *sdlist)
{
struct pnfs_device *dev;
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index fc0fb23..d62e3f9 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -55,7 +55,7 @@ struct block_mount_id {

struct pnfs_block_dev {
struct list_head bm_node;
- struct pnfs_deviceid bm_mdevid; /* associated devid */
+ struct nfs4_deviceid bm_mdevid; /* associated devid */
struct block_device *bm_mdev; /* meta device itself */
};

@@ -132,7 +132,7 @@ struct pnfs_inval_tracking {
struct pnfs_block_extent {
struct kref be_refcnt;
struct list_head be_node; /* link into lseg list */
- struct pnfs_deviceid be_devid; /* STUB - remevable??? */
+ struct nfs4_deviceid be_devid; /* STUB - remevable??? */
struct block_device *be_mdev;
sector_t be_f_offset; /* the starting offset in the file */
sector_t be_length; /* the size of the extent */
@@ -144,7 +144,7 @@ struct pnfs_block_extent {
/* Shortened extent used by LAYOUTCOMMIT */
struct pnfs_block_short_extent {
struct list_head bse_node;
- struct pnfs_deviceid bse_devid; /* STUB - removable??? */
+ struct nfs4_deviceid bse_devid; /* STUB - removable??? */
struct block_device *bse_mdev;
sector_t bse_f_offset; /* the starting offset in the file */
sector_t bse_length; /* the size of the extent */
@@ -226,7 +226,7 @@ uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes);
memcpy((x), p, nbytes); \
p += XDR_QUADLEN(nbytes); \
} while (0)
-#define READ_DEVID(x) COPYMEM((x)->data, NFS4_PNFS_DEVICEID4_SIZE)
+#define READ_DEVID(x) COPYMEM((x)->data, NFS4_DEVICEID4_SIZE)
#define READ_SECTOR(x) do { \
READ64(tmp); \
if (tmp & 0x1ff) { \
@@ -248,7 +248,7 @@ uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes);
#define WRITEMEM(ptr, nbytes) do { \
p = xdr_encode_opaque_fixed(p, ptr, nbytes); \
} while (0)
-#define WRITE_DEVID(x) WRITEMEM((x)->data, NFS4_PNFS_DEVICEID4_SIZE)
+#define WRITE_DEVID(x) WRITEMEM((x)->data, NFS4_DEVICEID4_SIZE)

/* blocklayoutdev.c */
struct block_device *nfs4_blkdev_get(dev_t dev);
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index e9ea86a..17bd25a 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -133,7 +133,7 @@ nfs4_blk_decode_device(struct nfs_server *server,
goto out_err;

rv->bm_mdev = bd;
- memcpy(&rv->bm_mdevid, &dev->dev_id, sizeof(struct pnfs_deviceid));
+ memcpy(&rv->bm_mdevid, &dev->dev_id, sizeof(struct nfs4_deviceid));
dprintk("%s Created device %s with bd_block_size %u\n",
__func__,
bd->bd_disk->disk_name,
@@ -153,7 +153,7 @@ out_err:

/* Map deviceid returned by the server to constructed block_device */
static struct block_device *translate_devid(struct pnfs_layout_hdr *lo,
- struct pnfs_deviceid *id)
+ struct nfs4_deviceid *id)
{
struct block_device *rv = NULL;
struct block_mount_id *mid;
@@ -164,7 +164,7 @@ static struct block_device *translate_devid(struct pnfs_layout_hdr *lo,
spin_lock(&mid->bm_lock);
list_for_each_entry(dev, &mid->bm_devlist, bm_node) {
if (memcmp(id->data, dev->bm_mdevid.data,
- NFS4_PNFS_DEVICEID4_SIZE) == 0) {
+ NFS4_DEVICEID4_SIZE) == 0) {
rv = dev->bm_mdev;
goto out;
}
@@ -254,7 +254,7 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
READ32(count);

dprintk("%s enter, number of extents %i\n", __func__, count);
- BLK_READBUF(p, end, (28 + NFS4_PNFS_DEVICEID4_SIZE) * count);
+ BLK_READBUF(p, end, (28 + NFS4_DEVICEID4_SIZE) * count);

/* Decode individual extents, putting them in temporary
* staging area until whole layout is decoded to make error
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 20cc863..40dff82 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -794,7 +794,7 @@ _prep_new_extent(struct pnfs_block_extent *new,
{
kref_init(&new->be_refcnt);
/* don't need to INIT_LIST_HEAD(&new->be_node) */
- memcpy(&new->be_devid, &orig->be_devid, sizeof(struct pnfs_deviceid));
+ memcpy(&new->be_devid, &orig->be_devid, sizeof(struct nfs4_deviceid));
new->be_mdev = orig->be_mdev;
new->be_f_offset = offset;
new->be_length = length;
--
1.7.4.1


2011-06-09 13:32:17

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-09 07:49, Jim Rees wrote:
> Benny Halevy wrote:
>
> >> But note that this patch doesn't change anything unless you set the sysctl.
> > there is a default value of 2M. maybe we can set it to page size by
> > default so other layout are not affected and block layout can let
> > users set it by hand if they care about performance. does this make
> > sense?
>
> If doing it at all why use a sysctl rather than a mount option?
> Or maybe coding the logic for prefetching the layout iff sequential
> access is detected is the right thing to do.
>
> I would rather see some automatic solution than to add either a sysctl or a
> mount option. For now you can just drop that patch, as it's not needed for
> basic pnfs block.
>
> My understanding is that layoutget specifies a min and max, and the server

There's a min. What do you consider the max?
Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?

> is returning the min. Trond and Fred believe this should be fixed on the
> server.

Agreed.

Benny

> Here's the original report of the problem:
>
> From: Bergwolf
>
> From the network trace for pnfs, we can see the root cause for slow performance
> is too many small layoutget. In specific, client asks for a layout of only 4K
> pagesize (and server returns 8K due to block size alignment) at each time.
>
> The total IO time is 256/1.68 = 152 second.
> There are 256*1024/8 = 32768 layoutget for the 256MB file.
> On average, the time spent on each layoutget is 0.00456 second according to the
> trace.
> The total layoutget time is 32768* 0.00456 = 149 second, which takes up about
> 98% of total IO time.
>
> So we should optimize layoutget's granularity to get better performance. For
> instance, use a configurable prefetch size of 2MB or so.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-06-10 03:07:00

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-09 06:58, Jim Rees wrote:
> Benny Halevy wrote:
>
> > My understanding is that layoutget specifies a min and max, and the server
>
> There's a min. What do you consider the max?
> Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?
>
> The spec doesn't say max, it says "desired." I guess I assumed the server
> wouldn't normally return more than desired.

No, the server may freely upgrade the returned layout segment by returning
a layout for a larger byte range or even returning a RW layout where a READ
layout was asked for.

Benny

>
> 18.43.3. DESCRIPTION
> ...
>
> The LAYOUTGET operation returns layout information for the specified
> byte-range: a layout. The client actually specifies two ranges, both
> starting at the offset in the loga_offset field. The first range is
> between loga_offset and loga_offset + loga_length - 1 inclusive.
> This range indicates the desired range the client wants the layout to
> cover. The second range is between loga_offset and loga_offset +
> loga_minlength - 1 inclusive. This range indicates the required
> range the client needs the layout to cover. Thus, loga_minlength
> MUST be less than or equal to loga_length.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-06-07 17:33:13

by Jim Rees

[permalink] [raw]
Subject: [PATCH 60/88] SQUASHME: pnfs-block: nfs4_blk_add_block_disk ret must be signed

From: Benny Halevy <[email protected]>

otherwise checking for ret < 0 is futile.

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayoutdev.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index ef39c36..a866f5c 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -105,7 +105,8 @@ nfs4_blk_add_block_disk(struct device *cdev,
static char *claim_ptr = "I belong to pnfs block driver";
struct block_device *bdev;
struct gendisk *gd;
- unsigned int major, minor, ret = 0;
+ unsigned int major, minor;
+ int ret;
dev_t dev;

dprintk("%s enter \n", __func__);
--
1.7.4.1


2011-06-07 17:26:30

by Jim Rees

[permalink] [raw]
Subject: [PATCH 05/88] pnfs: xdr support for three word attribute bitmap

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
Signed-off-by: Fred Isaman <[email protected]>
[pnfs: clean up xdr]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/nfs4xdr.c | 59 +++++++++++++++++++++++++++++++++++----------
include/linux/nfs_fs_sb.h | 2 +-
include/linux/nfs_xdr.h | 2 +-
3 files changed, 48 insertions(+), 15 deletions(-)

diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index aeb7bc2..5e4447c 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -91,7 +91,7 @@ static int nfs4_stat_to_errno(int);
#define encode_getfh_maxsz (op_encode_hdr_maxsz)
#define decode_getfh_maxsz (op_decode_hdr_maxsz + 1 + \
((3+NFS4_FHSIZE) >> 2))
-#define nfs4_fattr_bitmap_maxsz 3
+#define nfs4_fattr_bitmap_maxsz 4
#define encode_getattr_maxsz (op_encode_hdr_maxsz + nfs4_fattr_bitmap_maxsz)
#define nfs4_name_maxsz (1 + ((3 + NFS4_MAXNAMLEN) >> 2))
#define nfs4_path_maxsz (1 + ((3 + NFS4_MAXPATHLEN) >> 2))
@@ -1095,6 +1095,35 @@ static void encode_getattr_two(struct xdr_stream *xdr, uint32_t bm0, uint32_t bm
hdr->replen += decode_getattr_maxsz;
}

+static void
+encode_getattr_three(struct xdr_stream *xdr,
+ uint32_t bm0, uint32_t bm1, uint32_t bm2,
+ struct compound_hdr *hdr)
+{
+ __be32 *p;
+
+ p = reserve_space(xdr, 4);
+ *p = cpu_to_be32(OP_GETATTR);
+ if (bm2) {
+ p = reserve_space(xdr, 16);
+ *p++ = cpu_to_be32(3);
+ *p++ = cpu_to_be32(bm0);
+ *p++ = cpu_to_be32(bm1);
+ *p = cpu_to_be32(bm2);
+ } else if (bm1) {
+ p = reserve_space(xdr, 12);
+ *p++ = cpu_to_be32(2);
+ *p++ = cpu_to_be32(bm0);
+ *p = cpu_to_be32(bm1);
+ } else {
+ p = reserve_space(xdr, 8);
+ *p++ = cpu_to_be32(1);
+ *p = cpu_to_be32(bm0);
+ }
+ hdr->nops++;
+ hdr->replen += decode_getattr_maxsz;
+}
+
static void encode_getfattr(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
{
encode_getattr_two(xdr, bitmask[0] & nfs4_fattr_bitmap[0],
@@ -2947,14 +2976,17 @@ static int decode_attr_bitmap(struct xdr_stream *xdr, uint32_t *bitmap)
goto out_overflow;
bmlen = be32_to_cpup(p);

- bitmap[0] = bitmap[1] = 0;
+ bitmap[0] = bitmap[1] = bitmap[2] = 0;
p = xdr_inline_decode(xdr, (bmlen << 2));
if (unlikely(!p))
goto out_overflow;
if (bmlen > 0) {
bitmap[0] = be32_to_cpup(p++);
- if (bmlen > 1)
- bitmap[1] = be32_to_cpup(p);
+ if (bmlen > 1) {
+ bitmap[1] = be32_to_cpup(p++);
+ if (bmlen > 2)
+ bitmap[2] = be32_to_cpup(p);
+ }
}
return 0;
out_overflow:
@@ -2986,8 +3018,9 @@ static int decode_attr_supported(struct xdr_stream *xdr, uint32_t *bitmap, uint3
return ret;
bitmap[0] &= ~FATTR4_WORD0_SUPPORTED_ATTRS;
} else
- bitmask[0] = bitmask[1] = 0;
- dprintk("%s: bitmask=%08x:%08x\n", __func__, bitmask[0], bitmask[1]);
+ bitmask[0] = bitmask[1] = bitmask[2] = 0;
+ dprintk("%s: bitmask=%08x:%08x:%08x\n", __func__,
+ bitmask[0], bitmask[1], bitmask[2]);
return 0;
}

@@ -4041,7 +4074,7 @@ out_overflow:
static int decode_server_caps(struct xdr_stream *xdr, struct nfs4_server_caps_res *res)
{
__be32 *savep;
- uint32_t attrlen, bitmap[2] = {0};
+ uint32_t attrlen, bitmap[3] = {0};
int status;

if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4067,7 +4100,7 @@ xdr_error:
static int decode_statfs(struct xdr_stream *xdr, struct nfs_fsstat *fsstat)
{
__be32 *savep;
- uint32_t attrlen, bitmap[2] = {0};
+ uint32_t attrlen, bitmap[3] = {0};
int status;

if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4099,7 +4132,7 @@ xdr_error:
static int decode_pathconf(struct xdr_stream *xdr, struct nfs_pathconf *pathconf)
{
__be32 *savep;
- uint32_t attrlen, bitmap[2] = {0};
+ uint32_t attrlen, bitmap[3] = {0};
int status;

if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4239,7 +4272,7 @@ static int decode_getfattr_generic(struct xdr_stream *xdr, struct nfs_fattr *fat
{
__be32 *savep;
uint32_t attrlen,
- bitmap[2] = {0};
+ bitmap[3] = {0};
int status;

status = decode_op_hdr(xdr, OP_GETATTR);
@@ -4328,7 +4361,7 @@ static int decode_attr_pnfstype(struct xdr_stream *xdr, uint32_t *bitmap,
static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
{
__be32 *savep;
- uint32_t attrlen, bitmap[2];
+ uint32_t attrlen, bitmap[3];
int status;

if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4775,7 +4808,7 @@ static int decode_getacl(struct xdr_stream *xdr, struct rpc_rqst *req,
{
__be32 *savep;
uint32_t attrlen,
- bitmap[2] = {0};
+ bitmap[3] = {0};
struct kvec *iov = req->rq_rcv_buf.head;
int status;

@@ -6606,7 +6639,7 @@ out:
int nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry,
int plus)
{
- uint32_t bitmap[2] = {0};
+ uint32_t bitmap[3] = {0};
uint32_t len;
__be32 *p = xdr_inline_decode(xdr, 4);
if (unlikely(!p))
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 87694ca..61052fc 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -130,7 +130,7 @@ struct nfs_server {
#endif

#ifdef CONFIG_NFS_V4
- u32 attr_bitmask[2];/* V4 bitmask representing the set
+ u32 attr_bitmask[3];/* V4 bitmask representing the set
of attributes supported on this
filesystem */
u32 cache_consistency_bitmask[2];
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index a4f3dd6..6c32c9d 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -957,7 +957,7 @@ struct nfs4_server_caps_arg {
};

struct nfs4_server_caps_res {
- u32 attr_bitmask[2];
+ u32 attr_bitmask[3];
u32 acl_bitmask;
u32 has_links;
u32 has_symlinks;
--
1.7.4.1


2011-06-07 17:34:31

by Jim Rees

[permalink] [raw]
Subject: [PATCH 72/88] SQAUSHME: pnfsblock: no PNFS_NFS_SERVER

From: Benny Halevy <[email protected]>

---
fs/nfs/blocklayout/blocklayout.c | 2 +-
fs/nfs/blocklayout/blocklayout.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 5dbd9c6..5e50c93 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -619,7 +619,7 @@ static int
bl_setup_layoutcommit(struct pnfs_layout_hdr *lo,
struct nfs4_layoutcommit_args *arg)
{
- struct nfs_server *nfss = PNFS_NFS_SERVER(lo);
+ struct nfs_server *nfss = NFS_SERVER(lo->inode);
struct bl_layoutupdate_data *layoutupdate_data;

dprintk("%s enter\n", __func__);
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index d62e3f9..7db7768 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -191,7 +191,7 @@ struct bl_layoutupdate_data {
struct list_head ranges;
};

-#define BLK_ID(lo) ((struct block_mount_id *)(PNFS_NFS_SERVER(lo)->pnfs_ld_data))
+#define BLK_ID(lo) ((struct block_mount_id *)(NFS_SERVER(lo->inode)->pnfs_ld_data))

static inline struct pnfs_block_layout *
BLK_LO2EXT(struct pnfs_layout_hdr *lo)
--
1.7.4.1


2011-06-07 17:26:16

by Jim Rees

[permalink] [raw]
Subject: [PATCH 03/88] pnfs_post_submit: Restore "pnfs: pnfs_do_flush" part 1

From: Fred Isaman <[email protected]>

This adds the hooks in nfs_write_begin and nfs_write_end needed
by the block server

Signed-off-by: Fred Isaman <[email protected]>
[pnfs: prevent offset overflow in _pnfs_do_flush]
[pnfs: pnfs_has_layout take_ref parameter should be bool]
[pnfs: clean up put_unlock_current_layout's interface]
[pnfs: introduce lseg valid bit]
[pnfsblock: cleanup pnfs_free_fsdata]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/file.c | 18 +++++++++++++-
fs/nfs/pnfs.c | 41 +++++++++++++++++++++++++++++++++
fs/nfs/pnfs.h | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 129 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 2f093ed..8c54b32 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -384,12 +384,15 @@ static int nfs_write_begin(struct file *file, struct address_space *mapping,
pgoff_t index = pos >> PAGE_CACHE_SHIFT;
struct page *page;
int once_thru = 0;
+ struct pnfs_layout_segment *lseg;

dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
file->f_path.dentry->d_parent->d_name.name,
file->f_path.dentry->d_name.name,
mapping->host->i_ino, len, (long long) pos);
-
+ lseg = pnfs_update_layout(mapping->host,
+ nfs_file_open_context(file),
+ pos, len, IOMODE_RW, GFP_NOFS);
start:
/*
* Prevent starvation issues if someone is doing a consistency
@@ -409,6 +412,9 @@ start:
if (ret) {
unlock_page(page);
page_cache_release(page);
+ *pagep = NULL;
+ *fsdata = NULL;
+ goto out;
} else if (!once_thru &&
nfs_want_read_modify_write(file, page, pos, len)) {
once_thru = 1;
@@ -417,6 +423,12 @@ start:
if (!ret)
goto start;
}
+ ret = pnfs_write_begin(file, page, pos, len, lseg, fsdata);
+ out:
+ if (ret) {
+ put_lseg(lseg);
+ *fsdata = NULL;
+ }
return ret;
}

@@ -426,6 +438,7 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
{
unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
int status;
+ struct pnfs_layout_segment *lseg;

dfprintk(PAGECACHE, "NFS: write_end(%s/%s(%ld), %u@%lld)\n",
file->f_path.dentry->d_parent->d_name.name,
@@ -452,10 +465,13 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
zero_user_segment(page, pglen, PAGE_CACHE_SIZE);
}

+ lseg = nfs4_pull_lseg_from_fsdata(file, fsdata);
status = nfs_updatepage(file, page, offset, copied);

unlock_page(page);
page_cache_release(page);
+ pnfs_write_end_cleanup(file, fsdata);
+ put_lseg(lseg);

if (status < 0)
return status;
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 12b228a..c88a8ee 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1138,6 +1138,41 @@ pnfs_try_to_write_data(struct nfs_write_data *wdata,
}

/*
+ * This gives the layout driver an opportunity to read in page "around"
+ * the data to be written. It returns 0 on success, otherwise an error code
+ * which will either be passed up to user, or ignored if
+ * some previous part of write succeeded.
+ * Note the range [pos, pos+len-1] is entirely within the page.
+ */
+int _pnfs_write_begin(struct inode *inode, struct page *page,
+ loff_t pos, unsigned len,
+ struct pnfs_layout_segment *lseg,
+ struct pnfs_fsdata **fsdata)
+{
+ struct pnfs_fsdata *data;
+ int status = 0;
+
+ dprintk("--> %s: pos=%llu len=%u\n",
+ __func__, (unsigned long long)pos, len);
+ data = kzalloc(sizeof(struct pnfs_fsdata), GFP_KERNEL);
+ if (!data) {
+ status = -ENOMEM;
+ goto out;
+ }
+ data->lseg = lseg; /* refcount passed into data to be managed there */
+ status = NFS_SERVER(inode)->pnfs_curr_ld->write_begin(
+ lseg, page, pos, len, data);
+ if (status) {
+ kfree(data);
+ data = NULL;
+ }
+out:
+ *fsdata = data;
+ dprintk("<-- %s: status=%d\n", __func__, status);
+ return status;
+}
+
+/*
* Called by non rpc-based layout drivers
*/
int
@@ -1250,6 +1285,12 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
}
EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);

+void pnfs_free_fsdata(struct pnfs_fsdata *fsdata)
+{
+ /* lseg refcounting handled directly in nfs_write_end */
+ kfree(fsdata);
+}
+
/*
* For the LAYOUT4_NFSV4_1_FILES layout type, NFS_DATA_SYNC WRITEs and
* NFS_UNSTABLE WRITEs with a COMMIT to data servers must store enough
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index b3481e5..5712053 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -54,6 +54,10 @@ enum pnfs_try_status {
PNFS_NOT_ATTEMPTED = 1,
};

+struct pnfs_fsdata {
+ struct pnfs_layout_segment *lseg;
+};
+
#ifdef CONFIG_NFS_V4_1

#define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
@@ -106,6 +110,9 @@ struct pnfs_layoutdriver_type {
*/
enum pnfs_try_status (*read_pagelist) (struct nfs_read_data *nfs_data);
enum pnfs_try_status (*write_pagelist) (struct nfs_write_data *nfs_data, int how);
+ int (*write_begin) (struct pnfs_layout_segment *lseg, struct page *page,
+ loff_t pos, unsigned count,
+ struct pnfs_fsdata *fsdata);

void (*free_deviceid_node) (struct nfs4_deviceid_node *);

@@ -180,6 +187,7 @@ enum pnfs_try_status pnfs_try_to_write_data(struct nfs_write_data *,
enum pnfs_try_status pnfs_try_to_read_data(struct nfs_read_data *,
const struct rpc_call_ops *);
bool pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev, struct nfs_page *req);
+void pnfs_free_fsdata(struct pnfs_fsdata *fsdata);
void pnfs_cleanup_layoutcommit(struct inode *,
struct nfs4_layoutcommit_data *);
int pnfs_layout_process(struct nfs4_layoutget *lgp);
@@ -193,6 +201,10 @@ void pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
int pnfs_choose_layoutget_stateid(nfs4_stateid *dst,
struct pnfs_layout_hdr *lo,
struct nfs4_state *open_state);
+int _pnfs_write_begin(struct inode *inode, struct page *page,
+ loff_t pos, unsigned len,
+ struct pnfs_layout_segment *lseg,
+ struct pnfs_fsdata **fsdata);
int mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
struct list_head *tmp_list,
struct pnfs_layout_range *recall_range);
@@ -305,6 +317,32 @@ pnfs_ld_layoutret_on_setattr(struct inode *inode)
PNFS_LAYOUTRET_ON_SETATTR;
}

+static inline int pnfs_write_begin(struct file *filp, struct page *page,
+ loff_t pos, unsigned len,
+ struct pnfs_layout_segment *lseg,
+ void **fsdata)
+{
+ struct inode *inode = filp->f_dentry->d_inode;
+ struct nfs_server *nfss = NFS_SERVER(inode);
+ int status = 0;
+
+ *fsdata = lseg;
+ if (lseg && nfss->pnfs_curr_ld->write_begin)
+ status = _pnfs_write_begin(inode, page, pos, len, lseg,
+ (struct pnfs_fsdata **) fsdata);
+ return status;
+}
+
+static inline void pnfs_write_end_cleanup(struct file *filp, void *fsdata)
+{
+ struct nfs_server *nfss = NFS_SERVER(filp->f_dentry->d_inode);
+
+ if (fsdata && nfss->pnfs_curr_ld) {
+ if (nfss->pnfs_curr_ld->write_begin)
+ pnfs_free_fsdata(fsdata);
+ }
+}
+
static inline int pnfs_return_layout(struct inode *ino)
{
struct nfs_inode *nfsi = NFS_I(ino);
@@ -325,6 +363,19 @@ static inline void pnfs_pageio_init(struct nfs_pageio_descriptor *pgio,
pgio->pg_test = ld->pg_test;
}

+static inline struct pnfs_layout_segment *
+nfs4_pull_lseg_from_fsdata(struct file *filp, void *fsdata)
+{
+ if (fsdata) {
+ struct nfs_server *nfss = NFS_SERVER(filp->f_dentry->d_inode);
+
+ if (nfss->pnfs_curr_ld && nfss->pnfs_curr_ld->write_begin)
+ return ((struct pnfs_fsdata *) fsdata)->lseg;
+ return (struct pnfs_layout_segment *)fsdata;
+ }
+ return NULL;
+}
+
#else /* CONFIG_NFS_V4_1 */

static inline void pnfs_destroy_all_layouts(struct nfs_client *clp)
@@ -372,6 +423,19 @@ static inline int pnfs_return_layout(struct inode *ino)
return 0;
}

+static inline int pnfs_write_begin(struct file *filp, struct page *page,
+ loff_t pos, unsigned len,
+ struct pnfs_layout_segment *lseg,
+ void **fsdata)
+{
+ *fsdata = NULL;
+ return 0;
+}
+
+static inline void pnfs_write_end_cleanup(struct file *filp, void *fsdata)
+{
+}
+
static inline bool
pnfs_ld_layoutret_on_setattr(struct inode *inode)
{
@@ -443,6 +507,13 @@ static inline int pnfs_layoutcommit_inode(struct inode *inode, bool sync)
static inline void nfs4_deviceid_purge_client(struct nfs_client *ncl)
{
}
+
+static inline struct pnfs_layout_segment *
+nfs4_pull_lseg_from_fsdata(struct file *filp, void *fsdata)
+{
+ return NULL;
+}
+
#endif /* CONFIG_NFS_V4_1 */

#endif /* FS_NFS_PNFS_H */
--
1.7.4.1


2011-06-07 17:26:29

by Jim Rees

[permalink] [raw]
Subject: [PATCH 04/88] pnfs_post_submit: Restore the pnfs_write_end part of "pnfs: commit and pnfs_write_end"

From: Fred Isaman <[email protected]>

pnfs: commit and pnfs_write_end

Add hooks in the nfs_write_end path, giving a driver the potential for
post-copy manipulation of the page.

[pnfs: pass lseg from write_begin to write_end]
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
[pnfs: fix pnfs_commit update_layout range]
Whole file semantics are different for COMMIT (0,0) and layouts
(0,NFS4_MAX_UINT64).
Reported-by: Alexandros Batsakis <[email protected]>
Signed-off-by: Andy Adamson <[email protected]>
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/file.c | 4 ++++
fs/nfs/pnfs.h | 25 +++++++++++++++++++++++++
2 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 8c54b32..3af1c00 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -466,8 +466,12 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
}

lseg = nfs4_pull_lseg_from_fsdata(file, fsdata);
+ status = pnfs_write_end(file, page, pos, len, copied, lseg);
+ if (status)
+ goto out;
status = nfs_updatepage(file, page, offset, copied);

+out:
unlock_page(page);
page_cache_release(page);
pnfs_write_end_cleanup(file, fsdata);
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 5712053..cfa8ea6 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -113,6 +113,9 @@ struct pnfs_layoutdriver_type {
int (*write_begin) (struct pnfs_layout_segment *lseg, struct page *page,
loff_t pos, unsigned count,
struct pnfs_fsdata *fsdata);
+ int (*write_end)(struct inode *inode, struct page *page, loff_t pos,
+ unsigned count, unsigned copied,
+ struct pnfs_layout_segment *lseg);

void (*free_deviceid_node) (struct nfs4_deviceid_node *);

@@ -333,6 +336,21 @@ static inline int pnfs_write_begin(struct file *filp, struct page *page,
return status;
}

+/* CAREFUL - what happens if copied < len??? */
+static inline int pnfs_write_end(struct file *filp, struct page *page,
+ loff_t pos, unsigned len, unsigned copied,
+ struct pnfs_layout_segment *lseg)
+{
+ struct inode *inode = filp->f_dentry->d_inode;
+ struct nfs_server *nfss = NFS_SERVER(inode);
+
+ if (nfss->pnfs_curr_ld && nfss->pnfs_curr_ld->write_end)
+ return nfss->pnfs_curr_ld->write_end(inode, page, pos, len,
+ copied, lseg);
+ else
+ return 0;
+}
+
static inline void pnfs_write_end_cleanup(struct file *filp, void *fsdata)
{
struct nfs_server *nfss = NFS_SERVER(filp->f_dentry->d_inode);
@@ -432,6 +450,13 @@ static inline int pnfs_write_begin(struct file *filp, struct page *page,
return 0;
}

+static inline int pnfs_write_end(struct file *filp, struct page *page,
+ loff_t pos, unsigned len, unsigned copied,
+ struct pnfs_layout_segment *lseg)
+{
+ return 0;
+}
+
static inline void pnfs_write_end_cleanup(struct file *filp, void *fsdata)
{
}
--
1.7.4.1


2011-06-07 17:34:42

by Jim Rees

[permalink] [raw]
Subject: [PATCH 74/88] SQUASHME: pnfsblock: use new struct pnfs_layout_hdr

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 22 +++++++---------------
1 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index ac53a3f..23dbe91 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -546,7 +546,7 @@ release_inval_marks(struct pnfs_inval_markings *marks)

/* Note we are relying on caller locking to prevent nasty races. */
static void
-bl_free_layout(struct pnfs_layout_hdr *lo)
+bl_free_layout_hdr(struct pnfs_layout_hdr *lo)
{
struct pnfs_block_layout *bl = BLK_LO2EXT(lo);

@@ -557,7 +557,7 @@ bl_free_layout(struct pnfs_layout_hdr *lo)
}

static struct pnfs_layout_hdr *
-bl_alloc_layout(struct inode *inode)
+bl_alloc_layout_hdr(struct inode *inode)
{
struct pnfs_block_layout *bl;

@@ -1105,15 +1105,17 @@ bl_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
return 1;
}

-static struct layoutdriver_io_operations blocklayout_io_operations = {
+static struct pnfs_layoutdriver_type blocklayout_type = {
+ .id = LAYOUT_BLOCK_VOLUME,
+ .name = "LAYOUT_BLOCK_VOLUME",
.commit = bl_commit,
.read_pagelist = bl_read_pagelist,
.write_pagelist = bl_write_pagelist,
.write_begin = bl_write_begin,
.write_end = bl_write_end,
.write_end_cleanup = bl_write_end_cleanup,
- .alloc_layout = bl_alloc_layout,
- .free_layout = bl_free_layout,
+ .alloc_layout_hdr = bl_alloc_layout_hdr,
+ .free_layout_hdr = bl_free_layout_hdr,
.alloc_lseg = bl_alloc_lseg,
.free_lseg = bl_free_lseg,
.setup_layoutcommit = bl_setup_layoutcommit,
@@ -1121,20 +1123,10 @@ static struct layoutdriver_io_operations blocklayout_io_operations = {
.cleanup_layoutcommit = bl_cleanup_layoutcommit,
.initialize_mountpoint = bl_initialize_mountpoint,
.uninitialize_mountpoint = bl_uninitialize_mountpoint,
-};
-
-static struct layoutdriver_policy_operations blocklayout_policy_operations = {
.get_stripesize = bl_get_stripesize,
.pg_test = bl_pg_test,
};

-static struct pnfs_layoutdriver_type blocklayout_type = {
- .id = LAYOUT_BLOCK_VOLUME,
- .name = "LAYOUT_BLOCK_VOLUME",
- .ld_io_ops = &blocklayout_io_operations,
- .ld_policy_ops = &blocklayout_policy_operations,
-};
-
static int __init nfs4blocklayout_init(void)
{
int ret;
--
1.7.4.1


2011-06-07 17:34:25

by Jim Rees

[permalink] [raw]
Subject: [PATCH 71/88] SQUASHME: pnfsblock: no callback ops

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 19 ++++++++++---------
1 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 071e7ef..5dbd9c6 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -44,7 +44,6 @@ MODULE_AUTHOR("Andy Adamson <[email protected]>");
MODULE_DESCRIPTION("The NFSv4.1 pNFS Block layout driver");

/* Callback operations to the pNFS client */
-static struct pnfs_client_operations *pnfs_block_callback_ops;

static void print_page(struct page *page)
{
@@ -199,7 +198,7 @@ static void bl_read_cleanup(struct work_struct *work)
dprintk("%s enter\n", __func__);
task = container_of(work, struct rpc_task, u.tk_work);
rdata = container_of(task, struct nfs_read_data, task);
- pnfs_block_callback_ops->nfs_readlist_complete(rdata);
+ pnfs_read_done(rdata);
}

static void
@@ -411,7 +410,7 @@ static void bl_write_cleanup(struct work_struct *work)
mark_extents_written(BLK_LSEG2EXT(wdata->pdata.lseg),
wdata->args.offset, wdata->args.count);
}
- pnfs_block_callback_ops->nfs_writelist_complete(wdata);
+ pnfs_writeback_done(wdata);
}

/* Called when last of bios associated with a bl_write_pagelist call finishes */
@@ -733,7 +732,7 @@ nfs4_blk_get_deviceinfo(struct nfs_server *server, const struct nfs_fh *fh,
dev->mincount = 0;

dprintk("%s: dev_id: %s\n", __func__, dev->dev_id.data);
- rc = pnfs_block_callback_ops->nfs_getdeviceinfo(server, dev);
+ rc = nfs4_proc_getdeviceinfo(server, dev);
dprintk("%s getdevice info returns %d\n", __func__, rc);
if (rc)
goto out_free;
@@ -783,8 +782,7 @@ bl_initialize_mountpoint(struct nfs_server *server, const struct nfs_fh *fh)
goto out_error;
dlist->eof = 0;
while (!dlist->eof) {
- status = pnfs_block_callback_ops->nfs_getdevicelist(
- server, fh, dlist);
+ status = nfs4_proc_getdevicelist(server, fh, dlist);
if (status)
goto out_error;
dprintk("%s GETDEVICELIST numdevs=%i, eof=%i\n",
@@ -1140,11 +1138,14 @@ static struct pnfs_layoutdriver_type blocklayout_type = {

static int __init nfs4blocklayout_init(void)
{
+ int ret;
+
dprintk("%s: NFSv4 Block Layout Driver Registering...\n", __func__);

- pnfs_block_callback_ops = pnfs_register_layoutdriver(&blocklayout_type);
- bl_pipe_init();
- return 0;
+ ret = pnfs_register_layoutdriver(&blocklayout_type);
+ if (!ret)
+ bl_pipe_init();
+ return ret;
}

static void __exit nfs4blocklayout_exit(void)
--
1.7.4.1


2011-06-07 17:28:11

by Jim Rees

[permalink] [raw]
Subject: [PATCH 20/88] pnfsblock: basic extent code

From: Fred Isaman <[email protected]>

Adds structures and basic create/delete code for extents.

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/Makefile | 3 +-
fs/nfs/blocklayout/blocklayout.c | 17 ++++++++++-
fs/nfs/blocklayout/blocklayout.h | 23 +++++++++++++++-
fs/nfs/blocklayout/extents.c | 55 ++++++++++++++++++++++++++++++++++++++
4 files changed, 94 insertions(+), 4 deletions(-)
create mode 100644 fs/nfs/blocklayout/extents.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index 2c4c062..1e7619f 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,5 @@
# Makefile for the pNFS block layout driver kernel module
#
obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o blocklayoutdev.o blocklayoutdm.o
+blocklayoutdriver-objs := blocklayout.o blocklayoutdev.o blocklayoutdm.o \
+ extents.o
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 677836c..fb06f3a 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -83,12 +83,25 @@ bl_write_pagelist(struct pnfs_layout_type *lo,
return PNFS_NOT_ATTEMPTED;
}

-/* STUB */
+/* FIXME - range ignored */
static void
release_extents(struct pnfs_block_layout *bl,
struct nfs4_pnfs_layout_segment *range)
{
- return;
+ int i;
+ struct pnfs_block_extent *be;
+
+ spin_lock(&bl->bl_ext_lock);
+ for (i = 0; i < EXTENT_LISTS; i++) {
+ while (!list_empty(&bl->bl_extents[i])) {
+ be = list_first_entry(&bl->bl_extents[i],
+ struct pnfs_block_extent,
+ be_node);
+ list_del(&be->be_node);
+ put_extent(be);
+ }
+ }
+ spin_unlock(&bl->bl_ext_lock);
}

/* STUB */
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index dd25f1a..4e6f8fc 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -99,10 +99,30 @@ struct pnfs_blk_sig {
struct pnfs_blk_sig_comp si_comps[PNFS_BLOCK_MAX_SIG_COMP];
};

+enum exstate4 {
+ PNFS_BLOCK_READWRITE_DATA = 0,
+ PNFS_BLOCK_READ_DATA = 1,
+ PNFS_BLOCK_INVALID_DATA = 2, /* mapped, but data is invalid */
+ PNFS_BLOCK_NONE_DATA = 3 /* unmapped, it's a hole */
+};
+
struct pnfs_inval_markings {
/* STUB */
};

+/* sector_t fields are all in 512-byte sectors */
+struct pnfs_block_extent {
+ struct kref be_refcnt;
+ struct list_head be_node; /* link into lseg list */
+ struct pnfs_deviceid be_devid; /* STUB - remevable??? */
+ struct block_device *be_mdev;
+ sector_t be_f_offset; /* the starting offset in the file */
+ sector_t be_length; /* the size of the extent */
+ sector_t be_v_offset; /* the starting offset in the volume */
+ enum exstate4 be_state; /* the state of this extent */
+ struct pnfs_inval_markings *be_inval; /* tracks INVAL->RW transition */
+};
+
static inline void
INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
{
@@ -170,5 +190,6 @@ struct pnfs_block_dev *nfs4_blk_init_metadev(struct super_block *sb,
struct pnfs_device *dev);
int nfs4_blk_flatten(struct pnfs_blk_volume *, int, struct pnfs_block_dev *);
void free_block_dev(struct pnfs_block_dev *bdev);
-
+/* extents.c */
+void put_extent(struct pnfs_block_extent *be);
#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
new file mode 100644
index 0000000..efdcc08
--- /dev/null
+++ b/fs/nfs/blocklayout/extents.c
@@ -0,0 +1,55 @@
+/*
+ * linux/fs/nfs/blocklayout/blocklayout.h
+ *
+ * Module for the NFSv4.1 pNFS block layout driver.
+ *
+ * Copyright (c) 2006 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * Andy Adamson <[email protected]>
+ * Fred Isaman <[email protected]>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization. if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose. the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+
+#include "blocklayout.h"
+#define NFSDBG_FACILITY NFSDBG_PNFS_LD
+
+static void
+destroy_extent(struct kref *kref)
+{
+ struct pnfs_block_extent *be;
+
+ be = container_of(kref, struct pnfs_block_extent, be_refcnt);
+ dprintk("%s be=%p\n", __func__, be);
+ kfree(be);
+}
+
+void
+put_extent(struct pnfs_block_extent *be)
+{
+ if (be) {
+ dprintk("%s enter %p (%i)\n", __func__, be,
+ atomic_read(&be->be_refcnt.refcount));
+ kref_put(&be->be_refcnt, destroy_extent);
+ }
+}
+
--
1.7.4.1


2011-06-07 17:33:49

by Jim Rees

[permalink] [raw]
Subject: [PATCH 65/88] pnfs-block: Add support for simple rpc pipefs

Signed-off-by: Eric Anderle <[email protected]>
Signed-off-by: Jim Rees <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
include/linux/sunrpc/rpc_pipe_fs.h | 4 +
include/linux/sunrpc/simple_rpc_pipefs.h | 111 ++++++++
net/sunrpc/Makefile | 2 +-
net/sunrpc/simple_rpc_pipefs.c | 424 ++++++++++++++++++++++++++++++
4 files changed, 540 insertions(+), 1 deletions(-)
create mode 100644 include/linux/sunrpc/simple_rpc_pipefs.h
create mode 100644 net/sunrpc/simple_rpc_pipefs.c

diff --git a/include/linux/sunrpc/rpc_pipe_fs.h b/include/linux/sunrpc/rpc_pipe_fs.h
index 6f942c9..2177d50 100644
--- a/include/linux/sunrpc/rpc_pipe_fs.h
+++ b/include/linux/sunrpc/rpc_pipe_fs.h
@@ -12,6 +12,10 @@ struct rpc_pipe_msg {
size_t len;
size_t copied;
int errno;
+#define PIPEFS_AUTOFREE_RPCMSG 0x01 /* frees rpc_pipe_msg */
+#define PIPEFS_AUTOFREE_RPCMSG_DATA 0x02 /* frees rpc_pipe_msg->data */
+#define PIPEFS_AUTOFREE_UPCALL_MSG PIPEFS_AUTOFREE_RPCMSG_DATA
+ u8 flags;
};

struct rpc_pipe_ops {
diff --git a/include/linux/sunrpc/simple_rpc_pipefs.h b/include/linux/sunrpc/simple_rpc_pipefs.h
new file mode 100644
index 0000000..02e8147
--- /dev/null
+++ b/include/linux/sunrpc/simple_rpc_pipefs.h
@@ -0,0 +1,111 @@
+/*
+ * Copyright (c) 2008 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * David M. Richter <[email protected]>
+ *
+ * Drawing on work done by Andy Adamson <[email protected]> and
+ * Marius Eriksen <[email protected]>. Thanks for the help over the
+ * years, guys.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
+ * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * With thanks to CITI's project sponsor and partner, IBM.
+ */
+
+#ifndef _SIMPLE_RPC_PIPEFS_H_
+#define _SIMPLE_RPC_PIPEFS_H_
+
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/mount.h>
+#include <linux/sched.h>
+#include <linux/sunrpc/clnt.h>
+#include <linux/sunrpc/rpc_pipe_fs.h>
+
+
+#define payload_of(headerp) ((void *)(headerp + 1))
+
+/*
+ * struct pipefs_hdr -- the generic message format for simple_rpc_pipefs.
+ * Messages may simply be the header itself, although having an optional
+ * data payload follow the header allows much more flexibility.
+ *
+ * Messages are created using pipefs_alloc_init_msg() and
+ * pipefs_alloc_init_msg_padded(), both of which accept a pointer to an
+ * (optional) data payload.
+ *
+ * Given a struct pipefs_hdr *msg that has a struct foo payload, the data
+ * can be accessed using: struct foo *foop = payload_of(msg)
+ */
+struct pipefs_hdr {
+ u32 msgid;
+ u8 type;
+ u8 flags;
+ u16 totallen; /* length of entire message, including hdr itself */
+ u32 status;
+};
+
+/*
+ * struct pipefs_list -- a type of list used for tracking callers who've made an
+ * upcall and are blocked waiting for a reply.
+ *
+ * See pipefs_queue_upcall_waitreply() and pipefs_assign_upcall_reply().
+ */
+struct pipefs_list {
+ struct list_head list;
+ spinlock_t list_lock;
+};
+
+
+/* See net/sunrpc/simple_rpc_pipefs.c for more info on using these functions. */
+extern struct dentry *pipefs_mkpipe(const char *name,
+ const struct rpc_pipe_ops *ops,
+ int wait_for_open);
+extern void pipefs_closepipe(struct dentry *pipe);
+extern void pipefs_init_list(struct pipefs_list *list);
+extern struct pipefs_hdr *pipefs_alloc_init_msg(u32 msgid, u8 type, u8 flags,
+ void *data, u16 datalen);
+extern struct pipefs_hdr *pipefs_alloc_init_msg_padded(u32 msgid, u8 type,
+ u8 flags, void *data,
+ u16 datalen, u16 padlen);
+extern struct pipefs_hdr *pipefs_queue_upcall_waitreply(struct dentry *pipe,
+ struct pipefs_hdr *msg,
+ struct pipefs_list
+ *uplist, u8 upflags,
+ u32 timeout);
+extern int pipefs_queue_upcall_noreply(struct dentry *pipe,
+ struct pipefs_hdr *msg, u8 upflags);
+extern int pipefs_assign_upcall_reply(struct pipefs_hdr *reply,
+ struct pipefs_list *uplist);
+extern struct pipefs_hdr *pipefs_readmsg(struct file *filp,
+ const char __user *src, size_t len);
+extern ssize_t pipefs_generic_upcall(struct file *filp,
+ struct rpc_pipe_msg *rpcmsg,
+ char __user *dst, size_t buflen);
+extern void pipefs_generic_destroy_msg(struct rpc_pipe_msg *rpcmsg);
+
+#endif /* _SIMPLE_RPC_PIPEFS_H_ */
diff --git a/net/sunrpc/Makefile b/net/sunrpc/Makefile
index 9d2fca5..e102040 100644
--- a/net/sunrpc/Makefile
+++ b/net/sunrpc/Makefile
@@ -12,7 +12,7 @@ sunrpc-y := clnt.o xprt.o socklib.o xprtsock.o sched.o \
svc.o svcsock.o svcauth.o svcauth_unix.o \
addr.o rpcb_clnt.o timer.o xdr.o \
sunrpc_syms.o cache.o rpc_pipe.o \
- svc_xprt.o
+ svc_xprt.o simple_rpc_pipefs.o
sunrpc-$(CONFIG_NFS_V4_1) += backchannel_rqst.o bc_svc.o
sunrpc-$(CONFIG_PROC_FS) += stats.o
sunrpc-$(CONFIG_SYSCTL) += sysctl.o
diff --git a/net/sunrpc/simple_rpc_pipefs.c b/net/sunrpc/simple_rpc_pipefs.c
new file mode 100644
index 0000000..c9306aa
--- /dev/null
+++ b/net/sunrpc/simple_rpc_pipefs.c
@@ -0,0 +1,424 @@
+/*
+ * net/sunrpc/simple_rpc_pipefs.c
+ *
+ * Copyright (c) 2008 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * David M. Richter <[email protected]>
+ *
+ * Drawing on work done by Andy Adamson <[email protected]> and
+ * Marius Eriksen <[email protected]>. Thanks for the help over the
+ * years, guys.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
+ * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * With thanks to CITI's project sponsor and partner, IBM.
+ */
+
+#include <linux/completion.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/sunrpc/simple_rpc_pipefs.h>
+
+
+/*
+ * Make an rpc_pipefs pipe named @name at the root of the mounted rpc_pipefs
+ * filesystem.
+ *
+ * If @wait_for_open is non-zero and an upcall is later queued but the userland
+ * end of the pipe has not yet been opened, the upcall will remain queued until
+ * the pipe is opened; otherwise, the upcall queueing will return with -EPIPE.
+ */
+struct dentry *pipefs_mkpipe(const char *name, const struct rpc_pipe_ops *ops,
+ int wait_for_open)
+{
+ struct dentry *dir, *pipe;
+ struct vfsmount *mnt;
+
+ mnt = rpc_get_mount();
+ if (IS_ERR(mnt)) {
+ pipe = ERR_CAST(mnt);
+ goto out;
+ }
+ dir = mnt->mnt_root;
+ if (!dir) {
+ pipe = ERR_PTR(-ENOENT);
+ goto out;
+ }
+ pipe = rpc_mkpipe(dir, name, NULL, ops,
+ wait_for_open ? RPC_PIPE_WAIT_FOR_OPEN : 0);
+out:
+ return pipe;
+}
+EXPORT_SYMBOL(pipefs_mkpipe);
+
+/*
+ * Shutdown a pipe made by pipefs_mkpipe().
+ * XXX: do we need to retain an extra reference on the mount?
+ */
+void pipefs_closepipe(struct dentry *pipe)
+{
+ rpc_unlink(pipe);
+ rpc_put_mount();
+}
+EXPORT_SYMBOL(pipefs_closepipe);
+
+/*
+ * Initialize a struct pipefs_list -- which are a way to keep track of callers
+ * who're blocked having made an upcall and are awaiting a reply.
+ *
+ * See pipefs_queue_upcall_waitreply() and pipefs_find_upcall_msgid() for how
+ * to use them.
+ */
+inline void pipefs_init_list(struct pipefs_list *list)
+{
+ INIT_LIST_HEAD(&list->list);
+ spin_lock_init(&list->list_lock);
+}
+EXPORT_SYMBOL(pipefs_init_list);
+
+/*
+ * Alloc/init a generic pipefs message header and copy into its message body
+ * an arbitrary data payload.
+ *
+ * struct pipefs_hdr's are meant to serve as generic, general-purpose message
+ * headers for easy rpc_pipefs I/O. When an upcall is made, the
+ * struct pipefs_hdr is assigned to a struct rpc_pipe_msg and delivered
+ * therein. --And yes, the naming can seem a little confusing at first:
+ *
+ * When one thinks of an upcall "message", in simple_rpc_pipefs that's a
+ * struct pipefs_hdr (possibly with an attached message body). A
+ * struct rpc_pipe_msg is actually only the -vehicle- by which the "real"
+ * message is delivered and processed.
+ */
+struct pipefs_hdr *pipefs_alloc_init_msg_padded(u32 msgid, u8 type, u8 flags,
+ void *data, u16 datalen, u16 padlen)
+{
+ u16 totallen;
+ struct pipefs_hdr *msg = NULL;
+
+ totallen = sizeof(*msg) + datalen + padlen;
+ if (totallen > PAGE_SIZE) {
+ msg = ERR_PTR(-E2BIG);
+ goto out;
+ }
+
+ msg = kzalloc(totallen, GFP_KERNEL);
+ if (!msg) {
+ msg = ERR_PTR(-ENOMEM);
+ goto out;
+ }
+
+ msg->msgid = msgid;
+ msg->type = type;
+ msg->flags = flags;
+ msg->totallen = totallen;
+ memcpy(payload_of(msg), data, datalen);
+out:
+ return msg;
+}
+EXPORT_SYMBOL(pipefs_alloc_init_msg_padded);
+
+/*
+ * See the description of pipefs_alloc_init_msg_padded().
+ */
+struct pipefs_hdr *pipefs_alloc_init_msg(u32 msgid, u8 type, u8 flags,
+ void *data, u16 datalen)
+{
+ return pipefs_alloc_init_msg_padded(msgid, type, flags, data,
+ datalen, 0);
+}
+EXPORT_SYMBOL(pipefs_alloc_init_msg);
+
+
+static void pipefs_init_rpcmsg(struct rpc_pipe_msg *rpcmsg,
+ struct pipefs_hdr *msg, u8 upflags)
+{
+ memset(rpcmsg, 0, sizeof(*rpcmsg));
+ rpcmsg->data = msg;
+ rpcmsg->len = msg->totallen;
+ rpcmsg->flags = upflags;
+}
+
+static struct rpc_pipe_msg *pipefs_alloc_init_rpcmsg(struct pipefs_hdr *msg,
+ u8 upflags)
+{
+ struct rpc_pipe_msg *rpcmsg;
+
+ rpcmsg = kmalloc(sizeof(*rpcmsg), GFP_KERNEL);
+ if (!rpcmsg)
+ return ERR_PTR(-ENOMEM);
+
+ pipefs_init_rpcmsg(rpcmsg, msg, upflags);
+ return rpcmsg;
+}
+
+
+/* represents an upcall that'll block and wait for a reply */
+struct pipefs_upcall {
+ u32 msgid;
+ struct rpc_pipe_msg rpcmsg;
+ struct list_head list;
+ wait_queue_head_t waitq;
+ struct pipefs_hdr *reply;
+};
+
+
+static void pipefs_init_upcall_waitreply(struct pipefs_upcall *upcall,
+ struct pipefs_hdr *msg, u8 upflags)
+{
+ upcall->reply = NULL;
+ upcall->msgid = msg->msgid;
+ INIT_LIST_HEAD(&upcall->list);
+ init_waitqueue_head(&upcall->waitq);
+ pipefs_init_rpcmsg(&upcall->rpcmsg, msg, upflags);
+}
+
+static int __pipefs_queue_upcall_waitreply(struct dentry *pipe,
+ struct pipefs_upcall *upcall,
+ struct pipefs_list *uplist,
+ u32 timeout)
+{
+ int err = 0;
+ DECLARE_WAITQUEUE(wq, current);
+
+ add_wait_queue(&upcall->waitq, &wq);
+ spin_lock(&uplist->list_lock);
+ list_add(&upcall->list, &uplist->list);
+ spin_unlock(&uplist->list_lock);
+
+ err = rpc_queue_upcall(pipe->d_inode, &upcall->rpcmsg);
+ if (err < 0)
+ goto out;
+
+ if (timeout) {
+ /* retval of 0 means timer expired */
+ err = schedule_timeout_uninterruptible(timeout);
+ if (err == 0 && upcall->reply == NULL)
+ err = -ETIMEDOUT;
+ } else {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule();
+ __set_current_state(TASK_RUNNING);
+ }
+
+out:
+ spin_lock(&uplist->list_lock);
+ list_del_init(&upcall->list);
+ spin_unlock(&uplist->list_lock);
+ remove_wait_queue(&upcall->waitq, &wq);
+ return err;
+}
+
+/*
+ * Queue a pipefs msg for an upcall to userspace, place the calling thread
+ * on @uplist, and block the thread to wait for a reply. If @timeout is
+ * nonzero, the thread will be blocked for at most @timeout jiffies.
+ *
+ * (To convert time units into jiffies, consider the functions
+ * msecs_to_jiffies(), usecs_to_jiffies(), timeval_to_jiffies(), and
+ * timespec_to_jiffies().)
+ *
+ * Once a reply is received by your downcall handler, call
+ * pipefs_assign_upcall_reply() with @uplist to find the corresponding upcall,
+ * assign the reply, and wake the waiting thread.
+ *
+ * This function's return value pointer may be an error and should be checked
+ * with IS_ERR() before attempting to access the reply message.
+ *
+ * Callers are responsible for freeing @msg, unless pipefs_generic_destroy_msg()
+ * is used as the ->destroy_msg() callback and the PIPEFS_AUTOFREE_UPCALL_MSG
+ * flag is set in @upflags. See also rpc_pipe_fs.h.
+ */
+struct pipefs_hdr *pipefs_queue_upcall_waitreply(struct dentry *pipe,
+ struct pipefs_hdr *msg,
+ struct pipefs_list *uplist,
+ u8 upflags, u32 timeout)
+{
+ int err = 0;
+ struct pipefs_upcall upcall;
+
+ pipefs_init_upcall_waitreply(&upcall, msg, upflags);
+ err = __pipefs_queue_upcall_waitreply(pipe, &upcall, uplist, timeout);
+ if (err < 0) {
+ kfree(upcall.reply);
+ upcall.reply = ERR_PTR(err);
+ }
+
+ return upcall.reply;
+}
+EXPORT_SYMBOL(pipefs_queue_upcall_waitreply);
+
+/*
+ * Queue a pipefs msg for an upcall to userspace and immediately return (i.e.,
+ * no reply is expected).
+ *
+ * Callers are responsible for freeing @msg, unless pipefs_generic_destroy_msg()
+ * is used as the ->destroy_msg() callback and the PIPEFS_AUTOFREE_UPCALL_MSG
+ * flag is set in @upflags. See also rpc_pipe_fs.h.
+ */
+int pipefs_queue_upcall_noreply(struct dentry *pipe, struct pipefs_hdr *msg,
+ u8 upflags)
+{
+ int err = 0;
+ struct rpc_pipe_msg *rpcmsg;
+
+ upflags |= PIPEFS_AUTOFREE_RPCMSG;
+ rpcmsg = pipefs_alloc_init_rpcmsg(msg, upflags);
+ if (IS_ERR(rpcmsg)) {
+ err = PTR_ERR(rpcmsg);
+ goto out;
+ }
+ err = rpc_queue_upcall(pipe->d_inode, rpcmsg);
+out:
+ return err;
+}
+EXPORT_SYMBOL(pipefs_queue_upcall_noreply);
+
+
+static struct pipefs_upcall *pipefs_find_upcall_msgid(u32 msgid,
+ struct pipefs_list *uplist)
+{
+ struct pipefs_upcall *upcall;
+
+ spin_lock(&uplist->list_lock);
+ list_for_each_entry(upcall, &uplist->list, list)
+ if (upcall->msgid == msgid)
+ goto out;
+ upcall = NULL;
+out:
+ spin_unlock(&uplist->list_lock);
+ return upcall;
+}
+
+/*
+ * In your rpc_pipe_ops->downcall() handler, once you've read in a downcall
+ * message and have determined that it is a reply to a waiting upcall,
+ * you can use this function to find the appropriate upcall, assign the result,
+ * and wake the upcall thread.
+ *
+ * The reply message must have the same msgid as the original upcall message's.
+ *
+ * See also pipefs_queue_upcall_waitreply() and pipefs_readmsg().
+ */
+int pipefs_assign_upcall_reply(struct pipefs_hdr *reply,
+ struct pipefs_list *uplist)
+{
+ int err = 0;
+ struct pipefs_upcall *upcall;
+
+ upcall = pipefs_find_upcall_msgid(reply->msgid, uplist);
+ if (!upcall) {
+ printk(KERN_ERR "%s: ERROR: have reply but no matching upcall "
+ "for msgid %d\n", __func__, reply->msgid);
+ err = -ENOENT;
+ goto out;
+ }
+ upcall->reply = reply;
+ wake_up(&upcall->waitq);
+out:
+ return err;
+}
+EXPORT_SYMBOL(pipefs_assign_upcall_reply);
+
+/*
+ * Generic method to read-in and return a newly-allocated message which begins
+ * with a struct pipefs_hdr.
+ */
+struct pipefs_hdr *pipefs_readmsg(struct file *filp, const char __user *src,
+ size_t len)
+{
+ int err = 0, hdrsize;
+ struct pipefs_hdr *msg = NULL;
+
+ hdrsize = sizeof(*msg);
+ if (len < hdrsize) {
+ printk(KERN_ERR "%s: ERROR: header is too short (%d vs %d)\n",
+ __func__, (int) len, hdrsize);
+ err = -EINVAL;
+ goto out;
+ }
+
+ msg = kzalloc(len, GFP_KERNEL);
+ if (!msg) {
+ err = -ENOMEM;
+ goto out;
+ }
+ if (copy_from_user(msg, src, len))
+ err = -EFAULT;
+out:
+ if (err) {
+ kfree(msg);
+ msg = ERR_PTR(err);
+ }
+ return msg;
+}
+EXPORT_SYMBOL(pipefs_readmsg);
+
+/*
+ * Generic rpc_pipe_ops->upcall() handler implementation.
+ *
+ * Don't call this directly: to make an upcall, use
+ * pipefs_queue_upcall_waitreply() or pipefs_queue_upcall_noreply().
+ */
+ssize_t pipefs_generic_upcall(struct file *filp, struct rpc_pipe_msg *rpcmsg,
+ char __user *dst, size_t buflen)
+{
+ char *data;
+ ssize_t len, left;
+
+ data = (char *)rpcmsg->data + rpcmsg->copied;
+ len = rpcmsg->len - rpcmsg->copied;
+ if (len > buflen)
+ len = buflen;
+
+ left = copy_to_user(dst, data, len);
+ if (left < 0) {
+ rpcmsg->errno = left;
+ return left;
+ }
+
+ len -= left;
+ rpcmsg->copied += len;
+ rpcmsg->errno = 0;
+ return len;
+}
+EXPORT_SYMBOL(pipefs_generic_upcall);
+
+/*
+ * Generic rpc_pipe_ops->destroy_msg() handler implementation.
+ *
+ * Items are only freed if @rpcmsg->flags has been set appropriately.
+ * See pipefs_queue_upcall_noreply() and rpc_pipe_fs.h.
+ */
+void pipefs_generic_destroy_msg(struct rpc_pipe_msg *rpcmsg)
+{
+ if (rpcmsg->flags & PIPEFS_AUTOFREE_UPCALL_MSG)
+ kfree(rpcmsg->data);
+ if (rpcmsg->flags & PIPEFS_AUTOFREE_RPCMSG)
+ kfree(rpcmsg);
+}
+EXPORT_SYMBOL(pipefs_generic_destroy_msg);
--
1.7.4.1


2011-06-07 17:27:07

by Jim Rees

[permalink] [raw]
Subject: [PATCH 10/88] pnfsblock: define PNFS_BLOCK Kconfig option

From: Fred Isaman <[email protected]>

Define a configuration variable to enable/disable compilation of the
block driver code.

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
[pnfs-block: fix CONFIG_PNFS_BLOCK dependencies]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/Kconfig | 8 ++++++++
fs/nfs/Makefile | 1 +
fs/nfs/blocklayout/Makefile | 5 +++++
3 files changed, 14 insertions(+), 0 deletions(-)
create mode 100644 fs/nfs/blocklayout/Makefile

diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index c96f9eb..d6bfd87 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -105,6 +105,14 @@ config PNFS_PANLAYOUT

If unsure, say N.

+config PNFS_BLOCK
+ tristate "Provide a pNFS block client (EXPERIMENTAL)"
+ depends on NFS_FS && PNFS
+ help
+ Say M or y here if you want your pNfs client to support the block protocol
+
+ If unsure, say N.
+
config ROOT_NFS
bool "Root file system on NFS"
depends on NFS_FS=y && IP_PNP
diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 6a34f7d..b58613d 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_PNFS_FILE_LAYOUT) += nfs_layout_nfsv41_files.o
nfs_layout_nfsv41_files-y := nfs4filelayout.o nfs4filelayoutdev.o

obj-$(CONFIG_PNFS_OBJLAYOUT) += objlayout/
+obj-$(CONFIG_PNFS_BLOCK) += blocklayout/
diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
new file mode 100644
index 0000000..f214c1c
--- /dev/null
+++ b/fs/nfs/blocklayout/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the pNFS block layout driver kernel module
+#
+obj-$(CONFIG_PNFS_BLOCK) +=
+blocklayoutdriver-objs :=
--
1.7.4.1


2011-06-07 17:32:49

by Jim Rees

[permalink] [raw]
Subject: [PATCH 57/88] SQUASHME: pnfsblock: write_end_cleanup adjust for removed ok_to_use_pnfs

From: Fred Isaman <[email protected]>

Signed-off-by: Fred Isaman <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index b1df445..43a5617 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1048,9 +1048,13 @@ bl_write_end_cleanup(struct file *filp, struct pnfs_fsdata *fsdata)
sector_t *pos;
struct address_space *mapping = filp->f_mapping;
struct pnfs_fsdata *fake_data;
+ struct pnfs_layout_segment *lseg;

if (!fsdata)
return;
+ lseg = fsdata->lseg;
+ if (!lseg)
+ return;
pos = fsdata->private;
if (!pos)
return;
@@ -1079,7 +1083,8 @@ bl_write_end_cleanup(struct file *filp, struct pnfs_fsdata *fsdata)
unlock_page(page);
continue;
}
- fake_data->ok_to_use_pnfs = 1;
+ get_lseg(lseg);
+ fake_data->lseg = lseg;
fake_data->bypass_eof = 1;
mapping->a_ops->write_end(filp, mapping,
index << PAGE_CACHE_SHIFT,
--
1.7.4.1


2011-06-07 17:35:26

by Jim Rees

[permalink] [raw]
Subject: [PATCH 81/88] SQUASHME: pnfs-block: fixup cleanup_layoutcommit arguments

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 5 +++--
fs/nfs/blocklayout/blocklayout.h | 2 +-
fs/nfs/blocklayout/extents.c | 2 +-
3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index ff94ee2..1865392 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -653,10 +653,11 @@ bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,

static void
bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
- struct nfs4_layoutcommit_args *arg, int status)
+ struct nfs4_layoutcommit_op_args *arg,
+ struct nfs4_layoutcommit_op_res *res)
{
dprintk("%s enter\n", __func__);
- clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), arg, status);
+ clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), arg, res->status);
kfree(arg->layoutdriver_data);
}

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 9e7bd62..4266ae7 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -278,7 +278,7 @@ int encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
struct xdr_stream *xdr,
const struct nfs4_layoutcommit_args *arg);
void clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
- const struct nfs4_layoutcommit_args *arg,
+ const struct nfs4_layoutcommit_op_args *arg,
int status);
int add_and_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 40dff82..0ff65f3 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -922,7 +922,7 @@ set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)

void
clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
- const struct nfs4_layoutcommit_args *arg,
+ const struct nfs4_layoutcommit_op_args *arg,
int status)
{
struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
--
1.7.4.1


2011-06-10 14:32:28

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 87/88] Add configurable prefetch size for layoutget


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
Sent: Friday, June 10, 2011 8:48 PM
To: Peng, Tao
Cc: [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-10 02:02, [email protected] wrote:
> Hi, Benny,
>
> Cheers,
> -Bergwolf
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
> Sent: Friday, June 10, 2011 5:30 AM
> To: Peng Tao
> Cc: Jim Rees; [email protected]; peter honeyman
> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>
> On 2011-06-09 07:54, Peng Tao wrote:
>> On Thu, Jun 9, 2011 at 2:06 PM, Benny Halevy <[email protected]> wrote:
>>> On 2011-06-08 03:15, Peng Tao wrote:
>>>> On 6/8/11, Jim Rees <[email protected]> wrote:
>>>>> Benny Halevy wrote:
>>>>>
>>>>> NAK.
>>>>> This affects all layout types. In particular it is undesired
>>>>> for write layouts that extend the file with the objects layout.
>>>>> The server can extend the layout segments range
>>>>> over what the client requested so why would the client
>>>>> ask for artificially large layouts?
>>>>>
>>>>> This has actually been the subject of some debate over Thursday night
>>>>> beers. The problem we're trying to solve is that the client is spending 98%
>>>>> of its time in layoutget. This patch gives us something like a 10x
>>>>> speedup. But many of us think it's not the right fix. I suggest we discuss
>>>>> next week.
>>>>>
>>>
>>> Sure.
>>>
>>>>> But note that this patch doesn't change anything unless you set the sysctl.
>>>> there is a default value of 2M. maybe we can set it to page size by
>>>> default so other layout are not affected and block layout can let
>>>> users set it by hand if they care about performance. does this make
>>>> sense?
>>>
>>> If doing it at all why use a sysctl rather than a mount option?
>> The purpose of using a sysctl is to give client the ability to change
>> it on the fly. In theory, layout prefetching can benefit all layout
>> types. So the patch tries to solve it in the pnfs generic layer.
>>
>
> But the need for this varies per-server and many times per application.
> Think sequential vs. random I/O. Therefore a mount option would help
> tuning the behavior on a per-use basis. Global behavior must be implemented
> using a dynamic algorithm that would take both the workload and the server
> observed behavior into account.
> [PT] Indeed. Dynamic algorithm is supposed to be able to solve all this. And it often takes longer to be designed/accepted. It has to prove to be better in most scenarios and does not hurt the left.

We need to find an acceptable solution to push this driver upstream.
I understand that developing a dynamic algorithm in the given time frame is
too big of a challenge, but hacking yet another client tunable is out of the
question either. For testing in the Bakeathon I'd consider taking a DEVONLY version
of this patch that is enabled using a config option and defaults to zero to have no effect
in run-time until the sysctl is sets it differently.
But keep in mind this is not suitable for pushing upstream.
[PT] Thanks for your understanding. We truly want to solve the performance problem and are open to suggestions. And this will be a feature that benefits all layout types, am I right?

-Tao

2011-06-09 15:02:20

by Peng Tao

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

Hi, Jim,

On Thu, Jun 9, 2011 at 7:49 PM, Jim Rees <[email protected]> wrote:
> Benny Halevy wrote:
>
>  >> But note that this patch doesn't change anything unless you set the sysctl.
>  > there is a default value of 2M. maybe we can set it to page size by
>  > default so other layout are not affected and block layout can let
>  > users set it by hand if they care about performance. does this make
>  > sense?
>
>  If doing it at all why use a sysctl rather than a mount option?
>  Or maybe coding the logic for prefetching the layout iff sequential
>  access is detected is the right thing to do.
>
> I would rather see some automatic solution than to add either a sysctl or a
> mount option.  For now you can just drop that patch, as it's not needed for
> basic pnfs block.
The current code w/o the patch makes block layout performance very poor.
Can we have it in for now, and when we have something smarter, change
it back later?

>
> My understanding is that layoutget specifies a min and max, and the server
> is returning the min.  Trond and Fred believe this should be fixed on the
> server.  Here's the original report of the problem:
>
> From: Bergwolf
>
> From the network trace for pnfs, we can see the root cause for slow performance
> is too many small layoutget. In specific, client asks for a layout of only 4K
> pagesize (and server returns 8K due to block size alignment) at each time.
>
> The total IO time is 256/1.68 = 152 second.
> There are 256*1024/8 = 32768 layoutget for the 256MB file.
> On average, the time spent on each layoutget is 0.00456 second according to the
> trace.
> The total layoutget time is 32768* 0.00456 = 149 second, which takes up about
> 98% of total IO time.
>
> So we should optimize layoutget's granularity to get better performance. For
> instance, use a configurable prefetch size of 2MB or so.
>



--
Thanks,
-Bergwolf

2011-06-07 17:28:38

by Jim Rees

[permalink] [raw]
Subject: [PATCH 22/88] pnfsblock: xdr decode pnfs_block_layout4

From: Fred Isaman <[email protected]>

XDR decodes the block layout payload sent in LAYOUTGET result, storing
the result in an extent list.

Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: fix bug getting pnfs_layout_type in translate_devid().]
Signed-off-by: Tao Guo <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 2 +
fs/nfs/blocklayout/blocklayoutdev.c | 165 ++++++++++++++++++++++++++++++++++-
fs/nfs/blocklayout/extents.c | 12 +++
3 files changed, 177 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index bcf85be..f91939d 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -142,6 +142,7 @@ struct pnfs_block_layout {
sector_t bl_blocksize; /* Server blocksize in sectors */
};

+#define BLK_ID(lo) ((struct block_mount_id *)(PNFS_MOUNTID(lo)->mountid))
#define BLK_LSEG2EXT(lseg) ((struct pnfs_block_layout *)lseg->layout->ld_data)
#define BLK_LO2EXT(lo) ((struct pnfs_block_layout *)lo->ld_data)

@@ -195,4 +196,5 @@ int nfs4_blk_flatten(struct pnfs_blk_volume *, int, struct pnfs_block_dev *);
void free_block_dev(struct pnfs_block_dev *bdev);
/* extents.c */
void put_extent(struct pnfs_block_extent *be);
+struct pnfs_block_extent *alloc_extent(void);
#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 818cc1c..77190fd 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -554,11 +554,172 @@ nfs4_blk_decode_device(struct super_block *sb,
return rv;
}

+/* Map deviceid returned by the server to constructed block_device */
+static struct block_device *translate_devid(struct pnfs_layout_type *lo,
+ struct pnfs_deviceid *id)
+{
+ struct block_device *rv = NULL;
+ struct block_mount_id *mid;
+ struct pnfs_block_dev *dev;
+
+ dprintk("%s enter, lo=%p, id=%p\n", __func__, lo, id);
+ mid = BLK_ID(lo);
+ spin_lock(&mid->bm_lock);
+ list_for_each_entry(dev, &mid->bm_devlist, bm_node) {
+ if (memcmp(id->data, dev->bm_mdevid.data,
+ NFS4_PNFS_DEVICEID4_SIZE) == 0) {
+ rv = dev->bm_mdev;
+ goto out;
+ }
+ }
+ out:
+ spin_unlock(&mid->bm_lock);
+ dprintk("%s returning %p\n", __func__, rv);
+ return rv;
+}
+
+/* Tracks info needed to ensure extents in layout obey constraints of spec */
+struct layout_verification {
+ u32 mode; /* R or RW */
+ u64 start; /* Expected start of next non-COW extent */
+ u64 inval; /* Start of INVAL coverage */
+ u64 cowread; /* End of COW read coverage */
+};
+
+/* Verify the extent meets the layout requirements of the pnfs-block draft,
+ * section 2.3.1.
+ */
+static int verify_extent(struct pnfs_block_extent *be,
+ struct layout_verification *lv)
+{
+ if (lv->mode == IOMODE_READ) {
+ if (be->be_state == PNFS_BLOCK_READWRITE_DATA ||
+ be->be_state == PNFS_BLOCK_INVALID_DATA)
+ return -EIO;
+ if (be->be_f_offset != lv->start)
+ return -EIO;
+ lv->start += be->be_length;
+ return 0;
+ }
+ /* lv->mode == IOMODE_RW */
+ if (be->be_state == PNFS_BLOCK_READWRITE_DATA) {
+ if (be->be_f_offset != lv->start)
+ return -EIO;
+ if (lv->cowread > lv->start)
+ return -EIO;
+ lv->start += be->be_length;
+ lv->inval = lv->start;
+ return 0;
+ } else if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
+ if (be->be_f_offset != lv->start)
+ return -EIO;
+ lv->start += be->be_length;
+ return 0;
+ } else if (be->be_state == PNFS_BLOCK_READ_DATA) {
+ if (be->be_f_offset > lv->start)
+ return -EIO;
+ if (be->be_f_offset < lv->inval)
+ return -EIO;
+ if (be->be_f_offset < lv->cowread)
+ return -EIO;
+ /* It looks like you might want to min this with lv->start,
+ * but you really don't.
+ */
+ lv->inval = lv->inval + be->be_length;
+ lv->cowread = be->be_f_offset + be->be_length;
+ return 0;
+ } else
+ return -EIO;
+}
+
/* XDR decode pnfs_block_layout4 structure */
int
nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
struct nfs4_pnfs_layoutget_res *lgr)
{
- /* STUB */
- return -EIO;
+ struct pnfs_block_layout *bl = PNFS_LD_DATA(lo);
+ uint32_t *p = (uint32_t *)lgr->layout.buf;
+ uint32_t *end = (uint32_t *)((char *)lgr->layout.buf + lgr->layout.len);
+ int i, status = -EIO;
+ uint32_t count;
+ struct pnfs_block_extent *be = NULL;
+ uint64_t tmp; /* Used by READSECTOR */
+ struct layout_verification lv = {
+ .mode = lgr->lseg.iomode,
+ .start = lgr->lseg.offset >> 9,
+ .inval = lgr->lseg.offset >> 9,
+ .cowread = lgr->lseg.offset >> 9,
+ };
+
+ LIST_HEAD(extents);
+
+ BLK_READBUF(p, end, 4);
+ READ32(count);
+
+ dprintk("%s enter, number of extents %i\n", __func__, count);
+ BLK_READBUF(p, end, (28 + NFS4_PNFS_DEVICEID4_SIZE) * count);
+
+ /* Decode individual extents, putting them in temporary
+ * staging area until whole layout is decoded to make error
+ * recovery easier.
+ */
+ for (i = 0; i < count; i++) {
+ be = alloc_extent();
+ if (!be) {
+ status = -ENOMEM;
+ goto out_err;
+ }
+ READ_DEVID(&be->be_devid);
+ be->be_mdev = translate_devid(lo, &be->be_devid);
+ if (!be->be_mdev)
+ goto out_err;
+ /* The next three values are read in as bytes,
+ * but stored as 512-byte sector lengths
+ */
+ READ_SECTOR(be->be_f_offset);
+ READ_SECTOR(be->be_length);
+ READ_SECTOR(be->be_v_offset);
+ READ32(be->be_state);
+ if (be->be_state == PNFS_BLOCK_INVALID_DATA)
+ be->be_inval = &bl->bl_inval;
+ if (verify_extent(be, &lv)) {
+ dprintk("%s verify failed\n", __func__);
+ goto out_err;
+ }
+ list_add_tail(&be->be_node, &extents);
+ }
+ if (p != end) {
+ dprintk("%s Undecoded cruft at end of opaque\n", __func__);
+ be = NULL;
+ goto out_err;
+ }
+ if (lgr->lseg.offset + lgr->lseg.length != lv.start << 9) {
+ dprintk("%s Final length mismatch\n", __func__);
+ be = NULL;
+ goto out_err;
+ }
+ if (lv.start < lv.cowread) {
+ dprintk("%s Final uncovered COW extent\n", __func__);
+ be = NULL;
+ goto out_err;
+ }
+ /* Extents decoded properly, now try to merge them in to
+ * existing layout extents.
+ */
+ /* STUB - instead we just throw them away */
+ status = 0;
+ goto out_err;
+ out:
+ dprintk("%s returns %i\n", __func__, status);
+ return status;
+
+ out_err:
+ put_extent(be);
+ while (!list_empty(&extents)) {
+ be = list_first_entry(&extents, struct pnfs_block_extent,
+ be_node);
+ list_del(&be->be_node);
+ put_extent(be);
+ }
+ goto out;
}
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index efdcc08..a952d39 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -53,3 +53,15 @@ put_extent(struct pnfs_block_extent *be)
}
}

+struct pnfs_block_extent *alloc_extent(void)
+{
+ struct pnfs_block_extent *be;
+
+ be = kmalloc(sizeof(struct pnfs_block_extent), GFP_KERNEL);
+ if (!be)
+ return NULL;
+ INIT_LIST_HEAD(&be->be_node);
+ kref_init(&be->be_refcnt);
+ be->be_inval = NULL;
+ return be;
+}
--
1.7.4.1


2011-06-07 17:33:03

by Jim Rees

[permalink] [raw]
Subject: [PATCH 59/88] SQUASHME: pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS

From: Fred Isaman <[email protected]>

Signed-off-by: Fred Isaman <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 77e9512..92f0b4b 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -448,8 +448,8 @@ bl_write_pagelist(struct pnfs_layout_type *lo,
int pg_index = pgbase >> PAGE_CACHE_SHIFT;

dprintk("%s enter, %Zu@%lld\n", __func__, count, offset);
- if (!test_bit(PG_USE_PNFS, &wdata->req->wb_flags)) {
- dprintk("PG_USE_PNFS not set\n");
+ if (!wdata->req->wb_lseg) {
+ dprintk("%s no lseg, falling back to MDS\n", __func__);
return PNFS_NOT_ATTEMPTED;
}
if (dont_like_caller(wdata->req)) {
--
1.7.4.1


2011-06-10 03:06:04

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-09 08:07, Peng Tao wrote:
> Hi, Jim and Benny,
>
> On Thu, Jun 9, 2011 at 9:58 PM, Jim Rees <[email protected]> wrote:
>> Benny Halevy wrote:
>>
>> > My understanding is that layoutget specifies a min and max, and the server
>>
>> There's a min. What do you consider the max?
>> Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?
>>
>> The spec doesn't say max, it says "desired." I guess I assumed the server
>> wouldn't normally return more than desired.
> In fact server is returning "desired" length. The problem is that we
> call pnfs_update_layout in nfs_write_begin, and it will end up setting
> both minlength and length to page size. There is no space for client
> to collapse layoutget range in nfs_write_begin.
>

That's a different issue. Waiting with pnfs_update_layout to flush
time rather than write_begin if the whole page is written would help
sending a more meaningful desired range as well as avoiding needless
read-modify-writes in case the application also wrote the whole
preallocated block.

Benny

>>
>> 18.43.3. DESCRIPTION
>> ...
>>
>> The LAYOUTGET operation returns layout information for the specified
>> byte-range: a layout. The client actually specifies two ranges, both
>> starting at the offset in the loga_offset field. The first range is
>> between loga_offset and loga_offset + loga_length - 1 inclusive.
>> This range indicates the desired range the client wants the layout to
>> cover. The second range is between loga_offset and loga_offset +
>> loga_minlength - 1 inclusive. This range indicates the required
>> range the client needs the layout to cover. Thus, loga_minlength
>> MUST be less than or equal to loga_length.
>>
>
>
>

--
Benny Halevy
CTO, Tonian Inc.

Tel: +972-54-802-8340
[email protected]

2011-06-11 01:46:30

by Peng Tao

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

Hi, Benny,

On Sat, Jun 11, 2011 at 5:15 AM, Benny Halevy <[email protected]> wrote:
> On 2011-06-10 16:03, Fred Isaman wrote:
>> On Fri, Jun 10, 2011 at 3:23 PM, Benny Halevy <[email protected]> wrote:
>>>
>>> A simple algorithm I can suggest is:
>>> - on initialization, calculate and save, per layout driver
>>>  - maximum layout size
>>
>> I must be misunderstanding something.  Layout size has nothing to do
>> with io size (other than the obvious fact that you want the layout >
>> io).
>
> Exactly, and if the layouts you get from the server are too small
> it's hard to do efficient I/O (modulo layout segment merging
> or gathering)
>
>>
>> I don't know about the object driver, but for both the file and block
>> drivers the client wants as much as the server will give it.
>>
>
> For blocks the message buffer size (as mentioned below) may be
> a limiting factor hence the limit.
It is true that message buffer size can limit the number of extents
returned by block server. But the length that each extent is
addressing is unknown. It can be of any size from layout_blksize to
even larger than file size), determined by the file's on disk layout.
Client cannot guess maximum segment size based on message buffer size.

> Another constraint for blocks and for objects to some extent is
> provisional allocation where you don't want to just arbitrarily
> ask for artificially large write layouts and this may be interpreted
> as intent to write to the whole range and will result in excessive
> provisional allocation of resources.
>
> Bottom line, I completely agree with "the client wants as much as the
> server will give it" so it should ask for what it needs and let the
> server decide whether to extend or trim the range if need be.
>
> Benny
>
>> Fred
>>
>>>    - take into account csr_fore_chan_attrs.ca_maxresponsesize and possible other parameters
>>>  - keep a working copy of the maximum value and the calculated copy.
>>>  - alignment value.
>>> - on miss, see if there's an adjacent layout segment in cache
>>> - if found, ask for twice the found segment size, up to the maximum value,
>>>  aligned on the alignment value.
>>> - if the server returns less the layoutget range, keep note of the returned length
>>>  (but not adjust maximum yet, as the server may return a short segment for various
>>>   reasons)
>>> - if the server is consistent about returning less than was asked, adjust the
>>>  - working copy of the maximum length
>>> - if the maximum was adjusted try bumping it up after X (TBD) layoutgets or T seconds
>>>  to see if that was just due to high load or conflicts on the server
>>> - on any error returned for LAYOUTGET reset the algorithm parameters
>>> - on session reestablishment recalculate maximums.
>>>
>>> Benny
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



--
Thanks,
-Bergwolf

2011-06-07 17:32:57

by Jim Rees

[permalink] [raw]
Subject: [PATCH 58/88] SQUASHME: pnfsblock: bl_write_pagelist support functions adjust for missing PG_USE_PNFS

From: Fred Isaman <[email protected]>

Signed-off-by: Fred Isaman <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 8 +++-----
1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 43a5617..77e9512 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1115,12 +1115,10 @@ bl_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
struct nfs_page *req)
{
dprintk("%s enter\n", __func__);
- if (pgio->pg_iswrite) {
- return test_bit(PG_USE_PNFS, &prev->wb_flags) ==
- test_bit(PG_USE_PNFS, &req->wb_flags);
- } else {
+ if (pgio->pg_iswrite)
+ return prev->wb_lseg == req->wb_lseg;
+ else
return 1;
- }
}

static struct layoutdriver_io_operations blocklayout_io_operations = {
--
1.7.4.1


2011-06-09 14:55:10

by Peng Tao

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On Thu, Jun 9, 2011 at 2:06 PM, Benny Halevy <[email protected]> wrote:
> On 2011-06-08 03:15, Peng Tao wrote:
>> On 6/8/11, Jim Rees <[email protected]> wrote:
>>> Benny Halevy wrote:
>>>
>>>   NAK.
>>>   This affects all layout types.  In particular it is undesired
>>>   for write layouts that extend the file with the objects layout.
>>>   The server can extend the layout segments range
>>>   over what the client requested so why would the client
>>>   ask for artificially large layouts?
>>>
>>> This has actually been the subject of some debate over Thursday night
>>> beers.  The problem we're trying to solve is that the client is spending 98%
>>> of its time in layoutget.  This patch gives us something like a 10x
>>> speedup.  But many of us think it's not the right fix.  I suggest we discuss
>>> next week.
>>>
>
> Sure.
>
>>> But note that this patch doesn't change anything unless you set the sysctl.
>> there is a default value of 2M. maybe we can set it to page size by
>> default so other layout are not affected and block layout can let
>> users set it by hand if they care about performance. does this make
>> sense?
>
> If doing it at all why use a sysctl rather than a mount option?
The purpose of using a sysctl is to give client the ability to change
it on the fly. In theory, layout prefetching can benefit all layout
types. So the patch tries to solve it in the pnfs generic layer.

> Or maybe coding the logic for prefetching the layout iff sequential
> access is detected is the right thing to do.
Yeah, automatic decision should be a better way.

>
> Benny
>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>



--
Thanks,
-Bergwolf

2011-06-07 17:26:44

by Jim Rees

[permalink] [raw]
Subject: [PATCH 07/88] pnfs: HACK: modify write_end_cleanup

From: Fred Isaman <[email protected]>

This needs to be changed, but will require a major rewrite
of the block layout's IO code. Including it here so I can
get some current code into the tree.

Change write_end_cleanup API to allow block driver the ability to
handle server blocks larger than a page in size. It needs to
initiate writes to adjacent pages, so pass in filp so that it
can call the appropriate aop functions. A field is also added to
fsdata to hold the list of pages.

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/pnfs.h | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index cfa8ea6..ac536bc 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -56,6 +56,7 @@ enum pnfs_try_status {

struct pnfs_fsdata {
struct pnfs_layout_segment *lseg;
+ void *private;
};

#ifdef CONFIG_NFS_V4_1
@@ -116,6 +117,8 @@ struct pnfs_layoutdriver_type {
int (*write_end)(struct inode *inode, struct page *page, loff_t pos,
unsigned count, unsigned copied,
struct pnfs_layout_segment *lseg);
+ void (*write_end_cleanup)(struct file *filp,
+ struct pnfs_fsdata *fsdata);

void (*free_deviceid_node) (struct nfs4_deviceid_node *);

@@ -356,6 +359,8 @@ static inline void pnfs_write_end_cleanup(struct file *filp, void *fsdata)
struct nfs_server *nfss = NFS_SERVER(filp->f_dentry->d_inode);

if (fsdata && nfss->pnfs_curr_ld) {
+ if (nfss->pnfs_curr_ld->write_end_cleanup)
+ nfss->pnfs_curr_ld->write_end_cleanup(filp, fsdata);
if (nfss->pnfs_curr_ld->write_begin)
pnfs_free_fsdata(fsdata);
}
--
1.7.4.1


2011-06-07 17:34:55

by Jim Rees

[permalink] [raw]
Subject: [PATCH 76/88] SQUASHME: pnfs-block: deprecate get_stripesize

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 8 --------
1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 23dbe91..57a7f04 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1083,13 +1083,6 @@ bl_write_end_cleanup(struct file *filp, struct pnfs_fsdata *fsdata)
fsdata->private = NULL;
}

-static ssize_t
-bl_get_stripesize(struct pnfs_layout_hdr *lo)
-{
- dprintk("%s enter\n", __func__);
- return 0;
-}
-
/* This is called by nfs_can_coalesce_requests via nfs_pageio_do_add_request.
* Should return False if there is a reason requests can not be coalesced,
* otherwise, should default to returning True.
@@ -1123,7 +1116,6 @@ static struct pnfs_layoutdriver_type blocklayout_type = {
.cleanup_layoutcommit = bl_cleanup_layoutcommit,
.initialize_mountpoint = bl_initialize_mountpoint,
.uninitialize_mountpoint = bl_uninitialize_mountpoint,
- .get_stripesize = bl_get_stripesize,
.pg_test = bl_pg_test,
};

--
1.7.4.1


2011-06-07 17:27:58

by Jim Rees

[permalink] [raw]
Subject: [PATCH 18/88] pnfsblock: construct and load md table

From: Fred Isaman <[email protected]>

Uses preparsed information gathered from GETDEVICEINFO to
create a dm device table that represents the given volume
topology.

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 3 +-
fs/nfs/blocklayout/blocklayoutdm.c | 191 +++++++++++++++++++++++++++++++++++-
2 files changed, 191 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index b705906..d695f8e 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -40,7 +40,8 @@
extern struct class shost_class; /* exported from drivers/scsi/hosts.c */
extern int dm_dev_create(struct dm_ioctl *param); /* from dm-ioctl.c */
extern int dm_dev_remove(struct dm_ioctl *param); /* from dm-ioctl.c */
-
+extern int dm_do_resume(struct dm_ioctl *param);
+extern int dm_table_load(struct dm_ioctl *param, size_t param_size);

struct block_mount_id {
struct super_block *bm_sb; /* back pointer */
diff --git a/fs/nfs/blocklayout/blocklayoutdm.c b/fs/nfs/blocklayout/blocklayoutdm.c
index 0e04494..4bff748 100644
--- a/fs/nfs/blocklayout/blocklayoutdm.c
+++ b/fs/nfs/blocklayout/blocklayoutdm.c
@@ -36,6 +36,31 @@

#define NFSDBG_FACILITY NFSDBG_PNFS_LD

+/* Defines used for calculating memory usage in nfs4_blk_flatten() */
+#define ARGSIZE 24 /* Max bytes needed for linear target arg string */
+#define SPECSIZE (sizeof8(struct dm_target_spec) + ARGSIZE)
+#define SPECS_PER_PAGE (PAGE_SIZE / SPECSIZE)
+#define SPEC_HEADER_ADJUST (SPECS_PER_PAGE - \
+ (PAGE_SIZE - sizeof8(struct dm_ioctl)) / SPECSIZE)
+#define roundup8(x) (((x)+7) & ~7)
+#define sizeof8(x) roundup8(sizeof(x))
+
+/* Given x>=1, return smallest n such that 2**n >= x */
+static unsigned long find_order(int x)
+{
+ unsigned long rv = 0;
+ for (x--; x; x >>= 1)
+ rv++;
+ return rv;
+}
+
+/* Debugging aid */
+static void print_extent(u64 meta_offset, dev_t disk,
+ u64 disk_offset, u64 length)
+{
+ dprintk("%lli:, %d:%d %lli, %lli\n", meta_offset, MAJOR(disk),
+ MINOR(disk), disk_offset, length);
+}
static int dev_create(const char *name, dev_t *dev)
{
struct dm_ioctl ctrl;
@@ -60,6 +85,14 @@ static int dev_remove(const char *name)
return dm_dev_remove(&ctrl);
}

+static int dev_resume(const char *name)
+{
+ struct dm_ioctl ctrl;
+ memset(&ctrl, 0, sizeof(ctrl));
+ strncpy(ctrl.name, name, DM_NAME_LEN-1);
+ return dm_do_resume(&ctrl);
+}
+
/*
* Release meta device
*/
@@ -141,10 +174,164 @@ struct pnfs_block_dev *nfs4_blk_init_metadev(struct super_block *sb,
return NULL;
}

-/* Stub */
+/*
+ * Given a vol_offset into root, returns the disk and disk_offset it
+ * corresponds to, as well as the length of the contiguous segment thereafter.
+ * All offsets/lengths are in 512-byte sectors.
+ */
+static int nfs4_blk_resolve(int root, struct pnfs_blk_volume *vols,
+ u64 vol_offset, dev_t *disk, u64 *disk_offset,
+ u64 *length)
+{
+ struct pnfs_blk_volume *node;
+ u64 node_offset;
+
+ /* Walk down device tree until we hit a leaf node (VOLUME_SIMPLE) */
+ node = &vols[root];
+ node_offset = vol_offset;
+ *length = node->bv_size;
+ while (1) {
+ dprintk("offset=%lli, length=%lli\n",
+ node_offset, *length);
+ if (node_offset > node->bv_size)
+ return -EIO;
+ switch (node->bv_type) {
+ case PNFS_BLOCK_VOLUME_SIMPLE:
+ *disk = node->bv_dev;
+ dprintk("%s VOLUME_SIMPLE: node->bv_dev %d:%d\n",
+ __func__,
+ MAJOR(node->bv_dev),
+ MINOR(node->bv_dev));
+ *disk_offset = node_offset;
+ *length = min(*length, node->bv_size - node_offset);
+ return 0;
+ case PNFS_BLOCK_VOLUME_SLICE:
+ dprintk("%s VOLUME_SLICE:\n", __func__);
+ *length = min(*length, node->bv_size - node_offset);
+ node_offset += node->bv_offset;
+ node = node->bv_vols[0];
+ break;
+ case PNFS_BLOCK_VOLUME_CONCAT: {
+ u64 next = 0, sum = 0;
+ int i;
+ dprintk("%s VOLUME_CONCAT:\n", __func__);
+ for (i = 0; i < node->bv_vol_n; i++) {
+ next = sum + node->bv_vols[i]->bv_size;
+ if (node_offset < next)
+ break;
+ sum = next;
+ }
+ *length = min(*length, next - node_offset);
+ node_offset -= sum;
+ node = node->bv_vols[i];
+ }
+ break;
+ case PNFS_BLOCK_VOLUME_STRIPE: {
+ u64 global_s_no;
+ u64 stripe_pos;
+ u64 local_s_no;
+ u64 disk_number;
+
+ dprintk("%s VOLUME_STRIPE:\n", __func__);
+ global_s_no = node_offset;
+ /* BUG - note this assumes stripe_unit <= 2**32 */
+ stripe_pos = (u64) do_div(global_s_no,
+ (u32)node->bv_stripe_unit);
+ local_s_no = global_s_no;
+ disk_number = (u64) do_div(local_s_no,
+ (u32) node->bv_vol_n);
+ *length = min(*length,
+ node->bv_stripe_unit - stripe_pos);
+ node_offset = local_s_no * node->bv_stripe_unit +
+ stripe_pos;
+ node = node->bv_vols[disk_number];
+ }
+ break;
+ default:
+ return -EIO;
+ }
+ }
+}
+
+/*
+ * Create an LVM dm device table that represents the volume topology returned
+ * by GETDEVICELIST or GETDEVICEINFO.
+ *
+ * vols: topology with VOLUME_SIMPLEs mapped to visable scsi disks.
+ * size: number of volumes in vols.
+ */
int nfs4_blk_flatten(struct pnfs_blk_volume *vols, int size,
struct pnfs_block_dev *bdev)
{
- return 0;
+ u64 meta_offset = 0;
+ u64 meta_size = vols[size-1].bv_size;
+ dev_t disk;
+ u64 disk_offset, len;
+ int status = 0, count = 0, pages_needed;
+ struct dm_ioctl *ctl;
+ struct dm_target_spec *spec;
+ char *args = NULL;
+ unsigned long p;
+
+ dprintk("%s enter. mdevname %s number of volumes %d\n", __func__,
+ bdev->bm_mdevname, size);
+
+ /* We need to reserve memory to store segments, so need to count
+ * segments. This means we resolve twice, basically throwing away
+ * all info from first run apart from the count. Seems like
+ * there should be a better way.
+ */
+ for (meta_offset = 0; meta_offset < meta_size; meta_offset += len) {
+ status = nfs4_blk_resolve(size-1, vols, meta_offset, &disk,
+ &disk_offset, &len);
+ /* TODO Check status */
+ count += 1;
+ }
+
+ dprintk("%s: Have %i segments\n", __func__, count);
+ pages_needed = ((count + SPEC_HEADER_ADJUST) / SPECS_PER_PAGE) + 1;
+ dprintk("%s: Need %i pages\n", __func__, pages_needed);
+ p = __get_free_pages(GFP_KERNEL, find_order(pages_needed));
+ if (!p)
+ return -ENOMEM;
+ /* A dm_ioctl is placed at the beginning, followed by a series of
+ * (dm_target_spec, argument string) pairs.
+ */
+ ctl = (struct dm_ioctl *) p;
+ spec = (struct dm_target_spec *) (p + sizeof8(*ctl));
+ memset(ctl, 0, sizeof(*ctl));
+ ctl->data_start = (char *) spec - (char *) ctl;
+ ctl->target_count = count;
+ strncpy(ctl->name, bdev->bm_mdevname, DM_NAME_LEN);
+
+ dprintk("%s ctl->name %s\n", __func__, ctl->name);
+ for (meta_offset = 0; meta_offset < meta_size; meta_offset += len) {
+ status = nfs4_blk_resolve(size-1, vols, meta_offset, &disk,
+ &disk_offset, &len);
+ if (!len)
+ break;
+ /* TODO Check status */
+ print_extent(meta_offset, disk, disk_offset, len);
+ spec->sector_start = meta_offset;
+ spec->length = len;
+ spec->status = 0;
+ strcpy(spec->target_type, "linear");
+ args = (char *) (spec + 1);
+ sprintf(args, "%i:%i %lli",
+ MAJOR(disk), MINOR(disk), disk_offset);
+ dprintk("%s args %s\n", __func__, args);
+ spec->next = roundup8(sizeof(*spec) + strlen(args) + 1);
+ spec = (struct dm_target_spec *) (((char *) spec) + spec->next);
+ }
+ ctl->data_size = (char *) spec - (char *) ctl;
+
+ status = dm_table_load(ctl, ctl->data_size);
+ dprintk("%s dm_table_load returns %d\n", __func__, status);
+
+ dev_resume(bdev->bm_mdevname);
+
+ free_pages(p, find_order(pages_needed));
+ dprintk("%s returns %d\n", __func__, status);
+ return status;
}

--
1.7.4.1


2011-06-10 02:16:45

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 00/88] pnfs block layout driver

On 06/09/2011 03:15 PM, Jim Rees wrote:
> Boaz Harrosh wrote:
>
> Who is going to SQUASH all the SQUASHMEs and re think the all patch
> separation again. To something that makes a more logical progression
> and easier on the review. The way it is now I'm not able to review,
> sorry, I got lost trying to understand which is which.
>
> I'm open to suggestions and happy to do the work. I agree that 88 patches
> is nearly indigestable. However I note that Benny seems to have pulled in
> the entire set so I'm not sure how to proceed at this point. Also this code
> was in Benny's 2.6.38 and only got dropped when the 3.0 merge came along, so
> most of it's already been under review for a year or more.

Lets start by squashing all the SQUASHMEs into their proper place, that will
get you down to 49. You might notice that it could get hard and it will be
easier to actually take the complete code and re-divide it to patches, a fresh.
crafted more less on the old division strategy.

It's what Fred did at the final files submission. And it is what I did more
or less with the objects final submission. But you might find that for you
it is easier to rebase and squash the patches together after you re-order
them. It's your call and you will not know before you experiment a little.
I can show you some techniques I use on both these paths.
(Let's meet in B next week)

Then you should do a second pass to see that each patch is compile-able and
makes sense and maybe also reorder the introduction of the generic parts
close to where they are used in the pachset progression. Again like we did
for files and objects.

Thanks
Boaz

2011-06-07 17:28:05

by Jim Rees

[permalink] [raw]
Subject: [PATCH 19/88] pnfsblock: layout alloc and free

From: Fred Isaman <[email protected]>

Allocate the empty list-heads that will hold all the extent data
for the layout.

Signed-off-by: Fred Isaman <[email protected]>
[pnfs: move pnfs_layout_type inline in nfs_inode]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 33 ++++++++++++++++++++++++++++++++-
fs/nfs/blocklayout/blocklayout.h | 25 +++++++++++++++++++++++++
2 files changed, 57 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index ebaa48a..677836c 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -83,18 +83,49 @@ bl_write_pagelist(struct pnfs_layout_type *lo,
return PNFS_NOT_ATTEMPTED;
}

+/* STUB */
+static void
+release_extents(struct pnfs_block_layout *bl,
+ struct nfs4_pnfs_layout_segment *range)
+{
+ return;
+}
+
+/* STUB */
+static void
+release_inval_marks(void)
+{
+ return;
+}
+
+/* Note we are relying on caller locking to prevent nasty races. */
static void
bl_free_layout(void *p)
{
+ struct pnfs_block_layout *bl = p;
+
dprintk("%s enter\n", __func__);
+ release_extents(bl, NULL);
+ release_inval_marks();
+ kfree(bl);
return;
}

static void *
bl_alloc_layout(struct pnfs_mount_type *mtype, struct inode *inode)
{
+ struct pnfs_block_layout *bl;
+
dprintk("%s enter\n", __func__);
- return NULL;
+ bl = kzalloc(sizeof(*bl), GFP_KERNEL);
+ if (!bl)
+ return NULL;
+ spin_lock_init(&bl->bl_ext_lock);
+ INIT_LIST_HEAD(&bl->bl_extents[0]);
+ INIT_LIST_HEAD(&bl->bl_extents[1]);
+ bl->bl_blocksize = NFS_SERVER(inode)->pnfs_blksize >> 9;
+ INIT_INVAL_MARKS(&bl->bl_inval, bl->bl_blocksize);
+ return bl;
}

static void
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index d695f8e..dd25f1a 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -99,6 +99,31 @@ struct pnfs_blk_sig {
struct pnfs_blk_sig_comp si_comps[PNFS_BLOCK_MAX_SIG_COMP];
};

+struct pnfs_inval_markings {
+ /* STUB */
+};
+
+static inline void
+INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
+{
+ /* STUB */
+}
+
+enum extentclass4 {
+ RW_EXTENT = 0, /* READWRTE and INVAL */
+ RO_EXTENT = 1, /* READ and NONE */
+ EXTENT_LISTS = 2,
+};
+
+struct pnfs_block_layout {
+ struct pnfs_inval_markings bl_inval; /* tracks INVAL->RW transition */
+ spinlock_t bl_ext_lock; /* Protects list manipulation */
+ struct list_head bl_extents[EXTENT_LISTS]; /* R and RW extents */
+ sector_t bl_blocksize; /* Server blocksize in sectors */
+};
+
+#define BLK_LO2EXT(lo) ((struct pnfs_block_layout *)lo->ld_data)
+
uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes);

#define BLK_READBUF(p, e, nbytes) do { \
--
1.7.4.1


2011-06-10 02:20:46

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 00/88] pnfs block layout driver

On 06/09/2011 07:16 PM, Boaz Harrosh wrote:
> On 06/09/2011 03:15 PM, Jim Rees wrote:
>> Boaz Harrosh wrote:
>>
>> Who is going to SQUASH all the SQUASHMEs and re think the all patch
>> separation again. To something that makes a more logical progression
>> and easier on the review. The way it is now I'm not able to review,
>> sorry, I got lost trying to understand which is which.
>>
>> I'm open to suggestions and happy to do the work. I agree that 88 patches
>> is nearly indigestable. However I note that Benny seems to have pulled in
>> the entire set so I'm not sure how to proceed at this point. Also this code
>> was in Benny's 2.6.38 and only got dropped when the 3.0 merge came along, so
>> most of it's already been under review for a year or more.
>
> Lets start by squashing all the SQUASHMEs into their proper place, that will
> get you down to 49. You might notice that it could get hard and it will be
> easier to actually take the complete code and re-divide it to patches, a fresh.
> crafted more less on the old division strategy.
>
> It's what Fred did at the final files submission. And it is what I did more
> or less with the objects final submission. But you might find that for you
> it is easier to rebase and squash the patches together after you re-order
> them. It's your call and you will not know before you experiment a little.
> I can show you some techniques I use on both these paths.
> (Let's meet in B next week)
>
> Then you should do a second pass to see that each patch is compile-able and
> makes sense and maybe also reorder the introduction of the generic parts
> close to where they are used in the pachset progression. Again like we did
> for files and objects.
>
> Thanks
> Boaz

Ho and I forgot, once you do all that we might want to revisit all these patches
that have HACK in their title. I guess Fred can help with some of these

Boaz

2011-06-07 17:26:58

by Jim Rees

[permalink] [raw]
Subject: [PATCH 09/88] pnfs: HACK: adjust eof handling

From: Fred Isaman <[email protected]>

This needs to be changed, but will require a major rewrite
of the block layout's IO code. Including it here so I can
get some current code into the tree.

To deal with multipage blocks, the block driver sometimes needs to
write pages of zeros past the EOF without advancing the eof to the
written page. This gives a minimal infrastructure to allow that to happen.

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/pnfs.h | 14 ++++++++++++++
fs/nfs/write.c | 3 ++-
2 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index ac536bc..b50cf3a 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -56,6 +56,7 @@ enum pnfs_try_status {

struct pnfs_fsdata {
struct pnfs_layout_segment *lseg;
+ int bypass_eof;
void *private;
};

@@ -313,6 +314,13 @@ static inline void pnfs_clear_request_commit(struct nfs_page *req)
put_lseg(req->wb_commit_lseg);
}

+static inline int pnfs_grow_ok(struct pnfs_layout_segment *lseg,
+ struct pnfs_fsdata *fsdata)
+{
+ return !fsdata || ((struct pnfs_layout_segment *)fsdata == lseg) ||
+ !fsdata->bypass_eof;
+}
+
/* Should the pNFS client commit and return the layout upon a setattr */
static inline bool
pnfs_ld_layoutret_on_setattr(struct inode *inode)
@@ -427,6 +435,12 @@ pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
return NULL;
}

+static inline int pnfs_grow_ok(struct pnfs_layout_segment *lseg,
+ struct pnfs_fsdata *fsdata)
+{
+ return 1;
+}
+
static inline enum pnfs_try_status
pnfs_try_to_read_data(struct nfs_read_data *data,
const struct rpc_call_ops *call_ops)
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index fc36db8..75e2a6b 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -683,7 +683,8 @@ static int nfs_writepage_setup(struct nfs_open_context *ctx, struct page *page,
if (IS_ERR(req))
return PTR_ERR(req);
/* Update file length */
- nfs_grow_file(page, offset, count);
+ if (pnfs_grow_ok(lseg, fsdata))
+ nfs_grow_file(page, offset, count);
nfs_mark_uptodate(page, req->wb_pgbase, req->wb_bytes);
nfs_mark_request_dirty(req);
nfs_clear_page_tag_locked(req);
--
1.7.4.1


2011-06-07 17:32:36

by Jim Rees

[permalink] [raw]
Subject: [PATCH 55/88] SQUASHME: pnfsblock: write_begin adjust for removed fields

From: Fred Isaman <[email protected]>

ok_to_use_pnfs and PG_USE_PNFS are gone, instead test req->wb_lseg for NULL.
This also means that the entire do_flush routine is redundant.

Signed-off-by: Fred Isaman <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 21 ++++-----------------
1 files changed, 4 insertions(+), 17 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 66044d4..eb5760f 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -988,10 +988,10 @@ bl_write_begin(struct pnfs_layout_segment *lseg, struct page *page, loff_t pos,
if (bl->bl_blocksize < (PAGE_CACHE_SIZE >> 9)) {
dprintk("%s Can't handle blocksize %llu\n", __func__,
(u64)bl->bl_blocksize);
- fsdata->ok_to_use_pnfs = 0;
+ put_lseg(fsdata->lseg);
+ fsdata->lseg = NULL;
return 0;
}
- fsdata->ok_to_use_pnfs = 1;
if (PageMappedToDisk(page)) {
/* Basically, this is a flag that says we have
* successfully called write_begin already on this page.
@@ -1014,7 +1014,8 @@ bl_write_begin(struct pnfs_layout_segment *lseg, struct page *page, loff_t pos,
* should be true if we get here.
*/
BUG_ON(PagePrivate(page));
- fsdata->ok_to_use_pnfs = 0;
+ put_lseg(fsdata->lseg);
+ fsdata->lseg = NULL;
kfree(pages_to_mark);
ret = 0;
} else {
@@ -1122,19 +1123,6 @@ bl_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
}
}

-/* This checks if old req will likely use same io method as soon
- * to be created request, and returns False if they are the same.
- */
-static int
-bl_do_flush(struct pnfs_layout_segment *lseg, struct nfs_page *req,
- struct pnfs_fsdata *fsdata)
-{
- int will_try_pnfs;
- dprintk("%s enter\n", __func__);
- will_try_pnfs = fsdata ? (fsdata->ok_to_use_pnfs) : (lseg != NULL);
- return will_try_pnfs != test_bit(PG_USE_PNFS, &req->wb_flags);
-}
-
static struct layoutdriver_io_operations blocklayout_io_operations = {
.commit = bl_commit,
.read_pagelist = bl_read_pagelist,
@@ -1156,7 +1144,6 @@ static struct layoutdriver_io_operations blocklayout_io_operations = {
static struct layoutdriver_policy_operations blocklayout_policy_operations = {
.get_stripesize = bl_get_stripesize,
.pg_test = bl_pg_test,
- .do_flush = bl_do_flush,
};

static struct pnfs_layoutdriver_type blocklayout_type = {
--
1.7.4.1


2011-06-07 17:26:51

by Jim Rees

[permalink] [raw]
Subject: [PATCH 08/88] HACK: propagate fsdata into nfs_writepage_setup

From: Fred Isaman <[email protected]>

This is needed for the eof hack

Signed-off-by: Fred Isaman <[email protected]>
---
fs/nfs/file.c | 4 ++--
fs/nfs/write.c | 9 ++++++---
include/linux/nfs_fs.h | 3 ++-
3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 3af1c00..1768762 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -469,7 +469,7 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
status = pnfs_write_end(file, page, pos, len, copied, lseg);
if (status)
goto out;
- status = nfs_updatepage(file, page, offset, copied);
+ status = nfs_updatepage(file, page, offset, copied, lseg, fsdata);

out:
unlock_page(page);
@@ -597,7 +597,7 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

ret = VM_FAULT_LOCKED;
if (nfs_flush_incompatible(filp, page) == 0 &&
- nfs_updatepage(filp, page, 0, pagelen) == 0)
+ nfs_updatepage(filp, page, 0, pagelen, NULL, NULL) == 0)
goto out;

ret = VM_FAULT_SIGBUS;
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index e268e3b..fc36db8 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -673,7 +673,9 @@ out:
}

static int nfs_writepage_setup(struct nfs_open_context *ctx, struct page *page,
- unsigned int offset, unsigned int count)
+ unsigned int offset, unsigned int count,
+ struct pnfs_layout_segment *lseg, void *fsdata)
+
{
struct nfs_page *req;

@@ -734,7 +736,8 @@ static int nfs_write_pageuptodate(struct page *page, struct inode *inode)
* things with a page scheduled for an RPC call (e.g. invalidate it).
*/
int nfs_updatepage(struct file *file, struct page *page,
- unsigned int offset, unsigned int count)
+ unsigned int offset, unsigned int count,
+ struct pnfs_layout_segment *lseg, void *fsdata)
{
struct nfs_open_context *ctx = nfs_file_open_context(file);
struct inode *inode = page->mapping->host;
@@ -759,7 +762,7 @@ int nfs_updatepage(struct file *file, struct page *page,
offset = 0;
}

- status = nfs_writepage_setup(ctx, page, offset, count);
+ status = nfs_writepage_setup(ctx, page, offset, count, lseg, fsdata);
if (status < 0)
nfs_set_pageerror(page);

diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 1b93b9c..d45e3b3 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -510,7 +510,8 @@ extern int nfs_congestion_kb;
extern int nfs_writepage(struct page *page, struct writeback_control *wbc);
extern int nfs_writepages(struct address_space *, struct writeback_control *);
extern int nfs_flush_incompatible(struct file *file, struct page *page);
-extern int nfs_updatepage(struct file *, struct page *, unsigned int, unsigned int);
+extern int nfs_updatepage(struct file *, struct page *, unsigned int, unsigned int,
+ struct pnfs_layout_segment *, void *);
extern void nfs_writeback_done(struct rpc_task *, struct nfs_write_data *);

/*
--
1.7.4.1


2011-06-07 17:32:24

by Jim Rees

[permalink] [raw]
Subject: [PATCH 53/88] SQUASHME: pnfsblock: set pnfs_blksize before calling set_pnfs_layoutdriver

From: Zhang Jingwang <[email protected]>

For block/volume layout driver, set_pnfs_layoutdriver will call
initialize_mountpoint() which will check the value of pnfs_blksize.

Signed-off-by: Zhang Jingwang <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/client.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 6c6236b..b2c6920 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -937,8 +937,8 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fh *mntf
if (server->wsize > NFS_MAX_FILE_IO_SIZE)
server->wsize = NFS_MAX_FILE_IO_SIZE;
server->wpages = (server->wsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- set_pnfs_layoutdriver(server, mntfh, fsinfo->layouttype);
server->pnfs_blksize = fsinfo->blksize;
+ set_pnfs_layoutdriver(server, mntfh, fsinfo->layouttype);

server->wtmult = nfs_block_bits(fsinfo->wtmult, NULL);

--
1.7.4.1


2011-06-07 17:30:16

by Jim Rees

[permalink] [raw]
Subject: [PATCH 36/88] pnfsblock: encode_layoutcommit

From: Fred Isaman <[email protected]>

In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
extents list, for reads.

In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:

1. On setup_layoutcommit, the range is adjusted as before
and a structure is allocated for communication with
bl_encode_layoutcommit && bl_cleanup_layoutcommit
(Generic layer provides a void-star to hang it on)

2. bl_encode_layoutcommit is called to do the actual
encoding directly into xdr. The commit-extent-list is not
freed and is stored on above structure.
FIXME: The code is not yet converted to the new XDR cleanup

3. On cleanup the commit-extent-list is put back by a call
to set_to_rw() as before, but with no need for XDR decoding
of the list as before. And the commit-extent-list is freed.
Finally allocated structure is freed.

Signed-off-by: Fred Isaman <[email protected]>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <[email protected]>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <[email protected]>
[pnfsblock: prevent commit list corruption]
[pnfsblock: fix layoutcommit with an empty opaque]
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 1 +
fs/nfs/blocklayout/blocklayout.h | 3 ++
fs/nfs/blocklayout/extents.c | 48 ++++++++++++++++++++++++++++++++++++++
3 files changed, 52 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 0277974..6132e8e 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -653,6 +653,7 @@ bl_encode_layoutcommit(struct pnfs_layout_type *lo, struct xdr_stream *xdr,
const struct pnfs_layoutcommit_arg *arg)
{
dprintk("%s enter\n", __func__);
+ encode_pnfs_block_layoutupdate(BLK_LO2EXT(lo), xdr, arg);
}

static void
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 780d757..1c110e1 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -263,6 +263,9 @@ void put_extent(struct pnfs_block_extent *be);
struct pnfs_block_extent *alloc_extent(void);
struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
+int encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+ struct xdr_stream *xdr,
+ const struct pnfs_layoutcommit_arg *arg);
int add_and_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);
int mark_for_commit(struct pnfs_block_extent *be,
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index d4e4a92..b2f8643 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -680,3 +680,51 @@ find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
print_bl_extent(ret);
return ret;
}
+
+int
+encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+ struct xdr_stream *xdr,
+ const struct pnfs_layoutcommit_arg *arg)
+{
+ sector_t start, end;
+ struct pnfs_block_short_extent *lce, *save;
+ unsigned int count;
+ struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
+ struct list_head *ranges = &bld->ranges;
+ __be32 *p, *xdr_start;
+
+ dprintk("%s enter\n", __func__);
+ start = arg->lseg.offset >> 9;
+ end = start + (arg->lseg.length >> 9);
+ dprintk("%s set start=%llu, end=%llu\n",
+ __func__, (u64)start, (u64)end);
+
+ /* BUG - creation of bl_commit is buggy - need to wait for
+ * entire block to be marked WRITTEN before it can be added.
+ */
+ spin_lock(&bl->bl_ext_lock);
+ list_splice_init(&bl->bl_commit, ranges);
+ count = bl->bl_count;
+ bl->bl_count = 0;
+ /* Want to adjust for possible truncate */
+ /* We now want to adjust argument range */
+ spin_unlock(&bl->bl_ext_lock);
+
+ dprintk("%s found %i ranges\n", __func__, count);
+ /* XDR encode the ranges found */
+ xdr_start = p = xdr_reserve_space(xdr, 8);
+ p++;
+ WRITE32(count);
+ list_for_each_entry_safe(lce, save, ranges, bse_node) {
+ p = xdr_reserve_space(xdr, 7 * 4 + sizeof(lce->bse_devid.data));
+
+ WRITE_DEVID(&lce->bse_devid);
+ WRITE64(lce->bse_f_offset << 9);
+ WRITE64(lce->bse_length << 9);
+ WRITE64(0LL);
+ WRITE32(PNFS_BLOCK_READWRITE_DATA);
+ }
+
+ *xdr_start = cpu_to_be32((xdr->p - xdr_start - 1) * 4);
+ return 0;
+}
--
1.7.4.1


2011-06-10 19:02:45

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-10 10:17, [email protected] wrote:
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
> Sent: Friday, June 10, 2011 8:36 PM
> To: Peng, Tao
> Cc: [email protected]; [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>
> On 2011-06-10 01:36, [email protected] wrote:
>> Hi, Benny,
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
>> Sent: Friday, June 10, 2011 5:23 AM
>> To: Jim Rees
>> Cc: Peng Tao; [email protected]; peter honeyman
>> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>>
>> On 2011-06-09 06:58, Jim Rees wrote:
>>> Benny Halevy wrote:
>>>
>>> > My understanding is that layoutget specifies a min and max, and the server
>>>
>>> There's a min. What do you consider the max?
>>> Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?
>>>
>>> The spec doesn't say max, it says "desired." I guess I assumed the server
>>> wouldn't normally return more than desired.
>>
>> No, the server may freely upgrade the returned layout segment by returning
>> a layout for a larger byte range or even returning a RW layout where a READ
>> layout was asked for.
>> [PT] It is true that server can upgrade the layout segment freely. But there is always a price to pay. Server has to be dealing with all kind of clients.
>> If server returns more than being asked for, it may hurt other clients.
>
> And if all clients ask for more than they need and the server just
> gives it to them, what do you get out of that?
> [PT] We cannot avoid this even if client has automatic layout prefetch algorithem implemented... think about all clients are doing sequential IO...

Right and that's why the server has to be intelligent about it.

Benny

2011-06-07 17:32:30

by Jim Rees

[permalink] [raw]
Subject: [PATCH 54/88] SQUASHME: pnfsblock: get rid of threshold policy ops

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 9 ---------
1 files changed, 0 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 688984f..66044d4 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1105,13 +1105,6 @@ bl_get_stripesize(struct pnfs_layout_type *lo)
return 0;
}

-static ssize_t
-bl_get_io_threshold(struct pnfs_layout_type *lo, struct inode *inode)
-{
- dprintk("%s enter\n", __func__);
- return 0;
-}
-
/* This is called by nfs_can_coalesce_requests via nfs_pageio_do_add_request.
* Should return False if there is a reason requests can not be coalesced,
* otherwise, should default to returning True.
@@ -1162,8 +1155,6 @@ static struct layoutdriver_io_operations blocklayout_io_operations = {

static struct layoutdriver_policy_operations blocklayout_policy_operations = {
.get_stripesize = bl_get_stripesize,
- .get_read_threshold = bl_get_io_threshold,
- .get_write_threshold = bl_get_io_threshold,
.pg_test = bl_pg_test,
.do_flush = bl_do_flush,
};
--
1.7.4.1


2011-06-07 17:27:19

by Jim Rees

[permalink] [raw]
Subject: [PATCH 12/88] pnfsblock: expose scsi interface

From: Fred Isaman <[email protected]>

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
drivers/scsi/hosts.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 4f7a582..7d91903 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -50,10 +50,11 @@ static void scsi_host_cls_release(struct device *dev)
put_device(&class_to_shost(dev)->shost_gendev);
}

-static struct class shost_class = {
+struct class shost_class = {
.name = "scsi_host",
.dev_release = scsi_host_cls_release,
};
+EXPORT_SYMBOL(shost_class);

/**
* scsi_host_set_state - Take the given host through the host state model.
--
1.7.4.1


2011-06-10 06:05:01

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 87/88] Add configurable prefetch size for layoutget

Hi, Benny,

Cheers,
-Bergwolf


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
Sent: Friday, June 10, 2011 5:30 AM
To: Peng Tao
Cc: Jim Rees; [email protected]; peter honeyman
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-09 07:54, Peng Tao wrote:
> On Thu, Jun 9, 2011 at 2:06 PM, Benny Halevy <[email protected]> wrote:
>> On 2011-06-08 03:15, Peng Tao wrote:
>>> On 6/8/11, Jim Rees <[email protected]> wrote:
>>>> Benny Halevy wrote:
>>>>
>>>> NAK.
>>>> This affects all layout types. In particular it is undesired
>>>> for write layouts that extend the file with the objects layout.
>>>> The server can extend the layout segments range
>>>> over what the client requested so why would the client
>>>> ask for artificially large layouts?
>>>>
>>>> This has actually been the subject of some debate over Thursday night
>>>> beers. The problem we're trying to solve is that the client is spending 98%
>>>> of its time in layoutget. This patch gives us something like a 10x
>>>> speedup. But many of us think it's not the right fix. I suggest we discuss
>>>> next week.
>>>>
>>
>> Sure.
>>
>>>> But note that this patch doesn't change anything unless you set the sysctl.
>>> there is a default value of 2M. maybe we can set it to page size by
>>> default so other layout are not affected and block layout can let
>>> users set it by hand if they care about performance. does this make
>>> sense?
>>
>> If doing it at all why use a sysctl rather than a mount option?
> The purpose of using a sysctl is to give client the ability to change
> it on the fly. In theory, layout prefetching can benefit all layout
> types. So the patch tries to solve it in the pnfs generic layer.
>

But the need for this varies per-server and many times per application.
Think sequential vs. random I/O. Therefore a mount option would help
tuning the behavior on a per-use basis. Global behavior must be implemented
using a dynamic algorithm that would take both the workload and the server
observed behavior into account.
[PT] Indeed. Dynamic algorithm is supposed to be able to solve all this. And it often takes longer to be designed/accepted. It has to prove to be better in most scenarios and does not hurt the left.



2011-06-07 17:30:33

by Jim Rees

[permalink] [raw]
Subject: [PATCH 38/88] pnfsblock: merge rw extents

From: Fred Isaman <[email protected]>

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/extents.c | 48 ++++++++++++++++++++++++++++++++++++++---
1 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 1719a67..a05ee2a 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -773,12 +773,43 @@ _prep_new_extent(struct pnfs_block_extent *new,
new->be_inval = orig->be_inval;
}

+/* Tries to merge be with extent in front of it in list.
+ * Frees storage if not used.
+ */
+static struct pnfs_block_extent *
+_front_merge(struct pnfs_block_extent *be, struct list_head *head,
+ struct pnfs_block_extent *storage)
+{
+ struct pnfs_block_extent *prev;
+
+ if (!storage)
+ goto no_merge;
+ if (&be->be_node == head || be->be_node.prev == head)
+ goto no_merge;
+ prev = list_entry(be->be_node.prev, struct pnfs_block_extent, be_node);
+ if ((prev->be_f_offset + prev->be_length != be->be_f_offset) ||
+ !extents_consistent(prev, be))
+ goto no_merge;
+ _prep_new_extent(storage, prev, prev->be_f_offset,
+ prev->be_length + be->be_length, prev->be_state);
+ list_replace(&prev->be_node, &storage->be_node);
+ put_extent(prev);
+ list_del(&be->be_node);
+ put_extent(be);
+ return storage;
+
+ no_merge:
+ kfree(storage);
+ return be;
+}
+
static u64
set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)
{
- u64 rv = 0;
+ u64 rv = offset + length;
struct pnfs_block_extent *be, *e1, *e2, *e3, *new, *old;
struct pnfs_block_extent *children[3];
+ struct pnfs_block_extent *merge1 = NULL, *merge2 = NULL;
int i = 0, j;

dprintk("%s(%llu, %llu)\n", __func__, offset, length);
@@ -792,7 +823,6 @@ set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)

spin_lock(&bl->bl_ext_lock);
be = find_get_extent_locked(bl, offset);
- print_bl_extent(be);
rv = be->be_f_offset + be->be_length;
if (be->be_state != PNFS_BLOCK_INVALID_DATA) {
spin_unlock(&bl->bl_ext_lock);
@@ -805,13 +835,15 @@ set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)
PNFS_BLOCK_INVALID_DATA);
children[i++] = e1;
kref_get(&e1->be_refcnt);
+ print_bl_extent(e1);
} else
- kfree(e1);
+ merge1 = e1;
_prep_new_extent(e2, be, offset,
min(length, be->be_f_offset + be->be_length - offset),
PNFS_BLOCK_READWRITE_DATA);
children[i++] = e2;
kref_get(&e2->be_refcnt);
+ print_bl_extent(e2);
if (offset + length < be->be_f_offset + be->be_length) {
_prep_new_extent(e3, be, e2->be_f_offset + e2->be_length,
be->be_f_offset + be->be_length -
@@ -819,8 +851,9 @@ set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)
PNFS_BLOCK_INVALID_DATA);
children[i++] = e3;
kref_get(&e3->be_refcnt);
+ print_bl_extent(e3);
} else
- kfree(e3);
+ merge2 = e3;

/* Remove be from list, and insert the e* */
/* We don't get refs on e*, since this list is the base reference
@@ -831,11 +864,18 @@ set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)
new = children[0];
list_replace(&be->be_node, &new->be_node);
put_extent(be);
+ new = _front_merge(new, &bl->bl_extents[RW_EXTENT], merge1);
for (j = 1; j < i; j++) {
old = new;
new = children[j];
list_add(&new->be_node, &old->be_node);
}
+ if (merge2) {
+ /* This is a HACK, should just create a _back_merge function */
+ new = list_entry(new->be_node.next,
+ struct pnfs_block_extent, be_node);
+ new = _front_merge(new, &bl->bl_extents[RW_EXTENT], merge2);
+ }
spin_unlock(&bl->bl_ext_lock);

/* Since we removed the base reference above, be is now scheduled for
--
1.7.4.1


2011-06-07 17:29:18

by Jim Rees

[permalink] [raw]
Subject: [PATCH 29/88] pnfsblock: write_begin

From: Fred Isaman <[email protected]>

Implements bl_write_begin and bl_do_flush, allowing block driver to read
in page "around" the data that is about to be copied to the page.

[pnfsblock: fix 64-bit compiler warnings for write_begin]
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 196 ++++++++++++++++++++++++++++++++++++++
1 files changed, 196 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 99de9e3..b3ad99d 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -32,6 +32,7 @@
#include <linux/module.h>
#include <linux/init.h>

+#include <linux/buffer_head.h> /* various write calls */
#include <linux/bio.h> /* struct bio */
#include <linux/vmalloc.h>
#include "blocklayout.h"
@@ -637,6 +638,186 @@ bl_uninitialize_mountpoint(struct pnfs_mount_type *mtype)
return 0;
}

+/* STUB - mark intersection of layout and page as bad, so is not
+ * used again.
+ */
+static void mark_bad_read(void)
+{
+ return;
+}
+
+/* Copied from buffer.c */
+static void __end_buffer_read_notouch(struct buffer_head *bh, int uptodate)
+{
+ if (uptodate) {
+ set_buffer_uptodate(bh);
+ } else {
+ /* This happens, due to failed READA attempts. */
+ clear_buffer_uptodate(bh);
+ }
+ unlock_buffer(bh);
+}
+
+/* Copied from buffer.c */
+static void end_buffer_read_nobh(struct buffer_head *bh, int uptodate)
+{
+ __end_buffer_read_notouch(bh, uptodate);
+}
+
+/*
+ * map_block: map a requested I/0 block (isect) into an offset in the LVM
+ * meta block_device
+ */
+static void
+map_block(sector_t isect, struct pnfs_block_extent *be, struct buffer_head *bh)
+{
+ dprintk("%s enter be=%p\n", __func__, be);
+
+ set_buffer_mapped(bh);
+ bh->b_bdev = be->be_mdev;
+ bh->b_blocknr = (isect - be->be_f_offset + be->be_v_offset) >>
+ (be->be_mdev->bd_inode->i_blkbits - 9);
+
+ dprintk("%s isect %ld, bh->b_blocknr %ld, using bsize %Zd\n",
+ __func__, (long)isect,
+ (long)bh->b_blocknr,
+ bh->b_size);
+ return;
+}
+
+/* Given an unmapped page, zero it (or read in page for COW),
+ * and set appropriate flags/markings, but it is safe to not initialize
+ * the range given in [from, to).
+ */
+/* This is loosely based on nobh_write_begin */
+static int
+init_page_for_write(struct pnfs_block_layout *bl, struct page *page,
+ unsigned from, unsigned to, sector_t **pages_to_mark)
+{
+ struct buffer_head *bh;
+ int inval, ret = -EIO;
+ struct pnfs_block_extent *be = NULL, *cow_read = NULL;
+ sector_t isect;
+
+ dprintk("%s enter, %p\n", __func__, page);
+ bh = alloc_page_buffers(page, PAGE_CACHE_SIZE, 0);
+ if (!bh) {
+ ret = -ENOMEM;
+ goto cleanup;
+ }
+
+ isect = (sector_t)page->index << (PAGE_CACHE_SHIFT - 9);
+ be = find_get_extent(bl, isect, &cow_read);
+ if (!be)
+ goto cleanup;
+ inval = is_hole(be, isect);
+ dprintk("%s inval=%i, from=%u, to=%u\n", __func__, inval, from, to);
+ if (inval) {
+ if (be->be_state == PNFS_BLOCK_NONE_DATA) {
+ dprintk("%s PANIC - got NONE_DATA extent %p\n",
+ __func__, be);
+ goto cleanup;
+ }
+ map_block(isect, be, bh);
+ unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
+ }
+ if (PageUptodate(page)) {
+ /* Do nothing */
+ } else if (inval & !cow_read) {
+ zero_user_segments(page, 0, from, to, PAGE_CACHE_SIZE);
+ } else if (0 < from || PAGE_CACHE_SIZE > to) {
+ struct pnfs_block_extent *read_extent;
+
+ read_extent = (inval && cow_read) ? cow_read : be;
+ map_block(isect, read_extent, bh);
+ lock_buffer(bh);
+ bh->b_end_io = end_buffer_read_nobh;
+ submit_bh(READ, bh);
+ dprintk("%s: Waiting for buffer read\n", __func__);
+ /* XXX Don't really want to hold layout lock here */
+ wait_on_buffer(bh);
+ if (!buffer_uptodate(bh))
+ goto cleanup;
+ }
+ if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
+ /* There is a BUG here if is a short copy after write_begin,
+ * but I think this is a generic fs bug. The problem is that
+ * we have marked the page as initialized, but it is possible
+ * that the section not copied may never get copied.
+ */
+ ret = mark_initialized_sectors(be->be_inval, isect,
+ PAGE_CACHE_SECTORS,
+ pages_to_mark);
+ /* Want to preallocate mem so above can't fail */
+ if (ret)
+ goto cleanup;
+ }
+ SetPageMappedToDisk(page);
+ ret = 0;
+
+cleanup:
+ free_buffer_head(bh);
+ put_extent(be);
+ put_extent(cow_read);
+ if (ret) {
+ /* Need to mark layout with bad read...should now
+ * just use nfs4 for reads and writes.
+ */
+ mark_bad_read();
+ }
+ return ret;
+}
+
+static int
+bl_write_begin(struct pnfs_layout_segment *lseg, struct page *page, loff_t pos,
+ unsigned count, struct pnfs_fsdata *fsdata)
+{
+ unsigned from, to;
+ int ret;
+ sector_t *pages_to_mark = NULL;
+ struct pnfs_block_layout *bl = BLK_LSEG2EXT(lseg);
+
+ dprintk("%s enter, %u@%lld\n", __func__, count, pos);
+ print_page(page);
+ /* The following code assumes blocksize >= PAGE_CACHE_SIZE */
+ if (bl->bl_blocksize < (PAGE_CACHE_SIZE >> 9)) {
+ dprintk("%s Can't handle blocksize %llu\n", __func__,
+ (u64)bl->bl_blocksize);
+ fsdata->ok_to_use_pnfs = 0;
+ return 0;
+ }
+ fsdata->ok_to_use_pnfs = 1;
+ if (PageMappedToDisk(page)) {
+ /* Basically, this is a flag that says we have
+ * successfully called write_begin already on this page.
+ */
+ /* NOTE - there are cache consistency issues here.
+ * For example, what if the layout is recalled, then regained?
+ * If the file is closed and reopened, will the page flags
+ * be reset? If not, we'll have to use layout info instead of
+ * the page flag.
+ */
+ return 0;
+ }
+ from = pos & (PAGE_CACHE_SIZE - 1);
+ to = from + count;
+ ret = init_page_for_write(bl, page, from, to, &pages_to_mark);
+ if (ret) {
+ dprintk("%s init page failed with %i", __func__, ret);
+ /* Revert back to plain NFS and just continue on with
+ * write. This assumes there is no request attached, which
+ * should be true if we get here.
+ */
+ BUG_ON(PagePrivate(page));
+ fsdata->ok_to_use_pnfs = 0;
+ kfree(pages_to_mark);
+ ret = 0;
+ } else {
+ fsdata->private = pages_to_mark;
+ }
+ return ret;
+}
+
static ssize_t
bl_get_stripesize(struct pnfs_layout_type *lo)
{
@@ -663,10 +844,24 @@ bl_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
return 1;
}

+/* This checks if old req will likely use same io method as soon
+ * to be created request, and returns False if they are the same.
+ */
+static int
+bl_do_flush(struct pnfs_layout_segment *lseg, struct nfs_page *req,
+ struct pnfs_fsdata *fsdata)
+{
+ int will_try_pnfs;
+ dprintk("%s enter\n", __func__);
+ will_try_pnfs = fsdata ? (fsdata->ok_to_use_pnfs) : (lseg != NULL);
+ return will_try_pnfs != test_bit(PG_USE_PNFS, &req->wb_flags);
+}
+
static struct layoutdriver_io_operations blocklayout_io_operations = {
.commit = bl_commit,
.read_pagelist = bl_read_pagelist,
.write_pagelist = bl_write_pagelist,
+ .write_begin = bl_write_begin,
.alloc_layout = bl_alloc_layout,
.free_layout = bl_free_layout,
.alloc_lseg = bl_alloc_lseg,
@@ -683,6 +878,7 @@ static struct layoutdriver_policy_operations blocklayout_policy_operations = {
.get_read_threshold = bl_get_io_threshold,
.get_write_threshold = bl_get_io_threshold,
.pg_test = bl_pg_test,
+ .do_flush = bl_do_flush,
};

static struct pnfs_layoutdriver_type blocklayout_type = {
--
1.7.4.1


2011-06-07 17:32:12

by Jim Rees

[permalink] [raw]
Subject: [PATCH 51/88] pnfsblock: expose block_class interface

From: Tao Guo <[email protected]>

Export symbol block_class instead of shost_class for much more
generic block device iterations.

Signed-off-by: Huang Haoi <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
block/genhd.c | 1 +
drivers/scsi/hosts.c | 1 -
2 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 95822ae..6f19558 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1108,6 +1108,7 @@ static void disk_release(struct device *dev)
struct class block_class = {
.name = "block",
};
+EXPORT_SYMBOL(block_class);

static char *block_devnode(struct device *dev, mode_t *mode)
{
diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 7d91903..ac24f02 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -54,7 +54,6 @@ struct class shost_class = {
.name = "scsi_host",
.dev_release = scsi_host_cls_release,
};
-EXPORT_SYMBOL(shost_class);

/**
* scsi_host_set_state - Take the given host through the host state model.
--
1.7.4.1


2011-06-07 17:29:13

by Jim Rees

[permalink] [raw]
Subject: [PATCH 28/88] pnfsblock: SPLITME: add extent manipulation functions

From: Fred Isaman <[email protected]>

Adds working implementations of various support functions
to handle INVAL extents, needed by writes, such as
mark_initialized_sectors and is_sector_initialized.

SPLIT: this needs to be split into the exported functions, and the
range support functions (which will be replaced eventually.)

[pnfsblock: fix 64-bit compiler warnings for extent manipulation]
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 23 +++-
fs/nfs/blocklayout/extents.c | 265 +++++++++++++++++++++++++++++++++++++-
2 files changed, 284 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 493d4d3..c4b7b40 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -37,6 +37,8 @@
#include <linux/nfs4_pnfs.h>
#include <linux/dm-ioctl.h> /* Needed for struct dm_ioctl*/

+#define PAGE_CACHE_SECTORS (PAGE_CACHE_SIZE >> 9)
+
#define PG_pnfserr PG_owner_priv_1
#define PagePnfsErr(page) test_bit(PG_pnfserr, &(page)->flags)
#define SetPagePnfsErr(page) set_bit(PG_pnfserr, &(page)->flags)
@@ -111,8 +113,17 @@ enum exstate4 {
PNFS_BLOCK_NONE_DATA = 3 /* unmapped, it's a hole */
};

+#define MY_MAX_TAGS (15) /* tag bitnums used must be less than this */
+
+struct my_tree_t {
+ sector_t mtt_step_size; /* Internal sector alignment */
+ struct list_head mtt_stub; /* Should be a radix tree */
+};
+
struct pnfs_inval_markings {
- /* STUB */
+ spinlock_t im_lock;
+ struct my_tree_t im_tree; /* Sectors that need LAYOUTCOMMIT */
+ sector_t im_block_size; /* Server blocksize in sectors */
};

/* sector_t fields are all in 512-byte sectors */
@@ -131,7 +142,11 @@ struct pnfs_block_extent {
static inline void
INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
{
- /* STUB */
+ spin_lock_init(&marks->im_lock);
+ INIT_LIST_HEAD(&marks->im_tree.mtt_stub);
+ marks->im_block_size = blocksize;
+ marks->im_tree.mtt_step_size = min((sector_t)PAGE_CACHE_SECTORS,
+ blocksize);
}

enum extentclass4 {
@@ -211,8 +226,12 @@ void free_block_dev(struct pnfs_block_dev *bdev);
struct pnfs_block_extent *
find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
struct pnfs_block_extent **cow_read);
+int mark_initialized_sectors(struct pnfs_inval_markings *marks,
+ sector_t offset, sector_t length,
+ sector_t **pages);
void put_extent(struct pnfs_block_extent *be);
struct pnfs_block_extent *alloc_extent(void);
+struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
int add_and_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 31fe359..ef8a5b7 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -33,10 +33,263 @@
#include "blocklayout.h"
#define NFSDBG_FACILITY NFSDBG_PNFS_LD

+/* Bit numbers */
+#define EXTENT_INITIALIZED 0
+#define EXTENT_WRITTEN 1
+#define EXTENT_IN_COMMIT 2
+#define INTERNAL_EXISTS MY_MAX_TAGS
+#define INTERNAL_MASK ((1 << INTERNAL_EXISTS) - 1)
+
+struct pnfs_inval_tracking {
+ struct list_head it_link;
+ int it_sector;
+ int it_tags;
+};
+
+/* Returns largest t<=s s.t. t%base==0 */
+static inline sector_t normalize(sector_t s, int base)
+{
+ sector_t tmp = s; /* Since do_div modifies its argument */
+ return s - do_div(tmp, base);
+}
+
+static inline sector_t normalize_up(sector_t s, int base)
+{
+ return normalize(s + base - 1, base);
+}
+
+/* Complete stub using list while determine API wanted */
+
+/* Returns tags, or negative */
+static int32_t _find_entry(struct my_tree_t *tree, u64 s)
+{
+ struct pnfs_inval_tracking *pos;
+
+ dprintk("%s(%llu) enter\n", __func__, s);
+ list_for_each_entry(pos, &tree->mtt_stub, it_link) {
+ if (pos->it_sector < s)
+ continue;
+ else if (pos->it_sector == s)
+ return pos->it_tags & INTERNAL_MASK;
+ else
+ break;
+ }
+ return -ENOENT;
+}
+
+static inline
+int _has_tag(struct my_tree_t *tree, u64 s, int32_t tag)
+{
+ int32_t tags;
+
+ dprintk("%s(%llu, %i) enter\n", __func__, s, tag);
+ s = normalize(s, tree->mtt_step_size);
+ tags = _find_entry(tree, s);
+ if ((tags < 0) || !(tags & (1 << tag)))
+ return 0;
+ else
+ return 1;
+}
+
+/* Creates entry with tag, or if entry already exists, unions tag to it.
+ * If storage is not NULL, newly created entry will use it.
+ * Returns number of entries added, or negative on error.
+ */
+static int _add_entry(struct my_tree_t *tree, u64 s, int32_t tag,
+ struct pnfs_inval_tracking *storage)
+{
+ int found = 0;
+ struct pnfs_inval_tracking *pos;
+
+ dprintk("%s(%llu, %i, %p) enter\n", __func__, s, tag, storage);
+ list_for_each_entry(pos, &tree->mtt_stub, it_link) {
+ if (pos->it_sector < s)
+ continue;
+ else if (pos->it_sector == s) {
+ found = 1;
+ break;
+ } else
+ break;
+ }
+ if (found) {
+ pos->it_tags |= (1 << tag);
+ return 0;
+ } else {
+ struct pnfs_inval_tracking *new;
+ if (storage)
+ new = storage;
+ else {
+ new = kmalloc(sizeof(*new), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+ }
+ new->it_sector = s;
+ new->it_tags = (1 << tag);
+ list_add_tail(&new->it_link, &pos->it_link);
+ return 1;
+ }
+}
+
+/* XXXX Really want option to not create */
+/* Over range, unions tag with existing entries, else creates entry with tag */
+static int _set_range(struct my_tree_t *tree, int32_t tag, u64 s, u64 length)
+{
+ u64 i;
+
+ dprintk("%s(%i, %llu, %llu) enter\n", __func__, tag, s, length);
+ for (i = normalize(s, tree->mtt_step_size); i < s + length;
+ i += tree->mtt_step_size)
+ if (_add_entry(tree, i, tag, NULL))
+ return -ENOMEM;
+ return 0;
+}
+
+
+/* Ensure that future operations on given range of tree will not malloc */
+static int _preload_range(struct my_tree_t *tree, u64 offset, u64 length)
+{
+ u64 start, end, s;
+ int count, i, used = 0, status = -ENOMEM;
+ struct pnfs_inval_tracking **storage;
+
+ dprintk("%s(%llu, %llu) enter\n", __func__, offset, length);
+ start = normalize(offset, tree->mtt_step_size);
+ end = normalize_up(offset + length, tree->mtt_step_size);
+ count = (int)(end - start) / (int)tree->mtt_step_size;
+
+ /* Pre-malloc what memory we might need */
+ storage = kmalloc(sizeof(*storage) * count, GFP_KERNEL);
+ if (!storage)
+ return -ENOMEM;
+ for (i = 0; i < count; i++) {
+ storage[i] = kmalloc(sizeof(struct pnfs_inval_tracking),
+ GFP_KERNEL);
+ if (!storage[i])
+ goto out_cleanup;
+ }
+
+ /* Now need lock - HOW??? */
+
+ for (s = start; s < end; s += tree->mtt_step_size)
+ used += _add_entry(tree, s, INTERNAL_EXISTS, storage[used]);
+
+ /* Unlock - HOW??? */
+ status = 0;
+
+ out_cleanup:
+ for (i = used; i < count; i++) {
+ if (!storage[i])
+ break;
+ kfree(storage[i]);
+ }
+ return status;
+}
+
+static void set_needs_init(sector_t *array, sector_t offset)
+{
+ sector_t *p = array;
+
+ dprintk("%s enter\n", __func__);
+ if (!p)
+ return;
+ while (*p < offset)
+ p++;
+ if (*p == offset)
+ return;
+ else if (*p == ~0) {
+ *p++ = offset;
+ *p = ~0;
+ return;
+ } else {
+ sector_t *save = p;
+ dprintk("%s Adding %llu\n", __func__, (u64)offset);
+ while (*p != ~0)
+ p++;
+ p++;
+ memmove(save + 1, save, (char *)p - (char *)save);
+ *save = offset;
+ return;
+ }
+}
+
+/* We are relying on page lock to serialize this */
int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect)
{
- /* STUB */
- return 0;
+ int rv;
+
+ spin_lock(&marks->im_lock);
+ rv = _has_tag(&marks->im_tree, isect, EXTENT_INITIALIZED);
+ spin_unlock(&marks->im_lock);
+ return rv;
+}
+
+/* Marks sectors in [offest, offset_length) as having been initialized.
+ * All lengths are step-aligned, where step is min(pagesize, blocksize).
+ * Notes where partial block is initialized, and helps prepare it for
+ * complete initialization later.
+ */
+/* Currently assumes offset is page-aligned */
+int mark_initialized_sectors(struct pnfs_inval_markings *marks,
+ sector_t offset, sector_t length,
+ sector_t **pages)
+{
+ sector_t s, start, end;
+ sector_t *array = NULL; /* Pages to mark */
+
+ dprintk("%s(offset=%llu,len=%llu) enter\n",
+ __func__, (u64)offset, (u64)length);
+ s = max((sector_t) 3,
+ 2 * (marks->im_block_size / (PAGE_CACHE_SECTORS)));
+ dprintk("%s set max=%llu\n", __func__, (u64)s);
+ if (pages) {
+ array = kmalloc(s * sizeof(sector_t), GFP_KERNEL);
+ if (!array)
+ goto outerr;
+ array[0] = ~0;
+ }
+
+ start = normalize(offset, marks->im_block_size);
+ end = normalize_up(offset + length, marks->im_block_size);
+ if (_preload_range(&marks->im_tree, start, end - start))
+ goto outerr;
+
+ spin_lock(&marks->im_lock);
+
+ for (s = normalize_up(start, PAGE_CACHE_SECTORS);
+ s < offset; s += PAGE_CACHE_SECTORS) {
+ dprintk("%s pre-area pages\n", __func__);
+ /* Portion of used block is not initialized */
+ if (!_has_tag(&marks->im_tree, s, EXTENT_INITIALIZED))
+ set_needs_init(array, s);
+ }
+ if (_set_range(&marks->im_tree, EXTENT_INITIALIZED, offset, length))
+ goto out_unlock;
+ for (s = normalize_up(offset + length, PAGE_CACHE_SECTORS);
+ s < end; s += PAGE_CACHE_SECTORS) {
+ dprintk("%s post-area pages\n", __func__);
+ if (!_has_tag(&marks->im_tree, s, EXTENT_INITIALIZED))
+ set_needs_init(array, s);
+ }
+
+ spin_unlock(&marks->im_lock);
+
+ if (pages) {
+ if (array[0] == ~0) {
+ kfree(array);
+ *pages = NULL;
+ } else
+ *pages = array;
+ }
+ return 0;
+
+ out_unlock:
+ spin_unlock(&marks->im_lock);
+ outerr:
+ if (pages) {
+ kfree(array);
+ *pages = NULL;
+ }
+ return -ENOMEM;
}

static void print_bl_extent(struct pnfs_block_extent *be)
@@ -83,6 +336,14 @@ struct pnfs_block_extent *alloc_extent(void)
return be;
}

+struct pnfs_block_extent *
+get_extent(struct pnfs_block_extent *be)
+{
+ if (be)
+ kref_get(&be->be_refcnt);
+ return be;
+}
+
void print_elist(struct list_head *list)
{
struct pnfs_block_extent *be;
--
1.7.4.1


2011-06-08 07:15:11

by Peng Tao

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 6/8/11, Jim Rees <[email protected]> wrote:
> Benny Halevy wrote:
>
> NAK.
> This affects all layout types. In particular it is undesired
> for write layouts that extend the file with the objects layout.
> The server can extend the layout segments range
> over what the client requested so why would the client
> ask for artificially large layouts?
>
> This has actually been the subject of some debate over Thursday night
> beers. The problem we're trying to solve is that the client is spending 98%
> of its time in layoutget. This patch gives us something like a 10x
> speedup. But many of us think it's not the right fix. I suggest we discuss
> next week.
>
> But note that this patch doesn't change anything unless you set the sysctl.
there is a default value of 2M. maybe we can set it to page size by
default so other layout are not affected and block layout can let
users set it by hand if they care about performance. does this make
sense?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


--
Thanks,
-Bergwolf

2011-06-10 19:23:35

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-10 10:09, [email protected] wrote:
> Hi, Benny,
>
> -----Original Message-----
> From: Benny Halevy [mailto:[email protected]]
> Sent: Friday, June 10, 2011 8:33 PM
> To: Peng, Tao
> Cc: [email protected]; [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>
> On 2011-06-10 02:00, [email protected] wrote:
>> Hi, Benny,
>>
>> Cheers,
>> -Bergwolf
>>
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
>> Sent: Friday, June 10, 2011 5:23 AM
>> To: Peng Tao
>> Cc: Jim Rees; [email protected]; peter honeyman
>> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>>
>> On 2011-06-09 08:07, Peng Tao wrote:
>>> Hi, Jim and Benny,
>>>
>>> On Thu, Jun 9, 2011 at 9:58 PM, Jim Rees <[email protected]> wrote:
>>>> Benny Halevy wrote:
>>>>
>>>> > My understanding is that layoutget specifies a min and max, and the server
>>>>
>>>> There's a min. What do you consider the max?
>>>> Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?
>>>>
>>>> The spec doesn't say max, it says "desired." I guess I assumed the server
>>>> wouldn't normally return more than desired.
>>> In fact server is returning "desired" length. The problem is that we
>>> call pnfs_update_layout in nfs_write_begin, and it will end up setting
>>> both minlength and length to page size. There is no space for client
>>> to collapse layoutget range in nfs_write_begin.
>>>
>>
>> That's a different issue. Waiting with pnfs_update_layout to flush
>> time rather than write_begin if the whole page is written would help
>> sending a more meaningful desired range as well as avoiding needless
>> read-modify-writes in case the application also wrote the whole
>> preallocated block.
>> [PT] It is also the reason why we want to introduce layout prefetching, to get more segment than the page passed in nfs_write_begin.
>>
>
> Peng, I understand what you want to achieve but the proposed way
> just doesn't fly. The server knows better than the client its allocation policies
> and it knows better the combined workload of different client and possible
> conflicts between them therefore it should be making the ultimate decision
> about the actual segment sizes.
> [PT] Yes, you are right. Server should know combined workload of all clients and make its decision based on that.
> And it always has the right to return more than (or less than) specified in loga_length.
>
> That said, the client should indeed do its best to ask for the most appropriate
> segments size for its use and we should be making a better job at that.
> It's just that blindly asking for more is not a good strategy and requiring
> manual admin help to tune the clients is not acceptable.
> [PT] yeah, determing the most appropriate is always the hart part. Do you have any suggestions to that?

A simple algorithm I can suggest is:
- on initialization, calculate and save, per layout driver
- maximum layout size
- take into account csr_fore_chan_attrs.ca_maxresponsesize and possible other parameters
- keep a working copy of the maximum value and the calculated copy.
- alignment value.
- on miss, see if there's an adjacent layout segment in cache
- if found, ask for twice the found segment size, up to the maximum value,
aligned on the alignment value.
- if the server returns less the layoutget range, keep note of the returned length
(but not adjust maximum yet, as the server may return a short segment for various
reasons)
- if the server is consistent about returning less than was asked, adjust the
- working copy of the maximum length
- if the maximum was adjusted try bumping it up after X (TBD) layoutgets or T seconds
to see if that was just due to high load or conflicts on the server
- on any error returned for LAYOUTGET reset the algorithm parameters
- on session reestablishment recalculate maximums.

Benny

>
> Thanks,
> Tao

2011-06-07 17:30:40

by Jim Rees

[permalink] [raw]
Subject: [PATCH 39/88] pnfsblock: debugging dprintks for clist info

From: Fred Isaman <[email protected]>

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/extents.c | 30 ++++++++++++++++++++++++++++++
1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index a05ee2a..98452ca 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -349,6 +349,31 @@ int mark_written_sectors(struct pnfs_inval_markings *marks,
return status;
}

+static void print_short_extent(struct pnfs_block_short_extent *be)
+{
+ dprintk("PRINT SHORT EXTENT extent %p\n", be);
+ if (be) {
+ dprintk(" be_f_offset %llu\n", (u64)be->bse_f_offset);
+ dprintk(" be_length %llu\n", (u64)be->bse_length);
+ }
+}
+
+void print_clist(struct list_head *list, unsigned int count)
+{
+ struct pnfs_block_short_extent *be;
+ unsigned int i = 0;
+
+ dprintk("****************\n");
+ dprintk("Extent list looks like:\n");
+ list_for_each_entry(be, list, bse_node) {
+ i++;
+ print_short_extent(be);
+ }
+ if (i != count)
+ dprintk("\n\nExpected %u entries\n\n\n", count);
+ dprintk("****************\n");
+}
+
/* Note: In theory, we should do more checking that devid's match between
* old and new, but if they don't, the lists are too corrupt to salvage anyway.
*/
@@ -360,6 +385,9 @@ static void add_to_commitlist(struct pnfs_block_layout *bl,
struct pnfs_block_short_extent *old, *save;
sector_t end = new->bse_f_offset + new->bse_length;

+ dprintk("%s enter\n", __func__);
+ print_short_extent(new);
+ print_clist(clist, bl->bl_count);
bl->bl_count++;
/* Scan for proper place to insert, extending new to the left
* as much as possible.
@@ -409,6 +437,8 @@ static void add_to_commitlist(struct pnfs_block_layout *bl,
kfree(old);
}
}
+ dprintk("%s: after merging\n", __func__);
+ print_clist(clist, bl->bl_count);
}

/* Note the range described by offset, length is guaranteed to be contained
--
1.7.4.1


2011-06-07 17:27:44

by Jim Rees

[permalink] [raw]
Subject: [PATCH 16/88] pnfsblock: select BLK_DEV_DM when PNFS_BLOCK is configured

From: Benny Halevy <[email protected]>

Also, select MD

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/Kconfig | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index d6bfd87..bccd415 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -108,6 +108,8 @@ config PNFS_PANLAYOUT
config PNFS_BLOCK
tristate "Provide a pNFS block client (EXPERIMENTAL)"
depends on NFS_FS && PNFS
+ select MD
+ select BLK_DEV_DM
help
Say M or y here if you want your pNfs client to support the block protocol

--
1.7.4.1


2011-06-07 17:31:00

by Jim Rees

[permalink] [raw]
Subject: [PATCH 42/88] SQUASHME: pnfsblock: Fix a memory leak

From: Zhang Jingwang <[email protected]>

Array storage is allocated but not freed, free it.

Signed-off-by: Zhang Jingwang <[email protected]>
Acked-by: J. Bruce Fields <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/extents.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 09a7c5c..288a38a 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -181,6 +181,7 @@ static int _preload_range(struct my_tree_t *tree, u64 offset, u64 length)
break;
kfree(storage[i]);
}
+ kfree(storage);
return status;
}

--
1.7.4.1


2011-06-07 17:29:40

by Jim Rees

[permalink] [raw]
Subject: [PATCH 32/88] pnfsblock: bl_write_pagelist support functions

From: Fred Isaman <[email protected]>

[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: fixup blksize alignment in bl_setup_layoutcommit]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 32 +++++++++++++++++++++++++++++++-
1 files changed, 31 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index af26bcc..9c46f5a 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -73,6 +73,19 @@ static int is_hole(struct pnfs_block_extent *be, sector_t isect)
return !is_sector_initialized(be->be_inval, isect);
}

+/* Given the be associated with isect, determine if page data can be
+ * written to disk.
+ */
+static int is_writable(struct pnfs_block_extent *be, sector_t isect)
+{
+ if (be->be_state == PNFS_BLOCK_READWRITE_DATA)
+ return 1;
+ else if (be->be_state != PNFS_BLOCK_INVALID_DATA)
+ return 0;
+ else
+ return is_sector_initialized(be->be_inval, isect);
+}
+
static int
dont_like_caller(struct nfs_page *req)
{
@@ -441,7 +454,19 @@ static int
bl_setup_layoutcommit(struct pnfs_layout_type *lo,
struct pnfs_layoutcommit_arg *arg)
{
+ struct nfs_server *nfss = PNFS_NFS_SERVER(lo);
+ struct pnfs_layoutcommit_arg *arg = &data->args;
+
dprintk("%s enter\n", __func__);
+ /* Need to ensure commit is block-size aligned */
+ if (nfss->pnfs_blksize) {
+ u64 mask = nfss->pnfs_blksize - 1;
+ u64 offset = arg->lseg.offset & mask;
+
+ arg->lseg.offset -= offset;
+ arg->lseg.length += offset + mask;
+ arg->lseg.length &= ~mask;
+ }
return 0;
}

@@ -916,7 +941,12 @@ bl_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
struct nfs_page *req)
{
dprintk("%s enter\n", __func__);
- return 1;
+ if (pgio->pg_iswrite) {
+ return test_bit(PG_USE_PNFS, &prev->wb_flags) ==
+ test_bit(PG_USE_PNFS, &req->wb_flags);
+ } else {
+ return 1;
+ }
}

/* This checks if old req will likely use same io method as soon
--
1.7.4.1


2011-06-07 17:35:03

by Jim Rees

[permalink] [raw]
Subject: [PATCH 77/88] move include lines out of include file

Signed-off-by: Jim Rees <[email protected]>
[This patch does *not* break the header's independence]
Signed-off-by: Boaz Harrosh <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
include/linux/sunrpc/simple_rpc_pipefs.h | 6 ------
net/sunrpc/simple_rpc_pipefs.c | 5 ++---
2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/include/linux/sunrpc/simple_rpc_pipefs.h b/include/linux/sunrpc/simple_rpc_pipefs.h
index 02e8147..f6a1227 100644
--- a/include/linux/sunrpc/simple_rpc_pipefs.h
+++ b/include/linux/sunrpc/simple_rpc_pipefs.h
@@ -39,14 +39,8 @@
#ifndef _SIMPLE_RPC_PIPEFS_H_
#define _SIMPLE_RPC_PIPEFS_H_

-#include <linux/fs.h>
-#include <linux/list.h>
-#include <linux/mount.h>
-#include <linux/sched.h>
-#include <linux/sunrpc/clnt.h>
#include <linux/sunrpc/rpc_pipe_fs.h>

-
#define payload_of(headerp) ((void *)(headerp + 1))

/*
diff --git a/net/sunrpc/simple_rpc_pipefs.c b/net/sunrpc/simple_rpc_pipefs.c
index c9306aa..24af0a1 100644
--- a/net/sunrpc/simple_rpc_pipefs.c
+++ b/net/sunrpc/simple_rpc_pipefs.c
@@ -38,9 +38,8 @@
* With thanks to CITI's project sponsor and partner, IBM.
*/

-#include <linux/completion.h>
-#include <linux/uaccess.h>
-#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/sunrpc/clnt.h>
#include <linux/sunrpc/simple_rpc_pipefs.h>


--
1.7.4.1


2011-06-07 17:29:06

by Jim Rees

[permalink] [raw]
Subject: [PATCH 27/88] pnfsblock: read path error handling

From: Fred Isaman <[email protected]>

Communicate between nfs_readpages and nfs_readpage any pnfs failures,
so that nfs_readpage can immediately default to mds.

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 22ea965..99de9e3 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -151,10 +151,12 @@ static inline void
bl_done_with_rpage(struct page *page, const int ok)
{
if (ok) {
+ ClearPagePnfsErr(page);
SetPageUptodate(page);
} else {
ClearPageUptodate(page);
SetPageError(page);
+ SetPagePnfsErr(page);
}
/* Page is unlocked via rpc_release. Should really be done here. */
}
@@ -227,6 +229,13 @@ bl_read_pagelist(struct pnfs_layout_type *lo,
dprintk("%s dont_like_caller failed\n", __func__);
goto use_mds;
}
+ if ((nr_pages == 1) && PagePnfsErr(rdata->req->wb_page)) {
+ /* We want to fall back to mds in case of read_page
+ * after error on read_pages.
+ */
+ dprintk("%s PG_pnfserr set\n", __func__);
+ goto use_mds;
+ }
par = alloc_parallel(rdata);
if (!par)
goto use_mds;
--
1.7.4.1


2011-06-07 17:26:02

by Jim Rees

[permalink] [raw]
Subject: [PATCH 01/88] pnfs: add set-clear layoutdriver interface

From: Benny Halevy <[email protected]>

To allow layout driver to issue getdevicelist at mount time, and clean up
at umount time.

[fixup non NFS_V4_1 set_pnfs_layoutdriver definition]
[pnfs: pass mntfh down the init_pnfs path]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/client.c | 7 ++++---
fs/nfs/pnfs.c | 15 +++++++++++++--
fs/nfs/pnfs.h | 8 ++++++--
3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index b3dc2b8..6bdb7da0 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -906,7 +906,8 @@ error:
/*
* Load up the server record from information gained in an fsinfo record
*/
-static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *fsinfo)
+static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fh *mntfh,
+ struct nfs_fsinfo *fsinfo)
{
unsigned long max_rpc_payload;

@@ -936,7 +937,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *
if (server->wsize > NFS_MAX_FILE_IO_SIZE)
server->wsize = NFS_MAX_FILE_IO_SIZE;
server->wpages = (server->wsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- set_pnfs_layoutdriver(server, fsinfo->layouttype);
+ set_pnfs_layoutdriver(server, mntfh, fsinfo->layouttype);

server->wtmult = nfs_block_bits(fsinfo->wtmult, NULL);

@@ -982,7 +983,7 @@ static int nfs_probe_fsinfo(struct nfs_server *server, struct nfs_fh *mntfh, str
if (error < 0)
goto out_error;

- nfs_server_set_fsinfo(server, &fsinfo);
+ nfs_server_set_fsinfo(server, mntfh, &fsinfo);

/* Get some general file system info */
if (server->namelen == 0) {
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 963be42..3b182cc 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -75,8 +75,11 @@ find_pnfs_driver(u32 id)
void
unset_pnfs_layoutdriver(struct nfs_server *nfss)
{
- if (nfss->pnfs_curr_ld)
+ if (nfss->pnfs_curr_ld) {
+ if (nfss->pnfs_curr_ld->clear_layoutdriver)
+ nfss->pnfs_curr_ld->clear_layoutdriver(nfss);
module_put(nfss->pnfs_curr_ld->owner);
+ }
nfss->pnfs_curr_ld = NULL;
}

@@ -87,7 +90,8 @@ unset_pnfs_layoutdriver(struct nfs_server *nfss)
* @id layout type. Zero (illegal layout type) indicates pNFS not in use.
*/
void
-set_pnfs_layoutdriver(struct nfs_server *server, u32 id)
+set_pnfs_layoutdriver(struct nfs_server *server, const struct nfs_fh *mntfh,
+ u32 id)
{
struct pnfs_layoutdriver_type *ld_type = NULL;

@@ -114,6 +118,13 @@ set_pnfs_layoutdriver(struct nfs_server *server, u32 id)
goto out_no_driver;
}
server->pnfs_curr_ld = ld_type;
+ if (ld_type->set_layoutdriver && ld_type->set_layoutdriver(server, mntfh)) {
+ printk(KERN_ERR
+ "%s: Error initializing mount point for layout driver %u.\n",
+ __func__, id);
+ module_put(ld_type->owner);
+ goto out_no_driver;
+ }

dprintk("%s: pNFS module for %u set\n", __func__, id);
return;
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index ba01e59..b8548d8 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -80,6 +80,9 @@ struct pnfs_layoutdriver_type {
struct module *owner;
unsigned flags;

+ int (*set_layoutdriver) (struct nfs_server *, const struct nfs_fh *);
+ int (*clear_layoutdriver) (struct nfs_server *);
+
struct pnfs_layout_hdr * (*alloc_layout_hdr) (struct inode *inode, gfp_t gfp_flags);
void (*free_layout_hdr) (struct pnfs_layout_hdr *);

@@ -169,7 +172,7 @@ struct pnfs_layout_segment *
pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
loff_t pos, u64 count, enum pnfs_iomode access_type,
gfp_t gfp_flags);
-void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
+void set_pnfs_layoutdriver(struct nfs_server *, const struct nfs_fh *, u32);
void unset_pnfs_layoutdriver(struct nfs_server *);
enum pnfs_try_status pnfs_try_to_write_data(struct nfs_write_data *,
const struct rpc_call_ops *, int);
@@ -396,7 +399,8 @@ pnfs_roc_drain(struct inode *ino, u32 *barrier)
return false;
}

-static inline void set_pnfs_layoutdriver(struct nfs_server *s, u32 id)
+static inline void set_pnfs_layoutdriver(struct nfs_server *s,
+ const struct nfs_fh *mntfh, u32 id);
{
}

--
1.7.4.1


2011-06-07 17:33:56

by Jim Rees

[permalink] [raw]
Subject: [PATCH 66/88] pnfs-block: Remove device creation from kernel

Signed-off-by: Eric Anderle <[email protected]>
Signed-off-by: Jim Rees <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/Makefile | 2 +-
fs/nfs/blocklayout/block-device-discovery-pipe.c | 66 +++
fs/nfs/blocklayout/blocklayout.c | 15 +-
fs/nfs/blocklayout/blocklayout.h | 18 +-
fs/nfs/blocklayout/blocklayoutdev.c | 494 +++-------------------
fs/nfs/blocklayout/blocklayoutdm.c | 297 ++-----------
6 files changed, 181 insertions(+), 711 deletions(-)
create mode 100644 fs/nfs/blocklayout/block-device-discovery-pipe.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index 1e7619f..5a4bf3d 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -3,4 +3,4 @@
#
obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
blocklayoutdriver-objs := blocklayout.o blocklayoutdev.o blocklayoutdm.o \
- extents.o
+ extents.o block-device-discovery-pipe.o
diff --git a/fs/nfs/blocklayout/block-device-discovery-pipe.c b/fs/nfs/blocklayout/block-device-discovery-pipe.c
new file mode 100644
index 0000000..e4c199f
--- /dev/null
+++ b/fs/nfs/blocklayout/block-device-discovery-pipe.c
@@ -0,0 +1,66 @@
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/proc_fs.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/ctype.h>
+#include <linux/sched.h>
+#include "blocklayout.h"
+
+#define NFSDBG_FACILITY NFSDBG_PNFS_LD
+
+struct pipefs_list bl_device_list;
+struct dentry *bl_device_pipe;
+
+ssize_t bl_pipe_downcall(struct file *filp, const char __user *src, size_t len)
+{
+ int err;
+ struct pipefs_hdr *msg;
+
+ dprintk("Entering %s...\n", __func__);
+
+ msg = pipefs_readmsg(filp, src, len);
+ if (IS_ERR(msg)) {
+ dprintk("ERROR: unable to read pipefs message.\n");
+ return PTR_ERR(msg);
+ }
+
+ /* now assign the result, which wakes the blocked thread */
+ err = pipefs_assign_upcall_reply(msg, &bl_device_list);
+ if (err) {
+ dprintk("ERROR: failed to assign upcall with id %u\n",
+ msg->msgid);
+ kfree(msg);
+ }
+ return len;
+}
+
+static const struct rpc_pipe_ops bl_pipe_ops = {
+ .upcall = pipefs_generic_upcall,
+ .downcall = bl_pipe_downcall,
+ .destroy_msg = pipefs_generic_destroy_msg,
+};
+
+int bl_pipe_init(void)
+{
+ dprintk("%s: block_device pipefs registering...\n", __func__);
+ bl_device_pipe = pipefs_mkpipe("bl_device_pipe", &bl_pipe_ops, 1);
+ if (IS_ERR(bl_device_pipe))
+ dprintk("ERROR, unable to make block_device pipe\n");
+
+ if (!bl_device_pipe)
+ dprintk("bl_device_pipe is NULL!\n");
+ else
+ dprintk("bl_device_pipe created!\n");
+ pipefs_init_list(&bl_device_list);
+ return 0;
+}
+
+void bl_pipe_exit(void)
+{
+ dprintk("%s: block_device pipefs unregistering...\n", __func__);
+ if (IS_ERR(bl_device_pipe))
+ return ;
+ pipefs_closepipe(bl_device_pipe);
+ return;
+}
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index bfcef54..e3cd75f 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -732,6 +732,7 @@ nfs4_blk_get_deviceinfo(struct nfs_server *server, const struct nfs_fh *fh,
dev->pglen = PAGE_SIZE * max_pages;
dev->mincount = 0;

+ dprintk("%s: dev_id: %s\n", __func__, dev->dev_id.data);
rc = pnfs_block_callback_ops->nfs_getdeviceinfo(server, dev);
dprintk("%s getdevice info returns %d\n", __func__, rc);
if (rc)
@@ -760,7 +761,7 @@ bl_initialize_mountpoint(struct nfs_server *server, const struct nfs_fh *fh)
struct pnfs_devicelist *dlist = NULL;
struct pnfs_block_dev *bdev;
LIST_HEAD(block_disklist);
- int status, i;
+ int status = 0, i;

dprintk("%s enter\n", __func__);

@@ -777,13 +778,6 @@ bl_initialize_mountpoint(struct nfs_server *server, const struct nfs_fh *fh)
spin_lock_init(&b_mt_id->bm_lock);
INIT_LIST_HEAD(&b_mt_id->bm_devlist);

- /* Construct a list of all visible block disks that have not been
- * claimed.
- */
- status = nfs4_blk_create_block_disk_list(&block_disklist);
- if (status < 0)
- goto out_error;
-
dlist = kmalloc(sizeof(struct pnfs_devicelist), GFP_KERNEL);
if (!dlist)
goto out_error;
@@ -814,10 +808,9 @@ bl_initialize_mountpoint(struct nfs_server *server, const struct nfs_fh *fh)
}
dprintk("%s SUCCESS\n", __func__);
server->pnfs_ld_data = b_mt_id;
- status = 0;
+
out_return:
kfree(dlist);
- nfs4_blk_destroy_disk_list(&block_disklist);
return status;

out_error:
@@ -1150,6 +1143,7 @@ static int __init nfs4blocklayout_init(void)
dprintk("%s: NFSv4 Block Layout Driver Registering...\n", __func__);

pnfs_block_callback_ops = pnfs_register_layoutdriver(&blocklayout_type);
+ bl_pipe_init();
return 0;
}

@@ -1159,6 +1153,7 @@ static void __exit nfs4blocklayout_exit(void)
__func__);

pnfs_unregister_layoutdriver(&blocklayout_type);
+ bl_pipe_exit();
}

module_init(nfs4blocklayout_init);
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index d316b7f..8931944 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -56,7 +56,6 @@ struct block_mount_id {

struct pnfs_block_dev {
struct list_head bm_node;
- char *bm_mdevname; /* meta device name */
struct pnfs_deviceid bm_mdevid; /* associated devid */
struct block_device *bm_mdev; /* meta device itself */
};
@@ -263,8 +262,6 @@ int nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
int nfs4_blk_create_block_disk_list(struct list_head *);
void nfs4_blk_destroy_disk_list(struct list_head *);
/* blocklayoutdm.c */
-struct pnfs_block_dev *nfs4_blk_init_metadev(struct nfs_server *server,
- struct pnfs_device *dev);
int nfs4_blk_flatten(struct pnfs_blk_volume *, int, struct pnfs_block_dev *);
void free_block_dev(struct pnfs_block_dev *bdev);
/* extents.c */
@@ -288,4 +285,19 @@ int add_and_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);
int mark_for_commit(struct pnfs_block_extent *be,
sector_t offset, sector_t length);
+
+#include <linux/sunrpc/simple_rpc_pipefs.h>
+
+extern struct pipefs_list bl_device_list;
+extern struct dentry *bl_device_pipe;
+
+int bl_pipe_init(void);
+void bl_pipe_exit(void);
+
+#define BL_DEVICE_UMOUNT 0x0 /* Umount--delete devices */
+#define BL_DEVICE_MOUNT 0x1 /* Mount--create devices*/
+#define BL_DEVICE_REQUEST_INIT 0x0 /* Start request */
+#define BL_DEVICE_REQUEST_PROC 0x1 /* User level process succeeds */
+#define BL_DEVICE_REQUEST_ERR 0x2 /* User level process fails */
+
#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 7285d5e..98ec92b3 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -34,13 +34,12 @@

#include <linux/genhd.h>
#include <linux/blkdev.h>
+#include <linux/hash.h>

#include "blocklayout.h"

#define NFSDBG_FACILITY NFSDBG_PNFS_LD

-#define MAX_VOLS 256 /* Maximum number of block disks. Totally arbitrary */
-
uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes)
{
uint32_t *q = p + XDR_QUADLEN(nbytes);
@@ -77,397 +76,6 @@ int nfs4_blkdev_put(struct block_device *bdev)
return blkdev_put(bdev, FMODE_READ);
}

-/* Add a visible, claimed (by us!) block disk to the device list */
-static int alloc_add_disk(struct block_device *blk_dev, struct list_head *dlist)
-{
- struct visible_block_device *vis_dev;
-
- dprintk("%s enter\n", __func__);
- vis_dev = kmalloc(sizeof(struct visible_block_device), GFP_KERNEL);
- if (!vis_dev) {
- dprintk("%s nfs4_get_sig failed\n", __func__);
- return -ENOMEM;
- }
- vis_dev->vi_bdev = blk_dev;
- vis_dev->vi_mapped = 0;
- vis_dev->vi_put_done = 0;
- list_add(&vis_dev->vi_node, dlist);
- return 0;
-}
-
-/* Walk the list of block_devices. Add disks that can be opened and claimed
- * to the device list
- */
-static int
-nfs4_blk_add_block_disk(struct device *cdev,
- int index, struct list_head *dlist)
-{
- static char *claim_ptr = "I belong to pnfs block driver";
- struct block_device *bdev;
- struct gendisk *gd;
- unsigned int major, minor;
- int ret;
- dev_t dev;
-
- dprintk("%s enter \n", __func__);
- if (index >= MAX_VOLS) {
- dprintk("%s MAX_VOLS hit\n", __func__);
- return -ENOSPC;
- }
- gd = dev_to_disk(cdev);
- if (gd == NULL || get_capacity(gd) == 0 ||
- (gd->flags & GENHD_FL_SUPPRESS_PARTITION_INFO)) /* Skip ramdisks */
- goto out;
-
- dev = cdev->devt;
- major = MAJOR(dev);
- minor = MINOR(dev);
- bdev = nfs4_blkdev_get(dev);
- if (!bdev) {
- dprintk("%s: failed to open device %d:%d\n",
- __func__, major, minor);
- goto out;
- }
-
- if (bd_claim(bdev, claim_ptr)) {
- dprintk("%s: failed to claim device %d:%d\n",
- __func__, major, minor);
- blkdev_put(bdev, FMODE_READ);
- goto out;
- }
-
- ret = alloc_add_disk(bdev, dlist);
- if (ret < 0)
- goto out_err;
- index++;
- dprintk("%s ADDED DEVICE %d:%d capacity %ld, bd_block_size %d\n",
- __func__, major, minor,
- (unsigned long)get_capacity(gd),
- bdev->bd_block_size);
-
-out:
- dprintk("%s returns index %d \n", __func__, index);
- return index;
-
-out_err:
- dprintk("%s Can't add disk %d:%d to list. ERROR: %d\n",
- __func__, major, minor, ret);
- nfs4_blkdev_put(bdev);
- return ret;
-}
-
-/* Destroy the temporary block disk list */
-void nfs4_blk_destroy_disk_list(struct list_head *dlist)
-{
- struct visible_block_device *vis_dev;
-
- dprintk("%s enter\n", __func__);
- while (!list_empty(dlist)) {
- vis_dev = list_first_entry(dlist, struct visible_block_device,
- vi_node);
- dprintk("%s removing device %d:%d\n", __func__,
- MAJOR(vis_dev->vi_bdev->bd_dev),
- MINOR(vis_dev->vi_bdev->bd_dev));
- list_del(&vis_dev->vi_node);
- if (!vis_dev->vi_put_done)
- nfs4_blkdev_put(vis_dev->vi_bdev);
- kfree(vis_dev);
- }
-}
-
-struct nfs4_blk_block_disk_list_ctl {
- struct list_head *dlist;
- int index;
-};
-
-static int nfs4_blk_iter_block_disk_list(struct device *cdev, void *data)
-{
- struct nfs4_blk_block_disk_list_ctl *lc = data;
- int ret;
-
- dprintk("%s enter\n", __func__);
- ret = nfs4_blk_add_block_disk(cdev, lc->index, lc->dlist);
- dprintk("%s 1 ret %d\n", __func__, ret);
- if (ret >= 0) {
- lc->index = ret;
- ret = 0;
- }
- return ret;
-}
-
-/*
- * Create a temporary list of all block disks host can see, and that have not
- * yet been claimed.
- * block_class: list of all registered block disks.
- * returns -errno on error, and #of devices found on success.
-*/
-int nfs4_blk_create_block_disk_list(struct list_head *dlist)
-{
- struct nfs4_blk_block_disk_list_ctl lc = {
- .dlist = dlist,
- .index = 0,
- };
-
- dprintk("%s enter\n", __func__);
- return class_for_each_device(&block_class, NULL,
- &lc, nfs4_blk_iter_block_disk_list);
-}
-/* We are given an array of XDR encoded array indices, each of which should
- * refer to a previously decoded device. Translate into a list of pointers
- * to the appropriate pnfs_blk_volume's.
- */
-static int set_vol_array(uint32_t **pp, uint32_t *end,
- struct pnfs_blk_volume *vols, int working)
-{
- int i, index;
- uint32_t *p = *pp;
- struct pnfs_blk_volume **array = vols[working].bv_vols;
- for (i = 0; i < vols[working].bv_vol_n; i++) {
- BLK_READBUF(p, end, 4);
- READ32(index);
- if ((index < 0) || (index >= working)) {
- dprintk("%s Index %i out of expected range\n",
- __func__, index);
- goto out_err;
- }
- array[i] = &vols[index];
- }
- *pp = p;
- return 0;
- out_err:
- return -EIO;
-}
-
-static uint64_t sum_subvolume_sizes(struct pnfs_blk_volume *vol)
-{
- int i;
- uint64_t sum = 0;
- for (i = 0; i < vol->bv_vol_n; i++)
- sum += vol->bv_vols[i]->bv_size;
- return sum;
-}
-
-static int decode_blk_signature(uint32_t **pp, uint32_t *end,
- struct pnfs_blk_sig *sig)
-{
- int i, tmp;
- uint32_t *p = *pp;
-
- BLK_READBUF(p, end, 4);
- READ32(sig->si_num_comps);
- if (sig->si_num_comps == 0) {
- dprintk("%s 0 components in sig\n", __func__);
- goto out_err;
- }
- if (sig->si_num_comps >= PNFS_BLOCK_MAX_SIG_COMP) {
- dprintk("number of sig comps %i >= PNFS_BLOCK_MAX_SIG_COMP\n",
- sig->si_num_comps);
- goto out_err;
- }
- for (i = 0; i < sig->si_num_comps; i++) {
- BLK_READBUF(p, end, 12);
- READ64(sig->si_comps[i].bs_offset);
- READ32(tmp);
- sig->si_comps[i].bs_length = tmp;
- BLK_READBUF(p, end, tmp);
- /* Note we rely here on fact that sig is used immediately
- * for mapping, then thrown away.
- */
- sig->si_comps[i].bs_string = (char *)p;
- p += XDR_QUADLEN(tmp);
- }
- *pp = p;
- return 0;
- out_err:
- return -EIO;
-}
-
-/* Translate a signature component into a block and offset. */
-static void get_sector(struct block_device *bdev,
- struct pnfs_blk_sig_comp *comp,
- sector_t *block,
- uint32_t *offset_in_block)
-{
- int64_t use_offset = comp->bs_offset;
- unsigned int blkshift = blksize_bits(block_size(bdev));
-
- dprintk("%s enter\n", __func__);
- if (use_offset < 0)
- use_offset += (get_capacity(bdev->bd_disk) << 9);
- *block = use_offset >> blkshift;
- *offset_in_block = use_offset - (*block << blkshift);
-
- dprintk("%s block %llu offset_in_block %u\n",
- __func__, (u64)*block, *offset_in_block);
- return;
-}
-
-/*
- * All signatures in sig must be found on bdev for verification.
- * Returns True if sig matches, False otherwise.
- *
- * STUB - signature crossing a block boundary will cause problems.
- */
-static int verify_sig(struct block_device *bdev, struct pnfs_blk_sig *sig)
-{
- sector_t block = 0;
- struct pnfs_blk_sig_comp *comp;
- struct buffer_head *bh = NULL;
- uint32_t offset_in_block = 0;
- char *ptr;
- int i;
-
- dprintk("%s enter. bd_disk->capacity %ld, bd_block_size %d\n",
- __func__, (unsigned long)get_capacity(bdev->bd_disk),
- bdev->bd_block_size);
- for (i = 0; i < sig->si_num_comps; i++) {
- comp = &sig->si_comps[i];
- dprintk("%s comp->bs_offset %lld, length=%d\n", __func__,
- comp->bs_offset, comp->bs_length);
- get_sector(bdev, comp, &block, &offset_in_block);
- bh = __bread(bdev, block, bdev->bd_block_size);
- if (!bh)
- goto out_err;
- ptr = (char *)bh->b_data + offset_in_block;
- if (memcmp(ptr, comp->bs_string, comp->bs_length))
- goto out_err;
- brelse(bh);
- }
- dprintk("%s Complete Match Found\n", __func__);
- return 1;
-
-out_err:
- brelse(bh);
- dprintk("%s No Match\n", __func__);
- return 0;
-}
-
-/*
- * map_sig_to_device()
- * Given a signature, walk the list of visible block disks searching for
- * a match. Returns True if mapping was done, False otherwise.
- *
- * While we're at it, fill in the vol->bv_size.
- */
-/* XXX FRED - use normal 0=success status */
-static int map_sig_to_device(struct pnfs_blk_sig *sig,
- struct pnfs_blk_volume *vol,
- struct list_head *sdlist)
-{
- int mapped = 0;
- struct visible_block_device *vis_dev;
-
- list_for_each_entry(vis_dev, sdlist, vi_node) {
- if (vis_dev->vi_mapped || !vis_dev->vi_bdev->bd_disk)
- continue;
- mapped = verify_sig(vis_dev->vi_bdev, sig);
- if (mapped) {
- vol->bv_dev = vis_dev->vi_bdev->bd_dev;
- vol->bv_size = get_capacity(vis_dev->vi_bdev->bd_disk);
- vis_dev->vi_mapped = 1;
- /* XXX FRED check this */
- /* We no longer need to scan this device, and
- * we need to "put" it before creating metadevice.
- */
- if (!vis_dev->vi_put_done) {
- vis_dev->vi_put_done = 1;
- nfs4_blkdev_put(vis_dev->vi_bdev);
- }
- break;
- }
- }
- return mapped;
-}
-
-/* XDR decodes pnfs_block_volume4 structure */
-static int decode_blk_volume(uint32_t **pp, uint32_t *end,
- struct pnfs_blk_volume *vols, int i,
- struct list_head *sdlist, int *array_cnt)
-{
- int status = 0;
- struct pnfs_blk_sig sig;
- uint32_t *p = *pp;
- uint64_t tmp; /* Used by READ_SECTOR */
- struct pnfs_blk_volume *vol = &vols[i];
- int j;
- u64 tmp_size;
-
- BLK_READBUF(p, end, 4);
- READ32(vol->bv_type);
- dprintk("%s vol->bv_type = %i\n", __func__, vol->bv_type);
- switch (vol->bv_type) {
- case PNFS_BLOCK_VOLUME_SIMPLE:
- *array_cnt = 0;
- status = decode_blk_signature(&p, end, &sig);
- if (status)
- return status;
- status = map_sig_to_device(&sig, vol, sdlist);
- if (!status) {
- dprintk("Could not find disk for device\n");
- return -EIO;
- }
- status = 0;
- dprintk("%s Set Simple vol to dev %d:%d, size %llu\n",
- __func__,
- MAJOR(vol->bv_dev),
- MINOR(vol->bv_dev),
- (u64)vol->bv_size);
- break;
- case PNFS_BLOCK_VOLUME_SLICE:
- BLK_READBUF(p, end, 16);
- READ_SECTOR(vol->bv_offset);
- READ_SECTOR(vol->bv_size);
- *array_cnt = vol->bv_vol_n = 1;
- status = set_vol_array(&p, end, vols, i);
- break;
- case PNFS_BLOCK_VOLUME_STRIPE:
- BLK_READBUF(p, end, 8);
- READ_SECTOR(vol->bv_stripe_unit);
- BLK_READBUF(p, end, 4);
- READ32(vol->bv_vol_n);
- if (!vol->bv_vol_n)
- return -EIO;
- *array_cnt = vol->bv_vol_n;
- status = set_vol_array(&p, end, vols, i);
- if (status)
- return status;
- /* Ensure all subvolumes are the same size */
- for (j = 1; j < vol->bv_vol_n; j++) {
- if (vol->bv_vols[j]->bv_size !=
- vol->bv_vols[0]->bv_size) {
- dprintk("%s varying subvol size\n", __func__);
- return -EIO;
- }
- }
- /* Make sure total size only includes addressable areas */
- tmp_size = vol->bv_vols[0]->bv_size;
- do_div(tmp_size, (u32)vol->bv_stripe_unit);
- vol->bv_size = vol->bv_vol_n * tmp_size * vol->bv_stripe_unit;
- dprintk("%s Set Stripe vol to size %llu\n",
- __func__, (u64)vol->bv_size);
- break;
- case PNFS_BLOCK_VOLUME_CONCAT:
- BLK_READBUF(p, end, 4);
- READ32(vol->bv_vol_n);
- if (!vol->bv_vol_n)
- return -EIO;
- *array_cnt = vol->bv_vol_n;
- status = set_vol_array(&p, end, vols, i);
- if (status)
- return status;
- vol->bv_size = sum_subvolume_sizes(vol);
- dprintk("%s Set Concat vol to size %llu\n",
- __func__, (u64)vol->bv_size);
- break;
- default:
- dprintk("Unknown volume type %i\n", vol->bv_type);
- out_err:
- return -EIO;
- }
- *pp = p;
- return status;
-}
-
/* Decodes pnfs_block_deviceaddr4 (draft-8) which is XDR encoded
* in dev->dev_addr_buf.
*/
@@ -476,65 +84,71 @@ nfs4_blk_decode_device(struct nfs_server *server,
struct pnfs_device *dev,
struct list_head *sdlist)
{
- int num_vols, i, status, count;
- struct pnfs_blk_volume *vols, **arrays, **arrays_ptr;
- uint32_t *p = dev->area;
- uint32_t *end = (uint32_t *) ((char *) p + dev->mincount);
struct pnfs_block_dev *rv = NULL;
- struct visible_block_device *vis_dev;
+ struct block_device *bd = NULL;
+ struct pipefs_hdr *msg = NULL, *reply = NULL;
+ uint32_t major, minor;

dprintk("%s enter\n", __func__);

- READ32(num_vols);
- dprintk("%s num_vols = %i\n", __func__, num_vols);
-
- vols = kmalloc(sizeof(struct pnfs_blk_volume) * num_vols, GFP_KERNEL);
- if (!vols)
+ if (IS_ERR(bl_device_pipe))
return NULL;
- /* Each volume in vols array needs its own array. Save time by
- * allocating them all in one large hunk. Because each volume
- * array can only reference previous volumes, and because once
- * a concat or stripe references a volume, it may never be
- * referenced again, the volume arrays are guaranteed to fit
- * in the suprisingly small space allocated.
- */
- arrays = kmalloc(sizeof(struct pnfs_blk_volume *) * num_vols * 2,
- GFP_KERNEL);
- if (!arrays)
- goto out;
- arrays_ptr = arrays;
+ dprintk("%s CREATING PIPEFS MESSAGE\n", __func__);
+ dprintk("%s: deviceid: %s, mincount: %d\n", __func__, dev->dev_id.data,
+ dev->mincount);
+ msg = pipefs_alloc_init_msg(0, BL_DEVICE_MOUNT, 0, dev->area,
+ dev->mincount);
+ if (IS_ERR(msg)) {
+ dprintk("ERROR: couldn't make pipefs message.\n");
+ goto out_err;
+ }
+ msg->msgid = hash_ptr(&msg, sizeof(msg->msgid) * 8);
+ msg->status = BL_DEVICE_REQUEST_INIT;
+
+ dprintk("%s CALLING USERSPACE DAEMON\n", __func__);
+ reply = pipefs_queue_upcall_waitreply(bl_device_pipe, msg,
+ &bl_device_list, 0, 0);

- list_for_each_entry(vis_dev, sdlist, vi_node) {
- /* Wipe crud left from parsing previous device */
- vis_dev->vi_mapped = 0;
+ if (IS_ERR(reply)) {
+ dprintk("ERROR: upcall_waitreply failed\n");
+ goto out_err;
}
- for (i = 0; i < num_vols; i++) {
- vols[i].bv_vols = arrays_ptr;
- status = decode_blk_volume(&p, end, vols, i, sdlist, &count);
- if (status)
- goto out;
- arrays_ptr += count;
+ if (reply->status != BL_DEVICE_REQUEST_PROC) {
+ dprintk("%s failed to open device: %ld\n",
+ __func__, PTR_ERR(bd));
+ goto out_err;
}
-
- /* Check that we have used up opaque */
- if (p != end) {
- dprintk("Undecoded cruft at end of opaque\n");
- goto out;
+ memcpy(&major, (uint32_t *)(payload_of(reply)), sizeof(uint32_t));
+ memcpy(&minor, (uint32_t *)(payload_of(reply) + sizeof(uint32_t)),
+ sizeof(uint32_t));
+ bd = nfs4_blkdev_get(MKDEV(major, minor));
+ if (IS_ERR(bd)) {
+ dprintk("%s failed to open device : %ld\n",
+ __func__, PTR_ERR(bd));
+ goto out_err;
}

- /* Now use info in vols to create the meta device */
- rv = nfs4_blk_init_metadev(server, dev);
+ rv = kzalloc(sizeof(*rv), GFP_KERNEL);
if (!rv)
- goto out;
- status = nfs4_blk_flatten(vols, num_vols, rv);
- if (status) {
- free_block_dev(rv);
- rv = NULL;
- }
- out:
- kfree(arrays);
- kfree(vols);
+ goto out_err;
+
+ rv->bm_mdev = bd;
+ memcpy(&rv->bm_mdevid, &dev->dev_id, sizeof(struct pnfs_deviceid));
+ dprintk("%s Created device %s with bd_block_size %u\n",
+ __func__,
+ bd->bd_disk->disk_name,
+ bd->bd_block_size);
+ kfree(reply);
+ kfree(msg);
return rv;
+
+out_err:
+ kfree(rv);
+ if (!IS_ERR(reply))
+ kfree(reply);
+ if (!IS_ERR(msg))
+ kfree(msg);
+ return NULL;
}

/* Map deviceid returned by the server to constructed block_device */
diff --git a/fs/nfs/blocklayout/blocklayoutdm.c b/fs/nfs/blocklayout/blocklayoutdm.c
index 3d15de0..097dd05 100644
--- a/fs/nfs/blocklayout/blocklayoutdm.c
+++ b/fs/nfs/blocklayout/blocklayoutdm.c
@@ -31,6 +31,8 @@
*/

#include <linux/genhd.h> /* gendisk - used in a dprintk*/
+#include <linux/sched.h>
+#include <linux/hash.h>

#include "blocklayout.h"

@@ -45,52 +47,44 @@
#define roundup8(x) (((x)+7) & ~7)
#define sizeof8(x) roundup8(sizeof(x))

-/* Given x>=1, return smallest n such that 2**n >= x */
-static unsigned long find_order(int x)
+static int dev_remove(dev_t dev)
{
- unsigned long rv = 0;
- for (x--; x; x >>= 1)
- rv++;
- return rv;
-}
-
-/* Debugging aid */
-static void print_extent(u64 meta_offset, dev_t disk,
- u64 disk_offset, u64 length)
-{
- dprintk("%lli:, %d:%d %lli, %lli\n", meta_offset, MAJOR(disk),
- MINOR(disk), disk_offset, length);
-}
-static int dev_create(const char *name, dev_t *dev)
-{
- struct dm_ioctl ctrl;
- int rv;
-
- memset(&ctrl, 0, sizeof(ctrl));
- strncpy(ctrl.name, name, DM_NAME_LEN-1);
- rv = dm_dev_create(&ctrl); /* XXX - need to pull data out of ctrl */
- dprintk("Tried to create %s, got %i\n", name, rv);
- if (!rv) {
- *dev = huge_decode_dev(ctrl.dev);
- dprintk("dev = (%i, %i)\n", MAJOR(*dev), MINOR(*dev));
+ int ret = 1;
+ struct pipefs_hdr *msg = NULL, *reply = NULL;
+ uint64_t bl_dev;
+ uint32_t major = MAJOR(dev), minor = MINOR(dev);
+
+ dprintk("Entering %s\n", __func__);
+
+ if (IS_ERR(bl_device_pipe))
+ return ret;
+
+ memcpy((void *)&bl_dev, &major, sizeof(uint32_t));
+ memcpy((void *)&bl_dev + sizeof(uint32_t), &minor, sizeof(uint32_t));
+ msg = pipefs_alloc_init_msg(0, BL_DEVICE_UMOUNT, 0, (void *)&bl_dev,
+ sizeof(uint64_t));
+ if (IS_ERR(msg)) {
+ dprintk("ERROR: couldn't make pipefs message.\n");
+ goto out;
+ }
+ msg->msgid = hash_ptr(&msg, sizeof(msg->msgid) * 8);
+ msg->status = BL_DEVICE_REQUEST_INIT;
+
+ reply = pipefs_queue_upcall_waitreply(bl_device_pipe, msg,
+ &bl_device_list, 0, 0);
+ if (IS_ERR(reply)) {
+ dprintk("ERROR: upcall_waitreply failed\n");
+ goto out;
}
- return rv;
-}
-
-static int dev_remove(const char *name)
-{
- struct dm_ioctl ctrl;
- memset(&ctrl, 0, sizeof(ctrl));
- strncpy(ctrl.name, name, DM_NAME_LEN-1);
- return dm_dev_remove(&ctrl);
-}

-static int dev_resume(const char *name)
-{
- struct dm_ioctl ctrl;
- memset(&ctrl, 0, sizeof(ctrl));
- strncpy(ctrl.name, name, DM_NAME_LEN-1);
- return dm_do_resume(&ctrl);
+ if (reply->status == BL_DEVICE_REQUEST_PROC)
+ ret = 0; /*TODO: what to return*/
+out:
+ if (!IS_ERR(reply))
+ kfree(reply);
+ if (!IS_ERR(msg))
+ kfree(msg);
+ return ret;
}

/*
@@ -100,12 +94,12 @@ static int nfs4_blk_metadev_release(struct pnfs_block_dev *bdev)
{
int rv;

- dprintk("%s Releasing %s\n", __func__, bdev->bm_mdevname);
+ dprintk("%s Releasing\n", __func__);
/* XXX Check return? */
rv = nfs4_blkdev_put(bdev->bm_mdev);
dprintk("%s nfs4_blkdev_put returns %d\n", __func__, rv);

- rv = dev_remove(bdev->bm_mdevname);
+ rv = dev_remove(bdev->bm_mdev->bd_dev);
dprintk("%s Returns %d\n", __func__, rv);
return rv;
}
@@ -114,9 +108,8 @@ void free_block_dev(struct pnfs_block_dev *bdev)
{
if (bdev) {
if (bdev->bm_mdev) {
- dprintk("%s Removing DM device: %s %d:%d\n",
+ dprintk("%s Removing DM device: %d:%d\n",
__func__,
- bdev->bm_mdevname,
MAJOR(bdev->bm_mdev->bd_dev),
MINOR(bdev->bm_mdev->bd_dev));
/* XXX Check status ?? */
@@ -125,213 +118,3 @@ void free_block_dev(struct pnfs_block_dev *bdev)
kfree(bdev);
}
}
-
-/*
- * Create meta device. Keep it open to use for I/O.
- */
-struct pnfs_block_dev *nfs4_blk_init_metadev(struct nfs_server *server,
- struct pnfs_device *dev)
-{
- static uint64_t dev_count; /* STUB used for device names */
- struct block_device *bd;
- dev_t meta_dev;
- struct pnfs_block_dev *rv;
- int status;
-
- dprintk("%s enter\n", __func__);
-
- rv = kmalloc(sizeof(*rv) + 32, GFP_KERNEL);
- if (!rv)
- return NULL;
- rv->bm_mdevname = (char *)rv + sizeof(*rv);
- sprintf(rv->bm_mdevname, "FRED_%llu", dev_count++);
- status = dev_create(rv->bm_mdevname, &meta_dev);
- if (status)
- goto out_err;
- bd = nfs4_blkdev_get(meta_dev);
- if (!bd)
- goto out_err;
- if (bd_claim(bd, server)) {
- dprintk("%s: failed to claim device %d:%d\n",
- __func__,
- MAJOR(meta_dev),
- MINOR(meta_dev));
- blkdev_put(bd, FMODE_READ);
- goto out_err;
- }
-
- rv->bm_mdev = bd;
- memcpy(&rv->bm_mdevid, &dev->dev_id, sizeof(struct pnfs_deviceid));
- dprintk("%s Created device %s named %s with bd_block_size %u\n",
- __func__,
- bd->bd_disk->disk_name,
- rv->bm_mdevname,
- bd->bd_block_size);
- return rv;
-
- out_err:
- kfree(rv);
- return NULL;
-}
-
-/*
- * Given a vol_offset into root, returns the disk and disk_offset it
- * corresponds to, as well as the length of the contiguous segment thereafter.
- * All offsets/lengths are in 512-byte sectors.
- */
-static int nfs4_blk_resolve(int root, struct pnfs_blk_volume *vols,
- u64 vol_offset, dev_t *disk, u64 *disk_offset,
- u64 *length)
-{
- struct pnfs_blk_volume *node;
- u64 node_offset;
-
- /* Walk down device tree until we hit a leaf node (VOLUME_SIMPLE) */
- node = &vols[root];
- node_offset = vol_offset;
- *length = node->bv_size;
- while (1) {
- dprintk("offset=%lli, length=%lli\n",
- node_offset, *length);
- if (node_offset > node->bv_size)
- return -EIO;
- switch (node->bv_type) {
- case PNFS_BLOCK_VOLUME_SIMPLE:
- *disk = node->bv_dev;
- dprintk("%s VOLUME_SIMPLE: node->bv_dev %d:%d\n",
- __func__,
- MAJOR(node->bv_dev),
- MINOR(node->bv_dev));
- *disk_offset = node_offset;
- *length = min(*length, node->bv_size - node_offset);
- return 0;
- case PNFS_BLOCK_VOLUME_SLICE:
- dprintk("%s VOLUME_SLICE:\n", __func__);
- *length = min(*length, node->bv_size - node_offset);
- node_offset += node->bv_offset;
- node = node->bv_vols[0];
- break;
- case PNFS_BLOCK_VOLUME_CONCAT: {
- u64 next = 0, sum = 0;
- int i;
- dprintk("%s VOLUME_CONCAT:\n", __func__);
- for (i = 0; i < node->bv_vol_n; i++) {
- next = sum + node->bv_vols[i]->bv_size;
- if (node_offset < next)
- break;
- sum = next;
- }
- *length = min(*length, next - node_offset);
- node_offset -= sum;
- node = node->bv_vols[i];
- }
- break;
- case PNFS_BLOCK_VOLUME_STRIPE: {
- u64 global_s_no;
- u64 stripe_pos;
- u64 local_s_no;
- u64 disk_number;
-
- dprintk("%s VOLUME_STRIPE:\n", __func__);
- global_s_no = node_offset;
- /* BUG - note this assumes stripe_unit <= 2**32 */
- stripe_pos = (u64) do_div(global_s_no,
- (u32)node->bv_stripe_unit);
- local_s_no = global_s_no;
- disk_number = (u64) do_div(local_s_no,
- (u32) node->bv_vol_n);
- *length = min(*length,
- node->bv_stripe_unit - stripe_pos);
- node_offset = local_s_no * node->bv_stripe_unit +
- stripe_pos;
- node = node->bv_vols[disk_number];
- }
- break;
- default:
- return -EIO;
- }
- }
-}
-
-/*
- * Create an LVM dm device table that represents the volume topology returned
- * by GETDEVICELIST or GETDEVICEINFO.
- *
- * vols: topology with VOLUME_SIMPLEs mapped to visable block disks.
- * size: number of volumes in vols.
- */
-int nfs4_blk_flatten(struct pnfs_blk_volume *vols, int size,
- struct pnfs_block_dev *bdev)
-{
- u64 meta_offset = 0;
- u64 meta_size = vols[size-1].bv_size;
- dev_t disk;
- u64 disk_offset, len;
- int status = 0, count = 0, pages_needed;
- struct dm_ioctl *ctl;
- struct dm_target_spec *spec;
- char *args = NULL;
- unsigned long p;
-
- dprintk("%s enter. mdevname %s number of volumes %d\n", __func__,
- bdev->bm_mdevname, size);
-
- /* We need to reserve memory to store segments, so need to count
- * segments. This means we resolve twice, basically throwing away
- * all info from first run apart from the count. Seems like
- * there should be a better way.
- */
- for (meta_offset = 0; meta_offset < meta_size; meta_offset += len) {
- status = nfs4_blk_resolve(size-1, vols, meta_offset, &disk,
- &disk_offset, &len);
- /* TODO Check status */
- count += 1;
- }
-
- dprintk("%s: Have %i segments\n", __func__, count);
- pages_needed = ((count + SPEC_HEADER_ADJUST) / SPECS_PER_PAGE) + 1;
- dprintk("%s: Need %i pages\n", __func__, pages_needed);
- p = __get_free_pages(GFP_KERNEL, find_order(pages_needed));
- if (!p)
- return -ENOMEM;
- /* A dm_ioctl is placed at the beginning, followed by a series of
- * (dm_target_spec, argument string) pairs.
- */
- ctl = (struct dm_ioctl *) p;
- spec = (struct dm_target_spec *) (p + sizeof8(*ctl));
- memset(ctl, 0, sizeof(*ctl));
- ctl->data_start = (char *) spec - (char *) ctl;
- ctl->target_count = count;
- strncpy(ctl->name, bdev->bm_mdevname, DM_NAME_LEN);
-
- dprintk("%s ctl->name %s\n", __func__, ctl->name);
- for (meta_offset = 0; meta_offset < meta_size; meta_offset += len) {
- status = nfs4_blk_resolve(size-1, vols, meta_offset, &disk,
- &disk_offset, &len);
- if (!len)
- break;
- /* TODO Check status */
- print_extent(meta_offset, disk, disk_offset, len);
- spec->sector_start = meta_offset;
- spec->length = len;
- spec->status = 0;
- strcpy(spec->target_type, "linear");
- args = (char *) (spec + 1);
- sprintf(args, "%i:%i %lli",
- MAJOR(disk), MINOR(disk), disk_offset);
- dprintk("%s args %s\n", __func__, args);
- spec->next = roundup8(sizeof(*spec) + strlen(args) + 1);
- spec = (struct dm_target_spec *) (((char *) spec) + spec->next);
- }
- ctl->data_size = (char *) spec - (char *) ctl;
-
- status = dm_table_load(ctl, ctl->data_size);
- dprintk("%s dm_table_load returns %d\n", __func__, status);
-
- dev_resume(bdev->bm_mdevname);
-
- free_pages(p, find_order(pages_needed));
- dprintk("%s returns %d\n", __func__, status);
- return status;
-}
-
--
1.7.4.1


2011-06-12 14:40:56

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 06/10/2011 07:19 PM, Peng Tao wrote:
> On Sat, Jun 11, 2011 at 7:20 AM, Boaz Harrosh <[email protected]> wrote:
>> On 06/10/2011 12:23 PM, Benny Halevy wrote:
>>> On 2011-06-10 10:09, [email protected] wrote:
>>>
>>> A simple algorithm I can suggest is:
>>> - on initialization, calculate and save, per layout driver
>>> - maximum layout size
>>> - take into account csr_fore_chan_attrs.ca_maxresponsesize and possible other parameters
>>> - keep a working copy of the maximum value and the calculated copy.
>>> - alignment value.
>>> - on miss, see if there's an adjacent layout segment in cache
>>> - if found, ask for twice the found segment size, up to the maximum value,
>>> aligned on the alignment value.
>>> - if the server returns less the layoutget range, keep note of the returned length
>>> (but not adjust maximum yet, as the server may return a short segment for various
>>> reasons)
>>> - if the server is consistent about returning less than was asked, adjust the
>>> - working copy of the maximum length
>>> - if the maximum was adjusted try bumping it up after X (TBD) layoutgets or T seconds
>>> to see if that was just due to high load or conflicts on the server
>>> - on any error returned for LAYOUTGET reset the algorithm parameters
>>> - on session reestablishment recalculate maximums.
>>>
>>> Benny
>>>
>>
>> I completely disagree with all this. NACK!
>>
>> The only proper thing a client can do is ask for what it needs, and only the application
>> can do that, because at the VFS level it is only second guessing, and is completely
>> pointless.
>>
>> The only one that can know about structure, alignments, optimal IO sizes and layouts
>> is the server. The server even have more information to second guess the application
>> from the file size information and it's share and lock disposition. Please see my
>> simple Server side algorithm.
> Well, IMO, client is closer to applications and should have a better
> position at "guessing" application's workload.
>
> A simple example, when a client asks for a layout, server would have
> no idea if client is doing layout prefetch, or if it really need that
> range to complete its work. But the client knows it for sure.
>

What is layout prefetch? we don't do any in the Linux client.

And your example does not make any sense. If a theoretical stupid client does
"layout prefetch" what ever that means, how is it related to the application?

There is not a single information a client has the the server does not. Look at
your patch. Set an arbitrary value at client setup. That same arbitrary value
can be setup at server. Look at benny's algorithm. It can be done just the same
at the server side. (Which does not mean it is a good algorithm)

Just take that patch of yours, but instead of at client side. Put it at server
side. You will achieve exact same results. Only you configure one server instead
of every client.

And no way in hell I let that go into the Linux generic client!

Boaz

>>
>> Because you must understand one most important thing. Any smart decision a client can
>> make will be after it received the layout (stripe_unit, number-of-devices etc..) But
>> at that time it is too late it already sent the layout_get. Only the server knows
>> before hand what is the most optimal size. The client should just be a transparent
>> pipe from application to the server. It should never ever set policy. Only a Server
>> can/should do that.
>>
>> Lets put the efforts and algorithms where they belong, please?
>>
>> Boaz
>>
>
>
>


2011-06-12 18:46:51

by Peng Tao

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On Sun, Jun 12, 2011 at 10:40 PM, Boaz Harrosh <[email protected]> wrote:
> On 06/10/2011 07:19 PM, Peng Tao wrote:
>> On Sat, Jun 11, 2011 at 7:20 AM, Boaz Harrosh <[email protected]> wrote:
>>> On 06/10/2011 12:23 PM, Benny Halevy wrote:
>>>> On 2011-06-10 10:09, [email protected] wrote:
>>>>
>>>> A simple algorithm I can suggest is:
>>>> - on initialization, calculate and save, per layout driver
>>>>   - maximum layout size
>>>>     - take into account csr_fore_chan_attrs.ca_maxresponsesize and possible other parameters
>>>>   - keep a working copy of the maximum value and the calculated copy.
>>>>   - alignment value.
>>>> - on miss, see if there's an adjacent layout segment in cache
>>>> - if found, ask for twice the found segment size, up to the maximum value,
>>>>   aligned on the alignment value.
>>>> - if the server returns less the layoutget range, keep note of the returned length
>>>>   (but not adjust maximum yet, as the server may return a short segment for various
>>>>    reasons)
>>>> - if the server is consistent about returning less than was asked, adjust the
>>>>   - working copy of the maximum length
>>>> - if the maximum was adjusted try bumping it up after X (TBD) layoutgets or T seconds
>>>>   to see if that was just due to high load or conflicts on the server
>>>> - on any error returned for LAYOUTGET reset the algorithm parameters
>>>> - on session reestablishment recalculate maximums.
>>>>
>>>> Benny
>>>>
>>>
>>> I completely disagree with all this. NACK!
>>>
>>> The only proper thing a client can do is ask for what it needs, and only the application
>>> can do that, because at the VFS level it is only second guessing, and is completely
>>> pointless.
>>>
>>> The only one that can know about structure, alignments, optimal IO sizes and layouts
>>> is the server. The server even have more information to second guess the application
>>> from the file size information and it's share and lock disposition. Please see my
>>> simple Server side algorithm.
>> Well, IMO, client is closer to applications and should have a better
>> position at "guessing" application's workload.
>>
>> A simple example, when a client asks for a layout, server would have
>> no idea if client is doing layout prefetch, or if it really need that
>> range to complete its work. But the client knows it for sure.
>>
>
> What is layout prefetch? we don't do any in the Linux client.
>
> And your example does not make any sense. If a theoretical stupid client does
> "layout prefetch" what ever that means, how is it related to the application?
The example is just showing you that sometimes server does not know
everything. And the thing it does not know, can have some impact on
application IO performance. If client is asking for more than it needs
and server is returning more than client asks for, application
performance is impacted for sure.

>
> There is not a single information a client has the the server does not. Look at
> your patch. Set an arbitrary value at client setup. That same arbitrary value
> can be setup at server. Look at benny's algorithm. It can be done just the same
> at the server side. (Which does not mean it is a good algorithm)
True. And both client and server should have that some kind of
algorithms implemented. Server can be smart. But it does not prevent
client from being smart too. There can always be naive clients and
naive servers. We can't prevent users from using them. And when users
are dealing with naive servers, we can do better with the algorithm at
client side, other than simply telling them "it's your fault not using
the smart servers!".

>
> Just take that patch of yours, but instead of at client side. Put it at server
> side. You will achieve exact same results. Only you configure one server instead
> of every client.
>
> And no way in hell I let that go into the Linux generic client!
While taking your side so firmly, would you mind telling why such kind
of algorithm cannot be implemented at client side?

>
> Boaz
>
>>>
>>> Because you must understand one most important thing. Any smart decision a client can
>>> make will be after it received the layout (stripe_unit, number-of-devices etc..) But
>>> at that time it is too late it already sent the layout_get. Only the server knows
>>> before hand what is the most optimal size. The client should just be a transparent
>>> pipe from application to the server. It should never ever set policy. Only a Server
>>> can/should do that.
>>>
>>> Lets put the efforts and algorithms where they belong, please?
>>>
>>> Boaz
>>>
>>
>>
>>
>
>



--
Thanks,
-Bergwolf

2011-06-10 05:36:57

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 87/88] Add configurable prefetch size for layoutget

Hi, Benny,

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
Sent: Friday, June 10, 2011 5:23 AM
To: Jim Rees
Cc: Peng Tao; [email protected]; peter honeyman
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-09 06:58, Jim Rees wrote:
> Benny Halevy wrote:
>
> > My understanding is that layoutget specifies a min and max, and the server
>
> There's a min. What do you consider the max?
> Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?
>
> The spec doesn't say max, it says "desired." I guess I assumed the server
> wouldn't normally return more than desired.

No, the server may freely upgrade the returned layout segment by returning
a layout for a larger byte range or even returning a RW layout where a READ
layout was asked for.
[PT] It is true that server can upgrade the layout segment freely. But there is always a price to pay. Server has to be dealing with all kind of clients.
If server returns more than being asked for, it may hurt other clients.


2011-06-09 13:58:48

by Jim Rees

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

Benny Halevy wrote:

> My understanding is that layoutget specifies a min and max, and the server

There's a min. What do you consider the max?
Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?

The spec doesn't say max, it says "desired." I guess I assumed the server
wouldn't normally return more than desired.

18.43.3. DESCRIPTION
...

The LAYOUTGET operation returns layout information for the specified
byte-range: a layout. The client actually specifies two ranges, both
starting at the offset in the loga_offset field. The first range is
between loga_offset and loga_offset + loga_length - 1 inclusive.
This range indicates the desired range the client wants the layout to
cover. The second range is between loga_offset and loga_offset +
loga_minlength - 1 inclusive. This range indicates the required
range the client needs the layout to cover. Thus, loga_minlength
MUST be less than or equal to loga_length.

2011-06-07 17:30:28

by Jim Rees

[permalink] [raw]
Subject: [PATCH 37/88] pnfsblock: cleanup_layoutcommit

From: Fred Isaman <[email protected]>

In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
extents list, for reads.

In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:

1. On setup_layoutcommit, the range is adjusted as before
and a structure is allocated for communication with
bl_encode_layoutcommit && bl_cleanup_layoutcommit
(Generic layer provides a void-star to hang it on)

2. bl_encode_layoutcommit is called to do the actual
encoding directly into xdr. The commit-extent-list is not
freed and is stored on above structure.
FIXME: The code is not yet converted to the new XDR cleanup

3. On cleanup the commit-extent-list is put back by a call
to set_to_rw() as before, but with no need for XDR decoding
of the list as before. And the commit-extent-list is freed.
Finally allocated structure is freed.

[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <[email protected]>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <[email protected]>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <[email protected]>
[pnfsblock: cleanup_layoutcommit wants a status parameter]
Signed-off-by: Boaz Harrosh <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 2 +
fs/nfs/blocklayout/blocklayout.h | 3 +
fs/nfs/blocklayout/extents.c | 153 +++++++++++++++++++++++++++++++++++++-
3 files changed, 157 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 6132e8e..db008e6 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -661,6 +661,8 @@ bl_cleanup_layoutcommit(struct pnfs_layout_type *lo,
struct pnfs_layoutcommit_arg *arg, int status)
{
dprintk("%s enter\n", __func__);
+ clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), arg, status);
+ kfree(arg->layoutdriver_data);
}

static void free_blk_mountid(struct block_mount_id *mid)
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 1c110e1..ca36e61 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -266,6 +266,9 @@ int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
int encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
struct xdr_stream *xdr,
const struct pnfs_layoutcommit_arg *arg);
+void clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+ const struct pnfs_layoutcommit_arg *arg,
+ int status);
int add_and_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);
int mark_for_commit(struct pnfs_block_extent *be,
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index b2f8643..1719a67 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -144,7 +144,6 @@ static int _set_range(struct my_tree_t *tree, int32_t tag, u64 s, u64 length)
return 0;
}

-
/* Ensure that future operations on given range of tree will not malloc */
static int _preload_range(struct my_tree_t *tree, u64 offset, u64 length)
{
@@ -681,6 +680,34 @@ find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
return ret;
}

+/* Similar to find_get_extent, but called with lock held, and ignores cow */
+static struct pnfs_block_extent *
+find_get_extent_locked(struct pnfs_block_layout *bl, sector_t isect)
+{
+ struct pnfs_block_extent *be, *ret = NULL;
+ int i;
+
+ dprintk("%s enter with isect %llu\n", __func__, (u64)isect);
+ for (i = 0; i < EXTENT_LISTS; i++) {
+ if (ret)
+ break;
+ list_for_each_entry(be, &bl->bl_extents[i], be_node) {
+ if (isect < be->be_f_offset)
+ break;
+ if (isect < be->be_f_offset + be->be_length) {
+ /* We have found an extent */
+ dprintk("%s Get %p (%i)\n", __func__, be,
+ atomic_read(&be->be_refcnt.refcount));
+ kref_get(&be->be_refcnt);
+ ret = be;
+ break;
+ }
+ }
+ }
+ print_bl_extent(ret);
+ return ret;
+}
+
int
encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
struct xdr_stream *xdr,
@@ -728,3 +755,127 @@ encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
*xdr_start = cpu_to_be32((xdr->p - xdr_start - 1) * 4);
return 0;
}
+
+/* Helper function to set_to_rw that initialize a new extent */
+static void
+_prep_new_extent(struct pnfs_block_extent *new,
+ struct pnfs_block_extent *orig,
+ sector_t offset, sector_t length, int state)
+{
+ kref_init(&new->be_refcnt);
+ /* don't need to INIT_LIST_HEAD(&new->be_node) */
+ memcpy(&new->be_devid, &orig->be_devid, sizeof(struct pnfs_deviceid));
+ new->be_mdev = orig->be_mdev;
+ new->be_f_offset = offset;
+ new->be_length = length;
+ new->be_v_offset = orig->be_v_offset - orig->be_f_offset + offset;
+ new->be_state = state;
+ new->be_inval = orig->be_inval;
+}
+
+static u64
+set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)
+{
+ u64 rv = 0;
+ struct pnfs_block_extent *be, *e1, *e2, *e3, *new, *old;
+ struct pnfs_block_extent *children[3];
+ int i = 0, j;
+
+ dprintk("%s(%llu, %llu)\n", __func__, offset, length);
+ /* Create storage for up to three new extents e1, e2, e3 */
+ e1 = kmalloc(sizeof(*e1), GFP_KERNEL);
+ e2 = kmalloc(sizeof(*e2), GFP_KERNEL);
+ e3 = kmalloc(sizeof(*e3), GFP_KERNEL);
+ /* BUG - we are ignoring any failure */
+ if (!e1 || !e2 || !e3)
+ goto out_nosplit;
+
+ spin_lock(&bl->bl_ext_lock);
+ be = find_get_extent_locked(bl, offset);
+ print_bl_extent(be);
+ rv = be->be_f_offset + be->be_length;
+ if (be->be_state != PNFS_BLOCK_INVALID_DATA) {
+ spin_unlock(&bl->bl_ext_lock);
+ goto out_nosplit;
+ }
+ /* Add e* to children, bumping e*'s krefs */
+ if (be->be_f_offset != offset) {
+ _prep_new_extent(e1, be, be->be_f_offset,
+ offset - be->be_f_offset,
+ PNFS_BLOCK_INVALID_DATA);
+ children[i++] = e1;
+ kref_get(&e1->be_refcnt);
+ } else
+ kfree(e1);
+ _prep_new_extent(e2, be, offset,
+ min(length, be->be_f_offset + be->be_length - offset),
+ PNFS_BLOCK_READWRITE_DATA);
+ children[i++] = e2;
+ kref_get(&e2->be_refcnt);
+ if (offset + length < be->be_f_offset + be->be_length) {
+ _prep_new_extent(e3, be, e2->be_f_offset + e2->be_length,
+ be->be_f_offset + be->be_length -
+ offset - length,
+ PNFS_BLOCK_INVALID_DATA);
+ children[i++] = e3;
+ kref_get(&e3->be_refcnt);
+ } else
+ kfree(e3);
+
+ /* Remove be from list, and insert the e* */
+ /* We don't get refs on e*, since this list is the base reference
+ * set when init'ed.
+ */
+ if (i < 3)
+ children[i] = NULL;
+ new = children[0];
+ list_replace(&be->be_node, &new->be_node);
+ put_extent(be);
+ for (j = 1; j < i; j++) {
+ old = new;
+ new = children[j];
+ list_add(&new->be_node, &old->be_node);
+ }
+ spin_unlock(&bl->bl_ext_lock);
+
+ /* Since we removed the base reference above, be is now scheduled for
+ * destruction.
+ */
+ put_extent(be);
+ dprintk("%s returns %llu after split\n", __func__, rv);
+ return rv;
+
+ out_nosplit:
+ kfree(e1);
+ kfree(e2);
+ kfree(e3);
+ dprintk("%s returns %llu without splitting\n", __func__, rv);
+ return rv;
+}
+
+void
+clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+ const struct pnfs_layoutcommit_arg *arg,
+ int status)
+{
+ struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
+ struct pnfs_block_short_extent *lce, *save;
+
+ dprintk("%s status %d\n", __func__, status);
+ list_for_each_entry_safe_reverse(lce, save, &bld->ranges, bse_node) {
+ if (likely(!status)) {
+ u64 offset = lce->bse_f_offset;
+ u64 end = offset + lce->bse_length;
+
+ do {
+ offset = set_to_rw(bl, offset, end - offset);
+ } while (offset < end);
+
+ kfree(lce);
+ } else {
+ spin_lock(&bl->bl_ext_lock);
+ add_to_commitlist(bl, lce);
+ spin_unlock(&bl->bl_ext_lock);
+ }
+ }
+}
--
1.7.4.1


2011-06-10 06:00:55

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 87/88] Add configurable prefetch size for layoutget

Hi, Benny,

Cheers,
-Bergwolf


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
Sent: Friday, June 10, 2011 5:23 AM
To: Peng Tao
Cc: Jim Rees; [email protected]; peter honeyman
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-09 08:07, Peng Tao wrote:
> Hi, Jim and Benny,
>
> On Thu, Jun 9, 2011 at 9:58 PM, Jim Rees <[email protected]> wrote:
>> Benny Halevy wrote:
>>
>> > My understanding is that layoutget specifies a min and max, and the server
>>
>> There's a min. What do you consider the max?
>> Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?
>>
>> The spec doesn't say max, it says "desired." I guess I assumed the server
>> wouldn't normally return more than desired.
> In fact server is returning "desired" length. The problem is that we
> call pnfs_update_layout in nfs_write_begin, and it will end up setting
> both minlength and length to page size. There is no space for client
> to collapse layoutget range in nfs_write_begin.
>

That's a different issue. Waiting with pnfs_update_layout to flush
time rather than write_begin if the whole page is written would help
sending a more meaningful desired range as well as avoiding needless
read-modify-writes in case the application also wrote the whole
preallocated block.
[PT] It is also the reason why we want to introduce layout prefetching, to get more segment than the page passed in nfs_write_begin.

>>
>> 18.43.3. DESCRIPTION
>> ...
>>
>> The LAYOUTGET operation returns layout information for the specified
>> byte-range: a layout. The client actually specifies two ranges, both
>> starting at the offset in the loga_offset field. The first range is
>> between loga_offset and loga_offset + loga_length - 1 inclusive.
>> This range indicates the desired range the client wants the layout to
>> cover. The second range is between loga_offset and loga_offset +
>> loga_minlength - 1 inclusive. This range indicates the required
>> range the client needs the layout to cover. Thus, loga_minlength
>> MUST be less than or equal to loga_length.
>>
>
>
>

--
Benny Halevy
CTO, Tonian Inc.

Tel: +972-54-802-8340
[email protected]

2011-06-07 17:35:14

by Jim Rees

[permalink] [raw]
Subject: [PATCH 79/88] SQUASHME: pnfs-block: Return failure from bl_initialize_mountpoint

if we can't get the device for the disk list.

Signed-off-by: Jim Rees <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 5be912e..5f11fb8 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -677,7 +677,7 @@ static void free_blk_mountid(struct block_mount_id *mid)
}
}

-/* This is mostly copied form the filelayout's get_device_info function.
+/* This is mostly copied from the filelayout's get_device_info function.
* It seems much of this should be at the generic pnfs level.
*/
static struct pnfs_block_dev *
@@ -796,8 +796,10 @@ bl_set_layoutdriver(struct nfs_server *server, const struct nfs_fh *fh)
bdev = nfs4_blk_get_deviceinfo(server, fh,
&dlist->dev_id[i],
&block_disklist);
- if (!bdev)
+ if (!bdev) {
+ status = -ENODEV;
goto out_error;
+ }
spin_lock(&b_mt_id->bm_lock);
list_add(&bdev->bm_node, &b_mt_id->bm_devlist);
spin_unlock(&b_mt_id->bm_lock);
--
1.7.4.1


2011-06-07 17:34:03

by Jim Rees

[permalink] [raw]
Subject: [PATCH 67/88] SQUASHME: pnfs-block: apply types rename

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 32 ++++++++++++++++----------------
fs/nfs/blocklayout/blocklayout.h | 13 ++++++-------
fs/nfs/blocklayout/blocklayoutdev.c | 16 ++++++++--------
fs/nfs/blocklayout/extents.c | 8 ++++----
4 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index e3cd75f..f49c68c 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -515,7 +515,7 @@ bl_write_pagelist(struct nfs_write_data *wdata,
/* FIXME - range ignored */
static void
release_extents(struct pnfs_block_layout *bl,
- struct nfs4_pnfs_layout_segment *range)
+ struct pnfs_layout_range *range)
{
int i;
struct pnfs_block_extent *be;
@@ -547,7 +547,7 @@ release_inval_marks(struct pnfs_inval_markings *marks)

/* Note we are relying on caller locking to prevent nasty races. */
static void
-bl_free_layout(struct pnfs_layout_type *lo)
+bl_free_layout(struct pnfs_layout_hdr *lo)
{
struct pnfs_block_layout *bl = BLK_LO2EXT(lo);

@@ -557,7 +557,7 @@ bl_free_layout(struct pnfs_layout_type *lo)
kfree(bl);
}

-static struct pnfs_layout_type *
+static struct pnfs_layout_hdr *
bl_alloc_layout(struct inode *inode)
{
struct pnfs_block_layout *bl;
@@ -590,8 +590,8 @@ bl_free_lseg(struct pnfs_layout_segment *lseg)
* cause lots of unnecessary overlapping LAYOUTGET requests.
*/
static struct pnfs_layout_segment *
-bl_alloc_lseg(struct pnfs_layout_type *lo,
- struct nfs4_pnfs_layoutget_res *lgr)
+bl_alloc_lseg(struct pnfs_layout_hdr *lo,
+ struct nfs4_layoutget_res *lgr)
{
struct pnfs_layout_segment *lseg;
int status;
@@ -617,8 +617,8 @@ bl_alloc_lseg(struct pnfs_layout_type *lo,
}

static int
-bl_setup_layoutcommit(struct pnfs_layout_type *lo,
- struct pnfs_layoutcommit_arg *arg)
+bl_setup_layoutcommit(struct pnfs_layout_hdr *lo,
+ struct nfs4_layoutcommit_args *arg)
{
struct nfs_server *nfss = PNFS_NFS_SERVER(lo);
struct bl_layoutupdate_data *layoutupdate_data;
@@ -627,11 +627,11 @@ bl_setup_layoutcommit(struct pnfs_layout_type *lo,
/* Need to ensure commit is block-size aligned */
if (nfss->pnfs_blksize) {
u64 mask = nfss->pnfs_blksize - 1;
- u64 offset = arg->lseg.offset & mask;
+ u64 offset = arg->range.offset & mask;

- arg->lseg.offset -= offset;
- arg->lseg.length += offset + mask;
- arg->lseg.length &= ~mask;
+ arg->range.offset -= offset;
+ arg->range.length += offset + mask;
+ arg->range.length &= ~mask;
}

layoutupdate_data = kmalloc(sizeof(struct bl_layoutupdate_data),
@@ -645,16 +645,16 @@ bl_setup_layoutcommit(struct pnfs_layout_type *lo,
}

static void
-bl_encode_layoutcommit(struct pnfs_layout_type *lo, struct xdr_stream *xdr,
- const struct pnfs_layoutcommit_arg *arg)
+bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
+ const struct nfs4_layoutcommit_args *arg)
{
dprintk("%s enter\n", __func__);
encode_pnfs_block_layoutupdate(BLK_LO2EXT(lo), xdr, arg);
}

static void
-bl_cleanup_layoutcommit(struct pnfs_layout_type *lo,
- struct pnfs_layoutcommit_arg *arg, int status)
+bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
+ struct nfs4_layoutcommit_args *arg, int status)
{
dprintk("%s enter\n", __func__);
clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), arg, status);
@@ -1087,7 +1087,7 @@ bl_write_end_cleanup(struct file *filp, struct pnfs_fsdata *fsdata)
}

static ssize_t
-bl_get_stripesize(struct pnfs_layout_type *lo)
+bl_get_stripesize(struct pnfs_layout_hdr *lo)
{
dprintk("%s enter\n", __func__);
return 0;
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 8931944..ab61a19 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -33,7 +33,6 @@
#define FS_NFS_NFS4BLOCKLAYOUT_H

#include <linux/nfs_fs.h>
-#include <linux/pnfs_xdr.h> /* Needed by nfs4_pnfs.h */
#include <linux/nfs4_pnfs.h>
#include <linux/dm-ioctl.h> /* Needed for struct dm_ioctl*/

@@ -176,7 +175,7 @@ static inline int choose_list(enum exstate4 state)
}

struct pnfs_block_layout {
- struct pnfs_layout_type bl_layout;
+ struct pnfs_layout_hdr bl_layout;
struct pnfs_inval_markings bl_inval; /* tracks INVAL->RW transition */
spinlock_t bl_ext_lock; /* Protects list manipulation */
struct list_head bl_extents[EXTENT_LISTS]; /* R and RW extents */
@@ -195,7 +194,7 @@ struct bl_layoutupdate_data {
#define BLK_ID(lo) ((struct block_mount_id *)(PNFS_NFS_SERVER(lo)->pnfs_ld_data))

static inline struct pnfs_block_layout *
-BLK_LO2EXT(struct pnfs_layout_type *lo)
+BLK_LO2EXT(struct pnfs_layout_hdr *lo)
{
return container_of(lo, struct pnfs_block_layout, bl_layout);
}
@@ -257,8 +256,8 @@ int nfs4_blkdev_put(struct block_device *bdev);
struct pnfs_block_dev *nfs4_blk_decode_device(struct nfs_server *server,
struct pnfs_device *dev,
struct list_head *sdlist);
-int nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
- struct nfs4_pnfs_layoutget_res *lgr);
+int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
+ struct nfs4_layoutget_res *lgr);
int nfs4_blk_create_block_disk_list(struct list_head *);
void nfs4_blk_destroy_disk_list(struct list_head *);
/* blocklayoutdm.c */
@@ -277,9 +276,9 @@ struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
int encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
struct xdr_stream *xdr,
- const struct pnfs_layoutcommit_arg *arg);
+ const struct nfs4_layoutcommit_args *arg);
void clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
- const struct pnfs_layoutcommit_arg *arg,
+ const struct nfs4_layoutcommit_args *arg,
int status);
int add_and_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 98ec92b3..e9ea86a 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -152,7 +152,7 @@ out_err:
}

/* Map deviceid returned by the server to constructed block_device */
-static struct block_device *translate_devid(struct pnfs_layout_type *lo,
+static struct block_device *translate_devid(struct pnfs_layout_hdr *lo,
struct pnfs_deviceid *id)
{
struct block_device *rv = NULL;
@@ -231,8 +231,8 @@ static int verify_extent(struct pnfs_block_extent *be,

/* XDR decode pnfs_block_layout4 structure */
int
-nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
- struct nfs4_pnfs_layoutget_res *lgr)
+nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
+ struct nfs4_layoutget_res *lgr)
{
struct pnfs_block_layout *bl = BLK_LO2EXT(lo);
uint32_t *p = (uint32_t *)lgr->layout.buf;
@@ -242,10 +242,10 @@ nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
struct pnfs_block_extent *be = NULL, *save;
uint64_t tmp; /* Used by READSECTOR */
struct layout_verification lv = {
- .mode = lgr->lseg.iomode,
- .start = lgr->lseg.offset >> 9,
- .inval = lgr->lseg.offset >> 9,
- .cowread = lgr->lseg.offset >> 9,
+ .mode = lgr->range.iomode,
+ .start = lgr->range.offset >> 9,
+ .inval = lgr->range.offset >> 9,
+ .cowread = lgr->range.offset >> 9,
};

LIST_HEAD(extents);
@@ -290,7 +290,7 @@ nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
be = NULL;
goto out_err;
}
- if (lgr->lseg.offset + lgr->lseg.length != lv.start << 9) {
+ if (lgr->range.offset + lgr->range.length != lv.start << 9) {
dprintk("%s Final length mismatch\n", __func__);
be = NULL;
goto out_err;
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 6c26cd4..20cc863 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -738,7 +738,7 @@ find_get_extent_locked(struct pnfs_block_layout *bl, sector_t isect)
int
encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
struct xdr_stream *xdr,
- const struct pnfs_layoutcommit_arg *arg)
+ const struct nfs4_layoutcommit_args *arg)
{
sector_t start, end;
struct pnfs_block_short_extent *lce, *save;
@@ -748,8 +748,8 @@ encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
__be32 *p, *xdr_start;

dprintk("%s enter\n", __func__);
- start = arg->lseg.offset >> 9;
- end = start + (arg->lseg.length >> 9);
+ start = arg->range.offset >> 9;
+ end = start + (arg->range.length >> 9);
dprintk("%s set start=%llu, end=%llu\n",
__func__, (u64)start, (u64)end);

@@ -922,7 +922,7 @@ set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)

void
clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
- const struct pnfs_layoutcommit_arg *arg,
+ const struct nfs4_layoutcommit_args *arg,
int status)
{
struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
--
1.7.4.1


2011-06-07 17:36:03

by Jim Rees

[permalink] [raw]
Subject: [PATCH 86/88] SQUASHME: pnfs: blocklayout: port block layout code

From: Peng Tao <[email protected]>

Make minimal changes to let block layout driver work in current framework.

Signed-off-by: Tang Haiying <[email protected]>
Signed-off-by: Zhang Jingwang <[email protected]>
Signed-off-by: Peng Tao <[email protected]>
Signed-off-by: Jim Rees <[email protected]>
---
drivers/md/dm-ioctl.c | 24 --------
drivers/scsi/hosts.c | 3 +-
fs/nfs/blocklayout/blocklayout.c | 105 ++++++++++------------------------
fs/nfs/blocklayout/blocklayout.h | 9 +--
fs/nfs/blocklayout/blocklayoutdev.c | 34 ++++++++----
fs/nfs/blocklayout/extents.c | 14 +----
fs/nfs/nfs4proc.c | 1 -
fs/nfs/nfs4xdr.c | 3 +-
fs/nfs/pnfs.c | 8 ++-
fs/nfs/pnfs.h | 1 +
include/linux/nfs_fs_sb.h | 1 +
11 files changed, 69 insertions(+), 134 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index d0d417e..4cacdad 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -713,12 +713,6 @@ static int dev_create(struct dm_ioctl *param, size_t param_size)
return 0;
}

-int dm_dev_create(struct dm_ioctl *param)
-{
- return dev_create(param, sizeof(*param));
-}
-EXPORT_SYMBOL(dm_dev_create);
-
/*
* Always use UUID for lookups if it's present, otherwise use name or dev.
*/
@@ -814,12 +808,6 @@ static int dev_remove(struct dm_ioctl *param, size_t param_size)
return 0;
}

-int dm_dev_remove(struct dm_ioctl *param)
-{
- return dev_remove(param, sizeof(*param));
-}
-EXPORT_SYMBOL(dm_dev_remove);
-
/*
* Check a string doesn't overrun the chunk of
* memory we copied from userland.
@@ -1002,12 +990,6 @@ static int do_resume(struct dm_ioctl *param)
return r;
}

-int dm_do_resume(struct dm_ioctl *param)
-{
- return do_resume(param);
-}
-EXPORT_SYMBOL(dm_do_resume);
-
/*
* Set or unset the suspension state of a device.
* If the device already is in the requested state we just return its status.
@@ -1274,12 +1256,6 @@ out:
return r;
}

-int dm_table_load(struct dm_ioctl *param, size_t param_size)
-{
- return table_load(param, param_size);
-}
-EXPORT_SYMBOL(dm_table_load);
-
static int table_clear(struct dm_ioctl *param, size_t param_size)
{
struct hash_cell *hc;
diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 7d91903..4f7a582 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -50,11 +50,10 @@ static void scsi_host_cls_release(struct device *dev)
put_device(&class_to_shost(dev)->shost_gendev);
}

-struct class shost_class = {
+static struct class shost_class = {
.name = "scsi_host",
.dev_release = scsi_host_cls_release,
};
-EXPORT_SYMBOL(shost_class);

/**
* scsi_host_set_state - Take the given host through the host state model.
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 2583b87..d842ec8 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -97,14 +97,6 @@ dont_like_caller(struct nfs_page *req)
}
}

-static enum pnfs_try_status
-bl_commit(struct nfs_write_data *nfs_data,
- int sync)
-{
- dprintk("%s enter\n", __func__);
- return PNFS_NOT_ATTEMPTED;
-}
-
/* The data we are handed might be spread across several bios. We need
* to track when the last one is finished.
*/
@@ -198,7 +190,7 @@ static void bl_read_cleanup(struct work_struct *work)
dprintk("%s enter\n", __func__);
task = container_of(work, struct rpc_task, u.tk_work);
rdata = container_of(task, struct nfs_read_data, task);
- pnfs_read_done(rdata);
+ pnfs_ld_read_done(rdata);
}

static void
@@ -219,8 +211,7 @@ static void bl_rpc_do_nothing(struct rpc_task *task, void *calldata)
}

static enum pnfs_try_status
-bl_read_pagelist(struct nfs_read_data *rdata,
- unsigned nr_pages)
+bl_read_pagelist(struct nfs_read_data *rdata)
{
int i, hole;
struct bio *bio = NULL;
@@ -233,13 +224,13 @@ bl_read_pagelist(struct nfs_read_data *rdata,
int pg_index = rdata->args.pgbase >> PAGE_CACHE_SHIFT;

dprintk("%s enter nr_pages %u offset %lld count %Zd\n", __func__,
- nr_pages, f_offset, count);
+ rdata->npages, f_offset, count);

if (dont_like_caller(rdata->req)) {
dprintk("%s dont_like_caller failed\n", __func__);
goto use_mds;
}
- if ((nr_pages == 1) && PagePnfsErr(rdata->req->wb_page)) {
+ if ((rdata->npages == 1) && PagePnfsErr(rdata->req->wb_page)) {
/* We want to fall back to mds in case of read_page
* after error on read_pages.
*/
@@ -249,21 +240,21 @@ bl_read_pagelist(struct nfs_read_data *rdata,
par = alloc_parallel(rdata);
if (!par)
goto use_mds;
- par->call_ops = *rdata->pdata.call_ops;
+ par->call_ops = *rdata->mds_ops;
par->call_ops.rpc_call_done = bl_rpc_do_nothing;
par->pnfs_callback = bl_end_par_io_read;
/* At this point, we can no longer jump to use_mds */

isect = (sector_t) (f_offset >> 9);
/* Code assumes extents are page-aligned */
- for (i = pg_index; i < nr_pages; i++) {
+ for (i = pg_index; i < rdata->npages; i++) {
if (!extent_length) {
/* We've used up the previous extent */
put_extent(be);
put_extent(cow_read);
bio = bl_submit_bio(READ, bio);
/* Get the next one */
- be = find_get_extent(BLK_LSEG2EXT(rdata->pdata.lseg),
+ be = find_get_extent(BLK_LSEG2EXT(rdata->lseg),
isect, &cow_read);
if (!be) {
/* Error out this page */
@@ -293,7 +284,7 @@ bl_read_pagelist(struct nfs_read_data *rdata,
be_read = (hole && cow_read) ? cow_read : be;
for (;;) {
if (!bio) {
- bio = bio_alloc(GFP_NOIO, nr_pages - i);
+ bio = bio_alloc(GFP_NOIO, rdata->npages - i);
if (!bio) {
/* Error out this page */
bl_done_with_rpage(pages[i], 0);
@@ -407,10 +398,10 @@ static void bl_write_cleanup(struct work_struct *work)
/* BUG - this should be called after each bio, not after
* all finish, unless have some way of storing success/failure
*/
- mark_extents_written(BLK_LSEG2EXT(wdata->pdata.lseg),
+ mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
wdata->args.offset, wdata->args.count);
}
- pnfs_writeback_done(wdata);
+ pnfs_ld_write_done(wdata);
}

/* Called when last of bios associated with a bl_write_pagelist call finishes */
@@ -428,7 +419,6 @@ bl_end_par_io_write(void *data)

static enum pnfs_try_status
bl_write_pagelist(struct nfs_write_data *wdata,
- unsigned nr_pages,
int sync)
{
int i;
@@ -442,7 +432,7 @@ bl_write_pagelist(struct nfs_write_data *wdata,
int pg_index = wdata->args.pgbase >> PAGE_CACHE_SHIFT;

dprintk("%s enter, %Zu@%lld\n", __func__, count, offset);
- if (!wdata->req->wb_lseg) {
+ if (!wdata->lseg) {
dprintk("%s no lseg, falling back to MDS\n", __func__);
return PNFS_NOT_ATTEMPTED;
}
@@ -460,19 +450,19 @@ bl_write_pagelist(struct nfs_write_data *wdata,
par = alloc_parallel(wdata);
if (!par)
return PNFS_NOT_ATTEMPTED;
- par->call_ops = *wdata->pdata.call_ops;
+ par->call_ops = *wdata->mds_ops;
par->call_ops.rpc_call_done = bl_rpc_do_nothing;
par->pnfs_callback = bl_end_par_io_write;
/* At this point, have to be more careful with error handling */

isect = (sector_t) ((offset & (long)PAGE_CACHE_MASK) >> 9);
- for (i = pg_index; i < nr_pages; i++) {
+ for (i = pg_index; i < wdata->npages ; i++) {
if (!extent_length) {
/* We've used up the previous extent */
put_extent(be);
bio = bl_submit_bio(WRITE, bio);
/* Get the next one */
- be = find_get_extent(BLK_LSEG2EXT(wdata->pdata.lseg),
+ be = find_get_extent(BLK_LSEG2EXT(wdata->lseg),
isect, NULL);
if (!be || !is_writable(be, isect)) {
/* FIXME */
@@ -484,7 +474,7 @@ bl_write_pagelist(struct nfs_write_data *wdata,
}
for (;;) {
if (!bio) {
- bio = bio_alloc(GFP_NOIO, nr_pages - i);
+ bio = bio_alloc(GFP_NOIO, wdata->npages - i);
if (!bio) {
/* Error out this page */
/* FIXME */
@@ -504,7 +494,12 @@ bl_write_pagelist(struct nfs_write_data *wdata,
isect += PAGE_CACHE_SIZE >> 9;
extent_length -= PAGE_CACHE_SIZE >> 9;
}
- wdata->res.count = (isect << 9) - (offset & (long)PAGE_CACHE_MASK);
+ wdata->res.count = (isect << 9) - (offset);
+ if (count < wdata->res.count) {
+ wdata->res.count = count;
+ }
+ /* pnfs_set_layoutcommit needs this */
+ wdata->mds_offset = offset;
put_extent(be);
bl_submit_bio(WRITE, bio);
put_parallel(par);
@@ -557,18 +552,19 @@ bl_free_layout_hdr(struct pnfs_layout_hdr *lo)
}

static struct pnfs_layout_hdr *
-bl_alloc_layout_hdr(struct inode *inode)
+bl_alloc_layout_hdr(struct inode *inode, gfp_t gfp_flags)
{
struct pnfs_block_layout *bl;

dprintk("%s enter\n", __func__);
- bl = kzalloc(sizeof(*bl), GFP_KERNEL);
+ bl = kzalloc(sizeof(*bl), gfp_flags);
if (!bl)
return NULL;
spin_lock_init(&bl->bl_ext_lock);
INIT_LIST_HEAD(&bl->bl_extents[0]);
INIT_LIST_HEAD(&bl->bl_extents[1]);
INIT_LIST_HEAD(&bl->bl_commit);
+ INIT_LIST_HEAD(&bl->bl_committing);
bl->bl_count = 0;
bl->bl_blocksize = NFS_SERVER(inode)->pnfs_blksize >> 9;
INIT_INVAL_MARKS(&bl->bl_inval, bl->bl_blocksize);
@@ -590,16 +586,16 @@ bl_free_lseg(struct pnfs_layout_segment *lseg)
*/
static struct pnfs_layout_segment *
bl_alloc_lseg(struct pnfs_layout_hdr *lo,
- struct nfs4_layoutget_res *lgr)
+ struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
{
struct pnfs_layout_segment *lseg;
int status;

dprintk("%s enter\n", __func__);
- lseg = kzalloc(sizeof(*lseg) + 0, GFP_KERNEL);
+ lseg = kzalloc(sizeof(*lseg) + 0, gfp_flags);
if (!lseg)
return NULL;
- status = nfs4_blk_process_layoutget(lo, lgr);
+ status = nfs4_blk_process_layoutget(lo, lgr, gfp_flags);
if (status) {
/* We don't want to call the full-blown bl_free_lseg,
* since on error extents were not touched.
@@ -615,34 +611,6 @@ bl_alloc_lseg(struct pnfs_layout_hdr *lo,
return lseg;
}

-static int
-bl_setup_layoutcommit(struct pnfs_layout_hdr *lo,
- struct nfs4_layoutcommit_args *arg)
-{
- struct nfs_server *nfss = NFS_SERVER(lo->plh_inode);
- struct bl_layoutupdate_data *layoutupdate_data;
-
- dprintk("%s enter\n", __func__);
- /* Need to ensure commit is block-size aligned */
- if (nfss->pnfs_blksize) {
- u64 mask = nfss->pnfs_blksize - 1;
- u64 offset = arg->range.offset & mask;
-
- arg->range.offset -= offset;
- arg->range.length += offset + mask;
- arg->range.length &= ~mask;
- }
-
- layoutupdate_data = kmalloc(sizeof(struct bl_layoutupdate_data),
- GFP_KERNEL);
- if (unlikely(!layoutupdate_data))
- return -ENOMEM;
- INIT_LIST_HEAD(&layoutupdate_data->ranges);
- arg->layoutdriver_data = layoutupdate_data;
-
- return 0;
-}
-
static void
bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
const struct nfs4_layoutcommit_args *arg)
@@ -657,7 +625,6 @@ bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
{
dprintk("%s enter\n", __func__);
clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), &lcdata->args, lcdata->res.status);
- kfree(lcdata->args.layoutdriver_data);
}

static void free_blk_mountid(struct block_mount_id *mid)
@@ -1085,25 +1052,16 @@ bl_write_end_cleanup(struct file *filp, struct pnfs_fsdata *fsdata)
fsdata->private = NULL;
}

-/* This is called by nfs_can_coalesce_requests via nfs_pageio_do_add_request.
- * Should return False if there is a reason requests can not be coalesced,
- * otherwise, should default to returning True.
- */
-static int
+static bool
bl_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
- struct nfs_page *req)
+ struct nfs_page *req)
{
- dprintk("%s enter\n", __func__);
- if (pgio->pg_iswrite)
- return prev->wb_lseg == req->wb_lseg;
- else
- return 1;
+ return pnfs_generic_pg_test(pgio, prev, req);
}

static struct pnfs_layoutdriver_type blocklayout_type = {
.id = LAYOUT_BLOCK_VOLUME,
.name = "LAYOUT_BLOCK_VOLUME",
- .commit = bl_commit,
.read_pagelist = bl_read_pagelist,
.write_pagelist = bl_write_pagelist,
.write_begin = bl_write_begin,
@@ -1113,12 +1071,11 @@ static struct pnfs_layoutdriver_type blocklayout_type = {
.free_layout_hdr = bl_free_layout_hdr,
.alloc_lseg = bl_alloc_lseg,
.free_lseg = bl_free_lseg,
- .setup_layoutcommit = bl_setup_layoutcommit,
.encode_layoutcommit = bl_encode_layoutcommit,
.cleanup_layoutcommit = bl_cleanup_layoutcommit,
.set_layoutdriver = bl_set_layoutdriver,
.clear_layoutdriver = bl_clear_layoutdriver,
- .pg_test = bl_pg_test,
+ .pg_test = bl_pg_test,
};

static int __init nfs4blocklayout_init(void)
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index a8198ae..dd596d4 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -33,7 +33,6 @@
#define FS_NFS_NFS4BLOCKLAYOUT_H

#include <linux/nfs_fs.h>
-#include <linux/dm-ioctl.h> /* Needed for struct dm_ioctl*/
#include "../pnfs.h"

#define PAGE_CACHE_SECTORS (PAGE_CACHE_SIZE >> 9)
@@ -43,11 +42,6 @@
#define SetPagePnfsErr(page) set_bit(PG_pnfserr, &(page)->flags)
#define ClearPagePnfsErr(page) clear_bit(PG_pnfserr, &(page)->flags)

-extern int dm_dev_create(struct dm_ioctl *param); /* from dm-ioctl.c */
-extern int dm_dev_remove(struct dm_ioctl *param); /* from dm-ioctl.c */
-extern int dm_do_resume(struct dm_ioctl *param);
-extern int dm_table_load(struct dm_ioctl *param, size_t param_size);
-
struct block_mount_id {
spinlock_t bm_lock; /* protects list */
struct list_head bm_devlist; /* holds pnfs_block_dev */
@@ -180,6 +174,7 @@ struct pnfs_block_layout {
spinlock_t bl_ext_lock; /* Protects list manipulation */
struct list_head bl_extents[EXTENT_LISTS]; /* R and RW extents */
struct list_head bl_commit; /* Needs layout commit */
+ struct list_head bl_committing; /* Layout committing */
unsigned int bl_count; /* entries in bl_commit */
sector_t bl_blocksize; /* Server blocksize in sectors */
};
@@ -257,7 +252,7 @@ struct pnfs_block_dev *nfs4_blk_decode_device(struct nfs_server *server,
struct pnfs_device *dev,
struct list_head *sdlist);
int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
- struct nfs4_layoutget_res *lgr);
+ struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
int nfs4_blk_create_block_disk_list(struct list_head *);
void nfs4_blk_destroy_disk_list(struct list_head *);
/* blocklayoutdm.c */
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 23469e3..a90eb6b 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -231,14 +231,16 @@ static int verify_extent(struct pnfs_block_extent *be,
/* XDR decode pnfs_block_layout4 structure */
int
nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
- struct nfs4_layoutget_res *lgr)
+ struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
{
struct pnfs_block_layout *bl = BLK_LO2EXT(lo);
- uint32_t *p = (uint32_t *)lgr->layout.buf;
- uint32_t *end = (uint32_t *)((char *)lgr->layout.buf + lgr->layout.len);
int i, status = -EIO;
uint32_t count;
struct pnfs_block_extent *be = NULL, *save;
+ struct xdr_stream stream;
+ struct xdr_buf buf;
+ struct page *scratch;
+ __be32 *p;
uint64_t tmp; /* Used by READSECTOR */
struct layout_verification lv = {
.mode = lgr->range.iomode,
@@ -246,14 +248,27 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
.inval = lgr->range.offset >> 9,
.cowread = lgr->range.offset >> 9,
};
-
LIST_HEAD(extents);

- BLK_READBUF(p, end, 4);
+ dprintk("---> %s\n", __func__);
+
+ scratch = alloc_page(gfp_flags);
+ if (!scratch)
+ return -ENOMEM;
+
+ xdr_init_decode_pages(&stream, &buf, lgr->layoutp->pages, lgr->layoutp->len);
+ xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE);
+
+ p = xdr_inline_decode(&stream, 4);
+ if (unlikely(!p))
+ goto out_err;
+
READ32(count);

dprintk("%s enter, number of extents %i\n", __func__, count);
- BLK_READBUF(p, end, (28 + NFS4_DEVICEID4_SIZE) * count);
+ p = xdr_inline_decode(&stream, (28 + NFS4_DEVICEID4_SIZE) * count);
+ if (unlikely(!p))
+ goto out_err;

/* Decode individual extents, putting them in temporary
* staging area until whole layout is decoded to make error
@@ -269,6 +284,7 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
be->be_mdev = translate_devid(lo, &be->be_devid);
if (!be->be_mdev)
goto out_err;
+
/* The next three values are read in as bytes,
* but stored as 512-byte sector lengths
*/
@@ -284,11 +300,6 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
}
list_add_tail(&be->be_node, &extents);
}
- if (p != end) {
- dprintk("%s Undecoded cruft at end of opaque\n", __func__);
- be = NULL;
- goto out_err;
- }
if (lgr->range.offset + lgr->range.length != lv.start << 9) {
dprintk("%s Final length mismatch\n", __func__);
be = NULL;
@@ -319,6 +330,7 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
spin_unlock(&bl->bl_ext_lock);
status = 0;
out:
+ __free_page(scratch);
dprintk("%s returns %i\n", __func__, status);
return status;

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 40dff82..08413ec 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -232,7 +232,7 @@ _range_has_tag(struct my_tree_t *tree, u64 start, u64 end, int32_t tag)
if ((pos->it_sector == end - tree->mtt_step_size) &&
(pos->it_tags & (1 << tag))) {
expect = pos->it_sector - tree->mtt_step_size;
- if (expect < start)
+ if (pos->it_sector < tree->mtt_step_size || expect < start)
return 1;
continue;
} else {
@@ -740,19 +740,12 @@ encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
struct xdr_stream *xdr,
const struct nfs4_layoutcommit_args *arg)
{
- sector_t start, end;
struct pnfs_block_short_extent *lce, *save;
unsigned int count = 0;
- struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
- struct list_head *ranges = &bld->ranges;
+ struct list_head *ranges = &bl->bl_committing;
__be32 *p, *xdr_start;

dprintk("%s enter\n", __func__);
- start = arg->range.offset >> 9;
- end = start + (arg->range.length >> 9);
- dprintk("%s set start=%llu, end=%llu\n",
- __func__, (u64)start, (u64)end);
-
/* BUG - creation of bl_commit is buggy - need to wait for
* entire block to be marked WRITTEN before it can be added.
*/
@@ -925,11 +918,10 @@ clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
const struct nfs4_layoutcommit_args *arg,
int status)
{
- struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
struct pnfs_block_short_extent *lce, *save;

dprintk("%s status %d\n", __func__, status);
- list_for_each_entry_safe_reverse(lce, save, &bld->ranges, bse_node) {
+ list_for_each_entry_safe_reverse(lce, save, &bl->bl_committing, bse_node) {
if (likely(!status)) {
u64 offset = lce->bse_f_offset;
u64 end = offset + lce->bse_length;
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index a693283..987260c 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5788,7 +5788,6 @@ static int _nfs4_getdevicelist(struct nfs_server *server,

dprintk("--> %s\n", __func__);
status = nfs4_call_sync(server->client, server, &msg, &args.seq_args, &res.seq_res, 0);
- put_rpccred(msg.rpc_cred);
dprintk("<-- %s status=%d\n", __func__, status);
return status;
}
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index e059dc8..73f18f4 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -1963,7 +1963,7 @@ encode_layoutcommit(struct xdr_stream *xdr,
*p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
/* Only whole file layouts */
p = xdr_encode_hyper(p, 0); /* offset */
- p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
+ p = xdr_encode_hyper(p, args->lastbytewritten+1); /* length */
*p++ = cpu_to_be32(0); /* reclaim */
p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
*p++ = cpu_to_be32(1); /* newoffset = TRUE */
@@ -5467,7 +5467,6 @@ static int decode_layoutcommit(struct xdr_stream *xdr,
int status;

status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
- res->status = status;
if (status)
return status;

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index c88a8ee..9920bff 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -898,8 +898,6 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
ret = get_lseg(lseg);
break;
}
- if (cmp_layout(range, &lseg->pls_range) > 0)
- break;
}

dprintk("%s:Return lseg %p ref %d\n",
@@ -1252,6 +1250,7 @@ static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
}
}
rv->pls_end_pos = max_pos;
+ dprintk("%s: lseg %p end_pos %llu\n", __func__, rv, rv->pls_end_pos);

return rv;
}
@@ -1261,6 +1260,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
{
struct nfs_inode *nfsi = NFS_I(wdata->inode);
loff_t end_pos = wdata->mds_offset + wdata->res.count;
+ loff_t isize = i_size_read(wdata->inode);
bool mark_as_dirty = false;

spin_lock(&nfsi->vfs_inode.i_lock);
@@ -1274,9 +1274,13 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
dprintk("%s: Set layoutcommit for inode %lu ",
__func__, wdata->inode->i_ino);
}
+ if (end_pos > isize)
+ end_pos = isize;
if (end_pos > wdata->lseg->pls_end_pos)
wdata->lseg->pls_end_pos = end_pos;
spin_unlock(&nfsi->vfs_inode.i_lock);
+ dprintk("%s: lseg %p end_pos %llu\n",
+ __func__, wdata->lseg, wdata->lseg->pls_end_pos);

/* if pnfs_layoutcommit_inode() runs between inode locks, the next one
* will be a noop because NFS_INO_LAYOUTCOMMIT will not be set */
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index b50cf3a..28d57c9 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -156,6 +156,7 @@ struct pnfs_device {
unsigned int layout_type;
unsigned int mincount;
struct page **pages;
+ void *area;
unsigned int pgbase;
unsigned int pglen;
};
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 3d93ada..79cc4ca 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -143,6 +143,7 @@ struct nfs_server {
filesystem */
struct pnfs_layoutdriver_type *pnfs_curr_ld; /* Active layout driver */
struct rpc_wait_queue roc_rpcwaitq;
+ void *pnfs_ld_data; /* per mount point data */
u32 pnfs_blksize; /* layout_blksize attr */

/* the following fields are protected by nfs_client->cl_lock */
--
1.7.4.1


2011-06-07 17:34:12

by Jim Rees

[permalink] [raw]
Subject: [PATCH 69/88] SQUASHME: pnfsblock: remove obsolete include file from blocklayout.h

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index ab61a19..fc0fb23 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -33,8 +33,8 @@
#define FS_NFS_NFS4BLOCKLAYOUT_H

#include <linux/nfs_fs.h>
-#include <linux/nfs4_pnfs.h>
#include <linux/dm-ioctl.h> /* Needed for struct dm_ioctl*/
+#include <../pnfs.h>

#define PAGE_CACHE_SECTORS (PAGE_CACHE_SIZE >> 9)

--
1.7.4.1


2011-06-08 02:18:54

by Jim Rees

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

Benny Halevy wrote:

NAK.
This affects all layout types. In particular it is undesired
for write layouts that extend the file with the objects layout.
The server can extend the layout segments range
over what the client requested so why would the client
ask for artificially large layouts?

This has actually been the subject of some debate over Thursday night
beers. The problem we're trying to solve is that the client is spending 98%
of its time in layoutget. This patch gives us something like a 10x
speedup. But many of us think it's not the right fix. I suggest we discuss
next week.

But note that this patch doesn't change anything unless you set the sysctl.

2011-06-08 07:38:58

by Peng Tao

[permalink] [raw]
Subject: Re: [PATCH 86/88] SQUASHME: pnfs: blocklayout: port block layout code

yes, you are right. it should be pnfs_generic_pg_test. thanks for catching this!


On 6/8/11, Benny Halevy <[email protected]> wrote:
> On 2011-06-07 13:35, Jim Rees wrote:
>> From: Peng Tao <[email protected]>
>>
>> Make minimal changes to let block layout driver work in current framework.
>>
>> Signed-off-by: Tang Haiying <[email protected]>
>> Signed-off-by: Zhang Jingwang <[email protected]>
>> Signed-off-by: Peng Tao <[email protected]>
>> Signed-off-by: Jim Rees <[email protected]>
>> ---
>> drivers/md/dm-ioctl.c | 24 --------
>> drivers/scsi/hosts.c | 3 +-
>> fs/nfs/blocklayout/blocklayout.c | 105
>> ++++++++++------------------------
>> fs/nfs/blocklayout/blocklayout.h | 9 +--
>> fs/nfs/blocklayout/blocklayoutdev.c | 34 ++++++++----
>> fs/nfs/blocklayout/extents.c | 14 +----
>> fs/nfs/nfs4proc.c | 1 -
>> fs/nfs/nfs4xdr.c | 3 +-
>> fs/nfs/pnfs.c | 8 ++-
>> fs/nfs/pnfs.h | 1 +
>> include/linux/nfs_fs_sb.h | 1 +
>> 11 files changed, 69 insertions(+), 134 deletions(-)
>>
>> diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
>> index d0d417e..4cacdad 100644
>> --- a/drivers/md/dm-ioctl.c
>> +++ b/drivers/md/dm-ioctl.c
>> @@ -713,12 +713,6 @@ static int dev_create(struct dm_ioctl *param, size_t
>> param_size)
>> return 0;
>> }
>>
>> -int dm_dev_create(struct dm_ioctl *param)
>> -{
>> - return dev_create(param, sizeof(*param));
>> -}
>> -EXPORT_SYMBOL(dm_dev_create);
>> -
>> /*
>> * Always use UUID for lookups if it's present, otherwise use name or
>> dev.
>> */
>> @@ -814,12 +808,6 @@ static int dev_remove(struct dm_ioctl *param, size_t
>> param_size)
>> return 0;
>> }
>>
>> -int dm_dev_remove(struct dm_ioctl *param)
>> -{
>> - return dev_remove(param, sizeof(*param));
>> -}
>> -EXPORT_SYMBOL(dm_dev_remove);
>> -
>> /*
>> * Check a string doesn't overrun the chunk of
>> * memory we copied from userland.
>> @@ -1002,12 +990,6 @@ static int do_resume(struct dm_ioctl *param)
>> return r;
>> }
>>
>> -int dm_do_resume(struct dm_ioctl *param)
>> -{
>> - return do_resume(param);
>> -}
>> -EXPORT_SYMBOL(dm_do_resume);
>> -
>> /*
>> * Set or unset the suspension state of a device.
>> * If the device already is in the requested state we just return its
>> status.
>> @@ -1274,12 +1256,6 @@ out:
>> return r;
>> }
>>
>> -int dm_table_load(struct dm_ioctl *param, size_t param_size)
>> -{
>> - return table_load(param, param_size);
>> -}
>> -EXPORT_SYMBOL(dm_table_load);
>> -
>> static int table_clear(struct dm_ioctl *param, size_t param_size)
>> {
>> struct hash_cell *hc;
>> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
>> index 7d91903..4f7a582 100644
>> --- a/drivers/scsi/hosts.c
>> +++ b/drivers/scsi/hosts.c
>> @@ -50,11 +50,10 @@ static void scsi_host_cls_release(struct device *dev)
>> put_device(&class_to_shost(dev)->shost_gendev);
>> }
>>
>> -struct class shost_class = {
>> +static struct class shost_class = {
>> .name = "scsi_host",
>> .dev_release = scsi_host_cls_release,
>> };
>> -EXPORT_SYMBOL(shost_class);
>>
>> /**
>> * scsi_host_set_state - Take the given host through the host state
>> model.
>> diff --git a/fs/nfs/blocklayout/blocklayout.c
>> b/fs/nfs/blocklayout/blocklayout.c
>> index 2583b87..d842ec8 100644
>> --- a/fs/nfs/blocklayout/blocklayout.c
>> +++ b/fs/nfs/blocklayout/blocklayout.c
>> @@ -97,14 +97,6 @@ dont_like_caller(struct nfs_page *req)
>> }
>> }
>>
>> -static enum pnfs_try_status
>> -bl_commit(struct nfs_write_data *nfs_data,
>> - int sync)
>> -{
>> - dprintk("%s enter\n", __func__);
>> - return PNFS_NOT_ATTEMPTED;
>> -}
>> -
>> /* The data we are handed might be spread across several bios. We need
>> * to track when the last one is finished.
>> */
>> @@ -198,7 +190,7 @@ static void bl_read_cleanup(struct work_struct *work)
>> dprintk("%s enter\n", __func__);
>> task = container_of(work, struct rpc_task, u.tk_work);
>> rdata = container_of(task, struct nfs_read_data, task);
>> - pnfs_read_done(rdata);
>> + pnfs_ld_read_done(rdata);
>> }
>>
>> static void
>> @@ -219,8 +211,7 @@ static void bl_rpc_do_nothing(struct rpc_task *task,
>> void *calldata)
>> }
>>
>> static enum pnfs_try_status
>> -bl_read_pagelist(struct nfs_read_data *rdata,
>> - unsigned nr_pages)
>> +bl_read_pagelist(struct nfs_read_data *rdata)
>> {
>> int i, hole;
>> struct bio *bio = NULL;
>> @@ -233,13 +224,13 @@ bl_read_pagelist(struct nfs_read_data *rdata,
>> int pg_index = rdata->args.pgbase >> PAGE_CACHE_SHIFT;
>>
>> dprintk("%s enter nr_pages %u offset %lld count %Zd\n", __func__,
>> - nr_pages, f_offset, count);
>> + rdata->npages, f_offset, count);
>>
>> if (dont_like_caller(rdata->req)) {
>> dprintk("%s dont_like_caller failed\n", __func__);
>> goto use_mds;
>> }
>> - if ((nr_pages == 1) && PagePnfsErr(rdata->req->wb_page)) {
>> + if ((rdata->npages == 1) && PagePnfsErr(rdata->req->wb_page)) {
>> /* We want to fall back to mds in case of read_page
>> * after error on read_pages.
>> */
>> @@ -249,21 +240,21 @@ bl_read_pagelist(struct nfs_read_data *rdata,
>> par = alloc_parallel(rdata);
>> if (!par)
>> goto use_mds;
>> - par->call_ops = *rdata->pdata.call_ops;
>> + par->call_ops = *rdata->mds_ops;
>> par->call_ops.rpc_call_done = bl_rpc_do_nothing;
>> par->pnfs_callback = bl_end_par_io_read;
>> /* At this point, we can no longer jump to use_mds */
>>
>> isect = (sector_t) (f_offset >> 9);
>> /* Code assumes extents are page-aligned */
>> - for (i = pg_index; i < nr_pages; i++) {
>> + for (i = pg_index; i < rdata->npages; i++) {
>> if (!extent_length) {
>> /* We've used up the previous extent */
>> put_extent(be);
>> put_extent(cow_read);
>> bio = bl_submit_bio(READ, bio);
>> /* Get the next one */
>> - be = find_get_extent(BLK_LSEG2EXT(rdata->pdata.lseg),
>> + be = find_get_extent(BLK_LSEG2EXT(rdata->lseg),
>> isect, &cow_read);
>> if (!be) {
>> /* Error out this page */
>> @@ -293,7 +284,7 @@ bl_read_pagelist(struct nfs_read_data *rdata,
>> be_read = (hole && cow_read) ? cow_read : be;
>> for (;;) {
>> if (!bio) {
>> - bio = bio_alloc(GFP_NOIO, nr_pages - i);
>> + bio = bio_alloc(GFP_NOIO, rdata->npages - i);
>> if (!bio) {
>> /* Error out this page */
>> bl_done_with_rpage(pages[i], 0);
>> @@ -407,10 +398,10 @@ static void bl_write_cleanup(struct work_struct
>> *work)
>> /* BUG - this should be called after each bio, not after
>> * all finish, unless have some way of storing success/failure
>> */
>> - mark_extents_written(BLK_LSEG2EXT(wdata->pdata.lseg),
>> + mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
>> wdata->args.offset, wdata->args.count);
>> }
>> - pnfs_writeback_done(wdata);
>> + pnfs_ld_write_done(wdata);
>> }
>>
>> /* Called when last of bios associated with a bl_write_pagelist call
>> finishes */
>> @@ -428,7 +419,6 @@ bl_end_par_io_write(void *data)
>>
>> static enum pnfs_try_status
>> bl_write_pagelist(struct nfs_write_data *wdata,
>> - unsigned nr_pages,
>> int sync)
>> {
>> int i;
>> @@ -442,7 +432,7 @@ bl_write_pagelist(struct nfs_write_data *wdata,
>> int pg_index = wdata->args.pgbase >> PAGE_CACHE_SHIFT;
>>
>> dprintk("%s enter, %Zu@%lld\n", __func__, count, offset);
>> - if (!wdata->req->wb_lseg) {
>> + if (!wdata->lseg) {
>> dprintk("%s no lseg, falling back to MDS\n", __func__);
>> return PNFS_NOT_ATTEMPTED;
>> }
>> @@ -460,19 +450,19 @@ bl_write_pagelist(struct nfs_write_data *wdata,
>> par = alloc_parallel(wdata);
>> if (!par)
>> return PNFS_NOT_ATTEMPTED;
>> - par->call_ops = *wdata->pdata.call_ops;
>> + par->call_ops = *wdata->mds_ops;
>> par->call_ops.rpc_call_done = bl_rpc_do_nothing;
>> par->pnfs_callback = bl_end_par_io_write;
>> /* At this point, have to be more careful with error handling */
>>
>> isect = (sector_t) ((offset & (long)PAGE_CACHE_MASK) >> 9);
>> - for (i = pg_index; i < nr_pages; i++) {
>> + for (i = pg_index; i < wdata->npages ; i++) {
>> if (!extent_length) {
>> /* We've used up the previous extent */
>> put_extent(be);
>> bio = bl_submit_bio(WRITE, bio);
>> /* Get the next one */
>> - be = find_get_extent(BLK_LSEG2EXT(wdata->pdata.lseg),
>> + be = find_get_extent(BLK_LSEG2EXT(wdata->lseg),
>> isect, NULL);
>> if (!be || !is_writable(be, isect)) {
>> /* FIXME */
>> @@ -484,7 +474,7 @@ bl_write_pagelist(struct nfs_write_data *wdata,
>> }
>> for (;;) {
>> if (!bio) {
>> - bio = bio_alloc(GFP_NOIO, nr_pages - i);
>> + bio = bio_alloc(GFP_NOIO, wdata->npages - i);
>> if (!bio) {
>> /* Error out this page */
>> /* FIXME */
>> @@ -504,7 +494,12 @@ bl_write_pagelist(struct nfs_write_data *wdata,
>> isect += PAGE_CACHE_SIZE >> 9;
>> extent_length -= PAGE_CACHE_SIZE >> 9;
>> }
>> - wdata->res.count = (isect << 9) - (offset & (long)PAGE_CACHE_MASK);
>> + wdata->res.count = (isect << 9) - (offset);
>> + if (count < wdata->res.count) {
>> + wdata->res.count = count;
>> + }
>> + /* pnfs_set_layoutcommit needs this */
>> + wdata->mds_offset = offset;
>> put_extent(be);
>> bl_submit_bio(WRITE, bio);
>> put_parallel(par);
>> @@ -557,18 +552,19 @@ bl_free_layout_hdr(struct pnfs_layout_hdr *lo)
>> }
>>
>> static struct pnfs_layout_hdr *
>> -bl_alloc_layout_hdr(struct inode *inode)
>> +bl_alloc_layout_hdr(struct inode *inode, gfp_t gfp_flags)
>> {
>> struct pnfs_block_layout *bl;
>>
>> dprintk("%s enter\n", __func__);
>> - bl = kzalloc(sizeof(*bl), GFP_KERNEL);
>> + bl = kzalloc(sizeof(*bl), gfp_flags);
>> if (!bl)
>> return NULL;
>> spin_lock_init(&bl->bl_ext_lock);
>> INIT_LIST_HEAD(&bl->bl_extents[0]);
>> INIT_LIST_HEAD(&bl->bl_extents[1]);
>> INIT_LIST_HEAD(&bl->bl_commit);
>> + INIT_LIST_HEAD(&bl->bl_committing);
>> bl->bl_count = 0;
>> bl->bl_blocksize = NFS_SERVER(inode)->pnfs_blksize >> 9;
>> INIT_INVAL_MARKS(&bl->bl_inval, bl->bl_blocksize);
>> @@ -590,16 +586,16 @@ bl_free_lseg(struct pnfs_layout_segment *lseg)
>> */
>> static struct pnfs_layout_segment *
>> bl_alloc_lseg(struct pnfs_layout_hdr *lo,
>> - struct nfs4_layoutget_res *lgr)
>> + struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
>> {
>> struct pnfs_layout_segment *lseg;
>> int status;
>>
>> dprintk("%s enter\n", __func__);
>> - lseg = kzalloc(sizeof(*lseg) + 0, GFP_KERNEL);
>> + lseg = kzalloc(sizeof(*lseg) + 0, gfp_flags);
>> if (!lseg)
>> return NULL;
>> - status = nfs4_blk_process_layoutget(lo, lgr);
>> + status = nfs4_blk_process_layoutget(lo, lgr, gfp_flags);
>> if (status) {
>> /* We don't want to call the full-blown bl_free_lseg,
>> * since on error extents were not touched.
>> @@ -615,34 +611,6 @@ bl_alloc_lseg(struct pnfs_layout_hdr *lo,
>> return lseg;
>> }
>>
>> -static int
>> -bl_setup_layoutcommit(struct pnfs_layout_hdr *lo,
>> - struct nfs4_layoutcommit_args *arg)
>> -{
>> - struct nfs_server *nfss = NFS_SERVER(lo->plh_inode);
>> - struct bl_layoutupdate_data *layoutupdate_data;
>> -
>> - dprintk("%s enter\n", __func__);
>> - /* Need to ensure commit is block-size aligned */
>> - if (nfss->pnfs_blksize) {
>> - u64 mask = nfss->pnfs_blksize - 1;
>> - u64 offset = arg->range.offset & mask;
>> -
>> - arg->range.offset -= offset;
>> - arg->range.length += offset + mask;
>> - arg->range.length &= ~mask;
>> - }
>> -
>> - layoutupdate_data = kmalloc(sizeof(struct bl_layoutupdate_data),
>> - GFP_KERNEL);
>> - if (unlikely(!layoutupdate_data))
>> - return -ENOMEM;
>> - INIT_LIST_HEAD(&layoutupdate_data->ranges);
>> - arg->layoutdriver_data = layoutupdate_data;
>> -
>> - return 0;
>> -}
>> -
>> static void
>> bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream
>> *xdr,
>> const struct nfs4_layoutcommit_args *arg)
>> @@ -657,7 +625,6 @@ bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
>> {
>> dprintk("%s enter\n", __func__);
>> clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), &lcdata->args,
>> lcdata->res.status);
>> - kfree(lcdata->args.layoutdriver_data);
>> }
>>
>> static void free_blk_mountid(struct block_mount_id *mid)
>> @@ -1085,25 +1052,16 @@ bl_write_end_cleanup(struct file *filp, struct
>> pnfs_fsdata *fsdata)
>> fsdata->private = NULL;
>> }
>>
>> -/* This is called by nfs_can_coalesce_requests via
>> nfs_pageio_do_add_request.
>> - * Should return False if there is a reason requests can not be
>> coalesced,
>> - * otherwise, should default to returning True.
>> - */
>> -static int
>> +static bool
>> bl_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
>> - struct nfs_page *req)
>> + struct nfs_page *req)
>> {
>> - dprintk("%s enter\n", __func__);
>> - if (pgio->pg_iswrite)
>> - return prev->wb_lseg == req->wb_lseg;
>> - else
>> - return 1;
>> + return pnfs_generic_pg_test(pgio, prev, req);
>> }
>>
>> static struct pnfs_layoutdriver_type blocklayout_type = {
>> .id = LAYOUT_BLOCK_VOLUME,
>> .name = "LAYOUT_BLOCK_VOLUME",
>> - .commit = bl_commit,
>> .read_pagelist = bl_read_pagelist,
>> .write_pagelist = bl_write_pagelist,
>> .write_begin = bl_write_begin,
>> @@ -1113,12 +1071,11 @@ static struct pnfs_layoutdriver_type
>> blocklayout_type = {
>> .free_layout_hdr = bl_free_layout_hdr,
>> .alloc_lseg = bl_alloc_lseg,
>> .free_lseg = bl_free_lseg,
>> - .setup_layoutcommit = bl_setup_layoutcommit,
>> .encode_layoutcommit = bl_encode_layoutcommit,
>> .cleanup_layoutcommit = bl_cleanup_layoutcommit,
>> .set_layoutdriver = bl_set_layoutdriver,
>> .clear_layoutdriver = bl_clear_layoutdriver,
>> - .pg_test = bl_pg_test,
>> + .pg_test = bl_pg_test,
>
> Why not just set pg_test to pnfs_generic_pg_test?
>
> Benny
>
>> };
>>
>> static int __init nfs4blocklayout_init(void)
>> diff --git a/fs/nfs/blocklayout/blocklayout.h
>> b/fs/nfs/blocklayout/blocklayout.h
>> index a8198ae..dd596d4 100644
>> --- a/fs/nfs/blocklayout/blocklayout.h
>> +++ b/fs/nfs/blocklayout/blocklayout.h
>> @@ -33,7 +33,6 @@
>> #define FS_NFS_NFS4BLOCKLAYOUT_H
>>
>> #include <linux/nfs_fs.h>
>> -#include <linux/dm-ioctl.h> /* Needed for struct dm_ioctl*/
>> #include "../pnfs.h"
>>
>> #define PAGE_CACHE_SECTORS (PAGE_CACHE_SIZE >> 9)
>> @@ -43,11 +42,6 @@
>> #define SetPagePnfsErr(page) set_bit(PG_pnfserr, &(page)->flags)
>> #define ClearPagePnfsErr(page) clear_bit(PG_pnfserr, &(page)->flags)
>>
>> -extern int dm_dev_create(struct dm_ioctl *param); /* from dm-ioctl.c */
>> -extern int dm_dev_remove(struct dm_ioctl *param); /* from dm-ioctl.c */
>> -extern int dm_do_resume(struct dm_ioctl *param);
>> -extern int dm_table_load(struct dm_ioctl *param, size_t param_size);
>> -
>> struct block_mount_id {
>> spinlock_t bm_lock; /* protects list */
>> struct list_head bm_devlist; /* holds pnfs_block_dev */
>> @@ -180,6 +174,7 @@ struct pnfs_block_layout {
>> spinlock_t bl_ext_lock; /* Protects list manipulation */
>> struct list_head bl_extents[EXTENT_LISTS]; /* R and RW extents */
>> struct list_head bl_commit; /* Needs layout commit */
>> + struct list_head bl_committing; /* Layout committing */
>> unsigned int bl_count; /* entries in bl_commit */
>> sector_t bl_blocksize; /* Server blocksize in sectors */
>> };
>> @@ -257,7 +252,7 @@ struct pnfs_block_dev *nfs4_blk_decode_device(struct
>> nfs_server *server,
>> struct pnfs_device *dev,
>> struct list_head *sdlist);
>> int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
>> - struct nfs4_layoutget_res *lgr);
>> + struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
>> int nfs4_blk_create_block_disk_list(struct list_head *);
>> void nfs4_blk_destroy_disk_list(struct list_head *);
>> /* blocklayoutdm.c */
>> diff --git a/fs/nfs/blocklayout/blocklayoutdev.c
>> b/fs/nfs/blocklayout/blocklayoutdev.c
>> index 23469e3..a90eb6b 100644
>> --- a/fs/nfs/blocklayout/blocklayoutdev.c
>> +++ b/fs/nfs/blocklayout/blocklayoutdev.c
>> @@ -231,14 +231,16 @@ static int verify_extent(struct pnfs_block_extent
>> *be,
>> /* XDR decode pnfs_block_layout4 structure */
>> int
>> nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
>> - struct nfs4_layoutget_res *lgr)
>> + struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
>> {
>> struct pnfs_block_layout *bl = BLK_LO2EXT(lo);
>> - uint32_t *p = (uint32_t *)lgr->layout.buf;
>> - uint32_t *end = (uint32_t *)((char *)lgr->layout.buf + lgr->layout.len);
>> int i, status = -EIO;
>> uint32_t count;
>> struct pnfs_block_extent *be = NULL, *save;
>> + struct xdr_stream stream;
>> + struct xdr_buf buf;
>> + struct page *scratch;
>> + __be32 *p;
>> uint64_t tmp; /* Used by READSECTOR */
>> struct layout_verification lv = {
>> .mode = lgr->range.iomode,
>> @@ -246,14 +248,27 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr
>> *lo,
>> .inval = lgr->range.offset >> 9,
>> .cowread = lgr->range.offset >> 9,
>> };
>> -
>> LIST_HEAD(extents);
>>
>> - BLK_READBUF(p, end, 4);
>> + dprintk("---> %s\n", __func__);
>> +
>> + scratch = alloc_page(gfp_flags);
>> + if (!scratch)
>> + return -ENOMEM;
>> +
>> + xdr_init_decode_pages(&stream, &buf, lgr->layoutp->pages,
>> lgr->layoutp->len);
>> + xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE);
>> +
>> + p = xdr_inline_decode(&stream, 4);
>> + if (unlikely(!p))
>> + goto out_err;
>> +
>> READ32(count);
>>
>> dprintk("%s enter, number of extents %i\n", __func__, count);
>> - BLK_READBUF(p, end, (28 + NFS4_DEVICEID4_SIZE) * count);
>> + p = xdr_inline_decode(&stream, (28 + NFS4_DEVICEID4_SIZE) * count);
>> + if (unlikely(!p))
>> + goto out_err;
>>
>> /* Decode individual extents, putting them in temporary
>> * staging area until whole layout is decoded to make error
>> @@ -269,6 +284,7 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
>> be->be_mdev = translate_devid(lo, &be->be_devid);
>> if (!be->be_mdev)
>> goto out_err;
>> +
>> /* The next three values are read in as bytes,
>> * but stored as 512-byte sector lengths
>> */
>> @@ -284,11 +300,6 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr
>> *lo,
>> }
>> list_add_tail(&be->be_node, &extents);
>> }
>> - if (p != end) {
>> - dprintk("%s Undecoded cruft at end of opaque\n", __func__);
>> - be = NULL;
>> - goto out_err;
>> - }
>> if (lgr->range.offset + lgr->range.length != lv.start << 9) {
>> dprintk("%s Final length mismatch\n", __func__);
>> be = NULL;
>> @@ -319,6 +330,7 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
>> spin_unlock(&bl->bl_ext_lock);
>> status = 0;
>> out:
>> + __free_page(scratch);
>> dprintk("%s returns %i\n", __func__, status);
>> return status;
>>
>> diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
>> index 40dff82..08413ec 100644
>> --- a/fs/nfs/blocklayout/extents.c
>> +++ b/fs/nfs/blocklayout/extents.c
>> @@ -232,7 +232,7 @@ _range_has_tag(struct my_tree_t *tree, u64 start, u64
>> end, int32_t tag)
>> if ((pos->it_sector == end - tree->mtt_step_size) &&
>> (pos->it_tags & (1 << tag))) {
>> expect = pos->it_sector - tree->mtt_step_size;
>> - if (expect < start)
>> + if (pos->it_sector < tree->mtt_step_size || expect < start)
>> return 1;
>> continue;
>> } else {
>> @@ -740,19 +740,12 @@ encode_pnfs_block_layoutupdate(struct
>> pnfs_block_layout *bl,
>> struct xdr_stream *xdr,
>> const struct nfs4_layoutcommit_args *arg)
>> {
>> - sector_t start, end;
>> struct pnfs_block_short_extent *lce, *save;
>> unsigned int count = 0;
>> - struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
>> - struct list_head *ranges = &bld->ranges;
>> + struct list_head *ranges = &bl->bl_committing;
>> __be32 *p, *xdr_start;
>>
>> dprintk("%s enter\n", __func__);
>> - start = arg->range.offset >> 9;
>> - end = start + (arg->range.length >> 9);
>> - dprintk("%s set start=%llu, end=%llu\n",
>> - __func__, (u64)start, (u64)end);
>> -
>> /* BUG - creation of bl_commit is buggy - need to wait for
>> * entire block to be marked WRITTEN before it can be added.
>> */
>> @@ -925,11 +918,10 @@ clean_pnfs_block_layoutupdate(struct
>> pnfs_block_layout *bl,
>> const struct nfs4_layoutcommit_args *arg,
>> int status)
>> {
>> - struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
>> struct pnfs_block_short_extent *lce, *save;
>>
>> dprintk("%s status %d\n", __func__, status);
>> - list_for_each_entry_safe_reverse(lce, save, &bld->ranges, bse_node) {
>> + list_for_each_entry_safe_reverse(lce, save, &bl->bl_committing,
>> bse_node) {
>> if (likely(!status)) {
>> u64 offset = lce->bse_f_offset;
>> u64 end = offset + lce->bse_length;
>> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
>> index a693283..987260c 100644
>> --- a/fs/nfs/nfs4proc.c
>> +++ b/fs/nfs/nfs4proc.c
>> @@ -5788,7 +5788,6 @@ static int _nfs4_getdevicelist(struct nfs_server
>> *server,
>>
>> dprintk("--> %s\n", __func__);
>> status = nfs4_call_sync(server->client, server, &msg, &args.seq_args,
>> &res.seq_res, 0);
>> - put_rpccred(msg.rpc_cred);
>> dprintk("<-- %s status=%d\n", __func__, status);
>> return status;
>> }
>> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
>> index e059dc8..73f18f4 100644
>> --- a/fs/nfs/nfs4xdr.c
>> +++ b/fs/nfs/nfs4xdr.c
>> @@ -1963,7 +1963,7 @@ encode_layoutcommit(struct xdr_stream *xdr,
>> *p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
>> /* Only whole file layouts */
>> p = xdr_encode_hyper(p, 0); /* offset */
>> - p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
>> + p = xdr_encode_hyper(p, args->lastbytewritten+1); /* length */
>> *p++ = cpu_to_be32(0); /* reclaim */
>> p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
>> *p++ = cpu_to_be32(1); /* newoffset = TRUE */
>> @@ -5467,7 +5467,6 @@ static int decode_layoutcommit(struct xdr_stream
>> *xdr,
>> int status;
>>
>> status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
>> - res->status = status;
>> if (status)
>> return status;
>>
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index c88a8ee..9920bff 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -898,8 +898,6 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
>> ret = get_lseg(lseg);
>> break;
>> }
>> - if (cmp_layout(range, &lseg->pls_range) > 0)
>> - break;
>> }
>>
>> dprintk("%s:Return lseg %p ref %d\n",
>> @@ -1252,6 +1250,7 @@ static struct pnfs_layout_segment
>> *pnfs_list_write_lseg(struct inode *inode)
>> }
>> }
>> rv->pls_end_pos = max_pos;
>> + dprintk("%s: lseg %p end_pos %llu\n", __func__, rv, rv->pls_end_pos);
>>
>> return rv;
>> }
>> @@ -1261,6 +1260,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>> {
>> struct nfs_inode *nfsi = NFS_I(wdata->inode);
>> loff_t end_pos = wdata->mds_offset + wdata->res.count;
>> + loff_t isize = i_size_read(wdata->inode);
>> bool mark_as_dirty = false;
>>
>> spin_lock(&nfsi->vfs_inode.i_lock);
>> @@ -1274,9 +1274,13 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
>> dprintk("%s: Set layoutcommit for inode %lu ",
>> __func__, wdata->inode->i_ino);
>> }
>> + if (end_pos > isize)
>> + end_pos = isize;
>> if (end_pos > wdata->lseg->pls_end_pos)
>> wdata->lseg->pls_end_pos = end_pos;
>> spin_unlock(&nfsi->vfs_inode.i_lock);
>> + dprintk("%s: lseg %p end_pos %llu\n",
>> + __func__, wdata->lseg, wdata->lseg->pls_end_pos);
>>
>> /* if pnfs_layoutcommit_inode() runs between inode locks, the next one
>> * will be a noop because NFS_INO_LAYOUTCOMMIT will not be set */
>> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>> index b50cf3a..28d57c9 100644
>> --- a/fs/nfs/pnfs.h
>> +++ b/fs/nfs/pnfs.h
>> @@ -156,6 +156,7 @@ struct pnfs_device {
>> unsigned int layout_type;
>> unsigned int mincount;
>> struct page **pages;
>> + void *area;
>> unsigned int pgbase;
>> unsigned int pglen;
>> };
>> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
>> index 3d93ada..79cc4ca 100644
>> --- a/include/linux/nfs_fs_sb.h
>> +++ b/include/linux/nfs_fs_sb.h
>> @@ -143,6 +143,7 @@ struct nfs_server {
>> filesystem */
>> struct pnfs_layoutdriver_type *pnfs_curr_ld; /* Active layout driver */
>> struct rpc_wait_queue roc_rpcwaitq;
>> + void *pnfs_ld_data; /* per mount point data */
>> u32 pnfs_blksize; /* layout_blksize attr */
>>
>> /* the following fields are protected by nfs_client->cl_lock */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


--
Thanks,
-Bergwolf

2011-06-09 21:52:55

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 00/88] pnfs block layout driver

On 06/07/2011 10:24 AM, [email protected] wrote:
> This patch set adds a block layout driver to the pnfs client.
>
> Benny Halevy (25):
> pnfs: add set-clear layoutdriver interface
> pnfs: xdr support for three word attribute bitmap
> pnfsblock: select BLK_DEV_DM when PNFS_BLOCK is configured
> SQUASHME: pnfs-block: convert APIs pnfs-post-submit
> SQUASHME: pnfsblock: get rid of threshold policy ops
> SQUASHME: pnfs-block: nfs4_blk_add_block_disk ret must be signed
> SQUASHME: pnfs-block: use new alloc/free_layout API
> SQUASHME: pnfs-block: use new commit api
> SQUASHME: pnfs-block: use new read_pagelist api
> SQUASHME: pnfs-block: use new write_pagelist api
> SQUASHME: pnfs-block: apply types rename
> SQUASHME: pnfs-block: Revert "pnfsblock: expose block_class
> interface"
> SQUASHME: pnfsblock: remove obsolete include file from blocklayout.h
> SQUASHME: pnfsblock: use nfs4_deviceid
> SQUASHME: pnfsblock: no callback ops
> SQAUSHME: pnfsblock: no PNFS_NFS_SERVER
> SQUASHME: pnfsblock: no dev_notify_types
> SQUASHME: pnfsblock: use new struct pnfs_layout_hdr
> SQUASHME: pnfs-block: deprecate get_stripesize
> SQUASHME: pnfs-block: use {set,clear}_layoutdriver
> SQUASHME: pnfs-block: fixup setup_layoutcommit arguments
> SQUASHME: pnfs-block: fixup cleanup_layoutcommit arguments
> SQUASHME: pnfs-block: fixup encode_layoutcommit arguments
> SQUASHME: pnfs-block: fixup layoutcommit methods args
> SQUASHME: pnfs-block: use pnfs_layout_hdr field prefix
>
> Boaz Harrosh (1):
> SQUASHME: pnfs-block: remove of CONFIG_PNFS fallout
>
> Fred (1):
> pnfsblock: find_get_extent
>
> Fred Isaman (39):
> pnfs_post_submit: Restore "pnfs: pnfs_do_flush" part 1
> pnfs_post_submit: Restore the pnfs_write_end part of "pnfs: commit
> and pnfs_write_end"
> pnfs: HACK: ask for layout_blksize on mount
> pnfs: HACK: modify write_end_cleanup
> HACK: propagate fsdata into nfs_writepage_setup
> pnfs: HACK: adjust eof handling
> pnfsblock: define PNFS_BLOCK Kconfig option
> pnfsblock: blocklayout stub
> pnfsblock: expose scsi interface
> pnfsblock: scan scsi devices
> pnfsblock: call and parse getdevicelist
> pnfsblock: dm kernel interface
> pnfsblock: create and destroy dm metadevice
> pnfsblock: construct and load md table
> pnfsblock: layout alloc and free
> pnfsblock: basic extent code
> pnfsblock: lseg alloc and free
> pnfsblock: xdr decode pnfs_block_layout4
> pnfsblock: merge extents
> pnfsblock: bl_read_pagelist
> pnfsblock: allow use of PG_owner_priv_1 flag
> pnfsblock: read path error handling
> pnfsblock: SPLITME: add extent manipulation functions
> pnfsblock: write_begin
> pnfsblock: write_end
> pnfsblock: write_end_cleanup
> pnfsblock: bl_write_pagelist support functions
> pnfsblock: bl_write_pagelist
> pnfsblock: note written INVAL areas for layoutcommit
> pnfsblock: bl_setup_layoutcommit
> pnfsblock: encode_layoutcommit
> pnfsblock: cleanup_layoutcommit
> pnfsblock: merge rw extents
> pnfsblock: debugging dprintks for clist info
> SQUASHME: pnfsblock: write_begin adjust for removed fields
> SQUASHME: pnfsblock: write_end adjust for removed ok_to_use_pnfs
> SQUASHME: pnfsblock: write_end_cleanup adjust for removed
> ok_to_use_pnfs
> SQUASHME: pnfsblock: bl_write_pagelist support functions adjust for
> missing PG_USE_PNFS
> SQUASHME: pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS
>
> J. Bruce Fields (1):
> SQUASHME: pnfs-block: fix compile breakage
>
> Jim Rees (5):
> pnfs-block: Add support for simple rpc pipefs
> pnfs-block: Remove device creation from kernel
> move include lines out of include file
> SQUASHME: pnfs-block: Return failure from bl_initialize_mountpoint
> pnfs-block: fix blocklayoutdev.c for new blkdev_get_by_dev()
>
> Mike Sager (1):
> pnfsblock: use the session max response size for getdeviceinfo's
> maxcount
>
> Peng Tao (4):
> pnfs: let layoutcommit code handle multiple segments
> SQUASHME: pnfs: blocklayout: port block layout code
> Add configurable prefetch size for layoutget
> NFS41: do not update isize if inode needs layoutcommit
>
> Steve Dickson (1):
> SQUASHME: pnfsblock: compile error in blocklayout code
>
> Tao Guo (3):
> SQUASHME: pnfsblock: fix bug when decoding block device info.
> pnfsblock: expose block_class interface
> pnfsblock: iterating all local block disks instead of only scsi disks
> when initializing mount point.
>
> Zhang Jingwang (7):
> SQAUSHME: blocklayoutdriver: NULL pointer reference when committing
> too many extents
> SQUASHME: pnfsblock: Fix a memory leak
> SQUASHME: pnfsblock: Wrong extent refcount in block extents list
> SQUASHME: pnfsblock: Implement release_inval_marks
> SQUASHME: pnfsblock: Fix missing extent in commit list
> pnfsblock: Lookup list entry of layouts and tags in reverse order
> SQUASHME: pnfsblock: set pnfs_blksize before calling
> set_pnfs_layoutdriver
>
> fs/nfs/Kconfig | 10 +
> fs/nfs/Makefile | 1 +
> fs/nfs/blocklayout/Makefile | 6 +
> fs/nfs/blocklayout/block-device-discovery-pipe.c | 66 ++
> fs/nfs/blocklayout/blocklayout.c | 1103 ++++++++++++++++++++++
> fs/nfs/blocklayout/blocklayout.h | 297 ++++++
> fs/nfs/blocklayout/blocklayoutdev.c | 346 +++++++
> fs/nfs/blocklayout/blocklayoutdm.c | 120 +++
> fs/nfs/blocklayout/extents.c | 940 ++++++++++++++++++
> fs/nfs/client.c | 8 +-
> fs/nfs/file.c | 26 +-
> fs/nfs/inode.c | 3 +-
> fs/nfs/nfs4_fs.h | 2 +-
> fs/nfs/nfs4proc.c | 6 +-
> fs/nfs/nfs4xdr.c | 104 ++-
> fs/nfs/pnfs.c | 96 ++-
> fs/nfs/pnfs.h | 126 +++-
> fs/nfs/sysctl.c | 10 +
> fs/nfs/write.c | 12 +-
> include/linux/nfs_fs.h | 3 +-
> include/linux/nfs_fs_sb.h | 4 +-
> include/linux/nfs_xdr.h | 3 +-
> include/linux/sunrpc/rpc_pipe_fs.h | 4 +
> include/linux/sunrpc/simple_rpc_pipefs.h | 105 ++
> net/sunrpc/Makefile | 2 +-
> net/sunrpc/simple_rpc_pipefs.c | 423 +++++++++
> 26 files changed, 3778 insertions(+), 48 deletions(-)
> create mode 100644 fs/nfs/blocklayout/Makefile
> create mode 100644 fs/nfs/blocklayout/block-device-discovery-pipe.c
> create mode 100644 fs/nfs/blocklayout/blocklayout.c
> create mode 100644 fs/nfs/blocklayout/blocklayout.h
> create mode 100644 fs/nfs/blocklayout/blocklayoutdev.c
> create mode 100644 fs/nfs/blocklayout/blocklayoutdm.c
> create mode 100644 fs/nfs/blocklayout/extents.c
> create mode 100644 include/linux/sunrpc/simple_rpc_pipefs.h
> create mode 100644 net/sunrpc/simple_rpc_pipefs.c
>

Who is going to SQUASH all the SQUASHMEs and re think the all patch
separation again. To something that makes a more logical progression
and easier on the review. The way it is now I'm not able to review,
sorry, I got lost trying to understand which is which.

Thanks
Boaz


2011-06-10 14:17:29

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 87/88] Add configurable prefetch size for layoutget


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
Sent: Friday, June 10, 2011 8:36 PM
To: Peng, Tao
Cc: [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-10 01:36, [email protected] wrote:
> Hi, Benny,
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
> Sent: Friday, June 10, 2011 5:23 AM
> To: Jim Rees
> Cc: Peng Tao; [email protected]; peter honeyman
> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>
> On 2011-06-09 06:58, Jim Rees wrote:
>> Benny Halevy wrote:
>>
>> > My understanding is that layoutget specifies a min and max, and the server
>>
>> There's a min. What do you consider the max?
>> Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?
>>
>> The spec doesn't say max, it says "desired." I guess I assumed the server
>> wouldn't normally return more than desired.
>
> No, the server may freely upgrade the returned layout segment by returning
> a layout for a larger byte range or even returning a RW layout where a READ
> layout was asked for.
> [PT] It is true that server can upgrade the layout segment freely. But there is always a price to pay. Server has to be dealing with all kind of clients.
> If server returns more than being asked for, it may hurt other clients.

And if all clients ask for more than they need and the server just
gives it to them, what do you get out of that?
[PT] We cannot avoid this even if client has automatic layout prefetch algorithem implemented... think about all clients are doing sequential IO...


-Tao

2011-06-09 15:07:57

by Peng Tao

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

Hi, Jim and Benny,

On Thu, Jun 9, 2011 at 9:58 PM, Jim Rees <[email protected]> wrote:
> Benny Halevy wrote:
>
>  > My understanding is that layoutget specifies a min and max, and the server
>
>  There's a min.  What do you consider the max?
>  Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?
>
> The spec doesn't say max, it says "desired."  I guess I assumed the server
> wouldn't normally return more than desired.
In fact server is returning "desired" length. The problem is that we
call pnfs_update_layout in nfs_write_begin, and it will end up setting
both minlength and length to page size. There is no space for client
to collapse layoutget range in nfs_write_begin.

>
> 18.43.3.  DESCRIPTION
> ...
>
>   The LAYOUTGET operation returns layout information for the specified
>   byte-range: a layout.  The client actually specifies two ranges, both
>   starting at the offset in the loga_offset field.  The first range is
>   between loga_offset and loga_offset + loga_length - 1 inclusive.
>   This range indicates the desired range the client wants the layout to
>   cover.  The second range is between loga_offset and loga_offset +
>   loga_minlength - 1 inclusive.  This range indicates the required
>   range the client needs the layout to cover.  Thus, loga_minlength
>   MUST be less than or equal to loga_length.
>



--
Thanks,
-Bergwolf

2011-06-11 02:19:58

by Peng Tao

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On Sat, Jun 11, 2011 at 7:20 AM, Boaz Harrosh <[email protected]> wrote:
> On 06/10/2011 12:23 PM, Benny Halevy wrote:
>> On 2011-06-10 10:09, [email protected] wrote:
>>
>> A simple algorithm I can suggest is:
>> - on initialization, calculate and save, per layout driver
>>   - maximum layout size
>>     - take into account csr_fore_chan_attrs.ca_maxresponsesize and possible other parameters
>>   - keep a working copy of the maximum value and the calculated copy.
>>   - alignment value.
>> - on miss, see if there's an adjacent layout segment in cache
>> - if found, ask for twice the found segment size, up to the maximum value,
>>   aligned on the alignment value.
>> - if the server returns less the layoutget range, keep note of the returned length
>>   (but not adjust maximum yet, as the server may return a short segment for various
>>    reasons)
>> - if the server is consistent about returning less than was asked, adjust the
>>   - working copy of the maximum length
>> - if the maximum was adjusted try bumping it up after X (TBD) layoutgets or T seconds
>>   to see if that was just due to high load or conflicts on the server
>> - on any error returned for LAYOUTGET reset the algorithm parameters
>> - on session reestablishment recalculate maximums.
>>
>> Benny
>>
>
> I completely disagree with all this. NACK!
>
> The only proper thing a client can do is ask for what it needs, and only the application
> can do that, because at the VFS level it is only second guessing, and is completely
> pointless.
>
> The only one that can know about structure, alignments, optimal IO sizes and layouts
> is the server. The server even have more information to second guess the application
> from the file size information and it's share and lock disposition. Please see my
> simple Server side algorithm.
Well, IMO, client is closer to applications and should have a better
position at "guessing" application's workload.

A simple example, when a client asks for a layout, server would have
no idea if client is doing layout prefetch, or if it really need that
range to complete its work. But the client knows it for sure.

>
> Because you must understand one most important thing. Any smart decision a client can
> make will be after it received the layout (stripe_unit, number-of-devices etc..) But
> at that time it is too late it already sent the layout_get. Only the server knows
> before hand what is the most optimal size. The client should just be a transparent
> pipe from application to the server. It should never ever set policy. Only a Server
> can/should do that.
>
> Lets put the efforts and algorithms where they belong, please?
>
> Boaz
>



--
Thanks,
-Bergwolf

2011-06-07 17:36:16

by Jim Rees

[permalink] [raw]
Subject: [PATCH 88/88] NFS41: do not update isize if inode needs layoutcommit

From: Peng Tao <[email protected]>

Layout commit is supposed to set server file size similiar to nfs pages.
We should not update client file size for the same reason.
Otherwise we will lose what we have at hand.

Signed-off-by: Peng Tao <[email protected]>
Signed-off-by: Jim Rees <[email protected]>
---
fs/nfs/inode.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 144f2a3..3f1eb81 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1294,7 +1294,8 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
if (new_isize != cur_isize) {
/* Do we perhaps have any outstanding writes, or has
* the file grown beyond our last write? */
- if (nfsi->npages == 0 || new_isize > cur_isize) {
+ if ((nfsi->npages == 0 && !test_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) ||
+ new_isize > cur_isize) {
i_size_write(inode, new_isize);
invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA;
}
--
1.7.4.1


2011-06-07 17:26:35

by Jim Rees

[permalink] [raw]
Subject: [PATCH 06/88] pnfs: HACK: ask for layout_blksize on mount

From: Fred Isaman <[email protected]>

This needs to be changed, as it assumes layout_blksize is a
filesystem attribute, when in fact it is a per-file attribute.
Including it here so I can get some current code into the tree.

Note this requires fattr bitmap to be length 3, instead of 2.

Signed-off-by: Fred Isaman <[email protected]>
[pnfs: replace lease_bitmap to length 3, instead of 2.]
Signed-off-by: Tao Guo <[email protected]>
[pnfsblock: clean up xdr]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/client.c | 1 +
fs/nfs/nfs4_fs.h | 2 +-
fs/nfs/nfs4proc.c | 5 +++--
fs/nfs/nfs4xdr.c | 42 +++++++++++++++++++++++++++++++++++++-----
include/linux/nfs_fs_sb.h | 1 +
include/linux/nfs_xdr.h | 1 +
6 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 6bdb7da0..6c6236b 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -938,6 +938,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fh *mntf
server->wsize = NFS_MAX_FILE_IO_SIZE;
server->wpages = (server->wsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
set_pnfs_layoutdriver(server, mntfh, fsinfo->layouttype);
+ server->pnfs_blksize = fsinfo->blksize;

server->wtmult = nfs_block_bits(fsinfo->wtmult, NULL);

diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index d69fe11..d9c82fa 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -315,7 +315,7 @@ extern const struct nfs4_minor_version_ops *nfs_v4_minor_ops[];
extern const u32 nfs4_fattr_bitmap[2];
extern const u32 nfs4_statfs_bitmap[2];
extern const u32 nfs4_pathconf_bitmap[2];
-extern const u32 nfs4_fsinfo_bitmap[2];
+extern const u32 nfs4_fsinfo_bitmap[3];
extern const u32 nfs4_fs_locations_bitmap[2];

/* nfs4renewd.c */
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 9f09aa2..a693283 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -137,12 +137,13 @@ const u32 nfs4_pathconf_bitmap[2] = {
0
};

-const u32 nfs4_fsinfo_bitmap[2] = { FATTR4_WORD0_MAXFILESIZE
+const u32 nfs4_fsinfo_bitmap[3] = { FATTR4_WORD0_MAXFILESIZE
| FATTR4_WORD0_MAXREAD
| FATTR4_WORD0_MAXWRITE
| FATTR4_WORD0_LEASE_TIME,
FATTR4_WORD1_TIME_DELTA
- | FATTR4_WORD1_FS_LAYOUT_TYPES
+ | FATTR4_WORD1_FS_LAYOUT_TYPES,
+ FATTR4_WORD2_LAYOUT_BLKSIZE
};

const u32 nfs4_fs_locations_bitmap[2] = {
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 5e4447c..e059dc8 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -113,7 +113,11 @@ static int nfs4_stat_to_errno(int);
#define encode_restorefh_maxsz (op_encode_hdr_maxsz)
#define decode_restorefh_maxsz (op_decode_hdr_maxsz)
#define encode_fsinfo_maxsz (encode_getattr_maxsz)
-#define decode_fsinfo_maxsz (op_decode_hdr_maxsz + 15)
+/* The 5 accounts for the PNFS attributes, and assumes that at most three
+ * layout types will be returned.
+ */
+#define decode_fsinfo_maxsz (op_decode_hdr_maxsz + \
+ nfs4_fattr_bitmap_maxsz + 4 + 8 + 5)
#define encode_renew_maxsz (op_encode_hdr_maxsz + 3)
#define decode_renew_maxsz (op_decode_hdr_maxsz)
#define encode_setclientid_maxsz \
@@ -1132,8 +1136,11 @@ static void encode_getfattr(struct xdr_stream *xdr, const u32* bitmask, struct c

static void encode_fsinfo(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
{
- encode_getattr_two(xdr, bitmask[0] & nfs4_fsinfo_bitmap[0],
- bitmask[1] & nfs4_fsinfo_bitmap[1], hdr);
+ encode_getattr_three(xdr,
+ bitmask[0] & nfs4_fsinfo_bitmap[0],
+ bitmask[1] & nfs4_fsinfo_bitmap[1],
+ bitmask[2] & nfs4_fsinfo_bitmap[2],
+ hdr);
}

static void encode_fs_locations(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
@@ -2604,7 +2611,7 @@ static void nfs4_xdr_enc_setclientid_confirm(struct rpc_rqst *req,
struct compound_hdr hdr = {
.nops = 0,
};
- const u32 lease_bitmap[2] = { FATTR4_WORD0_LEASE_TIME, 0 };
+ const u32 lease_bitmap[3] = { FATTR4_WORD0_LEASE_TIME, 0, 0 };

encode_compound_hdr(xdr, req, &hdr);
encode_setclientid_confirm(xdr, arg, &hdr);
@@ -2748,7 +2755,7 @@ static void nfs4_xdr_enc_get_lease_time(struct rpc_rqst *req,
struct compound_hdr hdr = {
.minorversion = nfs4_xdr_minorversion(&args->la_seq_args),
};
- const u32 lease_bitmap[2] = { FATTR4_WORD0_LEASE_TIME, 0 };
+ const u32 lease_bitmap[3] = { FATTR4_WORD0_LEASE_TIME, 0, 0 };

encode_compound_hdr(xdr, req, &hdr);
encode_sequence(xdr, &args->la_seq_args, &hdr);
@@ -4358,6 +4365,28 @@ static int decode_attr_pnfstype(struct xdr_stream *xdr, uint32_t *bitmap,
return status;
}

+/*
+ * The prefered block size for layout directed io
+ */
+static int decode_attr_layout_blksize(struct xdr_stream *xdr, uint32_t *bitmap,
+ uint32_t *res)
+{
+ __be32 *p;
+
+ dprintk("%s: bitmap is %x\n", __func__, bitmap[2]);
+ *res = 0;
+ if (bitmap[2] & FATTR4_WORD2_LAYOUT_BLKSIZE) {
+ p = xdr_inline_decode(xdr, 4);
+ if (unlikely(!p)) {
+ print_overflow_msg(__func__, xdr);
+ return -EIO;
+ }
+ *res = be32_to_cpup(p);
+ bitmap[2] &= ~FATTR4_WORD2_LAYOUT_BLKSIZE;
+ }
+ return 0;
+}
+
static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
{
__be32 *savep;
@@ -4389,6 +4418,9 @@ static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
status = decode_attr_pnfstype(xdr, bitmap, &fsinfo->layouttype);
if (status != 0)
goto xdr_error;
+ status = decode_attr_layout_blksize(xdr, bitmap, &fsinfo->blksize);
+ if (status)
+ goto xdr_error;

status = verify_attr_len(xdr, savep, attrlen);
xdr_error:
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 61052fc..3d93ada 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -143,6 +143,7 @@ struct nfs_server {
filesystem */
struct pnfs_layoutdriver_type *pnfs_curr_ld; /* Active layout driver */
struct rpc_wait_queue roc_rpcwaitq;
+ u32 pnfs_blksize; /* layout_blksize attr */

/* the following fields are protected by nfs_client->cl_lock */
struct rb_root state_owners;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 6c32c9d..d1f9c27 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -124,6 +124,7 @@ struct nfs_fsinfo {
struct timespec time_delta; /* server time granularity */
__u32 lease_time; /* in seconds */
__u32 layouttype; /* supported pnfs layout driver */
+ __u32 blksize; /* preferred pnfs io block size */
};

struct nfs_fsstat {
--
1.7.4.1


2011-06-07 17:31:27

by Jim Rees

[permalink] [raw]
Subject: [PATCH 45/88] SQUASHME: pnfsblock: Implement release_inval_marks

From: Zhang Jingwang <[email protected]>

Leaving it unimplemented will cause memory leak.

Signed-off-by: Zhang Jingwang <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 11 ++++++++---
fs/nfs/blocklayout/blocklayout.h | 6 ++++++
fs/nfs/blocklayout/extents.c | 6 ------
3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index db008e6..b0ad836 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -541,10 +541,15 @@ release_extents(struct pnfs_block_layout *bl,
spin_unlock(&bl->bl_ext_lock);
}

-/* STUB */
static void
-release_inval_marks(void)
+release_inval_marks(struct pnfs_inval_markings *marks)
{
+ struct pnfs_inval_tracking *pos, *temp;
+
+ list_for_each_entry_safe(pos, temp, &marks->im_tree.mtt_stub, it_link) {
+ list_del(&pos->it_link);
+ kfree(pos);
+ }
return;
}

@@ -556,7 +561,7 @@ bl_free_layout(void *p)

dprintk("%s enter\n", __func__);
release_extents(bl, NULL);
- release_inval_marks();
+ release_inval_marks(&bl->bl_inval);
kfree(bl);
return;
}
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index ca36e61..45939e1 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -126,6 +126,12 @@ struct pnfs_inval_markings {
sector_t im_block_size; /* Server blocksize in sectors */
};

+struct pnfs_inval_tracking {
+ struct list_head it_link;
+ int it_sector;
+ int it_tags;
+};
+
/* sector_t fields are all in 512-byte sectors */
struct pnfs_block_extent {
struct kref be_refcnt;
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 4722899..cf5b3a3 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -40,12 +40,6 @@
#define INTERNAL_EXISTS MY_MAX_TAGS
#define INTERNAL_MASK ((1 << INTERNAL_EXISTS) - 1)

-struct pnfs_inval_tracking {
- struct list_head it_link;
- int it_sector;
- int it_tags;
-};
-
/* Returns largest t<=s s.t. t%base==0 */
static inline sector_t normalize(sector_t s, int base)
{
--
1.7.4.1


2011-06-07 17:34:06

by Jim Rees

[permalink] [raw]
Subject: [PATCH 68/88] SQUASHME: pnfs-block: Revert "pnfsblock: expose block_class interface"

From: Benny Halevy <[email protected]>

Following 462557a pnfs-block: Remove device creation from kernel
exporting block_class is not needed anymore.

This reverts commit f0004c24a5c923af2eba64506055bb83882bcf0e.

Signed-off-by: Benny Halevy <[email protected]>
---
block/genhd.c | 1 -
drivers/scsi/hosts.c | 1 +
2 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 6f19558..95822ae 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1108,7 +1108,6 @@ static void disk_release(struct device *dev)
struct class block_class = {
.name = "block",
};
-EXPORT_SYMBOL(block_class);

static char *block_devnode(struct device *dev, mode_t *mode)
{
diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index ac24f02..7d91903 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -54,6 +54,7 @@ struct class shost_class = {
.name = "scsi_host",
.dev_release = scsi_host_cls_release,
};
+EXPORT_SYMBOL(shost_class);

/**
* scsi_host_set_state - Take the given host through the host state model.
--
1.7.4.1


2011-06-07 17:31:13

by Jim Rees

[permalink] [raw]
Subject: [PATCH 44/88] SQUASHME: pnfsblock: Wrong extent refcount in block extents list

From: Zhang Jingwang <[email protected]>

_prep_new_extent called kref_init to initialize extent's refcount to 1.
After that, we shouldn't call kref_get to increase its refcount, because
this extent is only referenced by the list.

Signed-off-by: Zhang Jingwang <[email protected]>
Acked-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/extents.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 288a38a..4722899 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -868,7 +868,6 @@ set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)
offset - be->be_f_offset,
PNFS_BLOCK_INVALID_DATA);
children[i++] = e1;
- kref_get(&e1->be_refcnt);
print_bl_extent(e1);
} else
merge1 = e1;
@@ -876,7 +875,6 @@ set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)
min(length, be->be_f_offset + be->be_length - offset),
PNFS_BLOCK_READWRITE_DATA);
children[i++] = e2;
- kref_get(&e2->be_refcnt);
print_bl_extent(e2);
if (offset + length < be->be_f_offset + be->be_length) {
_prep_new_extent(e3, be, e2->be_f_offset + e2->be_length,
@@ -884,7 +882,6 @@ set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)
offset - length,
PNFS_BLOCK_INVALID_DATA);
children[i++] = e3;
- kref_get(&e3->be_refcnt);
print_bl_extent(e3);
} else
merge2 = e3;
--
1.7.4.1


2011-06-07 17:29:01

by Jim Rees

[permalink] [raw]
Subject: [PATCH 26/88] pnfsblock: allow use of PG_owner_priv_1 flag

From: Fred Isaman <[email protected]>

There is currently no good way for pnfs to communicate problems. For
example - the linux read code first tries to do readahead through
nfs_readpages. Failure there is ignored, and it will later call
nfs_readpage. Failure there is also ignored, except that the lack of
PG_uptodate is communicated back via -EIO.

With pnfs, it would be useful to be able to communicate to
nfs_readpage that direct disk IO failed on readahead, and that it
should failover to using the MDS.

Making the page flag PG_owner_priv_1 available as PG_pnfserr is one
way to do so. (An alternative would be to embed this in the layout,
but then pg_test can't easily access the info.)

This may be better as generic pnfs code, in which case it should be
put in pnfs.h, or even page-flags.h

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 8b06c93..493d4d3 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -37,6 +37,11 @@
#include <linux/nfs4_pnfs.h>
#include <linux/dm-ioctl.h> /* Needed for struct dm_ioctl*/

+#define PG_pnfserr PG_owner_priv_1
+#define PagePnfsErr(page) test_bit(PG_pnfserr, &(page)->flags)
+#define SetPagePnfsErr(page) set_bit(PG_pnfserr, &(page)->flags)
+#define ClearPagePnfsErr(page) clear_bit(PG_pnfserr, &(page)->flags)
+
extern struct class shost_class; /* exported from drivers/scsi/hosts.c */
extern int dm_dev_create(struct dm_ioctl *param); /* from dm-ioctl.c */
extern int dm_dev_remove(struct dm_ioctl *param); /* from dm-ioctl.c */
--
1.7.4.1


2011-06-07 17:31:07

by Jim Rees

[permalink] [raw]
Subject: [PATCH 43/88] SQUASHME: pnfsblock: fix bug when decoding block device info.

From: Tao Guo <[email protected]>

Skip local block device if its gendisk structure is NULL.

Signed-off-by: Tao Guo <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayoutdev.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index ac5c117..9fc3d46 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -375,7 +375,7 @@ static int map_sig_to_device(struct pnfs_blk_sig *sig,
struct visible_block_device *vis_dev;

list_for_each_entry(vis_dev, sdlist, vi_node) {
- if (vis_dev->vi_mapped)
+ if (vis_dev->vi_mapped || !vis_dev->vi_bdev->bd_disk)
continue;
mapped = verify_sig(vis_dev->vi_bdev, sig);
if (mapped) {
--
1.7.4.1


2011-06-10 03:07:32

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-09 07:54, Peng Tao wrote:
> On Thu, Jun 9, 2011 at 2:06 PM, Benny Halevy <[email protected]> wrote:
>> On 2011-06-08 03:15, Peng Tao wrote:
>>> On 6/8/11, Jim Rees <[email protected]> wrote:
>>>> Benny Halevy wrote:
>>>>
>>>> NAK.
>>>> This affects all layout types. In particular it is undesired
>>>> for write layouts that extend the file with the objects layout.
>>>> The server can extend the layout segments range
>>>> over what the client requested so why would the client
>>>> ask for artificially large layouts?
>>>>
>>>> This has actually been the subject of some debate over Thursday night
>>>> beers. The problem we're trying to solve is that the client is spending 98%
>>>> of its time in layoutget. This patch gives us something like a 10x
>>>> speedup. But many of us think it's not the right fix. I suggest we discuss
>>>> next week.
>>>>
>>
>> Sure.
>>
>>>> But note that this patch doesn't change anything unless you set the sysctl.
>>> there is a default value of 2M. maybe we can set it to page size by
>>> default so other layout are not affected and block layout can let
>>> users set it by hand if they care about performance. does this make
>>> sense?
>>
>> If doing it at all why use a sysctl rather than a mount option?
> The purpose of using a sysctl is to give client the ability to change
> it on the fly. In theory, layout prefetching can benefit all layout
> types. So the patch tries to solve it in the pnfs generic layer.
>

But the need for this varies per-server and many times per application.
Think sequential vs. random I/O. Therefore a mount option would help
tuning the behavior on a per-use basis. Global behavior must be implemented
using a dynamic algorithm that would take both the workload and the server
observed behavior into account.

Benny

>> Or maybe coding the logic for prefetching the layout iff sequential
>> access is detected is the right thing to do.
> Yeah, automatic decision should be a better way.
>
>>
>> Benny
>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>>
>>
>
>
>

2011-06-10 14:11:26

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 87/88] Add configurable prefetch size for layoutget

Hi, Benny,

-----Original Message-----
From: Benny Halevy [mailto:[email protected]]
Sent: Friday, June 10, 2011 8:33 PM
To: Peng, Tao
Cc: [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-10 02:00, [email protected] wrote:
> Hi, Benny,
>
> Cheers,
> -Bergwolf
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
> Sent: Friday, June 10, 2011 5:23 AM
> To: Peng Tao
> Cc: Jim Rees; [email protected]; peter honeyman
> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>
> On 2011-06-09 08:07, Peng Tao wrote:
>> Hi, Jim and Benny,
>>
>> On Thu, Jun 9, 2011 at 9:58 PM, Jim Rees <[email protected]> wrote:
>>> Benny Halevy wrote:
>>>
>>> > My understanding is that layoutget specifies a min and max, and the server
>>>
>>> There's a min. What do you consider the max?
>>> Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?
>>>
>>> The spec doesn't say max, it says "desired." I guess I assumed the server
>>> wouldn't normally return more than desired.
>> In fact server is returning "desired" length. The problem is that we
>> call pnfs_update_layout in nfs_write_begin, and it will end up setting
>> both minlength and length to page size. There is no space for client
>> to collapse layoutget range in nfs_write_begin.
>>
>
> That's a different issue. Waiting with pnfs_update_layout to flush
> time rather than write_begin if the whole page is written would help
> sending a more meaningful desired range as well as avoiding needless
> read-modify-writes in case the application also wrote the whole
> preallocated block.
> [PT] It is also the reason why we want to introduce layout prefetching, to get more segment than the page passed in nfs_write_begin.
>

Peng, I understand what you want to achieve but the proposed way
just doesn't fly. The server knows better than the client its allocation policies
and it knows better the combined workload of different client and possible
conflicts between them therefore it should be making the ultimate decision
about the actual segment sizes.
[PT] Yes, you are right. Server should know combined workload of all clients and make its decision based on that.
And it always has the right to return more than (or less than) specified in loga_length.

That said, the client should indeed do its best to ask for the most appropriate
segments size for its use and we should be making a better job at that.
It's just that blindly asking for more is not a good strategy and requiring
manual admin help to tune the clients is not acceptable.
[PT] yeah, determing the most appropriate is always the hart part. Do you have any suggestions to that?

Thanks,
Tao


2011-06-07 17:27:51

by Jim Rees

[permalink] [raw]
Subject: [PATCH 17/88] pnfsblock: create and destroy dm metadevice

From: Fred Isaman <[email protected]>

Adds functions to create a (tableless) dm device and clean it up.

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>

pnfsblock: fix pnfs_deviceid references

the pnfs_deviceid typedef was changed to struct pnfs_deviceid

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 2 +
fs/nfs/blocklayout/blocklayoutdev.c | 4 +-
fs/nfs/blocklayout/blocklayoutdm.c | 84 +++++++++++++++++++++++++++++++++-
3 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 2c6e1fe..b705906 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -132,6 +132,8 @@ uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes);
} while (0)

/* blocklayoutdev.c */
+struct block_device *nfs4_blkdev_get(dev_t dev);
+int nfs4_blkdev_put(struct block_device *bdev);
struct pnfs_block_dev *nfs4_blk_decode_device(struct super_block *sb,
struct pnfs_device *dev,
struct list_head *sdlist);
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index f1689b9..0ea44aa 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -52,7 +52,7 @@ uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes)
EXPORT_SYMBOL(blk_overflow);

/* Open a block_device by device number. */
-static struct block_device *nfs4_blkdev_get(dev_t dev)
+struct block_device *nfs4_blkdev_get(dev_t dev)
{
struct block_device *bd;

@@ -70,7 +70,7 @@ fail:
/*
* Release the block device
*/
-static int nfs4_blkdev_put(struct block_device *bdev)
+int nfs4_blkdev_put(struct block_device *bdev)
{
dprintk("%s for device %d:%d\n", __func__, MAJOR(bdev->bd_dev),
MINOR(bdev->bd_dev));
diff --git a/fs/nfs/blocklayout/blocklayoutdm.c b/fs/nfs/blocklayout/blocklayoutdm.c
index 15eaed2..0e04494 100644
--- a/fs/nfs/blocklayout/blocklayoutdm.c
+++ b/fs/nfs/blocklayout/blocklayoutdm.c
@@ -30,14 +30,51 @@
* possibility of such damages.
*/

+#include <linux/genhd.h> /* gendisk - used in a dprintk*/
+
#include "blocklayout.h"

#define NFSDBG_FACILITY NFSDBG_PNFS_LD

-/* Stub */
+static int dev_create(const char *name, dev_t *dev)
+{
+ struct dm_ioctl ctrl;
+ int rv;
+
+ memset(&ctrl, 0, sizeof(ctrl));
+ strncpy(ctrl.name, name, DM_NAME_LEN-1);
+ rv = dm_dev_create(&ctrl); /* XXX - need to pull data out of ctrl */
+ dprintk("Tried to create %s, got %i\n", name, rv);
+ if (!rv) {
+ *dev = huge_decode_dev(ctrl.dev);
+ dprintk("dev = (%i, %i)\n", MAJOR(*dev), MINOR(*dev));
+ }
+ return rv;
+}
+
+static int dev_remove(const char *name)
+{
+ struct dm_ioctl ctrl;
+ memset(&ctrl, 0, sizeof(ctrl));
+ strncpy(ctrl.name, name, DM_NAME_LEN-1);
+ return dm_dev_remove(&ctrl);
+}
+
+/*
+ * Release meta device
+ */
static int nfs4_blk_metadev_release(struct pnfs_block_dev *bdev)
{
- return 0;
+ int rv;
+
+ dprintk("%s Releasing %s\n", __func__, bdev->bm_mdevname);
+ /* XXX Check return? */
+ rv = nfs4_blkdev_put(bdev->bm_mdev);
+ dprintk("%s nfs4_blkdev_put returns %d\n", __func__, rv);
+
+ rv = dev_remove(bdev->bm_mdevname);
+ dprintk("%s Returns %d\n", __func__, rv);
+ return rv;
}

void free_block_dev(struct pnfs_block_dev *bdev)
@@ -56,10 +93,51 @@ void free_block_dev(struct pnfs_block_dev *bdev)
}
}

-/* Stub */
+/*
+ * Create meta device. Keep it open to use for I/O.
+ */
struct pnfs_block_dev *nfs4_blk_init_metadev(struct super_block *sb,
struct pnfs_device *dev)
{
+ static uint64_t dev_count; /* STUB used for device names */
+ struct block_device *bd;
+ dev_t meta_dev;
+ struct pnfs_block_dev *rv;
+ int status;
+
+ dprintk("%s enter\n", __func__);
+
+ rv = kmalloc(sizeof(*rv) + 32, GFP_KERNEL);
+ if (!rv)
+ return NULL;
+ rv->bm_mdevname = (char *)rv + sizeof(*rv);
+ sprintf(rv->bm_mdevname, "FRED_%llu", dev_count++);
+ status = dev_create(rv->bm_mdevname, &meta_dev);
+ if (status)
+ goto out_err;
+ bd = nfs4_blkdev_get(meta_dev);
+ if (!bd)
+ goto out_err;
+ if (bd_claim(bd, sb)) {
+ dprintk("%s: failed to claim device %d:%d\n",
+ __func__,
+ MAJOR(meta_dev),
+ MINOR(meta_dev));
+ blkdev_put(bd, FMODE_READ);
+ goto out_err;
+ }
+
+ rv->bm_mdev = bd;
+ memcpy(&rv->bm_mdevid, &dev->dev_id, sizeof(struct pnfs_deviceid));
+ dprintk("%s Created device %s named %s with bd_block_size %u\n",
+ __func__,
+ bd->bd_disk->disk_name,
+ rv->bm_mdevname,
+ bd->bd_block_size);
+ return rv;
+
+ out_err:
+ kfree(rv);
return NULL;
}

--
1.7.4.1


2011-06-07 17:35:10

by Jim Rees

[permalink] [raw]
Subject: [PATCH 78/88] SQUASHME: pnfs-block: use {set,clear}_layoutdriver

From: Benny Halevy <[email protected]>

Methods renamed by Trond as of 2.6.37

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 57a7f04..5be912e 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -752,7 +752,7 @@ nfs4_blk_get_deviceinfo(struct nfs_server *server, const struct nfs_fh *fh,
* Retrieve the list of available devices for the mountpoint.
*/
static int
-bl_initialize_mountpoint(struct nfs_server *server, const struct nfs_fh *fh)
+bl_set_layoutdriver(struct nfs_server *server, const struct nfs_fh *fh)
{
struct block_mount_id *b_mt_id = NULL;
struct pnfs_mount_type *mtype = NULL;
@@ -817,7 +817,7 @@ bl_initialize_mountpoint(struct nfs_server *server, const struct nfs_fh *fh)
}

static int
-bl_uninitialize_mountpoint(struct nfs_server *server)
+bl_clear_layoutdriver(struct nfs_server *server)
{
struct block_mount_id *b_mt_id = server->pnfs_ld_data;

@@ -1114,8 +1114,8 @@ static struct pnfs_layoutdriver_type blocklayout_type = {
.setup_layoutcommit = bl_setup_layoutcommit,
.encode_layoutcommit = bl_encode_layoutcommit,
.cleanup_layoutcommit = bl_cleanup_layoutcommit,
- .initialize_mountpoint = bl_initialize_mountpoint,
- .uninitialize_mountpoint = bl_uninitialize_mountpoint,
+ .set_layoutdriver = bl_set_layoutdriver,
+ .clear_layoutdriver = bl_clear_layoutdriver,
.pg_test = bl_pg_test,
};

--
1.7.4.1


2011-06-11 01:36:16

by Peng Tao

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On Sat, Jun 11, 2011 at 3:23 AM, Benny Halevy <[email protected]> wrote:
> On 2011-06-10 10:09, [email protected] wrote:
>> Hi, Benny,
>>
>> -----Original Message-----
>> From: Benny Halevy [mailto:[email protected]]
>> Sent: Friday, June 10, 2011 8:33 PM
>> To: Peng, Tao
>> Cc: [email protected]; [email protected]; [email protected]; [email protected]
>> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>>
>> On 2011-06-10 02:00, [email protected] wrote:
>>> Hi, Benny,
>>>
>>> Cheers,
>>> -Bergwolf
>>>
>>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
>>> Sent: Friday, June 10, 2011 5:23 AM
>>> To: Peng Tao
>>> Cc: Jim Rees; [email protected]; peter honeyman
>>> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>>>
>>> On 2011-06-09 08:07, Peng Tao wrote:
>>>> Hi, Jim and Benny,
>>>>
>>>> On Thu, Jun 9, 2011 at 9:58 PM, Jim Rees <[email protected]> wrote:
>>>>> Benny Halevy wrote:
>>>>>
>>>>>  > My understanding is that layoutget specifies a min and max, and the server
>>>>>
>>>>>  There's a min.  What do you consider the max?
>>>>>  Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?
>>>>>
>>>>> The spec doesn't say max, it says "desired."  I guess I assumed the server
>>>>> wouldn't normally return more than desired.
>>>> In fact server is returning "desired" length. The problem is that we
>>>> call pnfs_update_layout in nfs_write_begin, and it will end up setting
>>>> both minlength and length to page size. There is no space for client
>>>> to collapse layoutget range in nfs_write_begin.
>>>>
>>>
>>> That's a different issue.  Waiting with pnfs_update_layout to flush
>>> time rather than write_begin if the whole page is written would help
>>> sending a more meaningful desired range as well as avoiding needless
>>> read-modify-writes in case the application also wrote the whole
>>> preallocated block.
>>> [PT] It is also the reason why we want to introduce layout prefetching, to get more segment than the page passed in nfs_write_begin.
>>>
>>
>> Peng, I understand what you want to achieve but the proposed way
>> just doesn't fly. The server knows better than the client its allocation policies
>> and it knows better the combined workload of different client and possible
>> conflicts between them therefore it should be making the ultimate decision
>> about the actual segment sizes.
>> [PT] Yes, you are right. Server should know combined workload of all clients and make its decision based on that.
>> And it always has the right to return more than (or less than) specified in loga_length.
>>
>> That said, the client should indeed do its best to ask for the most appropriate
>> segments size for its use and we should be making a better job at that.
>> It's just that blindly asking for more is not a good strategy and requiring
>> manual admin help to tune the clients is not acceptable.
>> [PT] yeah, determing the most appropriate is always the hart part. Do you have any suggestions to that?
>
> A simple algorithm I can suggest is:
> - on initialization, calculate and save, per layout driver
>  - maximum layout size
>    - take into account csr_fore_chan_attrs.ca_maxresponsesize and possible other parameters
>  - keep a working copy of the maximum value and the calculated copy.
>  - alignment value.
> - on miss, see if there's an adjacent layout segment in cache
Err, that's another issue. Generic layer should really merge adjacent
layout segments when necessary, instead of letting lookup code find
out what are adjacent...

> - if found, ask for twice the found segment size, up to the maximum value,
>  aligned on the alignment value.
> - if the server returns less the layoutget range, keep note of the returned length
>  (but not adjust maximum yet, as the server may return a short segment for various
>   reasons)
> - if the server is consistent about returning less than was asked, adjust the
>  - working copy of the maximum length
If server is consistent about returning more/less than asked, it is an
indicator that server is adjust the range automatically. Then client
should stop using this algorithm and trust server behavior...

> - if the maximum was adjusted try bumping it up after X (TBD) layoutgets or T seconds
>  to see if that was just due to high load or conflicts on the server
> - on any error returned for LAYOUTGET reset the algorithm parameters
> - on session reestablishment recalculate maximums.
>
> Benny
>
>>
>> Thanks,
>> Tao
>



--
Thanks,
-Bergwolf

2011-06-10 20:03:07

by Fred Isaman

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On Fri, Jun 10, 2011 at 3:23 PM, Benny Halevy <[email protected]> wrote:
>
> A simple algorithm I can suggest is:
> - on initialization, calculate and save, per layout driver
> ?- maximum layout size

I must be misunderstanding something. Layout size has nothing to do
with io size (other than the obvious fact that you want the layout >
io).

I don't know about the object driver, but for both the file and block
drivers the client wants as much as the server will give it.

Fred

> ? ?- take into account csr_fore_chan_attrs.ca_maxresponsesize and possible other parameters
> ?- keep a working copy of the maximum value and the calculated copy.
> ?- alignment value.
> - on miss, see if there's an adjacent layout segment in cache
> - if found, ask for twice the found segment size, up to the maximum value,
> ?aligned on the alignment value.
> - if the server returns less the layoutget range, keep note of the returned length
> ?(but not adjust maximum yet, as the server may return a short segment for various
> ? reasons)
> - if the server is consistent about returning less than was asked, adjust the
> ?- working copy of the maximum length
> - if the maximum was adjusted try bumping it up after X (TBD) layoutgets or T seconds
> ?to see if that was just due to high load or conflicts on the server
> - on any error returned for LAYOUTGET reset the algorithm parameters
> - on session reestablishment recalculate maximums.
>
> Benny
>

2011-06-07 17:32:08

by Jim Rees

[permalink] [raw]
Subject: [PATCH 50/88] pnfsblock: Lookup list entry of layouts and tags in reverse order

From: Zhang Jingwang <[email protected]>

Optimize for sequencial write. Layout infos and tags are organized by
file offset. When appending data to a file whole list will be examined,
which introduce notable performance decrease.

Signed-off-by: Zhang Jingwang <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/extents.c | 126 +++++++++++++++++++++---------------------
1 files changed, 64 insertions(+), 62 deletions(-)

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index cf5b3a3..6c26cd4 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -60,8 +60,8 @@ static int32_t _find_entry(struct my_tree_t *tree, u64 s)
struct pnfs_inval_tracking *pos;

dprintk("%s(%llu) enter\n", __func__, s);
- list_for_each_entry(pos, &tree->mtt_stub, it_link) {
- if (pos->it_sector < s)
+ list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
+ if (pos->it_sector > s)
continue;
else if (pos->it_sector == s)
return pos->it_tags & INTERNAL_MASK;
@@ -96,8 +96,8 @@ static int _add_entry(struct my_tree_t *tree, u64 s, int32_t tag,
struct pnfs_inval_tracking *pos;

dprintk("%s(%llu, %i, %p) enter\n", __func__, s, tag, storage);
- list_for_each_entry(pos, &tree->mtt_stub, it_link) {
- if (pos->it_sector < s)
+ list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
+ if (pos->it_sector > s)
continue;
else if (pos->it_sector == s) {
found = 1;
@@ -119,7 +119,7 @@ static int _add_entry(struct my_tree_t *tree, u64 s, int32_t tag,
}
new->it_sector = s;
new->it_tags = (1 << tag);
- list_add_tail(&new->it_link, &pos->it_link);
+ list_add(&new->it_link, &pos->it_link);
return 1;
}
}
@@ -225,14 +225,14 @@ _range_has_tag(struct my_tree_t *tree, u64 start, u64 end, int32_t tag)
u64 expect = 0;

dprintk("%s(%llu, %llu, %i) enter\n", __func__, start, end, tag);
- list_for_each_entry(pos, &tree->mtt_stub, it_link) {
- if (pos->it_sector < start)
+ list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
+ if (pos->it_sector >= end)
continue;
if (!expect) {
- if ((pos->it_sector == start) &&
+ if ((pos->it_sector == end - tree->mtt_step_size) &&
(pos->it_tags & (1 << tag))) {
- expect = start + tree->mtt_step_size;
- if (expect == end)
+ expect = pos->it_sector - tree->mtt_step_size;
+ if (expect < start)
return 1;
continue;
} else {
@@ -241,8 +241,8 @@ _range_has_tag(struct my_tree_t *tree, u64 start, u64 end, int32_t tag)
}
if (pos->it_sector != expect || !(pos->it_tags & (1 << tag)))
return 0;
- expect += tree->mtt_step_size;
- if (expect == end)
+ expect -= tree->mtt_step_size;
+ if (expect < start)
return 1;
}
return 0;
@@ -589,65 +589,67 @@ add_and_merge_extent(struct pnfs_block_layout *bl,
/* Scan for proper place to insert, extending new to the left
* as much as possible.
*/
- list_for_each_entry_safe(be, tmp, list, be_node) {
- if (new->be_f_offset < be->be_f_offset)
+ list_for_each_entry_safe_reverse(be, tmp, list, be_node) {
+ if (new->be_f_offset >= be->be_f_offset + be->be_length)
break;
- if (end <= be->be_f_offset + be->be_length) {
- /* new is a subset of existing be*/
+ if (new->be_f_offset >= be->be_f_offset) {
+ if (end <= be->be_f_offset + be->be_length) {
+ /* new is a subset of existing be*/
+ if (extents_consistent(be, new)) {
+ dprintk("%s: new is subset, ignoring\n",
+ __func__);
+ put_extent(new);
+ return 0;
+ } else {
+ goto out_err;
+ }
+ } else {
+ /* |<-- be -->|
+ * |<-- new -->| */
+ if (extents_consistent(be, new)) {
+ /* extend new to fully replace be */
+ new->be_length += new->be_f_offset -
+ be->be_f_offset;
+ new->be_f_offset = be->be_f_offset;
+ new->be_v_offset = be->be_v_offset;
+ dprintk("%s: removing %p\n", __func__, be);
+ list_del(&be->be_node);
+ put_extent(be);
+ } else {
+ goto out_err;
+ }
+ }
+ } else if (end >= be->be_f_offset + be->be_length) {
+ /* new extent overlap existing be */
if (extents_consistent(be, new)) {
- dprintk("%s: new is subset, ignoring\n",
- __func__);
- put_extent(new);
- return 0;
- } else
+ /* extend new to fully replace be */
+ dprintk("%s: removing %p\n", __func__, be);
+ list_del(&be->be_node);
+ put_extent(be);
+ } else {
goto out_err;
- } else if (new->be_f_offset <=
- be->be_f_offset + be->be_length) {
- /* new overlaps or abuts existing be */
- if (extents_consistent(be, new)) {
+ }
+ } else if (end > be->be_f_offset) {
+ /* |<-- be -->|
+ *|<-- new -->| */
+ if (extents_consistent(new, be)) {
/* extend new to fully replace be */
- new->be_length += new->be_f_offset -
- be->be_f_offset;
- new->be_f_offset = be->be_f_offset;
- new->be_v_offset = be->be_v_offset;
+ new->be_length += be->be_f_offset + be->be_length -
+ new->be_f_offset - new->be_length;
dprintk("%s: removing %p\n", __func__, be);
list_del(&be->be_node);
put_extent(be);
- } else if (new->be_f_offset !=
- be->be_f_offset + be->be_length)
+ } else {
goto out_err;
+ }
}
}
/* Note that if we never hit the above break, be will not point to a
* valid extent. However, in that case &be->be_node==list.
*/
- list_add_tail(&new->be_node, &be->be_node);
+ list_add(&new->be_node, &be->be_node);
dprintk("%s: inserting new\n", __func__);
print_elist(list);
- /* Scan forward for overlaps. If we find any, extend new and
- * remove the overlapped extent.
- */
- be = list_prepare_entry(new, list, be_node);
- list_for_each_entry_safe_continue(be, tmp, list, be_node) {
- if (end < be->be_f_offset)
- break;
- /* new overlaps or abuts existing be */
- if (extents_consistent(be, new)) {
- if (end < be->be_f_offset + be->be_length) {
- /* extend new to fully cover be */
- end = be->be_f_offset + be->be_length;
- new->be_length = end - new->be_f_offset;
- }
- dprintk("%s: removing %p\n", __func__, be);
- list_del(&be->be_node);
- put_extent(be);
- } else if (end != be->be_f_offset) {
- list_del(&new->be_node);
- goto out_err;
- }
- }
- dprintk("%s: after merging\n", __func__);
- print_elist(list);
/* STUB - The per-list consistency checks have all been done,
* should now check cross-list consistency.
*/
@@ -680,10 +682,10 @@ find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
if (ret &&
(!cow_read || ret->be_state != PNFS_BLOCK_INVALID_DATA))
break;
- list_for_each_entry(be, &bl->bl_extents[i], be_node) {
- if (isect < be->be_f_offset)
+ list_for_each_entry_reverse(be, &bl->bl_extents[i], be_node) {
+ if (isect >= be->be_f_offset + be->be_length)
break;
- if (isect < be->be_f_offset + be->be_length) {
+ if (isect >= be->be_f_offset) {
/* We have found an extent */
dprintk("%s Get %p (%i)\n", __func__, be,
atomic_read(&be->be_refcnt.refcount));
@@ -716,10 +718,10 @@ find_get_extent_locked(struct pnfs_block_layout *bl, sector_t isect)
for (i = 0; i < EXTENT_LISTS; i++) {
if (ret)
break;
- list_for_each_entry(be, &bl->bl_extents[i], be_node) {
- if (isect < be->be_f_offset)
+ list_for_each_entry_reverse(be, &bl->bl_extents[i], be_node) {
+ if (isect >= be->be_f_offset + be->be_length)
break;
- if (isect < be->be_f_offset + be->be_length) {
+ if (isect >= be->be_f_offset) {
/* We have found an extent */
dprintk("%s Get %p (%i)\n", __func__, be,
atomic_read(&be->be_refcnt.refcount));
--
1.7.4.1


2011-06-07 17:36:10

by Jim Rees

[permalink] [raw]
Subject: [PATCH 87/88] Add configurable prefetch size for layoutget

From: Peng Tao <[email protected]>

pnfs_layout_prefetch_kb can be modified via sysctl.

Signed-off-by: Peng Tao <[email protected]>
Signed-off-by: Jim Rees <[email protected]>
---
fs/nfs/pnfs.c | 17 +++++++++++++++++
fs/nfs/pnfs.h | 1 +
fs/nfs/sysctl.c | 10 ++++++++++
3 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 9920bff..9c2b569 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -46,6 +46,11 @@ static DEFINE_SPINLOCK(pnfs_spinlock);
*/
static LIST_HEAD(pnfs_modules_tbl);

+/*
+ * layoutget prefetch size
+ */
+unsigned int pnfs_layout_prefetch_kb = 2 << 10;
+
/* Return the registered pnfs layout driver module matching given id */
static struct pnfs_layoutdriver_type *
find_pnfs_driver_locked(u32 id)
@@ -906,6 +911,16 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
}

/*
+ * Set layout prefetch length.
+ */
+static void
+pnfs_set_layout_prefetch(struct pnfs_layout_range *range)
+{
+ if (range->length < (pnfs_layout_prefetch_kb << 10))
+ range->length = pnfs_layout_prefetch_kb << 10;
+}
+
+/*
* Layout segment is retreived from the server if not cached.
* The appropriate layout segment is referenced and returned to the caller.
*/
@@ -956,6 +971,8 @@ pnfs_update_layout(struct inode *ino,

if (pnfs_layoutgets_blocked(lo, NULL, 0))
goto out_unlock;
+
+ pnfs_set_layout_prefetch(&arg);
atomic_inc(&lo->plh_outstanding);

get_layout_hdr(lo);
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 28d57c9..563c67b 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -182,6 +182,7 @@ extern int nfs4_proc_layoutget(struct nfs4_layoutget *lgp);
extern int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp);

/* pnfs.c */
+extern unsigned int pnfs_layout_prefetch_kb;
void get_layout_hdr(struct pnfs_layout_hdr *lo);
void put_lseg(struct pnfs_layout_segment *lseg);
struct pnfs_layout_segment *
diff --git a/fs/nfs/sysctl.c b/fs/nfs/sysctl.c
index 978aaeb..79a5134 100644
--- a/fs/nfs/sysctl.c
+++ b/fs/nfs/sysctl.c
@@ -14,6 +14,7 @@
#include <linux/nfs_fs.h>

#include "callback.h"
+#include "pnfs.h"

#ifdef CONFIG_NFS_V4
static const int nfs_set_port_min = 0;
@@ -42,6 +43,15 @@ static ctl_table nfs_cb_sysctls[] = {
},
#endif /* CONFIG_NFS_USE_NEW_IDMAPPER */
#endif
+#ifdef CONFIG_NFS_V4_1
+ {
+ .procname = "pnfs_layout_prefetch_kb",
+ .data = &pnfs_layout_prefetch_kb,
+ .maxlen = sizeof(pnfs_layout_prefetch_kb),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif
{
.procname = "nfs_mountpoint_timeout",
.data = &nfs_mountpoint_expiry_timeout,
--
1.7.4.1


2011-06-08 02:05:45

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 88/88] NFS41: do not update isize if inode needs layoutcommit

Better send generic patches separately.
This patch needs to go upstream and to stable 2.6.39

Benny

On 2011-06-07 13:36, Jim Rees wrote:
> From: Peng Tao <[email protected]>
>
> Layout commit is supposed to set server file size similiar to nfs pages.
> We should not update client file size for the same reason.
> Otherwise we will lose what we have at hand.
>
> Signed-off-by: Peng Tao <[email protected]>
> Signed-off-by: Jim Rees <[email protected]>
> ---
> fs/nfs/inode.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 144f2a3..3f1eb81 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -1294,7 +1294,8 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
> if (new_isize != cur_isize) {
> /* Do we perhaps have any outstanding writes, or has
> * the file grown beyond our last write? */
> - if (nfsi->npages == 0 || new_isize > cur_isize) {
> + if ((nfsi->npages == 0 && !test_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) ||
> + new_isize > cur_isize) {
> i_size_write(inode, new_isize);
> invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA;
> }

2011-06-07 17:26:09

by Jim Rees

[permalink] [raw]
Subject: [PATCH 02/88] pnfs: let layoutcommit code handle multiple segments

From: Peng Tao <[email protected]>

Some layout driver like block will have multiple segments.
Generic code should be able to handle it.

Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/pnfs.c | 15 ++++++++++++---
fs/nfs/pnfs.h | 1 +
2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 3b182cc..12b228a 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1206,10 +1206,18 @@ void pnfs_cleanup_layoutcommit(struct inode *inode,
static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
{
struct pnfs_layout_segment *lseg, *rv = NULL;
+ loff_t max_pos = 0;
+
+ list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list) {
+ if (lseg->pls_range.iomode == IOMODE_RW) {
+ if (max_pos < lseg->pls_end_pos)
+ max_pos = lseg->pls_end_pos;
+ if (test_and_clear_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags))
+ rv = lseg;
+ }
+ }
+ rv->pls_end_pos = max_pos;

- list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list)
- if (lseg->pls_range.iomode == IOMODE_RW)
- rv = lseg;
return rv;
}

@@ -1224,6 +1232,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
if (!test_and_set_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
/* references matched in nfs4_layoutcommit_release */
get_lseg(wdata->lseg);
+ set_bit(NFS_LSEG_LAYOUTCOMMIT, &wdata->lseg->pls_flags);
wdata->lseg->pls_lc_cred =
get_rpccred(wdata->args.context->state->owner->so_cred);
mark_as_dirty = true;
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index b8548d8..b3481e5 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -36,6 +36,7 @@
enum {
NFS_LSEG_VALID = 0, /* cleared when lseg is recalled/returned */
NFS_LSEG_ROC, /* roc bit received from server */
+ NFS_LSEG_LAYOUTCOMMIT, /* layoutcommit bit set for layoutcommit */
};

struct pnfs_layout_segment {
--
1.7.4.1


2011-06-09 06:08:29

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-08 03:15, Peng Tao wrote:
> On 6/8/11, Jim Rees <[email protected]> wrote:
>> Benny Halevy wrote:
>>
>> NAK.
>> This affects all layout types. In particular it is undesired
>> for write layouts that extend the file with the objects layout.
>> The server can extend the layout segments range
>> over what the client requested so why would the client
>> ask for artificially large layouts?
>>
>> This has actually been the subject of some debate over Thursday night
>> beers. The problem we're trying to solve is that the client is spending 98%
>> of its time in layoutget. This patch gives us something like a 10x
>> speedup. But many of us think it's not the right fix. I suggest we discuss
>> next week.
>>

Sure.

>> But note that this patch doesn't change anything unless you set the sysctl.
> there is a default value of 2M. maybe we can set it to page size by
> default so other layout are not affected and block layout can let
> users set it by hand if they care about performance. does this make
> sense?

If doing it at all why use a sysctl rather than a mount option?
Or maybe coding the logic for prefetching the layout iff sequential
access is detected is the right thing to do.

Benny

>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
>

2011-06-09 22:15:30

by Jim Rees

[permalink] [raw]
Subject: Re: [PATCH 00/88] pnfs block layout driver

Boaz Harrosh wrote:

Who is going to SQUASH all the SQUASHMEs and re think the all patch
separation again. To something that makes a more logical progression
and easier on the review. The way it is now I'm not able to review,
sorry, I got lost trying to understand which is which.

I'm open to suggestions and happy to do the work. I agree that 88 patches
is nearly indigestable. However I note that Benny seems to have pulled in
the entire set so I'm not sure how to proceed at this point. Also this code
was in Benny's 2.6.38 and only got dropped when the 3.0 merge came along, so
most of it's already been under review for a year or more.

2011-06-09 06:06:23

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-08 03:15, Peng Tao wrote:
> On 6/8/11, Jim Rees <[email protected]> wrote:
>> Benny Halevy wrote:
>>
>> NAK.
>> This affects all layout types. In particular it is undesired
>> for write layouts that extend the file with the objects layout.
>> The server can extend the layout segments range
>> over what the client requested so why would the client
>> ask for artificially large layouts?
>>
>> This has actually been the subject of some debate over Thursday night
>> beers. The problem we're trying to solve is that the client is spending 98%
>> of its time in layoutget. This patch gives us something like a 10x
>> speedup. But many of us think it's not the right fix. I suggest we discuss
>> next week.
>>

Sure.

>> But note that this patch doesn't change anything unless you set the sysctl.
> there is a default value of 2M. maybe we can set it to page size by
> default so other layout are not affected and block layout can let
> users set it by hand if they care about performance. does this make
> sense?

If doing it at all why use a sysctl rather than a mount option?
Or maybe coding the logic for prefetching the layout iff sequential
access is detected is the right thing to do.

Benny

>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
>

2011-06-09 11:49:31

by Jim Rees

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

Benny Halevy wrote:

>> But note that this patch doesn't change anything unless you set the sysctl.
> there is a default value of 2M. maybe we can set it to page size by
> default so other layout are not affected and block layout can let
> users set it by hand if they care about performance. does this make
> sense?

If doing it at all why use a sysctl rather than a mount option?
Or maybe coding the logic for prefetching the layout iff sequential
access is detected is the right thing to do.

I would rather see some automatic solution than to add either a sysctl or a
mount option. For now you can just drop that patch, as it's not needed for
basic pnfs block.

My understanding is that layoutget specifies a min and max, and the server
is returning the min. Trond and Fred believe this should be fixed on the
server. Here's the original report of the problem:

From: Bergwolf

>From the network trace for pnfs, we can see the root cause for slow performance
is too many small layoutget. In specific, client asks for a layout of only 4K
pagesize (and server returns 8K due to block size alignment) at each time.

The total IO time is 256/1.68 = 152 second.
There are 256*1024/8 = 32768 layoutget for the 256MB file.
On average, the time spent on each layoutget is 0.00456 second according to the
trace.
The total layoutget time is 32768* 0.00456 = 149 second, which takes up about
98% of total IO time.

So we should optimize layoutget's granularity to get better performance. For
instance, use a configurable prefetch size of 2MB or so.

2011-06-07 17:31:48

by Jim Rees

[permalink] [raw]
Subject: [PATCH 47/88] pnfsblock: use the session max response size for getdeviceinfo's maxcount

From: Mike Sager <[email protected]>

Per Trond, no need to try a small maxcount first. Base the maxcount
on the session's max response size.

Same change as for files. Untested for blocks though.

Signed-off-by: Mike Sager <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 62 ++++++++++++++++++++-----------------
1 files changed, 33 insertions(+), 29 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index cf306e9..65cf104 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -694,59 +694,63 @@ nfs4_blk_get_deviceinfo(struct super_block *sb, struct nfs_fh *fh,
{
struct pnfs_device *dev;
struct pnfs_block_dev *rv = NULL;
- int maxpages = NFS4_GETDEVINFO_MAXSIZE >> PAGE_SHIFT;
- struct page *pages[maxpages];
- int alloced_pages = 0, used_pages = 1;
- int j, rc;
+ u32 max_resp_sz;
+ int max_pages;
+ struct page **pages = NULL;
+ int i, rc;
+ struct nfs_server *server = NFS_SB(sb);
+
+ /*
+ * Use the session max response size as the basis for setting
+ * GETDEVICEINFO's maxcount
+ */
+ max_resp_sz = server->nfs_client->cl_session->fc_attrs.max_resp_sz;
+ max_pages = max_resp_sz >> PAGE_SHIFT;
+ dprintk("%s max_resp_sz %u max_pages %d\n",
+ __func__, max_resp_sz, max_pages);

- dprintk("%s enter\n", __func__);
dev = kmalloc(sizeof(*dev), GFP_KERNEL);
if (!dev) {
dprintk("%s kmalloc failed\n", __func__);
return NULL;
}
- retry_once:
- dprintk("%s trying used_pages %d\n", __func__, used_pages);
- for (; alloced_pages < used_pages; alloced_pages++) {
- pages[alloced_pages] = alloc_page(GFP_KERNEL);
- if (!pages[alloced_pages])
- goto out_free;
+
+ pages = kzalloc(max_pages * sizeof(struct page *), GFP_KERNEL);
+ if (pages == NULL) {
+ kfree(dev);
+ return NULL;
}
- /* set dev->area */
- if (used_pages == 1)
- dev->area = page_address(pages[0]);
- else {
- dev->area = vmap(pages, used_pages, VM_MAP, PAGE_KERNEL);
- if (!dev->area)
+ for (i = 0; i < max_pages; i++) {
+ pages[i] = alloc_page(GFP_KERNEL);
+ if (!pages[i])
goto out_free;
}

+ /* set dev->area */
+ dev->area = vmap(pages, max_pages, VM_MAP, PAGE_KERNEL);
+ if (!dev->area)
+ goto out_free;
+
memcpy(&dev->dev_id, d_id, sizeof(*d_id));
dev->layout_type = LAYOUT_BLOCK_VOLUME;
dev->dev_notify_types = 0;
dev->pages = pages;
dev->pgbase = 0;
- dev->pglen = PAGE_SIZE * used_pages;
+ dev->pglen = PAGE_SIZE * max_pages;
dev->mincount = 0;

rc = pnfs_callback_ops->nfs_getdeviceinfo(sb, dev);
- dprintk("%s getdevice info returns %d used_pages %d\n", __func__, rc,
- used_pages);
- if (rc == -ETOOSMALL && used_pages == 1) {
- dev->area = NULL;
- used_pages = (dev->mincount + PAGE_SIZE - 1) >> PAGE_SHIFT;
- if (used_pages > 1 && used_pages <= maxpages)
- goto retry_once;
- }
+ dprintk("%s getdevice info returns %d\n", __func__, rc);
if (rc)
goto out_free;

rv = nfs4_blk_decode_device(sb, dev, sdlist);
out_free:
- if (used_pages > 1 && dev->area != NULL)
+ if (dev->area != NULL)
vunmap(dev->area);
- for (j = 0; j < alloced_pages; j++)
- __free_page(pages[j]);
+ for (i = 0; i < max_pages; i++)
+ __free_page(pages[i]);
+ kfree(pages);
kfree(dev);
return rv;
}
--
1.7.4.1


2011-06-07 17:27:25

by Jim Rees

[permalink] [raw]
Subject: [PATCH 13/88] pnfsblock: scan scsi devices

From: Fred Isaman <[email protected]>

Scan scsi devices available to map against future GETDEVICEINFO output.

[pnfsblock: use class iteration api]
[pnfsblock: fix oops when using multiple devices]
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/Makefile | 2 +-
fs/nfs/blocklayout/blocklayout.c | 13 ++-
fs/nfs/blocklayout/blocklayout.h | 53 ++++++++
fs/nfs/blocklayout/blocklayoutdev.c | 231 +++++++++++++++++++++++++++++++++++
4 files changed, 295 insertions(+), 4 deletions(-)
create mode 100644 fs/nfs/blocklayout/blocklayout.h
create mode 100644 fs/nfs/blocklayout/blocklayoutdev.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index 6bf49cd..36d959f 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,4 @@
# Makefile for the pNFS block layout driver kernel module
#
obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o
+blocklayoutdriver-objs := blocklayout.o blocklayoutdev.o
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 1312849..9889f27 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -32,9 +32,7 @@
#include <linux/module.h>
#include <linux/init.h>

-#include <linux/nfs_fs.h>
-#include <linux/pnfs_xdr.h> /* Needed by nfs4_pnfs.h */
-#include <linux/nfs4_pnfs.h>
+#include "blocklayout.h"

#define NFSDBG_FACILITY NFSDBG_PNFS_LD

@@ -135,10 +133,19 @@ bl_cleanup_layoutcommit(struct pnfs_layout_type *lo,
dprintk("%s enter\n", __func__);
}

+/*
+ * This is just a STUB to check the scsi scanning code
+ */
static struct pnfs_mount_type *
bl_initialize_mountpoint(struct super_block *sb, struct nfs_fh *fh)
{
+ LIST_HEAD(scsi_disklist);
+
dprintk("%s enter\n", __func__);
+
+ nfs4_blk_create_scsi_disk_list(&scsi_disklist);
+ nfs4_blk_destroy_disk_list(&scsi_disklist);
+
return NULL;
}

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
new file mode 100644
index 0000000..5dbb8f2
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -0,0 +1,53 @@
+/*
+ * linux/fs/nfs/blocklayout/blocklayout.h
+ *
+ * Module for the NFSv4.1 pNFS block layout driver.
+ *
+ * Copyright (c) 2006 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * Andy Adamson <[email protected]>
+ * Fred Isaman <[email protected]>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization. if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose. the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+#ifndef FS_NFS_NFS4BLOCKLAYOUT_H
+#define FS_NFS_NFS4BLOCKLAYOUT_H
+
+#include <linux/nfs_fs.h>
+#include <linux/pnfs_xdr.h> /* Needed by nfs4_pnfs.h */
+#include <linux/nfs4_pnfs.h>
+
+extern struct class shost_class; /* exported from drivers/scsi/hosts.c */
+
+/* holds visible disks that can be matched against VOLUME_SIMPLE signatures */
+struct visible_block_device {
+ struct list_head vi_node;
+ struct block_device *vi_bdev;
+ int vi_mapped;
+ int vi_put_done;
+};
+
+/* blocklayoutdev.c */
+int nfs4_blk_create_scsi_disk_list(struct list_head *);
+void nfs4_blk_destroy_disk_list(struct list_head *);
+
+#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
new file mode 100644
index 0000000..b4f52fb
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -0,0 +1,231 @@
+/*
+ * linux/fs/nfs/blocklayout/blocklayoutdev.c
+ *
+ * Device operations for the pnfs nfs4 file layout driver.
+ *
+ * Copyright (c) 2006 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * Andy Adamson <[email protected]>
+ * Fred Isaman <[email protected]>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization. if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose. the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+#include <linux/module.h>
+#include <linux/buffer_head.h> /* __bread */
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_host.h>
+
+#include "blocklayout.h"
+
+#define NFSDBG_FACILITY NFSDBG_PNFS_LD
+
+#define MAX_VOLS 256 /* Maximum number of SCSI disks. Totally arbitrary */
+
+uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes)
+{
+ uint32_t *q = p + XDR_QUADLEN(nbytes);
+ if (unlikely(q > end || q < p))
+ return NULL;
+ return p;
+}
+EXPORT_SYMBOL(blk_overflow);
+
+/* Open a block_device by device number. */
+static struct block_device *nfs4_blkdev_get(dev_t dev)
+{
+ struct block_device *bd;
+
+ dprintk("%s enter\n", __func__);
+ bd = open_by_devnum(dev, FMODE_READ);
+ if (IS_ERR(bd))
+ goto fail;
+ return bd;
+fail:
+ dprintk("%s failed to open device : %ld\n",
+ __func__, PTR_ERR(bd));
+ return NULL;
+}
+
+/*
+ * Release the block device
+ */
+static int nfs4_blkdev_put(struct block_device *bdev)
+{
+ dprintk("%s for device %d:%d\n", __func__, MAJOR(bdev->bd_dev),
+ MINOR(bdev->bd_dev));
+ bd_release(bdev);
+ return blkdev_put(bdev, FMODE_READ);
+}
+
+/* Add a visible, claimed (by us!) scsi disk to the device list */
+static int alloc_add_disk(struct block_device *blk_dev, struct list_head *dlist)
+{
+ struct visible_block_device *vis_dev;
+
+ dprintk("%s enter\n", __func__);
+ vis_dev = kmalloc(sizeof(struct visible_block_device), GFP_KERNEL);
+ if (!vis_dev) {
+ dprintk("%s nfs4_get_sig failed\n", __func__);
+ return -ENOMEM;
+ }
+ vis_dev->vi_bdev = blk_dev;
+ vis_dev->vi_mapped = 0;
+ vis_dev->vi_put_done = 0;
+ list_add(&vis_dev->vi_node, dlist);
+ return 0;
+}
+
+/* Walk the list of scsi_devices. Add disks that can be opened and claimed
+ * to the device list
+ */
+static int
+nfs4_blk_add_scsi_disk(struct Scsi_Host *shost,
+ int index, struct list_head *dlist)
+{
+ static char *claim_ptr = "I belong to pnfs block driver";
+ struct block_device *bdev;
+ struct gendisk *gd;
+ struct scsi_device *sdev;
+ unsigned int major, minor, ret = 0;
+ dev_t dev;
+
+ dprintk("%s enter \n", __func__);
+ if (index >= MAX_VOLS) {
+ dprintk("%s MAX_VOLS hit\n", __func__);
+ return -ENOSPC;
+ }
+ dprintk("%s 1 \n", __func__);
+ index--;
+ shost_for_each_device(sdev, shost) {
+ dprintk("%s 2\n", __func__);
+ /* Need to do this check before bumping index */
+ if (sdev->type != TYPE_DISK)
+ continue;
+ dprintk("%s 3 index %d \n", __func__, index);
+ if (++index >= MAX_VOLS) {
+ scsi_device_put(sdev);
+ break;
+ }
+ major = (!(index >> 4) ? SCSI_DISK0_MAJOR :
+ SCSI_DISK1_MAJOR-1 + (index >> 4));
+ minor = ((index << 4) & 255);
+
+ dprintk("%s SCSI device %d:%d \n", __func__, major, minor);
+
+ dev = MKDEV(major, minor);
+ bdev = nfs4_blkdev_get(dev);
+ if (!bdev) {
+ dprintk("%s: failed to open device %d:%d\n",
+ __func__, major, minor);
+ continue;
+ }
+ gd = bdev->bd_disk;
+
+ dprintk("%s 4\n", __func__);
+
+ if (bd_claim(bdev, claim_ptr)) {
+ dprintk("%s: failed to claim device %d:%d\n",
+ __func__, gd->major, gd->first_minor);
+ blkdev_put(bdev, FMODE_READ);
+ continue;
+ }
+
+ ret = alloc_add_disk(bdev, dlist);
+ if (ret < 0)
+ goto out_err;
+ dprintk("%s ADDED DEVICE capacity %ld, bd_block_size %d\n",
+ __func__,
+ (unsigned long)get_capacity(gd),
+ bdev->bd_block_size);
+
+ }
+ index++;
+ dprintk("%s returns index %d \n", __func__, index);
+ return index;
+
+out_err:
+ dprintk("%s Can't add disk to list. ERROR: %d\n", __func__, ret);
+ nfs4_blkdev_put(bdev);
+ return ret;
+}
+
+/* Destroy the temporary scsi disk list */
+void nfs4_blk_destroy_disk_list(struct list_head *dlist)
+{
+ struct visible_block_device *vis_dev;
+
+ dprintk("%s enter\n", __func__);
+ while (!list_empty(dlist)) {
+ vis_dev = list_first_entry(dlist, struct visible_block_device,
+ vi_node);
+ dprintk("%s removing device %d:%d\n", __func__,
+ MAJOR(vis_dev->vi_bdev->bd_dev),
+ MINOR(vis_dev->vi_bdev->bd_dev));
+ list_del(&vis_dev->vi_node);
+ if (!vis_dev->vi_put_done)
+ nfs4_blkdev_put(vis_dev->vi_bdev);
+ kfree(vis_dev);
+ }
+}
+
+struct nfs4_blk_scsi_disk_list_ctl {
+ struct list_head *dlist;
+ int index;
+};
+
+static int nfs4_blk_iter_scsi_disk_list(struct device *cdev, void *data)
+{
+ struct Scsi_Host *shost;
+ struct nfs4_blk_scsi_disk_list_ctl *lc = data;
+ int ret;
+
+ dprintk("%s enter\n", __func__);
+ shost = class_to_shost(cdev);
+ ret = nfs4_blk_add_scsi_disk(shost, lc->index, lc->dlist);
+ dprintk("%s 1 ret %d\n", __func__, ret);
+ if (ret >= 0) {
+ lc->index = ret;
+ ret = 0;
+ }
+ return ret;
+}
+
+/*
+ * Create a temporary list of all SCSI disks host can see, and that have not
+ * yet been claimed.
+ * shost_class: list of all registered scsi_hosts
+ * returns -errno on error, and #of devices found on success.
+ * XXX Loosely emulate scsi_host_lookup from scsi/host.c
+*/
+int nfs4_blk_create_scsi_disk_list(struct list_head *dlist)
+{
+ struct nfs4_blk_scsi_disk_list_ctl lc = {
+ .dlist = dlist,
+ .index = 0,
+ };
+
+ dprintk("%s enter\n", __func__);
+ return class_for_each_device(&shost_class, NULL,
+ &lc, nfs4_blk_iter_scsi_disk_list);
+}
--
1.7.4.1


2011-06-07 17:35:55

by Jim Rees

[permalink] [raw]
Subject: [PATCH 85/88] SQUASHME: pnfs-block: use pnfs_layout_hdr field prefix

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 2 +-
fs/nfs/blocklayout/blocklayout.h | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 161c113..2583b87 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -619,7 +619,7 @@ static int
bl_setup_layoutcommit(struct pnfs_layout_hdr *lo,
struct nfs4_layoutcommit_args *arg)
{
- struct nfs_server *nfss = NFS_SERVER(lo->inode);
+ struct nfs_server *nfss = NFS_SERVER(lo->plh_inode);
struct bl_layoutupdate_data *layoutupdate_data;

dprintk("%s enter\n", __func__);
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 9e7bd62..a8198ae 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -191,7 +191,7 @@ struct bl_layoutupdate_data {
struct list_head ranges;
};

-#define BLK_ID(lo) ((struct block_mount_id *)(NFS_SERVER(lo->inode)->pnfs_ld_data))
+#define BLK_ID(lo) ((struct block_mount_id *)(NFS_SERVER(lo->plh_inode)->pnfs_ld_data))

static inline struct pnfs_block_layout *
BLK_LO2EXT(struct pnfs_layout_hdr *lo)
@@ -202,7 +202,7 @@ BLK_LO2EXT(struct pnfs_layout_hdr *lo)
static inline struct pnfs_block_layout *
BLK_LSEG2EXT(struct pnfs_layout_segment *lseg)
{
- return BLK_LO2EXT(lseg->layout);
+ return BLK_LO2EXT(lseg->pls_layout);
}

uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes);
--
1.7.4.1


2011-06-10 19:07:49

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-10 10:30, [email protected] wrote:
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
> Sent: Friday, June 10, 2011 8:48 PM
> To: Peng, Tao
> Cc: [email protected]; [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>
> On 2011-06-10 02:02, [email protected] wrote:
>> Hi, Benny,
>>
>> Cheers,
>> -Bergwolf
>>
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
>> Sent: Friday, June 10, 2011 5:30 AM
>> To: Peng Tao
>> Cc: Jim Rees; [email protected]; peter honeyman
>> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>>
>> On 2011-06-09 07:54, Peng Tao wrote:
>>> On Thu, Jun 9, 2011 at 2:06 PM, Benny Halevy <[email protected]> wrote:
>>>> On 2011-06-08 03:15, Peng Tao wrote:
>>>>> On 6/8/11, Jim Rees <[email protected]> wrote:
>>>>>> Benny Halevy wrote:
>>>>>>
>>>>>> NAK.
>>>>>> This affects all layout types. In particular it is undesired
>>>>>> for write layouts that extend the file with the objects layout.
>>>>>> The server can extend the layout segments range
>>>>>> over what the client requested so why would the client
>>>>>> ask for artificially large layouts?
>>>>>>
>>>>>> This has actually been the subject of some debate over Thursday night
>>>>>> beers. The problem we're trying to solve is that the client is spending 98%
>>>>>> of its time in layoutget. This patch gives us something like a 10x
>>>>>> speedup. But many of us think it's not the right fix. I suggest we discuss
>>>>>> next week.
>>>>>>
>>>>
>>>> Sure.
>>>>
>>>>>> But note that this patch doesn't change anything unless you set the sysctl.
>>>>> there is a default value of 2M. maybe we can set it to page size by
>>>>> default so other layout are not affected and block layout can let
>>>>> users set it by hand if they care about performance. does this make
>>>>> sense?
>>>>
>>>> If doing it at all why use a sysctl rather than a mount option?
>>> The purpose of using a sysctl is to give client the ability to change
>>> it on the fly. In theory, layout prefetching can benefit all layout
>>> types. So the patch tries to solve it in the pnfs generic layer.
>>>
>>
>> But the need for this varies per-server and many times per application.
>> Think sequential vs. random I/O. Therefore a mount option would help
>> tuning the behavior on a per-use basis. Global behavior must be implemented
>> using a dynamic algorithm that would take both the workload and the server
>> observed behavior into account.
>> [PT] Indeed. Dynamic algorithm is supposed to be able to solve all this. And it often takes longer to be designed/accepted. It has to prove to be better in most scenarios and does not hurt the left.
>
> We need to find an acceptable solution to push this driver upstream.
> I understand that developing a dynamic algorithm in the given time frame is
> too big of a challenge, but hacking yet another client tunable is out of the
> question either. For testing in the Bakeathon I'd consider taking a DEVONLY version
> of this patch that is enabled using a config option and defaults to zero to have no effect
> in run-time until the sysctl is sets it differently.
> But keep in mind this is not suitable for pushing upstream.
> [PT] Thanks for your understanding. We truly want to solve the performance problem and are open to suggestions. And this will be a feature that benefits all layout types, am I right?

Adaptive layout prefetching at the client side is a nice workaround for
naive server implementations but you can't get around implementing
a more sophisticated algorithm on the server side. As for the benefit,
if yo implement such an algorithm, on the client side I would like it
to be implemented generically so that all layout types could benefit
from it.

Benny

>
> -Tao

2011-06-07 17:27:33

by Jim Rees

[permalink] [raw]
Subject: [PATCH 14/88] pnfsblock: call and parse getdevicelist

From: Fred Isaman <[email protected]>

Call GETDEVICELIST during mount, then call and parse GETDEVICEINFO
for each device returned.

[pnfsblock: fix pnfs_deviceid references]
Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: fix print format warnings for sector_t and size_t]
[pnfs-block: #include <linux/vmalloc.h>]
Signed-off-by: Benny Halevy <[email protected]>
[pnfsblock: fix bug determining size of striped volume]
[pnfsblock: fix oops when using multiple devices]
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/Makefile | 2 +-
fs/nfs/blocklayout/blocklayout.c | 163 +++++++++++++++++-
fs/nfs/blocklayout/blocklayout.h | 89 ++++++++++
fs/nfs/blocklayout/blocklayoutdev.c | 324 +++++++++++++++++++++++++++++++++++
fs/nfs/blocklayout/blocklayoutdm.c | 72 ++++++++
5 files changed, 646 insertions(+), 4 deletions(-)
create mode 100644 fs/nfs/blocklayout/blocklayoutdm.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index 36d959f..2c4c062 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,4 @@
# Makefile for the pNFS block layout driver kernel module
#
obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o blocklayoutdev.o
+blocklayoutdriver-objs := blocklayout.o blocklayoutdev.o blocklayoutdm.o
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 9889f27..ebaa48a 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -32,6 +32,7 @@
#include <linux/module.h>
#include <linux/init.h>

+#include <linux/vmalloc.h>
#include "blocklayout.h"

#define NFSDBG_FACILITY NFSDBG_PNFS_LD
@@ -133,26 +134,182 @@ bl_cleanup_layoutcommit(struct pnfs_layout_type *lo,
dprintk("%s enter\n", __func__);
}

+static void free_blk_mountid(struct block_mount_id *mid)
+{
+ if (mid) {
+ struct pnfs_block_dev *dev;
+ spin_lock(&mid->bm_lock);
+ while (!list_empty(&mid->bm_devlist)) {
+ dev = list_first_entry(&mid->bm_devlist,
+ struct pnfs_block_dev,
+ bm_node);
+ list_del(&dev->bm_node);
+ free_block_dev(dev);
+ }
+ spin_unlock(&mid->bm_lock);
+ kfree(mid);
+ }
+}
+
+/* This is mostly copied form the filelayout's get_device_info function.
+ * It seems much of this should be at the generic pnfs level.
+ */
+static struct pnfs_block_dev *
+nfs4_blk_get_deviceinfo(struct super_block *sb, struct nfs_fh *fh,
+ struct pnfs_deviceid *d_id,
+ struct list_head *sdlist)
+{
+ struct pnfs_device *dev;
+ struct pnfs_block_dev *rv = NULL;
+ int maxpages = NFS4_GETDEVINFO_MAXSIZE >> PAGE_SHIFT;
+ struct page *pages[maxpages];
+ int alloced_pages = 0, used_pages = 1;
+ int j, rc;
+
+ dprintk("%s enter\n", __func__);
+ dev = kmalloc(sizeof(*dev), GFP_KERNEL);
+ if (!dev) {
+ dprintk("%s kmalloc failed\n", __func__);
+ return NULL;
+ }
+ retry_once:
+ dprintk("%s trying used_pages %d\n", __func__, used_pages);
+ for (; alloced_pages < used_pages; alloced_pages++) {
+ pages[alloced_pages] = alloc_page(GFP_KERNEL);
+ if (!pages[alloced_pages])
+ goto out_free;
+ }
+ /* set dev->area */
+ if (used_pages == 1)
+ dev->area = page_address(pages[0]);
+ else {
+ dev->area = vmap(pages, used_pages, VM_MAP, PAGE_KERNEL);
+ if (!dev->area)
+ goto out_free;
+ }
+
+ memcpy(&dev->dev_id, d_id, sizeof(*d_id));
+ dev->layout_type = LAYOUT_BLOCK_VOLUME;
+ dev->dev_notify_types = 0;
+ dev->pages = pages;
+ dev->pgbase = 0;
+ dev->pglen = PAGE_SIZE * used_pages;
+ dev->mincount = 0;
+
+ rc = pnfs_callback_ops->nfs_getdeviceinfo(sb, dev);
+ dprintk("%s getdevice info returns %d used_pages %d\n", __func__, rc,
+ used_pages);
+ if (rc == -ETOOSMALL && used_pages == 1) {
+ dev->area = NULL;
+ used_pages = (dev->mincount + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (used_pages > 1 && used_pages <= maxpages)
+ goto retry_once;
+ }
+ if (rc)
+ goto out_free;
+
+ rv = nfs4_blk_decode_device(sb, dev, sdlist);
+ out_free:
+ if (used_pages > 1 && dev->area != NULL)
+ vunmap(dev->area);
+ for (j = 0; j < alloced_pages; j++)
+ __free_page(pages[j]);
+ kfree(dev);
+ return rv;
+}
+
+
/*
- * This is just a STUB to check the scsi scanning code
+ * Retrieve the list of available devices for the mountpoint.
*/
static struct pnfs_mount_type *
bl_initialize_mountpoint(struct super_block *sb, struct nfs_fh *fh)
{
+ struct block_mount_id *b_mt_id = NULL;
+ struct pnfs_mount_type *mtype = NULL;
+ struct pnfs_devicelist *dlist = NULL;
+ struct pnfs_block_dev *bdev;
LIST_HEAD(scsi_disklist);
+ int status, i;

dprintk("%s enter\n", __func__);

- nfs4_blk_create_scsi_disk_list(&scsi_disklist);
+ if (NFS_SB(sb)->pnfs_blksize == 0) {
+ dprintk("%s Server did not return blksize\n", __func__);
+ return NULL;
+ }
+ b_mt_id = kzalloc(sizeof(struct block_mount_id), GFP_KERNEL);
+ if (!b_mt_id)
+ goto out_error;
+ /* Initialize nfs4 block layout mount id */
+ b_mt_id->bm_sb = sb; /* back pointer to retrieve nfs_server struct */
+ spin_lock_init(&b_mt_id->bm_lock);
+ INIT_LIST_HEAD(&b_mt_id->bm_devlist);
+ mtype = kzalloc(sizeof(struct pnfs_mount_type), GFP_KERNEL);
+ if (!mtype)
+ goto out_error;
+ mtype->mountid = (void *)b_mt_id;
+
+ /* Construct a list of all visible scsi disks that have not been
+ * claimed.
+ */
+ status = nfs4_blk_create_scsi_disk_list(&scsi_disklist);
+ if (status < 0)
+ goto out_error;
+
+ dlist = kmalloc(sizeof(struct pnfs_devicelist), GFP_KERNEL);
+ if (!dlist)
+ goto out_error;
+ dlist->eof = 0;
+ while (!dlist->eof) {
+ status = pnfs_callback_ops->nfs_getdevicelist(sb, fh, dlist);
+ if (status)
+ goto out_error;
+ dprintk("%s GETDEVICELIST numdevs=%i, eof=%i\n",
+ __func__, dlist->num_devs, dlist->eof);
+ /* For each device returned in dlist, call GETDEVICEINFO, and
+ * decode the opaque topology encoding to create a flat
+ * volume topology, matching VOLUME_SIMPLE disk signatures
+ * to disks in the visible scsi disk list.
+ * Construct an LVM meta device from the flat volume topology.
+ */
+ for (i = 0; i < dlist->num_devs; i++) {
+ bdev = nfs4_blk_get_deviceinfo(sb, fh,
+ &dlist->dev_id[i],
+ &scsi_disklist);
+ if (!bdev)
+ goto out_error;
+ spin_lock(&b_mt_id->bm_lock);
+ list_add(&bdev->bm_node, &b_mt_id->bm_devlist);
+ spin_unlock(&b_mt_id->bm_lock);
+ }
+ }
+ dprintk("%s SUCCESS\n", __func__);
+
+ out_return:
+ kfree(dlist);
nfs4_blk_destroy_disk_list(&scsi_disklist);
+ return mtype;

- return NULL;
+ out_error:
+ free_blk_mountid(b_mt_id);
+ kfree(mtype);
+ mtype = NULL;
+ goto out_return;
}

static int
bl_uninitialize_mountpoint(struct pnfs_mount_type *mtype)
{
+ struct block_mount_id *b_mt_id = NULL;
+
dprintk("%s enter\n", __func__);
+ if (!mtype)
+ return 0;
+ b_mt_id = (struct block_mount_id *)mtype->mountid;
+ free_blk_mountid(b_mt_id);
+ kfree(mtype);
+ dprintk("%s RETURNS\n", __func__);
return 0;
}

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 5dbb8f2..4af6685 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -38,6 +38,19 @@

extern struct class shost_class; /* exported from drivers/scsi/hosts.c */

+struct block_mount_id {
+ struct super_block *bm_sb; /* back pointer */
+ spinlock_t bm_lock; /* protects list */
+ struct list_head bm_devlist; /* holds pnfs_block_dev */
+};
+
+struct pnfs_block_dev {
+ struct list_head bm_node;
+ char *bm_mdevname; /* meta device name */
+ struct pnfs_deviceid bm_mdevid; /* associated devid */
+ struct block_device *bm_mdev; /* meta device itself */
+};
+
/* holds visible disks that can be matched against VOLUME_SIMPLE signatures */
struct visible_block_device {
struct list_head vi_node;
@@ -46,8 +59,84 @@ struct visible_block_device {
int vi_put_done;
};

+enum blk_vol_type {
+ PNFS_BLOCK_VOLUME_SIMPLE = 0, /* maps to a single LU */
+ PNFS_BLOCK_VOLUME_SLICE = 1, /* slice of another volume */
+ PNFS_BLOCK_VOLUME_CONCAT = 2, /* concatenation of multiple volumes */
+ PNFS_BLOCK_VOLUME_STRIPE = 3 /* striped across multiple volumes */
+};
+
+/* All disk offset/lengths are stored in 512-byte sectors */
+struct pnfs_blk_volume {
+ uint32_t bv_type;
+ sector_t bv_size;
+ struct pnfs_blk_volume **bv_vols;
+ int bv_vol_n;
+ union {
+ dev_t bv_dev;
+ sector_t bv_stripe_unit;
+ sector_t bv_offset;
+ };
+};
+
+/* Since components need not be aligned, cannot use sector_t */
+struct pnfs_blk_sig_comp {
+ int64_t bs_offset; /* In bytes */
+ uint32_t bs_length; /* In bytes */
+ char *bs_string;
+};
+
+/* Maximum number of signatures components in a simple volume */
+# define PNFS_BLOCK_MAX_SIG_COMP 16
+
+struct pnfs_blk_sig {
+ int si_num_comps;
+ struct pnfs_blk_sig_comp si_comps[PNFS_BLOCK_MAX_SIG_COMP];
+};
+
+uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes);
+
+#define BLK_READBUF(p, e, nbytes) do { \
+ p = blk_overflow(p, e, nbytes); \
+ if (!p) { \
+ printk(KERN_WARNING \
+ "%s: reply buffer overflowed in line %d.\n", \
+ __func__, __LINE__); \
+ goto out_err; \
+ } \
+} while (0)
+
+#define READ32(x) (x) = ntohl(*p++)
+#define READ64(x) do { \
+ (x) = (uint64_t)ntohl(*p++) << 32; \
+ (x) |= ntohl(*p++); \
+} while (0)
+#define COPYMEM(x, nbytes) do { \
+ memcpy((x), p, nbytes); \
+ p += XDR_QUADLEN(nbytes); \
+} while (0)
+#define READ_DEVID(x) COPYMEM((x)->data, NFS4_PNFS_DEVICEID4_SIZE)
+#define READ_SECTOR(x) do { \
+ READ64(tmp); \
+ if (tmp & 0x1ff) { \
+ printk(KERN_WARNING \
+ "%s Value not 512-byte aligned at line %d\n", \
+ __func__, __LINE__); \
+ goto out_err; \
+ } \
+ (x) = tmp >> 9; \
+} while (0)
+
/* blocklayoutdev.c */
+struct pnfs_block_dev *nfs4_blk_decode_device(struct super_block *sb,
+ struct pnfs_device *dev,
+ struct list_head *sdlist);
int nfs4_blk_create_scsi_disk_list(struct list_head *);
void nfs4_blk_destroy_disk_list(struct list_head *);
+/* blocklayoutdm.c */
+struct pnfs_block_dev *nfs4_blk_init_metadev(struct super_block *sb,
+ struct pnfs_device *dev);
+int nfs4_blk_flatten(struct pnfs_blk_volume *, int, struct pnfs_block_dev *);
+void free_block_dev(struct pnfs_block_dev *bdev);

#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index b4f52fb..f1689b9 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -229,3 +229,327 @@ int nfs4_blk_create_scsi_disk_list(struct list_head *dlist)
return class_for_each_device(&shost_class, NULL,
&lc, nfs4_blk_iter_scsi_disk_list);
}
+/* We are given an array of XDR encoded array indices, each of which should
+ * refer to a previously decoded device. Translate into a list of pointers
+ * to the appropriate pnfs_blk_volume's.
+ */
+static int set_vol_array(uint32_t **pp, uint32_t *end,
+ struct pnfs_blk_volume *vols, int working)
+{
+ int i, index;
+ uint32_t *p = *pp;
+ struct pnfs_blk_volume **array = vols[working].bv_vols;
+ for (i = 0; i < vols[working].bv_vol_n; i++) {
+ BLK_READBUF(p, end, 4);
+ READ32(index);
+ if ((index < 0) || (index >= working)) {
+ dprintk("%s Index %i out of expected range\n",
+ __func__, index);
+ goto out_err;
+ }
+ array[i] = &vols[index];
+ }
+ *pp = p;
+ return 0;
+ out_err:
+ return -EIO;
+}
+
+static uint64_t sum_subvolume_sizes(struct pnfs_blk_volume *vol)
+{
+ int i;
+ uint64_t sum = 0;
+ for (i = 0; i < vol->bv_vol_n; i++)
+ sum += vol->bv_vols[i]->bv_size;
+ return sum;
+}
+
+static int decode_blk_signature(uint32_t **pp, uint32_t *end,
+ struct pnfs_blk_sig *sig)
+{
+ int i, tmp;
+ uint32_t *p = *pp;
+
+ BLK_READBUF(p, end, 4);
+ READ32(sig->si_num_comps);
+ if (sig->si_num_comps == 0) {
+ dprintk("%s 0 components in sig\n", __func__);
+ goto out_err;
+ }
+ if (sig->si_num_comps >= PNFS_BLOCK_MAX_SIG_COMP) {
+ dprintk("number of sig comps %i >= PNFS_BLOCK_MAX_SIG_COMP\n",
+ sig->si_num_comps);
+ goto out_err;
+ }
+ for (i = 0; i < sig->si_num_comps; i++) {
+ BLK_READBUF(p, end, 12);
+ READ64(sig->si_comps[i].bs_offset);
+ READ32(tmp);
+ sig->si_comps[i].bs_length = tmp;
+ BLK_READBUF(p, end, tmp);
+ /* Note we rely here on fact that sig is used immediately
+ * for mapping, then thrown away.
+ */
+ sig->si_comps[i].bs_string = (char *)p;
+ p += XDR_QUADLEN(tmp);
+ }
+ *pp = p;
+ return 0;
+ out_err:
+ return -EIO;
+}
+
+/* Translate a signature component into a block and offset. */
+static void get_sector(struct block_device *bdev,
+ struct pnfs_blk_sig_comp *comp,
+ sector_t *block,
+ uint32_t *offset_in_block)
+{
+ int64_t use_offset = comp->bs_offset;
+ unsigned int blkshift = blksize_bits(block_size(bdev));
+
+ dprintk("%s enter\n", __func__);
+ if (use_offset < 0)
+ use_offset += (get_capacity(bdev->bd_disk) << 9);
+ *block = use_offset >> blkshift;
+ *offset_in_block = use_offset - (*block << blkshift);
+
+ dprintk("%s block %llu offset_in_block %u\n",
+ __func__, (u64)*block, *offset_in_block);
+ return;
+}
+
+/*
+ * All signatures in sig must be found on bdev for verification.
+ * Returns True if sig matches, False otherwise.
+ *
+ * STUB - signature crossing a block boundary will cause problems.
+ */
+static int verify_sig(struct block_device *bdev, struct pnfs_blk_sig *sig)
+{
+ sector_t block = 0;
+ struct pnfs_blk_sig_comp *comp;
+ struct buffer_head *bh = NULL;
+ uint32_t offset_in_block = 0;
+ char *ptr;
+ int i;
+
+ dprintk("%s enter. bd_disk->capacity %ld, bd_block_size %d\n",
+ __func__, (unsigned long)get_capacity(bdev->bd_disk),
+ bdev->bd_block_size);
+ for (i = 0; i < sig->si_num_comps; i++) {
+ comp = &sig->si_comps[i];
+ dprintk("%s comp->bs_offset %lld, length=%d\n", __func__,
+ comp->bs_offset, comp->bs_length);
+ get_sector(bdev, comp, &block, &offset_in_block);
+ bh = __bread(bdev, block, bdev->bd_block_size);
+ if (!bh)
+ goto out_err;
+ ptr = (char *)bh->b_data + offset_in_block;
+ if (memcmp(ptr, comp->bs_string, comp->bs_length))
+ goto out_err;
+ brelse(bh);
+ }
+ dprintk("%s Complete Match Found\n", __func__);
+ return 1;
+
+out_err:
+ brelse(bh);
+ dprintk("%s No Match\n", __func__);
+ return 0;
+}
+
+/*
+ * map_sig_to_device()
+ * Given a signature, walk the list of visible scsi disks searching for
+ * a match. Returns True if mapping was done, False otherwise.
+ *
+ * While we're at it, fill in the vol->bv_size.
+ */
+/* XXX FRED - use normal 0=success status */
+static int map_sig_to_device(struct pnfs_blk_sig *sig,
+ struct pnfs_blk_volume *vol,
+ struct list_head *sdlist)
+{
+ int mapped = 0;
+ struct visible_block_device *vis_dev;
+
+ list_for_each_entry(vis_dev, sdlist, vi_node) {
+ if (vis_dev->vi_mapped)
+ continue;
+ mapped = verify_sig(vis_dev->vi_bdev, sig);
+ if (mapped) {
+ vol->bv_dev = vis_dev->vi_bdev->bd_dev;
+ vol->bv_size = get_capacity(vis_dev->vi_bdev->bd_disk);
+ vis_dev->vi_mapped = 1;
+ /* XXX FRED check this */
+ /* We no longer need to scan this device, and
+ * we need to "put" it before creating metadevice.
+ */
+ if (!vis_dev->vi_put_done) {
+ vis_dev->vi_put_done = 1;
+ nfs4_blkdev_put(vis_dev->vi_bdev);
+ }
+ break;
+ }
+ }
+ return mapped;
+}
+
+/* XDR decodes pnfs_block_volume4 structure */
+static int decode_blk_volume(uint32_t **pp, uint32_t *end,
+ struct pnfs_blk_volume *vols, int i,
+ struct list_head *sdlist, int *array_cnt)
+{
+ int status = 0;
+ struct pnfs_blk_sig sig;
+ uint32_t *p = *pp;
+ uint64_t tmp; /* Used by READ_SECTOR */
+ struct pnfs_blk_volume *vol = &vols[i];
+ int j;
+ u64 tmp_size;
+
+ BLK_READBUF(p, end, 4);
+ READ32(vol->bv_type);
+ dprintk("%s vol->bv_type = %i\n", __func__, vol->bv_type);
+ switch (vol->bv_type) {
+ case PNFS_BLOCK_VOLUME_SIMPLE:
+ *array_cnt = 0;
+ status = decode_blk_signature(&p, end, &sig);
+ if (status)
+ return status;
+ status = map_sig_to_device(&sig, vol, sdlist);
+ if (!status) {
+ dprintk("Could not find disk for device\n");
+ return -EIO;
+ }
+ status = 0;
+ dprintk("%s Set Simple vol to dev %d:%d, size %llu\n",
+ __func__,
+ MAJOR(vol->bv_dev),
+ MINOR(vol->bv_dev),
+ (u64)vol->bv_size);
+ break;
+ case PNFS_BLOCK_VOLUME_SLICE:
+ BLK_READBUF(p, end, 16);
+ READ_SECTOR(vol->bv_offset);
+ READ_SECTOR(vol->bv_size);
+ *array_cnt = vol->bv_vol_n = 1;
+ status = set_vol_array(&p, end, vols, i);
+ break;
+ case PNFS_BLOCK_VOLUME_STRIPE:
+ BLK_READBUF(p, end, 8);
+ READ_SECTOR(vol->bv_stripe_unit);
+ BLK_READBUF(p, end, 4);
+ READ32(vol->bv_vol_n);
+ if (!vol->bv_vol_n)
+ return -EIO;
+ *array_cnt = vol->bv_vol_n;
+ status = set_vol_array(&p, end, vols, i);
+ if (status)
+ return status;
+ /* Ensure all subvolumes are the same size */
+ for (j = 1; j < vol->bv_vol_n; j++) {
+ if (vol->bv_vols[j]->bv_size !=
+ vol->bv_vols[0]->bv_size) {
+ dprintk("%s varying subvol size\n", __func__);
+ return -EIO;
+ }
+ }
+ /* Make sure total size only includes addressable areas */
+ tmp_size = vol->bv_vols[0]->bv_size;
+ do_div(tmp_size, (u32)vol->bv_stripe_unit);
+ vol->bv_size = vol->bv_vol_n * tmp_size * vol->bv_stripe_unit;
+ dprintk("%s Set Stripe vol to size %llu\n",
+ __func__, (u64)vol->bv_size);
+ break;
+ case PNFS_BLOCK_VOLUME_CONCAT:
+ BLK_READBUF(p, end, 4);
+ READ32(vol->bv_vol_n);
+ if (!vol->bv_vol_n)
+ return -EIO;
+ *array_cnt = vol->bv_vol_n;
+ status = set_vol_array(&p, end, vols, i);
+ if (status)
+ return status;
+ vol->bv_size = sum_subvolume_sizes(vol);
+ dprintk("%s Set Concat vol to size %llu\n",
+ __func__, (u64)vol->bv_size);
+ break;
+ default:
+ dprintk("Unknown volume type %i\n", vol->bv_type);
+ out_err:
+ return -EIO;
+ }
+ *pp = p;
+ return status;
+}
+
+/* Decodes pnfs_block_deviceaddr4 (draft-8) which is XDR encoded
+ * in dev->dev_addr_buf.
+ */
+struct pnfs_block_dev *
+nfs4_blk_decode_device(struct super_block *sb,
+ struct pnfs_device *dev,
+ struct list_head *sdlist)
+{
+ int num_vols, i, status, count;
+ struct pnfs_blk_volume *vols, **arrays, **arrays_ptr;
+ uint32_t *p = dev->area;
+ uint32_t *end = (uint32_t *) ((char *) p + dev->mincount);
+ struct pnfs_block_dev *rv = NULL;
+ struct visible_block_device *vis_dev;
+
+ dprintk("%s enter\n", __func__);
+
+ READ32(num_vols);
+ dprintk("%s num_vols = %i\n", __func__, num_vols);
+
+ vols = kmalloc(sizeof(struct pnfs_blk_volume) * num_vols, GFP_KERNEL);
+ if (!vols)
+ return NULL;
+ /* Each volume in vols array needs its own array. Save time by
+ * allocating them all in one large hunk. Because each volume
+ * array can only reference previous volumes, and because once
+ * a concat or stripe references a volume, it may never be
+ * referenced again, the volume arrays are guaranteed to fit
+ * in the suprisingly small space allocated.
+ */
+ arrays = kmalloc(sizeof(struct pnfs_blk_volume *) * num_vols * 2,
+ GFP_KERNEL);
+ if (!arrays)
+ goto out;
+ arrays_ptr = arrays;
+
+ list_for_each_entry(vis_dev, sdlist, vi_node) {
+ /* Wipe crud left from parsing previous device */
+ vis_dev->vi_mapped = 0;
+ }
+ for (i = 0; i < num_vols; i++) {
+ vols[i].bv_vols = arrays_ptr;
+ status = decode_blk_volume(&p, end, vols, i, sdlist, &count);
+ if (status)
+ goto out;
+ arrays_ptr += count;
+ }
+
+ /* Check that we have used up opaque */
+ if (p != end) {
+ dprintk("Undecoded cruft at end of opaque\n");
+ goto out;
+ }
+
+ /* Now use info in vols to create the meta device */
+ rv = nfs4_blk_init_metadev(sb, dev);
+ if (!rv)
+ goto out;
+ status = nfs4_blk_flatten(vols, num_vols, rv);
+ if (status) {
+ free_block_dev(rv);
+ rv = NULL;
+ }
+ out:
+ kfree(arrays);
+ kfree(vols);
+ return rv;
+}
diff --git a/fs/nfs/blocklayout/blocklayoutdm.c b/fs/nfs/blocklayout/blocklayoutdm.c
new file mode 100644
index 0000000..15eaed2
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayoutdm.c
@@ -0,0 +1,72 @@
+/*
+ * linux/fs/nfs/blocklayout/blocklayoutdm.c
+ *
+ * Module for the NFSv4.1 pNFS block layout driver.
+ *
+ * Copyright (c) 2007 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * Fred Isaman <[email protected]>
+ * Andy Adamson <[email protected]>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization. if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose. the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+
+#include "blocklayout.h"
+
+#define NFSDBG_FACILITY NFSDBG_PNFS_LD
+
+/* Stub */
+static int nfs4_blk_metadev_release(struct pnfs_block_dev *bdev)
+{
+ return 0;
+}
+
+void free_block_dev(struct pnfs_block_dev *bdev)
+{
+ if (bdev) {
+ if (bdev->bm_mdev) {
+ dprintk("%s Removing DM device: %s %d:%d\n",
+ __func__,
+ bdev->bm_mdevname,
+ MAJOR(bdev->bm_mdev->bd_dev),
+ MINOR(bdev->bm_mdev->bd_dev));
+ /* XXX Check status ?? */
+ nfs4_blk_metadev_release(bdev);
+ }
+ kfree(bdev);
+ }
+}
+
+/* Stub */
+struct pnfs_block_dev *nfs4_blk_init_metadev(struct super_block *sb,
+ struct pnfs_device *dev)
+{
+ return NULL;
+}
+
+/* Stub */
+int nfs4_blk_flatten(struct pnfs_blk_volume *vols, int size,
+ struct pnfs_block_dev *bdev)
+{
+ return 0;
+}
+
--
1.7.4.1


2011-06-07 17:32:19

by Jim Rees

[permalink] [raw]
Subject: [PATCH 52/88] pnfsblock: iterating all local block disks instead of only scsi disks when initializing mount point.

From: Tao Guo <[email protected]>

So we can use virtual block devices like MD/DM in blocklayoutdriver.

Signed-off-by: Huang Haoi <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 12 ++--
fs/nfs/blocklayout/blocklayout.h | 3 +-
fs/nfs/blocklayout/blocklayoutdev.c | 116 +++++++++++++++--------------------
fs/nfs/blocklayout/blocklayoutdm.c | 2 +-
4 files changed, 57 insertions(+), 76 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 918e6d6..688984f 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -765,7 +765,7 @@ bl_initialize_mountpoint(struct nfs_server *server, const struct nfs_fh *fh)
struct pnfs_mount_type *mtype = NULL;
struct pnfs_devicelist *dlist = NULL;
struct pnfs_block_dev *bdev;
- LIST_HEAD(scsi_disklist);
+ LIST_HEAD(block_disklist);
int status, i;

dprintk("%s enter\n", __func__);
@@ -783,10 +783,10 @@ bl_initialize_mountpoint(struct nfs_server *server, const struct nfs_fh *fh)
spin_lock_init(&b_mt_id->bm_lock);
INIT_LIST_HEAD(&b_mt_id->bm_devlist);

- /* Construct a list of all visible scsi disks that have not been
+ /* Construct a list of all visible block disks that have not been
* claimed.
*/
- status = nfs4_blk_create_scsi_disk_list(&scsi_disklist);
+ status = nfs4_blk_create_block_disk_list(&block_disklist);
if (status < 0)
goto out_error;

@@ -804,13 +804,13 @@ bl_initialize_mountpoint(struct nfs_server *server, const struct nfs_fh *fh)
/* For each device returned in dlist, call GETDEVICEINFO, and
* decode the opaque topology encoding to create a flat
* volume topology, matching VOLUME_SIMPLE disk signatures
- * to disks in the visible scsi disk list.
+ * to disks in the visible block disk list.
* Construct an LVM meta device from the flat volume topology.
*/
for (i = 0; i < dlist->num_devs; i++) {
bdev = nfs4_blk_get_deviceinfo(server, fh,
&dlist->dev_id[i],
- &scsi_disklist);
+ &block_disklist);
if (!bdev)
goto out_error;
spin_lock(&b_mt_id->bm_lock);
@@ -823,7 +823,7 @@ bl_initialize_mountpoint(struct nfs_server *server, const struct nfs_fh *fh)
status = 0;
out_return:
kfree(dlist);
- nfs4_blk_destroy_disk_list(&scsi_disklist);
+ nfs4_blk_destroy_disk_list(&block_disklist);
return status;

out_error:
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 286adc9..0efed8d 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -44,7 +44,6 @@
#define SetPagePnfsErr(page) set_bit(PG_pnfserr, &(page)->flags)
#define ClearPagePnfsErr(page) clear_bit(PG_pnfserr, &(page)->flags)

-extern struct class shost_class; /* exported from drivers/scsi/hosts.c */
extern int dm_dev_create(struct dm_ioctl *param); /* from dm-ioctl.c */
extern int dm_dev_remove(struct dm_ioctl *param); /* from dm-ioctl.c */
extern int dm_do_resume(struct dm_ioctl *param);
@@ -250,7 +249,7 @@ struct pnfs_block_dev *nfs4_blk_decode_device(struct nfs_server *server,
struct list_head *sdlist);
int nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
struct nfs4_pnfs_layoutget_res *lgr);
-int nfs4_blk_create_scsi_disk_list(struct list_head *);
+int nfs4_blk_create_block_disk_list(struct list_head *);
void nfs4_blk_destroy_disk_list(struct list_head *);
/* blocklayoutdm.c */
struct pnfs_block_dev *nfs4_blk_init_metadev(struct nfs_server *server,
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 4f45523..ef39c36 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -32,15 +32,14 @@
#include <linux/module.h>
#include <linux/buffer_head.h> /* __bread */

-#include <scsi/scsi.h>
-#include <scsi/scsi_device.h>
-#include <scsi/scsi_host.h>
+#include <linux/genhd.h>
+#include <linux/blkdev.h>

#include "blocklayout.h"

#define NFSDBG_FACILITY NFSDBG_PNFS_LD

-#define MAX_VOLS 256 /* Maximum number of SCSI disks. Totally arbitrary */
+#define MAX_VOLS 256 /* Maximum number of block disks. Totally arbitrary */

uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes)
{
@@ -78,7 +77,7 @@ int nfs4_blkdev_put(struct block_device *bdev)
return blkdev_put(bdev, FMODE_READ);
}

-/* Add a visible, claimed (by us!) scsi disk to the device list */
+/* Add a visible, claimed (by us!) block disk to the device list */
static int alloc_add_disk(struct block_device *blk_dev, struct list_head *dlist)
{
struct visible_block_device *vis_dev;
@@ -96,17 +95,16 @@ static int alloc_add_disk(struct block_device *blk_dev, struct list_head *dlist)
return 0;
}

-/* Walk the list of scsi_devices. Add disks that can be opened and claimed
+/* Walk the list of block_devices. Add disks that can be opened and claimed
* to the device list
*/
static int
-nfs4_blk_add_scsi_disk(struct Scsi_Host *shost,
+nfs4_blk_add_block_disk(struct device *cdev,
int index, struct list_head *dlist)
{
static char *claim_ptr = "I belong to pnfs block driver";
struct block_device *bdev;
struct gendisk *gd;
- struct scsi_device *sdev;
unsigned int major, minor, ret = 0;
dev_t dev;

@@ -115,62 +113,49 @@ nfs4_blk_add_scsi_disk(struct Scsi_Host *shost,
dprintk("%s MAX_VOLS hit\n", __func__);
return -ENOSPC;
}
- dprintk("%s 1 \n", __func__);
- index--;
- shost_for_each_device(sdev, shost) {
- dprintk("%s 2\n", __func__);
- /* Need to do this check before bumping index */
- if (sdev->type != TYPE_DISK)
- continue;
- dprintk("%s 3 index %d \n", __func__, index);
- if (++index >= MAX_VOLS) {
- scsi_device_put(sdev);
- break;
- }
- major = (!(index >> 4) ? SCSI_DISK0_MAJOR :
- SCSI_DISK1_MAJOR-1 + (index >> 4));
- minor = ((index << 4) & 255);
-
- dprintk("%s SCSI device %d:%d \n", __func__, major, minor);
-
- dev = MKDEV(major, minor);
- bdev = nfs4_blkdev_get(dev);
- if (!bdev) {
- dprintk("%s: failed to open device %d:%d\n",
- __func__, major, minor);
- continue;
- }
- gd = bdev->bd_disk;
-
- dprintk("%s 4\n", __func__);
-
- if (bd_claim(bdev, claim_ptr)) {
- dprintk("%s: failed to claim device %d:%d\n",
- __func__, gd->major, gd->first_minor);
- blkdev_put(bdev, FMODE_READ);
- continue;
- }
+ gd = dev_to_disk(cdev);
+ if (gd == NULL || get_capacity(gd) == 0 ||
+ (gd->flags & GENHD_FL_SUPPRESS_PARTITION_INFO)) /* Skip ramdisks */
+ goto out;

- ret = alloc_add_disk(bdev, dlist);
- if (ret < 0)
- goto out_err;
- dprintk("%s ADDED DEVICE capacity %ld, bd_block_size %d\n",
- __func__,
- (unsigned long)get_capacity(gd),
- bdev->bd_block_size);
+ dev = cdev->devt;
+ major = MAJOR(dev);
+ minor = MINOR(dev);
+ bdev = nfs4_blkdev_get(dev);
+ if (!bdev) {
+ dprintk("%s: failed to open device %d:%d\n",
+ __func__, major, minor);
+ goto out;
+ }

+ if (bd_claim(bdev, claim_ptr)) {
+ dprintk("%s: failed to claim device %d:%d\n",
+ __func__, major, minor);
+ blkdev_put(bdev, FMODE_READ);
+ goto out;
}
+
+ ret = alloc_add_disk(bdev, dlist);
+ if (ret < 0)
+ goto out_err;
index++;
+ dprintk("%s ADDED DEVICE %d:%d capacity %ld, bd_block_size %d\n",
+ __func__, major, minor,
+ (unsigned long)get_capacity(gd),
+ bdev->bd_block_size);
+
+out:
dprintk("%s returns index %d \n", __func__, index);
return index;

out_err:
- dprintk("%s Can't add disk to list. ERROR: %d\n", __func__, ret);
+ dprintk("%s Can't add disk %d:%d to list. ERROR: %d\n",
+ __func__, major, minor, ret);
nfs4_blkdev_put(bdev);
return ret;
}

-/* Destroy the temporary scsi disk list */
+/* Destroy the temporary block disk list */
void nfs4_blk_destroy_disk_list(struct list_head *dlist)
{
struct visible_block_device *vis_dev;
@@ -189,20 +174,18 @@ void nfs4_blk_destroy_disk_list(struct list_head *dlist)
}
}

-struct nfs4_blk_scsi_disk_list_ctl {
+struct nfs4_blk_block_disk_list_ctl {
struct list_head *dlist;
int index;
};

-static int nfs4_blk_iter_scsi_disk_list(struct device *cdev, void *data)
+static int nfs4_blk_iter_block_disk_list(struct device *cdev, void *data)
{
- struct Scsi_Host *shost;
- struct nfs4_blk_scsi_disk_list_ctl *lc = data;
+ struct nfs4_blk_block_disk_list_ctl *lc = data;
int ret;

dprintk("%s enter\n", __func__);
- shost = class_to_shost(cdev);
- ret = nfs4_blk_add_scsi_disk(shost, lc->index, lc->dlist);
+ ret = nfs4_blk_add_block_disk(cdev, lc->index, lc->dlist);
dprintk("%s 1 ret %d\n", __func__, ret);
if (ret >= 0) {
lc->index = ret;
@@ -212,22 +195,21 @@ static int nfs4_blk_iter_scsi_disk_list(struct device *cdev, void *data)
}

/*
- * Create a temporary list of all SCSI disks host can see, and that have not
+ * Create a temporary list of all block disks host can see, and that have not
* yet been claimed.
- * shost_class: list of all registered scsi_hosts
+ * block_class: list of all registered block disks.
* returns -errno on error, and #of devices found on success.
- * XXX Loosely emulate scsi_host_lookup from scsi/host.c
*/
-int nfs4_blk_create_scsi_disk_list(struct list_head *dlist)
+int nfs4_blk_create_block_disk_list(struct list_head *dlist)
{
- struct nfs4_blk_scsi_disk_list_ctl lc = {
+ struct nfs4_blk_block_disk_list_ctl lc = {
.dlist = dlist,
.index = 0,
};

dprintk("%s enter\n", __func__);
- return class_for_each_device(&shost_class, NULL,
- &lc, nfs4_blk_iter_scsi_disk_list);
+ return class_for_each_device(&block_class, NULL,
+ &lc, nfs4_blk_iter_block_disk_list);
}
/* We are given an array of XDR encoded array indices, each of which should
* refer to a previously decoded device. Translate into a list of pointers
@@ -361,7 +343,7 @@ out_err:

/*
* map_sig_to_device()
- * Given a signature, walk the list of visible scsi disks searching for
+ * Given a signature, walk the list of visible block disks searching for
* a match. Returns True if mapping was done, False otherwise.
*
* While we're at it, fill in the vol->bv_size.
diff --git a/fs/nfs/blocklayout/blocklayoutdm.c b/fs/nfs/blocklayout/blocklayoutdm.c
index d70f6b2..3d15de0 100644
--- a/fs/nfs/blocklayout/blocklayoutdm.c
+++ b/fs/nfs/blocklayout/blocklayoutdm.c
@@ -257,7 +257,7 @@ static int nfs4_blk_resolve(int root, struct pnfs_blk_volume *vols,
* Create an LVM dm device table that represents the volume topology returned
* by GETDEVICELIST or GETDEVICEINFO.
*
- * vols: topology with VOLUME_SIMPLEs mapped to visable scsi disks.
+ * vols: topology with VOLUME_SIMPLEs mapped to visable block disks.
* size: number of volumes in vols.
*/
int nfs4_blk_flatten(struct pnfs_blk_volume *vols, int size,
--
1.7.4.1


2011-06-10 12:33:59

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-10 02:00, [email protected] wrote:
> Hi, Benny,
>
> Cheers,
> -Bergwolf
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
> Sent: Friday, June 10, 2011 5:23 AM
> To: Peng Tao
> Cc: Jim Rees; [email protected]; peter honeyman
> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>
> On 2011-06-09 08:07, Peng Tao wrote:
>> Hi, Jim and Benny,
>>
>> On Thu, Jun 9, 2011 at 9:58 PM, Jim Rees <[email protected]> wrote:
>>> Benny Halevy wrote:
>>>
>>> > My understanding is that layoutget specifies a min and max, and the server
>>>
>>> There's a min. What do you consider the max?
>>> Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?
>>>
>>> The spec doesn't say max, it says "desired." I guess I assumed the server
>>> wouldn't normally return more than desired.
>> In fact server is returning "desired" length. The problem is that we
>> call pnfs_update_layout in nfs_write_begin, and it will end up setting
>> both minlength and length to page size. There is no space for client
>> to collapse layoutget range in nfs_write_begin.
>>
>
> That's a different issue. Waiting with pnfs_update_layout to flush
> time rather than write_begin if the whole page is written would help
> sending a more meaningful desired range as well as avoiding needless
> read-modify-writes in case the application also wrote the whole
> preallocated block.
> [PT] It is also the reason why we want to introduce layout prefetching, to get more segment than the page passed in nfs_write_begin.
>

Peng, I understand what you want to achieve but the proposed way
just doesn't fly. The server knows better than the client its allocation policies
and it knows better the combined workload of different client and possible
conflicts between them therefore it should be making the ultimate decision
about the actual segment sizes.

That said, the client should indeed do its best to ask for the most appropriate
segments size for its use and we should be making a better job at that.
It's just that blindly asking for more is not a good strategy and requiring
manual admin help to tune the clients is not acceptable.

Benny

2011-06-10 12:36:05

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-10 01:36, [email protected] wrote:
> Hi, Benny,
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
> Sent: Friday, June 10, 2011 5:23 AM
> To: Jim Rees
> Cc: Peng Tao; [email protected]; peter honeyman
> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>
> On 2011-06-09 06:58, Jim Rees wrote:
>> Benny Halevy wrote:
>>
>> > My understanding is that layoutget specifies a min and max, and the server
>>
>> There's a min. What do you consider the max?
>> Whatever gets into csa_fore_chan_attrs.ca_maxresponsesize?
>>
>> The spec doesn't say max, it says "desired." I guess I assumed the server
>> wouldn't normally return more than desired.
>
> No, the server may freely upgrade the returned layout segment by returning
> a layout for a larger byte range or even returning a RW layout where a READ
> layout was asked for.
> [PT] It is true that server can upgrade the layout segment freely. But there is always a price to pay. Server has to be dealing with all kind of clients.
> If server returns more than being asked for, it may hurt other clients.

And if all clients ask for more than they need and the server just
gives it to them, what do you get out of that?

Benny

2011-06-08 01:27:40

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 86/88] SQUASHME: pnfs: blocklayout: port block layout code

On 2011-06-07 13:35, Jim Rees wrote:
> From: Peng Tao <[email protected]>
>
> Make minimal changes to let block layout driver work in current framework.
>
> Signed-off-by: Tang Haiying <[email protected]>
> Signed-off-by: Zhang Jingwang <[email protected]>
> Signed-off-by: Peng Tao <[email protected]>
> Signed-off-by: Jim Rees <[email protected]>
> ---
> drivers/md/dm-ioctl.c | 24 --------
> drivers/scsi/hosts.c | 3 +-
> fs/nfs/blocklayout/blocklayout.c | 105 ++++++++++------------------------
> fs/nfs/blocklayout/blocklayout.h | 9 +--
> fs/nfs/blocklayout/blocklayoutdev.c | 34 ++++++++----
> fs/nfs/blocklayout/extents.c | 14 +----
> fs/nfs/nfs4proc.c | 1 -
> fs/nfs/nfs4xdr.c | 3 +-
> fs/nfs/pnfs.c | 8 ++-
> fs/nfs/pnfs.h | 1 +
> include/linux/nfs_fs_sb.h | 1 +
> 11 files changed, 69 insertions(+), 134 deletions(-)
>
> diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> index d0d417e..4cacdad 100644
> --- a/drivers/md/dm-ioctl.c
> +++ b/drivers/md/dm-ioctl.c
> @@ -713,12 +713,6 @@ static int dev_create(struct dm_ioctl *param, size_t param_size)
> return 0;
> }
>
> -int dm_dev_create(struct dm_ioctl *param)
> -{
> - return dev_create(param, sizeof(*param));
> -}
> -EXPORT_SYMBOL(dm_dev_create);
> -
> /*
> * Always use UUID for lookups if it's present, otherwise use name or dev.
> */
> @@ -814,12 +808,6 @@ static int dev_remove(struct dm_ioctl *param, size_t param_size)
> return 0;
> }
>
> -int dm_dev_remove(struct dm_ioctl *param)
> -{
> - return dev_remove(param, sizeof(*param));
> -}
> -EXPORT_SYMBOL(dm_dev_remove);
> -
> /*
> * Check a string doesn't overrun the chunk of
> * memory we copied from userland.
> @@ -1002,12 +990,6 @@ static int do_resume(struct dm_ioctl *param)
> return r;
> }
>
> -int dm_do_resume(struct dm_ioctl *param)
> -{
> - return do_resume(param);
> -}
> -EXPORT_SYMBOL(dm_do_resume);
> -
> /*
> * Set or unset the suspension state of a device.
> * If the device already is in the requested state we just return its status.
> @@ -1274,12 +1256,6 @@ out:
> return r;
> }
>
> -int dm_table_load(struct dm_ioctl *param, size_t param_size)
> -{
> - return table_load(param, param_size);
> -}
> -EXPORT_SYMBOL(dm_table_load);
> -
> static int table_clear(struct dm_ioctl *param, size_t param_size)
> {
> struct hash_cell *hc;
> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
> index 7d91903..4f7a582 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -50,11 +50,10 @@ static void scsi_host_cls_release(struct device *dev)
> put_device(&class_to_shost(dev)->shost_gendev);
> }
>
> -struct class shost_class = {
> +static struct class shost_class = {
> .name = "scsi_host",
> .dev_release = scsi_host_cls_release,
> };
> -EXPORT_SYMBOL(shost_class);
>
> /**
> * scsi_host_set_state - Take the given host through the host state model.
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 2583b87..d842ec8 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -97,14 +97,6 @@ dont_like_caller(struct nfs_page *req)
> }
> }
>
> -static enum pnfs_try_status
> -bl_commit(struct nfs_write_data *nfs_data,
> - int sync)
> -{
> - dprintk("%s enter\n", __func__);
> - return PNFS_NOT_ATTEMPTED;
> -}
> -
> /* The data we are handed might be spread across several bios. We need
> * to track when the last one is finished.
> */
> @@ -198,7 +190,7 @@ static void bl_read_cleanup(struct work_struct *work)
> dprintk("%s enter\n", __func__);
> task = container_of(work, struct rpc_task, u.tk_work);
> rdata = container_of(task, struct nfs_read_data, task);
> - pnfs_read_done(rdata);
> + pnfs_ld_read_done(rdata);
> }
>
> static void
> @@ -219,8 +211,7 @@ static void bl_rpc_do_nothing(struct rpc_task *task, void *calldata)
> }
>
> static enum pnfs_try_status
> -bl_read_pagelist(struct nfs_read_data *rdata,
> - unsigned nr_pages)
> +bl_read_pagelist(struct nfs_read_data *rdata)
> {
> int i, hole;
> struct bio *bio = NULL;
> @@ -233,13 +224,13 @@ bl_read_pagelist(struct nfs_read_data *rdata,
> int pg_index = rdata->args.pgbase >> PAGE_CACHE_SHIFT;
>
> dprintk("%s enter nr_pages %u offset %lld count %Zd\n", __func__,
> - nr_pages, f_offset, count);
> + rdata->npages, f_offset, count);
>
> if (dont_like_caller(rdata->req)) {
> dprintk("%s dont_like_caller failed\n", __func__);
> goto use_mds;
> }
> - if ((nr_pages == 1) && PagePnfsErr(rdata->req->wb_page)) {
> + if ((rdata->npages == 1) && PagePnfsErr(rdata->req->wb_page)) {
> /* We want to fall back to mds in case of read_page
> * after error on read_pages.
> */
> @@ -249,21 +240,21 @@ bl_read_pagelist(struct nfs_read_data *rdata,
> par = alloc_parallel(rdata);
> if (!par)
> goto use_mds;
> - par->call_ops = *rdata->pdata.call_ops;
> + par->call_ops = *rdata->mds_ops;
> par->call_ops.rpc_call_done = bl_rpc_do_nothing;
> par->pnfs_callback = bl_end_par_io_read;
> /* At this point, we can no longer jump to use_mds */
>
> isect = (sector_t) (f_offset >> 9);
> /* Code assumes extents are page-aligned */
> - for (i = pg_index; i < nr_pages; i++) {
> + for (i = pg_index; i < rdata->npages; i++) {
> if (!extent_length) {
> /* We've used up the previous extent */
> put_extent(be);
> put_extent(cow_read);
> bio = bl_submit_bio(READ, bio);
> /* Get the next one */
> - be = find_get_extent(BLK_LSEG2EXT(rdata->pdata.lseg),
> + be = find_get_extent(BLK_LSEG2EXT(rdata->lseg),
> isect, &cow_read);
> if (!be) {
> /* Error out this page */
> @@ -293,7 +284,7 @@ bl_read_pagelist(struct nfs_read_data *rdata,
> be_read = (hole && cow_read) ? cow_read : be;
> for (;;) {
> if (!bio) {
> - bio = bio_alloc(GFP_NOIO, nr_pages - i);
> + bio = bio_alloc(GFP_NOIO, rdata->npages - i);
> if (!bio) {
> /* Error out this page */
> bl_done_with_rpage(pages[i], 0);
> @@ -407,10 +398,10 @@ static void bl_write_cleanup(struct work_struct *work)
> /* BUG - this should be called after each bio, not after
> * all finish, unless have some way of storing success/failure
> */
> - mark_extents_written(BLK_LSEG2EXT(wdata->pdata.lseg),
> + mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
> wdata->args.offset, wdata->args.count);
> }
> - pnfs_writeback_done(wdata);
> + pnfs_ld_write_done(wdata);
> }
>
> /* Called when last of bios associated with a bl_write_pagelist call finishes */
> @@ -428,7 +419,6 @@ bl_end_par_io_write(void *data)
>
> static enum pnfs_try_status
> bl_write_pagelist(struct nfs_write_data *wdata,
> - unsigned nr_pages,
> int sync)
> {
> int i;
> @@ -442,7 +432,7 @@ bl_write_pagelist(struct nfs_write_data *wdata,
> int pg_index = wdata->args.pgbase >> PAGE_CACHE_SHIFT;
>
> dprintk("%s enter, %Zu@%lld\n", __func__, count, offset);
> - if (!wdata->req->wb_lseg) {
> + if (!wdata->lseg) {
> dprintk("%s no lseg, falling back to MDS\n", __func__);
> return PNFS_NOT_ATTEMPTED;
> }
> @@ -460,19 +450,19 @@ bl_write_pagelist(struct nfs_write_data *wdata,
> par = alloc_parallel(wdata);
> if (!par)
> return PNFS_NOT_ATTEMPTED;
> - par->call_ops = *wdata->pdata.call_ops;
> + par->call_ops = *wdata->mds_ops;
> par->call_ops.rpc_call_done = bl_rpc_do_nothing;
> par->pnfs_callback = bl_end_par_io_write;
> /* At this point, have to be more careful with error handling */
>
> isect = (sector_t) ((offset & (long)PAGE_CACHE_MASK) >> 9);
> - for (i = pg_index; i < nr_pages; i++) {
> + for (i = pg_index; i < wdata->npages ; i++) {
> if (!extent_length) {
> /* We've used up the previous extent */
> put_extent(be);
> bio = bl_submit_bio(WRITE, bio);
> /* Get the next one */
> - be = find_get_extent(BLK_LSEG2EXT(wdata->pdata.lseg),
> + be = find_get_extent(BLK_LSEG2EXT(wdata->lseg),
> isect, NULL);
> if (!be || !is_writable(be, isect)) {
> /* FIXME */
> @@ -484,7 +474,7 @@ bl_write_pagelist(struct nfs_write_data *wdata,
> }
> for (;;) {
> if (!bio) {
> - bio = bio_alloc(GFP_NOIO, nr_pages - i);
> + bio = bio_alloc(GFP_NOIO, wdata->npages - i);
> if (!bio) {
> /* Error out this page */
> /* FIXME */
> @@ -504,7 +494,12 @@ bl_write_pagelist(struct nfs_write_data *wdata,
> isect += PAGE_CACHE_SIZE >> 9;
> extent_length -= PAGE_CACHE_SIZE >> 9;
> }
> - wdata->res.count = (isect << 9) - (offset & (long)PAGE_CACHE_MASK);
> + wdata->res.count = (isect << 9) - (offset);
> + if (count < wdata->res.count) {
> + wdata->res.count = count;
> + }
> + /* pnfs_set_layoutcommit needs this */
> + wdata->mds_offset = offset;
> put_extent(be);
> bl_submit_bio(WRITE, bio);
> put_parallel(par);
> @@ -557,18 +552,19 @@ bl_free_layout_hdr(struct pnfs_layout_hdr *lo)
> }
>
> static struct pnfs_layout_hdr *
> -bl_alloc_layout_hdr(struct inode *inode)
> +bl_alloc_layout_hdr(struct inode *inode, gfp_t gfp_flags)
> {
> struct pnfs_block_layout *bl;
>
> dprintk("%s enter\n", __func__);
> - bl = kzalloc(sizeof(*bl), GFP_KERNEL);
> + bl = kzalloc(sizeof(*bl), gfp_flags);
> if (!bl)
> return NULL;
> spin_lock_init(&bl->bl_ext_lock);
> INIT_LIST_HEAD(&bl->bl_extents[0]);
> INIT_LIST_HEAD(&bl->bl_extents[1]);
> INIT_LIST_HEAD(&bl->bl_commit);
> + INIT_LIST_HEAD(&bl->bl_committing);
> bl->bl_count = 0;
> bl->bl_blocksize = NFS_SERVER(inode)->pnfs_blksize >> 9;
> INIT_INVAL_MARKS(&bl->bl_inval, bl->bl_blocksize);
> @@ -590,16 +586,16 @@ bl_free_lseg(struct pnfs_layout_segment *lseg)
> */
> static struct pnfs_layout_segment *
> bl_alloc_lseg(struct pnfs_layout_hdr *lo,
> - struct nfs4_layoutget_res *lgr)
> + struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
> {
> struct pnfs_layout_segment *lseg;
> int status;
>
> dprintk("%s enter\n", __func__);
> - lseg = kzalloc(sizeof(*lseg) + 0, GFP_KERNEL);
> + lseg = kzalloc(sizeof(*lseg) + 0, gfp_flags);
> if (!lseg)
> return NULL;
> - status = nfs4_blk_process_layoutget(lo, lgr);
> + status = nfs4_blk_process_layoutget(lo, lgr, gfp_flags);
> if (status) {
> /* We don't want to call the full-blown bl_free_lseg,
> * since on error extents were not touched.
> @@ -615,34 +611,6 @@ bl_alloc_lseg(struct pnfs_layout_hdr *lo,
> return lseg;
> }
>
> -static int
> -bl_setup_layoutcommit(struct pnfs_layout_hdr *lo,
> - struct nfs4_layoutcommit_args *arg)
> -{
> - struct nfs_server *nfss = NFS_SERVER(lo->plh_inode);
> - struct bl_layoutupdate_data *layoutupdate_data;
> -
> - dprintk("%s enter\n", __func__);
> - /* Need to ensure commit is block-size aligned */
> - if (nfss->pnfs_blksize) {
> - u64 mask = nfss->pnfs_blksize - 1;
> - u64 offset = arg->range.offset & mask;
> -
> - arg->range.offset -= offset;
> - arg->range.length += offset + mask;
> - arg->range.length &= ~mask;
> - }
> -
> - layoutupdate_data = kmalloc(sizeof(struct bl_layoutupdate_data),
> - GFP_KERNEL);
> - if (unlikely(!layoutupdate_data))
> - return -ENOMEM;
> - INIT_LIST_HEAD(&layoutupdate_data->ranges);
> - arg->layoutdriver_data = layoutupdate_data;
> -
> - return 0;
> -}
> -
> static void
> bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
> const struct nfs4_layoutcommit_args *arg)
> @@ -657,7 +625,6 @@ bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
> {
> dprintk("%s enter\n", __func__);
> clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), &lcdata->args, lcdata->res.status);
> - kfree(lcdata->args.layoutdriver_data);
> }
>
> static void free_blk_mountid(struct block_mount_id *mid)
> @@ -1085,25 +1052,16 @@ bl_write_end_cleanup(struct file *filp, struct pnfs_fsdata *fsdata)
> fsdata->private = NULL;
> }
>
> -/* This is called by nfs_can_coalesce_requests via nfs_pageio_do_add_request.
> - * Should return False if there is a reason requests can not be coalesced,
> - * otherwise, should default to returning True.
> - */
> -static int
> +static bool
> bl_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
> - struct nfs_page *req)
> + struct nfs_page *req)
> {
> - dprintk("%s enter\n", __func__);
> - if (pgio->pg_iswrite)
> - return prev->wb_lseg == req->wb_lseg;
> - else
> - return 1;
> + return pnfs_generic_pg_test(pgio, prev, req);
> }
>
> static struct pnfs_layoutdriver_type blocklayout_type = {
> .id = LAYOUT_BLOCK_VOLUME,
> .name = "LAYOUT_BLOCK_VOLUME",
> - .commit = bl_commit,
> .read_pagelist = bl_read_pagelist,
> .write_pagelist = bl_write_pagelist,
> .write_begin = bl_write_begin,
> @@ -1113,12 +1071,11 @@ static struct pnfs_layoutdriver_type blocklayout_type = {
> .free_layout_hdr = bl_free_layout_hdr,
> .alloc_lseg = bl_alloc_lseg,
> .free_lseg = bl_free_lseg,
> - .setup_layoutcommit = bl_setup_layoutcommit,
> .encode_layoutcommit = bl_encode_layoutcommit,
> .cleanup_layoutcommit = bl_cleanup_layoutcommit,
> .set_layoutdriver = bl_set_layoutdriver,
> .clear_layoutdriver = bl_clear_layoutdriver,
> - .pg_test = bl_pg_test,
> + .pg_test = bl_pg_test,
> };
>
> static int __init nfs4blocklayout_init(void)
> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> index a8198ae..dd596d4 100644
> --- a/fs/nfs/blocklayout/blocklayout.h
> +++ b/fs/nfs/blocklayout/blocklayout.h
> @@ -33,7 +33,6 @@
> #define FS_NFS_NFS4BLOCKLAYOUT_H
>
> #include <linux/nfs_fs.h>
> -#include <linux/dm-ioctl.h> /* Needed for struct dm_ioctl*/
> #include "../pnfs.h"
>
> #define PAGE_CACHE_SECTORS (PAGE_CACHE_SIZE >> 9)
> @@ -43,11 +42,6 @@
> #define SetPagePnfsErr(page) set_bit(PG_pnfserr, &(page)->flags)
> #define ClearPagePnfsErr(page) clear_bit(PG_pnfserr, &(page)->flags)
>
> -extern int dm_dev_create(struct dm_ioctl *param); /* from dm-ioctl.c */
> -extern int dm_dev_remove(struct dm_ioctl *param); /* from dm-ioctl.c */
> -extern int dm_do_resume(struct dm_ioctl *param);
> -extern int dm_table_load(struct dm_ioctl *param, size_t param_size);
> -
> struct block_mount_id {
> spinlock_t bm_lock; /* protects list */
> struct list_head bm_devlist; /* holds pnfs_block_dev */
> @@ -180,6 +174,7 @@ struct pnfs_block_layout {
> spinlock_t bl_ext_lock; /* Protects list manipulation */
> struct list_head bl_extents[EXTENT_LISTS]; /* R and RW extents */
> struct list_head bl_commit; /* Needs layout commit */
> + struct list_head bl_committing; /* Layout committing */
> unsigned int bl_count; /* entries in bl_commit */
> sector_t bl_blocksize; /* Server blocksize in sectors */
> };
> @@ -257,7 +252,7 @@ struct pnfs_block_dev *nfs4_blk_decode_device(struct nfs_server *server,
> struct pnfs_device *dev,
> struct list_head *sdlist);
> int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
> - struct nfs4_layoutget_res *lgr);
> + struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
> int nfs4_blk_create_block_disk_list(struct list_head *);
> void nfs4_blk_destroy_disk_list(struct list_head *);
> /* blocklayoutdm.c */
> diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
> index 23469e3..a90eb6b 100644
> --- a/fs/nfs/blocklayout/blocklayoutdev.c
> +++ b/fs/nfs/blocklayout/blocklayoutdev.c
> @@ -231,14 +231,16 @@ static int verify_extent(struct pnfs_block_extent *be,
> /* XDR decode pnfs_block_layout4 structure */
> int
> nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
> - struct nfs4_layoutget_res *lgr)
> + struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
> {
> struct pnfs_block_layout *bl = BLK_LO2EXT(lo);
> - uint32_t *p = (uint32_t *)lgr->layout.buf;
> - uint32_t *end = (uint32_t *)((char *)lgr->layout.buf + lgr->layout.len);
> int i, status = -EIO;
> uint32_t count;
> struct pnfs_block_extent *be = NULL, *save;
> + struct xdr_stream stream;
> + struct xdr_buf buf;
> + struct page *scratch;
> + __be32 *p;
> uint64_t tmp; /* Used by READSECTOR */
> struct layout_verification lv = {
> .mode = lgr->range.iomode,
> @@ -246,14 +248,27 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
> .inval = lgr->range.offset >> 9,
> .cowread = lgr->range.offset >> 9,
> };
> -
> LIST_HEAD(extents);
>
> - BLK_READBUF(p, end, 4);
> + dprintk("---> %s\n", __func__);
> +
> + scratch = alloc_page(gfp_flags);
> + if (!scratch)
> + return -ENOMEM;
> +
> + xdr_init_decode_pages(&stream, &buf, lgr->layoutp->pages, lgr->layoutp->len);
> + xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE);
> +
> + p = xdr_inline_decode(&stream, 4);
> + if (unlikely(!p))
> + goto out_err;
> +
> READ32(count);
>
> dprintk("%s enter, number of extents %i\n", __func__, count);
> - BLK_READBUF(p, end, (28 + NFS4_DEVICEID4_SIZE) * count);
> + p = xdr_inline_decode(&stream, (28 + NFS4_DEVICEID4_SIZE) * count);
> + if (unlikely(!p))
> + goto out_err;
>
> /* Decode individual extents, putting them in temporary
> * staging area until whole layout is decoded to make error
> @@ -269,6 +284,7 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
> be->be_mdev = translate_devid(lo, &be->be_devid);
> if (!be->be_mdev)
> goto out_err;
> +
> /* The next three values are read in as bytes,
> * but stored as 512-byte sector lengths
> */
> @@ -284,11 +300,6 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
> }
> list_add_tail(&be->be_node, &extents);
> }
> - if (p != end) {
> - dprintk("%s Undecoded cruft at end of opaque\n", __func__);
> - be = NULL;
> - goto out_err;
> - }
> if (lgr->range.offset + lgr->range.length != lv.start << 9) {
> dprintk("%s Final length mismatch\n", __func__);
> be = NULL;
> @@ -319,6 +330,7 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
> spin_unlock(&bl->bl_ext_lock);
> status = 0;
> out:
> + __free_page(scratch);
> dprintk("%s returns %i\n", __func__, status);
> return status;
>
> diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
> index 40dff82..08413ec 100644
> --- a/fs/nfs/blocklayout/extents.c
> +++ b/fs/nfs/blocklayout/extents.c
> @@ -232,7 +232,7 @@ _range_has_tag(struct my_tree_t *tree, u64 start, u64 end, int32_t tag)
> if ((pos->it_sector == end - tree->mtt_step_size) &&
> (pos->it_tags & (1 << tag))) {
> expect = pos->it_sector - tree->mtt_step_size;
> - if (expect < start)
> + if (pos->it_sector < tree->mtt_step_size || expect < start)
> return 1;
> continue;
> } else {
> @@ -740,19 +740,12 @@ encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
> struct xdr_stream *xdr,
> const struct nfs4_layoutcommit_args *arg)
> {
> - sector_t start, end;
> struct pnfs_block_short_extent *lce, *save;
> unsigned int count = 0;
> - struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
> - struct list_head *ranges = &bld->ranges;
> + struct list_head *ranges = &bl->bl_committing;
> __be32 *p, *xdr_start;
>
> dprintk("%s enter\n", __func__);
> - start = arg->range.offset >> 9;
> - end = start + (arg->range.length >> 9);
> - dprintk("%s set start=%llu, end=%llu\n",
> - __func__, (u64)start, (u64)end);
> -
> /* BUG - creation of bl_commit is buggy - need to wait for
> * entire block to be marked WRITTEN before it can be added.
> */
> @@ -925,11 +918,10 @@ clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
> const struct nfs4_layoutcommit_args *arg,
> int status)
> {
> - struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
> struct pnfs_block_short_extent *lce, *save;
>
> dprintk("%s status %d\n", __func__, status);
> - list_for_each_entry_safe_reverse(lce, save, &bld->ranges, bse_node) {
> + list_for_each_entry_safe_reverse(lce, save, &bl->bl_committing, bse_node) {
> if (likely(!status)) {
> u64 offset = lce->bse_f_offset;
> u64 end = offset + lce->bse_length;
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index a693283..987260c 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -5788,7 +5788,6 @@ static int _nfs4_getdevicelist(struct nfs_server *server,
>
> dprintk("--> %s\n", __func__);
> status = nfs4_call_sync(server->client, server, &msg, &args.seq_args, &res.seq_res, 0);
> - put_rpccred(msg.rpc_cred);
> dprintk("<-- %s status=%d\n", __func__, status);
> return status;
> }
> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> index e059dc8..73f18f4 100644
> --- a/fs/nfs/nfs4xdr.c
> +++ b/fs/nfs/nfs4xdr.c
> @@ -1963,7 +1963,7 @@ encode_layoutcommit(struct xdr_stream *xdr,
> *p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
> /* Only whole file layouts */
> p = xdr_encode_hyper(p, 0); /* offset */
> - p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
> + p = xdr_encode_hyper(p, args->lastbytewritten+1); /* length */
> *p++ = cpu_to_be32(0); /* reclaim */
> p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
> *p++ = cpu_to_be32(1); /* newoffset = TRUE */
> @@ -5467,7 +5467,6 @@ static int decode_layoutcommit(struct xdr_stream *xdr,
> int status;
>
> status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
> - res->status = status;
> if (status)
> return status;
>
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index c88a8ee..9920bff 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -898,8 +898,6 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
> ret = get_lseg(lseg);
> break;
> }
> - if (cmp_layout(range, &lseg->pls_range) > 0)
> - break;
> }
>
> dprintk("%s:Return lseg %p ref %d\n",
> @@ -1252,6 +1250,7 @@ static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
> }
> }
> rv->pls_end_pos = max_pos;
> + dprintk("%s: lseg %p end_pos %llu\n", __func__, rv, rv->pls_end_pos);
>
> return rv;
> }
> @@ -1261,6 +1260,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
> {
> struct nfs_inode *nfsi = NFS_I(wdata->inode);
> loff_t end_pos = wdata->mds_offset + wdata->res.count;

This needs patch 4b8ee2b which I'm pulling into pnfs-all-2.6.39
What base did you use for this patchset?

Benny

> + loff_t isize = i_size_read(wdata->inode);
> bool mark_as_dirty = false;
>
> spin_lock(&nfsi->vfs_inode.i_lock);
> @@ -1274,9 +1274,13 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
> dprintk("%s: Set layoutcommit for inode %lu ",
> __func__, wdata->inode->i_ino);
> }
> + if (end_pos > isize)
> + end_pos = isize;
> if (end_pos > wdata->lseg->pls_end_pos)
> wdata->lseg->pls_end_pos = end_pos;
> spin_unlock(&nfsi->vfs_inode.i_lock);
> + dprintk("%s: lseg %p end_pos %llu\n",
> + __func__, wdata->lseg, wdata->lseg->pls_end_pos);
>
> /* if pnfs_layoutcommit_inode() runs between inode locks, the next one
> * will be a noop because NFS_INO_LAYOUTCOMMIT will not be set */
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index b50cf3a..28d57c9 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -156,6 +156,7 @@ struct pnfs_device {
> unsigned int layout_type;
> unsigned int mincount;
> struct page **pages;
> + void *area;
> unsigned int pgbase;
> unsigned int pglen;
> };
> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
> index 3d93ada..79cc4ca 100644
> --- a/include/linux/nfs_fs_sb.h
> +++ b/include/linux/nfs_fs_sb.h
> @@ -143,6 +143,7 @@ struct nfs_server {
> filesystem */
> struct pnfs_layoutdriver_type *pnfs_curr_ld; /* Active layout driver */
> struct rpc_wait_queue roc_rpcwaitq;
> + void *pnfs_ld_data; /* per mount point data */
> u32 pnfs_blksize; /* layout_blksize attr */
>
> /* the following fields are protected by nfs_client->cl_lock */

2011-06-10 16:44:46

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 06/10/2011 09:23 AM, Boaz Harrosh wrote:
>
> I disagree. Please make that same exact patch for the server you are using!
>
> Leave the client alone. Don't even consider getting use to this, sticking
> broken stuff on client scripts. If anywhere it should be in Server configuration
> files
>
> Boaz
>

BTW the algorithm at the *server* side need not be dynamic and can be very simple.
Just a simple rule based:

1. If the file is new/zero-size just give out a small layout say 1-2 stripe_units
2. If the file is or becoming bigger give out some BIG maximal optimum IO size.
A size that anything bigger will not increase performance.

3. If the file is opened by a second client, recall the first big layout and
give out a one-to-few strips layout for shared access.
You will find that this shared file case is very rare. There are not many
application that share the same file. If they do they usually use range
locks. Query your lock manager if the client has a lock on this range of
the file, give out the full locked range as a layout. If that is not the proper
hint from the application, then what is? It is a much better hint then the
nfs-client can ever guess.

4. Make it simple as hell....

Just my $0.017
Boaz

2011-06-07 17:32:01

by Jim Rees

[permalink] [raw]
Subject: [PATCH 49/88] SQUASHME: pnfs-block: convert APIs pnfs-post-submit

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 45 ++++++++++++++--------------------
fs/nfs/blocklayout/blocklayout.h | 7 ++---
fs/nfs/blocklayout/blocklayoutdev.c | 8 +++---
fs/nfs/blocklayout/blocklayoutdm.c | 4 +-
4 files changed, 28 insertions(+), 36 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 768d8fa..918e6d6 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -564,7 +564,7 @@ bl_free_layout(void *p)
}

static void *
-bl_alloc_layout(struct pnfs_mount_type *mtype, struct inode *inode)
+bl_alloc_layout(struct inode *inode)
{
struct pnfs_block_layout *bl;

@@ -688,7 +688,7 @@ static void free_blk_mountid(struct block_mount_id *mid)
* It seems much of this should be at the generic pnfs level.
*/
static struct pnfs_block_dev *
-nfs4_blk_get_deviceinfo(struct super_block *sb, struct nfs_fh *fh,
+nfs4_blk_get_deviceinfo(struct nfs_server *server, const struct nfs_fh *fh,
struct pnfs_deviceid *d_id,
struct list_head *sdlist)
{
@@ -698,7 +698,6 @@ nfs4_blk_get_deviceinfo(struct super_block *sb, struct nfs_fh *fh,
int max_pages;
struct page **pages = NULL;
int i, rc;
- struct nfs_server *server = NFS_SB(sb);

/*
* Use the session max response size as the basis for setting
@@ -739,12 +738,12 @@ nfs4_blk_get_deviceinfo(struct super_block *sb, struct nfs_fh *fh,
dev->pglen = PAGE_SIZE * max_pages;
dev->mincount = 0;

- rc = pnfs_block_callback_ops->nfs_getdeviceinfo(sb, dev);
+ rc = pnfs_block_callback_ops->nfs_getdeviceinfo(server, dev);
dprintk("%s getdevice info returns %d\n", __func__, rc);
if (rc)
goto out_free;

- rv = nfs4_blk_decode_device(sb, dev, sdlist);
+ rv = nfs4_blk_decode_device(server, dev, sdlist);
out_free:
if (dev->area != NULL)
vunmap(dev->area);
@@ -759,8 +758,8 @@ nfs4_blk_get_deviceinfo(struct super_block *sb, struct nfs_fh *fh,
/*
* Retrieve the list of available devices for the mountpoint.
*/
-static struct pnfs_mount_type *
-bl_initialize_mountpoint(struct super_block *sb, struct nfs_fh *fh)
+static int
+bl_initialize_mountpoint(struct nfs_server *server, const struct nfs_fh *fh)
{
struct block_mount_id *b_mt_id = NULL;
struct pnfs_mount_type *mtype = NULL;
@@ -771,21 +770,18 @@ bl_initialize_mountpoint(struct super_block *sb, struct nfs_fh *fh)

dprintk("%s enter\n", __func__);

- if (NFS_SB(sb)->pnfs_blksize == 0) {
+ if (server->pnfs_blksize == 0) {
dprintk("%s Server did not return blksize\n", __func__);
- return NULL;
+ return -EINVAL;
}
b_mt_id = kzalloc(sizeof(struct block_mount_id), GFP_KERNEL);
- if (!b_mt_id)
+ if (!b_mt_id) {
+ status = -ENOMEM;
goto out_error;
+ }
/* Initialize nfs4 block layout mount id */
- b_mt_id->bm_sb = sb; /* back pointer to retrieve nfs_server struct */
spin_lock_init(&b_mt_id->bm_lock);
INIT_LIST_HEAD(&b_mt_id->bm_devlist);
- mtype = kzalloc(sizeof(struct pnfs_mount_type), GFP_KERNEL);
- if (!mtype)
- goto out_error;
- mtype->mountid = (void *)b_mt_id;

/* Construct a list of all visible scsi disks that have not been
* claimed.
@@ -799,7 +795,8 @@ bl_initialize_mountpoint(struct super_block *sb, struct nfs_fh *fh)
goto out_error;
dlist->eof = 0;
while (!dlist->eof) {
- status = pnfs_block_callback_ops->nfs_getdevicelist(sb, fh, dlist);
+ status = pnfs_block_callback_ops->nfs_getdevicelist(
+ server, fh, dlist);
if (status)
goto out_error;
dprintk("%s GETDEVICELIST numdevs=%i, eof=%i\n",
@@ -811,7 +808,7 @@ bl_initialize_mountpoint(struct super_block *sb, struct nfs_fh *fh)
* Construct an LVM meta device from the flat volume topology.
*/
for (i = 0; i < dlist->num_devs; i++) {
- bdev = nfs4_blk_get_deviceinfo(sb, fh,
+ bdev = nfs4_blk_get_deviceinfo(server, fh,
&dlist->dev_id[i],
&scsi_disklist);
if (!bdev)
@@ -822,30 +819,26 @@ bl_initialize_mountpoint(struct super_block *sb, struct nfs_fh *fh)
}
}
dprintk("%s SUCCESS\n", __func__);
-
+ server->pnfs_ld_data = b_mt_id;
+ status = 0;
out_return:
kfree(dlist);
nfs4_blk_destroy_disk_list(&scsi_disklist);
- return mtype;
+ return status;

out_error:
free_blk_mountid(b_mt_id);
kfree(mtype);
- mtype = NULL;
goto out_return;
}

static int
-bl_uninitialize_mountpoint(struct pnfs_mount_type *mtype)
+bl_uninitialize_mountpoint(struct nfs_server *server)
{
- struct block_mount_id *b_mt_id = NULL;
+ struct block_mount_id *b_mt_id = server->pnfs_ld_data;

dprintk("%s enter\n", __func__);
- if (!mtype)
- return 0;
- b_mt_id = (struct block_mount_id *)mtype->mountid;
free_blk_mountid(b_mt_id);
- kfree(mtype);
dprintk("%s RETURNS\n", __func__);
return 0;
}
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 45939e1..286adc9 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -51,7 +51,6 @@ extern int dm_do_resume(struct dm_ioctl *param);
extern int dm_table_load(struct dm_ioctl *param, size_t param_size);

struct block_mount_id {
- struct super_block *bm_sb; /* back pointer */
spinlock_t bm_lock; /* protects list */
struct list_head bm_devlist; /* holds pnfs_block_dev */
};
@@ -194,7 +193,7 @@ struct bl_layoutupdate_data {
struct list_head ranges;
};

-#define BLK_ID(lo) ((struct block_mount_id *)(PNFS_MOUNTID(lo)->mountid))
+#define BLK_ID(lo) ((struct block_mount_id *)(PNFS_NFS_SERVER(lo)->pnfs_ld_data))
#define BLK_LSEG2EXT(lseg) ((struct pnfs_block_layout *)lseg->layout->ld_data)
#define BLK_LO2EXT(lo) ((struct pnfs_block_layout *)lo->ld_data)

@@ -246,7 +245,7 @@ uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes);
/* blocklayoutdev.c */
struct block_device *nfs4_blkdev_get(dev_t dev);
int nfs4_blkdev_put(struct block_device *bdev);
-struct pnfs_block_dev *nfs4_blk_decode_device(struct super_block *sb,
+struct pnfs_block_dev *nfs4_blk_decode_device(struct nfs_server *server,
struct pnfs_device *dev,
struct list_head *sdlist);
int nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
@@ -254,7 +253,7 @@ int nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
int nfs4_blk_create_scsi_disk_list(struct list_head *);
void nfs4_blk_destroy_disk_list(struct list_head *);
/* blocklayoutdm.c */
-struct pnfs_block_dev *nfs4_blk_init_metadev(struct super_block *sb,
+struct pnfs_block_dev *nfs4_blk_init_metadev(struct nfs_server *server,
struct pnfs_device *dev);
int nfs4_blk_flatten(struct pnfs_blk_volume *, int, struct pnfs_block_dev *);
void free_block_dev(struct pnfs_block_dev *bdev);
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 9fc3d46..4f45523 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -489,9 +489,9 @@ static int decode_blk_volume(uint32_t **pp, uint32_t *end,
* in dev->dev_addr_buf.
*/
struct pnfs_block_dev *
-nfs4_blk_decode_device(struct super_block *sb,
- struct pnfs_device *dev,
- struct list_head *sdlist)
+nfs4_blk_decode_device(struct nfs_server *server,
+ struct pnfs_device *dev,
+ struct list_head *sdlist)
{
int num_vols, i, status, count;
struct pnfs_blk_volume *vols, **arrays, **arrays_ptr;
@@ -540,7 +540,7 @@ nfs4_blk_decode_device(struct super_block *sb,
}

/* Now use info in vols to create the meta device */
- rv = nfs4_blk_init_metadev(sb, dev);
+ rv = nfs4_blk_init_metadev(server, dev);
if (!rv)
goto out;
status = nfs4_blk_flatten(vols, num_vols, rv);
diff --git a/fs/nfs/blocklayout/blocklayoutdm.c b/fs/nfs/blocklayout/blocklayoutdm.c
index 4bff748..d70f6b2 100644
--- a/fs/nfs/blocklayout/blocklayoutdm.c
+++ b/fs/nfs/blocklayout/blocklayoutdm.c
@@ -129,7 +129,7 @@ void free_block_dev(struct pnfs_block_dev *bdev)
/*
* Create meta device. Keep it open to use for I/O.
*/
-struct pnfs_block_dev *nfs4_blk_init_metadev(struct super_block *sb,
+struct pnfs_block_dev *nfs4_blk_init_metadev(struct nfs_server *server,
struct pnfs_device *dev)
{
static uint64_t dev_count; /* STUB used for device names */
@@ -151,7 +151,7 @@ struct pnfs_block_dev *nfs4_blk_init_metadev(struct super_block *sb,
bd = nfs4_blkdev_get(meta_dev);
if (!bd)
goto out_err;
- if (bd_claim(bd, sb)) {
+ if (bd_claim(bd, server)) {
dprintk("%s: failed to claim device %d:%d\n",
__func__,
MAJOR(meta_dev),
--
1.7.4.1


2011-06-07 17:29:31

by Jim Rees

[permalink] [raw]
Subject: [PATCH 31/88] pnfsblock: write_end_cleanup

From: Fred Isaman <[email protected]>

Ensure all pages in block are marked for initialization if needed.

[pnfsblock: Update to 2.6.29]
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 59 ++++++++++++++++++++++++++++++++++++++
1 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index f4851c1..af26bcc 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -835,6 +835,64 @@ bl_write_end(struct inode *inode, struct page *page, loff_t pos,
return 0;
}

+/* Return any memory allocated to fsdata->private, and take advantage
+ * of no page locks to mark pages noted in write_begin as needing
+ * initialization.
+ */
+static void
+bl_write_end_cleanup(struct file *filp, struct pnfs_fsdata *fsdata)
+{
+ struct page *page;
+ pgoff_t index;
+ sector_t *pos;
+ struct address_space *mapping = filp->f_mapping;
+ struct pnfs_fsdata *fake_data;
+
+ if (!fsdata)
+ return;
+ pos = fsdata->private;
+ if (!pos)
+ return;
+ dprintk("%s enter with pos=%llu\n", __func__, (u64)(*pos));
+ for (; *pos != ~0; pos++) {
+ index = *pos >> (PAGE_CACHE_SHIFT - 9);
+ /* XXX How do we properly deal with failures here??? */
+ page = grab_cache_page_write_begin(mapping, index, 0);
+ if (!page) {
+ printk(KERN_ERR "%s BUG BUG BUG NoMem\n", __func__);
+ continue;
+ }
+ dprintk("%s: Examining block page\n", __func__);
+ print_page(page);
+ if (!PageMappedToDisk(page)) {
+ /* XXX How do we properly deal with failures here??? */
+ dprintk("%s Marking block page\n", __func__);
+ init_page_for_write(BLK_LSEG2EXT(fsdata->lseg), page,
+ PAGE_CACHE_SIZE, PAGE_CACHE_SIZE,
+ NULL);
+ print_page(page);
+ fake_data = kzalloc(sizeof(*fake_data), GFP_KERNEL);
+ if (!fake_data) {
+ printk(KERN_ERR "%s BUG BUG BUG NoMem\n",
+ __func__);
+ unlock_page(page);
+ continue;
+ }
+ fake_data->ok_to_use_pnfs = 1;
+ fake_data->bypass_eof = 1;
+ mapping->a_ops->write_end(filp, mapping,
+ index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE,
+ PAGE_CACHE_SIZE,
+ page, fake_data);
+ /* Note fake_data is freed by nfs_write_end */
+ } else
+ unlock_page(page);
+ }
+ kfree(fsdata->private);
+ fsdata->private = NULL;
+}
+
static ssize_t
bl_get_stripesize(struct pnfs_layout_type *lo)
{
@@ -880,6 +938,7 @@ static struct layoutdriver_io_operations blocklayout_io_operations = {
.write_pagelist = bl_write_pagelist,
.write_begin = bl_write_begin,
.write_end = bl_write_end,
+ .write_end_cleanup = bl_write_end_cleanup,
.alloc_layout = bl_alloc_layout,
.free_layout = bl_free_layout,
.alloc_lseg = bl_alloc_lseg,
--
1.7.4.1


2011-06-08 07:29:37

by Peng Tao

[permalink] [raw]
Subject: Re: [PATCH 88/88] NFS41: do not update isize if inode needs layoutcommit

sorry, the link should be
http://www.spinics.net/lists/linux-nfs/msg21586.html

On 6/8/11, Peng Tao <[email protected]> wrote:
> On 6/8/11, Benny Halevy <[email protected]> wrote:
>> Better send generic patches separately.
>> This patch needs to go upstream and to stable 2.6.39
> It has been sent separately before. please see
> http://www.spinics.net/list/linux-nfs/msg21586.html
>
>>
>> Benny
>>
>> On 2011-06-07 13:36, Jim Rees wrote:
>>> From: Peng Tao <[email protected]>
>>>
>>> Layout commit is supposed to set server file size similiar to nfs pages.
>>> We should not update client file size for the same reason.
>>> Otherwise we will lose what we have at hand.
>>>
>>> Signed-off-by: Peng Tao <[email protected]>
>>> Signed-off-by: Jim Rees <[email protected]>
>>> ---
>>> fs/nfs/inode.c | 3 ++-
>>> 1 files changed, 2 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>>> index 144f2a3..3f1eb81 100644
>>> --- a/fs/nfs/inode.c
>>> +++ b/fs/nfs/inode.c
>>> @@ -1294,7 +1294,8 @@ static int nfs_update_inode(struct inode *inode,
>>> struct nfs_fattr *fattr)
>>> if (new_isize != cur_isize) {
>>> /* Do we perhaps have any outstanding writes, or has
>>> * the file grown beyond our last write? */
>>> - if (nfsi->npages == 0 || new_isize > cur_isize) {
>>> + if ((nfsi->npages == 0 && !test_bit(NFS_INO_LAYOUTCOMMIT,
>>> &nfsi->flags)) ||
>>> + new_isize > cur_isize) {
>>> i_size_write(inode, new_isize);
>>> invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA;
>>> }
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
>
> --
> Thanks,
> -Bergwolf
>


--
Thanks,
-Bergwolf

2011-06-07 17:30:54

by Jim Rees

[permalink] [raw]
Subject: [PATCH 41/88] SQUASHME: pnfs-block: remove of CONFIG_PNFS fallout

From: Boaz Harrosh <[email protected]>

Signed-off-by: Boaz Harrosh <[email protected]>
[depend on NFS_V4_1]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/Kconfig | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index bccd415..c8bd06c 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -107,7 +107,7 @@ config PNFS_PANLAYOUT

config PNFS_BLOCK
tristate "Provide a pNFS block client (EXPERIMENTAL)"
- depends on NFS_FS && PNFS
+ depends on NFS_FS && NFS_V4_1
select MD
select BLK_DEV_DM
help
--
1.7.4.1


2011-06-07 17:31:54

by Jim Rees

[permalink] [raw]
Subject: [PATCH 48/88] SQUASHME: pnfs-block: fix compile breakage

From: J. Bruce Fields <[email protected]>

fs/nfs/blocklayout/built-in.o: In function `bl_rpc_do_nothing':
/home/bfields/local/build-2.6/fs/nfs/blocklayout/blocklayout.c:219: multiple definition of `pnfs_callback_ops'
fs/nfs/nfslayoutdriver.o:/home/bfields/local/build-2.6/fs/nfs/nfs4filelayout.c:160: first defined here

The variable in the block case never seems to be used outside the one
file; so change the name and declare it static.

Signed-off-by: J. Bruce Fields <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 65cf104..768d8fa 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -44,7 +44,7 @@ MODULE_AUTHOR("Andy Adamson <[email protected]>");
MODULE_DESCRIPTION("The NFSv4.1 pNFS Block layout driver");

/* Callback operations to the pNFS client */
-struct pnfs_client_operations *pnfs_callback_ops;
+static struct pnfs_client_operations *pnfs_block_callback_ops;

static void print_page(struct page *page)
{
@@ -200,7 +200,7 @@ static void bl_read_cleanup(struct work_struct *work)
dprintk("%s enter\n", __func__);
task = container_of(work, struct rpc_task, u.tk_work);
rdata = container_of(task, struct nfs_read_data, task);
- pnfs_callback_ops->nfs_readlist_complete(rdata);
+ pnfs_block_callback_ops->nfs_readlist_complete(rdata);
}

static void
@@ -414,7 +414,7 @@ static void bl_write_cleanup(struct work_struct *work)
mark_extents_written(BLK_LSEG2EXT(wdata->pdata.lseg),
wdata->args.offset, wdata->args.count);
}
- pnfs_callback_ops->nfs_writelist_complete(wdata);
+ pnfs_block_callback_ops->nfs_writelist_complete(wdata);
}

/* Called when last of bios associated with a bl_write_pagelist call finishes */
@@ -739,7 +739,7 @@ nfs4_blk_get_deviceinfo(struct super_block *sb, struct nfs_fh *fh,
dev->pglen = PAGE_SIZE * max_pages;
dev->mincount = 0;

- rc = pnfs_callback_ops->nfs_getdeviceinfo(sb, dev);
+ rc = pnfs_block_callback_ops->nfs_getdeviceinfo(sb, dev);
dprintk("%s getdevice info returns %d\n", __func__, rc);
if (rc)
goto out_free;
@@ -799,7 +799,7 @@ bl_initialize_mountpoint(struct super_block *sb, struct nfs_fh *fh)
goto out_error;
dlist->eof = 0;
while (!dlist->eof) {
- status = pnfs_callback_ops->nfs_getdevicelist(sb, fh, dlist);
+ status = pnfs_block_callback_ops->nfs_getdevicelist(sb, fh, dlist);
if (status)
goto out_error;
dprintk("%s GETDEVICELIST numdevs=%i, eof=%i\n",
@@ -1186,7 +1186,7 @@ static int __init nfs4blocklayout_init(void)
{
dprintk("%s: NFSv4 Block Layout Driver Registering...\n", __func__);

- pnfs_callback_ops = pnfs_register_layoutdriver(&blocklayout_type);
+ pnfs_block_callback_ops = pnfs_register_layoutdriver(&blocklayout_type);
return 0;
}

--
1.7.4.1


2011-06-07 17:33:38

by Jim Rees

[permalink] [raw]
Subject: [PATCH 64/88] SQUASHME: pnfs-block: use new write_pagelist api

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 16 +++++++---------
1 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 80d25768..bfcef54 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -428,21 +428,19 @@ bl_end_par_io_write(void *data)
}

static enum pnfs_try_status
-bl_write_pagelist(struct pnfs_layout_type *lo,
- struct page **pages,
- unsigned int pgbase,
- unsigned nr_pages,
- loff_t offset,
- size_t count,
- int sync,
- struct nfs_write_data *wdata)
+bl_write_pagelist(struct nfs_write_data *wdata,
+ unsigned nr_pages,
+ int sync)
{
int i;
struct bio *bio = NULL;
struct pnfs_block_extent *be = NULL;
sector_t isect, extent_length = 0;
struct parallel_io *par;
- int pg_index = pgbase >> PAGE_CACHE_SHIFT;
+ loff_t offset = wdata->args.offset;
+ size_t count = wdata->args.count;
+ struct page **pages = wdata->args.pages;
+ int pg_index = wdata->args.pgbase >> PAGE_CACHE_SHIFT;

dprintk("%s enter, %Zu@%lld\n", __func__, count, offset);
if (!wdata->req->wb_lseg) {
--
1.7.4.1


2011-06-07 17:27:36

by Jim Rees

[permalink] [raw]
Subject: [PATCH 15/88] pnfsblock: dm kernel interface

From: Fred Isaman <[email protected]>

We need kernel access to what is currently a user-mode only ioctl interface.

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
drivers/md/dm-ioctl.c | 24 ++++++++++++++++++++++++
fs/nfs/blocklayout/blocklayout.h | 4 ++++
2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 4cacdad..d0d417e 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -713,6 +713,12 @@ static int dev_create(struct dm_ioctl *param, size_t param_size)
return 0;
}

+int dm_dev_create(struct dm_ioctl *param)
+{
+ return dev_create(param, sizeof(*param));
+}
+EXPORT_SYMBOL(dm_dev_create);
+
/*
* Always use UUID for lookups if it's present, otherwise use name or dev.
*/
@@ -808,6 +814,12 @@ static int dev_remove(struct dm_ioctl *param, size_t param_size)
return 0;
}

+int dm_dev_remove(struct dm_ioctl *param)
+{
+ return dev_remove(param, sizeof(*param));
+}
+EXPORT_SYMBOL(dm_dev_remove);
+
/*
* Check a string doesn't overrun the chunk of
* memory we copied from userland.
@@ -990,6 +1002,12 @@ static int do_resume(struct dm_ioctl *param)
return r;
}

+int dm_do_resume(struct dm_ioctl *param)
+{
+ return do_resume(param);
+}
+EXPORT_SYMBOL(dm_do_resume);
+
/*
* Set or unset the suspension state of a device.
* If the device already is in the requested state we just return its status.
@@ -1256,6 +1274,12 @@ out:
return r;
}

+int dm_table_load(struct dm_ioctl *param, size_t param_size)
+{
+ return table_load(param, param_size);
+}
+EXPORT_SYMBOL(dm_table_load);
+
static int table_clear(struct dm_ioctl *param, size_t param_size)
{
struct hash_cell *hc;
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 4af6685..2c6e1fe 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -35,8 +35,12 @@
#include <linux/nfs_fs.h>
#include <linux/pnfs_xdr.h> /* Needed by nfs4_pnfs.h */
#include <linux/nfs4_pnfs.h>
+#include <linux/dm-ioctl.h> /* Needed for struct dm_ioctl*/

extern struct class shost_class; /* exported from drivers/scsi/hosts.c */
+extern int dm_dev_create(struct dm_ioctl *param); /* from dm-ioctl.c */
+extern int dm_dev_remove(struct dm_ioctl *param); /* from dm-ioctl.c */
+

struct block_mount_id {
struct super_block *bm_sb; /* back pointer */
--
1.7.4.1


2011-06-07 17:35:38

by Jim Rees

[permalink] [raw]
Subject: [PATCH 83/88] SQUASHME: pnfs-block: fixup layoutcommit methods args

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 11 +++++------
fs/nfs/blocklayout/blocklayout.h | 4 ++--
fs/nfs/blocklayout/extents.c | 4 ++--
3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 39f3896..161c113 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -617,7 +617,7 @@ bl_alloc_lseg(struct pnfs_layout_hdr *lo,

static int
bl_setup_layoutcommit(struct pnfs_layout_hdr *lo,
- struct nfs4_layoutcommit_op_args *arg)
+ struct nfs4_layoutcommit_args *arg)
{
struct nfs_server *nfss = NFS_SERVER(lo->inode);
struct bl_layoutupdate_data *layoutupdate_data;
@@ -645,7 +645,7 @@ bl_setup_layoutcommit(struct pnfs_layout_hdr *lo,

static void
bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
- const struct nfs4_layoutcommit_op_args *arg)
+ const struct nfs4_layoutcommit_args *arg)
{
dprintk("%s enter\n", __func__);
encode_pnfs_block_layoutupdate(BLK_LO2EXT(lo), xdr, arg);
@@ -653,12 +653,11 @@ bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,

static void
bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
- struct nfs4_layoutcommit_op_args *arg,
- struct nfs4_layoutcommit_op_res *res)
+ struct nfs4_layoutcommit_data *lcdata)
{
dprintk("%s enter\n", __func__);
- clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), arg, res->status);
- kfree(arg->layoutdriver_data);
+ clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), &lcdata->args, lcdata->res.status);
+ kfree(lcdata->args.layoutdriver_data);
}

static void free_blk_mountid(struct block_mount_id *mid)
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 058c198..9e7bd62 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -276,9 +276,9 @@ struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
int encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
struct xdr_stream *xdr,
- const struct nfs4_layoutcommit_op_args *arg);
+ const struct nfs4_layoutcommit_args *arg);
void clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
- const struct nfs4_layoutcommit_op_args *arg,
+ const struct nfs4_layoutcommit_args *arg,
int status);
int add_and_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index cd32935..40dff82 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -738,7 +738,7 @@ find_get_extent_locked(struct pnfs_block_layout *bl, sector_t isect)
int
encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
struct xdr_stream *xdr,
- const struct nfs4_layoutcommit_op_args *arg)
+ const struct nfs4_layoutcommit_args *arg)
{
sector_t start, end;
struct pnfs_block_short_extent *lce, *save;
@@ -922,7 +922,7 @@ set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)

void
clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
- const struct nfs4_layoutcommit_op_args *arg,
+ const struct nfs4_layoutcommit_args *arg,
int status)
{
struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
--
1.7.4.1


2011-06-08 02:01:43

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

NAK.
This affects all layout types. In particular it is undesired
for write layouts that extend the file with the objects layout.
The server can extend the layout segments range
over what the client requested so why would the client
ask for artificially large layouts?

Benny

On 2011-06-07 13:36, Jim Rees wrote:
> From: Peng Tao <[email protected]>
>
> pnfs_layout_prefetch_kb can be modified via sysctl.
>
> Signed-off-by: Peng Tao <[email protected]>
> Signed-off-by: Jim Rees <[email protected]>
> ---
> fs/nfs/pnfs.c | 17 +++++++++++++++++
> fs/nfs/pnfs.h | 1 +
> fs/nfs/sysctl.c | 10 ++++++++++
> 3 files changed, 28 insertions(+), 0 deletions(-)
>
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 9920bff..9c2b569 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -46,6 +46,11 @@ static DEFINE_SPINLOCK(pnfs_spinlock);
> */
> static LIST_HEAD(pnfs_modules_tbl);
>
> +/*
> + * layoutget prefetch size
> + */
> +unsigned int pnfs_layout_prefetch_kb = 2 << 10;
> +
> /* Return the registered pnfs layout driver module matching given id */
> static struct pnfs_layoutdriver_type *
> find_pnfs_driver_locked(u32 id)
> @@ -906,6 +911,16 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
> }
>
> /*
> + * Set layout prefetch length.
> + */
> +static void
> +pnfs_set_layout_prefetch(struct pnfs_layout_range *range)
> +{
> + if (range->length < (pnfs_layout_prefetch_kb << 10))
> + range->length = pnfs_layout_prefetch_kb << 10;
> +}
> +
> +/*
> * Layout segment is retreived from the server if not cached.
> * The appropriate layout segment is referenced and returned to the caller.
> */
> @@ -956,6 +971,8 @@ pnfs_update_layout(struct inode *ino,
>
> if (pnfs_layoutgets_blocked(lo, NULL, 0))
> goto out_unlock;
> +
> + pnfs_set_layout_prefetch(&arg);
> atomic_inc(&lo->plh_outstanding);
>
> get_layout_hdr(lo);
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index 28d57c9..563c67b 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -182,6 +182,7 @@ extern int nfs4_proc_layoutget(struct nfs4_layoutget *lgp);
> extern int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp);
>
> /* pnfs.c */
> +extern unsigned int pnfs_layout_prefetch_kb;
> void get_layout_hdr(struct pnfs_layout_hdr *lo);
> void put_lseg(struct pnfs_layout_segment *lseg);
> struct pnfs_layout_segment *
> diff --git a/fs/nfs/sysctl.c b/fs/nfs/sysctl.c
> index 978aaeb..79a5134 100644
> --- a/fs/nfs/sysctl.c
> +++ b/fs/nfs/sysctl.c
> @@ -14,6 +14,7 @@
> #include <linux/nfs_fs.h>
>
> #include "callback.h"
> +#include "pnfs.h"
>
> #ifdef CONFIG_NFS_V4
> static const int nfs_set_port_min = 0;
> @@ -42,6 +43,15 @@ static ctl_table nfs_cb_sysctls[] = {
> },
> #endif /* CONFIG_NFS_USE_NEW_IDMAPPER */
> #endif
> +#ifdef CONFIG_NFS_V4_1
> + {
> + .procname = "pnfs_layout_prefetch_kb",
> + .data = &pnfs_layout_prefetch_kb,
> + .maxlen = sizeof(pnfs_layout_prefetch_kb),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + },
> +#endif
> {
> .procname = "nfs_mountpoint_timeout",
> .data = &nfs_mountpoint_expiry_timeout,

2011-06-10 12:47:38

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-10 02:02, [email protected] wrote:
> Hi, Benny,
>
> Cheers,
> -Bergwolf
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
> Sent: Friday, June 10, 2011 5:30 AM
> To: Peng Tao
> Cc: Jim Rees; [email protected]; peter honeyman
> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>
> On 2011-06-09 07:54, Peng Tao wrote:
>> On Thu, Jun 9, 2011 at 2:06 PM, Benny Halevy <[email protected]> wrote:
>>> On 2011-06-08 03:15, Peng Tao wrote:
>>>> On 6/8/11, Jim Rees <[email protected]> wrote:
>>>>> Benny Halevy wrote:
>>>>>
>>>>> NAK.
>>>>> This affects all layout types. In particular it is undesired
>>>>> for write layouts that extend the file with the objects layout.
>>>>> The server can extend the layout segments range
>>>>> over what the client requested so why would the client
>>>>> ask for artificially large layouts?
>>>>>
>>>>> This has actually been the subject of some debate over Thursday night
>>>>> beers. The problem we're trying to solve is that the client is spending 98%
>>>>> of its time in layoutget. This patch gives us something like a 10x
>>>>> speedup. But many of us think it's not the right fix. I suggest we discuss
>>>>> next week.
>>>>>
>>>
>>> Sure.
>>>
>>>>> But note that this patch doesn't change anything unless you set the sysctl.
>>>> there is a default value of 2M. maybe we can set it to page size by
>>>> default so other layout are not affected and block layout can let
>>>> users set it by hand if they care about performance. does this make
>>>> sense?
>>>
>>> If doing it at all why use a sysctl rather than a mount option?
>> The purpose of using a sysctl is to give client the ability to change
>> it on the fly. In theory, layout prefetching can benefit all layout
>> types. So the patch tries to solve it in the pnfs generic layer.
>>
>
> But the need for this varies per-server and many times per application.
> Think sequential vs. random I/O. Therefore a mount option would help
> tuning the behavior on a per-use basis. Global behavior must be implemented
> using a dynamic algorithm that would take both the workload and the server
> observed behavior into account.
> [PT] Indeed. Dynamic algorithm is supposed to be able to solve all this. And it often takes longer to be designed/accepted. It has to prove to be better in most scenarios and does not hurt the left.

We need to find an acceptable solution to push this driver upstream.
I understand that developing a dynamic algorithm in the given time frame is
too big of a challenge, but hacking yet another client tunable is out of the
question either. For testing in the Bakeathon I'd consider taking a DEVONLY version
of this patch that is enabled using a config option and defaults to zero to have no effect
in run-time until the sysctl is sets it differently.
But keep in mind this is not suitable for pushing upstream.

Benny

2011-06-07 17:32:42

by Jim Rees

[permalink] [raw]
Subject: [PATCH 56/88] SQUASHME: pnfsblock: write_end adjust for removed ok_to_use_pnfs

From: Fred Isaman <[email protected]>

Signed-off-by: Fred Isaman <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 13 ++++---------
1 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index eb5760f..b1df445 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1027,17 +1027,12 @@ bl_write_begin(struct pnfs_layout_segment *lseg, struct page *page, loff_t pos,
/* CAREFUL - what happens if copied < count??? */
static int
bl_write_end(struct inode *inode, struct page *page, loff_t pos,
- unsigned count, unsigned copied, struct pnfs_fsdata *fsdata)
+ unsigned count, unsigned copied, struct pnfs_layout_segment *lseg)
{
- dprintk("%s enter, %u@%lld, %i\n", __func__, count, pos,
- fsdata ? fsdata->ok_to_use_pnfs : -1);
+ dprintk("%s enter, %u@%lld, lseg=%p\n", __func__, count, pos, lseg);
print_page(page);
- if (fsdata) {
- if (fsdata->ok_to_use_pnfs) {
- dprintk("%s using pnfs\n", __func__);
- SetPageUptodate(page);
- }
- }
+ if (lseg)
+ SetPageUptodate(page);
return 0;
}

--
1.7.4.1


2011-06-07 17:29:48

by Jim Rees

[permalink] [raw]
Subject: [PATCH 33/88] pnfsblock: bl_write_pagelist

From: Fred Isaman <[email protected]>

Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.

When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.

Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 141 ++++++++++++++++++++++++++++++++++++-
1 files changed, 137 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 9c46f5a..df0cfe4 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -335,9 +335,69 @@ bl_read_pagelist(struct pnfs_layout_type *lo,
return PNFS_NOT_ATTEMPTED;
}

-/* FRED - It can indicate bytes written in wdata->res.count.
- * It can indicate error status in wdata->task.tk_status.
+/* STUB - this needs thought */
+static inline void
+bl_done_with_wpage(struct page *page, const int ok)
+{
+ if (!ok) {
+ SetPageError(page);
+ SetPagePnfsErr(page);
+ /* This is an inline copy of nfs_zap_mapping */
+ /* This is oh so fishy, and needs deep thought */
+ if (page->mapping->nrpages != 0) {
+ struct inode *inode = page->mapping->host;
+ spin_lock(&inode->i_lock);
+ NFS_I(inode)->cache_validity |= NFS_INO_INVALID_DATA;
+ spin_unlock(&inode->i_lock);
+ }
+ }
+ /* end_page_writeback called in rpc_release. Should be done here. */
+}
+
+/* This is basically copied from mpage_end_io_read */
+static void bl_end_io_write(struct bio *bio, int err)
+{
+ void *data = bio->bi_private;
+ const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+ struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+
+ do {
+ struct page *page = bvec->bv_page;
+
+ if (--bvec >= bio->bi_io_vec)
+ prefetchw(&bvec->bv_page->flags);
+ bl_done_with_wpage(page, uptodate);
+ } while (bvec >= bio->bi_io_vec);
+ bio_put(bio);
+ put_parallel(data);
+}
+
+/* Function scheduled for call during bl_end_par_io_write,
+ * it marks sectors as written and extends the commitlist.
*/
+static void bl_write_cleanup(struct work_struct *work)
+{
+ struct rpc_task *task;
+ struct nfs_write_data *wdata;
+ dprintk("%s enter\n", __func__);
+ task = container_of(work, struct rpc_task, u.tk_work);
+ wdata = container_of(task, struct nfs_write_data, task);
+ pnfs_callback_ops->nfs_writelist_complete(wdata);
+}
+
+/* Called when last of bios associated with a bl_write_pagelist call finishes */
+static void
+bl_end_par_io_write(void *data)
+{
+ struct nfs_write_data *wdata = data;
+
+ /* STUB - ignoring error handling */
+ wdata->task.tk_status = 0;
+ wdata->verf.committed = NFS_FILE_SYNC;
+ INIT_WORK(&wdata->task.u.tk_work, bl_write_cleanup);
+ schedule_work(&wdata->task.u.tk_work);
+}
+
static enum pnfs_try_status
bl_write_pagelist(struct pnfs_layout_type *lo,
struct page **pages,
@@ -348,8 +408,81 @@ bl_write_pagelist(struct pnfs_layout_type *lo,
int sync,
struct nfs_write_data *wdata)
{
- dprintk("%s enter - just using nfs\n", __func__);
- return PNFS_NOT_ATTEMPTED;
+ int i;
+ struct bio *bio = NULL;
+ struct pnfs_block_extent *be = NULL;
+ sector_t isect, extent_length = 0;
+ struct parallel_io *par;
+ int pg_index = pgbase >> PAGE_CACHE_SHIFT;
+
+ dprintk("%s enter, %Zu@%lld\n", __func__, count, offset);
+ if (!test_bit(PG_USE_PNFS, &wdata->req->wb_flags)) {
+ dprintk("PG_USE_PNFS not set\n");
+ return PNFS_NOT_ATTEMPTED;
+ }
+ if (dont_like_caller(wdata->req)) {
+ dprintk("%s dont_like_caller failed\n", __func__);
+ return PNFS_NOT_ATTEMPTED;
+ }
+ /* At this point, wdata->pages is a (sequential) list of nfs_pages.
+ * We want to write each, and if there is an error remove it from
+ * list and call
+ * nfs_retry_request(req) to have it redone using nfs.
+ * QUEST? Do as block or per req? Think have to do per block
+ * as part of end_bio
+ */
+ par = alloc_parallel(wdata);
+ if (!par)
+ return PNFS_NOT_ATTEMPTED;
+ par->call_ops = *wdata->pdata.call_ops;
+ par->call_ops.rpc_call_done = bl_rpc_do_nothing;
+ par->pnfs_callback = bl_end_par_io_write;
+ /* At this point, have to be more careful with error handling */
+
+ isect = (sector_t) ((offset & (long)PAGE_CACHE_MASK) >> 9);
+ for (i = pg_index; i < nr_pages; i++) {
+ if (!extent_length) {
+ /* We've used up the previous extent */
+ put_extent(be);
+ bio = bl_submit_bio(WRITE, bio);
+ /* Get the next one */
+ be = find_get_extent(BLK_LSEG2EXT(wdata->pdata.lseg),
+ isect, NULL);
+ if (!be || !is_writable(be, isect)) {
+ /* FIXME */
+ bl_done_with_wpage(pages[i], 0);
+ break;
+ }
+ extent_length = be->be_length -
+ (isect - be->be_f_offset);
+ }
+ for (;;) {
+ if (!bio) {
+ bio = bio_alloc(GFP_NOIO, nr_pages - i);
+ if (!bio) {
+ /* Error out this page */
+ /* FIXME */
+ bl_done_with_wpage(pages[i], 0);
+ break;
+ }
+ bio->bi_sector = isect - be->be_f_offset +
+ be->be_v_offset;
+ bio->bi_bdev = be->be_mdev;
+ bio->bi_end_io = bl_end_io_write;
+ bio->bi_private = par;
+ }
+ if (bio_add_page(bio, pages[i], PAGE_SIZE, 0))
+ break;
+ bio = bl_submit_bio(WRITE, bio);
+ }
+ isect += PAGE_CACHE_SIZE >> 9;
+ extent_length -= PAGE_CACHE_SIZE >> 9;
+ }
+ wdata->res.count = (isect << 9) - (offset & (long)PAGE_CACHE_MASK);
+ put_extent(be);
+ bl_submit_bio(WRITE, bio);
+ put_parallel(par);
+ return PNFS_ATTEMPTED;
}

/* FIXME - range ignored */
--
1.7.4.1


2011-06-07 17:35:48

by Jim Rees

[permalink] [raw]
Subject: [PATCH 84/88] pnfs-block: fix blocklayoutdev.c for new blkdev_get_by_dev()

blkdev_get has been replaced by blkdev_get_by_dev in 2.6.38, which now handles
exclusive use too (which we don't use). Fix uses in blocklayout to match.

Signed-off-by: Jim Rees <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayoutdev.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 17bd25a..23469e3 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -55,7 +55,7 @@ struct block_device *nfs4_blkdev_get(dev_t dev)
struct block_device *bd;

dprintk("%s enter\n", __func__);
- bd = open_by_devnum(dev, FMODE_READ);
+ bd = blkdev_get_by_dev(dev, FMODE_READ, NULL);
if (IS_ERR(bd))
goto fail;
return bd;
@@ -72,7 +72,6 @@ int nfs4_blkdev_put(struct block_device *bdev)
{
dprintk("%s for device %d:%d\n", __func__, MAJOR(bdev->bd_dev),
MINOR(bdev->bd_dev));
- bd_release(bdev);
return blkdev_put(bdev, FMODE_READ);
}

--
1.7.4.1


2011-06-07 17:33:31

by Jim Rees

[permalink] [raw]
Subject: [PATCH 63/88] SQUASHME: pnfs-block: use new read_pagelist api

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 14 ++++++--------
1 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index e2ee90a..80d25768 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -220,20 +220,18 @@ static void bl_rpc_do_nothing(struct rpc_task *task, void *calldata)
}

static enum pnfs_try_status
-bl_read_pagelist(struct pnfs_layout_type *lo,
- struct page **pages,
- unsigned int pgbase,
- unsigned nr_pages,
- loff_t f_offset,
- size_t count,
- struct nfs_read_data *rdata)
+bl_read_pagelist(struct nfs_read_data *rdata,
+ unsigned nr_pages)
{
int i, hole;
struct bio *bio = NULL;
struct pnfs_block_extent *be = NULL, *cow_read = NULL;
sector_t isect, extent_length = 0;
struct parallel_io *par;
- int pg_index = pgbase >> PAGE_CACHE_SHIFT;
+ loff_t f_offset = rdata->args.offset;
+ size_t count = rdata->args.count;
+ struct page **pages = rdata->args.pages;
+ int pg_index = rdata->args.pgbase >> PAGE_CACHE_SHIFT;

dprintk("%s enter nr_pages %u offset %lld count %Zd\n", __func__,
nr_pages, f_offset, count);
--
1.7.4.1


2011-06-07 17:29:55

by Jim Rees

[permalink] [raw]
Subject: [PATCH 34/88] pnfsblock: note written INVAL areas for layoutcommit

From: Fred Isaman <[email protected]>

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 37 ++++++++
fs/nfs/blocklayout/blocklayout.h | 13 +++
fs/nfs/blocklayout/extents.c | 173 ++++++++++++++++++++++++++++++++++++++
3 files changed, 223 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index df0cfe4..d4396d6 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -335,6 +335,33 @@ bl_read_pagelist(struct pnfs_layout_type *lo,
return PNFS_NOT_ATTEMPTED;
}

+static void mark_extents_written(struct pnfs_block_layout *bl,
+ __u64 offset, __u32 count)
+{
+ sector_t isect, end;
+ struct pnfs_block_extent *be;
+
+ dprintk("%s(%llu, %u)\n", __func__, offset, count);
+ if (count == 0)
+ return;
+ isect = (offset & (long)(PAGE_CACHE_MASK)) >> 9;
+ end = (offset + count + PAGE_CACHE_SIZE - 1) & (long)(PAGE_CACHE_MASK);
+ end >>= 9;
+ while (isect < end) {
+ be = find_get_extent(bl, isect, NULL);
+ BUG_ON(!be); /* FIXME */
+ if (be->be_state != PNFS_BLOCK_INVALID_DATA)
+ isect += be->be_length;
+ else {
+ sector_t len;
+ len = min(end, be->be_f_offset + be->be_length) - isect;
+ mark_for_commit(be, isect, len); /* What if fails? */
+ isect += len;
+ }
+ put_extent(be);
+ }
+}
+
/* STUB - this needs thought */
static inline void
bl_done_with_wpage(struct page *page, const int ok)
@@ -382,6 +409,14 @@ static void bl_write_cleanup(struct work_struct *work)
dprintk("%s enter\n", __func__);
task = container_of(work, struct rpc_task, u.tk_work);
wdata = container_of(task, struct nfs_write_data, task);
+ if (!wdata->task.tk_status) {
+ /* Marks for LAYOUTCOMMIT */
+ /* BUG - this should be called after each bio, not after
+ * all finish, unless have some way of storing success/failure
+ */
+ mark_extents_written(BLK_LSEG2EXT(wdata->pdata.lseg),
+ wdata->args.offset, wdata->args.count);
+ }
pnfs_callback_ops->nfs_writelist_complete(wdata);
}

@@ -538,6 +573,8 @@ bl_alloc_layout(struct pnfs_mount_type *mtype, struct inode *inode)
spin_lock_init(&bl->bl_ext_lock);
INIT_LIST_HEAD(&bl->bl_extents[0]);
INIT_LIST_HEAD(&bl->bl_extents[1]);
+ INIT_LIST_HEAD(&bl->bl_commit);
+ bl->bl_count = 0;
bl->bl_blocksize = NFS_SERVER(inode)->pnfs_blksize >> 9;
INIT_INVAL_MARKS(&bl->bl_inval, bl->bl_blocksize);
return bl;
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index c4b7b40..1ec9bff 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -139,6 +139,15 @@ struct pnfs_block_extent {
struct pnfs_inval_markings *be_inval; /* tracks INVAL->RW transition */
};

+/* Shortened extent used by LAYOUTCOMMIT */
+struct pnfs_block_short_extent {
+ struct list_head bse_node;
+ struct pnfs_deviceid bse_devid; /* STUB - removable??? */
+ struct block_device *bse_mdev;
+ sector_t bse_f_offset; /* the starting offset in the file */
+ sector_t bse_length; /* the size of the extent */
+};
+
static inline void
INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
{
@@ -167,6 +176,8 @@ struct pnfs_block_layout {
struct pnfs_inval_markings bl_inval; /* tracks INVAL->RW transition */
spinlock_t bl_ext_lock; /* Protects list manipulation */
struct list_head bl_extents[EXTENT_LISTS]; /* R and RW extents */
+ struct list_head bl_commit; /* Needs layout commit */
+ unsigned int bl_count; /* entries in bl_commit */
sector_t bl_blocksize; /* Server blocksize in sectors */
};

@@ -235,4 +246,6 @@ struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
int add_and_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);
+int mark_for_commit(struct pnfs_block_extent *be,
+ sector_t offset, sector_t length);
#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index ef8a5b7..d4e4a92 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -223,6 +223,48 @@ int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect)
return rv;
}

+/* Assume start, end already sector aligned */
+static int
+_range_has_tag(struct my_tree_t *tree, u64 start, u64 end, int32_t tag)
+{
+ struct pnfs_inval_tracking *pos;
+ u64 expect = 0;
+
+ dprintk("%s(%llu, %llu, %i) enter\n", __func__, start, end, tag);
+ list_for_each_entry(pos, &tree->mtt_stub, it_link) {
+ if (pos->it_sector < start)
+ continue;
+ if (!expect) {
+ if ((pos->it_sector == start) &&
+ (pos->it_tags & (1 << tag))) {
+ expect = start + tree->mtt_step_size;
+ if (expect == end)
+ return 1;
+ continue;
+ } else {
+ return 0;
+ }
+ }
+ if (pos->it_sector != expect || !(pos->it_tags & (1 << tag)))
+ return 0;
+ expect += tree->mtt_step_size;
+ if (expect == end)
+ return 1;
+ }
+ return 0;
+}
+
+static int is_range_written(struct pnfs_inval_markings *marks,
+ sector_t start, sector_t end)
+{
+ int rv;
+
+ spin_lock(&marks->im_lock);
+ rv = _range_has_tag(&marks->im_tree, start, end, EXTENT_WRITTEN);
+ spin_unlock(&marks->im_lock);
+ return rv;
+}
+
/* Marks sectors in [offest, offset_length) as having been initialized.
* All lengths are step-aligned, where step is min(pagesize, blocksize).
* Notes where partial block is initialized, and helps prepare it for
@@ -292,6 +334,137 @@ int mark_initialized_sectors(struct pnfs_inval_markings *marks,
return -ENOMEM;
}

+/* Marks sectors in [offest, offset+length) as having been written to disk.
+ * All lengths should be block aligned.
+ */
+int mark_written_sectors(struct pnfs_inval_markings *marks,
+ sector_t offset, sector_t length)
+{
+ int status;
+
+ dprintk("%s(offset=%llu,len=%llu) enter\n", __func__,
+ (u64)offset, (u64)length);
+ spin_lock(&marks->im_lock);
+ status = _set_range(&marks->im_tree, EXTENT_WRITTEN, offset, length);
+ spin_unlock(&marks->im_lock);
+ return status;
+}
+
+/* Note: In theory, we should do more checking that devid's match between
+ * old and new, but if they don't, the lists are too corrupt to salvage anyway.
+ */
+/* Note this is very similar to add_and_merge_extent */
+static void add_to_commitlist(struct pnfs_block_layout *bl,
+ struct pnfs_block_short_extent *new)
+{
+ struct list_head *clist = &bl->bl_commit;
+ struct pnfs_block_short_extent *old, *save;
+ sector_t end = new->bse_f_offset + new->bse_length;
+
+ bl->bl_count++;
+ /* Scan for proper place to insert, extending new to the left
+ * as much as possible.
+ */
+ list_for_each_entry_safe(old, save, clist, bse_node) {
+ if (new->bse_f_offset < old->bse_f_offset)
+ break;
+ if (end <= old->bse_f_offset + old->bse_length) {
+ /* Range is already in list */
+ bl->bl_count--;
+ kfree(new);
+ return;
+ } else if (new->bse_f_offset <=
+ old->bse_f_offset + old->bse_length) {
+ /* new overlaps or abuts existing be */
+ if (new->bse_mdev == old->bse_mdev) {
+ /* extend new to fully replace old */
+ new->bse_length += new->bse_f_offset -
+ old->bse_f_offset;
+ new->bse_f_offset = old->bse_f_offset;
+ list_del(&old->bse_node);
+ bl->bl_count--;
+ kfree(old);
+ }
+ }
+ }
+ /* Note that if we never hit the above break, old will not point to a
+ * valid extent. However, in that case &old->bse_node==list.
+ */
+ list_add_tail(&new->bse_node, &old->bse_node);
+ /* Scan forward for overlaps. If we find any, extend new and
+ * remove the overlapped extent.
+ */
+ old = list_prepare_entry(new, clist, bse_node);
+ list_for_each_entry_safe_continue(old, save, clist, bse_node) {
+ if (end < old->bse_f_offset)
+ break;
+ /* new overlaps or abuts old */
+ if (new->bse_mdev == old->bse_mdev) {
+ if (end < old->bse_f_offset + old->bse_length) {
+ /* extend new to fully cover old */
+ end = old->bse_f_offset + old->bse_length;
+ new->bse_length = end - new->bse_f_offset;
+ }
+ list_del(&old->bse_node);
+ bl->bl_count--;
+ kfree(old);
+ }
+ }
+}
+
+/* Note the range described by offset, length is guaranteed to be contained
+ * within be.
+ */
+int mark_for_commit(struct pnfs_block_extent *be,
+ sector_t offset, sector_t length)
+{
+ sector_t new_end, end = offset + length;
+ struct pnfs_block_short_extent *new;
+ struct pnfs_block_layout *bl = container_of(be->be_inval,
+ struct pnfs_block_layout,
+ bl_inval);
+
+ new = kmalloc(sizeof(*new), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ mark_written_sectors(be->be_inval, offset, length);
+ /* We want to add the range to commit list, but it must be
+ * block-normalized, and verified that the normalized range has
+ * been entirely written to disk.
+ */
+ new->bse_f_offset = offset;
+ offset = normalize(offset, bl->bl_blocksize);
+ if (offset < new->bse_f_offset) {
+ if (is_range_written(be->be_inval, offset, new->bse_f_offset))
+ new->bse_f_offset = offset;
+ else
+ new->bse_f_offset = offset + bl->bl_blocksize;
+ }
+ new_end = normalize_up(end, bl->bl_blocksize);
+ if (end < new_end) {
+ if (is_range_written(be->be_inval, end, new_end))
+ end = new_end;
+ else
+ end = new_end - bl->bl_blocksize;
+ }
+ if (end <= new->bse_f_offset) {
+ kfree(new);
+ return 0;
+ }
+ new->bse_length = end - new->bse_f_offset;
+ new->bse_devid = be->be_devid;
+ new->bse_mdev = be->be_mdev;
+
+ spin_lock(&bl->bl_ext_lock);
+ /* new will be freed, either by add_to_commitlist if it decides not
+ * to use it, or after LAYOUTCOMMIT uses it in the commitlist.
+ */
+ add_to_commitlist(bl, new);
+ spin_unlock(&bl->bl_ext_lock);
+ return 0;
+}
+
static void print_bl_extent(struct pnfs_block_extent *be)
{
dprintk("PRINT EXTENT extent %p\n", be);
--
1.7.4.1


2011-06-07 17:35:31

by Jim Rees

[permalink] [raw]
Subject: [PATCH 82/88] SQUASHME: pnfs-block: fixup encode_layoutcommit arguments

From: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 2 +-
fs/nfs/blocklayout/blocklayout.h | 2 +-
fs/nfs/blocklayout/extents.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 1865392..39f3896 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -645,7 +645,7 @@ bl_setup_layoutcommit(struct pnfs_layout_hdr *lo,

static void
bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
- const struct nfs4_layoutcommit_args *arg)
+ const struct nfs4_layoutcommit_op_args *arg)
{
dprintk("%s enter\n", __func__);
encode_pnfs_block_layoutupdate(BLK_LO2EXT(lo), xdr, arg);
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 4266ae7..058c198 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -276,7 +276,7 @@ struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
int encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
struct xdr_stream *xdr,
- const struct nfs4_layoutcommit_args *arg);
+ const struct nfs4_layoutcommit_op_args *arg);
void clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
const struct nfs4_layoutcommit_op_args *arg,
int status);
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 0ff65f3..cd32935 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -738,7 +738,7 @@ find_get_extent_locked(struct pnfs_block_layout *bl, sector_t isect)
int
encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
struct xdr_stream *xdr,
- const struct nfs4_layoutcommit_args *arg)
+ const struct nfs4_layoutcommit_op_args *arg)
{
sector_t start, end;
struct pnfs_block_short_extent *lce, *save;
--
1.7.4.1


2011-06-07 17:27:13

by Jim Rees

[permalink] [raw]
Subject: [PATCH 11/88] pnfsblock: blocklayout stub

From: Fred Isaman <[email protected]>

Adds the minimal structure for a pnfs block layout driver,
with all function pointers aimed at stubs.

[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <[email protected]>
[pnfs: move pnfs_layout_type inline in nfs_inode]
Signed-off-by: Benny Halevy <[email protected]>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/Makefile | 4 +-
fs/nfs/blocklayout/blocklayout.c | 224 ++++++++++++++++++++++++++++++++++++++
2 files changed, 226 insertions(+), 2 deletions(-)
create mode 100644 fs/nfs/blocklayout/blocklayout.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index f214c1c..6bf49cd 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -1,5 +1,5 @@
#
# Makefile for the pNFS block layout driver kernel module
#
-obj-$(CONFIG_PNFS_BLOCK) +=
-blocklayoutdriver-objs :=
+obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
+blocklayoutdriver-objs := blocklayout.o
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
new file mode 100644
index 0000000..1312849
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -0,0 +1,224 @@
+/*
+ * linux/fs/nfs/blocklayout/blocklayout.c
+ *
+ * Module for the NFSv4.1 pNFS block layout driver.
+ *
+ * Copyright (c) 2006 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * Andy Adamson <[email protected]>
+ * Fred Isaman <[email protected]>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization. if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose. the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+
+#include <linux/nfs_fs.h>
+#include <linux/pnfs_xdr.h> /* Needed by nfs4_pnfs.h */
+#include <linux/nfs4_pnfs.h>
+
+#define NFSDBG_FACILITY NFSDBG_PNFS_LD
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Andy Adamson <[email protected]>");
+MODULE_DESCRIPTION("The NFSv4.1 pNFS Block layout driver");
+
+/* Callback operations to the pNFS client */
+struct pnfs_client_operations *pnfs_callback_ops;
+
+static enum pnfs_try_status
+bl_commit(struct pnfs_layout_type *lo,
+ int sync,
+ struct nfs_write_data *nfs_data)
+{
+ dprintk("%s enter\n", __func__);
+ return PNFS_NOT_ATTEMPTED;
+}
+
+static enum pnfs_try_status
+bl_read_pagelist(struct pnfs_layout_type *lo,
+ struct page **pages,
+ unsigned int pgbase,
+ unsigned nr_pages,
+ loff_t offset,
+ size_t count,
+ struct nfs_read_data *nfs_data)
+{
+ dprintk("%s enter\n", __func__);
+ return PNFS_NOT_ATTEMPTED;
+}
+
+/* FRED - It can indicate bytes written in wdata->res.count.
+ * It can indicate error status in wdata->task.tk_status.
+ */
+static enum pnfs_try_status
+bl_write_pagelist(struct pnfs_layout_type *lo,
+ struct page **pages,
+ unsigned int pgbase,
+ unsigned nr_pages,
+ loff_t offset,
+ size_t count,
+ int sync,
+ struct nfs_write_data *wdata)
+{
+ dprintk("%s enter - just using nfs\n", __func__);
+ return PNFS_NOT_ATTEMPTED;
+}
+
+static void
+bl_free_layout(void *p)
+{
+ dprintk("%s enter\n", __func__);
+ return;
+}
+
+static void *
+bl_alloc_layout(struct pnfs_mount_type *mtype, struct inode *inode)
+{
+ dprintk("%s enter\n", __func__);
+ return NULL;
+}
+
+static void
+bl_free_lseg(struct pnfs_layout_segment *lseg)
+{
+ dprintk("%s enter\n", __func__);
+ return;
+}
+
+static struct pnfs_layout_segment *
+bl_alloc_lseg(struct pnfs_layout_type *lo,
+ struct nfs4_pnfs_layoutget_res *lgr)
+{
+ dprintk("%s enter\n", __func__);
+ return NULL;
+}
+
+static int
+bl_setup_layoutcommit(struct pnfs_layout_type *lo,
+ struct pnfs_layoutcommit_arg *arg)
+{
+ dprintk("%s enter\n", __func__);
+ return 0;
+}
+
+static void
+bl_encode_layoutcommit(struct pnfs_layout_type *lo, struct xdr_stream *xdr,
+ const struct pnfs_layoutcommit_arg *arg)
+{
+ dprintk("%s enter\n", __func__);
+}
+
+static void
+bl_cleanup_layoutcommit(struct pnfs_layout_type *lo,
+ struct pnfs_layoutcommit_arg *arg, int status)
+{
+ dprintk("%s enter\n", __func__);
+}
+
+static struct pnfs_mount_type *
+bl_initialize_mountpoint(struct super_block *sb, struct nfs_fh *fh)
+{
+ dprintk("%s enter\n", __func__);
+ return NULL;
+}
+
+static int
+bl_uninitialize_mountpoint(struct pnfs_mount_type *mtype)
+{
+ dprintk("%s enter\n", __func__);
+ return 0;
+}
+
+static ssize_t
+bl_get_stripesize(struct pnfs_layout_type *lo)
+{
+ dprintk("%s enter\n", __func__);
+ return 0;
+}
+
+static ssize_t
+bl_get_io_threshold(struct pnfs_layout_type *lo, struct inode *inode)
+{
+ dprintk("%s enter\n", __func__);
+ return 0;
+}
+
+/* This is called by nfs_can_coalesce_requests via nfs_pageio_do_add_request.
+ * Should return False if there is a reason requests can not be coalesced,
+ * otherwise, should default to returning True.
+ */
+static int
+bl_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
+ struct nfs_page *req)
+{
+ dprintk("%s enter\n", __func__);
+ return 1;
+}
+
+static struct layoutdriver_io_operations blocklayout_io_operations = {
+ .commit = bl_commit,
+ .read_pagelist = bl_read_pagelist,
+ .write_pagelist = bl_write_pagelist,
+ .alloc_layout = bl_alloc_layout,
+ .free_layout = bl_free_layout,
+ .alloc_lseg = bl_alloc_lseg,
+ .free_lseg = bl_free_lseg,
+ .setup_layoutcommit = bl_setup_layoutcommit,
+ .encode_layoutcommit = bl_encode_layoutcommit,
+ .cleanup_layoutcommit = bl_cleanup_layoutcommit,
+ .initialize_mountpoint = bl_initialize_mountpoint,
+ .uninitialize_mountpoint = bl_uninitialize_mountpoint,
+};
+
+static struct layoutdriver_policy_operations blocklayout_policy_operations = {
+ .get_stripesize = bl_get_stripesize,
+ .get_read_threshold = bl_get_io_threshold,
+ .get_write_threshold = bl_get_io_threshold,
+ .pg_test = bl_pg_test,
+};
+
+static struct pnfs_layoutdriver_type blocklayout_type = {
+ .id = LAYOUT_BLOCK_VOLUME,
+ .name = "LAYOUT_BLOCK_VOLUME",
+ .ld_io_ops = &blocklayout_io_operations,
+ .ld_policy_ops = &blocklayout_policy_operations,
+};
+
+static int __init nfs4blocklayout_init(void)
+{
+ dprintk("%s: NFSv4 Block Layout Driver Registering...\n", __func__);
+
+ pnfs_callback_ops = pnfs_register_layoutdriver(&blocklayout_type);
+ return 0;
+}
+
+static void __exit nfs4blocklayout_exit(void)
+{
+ dprintk("%s: NFSv4 Block Layout Driver Unregistering...\n",
+ __func__);
+
+ pnfs_unregister_layoutdriver(&blocklayout_type);
+}
+
+module_init(nfs4blocklayout_init);
+module_exit(nfs4blocklayout_exit);
--
1.7.4.1


2011-06-07 17:30:08

by Jim Rees

[permalink] [raw]
Subject: [PATCH 35/88] pnfsblock: bl_setup_layoutcommit

From: Fred Isaman <[email protected]>

In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
extents list, for reads.

In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:

1. On setup_layoutcommit, the range is adjusted as before
and a structure is allocated for communication with
bl_encode_layoutcommit && bl_cleanup_layoutcommit
(Generic layer provides a void-star to hang it on)

2. bl_encode_layoutcommit is called to do the actual
encoding directly into xdr. The commit-extent-list is not
freed and is stored on above structure.
FIXME: The code is not yet converted to the new XDR cleanup

3. On cleanup the commit-extent-list is put back by a call
to set_to_rw() as before, but with no need for XDR decoding
of the list as before. And the commit-extent-list is freed.
Finally allocated structure is freed.

[pnfsblock: fix 64-bit compiler warnings for setup_layoutcommit]
Signed-off-by: Fred Isaman <[email protected]>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 10 +++++++++-
fs/nfs/blocklayout/blocklayout.h | 19 +++++++++++++++++++
2 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index d4396d6..0277974 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -625,7 +625,7 @@ bl_setup_layoutcommit(struct pnfs_layout_type *lo,
struct pnfs_layoutcommit_arg *arg)
{
struct nfs_server *nfss = PNFS_NFS_SERVER(lo);
- struct pnfs_layoutcommit_arg *arg = &data->args;
+ struct bl_layoutupdate_data *layoutupdate_data;

dprintk("%s enter\n", __func__);
/* Need to ensure commit is block-size aligned */
@@ -637,6 +637,14 @@ bl_setup_layoutcommit(struct pnfs_layout_type *lo,
arg->lseg.length += offset + mask;
arg->lseg.length &= ~mask;
}
+
+ layoutupdate_data = kmalloc(sizeof(struct bl_layoutupdate_data),
+ GFP_KERNEL);
+ if (unlikely(!layoutupdate_data))
+ return -ENOMEM;
+ INIT_LIST_HEAD(&layoutupdate_data->ranges);
+ arg->layoutdriver_data = layoutupdate_data;
+
return 0;
}

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 1ec9bff..780d757 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -181,6 +181,13 @@ struct pnfs_block_layout {
sector_t bl_blocksize; /* Server blocksize in sectors */
};

+/* this struct is comunicated between:
+ * bl_setup_layoutcommit && bl_encode_layoutcommit && bl_cleanup_layoutcommit
+ */
+struct bl_layoutupdate_data {
+ struct list_head ranges;
+};
+
#define BLK_ID(lo) ((struct block_mount_id *)(PNFS_MOUNTID(lo)->mountid))
#define BLK_LSEG2EXT(lseg) ((struct pnfs_block_layout *)lseg->layout->ld_data)
#define BLK_LO2EXT(lo) ((struct pnfs_block_layout *)lo->ld_data)
@@ -218,6 +225,18 @@ uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes);
(x) = tmp >> 9; \
} while (0)

+#define WRITE32(n) do { \
+ *p++ = htonl(n); \
+ } while (0)
+#define WRITE64(n) do { \
+ *p++ = htonl((uint32_t)((n) >> 32)); \
+ *p++ = htonl((uint32_t)(n)); \
+} while (0)
+#define WRITEMEM(ptr, nbytes) do { \
+ p = xdr_encode_opaque_fixed(p, ptr, nbytes); \
+} while (0)
+#define WRITE_DEVID(x) WRITEMEM((x)->data, NFS4_PNFS_DEVICEID4_SIZE)
+
/* blocklayoutdev.c */
struct block_device *nfs4_blkdev_get(dev_t dev);
int nfs4_blkdev_put(struct block_device *bdev);
--
1.7.4.1


2011-06-10 21:15:56

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 2011-06-10 16:03, Fred Isaman wrote:
> On Fri, Jun 10, 2011 at 3:23 PM, Benny Halevy <[email protected]> wrote:
>>
>> A simple algorithm I can suggest is:
>> - on initialization, calculate and save, per layout driver
>> - maximum layout size
>
> I must be misunderstanding something. Layout size has nothing to do
> with io size (other than the obvious fact that you want the layout >
> io).

Exactly, and if the layouts you get from the server are too small
it's hard to do efficient I/O (modulo layout segment merging
or gathering)

>
> I don't know about the object driver, but for both the file and block
> drivers the client wants as much as the server will give it.
>

For blocks the message buffer size (as mentioned below) may be
a limiting factor hence the limit.
Another constraint for blocks and for objects to some extent is
provisional allocation where you don't want to just arbitrarily
ask for artificially large write layouts and this may be interpreted
as intent to write to the whole range and will result in excessive
provisional allocation of resources.

Bottom line, I completely agree with "the client wants as much as the
server will give it" so it should ask for what it needs and let the
server decide whether to extend or trim the range if need be.

Benny

> Fred
>
>> - take into account csr_fore_chan_attrs.ca_maxresponsesize and possible other parameters
>> - keep a working copy of the maximum value and the calculated copy.
>> - alignment value.
>> - on miss, see if there's an adjacent layout segment in cache
>> - if found, ask for twice the found segment size, up to the maximum value,
>> aligned on the alignment value.
>> - if the server returns less the layoutget range, keep note of the returned length
>> (but not adjust maximum yet, as the server may return a short segment for various
>> reasons)
>> - if the server is consistent about returning less than was asked, adjust the
>> - working copy of the maximum length
>> - if the maximum was adjusted try bumping it up after X (TBD) layoutgets or T seconds
>> to see if that was just due to high load or conflicts on the server
>> - on any error returned for LAYOUTGET reset the algorithm parameters
>> - on session reestablishment recalculate maximums.
>>
>> Benny
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-06-07 17:28:16

by Jim Rees

[permalink] [raw]
Subject: [PATCH 21/88] pnfsblock: lseg alloc and free

From: Fred Isaman <[email protected]>

Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: fix bug getting pnfs_layout_type in translate_devid().]
Signed-off-by: Tao Guo <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 29 +++++++++++++++++++++++++++--
fs/nfs/blocklayout/blocklayout.h | 3 +++
fs/nfs/blocklayout/blocklayoutdev.c | 9 +++++++++
3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index fb06f3a..f54e9a9 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -145,15 +145,40 @@ static void
bl_free_lseg(struct pnfs_layout_segment *lseg)
{
dprintk("%s enter\n", __func__);
- return;
+ kfree(lseg);
}

+/* Because the generic infrastructure does not correctly merge layouts,
+ * we pretty much ignore lseg, and store all data layout wide, so we
+ * can correctly merge. Eventually we should push some correct merge
+ * behavior up to the generic code, as the current behavior tends to
+ * cause lots of unnecessary overlapping LAYOUTGET requests.
+ */
static struct pnfs_layout_segment *
bl_alloc_lseg(struct pnfs_layout_type *lo,
struct nfs4_pnfs_layoutget_res *lgr)
{
+ struct pnfs_layout_segment *lseg;
+ int status;
+
dprintk("%s enter\n", __func__);
- return NULL;
+ lseg = kzalloc(sizeof(*lseg) + 0, GFP_KERNEL);
+ if (!lseg)
+ return NULL;
+ status = nfs4_blk_process_layoutget(lo, lgr);
+ if (status) {
+ /* We don't want to call the full-blown bl_free_lseg,
+ * since on error extents were not touched.
+ */
+ /* STUB - we really want to distinguish between 2 error
+ * conditions here. This lseg failed, but lo data structures
+ * are OK, or we hosed the lo data structures. The calling
+ * code probably needs to distinguish this too.
+ */
+ kfree(lseg);
+ return ERR_PTR(status);
+ }
+ return lseg;
}

static int
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 4e6f8fc..bcf85be 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -142,6 +142,7 @@ struct pnfs_block_layout {
sector_t bl_blocksize; /* Server blocksize in sectors */
};

+#define BLK_LSEG2EXT(lseg) ((struct pnfs_block_layout *)lseg->layout->ld_data)
#define BLK_LO2EXT(lo) ((struct pnfs_block_layout *)lo->ld_data)

uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes);
@@ -183,6 +184,8 @@ int nfs4_blkdev_put(struct block_device *bdev);
struct pnfs_block_dev *nfs4_blk_decode_device(struct super_block *sb,
struct pnfs_device *dev,
struct list_head *sdlist);
+int nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
+ struct nfs4_pnfs_layoutget_res *lgr);
int nfs4_blk_create_scsi_disk_list(struct list_head *);
void nfs4_blk_destroy_disk_list(struct list_head *);
/* blocklayoutdm.c */
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 0ea44aa..818cc1c 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -553,3 +553,12 @@ nfs4_blk_decode_device(struct super_block *sb,
kfree(vols);
return rv;
}
+
+/* XDR decode pnfs_block_layout4 structure */
+int
+nfs4_blk_process_layoutget(struct pnfs_layout_type *lo,
+ struct nfs4_pnfs_layoutget_res *lgr)
+{
+ /* STUB */
+ return -EIO;
+}
--
1.7.4.1


2011-06-08 02:06:37

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 86/88] SQUASHME: pnfs: blocklayout: port block layout code

On 2011-06-07 13:35, Jim Rees wrote:
> From: Peng Tao <[email protected]>
>
> Make minimal changes to let block layout driver work in current framework.
>
> Signed-off-by: Tang Haiying <[email protected]>
> Signed-off-by: Zhang Jingwang <[email protected]>
> Signed-off-by: Peng Tao <[email protected]>
> Signed-off-by: Jim Rees <[email protected]>
> ---
> drivers/md/dm-ioctl.c | 24 --------
> drivers/scsi/hosts.c | 3 +-
> fs/nfs/blocklayout/blocklayout.c | 105 ++++++++++------------------------
> fs/nfs/blocklayout/blocklayout.h | 9 +--
> fs/nfs/blocklayout/blocklayoutdev.c | 34 ++++++++----
> fs/nfs/blocklayout/extents.c | 14 +----
> fs/nfs/nfs4proc.c | 1 -
> fs/nfs/nfs4xdr.c | 3 +-
> fs/nfs/pnfs.c | 8 ++-
> fs/nfs/pnfs.h | 1 +
> include/linux/nfs_fs_sb.h | 1 +
> 11 files changed, 69 insertions(+), 134 deletions(-)
>
> diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> index d0d417e..4cacdad 100644
> --- a/drivers/md/dm-ioctl.c
> +++ b/drivers/md/dm-ioctl.c
> @@ -713,12 +713,6 @@ static int dev_create(struct dm_ioctl *param, size_t param_size)
> return 0;
> }
>
> -int dm_dev_create(struct dm_ioctl *param)
> -{
> - return dev_create(param, sizeof(*param));
> -}
> -EXPORT_SYMBOL(dm_dev_create);
> -
> /*
> * Always use UUID for lookups if it's present, otherwise use name or dev.
> */
> @@ -814,12 +808,6 @@ static int dev_remove(struct dm_ioctl *param, size_t param_size)
> return 0;
> }
>
> -int dm_dev_remove(struct dm_ioctl *param)
> -{
> - return dev_remove(param, sizeof(*param));
> -}
> -EXPORT_SYMBOL(dm_dev_remove);
> -
> /*
> * Check a string doesn't overrun the chunk of
> * memory we copied from userland.
> @@ -1002,12 +990,6 @@ static int do_resume(struct dm_ioctl *param)
> return r;
> }
>
> -int dm_do_resume(struct dm_ioctl *param)
> -{
> - return do_resume(param);
> -}
> -EXPORT_SYMBOL(dm_do_resume);
> -
> /*
> * Set or unset the suspension state of a device.
> * If the device already is in the requested state we just return its status.
> @@ -1274,12 +1256,6 @@ out:
> return r;
> }
>
> -int dm_table_load(struct dm_ioctl *param, size_t param_size)
> -{
> - return table_load(param, param_size);
> -}
> -EXPORT_SYMBOL(dm_table_load);
> -
> static int table_clear(struct dm_ioctl *param, size_t param_size)
> {
> struct hash_cell *hc;
> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
> index 7d91903..4f7a582 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -50,11 +50,10 @@ static void scsi_host_cls_release(struct device *dev)
> put_device(&class_to_shost(dev)->shost_gendev);
> }
>
> -struct class shost_class = {
> +static struct class shost_class = {
> .name = "scsi_host",
> .dev_release = scsi_host_cls_release,
> };
> -EXPORT_SYMBOL(shost_class);
>
> /**
> * scsi_host_set_state - Take the given host through the host state model.
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 2583b87..d842ec8 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -97,14 +97,6 @@ dont_like_caller(struct nfs_page *req)
> }
> }
>
> -static enum pnfs_try_status
> -bl_commit(struct nfs_write_data *nfs_data,
> - int sync)
> -{
> - dprintk("%s enter\n", __func__);
> - return PNFS_NOT_ATTEMPTED;
> -}
> -
> /* The data we are handed might be spread across several bios. We need
> * to track when the last one is finished.
> */
> @@ -198,7 +190,7 @@ static void bl_read_cleanup(struct work_struct *work)
> dprintk("%s enter\n", __func__);
> task = container_of(work, struct rpc_task, u.tk_work);
> rdata = container_of(task, struct nfs_read_data, task);
> - pnfs_read_done(rdata);
> + pnfs_ld_read_done(rdata);
> }
>
> static void
> @@ -219,8 +211,7 @@ static void bl_rpc_do_nothing(struct rpc_task *task, void *calldata)
> }
>
> static enum pnfs_try_status
> -bl_read_pagelist(struct nfs_read_data *rdata,
> - unsigned nr_pages)
> +bl_read_pagelist(struct nfs_read_data *rdata)
> {
> int i, hole;
> struct bio *bio = NULL;
> @@ -233,13 +224,13 @@ bl_read_pagelist(struct nfs_read_data *rdata,
> int pg_index = rdata->args.pgbase >> PAGE_CACHE_SHIFT;
>
> dprintk("%s enter nr_pages %u offset %lld count %Zd\n", __func__,
> - nr_pages, f_offset, count);
> + rdata->npages, f_offset, count);
>
> if (dont_like_caller(rdata->req)) {
> dprintk("%s dont_like_caller failed\n", __func__);
> goto use_mds;
> }
> - if ((nr_pages == 1) && PagePnfsErr(rdata->req->wb_page)) {
> + if ((rdata->npages == 1) && PagePnfsErr(rdata->req->wb_page)) {
> /* We want to fall back to mds in case of read_page
> * after error on read_pages.
> */
> @@ -249,21 +240,21 @@ bl_read_pagelist(struct nfs_read_data *rdata,
> par = alloc_parallel(rdata);
> if (!par)
> goto use_mds;
> - par->call_ops = *rdata->pdata.call_ops;
> + par->call_ops = *rdata->mds_ops;
> par->call_ops.rpc_call_done = bl_rpc_do_nothing;
> par->pnfs_callback = bl_end_par_io_read;
> /* At this point, we can no longer jump to use_mds */
>
> isect = (sector_t) (f_offset >> 9);
> /* Code assumes extents are page-aligned */
> - for (i = pg_index; i < nr_pages; i++) {
> + for (i = pg_index; i < rdata->npages; i++) {
> if (!extent_length) {
> /* We've used up the previous extent */
> put_extent(be);
> put_extent(cow_read);
> bio = bl_submit_bio(READ, bio);
> /* Get the next one */
> - be = find_get_extent(BLK_LSEG2EXT(rdata->pdata.lseg),
> + be = find_get_extent(BLK_LSEG2EXT(rdata->lseg),
> isect, &cow_read);
> if (!be) {
> /* Error out this page */
> @@ -293,7 +284,7 @@ bl_read_pagelist(struct nfs_read_data *rdata,
> be_read = (hole && cow_read) ? cow_read : be;
> for (;;) {
> if (!bio) {
> - bio = bio_alloc(GFP_NOIO, nr_pages - i);
> + bio = bio_alloc(GFP_NOIO, rdata->npages - i);
> if (!bio) {
> /* Error out this page */
> bl_done_with_rpage(pages[i], 0);
> @@ -407,10 +398,10 @@ static void bl_write_cleanup(struct work_struct *work)
> /* BUG - this should be called after each bio, not after
> * all finish, unless have some way of storing success/failure
> */
> - mark_extents_written(BLK_LSEG2EXT(wdata->pdata.lseg),
> + mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
> wdata->args.offset, wdata->args.count);
> }
> - pnfs_writeback_done(wdata);
> + pnfs_ld_write_done(wdata);
> }
>
> /* Called when last of bios associated with a bl_write_pagelist call finishes */
> @@ -428,7 +419,6 @@ bl_end_par_io_write(void *data)
>
> static enum pnfs_try_status
> bl_write_pagelist(struct nfs_write_data *wdata,
> - unsigned nr_pages,
> int sync)
> {
> int i;
> @@ -442,7 +432,7 @@ bl_write_pagelist(struct nfs_write_data *wdata,
> int pg_index = wdata->args.pgbase >> PAGE_CACHE_SHIFT;
>
> dprintk("%s enter, %Zu@%lld\n", __func__, count, offset);
> - if (!wdata->req->wb_lseg) {
> + if (!wdata->lseg) {
> dprintk("%s no lseg, falling back to MDS\n", __func__);
> return PNFS_NOT_ATTEMPTED;
> }
> @@ -460,19 +450,19 @@ bl_write_pagelist(struct nfs_write_data *wdata,
> par = alloc_parallel(wdata);
> if (!par)
> return PNFS_NOT_ATTEMPTED;
> - par->call_ops = *wdata->pdata.call_ops;
> + par->call_ops = *wdata->mds_ops;
> par->call_ops.rpc_call_done = bl_rpc_do_nothing;
> par->pnfs_callback = bl_end_par_io_write;
> /* At this point, have to be more careful with error handling */
>
> isect = (sector_t) ((offset & (long)PAGE_CACHE_MASK) >> 9);
> - for (i = pg_index; i < nr_pages; i++) {
> + for (i = pg_index; i < wdata->npages ; i++) {
> if (!extent_length) {
> /* We've used up the previous extent */
> put_extent(be);
> bio = bl_submit_bio(WRITE, bio);
> /* Get the next one */
> - be = find_get_extent(BLK_LSEG2EXT(wdata->pdata.lseg),
> + be = find_get_extent(BLK_LSEG2EXT(wdata->lseg),
> isect, NULL);
> if (!be || !is_writable(be, isect)) {
> /* FIXME */
> @@ -484,7 +474,7 @@ bl_write_pagelist(struct nfs_write_data *wdata,
> }
> for (;;) {
> if (!bio) {
> - bio = bio_alloc(GFP_NOIO, nr_pages - i);
> + bio = bio_alloc(GFP_NOIO, wdata->npages - i);
> if (!bio) {
> /* Error out this page */
> /* FIXME */
> @@ -504,7 +494,12 @@ bl_write_pagelist(struct nfs_write_data *wdata,
> isect += PAGE_CACHE_SIZE >> 9;
> extent_length -= PAGE_CACHE_SIZE >> 9;
> }
> - wdata->res.count = (isect << 9) - (offset & (long)PAGE_CACHE_MASK);
> + wdata->res.count = (isect << 9) - (offset);
> + if (count < wdata->res.count) {
> + wdata->res.count = count;
> + }
> + /* pnfs_set_layoutcommit needs this */
> + wdata->mds_offset = offset;
> put_extent(be);
> bl_submit_bio(WRITE, bio);
> put_parallel(par);
> @@ -557,18 +552,19 @@ bl_free_layout_hdr(struct pnfs_layout_hdr *lo)
> }
>
> static struct pnfs_layout_hdr *
> -bl_alloc_layout_hdr(struct inode *inode)
> +bl_alloc_layout_hdr(struct inode *inode, gfp_t gfp_flags)
> {
> struct pnfs_block_layout *bl;
>
> dprintk("%s enter\n", __func__);
> - bl = kzalloc(sizeof(*bl), GFP_KERNEL);
> + bl = kzalloc(sizeof(*bl), gfp_flags);
> if (!bl)
> return NULL;
> spin_lock_init(&bl->bl_ext_lock);
> INIT_LIST_HEAD(&bl->bl_extents[0]);
> INIT_LIST_HEAD(&bl->bl_extents[1]);
> INIT_LIST_HEAD(&bl->bl_commit);
> + INIT_LIST_HEAD(&bl->bl_committing);
> bl->bl_count = 0;
> bl->bl_blocksize = NFS_SERVER(inode)->pnfs_blksize >> 9;
> INIT_INVAL_MARKS(&bl->bl_inval, bl->bl_blocksize);
> @@ -590,16 +586,16 @@ bl_free_lseg(struct pnfs_layout_segment *lseg)
> */
> static struct pnfs_layout_segment *
> bl_alloc_lseg(struct pnfs_layout_hdr *lo,
> - struct nfs4_layoutget_res *lgr)
> + struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
> {
> struct pnfs_layout_segment *lseg;
> int status;
>
> dprintk("%s enter\n", __func__);
> - lseg = kzalloc(sizeof(*lseg) + 0, GFP_KERNEL);
> + lseg = kzalloc(sizeof(*lseg) + 0, gfp_flags);
> if (!lseg)
> return NULL;
> - status = nfs4_blk_process_layoutget(lo, lgr);
> + status = nfs4_blk_process_layoutget(lo, lgr, gfp_flags);
> if (status) {
> /* We don't want to call the full-blown bl_free_lseg,
> * since on error extents were not touched.
> @@ -615,34 +611,6 @@ bl_alloc_lseg(struct pnfs_layout_hdr *lo,
> return lseg;
> }
>
> -static int
> -bl_setup_layoutcommit(struct pnfs_layout_hdr *lo,
> - struct nfs4_layoutcommit_args *arg)
> -{
> - struct nfs_server *nfss = NFS_SERVER(lo->plh_inode);
> - struct bl_layoutupdate_data *layoutupdate_data;
> -
> - dprintk("%s enter\n", __func__);
> - /* Need to ensure commit is block-size aligned */
> - if (nfss->pnfs_blksize) {
> - u64 mask = nfss->pnfs_blksize - 1;
> - u64 offset = arg->range.offset & mask;
> -
> - arg->range.offset -= offset;
> - arg->range.length += offset + mask;
> - arg->range.length &= ~mask;
> - }
> -
> - layoutupdate_data = kmalloc(sizeof(struct bl_layoutupdate_data),
> - GFP_KERNEL);
> - if (unlikely(!layoutupdate_data))
> - return -ENOMEM;
> - INIT_LIST_HEAD(&layoutupdate_data->ranges);
> - arg->layoutdriver_data = layoutupdate_data;
> -
> - return 0;
> -}
> -
> static void
> bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
> const struct nfs4_layoutcommit_args *arg)
> @@ -657,7 +625,6 @@ bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
> {
> dprintk("%s enter\n", __func__);
> clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), &lcdata->args, lcdata->res.status);
> - kfree(lcdata->args.layoutdriver_data);
> }
>
> static void free_blk_mountid(struct block_mount_id *mid)
> @@ -1085,25 +1052,16 @@ bl_write_end_cleanup(struct file *filp, struct pnfs_fsdata *fsdata)
> fsdata->private = NULL;
> }
>
> -/* This is called by nfs_can_coalesce_requests via nfs_pageio_do_add_request.
> - * Should return False if there is a reason requests can not be coalesced,
> - * otherwise, should default to returning True.
> - */
> -static int
> +static bool
> bl_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
> - struct nfs_page *req)
> + struct nfs_page *req)
> {
> - dprintk("%s enter\n", __func__);
> - if (pgio->pg_iswrite)
> - return prev->wb_lseg == req->wb_lseg;
> - else
> - return 1;
> + return pnfs_generic_pg_test(pgio, prev, req);
> }
>
> static struct pnfs_layoutdriver_type blocklayout_type = {
> .id = LAYOUT_BLOCK_VOLUME,
> .name = "LAYOUT_BLOCK_VOLUME",
> - .commit = bl_commit,
> .read_pagelist = bl_read_pagelist,
> .write_pagelist = bl_write_pagelist,
> .write_begin = bl_write_begin,
> @@ -1113,12 +1071,11 @@ static struct pnfs_layoutdriver_type blocklayout_type = {
> .free_layout_hdr = bl_free_layout_hdr,
> .alloc_lseg = bl_alloc_lseg,
> .free_lseg = bl_free_lseg,
> - .setup_layoutcommit = bl_setup_layoutcommit,
> .encode_layoutcommit = bl_encode_layoutcommit,
> .cleanup_layoutcommit = bl_cleanup_layoutcommit,
> .set_layoutdriver = bl_set_layoutdriver,
> .clear_layoutdriver = bl_clear_layoutdriver,
> - .pg_test = bl_pg_test,
> + .pg_test = bl_pg_test,

Why not just set pg_test to pnfs_generic_pg_test?

Benny

> };
>
> static int __init nfs4blocklayout_init(void)
> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> index a8198ae..dd596d4 100644
> --- a/fs/nfs/blocklayout/blocklayout.h
> +++ b/fs/nfs/blocklayout/blocklayout.h
> @@ -33,7 +33,6 @@
> #define FS_NFS_NFS4BLOCKLAYOUT_H
>
> #include <linux/nfs_fs.h>
> -#include <linux/dm-ioctl.h> /* Needed for struct dm_ioctl*/
> #include "../pnfs.h"
>
> #define PAGE_CACHE_SECTORS (PAGE_CACHE_SIZE >> 9)
> @@ -43,11 +42,6 @@
> #define SetPagePnfsErr(page) set_bit(PG_pnfserr, &(page)->flags)
> #define ClearPagePnfsErr(page) clear_bit(PG_pnfserr, &(page)->flags)
>
> -extern int dm_dev_create(struct dm_ioctl *param); /* from dm-ioctl.c */
> -extern int dm_dev_remove(struct dm_ioctl *param); /* from dm-ioctl.c */
> -extern int dm_do_resume(struct dm_ioctl *param);
> -extern int dm_table_load(struct dm_ioctl *param, size_t param_size);
> -
> struct block_mount_id {
> spinlock_t bm_lock; /* protects list */
> struct list_head bm_devlist; /* holds pnfs_block_dev */
> @@ -180,6 +174,7 @@ struct pnfs_block_layout {
> spinlock_t bl_ext_lock; /* Protects list manipulation */
> struct list_head bl_extents[EXTENT_LISTS]; /* R and RW extents */
> struct list_head bl_commit; /* Needs layout commit */
> + struct list_head bl_committing; /* Layout committing */
> unsigned int bl_count; /* entries in bl_commit */
> sector_t bl_blocksize; /* Server blocksize in sectors */
> };
> @@ -257,7 +252,7 @@ struct pnfs_block_dev *nfs4_blk_decode_device(struct nfs_server *server,
> struct pnfs_device *dev,
> struct list_head *sdlist);
> int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
> - struct nfs4_layoutget_res *lgr);
> + struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
> int nfs4_blk_create_block_disk_list(struct list_head *);
> void nfs4_blk_destroy_disk_list(struct list_head *);
> /* blocklayoutdm.c */
> diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
> index 23469e3..a90eb6b 100644
> --- a/fs/nfs/blocklayout/blocklayoutdev.c
> +++ b/fs/nfs/blocklayout/blocklayoutdev.c
> @@ -231,14 +231,16 @@ static int verify_extent(struct pnfs_block_extent *be,
> /* XDR decode pnfs_block_layout4 structure */
> int
> nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
> - struct nfs4_layoutget_res *lgr)
> + struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
> {
> struct pnfs_block_layout *bl = BLK_LO2EXT(lo);
> - uint32_t *p = (uint32_t *)lgr->layout.buf;
> - uint32_t *end = (uint32_t *)((char *)lgr->layout.buf + lgr->layout.len);
> int i, status = -EIO;
> uint32_t count;
> struct pnfs_block_extent *be = NULL, *save;
> + struct xdr_stream stream;
> + struct xdr_buf buf;
> + struct page *scratch;
> + __be32 *p;
> uint64_t tmp; /* Used by READSECTOR */
> struct layout_verification lv = {
> .mode = lgr->range.iomode,
> @@ -246,14 +248,27 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
> .inval = lgr->range.offset >> 9,
> .cowread = lgr->range.offset >> 9,
> };
> -
> LIST_HEAD(extents);
>
> - BLK_READBUF(p, end, 4);
> + dprintk("---> %s\n", __func__);
> +
> + scratch = alloc_page(gfp_flags);
> + if (!scratch)
> + return -ENOMEM;
> +
> + xdr_init_decode_pages(&stream, &buf, lgr->layoutp->pages, lgr->layoutp->len);
> + xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE);
> +
> + p = xdr_inline_decode(&stream, 4);
> + if (unlikely(!p))
> + goto out_err;
> +
> READ32(count);
>
> dprintk("%s enter, number of extents %i\n", __func__, count);
> - BLK_READBUF(p, end, (28 + NFS4_DEVICEID4_SIZE) * count);
> + p = xdr_inline_decode(&stream, (28 + NFS4_DEVICEID4_SIZE) * count);
> + if (unlikely(!p))
> + goto out_err;
>
> /* Decode individual extents, putting them in temporary
> * staging area until whole layout is decoded to make error
> @@ -269,6 +284,7 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
> be->be_mdev = translate_devid(lo, &be->be_devid);
> if (!be->be_mdev)
> goto out_err;
> +
> /* The next three values are read in as bytes,
> * but stored as 512-byte sector lengths
> */
> @@ -284,11 +300,6 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
> }
> list_add_tail(&be->be_node, &extents);
> }
> - if (p != end) {
> - dprintk("%s Undecoded cruft at end of opaque\n", __func__);
> - be = NULL;
> - goto out_err;
> - }
> if (lgr->range.offset + lgr->range.length != lv.start << 9) {
> dprintk("%s Final length mismatch\n", __func__);
> be = NULL;
> @@ -319,6 +330,7 @@ nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
> spin_unlock(&bl->bl_ext_lock);
> status = 0;
> out:
> + __free_page(scratch);
> dprintk("%s returns %i\n", __func__, status);
> return status;
>
> diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
> index 40dff82..08413ec 100644
> --- a/fs/nfs/blocklayout/extents.c
> +++ b/fs/nfs/blocklayout/extents.c
> @@ -232,7 +232,7 @@ _range_has_tag(struct my_tree_t *tree, u64 start, u64 end, int32_t tag)
> if ((pos->it_sector == end - tree->mtt_step_size) &&
> (pos->it_tags & (1 << tag))) {
> expect = pos->it_sector - tree->mtt_step_size;
> - if (expect < start)
> + if (pos->it_sector < tree->mtt_step_size || expect < start)
> return 1;
> continue;
> } else {
> @@ -740,19 +740,12 @@ encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
> struct xdr_stream *xdr,
> const struct nfs4_layoutcommit_args *arg)
> {
> - sector_t start, end;
> struct pnfs_block_short_extent *lce, *save;
> unsigned int count = 0;
> - struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
> - struct list_head *ranges = &bld->ranges;
> + struct list_head *ranges = &bl->bl_committing;
> __be32 *p, *xdr_start;
>
> dprintk("%s enter\n", __func__);
> - start = arg->range.offset >> 9;
> - end = start + (arg->range.length >> 9);
> - dprintk("%s set start=%llu, end=%llu\n",
> - __func__, (u64)start, (u64)end);
> -
> /* BUG - creation of bl_commit is buggy - need to wait for
> * entire block to be marked WRITTEN before it can be added.
> */
> @@ -925,11 +918,10 @@ clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
> const struct nfs4_layoutcommit_args *arg,
> int status)
> {
> - struct bl_layoutupdate_data *bld = arg->layoutdriver_data;
> struct pnfs_block_short_extent *lce, *save;
>
> dprintk("%s status %d\n", __func__, status);
> - list_for_each_entry_safe_reverse(lce, save, &bld->ranges, bse_node) {
> + list_for_each_entry_safe_reverse(lce, save, &bl->bl_committing, bse_node) {
> if (likely(!status)) {
> u64 offset = lce->bse_f_offset;
> u64 end = offset + lce->bse_length;
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index a693283..987260c 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -5788,7 +5788,6 @@ static int _nfs4_getdevicelist(struct nfs_server *server,
>
> dprintk("--> %s\n", __func__);
> status = nfs4_call_sync(server->client, server, &msg, &args.seq_args, &res.seq_res, 0);
> - put_rpccred(msg.rpc_cred);
> dprintk("<-- %s status=%d\n", __func__, status);
> return status;
> }
> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> index e059dc8..73f18f4 100644
> --- a/fs/nfs/nfs4xdr.c
> +++ b/fs/nfs/nfs4xdr.c
> @@ -1963,7 +1963,7 @@ encode_layoutcommit(struct xdr_stream *xdr,
> *p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
> /* Only whole file layouts */
> p = xdr_encode_hyper(p, 0); /* offset */
> - p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
> + p = xdr_encode_hyper(p, args->lastbytewritten+1); /* length */
> *p++ = cpu_to_be32(0); /* reclaim */
> p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
> *p++ = cpu_to_be32(1); /* newoffset = TRUE */
> @@ -5467,7 +5467,6 @@ static int decode_layoutcommit(struct xdr_stream *xdr,
> int status;
>
> status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
> - res->status = status;
> if (status)
> return status;
>
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index c88a8ee..9920bff 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -898,8 +898,6 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
> ret = get_lseg(lseg);
> break;
> }
> - if (cmp_layout(range, &lseg->pls_range) > 0)
> - break;
> }
>
> dprintk("%s:Return lseg %p ref %d\n",
> @@ -1252,6 +1250,7 @@ static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
> }
> }
> rv->pls_end_pos = max_pos;
> + dprintk("%s: lseg %p end_pos %llu\n", __func__, rv, rv->pls_end_pos);
>
> return rv;
> }
> @@ -1261,6 +1260,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
> {
> struct nfs_inode *nfsi = NFS_I(wdata->inode);
> loff_t end_pos = wdata->mds_offset + wdata->res.count;
> + loff_t isize = i_size_read(wdata->inode);
> bool mark_as_dirty = false;
>
> spin_lock(&nfsi->vfs_inode.i_lock);
> @@ -1274,9 +1274,13 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
> dprintk("%s: Set layoutcommit for inode %lu ",
> __func__, wdata->inode->i_ino);
> }
> + if (end_pos > isize)
> + end_pos = isize;
> if (end_pos > wdata->lseg->pls_end_pos)
> wdata->lseg->pls_end_pos = end_pos;
> spin_unlock(&nfsi->vfs_inode.i_lock);
> + dprintk("%s: lseg %p end_pos %llu\n",
> + __func__, wdata->lseg, wdata->lseg->pls_end_pos);
>
> /* if pnfs_layoutcommit_inode() runs between inode locks, the next one
> * will be a noop because NFS_INO_LAYOUTCOMMIT will not be set */
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index b50cf3a..28d57c9 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -156,6 +156,7 @@ struct pnfs_device {
> unsigned int layout_type;
> unsigned int mincount;
> struct page **pages;
> + void *area;
> unsigned int pgbase;
> unsigned int pglen;
> };
> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
> index 3d93ada..79cc4ca 100644
> --- a/include/linux/nfs_fs_sb.h
> +++ b/include/linux/nfs_fs_sb.h
> @@ -143,6 +143,7 @@ struct nfs_server {
> filesystem */
> struct pnfs_layoutdriver_type *pnfs_curr_ld; /* Active layout driver */
> struct rpc_wait_queue roc_rpcwaitq;
> + void *pnfs_ld_data; /* per mount point data */
> u32 pnfs_blksize; /* layout_blksize attr */
>
> /* the following fields are protected by nfs_client->cl_lock */

2011-06-10 16:23:40

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget

On 06/10/2011 05:47 AM, Benny Halevy wrote:
> On 2011-06-10 02:02, [email protected] wrote:
>> Hi, Benny,
>>
>> Cheers,
>> -Bergwolf
>>
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Benny Halevy
>> Sent: Friday, June 10, 2011 5:30 AM
>> To: Peng Tao
>> Cc: Jim Rees; [email protected]; peter honeyman
>> Subject: Re: [PATCH 87/88] Add configurable prefetch size for layoutget
>>
>> On 2011-06-09 07:54, Peng Tao wrote:
>>> On Thu, Jun 9, 2011 at 2:06 PM, Benny Halevy <[email protected]> wrote:
>>>> On 2011-06-08 03:15, Peng Tao wrote:
>>>>> On 6/8/11, Jim Rees <[email protected]> wrote:
>>>>>> Benny Halevy wrote:
>>>>>>
>>>>>> NAK.
>>>>>> This affects all layout types. In particular it is undesired
>>>>>> for write layouts that extend the file with the objects layout.
>>>>>> The server can extend the layout segments range
>>>>>> over what the client requested so why would the client
>>>>>> ask for artificially large layouts?
>>>>>>
>>>>>> This has actually been the subject of some debate over Thursday night
>>>>>> beers. The problem we're trying to solve is that the client is spending 98%
>>>>>> of its time in layoutget. This patch gives us something like a 10x
>>>>>> speedup. But many of us think it's not the right fix. I suggest we discuss
>>>>>> next week.
>>>>>>
>>>>
>>>> Sure.
>>>>
>>>>>> But note that this patch doesn't change anything unless you set the sysctl.
>>>>> there is a default value of 2M. maybe we can set it to page size by
>>>>> default so other layout are not affected and block layout can let
>>>>> users set it by hand if they care about performance. does this make
>>>>> sense?
>>>>
>>>> If doing it at all why use a sysctl rather than a mount option?
>>> The purpose of using a sysctl is to give client the ability to change
>>> it on the fly. In theory, layout prefetching can benefit all layout
>>> types. So the patch tries to solve it in the pnfs generic layer.
>>>
>>
>> But the need for this varies per-server and many times per application.
>> Think sequential vs. random I/O. Therefore a mount option would help
>> tuning the behavior on a per-use basis. Global behavior must be implemented
>> using a dynamic algorithm that would take both the workload and the server
>> observed behavior into account.
>> [PT] Indeed. Dynamic algorithm is supposed to be able to solve all this. And it often takes longer to be designed/accepted. It has to prove to be better in most scenarios and does not hurt the left.
>
> We need to find an acceptable solution to push this driver upstream.
> I understand that developing a dynamic algorithm in the given time frame is
> too big of a challenge, but hacking yet another client tunable is out of the
> question either. For testing in the Bakeathon I'd consider taking a DEVONLY version
> of this patch that is enabled using a config option and defaults to zero to have no effect
> in run-time until the sysctl is sets it differently.
> But keep in mind this is not suitable for pushing upstream.
>

I disagree. Please make that same exact patch for the server you are using!

Leave the client alone. Don't even consider getting use to this, sticking
broken stuff on client scripts. If anywhere it should be in Server configuration
files

Boaz

> Benny
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html