2011-06-14 02:32:01

by Jim Rees

[permalink] [raw]
Subject: [PATCH 00/33] v2 block layout patches

This version is based on commit 3f585d500f68912fa622749f519ed7df16e417b8.
It fixes the whitespace errors, adds a couple of missing signed-offs, moves
the configurable prefetch to the end and labels it DEVONLY, fixes the
Kconfig, standardizes on "pnfsblock" for patch subject, and makes some other
minor cleanups. I have not yet incorporated Fred's suggestions.

This patch set is also available on the for-benny branch of
git://citi.umich.edu/projects/linux-pnfs-blk.git .

Andy Adamson (1):
pnfs: GETDEVICELIST

Benny Halevy (1):
pnfs: add set-clear layoutdriver interface

Fred (1):
pnfsblock: find_get_extent

Fred Isaman (21):
pnfsblock: define PNFS_BLOCK Kconfig option
pnfsblock: blocklayout stub
pnfsblock: layout alloc and free
pnfsblock: add support for simple rpc pipefs
pnfsblock: basic extent code
pnfsblock: lseg alloc and free
pnfsblock: merge extents
pnfsblock: call and parse getdevicelist
pnfsblock: allow use of PG_owner_priv_1 flag
pnfsblock: xdr decode pnfs_block_layout4
pnfsblock: SPLITME: add extent manipulation functions
pnfsblock: merge rw extents
pnfsblock: encode_layoutcommit
pnfsblock: cleanup_layoutcommit
pnfsblock: bl_read_pagelist
pnfsblock: write_begin
pnfsblock: write_end
pnfsblock: write_end_cleanup
pnfsblock: bl_write_pagelist support functions
pnfsblock: bl_write_pagelist
pnfsblock: note written INVAL areas for layoutcommit

Jim Rees (3):
pnfsblock: add block device discovery pipe
pnfsblock: add device operations
pnfsblock: remove device operations

Peng Tao (5):
pnfs: let layoutcommit code handle multiple segments
pnfs: hook nfs_write_begin/end to allow layout driver manipulation
pnfs: ask for layout_blksize and save it in nfs_server
pnfs: cleanup_layoutcommit
pnfsblock DEVONLY: Add configurable prefetch size for layoutget

Zhang Jingwang (1):
pnfsblock: Implement release_inval_marks

fs/nfs/Kconfig | 10 +
fs/nfs/Makefile | 1 +
fs/nfs/blocklayout/Makefile | 5 +
fs/nfs/blocklayout/block-device-discovery-pipe.c | 66 ++
fs/nfs/blocklayout/blocklayout.c | 1085 ++++++++++++++++++++++
fs/nfs/blocklayout/blocklayout.h | 287 ++++++
fs/nfs/blocklayout/blocklayoutdev.c | 346 +++++++
fs/nfs/blocklayout/blocklayoutdm.c | 120 +++
fs/nfs/blocklayout/extents.c | 941 +++++++++++++++++++
fs/nfs/client.c | 9 +-
fs/nfs/file.c | 26 +-
fs/nfs/nfs4_fs.h | 2 +-
fs/nfs/nfs4proc.c | 54 +-
fs/nfs/nfs4xdr.c | 232 +++++-
fs/nfs/pnfs.c | 107 ++-
fs/nfs/pnfs.h | 142 +++-
fs/nfs/sysctl.c | 10 +
fs/nfs/write.c | 12 +-
include/linux/nfs4.h | 1 +
include/linux/nfs_fs.h | 3 +-
include/linux/nfs_fs_sb.h | 4 +-
include/linux/nfs_xdr.h | 15 +-
include/linux/sunrpc/simple_rpc_pipefs.h | 105 +++
net/sunrpc/simple_rpc_pipefs.c | 423 +++++++++
24 files changed, 3963 insertions(+), 43 deletions(-)
create mode 100644 fs/nfs/blocklayout/Makefile
create mode 100644 fs/nfs/blocklayout/block-device-discovery-pipe.c
create mode 100644 fs/nfs/blocklayout/blocklayout.c
create mode 100644 fs/nfs/blocklayout/blocklayout.h
create mode 100644 fs/nfs/blocklayout/blocklayoutdev.c
create mode 100644 fs/nfs/blocklayout/blocklayoutdm.c
create mode 100644 fs/nfs/blocklayout/extents.c
create mode 100644 include/linux/sunrpc/simple_rpc_pipefs.h
create mode 100644 net/sunrpc/simple_rpc_pipefs.c

--
1.7.4.1



2011-06-14 02:33:07

by Jim Rees

[permalink] [raw]
Subject: [PATCH 25/33] pnfsblock: bl_read_pagelist

From: Fred Isaman <[email protected]>

Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.

When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.

[pnfsblock: read path error handling]
Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <[email protected]>
[pnfs-block: use new read_pagelist api]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 259 ++++++++++++++++++++++++++++++++++++++
1 files changed, 259 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index f3189d6..d9bcb13 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -31,6 +31,7 @@
*/
#include <linux/module.h>
#include <linux/init.h>
+#include <linux/bio.h> /* struct bio */
#include <linux/vmalloc.h>
#include "blocklayout.h"

@@ -40,9 +41,267 @@ MODULE_LICENSE("GPL");
MODULE_AUTHOR("Andy Adamson <[email protected]>");
MODULE_DESCRIPTION("The NFSv4.1 pNFS Block layout driver");

+static void print_page(struct page *page)
+{
+ dprintk("PRINTPAGE page %p\n", page);
+ dprintk(" PagePrivate %d\n", PagePrivate(page));
+ dprintk(" PageUptodate %d\n", PageUptodate(page));
+ dprintk(" PageError %d\n", PageError(page));
+ dprintk(" PageDirty %d\n", PageDirty(page));
+ dprintk(" PageReferenced %d\n", PageReferenced(page));
+ dprintk(" PageLocked %d\n", PageLocked(page));
+ dprintk(" PageWriteback %d\n", PageWriteback(page));
+ dprintk(" PageMappedToDisk %d\n", PageMappedToDisk(page));
+ dprintk("\n");
+}
+
+/* Given the be associated with isect, determine if page data needs to be
+ * initialized.
+ */
+static int is_hole(struct pnfs_block_extent *be, sector_t isect)
+{
+ if (be->be_state == PNFS_BLOCK_NONE_DATA)
+ return 1;
+ else if (be->be_state != PNFS_BLOCK_INVALID_DATA)
+ return 0;
+ else
+ return !is_sector_initialized(be->be_inval, isect);
+}
+
+static int
+dont_like_caller(struct nfs_page *req)
+{
+ if (atomic_read(&req->wb_complete)) {
+ /* Called by _multi */
+ return 1;
+ } else {
+ /* Called by _one */
+ return 0;
+ }
+}
+
+/* The data we are handed might be spread across several bios. We need
+ * to track when the last one is finished.
+ */
+struct parallel_io {
+ struct kref refcnt;
+ struct rpc_call_ops call_ops;
+ void (*pnfs_callback) (void *data);
+ void *data;
+};
+
+static inline struct parallel_io *alloc_parallel(void *data)
+{
+ struct parallel_io *rv;
+
+ rv = kmalloc(sizeof(*rv), GFP_KERNEL);
+ if (rv) {
+ rv->data = data;
+ kref_init(&rv->refcnt);
+ }
+ return rv;
+}
+
+static inline void get_parallel(struct parallel_io *p)
+{
+ kref_get(&p->refcnt);
+}
+
+static void destroy_parallel(struct kref *kref)
+{
+ struct parallel_io *p = container_of(kref, struct parallel_io, refcnt);
+
+ dprintk("%s enter\n", __func__);
+ p->pnfs_callback(p->data);
+ kfree(p);
+}
+
+static inline void put_parallel(struct parallel_io *p)
+{
+ kref_put(&p->refcnt, destroy_parallel);
+}
+
+static struct bio *
+bl_submit_bio(int rw, struct bio *bio)
+{
+ if (bio) {
+ get_parallel(bio->bi_private);
+ dprintk("%s submitting %s bio %u@%llu\n", __func__,
+ rw == READ ? "read" : "write",
+ bio->bi_size, (u64)bio->bi_sector);
+ submit_bio(rw, bio);
+ }
+ return NULL;
+}
+
+static inline void
+bl_done_with_rpage(struct page *page, const int ok)
+{
+ if (ok) {
+ ClearPagePnfsErr(page);
+ SetPageUptodate(page);
+ } else {
+ ClearPageUptodate(page);
+ SetPageError(page);
+ SetPagePnfsErr(page);
+ }
+ /* Page is unlocked via rpc_release. Should really be done here. */
+}
+
+/* This is basically copied from mpage_end_io_read */
+static void bl_end_io_read(struct bio *bio, int err)
+{
+ void *data = bio->bi_private;
+ const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+ struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+
+ do {
+ struct page *page = bvec->bv_page;
+
+ if (--bvec >= bio->bi_io_vec)
+ prefetchw(&bvec->bv_page->flags);
+ bl_done_with_rpage(page, uptodate);
+ } while (bvec >= bio->bi_io_vec);
+ bio_put(bio);
+ put_parallel(data);
+}
+
+static void bl_read_cleanup(struct work_struct *work)
+{
+ struct rpc_task *task;
+ struct nfs_read_data *rdata;
+ dprintk("%s enter\n", __func__);
+ task = container_of(work, struct rpc_task, u.tk_work);
+ rdata = container_of(task, struct nfs_read_data, task);
+ pnfs_ld_read_done(rdata);
+}
+
+static void
+bl_end_par_io_read(void *data)
+{
+ struct nfs_read_data *rdata = data;
+
+ INIT_WORK(&rdata->task.u.tk_work, bl_read_cleanup);
+ schedule_work(&rdata->task.u.tk_work);
+}
+
+/* We don't want normal .rpc_call_done callback used, so we replace it
+ * with this stub.
+ */
+static void bl_rpc_do_nothing(struct rpc_task *task, void *calldata)
+{
+ return;
+}
+
static enum pnfs_try_status
bl_read_pagelist(struct nfs_read_data *rdata)
{
+ int i, hole;
+ struct bio *bio = NULL;
+ struct pnfs_block_extent *be = NULL, *cow_read = NULL;
+ sector_t isect, extent_length = 0;
+ struct parallel_io *par;
+ loff_t f_offset = rdata->args.offset;
+ size_t count = rdata->args.count;
+ struct page **pages = rdata->args.pages;
+ int pg_index = rdata->args.pgbase >> PAGE_CACHE_SHIFT;
+
+ dprintk("%s enter nr_pages %u offset %lld count %Zd\n", __func__,
+ rdata->npages, f_offset, count);
+
+ if (dont_like_caller(rdata->req)) {
+ dprintk("%s dont_like_caller failed\n", __func__);
+ goto use_mds;
+ }
+ if ((rdata->npages == 1) && PagePnfsErr(rdata->req->wb_page)) {
+ /* We want to fall back to mds in case of read_page
+ * after error on read_pages.
+ */
+ dprintk("%s PG_pnfserr set\n", __func__);
+ goto use_mds;
+ }
+ par = alloc_parallel(rdata);
+ if (!par)
+ goto use_mds;
+ par->call_ops = *rdata->mds_ops;
+ par->call_ops.rpc_call_done = bl_rpc_do_nothing;
+ par->pnfs_callback = bl_end_par_io_read;
+ /* At this point, we can no longer jump to use_mds */
+
+ isect = (sector_t) (f_offset >> 9);
+ /* Code assumes extents are page-aligned */
+ for (i = pg_index; i < rdata->npages; i++) {
+ if (!extent_length) {
+ /* We've used up the previous extent */
+ put_extent(be);
+ put_extent(cow_read);
+ bio = bl_submit_bio(READ, bio);
+ /* Get the next one */
+ be = find_get_extent(BLK_LSEG2EXT(rdata->lseg),
+ isect, &cow_read);
+ if (!be) {
+ /* Error out this page */
+ bl_done_with_rpage(pages[i], 0);
+ break;
+ }
+ extent_length = be->be_length -
+ (isect - be->be_f_offset);
+ if (cow_read) {
+ sector_t cow_length = cow_read->be_length -
+ (isect - cow_read->be_f_offset);
+ extent_length = min(extent_length, cow_length);
+ }
+ }
+ hole = is_hole(be, isect);
+ if (hole && !cow_read) {
+ bio = bl_submit_bio(READ, bio);
+ /* Fill hole w/ zeroes w/o accessing device */
+ dprintk("%s Zeroing page for hole\n", __func__);
+ zero_user(pages[i], 0,
+ min_t(int, PAGE_CACHE_SIZE, count));
+ print_page(pages[i]);
+ bl_done_with_rpage(pages[i], 1);
+ } else {
+ struct pnfs_block_extent *be_read;
+
+ be_read = (hole && cow_read) ? cow_read : be;
+ for (;;) {
+ if (!bio) {
+ bio = bio_alloc(GFP_NOIO, rdata->npages - i);
+ if (!bio) {
+ /* Error out this page */
+ bl_done_with_rpage(pages[i], 0);
+ break;
+ }
+ bio->bi_sector = isect -
+ be_read->be_f_offset +
+ be_read->be_v_offset;
+ bio->bi_bdev = be_read->be_mdev;
+ bio->bi_end_io = bl_end_io_read;
+ bio->bi_private = par;
+ }
+ if (bio_add_page(bio, pages[i], PAGE_SIZE, 0))
+ break;
+ bio = bl_submit_bio(READ, bio);
+ }
+ }
+ isect += PAGE_CACHE_SIZE >> 9;
+ extent_length -= PAGE_CACHE_SIZE >> 9;
+ }
+ if ((isect << 9) >= rdata->inode->i_size) {
+ rdata->res.eof = 1;
+ rdata->res.count = rdata->inode->i_size - f_offset;
+ } else {
+ rdata->res.count = (isect << 9) - f_offset;
+ }
+ put_extent(be);
+ put_extent(cow_read);
+ bl_submit_bio(READ, bio);
+ put_parallel(par);
+ return PNFS_ATTEMPTED;
+
+ use_mds:
+ dprintk("Giving up and using normal NFS\n");
return PNFS_NOT_ATTEMPTED;
}

--
1.7.4.1


2011-06-14 02:32:36

by Jim Rees

[permalink] [raw]
Subject: [PATCH 13/33] pnfsblock: add device operations

Signed-off-by: Jim Rees <[email protected]>
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/Makefile | 2 +-
fs/nfs/blocklayout/blocklayout.h | 15 ++++
fs/nfs/blocklayout/blocklayoutdev.c | 151 +++++++++++++++++++++++++++++++++++
3 files changed, 167 insertions(+), 1 deletions(-)
create mode 100644 fs/nfs/blocklayout/blocklayoutdev.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index af39d19..bd69aad 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,4 @@
# Makefile for the pNFS block layout driver kernel module
#
obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o extents.o
+blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o extents.o blocklayoutdev.o
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index c1825ae..cda7ea1 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -35,6 +35,12 @@
#include <linux/nfs_fs.h>
#include "../pnfs.h"

+struct pnfs_block_dev {
+ struct list_head bm_node;
+ struct nfs4_deviceid bm_mdevid; /* associated devid */
+ struct block_device *bm_mdev; /* meta device itself */
+};
+
enum exstate4 {
PNFS_BLOCK_READWRITE_DATA = 0,
PNFS_BLOCK_READ_DATA = 1,
@@ -87,6 +93,15 @@ static inline struct pnfs_block_layout *BLK_LO2EXT(struct pnfs_layout_hdr *lo)
return container_of(lo, struct pnfs_block_layout, bl_layout);
}

+/* blocklayoutdev.c */
+struct block_device *nfs4_blkdev_get(dev_t dev);
+int nfs4_blkdev_put(struct block_device *bdev);
+struct pnfs_block_dev *nfs4_blk_decode_device(struct nfs_server *server,
+ struct pnfs_device *dev,
+ struct list_head *sdlist);
+int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
+ struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
+
#include <linux/sunrpc/simple_rpc_pipefs.h>

extern struct pipefs_list bl_device_list;
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
new file mode 100644
index 0000000..9a65a66
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -0,0 +1,151 @@
+/*
+ * linux/fs/nfs/blocklayout/blocklayoutdev.c
+ *
+ * Device operations for the pnfs nfs4 file layout driver.
+ *
+ * Copyright (c) 2006 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * Andy Adamson <[email protected]>
+ * Fred Isaman <[email protected]>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization. if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose. the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+#include <linux/module.h>
+#include <linux/buffer_head.h> /* __bread */
+
+#include <linux/genhd.h>
+#include <linux/blkdev.h>
+#include <linux/hash.h>
+
+#include "blocklayout.h"
+
+#define NFSDBG_FACILITY NFSDBG_PNFS_LD
+
+uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes)
+{
+ uint32_t *q = p + XDR_QUADLEN(nbytes);
+ if (unlikely(q > end || q < p))
+ return NULL;
+ return p;
+}
+EXPORT_SYMBOL(blk_overflow);
+
+/* Open a block_device by device number. */
+struct block_device *nfs4_blkdev_get(dev_t dev)
+{
+ struct block_device *bd;
+
+ dprintk("%s enter\n", __func__);
+ bd = blkdev_get_by_dev(dev, FMODE_READ, NULL);
+ if (IS_ERR(bd))
+ goto fail;
+ return bd;
+fail:
+ dprintk("%s failed to open device : %ld\n",
+ __func__, PTR_ERR(bd));
+ return NULL;
+}
+
+/*
+ * Release the block device
+ */
+int nfs4_blkdev_put(struct block_device *bdev)
+{
+ dprintk("%s for device %d:%d\n", __func__, MAJOR(bdev->bd_dev),
+ MINOR(bdev->bd_dev));
+ return blkdev_put(bdev, FMODE_READ);
+}
+
+/* Decodes pnfs_block_deviceaddr4 (draft-8) which is XDR encoded
+ * in dev->dev_addr_buf.
+ */
+struct pnfs_block_dev *
+nfs4_blk_decode_device(struct nfs_server *server,
+ struct pnfs_device *dev,
+ struct list_head *sdlist)
+{
+ struct pnfs_block_dev *rv = NULL;
+ struct block_device *bd = NULL;
+ struct pipefs_hdr *msg = NULL, *reply = NULL;
+ uint32_t major, minor;
+
+ dprintk("%s enter\n", __func__);
+
+ if (IS_ERR(bl_device_pipe))
+ return NULL;
+ dprintk("%s CREATING PIPEFS MESSAGE\n", __func__);
+ dprintk("%s: deviceid: %s, mincount: %d\n", __func__, dev->dev_id.data,
+ dev->mincount);
+ msg = pipefs_alloc_init_msg(0, BL_DEVICE_MOUNT, 0, dev->area,
+ dev->mincount);
+ if (IS_ERR(msg)) {
+ dprintk("ERROR: couldn't make pipefs message.\n");
+ goto out_err;
+ }
+ msg->msgid = hash_ptr(&msg, sizeof(msg->msgid) * 8);
+ msg->status = BL_DEVICE_REQUEST_INIT;
+
+ dprintk("%s CALLING USERSPACE DAEMON\n", __func__);
+ reply = pipefs_queue_upcall_waitreply(bl_device_pipe, msg,
+ &bl_device_list, 0, 0);
+
+ if (IS_ERR(reply)) {
+ dprintk("ERROR: upcall_waitreply failed\n");
+ goto out_err;
+ }
+ if (reply->status != BL_DEVICE_REQUEST_PROC) {
+ dprintk("%s failed to open device: %ld\n",
+ __func__, PTR_ERR(bd));
+ goto out_err;
+ }
+ memcpy(&major, (uint32_t *)(payload_of(reply)), sizeof(uint32_t));
+ memcpy(&minor, (uint32_t *)(payload_of(reply) + sizeof(uint32_t)),
+ sizeof(uint32_t));
+ bd = nfs4_blkdev_get(MKDEV(major, minor));
+ if (IS_ERR(bd)) {
+ dprintk("%s failed to open device : %ld\n",
+ __func__, PTR_ERR(bd));
+ goto out_err;
+ }
+
+ rv = kzalloc(sizeof(*rv), GFP_KERNEL);
+ if (!rv)
+ goto out_err;
+
+ rv->bm_mdev = bd;
+ memcpy(&rv->bm_mdevid, &dev->dev_id, sizeof(struct nfs4_deviceid));
+ dprintk("%s Created device %s with bd_block_size %u\n",
+ __func__,
+ bd->bd_disk->disk_name,
+ bd->bd_block_size);
+ kfree(reply);
+ kfree(msg);
+ return rv;
+
+out_err:
+ kfree(rv);
+ if (!IS_ERR(reply))
+ kfree(reply);
+ if (!IS_ERR(msg))
+ kfree(msg);
+ return NULL;
+}
--
1.7.4.1


2011-06-14 02:32:33

by Jim Rees

[permalink] [raw]
Subject: [PATCH 12/33] pnfsblock: basic extent code

From: Fred Isaman <[email protected]>

Adds structures and basic create/delete code for extents.

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
Signed-off-by: Zhang Jingwang <[email protected]>
---
fs/nfs/blocklayout/Makefile | 2 +-
fs/nfs/blocklayout/blocklayout.c | 20 ++++++--
fs/nfs/blocklayout/blocklayout.h | 1 +
fs/nfs/blocklayout/extents.c | 97 ++++++++++++++++++++++++++++++++++++++
4 files changed, 115 insertions(+), 5 deletions(-)
create mode 100644 fs/nfs/blocklayout/extents.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index d2bcd81..af39d19 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,4 @@
# Makefile for the pNFS block layout driver kernel module
#
obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o
+blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o extents.o
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index bc6a0b2..4ca0838 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -53,12 +53,24 @@ bl_write_pagelist(struct nfs_write_data *wdata,
return PNFS_NOT_ATTEMPTED;
}

-/* STUB */
+/* FIXME - range ignored */
static void
-release_extents(struct pnfs_block_layout *bl,
- struct pnfs_layout_range *range)
-{
- return;
+release_extents(struct pnfs_block_layout *bl, struct pnfs_layout_range *range)
+{
+ int i;
+ struct pnfs_block_extent *be;
+
+ spin_lock(&bl->bl_ext_lock);
+ for (i = 0; i < EXTENT_LISTS; i++) {
+ while (!list_empty(&bl->bl_extents[i])) {
+ be = list_first_entry(&bl->bl_extents[i],
+ struct pnfs_block_extent,
+ be_node);
+ list_del(&be->be_node);
+ put_extent(be);
+ }
+ }
+ spin_unlock(&bl->bl_ext_lock);
}

/* STUB */
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 4b8608c..c1825ae 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -101,4 +101,5 @@ void bl_pipe_exit(void);
#define BL_DEVICE_REQUEST_PROC 0x1 /* User level process succeeds */
#define BL_DEVICE_REQUEST_ERR 0x2 /* User level process fails */

+void put_extent(struct pnfs_block_extent *be);
#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
new file mode 100644
index 0000000..1283fa9
--- /dev/null
+++ b/fs/nfs/blocklayout/extents.c
@@ -0,0 +1,97 @@
+/*
+ * linux/fs/nfs/blocklayout/blocklayout.h
+ *
+ * Module for the NFSv4.1 pNFS block layout driver.
+ *
+ * Copyright (c) 2006 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * Andy Adamson <[email protected]>
+ * Fred Isaman <[email protected]>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization. if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose. the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+
+#include "blocklayout.h"
+#define NFSDBG_FACILITY NFSDBG_PNFS_LD
+
+static void print_bl_extent(struct pnfs_block_extent *be)
+{
+ dprintk("PRINT EXTENT extent %p\n", be);
+ if (be) {
+ dprintk(" be_f_offset %llu\n", (u64)be->be_f_offset);
+ dprintk(" be_length %llu\n", (u64)be->be_length);
+ dprintk(" be_v_offset %llu\n", (u64)be->be_v_offset);
+ dprintk(" be_state %d\n", be->be_state);
+ }
+}
+
+static void
+destroy_extent(struct kref *kref)
+{
+ struct pnfs_block_extent *be;
+
+ be = container_of(kref, struct pnfs_block_extent, be_refcnt);
+ dprintk("%s be=%p\n", __func__, be);
+ kfree(be);
+}
+
+void
+put_extent(struct pnfs_block_extent *be)
+{
+ if (be) {
+ dprintk("%s enter %p (%i)\n", __func__, be,
+ atomic_read(&be->be_refcnt.refcount));
+ kref_put(&be->be_refcnt, destroy_extent);
+ }
+}
+
+struct pnfs_block_extent *alloc_extent(void)
+{
+ struct pnfs_block_extent *be;
+
+ be = kmalloc(sizeof(struct pnfs_block_extent), GFP_KERNEL);
+ if (!be)
+ return NULL;
+ INIT_LIST_HEAD(&be->be_node);
+ kref_init(&be->be_refcnt);
+ be->be_inval = NULL;
+ return be;
+}
+
+struct pnfs_block_extent *
+get_extent(struct pnfs_block_extent *be)
+{
+ if (be)
+ kref_get(&be->be_refcnt);
+ return be;
+}
+
+void print_elist(struct list_head *list)
+{
+ struct pnfs_block_extent *be;
+ dprintk("****************\n");
+ dprintk("Extent list looks like:\n");
+ list_for_each_entry(be, list, be_node) {
+ print_bl_extent(be);
+ }
+ dprintk("****************\n");
+}
--
1.7.4.1


2011-06-14 02:32:19

by Jim Rees

[permalink] [raw]
Subject: [PATCH 07/33] pnfsblock: define PNFS_BLOCK Kconfig option

From: Fred Isaman <[email protected]>

Define a configuration variable to enable/disable compilation of the
block driver code.

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
[pnfs-block: fix CONFIG_PNFS_BLOCK dependencies]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/Kconfig | 10 ++++++++++
fs/nfs/Makefile | 1 +
fs/nfs/blocklayout/Makefile | 5 +++++
3 files changed, 16 insertions(+), 0 deletions(-)
create mode 100644 fs/nfs/blocklayout/Makefile

diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index 8151554..b613820 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -97,6 +97,16 @@ config PNFS_OBJLAYOUT

If unsure, say N.

+config PNFS_BLOCK
+ tristate "Provide a pNFS block client (EXPERIMENTAL)"
+ depends on NFS_FS && NFS_V4_1
+ select MD
+ select BLK_DEV_DM
+ help
+ Say M or y here if you want your pNfs client to support the block protocol
+
+ If unsure, say N.
+
config ROOT_NFS
bool "Root file system on NFS"
depends on NFS_FS=y && IP_PNP
diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 6a34f7d..b58613d 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_PNFS_FILE_LAYOUT) += nfs_layout_nfsv41_files.o
nfs_layout_nfsv41_files-y := nfs4filelayout.o nfs4filelayoutdev.o

obj-$(CONFIG_PNFS_OBJLAYOUT) += objlayout/
+obj-$(CONFIG_PNFS_BLOCK) += blocklayout/
diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
new file mode 100644
index 0000000..f214c1c
--- /dev/null
+++ b/fs/nfs/blocklayout/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the pNFS block layout driver kernel module
+#
+obj-$(CONFIG_PNFS_BLOCK) +=
+blocklayoutdriver-objs :=
--
1.7.4.1


2011-06-14 02:33:01

by Jim Rees

[permalink] [raw]
Subject: [PATCH 23/33] pnfsblock: encode_layoutcommit

From: Fred Isaman <[email protected]>

In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
extents list, for reads.

In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:

1. On setup_layoutcommit, the range is adjusted as before
and a structure is allocated for communication with
bl_encode_layoutcommit && bl_cleanup_layoutcommit
(Generic layer provides a void-star to hang it on)

2. bl_encode_layoutcommit is called to do the actual
encoding directly into xdr. The commit-extent-list is not
freed and is stored on above structure.
FIXME: The code is not yet converted to the new XDR cleanup

3. On cleanup the commit-extent-list is put back by a call
to set_to_rw() as before, but with no need for XDR decoding
of the list as before. And the commit-extent-list is freed.
Finally allocated structure is freed.

Signed-off-by: Fred Isaman <[email protected]>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <[email protected]>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <[email protected]>
[pnfsblock: prevent commit list corruption]
[pnfsblock: fix layoutcommit with an empty opaque]
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 2 +
fs/nfs/blocklayout/blocklayout.h | 12 +++
fs/nfs/blocklayout/extents.c | 175 ++++++++++++++++++++++++++++----------
3 files changed, 145 insertions(+), 44 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 243ce3f..6d0844c 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -153,6 +153,8 @@ static void
bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
const struct nfs4_layoutcommit_args *arg)
{
+ dprintk("%s enter\n", __func__);
+ encode_pnfs_block_layoutupdate(BLK_LO2EXT(lo), xdr, arg);
}

static void
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 364540a..c3b41f4 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -135,6 +135,15 @@ struct pnfs_block_extent {
struct pnfs_inval_markings *be_inval; /* tracks INVAL->RW transition */
};

+/* Shortened extent used by LAYOUTCOMMIT */
+struct pnfs_block_short_extent {
+ struct list_head bse_node;
+ struct nfs4_deviceid bse_devid; /* STUB - removable??? */
+ struct block_device *bse_mdev;
+ sector_t bse_f_offset; /* the starting offset in the file */
+ sector_t bse_length; /* the size of the extent */
+};
+
static inline void
INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
{
@@ -250,6 +259,9 @@ void put_extent(struct pnfs_block_extent *be);
struct pnfs_block_extent *alloc_extent(void);
struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
+int encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+ struct xdr_stream *xdr,
+ const struct nfs4_layoutcommit_args *arg);
int add_and_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 43a3601..e754d32 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -286,6 +286,47 @@ int mark_initialized_sectors(struct pnfs_inval_markings *marks,
return -ENOMEM;
}

+/* Marks sectors in [offest, offset+length) as having been written to disk.
+ * All lengths should be block aligned.
+ */
+int mark_written_sectors(struct pnfs_inval_markings *marks,
+ sector_t offset, sector_t length)
+{
+ int status;
+
+ dprintk("%s(offset=%llu,len=%llu) enter\n", __func__,
+ (u64)offset, (u64)length);
+ spin_lock(&marks->im_lock);
+ status = _set_range(&marks->im_tree, EXTENT_WRITTEN, offset, length);
+ spin_unlock(&marks->im_lock);
+ return status;
+}
+
+static void print_short_extent(struct pnfs_block_short_extent *be)
+{
+ dprintk("PRINT SHORT EXTENT extent %p\n", be);
+ if (be) {
+ dprintk(" be_f_offset %llu\n", (u64)be->bse_f_offset);
+ dprintk(" be_length %llu\n", (u64)be->bse_length);
+ }
+}
+
+void print_clist(struct list_head *list, unsigned int count)
+{
+ struct pnfs_block_short_extent *be;
+ unsigned int i = 0;
+
+ dprintk("****************\n");
+ dprintk("Extent list looks like:\n");
+ list_for_each_entry(be, list, bse_node) {
+ i++;
+ print_short_extent(be);
+ }
+ if (i != count)
+ dprintk("\n\nExpected %u entries\n\n\n", count);
+ dprintk("****************\n");
+}
+
static void print_bl_extent(struct pnfs_block_extent *be)
{
dprintk("PRINT EXTENT extent %p\n", be);
@@ -386,65 +427,67 @@ add_and_merge_extent(struct pnfs_block_layout *bl,
/* Scan for proper place to insert, extending new to the left
* as much as possible.
*/
- list_for_each_entry_safe(be, tmp, list, be_node) {
- if (new->be_f_offset < be->be_f_offset)
+ list_for_each_entry_safe_reverse(be, tmp, list, be_node) {
+ if (new->be_f_offset >= be->be_f_offset + be->be_length)
break;
- if (end <= be->be_f_offset + be->be_length) {
- /* new is a subset of existing be*/
+ if (new->be_f_offset >= be->be_f_offset) {
+ if (end <= be->be_f_offset + be->be_length) {
+ /* new is a subset of existing be*/
+ if (extents_consistent(be, new)) {
+ dprintk("%s: new is subset, ignoring\n",
+ __func__);
+ put_extent(new);
+ return 0;
+ } else {
+ goto out_err;
+ }
+ } else {
+ /* |<-- be -->|
+ * |<-- new -->| */
+ if (extents_consistent(be, new)) {
+ /* extend new to fully replace be */
+ new->be_length += new->be_f_offset -
+ be->be_f_offset;
+ new->be_f_offset = be->be_f_offset;
+ new->be_v_offset = be->be_v_offset;
+ dprintk("%s: removing %p\n", __func__, be);
+ list_del(&be->be_node);
+ put_extent(be);
+ } else {
+ goto out_err;
+ }
+ }
+ } else if (end >= be->be_f_offset + be->be_length) {
+ /* new extent overlap existing be */
if (extents_consistent(be, new)) {
- dprintk("%s: new is subset, ignoring\n",
- __func__);
- put_extent(new);
- return 0;
- } else
+ /* extend new to fully replace be */
+ dprintk("%s: removing %p\n", __func__, be);
+ list_del(&be->be_node);
+ put_extent(be);
+ } else {
goto out_err;
- } else if (new->be_f_offset <=
- be->be_f_offset + be->be_length) {
- /* new overlaps or abuts existing be */
- if (extents_consistent(be, new)) {
+ }
+ } else if (end > be->be_f_offset) {
+ /* |<-- be -->|
+ *|<-- new -->| */
+ if (extents_consistent(new, be)) {
/* extend new to fully replace be */
- new->be_length += new->be_f_offset -
- be->be_f_offset;
- new->be_f_offset = be->be_f_offset;
- new->be_v_offset = be->be_v_offset;
+ new->be_length += be->be_f_offset + be->be_length -
+ new->be_f_offset - new->be_length;
dprintk("%s: removing %p\n", __func__, be);
list_del(&be->be_node);
put_extent(be);
- } else if (new->be_f_offset !=
- be->be_f_offset + be->be_length)
+ } else {
goto out_err;
+ }
}
}
/* Note that if we never hit the above break, be will not point to a
* valid extent. However, in that case &be->be_node==list.
*/
- list_add_tail(&new->be_node, &be->be_node);
+ list_add(&new->be_node, &be->be_node);
dprintk("%s: inserting new\n", __func__);
print_elist(list);
- /* Scan forward for overlaps. If we find any, extend new and
- * remove the overlapped extent.
- */
- be = list_prepare_entry(new, list, be_node);
- list_for_each_entry_safe_continue(be, tmp, list, be_node) {
- if (end < be->be_f_offset)
- break;
- /* new overlaps or abuts existing be */
- if (extents_consistent(be, new)) {
- if (end < be->be_f_offset + be->be_length) {
- /* extend new to fully cover be */
- end = be->be_f_offset + be->be_length;
- new->be_length = end - new->be_f_offset;
- }
- dprintk("%s: removing %p\n", __func__, be);
- list_del(&be->be_node);
- put_extent(be);
- } else if (end != be->be_f_offset) {
- list_del(&new->be_node);
- goto out_err;
- }
- }
- dprintk("%s: after merging\n", __func__);
- print_elist(list);
/* STUB - The per-list consistency checks have all been done,
* should now check cross-list consistency.
*/
@@ -502,6 +545,50 @@ find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
return ret;
}

+int
+encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+ struct xdr_stream *xdr,
+ const struct nfs4_layoutcommit_args *arg)
+{
+ struct pnfs_block_short_extent *lce, *save;
+ unsigned int count = 0;
+ struct list_head *ranges = &bl->bl_committing;
+ __be32 *p, *xdr_start;
+
+ dprintk("%s enter\n", __func__);
+ /* BUG - creation of bl_commit is buggy - need to wait for
+ * entire block to be marked WRITTEN before it can be added.
+ */
+ spin_lock(&bl->bl_ext_lock);
+ /* Want to adjust for possible truncate */
+ /* We now want to adjust argument range */
+
+ /* XDR encode the ranges found */
+ xdr_start = xdr_reserve_space(xdr, 8);
+ if (!xdr_start)
+ goto out;
+ list_for_each_entry_safe(lce, save, &bl->bl_commit, bse_node) {
+ p = xdr_reserve_space(xdr, 7 * 4 + sizeof(lce->bse_devid.data));
+ if (!p)
+ break;
+ WRITE_DEVID(&lce->bse_devid);
+ WRITE64(lce->bse_f_offset << 9);
+ WRITE64(lce->bse_length << 9);
+ WRITE64(0LL);
+ WRITE32(PNFS_BLOCK_READWRITE_DATA);
+ list_del(&lce->bse_node);
+ list_add_tail(&lce->bse_node, ranges);
+ bl->bl_count--;
+ count++;
+ }
+ xdr_start[0] = cpu_to_be32((xdr->p - xdr_start - 1) * 4);
+ xdr_start[1] = cpu_to_be32(count);
+out:
+ spin_unlock(&bl->bl_ext_lock);
+ dprintk("%s found %i ranges\n", __func__, count);
+ return 0;
+}
+
/* Helper function to set_to_rw that initialize a new extent */
static void
_prep_new_extent(struct pnfs_block_extent *new,
--
1.7.4.1


2011-06-14 02:32:46

by Jim Rees

[permalink] [raw]
Subject: [PATCH 17/33] pnfsblock: call and parse getdevicelist

From: Fred Isaman <[email protected]>

Call GETDEVICELIST during mount, then call and parse GETDEVICEINFO
for each device returned.

[pnfsblock: fix pnfs_deviceid references]
Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: fix print format warnings for sector_t and size_t]
[pnfs-block: #include <linux/vmalloc.h>]
[pnfsblock: no PNFS_NFS_SERVER]
Signed-off-by: Benny Halevy <[email protected]>
[pnfsblock: fix bug determining size of striped volume]
[pnfsblock: fix oops when using multiple devices]
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 155 +++++++++++++++++++++++++++++++++++++-
fs/nfs/blocklayout/blocklayout.h | 98 ++++++++++++++++++++++++-
2 files changed, 250 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 992fd31..243ce3f 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -31,7 +31,7 @@
*/
#include <linux/module.h>
#include <linux/init.h>
-
+#include <linux/vmalloc.h>
#include "blocklayout.h"

#define NFSDBG_FACILITY NFSDBG_PNFS_LD
@@ -161,17 +161,168 @@ bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
{
}

+static void free_blk_mountid(struct block_mount_id *mid)
+{
+ if (mid) {
+ struct pnfs_block_dev *dev;
+ spin_lock(&mid->bm_lock);
+ while (!list_empty(&mid->bm_devlist)) {
+ dev = list_first_entry(&mid->bm_devlist,
+ struct pnfs_block_dev,
+ bm_node);
+ list_del(&dev->bm_node);
+ free_block_dev(dev);
+ }
+ spin_unlock(&mid->bm_lock);
+ kfree(mid);
+ }
+}
+
+/* This is mostly copied from the filelayout's get_device_info function.
+ * It seems much of this should be at the generic pnfs level.
+ */
+static struct pnfs_block_dev *
+nfs4_blk_get_deviceinfo(struct nfs_server *server, const struct nfs_fh *fh,
+ struct nfs4_deviceid *d_id,
+ struct list_head *sdlist)
+{
+ struct pnfs_device *dev;
+ struct pnfs_block_dev *rv = NULL;
+ u32 max_resp_sz;
+ int max_pages;
+ struct page **pages = NULL;
+ int i, rc;
+
+ /*
+ * Use the session max response size as the basis for setting
+ * GETDEVICEINFO's maxcount
+ */
+ max_resp_sz = server->nfs_client->cl_session->fc_attrs.max_resp_sz;
+ max_pages = max_resp_sz >> PAGE_SHIFT;
+ dprintk("%s max_resp_sz %u max_pages %d\n",
+ __func__, max_resp_sz, max_pages);
+
+ dev = kmalloc(sizeof(*dev), GFP_KERNEL);
+ if (!dev) {
+ dprintk("%s kmalloc failed\n", __func__);
+ return NULL;
+ }
+
+ pages = kzalloc(max_pages * sizeof(struct page *), GFP_KERNEL);
+ if (pages == NULL) {
+ kfree(dev);
+ return NULL;
+ }
+ for (i = 0; i < max_pages; i++) {
+ pages[i] = alloc_page(GFP_KERNEL);
+ if (!pages[i])
+ goto out_free;
+ }
+
+ /* set dev->area */
+ dev->area = vmap(pages, max_pages, VM_MAP, PAGE_KERNEL);
+ if (!dev->area)
+ goto out_free;
+
+ memcpy(&dev->dev_id, d_id, sizeof(*d_id));
+ dev->layout_type = LAYOUT_BLOCK_VOLUME;
+ dev->pages = pages;
+ dev->pgbase = 0;
+ dev->pglen = PAGE_SIZE * max_pages;
+ dev->mincount = 0;
+
+ dprintk("%s: dev_id: %s\n", __func__, dev->dev_id.data);
+ rc = nfs4_proc_getdeviceinfo(server, dev);
+ dprintk("%s getdevice info returns %d\n", __func__, rc);
+ if (rc)
+ goto out_free;
+
+ rv = nfs4_blk_decode_device(server, dev, sdlist);
+ out_free:
+ if (dev->area != NULL)
+ vunmap(dev->area);
+ for (i = 0; i < max_pages; i++)
+ __free_page(pages[i]);
+ kfree(pages);
+ kfree(dev);
+ return rv;
+}
+
static int
bl_set_layoutdriver(struct nfs_server *server, const struct nfs_fh *fh)
{
+ struct block_mount_id *b_mt_id = NULL;
+ struct pnfs_mount_type *mtype = NULL;
+ struct pnfs_devicelist *dlist = NULL;
+ struct pnfs_block_dev *bdev;
+ LIST_HEAD(block_disklist);
+ int status = 0, i;
+
dprintk("%s enter\n", __func__);
- return 0;
+
+ if (server->pnfs_blksize == 0) {
+ dprintk("%s Server did not return blksize\n", __func__);
+ return -EINVAL;
+ }
+ b_mt_id = kzalloc(sizeof(struct block_mount_id), GFP_KERNEL);
+ if (!b_mt_id) {
+ status = -ENOMEM;
+ goto out_error;
+ }
+ /* Initialize nfs4 block layout mount id */
+ spin_lock_init(&b_mt_id->bm_lock);
+ INIT_LIST_HEAD(&b_mt_id->bm_devlist);
+
+ dlist = kmalloc(sizeof(struct pnfs_devicelist), GFP_KERNEL);
+ if (!dlist)
+ goto out_error;
+ dlist->eof = 0;
+ while (!dlist->eof) {
+ status = nfs4_proc_getdevicelist(server, fh, dlist);
+ if (status)
+ goto out_error;
+ dprintk("%s GETDEVICELIST numdevs=%i, eof=%i\n",
+ __func__, dlist->num_devs, dlist->eof);
+ /* For each device returned in dlist, call GETDEVICEINFO, and
+ * decode the opaque topology encoding to create a flat
+ * volume topology, matching VOLUME_SIMPLE disk signatures
+ * to disks in the visible block disk list.
+ * Construct an LVM meta device from the flat volume topology.
+ */
+ for (i = 0; i < dlist->num_devs; i++) {
+ bdev = nfs4_blk_get_deviceinfo(server, fh,
+ &dlist->dev_id[i],
+ &block_disklist);
+ if (!bdev) {
+ status = -ENODEV;
+ goto out_error;
+ }
+ spin_lock(&b_mt_id->bm_lock);
+ list_add(&bdev->bm_node, &b_mt_id->bm_devlist);
+ spin_unlock(&b_mt_id->bm_lock);
+ }
+ }
+ dprintk("%s SUCCESS\n", __func__);
+ server->pnfs_ld_data = b_mt_id;
+
+ out_return:
+ kfree(dlist);
+ return status;
+
+ out_error:
+ free_blk_mountid(b_mt_id);
+ kfree(mtype);
+ goto out_return;
}

static int
bl_clear_layoutdriver(struct nfs_server *server)
{
+ struct block_mount_id *b_mt_id = server->pnfs_ld_data;
+
dprintk("%s enter\n", __func__);
+ free_blk_mountid(b_mt_id);
+ dprintk("%s RETURNS\n", __func__);
return 0;
}

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 6705b10..a596e75 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -35,12 +35,60 @@
#include <linux/nfs_fs.h>
#include "../pnfs.h"

+struct block_mount_id {
+ spinlock_t bm_lock; /* protects list */
+ struct list_head bm_devlist; /* holds pnfs_block_dev */
+};
+
struct pnfs_block_dev {
struct list_head bm_node;
struct nfs4_deviceid bm_mdevid; /* associated devid */
struct block_device *bm_mdev; /* meta device itself */
};

+/* holds visible disks that can be matched against VOLUME_SIMPLE signatures */
+struct visible_block_device {
+ struct list_head vi_node;
+ struct block_device *vi_bdev;
+ int vi_mapped;
+ int vi_put_done;
+};
+
+enum blk_vol_type {
+ PNFS_BLOCK_VOLUME_SIMPLE = 0, /* maps to a single LU */
+ PNFS_BLOCK_VOLUME_SLICE = 1, /* slice of another volume */
+ PNFS_BLOCK_VOLUME_CONCAT = 2, /* concatenation of multiple volumes */
+ PNFS_BLOCK_VOLUME_STRIPE = 3 /* striped across multiple volumes */
+};
+
+/* All disk offset/lengths are stored in 512-byte sectors */
+struct pnfs_blk_volume {
+ uint32_t bv_type;
+ sector_t bv_size;
+ struct pnfs_blk_volume **bv_vols;
+ int bv_vol_n;
+ union {
+ dev_t bv_dev;
+ sector_t bv_stripe_unit;
+ sector_t bv_offset;
+ };
+};
+
+/* Since components need not be aligned, cannot use sector_t */
+struct pnfs_blk_sig_comp {
+ int64_t bs_offset; /* In bytes */
+ uint32_t bs_length; /* In bytes */
+ char *bs_string;
+};
+
+/* Maximum number of signatures components in a simple volume */
+# define PNFS_BLOCK_MAX_SIG_COMP 16
+
+struct pnfs_blk_sig {
+ int si_num_comps;
+ struct pnfs_blk_sig_comp si_comps[PNFS_BLOCK_MAX_SIG_COMP];
+};
+
enum exstate4 {
PNFS_BLOCK_READWRITE_DATA = 0,
PNFS_BLOCK_READ_DATA = 1,
@@ -96,7 +144,10 @@ struct pnfs_block_layout {
sector_t bl_blocksize; /* Server blocksize in sectors */
};

-static inline struct pnfs_block_layout *BLK_LO2EXT(struct pnfs_layout_hdr *lo)
+#define BLK_ID(lo) ((struct block_mount_id *)(NFS_SERVER(lo->plh_inode)->pnfs_ld_data))
+
+static inline struct pnfs_block_layout *
+BLK_LO2EXT(struct pnfs_layout_hdr *lo)
{
return container_of(lo, struct pnfs_block_layout, bl_layout);
}
@@ -107,6 +158,51 @@ BLK_LSEG2EXT(struct pnfs_layout_segment *lseg)
return BLK_LO2EXT(lseg->pls_layout);
}

+uint32_t *blk_overflow(uint32_t *p, uint32_t *end, size_t nbytes);
+
+#define BLK_READBUF(p, e, nbytes) do { \
+ p = blk_overflow(p, e, nbytes); \
+ if (!p) { \
+ printk(KERN_WARNING \
+ "%s: reply buffer overflowed in line %d.\n", \
+ __func__, __LINE__); \
+ goto out_err; \
+ } \
+} while (0)
+
+#define READ32(x) (x) = ntohl(*p++)
+#define READ64(x) do { \
+ (x) = (uint64_t)ntohl(*p++) << 32; \
+ (x) |= ntohl(*p++); \
+} while (0)
+#define COPYMEM(x, nbytes) do { \
+ memcpy((x), p, nbytes); \
+ p += XDR_QUADLEN(nbytes); \
+} while (0)
+#define READ_DEVID(x) COPYMEM((x)->data, NFS4_DEVICEID4_SIZE)
+#define READ_SECTOR(x) do { \
+ READ64(tmp); \
+ if (tmp & 0x1ff) { \
+ printk(KERN_WARNING \
+ "%s Value not 512-byte aligned at line %d\n", \
+ __func__, __LINE__); \
+ goto out_err; \
+ } \
+ (x) = tmp >> 9; \
+} while (0)
+
+#define WRITE32(n) do { \
+ *p++ = htonl(n); \
+ } while (0)
+#define WRITE64(n) do { \
+ *p++ = htonl((uint32_t)((n) >> 32)); \
+ *p++ = htonl((uint32_t)(n)); \
+} while (0)
+#define WRITEMEM(ptr, nbytes) do { \
+ p = xdr_encode_opaque_fixed(p, ptr, nbytes); \
+} while (0)
+#define WRITE_DEVID(x) WRITEMEM((x)->data, NFS4_DEVICEID4_SIZE)
+
/* blocklayoutdev.c */
struct block_device *nfs4_blkdev_get(dev_t dev);
int nfs4_blkdev_put(struct block_device *bdev);
--
1.7.4.1


2011-06-14 02:32:58

by Jim Rees

[permalink] [raw]
Subject: [PATCH 22/33] pnfsblock: merge rw extents

From: Fred Isaman <[email protected]>

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/extents.c | 47 ++++++++++++++++++++++++++++++++++++++++++
1 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 3d36f66..43a3601 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -501,3 +501,50 @@ find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
print_bl_extent(ret);
return ret;
}
+
+/* Helper function to set_to_rw that initialize a new extent */
+static void
+_prep_new_extent(struct pnfs_block_extent *new,
+ struct pnfs_block_extent *orig,
+ sector_t offset, sector_t length, int state)
+{
+ kref_init(&new->be_refcnt);
+ /* don't need to INIT_LIST_HEAD(&new->be_node) */
+ memcpy(&new->be_devid, &orig->be_devid, sizeof(struct nfs4_deviceid));
+ new->be_mdev = orig->be_mdev;
+ new->be_f_offset = offset;
+ new->be_length = length;
+ new->be_v_offset = orig->be_v_offset - orig->be_f_offset + offset;
+ new->be_state = state;
+ new->be_inval = orig->be_inval;
+}
+
+/* Tries to merge be with extent in front of it in list.
+ * Frees storage if not used.
+ */
+static struct pnfs_block_extent *
+_front_merge(struct pnfs_block_extent *be, struct list_head *head,
+ struct pnfs_block_extent *storage)
+{
+ struct pnfs_block_extent *prev;
+
+ if (!storage)
+ goto no_merge;
+ if (&be->be_node == head || be->be_node.prev == head)
+ goto no_merge;
+ prev = list_entry(be->be_node.prev, struct pnfs_block_extent, be_node);
+ if ((prev->be_f_offset + prev->be_length != be->be_f_offset) ||
+ !extents_consistent(prev, be))
+ goto no_merge;
+ _prep_new_extent(storage, prev, prev->be_f_offset,
+ prev->be_length + be->be_length, prev->be_state);
+ list_replace(&prev->be_node, &storage->be_node);
+ put_extent(prev);
+ list_del(&be->be_node);
+ put_extent(be);
+ return storage;
+
+ no_merge:
+ kfree(storage);
+ return be;
+}
--
1.7.4.1


2011-06-14 02:32:13

by Jim Rees

[permalink] [raw]
Subject: [PATCH 04/33] pnfs: hook nfs_write_begin/end to allow layout driver manipulation

From: Peng Tao <[email protected]>

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
Reported-by: Alexandros Batsakis <[email protected]>
Signed-off-by: Andy Adamson <[email protected]>
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/file.c | 26 ++++++++++-
fs/nfs/pnfs.c | 41 +++++++++++++++++
fs/nfs/pnfs.h | 115 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/write.c | 12 +++--
include/linux/nfs_fs.h | 3 +-
5 files changed, 189 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 2f093ed..1768762 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -384,12 +384,15 @@ static int nfs_write_begin(struct file *file, struct address_space *mapping,
pgoff_t index = pos >> PAGE_CACHE_SHIFT;
struct page *page;
int once_thru = 0;
+ struct pnfs_layout_segment *lseg;

dfprintk(PAGECACHE, "NFS: write_begin(%s/%s(%ld), %u@%lld)\n",
file->f_path.dentry->d_parent->d_name.name,
file->f_path.dentry->d_name.name,
mapping->host->i_ino, len, (long long) pos);
-
+ lseg = pnfs_update_layout(mapping->host,
+ nfs_file_open_context(file),
+ pos, len, IOMODE_RW, GFP_NOFS);
start:
/*
* Prevent starvation issues if someone is doing a consistency
@@ -409,6 +412,9 @@ start:
if (ret) {
unlock_page(page);
page_cache_release(page);
+ *pagep = NULL;
+ *fsdata = NULL;
+ goto out;
} else if (!once_thru &&
nfs_want_read_modify_write(file, page, pos, len)) {
once_thru = 1;
@@ -417,6 +423,12 @@ start:
if (!ret)
goto start;
}
+ ret = pnfs_write_begin(file, page, pos, len, lseg, fsdata);
+ out:
+ if (ret) {
+ put_lseg(lseg);
+ *fsdata = NULL;
+ }
return ret;
}

@@ -426,6 +438,7 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
{
unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
int status;
+ struct pnfs_layout_segment *lseg;

dfprintk(PAGECACHE, "NFS: write_end(%s/%s(%ld), %u@%lld)\n",
file->f_path.dentry->d_parent->d_name.name,
@@ -452,10 +465,17 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
zero_user_segment(page, pglen, PAGE_CACHE_SIZE);
}

- status = nfs_updatepage(file, page, offset, copied);
+ lseg = nfs4_pull_lseg_from_fsdata(file, fsdata);
+ status = pnfs_write_end(file, page, pos, len, copied, lseg);
+ if (status)
+ goto out;
+ status = nfs_updatepage(file, page, offset, copied, lseg, fsdata);

+out:
unlock_page(page);
page_cache_release(page);
+ pnfs_write_end_cleanup(file, fsdata);
+ put_lseg(lseg);

if (status < 0)
return status;
@@ -577,7 +597,7 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

ret = VM_FAULT_LOCKED;
if (nfs_flush_incompatible(filp, page) == 0 &&
- nfs_updatepage(filp, page, 0, pagelen) == 0)
+ nfs_updatepage(filp, page, 0, pagelen, NULL, NULL) == 0)
goto out;

ret = VM_FAULT_SIGBUS;
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index e252af1..5373960 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1188,6 +1188,41 @@ pnfs_try_to_write_data(struct nfs_write_data *wdata,
}

/*
+ * This gives the layout driver an opportunity to read in page "around"
+ * the data to be written. It returns 0 on success, otherwise an error code
+ * which will either be passed up to user, or ignored if
+ * some previous part of write succeeded.
+ * Note the range [pos, pos+len-1] is entirely within the page.
+ */
+int _pnfs_write_begin(struct inode *inode, struct page *page,
+ loff_t pos, unsigned len,
+ struct pnfs_layout_segment *lseg,
+ struct pnfs_fsdata **fsdata)
+{
+ struct pnfs_fsdata *data;
+ int status = 0;
+
+ dprintk("--> %s: pos=%llu len=%u\n",
+ __func__, (unsigned long long)pos, len);
+ data = kzalloc(sizeof(struct pnfs_fsdata), GFP_KERNEL);
+ if (!data) {
+ status = -ENOMEM;
+ goto out;
+ }
+ data->lseg = lseg; /* refcount passed into data to be managed there */
+ status = NFS_SERVER(inode)->pnfs_curr_ld->write_begin(
+ lseg, page, pos, len, data);
+ if (status) {
+ kfree(data);
+ data = NULL;
+ }
+out:
+ *fsdata = data;
+ dprintk("<-- %s: status=%d\n", __func__, status);
+ return status;
+}
+
+/*
* Called by non rpc-based layout drivers
*/
int
@@ -1288,6 +1323,12 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
}
EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);

+void pnfs_free_fsdata(struct pnfs_fsdata *fsdata)
+{
+ /* lseg refcounting handled directly in nfs_write_end */
+ kfree(fsdata);
+}
+
/*
* For the LAYOUT4_NFSV4_1_FILES layout type, NFS_DATA_SYNC WRITEs and
* NFS_UNSTABLE WRITEs with a COMMIT to data servers must store enough
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 0ac820f..57aefb6 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -54,6 +54,12 @@ enum pnfs_try_status {
PNFS_NOT_ATTEMPTED = 1,
};

+struct pnfs_fsdata {
+ struct pnfs_layout_segment *lseg;
+ int bypass_eof;
+ void *private;
+};
+
#ifdef CONFIG_NFS_V4_1

#define LAYOUT_NFSV4_1_MODULE_PREFIX "nfs-layouttype4"
@@ -107,6 +113,14 @@ struct pnfs_layoutdriver_type {
*/
enum pnfs_try_status (*read_pagelist) (struct nfs_read_data *nfs_data);
enum pnfs_try_status (*write_pagelist) (struct nfs_write_data *nfs_data, int how);
+ int (*write_begin) (struct pnfs_layout_segment *lseg, struct page *page,
+ loff_t pos, unsigned count,
+ struct pnfs_fsdata *fsdata);
+ int (*write_end)(struct inode *inode, struct page *page, loff_t pos,
+ unsigned count, unsigned copied,
+ struct pnfs_layout_segment *lseg);
+ void (*write_end_cleanup)(struct file *filp,
+ struct pnfs_fsdata *fsdata);

void (*free_deviceid_node) (struct nfs4_deviceid_node *);

@@ -178,6 +192,7 @@ enum pnfs_try_status pnfs_try_to_read_data(struct nfs_read_data *,
void pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *, struct nfs_page *);
void pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *, struct nfs_page *);
bool pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev, struct nfs_page *req);
+void pnfs_free_fsdata(struct pnfs_fsdata *fsdata);
int pnfs_layout_process(struct nfs4_layoutget *lgp);
void pnfs_free_lseg_list(struct list_head *tmp_list);
void pnfs_destroy_layout(struct nfs_inode *);
@@ -189,6 +204,10 @@ void pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
int pnfs_choose_layoutget_stateid(nfs4_stateid *dst,
struct pnfs_layout_hdr *lo,
struct nfs4_state *open_state);
+int _pnfs_write_begin(struct inode *inode, struct page *page,
+ loff_t pos, unsigned len,
+ struct pnfs_layout_segment *lseg,
+ struct pnfs_fsdata **fsdata);
int mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
struct list_head *tmp_list,
struct pnfs_layout_range *recall_range);
@@ -291,6 +310,13 @@ static inline void pnfs_clear_request_commit(struct nfs_page *req)
put_lseg(req->wb_commit_lseg);
}

+static inline int pnfs_grow_ok(struct pnfs_layout_segment *lseg,
+ struct pnfs_fsdata *fsdata)
+{
+ return !fsdata || ((struct pnfs_layout_segment *)fsdata == lseg) ||
+ !fsdata->bypass_eof;
+}
+
/* Should the pNFS client commit and return the layout upon a setattr */
static inline bool
pnfs_ld_layoutret_on_setattr(struct inode *inode)
@@ -301,6 +327,49 @@ pnfs_ld_layoutret_on_setattr(struct inode *inode)
PNFS_LAYOUTRET_ON_SETATTR;
}

+static inline int pnfs_write_begin(struct file *filp, struct page *page,
+ loff_t pos, unsigned len,
+ struct pnfs_layout_segment *lseg,
+ void **fsdata)
+{
+ struct inode *inode = filp->f_dentry->d_inode;
+ struct nfs_server *nfss = NFS_SERVER(inode);
+ int status = 0;
+
+ *fsdata = lseg;
+ if (lseg && nfss->pnfs_curr_ld->write_begin)
+ status = _pnfs_write_begin(inode, page, pos, len, lseg,
+ (struct pnfs_fsdata **) fsdata);
+ return status;
+}
+
+/* CAREFUL - what happens if copied < len??? */
+static inline int pnfs_write_end(struct file *filp, struct page *page,
+ loff_t pos, unsigned len, unsigned copied,
+ struct pnfs_layout_segment *lseg)
+{
+ struct inode *inode = filp->f_dentry->d_inode;
+ struct nfs_server *nfss = NFS_SERVER(inode);
+
+ if (nfss->pnfs_curr_ld && nfss->pnfs_curr_ld->write_end)
+ return nfss->pnfs_curr_ld->write_end(inode, page, pos, len,
+ copied, lseg);
+ else
+ return 0;
+}
+
+static inline void pnfs_write_end_cleanup(struct file *filp, void *fsdata)
+{
+ struct nfs_server *nfss = NFS_SERVER(filp->f_dentry->d_inode);
+
+ if (fsdata && nfss->pnfs_curr_ld) {
+ if (nfss->pnfs_curr_ld->write_end_cleanup)
+ nfss->pnfs_curr_ld->write_end_cleanup(filp, fsdata);
+ if (nfss->pnfs_curr_ld->write_begin)
+ pnfs_free_fsdata(fsdata);
+ }
+}
+
static inline int pnfs_return_layout(struct inode *ino)
{
struct nfs_inode *nfsi = NFS_I(ino);
@@ -312,6 +381,19 @@ static inline int pnfs_return_layout(struct inode *ino)
return 0;
}

+static inline struct pnfs_layout_segment *
+nfs4_pull_lseg_from_fsdata(struct file *filp, void *fsdata)
+{
+ if (fsdata) {
+ struct nfs_server *nfss = NFS_SERVER(filp->f_dentry->d_inode);
+
+ if (nfss->pnfs_curr_ld && nfss->pnfs_curr_ld->write_begin)
+ return ((struct pnfs_fsdata *) fsdata)->lseg;
+ return (struct pnfs_layout_segment *)fsdata;
+ }
+ return NULL;
+}
+
#else /* CONFIG_NFS_V4_1 */

static inline void pnfs_destroy_all_layouts(struct nfs_client *clp)
@@ -332,6 +414,12 @@ static inline void put_lseg(struct pnfs_layout_segment *lseg)
{
}

+static inline int pnfs_grow_ok(struct pnfs_layout_segment *lseg,
+ struct pnfs_fsdata *fsdata)
+{
+ return 1;
+}
+
static inline enum pnfs_try_status
pnfs_try_to_read_data(struct nfs_read_data *data,
const struct rpc_call_ops *call_ops)
@@ -351,6 +439,26 @@ static inline int pnfs_return_layout(struct inode *ino)
return 0;
}

+static inline int pnfs_write_begin(struct file *filp, struct page *page,
+ loff_t pos, unsigned len,
+ struct pnfs_layout_segment *lseg,
+ void **fsdata)
+{
+ *fsdata = NULL;
+ return 0;
+}
+
+static inline int pnfs_write_end(struct file *filp, struct page *page,
+ loff_t pos, unsigned len, unsigned copied,
+ struct pnfs_layout_segment *lseg)
+{
+ return 0;
+}
+
+static inline void pnfs_write_end_cleanup(struct file *filp, void *fsdata)
+{
+}
+
static inline bool
pnfs_ld_layoutret_on_setattr(struct inode *inode)
{
@@ -427,6 +535,13 @@ static inline int pnfs_layoutcommit_inode(struct inode *inode, bool sync)
static inline void nfs4_deviceid_purge_client(struct nfs_client *ncl)
{
}
+
+static inline struct pnfs_layout_segment *
+nfs4_pull_lseg_from_fsdata(struct file *filp, void *fsdata)
+{
+ return NULL;
+}
+
#endif /* CONFIG_NFS_V4_1 */

#endif /* FS_NFS_PNFS_H */
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 70f1ef0..91d2acd 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -673,7 +673,9 @@ out:
}

static int nfs_writepage_setup(struct nfs_open_context *ctx, struct page *page,
- unsigned int offset, unsigned int count)
+ unsigned int offset, unsigned int count,
+ struct pnfs_layout_segment *lseg, void *fsdata)
+
{
struct nfs_page *req;

@@ -681,7 +683,8 @@ static int nfs_writepage_setup(struct nfs_open_context *ctx, struct page *page,
if (IS_ERR(req))
return PTR_ERR(req);
/* Update file length */
- nfs_grow_file(page, offset, count);
+ if (pnfs_grow_ok(lseg, fsdata))
+ nfs_grow_file(page, offset, count);
nfs_mark_uptodate(page, req->wb_pgbase, req->wb_bytes);
nfs_mark_request_dirty(req);
nfs_clear_page_tag_locked(req);
@@ -734,7 +737,8 @@ static int nfs_write_pageuptodate(struct page *page, struct inode *inode)
* things with a page scheduled for an RPC call (e.g. invalidate it).
*/
int nfs_updatepage(struct file *file, struct page *page,
- unsigned int offset, unsigned int count)
+ unsigned int offset, unsigned int count,
+ struct pnfs_layout_segment *lseg, void *fsdata)
{
struct nfs_open_context *ctx = nfs_file_open_context(file);
struct inode *inode = page->mapping->host;
@@ -759,7 +763,7 @@ int nfs_updatepage(struct file *file, struct page *page,
offset = 0;
}

- status = nfs_writepage_setup(ctx, page, offset, count);
+ status = nfs_writepage_setup(ctx, page, offset, count, lseg, fsdata);
if (status < 0)
nfs_set_pageerror(page);

diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 1b93b9c..e459379 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -510,7 +510,8 @@ extern int nfs_congestion_kb;
extern int nfs_writepage(struct page *page, struct writeback_control *wbc);
extern int nfs_writepages(struct address_space *, struct writeback_control *);
extern int nfs_flush_incompatible(struct file *file, struct page *page);
-extern int nfs_updatepage(struct file *, struct page *, unsigned int, unsigned int);
+extern int nfs_updatepage(struct file *, struct page *, unsigned int,
+ unsigned int, struct pnfs_layout_segment *, void *);
extern void nfs_writeback_done(struct rpc_task *, struct nfs_write_data *);

/*
--
1.7.4.1


2011-06-14 02:32:09

by Jim Rees

[permalink] [raw]
Subject: [PATCH 02/33] pnfs: add set-clear layoutdriver interface

From: Benny Halevy <[email protected]>

To allow layout driver to issue getdevicelist at mount time, and clean up
at umount time.

[fixup non NFS_V4_1 set_pnfs_layoutdriver definition]
[pnfs: pass mntfh down the init_pnfs path]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/client.c | 8 +++++---
fs/nfs/pnfs.c | 16 ++++++++++++++--
fs/nfs/pnfs.h | 8 ++++++--
3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index b3dc2b8..d630bb7 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -906,7 +906,9 @@ error:
/*
* Load up the server record from information gained in an fsinfo record
*/
-static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *fsinfo)
+static void nfs_server_set_fsinfo(struct nfs_server *server,
+ struct nfs_fh *mntfh,
+ struct nfs_fsinfo *fsinfo)
{
unsigned long max_rpc_payload;

@@ -936,7 +938,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *
if (server->wsize > NFS_MAX_FILE_IO_SIZE)
server->wsize = NFS_MAX_FILE_IO_SIZE;
server->wpages = (server->wsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- set_pnfs_layoutdriver(server, fsinfo->layouttype);
+ set_pnfs_layoutdriver(server, mntfh, fsinfo->layouttype);

server->wtmult = nfs_block_bits(fsinfo->wtmult, NULL);

@@ -982,7 +984,7 @@ static int nfs_probe_fsinfo(struct nfs_server *server, struct nfs_fh *mntfh, str
if (error < 0)
goto out_error;

- nfs_server_set_fsinfo(server, &fsinfo);
+ nfs_server_set_fsinfo(server, mntfh, &fsinfo);

/* Get some general file system info */
if (server->namelen == 0) {
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index b848a7e..593a9aa 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -75,8 +75,11 @@ find_pnfs_driver(u32 id)
void
unset_pnfs_layoutdriver(struct nfs_server *nfss)
{
- if (nfss->pnfs_curr_ld)
+ if (nfss->pnfs_curr_ld) {
+ if (nfss->pnfs_curr_ld->clear_layoutdriver)
+ nfss->pnfs_curr_ld->clear_layoutdriver(nfss);
module_put(nfss->pnfs_curr_ld->owner);
+ }
nfss->pnfs_curr_ld = NULL;
}

@@ -87,7 +90,8 @@ unset_pnfs_layoutdriver(struct nfs_server *nfss)
* @id layout type. Zero (illegal layout type) indicates pNFS not in use.
*/
void
-set_pnfs_layoutdriver(struct nfs_server *server, u32 id)
+set_pnfs_layoutdriver(struct nfs_server *server, const struct nfs_fh *mntfh,
+ u32 id)
{
struct pnfs_layoutdriver_type *ld_type = NULL;

@@ -114,6 +118,14 @@ set_pnfs_layoutdriver(struct nfs_server *server, u32 id)
goto out_no_driver;
}
server->pnfs_curr_ld = ld_type;
+ if (ld_type->set_layoutdriver
+ && ld_type->set_layoutdriver(server, mntfh)) {
+ printk(KERN_ERR
+ "%s: Error initializing mount point for layout driver %u.\n",
+ __func__, id);
+ module_put(ld_type->owner);
+ goto out_no_driver;
+ }

dprintk("%s: pNFS module for %u set\n", __func__, id);
return;
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 9dc950c..f984598 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -80,6 +80,9 @@ struct pnfs_layoutdriver_type {
struct module *owner;
unsigned flags;

+ int (*set_layoutdriver) (struct nfs_server *, const struct nfs_fh *);
+ int (*clear_layoutdriver) (struct nfs_server *);
+
struct pnfs_layout_hdr * (*alloc_layout_hdr) (struct inode *inode, gfp_t gfp_flags);
void (*free_layout_hdr) (struct pnfs_layout_hdr *);

@@ -165,7 +168,7 @@ void put_lseg(struct pnfs_layout_segment *lseg);
bool pnfs_pageio_init_read(struct nfs_pageio_descriptor *, struct inode *);
bool pnfs_pageio_init_write(struct nfs_pageio_descriptor *, struct inode *, int);

-void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
+void set_pnfs_layoutdriver(struct nfs_server *, const struct nfs_fh *, u32);
void unset_pnfs_layoutdriver(struct nfs_server *);
enum pnfs_try_status pnfs_try_to_write_data(struct nfs_write_data *,
const struct rpc_call_ops *, int);
@@ -375,7 +378,8 @@ pnfs_roc_drain(struct inode *ino, u32 *barrier)
return false;
}

-static inline void set_pnfs_layoutdriver(struct nfs_server *s, u32 id)
+static inline void set_pnfs_layoutdriver(struct nfs_server *s,
+ const struct nfs_fh *mntfh, u32 id);
{
}

--
1.7.4.1


2011-06-14 02:32:41

by Jim Rees

[permalink] [raw]
Subject: [PATCH 15/33] pnfsblock: lseg alloc and free

From: Fred Isaman <[email protected]>

Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: fix bug getting pnfs_layout_type in translate_devid().]
Signed-off-by: Tao Guo <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
Signed-off-by: Zhang Jingwang <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 39 +++++++++++++++++++++++++++++-----
fs/nfs/blocklayout/blocklayout.h | 6 +++++
fs/nfs/blocklayout/blocklayoutdev.c | 8 +++++++
3 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 4ca0838..992fd31 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -110,16 +110,43 @@ static struct pnfs_layout_hdr *bl_alloc_layout_hdr(struct inode *inode,
return &bl->bl_layout;
}

-static void
-bl_free_lseg(struct pnfs_layout_segment *lseg)
+static void bl_free_lseg(struct pnfs_layout_segment *lseg)
{
+ dprintk("%s enter\n", __func__);
+ kfree(lseg);
}

-static struct pnfs_layout_segment *
-bl_alloc_lseg(struct pnfs_layout_hdr *lo,
- struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
+/* Because the generic infrastructure does not correctly merge layouts,
+ * we pretty much ignore lseg, and store all data layout wide, so we
+ * can correctly merge. Eventually we should push some correct merge
+ * behavior up to the generic code, as the current behavior tends to
+ * cause lots of unnecessary overlapping LAYOUTGET requests.
+ */
+static struct pnfs_layout_segment *bl_alloc_lseg(struct pnfs_layout_hdr *lo,
+ struct nfs4_layoutget_res *lgr,
+ gfp_t gfp_flags)
{
- return NULL;
+ struct pnfs_layout_segment *lseg;
+ int status;
+
+ dprintk("%s enter\n", __func__);
+ lseg = kzalloc(sizeof(*lseg) + 0, gfp_flags);
+ if (!lseg)
+ return NULL;
+ status = nfs4_blk_process_layoutget(lo, lgr, gfp_flags);
+ if (status) {
+ /* We don't want to call the full-blown bl_free_lseg,
+ * since on error extents were not touched.
+ */
+ /* STUB - we really want to distinguish between 2 error
+ * conditions here. This lseg failed, but lo data structures
+ * are OK, or we hosed the lo data structures. The calling
+ * code probably needs to distinguish this too.
+ */
+ kfree(lseg);
+ return ERR_PTR(status);
+ }
+ return lseg;
}

static void
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 839b81d..5c1ccc1 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -93,6 +93,12 @@ static inline struct pnfs_block_layout *BLK_LO2EXT(struct pnfs_layout_hdr *lo)
return container_of(lo, struct pnfs_block_layout, bl_layout);
}

+static inline struct pnfs_block_layout *
+BLK_LSEG2EXT(struct pnfs_layout_segment *lseg)
+{
+ return BLK_LO2EXT(lseg->pls_layout);
+}
+
/* blocklayoutdev.c */
struct block_device *nfs4_blkdev_get(dev_t dev);
int nfs4_blkdev_put(struct block_device *bdev);
diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 9a65a66..0fedf50 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -149,3 +149,11 @@ out_err:
kfree(msg);
return NULL;
}
+
+int
+nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
+ struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
+{
+ /* STUB */
+ return -EIO;
+}
--
1.7.4.1


2011-06-14 02:33:04

by Jim Rees

[permalink] [raw]
Subject: [PATCH 24/33] pnfsblock: cleanup_layoutcommit

From: Fred Isaman <[email protected]>

In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
extents list, for reads.

In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:

1. On setup_layoutcommit, the range is adjusted as before
and a structure is allocated for communication with
bl_encode_layoutcommit && bl_cleanup_layoutcommit
(Generic layer provides a void-star to hang it on)

2. bl_encode_layoutcommit is called to do the actual
encoding directly into xdr. The commit-extent-list is not
freed and is stored on above structure.
FIXME: The code is not yet converted to the new XDR cleanup

3. On cleanup the commit-extent-list is put back by a call
to set_to_rw() as before, but with no need for XDR decoding
of the list as before. And the commit-extent-list is freed.
Finally allocated structure is freed.

[SQUASHME: pnfs: blocklayout: port block layout code]
Signed-off-by: Peng Tao <[email protected]>
[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <[email protected]>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <[email protected]>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <[email protected]>
[pnfsblock: cleanup_layoutcommit wants a status parameter]
Signed-off-by: Boaz Harrosh <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 2 +
fs/nfs/blocklayout/blocklayout.h | 3 +
fs/nfs/blocklayout/extents.c | 209 ++++++++++++++++++++++++++++++++++++++
3 files changed, 214 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 6d0844c..f3189d6 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -161,6 +161,8 @@ static void
bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
struct nfs4_layoutcommit_data *lcdata)
{
+ dprintk("%s enter\n", __func__);
+ clean_pnfs_block_layoutupdate(BLK_LO2EXT(lo), &lcdata->args, lcdata->res.status);
}

static void free_blk_mountid(struct block_mount_id *mid)
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index c3b41f4..5a7d0be 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -262,6 +262,9 @@ int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
int encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
struct xdr_stream *xdr,
const struct nfs4_layoutcommit_args *arg);
+void clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+ const struct nfs4_layoutcommit_args *arg,
+ int status);
int add_and_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index e754d32..1447bfc 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -327,6 +327,73 @@ void print_clist(struct list_head *list, unsigned int count)
dprintk("****************\n");
}

+/* Note: In theory, we should do more checking that devid's match between
+ * old and new, but if they don't, the lists are too corrupt to salvage anyway.
+ */
+/* Note this is very similar to add_and_merge_extent */
+static void add_to_commitlist(struct pnfs_block_layout *bl,
+ struct pnfs_block_short_extent *new)
+{
+ struct list_head *clist = &bl->bl_commit;
+ struct pnfs_block_short_extent *old, *save;
+ sector_t end = new->bse_f_offset + new->bse_length;
+
+ dprintk("%s enter\n", __func__);
+ print_short_extent(new);
+ print_clist(clist, bl->bl_count);
+ bl->bl_count++;
+ /* Scan for proper place to insert, extending new to the left
+ * as much as possible.
+ */
+ list_for_each_entry_safe(old, save, clist, bse_node) {
+ if (new->bse_f_offset < old->bse_f_offset)
+ break;
+ if (end <= old->bse_f_offset + old->bse_length) {
+ /* Range is already in list */
+ bl->bl_count--;
+ kfree(new);
+ return;
+ } else if (new->bse_f_offset <=
+ old->bse_f_offset + old->bse_length) {
+ /* new overlaps or abuts existing be */
+ if (new->bse_mdev == old->bse_mdev) {
+ /* extend new to fully replace old */
+ new->bse_length += new->bse_f_offset -
+ old->bse_f_offset;
+ new->bse_f_offset = old->bse_f_offset;
+ list_del(&old->bse_node);
+ bl->bl_count--;
+ kfree(old);
+ }
+ }
+ }
+ /* Note that if we never hit the above break, old will not point to a
+ * valid extent. However, in that case &old->bse_node==list.
+ */
+ list_add_tail(&new->bse_node, &old->bse_node);
+ /* Scan forward for overlaps. If we find any, extend new and
+ * remove the overlapped extent.
+ */
+ old = list_prepare_entry(new, clist, bse_node);
+ list_for_each_entry_safe_continue(old, save, clist, bse_node) {
+ if (end < old->bse_f_offset)
+ break;
+ /* new overlaps or abuts old */
+ if (new->bse_mdev == old->bse_mdev) {
+ if (end < old->bse_f_offset + old->bse_length) {
+ /* extend new to fully cover old */
+ end = old->bse_f_offset + old->bse_length;
+ new->bse_length = end - new->bse_f_offset;
+ }
+ list_del(&old->bse_node);
+ bl->bl_count--;
+ kfree(old);
+ }
+ }
+ dprintk("%s: after merging\n", __func__);
+ print_clist(clist, bl->bl_count);
+}
+
static void print_bl_extent(struct pnfs_block_extent *be)
{
dprintk("PRINT EXTENT extent %p\n", be);
@@ -545,6 +612,34 @@ find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
return ret;
}

+/* Similar to find_get_extent, but called with lock held, and ignores cow */
+static struct pnfs_block_extent *
+find_get_extent_locked(struct pnfs_block_layout *bl, sector_t isect)
+{
+ struct pnfs_block_extent *be, *ret = NULL;
+ int i;
+
+ dprintk("%s enter with isect %llu\n", __func__, (u64)isect);
+ for (i = 0; i < EXTENT_LISTS; i++) {
+ if (ret)
+ break;
+ list_for_each_entry_reverse(be, &bl->bl_extents[i], be_node) {
+ if (isect >= be->be_f_offset + be->be_length)
+ break;
+ if (isect >= be->be_f_offset) {
+ /* We have found an extent */
+ dprintk("%s Get %p (%i)\n", __func__, be,
+ atomic_read(&be->be_refcnt.refcount));
+ kref_get(&be->be_refcnt);
+ ret = be;
+ break;
+ }
+ }
+ }
+ print_bl_extent(ret);
+ return ret;
+}
+
int
encode_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
struct xdr_stream *xdr,
@@ -635,3 +730,117 @@ _front_merge(struct pnfs_block_extent *be, struct list_head *head,
kfree(storage);
return be;
}
+
+static u64
+set_to_rw(struct pnfs_block_layout *bl, u64 offset, u64 length)
+{
+ u64 rv = offset + length;
+ struct pnfs_block_extent *be, *e1, *e2, *e3, *new, *old;
+ struct pnfs_block_extent *children[3];
+ struct pnfs_block_extent *merge1 = NULL, *merge2 = NULL;
+ int i = 0, j;
+
+ dprintk("%s(%llu, %llu)\n", __func__, offset, length);
+ /* Create storage for up to three new extents e1, e2, e3 */
+ e1 = kmalloc(sizeof(*e1), GFP_KERNEL);
+ e2 = kmalloc(sizeof(*e2), GFP_KERNEL);
+ e3 = kmalloc(sizeof(*e3), GFP_KERNEL);
+ /* BUG - we are ignoring any failure */
+ if (!e1 || !e2 || !e3)
+ goto out_nosplit;
+
+ spin_lock(&bl->bl_ext_lock);
+ be = find_get_extent_locked(bl, offset);
+ rv = be->be_f_offset + be->be_length;
+ if (be->be_state != PNFS_BLOCK_INVALID_DATA) {
+ spin_unlock(&bl->bl_ext_lock);
+ goto out_nosplit;
+ }
+ /* Add e* to children, bumping e*'s krefs */
+ if (be->be_f_offset != offset) {
+ _prep_new_extent(e1, be, be->be_f_offset,
+ offset - be->be_f_offset,
+ PNFS_BLOCK_INVALID_DATA);
+ children[i++] = e1;
+ print_bl_extent(e1);
+ } else
+ merge1 = e1;
+ _prep_new_extent(e2, be, offset,
+ min(length, be->be_f_offset + be->be_length - offset),
+ PNFS_BLOCK_READWRITE_DATA);
+ children[i++] = e2;
+ print_bl_extent(e2);
+ if (offset + length < be->be_f_offset + be->be_length) {
+ _prep_new_extent(e3, be, e2->be_f_offset + e2->be_length,
+ be->be_f_offset + be->be_length -
+ offset - length,
+ PNFS_BLOCK_INVALID_DATA);
+ children[i++] = e3;
+ print_bl_extent(e3);
+ } else
+ merge2 = e3;
+
+ /* Remove be from list, and insert the e* */
+ /* We don't get refs on e*, since this list is the base reference
+ * set when init'ed.
+ */
+ if (i < 3)
+ children[i] = NULL;
+ new = children[0];
+ list_replace(&be->be_node, &new->be_node);
+ put_extent(be);
+ new = _front_merge(new, &bl->bl_extents[RW_EXTENT], merge1);
+ for (j = 1; j < i; j++) {
+ old = new;
+ new = children[j];
+ list_add(&new->be_node, &old->be_node);
+ }
+ if (merge2) {
+ /* This is a HACK, should just create a _back_merge function */
+ new = list_entry(new->be_node.next,
+ struct pnfs_block_extent, be_node);
+ new = _front_merge(new, &bl->bl_extents[RW_EXTENT], merge2);
+ }
+ spin_unlock(&bl->bl_ext_lock);
+
+ /* Since we removed the base reference above, be is now scheduled for
+ * destruction.
+ */
+ put_extent(be);
+ dprintk("%s returns %llu after split\n", __func__, rv);
+ return rv;
+
+ out_nosplit:
+ kfree(e1);
+ kfree(e2);
+ kfree(e3);
+ dprintk("%s returns %llu without splitting\n", __func__, rv);
+ return rv;
+}
+
+void
+clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
+ const struct nfs4_layoutcommit_args *arg,
+ int status)
+{
+ struct pnfs_block_short_extent *lce, *save;
+
+ dprintk("%s status %d\n", __func__, status);
+ list_for_each_entry_safe_reverse(lce, save, &bl->bl_committing, bse_node) {
+ if (likely(!status)) {
+ u64 offset = lce->bse_f_offset;
+ u64 end = offset + lce->bse_length;
+
+ do {
+ offset = set_to_rw(bl, offset, end - offset);
+ } while (offset < end);
+ list_del(&lce->bse_node);
+
+ kfree(lce);
+ } else {
+ spin_lock(&bl->bl_ext_lock);
+ add_to_commitlist(bl, lce);
+ spin_unlock(&bl->bl_ext_lock);
+ }
+ }
+}
--
1.7.4.1


2011-06-14 02:32:43

by Jim Rees

[permalink] [raw]
Subject: [PATCH 16/33] pnfsblock: merge extents

From: Fred Isaman <[email protected]>

Replace a stub, so that extents underlying the layouts are properly
added, merged, or ignored as necessary.

Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: delete the new node before put it]
Signed-off-by: Mingyang Guo <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 14 +++++-
fs/nfs/blocklayout/extents.c | 106 ++++++++++++++++++++++++++++++++++++++
2 files changed, 119 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 5c1ccc1..6705b10 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -77,6 +77,14 @@ enum extentclass4 {
EXTENT_LISTS = 2,
};

+static inline int choose_list(enum exstate4 state)
+{
+ if (state == PNFS_BLOCK_READ_DATA || state == PNFS_BLOCK_NONE_DATA)
+ return RO_EXTENT;
+ else
+ return RW_EXTENT;
+}
+
struct pnfs_block_layout {
struct pnfs_layout_hdr bl_layout;
struct pnfs_inval_markings bl_inval; /* tracks INVAL->RW transition */
@@ -109,6 +117,11 @@ int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
/* blocklayoutdm.c */
void free_block_dev(struct pnfs_block_dev *bdev);
+/* extents.c */
+void put_extent(struct pnfs_block_extent *be);
+struct pnfs_block_extent *alloc_extent(void);
+int add_and_merge_extent(struct pnfs_block_layout *bl,
+ struct pnfs_block_extent *new);

#include <linux/sunrpc/simple_rpc_pipefs.h>

@@ -124,5 +137,4 @@ void bl_pipe_exit(void);
#define BL_DEVICE_REQUEST_PROC 0x1 /* User level process succeeds */
#define BL_DEVICE_REQUEST_ERR 0x2 /* User level process fails */

-void put_extent(struct pnfs_block_extent *be);
#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 1283fa9..26c263f 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -95,3 +95,109 @@ void print_elist(struct list_head *list)
}
dprintk("****************\n");
}
+
+static inline int
+extents_consistent(struct pnfs_block_extent *old, struct pnfs_block_extent *new)
+{
+ /* Note this assumes new->be_f_offset >= old->be_f_offset */
+ return (new->be_state == old->be_state) &&
+ ((new->be_state == PNFS_BLOCK_NONE_DATA) ||
+ ((new->be_v_offset - old->be_v_offset ==
+ new->be_f_offset - old->be_f_offset) &&
+ new->be_mdev == old->be_mdev));
+}
+
+/* Adds new to appropriate list in bl, modifying new and removing existing
+ * extents as appropriate to deal with overlaps.
+ *
+ * See find_get_extent for list constraints.
+ *
+ * Refcount on new is already set. If end up not using it, or error out,
+ * need to put the reference.
+ *
+ * Lock is held by caller.
+ */
+int
+add_and_merge_extent(struct pnfs_block_layout *bl,
+ struct pnfs_block_extent *new)
+{
+ struct pnfs_block_extent *be, *tmp;
+ sector_t end = new->be_f_offset + new->be_length;
+ struct list_head *list;
+
+ dprintk("%s enter with be=%p\n", __func__, new);
+ print_bl_extent(new);
+ list = &bl->bl_extents[choose_list(new->be_state)];
+ print_elist(list);
+
+ /* Scan for proper place to insert, extending new to the left
+ * as much as possible.
+ */
+ list_for_each_entry_safe(be, tmp, list, be_node) {
+ if (new->be_f_offset < be->be_f_offset)
+ break;
+ if (end <= be->be_f_offset + be->be_length) {
+ /* new is a subset of existing be*/
+ if (extents_consistent(be, new)) {
+ dprintk("%s: new is subset, ignoring\n",
+ __func__);
+ put_extent(new);
+ return 0;
+ } else
+ goto out_err;
+ } else if (new->be_f_offset <=
+ be->be_f_offset + be->be_length) {
+ /* new overlaps or abuts existing be */
+ if (extents_consistent(be, new)) {
+ /* extend new to fully replace be */
+ new->be_length += new->be_f_offset -
+ be->be_f_offset;
+ new->be_f_offset = be->be_f_offset;
+ new->be_v_offset = be->be_v_offset;
+ dprintk("%s: removing %p\n", __func__, be);
+ list_del(&be->be_node);
+ put_extent(be);
+ } else if (new->be_f_offset !=
+ be->be_f_offset + be->be_length)
+ goto out_err;
+ }
+ }
+ /* Note that if we never hit the above break, be will not point to a
+ * valid extent. However, in that case &be->be_node==list.
+ */
+ list_add_tail(&new->be_node, &be->be_node);
+ dprintk("%s: inserting new\n", __func__);
+ print_elist(list);
+ /* Scan forward for overlaps. If we find any, extend new and
+ * remove the overlapped extent.
+ */
+ be = list_prepare_entry(new, list, be_node);
+ list_for_each_entry_safe_continue(be, tmp, list, be_node) {
+ if (end < be->be_f_offset)
+ break;
+ /* new overlaps or abuts existing be */
+ if (extents_consistent(be, new)) {
+ if (end < be->be_f_offset + be->be_length) {
+ /* extend new to fully cover be */
+ end = be->be_f_offset + be->be_length;
+ new->be_length = end - new->be_f_offset;
+ }
+ dprintk("%s: removing %p\n", __func__, be);
+ list_del(&be->be_node);
+ put_extent(be);
+ } else if (end != be->be_f_offset) {
+ list_del(&new->be_node);
+ goto out_err;
+ }
+ }
+ dprintk("%s: after merging\n", __func__);
+ print_elist(list);
+ /* STUB - The per-list consistency checks have all been done,
+ * should now check cross-list consistency.
+ */
+ return 0;
+
+ out_err:
+ put_extent(new);
+ return -EIO;
+}
--
1.7.4.1


2011-06-14 15:52:12

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 10/33] pnfsblock: add support for simple rpc pipefs

On 2011-06-13 22:32, Jim Rees wrote:
> From: Fred Isaman <[email protected]>
>
> Signed-off-by: Eric Anderle <[email protected]>
> Signed-off-by: Jim Rees <[email protected]>
> Signed-off-by: Benny Halevy <[email protected]>
> move include lines out of include file
> Signed-off-by: Jim Rees <[email protected]>
> [This patch does *not* break the header's independence]
> Signed-off-by: Boaz Harrosh <[email protected]>
> Signed-off-by: Benny Halevy <[email protected]>
> ---
> include/linux/sunrpc/simple_rpc_pipefs.h | 105 ++++++++
> net/sunrpc/simple_rpc_pipefs.c | 423 ++++++++++++++++++++++++++++++
> 2 files changed, 528 insertions(+), 0 deletions(-)
> create mode 100644 include/linux/sunrpc/simple_rpc_pipefs.h
> create mode 100644 net/sunrpc/simple_rpc_pipefs.c

what happnened to this hunk?

diff --git b/net/sunrpc/Makefile a/net/sunrpc/Makefile
index 9d2fca5..e102040 100644
--- b/net/sunrpc/Makefile
+++ a/net/sunrpc/Makefile
@@ -12,7 +12,7 @@ sunrpc-y := clnt.o xprt.o socklib.o xprtsock.o sched.o \
svc.o svcsock.o svcauth.o svcauth_unix.o \
addr.o rpcb_clnt.o timer.o xdr.o \
sunrpc_syms.o cache.o rpc_pipe.o \
- svc_xprt.o
+ svc_xprt.o simple_rpc_pipefs.o
sunrpc-$(CONFIG_NFS_V4_1) += backchannel_rqst.o bc_svc.o
sunrpc-$(CONFIG_PROC_FS) += stats.o
sunrpc-$(CONFIG_SYSCTL) += sysctl.o

Benny

>
> diff --git a/include/linux/sunrpc/simple_rpc_pipefs.h b/include/linux/sunrpc/simple_rpc_pipefs.h
> new file mode 100644
> index 0000000..f6a1227
> --- /dev/null
> +++ b/include/linux/sunrpc/simple_rpc_pipefs.h
> @@ -0,0 +1,105 @@
> +/*
> + * Copyright (c) 2008 The Regents of the University of Michigan.
> + * All rights reserved.
> + *
> + * David M. Richter <[email protected]>
> + *
> + * Drawing on work done by Andy Adamson <[email protected]> and
> + * Marius Eriksen <[email protected]>. Thanks for the help over the
> + * years, guys.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + *
> + * 1. Redistributions of source code must retain the above copyright
> + * notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + * notice, this list of conditions and the following disclaimer in the
> + * documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of the University nor the names of its
> + * contributors may be used to endorse or promote products derived
> + * from this software without specific prior written permission.
> + *
> + * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
> + * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
> + * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
> + * DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
> + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
> + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
> + * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
> + * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
> + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + *
> + * With thanks to CITI's project sponsor and partner, IBM.
> + */
> +
> +#ifndef _SIMPLE_RPC_PIPEFS_H_
> +#define _SIMPLE_RPC_PIPEFS_H_
> +
> +#include <linux/sunrpc/rpc_pipe_fs.h>
> +
> +#define payload_of(headerp) ((void *)(headerp + 1))
> +
> +/*
> + * struct pipefs_hdr -- the generic message format for simple_rpc_pipefs.
> + * Messages may simply be the header itself, although having an optional
> + * data payload follow the header allows much more flexibility.
> + *
> + * Messages are created using pipefs_alloc_init_msg() and
> + * pipefs_alloc_init_msg_padded(), both of which accept a pointer to an
> + * (optional) data payload.
> + *
> + * Given a struct pipefs_hdr *msg that has a struct foo payload, the data
> + * can be accessed using: struct foo *foop = payload_of(msg)
> + */
> +struct pipefs_hdr {
> + u32 msgid;
> + u8 type;
> + u8 flags;
> + u16 totallen; /* length of entire message, including hdr itself */
> + u32 status;
> +};
> +
> +/*
> + * struct pipefs_list -- a type of list used for tracking callers who've made an
> + * upcall and are blocked waiting for a reply.
> + *
> + * See pipefs_queue_upcall_waitreply() and pipefs_assign_upcall_reply().
> + */
> +struct pipefs_list {
> + struct list_head list;
> + spinlock_t list_lock;
> +};
> +
> +
> +/* See net/sunrpc/simple_rpc_pipefs.c for more info on using these functions. */
> +extern struct dentry *pipefs_mkpipe(const char *name,
> + const struct rpc_pipe_ops *ops,
> + int wait_for_open);
> +extern void pipefs_closepipe(struct dentry *pipe);
> +extern void pipefs_init_list(struct pipefs_list *list);
> +extern struct pipefs_hdr *pipefs_alloc_init_msg(u32 msgid, u8 type, u8 flags,
> + void *data, u16 datalen);
> +extern struct pipefs_hdr *pipefs_alloc_init_msg_padded(u32 msgid, u8 type,
> + u8 flags, void *data,
> + u16 datalen, u16 padlen);
> +extern struct pipefs_hdr *pipefs_queue_upcall_waitreply(struct dentry *pipe,
> + struct pipefs_hdr *msg,
> + struct pipefs_list
> + *uplist, u8 upflags,
> + u32 timeout);
> +extern int pipefs_queue_upcall_noreply(struct dentry *pipe,
> + struct pipefs_hdr *msg, u8 upflags);
> +extern int pipefs_assign_upcall_reply(struct pipefs_hdr *reply,
> + struct pipefs_list *uplist);
> +extern struct pipefs_hdr *pipefs_readmsg(struct file *filp,
> + const char __user *src, size_t len);
> +extern ssize_t pipefs_generic_upcall(struct file *filp,
> + struct rpc_pipe_msg *rpcmsg,
> + char __user *dst, size_t buflen);
> +extern void pipefs_generic_destroy_msg(struct rpc_pipe_msg *rpcmsg);
> +
> +#endif /* _SIMPLE_RPC_PIPEFS_H_ */
> diff --git a/net/sunrpc/simple_rpc_pipefs.c b/net/sunrpc/simple_rpc_pipefs.c
> new file mode 100644
> index 0000000..24af0a1
> --- /dev/null
> +++ b/net/sunrpc/simple_rpc_pipefs.c
> @@ -0,0 +1,423 @@
> +/*
> + * net/sunrpc/simple_rpc_pipefs.c
> + *
> + * Copyright (c) 2008 The Regents of the University of Michigan.
> + * All rights reserved.
> + *
> + * David M. Richter <[email protected]>
> + *
> + * Drawing on work done by Andy Adamson <[email protected]> and
> + * Marius Eriksen <[email protected]>. Thanks for the help over the
> + * years, guys.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + *
> + * 1. Redistributions of source code must retain the above copyright
> + * notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + * notice, this list of conditions and the following disclaimer in the
> + * documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of the University nor the names of its
> + * contributors may be used to endorse or promote products derived
> + * from this software without specific prior written permission.
> + *
> + * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
> + * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
> + * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
> + * DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
> + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
> + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
> + * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
> + * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
> + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + *
> + * With thanks to CITI's project sponsor and partner, IBM.
> + */
> +
> +#include <linux/mount.h>
> +#include <linux/sunrpc/clnt.h>
> +#include <linux/sunrpc/simple_rpc_pipefs.h>
> +
> +
> +/*
> + * Make an rpc_pipefs pipe named @name at the root of the mounted rpc_pipefs
> + * filesystem.
> + *
> + * If @wait_for_open is non-zero and an upcall is later queued but the userland
> + * end of the pipe has not yet been opened, the upcall will remain queued until
> + * the pipe is opened; otherwise, the upcall queueing will return with -EPIPE.
> + */
> +struct dentry *pipefs_mkpipe(const char *name, const struct rpc_pipe_ops *ops,
> + int wait_for_open)
> +{
> + struct dentry *dir, *pipe;
> + struct vfsmount *mnt;
> +
> + mnt = rpc_get_mount();
> + if (IS_ERR(mnt)) {
> + pipe = ERR_CAST(mnt);
> + goto out;
> + }
> + dir = mnt->mnt_root;
> + if (!dir) {
> + pipe = ERR_PTR(-ENOENT);
> + goto out;
> + }
> + pipe = rpc_mkpipe(dir, name, NULL, ops,
> + wait_for_open ? RPC_PIPE_WAIT_FOR_OPEN : 0);
> +out:
> + return pipe;
> +}
> +EXPORT_SYMBOL(pipefs_mkpipe);
> +
> +/*
> + * Shutdown a pipe made by pipefs_mkpipe().
> + * XXX: do we need to retain an extra reference on the mount?
> + */
> +void pipefs_closepipe(struct dentry *pipe)
> +{
> + rpc_unlink(pipe);
> + rpc_put_mount();
> +}
> +EXPORT_SYMBOL(pipefs_closepipe);
> +
> +/*
> + * Initialize a struct pipefs_list -- which are a way to keep track of callers
> + * who're blocked having made an upcall and are awaiting a reply.
> + *
> + * See pipefs_queue_upcall_waitreply() and pipefs_find_upcall_msgid() for how
> + * to use them.
> + */
> +inline void pipefs_init_list(struct pipefs_list *list)
> +{
> + INIT_LIST_HEAD(&list->list);
> + spin_lock_init(&list->list_lock);
> +}
> +EXPORT_SYMBOL(pipefs_init_list);
> +
> +/*
> + * Alloc/init a generic pipefs message header and copy into its message body
> + * an arbitrary data payload.
> + *
> + * struct pipefs_hdr's are meant to serve as generic, general-purpose message
> + * headers for easy rpc_pipefs I/O. When an upcall is made, the
> + * struct pipefs_hdr is assigned to a struct rpc_pipe_msg and delivered
> + * therein. --And yes, the naming can seem a little confusing at first:
> + *
> + * When one thinks of an upcall "message", in simple_rpc_pipefs that's a
> + * struct pipefs_hdr (possibly with an attached message body). A
> + * struct rpc_pipe_msg is actually only the -vehicle- by which the "real"
> + * message is delivered and processed.
> + */
> +struct pipefs_hdr *pipefs_alloc_init_msg_padded(u32 msgid, u8 type, u8 flags,
> + void *data, u16 datalen, u16 padlen)
> +{
> + u16 totallen;
> + struct pipefs_hdr *msg = NULL;
> +
> + totallen = sizeof(*msg) + datalen + padlen;
> + if (totallen > PAGE_SIZE) {
> + msg = ERR_PTR(-E2BIG);
> + goto out;
> + }
> +
> + msg = kzalloc(totallen, GFP_KERNEL);
> + if (!msg) {
> + msg = ERR_PTR(-ENOMEM);
> + goto out;
> + }
> +
> + msg->msgid = msgid;
> + msg->type = type;
> + msg->flags = flags;
> + msg->totallen = totallen;
> + memcpy(payload_of(msg), data, datalen);
> +out:
> + return msg;
> +}
> +EXPORT_SYMBOL(pipefs_alloc_init_msg_padded);
> +
> +/*
> + * See the description of pipefs_alloc_init_msg_padded().
> + */
> +struct pipefs_hdr *pipefs_alloc_init_msg(u32 msgid, u8 type, u8 flags,
> + void *data, u16 datalen)
> +{
> + return pipefs_alloc_init_msg_padded(msgid, type, flags, data,
> + datalen, 0);
> +}
> +EXPORT_SYMBOL(pipefs_alloc_init_msg);
> +
> +
> +static void pipefs_init_rpcmsg(struct rpc_pipe_msg *rpcmsg,
> + struct pipefs_hdr *msg, u8 upflags)
> +{
> + memset(rpcmsg, 0, sizeof(*rpcmsg));
> + rpcmsg->data = msg;
> + rpcmsg->len = msg->totallen;
> + rpcmsg->flags = upflags;
> +}
> +
> +static struct rpc_pipe_msg *pipefs_alloc_init_rpcmsg(struct pipefs_hdr *msg,
> + u8 upflags)
> +{
> + struct rpc_pipe_msg *rpcmsg;
> +
> + rpcmsg = kmalloc(sizeof(*rpcmsg), GFP_KERNEL);
> + if (!rpcmsg)
> + return ERR_PTR(-ENOMEM);
> +
> + pipefs_init_rpcmsg(rpcmsg, msg, upflags);
> + return rpcmsg;
> +}
> +
> +
> +/* represents an upcall that'll block and wait for a reply */
> +struct pipefs_upcall {
> + u32 msgid;
> + struct rpc_pipe_msg rpcmsg;
> + struct list_head list;
> + wait_queue_head_t waitq;
> + struct pipefs_hdr *reply;
> +};
> +
> +
> +static void pipefs_init_upcall_waitreply(struct pipefs_upcall *upcall,
> + struct pipefs_hdr *msg, u8 upflags)
> +{
> + upcall->reply = NULL;
> + upcall->msgid = msg->msgid;
> + INIT_LIST_HEAD(&upcall->list);
> + init_waitqueue_head(&upcall->waitq);
> + pipefs_init_rpcmsg(&upcall->rpcmsg, msg, upflags);
> +}
> +
> +static int __pipefs_queue_upcall_waitreply(struct dentry *pipe,
> + struct pipefs_upcall *upcall,
> + struct pipefs_list *uplist,
> + u32 timeout)
> +{
> + int err = 0;
> + DECLARE_WAITQUEUE(wq, current);
> +
> + add_wait_queue(&upcall->waitq, &wq);
> + spin_lock(&uplist->list_lock);
> + list_add(&upcall->list, &uplist->list);
> + spin_unlock(&uplist->list_lock);
> +
> + err = rpc_queue_upcall(pipe->d_inode, &upcall->rpcmsg);
> + if (err < 0)
> + goto out;
> +
> + if (timeout) {
> + /* retval of 0 means timer expired */
> + err = schedule_timeout_uninterruptible(timeout);
> + if (err == 0 && upcall->reply == NULL)
> + err = -ETIMEDOUT;
> + } else {
> + set_current_state(TASK_UNINTERRUPTIBLE);
> + schedule();
> + __set_current_state(TASK_RUNNING);
> + }
> +
> +out:
> + spin_lock(&uplist->list_lock);
> + list_del_init(&upcall->list);
> + spin_unlock(&uplist->list_lock);
> + remove_wait_queue(&upcall->waitq, &wq);
> + return err;
> +}
> +
> +/*
> + * Queue a pipefs msg for an upcall to userspace, place the calling thread
> + * on @uplist, and block the thread to wait for a reply. If @timeout is
> + * nonzero, the thread will be blocked for at most @timeout jiffies.
> + *
> + * (To convert time units into jiffies, consider the functions
> + * msecs_to_jiffies(), usecs_to_jiffies(), timeval_to_jiffies(), and
> + * timespec_to_jiffies().)
> + *
> + * Once a reply is received by your downcall handler, call
> + * pipefs_assign_upcall_reply() with @uplist to find the corresponding upcall,
> + * assign the reply, and wake the waiting thread.
> + *
> + * This function's return value pointer may be an error and should be checked
> + * with IS_ERR() before attempting to access the reply message.
> + *
> + * Callers are responsible for freeing @msg, unless pipefs_generic_destroy_msg()
> + * is used as the ->destroy_msg() callback and the PIPEFS_AUTOFREE_UPCALL_MSG
> + * flag is set in @upflags. See also rpc_pipe_fs.h.
> + */
> +struct pipefs_hdr *pipefs_queue_upcall_waitreply(struct dentry *pipe,
> + struct pipefs_hdr *msg,
> + struct pipefs_list *uplist,
> + u8 upflags, u32 timeout)
> +{
> + int err = 0;
> + struct pipefs_upcall upcall;
> +
> + pipefs_init_upcall_waitreply(&upcall, msg, upflags);
> + err = __pipefs_queue_upcall_waitreply(pipe, &upcall, uplist, timeout);
> + if (err < 0) {
> + kfree(upcall.reply);
> + upcall.reply = ERR_PTR(err);
> + }
> +
> + return upcall.reply;
> +}
> +EXPORT_SYMBOL(pipefs_queue_upcall_waitreply);
> +
> +/*
> + * Queue a pipefs msg for an upcall to userspace and immediately return (i.e.,
> + * no reply is expected).
> + *
> + * Callers are responsible for freeing @msg, unless pipefs_generic_destroy_msg()
> + * is used as the ->destroy_msg() callback and the PIPEFS_AUTOFREE_UPCALL_MSG
> + * flag is set in @upflags. See also rpc_pipe_fs.h.
> + */
> +int pipefs_queue_upcall_noreply(struct dentry *pipe, struct pipefs_hdr *msg,
> + u8 upflags)
> +{
> + int err = 0;
> + struct rpc_pipe_msg *rpcmsg;
> +
> + upflags |= PIPEFS_AUTOFREE_RPCMSG;
> + rpcmsg = pipefs_alloc_init_rpcmsg(msg, upflags);
> + if (IS_ERR(rpcmsg)) {
> + err = PTR_ERR(rpcmsg);
> + goto out;
> + }
> + err = rpc_queue_upcall(pipe->d_inode, rpcmsg);
> +out:
> + return err;
> +}
> +EXPORT_SYMBOL(pipefs_queue_upcall_noreply);
> +
> +
> +static struct pipefs_upcall *pipefs_find_upcall_msgid(u32 msgid,
> + struct pipefs_list *uplist)
> +{
> + struct pipefs_upcall *upcall;
> +
> + spin_lock(&uplist->list_lock);
> + list_for_each_entry(upcall, &uplist->list, list)
> + if (upcall->msgid == msgid)
> + goto out;
> + upcall = NULL;
> +out:
> + spin_unlock(&uplist->list_lock);
> + return upcall;
> +}
> +
> +/*
> + * In your rpc_pipe_ops->downcall() handler, once you've read in a downcall
> + * message and have determined that it is a reply to a waiting upcall,
> + * you can use this function to find the appropriate upcall, assign the result,
> + * and wake the upcall thread.
> + *
> + * The reply message must have the same msgid as the original upcall message's.
> + *
> + * See also pipefs_queue_upcall_waitreply() and pipefs_readmsg().
> + */
> +int pipefs_assign_upcall_reply(struct pipefs_hdr *reply,
> + struct pipefs_list *uplist)
> +{
> + int err = 0;
> + struct pipefs_upcall *upcall;
> +
> + upcall = pipefs_find_upcall_msgid(reply->msgid, uplist);
> + if (!upcall) {
> + printk(KERN_ERR "%s: ERROR: have reply but no matching upcall "
> + "for msgid %d\n", __func__, reply->msgid);
> + err = -ENOENT;
> + goto out;
> + }
> + upcall->reply = reply;
> + wake_up(&upcall->waitq);
> +out:
> + return err;
> +}
> +EXPORT_SYMBOL(pipefs_assign_upcall_reply);
> +
> +/*
> + * Generic method to read-in and return a newly-allocated message which begins
> + * with a struct pipefs_hdr.
> + */
> +struct pipefs_hdr *pipefs_readmsg(struct file *filp, const char __user *src,
> + size_t len)
> +{
> + int err = 0, hdrsize;
> + struct pipefs_hdr *msg = NULL;
> +
> + hdrsize = sizeof(*msg);
> + if (len < hdrsize) {
> + printk(KERN_ERR "%s: ERROR: header is too short (%d vs %d)\n",
> + __func__, (int) len, hdrsize);
> + err = -EINVAL;
> + goto out;
> + }
> +
> + msg = kzalloc(len, GFP_KERNEL);
> + if (!msg) {
> + err = -ENOMEM;
> + goto out;
> + }
> + if (copy_from_user(msg, src, len))
> + err = -EFAULT;
> +out:
> + if (err) {
> + kfree(msg);
> + msg = ERR_PTR(err);
> + }
> + return msg;
> +}
> +EXPORT_SYMBOL(pipefs_readmsg);
> +
> +/*
> + * Generic rpc_pipe_ops->upcall() handler implementation.
> + *
> + * Don't call this directly: to make an upcall, use
> + * pipefs_queue_upcall_waitreply() or pipefs_queue_upcall_noreply().
> + */
> +ssize_t pipefs_generic_upcall(struct file *filp, struct rpc_pipe_msg *rpcmsg,
> + char __user *dst, size_t buflen)
> +{
> + char *data;
> + ssize_t len, left;
> +
> + data = (char *)rpcmsg->data + rpcmsg->copied;
> + len = rpcmsg->len - rpcmsg->copied;
> + if (len > buflen)
> + len = buflen;
> +
> + left = copy_to_user(dst, data, len);
> + if (left < 0) {
> + rpcmsg->errno = left;
> + return left;
> + }
> +
> + len -= left;
> + rpcmsg->copied += len;
> + rpcmsg->errno = 0;
> + return len;
> +}
> +EXPORT_SYMBOL(pipefs_generic_upcall);
> +
> +/*
> + * Generic rpc_pipe_ops->destroy_msg() handler implementation.
> + *
> + * Items are only freed if @rpcmsg->flags has been set appropriately.
> + * See pipefs_queue_upcall_noreply() and rpc_pipe_fs.h.
> + */
> +void pipefs_generic_destroy_msg(struct rpc_pipe_msg *rpcmsg)
> +{
> + if (rpcmsg->flags & PIPEFS_AUTOFREE_UPCALL_MSG)
> + kfree(rpcmsg->data);
> + if (rpcmsg->flags & PIPEFS_AUTOFREE_RPCMSG)
> + kfree(rpcmsg);
> +}
> +EXPORT_SYMBOL(pipefs_generic_destroy_msg);

2011-06-14 02:32:08

by Jim Rees

[permalink] [raw]
Subject: [PATCH 01/33] pnfs: GETDEVICELIST

From: Andy Adamson <[email protected]>

The block driver uses GETDEVICELIST

Signed-off-by: Andy Adamson <[email protected]>
[pass struct nfs_server * to getdevicelist]
[get machince creds for getdevicelist]
[fix getdevicelist decode sizing]
Signed-off-by: Benny Halevy <[email protected]>
Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/nfs4proc.c | 48 ++++++++++++++++++
fs/nfs/nfs4xdr.c | 128 +++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/pnfs.h | 12 ++++
include/linux/nfs4.h | 1 +
include/linux/nfs_xdr.h | 11 ++++
5 files changed, 200 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 3ca4497..ec580a8 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5759,6 +5759,54 @@ int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp)
return status;
}

+/*
+ * Retrieve the list of Data Server devices from the MDS.
+ */
+static int _nfs4_getdevicelist(struct nfs_server *server,
+ const struct nfs_fh *fh,
+ struct pnfs_devicelist *devlist)
+{
+ struct nfs4_getdevicelist_args args = {
+ .fh = fh,
+ .layoutclass = server->pnfs_curr_ld->id,
+ };
+ struct nfs4_getdevicelist_res res = {
+ .devlist = devlist,
+ };
+ struct rpc_message msg = {
+ .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_GETDEVICELIST],
+ .rpc_argp = &args,
+ .rpc_resp = &res,
+ };
+ int status;
+
+ dprintk("--> %s\n", __func__);
+ status = nfs4_call_sync(server->client, server, &msg, &args.seq_args,
+ &res.seq_res, 0);
+ dprintk("<-- %s status=%d\n", __func__, status);
+ return status;
+}
+
+int nfs4_proc_getdevicelist(struct nfs_server *server,
+ const struct nfs_fh *fh,
+ struct pnfs_devicelist *devlist)
+{
+ struct nfs4_exception exception = { };
+ int err;
+
+ do {
+ err = nfs4_handle_exception(server,
+ _nfs4_getdevicelist(server, fh, devlist),
+ &exception);
+ } while (exception.retry);
+
+ dprintk("%s: err=%d, num_devs=%u\n", __func__,
+ err, devlist->num_devs);
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(nfs4_proc_getdevicelist);
+
static int
_nfs4_proc_getdeviceinfo(struct nfs_server *server, struct pnfs_device *pdev)
{
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index c4b7d6c..60e9d44 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -314,6 +314,17 @@ static int nfs4_stat_to_errno(int);
XDR_QUADLEN(NFS4_MAX_SESSIONID_LEN) + 5)
#define encode_reclaim_complete_maxsz (op_encode_hdr_maxsz + 4)
#define decode_reclaim_complete_maxsz (op_decode_hdr_maxsz + 4)
+#define encode_getdevicelist_maxsz (op_encode_hdr_maxsz + 4 + \
+ encode_verifier_maxsz)
+#define decode_getdevicelist_maxsz (op_decode_hdr_maxsz + \
+ 2 /* nfs_cookie4 gdlr_cookie */ + \
+ decode_verifier_maxsz \
+ /* verifier4 gdlr_verifier */ + \
+ 1 /* gdlr_deviceid_list count */ + \
+ XDR_QUADLEN(NFS4_PNFS_GETDEVLIST_MAXNUM * \
+ NFS4_DEVICEID4_SIZE) \
+ /* gdlr_deviceid_list */ + \
+ 1 /* bool gdlr_eof */)
#define encode_getdeviceinfo_maxsz (op_encode_hdr_maxsz + 4 + \
XDR_QUADLEN(NFS4_DEVICEID4_SIZE))
#define decode_getdeviceinfo_maxsz (op_decode_hdr_maxsz + \
@@ -740,6 +751,14 @@ static int nfs4_stat_to_errno(int);
#define NFS4_dec_reclaim_complete_sz (compound_decode_hdr_maxsz + \
decode_sequence_maxsz + \
decode_reclaim_complete_maxsz)
+#define NFS4_enc_getdevicelist_sz (compound_encode_hdr_maxsz + \
+ encode_sequence_maxsz + \
+ encode_putfh_maxsz + \
+ encode_getdevicelist_maxsz)
+#define NFS4_dec_getdevicelist_sz (compound_decode_hdr_maxsz + \
+ decode_sequence_maxsz + \
+ decode_putfh_maxsz + \
+ decode_getdevicelist_maxsz)
#define NFS4_enc_getdeviceinfo_sz (compound_encode_hdr_maxsz + \
encode_sequence_maxsz +\
encode_getdeviceinfo_maxsz)
@@ -1827,6 +1846,26 @@ static void encode_sequence(struct xdr_stream *xdr,

#ifdef CONFIG_NFS_V4_1
static void
+encode_getdevicelist(struct xdr_stream *xdr,
+ const struct nfs4_getdevicelist_args *args,
+ struct compound_hdr *hdr)
+{
+ __be32 *p;
+ nfs4_verifier dummy = {
+ .data = "dummmmmy",
+ };
+
+ p = reserve_space(xdr, 20);
+ *p++ = cpu_to_be32(OP_GETDEVICELIST);
+ *p++ = cpu_to_be32(args->layoutclass);
+ *p++ = cpu_to_be32(NFS4_PNFS_GETDEVLIST_MAXNUM);
+ xdr_encode_hyper(p, 0ULL); /* cookie */
+ encode_nfs4_verifier(xdr, &dummy);
+ hdr->nops++;
+ hdr->replen += decode_getdevicelist_maxsz;
+}
+
+static void
encode_getdeviceinfo(struct xdr_stream *xdr,
const struct nfs4_getdeviceinfo_args *args,
struct compound_hdr *hdr)
@@ -2707,6 +2746,24 @@ static void nfs4_xdr_enc_reclaim_complete(struct rpc_rqst *req,
}

/*
+ * Encode GETDEVICELIST request
+ */
+static void nfs4_xdr_enc_getdevicelist(struct rpc_rqst *req,
+ struct xdr_stream *xdr,
+ struct nfs4_getdevicelist_args *args)
+{
+ struct compound_hdr hdr = {
+ .minorversion = nfs4_xdr_minorversion(&args->seq_args),
+ };
+
+ encode_compound_hdr(xdr, req, &hdr);
+ encode_sequence(xdr, &args->seq_args, &hdr);
+ encode_putfh(xdr, args->fh, &hdr);
+ encode_getdevicelist(xdr, args, &hdr);
+ encode_nops(&hdr);
+}
+
+/*
* Encode GETDEVICEINFO request
*/
static void nfs4_xdr_enc_getdeviceinfo(struct rpc_rqst *req,
@@ -5141,6 +5198,50 @@ out_overflow:
}

#if defined(CONFIG_NFS_V4_1)
+/*
+ * TODO: Need to handle case when EOF != true;
+ */
+static int decode_getdevicelist(struct xdr_stream *xdr,
+ struct pnfs_devicelist *res)
+{
+ __be32 *p;
+ int status, i;
+ struct nfs_writeverf verftemp;
+
+ status = decode_op_hdr(xdr, OP_GETDEVICELIST);
+ if (status)
+ return status;
+
+ p = xdr_inline_decode(xdr, 8 + 8 + 4);
+ if (unlikely(!p))
+ goto out_overflow;
+
+ /* TODO: Skip cookie for now */
+ p += 2;
+
+ /* Read verifier */
+ p = xdr_decode_opaque_fixed(p, verftemp.verifier, 8);
+
+ res->num_devs = be32_to_cpup(p);
+
+ dprintk("%s: num_dev %d\n", __func__, res->num_devs);
+
+ if (res->num_devs > NFS4_PNFS_GETDEVLIST_MAXNUM)
+ return -NFS4ERR_REP_TOO_BIG;
+
+ p = xdr_inline_decode(xdr,
+ res->num_devs * NFS4_DEVICEID4_SIZE + 4);
+ if (unlikely(!p))
+ goto out_overflow;
+ for (i = 0; i < res->num_devs; i++)
+ p = xdr_decode_opaque_fixed(p, res->dev_id[i].data,
+ NFS4_DEVICEID4_SIZE);
+ res->eof = be32_to_cpup(p);
+ return 0;
+out_overflow:
+ print_overflow_msg(__func__, xdr);
+ return -EIO;
+}

static int decode_getdeviceinfo(struct xdr_stream *xdr,
struct pnfs_device *pdev)
@@ -6366,6 +6467,32 @@ static int nfs4_xdr_dec_reclaim_complete(struct rpc_rqst *rqstp,
}

/*
+ * Decode GETDEVICELIST response
+ */
+static int nfs4_xdr_dec_getdevicelist(struct rpc_rqst *rqstp,
+ struct xdr_stream *xdr,
+ struct nfs4_getdevicelist_res *res)
+{
+ struct compound_hdr hdr;
+ int status;
+
+ dprintk("encoding getdevicelist!\n");
+
+ status = decode_compound_hdr(xdr, &hdr);
+ if (status != 0)
+ goto out;
+ status = decode_sequence(xdr, &res->seq_res, rqstp);
+ if (status != 0)
+ goto out;
+ status = decode_putfh(xdr);
+ if (status != 0)
+ goto out;
+ status = decode_getdevicelist(xdr, res->devlist);
+out:
+ return status;
+}
+
+/*
* Decode GETDEVINFO response
*/
static int nfs4_xdr_dec_getdeviceinfo(struct rpc_rqst *rqstp,
@@ -6663,6 +6790,7 @@ struct rpc_procinfo nfs4_procedures[] = {
PROC(LAYOUTGET, enc_layoutget, dec_layoutget),
PROC(LAYOUTCOMMIT, enc_layoutcommit, dec_layoutcommit),
PROC(LAYOUTRETURN, enc_layoutreturn, dec_layoutreturn),
+ PROC(GETDEVICELIST, enc_getdevicelist, dec_getdevicelist),
#endif /* CONFIG_NFS_V4_1 */
};

diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index f2b183b..9dc950c 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -133,14 +133,26 @@ struct pnfs_device {
unsigned int layout_type;
unsigned int mincount;
struct page **pages;
+ void *area;
unsigned int pgbase;
unsigned int pglen;
};

+#define NFS4_PNFS_GETDEVLIST_MAXNUM 16
+
+struct pnfs_devicelist {
+ unsigned int eof;
+ unsigned int num_devs;
+ struct nfs4_deviceid dev_id[NFS4_PNFS_GETDEVLIST_MAXNUM];
+};
+
extern int pnfs_register_layoutdriver(struct pnfs_layoutdriver_type *);
extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);

/* nfs4proc.c */
+extern int nfs4_proc_getdevicelist(struct nfs_server *server,
+ const struct nfs_fh *fh,
+ struct pnfs_devicelist *devlist);
extern int nfs4_proc_getdeviceinfo(struct nfs_server *server,
struct pnfs_device *dev);
extern int nfs4_proc_layoutget(struct nfs4_layoutget *lgp);
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 504b289..7915d41 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -560,6 +560,7 @@ enum {
NFSPROC4_CLNT_GET_LEASE_TIME,
NFSPROC4_CLNT_RECLAIM_COMPLETE,
NFSPROC4_CLNT_LAYOUTGET,
+ NFSPROC4_CLNT_GETDEVICELIST,
NFSPROC4_CLNT_GETDEVICEINFO,
NFSPROC4_CLNT_LAYOUTCOMMIT,
NFSPROC4_CLNT_LAYOUTRETURN,
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 00848d8..0863228 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -235,6 +235,17 @@ struct nfs4_layoutget {
gfp_t gfp_flags;
};

+struct nfs4_getdevicelist_args {
+ const struct nfs_fh *fh;
+ u32 layoutclass;
+ struct nfs4_sequence_args seq_args;
+};
+
+struct nfs4_getdevicelist_res {
+ struct pnfs_devicelist *devlist;
+ struct nfs4_sequence_res seq_res;
+};
+
struct nfs4_getdeviceinfo_args {
struct pnfs_device *pdev;
struct nfs4_sequence_args seq_args;
--
1.7.4.1


2011-06-14 02:33:15

by Jim Rees

[permalink] [raw]
Subject: [PATCH 29/33] pnfsblock: bl_write_pagelist support functions

From: Fred Isaman <[email protected]>

[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: fixup blksize alignment in bl_setup_layoutcommit]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 13 +++++++++++++
1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index cbf74d8..01fe089 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -70,6 +70,19 @@ static int is_hole(struct pnfs_block_extent *be, sector_t isect)
return !is_sector_initialized(be->be_inval, isect);
}

+/* Given the be associated with isect, determine if page data can be
+ * written to disk.
+ */
+static int is_writable(struct pnfs_block_extent *be, sector_t isect)
+{
+ if (be->be_state == PNFS_BLOCK_READWRITE_DATA)
+ return 1;
+ else if (be->be_state != PNFS_BLOCK_INVALID_DATA)
+ return 0;
+ else
+ return is_sector_initialized(be->be_inval, isect);
+}
+
static int
dont_like_caller(struct nfs_page *req)
{
--
1.7.4.1


2011-06-14 02:32:56

by Jim Rees

[permalink] [raw]
Subject: [PATCH 21/33] pnfsblock: SPLITME: add extent manipulation functions

From: Fred Isaman <[email protected]>

Adds working implementations of various support functions
to handle INVAL extents, needed by writes, such as
mark_initialized_sectors and is_sector_initialized.

SPLIT: this needs to be split into the exported functions, and the
range support functions (which will be replaced eventually.)

[pnfsblock: fix 64-bit compiler warnings for extent manipulation]
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 30 ++++-
fs/nfs/blocklayout/extents.c | 253 ++++++++++++++++++++++++++++++++++++++
2 files changed, 281 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 04eb6de..364540a 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -35,6 +35,8 @@
#include <linux/nfs_fs.h>
#include "../pnfs.h"

+#define PAGE_CACHE_SECTORS (PAGE_CACHE_SIZE >> 9)
+
#define PG_pnfserr PG_owner_priv_1
#define PagePnfsErr(page) test_bit(PG_pnfserr, &(page)->flags)
#define SetPagePnfsErr(page) set_bit(PG_pnfserr, &(page)->flags)
@@ -101,8 +103,23 @@ enum exstate4 {
PNFS_BLOCK_NONE_DATA = 3 /* unmapped, it's a hole */
};

+#define MY_MAX_TAGS (15) /* tag bitnums used must be less than this */
+
+struct my_tree_t {
+ sector_t mtt_step_size; /* Internal sector alignment */
+ struct list_head mtt_stub; /* Should be a radix tree */
+};
+
struct pnfs_inval_markings {
- /* STUB */
+ spinlock_t im_lock;
+ struct my_tree_t im_tree; /* Sectors that need LAYOUTCOMMIT */
+ sector_t im_block_size; /* Server blocksize in sectors */
+};
+
+struct pnfs_inval_tracking {
+ struct list_head it_link;
+ int it_sector;
+ int it_tags;
};

/* sector_t fields are all in 512-byte sectors */
@@ -121,7 +138,11 @@ struct pnfs_block_extent {
static inline void
INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
{
- /* STUB */
+ spin_lock_init(&marks->im_lock);
+ INIT_LIST_HEAD(&marks->im_tree.mtt_stub);
+ marks->im_block_size = blocksize;
+ marks->im_tree.mtt_step_size = min((sector_t)PAGE_CACHE_SECTORS,
+ blocksize);
}

enum extentclass4 {
@@ -222,8 +243,13 @@ void free_block_dev(struct pnfs_block_dev *bdev);
struct pnfs_block_extent *
find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
struct pnfs_block_extent **cow_read);
+int mark_initialized_sectors(struct pnfs_inval_markings *marks,
+ sector_t offset, sector_t length,
+ sector_t **pages);
void put_extent(struct pnfs_block_extent *be);
struct pnfs_block_extent *alloc_extent(void);
+struct pnfs_block_extent *get_extent(struct pnfs_block_extent *be);
+int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect);
int add_and_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index f0b3f13..3d36f66 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -33,6 +33,259 @@
#include "blocklayout.h"
#define NFSDBG_FACILITY NFSDBG_PNFS_LD

+/* Bit numbers */
+#define EXTENT_INITIALIZED 0
+#define EXTENT_WRITTEN 1
+#define EXTENT_IN_COMMIT 2
+#define INTERNAL_EXISTS MY_MAX_TAGS
+#define INTERNAL_MASK ((1 << INTERNAL_EXISTS) - 1)
+
+/* Returns largest t<=s s.t. t%base==0 */
+static inline sector_t normalize(sector_t s, int base)
+{
+ sector_t tmp = s; /* Since do_div modifies its argument */
+ return s - do_div(tmp, base);
+}
+
+static inline sector_t normalize_up(sector_t s, int base)
+{
+ return normalize(s + base - 1, base);
+}
+
+/* Complete stub using list while determine API wanted */
+
+/* Returns tags, or negative */
+static int32_t _find_entry(struct my_tree_t *tree, u64 s)
+{
+ struct pnfs_inval_tracking *pos;
+
+ dprintk("%s(%llu) enter\n", __func__, s);
+ list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
+ if (pos->it_sector > s)
+ continue;
+ else if (pos->it_sector == s)
+ return pos->it_tags & INTERNAL_MASK;
+ else
+ break;
+ }
+ return -ENOENT;
+}
+
+static inline
+int _has_tag(struct my_tree_t *tree, u64 s, int32_t tag)
+{
+ int32_t tags;
+
+ dprintk("%s(%llu, %i) enter\n", __func__, s, tag);
+ s = normalize(s, tree->mtt_step_size);
+ tags = _find_entry(tree, s);
+ if ((tags < 0) || !(tags & (1 << tag)))
+ return 0;
+ else
+ return 1;
+}
+
+/* Creates entry with tag, or if entry already exists, unions tag to it.
+ * If storage is not NULL, newly created entry will use it.
+ * Returns number of entries added, or negative on error.
+ */
+static int _add_entry(struct my_tree_t *tree, u64 s, int32_t tag,
+ struct pnfs_inval_tracking *storage)
+{
+ int found = 0;
+ struct pnfs_inval_tracking *pos;
+
+ dprintk("%s(%llu, %i, %p) enter\n", __func__, s, tag, storage);
+ list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
+ if (pos->it_sector > s)
+ continue;
+ else if (pos->it_sector == s) {
+ found = 1;
+ break;
+ } else
+ break;
+ }
+ if (found) {
+ pos->it_tags |= (1 << tag);
+ return 0;
+ } else {
+ struct pnfs_inval_tracking *new;
+ if (storage)
+ new = storage;
+ else {
+ new = kmalloc(sizeof(*new), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+ }
+ new->it_sector = s;
+ new->it_tags = (1 << tag);
+ list_add(&new->it_link, &pos->it_link);
+ return 1;
+ }
+}
+
+/* XXXX Really want option to not create */
+/* Over range, unions tag with existing entries, else creates entry with tag */
+static int _set_range(struct my_tree_t *tree, int32_t tag, u64 s, u64 length)
+{
+ u64 i;
+
+ dprintk("%s(%i, %llu, %llu) enter\n", __func__, tag, s, length);
+ for (i = normalize(s, tree->mtt_step_size); i < s + length;
+ i += tree->mtt_step_size)
+ if (_add_entry(tree, i, tag, NULL))
+ return -ENOMEM;
+ return 0;
+}
+
+/* Ensure that future operations on given range of tree will not malloc */
+static int _preload_range(struct my_tree_t *tree, u64 offset, u64 length)
+{
+ u64 start, end, s;
+ int count, i, used = 0, status = -ENOMEM;
+ struct pnfs_inval_tracking **storage;
+
+ dprintk("%s(%llu, %llu) enter\n", __func__, offset, length);
+ start = normalize(offset, tree->mtt_step_size);
+ end = normalize_up(offset + length, tree->mtt_step_size);
+ count = (int)(end - start) / (int)tree->mtt_step_size;
+
+ /* Pre-malloc what memory we might need */
+ storage = kmalloc(sizeof(*storage) * count, GFP_KERNEL);
+ if (!storage)
+ return -ENOMEM;
+ for (i = 0; i < count; i++) {
+ storage[i] = kmalloc(sizeof(struct pnfs_inval_tracking),
+ GFP_KERNEL);
+ if (!storage[i])
+ goto out_cleanup;
+ }
+
+ /* Now need lock - HOW??? */
+
+ for (s = start; s < end; s += tree->mtt_step_size)
+ used += _add_entry(tree, s, INTERNAL_EXISTS, storage[used]);
+
+ /* Unlock - HOW??? */
+ status = 0;
+
+ out_cleanup:
+ for (i = used; i < count; i++) {
+ if (!storage[i])
+ break;
+ kfree(storage[i]);
+ }
+ kfree(storage);
+ return status;
+}
+
+static void set_needs_init(sector_t *array, sector_t offset)
+{
+ sector_t *p = array;
+
+ dprintk("%s enter\n", __func__);
+ if (!p)
+ return;
+ while (*p < offset)
+ p++;
+ if (*p == offset)
+ return;
+ else if (*p == ~0) {
+ *p++ = offset;
+ *p = ~0;
+ return;
+ } else {
+ sector_t *save = p;
+ dprintk("%s Adding %llu\n", __func__, (u64)offset);
+ while (*p != ~0)
+ p++;
+ p++;
+ memmove(save + 1, save, (char *)p - (char *)save);
+ *save = offset;
+ return;
+ }
+}
+
+/* We are relying on page lock to serialize this */
+int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect)
+{
+ int rv;
+
+ spin_lock(&marks->im_lock);
+ rv = _has_tag(&marks->im_tree, isect, EXTENT_INITIALIZED);
+ spin_unlock(&marks->im_lock);
+ return rv;
+}
+
+/* Marks sectors in [offest, offset_length) as having been initialized.
+ * All lengths are step-aligned, where step is min(pagesize, blocksize).
+ * Notes where partial block is initialized, and helps prepare it for
+ * complete initialization later.
+ */
+/* Currently assumes offset is page-aligned */
+int mark_initialized_sectors(struct pnfs_inval_markings *marks,
+ sector_t offset, sector_t length,
+ sector_t **pages)
+{
+ sector_t s, start, end;
+ sector_t *array = NULL; /* Pages to mark */
+
+ dprintk("%s(offset=%llu,len=%llu) enter\n",
+ __func__, (u64)offset, (u64)length);
+ s = max((sector_t) 3,
+ 2 * (marks->im_block_size / (PAGE_CACHE_SECTORS)));
+ dprintk("%s set max=%llu\n", __func__, (u64)s);
+ if (pages) {
+ array = kmalloc(s * sizeof(sector_t), GFP_KERNEL);
+ if (!array)
+ goto outerr;
+ array[0] = ~0;
+ }
+
+ start = normalize(offset, marks->im_block_size);
+ end = normalize_up(offset + length, marks->im_block_size);
+ if (_preload_range(&marks->im_tree, start, end - start))
+ goto outerr;
+
+ spin_lock(&marks->im_lock);
+
+ for (s = normalize_up(start, PAGE_CACHE_SECTORS);
+ s < offset; s += PAGE_CACHE_SECTORS) {
+ dprintk("%s pre-area pages\n", __func__);
+ /* Portion of used block is not initialized */
+ if (!_has_tag(&marks->im_tree, s, EXTENT_INITIALIZED))
+ set_needs_init(array, s);
+ }
+ if (_set_range(&marks->im_tree, EXTENT_INITIALIZED, offset, length))
+ goto out_unlock;
+ for (s = normalize_up(offset + length, PAGE_CACHE_SECTORS);
+ s < end; s += PAGE_CACHE_SECTORS) {
+ dprintk("%s post-area pages\n", __func__);
+ if (!_has_tag(&marks->im_tree, s, EXTENT_INITIALIZED))
+ set_needs_init(array, s);
+ }
+
+ spin_unlock(&marks->im_lock);
+
+ if (pages) {
+ if (array[0] == ~0) {
+ kfree(array);
+ *pages = NULL;
+ } else
+ *pages = array;
+ }
+ return 0;
+
+ out_unlock:
+ spin_unlock(&marks->im_lock);
+ outerr:
+ if (pages) {
+ kfree(array);
+ *pages = NULL;
+ }
+ return -ENOMEM;
+}
+
static void print_bl_extent(struct pnfs_block_extent *be)
{
dprintk("PRINT EXTENT extent %p\n", be);
--
1.7.4.1


2011-06-14 02:33:09

by Jim Rees

[permalink] [raw]
Subject: [PATCH 26/33] pnfsblock: write_begin

From: Fred Isaman <[email protected]>

Implements bl_write_begin and bl_do_flush, allowing block driver to read
in page "around" the data that is about to be copied to the page.

[pnfsblock: fix 64-bit compiler warnings for write_begin]
[pnfsblock: write_begin adjust for removed fields]
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 178 +++++++++++++++++++++++++++++++++++++-
1 files changed, 177 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index d9bcb13..b9b961f 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -31,6 +31,8 @@
*/
#include <linux/module.h>
#include <linux/init.h>
+
+#include <linux/buffer_head.h> /* various write calls */
#include <linux/bio.h> /* struct bio */
#include <linux/vmalloc.h>
#include "blocklayout.h"
@@ -589,11 +591,185 @@ bl_clear_layoutdriver(struct nfs_server *server)
return 0;
}

+/* STUB - mark intersection of layout and page as bad, so is not
+ * used again.
+ */
+static void mark_bad_read(void)
+{
+ return;
+}
+
+/* Copied from buffer.c */
+static void __end_buffer_read_notouch(struct buffer_head *bh, int uptodate)
+{
+ if (uptodate) {
+ set_buffer_uptodate(bh);
+ } else {
+ /* This happens, due to failed READA attempts. */
+ clear_buffer_uptodate(bh);
+ }
+ unlock_buffer(bh);
+}
+
+/* Copied from buffer.c */
+static void end_buffer_read_nobh(struct buffer_head *bh, int uptodate)
+{
+ __end_buffer_read_notouch(bh, uptodate);
+}
+
+/*
+ * map_block: map a requested I/0 block (isect) into an offset in the LVM
+ * meta block_device
+ */
+static void
+map_block(sector_t isect, struct pnfs_block_extent *be, struct buffer_head *bh)
+{
+ dprintk("%s enter be=%p\n", __func__, be);
+
+ set_buffer_mapped(bh);
+ bh->b_bdev = be->be_mdev;
+ bh->b_blocknr = (isect - be->be_f_offset + be->be_v_offset) >>
+ (be->be_mdev->bd_inode->i_blkbits - 9);
+
+ dprintk("%s isect %ld, bh->b_blocknr %ld, using bsize %Zd\n",
+ __func__, (long)isect,
+ (long)bh->b_blocknr,
+ bh->b_size);
+ return;
+}
+
+/* Given an unmapped page, zero it (or read in page for COW),
+ * and set appropriate flags/markings, but it is safe to not initialize
+ * the range given in [from, to).
+ */
+/* This is loosely based on nobh_write_begin */
+static int
+init_page_for_write(struct pnfs_block_layout *bl, struct page *page,
+ unsigned from, unsigned to, sector_t **pages_to_mark)
+{
+ struct buffer_head *bh;
+ int inval, ret = -EIO;
+ struct pnfs_block_extent *be = NULL, *cow_read = NULL;
+ sector_t isect;
+
+ dprintk("%s enter, %p\n", __func__, page);
+ bh = alloc_page_buffers(page, PAGE_CACHE_SIZE, 0);
+ if (!bh) {
+ ret = -ENOMEM;
+ goto cleanup;
+ }
+
+ isect = (sector_t)page->index << (PAGE_CACHE_SHIFT - 9);
+ be = find_get_extent(bl, isect, &cow_read);
+ if (!be)
+ goto cleanup;
+ inval = is_hole(be, isect);
+ dprintk("%s inval=%i, from=%u, to=%u\n", __func__, inval, from, to);
+ if (inval) {
+ if (be->be_state == PNFS_BLOCK_NONE_DATA) {
+ dprintk("%s PANIC - got NONE_DATA extent %p\n",
+ __func__, be);
+ goto cleanup;
+ }
+ map_block(isect, be, bh);
+ unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
+ }
+ if (PageUptodate(page)) {
+ /* Do nothing */
+ } else if (inval & !cow_read) {
+ zero_user_segments(page, 0, from, to, PAGE_CACHE_SIZE);
+ } else if (0 < from || PAGE_CACHE_SIZE > to) {
+ struct pnfs_block_extent *read_extent;
+
+ read_extent = (inval && cow_read) ? cow_read : be;
+ map_block(isect, read_extent, bh);
+ lock_buffer(bh);
+ bh->b_end_io = end_buffer_read_nobh;
+ submit_bh(READ, bh);
+ dprintk("%s: Waiting for buffer read\n", __func__);
+ /* XXX Don't really want to hold layout lock here */
+ wait_on_buffer(bh);
+ if (!buffer_uptodate(bh))
+ goto cleanup;
+ }
+ if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
+ /* There is a BUG here if is a short copy after write_begin,
+ * but I think this is a generic fs bug. The problem is that
+ * we have marked the page as initialized, but it is possible
+ * that the section not copied may never get copied.
+ */
+ ret = mark_initialized_sectors(be->be_inval, isect,
+ PAGE_CACHE_SECTORS,
+ pages_to_mark);
+ /* Want to preallocate mem so above can't fail */
+ if (ret)
+ goto cleanup;
+ }
+ SetPageMappedToDisk(page);
+ ret = 0;
+
+cleanup:
+ free_buffer_head(bh);
+ put_extent(be);
+ put_extent(cow_read);
+ if (ret) {
+ /* Need to mark layout with bad read...should now
+ * just use nfs4 for reads and writes.
+ */
+ mark_bad_read();
+ }
+ return ret;
+}
+
static int
bl_write_begin(struct pnfs_layout_segment *lseg, struct page *page, loff_t pos,
unsigned count, struct pnfs_fsdata *fsdata)
{
- return 0;
+ unsigned from, to;
+ int ret;
+ sector_t *pages_to_mark = NULL;
+ struct pnfs_block_layout *bl = BLK_LSEG2EXT(lseg);
+
+ dprintk("%s enter, %u@%lld\n", __func__, count, pos);
+ print_page(page);
+ /* The following code assumes blocksize >= PAGE_CACHE_SIZE */
+ if (bl->bl_blocksize < (PAGE_CACHE_SIZE >> 9)) {
+ dprintk("%s Can't handle blocksize %llu\n", __func__,
+ (u64)bl->bl_blocksize);
+ put_lseg(fsdata->lseg);
+ fsdata->lseg = NULL;
+ return 0;
+ }
+ if (PageMappedToDisk(page)) {
+ /* Basically, this is a flag that says we have
+ * successfully called write_begin already on this page.
+ */
+ /* NOTE - there are cache consistency issues here.
+ * For example, what if the layout is recalled, then regained?
+ * If the file is closed and reopened, will the page flags
+ * be reset? If not, we'll have to use layout info instead of
+ * the page flag.
+ */
+ return 0;
+ }
+ from = pos & (PAGE_CACHE_SIZE - 1);
+ to = from + count;
+ ret = init_page_for_write(bl, page, from, to, &pages_to_mark);
+ if (ret) {
+ dprintk("%s init page failed with %i", __func__, ret);
+ /* Revert back to plain NFS and just continue on with
+ * write. This assumes there is no request attached, which
+ * should be true if we get here.
+ */
+ BUG_ON(PagePrivate(page));
+ put_lseg(fsdata->lseg);
+ fsdata->lseg = NULL;
+ kfree(pages_to_mark);
+ ret = 0;
+ } else {
+ fsdata->private = pages_to_mark;
+ }
+ return ret;
}

static int
--
1.7.4.1


2011-06-14 02:33:21

by Jim Rees

[permalink] [raw]
Subject: [PATCH 31/33] pnfsblock: note written INVAL areas for layoutcommit

From: Fred Isaman <[email protected]>

[SQUASHME: pnfs: blocklayout: port block layout code]
Signed-off-by: Peng Tao <[email protected]>
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 32 +++++++++++++
fs/nfs/blocklayout/blocklayout.h | 2 +
fs/nfs/blocklayout/extents.c | 95 ++++++++++++++++++++++++++++++++++++++
3 files changed, 129 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 6fe039c8..6de800f 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -320,6 +320,30 @@ bl_read_pagelist(struct nfs_read_data *rdata)
return PNFS_NOT_ATTEMPTED;
}

+static void mark_extents_written(struct pnfs_block_layout *bl,
+ __u64 offset, __u32 count)
+{
+ sector_t isect, end;
+ struct pnfs_block_extent *be;
+
+ dprintk("%s(%llu, %u)\n", __func__, offset, count);
+ if (count == 0)
+ return;
+ isect = (offset & (long)(PAGE_CACHE_MASK)) >> 9;
+ end = (offset + count + PAGE_CACHE_SIZE - 1) & (long)(PAGE_CACHE_MASK);
+ end >>= 9;
+ while (isect < end) {
+ sector_t len;
+ be = find_get_extent(bl, isect, NULL);
+ BUG_ON(!be); /* FIXME */
+ len = min(end, be->be_f_offset + be->be_length) - isect;
+ if (be->be_state == PNFS_BLOCK_INVALID_DATA)
+ mark_for_commit(be, isect, len); /* What if fails? */
+ isect += len;
+ put_extent(be);
+ }
+}
+
/* STUB - this needs thought */
static inline void
bl_done_with_wpage(struct page *page, const int ok)
@@ -367,6 +391,14 @@ static void bl_write_cleanup(struct work_struct *work)
dprintk("%s enter\n", __func__);
task = container_of(work, struct rpc_task, u.tk_work);
wdata = container_of(task, struct nfs_write_data, task);
+ if (!wdata->task.tk_status) {
+ /* Marks for LAYOUTCOMMIT */
+ /* BUG - this should be called after each bio, not after
+ * all finish, unless have some way of storing success/failure
+ */
+ mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
+ wdata->args.offset, wdata->args.count);
+ }
pnfs_ld_write_done(wdata);
}

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 5a7d0be..6b7718b 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -267,6 +267,8 @@ void clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
int status);
int add_and_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);
+int mark_for_commit(struct pnfs_block_extent *be,
+ sector_t offset, sector_t length);

#include <linux/sunrpc/simple_rpc_pipefs.h>

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 1447bfc..a62d29f 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -217,6 +217,48 @@ int is_sector_initialized(struct pnfs_inval_markings *marks, sector_t isect)
return rv;
}

+/* Assume start, end already sector aligned */
+static int
+_range_has_tag(struct my_tree_t *tree, u64 start, u64 end, int32_t tag)
+{
+ struct pnfs_inval_tracking *pos;
+ u64 expect = 0;
+
+ dprintk("%s(%llu, %llu, %i) enter\n", __func__, start, end, tag);
+ list_for_each_entry_reverse(pos, &tree->mtt_stub, it_link) {
+ if (pos->it_sector >= end)
+ continue;
+ if (!expect) {
+ if ((pos->it_sector == end - tree->mtt_step_size) &&
+ (pos->it_tags & (1 << tag))) {
+ expect = pos->it_sector - tree->mtt_step_size;
+ if (pos->it_sector < tree->mtt_step_size || expect < start)
+ return 1;
+ continue;
+ } else {
+ return 0;
+ }
+ }
+ if (pos->it_sector != expect || !(pos->it_tags & (1 << tag)))
+ return 0;
+ expect -= tree->mtt_step_size;
+ if (expect < start)
+ return 1;
+ }
+ return 0;
+}
+
+static int is_range_written(struct pnfs_inval_markings *marks,
+ sector_t start, sector_t end)
+{
+ int rv;
+
+ spin_lock(&marks->im_lock);
+ rv = _range_has_tag(&marks->im_tree, start, end, EXTENT_WRITTEN);
+ spin_unlock(&marks->im_lock);
+ return rv;
+}
+
/* Marks sectors in [offest, offset_length) as having been initialized.
* All lengths are step-aligned, where step is min(pagesize, blocksize).
* Notes where partial block is initialized, and helps prepare it for
@@ -394,6 +436,59 @@ static void add_to_commitlist(struct pnfs_block_layout *bl,
print_clist(clist, bl->bl_count);
}

+/* Note the range described by offset, length is guaranteed to be contained
+ * within be.
+ */
+int mark_for_commit(struct pnfs_block_extent *be,
+ sector_t offset, sector_t length)
+{
+ sector_t new_end, end = offset + length;
+ struct pnfs_block_short_extent *new;
+ struct pnfs_block_layout *bl = container_of(be->be_inval,
+ struct pnfs_block_layout,
+ bl_inval);
+
+ new = kmalloc(sizeof(*new), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ mark_written_sectors(be->be_inval, offset, length);
+ /* We want to add the range to commit list, but it must be
+ * block-normalized, and verified that the normalized range has
+ * been entirely written to disk.
+ */
+ new->bse_f_offset = offset;
+ offset = normalize(offset, bl->bl_blocksize);
+ if (offset < new->bse_f_offset) {
+ if (is_range_written(be->be_inval, offset, new->bse_f_offset))
+ new->bse_f_offset = offset;
+ else
+ new->bse_f_offset = offset + bl->bl_blocksize;
+ }
+ new_end = normalize_up(end, bl->bl_blocksize);
+ if (end < new_end) {
+ if (is_range_written(be->be_inval, end, new_end))
+ end = new_end;
+ else
+ end = new_end - bl->bl_blocksize;
+ }
+ if (end <= new->bse_f_offset) {
+ kfree(new);
+ return 0;
+ }
+ new->bse_length = end - new->bse_f_offset;
+ new->bse_devid = be->be_devid;
+ new->bse_mdev = be->be_mdev;
+
+ spin_lock(&bl->bl_ext_lock);
+ /* new will be freed, either by add_to_commitlist if it decides not
+ * to use it, or after LAYOUTCOMMIT uses it in the commitlist.
+ */
+ add_to_commitlist(bl, new);
+ spin_unlock(&bl->bl_ext_lock);
+ return 0;
+}
+
static void print_bl_extent(struct pnfs_block_extent *be)
{
dprintk("PRINT EXTENT extent %p\n", be);
--
1.7.4.1


2011-06-14 02:32:17

by Jim Rees

[permalink] [raw]
Subject: [PATCH 06/33] pnfs: cleanup_layoutcommit

From: Peng Tao <[email protected]>

This gives layout driver a chance to cleanup structures they put in.
Also ensure layoutcommit does not commit more than isize, as block layout
driver may dirty pages beyond EOF.

Signed-off-by: Andy Adamson <[email protected]>
[fixup layout header pointer for layoutcommit]
Signed-off-by: Benny Halevy <[email protected]>
Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/nfs4proc.c | 1 +
fs/nfs/nfs4xdr.c | 3 ++-
fs/nfs/pnfs.c | 15 +++++++++++++++
fs/nfs/pnfs.h | 5 +++++
include/linux/nfs_xdr.h | 1 +
5 files changed, 24 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 04450bf..75c07f5 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5887,6 +5887,7 @@ static void nfs4_layoutcommit_release(void *calldata)
{
struct nfs4_layoutcommit_data *data = calldata;

+ pnfs_cleanup_layoutcommit(data->args.inode, data);
/* Matched by references in pnfs_set_layoutcommit */
put_lseg(data->lseg);
put_rpccred(data->cred);
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index b8f375e..f884a23 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -1963,7 +1963,7 @@ encode_layoutcommit(struct xdr_stream *xdr,
*p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
/* Only whole file layouts */
p = xdr_encode_hyper(p, 0); /* offset */
- p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
+ p = xdr_encode_hyper(p, args->lastbytewritten+1); /* length */
*p++ = cpu_to_be32(0); /* reclaim */
p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
*p++ = cpu_to_be32(1); /* newoffset = TRUE */
@@ -5469,6 +5469,7 @@ static int decode_layoutcommit(struct xdr_stream *xdr,
int status;

status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
+ res->status = status;
if (status)
return status;

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 5373960..7912635 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1299,6 +1299,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
{
struct nfs_inode *nfsi = NFS_I(wdata->inode);
loff_t end_pos = wdata->mds_offset + wdata->res.count;
+ loff_t isize = i_size_read(wdata->inode);
bool mark_as_dirty = false;

spin_lock(&nfsi->vfs_inode.i_lock);
@@ -1312,9 +1313,13 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
dprintk("%s: Set layoutcommit for inode %lu ",
__func__, wdata->inode->i_ino);
}
+ if (end_pos > isize)
+ end_pos = isize;
if (end_pos > wdata->lseg->pls_end_pos)
wdata->lseg->pls_end_pos = end_pos;
spin_unlock(&nfsi->vfs_inode.i_lock);
+ dprintk("%s: lseg %p end_pos %llu\n",
+ __func__, wdata->lseg, wdata->lseg->pls_end_pos);

/* if pnfs_layoutcommit_inode() runs between inode locks, the next one
* will be a noop because NFS_INO_LAYOUTCOMMIT will not be set */
@@ -1323,6 +1328,16 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
}
EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);

+void pnfs_cleanup_layoutcommit(struct inode *inode,
+ struct nfs4_layoutcommit_data *data)
+{
+ struct nfs_server *nfss = NFS_SERVER(inode);
+
+ if (nfss->pnfs_curr_ld->cleanup_layoutcommit)
+ nfss->pnfs_curr_ld->cleanup_layoutcommit(NFS_I(inode)->layout,
+ data);
+}
+
void pnfs_free_fsdata(struct pnfs_fsdata *fsdata)
{
/* lseg refcounting handled directly in nfs_write_end */
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 57aefb6..d7f203b 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -128,6 +128,9 @@ struct pnfs_layoutdriver_type {
struct xdr_stream *xdr,
const struct nfs4_layoutreturn_args *args);

+ void (*cleanup_layoutcommit) (struct pnfs_layout_hdr *layoutid,
+ struct nfs4_layoutcommit_data *data);
+
void (*encode_layoutcommit) (struct pnfs_layout_hdr *layoutid,
struct xdr_stream *xdr,
const struct nfs4_layoutcommit_args *args);
@@ -216,6 +219,8 @@ void pnfs_roc_release(struct inode *ino);
void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
bool pnfs_roc_drain(struct inode *ino, u32 *barrier);
void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
+void pnfs_cleanup_layoutcommit(struct inode *inode,
+ struct nfs4_layoutcommit_data *data);
int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
int _pnfs_return_layout(struct inode *);
int pnfs_ld_write_done(struct nfs_write_data *);
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 1fa6f7a..253b6d4 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -269,6 +269,7 @@ struct nfs4_layoutcommit_res {
struct nfs_fattr *fattr;
const struct nfs_server *server;
struct nfs4_sequence_res seq_res;
+ int status;
};

struct nfs4_layoutcommit_data {
--
1.7.4.1


2011-06-14 02:32:22

by Jim Rees

[permalink] [raw]
Subject: [PATCH 08/33] pnfsblock: blocklayout stub

From: Fred Isaman <[email protected]>

Adds the minimal structure for a pnfs block layout driver,
with all function pointers aimed at stubs.

[pnfsblock: SQUASHME: port block layout code]
Signed-off-by: Peng Tao <[email protected]>
[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <[email protected]>
[pnfs: move pnfs_layout_type inline in nfs_inode]
Signed-off-by: Benny Halevy <[email protected]>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/Makefile | 4 +-
fs/nfs/blocklayout/blocklayout.c | 166 ++++++++++++++++++++++++++++++++++++++
2 files changed, 168 insertions(+), 2 deletions(-)
create mode 100644 fs/nfs/blocklayout/blocklayout.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index f214c1c..6bf49cd 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -1,5 +1,5 @@
#
# Makefile for the pNFS block layout driver kernel module
#
-obj-$(CONFIG_PNFS_BLOCK) +=
-blocklayoutdriver-objs :=
+obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
+blocklayoutdriver-objs := blocklayout.o
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
new file mode 100644
index 0000000..2e0d41a
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -0,0 +1,166 @@
+/*
+ * linux/fs/nfs/blocklayout/blocklayout.c
+ *
+ * Module for the NFSv4.1 pNFS block layout driver.
+ *
+ * Copyright (c) 2006 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * Andy Adamson <[email protected]>
+ * Fred Isaman <[email protected]>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization. if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose. the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+
+#include "../pnfs.h"
+
+#define NFSDBG_FACILITY NFSDBG_PNFS_LD
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Andy Adamson <[email protected]>");
+MODULE_DESCRIPTION("The NFSv4.1 pNFS Block layout driver");
+
+static enum pnfs_try_status
+bl_read_pagelist(struct nfs_read_data *rdata)
+{
+ return PNFS_NOT_ATTEMPTED;
+}
+
+static enum pnfs_try_status
+bl_write_pagelist(struct nfs_write_data *wdata,
+ int sync)
+{
+ return PNFS_NOT_ATTEMPTED;
+}
+
+static void
+bl_free_layout_hdr(struct pnfs_layout_hdr *lo)
+{
+}
+
+static struct pnfs_layout_hdr *
+bl_alloc_layout_hdr(struct inode *inode, gfp_t gfp_flags)
+{
+ return NULL;
+}
+
+static void
+bl_free_lseg(struct pnfs_layout_segment *lseg)
+{
+}
+
+static struct pnfs_layout_segment *
+bl_alloc_lseg(struct pnfs_layout_hdr *lo,
+ struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
+{
+ return NULL;
+}
+
+static void
+bl_encode_layoutcommit(struct pnfs_layout_hdr *lo, struct xdr_stream *xdr,
+ const struct nfs4_layoutcommit_args *arg)
+{
+}
+
+static void
+bl_cleanup_layoutcommit(struct pnfs_layout_hdr *lo,
+ struct nfs4_layoutcommit_data *lcdata)
+{
+}
+
+static int
+bl_set_layoutdriver(struct nfs_server *server, const struct nfs_fh *fh)
+{
+ dprintk("%s enter\n", __func__);
+ return 0;
+}
+
+static int
+bl_clear_layoutdriver(struct nfs_server *server)
+{
+ dprintk("%s enter\n", __func__);
+ return 0;
+}
+
+static int
+bl_write_begin(struct pnfs_layout_segment *lseg, struct page *page, loff_t pos,
+ unsigned count, struct pnfs_fsdata *fsdata)
+{
+ return 0;
+}
+
+static int
+bl_write_end(struct inode *inode, struct page *page, loff_t pos,
+ unsigned count, unsigned copied, struct pnfs_layout_segment *lseg)
+{
+ return 0;
+}
+
+/* Return any memory allocated to fsdata->private, and take advantage
+ * of no page locks to mark pages noted in write_begin as needing
+ * initialization.
+ */
+static void
+bl_write_end_cleanup(struct file *filp, struct pnfs_fsdata *fsdata)
+{
+}
+
+static struct pnfs_layoutdriver_type blocklayout_type = {
+ .id = LAYOUT_BLOCK_VOLUME,
+ .name = "LAYOUT_BLOCK_VOLUME",
+ .read_pagelist = bl_read_pagelist,
+ .write_pagelist = bl_write_pagelist,
+ .write_begin = bl_write_begin,
+ .write_end = bl_write_end,
+ .write_end_cleanup = bl_write_end_cleanup,
+ .alloc_layout_hdr = bl_alloc_layout_hdr,
+ .free_layout_hdr = bl_free_layout_hdr,
+ .alloc_lseg = bl_alloc_lseg,
+ .free_lseg = bl_free_lseg,
+ .encode_layoutcommit = bl_encode_layoutcommit,
+ .cleanup_layoutcommit = bl_cleanup_layoutcommit,
+ .set_layoutdriver = bl_set_layoutdriver,
+ .clear_layoutdriver = bl_clear_layoutdriver,
+ .pg_test = pnfs_generic_pg_test,
+};
+
+static int __init nfs4blocklayout_init(void)
+{
+ int ret;
+
+ dprintk("%s: NFSv4 Block Layout Driver Registering...\n", __func__);
+
+ ret = pnfs_register_layoutdriver(&blocklayout_type);
+ return ret;
+}
+
+static void __exit nfs4blocklayout_exit(void)
+{
+ dprintk("%s: NFSv4 Block Layout Driver Unregistering...\n",
+ __func__);
+
+ pnfs_unregister_layoutdriver(&blocklayout_type);
+}
+
+module_init(nfs4blocklayout_init);
+module_exit(nfs4blocklayout_exit);
--
1.7.4.1


2011-06-14 02:32:50

by Jim Rees

[permalink] [raw]
Subject: [PATCH 19/33] pnfsblock: xdr decode pnfs_block_layout4

From: Fred Isaman <[email protected]>

XDR decodes the block layout payload sent in LAYOUTGET result, storing
the result in an extent list.

Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: fix bug getting pnfs_layout_type in translate_devid().]
Signed-off-by: Tao Guo <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayoutdev.c | 191 ++++++++++++++++++++++++++++++++++-
1 files changed, 189 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayoutdev.c b/fs/nfs/blocklayout/blocklayoutdev.c
index 0fedf50..a90eb6b 100644
--- a/fs/nfs/blocklayout/blocklayoutdev.c
+++ b/fs/nfs/blocklayout/blocklayoutdev.c
@@ -150,10 +150,197 @@ out_err:
return NULL;
}

+/* Map deviceid returned by the server to constructed block_device */
+static struct block_device *translate_devid(struct pnfs_layout_hdr *lo,
+ struct nfs4_deviceid *id)
+{
+ struct block_device *rv = NULL;
+ struct block_mount_id *mid;
+ struct pnfs_block_dev *dev;
+
+ dprintk("%s enter, lo=%p, id=%p\n", __func__, lo, id);
+ mid = BLK_ID(lo);
+ spin_lock(&mid->bm_lock);
+ list_for_each_entry(dev, &mid->bm_devlist, bm_node) {
+ if (memcmp(id->data, dev->bm_mdevid.data,
+ NFS4_DEVICEID4_SIZE) == 0) {
+ rv = dev->bm_mdev;
+ goto out;
+ }
+ }
+ out:
+ spin_unlock(&mid->bm_lock);
+ dprintk("%s returning %p\n", __func__, rv);
+ return rv;
+}
+
+/* Tracks info needed to ensure extents in layout obey constraints of spec */
+struct layout_verification {
+ u32 mode; /* R or RW */
+ u64 start; /* Expected start of next non-COW extent */
+ u64 inval; /* Start of INVAL coverage */
+ u64 cowread; /* End of COW read coverage */
+};
+
+/* Verify the extent meets the layout requirements of the pnfs-block draft,
+ * section 2.3.1.
+ */
+static int verify_extent(struct pnfs_block_extent *be,
+ struct layout_verification *lv)
+{
+ if (lv->mode == IOMODE_READ) {
+ if (be->be_state == PNFS_BLOCK_READWRITE_DATA ||
+ be->be_state == PNFS_BLOCK_INVALID_DATA)
+ return -EIO;
+ if (be->be_f_offset != lv->start)
+ return -EIO;
+ lv->start += be->be_length;
+ return 0;
+ }
+ /* lv->mode == IOMODE_RW */
+ if (be->be_state == PNFS_BLOCK_READWRITE_DATA) {
+ if (be->be_f_offset != lv->start)
+ return -EIO;
+ if (lv->cowread > lv->start)
+ return -EIO;
+ lv->start += be->be_length;
+ lv->inval = lv->start;
+ return 0;
+ } else if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
+ if (be->be_f_offset != lv->start)
+ return -EIO;
+ lv->start += be->be_length;
+ return 0;
+ } else if (be->be_state == PNFS_BLOCK_READ_DATA) {
+ if (be->be_f_offset > lv->start)
+ return -EIO;
+ if (be->be_f_offset < lv->inval)
+ return -EIO;
+ if (be->be_f_offset < lv->cowread)
+ return -EIO;
+ /* It looks like you might want to min this with lv->start,
+ * but you really don't.
+ */
+ lv->inval = lv->inval + be->be_length;
+ lv->cowread = be->be_f_offset + be->be_length;
+ return 0;
+ } else
+ return -EIO;
+}
+
+/* XDR decode pnfs_block_layout4 structure */
int
nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
struct nfs4_layoutget_res *lgr, gfp_t gfp_flags)
{
- /* STUB */
- return -EIO;
+ struct pnfs_block_layout *bl = BLK_LO2EXT(lo);
+ int i, status = -EIO;
+ uint32_t count;
+ struct pnfs_block_extent *be = NULL, *save;
+ struct xdr_stream stream;
+ struct xdr_buf buf;
+ struct page *scratch;
+ __be32 *p;
+ uint64_t tmp; /* Used by READSECTOR */
+ struct layout_verification lv = {
+ .mode = lgr->range.iomode,
+ .start = lgr->range.offset >> 9,
+ .inval = lgr->range.offset >> 9,
+ .cowread = lgr->range.offset >> 9,
+ };
+ LIST_HEAD(extents);
+
+ dprintk("---> %s\n", __func__);
+
+ scratch = alloc_page(gfp_flags);
+ if (!scratch)
+ return -ENOMEM;
+
+ xdr_init_decode_pages(&stream, &buf, lgr->layoutp->pages, lgr->layoutp->len);
+ xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE);
+
+ p = xdr_inline_decode(&stream, 4);
+ if (unlikely(!p))
+ goto out_err;
+
+ READ32(count);
+
+ dprintk("%s enter, number of extents %i\n", __func__, count);
+ p = xdr_inline_decode(&stream, (28 + NFS4_DEVICEID4_SIZE) * count);
+ if (unlikely(!p))
+ goto out_err;
+
+ /* Decode individual extents, putting them in temporary
+ * staging area until whole layout is decoded to make error
+ * recovery easier.
+ */
+ for (i = 0; i < count; i++) {
+ be = alloc_extent();
+ if (!be) {
+ status = -ENOMEM;
+ goto out_err;
+ }
+ READ_DEVID(&be->be_devid);
+ be->be_mdev = translate_devid(lo, &be->be_devid);
+ if (!be->be_mdev)
+ goto out_err;
+
+ /* The next three values are read in as bytes,
+ * but stored as 512-byte sector lengths
+ */
+ READ_SECTOR(be->be_f_offset);
+ READ_SECTOR(be->be_length);
+ READ_SECTOR(be->be_v_offset);
+ READ32(be->be_state);
+ if (be->be_state == PNFS_BLOCK_INVALID_DATA)
+ be->be_inval = &bl->bl_inval;
+ if (verify_extent(be, &lv)) {
+ dprintk("%s verify failed\n", __func__);
+ goto out_err;
+ }
+ list_add_tail(&be->be_node, &extents);
+ }
+ if (lgr->range.offset + lgr->range.length != lv.start << 9) {
+ dprintk("%s Final length mismatch\n", __func__);
+ be = NULL;
+ goto out_err;
+ }
+ if (lv.start < lv.cowread) {
+ dprintk("%s Final uncovered COW extent\n", __func__);
+ be = NULL;
+ goto out_err;
+ }
+ /* Extents decoded properly, now try to merge them in to
+ * existing layout extents.
+ */
+ spin_lock(&bl->bl_ext_lock);
+ list_for_each_entry_safe(be, save, &extents, be_node) {
+ list_del(&be->be_node);
+ status = add_and_merge_extent(bl, be);
+ if (status) {
+ spin_unlock(&bl->bl_ext_lock);
+ /* This is a fairly catastrophic error, as the
+ * entire layout extent lists are now corrupted.
+ * We should have some way to distinguish this.
+ */
+ be = NULL;
+ goto out_err;
+ }
+ }
+ spin_unlock(&bl->bl_ext_lock);
+ status = 0;
+ out:
+ __free_page(scratch);
+ dprintk("%s returns %i\n", __func__, status);
+ return status;
+
+ out_err:
+ put_extent(be);
+ while (!list_empty(&extents)) {
+ be = list_first_entry(&extents, struct pnfs_block_extent,
+ be_node);
+ list_del(&be->be_node);
+ put_extent(be);
+ }
+ goto out;
}
--
1.7.4.1


2011-06-14 02:32:53

by Jim Rees

[permalink] [raw]
Subject: [PATCH 20/33] pnfsblock: find_get_extent

From: Fred <[email protected]>

Implement find_get_extent(), one of the core extent manipulation
routines.

[pnfsblock: Lookup list entry of layouts and tags in reverse order]
Signed-off-by: Zhang Jingwang <[email protected]>
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>

pnfsblock: fix print format warnings for sector_t and size_t

gcc spews warnings about these on x86_64, e.g.:
fs/nfs/blocklayout/blocklayout.c:74: warning: format ‘%Lu’ expects type ‘long long unsigned int’, but argument 2 has type ‘sector_t’
fs/nfs/blocklayout/blocklayout.c:388: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’

Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 3 ++
fs/nfs/blocklayout/extents.c | 47 ++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 0a54feb..04eb6de 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -219,6 +219,9 @@ int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
/* blocklayoutdm.c */
void free_block_dev(struct pnfs_block_dev *bdev);
/* extents.c */
+struct pnfs_block_extent *
+find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
+ struct pnfs_block_extent **cow_read);
void put_extent(struct pnfs_block_extent *be);
struct pnfs_block_extent *alloc_extent(void);
int add_and_merge_extent(struct pnfs_block_layout *bl,
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 26c263f..f0b3f13 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -201,3 +201,50 @@ add_and_merge_extent(struct pnfs_block_layout *bl,
put_extent(new);
return -EIO;
}
+
+/* Returns extent, or NULL. If a second READ extent exists, it is returned
+ * in cow_read, if given.
+ *
+ * The extents are kept in two seperate ordered lists, one for READ and NONE,
+ * one for READWRITE and INVALID. Within each list, we assume:
+ * 1. Extents are ordered by file offset.
+ * 2. For any given isect, there is at most one extents that matches.
+ */
+struct pnfs_block_extent *
+find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
+ struct pnfs_block_extent **cow_read)
+{
+ struct pnfs_block_extent *be, *cow, *ret;
+ int i;
+
+ dprintk("%s enter with isect %llu\n", __func__, (u64)isect);
+ cow = ret = NULL;
+ spin_lock(&bl->bl_ext_lock);
+ for (i = 0; i < EXTENT_LISTS; i++) {
+ if (ret &&
+ (!cow_read || ret->be_state != PNFS_BLOCK_INVALID_DATA))
+ break;
+ list_for_each_entry_reverse(be, &bl->bl_extents[i], be_node) {
+ if (isect >= be->be_f_offset + be->be_length)
+ break;
+ if (isect >= be->be_f_offset) {
+ /* We have found an extent */
+ dprintk("%s Get %p (%i)\n", __func__, be,
+ atomic_read(&be->be_refcnt.refcount));
+ kref_get(&be->be_refcnt);
+ if (!ret)
+ ret = be;
+ else if (be->be_state != PNFS_BLOCK_READ_DATA)
+ put_extent(be);
+ else
+ cow = be;
+ break;
+ }
+ }
+ }
+ spin_unlock(&bl->bl_ext_lock);
+ if (cow_read)
+ *cow_read = cow;
+ print_bl_extent(ret);
+ return ret;
+}
--
1.7.4.1


2011-06-14 02:32:48

by Jim Rees

[permalink] [raw]
Subject: [PATCH 18/33] pnfsblock: allow use of PG_owner_priv_1 flag

From: Fred Isaman <[email protected]>

There is currently no good way for pnfs to communicate problems. For
example - the linux read code first tries to do readahead through
nfs_readpages. Failure there is ignored, and it will later call
nfs_readpage. Failure there is also ignored, except that the lack of
PG_uptodate is communicated back via -EIO.

With pnfs, it would be useful to be able to communicate to
nfs_readpage that direct disk IO failed on readahead, and that it
should failover to using the MDS.

Making the page flag PG_owner_priv_1 available as PG_pnfserr is one
way to do so. (An alternative would be to embed this in the layout,
but then pg_test can't easily access the info.)

This may be better as generic pnfs code, in which case it should be
put in pnfs.h, or even page-flags.h

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index a596e75..0a54feb 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -35,6 +35,11 @@
#include <linux/nfs_fs.h>
#include "../pnfs.h"

+#define PG_pnfserr PG_owner_priv_1
+#define PagePnfsErr(page) test_bit(PG_pnfserr, &(page)->flags)
+#define SetPagePnfsErr(page) set_bit(PG_pnfserr, &(page)->flags)
+#define ClearPagePnfsErr(page) clear_bit(PG_pnfserr, &(page)->flags)
+
struct block_mount_id {
spinlock_t bm_lock; /* protects list */
struct list_head bm_devlist; /* holds pnfs_block_dev */
--
1.7.4.1


2011-06-14 02:32:24

by Jim Rees

[permalink] [raw]
Subject: [PATCH 09/33] pnfsblock: layout alloc and free

From: Fred Isaman <[email protected]>

Allocate the empty list-heads that will hold all the extent data
for the layout.

Signed-off-by: Fred Isaman <[email protected]>
[pnfs: move pnfs_layout_type inline in nfs_inode]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 44 ++++++++++++++++--
fs/nfs/blocklayout/blocklayout.h | 90 ++++++++++++++++++++++++++++++++++++++
2 files changed, 129 insertions(+), 5 deletions(-)
create mode 100644 fs/nfs/blocklayout/blocklayout.h

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 2e0d41a..08458c6 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -32,7 +32,7 @@
#include <linux/module.h>
#include <linux/init.h>

-#include "../pnfs.h"
+#include "blocklayout.h"

#define NFSDBG_FACILITY NFSDBG_PNFS_LD

@@ -53,15 +53,49 @@ bl_write_pagelist(struct nfs_write_data *wdata,
return PNFS_NOT_ATTEMPTED;
}

+/* STUB */
static void
-bl_free_layout_hdr(struct pnfs_layout_hdr *lo)
+release_extents(struct pnfs_block_layout *bl,
+ struct pnfs_layout_range *range)
{
+ return;
}

-static struct pnfs_layout_hdr *
-bl_alloc_layout_hdr(struct inode *inode, gfp_t gfp_flags)
+/* STUB */
+static void
+release_inval_marks(struct pnfs_inval_markings *marks)
{
- return NULL;
+ return;
+}
+
+static void bl_free_layout_hdr(struct pnfs_layout_hdr *lo)
+{
+ struct pnfs_block_layout *bl = BLK_LO2EXT(lo);
+
+ dprintk("%s enter\n", __func__);
+ release_extents(bl, NULL);
+ release_inval_marks(&bl->bl_inval);
+ kfree(bl);
+}
+
+static struct pnfs_layout_hdr *bl_alloc_layout_hdr(struct inode *inode,
+ gfp_t gfp_flags)
+{
+ struct pnfs_block_layout *bl;
+
+ dprintk("%s enter\n", __func__);
+ bl = kzalloc(sizeof(*bl), gfp_flags);
+ if (!bl)
+ return NULL;
+ spin_lock_init(&bl->bl_ext_lock);
+ INIT_LIST_HEAD(&bl->bl_extents[0]);
+ INIT_LIST_HEAD(&bl->bl_extents[1]);
+ INIT_LIST_HEAD(&bl->bl_commit);
+ INIT_LIST_HEAD(&bl->bl_committing);
+ bl->bl_count = 0;
+ bl->bl_blocksize = NFS_SERVER(inode)->pnfs_blksize >> 9;
+ INIT_INVAL_MARKS(&bl->bl_inval, bl->bl_blocksize);
+ return &bl->bl_layout;
}

static void
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
new file mode 100644
index 0000000..49d69c7
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -0,0 +1,90 @@
+/*
+ * linux/fs/nfs/blocklayout/blocklayout.h
+ *
+ * Module for the NFSv4.1 pNFS block layout driver.
+ *
+ * Copyright (c) 2006 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * Andy Adamson <[email protected]>
+ * Fred Isaman <[email protected]>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization. if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose. the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+#ifndef FS_NFS_NFS4BLOCKLAYOUT_H
+#define FS_NFS_NFS4BLOCKLAYOUT_H
+
+#include <linux/nfs_fs.h>
+#include "../pnfs.h"
+
+enum exstate4 {
+ PNFS_BLOCK_READWRITE_DATA = 0,
+ PNFS_BLOCK_READ_DATA = 1,
+ PNFS_BLOCK_INVALID_DATA = 2, /* mapped, but data is invalid */
+ PNFS_BLOCK_NONE_DATA = 3 /* unmapped, it's a hole */
+};
+
+struct pnfs_inval_markings {
+ /* STUB */
+};
+
+/* sector_t fields are all in 512-byte sectors */
+struct pnfs_block_extent {
+ struct kref be_refcnt;
+ struct list_head be_node; /* link into lseg list */
+ struct nfs4_deviceid be_devid; /* STUB - remevable??? */
+ struct block_device *be_mdev;
+ sector_t be_f_offset; /* the starting offset in the file */
+ sector_t be_length; /* the size of the extent */
+ sector_t be_v_offset; /* the starting offset in the volume */
+ enum exstate4 be_state; /* the state of this extent */
+ struct pnfs_inval_markings *be_inval; /* tracks INVAL->RW transition */
+};
+
+static inline void
+INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
+{
+ /* STUB */
+}
+
+enum extentclass4 {
+ RW_EXTENT = 0, /* READWRTE and INVAL */
+ RO_EXTENT = 1, /* READ and NONE */
+ EXTENT_LISTS = 2,
+};
+
+struct pnfs_block_layout {
+ struct pnfs_layout_hdr bl_layout;
+ struct pnfs_inval_markings bl_inval; /* tracks INVAL->RW transition */
+ spinlock_t bl_ext_lock; /* Protects list manipulation */
+ struct list_head bl_extents[EXTENT_LISTS]; /* R and RW extents */
+ struct list_head bl_commit; /* Needs layout commit */
+ struct list_head bl_committing; /* Layout committing */
+ unsigned int bl_count; /* entries in bl_commit */
+ sector_t bl_blocksize; /* Server blocksize in sectors */
+};
+
+static inline struct pnfs_block_layout *BLK_LO2EXT(struct pnfs_layout_hdr *lo)
+{
+ return container_of(lo, struct pnfs_block_layout, bl_layout);
+}
+
+#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
--
1.7.4.1


2011-06-14 02:32:28

by Jim Rees

[permalink] [raw]
Subject: [PATCH 10/33] pnfsblock: add support for simple rpc pipefs

From: Fred Isaman <[email protected]>

Signed-off-by: Eric Anderle <[email protected]>
Signed-off-by: Jim Rees <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
move include lines out of include file
Signed-off-by: Jim Rees <[email protected]>
[This patch does *not* break the header's independence]
Signed-off-by: Boaz Harrosh <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
include/linux/sunrpc/simple_rpc_pipefs.h | 105 ++++++++
net/sunrpc/simple_rpc_pipefs.c | 423 ++++++++++++++++++++++++++++++
2 files changed, 528 insertions(+), 0 deletions(-)
create mode 100644 include/linux/sunrpc/simple_rpc_pipefs.h
create mode 100644 net/sunrpc/simple_rpc_pipefs.c

diff --git a/include/linux/sunrpc/simple_rpc_pipefs.h b/include/linux/sunrpc/simple_rpc_pipefs.h
new file mode 100644
index 0000000..f6a1227
--- /dev/null
+++ b/include/linux/sunrpc/simple_rpc_pipefs.h
@@ -0,0 +1,105 @@
+/*
+ * Copyright (c) 2008 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * David M. Richter <[email protected]>
+ *
+ * Drawing on work done by Andy Adamson <[email protected]> and
+ * Marius Eriksen <[email protected]>. Thanks for the help over the
+ * years, guys.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
+ * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * With thanks to CITI's project sponsor and partner, IBM.
+ */
+
+#ifndef _SIMPLE_RPC_PIPEFS_H_
+#define _SIMPLE_RPC_PIPEFS_H_
+
+#include <linux/sunrpc/rpc_pipe_fs.h>
+
+#define payload_of(headerp) ((void *)(headerp + 1))
+
+/*
+ * struct pipefs_hdr -- the generic message format for simple_rpc_pipefs.
+ * Messages may simply be the header itself, although having an optional
+ * data payload follow the header allows much more flexibility.
+ *
+ * Messages are created using pipefs_alloc_init_msg() and
+ * pipefs_alloc_init_msg_padded(), both of which accept a pointer to an
+ * (optional) data payload.
+ *
+ * Given a struct pipefs_hdr *msg that has a struct foo payload, the data
+ * can be accessed using: struct foo *foop = payload_of(msg)
+ */
+struct pipefs_hdr {
+ u32 msgid;
+ u8 type;
+ u8 flags;
+ u16 totallen; /* length of entire message, including hdr itself */
+ u32 status;
+};
+
+/*
+ * struct pipefs_list -- a type of list used for tracking callers who've made an
+ * upcall and are blocked waiting for a reply.
+ *
+ * See pipefs_queue_upcall_waitreply() and pipefs_assign_upcall_reply().
+ */
+struct pipefs_list {
+ struct list_head list;
+ spinlock_t list_lock;
+};
+
+
+/* See net/sunrpc/simple_rpc_pipefs.c for more info on using these functions. */
+extern struct dentry *pipefs_mkpipe(const char *name,
+ const struct rpc_pipe_ops *ops,
+ int wait_for_open);
+extern void pipefs_closepipe(struct dentry *pipe);
+extern void pipefs_init_list(struct pipefs_list *list);
+extern struct pipefs_hdr *pipefs_alloc_init_msg(u32 msgid, u8 type, u8 flags,
+ void *data, u16 datalen);
+extern struct pipefs_hdr *pipefs_alloc_init_msg_padded(u32 msgid, u8 type,
+ u8 flags, void *data,
+ u16 datalen, u16 padlen);
+extern struct pipefs_hdr *pipefs_queue_upcall_waitreply(struct dentry *pipe,
+ struct pipefs_hdr *msg,
+ struct pipefs_list
+ *uplist, u8 upflags,
+ u32 timeout);
+extern int pipefs_queue_upcall_noreply(struct dentry *pipe,
+ struct pipefs_hdr *msg, u8 upflags);
+extern int pipefs_assign_upcall_reply(struct pipefs_hdr *reply,
+ struct pipefs_list *uplist);
+extern struct pipefs_hdr *pipefs_readmsg(struct file *filp,
+ const char __user *src, size_t len);
+extern ssize_t pipefs_generic_upcall(struct file *filp,
+ struct rpc_pipe_msg *rpcmsg,
+ char __user *dst, size_t buflen);
+extern void pipefs_generic_destroy_msg(struct rpc_pipe_msg *rpcmsg);
+
+#endif /* _SIMPLE_RPC_PIPEFS_H_ */
diff --git a/net/sunrpc/simple_rpc_pipefs.c b/net/sunrpc/simple_rpc_pipefs.c
new file mode 100644
index 0000000..24af0a1
--- /dev/null
+++ b/net/sunrpc/simple_rpc_pipefs.c
@@ -0,0 +1,423 @@
+/*
+ * net/sunrpc/simple_rpc_pipefs.c
+ *
+ * Copyright (c) 2008 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * David M. Richter <[email protected]>
+ *
+ * Drawing on work done by Andy Adamson <[email protected]> and
+ * Marius Eriksen <[email protected]>. Thanks for the help over the
+ * years, guys.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
+ * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * With thanks to CITI's project sponsor and partner, IBM.
+ */
+
+#include <linux/mount.h>
+#include <linux/sunrpc/clnt.h>
+#include <linux/sunrpc/simple_rpc_pipefs.h>
+
+
+/*
+ * Make an rpc_pipefs pipe named @name at the root of the mounted rpc_pipefs
+ * filesystem.
+ *
+ * If @wait_for_open is non-zero and an upcall is later queued but the userland
+ * end of the pipe has not yet been opened, the upcall will remain queued until
+ * the pipe is opened; otherwise, the upcall queueing will return with -EPIPE.
+ */
+struct dentry *pipefs_mkpipe(const char *name, const struct rpc_pipe_ops *ops,
+ int wait_for_open)
+{
+ struct dentry *dir, *pipe;
+ struct vfsmount *mnt;
+
+ mnt = rpc_get_mount();
+ if (IS_ERR(mnt)) {
+ pipe = ERR_CAST(mnt);
+ goto out;
+ }
+ dir = mnt->mnt_root;
+ if (!dir) {
+ pipe = ERR_PTR(-ENOENT);
+ goto out;
+ }
+ pipe = rpc_mkpipe(dir, name, NULL, ops,
+ wait_for_open ? RPC_PIPE_WAIT_FOR_OPEN : 0);
+out:
+ return pipe;
+}
+EXPORT_SYMBOL(pipefs_mkpipe);
+
+/*
+ * Shutdown a pipe made by pipefs_mkpipe().
+ * XXX: do we need to retain an extra reference on the mount?
+ */
+void pipefs_closepipe(struct dentry *pipe)
+{
+ rpc_unlink(pipe);
+ rpc_put_mount();
+}
+EXPORT_SYMBOL(pipefs_closepipe);
+
+/*
+ * Initialize a struct pipefs_list -- which are a way to keep track of callers
+ * who're blocked having made an upcall and are awaiting a reply.
+ *
+ * See pipefs_queue_upcall_waitreply() and pipefs_find_upcall_msgid() for how
+ * to use them.
+ */
+inline void pipefs_init_list(struct pipefs_list *list)
+{
+ INIT_LIST_HEAD(&list->list);
+ spin_lock_init(&list->list_lock);
+}
+EXPORT_SYMBOL(pipefs_init_list);
+
+/*
+ * Alloc/init a generic pipefs message header and copy into its message body
+ * an arbitrary data payload.
+ *
+ * struct pipefs_hdr's are meant to serve as generic, general-purpose message
+ * headers for easy rpc_pipefs I/O. When an upcall is made, the
+ * struct pipefs_hdr is assigned to a struct rpc_pipe_msg and delivered
+ * therein. --And yes, the naming can seem a little confusing at first:
+ *
+ * When one thinks of an upcall "message", in simple_rpc_pipefs that's a
+ * struct pipefs_hdr (possibly with an attached message body). A
+ * struct rpc_pipe_msg is actually only the -vehicle- by which the "real"
+ * message is delivered and processed.
+ */
+struct pipefs_hdr *pipefs_alloc_init_msg_padded(u32 msgid, u8 type, u8 flags,
+ void *data, u16 datalen, u16 padlen)
+{
+ u16 totallen;
+ struct pipefs_hdr *msg = NULL;
+
+ totallen = sizeof(*msg) + datalen + padlen;
+ if (totallen > PAGE_SIZE) {
+ msg = ERR_PTR(-E2BIG);
+ goto out;
+ }
+
+ msg = kzalloc(totallen, GFP_KERNEL);
+ if (!msg) {
+ msg = ERR_PTR(-ENOMEM);
+ goto out;
+ }
+
+ msg->msgid = msgid;
+ msg->type = type;
+ msg->flags = flags;
+ msg->totallen = totallen;
+ memcpy(payload_of(msg), data, datalen);
+out:
+ return msg;
+}
+EXPORT_SYMBOL(pipefs_alloc_init_msg_padded);
+
+/*
+ * See the description of pipefs_alloc_init_msg_padded().
+ */
+struct pipefs_hdr *pipefs_alloc_init_msg(u32 msgid, u8 type, u8 flags,
+ void *data, u16 datalen)
+{
+ return pipefs_alloc_init_msg_padded(msgid, type, flags, data,
+ datalen, 0);
+}
+EXPORT_SYMBOL(pipefs_alloc_init_msg);
+
+
+static void pipefs_init_rpcmsg(struct rpc_pipe_msg *rpcmsg,
+ struct pipefs_hdr *msg, u8 upflags)
+{
+ memset(rpcmsg, 0, sizeof(*rpcmsg));
+ rpcmsg->data = msg;
+ rpcmsg->len = msg->totallen;
+ rpcmsg->flags = upflags;
+}
+
+static struct rpc_pipe_msg *pipefs_alloc_init_rpcmsg(struct pipefs_hdr *msg,
+ u8 upflags)
+{
+ struct rpc_pipe_msg *rpcmsg;
+
+ rpcmsg = kmalloc(sizeof(*rpcmsg), GFP_KERNEL);
+ if (!rpcmsg)
+ return ERR_PTR(-ENOMEM);
+
+ pipefs_init_rpcmsg(rpcmsg, msg, upflags);
+ return rpcmsg;
+}
+
+
+/* represents an upcall that'll block and wait for a reply */
+struct pipefs_upcall {
+ u32 msgid;
+ struct rpc_pipe_msg rpcmsg;
+ struct list_head list;
+ wait_queue_head_t waitq;
+ struct pipefs_hdr *reply;
+};
+
+
+static void pipefs_init_upcall_waitreply(struct pipefs_upcall *upcall,
+ struct pipefs_hdr *msg, u8 upflags)
+{
+ upcall->reply = NULL;
+ upcall->msgid = msg->msgid;
+ INIT_LIST_HEAD(&upcall->list);
+ init_waitqueue_head(&upcall->waitq);
+ pipefs_init_rpcmsg(&upcall->rpcmsg, msg, upflags);
+}
+
+static int __pipefs_queue_upcall_waitreply(struct dentry *pipe,
+ struct pipefs_upcall *upcall,
+ struct pipefs_list *uplist,
+ u32 timeout)
+{
+ int err = 0;
+ DECLARE_WAITQUEUE(wq, current);
+
+ add_wait_queue(&upcall->waitq, &wq);
+ spin_lock(&uplist->list_lock);
+ list_add(&upcall->list, &uplist->list);
+ spin_unlock(&uplist->list_lock);
+
+ err = rpc_queue_upcall(pipe->d_inode, &upcall->rpcmsg);
+ if (err < 0)
+ goto out;
+
+ if (timeout) {
+ /* retval of 0 means timer expired */
+ err = schedule_timeout_uninterruptible(timeout);
+ if (err == 0 && upcall->reply == NULL)
+ err = -ETIMEDOUT;
+ } else {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule();
+ __set_current_state(TASK_RUNNING);
+ }
+
+out:
+ spin_lock(&uplist->list_lock);
+ list_del_init(&upcall->list);
+ spin_unlock(&uplist->list_lock);
+ remove_wait_queue(&upcall->waitq, &wq);
+ return err;
+}
+
+/*
+ * Queue a pipefs msg for an upcall to userspace, place the calling thread
+ * on @uplist, and block the thread to wait for a reply. If @timeout is
+ * nonzero, the thread will be blocked for at most @timeout jiffies.
+ *
+ * (To convert time units into jiffies, consider the functions
+ * msecs_to_jiffies(), usecs_to_jiffies(), timeval_to_jiffies(), and
+ * timespec_to_jiffies().)
+ *
+ * Once a reply is received by your downcall handler, call
+ * pipefs_assign_upcall_reply() with @uplist to find the corresponding upcall,
+ * assign the reply, and wake the waiting thread.
+ *
+ * This function's return value pointer may be an error and should be checked
+ * with IS_ERR() before attempting to access the reply message.
+ *
+ * Callers are responsible for freeing @msg, unless pipefs_generic_destroy_msg()
+ * is used as the ->destroy_msg() callback and the PIPEFS_AUTOFREE_UPCALL_MSG
+ * flag is set in @upflags. See also rpc_pipe_fs.h.
+ */
+struct pipefs_hdr *pipefs_queue_upcall_waitreply(struct dentry *pipe,
+ struct pipefs_hdr *msg,
+ struct pipefs_list *uplist,
+ u8 upflags, u32 timeout)
+{
+ int err = 0;
+ struct pipefs_upcall upcall;
+
+ pipefs_init_upcall_waitreply(&upcall, msg, upflags);
+ err = __pipefs_queue_upcall_waitreply(pipe, &upcall, uplist, timeout);
+ if (err < 0) {
+ kfree(upcall.reply);
+ upcall.reply = ERR_PTR(err);
+ }
+
+ return upcall.reply;
+}
+EXPORT_SYMBOL(pipefs_queue_upcall_waitreply);
+
+/*
+ * Queue a pipefs msg for an upcall to userspace and immediately return (i.e.,
+ * no reply is expected).
+ *
+ * Callers are responsible for freeing @msg, unless pipefs_generic_destroy_msg()
+ * is used as the ->destroy_msg() callback and the PIPEFS_AUTOFREE_UPCALL_MSG
+ * flag is set in @upflags. See also rpc_pipe_fs.h.
+ */
+int pipefs_queue_upcall_noreply(struct dentry *pipe, struct pipefs_hdr *msg,
+ u8 upflags)
+{
+ int err = 0;
+ struct rpc_pipe_msg *rpcmsg;
+
+ upflags |= PIPEFS_AUTOFREE_RPCMSG;
+ rpcmsg = pipefs_alloc_init_rpcmsg(msg, upflags);
+ if (IS_ERR(rpcmsg)) {
+ err = PTR_ERR(rpcmsg);
+ goto out;
+ }
+ err = rpc_queue_upcall(pipe->d_inode, rpcmsg);
+out:
+ return err;
+}
+EXPORT_SYMBOL(pipefs_queue_upcall_noreply);
+
+
+static struct pipefs_upcall *pipefs_find_upcall_msgid(u32 msgid,
+ struct pipefs_list *uplist)
+{
+ struct pipefs_upcall *upcall;
+
+ spin_lock(&uplist->list_lock);
+ list_for_each_entry(upcall, &uplist->list, list)
+ if (upcall->msgid == msgid)
+ goto out;
+ upcall = NULL;
+out:
+ spin_unlock(&uplist->list_lock);
+ return upcall;
+}
+
+/*
+ * In your rpc_pipe_ops->downcall() handler, once you've read in a downcall
+ * message and have determined that it is a reply to a waiting upcall,
+ * you can use this function to find the appropriate upcall, assign the result,
+ * and wake the upcall thread.
+ *
+ * The reply message must have the same msgid as the original upcall message's.
+ *
+ * See also pipefs_queue_upcall_waitreply() and pipefs_readmsg().
+ */
+int pipefs_assign_upcall_reply(struct pipefs_hdr *reply,
+ struct pipefs_list *uplist)
+{
+ int err = 0;
+ struct pipefs_upcall *upcall;
+
+ upcall = pipefs_find_upcall_msgid(reply->msgid, uplist);
+ if (!upcall) {
+ printk(KERN_ERR "%s: ERROR: have reply but no matching upcall "
+ "for msgid %d\n", __func__, reply->msgid);
+ err = -ENOENT;
+ goto out;
+ }
+ upcall->reply = reply;
+ wake_up(&upcall->waitq);
+out:
+ return err;
+}
+EXPORT_SYMBOL(pipefs_assign_upcall_reply);
+
+/*
+ * Generic method to read-in and return a newly-allocated message which begins
+ * with a struct pipefs_hdr.
+ */
+struct pipefs_hdr *pipefs_readmsg(struct file *filp, const char __user *src,
+ size_t len)
+{
+ int err = 0, hdrsize;
+ struct pipefs_hdr *msg = NULL;
+
+ hdrsize = sizeof(*msg);
+ if (len < hdrsize) {
+ printk(KERN_ERR "%s: ERROR: header is too short (%d vs %d)\n",
+ __func__, (int) len, hdrsize);
+ err = -EINVAL;
+ goto out;
+ }
+
+ msg = kzalloc(len, GFP_KERNEL);
+ if (!msg) {
+ err = -ENOMEM;
+ goto out;
+ }
+ if (copy_from_user(msg, src, len))
+ err = -EFAULT;
+out:
+ if (err) {
+ kfree(msg);
+ msg = ERR_PTR(err);
+ }
+ return msg;
+}
+EXPORT_SYMBOL(pipefs_readmsg);
+
+/*
+ * Generic rpc_pipe_ops->upcall() handler implementation.
+ *
+ * Don't call this directly: to make an upcall, use
+ * pipefs_queue_upcall_waitreply() or pipefs_queue_upcall_noreply().
+ */
+ssize_t pipefs_generic_upcall(struct file *filp, struct rpc_pipe_msg *rpcmsg,
+ char __user *dst, size_t buflen)
+{
+ char *data;
+ ssize_t len, left;
+
+ data = (char *)rpcmsg->data + rpcmsg->copied;
+ len = rpcmsg->len - rpcmsg->copied;
+ if (len > buflen)
+ len = buflen;
+
+ left = copy_to_user(dst, data, len);
+ if (left < 0) {
+ rpcmsg->errno = left;
+ return left;
+ }
+
+ len -= left;
+ rpcmsg->copied += len;
+ rpcmsg->errno = 0;
+ return len;
+}
+EXPORT_SYMBOL(pipefs_generic_upcall);
+
+/*
+ * Generic rpc_pipe_ops->destroy_msg() handler implementation.
+ *
+ * Items are only freed if @rpcmsg->flags has been set appropriately.
+ * See pipefs_queue_upcall_noreply() and rpc_pipe_fs.h.
+ */
+void pipefs_generic_destroy_msg(struct rpc_pipe_msg *rpcmsg)
+{
+ if (rpcmsg->flags & PIPEFS_AUTOFREE_UPCALL_MSG)
+ kfree(rpcmsg->data);
+ if (rpcmsg->flags & PIPEFS_AUTOFREE_RPCMSG)
+ kfree(rpcmsg);
+}
+EXPORT_SYMBOL(pipefs_generic_destroy_msg);
--
1.7.4.1


2011-06-14 02:32:30

by Jim Rees

[permalink] [raw]
Subject: [PATCH 11/33] pnfsblock: add block device discovery pipe

Signed-off-by: Eric Anderle <[email protected]>
Signed-off-by: Jim Rees <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/Makefile | 2 +-
fs/nfs/blocklayout/block-device-discovery-pipe.c | 66 ++++++++++++++++++++++
fs/nfs/blocklayout/blocklayout.c | 3 +
fs/nfs/blocklayout/blocklayout.h | 14 +++++
4 files changed, 84 insertions(+), 1 deletions(-)
create mode 100644 fs/nfs/blocklayout/block-device-discovery-pipe.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index 6bf49cd..d2bcd81 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,4 @@
# Makefile for the pNFS block layout driver kernel module
#
obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o
+blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o
diff --git a/fs/nfs/blocklayout/block-device-discovery-pipe.c b/fs/nfs/blocklayout/block-device-discovery-pipe.c
new file mode 100644
index 0000000..e4c199f
--- /dev/null
+++ b/fs/nfs/blocklayout/block-device-discovery-pipe.c
@@ -0,0 +1,66 @@
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/proc_fs.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/ctype.h>
+#include <linux/sched.h>
+#include "blocklayout.h"
+
+#define NFSDBG_FACILITY NFSDBG_PNFS_LD
+
+struct pipefs_list bl_device_list;
+struct dentry *bl_device_pipe;
+
+ssize_t bl_pipe_downcall(struct file *filp, const char __user *src, size_t len)
+{
+ int err;
+ struct pipefs_hdr *msg;
+
+ dprintk("Entering %s...\n", __func__);
+
+ msg = pipefs_readmsg(filp, src, len);
+ if (IS_ERR(msg)) {
+ dprintk("ERROR: unable to read pipefs message.\n");
+ return PTR_ERR(msg);
+ }
+
+ /* now assign the result, which wakes the blocked thread */
+ err = pipefs_assign_upcall_reply(msg, &bl_device_list);
+ if (err) {
+ dprintk("ERROR: failed to assign upcall with id %u\n",
+ msg->msgid);
+ kfree(msg);
+ }
+ return len;
+}
+
+static const struct rpc_pipe_ops bl_pipe_ops = {
+ .upcall = pipefs_generic_upcall,
+ .downcall = bl_pipe_downcall,
+ .destroy_msg = pipefs_generic_destroy_msg,
+};
+
+int bl_pipe_init(void)
+{
+ dprintk("%s: block_device pipefs registering...\n", __func__);
+ bl_device_pipe = pipefs_mkpipe("bl_device_pipe", &bl_pipe_ops, 1);
+ if (IS_ERR(bl_device_pipe))
+ dprintk("ERROR, unable to make block_device pipe\n");
+
+ if (!bl_device_pipe)
+ dprintk("bl_device_pipe is NULL!\n");
+ else
+ dprintk("bl_device_pipe created!\n");
+ pipefs_init_list(&bl_device_list);
+ return 0;
+}
+
+void bl_pipe_exit(void)
+{
+ dprintk("%s: block_device pipefs unregistering...\n", __func__);
+ if (IS_ERR(bl_device_pipe))
+ return ;
+ pipefs_closepipe(bl_device_pipe);
+ return;
+}
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 08458c6..bc6a0b2 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -185,6 +185,8 @@ static int __init nfs4blocklayout_init(void)
dprintk("%s: NFSv4 Block Layout Driver Registering...\n", __func__);

ret = pnfs_register_layoutdriver(&blocklayout_type);
+ if (!ret)
+ bl_pipe_init();
return ret;
}

@@ -194,6 +196,7 @@ static void __exit nfs4blocklayout_exit(void)
__func__);

pnfs_unregister_layoutdriver(&blocklayout_type);
+ bl_pipe_exit();
}

module_init(nfs4blocklayout_init);
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 49d69c7..4b8608c 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -87,4 +87,18 @@ static inline struct pnfs_block_layout *BLK_LO2EXT(struct pnfs_layout_hdr *lo)
return container_of(lo, struct pnfs_block_layout, bl_layout);
}

+#include <linux/sunrpc/simple_rpc_pipefs.h>
+
+extern struct pipefs_list bl_device_list;
+extern struct dentry *bl_device_pipe;
+
+int bl_pipe_init(void);
+void bl_pipe_exit(void);
+
+#define BL_DEVICE_UMOUNT 0x0 /* Umount--delete devices */
+#define BL_DEVICE_MOUNT 0x1 /* Mount--create devices*/
+#define BL_DEVICE_REQUEST_INIT 0x0 /* Start request */
+#define BL_DEVICE_REQUEST_PROC 0x1 /* User level process succeeds */
+#define BL_DEVICE_REQUEST_ERR 0x2 /* User level process fails */
+
#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
--
1.7.4.1


2011-06-14 02:33:13

by Jim Rees

[permalink] [raw]
Subject: [PATCH 28/33] pnfsblock: write_end_cleanup

From: Fred Isaman <[email protected]>

Ensure all pages in block are marked for initialization if needed.

[pnfsblock: Update to 2.6.29]
[pnfsblock: write_end_cleanup adjust for removed ok_to_use_pnfs]
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 54 ++++++++++++++++++++++++++++++++++++++
1 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index dff5e69..cbf74d8 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -791,6 +791,60 @@ bl_write_end(struct inode *inode, struct page *page, loff_t pos,
static void
bl_write_end_cleanup(struct file *filp, struct pnfs_fsdata *fsdata)
{
+ struct page *page;
+ pgoff_t index;
+ sector_t *pos;
+ struct address_space *mapping = filp->f_mapping;
+ struct pnfs_fsdata *fake_data;
+ struct pnfs_layout_segment *lseg;
+
+ if (!fsdata)
+ return;
+ lseg = fsdata->lseg;
+ if (!lseg)
+ return;
+ pos = fsdata->private;
+ if (!pos)
+ return;
+ dprintk("%s enter with pos=%llu\n", __func__, (u64)(*pos));
+ for (; *pos != ~0; pos++) {
+ index = *pos >> (PAGE_CACHE_SHIFT - 9);
+ /* XXX How do we properly deal with failures here??? */
+ page = grab_cache_page_write_begin(mapping, index, 0);
+ if (!page) {
+ printk(KERN_ERR "%s BUG BUG BUG NoMem\n", __func__);
+ continue;
+ }
+ dprintk("%s: Examining block page\n", __func__);
+ print_page(page);
+ if (!PageMappedToDisk(page)) {
+ /* XXX How do we properly deal with failures here??? */
+ dprintk("%s Marking block page\n", __func__);
+ init_page_for_write(BLK_LSEG2EXT(fsdata->lseg), page,
+ PAGE_CACHE_SIZE, PAGE_CACHE_SIZE,
+ NULL);
+ print_page(page);
+ fake_data = kzalloc(sizeof(*fake_data), GFP_KERNEL);
+ if (!fake_data) {
+ printk(KERN_ERR "%s BUG BUG BUG NoMem\n",
+ __func__);
+ unlock_page(page);
+ continue;
+ }
+ get_lseg(lseg);
+ fake_data->lseg = lseg;
+ fake_data->bypass_eof = 1;
+ mapping->a_ops->write_end(filp, mapping,
+ index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE,
+ PAGE_CACHE_SIZE,
+ page, fake_data);
+ /* Note fake_data is freed by nfs_write_end */
+ } else
+ unlock_page(page);
+ }
+ kfree(fsdata->private);
+ fsdata->private = NULL;
}

static struct pnfs_layoutdriver_type blocklayout_type = {
--
1.7.4.1


2011-06-14 02:33:11

by Jim Rees

[permalink] [raw]
Subject: [PATCH 27/33] pnfsblock: write_end

From: Fred Isaman <[email protected]>

Implements bl_write_end, which basically just calls SetPageUptodate.

[pnfsblock: write_end adjust for removed ok_to_use_pnfs]
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index b9b961f..dff5e69 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -772,10 +772,15 @@ bl_write_begin(struct pnfs_layout_segment *lseg, struct page *page, loff_t pos,
return ret;
}

+/* CAREFUL - what happens if copied < count??? */
static int
bl_write_end(struct inode *inode, struct page *page, loff_t pos,
unsigned count, unsigned copied, struct pnfs_layout_segment *lseg)
{
+ dprintk("%s enter, %u@%lld, lseg=%p\n", __func__, count, pos, lseg);
+ print_page(page);
+ if (lseg)
+ SetPageUptodate(page);
return 0;
}

--
1.7.4.1


2011-06-14 02:33:18

by Jim Rees

[permalink] [raw]
Subject: [PATCH 30/33] pnfsblock: bl_write_pagelist

From: Fred Isaman <[email protected]>

Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.

When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.

[pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS]
Signed-off-by: Fred Isaman <[email protected]>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <[email protected]>
[pnfs-block: use new write_pagelist api]
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 146 +++++++++++++++++++++++++++++++++++++-
1 files changed, 145 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 01fe089..6fe039c8 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -320,11 +320,155 @@ bl_read_pagelist(struct nfs_read_data *rdata)
return PNFS_NOT_ATTEMPTED;
}

+/* STUB - this needs thought */
+static inline void
+bl_done_with_wpage(struct page *page, const int ok)
+{
+ if (!ok) {
+ SetPageError(page);
+ SetPagePnfsErr(page);
+ /* This is an inline copy of nfs_zap_mapping */
+ /* This is oh so fishy, and needs deep thought */
+ if (page->mapping->nrpages != 0) {
+ struct inode *inode = page->mapping->host;
+ spin_lock(&inode->i_lock);
+ NFS_I(inode)->cache_validity |= NFS_INO_INVALID_DATA;
+ spin_unlock(&inode->i_lock);
+ }
+ }
+ /* end_page_writeback called in rpc_release. Should be done here. */
+}
+
+/* This is basically copied from mpage_end_io_read */
+static void bl_end_io_write(struct bio *bio, int err)
+{
+ void *data = bio->bi_private;
+ const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+ struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+
+ do {
+ struct page *page = bvec->bv_page;
+
+ if (--bvec >= bio->bi_io_vec)
+ prefetchw(&bvec->bv_page->flags);
+ bl_done_with_wpage(page, uptodate);
+ } while (bvec >= bio->bi_io_vec);
+ bio_put(bio);
+ put_parallel(data);
+}
+
+/* Function scheduled for call during bl_end_par_io_write,
+ * it marks sectors as written and extends the commitlist.
+ */
+static void bl_write_cleanup(struct work_struct *work)
+{
+ struct rpc_task *task;
+ struct nfs_write_data *wdata;
+ dprintk("%s enter\n", __func__);
+ task = container_of(work, struct rpc_task, u.tk_work);
+ wdata = container_of(task, struct nfs_write_data, task);
+ pnfs_ld_write_done(wdata);
+}
+
+/* Called when last of bios associated with a bl_write_pagelist call finishes */
+static void
+bl_end_par_io_write(void *data)
+{
+ struct nfs_write_data *wdata = data;
+
+ /* STUB - ignoring error handling */
+ wdata->task.tk_status = 0;
+ wdata->verf.committed = NFS_FILE_SYNC;
+ INIT_WORK(&wdata->task.u.tk_work, bl_write_cleanup);
+ schedule_work(&wdata->task.u.tk_work);
+}
+
static enum pnfs_try_status
bl_write_pagelist(struct nfs_write_data *wdata,
int sync)
{
- return PNFS_NOT_ATTEMPTED;
+ int i;
+ struct bio *bio = NULL;
+ struct pnfs_block_extent *be = NULL;
+ sector_t isect, extent_length = 0;
+ struct parallel_io *par;
+ loff_t offset = wdata->args.offset;
+ size_t count = wdata->args.count;
+ struct page **pages = wdata->args.pages;
+ int pg_index = wdata->args.pgbase >> PAGE_CACHE_SHIFT;
+
+ dprintk("%s enter, %Zu@%lld\n", __func__, count, offset);
+ if (!wdata->lseg) {
+ dprintk("%s no lseg, falling back to MDS\n", __func__);
+ return PNFS_NOT_ATTEMPTED;
+ }
+ if (dont_like_caller(wdata->req)) {
+ dprintk("%s dont_like_caller failed\n", __func__);
+ return PNFS_NOT_ATTEMPTED;
+ }
+ /* At this point, wdata->pages is a (sequential) list of nfs_pages.
+ * We want to write each, and if there is an error remove it from
+ * list and call
+ * nfs_retry_request(req) to have it redone using nfs.
+ * QUEST? Do as block or per req? Think have to do per block
+ * as part of end_bio
+ */
+ par = alloc_parallel(wdata);
+ if (!par)
+ return PNFS_NOT_ATTEMPTED;
+ par->call_ops = *wdata->mds_ops;
+ par->call_ops.rpc_call_done = bl_rpc_do_nothing;
+ par->pnfs_callback = bl_end_par_io_write;
+ /* At this point, have to be more careful with error handling */
+
+ isect = (sector_t) ((offset & (long)PAGE_CACHE_MASK) >> 9);
+ for (i = pg_index; i < wdata->npages ; i++) {
+ if (!extent_length) {
+ /* We've used up the previous extent */
+ put_extent(be);
+ bio = bl_submit_bio(WRITE, bio);
+ /* Get the next one */
+ be = find_get_extent(BLK_LSEG2EXT(wdata->lseg),
+ isect, NULL);
+ if (!be || !is_writable(be, isect)) {
+ /* FIXME */
+ bl_done_with_wpage(pages[i], 0);
+ break;
+ }
+ extent_length = be->be_length -
+ (isect - be->be_f_offset);
+ }
+ for (;;) {
+ if (!bio) {
+ bio = bio_alloc(GFP_NOIO, wdata->npages - i);
+ if (!bio) {
+ /* Error out this page */
+ /* FIXME */
+ bl_done_with_wpage(pages[i], 0);
+ break;
+ }
+ bio->bi_sector = isect - be->be_f_offset +
+ be->be_v_offset;
+ bio->bi_bdev = be->be_mdev;
+ bio->bi_end_io = bl_end_io_write;
+ bio->bi_private = par;
+ }
+ if (bio_add_page(bio, pages[i], PAGE_SIZE, 0))
+ break;
+ bio = bl_submit_bio(WRITE, bio);
+ }
+ isect += PAGE_CACHE_SIZE >> 9;
+ extent_length -= PAGE_CACHE_SIZE >> 9;
+ }
+ wdata->res.count = (isect << 9) - (offset);
+ if (count < wdata->res.count)
+ wdata->res.count = count;
+ /* pnfs_set_layoutcommit needs this */
+ wdata->mds_offset = offset;
+ put_extent(be);
+ bl_submit_bio(WRITE, bio);
+ put_parallel(par);
+ return PNFS_ATTEMPTED;
}

/* FIXME - range ignored */
--
1.7.4.1


2011-06-14 02:33:23

by Jim Rees

[permalink] [raw]
Subject: [PATCH 32/33] pnfsblock: Implement release_inval_marks

From: Zhang Jingwang <[email protected]>

Leaving it unimplemented will cause memory leak.

Signed-off-by: Zhang Jingwang <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 6de800f..3187c4f 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -523,10 +523,15 @@ release_extents(struct pnfs_block_layout *bl, struct pnfs_layout_range *range)
spin_unlock(&bl->bl_ext_lock);
}

-/* STUB */
static void
release_inval_marks(struct pnfs_inval_markings *marks)
{
+ struct pnfs_inval_tracking *pos, *temp;
+
+ list_for_each_entry_safe(pos, temp, &marks->im_tree.mtt_stub, it_link) {
+ list_del(&pos->it_link);
+ kfree(pos);
+ }
return;
}

--
1.7.4.1


2011-06-14 02:33:25

by Jim Rees

[permalink] [raw]
Subject: [PATCH 33/33] pnfsblock DEVONLY: Add configurable prefetch size for layoutget

From: Peng Tao <[email protected]>

Do not send upstream. Only for dealing with servers that return small
layouts.

pnfs_layout_prefetch_kb can be modified via sysctl.
default to 0 so no effect if not set via sysctl.

Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/pnfs.c | 17 +++++++++++++++++
fs/nfs/pnfs.h | 1 +
fs/nfs/sysctl.c | 10 ++++++++++
3 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 7912635..4f63ebd 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -46,6 +46,11 @@ static DEFINE_SPINLOCK(pnfs_spinlock);
*/
static LIST_HEAD(pnfs_modules_tbl);

+/*
+ * layoutget prefetch size
+ */
+unsigned int pnfs_layout_prefetch_kb;
+
/* Return the registered pnfs layout driver module matching given id */
static struct pnfs_layoutdriver_type *
find_pnfs_driver_locked(u32 id)
@@ -909,6 +914,16 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
}

/*
+ * Set layout prefetch length.
+ */
+static void
+pnfs_set_layout_prefetch(struct pnfs_layout_range *range)
+{
+ if (range->length < (pnfs_layout_prefetch_kb << 10))
+ range->length = pnfs_layout_prefetch_kb << 10;
+}
+
+/*
* Layout segment is retreived from the server if not cached.
* The appropriate layout segment is referenced and returned to the caller.
*/
@@ -959,6 +974,8 @@ pnfs_update_layout(struct inode *ino,

if (pnfs_layoutgets_blocked(lo, NULL, 0))
goto out_unlock;
+
+ pnfs_set_layout_prefetch(&arg);
atomic_inc(&lo->plh_outstanding);

get_layout_hdr(lo);
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index d7f203b..6718053 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -180,6 +180,7 @@ extern int nfs4_proc_layoutget(struct nfs4_layoutget *lgp);
extern int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp);

/* pnfs.c */
+extern unsigned int pnfs_layout_prefetch_kb;
void get_layout_hdr(struct pnfs_layout_hdr *lo);
void put_lseg(struct pnfs_layout_segment *lseg);

diff --git a/fs/nfs/sysctl.c b/fs/nfs/sysctl.c
index 978aaeb..79a5134 100644
--- a/fs/nfs/sysctl.c
+++ b/fs/nfs/sysctl.c
@@ -14,6 +14,7 @@
#include <linux/nfs_fs.h>

#include "callback.h"
+#include "pnfs.h"

#ifdef CONFIG_NFS_V4
static const int nfs_set_port_min = 0;
@@ -42,6 +43,15 @@ static ctl_table nfs_cb_sysctls[] = {
},
#endif /* CONFIG_NFS_USE_NEW_IDMAPPER */
#endif
+#ifdef CONFIG_NFS_V4_1
+ {
+ .procname = "pnfs_layout_prefetch_kb",
+ .data = &pnfs_layout_prefetch_kb,
+ .maxlen = sizeof(pnfs_layout_prefetch_kb),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif
{
.procname = "nfs_mountpoint_timeout",
.data = &nfs_mountpoint_expiry_timeout,
--
1.7.4.1


2011-06-14 02:32:09

by Jim Rees

[permalink] [raw]
Subject: [PATCH 03/33] pnfs: let layoutcommit code handle multiple segments

From: Peng Tao <[email protected]>

Some layout driver like block will have multiple segments.
Generic code should be able to handle it.

Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/pnfs.c | 18 ++++++++++++++----
fs/nfs/pnfs.h | 1 +
2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 593a9aa..e252af1 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -893,7 +893,7 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
dprintk("%s:Begin\n", __func__);

assert_spin_locked(&lo->plh_inode->i_lock);
- list_for_each_entry(lseg, &lo->plh_segs, pls_list) {
+ list_for_each_entry_reverse(lseg, &lo->plh_segs, pls_list) {
if (test_bit(NFS_LSEG_VALID, &lseg->pls_flags) &&
is_matching_lseg(&lseg->pls_range, range)) {
ret = get_lseg(lseg);
@@ -1243,10 +1243,19 @@ pnfs_try_to_read_data(struct nfs_read_data *rdata,
static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
{
struct pnfs_layout_segment *lseg, *rv = NULL;
+ loff_t max_pos = 0;
+
+ list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list) {
+ if (lseg->pls_range.iomode == IOMODE_RW) {
+ if (max_pos < lseg->pls_end_pos)
+ max_pos = lseg->pls_end_pos;
+ if (test_and_clear_bit
+ (NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags))
+ rv = lseg;
+ }
+ }
+ rv->pls_end_pos = max_pos;

- list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list)
- if (lseg->pls_range.iomode == IOMODE_RW)
- rv = lseg;
return rv;
}

@@ -1261,6 +1270,7 @@ pnfs_set_layoutcommit(struct nfs_write_data *wdata)
if (!test_and_set_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
/* references matched in nfs4_layoutcommit_release */
get_lseg(wdata->lseg);
+ set_bit(NFS_LSEG_LAYOUTCOMMIT, &wdata->lseg->pls_flags);
wdata->lseg->pls_lc_cred =
get_rpccred(wdata->args.context->state->owner->so_cred);
mark_as_dirty = true;
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index f984598..0ac820f 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -36,6 +36,7 @@
enum {
NFS_LSEG_VALID = 0, /* cleared when lseg is recalled/returned */
NFS_LSEG_ROC, /* roc bit received from server */
+ NFS_LSEG_LAYOUTCOMMIT, /* layoutcommit bit set for layoutcommit */
};

struct pnfs_layout_segment {
--
1.7.4.1


2011-06-14 02:32:38

by Jim Rees

[permalink] [raw]
Subject: [PATCH 14/33] pnfsblock: remove device operations

Signed-off-by: Jim Rees <[email protected]>
Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
---
fs/nfs/blocklayout/Makefile | 2 +-
fs/nfs/blocklayout/blocklayout.h | 2 +
fs/nfs/blocklayout/blocklayoutdm.c | 120 ++++++++++++++++++++++++++++++++++++
3 files changed, 123 insertions(+), 1 deletions(-)
create mode 100644 fs/nfs/blocklayout/blocklayoutdm.c

diff --git a/fs/nfs/blocklayout/Makefile b/fs/nfs/blocklayout/Makefile
index bd69aad..bdbf180 100644
--- a/fs/nfs/blocklayout/Makefile
+++ b/fs/nfs/blocklayout/Makefile
@@ -2,4 +2,4 @@
# Makefile for the pNFS block layout driver kernel module
#
obj-$(CONFIG_PNFS_BLOCK) += blocklayoutdriver.o
-blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o extents.o blocklayoutdev.o
+blocklayoutdriver-objs := blocklayout.o block-device-discovery-pipe.o extents.o blocklayoutdev.o blocklayoutdm.o
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index cda7ea1..839b81d 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -101,6 +101,8 @@ struct pnfs_block_dev *nfs4_blk_decode_device(struct nfs_server *server,
struct list_head *sdlist);
int nfs4_blk_process_layoutget(struct pnfs_layout_hdr *lo,
struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
+/* blocklayoutdm.c */
+void free_block_dev(struct pnfs_block_dev *bdev);

#include <linux/sunrpc/simple_rpc_pipefs.h>

diff --git a/fs/nfs/blocklayout/blocklayoutdm.c b/fs/nfs/blocklayout/blocklayoutdm.c
new file mode 100644
index 0000000..097dd05
--- /dev/null
+++ b/fs/nfs/blocklayout/blocklayoutdm.c
@@ -0,0 +1,120 @@
+/*
+ * linux/fs/nfs/blocklayout/blocklayoutdm.c
+ *
+ * Module for the NFSv4.1 pNFS block layout driver.
+ *
+ * Copyright (c) 2007 The Regents of the University of Michigan.
+ * All rights reserved.
+ *
+ * Fred Isaman <[email protected]>
+ * Andy Adamson <[email protected]>
+ *
+ * permission is granted to use, copy, create derivative works and
+ * redistribute this software and such derivative works for any purpose,
+ * so long as the name of the university of michigan is not used in
+ * any advertising or publicity pertaining to the use or distribution
+ * of this software without specific, written prior authorization. if
+ * the above copyright notice or any other identification of the
+ * university of michigan is included in any copy of any portion of
+ * this software, then the disclaimer below must also be included.
+ *
+ * this software is provided as is, without representation from the
+ * university of michigan as to its fitness for any purpose, and without
+ * warranty by the university of michigan of any kind, either express
+ * or implied, including without limitation the implied warranties of
+ * merchantability and fitness for a particular purpose. the regents
+ * of the university of michigan shall not be liable for any damages,
+ * including special, indirect, incidental, or consequential damages,
+ * with respect to any claim arising out or in connection with the use
+ * of the software, even if it has been or is hereafter advised of the
+ * possibility of such damages.
+ */
+
+#include <linux/genhd.h> /* gendisk - used in a dprintk*/
+#include <linux/sched.h>
+#include <linux/hash.h>
+
+#include "blocklayout.h"
+
+#define NFSDBG_FACILITY NFSDBG_PNFS_LD
+
+/* Defines used for calculating memory usage in nfs4_blk_flatten() */
+#define ARGSIZE 24 /* Max bytes needed for linear target arg string */
+#define SPECSIZE (sizeof8(struct dm_target_spec) + ARGSIZE)
+#define SPECS_PER_PAGE (PAGE_SIZE / SPECSIZE)
+#define SPEC_HEADER_ADJUST (SPECS_PER_PAGE - \
+ (PAGE_SIZE - sizeof8(struct dm_ioctl)) / SPECSIZE)
+#define roundup8(x) (((x)+7) & ~7)
+#define sizeof8(x) roundup8(sizeof(x))
+
+static int dev_remove(dev_t dev)
+{
+ int ret = 1;
+ struct pipefs_hdr *msg = NULL, *reply = NULL;
+ uint64_t bl_dev;
+ uint32_t major = MAJOR(dev), minor = MINOR(dev);
+
+ dprintk("Entering %s\n", __func__);
+
+ if (IS_ERR(bl_device_pipe))
+ return ret;
+
+ memcpy((void *)&bl_dev, &major, sizeof(uint32_t));
+ memcpy((void *)&bl_dev + sizeof(uint32_t), &minor, sizeof(uint32_t));
+ msg = pipefs_alloc_init_msg(0, BL_DEVICE_UMOUNT, 0, (void *)&bl_dev,
+ sizeof(uint64_t));
+ if (IS_ERR(msg)) {
+ dprintk("ERROR: couldn't make pipefs message.\n");
+ goto out;
+ }
+ msg->msgid = hash_ptr(&msg, sizeof(msg->msgid) * 8);
+ msg->status = BL_DEVICE_REQUEST_INIT;
+
+ reply = pipefs_queue_upcall_waitreply(bl_device_pipe, msg,
+ &bl_device_list, 0, 0);
+ if (IS_ERR(reply)) {
+ dprintk("ERROR: upcall_waitreply failed\n");
+ goto out;
+ }
+
+ if (reply->status == BL_DEVICE_REQUEST_PROC)
+ ret = 0; /*TODO: what to return*/
+out:
+ if (!IS_ERR(reply))
+ kfree(reply);
+ if (!IS_ERR(msg))
+ kfree(msg);
+ return ret;
+}
+
+/*
+ * Release meta device
+ */
+static int nfs4_blk_metadev_release(struct pnfs_block_dev *bdev)
+{
+ int rv;
+
+ dprintk("%s Releasing\n", __func__);
+ /* XXX Check return? */
+ rv = nfs4_blkdev_put(bdev->bm_mdev);
+ dprintk("%s nfs4_blkdev_put returns %d\n", __func__, rv);
+
+ rv = dev_remove(bdev->bm_mdev->bd_dev);
+ dprintk("%s Returns %d\n", __func__, rv);
+ return rv;
+}
+
+void free_block_dev(struct pnfs_block_dev *bdev)
+{
+ if (bdev) {
+ if (bdev->bm_mdev) {
+ dprintk("%s Removing DM device: %d:%d\n",
+ __func__,
+ MAJOR(bdev->bm_mdev->bd_dev),
+ MINOR(bdev->bm_mdev->bd_dev));
+ /* XXX Check status ?? */
+ nfs4_blk_metadev_release(bdev);
+ }
+ kfree(bdev);
+ }
+}
--
1.7.4.1


2011-06-14 02:32:15

by Jim Rees

[permalink] [raw]
Subject: [PATCH 05/33] pnfs: ask for layout_blksize and save it in nfs_server

From: Peng Tao <[email protected]>

Block layout needs it to determine IO size.

Signed-off-by: Fred Isaman <[email protected]>
Signed-off-by: Tao Guo <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/client.c | 1 +
fs/nfs/nfs4_fs.h | 2 +-
fs/nfs/nfs4proc.c | 5 +-
fs/nfs/nfs4xdr.c | 101 +++++++++++++++++++++++++++++++++++++--------
include/linux/nfs_fs_sb.h | 4 +-
include/linux/nfs_xdr.h | 3 +-
6 files changed, 93 insertions(+), 23 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index d630bb7..3b75943 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -938,6 +938,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server,
if (server->wsize > NFS_MAX_FILE_IO_SIZE)
server->wsize = NFS_MAX_FILE_IO_SIZE;
server->wpages = (server->wsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+ server->pnfs_blksize = fsinfo->blksize;
set_pnfs_layoutdriver(server, mntfh, fsinfo->layouttype);

server->wtmult = nfs_block_bits(fsinfo->wtmult, NULL);
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index c4a6983..5725a7e 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -315,7 +315,7 @@ extern const struct nfs4_minor_version_ops *nfs_v4_minor_ops[];
extern const u32 nfs4_fattr_bitmap[2];
extern const u32 nfs4_statfs_bitmap[2];
extern const u32 nfs4_pathconf_bitmap[2];
-extern const u32 nfs4_fsinfo_bitmap[2];
+extern const u32 nfs4_fsinfo_bitmap[3];
extern const u32 nfs4_fs_locations_bitmap[2];

/* nfs4renewd.c */
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index ec580a8..04450bf 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -137,12 +137,13 @@ const u32 nfs4_pathconf_bitmap[2] = {
0
};

-const u32 nfs4_fsinfo_bitmap[2] = { FATTR4_WORD0_MAXFILESIZE
+const u32 nfs4_fsinfo_bitmap[3] = { FATTR4_WORD0_MAXFILESIZE
| FATTR4_WORD0_MAXREAD
| FATTR4_WORD0_MAXWRITE
| FATTR4_WORD0_LEASE_TIME,
FATTR4_WORD1_TIME_DELTA
- | FATTR4_WORD1_FS_LAYOUT_TYPES
+ | FATTR4_WORD1_FS_LAYOUT_TYPES,
+ FATTR4_WORD2_LAYOUT_BLKSIZE
};

const u32 nfs4_fs_locations_bitmap[2] = {
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 60e9d44..b8f375e 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -91,7 +91,7 @@ static int nfs4_stat_to_errno(int);
#define encode_getfh_maxsz (op_encode_hdr_maxsz)
#define decode_getfh_maxsz (op_decode_hdr_maxsz + 1 + \
((3+NFS4_FHSIZE) >> 2))
-#define nfs4_fattr_bitmap_maxsz 3
+#define nfs4_fattr_bitmap_maxsz 4
#define encode_getattr_maxsz (op_encode_hdr_maxsz + nfs4_fattr_bitmap_maxsz)
#define nfs4_name_maxsz (1 + ((3 + NFS4_MAXNAMLEN) >> 2))
#define nfs4_path_maxsz (1 + ((3 + NFS4_MAXPATHLEN) >> 2))
@@ -113,7 +113,11 @@ static int nfs4_stat_to_errno(int);
#define encode_restorefh_maxsz (op_encode_hdr_maxsz)
#define decode_restorefh_maxsz (op_decode_hdr_maxsz)
#define encode_fsinfo_maxsz (encode_getattr_maxsz)
-#define decode_fsinfo_maxsz (op_decode_hdr_maxsz + 15)
+/* The 5 accounts for the PNFS attributes, and assumes that at most three
+ * layout types will be returned.
+ */
+#define decode_fsinfo_maxsz (op_decode_hdr_maxsz + \
+ nfs4_fattr_bitmap_maxsz + 4 + 8 + 5)
#define encode_renew_maxsz (op_encode_hdr_maxsz + 3)
#define decode_renew_maxsz (op_decode_hdr_maxsz)
#define encode_setclientid_maxsz \
@@ -1095,6 +1099,35 @@ static void encode_getattr_two(struct xdr_stream *xdr, uint32_t bm0, uint32_t bm
hdr->replen += decode_getattr_maxsz;
}

+static void
+encode_getattr_three(struct xdr_stream *xdr,
+ uint32_t bm0, uint32_t bm1, uint32_t bm2,
+ struct compound_hdr *hdr)
+{
+ __be32 *p;
+
+ p = reserve_space(xdr, 4);
+ *p = cpu_to_be32(OP_GETATTR);
+ if (bm2) {
+ p = reserve_space(xdr, 16);
+ *p++ = cpu_to_be32(3);
+ *p++ = cpu_to_be32(bm0);
+ *p++ = cpu_to_be32(bm1);
+ *p = cpu_to_be32(bm2);
+ } else if (bm1) {
+ p = reserve_space(xdr, 12);
+ *p++ = cpu_to_be32(2);
+ *p++ = cpu_to_be32(bm0);
+ *p = cpu_to_be32(bm1);
+ } else {
+ p = reserve_space(xdr, 8);
+ *p++ = cpu_to_be32(1);
+ *p = cpu_to_be32(bm0);
+ }
+ hdr->nops++;
+ hdr->replen += decode_getattr_maxsz;
+}
+
static void encode_getfattr(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
{
encode_getattr_two(xdr, bitmask[0] & nfs4_fattr_bitmap[0],
@@ -1103,8 +1136,11 @@ static void encode_getfattr(struct xdr_stream *xdr, const u32* bitmask, struct c

static void encode_fsinfo(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
{
- encode_getattr_two(xdr, bitmask[0] & nfs4_fsinfo_bitmap[0],
- bitmask[1] & nfs4_fsinfo_bitmap[1], hdr);
+ encode_getattr_three(xdr,
+ bitmask[0] & nfs4_fsinfo_bitmap[0],
+ bitmask[1] & nfs4_fsinfo_bitmap[1],
+ bitmask[2] & nfs4_fsinfo_bitmap[2],
+ hdr);
}

static void encode_fs_locations(struct xdr_stream *xdr, const u32* bitmask, struct compound_hdr *hdr)
@@ -2575,7 +2611,7 @@ static void nfs4_xdr_enc_setclientid_confirm(struct rpc_rqst *req,
struct compound_hdr hdr = {
.nops = 0,
};
- const u32 lease_bitmap[2] = { FATTR4_WORD0_LEASE_TIME, 0 };
+ const u32 lease_bitmap[3] = { FATTR4_WORD0_LEASE_TIME, 0, 0 };

encode_compound_hdr(xdr, req, &hdr);
encode_setclientid_confirm(xdr, arg, &hdr);
@@ -2719,7 +2755,7 @@ static void nfs4_xdr_enc_get_lease_time(struct rpc_rqst *req,
struct compound_hdr hdr = {
.minorversion = nfs4_xdr_minorversion(&args->la_seq_args),
};
- const u32 lease_bitmap[2] = { FATTR4_WORD0_LEASE_TIME, 0 };
+ const u32 lease_bitmap[3] = { FATTR4_WORD0_LEASE_TIME, 0, 0 };

encode_compound_hdr(xdr, req, &hdr);
encode_sequence(xdr, &args->la_seq_args, &hdr);
@@ -2947,14 +2983,17 @@ static int decode_attr_bitmap(struct xdr_stream *xdr, uint32_t *bitmap)
goto out_overflow;
bmlen = be32_to_cpup(p);

- bitmap[0] = bitmap[1] = 0;
+ bitmap[0] = bitmap[1] = bitmap[2] = 0;
p = xdr_inline_decode(xdr, (bmlen << 2));
if (unlikely(!p))
goto out_overflow;
if (bmlen > 0) {
bitmap[0] = be32_to_cpup(p++);
- if (bmlen > 1)
- bitmap[1] = be32_to_cpup(p);
+ if (bmlen > 1) {
+ bitmap[1] = be32_to_cpup(p++);
+ if (bmlen > 2)
+ bitmap[2] = be32_to_cpup(p);
+ }
}
return 0;
out_overflow:
@@ -2986,8 +3025,9 @@ static int decode_attr_supported(struct xdr_stream *xdr, uint32_t *bitmap, uint3
return ret;
bitmap[0] &= ~FATTR4_WORD0_SUPPORTED_ATTRS;
} else
- bitmask[0] = bitmask[1] = 0;
- dprintk("%s: bitmask=%08x:%08x\n", __func__, bitmask[0], bitmask[1]);
+ bitmask[0] = bitmask[1] = bitmask[2] = 0;
+ dprintk("%s: bitmask=%08x:%08x:%08x\n", __func__,
+ bitmask[0], bitmask[1], bitmask[2]);
return 0;
}

@@ -4041,7 +4081,7 @@ out_overflow:
static int decode_server_caps(struct xdr_stream *xdr, struct nfs4_server_caps_res *res)
{
__be32 *savep;
- uint32_t attrlen, bitmap[2] = {0};
+ uint32_t attrlen, bitmap[3] = {0};
int status;

if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4067,7 +4107,7 @@ xdr_error:
static int decode_statfs(struct xdr_stream *xdr, struct nfs_fsstat *fsstat)
{
__be32 *savep;
- uint32_t attrlen, bitmap[2] = {0};
+ uint32_t attrlen, bitmap[3] = {0};
int status;

if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4099,7 +4139,7 @@ xdr_error:
static int decode_pathconf(struct xdr_stream *xdr, struct nfs_pathconf *pathconf)
{
__be32 *savep;
- uint32_t attrlen, bitmap[2] = {0};
+ uint32_t attrlen, bitmap[3] = {0};
int status;

if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4239,7 +4279,7 @@ static int decode_getfattr_generic(struct xdr_stream *xdr, struct nfs_fattr *fat
{
__be32 *savep;
uint32_t attrlen,
- bitmap[2] = {0};
+ bitmap[3] = {0};
int status;

status = decode_op_hdr(xdr, OP_GETATTR);
@@ -4325,10 +4365,32 @@ static int decode_attr_pnfstype(struct xdr_stream *xdr, uint32_t *bitmap,
return status;
}

+/*
+ * The prefered block size for layout directed io
+ */
+static int decode_attr_layout_blksize(struct xdr_stream *xdr, uint32_t *bitmap,
+ uint32_t *res)
+{
+ __be32 *p;
+
+ dprintk("%s: bitmap is %x\n", __func__, bitmap[2]);
+ *res = 0;
+ if (bitmap[2] & FATTR4_WORD2_LAYOUT_BLKSIZE) {
+ p = xdr_inline_decode(xdr, 4);
+ if (unlikely(!p)) {
+ print_overflow_msg(__func__, xdr);
+ return -EIO;
+ }
+ *res = be32_to_cpup(p);
+ bitmap[2] &= ~FATTR4_WORD2_LAYOUT_BLKSIZE;
+ }
+ return 0;
+}
+
static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
{
__be32 *savep;
- uint32_t attrlen, bitmap[2];
+ uint32_t attrlen, bitmap[3];
int status;

if ((status = decode_op_hdr(xdr, OP_GETATTR)) != 0)
@@ -4356,6 +4418,9 @@ static int decode_fsinfo(struct xdr_stream *xdr, struct nfs_fsinfo *fsinfo)
status = decode_attr_pnfstype(xdr, bitmap, &fsinfo->layouttype);
if (status != 0)
goto xdr_error;
+ status = decode_attr_layout_blksize(xdr, bitmap, &fsinfo->blksize);
+ if (status)
+ goto xdr_error;

status = verify_attr_len(xdr, savep, attrlen);
xdr_error:
@@ -4775,7 +4840,7 @@ static int decode_getacl(struct xdr_stream *xdr, struct rpc_rqst *req,
{
__be32 *savep;
uint32_t attrlen,
- bitmap[2] = {0};
+ bitmap[3] = {0};
struct kvec *iov = req->rq_rcv_buf.head;
int status;

@@ -6607,7 +6672,7 @@ out:
int nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry,
int plus)
{
- uint32_t bitmap[2] = {0};
+ uint32_t bitmap[3] = {0};
uint32_t len;
__be32 *p = xdr_inline_decode(xdr, 4);
if (unlikely(!p))
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 87694ca..79cc4ca 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -130,7 +130,7 @@ struct nfs_server {
#endif

#ifdef CONFIG_NFS_V4
- u32 attr_bitmask[2];/* V4 bitmask representing the set
+ u32 attr_bitmask[3];/* V4 bitmask representing the set
of attributes supported on this
filesystem */
u32 cache_consistency_bitmask[2];
@@ -143,6 +143,8 @@ struct nfs_server {
filesystem */
struct pnfs_layoutdriver_type *pnfs_curr_ld; /* Active layout driver */
struct rpc_wait_queue roc_rpcwaitq;
+ void *pnfs_ld_data; /* per mount point data */
+ u32 pnfs_blksize; /* layout_blksize attr */

/* the following fields are protected by nfs_client->cl_lock */
struct rb_root state_owners;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 0863228..1fa6f7a 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -122,6 +122,7 @@ struct nfs_fsinfo {
struct timespec time_delta; /* server time granularity */
__u32 lease_time; /* in seconds */
__u32 layouttype; /* supported pnfs layout driver */
+ __u32 blksize; /* preferred pnfs io block size */
};

struct nfs_fsstat {
@@ -953,7 +954,7 @@ struct nfs4_server_caps_arg {
};

struct nfs4_server_caps_res {
- u32 attr_bitmask[2];
+ u32 attr_bitmask[3];
u32 acl_bitmask;
u32 has_links;
u32 has_symlinks;
--
1.7.4.1