This series adds support for the pNFS operations in NFS v4.1, as well
as a block layout driver that can export block based filesystems that
implement a few additional export operations. Support for XFS is
provided in this series, but other filesystems could be added easily.
The core pNFS code of course owns its heritage to the existing Linux
pNFS server prototype, but except for a few bits and pieces in the
XDR path nothing is left from it.
The design of this new pNFS server is fairly different from the old
one - while the old one implemented very little semantics in nfsd
and left almost everything to filesystems my implementation implements
as much as possible in common nfsd code, then dispatches to a layout
driver that still is part of nfsd and only then calls into the
filesystem, thus keeping it free from intimate pNFS knowledge.
More details are document in the individual patch descriptions and
code comments.
This code is also available from:
git://git.infradead.org/users/hch/pnfs.git pnfsd-for-3.20
This gives us a nice upper bound for later use in nfѕd.
Signed-off-by: Christoph Hellwig <[email protected]>
---
include/linux/nfs4.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 022b761..8a3589c 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -516,6 +516,7 @@ enum pnfs_layouttype {
LAYOUT_NFSV4_1_FILES = 1,
LAYOUT_OSD2_OBJECTS = 2,
LAYOUT_BLOCK_VOLUME = 3,
+ LAYOUT_TYPE_MAX
};
/* used for both layout return and recall */
--
1.9.1
This (ab-)uses the file locking code to allow filesystems to recall
outstanding pNFS layouts on a file. This new lease type is similar but
not quite the same as FL_DELEG. A FL_LAYOUT lease can always be granted,
an a per-filesystem lock (XFS iolock for the initial implementation)
ensures not FL_LAYOUT leases granted when we would need to recall them.
Also included are changes that allow multiple outstanding read
leases of different types on the same file as long as they have a
differnt owner. This wasn't a problem until now as nfsd never set
FL_LEASE leases, and no one else used FL_DELEG leases, but given that
nfsd will also issues FL_LAYOUT leases we will have to handle it now.
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/locks.c | 28 +++++++++++++++++++---------
fs/nfsd/nfs4state.c | 2 +-
include/linux/fs.h | 16 ++++++++++++++++
3 files changed, 36 insertions(+), 10 deletions(-)
diff --git a/fs/locks.c b/fs/locks.c
index 735b8d3..6cf41f8 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -137,7 +137,7 @@
#define IS_POSIX(fl) (fl->fl_flags & FL_POSIX)
#define IS_FLOCK(fl) (fl->fl_flags & FL_FLOCK)
-#define IS_LEASE(fl) (fl->fl_flags & (FL_LEASE|FL_DELEG))
+#define IS_LEASE(fl) (fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
#define IS_OFDLCK(fl) (fl->fl_flags & FL_OFDLCK)
static bool lease_breaking(struct file_lock *fl)
@@ -1348,6 +1348,8 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
static bool leases_conflict(struct file_lock *lease, struct file_lock *breaker)
{
+ if ((breaker->fl_flags & FL_LAYOUT) != (lease->fl_flags & FL_LAYOUT))
+ return false;
if ((breaker->fl_flags & FL_DELEG) && (lease->fl_flags & FL_LEASE))
return false;
return locks_conflict(breaker, lease);
@@ -1560,11 +1562,14 @@ int fcntl_getlease(struct file *filp)
* conflict with the lease we're trying to set.
*/
static int
-check_conflicting_open(const struct dentry *dentry, const long arg)
+check_conflicting_open(const struct dentry *dentry, const long arg, int flags)
{
int ret = 0;
struct inode *inode = dentry->d_inode;
+ if (flags & FL_LAYOUT)
+ return 0;
+
if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
return -EAGAIN;
@@ -1608,7 +1613,7 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
spin_lock(&inode->i_lock);
time_out_leases(inode, &dispose);
- error = check_conflicting_open(dentry, arg);
+ error = check_conflicting_open(dentry, arg, lease->fl_flags);
if (error)
goto out;
@@ -1624,10 +1629,13 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
for (before = &inode->i_flock;
((fl = *before) != NULL) && IS_LEASE(fl);
before = &fl->fl_next) {
- if (fl->fl_file == filp) {
+
+ if (fl->fl_file == filp &&
+ fl->fl_owner == lease->fl_owner) {
my_before = before;
continue;
}
+
/*
* No exclusive leases if someone else has a lease on
* this file:
@@ -1665,7 +1673,7 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
* precedes these checks.
*/
smp_mb();
- error = check_conflicting_open(dentry, arg);
+ error = check_conflicting_open(dentry, arg, lease->fl_flags);
if (error)
goto out_unlink;
@@ -1685,7 +1693,7 @@ out_unlink:
goto out;
}
-static int generic_delete_lease(struct file *filp)
+static int generic_delete_lease(struct file *filp, void *priv)
{
int error = -EAGAIN;
struct file_lock *fl, **before;
@@ -1698,7 +1706,8 @@ static int generic_delete_lease(struct file *filp)
for (before = &inode->i_flock;
((fl = *before) != NULL) && IS_LEASE(fl);
before = &fl->fl_next) {
- if (fl->fl_file == filp)
+ if (fl->fl_file == filp &&
+ priv == fl->fl_owner)
break;
}
trace_generic_delete_lease(inode, fl);
@@ -1737,13 +1746,14 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp,
switch (arg) {
case F_UNLCK:
- return generic_delete_lease(filp);
+ return generic_delete_lease(filp, *priv);
case F_RDLCK:
case F_WRLCK:
if (!(*flp)->fl_lmops->lm_break) {
WARN_ON_ONCE(1);
return -ENOLCK;
}
+
return generic_add_lease(filp, arg, flp, priv);
default:
return -EINVAL;
@@ -1816,7 +1826,7 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
{
if (arg == F_UNLCK)
- return vfs_setlease(filp, F_UNLCK, NULL, NULL);
+ return vfs_setlease(filp, F_UNLCK, NULL, (void **)&filp);
return do_fcntl_add_lease(fd, filp, arg);
}
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 277f8b8..2505b68 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -693,7 +693,7 @@ static void nfs4_put_deleg_lease(struct nfs4_file *fp)
spin_unlock(&fp->fi_lock);
if (filp) {
- vfs_setlease(filp, F_UNLCK, NULL, NULL);
+ vfs_setlease(filp, F_UNLCK, NULL, (void **)&fp);
fput(filp);
}
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f90c028..204cf91 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -875,6 +875,7 @@ static inline struct file *get_file(struct file *f)
#define FL_DOWNGRADE_PENDING 256 /* Lease is being downgraded */
#define FL_UNLOCK_PENDING 512 /* Lease is being broken */
#define FL_OFDLCK 1024 /* lock is "owned" by struct file */
+#define FL_LAYOUT 2048 /* outstanding pNFS layout */
/*
* Special return value from posix_lock_file() and vfs_lock_file() for
@@ -2017,6 +2018,16 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
return ret;
}
+static inline int break_layout(struct inode *inode, bool wait)
+{
+ smp_mb();
+ if (inode->i_flock)
+ return __break_lease(inode,
+ wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
+ FL_LAYOUT);
+ return 0;
+}
+
#else /* !CONFIG_FILE_LOCKING */
static inline int locks_mandatory_locked(struct file *file)
{
@@ -2072,6 +2083,11 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
return 0;
}
+static inline int break_layout(struct inode *inode, bool wait)
+{
+ return 0;
+}
+
#endif /* CONFIG_FILE_LOCKING */
/* fs/open.c */
--
1.9.1
On Tue, 6 Jan 2015 17:28:25 +0100
Christoph Hellwig <[email protected]> wrote:
> This (ab-)uses the file locking code to allow filesystems to recall
> outstanding pNFS layouts on a file. This new lease type is similar but
> not quite the same as FL_DELEG. A FL_LAYOUT lease can always be granted,
> an a per-filesystem lock (XFS iolock for the initial implementation)
> ensures not FL_LAYOUT leases granted when we would need to recall them.
>
> Also included are changes that allow multiple outstanding read
> leases of different types on the same file as long as they have a
> differnt owner. This wasn't a problem until now as nfsd never set
> FL_LEASE leases, and no one else used FL_DELEG leases, but given that
> nfsd will also issues FL_LAYOUT leases we will have to handle it now.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
So with the current code, layouts are always whole-file?
Tracking layouts as a lease-like object seems reasonable, but I'm not
100% thrilled with overloading all of the lease code with this. Perhaps
it should be its own sort of object with a separate API to manage them?
That would also make it easier to support layouts that are not for the
entire file.
To that end, it might be nice to hold off on taking this until we
deprecate the i_flock list as we can then give layouts their own
list_head in the file_lock_context. It would also make it easier to use
a new sort of object to represent layouts.
I just cleaned up that patchset last week, and will re-post it soon
once I give it a bit of testing this week.
> ---
> fs/locks.c | 28 +++++++++++++++++++---------
> fs/nfsd/nfs4state.c | 2 +-
> include/linux/fs.h | 16 ++++++++++++++++
> 3 files changed, 36 insertions(+), 10 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index 735b8d3..6cf41f8 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -137,7 +137,7 @@
>
> #define IS_POSIX(fl) (fl->fl_flags & FL_POSIX)
> #define IS_FLOCK(fl) (fl->fl_flags & FL_FLOCK)
> -#define IS_LEASE(fl) (fl->fl_flags & (FL_LEASE|FL_DELEG))
> +#define IS_LEASE(fl) (fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
> #define IS_OFDLCK(fl) (fl->fl_flags & FL_OFDLCK)
>
> static bool lease_breaking(struct file_lock *fl)
> @@ -1348,6 +1348,8 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
>
> static bool leases_conflict(struct file_lock *lease, struct file_lock *breaker)
> {
> + if ((breaker->fl_flags & FL_LAYOUT) != (lease->fl_flags & FL_LAYOUT))
> + return false;
> if ((breaker->fl_flags & FL_DELEG) && (lease->fl_flags & FL_LEASE))
> return false;
> return locks_conflict(breaker, lease);
> @@ -1560,11 +1562,14 @@ int fcntl_getlease(struct file *filp)
> * conflict with the lease we're trying to set.
> */
> static int
> -check_conflicting_open(const struct dentry *dentry, const long arg)
> +check_conflicting_open(const struct dentry *dentry, const long arg, int flags)
> {
> int ret = 0;
> struct inode *inode = dentry->d_inode;
>
> + if (flags & FL_LAYOUT)
> + return 0;
> +
> if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
> return -EAGAIN;
>
> @@ -1608,7 +1613,7 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
>
> spin_lock(&inode->i_lock);
> time_out_leases(inode, &dispose);
> - error = check_conflicting_open(dentry, arg);
> + error = check_conflicting_open(dentry, arg, lease->fl_flags);
> if (error)
> goto out;
>
> @@ -1624,10 +1629,13 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
> for (before = &inode->i_flock;
> ((fl = *before) != NULL) && IS_LEASE(fl);
> before = &fl->fl_next) {
> - if (fl->fl_file == filp) {
> +
> + if (fl->fl_file == filp &&
> + fl->fl_owner == lease->fl_owner) {
> my_before = before;
> continue;
> }
> +
> /*
> * No exclusive leases if someone else has a lease on
> * this file:
> @@ -1665,7 +1673,7 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
> * precedes these checks.
> */
> smp_mb();
> - error = check_conflicting_open(dentry, arg);
> + error = check_conflicting_open(dentry, arg, lease->fl_flags);
> if (error)
> goto out_unlink;
>
> @@ -1685,7 +1693,7 @@ out_unlink:
> goto out;
> }
>
> -static int generic_delete_lease(struct file *filp)
> +static int generic_delete_lease(struct file *filp, void *priv)
> {
> int error = -EAGAIN;
> struct file_lock *fl, **before;
> @@ -1698,7 +1706,8 @@ static int generic_delete_lease(struct file *filp)
> for (before = &inode->i_flock;
> ((fl = *before) != NULL) && IS_LEASE(fl);
> before = &fl->fl_next) {
> - if (fl->fl_file == filp)
> + if (fl->fl_file == filp &&
> + priv == fl->fl_owner)
> break;
> }
> trace_generic_delete_lease(inode, fl);
> @@ -1737,13 +1746,14 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp,
>
> switch (arg) {
> case F_UNLCK:
> - return generic_delete_lease(filp);
> + return generic_delete_lease(filp, *priv);
> case F_RDLCK:
> case F_WRLCK:
> if (!(*flp)->fl_lmops->lm_break) {
> WARN_ON_ONCE(1);
> return -ENOLCK;
> }
> +
> return generic_add_lease(filp, arg, flp, priv);
> default:
> return -EINVAL;
> @@ -1816,7 +1826,7 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
> {
> if (arg == F_UNLCK)
> - return vfs_setlease(filp, F_UNLCK, NULL, NULL);
> + return vfs_setlease(filp, F_UNLCK, NULL, (void **)&filp);
> return do_fcntl_add_lease(fd, filp, arg);
> }
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 277f8b8..2505b68 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -693,7 +693,7 @@ static void nfs4_put_deleg_lease(struct nfs4_file *fp)
> spin_unlock(&fp->fi_lock);
>
> if (filp) {
> - vfs_setlease(filp, F_UNLCK, NULL, NULL);
> + vfs_setlease(filp, F_UNLCK, NULL, (void **)&fp);
> fput(filp);
> }
> }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f90c028..204cf91 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -875,6 +875,7 @@ static inline struct file *get_file(struct file *f)
> #define FL_DOWNGRADE_PENDING 256 /* Lease is being downgraded */
> #define FL_UNLOCK_PENDING 512 /* Lease is being broken */
> #define FL_OFDLCK 1024 /* lock is "owned" by struct file */
> +#define FL_LAYOUT 2048 /* outstanding pNFS layout */
>
> /*
> * Special return value from posix_lock_file() and vfs_lock_file() for
> @@ -2017,6 +2018,16 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
> return ret;
> }
>
> +static inline int break_layout(struct inode *inode, bool wait)
> +{
> + smp_mb();
> + if (inode->i_flock)
> + return __break_lease(inode,
> + wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
> + FL_LAYOUT);
> + return 0;
> +}
> +
> #else /* !CONFIG_FILE_LOCKING */
> static inline int locks_mandatory_locked(struct file *file)
> {
> @@ -2072,6 +2083,11 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
> return 0;
> }
>
> +static inline int break_layout(struct inode *inode, bool wait)
> +{
> + return 0;
> +}
> +
> #endif /* CONFIG_FILE_LOCKING */
>
> /* fs/open.c */
--
Jeff Layton <[email protected]>
On Tue, Jan 06, 2015 at 10:46:52AM -0800, Jeff Layton wrote:
> So with the current code, layouts are always whole-file?
layouts aren't whole-file, but layout recalls are.
> Tracking layouts as a lease-like object seems reasonable, but I'm not
> 100% thrilled with overloading all of the lease code with this. Perhaps
> it should be its own sort of object with a separate API to manage them?
> That would also make it easier to support layouts that are not for the
> entire file.
>
> To that end, it might be nice to hold off on taking this until we
> deprecate the i_flock list as we can then give layouts their own
> list_head in the file_lock_context. It would also make it easier to use
> a new sort of object to represent layouts.
>
> I just cleaned up that patchset last week, and will re-post it soon
> once I give it a bit of testing this week.
I'm happy to add support to your reworked locks/leases/etc handling
for this. As for which one gets merged first I'd say which one
is in a mergeable shape earlier. If you're confident to get your
rework in ASAP I'm happy to rebase it on top, otherwise doing it
the other way around sounds easier.
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/nfsd/nfs4xdr.c | 39 ++++++++++++++++++++++-----------------
1 file changed, 22 insertions(+), 17 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 15f7b73..fe31178 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -235,6 +235,22 @@ static char *savemem(struct nfsd4_compoundargs *argp, __be32 *p, int nbytes)
}
static __be32
+nfsd4_decode_time(struct nfsd4_compoundargs *argp, struct timespec *tv)
+{
+ DECODE_HEAD;
+ u64 sec;
+
+ READ_BUF(12);
+ p = xdr_decode_hyper(p, &sec);
+ tv->tv_sec = sec;
+ tv->tv_nsec = be32_to_cpup(p++);
+ if (tv->tv_nsec >= (u32)1000000000)
+ return nfserr_inval;
+
+ DECODE_TAIL;
+}
+
+static __be32
nfsd4_decode_bitmap(struct nfsd4_compoundargs *argp, u32 *bmval)
{
u32 bmlen;
@@ -267,7 +283,6 @@ nfsd4_decode_fattr(struct nfsd4_compoundargs *argp, u32 *bmval,
{
int expected_len, len = 0;
u32 dummy32;
- u64 sec;
char *buf;
DECODE_HEAD;
@@ -358,15 +373,10 @@ nfsd4_decode_fattr(struct nfsd4_compoundargs *argp, u32 *bmval,
dummy32 = be32_to_cpup(p++);
switch (dummy32) {
case NFS4_SET_TO_CLIENT_TIME:
- /* We require the high 32 bits of 'seconds' to be 0, and we ignore
- all 32 bits of 'nseconds'. */
- READ_BUF(12);
len += 12;
- p = xdr_decode_hyper(p, &sec);
- iattr->ia_atime.tv_sec = (time_t)sec;
- iattr->ia_atime.tv_nsec = be32_to_cpup(p++);
- if (iattr->ia_atime.tv_nsec >= (u32)1000000000)
- return nfserr_inval;
+ status = nfsd4_decode_time(argp, &iattr->ia_atime);
+ if (status)
+ return status;
iattr->ia_valid |= (ATTR_ATIME | ATTR_ATIME_SET);
break;
case NFS4_SET_TO_SERVER_TIME:
@@ -382,15 +392,10 @@ nfsd4_decode_fattr(struct nfsd4_compoundargs *argp, u32 *bmval,
dummy32 = be32_to_cpup(p++);
switch (dummy32) {
case NFS4_SET_TO_CLIENT_TIME:
- /* We require the high 32 bits of 'seconds' to be 0, and we ignore
- all 32 bits of 'nseconds'. */
- READ_BUF(12);
len += 12;
- p = xdr_decode_hyper(p, &sec);
- iattr->ia_mtime.tv_sec = sec;
- iattr->ia_mtime.tv_nsec = be32_to_cpup(p++);
- if (iattr->ia_mtime.tv_nsec >= (u32)1000000000)
- return nfserr_inval;
+ status = nfsd4_decode_time(argp, &iattr->ia_mtime);
+ if (status)
+ return status;
iattr->ia_valid |= (ATTR_MTIME | ATTR_MTIME_SET);
break;
case NFS4_SET_TO_SERVER_TIME:
--
1.9.1
On Tue, Jan 06, 2015 at 05:28:26PM +0100, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> fs/nfsd/nfs4xdr.c | 39 ++++++++++++++++++++++-----------------
> 1 file changed, 22 insertions(+), 17 deletions(-)
>
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 15f7b73..fe31178 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -235,6 +235,22 @@ static char *savemem(struct nfsd4_compoundargs *argp, __be32 *p, int nbytes)
> }
>
> static __be32
> +nfsd4_decode_time(struct nfsd4_compoundargs *argp, struct timespec *tv)
> +{
> + DECODE_HEAD;
> + u64 sec;
> +
> + READ_BUF(12);
> + p = xdr_decode_hyper(p, &sec);
> + tv->tv_sec = sec;
> + tv->tv_nsec = be32_to_cpup(p++);
> + if (tv->tv_nsec >= (u32)1000000000)
> + return nfserr_inval;
> +
> + DECODE_TAIL;
> +}
> +
> +static __be32
> nfsd4_decode_bitmap(struct nfsd4_compoundargs *argp, u32 *bmval)
> {
> u32 bmlen;
> @@ -267,7 +283,6 @@ nfsd4_decode_fattr(struct nfsd4_compoundargs *argp, u32 *bmval,
> {
> int expected_len, len = 0;
> u32 dummy32;
> - u64 sec;
> char *buf;
>
> DECODE_HEAD;
> @@ -358,15 +373,10 @@ nfsd4_decode_fattr(struct nfsd4_compoundargs *argp, u32 *bmval,
> dummy32 = be32_to_cpup(p++);
> switch (dummy32) {
> case NFS4_SET_TO_CLIENT_TIME:
> - /* We require the high 32 bits of 'seconds' to be 0, and we ignore
> - all 32 bits of 'nseconds'. */
Have you done away with these requirements?
> - READ_BUF(12);
> len += 12;
I think this code makes it clear that the magic number 12 is the
same on both lines. With the change, that gets lost.
Do I think that the 12 will ever change? No.
Do I think this becomes more "magic"? Yes.
> - p = xdr_decode_hyper(p, &sec);
> - iattr->ia_atime.tv_sec = (time_t)sec;
> - iattr->ia_atime.tv_nsec = be32_to_cpup(p++);
> - if (iattr->ia_atime.tv_nsec >= (u32)1000000000)
> - return nfserr_inval;
> + status = nfsd4_decode_time(argp, &iattr->ia_atime);
> + if (status)
> + return status;
> iattr->ia_valid |= (ATTR_ATIME | ATTR_ATIME_SET);
> break;
> case NFS4_SET_TO_SERVER_TIME:
> @@ -382,15 +392,10 @@ nfsd4_decode_fattr(struct nfsd4_compoundargs *argp, u32 *bmval,
> dummy32 = be32_to_cpup(p++);
> switch (dummy32) {
> case NFS4_SET_TO_CLIENT_TIME:
> - /* We require the high 32 bits of 'seconds' to be 0, and we ignore
> - all 32 bits of 'nseconds'. */
> - READ_BUF(12);
> len += 12;
> - p = xdr_decode_hyper(p, &sec);
> - iattr->ia_mtime.tv_sec = sec;
> - iattr->ia_mtime.tv_nsec = be32_to_cpup(p++);
> - if (iattr->ia_mtime.tv_nsec >= (u32)1000000000)
> - return nfserr_inval;
> + status = nfsd4_decode_time(argp, &iattr->ia_mtime);
> + if (status)
> + return status;
> iattr->ia_valid |= (ATTR_MTIME | ATTR_MTIME_SET);
> break;
> case NFS4_SET_TO_SERVER_TIME:
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 09, 2015 at 03:02:02PM -0800, Tom Haynes wrote:
> > DECODE_HEAD;
> > @@ -358,15 +373,10 @@ nfsd4_decode_fattr(struct nfsd4_compoundargs *argp, u32 *bmval,
> > dummy32 = be32_to_cpup(p++);
> > switch (dummy32) {
> > case NFS4_SET_TO_CLIENT_TIME:
> > - /* We require the high 32 bits of 'seconds' to be 0, and we ignore
> > - all 32 bits of 'nseconds'. */
>
> Have you done away with these requirements?
No, the comment just go lost, I'll add it bacl.
>
> > - READ_BUF(12);
> > len += 12;
>
> I think this code makes it clear that the magic number 12 is the
> same on both lines. With the change, that gets lost.
>
> Do I think that the 12 will ever change? No.
>
> Do I think this becomes more "magic"? Yes.
Sure. but the whole counting the number to be decoded in setattr
is magic to start with. I guess we could replace it with some magic
pointer arithmetic on argp->p, but is that really worth it? Should
be a separate patch for sure.
On Sun, Jan 11, 2015 at 12:42:42PM +0100, Christoph Hellwig wrote:
> On Fri, Jan 09, 2015 at 03:02:02PM -0800, Tom Haynes wrote:
> >
> > > - READ_BUF(12);
> > > len += 12;
> >
> > I think this code makes it clear that the magic number 12 is the
> > same on both lines. With the change, that gets lost.
> >
> > Do I think that the 12 will ever change? No.
> >
> > Do I think this becomes more "magic"? Yes.
>
> Sure. but the whole counting the number to be decoded in setattr
> is magic to start with.
Agreed.
> I guess we could replace it with some magic
> pointer arithmetic on argp->p, but is that really worth it?
Which is why I asked the leading questions. I see both sides,
but ultimately it is a nit considering the rest of the abuse.
I'm fine with you deciding it is still magic overall.
> Should
> be a separate patch for sure.
The pnfs code will need it too. Also remove the nfsd_ prefix to match the
other filehandle helpers in that file.
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/nfsd/nfs4state.c | 12 ++----------
fs/nfsd/nfsfh.h | 9 +++++++++
2 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 2505b68..aaa3f8e 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -408,14 +408,6 @@ static unsigned int file_hashval(struct knfsd_fh *fh)
return nfsd_fh_hashval(fh) & (FILE_HASH_SIZE - 1);
}
-static bool nfsd_fh_match(struct knfsd_fh *fh1, struct knfsd_fh *fh2)
-{
- return fh1->fh_size == fh2->fh_size &&
- !memcmp(fh1->fh_base.fh_pad,
- fh2->fh_base.fh_pad,
- fh1->fh_size);
-}
-
static struct hlist_head file_hashtbl[FILE_HASH_SIZE];
static void
@@ -3300,7 +3292,7 @@ find_file_locked(struct knfsd_fh *fh, unsigned int hashval)
struct nfs4_file *fp;
hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash) {
- if (nfsd_fh_match(&fp->fi_fhandle, fh)) {
+ if (fh_match(&fp->fi_fhandle, fh)) {
if (atomic_inc_not_zero(&fp->fi_ref))
return fp;
}
@@ -4294,7 +4286,7 @@ laundromat_main(struct work_struct *laundry)
static inline __be32 nfs4_check_fh(struct svc_fh *fhp, struct nfs4_ol_stateid *stp)
{
- if (!nfsd_fh_match(&fhp->fh_handle, &stp->st_stid.sc_file->fi_fhandle))
+ if (!fh_match(&fhp->fh_handle, &stp->st_stid.sc_file->fi_fhandle))
return nfserr_bad_stateid;
return nfs_ok;
}
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 08236d7..e24d954 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -187,6 +187,15 @@ fh_init(struct svc_fh *fhp, int maxsize)
return fhp;
}
+static inline bool fh_match(struct knfsd_fh *fh1, struct knfsd_fh *fh2)
+{
+ if (fh1->fh_size != fh2->fh_size)
+ return false;
+ if (memcmp(fh1->fh_base.fh_pad, fh2->fh_base.fh_pad, fh1->fh_size) != 0)
+ return false;
+ return true;
+}
+
#ifdef CONFIG_NFSD_V3
/*
* The wcc data stored in current_fh should be cleared
--
1.9.1
Add a helper to check that the fsid parts of two file handles match.
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/nfsd/nfsfh.h | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index e24d954..84cae20 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -196,6 +196,15 @@ static inline bool fh_match(struct knfsd_fh *fh1, struct knfsd_fh *fh2)
return true;
}
+static inline bool fh_fsid_match(struct knfsd_fh *fh1, struct knfsd_fh *fh2)
+{
+ if (fh1->fh_fsid_type != fh2->fh_fsid_type)
+ return false;
+ if (memcmp(fh1->fh_fsid, fh2->fh_fsid, key_len(fh1->fh_fsid_type) != 0))
+ return false;
+ return true;
+}
+
#ifdef CONFIG_NFSD_V3
/*
* The wcc data stored in current_fh should be cleared
--
1.9.1
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/nfsd/nfs4state.c | 8 ++++----
fs/nfsd/state.h | 6 ++++++
2 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index aaa3f8e..e804e9b 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -486,7 +486,7 @@ static void nfs4_file_put_access(struct nfs4_file *fp, u32 access)
__nfs4_file_put_access(fp, O_RDONLY);
}
-static struct nfs4_stid *nfs4_alloc_stid(struct nfs4_client *cl,
+struct nfs4_stid *nfs4_alloc_stid(struct nfs4_client *cl,
struct kmem_cache *slab)
{
struct nfs4_stid *stid;
@@ -690,7 +690,7 @@ static void nfs4_put_deleg_lease(struct nfs4_file *fp)
}
}
-static void unhash_stid(struct nfs4_stid *s)
+void nfs4_unhash_stid(struct nfs4_stid *s)
{
s->sc_type = 0;
}
@@ -998,7 +998,7 @@ static void unhash_lock_stateid(struct nfs4_ol_stateid *stp)
list_del_init(&stp->st_locks);
unhash_ol_stateid(stp);
- unhash_stid(&stp->st_stid);
+ nfs4_unhash_stid(&stp->st_stid);
}
static void release_lock_stateid(struct nfs4_ol_stateid *stp)
@@ -4437,7 +4437,7 @@ out_unlock:
return status;
}
-static __be32
+__be32
nfsd4_lookup_stateid(struct nfsd4_compound_state *cstate,
stateid_t *stateid, unsigned char typemask,
struct nfs4_stid **s, struct nfsd_net *nn)
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index dab6553..55a3ece 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -545,6 +545,12 @@ struct nfsd_net;
extern __be32 nfs4_preprocess_stateid_op(struct net *net,
struct nfsd4_compound_state *cstate,
stateid_t *stateid, int flags, struct file **filp);
+__be32 nfsd4_lookup_stateid(struct nfsd4_compound_state *cstate,
+ stateid_t *stateid, unsigned char typemask,
+ struct nfs4_stid **s, struct nfsd_net *nn);
+struct nfs4_stid *nfs4_alloc_stid(struct nfs4_client *cl,
+ struct kmem_cache *slab);
+void nfs4_unhash_stid(struct nfs4_stid *s);
void nfs4_put_stid(struct nfs4_stid *s);
void nfs4_remove_reclaim_record(struct nfs4_client_reclaim *, struct nfsd_net *);
extern void nfs4_release_reclaim(struct nfsd_net *);
--
1.9.1
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/nfsd/nfs4state.c | 10 ++--------
fs/nfsd/state.h | 7 +++++++
2 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index e804e9b..b2054d4 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -282,7 +282,7 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
kmem_cache_free(file_slab, fp);
}
-static inline void
+void
put_nfs4_file(struct nfs4_file *fi)
{
might_lock(&state_lock);
@@ -295,12 +295,6 @@ put_nfs4_file(struct nfs4_file *fi)
}
}
-static inline void
-get_nfs4_file(struct nfs4_file *fi)
-{
- atomic_inc(&fi->fi_ref);
-}
-
static struct file *
__nfs4_get_fd(struct nfs4_file *f, int oflag)
{
@@ -3300,7 +3294,7 @@ find_file_locked(struct knfsd_fh *fh, unsigned int hashval)
return NULL;
}
-static struct nfs4_file *
+struct nfs4_file *
find_file(struct knfsd_fh *fh)
{
struct nfs4_file *fp;
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 55a3ece..8bc961e 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -573,6 +573,13 @@ extern struct nfs4_client_reclaim *nfs4_client_to_reclaim(const char *name,
struct nfsd_net *nn);
extern bool nfs4_has_reclaimed_state(const char *name, struct nfsd_net *nn);
+struct nfs4_file *find_file(struct knfsd_fh *fh);
+void put_nfs4_file(struct nfs4_file *fi);
+static inline void get_nfs4_file(struct nfs4_file *fi)
+{
+ atomic_inc(&fi->fi_ref);
+}
+
/* grace period management */
void nfsd4_end_grace(struct nfsd_net *nn);
--
1.9.1
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/nfsd/nfs4state.c | 2 +-
fs/nfsd/state.h | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index b2054d4..9f6a075 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -352,7 +352,7 @@ find_readable_file(struct nfs4_file *f)
return ret;
}
-static struct file *
+struct file *
find_any_file(struct nfs4_file *f)
{
struct file *ret;
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 8bc961e..38ebb12 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -579,6 +579,7 @@ static inline void get_nfs4_file(struct nfs4_file *fi)
{
atomic_inc(&fi->fi_ref);
}
+struct file *find_any_file(struct nfs4_file *f);
/* grace period management */
void nfsd4_end_grace(struct nfsd_net *nn);
--
1.9.1
Add support for the GETDEVICEINFO, LAYOUTGET, LAYOUTCOMMIT and
LAYOUTRETURN NFSv4.1 operations, as well as backing code to manage
outstanding layouts and devices.
Layout management is very straight forward, with a nfs4_layout_stateid
structure that extents nfs4_stid to manage layout stateids as the
top-level structure. It is linked into the nfs4_file and nfs4_client
structures like the other stateids, and contains a linked list of
layouts that hang of the stateid. The actual layout operations are
implemented in layout drivers that are not part of this commit, but
will be added later.
The worst part of this commit is the management of the pNFS device IDs,
which suffers from a specification that is not sanely implementable due
to the fact that the device-IDs are global and not bound to an export,
and have a small enough size so that we can't store the fsid portion of
a file handle, and must never be reused. As we still do need perform all
export authentication and validation checks on a device ID passed to
GETDEVICEINFO we are caught between a rock and a hard place. To work
around this issue we add a new hash that maps from a 64-bit integer to a
fsid so that we can look up the export to authenticate against it,
a 32-bit integer as a generation that we can bump when changing the device,
and a currently unused 32-bit integer that could be used in the future
to handle more than a single device per export. Entries in this hash
table are never deleted as we can't reuse the ids anyway, and would have
a severe lifetime problem anyway as Linux export structures are temporary
structures that can go away under load.
Parts of the XDR data, structures and marshaling/unmarshaling code, as
well as many concepts are derived from the old pNFS server implementation
from Andy Adamson, Benny Halevy, Dean Hildebrand, Marc Eshel, Fred Isaman,
Mike Sager, Ricardo Labiaga and many others.
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/nfsd/Kconfig | 10 +
fs/nfsd/Makefile | 1 +
fs/nfsd/export.c | 8 +
fs/nfsd/export.h | 2 +
fs/nfsd/nfs4layouts.c | 486 ++++++++++++++++++++++++++++++++++++++++
fs/nfsd/nfs4proc.c | 266 ++++++++++++++++++++++
fs/nfsd/nfs4state.c | 16 +-
fs/nfsd/nfs4xdr.c | 306 +++++++++++++++++++++++++
fs/nfsd/nfsctl.c | 9 +-
fs/nfsd/nfsd.h | 16 +-
fs/nfsd/pnfs.h | 80 +++++++
fs/nfsd/state.h | 21 ++
fs/nfsd/xdr4.h | 60 +++++
include/linux/nfs4.h | 1 +
include/uapi/linux/nfsd/debug.h | 1 +
15 files changed, 1279 insertions(+), 4 deletions(-)
create mode 100644 fs/nfsd/nfs4layouts.c
create mode 100644 fs/nfsd/pnfs.h
diff --git a/fs/nfsd/Kconfig b/fs/nfsd/Kconfig
index 7339515..683bf71 100644
--- a/fs/nfsd/Kconfig
+++ b/fs/nfsd/Kconfig
@@ -82,6 +82,16 @@ config NFSD_V4
If unsure, say N.
+config NFSD_PNFS
+ bool "NFSv4.1 server support for Parallel NFS (pNFS)"
+ depends on NFSD_V4
+ help
+ This option enables support for the parallel NFS features of the
+ minor version 1 of the NFSv4 protocol (RFC5661) in the kernel's NFS
+ server.
+
+ If unsure, say N.
+
config NFSD_V4_SECURITY_LABEL
bool "Provide Security Label support for NFSv4 server"
depends on NFSD_V4 && SECURITY
diff --git a/fs/nfsd/Makefile b/fs/nfsd/Makefile
index af32ef0..5806270 100644
--- a/fs/nfsd/Makefile
+++ b/fs/nfsd/Makefile
@@ -12,3 +12,4 @@ nfsd-$(CONFIG_NFSD_V3) += nfs3proc.o nfs3xdr.o
nfsd-$(CONFIG_NFSD_V3_ACL) += nfs3acl.o
nfsd-$(CONFIG_NFSD_V4) += nfs4proc.o nfs4xdr.o nfs4state.o nfs4idmap.o \
nfs4acl.o nfs4callback.o nfs4recover.o
+nfsd-$(CONFIG_NFSD_PNFS) += nfs4layouts.o
diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index 30a739d..c3e3b6e 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -20,6 +20,7 @@
#include "nfsd.h"
#include "nfsfh.h"
#include "netns.h"
+#include "pnfs.h"
#define NFSDDBG_FACILITY NFSDDBG_EXPORT
@@ -545,6 +546,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
exp.ex_client = dom;
exp.cd = cd;
+ exp.ex_devid_map = NULL;
/* expiry */
err = -EINVAL;
@@ -621,6 +623,8 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
if (!gid_valid(exp.ex_anon_gid))
goto out4;
err = 0;
+
+ nfsd4_setup_layout_type(&exp);
}
expp = svc_export_lookup(&exp);
@@ -703,6 +707,7 @@ static void svc_export_init(struct cache_head *cnew, struct cache_head *citem)
new->ex_fslocs.locations = NULL;
new->ex_fslocs.locations_count = 0;
new->ex_fslocs.migrated = 0;
+ new->ex_layout_type = 0;
new->ex_uuid = NULL;
new->cd = item->cd;
}
@@ -717,6 +722,8 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
new->ex_anon_uid = item->ex_anon_uid;
new->ex_anon_gid = item->ex_anon_gid;
new->ex_fsid = item->ex_fsid;
+ new->ex_devid_map = item->ex_devid_map;
+ item->ex_devid_map = NULL;
new->ex_uuid = item->ex_uuid;
item->ex_uuid = NULL;
new->ex_fslocs.locations = item->ex_fslocs.locations;
@@ -725,6 +732,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
item->ex_fslocs.locations_count = 0;
new->ex_fslocs.migrated = item->ex_fslocs.migrated;
item->ex_fslocs.migrated = 0;
+ new->ex_layout_type = item->ex_layout_type;
new->ex_nflavors = item->ex_nflavors;
for (i = 0; i < MAX_SECINFO_LIST; i++) {
new->ex_flavors[i] = item->ex_flavors[i];
diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index 04dc8c1..1f52bfc 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -56,6 +56,8 @@ struct svc_export {
struct nfsd4_fs_locations ex_fslocs;
uint32_t ex_nflavors;
struct exp_flavor_info ex_flavors[MAX_SECINFO_LIST];
+ enum pnfs_layouttype ex_layout_type;
+ struct nfsd4_deviceid_map *ex_devid_map;
struct cache_detail *cd;
};
diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
new file mode 100644
index 0000000..0753ed8
--- /dev/null
+++ b/fs/nfsd/nfs4layouts.c
@@ -0,0 +1,486 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include <linux/jhash.h>
+#include <linux/sched.h>
+
+#include "pnfs.h"
+#include "netns.h"
+
+#define NFSDDBG_FACILITY NFSDDBG_PNFS
+
+struct nfs4_layout {
+ struct list_head lo_perstate;
+ struct nfs4_layout_stateid *lo_state;
+ struct nfsd4_layout_seg lo_seg;
+};
+
+static struct kmem_cache *nfs4_layout_cache;
+static struct kmem_cache *nfs4_layout_stateid_cache;
+
+const struct nfsd4_layout_ops *nfsd4_layout_ops[LAYOUT_TYPE_MAX] = {
+};
+
+/* pNFS device ID to export fsid mapping */
+#define DEVID_HASH_BITS 8
+#define DEVID_HASH_SIZE (1 << DEVID_HASH_BITS)
+#define DEVID_HASH_MASK (DEVID_HASH_SIZE - 1)
+static u64 nfsd_devid_seq = 1;
+static struct list_head nfsd_devid_hash[DEVID_HASH_SIZE];
+static DEFINE_SPINLOCK(nfsd_devid_lock);
+
+static inline u32 devid_hashfn(u64 idx)
+{
+ return jhash_2words(idx, idx >> 32, 0) & DEVID_HASH_MASK;
+}
+
+static void
+nfsd4_alloc_devid_map(const struct svc_fh *fhp)
+{
+ const struct knfsd_fh *fh = &fhp->fh_handle;
+ size_t fsid_len = key_len(fh->fh_fsid_type);
+ struct nfsd4_deviceid_map *map, *old;
+ int i;
+
+ map = kzalloc(sizeof(*map) + fsid_len, GFP_KERNEL);
+ if (!map)
+ return;
+
+ map->fsid_type = fh->fh_fsid_type;
+ memcpy(&map->fsid, fh->fh_fsid, fsid_len);
+
+ spin_lock(&nfsd_devid_lock);
+ if (fhp->fh_export->ex_devid_map)
+ goto out_unlock;
+
+ for (i = 0; i < DEVID_HASH_SIZE; i++) {
+ list_for_each_entry(old, &nfsd_devid_hash[i], hash) {
+ if (old->fsid_type != fh->fh_fsid_type)
+ continue;
+ if (memcmp(old->fsid, fh->fh_fsid,
+ key_len(old->fsid_type)))
+ continue;
+
+ fhp->fh_export->ex_devid_map = old;
+ goto out_unlock;
+ }
+ }
+
+ map->idx = nfsd_devid_seq++;
+ list_add_tail_rcu(&map->hash, &nfsd_devid_hash[devid_hashfn(map->idx)]);
+ fhp->fh_export->ex_devid_map = map;
+ map = NULL;
+
+out_unlock:
+ spin_unlock(&nfsd_devid_lock);
+ if (map)
+ kfree(map);
+}
+
+struct nfsd4_deviceid_map *
+nfsd4_find_devid_map(int idx)
+{
+ struct nfsd4_deviceid_map *map, *ret = NULL;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(map, &nfsd_devid_hash[devid_hashfn(idx)], hash)
+ if (map->idx == idx)
+ ret = map;
+ rcu_read_unlock();
+
+ return ret;
+}
+
+int
+nfsd4_set_deviceid(struct nfsd4_deviceid *id, const struct svc_fh *fhp,
+ u32 device_generation)
+{
+ if (!fhp->fh_export->ex_devid_map) {
+ nfsd4_alloc_devid_map(fhp);
+ if (!fhp->fh_export->ex_devid_map)
+ return -ENOMEM;
+ }
+
+ id->fsid_idx = fhp->fh_export->ex_devid_map->idx;
+ id->generation = device_generation;
+ id->pad = 0;
+ return 0;
+}
+
+void nfsd4_setup_layout_type(struct svc_export *exp)
+{
+}
+
+static void
+nfsd4_free_layout_stateid(struct nfs4_stid *stid)
+{
+ struct nfs4_layout_stateid *ls = layoutstateid(stid);
+ struct nfs4_client *clp = ls->ls_stid.sc_client;
+ struct nfs4_file *fp = ls->ls_stid.sc_file;
+
+ spin_lock(&clp->cl_lock);
+ list_del_init(&ls->ls_perclnt);
+ spin_unlock(&clp->cl_lock);
+
+ spin_lock(&fp->fi_lock);
+ list_del_init(&ls->ls_perfile);
+ spin_unlock(&fp->fi_lock);
+
+ kmem_cache_free(nfs4_layout_stateid_cache, ls);
+}
+
+static struct nfs4_layout_stateid *
+nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate,
+ struct nfs4_stid *parent, u32 layout_type)
+{
+ struct nfs4_client *clp = cstate->clp;
+ struct nfs4_file *fp = parent->sc_file;
+ struct nfs4_layout_stateid *ls;
+ struct nfs4_stid *stp;
+
+ stp = nfs4_alloc_stid(cstate->clp, nfs4_layout_stateid_cache);
+ if (!stp)
+ return NULL;
+ stp->sc_free = nfsd4_free_layout_stateid;
+ get_nfs4_file(fp);
+ stp->sc_file = fp;
+
+ ls = layoutstateid(stp);
+ INIT_LIST_HEAD(&ls->ls_perclnt);
+ INIT_LIST_HEAD(&ls->ls_perfile);
+ spin_lock_init(&ls->ls_lock);
+ INIT_LIST_HEAD(&ls->ls_layouts);
+ ls->ls_layout_type = layout_type;
+
+ spin_lock(&clp->cl_lock);
+ stp->sc_type = NFS4_LAYOUT_STID;
+ list_add(&ls->ls_perclnt, &clp->cl_lo_states);
+ spin_unlock(&clp->cl_lock);
+
+ spin_lock(&fp->fi_lock);
+ list_add(&ls->ls_perfile, &fp->fi_lo_states);
+ spin_unlock(&fp->fi_lock);
+
+ return ls;
+}
+
+__be32
+nfsd4_preprocess_layout_stateid(struct svc_rqst *rqstp,
+ struct nfsd4_compound_state *cstate, stateid_t *stateid,
+ bool create, u32 layout_type, struct nfs4_layout_stateid **lsp)
+{
+ struct nfs4_layout_stateid *ls;
+ struct nfs4_stid *stid;
+ unsigned char typemask = NFS4_LAYOUT_STID;
+ __be32 status;
+
+ if (create)
+ typemask |= (NFS4_OPEN_STID | NFS4_LOCK_STID | NFS4_DELEG_STID);
+
+ status = nfsd4_lookup_stateid(cstate, stateid, typemask, &stid,
+ net_generic(SVC_NET(rqstp), nfsd_net_id));
+ if (status)
+ goto out;
+
+ if (!fh_match(&cstate->current_fh.fh_handle,
+ &stid->sc_file->fi_fhandle)) {
+ status = nfserr_bad_stateid;
+ goto out_put_stid;
+ }
+
+ if (stid->sc_type != NFS4_LAYOUT_STID) {
+ ls = nfsd4_alloc_layout_stateid(cstate, stid, layout_type);
+ nfs4_put_stid(stid);
+
+ status = nfserr_jukebox;
+ if (!ls)
+ goto out;
+ } else {
+ ls = container_of(stid, struct nfs4_layout_stateid, ls_stid);
+
+ status = nfserr_bad_stateid;
+ if (stateid->si_generation > stid->sc_stateid.si_generation)
+ goto out_put_stid;
+ if (layout_type != ls->ls_layout_type)
+ goto out_put_stid;
+ }
+
+ *lsp = ls;
+ return 0;
+
+out_put_stid:
+ nfs4_put_stid(stid);
+out:
+ return status;
+}
+
+static inline u64
+layout_end(struct nfsd4_layout_seg *seg)
+{
+ u64 end = seg->offset + seg->length;
+ return end >= seg->offset ? seg->length : NFS4_MAX_UINT64;
+}
+
+static void
+layout_update_len(struct nfsd4_layout_seg *lo, u64 end)
+{
+ if (end == NFS4_MAX_UINT64)
+ lo->length = NFS4_MAX_UINT64;
+ else
+ lo->length = end - lo->offset;
+}
+
+static bool
+layouts_overlapping(struct nfs4_layout *lo, struct nfsd4_layout_seg *s)
+{
+ if (s->iomode != IOMODE_ANY && s->iomode != lo->lo_seg.iomode)
+ return false;
+ if (layout_end(&lo->lo_seg) <= s->offset)
+ return false;
+ if (layout_end(s) <= lo->lo_seg.offset)
+ return false;
+ return true;
+}
+
+static bool
+layouts_try_merge(struct nfsd4_layout_seg *lo, struct nfsd4_layout_seg *new)
+{
+ if (lo->iomode != new->iomode)
+ return false;
+ if (layout_end(new) < lo->offset)
+ return false;
+ if (layout_end(lo) < new->offset)
+ return false;
+
+ lo->offset = min(lo->offset, new->offset);
+ layout_update_len(lo, max(layout_end(lo), layout_end(new)));
+ return true;
+}
+
+__be32
+nfsd4_insert_layout(struct nfsd4_layoutget *lgp, struct nfs4_layout_stateid *ls)
+{
+ struct nfsd4_layout_seg *seg = &lgp->lg_seg;
+ struct nfs4_layout *lp, *new = NULL;
+
+ spin_lock(&ls->ls_lock);
+ list_for_each_entry(lp, &ls->ls_layouts, lo_perstate) {
+ if (layouts_try_merge(&lp->lo_seg, seg))
+ goto done;
+ }
+ spin_unlock(&ls->ls_lock);
+
+ new = kmem_cache_alloc(nfs4_layout_cache, GFP_KERNEL);
+ if (!new)
+ return nfserr_jukebox;
+ memcpy(&new->lo_seg, seg, sizeof(lp->lo_seg));
+ new->lo_state = ls;
+
+ spin_lock(&ls->ls_lock);
+ list_for_each_entry(lp, &ls->ls_layouts, lo_perstate) {
+ if (layouts_try_merge(&lp->lo_seg, seg))
+ goto done;
+ }
+
+ atomic_inc(&ls->ls_stid.sc_count);
+ list_add_tail(&new->lo_perstate, &ls->ls_layouts);
+ new = NULL;
+done:
+ update_stateid(&ls->ls_stid.sc_stateid);
+ memcpy(&lgp->lg_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t));
+ spin_unlock(&ls->ls_lock);
+ if (new)
+ kmem_cache_free(nfs4_layout_cache, new);
+ return nfs_ok;
+}
+
+static void
+nfsd4_free_layouts(struct list_head *reaplist)
+{
+ while (!list_empty(reaplist)) {
+ struct nfs4_layout *lp = list_first_entry(reaplist,
+ struct nfs4_layout, lo_perstate);
+
+ list_del(&lp->lo_perstate);
+ nfs4_put_stid(&lp->lo_state->ls_stid);
+ kmem_cache_free(nfs4_layout_cache, lp);
+ }
+}
+
+static void
+nfsd4_return_file_layout(struct nfs4_layout *lp, struct nfsd4_layout_seg *seg,
+ struct list_head *reaplist)
+{
+ struct nfsd4_layout_seg *lo = &lp->lo_seg;
+ u64 end = layout_end(lo);
+
+ if (seg->offset <= lo->offset) {
+ if (layout_end(seg) >= end) {
+ list_move_tail(&lp->lo_perstate, reaplist);
+ return;
+ }
+ end = seg->offset;
+ } else {
+ /* retain the whole layout segment on a split. */
+ if (layout_end(seg) < end) {
+ dprintk("%s: split not supported\n", __func__);
+ return;
+ }
+
+ lo->offset = layout_end(seg);
+ }
+
+ layout_update_len(lo, end);
+}
+
+__be32
+nfsd4_return_file_layouts(struct svc_rqst *rqstp,
+ struct nfsd4_compound_state *cstate,
+ struct nfsd4_layoutreturn *lrp)
+{
+ struct nfs4_layout_stateid *ls;
+ struct nfs4_layout *lp, *n;
+ LIST_HEAD(reaplist);
+ __be32 nfserr;
+ int found = 0;
+
+ nfserr = nfsd4_preprocess_layout_stateid(rqstp, cstate, &lrp->lr_sid,
+ false, lrp->lr_layout_type,
+ &ls);
+ if (nfserr)
+ return nfserr;
+
+ spin_lock(&ls->ls_lock);
+ list_for_each_entry_safe(lp, n, &ls->ls_layouts, lo_perstate) {
+ if (layouts_overlapping(lp, &lrp->lr_seg)) {
+ nfsd4_return_file_layout(lp, &lrp->lr_seg, &reaplist);
+ found++;
+ }
+ }
+ if (!list_empty(&ls->ls_layouts)) {
+ if (found) {
+ update_stateid(&ls->ls_stid.sc_stateid);
+ memcpy(&lrp->lr_sid, &ls->ls_stid.sc_stateid,
+ sizeof(stateid_t));
+ }
+ lrp->lrs_present = 1;
+ } else {
+ nfs4_unhash_stid(&ls->ls_stid);
+ lrp->lrs_present = 0;
+ }
+ spin_unlock(&ls->ls_lock);
+
+ nfs4_put_stid(&ls->ls_stid);
+ nfsd4_free_layouts(&reaplist);
+ return nfs_ok;
+}
+
+__be32
+nfsd4_return_client_layouts(struct svc_rqst *rqstp,
+ struct nfsd4_compound_state *cstate,
+ struct nfsd4_layoutreturn *lrp)
+{
+ struct nfs4_layout_stateid *ls, *n;
+ struct nfs4_client *clp = cstate->clp;
+ struct nfs4_layout *lp, *t;
+ LIST_HEAD(reaplist);
+
+ lrp->lrs_present = 0;
+
+ spin_lock(&clp->cl_lock);
+ list_for_each_entry_safe(ls, n, &clp->cl_lo_states, ls_perclnt) {
+ if (lrp->lr_return_type == RETURN_FSID &&
+ !fh_fsid_match(&ls->ls_stid.sc_file->fi_fhandle,
+ &cstate->current_fh.fh_handle))
+ continue;
+
+ spin_lock(&ls->ls_lock);
+ list_for_each_entry_safe(lp, t, &ls->ls_layouts, lo_perstate) {
+ if (lrp->lr_seg.iomode == IOMODE_ANY ||
+ lrp->lr_seg.iomode == lp->lo_seg.iomode)
+ list_move_tail(&lp->lo_perstate, &reaplist);
+ }
+ spin_unlock(&ls->ls_lock);
+ }
+ spin_unlock(&clp->cl_lock);
+
+ nfsd4_free_layouts(&reaplist);
+ return 0;
+}
+
+static void
+nfsd4_return_all_layouts(struct nfs4_layout_stateid *ls,
+ struct list_head *reaplist)
+{
+ spin_lock(&ls->ls_lock);
+ list_splice_init(&ls->ls_layouts, reaplist);
+ spin_unlock(&ls->ls_lock);
+}
+
+void
+nfsd4_return_all_client_layouts(struct nfs4_client *clp)
+{
+ struct nfs4_layout_stateid *ls, *n;
+ LIST_HEAD(reaplist);
+
+ spin_lock(&clp->cl_lock);
+ list_for_each_entry_safe(ls, n, &clp->cl_lo_states, ls_perclnt)
+ nfsd4_return_all_layouts(ls, &reaplist);
+ spin_unlock(&clp->cl_lock);
+
+ nfsd4_free_layouts(&reaplist);
+}
+
+void
+nfsd4_return_all_file_layouts(struct nfs4_client *clp, struct nfs4_file *fp)
+{
+ struct nfs4_layout_stateid *ls, *n;
+ LIST_HEAD(reaplist);
+
+ spin_lock(&fp->fi_lock);
+ list_for_each_entry_safe(ls, n, &fp->fi_lo_states, ls_perfile) {
+ if (ls->ls_stid.sc_client == clp)
+ nfsd4_return_all_layouts(ls, &reaplist);
+ }
+ spin_unlock(&fp->fi_lock);
+
+ nfsd4_free_layouts(&reaplist);
+}
+
+int
+nfsd4_init_pnfs(void)
+{
+ int i;
+
+ for (i = 0; i < DEVID_HASH_SIZE; i++)
+ INIT_LIST_HEAD(&nfsd_devid_hash[i]);
+
+ nfs4_layout_cache = kmem_cache_create("nfs4_layout",
+ sizeof(struct nfs4_layout), 0, 0, NULL);
+ if (!nfs4_layout_cache)
+ return -ENOMEM;
+
+ nfs4_layout_stateid_cache = kmem_cache_create("nfs4_layout_stateid",
+ sizeof(struct nfs4_layout_stateid), 0, 0, NULL);
+ if (!nfs4_layout_stateid_cache) {
+ kmem_cache_destroy(nfs4_layout_cache);
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+void
+nfsd4_exit_pnfs(void)
+{
+ int i;
+
+ kmem_cache_destroy(nfs4_layout_cache);
+ kmem_cache_destroy(nfs4_layout_stateid_cache);
+
+ for (i = 0; i < DEVID_HASH_SIZE; i++) {
+ struct nfsd4_deviceid_map *map, *n;
+
+ list_for_each_entry_safe(map, n, &nfsd_devid_hash[i], hash)
+ kfree(map);
+ }
+}
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index ac71d13..b813913 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -43,6 +43,7 @@
#include "current_stateid.h"
#include "netns.h"
#include "acl.h"
+#include "pnfs.h"
#ifdef CONFIG_NFSD_V4_SECURITY_LABEL
#include <linux/security.h>
@@ -1178,6 +1179,252 @@ nfsd4_verify(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
return status == nfserr_same ? nfs_ok : status;
}
+#ifdef CONFIG_NFSD_PNFS
+static const struct nfsd4_layout_ops *
+nfsd4_layout_verify(struct svc_export *exp, unsigned int layout_type)
+{
+ if (!exp->ex_layout_type) {
+ dprintk("%s: export does not support pNFS\n", __func__);
+ return NULL;
+ }
+
+ if (exp->ex_layout_type != layout_type) {
+ dprintk("%s: layout type %d not supported\n",
+ __func__, layout_type);
+ return NULL;
+ }
+
+ return nfsd4_layout_ops[layout_type];
+}
+
+static __be32
+nfsd4_getdeviceinfo(struct svc_rqst *rqstp,
+ struct nfsd4_compound_state *cstate,
+ struct nfsd4_getdeviceinfo *gdp)
+{
+ const struct nfsd4_layout_ops *ops;
+ struct nfsd4_deviceid_map *map;
+ struct svc_export *exp;
+ __be32 nfserr;
+
+ dprintk("%s: layout_type %u dev_id [0x%llx:0x%x] maxcnt %u\n",
+ __func__,
+ gdp->gd_layout_type,
+ gdp->gd_devid.fsid_idx, gdp->gd_devid.generation,
+ gdp->gd_maxcount);
+
+ map = nfsd4_find_devid_map(gdp->gd_devid.fsid_idx);
+ if (!map) {
+ dprintk("%s: couldn't find device ID to export mapping!\n",
+ __func__);
+ return nfserr_noent;
+ }
+
+ exp = rqst_exp_find(rqstp, map->fsid_type, map->fsid);
+ if (IS_ERR(exp)) {
+ dprintk("%s: could not find device id\n", __func__);
+ return nfserr_noent;
+ }
+
+ nfserr = nfserr_layoutunavailable;
+ ops = nfsd4_layout_verify(exp, gdp->gd_layout_type);
+ if (!ops)
+ goto out;
+
+ nfserr = nfs_ok;
+ if (gdp->gd_maxcount != 0)
+ nfserr = ops->proc_getdeviceinfo(exp->ex_path.mnt->mnt_sb, gdp);
+
+ gdp->gd_notify_types &= ops->notify_types;
+ exp_put(exp);
+out:
+ return nfserr;
+}
+
+static __be32
+nfsd4_layoutget(struct svc_rqst *rqstp,
+ struct nfsd4_compound_state *cstate,
+ struct nfsd4_layoutget *lgp)
+{
+ struct svc_fh *current_fh = &cstate->current_fh;
+ const struct nfsd4_layout_ops *ops;
+ struct nfs4_layout_stateid *ls;
+ __be32 nfserr;
+ int accmode;
+
+ switch (lgp->lg_seg.iomode) {
+ case IOMODE_READ:
+ accmode = NFSD_MAY_READ;
+ break;
+ case IOMODE_RW:
+ accmode = NFSD_MAY_READ | NFSD_MAY_WRITE;
+ break;
+ default:
+ dprintk("%s: invalid iomode %d\n",
+ __func__, lgp->lg_seg.iomode);
+ nfserr = nfserr_badiomode;
+ goto out;
+ }
+
+ nfserr = fh_verify(rqstp, current_fh, 0, accmode);
+ if (nfserr)
+ goto out;
+
+ nfserr = nfserr_layoutunavailable;
+ ops = nfsd4_layout_verify(current_fh->fh_export, lgp->lg_layout_type);
+ if (!ops)
+ goto out;
+
+ /*
+ * Verify minlength and range as per RFC5661:
+ * o If loga_length is less than loga_minlength,
+ * the metadata server MUST return NFS4ERR_INVAL.
+ * o If the sum of loga_offset and loga_minlength exceeds
+ * NFS4_UINT64_MAX, and loga_minlength is not
+ * NFS4_UINT64_MAX, the error NFS4ERR_INVAL MUST result.
+ * o If the sum of loga_offset and loga_length exceeds
+ * NFS4_UINT64_MAX, and loga_length is not NFS4_UINT64_MAX,
+ * the error NFS4ERR_INVAL MUST result.
+ */
+ nfserr = nfserr_inval;
+ if (lgp->lg_seg.length < lgp->lg_minlength ||
+ (lgp->lg_minlength != NFS4_MAX_UINT64 &&
+ lgp->lg_minlength > NFS4_MAX_UINT64 - lgp->lg_seg.offset) ||
+ (lgp->lg_seg.length != NFS4_MAX_UINT64 &&
+ lgp->lg_seg.length > NFS4_MAX_UINT64 - lgp->lg_seg.offset))
+ goto out;
+ if (lgp->lg_seg.length == 0)
+ goto out;
+
+ nfserr = nfsd4_preprocess_layout_stateid(rqstp, cstate, &lgp->lg_sid,
+ true, lgp->lg_layout_type, &ls);
+ if (nfserr)
+ goto out;
+
+ nfserr = ops->proc_layoutget(current_fh->fh_dentry->d_inode,
+ current_fh, lgp);
+ if (nfserr)
+ goto out_put_stid;
+
+ nfserr = nfsd4_insert_layout(lgp, ls);
+
+out_put_stid:
+ nfs4_put_stid(&ls->ls_stid);
+out:
+ return nfserr;
+}
+
+static __be32
+nfsd4_layoutcommit(struct svc_rqst *rqstp,
+ struct nfsd4_compound_state *cstate,
+ struct nfsd4_layoutcommit *lcp)
+{
+ const struct nfsd4_layout_seg *seg = &lcp->lc_seg;
+ struct svc_fh *current_fh = &cstate->current_fh;
+ const struct nfsd4_layout_ops *ops;
+ loff_t new_size = lcp->lc_last_wr + 1;
+ struct inode *inode;
+ struct nfs4_layout_stateid *ls;
+ __be32 nfserr;
+
+ nfserr = fh_verify(rqstp, current_fh, 0, NFSD_MAY_WRITE);
+ if (nfserr)
+ goto out;
+
+ nfserr = nfserr_layoutunavailable;
+ ops = nfsd4_layout_verify(current_fh->fh_export, lcp->lc_layout_type);
+ if (!ops)
+ goto out;
+ inode = current_fh->fh_dentry->d_inode;
+
+ nfserr = nfserr_inval;
+ if (new_size <= seg->offset) {
+ dprintk("pnfsd: last write before layout segment\n");
+ goto out;
+ }
+ if (new_size > seg->offset + seg->length) {
+ dprintk("pnfsd: last write beyond layout segment\n");
+ goto out;
+ }
+ if (!lcp->lc_newoffset && new_size > i_size_read(inode)) {
+ dprintk("pnfsd: layoutcommit beyond EOF\n");
+ goto out;
+ }
+
+ nfserr = nfsd4_preprocess_layout_stateid(rqstp, cstate, &lcp->lc_sid,
+ false, lcp->lc_layout_type,
+ &ls);
+ if (nfserr) {
+ /* fixup error code as per RFC5661 */
+ if (nfserr == nfserr_bad_stateid)
+ nfserr = nfserr_badlayout;
+ goto out;
+ }
+
+ nfserr = ops->proc_layoutcommit(inode, lcp);
+ if (nfserr)
+ goto out_put_stid;
+
+ if (new_size > i_size_read(inode)) {
+ lcp->lc_size_chg = 1;
+ lcp->lc_newsize = new_size;
+ } else {
+ lcp->lc_size_chg = 0;
+ }
+
+out_put_stid:
+ nfs4_put_stid(&ls->ls_stid);
+out:
+ return nfserr;
+}
+
+static __be32
+nfsd4_layoutreturn(struct svc_rqst *rqstp,
+ struct nfsd4_compound_state *cstate,
+ struct nfsd4_layoutreturn *lrp)
+{
+ struct svc_fh *current_fh = &cstate->current_fh;
+ __be32 nfserr;
+
+ nfserr = fh_verify(rqstp, current_fh, 0, NFSD_MAY_NOP);
+ if (nfserr)
+ goto out;
+
+ nfserr = nfserr_layoutunavailable;
+ if (!nfsd4_layout_verify(current_fh->fh_export, lrp->lr_layout_type))
+ goto out;
+
+ switch (lrp->lr_seg.iomode) {
+ case IOMODE_READ:
+ case IOMODE_RW:
+ case IOMODE_ANY:
+ break;
+ default:
+ dprintk("%s: invalid iomode %d\n", __func__,
+ lrp->lr_seg.iomode);
+ nfserr = nfserr_inval;
+ goto out;
+ }
+
+ switch (lrp->lr_return_type) {
+ case RETURN_FILE:
+ nfserr = nfsd4_return_file_layouts(rqstp, cstate, lrp);
+ break;
+ case RETURN_FSID:
+ case RETURN_ALL:
+ nfserr = nfsd4_return_client_layouts(rqstp, cstate, lrp);
+ break;
+ default:
+ dprintk("%s: invalid return_type %d\n", __func__,
+ lrp->lr_return_type);
+ nfserr = nfserr_inval;
+ break;
+ }
+out:
+ return nfserr;
+}
+#endif /* CONFIG_NFSD_PNFS */
+
/*
* NULL call.
*/
@@ -1966,6 +2213,25 @@ static struct nfsd4_operation nfsd4_ops[] = {
.op_get_currentstateid = (stateid_getter)nfsd4_get_freestateid,
.op_rsize_bop = (nfsd4op_rsize)nfsd4_only_status_rsize,
},
+#ifdef CONFIG_NFSD_PNFS
+ [OP_GETDEVICEINFO] = {
+ .op_func = (nfsd4op_func)nfsd4_getdeviceinfo,
+ .op_flags = ALLOWED_WITHOUT_FH,
+ .op_name = "OP_GETDEVICEINFO",
+ },
+ [OP_LAYOUTGET] = {
+ .op_func = (nfsd4op_func)nfsd4_layoutget,
+ .op_name = "OP_LAYOUTGET",
+ },
+ [OP_LAYOUTCOMMIT] = {
+ .op_func = (nfsd4op_func)nfsd4_layoutcommit,
+ .op_name = "OP_LAYOUTCOMMIT",
+ },
+ [OP_LAYOUTRETURN] = {
+ .op_func = (nfsd4op_func)nfsd4_layoutreturn,
+ .op_name = "OP_LAYOUTRETURN",
+ },
+#endif /* CONFIG_NFSD_PNFS */
/* NFSv4.2 operations */
[OP_ALLOCATE] = {
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 9f6a075..eb972e6 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -48,6 +48,7 @@
#include "current_stateid.h"
#include "netns.h"
+#include "pnfs.h"
#define NFSDDBG_FACILITY NFSDDBG_PROC
@@ -1544,6 +1545,9 @@ static struct nfs4_client *alloc_client(struct xdr_netobj name)
INIT_LIST_HEAD(&clp->cl_lru);
INIT_LIST_HEAD(&clp->cl_callbacks);
INIT_LIST_HEAD(&clp->cl_revoked);
+#ifdef CONFIG_NFSD_PNFS
+ INIT_LIST_HEAD(&clp->cl_lo_states);
+#endif
spin_lock_init(&clp->cl_lock);
rpc_init_wait_queue(&clp->cl_cb_waitq, "Backchannel slot table");
return clp;
@@ -1648,6 +1652,7 @@ __destroy_client(struct nfs4_client *clp)
nfs4_get_stateowner(&oo->oo_owner);
release_openowner(oo);
}
+ nfsd4_return_all_client_layouts(clp);
nfsd4_shutdown_callback(clp);
if (clp->cl_cb_conn.cb_xprt)
svc_xprt_put(clp->cl_cb_conn.cb_xprt);
@@ -2131,8 +2136,11 @@ nfsd4_replay_cache_entry(struct nfsd4_compoundres *resp,
static void
nfsd4_set_ex_flags(struct nfs4_client *new, struct nfsd4_exchange_id *clid)
{
- /* pNFS is not supported */
+#ifdef CONFIG_NFSD_PNFS
+ new->cl_exchange_flags |= EXCHGID4_FLAG_USE_PNFS_MDS;
+#else
new->cl_exchange_flags |= EXCHGID4_FLAG_USE_NON_PNFS;
+#endif
/* Referrals are supported, Migration is not. */
new->cl_exchange_flags |= EXCHGID4_FLAG_SUPP_MOVED_REFER;
@@ -3060,6 +3068,9 @@ static void nfsd4_init_file(struct knfsd_fh *fh, unsigned int hashval,
fp->fi_share_deny = 0;
memset(fp->fi_fds, 0, sizeof(fp->fi_fds));
memset(fp->fi_access, 0, sizeof(fp->fi_access));
+#ifdef CONFIG_NFSD_PNFS
+ INIT_LIST_HEAD(&fp->fi_lo_states);
+#endif
hlist_add_head_rcu(&fp->fi_hash, &file_hashtbl[hashval]);
}
@@ -4845,6 +4856,9 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
update_stateid(&stp->st_stid.sc_stateid);
memcpy(&close->cl_stateid, &stp->st_stid.sc_stateid, sizeof(stateid_t));
+ nfsd4_return_all_file_layouts(stp->st_stateowner->so_client,
+ stp->st_stid.sc_file);
+
nfsd4_close_open_stateid(stp);
/* put reference from nfs4_preprocess_seqid_op */
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index fe31178..161cc37 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -47,6 +47,7 @@
#include "state.h"
#include "cache.h"
#include "netns.h"
+#include "pnfs.h"
#ifdef CONFIG_NFSD_V4_SECURITY_LABEL
#include <linux/security.h>
@@ -1518,6 +1519,127 @@ static __be32 nfsd4_decode_reclaim_complete(struct nfsd4_compoundargs *argp, str
DECODE_TAIL;
}
+#ifdef CONFIG_NFSD_PNFS
+static __be32
+nfsd4_decode_getdeviceinfo(struct nfsd4_compoundargs *argp,
+ struct nfsd4_getdeviceinfo *gdev)
+{
+ DECODE_HEAD;
+ u32 num, i;
+
+ READ_BUF(sizeof(struct nfsd4_deviceid) + 3 * 4);
+ COPYMEM(&gdev->gd_devid, sizeof(struct nfsd4_deviceid));
+ gdev->gd_layout_type = be32_to_cpup(p++);
+ gdev->gd_maxcount = be32_to_cpup(p++);
+ num = be32_to_cpup(p++);
+ if (num) {
+ READ_BUF(4 * num);
+ gdev->gd_notify_types = be32_to_cpup(p++);
+ for (i = 1; i < num; i++) {
+ if (be32_to_cpup(p++)) {
+ status = nfserr_inval;
+ goto out;
+ }
+ }
+ }
+ DECODE_TAIL;
+}
+
+static __be32
+nfsd4_decode_layoutget(struct nfsd4_compoundargs *argp,
+ struct nfsd4_layoutget *lgp)
+{
+ DECODE_HEAD;
+
+ READ_BUF(36);
+ lgp->lg_signal = be32_to_cpup(p++);
+ lgp->lg_layout_type = be32_to_cpup(p++);
+ lgp->lg_seg.iomode = be32_to_cpup(p++);
+ p = xdr_decode_hyper(p, &lgp->lg_seg.offset);
+ p = xdr_decode_hyper(p, &lgp->lg_seg.length);
+ p = xdr_decode_hyper(p, &lgp->lg_minlength);
+ nfsd4_decode_stateid(argp, &lgp->lg_sid);
+ READ_BUF(4);
+ lgp->lg_maxcount = be32_to_cpup(p++);
+
+ DECODE_TAIL;
+}
+
+static __be32
+nfsd4_decode_layoutcommit(struct nfsd4_compoundargs *argp,
+ struct nfsd4_layoutcommit *lcp)
+{
+ DECODE_HEAD;
+ u32 timechange;
+
+ READ_BUF(20);
+ p = xdr_decode_hyper(p, &lcp->lc_seg.offset);
+ p = xdr_decode_hyper(p, &lcp->lc_seg.length);
+ lcp->lc_reclaim = be32_to_cpup(p++);
+ nfsd4_decode_stateid(argp, &lcp->lc_sid);
+ READ_BUF(4);
+ lcp->lc_newoffset = be32_to_cpup(p++);
+ if (lcp->lc_newoffset) {
+ READ_BUF(8);
+ p = xdr_decode_hyper(p, &lcp->lc_last_wr);
+ } else
+ lcp->lc_last_wr = 0;
+ READ_BUF(4);
+ timechange = be32_to_cpup(p++);
+ if (timechange) {
+ status = nfsd4_decode_time(argp, &lcp->lc_mtime);
+ if (status)
+ return status;
+ } else {
+ lcp->lc_mtime.tv_nsec = UTIME_NOW;
+ }
+ READ_BUF(8);
+ lcp->lc_layout_type = be32_to_cpup(p++);
+
+ /*
+ * Save the layout update in XDR format and let the layout driver deal
+ * with it later.
+ */
+ lcp->lc_up_len = be32_to_cpup(p++);
+ if (lcp->lc_up_len > 0) {
+ READ_BUF(lcp->lc_up_len);
+ READMEM(lcp->lc_up_layout, lcp->lc_up_len);
+ }
+
+ DECODE_TAIL;
+}
+
+static __be32
+nfsd4_decode_layoutreturn(struct nfsd4_compoundargs *argp,
+ struct nfsd4_layoutreturn *lrp)
+{
+ DECODE_HEAD;
+
+ READ_BUF(16);
+ lrp->lr_reclaim = be32_to_cpup(p++);
+ lrp->lr_layout_type = be32_to_cpup(p++);
+ lrp->lr_seg.iomode = be32_to_cpup(p++);
+ lrp->lr_return_type = be32_to_cpup(p++);
+ if (lrp->lr_return_type == RETURN_FILE) {
+ READ_BUF(16);
+ p = xdr_decode_hyper(p, &lrp->lr_seg.offset);
+ p = xdr_decode_hyper(p, &lrp->lr_seg.length);
+ nfsd4_decode_stateid(argp, &lrp->lr_sid);
+ READ_BUF(4);
+ lrp->lrf_body_len = be32_to_cpup(p++);
+ if (lrp->lrf_body_len > 0) {
+ READ_BUF(lrp->lrf_body_len);
+ READMEM(lrp->lrf_body, lrp->lrf_body_len);
+ }
+ } else {
+ lrp->lr_seg.offset = 0;
+ lrp->lr_seg.length = NFS4_MAX_UINT64;
+ }
+
+ DECODE_TAIL;
+}
+#endif /* CONFIG_NFSD_PNFS */
+
static __be32
nfsd4_decode_fallocate(struct nfsd4_compoundargs *argp,
struct nfsd4_fallocate *fallocate)
@@ -1612,11 +1734,19 @@ static nfsd4_dec nfsd4_dec_ops[] = {
[OP_DESTROY_SESSION] = (nfsd4_dec)nfsd4_decode_destroy_session,
[OP_FREE_STATEID] = (nfsd4_dec)nfsd4_decode_free_stateid,
[OP_GET_DIR_DELEGATION] = (nfsd4_dec)nfsd4_decode_notsupp,
+#ifdef CONFIG_NFSD_PNFS
+ [OP_GETDEVICEINFO] = (nfsd4_dec)nfsd4_decode_getdeviceinfo,
+ [OP_GETDEVICELIST] = (nfsd4_dec)nfsd4_decode_notsupp,
+ [OP_LAYOUTCOMMIT] = (nfsd4_dec)nfsd4_decode_layoutcommit,
+ [OP_LAYOUTGET] = (nfsd4_dec)nfsd4_decode_layoutget,
+ [OP_LAYOUTRETURN] = (nfsd4_dec)nfsd4_decode_layoutreturn,
+#else
[OP_GETDEVICEINFO] = (nfsd4_dec)nfsd4_decode_notsupp,
[OP_GETDEVICELIST] = (nfsd4_dec)nfsd4_decode_notsupp,
[OP_LAYOUTCOMMIT] = (nfsd4_dec)nfsd4_decode_notsupp,
[OP_LAYOUTGET] = (nfsd4_dec)nfsd4_decode_notsupp,
[OP_LAYOUTRETURN] = (nfsd4_dec)nfsd4_decode_notsupp,
+#endif
[OP_SECINFO_NO_NAME] = (nfsd4_dec)nfsd4_decode_secinfo_no_name,
[OP_SEQUENCE] = (nfsd4_dec)nfsd4_decode_sequence,
[OP_SET_SSV] = (nfsd4_dec)nfsd4_decode_notsupp,
@@ -2544,6 +2674,30 @@ out_acl:
get_parent_attributes(exp, &stat);
p = xdr_encode_hyper(p, stat.ino);
}
+#ifdef CONFIG_NFSD_PNFS
+ if ((bmval1 & FATTR4_WORD1_FS_LAYOUT_TYPES) ||
+ (bmval2 & FATTR4_WORD2_LAYOUT_TYPES)) {
+ if (exp->ex_layout_type) {
+ p = xdr_reserve_space(xdr, 8);
+ if (!p)
+ goto out_resource;
+ *p++ = cpu_to_be32(1);
+ *p++ = cpu_to_be32(exp->ex_layout_type);
+ } else {
+ p = xdr_reserve_space(xdr, 4);
+ if (!p)
+ goto out_resource;
+ *p++ = cpu_to_be32(0);
+ }
+ }
+
+ if (bmval2 & FATTR4_WORD2_LAYOUT_BLKSIZE) {
+ p = xdr_reserve_space(xdr, 4);
+ if (!p)
+ goto out_resource;
+ *p++ = cpu_to_be32(stat.blksize);
+ }
+#endif /* CONFIG_NFSD_PNFS */
if (bmval2 & FATTR4_WORD2_SECURITY_LABEL) {
status = nfsd4_encode_security_label(xdr, rqstp, context,
contextlen);
@@ -3819,6 +3973,150 @@ nfsd4_encode_test_stateid(struct nfsd4_compoundres *resp, __be32 nfserr,
return nfserr;
}
+#ifdef CONFIG_NFSD_PNFS
+static __be32
+nfsd4_encode_getdeviceinfo(struct nfsd4_compoundres *resp, __be32 nfserr,
+ struct nfsd4_getdeviceinfo *gdev)
+{
+ struct xdr_stream *xdr = &resp->xdr;
+ const struct nfsd4_layout_ops *ops =
+ nfsd4_layout_ops[gdev->gd_layout_type];
+ u32 starting_len = xdr->buf->len, needed_len;
+ __be32 *p;
+
+ dprintk("%s: err %d\n", __func__, nfserr);
+ if (nfserr)
+ goto out;
+
+ p = xdr_reserve_space(xdr, 4);
+ if (!p)
+ return nfserr_resource;
+ *p++ = cpu_to_be32(gdev->gd_layout_type);
+
+ /* If maxcount is 0 then just update notifications */
+ if (gdev->gd_maxcount != 0) {
+ nfserr = ops->encode_getdeviceinfo(xdr, gdev);
+ if (nfserr) {
+ /*
+ * We don't bother to burden the layout drivers with
+ * enforcing gd_maxcount, just tell the client to
+ * come back with a bigger buffer if it's not enough.
+ */
+ if (xdr->buf->len + 4 > gdev->gd_maxcount)
+ goto toosmall;
+ goto out;
+ }
+ }
+
+ if (gdev->gd_notify_types) {
+ p = xdr_reserve_space(xdr, 4 + 4);
+ if (!p)
+ return nfserr_resource;
+ *p++ = cpu_to_be32(1); /* bitmap length */
+ *p++ = cpu_to_be32(gdev->gd_notify_types);
+ } else {
+ p = xdr_reserve_space(xdr, 4);
+ if (!p)
+ return nfserr_resource;
+ *p++ = 0;
+ }
+
+out:
+ kfree(gdev->gd_device);
+ dprintk("%s: done: %d\n", __func__, be32_to_cpu(nfserr));
+ return nfserr;
+
+toosmall:
+ dprintk("%s: maxcount too small\n", __func__);
+ needed_len = xdr->buf->len + 4 /* notifications */;
+ xdr_truncate_encode(xdr, starting_len);
+ p = xdr_reserve_space(xdr, 4);
+ if (!p)
+ return nfserr_resource;
+ *p++ = cpu_to_be32(needed_len);
+ nfserr = nfserr_toosmall;
+ goto out;
+}
+
+static __be32
+nfsd4_encode_layoutget(struct nfsd4_compoundres *resp, __be32 nfserr,
+ struct nfsd4_layoutget *lgp)
+{
+ struct xdr_stream *xdr = &resp->xdr;
+ const struct nfsd4_layout_ops *ops =
+ nfsd4_layout_ops[lgp->lg_layout_type];
+ __be32 *p;
+
+ dprintk("%s: err %d\n", __func__, nfserr);
+ if (nfserr)
+ goto out;
+
+ nfserr = nfserr_resource;
+ p = xdr_reserve_space(xdr, 36 + sizeof(stateid_opaque_t));
+ if (!p)
+ goto out;
+
+ *p++ = cpu_to_be32(lgp->lg_roc);
+ *p++ = cpu_to_be32(lgp->lg_sid.si_generation);
+ p = xdr_encode_opaque_fixed(p, &lgp->lg_sid.si_opaque,
+ sizeof(stateid_opaque_t));
+
+ *p++ = cpu_to_be32(1); /* we always return a single layout */
+ p = xdr_encode_hyper(p, lgp->lg_seg.offset);
+ p = xdr_encode_hyper(p, lgp->lg_seg.length);
+ *p++ = cpu_to_be32(lgp->lg_seg.iomode);
+ *p++ = cpu_to_be32(lgp->lg_layout_type);
+
+ nfserr = ops->encode_layoutget(xdr, lgp);
+out:
+ kfree(lgp->lg_content);
+ return nfserr;
+}
+
+static __be32
+nfsd4_encode_layoutcommit(struct nfsd4_compoundres *resp, __be32 nfserr,
+ struct nfsd4_layoutcommit *lcp)
+{
+ struct xdr_stream *xdr = &resp->xdr;
+ __be32 *p;
+
+ if (nfserr)
+ return nfserr;
+
+ p = xdr_reserve_space(xdr, 4);
+ if (!p)
+ return nfserr_resource;
+ *p++ = cpu_to_be32(lcp->lc_size_chg);
+ if (lcp->lc_size_chg) {
+ p = xdr_reserve_space(xdr, 8);
+ if (!p)
+ return nfserr_resource;
+ p = xdr_encode_hyper(p, lcp->lc_newsize);
+ }
+
+ return nfs_ok;
+}
+
+static __be32
+nfsd4_encode_layoutreturn(struct nfsd4_compoundres *resp, __be32 nfserr,
+ struct nfsd4_layoutreturn *lrp)
+{
+ struct xdr_stream *xdr = &resp->xdr;
+ __be32 *p;
+
+ if (nfserr)
+ return nfserr;
+
+ p = xdr_reserve_space(xdr, 4);
+ if (!p)
+ return nfserr_resource;
+ *p++ = cpu_to_be32(lrp->lrs_present);
+ if (lrp->lrs_present)
+ nfsd4_encode_stateid(xdr, &lrp->lr_sid);
+ return nfs_ok;
+}
+#endif /* CONFIG_NFSD_PNFS */
+
static __be32
nfsd4_encode_seek(struct nfsd4_compoundres *resp, __be32 nfserr,
struct nfsd4_seek *seek)
@@ -3895,11 +4193,19 @@ static nfsd4_enc nfsd4_enc_ops[] = {
[OP_DESTROY_SESSION] = (nfsd4_enc)nfsd4_encode_noop,
[OP_FREE_STATEID] = (nfsd4_enc)nfsd4_encode_noop,
[OP_GET_DIR_DELEGATION] = (nfsd4_enc)nfsd4_encode_noop,
+#ifdef CONFIG_NFSD_PNFS
+ [OP_GETDEVICEINFO] = (nfsd4_enc)nfsd4_encode_getdeviceinfo,
+ [OP_GETDEVICELIST] = (nfsd4_enc)nfsd4_encode_noop,
+ [OP_LAYOUTCOMMIT] = (nfsd4_enc)nfsd4_encode_layoutcommit,
+ [OP_LAYOUTGET] = (nfsd4_enc)nfsd4_encode_layoutget,
+ [OP_LAYOUTRETURN] = (nfsd4_enc)nfsd4_encode_layoutreturn,
+#else
[OP_GETDEVICEINFO] = (nfsd4_enc)nfsd4_encode_noop,
[OP_GETDEVICELIST] = (nfsd4_enc)nfsd4_encode_noop,
[OP_LAYOUTCOMMIT] = (nfsd4_enc)nfsd4_encode_noop,
[OP_LAYOUTGET] = (nfsd4_enc)nfsd4_encode_noop,
[OP_LAYOUTRETURN] = (nfsd4_enc)nfsd4_encode_noop,
+#endif
[OP_SECINFO_NO_NAME] = (nfsd4_enc)nfsd4_encode_secinfo_no_name,
[OP_SEQUENCE] = (nfsd4_enc)nfsd4_encode_sequence,
[OP_SET_SSV] = (nfsd4_enc)nfsd4_encode_noop,
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 19ace74..aa47d75 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -21,6 +21,7 @@
#include "cache.h"
#include "state.h"
#include "netns.h"
+#include "pnfs.h"
/*
* We have a single directory with several nodes in it.
@@ -1258,9 +1259,12 @@ static int __init init_nfsd(void)
retval = nfsd4_init_slabs();
if (retval)
goto out_unregister_pernet;
- retval = nfsd_fault_inject_init(); /* nfsd fault injection controls */
+ retval = nfsd4_init_pnfs();
if (retval)
goto out_free_slabs;
+ retval = nfsd_fault_inject_init(); /* nfsd fault injection controls */
+ if (retval)
+ goto out_exit_pnfs;
nfsd_stat_init(); /* Statistics */
retval = nfsd_reply_cache_init();
if (retval)
@@ -1282,6 +1286,8 @@ out_free_lockd:
out_free_stat:
nfsd_stat_shutdown();
nfsd_fault_inject_cleanup();
+out_exit_pnfs:
+ nfsd4_exit_pnfs();
out_free_slabs:
nfsd4_free_slabs();
out_unregister_pernet:
@@ -1299,6 +1305,7 @@ static void __exit exit_nfsd(void)
nfsd_stat_shutdown();
nfsd_lockd_shutdown();
nfsd4_free_slabs();
+ nfsd4_exit_pnfs();
nfsd_fault_inject_cleanup();
unregister_filesystem(&nfsd_fs_type);
unregister_pernet_subsys(&nfsd_net_ops);
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 33a46a8..565c4da 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -325,15 +325,27 @@ void nfsd_lockd_shutdown(void);
#define NFSD4_SUPPORTED_ATTRS_WORD2 0
+/* 4.1 */
+#ifdef CONFIG_NFSD_PNFS
+#define PNFSD_SUPPORTED_ATTRS_WORD1 FATTR4_WORD1_FS_LAYOUT_TYPES
+#define PNFSD_SUPPORTED_ATTRS_WORD2 \
+(FATTR4_WORD2_LAYOUT_BLKSIZE | FATTR4_WORD2_LAYOUT_TYPES)
+#else
+#define PNFSD_SUPPORTED_ATTRS_WORD1 0
+#define PNFSD_SUPPORTED_ATTRS_WORD2 0
+#endif /* CONFIG_NFSD_PNFS */
+
#define NFSD4_1_SUPPORTED_ATTRS_WORD0 \
NFSD4_SUPPORTED_ATTRS_WORD0
#define NFSD4_1_SUPPORTED_ATTRS_WORD1 \
- NFSD4_SUPPORTED_ATTRS_WORD1
+ (NFSD4_SUPPORTED_ATTRS_WORD1 | PNFSD_SUPPORTED_ATTRS_WORD1)
#define NFSD4_1_SUPPORTED_ATTRS_WORD2 \
- (NFSD4_SUPPORTED_ATTRS_WORD2 | FATTR4_WORD2_SUPPATTR_EXCLCREAT)
+ (NFSD4_SUPPORTED_ATTRS_WORD2 | PNFSD_SUPPORTED_ATTRS_WORD2 | \
+ FATTR4_WORD2_SUPPATTR_EXCLCREAT)
+/* 4.2 */
#ifdef CONFIG_NFSD_V4_SECURITY_LABEL
#define NFSD4_2_SECURITY_ATTRS FATTR4_WORD2_SECURITY_LABEL
#else
diff --git a/fs/nfsd/pnfs.h b/fs/nfsd/pnfs.h
new file mode 100644
index 0000000..fa37117
--- /dev/null
+++ b/fs/nfsd/pnfs.h
@@ -0,0 +1,80 @@
+#ifndef _FS_NFSD_PNFSD_H
+#define _FS_NFSD_PNFSD_H 1
+
+#include <linux/exportfs.h>
+#include <linux/nfsd/export.h>
+
+#include "state.h"
+#include "xdr4.h"
+
+struct xdr_stream;
+
+struct nfsd4_deviceid_map {
+ struct list_head hash;
+ u64 idx;
+ int fsid_type;
+ u32 fsid[];
+};
+
+struct nfsd4_layout_ops {
+ u32 notify_types;
+
+ __be32 (*proc_getdeviceinfo)(struct super_block *sb,
+ struct nfsd4_getdeviceinfo *gdevp);
+ __be32 (*encode_getdeviceinfo)(struct xdr_stream *xdr,
+ struct nfsd4_getdeviceinfo *gdevp);
+
+ __be32 (*proc_layoutget)(struct inode *, const struct svc_fh *fhp,
+ struct nfsd4_layoutget *lgp);
+ __be32 (*encode_layoutget)(struct xdr_stream *,
+ struct nfsd4_layoutget *lgp);
+
+ __be32 (*proc_layoutcommit)(struct inode *inode,
+ struct nfsd4_layoutcommit *lcp);
+};
+
+extern const struct nfsd4_layout_ops *nfsd4_layout_ops[];
+
+__be32 nfsd4_preprocess_layout_stateid(struct svc_rqst *rqstp,
+ struct nfsd4_compound_state *cstate, stateid_t *stateid,
+ bool create, u32 layout_type, struct nfs4_layout_stateid **lsp);
+__be32 nfsd4_insert_layout(struct nfsd4_layoutget *lgp,
+ struct nfs4_layout_stateid *ls);
+__be32 nfsd4_return_file_layouts(struct svc_rqst *rqstp,
+ struct nfsd4_compound_state *cstate,
+ struct nfsd4_layoutreturn *lrp);
+__be32 nfsd4_return_client_layouts(struct svc_rqst *rqstp,
+ struct nfsd4_compound_state *cstate,
+ struct nfsd4_layoutreturn *lrp);
+int nfsd4_set_deviceid(struct nfsd4_deviceid *id, const struct svc_fh *fhp,
+ u32 device_generation);
+struct nfsd4_deviceid_map *nfsd4_find_devid_map(int idx);
+
+#ifdef CONFIG_NFSD_PNFS
+void nfsd4_setup_layout_type(struct svc_export *exp);
+void nfsd4_return_all_client_layouts(struct nfs4_client *);
+void nfsd4_return_all_file_layouts(struct nfs4_client *clp,
+ struct nfs4_file *fp);
+int nfsd4_init_pnfs(void);
+void nfsd4_exit_pnfs(void);
+#else
+static inline void nfsd4_setup_layout_type(struct svc_export *exp)
+{
+}
+
+static inline void nfsd4_return_all_client_layouts(struct nfs4_client *clp)
+{
+}
+static inline void nfsd4_return_all_file_layouts(struct nfs4_client *clp,
+ struct nfs4_file *fp)
+{
+}
+static inline void nfsd4_exit_pnfs(void)
+{
+}
+static inline int nfsd4_init_pnfs(void)
+{
+ return 0;
+}
+#endif /* CONFIG_NFSD_PNFS */
+#endif /* _FS_NFSD_PNFSD_H */
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 38ebb12..5f66b7f 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -92,6 +92,7 @@ struct nfs4_stid {
/* For a deleg stateid kept around only to process free_stateid's: */
#define NFS4_REVOKED_DELEG_STID 16
#define NFS4_CLOSED_DELEG_STID 32
+#define NFS4_LAYOUT_STID 64
unsigned char sc_type;
stateid_t sc_stateid;
struct nfs4_client *sc_client;
@@ -297,6 +298,9 @@ struct nfs4_client {
struct list_head cl_delegations;
struct list_head cl_revoked; /* unacknowledged, revoked 4.1 state */
struct list_head cl_lru; /* tail queue */
+#ifdef CONFIG_NFSD_PNFS
+ struct list_head cl_lo_states; /* outstanding layout states */
+#endif
struct xdr_netobj cl_name; /* id generated by client */
nfs4_verifier cl_verifier; /* generated by client */
time_t cl_time; /* time of last lease renewal */
@@ -496,6 +500,9 @@ struct nfs4_file {
int fi_delegees;
struct knfsd_fh fi_fhandle;
bool fi_had_conflict;
+#ifdef CONFIG_NFSD_PNFS
+ struct list_head fi_lo_states;
+#endif
};
/*
@@ -528,6 +535,20 @@ static inline struct nfs4_ol_stateid *openlockstateid(struct nfs4_stid *s)
return container_of(s, struct nfs4_ol_stateid, st_stid);
}
+struct nfs4_layout_stateid {
+ struct nfs4_stid ls_stid;
+ struct list_head ls_perclnt;
+ struct list_head ls_perfile;
+ spinlock_t ls_lock;
+ struct list_head ls_layouts;
+ u32 ls_layout_type;
+};
+
+static inline struct nfs4_layout_stateid *layoutstateid(struct nfs4_stid *s)
+{
+ return container_of(s, struct nfs4_layout_stateid, ls_stid);
+}
+
/* flags for preprocess_seqid_op() */
#define RD_STATE 0x00000010
#define WR_STATE 0x00000020
diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
index 90a5925..4ac81cb 100644
--- a/fs/nfsd/xdr4.h
+++ b/fs/nfsd/xdr4.h
@@ -428,6 +428,62 @@ struct nfsd4_reclaim_complete {
u32 rca_one_fs;
};
+struct nfsd4_deviceid {
+ u64 fsid_idx;
+ u32 generation;
+ u32 pad;
+};
+
+struct nfsd4_layout_seg {
+ u32 iomode;
+ u64 offset;
+ u64 length;
+};
+
+struct nfsd4_getdeviceinfo {
+ struct nfsd4_deviceid gd_devid; /* request */
+ u32 gd_layout_type; /* request */
+ u32 gd_maxcount; /* request */
+ u32 gd_notify_types;/* request - response */
+ void *gd_device; /* response */
+};
+
+struct nfsd4_layoutget {
+ u64 lg_minlength; /* request */
+ u32 lg_signal; /* request */
+ u32 lg_layout_type; /* request */
+ u32 lg_maxcount; /* request */
+ stateid_t lg_sid; /* request/response */
+ struct nfsd4_layout_seg lg_seg; /* request/response */
+ u32 lg_roc; /* response */
+ void *lg_content; /* response */
+};
+
+struct nfsd4_layoutcommit {
+ stateid_t lc_sid; /* request */
+ struct nfsd4_layout_seg lc_seg; /* request */
+ u32 lc_reclaim; /* request */
+ u32 lc_newoffset; /* request */
+ u64 lc_last_wr; /* request */
+ struct timespec lc_mtime; /* request */
+ u32 lc_layout_type; /* request */
+ u32 lc_up_len; /* layout length */
+ void *lc_up_layout; /* decoded by callback */
+ u32 lc_size_chg; /* boolean for response */
+ u64 lc_newsize; /* response */
+};
+
+struct nfsd4_layoutreturn {
+ u32 lr_return_type; /* request */
+ u32 lr_layout_type; /* request */
+ struct nfsd4_layout_seg lr_seg; /* request */
+ u32 lr_reclaim; /* request */
+ u32 lrf_body_len; /* request */
+ void *lrf_body; /* request */
+ stateid_t lr_sid; /* request/response */
+ u32 lrs_present; /* response */
+};
+
struct nfsd4_fallocate {
/* request */
stateid_t falloc_stateid;
@@ -491,6 +547,10 @@ struct nfsd4_op {
struct nfsd4_reclaim_complete reclaim_complete;
struct nfsd4_test_stateid test_stateid;
struct nfsd4_free_stateid free_stateid;
+ struct nfsd4_getdeviceinfo getdeviceinfo;
+ struct nfsd4_layoutget layoutget;
+ struct nfsd4_layoutcommit layoutcommit;
+ struct nfsd4_layoutreturn layoutreturn;
/* NFSv4.2 */
struct nfsd4_fallocate allocate;
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 8a3589c..bc10d68 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -411,6 +411,7 @@ enum lock_type4 {
#define FATTR4_WORD1_TIME_MODIFY_SET (1UL << 22)
#define FATTR4_WORD1_MOUNTED_ON_FILEID (1UL << 23)
#define FATTR4_WORD1_FS_LAYOUT_TYPES (1UL << 30)
+#define FATTR4_WORD2_LAYOUT_TYPES (1UL << 0)
#define FATTR4_WORD2_LAYOUT_BLKSIZE (1UL << 1)
#define FATTR4_WORD2_MDSTHRESHOLD (1UL << 4)
#define FATTR4_WORD2_SECURITY_LABEL (1UL << 16)
diff --git a/include/uapi/linux/nfsd/debug.h b/include/uapi/linux/nfsd/debug.h
index 1fdc95b..0bf130a 100644
--- a/include/uapi/linux/nfsd/debug.h
+++ b/include/uapi/linux/nfsd/debug.h
@@ -32,6 +32,7 @@
#define NFSDDBG_REPCACHE 0x0080
#define NFSDDBG_XDR 0x0100
#define NFSDDBG_LOCKD 0x0200
+#define NFSDDBG_PNFS 0x0400
#define NFSDDBG_ALL 0x7FFF
#define NFSDDBG_NOCHANGE 0xFFFF
--
1.9.1
On Tue, 6 Jan 2015 17:28:32 +0100
Christoph Hellwig <[email protected]> wrote:
> Add support for the GETDEVICEINFO, LAYOUTGET, LAYOUTCOMMIT and
> LAYOUTRETURN NFSv4.1 operations, as well as backing code to manage
> outstanding layouts and devices.
>
> Layout management is very straight forward, with a nfs4_layout_stateid
> structure that extents nfs4_stid to manage layout stateids as the
> top-level structure. It is linked into the nfs4_file and nfs4_client
> structures like the other stateids, and contains a linked list of
> layouts that hang of the stateid. The actual layout operations are
> implemented in layout drivers that are not part of this commit, but
> will be added later.
>
> The worst part of this commit is the management of the pNFS device IDs,
> which suffers from a specification that is not sanely implementable due
> to the fact that the device-IDs are global and not bound to an export,
> and have a small enough size so that we can't store the fsid portion of
> a file handle, and must never be reused. As we still do need perform all
> export authentication and validation checks on a device ID passed to
> GETDEVICEINFO we are caught between a rock and a hard place. To work
> around this issue we add a new hash that maps from a 64-bit integer to a
> fsid so that we can look up the export to authenticate against it,
> a 32-bit integer as a generation that we can bump when changing the device,
> and a currently unused 32-bit integer that could be used in the future
> to handle more than a single device per export. Entries in this hash
> table are never deleted as we can't reuse the ids anyway, and would have
> a severe lifetime problem anyway as Linux export structures are temporary
> structures that can go away under load.
>
> Parts of the XDR data, structures and marshaling/unmarshaling code, as
> well as many concepts are derived from the old pNFS server implementation
> from Andy Adamson, Benny Halevy, Dean Hildebrand, Marc Eshel, Fred Isaman,
> Mike Sager, Ricardo Labiaga and many others.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> fs/nfsd/Kconfig | 10 +
> fs/nfsd/Makefile | 1 +
> fs/nfsd/export.c | 8 +
> fs/nfsd/export.h | 2 +
> fs/nfsd/nfs4layouts.c | 486 ++++++++++++++++++++++++++++++++++++++++
> fs/nfsd/nfs4proc.c | 266 ++++++++++++++++++++++
> fs/nfsd/nfs4state.c | 16 +-
> fs/nfsd/nfs4xdr.c | 306 +++++++++++++++++++++++++
> fs/nfsd/nfsctl.c | 9 +-
> fs/nfsd/nfsd.h | 16 +-
> fs/nfsd/pnfs.h | 80 +++++++
> fs/nfsd/state.h | 21 ++
> fs/nfsd/xdr4.h | 60 +++++
> include/linux/nfs4.h | 1 +
> include/uapi/linux/nfsd/debug.h | 1 +
> 15 files changed, 1279 insertions(+), 4 deletions(-)
> create mode 100644 fs/nfsd/nfs4layouts.c
> create mode 100644 fs/nfsd/pnfs.h
>
[...]
> @@ -4845,6 +4856,9 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> update_stateid(&stp->st_stid.sc_stateid);
> memcpy(&close->cl_stateid, &stp->st_stid.sc_stateid, sizeof(stateid_t));
>
> + nfsd4_return_all_file_layouts(stp->st_stateowner->so_client,
> + stp->st_stid.sc_file);
> +
Shouldn't the above be conditional on whether the lg_roc was true?
> nfsd4_close_open_stateid(stp);
>
> /* put reference from nfs4_preprocess_seqid_op */
--
Jeff Layton <[email protected]>
On Thu, Jan 08, 2015 at 04:48:51PM -0800, Jeff Layton wrote:
> > + nfsd4_return_all_file_layouts(stp->st_stateowner->so_client,
> > + stp->st_stid.sc_file);
> > +
>
> Shouldn't the above be conditional on whether the lg_roc was true?
There is no support for non-lg_roc layouts at the moment.
In general I've avoided adding code that isn't used, as they can't
be tested and thus most likely won't work.
On Fri, 9 Jan 2015 11:05:51 +0100
Christoph Hellwig <[email protected]> wrote:
> On Thu, Jan 08, 2015 at 04:48:51PM -0800, Jeff Layton wrote:
> > > + nfsd4_return_all_file_layouts(stp->st_stateowner->so_client,
> > > + stp->st_stid.sc_file);
> > > +
> >
> > Shouldn't the above be conditional on whether the lg_roc was true?
>
> There is no support for non-lg_roc layouts at the moment.
>
> In general I've avoided adding code that isn't used, as they can't
> be tested and thus most likely won't work.
Ok, it'd be good to document that in some comments then for the sake of
posterity (maybe it is later in the set -- I haven't gotten to the end
yet).
Now, that said...I think that your ROC semantics are wrong here. You
also have to take delegations into account. [1]
Basically the semantics that you want are that nfsd should do all of
the ROC stuff on last close iff there are no outstanding delegations or
on delegreturn iff there are no opens.
What we ended up doing in the unreleased code we have was to create a
new per-client and per-file object (that we creatively called an
"odstate"). An open stateid and a delegation stateid would hold a
reference to this object which is put when those stateids are freed.
When its refcount goes to zero, then we'd free any outstanding layouts
on the file for that client and free the object.
You probably want to do something similar here.
[1]: Tom and Trond mentioned that there's a RFC5661 errata pending for
this, but I don't see it right offhand.
--
Jeff Layton <[email protected]>
On Fri, Jan 09, 2015 at 08:51:30AM -0800, Jeff Layton wrote:
> Ok, it'd be good to document that in some comments then for the sake of
> posterity (maybe it is later in the set -- I haven't gotten to the end
> yet).
What kinds of comments do you expect? Not implementing unused features
of a protocol should be the default for anything in Linux.
> Now, that said...I think that your ROC semantics are wrong here. You
> also have to take delegations into account. [1]
>
> Basically the semantics that you want are that nfsd should do all of
> the ROC stuff on last close iff there are no outstanding delegations or
> on delegreturn iff there are no opens.
>
> What we ended up doing in the unreleased code we have was to create a
> new per-client and per-file object (that we creatively called an
> "odstate"). An open stateid and a delegation stateid would hold a
> reference to this object which is put when those stateids are freed.
> When its refcount goes to zero, then we'd free any outstanding layouts
> on the file for that client and free the object.
>
> You probably want to do something similar here.
>
> [1]: Tom and Trond mentioned that there's a RFC5661 errata pending for
> this, but I don't see it right offhand.
It would be good to look at the errata. While the idea of keeping
layouts around longer makes sense, I would only expect to do this
if they layout state was created based on a delegation stateid, not
a lock or open stateid. In that case having the layouts hang off
the "parent" stateid might be another option.
On Fri, 9 Jan 2015 18:16:41 +0100
Christoph Hellwig <[email protected]> wrote:
> On Fri, Jan 09, 2015 at 08:51:30AM -0800, Jeff Layton wrote:
> > Ok, it'd be good to document that in some comments then for the sake of
> > posterity (maybe it is later in the set -- I haven't gotten to the end
> > yet).
>
> What kinds of comments do you expect? Not implementing unused features
> of a protocol should be the default for anything in Linux.
>
I was thinking just a comment saying that ROC is always true in this
implementation, or maybe consider eliminating the lg_roc field in
struct nfsd4_layoutget altogether since it's currently always "1".
It's a little confusing now since the encoder can handle the case where
lg_roc is 0, but the rest of the code can't.
> > Now, that said...I think that your ROC semantics are wrong here. You
> > also have to take delegations into account. [1]
> >
> > Basically the semantics that you want are that nfsd should do all of
> > the ROC stuff on last close iff there are no outstanding delegations or
> > on delegreturn iff there are no opens.
> >
> > What we ended up doing in the unreleased code we have was to create a
> > new per-client and per-file object (that we creatively called an
> > "odstate"). An open stateid and a delegation stateid would hold a
> > reference to this object which is put when those stateids are freed.
> > When its refcount goes to zero, then we'd free any outstanding layouts
> > on the file for that client and free the object.
> >
> > You probably want to do something similar here.
> >
> > [1]: Tom and Trond mentioned that there's a RFC5661 errata pending for
> > this, but I don't see it right offhand.
>
> It would be good to look at the errata. While the idea of keeping
> layouts around longer makes sense, I would only expect to do this
> if they layout state was created based on a delegation stateid, not
> a lock or open stateid. In that case having the layouts hang off
> the "parent" stateid might be another option.
I found it:
http://www.rfc-editor.org/errata_search.php?rfc=5661&eid=3226
--
Jeff Layton <[email protected]>
On Fri, 9 Jan 2015 09:28:35 -0800
Jeff Layton <[email protected]> wrote:
> On Fri, 9 Jan 2015 18:16:41 +0100
> Christoph Hellwig <[email protected]> wrote:
>
> > On Fri, Jan 09, 2015 at 08:51:30AM -0800, Jeff Layton wrote:
> > > Ok, it'd be good to document that in some comments then for the sake of
> > > posterity (maybe it is later in the set -- I haven't gotten to the end
> > > yet).
> >
> > What kinds of comments do you expect? Not implementing unused features
> > of a protocol should be the default for anything in Linux.
> >
>
> I was thinking just a comment saying that ROC is always true in this
> implementation, or maybe consider eliminating the lg_roc field in
> struct nfsd4_layoutget altogether since it's currently always "1".
>
> It's a little confusing now since the encoder can handle the case where
> lg_roc is 0, but the rest of the code can't.
>
> > > Now, that said...I think that your ROC semantics are wrong here. You
> > > also have to take delegations into account. [1]
> > >
> > > Basically the semantics that you want are that nfsd should do all of
> > > the ROC stuff on last close iff there are no outstanding delegations or
> > > on delegreturn iff there are no opens.
> > >
> > > What we ended up doing in the unreleased code we have was to create a
> > > new per-client and per-file object (that we creatively called an
> > > "odstate"). An open stateid and a delegation stateid would hold a
> > > reference to this object which is put when those stateids are freed.
> > > When its refcount goes to zero, then we'd free any outstanding layouts
> > > on the file for that client and free the object.
> > >
> > > You probably want to do something similar here.
> > >
> > > [1]: Tom and Trond mentioned that there's a RFC5661 errata pending for
> > > this, but I don't see it right offhand.
> >
> > It would be good to look at the errata. While the idea of keeping
> > layouts around longer makes sense, I would only expect to do this
> > if they layout state was created based on a delegation stateid, not
> > a lock or open stateid. In that case having the layouts hang off
> > the "parent" stateid might be another option.
>
> I found it:
>
> http://www.rfc-editor.org/errata_search.php?rfc=5661&eid=3226
>
Oh, hmm...except that doesn't seem to have been updated according to
the discussion from around a year ago. See the thread entitled:
[nfsv4] NFSv4.1 errata id 3226 (the return of return-on-close layouts)
...on thee [email protected] mailing list. Trond mentions it there.
Perhaps we need to revise that errata?
--
Jeff Layton <[email protected]>
On Fri, Jan 9, 2015 at 9:33 AM, Jeff Layton <[email protected]> wrote:
> On Fri, 9 Jan 2015 09:28:35 -0800
> Jeff Layton <[email protected]> wrote:
>
>> On Fri, 9 Jan 2015 18:16:41 +0100
>> Christoph Hellwig <[email protected]> wrote:
>>
>> > On Fri, Jan 09, 2015 at 08:51:30AM -0800, Jeff Layton wrote:
>> > > Ok, it'd be good to document that in some comments then for the sake of
>> > > posterity (maybe it is later in the set -- I haven't gotten to the end
>> > > yet).
>> >
>> > What kinds of comments do you expect? Not implementing unused features
>> > of a protocol should be the default for anything in Linux.
>> >
>>
>> I was thinking just a comment saying that ROC is always true in this
>> implementation, or maybe consider eliminating the lg_roc field in
>> struct nfsd4_layoutget altogether since it's currently always "1".
>>
>> It's a little confusing now since the encoder can handle the case where
>> lg_roc is 0, but the rest of the code can't.
>>
>> > > Now, that said...I think that your ROC semantics are wrong here. You
>> > > also have to take delegations into account. [1]
>> > >
>> > > Basically the semantics that you want are that nfsd should do all of
>> > > the ROC stuff on last close iff there are no outstanding delegations or
>> > > on delegreturn iff there are no opens.
>> > >
>> > > What we ended up doing in the unreleased code we have was to create a
>> > > new per-client and per-file object (that we creatively called an
>> > > "odstate"). An open stateid and a delegation stateid would hold a
>> > > reference to this object which is put when those stateids are freed.
>> > > When its refcount goes to zero, then we'd free any outstanding layouts
>> > > on the file for that client and free the object.
>> > >
>> > > You probably want to do something similar here.
>> > >
>> > > [1]: Tom and Trond mentioned that there's a RFC5661 errata pending for
>> > > this, but I don't see it right offhand.
>> >
>> > It would be good to look at the errata. While the idea of keeping
>> > layouts around longer makes sense, I would only expect to do this
>> > if they layout state was created based on a delegation stateid, not
>> > a lock or open stateid. In that case having the layouts hang off
>> > the "parent" stateid might be another option.
>>
>> I found it:
>>
>> http://www.rfc-editor.org/errata_search.php?rfc=5661&eid=3226
>>
>
> Oh, hmm...except that doesn't seem to have been updated according to
> the discussion from around a year ago. See the thread entitled:
>
> [nfsv4] NFSv4.1 errata id 3226 (the return of return-on-close layouts)
>
> ...on thee [email protected] mailing list. Trond mentions it there.
> Perhaps we need to revise that errata?
>
Please see:
http://www.rfc-editor.org/errata_search.php?eid=3901
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On Fri, Jan 09, 2015 at 09:43:41AM -0800, Trond Myklebust wrote:
> > [nfsv4] NFSv4.1 errata id 3226 (the return of return-on-close layouts)
> >
> > ...on thee [email protected] mailing list. Trond mentions it there.
> > Perhaps we need to revise that errata?
> >
>
> Please see:
> http://www.rfc-editor.org/errata_search.php?eid=3901
I think the language in this errata is very confusing, especially:
"After the client has closed all open stateids and returned the
delegation stateids for a file for which logr_return_on_close
was set to TRUE, the server MUST invalidate all layout segments
that were issued to the client for that file."
While the idea that return on close layouts should be valid as
long as the "parent" stateid is around make a lot of sense, this
requirement to track all open / delegation stateids per file/client
combination seems insane.
The only logical way to extend the original text is to require
layouts to be implicitly returned when:
- the file is close for a layout stateid that is created based on
the open or lock stateid
- the delegation is returned for a layout stateid that is created
based on the delegation stateid.
That is, only keep layouts opened based on the delegation stateid
alive over a close if they are hanging off that delegation stateid.
On Fri, Jan 09, 2015 at 07:49:43PM +0100, Christoph Hellwig wrote:
> On Fri, Jan 09, 2015 at 09:43:41AM -0800, Trond Myklebust wrote:
> > > [nfsv4] NFSv4.1 errata id 3226 (the return of return-on-close layouts)
> > >
> > > ...on thee [email protected] mailing list. Trond mentions it there.
> > > Perhaps we need to revise that errata?
> > >
> >
> > Please see:
> > http://www.rfc-editor.org/errata_search.php?eid=3901
>
> I think the language in this errata is very confusing, especially:
>
> "After the client has closed all open stateids and returned the
> delegation stateids for a file for which logr_return_on_close
> was set to TRUE, the server MUST invalidate all layout segments
> that were issued to the client for that file."
>
> While the idea that return on close layouts should be valid as
> long as the "parent" stateid is around make a lot of sense, this
> requirement to track all open / delegation stateids per file/client
> combination seems insane.
Christoph,
I don't understand this concern. Section 8.2.1:
open stateids: OPEN state for a given client ID/open-owner/filehandle triple
delegation stateids: A stateid represents a single delegation held by a client for a
particular filehandle.
By definition, open/delegation stateids are tracked per
file/client combination.
>
> The only logical way to extend the original text is to require
> layouts to be implicitly returned when:
>
> - the file is close for a layout stateid that is created based on
> the open or lock stateid
I read you as saying that on the first CLOSE, the layout must
be returned. This seems very burdensome in that the client is aware
that other OPENs may have occured and expects to be able to still
utilize the layout. Under this new model, it would need to get
the layout again.
I.e., Trond's orginal concern with Section 18.43.3 is just that,
the text states that:
The logr_return_on_close result field is a directive to return the
layout before closing the file.
Paraphrasing the errata, this could be rewritten as
The logr_return_on_close result field is a directive to return the
layout before the last close of the file.
> - the delegation is returned for a layout stateid that is created
> based on the delegation stateid.
This agrees with the paragraph above.
>
> That is, only keep layouts opened based on the delegation stateid
> alive over a close if they are hanging off that delegation stateid.
Thanks,
Tom
On Wed, Jan 14, 2015 at 11:16:27AM -0800, Tom Haynes wrote:
> Christoph,
>
> I don't understand this concern. Section 8.2.1:
>
> open stateids: OPEN state for a given client ID/open-owner/filehandle triple
>
> delegation stateids: A stateid represents a single delegation held by a client for a
> particular filehandle.
>
> By definition, open/delegation stateids are tracked per
> file/client combination.
My concern is that the language in errata 3901 says the server should
invalidate all layouts after all open stateids are closed, and all
delegation stateids are returned for a given file, which means the
server needs to add another object just to track the layouts. If
on the other hand we say the lifetime of the layouts is tied to
the open stateid or the delegation stateid used to create the
layout stateid we can track the outstanding layouts in those
open/delegation stateid (and the lock stateid as well).
On Thu, Jan 15, 2015 at 11:26 AM, Christoph Hellwig <[email protected]> wrote:
> On Wed, Jan 14, 2015 at 11:16:27AM -0800, Tom Haynes wrote:
>> Christoph,
>>
>> I don't understand this concern. Section 8.2.1:
>>
>> open stateids: OPEN state for a given client ID/open-owner/filehandle triple
>>
>> delegation stateids: A stateid represents a single delegation held by a client for a
>> particular filehandle.
>>
>> By definition, open/delegation stateids are tracked per
>> file/client combination.
>
> My concern is that the language in errata 3901 says the server should
> invalidate all layouts after all open stateids are closed, and all
> delegation stateids are returned for a given file, which means the
> server needs to add another object just to track the layouts. If
> on the other hand we say the lifetime of the layouts is tied to
> the open stateid or the delegation stateid used to create the
> layout stateid we can track the outstanding layouts in those
> open/delegation stateid (and the lock stateid as well).
>
The problem, in my mind, is that defeats the main purpose of
return-on-close, which is to reduce the on-the-wire chattiness of the
protocol. How one implements that internally on the server is not
really a concern that should be reflected in the protocol.
Cheers
Trond
On Tue, Jan 06, 2015 at 05:28:32PM +0100, Christoph Hellwig wrote:
> +#ifdef CONFIG_NFSD_PNFS
> +static __be32
> +nfsd4_encode_getdeviceinfo(struct nfsd4_compoundres *resp, __be32 nfserr,
> + struct nfsd4_getdeviceinfo *gdev)
> +{
> + struct xdr_stream *xdr = &resp->xdr;
> + const struct nfsd4_layout_ops *ops =
> + nfsd4_layout_ops[gdev->gd_layout_type];
> + u32 starting_len = xdr->buf->len, needed_len;
> + __be32 *p;
> +
> + dprintk("%s: err %d\n", __func__, nfserr);
> + if (nfserr)
> + goto out;
In nfsd4_block_get_device_info_simple(), gdp->gd_device might have
been allocated, but sb->s_export_op->get_uuid() might have returned
an error, which would cause a leak here.
> +
> + p = xdr_reserve_space(xdr, 4);
> + if (!p)
> + return nfserr_resource;
gdp->gd_device can be leaked here.
> + *p++ = cpu_to_be32(gdev->gd_layout_type);
> +
> + /* If maxcount is 0 then just update notifications */
> + if (gdev->gd_maxcount != 0) {
> + nfserr = ops->encode_getdeviceinfo(xdr, gdev);
> + if (nfserr) {
> + /*
> + * We don't bother to burden the layout drivers with
> + * enforcing gd_maxcount, just tell the client to
> + * come back with a bigger buffer if it's not enough.
> + */
> + if (xdr->buf->len + 4 > gdev->gd_maxcount)
> + goto toosmall;
> + goto out;
> + }
> + }
> +
> + if (gdev->gd_notify_types) {
> + p = xdr_reserve_space(xdr, 4 + 4);
> + if (!p)
> + return nfserr_resource;
gdp->gd_device can be leaked here.
> + *p++ = cpu_to_be32(1); /* bitmap length */
> + *p++ = cpu_to_be32(gdev->gd_notify_types);
> + } else {
> + p = xdr_reserve_space(xdr, 4);
> + if (!p)
> + return nfserr_resource;
gdp->gd_device can be leaked here.
> + *p++ = 0;
> + }
> +
> +out:
> + kfree(gdev->gd_device);
> + dprintk("%s: done: %d\n", __func__, be32_to_cpu(nfserr));
> + return nfserr;
> +
> +toosmall:
> + dprintk("%s: maxcount too small\n", __func__);
> + needed_len = xdr->buf->len + 4 /* notifications */;
> + xdr_truncate_encode(xdr, starting_len);
> + p = xdr_reserve_space(xdr, 4);
> + if (!p)
> + return nfserr_resource;
> + *p++ = cpu_to_be32(needed_len);
> + nfserr = nfserr_toosmall;
> + goto out;
> +}
> +
Add support to issue layout recalls to clients. For now we only support
full-file recalls to get a simple and stable implementation. This allows
to embedd a nfsd4_callback structure in the layout_state and thus avoid
any memory allocations under spinlocks during a recall. For normal
use cases that do not intent to share a single file between multiple
clients this implementation is fully sufficient.
To ensure layouts are recalled on local filesystem access each layout
state registers a new FL_LAYOUT lease with the kernel file locking code,
which filesystems that support pNFS exports that require recalls need
to break on conflicting access patterns.
The XDR code is based on the old pNFS server implementation by
Andy Adamson, Benny Halevy, Boaz Harrosh, Dean Hildebrand, Fred Isaman,
Marc Eshel, Mike Sager and Ricardo Labiaga.
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/nfsd/nfs4callback.c | 99 +++++++++++++++++++++++
fs/nfsd/nfs4layouts.c | 214 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/nfsd/nfs4proc.c | 4 +
fs/nfsd/nfs4state.c | 1 +
fs/nfsd/state.h | 6 ++
fs/nfsd/xdr4cb.h | 7 ++
6 files changed, 330 insertions(+), 1 deletion(-)
diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index 7cbdf1b..5827785 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -546,6 +546,102 @@ out:
return status;
}
+#ifdef CONFIG_NFSD_PNFS
+/*
+ * CB_LAYOUTRECALL4args
+ *
+ * struct layoutrecall_file4 {
+ * nfs_fh4 lor_fh;
+ * offset4 lor_offset;
+ * length4 lor_length;
+ * stateid4 lor_stateid;
+ * };
+ *
+ * union layoutrecall4 switch(layoutrecall_type4 lor_recalltype) {
+ * case LAYOUTRECALL4_FILE:
+ * layoutrecall_file4 lor_layout;
+ * case LAYOUTRECALL4_FSID:
+ * fsid4 lor_fsid;
+ * case LAYOUTRECALL4_ALL:
+ * void;
+ * };
+ *
+ * struct CB_LAYOUTRECALL4args {
+ * layouttype4 clora_type;
+ * layoutiomode4 clora_iomode;
+ * bool clora_changed;
+ * layoutrecall4 clora_recall;
+ * };
+ */
+static void encode_cb_layout4args(struct xdr_stream *xdr,
+ const struct nfs4_layout_stateid *ls,
+ struct nfs4_cb_compound_hdr *hdr)
+{
+ __be32 *p;
+
+ BUG_ON(hdr->minorversion == 0);
+
+ p = xdr_reserve_space(xdr, 5 * 4);
+ *p++ = cpu_to_be32(OP_CB_LAYOUTRECALL);
+ *p++ = cpu_to_be32(ls->ls_layout_type);
+ *p++ = cpu_to_be32(IOMODE_ANY);
+ *p++ = cpu_to_be32(1);
+ *p = cpu_to_be32(RETURN_FILE);
+
+ encode_nfs_fh4(xdr, &ls->ls_stid.sc_file->fi_fhandle);
+
+ p = xdr_reserve_space(xdr, 2 * 8);
+ p = xdr_encode_hyper(p, 0);
+ xdr_encode_hyper(p, NFS4_MAX_UINT64);
+
+ encode_stateid4(xdr, &ls->ls_recall_sid);
+
+ hdr->nops++;
+}
+
+static void nfs4_xdr_enc_cb_layout(struct rpc_rqst *req,
+ struct xdr_stream *xdr,
+ const struct nfsd4_callback *cb)
+{
+ const struct nfs4_layout_stateid *ls =
+ container_of(cb, struct nfs4_layout_stateid, ls_recall);
+ struct nfs4_cb_compound_hdr hdr = {
+ .ident = 0,
+ .minorversion = cb->cb_minorversion,
+ };
+
+ encode_cb_compound4args(xdr, &hdr);
+ encode_cb_sequence4args(xdr, cb, &hdr);
+ encode_cb_layout4args(xdr, ls, &hdr);
+ encode_cb_nops(&hdr);
+}
+
+static int nfs4_xdr_dec_cb_layout(struct rpc_rqst *rqstp,
+ struct xdr_stream *xdr,
+ struct nfsd4_callback *cb)
+{
+ struct nfs4_cb_compound_hdr hdr;
+ enum nfsstat4 nfserr;
+ int status;
+
+ status = decode_cb_compound4res(xdr, &hdr);
+ if (unlikely(status))
+ goto out;
+ if (cb) {
+ status = decode_cb_sequence4res(xdr, cb);
+ if (unlikely(status))
+ goto out;
+ }
+ status = decode_cb_op_status(xdr, OP_CB_LAYOUTRECALL, &nfserr);
+ if (unlikely(status))
+ goto out;
+ if (unlikely(nfserr != NFS4_OK))
+ status = nfs_cb_stat_to_errno(nfserr);
+out:
+ return status;
+}
+#endif /* CONFIG_NFSD_PNFS */
+
/*
* RPC procedure tables
*/
@@ -563,6 +659,9 @@ out:
static struct rpc_procinfo nfs4_cb_procedures[] = {
PROC(CB_NULL, NULL, cb_null, cb_null),
PROC(CB_RECALL, COMPOUND, cb_recall, cb_recall),
+#ifdef CONFIG_NFSD_PNFS
+ PROC(CB_LAYOUT, COMPOUND, cb_layout, cb_layout),
+#endif
};
static struct rpc_version nfs_cb_version4 = {
diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
index 0753ed8..72a12ca 100644
--- a/fs/nfsd/nfs4layouts.c
+++ b/fs/nfsd/nfs4layouts.c
@@ -1,8 +1,11 @@
/*
* Copyright (c) 2014 Christoph Hellwig.
*/
+#include <linux/kmod.h>
+#include <linux/file.h>
#include <linux/jhash.h>
#include <linux/sched.h>
+#include <linux/sunrpc/addr.h>
#include "pnfs.h"
#include "netns.h"
@@ -18,6 +21,9 @@ struct nfs4_layout {
static struct kmem_cache *nfs4_layout_cache;
static struct kmem_cache *nfs4_layout_stateid_cache;
+static struct nfsd4_callback_ops nfsd4_cb_layout_ops;
+static const struct lock_manager_operations nfsd4_layouts_lm_ops;
+
const struct nfsd4_layout_ops *nfsd4_layout_ops[LAYOUT_TYPE_MAX] = {
};
@@ -126,9 +132,42 @@ nfsd4_free_layout_stateid(struct nfs4_stid *stid)
list_del_init(&ls->ls_perfile);
spin_unlock(&fp->fi_lock);
+ vfs_setlease(ls->ls_file, F_UNLCK, NULL, (void **)&ls);
+ fput(ls->ls_file);
+
+ if (ls->ls_recalled)
+ atomic_dec(&ls->ls_stid.sc_file->fi_lo_recalls);
+
kmem_cache_free(nfs4_layout_stateid_cache, ls);
}
+static int
+nfsd4_layout_setlease(struct nfs4_layout_stateid *ls)
+{
+ struct file_lock *fl;
+ int status;
+
+ fl = locks_alloc_lock();
+ if (!fl)
+ return -ENOMEM;
+ locks_init_lock(fl);
+ fl->fl_lmops = &nfsd4_layouts_lm_ops;
+ fl->fl_flags = FL_LAYOUT;
+ fl->fl_type = F_RDLCK;
+ fl->fl_end = OFFSET_MAX;
+ fl->fl_owner = ls;
+ fl->fl_pid = current->tgid;
+ fl->fl_file = ls->ls_file;
+
+ status = vfs_setlease(fl->fl_file, fl->fl_type, &fl, NULL);
+ if (status) {
+ locks_free_lock(fl);
+ return status;
+ }
+ BUG_ON(fl != NULL);
+ return 0;
+}
+
static struct nfs4_layout_stateid *
nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate,
struct nfs4_stid *parent, u32 layout_type)
@@ -151,6 +190,20 @@ nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate,
spin_lock_init(&ls->ls_lock);
INIT_LIST_HEAD(&ls->ls_layouts);
ls->ls_layout_type = layout_type;
+ nfsd4_init_cb(&ls->ls_recall, clp, &nfsd4_cb_layout_ops,
+ NFSPROC4_CLNT_CB_LAYOUT);
+
+ if (parent->sc_type == NFS4_DELEG_STID)
+ ls->ls_file = get_file(fp->fi_deleg_file);
+ else
+ ls->ls_file = find_any_file(fp);
+ BUG_ON(!ls->ls_file);
+
+ if (nfsd4_layout_setlease(ls)) {
+ put_nfs4_file(fp);
+ kmem_cache_free(nfs4_layout_stateid_cache, ls);
+ return NULL;
+ }
spin_lock(&clp->cl_lock);
stp->sc_type = NFS4_LAYOUT_STID;
@@ -214,6 +267,27 @@ out:
return status;
}
+static void
+nfsd4_recall_file_layout(struct nfs4_layout_stateid *ls)
+{
+ spin_lock(&ls->ls_lock);
+ if (ls->ls_recalled)
+ goto out_unlock;
+
+ ls->ls_recalled = true;
+ atomic_inc(&ls->ls_stid.sc_file->fi_lo_recalls);
+ if (list_empty(&ls->ls_layouts))
+ goto out_unlock;
+
+ atomic_inc(&ls->ls_stid.sc_count);
+ update_stateid(&ls->ls_stid.sc_stateid);
+ memcpy(&ls->ls_recall_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t));
+ nfsd4_run_cb(&ls->ls_recall);
+
+out_unlock:
+ spin_unlock(&ls->ls_lock);
+}
+
static inline u64
layout_end(struct nfsd4_layout_seg *seg)
{
@@ -257,18 +331,44 @@ layouts_try_merge(struct nfsd4_layout_seg *lo, struct nfsd4_layout_seg *new)
return true;
}
+static __be32
+nfsd4_recall_conflict(struct nfs4_layout_stateid *ls)
+{
+ struct nfs4_file *fp = ls->ls_stid.sc_file;
+ struct nfs4_layout_stateid *l, *n;
+ __be32 nfserr = nfs_ok;
+
+ assert_spin_locked(&fp->fi_lock);
+
+ list_for_each_entry_safe(l, n, &fp->fi_lo_states, ls_perfile) {
+ if (l != ls) {
+ nfsd4_recall_file_layout(l);
+ nfserr = nfserr_recallconflict;
+ }
+ }
+
+ return nfserr;
+}
+
__be32
nfsd4_insert_layout(struct nfsd4_layoutget *lgp, struct nfs4_layout_stateid *ls)
{
struct nfsd4_layout_seg *seg = &lgp->lg_seg;
+ struct nfs4_file *fp = ls->ls_stid.sc_file;
struct nfs4_layout *lp, *new = NULL;
+ __be32 nfserr;
+ spin_lock(&fp->fi_lock);
+ nfserr = nfsd4_recall_conflict(ls);
+ if (nfserr)
+ goto out;
spin_lock(&ls->ls_lock);
list_for_each_entry(lp, &ls->ls_layouts, lo_perstate) {
if (layouts_try_merge(&lp->lo_seg, seg))
goto done;
}
spin_unlock(&ls->ls_lock);
+ spin_unlock(&fp->fi_lock);
new = kmem_cache_alloc(nfs4_layout_cache, GFP_KERNEL);
if (!new)
@@ -276,6 +376,10 @@ nfsd4_insert_layout(struct nfsd4_layoutget *lgp, struct nfs4_layout_stateid *ls)
memcpy(&new->lo_seg, seg, sizeof(lp->lo_seg));
new->lo_state = ls;
+ spin_lock(&fp->fi_lock);
+ nfserr = nfsd4_recall_conflict(ls);
+ if (nfserr)
+ goto out;
spin_lock(&ls->ls_lock);
list_for_each_entry(lp, &ls->ls_layouts, lo_perstate) {
if (layouts_try_merge(&lp->lo_seg, seg))
@@ -289,9 +393,11 @@ done:
update_stateid(&ls->ls_stid.sc_stateid);
memcpy(&lgp->lg_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t));
spin_unlock(&ls->ls_lock);
+out:
+ spin_unlock(&fp->fi_lock);
if (new)
kmem_cache_free(nfs4_layout_cache, new);
- return nfs_ok;
+ return nfserr;
}
static void
@@ -447,6 +553,112 @@ nfsd4_return_all_file_layouts(struct nfs4_client *clp, struct nfs4_file *fp)
nfsd4_free_layouts(&reaplist);
}
+static void
+nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
+{
+ struct nfs4_client *clp = ls->ls_stid.sc_client;
+ char addr_str[INET6_ADDRSTRLEN];
+ static char *envp[] = {
+ "HOME=/",
+ "TERM=linux",
+ "PATH=/sbin:/usr/sbin:/bin:/usr/bin",
+ NULL
+ };
+ char *argv[8];
+ int error;
+
+ rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str));
+
+ printk(KERN_WARNING
+ "nfsd: client %s failed to respond to layout recall. "
+ " Fencing..\n", addr_str);
+
+ argv[0] = "/sbin/nfsd-recall-failed";
+ argv[1] = addr_str;
+ argv[2] = ls->ls_file->f_path.mnt->mnt_sb->s_id;
+ argv[3] = NULL;
+
+ error = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
+ if (error) {
+ printk(KERN_ERR "nfsd: fence failed for client %s: %d!\n",
+ addr_str, error);
+ }
+}
+
+static int
+nfsd4_cb_layout_done(struct nfsd4_callback *cb, struct rpc_task *task)
+{
+ struct nfs4_layout_stateid *ls =
+ container_of(cb, struct nfs4_layout_stateid, ls_recall);
+ LIST_HEAD(reaplist);
+
+ switch (task->tk_status) {
+ case 0:
+ return 1;
+ case -NFS4ERR_NOMATCHING_LAYOUT:
+ task->tk_status = 0;
+ return 1;
+ case -NFS4ERR_DELAY:
+ /* Poll the client until it's done with the layout */
+ /* FIXME: cap number of retries.
+ * The pnfs standard states that we need to only expire
+ * the client after at-least "lease time" .eg lease-time * 2
+ * when failing to communicate a recall
+ */
+ rpc_delay(task, HZ/100); /* 10 mili-seconds */
+ return 0;
+ default:
+ /*
+ * Unknown error or non-responding client, we'll need to fence.
+ */
+ nfsd4_cb_layout_fail(ls);
+ return -1;
+ }
+}
+
+static void
+nfsd4_cb_layout_release(struct nfsd4_callback *cb)
+{
+ struct nfs4_layout_stateid *ls =
+ container_of(cb, struct nfs4_layout_stateid, ls_recall);
+ LIST_HEAD(reaplist);
+
+ nfsd4_return_all_layouts(ls, &reaplist);
+ nfsd4_free_layouts(&reaplist);
+ nfs4_put_stid(&ls->ls_stid);
+}
+
+static struct nfsd4_callback_ops nfsd4_cb_layout_ops = {
+ .done = nfsd4_cb_layout_done,
+ .release = nfsd4_cb_layout_release,
+};
+
+static bool
+nfsd4_layout_lm_break(struct file_lock *fl)
+{
+ /*
+ * We don't want the locks code to timeout the lease for us;
+ * we'll remove it ourself if a layout isn't returned
+ * in time:
+ */
+ fl->fl_break_time = 0;
+ nfsd4_recall_file_layout(fl->fl_owner);
+ return false;
+}
+
+static int
+nfsd4_layout_lm_change(struct file_lock **onlist, int arg,
+ struct list_head *dispose)
+{
+ BUG_ON(!(arg & F_UNLCK));
+ return lease_modify(onlist, arg, dispose);
+}
+
+static const struct lock_manager_operations nfsd4_layouts_lm_ops = {
+ .lm_break = nfsd4_layout_lm_break,
+ .lm_change = nfsd4_layout_lm_change,
+};
+
int
nfsd4_init_pnfs(void)
{
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index b813913..c051d5b 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1301,6 +1301,10 @@ nfsd4_layoutget(struct svc_rqst *rqstp,
if (nfserr)
goto out;
+ nfserr = nfserr_recallconflict;
+ if (atomic_read(&ls->ls_stid.sc_file->fi_lo_recalls))
+ goto out_put_stid;
+
nfserr = ops->proc_layoutget(current_fh->fh_dentry->d_inode,
current_fh, lgp);
if (nfserr)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index eb972e6..1450b57 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3070,6 +3070,7 @@ static void nfsd4_init_file(struct knfsd_fh *fh, unsigned int hashval,
memset(fp->fi_access, 0, sizeof(fp->fi_access));
#ifdef CONFIG_NFSD_PNFS
INIT_LIST_HEAD(&fp->fi_lo_states);
+ atomic_set(&fp->fi_lo_recalls, 0);
#endif
hlist_add_head_rcu(&fp->fi_hash, &file_hashtbl[hashval]);
}
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 5f66b7f..4f3bfeb 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -502,6 +502,7 @@ struct nfs4_file {
bool fi_had_conflict;
#ifdef CONFIG_NFSD_PNFS
struct list_head fi_lo_states;
+ atomic_t fi_lo_recalls;
#endif
};
@@ -542,6 +543,10 @@ struct nfs4_layout_stateid {
spinlock_t ls_lock;
struct list_head ls_layouts;
u32 ls_layout_type;
+ struct file *ls_file;
+ struct nfsd4_callback ls_recall;
+ stateid_t ls_recall_sid;
+ bool ls_recalled;
};
static inline struct nfs4_layout_stateid *layoutstateid(struct nfs4_stid *s)
@@ -556,6 +561,7 @@ static inline struct nfs4_layout_stateid *layoutstateid(struct nfs4_stid *s)
enum nfsd4_cb_op {
NFSPROC4_CLNT_CB_NULL = 0,
NFSPROC4_CLNT_CB_RECALL,
+ NFSPROC4_CLNT_CB_LAYOUT,
NFSPROC4_CLNT_CB_SEQUENCE,
};
diff --git a/fs/nfsd/xdr4cb.h b/fs/nfsd/xdr4cb.h
index c5c55df..c47f6fd 100644
--- a/fs/nfsd/xdr4cb.h
+++ b/fs/nfsd/xdr4cb.h
@@ -21,3 +21,10 @@
#define NFS4_dec_cb_recall_sz (cb_compound_dec_hdr_sz + \
cb_sequence_dec_sz + \
op_dec_sz)
+#define NFS4_enc_cb_layout_sz (cb_compound_enc_hdr_sz + \
+ cb_sequence_enc_sz + \
+ 1 + 3 + \
+ enc_nfs4_fh_sz + 4)
+#define NFS4_dec_cb_layout_sz (cb_compound_dec_hdr_sz + \
+ cb_sequence_dec_sz + \
+ op_dec_sz)
--
1.9.1
On Tue, Jan 06, 2015 at 05:28:33PM +0100, Christoph Hellwig wrote:
> Add support to issue layout recalls to clients. For now we only support
> full-file recalls to get a simple and stable implementation. This allows
> to embedd a nfsd4_callback structure in the layout_state and thus avoid
> any memory allocations under spinlocks during a recall. For normal
> use cases that do not intent to share a single file between multiple
> clients this implementation is fully sufficient.
>
> To ensure layouts are recalled on local filesystem access each layout
> state registers a new FL_LAYOUT lease with the kernel file locking code,
> which filesystems that support pNFS exports that require recalls need
> to break on conflicting access patterns.
>
> The XDR code is based on the old pNFS server implementation by
> Andy Adamson, Benny Halevy, Boaz Harrosh, Dean Hildebrand, Fred Isaman,
> Marc Eshel, Mike Sager and Ricardo Labiaga.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> fs/nfsd/nfs4callback.c | 99 +++++++++++++++++++++++
> fs/nfsd/nfs4layouts.c | 214 ++++++++++++++++++++++++++++++++++++++++++++++++-
> fs/nfsd/nfs4proc.c | 4 +
> fs/nfsd/nfs4state.c | 1 +
> fs/nfsd/state.h | 6 ++
> fs/nfsd/xdr4cb.h | 7 ++
> 6 files changed, 330 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> index 7cbdf1b..5827785 100644
> --- a/fs/nfsd/nfs4callback.c
> +++ b/fs/nfsd/nfs4callback.c
> @@ -546,6 +546,102 @@ out:
> return status;
> }
>
> +#ifdef CONFIG_NFSD_PNFS
> +/*
> + * CB_LAYOUTRECALL4args
> + *
> + * struct layoutrecall_file4 {
> + * nfs_fh4 lor_fh;
> + * offset4 lor_offset;
> + * length4 lor_length;
> + * stateid4 lor_stateid;
> + * };
> + *
> + * union layoutrecall4 switch(layoutrecall_type4 lor_recalltype) {
> + * case LAYOUTRECALL4_FILE:
> + * layoutrecall_file4 lor_layout;
> + * case LAYOUTRECALL4_FSID:
> + * fsid4 lor_fsid;
> + * case LAYOUTRECALL4_ALL:
> + * void;
> + * };
> + *
> + * struct CB_LAYOUTRECALL4args {
> + * layouttype4 clora_type;
> + * layoutiomode4 clora_iomode;
> + * bool clora_changed;
> + * layoutrecall4 clora_recall;
> + * };
> + */
> +static void encode_cb_layout4args(struct xdr_stream *xdr,
> + const struct nfs4_layout_stateid *ls,
> + struct nfs4_cb_compound_hdr *hdr)
> +{
> + __be32 *p;
> +
> + BUG_ON(hdr->minorversion == 0);
> +
> + p = xdr_reserve_space(xdr, 5 * 4);
> + *p++ = cpu_to_be32(OP_CB_LAYOUTRECALL);
> + *p++ = cpu_to_be32(ls->ls_layout_type);
> + *p++ = cpu_to_be32(IOMODE_ANY);
> + *p++ = cpu_to_be32(1);
> + *p = cpu_to_be32(RETURN_FILE);
> +
> + encode_nfs_fh4(xdr, &ls->ls_stid.sc_file->fi_fhandle);
> +
> + p = xdr_reserve_space(xdr, 2 * 8);
> + p = xdr_encode_hyper(p, 0);
> + xdr_encode_hyper(p, NFS4_MAX_UINT64);
> +
> + encode_stateid4(xdr, &ls->ls_recall_sid);
> +
> + hdr->nops++;
> +}
> +
> +static void nfs4_xdr_enc_cb_layout(struct rpc_rqst *req,
> + struct xdr_stream *xdr,
> + const struct nfsd4_callback *cb)
> +{
> + const struct nfs4_layout_stateid *ls =
> + container_of(cb, struct nfs4_layout_stateid, ls_recall);
> + struct nfs4_cb_compound_hdr hdr = {
> + .ident = 0,
> + .minorversion = cb->cb_minorversion,
> + };
> +
> + encode_cb_compound4args(xdr, &hdr);
> + encode_cb_sequence4args(xdr, cb, &hdr);
> + encode_cb_layout4args(xdr, ls, &hdr);
> + encode_cb_nops(&hdr);
> +}
> +
> +static int nfs4_xdr_dec_cb_layout(struct rpc_rqst *rqstp,
> + struct xdr_stream *xdr,
> + struct nfsd4_callback *cb)
> +{
> + struct nfs4_cb_compound_hdr hdr;
> + enum nfsstat4 nfserr;
> + int status;
> +
> + status = decode_cb_compound4res(xdr, &hdr);
> + if (unlikely(status))
> + goto out;
> + if (cb) {
> + status = decode_cb_sequence4res(xdr, cb);
> + if (unlikely(status))
> + goto out;
> + }
> + status = decode_cb_op_status(xdr, OP_CB_LAYOUTRECALL, &nfserr);
> + if (unlikely(status))
> + goto out;
> + if (unlikely(nfserr != NFS4_OK))
> + status = nfs_cb_stat_to_errno(nfserr);
> +out:
> + return status;
> +}
> +#endif /* CONFIG_NFSD_PNFS */
> +
> /*
> * RPC procedure tables
> */
> @@ -563,6 +659,9 @@ out:
> static struct rpc_procinfo nfs4_cb_procedures[] = {
> PROC(CB_NULL, NULL, cb_null, cb_null),
> PROC(CB_RECALL, COMPOUND, cb_recall, cb_recall),
> +#ifdef CONFIG_NFSD_PNFS
> + PROC(CB_LAYOUT, COMPOUND, cb_layout, cb_layout),
> +#endif
> };
>
> static struct rpc_version nfs_cb_version4 = {
> diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
> index 0753ed8..72a12ca 100644
> --- a/fs/nfsd/nfs4layouts.c
> +++ b/fs/nfsd/nfs4layouts.c
> @@ -1,8 +1,11 @@
> /*
> * Copyright (c) 2014 Christoph Hellwig.
> */
> +#include <linux/kmod.h>
> +#include <linux/file.h>
> #include <linux/jhash.h>
> #include <linux/sched.h>
> +#include <linux/sunrpc/addr.h>
>
> #include "pnfs.h"
> #include "netns.h"
> @@ -18,6 +21,9 @@ struct nfs4_layout {
> static struct kmem_cache *nfs4_layout_cache;
> static struct kmem_cache *nfs4_layout_stateid_cache;
>
> +static struct nfsd4_callback_ops nfsd4_cb_layout_ops;
> +static const struct lock_manager_operations nfsd4_layouts_lm_ops;
> +
> const struct nfsd4_layout_ops *nfsd4_layout_ops[LAYOUT_TYPE_MAX] = {
> };
>
> @@ -126,9 +132,42 @@ nfsd4_free_layout_stateid(struct nfs4_stid *stid)
> list_del_init(&ls->ls_perfile);
> spin_unlock(&fp->fi_lock);
>
> + vfs_setlease(ls->ls_file, F_UNLCK, NULL, (void **)&ls);
> + fput(ls->ls_file);
> +
> + if (ls->ls_recalled)
> + atomic_dec(&ls->ls_stid.sc_file->fi_lo_recalls);
> +
> kmem_cache_free(nfs4_layout_stateid_cache, ls);
> }
>
> +static int
> +nfsd4_layout_setlease(struct nfs4_layout_stateid *ls)
> +{
> + struct file_lock *fl;
> + int status;
> +
> + fl = locks_alloc_lock();
> + if (!fl)
> + return -ENOMEM;
> + locks_init_lock(fl);
> + fl->fl_lmops = &nfsd4_layouts_lm_ops;
> + fl->fl_flags = FL_LAYOUT;
> + fl->fl_type = F_RDLCK;
> + fl->fl_end = OFFSET_MAX;
> + fl->fl_owner = ls;
> + fl->fl_pid = current->tgid;
> + fl->fl_file = ls->ls_file;
> +
> + status = vfs_setlease(fl->fl_file, fl->fl_type, &fl, NULL);
> + if (status) {
> + locks_free_lock(fl);
> + return status;
> + }
> + BUG_ON(fl != NULL);
> + return 0;
> +}
> +
> static struct nfs4_layout_stateid *
> nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate,
> struct nfs4_stid *parent, u32 layout_type)
> @@ -151,6 +190,20 @@ nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate,
> spin_lock_init(&ls->ls_lock);
> INIT_LIST_HEAD(&ls->ls_layouts);
> ls->ls_layout_type = layout_type;
> + nfsd4_init_cb(&ls->ls_recall, clp, &nfsd4_cb_layout_ops,
> + NFSPROC4_CLNT_CB_LAYOUT);
> +
> + if (parent->sc_type == NFS4_DELEG_STID)
> + ls->ls_file = get_file(fp->fi_deleg_file);
> + else
> + ls->ls_file = find_any_file(fp);
> + BUG_ON(!ls->ls_file);
> +
> + if (nfsd4_layout_setlease(ls)) {
> + put_nfs4_file(fp);
> + kmem_cache_free(nfs4_layout_stateid_cache, ls);
> + return NULL;
> + }
>
> spin_lock(&clp->cl_lock);
> stp->sc_type = NFS4_LAYOUT_STID;
> @@ -214,6 +267,27 @@ out:
> return status;
> }
>
> +static void
> +nfsd4_recall_file_layout(struct nfs4_layout_stateid *ls)
> +{
> + spin_lock(&ls->ls_lock);
> + if (ls->ls_recalled)
> + goto out_unlock;
> +
> + ls->ls_recalled = true;
> + atomic_inc(&ls->ls_stid.sc_file->fi_lo_recalls);
> + if (list_empty(&ls->ls_layouts))
> + goto out_unlock;
> +
> + atomic_inc(&ls->ls_stid.sc_count);
> + update_stateid(&ls->ls_stid.sc_stateid);
> + memcpy(&ls->ls_recall_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t));
> + nfsd4_run_cb(&ls->ls_recall);
> +
> +out_unlock:
> + spin_unlock(&ls->ls_lock);
> +}
> +
> static inline u64
> layout_end(struct nfsd4_layout_seg *seg)
> {
> @@ -257,18 +331,44 @@ layouts_try_merge(struct nfsd4_layout_seg *lo, struct nfsd4_layout_seg *new)
> return true;
> }
>
> +static __be32
> +nfsd4_recall_conflict(struct nfs4_layout_stateid *ls)
> +{
> + struct nfs4_file *fp = ls->ls_stid.sc_file;
> + struct nfs4_layout_stateid *l, *n;
> + __be32 nfserr = nfs_ok;
> +
> + assert_spin_locked(&fp->fi_lock);
> +
> + list_for_each_entry_safe(l, n, &fp->fi_lo_states, ls_perfile) {
> + if (l != ls) {
> + nfsd4_recall_file_layout(l);
> + nfserr = nfserr_recallconflict;
> + }
> + }
> +
> + return nfserr;
> +}
> +
> __be32
> nfsd4_insert_layout(struct nfsd4_layoutget *lgp, struct nfs4_layout_stateid *ls)
> {
> struct nfsd4_layout_seg *seg = &lgp->lg_seg;
> + struct nfs4_file *fp = ls->ls_stid.sc_file;
> struct nfs4_layout *lp, *new = NULL;
> + __be32 nfserr;
>
> + spin_lock(&fp->fi_lock);
> + nfserr = nfsd4_recall_conflict(ls);
> + if (nfserr)
> + goto out;
> spin_lock(&ls->ls_lock);
> list_for_each_entry(lp, &ls->ls_layouts, lo_perstate) {
> if (layouts_try_merge(&lp->lo_seg, seg))
> goto done;
> }
> spin_unlock(&ls->ls_lock);
> + spin_unlock(&fp->fi_lock);
>
> new = kmem_cache_alloc(nfs4_layout_cache, GFP_KERNEL);
> if (!new)
> @@ -276,6 +376,10 @@ nfsd4_insert_layout(struct nfsd4_layoutget *lgp, struct nfs4_layout_stateid *ls)
> memcpy(&new->lo_seg, seg, sizeof(lp->lo_seg));
> new->lo_state = ls;
>
> + spin_lock(&fp->fi_lock);
> + nfserr = nfsd4_recall_conflict(ls);
> + if (nfserr)
> + goto out;
> spin_lock(&ls->ls_lock);
> list_for_each_entry(lp, &ls->ls_layouts, lo_perstate) {
> if (layouts_try_merge(&lp->lo_seg, seg))
> @@ -289,9 +393,11 @@ done:
> update_stateid(&ls->ls_stid.sc_stateid);
> memcpy(&lgp->lg_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t));
> spin_unlock(&ls->ls_lock);
> +out:
> + spin_unlock(&fp->fi_lock);
> if (new)
> kmem_cache_free(nfs4_layout_cache, new);
> - return nfs_ok;
> + return nfserr;
> }
>
> static void
> @@ -447,6 +553,112 @@ nfsd4_return_all_file_layouts(struct nfs4_client *clp, struct nfs4_file *fp)
> nfsd4_free_layouts(&reaplist);
> }
>
> +static void
> +nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
> +{
> + struct nfs4_client *clp = ls->ls_stid.sc_client;
> + char addr_str[INET6_ADDRSTRLEN];
> + static char *envp[] = {
> + "HOME=/",
> + "TERM=linux",
> + "PATH=/sbin:/usr/sbin:/bin:/usr/bin",
> + NULL
> + };
> + char *argv[8];
> + int error;
> +
> + rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str));
This bothers me a little: cl_addr is just the address that the
exchange_id came from. In theory there's no one-to-one relationship
between NFSv4 clients and IP addresses. Is it likely the iscsi traffic
could use a different interface than the MDS traffic?
If this is the best we can do, then maybe this should at least be
documented.
--b.
> +
> + printk(KERN_WARNING
> + "nfsd: client %s failed to respond to layout recall. "
> + " Fencing..\n", addr_str);
> +
> + argv[0] = "/sbin/nfsd-recall-failed";
> + argv[1] = addr_str;
> + argv[2] = ls->ls_file->f_path.mnt->mnt_sb->s_id;
> + argv[3] = NULL;
> +
> + error = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
> + if (error) {
> + printk(KERN_ERR "nfsd: fence failed for client %s: %d!\n",
> + addr_str, error);
> + }
> +}
> +
> +static int
> +nfsd4_cb_layout_done(struct nfsd4_callback *cb, struct rpc_task *task)
> +{
> + struct nfs4_layout_stateid *ls =
> + container_of(cb, struct nfs4_layout_stateid, ls_recall);
> + LIST_HEAD(reaplist);
> +
> + switch (task->tk_status) {
> + case 0:
> + return 1;
> + case -NFS4ERR_NOMATCHING_LAYOUT:
> + task->tk_status = 0;
> + return 1;
> + case -NFS4ERR_DELAY:
> + /* Poll the client until it's done with the layout */
> + /* FIXME: cap number of retries.
> + * The pnfs standard states that we need to only expire
> + * the client after at-least "lease time" .eg lease-time * 2
> + * when failing to communicate a recall
> + */
> + rpc_delay(task, HZ/100); /* 10 mili-seconds */
> + return 0;
> + default:
> + /*
> + * Unknown error or non-responding client, we'll need to fence.
> + */
> + nfsd4_cb_layout_fail(ls);
> + return -1;
> + }
> +}
> +
> +static void
> +nfsd4_cb_layout_release(struct nfsd4_callback *cb)
> +{
> + struct nfs4_layout_stateid *ls =
> + container_of(cb, struct nfs4_layout_stateid, ls_recall);
> + LIST_HEAD(reaplist);
> +
> + nfsd4_return_all_layouts(ls, &reaplist);
> + nfsd4_free_layouts(&reaplist);
> + nfs4_put_stid(&ls->ls_stid);
> +}
> +
> +static struct nfsd4_callback_ops nfsd4_cb_layout_ops = {
> + .done = nfsd4_cb_layout_done,
> + .release = nfsd4_cb_layout_release,
> +};
> +
> +static bool
> +nfsd4_layout_lm_break(struct file_lock *fl)
> +{
> + /*
> + * We don't want the locks code to timeout the lease for us;
> + * we'll remove it ourself if a layout isn't returned
> + * in time:
> + */
> + fl->fl_break_time = 0;
> + nfsd4_recall_file_layout(fl->fl_owner);
> + return false;
> +}
> +
> +static int
> +nfsd4_layout_lm_change(struct file_lock **onlist, int arg,
> + struct list_head *dispose)
> +{
> + BUG_ON(!(arg & F_UNLCK));
> + return lease_modify(onlist, arg, dispose);
> +}
> +
> +static const struct lock_manager_operations nfsd4_layouts_lm_ops = {
> + .lm_break = nfsd4_layout_lm_break,
> + .lm_change = nfsd4_layout_lm_change,
> +};
> +
> int
> nfsd4_init_pnfs(void)
> {
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index b813913..c051d5b 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -1301,6 +1301,10 @@ nfsd4_layoutget(struct svc_rqst *rqstp,
> if (nfserr)
> goto out;
>
> + nfserr = nfserr_recallconflict;
> + if (atomic_read(&ls->ls_stid.sc_file->fi_lo_recalls))
> + goto out_put_stid;
> +
> nfserr = ops->proc_layoutget(current_fh->fh_dentry->d_inode,
> current_fh, lgp);
> if (nfserr)
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index eb972e6..1450b57 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -3070,6 +3070,7 @@ static void nfsd4_init_file(struct knfsd_fh *fh, unsigned int hashval,
> memset(fp->fi_access, 0, sizeof(fp->fi_access));
> #ifdef CONFIG_NFSD_PNFS
> INIT_LIST_HEAD(&fp->fi_lo_states);
> + atomic_set(&fp->fi_lo_recalls, 0);
> #endif
> hlist_add_head_rcu(&fp->fi_hash, &file_hashtbl[hashval]);
> }
> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> index 5f66b7f..4f3bfeb 100644
> --- a/fs/nfsd/state.h
> +++ b/fs/nfsd/state.h
> @@ -502,6 +502,7 @@ struct nfs4_file {
> bool fi_had_conflict;
> #ifdef CONFIG_NFSD_PNFS
> struct list_head fi_lo_states;
> + atomic_t fi_lo_recalls;
> #endif
> };
>
> @@ -542,6 +543,10 @@ struct nfs4_layout_stateid {
> spinlock_t ls_lock;
> struct list_head ls_layouts;
> u32 ls_layout_type;
> + struct file *ls_file;
> + struct nfsd4_callback ls_recall;
> + stateid_t ls_recall_sid;
> + bool ls_recalled;
> };
>
> static inline struct nfs4_layout_stateid *layoutstateid(struct nfs4_stid *s)
> @@ -556,6 +561,7 @@ static inline struct nfs4_layout_stateid *layoutstateid(struct nfs4_stid *s)
> enum nfsd4_cb_op {
> NFSPROC4_CLNT_CB_NULL = 0,
> NFSPROC4_CLNT_CB_RECALL,
> + NFSPROC4_CLNT_CB_LAYOUT,
> NFSPROC4_CLNT_CB_SEQUENCE,
> };
>
> diff --git a/fs/nfsd/xdr4cb.h b/fs/nfsd/xdr4cb.h
> index c5c55df..c47f6fd 100644
> --- a/fs/nfsd/xdr4cb.h
> +++ b/fs/nfsd/xdr4cb.h
> @@ -21,3 +21,10 @@
> #define NFS4_dec_cb_recall_sz (cb_compound_dec_hdr_sz + \
> cb_sequence_dec_sz + \
> op_dec_sz)
> +#define NFS4_enc_cb_layout_sz (cb_compound_enc_hdr_sz + \
> + cb_sequence_enc_sz + \
> + 1 + 3 + \
> + enc_nfs4_fh_sz + 4)
> +#define NFS4_dec_cb_layout_sz (cb_compound_dec_hdr_sz + \
> + cb_sequence_dec_sz + \
> + op_dec_sz)
> --
> 1.9.1
>
> This bothers me a little: cl_addr is just the address that the
> exchange_id came from. In theory there's no one-to-one relationship
> between NFSv4 clients and IP addresses. Is it likely the iscsi traffic
> could use a different interface than the MDS traffic?
>
> If this is the best we can do, then maybe this should at least be
> documented.
The pNFS block fencing protocol bothers me a lot, it seems like very
little thought went into that part of the standard.
I proposed a new SCSI layout type that fixes those issues on the
NFSv4 WG list, but there's been zero interest in it:
http://www.ietf.org/mail-archive/web/nfsv4/current/msg13469.html
On Tue, Jan 06, 2015 at 06:42:14PM +0100, Christoph Hellwig wrote:
> > This bothers me a little: cl_addr is just the address that the
> > exchange_id came from. In theory there's no one-to-one relationship
> > between NFSv4 clients and IP addresses. Is it likely the iscsi traffic
> > could use a different interface than the MDS traffic?
> >
> > If this is the best we can do, then maybe this should at least be
> > documented.
>
> The pNFS block fencing protocol bothers me a lot, it seems like very
> little thought went into that part of the standard.
>
> I proposed a new SCSI layout type that fixes those issues on the
> NFSv4 WG list, but there's been zero interest in it:
>
> http://www.ietf.org/mail-archive/web/nfsv4/current/msg13469.html
>
I don't know if I would say zero interest or normal apathy on the
NFSv4 WG list to replying outside of the IETF meeting venue.
I'd certainly like to see it go forward.
Signed-off-by: Christoph Hellwig <[email protected]>
---
Documentation/filesystems/nfs/nfs41-server.txt | 23 ++++++++---------------
1 file changed, 8 insertions(+), 15 deletions(-)
diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt
index c49cd7e..682a59f 100644
--- a/Documentation/filesystems/nfs/nfs41-server.txt
+++ b/Documentation/filesystems/nfs/nfs41-server.txt
@@ -24,11 +24,6 @@ focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
"exactly once" semantics and better control and throttling of the
resources allocated for each client.
-Other NFSv4.1 features, Parallel NFS operations in particular,
-are still under development out of tree.
-See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design
-for more information.
-
The table below, taken from the NFSv4.1 document, lists
the operations that are mandatory to implement (REQ), optional
(OPT), and NFSv4.0 operations that are required not to implement (MNI)
@@ -43,9 +38,7 @@ The OPTIONAL features identified and their abbreviations are as follows:
The following abbreviations indicate the linux server implementation status.
I Implemented NFSv4.1 operations.
NS Not Supported.
- NS* unimplemented optional feature.
- P pNFS features implemented out of tree.
- PNS pNFS features that are not supported yet (out of tree).
+ NS* Unimplemented optional feature.
Operations
@@ -70,13 +63,13 @@ I | DESTROY_SESSION | REQ | | Section 18.37 |
I | EXCHANGE_ID | REQ | | Section 18.35 |
I | FREE_STATEID | REQ | | Section 18.38 |
| GETATTR | REQ | | Section 18.7 |
-P | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 |
-P | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 |
+I | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 |
+NS*| GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 |
| GETFH | REQ | | Section 18.8 |
NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 |
-P | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 |
-P | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 |
-P | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 |
+I | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 |
+I | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 |
+I | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 |
| LINK | OPT | | Section 18.9 |
| LOCK | REQ | | Section 18.10 |
| LOCKT | REQ | | Section 18.11 |
@@ -122,9 +115,9 @@ Callback Operations
| | MNI | or OPT) | |
+-------------------------+-----------+-------------+---------------+
| CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 |
-P | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 |
+I | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 |
NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 |
-P | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 |
+NS*| CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 |
NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 |
NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 |
| CB_RECALL | OPT | FDELG, | Section 20.2 |
--
1.9.1
For now just a few simple events to trace the layout stateid lifetime, but
these already were enough to find several bugs in the Linux client layout
stateid handling.
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/nfsd/Makefile | 7 ++++++-
fs/nfsd/nfs4layouts.c | 16 ++++++++++++++-
fs/nfsd/nfs4proc.c | 6 +++++-
fs/nfsd/trace.c | 5 +++++
fs/nfsd/trace.h | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 85 insertions(+), 3 deletions(-)
create mode 100644 fs/nfsd/trace.c
create mode 100644 fs/nfsd/trace.h
diff --git a/fs/nfsd/Makefile b/fs/nfsd/Makefile
index 5806270..6cba933 100644
--- a/fs/nfsd/Makefile
+++ b/fs/nfsd/Makefile
@@ -2,9 +2,14 @@
# Makefile for the Linux nfs server
#
+ccflags-y += -I$(src) # needed for trace events
+
obj-$(CONFIG_NFSD) += nfsd.o
-nfsd-y := nfssvc.o nfsctl.o nfsproc.o nfsfh.o vfs.o \
+# this one should be compiled first, as the tracing macros can easily blow up
+nfsd-y += trace.o
+
+nfsd-y += nfssvc.o nfsctl.o nfsproc.o nfsfh.o vfs.o \
export.o auth.o lockd.o nfscache.o nfsxdr.o stats.o
nfsd-$(CONFIG_NFSD_FAULT_INJECTION) += fault_inject.o
nfsd-$(CONFIG_NFSD_V2_ACL) += nfs2acl.o
diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
index 72a12ca..bb91981 100644
--- a/fs/nfsd/nfs4layouts.c
+++ b/fs/nfsd/nfs4layouts.c
@@ -9,6 +9,7 @@
#include "pnfs.h"
#include "netns.h"
+#include "trace.h"
#define NFSDDBG_FACILITY NFSDDBG_PNFS
@@ -124,6 +125,8 @@ nfsd4_free_layout_stateid(struct nfs4_stid *stid)
struct nfs4_client *clp = ls->ls_stid.sc_client;
struct nfs4_file *fp = ls->ls_stid.sc_file;
+ trace_layoutstate_free(&ls->ls_stid.sc_stateid);
+
spin_lock(&clp->cl_lock);
list_del_init(&ls->ls_perclnt);
spin_unlock(&clp->cl_lock);
@@ -214,6 +217,7 @@ nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate,
list_add(&ls->ls_perfile, &fp->fi_lo_states);
spin_unlock(&fp->fi_lock);
+ trace_layoutstate_alloc(&ls->ls_stid.sc_stateid);
return ls;
}
@@ -279,6 +283,8 @@ nfsd4_recall_file_layout(struct nfs4_layout_stateid *ls)
if (list_empty(&ls->ls_layouts))
goto out_unlock;
+ trace_layout_recall(&ls->ls_stid.sc_stateid);
+
atomic_inc(&ls->ls_stid.sc_count);
update_stateid(&ls->ls_stid.sc_stateid);
memcpy(&ls->ls_recall_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t));
@@ -453,8 +459,10 @@ nfsd4_return_file_layouts(struct svc_rqst *rqstp,
nfserr = nfsd4_preprocess_layout_stateid(rqstp, cstate, &lrp->lr_sid,
false, lrp->lr_layout_type,
&ls);
- if (nfserr)
+ if (nfserr) {
+ trace_layout_return_lookup_fail(&lrp->lr_sid);
return nfserr;
+ }
spin_lock(&ls->ls_lock);
list_for_each_entry_safe(lp, n, &ls->ls_layouts, lo_perstate) {
@@ -471,6 +479,7 @@ nfsd4_return_file_layouts(struct svc_rqst *rqstp,
}
lrp->lrs_present = 1;
} else {
+ trace_layoutstate_unhash(&ls->ls_stid.sc_stateid);
nfs4_unhash_stid(&ls->ls_stid);
lrp->lrs_present = 0;
}
@@ -569,6 +578,8 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str));
+ nfsd4_cb_layout_fail(ls);
+
printk(KERN_WARNING
"nfsd: client %s failed to respond to layout recall. "
" Fencing..\n", addr_str);
@@ -596,6 +607,7 @@ nfsd4_cb_layout_done(struct nfsd4_callback *cb, struct rpc_task *task)
case 0:
return 1;
case -NFS4ERR_NOMATCHING_LAYOUT:
+ trace_layout_recall_done(&ls->ls_stid.sc_stateid);
task->tk_status = 0;
return 1;
case -NFS4ERR_DELAY:
@@ -623,6 +635,8 @@ nfsd4_cb_layout_release(struct nfsd4_callback *cb)
container_of(cb, struct nfs4_layout_stateid, ls_recall);
LIST_HEAD(reaplist);
+ trace_layout_recall_release(&ls->ls_stid.sc_stateid);
+
nfsd4_return_all_layouts(ls, &reaplist);
nfsd4_free_layouts(&reaplist);
nfs4_put_stid(&ls->ls_stid);
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index c051d5b..28e3927 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -44,6 +44,7 @@
#include "netns.h"
#include "acl.h"
#include "pnfs.h"
+#include "trace.h"
#ifdef CONFIG_NFSD_V4_SECURITY_LABEL
#include <linux/security.h>
@@ -1298,8 +1299,10 @@ nfsd4_layoutget(struct svc_rqst *rqstp,
nfserr = nfsd4_preprocess_layout_stateid(rqstp, cstate, &lgp->lg_sid,
true, lgp->lg_layout_type, &ls);
- if (nfserr)
+ if (nfserr) {
+ trace_layout_get_lookup_fail(&lgp->lg_sid);
goto out;
+ }
nfserr = nfserr_recallconflict;
if (atomic_read(&ls->ls_stid.sc_file->fi_lo_recalls))
@@ -1359,6 +1362,7 @@ nfsd4_layoutcommit(struct svc_rqst *rqstp,
false, lcp->lc_layout_type,
&ls);
if (nfserr) {
+ trace_layout_commit_lookup_fail(&lcp->lc_sid);
/* fixup error code as per RFC5661 */
if (nfserr == nfserr_bad_stateid)
nfserr = nfserr_badlayout;
diff --git a/fs/nfsd/trace.c b/fs/nfsd/trace.c
new file mode 100644
index 0000000..82f8907
--- /dev/null
+++ b/fs/nfsd/trace.c
@@ -0,0 +1,5 @@
+
+#include "state.h"
+
+#define CREATE_TRACE_POINTS
+#include "trace.h"
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
new file mode 100644
index 0000000..c668520
--- /dev/null
+++ b/fs/nfsd/trace.h
@@ -0,0 +1,54 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM nfsd
+
+#if !defined(_NFSD_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _NFSD_TRACE_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(nfsd_stateid_class,
+ TP_PROTO(stateid_t *stp),
+ TP_ARGS(stp),
+ TP_STRUCT__entry(
+ __field(u32, cl_boot)
+ __field(u32, cl_id)
+ __field(u32, si_id)
+ __field(u32, si_generation)
+ ),
+ TP_fast_assign(
+ __entry->cl_boot = stp->si_opaque.so_clid.cl_boot;
+ __entry->cl_id = stp->si_opaque.so_clid.cl_id;
+ __entry->si_id = stp->si_opaque.so_id;
+ __entry->si_generation = stp->si_generation;
+ ),
+ TP_printk("client %08x:%08x stateid %08x:%08x",
+ __entry->cl_boot,
+ __entry->cl_id,
+ __entry->si_id,
+ __entry->si_generation)
+)
+
+#define DEFINE_STATEID_EVENT(name) \
+DEFINE_EVENT(nfsd_stateid_class, name, \
+ TP_PROTO(stateid_t *stp), \
+ TP_ARGS(stp))
+DEFINE_STATEID_EVENT(layoutstate_alloc);
+DEFINE_STATEID_EVENT(layoutstate_unhash);
+DEFINE_STATEID_EVENT(layoutstate_free);
+DEFINE_STATEID_EVENT(layout_get_lookup_fail);
+DEFINE_STATEID_EVENT(layout_commit_lookup_fail);
+DEFINE_STATEID_EVENT(layout_return_lookup_fail);
+DEFINE_STATEID_EVENT(layout_recall);
+DEFINE_STATEID_EVENT(layout_recall_done);
+DEFINE_STATEID_EVENT(layout_recall_fail);
+DEFINE_STATEID_EVENT(layout_recall_release);
+
+#endif /* _NFSD_TRACE_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE trace
+#include <trace/define_trace.h>
--
1.9.1
Add three methods to allow exporting pnfs block layout volumes:
- get_uuid: get a filesystem unique signature exposed to clients
- map_blocks: map and if nessecary allocate blocks for a layout
- commit_blocks: commit blocks in a layout once the client is done with them
For now we stick the external pnfs block layout interfaces into s_export_op to
avoid mixing them up with the internal interface between the NFS server and
the layout drivers. Once we've fully internalized the latter interface we
can redecide if these methods should stay in s_export_ops.
Signed-off-by: Christoph Hellwig <[email protected]>
---
include/linux/exportfs.h | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index 41b223a..ff46bf7 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -4,6 +4,7 @@
#include <linux/types.h>
struct dentry;
+struct iattr;
struct inode;
struct super_block;
struct vfsmount;
@@ -180,6 +181,19 @@ struct fid {
* get_name is not (which is possibly inconsistent)
*/
+/* types of block ranges for multipage write mappings. */
+#define IOMAP_HOLE 0x01 /* no blocks allocated, need allocation */
+#define IOMAP_DELALLOC 0x02 /* delayed allocation blocks */
+#define IOMAP_MAPPED 0x03 /* blocks allocated @blkno */
+#define IOMAP_UNWRITTEN 0x04 /* blocks allocated @blkno in unwritten state */
+
+struct iomap {
+ sector_t blkno; /* first sector of mapping */
+ loff_t offset; /* file offset of mapping, bytes */
+ u64 length; /* length of mapping, bytes */
+ int type; /* type of mapping */
+};
+
struct export_operations {
int (*encode_fh)(struct inode *inode, __u32 *fh, int *max_len,
struct inode *parent);
@@ -191,6 +205,13 @@ struct export_operations {
struct dentry *child);
struct dentry * (*get_parent)(struct dentry *child);
int (*commit_metadata)(struct inode *inode);
+
+ int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
+ int (*map_blocks)(struct inode *inode, loff_t offset,
+ u64 len, struct iomap *iomap,
+ bool write, u32 *device_generation);
+ int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
+ int nr_iomaps, struct iattr *iattr);
};
extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
--
1.9.1
Add a small shim between core nfsd and filesystems to translate the
somewhat cumbersome pNFS data structures and semantics to something
more palatable for Linux filesystems.
Signed-off-by: Christoph Hellwig <[email protected]>
---
.../filesystems/nfs/pnfs-block-server.txt | 40 +++++
fs/nfsd/Makefile | 2 +-
fs/nfsd/blocklayout.c | 194 +++++++++++++++++++++
fs/nfsd/blocklayoutxdr.c | 157 +++++++++++++++++
fs/nfsd/blocklayoutxdr.h | 62 +++++++
fs/nfsd/nfs4layouts.c | 7 +
fs/nfsd/pnfs.h | 1 +
7 files changed, 462 insertions(+), 1 deletion(-)
create mode 100644 Documentation/filesystems/nfs/pnfs-block-server.txt
create mode 100644 fs/nfsd/blocklayout.c
create mode 100644 fs/nfsd/blocklayoutxdr.c
create mode 100644 fs/nfsd/blocklayoutxdr.h
diff --git a/Documentation/filesystems/nfs/pnfs-block-server.txt b/Documentation/filesystems/nfs/pnfs-block-server.txt
new file mode 100644
index 0000000..f45d399
--- /dev/null
+++ b/Documentation/filesystems/nfs/pnfs-block-server.txt
@@ -0,0 +1,40 @@
+pNFS block layout server user guide
+
+The Linux NFS server now supports the pNFS block layout extension. In this
+case the NFS server acts as Metadata Server (MDS) for pNFS, which in addition
+to handling all the metadata access to the NFS export also hands out layouts
+to the clients to directly access the underlying block devices that is
+shared with the client. Note that there are no Data Servers (DSs) in the
+block layout flavor of pNFS.
+
+To use pNFS block layouts with with the Linux NFS server the exported file
+system needs to support the pNFS block layouts (current just XFS), and the
+file system must sit on shared storage (typically iSCSI) that is accessible
+to the clients as well as the server. The file system needs to either sit
+directly on the exported volume, or on a RAID 0 using the MD software RAID
+driver with the version 1 superblock format. If the filesystem uses sits
+on a RAID 0 device the clients will automatically stripe their I/O over
+multiple LUNs.
+
+On the server pNFS block volume support is automatically if the file system
+support its. On the client make sure the kernel has the CONFIG_PNFS_BLOCK
+option enabled, the blkmapd daemon from nfs-utils is running, and the
+file system, is mounted using the NFSv4.1 protocol version (mount -o vers=4.1).
+
+If the nfsd server needs to fence a non-responding client it calls
+/sbin/nfsd-recall-failed with the first argument set to the IP address of
+the client, and the second argument set to the device node without the /dev
+prefix for the filesystem to be fenced. Below is an example file that show
+how to translate the device into a serial number from SCSI EVPD 0x80:
+
+cat > /sbin/nfsd-recall-failed << EOF
+#!/bin/sh
+
+CLIENT="$1"
+DEV="/dev/$2"
+EVPD=`sg_inq --page=0x80 ${DEV} | \
+ grep "Unit serial number:" | \
+ awk -F ': ' '{print $2}'`
+
+echo "fencing client ${CLIENT} serial ${EVPD}" >> /var/log/pnfsd-fence.log
+EOF
diff --git a/fs/nfsd/Makefile b/fs/nfsd/Makefile
index 6cba933..9a6028e 100644
--- a/fs/nfsd/Makefile
+++ b/fs/nfsd/Makefile
@@ -17,4 +17,4 @@ nfsd-$(CONFIG_NFSD_V3) += nfs3proc.o nfs3xdr.o
nfsd-$(CONFIG_NFSD_V3_ACL) += nfs3acl.o
nfsd-$(CONFIG_NFSD_V4) += nfs4proc.o nfs4xdr.o nfs4state.o nfs4idmap.o \
nfs4acl.o nfs4callback.o nfs4recover.o
-nfsd-$(CONFIG_NFSD_PNFS) += nfs4layouts.o
+nfsd-$(CONFIG_NFSD_PNFS) += nfs4layouts.o blocklayout.o blocklayoutxdr.o
diff --git a/fs/nfsd/blocklayout.c b/fs/nfsd/blocklayout.c
new file mode 100644
index 0000000..a14e358
--- /dev/null
+++ b/fs/nfsd/blocklayout.c
@@ -0,0 +1,194 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include <linux/exportfs.h>
+#include <linux/genhd.h>
+#include <linux/slab.h>
+#include <linux/raid_class.h>
+
+#include <linux/nfsd/debug.h>
+
+#include "blocklayoutxdr.h"
+#include "pnfs.h"
+
+#define NFSDDBG_FACILITY NFSDDBG_PNFS
+
+
+static int
+nfsd4_block_get_device_info_simple(struct super_block *sb,
+ struct nfsd4_getdeviceinfo *gdp)
+{
+ struct pnfs_block_deviceaddr *dev;
+ struct pnfs_block_volume *b;
+
+ dev = kzalloc(sizeof(struct pnfs_block_deviceaddr) +
+ sizeof(struct pnfs_block_volume), GFP_KERNEL);
+ if (!dev)
+ return -ENOMEM;
+ gdp->gd_device = dev;
+
+ dev->nr_volumes = 1;
+ b = &dev->volumes[0];
+
+ b->type = PNFS_BLOCK_VOLUME_SIMPLE;
+ b->simple.sig_len = PNFS_BLOCK_UUID_LEN;
+ return sb->s_export_op->get_uuid(sb, b->simple.sig, &b->simple.sig_len,
+ &b->simple.offset);
+}
+
+static __be32
+nfsd4_block_proc_getdeviceinfo(struct super_block *sb,
+ struct nfsd4_getdeviceinfo *gdp)
+{
+ if (sb->s_bdev != sb->s_bdev->bd_contains)
+ return nfserr_inval;
+ return nfserrno(nfsd4_block_get_device_info_simple(sb, gdp));
+}
+
+static __be32
+nfsd4_block_proc_layoutget(struct inode *inode, const struct svc_fh *fhp,
+ struct nfsd4_layoutget *args)
+{
+ struct nfsd4_layout_seg *seg = &args->lg_seg;
+ struct super_block *sb = inode->i_sb;
+ u32 block_size = (1 << inode->i_blkbits);
+ struct pnfs_block_extent *bex;
+ struct iomap iomap;
+ u32 device_generation = 0;
+ int error;
+
+ /*
+ * We do not attempt to support I/O smaller than the fs block size,
+ * or not aligned to it.
+ */
+ if (args->lg_minlength < block_size) {
+ dprintk("pnfsd: I/O too small\n");
+ goto out_layoutunavailable;
+ }
+ if (seg->offset & (block_size - 1)) {
+ dprintk("pnfsd: I/O misaligned\n");
+ goto out_layoutunavailable;
+ }
+
+ /*
+ * Some clients barf on non-zero block numbers for NONE or INVALID
+ * layouts, so make sure to zero the whole structure.
+ */
+ error = -ENOMEM;
+ bex = kzalloc(sizeof(*bex), GFP_KERNEL);
+ if (!bex)
+ goto out_error;
+ args->lg_content = bex;
+
+ error = sb->s_export_op->map_blocks(inode, seg->offset, seg->length,
+ &iomap, seg->iomode != IOMODE_READ,
+ &device_generation);
+ if (error) {
+ if (error == -ENXIO)
+ goto out_layoutunavailable;
+ goto out_error;
+ }
+
+ if (iomap.length < args->lg_minlength) {
+ dprintk("pnfsd: extent smaller than minlength\n");
+ goto out_layoutunavailable;
+ }
+
+ switch (iomap.type) {
+ case IOMAP_MAPPED:
+ if (seg->iomode == IOMODE_READ)
+ bex->es = PNFS_BLOCK_READ_DATA;
+ else
+ bex->es = PNFS_BLOCK_READWRITE_DATA;
+ bex->soff = (iomap.blkno << 9);
+ break;
+ case IOMAP_UNWRITTEN:
+ if (seg->iomode & IOMODE_RW) {
+ /*
+ * Crack monkey special case from section 2.3.1.
+ */
+ if (args->lg_minlength == 0) {
+ dprintk("pnfsd: no soup for you!\n");
+ goto out_layoutunavailable;
+ }
+
+ bex->es = PNFS_BLOCK_INVALID_DATA;
+ bex->soff = (iomap.blkno << 9);
+ break;
+ }
+ /*FALLTHRU*/
+ case IOMAP_HOLE:
+ if (seg->iomode == IOMODE_READ) {
+ bex->es = PNFS_BLOCK_NONE_DATA;
+ break;
+ }
+ /*FALLTHRU*/
+ case IOMAP_DELALLOC:
+ default:
+ WARN(1, "pnfsd: filesystem returned %d extent\n", iomap.type);
+ goto out_layoutunavailable;
+ }
+
+ error = nfsd4_set_deviceid(&bex->vol_id, fhp, device_generation);
+ if (error)
+ goto out_error;
+ bex->foff = iomap.offset;
+ bex->len = iomap.length;
+
+ seg->offset = iomap.offset;
+ seg->length = iomap.length;
+
+ args->lg_roc = 1;
+
+ dprintk("GET: %lld:%lld %d\n", bex->foff, bex->len, bex->es);
+ return 0;
+
+out_error:
+ seg->length = 0;
+ return nfserrno(error);
+out_layoutunavailable:
+ seg->length = 0;
+ return nfserr_layoutunavailable;
+}
+
+static __be32
+nfsd4_block_proc_layoutcommit(struct inode *inode,
+ struct nfsd4_layoutcommit *lcp)
+{
+ loff_t new_size = lcp->lc_last_wr + 1;
+ struct iattr iattr = { .ia_valid = 0 };
+ struct iomap *iomaps;
+ int nr_iomaps;
+ int error;
+
+ nr_iomaps = nfsd4_block_decode_layoutupdate(lcp->lc_up_layout,
+ lcp->lc_up_len, &iomaps, 1 << inode->i_blkbits);
+ if (nr_iomaps < 0)
+ return nfserrno(nr_iomaps);
+
+ if (lcp->lc_mtime.tv_nsec == UTIME_NOW)
+ lcp->lc_mtime = current_fs_time(inode->i_sb);
+ if (timespec_compare(&lcp->lc_mtime, &inode->i_mtime) > 0) {
+ iattr.ia_valid |= ATTR_ATIME | ATTR_CTIME | ATTR_MTIME;
+ iattr.ia_atime = iattr.ia_ctime = iattr.ia_mtime =
+ lcp->lc_mtime;
+ }
+
+ if (new_size > i_size_read(inode)) {
+ iattr.ia_valid |= ATTR_SIZE;
+ iattr.ia_size = new_size;
+ }
+
+ error = inode->i_sb->s_export_op->commit_blocks(inode, iomaps,
+ nr_iomaps, &iattr);
+ kfree(iomaps);
+ return nfserrno(error);
+}
+
+const struct nfsd4_layout_ops bl_layout_ops = {
+ .proc_getdeviceinfo = nfsd4_block_proc_getdeviceinfo,
+ .encode_getdeviceinfo = nfsd4_block_encode_getdeviceinfo,
+ .proc_layoutget = nfsd4_block_proc_layoutget,
+ .encode_layoutget = nfsd4_block_encode_layoutget,
+ .proc_layoutcommit = nfsd4_block_proc_layoutcommit,
+};
diff --git a/fs/nfsd/blocklayoutxdr.c b/fs/nfsd/blocklayoutxdr.c
new file mode 100644
index 0000000..9da89fd
--- /dev/null
+++ b/fs/nfsd/blocklayoutxdr.c
@@ -0,0 +1,157 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include <linux/sunrpc/svc.h>
+#include <linux/exportfs.h>
+#include <linux/nfs4.h>
+
+#include "nfsd.h"
+#include "blocklayoutxdr.h"
+
+#define NFSDDBG_FACILITY NFSDDBG_PNFS
+
+
+__be32
+nfsd4_block_encode_layoutget(struct xdr_stream *xdr,
+ struct nfsd4_layoutget *lgp)
+{
+ struct pnfs_block_extent *b = lgp->lg_content;
+ int len = sizeof(__be32) + 5 * sizeof(__be64) + sizeof(__be32);
+ __be32 *p;
+
+ p = xdr_reserve_space(xdr, sizeof(__be32) + len);
+ if (!p)
+ return nfserr_toosmall;
+
+ *p++ = cpu_to_be32(len);
+ *p++ = cpu_to_be32(1); /* we always return a single extent */
+
+ p = xdr_encode_opaque_fixed(p, &b->vol_id,
+ sizeof(struct nfsd4_deviceid));
+ p = xdr_encode_hyper(p, b->foff);
+ p = xdr_encode_hyper(p, b->len);
+ p = xdr_encode_hyper(p, b->soff);
+ *p++ = cpu_to_be32(b->es);
+ return 0;
+}
+
+static int
+nfsd4_block_encode_volume(struct xdr_stream *xdr, struct pnfs_block_volume *b)
+{
+ __be32 *p;
+ int len;
+
+ switch (b->type) {
+ case PNFS_BLOCK_VOLUME_SIMPLE:
+ len = 4 + 4 + 8 + 4 + b->simple.sig_len;
+ p = xdr_reserve_space(xdr, len);
+ if (!p)
+ return -ETOOSMALL;
+
+ *p++ = cpu_to_be32(b->type);
+ *p++ = cpu_to_be32(1); /* single signature */
+ p = xdr_encode_hyper(p, b->simple.offset);
+ p = xdr_encode_opaque(p, b->simple.sig, b->simple.sig_len);
+ break;
+ default:
+ return -ENOTSUPP;
+ }
+
+ return len;
+}
+
+__be32
+nfsd4_block_encode_getdeviceinfo(struct xdr_stream *xdr,
+ struct nfsd4_getdeviceinfo *gdp)
+{
+ struct pnfs_block_deviceaddr *dev = gdp->gd_device;
+ int len = sizeof(__be32), ret, i;
+ __be32 *p;
+
+ p = xdr_reserve_space(xdr, len + sizeof(__be32));
+ if (!p)
+ return nfserr_resource;
+
+ for (i = 0; i < dev->nr_volumes; i++) {
+ ret = nfsd4_block_encode_volume(xdr, &dev->volumes[i]);
+ if (ret < 0)
+ return nfserrno(ret);
+ len += ret;
+ }
+
+ /*
+ * Fill in the overall length and number of volumes at the beginning
+ * of the layout.
+ */
+ *p++ = cpu_to_be32(len);
+ *p++ = cpu_to_be32(dev->nr_volumes);
+ return 0;
+}
+
+int
+nfsd4_block_decode_layoutupdate(__be32 *p, u32 len, struct iomap **iomapp,
+ u32 block_size)
+{
+ struct iomap *iomaps;
+ u32 nr_iomaps, expected, i;
+
+ if (len < sizeof(u32)) {
+ dprintk("%s: extent array too small: %u\n", __func__, len);
+ return -EINVAL;
+ }
+
+ nr_iomaps = be32_to_cpup(p++);
+ expected = sizeof(__be32) + nr_iomaps * NFS4_BLOCK_EXTENT_SIZE;
+ if (len != expected) {
+ dprintk("%s: extent array size mismatch: %u/%u\n",
+ __func__, len, expected);
+ return -EINVAL;
+ }
+
+ iomaps = kcalloc(nr_iomaps, sizeof(*iomaps), GFP_KERNEL);
+ if (!iomaps) {
+ dprintk("%s: failed to allocate extent array\n", __func__);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < nr_iomaps; i++) {
+ struct pnfs_block_extent bex;
+
+ memcpy(&bex.vol_id, p, sizeof(struct nfsd4_deviceid));
+ p += XDR_QUADLEN(sizeof(struct nfsd4_deviceid));
+
+ p = xdr_decode_hyper(p, &bex.foff);
+ if (bex.foff & (block_size - 1)) {
+ dprintk("%s: unaligned offset %lld\n",
+ __func__, bex.foff);
+ goto fail;
+ }
+ p = xdr_decode_hyper(p, &bex.len);
+ if (bex.len & (block_size - 1)) {
+ dprintk("%s: unaligned length %lld\n",
+ __func__, bex.foff);
+ goto fail;
+ }
+ p = xdr_decode_hyper(p, &bex.soff);
+ if (bex.soff & (block_size - 1)) {
+ dprintk("%s: unaligned disk offset %lld\n",
+ __func__, bex.soff);
+ goto fail;
+ }
+ bex.es = be32_to_cpup(p++);
+ if (bex.es != PNFS_BLOCK_READWRITE_DATA) {
+ dprintk("%s: incorrect extent state %d\n",
+ __func__, bex.es);
+ goto fail;
+ }
+
+ iomaps[i].offset = bex.foff;
+ iomaps[i].length = bex.len;
+ }
+
+ *iomapp = iomaps;
+ return nr_iomaps;
+fail:
+ kfree(iomaps);
+ return -EINVAL;
+}
diff --git a/fs/nfsd/blocklayoutxdr.h b/fs/nfsd/blocklayoutxdr.h
new file mode 100644
index 0000000..fdc7903
--- /dev/null
+++ b/fs/nfsd/blocklayoutxdr.h
@@ -0,0 +1,62 @@
+#ifndef _NFSD_BLOCKLAYOUTXDR_H
+#define _NFSD_BLOCKLAYOUTXDR_H 1
+
+#include <linux/blkdev.h>
+#include "xdr4.h"
+
+struct iomap;
+struct xdr_stream;
+
+enum pnfs_block_extent_state {
+ PNFS_BLOCK_READWRITE_DATA = 0,
+ PNFS_BLOCK_READ_DATA = 1,
+ PNFS_BLOCK_INVALID_DATA = 2,
+ PNFS_BLOCK_NONE_DATA = 3,
+};
+
+struct pnfs_block_extent {
+ struct nfsd4_deviceid vol_id;
+ u64 foff;
+ u64 len;
+ u64 soff;
+ enum pnfs_block_extent_state es;
+};
+#define NFS4_BLOCK_EXTENT_SIZE 44
+
+enum pnfs_block_volume_type {
+ PNFS_BLOCK_VOLUME_SIMPLE = 0,
+ PNFS_BLOCK_VOLUME_SLICE = 1,
+ PNFS_BLOCK_VOLUME_CONCAT = 2,
+ PNFS_BLOCK_VOLUME_STRIPE = 3,
+};
+
+/*
+ * Random upper cap for the uuid length to avoid unbounded allocation.
+ * Not actually limited by the protocol.
+ */
+#define PNFS_BLOCK_UUID_LEN 128
+
+struct pnfs_block_volume {
+ enum pnfs_block_volume_type type;
+ union {
+ struct {
+ u64 offset;
+ u32 sig_len;
+ u8 sig[PNFS_BLOCK_UUID_LEN];
+ } simple;
+ };
+};
+
+struct pnfs_block_deviceaddr {
+ u32 nr_volumes;
+ struct pnfs_block_volume volumes[];
+};
+
+__be32 nfsd4_block_encode_getdeviceinfo(struct xdr_stream *xdr,
+ struct nfsd4_getdeviceinfo *gdp);
+__be32 nfsd4_block_encode_layoutget(struct xdr_stream *xdr,
+ struct nfsd4_layoutget *lgp);
+int nfsd4_block_decode_layoutupdate(__be32 *p, u32 len, struct iomap **iomapp,
+ u32 block_size);
+
+#endif /* _NFSD_BLOCKLAYOUTXDR_H */
diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
index bb91981..8353b7a 100644
--- a/fs/nfsd/nfs4layouts.c
+++ b/fs/nfsd/nfs4layouts.c
@@ -26,6 +26,7 @@ static struct nfsd4_callback_ops nfsd4_cb_layout_ops;
static const struct lock_manager_operations nfsd4_layouts_lm_ops;
const struct nfsd4_layout_ops *nfsd4_layout_ops[LAYOUT_TYPE_MAX] = {
+ [LAYOUT_BLOCK_VOLUME] = &bl_layout_ops,
};
/* pNFS device ID to export fsid mapping */
@@ -116,6 +117,12 @@ nfsd4_set_deviceid(struct nfsd4_deviceid *id, const struct svc_fh *fhp,
void nfsd4_setup_layout_type(struct svc_export *exp)
{
+ struct super_block *sb = exp->ex_path.mnt->mnt_sb;
+
+ if (sb->s_export_op->get_uuid &&
+ sb->s_export_op->map_blocks &&
+ sb->s_export_op->commit_blocks)
+ exp->ex_layout_type = LAYOUT_BLOCK_VOLUME;
}
static void
diff --git a/fs/nfsd/pnfs.h b/fs/nfsd/pnfs.h
index fa37117..d6d94e1 100644
--- a/fs/nfsd/pnfs.h
+++ b/fs/nfsd/pnfs.h
@@ -34,6 +34,7 @@ struct nfsd4_layout_ops {
};
extern const struct nfsd4_layout_ops *nfsd4_layout_ops[];
+extern const struct nfsd4_layout_ops bl_layout_ops;
__be32 nfsd4_preprocess_layout_stateid(struct svc_rqst *rqstp,
struct nfsd4_compound_state *cstate, stateid_t *stateid,
--
1.9.1
On Tue, Jan 06, 2015 at 05:28:37PM +0100, Christoph Hellwig wrote:
> Add a small shim between core nfsd and filesystems to translate the
> somewhat cumbersome pNFS data structures and semantics to something
> more palatable for Linux filesystems.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> .../filesystems/nfs/pnfs-block-server.txt | 40 +++++
> fs/nfsd/Makefile | 2 +-
> fs/nfsd/blocklayout.c | 194 +++++++++++++++++++++
> fs/nfsd/blocklayoutxdr.c | 157 +++++++++++++++++
> fs/nfsd/blocklayoutxdr.h | 62 +++++++
> fs/nfsd/nfs4layouts.c | 7 +
> fs/nfsd/pnfs.h | 1 +
> 7 files changed, 462 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/filesystems/nfs/pnfs-block-server.txt
> create mode 100644 fs/nfsd/blocklayout.c
> create mode 100644 fs/nfsd/blocklayoutxdr.c
> create mode 100644 fs/nfsd/blocklayoutxdr.h
>
> diff --git a/Documentation/filesystems/nfs/pnfs-block-server.txt b/Documentation/filesystems/nfs/pnfs-block-server.txt
> new file mode 100644
> index 0000000..f45d399
> --- /dev/null
> +++ b/Documentation/filesystems/nfs/pnfs-block-server.txt
> @@ -0,0 +1,40 @@
> +pNFS block layout server user guide
> +
> +The Linux NFS server now supports the pNFS block layout extension. In this
> +case the NFS server acts as Metadata Server (MDS) for pNFS, which in addition
> +to handling all the metadata access to the NFS export also hands out layouts
> +to the clients to directly access the underlying block devices that is
s/is/are/.
> +shared with the client. Note that there are no Data Servers (DSs) in the
> +block layout flavor of pNFS.
> +
> +To use pNFS block layouts with with the Linux NFS server the exported file
> +system needs to support the pNFS block layouts (current just XFS), and the
> +file system must sit on shared storage (typically iSCSI) that is accessible
> +to the clients as well as the server. The file system needs to either sit
> +directly on the exported volume, or on a RAID 0 using the MD software RAID
> +driver with the version 1 superblock format. If the filesystem uses sits
> +on a RAID 0 device the clients will automatically stripe their I/O over
> +multiple LUNs.
> +
> +On the server pNFS block volume support is automatically if the file system
s/automatically/automatically enabled/.
So there's no server-side configuration required at all?
--b.
> +support its. On the client make sure the kernel has the CONFIG_PNFS_BLOCK
> +option enabled, the blkmapd daemon from nfs-utils is running, and the
> +file system, is mounted using the NFSv4.1 protocol version (mount -o vers=4.1).
> +
> +If the nfsd server needs to fence a non-responding client it calls
> +/sbin/nfsd-recall-failed with the first argument set to the IP address of
> +the client, and the second argument set to the device node without the /dev
> +prefix for the filesystem to be fenced. Below is an example file that show
> +how to translate the device into a serial number from SCSI EVPD 0x80:
...
On Tue, Jan 06, 2015 at 12:16:58PM -0500, J. Bruce Fields wrote:
> > +file system must sit on shared storage (typically iSCSI) that is accessible
> > +to the clients as well as the server. The file system needs to either sit
> > +directly on the exported volume, or on a RAID 0 using the MD software RAID
> > +driver with the version 1 superblock format. If the filesystem uses sits
> > +on a RAID 0 device the clients will automatically stripe their I/O over
> > +multiple LUNs.
> > +
> > +On the server pNFS block volume support is automatically if the file system
>
> s/automatically/automatically enabled/.
>
> So there's no server-side configuration required at all?
The only required configuration is the fencing helper script if you
want to be able to fence a non-responding client. For simple test setups
everything will just work out of the box.
On Tue, Jan 06, 2015 at 06:39:57PM +0100, Christoph Hellwig wrote:
> On Tue, Jan 06, 2015 at 12:16:58PM -0500, J. Bruce Fields wrote:
> > > +file system must sit on shared storage (typically iSCSI) that is accessible
> > > +to the clients as well as the server. The file system needs to either sit
> > > +directly on the exported volume, or on a RAID 0 using the MD software RAID
> > > +driver with the version 1 superblock format. If the filesystem uses sits
> > > +on a RAID 0 device the clients will automatically stripe their I/O over
> > > +multiple LUNs.
> > > +
> > > +On the server pNFS block volume support is automatically if the file system
> >
> > s/automatically/automatically enabled/.
> >
> > So there's no server-side configuration required at all?
>
> The only required configuration is the fencing helper script if you
> want to be able to fence a non-responding client. For simple test setups
> everything will just work out of the box.
I think we want at a minimum some kind of server-side "off" switch.
If nothing else it'd be handy for troubleshooting. ("Server crashing?
Could you turn off pnfs blocks and try again?")
--b.
On Tue, 6 Jan 2015 14:39:49 -0500
"J. Bruce Fields" <[email protected]> wrote:
> On Tue, Jan 06, 2015 at 06:39:57PM +0100, Christoph Hellwig wrote:
> > On Tue, Jan 06, 2015 at 12:16:58PM -0500, J. Bruce Fields wrote:
> > > > +file system must sit on shared storage (typically iSCSI) that is accessible
> > > > +to the clients as well as the server. The file system needs to either sit
> > > > +directly on the exported volume, or on a RAID 0 using the MD software RAID
> > > > +driver with the version 1 superblock format. If the filesystem uses sits
> > > > +on a RAID 0 device the clients will automatically stripe their I/O over
> > > > +multiple LUNs.
> > > > +
> > > > +On the server pNFS block volume support is automatically if the file system
> > >
> > > s/automatically/automatically enabled/.
> > >
> > > So there's no server-side configuration required at all?
> >
> > The only required configuration is the fencing helper script if you
> > want to be able to fence a non-responding client. For simple test setups
> > everything will just work out of the box.
>
> I think we want at a minimum some kind of server-side "off" switch.
>
> If nothing else it'd be handy for troubleshooting. ("Server crashing?
> Could you turn off pnfs blocks and try again?")
>
> --b.
Or maybe an "on" switch?
We have some patches (not posted currently) that add a "pnfs" export
option. Maybe we should add that and only enable pnfs on exports that
have that option present?
--
Jeff Layton <[email protected]>
On Tue, Jan 06, 2015 at 11:42:05AM -0800, Jeff Layton wrote:
> Or maybe an "on" switch?
>
> We have some patches (not posted currently) that add a "pnfs" export
> option. Maybe we should add that and only enable pnfs on exports that
> have that option present?
I would defintively prefer the off switch. I can add one if people want
it, but export options are a little annoying as they require support
not only in the kernel but also in nfs-utils.
On Wed, 7 Jan 2015 11:28:02 +0100
Christoph Hellwig <[email protected]> wrote:
> On Tue, Jan 06, 2015 at 11:42:05AM -0800, Jeff Layton wrote:
> > Or maybe an "on" switch?
> >
> > We have some patches (not posted currently) that add a "pnfs" export
> > option. Maybe we should add that and only enable pnfs on exports that
> > have that option present?
>
> I would defintively prefer the off switch. I can add one if people want
> it, but export options are a little annoying as they require support
> not only in the kernel but also in nfs-utils.
True, it is a pain, but I think it's realistic to expect someone who
wants to do pnfs to have an updated nfs-utils. It wouldn't take too
long for it to trickle out to the various distros and adding new export
options is fairly simple to do.
If we do want to go that route, it might be nice to do the option with
a list of layout types. For example:
pnfs=block:file:flexfiles
...so we could potentially support more than one layout type per
export.
--
Jeff Layton <[email protected]>
On Thu, Jan 08, 2015 at 12:41:31PM -0800, Jeff Layton wrote:
> On Wed, 7 Jan 2015 11:28:02 +0100
> Christoph Hellwig <[email protected]> wrote:
>
> > On Tue, Jan 06, 2015 at 11:42:05AM -0800, Jeff Layton wrote:
> > > Or maybe an "on" switch?
> > >
> > > We have some patches (not posted currently) that add a "pnfs" export
> > > option. Maybe we should add that and only enable pnfs on exports that
> > > have that option present?
> >
> > I would defintively prefer the off switch. I can add one if people want
> > it, but export options are a little annoying as they require support
> > not only in the kernel but also in nfs-utils.
>
> True, it is a pain, but I think it's realistic to expect someone who
> wants to do pnfs to have an updated nfs-utils. It wouldn't take too
> long for it to trickle out to the various distros and adding new export
> options is fairly simple to do.
>
> If we do want to go that route, it might be nice to do the option with
> a list of layout types. For example:
>
> pnfs=block:file:flexfiles
>
> ...so we could potentially support more than one layout type per
> export.
I like the goal of making this as close to zero-configuration as
possible, and I'd rather wait for a demonstrated need till we add
per-export or multiple-layout-type configuration. A global off switch
sounds OK to me.
--b.
On Tue, Jan 06, 2015 at 05:28:37PM +0100, Christoph Hellwig wrote:
> Add a small shim between core nfsd and filesystems to translate the
> somewhat cumbersome pNFS data structures and semantics to something
> more palatable for Linux filesystems.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> .../filesystems/nfs/pnfs-block-server.txt | 40 +++++
> fs/nfsd/Makefile | 2 +-
> fs/nfsd/blocklayout.c | 194 +++++++++++++++++++++
> fs/nfsd/blocklayoutxdr.c | 157 +++++++++++++++++
> fs/nfsd/blocklayoutxdr.h | 62 +++++++
> fs/nfsd/nfs4layouts.c | 7 +
> fs/nfsd/pnfs.h | 1 +
> 7 files changed, 462 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/filesystems/nfs/pnfs-block-server.txt
> create mode 100644 fs/nfsd/blocklayout.c
> create mode 100644 fs/nfsd/blocklayoutxdr.c
> create mode 100644 fs/nfsd/blocklayoutxdr.h
Could you follow the client code convention by putting
each layout type in a directory?
lacker:linux loghyr$ ls -la fs/nfs/blocklayout/
total 80
On Sun, Jan 11, 2015 at 11:56:06PM -0500, Tom Haynes wrote:
> Could you follow the client code convention by putting
> each layout type in a directory?
I have to say I hate that convention on the client side, so I'd
be happier to keep it as-is.
On Tue, Jan 06, 2015 at 05:28:37PM +0100, Christoph Hellwig wrote:
> Add a small shim between core nfsd and filesystems to translate the
> somewhat cumbersome pNFS data structures and semantics to something
> more palatable for Linux filesystems.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> .../filesystems/nfs/pnfs-block-server.txt | 40 +++++
> fs/nfsd/Makefile | 2 +-
> fs/nfsd/blocklayout.c | 194 +++++++++++++++++++++
> fs/nfsd/blocklayoutxdr.c | 157 +++++++++++++++++
> fs/nfsd/blocklayoutxdr.h | 62 +++++++
> fs/nfsd/nfs4layouts.c | 7 +
> fs/nfsd/pnfs.h | 1 +
> 7 files changed, 462 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/filesystems/nfs/pnfs-block-server.txt
> create mode 100644 fs/nfsd/blocklayout.c
> create mode 100644 fs/nfsd/blocklayoutxdr.c
> create mode 100644 fs/nfsd/blocklayoutxdr.h
>
> diff --git a/Documentation/filesystems/nfs/pnfs-block-server.txt b/Documentation/filesystems/nfs/pnfs-block-server.txt
> new file mode 100644
> index 0000000..f45d399
> --- /dev/null
> +++ b/Documentation/filesystems/nfs/pnfs-block-server.txt
> @@ -0,0 +1,40 @@
> +pNFS block layout server user guide
> +
> +The Linux NFS server now supports the pNFS block layout extension. In this
> +case the NFS server acts as Metadata Server (MDS) for pNFS, which in addition
> +to handling all the metadata access to the NFS export also hands out layouts
> +to the clients to directly access the underlying block devices that is
to the clients. The layout allows the client to directly access the underlying block devices that (are)
> +shared with the client. Note that there are no Data Servers (DSs) in the
> +block layout flavor of pNFS.
Which is why the spec calls them storage devices.
> +
> +To use pNFS block layouts with with the Linux NFS server the exported file
> +system needs to support the pNFS block layouts (current just XFS), and the
currently
> +file system must sit on shared storage (typically iSCSI) that is accessible
> +to the clients as well as the server. The file system needs to either sit
> +directly on the exported volume, or on a RAID 0 using the MD software RAID
a RAID 0 what?
> +driver with the version 1 superblock format. If the filesystem uses sits
In general, /filesystem/file system/
/filesystem uses/file system it uses/
> +on a RAID 0 device the clients will automatically stripe their I/O over
> +multiple LUNs.
> +
> +On the server pNFS block volume support is automatically if the file system
> +support its. On the client make sure the kernel has the CONFIG_PNFS_BLOCK
/its/it/
> +option enabled, the blkmapd daemon from nfs-utils is running, and the
> +file system, is mounted using the NFSv4.1 protocol version (mount -o vers=4.1).
/system, is/system is/
> +
> +If the nfsd server needs to fence a non-responding client it calls
> +/sbin/nfsd-recall-failed with the first argument set to the IP address of
> +the client, and the second argument set to the device node without the /dev
> +prefix for the filesystem to be fenced. Below is an example file that show
/show/shows/
> +how to translate the device into a serial number from SCSI EVPD 0x80:
> +
> +cat > /sbin/nfsd-recall-failed << EOF
> +#!/bin/sh
> +
> +CLIENT="$1"
> +DEV="/dev/$2"
> +EVPD=`sg_inq --page=0x80 ${DEV} | \
> + grep "Unit serial number:" | \
> + awk -F ': ' '{print $2}'`
> +
> +echo "fencing client ${CLIENT} serial ${EVPD}" >> /var/log/pnfsd-fence.log
> +EOF
> diff --git a/fs/nfsd/Makefile b/fs/nfsd/Makefile
> index 6cba933..9a6028e 100644
> --- a/fs/nfsd/Makefile
> +++ b/fs/nfsd/Makefile
> @@ -17,4 +17,4 @@ nfsd-$(CONFIG_NFSD_V3) += nfs3proc.o nfs3xdr.o
> nfsd-$(CONFIG_NFSD_V3_ACL) += nfs3acl.o
> nfsd-$(CONFIG_NFSD_V4) += nfs4proc.o nfs4xdr.o nfs4state.o nfs4idmap.o \
> nfs4acl.o nfs4callback.o nfs4recover.o
> -nfsd-$(CONFIG_NFSD_PNFS) += nfs4layouts.o
> +nfsd-$(CONFIG_NFSD_PNFS) += nfs4layouts.o blocklayout.o blocklayoutxdr.o
> diff --git a/fs/nfsd/blocklayout.c b/fs/nfsd/blocklayout.c
> new file mode 100644
> index 0000000..a14e358
> --- /dev/null
> +++ b/fs/nfsd/blocklayout.c
> @@ -0,0 +1,194 @@
> +/*
> + * Copyright (c) 2014 Christoph Hellwig.
> + */
> +#include <linux/exportfs.h>
> +#include <linux/genhd.h>
> +#include <linux/slab.h>
> +#include <linux/raid_class.h>
> +
> +#include <linux/nfsd/debug.h>
> +
> +#include "blocklayoutxdr.h"
> +#include "pnfs.h"
> +
> +#define NFSDDBG_FACILITY NFSDDBG_PNFS
> +
> +
> +static int
> +nfsd4_block_get_device_info_simple(struct super_block *sb,
> + struct nfsd4_getdeviceinfo *gdp)
> +{
> + struct pnfs_block_deviceaddr *dev;
> + struct pnfs_block_volume *b;
> +
> + dev = kzalloc(sizeof(struct pnfs_block_deviceaddr) +
> + sizeof(struct pnfs_block_volume), GFP_KERNEL);
> + if (!dev)
> + return -ENOMEM;
> + gdp->gd_device = dev;
> +
> + dev->nr_volumes = 1;
> + b = &dev->volumes[0];
> +
> + b->type = PNFS_BLOCK_VOLUME_SIMPLE;
> + b->simple.sig_len = PNFS_BLOCK_UUID_LEN;
> + return sb->s_export_op->get_uuid(sb, b->simple.sig, &b->simple.sig_len,
> + &b->simple.offset);
> +}
> +
> +static __be32
> +nfsd4_block_proc_getdeviceinfo(struct super_block *sb,
> + struct nfsd4_getdeviceinfo *gdp)
> +{
> + if (sb->s_bdev != sb->s_bdev->bd_contains)
> + return nfserr_inval;
> + return nfserrno(nfsd4_block_get_device_info_simple(sb, gdp));
> +}
> +
> +static __be32
> +nfsd4_block_proc_layoutget(struct inode *inode, const struct svc_fh *fhp,
> + struct nfsd4_layoutget *args)
> +{
> + struct nfsd4_layout_seg *seg = &args->lg_seg;
> + struct super_block *sb = inode->i_sb;
> + u32 block_size = (1 << inode->i_blkbits);
> + struct pnfs_block_extent *bex;
> + struct iomap iomap;
> + u32 device_generation = 0;
> + int error;
> +
> + /*
> + * We do not attempt to support I/O smaller than the fs block size,
> + * or not aligned to it.
> + */
> + if (args->lg_minlength < block_size) {
> + dprintk("pnfsd: I/O too small\n");
> + goto out_layoutunavailable;
> + }
> + if (seg->offset & (block_size - 1)) {
> + dprintk("pnfsd: I/O misaligned\n");
> + goto out_layoutunavailable;
> + }
> +
> + /*
> + * Some clients barf on non-zero block numbers for NONE or INVALID
> + * layouts, so make sure to zero the whole structure.
> + */
> + error = -ENOMEM;
> + bex = kzalloc(sizeof(*bex), GFP_KERNEL);
> + if (!bex)
> + goto out_error;
bex is allocated.
> + args->lg_content = bex;
> +
> + error = sb->s_export_op->map_blocks(inode, seg->offset, seg->length,
> + &iomap, seg->iomode != IOMODE_READ,
> + &device_generation);
> + if (error) {
> + if (error == -ENXIO)
> + goto out_layoutunavailable;
> + goto out_error;
> + }
> +
> + if (iomap.length < args->lg_minlength) {
> + dprintk("pnfsd: extent smaller than minlength\n");
> + goto out_layoutunavailable;
> + }
> +
> + switch (iomap.type) {
> + case IOMAP_MAPPED:
> + if (seg->iomode == IOMODE_READ)
> + bex->es = PNFS_BLOCK_READ_DATA;
> + else
> + bex->es = PNFS_BLOCK_READWRITE_DATA;
> + bex->soff = (iomap.blkno << 9);
> + break;
> + case IOMAP_UNWRITTEN:
> + if (seg->iomode & IOMODE_RW) {
> + /*
> + * Crack monkey special case from section 2.3.1.
> + */
> + if (args->lg_minlength == 0) {
> + dprintk("pnfsd: no soup for you!\n");
> + goto out_layoutunavailable;
> + }
> +
> + bex->es = PNFS_BLOCK_INVALID_DATA;
> + bex->soff = (iomap.blkno << 9);
> + break;
> + }
> + /*FALLTHRU*/
> + case IOMAP_HOLE:
> + if (seg->iomode == IOMODE_READ) {
> + bex->es = PNFS_BLOCK_NONE_DATA;
> + break;
> + }
> + /*FALLTHRU*/
> + case IOMAP_DELALLOC:
> + default:
> + WARN(1, "pnfsd: filesystem returned %d extent\n", iomap.type);
> + goto out_layoutunavailable;
> + }
> +
> + error = nfsd4_set_deviceid(&bex->vol_id, fhp, device_generation);
> + if (error)
> + goto out_error;
> + bex->foff = iomap.offset;
> + bex->len = iomap.length;
> +
> + seg->offset = iomap.offset;
> + seg->length = iomap.length;
> +
> + args->lg_roc = 1;
> +
> + dprintk("GET: %lld:%lld %d\n", bex->foff, bex->len, bex->es);
> + return 0;
> +
> +out_error:
> + seg->length = 0;
> + return nfserrno(error);
> +out_layoutunavailable:
> + seg->length = 0;
> + return nfserr_layoutunavailable;
What reclaims bex in both error cases??
The call flow seems to be:
nfsd4_proc_compound -> nfsd4_layoutget -> nfsd4_block_proc_layoutget
lg_content gets freed in nfsd4_encode_layoutget() in all paths.
nfsd4_encode_operation() calls nfsd4_encode_layoutget().
But nfsd4_encode_layoutget() is not called in all paths:
p = xdr_reserve_space(xdr, 8);
if (!p) {
WARN_ON_ONCE(1);
return; // leak
}
...
if (op->opnum == OP_ILLEGAL)
goto status; // Not really a leak, if we hit this, bigger issues apply.
So bex is correctly accounted for, but in general
nfsd4_encode_operation() can leak any operation
specific memory.
> +}
> +
> +static __be32
> +nfsd4_block_proc_layoutcommit(struct inode *inode,
> + struct nfsd4_layoutcommit *lcp)
> +{
> + loff_t new_size = lcp->lc_last_wr + 1;
> + struct iattr iattr = { .ia_valid = 0 };
> + struct iomap *iomaps;
> + int nr_iomaps;
> + int error;
> +
> + nr_iomaps = nfsd4_block_decode_layoutupdate(lcp->lc_up_layout,
> + lcp->lc_up_len, &iomaps, 1 << inode->i_blkbits);
> + if (nr_iomaps < 0)
> + return nfserrno(nr_iomaps);
> +
> + if (lcp->lc_mtime.tv_nsec == UTIME_NOW)
> + lcp->lc_mtime = current_fs_time(inode->i_sb);
> + if (timespec_compare(&lcp->lc_mtime, &inode->i_mtime) > 0) {
> + iattr.ia_valid |= ATTR_ATIME | ATTR_CTIME | ATTR_MTIME;
> + iattr.ia_atime = iattr.ia_ctime = iattr.ia_mtime =
> + lcp->lc_mtime;
> + }
> +
> + if (new_size > i_size_read(inode)) {
> + iattr.ia_valid |= ATTR_SIZE;
> + iattr.ia_size = new_size;
> + }
> +
> + error = inode->i_sb->s_export_op->commit_blocks(inode, iomaps,
> + nr_iomaps, &iattr);
> + kfree(iomaps);
> + return nfserrno(error);
> +}
> +
> +const struct nfsd4_layout_ops bl_layout_ops = {
> + .proc_getdeviceinfo = nfsd4_block_proc_getdeviceinfo,
> + .encode_getdeviceinfo = nfsd4_block_encode_getdeviceinfo,
> + .proc_layoutget = nfsd4_block_proc_layoutget,
> + .encode_layoutget = nfsd4_block_encode_layoutget,
> + .proc_layoutcommit = nfsd4_block_proc_layoutcommit,
> +};
> diff --git a/fs/nfsd/blocklayoutxdr.c b/fs/nfsd/blocklayoutxdr.c
> new file mode 100644
> index 0000000..9da89fd
> --- /dev/null
> +++ b/fs/nfsd/blocklayoutxdr.c
> @@ -0,0 +1,157 @@
> +/*
> + * Copyright (c) 2014 Christoph Hellwig.
> + */
> +#include <linux/sunrpc/svc.h>
> +#include <linux/exportfs.h>
> +#include <linux/nfs4.h>
> +
> +#include "nfsd.h"
> +#include "blocklayoutxdr.h"
> +
> +#define NFSDDBG_FACILITY NFSDDBG_PNFS
> +
> +
> +__be32
> +nfsd4_block_encode_layoutget(struct xdr_stream *xdr,
> + struct nfsd4_layoutget *lgp)
> +{
> + struct pnfs_block_extent *b = lgp->lg_content;
> + int len = sizeof(__be32) + 5 * sizeof(__be64) + sizeof(__be32);
> + __be32 *p;
> +
> + p = xdr_reserve_space(xdr, sizeof(__be32) + len);
> + if (!p)
> + return nfserr_toosmall;
> +
> + *p++ = cpu_to_be32(len);
> + *p++ = cpu_to_be32(1); /* we always return a single extent */
> +
> + p = xdr_encode_opaque_fixed(p, &b->vol_id,
> + sizeof(struct nfsd4_deviceid));
> + p = xdr_encode_hyper(p, b->foff);
> + p = xdr_encode_hyper(p, b->len);
> + p = xdr_encode_hyper(p, b->soff);
> + *p++ = cpu_to_be32(b->es);
> + return 0;
> +}
> +
> +static int
> +nfsd4_block_encode_volume(struct xdr_stream *xdr, struct pnfs_block_volume *b)
> +{
> + __be32 *p;
> + int len;
> +
> + switch (b->type) {
> + case PNFS_BLOCK_VOLUME_SIMPLE:
> + len = 4 + 4 + 8 + 4 + b->simple.sig_len;
> + p = xdr_reserve_space(xdr, len);
> + if (!p)
> + return -ETOOSMALL;
> +
> + *p++ = cpu_to_be32(b->type);
> + *p++ = cpu_to_be32(1); /* single signature */
> + p = xdr_encode_hyper(p, b->simple.offset);
> + p = xdr_encode_opaque(p, b->simple.sig, b->simple.sig_len);
> + break;
> + default:
> + return -ENOTSUPP;
> + }
> +
> + return len;
> +}
> +
> +__be32
> +nfsd4_block_encode_getdeviceinfo(struct xdr_stream *xdr,
> + struct nfsd4_getdeviceinfo *gdp)
> +{
> + struct pnfs_block_deviceaddr *dev = gdp->gd_device;
> + int len = sizeof(__be32), ret, i;
> + __be32 *p;
> +
> + p = xdr_reserve_space(xdr, len + sizeof(__be32));
> + if (!p)
> + return nfserr_resource;
> +
> + for (i = 0; i < dev->nr_volumes; i++) {
> + ret = nfsd4_block_encode_volume(xdr, &dev->volumes[i]);
> + if (ret < 0)
> + return nfserrno(ret);
> + len += ret;
> + }
> +
> + /*
> + * Fill in the overall length and number of volumes at the beginning
> + * of the layout.
> + */
> + *p++ = cpu_to_be32(len);
> + *p++ = cpu_to_be32(dev->nr_volumes);
> + return 0;
> +}
> +
> +int
> +nfsd4_block_decode_layoutupdate(__be32 *p, u32 len, struct iomap **iomapp,
> + u32 block_size)
> +{
> + struct iomap *iomaps;
> + u32 nr_iomaps, expected, i;
> +
> + if (len < sizeof(u32)) {
> + dprintk("%s: extent array too small: %u\n", __func__, len);
> + return -EINVAL;
> + }
> +
> + nr_iomaps = be32_to_cpup(p++);
> + expected = sizeof(__be32) + nr_iomaps * NFS4_BLOCK_EXTENT_SIZE;
> + if (len != expected) {
> + dprintk("%s: extent array size mismatch: %u/%u\n",
> + __func__, len, expected);
> + return -EINVAL;
> + }
> +
> + iomaps = kcalloc(nr_iomaps, sizeof(*iomaps), GFP_KERNEL);
> + if (!iomaps) {
> + dprintk("%s: failed to allocate extent array\n", __func__);
> + return -ENOMEM;
> + }
> +
> + for (i = 0; i < nr_iomaps; i++) {
> + struct pnfs_block_extent bex;
> +
> + memcpy(&bex.vol_id, p, sizeof(struct nfsd4_deviceid));
> + p += XDR_QUADLEN(sizeof(struct nfsd4_deviceid));
> +
> + p = xdr_decode_hyper(p, &bex.foff);
> + if (bex.foff & (block_size - 1)) {
> + dprintk("%s: unaligned offset %lld\n",
> + __func__, bex.foff);
> + goto fail;
> + }
> + p = xdr_decode_hyper(p, &bex.len);
> + if (bex.len & (block_size - 1)) {
> + dprintk("%s: unaligned length %lld\n",
> + __func__, bex.foff);
> + goto fail;
> + }
> + p = xdr_decode_hyper(p, &bex.soff);
> + if (bex.soff & (block_size - 1)) {
> + dprintk("%s: unaligned disk offset %lld\n",
> + __func__, bex.soff);
> + goto fail;
> + }
> + bex.es = be32_to_cpup(p++);
> + if (bex.es != PNFS_BLOCK_READWRITE_DATA) {
> + dprintk("%s: incorrect extent state %d\n",
> + __func__, bex.es);
> + goto fail;
> + }
> +
> + iomaps[i].offset = bex.foff;
> + iomaps[i].length = bex.len;
> + }
> +
> + *iomapp = iomaps;
> + return nr_iomaps;
> +fail:
> + kfree(iomaps);
> + return -EINVAL;
> +}
> diff --git a/fs/nfsd/blocklayoutxdr.h b/fs/nfsd/blocklayoutxdr.h
> new file mode 100644
> index 0000000..fdc7903
> --- /dev/null
> +++ b/fs/nfsd/blocklayoutxdr.h
> @@ -0,0 +1,62 @@
> +#ifndef _NFSD_BLOCKLAYOUTXDR_H
> +#define _NFSD_BLOCKLAYOUTXDR_H 1
> +
> +#include <linux/blkdev.h>
> +#include "xdr4.h"
> +
> +struct iomap;
> +struct xdr_stream;
> +
> +enum pnfs_block_extent_state {
> + PNFS_BLOCK_READWRITE_DATA = 0,
> + PNFS_BLOCK_READ_DATA = 1,
> + PNFS_BLOCK_INVALID_DATA = 2,
> + PNFS_BLOCK_NONE_DATA = 3,
> +};
> +
> +struct pnfs_block_extent {
> + struct nfsd4_deviceid vol_id;
> + u64 foff;
> + u64 len;
> + u64 soff;
> + enum pnfs_block_extent_state es;
> +};
> +#define NFS4_BLOCK_EXTENT_SIZE 44
> +
> +enum pnfs_block_volume_type {
> + PNFS_BLOCK_VOLUME_SIMPLE = 0,
> + PNFS_BLOCK_VOLUME_SLICE = 1,
> + PNFS_BLOCK_VOLUME_CONCAT = 2,
> + PNFS_BLOCK_VOLUME_STRIPE = 3,
> +};
> +
> +/*
> + * Random upper cap for the uuid length to avoid unbounded allocation.
> + * Not actually limited by the protocol.
> + */
> +#define PNFS_BLOCK_UUID_LEN 128
> +
> +struct pnfs_block_volume {
> + enum pnfs_block_volume_type type;
> + union {
> + struct {
> + u64 offset;
> + u32 sig_len;
> + u8 sig[PNFS_BLOCK_UUID_LEN];
> + } simple;
> + };
> +};
> +
> +struct pnfs_block_deviceaddr {
> + u32 nr_volumes;
> + struct pnfs_block_volume volumes[];
> +};
> +
> +__be32 nfsd4_block_encode_getdeviceinfo(struct xdr_stream *xdr,
> + struct nfsd4_getdeviceinfo *gdp);
> +__be32 nfsd4_block_encode_layoutget(struct xdr_stream *xdr,
> + struct nfsd4_layoutget *lgp);
> +int nfsd4_block_decode_layoutupdate(__be32 *p, u32 len, struct iomap **iomapp,
> + u32 block_size);
> +
> +#endif /* _NFSD_BLOCKLAYOUTXDR_H */
> diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
> index bb91981..8353b7a 100644
> --- a/fs/nfsd/nfs4layouts.c
> +++ b/fs/nfsd/nfs4layouts.c
> @@ -26,6 +26,7 @@ static struct nfsd4_callback_ops nfsd4_cb_layout_ops;
> static const struct lock_manager_operations nfsd4_layouts_lm_ops;
>
> const struct nfsd4_layout_ops *nfsd4_layout_ops[LAYOUT_TYPE_MAX] = {
> + [LAYOUT_BLOCK_VOLUME] = &bl_layout_ops,
> };
>
> /* pNFS device ID to export fsid mapping */
> @@ -116,6 +117,12 @@ nfsd4_set_deviceid(struct nfsd4_deviceid *id, const struct svc_fh *fhp,
>
> void nfsd4_setup_layout_type(struct svc_export *exp)
> {
> + struct super_block *sb = exp->ex_path.mnt->mnt_sb;
> +
> + if (sb->s_export_op->get_uuid &&
> + sb->s_export_op->map_blocks &&
> + sb->s_export_op->commit_blocks)
> + exp->ex_layout_type = LAYOUT_BLOCK_VOLUME;
> }
>
> static void
> diff --git a/fs/nfsd/pnfs.h b/fs/nfsd/pnfs.h
> index fa37117..d6d94e1 100644
> --- a/fs/nfsd/pnfs.h
> +++ b/fs/nfsd/pnfs.h
> @@ -34,6 +34,7 @@ struct nfsd4_layout_ops {
> };
>
> extern const struct nfsd4_layout_ops *nfsd4_layout_ops[];
> +extern const struct nfsd4_layout_ops bl_layout_ops;
>
> __be32 nfsd4_preprocess_layout_stateid(struct svc_rqst *rqstp,
> struct nfsd4_compound_state *cstate, stateid_t *stateid,
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 12, 2015 at 01:14:19AM -0500, Tom Haynes wrote:
>
> > +file system must sit on shared storage (typically iSCSI) that is accessible
> > +to the clients as well as the server. The file system needs to either sit
> > +directly on the exported volume, or on a RAID 0 using the MD software RAID
>
> a RAID 0 what?
I don't quite understand that comment. But I'll have to revise that
text anyway as the RAID0/1 support isn't part of this submission yet,
as it needs a little more work and involves two more subsystems.
For those who are curious about the md support, it is available at:
git://git.infradead.org/users/hch/pnfs.git pnfsd-block-md-support
> What reclaims bex in both error cases??
>
> The call flow seems to be:
>
> nfsd4_proc_compound -> nfsd4_layoutget -> nfsd4_block_proc_layoutget
>
> lg_content gets freed in nfsd4_encode_layoutget() in all paths.
>
> nfsd4_encode_operation() calls nfsd4_encode_layoutget().
>
> But nfsd4_encode_layoutget() is not called in all paths:
>
> p = xdr_reserve_space(xdr, 8);
> if (!p) {
> WARN_ON_ONCE(1);
> return; // leak
> }
> ...
> if (op->opnum == OP_ILLEGAL)
> goto status; // Not really a leak, if we hit this, bigger issues apply.
>
> So bex is correctly accounted for, but in general
> nfsd4_encode_operation() can leak any operation
> specific memory.
I guess we need to fix properly in the nfsd core eventually. For
example by adding a new method called for successful and error completions
that can free all ressources.
The code is already ready for it, and the pnfs layout commit code expects
to be able to pass a larger than 32-bit argument.
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/xfs/xfs_iomap.c | 2 +-
fs/xfs/xfs_iomap.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index c980e2a..ccb1dd0 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -802,7 +802,7 @@ int
xfs_iomap_write_unwritten(
xfs_inode_t *ip,
xfs_off_t offset,
- size_t count)
+ xfs_off_t count)
{
xfs_mount_t *mp = ip->i_mount;
xfs_fileoff_t offset_fsb;
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 411fbb8..8688e66 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -27,6 +27,6 @@ int xfs_iomap_write_delay(struct xfs_inode *, xfs_off_t, size_t,
struct xfs_bmbt_irec *);
int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t,
struct xfs_bmbt_irec *);
-int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, size_t);
+int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t);
#endif /* __XFS_IOMAP_H__*/
--
1.9.1
Current xfs_bmapi_write always allocates blocks when it encounters a
hole. But for unwritten extent conversions we do not have the proper
transaction reservations to do that, and should error out instead.
Currently this doesn't matter too much because the writeback path
ensures that all blocks are properly allocated, but the pNFS block
server code will accept unwritten extent conversions from clients,
and in case of recovery from a crashed server we might get conversion
requests for blocks whose allocation transaction hasn't made it to
disk before the crash. Also in general it is a good idea to be
defensive here, especially for client initiated requests.
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/xfs/libxfs/xfs_bmap.c | 15 +++++++++++++++
fs/xfs/xfs_iomap.c | 17 ++++++-----------
2 files changed, 21 insertions(+), 11 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index b5eb474..be08671 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4580,6 +4580,20 @@ xfs_bmapi_write(
* that we found, if any.
*/
if (inhole || wasdelay) {
+ if ((flags & (XFS_BMAPI_CONVERT|XFS_BMAPI_PREALLOC)) ==
+ XFS_BMAPI_CONVERT) {
+ xfs_filblks_t count;
+
+ if (eof)
+ bma.got.br_startoff = end;
+
+ count = XFS_FILBLKS_MIN(len,
+ bma.got.br_startoff - bno);
+ bno += count;
+ len -= count;
+ goto next;
+ }
+
bma.eof = eof;
bma.conv = !!(flags & XFS_BMAPI_CONVERT);
bma.wasdel = wasdelay;
@@ -4621,6 +4635,7 @@ xfs_bmapi_write(
/* update the extent map to return */
xfs_bmapi_update_map(&mval, &bno, &len, obno, end, &n, flags);
+next:
/*
* If we're done, stop now. Stop when we've allocated
* XFS_BMAP_MAX_NMAP extents no matter what. Otherwise
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ccb1dd0..4b139f2 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -807,7 +807,6 @@ xfs_iomap_write_unwritten(
xfs_mount_t *mp = ip->i_mount;
xfs_fileoff_t offset_fsb;
xfs_filblks_t count_fsb;
- xfs_filblks_t numblks_fsb;
xfs_fsblock_t firstfsb;
int nimaps;
xfs_trans_t *tp;
@@ -896,19 +895,15 @@ xfs_iomap_write_unwritten(
if (error)
return error;
+ if (!nimaps)
+ break;
+
if (!(imap.br_startblock || XFS_IS_REALTIME_INODE(ip)))
return xfs_alert_fsblock_zero(ip, &imap);
- if ((numblks_fsb = imap.br_blockcount) == 0) {
- /*
- * The numblks_fsb value should always get
- * smaller, otherwise the loop is stuck.
- */
- ASSERT(imap.br_blockcount);
- break;
- }
- offset_fsb += numblks_fsb;
- count_fsb -= numblks_fsb;
+ ASSERT(imap.br_blockcount);
+ offset_fsb += imap.br_blockcount;
+ count_fsb -= imap.br_blockcount;
} while (count_fsb > 0);
return 0;
--
1.9.1
On Tue, Jan 06, 2015 at 05:28:39PM +0100, Christoph Hellwig wrote:
> Current xfs_bmapi_write always allocates blocks when it encounters a
> hole. But for unwritten extent conversions we do not have the proper
> transaction reservations to do that, and should error out instead.
>
> Currently this doesn't matter too much because the writeback path
> ensures that all blocks are properly allocated, but the pNFS block
> server code will accept unwritten extent conversions from clients,
> and in case of recovery from a crashed server we might get conversion
> requests for blocks whose allocation transaction hasn't made it to
> disk before the crash. Also in general it is a good idea to be
> defensive here, especially for client initiated requests.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> fs/xfs/libxfs/xfs_bmap.c | 15 +++++++++++++++
> fs/xfs/xfs_iomap.c | 17 ++++++-----------
> 2 files changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index b5eb474..be08671 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4580,6 +4580,20 @@ xfs_bmapi_write(
> * that we found, if any.
> */
> if (inhole || wasdelay) {
> + if ((flags & (XFS_BMAPI_CONVERT|XFS_BMAPI_PREALLOC)) ==
> + XFS_BMAPI_CONVERT) {
> + xfs_filblks_t count;
> +
> + if (eof)
> + bma.got.br_startoff = end;
> +
> + count = XFS_FILBLKS_MIN(len,
> + bma.got.br_startoff - bno);
> + bno += count;
> + len -= count;
> + goto next;
> + }
Please add a comment to the code explaining why this check is needed.
Cheers,
Dave.
--
Dave Chinner
[email protected]
Add operations to export pNFS block layouts from an XFS filesystem. See
the previous commit adding the operations for an explanation of them.
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/xfs/Makefile | 1 +
fs/xfs/xfs_export.c | 6 ++
fs/xfs/xfs_fsops.c | 2 +
fs/xfs/xfs_iops.c | 2 +-
fs/xfs/xfs_iops.h | 1 +
fs/xfs/xfs_mount.h | 2 +
fs/xfs/xfs_pnfs.c | 270 ++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_pnfs.h | 11 +++
8 files changed, 294 insertions(+), 1 deletion(-)
create mode 100644 fs/xfs/xfs_pnfs.c
create mode 100644 fs/xfs/xfs_pnfs.h
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index d617999..df68285 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -121,3 +121,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL) += xfs_acl.o
xfs-$(CONFIG_PROC_FS) += xfs_stats.o
xfs-$(CONFIG_SYSCTL) += xfs_sysctl.o
xfs-$(CONFIG_COMPAT) += xfs_ioctl32.o
+xfs-$(CONFIG_NFSD_PNFS) += xfs_pnfs.o
diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c
index 5eb4a14..b97359b 100644
--- a/fs/xfs/xfs_export.c
+++ b/fs/xfs/xfs_export.c
@@ -30,6 +30,7 @@
#include "xfs_trace.h"
#include "xfs_icache.h"
#include "xfs_log.h"
+#include "xfs_pnfs.h"
/*
* Note that we only accept fileids which are long enough rather than allow
@@ -245,4 +246,9 @@ const struct export_operations xfs_export_operations = {
.fh_to_parent = xfs_fs_fh_to_parent,
.get_parent = xfs_fs_get_parent,
.commit_metadata = xfs_fs_nfs_commit_metadata,
+#ifdef CONFIG_NFSD_PNFS
+ .get_uuid = xfs_fs_get_uuid,
+ .map_blocks = xfs_fs_map_blocks,
+ .commit_blocks = xfs_fs_commit_blocks,
+#endif
};
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index fdc6422..2b86be8 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -601,6 +601,8 @@ xfs_growfs_data(
if (!mutex_trylock(&mp->m_growlock))
return -EWOULDBLOCK;
error = xfs_growfs_data_private(mp, in);
+ if (!error)
+ mp->m_generation++;
mutex_unlock(&mp->m_growlock);
return error;
}
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index c50311c..6ff84e8 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -496,7 +496,7 @@ xfs_setattr_mode(
inode->i_mode |= mode & ~S_IFMT;
}
-static void
+void
xfs_setattr_time(
struct xfs_inode *ip,
struct iattr *iattr)
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 1c34e43..ea7a98e 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -32,6 +32,7 @@ extern void xfs_setup_inode(struct xfs_inode *);
*/
#define XFS_ATTR_NOACL 0x01 /* Don't call posix_acl_chmod */
+extern void xfs_setattr_time(struct xfs_inode *ip, struct iattr *iattr);
extern int xfs_setattr_nonsize(struct xfs_inode *ip, struct iattr *vap,
int flags);
extern int xfs_setattr_size(struct xfs_inode *ip, struct iattr *vap);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 22ccf69..aba26c8 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -175,6 +175,8 @@ typedef struct xfs_mount {
struct workqueue_struct *m_reclaim_workqueue;
struct workqueue_struct *m_log_workqueue;
struct workqueue_struct *m_eofblocks_workqueue;
+
+ __uint32_t m_generation; /* incremented on each growfs */
} xfs_mount_t;
/*
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
new file mode 100644
index 0000000..d95f596
--- /dev/null
+++ b/fs/xfs/xfs_pnfs.c
@@ -0,0 +1,270 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include "xfs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_log.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_error.h"
+#include "xfs_iomap.h"
+#include "xfs_shared.h"
+#include "xfs_pnfs.h"
+
+int
+xfs_fs_get_uuid(
+ struct super_block *sb,
+ u8 *buf,
+ u32 *len,
+ u64 *offset)
+{
+ struct xfs_mount *mp = XFS_M(sb);
+
+ if (*len < sizeof(uuid_t))
+ return -EINVAL;
+
+ memcpy(buf, &mp->m_sb.sb_uuid, sizeof(uuid_t));
+ *len = sizeof(uuid_t);
+ *offset = offsetof(struct xfs_dsb, sb_uuid);
+ return 0;
+}
+
+static void
+xfs_map_iomap(
+ struct xfs_inode *ip,
+ struct iomap *iomap,
+ struct xfs_bmbt_irec *imap,
+ xfs_off_t offset)
+{
+ struct xfs_mount *mp = ip->i_mount;
+
+ iomap->blkno = -1;
+ if (imap->br_startblock == HOLESTARTBLOCK)
+ iomap->type = IOMAP_HOLE;
+ else if (imap->br_startblock == DELAYSTARTBLOCK)
+ iomap->type = IOMAP_DELALLOC;
+ else {
+ /*
+ * the block number in the iomap must match the start offset we
+ * place in the iomap.
+ */
+ iomap->blkno = xfs_fsb_to_db(ip, imap->br_startblock);
+ ASSERT(iomap->blkno || XFS_IS_REALTIME_INODE(ip));
+ if (imap->br_state == XFS_EXT_UNWRITTEN)
+ iomap->type = IOMAP_UNWRITTEN;
+ else
+ iomap->type = IOMAP_MAPPED;
+ }
+ iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
+ iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
+}
+
+static int
+xfs_fs_update_flags(
+ struct xfs_inode *ip)
+{
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_trans *tp;
+ int error;
+
+ /*
+ * Update the mode, and prealloc flag bits.
+ */
+ tp = xfs_trans_alloc(mp, XFS_TRANS_WRITEID);
+ error = xfs_trans_reserve(tp, &M_RES(mp)->tr_writeid, 0, 0);
+ if (error) {
+ xfs_trans_cancel(tp, 0);
+ return error;
+ }
+
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
+ xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+ ip->i_d.di_mode &= ~S_ISUID;
+ if (ip->i_d.di_mode & S_IXGRP)
+ ip->i_d.di_mode &= ~S_ISGID;
+
+ ip->i_d.di_flags |= XFS_DIFLAG_PREALLOC;
+
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+ return xfs_trans_commit(tp, 0);
+}
+
+/*
+ * Get a layout for the pNFS client.
+ *
+ * Note that in the allocation case we do force out the transaction here.
+ * There is no metadata update that is required to be stable for NFS
+ * semantics, and layouts are not valid over a server crash. Instead
+ * we'll have to be careful in the commit routine as it might pass us
+ * blocks for an allocation that never made it to disk in the recovery
+ * case.
+ */
+int
+xfs_fs_map_blocks(
+ struct inode *inode,
+ loff_t offset,
+ u64 length,
+ struct iomap *iomap,
+ bool write,
+ u32 *device_generation)
+{
+ struct xfs_inode *ip = XFS_I(inode);
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_bmbt_irec imap;
+ xfs_fileoff_t offset_fsb, end_fsb;
+ loff_t limit;
+ int bmapi_flags = XFS_BMAPI_ENTIRE;
+ int nimaps = 1;
+ uint lock_flags;
+ int error = 0;
+
+ if (XFS_FORCED_SHUTDOWN(mp))
+ return -EIO;
+ if (XFS_IS_REALTIME_INODE(ip))
+ return -ENXIO;
+
+ xfs_ilock(ip, XFS_IOLOCK_EXCL);
+ if (!write) {
+ limit = max(round_up(i_size_read(inode),
+ inode->i_sb->s_blocksize),
+ mp->m_super->s_maxbytes);
+ } else {
+ limit = mp->m_super->s_maxbytes;
+ }
+
+ error = -EINVAL;
+ if (offset > limit)
+ goto out_unlock;
+ if (offset + length > mp->m_super->s_maxbytes)
+ length = limit - offset;
+
+ /*
+ * Flush data and truncate the pagecache. pNFS block clients just
+ * like direct I/O access the disk directly.
+ */
+ error = filemap_write_and_wait(inode->i_mapping);
+ if (error)
+ goto out_unlock;
+ invalidate_inode_pages2(inode->i_mapping);
+
+ end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + length);
+ offset_fsb = XFS_B_TO_FSBT(mp, offset);
+
+ lock_flags = xfs_ilock_data_map_shared(ip);
+ error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
+ &imap, &nimaps, bmapi_flags);
+ xfs_iunlock(ip, lock_flags);
+
+ if (error)
+ goto out_unlock;
+
+ if (write) {
+ if (!nimaps || imap.br_startblock == HOLESTARTBLOCK) {
+ error = xfs_iomap_write_direct(ip, offset, length,
+ &imap, nimaps);
+ if (error)
+ goto out_unlock;
+ }
+
+ error = xfs_fs_update_flags(ip);
+ if (error)
+ goto out_unlock;
+ }
+ xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+
+ xfs_map_iomap(ip, iomap, &imap, offset);
+ *device_generation = mp->m_generation;
+ return error;
+out_unlock:
+ xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+ return error;
+}
+
+/*
+ * Make sure the blocks described by maps are stable on disk. This includes
+ * converting any unwritten extents, flushing the disk cache and updating the
+ * time stamps.
+ *
+ * Note that we rely on the caller to always send us a timestamp update so that
+ * we always commit a transaction here. If that stops being true we will have
+ * to manually flush the cache here similar to what the fsync code path does
+ * for datasyncs on files that have no dirty metadata.
+ *
+ * In the reclaim case we might get called for blocks that were only allocated
+ * in memory and not on disk. We rely on the fact that unwritten extent
+ * conversions handle this properly.
+ */
+int
+xfs_fs_commit_blocks(
+ struct inode *inode,
+ struct iomap *maps,
+ int nr_maps,
+ struct iattr *iattr)
+{
+ struct xfs_inode *ip = XFS_I(inode);
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_trans *tp;
+ int error, i;
+ loff_t size;
+
+ xfs_ilock(ip, XFS_IOLOCK_EXCL);
+
+ size = i_size_read(inode);
+ if ((iattr->ia_valid & ATTR_SIZE) && iattr->ia_size > size)
+ size = iattr->ia_size;
+
+ for (i = 0; i < nr_maps; i++) {
+ u64 start, length, end;
+
+ start = maps[i].offset;
+ if (start > size)
+ continue;
+
+ end = start + maps[i].length;
+ if (end > size)
+ end = size;
+
+ length = end - start;
+ if (!length)
+ continue;
+
+ error = xfs_iomap_write_unwritten(ip, start, length);
+ if (error)
+ goto out_drop_iolock;
+ }
+
+ /*
+ * Make sure reads through the pagecache see the new data.
+ */
+ invalidate_inode_pages2(inode->i_mapping);
+
+ tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_NOT_SIZE);
+ error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ichange, 0, 0);
+ if (error)
+ goto out_drop_iolock;
+
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
+ xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+
+ xfs_setattr_time(ip, iattr);
+ if (iattr->ia_valid & ATTR_SIZE) {
+ if (iattr->ia_size > i_size_read(inode)) {
+ i_size_write(inode, iattr->ia_size);
+ ip->i_d.di_size = iattr->ia_size;
+ }
+ }
+
+ xfs_trans_set_sync(tp);
+ error = xfs_trans_commit(tp, 0);
+
+out_drop_iolock:
+ xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+ return error;
+}
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
new file mode 100644
index 0000000..0d91255
--- /dev/null
+++ b/fs/xfs/xfs_pnfs.h
@@ -0,0 +1,11 @@
+#ifndef _XFS_PNFS_H
+#define _XFS_PNFS_H 1
+
+#ifdef CONFIG_NFSD_PNFS
+int xfs_fs_get_uuid(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
+int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
+ struct iomap *iomap, bool write, u32 *device_generation);
+int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
+ struct iattr *iattr);
+#endif /* CONFIG_NFSD_PNFS */
+#endif /* _XFS_PNFS_H */
--
1.9.1
On Tue, Jan 06, 2015 at 05:28:40PM +0100, Christoph Hellwig wrote:
> Add operations to export pNFS block layouts from an XFS filesystem. See
> the previous commit adding the operations for an explanation of them.
.....
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index fdc6422..2b86be8 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -601,6 +601,8 @@ xfs_growfs_data(
> if (!mutex_trylock(&mp->m_growlock))
> return -EWOULDBLOCK;
> error = xfs_growfs_data_private(mp, in);
> + if (!error)
> + mp->m_generation++;
> mutex_unlock(&mp->m_growlock);
> return error;
> }
I couldn't find an explanation of what this generation number is
for. What are it's semantics w.r.t. server crashes?
> +xfs_fs_get_uuid(
> + struct super_block *sb,
> + u8 *buf,
> + u32 *len,
> + u64 *offset)
> +{
> + struct xfs_mount *mp = XFS_M(sb);
> +
> + if (*len < sizeof(uuid_t))
> + return -EINVAL;
> +
> + memcpy(buf, &mp->m_sb.sb_uuid, sizeof(uuid_t));
uuid_copy()?
> + *len = sizeof(uuid_t);
> + *offset = offsetof(struct xfs_dsb, sb_uuid);
> + return 0;
> +}
> +
> +static void
> +xfs_map_iomap(
> + struct xfs_inode *ip,
> + struct iomap *iomap,
> + struct xfs_bmbt_irec *imap,
> + xfs_off_t offset)
xfs_bmbt_to_iomap()?
> +{
> + struct xfs_mount *mp = ip->i_mount;
> +
> + iomap->blkno = -1;
> + if (imap->br_startblock == HOLESTARTBLOCK)
> + iomap->type = IOMAP_HOLE;
> + else if (imap->br_startblock == DELAYSTARTBLOCK)
> + iomap->type = IOMAP_DELALLOC;
> + else {
> + /*
> + * the block number in the iomap must match the start offset we
> + * place in the iomap.
> + */
> + iomap->blkno = xfs_fsb_to_db(ip, imap->br_startblock);
> + ASSERT(iomap->blkno || XFS_IS_REALTIME_INODE(ip));
> + if (imap->br_state == XFS_EXT_UNWRITTEN)
> + iomap->type = IOMAP_UNWRITTEN;
> + else
> + iomap->type = IOMAP_MAPPED;
> + }
> + iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
> + iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
> +}
Why does this function get passed an offset it is not actually used?
> +static int
> +xfs_fs_update_flags(
> + struct xfs_inode *ip)
> +{
> + struct xfs_mount *mp = ip->i_mount;
> + struct xfs_trans *tp;
> + int error;
> +
> + /*
> + * Update the mode, and prealloc flag bits.
> + */
> + tp = xfs_trans_alloc(mp, XFS_TRANS_WRITEID);
> + error = xfs_trans_reserve(tp, &M_RES(mp)->tr_writeid, 0, 0);
> + if (error) {
> + xfs_trans_cancel(tp, 0);
> + return error;
> + }
> +
> + xfs_ilock(ip, XFS_ILOCK_EXCL);
> + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> + ip->i_d.di_mode &= ~S_ISUID;
> + if (ip->i_d.di_mode & S_IXGRP)
> + ip->i_d.di_mode &= ~S_ISGID;
> +
> + ip->i_d.di_flags |= XFS_DIFLAG_PREALLOC;
> +
> + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> + return xfs_trans_commit(tp, 0);
> +}
That needs timestamp changes as well. i.e.:
xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
and at that point, it's basically the same code as in
xfs_file_fallocate() and xfs_ioc_space(), so should probably be
factored into a common operation.
> +
> +/*
> + * Get a layout for the pNFS client.
> + *
> + * Note that in the allocation case we do force out the transaction here.
> + * There is no metadata update that is required to be stable for NFS
> + * semantics, and layouts are not valid over a server crash. Instead
> + * we'll have to be careful in the commit routine as it might pass us
> + * blocks for an allocation that never made it to disk in the recovery
> + * case.
I think you are saying that because block allocation is an async
transaction, then we have to deal with the possibility that we crash
before the transaction hits the disk.
How often do we have to allocate
new blocks like this? Do we need to use async transactions for this
case, or should we simply do the brute force thing (by making the
allocation transaction synchronous) initially and then, if
performance problems arise, optimise from there?
> + */
> +int
> +xfs_fs_map_blocks(
> + struct inode *inode,
> + loff_t offset,
> + u64 length,
> + struct iomap *iomap,
> + bool write,
> + u32 *device_generation)
> +{
> + struct xfs_inode *ip = XFS_I(inode);
> + struct xfs_mount *mp = ip->i_mount;
> + struct xfs_bmbt_irec imap;
> + xfs_fileoff_t offset_fsb, end_fsb;
> + loff_t limit;
> + int bmapi_flags = XFS_BMAPI_ENTIRE;
> + int nimaps = 1;
> + uint lock_flags;
> + int error = 0;
> +
> + if (XFS_FORCED_SHUTDOWN(mp))
> + return -EIO;
> + if (XFS_IS_REALTIME_INODE(ip))
> + return -ENXIO;
> +
> + xfs_ilock(ip, XFS_IOLOCK_EXCL);
Why are we locking out IO just to read the block map (needs a
comment)?
> + if (!write) {
> + limit = max(round_up(i_size_read(inode),
> + inode->i_sb->s_blocksize),
> + mp->m_super->s_maxbytes);
> + } else {
> + limit = mp->m_super->s_maxbytes;
> + }
limit = mp->m_super->s_maxbytes;
if (!write)
limit = max(limit, round_up(i_size_read(inode),
inode->i_sb->s_blocksize));
> +
> + error = -EINVAL;
> + if (offset > limit)
> + goto out_unlock;
> + if (offset + length > mp->m_super->s_maxbytes)
> + length = limit - offset;
Need to catch a wrap through zero...
> + /*
> + * Flush data and truncate the pagecache. pNFS block clients just
> + * like direct I/O access the disk directly.
> + */
> + error = filemap_write_and_wait(inode->i_mapping);
> + if (error)
> + goto out_unlock;
> + invalidate_inode_pages2(inode->i_mapping);
invalidate_inode_pages2() can fail with EBUSY....
> +
> + end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + length);
> + offset_fsb = XFS_B_TO_FSBT(mp, offset);
> +
> + lock_flags = xfs_ilock_data_map_shared(ip);
> + error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
> + &imap, &nimaps, bmapi_flags);
> + xfs_iunlock(ip, lock_flags);
> +
> + if (error)
> + goto out_unlock;
> +
> + if (write) {
ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> + if (!nimaps || imap.br_startblock == HOLESTARTBLOCK) {
> + error = xfs_iomap_write_direct(ip, offset, length,
> + &imap, nimaps);
> + if (error)
> + goto out_unlock;
> + }
> +
> + error = xfs_fs_update_flags(ip);
> + if (error)
> + goto out_unlock;
> + }
> + xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> +
> + xfs_map_iomap(ip, iomap, &imap, offset);
> + *device_generation = mp->m_generation;
So whenever the server first starts up the generation number in a
map is going to be zero - what purpose does this actually serve?
> + return error;
> +out_unlock:
> + xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> + return error;
> +}
> +
> +/*
> + * Make sure the blocks described by maps are stable on disk. This includes
> + * converting any unwritten extents, flushing the disk cache and updating the
> + * time stamps.
> + *
> + * Note that we rely on the caller to always send us a timestamp update so that
> + * we always commit a transaction here. If that stops being true we will have
> + * to manually flush the cache here similar to what the fsync code path does
> + * for datasyncs on files that have no dirty metadata.
Needs an assert.
> + *
> + * In the reclaim case we might get called for blocks that were only allocated
> + * in memory and not on disk. We rely on the fact that unwritten extent
> + * conversions handle this properly.
> + */
Making allocation transactions synchronous as well will make this
wart go away.
> +int
> +xfs_fs_commit_blocks(
> + struct inode *inode,
> + struct iomap *maps,
> + int nr_maps,
> + struct iattr *iattr)
> +{
> + struct xfs_inode *ip = XFS_I(inode);
> + struct xfs_mount *mp = ip->i_mount;
> + struct xfs_trans *tp;
> + int error, i;
> + loff_t size;
> +
> + xfs_ilock(ip, XFS_IOLOCK_EXCL);
> +
> + size = i_size_read(inode);
> + if ((iattr->ia_valid & ATTR_SIZE) && iattr->ia_size > size)
> + size = iattr->ia_size;
> +
> + for (i = 0; i < nr_maps; i++) {
> + u64 start, length, end;
> +
> + start = maps[i].offset;
> + if (start > size)
> + continue;
> +
> + end = start + maps[i].length;
> + if (end > size)
> + end = size;
> +
> + length = end - start;
> + if (!length)
> + continue;
> +
> + error = xfs_iomap_write_unwritten(ip, start, length);
> + if (error)
> + goto out_drop_iolock;
> + }
> +
> + /*
> + * Make sure reads through the pagecache see the new data.
> + */
> + invalidate_inode_pages2(inode->i_mapping);
Probably should do that first. Also, what happens if there is local
dirty data on the file at this point? Doesn't this just toss them
away?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Jan 07, 2015 at 11:24:34AM +1100, Dave Chinner wrote:
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index fdc6422..2b86be8 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -601,6 +601,8 @@ xfs_growfs_data(
> > if (!mutex_trylock(&mp->m_growlock))
> > return -EWOULDBLOCK;
> > error = xfs_growfs_data_private(mp, in);
> > + if (!error)
> > + mp->m_generation++;
> > mutex_unlock(&mp->m_growlock);
> > return error;
> > }
>
> I couldn't find an explanation of what this generation number is
> for. What are it's semantics w.r.t. server crashes?
The generation is incremented when we grow the filesystem, so that
a new layout (block mapping) returned to the clіent referers to the
new NFS device ID, which will make the client aware of the new size.
The device IDs aren't persistent, so after a server crash / reboot
we'll start at zero again.
I'll add comments explaining this to the code.
> Why does this function get passed an offset it is not actually used?
Historic reasons..
> > +static int
> > +xfs_fs_update_flags(
> > + struct xfs_inode *ip)
> > +{
> > + struct xfs_mount *mp = ip->i_mount;
> > + struct xfs_trans *tp;
> > + int error;
> > +
> > + /*
> > + * Update the mode, and prealloc flag bits.
> > + */
> > + tp = xfs_trans_alloc(mp, XFS_TRANS_WRITEID);
> > + error = xfs_trans_reserve(tp, &M_RES(mp)->tr_writeid, 0, 0);
> > + if (error) {
> > + xfs_trans_cancel(tp, 0);
> > + return error;
> > + }
> > +
> > + xfs_ilock(ip, XFS_ILOCK_EXCL);
> > + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > + ip->i_d.di_mode &= ~S_ISUID;
> > + if (ip->i_d.di_mode & S_IXGRP)
> > + ip->i_d.di_mode &= ~S_ISGID;
> > +
> > + ip->i_d.di_flags |= XFS_DIFLAG_PREALLOC;
> > +
> > + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > + return xfs_trans_commit(tp, 0);
> > +}
>
> That needs timestamp changes as well. i.e.:
>
> xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
The time stamps are only updated when we actually commit the data.
Updating them here might be harmless, but I'll have to dig into the
protocol specification and tests a bit more to check if doing the
additional timestamp update would be harmless.
> > +
> > +/*
> > + * Get a layout for the pNFS client.
> > + *
> > + * Note that in the allocation case we do force out the transaction here.
> > + * There is no metadata update that is required to be stable for NFS
> > + * semantics, and layouts are not valid over a server crash. Instead
> > + * we'll have to be careful in the commit routine as it might pass us
> > + * blocks for an allocation that never made it to disk in the recovery
> > + * case.
>
> I think you are saying that because block allocation is an async
> transaction, then we have to deal with the possibility that we crash
> before the transaction hits the disk.
>
> How often do we have to allocate
> new blocks like this? Do we need to use async transactions for this
> case, or should we simply do the brute force thing (by making the
> allocation transaction synchronous) initially and then, if
> performance problems arise, optimise from there?
Every block allocation from a pNFS client goes through this path, so
yes it is performance critical.
> > + xfs_map_iomap(ip, iomap, &imap, offset);
> > + *device_generation = mp->m_generation;
>
> So whenever the server first starts up the generation number in a
> map is going to be zero - what purpose does this actually serve?
So that we can communicate if a device was grown to the client, which
in this case needs to re-read the device information.
> > + if (!length)
> > + continue;
> > +
> > + error = xfs_iomap_write_unwritten(ip, start, length);
> > + if (error)
> > + goto out_drop_iolock;
> > + }
> > +
> > + /*
> > + * Make sure reads through the pagecache see the new data.
> > + */
> > + invalidate_inode_pages2(inode->i_mapping);
>
> Probably should do that first. Also, what happens if there is local
> dirty data on the file at this point? Doesn't this just toss them
> away?
If there was local data it will be tossed. For regular writes this can't
happen because we really outstanding layouts in the write path. For
mmap we for now ignore this problem, as a pNFS server should generally
not be used locally.
On Wed, Jan 07, 2015 at 11:40:10AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 07, 2015 at 11:24:34AM +1100, Dave Chinner wrote:
> > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > index fdc6422..2b86be8 100644
> > > --- a/fs/xfs/xfs_fsops.c
> > > +++ b/fs/xfs/xfs_fsops.c
> > > @@ -601,6 +601,8 @@ xfs_growfs_data(
> > > if (!mutex_trylock(&mp->m_growlock))
> > > return -EWOULDBLOCK;
> > > error = xfs_growfs_data_private(mp, in);
> > > + if (!error)
> > > + mp->m_generation++;
> > > mutex_unlock(&mp->m_growlock);
> > > return error;
> > > }
> >
> > I couldn't find an explanation of what this generation number is
> > for. What are it's semantics w.r.t. server crashes?
>
> The generation is incremented when we grow the filesystem, so that
> a new layout (block mapping) returned to the clіent referers to the
> new NFS device ID, which will make the client aware of the new size.
>
> The device IDs aren't persistent, so after a server crash / reboot
> we'll start at zero again.
So what happens if a grow occurs, then the server crashes, and the
client on reboot sees the same generation as before the grow
occured?
Perhaps it would be better to just initialise the generation with a
random number?
> I'll add comments explaining this to the code.
>
> > Why does this function get passed an offset it is not actually used?
>
> Historic reasons..
>
> > > +static int
> > > +xfs_fs_update_flags(
> > > + struct xfs_inode *ip)
> > > +{
> > > + struct xfs_mount *mp = ip->i_mount;
> > > + struct xfs_trans *tp;
> > > + int error;
> > > +
> > > + /*
> > > + * Update the mode, and prealloc flag bits.
> > > + */
> > > + tp = xfs_trans_alloc(mp, XFS_TRANS_WRITEID);
> > > + error = xfs_trans_reserve(tp, &M_RES(mp)->tr_writeid, 0, 0);
> > > + if (error) {
> > > + xfs_trans_cancel(tp, 0);
> > > + return error;
> > > + }
> > > +
> > > + xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > + ip->i_d.di_mode &= ~S_ISUID;
> > > + if (ip->i_d.di_mode & S_IXGRP)
> > > + ip->i_d.di_mode &= ~S_ISGID;
> > > +
> > > + ip->i_d.di_flags |= XFS_DIFLAG_PREALLOC;
> > > +
> > > + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > > + return xfs_trans_commit(tp, 0);
> > > +}
> >
> > That needs timestamp changes as well. i.e.:
> >
> > xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
>
> The time stamps are only updated when we actually commit the data.
> Updating them here might be harmless, but I'll have to dig into the
> protocol specification and tests a bit more to check if doing the
> additional timestamp update would be harmless.
>
> > > +
> > > +/*
> > > + * Get a layout for the pNFS client.
> > > + *
> > > + * Note that in the allocation case we do force out the transaction here.
> > > + * There is no metadata update that is required to be stable for NFS
> > > + * semantics, and layouts are not valid over a server crash. Instead
> > > + * we'll have to be careful in the commit routine as it might pass us
> > > + * blocks for an allocation that never made it to disk in the recovery
> > > + * case.
> >
> > I think you are saying that because block allocation is an async
> > transaction, then we have to deal with the possibility that we crash
> > before the transaction hits the disk.
> >
> > How often do we have to allocate
> > new blocks like this? Do we need to use async transactions for this
> > case, or should we simply do the brute force thing (by making the
> > allocation transaction synchronous) initially and then, if
> > performance problems arise, optimise from there?
>
> Every block allocation from a pNFS client goes through this path, so
> yes it is performance critical.
Sure, but how many allocations per second are we expecting to have
to support? We can do tens of thousands of synchronous transactions
per second on luns with non-volatile write caches, so I'm really
wondering how much of a limitation this is going to be in the real
world. Do you have any numbers?
> > So whenever the server first starts up the generation number in a
> > map is going to be zero - what purpose does this actually serve?
>
> So that we can communicate if a device was grown to the client, which
> in this case needs to re-read the device information.
Why does it need to reread the device information? the layouts that
are handled to it are still going to be valid from the server POV...
> > > + /*
> > > + * Make sure reads through the pagecache see the new data.
> > > + */
> > > + invalidate_inode_pages2(inode->i_mapping);
> >
> > Probably should do that first. Also, what happens if there is local
> > dirty data on the file at this point? Doesn't this just toss them
> > away?
>
> If there was local data it will be tossed. For regular writes this can't
> happen because we really outstanding layouts in the write path. For
> mmap we for now ignore this problem, as a pNFS server should generally
> not be used locally.
Comments, please. ;)
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Jan 08, 2015 at 08:11:40AM +1100, Dave Chinner wrote:
> So what happens if a grow occurs, then the server crashes, and the
> client on reboot sees the same generation as before the grow
> occured?
The client doesn't really see the generation. It's party of the deviceid,
which is opaqueue to the client.
If the client sends the opaqueue device ID that contains the generation
after the grow to a server that had crashed / restarted the server
will reject it as the server starts at zero. The causes the client
to get a new, valid device ID from the server.
Unlike the NFS file hadles which are persistent the device IDs are volatile
handles that can go away (and have really horrible life time rules..).
> > Every block allocation from a pNFS client goes through this path, so
> > yes it is performance critical.
>
> Sure, but how many allocations per second are we expecting to have
> to support? We can do tens of thousands of synchronous transactions
> per second on luns with non-volatile write caches, so I'm really
> wondering how much of a limitation this is going to be in the real
> world. Do you have any numbers?
I don't have numbers right now without running specific benchmarks,
but the rate will be about the same as for local XFS use on the same
workload.
>
> > > So whenever the server first starts up the generation number in a
> > > map is going to be zero - what purpose does this actually serve?
> >
> > So that we can communicate if a device was grown to the client, which
> > in this case needs to re-read the device information.
>
> Why does it need to reread the device information? the layouts that
> are handled to it are still going to be valid from the server POV...
The existing layouts are still valid. But any new layout can reference the
added size, so any new layout needs to point to the new device ID.
Once the client sees the new device ID it needs to get the information for
it, which causes it to re-read the device information.
On Thu, Jan 08, 2015 at 01:43:27PM +0100, Christoph Hellwig wrote:
> On Thu, Jan 08, 2015 at 08:11:40AM +1100, Dave Chinner wrote:
> > So what happens if a grow occurs, then the server crashes, and the
> > client on reboot sees the same generation as before the grow
> > occured?
>
> The client doesn't really see the generation. It's party of the deviceid,
> which is opaqueue to the client.
>
> If the client sends the opaqueue device ID that contains the generation
> after the grow to a server that had crashed / restarted the server
> will reject it as the server starts at zero. The causes the client
> to get a new, valid device ID from the server.
But if the server fs has a generation number of zero when it
crashes, how does the client tell that it needs a new device ID from
the server?
> Unlike the NFS file hadles which are persistent the device IDs are volatile
> handles that can go away (and have really horrible life time rules..).
Right. How the clients detect that "going away" when the device
generation is zero both before and after a server crash is the
question I'm asking....
> > > So that we can communicate if a device was grown to the client, which
> > > in this case needs to re-read the device information.
> >
> > Why does it need to reread the device information? the layouts that
> > are handled to it are still going to be valid from the server POV...
>
> The existing layouts are still valid. But any new layout can reference the
> added size, so any new layout needs to point to the new device ID.
>
> Once the client sees the new device ID it needs to get the information for
> it, which causes it to re-read the device information.
OK.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Jan 09, 2015 at 08:04:05AM +1100, Dave Chinner wrote:
> > If the client sends the opaqueue device ID that contains the generation
> > after the grow to a server that had crashed / restarted the server
> > will reject it as the server starts at zero. The causes the client
> > to get a new, valid device ID from the server.
>
> But if the server fs has a generation number of zero when it
> crashes, how does the client tell that it needs a new device ID from
> the server?
>
> > Unlike the NFS file hadles which are persistent the device IDs are volatile
> > handles that can go away (and have really horrible life time rules..).
>
> Right. How the clients detect that "going away" when the device
> generation is zero both before and after a server crash is the
> question I'm asking....
The server tells the client by rejecting the operation using the
device ID.
On Fri, Jan 09, 2015 at 12:41:59PM +0100, Christoph Hellwig wrote:
> On Fri, Jan 09, 2015 at 08:04:05AM +1100, Dave Chinner wrote:
> > > If the client sends the opaqueue device ID that contains the generation
> > > after the grow to a server that had crashed / restarted the server
> > > will reject it as the server starts at zero. The causes the client
> > > to get a new, valid device ID from the server.
> >
> > But if the server fs has a generation number of zero when it
> > crashes, how does the client tell that it needs a new device ID from
> > the server?
> >
> > > Unlike the NFS file hadles which are persistent the device IDs are volatile
> > > handles that can go away (and have really horrible life time rules..).
> >
> > Right. How the clients detect that "going away" when the device
> > generation is zero both before and after a server crash is the
> > question I'm asking....
>
> The server tells the client by rejecting the operation using the
> device ID.
Ok, so:
client server
get layout
dev id == 0
grow
gen++ (=1)
crash
....
gen = 0 (initialised after boot)
commit layout
dev id == 0
server executes op, even though
device has changed....
What prevents this? Shouldn't the server be rejecting the commit
layout operation as there was a grow operation between the client
operations?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Jan 12, 2015 at 02:04:01PM +1100, Dave Chinner wrote:
> Ok, so:
>
> client server
>
> get layout
> dev id == 0
> grow
> gen++ (=1)
> crash
> ....
> gen = 0 (initialised after boot)
>
> commit layout
> dev id == 0
> server executes op, even though
> device has changed....
>
> What prevents this? Shouldn't the server be rejecting the commit
> layout operation as there was a grow operation between the client
> operations?
There is no need to reject the commit. Grows for the block layout
driver never invalidate existing layouts, as they are purely grow
operation. The only reason to bother with the generation is
to ensure that new layouts might point into areas the client
didn't previously known about. So the interesting variation of your
scenario above is:
client server
grow
gen++ (=1)
get layout
dev id == (x, 1)
crash
....
gen = 0 (initialised after boot)
commit layout
id == 1
Which will be rejected, and the client either choses to get a
new layout / deviceID, or just writes the data back through normal
I/O.
Now one interesting case would be a resize that completed in
memory, gets a layout refering to it send out, but not commited to disk,
and then anothe resize to a smaller size before the commit. Not
really practical, but if it happend we could get writes beyond the
end of the filesystem.
I didn't assume this was possible as I assumed growfs to be synchronous,
but it turns out while we do various synchronous buffer writes the
transaction isn't actually commited synchronously.
I think we should just make growfs commit the transaction
synchronously to avoid both the pnfs problem, as well as the
problem of growfs potentially updating the secondary superblocks
before the transaction hit the disk.
Recall all outstanding pNFS layouts and truncates, writes and similar extent
list modifying operations.
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/xfs/xfs_file.c | 14 ++++++++++++--
fs/xfs/xfs_ioctl.c | 9 +++++++--
fs/xfs/xfs_iops.c | 11 ++++++++---
fs/xfs/xfs_pnfs.c | 17 +++++++++++++++++
fs/xfs/xfs_pnfs.h | 7 +++++++
5 files changed, 51 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 13e974e..cb7464c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -36,6 +36,7 @@
#include "xfs_trace.h"
#include "xfs_log.h"
#include "xfs_icache.h"
+#include "xfs_pnfs.h"
#include <linux/aio.h>
#include <linux/dcache.h>
@@ -518,6 +519,10 @@ restart:
if (error)
return error;
+ error = xfs_break_layouts(inode, iolock);
+ if (error)
+ return error;
+
/*
* If the offset is beyond the size of the file, we need to zero any
* blocks that fall between the existing EOF and the start of this
@@ -786,6 +791,7 @@ xfs_file_fallocate(
struct xfs_inode *ip = XFS_I(inode);
struct xfs_trans *tp;
long error;
+ uint iolock = XFS_IOLOCK_EXCL;
loff_t new_size = 0;
if (!S_ISREG(inode->i_mode))
@@ -794,7 +800,11 @@ xfs_file_fallocate(
FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE))
return -EOPNOTSUPP;
- xfs_ilock(ip, XFS_IOLOCK_EXCL);
+ xfs_ilock(ip, iolock);
+ error = xfs_break_layouts(inode, &iolock);
+ if (error)
+ goto out_unlock;
+
if (mode & FALLOC_FL_PUNCH_HOLE) {
error = xfs_free_file_space(ip, offset, len);
if (error)
@@ -874,7 +884,7 @@ xfs_file_fallocate(
}
out_unlock:
- xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+ xfs_iunlock(ip, iolock);
return error;
}
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index a183198..d9f3937 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -39,6 +39,7 @@
#include "xfs_icache.h"
#include "xfs_symlink.h"
#include "xfs_trans.h"
+#include "xfs_pnfs.h"
#include <linux/capability.h>
#include <linux/dcache.h>
@@ -611,6 +612,7 @@ xfs_ioc_space(
struct iattr iattr;
bool setprealloc = false;
bool clrprealloc = false;
+ uint iolock = XFS_IOLOCK_EXCL;
int error;
/*
@@ -634,7 +636,10 @@ xfs_ioc_space(
if (error)
return error;
- xfs_ilock(ip, XFS_IOLOCK_EXCL);
+ xfs_ilock(ip, iolock);
+ error = xfs_break_layouts(inode, &iolock);
+ if (error)
+ goto out_unlock;
switch (bf->l_whence) {
case 0: /*SEEK_SET*/
@@ -751,7 +756,7 @@ xfs_ioc_space(
error = xfs_trans_commit(tp, 0);
out_unlock:
- xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+ xfs_iunlock(ip, iolock);
mnt_drop_write_file(filp);
return error;
}
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 6ff84e8..b1e849a 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -37,6 +37,7 @@
#include "xfs_da_btree.h"
#include "xfs_dir2.h"
#include "xfs_trans_space.h"
+#include "xfs_pnfs.h"
#include <linux/capability.h>
#include <linux/xattr.h>
@@ -970,9 +971,13 @@ xfs_vn_setattr(
int error;
if (iattr->ia_valid & ATTR_SIZE) {
- xfs_ilock(ip, XFS_IOLOCK_EXCL);
- error = xfs_setattr_size(ip, iattr);
- xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+ uint iolock = XFS_IOLOCK_EXCL;
+
+ xfs_ilock(ip, iolock);
+ error = xfs_break_layouts(dentry->d_inode, &iolock);
+ if (!error)
+ error = xfs_setattr_size(ip, iattr);
+ xfs_iunlock(ip, iolock);
} else {
error = xfs_setattr_nonsize(ip, iattr, 0);
}
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index d95f596..130516a 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -18,6 +18,23 @@
#include "xfs_pnfs.h"
int
+xfs_break_layouts(
+ struct inode *inode,
+ uint *iolock)
+{
+ int error;
+
+ while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
+ xfs_iunlock(XFS_I(inode), *iolock);
+ error = break_layout(inode, true);
+ *iolock = XFS_IOLOCK_EXCL;
+ xfs_ilock(XFS_I(inode), *iolock);
+ }
+
+ return error;
+}
+
+int
xfs_fs_get_uuid(
struct super_block *sb,
u8 *buf,
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index 0d91255..b7fbfce 100644
--- a/fs/xfs/xfs_pnfs.h
+++ b/fs/xfs/xfs_pnfs.h
@@ -7,5 +7,12 @@ int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
struct iomap *iomap, bool write, u32 *device_generation);
int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
struct iattr *iattr);
+
+int xfs_break_layouts(struct inode *inode, uint *iolock);
+#else
+static inline int xfs_break_layouts(struct inode *inode, uint *iolock)
+{
+ return 0;
+}
#endif /* CONFIG_NFSD_PNFS */
#endif /* _XFS_PNFS_H */
--
1.9.1
On Tue, Jan 06, 2015 at 05:28:41PM +0100, Christoph Hellwig wrote:
> Recall all outstanding pNFS layouts and truncates, writes and similar extent
> list modifying operations.
This is not sufficient to isolate extent manipulations. mmap writes
can trigger allocation through ->page_mkwrite, and can also trigger
extent conversion at IO completion without first needing allocation.
Maybe I'm missing something - this patchset needs some comments
documenting the locking used in XFS to co-ordinate layout coherency
at the client side with IO that is in progress for clients with
overlapping block maps, as well as against server side application
IO.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Jan 07, 2015 at 10:18:46AM +1100, Dave Chinner wrote:
> On Tue, Jan 06, 2015 at 05:28:41PM +0100, Christoph Hellwig wrote:
> > Recall all outstanding pNFS layouts and truncates, writes and similar extent
> > list modifying operations.
>
> This is not sufficient to isolate extent manipulations. mmap writes
> can trigger allocation through ->page_mkwrite, and can also trigger
> extent conversion at IO completion without first needing allocation.
>
> Maybe I'm missing something - this patchset needs some comments
> documenting the locking used in XFS to co-ordinate layout coherency
> at the client side with IO that is in progress for clients with
> overlapping block maps, as well as against server side application
> IO.
Ys, the description was a little to dense. We only care about extent list
manipulations that remove or change existing block mappings. Newly
allocated blocks don't concern the pNFS operation.
I'll take care of better documentation.
On Tue, Jan 06, 2015 at 05:28:23PM +0100, Christoph Hellwig wrote:
> This series adds support for the pNFS operations in NFS v4.1, as well
> as a block layout driver that can export block based filesystems that
> implement a few additional export operations. Support for XFS is
> provided in this series, but other filesystems could be added easily.
>
> The core pNFS code of course owns its heritage to the existing Linux
> pNFS server prototype, but except for a few bits and pieces in the
> XDR path nothing is left from it.
>
> The design of this new pNFS server is fairly different from the old
> one - while the old one implemented very little semantics in nfsd
> and left almost everything to filesystems my implementation implements
> as much as possible in common nfsd code, then dispatches to a layout
> driver that still is part of nfsd and only then calls into the
> filesystem, thus keeping it free from intimate pNFS knowledge.
>
> More details are document in the individual patch descriptions and
> code comments.
>
> This code is also available from:
>
> git://git.infradead.org/users/hch/pnfs.git pnfsd-for-3.20
Neat, thanks! I'll look at it.
Some naive questions:
- do we have evidence that this is useful in its current form?
- any advice on testing? Is there was some simple virtual setup
that would allow any loser with no special hardware (e.g., me)
to check whether they've broken the block server?
- any debugging advice? E.g., have you checked if current
wireshark can handle the MDS traffic?
--b.
On Tue, Jan 06, 2015 at 12:32:22PM -0500, J. Bruce Fields wrote:
> - do we have evidence that this is useful in its current form?
What is your threshold for usefulness? It passes xfstests fine, and
shows linear scalability with multiple clients that each have 10GB
links.
> - any advice on testing? Is there was some simple virtual setup
> that would allow any loser with no special hardware (e.g., me)
> to check whether they've broken the block server?
Run two kvm VMs that share the same disk. Create an XFS filesystem
on the MDS, and export it. If the client has blkmapd running (on Debian
it needs to be started manually) it will use pNFS for accessing the
filesystem. Verify that using the per-operation counters in
/proc/self/mounstats. Repeat with additional clients as nessecary.
Alternatively set up a simple iSCSI target using tgt or lio and
connect to it from multiple clients.
> - any debugging advice? E.g., have you checked if current
> wireshark can handle the MDS traffic?
The wireshare version I've used decoded the generic pNFS operations
fine, but just dumps the layout specifics as hex data.
Enable the trace points added in this series, they track all stateid
interactions in the server. Additіonally the pnfs debug printks on
client and server dump a lot of information.
On Tue, 6 Jan 2015 18:56:11 +0100
Christoph Hellwig <[email protected]> wrote:
> On Tue, Jan 06, 2015 at 12:32:22PM -0500, J. Bruce Fields wrote:
> > - do we have evidence that this is useful in its current form?
>
> What is your threshold for usefulness? It passes xfstests fine, and
> shows linear scalability with multiple clients that each have 10GB
> links.
>
> > - any advice on testing? Is there was some simple virtual setup
> > that would allow any loser with no special hardware (e.g., me)
> > to check whether they've broken the block server?
>
> Run two kvm VMs that share the same disk. Create an XFS filesystem
> on the MDS, and export it. If the client has blkmapd running (on Debian
> it needs to be started manually) it will use pNFS for accessing the
> filesystem. Verify that using the per-operation counters in
> /proc/self/mounstats. Repeat with additional clients as nessecary.
>
> Alternatively set up a simple iSCSI target using tgt or lio and
> connect to it from multiple clients.
>
> > - any debugging advice? E.g., have you checked if current
> > wireshark can handle the MDS traffic?
>
> The wireshare version I've used decoded the generic pNFS operations
> fine, but just dumps the layout specifics as hex data.
>
> Enable the trace points added in this series, they track all stateid
> interactions in the server. Additіonally the pnfs debug printks on
> client and server dump a lot of information.
The wireshark decoder really only handles files layouts right now. Dros
has some patches to add flexfiles support too (once the spec is a bit
more finalized) and at that point it shouldn't be too hard to fix it to
handle block layout as well.
--
Jeff Layton <[email protected]>
> On Jan 6, 2015, at 1:37 PM, Jeff Layton <[email protected]> wrote:
>
> On Tue, 6 Jan 2015 18:56:11 +0100
> Christoph Hellwig <[email protected]> wrote:
>
>> On Tue, Jan 06, 2015 at 12:32:22PM -0500, J. Bruce Fields wrote:
>>> - do we have evidence that this is useful in its current form?
>>
>> What is your threshold for usefulness? It passes xfstests fine, and
>> shows linear scalability with multiple clients that each have 10GB
>> links.
>>
>>> - any advice on testing? Is there was some simple virtual setup
>>> that would allow any loser with no special hardware (e.g., me)
>>> to check whether they've broken the block server?
>>
>> Run two kvm VMs that share the same disk. Create an XFS filesystem
>> on the MDS, and export it. If the client has blkmapd running (on Debian
>> it needs to be started manually) it will use pNFS for accessing the
>> filesystem. Verify that using the per-operation counters in
>> /proc/self/mounstats. Repeat with additional clients as nessecary.
>>
>> Alternatively set up a simple iSCSI target using tgt or lio and
>> connect to it from multiple clients.
>>
>>> - any debugging advice? E.g., have you checked if current
>>> wireshark can handle the MDS traffic?
>>
>> The wireshare version I've used decoded the generic pNFS operations
>> fine, but just dumps the layout specifics as hex data.
>>
>> Enable the trace points added in this series, they track all stateid
>> interactions in the server. Additіonally the pnfs debug printks on
>> client and server dump a lot of information.
>
> The wireshark decoder really only handles files layouts right now. Dros
> has some patches to add flexfiles support too (once the spec is a bit
> more finalized) and at that point it shouldn't be too hard to fix it to
> handle block layout as well.
I should be publishing these patches soon. The only holdup is waiting
on the IANA layout type assignment.
-dros
On Tue, Jan 06, 2015 at 06:56:11PM +0100, Christoph Hellwig wrote:
> On Tue, Jan 06, 2015 at 12:32:22PM -0500, J. Bruce Fields wrote:
> > - do we have evidence that this is useful in its current form?
>
> What is your threshold for usefulness? It passes xfstests fine, and
> shows linear scalability with multiple clients that each have 10GB
> links.
Sounds good. It'd be interesting to see details if they can be posted.
--b.
> > - any advice on testing? Is there was some simple virtual setup
> > that would allow any loser with no special hardware (e.g., me)
> > to check whether they've broken the block server?
>
> Run two kvm VMs that share the same disk. Create an XFS filesystem
> on the MDS, and export it. If the client has blkmapd running (on Debian
> it needs to be started manually) it will use pNFS for accessing the
> filesystem. Verify that using the per-operation counters in
> /proc/self/mounstats. Repeat with additional clients as nessecary.
>
> Alternatively set up a simple iSCSI target using tgt or lio and
> connect to it from multiple clients.
>
> > - any debugging advice? E.g., have you checked if current
> > wireshark can handle the MDS traffic?
>
> The wireshare version I've used decoded the generic pNFS operations
> fine, but just dumps the layout specifics as hex data.
>
> Enable the trace points added in this series, they track all stateid
> interactions in the server. Additіonally the pnfs debug printks on
> client and server dump a lot of information.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html