Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756375AbZCaSJH (ORCPT ); Tue, 31 Mar 2009 14:09:07 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751436AbZCaSIx (ORCPT ); Tue, 31 Mar 2009 14:08:53 -0400 Received: from mx2.netapp.com ([216.240.18.37]:17979 "EHLO mx2.netapp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751252AbZCaSIr (ORCPT ); Tue, 31 Mar 2009 14:08:47 -0400 X-IronPort-AV: E=Sophos;i="4.39,303,1235980800"; d="scan'208";a="148350581" Subject: [GIT PULL] Please pull the first batch of NFS client changes (and cachefs merge)... From: Trond Myklebust To: Linus Torvalds Cc: linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org, dhowells@redhat.com Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: NetApp Date: Tue, 31 Mar 2009 14:07:04 -0400 Message-Id: <1238522824.6577.5.camel@heimdal.trondhjem.org> Mime-Version: 1.0 X-Mailer: Evolution 2.26.0 X-OriginalArrivalTime: 31 Mar 2009 18:07:09.0255 (UTC) FILETIME=[82156D70:01C9B22B] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 130080 Lines: 3057 Hi Linus, Please pull from the "for-linus" branch of the repository at git pull git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git for-linus This will update the following files through the appended changesets. Cheers, Trond ---- Documentation/filesystems/caching/backend-api.txt | 658 ++++++++++++++++ Documentation/filesystems/caching/cachefiles.txt | 501 ++++++++++++ Documentation/filesystems/caching/fscache.txt | 333 ++++++++ Documentation/filesystems/caching/netfs-api.txt | 800 +++++++++++++++++++ Documentation/filesystems/caching/object.txt | 313 ++++++++ Documentation/filesystems/caching/operations.txt | 213 +++++ Documentation/slow-work.txt | 174 +++++ fs/Kconfig | 7 + fs/Makefile | 2 + fs/afs/Kconfig | 8 + fs/afs/Makefile | 3 + fs/afs/cache.c | 503 ++++++++----- fs/afs/cache.h | 15 +- fs/afs/cell.c | 16 +- fs/afs/file.c | 220 ++++-- fs/afs/inode.c | 31 +- fs/afs/internal.h | 53 +- fs/afs/main.c | 27 +- fs/afs/mntpt.c | 4 +- fs/afs/vlocation.c | 25 +- fs/afs/volume.c | 14 +- fs/afs/write.c | 21 + fs/cachefiles/Kconfig | 39 + fs/cachefiles/Makefile | 18 + fs/cachefiles/cf-bind.c | 286 +++++++ fs/cachefiles/cf-daemon.c | 754 ++++++++++++++++++ fs/cachefiles/cf-interface.c | 449 +++++++++++ fs/cachefiles/cf-internal.h | 360 +++++++++ fs/cachefiles/cf-key.c | 159 ++++ fs/cachefiles/cf-main.c | 106 +++ fs/cachefiles/cf-namei.c | 772 +++++++++++++++++++ fs/cachefiles/cf-proc.c | 134 ++++ fs/cachefiles/cf-rdwr.c | 853 +++++++++++++++++++++ fs/cachefiles/cf-security.c | 116 +++ fs/cachefiles/cf-xattr.c | 291 +++++++ fs/ext2/inode.c | 2 + fs/ext3/inode.c | 9 +- fs/fscache/Kconfig | 56 ++ fs/fscache/Makefile | 19 + fs/fscache/fsc-cache.c | 415 ++++++++++ fs/fscache/fsc-cookie.c | 498 ++++++++++++ fs/fscache/fsc-fsdef.c | 144 ++++ fs/fscache/fsc-histogram.c | 109 +++ fs/fscache/fsc-internal.h | 380 +++++++++ fs/fscache/fsc-main.c | 124 +++ fs/fscache/fsc-netfs.c | 103 +++ fs/fscache/fsc-object.c | 810 +++++++++++++++++++ fs/fscache/fsc-operation.c | 459 +++++++++++ fs/fscache/fsc-page.c | 771 +++++++++++++++++++ fs/fscache/fsc-proc.c | 68 ++ fs/fscache/fsc-stats.c | 212 +++++ fs/lockd/clntlock.c | 51 +-- fs/lockd/mon.c | 8 +- fs/lockd/svc.c | 42 +- fs/nfs/Kconfig | 8 + fs/nfs/Makefile | 1 + fs/nfs/callback.c | 31 +- fs/nfs/callback.h | 1 + fs/nfs/client.c | 130 ++-- fs/nfs/dir.c | 9 +- fs/nfs/file.c | 69 ++- fs/nfs/fscache-index.c | 337 ++++++++ fs/nfs/fscache.c | 521 +++++++++++++ fs/nfs/fscache.h | 208 +++++ fs/nfs/getroot.c | 4 +- fs/nfs/inode.c | 323 ++++++--- fs/nfs/internal.h | 8 + fs/nfs/iostat.h | 18 + fs/nfs/nfs2xdr.c | 9 +- fs/nfs/nfs3proc.c | 1 + fs/nfs/nfs3xdr.c | 37 +- fs/nfs/nfs4proc.c | 47 +- fs/nfs/nfs4state.c | 10 +- fs/nfs/nfs4xdr.c | 213 ++++-- fs/nfs/pagelist.c | 11 - fs/nfs/proc.c | 1 + fs/nfs/read.c | 27 +- fs/nfs/super.c | 49 ++- fs/nfs/write.c | 53 +- fs/nfsd/nfsctl.c | 6 +- fs/nfsd/nfssvc.c | 5 +- fs/splice.c | 3 +- fs/super.c | 1 + include/linux/fs.h | 7 + include/linux/fscache-cache.h | 504 ++++++++++++ include/linux/fscache.h | 592 ++++++++++++++ include/linux/nfs_fs.h | 17 +- include/linux/nfs_fs_sb.h | 16 + include/linux/nfs_iostat.h | 12 + include/linux/nfs_xdr.h | 59 ++- include/linux/page-flags.h | 43 +- include/linux/pagemap.h | 21 + include/linux/slow-work.h | 95 +++ include/linux/sunrpc/svc.h | 9 +- include/linux/sunrpc/svc_xprt.h | 52 +- include/linux/sunrpc/xprt.h | 2 + init/Kconfig | 12 + kernel/Makefile | 1 + kernel/slow-work.c | 640 ++++++++++++++++ kernel/sysctl.c | 9 + mm/filemap.c | 99 +++ mm/migrate.c | 10 +- mm/readahead.c | 40 +- mm/swap.c | 4 +- mm/truncate.c | 10 +- mm/vmscan.c | 6 +- net/sunrpc/Kconfig | 22 - net/sunrpc/clnt.c | 48 +- net/sunrpc/rpcb_clnt.c | 103 ++- net/sunrpc/svc.c | 158 ++-- net/sunrpc/svc_xprt.c | 31 +- net/sunrpc/svcsock.c | 40 +- net/sunrpc/xprt.c | 89 ++- net/sunrpc/xprtrdma/rpc_rdma.c | 26 +- net/sunrpc/xprtrdma/svc_rdma_sendto.c | 8 +- net/sunrpc/xprtsock.c | 363 ++++++---- security/security.c | 2 + 117 files changed, 16611 insertions(+), 1238 deletions(-) commit e13a5357ab5961844e64ec4ade6e4e13bfc33355 Author: Trond Myklebust Date: Mon Mar 30 18:59:17 2009 -0400 SUNRPC: Ensure IPV6_V6ONLY is set on the socket before binding to a port Also ensure that we use the protocol family instead of the address family when calling sock_create_kern(). Signed-off-by: Trond Myklebust commit 199c2bcb07969dbbd2c5479bd2a0a2382836332e Author: Mans Rullgard Date: Sat Mar 28 19:55:20 2009 +0000 NSM: Fix unaligned accesses in nsm_init_private() This fixes unaligned accesses in nsm_init_private() when creating nlm_reboot keys. Signed-off-by: Mans Rullgard Reviewed-by: Chuck Lever Signed-off-by: Trond Myklebust commit 3c8c45dfab78a1919f6f8a3ea46998c487eb7e12 Author: Chuck Lever Date: Wed Mar 18 20:48:14 2009 -0400 NFS: Simplify logic to compare socket addresses in client.c Callback requests from IPv4 servers are now always guaranteed to be AF_INET, and never mapped IPv4 AF_INET6 addresses. Both nfs_match_client() and nfs_find_client() can now share the same address comparison logic, so fold them together. We can also dispense with of most of the conditional compilation in here. Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit f738f5170367b367e38b2d75a413e7b3c52d46a5 Author: Chuck Lever Date: Wed Mar 18 20:48:06 2009 -0400 NFS: Start PF_INET6 callback listener only if IPv6 support is available Apparently a lot of people need to disable IPv6 completely on their distributor-built systems, which have CONFIG_IPV6_MODULE enabled at build time. They do this by blacklisting the ipv6.ko module. This causes the creation of the NFSv4 callback service listener to fail if CONFIG_IPV6_MODULE is set, but the module cannot be loaded. Now that the kernel's PF_INET6 RPC listeners are completely separate from PF_INET listeners, we can always start PF_INET. Then the NFS client can try to start a PF_INET6 listener, but it isn't required to be available. Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit eb16e907781a9da7f272a3e8284c26bc4e4aeb9d Author: Chuck Lever Date: Wed Mar 18 20:47:59 2009 -0400 lockd: Start PF_INET6 listener only if IPv6 support is available Apparently a lot of people need to disable IPv6 completely on their distributor-built systems, which have CONFIG_IPV6_MODULE enabled at build time. They do this by blacklisting the ipv6.ko module. This causes the creation of the lockd service listener to fail if CONFIG_IPV6_MODULE is set, but the module cannot be loaded. Now that the kernel's PF_INET6 RPC listeners are completely separate from PF_INET listeners, we can always start PF_INET. Then lockd can try to start PF_INET6, but it isn't required to be available. Note this has the added benefit that NLM callbacks from AF_INET6 servers will never come from AF_INET remotes. We no longer have to worry about matching mapped IPv4 addresses to AF_INET when comparing addresses. Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 9355982830ad67dca35e0f3d43319f3d438f82b4 Author: Chuck Lever Date: Wed Mar 18 20:47:51 2009 -0400 SUNRPC: Remove CONFIG_SUNRPC_REGISTER_V4 We just augmented the kernel's RPC service registration code so that it automatically adjusts to what is supported in user space. Thus we no longer need the kernel configuration option to enable registering RPC services with v4 -- it's all done automatically. This patch is part of a series that addresses http://bugzilla.kernel.org/show_bug.cgi?id=12256 Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 363f724cdd3d2ae554e261be995abdeb15f7bdd9 Author: Chuck Lever Date: Wed Mar 18 20:47:44 2009 -0400 SUNRPC: rpcb_register() should handle errors silently Move error reporting for RPC registration to rpcb_register's caller. This way the caller can choose to recover silently from certain errors, but report errors it does not recognize. Error reporting for kernel RPC service registration is now handled in one place. This patch is part of a series that addresses http://bugzilla.kernel.org/show_bug.cgi?id=12256 Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit cadc0fa534e51e20fdffe1623913c163a18d71b1 Author: Chuck Lever Date: Wed Mar 18 20:47:36 2009 -0400 SUNRPC: Simplify kernel RPC service registration The kernel registers RPC services with the local portmapper with an rpcbind SET upcall to the local portmapper. Traditionally, this used rpcbind v2 (PMAP), but registering RPC services that support IPv6 requires rpcbind v3 or v4. Since we now want separate PF_INET and PF_INET6 listeners for each kernel RPC service, svc_register() will do only one of those registrations at a time. For PF_INET, it tries an rpcb v4 SET upcall first; if that fails, it does a legacy portmap SET. This makes it entirely backwards compatible with legacy user space, but allows a proper v4 SET to be used if rpcbind is available. For PF_INET6, it does an rpcb v4 SET upcall. If that fails, it fails the registration, and thus the transport creation. This let's the kernel detect if user space is able to support IPv6 RPC services, and thus whether it should maintain a PF_INET6 listener for each service at all. This provides complete backwards compatibilty with legacy user space that only supports rpcbind v2. The only down-side is that registering a new kernel RPC service may take an extra exchange with the local portmapper on legacy systems, but this is an infrequent operation and is done over UDP (no lingering sockets in TIMEWAIT), so it shouldn't be consequential. This patch is part of a series that addresses http://bugzilla.kernel.org/show_bug.cgi?id=12256 Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit d5a8620f7c8a5bcade730e2fa1224191f289fb00 Author: Chuck Lever Date: Wed Mar 18 20:47:29 2009 -0400 SUNRPC: Simplify svc_unregister() Our initial implementation of svc_unregister() assumed that PMAP_UNSET cleared all rpcbind registrations for a [program, version] tuple. However, we now have evidence that PMAP_UNSET clears only "inet" entries, and not "inet6" entries, in the rpcbind database. For backwards compatibility with the legacy portmapper, the svc_unregister() function also must work if user space doesn't support rpcbind version 4 at all. Thus we'll send an rpcbind v4 UNSET, and if that fails, we'll send a PMAP_UNSET. This simplifies the code in svc_unregister() and provides better backwards compatibility with legacy user space that does not support rpcbind version 4. We can get rid of the conditional compilation in here as well. This patch is part of a series that addresses http://bugzilla.kernel.org/show_bug.cgi?id=12256 Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 1673d0de40ab46cac3b456ad50e1c8d6a31bfd66 Author: Chuck Lever Date: Wed Mar 18 20:47:21 2009 -0400 SUNRPC: Allow callers to pass rpcb_v4_register a NULL address The user space TI-RPC library uses an empty string for the universal address when unregistering all target addresses for [program, version]. The kernel's rpcb client should behave the same way. Here, we are switching between several registration methods based on the protocol family of the incoming address. Rename the other rpcbind v4 registration functions to make it clear that they, as well, are switched on protocol family. In /etc/netconfig, this is either "inet" or "inet6". NB: The loopback protocol families are not supported in the kernel. Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 126e4bc3b3b446482696377f67a634c76eaf2e9c Author: Chuck Lever Date: Wed Mar 18 20:47:14 2009 -0400 SUNRPC: rpcbind actually interprets r_owner string RFC 1833 has little to say about the contents of r_owner; it only specifies that it is a string, and states that it is used to control who can UNSET an entry. Our port of rpcbind (from Sun) assumes this string contains a numeric UID value, not alphabetical or symbolic characters, but checks this value only for AF_LOCAL RPCB_SET or RPCB_UNSET requests. In all other cases, rpcbind ignores the contents of the r_owner string. The reference user space implementation of rpcb_set(3) uses a numeric UID for all SET/UNSET requests (even via the network) and an empty string for all other requests. We emulate that behavior here to maintain bug-for-bug compatibility. Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 3aba45536fe8f92aa07bcdfd2fb1cf17eec7d786 Author: Chuck Lever Date: Wed Mar 18 20:47:06 2009 -0400 SUNRPC: Clean up address type casts in rpcb_v4_register() Clean up: Simplify rpcb_v4_register() and its helpers by moving the details of sockaddr type casting to rpcb_v4_register()'s helper functions. Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit ba5c35e0c7e30b095636cd58b0854fdbd3c32947 Author: Chuck Lever Date: Wed Mar 18 20:46:59 2009 -0400 SUNRPC: Don't return EPROTONOSUPPORT in svc_register()'s helpers The RPC client returns -EPROTONOSUPPORT if there is a protocol version mismatch (ie the remote RPC server doesn't support the RPC protocol version sent by the client). Helpers for the svc_register() function return -EPROTONOSUPPORT if they don't recognize the passed-in IPPROTO_ value. These are two entirely different failure modes. Have the helpers return -ENOPROTOOPT instead of -EPROTONOSUPPORT. This will allow callers to determine more precisely what the underlying problem is, and decide to report or recover appropriately. This patch is part of a series that addresses http://bugzilla.kernel.org/show_bug.cgi?id=12256 Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit fc28decdc93633a65d54e42498e9e819d466329c Author: Chuck Lever Date: Wed Mar 18 20:46:51 2009 -0400 SUNRPC: Use IPv4 loopback for registering AF_INET6 kernel RPC services The kernel uses an IPv6 loopback address when registering its AF_INET6 RPC services so that it can tell whether the local portmapper is actually IPv6-enabled. Since the legacy portmapper doesn't listen on IPv6, however, this causes a long timeout on older systems if the kernel happens to try creating and registering an AF_INET6 RPC service. Originally I wanted to use a connected transport (either TCP or connected UDP) so that the upcall would fail immediately if the portmapper wasn't listening on IPv6, but we never agreed on what transport to use. In the end, it's of little consequence to the kernel whether the local portmapper is listening on IPv6. It's only important whether the portmapper supports rpcbind v4. And the kernel can't tell that at all if it is sending requests via IPv6 -- the portmapper will just ignore them. So, send both rpcbind v2 and v4 SET/UNSET requests via IPv4 loopback to maintain better backwards compatibility between new kernels and legacy user space, and prevent multi-second hangs in some cases when the kernel attempts to register RPC services. This patch is part of a series that addresses http://bugzilla.kernel.org/show_bug.cgi?id=12256 Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 7d21c0f9845f0ce4e81baac3519fbb2c6c2cc908 Author: Chuck Lever Date: Wed Mar 18 20:46:44 2009 -0400 SUNRPC: Set IPV6ONLY flag on PF_INET6 RPC listener sockets We are about to convert to using separate RPC listener sockets for PF_INET and PF_INET6. This echoes the way IPv6 is handled in user space by TI-RPC, and eliminates the need for ULPs to worry about mapped IPv4 AF_INET6 addresses when doing address comparisons. Start by setting the IPV6ONLY flag on PF_INET6 RPC listener sockets. Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 26298caacac3e4754194b13aef377706d5de6cf6 Author: Chuck Lever Date: Wed Mar 18 20:46:36 2009 -0400 NFS: Revert creation of IPv6 listeners for lockd and NFSv4 callbacks We're about to convert over to using separate PF_INET and PF_INET6 listeners, instead of a single PF_INET6 listener that also receives AF_INET requests and maps them to AF_INET6. Clear the way by removing the logic in lockd and the NFSv4 callback server that creates an AF_INET6 service listener. Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 49a9072f29a1039f142ec98b44a72d7173651c02 Author: Chuck Lever Date: Wed Mar 18 20:46:29 2009 -0400 SUNRPC: Remove @family argument from svc_create() and svc_create_pooled() Since an RPC service listener's protocol family is specified now via svc_create_xprt(), it no longer needs to be passed to svc_create() or svc_create_pooled(). Remove that argument from the synopsis of those functions, and remove the sv_family field from the svc_serv struct. Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 9652ada3fb5914a67d8422114e8a76388330fa79 Author: Chuck Lever Date: Wed Mar 18 20:46:21 2009 -0400 SUNRPC: Change svc_create_xprt() to take a @family argument The sv_family field is going away. Pass a protocol family argument to svc_create_xprt() instead of extracting the family from the passed-in svc_serv struct. Again, as this is a listener socket and not an address, we make this new argument an "int" protocol family, instead of an "sa_family_t." Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit baf01caf09e87579c2d157e5ee29975db8551522 Author: Chuck Lever Date: Wed Mar 18 20:46:13 2009 -0400 SUNRPC: svc_setup_socket() gets protocol family from socket Since the sv_family field is going away, modify svc_setup_socket() to extract the protocol family from the passed-in socket instead of from the passed-in svc_serv struct. Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 4b62e58cccff9c5e7ffc7023f7ec24c75fbd549b Author: Chuck Lever Date: Wed Mar 18 20:46:06 2009 -0400 SUNRPC: Pass a family argument to svc_register() The sv_family field is going away. Instead of using sv_family, have the svc_register() function take a protocol family argument. Since this argument represents a protocol family, and not an address family, this argument takes an int, as this is what is passed to sock_create_kern(). Also make sure svc_register's helpers are checking for PF_FOO instead of AF_FOO. The value of [AP]F_FOO are equivalent; this is simply a symbolic change to reflect the semantics of the value stored in that variable. sock_create_kern() should return EPFNOSUPPORT if the passed-in protocol family isn't supported, but it uses EAFNOSUPPORT for this case. We will stick with that tradition here, as svc_register() is called by the RPC server in the same path as sock_create_kern(). Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 156e62094a74cf43f02f56ef96b6cda567501357 Author: Chuck Lever Date: Wed Mar 18 20:45:58 2009 -0400 SUNRPC: Clean up svc_find_xprt() calling sequence Clean up: add documentating comment and use appropriate data types for svc_find_xprt()'s arguments. This also eliminates a mixed sign comparison: @port was an int, while the return value of svc_xprt_local_port() is an unsigned short. Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit adbbe929569e6eec8ff9feca23f1f2b40b42853d Author: Chuck Lever Date: Wed Mar 18 20:45:51 2009 -0400 NFSD: If port value written to /proc/fs/nfsd/portlist is invalid, return EINVAL Make sure port value read from user space by write_ports is valid before passing it to svc_find_xprt(). If it wasn't, the writer would get ENOENT instead of EINVAL. Noticed-by: J. Bruce Fields Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit efb3288b423d7e3533a68dccecaa05a56a281a4e Author: Chuck Lever Date: Wed Mar 18 20:45:43 2009 -0400 SUNRPC: Clean up static inline functions in svc_xprt.h Clean up: Enable the use of const arguments in higher level svc_ APIs by adding const to the arguments of the helper functions in svc_xprt.h Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 776bd5c7a207de546918f805090bfc823d2660c8 Author: Chuck Lever Date: Wed Mar 18 20:45:28 2009 -0400 SUNRPC: Don't flag empty RPCB_GETADDR reply as bogus In 2007, commit e65fe3976f594603ed7b1b4a99d3e9b867f573ea added additional sanity checking to rpcb_decode_getaddr() to make sure we were getting a reply that was long enough to be an actual universal address. If the uaddr string isn't long enough, the XDR decoder returns EIO. However, an empty string is a valid RPCB_GETADDR response if the requested service isn't registered. Moreover, "::.n.m" is also a valid RPCB_GETADDR response for IPv6 addresses that is shorter than rpcb_decode_getaddr()'s lower limit of 11. So this sanity check introduced a regression for rpcbind requests against IPv6 remotes. So revert the lower bound check added by commit e65fe3976f594603ed7b1b4a99d3e9b867f573ea, and add an explicit check for an empty uaddr string, similar to libtirpc's rpcb_getaddr(3). Pointed-out-by: Jeff Layton Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 7fe5c398fc2186ed586db11106a6692d871d0d58 Author: Trond Myklebust Date: Thu Mar 19 15:35:50 2009 -0400 NFS: Optimise NFS close() Close-to-open cache consistency rules really only require us to flush out writes on calls to close(), and require us to revalidate attributes on the very last close of the file. Currently we appear to be doing a lot of extra attribute revalidation and cache flushes. Signed-off-by: Trond Myklebust commit b1e4adf4ea41bb8b5a7bfc1a7001f137e65495df Author: Trond Myklebust Date: Thu Mar 19 15:35:49 2009 -0400 NFS: Fix the notifications when renaming onto an existing file NFS appears to be returning an unnecessary "delete" notification when we're doing an atomic rename. See http://bugzilla.gnome.org/show_bug.cgi?id=575684 The fix is to get rid of the redundant call to d_delete(). Signed-off-by: Trond Myklebust commit 47c62564200609b6de60f535f61f0c73dd10c7c9 Author: Trond Myklebust Date: Mon Mar 16 08:13:41 2009 -0400 NFS: Fix up a mismerged patch Move the definition of nfs_need_commit() into the #ifdef CONFIG_NFS_V3 section as originally intended in the patch "NFS: cleanup - remove struct nfs_inode->ncommit" Signed-off-by: Trond Myklebust commit 2e3c230bc7149a6af65d26a0c312e230e0c33cc3 Author: Tom Talpey Date: Thu Mar 12 22:21:21 2009 -0400 SVCRDMA: fix recent printk format warnings. printk formats in prior commit were reversed/incorrect. Compiled without warning on x86 and x86_64, but detected on ppc. Signed-off-by: Tom Talpey Signed-off-by: Trond Myklebust commit 55420c24a0d4d1fce70ca713f84aa00b6b74a70e Author: Trond Myklebust Date: Wed Mar 11 15:29:24 2009 -0400 SUNRPC: Ensure we close the socket on EPIPE errors too... As long as one task is holding the socket lock, then calls to xprt_force_disconnect(xprt) will not succeed in shutting down the socket. In particular, this would mean that a server initiated shutdown will not succeed until the lock is relinquished. In order to avoid the deadlock, we should ensure that xs_tcp_send_request() closes the socket on EPIPE errors too. Signed-off-by: Trond Myklebust commit b61d59fffd3e5b6037c92b4c840605831de8a251 Author: Trond Myklebust Date: Wed Mar 11 14:38:04 2009 -0400 SUNRPC: xs_tcp_connect_worker{4,6}: merge common code Signed-off-by: Trond Myklebust commit 25fe6142a57c720452c5e9ddbc1f32309c1e5c19 Author: Trond Myklebust Date: Wed Mar 11 14:38:03 2009 -0400 SUNRPC: Add a sysctl to control the duration of the socket linger timeout Signed-off-by: Trond Myklebust commit 7d1e8255cf959fba7ee2317550dfde39f0b936ae Author: Trond Myklebust Date: Wed Mar 11 14:38:03 2009 -0400 SUNRPC: Add the equivalent of the linger and linger2 timeouts to RPC sockets This fixes a regression against FreeBSD servers as reported by Tomas Kasparek. Apparently when using RPC over a TCP socket, the FreeBSD servers don't ever react to the client closing the socket, and so commit e06799f958bf7f9f8fae15f0c6f519953fb0257c (SUNRPC: Use shutdown() instead of close() when disconnecting a TCP socket) causes the setup to hang forever whenever the client attempts to close and then reconnect. We break the deadlock by adding a 'linger2' style timeout to the socket, after which, the client will abort the connection using a TCP 'RST'. The default timeout is set to 15 seconds. A subsequent patch will put it under user control by means of a systctl. Signed-off-by: Trond Myklebust commit 5e3771ce2d6a69e10fcc870cdf226d121d868491 Author: Trond Myklebust Date: Wed Mar 11 14:38:01 2009 -0400 SUNRPC: Ensure that xs_nospace return values are propagated If xs_nospace() finds that the socket has disconnected, it attempts to return ENOTCONN, however that value is then squashed by the callers. Signed-off-by: Trond Myklebust commit 8a2cec295f4499cc9d4452e9b02d4ed071bb42d3 Author: Trond Myklebust Date: Wed Mar 11 14:38:01 2009 -0400 SUNRPC: Delay, then retry on connection errors. Enforce the comment in xs_tcp_connect_worker4/xs_tcp_connect_worker6 that we should delay, then retry on certain connection errors. Signed-off-by: Trond Myklebust commit 2a4919919a97911b0aa4b9f5ac1eab90ba87652b Author: Trond Myklebust Date: Wed Mar 11 14:38:00 2009 -0400 SUNRPC: Return EAGAIN instead of ENOTCONN when waking up xprt->pending While we should definitely return socket errors to the task that is currently trying to send data, there is no need to propagate the same error to all the other tasks on xprt->pending. Doing so actually slows down recovery, since it causes more than one tasks to attempt socket recovery. Signed-off-by: Trond Myklebust commit 482f32e65d31cbf88d08306fa5d397cc945c3c26 Author: Trond Myklebust Date: Wed Mar 11 14:38:00 2009 -0400 SUNRPC: Handle socket errors correctly Ensure that we pick up and handle socket errors as they occur. Signed-off-by: Trond Myklebust commit c8485e4d634f6df155040293928707f127f0d06d Author: Trond Myklebust Date: Wed Mar 11 14:37:59 2009 -0400 SUNRPC: Handle ECONNREFUSED correctly in xprt_transmit() If we get an ECONNREFUSED error, we currently go to sleep on the 'xprt->sending' wait queue. The problem is that no timeout is set there, and there is nothing else that will wake the task up later. We should deal with ECONNREFUSED in call_status, given that is where we also deal with -EHOSTDOWN, and friends. Signed-off-by: Trond Myklebust commit 40d2549db5f515e415894def98b49db7d4c56714 Author: Trond Myklebust Date: Wed Mar 11 14:37:58 2009 -0400 SUNRPC: Don't disconnect if a connection is still in progress. Signed-off-by: Trond Myklebust commit 670f94573104b4a25525d3fcdcd6496c678df172 Author: Trond Myklebust Date: Wed Mar 11 14:37:58 2009 -0400 SUNRPC: Ensure we set XPRT_CLOSING only after we've sent a tcp FIN... ...so that we can distinguish between when we need to shutdown and when we don't. Also remove the call to xs_tcp_shutdown() from xs_tcp_connect(), since xprt_connect() makes the same test. Signed-off-by: Trond Myklebust commit 15f081ca8ddfe150fb639c591b18944a539da0fc Author: Trond Myklebust Date: Wed Mar 11 14:37:57 2009 -0400 SUNRPC: Avoid an unnecessary task reschedule on ENOTCONN If the socket is unconnected, and xprt_transmit() returns ENOTCONN, we currently give up the lock on the transport channel. Doing so means that the lock automatically gets assigned to the next task in the xprt->sending queue, and so that task needs to be woken up to do the actual connect. The following patch aims to avoid that unnecessary task switch. Signed-off-by: Trond Myklebust commit a67d18f89f5782806135aad4ee012ff78d45aae7 Author: Tom Talpey Date: Wed Mar 11 14:37:56 2009 -0400 NFS: load the rpc/rdma transport module automatically When mounting an NFS/RDMA server with the "-o proto=rdma" or "-o rdma" options, attempt to dynamically load the necessary "xprtrdma" client transport module. Doing so improves usability, while avoiding a static module dependency and any unnecesary resources. Signed-off-by: Tom Talpey Cc: Chuck Lever Signed-off-by: Trond Myklebust commit 441e3e242903f9b190d5764bed73edb58f977413 Author: Tom Talpey Date: Wed Mar 11 14:37:56 2009 -0400 SUNRPC: dynamically load RPC transport modules on-demand Provide an api to attempt to load any necessary kernel RPC client transport module automatically. By convention, the desired module name is "xprt"+"transport name". For example, when NFS mounting with "-o proto=rdma", attempt to load the "xprtrdma" module. Signed-off-by: Tom Talpey Cc: Chuck Lever Signed-off-by: Trond Myklebust commit b38ab40ad58c1fc43ea590d6342f6a6763ac8fb6 Author: Tom Talpey Date: Wed Mar 11 14:37:55 2009 -0400 XPRTRDMA: correct an rpc/rdma inline send marshaling error Certain client rpc's which contain both lengthy page-contained metadata and a non-empty xdr_tail buffer require careful handling to avoid overlapped memory copying. Rearranging of existing rpcrdma marshaling code avoids it; this fixes an NFSv4 symlink creation error detected with connectathon basic/test8 to multiple servers. Signed-off-by: Tom Talpey Signed-off-by: Trond Myklebust commit b1e1e158779f1d99c2cc18e466f6bf9099fc0853 Author: Tom Talpey Date: Wed Mar 11 14:37:55 2009 -0400 SVCRDMA: remove faulty assertions in rpc/rdma chunk validation. Certain client-provided RPCRDMA chunk alignments result in an additional scatter/gather entry, which triggered nfs/rdma server assertions incorrectly. OpenSolaris nfs/rdma client connectathon testing was blocked by these in the special/locking section. Signed-off-by: Tom Talpey Cc: Tom Tucker Signed-off-by: Trond Myklebust commit e1ebfd33be068ec933f8954060a499bd22ad6f69 Author: Trond Myklebust Date: Wed Mar 11 14:37:54 2009 -0400 NFS: Kill the "defined but not used" compile error on nommu machines Bryan Wu reports that when compiling NFS on nommu machines he gets a "defined but not used" error on nfs_file_mmap(). The easiest fix is simply to get rid of the special casing in NFS, and just always call generic_file_mmap() to set up the file. Signed-off-by: Trond Myklebust commit 72cb77f4a5ace37b12dcb47a0e8637a2c28ad881 Author: Trond Myklebust Date: Wed Mar 11 14:10:30 2009 -0400 NFS: Throttle page dirtying while we're flushing to disk The following patch is a combination of a patch by myself and Peter Staubach. Trond: If we allow other processes to dirty pages while a process is doing a consistency sync to disk, we can end up never making progress. Peter: Attached is a patch which addresses a continuing problem with the NFS client generating out of order WRITE requests. While this is compliant with all of the current protocol specifications, there are servers in the market which can not handle out of order WRITE requests very well. Also, this may lead to sub-optimal block allocations in the underlying file system on the server. This may cause the read throughputs to be reduced when reading the file from the server. Peter: There has been a lot of work recently done to address out of order issues on a systemic level. However, the NFS client is still susceptible to the problem. Out of order WRITE requests can occur when pdflush is in the middle of writing out pages while the process dirtying the pages calls generic_file_buffered_write which calls generic_perform_write which calls balance_dirty_pages_rate_limited which ends up calling writeback_inodes which ends up calling back into the NFS client to writes out dirty pages for the same file that pdflush happens to be working with. Signed-off-by: Peter Staubach [modification by Trond to merge the two similar patches] Signed-off-by: Trond Myklebust commit fb8a1f11b64e213d94dfa1cebb2a42a7b8c115c4 Author: Trond Myklebust Date: Wed Mar 11 14:10:29 2009 -0400 NFS: cleanup - remove struct nfs_inode->ncommit Signed-off-by: Trond Myklebust commit a65318bf3afc93ce49227e849d213799b072c5fd Author: Trond Myklebust Date: Wed Mar 11 14:10:28 2009 -0400 NFSv4: Simplify some cache consistency post-op GETATTRs Certain asynchronous operations such as write() do not expect (or care) that other metadata such as the file owner, mode, acls, ... change. All they want to do is update and/or check the change attribute, ctime, and mtime. By skipping the file owner and group update, we also avoid having to do a potential idmapper upcall for these asynchronous RPC calls. Signed-off-by: Trond Myklebust commit 69aaaae18f7027d9594bce100378f102926cc0be Author: Trond Myklebust Date: Wed Mar 11 14:10:28 2009 -0400 NFSv4: A referral is assumed to always point to a directory. Fix a bug whereby we would fail to create a mount point for a referral. Signed-off-by: Trond Myklebust commit 409924e4c943072a63c43bb6b77576bf12f1896b Author: Trond Myklebust Date: Wed Mar 11 14:10:27 2009 -0400 NFSv4: Make decode_getfattr() set fattr->valid to reflect what was decoded Signed-off-by: Trond Myklebust commit f26c7a78876ccd6c9b477ab4ca127aa1a4ef68c7 Author: Trond Myklebust Date: Wed Mar 11 14:10:26 2009 -0400 NFSv4: Clean up decode_getfattr() Signed-off-by: Trond Myklebust commit bca794785c2c12ecddeb09e70165b8ff80baa6ae Author: Trond Myklebust Date: Wed Mar 11 14:10:26 2009 -0400 NFS: Fix the type of struct nfs_fattr->mode There is no point in using anything other than umode_t, since we copy the content pretty much directly into inode->i_mode. Signed-off-by: Trond Myklebust commit 1ca277d88dafdbc3c5a69d32590e7184b9af6371 Author: Trond Myklebust Date: Wed Mar 11 14:10:25 2009 -0400 NFS: Shrink the struct nfs_fattr We don't need the bitmap[] field anymore, since the 'valid' field tells us all we need to know about which attributes were filled in... Also move the pre-op attributes in order to improve the structure packing. Signed-off-by: Trond Myklebust commit 9e6e70f8d8b6698e0017c56b86525aabe9c7cd4c Author: Trond Myklebust Date: Wed Mar 11 14:10:24 2009 -0400 NFSv4: Support NFSv4 optional attributes in the struct nfs_fattr Currently, filling struct nfs_fattr is more or less an all or nothing operation, since NFSv2 and NFSv3 have only mandatory attributes. In NFSv4, some attributes are optional, and so we may simply not be able to fill in those fields. Furthermore, NFSv4 allows you to specify which attributes you are interested in retrieving, thus permitting you to optimise away retrieval of attributes that you know will no change... Signed-off-by: Trond Myklebust commit 78f945f88ef83dcc7c962614a080e0a9a2db5889 Author: Trond Myklebust Date: Wed Mar 11 14:10:23 2009 -0400 NFSv4: Ignore errors on the post-op attributes in SETATTR calls There is no need to fail or retry a SETATTR call just because the post-op GETATTR failed. Signed-off-by: Trond Myklebust commit 37d9d76d8b3a2ac5817e1fa3263cfe0fdb439e51 Author: NeilBrown Date: Wed Mar 11 14:10:23 2009 -0400 NFS: flush cached directory information slightly more readily. If cached directory contents becomes incorrect, there is no way to flush the contents. This contrasts with files where file locking is the recommended way to ensure cache consistency between multiple applications (a read-lock always flushes the cache). Also while changes to files often change the size of the file (thus triggering a cache flush), changes to directories often do not change the apparent size (as the size is often rounded to a block size). So it is particularly important with directories to avoid the possibility of an incorrect cache wherever possible. When the link count on a directory changes it implies a change in the number of child directories, and so a change in the contents of this directory. So use that as a trigger to flush cached contents. When the ctime changes but the mtime does not, there are two possible reasons. 1/ The owner/mode information has been changed. 2/ utimes has been used to set the mtime backwards. In the first case, a data-cache flush is not required. In the second case it is. So on the basis that correctness trumps performance, flush the directory contents cache in this case also. Signed-off-by: NeilBrown Signed-off-by: Trond Myklebust commit 2b57dc6cf9bf31edc0df430ea18dd1dbd3028975 Author: Suresh Jayaraman Date: Wed Mar 11 14:10:22 2009 -0400 NFS: Minor __nfs_revalidate_inode cleanup Remove redundant NFS_STALE() check, a leftover due to the commit 691beb13cdc88358334ef0ba867c080a247a760f Signed-off-by: Suresh Jayaraman Signed-off-by: Trond Myklebust commit fe315e76fc3a3f9f7e1581dc22fec7e7719f0896 Author: Chuck Lever Date: Wed Mar 11 14:10:21 2009 -0400 SUNRPC: Avoid spurious wake-up during UDP connect processing To clear out old state, the UDP connect workers unconditionally invoke xs_close() before proceeding with a new connect. Nowadays this causes a spurious wake-up of the task waiting for the connect to complete. This is a little racey, but usually harmless. The waiting task immediately retries the connect via a call_bind/call_connect sequence, which usually finds the transport already in the connected state because the connect worker has finished in the background. To avoid a spurious wake-up, factor the xs_close() logic that resets the underlying socket into a helper, and have the UDP connect workers call that helper instead of xs_close(). Signed-off-by: Chuck Lever Signed-off-by: Trond Myklebust commit 8b823e2e21e197bab497272278da0d9cdb48d5ec Author: David Howells Date: Fri Feb 6 13:11:27 2009 +0000 NFS: Add mount options to enable local caching on NFS Add NFS mount options to allow the local caching support to be enabled. The attached patch makes it possible for the NFS filesystem to be told to make use of the network filesystem local caching service (FS-Cache). To be able to use this, a recent nfsutils package is required. There are three variant NFS mount options that can be added to a mount command to control caching for a mount. Only the last one specified takes effect: (*) Adding "fsc" will request caching. (*) Adding "fsc=" will request caching and also specify a uniquifier. (*) Adding "nofsc" will disable caching. For example: mount warthog:/ /a -o fsc The cache of a particular superblock (NFS FSID) will be shared between all mounts of that volume, provided they have the same connection parameters and are not marked 'nosharecache'. Where it is otherwise impossible to distinguish superblocks because all the parameters are identical, but the 'nosharecache' option is supplied, a uniquifying string must be supplied, else only the first mount will be permitted to use the cache. If there's a key collision, then the second mount will disable caching and give a warning into the kernel log. Signed-off-by: David Howells commit 8fe6d6ba0759eafc5a6a52ab1f3f08dc7e9142a0 Author: David Howells Date: Fri Feb 6 13:11:27 2009 +0000 NFS: Display local caching state Display the local caching state in /proc/fs/nfsfs/volumes. Signed-off-by: David Howells commit d8fedcfd8d752e596ef9ab3c7903f4650a1c6466 Author: David Howells Date: Fri Feb 6 13:11:27 2009 +0000 NFS: Store pages from an NFS inode into a local cache Store pages from an NFS inode into the cache data storage object associated with that inode. Signed-off-by: David Howells commit d6b69cdbcbd82a4b337cfee45206057d8c81b308 Author: David Howells Date: Fri Feb 6 13:11:27 2009 +0000 NFS: Read pages from FS-Cache into an NFS inode Read pages from an FS-Cache data storage object representing an inode into an NFS inode. Signed-off-by: David Howells commit fa15948a5d315ebfe84a49dacfcedd64e85af90c Author: David Howells Date: Fri Feb 6 13:11:27 2009 +0000 NFS: nfs_readpage_async() needs to be accessible as a fallback for local caching nfs_readpage_async() needs to be non-static so that it can be used as a fallback for the local on-disk caching should an EIO crop up when reading the cache. Signed-off-by: David Howells commit d45a3d2ebed6e1792efef8509e9cd462f21c7c94 Author: David Howells Date: Fri Feb 6 13:11:26 2009 +0000 NFS: Add read context retention for FS-Cache to call back with Add read context retention so that FS-Cache can call back into NFS when a read operation on the cache fails EIO rather than reading data. This permits NFS to then fetch the data from the server instead using the appropriate security context. Signed-off-by: David Howells commit 2206e37915fa20be2434fe70219b24f6a77ea9f1 Author: David Howells Date: Fri Feb 6 13:11:26 2009 +0000 NFS: FS-Cache page management FS-Cache page management for NFS. This includes hooking the releasing and invalidation of pages marked with PG_fscache (aka PG_private_2) and waiting for completion of the write-to-cache flag (PG_fscache_write aka PG_owner_priv_2). Signed-off-by: David Howells commit 976cc35b133f5243c04ea0e8588476fe208e5d1b Author: David Howells Date: Fri Feb 6 13:11:26 2009 +0000 NFS: Add some new I/O counters for FS-Cache doing things for NFS Add some new NFS I/O counters for FS-Cache doing things for NFS. A new line is emitted into /proc/pid/mountstats if caching is enabled that looks like: fsc: Where is the number of pages read successfully from the cache, is the number of failed page reads against the cache, is the number of successful page writes to the cache, is the number of failed page writes to the cache, and is the number of NFS pages that have been disconnected from the cache. Signed-off-by: David Howells commit 20e6393664aa14623349f4000ef9f62b1a85f7fe Author: David Howells Date: Fri Feb 6 13:11:25 2009 +0000 NFS: Invalidate FsCache page flags when cache removed Invalidate the FsCache page flags on the pages belonging to an inode when the cache backing that NFS inode is removed. This allows a live cache to be withdrawn. Signed-off-by: David Howells commit 1c65015a3a26cdf57695547092cc65439dbc6440 Author: David Howells Date: Fri Feb 6 13:11:25 2009 +0000 NFS: Use local disk inode cache Bind data storage objects in the local cache to NFS inodes. Signed-off-by: David Howells commit aea9f35128da21d35c2176ee7871238153494931 Author: David Howells Date: Fri Feb 6 13:11:25 2009 +0000 NFS: Define and create inode-level cache objects Define and create inode-level cache data storage objects (as managed by nfs_inode structs). Each inode-level object is created in a superblock-level index object and is itself a data storage object into which pages from the inode are stored. The inode object key is the NFS file handle for the inode. The inode object is given coherency data to carry in the auxiliary data permitted by the cache. This is a sequence made up of: (1) i_mtime from the NFS inode. (2) i_ctime from the NFS inode. (3) i_size from the NFS inode. (4) change_attr from the NFSv4 attribute data. As the cache is a persistent cache, the auxiliary data is checked when a new NFS in-memory inode is set up that matches an already existing data storage object in the cache. If the coherency data is the same, the on-disk object is retained and used; if not, it is scrapped and a new one created. Signed-off-by: David Howells commit 59aae6bb7177d819c5ebe67bba6cb740b94c6e19 Author: David Howells Date: Fri Feb 6 13:11:25 2009 +0000 NFS: Define and create superblock-level objects Define and create superblock-level cache index objects (as managed by nfs_server structs). Each superblock object is created in a server level index object and is itself an index into which inode-level objects are inserted. Ideally there would be one superblock-level object per server, and the former would be folded into the latter; however, since the "nosharecache" option exists this isn't possible. The superblock object key is a sequence consisting of: (1) Certain superblock s_flags. (2) Various connection parameters that serve to distinguish superblocks for sget(). (3) The volume FSID. (4) The security flavour. (5) The uniquifier length. (6) The uniquifier text. This is normally an empty string, unless the fsc=xyz mount option was used to explicitly specify a uniquifier. The key blob is of variable length, depending on the length of (6). The superblock object is given no coherency data to carry in the auxiliary data permitted by the cache. It is assumed that the superblock is always coherent. This patch also adds uniquification handling such that two otherwise identical superblocks, at least one of which is marked "nosharecache", won't end up trying to share the on-disk cache. It will be possible to manually provide a uniquifier through a mount option with a later patch to avoid the error otherwise produced. Signed-off-by: David Howells commit cd4d738fb46391e47fb70f3e71c96a11230dff17 Author: David Howells Date: Fri Feb 6 13:11:25 2009 +0000 NFS: Define and create server-level objects Define and create server-level cache index objects (as managed by nfs_client structs). Each server object is created in the NFS top-level index object and is itself an index into which superblock-level objects are inserted. Ideally there would be one superblock-level object per server, and the former would be folded into the latter; however, since the "nosharecache" option exists this isn't possible. The server object key is a sequence consisting of: (1) NFS version (2) Server address family (eg: AF_INET or AF_INET6) (3) Server port. (4) Server IP address. The key blob is of variable length, depending on the length of (4). The server object is given no coherency data to carry in the auxiliary data permitted by the cache. Signed-off-by: David Howells commit 12d25ea488c3480b8014a2457c410062574406e2 Author: David Howells Date: Fri Feb 6 13:11:25 2009 +0000 NFS: Register NFS for caching and retrieve the top-level index Register NFS for caching and retrieve the top-level cache index object cookie. Signed-off-by: David Howells commit b7aac6f5e916a59e1314d964e5dba279a66f2199 Author: David Howells Date: Fri Feb 6 13:11:24 2009 +0000 NFS: Permit local filesystem caching to be enabled for NFS Permit local filesystem caching to be enabled for NFS in the kernel configuration. Signed-off-by: David Howells commit a44f0e333217b76f993cfb9da1c532c87b96b576 Author: David Howells Date: Fri Feb 6 13:11:24 2009 +0000 NFS: Add FS-Cache option bit and debug bit Add FS-Cache option bit to nfs_server struct. This is set to indicate local on-disk caching is enabled for a particular superblock. Also add debug bit for local caching operations. Signed-off-by: David Howells commit e858675049961c24d12a9fb66aae6638ff30abb8 Author: David Howells Date: Fri Feb 6 13:11:24 2009 +0000 NFS: Add comment banners to some NFS functions Add comment banners to some NFS functions so that they can be modified by the NFS fscache patches for further information. Signed-off-by: David Howells commit 1a2ebad25e4597c328b5a23823af9590411f4e7f Author: David Howells Date: Fri Feb 6 13:11:24 2009 +0000 FS-Cache: Make kAFS use FS-Cache The attached patch makes the kAFS filesystem in fs/afs/ use FS-Cache, and through it any attached caches. The kAFS filesystem will use caching automatically if it's available. Signed-off-by: David Howells commit 96a16ef5cd64d79f9706b9dccdb180d248c46866 Author: David Howells Date: Fri Feb 6 13:11:24 2009 +0000 CacheFiles: A cache that backs onto a mounted filesystem Add an FS-Cache cache-backend that permits a mounted filesystem to be used as a backing store for the cache. CacheFiles uses a userspace daemon to do some of the cache management - such as reaping stale nodes and culling. This is called cachefilesd and lives in /sbin. The source for the daemon can be downloaded from: http://people.redhat.com/~dhowells/cachefs/cachefilesd.c And an example configuration from: http://people.redhat.com/~dhowells/cachefs/cachefilesd.conf The filesystem and data integrity of the cache are only as good as those of the filesystem providing the backing services. Note that CacheFiles does not attempt to journal anything since the journalling interfaces of the various filesystems are very specific in nature. CacheFiles creates a misc character device - "/dev/cachefiles" - that is used to communication with the daemon. Only one thing may have this open at once, and whilst it is open, a cache is at least partially in existence. The daemon opens this and sends commands down it to control the cache. CacheFiles is currently limited to a single cache. CacheFiles attempts to maintain at least a certain percentage of free space on the filesystem, shrinking the cache by culling the objects it contains to make space if necessary - see the "Cache Culling" section. This means it can be placed on the same medium as a live set of data, and will expand to make use of spare space and automatically contract when the set of data requires more space. ============ REQUIREMENTS ============ The use of CacheFiles and its daemon requires the following features to be available in the system and in the cache filesystem: - dnotify. - extended attributes (xattrs). - openat() and friends. - bmap() support on files in the filesystem (FIBMAP ioctl). - The use of bmap() to detect a partial page at the end of the file. It is strongly recommended that the "dir_index" option is enabled on Ext3 filesystems being used as a cache. ============= CONFIGURATION ============= The cache is configured by a script in /etc/cachefilesd.conf. These commands set up cache ready for use. The following script commands are available: (*) brun % (*) bcull % (*) bstop % (*) frun % (*) fcull % (*) fstop % Configure the culling limits. Optional. See the section on culling The defaults are 7% (run), 5% (cull) and 1% (stop) respectively. The commands beginning with a 'b' are file space (block) limits, those beginning with an 'f' are file count limits. (*) dir Specify the directory containing the root of the cache. Mandatory. (*) tag Specify a tag to FS-Cache to use in distinguishing multiple caches. Optional. The default is "CacheFiles". (*) debug Specify a numeric bitmask to control debugging in the kernel module. Optional. The default is zero (all off). The following values can be OR'd into the mask to collect various information: 1 Turn on trace of function entry (_enter() macros) 2 Turn on trace of function exit (_leave() macros) 4 Turn on trace of internal debug points (_debug()) This mask can also be set through sysfs, eg: echo 5 >/sys/modules/cachefiles/parameters/debug ================== STARTING THE CACHE ================== The cache is started by running the daemon. The daemon opens the cache device, configures the cache and tells it to begin caching. At that point the cache binds to fscache and the cache becomes live. The daemon is run as follows: /sbin/cachefilesd [-d]* [-s] [-n] [-f ] The flags are: (*) -d Increase the debugging level. This can be specified multiple times and is cumulative with itself. (*) -s Send messages to stderr instead of syslog. (*) -n Don't daemonise and go into background. (*) -f Use an alternative configuration file rather than the default one. =============== THINGS TO AVOID =============== Do not mount other things within the cache as this will cause problems. The kernel module contains its own very cut-down path walking facility that ignores mountpoints, but the daemon can't avoid them. Do not create, rename or unlink files and directories in the cache whilst the cache is active, as this may cause the state to become uncertain. Renaming files in the cache might make objects appear to be other objects (the filename is part of the lookup key). Do not change or remove the extended attributes attached to cache files by the cache as this will cause the cache state management to get confused. Do not create files or directories in the cache, lest the cache get confused or serve incorrect data. Do not chmod files in the cache. The module creates things with minimal permissions to prevent random users being able to access them directly. ============= CACHE CULLING ============= The cache may need culling occasionally to make space. This involves discarding objects from the cache that have been used less recently than anything else. Culling is based on the access time of data objects. Empty directories are culled if not in use. Cache culling is done on the basis of the percentage of blocks and the percentage of files available in the underlying filesystem. There are six "limits": (*) brun (*) frun If the amount of free space and the number of available files in the cache rises above both these limits, then culling is turned off. (*) bcull (*) fcull If the amount of available space or the number of available files in the cache falls below either of these limits, then culling is started. (*) bstop (*) fstop If the amount of available space or the number of available files in the cache falls below either of these limits, then no further allocation of disk space or files is permitted until culling has raised things above these limits again. These must be configured thusly: 0 <= bstop < bcull < brun < 100 0 <= fstop < fcull < frun < 100 Note that these are percentages of available space and available files, and do _not_ appear as 100 minus the percentage displayed by the "df" program. The userspace daemon scans the cache to build up a table of cullable objects. These are then culled in least recently used order. A new scan of the cache is started as soon as space is made in the table. Objects will be skipped if their atimes have changed or if the kernel module says it is still using them. =============== CACHE STRUCTURE =============== The CacheFiles module will create two directories in the directory it was given: (*) cache/ (*) graveyard/ The active cache objects all reside in the first directory. The CacheFiles kernel module moves any retired or culled objects that it can't simply unlink to the graveyard from which the daemon will actually delete them. The daemon uses dnotify to monitor the graveyard directory, and will delete anything that appears therein. The module represents index objects as directories with the filename "I..." or "J...". Note that the "cache/" directory is itself a special index. Data objects are represented as files if they have no children, or directories if they do. Their filenames all begin "D..." or "E...". If represented as a directory, data objects will have a file in the directory called "data" that actually holds the data. Special objects are similar to data objects, except their filenames begin "S..." or "T...". If an object has children, then it will be represented as a directory. Immediately in the representative directory are a collection of directories named for hash values of the child object keys with an '@' prepended. Into this directory, if possible, will be placed the representations of the child objects: INDEX INDEX INDEX DATA FILES ========= ========== ================================= ================ cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400 cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry If the key is so long that it exceeds NAME_MAX with the decorations added on to it, then it will be cut into pieces, the first few of which will be used to make a nest of directories, and the last one of which will be the objects inside the last directory. The names of the intermediate directories will have '+' prepended: J1223/@23/+xy...z/+kl...m/Epqr Note that keys are raw data, and not only may they exceed NAME_MAX in size, they may also contain things like '/' and NUL characters, and so they may not be suitable for turning directly into a filename. To handle this, CacheFiles will use a suitably printable filename directly and "base-64" encode ones that aren't directly suitable. The two versions of object filenames indicate the encoding: OBJECT TYPE PRINTABLE ENCODED =============== =============== =============== Index "I..." "J..." Data "D..." "E..." Special "S..." "T..." Intermediate directories are always "@" or "+" as appropriate. Each object in the cache has an extended attribute label that holds the object type ID (required to distinguish special objects) and the auxiliary data from the netfs. The latter is used to detect stale objects in the cache and update or retire them. Note that CacheFiles will erase from the cache any file it doesn't recognise or any file of an incorrect type (such as a FIFO file or a device file). ========================== SECURITY MODEL AND SELINUX ========================== CacheFiles is implemented to deal properly with the LSM security features of the Linux kernel and the SELinux facility. One of the problems that CacheFiles faces is that it is generally acting on behalf of a process, and running in that process's context, and that includes a security context that is not appropriate for accessing the cache - either because the files in the cache are inaccessible to that process, or because if the process creates a file in the cache, that file may be inaccessible to other processes. The way CacheFiles works is to temporarily change the security context (fsuid, fsgid and actor security label) that the process acts as - without changing the security context of the process when it the target of an operation performed by some other process (so signalling and suchlike still work correctly). When the CacheFiles module is asked to bind to its cache, it: (1) Finds the security label attached to the root cache directory and uses that as the security label with which it will create files. By default, this is: cachefiles_var_t (2) Finds the security label of the process which issued the bind request (presumed to be the cachefilesd daemon), which by default will be: cachefilesd_t and asks LSM to supply a security ID as which it should act given the daemon's label. By default, this will be: cachefiles_kernel_t SELinux transitions the daemon's security ID to the module's security ID based on a rule of this form in the policy. type_transition kernel_t : process ; For instance: type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t; The module's security ID gives it permission to create, move and remove files and directories in the cache, to find and access directories and files in the cache, to set and access extended attributes on cache objects, and to read and write files in the cache. The daemon's security ID gives it only a very restricted set of permissions: it may scan directories, stat files and erase files and directories. It may not read or write files in the cache, and so it is precluded from accessing the data cached therein; nor is it permitted to create new files in the cache. There are policy source files available in: http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2 and later versions. In that tarball, see the files: cachefilesd.te cachefilesd.fc cachefilesd.if They are built and installed directly by the RPM. If a non-RPM based system is being used, then copy the above files to their own directory and run: make -f /usr/share/selinux/devel/Makefile semodule -i cachefilesd.pp You will need checkpolicy and selinux-policy-devel installed prior to the build. By default, the cache is located in /var/fscache, but if it is desirable that it should be elsewhere, than either the above policy files must be altered, or an auxiliary policy must be installed to label the alternate location of the cache. For instructions on how to add an auxiliary policy to enable the cache to be located elsewhere when SELinux is in enforcing mode, please see: /usr/share/doc/cachefilesd-*/move-cache.txt When the cachefilesd rpm is installed; alternatively, the document can be found in the sources. ================== A NOTE ON SECURITY ================== CacheFiles makes use of the split security in the task_struct. It allocates its own task_security structure, and redirects current->act_as to point to it when it acts on behalf of another process, in that process's context. The reason it does this is that it calls vfs_mkdir() and suchlike rather than bypassing security and calling inode ops directly. Therefore the VFS and LSM may deny the CacheFiles access to the cache data because under some circumstances the caching code is running in the security context of whatever process issued the original syscall on the netfs. Furthermore, should CacheFiles create a file or directory, the security parameters with that object is created (UID, GID, security label) would be derived from that process that issued the system call, thus potentially preventing other processes from accessing the cache - including CacheFiles's cache management daemon (cachefilesd). What is required is to temporarily override the security of the process that issued the system call. We can't, however, just do an in-place change of the security data as that affects the process as an object, not just as a subject. This means it may lose signals or ptrace events for example, and affects what the process looks like in /proc. So CacheFiles makes use of a logical split in the security between the objective security (task->sec) and the subjective security (task->act_as). The objective security holds the intrinsic security properties of a process and is never overridden. This is what appears in /proc, and is what is used when a process is the target of an operation by some other process (SIGKILL for example). The subjective security holds the active security properties of a process, and may be overridden. This is not seen externally, and is used whan a process acts upon another object, for example SIGKILLing another process or opening a file. LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request for CacheFiles to run in a context of a specific security label, or to create files and directories with another security label. This documentation is added by the patch to: Documentation/filesystems/caching/cachefiles.txt Signed-Off-By: David Howells commit 26e7056dcd81e3cf15652224d09b31916dd7731d Author: David Howells Date: Fri Feb 6 13:11:24 2009 +0000 CacheFiles: Export things for CacheFiles Export a number of functions for CacheFiles's use. Signed-off-by: David Howells commit 22bb2f31f060692b84ca0792835f92e60548ec80 Author: David Howells Date: Fri Feb 6 13:11:24 2009 +0000 CacheFiles: Permit the page lock state to be monitored Add a function to install a monitor on the page lock waitqueue for a particular page, thus allowing the page being unlocked to be detected. This is used by CacheFiles to detect read completion on a page in the backing filesystem so that it can then copy the data to the waiting netfs page. Signed-off-by: David Howells commit d45eed81be69809255ea6ef3b350f68f59651545 Author: David Howells Date: Fri Feb 6 13:11:24 2009 +0000 CacheFiles: Add a hook to write a single page of data to an inode Add an address space operation to write one single page of data to an inode at a page-aligned location (thus permitting the implementation to be highly optimised). The data source is a single page. This is used by CacheFiles to store the contents of netfs pages into their backing file pages. Supply a generic implementation for this that uses the write_begin() and write_end() address_space operations to bind a copy directly into the page cache. Hook the Ext2 and Ext3 operations to the generic implementation. Signed-off-by: David Howells commit a14f0b2c18cfbf0233ac9103dcfee8a3c507252c Author: David Howells Date: Fri Feb 6 13:11:23 2009 +0000 CacheFiles: Be consistent about the use of mapping vs file->f_mapping in Ext3 Change all the usages of file->f_mapping in ext3_*write_end() functions to use the mapping argument directly. This has two consequences: (*) Consistency. Without this patch sometimes one is used and sometimes the other is. (*) A NULL file pointer can be passed. This feature is then made use of by the generic hook in the next patch, which is used by CacheFiles to write pages to a file without setting up a file struct. Signed-off-by: David Howells commit af7f26be82dd796add846df2b317abbb75dba422 Author: David Howells Date: Fri Feb 6 13:11:23 2009 +0000 FS-Cache: Implement data I/O part of netfs API Implement the data I/O part of the FS-Cache netfs API. The documentation and API header file were added in a previous patch. This patch implements the following functions for the netfs to call: (*) fscache_attr_changed(). Indicate that the object has changed its attributes. The only attribute currently recorded is the file size. Only pages within the set file size will be stored in the cache. This operation is submitted for asynchronous processing, and will return immediately. It will return -ENOMEM if an out of memory error is encountered, -ENOBUFS if the object is not actually cached, or 0 if the operation is successfully queued. (*) fscache_read_or_alloc_page(). (*) fscache_read_or_alloc_pages(). Request data be fetched from the disk, and allocate internal metadata to track the netfs pages and reserve disk space for unknown pages. These operations perform semi-asynchronous data reads. Upon returning they will indicate which pages they think can be retrieved from disk, and will have set in progress attempts to retrieve those pages. These will return, in order of preference, -ENOMEM on memory allocation error, -ERESTARTSYS if a signal interrupted proceedings, -ENODATA if one or more requested pages are not yet cached, -ENOBUFS if the object is not actually cached or if there isn't space for future pages to be cached on this object, or 0 if successful. In the case of the multipage function, the pages for which reads are set in progress will be removed from the list and the page count decreased appropriately. If any read operations should fail, the completion function will be given an error, and will also be passed contextual information to allow the netfs to fall back to querying the server for the absent pages. For each successful read, the page completion function will also be called. Any pages subsequently tracked by the cache will have PG_fscache set upon them on return. fscache_uncache_page() must be called for such pages. If supplied by the netfs, the mark_pages_cached() cookie op will be invoked for any pages now tracked. (*) fscache_alloc_page(). Allocate internal metadata to track a netfs page and reserve disk space. This will return -ENOMEM on memory allocation error, -ERESTARTSYS on signal, -ENOBUFS if the object isn't cached, or there isn't enough space in the cache, or 0 if successful. Any pages subsequently tracked by the cache will have PG_fscache set upon them on return. fscache_uncache_page() must be called for such pages. If supplied by the netfs, the mark_pages_cached() cookie op will be invoked for any pages now tracked. (*) fscache_write_page(). Request data be stored to disk. This may only be called on pages that have been read or alloc'd by the above three functions and have not yet been uncached. This will return -ENOMEM on memory allocation error, -ERESTARTSYS on signal, -ENOBUFS if the object isn't cached, or there isn't immediately enough space in the cache, or 0 if successful. On a successful return, this operation will have queued the page for asynchronous writing to the cache. The page will be returned with PG_fscache_write set until the write completes one way or another. The caller will not be notified if the write fails due to an I/O error. If that happens, the object will become available and all pending writes will be aborted. Note that the cache may batch up page writes, and so it may take a while to get around to writing them out. The caller must assume that until PG_fscache_write is cleared the page is use by the cache. Any changes made to the page may be reflected on disk. The page may even be under DMA. (*) fscache_uncache_page(). Indicate that the cache should stop tracking a page previously read or alloc'd from the cache. If the page was alloc'd only, but unwritten, it will not appear on disk. Signed-off-by: David Howells commit 9ee509a134cc784a71b5954cfd9b13e2f5012a29 Author: David Howells Date: Fri Feb 6 13:11:23 2009 +0000 FS-Cache: Add and document asynchronous operation handling Add and document asynchronous operation handling for use by FS-Cache's data storage and retrieval routines. The following documentation is added to: Documentation/filesystems/caching/operations.txt ================================ ASYNCHRONOUS OPERATIONS HANDLING ================================ ======== OVERVIEW ======== FS-Cache has an asynchronous operations handling facility that it uses for its data storage and retrieval routines. Its operations are represented by fscache_operation structs, though these are usually embedded into some other structure. This facility is available to and expected to be be used by the cache backends, and FS-Cache will create operations and pass them off to the appropriate cache backend for completion. To make use of this facility, should be #included. =============================== OPERATION RECORD INITIALISATION =============================== An operation is recorded in an fscache_operation struct: struct fscache_operation { union { struct work_struct fast_work; struct slow_work slow_work; }; unsigned long flags; fscache_operation_processor_t processor; ... }; Someone wanting to issue an operation should allocate something with this struct embedded in it. They should initialise it by calling: void fscache_operation_init(struct fscache_operation *op, fscache_operation_release_t release); with the operation to be initialised and the release function to use. The op->flags parameter should be set to indicate the CPU time provision and the exclusivity (see the Parameters section). The op->fast_work, op->slow_work and op->processor flags should be set as appropriate for the CPU time provision (see the Parameters section). FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the operation and waited for afterwards. ========== PARAMETERS ========== There are a number of parameters that can be set in the operation record's flag parameter. There are three options for the provision of CPU time in these operations: (1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread may decide it wants to handle an operation itself without deferring it to another thread. This is, for example, used in read operations for calling readpages() on the backing filesystem in CacheFiles. Although readpages() does an asynchronous data fetch, the determination of whether pages exist is done synchronously - and the netfs does not proceed until this has been determined. If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags before submitting the operation, and the operating thread must wait for it to be cleared before proceeding: wait_on_bit(&op->flags, FSCACHE_OP_WAITING, fscache_wait_bit, TASK_UNINTERRUPTIBLE); (2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it will be given to keventd to process. Such an operation is not permitted to sleep on I/O. This is, for example, used by CacheFiles to copy data from a backing fs page to a netfs page after the backing fs has read the page in. If this option is used, op->fast_work and op->processor must be initialised before submitting the operation: INIT_WORK(&op->fast_work, do_some_work); (3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it will be given to the slow work facility to process. Such an operation is permitted to sleep on I/O. This is, for example, used by FS-Cache to handle background writes of pages that have just been fetched from a remote server. If this option is used, op->slow_work and op->processor must be initialised before submitting the operation: fscache_operation_init_slow(op, processor) Furthermore, operations may be one of two types: (1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in conjunction with any other operation on the object being operated upon. An example of this is the attribute change operation, in which the file being written to may need truncation. (2) Shareable. Operations of this type may be running simultaneously. It's up to the operation implementation to prevent interference between other operations running at the same time. ========= PROCEDURE ========= Operations are used through the following procedure: (1) The submitting thread must allocate the operation and initialise it itself. Normally this would be part of a more specific structure with the generic op embedded within. (2) The submitting thread must then submit the operation for processing using one of the following two functions: int fscache_submit_op(struct fscache_object *object, struct fscache_operation *op); int fscache_submit_exclusive_op(struct fscache_object *object, struct fscache_operation *op); The first function should be used to submit non-exclusive ops and the second to submit exclusive ones. The caller must still set the FSCACHE_OP_EXCLUSIVE flag. If successful, both functions will assign the operation to the specified object and return 0. -ENOBUFS will be returned if the object specified is permanently unavailable. The operation manager will defer operations on an object that is still undergoing lookup or creation. The operation will also be deferred if an operation of conflicting exclusivity is in progress on the object. If the operation is asynchronous, the manager will retain a reference to it, so the caller should put their reference to it by passing it to: void fscache_put_operation(struct fscache_operation *op); (3) If the submitting thread wants to do the work itself, and has marked the operation with FSCACHE_OP_MYTHREAD, then it should monitor FSCACHE_OP_WAITING as described above and check the state of the object if necessary (the object might have died whilst the thread was waiting). When it has finished doing its processing, it should call fscache_put_operation() on it. (4) The operation holds an effective lock upon the object, preventing other exclusive ops conflicting until it is released. The operation can be enqueued for further immediate asynchronous processing by adjusting the CPU time provisioning option if necessary, eg: op->flags &= ~FSCACHE_OP_TYPE; op->flags |= ~FSCACHE_OP_FAST; and calling: void fscache_enqueue_operation(struct fscache_operation *op) This can be used to allow other things to have use of the worker thread pools. ===================== ASYNCHRONOUS CALLBACK ===================== When used in asynchronous mode, the worker thread pool will invoke the processor method with a pointer to the operation. This should then get at the container struct by using container_of(): static void fscache_write_op(struct fscache_operation *_op) { struct fscache_storage *op = container_of(_op, struct fscache_storage, op); ... } The caller holds a reference on the operation, and will invoke fscache_put_operation() when the processor function returns. The processor function is at liberty to call fscache_enqueue_operation() or to take extra references. Signed-off-by: David Howells commit efb5a586097b3f3cc0ae0782be026aedac104e19 Author: David Howells Date: Fri Feb 6 13:11:23 2009 +0000 FS-Cache: Implement the cookie management part of the netfs API Implement the cookie management part of the FS-Cache netfs client API. The documentation and API header file were added in a previous patch. This patch implements the following three functions: (1) fscache_acquire_cookie(). Acquire a cookie to represent an object to the netfs. If the object in question is a non-index object, then that object and its parent indices will be created on disk at this point if they don't already exist. Index creation is deferred because an index may reside in multiple caches. (2) fscache_relinquish_cookie(). Retire or release a cookie previously acquired. At this point, the object on disk may be destroyed. (3) fscache_update_cookie(). Update the in-cache representation of a cookie. This is used to update the auxiliary data for coherency management purposes. With this patch it is possible to have a netfs instruct a cache backend to look up, validate and create metadata on disk and to destroy it again. The ability to actually store and retrieve data in the objects so created is added in later patches. Note that these functions will never return an error. _All_ errors are handled internally to FS-Cache. The worst that can happen is that fscache_acquire_cookie() may return a NULL pointer - which is considered a negative cookie pointer and can be passed back to any function that takes a cookie without harm. A negative cookie pointer merely suppresses caching at that level. The stub in linux/fscache.h will detect inline the negative cookie pointer and abort the operation as fast as possible. This means that the compiler doesn't have to set up for a call in that case. See the documentation in Documentation/filesystems/caching/netfs-api.txt for more information. Signed-off-by: David Howells commit 331c344ebbb9c19b2008359a0d3e64c5ae0e4965 Author: David Howells Date: Fri Feb 6 13:11:23 2009 +0000 FS-Cache: Object management state machine Implement the cache object management state machine. The following documentation is added to illuminate the working of this state machine. It will also be added as: Documentation/filesystems/caching/object.txt ==================================================== IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT ==================================================== ============== REPRESENTATION ============== FS-Cache maintains an in-kernel representation of each object that a netfs is currently interested in. Such objects are represented by the fscache_cookie struct and are referred to as cookies. FS-Cache also maintains a separate in-kernel representation of the objects that a cache backend is currently actively caching. Such objects are represented by the fscache_object struct. The cache backends allocate these upon request, and are expected to embed them in their own representations. These are referred to as objects. There is a 1:N relationship between cookies and objects. A cookie may be represented by multiple objects - an index may exist in more than one cache - or even by no objects (it may not be cached). Furthermore, both cookies and objects are hierarchical. The two hierarchies correspond, but the cookies tree is a superset of the union of the object trees of multiple caches: NETFS INDEX TREE : CACHE 1 : CACHE 2 : : : +-----------+ : +----------->| IObject | : +-----------+ | : +-----------+ : | ICookie |-------+ : | : +-----------+ | : | : +-----------+ | +------------------------------>| IObject | | : | : +-----------+ | : V : | | : +-----------+ : | V +----------->| IObject | : | +-----------+ | : +-----------+ : | | ICookie |-------+ : | : V +-----------+ | : | : +-----------+ | +------------------------------>| IObject | +-----+-----+ : | : +-----------+ | | : | : | V | : V : | +-----------+ | : +-----------+ : | | ICookie |------------------------->| IObject | : | +-----------+ | : +-----------+ : | | V : | : V | +-----------+ : | : +-----------+ | | ICookie |-------------------------------->| IObject | | +-----------+ : | : +-----------+ V | : V : | +-----------+ | : +-----------+ : | | DCookie |------------------------->| DObject | : | +-----------+ | : +-----------+ : | | : : | +-------+-------+ : : | | | : : | V V : : V +-----------+ +-----------+ : : +-----------+ | DCookie | | DCookie |------------------------>| DObject | +-----------+ +-----------+ : : +-----------+ : : In the above illustration, ICookie and IObject represent indices and DCookie and DObject represent data storage objects. Indices may have representation in multiple caches, but currently, non-index objects may not. Objects of any type may also be entirely unrepresented. As far as the netfs API goes, the netfs is only actually permitted to see pointers to the cookies. The cookies themselves and any objects attached to those cookies are hidden from it. =============================== OBJECT MANAGEMENT STATE MACHINE =============================== Within FS-Cache, each active object is managed by its own individual state machine. The state for an object is kept in the fscache_object struct, in object->state. A cookie may point to a set of objects that are in different states. Each state has an action associated with it that is invoked when the machine wakes up in that state. There are four logical sets of states: (1) Preparation: states that wait for the parent objects to become ready. The representations are hierarchical, and it is expected that an object must be created or accessed with respect to its parent object. (2) Initialisation: states that perform lookups in the cache and validate what's found and that create on disk any missing metadata. (3) Normal running: states that allow netfs operations on objects to proceed and that update the state of objects. (4) Termination: states that detach objects from their netfs cookies, that delete objects from disk, that handle disk and system errors and that free up in-memory resources. In most cases, transitioning between states is in response to signalled events. When a state has finished processing, it will usually set the mask of events in which it is interested (object->event_mask) and relinquish the worker thread. Then when an event is raised (by calling fscache_raise_event()), if the event is not masked, the object will be queued for processing (by calling fscache_enqueue_object()). PROVISION OF CPU TIME --------------------- The work to be done by the various states is given CPU time by the threads of the slow work facility (see Documentation/slow-work.txt). This is used in preference to the workqueue facility because: (1) Threads may be completely occupied for very long periods of time by a particular work item. These state actions may be doing sequences of synchronous, journalled disk accesses (lookup, mkdir, create, setxattr, getxattr, truncate, unlink, rmdir, rename). (2) Threads may do little actual work, but may rather spend a lot of time sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded workqueues don't necessarily have the right numbers of threads. LOCKING SIMPLIFICATION ---------------------- Because only one worker thread may be operating on any particular object's state machine at once, this simplifies the locking, particularly with respect to disconnecting the netfs's representation of a cache object (fscache_cookie) from the cache backend's representation (fscache_object) - which may be requested from either end. ================= THE SET OF STATES ================= The object state machine has a set of states that it can be in. There are preparation states in which the object sets itself up and waits for its parent object to transit to a state that allows access to its children: (1) State FSCACHE_OBJECT_INIT. Initialise the object and wait for the parent object to become active. In the cache, it is expected that it will not be possible to look an object up from the parent object, until that parent object itself has been looked up. There are initialisation states in which the object sets itself up and accesses disk for the object metadata: (2) State FSCACHE_OBJECT_LOOKING_UP. Look up the object on disk, using the parent as a starting point. FS-Cache expects the cache backend to probe the cache to see whether this object is represented there, and if it is, to see if it's valid (coherency management). The cache should call fscache_object_lookup_negative() to indicate lookup failure for whatever reason, and should call fscache_obtained_object() to indicate success. At the completion of lookup, FS-Cache will let the netfs go ahead with read operations, no matter whether the file is yet cached. If not yet cached, read operations will be immediately rejected with ENODATA until the first known page is uncached - as to that point there can be no data to be read out of the cache for that file that isn't currently also held in the pagecache. (3) State FSCACHE_OBJECT_CREATING. Create an object on disk, using the parent as a starting point. This happens if the lookup failed to find the object, or if the object's coherency data indicated what's on disk is out of date. In this state, FS-Cache expects the cache to create The cache should call fscache_obtained_object() if creation completes successfully, fscache_object_lookup_negative() otherwise. At the completion of creation, FS-Cache will start processing write operations the netfs has queued for an object. If creation failed, the write ops will be transparently discarded, and nothing recorded in the cache. There are some normal running states in which the object spends its time servicing netfs requests: (4) State FSCACHE_OBJECT_AVAILABLE. A transient state in which pending operations are started, child objects are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary lookup data is freed. (5) State FSCACHE_OBJECT_ACTIVE. The normal running state. In this state, requests the netfs makes will be passed on to the cache. (6) State FSCACHE_OBJECT_UPDATING. The state machine comes here to update the object in the cache from the netfs's records. This involves updating the auxiliary data that is used to maintain coherency. And there are terminal states in which an object cleans itself up, deallocates memory and potentially deletes stuff from disk: (7) State FSCACHE_OBJECT_LC_DYING. The object comes here if it is dying because of a lookup or creation error. This would be due to a disk error or system error of some sort. Temporary data is cleaned up, and the parent is released. (8) State FSCACHE_OBJECT_DYING. The object comes here if it is dying due to an error, because its parent cookie has been relinquished by the netfs or because the cache is being withdrawn. Any child objects waiting on this one are given CPU time so that they too can destroy themselves. This object waits for all its children to go away before advancing to the next state. (9) State FSCACHE_OBJECT_ABORT_INIT. The object comes to this state if it was waiting on its parent in FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself so that the parent may proceed from the FSCACHE_OBJECT_DYING state. (10) State FSCACHE_OBJECT_RELEASING. (11) State FSCACHE_OBJECT_RECYCLING. The object comes to one of these two states when dying once it is rid of all its children, if it is dying because the netfs relinquished its cookie. In the first state, the cached data is expected to persist, and in the second it will be deleted. (12) State FSCACHE_OBJECT_WITHDRAWING. The object transits to this state if the cache decides it wants to withdraw the object from service, perhaps to make space, but also due to error or just because the whole cache is being withdrawn. (13) State FSCACHE_OBJECT_DEAD. The object transits to this state when the in-memory object record is ready to be deleted. The object processor shouldn't ever see an object in this state. THE SET OF EVENTS ----------------- There are a number of events that can be raised to an object state machine: (*) FSCACHE_OBJECT_EV_UPDATE The netfs requested that an object be updated. The state machine will ask the cache backend to update the object, and the cache backend will ask the netfs for details of the change through its cookie definition ops. (*) FSCACHE_OBJECT_EV_CLEARED This is signalled in two circumstances: (a) when an object's last child object is dropped and (b) when the last operation outstanding on an object is completed. This is used to proceed from the dying state. (*) FSCACHE_OBJECT_EV_ERROR This is signalled when an I/O error occurs during the processing of some object. (*) FSCACHE_OBJECT_EV_RELEASE (*) FSCACHE_OBJECT_EV_RETIRE These are signalled when the netfs relinquishes a cookie it was using. The event selected depends on whether the netfs asks for the backing object to be retired (deleted) or retained. (*) FSCACHE_OBJECT_EV_WITHDRAW This is signalled when the cache backend wants to withdraw an object. This means that the object will have to be detached from the netfs's cookie. Because the withdrawing releasing/retiring events are all handled by the object state machine, it doesn't matter if there's a collision with both ends trying to sever the connection at the same time. The state machine can just pick which one it wants to honour, and that effects the other. Signed-off-by: David Howells commit 16cc469e0ab07f2200a4a6e02aa775848265d7b2 Author: David Howells Date: Fri Feb 6 13:11:23 2009 +0000 FS-Cache: Bit waiting helpers Add helpers for use with wait_on_bit(). Signed-off-by: David Howells commit 251e241772680e22b547e6e9a4a4a3fdc8d55cd7 Author: David Howells Date: Fri Feb 6 13:11:23 2009 +0000 FS-Cache: Add netfs registration Add functions to register and unregister a network filesystem or other client of the FS-Cache service. This allocates and releases the cookie representing the top-level index for a netfs, and makes it available to the netfs. If the FS-Cache facility is disabled, then the calls are optimised away at compile time. Note that whilst this patch may appear to work with FS-Cache enabled and a netfs attempting to use it, it will leak the cookie it allocates for the netfs as fscache_relinquish_cookie() is implemented in a later patch. This will cause the slab code to emit a warning when the module is removed. Signed-off-by: David Howells commit 0e4ab7dcd20057c249b4d9256ac51b2725c33c1a Author: David Howells Date: Fri Feb 6 13:11:22 2009 +0000 FS-Cache: Provide a slab for cookie allocation Provide a slab from which can be allocated the FS-Cache cookies that will be presented to the netfs. Also provide a slab constructor and a function to recursively discard a cookie and its ancestor chain. Signed-off-by: David Howells commit 01581a7254818ce4c8cc67a3b7c019dd0dfbaa0e Author: David Howells Date: Fri Feb 6 13:11:22 2009 +0000 FS-Cache: Add cache management Implement the entry points by which a cache backend may initialise, add, declare an error upon and withdraw a cache. Further, an object is created in sysfs under which each cache added will get an object created: /sys/fs/fscache// All of this is described in Documentation/filesystems/caching/backend-api.txt added by a previous patch. Signed-off-by: David Howells commit 8d7a391681e02ac3d04a46c17bea1bef3115d387 Author: David Howells Date: Fri Feb 6 13:11:22 2009 +0000 FS-Cache: Add cache tag handling Implement two features of FS-Cache: (1) The ability to request and release cache tags - names by which a cache may be known to a netfs, and thus selected for use. (2) An internal function by which a cache is selected by consulting the netfs, if the netfs wishes to be consulted. Signed-off-by: David Howells commit 5b9e063241416dcaffa59d4d25e9ab586e145f46 Author: David Howells Date: Fri Feb 6 13:11:22 2009 +0000 FS-Cache: Root index definition Add a description of the root index of the cache for later patches to make use of. The root index is owned by FS-Cache itself. When a netfs requests caching facilities, FS-Cache will, if one doesn't already exist, create an entry in the root index with the key being the name of the netfs ("AFS" for example), and the auxiliary data holding the index structure version supplied by the netfs: FSDEF | +-----------+ | | NFS AFS [v=1] [v=1] If an entry with the appropriate name does already exist, the version is compared. If the version is different, the entire subtree from that entry will be discarded and a new entry created. The new entry will be an index, and a cookie referring to it will be passed to the netfs. This is then the root handle by which the netfs accesses the cache. It can create whatever objects it likes in that index, including further indices. Signed-off-by: David Howells commit d28112f1a4d6607c48a23fff83eb14fd4bb9bd7b Author: David Howells Date: Fri Feb 6 13:11:22 2009 +0000 FS-Cache: Add use of /proc and presentation of statistics Make FS-Cache create its /proc interface and present various statistical information through it. Also provide the functions for updating this information. These features are enabled by: CONFIG_FSCACHE_PROC CONFIG_FSCACHE_STATS CONFIG_FSCACHE_HISTOGRAM The /proc directory for FS-Cache is also exported so that caching modules can add their own statistics there too. The FS-Cache module is loadable at this point, and the statistics files can be examined by userspace: cat /proc/fs/fscache/stats cat /proc/fs/fscache/histogram Signed-off-by: David Howells commit 81a4588b03f5047289eee85ff5e7bcce6d6f42c3 Author: David Howells Date: Fri Feb 6 13:11:22 2009 +0000 FS-Cache: Add main configuration option, module entry points and debugging Add the main configuration option, allowing FS-Cache to be selected; the module entry and exit functions and the debugging stuff used by these patches. The two configuration options added are: CONFIG_FSCACHE CONFIG_FSCACHE_DEBUG The first enables the facility, and the second makes the debugging statements enableable through the "debug" module parameter. The value of this parameter is a bitmask as described in: Documentation/filesystems/caching/fscache.txt The module can be loaded at this point, but all it will do at this point in the patch series is to start up the slow work facility and shut it down again. Signed-off-by: David Howells commit 430c2ab90579048387d14caed3780e9ffffc6b36 Author: David Howells Date: Fri Feb 6 13:11:22 2009 +0000 FS-Cache: Add the FS-Cache cache backend API and documentation Add the API for a generic facility (FS-Cache) by which caches may declare them selves open for business, and may obtain work to be done from network filesystems. The header file is included by: #include Documentation for the API is also added to: Documentation/filesystems/caching/backend-api.txt This API is not usable without the implementation of the utility functions which will be added in further patches. Signed-off-by: David Howells commit 6551cf67d5443df733a8176caf8db40f0fa4c451 Author: David Howells Date: Fri Feb 6 13:11:21 2009 +0000 FS-Cache: Add the FS-Cache netfs API and documentation Add the API for a generic facility (FS-Cache) by which filesystems (such as AFS or NFS) may call on local caching capabilities without having to know anything about how the cache works, or even if there is a cache: +---------+ | | +--------------+ | NFS |--+ | | | | | +-->| CacheFS | +---------+ | +----------+ | | /dev/hda5 | | | | | +--------------+ +---------+ +-->| | | | | | |--+ | AFS |----->| FS-Cache | | | | |--+ +---------+ +-->| | | | | | | +--------------+ +---------+ | +----------+ | | | | | | +-->| CacheFiles | | ISOFS |--+ | /var/cache | | | +--------------+ +---------+ General documentation and documentation of the netfs specific API are provided in addition to the header files. As this patch stands, it is possible to build a filesystem against the facility and attempt to use it. All that will happen is that all requests will be immediately denied as if no cache is present. Further patches will implement the core of the facility. The facility will transfer requests from networking filesystems to appropriate caches if possible, or else gracefully deny them. If this facility is disabled in the kernel configuration, then all its operations will trivially reduce to nothing during compilation. WHY NOT I_MAPPING? ================== I have added my own API to implement caching rather than using i_mapping to do this for a number of reasons. These have been discussed a lot on the LKML and CacheFS mailing lists, but to summarise the basics: (1) Most filesystems don't do hole reportage. Holes in files are treated as blocks of zeros and can't be distinguished otherwise, making it difficult to distinguish blocks that have been read from the network and cached from those that haven't. (2) The backing inode must be fully populated before being exposed to userspace through the main inode because the VM/VFS goes directly to the backing inode and does not interrogate the front inode's VM ops. Therefore: (a) The backing inode must fit entirely within the cache. (b) All backed files currently open must fit entirely within the cache at the same time. (c) A working set of files in total larger than the cache may not be cached. (d) A file may not grow larger than the available space in the cache. (e) A file that's open and cached, and remotely grows larger than the cache is potentially stuffed. (3) Writes go to the backing filesystem, and can only be transferred to the network when the file is closed. (4) There's no record of what changes have been made, so the whole file must be written back. (5) The pages belong to the backing filesystem, and all metadata associated with that page are relevant only to the backing filesystem, and not anything stacked atop it. OVERVIEW ======== FS-Cache provides (or will provide) the following facilities: (1) Caches can be added / removed at any time, even whilst in use. (2) Adds a facility by which tags can be used to refer to caches, even if they're not available yet. (3) More than one cache can be used at once. Caches can be selected explicitly by use of tags. (4) The netfs is provided with an interface that allows either party to withdraw caching facilities from a file (required for (1)). (5) A netfs may annotate cache objects that belongs to it. This permits the storage of coherency maintenance data. (6) Cache objects will be pinnable and space reservations will be possible. (7) The interface to the netfs returns as few errors as possible, preferring rather to let the netfs remain oblivious. (8) Cookies are used to represent indices, files and other objects to the netfs. The simplest cookie is just a NULL pointer - indicating nothing cached there. (9) The netfs is allowed to propose - dynamically - any index hierarchy it desires, though it must be aware that the index search function is recursive, stack space is limited, and indices can only be children of indices. (10) Indices can be used to group files together to reduce key size and to make group invalidation easier. The use of indices may make lookup quicker, but that's cache dependent. (11) Data I/O is effectively done directly to and from the netfs's pages. The netfs indicates that page A is at index B of the data-file represented by cookie C, and that it should be read or written. The cache backend may or may not start I/O on that page, but if it does, a netfs callback will be invoked to indicate completion. The I/O may be either synchronous or asynchronous. (12) Cookies can be "retired" upon release. At this point FS-Cache will mark them as obsolete and the index hierarchy rooted at that point will get recycled. (13) The netfs provides a "match" function for index searches. In addition to saying whether a match was made or not, this can also specify that an entry should be updated or deleted. FS-Cache maintains a virtual index tree in which all indices, files, objects and pages are kept. Bits of this tree may actually reside in one or more caches. FSDEF | +------------------------------------+ | | NFS AFS | | +--------------------------+ +-----------+ | | | | homedir mirror afs.org redhat.com | | | +------------+ +---------------+ +----------+ | | | | | | 00001 00002 00007 00125 vol00001 vol00002 | | | | | +---+---+ +-----+ +---+ +------+------+ +-----+----+ | | | | | | | | | | | | | PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak | | PG0 +-------+ | | 00001 00003 | +---+---+ | | | PG0 PG1 PG2 In the example above, two netfs's can be seen to be backed: NFS and AFS. These have different index hierarchies: (*) The NFS primary index will probably contain per-server indices. Each server index is indexed by NFS file handles to get data file objects. Each data file objects can have an array of pages, but may also have further child objects, such as extended attributes and directory entries. Extended attribute objects themselves have page-array contents. (*) The AFS primary index contains per-cell indices. Each cell index contains per-logical-volume indices. Each of volume index contains up to three indices for the read-write, read-only and backup mirrors of those volumes. Each of these contains vnode data file objects, each of which contains an array of pages. The very top index is the FS-Cache master index in which individual netfs's have entries. Any index object may reside in more than one cache, provided it only has index children. Any index with non-index object children will be assumed to only reside in one cache. The FS-Cache overview can be found in: Documentation/filesystems/caching/fscache.txt The netfs API to FS-Cache can be found in: Documentation/filesystems/caching/netfs-api.txt Signed-off-by: David Howells commit 9cbd0c554b9af1b3944a7004eec069ce2f3d39af Author: David Howells Date: Fri Feb 6 13:11:21 2009 +0000 FS-Cache: Recruit a couple of page flags for cache management Recruit a couple of page flags to aid in cache management. The following extra flags are defined: (1) PG_fscache (PG_private_2) The marked page is backed by a local cache and is pinning resources in the cache driver. (2) PG_fscache_write (PG_owner_priv_2) The marked page is being written to the local cache. The page may not be modified whilst this is in progress. If PG_fscache is set, then things that checked for PG_private will now also check for that. This includes things like truncation and page invalidation. The function page_has_private() had been added to make the checks for both PG_private and PG_private_2 at the same time. Signed-off-by: David Howells commit 5a91e26a389a1b76972c988c3bbf1d2e2bcddaf4 Author: David Howells Date: Fri Feb 6 13:11:21 2009 +0000 FS-Cache: Release page->private after failed readahead The attached patch causes read_cache_pages() to release page-private data on a page for which add_to_page_cache() fails or the filler function fails. This permits pages with caching references associated with them to be cleaned up. The invalidatepage() address space op is called (indirectly) to do the honours. Signed-off-by: David Howells commit 88fc9dd71de93bc44a8455997afcd38544906172 Author: David Howells Date: Fri Feb 6 13:11:21 2009 +0000 Document the slow work thread pool Document the slow work thread pool. Signed-off-by: David Howells commit 5fe1e49bc97b6b0780f230c92b3d3cd73101747a Author: David Howells Date: Fri Feb 6 13:11:21 2009 +0000 Make the slow work pool configurable Make the slow work pool configurable through /proc/sys/kernel/slow-work. (*) /proc/sys/kernel/slow-work/min-threads The minimum number of threads that should be in the pool as long as it is in use. This may be anywhere between 2 and max-threads. (*) /proc/sys/kernel/slow-work/max-threads The maximum number of threads that should in the pool. This may be anywhere between min-threads and 255 or NR_CPUS * 2, whichever is greater. (*) /proc/sys/kernel/slow-work/vslow-percentage The percentage of active threads in the pool that may be used to execute very slow work items. This may be between 1 and 99. The resultant number is bounded to between 1 and one fewer than the number of active threads. This ensures there is always at least one thread that can process very slow work items, and always at least one thread that won't. Signed-off-by: David Howells Acked-by: Serge Hallyn commit 2d951cbb6f901da5926d983c928ae79e00538870 Author: David Howells Date: Fri Feb 6 13:11:21 2009 +0000 Make slow-work thread pool actually dynamic Make the slow-work thread pool actually dynamic in the number of threads it contains. With this patch, it will both create additional threads when it has extra work to do, and cull excess threads that aren't doing anything. Signed-off-by: David Howells Acked-by: Serge Hallyn commit 8a3923ac2bfcba7a98724c2546a146aaa7300fed Author: David Howells Date: Fri Feb 6 13:11:21 2009 +0000 Create a dynamically sized pool of threads for doing very slow work items Create a dynamically sized pool of threads for doing very slow work items, such as invoking mkdir() or rmdir() - things that may take a long time and may sleep, holding mutexes/semaphores and hogging a thread, and are thus unsuitable for workqueues. The number of threads is always at least a settable minimum, but more are started when there's more work to do, up to a limit. Because of the nature of the load, it's not suitable for a 1-thread-per-CPU type pool. A system with one CPU may well want several threads. This is used by FS-Cache to do slow caching operations in the background, such as looking up, creating or deleting cache objects. Signed-off-by: David Howells Acked-by: Serge Hallyn -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/