From: Chuck Lever Subject: Re: [NFS] Re: [PATCH][RFC] NFS: Improving the access cache Date: Wed, 03 May 2006 00:42:51 -0400 Message-ID: <445834CB.4050408@citi.umich.edu> References: <444EC96B.80400@RedHat.com> <17486.64825.942642.594218@cse.unsw.edu.au> <444F88EF.5090105@RedHat.com> <17487.62730.16297.979429@cse.unsw.edu.au> <44572B33.4070100@RedHat.com> Reply-To: cel@citi.umich.edu Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Cc: nfs@lists.sourceforge.net, linux-fsdevel@vger.kernel.org Return-path: To: Steve Dickson In-Reply-To: <44572B33.4070100@RedHat.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Steve Dickson wrote: > Talking with Trond, he would like to do something slightly different > which I'll outline here to make sure we are all on the same page.... > > Basically we would maintain one global hlist (i.e. link list) that > would contain all of the cached entries; then each nfs_inode would > have its own LRU hlist that would contain entries that are associated > with that nfs_inode. So each entry would be on two lists, the > global hlist and hlist in the nfs_inode. > > We would govern memory consumption by only allowing 30 entries > on any one hlist in the nfs_inode and by registering the globe > hlist with the VFS shrinker which will cause the list to be prune > when memory is needed. So this means, when the 31st entry was added > to the hlist in the nfs_inode, the least recently used entry would > be removed. > > Locking might be a bit tricky, but do able... To make this scalable, > I would think we would need global read/write spin_lock. The read_lock() > would be taken when the hlist in the inode was searched and the > write_lock() would taken when the hlist in the inode was changed > and when the global list was prune. For the sake of discussion, let me propose some design alternatives. 1. We already have cache shrinkage built in: when an inode is purged due to cache shrinkage, the access cache for that inode is purged as well. In other words, there is already a mechanism for external memory pressure to shrink this cache. I don't see a strong need to complicate matters by adding more cache shrinkage than already exists with normal inode and dentry cache shrinkage. Now you won't need to hold a global lock to serialize normal accesses with purging and cache garbage collection. Eliminating global serialization is a Good Thing (tm). 2. Use a radix tree per inode. The radix tree key is a uid or gid, and each node in a tree stores the access mask for that {inode, uid} tuple. This seems a lot simpler to implement than a dual hlist, and will scale automatically with a large number of uids accessing the same inode. The nodes are small, and you don't need to allocate a big chunk of contiguous memory for a hash table. 3. Instead of serializing by spinning, you should use a semaphore. The reason for this is that when multiple processes owned by the same uid access the same inode concurrently, only the first process should be allowed to generate a real ACCESS request; otherwise they will race and potentially all of them could generate the same ACCESS request concurrently. You will need to serialize on-the-wire requests with accesses to the cache, and such wire requests will need the waiting processes to sleep, not spin. 4. You will need some mechanism for ensuring that the contents of the access cache are "up to date". You will need some way of deciding when to revalidate each {inode, uid} tuple. Based on what Peter said, I think you are going to check the inode's ctime, and purge the whole access cache for an inode if its ctime changes. But you may need something like an nfs_revalidate_inode() before you proceed to examine an inode's access cache. It might be more efficient to generate just an ACCESS request instead of a GETATTR followed by an ACCESS, but I don't see an easy way to do this given the current inode revalidation architecture of the client. 5. You need to handle ESTALE. Often, ->permission is the first thing the VFS will do before a lookup or open, and that is when the NFS client first notices that a cached file handle is stale. Should ESTALE returned on an ACCESS request mean always return permission denied, or should it mean purge the access cache and grant access, so that the next VFS step sees the ESTALE and can recover appropriately? -- corporate: personal: