Make sure to properly invalidate the pagecache before performing direct I/O,
so that no stale pages are left around. This matches what the generic
direct I/O code does. Also take the i_mutex over the direct write submission
to avoid the lifelock vs truncate waiting for i_dio_count to decrease, and
to avoid having the pagecache easily repopulated while direct I/O is in
progrss. Again matching the generic direct I/O code.
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/nfs/direct.c | 29 +++++++++++++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 6cc7fe1..2b778fc 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -939,9 +939,12 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
struct inode *inode = mapping->host;
struct nfs_direct_req *dreq;
struct nfs_lock_context *l_ctx;
+ loff_t end;
size_t count;
count = iov_length(iov, nr_segs);
+ end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+
nfs_add_stats(mapping->host, NFSIOS_DIRECTWRITTENBYTES, count);
dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
@@ -958,16 +961,25 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
if (!count)
goto out;
+ mutex_lock(&inode->i_mutex);
+
result = nfs_sync_mapping(mapping);
if (result)
- goto out;
+ goto out_unlock;
+
+ if (mapping->nrpages) {
+ result = invalidate_inode_pages2_range(mapping,
+ pos >> PAGE_CACHE_SHIFT, end);
+ if (result)
+ goto out_unlock;
+ }
task_io_account_write(count);
result = -ENOMEM;
dreq = nfs_direct_req_alloc();
if (!dreq)
- goto out;
+ goto out_unlock;
dreq->inode = inode;
dreq->bytes_left = count;
@@ -982,6 +994,14 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
dreq->iocb = iocb;
result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos, uio);
+
+ if (mapping->nrpages) {
+ invalidate_inode_pages2_range(mapping,
+ pos >> PAGE_CACHE_SHIFT, end);
+ }
+
+ mutex_unlock(&inode->i_mutex);
+
if (!result) {
result = nfs_direct_wait(dreq);
if (result > 0) {
@@ -994,8 +1014,13 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
spin_unlock(&inode->i_lock);
}
}
+ nfs_direct_req_release(dreq);
+ return result;
+
out_release:
nfs_direct_req_release(dreq);
+out_unlock:
+ mutex_unlock(&inode->i_mutex);
out:
return result;
}
--
1.7.10.4
On Fri, Nov 15, 2013 at 09:52:41AM -0500, Jeff Layton wrote:
> Do you have these patches in a git tree someplace? If so, I wouldn't
> mind running this reproducer against it to see if it helps. It's a bit
> of a longshot, but what the heck...
While I do have a local git tree I don't really have anywhere to push
it to. But applying these patches shouldn't be all that hard.
On Fri, 15 Nov 2013 07:02:04 -0800
Christoph Hellwig <[email protected]> wrote:
> On Fri, Nov 15, 2013 at 09:52:41AM -0500, Jeff Layton wrote:
> > Do you have these patches in a git tree someplace? If so, I wouldn't
> > mind running this reproducer against it to see if it helps. It's a bit
> > of a longshot, but what the heck...
>
> While I do have a local git tree I don't really have anywhere to push
> it to. But applying these patches shouldn't be all that hard.
>
It's not -- I'm just lazy...
FWIW, I tried this set and it didn't make any difference on the bug, so
I'll just keep soldiering on to track it down...
Thanks,
--
Jeff Layton <[email protected]>
On Thu, Nov 14, 2013 at 01:35:51PM -0500, Jeff Layton wrote:
> Hrm... I started chasing down a bug reported by our QA group last week
> that's showing up when you mix DIO writes and buffered reads
> (basically, diotest3 in the LTP suite is failing). The bug is marked
> private for dumb reasons but I'll see if I can make it public. I'll
> also plan to give this series a spin to see if it helps fix that bug...
>
> In any case, the DIO write code calls nfs_zap_mapping after it gets the
> WRITE reply. That sets NFS_INO_INVALID_DATA and should prevent buffered
> read() calls from getting data out of the cache after the write reply
> comes in.
>
> Why is that not sufficient here?
Sounds like it should actually be fine, although I had similar testcases
fail. I didn't even notice we were doing the invalidation, but delaying
it. Can't see how that helps when bringing mmap into the game, although
that was always an best effort and pray that it works scenario.
On Fri, 15 Nov 2013 06:28:47 -0800
Christoph Hellwig <[email protected]> wrote:
> On Thu, Nov 14, 2013 at 01:35:51PM -0500, Jeff Layton wrote:
> > Hrm... I started chasing down a bug reported by our QA group last week
> > that's showing up when you mix DIO writes and buffered reads
> > (basically, diotest3 in the LTP suite is failing). The bug is marked
> > private for dumb reasons but I'll see if I can make it public. I'll
> > also plan to give this series a spin to see if it helps fix that bug...
> >
> > In any case, the DIO write code calls nfs_zap_mapping after it gets the
> > WRITE reply. That sets NFS_INO_INVALID_DATA and should prevent buffered
> > read() calls from getting data out of the cache after the write reply
> > comes in.
> >
> > Why is that not sufficient here?
>
> Sounds like it should actually be fine, although I had similar testcases
> fail. I didn't even notice we were doing the invalidation, but delaying
> it. Can't see how that helps when bringing mmap into the game, although
> that was always an best effort and pray that it works scenario.
>
Ok, cool. The bug that I've been looking at with Trond's help is here:
https://bugzilla.redhat.com/show_bug.cgi?id=919382
Do you have these patches in a git tree someplace? If so, I wouldn't
mind running this reproducer against it to see if it helps. It's a bit
of a longshot, but what the heck...
--
Jeff Layton <[email protected]>
On Thu, 14 Nov 2013 08:50:34 -0800
Christoph Hellwig <[email protected]> wrote:
> Make sure to properly invalidate the pagecache before performing direct I/O,
> so that no stale pages are left around. This matches what the generic
> direct I/O code does. Also take the i_mutex over the direct write submission
> to avoid the lifelock vs truncate waiting for i_dio_count to decrease, and
> to avoid having the pagecache easily repopulated while direct I/O is in
> progrss. Again matching the generic direct I/O code.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> fs/nfs/direct.c | 29 +++++++++++++++++++++++++++--
> 1 file changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 6cc7fe1..2b778fc 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -939,9 +939,12 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
> struct inode *inode = mapping->host;
> struct nfs_direct_req *dreq;
> struct nfs_lock_context *l_ctx;
> + loff_t end;
> size_t count;
>
> count = iov_length(iov, nr_segs);
> + end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
> +
> nfs_add_stats(mapping->host, NFSIOS_DIRECTWRITTENBYTES, count);
>
> dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
> @@ -958,16 +961,25 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
> if (!count)
> goto out;
>
> + mutex_lock(&inode->i_mutex);
> +
> result = nfs_sync_mapping(mapping);
> if (result)
> - goto out;
> + goto out_unlock;
> +
> + if (mapping->nrpages) {
> + result = invalidate_inode_pages2_range(mapping,
> + pos >> PAGE_CACHE_SHIFT, end);
> + if (result)
> + goto out_unlock;
> + }
>
> task_io_account_write(count);
>
> result = -ENOMEM;
> dreq = nfs_direct_req_alloc();
> if (!dreq)
> - goto out;
> + goto out_unlock;
>
> dreq->inode = inode;
> dreq->bytes_left = count;
> @@ -982,6 +994,14 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
> dreq->iocb = iocb;
>
> result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos, uio);
> +
> + if (mapping->nrpages) {
> + invalidate_inode_pages2_range(mapping,
> + pos >> PAGE_CACHE_SHIFT, end);
> + }
> +
> + mutex_unlock(&inode->i_mutex);
> +
> if (!result) {
> result = nfs_direct_wait(dreq);
> if (result > 0) {
> @@ -994,8 +1014,13 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
> spin_unlock(&inode->i_lock);
> }
> }
> + nfs_direct_req_release(dreq);
> + return result;
> +
> out_release:
> nfs_direct_req_release(dreq);
> +out_unlock:
> + mutex_unlock(&inode->i_mutex);
> out:
> return result;
> }
Hrm... I started chasing down a bug reported by our QA group last week
that's showing up when you mix DIO writes and buffered reads
(basically, diotest3 in the LTP suite is failing). The bug is marked
private for dumb reasons but I'll see if I can make it public. I'll
also plan to give this series a spin to see if it helps fix that bug...
In any case, the DIO write code calls nfs_zap_mapping after it gets the
WRITE reply. That sets NFS_INO_INVALID_DATA and should prevent buffered
read() calls from getting data out of the cache after the write reply
comes in.
Why is that not sufficient here?
--
Jeff Layton <[email protected]>
On Fri, 24 Jan 2014 10:11:11 -0700
Trond Myklebust <[email protected]> wrote:
>
> On Jan 24, 2014, at 8:52, Jeff Layton <[email protected]> wrote:
>
> > On Wed, 22 Jan 2014 07:04:09 -0500
> > Jeff Layton <[email protected]> wrote:
> >
> >> On Wed, 22 Jan 2014 00:24:14 -0800
> >> Christoph Hellwig <[email protected]> wrote:
> >>
> >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> >>>> In any case, this helps but it's a little odd. With this patch, you add
> >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> >>>> also left in the call to nfs_zap_mapping in the completion codepath.
> >>>>
> >>>> So now, we shoot down the mapping prior to doing a DIO write, and then
> >>>> mark the mapping for invalidation again when the write completes. Was
> >>>> that intentional?
> >>>>
> >>>> It seems a little excessive and might hurt performance in some cases.
> >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> >>>> this approach seems to give better cache coherency.
> >>>
> >>> Thile follows the model implemented and documented in
> >>> generic_file_direct_write().
> >>>
> >>
> >> Ok, thanks. That makes sense, and the problem described in those
> >> comments is almost exactly the one I've seen in practice.
> >>
> >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> >> flag is handled, but that really has nothing to do with this patchset.
> >>
> >> You can add my Tested-by to the set if you like...
> >>
> >
> > (re-sending with Trond's address fixed)
> >
> > I may have spoken too soon...
> >
> > This patchset didn't fix the problem once I cranked up the concurrency
> > from 100 child tasks to 1000. I think that HCH's patchset makes sense
> > and helps narrow the race window some, but the way that
> > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
> >
> > The following patch does seem to fix it however. It's a combination of
> > a test patch that Trond gave me a while back and another change to
> > serialize the nfs_invalidate_mapping ops.
> >
> > I think it's a reasonable approach to deal with the problem, but we
> > likely have some other areas that will need similar treatment since
> > they also check NFS_INO_INVALID_DATA:
> >
> > nfs_write_pageuptodate
> > nfs_readdir_search_for_cookie
> > nfs_update_inode
> >
> > Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> > opinion on the basic approach, or whether you have an idea of how
> > better handle the races here:
>
> I think that it is reasonable for nfs_revalidate_mapping, but I don?t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
> Readdir already has its own locking at the VFS level, so we shouldn?t need to care there.
>
nfs_write_pageuptodate does this:
---------------8<-----------------
if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
return false;
out:
return PageUptodate(page) != 0;
---------------8<-----------------
With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and
only later would the page be invalidated. So, there's a race window in
there where the bit could be cleared but the page flag is still set,
even though it's on its way out the cache. So, I think we'd need to do
some similar sort of locking in there to make sure that doesn't happen.
nfs_update_inode just does this:
if (invalid & NFS_INO_INVALID_DATA)
nfs_fscache_invalidate(inode);
...again, since we clear the bit first with this patch, I think we have
a potential race window there too. We might not see it set in a
situation where we would have before. That case is a bit more
problematic since we can't sleep to wait on the bitlock there.
It might be best to just get rid of that call altogether and move it
into nfs_invalidate_mapping. It seems to me that we ought to just
handle fscache the same way we do the pagecache when it comes to
invalidation.
As far as the readdir code goes, I haven't looked as closely at that
yet. I just noticed that it checked for NFS_INO_INVALID_DATA. Once we
settle the other two cases, I'll give that closer scrutiny.
Thanks,
--
Jeff Layton <[email protected]>
On Jan 24, 2014, at 18:05, Trond Myklebust <[email protected]> wrote:
>
> On Jan 24, 2014, at 17:54, Jeff Layton <[email protected]> wrote:
>
>> On Fri, 24 Jan 2014 17:39:45 -0700
>> Trond Myklebust <[email protected]> wrote:
>>
>>>
>>> On Jan 24, 2014, at 14:21, Jeff Layton <[email protected]> wrote:
>>>
>>>> On Fri, 24 Jan 2014 11:46:41 -0700
>>>> Trond Myklebust <[email protected]> wrote:
>>>>>
>>>>> Convert your patch to use wait_on_bit(), and then to call
>>>>> wait_on_bit_lock() if and only if you see NFS_INO_INVALID_DATA is set.
>>>>>
>>>>
>>>> I think that too would be racy...
>>>>
>>>> We have to clear NFS_INO_INVALID_DATA while holding the i_lock, but we
>>>> can't wait_on_bit_lock() under that. So (pseudocode):
>>>>
>>>> wait_on_bit
>>>> take i_lock
>>>> check and clear NFS_INO_INVALID_DATA
>>>> drop i_lock
>>>> wait_on_bit_lock
>>>>
>>>> ...so between dropping the i_lock and wait_on_bit_lock, we have a place
>>>> where another task could check the flag and find it clear.
>>>
>>>
>>> for(;;) {
>>> wait_on_bit(NFS_INO_INVALIDATING)
>>> /* Optimisation: don?t lock NFS_INO_INVALIDATING
>>> * if NFS_INO_INVALID_DATA was cleared while we waited.
>>> */
>>> if (!test_bit(NFS_INO_INVALID_DATA))
>>> return;
>>> if (!test_and_set_bit_lock(NFS_INO_INVALIDATING))
>>> break;
>>> }
>>> spin_lock(inode->i_lock);
>>> if (!test_and_clear_bit(NFS_INO_INVALID_DATA)) {
>>> spin_unlock(inode->i_lock);
>>> goto out_raced;
>>> }
>>> ?.
>>> out_raced:
>>> clear_bit(NFS_INO_INVALIDATING)
>>> wake_up_bit(NFS_INO_INVALIDATING)
>>>
>>>
>>> --
>>> Trond Myklebust
>>> Linux NFS client maintainer
>>>
>>
>> Hmm maybe. OTOH, if we're using atomic bitops do we need to deal with
>> the spinlock? I'll ponder it over the weekend and give it a harder
>> look on Monday.
>>
>
> The NFS_I(inode)->cache_validity doesn?t use bitops, so the correct behaviour is to put NFS_INO_INVALIDATING inside NFS_I(inode)->flags (which is an atomic bit op field), and then continue to use the spin lock for NFS_INO_INVALID_DATA.
In other words please replace the atomic test_bit(NFS_INO_INVALID_DATA) and test_and_clear_bit(NFS_INO_INVALID_DATA) in the above pseudocode with the appropriate tests and clears of NFS_I(inode)->cache_validity.
--
Trond Myklebust
Linux NFS client maintainer
On Fri, 15 Nov 2013 10:33:24 -0500
Jeff Layton <[email protected]> wrote:
> On Fri, 15 Nov 2013 07:02:04 -0800
> Christoph Hellwig <[email protected]> wrote:
>
> > On Fri, Nov 15, 2013 at 09:52:41AM -0500, Jeff Layton wrote:
> > > Do you have these patches in a git tree someplace? If so, I wouldn't
> > > mind running this reproducer against it to see if it helps. It's a bit
> > > of a longshot, but what the heck...
> >
> > While I do have a local git tree I don't really have anywhere to push
> > it to. But applying these patches shouldn't be all that hard.
> >
>
> It's not -- I'm just lazy...
>
> FWIW, I tried this set and it didn't make any difference on the bug, so
> I'll just keep soldiering on to track it down...
>
I just tried this set again, and it *did* seem to help that bug.
I think the reason it didn't before was that I had applied this set on
top of a tree that held a different patch that introduced a race in the
nfs_revalidate_mapping() code.
In any case, this helps but it's a little odd. With this patch, you add
an invalidate_inode_pages2 call prior to doing the DIO. But, you've
also left in the call to nfs_zap_mapping in the completion codepath.
So now, we shoot down the mapping prior to doing a DIO write, and then
mark the mapping for invalidation again when the write completes. Was
that intentional?
It seems a little excessive and might hurt performance in some cases.
OTOH, if you mix buffered and DIO you're asking for trouble anyway and
this approach seems to give better cache coherency.
--
Jeff Layton <[email protected]>
On Jan 24, 2014, at 14:21, Jeff Layton <[email protected]> wrote:
> On Fri, 24 Jan 2014 11:46:41 -0700
> Trond Myklebust <[email protected]> wrote:
>>
>> Convert your patch to use wait_on_bit(), and then to call
>> wait_on_bit_lock() if and only if you see NFS_INO_INVALID_DATA is set.
>>
>
> I think that too would be racy...
>
> We have to clear NFS_INO_INVALID_DATA while holding the i_lock, but we
> can't wait_on_bit_lock() under that. So (pseudocode):
>
> wait_on_bit
> take i_lock
> check and clear NFS_INO_INVALID_DATA
> drop i_lock
> wait_on_bit_lock
>
> ...so between dropping the i_lock and wait_on_bit_lock, we have a place
> where another task could check the flag and find it clear.
for(;;) {
wait_on_bit(NFS_INO_INVALIDATING)
/* Optimisation: don?t lock NFS_INO_INVALIDATING
* if NFS_INO_INVALID_DATA was cleared while we waited.
*/
if (!test_bit(NFS_INO_INVALID_DATA))
return;
if (!test_and_set_bit_lock(NFS_INO_INVALIDATING))
break;
}
spin_lock(inode->i_lock);
if (!test_and_clear_bit(NFS_INO_INVALID_DATA)) {
spin_unlock(inode->i_lock);
goto out_raced;
}
?.
out_raced:
clear_bit(NFS_INO_INVALIDATING)
wake_up_bit(NFS_INO_INVALIDATING)
--
Trond Myklebust
Linux NFS client maintainer
On Fri, 2014-01-24 at 13:00 -0500, Jeff Layton wrote:
> On Fri, 24 Jan 2014 10:40:06 -0700
> Trond Myklebust <[email protected]> wrote:
>
> > On Fri, 2014-01-24 at 12:29 -0500, Jeff Layton wrote:
> > > On Fri, 24 Jan 2014 10:11:11 -0700
> > > Trond Myklebust <[email protected]> wrote:
> > >
> > > >
> > > > On Jan 24, 2014, at 8:52, Jeff Layton <[email protected]> wrote:
> > > >
> > > > > On Wed, 22 Jan 2014 07:04:09 -0500
> > > > > Jeff Layton <[email protected]> wrote:
> > > > >
> > > > >> On Wed, 22 Jan 2014 00:24:14 -0800
> > > > >> Christoph Hellwig <[email protected]> wrote:
> > > > >>
> > > > >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > > > >>>> In any case, this helps but it's a little odd. With this patch, you add
> > > > >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > > > >>>> also left in the call to nfs_zap_mapping in the completion codepath.
> > > > >>>>
> > > > >>>> So now, we shoot down the mapping prior to doing a DIO write, and then
> > > > >>>> mark the mapping for invalidation again when the write completes. Was
> > > > >>>> that intentional?
> > > > >>>>
> > > > >>>> It seems a little excessive and might hurt performance in some cases.
> > > > >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > > > >>>> this approach seems to give better cache coherency.
> > > > >>>
> > > > >>> Thile follows the model implemented and documented in
> > > > >>> generic_file_direct_write().
> > > > >>>
> > > > >>
> > > > >> Ok, thanks. That makes sense, and the problem described in those
> > > > >> comments is almost exactly the one I've seen in practice.
> > > > >>
> > > > >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> > > > >> flag is handled, but that really has nothing to do with this patchset.
> > > > >>
> > > > >> You can add my Tested-by to the set if you like...
> > > > >>
> > > > >
> > > > > (re-sending with Trond's address fixed)
> > > > >
> > > > > I may have spoken too soon...
> > > > >
> > > > > This patchset didn't fix the problem once I cranked up the concurrency
> > > > > from 100 child tasks to 1000. I think that HCH's patchset makes sense
> > > > > and helps narrow the race window some, but the way that
> > > > > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
> > > > >
> > > > > The following patch does seem to fix it however. It's a combination of
> > > > > a test patch that Trond gave me a while back and another change to
> > > > > serialize the nfs_invalidate_mapping ops.
> > > > >
> > > > > I think it's a reasonable approach to deal with the problem, but we
> > > > > likely have some other areas that will need similar treatment since
> > > > > they also check NFS_INO_INVALID_DATA:
> > > > >
> > > > > nfs_write_pageuptodate
> > > > > nfs_readdir_search_for_cookie
> > > > > nfs_update_inode
> > > > >
> > > > > Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> > > > > opinion on the basic approach, or whether you have an idea of how
> > > > > better handle the races here:
> > > >
> > > > I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
> > > > Readdir already has its own locking at the VFS level, so we shouldn’t need to care there.
> > > >
> > >
> > >
> > > nfs_write_pageuptodate does this:
> > >
> > > ---------------8<-----------------
> > > if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
> > > return false;
> > > out:
> > > return PageUptodate(page) != 0;
> > > ---------------8<-----------------
> > >
> > > With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and
> > > only later would the page be invalidated. So, there's a race window in
> > > there where the bit could be cleared but the page flag is still set,
> > > even though it's on its way out the cache. So, I think we'd need to do
> > > some similar sort of locking in there to make sure that doesn't happen.
> >
> > We _cannot_ lock against nfs_revalidate_mapping() here, because we could
> > end up deadlocking with invalidate_inode_pages2().
> >
> > If you like, we could add a test for NFS_INO_INVALIDATING, to turn off
> > the optimisation in that case, but I'd like to understand what the race
> > would be: don't forget that the page is marked as PageUptodate(), which
> > means that either invalidate_inode_pages2() has not yet reached this
> > page, or that a read of the page succeeded after the invalidation was
> > made.
> >
>
> Right. The first situation seems wrong to me. We've marked the file as
> INVALID and then cleared the bit to start the process of invalidating
> the actual pages. It seems like nfs_write_pageuptodate ought not return
> true even if PageUptodate() is still set at that point.
>
> We could check NFS_INO_INVALIDATING, but we might miss that
> optimization in a lot of cases just because something happens to be
> in nfs_revalidate_mapping. Maybe that means that this bitlock isn't
> sufficient and we need some other mechanism. I'm not sure what that
> should be though.
Convert your patch to use wait_on_bit(), and then to call
wait_on_bit_lock() if and only if you see NFS_INO_INVALID_DATA is set.
> > > nfs_update_inode just does this:
> > >
> > > if (invalid & NFS_INO_INVALID_DATA)
> > > nfs_fscache_invalidate(inode);
> > >
> > > ...again, since we clear the bit first with this patch, I think we have
> > > a potential race window there too. We might not see it set in a
> > > situation where we would have before. That case is a bit more
> > > problematic since we can't sleep to wait on the bitlock there.
> >
> > Umm... That test in nfs_update_inode() is there because we might just
> > have _set_ the NFS_INO_INVALID_DATA bit.
> >
>
> Correct. But do we need to force a fscache invalidation at that point,
> or can it wait until we're going to invalidate the mapping too?
That's a question for David. My assumption is that since invalidation is
handled asynchronously by the fscache layer itself, that we need to let
it start that process as soon as possible, but perhaps these races are
an indication that we should actually do it at the time when we call
invalidate_inode_pages2() (or at the latest, when we're evicting the
inode from the icache)...
--
Trond Myklebust
Linux NFS client maintainer
On Jan 24, 2014, at 17:54, Jeff Layton <[email protected]> wrote:
> On Fri, 24 Jan 2014 17:39:45 -0700
> Trond Myklebust <[email protected]> wrote:
>
>>
>> On Jan 24, 2014, at 14:21, Jeff Layton <[email protected]> wrote:
>>
>>> On Fri, 24 Jan 2014 11:46:41 -0700
>>> Trond Myklebust <[email protected]> wrote:
>>>>
>>>> Convert your patch to use wait_on_bit(), and then to call
>>>> wait_on_bit_lock() if and only if you see NFS_INO_INVALID_DATA is set.
>>>>
>>>
>>> I think that too would be racy...
>>>
>>> We have to clear NFS_INO_INVALID_DATA while holding the i_lock, but we
>>> can't wait_on_bit_lock() under that. So (pseudocode):
>>>
>>> wait_on_bit
>>> take i_lock
>>> check and clear NFS_INO_INVALID_DATA
>>> drop i_lock
>>> wait_on_bit_lock
>>>
>>> ...so between dropping the i_lock and wait_on_bit_lock, we have a place
>>> where another task could check the flag and find it clear.
>>
>>
>> for(;;) {
>> wait_on_bit(NFS_INO_INVALIDATING)
>> /* Optimisation: don?t lock NFS_INO_INVALIDATING
>> * if NFS_INO_INVALID_DATA was cleared while we waited.
>> */
>> if (!test_bit(NFS_INO_INVALID_DATA))
>> return;
>> if (!test_and_set_bit_lock(NFS_INO_INVALIDATING))
>> break;
>> }
>> spin_lock(inode->i_lock);
>> if (!test_and_clear_bit(NFS_INO_INVALID_DATA)) {
>> spin_unlock(inode->i_lock);
>> goto out_raced;
>> }
>> ?.
>> out_raced:
>> clear_bit(NFS_INO_INVALIDATING)
>> wake_up_bit(NFS_INO_INVALIDATING)
>>
>>
>> --
>> Trond Myklebust
>> Linux NFS client maintainer
>>
>
> Hmm maybe. OTOH, if we're using atomic bitops do we need to deal with
> the spinlock? I'll ponder it over the weekend and give it a harder
> look on Monday.
>
The NFS_I(inode)->cache_validity doesn?t use bitops, so the correct behaviour is to put NFS_INO_INVALIDATING inside NFS_I(inode)->flags (which is an atomic bit op field), and then continue to use the spin lock for NFS_INO_INVALID_DATA.
--
Trond Myklebust
Linux NFS client maintainer
On Wed, 22 Jan 2014 07:04:09 -0500
Jeff Layton <[email protected]> wrote:
> On Wed, 22 Jan 2014 00:24:14 -0800
> Christoph Hellwig <[email protected]> wrote:
>
> > On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > > In any case, this helps but it's a little odd. With this patch, you add
> > > an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > > also left in the call to nfs_zap_mapping in the completion codepath.
> > >
> > > So now, we shoot down the mapping prior to doing a DIO write, and then
> > > mark the mapping for invalidation again when the write completes. Was
> > > that intentional?
> > >
> > > It seems a little excessive and might hurt performance in some cases.
> > > OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > > this approach seems to give better cache coherency.
> >
> > Thile follows the model implemented and documented in
> > generic_file_direct_write().
> >
>
> Ok, thanks. That makes sense, and the problem described in those
> comments is almost exactly the one I've seen in practice.
>
> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> flag is handled, but that really has nothing to do with this patchset.
>
> You can add my Tested-by to the set if you like...
>
I may have spoken too soon...
This patchset didn't fix the problem once I cranked up the concurrency
from 100 child tasks to 1000. I think that HCH's patchset makes sense
and helps narrow the race window some, but the way that
nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
The following patch does seem to fix it however. It's a combination of
a test patch that Trond gave me a while back and another change to
serialize the nfs_invalidate_mapping ops.
I think it's a reasonable approach to deal with the problem, but we
likely have some other areas that will need similar treatment since
they also check NFS_INO_INVALID_DATA:
nfs_write_pageuptodate
nfs_readdir_search_for_cookie
nfs_update_inode
Trond, thoughts? It's not quite ready for merge, but I'd like to get an
opinion on the basic approach, or whether you have an idea of how
better handle the races here:
------------------8<--------------------
NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping
There is a possible race in how the nfs_invalidate_mapping is handled.
Currently, we go and invalidate the pages in the file and then clear
NFS_INO_INVALID_DATA.
The problem is that it's possible for a stale page to creep into the
mapping after the page was invalidated (i.e., via readahead). If another
writer comes along and sets the flag after that happens but before
invalidate_inode_pages2 returns then we could clear the flag
without the cache having been properly invalidated.
So, we must clear the flag first and then invalidate the pages. This
however, opens another race:
It's possible to have two concurrent read() calls that end up in
nfs_revalidate_mapping at the same time. The first one clears the
NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping.
Just before calling that though, the other task races in, checks the
flag and finds it cleared. At that point, it sees that the mapping is
good and gets the lock on the page, allowing the read() to be satisfied
from the cache even though the data is no longer valid.
This effect is easily manifested by running diotest3 from the LTP test
suite on NFS. That program does a series of DIO writes and buffered
reads. The operations are serialized and page-aligned but the existing
code fails the test since it occasionally allows a read to come out of
the cache instead of being done on the wire when it should. While mixing
direct and buffered I/O isn't recommended, I believe it's possible to
hit this in other ways that just use buffered I/O, even though that
makes it harder to reproduce.
The problem is that the checking/clearing of that flag and the
invalidation of the mapping need to be as a unit. Fix this by
serializing concurrent invalidations with a bitlock.
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/inode.c | 32 +++++++++++++++++++++++++++-----
include/linux/nfs_fs.h | 1 +
2 files changed, 28 insertions(+), 5 deletions(-)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 00ad1c2..6fa07e1 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -977,11 +977,11 @@ static int nfs_invalidate_mapping(struct inode *inode, struct address_space *map
if (ret < 0)
return ret;
}
- spin_lock(&inode->i_lock);
- nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
- if (S_ISDIR(inode->i_mode))
+ if (S_ISDIR(inode->i_mode)) {
+ spin_lock(&inode->i_lock);
memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf));
- spin_unlock(&inode->i_lock);
+ spin_unlock(&inode->i_lock);
+ }
nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
nfs_fscache_wait_on_invalidate(inode);
@@ -1007,6 +1007,7 @@ static bool nfs_mapping_need_revalidate_inode(struct inode *inode)
int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping)
{
struct nfs_inode *nfsi = NFS_I(inode);
+ unsigned long *bitlock = &NFS_I(inode)->flags;
int ret = 0;
/* swapfiles are not supposed to be shared. */
@@ -1018,12 +1019,33 @@ int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping)
if (ret < 0)
goto out;
}
+
+ /*
+ * We must clear NFS_INO_INVALID_DATA first to ensure that
+ * invalidations that come in while we're shooting down the mappings
+ * are respected. But, that leaves a race window where one revalidator
+ * can clear the flag, and then another checks it before the mapping
+ * gets invalidated. Fix that by serializing access to this part of
+ * the function.
+ */
+ ret = wait_on_bit_lock(bitlock, NFS_INO_INVALIDATING,
+ nfs_wait_bit_killable, TASK_KILLABLE);
+ if (ret)
+ goto out;
+
+ spin_lock(&inode->i_lock);
if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
+ nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
+ spin_unlock(&inode->i_lock);
trace_nfs_invalidate_mapping_enter(inode);
ret = nfs_invalidate_mapping(inode, mapping);
trace_nfs_invalidate_mapping_exit(inode, ret);
- }
+ } else
+ spin_unlock(&inode->i_lock);
+ clear_bit_unlock(NFS_INO_INVALIDATING, bitlock);
+ smp_mb__after_clear_bit();
+ wake_up_bit(bitlock, NFS_INO_INVALIDATING);
out:
return ret;
}
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 4899737..18fb16f 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -215,6 +215,7 @@ struct nfs_inode {
#define NFS_INO_ADVISE_RDPLUS (0) /* advise readdirplus */
#define NFS_INO_STALE (1) /* possible stale inode */
#define NFS_INO_ACL_LRU_SET (2) /* Inode is on the LRU list */
+#define NFS_INO_INVALIDATING (3) /* inode is being invalidated */
#define NFS_INO_FLUSHING (4) /* inode is flushing out data */
#define NFS_INO_FSCACHE (5) /* inode can be cached by FS-Cache */
#define NFS_INO_FSCACHE_LOCK (6) /* FS-Cache cookie management lock */
--
1.8.5.3
On Fri, 24 Jan 2014 10:40:06 -0700
Trond Myklebust <[email protected]> wrote:
> On Fri, 2014-01-24 at 12:29 -0500, Jeff Layton wrote:
> > On Fri, 24 Jan 2014 10:11:11 -0700
> > Trond Myklebust <[email protected]> wrote:
> >
> > >
> > > On Jan 24, 2014, at 8:52, Jeff Layton <[email protected]> wrote:
> > >
> > > > On Wed, 22 Jan 2014 07:04:09 -0500
> > > > Jeff Layton <[email protected]> wrote:
> > > >
> > > >> On Wed, 22 Jan 2014 00:24:14 -0800
> > > >> Christoph Hellwig <[email protected]> wrote:
> > > >>
> > > >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > > >>>> In any case, this helps but it's a little odd. With this patch, you add
> > > >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > > >>>> also left in the call to nfs_zap_mapping in the completion codepath.
> > > >>>>
> > > >>>> So now, we shoot down the mapping prior to doing a DIO write, and then
> > > >>>> mark the mapping for invalidation again when the write completes. Was
> > > >>>> that intentional?
> > > >>>>
> > > >>>> It seems a little excessive and might hurt performance in some cases.
> > > >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > > >>>> this approach seems to give better cache coherency.
> > > >>>
> > > >>> Thile follows the model implemented and documented in
> > > >>> generic_file_direct_write().
> > > >>>
> > > >>
> > > >> Ok, thanks. That makes sense, and the problem described in those
> > > >> comments is almost exactly the one I've seen in practice.
> > > >>
> > > >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> > > >> flag is handled, but that really has nothing to do with this patchset.
> > > >>
> > > >> You can add my Tested-by to the set if you like...
> > > >>
> > > >
> > > > (re-sending with Trond's address fixed)
> > > >
> > > > I may have spoken too soon...
> > > >
> > > > This patchset didn't fix the problem once I cranked up the concurrency
> > > > from 100 child tasks to 1000. I think that HCH's patchset makes sense
> > > > and helps narrow the race window some, but the way that
> > > > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
> > > >
> > > > The following patch does seem to fix it however. It's a combination of
> > > > a test patch that Trond gave me a while back and another change to
> > > > serialize the nfs_invalidate_mapping ops.
> > > >
> > > > I think it's a reasonable approach to deal with the problem, but we
> > > > likely have some other areas that will need similar treatment since
> > > > they also check NFS_INO_INVALID_DATA:
> > > >
> > > > nfs_write_pageuptodate
> > > > nfs_readdir_search_for_cookie
> > > > nfs_update_inode
> > > >
> > > > Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> > > > opinion on the basic approach, or whether you have an idea of how
> > > > better handle the races here:
> > >
> > > I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
> > > Readdir already has its own locking at the VFS level, so we shouldn’t need to care there.
> > >
> >
> >
> > nfs_write_pageuptodate does this:
> >
> > ---------------8<-----------------
> > if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
> > return false;
> > out:
> > return PageUptodate(page) != 0;
> > ---------------8<-----------------
> >
> > With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and
> > only later would the page be invalidated. So, there's a race window in
> > there where the bit could be cleared but the page flag is still set,
> > even though it's on its way out the cache. So, I think we'd need to do
> > some similar sort of locking in there to make sure that doesn't happen.
>
> We _cannot_ lock against nfs_revalidate_mapping() here, because we could
> end up deadlocking with invalidate_inode_pages2().
>
> If you like, we could add a test for NFS_INO_INVALIDATING, to turn off
> the optimisation in that case, but I'd like to understand what the race
> would be: don't forget that the page is marked as PageUptodate(), which
> means that either invalidate_inode_pages2() has not yet reached this
> page, or that a read of the page succeeded after the invalidation was
> made.
>
Right. The first situation seems wrong to me. We've marked the file as
INVALID and then cleared the bit to start the process of invalidating
the actual pages. It seems like nfs_write_pageuptodate ought not return
true even if PageUptodate() is still set at that point.
We could check NFS_INO_INVALIDATING, but we might miss that
optimization in a lot of cases just because something happens to be
in nfs_revalidate_mapping. Maybe that means that this bitlock isn't
sufficient and we need some other mechanism. I'm not sure what that
should be though.
> > nfs_update_inode just does this:
> >
> > if (invalid & NFS_INO_INVALID_DATA)
> > nfs_fscache_invalidate(inode);
> >
> > ...again, since we clear the bit first with this patch, I think we have
> > a potential race window there too. We might not see it set in a
> > situation where we would have before. That case is a bit more
> > problematic since we can't sleep to wait on the bitlock there.
>
> Umm... That test in nfs_update_inode() is there because we might just
> have _set_ the NFS_INO_INVALID_DATA bit.
>
Correct. But do we need to force a fscache invalidation at that point,
or can it wait until we're going to invalidate the mapping too?
> >
> > It might be best to just get rid of that call altogether and move it
> > into nfs_invalidate_mapping. It seems to me that we ought to just
> > handle fscache the same way we do the pagecache when it comes to
> > invalidation.
> >
> > As far as the readdir code goes, I haven't looked as closely at that
> > yet. I just noticed that it checked for NFS_INO_INVALID_DATA. Once we
> > settle the other two cases, I'll give that closer scrutiny.
> >
--
Jeff Layton <[email protected]>
On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> In any case, this helps but it's a little odd. With this patch, you add
> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> also left in the call to nfs_zap_mapping in the completion codepath.
>
> So now, we shoot down the mapping prior to doing a DIO write, and then
> mark the mapping for invalidation again when the write completes. Was
> that intentional?
>
> It seems a little excessive and might hurt performance in some cases.
> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> this approach seems to give better cache coherency.
Thile follows the model implemented and documented in
generic_file_direct_write().
On Fri, 24 Jan 2014 17:39:45 -0700
Trond Myklebust <[email protected]> wrote:
>
> On Jan 24, 2014, at 14:21, Jeff Layton <[email protected]> wrote:
>
> > On Fri, 24 Jan 2014 11:46:41 -0700
> > Trond Myklebust <[email protected]> wrote:
> >>
> >> Convert your patch to use wait_on_bit(), and then to call
> >> wait_on_bit_lock() if and only if you see NFS_INO_INVALID_DATA is set.
> >>
> >
> > I think that too would be racy...
> >
> > We have to clear NFS_INO_INVALID_DATA while holding the i_lock, but we
> > can't wait_on_bit_lock() under that. So (pseudocode):
> >
> > wait_on_bit
> > take i_lock
> > check and clear NFS_INO_INVALID_DATA
> > drop i_lock
> > wait_on_bit_lock
> >
> > ...so between dropping the i_lock and wait_on_bit_lock, we have a place
> > where another task could check the flag and find it clear.
>
>
> for(;;) {
> wait_on_bit(NFS_INO_INVALIDATING)
> /* Optimisation: don?t lock NFS_INO_INVALIDATING
> * if NFS_INO_INVALID_DATA was cleared while we waited.
> */
> if (!test_bit(NFS_INO_INVALID_DATA))
> return;
> if (!test_and_set_bit_lock(NFS_INO_INVALIDATING))
> break;
> }
> spin_lock(inode->i_lock);
> if (!test_and_clear_bit(NFS_INO_INVALID_DATA)) {
> spin_unlock(inode->i_lock);
> goto out_raced;
> }
> ?.
> out_raced:
> clear_bit(NFS_INO_INVALIDATING)
> wake_up_bit(NFS_INO_INVALIDATING)
>
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
Hmm maybe. OTOH, if we're using atomic bitops do we need to deal with
the spinlock? I'll ponder it over the weekend and give it a harder
look on Monday.
Thanks for the thoughts so far...
--
Jeff Layton <[email protected]>
On Wed, 22 Jan 2014 00:24:14 -0800
Christoph Hellwig <[email protected]> wrote:
> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > In any case, this helps but it's a little odd. With this patch, you add
> > an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > also left in the call to nfs_zap_mapping in the completion codepath.
> >
> > So now, we shoot down the mapping prior to doing a DIO write, and then
> > mark the mapping for invalidation again when the write completes. Was
> > that intentional?
> >
> > It seems a little excessive and might hurt performance in some cases.
> > OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > this approach seems to give better cache coherency.
>
> Thile follows the model implemented and documented in
> generic_file_direct_write().
>
Ok, thanks. That makes sense, and the problem described in those
comments is almost exactly the one I've seen in practice.
I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
flag is handled, but that really has nothing to do with this patchset.
You can add my Tested-by to the set if you like...
Cheers,
--
Jeff Layton <[email protected]>
On Jan 24, 2014, at 8:52, Jeff Layton <[email protected]> wrote:
> On Wed, 22 Jan 2014 07:04:09 -0500
> Jeff Layton <[email protected]> wrote:
>
>> On Wed, 22 Jan 2014 00:24:14 -0800
>> Christoph Hellwig <[email protected]> wrote:
>>
>>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
>>>> In any case, this helps but it's a little odd. With this patch, you add
>>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
>>>> also left in the call to nfs_zap_mapping in the completion codepath.
>>>>
>>>> So now, we shoot down the mapping prior to doing a DIO write, and then
>>>> mark the mapping for invalidation again when the write completes. Was
>>>> that intentional?
>>>>
>>>> It seems a little excessive and might hurt performance in some cases.
>>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
>>>> this approach seems to give better cache coherency.
>>>
>>> Thile follows the model implemented and documented in
>>> generic_file_direct_write().
>>>
>>
>> Ok, thanks. That makes sense, and the problem described in those
>> comments is almost exactly the one I've seen in practice.
>>
>> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
>> flag is handled, but that really has nothing to do with this patchset.
>>
>> You can add my Tested-by to the set if you like...
>>
>
> (re-sending with Trond's address fixed)
>
> I may have spoken too soon...
>
> This patchset didn't fix the problem once I cranked up the concurrency
> from 100 child tasks to 1000. I think that HCH's patchset makes sense
> and helps narrow the race window some, but the way that
> nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
>
> The following patch does seem to fix it however. It's a combination of
> a test patch that Trond gave me a while back and another change to
> serialize the nfs_invalidate_mapping ops.
>
> I think it's a reasonable approach to deal with the problem, but we
> likely have some other areas that will need similar treatment since
> they also check NFS_INO_INVALID_DATA:
>
> nfs_write_pageuptodate
> nfs_readdir_search_for_cookie
> nfs_update_inode
>
> Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> opinion on the basic approach, or whether you have an idea of how
> better handle the races here:
I think that it is reasonable for nfs_revalidate_mapping, but I don?t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
Readdir already has its own locking at the VFS level, so we shouldn?t need to care there.
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer
On Fri, 24 Jan 2014 11:46:41 -0700
Trond Myklebust <[email protected]> wrote:
> On Fri, 2014-01-24 at 13:00 -0500, Jeff Layton wrote:
> > On Fri, 24 Jan 2014 10:40:06 -0700
> > Trond Myklebust <[email protected]> wrote:
> >
> > > On Fri, 2014-01-24 at 12:29 -0500, Jeff Layton wrote:
> > > > On Fri, 24 Jan 2014 10:11:11 -0700
> > > > Trond Myklebust <[email protected]> wrote:
> > > >
> > > > >
> > > > > On Jan 24, 2014, at 8:52, Jeff Layton <[email protected]> wrote:
> > > > >
> > > > > > On Wed, 22 Jan 2014 07:04:09 -0500
> > > > > > Jeff Layton <[email protected]> wrote:
> > > > > >
> > > > > >> On Wed, 22 Jan 2014 00:24:14 -0800
> > > > > >> Christoph Hellwig <[email protected]> wrote:
> > > > > >>
> > > > > >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > > > > >>>> In any case, this helps but it's a little odd. With this patch, you add
> > > > > >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > > > > >>>> also left in the call to nfs_zap_mapping in the completion codepath.
> > > > > >>>>
> > > > > >>>> So now, we shoot down the mapping prior to doing a DIO write, and then
> > > > > >>>> mark the mapping for invalidation again when the write completes. Was
> > > > > >>>> that intentional?
> > > > > >>>>
> > > > > >>>> It seems a little excessive and might hurt performance in some cases.
> > > > > >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > > > > >>>> this approach seems to give better cache coherency.
> > > > > >>>
> > > > > >>> Thile follows the model implemented and documented in
> > > > > >>> generic_file_direct_write().
> > > > > >>>
> > > > > >>
> > > > > >> Ok, thanks. That makes sense, and the problem described in those
> > > > > >> comments is almost exactly the one I've seen in practice.
> > > > > >>
> > > > > >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> > > > > >> flag is handled, but that really has nothing to do with this patchset.
> > > > > >>
> > > > > >> You can add my Tested-by to the set if you like...
> > > > > >>
> > > > > >
> > > > > > (re-sending with Trond's address fixed)
> > > > > >
> > > > > > I may have spoken too soon...
> > > > > >
> > > > > > This patchset didn't fix the problem once I cranked up the concurrency
> > > > > > from 100 child tasks to 1000. I think that HCH's patchset makes sense
> > > > > > and helps narrow the race window some, but the way that
> > > > > > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
> > > > > >
> > > > > > The following patch does seem to fix it however. It's a combination of
> > > > > > a test patch that Trond gave me a while back and another change to
> > > > > > serialize the nfs_invalidate_mapping ops.
> > > > > >
> > > > > > I think it's a reasonable approach to deal with the problem, but we
> > > > > > likely have some other areas that will need similar treatment since
> > > > > > they also check NFS_INO_INVALID_DATA:
> > > > > >
> > > > > > nfs_write_pageuptodate
> > > > > > nfs_readdir_search_for_cookie
> > > > > > nfs_update_inode
> > > > > >
> > > > > > Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> > > > > > opinion on the basic approach, or whether you have an idea of how
> > > > > > better handle the races here:
> > > > >
> > > > > I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
> > > > > Readdir already has its own locking at the VFS level, so we shouldn’t need to care there.
> > > > >
> > > >
> > > >
> > > > nfs_write_pageuptodate does this:
> > > >
> > > > ---------------8<-----------------
> > > > if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
> > > > return false;
> > > > out:
> > > > return PageUptodate(page) != 0;
> > > > ---------------8<-----------------
> > > >
> > > > With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and
> > > > only later would the page be invalidated. So, there's a race window in
> > > > there where the bit could be cleared but the page flag is still set,
> > > > even though it's on its way out the cache. So, I think we'd need to do
> > > > some similar sort of locking in there to make sure that doesn't happen.
> > >
> > > We _cannot_ lock against nfs_revalidate_mapping() here, because we could
> > > end up deadlocking with invalidate_inode_pages2().
> > >
> > > If you like, we could add a test for NFS_INO_INVALIDATING, to turn off
> > > the optimisation in that case, but I'd like to understand what the race
> > > would be: don't forget that the page is marked as PageUptodate(), which
> > > means that either invalidate_inode_pages2() has not yet reached this
> > > page, or that a read of the page succeeded after the invalidation was
> > > made.
> > >
> >
> > Right. The first situation seems wrong to me. We've marked the file as
> > INVALID and then cleared the bit to start the process of invalidating
> > the actual pages. It seems like nfs_write_pageuptodate ought not return
> > true even if PageUptodate() is still set at that point.
> >
> > We could check NFS_INO_INVALIDATING, but we might miss that
> > optimization in a lot of cases just because something happens to be
> > in nfs_revalidate_mapping. Maybe that means that this bitlock isn't
> > sufficient and we need some other mechanism. I'm not sure what that
> > should be though.
>
> Convert your patch to use wait_on_bit(), and then to call
> wait_on_bit_lock() if and only if you see NFS_INO_INVALID_DATA is set.
>
I think that too would be racy...
We have to clear NFS_INO_INVALID_DATA while holding the i_lock, but we
can't wait_on_bit_lock() under that. So (pseudocode):
wait_on_bit
take i_lock
check and clear NFS_INO_INVALID_DATA
drop i_lock
wait_on_bit_lock
...so between dropping the i_lock and wait_on_bit_lock, we have a place
where another task could check the flag and find it clear.
I think the upshot here is that a bit_lock may not be the appropriate
thing to use to handle this. I'll have to ponder what might be better...
> > > > nfs_update_inode just does this:
> > > >
> > > > if (invalid & NFS_INO_INVALID_DATA)
> > > > nfs_fscache_invalidate(inode);
> > > >
> > > > ...again, since we clear the bit first with this patch, I think we have
> > > > a potential race window there too. We might not see it set in a
> > > > situation where we would have before. That case is a bit more
> > > > problematic since we can't sleep to wait on the bitlock there.
> > >
> > > Umm... That test in nfs_update_inode() is there because we might just
> > > have _set_ the NFS_INO_INVALID_DATA bit.
> > >
> >
> > Correct. But do we need to force a fscache invalidation at that point,
> > or can it wait until we're going to invalidate the mapping too?
>
> That's a question for David. My assumption is that since invalidation is
> handled asynchronously by the fscache layer itself, that we need to let
> it start that process as soon as possible, but perhaps these races are
> an indication that we should actually do it at the time when we call
> invalidate_inode_pages2() (or at the latest, when we're evicting the
> inode from the icache)...
>
>
Ok, looks like it just sets a flag, so if we can handle this somehow
w/o sleeping then it may not matter. Again, I'll have to ponder what
may be better than a bit_lock.
Thanks,
--
Jeff Layton <[email protected]>
On Wed, 22 Jan 2014 07:04:09 -0500
Jeff Layton <[email protected]> wrote:
> On Wed, 22 Jan 2014 00:24:14 -0800
> Christoph Hellwig <[email protected]> wrote:
>
> > On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > > In any case, this helps but it's a little odd. With this patch, you add
> > > an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > > also left in the call to nfs_zap_mapping in the completion codepath.
> > >
> > > So now, we shoot down the mapping prior to doing a DIO write, and then
> > > mark the mapping for invalidation again when the write completes. Was
> > > that intentional?
> > >
> > > It seems a little excessive and might hurt performance in some cases.
> > > OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > > this approach seems to give better cache coherency.
> >
> > Thile follows the model implemented and documented in
> > generic_file_direct_write().
> >
>
> Ok, thanks. That makes sense, and the problem described in those
> comments is almost exactly the one I've seen in practice.
>
> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> flag is handled, but that really has nothing to do with this patchset.
>
> You can add my Tested-by to the set if you like...
>
(re-sending with Trond's address fixed)
I may have spoken too soon...
This patchset didn't fix the problem once I cranked up the concurrency
from 100 child tasks to 1000. I think that HCH's patchset makes sense
and helps narrow the race window some, but the way that
nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
The following patch does seem to fix it however. It's a combination of
a test patch that Trond gave me a while back and another change to
serialize the nfs_invalidate_mapping ops.
I think it's a reasonable approach to deal with the problem, but we
likely have some other areas that will need similar treatment since
they also check NFS_INO_INVALID_DATA:
nfs_write_pageuptodate
nfs_readdir_search_for_cookie
nfs_update_inode
Trond, thoughts? It's not quite ready for merge, but I'd like to get an
opinion on the basic approach, or whether you have an idea of how
better handle the races here:
------------------8<--------------------
NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping
There is a possible race in how the nfs_invalidate_mapping is handled.
Currently, we go and invalidate the pages in the file and then clear
NFS_INO_INVALID_DATA.
The problem is that it's possible for a stale page to creep into the
mapping after the page was invalidated (i.e., via readahead). If another
writer comes along and sets the flag after that happens but before
invalidate_inode_pages2 returns then we could clear the flag
without the cache having been properly invalidated.
So, we must clear the flag first and then invalidate the pages. This
however, opens another race:
It's possible to have two concurrent read() calls that end up in
nfs_revalidate_mapping at the same time. The first one clears the
NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping.
Just before calling that though, the other task races in, checks the
flag and finds it cleared. At that point, it sees that the mapping is
good and gets the lock on the page, allowing the read() to be satisfied
from the cache even though the data is no longer valid.
This effect is easily manifested by running diotest3 from the LTP test
suite on NFS. That program does a series of DIO writes and buffered
reads. The operations are serialized and page-aligned but the existing
code fails the test since it occasionally allows a read to come out of
the cache instead of being done on the wire when it should. While mixing
direct and buffered I/O isn't recommended, I believe it's possible to
hit this in other ways that just use buffered I/O, even though that
makes it harder to reproduce.
The problem is that the checking/clearing of that flag and the
invalidation of the mapping need to be as a unit. Fix this by
serializing concurrent invalidations with a bitlock.
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/inode.c | 32 +++++++++++++++++++++++++++-----
include/linux/nfs_fs.h | 1 +
2 files changed, 28 insertions(+), 5 deletions(-)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 00ad1c2..6fa07e1 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -977,11 +977,11 @@ static int nfs_invalidate_mapping(struct inode *inode, struct address_space *map
if (ret < 0)
return ret;
}
- spin_lock(&inode->i_lock);
- nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
- if (S_ISDIR(inode->i_mode))
+ if (S_ISDIR(inode->i_mode)) {
+ spin_lock(&inode->i_lock);
memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf));
- spin_unlock(&inode->i_lock);
+ spin_unlock(&inode->i_lock);
+ }
nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
nfs_fscache_wait_on_invalidate(inode);
@@ -1007,6 +1007,7 @@ static bool nfs_mapping_need_revalidate_inode(struct inode *inode)
int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping)
{
struct nfs_inode *nfsi = NFS_I(inode);
+ unsigned long *bitlock = &NFS_I(inode)->flags;
int ret = 0;
/* swapfiles are not supposed to be shared. */
@@ -1018,12 +1019,33 @@ int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping)
if (ret < 0)
goto out;
}
+
+ /*
+ * We must clear NFS_INO_INVALID_DATA first to ensure that
+ * invalidations that come in while we're shooting down the mappings
+ * are respected. But, that leaves a race window where one revalidator
+ * can clear the flag, and then another checks it before the mapping
+ * gets invalidated. Fix that by serializing access to this part of
+ * the function.
+ */
+ ret = wait_on_bit_lock(bitlock, NFS_INO_INVALIDATING,
+ nfs_wait_bit_killable, TASK_KILLABLE);
+ if (ret)
+ goto out;
+
+ spin_lock(&inode->i_lock);
if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
+ nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
+ spin_unlock(&inode->i_lock);
trace_nfs_invalidate_mapping_enter(inode);
ret = nfs_invalidate_mapping(inode, mapping);
trace_nfs_invalidate_mapping_exit(inode, ret);
- }
+ } else
+ spin_unlock(&inode->i_lock);
+ clear_bit_unlock(NFS_INO_INVALIDATING, bitlock);
+ smp_mb__after_clear_bit();
+ wake_up_bit(bitlock, NFS_INO_INVALIDATING);
out:
return ret;
}
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 4899737..18fb16f 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -215,6 +215,7 @@ struct nfs_inode {
#define NFS_INO_ADVISE_RDPLUS (0) /* advise readdirplus */
#define NFS_INO_STALE (1) /* possible stale inode */
#define NFS_INO_ACL_LRU_SET (2) /* Inode is on the LRU list */
+#define NFS_INO_INVALIDATING (3) /* inode is being invalidated */
#define NFS_INO_FLUSHING (4) /* inode is flushing out data */
#define NFS_INO_FSCACHE (5) /* inode can be cached by FS-Cache */
#define NFS_INO_FSCACHE_LOCK (6) /* FS-Cache cookie management lock */
--
1.8.5.3
On Fri, 2014-01-24 at 12:29 -0500, Jeff Layton wrote:
> On Fri, 24 Jan 2014 10:11:11 -0700
> Trond Myklebust <[email protected]> wrote:
>
> >
> > On Jan 24, 2014, at 8:52, Jeff Layton <[email protected]> wrote:
> >
> > > On Wed, 22 Jan 2014 07:04:09 -0500
> > > Jeff Layton <[email protected]> wrote:
> > >
> > >> On Wed, 22 Jan 2014 00:24:14 -0800
> > >> Christoph Hellwig <[email protected]> wrote:
> > >>
> > >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > >>>> In any case, this helps but it's a little odd. With this patch, you add
> > >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > >>>> also left in the call to nfs_zap_mapping in the completion codepath.
> > >>>>
> > >>>> So now, we shoot down the mapping prior to doing a DIO write, and then
> > >>>> mark the mapping for invalidation again when the write completes. Was
> > >>>> that intentional?
> > >>>>
> > >>>> It seems a little excessive and might hurt performance in some cases.
> > >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > >>>> this approach seems to give better cache coherency.
> > >>>
> > >>> Thile follows the model implemented and documented in
> > >>> generic_file_direct_write().
> > >>>
> > >>
> > >> Ok, thanks. That makes sense, and the problem described in those
> > >> comments is almost exactly the one I've seen in practice.
> > >>
> > >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> > >> flag is handled, but that really has nothing to do with this patchset.
> > >>
> > >> You can add my Tested-by to the set if you like...
> > >>
> > >
> > > (re-sending with Trond's address fixed)
> > >
> > > I may have spoken too soon...
> > >
> > > This patchset didn't fix the problem once I cranked up the concurrency
> > > from 100 child tasks to 1000. I think that HCH's patchset makes sense
> > > and helps narrow the race window some, but the way that
> > > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
> > >
> > > The following patch does seem to fix it however. It's a combination of
> > > a test patch that Trond gave me a while back and another change to
> > > serialize the nfs_invalidate_mapping ops.
> > >
> > > I think it's a reasonable approach to deal with the problem, but we
> > > likely have some other areas that will need similar treatment since
> > > they also check NFS_INO_INVALID_DATA:
> > >
> > > nfs_write_pageuptodate
> > > nfs_readdir_search_for_cookie
> > > nfs_update_inode
> > >
> > > Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> > > opinion on the basic approach, or whether you have an idea of how
> > > better handle the races here:
> >
> > I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
> > Readdir already has its own locking at the VFS level, so we shouldn’t need to care there.
> >
>
>
> nfs_write_pageuptodate does this:
>
> ---------------8<-----------------
> if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
> return false;
> out:
> return PageUptodate(page) != 0;
> ---------------8<-----------------
>
> With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and
> only later would the page be invalidated. So, there's a race window in
> there where the bit could be cleared but the page flag is still set,
> even though it's on its way out the cache. So, I think we'd need to do
> some similar sort of locking in there to make sure that doesn't happen.
We _cannot_ lock against nfs_revalidate_mapping() here, because we could
end up deadlocking with invalidate_inode_pages2().
If you like, we could add a test for NFS_INO_INVALIDATING, to turn off
the optimisation in that case, but I'd like to understand what the race
would be: don't forget that the page is marked as PageUptodate(), which
means that either invalidate_inode_pages2() has not yet reached this
page, or that a read of the page succeeded after the invalidation was
made.
> nfs_update_inode just does this:
>
> if (invalid & NFS_INO_INVALID_DATA)
> nfs_fscache_invalidate(inode);
>
> ...again, since we clear the bit first with this patch, I think we have
> a potential race window there too. We might not see it set in a
> situation where we would have before. That case is a bit more
> problematic since we can't sleep to wait on the bitlock there.
Umm... That test in nfs_update_inode() is there because we might just
have _set_ the NFS_INO_INVALID_DATA bit.
>
> It might be best to just get rid of that call altogether and move it
> into nfs_invalidate_mapping. It seems to me that we ought to just
> handle fscache the same way we do the pagecache when it comes to
> invalidation.
>
> As far as the readdir code goes, I haven't looked as closely at that
> yet. I just noticed that it checked for NFS_INO_INVALID_DATA. Once we
> settle the other two cases, I'll give that closer scrutiny.
>
> Thanks,
--
Trond Myklebust
Linux NFS client maintainer