hi all-
it appears that changes in or around invalidate_inode_pages that went into
2.5.32/3 have broken certain cache behaviors that the NFS client depends
on. when the NFS client calls invalidate_inode_pages to purge directory
data from the page cache, sometimes it works as before, and sometimes it
doesn't, leaving stale pages in the page cache.
i don't know much about the MM, but i can reliably reproduce the problem.
what more information can i provide? please copy me, as i'm not a member
of the linux-kernel mailing list.
- Chuck Lever
--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>
Chuck Lever wrote:
>
> hi all-
>
> it appears that changes in or around invalidate_inode_pages that went into
> 2.5.32/3 have broken certain cache behaviors that the NFS client depends
> on. when the NFS client calls invalidate_inode_pages to purge directory
> data from the page cache, sometimes it works as before, and sometimes it
> doesn't, leaving stale pages in the page cache.
>
> i don't know much about the MM, but i can reliably reproduce the problem.
> what more information can i provide? please copy me, as i'm not a member
> of the linux-kernel mailing list.
The locking became finer-grained.
Up to 2.5.31 or thereabouts, invalidate_inode_pages was grabbing
the global pagemap_lru_lock and walking all the inodes pages inside
that lock. This basically shuts down the entire VM for the whole
operation.
In 2.5.33, that lock is per-zone, and invalidate takes it on a
much more fine-grained basis.
That all assumes SMP/preempt. If you're seeing these problems on
uniproc/non-preempt then something fishy may be happening.
But be aware that invalidate_inode_pages has always been best-effort.
If someone is reading, or writing one of those pages then it
certainly will not be removed. If you need assurances that the
pagecache has been taken down then we'll need something stronger
in there.
On Thu, 5 Sep 2002, Andrew Morton wrote:
> That all assumes SMP/preempt. If you're seeing these problems on
> uniproc/non-preempt then something fishy may be happening.
sorry, forgot to mention: the system is UP, non-preemptible, high mem.
invalidate_inode_pages isn't freeing these pages because the page count is
two. perhaps the page count semantics of one of the page cache helper
functions has changed slightly. i'm still diagnosing.
fortunately the problem is deterministically reproducible. basic test6,
the readdir test, of 2002 connectathon test suite, fails -- either a
duplicate file entry or a missing file entry appears after some standard
file creation and removal processing in that directory. the incorrect
entries occur because the NFS client zaps the directory's page cache to
force the next reader to re-read the directory from the server. but
invalidate_inode_pages decides to leave the pages in the cache, so the
next reader gets stale cached data instead.
> But be aware that invalidate_inode_pages has always been best-effort.
> If someone is reading, or writing one of those pages then it
> certainly will not be removed. If you need assurances that the
> pagecache has been taken down then we'll need something stronger
> in there.
right, i've always wondered why the NFS client doesn't use
truncate_inode_pages, or something like it, instead. that can wait for
another day, though. :-)
- Chuck Lever
--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>
Chuck Lever wrote:
>
> On Thu, 5 Sep 2002, Andrew Morton wrote:
>
> > That all assumes SMP/preempt. If you're seeing these problems on
> > uniproc/non-preempt then something fishy may be happening.
>
> sorry, forgot to mention: the system is UP, non-preemptible, high mem.
>
> invalidate_inode_pages isn't freeing these pages because the page count is
> two. perhaps the page count semantics of one of the page cache helper
> functions has changed slightly. i'm still diagnosing.
OK, thanks. I can't immediately think of anything which would have
altered the refcounting in there except for page reclaim.
What you could do is to check whether the `page_count(page) != 1'
pages are on the LRU. If they have !PageLRU(page) then the extra
ref was taken by shrink_cache(). But that would be pretty rare,
especially on UP.
You may have more success using the stronger invalidate_inode_pages2().
>>>>> " " == Andrew Morton <[email protected]> writes:
> You may have more success using the stronger
> invalidate_inode_pages2().
Shouldn't make any difference. Chuck is seeing this on readdir() pages
which, of course, don't suffer from problems of dirtiness etc on NFS.
I've noticed that the code that used to clear page->flags when we
added a page to the page_cache has disappeared. Is it possible that
pages are being re-added with screwy values for page->flags?
Cheers,
Trond
Trond Myklebust wrote:
>
> >>>>> " " == Andrew Morton <[email protected]> writes:
>
> > You may have more success using the stronger
> > invalidate_inode_pages2().
>
> Shouldn't make any difference. Chuck is seeing this on readdir() pages
> which, of course, don't suffer from problems of dirtiness etc on NFS.
Well the VM will take a ref on the page during reclaim even if it
is clean.
With what sort of frequency does this happen? If it's easily reproducible
then dunno. If it's once-an-hour then it may be page reclaim, conceivably.
The PageLRU debug test in there will tell us.
> I've noticed that the code that used to clear page->flags when we
> added a page to the page_cache has disappeared. Is it possible that
> pages are being re-added with screwy values for page->flags?
It's possible - those flags were getting set all over the place,
and for add_to_swap(), the nonatomic rmw was an outright bug.
The intent now is to set the initial page state in prep_new_page()
and to then modify it atomically from there on in ways which make
sense, rather than "because that's what the code used to do".
Something may have got screwed up in there. Suggest you print
out page->flags from in there and we can take a look.
It's a bit worrisome if NFS is dependent upon successful pagecache
takedown in invalidate_inode_pages.
>>>>> " " == Andrew Morton <[email protected]> writes:
> It's a bit worrisome if NFS is dependent upon successful
> pagecache takedown in invalidate_inode_pages.
Why? We don't use all that buffer crap, so in principle
invalidate_inode_page() is only supposed to fail for us if
- page is locked (i.e. new read in progress or something like that)
- page is refcounted (by something like mmap()).
neither of which should be the case here.
Cheers,
Trond
Trond Myklebust wrote:
>
> >>>>> " " == Andrew Morton <[email protected]> writes:
>
> > It's a bit worrisome if NFS is dependent upon successful
> > pagecache takedown in invalidate_inode_pages.
>
> Why? We don't use all that buffer crap, so in principle
> invalidate_inode_page() is only supposed to fail for us if
>
> - page is locked (i.e. new read in progress or something like that)
> - page is refcounted (by something like mmap()).
- someone took a temporary ref on the page.
Possibly it is the deferred LRU addition code. Try running
lru_add_drain() before invalidate_inode_pages().
> neither of which should be the case here.
Probably, it worked OK with the global locking because nobody was taking
a temp ref against those pages.
Please tell me exactly what semantics NFS needs in there. Does
truncate_inode_pages() do the wrong thing?
>>>>> " " == Andrew Morton <[email protected]> writes:
> Probably, it worked OK with the global locking because nobody
> was taking a temp ref against those pages.
But we still have global locking in that readdir by means of the
BKL. There should be no nasty races...
> Please tell me exactly what semantics NFS needs in there. Does
> truncate_inode_pages() do the wrong thing?
Definitely. In most cases we *cannot* wait on I/O completion because
nfs_zap_caches() can be called directly from the 'rpciod' process by
means of a callback. Since 'rpciod' is responsible for completing all
asynchronous I/O, then truncate_inode_pages() would deadlock.
As I said: I don't believe the problem here has anything to do with
invalidate_inode_pages vs. truncate_inode_pages:
- Pages should only be locked if they are actually being read from
the server.
- They should only be refcounted and/or cleared while the BKL is
held...
There is no reason why code which worked fine under 2.2.x and 2.4.x
shouldn't work under 2.5.x.
Cheers,
Trond
Trond Myklebust wrote:
>
> >>>>> " " == Andrew Morton <[email protected]> writes:
>
> > Probably, it worked OK with the global locking because nobody
> > was taking a temp ref against those pages.
>
> But we still have global locking in that readdir by means of the
> BKL. There should be no nasty races...
>
> > Please tell me exactly what semantics NFS needs in there. Does
> > truncate_inode_pages() do the wrong thing?
>
> Definitely. In most cases we *cannot* wait on I/O completion because
> nfs_zap_caches() can be called directly from the 'rpciod' process by
> means of a callback. Since 'rpciod' is responsible for completing all
> asynchronous I/O, then truncate_inode_pages() would deadlock.
But if there are such pages, invalidate_inode_pages() would not
have removed them anyway?
> As I said: I don't believe the problem here has anything to do with
> invalidate_inode_pages vs. truncate_inode_pages:
> - Pages should only be locked if they are actually being read from
> the server.
> - They should only be refcounted and/or cleared while the BKL is
> held...
> There is no reason why code which worked fine under 2.2.x and 2.4.x
> shouldn't work under 2.5.x.
Trond, there are very good reasons why it broke. Those pages are
visible to the whole world via global data structures - both the
page LRUs and via the superblocks->inodes walk. Those things exist
for legitimate purposes, and it is legitimate for async threads
of control to take a reference on those pages while playing with them.
It just "happened to work" in earlier kernels.
I suspect we can just remove the page_count() test from invalidate
and that will fix everything up. That will give stronger invalidate
and anything which doesn't like that is probably buggy wrt truncate anyway.
Could you test that?
>>>>> " " == Andrew Morton <[email protected]> writes:
>> 'rpciod' process by means of a callback. Since 'rpciod' is
>> responsible for completing all asynchronous I/O, then
>> truncate_inode_pages() would deadlock.
> But if there are such pages, invalidate_inode_pages() would not
> have removed them anyway?
It should not have to. If the page is locked because it is being paged
in, then we're guaranteed that the data is fresh from the server. If
it is locked because we are writing, then ditto.
> Trond, there are very good reasons why it broke. Those pages
> are visible to the whole world via global data structures -
> both the page LRUs and via the superblocks->inodes walk. Those
> things exist for legitimate purposes, and it is legitimate for
> async threads of control to take a reference on those pages
> while playing with them.
I don't buy that explanation w.r.t. readdir: those pages should always
be clean. The global data structures might walk them in order to throw
them out, but that should be a bonus here.
In any case it seems a bit far fetched that they should be pinning
*all* the readdir pages...
> I suspect we can just remove the page_count() test from
> invalidate and that will fix everything up. That will give
> stronger invalidate and anything which doesn't like that is
> probably buggy wrt truncate anyway.
I never liked the page_count() test, but Linus overruled me because of
the concern for consistency w.r.t. shared mmap(). Is the latter case
resolved now?
I notice for instance that you've added mapping->releasepage().
What should we be doing for that?
Cheers,
Trond
Andrew Morton wrote:
> Trond Myklebust wrote:
>
>>>>>>>" " == Andrew Morton <[email protected]> writes:
>>>>>>>
>> > You may have more success using the stronger
>> > invalidate_inode_pages2().
>>
>>Shouldn't make any difference. Chuck is seeing this on readdir() pages
>>which, of course, don't suffer from problems of dirtiness etc on NFS.
>>
>
> Well the VM will take a ref on the page during reclaim even if it
> is clean.
>
> With what sort of frequency does this happen? If it's easily reproducible
> then dunno. If it's once-an-hour then it may be page reclaim, conceivably.
> The PageLRU debug test in there will tell us.
this happens every time i run test6 on 2.5.32 or 2.5.33. the pages
are not on the LRU, and are not locked, when the NFS client calls
invalidate_inode_pages.
how do these pages get into the page cache? answer:
nfs_readdir_filler is invoked by read_cache_page to fill a page.
when nfs_readdir_filler is invoked in 2.5.31, the page count on the
page to be filled is always 2. when nfs_readdir_filler is invoked
in 2.5.32+, the page count is very often 3, but occasionally it is 2.
- Chuck Lever
--
corporate:
<cel at netapp dot com>
personal:
<chucklever at bigfoot dot com>
Andrew Morton wrote:
> Trond, there are very good reasons why it broke. Those pages are
> visible to the whole world via global data structures - both the
> page LRUs and via the superblocks->inodes walk. Those things exist
> for legitimate purposes, and it is legitimate for async threads
> of control to take a reference on those pages while playing with them.
>
> It just "happened to work" in earlier kernels.
>
> I suspect we can just remove the page_count() test from invalidate
> and that will fix everything up. That will give stronger invalidate
> and anything which doesn't like that is probably buggy wrt truncate anyway.
>
> Could you test that?
removing that test from invalidate_inode_pages allows test6 to run to
completion.
however, i don't see why the reference counts should be higher in
2.5.32 than they were in 2.5.31. is there a good way to test that
these pages do not become orphaned?
--
corporate:
<cel at netapp dot com>
personal:
<chucklever at bigfoot dot com>
Trond Myklebust wrote:
>
> ...
> > Trond, there are very good reasons why it broke. Those pages
> > are visible to the whole world via global data structures -
> > both the page LRUs and via the superblocks->inodes walk. Those
> > things exist for legitimate purposes, and it is legitimate for
> > async threads of control to take a reference on those pages
> > while playing with them.
>
> I don't buy that explanation w.r.t. readdir: those pages should always
> be clean. The global data structures might walk them in order to throw
> them out, but that should be a bonus here.
It's not related to cleanliness - the VM evicts clean pages and
to do that it needs to manipulate them. To manipulate them the
VM needs to pin them down: via spinlocks, via PageLocked or via
page_count().
And a 2.4 kernel right now will set PG_locked while looking at
a clean NFS directory page at the tail of the LRU. A concurrent
invalidate_inode_page will not invalidate that page, so there
is exposure in 2.4 SMP. Except everything is serialised under
the LRU lock in 2.4...
> In any case it seems a bit far fetched that they should be pinning
> *all* the readdir pages...
>
> > I suspect we can just remove the page_count() test from
> > invalidate and that will fix everything up. That will give
> > stronger invalidate and anything which doesn't like that is
> > probably buggy wrt truncate anyway.
>
> I never liked the page_count() test, but Linus overruled me because of
> the concern for consistency w.r.t. shared mmap(). Is the latter case
> resolved now?
Oh, I see.
I'm not sure what semantics we really want for this. If we were to
"invalidate" a mapped page then it would become anonymous, which
makes some sense.
But if we want to keep the current "don't detach mapped pages from
pagecache" semantics then we should test page->pte.direct rather than
page_count(page). Making that 100% reliable would be tricky because
of the need for locking wrt concurrent page faults.
This only really matters (I think) for invalidate_inode_pages2(),
where we take down pagecache after O_DIRECT IO. In that case I
suspect we _should_ anonymise mapped cache, because we've gone
and changed the data on-disk.
> I notice for instance that you've added mapping->releasepage().
> What should we be doing for that?
page->private is "anonymous data which only the address_space knows
about".
If the mapping has some data at page->private it must set PG_private
inside lock_page and increment page->count.
If the VM wants to reclaim a page, and it has PG_private set then
the vm will run mapping->releasepage() against the page. The mapping's
releasepage must try to clear away whatever is held at ->private. If
that was successful then releasepage() must clear PG_private, decrement
page->count and return non-zero. If the info at ->private is not
freeable, releasepage returns zero. ->releasepage() may not sleep in
2.5.
So. NFS can put anything it likes at page->private. If you're not
doing that then you don't need a releasepage. If you are doing that
then you must have a releasepage().
Chuck Lever wrote:
>
> Andrew Morton wrote:
>
> > Trond, there are very good reasons why it broke. Those pages are
> > visible to the whole world via global data structures - both the
> > page LRUs and via the superblocks->inodes walk. Those things exist
> > for legitimate purposes, and it is legitimate for async threads
> > of control to take a reference on those pages while playing with them.
> >
> > It just "happened to work" in earlier kernels.
> >
> > I suspect we can just remove the page_count() test from invalidate
> > and that will fix everything up. That will give stronger invalidate
> > and anything which doesn't like that is probably buggy wrt truncate anyway.
> >
> > Could you test that?
>
> removing that test from invalidate_inode_pages allows test6 to run to
>
> completion.
OK, thanks. I bet adding an lru_cache_drain() fixes it too.
> however, i don't see why the reference counts should be higher in
>
> 2.5.32 than they were in 2.5.31.
Those pages are sitting in the cpu-local "to be added to the LRU soon"
queue, with their refcount elevated.
That queue really only makes sense for SMP, but it's enabled on UP so
we pick up any weird effects. Voila.
> is there a good way to test that
> these pages do not become orphaned?
Not that I know of - just run the test for ages and see if the box
runs out of memory.
>>>>> " " == Andrew Morton <[email protected]> writes:
> I'm not sure what semantics we really want for this. If we
> were to "invalidate" a mapped page then it would become
> anonymous, which makes some sense.
> But if we want to keep the current "don't detach mapped pages
> from pagecache" semantics then we should test page->pte.direct
> rather than page_count(page). Making that 100% reliable would
> be tricky because of the need for locking wrt concurrent page
> faults.
I believe that Linus is very much in favour of this latter
approach. We had the 'anonymize page' semantics under 2.2.x, but they
were changed in the 2.4.0-pre series.
The problem is that NFS can clear your page cache at any
moment. Things like out-of-order RPC calls etc. can trigger it. If
your knee-jerk response is to anonymize all those pages that are
referenced, you might end up with page aliasing (well - in practice
not, since we do protect against that but Linus wasn't convinced).
> The mapping's releasepage must try to clear away whatever is
> held at ->private. If that was successful then releasepage()
> must clear PG_private, decrement
> page-> count and return non-zero. If the info at ->private is not
> freeable, releasepage returns zero. ->releasepage() may not
> sleep in
> 2.5.
Interesting. Our 'private data' in this case would be whether or not
there is some pending I/O operation on the page. We don't keep it in
page->private, but I assume that's less of a problem ;-)
It's unrelated to the topic we're discussing, but having looked at it
I was thinking that we might want to use it in order to replace the
NFS 'flushd' daemon. Currently the latter is needed mainly in order
to ensure that readaheads actually do complete in a timely fashion
(i.e. before we run out of memory). Since try_to_release_page() is
called in shrink_cache(), I was thinking that we might pass that buck
on to releasepage()
(btw: there's currently a bug w.r.t. that'. If I understand you
correctly, the releasepage() thing is unrelated to page->buffers, but
the call in shrink_cache() is masked by an 'if (page->buffers))
Extending that idea, we might also be able to get rid of
nfs_try_to_free_pages(), if we also make releasepage() call the
necessary routines to flush dirty pages too...
Cheers,
Trond
Trond Myklebust wrote:
>
> >>>>> " " == Andrew Morton <[email protected]> writes:
>
> > I'm not sure what semantics we really want for this. If we
> > were to "invalidate" a mapped page then it would become
> > anonymous, which makes some sense.
>
> > But if we want to keep the current "don't detach mapped pages
> > from pagecache" semantics then we should test page->pte.direct
> > rather than page_count(page). Making that 100% reliable would
> > be tricky because of the need for locking wrt concurrent page
> > faults.
>
> I believe that Linus is very much in favour of this latter
> approach. We had the 'anonymize page' semantics under 2.2.x, but they
> were changed in the 2.4.0-pre series.
hm.
> The problem is that NFS can clear your page cache at any
> moment. Things like out-of-order RPC calls etc. can trigger it. If
> your knee-jerk response is to anonymize all those pages that are
> referenced, you might end up with page aliasing (well - in practice
> not, since we do protect against that but Linus wasn't convinced).
Oh. I thought this was a "purely theoretical" discussion because
it only pertains to directory data. You seem to be saying that
NFS is doing this for S_ISREG pagecache also?
Again: what do you _want_ to do? Having potentially incorrect pagecache
mapped into process memory space is probably not the answer to that ;)
Should we be forcibly unmapping the pages from pagetables? That would
result in them being faulted in again, and re-read.
> > The mapping's releasepage must try to clear away whatever is
> > held at ->private. If that was successful then releasepage()
> > must clear PG_private, decrement
> > page-> count and return non-zero. If the info at ->private is not
> > freeable, releasepage returns zero. ->releasepage() may not
> > sleep in
> > 2.5.
>
> Interesting. Our 'private data' in this case would be whether or not
> there is some pending I/O operation on the page. We don't keep it in
> page->private, but I assume that's less of a problem ;-)
> It's unrelated to the topic we're discussing, but having looked at it
> I was thinking that we might want to use it in order to replace the
> NFS 'flushd' daemon. Currently the latter is needed mainly in order
> to ensure that readaheads actually do complete in a timely fashion
> (i.e. before we run out of memory). Since try_to_release_page() is
> called in shrink_cache(), I was thinking that we might pass that buck
> on to releasepage()
That might work. It's a bit of a hassle that ->releasepage() must
be nonblocking, but generally it just wants to grab locks, decrement
refcounts and free things.
> (btw: there's currently a bug w.r.t. that'. If I understand you
> correctly, the releasepage() thing is unrelated to page->buffers, but
> the call in shrink_cache() is masked by an 'if (page->buffers))
That would be in a 2.4 kernel? In 2.4, page->buffers can only
contain buffers. If it contains anything else the kernel will
die.
> Extending that idea, we might also be able to get rid of
> nfs_try_to_free_pages(), if we also make releasepage() call the
> necessary routines to flush dirty pages too...
->releasepage() should never succeed against a dirty page. In fact
the VM shouldn't even bother calling it - there's a minor efficiency
bug there.
If your mapping has old, dirty pages then the VM will call your
->vm_writeback to write some of them back. Or it will repeatedly
call ->writepage if you don't define ->vm_writeback. That's the
place to clean the pages.
>>>>> " " == Andrew Morton <[email protected]> writes:
> Oh. I thought this was a "purely theoretical" discussion
> because it only pertains to directory data. You seem to be
> saying that NFS is doing this for S_ISREG pagecache also?
Yes. Chuck's particular example pertains to directory data, but if
we're to expand the discussion to what *should* we be doing, then we
need to consider the S_ISREG case too.
> Again: what do you _want_ to do? Having potentially incorrect
> pagecache mapped into process memory space is probably not the
> answer to that ;)
> Should we be forcibly unmapping the pages from pagetables?
> That would result in them being faulted in again, and re-read.
If we could do that for unlocked pages, and at the same time flush out
any pending writes, then that would be ideal. As long as the
page_count() thing is there, we *do* unfortunately have a potential
for cache corruption.
>> (btw: there's currently a bug w.r.t. that'. If I understand you
>> correctly, the releasepage() thing is unrelated to
>> page->buffers, but the call in shrink_cache() is masked by an
>> 'if (page->buffers))
> That would be in a 2.4 kernel? In 2.4, page->buffers can only
> contain buffers. If it contains anything else the kernel will
> die.
2.5.33
> If your mapping has old, dirty pages then the VM will call your
-> vm_writeback to write some of them back. Or it will repeatedly
> call ->writepage if you don't define ->vm_writeback. That's
> the place to clean the pages.
OK. I'll think about that. I'm off to France on a fortnight's vacation
now. Should leave me with a bit of free time, so once I finish the VFS
credential code, I'll try to find some time to catch up on your recent
changes...
Cheers,
Trond
Chuck Lever wrote:
>
> On Thu, 5 Sep 2002, Andrew Morton wrote:
>
> > That all assumes SMP/preempt. If you're seeing these problems on
> > uniproc/non-preempt then something fishy may be happening.
>
> sorry, forgot to mention: the system is UP, non-preemptible, high mem.
>
> invalidate_inode_pages isn't freeing these pages because the page count is
> two. perhaps the page count semantics of one of the page cache helper
> functions has changed slightly. i'm still diagnosing.
>
> fortunately the problem is deterministically reproducible. basic test6,
> the readdir test, of 2002 connectathon test suite, fails -- either a
> duplicate file entry or a missing file entry appears after some standard
> file creation and removal processing in that directory. the incorrect
> entries occur because the NFS client zaps the directory's page cache to
> force the next reader to re-read the directory from the server. but
> invalidate_inode_pages decides to leave the pages in the cache, so the
> next reader gets stale cached data instead.
Perhaps this explains my nfs problem. (2.5.32/33 UP, preempt, no
highmem)
Soemtimes, when editing a file on nfs, the file disappears
from the server. The client believes it is there
until an umount+mount sequence. It doesn't happen
for all files, but it is 100% reproducible for those affected.
Editing changes the directory when the editor makes a backup
copy, the old directory is kept around wrongly, and so the
save into the existing file silently fails because
wrong directory information from cache is used?
Helge Hafting
On Fri, 6 Sep 2002, Helge Hafting wrote:
> Chuck Lever wrote:
> Perhaps this explains my nfs problem. (2.5.32/33 UP, preempt, no
> highmem)
> Soemtimes, when editing a file on nfs, the file disappears
> from the server. The client believes it is there
> until an umount+mount sequence. It doesn't happen
> for all files, but it is 100% reproducible for those affected.
>
> Editing changes the directory when the editor makes a backup
> copy, the old directory is kept around wrongly, and so the
> save into the existing file silently fails because
> wrong directory information from cache is used?
that sounds very similar, it is probably the same bug.
- Chuck Lever
--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>
On Thursday 05 September 2002 20:27, Andrew Morton wrote:
> But be aware that invalidate_inode_pages has always been best-effort.
> If someone is reading, or writing one of those pages then it
> certainly will not be removed. If you need assurances that the
> pagecache has been taken down then we'll need something stronger
> in there.
But what is stopping us now from removing a page from the page cache
even while IO is in progress? (Practical issue: the page lock, but
that's a self-fullfilling prophesy.)
--
Daniel
On Friday 06 September 2002 00:19, Andrew Morton wrote:
> I'm not sure what semantics we really want for this. If we were to
> "invalidate" a mapped page then it would become anonymous, which
> makes some sense.
There's no need to leave the page mapped, you can easily walk the rmap list
and remove the references.
> If the VM wants to reclaim a page, and it has PG_private set then
> the vm will run mapping->releasepage() against the page. The mapping's
> releasepage must try to clear away whatever is held at ->private. If
> that was successful then releasepage() must clear PG_private, decrement
> page->count and return non-zero. If the info at ->private is not
> freeable, releasepage returns zero. ->releasepage() may not sleep in
> 2.5.
>
> So. NFS can put anything it likes at page->private. If you're not
> doing that then you don't need a releasepage. If you are doing that
> then you must have a releasepage().
Right now, there are no filesystems actually doing anything filesystem
specific here, are there? I really wonder if making this field, formerly
known as buffers, opaque to the vfs is the right idea.
--
Daniel
On Saturday 07 September 2002 10:01, Daniel Phillips wrote:
> On Thursday 05 September 2002 20:27, Andrew Morton wrote:
> > But be aware that invalidate_inode_pages has always been best-effort.
> > If someone is reading, or writing one of those pages then it
> > certainly will not be removed. If you need assurances that the
> > pagecache has been taken down then we'll need something stronger
> > in there.
>
> But what is stopping us now from removing a page from the page cache
> even while IO is in progress? (Practical issue: the page lock, but
> that's a self-fullfilling prophesy.)
Never mind, I can see that the main function of the page lock here is to
allow the filesystem to know there's no IO in progress on a block and so the
block can be recovered and used for something else. Leaving the page in the
page cache during IO is a simple means of keeping track of this.
All the same, I have deep misgivings about the logic of the vfs truncate path
in general, and given all the trouble it's caused historically, there's good
reason to. I don't know, it may be perfect the way it is, but history would
suggest otherwise.
--
Daniel
On Friday 06 September 2002 03:08, Andrew Morton wrote:
> Should we be forcibly unmapping the pages from pagetables? That would
> result in them being faulted in again, and re-read.
There's no conceptual difficulty if it's a truncate. For NFS, maybe
the thing to do is hold ->sem long enough to do whatever dark magic
NFS needs?
> That might work. It's a bit of a hassle that ->releasepage() must
> be nonblocking, but generally it just wants to grab locks, decrement
> refcounts and free things.
Err. I really really wonder why the vfs is not supposed to know
about these goings on, to the extent of being able to take care of
them itself.
--
Daniel
Daniel Phillips wrote:
>
> > That might work. It's a bit of a hassle that ->releasepage() must
> > be nonblocking, but generally it just wants to grab locks, decrement
> > refcounts and free things.
>
> Err. I really really wonder why the vfs is not supposed to know
> about these goings on, to the extent of being able to take care of
> them itself.
->releasepage() was originally added for ext3, which has additional
ordering constraints, and has additional references to the page's
blocks.
reiserfs has a releasepage in 2.5. XFS has a ->releasepage.
Daniel Phillips wrote:
>
> On Friday 06 September 2002 00:19, Andrew Morton wrote:
> > I'm not sure what semantics we really want for this. If we were to
> > "invalidate" a mapped page then it would become anonymous, which
> > makes some sense.
>
> There's no need to leave the page mapped, you can easily walk the rmap list
> and remove the references.
Yes, unmapping and forcing a subsequent major fault would make
sense. It would require additional locking in the nopage pth
to make this 100% accurate. I doubt if it's worth doing that,
so the unmap-and-refault-for-invalidate feature would probably
be best-effort. But more accurate than what we have now.
> > If the VM wants to reclaim a page, and it has PG_private set then
> > the vm will run mapping->releasepage() against the page. The mapping's
> > releasepage must try to clear away whatever is held at ->private. If
> > that was successful then releasepage() must clear PG_private, decrement
> > page->count and return non-zero. If the info at ->private is not
> > freeable, releasepage returns zero. ->releasepage() may not sleep in
> > 2.5.
> >
> > So. NFS can put anything it likes at page->private. If you're not
> > doing that then you don't need a releasepage. If you are doing that
> > then you must have a releasepage().
>
> Right now, there are no filesystems actually doing anything filesystem
> specific here, are there? I really wonder if making this field, formerly
> known as buffers, opaque to the vfs is the right idea.
That's right - it is only used for buffers at present. I was using
page->private in the delayed-allocate code for directly holding the
disk mapping information. There was some talk of using it for <mumble>
in XFS. Also it may be used in the NFS server for storing credential
information. Also it could be used for MAP_SHARED pages for credential
information - to fix the problem wherein kswapd (ie: root) is the
one who instantiates the page's blocks, thus allowing non-root programs
to consume the root-only reserved ext2/ext3 blocks.
Andrew Morton wrote:
>
> XFS has a ->releasepage.
And it looks like it wants to be able to sleep.
I lied about ->releasepage being nonblocking. try_to_free_buffers()
is nonblocking, and for a while, ->releasepage had to be nonblocking.
But we can relax that now. I'll put the gfp_flags back into
page reclaim's call to try_to_release_page(). That might help XFS
to play nice with the VM.
On Sat, 7 Sep 2002, Daniel Phillips wrote:
> On Friday 06 September 2002 00:19, Andrew Morton wrote:
> > I'm not sure what semantics we really want for this. If we were to
> > "invalidate" a mapped page then it would become anonymous, which
> > makes some sense.
>
> There's no need to leave the page mapped, you can easily walk the rmap list
> and remove the references.
A pagefaulting task can have claimed a reference to the page
and only be waiting on the lock we're currently holding.
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Spamtraps of the month: [email protected] [email protected]
Rik van Riel wrote:
>
> On Sat, 7 Sep 2002, Daniel Phillips wrote:
> > On Friday 06 September 2002 00:19, Andrew Morton wrote:
> > > I'm not sure what semantics we really want for this. If we were to
> > > "invalidate" a mapped page then it would become anonymous, which
> > > makes some sense.
> >
> > There's no need to leave the page mapped, you can easily walk the rmap list
> > and remove the references.
>
> A pagefaulting task can have claimed a reference to the page
> and only be waiting on the lock we're currently holding.
Yup. In which case it comes away with an anonymous page.
I'm very unkeen about using the inaccurate invalidate_inode_pages
for anything which matters, really. And the consistency of pagecache
data matters.
NFS should be using something stronger. And that's basically
vmtruncate() without the i_size manipulation. Hold i_sem,
vmtruncate_list() for assured pagetable takedown, proper page
locking to take the pages out of pagecache, etc.
Sure, we could replace the page_count() heuristic with a
page->pte.direct heuristic. Which would work just as well. Or
better. Or worse. Who knows?
Guys, can we sort out the NFS locking so that it is possible to
take the correct locks to get the 100% behaviour?
On Saturday 07 September 2002 18:06, Andrew Morton wrote:
> Daniel Phillips wrote:
> > > If the VM wants to reclaim a page, and it has PG_private set then
> > > the vm will run mapping->releasepage() against the page. The mapping's
> > > releasepage must try to clear away whatever is held at ->private. If
> > > that was successful then releasepage() must clear PG_private, decrement
> > > page->count and return non-zero. If the info at ->private is not
> > > freeable, releasepage returns zero. ->releasepage() may not sleep in
> > > 2.5.
> > >
> > > So. NFS can put anything it likes at page->private. If you're not
> > > doing that then you don't need a releasepage. If you are doing that
> > > then you must have a releasepage().
> >
> > Right now, there are no filesystems actually doing anything filesystem
> > specific here, are there? I really wonder if making this field, formerly
> > known as buffers, opaque to the vfs is the right idea.
>
> That's right - it is only used for buffers at present. I was using
> page->private in the delayed-allocate code for directly holding the
> disk mapping information.
It's worth taking a deep breath at this point and considering whether that
part of delalloc can be written generically, supposing that page->buffers
were restored to its former non-opaque status.
> There was some talk of using it for <mumble> in XFS.
Christoph would be familiar with that issue, if there is one.
> Also it may be used in the NFS server for storing credential
> information.
The NFS server is still a deep, black hole in the kernel from my point of
view and I'd like that situation to end as soon as possible, so it might
as well start ending now. Can you provide me a pointer to go start
digging at that specific question?
(And strongly agreed about the invalidate_inode_pages(2) issue: at some
point it would behoove VM and NFS developers to reach a mutual
understanding of what that interface is supposed to do, because it is
growing new warts and tentacles at an alarming rate, and still seems to
be, at best, a heuristic. I currently have the impression that the
intent is to make files sort-of coherent between clients, for some
slippery definition of sort-of.)
> Also it could be used for MAP_SHARED pages for credential
> information - to fix the problem wherein kswapd (ie: root) is the
> one who instantiates the page's blocks, thus allowing non-root programs
> to consume the root-only reserved ext2/ext3 blocks.
OK, well I have four action items:
- Listen to Christoph wax poetic about how XFS handles buffers and
whether he needs them to be opaque to the vfs.
- Think about delalloc with the goal of providing the mechanism at
the vfs level, subject to somebody proving it actually has some
benefit (maybe XFS already proves this? Come to think of it, if
XFS already supports delayed allocation and doesn't rely on
page->private...)
- Find out what NFS really thinks invalidate_inode_pages(2) is
supposed to do.
- Start to sniff at the credential issue, with a view to
ascertaining whether that subsystem ultimately relies on the
buffer field being opaque. Intuitively I think 'no', the
credential mechanism is supposed to be a generic vfs mechanism,
and so for NFS or anyone else to bury credential state in the
field formerly known as buffers is fundamentally wrong.
In general, I think we'd be better off if page->buffers was not opaque,
and that it should remain non-opaque until we are definitely ready to
get rid of them. Doing otherwise will just allow various applications
to start growing tendrils into the field, making it that much harder
to get rid of when the time comes.
So the question is, does anyone *really* need (void *) page->private
instead of page->buffers?
--
Daniel
Daniel Phillips wrote:
>
> On Saturday 07 September 2002 18:06, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > > > If the VM wants to reclaim a page, and it has PG_private set then
> > > > the vm will run mapping->releasepage() against the page. The mapping's
> > > > releasepage must try to clear away whatever is held at ->private. If
> > > > that was successful then releasepage() must clear PG_private, decrement
> > > > page->count and return non-zero. If the info at ->private is not
> > > > freeable, releasepage returns zero. ->releasepage() may not sleep in
> > > > 2.5.
> > > >
> > > > So. NFS can put anything it likes at page->private. If you're not
> > > > doing that then you don't need a releasepage. If you are doing that
> > > > then you must have a releasepage().
> > >
> > > Right now, there are no filesystems actually doing anything filesystem
> > > specific here, are there? I really wonder if making this field, formerly
> > > known as buffers, opaque to the vfs is the right idea.
> >
> > That's right - it is only used for buffers at present. I was using
> > page->private in the delayed-allocate code for directly holding the
> > disk mapping information.
>
> It's worth taking a deep breath at this point and considering whether that
> part of delalloc can be written generically, supposing that page->buffers
> were restored to its former non-opaque status.
The main motivation for that code was to fix the computational cost
of the buffer layer by doing a complete end-around. That problem
was later solved by fixing the buffer layer.
Yes, there are still reasons for delayed allocation, the space reservation
API, etc. But they are not compelling. They are certainly not compelling
when writeback continues to use the (randomly-ordered) mapping->dirty_pages
walk.
With radix-tree enhancements to permit a pgoff_t-order walk of the
dirty pages then yeah, order-of-magnitude gains in the tiobench
random write pass.
> > Also it may be used in the NFS server for storing credential
> > information.
>
> The NFS server is still a deep, black hole in the kernel from my point of
> view and I'd like that situation to end as soon as possible, so it might
> as well start ending now. Can you provide me a pointer to go start
> digging at that specific question?
NFS client. Me too.
> (And strongly agreed about the invalidate_inode_pages(2) issue: at some
> point it would behoove VM and NFS developers to reach a mutual
> understanding of what that interface is supposed to do, because it is
> growing new warts and tentacles at an alarming rate, and still seems to
> be, at best, a heuristic. I currently have the impression that the
> intent is to make files sort-of coherent between clients, for some
> slippery definition of sort-of.)
I've been discussing that with Chuck. I'd prefer that the NFS client
use a flavour of vmtruncate(), with its strong guarantees. But we
won't know how horrid that is from a locking perspective until Trond
returns.
> ...
> In general, I think we'd be better off if page->buffers was not opaque,
Disagree. There is zero computational cost to the current setup,
and it's a little cleaner, and it permits things such as the removal
of ->private from the pageframes, and hashing for it.
And there is plenty of precedent for putting fs-private hooks into
core VFS data structures.
> and that it should remain non-opaque until we are definitely ready to
> get rid of them.
There is nothing wrong with buffers, except the name. They no longer
buffer anything.
They _used_ to be the buffering entity, and an IO container, and
the abstraction of a disk block.
They are now just the abstraction of a disk block. s/buffer_head/block/g
should make things clearer.
And there is no way in which the kernel can get along without
some structure which represents a disk block. It does one thing,
and it does it well.
The page is the buffering entity.
The buffer_head is a disk block.
The BIO is the IO container.
Sometimes, for efficiency, we bypass the "block" part and go direct
page-into-BIO. That's a conceptually-wrong performance hack.
Yes, one could try to graft the "block" abstraction up into struct
page, or down into struct BIO. But one would be mistaken, I expect.
> Doing otherwise will just allow various applications
> to start growing tendrils into the field, making it that much harder
> to get rid of when the time comes.
>
> So the question is, does anyone *really* need (void *) page->private
> instead of page->buffers?
Don't know. But leaving it as-is tells the world that this is
per-fs metadata which the VM/VFS supports. This has no cost.
> I'm very unkeen about using the inaccurate invalidate_inode_pages
> for anything which matters, really. And the consistency of pagecache
> data matters.
>
> NFS should be using something stronger. And that's basically
> vmtruncate() without the i_size manipulation.
Yes, that looks good. Semantics are basically "and don't come back
until every damm page is gone" which is enforced by the requirement
that we hold the mapping->page_lock though one entire scan of the
truncated region. (Yes, I remember sweating this one out a year
or two ago so it doesn't eat 100% CPU on regular occasions.)
So, specifically, we want:
void invalidate_inode_pages(struct inode *inode)
{
truncate_inode_pages(mapping, 0);
}
Is it any harder than that?
By the way, now that we're all happy with the radix tree, we might
as well just go traverse that instead of all the mapping->*_pages.
(Not that I'm seriously suggesting rocking the boat that way just
now, but it might yield some interesting de-crufting possibilities.)
> Hold i_sem,
> vmtruncate_list() for assured pagetable takedown, proper page
> locking to take the pages out of pagecache, etc.
>
> Sure, we could replace the page_count() heuristic with a
> page->pte.direct heuristic. Which would work just as well. Or
> better. Or worse. Who knows?
>
> Guys, can we sort out the NFS locking so that it is possible to
> take the correct locks to get the 100% behaviour?
Trond, will the above work?
Now, what is this invalidate_inode_pages2 seepage about? Called from
one place. Sheesh.
--
Daniel
Daniel Phillips wrote:
>
> > I'm very unkeen about using the inaccurate invalidate_inode_pages
> > for anything which matters, really. And the consistency of pagecache
> > data matters.
> >
> > NFS should be using something stronger. And that's basically
> > vmtruncate() without the i_size manipulation.
>
> Yes, that looks good. Semantics are basically "and don't come back
> until every damm page is gone" which is enforced by the requirement
> that we hold the mapping->page_lock though one entire scan of the
> truncated region. (Yes, I remember sweating this one out a year
> or two ago so it doesn't eat 100% CPU on regular occasions.)
>
> So, specifically, we want:
>
> void invalidate_inode_pages(struct inode *inode)
> {
> truncate_inode_pages(mapping, 0);
> }
>
> Is it any harder than that?
Pretty much - need to leave i_size where it was. But there are
apparently reasons why NFS cannot sleepingly lock pages in this particular
context.
> By the way, now that we're all happy with the radix tree, we might
> as well just go traverse that instead of all the mapping->*_pages.
> (Not that I'm seriously suggesting rocking the boat that way just
> now, but it might yield some interesting de-crufting possibilities.)
Oh absolutely.
unsigned long radix_tree_gang_lookup(void **pointers,
unsiged long starting_from_here, unsigned long this_many);
could be used nicely in readahead, drop_behind, truncate, invalidate
and invalidate2. But to use it in writeback (desirable) we would need
additional metadata in radix_tree_node. One bit per page, which means
"this page is dirty" or "this subtree has dirty pages".
I keep saying this in the hope that someone will take pity and write it.
> ...
> Now, what is this invalidate_inode_pages2 seepage about? Called from
> one place. Sheesh.
heh. We still do have some O_DIRECT/pagecache coherency problems.
On Monday 09 September 2002 23:36, Andrew Morton wrote:
> Daniel Phillips wrote:
> Yes, there are still reasons for delayed allocation, the space reservation
> API, etc. But they are not compelling. They are certainly not compelling
> when writeback continues to use the (randomly-ordered) mapping->dirty_pages
> walk.
>
> With radix-tree enhancements to permit a pgoff_t-order walk of the
> dirty pages then yeah, order-of-magnitude gains in the tiobench
> random write pass.
Right. Coincidently, our emails both drawing the same conclusion just
passed each other on the way through vger.
> > ...
> > In general, I think we'd be better off if page->buffers was not opaque,
>
> Disagree. There is zero computational cost to the current setup,
It's not a computational cost, at least not directly, it's an
organizational cost.
> and it's a little cleaner, and it permits things such as the removal
> of ->private from the pageframes, and hashing for it.
That's not really an argument, you can do that with ->buffers too,
and there is zero cost for doing it with both, separately.
> And there is plenty of precedent for putting fs-private hooks into
> core VFS data structures.
But you have to have a clear argument why. I don't see one yet, and I
do see why the vfs wants to know about them. None of your reasons for
wanting ->private have anything to do with buffers, they are all about
other fancy features that somebody might want to add.
> > and that it should remain non-opaque until we are definitely ready to
> > get rid of them.
>
> There is nothing wrong with buffers, except the name. They no longer
> buffer anything.
Understood. However, the spelling patch to correct that would touch
the kernel in a few hundred places and break an indeterminate-but-large
number of pending patches.
> They _used_ to be the buffering entity, and an IO container, and
> the abstraction of a disk block.
>
> They are now just the abstraction of a disk block. s/buffer_head/block/g
> should make things clearer.
Oooh, yes, that would be nice, but see above spelling patch objection.
> And there is no way in which the kernel can get along without
> some structure which represents a disk block. It does one thing,
> and it does it well.
>
> The page is the buffering entity.
>
> The buffer_head is a disk block.
>
> The BIO is the IO container.
>
> Sometimes, for efficiency, we bypass the "block" part and go direct
> page-into-BIO. That's a conceptually-wrong performance hack.
Right, and what I *want* to do is introduce sub-page struct pages,
along with generalizing struct page via mapping->page_size_shift, so
that we can use struct page as a handle for a block, solving a number
of nasty structural problems. The patch to do this will be less
intrusive than you'd think at first blush, especially after
Christoph has been in there for a while longer, dunging out filemap.c.
> Yes, one could try to graft the "block" abstraction up into struct
> page, or down into struct BIO.
Err, right ;-)
> But one would be mistaken, I expect.
I expect not, but it falls to me to prove that, doesn't it? I was
definitely not contemplating putting this forward for halloween, but
instead to undertake the grunt work when the new stable series opens
(and while *you* are busy chasing all the new bugs you made, *cackle*
*cackle*).
> > Doing otherwise will just allow various applications
> > to start growing tendrils into the field, making it that much harder
> > to get rid of when the time comes.
> >
> > So the question is, does anyone *really* need (void *) page->private
> > instead of page->buffers?
>
> Don't know. But leaving it as-is tells the world that this is
> per-fs metadata which the VM/VFS supports. This has no cost.
It has a creeping-rot cost, because filesystems will invitably come up
with large numbers of mostly-bogus reasons for using it. I think the
solution to this is to introduce a hashed ->private_frob_me structure
and restore ->buffers to visibility by the vfs, which needs all the
help it can get to survive the current misfit between page and block
size.
On second thought, your name change sounds really attractive, how
about:
- struct buffer_head *buffers;
+ struct block *blocks;
Unfortunately, I can't think of any nice macro way to ease the pain
of the patch.
--
Daniel
On Tuesday 10 September 2002 00:03, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > > I'm very unkeen about using the inaccurate invalidate_inode_pages
> > > for anything which matters, really. And the consistency of pagecache
> > > data matters.
> > >
> > > NFS should be using something stronger. And that's basically
> > > vmtruncate() without the i_size manipulation.
> >
> > Yes, that looks good. Semantics are basically "and don't come back
> > until every damm page is gone" which is enforced by the requirement
> > that we hold the mapping->page_lock though one entire scan of the
> > truncated region. (Yes, I remember sweating this one out a year
> > or two ago so it doesn't eat 100% CPU on regular occasions.)
> >
> > So, specifically, we want:
> >
> > void invalidate_inode_pages(struct inode *inode)
> > {
> > truncate_inode_pages(mapping, 0);
> > }
> >
> > Is it any harder than that?
>
> Pretty much - need to leave i_size where it was.
This doesn't touch i_size.
> But there are
> apparently reasons why NFS cannot sleepingly lock pages in this particular
> context.
If only we knew what those were. It's hard to keep the word 'bogosity'
from popping into my head.
--
Daniel
Daniel Phillips wrote:
>
> > > void invalidate_inode_pages(struct inode *inode)
> > > {
> > > truncate_inode_pages(mapping, 0);
> > > }
> > >
> > > Is it any harder than that?
> >
> > Pretty much - need to leave i_size where it was.
>
> This doesn't touch i_size.
Sorry - I was thinking vmtruncate(). truncate_inode_pages() would
result in all the mmapped pages becoming out-of-date anonymous
memory. NFS needs to take down the pagetables so that processes
which are mmapping the file which changed on the server will take
a major fault and read a fresh copy. I believe.
On Tue, 10 Sep 2002, Daniel Phillips wrote:
> On Tuesday 10 September 2002 00:03, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > >
> > > > I'm very unkeen about using the inaccurate invalidate_inode_pages
> > > > for anything which matters, really. And the consistency of pagecache
> > > > data matters.
> > > >
> > > > NFS should be using something stronger. And that's basically
> > > > vmtruncate() without the i_size manipulation.
> > >
> > > Yes, that looks good. Semantics are basically "and don't come back
> > > until every damm page is gone" which is enforced by the requirement
> > > that we hold the mapping->page_lock though one entire scan of the
> > > truncated region. (Yes, I remember sweating this one out a year
> > > or two ago so it doesn't eat 100% CPU on regular occasions.)
> > >
> > > So, specifically, we want:
> > >
> > > void invalidate_inode_pages(struct inode *inode)
> > > {
> > > truncate_inode_pages(mapping, 0);
> > > }
> > >
> > > Is it any harder than that?
> >
> > Pretty much - need to leave i_size where it was.
>
> This doesn't touch i_size.
>
> > But there are
> > apparently reasons why NFS cannot sleepingly lock pages in this particular
> > context.
>
> If only we knew what those were. It's hard to keep the word 'bogosity'
> from popping into my head.
rpciod must never call a function that sleeps. if this happens, the whole
NFS client stops working until the function wakes up again. this is not
really bogus -- it is similar to restrictions placed on socket callbacks.
async RPC tasks (ie, the rpciod process) invokes invalidate_inode_pages
during normal, everyday processing, so it must not sleep. that's why it
today ignores locked pages.
thus:
1. whatever function purges a file's cached data must not sleep when
invoked from an async RPC task. likewise, such a function must not
sleep if the caller holds the file's i_sem.
2. access to the file must be serialized somehow with in-flight I/O
(locked pages). we don't want to start new reads before all dirty
pages have been flushed back to the server. dirty pages that have
not yet been scheduled must be dealt with correctly.
3. mmap'd pages must behave reasonably when a file's cache is purged.
clean pages should be faulted back in. what to do with dirty mmap'd
pages?
- Chuck Lever
--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>
On Tuesday 10 September 2002 01:51, Chuck Lever wrote:
> On Tue, 10 Sep 2002, Daniel Phillips wrote:
> > On Tuesday 10 September 2002 00:03, Andrew Morton wrote:
> > > But there are
> > > apparently reasons why NFS cannot sleepingly lock pages in this
> > > particular context.
> >
> > If only we knew what those were. It's hard to keep the word 'bogosity'
> > from popping into my head.
>
> rpciod must never call a function that sleeps. if this happens, the whole
> NFS client stops working until the function wakes up again. this is not
> really bogus -- it is similar to restrictions placed on socket callbacks.
Ah, a warm body with answers :-)
It *sounds* bogus: why should we be satisfied with a function that doesn't
do its job reliably (invalidate_inode_pages) in order to avoid coming up
with a way of keeping the client daemon from blocking? How about having
invalidate_inode_pages come back with "sorry boss, I couldn't complete the
job so I started as much IO as I could and I'm back now, try again later"?
> async RPC tasks (ie, the rpciod process) invokes invalidate_inode_pages
> during normal, everyday processing, so it must not sleep. that's why it
> today ignores locked pages.
OK, that's half the job, the other half is to know that something's been
ignored, and get back to it later. Either there is a valid reason for
getting rid of these pages or there isn't, and if there is a valid reason,
then getting rid of only some of them must surely leave the door wide
open to strange misbehaviour.
> thus:
>
> 1. whatever function purges a file's cached data must not sleep when
> invoked from an async RPC task.
[otherwise other tasks using the client will stall needlessly]
> ...likewise, such a function must not
> sleep if the caller holds the file's i_sem.
>
> 2. access to the file must be serialized somehow with in-flight I/O
> (locked pages). we don't want to start new reads before all dirty
> pages have been flushed back to the server. dirty pages that have
> not yet been scheduled must be dealt with correctly.
State machine!
> 3. mmap'd pages must behave reasonably when a file's cache is purged.
> clean pages should be faulted back in. what to do with dirty mmap'd
> pages?
I don't know, sorry. What?
You've probably been through this before, but could you please explain
the ground rules behind the cache purging strategy?
--
Daniel
On Tue, 10 Sep 2002, Daniel Phillips wrote:
> On Tuesday 10 September 2002 01:51, Chuck Lever wrote:
> > rpciod must never call a function that sleeps. if this happens, the whole
> > NFS client stops working until the function wakes up again. this is not
> > really bogus -- it is similar to restrictions placed on socket callbacks.
>
> Ah, a warm body with answers :-)
>
> It *sounds* bogus: why should we be satisfied with a function that doesn't
> do its job reliably (invalidate_inode_pages) in order to avoid coming up
> with a way of keeping the client daemon from blocking? How about having
> invalidate_inode_pages come back with "sorry boss, I couldn't complete the
> job so I started as much IO as I could and I'm back now, try again later"?
i'm not suggesting that invalidate_inode_pages behaves properly, i'm
simply trying to document why it works the way it does. i agree that it
leaves a window open for broken behavior today. what is *not* bogus is
the requirement for functions invoked by async RPC tasks not to sleep;
that's reasonable, i feel.
> > thus:
> >
> > 1. whatever function purges a file's cached data must not sleep when
> > invoked from an async RPC task.
>
> [otherwise other tasks using the client will stall needlessly]
correct.
> > 3. mmap'd pages must behave reasonably when a file's cache is purged.
> > clean pages should be faulted back in. what to do with dirty mmap'd
> > pages?
>
> I don't know, sorry. What?
'twas a rhetorical question. i'm trying to understand this myself. the
case of what to do with dirty mmap'd pages is somewhat sticky.
> You've probably been through this before, but could you please explain
> the ground rules behind the cache purging strategy?
i can answer the question "when does the NFS client purge a file's cached
data?"
there are four major categories:
a. directory changes require any cached readdir results be purged. this
forces the readdir results to be re-read from the server the next time
the client needs them. this is what broke with the recent changes in
2.5.32/3 that triggered this thread.
b. when the client discovers a file on the server was changed by some
other client, all pages in the page cache for that file are purged
(except the ones that can't be because they are locked, etc). this
is the case that is hit most often and in async RPC tasks, and is
on many critical performance paths.
c. when a file is locked or unlocked via lockf/fcntl, all pending writes
are pushed back to the server and any cached data in the page cache is
purged before the lock/unlock call returns. applications sometimes
depend on this behavior to checkpoint the client's cache.
d. some error occurred while reading a directory, or the object on the
server has changed type (like, a file becomes a directory but the
file handle is still the same -- a protocol error, but the client
checks for this just in case).
so let's talk about b. before and after many operations, the NFS client
attempts to revalidate an inode. this means it does a GETATTR operation,
or uses the attr results returned in many NFS requests, to compare the
file's size and mtime on the server with the same values it has cached
locally. this revalidation can occur during XDR processing while the RPC
layer marshals and unmarshals the results of an NFS request.
i don't want to speculate too much without Trond around to keep me honest.
however, i think what we want here is behavior that is closer to category
c., with as few negative performance implications as possible.
i think one way to accomplish this is to create two separate revalidation
functions -- one that can be used by synchronous code in the NFS client
that uses the 100% bug killer, and one that can be used from async RPC
tasks that simply marks that a purge is necessary, and next time through
the sync one, the purge actually occurs.
the only outstanding issue then is how to handle pages that are dirtied
via mmap'd files, since they are touched without going through the NFS
client. also, i'd really like to hear from maintainers of other network
file systems about how they manage cache coherency.
- Chuck Lever
--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>
On Tuesday 10 September 2002 17:09, Chuck Lever wrote:
> On Tue, 10 Sep 2002, Daniel Phillips wrote:
> > On Tuesday 10 September 2002 01:51, Chuck Lever wrote:
> > > rpciod must never call a function that sleeps. if this happens, the whole
> > > NFS client stops working until the function wakes up again. this is not
> > > really bogus -- it is similar to restrictions placed on socket callbacks.
> >
> > Ah, a warm body with answers :-)
> >
> > It *sounds* bogus: why should we be satisfied with a function that doesn't
> > do its job reliably (invalidate_inode_pages) in order to avoid coming up
> > with a way of keeping the client daemon from blocking? How about having
> > invalidate_inode_pages come back with "sorry boss, I couldn't complete the
> > job so I started as much IO as I could and I'm back now, try again later"?
>
> i'm not suggesting that invalidate_inode_pages behaves properly, i'm
> simply trying to document why it works the way it does.
And nicely too, thanks.
> > > 3. mmap'd pages must behave reasonably when a file's cache is purged.
> > > clean pages should be faulted back in. what to do with dirty mmap'd
> > > pages?
> >
> > I don't know, sorry. What?
>
> 'twas a rhetorical question.
A rhetorical answer as well ;-)
> i'm trying to understand this myself. the
> case of what to do with dirty mmap'd pages is somewhat sticky.
What I meant was, could you please explain the problem with dirty mmaped
pages. I see you explained it below: you mean that writes to mmaps bypass
the client, but the client really needs to know about them (and is
largely ignorant of them at present).
> > You've probably been through this before, but could you please explain
> > the ground rules behind the cache purging strategy?
>
> i can answer the question "when does the NFS client purge a file's cached
> data?"
>
> there are four major categories:
>
> a. directory changes require any cached readdir results be purged.
That is, the client changes the directory itself? I suppose an NFS
server is incapable of reporting directory changes caused by other
clients, because of being stateless.
> ...this
> forces the readdir results to be re-read from the server the next time
> the client needs them. this is what broke with the recent changes in
> 2.5.32/3 that triggered this thread.
>
> b. when the client discovers a file on the server was changed by some
> other client, all pages in the page cache for that file are purged
> (except the ones that can't be because they are locked, etc). this
> is the case that is hit most often and in async RPC tasks, and is
> on many critical performance paths.
>
> c. when a file is locked or unlocked via lockf/fcntl, all pending writes
> are pushed back to the server and any cached data in the page cache is
> purged before the lock/unlock call returns.
Do you mean, the client locks/unlocks the file, or some other client?
I'm trying to fit this into my model of how the server must work. It
must be that the locked/unlocked state is recorded at the server, in
the underlying file, and that the server reports the locked/unlocked
state of the file to every client via attr results. So now, why purge
at *both* lock and unlock time?
> ...applications sometimes
> depend on this behavior to checkpoint the client's cache.
>
> d. some error occurred while reading a directory, or the object on the
> server has changed type (like, a file becomes a directory but the
> file handle is still the same -- a protocol error, but the client
> checks for this just in case).
>
> so let's talk about b. before and after many operations, the NFS client
> attempts to revalidate an inode. this means it does a GETATTR operation,
> or uses the attr results returned in many NFS requests, to compare the
> file's size and mtime on the server with the same values it has cached
> locally. this revalidation can occur during XDR processing while the RPC
> layer marshals and unmarshals the results of an NFS request.
OK, so if this revalidation fails the client does the purge, as you
described in b.
> i don't want to speculate too much without Trond around to keep me honest.
> however, i think what we want here is behavior that is closer to category
> c., with as few negative performance implications as possible.
Actually, this is really, really useful and gives me lots pointers I
can follow for more details.
> i think one way to accomplish this is to create two separate revalidation
> functions -- one that can be used by synchronous code in the NFS client
> that uses the 100% bug killer, and one that can be used from async RPC
> tasks that simply marks that a purge is necessary, and next time through
> the sync one, the purge actually occurs.
That would certainly be easy from the VM side, then we could simply
use a derivative of vmtruncate that leaves the file size alone, as
Andrew suggested.
If this method isn't satisfactory, then with considerably more effort
we (you guys) could build a state machine in the client that relies
on an (as yet unwritten but pretty straightforward) atomic purger with
the ability to report the fact that it was unable to complete the
purge atomically.
Your suggestion is oh-so-much-easier. I hope it works out.
> the only outstanding issue then is how to handle pages that are dirtied
> via mmap'd files, since they are touched without going through the NFS
> client.
Hmm. And what do you want? Would a function that walks through the
radix tree and returns the OR of page_dirty for every page in it be
useful? That would be easy, but efficiency is another question. I
suppose that even if you had such a function, the need to poll
mmaped files constantly would be a stopper.
Would it be satisfactory to know within a second or two that the mmap
was dirtied? If so, the dirty scan could possibly be rolled into the
regular refill_inactive/shrink_cache scan, though at some cost to
structural cleanliness.
Could the client mprotect the mmap, and receive a signal the first
time somebody writes it? Jeff Dike does things like that with UML
and they seem to work pretty well. Of these approaches, this is the
one that sounds must satisfactory from the performance and
correctness point of view, and it is a proven technique, however
scary it may sound.
You want to know about the dirty pages only so you can send them
to the server, correct? Not because the client needs to purge
anything.
> also, i'd really like to hear from maintainers of other network
> file systems about how they manage cache coherency.
Yes, unfortunately, if we break Samba, Tridge knows where I live ;-)
--
Daniel
On Tuesday 10 September 2002 00:32, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > > > void invalidate_inode_pages(struct inode *inode)
> > > > {
> > > > truncate_inode_pages(mapping, 0);
> > > > }
> > > >
> > > > Is it any harder than that?
> > >
> > > Pretty much - need to leave i_size where it was.
> >
> > This doesn't touch i_size.
>
> Sorry - I was thinking vmtruncate(). truncate_inode_pages() would
> result in all the mmapped pages becoming out-of-date anonymous
> memory. NFS needs to take down the pagetables so that processes
> which are mmapping the file which changed on the server will take
> a major fault and read a fresh copy. I believe.
Oh, um. Yes, we need the additional pte zapping behaviour of
vmtruncate_list. It doesn't look particularly hard to produce a
variant of vmtruncation that does (doesn't do) what you suggest.
Let's see how the discussion goes with the NFS crowd.
--
Daniel
On Tue, 10 Sep 2002, Daniel Phillips wrote:
> On Tuesday 10 September 2002 17:09, Chuck Lever wrote:
> > i can answer the question "when does the NFS client purge a file's cached
> > data?"
> >
> > there are four major categories:
> >
> > a. directory changes require any cached readdir results be purged.
>
> That is, the client changes the directory itself? I suppose an NFS
> server is incapable of reporting directory changes caused by other
> clients, because of being stateless.
when a client itself changes a directory, it must purge it's own readdir
cache. this is because the server is really in control of the contents,
which are treated as more or less opaque by the client. the next time any
application on the client wants to read the directory, the client will go
back to the server to get the updated contents.
the NFS server doesn't report changes, the client must infer them from
changes in a file's size and mtime. such changes are detected this way
for directories as well as files. Trond may have added some code recently
that also "flushes" the dentry cache for a directory when such a change is
detected for a directory.
> > c. when a file is locked or unlocked via lockf/fcntl, all pending writes
> > are pushed back to the server and any cached data in the page cache is
> > purged before the lock/unlock call returns.
>
> Do you mean, the client locks/unlocks the file, or some other client?
when a client locks or unlocks a file, it flushes pending writes and
purges its read cache.
> I'm trying to fit this into my model of how the server must work. It
> must be that the locked/unlocked state is recorded at the server, in
> the underlying file, and that the server reports the locked/unlocked
> state of the file to every client via attr results.
no, lock management is handled by an entirely separate protocol. the
server maintains lock state separately from the actual files, as i
understand it.
> So now, why purge at *both* lock and unlock time?
this is a little long.
clients use what's called "close-to-open" cache consistency to try to
maintain a single-system image of the file. this means when a file is
opened, the client checks with the server to get the latest version of the
file data and metadata (or simply to verify that the client can continue
using whatever it has cached). when a file is closed, the client always
flushes any pending writes to the file back to the server.
in this way, when client A closes a file and subsequently client B opens
the same file, client B will see all changes made by client A before it
closed the file. note that while A still has the file open, B may not see
all changes made by A, unless an application on A explicitly flushes the
file. this is a compromise between tight cache coherency and good
performance.
when locking or unlocking a file, the idea is to make sure that other
clients can see all changes to the file that were made while it was
locked. locking and unlocking provide tighter cache coherency than simple
everyday close-to-open because that's why applications go to the trouble
of locking a file -- they expect to share the contents of the file with
other applications on other clients.
when a file is locked, the client wants to be certain it has the latest
version of the file for an application to play with. the cache is purged
to cause the client to read any recent changes from the server. when a
file is unlocked, the client wants to share its changes with other clients
so it flushes all pending writes before allowing the unlocking application
to proceed.
> > i don't want to speculate too much without Trond around to keep me honest.
> > however, i think what we want here is behavior that is closer to category
> > c., with as few negative performance implications as possible.
>
> Actually, this is really, really useful and gives me lots pointers I
> can follow for more details.
i'm very happy to be able to include more brains in the dialog!
> > i think one way to accomplish this is to create two separate revalidation
> > functions -- one that can be used by synchronous code in the NFS client
> > that uses the 100% bug killer, and one that can be used from async RPC
> > tasks that simply marks that a purge is necessary, and next time through
> > the sync one, the purge actually occurs.
>
> That would certainly be easy from the VM side, then we could simply
> use a derivative of vmtruncate that leaves the file size alone, as
> Andrew suggested.
the idea is that only the NFS client would have to worry about getting
this right, and would invoke the proper VM hammer when it is safe to do
so. that way, NFS-specific weirdness can be kept in fs/nfs/*.c, and not
ooze into the VM layer.
> > the only outstanding issue then is how to handle pages that are dirtied
> > via mmap'd files, since they are touched without going through the NFS
> > client.
[ ... various ideas snipped ... ]
> You want to know about the dirty pages only so you can send them
> to the server, correct? Not because the client needs to purge
> anything.
i'm thinking only at the level of what behavior we want, not how to
implement it, simply because i'm not familiar enough with how it works
today. i've researched how this behaves (more or less) on a reference
implementation of an NFS client by asking someone who worked on the
Solaris client.
the thinking is that if applications running on two separate clients have
a file mmap'd, the application designers already know well enough that
dirtying the same regions of the file on separate clients will have
nondeterministic results. thus the only good reason that two or more
clients would mmap the same file and dirty some pages would be to modify
different regions of the file, or there is some application-level
serialization scheme in place to keep writes to the file in a
deterministic order.
thus, when a client "revalidates" an mmap'd file and discovers that the
file was changed on the server by another client, the reference
implementation says "go ahead and flush all the writes you know about,
then purge the read cache."
so the only problem is finding the dirty pages so that the client can
schedule the writes. i think this needs to happen after a server change
is detected but before the client schedules any new I/O against the file.
today, dirty mmap'd pages are passed to the NFS client via the writepage
address space operation. what more needs to be done here? is there a
mechanism today to tell the VM layer to "call writepage for all dirty
mmap'd pages?"
- Chuck Lever
--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>
On Tuesday 10 September 2002 21:04, Chuck Lever wrote:
> today, dirty mmap'd pages are passed to the NFS client via the writepage
> address space operation. what more needs to be done here? is there a
> mechanism today to tell the VM layer to "call writepage for all dirty
> mmap'd pages?"
Well, Andrew has cooked up something that seems to be headed in that
direction, mpage_writepages and various associated superstructure, but
it does not apparently walk all the ptes to check the dirty bits before
it flushes the dirty pages, which is what you want, I think. It would
not be hard to teach it that trick, it's another thing made easy by
rmap.
Then, after that's done, what kind of semantics have we got? Perhaps
it's worth being able to guarantee that when a program says 'get all
the dirty memory here out onto disk' it actually happens, even though
there is no built-in way to be sure that some unsychronized task
won't come along and dirty the mmap again immediately. You could look
at the need for synchronization there as an application issue. On the
other hand, if the NFS client is taking the liberty of flushing the
dirty memory on behalf of the mmap users, what guarantee is provided?
IMHO, none, so is this really worth it for the NFS client to do this?
It does make sense that fsync should really get all the dirty pages
onto disk (err, or onto the server) and not come back until its done,
and that 'dirty pages' should include dirty ptes, not just pages that
happen to have been scanned and had their dirty bits moved from the
pte to the struct page.
Andrew, did I miss something, or does the current code really ignore
the pte dirty bits?
--
Daniel
Daniel Phillips wrote:
>
> ...
> Andrew, did I miss something, or does the current code really ignore
> the pte dirty bits?
Sure. pte_dirty -> PageDirty propagation happens in page reclaim,
and in msync.
We _could_ walk the pte chain in writeback. But that would involve
visiting every page in the mapping, basically. That could hurt.
But if a page is already dirty, and we're going to write it anyway,
it makes tons of sense to run around and clean all the ptes which
point at it.
It especially makes sense for fielmap_sync() to do that. (quickly
edits the todo file).
I'm not sure that MAP_SHARED is a good way of performing file writing,
really. And not many things seem to use it for that. It's more there
as a way for unrelated processes to find a chunk of common memory via
the usual namespace. Not sure about that, but I am sure that it's a
damn pest.
On Wednesday 11 September 2002 02:07, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > ...
> > Andrew, did I miss something, or does the current code really ignore
> > the pte dirty bits?
>
> Sure. pte_dirty -> PageDirty propagation happens in page reclaim,
> and in msync.
>
> We _could_ walk the pte chain in writeback. But that would involve
> visiting every page in the mapping, basically. That could hurt.
I wasn't suggesting that, I was confirming you're not going looking at
the ptes somewhere I didn't notice.
> But if a page is already dirty, and we're going to write it anyway,
> it makes tons of sense to run around and clean all the ptes which
> point at it.
Really. Since that is what the PG_dirty bit is supposed to be telling us.
> It especially makes sense for fielmap_sync() to do that. (quickly
> edits the todo file).
>
> I'm not sure that MAP_SHARED is a good way of performing file writing,
> really.
It sure isn't just now, and it's likely to remain that way for quite some
time.
> And not many things seem to use it for that. It's more there
> as a way for unrelated processes to find a chunk of common memory via
> the usual namespace. Not sure about that, but I am sure that it's a
> damn pest.
One day, in a perfect world, you will dirty a mmap, fsync it, and the data
will appear in the blink of an eye, somewhere else. Till then it would be
nice just to get mmaps of NFS files doing something reasonable. We do get
around to walking the ptes at file close I believe. Is that not driven by
zap_page_range, which moves any orphaned pte dirty bits down into the struct
page?
--
Daniel
Daniel Phillips wrote:
>
> > ...
> We do get
> around to walking the ptes at file close I believe. Is that not driven by
> zap_page_range, which moves any orphaned pte dirty bits down into the struct
> page?
Nope, close will just leave all the pages pte-dirty or PageDirty in
memory. truncate will nuke all the ptes and then the pagecache.
But the normal way in which pte-dirty pages find their way to the
backing file is:
- page reclaim runs try_to_unmap or
- user runs msync(). (Which will only clean that mm's ptes!)
These will run set_page_dirty(), making the page visible to
one of the many things which run writeback.
On Wednesday 11 September 2002 02:38, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > > ...
> > We do get
> > around to walking the ptes at file close I believe. Is that not driven by
> > zap_page_range, which moves any orphaned pte dirty bits down into the struct
> > page?
>
> Nope, close will just leave all the pages pte-dirty or PageDirty in
> memory. truncate will nuke all the ptes and then the pagecache.
>
> But the normal way in which pte-dirty pages find their way to the
> backing file is:
>
> - page reclaim runs try_to_unmap or
>
> - user runs msync(). (Which will only clean that mm's ptes!)
>
> These will run set_page_dirty(), making the page visible to
> one of the many things which run writeback.
So we just quietly drop any dirty memory mapped to a file if the user doesn't
run msync? Is that correct behaviour? It sure sounds wrong.
--
Daniel
Daniel Phillips wrote:
>
> On Wednesday 11 September 2002 02:38, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > >
> > > > ...
> > > We do get
> > > around to walking the ptes at file close I believe. Is that not driven by
> > > zap_page_range, which moves any orphaned pte dirty bits down into the struct
> > > page?
> >
> > Nope, close will just leave all the pages pte-dirty or PageDirty in
> > memory. truncate will nuke all the ptes and then the pagecache.
> >
> > But the normal way in which pte-dirty pages find their way to the
> > backing file is:
> >
> > - page reclaim runs try_to_unmap or
> >
> > - user runs msync(). (Which will only clean that mm's ptes!)
> >
> > These will run set_page_dirty(), making the page visible to
> > one of the many things which run writeback.
>
> So we just quietly drop any dirty memory mapped to a file if the user doesn't
> run msync? Is that correct behaviour? It sure sounds wrong.
>
If the page is truncated then yes. (indirectly: the pte dirty
state gets flushed into PG_dirty which is then invalidated).
Otherwise, no. The pte dirtiness is propagated into PG_dirty when
the pte is detached from the page. See try_to_unmap_one() - it is
quite straightforward.
On Wednesday 11 September 2002 03:49, Andrew Morton wrote:
> Daniel Phillips wrote:
> > So we just quietly drop any dirty memory mapped to a file if the user doesn't
> > run msync? Is that correct behaviour? It sure sounds wrong.
>
> If the page is truncated then yes. (indirectly: the pte dirty
> state gets flushed into PG_dirty which is then invalidated).
>
> Otherwise, no. The pte dirtiness is propagated into PG_dirty when
> the pte is detached from the page. See try_to_unmap_one() - it is
> quite straightforward.
Yep, that's what I meant the first time round. Had me worried for a moment
there...
The specific path I was refering to is:
sys_munmap->
unmap_region->
unmap_page_range->
zap_pmd_range->
zap_pte_range->
if (pte_dirty(pte))
set_page_dirty(page);
Which is the definitive way of getting the data onto disk. It's weird that
msync has to be different from fsync, or that fsync does not imply msync,
but that's the posix folks for ya, I'm sure they threw a dart blindfolded
at a list of reasons and came up with one.
--
Daniel
On Tue, 10 Sep 2002, Andrew Morton wrote:
> We _could_ walk the pte chain in writeback. But that would involve
> visiting every page in the mapping, basically. That could hurt.
>
> But if a page is already dirty, and we're going to write it anyway,
> it makes tons of sense to run around and clean all the ptes which
> point at it.
Walking ptes probably doesn't hurt as much as doing extra
disk IO, so I guess you're right ;)
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Spamtraps of the month: [email protected] [email protected]
On Wednesday 11 September 2002 18:18, Rik van Riel wrote:
> On Tue, 10 Sep 2002, Andrew Morton wrote:
>
> > We _could_ walk the pte chain in writeback. But that would involve
> > visiting every page in the mapping, basically. That could hurt.
> >
> > But if a page is already dirty, and we're going to write it anyway,
> > it makes tons of sense to run around and clean all the ptes which
> > point at it.
>
> Walking ptes probably doesn't hurt as much as doing extra
> disk IO, so I guess you're right ;)
Well, but the thing I'm worried about is that we have failed to
implement sys_msync at this point, please reassure me on that, if
possible.
--
Daniel
>>>>> " " == Chuck Lever <[email protected]> writes:
> rpciod must never call a function that sleeps. if this
> happens, the whole NFS client stops working until the function
> wakes up again. this is not really bogus -- it is similar to
> restrictions placed on socket callbacks.
I'm in France at the moment, and am therefore not really able to
follow up on this thread for the moment. I'll try to clarify the above
though:
2 reasons why rpciod cannot block:
1) Doing so will slow down I/O for *all* NFS users.
2) There's a minefield of possible deadlock situations: waiting on a
locked page is the main no-no since rpciod itself is the process
that needs to complete the read I/O and unlock the page.
Cheers,
Trond
Trond Myklebust wrote:
>
> >>>>> " " == Chuck Lever <[email protected]> writes:
>
> > rpciod must never call a function that sleeps. if this
> > happens, the whole NFS client stops working until the function
> > wakes up again. this is not really bogus -- it is similar to
> > restrictions placed on socket callbacks.
>
> I'm in France at the moment, and am therefore not really able to
> follow up on this thread for the moment. I'll try to clarify the above
> though:
>
> 2 reasons why rpciod cannot block:
>
> 1) Doing so will slow down I/O for *all* NFS users.
> 2) There's a minefield of possible deadlock situations: waiting on a
> locked page is the main no-no since rpciod itself is the process
> that needs to complete the read I/O and unlock the page.
>
Yes. Both of these would indicate that rpciod is the wrong process
to be performing the invalidation.
Is it not possible to co-opt a user process to perform the
invalidation? Just
inode->is_kaput = 1;
in rpciod?
On Tuesday 10 September 2002 21:04, Chuck Lever wrote:
> when locking or unlocking a file, the idea is to make sure that other
> clients can see all changes to the file that were made while it was
> locked. locking and unlocking provide tighter cache coherency than simple
> everyday close-to-open because that's why applications go to the trouble
> of locking a file -- they expect to share the contents of the file with
> other applications on other clients.
>
> when a file is locked, the client wants to be certain it has the latest
> version of the file for an application to play with. the cache is purged
> to cause the client to read any recent changes from the server. when a
> file is unlocked, the client wants to share its changes with other clients
> so it flushes all pending writes before allowing the unlocking application
> to proceed.
This is clear and easy to understand, thanks.
--
Daniel
Daniel Phillips wrote:
>
> ...
> > Is it not possible to co-opt a user process to perform the
> > invalidation? Just
> >
> > inode->is_kaput = 1;
> >
> > in rpciod?
>
> There must be a way. The key thing the VM needs to provide, and doesn't
> now, is a function callable by the rpciod that will report to the caller
> whether it was able to complete the invalidation without blocking. (I
> think I'm just rephrasing someone's earlier suggestion here.)
>
> I'm now thinking in general terms about how to concoct a mechanism
> that lets rpciod retry the invalidation later, for all those that turn
> out to be blocking. For example, rpciod could just keep a list of
> all pending invalidates and retry each inode on the list every time
> it has nothing to do. This is crude and n-squarish, but it would
> work. Maybe it's efficient enough for the time being. At least it's
> correct, which would be a step forward.
rpciod is the wrong process to be performing this operation. I'd suggest
the userspace process which wants to read the directory be used for this.
> Did you have some specific mechanism in mind?
>
Testing mapping->nrpages will tell you if the invalidation was successful.
On Thursday 12 September 2002 20:21, Andrew Morton wrote:
> Trond Myklebust wrote:
> >
> > >>>>> " " == Chuck Lever <[email protected]> writes:
> >
> > > rpciod must never call a function that sleeps. if this
> > > happens, the whole NFS client stops working until the function
> > > wakes up again. this is not really bogus -- it is similar to
> > > restrictions placed on socket callbacks.
> >
> > I'm in France at the moment, and am therefore not really able to
> > follow up on this thread for the moment. I'll try to clarify the above
> > though:
> >
> > 2 reasons why rpciod cannot block:
> >
> > 1) Doing so will slow down I/O for *all* NFS users.
> > 2) There's a minefield of possible deadlock situations: waiting on a
> > locked page is the main no-no since rpciod itself is the process
> > that needs to complete the read I/O and unlock the page.
> >
>
> Yes. Both of these would indicate that rpciod is the wrong process
> to be performing the invalidation.
>
> Is it not possible to co-opt a user process to perform the
> invalidation? Just
>
> inode->is_kaput = 1;
>
> in rpciod?
There must be a way. The key thing the VM needs to provide, and doesn't
now, is a function callable by the rpciod that will report to the caller
whether it was able to complete the invalidation without blocking. (I
think I'm just rephrasing someone's earlier suggestion here.)
I'm now thinking in general terms about how to concoct a mechanism
that lets rpciod retry the invalidation later, for all those that turn
out to be blocking. For example, rpciod could just keep a list of
all pending invalidates and retry each inode on the list every time
it has nothing to do. This is crude and n-squarish, but it would
work. Maybe it's efficient enough for the time being. At least it's
correct, which would be a step forward.
Did you have some specific mechanism in mind?
--
Daniel
On Thursday 12 September 2002 23:38, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > ...
> > > Is it not possible to co-opt a user process to perform the
> > > invalidation? Just
> > >
> > > inode->is_kaput = 1;
> > >
> > > in rpciod?
> >
> > There must be a way. The key thing the VM needs to provide, and doesn't
> > now, is a function callable by the rpciod that will report to the caller
> > whether it was able to complete the invalidation without blocking. (I
> > think I'm just rephrasing someone's earlier suggestion here.)
> >
> > I'm now thinking in general terms about how to concoct a mechanism
> > that lets rpciod retry the invalidation later, for all those that turn
> > out to be blocking. For example, rpciod could just keep a list of
> > all pending invalidates and retry each inode on the list every time
> > it has nothing to do. This is crude and n-squarish, but it would
> > work. Maybe it's efficient enough for the time being. At least it's
> > correct, which would be a step forward.
>
> rpciod is the wrong process to be performing this operation. I'd suggest
> the userspace process which wants to read the directory be used for this.
It's not just directories as I understand it, it's also any file that's
locked or unlocked.
If we make userspace do it, we need an interface. Is there
> > Did you have some specific mechanism in mind?
>
> Testing mapping->nrpages will tell you if the invalidation was successful.
That's nice, so may be no need for a new flavor of invalidate_inode_pages
to write, but what I was really talking about is what should be done
after discovering it failed.
Another thing: we only care about pages that were in the mapping at the
time of the original invalidate attempt. Coming back and invalidating
a bunch of pages that were already properly re-read from the server,
just to get those few we missed on the first attempt would be, in a
word, horrible.
This one is harder than it first seems. It's worth putting in the
effort to do the job correctly though. The current arrangement is,
err, unreliable.
--
Daniel
On Tue, 10 Sep 2002, Chuck Lever wrote:
> client. also, i'd really like to hear from maintainers of other network
> file systems about how they manage cache coherency.
I have a feeling that smbfs was built around an assumption that if it has
a file open, no one else can change it. I don't think that is valid
even with older smb protocols.
>From reading (selected parts of) this thread I believe the requirements
from smbfs is much like the nfs ones. An incomplete summary of what is
currently done:
1. If attributes of a file changes (size, mtime) the cached data is
purged.
2. Directory changes does not modify the mtime of the dir on windows
servers so a timeout is used.
3. Cached file data is currently not dropped when a file is closed. But
any dirty pages are written. I guess the idea is that (1) would still
catch any changes even if the file isn't open.
smbfs in 2.5 will support oplocks, where the server can tell the client
that someone else wants to open the same file, called an "oplock break".
The client should then purge any cached file data and respond when it is
done.
This needs to be a nonblocking purge for similar reasons as mentioned for
rpciod. The suggested best-effort-try-again-later invalidate sounds ok for
that.
There exists smb documentation stating that caching is not allowed without
a held oplock. But how to deal with mmap wasn't really in that authors
mind.
If I understood Andrew right with the inode flag that won't purge anything
until the next userspace access. I don't think that is what smbfs wants
since the response to the oplock break should happen fairly soon ...
/Urban
Urban Widmark wrote:
>
> ...
> If I understood Andrew right with the inode flag that won't purge anything
> until the next userspace access. I don't think that is what smbfs wants
> since the response to the oplock break should happen fairly soon ...
Well the lazy invalidation would be OK - defer that to the next
userspace access, or just let the pages die a natural VM death,
maybe poke `ksmbinvalidated'.
But if you want to perform writeback within the local IO daemon
then hm. That's a problem which NFS seems to not have? I guess
you'd have to leave i_sem held and poke `ksmbwritebackd'. Or teach
pdflush about queued work items (a tq_struct would do it).
But it's the same story: the requirements of
a) non blocking local IO daemon and
b) assured pagecache takedown
are conflicting. You need at least one more thread, and locking
against userspace activity.
On Thu, 12 Sep 2002, Andrew Morton wrote:
> Well the lazy invalidation would be OK - defer that to the next
> userspace access,
I think I have an idea on how to do that, here's some pseudocode:
invalidate_page(struct page * page) {
SetPageInvalidated(page);
rmap_lock(page);
for_each_pte(pte, page) {
make pte PROT_NONE;
flush TLBs for this virtual address;
}
rmap_unlock(page);
}
And in the page fault path:
if (pte_protection(pte) == PROT_NONE && PageInvalidated(pte_page_pte)) {
clear_pte(ptep);
page_cache_release(page);
mm->rss--;
}
What do you think, is this simple enough that it would work ? ;)
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Spamtraps of the month: [email protected] [email protected]
Rik van Riel wrote:
>
> On Thu, 12 Sep 2002, Andrew Morton wrote:
>
> > Well the lazy invalidation would be OK - defer that to the next
> > userspace access,
>
> I think I have an idea on how to do that, here's some pseudocode:
>
> invalidate_page(struct page * page) {
> SetPageInvalidated(page);
> rmap_lock(page);
> for_each_pte(pte, page) {
> make pte PROT_NONE;
> flush TLBs for this virtual address;
> }
> rmap_unlock(page);
> }
>
> And in the page fault path:
>
> if (pte_protection(pte) == PROT_NONE && PageInvalidated(pte_page_pte)) {
> clear_pte(ptep);
> page_cache_release(page);
> mm->rss--;
> }
>
That's the bottom-up approach. The top-down (vmtruncate) approach
would also work, if the locking is suitable.
Look, idunnoigiveup. Like scsi and USB, NFS is a black hole
where akpms fear to tread. I think I'll sulk until someone
explains why this work has to be performed in the context of
a process which cannot do it.
On Friday 13 September 2002 00:30, Rik van Riel wrote:
> On Thu, 12 Sep 2002, Andrew Morton wrote:
>
> > Well the lazy invalidation would be OK - defer that to the next
> > userspace access,
>
> I think I have an idea on how to do that, here's some pseudocode:
>
> invalidate_page(struct page * page) {
> SetPageInvalidated(page);
> rmap_lock(page);
> for_each_pte(pte, page) {
> make pte PROT_NONE;
> flush TLBs for this virtual address;
> }
> rmap_unlock(page);
> }
>
> And in the page fault path:
>
> if (pte_protection(pte) == PROT_NONE && PageInvalidated(pte_page_pte)) {
> clear_pte(ptep);
> page_cache_release(page);
> mm->rss--;
> }
>
> What do you think, is this simple enough that it would work ? ;)
This is very promising. Actually, we'd only need to do that for
pages that are currently skipped in invalidate_inode_pages. Have
to think about possible interactions now...
--
Daniel
On Thu, 12 Sep 2002, Andrew Morton wrote:
| Look, idunnoigiveup. Like scsi and USB, NFS is a black hole
| where akpms fear to tread.
hehe.
s/NFS/VM/
s/akpms/me et al/
Fortunately we have a large developer community with different
(complementary) points of view and interests.
--
~Randy
"Linux is not a research project. Never was, never will be."
-- Linus, 2002-09-02
On Thu, 12 Sep 2002, Andrew Morton wrote:
> Rik van Riel wrote:
> > invalidate_page(struct page * page) {
> That's the bottom-up approach. The top-down (vmtruncate) approach
> would also work, if the locking is suitable.
The top-down approach will almost certainly be most efficient when
invalidating a large chunk of a file (truncate, large file locks)
while the bottom-up approach is probably more efficient when the
system invalidates very few pages (small file lock, cluster file
system mmap() support).
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Spamtraps of the month: [email protected] [email protected]
On Friday 13 September 2002 01:23, Rik van Riel wrote:
> On Thu, 12 Sep 2002, Andrew Morton wrote:
> > Rik van Riel wrote:
>
> > > invalidate_page(struct page * page) {
>
> > That's the bottom-up approach. The top-down (vmtruncate) approach
> > would also work, if the locking is suitable.
>
> The top-down approach will almost certainly be most efficient when
> invalidating a large chunk of a file (truncate, large file locks)
> while the bottom-up approach is probably more efficient when the
> system invalidates very few pages (small file lock, cluster file
> system mmap() support).
The bottom-up approach is the one we want to use when we'd otherwise
skip a page in invalidate_inode_pages. This is the rare case. On the
face of it, this works out very well. Just have to think about
interactions now - I don't see any, but I haven't really gone hunting
yet.
--
Daniel
On Friday 13 September 2002 00:30, Rik van Riel wrote:
> On Thu, 12 Sep 2002, Andrew Morton wrote:
>
> > Well the lazy invalidation would be OK - defer that to the next
> > userspace access,
>
> I think I have an idea on how to do that, here's some pseudocode:
>
> invalidate_page(struct page * page) {
> SetPageInvalidated(page);
> rmap_lock(page);
> for_each_pte(pte, page) {
> make pte PROT_NONE;
> flush TLBs for this virtual address;
> }
> rmap_unlock(page);
> }
>
> And in the page fault path:
>
> if (pte_protection(pte) == PROT_NONE && PageInvalidated(pte_page_pte)) {
> clear_pte(ptep);
> page_cache_release(page);
> mm->rss--;
> }
>
> What do you think, is this simple enough that it would work ? ;)
Too simple to work, unfortunately. We have to at least 1) lock the
page and 2) remove it from the page cache. Can we remove a page from
the page cache while it still has pte references? I suppose we can,
it turns into an anonymous page. But isn't that the reason we didn't
do it in invalidate_inode_pages in the first place? However, since
we now, in addition, mark the page invalidated, it doesn't matter any
more what kind of page it is, so I suppose that's ok.
And we need a locked_page_cache_release->free_locked_page.
--
Daniel
On Friday 13 September 2002 06:19, I wrote:
> And we need a locked_page_cache_release->free_locked_page.
Hmm, maybe not. We should be able to unlock the page just after
removing it from the page cache.
By the way, the wait_on_page in lock_page is where we finally get
around for waiting on those locked pages we couldn't get rid of
in invalidate_inode_pages; what we have done here is shifted the
wait from rpciod to normal users, which is exactly what we want.
--
Daniel
On Thu, 12 Sep 2002, Andrew Morton wrote:
> But it's the same story: the requirements of
>
> a) non blocking local IO daemon and
>
> b) assured pagecache takedown
>
> are conflicting. You need at least one more thread, and locking
> against userspace activity.
I see no problem with adding another thread to handle the breaks.
Only the cost of an extra thread and the fact that smbiod was originally
created to handle the break (with a thought to eventually make it do the
IO as it does now) makes me want to put it in smbiod.
/Urban
>>>>> " " == Andrew Morton <[email protected]> writes:
> Look, idunnoigiveup. Like scsi and USB, NFS is a black hole
> where akpms fear to tread. I think I'll sulk until someone
> explains why this work has to be performed in the context of a
> process which cannot do it.
I'd be happy to move that work out of the RPC callbacks if you could
point out which other processes actually can do it.
The main problem is that the VFS/MM has no way of relabelling pages as
being invalid or no longer up to date: I once proposed simply clearing
PG_uptodate on those pages which cannot be cleared by
invalidate_inode_pages(), but this was not to Linus' taste.
Cheers,
Trond
Trond Myklebust wrote:
>
> >>>>> " " == Andrew Morton <[email protected]> writes:
>
> > Look, idunnoigiveup. Like scsi and USB, NFS is a black hole
> > where akpms fear to tread. I think I'll sulk until someone
> > explains why this work has to be performed in the context of a
> > process which cannot do it.
>
> I'd be happy to move that work out of the RPC callbacks if you could
> point out which other processes actually can do it.
Well it has to be a new thread, or user processes.
Would it be possible to mark the inode as "needs invalidation",
and make user processes check that flag once they have i_sem?
> The main problem is that the VFS/MM has no way of relabelling pages as
> being invalid or no longer up to date: I once proposed simply clearing
> PG_uptodate on those pages which cannot be cleared by
> invalidate_inode_pages(), but this was not to Linus' taste.
Yes, clearing PageUptodate without holding the page lock is
pretty scary.
Do we really need to invalidate individual pages, or is it real-life
acceptable to invalidate the whole mapping?
Doing an NFS-special invalidate function is not a problem, btw - my
current invalidate_inode_pages() is just 25 lines. It's merely a
matter of working out what it should do ;)
On Monday 23 September 2002 18:38, Trond Myklebust wrote:
> >>>>> " " == Andrew Morton <[email protected]> writes:
>
> > Look, idunnoigiveup. Like scsi and USB, NFS is a black hole
> > where akpms fear to tread. I think I'll sulk until someone
> > explains why this work has to be performed in the context of a
> > process which cannot do it.
>
> I'd be happy to move that work out of the RPC callbacks if you could
> point out which other processes actually can do it.
>
> The main problem is that the VFS/MM has no way of relabelling pages as
> being invalid or no longer up to date: I once proposed simply clearing
> PG_uptodate on those pages which cannot be cleared by
> invalidate_inode_pages(), but this was not to Linus' taste.
Could you please provide a subject line for that original discussion?
--
Daniel
On Monday 23 September 2002 18:38, Trond Myklebust wrote:
> >>>>> " " == Andrew Morton <[email protected]> writes:
>
> > Look, idunnoigiveup. Like scsi and USB, NFS is a black hole
> > where akpms fear to tread. I think I'll sulk until someone
> > explains why this work has to be performed in the context of a
> > process which cannot do it.
>
> I'd be happy to move that work out of the RPC callbacks if you could
> point out which other processes actually can do it.
>
> The main problem is that the VFS/MM has no way of relabelling pages as
> being invalid or no longer up to date: I once proposed simply clearing
> PG_uptodate on those pages which cannot be cleared by
> invalidate_inode_pages(), but this was not to Linus' taste.
I'll take a run at analyzing this.
First, it's clear why can't just set the page !uptodate: if we fail to
lock the page we can't change the state of the uptodate bit because we
would violate the locking rules, iow, we would race with the vfs (see
block_read/write_full_page).
Note that even if succeed in the TryLock and set !uptodate, we still
have to walk the rmap list and unmap the page or it won't get refaulted
and the uptodate bit will be ignored.
For any page we can't lock without blocking, the cases are:
1) Dirty: we don't need to invalidate it because it's going to get
written back to the server anyway
2) Locked, clean: the page could be locked for any number of reasons.
Probably, it's locked for reading though. We *obviously* need to
kill this page at some point or we have a nasty heisenbug. E.g.,
somebody, somewhere, will get a file handed back to them from some
other client that rewrote the whole thing, complete and correct
except for a stale page or two.
For pages that we can lock, we have:
3) Elevated count, clean: we could arguably ignore the use count
and just yank the page out of the inode list, as Andrew's patch
does. Getting it out of the mapping is harder, perhaps much
harder.
4) Clean, has buffers, can't get rid of the buffers: we can't know
why. HTree puts pages in this state for directory access, Ext3
probably does it for a variety of reasons. Same situation as
above.
Given the obviously broken case (2) above and the two probably broken
case (3) and (4), I don't see any way to ignore this problem and still
implement the NFS semantics Chuck described earlier.
I see Rik's suggestion of marking the problem pages invalid, and walking
the ptes to protect them as the cleanest fix. Unlike invalidate_inode_pages,
the fault path can block perfectly happily while the problem conditions
sort themselves out.
--
Daniel
>>>>> " " == Andrew Morton <[email protected]> writes:
> Well it has to be a new thread, or user processes.
> Would it be possible to mark the inode as "needs invalidation",
> and make user processes check that flag once they have i_sem?
Not good enough unless you add those checks at the VFS/MM level. Think
for instance about mmap() where the filesystem is usually not involved
once a pagein has occurred.
> Do we really need to invalidate individual pages, or is it
> real-life acceptable to invalidate the whole mapping?
Invalidating the mapping is certainly a good alternative if it can be
done cleanly.
Note that in doing so, we do not want to invalidate any reads or
writes that may have been already scheduled. The existing mapping
still would need to hang around long enough to permit them to
complete.
> Doing an NFS-special invalidate function is not a problem, btw
> - my current invalidate_inode_pages() is just 25 lines. It's
> merely a matter of working out what it should do ;)
invalidate_inode_pages() *was* NFS-specific until you guys hijacked it
for other purposes in 2.5.x ;-)
Cheers,
Trond
On Monday 23 September 2002 22:41, Trond Myklebust wrote:
> >>>>> " " == Andrew Morton <[email protected]> writes:
>
> > Would it be possible to mark the inode as "needs invalidation",
> > and make user processes check that flag once they have i_sem?
>
> Not good enough unless you add those checks at the VFS/MM level.
Please see Rik's suggestion and my followups where we are talking about
handling this in the MM.
> Think
> for instance about mmap() where the filesystem is usually not involved
> once a pagein has occurred.
>
> > Do we really need to invalidate individual pages, or is it
> > real-life acceptable to invalidate the whole mapping?
>
> Invalidating the mapping is certainly a good alternative if it can be
> done cleanly.
But invalidate_inode_pages is (usually) just trying to do exactly that.
It doesn't work. Anyway, what if you have a 2 gig file with 1 meg
mmaped/locked/whatever by a database?
> Note that in doing so, we do not want to invalidate any reads or
> writes that may have been already scheduled. The existing mapping
> still would need to hang around long enough to permit them to
> complete.
With the mechanism I described above, that would just work. The fault
path would do lock_page, thus waiting for the IO to complete.
--
Daniel
>>>>> " " == Daniel Phillips <[email protected]> writes:
>> Note that in doing so, we do not want to invalidate any reads
>> or writes that may have been already scheduled. The existing
>> mapping still would need to hang around long enough to permit
>> them to complete.
> With the mechanism I described above, that would just work.
> The fault path would do lock_page, thus waiting for the IO to
> complete.
NFS writes do not hold the page lock until completion. How would you
expect to be able to coalesce writes to the same page if they did?
Cheers,
Trond
On Tuesday 24 September 2002 00:43, Trond Myklebust wrote:
> >>>>> " " == Daniel Phillips <[email protected]> writes:
>
> >> Note that in doing so, we do not want to invalidate any reads
> >> or writes that may have been already scheduled. The existing
> >> mapping still would need to hang around long enough to permit
> >> them to complete.
>
> > With the mechanism I described above, that would just work.
> > The fault path would do lock_page, thus waiting for the IO to
> > complete.
>
> NFS writes do not hold the page lock until completion. How would you
> expect to be able to coalesce writes to the same page if they did?
Coalesce before initiating writeout? I don't see why NFS should be special
in this regard, or why it should not leave a page locked until IO has
completed, like other filesystems. Could you please explain?
--
Daniel
On Tuesday 24 September 2002 07:09, Daniel Phillips wrote:
> Coalesce before initiating writeout? I don't see why NFS should be special
> in this regard, or why it should not leave a page locked until IO has
> completed, like other filesystems. Could you please explain?
It does that for reads.
For writes, however, I see no point in keeping the page lock beyond what is
required in order to safely copy data from userland. To do so would give
disastrous performances if somebody was trying to write < page-sized records.
I belive this is true for all filesystems.
For some reason or another, the buffer stuff needs to re-take the page lock
when it actually performs the physical commit to disk. NFS doesn't need to do
this in order to perform an RPC call, so we don't...
Cheers,
Trond