LinuxLists.cc - Questions - Re: [PATCH] documentation for mm.h

2001-03-08 10:11:48

Subject: Questions - Re: [PATCH] documentation for mm.h

At 22:33 07/03/2001, Rik van Riel wrote:
[snip]
> typedef struct page {
>+ struct list_head list; /* ->mapping has some page lists. */
>+ struct address_space *mapping; /* The inode (or ...) we belong to. */
>+ unsigned long index; /* Our offset within mapping. */

Assuming index is in bytes (it looks like it is): Shouldn't index of type
unsigned long long or __u64? Otherwise, AFAICS using the page cache
automatically results in an artificial 4Gib limit on file size, which is
not very good, even by todays standards.

[snip]
>+ * During disk I/O, PG_locked is used. This bit is set before I/O
>+ * and reset when I/O completes. page->wait is a wait queue of all
>+ * tasks waiting for the I/O on this page to complete.

Is this physical I/O only or does it include a driver writing/reading the page?

Thanks,

Anton

--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://sourceforge.net/projects/linux-ntfs/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2001-03-08 10:52:21

by Ingo Oeser

[permalink] [raw]

Subject: Re: Questions - Re: [PATCH] documentation for mm.h

On Thu, Mar 08, 2001 at 10:11:50AM +0000, Anton Altaparmakov wrote:
> At 22:33 07/03/2001, Rik van Riel wrote:
> [snip]
> > typedef struct page {
> >+ struct list_head list; /* ->mapping has some page lists. */
> >+ struct address_space *mapping; /* The inode (or ...) we belong to. */
> >+ unsigned long index; /* Our offset within mapping. */
>
> Assuming index is in bytes (it looks like it is):

isn't. To get the byte offset, you have to multiply it by PAGE_{CACHE_,}SIZE.

> [snip]
> >+ * During disk I/O, PG_locked is used. This bit is set before I/O
> >+ * and reset when I/O completes. page->wait is a wait queue of all
> >+ * tasks waiting for the I/O on this page to complete.
>
> Is this physical I/O only or does it include a driver writing/reading the page?

Depends on the method of the driver, that is getting called.
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< come and join the fun >>>>>>>>>>>>

2001-03-08 11:40:22

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: Questions - Re: [PATCH] documentation for mm.h

At 10:51 08/03/01, Ingo Oeser wrote:
>On Thu, Mar 08, 2001 at 10:11:50AM +0000, Anton Altaparmakov wrote:
> > At 22:33 07/03/2001, Rik van Riel wrote:
> > [snip]
> > > typedef struct page {
> > >+ struct list_head list; /* ->mapping has some page
> lists. */
> > >+ struct address_space *mapping; /* The inode (or ...) we
> belong to. */
> > >+ unsigned long index; /* Our offset within mapping. */
> >
> > Assuming index is in bytes (it looks like it is):
>
>isn't. To get the byte offset, you have to multiply it by PAGE_{CACHE_,}SIZE.

Hi, first of all, thanks for the reply!

How do you reconcile that statement with the following comment from mm.h?

> * A page may belong to an inode's memory mapping. In this case,
> * page->mapping is the pointer to the inode, and page->offset is the
> * file offset of the page (not necessarily a multiple of PAGE_SIZE).

Surely, if you have to multiply index by PAGE_{CACHE_}SIZE, page->offset
would be a multiple of PAGE_{CACHE_}SIZE?

And even if it really is PAGE_{CACHE_}SIZE units, this still doesn't solve
the problem, it just defers it to 16Tib (on ia32 arch with 4kib
PAGE_{CACHE_}SIZE). With NTFS 3.0's use of sparse files, for the usn
journal for example, even this will be overflowed at some point on a
busy/large server. The only proper solution AFAICS is to allow the full
64-bits.

> > [snip]
> > >+ * During disk I/O, PG_locked is used. This bit is set before I/O
> > >+ * and reset when I/O completes. page->wait is a wait queue of all
> > >+ * tasks waiting for the I/O on this page to complete.
> >
> > Is this physical I/O only or does it include a driver writing/reading
> the page?
>
>Depends on the method of the driver, that is getting called.

Sorry, I should have been more detailed in my question, so let me try
again: When the NTFS file system driver needs to modify the meta data,
which will be in the page cache (meta data is stored in normal files on
NTFS, hence the page cache is very well suited to storing it with it's
page->mapping and page->offset fields), does the NTFS driver need to set
PG_locked while writing to the page?

And what about reading for that matter? What is the access serialization here?

Obviously I can have several readers on the same metadata at the same time,
and that's fine, but if someone is writing, then allowing anyone to read
the data at the same time would result in corrupt meta data being read
(this is because I am only going to use the page cache, i.e. there will be
no copying of the data at all, except for: user space <-> page cache <->
disk). I am thinking that a read/write semaphore would be the perfect
solution for this here, but it would be nice, if this could be handled on a
per page basis rather than a per file basis, at the very least so for meta
data files.

Thanks,

Anton

--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://sourceforge.net/projects/linux-ntfs/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2001-03-08 12:26:12

by Rik van Riel

[permalink] [raw]

Subject: Re: Questions - Re: [PATCH] documentation for mm.h

On Thu, 8 Mar 2001, Anton Altaparmakov wrote:
> At 22:33 07/03/2001, Rik van Riel wrote:
> [snip]
> > typedef struct page {
> >+ struct list_head list; /* ->mapping has some page lists. */
> >+ struct address_space *mapping; /* The inode (or ...) we belong to. */
> >+ unsigned long index; /* Our offset within mapping. */
>
> Assuming index is in bytes (it looks like it is): Shouldn't index of type

It's in units of PAGE_CACHE_SIZE. I've corrected the documentation.

> [snip]
> >+ * During disk I/O, PG_locked is used. This bit is set before I/O
> >+ * and reset when I/O completes. page->wait is a wait queue of all
> >+ * tasks waiting for the I/O on this page to complete.
>
> Is this physical I/O only or does it include a driver
> writing/reading the page?

I'm not sure ... anyone ?

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-03-08 12:28:32

by Rik van Riel

[permalink] [raw]

Subject: Re: Questions - Re: [PATCH] documentation for mm.h

On Thu, 8 Mar 2001, Anton Altaparmakov wrote:
> At 10:51 08/03/01, Ingo Oeser wrote:
> >On Thu, Mar 08, 2001 at 10:11:50AM +0000, Anton Altaparmakov wrote:
> > > At 22:33 07/03/2001, Rik van Riel wrote:

> > * A page may belong to an inode's memory mapping. In this case,
> > * page->mapping is the pointer to the inode, and page->offset is the
> > * file offset of the page (not necessarily a multiple of PAGE_SIZE).
>
> Surely, if you have to multiply index by PAGE_{CACHE_}SIZE,
> page->offset would be a multiple of PAGE_{CACHE_}SIZE?

Whooops, indeed. This was a piece of old documentation which was
200 lines down for inexplicable reasons. It's been corrected now.

thanks,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-03-08 12:42:02

by Ingo Oeser

[permalink] [raw]

Subject: Re: Questions - Re: [PATCH] documentation for mm.h

Hi,

On Thu, Mar 08, 2001 at 11:39:27AM +0000, Anton Altaparmakov wrote:
> > > >+ unsigned long index; /* Our offset within mapping. */
> > > Assuming index is in bytes (it looks like it is):
> >isn't. To get the byte offset, you have to multiply it by PAGE_{CACHE_,}SIZE.
>
> How do you reconcile that statement with the following comment from mm.h?
>
> > * A page may belong to an inode's memory mapping. In this case,
> > * page->mapping is the pointer to the inode, and page->offset is the
> > * file offset of the page (not necessarily a multiple of PAGE_SIZE).

page->index << PAGE_CACHE_SHIFT is the byte offset of this page
in the mapping.

The comment is crap, that should be removed (page->offset does
not exist anymore).

Rik did this in his documentation patch or will do it, if he
reads this ;-)

> And even if it really is PAGE_{CACHE_}SIZE units, this still doesn't solve
> the problem, it just defers it to 16Tib (on ia32 arch with 4kib
> PAGE_{CACHE_}SIZE). With NTFS 3.0's use of sparse files, for the usn
> journal for example, even this will be overflowed at some point on a
> busy/large server. The only proper solution AFAICS is to allow the full
> 64-bits.

Which is discussed already in another thread at lkml. I hope that
we'll get a 64bit blocklayer one day in 2.5/2.6 development.
Since this is only around 2 years and might be backported to 2.4
if needed, I don't see a big problem of deferring this.

Let's leave some market niche for commercial solutions for a while ;-)

> > > [snip]
> > > >+ * During disk I/O, PG_locked is used. This bit is set before I/O
> > > >+ * and reset when I/O completes. page->wait is a wait queue of all
> > > >+ * tasks waiting for the I/O on this page to complete.
> > >
> > > Is this physical I/O only or does it include a driver writing/reading
> > the page?
> >
> >Depends on the method of the driver, that is getting called.
>
> Sorry, I should have been more detailed in my question, so let me try
> again: When the NTFS file system driver needs to modify the meta data,
> which will be in the page cache (meta data is stored in normal files on
> NTFS, hence the page cache is very well suited to storing it with it's
> page->mapping and page->offset fields), does the NTFS driver need to set
> PG_locked while writing to the page?

Ahh. I thought you've meant DEVICE drivers. If I talk about
drivers, I usally mean that.

May be you should raise these issues again under a separate
subject and CC [email protected] for this.

I think it is worth it, because Linus want all fs to use page
cache for meta data and let buffer cache die slowly.

Basically the rules go like this:

The VM will wait for PG_locked pages, before it accesses them or
ignore them, if it cannot wait.

It will also try to read in pages, that are not PG_uptodate.

But user space will never see metadata pages anyway, so you
should be the only one, who cares about them. Just be prepared to
writepage() and readpage() and the like.

Just use lock_page() if you can sleep and TryLockPage() + EFAULT
(or similar) if you cannot.

Then just check Page_Uptodate() before you read and do
ClearUptodate() if you start writing to the metadata.

Since these operations are atomic bit operations, it should
suffice for your purpose.

But as stated above, I'm not very sure that I understand all the
code and know of all the races.

Multiple readers are AFAICS not yet possible with the current page
cache.

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< come and join the fun >>>>>>>>>>>>

2001-03-08 13:27:40

by Anton Altaparmakov

[permalink] [raw]

Subject: Question: fs meta data, page cache and locking

It was suggested to repost the below as a new thread and to cc: linux-fsdevel.

Any comments would be appreciated.

TIA,
Anton

So here goes:

At 12:41 08/03/01, Ingo Oeser wrote:
> > On Thu, Mar 08, 2001 at 11:39:27AM +0000, Anton Altaparmakov wrote:
>[snip, attributions lost]
> > > > >+ * During disk I/O, PG_locked is used. This bit is set before I/O
> > > > >+ * and reset when I/O completes. page->wait is a wait queue of all
> > > > >+ * tasks waiting for the I/O on this page to complete.
>[snip]
> > When the NTFS file system driver needs to modify the meta data,
> > which will be in the page cache (meta data is stored in normal files on
> > NTFS, hence the page cache is very well suited to storing it with it's
> > page->mapping and page->index fields), does the NTFS driver need to set
> > PG_locked while writing to the page?
>[snip]
>May be you should raise these issues again under a separate
>subject and CC [email protected] for this.
>
>I think it is worth it, because Linus want all fs to use page
>cache for meta data and let buffer cache die slowly.
>
>Basically the rules go like this:
>
>The VM will wait for PG_locked pages, before it accesses them or
>ignore them, if it cannot wait.
>
>It will also try to read in pages, that are not PG_uptodate.
>
>But user space will never see metadata pages anyway, so you
>should be the only one, who cares about them. Just be prepared to
>writepage() and readpage() and the like.
>
>Just use lock_page() if you can sleep and TryLockPage() + EFAULT
>(or similar) if you cannot.
>
>Then just check Page_Uptodate() before you read and do
>ClearUptodate() if you start writing to the metadata.
>
>Since these operations are atomic bit operations, it should
>suffice for your purpose.
>
>But as stated above, I'm not very sure that I understand all the
>code and know of all the races.
>
>Multiple readers are AFAICS not yet possible with the current page
>cache.
>
>Regards
>
>Ingo Oeser
>--
>10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
> <<<<<<<<<<<< come and join the fun >>>>>>>>>>>>

--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://sourceforge.net/projects/linux-ntfs/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2001-03-08 15:00:29

by Alexander Viro

[permalink] [raw]

Subject: Re: Question: fs meta data, page cache and locking

On Thu, 8 Mar 2001, Anton Altaparmakov reposted:

> >But user space will never see metadata pages anyway, so you
> >should be the only one, who cares about them. Just be prepared to
> >writepage() and readpage() and the like.

ITYM ->prepare_write()/->commit_write().

See ftp.math.psu.edu/pub/viro/ext2-dir-patch-S2.gz for example of
metadata in pagecache. For deeper metadata (== stuff that can
be needed to access with some pages locked, in case of ext2 that
would be indirect blocks, inode/block bitmaps and group descriptors)
you need to set ->gfp_mask of address_space to prohibit IO on
allocation. See drivers/block/loop.c - it has to do the same to
->i_mapping of underlying file.
Cheers,
Al