2002-01-20 09:03:39

by Shawn Starr

[permalink] [raw]
Subject: Possible Idea with filesystem buffering.


I've noticed that XFS's filesystem has a separate pagebuf_daemon to handle
caching/buffering.

Why not make a kernel page/caching daemon for other filesystems to use
(kpagebufd) so that each filesystem can use a kernel daemon interface to
handle buffering and caching.

I found that XFS's buffering/caching significantly reduced I/O load on the
system (with riel's rmap11b + rml's preempt patches and Andre's IDE
patch).

But I've not been able to acheive the same speed results with ReiserFS :-(

Just as we have a filesystem (VFS) layer, why not have a buffering/caching
layer for the filesystems to use inconjunction with the VM?

Comments, suggestions, flames welcome ;)

Shawn.


2002-01-20 11:35:54

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

In version 4 of reiserfs, our plan is to implement writepage such that
it does not write the page but instead pressures the reiser4 cache and
marks the page as recently accessed. This is Linus's preferred method
of doing that.

Personally, I think that makes writepage the wrong name for that
function, but I must admit it gets the job done, and it leaves writepage
as the right name for all filesystems that don't manage their own cache,
which is most of them.

Hans

Shawn wrote:

>I've noticed that XFS's filesystem has a separate pagebuf_daemon to handle
>caching/buffering.
>
>Why not make a kernel page/caching daemon for other filesystems to use
>(kpagebufd) so that each filesystem can use a kernel daemon interface to
>handle buffering and caching.
>
>I found that XFS's buffering/caching significantly reduced I/O load on the
>system (with riel's rmap11b + rml's preempt patches and Andre's IDE
>patch).
>
>But I've not been able to acheive the same speed results with ReiserFS :-(
>
>Just as we have a filesystem (VFS) layer, why not have a buffering/caching
>layer for the filesystems to use inconjunction with the VM?
>
There is hostility to this from one of the VM maintainers. He is
concerned that separate caches were what they had before and they
behaved badly. I think that they simply coded them wrong the time
before. The time before, the pressure on the subcaches was uneven, with
some caches only getting pressure if the other caches couldn't free
anything, so of course it behaved badly.

>
>
>Comments, suggestions, flames welcome ;)
>
>Shawn.
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>



2002-01-20 13:56:44

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Sun, 20 Jan 2002, Hans Reiser wrote:

> In version 4 of reiserfs, our plan is to implement writepage such that
> it does not write the page but instead pressures the reiser4 cache and
> marks the page as recently accessed.

What is this supposed to achieve ?

> Personally, I think that makes writepage the wrong name for that
> function, but I must admit it gets the job done,

And what job would that be ?

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-20 14:25:18

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Write clustering is one thing it achieves. When we flush a slum, the
cost of the seek so far outweighs the transfer cost that we should
transfer FLUSH_SIZE (where flush size is imagined to be something like
64 or 16 or at least 8) adjacent (in the tree order) nodes at the same
time to disk. There are many ways in which LRU is only an approximation
to optimum. This is one of many.

Flushing everything involved in a transaction so that (the buffers being
pinned in RAM (so that they don't have to be reread from disk when the
transaction commits) until the transaction commits) can be unpinned is
another thing.

Hans

2002-01-20 15:14:25

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Sun, 20 Jan 2002, Hans Reiser wrote:

> Write clustering is one thing it achieves.
>
> Flushing everything involved in a transaction ... is another thing.

Agreed on these points, but you really HAVE TO work towards
flushing the page ->writepage() gets called for.

Think about your typical PC, with memory in ZONE_DMA,
ZONE_NORMAL and ZONE_HIGHMEM. If we are short on DMA pages
we will end up calling ->writepage() on a DMA page.

If the filesystem ends up writing completely unrelated pages
and marking the DMA page in question referenced the VM will
go in a loop until the filesystem finally gets around to
making a page in the (small) DMA zone freeable ...

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-20 15:49:21

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

At 11:31 20/01/02, Hans Reiser wrote:
>In version 4 of reiserfs, our plan is to implement writepage such that it
>does not write the page but instead pressures the reiser4 cache and marks
>the page as recently accessed. This is Linus's preferred method of doing that.

But why do you want to do your own cache? Any individual fs driver is in no
position to know the overall demands on the VMM of the currently running
kernel/user programs/etc. As such it is IMHO inefficient and I think it
won't actually work due to VMM requiring to free specific memory and hence
calling writepage on that specific memory so it can throw the pages away
afterwards but in your concept writepage won't result in the page being
marked clean and the vm has made no progress and you have just created a
hole load of headaches for the VMM which it can't solve...

The VMM should be the ONLY thing in the kernel that has full control of all
caches in the system, and certainly all fs caches. Why you are putting a
second cache layer underneath the VMM is beyond me. It would be much better
to fix/expand the capabilities of the existing VMM which would have the
benefit that all fs could benefit not just ReiserFS.

>Personally, I think that makes writepage the wrong name for that function,
>but I must admit it gets the job done, and it leaves writepage as the
>right name for all filesystems that don't manage their own cache, which is
>most of them.

Yes it does make it the wrong name, but not only that it also breaks the
existing VMM if I understand anything about the VMM (which may of course
not be the case...).

Just a thought.

Best regards,

Anton


>Hans
>
>Shawn wrote:
>
>>I've noticed that XFS's filesystem has a separate pagebuf_daemon to handle
>>caching/buffering.
>>
>>Why not make a kernel page/caching daemon for other filesystems to use
>>(kpagebufd) so that each filesystem can use a kernel daemon interface to
>>handle buffering and caching.
>>
>>I found that XFS's buffering/caching significantly reduced I/O load on the
>>system (with riel's rmap11b + rml's preempt patches and Andre's IDE
>>patch).
>>
>>But I've not been able to acheive the same speed results with ReiserFS :-(
>>
>>Just as we have a filesystem (VFS) layer, why not have a buffering/caching
>>layer for the filesystems to use inconjunction with the VM?
>There is hostility to this from one of the VM maintainers. He is
>concerned that separate caches were what they had before and they behaved
>badly. I think that they simply coded them wrong the time before. The
>time before, the pressure on the subcaches was uneven, with some caches
>only getting pressure if the other caches couldn't free anything, so of
>course it behaved badly.
>
>>
>>
>>Comments, suggestions, flames welcome ;)
>>
>>Shawn.
>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>the body of a message to [email protected]
>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at http://www.tux.org/lkml/
>>
>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2002-01-20 17:50:36

by Mark Hahn

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Sun, 20 Jan 2002, Hans Reiser wrote:
> Write clustering is one thing it achieves. When we flush a slum, the

sure, that's fine. when the VM tells you to write a page,
you're free to write *more*, but you certainly must give back
that particular page. afaicr, this was the conclusion
of the long-ago thread that you're referring to.

regards, mark hahn.

2002-01-20 21:19:41

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Rik van Riel wrote:

>On Sun, 20 Jan 2002, Hans Reiser wrote:
>
>>Write clustering is one thing it achieves.
>>
>>Flushing everything involved in a transaction ... is another thing.
>>
>
>Agreed on these points, but you really HAVE TO work towards
>flushing the page ->writepage() gets called for.
>
>Think about your typical PC, with memory in ZONE_DMA,
>ZONE_NORMAL and ZONE_HIGHMEM. If we are short on DMA pages
>we will end up calling ->writepage() on a DMA page.
>
>If the filesystem ends up writing completely unrelated pages
>and marking the DMA page in question referenced the VM will
>go in a loop until the filesystem finally gets around to
>making a page in the (small) DMA zone freeable ...
>

This is a bug in VM design, yes? It should signal that it needs the
particular page written, which probnably means that it should use
writepage only when it needs that particular page written, and should
otherwise check to see if the filesystem supports something like
pressure_fs_cache(), yes?

>
>
>regards,
>
>Rik
>



2002-01-20 21:25:31

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Anton Altaparmakov wrote:

> At 11:31 20/01/02, Hans Reiser wrote:
>
>> In version 4 of reiserfs, our plan is to implement writepage such
>> that it does not write the page but instead pressures the reiser4
>> cache and marks the page as recently accessed. This is Linus's
>> preferred method of doing that.
>
>
> But why do you want to do your own cache? Any individual fs driver is
> in no position to know the overall demands on the VMM of the currently
> running kernel/user programs/etc.


So the VM system should inform it. The way to do that is to convey a
sense of cache pressure that is in proportion to the size of cache used
by that cache submanager, and then the cache submanager has to react
proportionally.

If every write page is consider a pressure increment, and if the page is
marked accessed, then proportional pressure is achieved.

> As such it is IMHO inefficient and I think it won't actually work due
> to VMM requiring to free specific memory and hence calling writepage
> on that specific memory so it can throw the pages away afterwards but
> in your concept writepage won't result in the page being marked clean
> and the vm has made no progress and you have just created a hole load
> of headaches for the VMM which it can't solve...
>
> The VMM should be the ONLY thing in the kernel that has full control
> of all caches in the system, and certainly all fs caches. Why you are
> putting a second cache layer underneath the VMM is beyond me. It would
> be much better to fix/expand the capabilities of the existing VMM
> which would have the benefit that all fs could benefit not just ReiserFS.
>
I agree, except that using writepage is what Linus wants, and except for
the DMA bug Rik mentions, it should work. It would be nice if the VM
maintainers were to comment writepage so that other filesystems could
know how to use it (and fix the DMA bug Rik mentions).

Hans



2002-01-20 21:26:11

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Mon, 21 Jan 2002, Hans Reiser wrote:
> Rik van Riel wrote:
> >On Sun, 20 Jan 2002, Hans Reiser wrote:

> >Agreed on these points, but you really HAVE TO work towards
> >flushing the page ->writepage() gets called for.
> >
> >Think about your typical PC, with memory in ZONE_DMA,
> >ZONE_NORMAL and ZONE_HIGHMEM. If we are short on DMA pages
> >we will end up calling ->writepage() on a DMA page.
> >
> >If the filesystem ends up writing completely unrelated pages
> >and marking the DMA page in question referenced the VM will
> >go in a loop until the filesystem finally gets around to
> >making a page in the (small) DMA zone freeable ...
>
> This is a bug in VM design, yes? It should signal that it needs the
> particular page written, which probnably means that it should use
> writepage only when it needs that particular page written,

That is exactly what the VM does.

> and should otherwise check to see if the filesystem supports something
> like pressure_fs_cache(), yes?

That's incompatible with the concept of memory zones.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-20 21:29:01

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Mark Hahn wrote:

>On Sun, 20 Jan 2002, Hans Reiser wrote:
>
>>Write clustering is one thing it achieves. When we flush a slum, the
>>
>
>sure, that's fine. when the VM tells you to write a page,
>you're free to write *more*, but you certainly must give back
>that particular page. afaicr, this was the conclusion
>of the long-ago thread that you're referring to.
>
>regards, mark hahn.
>
>
>
This is bad for use with internal nodes. It simplifies version 4 a
bunch to assume that if a node is in cache, its parent is also. Not
sure what to do about it, maybe we need to copy the node. Surely we
don't want to copy it unless it is a DMA related page cleaning.

Hans


2002-01-20 21:32:41

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Mon, 21 Jan 2002, Hans Reiser wrote:
> Mark Hahn wrote:
> >On Sun, 20 Jan 2002, Hans Reiser wrote:
> >
> >>Write clustering is one thing it achieves. When we flush a slum, the
> >
> >sure, that's fine. when the VM tells you to write a page,
> >you're free to write *more*, but you certainly must give back
> >that particular page. afaicr, this was the conclusion
> >of the long-ago thread that you're referring to.
>
> This is bad for use with internal nodes. It simplifies version 4 a
> bunch to assume that if a node is in cache, its parent is also. Not
> sure what to do about it, maybe we need to copy the node. Surely we
> don't want to copy it unless it is a DMA related page cleaning.

DMA isn't a special case, this thing can happen with ANY
memory zone.

Unless of course you decide to make reiserfs unsupported
for NUMA machines...

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-20 21:35:01

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Rik van Riel wrote:

>On Mon, 21 Jan 2002, Hans Reiser wrote:
>
>>Rik van Riel wrote:
>>
>>>On Sun, 20 Jan 2002, Hans Reiser wrote:
>>>
>
>>>Agreed on these points, but you really HAVE TO work towards
>>>flushing the page ->writepage() gets called for.
>>>
>>>Think about your typical PC, with memory in ZONE_DMA,
>>>ZONE_NORMAL and ZONE_HIGHMEM. If we are short on DMA pages
>>>we will end up calling ->writepage() on a DMA page.
>>>
>>>If the filesystem ends up writing completely unrelated pages
>>>and marking the DMA page in question referenced the VM will
>>>go in a loop until the filesystem finally gets around to
>>>making a page in the (small) DMA zone freeable ...
>>>
>>This is a bug in VM design, yes? It should signal that it needs the
>>particular page written, which probnably means that it should use
>>writepage only when it needs that particular page written,
>>
>
>That is exactly what the VM does.
>
So basically you continue to believe that one cache manager shall rule
them all, and in the darkness as to their needs, bind them.

>
>
>>and should otherwise check to see if the filesystem supports something
>>like pressure_fs_cache(), yes?
>>
>
>That's incompatible with the concept of memory zones.
>
Care to explain more?

>
>
>regards,
>
>Rik
>



2002-01-20 21:41:22

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Mon, 21 Jan 2002, Hans Reiser wrote:

> >>and should otherwise check to see if the filesystem supports something
> >>like pressure_fs_cache(), yes?
> >
> >That's incompatible with the concept of memory zones.
>
> Care to explain more?

On basically any machine we'll have multiple memory zones.

Each of those memory zones has its own free list and each
of the zones can get low on free pages independantly of the
other zones.

This means that if the VM asks to get a particular page
freed, at the very minimum you need to make a page from the
same zone freeable.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-20 21:53:42

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Rik van Riel wrote:

>On Mon, 21 Jan 2002, Hans Reiser wrote:
>
>>>>and should otherwise check to see if the filesystem supports something
>>>>like pressure_fs_cache(), yes?
>>>>
>>>That's incompatible with the concept of memory zones.
>>>
>>Care to explain more?
>>
>
>On basically any machine we'll have multiple memory zones.
>
>Each of those memory zones has its own free list and each
>of the zones can get low on free pages independantly of the
>other zones.
>
>This means that if the VM asks to get a particular page
>freed, at the very minimum you need to make a page from the
>same zone freeable.
>
>regards,
>
>Rik
>

I'll discuss with Josh tomorrow how we might implement support for that.
A clean and simple mechanism does not come to my mind immediately.

Hans

2002-01-20 22:01:25

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Mon, 21 Jan 2002, Hans Reiser wrote:

> >This means that if the VM asks to get a particular page
> >freed, at the very minimum you need to make a page from the
> >same zone freeable.
>
> I'll discuss with Josh tomorrow how we might implement support for that.
> A clean and simple mechanism does not come to my mind immediately.

Note that in order to support more reliable allocation of
contiguous memory areas (eg. for loading modules) we may
also want to add some simple form of defragmentation to
the VM.

If you really want to make life easy for the VM, ->writepage()
should work towards making the page it is called for freeable.

You probably want to do this since an easy VM is good for
performance and it would be embarrasing if reiserfs had the
worst performance under load simply due to bad interaction
with other subsystems...

kind regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-20 22:45:02

by Shawn Starr

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.


But why should each filesystem have to have a different method of
buffering/caching? that just doesn't fit the layered model of the kernel
IMHO.

Shawn.

On Sun, 20 Jan 2002, Hans Reiser wrote:

> In version 4 of reiserfs, our plan is to implement writepage such that
> it does not write the page but instead pressures the reiser4 cache and
> marks the page as recently accessed. This is Linus's preferred method
> of doing that.
>
> Personally, I think that makes writepage the wrong name for that
> function, but I must admit it gets the job done, and it leaves writepage
> as the right name for all filesystems that don't manage their own cache,
> which is most of them.
>
> Hans
>
> Shawn wrote:
>
> >I've noticed that XFS's filesystem has a separate pagebuf_daemon to handle
> >caching/buffering.
> >
> >Why not make a kernel page/caching daemon for other filesystems to use
> >(kpagebufd) so that each filesystem can use a kernel daemon interface to
> >handle buffering and caching.
> >
> >I found that XFS's buffering/caching significantly reduced I/O load on the
> >system (with riel's rmap11b + rml's preempt patches and Andre's IDE
> >patch).
> >
> >But I've not been able to acheive the same speed results with ReiserFS :-(
> >
> >Just as we have a filesystem (VFS) layer, why not have a buffering/caching
> >layer for the filesystems to use inconjunction with the VM?
> >
> There is hostility to this from one of the VM maintainers. He is
> concerned that separate caches were what they had before and they
> behaved badly. I think that they simply coded them wrong the time
> before. The time before, the pressure on the subcaches was uneven, with
> some caches only getting pressure if the other caches couldn't free
> anything, so of course it behaved badly.
>
> >
> >
> >Comments, suggestions, flames welcome ;)
> >
> >Shawn.
> >
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to [email protected]
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at http://www.tux.org/lkml/
> >
> >
>
>
>
>
>

2002-01-20 23:12:07

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Sun, 20 Jan 2002, Shawn Starr wrote:

> But why should each filesystem have to have a different method of
> buffering/caching? that just doesn't fit the layered model of the
> kernel IMHO.

I think Hans will give up the idea once he realises the
performance implications. ;)

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-20 23:40:27

by Shawn Starr

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.


My worry is this. If we have different filesystems having their own page
buffer/caching daemons we'll definately introduce race conditions.

Say have 2 hard drives with ReiserFS and EXT3 and im copying data between
the two and each of them has their own daemons its going to get pretty
messy no?


On Sun, 20 Jan 2002, Rik van Riel wrote:

> On Sun, 20 Jan 2002, Shawn Starr wrote:
>
> > But why should each filesystem have to have a different method of
> > buffering/caching? that just doesn't fit the layered model of the
> > kernel IMHO.
>
> I think Hans will give up the idea once he realises the
> performance implications. ;)
>
> Rik
> --
> "Linux holds advantages over the single-vendor commercial OS"
> -- Microsoft's "Competing with Linux" document
>
> http://www.surriel.com/ http://distro.conectiva.com/
>
>
>

2002-01-20 23:49:37

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Sun, 20 Jan 2002, Shawn Starr wrote:

> My worry is this. If we have different filesystems having their own page
> buffer/caching daemons we'll definately introduce race conditions.
>
> Say have 2 hard drives with ReiserFS and EXT3 and im copying data between
> the two and each of them has their own daemons its going to get pretty
> messy no?

Each of the "cache daemons" will react differently to VM
pressure, meaning the system will most definately get out
of balance.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-21 00:10:38

by Matt

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Mon, Jan 21, 2002 at 12:49:27AM +0300, Hans Reiser wrote:
> Rik van Riel wrote:

[snip snip]

>> On basically any machine we'll have multiple memory zones.

>> Each of those memory zones has its own free list and each of the
>> zones can get low on free pages independantly of the other zones.

>> This means that if the VM asks to get a particular page freed, at
>> the very minimum you need to make a page from the same zone
>> freeable.

>> regards,

>> Rik


> I'll discuss with Josh tomorrow how we might implement support for that.
> A clean and simple mechanism does not come to my mind immediately.

> Hans

i know this sounds semi-evil, but can't you just drop another non
dirty page and do a copy if you need the page you have been asked to
write out? because if you have no non dirty pages around you'd
probably have to drop the page anyway at some stage..

matt

2002-01-21 00:32:41

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Rik van Riel wrote:

>On Sun, 20 Jan 2002, Shawn Starr wrote:
>
>>But why should each filesystem have to have a different method of
>>buffering/caching? that just doesn't fit the layered model of the
>>kernel IMHO.
>>
>
>I think Hans will give up the idea once he realises the
>performance implications. ;)
>
>Rik
>

Rik, what reiser4 does is take a slum (a slum is a contiguous in the
tree order set of
dirty buffers), and just before flushing it to disk we squeeze the
entire slum as far
to the left as we can, and encrypt any parts of it that we need to
encrypt, and assign
block numbers to it.

Tree balancing normally has a tradeoff between memory copies performed on
average per insertion, and tightness in packing nodes. Squeezing in
response to
memory pressure greatly optimizes the the number of nodes we are packed
into
while only performing one memory copy just before flush time for that
optimization.
It is MUCH more efficient. Block allocation ala XFS can be much more
optimal if
done just before flushing. Encryption just before flushing rather than
with every
modification to a file is also much more efficient. Committing
transactions
also have a complex need to be memory pressure driven (complex enough
that I won't describe it here).

So, really, memory pressure needs to push a whole set of events in a well
designed filesystem. Thinking that you can just pick a page and write it
and write no other pages, all without understanding the optimizations of
the filesystem you write to, is simplistic.

Suppose we do what you ask, and always write the page (as well as some
other pages) to disk. This will result in the filesystem cache as a whole
receiving more pressure than other caches that only write one page in
response to pressure. This is unbalanced, leads to some caches having
shorter average page lifetimes than others, and it is therefor
suboptimal. Yes?



Hans

2002-01-21 00:49:01

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Mon, 21 Jan 2002, Hans Reiser wrote:

> Suppose we do what you ask, and always write the page (as well as some
> other pages) to disk. This will result in the filesystem cache as a
> whole receiving more pressure than other caches that only write one
> page in response to pressure. This is unbalanced, leads to some
> caches having shorter average page lifetimes than others, and it is
> therefor suboptimal. Yes?

If your ->writepage() writes pages to disk it just means
that reiserfs will be able to clean its pages faster than
the other filesystems.

This means the VM will not call reiserfs ->writepage() as
often as for the other filesystems, since more of the
pages it finds will already be clean and freeable.

I guess the only way to unbalance the caches is by actually
freeing pages in ->writepage, but I don't see any real reason
why you'd want to do that...

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-21 00:48:31

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Rik van Riel wrote:

>On Sun, 20 Jan 2002, Shawn Starr wrote:
>
>>My worry is this. If we have different filesystems having their own page
>>buffer/caching daemons we'll definately introduce race conditions.
>>
>>Say have 2 hard drives with ReiserFS and EXT3 and im copying data between
>>the two and each of them has their own daemons its going to get pretty
>>messy no?
>>
>
>Each of the "cache daemons" will react differently to VM
>pressure, meaning the system will most definately get out
>of balance.
>
>regards,
>
>Rik
>
Not if you provide a proper design of a master cache manager. Really,
all you have to do is
have the subcache managers designed to free the same number of pages on
average
in response to pressure, and to pressure them in proportion to their
size, and it is pretty simple
for VM.

Now of course, we can talk about all sorts of possible refinements of
this, such as perhaps
for some caches pressure in proportion to the square of their size is
appropriate, or perhaps for
some caches their pressure should be some multiple of some other cache's
pressure (suppose
the cost of fetching a page from disk is different from fetching a page
over a network,
and you have two different caches of pages, one from a disk backing store,
and one of pages from a network device backing store, then it IS optimal
to keep the pages
from the slower device longer). I would suggest that such refinements
go in later though.

Right now, we just want a simple interface for implementing the pressure
response for
Reiser4. More complex can wait until after we ship 4.0, and can
luxuriate in multitudinous
benchmarks.

Hans


2002-01-21 00:54:01

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Mon, 21 Jan 2002, Hans Reiser wrote:

> Not if you provide a proper design of a master cache manager.
> Really, all you have to do is have the subcache managers designed to
> free the same number of pages on average in response to pressure, and
> to pressure them in proportion to their size, and it is pretty simple
> for VM.

I take it you're volunteering to bring ext3, XFS, JFS,
JFFS2, NFS, the inode & dentry cache and smbfs into
shape so reiserfs won't get unbalanced ?

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-21 01:01:52

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Matt wrote:

>On Mon, Jan 21, 2002 at 12:49:27AM +0300, Hans Reiser wrote:
>
>>Rik van Riel wrote:
>>
>
>[snip snip]
>
>>>On basically any machine we'll have multiple memory zones.
>>>
>
>>>Each of those memory zones has its own free list and each of the
>>>zones can get low on free pages independantly of the other zones.
>>>
>
>>>This means that if the VM asks to get a particular page freed, at
>>>the very minimum you need to make a page from the same zone
>>>freeable.
>>>
>
>>>regards,
>>>
>
>>>Rik
>>>
>
>
>>I'll discuss with Josh tomorrow how we might implement support for that.
>> A clean and simple mechanism does not come to my mind immediately.
>>
>
>>Hans
>>
>
>i know this sounds semi-evil, but can't you just drop another non
>dirty page and do a copy if you need the page you have been asked to
>write out? because if you have no non dirty pages around you'd
>probably have to drop the page anyway at some stage..
>
> matt
>
Yes, but it is seriously suboptimal to do copies if not really needed.
So, if we really must, then yes, but must we? Would be best if VM told us
if we really must write that page.

Hans

2002-01-21 01:05:45

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Rik van Riel wrote:

>On Mon, 21 Jan 2002, Hans Reiser wrote:
>
>>Suppose we do what you ask, and always write the page (as well as some
>>other pages) to disk. This will result in the filesystem cache as a
>>whole receiving more pressure than other caches that only write one
>>page in response to pressure. This is unbalanced, leads to some
>>caches having shorter average page lifetimes than others, and it is
>>therefor suboptimal. Yes?
>>
>
>If your ->writepage() writes pages to disk it just means
>that reiserfs will be able to clean its pages faster than
>the other filesystems.
>
the logical extreme of this is that no write caching should be done at
all, only read caching?

>
>
>This means the VM will not call reiserfs ->writepage() as
>often as for the other filesystems, since more of the
>pages it finds will already be clean and freeable.
>
>I guess the only way to unbalance the caches is by actually
>freeing pages in ->writepage, but I don't see any real reason
>why you'd want to do that...
>
>regards,
>
>Rik
>
It would unbalance the write cache, not the read cache.

Hans

2002-01-21 01:12:42

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Rik van Riel wrote:

>On Mon, 21 Jan 2002, Hans Reiser wrote:
>
>>Not if you provide a proper design of a master cache manager.
>>Really, all you have to do is have the subcache managers designed to
>>free the same number of pages on average in response to pressure, and
>>to pressure them in proportion to their size, and it is pretty simple
>>for VM.
>>
>
>I take it you're volunteering to bring ext3, XFS, JFS,
>JFFS2, NFS, the inode & dentry cache and smbfs into
>shape so reiserfs won't get unbalanced ?
>
>regards,
>
>Rik
>
If they use writepage(), then the job of balancing cache cleaning is
done, we just use
writepage as their pressuring mechanism. Any FS that wants to optimize
cleaning
can implement a VFS method, and any FS that wants to optimize freeing
can implement a VFS method,
and all others can use their generic VM current mechanisms.

Hans

2002-01-21 01:22:24

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Mon, 21 Jan 2002, Hans Reiser wrote:
> Rik van Riel wrote:

> >If your ->writepage() writes pages to disk it just means
> >that reiserfs will be able to clean its pages faster than
> >the other filesystems.
>
> the logical extreme of this is that no write caching should be done at
> all, only read caching?

You know that's bad for write clustering ;)))

> >This means the VM will not call reiserfs ->writepage() as
> >often as for the other filesystems, since more of the
> >pages it finds will already be clean and freeable.
> >
> >I guess the only way to unbalance the caches is by actually
> >freeing pages in ->writepage, but I don't see any real reason
> >why you'd want to do that...
>
> It would unbalance the write cache, not the read cache.

Many workloads tend to read pages again after they've written
them, so throwing away pages immediately doesn't seem like a
good idea.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-21 01:26:14

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

[snip]
At 00:57 21/01/02, Hans Reiser wrote:
[snip]
> Would be best if VM told us if we really must write that page.

In theory the VM should never call writepage unless the page must be writen
out...

But I agree with you that it would be good to be able to distinguish the
two cases. I have been thinking about this a bit in the context of NTFS TNG
but I think that it would be better to have a generic solution rather than
every fs does their own copy of the same thing. I envisage that there is a
flush daemon which just walks around writing pages to disk in the
background (there could be one per fs, or a generic one which fs register
with, at their option they could have their own of course) in order to keep
the number of dirty pages low and in order to minimize data loss on the
event of system/power failure.

This demon requires several interfaces though, with regards to journalling
fs. The daemon should have an interface where the fs can say "commit pages
in this list NOW and do not return before done", also a barrier operation
would be required in journalling context. A transactions interface would be
ideal, where the fs can submit whole transactions consisting of writing out
a list of pages and optional write barriers; e.g. write journal pages x, y,
z, barrier, write metadata, perhaps barrier, finally write data pages a, b,
c. Simple file systems could just not bother at all and rely on the flush
daemon calling the fs to write the pages.

Obviously when this daemon writes pages the pages will continue being
there. OTOH, if the VM calls write page because it needs to free memory
then writepage must write and clean the page.

So, yes, a parameter to write page would be great in this context.
Alternatively we could have ->writepage and ->flushpage (or pick your
favourite two names) one being an optional writeout and one a forced
writeout... I like the parameter to writepage idea better but in the end it
doesn't really matter that much I would suspect...

Best regards,

Anton


--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2002-01-21 01:30:24

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Rik van Riel wrote:

>On Mon, 21 Jan 2002, Hans Reiser wrote:
>
>>Rik van Riel wrote:
>>
>
>>>If your ->writepage() writes pages to disk it just means
>>>that reiserfs will be able to clean its pages faster than
>>>the other filesystems.
>>>
>>the logical extreme of this is that no write caching should be done at
>>all, only read caching?
>>
>
>You know that's bad for write clustering ;)))
>
>>>This means the VM will not call reiserfs ->writepage() as
>>>often as for the other filesystems, since more of the
>>>pages it finds will already be clean and freeable.
>>>
>>>I guess the only way to unbalance the caches is by actually
>>>freeing pages in ->writepage, but I don't see any real reason
>>>why you'd want to do that...
>>>
>>It would unbalance the write cache, not the read cache.
>>
>
>Many workloads tend to read pages again after they've written
>them, so throwing away pages immediately doesn't seem like a
>good idea.
>

I think I must have said free when I meant clean, and this naturally
confused you.

writepage() cleans pages, which is sometimes necessary for freeing them,
but it does not free them itself.

The one place where we would free them is when we repack slums before
writing them. In this case, an empty node is not going to get accessed
again, so it should be freed.

Hans


2002-01-21 01:40:36

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Mon, 21 Jan 2002, Hans Reiser wrote:
> Rik van Riel wrote:

> >I take it you're volunteering to bring ext3, XFS, JFS,
> >JFFS2, NFS, the inode & dentry cache and smbfs into
> >shape so reiserfs won't get unbalanced ?

> If they use writepage(), then the job of balancing cache cleaning is
> done, we just use writepage as their pressuring mechanism.
> Any FS that wants to optimize cleaning can implement a VFS method, and
> any FS that wants to optimize freeing can implement a VFS method, and
> all others can use their generic VM current mechanisms.

It seems you're still assuming that different filesystems will
all see the same kind of load.

Freeing cache (or at least, applying pressure) really is a job
for the VM because none of the filesystems will have any idea
exactly how busy the other filesystems are.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-21 01:41:36

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Mon, 21 Jan 2002, Hans Reiser wrote:

> I think I must have said free when I meant clean, and this naturally
> confused you.
>
> writepage() cleans pages, which is sometimes necessary for freeing them,
> but it does not free them itself.
>
> The one place where we would free them is when we repack slums before
> writing them. In this case, an empty node is not going to get accessed
> again, so it should be freed.

Agreed.

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-21 02:28:30

by Shawn Starr

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.


On Mon, 21 Jan 2002, Anton Altaparmakov wrote:

> [snip]
> At 00:57 21/01/02, Hans Reiser wrote:
> [snip]
> > Would be best if VM told us if we really must write that page.
>
> In theory the VM should never call writepage unless the page must be writen
> out...
>
> But I agree with you that it would be good to be able to distinguish the
> two cases. I have been thinking about this a bit in the context of NTFS TNG
> but I think that it would be better to have a generic solution rather than
> every fs does their own copy of the same thing. I envisage that there is a
> flush daemon which just walks around writing pages to disk in the
> background (there could be one per fs, or a generic one which fs register
> with, at their option they could have their own of course) in order to keep
> the number of dirty pages low and in order to minimize data loss on the
> event of system/power failure.
>
> This demon requires several interfaces though, with regards to journalling
> fs. The daemon should have an interface where the fs can say "commit pages
> in this list NOW and do not return before done", also a barrier operation
> would be required in journalling context. A transactions interface would be
> ideal, where the fs can submit whole transactions consisting of writing out
> a list of pages and optional write barriers; e.g. write journal pages x, y,
> z, barrier, write metadata, perhaps barrier, finally write data pages a, b,
> c. Simple file systems could just not bother at all and rely on the flush
> daemon calling the fs to write the pages.
>
> Obviously when this daemon writes pages the pages will continue being
> there. OTOH, if the VM calls write page because it needs to free memory
> then writepage must write and clean the page.
>

if they are dirty and written immediately to the disk they can be cleaned
from the queue. It would be nice if there was some way to have a checksum
verify the data was written back then wipe it from the queue.

As an example: 5 operations requested, 2 already in queue.

In queue) DIRTY write to disk (this task has been in the queue for a
while)

In queue) not 'old' memory but must be written to disk

pending queue:

1) read operation
2) read operation
3) Write operation
4) write operation

The daemon should resort the priority write dirty pages to disk then write
nay other pages that are left on queue, then get to read pages.


Notes:

If there is only one operation in the queue (say write) and nothing else
comes along, then the daemon should force-write the data back to disk
after a period of timeout (the memory in the slot becomes dirty)

If there's too many tasks in the queue and another one requires more
memory then whats left in the buffer/cache the daemon could request to
store the request in swap memory and put it in the queue, if the request
is a write request it would have more priority then any read requests
still and get completed quickly allowing for remaining queue events to
complete.

Example:

ReiserFS:
Operation A. Write (10K)
Operation B. Read (200K)
Operation C. Write (160K)


XFS:
Operation A. Read (63K)
Operation B. Read (3k)
Operation C. Write (10K)


EXT3:
Operation A. Write (290K)
Operation B. Write (90K)
Operation C. Read (3k)

the kpagebuf (or whatever name). Would get all these requests and sort out
what needs to be done first as long as there's buffer/cache memory free
the write operations would be done as fast as possible, verified by some
checksum and purged from the queue, If there's no cache/buffer memory
free then all write queues reguardless of being in swap or cache/buffer need to be
written to disk.

So:
kpagebuf queue (total available buffer/cache memory is say 512K)

EXT3 Write (290K)
ReiserFS Write (160K)
ReiserFS Write (10K)
XFS Write (10K)
EXT3 Write (90K) - Goes in swap because total > 512K (Dirty x2 state)
ReiserFS Read (200K) - Swap (dirty x2)
XFS Read (63K) - Swap (dirty x2)
XFS Read (3K) - Swap (dirty x2)
EXT3 Read (3K) - Swap (dirty x2)

* The daemon would check in order of filesystem registeration for whos
should be in the read queue first.

* The daemon should maximize amount of memory stored in bufeer/cache to
try to prevent write requests having to go into swap.

In the above queue, we have a lot of read operations and one write
operation in swap. Clean out the write operations since they are now dirty
(because there's no room for more operations in the buffer/cache). Move
the swapped write operation to the top of the queue and get rid of it.
Move the read operations from swap to queue since there is room again. **
NOTE ** because those read requests are now dirty they MUST be delt with
or they'll get stuck in the queue with more write requests overtaking
them.

Maybe I've lost it but that's how I see it ;)

Shawn.

2002-01-21 09:14:06

by Horst von Brand

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Hans Reiser <[email protected]> said:
> Rik van Riel wrote:

[...]

>On basically any machine we'll have multiple memory zones.
> >
> >Each of those memory zones has its own free list and each
> >of the zones can get low on free pages independantly of the
> >other zones.
> >
> >This means that if the VM asks to get a particular page
> >freed, at the very minimum you need to make a page from the
> >same zone freeable.

> I'll discuss with Josh tomorrow how we might implement support for that.
> A clean and simple mechanism does not come to my mind immediately.

Free the page you were asked to free, optionally free anything else you
might want to. Anything else sounds like a gross violation of layering to
me.

The other way would be for the VM to say "Free at least <n> pages of this
<list>", but that gives a complicated API.
--
Horst von Brand http://counter.li.org # 22616

2002-01-21 09:21:56

by Horst von Brand

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Matt <[email protected]> said:

[...]

> i know this sounds semi-evil, but can't you just drop another non
> dirty page and do a copy if you need the page you have been asked to
> write out? because if you have no non dirty pages around you'd
> probably have to drop the page anyway at some stage..

Better not. "Get rid of A", OK, copied to B. "Get rid of B", OK, copied to
C. Lather. Rinse. Repeat.
--
Horst von Brand http://counter.li.org # 22616

2002-01-21 11:14:37

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Rik van Riel wrote:

>On Mon, 21 Jan 2002, Hans Reiser wrote:
>
>>Rik van Riel wrote:
>>
>
>>>I take it you're volunteering to bring ext3, XFS, JFS,
>>>JFFS2, NFS, the inode & dentry cache and smbfs into
>>>shape so reiserfs won't get unbalanced ?
>>>
>
>>If they use writepage(), then the job of balancing cache cleaning is
>>done, we just use writepage as their pressuring mechanism.
>>Any FS that wants to optimize cleaning can implement a VFS method, and
>>any FS that wants to optimize freeing can implement a VFS method, and
>>all others can use their generic VM current mechanisms.
>>
>
>It seems you're still assuming that different filesystems will
>all see the same kind of load.
>

I don't understand this comment.

>
>
>Freeing cache (or at least, applying pressure) really is a job
>for the VM because none of the filesystems will have any idea
>exactly how busy the other filesystems are.
>

I fully agree, and it is the point I have been making (poorly, since it
has not
communicated) for as long as I have been discussing it with you. The VM
should
apply pressure to the caches. It should define an interface that
subcache managers
act in response to. The larger a subcache is, the more percentage of
total memory
pressure it should receive. The amount of memory pressure per unit of
time should
be determined by the VM.

Note that there are two kinds of pressure, cleaning pressure and freeing
pressure.
I think that the structure appropriate for delegating them is the same,
but someone
may correct me.

Also note that a unit of pressure is a unit of aging, not a unit of
freeing/cleaning. The application of pressure does not necessarily free
a page, it
merely ages the subcache, which might or might not free a page depending
on how much
use is being made of what is in the subcache.

Thus, a subcache receives pressure to grow from somewhere (things like
write()
in the case of ReiserFS), and pressure to shrink from VM, and VM exerts
however
much total pressure on all the subcaches is required to not run out of
memory.

The mechanism of going through pages, seeing what subcache they belong
to, and
pressuring that subcache, is a decent one (if a bit CPU cache expensive)
for obtaining
linearly proportional cache pressure. Since code inertia favors it,
let's use it for now.

Hans



2002-01-21 12:12:36

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Mon, 21 Jan 2002, Hans Reiser wrote:

> >It seems you're still assuming that different filesystems will
> >all see the same kind of load.
>
> I don't understand this comment.

[snip]

> The VM should apply pressure to the caches. It should define an
> interface that subcache managers act in response to. The larger a
> subcache is, the more percentage of total memory pressure it should
> receive.

Wrong. If one filesystem is actively being used (eg. kernel
compile) and the other filesystem's cache isn't being used
(this one held the tarball of the kernel source) then the
cache which is being used actively should receive less
pressure than the cache which doesn't hold any active pages.

We really want to evict the kernel tarball from memory while
keeping the kernel source and object files resident.

This is exactly the reason why each filesystem cannot manage
its own cache ... it doesn't know anything about what the
system as a whole is doing.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-21 13:46:20

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Rik van Riel wrote:

>On Mon, 21 Jan 2002, Hans Reiser wrote:
>
>>>It seems you're still assuming that different filesystems will
>>>all see the same kind of load.
>>>
>>I don't understand this comment.
>>
>
>[snip]
>
>>The VM should apply pressure to the caches. It should define an
>>interface that subcache managers act in response to. The larger a
>>subcache is, the more percentage of total memory pressure it should
>>receive.
>>
>
>Wrong. If one filesystem is actively being used (eg. kernel
>compile) and the other filesystem's cache isn't being used
>(this one held the tarball of the kernel source) then the
>cache which is being used actively should receive less
>pressure than the cache which doesn't hold any active pages.
>

Pressure received is not equal to pages yielded. Think of pressure as a
request to age on average one page. Not a request to free on average
one page. The pressure received should be in proportion to the
percentage of total memory pages in use by the subcache. The number of
pages yielded should depend on the interplay of pressure received and
accesses made.

Does this make more sense now?

>
>
>We really want to evict the kernel tarball from memory while
>keeping the kernel source and object files resident.
>
If your example is based on untarring a kernel tarball from one
filesystem to another, it is doomed, because you probably want to
drop-behind the tarball contents.

I think I know what you mean though, so let's use an example of one
filesystem containing the files of a user who logs in once a week mostly
to check his email that he doesn't get very often, and the other
contains the files of a programmer who recompiles every 5 minutes. Is
this what you intend? If so, I think the mechanism described above
handles it.

Perhaps writepage isn't the cleanest way to implement it though, maybe
the page aging mechanism is where the call to the subcache belongs.

>
>
>This is exactly the reason why each filesystem cannot manage
>its own cache ... it doesn't know anything about what the
>system as a whole is doing.
>
Each filesystem can be told how much aging pressure to exert on itself.
The VM tracks what the system as a whole is doing, and the filesystem
tracks what its subcache is doing, and the filesystem listens to the VM
and acts accordingly.

Hans


2002-01-21 13:55:00

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Mon, 21 Jan 2002, Hans Reiser wrote:

> Pressure received is not equal to pages yielded. ... The number of
> pages yielded should depend on the interplay of pressure received and
> accesses made.
>
> Does this make more sense now?

Nice recipie for total chaos. You _know_ each filesystem will
behave differently in this respect, it'll be impossible to get
the VM balanced in this way...

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-21 14:11:43

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Rik van Riel wrote:

>On Mon, 21 Jan 2002, Hans Reiser wrote:
>
>>Pressure received is not equal to pages yielded. ... The number of
>>pages yielded should depend on the interplay of pressure received and
>>accesses made.
>>
>>Does this make more sense now?
>>
>
>Nice recipie for total chaos. You _know_ each filesystem will
>behave differently in this respect, it'll be impossible to get
>the VM balanced in this way...
>
>Rik
>
No, I don't _know_ that. Just because it got screwed up previously
doesn't mean that no one can ever get it right.

I think there should be well commented code with well commented
templates and examples, and persons who abuse the interface should be
handled like persons who abuse all the other interfaces.

Optimal is optimal, and if VM's default is seriously suboptimal for a
particular backing store then it simply shouldn't be used for that
backing store. Write clustering, slum squeezing, block allocating,
encrypting, committing transactions, all of these are serious things
that should be pushed by memory pressure from a VM that delegates. This
issue is no different from a human boss that refuses to delegate because
he doesn't want to lose control, and he doesn't have the managerial
skill that gives him the confidence that he can delegate well, and so
nothing gets done well because he doesn't have the time to optimize all
of the subordinates working for him as well as they could optimize
themselves. Rik, your plan won't scale. Sure, you have the time needed
to create one example template, but you cannot possibly create a single
VM well optimized for every cache in the kernel. They each have
different needs, different properties, different filessytem layouts.

Hans


2002-01-21 15:33:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Hans Reiser <[email protected]> writes:

> >
> >That is exactly what the VM does.
> >
> So basically you continue to believe that one cache manager shall rule them all,
>
> and in the darkness as to their needs, bind them.

Hans any other case generally sucks, and at best works well until the
VM changes and then breaks. The worst VM's I have seen are the home
spun cache management routines for compressing filesystems. So
trying for a generic solution is very good.

I suspect it easier to work out the semantics needed for reiserfs and
xfs to do delayed writes in the page cache than to work out the
semantics needed for having to competing VM's...

Eric




2002-01-21 15:40:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Hans Reiser <[email protected]> writes:

> Mark Hahn wrote:
>
> >On Sun, 20 Jan 2002, Hans Reiser wrote:
> >
> >> Write clustering is one thing it achieves. When we flush a slum, the
> >
> >sure, that's fine. when the VM tells you to write a page,
> >you're free to write *more*, but you certainly must give back
> > that particular page. afaicr, this was the conclusion of the long-ago thread
> > that you're referring to.
> >
> >regards, mark hahn.
> >
> >
> >
> This is bad for use with internal nodes. It simplifies version 4 a bunch to
> assume that if a node is in cache, its parent is also. Not sure what to do
> about it, maybe we need to copy the node. Surely we don't want to copy it
> unless it is a DMA related page cleaning.

Increment the count on the parent page, and don't decrement it until
the child goes away. This might need a notification from
page_cache_release when so you can decrement the count at the
appropriate time. But internal nodes are ``meta'' data which has
always had special freeing rules.

Eric

2002-01-21 17:23:34

by Chris Mason

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.



On Monday, January 21, 2002 05:07:30 PM +0300 Hans Reiser
<[email protected]> wrote:

> Rik van Riel wrote:
>
>> On Mon, 21 Jan 2002, Hans Reiser wrote:
>>
>>> Pressure received is not equal to pages yielded. ... The number of
>>> pages yielded should depend on the interplay of pressure received and
>>> accesses made.
>>>

Ah, once the FS starts counting accesses, we get in trouble. The FS should
strive to know only these 3 things:

How to read useful data into a page
How to flush a dirty page
How to free a pinned page

The VM records everything else, including how often a page is accessed, and
which pages should be freed in response to memory pressure. Of course, the
FS might have details on many more things such as write clustering, delayed
allocations, or which pinned pages require tons of extra work to write out.
This fools us into thinking the FS might be the best place to decide how to
react under memory pressure, leading to a little VM in each FS.

Everything gets cleaner if we push this info up to the VM in a generic
fashion, instead of trying to push bits of the VM down into each
filesystem.
The FS should have no idea of what memory pressure is, down that path lies
pain, suffering, and deadlocks against the journal ;-)

If the VM is telling the FS to write a pinned page when there are unpinned
pages that can be written with less cost, then we need to give the VM
better hints about the actual cost of writing the pinned page.

For periodic group flushes (delayed allocation, journal commits, etc), we
need better throttling on dirty pages instead of just dirty buffers like we
do now.

I'm not delusional enough to think this will make all the vm<->journal
nastiness go away, but it hopefully should be less painful than adding
extra VM intelligence into each FS.

-chris

2002-01-21 17:51:25

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Chris Mason wrote:

>
>On Monday, January 21, 2002 05:07:30 PM +0300 Hans Reiser
><[email protected]> wrote:
>
>>Rik van Riel wrote:
>>
>>>On Mon, 21 Jan 2002, Hans Reiser wrote:
>>>
>>>>Pressure received is not equal to pages yielded. ... The number of
>>>>pages yielded should depend on the interplay of pressure received and
>>>>accesses made.
>>>>
>
>Ah, once the FS starts counting accesses, we get in trouble. The FS should
>strive to know only these 3 things:
>
>How to read useful data into a page
>How to flush a dirty page
>How to free a pinned page
>
You say this with the all the dogma of someone working with code that
currently does things a particular way. You provide no reasons though.

>
>
>The VM records everything else, including how often a page is accessed, and
>which pages should be freed in response to memory pressure. Of course, the
>FS might have details on many more things such as write clustering, delayed
>allocations, or which pinned pages require tons of extra work to write out.
>This fools us into thinking the FS might be the best place to decide how to
>react under memory pressure, leading to a little VM in each FS.
>
>Everything gets cleaner if we push this info up to the VM in a generic
>fashion, instead of trying to push bits of the VM down into each
>filesystem.
>The FS should have no idea of what memory pressure is, down that path lies
>pain, suffering, and deadlocks against the journal ;-)
>
>If the VM is telling the FS to write a pinned page when there are unpinned
>pages that can be written with less cost, then we need to give the VM
>better hints about the actual cost of writing the pinned page.
>

Oh, this means a much more complicated interface, and it means that the
VM must take into account the optimizations of each and every
filesystem. Are you sure this isn't an unmaintainable centralized hell?
In practice, will it really mean that optimizations specific to a
particular filesystem will get ignored, because there will be too many
of them to keep up with, and they will clutter each other up if
implemented in one piece of code? Will programmers really be able to
experiment?

>
>
>For periodic group flushes (delayed allocation, journal commits, etc), we
>need better throttling on dirty pages instead of just dirty buffers like we
>do now.
>
>I'm not delusional enough to think this will make all the vm<->journal
>nastiness go away, but it hopefully should be less painful than adding
>extra VM intelligence into each FS.
>
>-chris
>
>
>
Say more about what you mean by better throttling on dirty pages, and
how that meets the needs of slum squeezing, transaction committing,
write clustering, etc. Last I remember, the generic write clustering
code in VM didn't even understand packing localities.;-)

Hans


2002-01-21 19:13:32

by Shawn Starr

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Nobody wants to comment on this? :(

Shawn.

On Sun, 2002-01-20 at 21:29, Shawn Starr wrote:
>
> On Mon, 21 Jan 2002, Anton Altaparmakov wrote:
>
> > [snip]
> > At 00:57 21/01/02, Hans Reiser wrote:
> > [snip]
> > > Would be best if VM told us if we really must write that page.
> >
> > In theory the VM should never call writepage unless the page must be writen
> > out...
> >
> > But I agree with you that it would be good to be able to distinguish the
> > two cases. I have been thinking about this a bit in the context of NTFS TNG
> > but I think that it would be better to have a generic solution rather than
> > every fs does their own copy of the same thing. I envisage that there is a
> > flush daemon which just walks around writing pages to disk in the
> > background (there could be one per fs, or a generic one which fs register
> > with, at their option they could have their own of course) in order to keep
> > the number of dirty pages low and in order to minimize data loss on the
> > event of system/power failure.
> >
> > This demon requires several interfaces though, with regards to journalling
> > fs. The daemon should have an interface where the fs can say "commit pages
> > in this list NOW and do not return before done", also a barrier operation
> > would be required in journalling context. A transactions interface would be
> > ideal, where the fs can submit whole transactions consisting of writing out
> > a list of pages and optional write barriers; e.g. write journal pages x, y,
> > z, barrier, write metadata, perhaps barrier, finally write data pages a, b,
> > c. Simple file systems could just not bother at all and rely on the flush
> > daemon calling the fs to write the pages.
> >
> > Obviously when this daemon writes pages the pages will continue being
> > there. OTOH, if the VM calls write page because it needs to free memory
> > then writepage must write and clean the page.
> >
>
> if they are dirty and written immediately to the disk they can be cleaned
> from the queue. It would be nice if there was some way to have a checksum
> verify the data was written back then wipe it from the queue.
>
> As an example: 5 operations requested, 2 already in queue.
>
> In queue) DIRTY write to disk (this task has been in the queue for a
> while)
>
> In queue) not 'old' memory but must be written to disk
>
> pending queue:
>
> 1) read operation
> 2) read operation
> 3) Write operation
> 4) write operation
>
> The daemon should resort the priority write dirty pages to disk then write
> nay other pages that are left on queue, then get to read pages.
>
>
> Notes:
>
> If there is only one operation in the queue (say write) and nothing else
> comes along, then the daemon should force-write the data back to disk
> after a period of timeout (the memory in the slot becomes dirty)
>
> If there's too many tasks in the queue and another one requires more
> memory then whats left in the buffer/cache the daemon could request to
> store the request in swap memory and put it in the queue, if the request
> is a write request it would have more priority then any read requests
> still and get completed quickly allowing for remaining queue events to
> complete.
>
> Example:
>
> ReiserFS:
> Operation A. Write (10K)
> Operation B. Read (200K)
> Operation C. Write (160K)
>
>
> XFS:
> Operation A. Read (63K)
> Operation B. Read (3k)
> Operation C. Write (10K)
>
>
> EXT3:
> Operation A. Write (290K)
> Operation B. Write (90K)
> Operation C. Read (3k)
>
> the kpagebuf (or whatever name). Would get all these requests and sort out
> what needs to be done first as long as there's buffer/cache memory free
> the write operations would be done as fast as possible, verified by some
> checksum and purged from the queue, If there's no cache/buffer memory
> free then all write queues reguardless of being in swap or cache/buffer need to be
> written to disk.
>
> So:
> kpagebuf queue (total available buffer/cache memory is say 512K)
>
> EXT3 Write (290K)
> ReiserFS Write (160K)
> ReiserFS Write (10K)
> XFS Write (10K)
> EXT3 Write (90K) - Goes in swap because total > 512K (Dirty x2 state)
> ReiserFS Read (200K) - Swap (dirty x2)
> XFS Read (63K) - Swap (dirty x2)
> XFS Read (3K) - Swap (dirty x2)
> EXT3 Read (3K) - Swap (dirty x2)
>
> * The daemon would check in order of filesystem registeration for whos
> should be in the read queue first.
>
> * The daemon should maximize amount of memory stored in bufeer/cache to
> try to prevent write requests having to go into swap.
>
> In the above queue, we have a lot of read operations and one write
> operation in swap. Clean out the write operations since they are now dirty
> (because there's no room for more operations in the buffer/cache). Move
> the swapped write operation to the top of the queue and get rid of it.
> Move the read operations from swap to queue since there is room again. **
> NOTE ** because those read requests are now dirty they MUST be delt with
> or they'll get stuck in the queue with more write requests overtaking
> them.
>
> Maybe I've lost it but that's how I see it ;)
>
> Shawn.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


2002-01-21 19:45:36

by Chris Mason

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.



On Monday, January 21, 2002 08:47:00 PM +0300 Hans Reiser
<[email protected]> wrote:

> Chris Mason wrote:
>
>>
>> On Monday, January 21, 2002 05:07:30 PM +0300 Hans Reiser
>> <[email protected]> wrote:
>>
>>> Rik van Riel wrote:
>>>
>>>> On Mon, 21 Jan 2002, Hans Reiser wrote:
>>>>
>>>>> Pressure received is not equal to pages yielded. ... The number of
>>>>> pages yielded should depend on the interplay of pressure received and
>>>>> accesses made.
>>>>>
>>
>> Ah, once the FS starts counting accesses, we get in trouble. The FS
>> should strive to know only these 3 things:
>>
>> How to read useful data into a page
>> How to flush a dirty page
>> How to free a pinned page
>>
> You say this with the all the dogma of someone working with code that
> currently does things a particular way. You provide no reasons though.

;-) In general, every bit of the VM we modify and copy into the FS will:

A) break later on as the rest of the VM evolves
B) perform poorly on hardware we don't have (numa).
C) make odd, hard to trigger bugs due to strange interactions on large
machines and certain work loads.
D) require almost constant maintenance.

And that is how it works right now. The journal is a subcache that does
not respond to memory pressure the same way on all the journaled
filesystems, and none of them are optimal.

>>
>> Everything gets cleaner if we push this info up to the VM in a generic
>> fashion, instead of trying to push bits of the VM down into each
>> filesystem.
>> The FS should have no idea of what memory pressure is, down that path
>> lies pain, suffering, and deadlocks against the journal ;-)
>>
>> If the VM is telling the FS to write a pinned page when there are
>> unpinned pages that can be written with less cost, then we need to give
>> the VM better hints about the actual cost of writing the pinned page.
>>
>
> Oh, this means a much more complicated interface,

Grin, we can't really compare interface complexity until both are written
and working.

> and it means that the
> VM must take into account the optimizations of each and every filesystem.
> Are you sure this isn't an unmaintainable centralized hell?

Decentralization in this case seems much more risky. The VM needs well
defined repeatable behaviour.

> In practice,
> will it really mean that optimizations specific to a particular
> filesystem will get ignored, because there will be too many of them to
> keep up with, and they will clutter each other up if implemented in one
> piece of code? Will programmers really be able to experiment?

The idea is to find the basic interface required to do this for us.
Internally, the FS needs an interface to give hints to its own subcache, so
it must be possible to give hints to a VM. I'm not pretending it will be
easy to generalize, but all the filesystems need a very similar set of
tools here, so it should be worth the effort.

>>
>>
>> For periodic group flushes (delayed allocation, journal commits, etc), we
>> need better throttling on dirty pages instead of just dirty buffers like
>> we do now.
>>
>> I'm not delusional enough to think this will make all the vm<->journal
>> nastiness go away, but it hopefully should be less painful than adding
>> extra VM intelligence into each FS.
>>
> Say more about what you mean by better throttling on dirty pages, and how
> that meets the needs of slum squeezing, transaction committing, write
> clustering, etc. Last I remember, the generic write clustering code in
> VM didn't even understand packing localities.;-)

Most write throttling is done by bdflush right now, because most dirty
things that need to hit disk have dirty buffers. For pinned pages, delayed
allocation etc, we probably want a rate limiter unrelated to buffers at
all, and one that can trigger complex actions from the FS instead of just a
simple write-one-page.

I'm not saying we should teach the VM how to do these complex operations,
but I do think it should be in charge of deciding when they happen as much
as possible. In other words, the journal would only trigger a commit on
its own when the transaction was full. The other cases (too old, low ram,
too many dirty pages) would be triggered by the VM.

For write clustering, we could add an int clusterpage(struct page *p)
address space op that allow the FS to find pages close to p, or the FS
could choose to cluster in its own writepage func.

-chris

2002-01-21 20:46:02

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Chris Mason wrote:

>
>On Monday, January 21, 2002 08:47:00 PM +0300 Hans Reiser
><[email protected]> wrote:
>
>>Chris Mason wrote:
>>
>>>On Monday, January 21, 2002 05:07:30 PM +0300 Hans Reiser
>>><[email protected]> wrote:
>>>
>>>>Rik van Riel wrote:
>>>>
>>>>>On Mon, 21 Jan 2002, Hans Reiser wrote:
>>>>>
>>>>>>Pressure received is not equal to pages yielded. ... The number of
>>>>>>pages yielded should depend on the interplay of pressure received and
>>>>>>accesses made.
>>>>>>
>>>Ah, once the FS starts counting accesses, we get in trouble. The FS
>>>should strive to know only these 3 things:
>>>
>>>How to read useful data into a page
>>>How to flush a dirty page
>>>How to free a pinned page
>>>
>>You say this with the all the dogma of someone working with code that
>>currently does things a particular way. You provide no reasons though.
>>
>
>;-) In general, every bit of the VM we modify and copy into the FS will:
>
>A) break later on as the rest of the VM evolves
>B) perform poorly on hardware we don't have (numa).
>C) make odd, hard to trigger bugs due to strange interactions on large
>machines and certain work loads.
>D) require almost constant maintenance.
>
>And that is how it works right now. The journal is a subcache that does
>not respond to memory pressure the same way on all the journaled
>filesystems, and none of them are optimal.
>

This is because you didn't want to disturb VM enough to create a proper
interface. You were right to have this attitude during code freeze.
Code freeze is over.

>
>
>>>Everything gets cleaner if we push this info up to the VM in a generic
>>>fashion, instead of trying to push bits of the VM down into each
>>>filesystem.
>>>The FS should have no idea of what memory pressure is, down that path
>>>lies pain, suffering, and deadlocks against the journal ;-)
>>>
>>>If the VM is telling the FS to write a pinned page when there are
>>>unpinned pages that can be written with less cost, then we need to give
>>>the VM better hints about the actual cost of writing the pinned page.
>>>
>>Oh, this means a much more complicated interface,
>>
>
>Grin, we can't really compare interface complexity until both are written
>and working.
>

Yah, yah, as the Germans taught me to say.;-)

>
>
>>and it means that the
>>VM must take into account the optimizations of each and every filesystem.
>>Are you sure this isn't an unmaintainable centralized hell?
>>
>
>Decentralization in this case seems much more risky. The VM needs well
>defined repeatable behaviour.
>

Decentralization always seems more risky. It is why we have so many
centralized economies, errh, .....

>
>
>>In practice,
>>will it really mean that optimizations specific to a particular
>>filesystem will get ignored, because there will be too many of them to
>>keep up with, and they will clutter each other up if implemented in one
>>piece of code? Will programmers really be able to experiment?
>>
>
>The idea is to find the basic interface required to do this for us.
>Internally, the FS needs an interface to give hints to its own subcache, so
>

Uh, the hints are called slums and balanced trees and unallocated
extents and distinctions between overwrite sets and relocate sets and
the difference between internal and leaf nodes and five different mount
options for how to allocate blocks and....

I think that asking VM to understand this is simply awful.

>
>it must be possible to give hints to a VM. I'm not pretending it will be
>easy to generalize, but all the filesystems need a very similar set of
>tools here, so it should be worth the effort.
>

I prefer the approach used in VFS, in which templates of generic FS code
are supplied, and people can use as much or as little of the generic
code as they want. This allows people who just want to create a
filesystem that can read a particular format to do so without unique
optimizations for that FS, and people who want to write a seriously
optimized filesystem that understands how to optimize for a particular
layout to do so.

I think that what you and Saveliev did made sense for 2.4 where we were
struggling against a code freeze (well, at least there was supposed to
be a code freeze on VM/VFS, but that is history we should not
revisit.....), but it is not appropriate for when there is no code freeze.

>
>
>>>
>>>For periodic group flushes (delayed allocation, journal commits, etc), we
>>>need better throttling on dirty pages instead of just dirty buffers like
>>>we do now.
>>>
>>>I'm not delusional enough to think this will make all the vm<->journal
>>>nastiness go away, but it hopefully should be less painful than adding
>>>extra VM intelligence into each FS.
>>>
>>Say more about what you mean by better throttling on dirty pages, and how
>>that meets the needs of slum squeezing, transaction committing, write
>>clustering, etc. Last I remember, the generic write clustering code in
>>VM didn't even understand packing localities.;-)
>>
>
>Most write throttling is done by bdflush right now, because most dirty
>things that need to hit disk have dirty buffers. For pinned pages, delayed
>allocation etc, we probably want a rate limiter unrelated to buffers at
>all, and one that can trigger complex actions from the FS instead of just a
>simple write-one-page.
>
>I'm not saying we should teach the VM how to do these complex operations,
>but I do think it should be in charge of deciding when they happen as much
>as possible. In other words, the journal would only trigger a commit on
>its own when the transaction was full. The other cases (too old, low ram,
>too many dirty pages) would be triggered by the VM.
>
I read this and it sounds like you are agreeing with me, which is
confusing;-), help me to understand what you mean by triggered. Do you
mean VM sends pressure to the FS? Do you mean that VM understands what
a transaction is? Is this that generic journaling layer trying to come
alive as a piece of the VM? I am definitely confused.

I think what I need to understand, is do you see the VM as telling the
FS when it has (too many dirty pages or too many clean pages) and
letting the FS choose to commit a transaction if it wants to as its way
of cleaning pages, or do you see the VM as telling the FS to commit a
transaction?

If you think that VM should tell the FS when it has too many pages, does
that mean that the VM understands that a particular page in the subcache
has not been accessed recently enough? Is that the pivot point of our
disagreement?

>
>
>For write clustering, we could add an int clusterpage(struct page *p)
>address space op that allow the FS to find pages close to p, or the FS
>could choose to cluster in its own writepage func.
>
What you are proposing is not consistent with how Marcello is doing
write clustering as part of the VM, you understand that, yes? What
Marcello is doing is fine for ReiserFS V3 but won't work well for v4, do
you agree?

Hans

2002-01-21 21:54:34

by Chris Mason

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.



On Monday, January 21, 2002 11:41:44 PM +0300 Hans Reiser
<[email protected]> wrote:

> I read this and it sounds like you are agreeing with me, which is
> confusing;-),

No, no, you're agreeing with me ;-)

> help me to understand what you mean by triggered. Do you
> mean VM sends pressure to the FS? Do you mean that VM understands what a
> transaction is? Is this that generic journaling layer trying to come
> alive as a piece of the VM? I am definitely confused.
>
The vm doesn't know what a transaction is. But, the vm might know that
a) this block is pinned by the FS for write ordering reasons
b) the cost of writing this block is X
c) calling page->somefunc will trigger writes on those blocks.

The cost could be in order of magnitude, the idea would be to give the FS
the chance to say 'one a scale of 1 to 10, writing this block will hurt
this much'. Some blocks might have negative costs, meaning they don't
depend on anything and help free others.

The same system can be used for transactions and delayed allocation,
without telling the VM about any specifics.

> I think what I need to understand, is do you see the VM as telling the FS
> when it has (too many dirty pages or too many clean pages) and letting
> the FS choose to commit a transaction if it wants to as its way of
> cleaning pages, or do you see the VM as telling the FS to commit a
> transaction?

I see the VM calling page->somefunc to flush that page, triggering whatever
events the FS feels are necessary. We might want some way to differentiate
between periodic writes and memory pressure, so the FS has the option of
doing fancier things during write throttling.

>
> If you think that VM should tell the FS when it has too many pages, does
> that mean that the VM understands that a particular page in the subcache
> has not been accessed recently enough? Is that the pivot point of our
> disagreement?

Pretty much. I don't think the VM should say 'you have too many pages', I
think it should say 'free this page'.

>>
>>
>> For write clustering, we could add an int clusterpage(struct page *p)
>> address space op that allow the FS to find pages close to p, or the FS
>> could choose to cluster in its own writepage func.
>>
> What you are proposing is not consistent with how Marcello is doing write
> clustering as part of the VM, you understand that, yes? What Marcello is
> doing is fine for ReiserFS V3 but won't work well for v4, do you agree?

Well, my only point is that it is possible to make an interface for write
clustering that gives the FS the freedom to do what it needs, but still
keep the intelligence about which pages need freeing first in the VM.

-chris

2002-01-22 06:04:34

by Andreas Dilger

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Jan 21, 2002 16:53 -0500, Chris Mason wrote:
> On Monday, January 21, 2002 11:41:44 PM +0300 Hans Reiser wrote:
> > help me to understand what you mean by triggered. Do you
> > mean VM sends pressure to the FS? Do you mean that VM understands what a
> > transaction is? Is this that generic journaling layer trying to come
> > alive as a piece of the VM? I am definitely confused.
>
> The vm doesn't know what a transaction is. But, the vm might know that
> a) this block is pinned by the FS for write ordering reasons
> b) the cost of writing this block is X
> c) calling page->somefunc will trigger writes on those blocks.
>
> The cost could be in order of magnitude, the idea would be to give the FS
> the chance to say 'one a scale of 1 to 10, writing this block will hurt
> this much'. Some blocks might have negative costs, meaning they don't
> depend on anything and help free others.
>
> The same system can be used for transactions and delayed allocation,
> without telling the VM about any specifics.
>
> > I think what I need to understand, is do you see the VM as telling the FS
> > when it has (too many dirty pages or too many clean pages) and letting
> > the FS choose to commit a transaction if it wants to as its way of
> > cleaning pages, or do you see the VM as telling the FS to commit a
> > transaction?
>
> I see the VM calling page->somefunc to flush that page, triggering whatever
> events the FS feels are necessary. We might want some way to differentiate
> between periodic writes and memory pressure, so the FS has the option of
> doing fancier things during write throttling.

The ext3 developers have also been wanting things like this for a long time,
both having a "memory pressure" notification, and a differentiation between
"write this now" and "this is a periodic sync, write some stuff". I've
CC'd them in case they want to contribute.

There are also other non-core caches in the kernel which could benefit
from having a generic "memory pressure" notification. Having a generic
memory pressure notification helps reduce (but not eliminate) the need
to call "write this page now" into the filesystem.

My guess would be that having calls into the FS with "priorities", just
like shrink_dcache_memory() does, would allow the FS to make more
intelligent decisions about what to write/free _before_ you get to the
stage where the VM is in a panic and is telling you _specifically_ what
to write/free/etc.

> > If you think that VM should tell the FS when it has too many pages, does
> > that mean that the VM understands that a particular page in the subcache
> > has not been accessed recently enough? Is that the pivot point of our
> > disagreement?
>
> Pretty much. I don't think the VM should say 'you have too many pages', I
> think it should say 'free this page'.

As above, it should have the capability to do both, depending on the
circumstances. The FS can obviously make better judgements locally about
what to write under normal circumstances, so it should be given the best
chance to do so.

The VM can make better _specific_ judgements when it needs to (e.g. free
a DMA page or another specific page to allow a larger contiguous chunk of
memory to be allocated), but in the cases where it just wants _some_ page(s)
to be freed, it should allow the FS to decide which one(s), if it cares.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2002-01-22 10:13:27

by Tommi Kyntola

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Mon, 21 Jan 2002, Andreas Dilger wrote:
> On Jan 21, 2002 16:53 -0500, Chris Mason wrote:
> > On Monday, January 21, 2002 11:41:44 PM +0300 Hans Reiser wrote:
> > > If you think that VM should tell the FS when it has too many pages, does
> > > that mean that the VM understands that a particular page in the subcache
> > > has not been accessed recently enough? Is that the pivot point of our
> > > disagreement?
> >
> > Pretty much. I don't think the VM should say 'you have too many pages', I
> > think it should say 'free this page'.
>
> As above, it should have the capability to do both, depending on the
> circumstances. The FS can obviously make better judgements locally about
> what to write under normal circumstances, so it should be given the best
> chance to do so.
>
> The VM can make better _specific_ judgements when it needs to (e.g. free
> a DMA page or another specific page to allow a larger contiguous chunk
> of memory to be allocated), but in the cases where it just wants _some_
> page(s) to be freed, it should allow the FS to decide which one(s), if
> it cares.

Which is pretty close to what Anton said. It seems obvious that the VM
needs to use also a (hopefully rare-case) write_page where FS
should comply, wether it's suboptimal or not for that particular FS.

But wouldn't Anton's suggestion about having a sperate (hopefully more
common case) write_some_page that'd give some leash to FS developers to
optimize their page releasing based on their own demands ?

It'd atleast allow centralized VM and keeping the other filesystems
intact.

--
Tommi "Kynde" Kyntola
/* A man alone in the forest talking to himself and
no women around to hear him. Is he still wrong? */


2002-01-22 11:43:24

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

So is there a consensus view that we need 2 calls, one to write a
particular page, and one to exert memory pressure, and the call to write
a particular page should only be used when we really need to write that
particular page?

Are we sure this meets the needs of memory zones, which I need to learn
more about the architecture of?

Hans

Andreas Dilger wrote:

>On Jan 21, 2002 16:53 -0500, Chris Mason wrote:
>
>>On Monday, January 21, 2002 11:41:44 PM +0300 Hans Reiser wrote:
>>
>>>help me to understand what you mean by triggered. Do you
>>>mean VM sends pressure to the FS? Do you mean that VM understands what a
>>>transaction is? Is this that generic journaling layer trying to come
>>>alive as a piece of the VM? I am definitely confused.
>>>
>>The vm doesn't know what a transaction is. But, the vm might know that
>>a) this block is pinned by the FS for write ordering reasons
>>b) the cost of writing this block is X
>>c) calling page->somefunc will trigger writes on those blocks.
>>
>>The cost could be in order of magnitude, the idea would be to give the FS
>>the chance to say 'one a scale of 1 to 10, writing this block will hurt
>>this much'. Some blocks might have negative costs, meaning they don't
>>depend on anything and help free others.
>>
>>The same system can be used for transactions and delayed allocation,
>>without telling the VM about any specifics.
>>
>>>I think what I need to understand, is do you see the VM as telling the FS
>>>when it has (too many dirty pages or too many clean pages) and letting
>>>the FS choose to commit a transaction if it wants to as its way of
>>>cleaning pages, or do you see the VM as telling the FS to commit a
>>>transaction?
>>>
>>I see the VM calling page->somefunc to flush that page, triggering whatever
>>events the FS feels are necessary. We might want some way to differentiate
>>between periodic writes and memory pressure, so the FS has the option of
>>doing fancier things during write throttling.
>>
>
>The ext3 developers have also been wanting things like this for a long time,
>both having a "memory pressure" notification, and a differentiation between
>"write this now" and "this is a periodic sync, write some stuff". I've
>CC'd them in case they want to contribute.
>
>There are also other non-core caches in the kernel which could benefit
>from having a generic "memory pressure" notification. Having a generic
>memory pressure notification helps reduce (but not eliminate) the need
>to call "write this page now" into the filesystem.
>
>My guess would be that having calls into the FS with "priorities", just
>like shrink_dcache_memory() does, would allow the FS to make more
>intelligent decisions about what to write/free _before_ you get to the
>stage where the VM is in a panic and is telling you _specifically_ what
>to write/free/etc.
>
>>>If you think that VM should tell the FS when it has too many pages, does
>>>that mean that the VM understands that a particular page in the subcache
>>>has not been accessed recently enough? Is that the pivot point of our
>>>disagreement?
>>>
>>Pretty much. I don't think the VM should say 'you have too many pages', I
>>think it should say 'free this page'.
>>
>
>As above, it should have the capability to do both, depending on the
>circumstances. The FS can obviously make better judgements locally about
>what to write under normal circumstances, so it should be given the best
>chance to do so.
>
>The VM can make better _specific_ judgements when it needs to (e.g. free
>a DMA page or another specific page to allow a larger contiguous chunk of
>memory to be allocated), but in the cases where it just wants _some_ page(s)
>to be freed, it should allow the FS to decide which one(s), if it cares.
>
>Cheers, Andreas
>--
>Andreas Dilger
>http://sourceforge.net/projects/ext2resize/
>http://www-mddsp.enel.ucalgary.ca/People/adilger/
>
>
>



2002-01-22 14:04:52

by Chris Mason

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.



On Monday, January 21, 2002 11:02:49 PM -0700 Andreas Dilger
<[email protected]> wrote:

[ snip ]

It seems like the basic features we are suggesting are very close, I'll try
one last time to make a case against the 'free_some_pages' call ;-)

>
> The VM can make better _specific_ judgements when it needs to (e.g. free
> a DMA page or another specific page to allow a larger contiguous chunk of
> memory to be allocated), but in the cases where it just wants _some_
> page(s) to be freed, it should allow the FS to decide which one(s), if it
> cares.

I'd rather see the VM trigger a flush on a specific page, but tell the FS
it's OK to do broader actions if it wants to. In the case of write
throttling, the FS doesn't know which page has been dirty the longest,
unless it starts maintaining its own lists. The VM has all that
information, so it kicks the throttle or periodic write off with one
buffer, and lets the FS trigger other events because we aren't under huge
memory load.

The FS doesn't know how long a page has been dirty, or how often it gets
used, or anything other than this page is pinned and waiting for X event to
take place. If we really can't get this info to the VM in a useful
fashion, that's one thing. But if we can clue the VM in a little and put
the decision making there, I think the end result will be more likely to
clean the right page. That does affect performance even when we're not
under heavy memory pressure.

-chris

2002-01-22 14:39:49

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Tue, 22 Jan 2002, Chris Mason wrote:

> It seems like the basic features we are suggesting are very close, I'll try
> one last time to make a case against the 'free_some_pages' call ;-)

> The FS doesn't know how long a page has been dirty, or how often it
> gets used,

In an efficient system, the FS will never get to know this, either.

The whole idea behind the VFS and the VM is that calls to the FS
are avoided as much as possible, in order to keep the system fast.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-22 18:48:50

by Andrew Morton

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Hans Reiser wrote:
>
> So is there a consensus view that we need 2 calls, one to write a
> particular page, and one to exert memory pressure, and the call to write
> a particular page should only be used when we really need to write that
> particular page?
>

Note that writepage() doesn't get used much. Most VM-initiated
filesystem writeback activity is via try_to_release_page(), which
has somewhat more vague and flexible semantics.

And by bdflush, which I suspect tends to conflict with sync_page_buffers()
under pressure. But that's a different problem.

-

2002-01-22 18:50:31

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Rik van Riel wrote:

>On Tue, 22 Jan 2002, Chris Mason wrote:
>
>>It seems like the basic features we are suggesting are very close, I'll try
>>one last time to make a case against the 'free_some_pages' call ;-)
>>
>
>>The FS doesn't know how long a page has been dirty, or how often it
>>gets used,
>>
>
>In an efficient system, the FS will never get to know this, either.
>

I don't understand this statement. If dereferencing a vfs op for every
page aging is too expensive, then ask it to age more than one page at a
time. Or do I miss your meaning?

>
>
>The whole idea behind the VFS and the VM is that calls to the FS
>are avoided as much as possible, in order to keep the system fast.
>
In other words, you write the core of our filesystem for us, and we
write the parts that don't interest you?

Maybe this is the real meat of the issue?

Hans



2002-01-22 19:04:20

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Tue, 22 Jan 2002, Andrew Morton wrote:
> Hans Reiser wrote:
> >
> > So is there a consensus view that we need 2 calls, one to write a
> > particular page, and one to exert memory pressure, and the call to write
> > a particular page should only be used when we really need to write that
> > particular page?
>
> Note that writepage() doesn't get used much. Most VM-initiated
> filesystem writeback activity is via try_to_release_page(), which
> has somewhat more vague and flexible semantics.

We may want to change this though, or at the very least get
rid of the horrible interplay between ->writepage and
try_to_release_page() ...

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-22 19:21:10

by Chris Mason

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.



On Tuesday, January 22, 2002 09:46:07 PM +0300 Hans Reiser
<[email protected]> wrote:

> Rik van Riel wrote:
>>> The FS doesn't know how long a page has been dirty, or how often it
>>> gets used,
>> In an efficient system, the FS will never get to know this, either.
>
> I don't understand this statement. If dereferencing a vfs op for every
> page aging is too expensive, then ask it to age more than one page at a
> time. Or do I miss your meaning?

Its not about the cost of a function call, it's what the FS does to make
that call useful. Pretend for a second the VM tells the FS everything it
needs to know to age a page (whatever scheme the FS wants to use).

Then pretend the VM decides there's memory pressure, and tells the FS
subcache to start freeing ram. So, the FS goes through its list of pages
and finds the most suitable one for flushing, but it has no idea how
suitable that page is in comparison with the pages that don't belong to
that FS (or even other pages from different mount points of the same FS
flavor).

Since each subcache has its own aging scheme, you can't look at a page from
subcache A and compare it with a page from subcache B.

All the filesystem can do is flush its own pages, which might be the least
suitable pages on the entire box. The VM has no way of knowing, and
neither does the FS, and that's why its inefficient.

Please let me know if I misunderstood the original plan ;-)

-chris

2002-01-22 20:14:45

by Steve Lord

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Tue, 2002-01-22 at 13:19, Chris Mason wrote:
>
>
> On Tuesday, January 22, 2002 09:46:07 PM +0300 Hans Reiser
> <[email protected]> wrote:
>
> > Rik van Riel wrote:
> >>> The FS doesn't know how long a page has been dirty, or how often it
> >>> gets used,
> >> In an efficient system, the FS will never get to know this, either.
> >
> > I don't understand this statement. If dereferencing a vfs op for every
> > page aging is too expensive, then ask it to age more than one page at a
> > time. Or do I miss your meaning?
>
> Its not about the cost of a function call, it's what the FS does to make
> that call useful. Pretend for a second the VM tells the FS everything it
> needs to know to age a page (whatever scheme the FS wants to use).
>
> Then pretend the VM decides there's memory pressure, and tells the FS
> subcache to start freeing ram. So, the FS goes through its list of pages
> and finds the most suitable one for flushing, but it has no idea how
> suitable that page is in comparison with the pages that don't belong to
> that FS (or even other pages from different mount points of the same FS
> flavor).
>
> Since each subcache has its own aging scheme, you can't look at a page from
> subcache A and compare it with a page from subcache B.
>
> All the filesystem can do is flush its own pages, which might be the least
> suitable pages on the entire box. The VM has no way of knowing, and
> neither does the FS, and that's why its inefficient.
>
> Please let me know if I misunderstood the original plan ;-)
>
Looks like I've been missing an interesting thread here ....

Surely flushing pages (and hence cleaning them) is not a bad thing to
do, provided you do not suck up all the available I/O bandwidth in the
process. The filesystem decides to clean the pages as it is efficient
from an I/O point of view. The vm is then free to reuse lots of pages
it could not before, but it still gets to make the decision about the
pages being good ones to reuse.

The xfs kernel changes add a call to writepage into the buffer flushing
path when the data is delayed allocate. We then end up issuing I/O on
surrounding pages which end up being contiguous on disk and are not
currently locked by some other thread.


Steve

2002-01-22 20:21:36

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Tue, 22 Jan 2002, Hans Reiser wrote:
> Rik van Riel wrote:
> >On Tue, 22 Jan 2002, Chris Mason wrote:

> >>The FS doesn't know how long a page has been dirty, or how often it
> >>gets used,
> >
> >In an efficient system, the FS will never get to know this, either.
>
> I don't understand this statement. If dereferencing a vfs op for
> every page aging is too expensive, then ask it to age more than one
> page at a time. Or do I miss your meaning?

Please repeat after me:

"THE FS DOES NOT SEE THE MMU ACCESSED BITS"

Also, if a piece of data is in the page cache, it is accessed
without calling the filesystem code.


This means the filesystem doesn't know how often pages are or
are not used, hence it cannot make the decisions the VM make.

Or do you want to have your own ReiserVM and ReiserPageCache ?

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-22 20:24:15

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Andrew Morton wrote:

>Hans Reiser wrote:
>
>>So is there a consensus view that we need 2 calls, one to write a
>>particular page, and one to exert memory pressure, and the call to write
>>a particular page should only be used when we really need to write that
>>particular page?
>>
>
>Note that writepage() doesn't get used much. Most VM-initiated
>filesystem writeback activity is via try_to_release_page(), which
>has somewhat more vague and flexible semantics.
>
>And by bdflush, which I suspect tends to conflict with sync_page_buffers()
>under pressure. But that's a different problem.
>
>-
>
>
So the problem is that there is no coherently architected VM-to-FS
interface that has been articulated, and we need one.

So far we can identify that we need something to pressure the FS, and
something to ask for a particular page.

It might be desirable to pressure the FS more than one page aging at a
time for reasons of performance as Rik pointed out.

Any other design considerations?


2002-01-22 20:36:36

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Chris Mason wrote:

>
>On Tuesday, January 22, 2002 09:46:07 PM +0300 Hans Reiser
><[email protected]> wrote:
>
>>Rik van Riel wrote:
>>
>>>>The FS doesn't know how long a page has been dirty, or how often it
>>>>gets used,
>>>>
>>>In an efficient system, the FS will never get to know this, either.
>>>
>>I don't understand this statement. If dereferencing a vfs op for every
>>page aging is too expensive, then ask it to age more than one page at a
>>time. Or do I miss your meaning?
>>
>
>Its not about the cost of a function call, it's what the FS does to make
>that call useful. Pretend for a second the VM tells the FS everything it
>needs to know to age a page (whatever scheme the FS wants to use).
>
>Then pretend the VM decides there's memory pressure, and tells the FS
>subcache to start freeing ram. So, the FS goes through its list of pages
>and finds the most suitable one for flushing, but it has no idea how
>suitable that page is in comparison with the pages that don't belong to
>that FS (or even other pages from different mount points of the same FS
>flavor).
>

Why does it need to know how suitable it is compared to the other
subcaches? It just ages X pages, and depends on the VM to determine how
large X is. The VM pressures subcaches in proportion to their size, it
doesn't need to know how suitable one page is compared to another, it
just has a notion of push on everyone in proportion to their size.

>
>
>Since each subcache has its own aging scheme, you can't look at a page from
>subcache A and compare it with a page from subcache B.
>

Chris, the VM doesn't compare one page to another within a unified
cache, so why should it compare one page to another within the delegated
cache management scheme? The VM ages until it gets what it wants, in
the current scheme. In the scheme I propose it requests aging from the
subcaches until it gets what it wants, instead of doing aging until it
gets what it wants.

Note that there is some slight inaccuracy in this, in that the current
scheme has ordered lists, but my point remains valid, especially if we
move to aging based on usage minus age counts, which I think Rik may be
supportive of do (it makes it easier to give less staying power to a
page that is read only once, and I would say it was Rik's idea except
that I have probably distorted it in repeating it).

>
>
>All the filesystem can do is flush its own pages, which might be the least
>suitable pages on the entire box. The VM has no way of knowing, and
>neither does the FS, and that's why its inefficient.
>
>Please let me know if I misunderstood the original plan ;-)
>
Thanks for pointing out what needed to be articulated. Is it more clear
now?

Hans


2002-01-22 20:51:07

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Tue, 22 Jan 2002, Hans Reiser wrote:

> So the problem is that there is no coherently architected VM-to-FS
> interface that has been articulated, and we need one.

Absolutely agreed. One of the main design elements for such an
interface would be doing all filesystem things in the filesystem
and all VM things in the VM so we don't get frankenstein monsters
on either side of the fence.

> So far we can identify that we need something to pressure the FS, and
> something to ask for a particular page.
>
> It might be desirable to pressure the FS more than one page aging at a
> time for reasons of performance as Rik pointed out.

> Any other design considerations?

One of the things we really want to do in the VM is pre-clean
data and just reclaim clean pages later on.

This means it would be easiest/best if the filesystem took
care of _just_ writing out data and if freeing the data later
on would be left to the VM.

I understand this is not always possible due to stuff like
metadata repacking, but I guess we can ignore this case for
now since the metadata is hopefully small and won't unbalance
the VM.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-22 21:03:17

by Rolf Lear

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Excuse me for being a kernel newbie (and a list lurker), and for simplifying what is obviously a complex issue... but ... the underlying issue really is simple. Further, you are both suggesting that a change in design is not out-of-the-question.

The VM is responsible for making sure that Mem is used efficiently. The FS is responsible for making sure that the disks (both space and speed) are used efficiently.

Now, I have followed this thread for days, and I agree with Rik that the VM should be able to tell (command) the FS to free a page. I agree with Hans that "Ideally", the VM should be capable of identifying the best page to free (in terms of cost to the FS). In this ideal world, it is the responsibility of an intelligent FS to inform an intelligent VM what it can do quickly, and what will take time.

What I propose is either:
a) An indication on each dirty page the cost required to clean it.
b) A FS function which can be called which indicates the cost of a clean.

This cost should be measured in terms of something relevant like approximate IO time. FS's which do not support this system should have stubs which cost all pages equally.

The system would work as follows:
VM Needs to free some Mem, and not enough clean pages can be freed.
VM Identifies those dirty pages which are cheap to flush/clean, and does it. If VM Needs to flush an expensive page, it can still do it, but it knows whe price ahead of time (double bonus).

To identify the cheap pages, the VM can ask the FS the price, and as an added bonus, the FS can tell the VM how many other pages will get freed in the process.

In my world of client-server / databases / etc, this just makes sense.

If this intelligent VM has a basic FS, it looses nothing. If it has an intelligent FS, it has more information to make better decisions.

Rolf

2002-01-22 21:10:09

by Chris Mason

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.



On Tuesday, January 22, 2002 11:32:09 PM +0300 Hans Reiser
<[email protected]> wrote:

>> Its not about the cost of a function call, it's what the FS does to make
>> that call useful. Pretend for a second the VM tells the FS everything it
>> needs to know to age a page (whatever scheme the FS wants to use).
>>
>> Then pretend the VM decides there's memory pressure, and tells the FS
>> subcache to start freeing ram. So, the FS goes through its list of pages
>> and finds the most suitable one for flushing, but it has no idea how
>> suitable that page is in comparison with the pages that don't belong to
>> that FS (or even other pages from different mount points of the same FS
>> flavor).
>>
>
> Why does it need to know how suitable it is compared to the other
> subcaches? It just ages X pages, and depends on the VM to determine how
> large X is. The VM pressures subcaches in proportion to their size, it
> doesn't need to know how suitable one page is compared to another, it
> just has a notion of push on everyone in proportion to their size.

If subcache A has 1000 pages that are very very active, and subcache B has
500 pages that never ever get used, should A get twice as much memory
pressure? That's what we want to avoid, and I don't see how subcaches
allow it.

-chris


2002-01-22 21:12:39

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Tue, 22 Jan 2002, Hans Reiser wrote:

> Why does it need to know how suitable it is compared to the other
> subcaches? It just ages X pages,

How the hell is the filesystem supposed to age pages ?

The filesystem DOES NOT KNOW how often pages are used,
so it cannot age the pages.

End of thread.

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-22 21:24:20

by Chris Mason

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.



On Tuesday, January 22, 2002 02:13:18 PM -0600 Steve Lord <[email protected]>
wrote:

> Looks like I've been missing an interesting thread here ....

Hi Steve ;-)

>
> Surely flushing pages (and hence cleaning them) is not a bad thing to
> do, provided you do not suck up all the available I/O bandwidth in the
> process. The filesystem decides to clean the pages as it is efficient
> from an I/O point of view. The vm is then free to reuse lots of pages
> it could not before, but it still gets to make the decision about the
> pages being good ones to reuse.

Very true, there are a few different workloads to consider.

1) The box really needs ram right now, and we should do the minimum amount
of work to get it done. This is usually done by kswapd or a process doing
an allocation. It should help if the FS gives the VM enough details to
skip pages that require extra allocations (like commit blocks) in favor of
less expensive ones.

2) There's lots of dirty pages around, it would be a good idea to flush
some, regardless of how many pages might be freeable afterwards. This is
where we want most of the i/o to actually happen, and where we want to give
the FS the most freedom in regards to which pages get written.

>
> The xfs kernel changes add a call to writepage into the buffer flushing
> path when the data is delayed allocate. We then end up issuing I/O on
> surrounding pages which end up being contiguous on disk and are not
> currently locked by some other thread.

This probably helps in both situations listed, assuming things like HIGHMEM
bounce buffers don't come into play.

-chris

2002-01-22 21:27:29

by Shawn Starr

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

I've started on writing a pagebuf daemon (experimenting with ramfs). It
will have the VM manage the allocating/freeing of pages. The filesystem
should not have to know when a page needs to be freed or allocated. It
just need pages. The pagebuf is supposed to age pages not the
filesystem.

Shawn.

On Tue, 2002-01-22 at 16:12, Rik van Riel wrote:
> On Tue, 22 Jan 2002, Hans Reiser wrote:
>
> > Why does it need to know how suitable it is compared to the other
> > subcaches? It just ages X pages,
>
> How the hell is the filesystem supposed to age pages ?
>
> The filesystem DOES NOT KNOW how often pages are used,
> so it cannot age the pages.
>
> End of thread.
>
> Rik
> --
> "Linux holds advantages over the single-vendor commercial OS"
> -- Microsoft's "Competing with Linux" document
>
> http://www.surriel.com/ http://distro.conectiva.com/
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


2002-01-22 21:31:49

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On 22 Jan 2002, Shawn Starr wrote:

> I've started on writing a pagebuf daemon (experimenting with ramfs).
> It will have the VM manage the allocating/freeing of pages. The
> filesystem should not have to know when a page needs to be freed or
> allocated. It just need pages. The pagebuf is supposed to age pages
> not the filesystem.

Last I looked it was try_to_free_pages() which does the aging
of pages. What functionality would a pagebuf daemon add in
this regard ?

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-22 22:06:31

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Shawn, I didn't respond to this because it seems like you are mixing in
issues relating to the elevator code into this, and so I don't really
understand you.

Hans

Shawn Starr wrote:

>Nobody wants to comment on this? :(
>
>Shawn.
>
>On Sun, 2002-01-20 at 21:29, Shawn Starr wrote:
>
>>On Mon, 21 Jan 2002, Anton Altaparmakov wrote:
>>
>>>[snip]
>>>At 00:57 21/01/02, Hans Reiser wrote:
>>>[snip]
>>> > Would be best if VM told us if we really must write that page.
>>>
>>>In theory the VM should never call writepage unless the page must be writen
>>>out...
>>>
>>>But I agree with you that it would be good to be able to distinguish the
>>>two cases. I have been thinking about this a bit in the context of NTFS TNG
>>>but I think that it would be better to have a generic solution rather than
>>>every fs does their own copy of the same thing. I envisage that there is a
>>>flush daemon which just walks around writing pages to disk in the
>>>background (there could be one per fs, or a generic one which fs register
>>>with, at their option they could have their own of course) in order to keep
>>>the number of dirty pages low and in order to minimize data loss on the
>>>event of system/power failure.
>>>
>>>This demon requires several interfaces though, with regards to journalling
>>>fs. The daemon should have an interface where the fs can say "commit pages
>>>in this list NOW and do not return before done", also a barrier operation
>>>would be required in journalling context. A transactions interface would be
>>>ideal, where the fs can submit whole transactions consisting of writing out
>>>a list of pages and optional write barriers; e.g. write journal pages x, y,
>>>z, barrier, write metadata, perhaps barrier, finally write data pages a, b,
>>>c. Simple file systems could just not bother at all and rely on the flush
>>>daemon calling the fs to write the pages.
>>>
>>>Obviously when this daemon writes pages the pages will continue being
>>>there. OTOH, if the VM calls write page because it needs to free memory
>>>then writepage must write and clean the page.
>>>
>>if they are dirty and written immediately to the disk they can be cleaned
>>from the queue. It would be nice if there was some way to have a checksum
>>verify the data was written back then wipe it from the queue.
>>
>>As an example: 5 operations requested, 2 already in queue.
>>
>>In queue) DIRTY write to disk (this task has been in the queue for a
>>while)
>>
>>In queue) not 'old' memory but must be written to disk
>>
>>pending queue:
>>
>>1) read operation
>>2) read operation
>>3) Write operation
>>4) write operation
>>
>>The daemon should resort the priority write dirty pages to disk then write
>>nay other pages that are left on queue, then get to read pages.
>>
>>
>>Notes:
>>
>>If there is only one operation in the queue (say write) and nothing else
>>comes along, then the daemon should force-write the data back to disk
>>after a period of timeout (the memory in the slot becomes dirty)
>>
>>If there's too many tasks in the queue and another one requires more
>>memory then whats left in the buffer/cache the daemon could request to
>>store the request in swap memory and put it in the queue, if the request
>>is a write request it would have more priority then any read requests
>>still and get completed quickly allowing for remaining queue events to
>>complete.
>>
>>Example:
>>
>>ReiserFS:
>> Operation A. Write (10K)
>> Operation B. Read (200K)
>> Operation C. Write (160K)
>>
>>
>>XFS:
>> Operation A. Read (63K)
>> Operation B. Read (3k)
>> Operation C. Write (10K)
>>
>>
>>EXT3:
>> Operation A. Write (290K)
>> Operation B. Write (90K)
>> Operation C. Read (3k)
>>
>>the kpagebuf (or whatever name). Would get all these requests and sort out
>>what needs to be done first as long as there's buffer/cache memory free
>>the write operations would be done as fast as possible, verified by some
>>checksum and purged from the queue, If there's no cache/buffer memory
>>free then all write queues reguardless of being in swap or cache/buffer need to be
>>written to disk.
>>
>>So:
>>kpagebuf queue (total available buffer/cache memory is say 512K)
>>
>> EXT3 Write (290K)
>> ReiserFS Write (160K)
>> ReiserFS Write (10K)
>> XFS Write (10K)
>> EXT3 Write (90K) - Goes in swap because total > 512K (Dirty x2 state)
>> ReiserFS Read (200K) - Swap (dirty x2)
>> XFS Read (63K) - Swap (dirty x2)
>> XFS Read (3K) - Swap (dirty x2)
>> EXT3 Read (3K) - Swap (dirty x2)
>>
>>* The daemon would check in order of filesystem registeration for whos
>>should be in the read queue first.
>>
>>* The daemon should maximize amount of memory stored in bufeer/cache to
>>try to prevent write requests having to go into swap.
>>
>>In the above queue, we have a lot of read operations and one write
>>operation in swap. Clean out the write operations since they are now dirty
>>(because there's no room for more operations in the buffer/cache). Move
>>the swapped write operation to the top of the queue and get rid of it.
>>Move the read operations from swap to queue since there is room again. **
>>NOTE ** because those read requests are now dirty they MUST be delt with
>>or they'll get stuck in the queue with more write requests overtaking
>>them.
>>
>>Maybe I've lost it but that's how I see it ;)
>>
>>Shawn.
>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>the body of a message to [email protected]
>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at http://www.tux.org/lkml/
>>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>



2002-01-22 22:10:01

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Chris Mason wrote:

>
>On Tuesday, January 22, 2002 11:32:09 PM +0300 Hans Reiser
><[email protected]> wrote:
>
>>>Its not about the cost of a function call, it's what the FS does to make
>>>that call useful. Pretend for a second the VM tells the FS everything it
>>>needs to know to age a page (whatever scheme the FS wants to use).
>>>
>>>Then pretend the VM decides there's memory pressure, and tells the FS
>>>subcache to start freeing ram. So, the FS goes through its list of pages
>>>and finds the most suitable one for flushing, but it has no idea how
>>>suitable that page is in comparison with the pages that don't belong to
>>>that FS (or even other pages from different mount points of the same FS
>>>flavor).
>>>
>>Why does it need to know how suitable it is compared to the other
>>subcaches? It just ages X pages, and depends on the VM to determine how
>>large X is. The VM pressures subcaches in proportion to their size, it
>>doesn't need to know how suitable one page is compared to another, it
>>just has a notion of push on everyone in proportion to their size.
>>
>
>If subcache A has 1000 pages that are very very active, and subcache B has
>500 pages that never ever get used, should A get twice as much memory
>pressure? That's what we want to avoid, and I don't see how subcaches
>allow it.
>
>-chris
>
>
>
>
Yes, it should get twice as much pressure, but that does not mean it
should free twice as many pages, it means it should age twice as many
pages, and then the accesses will un-age them.

Make more sense now?

Hans

2002-01-22 22:10:52

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.


What's wrong with having the file-system call a VM function
to free some buffer once it's been written and hasn't been
accessed recently? Isn't that what's being done already.

That keeps FS in the FS and VM in VM.

The file-system is the only thing that "knows" or should know about
file-system activity.

The only problem I see with the current implementation is that it
"seems as though" the file-system keeps old data too long. Therefore,
RAM gets short.

The actual buffer(s) that get written and then released
should be based upon "least-recently-used". Buffers should
be written until some target of free memory is reached. Presently
it doesn't seem as though we have such a target. Therefore, we
eventually run out of RAM and try to find some magic algorithm
to use. As a last resort, we kill processes. This is NotGood(tm).

We need a free-RAM target, possibly based upon a percentage of
available RAM. The lack of such a target is what causes the
out-of-RAM condition we have been experiencing. Somebody thought
that "free RAM is wasted RAM" and the VM has been based upon
that theory. That theory has been proven incorrect. You need
free RAM, just like you need "excess horsepower" to make
automobiles drivable. That free RAM is the needed "rubber-band"
to absorb the dynamics of real-world systems.

That free-RAM target can be attacked both by the file-system(s)
and the VM system. The file-system gives LRU buffers until
it has obtained the free-RAM target, without regard for the
fact that VM may immediately use those pages for process expansion.

VM will also give up LRU pages until it has reached the same target.
These targets occur at different times, which is the exact mechanism
necessary to load-balance available RAM. VM can write to swap if
it needs, to satisfy its free-RAM target but writing to swap
has to go directly to the device or you will oscillate if the
swap-write doesn't free its buffers. In other words, you don't
free cache-RAM by writing to a cached file-system. You will
eventually settle into the time-constant which causes oscillation.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (797.90 BogoMips).

I was going to compile a list of innovations that could be
attributed to Microsoft. Once I realized that Ctrl-Alt-Del
was handled in the BIOS, I found that there aren't any.


2002-01-22 22:22:41

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Wed, 23 Jan 2002, Hans Reiser wrote:

> Yes, it should get twice as much pressure, but that does not mean it
> should free twice as many pages, it means it should age twice as many
> pages, and then the accesses will un-age them.
>
> Make more sense now?

So basically you are saying that each filesystem should
implement the code to age all pages equally and react
equally to memory pressure ...

... essentially duplicating what the current VM already
does!

regads,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-22 22:35:52

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Let's try a non-reiserfs sub-cache example. Suppose you have a cache of
objects that are smaller than a page. These might be dcache entries,
these might be struct inodes, these might be all sorts of things in the
kernel.

Suppose that there is absolutely no correlation in access between the
objects that are on the same page. Suppose that this subcache has
methods for freeing however many of them it wants to free, and it can
squeeze them together into fewer pages whenever it wants to. Suppose it
can track accesses to the objects, and it could age them also, if we
wrote the code to do it.

If we age with page granularity as you ask us to, we are doing
fundamentally the wrong thing. Aging with page granularity means that
we keep in the cache every object that happens to land on the same page
with a frequently accessed object even if those objects are never
accessed again ever.

Another wrong way: Ok, so suppose we have methods for shrinking the
cache ala the old 2.2 dcache shrinking code. Suppose we invoke those
whenever the cache gets "too large", or the other caches are failing to
free pages because things have gotten SO pathologically inbalanced that
they have nothing they can free. This is also bad. It results in
unbalanced caches, and makes our VM maintainer think that subcaches are
inherently bad.

If we don't have a master VM pushing proportionally to their size on all
subcaches, and telling them how many pages worth of aging to apply, we
either have unused objects staying in memory because they happen to land
on a page with a frequently used object, or we have unbalanced caches
that know what to free but not how much to free.

We need a master VM that says how much aging pressure to apply, and
subcaches that respond to that. We need a VM that doesn't just
delegate, but delegates skillfully enough that the subcaches know what
they need to know to act on it.

Hans


Rik van Riel wrote:

>On Tue, 22 Jan 2002, Hans Reiser wrote:
>
>>Rik van Riel wrote:
>>
>>>On Tue, 22 Jan 2002, Chris Mason wrote:
>>>
>
>>>>The FS doesn't know how long a page has been dirty, or how often it
>>>>gets used,
>>>>
>>>In an efficient system, the FS will never get to know this, either.
>>>
>>I don't understand this statement. If dereferencing a vfs op for
>>every page aging is too expensive, then ask it to age more than one
>>page at a time. Or do I miss your meaning?
>>
>
>Please repeat after me:
>
> "THE FS DOES NOT SEE THE MMU ACCESSED BITS"
>
We can't borrow whatever pair of glasses the master VM is using?

>
>
>Also, if a piece of data is in the page cache, it is accessed
>without calling the filesystem code.
>
>
>This means the filesystem doesn't know how often pages are or
>are not used, hence it cannot make the decisions the VM make.
>
>Or do you want to have your own ReiserVM and ReiserPageCache ?
>
>regards,
>
>Rik
>



2002-01-22 23:30:34

by Shawn Starr

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

The only functionality added to the kernel would be a a interface for
filesystems to share it would basically create kpagebuf_* functions.

Shawn.


On Tue, 2002-01-22 at 17:08, Rik van Riel wrote:
> On 22 Jan 2002, Shawn Starr wrote:
>
> > the pagebuf daemon would use try_to_free_pages() periodically in its
> > queue.
>
> So it wouldn't add any functionality to the kernel ?
>
> Rik
> --
> "Linux holds advantages over the single-vendor commercial OS"
> -- Microsoft's "Competing with Linux" document
>
> http://www.surriel.com/ http://distro.conectiva.com/
>
>


2002-01-22 23:35:46

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Wed, 23 Jan 2002, Hans Reiser wrote:

> Let's try a non-reiserfs sub-cache example. Suppose you have a cache
> of objects that are smaller than a page.

> Suppose that there is absolutely no correlation in access between the
> objects that are on the same page. Suppose that this subcache has
> methods for freeing however many of them it wants to free, and it can
> squeeze them together into fewer pages whenever it wants to.

In this case I absolutely agree with you.

In this case it is also _possible_ because all access
to these data structures goes through the filesystem
code, so the filesystem knows exactly which object is
a candidate for freeing and which isn't.

I think the last messages from the thread were a
miscommunication between us -- I was under the impression
that you wanted per-filesystem freeing decisions for things
like page cache pages.

kind regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-22 23:38:14

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On 22 Jan 2002, Shawn Starr wrote:

> The only functionality added to the kernel would be a a interface for
> filesystems to share it would basically create kpagebuf_* functions.

What would these things achieve ?

It would be nice if you could give us a quick explanation of
what exactly kpagebufd is supposed to do, if only so I can
keep that in mind while working on the VM ;)

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-23 00:21:00

by Hans Reiser

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Rik van Riel wrote:

>On Wed, 23 Jan 2002, Hans Reiser wrote:
>
>>Yes, it should get twice as much pressure, but that does not mean it
>>should free twice as many pages, it means it should age twice as many
>>pages, and then the accesses will un-age them.
>>
>>Make more sense now?
>>
>
>So basically you are saying that each filesystem should
>implement the code to age all pages equally and react
>equally to memory pressure ...
>
>... essentially duplicating what the current VM already
>does!
>
>regads,
>
>Rik
>
If the object appropriate for the subcache is either larger (reiser4
slums), or smaller (have to reread that code to remember whether
dentries can reasonably be coded to be squeezed over to other pages, I
think so, if yes then they are an example of smaller, maybe someone can
say something on this) than a page, then you ought to age objects with a
granularity other than that of a page. You can express the aging in
units of pages (and the subcache can convert the units), but the aging
should be applied in units of the object being cached.

Just to confuse things, there are middle ground solutions as well. For
instance, reiser4 slums are variable size, and can even have maximums if
we want it. If we are lazy coders (and we might be), we could even
choose to track aging at page granularity, and be just like the generic
VM code, except for the final flush moment when we will consider
flushing 64 nodes to disk to count as 64 agings that our cache yielded
up as its fair share. With regards to that last sentence, I need more
time to think about whether that is really reasonably optimal to do and
simpler to code.

Consider an analogy with reiser4 plugins. One of my constant battles is
that my programmers want to take all the code that they think most
plugins will have to do, and force all plugin authors to do it that way
by not making the mostly common code part of the generic plugin
templates. The right way to do it is to create generic templates, let
the plugin authors add their couple of function calls that are unique to
their plugin to the generic template code, and get them to use the
generic template for reasons of convenience not compulsion. I am asking
you to create a cache plugin architecture for VM. It will be cool,
people will use it for all sorts of weird and useful optimizations of
obscure but important to someone caches (maybe even dcache if nothing
prevents relocating dcache entries, wish I could remember), trust me.:)
It is probably more important to caches other than ReiserFS that there
be this kind of architecture (we could survive the reduction in
optimality from flushing more than our fair share, it wouldn't kill us,
but I like to ask for the right design on principle, and I think that
for other caches it really will matter. It is also possible that some
future ReiserFS I don't yet imagine will more significantly benefit from
such a right design.)

Ok, so it seems we are it seems much less far apart now than we were
previously.:)

I remain curious about what dinner cooked by you using fresh Brazilian
ingredients tastes like. The tantalizing thought still lurks in the
back of my mind where you planted it.:) I MUST generate a business
requirement for going to Brazil.....:-)

Hans


2002-01-23 01:14:53

by Stuart Young

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

At 05:10 PM 22/01/02 -0500, Richard B. Johnson wrote:
>We need a free-RAM target, possibly based upon a percentage of
>available RAM. The lack of such a target is what causes the
>out-of-RAM condition we have been experiencing. Somebody thought
>that "free RAM is wasted RAM" and the VM has been based upon
>that theory. That theory has been proven incorrect. You need
>free RAM, just like you need "excess horsepower" to make
>automobiles drivable. That free RAM is the needed "rubber-band"
>to absorb the dynamics of real-world systems.

It'd be nice if this cache high/low watermark was adjustable, preferably
through say the sysctl interface, on a running kernel. This would mean that
a competent system administrator could tune the system to their needs. A
decent runscript for a particular program (I'm assuming run as root here)
could adjust the value to absorb the dynamics of a particular program.


Stuart Young - [email protected]
(aka Cefiar) - [email protected]

[All opinions expressed in the above message are my]
[own and not necessarily the views of my employer..]

2002-01-23 05:25:27

by Shawn Starr

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

The VM is busy with other tasks so why not have a daemon handle pages
delegated from the VM? Having a pagebuf daemon would allow for delay
writes and allow for perhaps readahead buffering of data having theses
would take some pressure off of the VM no?


On Tue, 2002-01-22 at 18:37, Rik van Riel wrote:
> On 22 Jan 2002, Shawn Starr wrote:
>
> > The only functionality added to the kernel would be a a interface for
> > filesystems to share it would basically create kpagebuf_* functions.
>
> What would these things achieve ?
>
> It would be nice if you could give us a quick explanation of
> what exactly kpagebufd is supposed to do, if only so I can
> keep that in mind while working on the VM ;)
>
> Rik
> --
> "Linux holds advantages over the single-vendor commercial OS"
> -- Microsoft's "Competing with Linux" document
>
> http://www.surriel.com/ http://distro.conectiva.com/
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


2002-01-23 09:43:52

by Martin.Knoblauch

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

> Re: Possible Idea with filesystem buffering.
>
> From: Richard B. Johnson ([email protected])
> Date: Tue Jan 22 2002 - 17:10:27 EST
>
>
> We need a free-RAM target, possibly based upon a percentage of
> available RAM. The lack of such a target is what causes the
> out-of-RAM condition we have been experiencing. Somebody thought
> that "free RAM is wasted RAM" and the VM has been based upon
> that theory. That theory has been proven incorrect. You need

Now, I think the theory itself is OK. The problem is that the stuff in
buffer/caches is to sticky. It does not go away when "more important"
uses for memory come up. Or at least it does not go away fast enough.

> free RAM, just like you need "excess horsepower" to make
> automobiles drivable. That free RAM is the needed "rubber-band"
> to absorb the dynamics of real-world systems.
>

Correct. The free target would help to avoid the panic/frenzy that
breaks out when we run out of free memory.

Question: what about just setting a maximum limit to the cache/buffer
size. Either absolute, or as a fraction of total available memory? Sure,
it maybe a waste of memory in most situations, but sometimes the
administrator/user of a system simply "knows better" than the FVM (F ==
Fine ? :-)

While being there, one could also add a "guaranteed minimum" limit for
the cache/buffer size. This way preventing a complete meltdown of IO
performance. True64 has such limits. They are usually at 100% (max) and
I think 20% (min), giving the cache access to all memory. But there were
situations where a max of 10% was just the rigth thing to do.

I know, the tuning-knob approach is frowned upon. But sometimes there
are workloads where even the best VM may not know how to react
correctly.

Martin
--
------------------------------------------------------------------
Martin Knoblauch | email: [email protected]
TeraPort GmbH | Phone: +49-89-510857-309
C+ITS | Fax: +49-89-510857-111
http://www.teraport.de | Mobile: +49-170-4904759

2002-01-23 11:53:07

by Helge Hafting

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Martin Knoblauch wrote:
>
> > Re: Possible Idea with filesystem buffering.
> >
> > From: Richard B. Johnson ([email protected])
> > Date: Tue Jan 22 2002 - 17:10:27 EST
> >
> >
> > We need a free-RAM target, possibly based upon a percentage of
> > available RAM. The lack of such a target is what causes the
> > out-of-RAM condition we have been experiencing. Somebody thought
> > that "free RAM is wasted RAM" and the VM has been based upon
> > that theory. That theory has been proven incorrect. You need
>
As far as I know, there is a free target. The kernel will try to get
rid of old pages (swapout program memory, toss cache pages)
when there's too little free memory around. This keeps memory
around so future allocations and IO request may start
immediately. Maybe the current target is too small, but it is there.
Without it, _every_ allocation or file operation would block
waiting for a swapout/cache flush in order to get free pages. Linux
isn't nearly _that_ bad.

> Now, I think the theory itself is OK. The problem is that the stuff in
> buffer/caches is to sticky. It does not go away when "more important"
> uses for memory come up. Or at least it does not go away fast enough.
>
Then we need a larger free target to cope with the slow cache freeing.

> > free RAM, just like you need "excess horsepower" to make
> > automobiles drivable. That free RAM is the needed "rubber-band"
> > to absorb the dynamics of real-world systems.
>
> Question: what about just setting a maximum limit to the cache/buffer
> size. Either absolute, or as a fraction of total available memory? Sure,
> it maybe a waste of memory in most situations, but sometimes the
> administrator/user of a system simply "knows better" than the FVM (F ==
> Fine ? :-)
[...]
> I know, the tuning-knob approach is frowned upon. But sometimes there
> are workloads where even the best VM may not know how to react
> correctly.

Wasting memory "in most situations" isn't really an option. But I
see nothing wrong with "knobs" as long as they are automatic by
default. Those who want to optimize for a corner case can
go and turn off the autopilot.

Helge Hafting

2002-01-23 12:02:58

by Rik van Riel

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On Wed, 23 Jan 2002, Helge Hafting wrote:

[free memory is wasted memory]

> > Now, I think the theory itself is OK. The problem is that the stuff in
> > buffer/caches is to sticky. It does not go away when "more important"
> > uses for memory come up. Or at least it does not go away fast enough.
>
> Then we need a larger free target to cope with the slow cache freeing.

Or we make the cache freeing faster. ;)

If you have the time, you might want to try -rmap some
day and see about the cache freeing...

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-23 12:11:58

by Martin.Knoblauch

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Helge Hafting wrote:
>
> Martin Knoblauch wrote:
> >
> > > Re: Possible Idea with filesystem buffering.
> > >
> > > From: Richard B. Johnson ([email protected])
> > > Date: Tue Jan 22 2002 - 17:10:27 EST
> > >
> > >
> > > We need a free-RAM target, possibly based upon a percentage of
> > > available RAM. The lack of such a target is what causes the
> > > out-of-RAM condition we have been experiencing. Somebody thought
> > > that "free RAM is wasted RAM" and the VM has been based upon
> > > that theory. That theory has been proven incorrect. You need
> >
> As far as I know, there is a free target. The kernel will try to get
> rid of old pages (swapout program memory, toss cache pages)
> when there's too little free memory around. This keeps memory
> around so future allocations and IO request may start
> immediately. Maybe the current target is too small, but it is there.
> Without it, _every_ allocation or file operation would block
> waiting for a swapout/cache flush in order to get free pages. Linux
> isn't nearly _that_ bad.
>

Nobody said it is _that_ bad. There are just some [maybe rare]
situations where it falls over and does not recover gracefully.

> > Now, I think the theory itself is OK. The problem is that the stuff in
> > buffer/caches is to sticky. It does not go away when "more important"
> > uses for memory come up. Or at least it does not go away fast enough.
> >
> Then we need a larger free target to cope with the slow cache freeing.
>

And as Rik said, we need to make freeing cache faster. All of this will
help the 98+% cases that the VM can be optimized for. But I doubt that
you can make it 100% and keep it simple at the same time.

> > > free RAM, just like you need "excess horsepower" to make
> > > automobiles drivable. That free RAM is the needed "rubber-band"
> > > to absorb the dynamics of real-world systems.
> >
> > Question: what about just setting a maximum limit to the cache/buffer
> > size. Either absolute, or as a fraction of total available memory? Sure,
> > it maybe a waste of memory in most situations, but sometimes the
> > administrator/user of a system simply "knows better" than the FVM (F ==
> > Fine ? :-)
> [...]
> > I know, the tuning-knob approach is frowned upon. But sometimes there
> > are workloads where even the best VM may not know how to react
> > correctly.
>
> Wasting memory "in most situations" isn't really an option. But I
> see nothing wrong with "knobs" as long as they are automatic by
> default. Those who want to optimize for a corner case can
> go and turn off the autopilot.
>

Definitely. The defaults need to be set for the general case.

Martin
--
------------------------------------------------------------------
Martin Knoblauch | email: [email protected]
TeraPort GmbH | Phone: +49-89-510857-309
C+ITS | Fax: +49-89-510857-111
http://www.teraport.de | Mobile: +49-170-4904759

2002-01-23 17:13:49

by Daniel Phillips

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

On January 22, 2002 10:08 pm, Chris Mason wrote:
> On Tuesday, January 22, 2002 11:32:09 PM +0300 Hans Reiser wrote:
> >> Its not about the cost of a function call, it's what the FS does to make
> >> that call useful. Pretend for a second the VM tells the FS everything it
> >> needs to know to age a page (whatever scheme the FS wants to use).
> >>
> >> Then pretend the VM decides there's memory pressure, and tells the FS
> >> subcache to start freeing ram. So, the FS goes through its list of pages
> >> and finds the most suitable one for flushing, but it has no idea how
> >> suitable that page is in comparison with the pages that don't belong to
> >> that FS (or even other pages from different mount points of the same FS
> >> flavor).
> >
> > Why does it need to know how suitable it is compared to the other
> > subcaches? It just ages X pages, and depends on the VM to determine how
> > large X is. The VM pressures subcaches in proportion to their size, it
> > doesn't need to know how suitable one page is compared to another, it
> > just has a notion of push on everyone in proportion to their size.
>
> If subcache A has 1000 pages that are very very active, and subcache B has
> 500 pages that never ever get used, should A get twice as much memory
> pressure? That's what we want to avoid, and I don't see how subcaches
> allow it.

This question at least is not difficult. Pressure (for writeout) should be
applied to each subcache in proportion to its portion of all inactive, dirty
pages in the system.

--
Daniel

2002-01-23 17:18:10

by Josh MacDonald

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Quoting Rik van Riel ([email protected]):
> On Tue, 22 Jan 2002, Hans Reiser wrote:
> > Rik van Riel wrote:
> > >On Tue, 22 Jan 2002, Chris Mason wrote:
>
> > >>The FS doesn't know how long a page has been dirty, or how often it
> > >>gets used,
> > >
> > >In an efficient system, the FS will never get to know this, either.
> >
> > I don't understand this statement. If dereferencing a vfs op for
> > every page aging is too expensive, then ask it to age more than one
> > page at a time. Or do I miss your meaning?
>
> Please repeat after me:
>
> "THE FS DOES NOT SEE THE MMU ACCESSED BITS"
>
> Also, if a piece of data is in the page cache, it is accessed
> without calling the filesystem code.
>
>
> This means the filesystem doesn't know how often pages are or
> are not used, hence it cannot make the decisions the VM make.
>
> Or do you want to have your own ReiserVM and ReiserPageCache ?

Rik,

We think there are good reasons for the FS to know when and how
its data is accessed, although this issue is less significant than
the semantics of writepage() being discussed in this thread.

Referring to the transaction design document I posted several months
ago:

http://marc.theaimsgroup.com/?l=linux-kernel&m=100510090926874&w=2

Our intention is for the file system to be capable of tracking
read and write data-dependencies so that it can safely defer
writing batches of data and still guarantee consistent crash
recovery from the application's point of view. The interaction
between transactions and mmaped regions may leave something to
be desired, and we may not need to know how often pages are or
are not used, but we would like to know which pages are read and
written by whom, even for the case of a page cache hit.

-josh

--
PRCS version control system http://sourceforge.net/projects/prcs
Xdelta storage & transport http://sourceforge.net/projects/xdelta
Need a concurrent skip list? http://sourceforge.net/projects/skiplist

2002-01-23 20:36:04

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Possible Idea with filesystem buffering.

Hi,

On Tue, Jan 22, 2002 at 05:03:02PM -0200, Rik van Riel wrote:
> On Tue, 22 Jan 2002, Andrew Morton wrote:
> > Hans Reiser wrote:
> >
> > Note that writepage() doesn't get used much. Most VM-initiated
> > filesystem writeback activity is via try_to_release_page(), which
> > has somewhat more vague and flexible semantics.
>
> We may want to change this though, or at the very least get
> rid of the horrible interplay between ->writepage and
> try_to_release_page() ...

This is actually really important --- writepage on its own cannot
distinguish between requests to flush something to disk (eg. msync or
fsync), and requests to evict dirty data from memory.

This is really important for ext3's data journaling mode --- syncing
to disk only requires flushing as far as the journal, but evicting
dirty pages requires a full writeback too. That's one place where our
traditional VM notion of writepage just isn't quite fine-grained
enough.

Cheers,
Stephen

2002-01-23 20:52:58

by Hans Reiser

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Possible Idea with filesystem buffering.

Stephen C. Tweedie wrote:

>Hi,
>
>On Tue, Jan 22, 2002 at 05:03:02PM -0200, Rik van Riel wrote:
>
>>On Tue, 22 Jan 2002, Andrew Morton wrote:
>>
>>>Hans Reiser wrote:
>>>
>>>Note that writepage() doesn't get used much. Most VM-initiated
>>>filesystem writeback activity is via try_to_release_page(), which
>>>has somewhat more vague and flexible semantics.
>>>
>>We may want to change this though, or at the very least get
>>rid of the horrible interplay between ->writepage and
>>try_to_release_page() ...
>>
>
>This is actually really important --- writepage on its own cannot
>distinguish between requests to flush something to disk (eg. msync or
>fsync), and requests to evict dirty data from memory.
>
>This is really important for ext3's data journaling mode --- syncing
>to disk only requires flushing as far as the journal, but evicting
>dirty pages requires a full writeback too. That's one place where our
>traditional VM notion of writepage just isn't quite fine-grained
>enough.
>
>Cheers,
> Stephen
>
>
I think this is a good point Stephen is making.

So we have:

* write this particular page at this particular memory address (for DMA
setup or other reasons).

* write the data on this page

* apply X units of aging pressure to the subcache if it is distinct from
the general cache and supports a pressure operation.

as the three distinct needs we are needing to serve in the design of the
interface.


Rik, are you comfortable now with this cache plugin approach I am
advocating now that I have explained it is motivated by the need to
handle objects that are not flushed in pages? You have had another day
to think about it, and you didn't quite say yes (though it did seem you
no longer think me crazy).

Hans

2002-01-23 21:03:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Possible Idea with filesystem buffering.

"Stephen C. Tweedie" wrote:
>
> Hi,
>
> On Tue, Jan 22, 2002 at 05:03:02PM -0200, Rik van Riel wrote:
> > On Tue, 22 Jan 2002, Andrew Morton wrote:
> > > Hans Reiser wrote:
> > >
> > > Note that writepage() doesn't get used much. Most VM-initiated
> > > filesystem writeback activity is via try_to_release_page(), which
> > > has somewhat more vague and flexible semantics.
> >
> > We may want to change this though, or at the very least get
> > rid of the horrible interplay between ->writepage and
> > try_to_release_page() ...
>
> This is actually really important --- writepage on its own cannot
> distinguish between requests to flush something to disk (eg. msync or
> fsync), and requests to evict dirty data from memory.
>
> This is really important for ext3's data journaling mode --- syncing
> to disk only requires flushing as far as the journal, but evicting
> dirty pages requires a full writeback too. That's one place where our
> traditional VM notion of writepage just isn't quite fine-grained
> enough.

And we use currently use PF_MEMALLOC to work out which context
we're being called from. Sigh.

I wish I'd taken better notes of all the square pegs which
ext3 had to push into the kernel's round holes. But there
were so many :)

-

2002-01-23 23:52:24

by Hugh Dickins

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Possible Idea with filesystem buffering.

On Wed, 23 Jan 2002, Stephen C. Tweedie wrote:
>
> This is actually really important --- writepage on its own cannot
> distinguish between requests to flush something to disk (eg. msync or
> fsync), and requests to evict dirty data from memory.

Actually, that much can now be distinguished:
PageLaunder(page) when evicting from memory,
!PageLaunder(page) when msync or fsync.

Hugh

2002-01-24 00:02:36

by Jeff Garzik

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Possible Idea with filesystem buffering.

Hugh Dickins wrote:
>
> On Wed, 23 Jan 2002, Stephen C. Tweedie wrote:
> >
> > This is actually really important --- writepage on its own cannot
> > distinguish between requests to flush something to disk (eg. msync or
> > fsync), and requests to evict dirty data from memory.
>
> Actually, that much can now be distinguished:
> PageLaunder(page) when evicting from memory,
> !PageLaunder(page) when msync or fsync.

Nifty! Thanks for pointing this out.

--
Jeff Garzik | "I went through my candy like hot oatmeal
Building 1024 | through an internally-buttered weasel."
MandrakeSoft | - goats.com

2002-01-24 14:00:31

by Martin.Knoblauch

[permalink] [raw]
Subject: Re: Possible Idea with filesystem buffering.

Helge Hafting wrote:
>
> Martin Knoblauch wrote:
>
> > you are correct in stating that it is [still] true that free memory is
> > wasted memory. The problem is that the "pool of trivially-freeable
> > pages" is under certain circumstances apprently not trivially-freeable
> > enough. And the pool has the tendency to push out processes into swap.
> > OK, most times these processes have been incative for quite some time,
> > but - and this is my opinion based on quite a few years in this field -
> > it should never do this. Task memory is "more valuable" than
> ^^^^^
> > buffer/cache memory. At least I want (demand :-) a switch to make the VM
> > behave that way.
>
> More valuable perhaps, but infinitely more valuable?

It depends on the situation :-) Thats why I want to be able to tell the
VM that it should leave its greedy fingers from task memory.

> Do you want swapping to happen _only_ if the process memory alone
> overflows available memory? Note that you'll get really unuseable
> fs performance if the page cache _never_ may push anything into
> swap. That means you have _no_ cache left as soon as process
> memory fills RAM completely. No cache at all.
>

There are bordercases where I may life better with only very few cache
pages left. Most likely not on a web server or similar. But there are
applications in the HPTC field, where FS caching is useless, unless you
can pack everything in cache.

In a former life I benchmarked an out-of-core FEM solver on
Alpha/Tru64. Sure, when we could stick the whole scratch dataset in
cache the performance was awesome. Unfortunatelly the dataset was about
40 GB and we only had 16 GB available (several constraints, one of them
the price the customer was willing to pay for :-(. The [very very well
tuned] IO pattern on the scratch dataset resulted in optimal performance
when the cache was turned off. The optimal system would have been the
next-higher-class box with 48 GB of memory and a 40 GB ramdisk. Of
course, we could't propose that :-((

That is why I want to be able to set maximum *and* minimum cache size.
The maximum setting helps me tuning application performance (setting it
to 100% just means the current behaviour) and setting mimimum guarantees
at least some minimal FS performance.

> The balance between cache and other memory may need tweaking,
> but don't bother going too far.
>

As I said, it depends on the situation. I am happy when 98+% of the
systems can run happily with the defaults. But for the last 2% the
tuning-knobs come handy. And sure - use on your own risk. All warranties
void.

> > And yes, quite a few smart people are working on it. But the progress
> > in the 2.4.x series is pretty slow and the direction still seems to be
> > unclear.
>
> There are both aa patches and Rik's rmap patch. Hard to say who "wins",
> but you can influence the choice by testing one or both and post
> your findings.
>

Some of us try. Up to the point where we become annoying :-) Personally
I do not think that one of them needs to win. There is very
useful/succesful stuff in both of them - not to forget that -aa is much
more than just VM work. I see it this way: -aa is an approach to fix the
obvious bugs and make the current system behave great/better/acceptable.
rmap is more on the infrastructure side - enabling new stuff to be done.
Similar things could be said about preempt vs. ll in some other
much-too-long thread.

Martin
PS: Putting lkml back
--
------------------------------------------------------------------
Martin Knoblauch | email: [email protected]
TeraPort GmbH | Phone: +49-89-510857-309
C+ITS | Fax: +49-89-510857-111
http://www.teraport.de | Mobile: +49-170-4904759