LinuxLists.cc - [RFC] Optimization for use-once pages

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001 [email protected] wrote:
> On Tue, Jul 24, 2001 at 05:47:30AM +0200, Daniel Phillips wrote:
>
> > So I decided to look for a new way of approaching the use-once problem
> > that would easily integrated with our current approach. What I came
> > up with is pretty simple: instead of starting a newly allocated page on
> > the active ring, I start it on the inactive queue with an age of zero.
> > Then, any time generic_file_read or write references a page, if its
> > age is zero, set its age to START_AGE and mark it as unreferenced.
>
> This is wonderfully simple and ellegant.

It's a tad incorrect, though.

If a page gets 2 1-byte reads in a microsecond, with this
patch it would get promoted to the active list, even though
it's really only used once.

Preferably we'd want to have a "new" list of, say, 5% maximum
of RAM size, where all references to a page are ignored. Any
references made after the page falls off that list would be
counted for page replacement purposes.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-24 17:08:12

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

[ CC: to linux-kernel added back since the topic isn't flammable ;) ]

On Tue, 24 Jul 2001, Stuart MacDonald wrote:
> From: "Rik van Riel" <[email protected]>
> > If a page gets 2 1-byte reads in a microsecond, with this
> > patch it would get promoted to the active list, even though
> > it's really only used once.
>
> I hate to bother you directly, but I don't wish to start a flame
> war on lkml. How exactly would you explain two accesses as
> being "used once"?

Because they occur in a very short interval, an interval MUCH
shorter than the time scale in which the VM subsystem looks at
referenced bits, etc...

Generally a CPU doesn't read more than one cache line at a time,
so I guess all "single references" are in reality multiple
accesses very shortly following each other.

This seems to be generally accepted theory and practice in the
field of page replacement.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-24 17:39:02

by Stuart MacDonald

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

From: "Rik van Riel" <[email protected]>
> Because they occur in a very short interval, an interval MUCH
> shorter than the time scale in which the VM subsystem looks at
> referenced bits, etc...

So what you're sayng is that any number of accesses, as long
as they all occur within the interval < VM subsystem scanning
interval, are all counted as one?

> This seems to be generally accepted theory and practice in the
> field of page replacement.

Okay, just seems odd to me, but IANAVMGuru.

..Stu

2001-07-24 17:40:52

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Linus Torvalds wrote:

> Hey, this looks _really_ nice. I never liked the special-cases
> that you removed (drop_behind in particular), and I have to say
> that the new code looks a lot saner, even without your extensive
> description and timing analysis.

Fully agreed, drop_behind is an ugly hack. The sooner
it dies the happier I am ;)

> Please people, test this out extensively - I'd love to integrate
> it, but while it looks very sane I'd really like to hear of
> different peoples reactions to it under different loads.

The one thing which has always come up in LRU/k and 2Q
papers is that the "first reference" can really be a
series of references in a very short time.

Counting only the very first reference will fail if we
do eg. sequential IO with non-page aligned read() calls,
which doesn't look like it's too uncommon.

In order to prevent this from happening, either the system
counts all first references in a short timespan (complex to
implement) or it has the new pages on a special - small fixed
size - page list and all references to the page while on that
list are ignored.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-24 17:45:12

by Mike Castle

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, Jul 24, 2001 at 02:07:32PM -0300, Rik van Riel wrote:
> > I hate to bother you directly, but I don't wish to start a flame
> > war on lkml. How exactly would you explain two accesses as
> > being "used once"?
>
> Because they occur in a very short interval, an interval MUCH
> shorter than the time scale in which the VM subsystem looks at
> referenced bits, etc...

Would mmap() then a for(;;) loop over the range be an example of such a
use?

mrc
--
Mike Castle [email protected] http://www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

2001-07-24 17:51:42

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Stuart MacDonald wrote:
> From: "Rik van Riel" <[email protected]>
> > Because they occur in a very short interval, an interval MUCH
> > shorter than the time scale in which the VM subsystem looks at
> > referenced bits, etc...
>
> So what you're sayng is that any number of accesses, as long
> as they all occur within the interval < VM subsystem scanning
> interval, are all counted as one?

Actually, the length of this interval could be even smaller
and is often a point of furious debating.

The 2Q algorithm seems to have solved this problem by not
using an interval, but a FIFO queue of small, fixed length.

> > This seems to be generally accepted theory and practice in the
> > field of page replacement.
>
> Okay, just seems odd to me, but IANAVMGuru.

Seems odd at first glance, true.

Let me give you an example:

- sequential access of a file
- script reads the file in 80-byte segments
(parsing some arcane data structure)
- these segments are accessed in rapid succession
- each 80-byte segment is accessed ONCE

In this case, even though the data is accessed only
once, each page is touched PAGE_SIZE/80 times, with
one 80-byte read() each time.

I hope this example gives you some holding point to
make it easier and grasp this - somewhat counter
intuitive - concept.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-24 17:53:22

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Mike Castle wrote:
> On Tue, Jul 24, 2001 at 02:07:32PM -0300, Rik van Riel wrote:
> > > I hate to bother you directly, but I don't wish to start a flame
> > > war on lkml. How exactly would you explain two accesses as
> > > being "used once"?
> >
> > Because they occur in a very short interval, an interval MUCH
> > shorter than the time scale in which the VM subsystem looks at
> > referenced bits, etc...
>
> Would mmap() then a for(;;) loop over the range be an example of
> such a use?

Yes. The problem here, however, would be that we'll only find
the referenced bit in the page table some time later.

This is another reason for having a "new" queue like 2Q's A1in
queue. It means we can both count all accesses in the first short
period as one access AND we can avoid actually doing anything
special as these accesses happen....

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-24 18:06:22

by Stuart MacDonald

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

From: "Rik van Riel" <[email protected]>
> Actually, the length of this interval could be even smaller
> and is often a point of furious debating.

Which is why I was avoiding flames earlier; I sorta
figured this might be a hot issue.

> Let me give you an example:
>
> - sequential access of a file
> - script reads the file in 80-byte segments
> (parsing some arcane data structure)
> - these segments are accessed in rapid succession
> - each 80-byte segment is accessed ONCE
>
> In this case, even though the data is accessed only
> once, each page is touched PAGE_SIZE/80 times, with
> one 80-byte read() each time.

I'd figured that sort of scenario was the basis for counting
them all as one. What about

- random access of same file
- script reads one arcane 80 byte struct
- script updates that struct say PAGE_SIZE/80
times, with one 80-byte write() each time

If they're all counted as one, would the page age
correctly, if the script happens to take a few seconds
break between another flurry of all-as-one updating?

..Stu

2001-07-24 18:10:22

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tuesday 24 July 2001 19:04, Rik van Riel wrote:
> On Tue, 24 Jul 2001, Linus Torvalds wrote:
> > Hey, this looks _really_ nice. I never liked the special-cases
> > that you removed (drop_behind in particular), and I have to say
> > that the new code looks a lot saner, even without your extensive
> > description and timing analysis.
>
> Fully agreed, drop_behind is an ugly hack. The sooner
> it dies the happier I am ;)
>
> > Please people, test this out extensively - I'd love to integrate
> > it, but while it looks very sane I'd really like to hear of
> > different peoples reactions to it under different loads.
>
> The one thing which has always come up in LRU/k and 2Q
> papers is that the "first reference" can really be a
> series of references in a very short time.

Yes, I thought about that but decided to try to demonstrate the
concept in its simplest form, and if things worked out, go ahead
and try to refine it.

Memory-mapped files have to be handled too. One possible way to
go at it is to do the test not against the current page being
handled by generic_* but against the page already on the head of
the inactive_dirty list, at the time the *next* page is queued.
This introduces a slight delay, time enough for several programmed
IO operations to complete. It will also work for mmap. As a
bonus, the code might even get cleaner because all the use_once
tests are gathered together into a single place.

> Counting only the very first reference will fail if we
> do eg. sequential IO with non-page aligned read() calls,
> which doesn't look like it's too uncommon.

Yep. We should also look at some statistics. So far I've just
used a single number: execution time.

> In order to prevent this from happening, either the system
> counts all first references in a short timespan (complex to
> implement) or it has the new pages on a special - small fixed
> size - page list and all references to the page while on that
> list are ignored.

Yes, those are both possibilities.

--
Daniel

2001-07-24 18:15:22

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Daniel Phillips wrote:
> On Tuesday 24 July 2001 19:04, Rik van Riel wrote:

> Memory-mapped files have to be handled too.

> > In order to prevent this from happening, either the system
> > counts all first references in a short timespan (complex to
> > implement) or it has the new pages on a special - small fixed
> > size - page list and all references to the page while on that
> > list are ignored.
>
> Yes, those are both possibilities.

The special "new pages" list handles the mmap() situation
automatically.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-24 18:15:52

by Mike Castle

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, Jul 24, 2001 at 02:51:03PM -0300, Rik van Riel wrote:
> Let me give you an example:
>
> - sequential access of a file
> - script reads the file in 80-byte segments
> (parsing some arcane data structure)
> - these segments are accessed in rapid succession
> - each 80-byte segment is accessed ONCE
>
> In this case, even though the data is accessed only
> once, each page is touched PAGE_SIZE/80 times, with
> one 80-byte read() each time.

Of course, such a program is poorly written anyway. That many individual
system calls! At least rewrite to use fread(). ;->

Change the example to using mmap(), and one feels less likely throttle the
programmer.

mrc
--
Mike Castle [email protected] http://www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

2001-07-24 18:22:02

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Mike Castle wrote:
> On Tue, Jul 24, 2001 at 02:51:03PM -0300, Rik van Riel wrote:
> > Let me give you an example:
>
> Of course, such a program is poorly written anyway. That many
> individual system calls!

That's not the point I was trying to make ;))

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-24 18:45:01

by Dr. Kelsey Hudson

[permalink] [raw]

Subject: adaptec aha152x.o stops working with AVA-1505 on 2.4.7

I compiled aha152x as a module and usually have it load at boot. However,
with kernel 2.4.7, the machine hangs in some in-between state when the
module is loaded. No oops or panic is printed -- the machine simply hangs
and does nothing. The keyboard is responsive however but it appears as
though nothing gets through to the kernel.

Furthermore, nothing is printed in the logs about any such error. I'm
wondering if this is somehow related to the recent SCSI layer cleanups?

The machine is an SMP Pentium III with 640 megabytes of RAM. If you need
any more details, e.g. .config, lspci, etc. let me know and I'll provide
them. For the moment, I have removed the controller card from the machine
simply because it was pissing me off to no extent and I didn't want to
deal with it any longer. Fortunately, the only device connected to it is
an optical disk drive that gets very infrequent use, so I can live without
it, for a while.

LMK what you come up with. TIA,

Kelsey Hudson [email protected]
Software Engineer
Compendium Technologies, Inc (619) 725-0771
---------------------------------------------------------------------------

2001-07-24 18:57:11

by Martin Devera

[permalink] [raw]

Subject: Patch for better ip_dynaddr for 2.2.19

I'm sorry that it is for old 2.2.x but I have 2.2.19 router
as MASQ box for our microwave line. I setup automatic backup
via ISDN but I find that masq connections stops working when
ISDN gets dialed.
It is because routing is changed and maddr field is bad in
ip_masq struct.
There is existing solution: sysctl_ip_dynaddr. But it replaces
maddr only on connections where no reply was recieved from outside.
It is true for machine connected via diald for example but when
changing LIVE route it simply doesn't work.
I patched 3 files with very short patch. It allows you to add
4 to sysctl_ip_dynaddr flag and it skips no-reply test.
It works well now.
Patched version is bacward compatible and there is no speed
penalty (well almost).
What about 2.2.20 inclusion ? It this correct place to send patch to ?

devik

--------
diff -ru linux/net/ipv4/ip_masq.c gate2/net/ipv4/ip_masq.c
--- linux/net/ipv4/ip_masq.c Sun Mar 25 18:37:41 2001
+++ gate2/net/ipv4/ip_masq.c Tue Jul 24 20:05:21 2001
@@ -50,6 +50,7 @@
* Kai Bankett : do not toss other IP protos in proto_doff()
* Dan Kegel : pointed correct NAT behavior for UDP streams
* Julian Anastasov : use daddr and dport as hash keys
+ * Martin Devera : extended sysctl_ip_dynaddr functionality
*
*/

@@ -1236,9 +1237,9 @@
* in this tunnel and routing iface address has changed...
* "You are welcome, diald".
*/
- if ( sysctl_ip_dynaddr && ms->flags & IP_MASQ_F_NO_REPLY && maddr != ms->maddr) {
-
- if (sysctl_ip_dynaddr > 1) {
+ if ( sysctl_ip_dynaddr && (sysctl_ip_dynaddr & 4 ||
+ ms->flags & IP_MASQ_F_NO_REPLY) && maddr != ms->maddr) {
+ if ((sysctl_ip_dynaddr & 3) > 1) {
IP_MASQ_INFO( "ip_fw_masquerade(): change masq.addr from %d.%d.%d.%d to %d.%d.%d.%d\n",
NIPQUAD(ms->maddr),NIPQUAD(maddr));
}
@@ -1527,9 +1528,9 @@
* in this tunnel and routing iface address has changed...
* "You are welcome, diald".
*/
- if ( sysctl_ip_dynaddr && ms->flags & IP_MASQ_F_NO_REPLY && maddr != ms->maddr) {
-
- if (sysctl_ip_dynaddr > 1) {
+ if ( sysctl_ip_dynaddr && (sysctl_ip_dynaddr & 4 ||
+ ms->flags & IP_MASQ_F_NO_REPLY) && maddr != ms->maddr) {
+ if ((sysctl_ip_dynaddr & 3) > 1) {
IP_MASQ_INFO( "ip_fw_masq_icmp(): change masq.addr %d.%d.%d.%d to %d.%d.%d.%d",
NIPQUAD(ms->maddr), NIPQUAD(maddr));
}
diff -ru linux/net/ipv4/tcp_ipv4.c gate2/net/ipv4/tcp_ipv4.c
--- linux/net/ipv4/tcp_ipv4.c Sun Mar 25 18:37:41 2001
+++ gate2/net/ipv4/tcp_ipv4.c Tue Jul 24 20:12:06 2001
@@ -1889,7 +1889,7 @@
}

if (new_saddr != sk->saddr) {
- if (sysctl_ip_dynaddr > 1) {
+ if ((sysctl_ip_dynaddr & 3) > 1) {
printk(KERN_INFO "tcp_v4_rebuild_header(): shifting sk->saddr "
"from %d.%d.%d.%d to %d.%d.%d.%d\n",
NIPQUAD(sk->saddr),
--- linux/Documentation/networking/ip_dynaddr.txt Sun Mar 25 18:31:57 2001
+++ gate2/Documentation/networking/ip_dynaddr.txt Tue Jul 24 20:44:42 2001
@@ -23,6 +23,12 @@
# echo 2 > /proc/sys/net/ipv4/ip_dynaddr
To disable (default)
# echo 0 > /proc/sys/net/ipv4/ip_dynaddr
+ To always rewrite MASQ connections instead only when no reply
+ packet is recieved (needed to be able to change routing, for
+ example when doing backup-line) add 4 to flag, like:
+ # echo 5 > /proc/sys/net/ipv4/ip_dynaddr
+ or
+ # echo 6 > /proc/sys/net/ipv4/ip_dynaddr

Enjoy!

---------------

2001-07-24 19:33:14

by J.C. Wren

[permalink] [raw]

Subject: Question about termios parameters

This may not be the best place to ask this, but I've done the research,
can't come up with an answer, and don't know a better group of people to
ask.

I have an embedded Linux device (2.2.12 kernel, BlueCat distro) that uses a
serial port for the console. When the box comes up, rc.sysinit starts the
application as a detached process (my_program&). When the program spits out
periodic status reports, the \n is not being mapped to CR-LF (i.e., I'm
getting only linefeeds).

Once you log on to the box (via login, into bash), the output becomes
correctly cooked.

I've tried twiddling termios parameters for OPOST and ONLCR, but it has no
effect. Trying to have the application run "stty -a" via a system() call
reports an error regarding it can't get the parameters for stdin.

What parameters are required to be set for a detached process started via
init to correctly have it's output mapped from \n to CR-NL?

--John

2001-07-24 19:41:34

by David E. Weekly

[permalink] [raw]

Subject: Is /dev/epoll scheduled for official inclusion anytime soon?

Hello all.

I've been playing around with Davide Libenzi's "/dev/epoll" patch and have
been very impressed by the performance figures (see his paper at
http://www.xmailserver.org/linux-patches/nio-improve.html for some
numbers) - it seems to exhibit really excellent scaling.

Given the sheer utility of using /dev/epoll in a largescale server, are
there any plans to roll it into the mainline kernel at any point? If not,
why / is there an equivalent scheduled for inclusion?

Clearly, IMHO, Linux needs something that can scale as well as BSD's kqueue
and Solaris's /dev/poll. /dev/epoll seems to be an excellent answer.

Sincerely,
David E. Weekly

PS: Cheers to Anton Altaparmikov's work on NTFS: In 2.4.7, we now have
rather robust read/write support of NT filesystems!

2001-07-24 20:06:21

[permalink] [raw]

Subject: Re: Is /dev/epoll scheduled for official inclusion anytime soon?

On Tue, 24 Jul 2001, David E. Weekly wrote:

> I've been playing around with Davide Libenzi's "/dev/epoll"

> Given the sheer utility of using /dev/epoll in a largescale
> server, are there any plans to roll it into the mainline kernel
> at any point?

So why don't you ask Davide if he has any plans to submit
it for inclusion into the kernel? ;)

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-24 20:26:50

by hiufungeric.tse

[permalink] [raw]

Subject: linux partitioning in IA64

Hei,

I am trying to install seawolf(linux IA64) on an IA 64 machine. For
partitioning I went through several trails without success. (total # disk space
is 10G). After all the initial settings, I restarted the machine to EFI (boot
manager). Can't boot to linux and the EFI can't recognize files the
installation CD initially saved on the hard disk. Are there special ways or
tricks to partition the disk and to make it work?

thanks in advance
hiu fung

2001-07-24 20:28:20

by Patrick Dreker

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

Hello...

Am Dienstag, 24. Juli 2001 18:48 schrieb Linus Torvalds:
> Please people, test this out extensively - I'd love to integrate it, but
> while it looks very sane I'd really like to hear of different peoples
> reactions to it under different loads.
I just decided to give this patch a try, as I have written a little
application which does some statistics over traces dumped by another program
by mmap()ing a large file and reading it sequentially. The trace file to be
analyzed is about 240 megs in size and consists of records each 249 bytes
long. The analysis program opens and the mmap()s the trace file doing some
small calculations on the data (basically it adds up fields from the records
to get overall values).

I have tested this on my Athlon 600 with 128 Megs of RAM, and it does not
make any difference whether I use plain 2.4.7 or 2.4.5-use-once. The program
always takes about 1 minute 6 seconds (+- 2 seconds) to complete, and the
machine starts swapping out stuff (thus I have omitted further stats like
vmstat output). I have just taken another look into my program to verify it
does not do something silly, like keeping old data around, but the program
cycle is always the same: copy a record from the mmap into a struct, perform
analysis, and copy next record. The struct is always reused for the next
struct (so there is only one struct at any time).

I can do further tests, if someone asks me to. I could even modify the
analysis program to check changes in behaviour...

> Linus

Patrick
--

2001-07-24 20:32:40

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Patrick Dreker wrote:

[snip program with mmap()]

> I have tested this on my Athlon 600 with 128 Megs of RAM, and it
> does not make any difference whether I use plain 2.4.7 or
> 2.4.5-use-once.

As expected. Only programs using generic_file_{read,write}()
will be impacted at the moment.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-24 20:35:10

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Patrick Dreker wrote:
>
> I just decided to give this patch a try, as I have written a little
> application which does some statistics over traces dumped by another program
> by mmap()ing a large file and reading it sequentially. The trace file to be
> analyzed is about 240 megs in size and consists of records each 249 bytes
> long. The analysis program opens and the mmap()s the trace file doing some
> small calculations on the data (basically it adds up fields from the records
> to get overall values).

Note that the patch won't do much anything for the mmap() case - the VM
doesn't consider that "use once" anyway, and trying to make it do so would
be hard, I suspect. It's just very nasty to try to figure out whether the
user has touched the page a single time or millions of times...

We do have the "accessed" bit, but in order to get any access patterns
we'd have to scan the page tables a _lot_ more than we do now. Right now
we do it only under memory pressure, and only about once per memory cycle,
which is not really even remotely enough to get anything more than "yes,
the user did touch the page"..

In order for mmap() to make a difference with the new code, we could do
something like look at the adjacent VM pages on page-in. That, together
with VM hints, might well be able to do similar things for mmap. But at a
noticeably higher cost than what the current code has.

Linus

2001-07-24 21:58:00

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Daniel Phillips wrote:

> Today's patch tackles the use-once problem, that is, the problem of
> how to identify and discard pages containing data likely to be used
> only once in a long time, while retaining pages that are used more
> often.
>
> I'll try to explain not only what I did, but the process I went
> through to arrive at this particular approach. This requires some
> background.

Well, as I see the patch should remove the problem where drop_behind()
deactivates pages of a readahead window even if some of those pages are
not "used-once" pages, right ?

I just want to make sure the performance improvements you're seeing caused
by the fix of this _particular_ problem.

If we knew the amount of non-used-once pages which drop_behind() is
deactivating under _your_ tests, we could make absolute sure about the
problem.

2001-07-24 22:06:10

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> On Tue, 24 Jul 2001, Daniel Phillips wrote:
>
> > Today's patch tackles the use-once problem, that is, the problem of
>
> Well, as I see the patch should remove the problem where
> drop_behind() deactivates pages of a readahead window even if
> some of those pages are not "used-once" pages, right ?
>
> I just want to make sure the performance improvements you're
> seeing caused by the fix of this _particular_ problem.

Fully agreed.

Especially since it was a one-liner change from worse
performance to better performance (IIRC) it would be
nice to see exactly WHY the system behaves the way it
does. ;)

Reading a bunch of 2Q, LRU/k, ... papers and thinking
about the problem very carefully should help us a bit
in this. Lots of researches have already looked into
this particular problem in quite a lot of detail.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-24 22:20:36

by Patrick Dreker

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

Am Dienstag, 24. Juli 2001 22:32 schrieb Rik van Riel:
> On Tue, 24 Jul 2001, Patrick Dreker wrote:
>
> [snip program with mmap()]
>
> > I have tested this on my Athlon 600 with 128 Megs of RAM, and it
> > does not make any difference whether I use plain 2.4.7 or
> > 2.4.5-use-once.
>
> As expected. Only programs using generic_file_{read,write}()
> will be impacted at the moment.
D'oh... got some thing wrong there it seems :-(
Anyways, I still have some old routines in the program doing the same via
read(). I have tried that (although the program is pretty silly, as it reads
the 240megs in 4 byte chunks.... the call overhead probably is the bigger
problem there...) and it improved the runtime by aproximately 20% using the
use_once patch.

linux-2.4.7:
22.300u 135.310s 2:41.38 97.6% 0+0k 0+0io 110pf+0w

linux-2.4.5-use_once:
14.980u 108.870s 2:09.79 95.4% 0+0k 0+0io 200pf+0w

Both measurements taken after reboot and while running KDE2.2
As stated: I am still willing to do further experiments... (read()ing larger
chunks at once?)

--
Patrick Dreker

2001-07-24 22:25:26

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Wed, 25 Jul 2001, Patrick Dreker wrote:

> Anyways, I still have some old routines in the program doing the
> same via read(). I have tried that (although the program is
> pretty silly, as it reads the 240megs in 4 byte chunks.... the
> call overhead probably is the bigger problem there...) and it
> improved the runtime by aproximately 20% using the use_once
> patch.
>
> linux-2.4.7:
> 22.300u 135.310s 2:41.38 97.6% 0+0k 0+0io 110pf+0w
>
> linux-2.4.5-use_once:
> 14.980u 108.870s 2:09.79 95.4% 0+0k 0+0io 200pf+0w

COOL ....

> Both measurements taken after reboot and while running KDE2.2 As
> stated: I am still willing to do further experiments...
> (read()ing larger chunks at once?)

Could you try reading 4kB chunks at once ?

That way you should truly only touch each page once.

(using smaller chunks, or chunks which aren't a
multiple of 4kB should break the current code)

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-24 22:24:05

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Rik van Riel wrote:

> On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> >
> > > Today's patch tackles the use-once problem, that is, the problem of
> >
> > Well, as I see the patch should remove the problem where
> > drop_behind() deactivates pages of a readahead window even if
> > some of those pages are not "used-once" pages, right ?
> >
> > I just want to make sure the performance improvements you're
> > seeing caused by the fix of this _particular_ problem.
>
> Fully agreed.
>
> Especially since it was a one-liner change from worse
> performance to better performance (IIRC) it would be
> nice to see exactly WHY the system behaves the way it
> does. ;)

Yes.

Daniel's patch adds "drop behind" (that is, adding swapcache
pages to the inactive dirty) behaviour to swapcache pages.

This is a _new_ thing, and I would like to know how that is changing the
whole VM behaviour..

2001-07-24 22:27:36

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Marcelo Tosatti wrote:

> Daniel's patch adds "drop behind" (that is, adding swapcache
> pages to the inactive dirty) behaviour to swapcache pages.
>
> This is a _new_ thing, and I would like to know how that is
> changing the whole VM behaviour..

It means that swap cache pages get about 1 second
to be used (in theory, under load) and after that
they can get evicted.

Should show up nicely with your VM stats patch...

Lets try to get the VM statistics patch merged,
after that we can talk about this stuff without
having to use our phantasy as much as we use it
now ;))

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-24 23:08:31

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tuesday 24 July 2001 22:27, Marcelo Tosatti wrote:
> On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > Today's patch tackles the use-once problem, that is, the problem of
> > how to identify and discard pages containing data likely to be used
> > only once in a long time, while retaining pages that are used more
> > often.
> >
> > I'll try to explain not only what I did, but the process I went
> > through to arrive at this particular approach. This requires some
> > background.
>
> Well, as I see the patch should remove the problem where
> drop_behind() deactivates pages of a readahead window even if some of
> those pages are not "used-once" pages, right ?

Yes, that was a specific goal.

> I just want to make sure the performance improvements you're seeing
> caused by the fix of this _particular_ problem.

Of course I chose the test load to show the benefit clearly, but I
also tried to avoid disturbing any of the existing cases that are
working well. I did the verification and audit in a hurry, so I
wouldn't be surprised at all if I missed something, but I feel
equally that picking these up is more a matter of adjustment than
change in viewpoint.

> If we knew the amount of non-used-once pages which drop_behind() is
> deactivating under _your_ tests, we could make absolute sure about
> the problem.

Yes, for sure, it's time to use the statistics weapon. I guess we
need to add in a couple more sensor points. An obvious place is in
check_used_once.

--
Daniel

2001-07-24 23:08:50

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tuesday 24 July 2001 22:53, Marcelo Tosatti wrote:
> On Tue, 24 Jul 2001, Rik van Riel wrote:
> > On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > > > Today's patch tackles the use-once problem, that is, the
> > > > problem of
> > >
> > > Well, as I see the patch should remove the problem where
> > > drop_behind() deactivates pages of a readahead window even if
> > > some of those pages are not "used-once" pages, right ?
> > >
> > > I just want to make sure the performance improvements you're
> > > seeing caused by the fix of this _particular_ problem.
> >
> > Fully agreed.
> >
> > Especially since it was a one-liner change from worse
> > performance to better performance (IIRC) it would be
> > nice to see exactly WHY the system behaves the way it
> > does. ;)
>
> Yes.
>
> Daniel's patch adds "drop behind" (that is, adding swapcache
> pages to the inactive dirty) behaviour to swapcache pages.
>
> This is a _new_ thing, and I would like to know how that is changing
> the whole VM behaviour..

Yes, absolutely. I knew I was doing that but I also thought it wouldn't
hurt. Rather it's part of a transition towards a full unification of
the file and swap paths.

Basically, I just left that part of it hanging. If you check my
detailed timings you'll see all my test runs have swaps=0, basically
because I didn't really want to hear about it just then ;-)

I was pretty sure it could be fixed if it broke.

--
Daniel

2001-07-24 23:08:50

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Wednesday 25 July 2001 00:05, Rik van Riel wrote:
> On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > > Today's patch tackles the use-once problem, that is, the problem
> > > of
> >
> > Well, as I see the patch should remove the problem where
> > drop_behind() deactivates pages of a readahead window even if
> > some of those pages are not "used-once" pages, right ?
> >
> > I just want to make sure the performance improvements you're
> > seeing caused by the fix of this _particular_ problem.
>
> Fully agreed.
>
> Especially since it was a one-liner change from worse
> performance to better performance (IIRC) it would be
> nice to see exactly WHY the system behaves the way it
> does. ;)

Oh, it wasn't an accident, I knew what I was trying to achieve.
It's just that I didn't immediately understand all the curlicues in
the page life cycles just from staring at the code. I had to see
the machine moving first. It was very instructive to see what
happened on reverting to a fifo strategy. It might even be a good
idea to put a proc hook in there to allow on-the-fly dumbing down
of the lru machinery. That way we can measure system behaviour
against a baseline without having to go through the
compile-reboot-bench cycle every time, which eats a major amount
of time.

> Reading a bunch of 2Q, LRU/k, ... papers and thinking
> about the problem very carefully should help us a bit
> in this. Lots of researches have already looked into
> this particular problem in quite a lot of detail.

Yep, unfortunately the subject of memory management in operating
systems seems to have received a lot less attention than memory
management in databases.

--
Daniel

2001-07-25 00:00:13

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tuesday 24 July 2001 18:56, Rik van Riel wrote:
> On Tue, 24 Jul 2001 [email protected] wrote:
> > On Tue, Jul 24, 2001 at 05:47:30AM +0200, Daniel Phillips wrote:
> > > So I decided to look for a new way of approaching the use-once
> > > problem that would easily integrated with our current approach.
> > > What I came up with is pretty simple: instead of starting a newly
> > > allocated page on the active ring, I start it on the inactive
> > > queue with an age of zero. Then, any time generic_file_read or
> > > write references a page, if its age is zero, set its age to
> > > START_AGE and mark it as unreferenced.
> >
> > This is wonderfully simple and ellegant.
>
> It's a tad incorrect, though.
>
> If a page gets 2 1-byte reads in a microsecond, with this
> patch it would get promoted to the active list, even though
> it's really only used once.

Yes, I expected that to be a relatively rare case however - block
aligned transfers are much more common, and when we have multiple
blocks per page we may well want the page to go onto the active list
because there may be quite a delay between successive block accesses.
>From there it will be aged normally, not such a horrible thing.

To fix this properly I suspect that just a few microseconds is enough
delay to pick up the temporal groupings you're talking about. That's
not hard to achieve.

> Preferably we'd want to have a "new" list of, say, 5% maximum
> of RAM size, where all references to a page are ignored. Any
> references made after the page falls off that list would be
> counted for page replacement purposes.

At that size you'd run a real risk of missing the tell-tale multiple
references that mark a page as frequently used. Think about metadata
here (right now, that usually just means directory pages, but later...
who knows). Once somebody has looked at two files in a directory -
while the page sits on the "ignore" list - chances are very good
they'll look at a third, but perhaps not before it drops off the south
end of the inactive queue.

Well, theorizing can only get us so far before we need to take actual
measurements.

--
Daniel

2001-07-25 00:43:11

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

At 01:04 25/07/2001, Daniel Phillips wrote:
>At that size you'd run a real risk of missing the tell-tale multiple
>references that mark a page as frequently used. Think about metadata
>here (right now, that usually just means directory pages, but later...
>who knows).

This is not actually implemented yet, but NTFS TNG will use the page cache
to hold both the LCN (physical clusters) and MFT (on disk inodes)
allocation bitmaps in addition to file and directory pages. (Admittedly the
LCN case folds into the file pages in page cache one as the LCN bitmap is
just stored inside the usual data of a file called $Bitmap, but the MFT
case is more complex as it is in an additional attribute inside the file
$MFT so the normal file read/write functions definitely cannot be used. The
usual data here is the actual on disk inodes...)

Just FYI.

Anton

--
"Nothing succeeds like success." - Alexandre Dumas
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

2001-07-25 00:29:40

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Rik van Riel wrote:
>
> (using smaller chunks, or chunks which aren't a
> multiple of 4kB should break the current code)

Maybe Patrick is using stdio? In that case, the small chunks will be
coalesced in the library layer anyway, which might explain the lack of
breakage.

Of course, if it improved performance even when "broken", that would be
even better. I like those kind sof algorithms.

Linus

2001-07-25 01:21:28

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tuesday 24 July 2001 22:33, Linus Torvalds wrote:
> On Tue, 24 Jul 2001, Patrick Dreker wrote:
> > I just decided to give this patch a try, as I have written a little
> > application which does some statistics over traces dumped by
> > another program by mmap()ing a large file and reading it
> > sequentially. The trace file to be analyzed is about 240 megs in
> > size and consists of records each 249 bytes long. The analysis
> > program opens and the mmap()s the trace file doing some small
> > calculations on the data (basically it adds up fields from the
> > records to get overall values).
>
> Note that the patch won't do much anything for the mmap() case - the
> VM doesn't consider that "use once" anyway, and trying to make it do
> so would be hard, I suspect. It's just very nasty to try to figure
> out whether the user has touched the page a single time or millions
> of times...
>
> We do have the "accessed" bit, but in order to get any access
> patterns we'd have to scan the page tables a _lot_ more than we do
> now. Right now we do it only under memory pressure, and only about
> once per memory cycle, which is not really even remotely enough to
> get anything more than "yes, the user did touch the page"..
>
> In order for mmap() to make a difference with the new code, we could
> do something like look at the adjacent VM pages on page-in. That,
> together with VM hints, might well be able to do similar things for
> mmap. But at a noticeably higher cost than what the current code has.

OK, the truth is, I decided to punt that problem on the assumption I'd
be able to come up with something workable when the time came. Now
having thought about it a little longer I see it's not so easy. It's
time for some free-form thinking.

Maybe there's a relatively simple way. We'll see for either memory
read or write we'll see the fault. At that point the page starts its
death march down the inactive queue and shortly after that the actual
memory operation takes place. Another short time later we clear the
hardware referenced bit by some as-yet-undetermined means. At the
south end of the queue we check the hardware referenced bit and
reactivate or continue with the eviction process as with the current
patch.

Now I sense there are gotchas here that I'm not entirely on top of -
the swap side of the memory manager being a little less than crystal
clear to me at this point. Let me start by guessing that it's not so
easy to know which pte to look in for the referenced bit. However, we
could cheat and store the pte address in the struct page, or perhaps in
the buffers, which happen to be still attached at the time.

Hmm, just a first idea, and admittedly ugly. I think what I really
have to do is some more code study.

--
Daniel

2001-07-25 01:26:27

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Wednesday 25 July 2001 02:43, Anton Altaparmakov wrote:
> At 01:04 25/07/2001, Daniel Phillips wrote:
> >At that size you'd run a real risk of missing the tell-tale multiple
> >references that mark a page as frequently used. Think about
> > metadata here (right now, that usually just means directory pages,
> > but later... who knows).
>
> This is not actually implemented yet, but NTFS TNG will use the page
> cache to hold both the LCN (physical clusters) and MFT (on disk
> inodes) allocation bitmaps in addition to file and directory pages.
> (Admittedly the LCN case folds into the file pages in page cache one
> as the LCN bitmap is just stored inside the usual data of a file
> called $Bitmap, but the MFT case is more complex as it is in an
> additional attribute inside the file $MFT so the normal file
> read/write functions definitely cannot be used. The usual data here
> is the actual on disk inodes...)
>
> Just FYI.

Yes, not a surprise. Plus, I was working on an experimental patch to
put Ext2 index blocks into the page cache just before I got off on this
use-once tangent. Time to go back to that...

--
Daniel

2001-07-25 01:21:28

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tuesday 24 July 2001 22:24, Patrick Dreker wrote:
> I just decided to give this patch a try, as I have written a little
> application which does some statistics over traces dumped by another
> program by mmap()ing a large file and reading it sequentially. The
> trace file to be analyzed is about 240 megs in size and consists of
> records each 249 bytes long. The analysis program opens and the
> mmap()s the trace file doing some small calculations on the data
> (basically it adds up fields from the records to get overall values).
>
> I have tested this on my Athlon 600 with 128 Megs of RAM, and it does
> not make any difference whether I use plain 2.4.7 or 2.4.5-use-once.
> The program always takes about 1 minute 6 seconds (+- 2 seconds) to
> complete, and the machine starts swapping out stuff

In this case that's an excellent result:

- The optimization doesn't include mmap's (yet)
- It doesn't break swap. (Good, I didn't check that myself)

This is a case of "no news is good news".

> (thus I have
> omitted further stats like vmstat output). I have just taken another
> look into my program to verify it does not do something silly, like
> keeping old data around, but the program cycle is always the same:
> copy a record from the mmap into a struct, perform analysis, and copy
> next record. The struct is always reused for the next struct (so
> there is only one struct at any time).
>
> I can do further tests, if someone asks me to. I could even modify
> the analysis program to check changes in behaviour...

(Already read your mail where you picked up the 20% improvement,
thanks, it warms my heart:-)

--
Daniel

2001-07-25 04:02:07

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Daniel Phillips wrote:

> --- ../2.4.5.clean/mm/vmscan.c Sat May 26 02:00:18 2001
> +++ ./mm/vmscan.c Mon Jul 23 17:25:27 2001
> @@ -359,10 +359,10 @@
> }
>
> /* Page is or was in use? Move it to the active list. */
> - if (PageReferenced(page) || page->age > 0 ||
> - (!page->buffers && page_count(page) > 1)) {
> + if (PageReferenced(page) || (!page->buffers && page_count(page) > 1)) {
> del_page_from_inactive_clean_list(page);
> add_page_to_active_list(page);
> + page->age = PAGE_AGE_START;
> continue;
> }
>
> @@ -462,11 +462,11 @@
> }
>
> /* Page is or was in use? Move it to the active list. */
> - if (PageReferenced(page) || page->age > 0 ||
> - (!page->buffers && page_count(page) > 1) ||
> + if (PageReferenced(page) || (!page->buffers && page_count(page) > 1) ||
> page_ramdisk(page)) {
> del_page_from_inactive_dirty_list(page);
> add_page_to_active_list(page);
> + page->age = PAGE_AGE_START;
> continue;
> }

That change will cause problems: we indicate that a page is young due to a
young mapped pte by increasing its age (at swap_out()).

With the current way of doing things, if you don't move higher aged pages
from the inactive lists to the active list you'll have severe problems
because of that (inactive list full of unfreeable pages due to mapped
pte's).

I suggest you to change

if (!page->buffers && page_count(page) > 1)
to
if ((page_count(page) - !!page->buffers) > 1)

at page_launder/reclaim_page to fix that problem.

This way we end up moving all pages with additional references ignoring
the buffers to the active list.

We have to unmap all references to a page before being able to make it
reclaimable clean cache anyway, so there is no real point keeping those
pages at the inactive list.

2001-07-25 04:38:01

by Rob Landley

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tuesday 24 July 2001 19:09, Daniel Phillips wrote:
> On Tuesday 24 July 2001 22:53, Marcelo Tosatti wrote:
> > On Tue, 24 Jul 2001, Rik van Riel wrote:
> > > On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > > > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > > > > Today's patch tackles the use-once problem, that is, the
> > > > > problem of
> > > >
> > > > Well, as I see the patch should remove the problem where
> > > > drop_behind() deactivates pages of a readahead window even if
> > > > some of those pages are not "used-once" pages, right ?
> > > >
> > > > I just want to make sure the performance improvements you're
> > > > seeing caused by the fix of this _particular_ problem.
> > >
> > > Fully agreed.
> > >
> > > Especially since it was a one-liner change from worse
> > > performance to better performance (IIRC) it would be
> > > nice to see exactly WHY the system behaves the way it
> > > does. ;)
> >
> > Yes.
> >
> > Daniel's patch adds "drop behind" (that is, adding swapcache
> > pages to the inactive dirty) behaviour to swapcache pages.
> >
> > This is a _new_ thing, and I would like to know how that is changing
> > the whole VM behaviour..
>
> Yes, absolutely. I knew I was doing that but I also thought it wouldn't
> hurt. Rather it's part of a transition towards a full unification of
> the file and swap paths.

Stupid question time:

So basically (lemme get this straight):

All VM allocation in 2.4 has become some variant of mmap. Either we're
mmaping in the executable and libraries when we exec, we're mmaping in a
file, or we're mmaping in the swap file/block device which is how we do
anonymous page allocations.

And this is why 2.4 keeps wanting to allocate swap space for pages that are
still in core. And why the 2.4 VM keeps going ballistic on boxes that have
more physical DRAM than they have swap space. (I.E. the 2X swap actually
being NECESSARY now, and 256 megs of "overflow" swap files for a 2 gig ram
box actually hurting matters by confusing the VM if swap is enabled AT ALL,
since it clashes conceptually.)

So getting swap to exclude in-core pages (so they don't take up space in the
swap file) would therefore be conceptually similar to implementing "files
with holes in them" support like EXT2 has for these swap file mmaps.

And the argument that doing so might "fragment the swap file", while true, is
actually a bit of a red herriing because the real objection is that it
unnecessarily complicates an otherwise clean and straightforward concept.

Am I even close here? (That was the stupid question I mentioned earlier,
knew I'd get around to it...) Reading the code's a bit harder when you don't
have any sort of conceptual understanding of what the heck it's actually
trying to do.

> Basically, I just left that part of it hanging. If you check my
> detailed timings you'll see all my test runs have swaps=0, basically
> because I didn't really want to hear about it just then ;-)
>
> I was pretty sure it could be fixed if it broke.

Just thought I'd say that personally, I think your greatest contribution in
this whole thread is that somebody finally clearly explained how the new VM
works, using small words. The actual improvements are just gravy. Yes I'm
biased. :)

I don't suppose we could get some variant of your initial post into
/Documentation/vm/HowItActuallyWorks.txt? (I take it the biggest "detail"
you glossed over was the seperation of memory into zones?)

> --
> Daniel

Rob

2001-07-25 05:06:41

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

Marcelo Tosatti wrote:
>
> Daniel's patch adds "drop behind" (that is, adding swapcache
> pages to the inactive dirty) behaviour to swapcache pages.

In some *brief* testing here, it appears that the use-once changes
make an improvement for light-medium workloads. With swap-intensive
workloads, the (possibly accidental) changes to swapcache aging
in fact improve things a lot, and use-once makes things a little worse.

This is a modified Galbraith test: 64 megs of RAM, `make -j12
bzImage', dual CPU:

2.4.7: 6:54
2.4.7+Daniel's patch 6:06
2.4.7+the below patch 5:56

--- mm/swap.c 2001/01/23 08:37:48 1.3
+++ mm/swap.c 2001/07/25 04:08:59
@@ -234,8 +234,8 @@
DEBUG_ADD_PAGE
add_page_to_active_list(page);
/* This should be relatively rare */
- if (!page->age)
- deactivate_page_nolock(page);
+ deactivate_page_nolock(page);
+ page->age = 0;
spin_unlock(&pagemap_lru_lock);
}

This change to lru_cache_add() is the only change made to 2.4.7,
and it provides the 17% speedup for this swap-intensive load.

With the same setup, running a `grep -r /usr/src' in parallel
with a `make -j3 bzImage', the `make -j3' takes:

2.4.7: 5:13
2.4.7+Daniel: 5:03
2.4.7+the above patch: 5:16

With the same setup, running a `grep -r /usr/src' in parallel
with a `make -j1 bzImage', the `make -j1' takes:

2.4.7: 9:25
2.4.7+Daniel: 8:55
2.4.7+the above patch: 9:35

So with lighter loads, use-once is starting to provide benefit, and the
deactivation is too aggressive.

> This is a _new_ thing, and I would like to know how that is changing the
> whole VM behaviour..

Sure. Daniel's patch radically changes the aging of swapcache
and other pages, and with some workloads it appears that it is
this change which brings about the performance increase, rather
than the intended use-once stuff.

I suspect the right balance here is to take use-once, but *not*
take its changes to lru_cache_add(). That's a separate thing.

Seems that lru_cache_add() is making decisions at a too-low level, and
they are sometimes wrong. The decision as to what age to give the page
and whether it should be activated needs to be made at a higher level.

-

2001-07-25 07:41:30

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tue, 24 Jul 2001, Rob Landley wrote:

> On Tuesday 24 July 2001 19:09, Daniel Phillips wrote:
> > On Tuesday 24 July 2001 22:53, Marcelo Tosatti wrote:
> > > On Tue, 24 Jul 2001, Rik van Riel wrote:
> > > > On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > > > > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > > > > > Today's patch tackles the use-once problem, that is, the
> > > > > > problem of
> > > > >
> > > > > Well, as I see the patch should remove the problem where
> > > > > drop_behind() deactivates pages of a readahead window even if
> > > > > some of those pages are not "used-once" pages, right ?
> > > > >
> > > > > I just want to make sure the performance improvements you're
> > > > > seeing caused by the fix of this _particular_ problem.
> > > >
> > > > Fully agreed.
> > > >
> > > > Especially since it was a one-liner change from worse
> > > > performance to better performance (IIRC) it would be
> > > > nice to see exactly WHY the system behaves the way it
> > > > does. ;)
> > >
> > > Yes.
> > >
> > > Daniel's patch adds "drop behind" (that is, adding swapcache
> > > pages to the inactive dirty) behaviour to swapcache pages.
> > >
> > > This is a _new_ thing, and I would like to know how that is changing
> > > the whole VM behaviour..
> >
> > Yes, absolutely. I knew I was doing that but I also thought it wouldn't
> > hurt. Rather it's part of a transition towards a full unification of
> > the file and swap paths.
>
> Stupid question time:
>
> So basically (lemme get this straight):
>
> All VM allocation in 2.4 has become some variant of mmap.
>
> Either we're mmaping in the executable and libraries when we exec,
> we're mmaping in a file, or we're mmaping in the swap file/block
> device which is how we do anonymous page allocations.
>
>
> And this is why 2.4 keeps wanting to allocate swap space for pages that are
> still in core. And why the 2.4 VM keeps going ballistic on boxes that have
> more physical DRAM than they have swap space. (I.E. the 2X swap actually
> being NECESSARY now, and 256 megs of "overflow" swap files for a 2 gig ram
> box actually hurting matters by confusing the VM if swap is enabled AT ALL,
> since it clashes conceptually.)

2.4 keeps wanting to allocate swap space for pages that are in core to be
able to add them to the LRU lists. That way we can get more detailed
information about the VM usage on the system.

When we run out of memory, we start looking linearly in all processes ptes
to find ptes which have not been used recently. When we find one pte
which has not been used recently, we unmap it from the physical page.

For anonymous data, we have to unmap the pte and point it to some
"identifier" which can be used later (at page fault time) to find the
data. (wheter its still on memory or already swapped out to disk)

Currently this "identifier" is a value in the swap map. The swap map is a
_direct_ mapping to backing storage (swap space).

That means we cannot unmap a pte which points to anon data without
allocating swap space for the page first.

We have added an aging scheme which gets detailed information about the
page usage, great! The problem is that we cannot get detailed anon page
usage info without having to allocate swap space. Result: 2xRAM swap rule
created.

Have you got the idea here ?

That is not going to change on 2.4, I think. Sorry.

> So getting swap to exclude in-core pages (so they don't take up space in the
> swap file) would therefore be conceptually similar to implementing "files
> with holes in them" support like EXT2 has for these swap file mmaps.

Not really. To exclude in-core pages (in-core meaning all pte's which
point to the in-mem page are mapped and valid) from swap we just need to
remove the page from swapcache. Its as simple as that, really.

However that will cause swap fragmentation. And lots of it.

> And the argument that doing so might "fragment the swap file", while true, is
> actually a bit of a red herriing because the real objection is that it
> unnecessarily complicates an otherwise clean and straightforward concept.

Nope. The "objections" for not deallocating in-core pages from swap (on
2.4) are:

- The code would need to avoid fragmentation. It would have to "unmap"
contiguous chunks of swap mappings as soon as swap started to get full.

Take into account that the swap clustering code we are using now was wrote
at the time we used to allocate swap space only when actually writing out
swap data: there was no need for the code to be fast.

- 2.4 is out. Fixing the problem in a performance-decent way would have to
change quite some code.

> Am I even close here? (That was the stupid question I mentioned earlier,
> knew I'd get around to it...) Reading the code's a bit harder when you don't
> have any sort of conceptual understanding of what the heck it's actually
> trying to do.

Well, I think so.

>
> > Basically, I just left that part of it hanging. If you check my
> > detailed timings you'll see all my test runs have swaps=0, basically
> > because I didn't really want to hear about it just then ;-)
> >
> > I was pretty sure it could be fixed if it broke.
>
> Just thought I'd say that personally, I think your greatest contribution in
> this whole thread is that somebody finally clearly explained how the new VM
> works, using small words. The actual improvements are just gravy. Yes I'm
> biased. :)
>
> I don't suppose we could get some variant of your initial post into
> /Documentation/vm/HowItActuallyWorks.txt? (I take it the biggest "detail"
> you glossed over was the seperation of memory into zones?)

Last comment:

Who the heck cares about good swapping performance when tasks are
getting killed because there is no more swap space available due to this
swap preallocation ? No one.

It makes sense to fuck up swap clustering to avoid the OOM killer.

In case someone is having this problem with the OOM killer in practice
(real workloads), please tell us.

2001-07-25 08:03:52

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Wed, 25 Jul 2001, Andrew Morton wrote:

> Marcelo Tosatti wrote:
> >
> > Daniel's patch adds "drop behind" (that is, adding swapcache
> > pages to the inactive dirty) behaviour to swapcache pages.
>
> In some *brief* testing here, it appears that the use-once changes
> make an improvement for light-medium workloads. With swap-intensive
> workloads, the (possibly accidental) changes to swapcache aging
> in fact improve things a lot, and use-once makes things a little worse.
>
> This is a modified Galbraith test: 64 megs of RAM, `make -j12
> bzImage', dual CPU:
>
> 2.4.7: 6:54
> 2.4.7+Daniel's patch 6:06
> 2.4.7+the below patch 5:56
>
> --- mm/swap.c 2001/01/23 08:37:48 1.3
> +++ mm/swap.c 2001/07/25 04:08:59
> @@ -234,8 +234,8 @@
> DEBUG_ADD_PAGE
> add_page_to_active_list(page);
> /* This should be relatively rare */
> - if (!page->age)
> - deactivate_page_nolock(page);
> + deactivate_page_nolock(page);
> + page->age = 0;
> spin_unlock(&pagemap_lru_lock);
> }
>
> This change to lru_cache_add() is the only change made to 2.4.7,
> and it provides the 17% speedup for this swap-intensive load.
>
> With the same setup, running a `grep -r /usr/src' in parallel
> with a `make -j3 bzImage', the `make -j3' takes:
>
> 2.4.7: 5:13
> 2.4.7+Daniel: 5:03
> 2.4.7+the above patch: 5:16
>
> With the same setup, running a `grep -r /usr/src' in parallel
> with a `make -j1 bzImage', the `make -j1' takes:
>
> 2.4.7: 9:25
> 2.4.7+Daniel: 8:55
> 2.4.7+the above patch: 9:35

Ok, now we know what improves the swapping performance.

I wonder _why_ it improves swapping performance.

> So with lighter loads, use-once is starting to provide benefit, and the
> deactivation is too aggressive.
>
> > This is a _new_ thing, and I would like to know how that is changing the
> > whole VM behaviour..
>
> Sure. Daniel's patch radically changes the aging of swapcache
> and other pages, and with some workloads it appears that it is
> this change which brings about the performance increase, rather
> than the intended use-once stuff.
>
> I suspect the right balance here is to take use-once, but *not*
> take its changes to lru_cache_add(). That's a separate thing.

Yep.

I perfectly agree with Daniel's use-once idea for _file cache_ pages. Its
really nice.

Now I'm not sure why directly adding swapcache pages to the inactive dirty
lits with 0 zero age improves things.

It _sounds_ like we avoid moving those swapcache pages to the active list,
then later move them to the inactive list and then writing them out.

That can result in less effective writeout clustering (its just a
guess) and also CPU wastage, as Daniel mentioned previously.

I'll hopefully going to profile his code tomorrow with the VM stats stuff,
if I have time.

> Seems that lru_cache_add() is making decisions at a too-low level, and
> they are sometimes wrong. The decision as to what age to give the page
> and whether it should be activated needs to be made at a higher level.

Agreed.

2001-07-25 08:24:56

by Patrick Dreker

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

Am Mittwoch, 25. Juli 2001 02:27 schrieb Linus Torvalds:
> On Tue, 24 Jul 2001, Rik van Riel wrote:
> > (using smaller chunks, or chunks which aren't a
> > multiple of 4kB should break the current code)
>
> Maybe Patrick is using stdio? In that case, the small chunks will be
> coalesced in the library layer anyway, which might explain the lack of
> breakage.
I am using normal read() calls to read the file after open()ing it on program
start and each call reads only 4 bytes. So if I am reading this right I am
seeing an improvement where I should not see one, or at least not such a big
one, right?

I did a few more test runs on each of the kernels to check if the results are
reproducible:
2.4.7-plain:
17.320u 115.100s 2:17.36 96.4% 0+0k 0+0io 110pf+0w
17.200u 94.170s 1:53.14 98.4% 0+0k 0+0io 110pf+0w
17.490u 113.730s 2:13.48 98.3% 0+0k 0+0io 110pf+0w

2.4.5-use_once:
14.730u 108.310s 2:09.57 94.9% 0+0k 0+0io 203pf+0w
13.880u 79.410s 1:38.64 94.5% 0+0k 0+0io 226pf+0w
14.840u 78.680s 1:37.86 95.5% 0+0k 0+0io 238pf+0w

The time under 2.4.5-use_once stays constant from the second run on (I tried
3 more times), while 2.4.7 shows pretty varying performance but I have never
seen it getting better than the 1:53.14 from the second run above. I had
stopped all services which I knew to cause periodic activity (exim,
cron/anacron) which could disturb the tests.

I also noticed, that under 2.4.5 after the 3 test runs the KDE Taskbasr got
swapped out, while under 2.4.7 this was not the case.

> Of course, if it improved performance even when "broken", that would be
> even better. I like those kind sof algorithms.
Who doesn't? :)

>
> Linus

--
Patrick Dreker
---------------------------------------------------------------------
> Is there anything else I can contribute?
The latitude and longtitude of the bios writers current position, and
a ballistic missile.
Alan Cox on [email protected]

2001-07-25 08:38:58

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

Rob Landley <[email protected]> writes:

> Stupid question time:
>
> So basically (lemme get this straight):
>
> All VM allocation in 2.4 has become some variant of mmap. Either we're
> mmaping in the executable and libraries when we exec, we're mmaping in a
> file, or we're mmaping in the swap file/block device which is how we do
> anonymous page allocations.

You are getting warm. In 2.4 most caching is moving exclusively to
the page cache. Page cache write support was added between 2.2 and
2.4 allowing things to stop using the block cache. However everything
you are mentioning used the page cache in 2.2. The page cache is
superior to the block cache as it caches the data we want to use not
some intermediate form of it.

Having one general purpose cache layer, increases the reliability
of the system, as we only have to have that layer debugged to know
everything is working. It is by know means the entire VM but it is
the majority and it is where we most pages that can be freed on demand
so the page cache receives most of the attention.

> And this is why 2.4 keeps wanting to allocate swap space for pages that are
> still in core. And why the 2.4 VM keeps going ballistic on boxes that have
> more physical DRAM than they have swap space. (I.E. the 2X swap actually
> being NECESSARY now, and 256 megs of "overflow" swap files for a 2 gig ram
> box actually hurting matters by confusing the VM if swap is enabled AT ALL,
> since it clashes conceptually.)

Nope. There are several canidates for this but none of the is simply
using the page cache.

The description, and characterization of the 2.4 VM with respect to
swapping is wrong. I have seen to much overgeneralization on this
topic already, and refuse to treat it in the abstract. If you have a
reproducible problem case report it specifically.

The biggest 2.4 swapping bug was failure free pages in the swap cache
that were not connected to any process. This should now be fixed.

The biggest issue is that the 2.4 VM has been heavily modified since
2.2 and the bugs are still being worked out. If you have a an
application that needs noticeably more swap than 2.2 did report it as
a bug. Any other treatment at this point would be over confidence in
the correctness of the VM in corner cases.

> So getting swap to exclude in-core pages (so they don't take up space in the
> swap file) would therefore be conceptually similar to implementing "files
> with holes in them" support like EXT2 has for these swap file mmaps.

Not really. There are some similiarities.

> And the argument that doing so might "fragment the swap file", while true, is
> actually a bit of a red herriing because the real objection is that it
> unnecessarily complicates an otherwise clean and straightforward concept.

Fragmenting the swap file is a read herring.

I have seen no real objections, just no real code. The thing to watch
out for is that you need to make certain that you actually have the
resources to swap a file out to. Consider having a swapfile on tmpfs.
Having a deadlock in the swapping code would be bad.

> Am I even close here? (That was the stupid question I mentioned earlier,
> knew I'd get around to it...) Reading the code's a bit harder when you don't
> have any sort of conceptual understanding of what the heck it's actually
> trying to do.

See my earlier answers for closenes.

> I don't suppose we could get some variant of your initial post into
> /Documentation/vm/HowItActuallyWorks.txt? (I take it the biggest "detail"
> you glossed over was the seperation of memory into zones?)

Becareful here, what is maintainable is a concepts behind how it
works. The page cache, multpile memory zones, etc.

You have the code for a detailed how it works document, and it is
continually changing. The driving forces, tools, and ideas behind
what is happening should be static enough to document however.

Eric

2001-07-25 09:29:55

by Mike A. Harris

[permalink] [raw]

Subject: Re: linux partitioning in IA64

On Tue, 24 Jul 2001 [email protected] wrote:

>Date: Tue, 24 Jul 2001 16:26:12 -0400 (EDT)
>From: [email protected]
>To: [email protected]
>Content-Type: text/plain; charset=US-ASCII
>Subject: linux partitioning in IA64
>
>Hei,
>
>I am trying to install seawolf(linux IA64) on an IA 64 machine.
>For partitioning I went through several trails without success.
>(total # disk space is 10G). After all the initial settings, I
>restarted the machine to EFI (boot manager). Can't boot to
>linux and the EFI can't recognize files the installation CD
>initially saved on the hard disk. Are there special ways or
>tricks to partition the disk and to make it work?

Much better to send this type of question to [email protected],
as it is mostly OT here...

Hope this helps.

----------------------------------------------------------------------
Mike A. Harris - Linux advocate - Open Source advocate
Opinions and viewpoints expressed are solely my own.
----------------------------------------------------------------------
Definition: MCSE - Must Consult Someone Experienced

2001-07-25 12:56:36

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Tuesday 24 July 2001 21:35, Rob Landley wrote:
> I don't suppose we could get some variant of your initial post into
> /Documentation/vm/HowItActuallyWorks.txt? (I take it the biggest
> "detail" you glossed over was the seperation of memory into zones?)

I glossed over a lot of big details:

- zones
- type of pages: anonymous, swap cache, file, high, buffer, ramdisk
- interaction with page cache
- various flavors of swap-in and swap-out paths
- shared memory and the swap cache
- locking strategy
- aging strategy
- scanning policy
- deadlock and livelock avoidance measures
- unloaded vs loaded behaviour
- effect of load changes
- out of memory handling
- clustering (or lack of it)
- IO throttling

Each of these is a topic all by itself. You'll find all of them
discussed extensively here on lkml.

--
Daniel

2001-07-25 12:55:35

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Wednesday 25 July 2001 10:32, Eric W. Biederman wrote:
> Consider having a swapfile on tmpfs.

Ooh, a truly twisted thought.

We'd know we're making progress when we can prove it doesn't deadlock.

--
Daniel

2001-07-25 12:57:46

by Martin Devera

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

Hello,

I read whole this thread and it started to interest me.

If I understood the code correctly, the used bit on ptes
is tested once per 64 second (for particular page) from
kswapd or more often under mem pressure and only for
pages on active list. Correct ?

Would it be possible to do cost-based eviction of pages ?
Like to compute cost for each page periodically (with variable
period - resort page to finner-period list when u bit was set for
too long) and resort page into hash bucket with key equal to
the cost (cost would be in moderate integer range).

Then you can simply evict pages from lowest hash bucket list
and move them to free list (after cleaning them).

It would allow developers to modify and test various cost
schemas including LRU,LRU/2,aging ... with simple change
of cost function.

I tried to create statistical model of page eviction and after
some simplify it seems that cost c = (Tw-t)^2+(Tr-t)^2
where Tr/w is expected time of next page reference/dirtying
based on EWMA averaging could minimize disc activity during
page eviction.

I wanted to try it but I'd like to know opinions of others.
Probably someone have got similar idea ..
Maybe it would be so expensive ..
fill other "maybe" here .. :-)

devik

2001-07-25 14:12:32

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Wednesday 25 July 2001 10:20, Patrick Dreker wrote:
> I did a few more test runs on each of the kernels to check if the
> results are reproducible:
> 2.4.7-plain:
> 17.320u 115.100s 2:17.36 96.4% 0+0k 0+0io 110pf+0w
> 17.200u 94.170s 1:53.14 98.4% 0+0k 0+0io 110pf+0w
> 17.490u 113.730s 2:13.48 98.3% 0+0k 0+0io 110pf+0w
>
> 2.4.5-use_once:
> 14.730u 108.310s 2:09.57 94.9% 0+0k 0+0io 203pf+0w
> 13.880u 79.410s 1:38.64 94.5% 0+0k 0+0io 226pf+0w
> 14.840u 78.680s 1:37.86 95.5% 0+0k 0+0io 238pf+0w

Look at the CPU dropping along with the times. Usually it goes the
other way.

> The time under 2.4.5-use_once stays constant from the second run on
> (I tried 3 more times), while 2.4.7 shows pretty varying performance
> but I have never seen it getting better than the 1:53.14 from the
> second run above. I had stopped all services which I knew to cause
> periodic activity (exim, cron/anacron) which could disturb the tests.
>
> I also noticed, that under 2.4.5 after the 3 test runs the KDE
> Taskbasr got swapped out, while under 2.4.7 this was not the case.

Not swapping out the task bar is a different problem, only loosely
related. The use-once thing is a step in the right direction because
it makes relatively more file IO pages available for deactivation
versus swap pages, and the task bar has a better chance of surviving.
However, it's not a really firm connection to the problem.

An additional line of attack is to look at the aging policy. I have a
strong sense we can do it better. Right now we're aging everything
down to a uniform zero and that really obviously throws away a lot of
information.

In the 2.5 timeframe, better unification of the page cache and swap
paths will make it much easier to focus on the taskbar problem.

--
Daniel

2001-07-25 16:12:09

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

Daniel Phillips <[email protected]> writes:

> On Wednesday 25 July 2001 10:32, Eric W. Biederman wrote:
> > Consider having a swapfile on tmpfs.
>
> Ooh, a truly twisted thought.
>
> We'd know we're making progress when we can prove it doesn't deadlock.

Given all of that I think I going to go try to hack up a version that
works.

The ensuring we don't deadlock sounds like it will require some shared
mechanisms with journaling filesystems. The whole pinned ram
problem. So to some extent it looks like a 2.4 problem. Though I'd
rather solve it in 2.5.x first.

Eric

2001-07-25 16:42:04

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Wed, 25 Jul 2001, Daniel Phillips wrote:
> On Wednesday 25 July 2001 08:33, Marcelo Tosatti wrote:
> > Now I'm not sure why directly adding swapcache pages to the inactive
> > dirty lits with 0 zero age improves things.
>
> Because it moves the page rapidly down the inactive queue towards the
> ->writepage instead of leaving it floating around on the active ring
> waiting to be noticed. We already know we want to evict that page,

We don't.

The page gets unmapped and added to the swap cache the first
time it wasn't referenced by the process.

This is before any page aging is done.

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-25 17:41:58

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Wednesday 25 July 2001 18:41, Rik van Riel wrote:
> On Wed, 25 Jul 2001, Daniel Phillips wrote:
> > On Wednesday 25 July 2001 08:33, Marcelo Tosatti wrote:
> > > Now I'm not sure why directly adding swapcache pages to the
> > > inactive dirty lits with 0 zero age improves things.
> >
> > Because it moves the page rapidly down the inactive queue towards
> > the ->writepage instead of leaving it floating around on the active
> > ring waiting to be noticed. We already know we want to evict that
> > page,
>
> We don't.
>
> The page gets unmapped and added to the swap cache the first
> time it wasn't referenced by the process.
>
> This is before any page aging is done.

True, it's more accurate to say that we already know we want to *try*
evicting that page. A wrong guess should not make it all the way down
the inactive queue.

--
Daniel

2001-07-25 21:18:50

by Steve Lord

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

> On Wednesday 25 July 2001 02:43, Anton Altaparmakov wrote:
> > At 01:04 25/07/2001, Daniel Phillips wrote:
> > >At that size you'd run a real risk of missing the tell-tale multiple
> > >references that mark a page as frequently used. Think about
> > > metadata here (right now, that usually just means directory pages,
> > > but later... who knows).
> >
> > This is not actually implemented yet, but NTFS TNG will use the page
> > cache to hold both the LCN (physical clusters) and MFT (on disk
> > inodes) allocation bitmaps in addition to file and directory pages.
> > (Admittedly the LCN case folds into the file pages in page cache one
> > as the LCN bitmap is just stored inside the usual data of a file
> > called $Bitmap, but the MFT case is more complex as it is in an
> > additional attribute inside the file $MFT so the normal file
> > read/write functions definitely cannot be used. The usual data here
> > is the actual on disk inodes...)
> >
> > Just FYI.
>
> Yes, not a surprise. Plus, I was working on an experimental patch to
> put Ext2 index blocks into the page cache just before I got off on this
> use-once tangent. Time to go back to that...
>
> --
> Daniel
> -

So here is a datapoint from a filesystem which uses pages for all of its
data and metadata, XFS. These are for 2.4.7 with and without the patch.
XFS metadata is held in a special address space, when not in use by XFS
the vm is free to reclaim it via the usual page reclaim methods, so XFS
is probably fairly heavily influenced by a change like this.

All runs on a 2 cpu 450MHz PIII with 128M of memory, single scsi partition.

1. dbench runs at various loads back to back, starting with
the highest load, 3 runs of each load):

Vanilla 2.4.7 xfs

64
=====
Throughput 6.77363 MB/sec (NB=8.46704 MB/sec 67.7363 MBit/sec)
Throughput 7.03715 MB/sec (NB=8.79644 MB/sec 70.3715 MBit/sec)
Throughput 7.11706 MB/sec (NB=8.89632 MB/sec 71.1706 MBit/sec)
48
=====
Throughput 8.50251 MB/sec (NB=10.6281 MB/sec 85.0251 MBit/sec)
Throughput 8.38426 MB/sec (NB=10.4803 MB/sec 83.8426 MBit/sec)
Throughput 8.58152 MB/sec (NB=10.7269 MB/sec 85.8152 MBit/sec)
32
=====
Throughput 11.6378 MB/sec (NB=14.5473 MB/sec 116.378 MBit/sec)
Throughput 11.7586 MB/sec (NB=14.6983 MB/sec 117.586 MBit/sec)
Throughput 12.0686 MB/sec (NB=15.0858 MB/sec 120.686 MBit/sec)
16
=====
Throughput 16.9877 MB/sec (NB=21.2346 MB/sec 169.877 MBit/sec)
Throughput 16.9502 MB/sec (NB=21.1877 MB/sec 169.502 MBit/sec)
Throughput 16.8703 MB/sec (NB=21.0878 MB/sec 168.703 MBit/sec)
8
=====
Throughput 36.085 MB/sec (NB=45.1062 MB/sec 360.85 MBit/sec)
Throughput 31.8114 MB/sec (NB=39.7642 MB/sec 318.114 MBit/sec)
Throughput 36.0801 MB/sec (NB=45.1001 MB/sec 360.801 MBit/sec)
4
=====
Throughput 42.7741 MB/sec (NB=53.4677 MB/sec 427.741 MBit/sec)
Throughput 46.2591 MB/sec (NB=57.8239 MB/sec 462.591 MBit/sec)
Throughput 42.8487 MB/sec (NB=53.5608 MB/sec 428.487 MBit/sec)
2
=====
Throughput 48.776 MB/sec (NB=60.97 MB/sec 487.76 MBit/sec)
Throughput 48.4417 MB/sec (NB=60.5522 MB/sec 484.417 MBit/sec)
Throughput 47.0938 MB/sec (NB=58.8673 MB/sec 470.938 MBit/sec)
1
=====
Throughput 32.0957 MB/sec (NB=40.1196 MB/sec 320.957 MBit/sec)
Throughput 32.1901 MB/sec (NB=40.2376 MB/sec 321.901 MBit/sec)
Throughput 31.3972 MB/sec (NB=39.2465 MB/sec 313.972 MBit/sec)

2.4.7 xfs + patch

64
=====
Throughput 8.99448 MB/sec (NB=11.2431 MB/sec 89.9448 MBit/sec)
Throughput 9.07137 MB/sec (NB=11.3392 MB/sec 90.7137 MBit/sec)
Throughput 9.09924 MB/sec (NB=11.3741 MB/sec 90.9924 MBit/sec)
48
=====
Throughput 10.7328 MB/sec (NB=13.416 MB/sec 107.328 MBit/sec)
Throughput 10.791 MB/sec (NB=13.4887 MB/sec 107.91 MBit/sec)
Throughput 10.9185 MB/sec (NB=13.6482 MB/sec 109.185 MBit/sec)
32
=====
Throughput 12.6625 MB/sec (NB=15.8282 MB/sec 126.625 MBit/sec)
Throughput 12.7601 MB/sec (NB=15.9501 MB/sec 127.601 MBit/sec)
Throughput 12.6732 MB/sec (NB=15.8415 MB/sec 126.732 MBit/sec)
16
=====
Throughput 19.5692 MB/sec (NB=24.4614 MB/sec 195.692 MBit/sec)
Throughput 19.6163 MB/sec (NB=24.5204 MB/sec 196.163 MBit/sec)
Throughput 19.6773 MB/sec (NB=24.5966 MB/sec 196.773 MBit/sec)
8
=====
Throughput 36.8323 MB/sec (NB=46.0404 MB/sec 368.323 MBit/sec)
Throughput 36.9346 MB/sec (NB=46.1683 MB/sec 369.346 MBit/sec)
Throughput 36.4622 MB/sec (NB=45.5778 MB/sec 364.622 MBit/sec)
4
=====
Throughput 42.883 MB/sec (NB=53.6037 MB/sec 428.83 MBit/sec)
Throughput 42.1596 MB/sec (NB=52.6995 MB/sec 421.596 MBit/sec)
Throughput 42.9602 MB/sec (NB=53.7002 MB/sec 429.602 MBit/sec)
2
=====
Throughput 44.2798 MB/sec (NB=55.3498 MB/sec 442.798 MBit/sec)
Throughput 39.5441 MB/sec (NB=49.4301 MB/sec 395.441 MBit/sec)
Throughput 46.8569 MB/sec (NB=58.5711 MB/sec 468.569 MBit/sec)
1
=====
Throughput 32.0682 MB/sec (NB=40.0852 MB/sec 320.682 MBit/sec)
Throughput 32.1923 MB/sec (NB=40.2403 MB/sec 321.923 MBit/sec)
Throughput 32.1677 MB/sec (NB=40.2097 MB/sec 321.677 MBit/sec)

A definite improvement under high loads, and the lower loads appear
about the same. I reran the dbench 2 numbers and got results in line
with the non patched kernel.

2. Kernel builds

Unmodified kernel:

time make -j3 bzImage
948.880u 70.660s 8:45.88 193.8% 0+0k 0+0io 721072pf+0w
948.360u 69.950s 8:43.68 194.4% 0+0k 0+0io 721072pf+0w

Patched kernel:
942.660u 54.320s 8:29.49 195.6% 0+0k 0+0io 721044pf+0w
942.150u 54.900s 8:29.35 195.7% 0+0k 0+0io 721041pf+0w

Looks good to me, 22% less system time for the kernel build
is pretty good, also looks like my cpus are slower than most
people's nowadays :-(

Steve

2001-07-26 03:29:01

by Ed Tomlinson

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

Hi,

Here are some more figures.

2.4.7 + ide patch + ide dma timeout fix + lvm beta8 + reiserfs fs

Throughput 19.3959 MB/sec (NB=24.2449 MB/sec 193.959 MBit/sec)
dbench 20 24.88s user 70.89s system 69% cpu 2:17.23 total

Throughput 19.0447 MB/sec (NB=23.8059 MB/sec 190.447 MBit/sec)
dbench 20 24.37s user 69.24s system 67% cpu 2:19.63 total

tob -f - -full misc > misc-0725.tob 235.80s user 321.73s system 42% cpu 21:52.49 total

2.4.7 + ide patch + ide dma timeout fix + lvm beta8 + reiserfs fs

Throughput 19.3395 MB/sec (NB=24.1744 MB/sec 193.395 MBit/sec)
dbench 20 23.70s user 70.68s system 68% cpu 2:17.52 total

Throughput 19.1872 MB/sec (NB=23.9841 MB/sec 191.872 MBit/sec)
dbench 20 24.30s user 69.42s system 67% cpu 2:18.60 total

tob -f - -full misc > /back/misc-0725.tob 229.64s user 326.26s system 43% cpu 21:11.77 total

Basicily no big changes here - which is good news.

Ed Tomlinson

2001-07-26 08:42:50

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

Rik van Riel <[email protected]> writes:

> On Wed, 25 Jul 2001, Daniel Phillips wrote:
> > On Wednesday 25 July 2001 08:33, Marcelo Tosatti wrote:
> > > Now I'm not sure why directly adding swapcache pages to the inactive
> > > dirty lits with 0 zero age improves things.
> >
> > Because it moves the page rapidly down the inactive queue towards the
> > ->writepage instead of leaving it floating around on the active ring
> > waiting to be noticed. We already know we want to evict that page,
>
> We don't.

Agreed. The kinds of ``aging'' don't match up so we can't tell if
it meets our usual criteria for aging.

> The page gets unmapped and added to the swap cache the first
> time it wasn't referenced by the process.
>
> This is before any page aging is done.

Actually there has been aging done. Unless you completely disable
testing for pte_young. It is a different kind of aging but it is
aging.

Eric

2001-07-26 12:02:47

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Thursday 26 July 2001 10:36, Eric W. Biederman wrote:
> Rik van Riel <[email protected]> writes:
> > On Wed, 25 Jul 2001, Daniel Phillips wrote:
> > > On Wednesday 25 July 2001 08:33, Marcelo Tosatti wrote:
> > > > Now I'm not sure why directly adding swapcache pages to the
> > > > inactive dirty lits with 0 zero age improves things.
> > >
> > > Because it moves the page rapidly down the inactive queue towards
> > > the ->writepage instead of leaving it floating around on the
> > > active ring waiting to be noticed. We already know we want to
> > > evict that page,
> >
> > We don't.
>
> Agreed. The kinds of ``aging'' don't match up so we can't tell if
> it meets our usual criteria for aging.

Well, in the absence of of benchmark evidence its hard to tell how
valuable the usual criteria really are. That's another topic though,
because the question here is not how the aging matches up, but how the
inactive queue handling matches up, see below.

> > The page gets unmapped and added to the swap cache the first
> > time it wasn't referenced by the process.
> >
> > This is before any page aging is done.
>
> Actually there has been aging done. Unless you completely disable
> testing for pte_young. It is a different kind of aging but it is
> aging.

In the case of a process page and in the case of a file IO page the
situation is the same: we have decided to put the page on trial. Once
we have arrived at that decision its previous history doesn't matter,
so it makes sense to set its age to a known state. In this case it's
0, meaning "on trial".

Consider that the process page will only ever be down-aged after being
unmapped. So if we put it on the active queue, it just acts like a
big, slow, inefficient timer. Why not put it on a fifo queue instead?
It's the same thing, just more efficient. But we already have a fifo
queue, the inactive_dirty_queue, which we are using for everything
else, so why not use it for this too, then at least we know its fair.

In other words, there is no obvious need to treat a process page
differently from a file IO page once decide to put it on trial.

There does seem to be a dangling thread here though - when a process
page is unmapped and added to swap cache in try_to_swap_out then later
faulted back in, I don't see where we "rescue" the page from the
inactive queue. Maybe I'm just not looking hard enough.

--
Daniel

2001-07-26 12:08:37

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Thu, 26 Jul 2001, Daniel Phillips wrote:

> There does seem to be a dangling thread here though - when a process
> page is unmapped and added to swap cache in try_to_swap_out then later
> faulted back in, I don't see where we "rescue" the page from the
> inactive queue. Maybe I'm just not looking hard enough.

do_swap_page()->lookup_swap_cache()->__find_page_nolock()->SetPageReferenced().

The referenced bit will make page_launder/reclaim_page move the page to
the active list.

2001-07-26 12:13:47

[permalink] [raw]

Subject: Re: [RFC] Optimization for use-once pages

On Thursday 26 July 2001 12:38, Marcelo Tosatti wrote:
> On Thu, 26 Jul 2001, Daniel Phillips wrote:
> > There does seem to be a dangling thread here though - when a
> > process page is unmapped and added to swap cache in try_to_swap_out
> > then later faulted back in, I don't see where we "rescue" the page
> > from the inactive queue. Maybe I'm just not looking hard enough.
>
> do_swap_page()->lookup_swap_cache()->__find_page_nolock()->SetPageRef
>erenced().
>
> The referenced bit will make page_launder/reclaim_page move the page
> to the active list.

Yes. And there I set ->age to a known state (START_AGE), so everything
seems to be in order.

--
Daniel

2001-07-30 19:39:44