LinuxLists.cc - Linux 2.4.18pre3-ac1

2002-01-13 21:45:03

Subject: Linux 2.4.18pre3-ac1

People keep bugging me about the -ac tree stuff so this is whats in my
current internal diff with the ll patch and the ide changes excluded.

Much of this is stuff just waiting to go to Marcelo but it has the 32bit
uid quota that some folks consider pretty critical and the rmap-11b VM
which I consider pretty essential

(Marcelo I'll be sending you stuff I've done from this anyway, if there
is other stuff you want extracting just ask)

Linux 2.4.18pre3-ac1

o 32bit uid quota
o rmap-11b VM (Rik van Riel,
William Irwin etc)
o Make scsi printer visible (Stefan Wieseckel)
o Report Hercules Fortissimo card (Minya Sorakinu)
o Fix O_NDELAY close mishandling on the following (me)
sound cards: cmpci, cs46xx, es1370, es1371,
esssolo1, sonicvibes
o tdfx pixclock handling fix (Jurriaan)
o Fix mishandling of file system size limiting (Andrea Arcangeli)
o generic_serial cleanups (Rasmus Andersen)
o serial.c locking fixes for SMP - move from cli (Kees)
too
o Truncate fixes from old -ac tree (Andrew Morton)
o Hopefully fix the i2o oops (me)
| Not the right fix but it'll do till I rewrite this
o Fix non blocking tty blocking bug (Peter Benie)
o IRQ routing workaround for problem HP laptops (Cory Bell)
o Fix the rcpci driver (Pete Popov)
o Fix documentation of aedsp location (Adrian Bunk)
o Fix the worst of the APM ate my cpu problems (Andreas Steinmetz)
o Correct icmp documentation (Pierre Lombard)
o Multiple mxser crash on boot fix (Stephan von Krawczynski)
o ldm header fix (Anton Altaparmakov)
o Fix unchecked kmalloc in i2o_proc (Ragnar Hojland Espinosa)
o Fix unchecked kmalloc in airo_cs (Ragnar Hojland Espinosa)
o Fix unchecked kmalloc in btaudio (Ragnar Hojland Espinosa)
o Fix unchecked kmalloc in qnx4/inode.c (Ragnar Hojland Espinosa)
o Disable DRM4.1 GMX2000 driver (4.0 required) (me)
o Fix sb16 lower speed limit bug (Jori Liesenborgs)
o Fix compilation of orinoco driver (Ben Herrenschmidt)
o ISAPnP init fix (Chris Rankin)
o Export release_console_sem (Andrew Morton)
o Output nat crash fix (Rusty Russell)
o Fix PLIP (Tim Waugh)
o Natsemi driver hang fix (Manfred Spraul)
o Add mono/stereo reporting to gemtek pci radio (Jonathan Hudson)

2002-01-13 22:13:05

by Thiago Rondon

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

[maluco@freak maluco]$ finger @kernel.org
[kernel.org]
The latest stable version of the Linux kernel is: 2.4.17
The latest prepatch for the stable Linux kernel tree is: 2.4.18-pre3
The latest beta version of the Linux kernel is: 2.5.1
The latest prepatch for the beta Linux kernel tree is: 2.5.2-pre11
The latest -ac patch to the stable Linux kernels is: 2.4.13-ac8

That message is "maintainer" by someone? The -ac tree isnt update.

On Sun, 13 Jan 2002, Alan Cox wrote:

> People keep bugging me about the -ac tree stuff so this is whats in my
> current internal diff with the ll patch and the ide changes excluded.
>
> Much of this is stuff just waiting to go to Marcelo but it has the 32bit
> uid quota that some folks consider pretty critical and the rmap-11b VM
> which I consider pretty essential
>
> (Marcelo I'll be sending you stuff I've done from this anyway, if there
> is other stuff you want extracting just ask)
>
> Linux 2.4.18pre3-ac1
>
> o 32bit uid quota
> o rmap-11b VM (Rik van Riel,
> William Irwin etc)
> o Make scsi printer visible (Stefan Wieseckel)
> o Report Hercules Fortissimo card (Minya Sorakinu)
> o Fix O_NDELAY close mishandling on the following (me)
> sound cards: cmpci, cs46xx, es1370, es1371,
> esssolo1, sonicvibes
> o tdfx pixclock handling fix (Jurriaan)
> o Fix mishandling of file system size limiting (Andrea Arcangeli)
> o generic_serial cleanups (Rasmus Andersen)
> o serial.c locking fixes for SMP - move from cli (Kees)
> too
> o Truncate fixes from old -ac tree (Andrew Morton)
> o Hopefully fix the i2o oops (me)
> | Not the right fix but it'll do till I rewrite this
> o Fix non blocking tty blocking bug (Peter Benie)
> o IRQ routing workaround for problem HP laptops (Cory Bell)
> o Fix the rcpci driver (Pete Popov)
> o Fix documentation of aedsp location (Adrian Bunk)
> o Fix the worst of the APM ate my cpu problems (Andreas Steinmetz)
> o Correct icmp documentation (Pierre Lombard)
> o Multiple mxser crash on boot fix (Stephan von Krawczynski)
> o ldm header fix (Anton Altaparmakov)
> o Fix unchecked kmalloc in i2o_proc (Ragnar Hojland Espinosa)
> o Fix unchecked kmalloc in airo_cs (Ragnar Hojland Espinosa)
> o Fix unchecked kmalloc in btaudio (Ragnar Hojland Espinosa)
> o Fix unchecked kmalloc in qnx4/inode.c (Ragnar Hojland Espinosa)
> o Disable DRM4.1 GMX2000 driver (4.0 required) (me)
> o Fix sb16 lower speed limit bug (Jori Liesenborgs)
> o Fix compilation of orinoco driver (Ben Herrenschmidt)
> o ISAPnP init fix (Chris Rankin)
> o Export release_console_sem (Andrew Morton)
> o Output nat crash fix (Rusty Russell)
> o Fix PLIP (Tim Waugh)
> o Natsemi driver hang fix (Manfred Spraul)
> o Add mono/stereo reporting to gemtek pci radio (Jonathan Hudson)
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-01-13 22:53:09

by Ville Herva

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

On Sun, Jan 13, 2002 at 04:44:46PM -0500, you [Alan Cox] claimed:
> People keep bugging me about the -ac tree stuff so this is whats in my
> current internal diff with the ll patch and the ide changes excluded.

Any big reason why you aren't including those two? I'm pretty sure a lot of
people will eventual bug Marcelo (and you) about merging ide to 2.4
proper (or -ac)... :)

> Linux 2.4.18pre3-ac1
>
> o rmap-11b VM (Rik van Riel,
> William Irwin etc)

So I gather you find this better than AA vm, even the -aa version?

> o Fix O_NDELAY close mishandling on the following (me)
> sound cards: cmpci, cs46xx, es1370, es1371,
> esssolo1, sonicvibes

With 17rc1, es1370 went once or twice to a state where it kept accepting
data _very_ slowly and seemingly nothing came out of speakers. Actually I'm
not sure if it actually ate any data, echo > /dev/dsp blocked, but some
audio apps _seemed_ to make some progress.

rmmod es1370; insmod es1370 succeeded, but didn't help - I had to reboot.

2.4.10ac10 (which is what I ran before 17rc1) never showed this. I wan't
able to reproduce it on purpose.

I guess this is not the fix for that?

-- v --

[email protected]

2002-01-13 22:57:49

by Alan Cox

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

> On Sun, Jan 13, 2002 at 04:44:46PM -0500, you [Alan Cox] claimed:
> > People keep bugging me about the -ac tree stuff so this is whats in my
> > current internal diff with the ll patch and the ide changes excluded.
>
> Any big reason why you aren't including those two? I'm pretty sure a lot of
> people will eventual bug Marcelo (and you) about merging ide to 2.4
> proper (or -ac)... :)

So I can tell which patch causes problems if any

> 2.4.10ac10 (which is what I ran before 17rc1) never showed this. I wan't
> able to reproduce it on purpose.
>
> I guess this is not the fix for that?

Thats the first I've heard of the other problem

2002-01-14 00:15:46

by Adam Kropelin

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

----- Original Message -----
From: "Alan Cox" <[email protected]>
To: <[email protected]>
Sent: Sunday, January 13, 2002 4:44 PM
Subject: Linux 2.4.18pre3-ac1

> People keep bugging me about the -ac tree stuff so this is whats in my
> current internal diff with the ll patch and the ide changes excluded.

<snip>

For the sake of completeness I ran my large inbound FTP transfer test (details
in the "Writeout in recent kernels..." thread) on this release. Performance and
observed writeout behavior was essentially the same as for 2.4.17, both stock
and with -rmap11a. Transfer time was 6:56 and writeout was uneven. 2.4.13-ac7 is
still the winner by a significant margin.

Hmmm...

--Adam

2002-01-14 00:38:20

by Alan

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

> in the "Writeout in recent kernels..." thread) on this release. Performance and
> observed writeout behavior was essentially the same as for 2.4.17, both stock
> and with -rmap11a. Transfer time was 6:56 and writeout was uneven. 2.4.13-ac7 is
> still the winner by a significant margin.

That is very useful information actually. That does rather imply that some
of the performance hit came from the block I/O elevator differences in the
old ac tree (the ones Linus hated ;)). Now the question (and part of the
reason Linus didnt like them) - is why ?

Alan

2002-01-14 02:13:54

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

On Mon, Jan 14, 2002 at 12:47:54AM +0000, Alan Cox wrote:
> That is very useful information actually. That does rather imply that some
> of the performance hit came from the block I/O elevator differences in the
> old ac tree (the ones Linus hated ;)). Now the question (and part of the
> reason Linus didnt like them) - is why ?

Iirc, Linus just didn't like the low/high watermarks for starting & stopping
io. Personally, I liked it and wanted to use that mechanism for deciding
when to submit additional blocks from the buffer cache for the device (it
provides a nice means of encouraging batching). The problem that started
this whole mess was a combination of the missing wake_up in the block layer
that I found, plus the horrendous io latency that we hit with a long io queue
and no priorities. The critical pages for swap in and program loading, as
well as background write outs need to have a priority boost so that
interactive feel is better. Of course, with quite a few improvements in
when we wait on ios going into the vm between 2.4.7 and 2.4.17, we don't
wait as indiscriminately on io as we did back then. But write out latency
can still harm us.

In effect, it is a latency vs thruput tradeoff.

-ben

2002-01-14 02:55:16

by Rik van Riel

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

On Sun, 13 Jan 2002, Adam Kropelin wrote:

> From: "Alan Cox" <[email protected]>
>
> > People keep bugging me about the -ac tree stuff so this is whats in my
> > current internal diff with the ll patch and the ide changes excluded.

> For the sake of completeness I ran my large inbound FTP transfer test
> (details in the "Writeout in recent kernels..." thread) on this
> release. Performance and observed writeout behavior was essentially
> the same as for 2.4.17, both stock and with -rmap11a. Transfer time
> was 6:56 and writeout was uneven. 2.4.13-ac7 is still the winner by a
> significant margin.

I'm looking into this bug, I just finished the first large
dbench test set on 2.4.17-rmap11b with 512 MB RAM, tomorrow
I'll run them with 128 and 32 MB of RAM.

Luckily you have already shown the other recent kernels to
have the same performance, so I only have to do half a day
of testing. I'll try to track down this bug and get it fixed.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-14 05:51:09

by Eric W. Biederman

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

Rik van Riel <[email protected]> writes:

> On Sun, 13 Jan 2002, Adam Kropelin wrote:
>
> > From: "Alan Cox" <[email protected]>
> >
> > > People keep bugging me about the -ac tree stuff so this is whats in my
> > > current internal diff with the ll patch and the ide changes excluded.
>
> > For the sake of completeness I ran my large inbound FTP transfer test
> > (details in the "Writeout in recent kernels..." thread) on this
> > release. Performance and observed writeout behavior was essentially
> > the same as for 2.4.17, both stock and with -rmap11a. Transfer time
> > was 6:56 and writeout was uneven. 2.4.13-ac7 is still the winner by a
> > significant margin.
>
> I'm looking into this bug, I just finished the first large
> dbench test set on 2.4.17-rmap11b with 512 MB RAM, tomorrow
> I'll run them with 128 and 32 MB of RAM.
>
> Luckily you have already shown the other recent kernels to
> have the same performance, so I only have to do half a day
> of testing. I'll try to track down this bug and get it fixed.

Rik while you are looking at your reverse mapping code, I would like
to call to your attention the at least trippling of times for fork. I
wouldn't be surprised if the reason your rmap vm handles things like
gcc -j better than the stock kernel is simply the reduced number of
processes, due to slower forking.

Just my 2 cents so we don't forget the caveats of the reverse map
approach.

Eric

2002-01-14 06:18:09

by Rik van Riel

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

On 13 Jan 2002, Eric W. Biederman wrote:
> Rik van Riel <[email protected]> writes:

> Rik while you are looking at your reverse mapping code, I would like
> to call to your attention the at least trippling of times for fork.

Dave McCracken has measured this on his system, it seems to vary
from between 10% for bash to 400% for a process with 10 MB of memory.

This is a problem which will need to be solved, a number of designs
on how to deal with this are ready, implementation needs to be done.

> I wouldn't be surprised if the reason your rmap vm handles things like
> gcc -j better than the stock kernel is simply the reduced number of
> processes, due to slower forking.

I really doubt this, since gcc spends so much more time doing
real work than forking that the time used in fork can be ignored,
even if it gets 3 times slower.

> Just my 2 cents so we don't forget the caveats of the reverse map
> approach.

The main way we can speed up fork easily is by not copying the
page tables at all at fork time but filling them in later at page
fault time. While this might look like it's just moving the overhead
from one place to another, but for the typical fork()+exec() case it
means (1) we don't copy the page tables at fork time (2) we don't
need to free them at exec time (3) after the exec, the parent can
just take back the complete page tables without having to take COW
faults on all its pages.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-14 07:28:37

by Eric W. Biederman

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

Rik van Riel <[email protected]> writes:

> On 13 Jan 2002, Eric W. Biederman wrote:
> > Rik van Riel <[email protected]> writes:
>
> > Rik while you are looking at your reverse mapping code, I would like
> > to call to your attention the at least trippling of times for fork.
>
> Dave McCracken has measured this on his system, it seems to vary
> from between 10% for bash to 400% for a process with 10 MB of memory.

O.k. That sounds about like what I was expecting.

> This is a problem which will need to be solved, a number of designs
> on how to deal with this are ready, implementation needs to be done.

> > I wouldn't be surprised if the reason your rmap vm handles things like
> > gcc -j better than the stock kernel is simply the reduced number of
> > processes, due to slower forking.
>
> I really doubt this, since gcc spends so much more time doing
> real work than forking that the time used in fork can be ignored,
> even if it gets 3 times slower.

But for make -j the forking is done by make and it is nearly a
fork bomb, there is simply a linear increase in the number of processes
instead of an exponential one. So I will at least hold this as a candidate
for the make -j kernel fixes.

> > Just my 2 cents so we don't forget the caveats of the reverse map
> > approach.
>
> The main way we can speed up fork easily is by not copying the
> page tables at all at fork time but filling them in later at page
> fault time. While this might look like it's just moving the overhead
> from one place to another, but for the typical fork()+exec() case it
> means (1) we don't copy the page tables at fork time (2) we don't
> need to free them at exec time (3) after the exec, the parent can
> just take back the complete page tables without having to take COW
> faults on all its pages.

Which is definitely a win. Perhaps we could even have paged page tables
at that point.

There is a second piece that should make things faster as well. Adopt
the a BSD style page table allocation where we do an order 1 allocation
and allocate both the page table and the reverse page tables all in the same
chunk of memory. Which means you can jump from one to the other with pointer
arithmetic. So you can lose one element of your reverse page table chain
structure.

Eric

2002-01-14 09:30:23

by David Miller

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

From: [email protected] (Eric W. Biederman)
Date: 14 Jan 2002 00:25:16 -0700

But for make -j the forking is done by make and it is nearly a
fork bomb

Someone has probably mentioned this, but it is important to recognize
that make uses vfork().

2002-01-14 12:06:31

by Rik van Riel

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

On Mon, 14 Jan 2002, David S. Miller wrote:
> From: [email protected] (Eric W. Biederman)
>
> But for make -j the forking is done by make and it is nearly a
> fork bomb
>
> Someone has probably mentioned this, but it is important to recognize
> that make uses vfork().

Indeed. In the beginning I was also afraid I'd hit the fork()
problem Eric mentions, but after running lots of tests I can't
really say it has shown up in the profiles anywhere.

I'm sure you could make a benchmark to clearly show it, but for
most common workloads it doesn't seem to be much of an issue.
A possible exception to this is apache, I need to look into that
a bit more.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-21 03:42:02

by Daniel Phillips

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

On January 14, 2002 08:25 am, Eric W. Biederman wrote:
> Rik van Riel <[email protected]> writes:
> > On 13 Jan 2002, Eric W. Biederman wrote:
> > > Rik while you are looking at your reverse mapping code, I would like
> > > to call to your attention the at least trippling of times for fork.
> >
> > Dave McCracken has measured this on his system, it seems to vary
> > from between 10% for bash to 400% for a process with 10 MB of memory.
>
> O.k. That sounds about like what I was expecting.
>
[...]
> > > Just my 2 cents so we don't forget the caveats of the reverse map
> > > approach.
> >
> > The main way we can speed up fork easily is by not copying the
> > page tables at all at fork time but filling them in later at page
> > fault time. While this might look like it's just moving the overhead
> > from one place to another, but for the typical fork()+exec() case it
> > means (1) we don't copy the page tables at fork time (2) we don't
> > need to free them at exec time (3) after the exec, the parent can
> > just take back the complete page tables without having to take COW
> > faults on all its pages.
>
> Which is definitely a win. Perhaps we could even have paged page tables
> at that point.

Yes, it's possible but it's of secondary importance. The first, essential
goal has to be to eliminate the rmap fork overhead so that rmap becomes
a 'never worse and often better' solution. It's for this reason that I
developed an algorithm a few weeks ago to do lazy page table instantiation
efficiently, which is what Rik is referring to. I'm not quite ready to
post details yet, since I haven't tried it, and frankly, I'm learning about
Unix memory management as I go, so there may well be a gaping hole I've
missed. Hopefully we'll know in a few days, and I'll post the full
writeup.

The way I see it, the purpose of lazy page table instantiation is to
overcome objections to the reverse pte mapping vm technique that have
been expressed in the past, namely the slowdown in dup_mmap inside fork.
I.e., if rmap slows down fork then Linus and Davem are going to
veto it, as they've done in the past, because they feel that the
as-yet-unproven advantages of physically-based vm scanning doesn't
outweigh the easily measurable fork overhead. Personally, I think
that's debatable, but by eliminating the overhead we eliminate the
objection, and as far as I know, it's the only serious objection.

--
Daniel

2002-01-21 05:30:56

by Richard Gooch

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

Daniel Phillips writes:
> The way I see it, the purpose of lazy page table instantiation is to
> overcome objections to the reverse pte mapping vm technique that
> have been expressed in the past, namely the slowdown in dup_mmap
> inside fork. I.e., if rmap slows down fork then Linus and Davem are
> going to veto it, as they've done in the past, because they feel
> that the as-yet-unproven advantages of physically-based vm scanning
> doesn't outweigh the easily measurable fork overhead. Personally, I
> think that's debatable, but by eliminating the overhead we eliminate
> the objection, and as far as I know, it's the only serious
> objection.

Will lazy page table instantiation speed up fork(2) without rmap?
If so, then you've got a problem, because rmap will still be slower
than non-rmap. Linus will happily grab any speedup and make that the
new baseline against which new schemes are compared :-)

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2002-01-21 05:35:36

by Rik van Riel

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

On Sun, 20 Jan 2002, Richard Gooch wrote:

> Will lazy page table instantiation speed up fork(2) without rmap?
> If so, then you've got a problem, because rmap will still be slower
> than non-rmap. Linus will happily grab any speedup and make that the
> new baseline against which new schemes are compared :-)

I guess the difference here is "optimised for lmbench"
vs. "optimised to be stable in real workloads" ;)

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-21 07:05:28

by Eric W. Biederman

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

Rik van Riel <[email protected]> writes:

> On Sun, 20 Jan 2002, Richard Gooch wrote:
>
> > Will lazy page table instantiation speed up fork(2) without rmap?
> > If so, then you've got a problem, because rmap will still be slower
> > than non-rmap. Linus will happily grab any speedup and make that the
> > new baseline against which new schemes are compared :-)

But the differences will go down to the noise level. Your average fork
shouldn't need to copy more than one page. So the amount of work is
near constant.

> I guess the difference here is "optimised for lmbench"
> vs. "optimised to be stable in real workloads" ;)

Currently the rmap patch triples the size of the page tables which is
also an issue. Though it is relatively straight forward to reduce
that to simply double the page table size with a order(1) allocation,
so we can remove one pointer.

Unless I am mistaken an every day shell script is fairly fork/exec/exit
intensive operation. And there are probably more shell scripts for
unix than every other kind of program put together.

An additional possible strike against rmap is that walking through
page tables in virtual address order is fairly cache friendly, while a
random walk has more of a cache penalty.

One more case that is difficult for rmap is the highly mapped case of
something like glibc. You can easily get to a thousand entries or
more for a single page. In which case a doubly linked list may be
more appropriate then a singly linked list (for add/insert), but this
again tripples or quadruples the page table size. And none of it
solves having to walk very long lists in some circumstances. The
best you can do is periodically unmapping pages, and then you only
have very long lists for highly active pages.

And to be fair rmap has some advantages over the current system. VM
algorithms are some simpler to code when you can code them however
you want to, instead of being constrained by other parts of the
implementation.

To the true sceptic what remains to be shown is

Eric

2002-01-21 12:03:15

by Rik van Riel

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

On 21 Jan 2002, Eric W. Biederman wrote:

> Currently the rmap patch triples the size of the page tables which is
> also an issue. Though it is relatively straight forward to reduce
> that to simply double the page table size with a order(1) allocation,
> so we can remove one pointer.

Actually most processes seem to be much smaller than 4 MB
_and_ have their pages spread out over their address space.

This means the page tables are sparsely populated and the
pte_chain mechanism should use less memory than doubling
the size of the page tables.

> Unless I am mistaken an every day shell script is fairly fork/exec/exit
> intensive operation. And there are probably more shell scripts for
> unix than every other kind of program put together.

Bash and gcc seem to use vfork, not sure about make...

> An additional possible strike against rmap is that walking through
> page tables in virtual address order is fairly cache friendly, while a
> random walk has more of a cache penalty.

In theory. In practice however kswapd seems to use less CPU
with the -rmap VM, most notably in doesn't seem to get lost
in the worst case behaviour of the normal VM where it scans
hundreds of megabytes of normal memory because it has a DMA
zone shortage...

> One more case that is difficult for rmap is the highly mapped case of
> something like glibc. You can easily get to a thousand entries or
> more for a single page. In which case a doubly linked list may be
> more appropriate then a singly linked list (for add/insert), but this
> again tripples or quadruples the page table size. And none of it
> solves having to walk very long lists in some circumstances. The
> best you can do is periodically unmapping pages, and then you only
> have very long lists for highly active pages.

I admit this could be an issue. It would be interesting to see
if it is an issue in practice though...

> And to be fair rmap has some advantages over the current system. VM
> algorithms are some simpler to code when you can code them however
> you want to, instead of being constrained by other parts of the
> implementation.
>
> To the true sceptic what remains to be shown is

Well, you could download the patch and look for yourself ;)

http://surriel.com/patches/
http://linuxvm.bkbits.net/

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-21 13:17:54

by Daniel Phillips

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

On January 21, 2002 06:30 am, Richard Gooch wrote:
> Daniel Phillips writes:
> > The way I see it, the purpose of lazy page table instantiation is to
> > overcome objections to the reverse pte mapping vm technique that
> > have been expressed in the past, namely the slowdown in dup_mmap
> > inside fork. I.e., if rmap slows down fork then Linus and Davem are
> > going to veto it, as they've done in the past, because they feel
> > that the as-yet-unproven advantages of physically-based vm scanning
> > doesn't outweigh the easily measurable fork overhead. Personally, I
> > think that's debatable, but by eliminating the overhead we eliminate
> > the objection, and as far as I know, it's the only serious
> > objection.
>
> Will lazy page table instantiation speed up fork(2) without rmap?

Yes.

> If so, then you've got a problem, because rmap will still be slower
> than non-rmap. Linus will happily grab any speedup and make that the
> new baseline against which new schemes are compared :-)

Fortunately, rmap and non-rmap will fork at the same speed since in
each case the work will consist of copying just the page directory and
incrementing the use counts of up to 1024 page tables.

Page table instantiation, which happens at fault time, will be slower
for rmap than non-rmap. However there are offsetting factors that
suggest the bottom line performance will be very similar in unloaded
cases, and will favor rmap under heavy load.

--
Daniel

2002-01-21 13:58:10

by Daniel Phillips

[permalink] [raw]

Subject: Re: Linux 2.4.18pre3-ac1

On January 21, 2002 08:01 am, Eric W. Biederman wrote:
> Rik van Riel <[email protected]> writes:
> > On Sun, 20 Jan 2002, Richard Gooch wrote:
> >
> > > Will lazy page table instantiation speed up fork(2) without rmap?
> > > If so, then you've got a problem, because rmap will still be slower
> > > than non-rmap. Linus will happily grab any speedup and make that the
> > > new baseline against which new schemes are compared :-)
>
> But the differences will go down to the noise level. Your average fork
> shouldn't need to copy more than one page. So the amount of work is
> near constant.

In fact there's no difference at all at fork time since the instantiation
work is defered to page fault time.

> > I guess the difference here is "optimised for lmbench"
> > vs. "optimised to be stable in real workloads" ;)
>
> Currently the rmap patch triples the size of the page tables which is
> also an issue. Though it is relatively straight forward to reduce
> that to simply double the page table size with a order(1) allocation,
> so we can remove one pointer.

As Rik pointed out, the overhead isn't per pte, it's 8 bytes per mapped page
plus 4 bytes per physical page.

This can be reduced to just 4 bytes per physical page in the case of
nonshared pages, and the shared case the 8 bytes per mapped page can be
reduced by various strategies. Even as it stands it's not too bad.

> Unless I am mistaken an every day shell script is fairly fork/exec/exit
> intensive operation. And there are probably more shell scripts for
> unix than every other kind of program put together.
>
> An additional possible strike against rmap is that walking through
> page tables in virtual address order is fairly cache friendly, while a
> random walk has more of a cache penalty.

Yes. I've proposed a small optimization where each pte_chain link points to
several ptes, reducing the cache penalty for the pte chain walk. Improving
the locality of the pte accesses themselves is not as easy since that would
require the lru list would need to be in non-random order with respect to
ptes, and I don't know any simple way to do that. I also think it doesn't
matter much since this overhead is incurred only when we are doing heavy
scanning of the ptes, and we are only doing that when we are under heavy
memory pressure. In theory, the cost of the extra cache hits will be
drowned out by the savings from improved page replacement decisions. Of
course, this remains to be seen.

> One more case that is difficult for rmap is the highly mapped case of
> something like glibc. You can easily get to a thousand entries or
> more for a single page. In which case a doubly linked list may be
> more appropriate then a singly linked list (for add/insert), but this
> again tripples or quadruples the page table size. And none of it
> solves having to walk very long lists in some circumstances. The
> best you can do is periodically unmapping pages, and then you only
> have very long lists for highly active pages.

It's likely that many of the page tables referring to glibc can be shared as
well. This is a somewhat different problem than the lazy instantiation, but
looks tractable to me. Without such sharing there are various things that
can be done to reduce the list maintainance overhead. Such long lists can be
special-cased for example, so that you would go to a double-linked list or a
tree only when sharing exceeds some threshold.

> And to be fair rmap has some advantages over the current system. VM
> algorithms are some simpler to code when you can code them however
> you want to, instead of being constrained by other parts of the
> implementation.

Virtual scanning has a fundamental disadvantage in comparison to reverse
mapping: there is a large and unpredictable lag between the time a pte's
accessed bit is transfered to a physical page and when it is queried during
the physical scan. Thus, virtual scanning does not scale well, because as
memory size increases the age of the accessed information in the physical
page becomes increasingly random. There is no simple way to partition the
virtual scan by physical region to reduce this lag. Though this is far from
the only problem with virtual scanning, in the long run it's the killer.

--
Daniel