I would like to have feedback about this VM update, if nobody can find
any serious issue I'd try to push vm-28 into mainline during 2.4.19pre.
Please test oom conditions as well.
Thanks!
URL:
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre1aa1.gz
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre1aa1/
Only in 2.4.18rc4aa1: 00_block-highmem-all-18b-3.gz
Only in 2.4.19pre1aa1: 00_block-highmem-all-18b-4.gz
Fix leftover setting.
Only in 2.4.18rc4aa1: 00_hpfs-oops-1
Only in 2.4.18rc4aa1: 30_get_request-starvation-1
Only in 2.4.18rc4aa1: 00_init-blk-freelist-1
Now in mainline.
Only in 2.4.19pre1aa1: 00_lcall_trace-1
call gate entry point speciality.
Only in 2.4.18rc4aa1: 00_prepare-write-fixes-1
Only in 2.4.19pre1aa1: 00_prepare-write-fixes-2
Avoid false positives (agreed Andrew?).
Only in 2.4.18rc4aa1: 10_rawio-vary-io-2
Only in 2.4.19pre1aa1: 10_rawio-vary-io-3
Rediffed.
Only in 2.4.18rc4aa1: 10_vm-27
Only in 2.4.19pre1aa1: 10_vm-28
Further updates. As soon as I get the confirm this goes well in all the
benchmarks I think it should go into mainline.
Only in 2.4.18rc4aa1: 70_xfs-1.gz
Only in 2.4.19pre1aa1: 70_xfs-2.gz
Drop PG_launder, never really existed in -aa, wait_IO does a
better job (not only for dirty bh submitted by the vm) and wait_IO is
just supported by xfs.
Andrea
On Wed, 27 Feb 2002, Andrea Arcangeli wrote:
> I would like to have feedback about this VM update, if nobody can find
> any serious issue I'd try to push vm-28 into mainline during 2.4.19pre.
> Please test oom conditions as well.
I have enjoyed using your -aa patches (and run child first) for some time,
and Rik's rmap patches as well. However, I find that for some machines
your stuff works clearly better, particularly larger memory machines, and
for some rmap is clearly more responsive, particularly for small machines
under heavy memory pressure.
The point is that choice is good, and having two solutions two address
various machines is a good thing, even if the convenience isn't all that
great. That being said, I fear that if your solution gets pushed into
mainline that it will preempt other solutions. And my testing tells me
that there is no one solution here, even with all the tuning in your VM,
using the hints you gave me.
I would rather see both systems continue to be available, until there is a
clear winner (ie. no common cases where one is clearly worse than the
other), or until they somehow merge, or even become config options (I
don't really favor that). I suggested that VM would be nice as a module,
but it doesn't see possible.
If others share the thought that it's too early for a preemptive choice
please speak up. And if everyone feels that this is good I will not beat a
dead horse on this one.
I assume you meant "serious issues" with failures, rather than
semi-political timing and choice issues.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Thu, Feb 28, 2002 at 05:11:25PM -0500, Bill Davidsen wrote:
> On Wed, 27 Feb 2002, Andrea Arcangeli wrote:
>
> > I would like to have feedback about this VM update, if nobody can find
> > any serious issue I'd try to push vm-28 into mainline during 2.4.19pre.
> > Please test oom conditions as well.
>
> I have enjoyed using your -aa patches (and run child first) for some time,
> and Rik's rmap patches as well. However, I find that for some machines
> your stuff works clearly better, particularly larger memory machines, and
> for some rmap is clearly more responsive, particularly for small machines
> under heavy memory pressure.
>
> The point is that choice is good, and having two solutions two address
> various machines is a good thing, even if the convenience isn't all that
> great. That being said, I fear that if your solution gets pushed into
> mainline that it will preempt other solutions. And my testing tells me
> that there is no one solution here, even with all the tuning in your VM,
> using the hints you gave me.
>
The problem here is that currently the mainline kernel makes some bad
dicesions in the VM, and -aa is the solution in this case. When -aa is
merged, you will still have both solutions; one in mainline, one as a patch
(rmap).
Linus has already changed the VM once in 2.4, and I don't really see another
large VM change (rmap in 2.4) happening again.
Rmap looks promising for a 2.5 merge after several issues are overcome
(pte-highmem, etc).
Mike
On Thu, 28 Feb 2002, Mike Fedyk wrote:
> The problem here is that currently the mainline kernel makes some bad
> dicesions in the VM, and -aa is the solution in this case. When -aa is
> merged, you will still have both solutions; one in mainline, one as a patch
> (rmap).
>
> Linus has already changed the VM once in 2.4, and I don't really see another
> large VM change (rmap in 2.4) happening again.
>
> Rmap looks promising for a 2.5 merge after several issues are overcome
> (pte-highmem, etc).
I do understand what happens in the VM currently... And as noted I run
both -aa kernels and rmap on different machines. But -aa runs better on
large machines and rmap better on small machines with memory pressure (my
experience), so blessing one and making the other "only a patch" troubles
me somewhat. I hate to say "compete" as VM solution, but they both solve
the same problem with more success in one field or another.
If either is adopted the pressure will be off to improve in the areas
where one or the other is weak, Once the decision is made that won't
happen, And if rmap is a large VM change, what then is Ardrea's code?
Large isn't just the size of the patch, it is to some extent the size of
the behaviour change.
For me it makes little difference, I like to play with kernels, and I'm
hoping for the source which needs only numbers in /proc/sys to tune,
rather than patches. But there are a lot more small machines (which I feel
are better served by rmap) than large. I would like to leave the jury out
a little longer on this.
I was looking for opinions, thak you for sharing yours.!
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Thu, Feb 28, 2002 at 10:26:48PM -0500, Bill Davidsen wrote:
> experience), so blessing one and making the other "only a patch" troubles
> me somewhat. I hate to say "compete" as VM solution, but they both solve
> the same problem with more success in one field or another.
>
> If either is adopted the pressure will be off to improve in the areas
> where one or the other is weak, Once the decision is made that won't
> happen,
I sincerely doubt that Rik will slow down at all when parts of -aa are in
the mainline kernel. There is 2.5 to work award, and 2.4 isn't a lost
cause...
Also, one has already been blessed, way back in 2.4.10-pre11 by Linus. I
don't see any chance of rmap getting into 2.4 before 2.4.27+ Marcelo has
said he wants to see rmap in production on in -ac for a while before he
thinks about merging rmap, and that's good IMHO.
>And if rmap is a large VM change, what then is Ardrea's code?
> Large isn't just the size of the patch, it is to some extent the size of
> the behavior change.
>
True, and by that token, rmap would be the larger change in behavior (not
swapping on disk accesses, etc ;).
> For me it makes little difference, I like to play with kernels, and I'm
> hoping for the source which needs only numbers in /proc/sys to tune,
> rather than patches. But there are a lot more small machines (which I feel
> are better served by rmap) than large. I would like to leave the jury out
> a little longer on this.
>
Look at it another way, by forcing Andrea to send it
in as small chunks with descriptions, we may finally get a documented -aa
VM. ;) So, lets watch and see that happen.
I don't see anyone benefiting with *both* of the VM enhancements as external
patches.
> I was looking for opinions, thak you for sharing yours.!
>
You will certainly find that here. ;)
On Thu, 28 Feb 2002, Bill Davidsen wrote:
> On Thu, 28 Feb 2002, Mike Fedyk wrote:
>
> > The problem here is that currently the mainline kernel makes some bad
> > dicesions in the VM, and -aa is the solution in this case. When -aa is
> > merged, you will still have both solutions; one in mainline, one as a patch
> > (rmap).
> >
> > Linus has already changed the VM once in 2.4, and I don't really see another
> > large VM change (rmap in 2.4) happening again.
> >
> > Rmap looks promising for a 2.5 merge after several issues are overcome
> > (pte-highmem, etc).
>
> I do understand what happens in the VM currently... And as noted I run
> both -aa kernels and rmap on different machines. But -aa runs better on
> large machines and rmap better on small machines with memory pressure (my
> experience), so blessing one and making the other "only a patch" troubles
> me somewhat. I hate to say "compete" as VM solution, but they both solve
> the same problem with more success in one field or another.
2.4 VM is Andrea's. There's no competition. I see current -aa VM patches
just as maintainance, which is performed outside the mainline for good
reasons. As soon as Andrea is satisfied with testing, -aa will be
integrated into Marcelo's 2.4. This is just part of VM (which admittedly
was quite "young" when it was included) maintainance/evolution.
OTOH, Red Hat 2.4 kernels are still based on Rik's, AFAIK. I bet they'll
be running 2.4-rmap sooner or later. Red Hat has a long history of running
kernels with non standard features (RAID 0.90 comes to mind). So maybe
there *is* competition, but on the vendor side only. I do hope vanilla
2.4 VM will be -aa forever (but I'll be running RH provided kernels most
of the times - I like them).
.TM.
> OTOH, Red Hat 2.4 kernels are still based on Rik's, AFAIK. I bet they'll
The RH 2.4.7-9 kernels are based on the stuff Rik wanted to try in 2.4 that
Linus played with, mixed with used once and then ignored chunks of. Think of
it as 2.4.Rik VM but not rmap.
For the future we'll evaluate all sorts of options for our customers to see
what is best to deliver - thats our job.
Alan
On Thu, 28 Feb 2002, Mike Fedyk wrote:
> Look at it another way, by forcing Andrea to send it in as small
> chunks with descriptions, we may finally get a documented -aa VM. ;)
> So, lets watch and see that happen.
That would be the preferred way. There must be some good stuff
hidden in -aa, but it won't turn into maintainable code just by
merging stuff into the kernel.
It'll turn into maintainable code by having it merged in small,
documented pieces.
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Fri, Mar 01, 2002 at 09:51:54AM -0300, Rik van Riel wrote:
> On Thu, 28 Feb 2002, Mike Fedyk wrote:
>
> > Look at it another way, by forcing Andrea to send it in as small
> > chunks with descriptions, we may finally get a documented -aa VM. ;)
> > So, lets watch and see that happen.
>
> That would be the preferred way. There must be some good stuff
> hidden in -aa, but it won't turn into maintainable code just by
> merging stuff into the kernel.
>
> It'll turn into maintainable code by having it merged in small,
> documented pieces.
>
Let me see... "small chunks with descriptions". I think we're saying the
same thing. What we need is to have those descriptions in the patches sent
to Marcelo, that way the docs are in the code...
Mike
On Thu, Feb 28, 2002 at 10:26:48PM -0500, Bill Davidsen wrote:
> rather than patches. But there are a lot more small machines (which I feel
> are better served by rmap) than large. I would like to leave the jury out
I think there's quite some confusion going on from the rmap users, let's
clarify the facts.
The rmap design in the VM is all about decreasing the complexity of
swap_out on the huge boxes (so it's all about saving CPU), by slowing
down a big lots of fast common paths like page faults and by paying with
some memory too. See the lmbench numbers posted by Randy after applying
rmap to see what I mean.
On a very lowmem machine the rmap design shouldn't really make a sensible
difference, the smaller the amount of mapped VM, the less rmap can make
differences, period.
So I wouldn't really worry about the low mem machines. I guess what
makes the difference for you (the responsiveness part) are things like
read-latency2 included at least in some variant of the rmap patch, but
they're completly orthogonal to the VM (they're included in the rmap
patch just incidentally, the rmap patch isn't just about the rmap
design, it's lots of other stuff too, please don't mistake this for a
blame, I would prefer if it would be kept separated so people wouldn't
be confused thinking rmap gives the responsiveness on the lowmem boxes,
but I'm also not perfect sometime at maintaining patches, see vm-28, it
does more than just one thing, even if they're at least all vm related
things).
Note that I'm listening to the rmap design too, and Rik's implementation
should be better than the last one I seen last year from Dave, but I
really am not going to slow down page faults and other paths just to
save CPU during heavy swapout in 2.4, all my machines are mostly idle
during heavy swapout/pageout anyways.
For 2.5 it would be easy to integrate just the rmap design from Rik's
patch on top of my vm-28, as far as the design is concerned that's
orthogonal with all the other changes I'm doing, but the very visible
lmbench slowdowns for lots of the important common paths didn't made it
appealing to me yet (first somebody has to show me the total wastage of
cpu during swapout with my current patch applied, I mean the last column
on the right of vmstat).
So in short you may want to try 2.4.19pre1 + vm-28 + read-latency2 (or
even more simply 2.4.19pre1aa1 + read-latency2) and see if it makes the
system as responsive as rmap for you on the lowmem boxes. let us know if
it helps, thanks!
IMHO vm-28 should be somehow included into mainline ASAP (before 2.4.19
is released), then again IMHO we can forget about the 2.4 VM and it will
be definitely finished.
Andrea
> On a very lowmem machine the rmap design shouldn't really make a sensible
> difference, the smaller the amount of mapped VM, the less rmap can make
> differences, period.
It makes a big big difference on a low memory box. Try running xfce on
a 24Mb box with the base 2.4.18, 2.4.18 + rmap12f and 2.4.18+aa. Thats
a case where aa definitely loses and without other I/O patches being
applied. Its an X11 based workload with a -lot- of shared pages. Both
rmap and aa materially outperform 2.4.18 base on this workload (and 2.4.17
blew up with out of memory errors)
> IMHO vm-28 should be somehow included into mainline ASAP (before 2.4.19
> is released), then again IMHO we can forget about the 2.4 VM and it will
> be definitely finished.
With luck 8) VM is never finished 8(
Alan
On Sat, Mar 02, 2002 at 02:28:20AM +0000, Alan Cox wrote:
> > On a very lowmem machine the rmap design shouldn't really make a sensible
> > difference, the smaller the amount of mapped VM, the less rmap can make
> > differences, period.
>
> It makes a big big difference on a low memory box. Try running xfce on
> a 24Mb box with the base 2.4.18, 2.4.18 + rmap12f and 2.4.18+aa. Thats
> a case where aa definitely loses and without other I/O patches being
hmm to fully evaluate this I'd need to have access to the exact two kernel
source tarballs that you compared (a diff against a known vanilla kernel
tree would be fine) and to know the way you measured the difference of
them while xfce was running (nominal performance/responsiveness/whatever?).
Andrea
On Sat, Mar 02, 2002 at 09:57:49PM -0200, Denis Vlasenko wrote:
> If rmap is really better than current VM, it will be merged into head
> development branch (2.5). There is no anti-rmap conspiracy :-)
Indeed.
Andrea
On Sat, 2002-03-02 at 15:47, Andrea Arcangeli wrote:
> On Sat, Mar 02, 2002 at 09:57:49PM -0200, Denis Vlasenko wrote:
>
> > If rmap is really better than current VM, it will be merged into head
> > development branch (2.5). There is no anti-rmap conspiracy :-)
>
> Indeed.
Of note: I don't think anyone "loses" if one VM is merged or not. A
reverse mapping VM is a significant redesign of our current VM approach
and if it proves better, yes, I suspect (and hope) it will be merged
into 2.5.
But that doesn't mean the 2.4 VM is worse, per se.
Robert Love
On March 2, 2002 03:06 am, Andrea Arcangeli wrote:
> On Thu, Feb 28, 2002 at 10:26:48PM -0500, Bill Davidsen wrote:
> > rather than patches. But there are a lot more small machines (which I feel
> > are better served by rmap) than large. I would like to leave the jury out
>
> I think there's quite some confusion going on from the rmap users, let's
> clarify the facts.
>
> The rmap design in the VM is all about decreasing the complexity of
> swap_out on the huge boxes (so it's all about saving CPU), by slowing
> down a big lots of fast common paths like page faults and by paying with
> some memory too. See the lmbench numbers posted by Randy after applying
> rmap to see what I mean.
Do you know any reason why rmap must slow down the page fault fast, or are
you just thinking about Rik's current implementation? Yes, rmap has to add
a pte_chain entry there, but it can be a direct pointer in the unshared case
and the spinlock looks like it can be avoided in the common case as well.
--
Daniel
On Sun, Mar 03, 2002 at 10:38:34PM +0100, Daniel Phillips wrote:
> On March 2, 2002 03:06 am, Andrea Arcangeli wrote:
> > On Thu, Feb 28, 2002 at 10:26:48PM -0500, Bill Davidsen wrote:
> > > rather than patches. But there are a lot more small machines (which I feel
> > > are better served by rmap) than large. I would like to leave the jury out
> >
> > I think there's quite some confusion going on from the rmap users, let's
> > clarify the facts.
> >
> > The rmap design in the VM is all about decreasing the complexity of
> > swap_out on the huge boxes (so it's all about saving CPU), by slowing
> > down a big lots of fast common paths like page faults and by paying with
> > some memory too. See the lmbench numbers posted by Randy after applying
> > rmap to see what I mean.
>
> Do you know any reason why rmap must slow down the page fault fast, or are
> you just thinking about Rik's current implementation? Yes, rmap has to add
> a pte_chain entry there, but it can be a direct pointer in the unshared case
> and the spinlock looks like it can be avoided in the common case as well.
unshared isn't the very common case (shm, and file mappings like
executables are all going to be shared, not unshared).
So unless you first share all the pagetables as well (like Ben once said
years ago), it's not going to be a direct pointer in the very common
case. And there's no guarantee you can share the pagetable (even
assuming the kernels supports that at the maximum possible degree across
execve and at random mmaps too) if you map those pages at different
virtual addresses.
Andrea
On March 4, 2002 01:49 am, Andrea Arcangeli wrote:
> On Sun, Mar 03, 2002 at 10:38:34PM +0100, Daniel Phillips wrote:
> > On March 2, 2002 03:06 am, Andrea Arcangeli wrote:
> > > On Thu, Feb 28, 2002 at 10:26:48PM -0500, Bill Davidsen wrote:
> > > > rather than patches. But there are a lot more small machines (which I feel
> > > > are better served by rmap) than large. I would like to leave the jury out
> > >
> > > I think there's quite some confusion going on from the rmap users, let's
> > > clarify the facts.
> > >
> > > The rmap design in the VM is all about decreasing the complexity of
> > > swap_out on the huge boxes (so it's all about saving CPU), by slowing
> > > down a big lots of fast common paths like page faults and by paying with
> > > some memory too. See the lmbench numbers posted by Randy after applying
> > > rmap to see what I mean.
> >
> > Do you know any reason why rmap must slow down the page fault fast, or are
> > you just thinking about Rik's current implementation? Yes, rmap has to add
> > a pte_chain entry there, but it can be a direct pointer in the unshared case
> > and the spinlock looks like it can be avoided in the common case as well.
>
> unshared isn't the very common case (shm, and file mappings like
> executables are all going to be shared, not unshared).
As soon as you have shared pages you start to benefit from rmap's ability
to unmap in one step, so the cost of creating the link is recovered by not
having to scan two page tables to unmap it. In theory. Do you see a hole
in that?
> So unless you first share all the pagetables as well (like Ben once said
> years ago), it's not going to be a direct pointer in the very common
> case. And there's no guarantee you can share the pagetable (even
> assuming the kernels supports that at the maximum possible degree across
> execve and at random mmaps too) if you map those pages at different
> virtual addresses.
The virtual alignment just needs to be the same modulo 4 MB. There are
other requirements as well, but being able to share seems to be the common
case.
--
Daniel
On Mon, Mar 04, 2002 at 02:46:22AM +0100, Daniel Phillips wrote:
> On March 4, 2002 01:49 am, Andrea Arcangeli wrote:
> > On Sun, Mar 03, 2002 at 10:38:34PM +0100, Daniel Phillips wrote:
> > > On March 2, 2002 03:06 am, Andrea Arcangeli wrote:
> > > > On Thu, Feb 28, 2002 at 10:26:48PM -0500, Bill Davidsen wrote:
> > > > > rather than patches. But there are a lot more small machines (which I feel
> > > > > are better served by rmap) than large. I would like to leave the jury out
> > > >
> > > > I think there's quite some confusion going on from the rmap users, let's
> > > > clarify the facts.
> > > >
> > > > The rmap design in the VM is all about decreasing the complexity of
> > > > swap_out on the huge boxes (so it's all about saving CPU), by slowing
> > > > down a big lots of fast common paths like page faults and by paying with
> > > > some memory too. See the lmbench numbers posted by Randy after applying
> > > > rmap to see what I mean.
> > >
> > > Do you know any reason why rmap must slow down the page fault fast, or are
> > > you just thinking about Rik's current implementation? Yes, rmap has to add
> > > a pte_chain entry there, but it can be a direct pointer in the unshared case
> > > and the spinlock looks like it can be avoided in the common case as well.
> >
> > unshared isn't the very common case (shm, and file mappings like
> > executables are all going to be shared, not unshared).
>
> As soon as you have shared pages you start to benefit from rmap's ability
> to unmap in one step, so the cost of creating the link is recovered by not
we'd benefit also with unshared pages.
BTW, for the map shared mappings we just collect the rmap information,
we need it for vmtruncate, but it's not layed out for efficient
browsing, it's only meant to make vmtruncate work.
> having to scan two page tables to unmap it. In theory. Do you see a hole
> in that?
Just the fact you never need the reverse lookup during lots of
important production usages (first that cames to mind is when you have
enough ram to do your job, all number crunching/fileserving, and most
servers are setup that way). This is the whole point. Note that this
has nothing to do with the "cache" part, this is only about the
pageout/swapout stage, only a few servers really needs heavy swapout.
The background swapout to avoid unused services to stay in ram forever,
doesn't matter with rmap or w/o rmap design.
And on the other case (heavy swapout/pageouts like in some hard DBMS
usage, simualtions and laptops or legacy desktops) we would mostly save
CPU and reduce complexity, but I really don't see system load during
heavy pageouts/swapouts yet, so I don't see an obvious need of save cpu
there either.
Probably the main difference visible in numbers would be infact to
follow a perfect lru, but really giving mapped pages an higher chance is
beneficial. Another bit in the current design of round robin cycling
over the whole VM clearing the accessed bitflag and activating physical
pages if needed, can also be see also as a feature in some ways. It is
much better at providing a kind of "clock based" aging to the accessed
bit information, while the lru pass rmap aware, wouldn't really be fair
with all the virtual pages the same way as we do now.
> > So unless you first share all the pagetables as well (like Ben once said
> > years ago), it's not going to be a direct pointer in the very common
> > case. And there's no guarantee you can share the pagetable (even
> > assuming the kernels supports that at the maximum possible degree across
> > execve and at random mmaps too) if you map those pages at different
> > virtual addresses.
>
> The virtual alignment just needs to be the same modulo 4 MB. There are
> other requirements as well, but being able to share seems to be the common
> case.
Yep on x86 w/o PAE. With PAE enabled (or x86-64 kernel) it needs to be
the same layout of phys pages on a naturally aligned 2M chunk. I trust
that will match often in theory, but still tracking it down over execve
and on random mmaps looks not that easy, I think for tracking that down
we'd really need the rmap information for everything (not just map
shared like right now). And also doing all the checks and walking the
reverse maps won't be zero cost, but I can see the benefit of the full
pte sharing (starting from cpu cache utilization across tlb flushes).
Infact it maybe rmap will be more useful for things like enabling the full
pagetable sharing you're suggesting above, rather than for replacing the
swap_out round robing cycle over the VM. so it might be used only for MM
internals rather than for VM internals.
Andrea
On March 4, 2002 03:25 am, Andrea Arcangeli wrote:
> On Mon, Mar 04, 2002 at 02:46:22AM +0100, Daniel Phillips wrote:
> > As soon as you have shared pages you start to benefit from rmap's ability
> > to unmap in one step, so the cost of creating the link is recovered by not
>
> we'd benefit also with unshared pages.
>
> BTW, for the map shared mappings we just collect the rmap information,
> we need it for vmtruncate, but it's not layed out for efficient
> browsing, it's only meant to make vmtruncate work.
Sorry, transmission error, what did you mean?
> > having to scan two page tables to unmap it. In theory. Do you see a hole
> > in that?
>
> Just the fact you never need the reverse lookup during lots of
> important production usages (first that cames to mind is when you have
> enough ram to do your job, all number crunching/fileserving, and most
> servers are setup that way). This is the whole point. Note that this
> has nothing to do with the "cache" part, this is only about the
> pageout/swapout stage, only a few servers really needs heavy swapout.
You always have to unmap the page at some point, so you win back the cost
of creating the pte_chain there, hopefully. You could argue that paying
the cost up front makes latency a little worse. You might have trouble
measuring that though.
> ...Another bit in the current design of round robin cycling
> over the whole VM clearing the accessed bitflag and activating physical
> pages if needed, can also be see also as a feature in some ways. It is
> much better at providing a kind of "clock based" aging to the accessed
> bit information, while the lru pass rmap aware, wouldn't really be fair
> with all the virtual pages the same way as we do now.
You get a perfectly good clock by scanning the lru list. It's not
totally fair because a page newly promoted from the cold end to the hot
end of the list will get scanned again after a much shorter delta-t,
but it's hard to see why that's bad.
--
Daniel
On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
> > having to scan two page tables to unmap it. In theory. Do you see a hole
> > in that?
>
> Just the fact you never need the reverse lookup during lots of
> important production usages (first that cames to mind is when you have
> enough ram to do your job, all number crunching/fileserving, and most
> servers are setup that way). This is the whole point. Note that this
> has nothing to do with the "cache" part, this is only about the
> pageout/swapout stage, only a few servers really needs heavy swapout.
Ahhh, but it's not necessarily about making this common case
better. It's about making sure Linux doesn't die horribly in
some worst cases.
The case of "system has more than enough memory" won't suffer
with -rmap anyway since the amount of activity in the VM part
of the system will be relatively low.
> And on the other case (heavy swapout/pageouts like in some hard DBMS
> usage, simualtions and laptops or legacy desktops) we would mostly save
> CPU and reduce complexity, but I really don't see system load during
> heavy pageouts/swapouts yet, so I don't see an obvious need of save cpu
> there either.
The thing here is that -rmap is able to easily balance the
reclaiming of cache with the swapout of anonymous pages.
Even though you tried to get rid of the magic numbers in
the old VM when you introduced your changes, you're already
back up to 4 magic numbers for the cache/swapout balancing.
This is not your fault, being difficult to balance is just
a fundamental property of the partially physical, partially
virtual scanning.
> Infact it maybe rmap will be more useful for things like enabling the full
> pagetable sharing you're suggesting above, rather than for replacing the
> swap_out round robing cycle over the VM. so it might be used only for MM
> internals rather than for VM internals.
Sharing is quite a can of worms, it might be easier to just use
4MB (or 2MB) pages for database shared memory segments and VMAs
where programs want large pages. That will get rid of both the
page tables (and associated locking) and the alignment constraints.
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Mon, Mar 04, 2002 at 09:41:40AM -0300, Rik van Riel wrote:
> On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
>
> > > having to scan two page tables to unmap it. In theory. Do you see a hole
> > > in that?
> >
> > Just the fact you never need the reverse lookup during lots of
> > important production usages (first that cames to mind is when you have
> > enough ram to do your job, all number crunching/fileserving, and most
> > servers are setup that way). This is the whole point. Note that this
> > has nothing to do with the "cache" part, this is only about the
> > pageout/swapout stage, only a few servers really needs heavy swapout.
>
> Ahhh, but it's not necessarily about making this common case
> better. It's about making sure Linux doesn't die horribly in
> some worst cases.
rmap is only about making pagout/swapout activities more efficient,
there's no stability issue to solve as far I can tell.
> The case of "system has more than enough memory" won't suffer
> with -rmap anyway since the amount of activity in the VM part
> of the system will be relatively low.
I don't see anything significant to save in that area. During heavy
paging the system load is something like 1/2% of the cpu.
> > And on the other case (heavy swapout/pageouts like in some hard DBMS
> > usage, simualtions and laptops or legacy desktops) we would mostly save
> > CPU and reduce complexity, but I really don't see system load during
> > heavy pageouts/swapouts yet, so I don't see an obvious need of save cpu
> > there either.
>
> The thing here is that -rmap is able to easily balance the
> reclaiming of cache with the swapout of anonymous pages.
>
> Even though you tried to get rid of the magic numbers in
> the old VM when you introduced your changes, you're already
> back up to 4 magic numbers for the cache/swapout balancing.
>
> This is not your fault, being difficult to balance is just
> a fundamental property of the partially physical, partially
> virtual scanning.
Those numbers also control how aggressive is the swap_out pass. That is
partly a feature I think. Do you plan to unmap and put anonymous pages
into the swapcache when you reach them in the inactive lru, despite you
may have 99% of ram into freeable cache? I think you'll still need some
number/heuristic to know when the lru pass should start to be aggressive
unmapping and pagingout stuff. So I believe this issue about the "number
thing" is quite unrelated to the complexity reduction of the paging
algorithm with the removal of the swap_out pass.
Andrea
On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
> On Mon, Mar 04, 2002 at 09:41:40AM -0300, Rik van Riel wrote:
> > On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
> >
> > > has nothing to do with the "cache" part, this is only about the
> > > pageout/swapout stage, only a few servers really needs heavy swapout.
> >
> > Ahhh, but it's not necessarily about making this common case
> > better. It's about making sure Linux doesn't die horribly in
> > some worst cases.
>
> rmap is only about making pagout/swapout activities more efficient,
> there's no stability issue to solve as far I can tell.
Not stability per se, but you have to admit the VM tends to
behave badly when there's a shortage in just one memory zone.
I believe NUMA will only make this situation worse.
It helps a lot when the VM can just free pages from those
zones where it has a memory shortage and skip scanning the
others.
> > The case of "system has more than enough memory" won't suffer
> > with -rmap anyway since the amount of activity in the VM part
> > of the system will be relatively low.
>
> I don't see anything significant to save in that area. During heavy
> paging the system load is something like 1/2% of the cpu.
During heavy paging you don't really care about how much system
time the VM takes (within reasonable limits, of course), instead
you care about how well the VM chooses which pages to swap out
and which pages to keep in RAM.
> > > And on the other case (heavy swapout/pageouts like in some hard DBMS
> > > usage, simualtions and laptops or legacy desktops) we would mostly save
> > > CPU and reduce complexity, but I really don't see system load during
> > > heavy pageouts/swapouts yet, so I don't see an obvious need of save cpu
> > > there either.
> >
> > The thing here is that -rmap is able to easily balance the
> > reclaiming of cache with the swapout of anonymous pages.
> >
> > Even though you tried to get rid of the magic numbers in
> > the old VM when you introduced your changes, you're already
> > back up to 4 magic numbers for the cache/swapout balancing.
> >
> > This is not your fault, being difficult to balance is just
> > a fundamental property of the partially physical, partially
> > virtual scanning.
>
> Those numbers also control how aggressive is the swap_out pass. That is
> partly a feature I think. Do you plan to unmap and put anonymous pages
> into the swapcache when you reach them in the inactive lru, despite you
> may have 99% of ram into freeable cache? I think you'll still need some
> number/heuristic to know when the lru pass should start to be aggressive
> unmapping and pagingout stuff. So I believe this issue about the "number
> thing" is quite unrelated to the complexity reduction of the paging
> algorithm with the removal of the swap_out pass.
It's harder to balance a combined virtual/physical scanning VM
than it is to balance a pure physical scanning VM.
I do have some tunables planned for -rmap, but those will be
more along the lines of a switch called "defer_swapout" which
the user can switch on or off. No need for the user to know
how the VM works internally, the VM has enough info to work
out the details by itself.
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Mon, Mar 04, 2002 at 11:23:57AM -0300, Rik van Riel wrote:
> On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
> > On Mon, Mar 04, 2002 at 09:41:40AM -0300, Rik van Riel wrote:
> > > On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
> > >
> > > > has nothing to do with the "cache" part, this is only about the
> > > > pageout/swapout stage, only a few servers really needs heavy swapout.
> > >
> > > Ahhh, but it's not necessarily about making this common case
> > > better. It's about making sure Linux doesn't die horribly in
> > > some worst cases.
> >
> > rmap is only about making pagout/swapout activities more efficient,
> > there's no stability issue to solve as far I can tell.
>
> Not stability per se, but you have to admit the VM tends to
> behave badly when there's a shortage in just one memory zone.
I don't think it behaves badly, and I don't see how rmap can help there
except from saving some cpu. The major O(N) complexity when working on
the lwoer zones is in passing over/ignoring the pages in the higher
zones and that's at the page layer so rmap will make no differences
there and the complexity will remain O(N) (where N is the number of
higher-pages).
> It helps a lot when the VM can just free pages from those
> zones where it has a memory shortage and skip scanning the
> others.
I'm not scanning the other/unrelated pagetables just now.
> > > The case of "system has more than enough memory" won't suffer
> > > with -rmap anyway since the amount of activity in the VM part
> > > of the system will be relatively low.
> >
> > I don't see anything significant to save in that area. During heavy
> > paging the system load is something like 1/2% of the cpu.
>
> During heavy paging you don't really care about how much system
> time the VM takes (within reasonable limits, of course), instead
yes, this is why rmap isn't making a sensible difference in the heavy
swap case either.
> you care about how well the VM chooses which pages to swap out
> and which pages to keep in RAM.
and for that the aging fair scan for the acessed bitflag has a chance to
be better than the unfair accessed bit handling in rmap that can lead to
not evaluating correctly the accessed-virtual-age of the pages. Also
threating mapped pages in a special manner is beneficial.
> > > > And on the other case (heavy swapout/pageouts like in some hard DBMS
> > > > usage, simualtions and laptops or legacy desktops) we would mostly save
> > > > CPU and reduce complexity, but I really don't see system load during
> > > > heavy pageouts/swapouts yet, so I don't see an obvious need of save cpu
> > > > there either.
> > >
> > > The thing here is that -rmap is able to easily balance the
> > > reclaiming of cache with the swapout of anonymous pages.
> > >
> > > Even though you tried to get rid of the magic numbers in
> > > the old VM when you introduced your changes, you're already
> > > back up to 4 magic numbers for the cache/swapout balancing.
> > >
> > > This is not your fault, being difficult to balance is just
> > > a fundamental property of the partially physical, partially
> > > virtual scanning.
> >
> > Those numbers also control how aggressive is the swap_out pass. That is
> > partly a feature I think. Do you plan to unmap and put anonymous pages
> > into the swapcache when you reach them in the inactive lru, despite you
> > may have 99% of ram into freeable cache? I think you'll still need some
> > number/heuristic to know when the lru pass should start to be aggressive
> > unmapping and pagingout stuff. So I believe this issue about the "number
> > thing" is quite unrelated to the complexity reduction of the paging
> > algorithm with the removal of the swap_out pass.
>
> It's harder to balance a combined virtual/physical scanning VM
> than it is to balance a pure physical scanning VM.
Depends, you may have to do similar things to balance rmap in a similar
manner. The point again is "when to start unmapping stuff". Once you
tune right and choose "ok, go ahead and unmap" at the right time, it
basically doesn't matter if you do that by calling swap_out or if you
try to unmap the current page in the lru. Plus swap_out will be fair and
mapped pages automatically will get a longer lifetime than unmapped
pages like plain fs cache, both things sounds like positive.
Plus rmap hurts the common fast paths, i.e. when no heavy swapout is
needed like in most servers out there.
Andrea
On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
> > you care about how well the VM chooses which pages to swap out
> > and which pages to keep in RAM.
>
> and for that the aging fair scan for the acessed bitflag has a chance to
> be better than the unfair accessed bit handling in rmap that can lead to
> not evaluating correctly the accessed-virtual-age of the pages.
Ummm, what do you mean by this ?
> Also threating mapped pages in a special manner is beneficial.
Note that -rmap already does this.
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
> Not stability per se, but you have to admit the VM tends to
> behave badly when there's a shortage in just one memory zone.
> I believe NUMA will only make this situation worse.
rmap would seem to buy us (at least) two major things for NUMA:
1) We can balance between zones easier by "swapping out"
pages to another zone.
2) We can do local per-node scanning - no need to bounce
information to and fro across the interconnect just to see what's
worth swapping out.
I suspect that the performance of NUMA under memory pressure
without the rmap stuff will be truly horrific, as we decend into
a cache-trashing page transfer war.
I can't see any way to fix this without some sort of rmap - any
other suggestions as to how this might be done?
Thanks,
Martin.
On Mon, Mar 04, 2002 at 08:59:10AM -0800, Martin J. Bligh wrote:
> > Not stability per se, but you have to admit the VM tends to
> > behave badly when there's a shortage in just one memory zone.
> > I believe NUMA will only make this situation worse.
>
> rmap would seem to buy us (at least) two major things for NUMA:
>
> 1) We can balance between zones easier by "swapping out"
> pages to another zone.
Yes, operations like "now migrate and bind this task to a certain
cpu/mem pair" pretty much needs rmap or it will get the same complexity
of swapout, that may be very very slow with lots of vm address space
mapped. But this has nothing to do with the swap_out pass we were
talking about previously.
I just considered those cases (like also supporting pagetable sharing at the
maximum possible levels also across random
mmaps/mremaps/mprotect/mlocks/execve), and this is why I said rmap may
be more useful for mm internals, rather than replacing the swap_out pass
(hmm in this case the migration of the pagecache may be considered more
a vm thing too though).
> 2) We can do local per-node scanning - no need to bounce
> information to and fro across the interconnect just to see what's
> worth swapping out.
the lru lists are global at the moment, so for the normal swapout
activitiy rmap won't allow you to do what you mention above (furthmore
rmap gives you only the pointer to the pte chain, but there's no
guarantee the pte is in the same node as the physical page, even
assuming we'll have per-node inactive/active list, so you'll fall into
the bouncing scenario anyways rmap or not, only the cpu usage will be
lower and as side effect you'll bounce less, but you're not avoiding the
interconnet overhead with the per-node scanning).
Said that I definitely agree the potential pageout/swapout scalability
with rmap may be better on a very huge system with several hundred
gigabytes of ram (despite the accessed bit aging will be less fair
etc..). So yes, I also of course agree that there will be benefits in
killing the swap_out loop on some currently-corner case hardware, and
maybe long term, if we'll ever need to pageout heavily on a 256G ram
box, it may be the only sane way to do that really no matter if it's
numa or not, (I think on a 256G box it will be only a matter of paging
out the dirty shared mappings and dropping the clean mappings, I don't
see any need to swapout there, but still to do the pageout efficiently
on such kind of machine we'll need rmap).
Also note that on the modern numa (the thing I mostly care about)
in misc load (like a desktop), without special usages (like user
bindings), striping virtual pages and pagecache over all the nodes will
be better than restricting one task to use only the bandwith of one bank
of ram, so decreasing significantly the potential bandwith of the global
machine. Interconnects are much faster than what ram will ever provide,
it's not the legacy dinousaur numa. I understand old hardware with huge
penalty while crossing the interconnects has different needs though.
They're both called cc-numa but they're completly different beasts. So I
don't worry much about walking on ptes on remote nodes, it may be infact
faster than walking on ptes of the same node, and usually the dinosaurs
have so much ram that they will hardly need to swapout heavily. On
similar lines the alpha cpus (despite I'd put it in the "new numa"
class) doesn't even provide the accessed bit in the pte, you only can
use minor page faults to know that.
The numa point for the new hardware is that if we have N nodes, and
we have apps loading at 100% each node and using 100% of mem bandwith
from each node in a local manner without passing through the
interconnects (like we can do with cpu bindings and migration+bind API)
then the performance will be better than if we stripe globally, and this
is why the OS needs to be aware about numa, to optimize those cases, so
if you've a certain workload you can get the 100% of the performance out
of the hardware, but on a misc load without a dedicated-design for the
machine where it is running on (so if we're not able to use all the 4
nodes fully in a local manner) striping will be better (so you'll get a
2/3 of performance out of the hardware, rather than a 1/4 of performance
of it because you're only using 1/4 of the global bandwith). Same goes
for shm and pagecache, page (or cacheline) striping is better there too.
note: the above numbers I invented them to make the example more clear,
they've no relation to any existing real hardware at all.
> I suspect that the performance of NUMA under memory pressure
> without the rmap stuff will be truly horrific, as we decend into
> a cache-trashing page transfer war.
depends on what kind of numa systems I think. I worry more about the
complexity with lots of ram. As said above on a 64bit 512G system
with hundred gigabytes of vm globally mapped at the same time, paging
out hard beacuse of some terabyte mapping marked dirty during page
faults, will quite certainly need rmap to pageout such dirty mappings
efficiently, really no matter if it's cc-numa or not, it's mostly a
complexity problem.
I really don't see it as a 2.4 need :). I never said no-way rmap in 2.5.
It maybe I won't agree on the implementation, but on the design I can
agree: if we'll ever need to get the above workloads fast and pagecache
migration for the numa bindings, we'll definitely need rmap for all kind
of user pages, not just for map shared pages, like we have just now in
2.4 and in all previous kernels (I hope this also answers Daniel's
question, otherwise please ask again). So I appreciate the work done on
rmap, but I currently don't see it as a 2.4 item.
> I can't see any way to fix this without some sort of rmap - any
> other suggestions as to how this might be done?
>
> Thanks,
>
> Martin.
Andrea
On Mon, 04 Mar 2002 08:59:10 -0800
"Martin J. Bligh" <[email protected]> wrote:
> 2) We can do local per-node scanning - no need to bounce
> information to and fro across the interconnect just to see what's
> worth swapping out.
Well, you can achieve this by "attaching" the nodes' local memory (zone) to its cpu and let the vm work preferably only on these attached zones (regarding the list scanning and the like). This way you have no interconnect traffic generated. But this is in no way related to rmap.
> I suspect that the performance of NUMA under memory pressure
> without the rmap stuff will be truly horrific, as we decend into
> a cache-trashing page transfer war.
I guess you are right for the current implementation, but I doubt rmap will be a _real_ solution to your problem.
> I can't see any way to fix this without some sort of rmap - any
> other suggestions as to how this might be done?
As stated above: try to bring in per-node zones that are preferred by their cpu. This can work equally well for UP,SMP and NUMA (maybe even for cluster).
UP=every zone is one or more preferred zone(s)
SMP=every zone is one or more preferred zone(s)
NUMA=every cpu has one or more preferred zone(s), but can walk the whole zone-list if necessary.
cluster=every cpu has one or more preferred zone(s), but cannot walk the whole zone-list.
Preference is implemented as simple list of cpu-ids attached to every memory zone. This is for being able to see the whole picture. Every cpu has a private list of (preferred) zones which is used by vm for the scanning jobs (swap et al). This way there is no need to touch interconnection. If you are really in a bad situation you can alway go back to the global list and do whatever is needed.
This sounds pretty scalable and runtime-configurable. And not related to rmap...
Beat me,
Stephan
PS: Drop clusters from the discussion, I know this would become weird.
>> 2) We can do local per-node scanning - no need to bounce
>> information to and fro across the interconnect just to see what's
>> worth swapping out.
>
> Well, you can achieve this by "attaching" the nodes' local memory
> (zone) to its cpu and let the vm work preferably only on these attached
> zones (regarding the list scanning and the like). This way you have no
> interconnect traffic generated. But this is in no way related to rmap.
>
>> I can't see any way to fix this without some sort of rmap - any
>> other suggestions as to how this might be done?
>
> As stated above: try to bring in per-node zones that are preferred by their cpu. This can work equally well for UP,SMP and NUMA (maybe even for cluster).
> UP=every zone is one or more preferred zone(s)
> SMP=every zone is one or more preferred zone(s)
> NUMA=every cpu has one or more preferred zone(s), but can walk the whole zone-list if necessary.
>
> Preference is implemented as simple list of cpu-ids attached to every
> memory zone. This is for being able to see the whole picture. Every
> cpu has a private list of (preferred) zones which is used by vm for the
> scanning jobs (swap et al). This way there is no need to touch interconnection.
> If you are really in a bad situation you can alway go back to the global
> list and do whatever is needed.
As I understand the current code (ie this may be totally wrong ;-) ) I think
we already pretty much have what you're suggesting. There's one (or more)
zone per node chained off the pgdata_t, and during memory allocation we
try to scan through the zones attatched to the local node first. The problem
seems to me to be that the way we do current swap-out scanning is virtual,
not physical, and thus cannot be per zone => per node.
Am I totally missing your point here?
Thanks,
Martin.
On Mon, 4 Mar 2002 19:18:04 +0100
Stephan von Krawczynski <[email protected]> wrote:
> As stated above: try to bring in per-node zones that are preferred by their cpu. This can work equally well for UP,SMP and NUMA (maybe even for cluster).
> UP=every zone is one or more preferred zone(s)
correct: UP=all zones are preferred zones for the single CPU
> SMP=every zone is one or more preferred zone(s)
correct: SMP=all zones are preferred zones for all CPUs
Regards,
Stephan
>> 1) We can balance between zones easier by "swapping out"
>> pages to another zone.
>
> Yes, operations like "now migrate and bind this task to a certain
> cpu/mem pair" pretty much needs rmap or it will get the same complexity
> of swapout, that may be very very slow with lots of vm address space
> mapped. But this has nothing to do with the swap_out pass we were
> talking about previously.
If we're out of memory on one node, and have free memory on another,
during the swap-out pass it would be quicker to transfer the page to
another node, ie "swap out the page to another zone" rather than swap
it out to disk. This is what I mean by the above comment (though you're
right, it helps with the more esoteric case of deliberate page migration too),
though I probably phrased it badly enough to make it incomprehensible ;-)
I guess could this help with non-NUMA architectures too - if ZONE_NORMAL
is full, and ZONE_HIGHMEM has free pages, it would be nice to be able
to scan ZONE_NORMAL, and transfer pages to ZONE_HIGHMEM. In
reality, I suspect this won't be so useful, as there shouldn't be HIGHEM
capable page data sitting in ZONE_NORMAL unless ZONE_HIGHMEM
had been full at some point in the past? And I'm not sure if we keep a bit
to say where the page could have been allocated from or not ?
M.
PS. The rest of your email re: striping twisted my brain out of shape - I'll
have to think about it some more.
On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
> > 2) We can do local per-node scanning - no need to bounce
> > information to and fro across the interconnect just to see what's
> > worth swapping out.
>
> the lru lists are global at the moment, so for the normal swapout
> activitiy rmap won't allow you to do what you mention above
Actually, the lru lists are per zone and have been for a while.
The thing which was lacking up to now is a pagecache_lru_lock
per zone, because this clashes with truncate(). Arjan came up
with a creative solution to fix this problem and I'll integrate
it into -rmap soon...
> (furthmore rmap gives you only the pointer to the pte chain, but there's
> no guarantee the pte is in the same node as the physical page, even
> assuming we'll have per-node inactive/active list, so you'll fall into
> the bouncing scenario anyways rmap or not, only the cpu usage will be
> lower and as side effect you'll bounce less, but you're not avoiding the
> interconnet overhead with the per-node scanning).
Well, if we need to free memory from node A, we will need to
do that anyway. If we don't scan the page tables from node B,
maybe we'll never be able to free memory from node A.
The only thing -rmap does is make sure we only scan the page
tables belonging to the physical pages in node A, instead of
having to scan the page tables of all processes in all nodes.
> Also note that on the modern numa (the thing I mostly care about) in
> misc load (like a desktop), without special usages (like user bindings),
> striping virtual pages and pagecache over all the nodes will be better
> than restricting one task to use only the bandwith of one bank of ram,
> so decreasing significantly the potential bandwith of the global
> machine.
This is an interesting point and suggests we want to start
the zone fallback chains from different places for each CPU,
this both balances the allocation and can avoid the CPUs
looking at "each other's" zone and bouncing cachelines around.
> depends on what kind of numa systems I think. I worry more about the
> complexity with lots of ram. As said above on a 64bit 512G system
> with hundred gigabytes of vm globally mapped at the same time, paging
> out hard beacuse of some terabyte mapping marked dirty during page
> faults, will quite certainly need rmap to pageout such dirty mappings
> efficiently, really no matter if it's cc-numa or not, it's mostly a
> complexity problem.
Indeed.
> I really don't see it as a 2.4 need :). I never said no-way rmap in 2.5.
> It maybe I won't agree on the implementation,
I'd appreciate it if you could look at the implementation and
look for areas to optimise. However, note that I don't believe
-rmap is already at the stage where optimisation is appropriate.
Or rather, now is the time for macro optimisations, not for
micro optimisations.
regards,
Rik
--
Will hack the VM for food.
http://www.surriel.com/ http://distro.conectiva.com/
On Mon, 4 Mar 2002, Stephan von Krawczynski wrote:
> On Mon, 04 Mar 2002 08:59:10 -0800
> "Martin J. Bligh" <[email protected]> wrote:
>
> > 2) We can do local per-node scanning - no need to bounce
> > information to and fro across the interconnect just to see what's
> > worth swapping out.
>
> Well, you can achieve this by "attaching" the nodes' local memory (zone)
> to its cpu and let the vm work preferably only on these attached zones
> (regarding the list scanning and the like). This way you have no
> interconnect traffic generated. But this is in no way related to rmap.
But it is. Without -rmap you don't know which processes from
which nodes could have mapped memory on your node, so you end
up scanning the page tables of all processes on all nodes.
regards,
Rik
--
Will hack the VM for food.
http://www.surriel.com/ http://distro.conectiva.com/
On Mon, Mar 04, 2002 at 10:46:54AM -0800, Martin J. Bligh wrote:
> >> 2) We can do local per-node scanning - no need to bounce
> >> information to and fro across the interconnect just to see what's
> >> worth swapping out.
> >
> > Well, you can achieve this by "attaching" the nodes' local memory
> > (zone) to its cpu and let the vm work preferably only on these attached
> > zones (regarding the list scanning and the like). This way you have no
> > interconnect traffic generated. But this is in no way related to rmap.
> >
> >> I can't see any way to fix this without some sort of rmap - any
> >> other suggestions as to how this might be done?
> >
> > As stated above: try to bring in per-node zones that are preferred by their cpu. This can work equally well for UP,SMP and NUMA (maybe even for cluster).
> > UP=every zone is one or more preferred zone(s)
> > SMP=every zone is one or more preferred zone(s)
> > NUMA=every cpu has one or more preferred zone(s), but can walk the whole zone-list if necessary.
> >
> > Preference is implemented as simple list of cpu-ids attached to every
> > memory zone. This is for being able to see the whole picture. Every
> > cpu has a private list of (preferred) zones which is used by vm for the
> > scanning jobs (swap et al). This way there is no need to touch interconnection.
> > If you are really in a bad situation you can alway go back to the global
> > list and do whatever is needed.
>
> As I understand the current code (ie this may be totally wrong ;-) ) I think
> we already pretty much have what you're suggesting. There's one (or more)
> zone per node chained off the pgdata_t, and during memory allocation we
> try to scan through the zones attatched to the local node first. The problem
yes, also make sure to keep this patch from SGI applied, it's very
important to avoid memory balancing if there's still free memory in the
other zones:
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre1aa1/20_numa-mm-1
It should apply cleanly on top of my vm-28.
> seems to me to be that the way we do current swap-out scanning is virtual,
> not physical, and thus cannot be per zone => per node.
actually if you do process bindings the pte should be all allocated
local to the node if numa is enabled, and if there's no binding, no
matter if you have rmap or not, the ptes can be spread across the whole
system (just like the physical pages in the inactive/active lrus,
because they're not per-node).
Andrea
On Mon, Mar 04, 2002 at 10:56:11AM -0800, Martin J. Bligh wrote:
> >> 1) We can balance between zones easier by "swapping out"
> >> pages to another zone.
> >
> > Yes, operations like "now migrate and bind this task to a certain
> > cpu/mem pair" pretty much needs rmap or it will get the same complexity
> > of swapout, that may be very very slow with lots of vm address space
> > mapped. But this has nothing to do with the swap_out pass we were
> > talking about previously.
>
> If we're out of memory on one node, and have free memory on another,
> during the swap-out pass it would be quicker to transfer the page to
> another node, ie "swap out the page to another zone" rather than swap
> it out to disk. This is what I mean by the above comment (though you're
I think unless we're sure we need to split the system in parts and so
there's some explicit cpu binding (like in the example I made above), it
doesn't worth to do migrations just because one zone is low on memory,
the migration has a cost and without bindings the scheduler is free to
reschedule the task away in the next timeslice anyways, and then it's
better to keep it there for cpu cache locality reasons. So I believe
it's better to make sure to use all available ram in all nodes instead
of doing migrations when the local node is low on mem. But this again
depends on the kind of numa system, I'm considering the new numas, not
the old ones with the huge penality on the remote memory.
> right, it helps with the more esoteric case of deliberate page migration too),
> though I probably phrased it badly enough to make it incomprehensible ;-)
>
> I guess could this help with non-NUMA architectures too - if ZONE_NORMAL
> is full, and ZONE_HIGHMEM has free pages, it would be nice to be able
> to scan ZONE_NORMAL, and transfer pages to ZONE_HIGHMEM. In
> reality, I suspect this won't be so useful, as there shouldn't be HIGHEM
> capable page data sitting in ZONE_NORMAL unless ZONE_HIGHMEM
> had been full at some point in the past? And I'm not sure if we keep a bit
Exactly, this is what the per-zone point-of-view watermarks just do in
my tree, and this is why even if we're not able to migrate all the
highmem capable pages from lowmem to highmem (like anon memory when
there's no swap, or mlocked memory) we still don't run into inbalances.
btw, to migrate anon memory without swap, we wouldn't really be forced
to use rmap, we could just use anonymous swapcache and then we could
migrate the swapcache atomically with the pagecache_lock acquired, just
like we would do with rmap. but I think the main problem of migration is
"when" should we trigger it. Currently we don't need to answer this
question and the watermarks make sure we've enough lowmem resources not
to madantory need migration. When I did the watermarks fix, I also
consdiered the migration through anon swap cache, but it wasn't black
and white thing, the watermarks are better solution for 2.4 at least I
think :).
> to say where the page could have been allocated from or not ?
>
> M.
>
> PS. The rest of your email re: striping twisted my brain out of shape - I'll
> have to think about it some more.
Andrea
On March 4, 2002 07:56 pm, Martin J. Bligh wrote:
> >> 1) We can balance between zones easier by "swapping out"
> >> pages to another zone.
> >
> > Yes, operations like "now migrate and bind this task to a certain
> > cpu/mem pair" pretty much needs rmap or it will get the same complexity
> > of swapout, that may be very very slow with lots of vm address space
> > mapped. But this has nothing to do with the swap_out pass we were
> > talking about previously.
>
> If we're out of memory on one node, and have free memory on another,
> during the swap-out pass it would be quicker to transfer the page to
> another node, ie "swap out the page to another zone" rather than swap
> it out to disk. This is what I mean by the above comment (though you're
> right, it helps with the more esoteric case of deliberate page migration too),
> though I probably phrased it badly enough to make it incomprehensible ;-)
>
> I guess could this help with non-NUMA architectures too - if ZONE_NORMAL
> is full, and ZONE_HIGHMEM has free pages, it would be nice to be able
> to scan ZONE_NORMAL, and transfer pages to ZONE_HIGHMEM. In
> reality, I suspect this won't be so useful, as there shouldn't be HIGHEM
> capable page data sitting in ZONE_NORMAL unless ZONE_HIGHMEM
> had been full at some point in the past?
That's the normal case when the cache is loaded up.
> And I'm not sure if we keep a bit
> to say where the page could have been allocated from or not ?
No, we don't record the gfp_mask or, in the case of discontigmem, the
zonelist. Perhaps this information could be recovered from the mapping, or
lack of it. I don't know how you'd deduce that a page was required to be
in zone_dma, for example, without specifically remembering that at
page_alloc time.
--
Daniel
On Mon, Mar 04, 2002 at 06:36:47PM -0300, Rik van Riel wrote:
> On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
>
> > > 2) We can do local per-node scanning - no need to bounce
> > > information to and fro across the interconnect just to see what's
> > > worth swapping out.
> >
> > the lru lists are global at the moment, so for the normal swapout
> > activitiy rmap won't allow you to do what you mention above
>
> Actually, the lru lists are per zone and have been for a while.
They're not in my tree and for very good reasons, Ben did such mistake
the first time at some point during 2.3. You've a big downside with the
per-zone information, all normal machines (like with 64M of ram or 2G of
ram) where theorical O(N) complexity is perfectly fine for lowmem
dma/normal allocations, will get hurted very much by the per-node lrus.
You're the one saying that the system load is very low and that it's
better to do more accurate page replacement decisions.
I think they may be worthwhile on a hundred gigabyte machine only, but
the whole point is that in such a box you'll have only one zone anyways
and so per-zone in such case will match per-node :).
So I think they should be at least per-node in 2.5 to make 99% of
userbase happy. And again, it depends on what kind numa if they've to
be global or per-node, so it would be probably much better to have them
per-node or global depending on a compile-time configuration #define.
> The thing which was lacking up to now is a pagecache_lru_lock
> per zone, because this clashes with truncate(). Arjan came up
> with a creative solution to fix this problem and I'll integrate
> it into -rmap soon...
making it a per-lru spinlock is natural scalability optimization, but
anyways pagemap_lru_lock isn't a very critical spinlock. before
worrying about pagemal_lru_lock I'd worry about the pagecache_lock I
think (even the pagecache_lock doesn't matter much on most usages). Of
course it also depends on the workload, but the important workloads will
hit the pagecache_lock first.
> > (furthmore rmap gives you only the pointer to the pte chain, but there's
> > no guarantee the pte is in the same node as the physical page, even
> > assuming we'll have per-node inactive/active list, so you'll fall into
> > the bouncing scenario anyways rmap or not, only the cpu usage will be
> > lower and as side effect you'll bounce less, but you're not avoiding the
> > interconnet overhead with the per-node scanning).
>
> Well, if we need to free memory from node A, we will need to
> do that anyway. If we don't scan the page tables from node B,
> maybe we'll never be able to free memory from node A.
>
> The only thing -rmap does is make sure we only scan the page
> tables belonging to the physical pages in node A, instead of
> having to scan the page tables of all processes in all nodes.
Correct. And as said this is a scalability optimization, the more ptes
you'll have, the more you want to skip the ones belonging to pages in
node B, or you may end wasting too much system time on 512G system etc...
> I'd appreciate it if you could look at the implementation and
> look for areas to optimise. However, note that I don't believe
I didn't had time to look too much into that yet (I had only a short
review so far), but I will certainly do that in some more time, looking
at it with a 2.5 long term prospective. I didn't liked too much that you
resurrected some of the old code that I don't think pays off. I would
preferred if you had rmap on top of my vm patch without reintroducing
the older logics. I still don't see the need of inactive_dirty and the
fact you dropped classzone and put the unreliable "plenty stuff" that
reintroduces design bugs that will lead kswapd go crazy again. But ok, I
don't worry too much about that, the rmap bits that maintains the
additional information are orthogonal with the other changes and that's
the interesting part of the patch after all.
Andrea
On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
> On Mon, Mar 04, 2002 at 10:46:54AM -0800, Martin J. Bligh wrote:
> > >> 2) We can do local per-node scanning - no need to bounce
> > >> information to and fro across the interconnect just to see what's
> > >> worth swapping out.
> > >
> > > Well, you can achieve this by "attaching" the nodes' local memory
> > > (zone) to its cpu and let the vm work preferably only on these attached
> > > zones (regarding the list scanning and the like). This way you have no
> > > interconnect traffic generated. But this is in no way related to rmap.
> > >
> > >> I can't see any way to fix this without some sort of rmap - any
> > >> other suggestions as to how this might be done?
> > >
> > > As stated above: try to bring in per-node zones that are preferred by their cpu. This can work equally well for UP,SMP and NUMA (maybe even for cluster).
> > > UP=every zone is one or more preferred zone(s)
> > > SMP=every zone is one or more preferred zone(s)
> > > NUMA=every cpu has one or more preferred zone(s), but can walk the whole zone-list if necessary.
> > >
> > > Preference is implemented as simple list of cpu-ids attached to every
> > > memory zone. This is for being able to see the whole picture. Every
> > > cpu has a private list of (preferred) zones which is used by vm for the
> > > scanning jobs (swap et al). This way there is no need to touch interconnection.
> > > If you are really in a bad situation you can alway go back to the global
> > > list and do whatever is needed.
> >
> > As I understand the current code (ie this may be totally wrong ;-) ) I think
> > we already pretty much have what you're suggesting. There's one (or more)
> > zone per node chained off the pgdata_t, and during memory allocation we
> > try to scan through the zones attatched to the local node first. The problem
>
> yes, also make sure to keep this patch from SGI applied, it's very
> important to avoid memory balancing if there's still free memory in the
> other zones:
>
> ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre1aa1/20_numa-mm-1
This patch is included (in a slightly different form) in the 2.4.17
discontig patch (http://sourceforge.net/projects/discontig).
But martin may need another patch to apply. With the current
implementation of __alloc_pages, we have 2 problems :
1) A node is not emptied before moving to the following node
2) If none of the zones on a node have more freepages than min(defined as
min+= z->pages_low), we start looking on the following node, instead of
trying harder on the same node.
I have a patch that tries to fix these problems. Of course this patch
makes sense only with either the discontig patch or the SGI patch Andrea
mentioned applied. I'd appreciate your feedback on this piece of code.
This patch is against 2.4.19-pre2:
--- linux-2.4.19-pre2/mm/page_alloc.c Mon Mar 4 14:35:27 2002
+++ linux-2.4.19-pre2-sam/mm/page_alloc.c Mon Mar 4 14:38:53 2002
@@ -339,68 +339,110 @@
*/
struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
{
- unsigned long min;
- zone_t **zone, * classzone;
+ unsigned long min_low, min_min;
+ zone_t **zone, **current_zone, * classzone, *z;
struct page * page;
int freed;
-
+ struct pglist_data* current_node;
+
zone = zonelist->zones;
- classzone = *zone;
- min = 1UL << order;
- for (;;) {
- zone_t *z = *(zone++);
- if (!z)
+ z = *zone;
+ for(;;){
+ /*
+ * This loops scans all the zones
+ */
+ min_low = 1UL << order;
+ current_node = z->zone_pgdat;
+ current_zone = zone;
+ classzone = z;
+ do{
+ /*
+ * This loops scans all the zones of
+ * the current node.
+ */
+ min_low += z->pages_low;
+ if (z->free_pages > min_low) {
+ page = rmqueue(z, order);
+ if (page)
+ return page;
+ }
+ z = *(++zone);
+ }while(z && (z->zone_pgdat == current_node));
+ /*
+ * The node is low on memory.
+ * If this is the last node, then the
+ * swap daemon is awaken.
+ */
+
+ classzone->need_balance = 1;
+ mb();
+ if (!z && waitqueue_active(&kswapd_wait))
+ wake_up_interruptible(&kswapd_wait);
+
+ min_min = 1UL << order;
+
+ /*
+ * We want to try again in the current node.
+ */
+ zone = current_zone;
+ z = *zone;
+ do{
+ unsigned long local_min;
+ local_min = z->pages_min;
+ if (!(gfp_mask & __GFP_WAIT))
+ local_min >>= 2;
+ min_min += local_min;
+ if (z->free_pages > min_min) {
+ page = rmqueue(z, order);
+ if (page)
+ return page;
+ }
+ z = *(++zone);
+ }while(z && (z->zone_pgdat == current_node));
+
+ /*
+ * If we are on the last node, and the current
+ * process has not the correct flags, then it is
+ * not allowed to empty the machine.
+ */
+ if(!z && !(current->flags & (PF_MEMALLOC | PF_MEMDIE)))
break;
- min += z->pages_low;
- if (z->free_pages > min) {
+ zone = current_zone;
+ z = *zone;
+ do{
page = rmqueue(z, order);
if (page)
return page;
- }
- }
-
- classzone->need_balance = 1;
- mb();
- if (waitqueue_active(&kswapd_wait))
- wake_up_interruptible(&kswapd_wait);
-
- zone = zonelist->zones;
- min = 1UL << order;
- for (;;) {
- unsigned long local_min;
- zone_t *z = *(zone++);
- if (!z)
+ z = *(++zone);
+ }while(z && (z->zone_pgdat == current_node));
+
+ if(!z)
break;
-
- local_min = z->pages_min;
- if (!(gfp_mask & __GFP_WAIT))
- local_min >>= 2;
- min += local_min;
- if (z->free_pages > min) {
- page = rmqueue(z, order);
- if (page)
- return page;
- }
}
-
- /* here we're in the low on memory slow path */
-
+
rebalance:
+ /*
+ * We were not able to find enough memory.
+ * Since the swap daemon has been waken up,
+ * we might be able to find some pages.
+ * If not, we need to balance the entire memory.
+ */
+ classzone = *zonelist->zones;
if (current->flags & (PF_MEMALLOC | PF_MEMDIE)) {
zone = zonelist->zones;
for (;;) {
zone_t *z = *(zone++);
if (!z)
break;
-
+
page = rmqueue(z, order);
if (page)
return page;
}
return NULL;
}
-
+
/* Atomic allocations - we can't balance anything */
if (!(gfp_mask & __GFP_WAIT))
return NULL;
@@ -410,14 +452,14 @@
return page;
zone = zonelist->zones;
- min = 1UL << order;
+ min_min = 1UL << order;
for (;;) {
zone_t *z = *(zone++);
if (!z)
break;
- min += z->pages_min;
- if (z->free_pages > min) {
+ min_min += z->pages_min;
+ if (z->free_pages > min_min) {
page = rmqueue(z, order);
if (page)
return page;
In message <[email protected]>, > : Andrea Arcangeli writ
es:
> it's better to make sure to use all available ram in all nodes instead
> of doing migrations when the local node is low on mem. But this again
> depends on the kind of numa system, I'm considering the new numas, not
> the old ones with the huge penality on the remote memory.
Andrea, don't forget that the "old" NUMAs will soon be the "new" NUMAs
again. The internal bus and clock speeds are still quite likely to
increase faster than the speeds of most interconnects. And even quite
a few "big SMP" machines today are really somewhat NUMA-like with a
2 to 1 - remote to local memory latency (e.g. the Corollary interconnect
used on a lot of >4-way IA32 boxes is not as fast as the two local
busses).
So, desiging for the "new" NUMAs is fine if your code goes into
production this year. But if it is going into production in two to
three years, you might want to be thinking about some greater memory
latency ratios for the upcoming hardware configurations...
gerrit
On Tue, 5 Mar 2002, Andrea Arcangeli wrote:
> On Mon, Mar 04, 2002 at 06:36:47PM -0300, Rik van Riel wrote:
> > On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
> >
> > > > 2) We can do local per-node scanning - no need to bounce
> > > > information to and fro across the interconnect just to see what's
> > > > worth swapping out.
> > >
> > > the lru lists are global at the moment, so for the normal swapout
> > > activitiy rmap won't allow you to do what you mention above
> >
> > Actually, the lru lists are per zone and have been for a while.
>
> They're not in my tree
Yeah, but you shouldn't judge rmap by what's in your tree ;))
Balancing is quite simple, too.
> > The thing which was lacking up to now is a pagecache_lru_lock
> > per zone, because this clashes with truncate(). Arjan came up
> > with a creative solution to fix this problem and I'll integrate
> > it into -rmap soon...
>
> making it a per-lru spinlock is natural scalability optimization, but
> anyways pagemap_lru_lock isn't a very critical spinlock.
That's what I used to think, too. The folks at IBM showed
me I was wrong and the pagemap_lru_lock is critical.
> > I'd appreciate it if you could look at the implementation and
> > look for areas to optimise. However, note that I don't believe
>
> I didn't had time to look too much into that yet (I had only a short
> review so far), but I will certainly do that in some more time, looking
> at it with a 2.5 long term prospective. I didn't liked too much that you
> resurrected some of the old code that I don't think pays off. I would
> preferred if you had rmap on top of my vm patch without reintroducing
> the older logics. I still don't see the need of inactive_dirty and the
> fact you dropped classzone and put the unreliable "plenty stuff" that
> reintroduces design bugs that will lead kswapd go crazy again. But ok, I
> don't worry too much about that, the rmap bits that maintains the
> additional information are orthogonal with the other changes and that's
> the interesting part of the patch after all.
OK, lets try to put classzone on top of a Hammer "NUMA" system.
You'll have one CPU starting to allocate from zone A, falling
back to zone B and then further down.
Another CPU starts allocating at zone B, falling back to A
and then further down.
How would you express this in classzone ? I've looked at it
for quite a while and haven't found a clean way to get this
situation right with classzone, which is why I have removed
it.
As for kswapd going crazy, that is nicely fixed by having
per zone lru lists... ;)
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Mon, Mar 04, 2002 at 08:11:21PM -0300, Rik van Riel wrote:
> You'll have one CPU starting to allocate from zone A, falling
> back to zone B and then further down.
what is zone A/B, I guess you mean node A/B etc.. Zones are called
NORMAL/DMA/HIGHMEM so I'm confused.
> Another CPU starts allocating at zone B, falling back to A
> and then further down.
>
> How would you express this in classzone ? I've looked at it
I don't see the problem you're raising. classzone is an information that
you pass the memory balancing, that tells it "what kind of ram you need".
That's all. This ensure it does the right work and that it puts the result into
the per-process local_pages structure, so the result isn't stolen before
we can notice it (fairness). That's completly unrelated to NUMA, I think
I said that many times. classzone and numa are disconnected concepts.
> As for kswapd going crazy, that is nicely fixed by having
> per zone lru lists... ;)
I don't see how per-zone lru lists are related to the kswapd deadlock.
as soon as the ZONE_DMA will be filled with filedescriptors or with
pagetables (or whatever non pageable/shrinkable kernel datastructure you
prefer) kswapd will go mad without classzone, period.
Check l-k and see how many kswapd-crazy reports there are been since
classzone is been introduced into the kernel, and incidentally we just
seen new kswapd report for the rmap patch without swap (it's hard to
trigger I know without swap, with swap such behaviour will happen
trivially because without swap every single page of anonymous ram will
become unpageable just like the kernel data, but the very same
kswapd-crazy problem would happen if swap was there too, it would only
take more time to reproduce like in the 2.4.x series with x < 10). it's
the same problem you told me at the kernel summit, remember? classzone
has the advantage of being very low cost and it also increases the
fairness of the allocations, compared to a system where you may end
working for others rather than for yourself like with the "plenty" stuff.
It not only fixes kswapd.
Andrea
On Tue, 5 Mar 2002, Andrea Arcangeli wrote:
> On Mon, Mar 04, 2002 at 08:11:21PM -0300, Rik van Riel wrote:
> > You'll have one CPU starting to allocate from zone A, falling
> > back to zone B and then further down.
>
> what is zone A/B, I guess you mean node A/B etc.. Zones are called
> NORMAL/DMA/HIGHMEM so I'm confused.
OK, now think about a NUMA-with-small-n system like AMD Hammer.
One of the CPUs will want to allocate from HIGHMEM zone A while
another CPU will start allocating at HIGHMEM zone B. Of course,
with memory access time between the "nodes" being not too different
you'll want to fall back to the "other" HIGHMEM zone before falling
back to the (single) NORMAL and DMA zones.
This could be expressed as:
"node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA
"node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA
How would you express this situation in classzone ?
> > As for kswapd going crazy, that is nicely fixed by having
> > per zone lru lists... ;)
>
> I don't see how per-zone lru lists are related to the kswapd deadlock.
> as soon as the ZONE_DMA will be filled with filedescriptors or with
> pagetables (or whatever non pageable/shrinkable kernel datastructure you
> prefer) kswapd will go mad without classzone, period.
So why would kswapd not go mad _with_ classzone ?
I bet the workaround for that problem has very little
to do with classzones...
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
> On Mon, Mar 04, 2002 at 10:46:54AM -0800, Martin J. Bligh wrote:
> > seems to me to be that the way we do current swap-out scanning is virtual,
> > not physical, and thus cannot be per zone => per node.
>
> actually if you do process bindings the pte should be all allocated
> local to the node if numa is enabled, and if there's no binding, no
> matter if you have rmap or not, the ptes can be spread across the whole
> system (just like the physical pages in the inactive/active lrus,
> because they're not per-node).
Think shared pages.
With -rmap you'll scan all the page table entries mapping
the pages on the current node, regardless of which node
the page tables live.
Without -rmap you'll need to scan all page table entries
in the system, not just the ones mapping pages on the
current node.
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Mon, Mar 04, 2002 at 03:09:51PM -0800, Gerrit Huizenga wrote:
>
> In message <[email protected]>, > : Andrea Arcangeli writ
> es:
> > it's better to make sure to use all available ram in all nodes instead
> > of doing migrations when the local node is low on mem. But this again
> > depends on the kind of numa system, I'm considering the new numas, not
> > the old ones with the huge penality on the remote memory.
>
> Andrea, don't forget that the "old" NUMAs will soon be the "new" NUMAs
> again. The internal bus and clock speeds are still quite likely to
> increase faster than the speeds of most interconnects. And even quite
For various reasons I think we'll never go back to "old" NUMA in the
long run.
> a few "big SMP" machines today are really somewhat NUMA-like with a
> 2 to 1 - remote to local memory latency (e.g. the Corollary interconnect
> used on a lot of >4-way IA32 boxes is not as fast as the two local
> busses).
there's a reason for that.
> So, desiging for the "new" NUMAs is fine if your code goes into
> production this year. But if it is going into production in two to
> three years, you might want to be thinking about some greater memory
> latency ratios for the upcoming hardware configurations...
Disagree, but don't take me wrong, I'm not really suggesting to design
for new numa only. I think linux should support both equally well, so
some heuristic like in the scheduler will be mostly the same, but they
will need different heuristics in some other place. For example the
"less frequently used ram migration instead of taking advantage of free
memory in the other nodes first" should fall in the old numa category.
Andrea
On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote:
> This could be expressed as:
>
> "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA
> "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA
Highmem? Let's assume you speak about "normal" and "dma" only of course.
And that's not always the right zonelist layout. If an allocation asks for
ram from a certain node, like during the ram bindings, we should use the
current layout of the numa zonelist. If node A is the preferred, than we
should allocate from node A first, other logics (see the point-of-view
watermarks in my tree) will make sure you fallback into node B if we
risk to be unbalanced across the zones. However, the layout you mentioned
above is sometime the right layout, for example for allocations with no
"preference" on the node to allocate from, your layout would make
perfect sense. But at the moment we miss an API to choose if the node
allocation should be strict or not.
Said that, see below to see how to implement the zonelist layout you
suggested on top of the current vm (regardless if it's the best generic
layout or not).
>
> How would you express this situation in classzone ?
Check my tree in the 20_numa-mm-1 patch, to implement your above layout,
you need to make a 10 line change to build_zonelistss so that it fills
the zonelist array with normal B before dma (and other way around for
the normal classzone zonelist on the node B).
The memory balancing in my tree will just do the right thing after that,
check the memclass based on zone_idx (that was needed for the old numa
too infact).
In short it fits beautifully into it.
> So why would kswapd not go mad _with_ classzone ?
because nobody asks for GFP_DMA and nobody cares about the state of the
DMA classzone. And if somebody does it is right that kswapd has to try
to make some progress, but if nobody asks there's no good reason to
waste CPU.
the scsi pool being allocated from DMA is not a problem, that never
happens at runtime. if it happens before production kswapd will stop in
a few seconds after a failed try.
> I bet the workaround for that problem has very little
> to do with classzones...
that is not a workaround, the memory balancing knows what classzone it
has to work on and so it doesn't fall into a senseless trap of trying to
free a classzone that nobody cares about.
My current VM code is very advanced about knowing every detail, it's not
a guess "let's look at which zones have plenty of memory". Just like it
supports NUMA layouts like the above you mentioned just fine (even if
you want to add highmem or any other zones you want). Note that this is
all unrelated to rmap, we can just put rmap on top of my VM bits without
any problem, that's completly orthogonal with the other bits.
Andrea
On Tue, 5 Mar 2002, Andrea Arcangeli wrote:
> On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote:
> > This could be expressed as:
> >
> > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA
> > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA
>
> Highmem? Let's assume you speak about "normal" and "dma" only of course.
>
> And that's not always the right zonelist layout. If an allocation asks for
> ram from a certain node, like during the ram bindings, we should use the
> current layout of the numa zonelist. If node A is the preferred, than we
> should allocate from node A first,
You're forgetting about the fact that this NUMA box only
has 1 ZONE_NORMAL and 1 ZONE_DMA while it has multiple
HIGHMEM zones...
This makes the fallback pattern somewhat more complex.
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Mon, Mar 04, 2002 at 10:26:30PM -0300, Rik van Riel wrote:
> On Tue, 5 Mar 2002, Andrea Arcangeli wrote:
> > On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote:
> > > This could be expressed as:
> > >
> > > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA
> > > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA
> >
> > Highmem? Let's assume you speak about "normal" and "dma" only of course.
> >
> > And that's not always the right zonelist layout. If an allocation asks for
> > ram from a certain node, like during the ram bindings, we should use the
> > current layout of the numa zonelist. If node A is the preferred, than we
> > should allocate from node A first,
>
> You're forgetting about the fact that this NUMA box only
the example you made doesn't have highmem at all.
> has 1 ZONE_NORMAL and 1 ZONE_DMA while it has multiple
> HIGHMEM zones...
it has multiple zone normal and only one zone dma. I'm not forgetting
that.
> This makes the fallback pattern somewhat more complex.
it's not more complex than the current way, it's just different and it's
not strict, but it's the best one for allocations that doesn't "prefer"
memory from a certain node, but OTOH we don't have an API to define
'waek' or 'strict' allocation bheaviour so the default would better be
the 'strict' one like in oldnuma. Infact in the future we may want to
have also a way to define a "very strict" allocation, that means it
won't fallback into the other nodes at all, even if there's plenty of
memory free on them. An API needs to be built with some bitflag
specifying the "strength" of the numa affinity required. Your layout
provides the 'weakest' approch, that is perfectly fine for some kind of
non-numa-aware allocations, just like "very strict" will be necessary
for the relocation bindings (if we cannot relocate in the right node
there's no point to relocate in another node, let's ingore complex
topologies for now :).
Andrea
> it's not more complex than the current way, it's just different and it's
> not strict, but it's the best one for allocations that doesn't "prefer"
> memory from a certain node, but OTOH we don't have an API to define
> 'waek' or 'strict' allocation bheaviour so the default would better be
> the 'strict' one like in oldnuma. Infact in the future we may want to
> have also a way to define a "very strict" allocation, that means it
> won't fallback into the other nodes at all, even if there's plenty of
> memory free on them. An API needs to be built with some bitflag
> specifying the "strength" of the numa affinity required. Your layout
> provides the 'weakest' approch, that is perfectly fine for some kind of
> non-numa-aware allocations, just like "very strict" will be necessary
> for the relocation bindings (if we cannot relocate in the right node
> there's no point to relocate in another node, let's ingore complex
> topologies for now :).
Actually, we (IBM) do have a simple API to do this that Matt Dobson
has been working on that's nearing readiness (& publication). I've
been coding up a patch to _alloc_pages today that has both a strict
and non-strict binding in it. It first goes through your "preferred" set of
nodes (defined on a per-process basis), then again looking for any
node that you've not strictly banned from the list - I hope that's
sufficient for what you're discussing? I'll try to publish my part tommorow,
definitely this week - it'll be easy to see how it works in conjunction with
the API, though the rest of the API might be a little longer before arrival ....
Martin.
In message <[email protected]>, > : Andrea Arcangeli writ
es:
> On Mon, Mar 04, 2002 at 03:09:51PM -0800, Gerrit Huizenga wrote:
> >
> > In message <[email protected]>, > : Andrea Arcangeli writ
> > es:
> > > it's better to make sure to use all available ram in all nodes instead
> > > of doing migrations when the local node is low on mem. But this again
> > > depends on the kind of numa system, I'm considering the new numas, not
> > > the old ones with the huge penality on the remote memory.
> >
> > Andrea, don't forget that the "old" NUMAs will soon be the "new" NUMAs
> > again. The internal bus and clock speeds are still quite likely to
> > increase faster than the speeds of most interconnects. And even quite
>
> For various reasons I think we'll never go back to "old" NUMA in the
> long run.
Do those reasons involve new advances in physics? How close can you
put, say, 4 CPUs? How physically close together can you put, say
64 CPUs? How fast can you arbitrate sharing/cache coherency on an
interconnect? How fast does, say, Intel, increase the clock rate of
a processor? How fast does the bus rate for the same chip increase?
How fast does the interconnect speed increase? How fast is the L1
cache? L2? L3? L4?
Basically, the trend seems to be hierarcies of latency and bandwidth,
and the more loads arbitrating in a given level of the hierarchy, the
longer the greater the latency. In part, the physics and the cost
of technologies seem to force a hierarchical approach.
I'm not sure why you think Physics won't dictate a return to the
previous differences in latency, especially since several vendors
are already working in that space...
> > a few "big SMP" machines today are really somewhat NUMA-like with a
> > 2 to 1 - remote to local memory latency (e.g. the Corollary interconnect
> > used on a lot of >4-way IA32 boxes is not as fast as the two local
> > busses).
>
> there's a reason for that.
>
> > So, desiging for the "new" NUMAs is fine if your code goes into
> > production this year. But if it is going into production in two to
> > three years, you might want to be thinking about some greater memory
> > latency ratios for the upcoming hardware configurations...
>
> Disagree, but don't take me wrong, I'm not really suggesting to design
> for new numa only. I think linux should support both equally well, so
> some heuristic like in the scheduler will be mostly the same, but they
> will need different heuristics in some other place. For example the
> "less frequently used ram migration instead of taking advantage of free
> memory in the other nodes first" should fall in the old numa category.
This is where I think some of the topology representation work will
help (lse and the sourceforge large system foundry). Various systems
will have various types of hierarchies in memory access, latency and
bandwidth. I agree that heurestics may need to be tuned per arch type,
but look well at the history of hardware development and be aware that
a past trend has been that local and remote bus speeds and memory access
latencies have tended to stair step - with local busses stepping up much
more quickly and interconnect stepping up much more slowly. And with
some architectures using three and four levels of hierarchy, the differences
between local and really, really remote will typically increase over a
five year (or so) window.
gerrit
On Mon, 4 Mar 2002, Rik van Riel wrote:
> On Tue, 5 Mar 2002, Andrea Arcangeli wrote:
> > On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote:
> > > This could be expressed as:
> > >
> > > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA
> > > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA
> >
> > Highmem? Let's assume you speak about "normal" and "dma" only of course.
> >
> > And that's not always the right zonelist layout. If an allocation asks for
> > ram from a certain node, like during the ram bindings, we should use the
> > current layout of the numa zonelist. If node A is the preferred, than we
> > should allocate from node A first,
>
> You're forgetting about the fact that this NUMA box only
> has 1 ZONE_NORMAL and 1 ZONE_DMA while it has multiple
> HIGHMEM zones...
>
> This makes the fallback pattern somewhat more complex.
Both HIMEM (on CPU) and NUMA nodes remind me somewhat of the days when
"band switched" memory was supposed to be the answer to limited addressing
space. The trick was to have things in the right place and not eat up the
capacity with moving data. I think you're right that the problem is not as
simple several posters have suggested. I'm afraid someone will have to do
some clever adaptive work here, the speed of the connects to the memory
will change as both evolve. And it's easier to make a node smaller than
put them closer together.
I'm awaiting the IBM paper(s) on this, I don't find my Hypercube and PVM
experience to fit well anymore :-(
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Mon, 4 Mar 2002, Martin J. Bligh wrote:
> > it's not more complex than the current way, it's just different and it's
> > not strict, but it's the best one for allocations that doesn't "prefer"
> > memory from a certain node, but OTOH we don't have an API to define
> > 'waek' or 'strict' allocation bheaviour so the default would better be
> > the 'strict' one like in oldnuma. Infact in the future we may want to
> > have also a way to define a "very strict" allocation, that means it
> > won't fallback into the other nodes at all, even if there's plenty of
> > memory free on them. An API needs to be built with some bitflag
> > specifying the "strength" of the numa affinity required. Your layout
> > provides the 'weakest' approch, that is perfectly fine for some kind of
> > non-numa-aware allocations, just like "very strict" will be necessary
> > for the relocation bindings (if we cannot relocate in the right node
> > there's no point to relocate in another node, let's ingore complex
> > topologies for now :).
>
> Actually, we (IBM) do have a simple API to do this that Matt Dobson
> has been working on that's nearing readiness (& publication). I've
> been coding up a patch to _alloc_pages today that has both a strict
> and non-strict binding in it. It first goes through your "preferred" set of
> nodes (defined on a per-process basis), then again looking for any
> node that you've not strictly banned from the list - I hope that's
> sufficient for what you're discussing? I'll try to publish my part tommorow,
> definitely this week - it'll be easy to see how it works in conjunction with
> the API, though the rest of the API might be a little longer before arrival ....
SGI's CpuMemSets is supposed to do that as well. We are now able to bind a
process to a set of memories, and soon we will be able to specify how
strict the allocation can be. Right now, if a process is allowed to
allocate memory from node 0, 2, and 3, it won't look outside of this set.
The memory set granularity is smaller though, because it depends on the
process, and the cpu (and thus the node) this process is running on.
The CpuMemSets have been tested and are available on the Linux Scalability
Effort sourceforge page, if you want to give it a try...
Samuel.
> They're not in my tree and for very good reasons, Ben did such mistake
> the first time at some point during 2.3. You've a big downside with the
> per-zone information, all normal machines (like with 64M of ram or 2G of
> ram) where theorical O(N) complexity is perfectly fine for lowmem
> dma/normal allocations, will get hurted very much by the per-node lrus.
I'm not sure why it has to be a big impact for the "common desktop"
machine - they should only have one zone anyway. ZONE_DMA should
shrivel up and die in a lonely corner. Yeah, OK, keep it as a
back-compatibility option for those museum pieces that need it,
but personally I'd make ISA DMA support a config option defaulting
to off ... maybe it's possible to do dynamically (just stick no
pages in it, though I suspect it's too late by the time we know).
Hardly any common desktop will need HIGHMEM support, and those
that do will probably get enough kickback from per-zone things to
pay for the cost.
To me, per-node would probably be about as good, but I don't think
per-zone is as bad as you think.
> making it a per-lru spinlock is natural scalability optimization,
> but anyways pagemap_lru_lock isn't a very critical spinlock.
see my other email - it's worse in rmap.
M.
> SGI's CpuMemSets is supposed to do that as well. We are now able to bind a
> process to a set of memories, and soon we will be able to specify how
> strict the allocation can be. Right now, if a process is allowed to
> allocate memory from node 0, 2, and 3, it won't look outside of this set.
> The memory set granularity is smaller though, because it depends on the
> process, and the cpu (and thus the node) this process is running on.
> The CpuMemSets have been tested and are available on the Linux Scalability
> Effort sourceforge page, if you want to give it a try...
The problem with CpuMemSets is that it's mind-bogglingly
complex - I think we need something simpler ... at least
to start with.
M.
>> seems to me to be that the way we do current swap-out scanning is
>> virtual, not physical, and thus cannot be per zone => per node.
>
> actually if you do process bindings the pte should be all allocated
> local to the node if numa is enabled, and if there's no binding, no
> matter if you have rmap or not, the ptes can be spread across the whole
> system (just like the physical pages in the inactive/active lrus,
> because they're not per-node).
Why does it matter if the ptes are spread across the system?
I get the feeling I'm missing some magic trick here ...
In reality we're not going to hard-bind every process,
though we'll try to keep most of the allocations local.
Imagine I have eight nodes (0..7), each with one zone (0..7).
I need to free memory from zone 5 ... with the virtual scan,
it seems to me that all I can do is blunder through the whole
process list looking for something that happens to have pages
on zone 5 that aren't being used much? Is this not expensive?
Won't I end up with a whole bunch of cross-node mem transfers?
M.
On Mon, 4 Mar 2002, Martin J. Bligh wrote:
> > SGI's CpuMemSets is supposed to do that as well. We are now able to bind a
> > process to a set of memories, and soon we will be able to specify how
> > strict the allocation can be. Right now, if a process is allowed to
> > allocate memory from node 0, 2, and 3, it won't look outside of this set.
> > The memory set granularity is smaller though, because it depends on the
> > process, and the cpu (and thus the node) this process is running on.
> > The CpuMemSets have been tested and are available on the Linux Scalability
> > Effort sourceforge page, if you want to give it a try...
>
> The problem with CpuMemSets is that it's mind-bogglingly
> complex - I think we need something simpler ... at least
> to start with.
Yes, I agree with the fact that it is complex. Right now, you need
to get a good understanding of them in order for them to be useful.
However I think this is the price to pay for something that covers a large
range of cases, from the simplest one to very complex ones. The simpler
implementation you are talking about will be useless as soon as you'll
need to cover more complex cases.
A good thing would be to define an API on top of CpuMemSets to allow
interested people to use them quickly for those simple cases.
Samuel.
1G x86 machines are becoming fairly common and they either need to waste
ram or turn on himem.
David Lang
On Mon, 4 Mar 2002, Martin J. Bligh wrote:
> > They're not in my tree and for very good reasons, Ben did such mistake
> > the first time at some point during 2.3. You've a big downside with the
> > per-zone information, all normal machines (like with 64M of ram or 2G of
> > ram) where theorical O(N) complexity is perfectly fine for lowmem
> > dma/normal allocations, will get hurted very much by the per-node lrus.
>
> I'm not sure why it has to be a big impact for the "common desktop"
> machine - they should only have one zone anyway. ZONE_DMA should
> shrivel up and die in a lonely corner. Yeah, OK, keep it as a
> back-compatibility option for those museum pieces that need it,
> but personally I'd make ISA DMA support a config option defaulting
> to off ... maybe it's possible to do dynamically (just stick no
> pages in it, though I suspect it's too late by the time we know).
> Hardly any common desktop will need HIGHMEM support, and those
> that do will probably get enough kickback from per-zone things to
> pay for the cost.
>
> To me, per-node would probably be about as good, but I don't think
> per-zone is as bad as you think.
>
> > making it a per-lru spinlock is natural scalability optimization,
> > but anyways pagemap_lru_lock isn't a very critical spinlock.
>
> see my other email - it's worse in rmap.
>
> M.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
In article <[email protected]> you wrote:
> I don't see how per-zone lru lists are related to the kswapd deadlock.
> as soon as the ZONE_DMA will be filled with filedescriptors or with
> pagetables (or whatever non pageable/shrinkable kernel datastructure you
> prefer) kswapd will go mad without classzone, period.
So does it with class zone on a scsi system....
On Mon, 4 Mar 2002 15:03:19 -0800 (PST)
Samuel Ortiz <[email protected]> wrote:
> On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
> > yes, also make sure to keep this patch from SGI applied, it's very
> > important to avoid memory balancing if there's still free memory in the
> > other zones:
> >
> > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre1aa1/20_numa-mm-1
> This patch is included (in a slightly different form) in the 2.4.17
> discontig patch (http://sourceforge.net/projects/discontig).
> But martin may need another patch to apply. With the current
> implementation of __alloc_pages, we have 2 problems :
> 1) A node is not emptied before moving to the following node
> 2) If none of the zones on a node have more freepages than min(defined as
> min+= z->pages_low), we start looking on the following node, instead of
> trying harder on the same node.
Forgive my ignorance, but aren't these two problems completely identical in a
UP or even SMP setup? I mean what is the negative drawback in your proposed
solution, if there simply is no other node? If it is not harmful to the
"standard" setups it may as well be included in the mainline, or not?
Regards,
Stephan
On Tue, 5 Mar 2002, Andrea Arcangeli wrote:
> On Mon, Mar 04, 2002 at 10:26:30PM -0300, Rik van Riel wrote:
> > On Tue, 5 Mar 2002, Andrea Arcangeli wrote:
> > > On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote:
> > > > This could be expressed as:
> > > >
> > > > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA
> > > > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA
> the example you made doesn't have highmem at all.
>
> > has 1 ZONE_NORMAL and 1 ZONE_DMA while it has multiple
> > HIGHMEM zones...
>
> it has multiple zone normal and only one zone dma. I'm not forgetting
> that.
Your reality doesn't seem to correspond well with NUMA-Q
reality.
Rik
--
Will hack the VM for food.
http://www.surriel.com/ http://distro.conectiva.com/
On Tue, 5 Mar 2002 [email protected] wrote:
> In article <[email protected]> you wrote:
>
> > I don't see how per-zone lru lists are related to the kswapd deadlock.
> > as soon as the ZONE_DMA will be filled with filedescriptors or with
> > pagetables (or whatever non pageable/shrinkable kernel datastructure you
> > prefer) kswapd will go mad without classzone, period.
>
> So does it with class zone on a scsi system....
Furthermore, there is another problem which is present in
both 2.4 vanilla, -aa and -rmap.
Suppose that (1) we are low on memory in ZONE_NORMAL and
(2) we have enough free memory in ZONE_HIGHMEM and (3) the
memory in ZONE_NORMAL is for a large part taken by buffer
heads belonging to pages in ZONE_HIGHMEM.
In that case, none of the VMs will bother freeing the buffer
heads associated with the highmem pages and kswapd will have
to work hard trying to free something else in ZONE_NORMAL.
Now before you say this is a strange theoretical situation,
I've seen it here when using highmem emulation. Low memory
was limited to 30 MB (16 MB ZONE_DMA, 14 MB ZONE_NORMAL)
and the rest of the machine was HIGHMEM. Buffer heads were
taking up 8 MB of low memory, dcache and inode cache were a
good second with 2 MB and 5 MB respectively.
How to efficiently fix this case ? I wouldn't know right now...
However, I guess we might want to come up with a fix because it's
a quite embarassing scenario ;)
regards,
Rik
--
Will hack the VM for food.
http://www.surriel.com/ http://distro.conectiva.com/
On Tue, Mar 05, 2002 at 08:35:51AM +0000, [email protected] wrote:
> In article <[email protected]> you wrote:
>
> > I don't see how per-zone lru lists are related to the kswapd deadlock.
> > as soon as the ZONE_DMA will be filled with filedescriptors or with
> > pagetables (or whatever non pageable/shrinkable kernel datastructure you
> > prefer) kswapd will go mad without classzone, period.
>
> So does it with class zone on a scsi system....
as said in another message such pool isn't refilled in a flood.
Andrea
On Tue, Mar 05, 2002 at 09:22:25AM -0300, Rik van Riel wrote:
> On Tue, 5 Mar 2002, Andrea Arcangeli wrote:
> > On Mon, Mar 04, 2002 at 10:26:30PM -0300, Rik van Riel wrote:
> > > On Tue, 5 Mar 2002, Andrea Arcangeli wrote:
> > > > On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote:
> > > > > This could be expressed as:
> > > > >
> > > > > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA
> > > > > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA
>
> > the example you made doesn't have highmem at all.
> >
> > > has 1 ZONE_NORMAL and 1 ZONE_DMA while it has multiple
> > > HIGHMEM zones...
> >
> > it has multiple zone normal and only one zone dma. I'm not forgetting
> > that.
>
> Your reality doesn't seem to correspond well with NUMA-Q
> reality.
Not sure to understand your point, current code should be fine for all
the classic numas, and for the case you were making too. Anyways
whatever is wrong for NUMA-Q it's not a problem introduced with the
classzone design because that's completly orthogonal to whatever numa
heuristics in the allocator and memory balancing.
Andrea
On Tue, Mar 05, 2002 at 09:41:56AM -0300, Rik van Riel wrote:
> On Tue, 5 Mar 2002 [email protected] wrote:
> > In article <[email protected]> you wrote:
> >
> > > I don't see how per-zone lru lists are related to the kswapd deadlock.
> > > as soon as the ZONE_DMA will be filled with filedescriptors or with
> > > pagetables (or whatever non pageable/shrinkable kernel datastructure you
> > > prefer) kswapd will go mad without classzone, period.
> >
> > So does it with class zone on a scsi system....
>
> Furthermore, there is another problem which is present in
> both 2.4 vanilla, -aa and -rmap.
Please check the code. scsi_resize_dma_pool is called when you insmod a
module. It doesn't really matter if kswapd runs for 2 seconds during
insmod. And anyways if there would be some buggy code allocating dma in
a flood by mistake on a high end machine, then I can fix it completly by
tracking down when somebody freed dma pages over some watermark, but
that would add additional accounting that I don't feel needed, simply
because if you don't need DMA zone you shouldn't use GFP_DMA, I feel
fixing scsi is the right thing if something (but again, I don't see any
flood allocation during production with scsi).
> Suppose that (1) we are low on memory in ZONE_NORMAL and
> (2) we have enough free memory in ZONE_HIGHMEM and (3) the
> memory in ZONE_NORMAL is for a large part taken by buffer
> heads belonging to pages in ZONE_HIGHMEM.
>
> In that case, none of the VMs will bother freeing the buffer
> heads associated with the highmem pages and kswapd will have
wrong, classzone will do that, both for NORMAL and HIGHMEM allocations.
You won't free the buffer headers only if you do DMA allocations and
by luck there will be no buffer headers in the DMA zone, otherwise it
will free the bh during DMA allocations too. remeber highmem classzone
means all the ram in the machine, not just highmem zone.
> to work hard trying to free something else in ZONE_NORMAL.
>
> Now before you say this is a strange theoretical situation,
> I've seen it here when using highmem emulation. Low memory
> was limited to 30 MB (16 MB ZONE_DMA, 14 MB ZONE_NORMAL)
> and the rest of the machine was HIGHMEM. Buffer heads were
> taking up 8 MB of low memory, dcache and inode cache were a
> good second with 2 MB and 5 MB respectively.
>
>
> How to efficiently fix this case ? I wouldn't know right now...
I don't see anything to fix, that should be just handled flawlessy.
> However, I guess we might want to come up with a fix because it's
> a quite embarassing scenario ;)
>
> regards,
>
> Rik
> --
> Will hack the VM for food.
>
> http://www.surriel.com/ http://distro.conectiva.com/
Andrea
--On Tuesday, March 05, 2002 9:22 AM -0300 Rik van Riel
<[email protected]> wrote:
> On Tue, 5 Mar 2002, Andrea Arcangeli wrote:
>> On Mon, Mar 04, 2002 at 10:26:30PM -0300, Rik van Riel wrote:
>> > On Tue, 5 Mar 2002, Andrea Arcangeli wrote:
>> > > On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote:
>> > > > This could be expressed as:
>> > > >
>> > > > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA
>> > > > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA
>
>> the example you made doesn't have highmem at all.
>>
>> > has 1 ZONE_NORMAL and 1 ZONE_DMA while it has multiple
>> > HIGHMEM zones...
>>
>> it has multiple zone normal and only one zone dma. I'm not forgetting
>> that.
>
> Your reality doesn't seem to correspond well with NUMA-Q
> reality.
I think the difference is that he has a 64 bit vaddr space,
and I don't ;-) Thus all mem to him is ZONE_NORMAL (not sure
why he still has a ZONE_DMA, unless he reused it for the 4Gb
boundary). Andrea, is my assumtpion correct?
On a 32 bit arch (eg ia32) everything above 896Mb (by default)
is ZONE_HIGHMEM. Thus if I have > 896Mb in the first node,
I will have one ZONE_NORMAL in node 0, and a ZONE_HIGHMEM
in every node. If I have < 896Mb in the first node, then
I have a ZONE_NORMAL in every node up to and including the
896 breakpoint, and a ZONE_HIGHMEM in every node from the
breakpoint up (including the breakpoint node). Thus the number
of zones = number of nodes + 1.
M.
On Tue, 5 Mar 2002, Andrea Arcangeli wrote:
> > Suppose that (1) we are low on memory in ZONE_NORMAL and
> > (2) we have enough free memory in ZONE_HIGHMEM and (3) the
> > memory in ZONE_NORMAL is for a large part taken by buffer
> > heads belonging to pages in ZONE_HIGHMEM.
> >
> > In that case, none of the VMs will bother freeing the buffer
> > heads associated with the highmem pages and kswapd will have
>
> wrong, classzone will do that, both for NORMAL and HIGHMEM allocations.
Let me explain it to you again:
1) ZONE_NORMAL + ZONE_DMA is low on free memory
2) the memory is taken by buffer heads, these
buffer heads belong to pagecache pages that
live in highmem
3) the highmem zone has enough free memory
As you probably know, shrink_caches() has the following line
of code to make sure it won't try to free highmem pages:
if (!memclass(page->zone, classzone))
continue;
Of course, this line of code also means it will not take
away the buffer heads from highmem pages, so the ZONE_NORMAL
and ZONE_DMA memory USED BY THE BUFFER HEADS will not be
freed.
regards,
Rik
--
Will hack the VM for food.
http://www.surriel.com/ http://distro.conectiva.com/
On Tue, 5 Mar 2002, Stephan von Krawczynski wrote:
> On Mon, 4 Mar 2002 15:03:19 -0800 (PST)
> Samuel Ortiz <[email protected]> wrote:
>
> > On Mon, 4 Mar 2002, Andrea Arcangeli wrote:
> > > yes, also make sure to keep this patch from SGI applied, it's very
> > > important to avoid memory balancing if there's still free memory in the
> > > other zones:
> > >
> > > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre1aa1/20_numa-mm-1
> > This patch is included (in a slightly different form) in the 2.4.17
> > discontig patch (http://sourceforge.net/projects/discontig).
> > But martin may need another patch to apply. With the current
> > implementation of __alloc_pages, we have 2 problems :
> > 1) A node is not emptied before moving to the following node
> > 2) If none of the zones on a node have more freepages than min(defined as
> > min+= z->pages_low), we start looking on the following node, instead of
> > trying harder on the same node.
>
> Forgive my ignorance, but aren't these two problems completely identical in a
> UP or even SMP setup? I mean what is the negative drawback in your proposed
> solution, if there simply is no other node? If it is not harmful to the
> "standard" setups it may as well be included in the mainline, or not?
You're right. It is harmful to the standard UMA boxes. However, the
current __alloc_pages does just what it is supposed to do on those boxes.
That's why very few people have been bothered by this bug. I was just
waiting for Andrea or Rik's feedback before trying to push it to Marcelo.
Maybe they'll find some time to review the patch soon...
Cheers,
Samuel.
On Tue, Mar 05, 2002 at 01:57:13PM -0300, Rik van Riel wrote:
> Let me explain it to you again:
>
> 1) ZONE_NORMAL + ZONE_DMA is low on free memory
>
> 2) the memory is taken by buffer heads, these
> buffer heads belong to pagecache pages that
> live in highmem
>
> 3) the highmem zone has enough free memory
>
>
> As you probably know, shrink_caches() has the following line
> of code to make sure it won't try to free highmem pages:
>
> if (!memclass(page->zone, classzone))
> continue;
>
> Of course, this line of code also means it will not take
> away the buffer heads from highmem pages, so the ZONE_NORMAL
> and ZONE_DMA memory USED BY THE BUFFER HEADS will not be
> freed.
I'm very sorry for not understanding your previous email, many thanks
for explaning this again since I understood perfectly this time :).
Right you are. I don't see this as a showstopper, but I think it would
be nice to do something about it in 2.4 too.
I think the best fix is to define a memclass_related() that checks
the page->buffers to see if there's any lowmem bh queued on top of such
page, the check cannot be embedded into the memclass, we need this
additional check, because we shouldn't consider a "classzone normal
progress" the freeing of a lowmem bh and furthmore we should only get
into the path of the bh-freeing, not the path of the page-freeing to
avoid throwing away highmem pagecache due a lowmem shortage. And if we
freed something significant but we think we failed (because we cannot
account the memclass_related as a progress), the .high watermark will
let us go ahead with the allocation later in page_alloc.c.
The above again fits beautifully into the classzone logic and it makes
100% sure not to waste a single page of highmem due a lowmem shortage.
It's nearly impossible that classzone collides with anything good
because classzone is the natural thing to do.
btw, I think you've the very same problem with the "plenty" logic, the
highmem zone will look as "plenty of ram free" and you won't balance it,
despite you should because otherwise the bh wouldn't be released.
The memclass_related will just make the VM accurate on those bh, and
later it can be extended to other metadata too if necessary, so we'll
always do the right thing.
Another approch would be to add the pages backing the bh into the lru
too, but then we'd need to mess with the slab and new bitflags, new
methods and so I don't think it's the best solution. The only good
reason for putting new kind of entries in the lru would be to age them
too the same way as the other pages, but we don't need that with the bh
(they're just in, and we mostly care only about the page age, not the bh
age).
Andrea
On Tue, Mar 05, 2002 at 07:26:04PM +0100, Andrea Arcangeli wrote:
> Another approch would be to add the pages backing the bh into the lru
> too, but then we'd need to mess with the slab and new bitflags, new
> methods and so I don't think it's the best solution. The only good
> reason for putting new kind of entries in the lru would be to age them
> too the same way as the other pages, but we don't need that with the bh
> (they're just in, and we mostly care only about the page age, not the bh
> age).
For 2.5 I kind of like this idea. There is one issue though: to make
this work really well we'd probably need a ->prepareforfreepage()
or similar page op (which for page cache pages can be equal to writepage()
) which the vm can use to prepare this page for freeing.
Arjan van de Ven wrote:
>
> On Tue, Mar 05, 2002 at 07:26:04PM +0100, Andrea Arcangeli wrote:
>
> > Another approch would be to add the pages backing the bh into the lru
> > too, but then we'd need to mess with the slab and new bitflags, new
> > methods and so I don't think it's the best solution. The only good
> > reason for putting new kind of entries in the lru would be to age them
> > too the same way as the other pages, but we don't need that with the bh
> > (they're just in, and we mostly care only about the page age, not the bh
> > age).
>
> For 2.5 I kind of like this idea. There is one issue though: to make
> this work really well we'd probably need a ->prepareforfreepage()
> or similar page op (which for page cache pages can be equal to writepage()
> ) which the vm can use to prepare this page for freeing.
If we stop using buffer_heads for pagecache I/O, we don't have this problem.
I'm showing a 20% reduction in CPU load for large reads. Which is a *lot*,
given that read load is dominated by copy_to_user.
2.5 is significantly less efficient than 2.4 at this time. Some of that
seems to be due to worsened I-cache footprint, and a lot of it is due
to the way buffer_heads now have a BIO wrapper layer.
Take a look at submit_bh(). The writing is on the wall, guys.
-
On 2 Mar 2002, Robert Love wrote:
> On Sat, 2002-03-02 at 15:47, Andrea Arcangeli wrote:
>
> > On Sat, Mar 02, 2002 at 09:57:49PM -0200, Denis Vlasenko wrote:
> >
> > > If rmap is really better than current VM, it will be merged into head
> > > development branch (2.5). There is no anti-rmap conspiracy :-)
> >
> > Indeed.
>
> Of note: I don't think anyone "loses" if one VM is merged or not. A
> reverse mapping VM is a significant redesign of our current VM approach
> and if it proves better, yes, I suspect (and hope) it will be merged
> into 2.5.
As noted, I do use both flavors of VM. But in practical terms the delay
getting the "performance" changes, rmap, preempt, scheduler, into a stable
kernel will be 18-24 months by my guess, 12-18 months to 2.6 and six
months before Linus opens 2.7 and lets things gel. So to the extent that
people who would be using those kernels get less performance, or less
responsiveness, I guess they are the only ones who lose.
Feel free to tell me it won't be that long or that 2.5 will be stable
enough for production use, but be prepared to have people post release
dates from 12 to 2.0, 2.0 to 2.2, 2.2 to 2.4, and just laugh about
stability. There are a lot of neat new things in 2.5, and they will take
relatively a long time to be stable. No one wants to limit the
development of 2.5, or at least the posts I read are in favor of more
change rather than less.
In any case, I agree there are no "losers" in that sense.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Tue, Mar 05, 2002 at 11:12:46AM -0800, Andrew Morton wrote:
> Arjan van de Ven wrote:
> >
> > On Tue, Mar 05, 2002 at 07:26:04PM +0100, Andrea Arcangeli wrote:
> >
> > > Another approch would be to add the pages backing the bh into the lru
> > > too, but then we'd need to mess with the slab and new bitflags, new
> > > methods and so I don't think it's the best solution. The only good
> > > reason for putting new kind of entries in the lru would be to age them
> > > too the same way as the other pages, but we don't need that with the bh
> > > (they're just in, and we mostly care only about the page age, not the bh
> > > age).
> >
> > For 2.5 I kind of like this idea. There is one issue though: to make
> > this work really well we'd probably need a ->prepareforfreepage()
> > or similar page op (which for page cache pages can be equal to writepage()
> > ) which the vm can use to prepare this page for freeing.
>
> If we stop using buffer_heads for pagecache I/O, we don't have this problem.
>
> I'm showing a 20% reduction in CPU load for large reads. Which is a *lot*,
> given that read load is dominated by copy_to_user.
>
> 2.5 is significantly less efficient than 2.4 at this time. Some of that
> seems to be due to worsened I-cache footprint, and a lot of it is due
> to the way buffer_heads now have a BIO wrapper layer.
Indeed, at the moment bio is making the thing more expensive in CPU
terms, even if OTOH it makes rawio fly.
> Take a look at submit_bh(). The writing is on the wall, guys.
>
> -
Andrea
On Wed, Mar 06, 2002 at 12:03:14AM +0100, Andrea Arcangeli wrote:
> On Tue, Mar 05, 2002 at 11:12:46AM -0800, Andrew Morton wrote:
> > Arjan van de Ven wrote:
> > >
> > > On Tue, Mar 05, 2002 at 07:26:04PM +0100, Andrea Arcangeli wrote:
> > >
> > > > Another approch would be to add the pages backing the bh into the lru
> > > > too, but then we'd need to mess with the slab and new bitflags, new
> > > > methods and so I don't think it's the best solution. The only good
> > > > reason for putting new kind of entries in the lru would be to age them
> > > > too the same way as the other pages, but we don't need that with the bh
> > > > (they're just in, and we mostly care only about the page age, not the bh
> > > > age).
> > >
> > > For 2.5 I kind of like this idea. There is one issue though: to make
> > > this work really well we'd probably need a ->prepareforfreepage()
> > > or similar page op (which for page cache pages can be equal to writepage()
> > > ) which the vm can use to prepare this page for freeing.
> >
> > If we stop using buffer_heads for pagecache I/O, we don't have this problem.
> >
> > I'm showing a 20% reduction in CPU load for large reads. Which is a *lot*,
> > given that read load is dominated by copy_to_user.
BTW, I noticed one of my last my email was a private reply so I'll
answer here too for the buffer_head pagecache I/O part:
Having persistence on the physical I/O information is a good thing, so
you don't need to resolve logical to physical block at every I/O and bio
has a cost to setup too. The information we carry on the bh isn't
superflous, it's needed for the I/O so even if you don't use the
buffer_head you will still need some other memory to hold such
information, or alternatively you need to call get_block (and serialize
in the fs) at every I/O even if you've plenty of ram free. So I don't
think the current setup is that stupid, current bh only sucks for the
rawio and that's fixed by bio.
Andrea
Andrea Arcangeli wrote:
>
> BTW, I noticed one of my last my email was a private reply so I'll
> answer here too for the buffer_head pagecache I/O part:
Heh. Me too.
> Having persistence on the physical I/O information is a good thing, so
> you don't need to resolve logical to physical block at every I/O and bio
> has a cost to setup too. The information we carry on the bh isn't
> superflous, it's needed for the I/O so even if you don't use the
> buffer_head you will still need some other memory to hold such
> information, or alternatively you need to call get_block (and serialize
> in the fs) at every I/O even if you've plenty of ram free. So I don't
> think the current setup is that stupid, current bh only sucks for the
> rawio and that's fixed by bio.
The small benefit of caching the get_block result in the buffers
just isn't worth it.
At present, a one-megabyte write to disk requires the allocation
and freeing and manipulation and locking of 256 buffer_heads and
256 BIOs. lru_list_lock, hash_table_lock, icache/dcache
thrashing, etc, etc. It's an *enormous* amount of work.
I'm doing the same amount of work with as few as two (yes, 2) BIOs.
This is not something theoretical. I have numbers, and code.
20% speedup on a 2-way with a workload which is dominated
by copy_*_user. It'll be more significant on larger machines,
on machines with higher core/main memory speed ratios, on
machines with higher I/O bandwidth. (OK, that bit was theoretical).
-
On Tue, Mar 05, 2002 at 03:24:49PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > BTW, I noticed one of my last my email was a private reply so I'll
> > answer here too for the buffer_head pagecache I/O part:
>
> Heh. Me too.
>
> > Having persistence on the physical I/O information is a good thing, so
> > you don't need to resolve logical to physical block at every I/O and bio
> > has a cost to setup too. The information we carry on the bh isn't
> > superflous, it's needed for the I/O so even if you don't use the
> > buffer_head you will still need some other memory to hold such
> > information, or alternatively you need to call get_block (and serialize
> > in the fs) at every I/O even if you've plenty of ram free. So I don't
> > think the current setup is that stupid, current bh only sucks for the
> > rawio and that's fixed by bio.
>
> The small benefit of caching the get_block result in the buffers
> just isn't worth it.
>
> At present, a one-megabyte write to disk requires the allocation
> and freeing and manipulation and locking of 256 buffer_heads and
> 256 BIOs. lru_list_lock, hash_table_lock, icache/dcache
> thrashing, etc, etc. It's an *enormous* amount of work.
>
> I'm doing the same amount of work with as few as two (yes, 2) BIOs.
>
> This is not something theoretical. I have numbers, and code.
> 20% speedup on a 2-way with a workload which is dominated
> by copy_*_user. It'll be more significant on larger machines,
> on machines with higher core/main memory speed ratios, on
> machines with higher I/O bandwidth. (OK, that bit was theoretical).
then let's cut and paste this part as well :)
depends what you're doing, if you do `cp /dev/zero .` and the fs is
lucky enough to have free contigous space I definitely can see the
improvement of highlevel merging, but that's not always what you're
doing with the fs, for example that's not the case for kernel compiles
and small files where you'll be always fragmented and where the bio will
at max hold 4k and you keep rewriting into cache. The times you enter
get_block you enter in a fs lock, rather than staying at the per-page
lock, it's not additional locking, the bh on the pagecahce doesn't need
any additional locking. So for a kernel compile the current situation is
an obvious advantage in performance and scalability (fs code definitely
doesn't scale at the moment).
But ok, globally it will be probably better to drop the bh since we have
to work on the bio anyways somehow and so at the very least we don't
want to be slowed down from the bio logic in the physically contigous
pagecache flood case.
I just meant the bh isn't totally pointless and it could be shrunk as
Arjan said in a private email.
Andrea
Andrea Arcangeli wrote:
>
>
> depends what you're doing, if you do `cp /dev/zero .` and the fs is
> lucky enough to have free contigous space I definitely can see the
> improvement of highlevel merging, but that's not always what you're
> doing with the fs, for example that's not the case for kernel compiles
> and small files where you'll be always fragmented and where the bio will
> at max hold 4k and you keep rewriting into cache.
Cache effects. We touch the buffers at prepare_write. We touch them
again at commit_write(). And at writeout time. And at page reclaim
time. I think it's this general white-noise cost which is causing
the funny profiles which I'm seeing. (For example, with no-buffers,
the cost of the IDE driver setup and interrupt handler has nosedived).
> The times you enter
> get_block you enter in a fs lock, rather than staying at the per-page
> lock, it's not additional locking, the bh on the pagecahce doesn't need
> any additional locking.
For writes, we have the lru list insertion, and the hashtable lock (twice).
> So for a kernel compile the current situation is
> an obvious advantage in performance and scalability (fs code definitely
> doesn't scale at the moment).
mm.. Delayed allocation means that the short-lived files never get
a disk mapping at all.
And yes, if all files are 100% fragmented then the BIO aggregation
doesn't help as much.
> But ok, globally it will be probably better to drop the bh since we have
> to work on the bio anyways somehow and so at the very least we don't
> want to be slowed down from the bio logic in the physically contigous
> pagecache flood case.
>
> I just meant the bh isn't totally pointless and it could be shrunk as
> Arjan said in a private email.
bh represents a disk block. It's a wrapper around a section of the
block device's pagecache pages. We'll always need a representation
of disk blocks. For filesystem metadata.
-
On March 5, 2002 01:41 pm, Rik van Riel wrote:
> On Tue, 5 Mar 2002 [email protected] wrote:
> > In article <[email protected]> you wrote:
> >
> > > I don't see how per-zone lru lists are related to the kswapd deadlock.
> > > as soon as the ZONE_DMA will be filled with filedescriptors or with
> > > pagetables (or whatever non pageable/shrinkable kernel datastructure you
> > > prefer) kswapd will go mad without classzone, period.
> >
> > So does it with class zone on a scsi system....
>
> Furthermore, there is another problem which is present in
> both 2.4 vanilla, -aa and -rmap.
>
> Suppose that (1) we are low on memory in ZONE_NORMAL and
> (2) we have enough free memory in ZONE_HIGHMEM and (3) the
> memory in ZONE_NORMAL is for a large part taken by buffer
> heads belonging to pages in ZONE_HIGHMEM.
>
> In that case, none of the VMs will bother freeing the buffer
> heads associated with the highmem pages and kswapd will have
> to work hard trying to free something else in ZONE_NORMAL.
>
> Now before you say this is a strange theoretical situation,
> I've seen it here when using highmem emulation. Low memory
> was limited to 30 MB (16 MB ZONE_DMA, 14 MB ZONE_NORMAL)
> and the rest of the machine was HIGHMEM. Buffer heads were
> taking up 8 MB of low memory, dcache and inode cache were a
> good second with 2 MB and 5 MB respectively.
>
>
> How to efficiently fix this case ? I wouldn't know right now...
> However, I guess we might want to come up with a fix because it's
> a quite embarassing scenario ;)
There's the short term fix - hack the vm - and the long term fix:
get rid of buffers. A buffers are does three jobs at the moment:
1) cache the physical block number
2) io handle for a file block
3) data handle for a file block, including locking
The physical block number could be moved either into the struct
page - which desireable since it wastes space for pages that don't
have physical blocks - or my preferred solution, move it into the
page cache radix tree.
For (2) we have a whole flock of solutions on the way. I guess
bio does the job quite nicely as Andrew Morton demonstrated last
week.
For (3), my idea is to generalize the size of the object referred
to by struct page so that it can match the filesystem block size.
This is still in the research stage, and there are a few issues I'm
looking at, but the more I look the more practical it seems. How
nice it would be to get rid of the page->buffers->page tangle, for
one thing.
--
Daniel