Please forgive the following for deliberate simplifiations and accidental misunderstandings.
If you flame gently, *maybe* we can pull something useful out of the ashen remains ;-)
This is an evolution of my original plan to do per-cpu persisent kmap pools a
couple of months ago - thanks to Rik for pointing out this was not going to work.
OK, so currently we divide the virtual address space into user space (u-space) and kernel
space (k-space) at the PAGE_OFFSET boundary, with two fundamental differences that I'm
interested in for the sake of this argument.
1. u-space has different protections from k-space (ie user can't read/write it directly)
2. u-space is per task. k-space is common across all tasks.
Imagine we create a hybrid "u-k-space" with the protections of k-space, but the locality
of u-space .... either by making part of the current k-space per task or by making part of
the current u-space protected like k-space ... not sure which would be easier.
This u-k-space would be a good area for at least two things (and probably others):
1. A good place to put the process pagetables. We only use up the amount of virtual
address space (vaddr space) for one task's pagetables - if we map them into ZONE_NORMAL
(as current mainline) we use up vaddr space for *all* task's pagetables - if we map them
through kmap (atomic or persistent), we pay dearly in tlbflushes.
2. A good place to make a per-task kmap area. This would be on a pool system similar to
the current persistent kmap. We would potentially do only a local cpu tlb_flush_all when
this table ran out (though if we're clever, we can use the context switches tlb_flush to
do this for us). This would make copy_to_user stuff that's currently done under kmap
cheaper.
This, unfortunately, isn't a total solution - we may sometimes need to modify the task's
pagetables from outside the process context, eg. swapout (thanks to dmc for pointing
this out to me ;-)). For this, we'd just use the existing kmap mechanism to create another
mapping to use temporarily, and we're no worse off than before. But on the whole I think
it wins us enough to be worthwhile.
Opinions?
Martin.
[Any chance to make your mailer wrap lines after 76 lines?
That would make reading a lot easier..]
On Wed, Mar 20, 2002 at 11:09:05AM -0800, Martin J. Bligh wrote:
> Imagine we create a hybrid "u-k-space" with the protections of k-space, but the locality
> of u-space .... either by making part of the current k-space per task or by making part of
> the current u-space protected like k-space ... not sure which would be easier.
>
> This u-k-space would be a good area for at least two things (and probably others):
That has been implemented in Caldera OpenUnix in the last years.
There was a nice overview paper by Steve Baumel and Rohit Chawla on this,
called "Managing More Physical With Less Virtual" which I think appeared
in some Y2000 Byte issue.
Christoph
On Wed, 20 Mar 2002, Martin J. Bligh wrote:
> This, unfortunately, isn't a total solution - we may sometimes need to
> modify the task's pagetables from outside the process context, eg.
> swapout (thanks to dmc for pointing this out to me ;-)). For this, we'd
> just use the existing kmap mechanism to create another mapping to use
> temporarily, and we're no worse off than before. But on the whole I
> think it wins us enough to be worthwhile.
There is absolutely no problem mapping the page tables of
another process into our own kmap space. It's just like
what the kernel does now, except that it'll be scalable
because each process has its own kmap array.
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
On Wed, 20 Mar 2002, Martin J. Bligh wrote:
> This u-k-space would be a good area for at least two things (and probably others):
> 1. A good place to put the process pagetables. ....
> 2. A good place to make a per-task kmap area. ....
I'm unsure about it, but I do think it's an idea worth pursuing (in 2.5).
Rik seems to be suggesting the same in your "Scalability problem" thread.
Hugh
On Wed, Mar 20, 2002 at 11:09:05AM -0800, Martin J. Bligh wrote:
> 1. A good place to put the process pagetables. We only use up the amount of virtual
> address space (vaddr space) for one task's pagetables - if we map them into ZONE_NORMAL
we need to walk pagetables not just from the current task and mapping
pagetables there would decrase the user address space too much.
> (as current mainline) we use up vaddr space for *all* task's pagetables - if we map them
I think you're missing the problem with mainline. There is no shortage
of virtual address space, there is a shortage of physical ram in the
zone normal. So we cannot keep them in zone normal (and there's no such
thing as "mapping in zone_normal"). Maybe I misunderstood what you were
saying.
> through kmap (atomic or persistent), we pay dearly in tlbflushes.
>
> 2. A good place to make a per-task kmap area. This would be on a pool system similar to
> the current persistent kmap. We would potentially do only a local cpu tlb_flush_all when
that would not be similar. There would be only 1 entry per "serie", so
there would be 1 virtual page for the pagecache and 1 virtual page for
the pagetables, two pages only in total per-process. It would not be a
real "pool", just two entries and there would not be a page->virtual
cache because the page->virtual has to be global. Plus even better,
those persistent kmaps couldn't block, so I wouldn't need to do the
_under_lock thing for pte_alloc.
The only difference between this and my scalable kmap outlined in the
previous emails, is that you won't need to pin the task because the
mapping will be migrated with the userspace (we must avoid to enable
lazy-tlb from kernel if we need to use kmaps though). Plus there won't
be the risk of stalling due running out of entries (so it couldn't
block).
At the top of the email I said "we need to walk pagetables not just from
the current task" and infact this single virtual entry reserved for the
pagetable handling will go and map pagetables of any task in the
system as we do just now with /proc etc..
So the idea of those 2 virtual pages per-task sounds nice compared to my
scalar per-cpu kmap idea, no scheduler hacks necessary and no risk to stall.
The cons (probably the reason I didn't even considered the possibility
of user the user-address-space) is the significant breakage that it will
make to the arch code (like the fact page faults will have to stop at
PAGE_OFFSET-NR_SERIES*PAGE_SIZE, the stack has to start two pages
before and certainly some more detail) but for 2.5 it may be worthwhile.
Also to avoid walking user pagetables at every kmap, we'd need a pointer
to the pte entry from the task structure, that also will need to be
collected somehow. Certainly it can work and as said it has some
advantage compared to the scalar kmap.
The concern about the out-of-contxt kmaps (from kernel threads etc..) is
nothing to worry about too, they can all be atomic, copy-user is the
only reason we need the persistence, and copy-user needs a context to
run. So it would be fine for that too. So with this design we'd have a
kind of atomic-but-persistent kmaps.
Still is missing the page->virtual cache so it remains inferior to the
current kmap-persistent-cache-pool on UP for example, but it is
certainly the best for your scalability needs in NUMA-Q.
Andrea
On Wed, Mar 20, 2002 at 04:36:48PM -0300, Rik van Riel wrote:
> On Wed, 20 Mar 2002, Martin J. Bligh wrote:
>
> > This, unfortunately, isn't a total solution - we may sometimes need to
> > modify the task's pagetables from outside the process context, eg.
> > swapout (thanks to dmc for pointing this out to me ;-)). For this, we'd
> > just use the existing kmap mechanism to create another mapping to use
> > temporarily, and we're no worse off than before. But on the whole I
> > think it wins us enough to be worthwhile.
>
> There is absolutely no problem mapping the page tables of
> another process into our own kmap space. It's just like
I thought he's talking about kswapd and friends, they all should keep
using the atomic kmaps for that, no problem there because we'll never
run copy-users from kswapd, kswapd doesn't have userspace to copy to :).
Andrea
On Wed, Mar 20, 2002 at 09:23:41PM +0100, Andrea Arcangeli wrote:
> we need to walk pagetables not just from the current task and mapping
> pagetables there would decrase the user address space too much.
Who sais it should be taken from user address space?
For example openunix takes a small (I think 4MB) part of the normal KVA
to be per-process mapped.
> I think you're missing the problem with mainline. There is no shortage
> of virtual address space, there is a shortage of physical ram in the
> zone normal. So we cannot keep them in zone normal (and there's no such
> thing as "mapping in zone_normal"). Maybe I misunderstood what you were
> saying.
The problem is not the 4GB ZONE_NORMAL but the ~1GB KVA space.
UnixWare/OpenUnix had huge problems getting all kernel structs for managing
16GB virtual into that - on the other hand their struct page is more
then twice as big as ours..
On Wed, 20 Mar 2002, Christoph Hellwig wrote:
> On Wed, Mar 20, 2002 at 09:23:41PM +0100, Andrea Arcangeli wrote:
> > we need to walk pagetables not just from the current task and mapping
> > pagetables there would decrase the user address space too much.
>
> Who sais it should be taken from user address space?
> For example openunix takes a small (I think 4MB) part of the normal KVA
> to be per-process mapped.
Linux would want it to come from the user address space because it has
no precedent for per-process addressing in the kernel address space,
and much simpler to keep it that way.
> > I think you're missing the problem with mainline. There is no shortage
> > of virtual address space, there is a shortage of physical ram in the
> > zone normal. So we cannot keep them in zone normal (and there's no such
> > thing as "mapping in zone_normal"). Maybe I misunderstood what you were
> > saying.
>
> The problem is not the 4GB ZONE_NORMAL but the ~1GB KVA space.
> UnixWare/OpenUnix had huge problems getting all kernel structs for managing
> 16GB virtual into that - on the other hand their struct page is more
> then twice as big as ours..
Which is more good reason to put user-address-space-specific-mappings
(page table mappings; the filepage mapping case is less obvious) in
the user address space (but of course not accessible to the user) -
probably above user stack, since that's already of indefinite size.
Hugh
On Wed, Mar 20, 2002 at 08:35:20PM +0000, Christoph Hellwig wrote:
> On Wed, Mar 20, 2002 at 09:23:41PM +0100, Andrea Arcangeli wrote:
> > we need to walk pagetables not just from the current task and mapping
> > pagetables there would decrase the user address space too much.
>
> Who sais it should be taken from user address space?
> For example openunix takes a small (I think 4MB) part of the normal KVA
> to be per-process mapped.
The only difference is that taking it from kernel space would require to
have different a whole block of 4k*512 naturally aligned virtual space
with PAE or 4k * 1024 w/o PAE. So taking it from userspace saves some
mbyte of kernel virtual address space. Also the higher level pagetables
there won't generate more overhead because they're necessary for the
previous user stack anyways.
>
> > I think you're missing the problem with mainline. There is no shortage
> > of virtual address space, there is a shortage of physical ram in the
> > zone normal. So we cannot keep them in zone normal (and there's no such
> > thing as "mapping in zone_normal"). Maybe I misunderstood what you were
> > saying.
>
> The problem is not the 4GB ZONE_NORMAL but the ~1GB KVA space.
Then you misunderstood what's the zone-normal, the zone normal is 800M
in size not 4GB. The 1GB of KVA is what constraint the size of the zone
normal to 800M. We're talking about the same thing, just looking at it
from different point of views.
> UnixWare/OpenUnix had huge problems getting all kernel structs for managing
> 16GB virtual into that - on the other hand their struct page is more
> then twice as big as ours..
We do pretty well with pte-highmem, there is some other bit that will be
better to optimize, but nothing major.
Andrea
On Wed, Mar 20, 2002 at 09:17:31PM +0000, Hugh Dickins wrote:
> probably above user stack, since that's already of indefinite size.
yes, that's the only place where it would be sane to put it.
Andrea
On Wed, Mar 20, 2002 at 10:34:25PM +0100, Andrea Arcangeli wrote:
> > The problem is not the 4GB ZONE_NORMAL but the ~1GB KVA space.
>
> Then you misunderstood what's the zone-normal, the zone normal is 800M
> in size not 4GB.
No, it was braino when writing.
> The 1GB of KVA is what constraint the size of the zone
> normal to 800M. We're talking about the same thing, just looking at it
> from different point of views.
Okay agreed now after the 'reminder'.
> > UnixWare/OpenUnix had huge problems getting all kernel structs for managing
> > 16GB virtual into that - on the other hand their struct page is more
> > then twice as big as ours..
>
> We do pretty well with pte-highmem, there is some other bit that will be
> better to optimize, but nothing major.
One major area to optimize are the kernel stacks I think.
On Wed, Mar 20, 2002 at 09:46:07PM +0000, Christoph Hellwig wrote:
> On Wed, Mar 20, 2002 at 10:34:25PM +0100, Andrea Arcangeli wrote:
> > > The problem is not the 4GB ZONE_NORMAL but the ~1GB KVA space.
> >
> > Then you misunderstood what's the zone-normal, the zone normal is 800M
> > in size not 4GB.
>
> No, it was braino when writing.
never mind.
>
> > The 1GB of KVA is what constraint the size of the zone
> > normal to 800M. We're talking about the same thing, just looking at it
> > from different point of views.
>
> Okay agreed now after the 'reminder'.
:)
>
> > > UnixWare/OpenUnix had huge problems getting all kernel structs for managing
> > > 16GB virtual into that - on the other hand their struct page is more
> > > then twice as big as ours..
> >
> > We do pretty well with pte-highmem, there is some other bit that will be
> > better to optimize, but nothing major.
>
> One major area to optimize are the kernel stacks I think.
Thet's another bit yes, but we'll need 200000 tasks to overflow the
lowmem (ignoring the fact the lowmem is shared also for the other lowmem
data structures) and there's the PID limit of 64k tasks. So I don't see
it as a major thing. Anyways if we really wanted to put the stack [and
task structure of course] in highmem, we could do that in two additional
entries after the user stack together with the two entries for the
pagecache and pagetable persistent kmaps. I think we can officially call
that area the "userfixmap" or "per-process-fixmap" (no matter if it's in
user or kernel space). But it is much faster to keep the kernel stack
always in 4M global tlbs, thus I don't think we need to change that in
2.5. (also USB was used to do dma in the kernel stack, not sure if they
changed it recently)
Andrea
--On Wednesday, March 20, 2002 21:23:41 +0100 Andrea Arcangeli <[email protected]> wrote:
> On Wed, Mar 20, 2002 at 11:09:05AM -0800, Martin J. Bligh wrote:
>> 1. A good place to put the process pagetables. We only use up the amount
>> of virtual address space (vaddr space) for one task's pagetables - if we map
>> them into ZONE_NORMAL
>
> we need to walk pagetables not just from the current task and mapping
> pagetables there would decrase the user address space too much.
How much? By calculations I've heard 3Mb or 6Mb, depending on whether
64Gb support is on or not. Doesn't seem like a lot to me. And as I said in my
original email, we could steal this from either user space or kernel space.
> I think you're missing the problem with mainline. There is no shortage
> of virtual address space, there is a shortage of physical ram in the
> zone normal. So we cannot keep them in zone normal (and there's no such
> thing as "mapping in zone_normal"). Maybe I misunderstood what you
> were saying.
The top 128Mb of kernel virtual space is the obvious choice if we're taking
it from kernel space, I wouldn't steal it from the ZONE_NORMAL area (though
I don't think shifting the 896Mb barrier down by 6Mb would kill anyone).
>> through kmap (atomic or persistent), we pay dearly in tlbflushes.
>>
>> 2. A good place to make a per-task kmap area. This would be on a pool system similar to
>> the current persistent kmap. We would potentially do only a local cpu tlb_flush_all when
>
> that would not be similar. There would be only 1 entry per "serie", so
> there would be 1 virtual page for the pagecache and 1 virtual page for
> the pagetables, two pages only in total per-process. It would not be a
> real "pool", just two entries and there would not be a page->virtual
> cache because the page->virtual has to be global. Plus even better,
> those persistent kmaps couldn't block, so I wouldn't need to do the
> _under_lock thing for pte_alloc.
Not sure I grok the above - you mean like the atomic_kmap stuff? The problem
with that is you have to do a tlb_flush_one per access - if we have a pool
we can do a *local* tlb_flush_all per N accesses (where N is sizeof pool).
And as we do a local tlb_flush_all per context switch anyway, we can
probably avoid doing *any* tlbflush in all but the heaviest pool usage if
we're clever about accounting.
> The only difference between this and my scalable kmap outlined in the
> previous emails, is that you won't need to pin the task because the
> mapping will be migrated with the userspace (we must avoid to enable
> lazy-tlb from kernel if we need to use kmaps though). Plus there won't
> be the risk of stalling due running out of entries (so it couldn't
> block).
That seems like a good difference to me. Rik pointed this problem out to
me a couple of months ago, which is why I threw that concept away.
Martin.
> That has been implemented in Caldera OpenUnix in the last years.
V7 unix had it. Thats where the "uarea" aka u. comes in. Its one of the
killer problems with Linux 8086 - on the 11 they could put the kernel stack
file handles and other process local crap into a swappable segment that
could also be swapped from the kernel address space. On the 8086 thats
trickier
On Thursday 21 March 2002 01:39 am, Alan Cox wrote:
> > That has been implemented in Caldera OpenUnix in the last years.
>
> V7 unix had it. Thats where the "uarea" aka u. comes in. Its one of the
> killer problems with Linux 8086 - on the 11 they could put the kernel stack
> file handles and other process local crap into a swappable segment that
> could also be swapped from the kernel address space. On the 8086 thats
> trickier
Some 20 years ago I knew almost everything about BSD-4.x on a VAX.
The user area was just above the user stack. Actually it was part of the
user space, accessible RO from user mode and RW for the kernel.
It was always mapped at a fixed address that was just below the 2G
marker.
The process table contained whatever was needed to swap-in
and access the user area, some scheduling parameters and
signal mask/pending bits (I'm sure I missed something).
This arrangement might save some physical memory because
this area was swapped with the process (actually that was the last
thing to swap out/first to swap in because the page table for the
rest of the process was in there).
I think this arrangement made stuff as shared memory (and libs),
ptrace and other IPC more complicated. Then memory management
stuff... Remember that the system base architecture knew how
to swap in/out only whole processes. Paging was implemented
above it with a global clock (LRU like) algorithm.
Truely I thought that putting everything in "current" in Linux was
more of a design decision and not something that's derived from
the '86 architecture.
-- Itai
> Truely I thought that putting everything in "current" in Linux was
> more of a design decision and not something that's derived from
> the '86 architecture.
It is - I said Linux8086 not Linux 80386.
Alan
--On Wednesday, March 20, 2002 7:45 PM +0000 Christoph Hellwig <[email protected]> wrote:
> [Any chance to make your mailer wrap lines after 76 lines?
> That would make reading a lot easier..]
>
> On Wed, Mar 20, 2002 at 11:09:05AM -0800, Martin J. Bligh wrote:
>> Imagine we create a hybrid "u-k-space" with the protections of k-space, but the locality
>> of u-space .... either by making part of the current k-space per task or by making part of
>> the current u-space protected like k-space ... not sure which would be easier.
>>
>> This u-k-space would be a good area for at least two things (and probably others):
>
> That has been implemented in Caldera OpenUnix in the last years.
> There was a nice overview paper by Steve Baumel and Rohit Chawla on this,
> called "Managing More Physical With Less Virtual" which I think appeared
> in some Y2000 Byte issue.
The only reference that I could find was this:
http://www.informatik.uni-trier.de/~ley/db/journals/spe/spe30.html
and I can't find the actual paper online anywhere ... is it available?
Thanks,
Martin.
On Thursday 21 March 2002 16:34, Martin J. Bligh wrote:
> The only reference that I could find was this:
>
> http://www.informatik.uni-trier.de/~ley/db/journals/spe/spe30.html
>
> and I can't find the actual paper online anywhere ... is it available?
http://www3.interscience.wiley.com/cgi-bin/issuetoc?ID=71007581
Abstract is free but the full text requires registration (no, I don't have access).
-- Itai
On Wed, Mar 20, 2002 at 11:00:02PM +0100, Andrea Arcangeli wrote:
> But it is much faster to keep the kernel stack always in 4M global
> tlbs, thus I don't think we need to change that in 2.5. (also USB was
> used to do dma in the kernel stack, not sure if they changed it
> recently)
Hopefully all instances of the USB code doing that have now been fixed.
If anyone sees any USB code that uses the kernel stack for USB
transfers, please let me know so it can be fixed.
We (the USB group) have always known that this is a bug in our code, so
don't feel you can't change things just because the USB code might be
broken by it :)
thanks,
greg k-h
On Thu, Mar 21, 2002 at 11:54:51AM -0800, Greg KH wrote:
> On Wed, Mar 20, 2002 at 11:00:02PM +0100, Andrea Arcangeli wrote:
> > But it is much faster to keep the kernel stack always in 4M global
> > tlbs, thus I don't think we need to change that in 2.5. (also USB was
> > used to do dma in the kernel stack, not sure if they changed it
> > recently)
>
> Hopefully all instances of the USB code doing that have now been fixed.
> If anyone sees any USB code that uses the kernel stack for USB
> transfers, please let me know so it can be fixed.
>
> We (the USB group) have always known that this is a bug in our code, so
> don't feel you can't change things just because the USB code might be
> broken by it :)
Glad to hear, thanks!
Andrea
On Wed, Mar 20, 2002 at 11:00:02PM +0100, Andrea Arcangeli wrote:
> Thet's another bit yes, but we'll need 200000 tasks to overflow the
> lowmem (ignoring the fact the lowmem is shared also for the other lowmem
> data structures) and there's the PID limit of 64k tasks. So I don't see
> it as a major thing. Anyways if we really wanted to put the stack [and
> task structure of course] in highmem, we could do that in two additional
> entries after the user stack together with the two entries for the
> pagecache and pagetable persistent kmaps. I think we can officially call
> that area the "userfixmap" or "per-process-fixmap" (no matter if it's in
> user or kernel space). But it is much faster to keep the kernel stack
> always in 4M global tlbs, thus I don't think we need to change that in
> 2.5. (also USB was used to do dma in the kernel stack, not sure if they
> changed it recently)
Another (perhaps obvious) pitfall is stack-allocated storage used for
components of globally-mapped structures. The premier example of this
is probably waitqueues. To keep them working, dynamic allocation of
globally-mapped storage or per-task static allocation thereof is
required as a substitute.
Cheers,
Bill
On Thu, Mar 21, 2002 at 03:49:43PM -0800, William Lee Irwin III wrote:
> On Wed, Mar 20, 2002 at 11:00:02PM +0100, Andrea Arcangeli wrote:
> > Thet's another bit yes, but we'll need 200000 tasks to overflow the
> > lowmem (ignoring the fact the lowmem is shared also for the other lowmem
> > data structures) and there's the PID limit of 64k tasks. So I don't see
> > it as a major thing. Anyways if we really wanted to put the stack [and
> > task structure of course] in highmem, we could do that in two additional
> > entries after the user stack together with the two entries for the
> > pagecache and pagetable persistent kmaps. I think we can officially call
> > that area the "userfixmap" or "per-process-fixmap" (no matter if it's in
> > user or kernel space). But it is much faster to keep the kernel stack
> > always in 4M global tlbs, thus I don't think we need to change that in
> > 2.5. (also USB was used to do dma in the kernel stack, not sure if they
> > changed it recently)
>
> Another (perhaps obvious) pitfall is stack-allocated storage used for
> components of globally-mapped structures. The premier example of this
> is probably waitqueues. To keep them working, dynamic allocation of
> globally-mapped storage or per-task static allocation thereof is
> required as a substitute.
Agreed, good point (in theory task struct could go there, I'm not
suggesting that of course, but stack has to remain visible to all MM for
such reason).
Andrea