After weeks of remote debugging I found one problem triggered by people
with the real highmem64 hardware and that really needs that much memory
mapped from lots of tasks at the same time.
They were running out of pagetables, mapping 1G per-task (shm for
example) will overflow the lowmem zone with PAE with some houndred tasks
in the system. They were pointing the finger at the VM but the VM was
just doing the very right thing to do.
This patch in short will move pagetables into highmem, obviously it
breaks all the archs out there. This should fix the problem completly
allowing linux to effectively support all the 64G possibly available in
the ia32 boxes (currently, without this patch, you risk to be able to
use only a few gigabytes).
-aa kernels as well obviously will support this feature from the very
next release and this patch is against 2.4.18pre2aa2, I recommend people
to test it on top of 2.4.18pre2aa2 and tell me if it works well or not.
If you don't have highmem you still can test it by configuring the
kernel with highmem emulation enabled.
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.18pre2aa2/pte-highmem-5
This will have to be forward ported to 2.5 as well of course. I
developed it on top of 2.4 because I must support this in 2.4-aa too
because HIGHMEM64G is basically useless for the most important people
without this additional change.
I'm too lazy to describe how it works with words, you've the source,
it seems not too dirty (it may still need some cleanup, suggestions as
welcome of course), and if you have questions I'll be glad to answer as
usual. patch is as well attached because it's not too big and I know
some people is annoyed by ftp downloads (mee too sometime :). thanks,
Andrea
On Wed, 16 Jan 2002, Andrea Arcangeli wrote:
>
> This patch in short will move pagetables into highmem, obviously it
> breaks all the archs out there.
Hmm.. Looks ok, although I miss the "obviously". Archs have their own page
table allocator functions, so by allocating lowmem (and most non-x86 won't
care) the change _should_ have zero impact on them simply because they
don't need to unmap. No?
Linus
Btw, some suggestions:
- why have those ugly special cases for bootup defines. kmap() is a no-op
on 1:1 pages, so all those "arch/i386/mm/init.c" games are totally
unnecessary if you just have the pages in low memory. That gets rid of
much of it.
- do the highmem pte bouncing only for CONFIG_X86_PAE: with less then 4GB
of RAM this doesn't seem to matter all that much (rule of thumb: we
need about 1% of memory for page tables). Again, this will simplify
things, as it will make other architectures not have to worry about the
change.
(And it's really CONFIG_X86_PAE that makes page tables twice as big, so
PAE increases the lowmem pressure for two independent reasons: twice as
much memory in page tables _and_ larger amounts of memory to map).
- please don't do that "pte_offset_atomic_irq()" have special support in
the header files: it is _not_ a generic operation, and it is only used
by the x86 page fault logic. For that reason, I would suggest moving
all that logic out of the header files, and into i386/mm/fault.c,
something along the lines of
pte = pte_offset_nokmap(..)
addr = kmap_atomic(pte, KM_VMFAULT);
instead of having special magic logic in the header files.
Other than that it looks fairly straightforward, I think.
Linus
On Wed, Jan 16, 2002 at 10:04:37AM -0800, Linus Torvalds wrote:
>
> On Wed, 16 Jan 2002, Andrea Arcangeli wrote:
> >
> > This patch in short will move pagetables into highmem, obviously it
> > breaks all the archs out there.
>
> Hmm.. Looks ok, although I miss the "obviously". Archs have their own page
> table allocator functions, so by allocating lowmem (and most non-x86 won't
> care) the change _should_ have zero impact on them simply because they
> don't need to unmap. No?
the problem is as usual we need to work with pages or pfn somewhere (see
pte_alloc), we cannot work with virtual addresses or we'll overflow...
the pte_t * virtual address is always the result of the kmap, so the
changes required are the minimal as possible in most places, and I made
a pte_kunmap smart enough to be able to unmap any kind of kmap so the
common code changes are as simple as possible, but still pmd_populate
needs to know the whole thing (struct page * actually) and probably some
other bit of the same kind with the pmd breaks too.
Andrea
On Wed, Jan 16, 2002 at 10:19:56AM -0800, Linus Torvalds wrote:
>
> Btw, some suggestions:
>
> - why have those ugly special cases for bootup defines. kmap() is a no-op
> on 1:1 pages, so all those "arch/i386/mm/init.c" games are totally
> unnecessary if you just have the pages in low memory. That gets rid of
> much of it.
agreed.
> - do the highmem pte bouncing only for CONFIG_X86_PAE: with less then 4GB
> of RAM this doesn't seem to matter all that much (rule of thumb: we
> need about 1% of memory for page tables). Again, this will simplify
> things, as it will make other architectures not have to worry about the
> change.
I'm not sure how simpler it will make things and I'd prefer to keep the
pte in highmem also with 4G to be more generic, think 1000 threads
mapping the same 1G of shm each, ok that may sound a bit insane but
it's not that far from some setup that triggered the pte problem.
>
> (And it's really CONFIG_X86_PAE that makes page tables twice as big, so
> PAE increases the lowmem pressure for two independent reasons: twice as
> much memory in page tables _and_ larger amounts of memory to map).
yes.
> - please don't do that "pte_offset_atomic_irq()" have special support in
> the header files: it is _not_ a generic operation, and it is only used
> by the x86 page fault logic. For that reason, I would suggest moving
> all that logic out of the header files, and into i386/mm/fault.c,
> something along the lines of
>
> pte = pte_offset_nokmap(..)
> addr = kmap_atomic(pte, KM_VMFAULT);
>
> instead of having special magic logic in the header files.
the pte_offset_nokmap will have to return a page actually, I was hiding
the "page" stuff in the header, that's it, but I'm fine to add a nokmap
that returns a page (not a pte) and to call kmap_atomic by hand to get
the pte_t *. Also there are non obvious implications if the pte is not
large as a page, but I'm not aware of any arch where a pte is not large
as a page... in the common code only pte_alloc play with pages as pte so
far.
> Other than that it looks fairly straightforward, I think.
I'm glad to hear :). btw, I don't expect big difference for 2.5, that
area should be pretty much the same. I was a bit sick in the last days,
I'll try to make the above cleanups soon and then to port to 2.5. thanks,
Forget to tell, another bit to cleanup is the pte_quicklist, that has to
be converted to a list_head, doing a kmap_atomic to queue pages into the
quicklist is very stupid at the moment, I just wanted to make it working
first, then to fix this bit, I'll clean it up in one of the next version
(but in the meantime is just usable).
Andrea
On Wed, 16 Jan 2002, Andrea Arcangeli wrote:
>
> > - do the highmem pte bouncing only for CONFIG_X86_PAE: with less then 4GB
> > of RAM this doesn't seem to matter all that much (rule of thumb: we
> > need about 1% of memory for page tables). Again, this will simplify
> > things, as it will make other architectures not have to worry about the
> > change.
>
> I'm not sure how simpler it will make things and I'd prefer to keep the
> pte in highmem also with 4G to be more generic
I agree that this will not make things simpler, it will just streamline
and speed up some of the VM paths for people who just don't need it.
A lot of HIGHMEM users have just 1GB and use HIGHMEM to get at the last
few megs, so..
It might make sense to make it an independent config option, and possibly
just split up the highmem question a bit more (ie make the choices be
something like "none", "4G", "4G+pte-kmap" or "64G(with pte-kmap)").
Linus
On Wed, Jan 16, 2002 at 11:11:48AM -0800, Linus Torvalds wrote:
>
> On Wed, 16 Jan 2002, Andrea Arcangeli wrote:
> >
> > > - do the highmem pte bouncing only for CONFIG_X86_PAE: with less then 4GB
> > > of RAM this doesn't seem to matter all that much (rule of thumb: we
> > > need about 1% of memory for page tables). Again, this will simplify
> > > things, as it will make other architectures not have to worry about the
> > > change.
> >
> > I'm not sure how simpler it will make things and I'd prefer to keep the
> > pte in highmem also with 4G to be more generic
>
> I agree that this will not make things simpler, it will just streamline
> and speed up some of the VM paths for people who just don't need it.
speedwise of such code path indeed but see below.
> A lot of HIGHMEM users have just 1GB and use HIGHMEM to get at the last
> few megs, so..
I don't see a big difference with the pages. if pagetables won't go into
highmem it will be pages that will go there. Infact with the pagetables
may be much better than with the pages. I mean it may be much better to
have the few megs filled with pagetables rather than with pages that we
may have to read(2)/write(2) all the time. We may end generating a
bigger kmap overhead if we put pages in there (of course it also depends
on the workload). It's not really obvious to me that 4G is an
improvement compared to 4G+pte-kmap on a 1G box. So I'm still not
convinced that the below differentiation of configuration options is
worthwhile. comments?
> It might make sense to make it an independent config option, and possibly
> just split up the highmem question a bit more (ie make the choices be
> something like "none", "4G", "4G+pte-kmap" or "64G(with pte-kmap)").
Andrea
On Wed, Jan 16, 2002 at 10:19:56AM -0800, Linus Torvalds wrote:
> - please don't do that "pte_offset_atomic_irq()" have special support in
> the header files: it is _not_ a generic operation, and it is only used
> by the x86 page fault logic. For that reason, I would suggest moving
> all that logic out of the header files, and into i386/mm/fault.c,
> something along the lines of
>
> pte = pte_offset_nokmap(..)
> addr = kmap_atomic(pte, KM_VMFAULT);
>
> instead of having special magic logic in the header files.
Ah, here's where I come in and say that kmap_atomic stinks and needs to be
replaced. ;-) If you take a look at code in various places making use of
atomic kmaps, some of the more interesting cases (like bio) have to disable
irqs during memory copies in order to avoid races on reuse of an atomic
kmap. I think that's a sign of an interface that needs redesign. My
proposal: make kmap_atomic more like kmap in that it allocates from a pool,
but make the pool per cpu with ~4 entries reserved per context. The only
concern I have is that we might not be restricting the depth of irq entry
currently, but I'm not familiar with that code. Time to code up a patch...
> Other than that it looks fairly straightforward, I think.
Agreed.
-ben
On Wed, 16 Jan 2002, Linus Torvalds wrote:
> - do the highmem pte bouncing only for CONFIG_X86_PAE: with less then 4GB
> of RAM this doesn't seem to matter all that much (rule of thumb: we
> need about 1% of memory for page tables).
Actually, with 100 processes mapping a 1 GB area of shared
memory, you'll need about 100 MB of page tables or 10% of
the SHM segment. This rises even further with PAE and/or
more processes.
The page table cost for heavily shared memory segments is
really getting out of hand, IMHO ... it might be worth it
to set aside a separate memory zone and use 4 MB pages for
the SHM segment.
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Wed, Jan 16, 2002 at 02:30:39PM -0500, Benjamin LaHaise wrote:
> On Wed, Jan 16, 2002 at 10:19:56AM -0800, Linus Torvalds wrote:
> > - please don't do that "pte_offset_atomic_irq()" have special support in
> > the header files: it is _not_ a generic operation, and it is only used
> > by the x86 page fault logic. For that reason, I would suggest moving
> > all that logic out of the header files, and into i386/mm/fault.c,
> > something along the lines of
> >
> > pte = pte_offset_nokmap(..)
> > addr = kmap_atomic(pte, KM_VMFAULT);
> >
> > instead of having special magic logic in the header files.
>
> Ah, here's where I come in and say that kmap_atomic stinks and needs to be
> replaced. ;-) If you take a look at code in various places making use of
> atomic kmaps, some of the more interesting cases (like bio) have to disable
> irqs during memory copies in order to avoid races on reuse of an atomic
> kmap. I think that's a sign of an interface that needs redesign. My
> proposal: make kmap_atomic more like kmap in that it allocates from a pool,
> but make the pool per cpu with ~4 entries reserved per context. The only
> concern I have is that we might not be restricting the depth of irq entry
> currently, but I'm not familiar with that code. Time to code up a patch...
you said it, there's no depth of irq entry points (only restriction
right now is the stack and if you overflow you notice the hard way :). I
think current way of doing the atomic kmaps is ok even if I see the
irq latency issue with the cli, we could probably make one irq entry
non-cli driven by protecting it with a per-cpu counter etc... (so only the
nested irq will have to take the cli), that could be an improvement for
irq latency, feel free to implement it. the pool approch with the pool
as large as the max number of reentrant irq doesn't sounds a good
approch to me instead. If you really want to make something like a pool,
then I'd just extend the same logic where even the first nested irq
doesn't need to take the cli, but only the second nested one needs.
There could be a #define that tells which is the last level of nested
irq that needs to take the cli instead of having a free kmap.
btw, In the pte_offset_atomic_irq case I was also too lazy to cli only
if it was an highmem page, because that's an extremely slow path
anyways, so a cli there doesn't really matter, at least with the bio
stuff I made sure we have the cli only with highmem pages by hiding it
inside.
>
> > Other than that it looks fairly straightforward, I think.
>
> Agreed.
Good to hear! thanks,
Andrea
Hi Andrea,
On Wed, 16 Jan 2002, Andrea Arcangeli wrote:
> They were running out of pagetables, mapping 1G per-task (shm for
> example) will overflow the lowmem zone with PAE with some houndred
> tasks in the system. They were pointing the finger at the VM but the
> VM was just doing the very right thing to do.
This lets me think about putting the swap vector of shmem into highmem
also. These can get big on highend servers and exactly these servers
tend to use a lot of shared mem.
What do you think?
Christoph
On Thu, 17 Jan 2002, Christoph Rohland wrote:
> On Wed, 16 Jan 2002, Andrea Arcangeli wrote:
> > They were running out of pagetables, mapping 1G per-task (shm for
> > example) will overflow the lowmem zone with PAE with some houndred
> > tasks in the system. They were pointing the finger at the VM but the
> > VM was just doing the very right thing to do.
>
> This lets me think about putting the swap vector of shmem into highmem
> also. These can get big on highend servers and exactly these servers
> tend to use a lot of shared mem.
Yes, good idea.
But people should realize that moving page tables and other such
structures into not-always-mapped highmem will make them harder to
access with kernel debuggers - I think those will want some changes.
Hugh
On Thu, Jan 17, 2002 at 09:31:12AM +0100, Christoph Rohland wrote:
> Hi Andrea,
>
> On Wed, 16 Jan 2002, Andrea Arcangeli wrote:
> > They were running out of pagetables, mapping 1G per-task (shm for
> > example) will overflow the lowmem zone with PAE with some houndred
> > tasks in the system. They were pointing the finger at the VM but the
> > VM was just doing the very right thing to do.
>
> This lets me think about putting the swap vector of shmem into highmem
> also. These can get big on highend servers and exactly these servers
> tend to use a lot of shared mem.
>
> What do you think?
How much allocated shm do you need to fill 800mbyte of normal zone with
the shm swap vector?
(btw, I suspect allocating one page at offset 4G in every shmfs file
could make the overhead per page of shm to increase)
But in real life I really don't expect problems here, one left page of
the vector holds 1024 swap entries, so the overhead is of the order of
1/1000. On the top of my head (without any precise calculation) 64G of
shm would generate stuff of the order of some houndred mbytes of ram
(please correct me if I'm wrong of course). The malicious case of all
files with one page mapped at the very end, doesn't sounds realistic
scenario either. Comments?
Andrea
On Thu, Jan 17, 2002 at 12:14:13PM +0000, Hugh Dickins wrote:
> On Thu, 17 Jan 2002, Christoph Rohland wrote:
> > On Wed, 16 Jan 2002, Andrea Arcangeli wrote:
> > > They were running out of pagetables, mapping 1G per-task (shm for
> > > example) will overflow the lowmem zone with PAE with some houndred
> > > tasks in the system. They were pointing the finger at the VM but the
> > > VM was just doing the very right thing to do.
> >
> > This lets me think about putting the swap vector of shmem into highmem
> > also. These can get big on highend servers and exactly these servers
> > tend to use a lot of shared mem.
>
> Yes, good idea.
>
> But people should realize that moving page tables and other such
> structures into not-always-mapped highmem will make them harder to
> access with kernel debuggers - I think those will want some changes.
the debugging prospective is probably the one I care less (you can
always drop a __GFP_HIGHMEM into the right alloc_pages to get the memory
direct mapped) that's most of the time a one liner in just one place.
Here I'd care more about the fact if it's really needed in real life or
not, if kmaps would be zero cost it would be probably ok anyways (modulo
increase of code complexity), but kmaps are lightweight but not zero
cost :).
Andrea
On Thu, 17 Jan 2002, Andrea Arcangeli wrote:
> On Thu, Jan 17, 2002 at 12:14:13PM +0000, Hugh Dickins wrote:
> >
> > But people should realize that moving page tables and other such
> > structures into not-always-mapped highmem will make them harder to
> > access with kernel debuggers - I think those will want some changes.
>
> the debugging prospective is probably the one I care less (you can
> always drop a __GFP_HIGHMEM into the right alloc_pages to get the memory
> direct mapped) that's most of the time a one liner in just one place.
I was thinking there of peering at a crashed kernel with some debugger,
trying to see the page tables (or whatever). In most cases, if they're
relevant to the problem, they will already be kmapped (or perhaps the
problem would just be that they're not). The maintainers of debuggers
will probably need to add something to help find the right mapping.
But you're right, Linux is not primarily designed as a platform for
kernel debuggers, and that should not stop progress: sooner or later
we have to extend use of highmem, you've found good reason to extend
it to page tables (and Christoph's shmem index is a very similar case).
Hugh
Hi Andrea,
On Thu, 17 Jan 2002, Andrea Arcangeli wrote:
> (btw, I suspect allocating one page at offset 4G in every shmfs file
> could make the overhead per page of shm to increase)
Nearly: A sparse file with the only page at 4G is the worst case: You
need three extra pages to hold the swap entry. The ratio goes fown as
soon as you add more pages somewhere.
> But in real life I really don't expect problems here, one left page
> of the vector holds 1024 swap entries, so the overhead is of the
> order of 1/1000. On the top of my head (without any precise
> calculation) 64G of shm would generate stuff of the order of some
> houndred mbytes of ram
Ok, 64GB shm allocate roughly 64MB swap entries, so this case should
not bother too much. I was still at the 390x case where we have 512
entries per page. But they do not need highmem.
Another case are smaller machines with big tmpfs instances: They get
killed by the swap entries. But you cannot hinder that without
swapping the swap entries themselves.
Greetings
Christoph
On Thu, Jan 17, 2002 at 05:11:48PM +0100, Christoph Rohland wrote:
> Hi Andrea,
>
> On Thu, 17 Jan 2002, Andrea Arcangeli wrote:
> > (btw, I suspect allocating one page at offset 4G in every shmfs file
> > could make the overhead per page of shm to increase)
>
> Nearly: A sparse file with the only page at 4G is the worst case: You
> need three extra pages to hold the swap entry. The ratio goes fown as
> soon as you add more pages somewhere.
Agreed.
>
> > But in real life I really don't expect problems here, one left page
> > of the vector holds 1024 swap entries, so the overhead is of the
> > order of 1/1000. On the top of my head (without any precise
> > calculation) 64G of shm would generate stuff of the order of some
> > houndred mbytes of ram
>
> Ok, 64GB shm allocate roughly 64MB swap entries, so this case should
> not bother too much. I was still at the 390x case where we have 512
> entries per page. But they do not need highmem.
agreed.
> Another case are smaller machines with big tmpfs instances: They get
> killed by the swap entries. But you cannot hinder that without
> swapping the swap entries themselves.
yes, same can happen with pagetables, if you've an huge amount of swap
and only a few mbytes of ram, that's another kind of problem that we
always ignored so far (mostly because it is possible to solve it very
efficiently by throwing money into some more ram :).
Andrea
On Thu, 17 Jan 2002, Christoph Rohland wrote:
> Another case are smaller machines with big tmpfs instances: They get
> killed by the swap entries. But you cannot hinder that without
> swapping the swap entries themselves.
I wonder how hard this would be to do ... ;)
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Wed, 16 Jan 2002, Andrea Arcangeli wrote:
>
> This patch in short will move pagetables into highmem, obviously it
> breaks all the archs out there. This should fix the problem completly
> allowing linux to effectively support all the 64G possibly available in
> the ia32 boxes (currently, without this patch, you risk to be able to
> use only a few gigabytes).
Several random points on the patch, I've not studied it as long as
I'd like: so may well be making a fool of myself on some of these,
and you may quickly say so!
1. Yes, this has to come sooner or later, but it is a significant step,
as I've said in other mail - may unmap some useful debugging info.
2. More complicated than I'd like: too many pte_offset variants!
I'd prefer it without the different SERIEs, I don't understand why
those. I assume it's to prevent kmaps of data flushing away "more
valuable" kmaps of page tables, but wouldn't it be better to keep
just one "serie" of kmap for now, add cleverer decision on what
and when to throw away later on, localized within mm/highmem.c?
3. KM_SERIE_PAGETABLE2 kmap_pagetable2 pmd_page2 pte_offset2 all just
for copy_page_range src? Why distinguished from KM_SERIE_PAGETABLE?
Often it will already be KM_SERIE_PAGETABLE. I can imagine you might
want an atomic at that point (holding spinlock), but I don't see what
the PAGETABLE2 distinction gives you.
4. You've lifted the PAE restriction to LAST_PKMAP 512 in i386/highmem.h,
and use pkmap_page_table as one long array in mm/highmem.c, but I
don't see where you enforce the contiguity of page table pages in
i386/mm/init.c. (I do already have a patch for lifting the 1024,512
kmaps limit, simplifying i386/mm/init.c, we've been using for months:
I can update that from 2.4.9 if you'd like it.)
5. Shouldn't mm/vmscan.c be in the patch?
Hugh
On Thu, Jan 17, 2002 at 05:57:15PM +0000, Hugh Dickins wrote:
> On Wed, 16 Jan 2002, Andrea Arcangeli wrote:
> >
> > This patch in short will move pagetables into highmem, obviously it
> > breaks all the archs out there. This should fix the problem completly
> > allowing linux to effectively support all the 64G possibly available in
> > the ia32 boxes (currently, without this patch, you risk to be able to
> > use only a few gigabytes).
>
> Several random points on the patch, I've not studied it as long as
> I'd like: so may well be making a fool of myself on some of these,
> and you may quickly say so!
>
> 1. Yes, this has to come sooner or later, but it is a significant step,
> as I've said in other mail - may unmap some useful debugging info.
>
> 2. More complicated than I'd like: too many pte_offset variants!
> I'd prefer it without the different SERIEs, I don't understand why
> those. I assume it's to prevent kmaps of data flushing away "more
> valuable" kmaps of page tables, but wouldn't it be better to keep
> just one "serie" of kmap for now, add cleverer decision on what
> and when to throw away later on, localized within mm/highmem.c?
we need more than one serie, no way, or it can deadlock, the default
serie is for pages, the other series are for pagetables.
>
> 3. KM_SERIE_PAGETABLE2 kmap_pagetable2 pmd_page2 pte_offset2 all just
> for copy_page_range src? Why distinguished from KM_SERIE_PAGETABLE?
> Often it will already be KM_SERIE_PAGETABLE. I can imagine you might
> want an atomic at that point (holding spinlock), but I don't see what
> the PAGETABLE2 distinction gives you.
deadlock avoidance. It's the same reason as the mempool, you need tow
mempool if you need two objects from the same task or it can deadlock
(yes, it would be hard to deadlock it but possible).
in fork the pte_offset kmap could be an atomic one, but atomic are more
costly with the invlpg I believe, to do it in a raw the 2 variant with a
different serie should be faster for fork(2).
> 4. You've lifted the PAE restriction to LAST_PKMAP 512 in i386/highmem.h,
> and use pkmap_page_table as one long array in mm/highmem.c, but I
> don't see where you enforce the contiguity of page table pages in
> i386/mm/init.c. (I do already have a patch for lifting the 1024,512
> kmaps limit, simplifying i386/mm/init.c, we've been using for months:
> I can update that from 2.4.9 if you'd like it.)
correct, currently it works because the bootmem tends to return
physically contigous pages but it is not enforced and it may trigger
with a weird e820 layout. If you've a patch very feel free to post it!! :)
thanks for the review.
>
> 5. Shouldn't mm/vmscan.c be in the patch?
can you elaborate?
Andrea
On Thu, 17 Jan 2002, Andrea Arcangeli wrote:
> On Thu, Jan 17, 2002 at 05:57:15PM +0000, Hugh Dickins wrote:
>
> we need more than one serie, no way, or it can deadlock, the default
> serie is for pages, the other series are for pagetables.
>
> > 3. KM_SERIE_PAGETABLE2 kmap_pagetable2 pmd_page2 pte_offset2 all just
> > for copy_page_range src? Why distinguished from KM_SERIE_PAGETABLE?
> > Often it will already be KM_SERIE_PAGETABLE. I can imagine you might
> > want an atomic at that point (holding spinlock), but I don't see what
> > the PAGETABLE2 distinction gives you.
>
> deadlock avoidance. It's the same reason as the mempool, you need tow
> mempool if you need two objects from the same task or it can deadlock
> (yes, it would be hard to deadlock it but possible).
Sorry, I'm still struggling: I can imagine deadlocks from having too
few kmaps altogether, and I can see how separating data from pagetables
protects pagetable code from bugs (missing kunmaps) in other code, but
I don't see how separating the three series make a crucial difference.
> in fork the pte_offset kmap could be an atomic one, but atomic are more
> costly with the invlpg I believe, to do it in a raw the 2 variant with a
> different serie should be faster for fork(2).
Yes, maybe, maybe not.
> > 4. You've lifted the PAE restriction to LAST_PKMAP 512 in i386/highmem.h,
> > and use pkmap_page_table as one long array in mm/highmem.c, but I
> > don't see where you enforce the contiguity of page table pages in
> > i386/mm/init.c. (I do already have a patch for lifting the 1024,512
> > kmaps limit, simplifying i386/mm/init.c, we've been using for months:
> > I can update that from 2.4.9 if you'd like it.)
>
> correct, currently it works because the bootmem tends to return
> physically contigous pages but it is not enforced and it may trigger
> with a weird e820 layout. If you've a patch very feel free to post it!! :)
> thanks for the review.
Okay, I've attached the existing patch against 2.4.9 below. It originated
from my Large PAGE_SIZE patch, where it became clear that the fixmaps are
appropriate for MMUPAGE_SIZE maps but the kmaps for PAGE_SIZE maps (later,
with Ben's work on Large PAGE_CACHE_SIZE, it became clear that actually
kmaps should be PAGE_CACHE_SIZE, but I've not added in that complication).
So the atomic kmaps are no longer done by fixmap.h, but come after the
vmallocs, before the regular kmaps, in highmem.h. It also doesn't waste
so much virtual space either end of the vmalloc area. The mm/init.c diff
looks a lot, mainly because the result is a lot simpler. If it would
help you for me to update to a release of your choice, let me know
(perhaps the next aa, with your mods in).
> > 5. Shouldn't mm/vmscan.c be in the patch?
>
> can you elaborate?
swap_out_pmd() uses pte_offset() on user page tables,
so doesn't it need to pte_kunmap() afterwards?
Hugh
diff -urN 249/arch/i386/mm/init.c 249-morekmaps/arch/i386/mm/init.c
--- 249/arch/i386/mm/init.c Sat Apr 21 00:15:20 2001
+++ 249-morekmaps/arch/i386/mm/init.c Sat Jan 12 21:26:28 2002
@@ -36,10 +36,15 @@
#include <asm/e820.h>
#include <asm/apic.h>
-unsigned long highstart_pfn, highend_pfn;
static unsigned long totalram_pages;
static unsigned long totalhigh_pages;
+#ifdef CONFIG_HIGHMEM
+unsigned long highstart_pfn, highend_pfn;
+pgprot_t kmap_prot;
+pte_t *kmap_pte;
+#endif
+
int do_check_pgt_cache(int low, int high)
{
int freed = 0;
@@ -62,31 +67,6 @@
return freed;
}
-/*
- * NOTE: pagetable_init alloc all the fixmap pagetables contiguous on the
- * physical space so we can cache the place of the first one and move
- * around without checking the pgd every time.
- */
-
-#if CONFIG_HIGHMEM
-pte_t *kmap_pte;
-pgprot_t kmap_prot;
-
-#define kmap_get_fixmap_pte(vaddr) \
- pte_offset(pmd_offset(pgd_offset_k(vaddr), (vaddr)), (vaddr))
-
-void __init kmap_init(void)
-{
- unsigned long kmap_vstart;
-
- /* cache the first kmap pte */
- kmap_vstart = __fix_to_virt(FIX_KMAP_BEGIN);
- kmap_pte = kmap_get_fixmap_pte(kmap_vstart);
-
- kmap_prot = PAGE_KERNEL;
-}
-#endif /* CONFIG_HIGHMEM */
-
void show_mem(void)
{
int i, total = 0, reserved = 0;
@@ -164,141 +144,92 @@
set_pte_phys(address, phys, flags);
}
-static void __init fixrange_init (unsigned long start, unsigned long end, pgd_t *pgd_base)
-{
- pgd_t *pgd;
- pmd_t *pmd;
- pte_t *pte;
- int i, j;
- unsigned long vaddr;
-
- vaddr = start;
- i = __pgd_offset(vaddr);
- j = __pmd_offset(vaddr);
- pgd = pgd_base + i;
-
- for ( ; (i < PTRS_PER_PGD) && (vaddr != end); pgd++, i++) {
-#if CONFIG_X86_PAE
- if (pgd_none(*pgd)) {
- pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
- set_pgd(pgd, __pgd(__pa(pmd) + 0x1));
- if (pmd != pmd_offset(pgd, 0))
- printk("PAE BUG #02!\n");
- }
- pmd = pmd_offset(pgd, vaddr);
-#else
- pmd = (pmd_t *)pgd;
-#endif
- for (; (j < PTRS_PER_PMD) && (vaddr != end); pmd++, j++) {
- if (pmd_none(*pmd)) {
- pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);
- set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte)));
- if (pte != pte_offset(pmd, 0))
- BUG();
- }
- vaddr += PMD_SIZE;
- }
- j = 0;
- }
-}
-
static void __init pagetable_init (void)
{
- unsigned long vaddr, end;
- pgd_t *pgd, *pgd_base;
- int i, j, k;
+ unsigned long entry, end;
pmd_t *pmd;
- pte_t *pte, *pte_base;
+ pte_t *pte;
+ int i;
/*
- * This can be zero as well - no problem, in that case we exit
- * the loops anyway due to the PTRS_PER_* conditions.
+ * It's generally assumed that the user/kernel boundary
+ * coincides with a pgd entry boundary (4MB, or 1GB if PAE):
+ * the restriction could be lifted but why bother? just check.
*/
- end = (unsigned long)__va(max_low_pfn*PAGE_SIZE);
-
- pgd_base = swapper_pg_dir;
-#if CONFIG_X86_PAE
- for (i = 0; i < PTRS_PER_PGD; i++)
- set_pgd(pgd_base + i, __pgd(1 + __pa(empty_zero_page)));
+#if (__PAGE_OFFSET & (PGDIR_SIZE - 1))
+#error PAGE_OFFSET should be a multiple of PGDIR_SIZE!
#endif
- i = __pgd_offset(PAGE_OFFSET);
- pgd = pgd_base + i;
- for (; i < PTRS_PER_PGD; pgd++, i++) {
- vaddr = i*PGDIR_SIZE;
- if (end && (vaddr >= end))
- break;
#if CONFIG_X86_PAE
- pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
- set_pgd(pgd, __pgd(__pa(pmd) + 0x1));
-#else
- pmd = (pmd_t *)pgd;
+ /*
+ * Usually only one page is needed here: if PAGE_OFFSET lowered,
+ * maybe three pages: need not be contiguous, but might as well.
+ */
+ pmd = (pmd_t *)alloc_bootmem_low_pages(KERNEL_PGD_PTRS*PAGE_SIZE);
+ for (i = 1; i < USER_PGD_PTRS; i++)
+ set_pgd(swapper_pg_dir+i, __pgd(1 + __pa(empty_zero_page)));
+ for (; i < PTRS_PER_PGD; i++, pmd += PTRS_PER_PMD)
+ set_pgd(swapper_pg_dir+i, __pgd(1 + __pa(pmd)));
+ /*
+ * Add low memory identity-mappings - SMP needs it when
+ * starting up on an AP from real-mode. In the non-PAE
+ * case we already have these mappings through head.S.
+ * All user-space mappings are explicitly cleared after
+ * SMP startup.
+ */
+ swapper_pg_dir[0] = swapper_pg_dir[USER_PGD_PTRS];
#endif
- if (pmd != pmd_offset(pgd, 0))
- BUG();
- for (j = 0; j < PTRS_PER_PMD; pmd++, j++) {
- vaddr = i*PGDIR_SIZE + j*PMD_SIZE;
- if (end && (vaddr >= end))
- break;
- if (cpu_has_pse) {
- unsigned long __pe;
-
- set_in_cr4(X86_CR4_PSE);
- boot_cpu_data.wp_works_ok = 1;
- __pe = _KERNPG_TABLE + _PAGE_PSE + __pa(vaddr);
- /* Make it "global" too if supported */
- if (cpu_has_pge) {
- set_in_cr4(X86_CR4_PGE);
- __pe += _PAGE_GLOBAL;
- }
- set_pmd(pmd, __pmd(__pe));
- continue;
- }
- pte_base = pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);
-
- for (k = 0; k < PTRS_PER_PTE; pte++, k++) {
- vaddr = i*PGDIR_SIZE + j*PMD_SIZE + k*PAGE_SIZE;
- if (end && (vaddr >= end))
- break;
- *pte = mk_pte_phys(__pa(vaddr), PAGE_KERNEL);
- }
- set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte_base)));
- if (pte_base != pte_offset(pmd, 0))
- BUG();
+ /*
+ * Map in all the low memory pages: using PSE if available,
+ * or by allocating and populating page tables if no PSE.
+ */
+ pmd = pmd_offset(pgd_offset_k(PAGE_OFFSET), PAGE_OFFSET);
+ end = max_low_pfn << PAGE_SHIFT;
+ if (cpu_has_pse) {
+ set_in_cr4(X86_CR4_PSE);
+ boot_cpu_data.wp_works_ok = 1;
+ entry = _KERNPG_TABLE | _PAGE_PSE;
+ /* Make it "global" too if supported */
+ if (cpu_has_pge) {
+ entry |= _PAGE_GLOBAL;
+ set_in_cr4(X86_CR4_PGE);
}
+ for (; entry < end; pmd++, entry += PTRS_PER_PTE*PAGE_SIZE)
+ set_pmd(pmd, __pmd(entry));
+ }
+ else for (entry = 0; entry < end; pmd++) {
+ pte = (pte_t *)alloc_bootmem_low_pages(PAGE_SIZE);
+ for (i = 0; i < PTRS_PER_PTE; i++, pte++, entry += PAGE_SIZE) {
+ if (entry >= end)
+ break;
+ *pte = mk_pte_phys(entry, PAGE_KERNEL);
+ }
+ set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte - i)));
}
/*
- * Fixed mappings, only the page table structure has to be
- * created - mappings will be set by set_fixmap():
- */
- vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
- fixrange_init(vaddr, 0, pgd_base);
-
-#if CONFIG_HIGHMEM
- /*
- * Permanent kmaps:
+ * Leave vmalloc() to create its own page tables as needed,
+ * but create the page tables at top of virtual memory, to be
+ * populated by kmap_atomic(), kmap_high() and set_fixmap().
+ * kmap_high() assumes pkmap_page_table contiguous throughout.
*/
- vaddr = PKMAP_BASE;
- fixrange_init(vaddr, vaddr + PAGE_SIZE*LAST_PKMAP, pgd_base);
-
- pgd = swapper_pg_dir + __pgd_offset(vaddr);
- pmd = pmd_offset(pgd, vaddr);
- pte = pte_offset(pmd, vaddr);
- pkmap_page_table = pte;
-#endif
+ pmd = pmd_offset(pgd_offset_k(VMALLOC_END), VMALLOC_END);
+ i = (0UL - (VMALLOC_END & PMD_MASK)) >> PMD_SHIFT;
+ pte = (pte_t *)alloc_bootmem_low_pages(i*PAGE_SIZE);
+ for (; --i >= 0; pmd++, pte += PTRS_PER_PTE)
+ set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte)));
-#if CONFIG_X86_PAE
+#ifdef CONFIG_HIGHMEM
/*
- * Add low memory identity-mappings - SMP needs it when
- * starting up on an AP from real-mode. In the non-PAE
- * case we already have these mappings through head.S.
- * All user-space mappings are explicitly cleared after
- * SMP startup.
+ * For asm/highmem.h kmap_atomic() and mm/highmem.c kmap_high().
*/
- pgd_base[0] = pgd_base[USER_PTRS_PER_PGD];
+ kmap_prot = PAGE_KERNEL;
+ kmap_pte = pte_offset(pmd_offset(pgd_offset_k(
+ KMAP_BASE), KMAP_BASE), KMAP_BASE);
+ pkmap_page_table = kmap_pte +
+ ((PKMAP_BASE - KMAP_BASE) >> PAGE_SHIFT);
#endif
}
@@ -344,24 +275,18 @@
__flush_tlb_all();
-#ifdef CONFIG_HIGHMEM
- kmap_init();
-#endif
{
unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
- unsigned int max_dma, high, low;
-
- max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
- low = max_low_pfn;
- high = highend_pfn;
+ unsigned int max_dma_pfn;
- if (low < max_dma)
- zones_size[ZONE_DMA] = low;
+ max_dma_pfn = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
+ if (max_low_pfn < max_dma_pfn)
+ zones_size[ZONE_DMA] = max_low_pfn;
else {
- zones_size[ZONE_DMA] = max_dma;
- zones_size[ZONE_NORMAL] = low - max_dma;
+ zones_size[ZONE_DMA] = max_dma_pfn;
+ zones_size[ZONE_NORMAL] = max_low_pfn - max_dma_pfn;
#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_HIGHMEM] = high - low;
+ zones_size[ZONE_HIGHMEM] = highend_pfn - max_low_pfn;
#endif
}
free_area_init(zones_size);
diff -urN 249/include/asm-i386/fixmap.h 249-morekmaps/include/asm-i386/fixmap.h
--- 249/include/asm-i386/fixmap.h Wed Aug 15 22:21:11 2001
+++ 249-morekmaps/include/asm-i386/fixmap.h Sat Jan 12 21:26:28 2002
@@ -17,10 +17,6 @@
#include <linux/kernel.h>
#include <asm/apicdef.h>
#include <asm/page.h>
-#ifdef CONFIG_HIGHMEM
-#include <linux/threads.h>
-#include <asm/kmap_types.h>
-#endif
/*
* Here we define all the compile-time 'special' virtual
@@ -40,13 +36,6 @@
* TLB entries of such buffers will not be flushed across
* task switches.
*/
-
-/*
- * on UP currently we will have no trace of the fixmap mechanizm,
- * no page table allocations, etc. This might change in the
- * future, say framebuffers for the console driver(s) could be
- * fix-mapped?
- */
enum fixed_addresses {
#ifdef CONFIG_X86_LOCAL_APIC
FIX_APIC_BASE, /* local (CPU) APIC) -- required for SMP or not */
@@ -60,10 +49,6 @@
FIX_CO_APIC, /* Cobalt APIC Redirection Table */
FIX_LI_PCIA, /* Lithium PCI Bridge A */
FIX_LI_PCIB, /* Lithium PCI Bridge B */
-#endif
-#ifdef CONFIG_HIGHMEM
- FIX_KMAP_BEGIN, /* reserved pte's for temporary kernel mappings */
- FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1,
#endif
__end_of_fixed_addresses
};
diff -urN 249/include/asm-i386/highmem.h 249-morekmaps/include/asm-i386/highmem.h
--- 249/include/asm-i386/highmem.h Wed Aug 15 22:21:21 2001
+++ 249-morekmaps/include/asm-i386/highmem.h Sat Jan 12 21:26:28 2002
@@ -23,11 +23,12 @@
#include <linux/config.h>
#include <linux/init.h>
#include <linux/interrupt.h>
+#include <linux/threads.h>
#include <asm/kmap_types.h>
#include <asm/pgtable.h>
/* undef for production */
-#define HIGHMEM_DEBUG 1
+#define HIGHMEM_DEBUG 0
/* declarations for highmem.c */
extern unsigned long highstart_pfn, highend_pfn;
@@ -36,21 +37,14 @@
extern pgprot_t kmap_prot;
extern pte_t *pkmap_page_table;
-extern void kmap_init(void) __init;
-
/*
- * Right now we initialize only a single pte table. It can be extended
- * easily, subsequent pte tables have to be allocated in one physical
- * chunk of RAM.
+ * Allow for somewhat less than 2048 kmaps, ending at FIXADDR_START.
+ * First the per-cpu atomic kmaps, then the persistent shared kmaps.
*/
-#define PKMAP_BASE (0xfe000000UL)
-#ifdef CONFIG_X86_PAE
-#define LAST_PKMAP 512
-#else
-#define LAST_PKMAP 1024
-#endif
-#define LAST_PKMAP_MASK (LAST_PKMAP-1)
-#define PKMAP_NR(virt) ((virt-PKMAP_BASE) >> PAGE_SHIFT)
+#define KMAP_BASE (0UL - (2048 << PAGE_SHIFT))
+#define PKMAP_BASE (KMAP_BASE + (KM_TYPE_NR*NR_CPUS*PAGE_SIZE))
+#define LAST_PKMAP ((FIXADDR_START - PKMAP_BASE) >> PAGE_SHIFT)
+#define PKMAP_NR(virt) (((virt) - PKMAP_BASE) >> PAGE_SHIFT)
#define PKMAP_ADDR(nr) (PKMAP_BASE + ((nr) << PAGE_SHIFT))
extern void * FASTCALL(kmap_high(struct page *page));
@@ -58,8 +52,6 @@
static inline void *kmap(struct page *page)
{
- if (in_interrupt())
- BUG();
if (page < highmem_start_page)
return page_address(page);
return kmap_high(page);
@@ -67,8 +59,6 @@
static inline void kunmap(struct page *page)
{
- if (in_interrupt())
- BUG();
if (page < highmem_start_page)
return;
kunmap_high(page);
@@ -77,24 +67,23 @@
/*
* The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap
* gives a more generic (and caching) interface. But kmap_atomic can
- * be used in IRQ contexts, so in some (very limited) cases we need
- * it.
+ * be used in IRQ contexts, so in some (very limited) cases we need it.
*/
static inline void *kmap_atomic(struct page *page, enum km_type type)
{
- enum fixed_addresses idx;
unsigned long vaddr;
+ int idx;
if (page < highmem_start_page)
return page_address(page);
idx = type + KM_TYPE_NR*smp_processor_id();
- vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
+ vaddr = KMAP_BASE + (idx << PAGE_SHIFT);
#if HIGHMEM_DEBUG
- if (!pte_none(*(kmap_pte-idx)))
+ if (!pte_none(*(kmap_pte+idx)))
BUG();
#endif
- set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
+ set_pte(kmap_pte+idx, mk_pte(page, __pgprot(__PAGE_KERNEL)));
__flush_tlb_one(vaddr);
return (void*) vaddr;
@@ -103,20 +92,22 @@
static inline void kunmap_atomic(void *kvaddr, enum km_type type)
{
#if HIGHMEM_DEBUG
- unsigned long vaddr = (unsigned long) kvaddr;
- enum fixed_addresses idx = type + KM_TYPE_NR*smp_processor_id();
+ unsigned long vaddr;
+ int idx;
- if (vaddr < FIXADDR_START) // FIXME
+ if (kvaddr < high_memory)
return;
- if (vaddr != __fix_to_virt(FIX_KMAP_BEGIN+idx))
+ vaddr = (unsigned long) kvaddr;
+ idx = type + KM_TYPE_NR*smp_processor_id();
+ if (vaddr != KMAP_BASE + (idx << PAGE_SHIFT))
BUG();
/*
* force other mappings to Oops if they'll try to access
* this pte without first remap it
*/
- pte_clear(kmap_pte-idx);
+ pte_clear(kmap_pte+idx);
__flush_tlb_one(vaddr);
#endif
}
diff -urN 249/include/asm-i386/pgtable.h 249-morekmaps/include/asm-i386/pgtable.h
--- 249/include/asm-i386/pgtable.h Wed Aug 15 22:21:11 2001
+++ 249-morekmaps/include/asm-i386/pgtable.h Sat Jan 12 21:26:28 2002
@@ -15,7 +15,6 @@
#ifndef __ASSEMBLY__
#include <asm/processor.h>
#include <asm/fixmap.h>
-#include <linux/threads.h>
#ifndef _I386_BITOPS_H
#include <asm/bitops.h>
@@ -129,21 +128,16 @@
#ifndef __ASSEMBLY__
-/* Just any arbitrary offset to the start of the vmalloc VM area: the
- * current 8MB value just means that there will be a 8MB "hole" after the
- * physical memory until the kernel virtual memory starts. That means that
- * any out-of-bounds memory accesses will hopefully be caught.
- * The vmalloc() routines leaves a hole of 4kB between each vmalloced
- * area for the same reason. ;)
+/*
+ * Maximize the virtual memory range available to vmalloc(),
+ * but start at a new page table since it cannot use a PSE entry.
*/
-#define VMALLOC_OFFSET (8*1024*1024)
-#define VMALLOC_START (((unsigned long) high_memory + 2*VMALLOC_OFFSET-1) & \
- ~(VMALLOC_OFFSET-1))
+#define VMALLOC_START (((unsigned long) high_memory + ~PMD_MASK) & PMD_MASK)
#define VMALLOC_VMADDR(x) ((unsigned long)(x))
#if CONFIG_HIGHMEM
-# define VMALLOC_END (PKMAP_BASE-2*PAGE_SIZE)
+# define VMALLOC_END KMAP_BASE
#else
-# define VMALLOC_END (FIXADDR_START-2*PAGE_SIZE)
+# define VMALLOC_END FIXADDR_START
#endif
/*
diff -urN 249/mm/highmem.c 249-morekmaps/mm/highmem.c
--- 249/mm/highmem.c Thu Aug 16 17:42:45 2001
+++ 249-morekmaps/mm/highmem.c Sat Jan 12 21:26:28 2002
@@ -21,6 +21,7 @@
#include <linux/highmem.h>
#include <linux/swap.h>
#include <linux/slab.h>
+#include <linux/interrupt.h>
/*
* Virtual_count is not a pure "count".
@@ -85,8 +86,8 @@
count = LAST_PKMAP;
/* Find an empty entry */
for (;;) {
- last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
- if (!last_pkmap_nr) {
+ if (++last_pkmap_nr >= LAST_PKMAP) {
+ last_pkmap_nr = 0;
flush_all_zero_pkmaps();
count = LAST_PKMAP;
}
@@ -135,6 +136,9 @@
*
* We cannot call this from interrupts, as it may block
*/
+ if (in_interrupt())
+ BUG();
+
spin_lock(&kmap_lock);
vaddr = (unsigned long) page->virtual;
if (!vaddr)
@@ -151,6 +155,9 @@
unsigned long vaddr;
unsigned long nr;
int need_wakeup;
+
+ if (in_interrupt())
+ BUG();
spin_lock(&kmap_lock);
vaddr = (unsigned long) page->virtual;
On Thu, Jan 17, 2002 at 07:02:38PM +0000, Hugh Dickins wrote:
> On Thu, 17 Jan 2002, Andrea Arcangeli wrote:
> > On Thu, Jan 17, 2002 at 05:57:15PM +0000, Hugh Dickins wrote:
> >
> > we need more than one serie, no way, or it can deadlock, the default
> > serie is for pages, the other series are for pagetables.
> >
> > > 3. KM_SERIE_PAGETABLE2 kmap_pagetable2 pmd_page2 pte_offset2 all just
> > > for copy_page_range src? Why distinguished from KM_SERIE_PAGETABLE?
> > > Often it will already be KM_SERIE_PAGETABLE. I can imagine you might
> > > want an atomic at that point (holding spinlock), but I don't see what
> > > the PAGETABLE2 distinction gives you.
> >
> > deadlock avoidance. It's the same reason as the mempool, you need tow
> > mempool if you need two objects from the same task or it can deadlock
> > (yes, it would be hard to deadlock it but possible).
>
> Sorry, I'm still struggling: I can imagine deadlocks from having too
> few kmaps altogether, and I can see how separating data from pagetables
> protects pagetable code from bugs (missing kunmaps) in other code, but
> I don't see how separating the three series make a crucial difference.
It really makes the cricial difference (deadlock avoidance is the only
reason there is the serie thing in the kmaps). see the thread with Ingo
about mempool, it's exactly the same problem.
in short: the same task cannot get two entries from the same serie
(/mempool) or the system can deadlock.
think of this, assume 3 entries in the pool and a single pool, all tasks
are running fork(2) so they need two kmaps, assume we have a single pool:
task1 task2 task3 task4
kmap kmap kmap
kmap <- (hangs)
(hangs)->kmap kmap kmap kmap
deadlock
thanks to the kmap series we cannot deadlock instead:
task1 task2 task3 task4
kmap kmap kmap
kmap (hangs)
kmap2 kmap2 kmap2
kunmap .....
kunmap2......
kmap2
kunmap
kunmap2
> > in fork the pte_offset kmap could be an atomic one, but atomic are more
> > costly with the invlpg I believe, to do it in a raw the 2 variant with a
> > different serie should be faster for fork(2).
>
> Yes, maybe, maybe not.
I guess it depends on the pagetable overhead and on the arch details
(invlpg overhead). If there are less than 1024 pagetables kmap to copy
kmap_serie is going to run faster than kmap_atomic because it may not
need any tlb flush at all, but if there are more we may need to do a
strong global tlb flush, so it is not obvious what is better. I usually
try to avoid the atomic kmaps unless we can't sleep, that's the main
reason for that choice. But I'm open to suggestions, the best would be
to benchmark it first with a few pagetables, and then with an huge
number of pagetables to get some number out of it. At least it is
something that we can change freely anytime as soon as we have some
number and in the meantime it is not a critical thing to sort out.
>
> > > 4. You've lifted the PAE restriction to LAST_PKMAP 512 in i386/highmem.h,
> > > and use pkmap_page_table as one long array in mm/highmem.c, but I
> > > don't see where you enforce the contiguity of page table pages in
> > > i386/mm/init.c. (I do already have a patch for lifting the 1024,512
> > > kmaps limit, simplifying i386/mm/init.c, we've been using for months:
> > > I can update that from 2.4.9 if you'd like it.)
> >
> > correct, currently it works because the bootmem tends to return
> > physically contigous pages but it is not enforced and it may trigger
> > with a weird e820 layout. If you've a patch very feel free to post it!! :)
> > thanks for the review.
>
> Okay, I've attached the existing patch against 2.4.9 below. It originated
> from my Large PAGE_SIZE patch, where it became clear that the fixmaps are
> appropriate for MMUPAGE_SIZE maps but the kmaps for PAGE_SIZE maps (later,
> with Ben's work on Large PAGE_CACHE_SIZE, it became clear that actually
> kmaps should be PAGE_CACHE_SIZE, but I've not added in that complication).
>
> So the atomic kmaps are no longer done by fixmap.h, but come after the
> vmallocs, before the regular kmaps, in highmem.h. It also doesn't waste
> so much virtual space either end of the vmalloc area. The mm/init.c diff
> looks a lot, mainly because the result is a lot simpler. If it would
> help you for me to update to a release of your choice, let me know
> (perhaps the next aa, with your mods in).
the below patch looks overkill, I don't want to change all that code,
I'd just need the few liner to have an array of pte pointers in function
of PTRS_PER_PTE and LAST_KMAP. Either that or even much better just
allocate all the pagetables at once with a single bootmem call, that is
certainly a safe approch too, just count the number of pte needed for
the persistente kmaps and then do a single bootmem call to allocate all
of them at once, this second approch sounds the most efficient one. If
you want to make a patch against pte-highmem-5 feel free to go ahead. It
should be a few liner. I didn't checked what other good things are
getting changed by the big patch, but I'd prefer to have only the
necessary changes at least for the 2.4 version of the patch.
> > > 5. Shouldn't mm/vmscan.c be in the patch?
> >
> > can you elaborate?
>
> swap_out_pmd() uses pte_offset() on user page tables,
> so doesn't it need to pte_kunmap() afterwards?
Ouch, I completly overlooked vmscan.c!! So this definitely explains
hangs during swapout activities. Many thanks, that was an important one
to catch!
Andrea
On Fri, 18 Jan 2002, Andrea Arcangeli wrote:
>
> It really makes the cricial difference (deadlock avoidance is the only
> reason there is the serie thing in the kmaps). see the thread with Ingo
> about mempool, it's exactly the same problem.
>
> in short: the same task cannot get two entries from the same serie
> (/mempool) or the system can deadlock.
Many thanks for your patient illustrations, Andrea.
I certainly don't dispute that an indefinite number of tasks,
competing for multiple limited resources, are liable to deadlock.
I hope your heart won't sink too heavily when I say I'm still not
convinced that your SERIEs will solve that. I really value your
time, so please don't spend it on detailed reply to me (never
mind my ?s): I'm trying to help, but in danger of hindering you.
Nobody should interpret your silence as accepting my points.
I still believe that KM_SERIE_PAGETABLE2 kmap_pagetable2 pmd_page2
pte_offset2 should just be removed. The weaker reason is that they
are used only in copy_page_range, and there's no block between the
pte_offset2 and the pte_kunmaps, so progress can be made if all cpus
have done the kmap in pte_alloc, and there's at least one kmap spare
(or freeable) to use for the pte_offset (replacing the pte_offset2);
that puts a limit NR_CPUS+1 on the kmaps needed here, so just add
NR_CPUS+1 for safety to the poolsize already thought necessary (??),
instead of holding a separate pool. But the stronger reason is that
pte_offset2 is being called with dst->page_table_lock held, yet it
may have to block while all its kmaps are in use: elsewhere you have
consistently used pte_offset_atomic when spinlock held, why not here?
You might argue that there would always be a spare KM_SERIE_PAGETABLE2
kmap (after the usual flush_all_zero_pkmaps), could never block there.
But if that's so, it's not at all obvious: because of the way the
SERIEs are not entirely disjoint. When trying to kmap for a particular
serie, kmap_high accepts the existing page->virtual, no matter what
serie it belongs to. So I think it is possible that though PAGETABLE2s
are only assigned in copy_page_range, they could linger indefinitely
being used as PAGETABLEs and even as DEFAULTs (since kmaps independent
of page freeing). I suspect that this blurring of the distinction
between SERIEs invalidates (in a strict sense) the deadlock avoidance
argument for separate SERIEs.
And doesn't that argument assume an ordering, a rule that a task
would necessarily allocate DEFAULT before PAGETABLE (or vice versa)?
But I think in some contexts it's one way round (DEFAULT before
PAGETABLE when read's file_read_actor's __copy_to_user faults),
and in other contexts the other way round (PAGETABLE before DEFAULT
when do_no_page's block_read_full_page clears hole, or nfs readpage,
or ramfs readpage, or shmem_nopage's clear_highpage). I'm willing
to believe that the DEFAULT,PAGETABLE distinction reduces the chance
of kmap exhaustion deadlock, but it looks difficult to carry through.
Eek! Am I reading that right? pte_alloc_one uses clear_highpage
which uses kmap i.e. DEFAULT, so with the persistence of page->virtual,
page tables will usually start out as DEFAULT and not PAGETABLE? If
you retain the separate SERIEs, then I think you will need to add a
clear_pagetable for use by pte_alloc_one.
Regarding swap_out_pmd: I expect you've now made it pte_offset_atomic
because the spinlock is held; but want to highlight that that one for
sure must not use the same pool as pte_alloc, otherwise you could
deadlock with swap_out needing the pool to free pages, but waiting
page allocators holding all kmaps from that pool.
I am worried by the way those pagetable kmaps are held across
handle_pte_fault: under memory load with many tasks, the system will
reach the point where all pagetable kmaps are in use, waiting for
pages to be allocated to satisfy the faults. Perhaps that's just
a fact of life, perhaps it will bring its own problems. It also
seems wrong how often we are reduced to using the atomic kmaps: I'm
not against them, I don't think "invlpg" is itself expensive, but after
going to the trouble to set up a cache of mappings, it's sad to have
to bypass it so often. I think we're going to find a better design
later on, but right now I don't have a constructive suggestion
(aah, 65536 kmaps, that would help!).
A final point. I don't have a conclusive argument (beyond simplicity),
but I believe it will prove to be a mistake to give highmem pagetables
to the kernel's vmalloc area. I think you should define pte_alloc_k
for vmalloc, then you can avoid its pte_kunmaps, and remove traps.c
fault.c ioremap.c drm_scatter.h drm_vm.h from the patch (drm_proc.h
would have to stay; but personally, I'd just delete that #if 0 block
and be done with it - it's shown up too often in my pte greps!); and
save you from having to add all those video drivers using "rvmalloc"
into the patch, currently they're missing.
(On the contiguity of the pagetables for kmaps: yes, you're right,
the patch I offered goes rather beyond what you'd want to put in
right now. I'll try very hard to cut it down to what's necessary
- but I'm sure you know just how hard it is to resist cleaning up!)
Hugh
On Sat, Jan 19, 2002 at 08:56:53PM +0000, Hugh Dickins wrote:
> On Fri, 18 Jan 2002, Andrea Arcangeli wrote:
> >
> > It really makes the cricial difference (deadlock avoidance is the only
> > reason there is the serie thing in the kmaps). see the thread with Ingo
> > about mempool, it's exactly the same problem.
> >
> > in short: the same task cannot get two entries from the same serie
> > (/mempool) or the system can deadlock.
>
> Many thanks for your patient illustrations, Andrea.
I'm the one who should thank you for the so helpful reviews, thanks so
much.
> I certainly don't dispute that an indefinite number of tasks,
> competing for multiple limited resources, are liable to deadlock.
> I hope your heart won't sink too heavily when I say I'm still not
> convinced that your SERIEs will solve that. I really value your
> time, so please don't spend it on detailed reply to me (never
> mind my ?s): I'm trying to help, but in danger of hindering you.
> Nobody should interpret your silence as accepting my points.
>
> I still believe that KM_SERIE_PAGETABLE2 kmap_pagetable2 pmd_page2
> pte_offset2 should just be removed. The weaker reason is that they
> are used only in copy_page_range, and there's no block between the
> pte_offset2 and the pte_kunmaps, so progress can be made if all cpus
the only issue is that the pte_offset2 could deadlock if it would be a
pte_offset instead of a pte_offset2.
> have done the kmap in pte_alloc, and there's at least one kmap spare
> (or freeable) to use for the pte_offset (replacing the pte_offset2);
there isn't a kmap spare, all 1024 kmaps are been used by 1024 tasks all
running pte_alloc, and later all blocking in the pte_offset (replacing
the pte_offset2), so all the 1024 tasks blocks in pte_offset, waiting a
spare kmap that will never arrive. KM_SERIE_PAGETABLE2 fixes this.
It is exactly like with the bio allocation, you can't allocate two bio
from the same mempool from the same taks or you can deadlock.
> that puts a limit NR_CPUS+1 on the kmaps needed here, so just add
> NR_CPUS+1 for safety to the poolsize already thought necessary (??),
the number of cpus have nothing to do with this, the issue is only the
number of tasks, the problem is the same in a UP or in a 1024way
machine.
> instead of holding a separate pool. But the stronger reason is that
> pte_offset2 is being called with dst->page_table_lock held, yet it
pte_offset2 is called without the pagetable lock held, otherwise it had
to be a pte_offset_atomic, we cannot recall pte_offset with a spinlock
held because pte_offset can schedule.
> may have to block while all its kmaps are in use: elsewhere you have
> consistently used pte_offset_atomic when spinlock held, why not here?
because the spinlock isn't held there :)
> You might argue that there would always be a spare KM_SERIE_PAGETABLE2
> kmap (after the usual flush_all_zero_pkmaps), could never block there.
> But if that's so, it's not at all obvious: because of the way the
> SERIEs are not entirely disjoint. When trying to kmap for a particular
> serie, kmap_high accepts the existing page->virtual, no matter what
> serie it belongs to. So I think it is possible that though PAGETABLE2s
> are only assigned in copy_page_range, they could linger indefinitely
> being used as PAGETABLEs and even as DEFAULTs (since kmaps independent
> of page freeing). I suspect that this blurring of the distinction
> between SERIEs invalidates (in a strict sense) the deadlock avoidance
> argument for separate SERIEs.
the mixture between PAGETABLE2 and PAGETABLE series due the
reentrance of kmap is more difficult to proof as correct. The reason it
cannot deadlock is that the previous pte_alloc (from the PAGETABLE
serie) is on a certainly newly created kmap, so it cannot be the one
that keeps a PAGETABLE2 slot pinned. As far as you do pte_offset2 while
you are sure you're not keeping pinned a PAGETABLE2 slot in the same
path (possibly with a pte_offset/pte_alloc) you're safe. The case of the
fork() path is obviously correct because the pte_alloc will map
something new.
Same is true between KM_SERIE_PAGETABLE and KM_SERIE_DEFAULT, the reason
it is obviously correct here is because the mixture cannot happen since
the first place, because pages will pass through the freelist before the
two can be mixed, and so they will be kunmapped first.
So still it can't deadlock and it's fine to use pte_offset2.
kunmap_vaddr will take care to unmap anything whatever it cames from.
The decision if to use pte_offset_atomic or pte_offset2 should be based
solely on performance factors (which still have to be measured).
> And doesn't that argument assume an ordering, a rule that a task
> would necessarily allocate DEFAULT before PAGETABLE (or vice versa)?
The ordering doesn't matter. You just don't need to keep pinned a
DEFAULT while mapping a default, and the other way around for pagetable.
> But I think in some contexts it's one way round (DEFAULT before
> PAGETABLE when read's file_read_actor's __copy_to_user faults),
> and in other contexts the other way round (PAGETABLE before DEFAULT
> when do_no_page's block_read_full_page clears hole, or nfs readpage,
> or ramfs readpage, or shmem_nopage's clear_highpage). I'm willing
> to believe that the DEFAULT,PAGETABLE distinction reduces the chance
> of kmap exhaustion deadlock, but it looks difficult to carry through.
I don't think there's any possible conflict between DEFAULT and
PAGETABLE.
> Eek! Am I reading that right? pte_alloc_one uses clear_highpage
> which uses kmap i.e. DEFAULT, so with the persistence of page->virtual,
> page tables will usually start out as DEFAULT and not PAGETABLE? If
> you retain the separate SERIEs, then I think you will need to add a
> clear_pagetable for use by pte_alloc_one.
agreed, thanks.
> Regarding swap_out_pmd: I expect you've now made it pte_offset_atomic
> because the spinlock is held; but want to highlight that that one for
correct, this was the missing chunk:
--- 2.4.18pre2aa2/mm/vmscan.c.~1~ Wed Jan 16 17:52:20 2002
+++ 2.4.18pre2aa2/mm/vmscan.c Fri Jan 18 03:51:45 2002
@@ -164,7 +164,7 @@
/* mm->page_table_lock is held. mmap_sem is not held */
static inline int swap_out_pmd(struct mm_struct * mm, struct vm_area_struct * vma, pmd_t *dir, unsigned long address, unsigned long end, int count, zone_t * classzone)
{
- pte_t * pte;
+ pte_t * pte, * pte_orig;
unsigned long pmd_end;
if (pmd_none(*dir))
@@ -175,7 +175,7 @@
return count;
}
- pte = pte_offset(dir, address);
+ pte_orig = pte = pte_offset_atomic(dir, address);
pmd_end = (address + PMD_SIZE) & PMD_MASK;
if (end > pmd_end)
@@ -196,6 +196,7 @@
address += PAGE_SIZE;
pte++;
} while (address && (address < end));
+ pte_kunmap(pte_orig);
mm->swap_address = address;
return count;
}
> sure must not use the same pool as pte_alloc, otherwise you could
> deadlock with swap_out needing the pool to free pages, but waiting
> page allocators holding all kmaps from that pool.
>
> I am worried by the way those pagetable kmaps are held across
> handle_pte_fault: under memory load with many tasks, the system will
> reach the point where all pagetable kmaps are in use, waiting for
> pages to be allocated to satisfy the faults. Perhaps that's just
> a fact of life, perhaps it will bring its own problems. It also
> seems wrong how often we are reduced to using the atomic kmaps: I'm
> not against them, I don't think "invlpg" is itself expensive, but after
> going to the trouble to set up a cache of mappings, it's sad to have
> to bypass it so often. I think we're going to find a better design
> later on, but right now I don't have a constructive suggestion
> (aah, 65536 kmaps, that would help!).
1024 tasks all taking a pagefault at the same time before the 1025th has
to block, sounds like a rasonable limit at the moment. Enlarging the
pool should be easy, by just changing the LAST_PKMAP define.
> A final point. I don't have a conclusive argument (beyond simplicity),
> but I believe it will prove to be a mistake to give highmem pagetables
> to the kernel's vmalloc area. I think you should define pte_alloc_k
> for vmalloc, then you can avoid its pte_kunmaps, and remove traps.c
> fault.c ioremap.c drm_scatter.h drm_vm.h from the patch (drm_proc.h
> would have to stay; but personally, I'd just delete that #if 0 block
> and be done with it - it's shown up too often in my pte greps!); and
> save you from having to add all those video drivers using "rvmalloc"
> into the patch, currently they're missing.
That's an option, but it's so easy to update those few drivers that I'm
not sure if it worth to ask yourself if the pte are been allocated by
vmalloc, and also putting hacks into vmalloc.c doesn't look very nice,
it is more effort than to update the drivers I believe.
> (On the contiguity of the pagetables for kmaps: yes, you're right,
> the patch I offered goes rather beyond what you'd want to put in
> right now. I'll try very hard to cut it down to what's necessary
> - but I'm sure you know just how hard it is to resist cleaning up!)
oh yes, I know :)
Andrea
On Mon, 21 Jan 2002, Andrea Arcangeli wrote:
> On Sat, Jan 19, 2002 at 08:56:53PM +0000, Hugh Dickins wrote:
> > ....
> > that puts a limit NR_CPUS+1 on the kmaps needed here, so just add
> > NR_CPUS+1 for safety to the poolsize already thought necessary (??),
>
> the number of cpus have nothing to do with this, the issue is only the
> number of tasks, the problem is the same in a UP or in a 1024way
> machine.
If copy_page_range were some low-level function called from all over
the place, or if it blocked after its second kmap, I'd agree with you.
But I'm afraid your PAGETABLE2 justifications have only made me firmer
in the opinion that it's unnecessary. Never mind, that disagreement
is irrelevant because...
> > instead of holding a separate pool. But the stronger reason is that
> > pte_offset2 is being called with dst->page_table_lock held, yet it
>
> pte_offset2 is called without the pagetable lock held, otherwise it had
> to be a pte_offset_atomic, we cannot recall pte_offset with a spinlock
> held because pte_offset can schedule.
>
> > may have to block while all its kmaps are in use: elsewhere you have
> > consistently used pte_offset_atomic when spinlock held, why not here?
>
> because the spinlock isn't held there :)
Please look again. src->page_table_lock is not held across pte_offset2,
but dst->page_table_lock _is_ being held across pte_offset2. See comment
just above copy_page_range. See how the immediately preceding pte_alloc
expects dst->page_table_lock to be held. So, pte_offset_atomic is
appropriate there, and pte_offset2 etc can go.
Actually, I'm coming to think that most of the pte_offset_atomic
calls should be replaced by a pte_offset_nowait: which gives the usual
pte_offset kmap if it can without blocking, or an atomic if not - your
pte_kunmap seems nicely designed for that very possibility. Whether
pte_offset_nowait should ever flush_all_zero_pkmaps, or just return
an atomic in the end-of-pkmap case, I'm not sure.
> the mixture between PAGETABLE2 and PAGETABLE series due the
> reentrance of kmap is more difficult to proof as correct. The reason it
> cannot deadlock is that the previous pte_alloc (from the PAGETABLE
> serie) is on a certainly newly created kmap, so it cannot be the one
> that keeps a PAGETABLE2 slot pinned. As far as you do pte_offset2 while
> you are sure you're not keeping pinned a PAGETABLE2 slot in the same
> path (possibly with a pte_offset/pte_alloc) you're safe. The case of the
> fork() path is obviously correct because the pte_alloc will map
> something new.
No, the phrase "newly created kmap" is misleading. It is a new use
for a kmap, but the mapping being used may simply be that left over
from the previous use of that page. To get the SERIEs working as you
intend, you need to re-kmap whenever serie wanted differs from serie
already assigned (but PAGETABLE/PAGETABLE2 distinction a nuisance,
PAGETABLE2 when it's forking parent, PAGETABLE at other times).
And what happens when a page is wanted for two purposes at once? I'm
thinking there of mapping /dev/mem, kmapping as DEFAULT while it's in
use as PAGETABLE. At present /dev/mem does not supply highmem, but if
page tables are in highmem, I believe we shall need to give something
like LKCD's lcrash access to it.
> Same is true between KM_SERIE_PAGETABLE and KM_SERIE_DEFAULT, the reason
> it is obviously correct here is because the mixture cannot happen since
> the first place, because pages will pass through the freelist before the
> two can be mixed, and so they will be kunmapped first.
I think you're mistaken there: passing through the
freelist doesn't affect a page's kmapping at all, does it?
We could make changes to do so, but that wouldn't be enough.
> > ....
> > And doesn't that argument assume an ordering, a rule that a task
> > would necessarily allocate DEFAULT before PAGETABLE (or vice versa)?
>
> The ordering doesn't matter. You just don't need to keep pinned a
> DEFAULT while mapping a default, and the other way around for pagetable.
>
> > But I think in some contexts it's one way round (DEFAULT before
> > PAGETABLE when read's file_read_actor's __copy_to_user faults),
> > and in other contexts the other way round (PAGETABLE before DEFAULT
> > when do_no_page's block_read_full_page clears hole, or nfs readpage,
> > or ramfs readpage, or shmem_nopage's clear_highpage). I'm willing
> > to believe that the DEFAULT,PAGETABLE distinction reduces the chance
> > of kmap exhaustion deadlock, but it looks difficult to carry through.
>
> I don't think there's any possible conflict between DEFAULT and
> PAGETABLE.
But I'm suggesting that in some cases a DEFAULT kmap is pinned while
a PAGETABLE kmap is being acquired, and in other cases a PAGETABLE
kmap is pinned while a DEFAULT kmap is being acquired; which might
be happening concurrently in an unlimited (well, <32768) number of
tasks. Is that not so? And couldn't that deadlock, even if the
DEFAULT/PAGETABLE distinction were strictly maintained? Of course,
the all-one-pool case can deadlock there too: perhaps more easily
(though it would be sensible to give that one pool as many entries
as all your pools combined).
Well, I've had my say: I believe the SERIEs, as currently implemented,
give _an illusion of_ security from kmap deadlock. But it feels like
I'm failing to dislodge you from your belief in them, and they're
not actively harmful, so I'll shut up - until provoked!
> > A final point. I don't have a conclusive argument (beyond simplicity),
> > but I believe it will prove to be a mistake to give highmem pagetables
> > to the kernel's vmalloc area. I think you should define pte_alloc_k
> > for vmalloc, then you can avoid its pte_kunmaps, and remove traps.c
> > fault.c ioremap.c drm_scatter.h drm_vm.h from the patch (drm_proc.h
> > would have to stay; but personally, I'd just delete that #if 0 block
> > and be done with it - it's shown up too often in my pte greps!); and
> > save you from having to add all those video drivers using "rvmalloc"
> > into the patch, currently they're missing.
>
> That's an option, but it's so easy to update those few drivers that I'm
> not sure if it worth to ask yourself if the pte are been allocated by
> vmalloc, and also putting hacks into vmalloc.c doesn't look very nice,
> it is more effort than to update the drivers I believe.
Well, you've done that work now in pre4aa1, and you've fixed the only
actual problem I saw with it overnight, and was hoping to hit you with
today (the vmalloc fault in interrupt context). I still don't like it
(just now I was checking my pagetable_init patch, wanted to check the
the vmalloc pagetable, quite difficult), but never mind.
Hugh
On Tue, Jan 22, 2002 at 06:01:43PM +0000, Hugh Dickins wrote:
> On Mon, 21 Jan 2002, Andrea Arcangeli wrote:
> > On Sat, Jan 19, 2002 at 08:56:53PM +0000, Hugh Dickins wrote:
> > > ....
> > > that puts a limit NR_CPUS+1 on the kmaps needed here, so just add
> > > NR_CPUS+1 for safety to the poolsize already thought necessary (??),
> >
> > the number of cpus have nothing to do with this, the issue is only the
> > number of tasks, the problem is the same in a UP or in a 1024way
> > machine.
>
> If copy_page_range were some low-level function called from all over
> the place, or if it blocked after its second kmap, I'd agree with you.
> But I'm afraid your PAGETABLE2 justifications have only made me firmer
> in the opinion that it's unnecessary. Never mind, that disagreement
> is irrelevant because...
>
> > > instead of holding a separate pool. But the stronger reason is that
> > > pte_offset2 is being called with dst->page_table_lock held, yet it
> >
> > pte_offset2 is called without the pagetable lock held, otherwise it had
> > to be a pte_offset_atomic, we cannot recall pte_offset with a spinlock
> > held because pte_offset can schedule.
> >
> > > may have to block while all its kmaps are in use: elsewhere you have
> > > consistently used pte_offset_atomic when spinlock held, why not here?
> >
> > because the spinlock isn't held there :)
>
> Please look again. src->page_table_lock is not held across pte_offset2,
> but dst->page_table_lock _is_ being held across pte_offset2. See comment
Oops, I overlooked one spin_lock held by the caller :). Many thanks for
spotting this. at least it will only deadlock at worse (if the
PAGETABLE2 seies is shorted).
> just above copy_page_range. See how the immediately preceding pte_alloc
> expects dst->page_table_lock to be held. So, pte_offset_atomic is
> appropriate there, and pte_offset2 etc can go.
Either that or pte_offset_under_lock2.
> Actually, I'm coming to think that most of the pte_offset_atomic
> calls should be replaced by a pte_offset_nowait: which gives the usual
> pte_offset kmap if it can without blocking, or an atomic if not - your
> pte_kunmap seems nicely designed for that very possibility. Whether
Possible optimization. Before doing so we should really measure how the
kmaps are faster compared to the kmap_atomic I believe. This could be a
performance optimization. But it could also be that the kmap_atomic are
faster and it doesn't worth to get into the kmap overhead when
persistence is not necessary.
> pte_offset_nowait should ever flush_all_zero_pkmaps, or just return
> an atomic in the end-of-pkmap case, I'm not sure.
>
> > the mixture between PAGETABLE2 and PAGETABLE series due the
> > reentrance of kmap is more difficult to proof as correct. The reason it
> > cannot deadlock is that the previous pte_alloc (from the PAGETABLE
> > serie) is on a certainly newly created kmap, so it cannot be the one
> > that keeps a PAGETABLE2 slot pinned. As far as you do pte_offset2 while
> > you are sure you're not keeping pinned a PAGETABLE2 slot in the same
> > path (possibly with a pte_offset/pte_alloc) you're safe. The case of the
> > fork() path is obviously correct because the pte_alloc will map
> > something new.
>
> No, the phrase "newly created kmap" is misleading. It is a new use
> for a kmap, but the mapping being used may simply be that left over
> from the previous use of that page. To get the SERIEs working as you
> intend, you need to re-kmap whenever serie wanted differs from serie
> already assigned (but PAGETABLE/PAGETABLE2 distinction a nuisance,
> PAGETABLE2 when it's forking parent, PAGETABLE at other times).
pagetable2 is needed every single time you need to access two pagetables
at the same time with a static kmap. Infact now that I notice it also
mremap needs it. With mremap both the two kmaps could return
entries in the PAGETABLE serie, no-way around it.
So to solve all this mess between mixture of PAGETABLE2 and PAGETABLE
(and even with DEFAULT if /dev/kmem will ever mess with highmem), I will
just drop the recursive behaviour of kmap. That is not needed for
coherency, that was only a kind of performance optimization and to
maintain a page->virtual, so I will just kill it and then we're done, no
mixture mess any longer. I will left it only for the DEFAULT serie,
because of the page->virtual, all other series will not care about
page->virtual any longer, page->virtual was a bad idea since the first
place, it would been better to keep the information on the current
stack, like all the other series will do.
> And what happens when a page is wanted for two purposes at once? I'm
> thinking there of mapping /dev/mem, kmapping as DEFAULT while it's in
> use as PAGETABLE. At present /dev/mem does not supply highmem, but if
> page tables are in highmem, I believe we shall need to give something
> like LKCD's lcrash access to it.
see above.
>
> > Same is true between KM_SERIE_PAGETABLE and KM_SERIE_DEFAULT, the reason
> > it is obviously correct here is because the mixture cannot happen since
> > the first place, because pages will pass through the freelist before the
> > two can be mixed, and so they will be kunmapped first.
>
> I think you're mistaken there: passing through the
> freelist doesn't affect a page's kmapping at all, does it?
> We could make changes to do so, but that wouldn't be enough.
Doing:
page = alloc_page();
kmap(page);
__free_page(page)
is a bug (it will leak kmap entries eventually), this is what i meant
with "passing through the freelist" as serializing point between
PAGETABLE and DEFAULT series.
>
> > > ....
> > > And doesn't that argument assume an ordering, a rule that a task
> > > would necessarily allocate DEFAULT before PAGETABLE (or vice versa)?
> >
> > The ordering doesn't matter. You just don't need to keep pinned a
> > DEFAULT while mapping a default, and the other way around for pagetable.
> >
> > > But I think in some contexts it's one way round (DEFAULT before
> > > PAGETABLE when read's file_read_actor's __copy_to_user faults),
> > > and in other contexts the other way round (PAGETABLE before DEFAULT
> > > when do_no_page's block_read_full_page clears hole, or nfs readpage,
> > > or ramfs readpage, or shmem_nopage's clear_highpage). I'm willing
> > > to believe that the DEFAULT,PAGETABLE distinction reduces the chance
> > > of kmap exhaustion deadlock, but it looks difficult to carry through.
> >
> > I don't think there's any possible conflict between DEFAULT and
> > PAGETABLE.
>
> But I'm suggesting that in some cases a DEFAULT kmap is pinned while
> a PAGETABLE kmap is being acquired, and in other cases a PAGETABLE
> kmap is pinned while a DEFAULT kmap is being acquired; which might
> be happening concurrently in an unlimited (well, <32768) number of
> tasks. Is that not so? And couldn't that deadlock, even if the
> DEFAULT/PAGETABLE distinction were strictly maintained? Of course,
> the all-one-pool case can deadlock there too: perhaps more easily
> (though it would be sensible to give that one pool as many entries
> as all your pools combined).
>
> Well, I've had my say: I believe the SERIEs, as currently implemented,
> give _an illusion of_ security from kmap deadlock. But it feels like
> I'm failing to dislodge you from your belief in them, and they're
> not actively harmful, so I'll shut up - until provoked!
You shouldn't really giveup if you think it's not right yet! :) Also I
now agree the ordering matters. So I'd define an ordering which is
PAGETABLE2 first, PAGETABLE second, and DEFAULT third. This in
combination of the changes I will make to the kmap (only DEFAULT kmap
reentrant), should fix all the deadlocks, right? The ordering between
DEFAULT and PAGETABLE should be always enforced, if we get a kmap on a
page it will be after we touch its respective pagetable, right?
I will try to get a new version out in the next few days. In short we
found that current version can deadlock in mremap (double PAGETABLE kmap
in a raw) or in fork (overlooked the spinlock), both deadlocks are not
reproducible in real life for me in normal usage at least, next version
should not deadlock there any longer.
BTW, did you checked my fix for the contiogous_pte in fixinit_range?
many thanks for the help,
Andrea
Andrea,
You've understandably gone ahead with pre4aa1, not waited for my
contiguous pkmap_page_table patch. I've now finished testing it,
and append below, since I prefer it to what you have in pre4aa1.
Two calls to fixrange_init, the second needing contiguous page
table address space, yet sharing the page table chosen by the
first call (FIXADDR_START is not aligned), would that really
work in general? And your fixrange_init even more complicated
than it was before!
It didn't boot in my 1GB user virtual, 2GB physical memory,
PAE testing (yes, PAE with <= 4GB physical memory is not optimal,
but should work for testing). I was confused when I tried that,
I was thinking I had 1GB physical, and was trying to demonstrate
that pagetable_init then left pgd[2] empty_zero, corrupted by
vmalloc. But it must have been something else, I didn't
investigate and returned to testing my own patch.
arch/i386/config.in | 5 +
arch/i386/mm/init.c | 119 +++++++++++----------------------------------
include/asm-i386/highmem.h | 5 -
include/asm-i386/pgtable.h | 2
4 files changed, 36 insertions, 95 deletions
config.in: you're not offering user address space size configuration
when CONFIG_HIGHMEM64G, perhaps because you know of the boot failure
I got, but it's useful then too: I've added choice (without 3.5GB).
init.c: well, I resisted many of the changes I made earlier,
but had to simplify pagetable_init for the pgd[2] empty_zero issue,
it then made sense to simplify its #ifdefs and remove fixrange_init.
highmem.h: just removed the comment about "single pte table".
pgtable.h: parentheses around ptep in pte_kunmap(ptep) macro
Hugh
diff -urN 1804aa1/arch/i386/config.in 1804AA1/arch/i386/config.in
--- 1804aa1/arch/i386/config.in Tue Jan 22 12:08:35 2002
+++ 1804AA1/arch/i386/config.in Tue Jan 22 13:56:58 2002
@@ -168,7 +168,10 @@
fi
if [ "$CONFIG_HIGHMEM64G" = "y" ]; then
define_bool CONFIG_X86_PAE y
- define_bool CONFIG_1GB y
+ choice 'User address space size' \
+ "3GB CONFIG_1GB \
+ 2GB CONFIG_2GB \
+ 1GB CONFIG_3GB" 3GB
else
choice 'User address space size' \
"3GB CONFIG_1GB \
diff -urN 1804aa1/arch/i386/mm/init.c 1804AA1/arch/i386/mm/init.c
--- 1804aa1/arch/i386/mm/init.c Tue Jan 22 12:08:36 2002
+++ 1804AA1/arch/i386/mm/init.c Tue Jan 22 14:28:42 2002
@@ -167,65 +167,6 @@
set_pte_phys(address, phys, flags);
}
-static void __init fixrange_init (unsigned long start, unsigned long end, pgd_t *pgd_base, int contigous_pte)
-{
- pgd_t *pgd;
- pmd_t *pmd;
- pte_t *pte;
- int i, j;
- unsigned long vaddr;
- int nr_pte;
- void * pte_array;
-
- vaddr = start;
- i = __pgd_offset(vaddr);
- j = __pmd_offset(vaddr);
- pgd = pgd_base + i;
-
- if (contigous_pte) {
- if (start >= end)
- BUG();
- nr_pte = (end - start + PMD_SIZE - 1) >> PMD_SHIFT;
- if (j + nr_pte > PTRS_PER_PMD)
- BUG();
- pte_array = alloc_bootmem_low_pages(PAGE_SIZE * nr_pte);
- }
- for ( ; (i < PTRS_PER_PGD) && (vaddr != end); pgd++, i++) {
-#if CONFIG_X86_PAE
- if (pgd_none(*pgd)) {
- pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
- set_pgd(pgd, __pgd(__pa(pmd) + 0x1));
- if (pmd != pmd_offset(pgd, 0))
- printk("PAE BUG #02!\n");
- }
- pmd = pmd_offset(pgd, vaddr);
-#else
- pmd = (pmd_t *)pgd;
-#endif
- for (; (j < PTRS_PER_PMD) && (vaddr != end); pmd++, j++) {
- if (pmd_none(*pmd)) {
- if (contigous_pte) {
- pte = (pte_t *) pte_array;
- pte_array += PAGE_SIZE;
- nr_pte--;
- } else
- pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);
- set_pmd(pmd, mk_pmd_phys(__pa(pte), __pgprot(_KERNPG_TABLE)));
- if (pte != pte_offset_lowmem(pmd, 0))
- BUG();
- }
- vaddr += PMD_SIZE;
- }
- j = 0;
- }
- if (contigous_pte) {
- if (nr_pte < 0)
- BUG();
- if (nr_pte > 0)
- free_bootmem((unsigned long) pte_array, nr_pte * PAGE_SIZE);
- }
-}
-
static void __init pagetable_init (void)
{
unsigned long vaddr, end;
@@ -242,8 +183,24 @@
pgd_base = swapper_pg_dir;
#if CONFIG_X86_PAE
- for (i = 0; i < PTRS_PER_PGD; i++)
+ /*
+ * First set all four entries of the pgd.
+ * Usually only one page is needed here: if PAGE_OFFSET lowered,
+ * maybe three pages: need not be contiguous, but might as well.
+ */
+ pmd = (pmd_t *)alloc_bootmem_low_pages(KERNEL_PGD_PTRS*PAGE_SIZE);
+ for (i = 1; i < USER_PGD_PTRS; i++)
set_pgd(pgd_base + i, __pgd(1 + __pa(empty_zero_page)));
+ for (; i < PTRS_PER_PGD; i++, pmd += PTRS_PER_PMD)
+ set_pgd(pgd_base + i, __pgd(1 + __pa(pmd)));
+ /*
+ * Add low memory identity-mappings - SMP needs it when
+ * starting up on an AP from real-mode. In the non-PAE
+ * case we already have these mappings through head.S.
+ * All user-space mappings are explicitly cleared after
+ * SMP startup.
+ */
+ pgd_base[0] = pgd_base[USER_PGD_PTRS];
#endif
i = __pgd_offset(PAGE_OFFSET);
pgd = pgd_base + i;
@@ -252,14 +209,7 @@
vaddr = i*PGDIR_SIZE;
if (end && (vaddr >= end))
break;
-#if CONFIG_X86_PAE
- pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
- set_pgd(pgd, __pgd(__pa(pmd) + 0x1));
-#else
- pmd = (pmd_t *)pgd;
-#endif
- if (pmd != pmd_offset(pgd, 0))
- BUG();
+ pmd = pmd_offset(pgd, 0);
for (j = 0; j < PTRS_PER_PMD; pmd++, j++) {
vaddr = i*PGDIR_SIZE + j*PMD_SIZE;
if (end && (vaddr >= end))
@@ -295,34 +245,27 @@
}
/*
- * Fixed mappings, only the page table structure has to be
- * created - mappings will be set by set_fixmap():
+ * Leave vmalloc() to create its own page tables as needed,
+ * but create the page tables at top of virtual memory, to be
+ * populated by kmap_high(), kmap_atomic(), and set_fixmap().
+ * kmap_high() assumes pkmap_page_table contiguous throughout.
*/
- vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
- fixrange_init(vaddr, 0, pgd_base, 0);
-
#if CONFIG_HIGHMEM
- /*
- * Permanent kmaps:
- */
vaddr = PKMAP_BASE;
- fixrange_init(vaddr, vaddr + PKMAP_SIZE, pgd_base, 1);
+#else
+ vaddr = FIXADDR_START;
+#endif
+ pmd = pmd_offset(pgd_offset_k(vaddr), vaddr);
+ i = (0UL - (vaddr & PMD_MASK)) >> PMD_SHIFT;
+ pte = (pte_t *)alloc_bootmem_low_pages(i*PAGE_SIZE);
+ for (; --i >= 0; pmd++, pte += PTRS_PER_PTE)
+ set_pmd(pmd, mk_pmd_phys(__pa(pte), __pgprot(_KERNPG_TABLE)));
+#if CONFIG_HIGHMEM
pgd = swapper_pg_dir + __pgd_offset(vaddr);
pmd = pmd_offset(pgd, vaddr);
pte = pte_offset_lowmem(pmd, vaddr);
pkmap_page_table = pte;
-#endif
-
-#if CONFIG_X86_PAE
- /*
- * Add low memory identity-mappings - SMP needs it when
- * starting up on an AP from real-mode. In the non-PAE
- * case we already have these mappings through head.S.
- * All user-space mappings are explicitly cleared after
- * SMP startup.
- */
- pgd_base[0] = pgd_base[USER_PTRS_PER_PGD];
#endif
}
diff -urN 1804aa1/include/asm-i386/highmem.h 1804AA1/include/asm-i386/highmem.h
--- 1804aa1/include/asm-i386/highmem.h Tue Jan 22 12:08:39 2002
+++ 1804AA1/include/asm-i386/highmem.h Tue Jan 22 13:53:21 2002
@@ -48,11 +48,6 @@
KM_NR_SERIES,
};
-/*
- * Right now we initialize only a single pte table. It can be extended
- * easily, subsequent pte tables have to be allocated in one physical
- * chunk of RAM.
- */
#define LAST_PKMAP 1024
#define LAST_PKMAP_MASK (LAST_PKMAP-1)
#define PKMAP_SIZE ((LAST_PKMAP*KM_NR_SERIES) << PAGE_SHIFT)
diff -urN 1804aa1/include/asm-i386/pgtable.h 1804AA1/include/asm-i386/pgtable.h
--- 1804aa1/include/asm-i386/pgtable.h Tue Jan 22 12:08:39 2002
+++ 1804AA1/include/asm-i386/pgtable.h Tue Jan 22 13:52:37 2002
@@ -382,7 +382,7 @@
__pte_offset(address))
#define pte_offset_lowmem(dir, address) ((pte_t *) pmd_page_lowmem(*(dir)) + \
__pte_offset(address))
-#define pte_kunmap(ptep) kunmap_vaddr((unsigned long) ptep & PAGE_MASK)
+#define pte_kunmap(ptep) kunmap_vaddr((unsigned long)(ptep) & PAGE_MASK)
/*
* The i386 doesn't have any external MMU info: the kernel page
On Tue, 22 Jan 2002, Andrea Arcangeli wrote:
> On Tue, Jan 22, 2002 at 06:01:43PM +0000, Hugh Dickins wrote:
>
> > Actually, I'm coming to think that most of the pte_offset_atomic
> > calls should be replaced by a pte_offset_nowait: which gives the usual
> > pte_offset kmap if it can without blocking, or an atomic if not - your
> > pte_kunmap seems nicely designed for that very possibility. Whether
>
> Possible optimization. Before doing so we should really measure how the
> kmaps are faster compared to the kmap_atomic I believe. This could be a
> performance optimization. But it could also be that the kmap_atomic are
> faster and it doesn't worth to get into the kmap overhead when
> persistence is not necessary.
Optimization, yes, can just avoid deadlocks with atomics for now.
I don't expect the kmap_atomics to be slow, but it's a shame to have
to keep on using them in so many places where there's a likelihood
that the pagetable in question is already mapped, no chance of blocking.
Particularly with the pagetable cache: kmapped on exit, ready for reuse.
> pagetable2 is needed every single time you need to access two pagetables
> at the same time with a static kmap. Infact now that I notice it also
> mremap needs it. With mremap both the two kmaps could return
> entries in the PAGETABLE serie, no-way around it.
Yes, I was politely turning a blind eye to move_one_page for now.
It's not a high-priority, and I'm sure it's easily fixed one way or
another. I hope we don't need PAGETABLE2 for it, I'd imagined one
pagetable kmapped from pool and the other nowaited (atomic if none
in pool). Definitely needs care yes, but as in copy_page_range,
the situation here is well defined (cannot be in a page fault,
doesn't involve filesystem, task cannot have other kmaps pinned),
so no great problem.
> So to solve all this mess between mixture of PAGETABLE2 and PAGETABLE
> (and even with DEFAULT if /dev/kmem will ever mess with highmem), I will
> just drop the recursive behaviour of kmap. That is not needed for
> coherency, that was only a kind of performance optimization and to
> maintain a page->virtual, so I will just kill it and then we're done, no
> mixture mess any longer. I will left it only for the DEFAULT serie,
> because of the page->virtual, all other series will not care about
> page->virtual any longer, page->virtual was a bad idea since the first
> place, it would been better to keep the information on the current
> stack, like all the other series will do.
As ever you think too fast for me: I need to ponder more about it.
At the moment I feel you're now over-reacting, I'd prefer to keep
pagetable page->virtual if we can.
> > > Same is true between KM_SERIE_PAGETABLE and KM_SERIE_DEFAULT, the reason
> > > it is obviously correct here is because the mixture cannot happen since
> > > the first place, because pages will pass through the freelist before the
> > > two can be mixed, and so they will be kunmapped first.
> >
> > I think you're mistaken there: passing through the
> > freelist doesn't affect a page's kmapping at all, does it?
> > We could make changes to do so, but that wouldn't be enough.
>
> Doing:
>
> page = alloc_page();
> kmap(page);
> __free_page(page)
>
> is a bug (it will leak kmap entries eventually), this is what i meant
> with "passing through the freelist" as serializing point between
> PAGETABLE and DEFAULT series.
Oh sure, it's wrong to kmap then free like that. What I meant is that
the pkmap_count is always kept one higher for persistence, so even when
you kunmap(page) before the __free_page(page) above, the page->virtual
mapping (appropriate to a particular serie) remains, and will be used
by the next user of the page (unless a flush_all_zero_pkmaps intervenes).
I'm a bit confused by your explanation here, unsure whether I'm just
retelling you what you already well know (but don't see how you could
have called passing through the freelist a serializing point between
PAGETABLE and DEFAULT if you do).
> > Well, I've had my say: I believe the SERIEs, as currently implemented,
> > give _an illusion of_ security from kmap deadlock. But it feels like
> > I'm failing to dislodge you from your belief in them, and they're
> > not actively harmful, so I'll shut up - until provoked!
>
> You shouldn't really giveup if you think it's not right yet! :) Also I
> now agree the ordering matters. So I'd define an ordering which is
> PAGETABLE2 first, PAGETABLE second, and DEFAULT third. This in
> combination of the changes I will make to the kmap (only DEFAULT kmap
> reentrant), should fix all the deadlocks, right? The ordering between
> DEFAULT and PAGETABLE should be always enforced, if we get a kmap on a
> page it will be after we touch its respective pagetable, right?
Again, I need to think more about this, feel you're rushing.
> I will try to get a new version out in the next few days. In short we
> found that current version can deadlock in mremap (double PAGETABLE kmap
> in a raw) or in fork (overlooked the spinlock), both deadlocks are not
> reproducible in real life for me in normal usage at least, next version
> should not deadlock there any longer.
Yes, neither of those potential deadlocks is high-priority to fix,
and there are plenty of others lurking in the ordering issue.
> BTW, did you checked my fix for the contiogous_pte in fixinit_range?
Yes, not keen on it! You'll have my separate mail with patch by now.
> many thanks for the help,
Glad to, and glad we're now more or less "on the same page"!
Hugh
On Tue, Jan 22, 2002 at 09:41:16PM +0000, Hugh Dickins wrote:
> On Tue, 22 Jan 2002, Andrea Arcangeli wrote:
> > On Tue, Jan 22, 2002 at 06:01:43PM +0000, Hugh Dickins wrote:
> >
> > > Actually, I'm coming to think that most of the pte_offset_atomic
> > > calls should be replaced by a pte_offset_nowait: which gives the usual
> > > pte_offset kmap if it can without blocking, or an atomic if not - your
> > > pte_kunmap seems nicely designed for that very possibility. Whether
> >
> > Possible optimization. Before doing so we should really measure how the
> > kmaps are faster compared to the kmap_atomic I believe. This could be a
> > performance optimization. But it could also be that the kmap_atomic are
> > faster and it doesn't worth to get into the kmap overhead when
> > persistence is not necessary.
>
> Optimization, yes, can just avoid deadlocks with atomics for now.
Let's speak about this later. We should ring a bell as soon as we know
what's really faster (invlpg at every kmap or global tlb flush once
every 1024 kmaps?).
> Yes, I was politely turning a blind eye to move_one_page for now.
> It's not a high-priority, and I'm sure it's easily fixed one way or
> another. I hope we don't need PAGETABLE2 for it, I'd imagined one
> pagetable kmapped from pool and the other nowaited (atomic if none
> in pool). Definitely needs care yes, but as in copy_page_range,
> the situation here is well defined (cannot be in a page fault,
> doesn't involve filesystem, task cannot have other kmaps pinned),
> so no great problem.
It could be solved also without the PAGETABLE2 by reworking some code
yes, but still the problem remains between PAGETABLE and DEFAULT.
>
> > So to solve all this mess between mixture of PAGETABLE2 and PAGETABLE
> > (and even with DEFAULT if /dev/kmem will ever mess with highmem), I will
> > just drop the recursive behaviour of kmap. That is not needed for
> > coherency, that was only a kind of performance optimization and to
> > maintain a page->virtual, so I will just kill it and then we're done, no
> > mixture mess any longer. I will left it only for the DEFAULT serie,
> > because of the page->virtual, all other series will not care about
> > page->virtual any longer, page->virtual was a bad idea since the first
> > place, it would been better to keep the information on the current
> > stack, like all the other series will do.
>
> As ever you think too fast for me: I need to ponder more about it.
> At the moment I feel you're now over-reacting, I'd prefer to keep
> pagetable page->virtual if we can.
page->virtual will remain for all the DEFAULT serie, to avoid breaking
the regular kmap pagecache users. But to keep a page->virtual for each
serie we'd need a page->virtual[KM_NR_SERIES] array, which is very
costly in terms of ram, so I prefer to manage the virtual address on the
stack, just like we always have to do with the atomic kmaps. Recursion
on the same page->virtual (at the second simultaneous kmap on the same
page) will be allowed only for the default series, all other series will
allocate a new kmap entry even if the page is just mapped in another
entry in the same serie. With pagetable mappings we never use
page->address, think that the pte_kunmap is the same regardless of the
kind of kmap happened (atomic or not), we never make use of page_address
on the pagetables. So it's going to work very easily and we'll save
some tons of static ram.
> > > > Same is true between KM_SERIE_PAGETABLE and KM_SERIE_DEFAULT, the reason
> > > > it is obviously correct here is because the mixture cannot happen since
> > > > the first place, because pages will pass through the freelist before the
> > > > two can be mixed, and so they will be kunmapped first.
> > >
> > > I think you're mistaken there: passing through the
> > > freelist doesn't affect a page's kmapping at all, does it?
> > > We could make changes to do so, but that wouldn't be enough.
> >
> > Doing:
> >
> > page = alloc_page();
> > kmap(page);
> > __free_page(page)
> >
> > is a bug (it will leak kmap entries eventually), this is what i meant
> > with "passing through the freelist" as serializing point between
> > PAGETABLE and DEFAULT series.
>
> Oh sure, it's wrong to kmap then free like that. What I meant is that
> the pkmap_count is always kept one higher for persistence, so even when
> you kunmap(page) before the __free_page(page) above, the page->virtual
> mapping (appropriate to a particular serie) remains, and will be used
> by the next user of the page (unless a flush_all_zero_pkmaps intervenes).
correct. I'm convinced the mixture problem invalidates completly the
deadlock avoidance using the series, so the only way to fix the
deadlocks is to avoid the mixture between the series.
>
> I'm a bit confused by your explanation here, unsure whether I'm just
> retelling you what you already well know (but don't see how you could
> have called passing through the freelist a serializing point between
> PAGETABLE and DEFAULT if you do).
>
> > > Well, I've had my say: I believe the SERIEs, as currently implemented,
> > > give _an illusion of_ security from kmap deadlock. But it feels like
> > > I'm failing to dislodge you from your belief in them, and they're
> > > not actively harmful, so I'll shut up - until provoked!
> >
> > You shouldn't really giveup if you think it's not right yet! :) Also I
> > now agree the ordering matters. So I'd define an ordering which is
> > PAGETABLE2 first, PAGETABLE second, and DEFAULT third. This in
> > combination of the changes I will make to the kmap (only DEFAULT kmap
> > reentrant), should fix all the deadlocks, right? The ordering between
> > DEFAULT and PAGETABLE should be always enforced, if we get a kmap on a
> > page it will be after we touch its respective pagetable, right?
>
> Again, I need to think more about this, feel you're rushing.
Well, if you see any problem let me know.
> > I will try to get a new version out in the next few days. In short we
> > found that current version can deadlock in mremap (double PAGETABLE kmap
> > in a raw) or in fork (overlooked the spinlock), both deadlocks are not
> > reproducible in real life for me in normal usage at least, next version
> > should not deadlock there any longer.
>
> Yes, neither of those potential deadlocks is high-priority to fix,
> and there are plenty of others lurking in the ordering issue.
The ordering thing is really simple I think. There are very few places
where we kmap and kmap_pagetable at the same time. And I don't see how
can could ever kmap before kmap_pagetable. so that part looks fine to
me. The other ordering constraint between kmap_pagetable2 and
kmap_pagetable is under our full control and it will have to check it in
only two places: mremap and fork, so the ordering sounds really a no
brainer. The mixture-avoidance of the series is the more difficult part
I believe (with page->virtual still retained for the DEFAULT serie).
>
> > BTW, did you checked my fix for the contiogous_pte in fixinit_range?
>
> Yes, not keen on it! You'll have my separate mail with patch by now.
thanks, my version is a few liner change to fixinit_range, yours is
bigger, but I will try to merge yours as soon as I will check it in
detail (I trust you it is better than my dirty fix :). OTOH, while the
"best" version should go into 2.5, for 2.4 I would have lived just fine
with the shorted possible fix that I did yesterday too :)
> Glad to, and glad we're now more or less "on the same page"!
yep :)
Andrea
Andrea Arcangeli writes:
> Let's speak about this later. We should ring a bell as soon as we know
> what's really faster (invlpg at every kmap or global tlb flush once
> every 1024 kmaps?).
It would be really good if we could have whatever hooks are necessary
for each architecture to decide this issue in its own way. On PPC for
example a global tlb flush is really expensive, it would be quicker
for us to either flush each kmap page individually or to flush the
range of addresses used for kmaps when we wrap. At the very least I
would like to have a flush_tlb_kernel_range function which would get
called at the end of flush_all_zero_pkmaps instead of flush_tlb_all.
Regards,
Paul.
On Wed, Jan 23, 2002 at 11:56:03AM +1100, Paul Mackerras wrote:
> Andrea Arcangeli writes:
>
> > Let's speak about this later. We should ring a bell as soon as we know
> > what's really faster (invlpg at every kmap or global tlb flush once
> > every 1024 kmaps?).
>
> It would be really good if we could have whatever hooks are necessary
> for each architecture to decide this issue in its own way. On PPC for
you could implement pte_offset_nowait as you prefer (it has to
be implemented in include/asm like pte_offset), so yes, if we implement
nowait, you should be free to choose what's best to do on pcc
automatically while fixing the compilation failures.
at the moment there is either "persistent across schedule" or "atomic
local" (not the optimized "_nowait" suggested by Hugh in replacement of
"_atomic") and most of the time we depend on the requirements of the
caller about nonblocking due spinlocks, or persistence for schedules. So
there's not much to choose, only a few places could use both pte_offset
or pte_offset_atomic (on the long run we could turn those, and the other
_atomic into _nowait ones per Hugh suggestion, that would give you the
"hook").
> example a global tlb flush is really expensive, it would be quicker
> for us to either flush each kmap page individually or to flush the
> range of addresses used for kmaps when we wrap. At the very least I
> would like to have a flush_tlb_kernel_range function which would get
> called at the end of flush_all_zero_pkmaps instead of flush_tlb_all.
that's certainly doable, on x86 it will just default to the
global tlb flush because we can't flush it more efficiently than that.
Andrea
On Wed, 23 Jan 2002, Andrea Arcangeli wrote:
>
> page->virtual will remain for all the DEFAULT serie, to avoid breaking
> the regular kmap pagecache users. But to keep a page->virtual for each
> serie we'd need a page->virtual[KM_NR_SERIES] array, which is very
> costly in terms of ram, ....
Agreed, not an option we'd want to use.
> correct. I'm convinced the mixture problem invalidates completly the
> deadlock avoidance using the series, so the only way to fix the
> deadlocks is to avoid the mixture between the series.
First half agreed, second half not sure. Maybe no series at all.
Could it be worked with just the one serie, and count in task_struct
of kmaps "raised" by task, only task with count >=1 allowed to take
the last N kmaps? I suspect something like that would work if not
scheduling otherwise, but no good held across allocating another
resource e.g. memory in fault. Probably rubbish, not thought out
fully, just mentioned in case it gives you an idea. Another such
half-baked idea I've played with a little is using one or two ptes
of the user address space (e.g. at top of stack) as per-task kmaps.
> The ordering thing is really simple I think. There are very few places
> where we kmap and kmap_pagetable at the same time. And I don't see how
> can could ever kmap before kmap_pagetable. so that part looks fine to me.
Nice if that's so, but I think you're sadly deluded ;-)
Imagine sys_read going to file_read_actor (kmap, __copy_to_user, kunmap),
imagine the __copy_to_user faulting (needs to kmap the pagetable),
imagine the place faulted is hole in ???fs file (kmap in clear_highpage),
imagine low on memory schedules all over.
There's more I want to consider in your mail, these were
just a few points I felt I could say something on quickly.
Hugh
On Tue, Jan 22, 2002 at 07:29:08PM +0000, Hugh Dickins wrote:
>
> It didn't boot in my 1GB user virtual, 2GB physical memory,
> PAE testing (yes, PAE with <= 4GB physical memory is not optimal,
> but should work for testing). I was confused when I tried that,
What I found is with 2.4.18pre4aa1 vanilla, when CONFIG_2G=y and
CONFIG_HIGHMEM=y, my system stops booting with:
Uncompressing Linux... Ok, booting the kernel.
With Hugh's patch, 2.4.18pre4aa1 will boot with CONFIG_2G=y
and CONFIG_HIGHMEM=y, which is good.
In my previous testing with the 1-2-3-gb patch, I always
had CONFIG_NOHIGHMEM=y.
Based on these findings, I think inclusion of Hugh's patch
is a good thing.
--
Randy Hron
On January 23, 2002 06:38 am, Hugh Dickins wrote:
> First half agreed, second half not sure. Maybe no series at all.
> Could it be worked with just the one serie,
Pardon me, but what is a serie? It's not an english word:
http://www.m-w.com/dictionary.htm
--
Daniel
On Wed, 23 Jan 2002, Daniel Phillips wrote:
> On January 23, 2002 06:38 am, Hugh Dickins wrote:
> > First half agreed, second half not sure. Maybe no series at all.
> > Could it be worked with just the one serie,
>
> Pardon me, but what is a serie? It's not an english word:
>
> http://www.m-w.com/dictionary.htm
It's a technical term meaning "element of series", recently coined
by our very own Mr. Andrea Arcangeli, whose "pte-highmem-5 patch"
will be cited in future editions of the Oxford English Dictionary.
Hugh
On Wed, Jan 23, 2002 at 05:38:47AM +0000, Hugh Dickins wrote:
> On Wed, 23 Jan 2002, Andrea Arcangeli wrote:
> >
> > page->virtual will remain for all the DEFAULT serie, to avoid breaking
> > the regular kmap pagecache users. But to keep a page->virtual for each
> > serie we'd need a page->virtual[KM_NR_SERIES] array, which is very
> > costly in terms of ram, ....
>
> Agreed, not an option we'd want to use.
>
> > correct. I'm convinced the mixture problem invalidates completly the
> > deadlock avoidance using the series, so the only way to fix the
> > deadlocks is to avoid the mixture between the series.
>
> First half agreed, second half not sure. Maybe no series at all.
> Could it be worked with just the one serie, and count in task_struct
> of kmaps "raised" by task, only task with count >=1 allowed to take
> the last N kmaps? I suspect something like that would work if not
> scheduling otherwise, but no good held across allocating another
> resource e.g. memory in fault. Probably rubbish, not thought out
I can imagine an alternate design to avoid the deadlock without the
series (doesn't sound exactly what you had in mind with count >= 1, but
it's on the same lines about using the task_struct to keep some per-task
information about the kmaps), it has some more overhead though, but it
has the very nice goodness of also invalidating the ordering
requirements.
The only new requirement would become the max number of kmap run by a single
task in sequence (which is not a new requirement, we had the very
requirement before too which was the NR_KMAP_SERIES).
The first kmap run by a task will try to reserve the MAX_NR_TASK_KMAPS
slots (we can keep track if it's the first kmap with some task_structure
field, on the lines of what you suggested), if it fails to reserve all
of them, it will release the ones it tried to allocate in the meantime
and it will go to sleep waiting for the resourced to be released by
somebody else. If it succeed it will use the first reserved entry to
succeed the kmap. The later kmaps will use the other two reserved kmap
slots preallocated at the first kmap. If the kernel tries to allocate
one more kmap entry over MAX_NR_TASK_KMAPS we can BUG().
In short this makes sure if a kmap has to sleep, it will always be the
first one. This ensures the deadlock avoidance.
This would solve not only the deadlock but it also drops the ordering
requirements, plus it will solve the mixture thing as well
(optimizations are possible, if the first kmap maps a page just mapped
we'd need to reserve only MAX_NR_TASK_KMAPS-1 entries, simply doing the
reservation + first kmap atomically, which will be natural). We can
define MAX_NR_TASK_KMAPS (suggestions for a better define name are
welcome) to 3, one for the kmap for the pagecache, one for the first
pagetable, and one for the second pagetable map (mremap).
Comments? Now I tend to believe this way is simpler after all, mostly
because it doesn't create special cases with special series, and it
makes life simpler for the kmap users, in short it reduces the
anti-deadlock requirement dramatically.
> fully, just mentioned in case it gives you an idea. Another such
> half-baked idea I've played with a little is using one or two ptes
> of the user address space (e.g. at top of stack) as per-task kmaps.
>
> > The ordering thing is really simple I think. There are very few places
> > where we kmap and kmap_pagetable at the same time. And I don't see how
> > can could ever kmap before kmap_pagetable. so that part looks fine to me.
>
> Nice if that's so, but I think you're sadly deluded ;-)
> Imagine sys_read going to file_read_actor (kmap, __copy_to_user, kunmap),
> imagine the __copy_to_user faulting (needs to kmap the pagetable),
I said the reverse, but this is the right path I meant of course. I
didn't see the other way happening anywhere, mostly because it would be
a bug if we would ever kmap during pagefaults because we could deadlock
just now in such a case.
> imagine the place faulted is hole in ???fs file (kmap in
> clear_highpage),
we can't call clear_highpage during page faults (hey, if we would ever
do, that would be just a deadlock condition right now in 2.4.17 too,
without pte-highmem applied :).
> imagine low on memory schedules all over.
schedules really shouldn't matter at all here.
thanks again for the helpful feedback,
Andrea
Let me reorder the mail (to avoid deadlock:) commenting on last first...
On Thu, 24 Jan 2002, Andrea Arcangeli wrote:
> On Wed, Jan 23, 2002 at 05:38:47AM +0000, Hugh Dickins wrote:
> > On Wed, 23 Jan 2002, Andrea Arcangeli wrote:
> > > The ordering thing is really simple I think. There are very few places
> > > where we kmap and kmap_pagetable at the same time. And I don't see how
> > > can could ever kmap before kmap_pagetable. so that part looks fine to me.
> >
> > Nice if that's so, but I think you're sadly deluded ;-)
> > Imagine sys_read going to file_read_actor (kmap, __copy_to_user, kunmap),
> > imagine the __copy_to_user faulting (needs to kmap the pagetable),
>
> I said the reverse, but this is the right path I meant of course. I
> didn't see the other way happening anywhere, mostly because it would be
> a bug if we would ever kmap during pagefaults because we could deadlock
> just now in such a case.
>
> > imagine the place faulted is hole in ???fs file (kmap in
> > clear_highpage),
>
> we can't call clear_highpage during page faults (hey, if we would ever
> do, that would be just a deadlock condition right now in 2.4.17 too,
> without pte-highmem applied :).
Does your ";)" mean that you know very well that it's done in
several places? I may be mistaken, but block_read_full_page,
nfs_readpage_sync, ramfs_readpage and shmem_getpage_locked (latter
a clear_highpage), all look like they use kmap in a place which could
deadlock already. I bet there are other instances. So far as I know,
we've not actually seen such deadlocks, the current LAST_PKMAP 1024
appears in practice to be enough to make them very unlikely (so far;
and I feel much safer now you've lifted the 512 limit on HIGHMEM64G).
I've CC'ed Ben: I think he atomicized some kmapping copy macros a
few months back, may have a view on this.
And now to the first part:
> I can imagine an alternate design to avoid the deadlock without the
> series (doesn't sound exactly what you had in mind with count >= 1, but
> it's on the same lines about using the task_struct to keep some per-task
> information about the kmaps), it has some more overhead though, but it
> has the very nice goodness of also invalidating the ordering
> requirements.
>
> The only new requirement would become the max number of kmap run by a single
> task in sequence (which is not a new requirement, we had the very
> requirement before too which was the NR_KMAP_SERIES).
>
> The first kmap run by a task will try to reserve the MAX_NR_TASK_KMAPS
> slots (we can keep track if it's the first kmap with some task_structure
> field, on the lines of what you suggested), if it fails to reserve all
> of them, it will release the ones it tried to allocate in the meantime
> and it will go to sleep waiting for the resourced to be released by
> somebody else. If it succeed it will use the first reserved entry to
> succeed the kmap. The later kmaps will use the other two reserved kmap
> slots preallocated at the first kmap. If the kernel tries to allocate
> one more kmap entry over MAX_NR_TASK_KMAPS we can BUG().
>
> In short this makes sure if a kmap has to sleep, it will always be the
> first one. This ensures the deadlock avoidance.
>
> This would solve not only the deadlock but it also drops the ordering
> requirements, plus it will solve the mixture thing as well
> (optimizations are possible, if the first kmap maps a page just mapped
> we'd need to reserve only MAX_NR_TASK_KMAPS-1 entries, simply doing the
> reservation + first kmap atomically, which will be natural). We can
> define MAX_NR_TASK_KMAPS (suggestions for a better define name are
> welcome) to 3, one for the kmap for the pagecache, one for the first
> pagetable, and one for the second pagetable map (mremap).
>
> Comments? Now I tend to believe this way is simpler after all, mostly
> because it doesn't create special cases with special series, and it
> makes life simpler for the kmap users, in short it reduces the
> anti-deadlock requirement dramatically.
At first I was very enthusiastic about this proposal: it gives solid,
convincing protection against kmap deadlocks. But I have two doubts.
The first, minor doubt is that I don't believe you can BUG() beyond
MAX_NR_TASK_KMAPS: in 2.5 perhaps, or a 2.4-pre, but I think we have
to live with the fact that there are or may already be potential kmap
deadlocks, and in a release it would be more harmful to BUG on every
entry to that potential, than to hang if it actually deadlocks (but
we must make LAST_PKMAP large enough to make that unlikely enough).
I think your MAX would be 2 not 3, unless you are allowing for the
danger kmaps such as I list above. Neither mremap nor copy_page_range
is done during a fault, there's no underlying kmap, they can use both.
But what about the loop driver? Can that raise the max? indefinitely?
The second, major doubt is: I don't see how this could be implemented,
without making kmaps so much more heavyweight than they are at present,
that we'd avoid them wherever possible (going for the atomic alternative
in many places). Every kmap_high which at present finds page->virtual
set and increments pkmap_count, would in the new scheme have to go and
find MAX_NR_TASK_KMAPS-1 free slots, and increment their pkmap_counts;
yet what proportion of these tasks would go on to use the additional
kmaps? If you can implement it with little overhead, it'll be great!
I've been thinking less ambitiously. Not worrying about the existing
potential deadlocks, just making sure that pagetable kmaps won't
increase the possibility of deadlock at all. I'm sure just doubling
LAST_PKMAP would not be sufficient: the way a pagetable kmap is held
over the page allocation, the system under memory pressure could
degrade to a point where all the kmaps were held by faulting
pagetables. I'm thinking of limiting the total number of pagetable
kmaps at any one time, and using pte_offset_nowait everywhere(?)
i.e. regetting the mapping, hopefully still cached in page->virtual,
after possibly waiting to allocate page.
Hugh