I have spent some time working on AIX, which pages its kernel memory.
It pins the interrupt handler functions, and any data that they access,
but does not pin the other code.
I'm looking for links as to why (unless I'm mistaken) Linux doesn't do
this, so I can better understand the system.
Thanks, and sorry for the broadcast message. My web search turned up
nothing.
Russ Lewis
On Fri, 26 Jul 2002, Russell Lewis wrote:
> I have spent some time working on AIX, which pages its kernel memory.
> It pins the interrupt handler functions, and any data that they access,
> but does not pin the other code.
>
> I'm looking for links as to why (unless I'm mistaken) Linux doesn't do
> this, so I can better understand the system.
>
> Thanks, and sorry for the broadcast message. My web search turned up
> nothing.
>
> Russ Lewis
You'll probably get a zillion replies on this.
Paging is expensive. The fastest kernel will not be paged.
Also, the kernel is very small, you gain a few pages, maybe
80 to 90 at the expense of paging CPU cycles. 85 * 4096 = 348,160
1/3 megabyte gained, hardly worth the cost.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.
On 26 Jul 2002, Robert Love wrote:
> On Fri, 2002-07-26 at 12:10, Rik van Riel wrote:
> > On 26 Jul 2002, Robert Love wrote:
> >
> > > Better question is, why would we have page-able kernel memory?
> >
> > We don't want to have generic page-able kernel memory.
> >
> > However, it might be useful to be able to reclaim or page
> > out data structures that might otherwise gobble up all of
> > RAM and crash the machine, say page tables.
>
> I agree a better solution than high-pte is probably needed. Shared page
> tables and/or large page tables insufficient?
Large pages and/or shared page tables should be more than
sufficient to handle all 'benign' real workloads.
However, 'malicious' workloads can easily generate the
need for more pagetables than what will fit into physical
RAM; at that point you just _have_ to throw some of these
page tables out of RAM. If the data can be reconstructed
from the VMA and the page cache, we can just blow the page
table away. If it can't, we have to come up with another
solution (maybe as simple as killing the application).
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
On Fri, 2002-07-26 at 12:10, Rik van Riel wrote:
> On 26 Jul 2002, Robert Love wrote:
>
> > Better question is, why would we have page-able kernel memory?
>
> We don't want to have generic page-able kernel memory.
>
> However, it might be useful to be able to reclaim or page
> out data structures that might otherwise gobble up all of
> RAM and crash the machine, say page tables.
I agree a better solution than high-pte is probably needed. Shared page
tables and/or large page tables insufficient?
Robert Love
On 26 Jul 2002, Robert Love wrote:
> On Fri, 2002-07-26 at 10:59, Russell Lewis wrote:
>
> > I have spent some time working on AIX, which pages its kernel memory.
> > It pins the interrupt handler functions, and any data that they access,
> > but does not pin the other code.
> >
> > I'm looking for links as to why (unless I'm mistaken) Linux doesn't do
> > this, so I can better understand the system.
>
> Better question is, why would we have page-able kernel memory?
We don't want to have generic page-able kernel memory.
However, it might be useful to be able to reclaim or page
out data structures that might otherwise gobble up all of
RAM and crash the machine, say page tables.
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
On Fri, 2002-07-26 at 10:59, Russell Lewis wrote:
> I have spent some time working on AIX, which pages its kernel memory.
> It pins the interrupt handler functions, and any data that they access,
> but does not pin the other code.
>
> I'm looking for links as to why (unless I'm mistaken) Linux doesn't do
> this, so I can better understand the system.
Better question is, why would we have page-able kernel memory?
It complicates kernel-space drastically for little gain. It is not that
we cannot, or there is a specific technical reason why not - just an
issue of taste. And lack of drugs.
Robert Love
On Fri, 2002-07-26 at 18:59, Russell Lewis wrote:
> I have spent some time working on AIX, which pages its kernel memory.
> It pins the interrupt handler functions, and any data that they access,
> but does not pin the other code.
>
> I'm looking for links as to why (unless I'm mistaken) Linux doesn't do
> this, so I can better understand the system.
Memory is relatively cheap, and the complexity of such a paging kernel
is huge (you have to pin down disk driver and I/O paths for example).
Linux prefers to try to keep simple debuggable approaches to things.
On Jul 26, 2002 10:59 -0700, Russell Lewis wrote:
> I have spent some time working on AIX, which pages its kernel memory.
> It pins the interrupt handler functions, and any data that they access,
> but does not pin the other code.
>
> I'm looking for links as to why (unless I'm mistaken) Linux doesn't do
> this, so I can better understand the system.
Because it is complex. Linus would rather the kernel stay small and non
pageable, rather than grow large enough to need paging (which will in
itself add even more size to the kernel).
I'm sure I've read postings on the subject on l-k, but it would be hard
to find.
Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
[email protected] said:
> Memory is relatively cheap, and the complexity of such a paging
> kernel is huge (you have to pin down disk driver and I/O paths for
> example). Linux prefers to try to keep simple debuggable approaches to
> things.
You could do it. Start with kmalloc_pageable (probably actually
vmalloc_pageable) and introduce new sections for pageable data and text,
which can be marked just as init sections are currently. Introduce it
slowly, adding it a little at a time like we did SMP, and like we _should_
have done preemption.
It's debatable what kind of benefit it would give you over and above just
fixing specific cases like page tables, though. Most of the systems where
I've _really_ cared about RAM to that extent have been systems without any
local storage which could sanely be used for swap.
--
dwmw2
On Sat, 27 Jul 2002, David Woodhouse wrote:
> [email protected] said:
> > Memory is relatively cheap, and the complexity of such a paging
> > kernel is huge (you have to pin down disk driver and I/O paths for
> > example). Linux prefers to try to keep simple debuggable approaches to
> > things.
>
> You could do it. Start with kmalloc_pageable ...
Funny things are bound to happen when code gets preempted because
of page faults...
> It's debatable what kind of benefit it would give you over and above
> just fixing specific cases like page tables, though.
In all extreme cases you'll find that 90% of kernel memory is
tied up in just a few data structures.
Making a generic infrastructure just to deal with these specific
cases is almost certainly overkill.
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
On Friday 26 July 2002 19:59, Russell Lewis wrote:
> I have spent some time working on AIX, which pages its kernel memory.
> It pins the interrupt handler functions, and any data that they access,
> but does not pin the other code.
>
> I'm looking for links as to why (unless I'm mistaken) Linux doesn't do
> this, so I can better understand the system.
You could say that most of the kernel memory is paged because it consists of
disk cache and pages that are handed out to be mapped into process memory.
Slab caches - The kernel's working memory - are not paged at all. Instead,
certain 'well-known' caches (inode, dentry, quota) are scanned for
inactive/unused objects, which are evicted on the theory that they can be
readily reconstructed from the file store if needed again. Buffer heads are
treated similarly, but with a different mechanism.
That leaves quite a few bits and pieces of slab cache in the 'misc' category
(including all kmalloc'd memory) and I guess we just cross our fingers, try
not to be too wasteful with it, and hope for the best.
There are two elephants in the bathtub: the mem_map array, which holds a few
bytes of state information for each physical page in the system, and page
tables, neither of which are swapped or pruned in any way. We are now
beginning to suffer pretty badly because of this, on certain high end
configurations. The problem is, these structures have to keep track of much
more than just the kernel memory. The former has to have entries for all of
the high memory pages (not addressable within the kernel's normal virtual
address space) and the latter has to keep track of pages mapped into every
task in the system, in other words, a virtually unlimited amount of memory
(pun intended). Solutions are being pursued. Paging page tables to swap is
one of the solutions being considered, though nobody has gone so far as to
try it yet. An easier solution is to place page tables in high memory, and a
patch for this exists. There is also work being done on page table sharing.
Hmm, that was more than I intended to write, but you have to be aware of all
of this to be able to think seriously about the question of why kernel memory
isn't paged. Besides the complexity, the real reason is performance. It
would be slower to take faults on all the different flavors of pages the
kernel deals with than to check explicitly for the presence of objects the
kernel needs to work with, such as file cache and inodes. This would also
conflict with the large (4 meg) pages used to map the kernel itself. There
would be extra costs for memory cache reloading (some architectures) and tlb
shootdowns (all architectures). Finally, on 32 bit processors, where will
you get all the virtual memory space you need to map hundreds or thousands of
cached files?
So there you are, a once-over-lightly of a simple question that has a
not-so-simple answer. We sort-of page some kernel memory, but not with the
hardware faulting mechanism. Some kernel memory isn't paged or pruned and
perhaps needs to be. We wave our hands at the rest as small change.
--
Daniel
On Saturday 27 July 2002 18:24, David Woodhouse wrote:
> ...Introduce it
> slowly, adding it a little at a time like we did SMP, and like we _should_
> have done preemption.
I'll bite. How should we have done preemption?
--
Daniel
On Sun, Jul 28, 2002 at 02:40:05AM +0200, Daniel Phillips wrote:
> There are two elephants in the bathtub: the mem_map array, which holds a few
> bytes of state information for each physical page in the system, and page
> tables, neither of which are swapped or pruned in any way. We are now
> beginning to suffer pretty badly because of this, on certain high end
> configurations. The problem is, these structures have to keep track of much
> more than just the kernel memory. The former has to have entries for all of
> the high memory pages (not addressable within the kernel's normal virtual
> address space) and the latter has to keep track of pages mapped into every
> task in the system, in other words, a virtually unlimited amount of memory
> (pun intended). Solutions are being pursued. Paging page tables to swap is
> one of the solutions being considered, though nobody has gone so far as to
> try it yet. An easier solution is to place page tables in high memory, and a
> patch for this exists. There is also work being done on page table sharing.
sizeof(mem_map) is a crippling issue for 32-bit machines. Something needs
to be done and fast, but it looks like most of the programmer resources
that would otherwise be there to attack the issue are tied up with even
more severe problems preventing even smaller machines from working well.
Hopefully those can be dealt with swiftly enough before Halloween.
Cheers,
Bill
On Fri, Jul 26, 2002 at 04:18:56PM -0300, Rik van Riel wrote:
> Large pages and/or shared page tables should be more than
> sufficient to handle all 'benign' real workloads.
> However, 'malicious' workloads can easily generate the
> need for more pagetables than what will fit into physical
> RAM; at that point you just _have_ to throw some of these
> page tables out of RAM. If the data can be reconstructed
> from the VMA and the page cache, we can just blow the page
> table away. If it can't, we have to come up with another
> solution (maybe as simple as killing the application).
If I can poke at the malice & sufficiency bits, the workloads
triggering pagetable memory exhaustion seem to be:
(1) forking server
(2) memory-sharing constellation of processes
(3) university workload, i.e. [tens of] thousands of idle /bin/sh's etc.
(4) someone mapping a large object (64-bit)
(5) large address space coverage over time with mmap/mremap/etc. (64-bit)
(4) and (5) both involve only a single task, largely innocent of trying
to do anything industrial-strength. Yes. It's that easy to do.
Databases on 32-bit appear to be hybrids of (1) and (2), and on 64-bit,
combinations of (1), (2), (4), and (5).
Of these workloads, only (2) is feasible through pagetable sharing and
cooperation from userspace, and only (4) is feasible though large pages
and cooperation from userspace. All of these exceed physical memory
bounds, not just virtual, and hence none are solved by pte-highmem.
64-bit doesn't have highmem so (4) and (5) are immune to it anyway. The
failure mode is typically deadlock, but SCSI panics were often seen in
2.4.x, along with other nondeterministic failures.
Feasible database workloads on 32-bit machines running mainline kernels
seem to run with between 50% and 90% of physical memory consumed by
process pagetables and severe restrictions on the number of clients
that attempt to connect. When larger proportions of memory are consumed
by process pagetables, kernel deadlock often ensues.
Cheers,
Bill
On Sat, 27 Jul 2002, William Lee Irwin III wrote:
> Feasible database workloads on 32-bit machines running mainline kernels
> seem to run with between 50% and 90% of physical memory consumed by
> process pagetables and severe restrictions on the number of clients
> that attempt to connect. When larger proportions of memory are consumed
> by process pagetables, kernel deadlock often ensues.
Even with 50% of memory in pagetables, I wouldn't be happy.
If I fork out the money for a machine with 16 GB of RAM, I'd
expect the thing to be able to at least cache 12 GB of my
database. Wasting all of memory in page tables just isn't
allright ;)
Gerrit told me some people within IBM are working on large
page support for shared memory segments and mmap()d areas,
I hope it'll be good enough to get accepted into 2.5 soon...
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
On Sun, 2002-07-28 at 01:59, Daniel Phillips wrote:
> On Saturday 27 July 2002 18:24, David Woodhouse wrote:
> > ...Introduce it
> > slowly, adding it a little at a time like we did SMP, and like we _should_
> > have done preemption.
>
> I'll bite. How should we have done preemption?
By defaulting pre-emption off globally and working from the inside
adding pre-empt enable/disable pairs around blocks of code as they were
checked. Like the smp lock work but inside out.
That way egthe NE2000 driver might still work properly now, and be
working until its fixed.