LinuxLists.cc - RFC: pageable kernel-segments

2001-04-17 15:57:42

Subject: RFC: pageable kernel-segments

Would anyone be intrested (besides me) in a kernel which can page
out certain parts of itself? The kernel should be in some kind of
vmlinux-ish (as in: uncompressed) format on disk for on-demand
re-loading of pages which are discarded.
Certain parts of drivers could get the __pageable prefix or so
(like the __init parts of drivers which get removed) for letting
the paging-code know that it can be discared if memory-pressure
demands it.
__pageable -code would then be things like (e.g.!) the code which
handles the open()/close() of a device. Most of the time a device
spends more time doing read/write/ioctl then close/open so. Also;
hopefully there's no interrupt-sensitive code in these routines.
I would think is usable (for example) for my 8MB ram laptop.
Anyone any thoughts on this?

Folkert van Heusden

[ http://www.vanheusden.com/Linux/kernel_patches.php3 ]

2001-04-17 16:08:22

by Disconnect

[permalink] [raw]

Subject: Re: RFC: pageable kernel-segments

On Tue, 17 Apr 2001, Heusden, Folkert van did have cause to say:

> I would think is usable (for example) for my 8MB ram laptop.
> Anyone any thoughts on this?

I'm not a kernel hacker, but I've got some thoughts on this:

1> Modules (with the autoloader) can do that for anything not necessary to
boot. (Although even modules could lose a few pages after they
load/init/etc. Hardware setup tends to only happen once..)

2> It'd be great for embedded systems. But you'd need a "scale" -
something along the lines of "Page this out, compress it, step on it,
forget it, we'll never need it in a hurry" up through "page this out if
you -absolutely- have to, but make it easily accessible as fast as
possible".

3> It would involve a major kernel rewrite before it was anything more
than a slowdown to a few drivers supporting it. And there would probably
need to be some /proc method of forbidding paging on certain
(modules/segments/etc) so that, for example, people who hit the
least-likely-path (most-likely-to-page-out) on a regular basis can disable
paging of that section/module/driver/whatnot.

-----BEGIN GEEK CODE BLOCK-----
Version: 3.1 [http://www.ebb.org/ungeek]
GIT/CC/CM/AT d--(-)@ s+:-- a-->? C++++$ ULBS*++++$ P+>+++ L++++>+++++
E--- W+++ N+@ o+>$ K? w--->+++++ O- M V-- PS+() PE Y+@ PGP++() t 5---
X-- R tv+@ b++++>$ DI++++ D++(+++) G++ e* h(-)* r++ y++
------END GEEK CODE BLOCK------

2001-04-17 19:22:11

by H. Peter Anvin

[permalink] [raw]

Subject: Re: RFC: pageable kernel-segments

Followup to: <[email protected]>
By author: "Heusden, Folkert van" <[email protected]>
In newsgroup: linux.dev.kernel
>
> Would anyone be intrested (besides me) in a kernel which can page
> out certain parts of itself? The kernel should be in some kind of
> vmlinux-ish (as in: uncompressed) format on disk for on-demand
> re-loading of pages which are discarded.
> Certain parts of drivers could get the __pageable prefix or so
> (like the __init parts of drivers which get removed) for letting
> the paging-code know that it can be discared if memory-pressure
> demands it.
> __pageable -code would then be things like (e.g.!) the code which
> handles the open()/close() of a device. Most of the time a device
> spends more time doing read/write/ioctl then close/open so. Also;
> hopefully there's no interrupt-sensitive code in these routines.
> I would think is usable (for example) for my 8MB ram laptop.
> Anyone any thoughts on this?
>

VMS does this. It at least used to have a great tendency to crash
itself, because it swapped out something that was called from a driver
that was called by the swapper -- resulting in deadlock. You need
iron discipline for this to work right in all circumstances.

Second, it makes it quite hard to know what operations can cause a
task to sleep, since any reference to paged-out memory can require a
page-in and the associated schedule. You almost need pointer
annotation in order for this to be safe.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-04-17 23:58:28

by Albert D. Cahalan

[permalink] [raw]

Subject: Re: RFC: pageable kernel-segments

H. Peter Anvin writes:
> By author: "Heusden, Folkert van" <[email protected]>

>> Would anyone be intrested (besides me) in a kernel which can page
...
>> Certain parts of drivers could get the __pageable prefix or so

> VMS does this. It at least used to have a great tendency to crash
> itself, because it swapped out something that was called from a driver
> that was called by the swapper -- resulting in deadlock. You need
> iron discipline for this to work right in all circumstances.
>
> Second, it makes it quite hard to know what operations can cause a
> task to sleep, since any reference to paged-out memory can require a
> page-in and the associated schedule. You almost need pointer
> annotation in order for this to be safe.

It wouldn't be nearly so dangerous to page from compressed
data in memory. The memory could be ROM.

2001-04-20 13:14:32

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: RFC: pageable kernel-segments

Hi,

On Tue, Apr 17, 2001 at 12:21:17PM -0700, H. Peter Anvin wrote:

> > Certain parts of drivers could get the __pageable prefix or so
> > (like the __init parts of drivers which get removed) for letting
> > the paging-code know that it can be discared if memory-pressure
> > demands it.
>
> VMS does this. It at least used to have a great tendency to crash
> itself, because it swapped out something that was called from a driver
> that was called by the swapper -- resulting in deadlock. You need
> iron discipline for this to work right in all circumstances.

Actually, VMS doesn't do this, precisely because it is so hard to get
right. VMS has both paged and non-paged pools for dynamically
allocated kernel memory, but the kernel code itself is non-pageable.

The big problem with such pageable memory isn't really device driver
deadlocks --- the easy rule which makes that work is simply never to
use paged pool from a driver which might be involved in swapping. :)
Even more tricky is the handling of kernel locking --- you cannot
access any paged memory with a spinlock held unless you have pinned
the pages in core beforehand.

--Stephen

2001-04-20 13:41:24

by Disconnect

[permalink] [raw]

Subject: Re: RFC: pageable kernel-segments

On Tue, 17 Apr 2001, Oliver Neukum did have cause to say:

> > load/init/etc. Hardware setup tends to only happen once..)
>
> No they can't. Modules can't be finegrained enough to do this without wasting
> more memory due to fragmentation than you'd gain.

Actually, don't they do this -already-? I thought I saw somewhere on here
recently that there was a class of functions you could use in a module for
'one-off' activities. I suspect that covers 90% of what could be paged
out (the remainder being mostly the unloading process, for non-hotswap
modules). But IANAKG. (..not a kernel guru).

> Actually not that great.Support for different types of kernel code is there
> to support __init and __initdata. You'd use a fixup scheme like the one used
> in copy_[to|from]_user to trigger paging in. Page out could be handled by the
> conventional mm.

I mis-typed - by 'major rewrite' I meant more an analysis and tagging
process, which would have to touch most of the kernel before it was
useful. But again, IANAKG so the existing swap code may already handle
that, at least in a way that it could be a ruleset (with override tags?)
instead of having to put a new set of tags everywhere.

---
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1 [http://www.ebb.org/ungeek]
GIT/CC/CM/AT d--(-)@ s+:-- a-->? C++++$ ULBS*++++$ P+>+++ L++++>+++++
E--- W+++ N+@ o+>$ K? w--->+++++ O- M V-- PS+() PE Y+@ PGP++() t 5---
X-- R tv+@ b++++>$ DI++++ D++(+++) G++ e* h(-)* r++ y++
------END GEEK CODE BLOCK------

2001-04-20 14:22:13

by Venkatesh Ramamurthy

[permalink] [raw]

Subject: Re: RFC: pageable kernel-segments

> > VMS does this. It at least used to have a great tendency to crash
> > itself, because it swapped out something that was called from a driver
> > that was called by the swapper -- resulting in deadlock. You need
> > iron discipline for this to work right in all circumstances.
>
> Actually, VMS doesn't do this, precisely because it is so hard to get
> right. VMS has both paged and non-paged pools for dynamically
> allocated kernel memory, but the kernel code itself is non-pageable.

[Venkat] This [pageable drivers] has been a nightware for NT (derived from
VMS) driver programmers. It almost divides the set of kernel API into two
halves, one which can be called at any IRQL and the other only at elevated
irql. The benefits of having pageable kernel pages is very minimal when
compared to the complexity that gets added to the kernel. We can keep the
kernel simpler(and faster) without having parts of drivers pageable. But one
more issue is having the page tables pageable.......

2001-04-20 14:48:45

by Alan

[permalink] [raw]

Subject: Re: RFC: pageable kernel-segments

> compared to the complexity that gets added to the kernel. We can keep the
> kernel simpler(and faster) without having parts of drivers pageable. But one
> more issue is having the page tables pageable.......

At the moment we can almost go a stage further - when we are short of memory
we can victimise apparently idle page tables by simply deleting them. What
stops us from doing this right now is handling anonymous pages where the
page table really is needed to find the swap entries.

There is a proposal (several it seems) to make 2.5 replace the conventional
unix swap with a filesystem of backing store for anonymous objects. That will
mean each object has its own vm area and inode and thus we can start blowing
away all user mode page tables when we want.

The primary reason for it however is to simplify all the code paths that deal
with swap. All the readahead becomes common code. Swap files become loopback
mounts. We can support multiple swap implementations (just pick your swap fs).
It also lays the groundwork for doing swap using spare disk space.

2001-04-20 15:38:31

by Venkatesh Ramamurthy

[permalink] [raw]

Subject: Re: RFC: pageable kernel-segments

> There is a proposal (several it seems) to make 2.5 replace the
conventional

Who is doing it? any links of where i can find this proposals?

2001-04-20 19:06:05

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: RFC: pageable kernel-segments

Hi,

On Fri, Apr 20, 2001 at 03:49:30PM +0100, Alan Cox wrote:

> There is a proposal (several it seems) to make 2.5 replace the conventional
> unix swap with a filesystem of backing store for anonymous objects. That will
> mean each object has its own vm area and inode and thus we can start blowing
> away all user mode page tables when we want.

Not without major VM overhaul.

The problem is MAP_PRIVATE, where a single vma can contain both normal
file-backed pages and anonymous pages at the same time. You don't
even know whose anonymous page it is --- a process with anon pages can
fork, so that later on some of the child's anon pages actually come
from the parent's anon space instead of the child's.

Right now all of the magic that makes this work is in the page tables.
To remove page tables we'd need additional structures all through the
VM to track anonymous pages, and that's exactly where the FreeBSD VM
starts to get extremely messy compared to ours.

--Stephen

2001-04-21 15:53:57

by Rik van Riel

[permalink] [raw]

Subject: Re: RFC: pageable kernel-segments

On Fri, 20 Apr 2001, Stephen C. Tweedie wrote:
> On Fri, Apr 20, 2001 at 03:49:30PM +0100, Alan Cox wrote:
>
> > There is a proposal (several it seems) to make 2.5 replace the conventional
> > unix swap with a filesystem of backing store for anonymous objects. That will
> > mean each object has its own vm area and inode and thus we can start blowing
> > away all user mode page tables when we want.
>
> Not without major VM overhaul.
>
> The problem is MAP_PRIVATE, where a single vma can contain both normal
> file-backed pages and anonymous pages at the same time. You don't
> even know whose anonymous page it is --- a process with anon pages can
> fork, so that later on some of the child's anon pages actually come
> from the parent's anon space instead of the child's.

Whoooops indeed. I forgot about this mess...

> Right now all of the magic that makes this work is in the page tables.
> To remove page tables we'd need additional structures all through the
> VM to track anonymous pages, and that's exactly where the FreeBSD VM
> starts to get extremely messy compared to ours.

That's because they still seem to use Mach's object chaining.

There's bound to be a much cleaner solution than whatever it
is they copied over from Mach ;)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/