2002-09-09 09:16:11

by David Woodhouse

[permalink] [raw]
Subject: [RFC] On paging of kernel VM.

I think I'd like to introduce 'real' VMAs into kernel space, so that areas
in the vmalloc range can have 'real' vm_ops and more to the point a real
nopage function.

Unfortunately AFAICT this would involve changing the fault handler on every
platform, so I'm debating whether it's really worth it -- if anyone else
could use it and if I could get round my problem any other way.

The problem is flash chips. These basically behave as ROM, but you write to
them by writing magic values to magic addresses, and during a write
operation the _whole_ chip returns status bits instead of data.

To avoid taking up precious RAM with copies of data which are already in
flash, we can map pages of flash directly into userspace. On taking a
fault, we wait for any pending write to complete, mark the chip as busy,
then set up the page tables appropriately so that userspace can read from
it. On starting a write operation, you invalidate all currently-visible
pages before starting to talk to the chip.

There are cases in the kernel where we'd really like the same setup --
mounting a JFFS2 file system, for example, is a slow operation because it's
entirely log-structured and we have to read every log entry on the file
system. The current method of reading into a RAM buffer under a lock and
then dealing with stuff in RAM is entirely suboptimal, and proof-of-concept
hacks to just use a pointer into the flash chip have been observed to
improve mount time by about a factor of 4.

The locking is a problem though. Flash chips may be divided into multiple
partitions and other code may want to write to its partition while a mount
is in progress. The na?ve approach of just locking the chip into read mode
on giving out a pointer to it, and unlocking it when the mount is complete,
is going to suck royally. Hence, it would be very nice if we could play the
same trick as we do for userspace; giving out a pointer which is always
going to be valid; you just might have to wait for it.

But as I said, this means screwing with every fault handler. It doesn't
have to affect the fast path -- we can go looking for these vmas only in
the case where we've already tried looking for the appropriate pte in
init_mm and haven't found it. But it's still an intrusive change that would
need to be done on every architecture.

I'm wondering what else could use this if it were implemented. Is there any
need for something like vmalloc_pageable(), for example? Anything else?
Rusty and I have wittered about marking certain kernel functions and data as
__pageable to go into a special such section too, but I'm wondering if that
conversation was slightly Guinness-influenced :)

Or is there another way to solve my original problem that I've overlooked?

Answers on a postcard to...

--
dwmw2



2002-09-09 11:09:07

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [RFC] On paging of kernel VM.

Hi,

On Mon, Sep 09, 2002 at 10:20:53AM +0100, David Woodhouse wrote:
> I think I'd like to introduce 'real' VMAs into kernel space, so that areas
> in the vmalloc range can have 'real' vm_ops and more to the point a real
> nopage function.

The alternative is a kmap-style mechanism for temporarily mapping
pages beyond physical memory on demand. That would avoid the space
limits we have on vmalloc etc; there's only a few tens of MB of
address space we can use for mmap tricks in kernel space, so
persistent maps are seriously constrained if you've got a lot of flash
you want to map.

And with a kmap interface, your locking problems are much simpler ---
you can trap accesses at source and you don't have to go hunting ptes
to invalidate.

Cheers,
Stephen

2002-09-09 11:19:58

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC] On paging of kernel VM.


[email protected] said:
> The alternative is a kmap-style mechanism for temporarily mapping
> pages beyond physical memory on demand.

That's a possibility I'd considered, but in this case there are problems
with explicitly mapping and unmapping the pages. The locking of the chip is
a detail I was hoping to avoid exposing to the users of the device.

With mapping/unmapping done explicitly, not only does an active mapping
prevent all other users from writing to the same device, hence requiring a
'cond_temporarily_unmap()' kind of function, but you also get deadlock if a
user of the device tries to write while they have a mapping active. The
answer "don't do that then" is workable but not preferable.

Given that all the logic to mark pages present on read and then invalidate
them on write access is going to have to be there for userspace _anyway_,
being able to keep the API nice and simple by using that in kernelspace too
would be far better, if we can justify the change to the slow path of the
vmalloc fault case.

But yes, what you suggest is the current API for the flash stuff, sans the
'cond_temporarily_unmap_if_people_are_waiting()' bit. And that's why I've
avoided actually _using_ it, preferring to put up with the overhead of
reading into a RAM buffer until we can fix it.

--
dwmw2


2002-09-10 00:21:57

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] On paging of kernel VM.

On Monday 09 September 2002 11:20, David Woodhouse wrote:
> But as I said, this means screwing with every fault handler. It doesn't
> have to affect the fast path -- we can go looking for these vmas only in
> the case where we've already tried looking for the appropriate pte in
> init_mm and haven't found it. But it's still an intrusive change that would
> need to be done on every architecture.

Why can't you go per-architecture and fall back to the slow way of doing it
for architectures that don't have the new functionality yet?

--
Daniel

2002-09-10 06:03:44

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC] On paging of kernel VM.


[email protected] said:
> Why can't you go per-architecture and fall back to the slow way of
> doing it for architectures that don't have the new functionality yet?

No. We can't make this kind of change to the way the vmalloc region works on
some architectures only. It has to remain uniform.

Either it's worth doing for all, or it's not. It's a fairly trivial change
in the slow path, after all. I suspect it's worth it -- I'll ask the same
question again with a patch attached as soon as I get time, in order to
elicit more responses.

--
dwmw2