2002-03-03 21:09:54

by Jeff Dike

[permalink] [raw]
Subject: [RFC] Arch option to touch newly allocated pages

What I'd like is for the arch to have __alloc_pages touch all pages that it
has allocated and is about to return.

The reason for this is that for UML, those pages are backed by host memory,
which may or may not be available when they are finally touched at some
arbitrary place in the kernel. I hit this by tmpfs running out of room
because my UMLs have their memory backed by tmpfs mounted on /tmp. So, I
want to be able to dirty those pages before they are seen by any other code.

My first guess at what I want in the code is for all the places that
__alloc_pages says this:

if (page)
return page;

to change to this:

if (page)
return arch_validate(page);

arch_validate would be defined as basically empty somewhere in a
include/linux/*.h unless the arch has defined one already. And I may want
to add order to the arg list if it can't be inferred from the page alignment.

My arch_validate would look something like this:

struct page_struct *arch_validate(page_struct *page)
{
unsigned long zero = 0;
unsigned long addr = page_address(page);

set_fs(USER_DS);
for(i = 0; i < 1 << order; i++){
if(copy_to_user(addr + i * PAGE_SIZE, &zero, sizeof(zero))){
set_fs(KERNEL_DS);
free_pages(addr, order);
return(NULL);
}
}
set_fs(KERNEL_DS);
return(page);
}

The use of set_fs/copy_to_user is somewhat hokey, but that's exactly the
effect that I want. Is there a better way of doing that?

So, is this a reasonable thing to do, and is the above the right way of
getting it?

Jeff


2002-03-03 21:47:10

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> The reason for this is that for UML, those pages are backed by host memory,
> which may or may not be available when they are finally touched at some
> arbitrary place in the kernel. I hit this by tmpfs running out of room
> because my UMLs have their memory backed by tmpfs mounted on /tmp. So, I
> want to be able to dirty those pages before they are seen by any other code.

No - you think you want to dirty the pages - you want to account the address
space. What you want to do is run 2.4.18ac3 and do

echo "2" > /proc/sys/vm/overcommit_memory

which on a good day will give you overcommit protection. Your map requests
will fail without the pages being dirtied and the extra swap that would
cause. It knows about tmpfs too but not ramfs, ramdisk or ptrace yet

Alan

2002-03-03 23:24:53

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> No - you think you want to dirty the pages - you want to account the
> address space. What you want to do is run 2.4.18ac3 and do
> echo "2" > /proc/sys/vm/overcommit_memory
> which on a good day will give you overcommit protection. Your map
> requests will fail without the pages being dirtied and the extra swap
> that would cause.

That doesn't sound right to me.

I don't have individual little map requests going on here. I have a single
large map happening at boot time which creates the UML "physical" memory
area.

So, say I have a 128M UML which is only ever going to use 32M of that. If
there isn't 128M of address space, but there is 32M, this UML will never
get off the ground, even though it really deserved to.

About the swap allocation, I'd bet essentially all the time when a page
is allocated, its dirtiness is imminent anyway. So, I'm not adding anything
to swap. It'll be there a usec later anyway. What I want is for the dirtying
to happen in a controlled place where something sane can be done if the page
isn't really there.

Jeff

2002-03-03 23:34:45

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> I don't have individual little map requests going on here. I have a single
> large map happening at boot time which creates the UML "physical" memory
> area.

Doesn't matter

> So, say I have a 128M UML which is only ever going to use 32M of that. If
> there isn't 128M of address space, but there is 32M, this UML will never
> get off the ground, even though it really deserved to.

Well thats up to you on how you implement it. mmap will tell you the truth
in overcommit mode 2 or 3. Nothing will get killed off when you try and
mmap too much or dirty pages you have.

> About the swap allocation, I'd bet essentially all the time when a page
> is allocated, its dirtiness is imminent anyway. So, I'm not adding anything

Nothing of the sort. Sitting in a gnome desktop I'm showing a 41200Kb worst
case swap requirement, but it appears under half of that is used.

> to swap. It'll be there a usec later anyway. What I want is for the dirtying
> to happen in a controlled place where something sane can be done if the page
> isn't really there.

Like randomly killing another process off ? If you want to dirty the pages
pray and catch the sigbus then see memset(3). If you want to be told "sorry
you can't have that" and write a simple loop to pick a good memory size,
you need the address space accounting.


Alan

2002-03-04 03:14:10

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> Like randomly killing another process off ? If you want to dirty the
> pages pray and catch the sigbus then see memset(3). If you want to be
> told "sorry you can't have that" and write a simple loop to pick a
> good memory size, you need the address space accounting.

OK, this sounds right if the machine is short of memory. Random
hacks to do something reasonable if a SIGBUS manages to gets through aren't
the way to go when random process deaths are what happen if it doesn't.

However, the host wasn't under a global memory shortage. The UML hit the
tmpfs size limit.

Does address space accounting enforce tmpfs limits (and other limits, like
RSS, when it happens)? Or is it enforcing a global limit?

When the host isn't in a memory shortage and UML is running under a sub-limit
(as with tmpfs), either of those gives me worse behavior than I get by being
able to trap the SIGBUS. It will arrive reliably without accompanying process
deaths. The first case means that the UML won't get off the ground even
though it would be able to deal semi-gracefully with tmpfs running out of room.
The second means that the mmap will succeed and I'm back to SIGBUS anyway.

> Nothing of the sort. Sitting in a gnome desktop I'm showing a 41200Kb
> worst case swap requirement, but it appears under half of that is
> used.

This I don't get. I'm assuming that the vast majority of the time when a
set of pages is returned by __alloc_pages, they all are going to be written
pretty soon. This being the case, how can it possibly affect anything to
touch them at the end of __alloc_pages?

Jeff

2002-03-04 03:21:00

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> Does address space accounting enforce tmpfs limits (and other limits, like
> RSS, when it happens)? Or is it enforcing a global limit?

It ensures that the total number of anonymous and/or tmpfs (eg anon shared)
pages that are mappable will fit in swap (or in mode 2 swap + 0.5*ram). You
never get a SIGBUS. Writes to tmpfs for new blocks will fail if that would
place the system in a potential overcommit situation.

> > Nothing of the sort. Sitting in a gnome desktop I'm showing a 41200Kb
> > worst case swap requirement, but it appears under half of that is
> > used.
>
> This I don't get. I'm assuming that the vast majority of the time when a
> set of pages is returned by __alloc_pages, they all are going to be written
> pretty soon. This being the case, how can it possibly affect anything to
> touch them at the end of __alloc_pages?

It isnt the alloc pages that is the problem.

You mmap - no pages are allocated. You use them , pages get allocated. If
you look at the actual maps you'll find a lot of people allocate an area
of address space but don't use it all. Without the address overcommit
management nothing guarantees that when you touch those pages you won't
fault. Furthermore unless you are very careful you may fault again on
the stack push for the SIGBUS and if that faults - SIGKILL->OOM time


Alan

2002-03-04 05:02:22

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> You never get a SIGBUS. Writes to tmpfs for new blocks will fail if
> that would place the system in a potential overcommit situation.

How will writes to tmpfs fail if we're not in an overcommit situation, but
tmpfs is full? Unless tmpfs is changed, it looks to me like you get a SIGBUS.

> It isnt the alloc pages that is the problem.

We are somehow failing to communicate...

> You mmap - no pages are allocated.

I understand this.

> You use them , pages get allocated.

This too.

> If you look at the actual maps you'll find a lot of people allocate an
> area of address space but don't use it all.

Yes.

> Without the address
> overcommit management nothing guarantees that when you touch those
> pages you won't fault.

Even with address overcommit management, I can fault if I touch pages when
tmpfs is full but the system is not near overcommit.

> Furthermore unless you are very careful you may
> fault again on the stack push for the SIGBUS and if that faults -
> SIGKILL->OOM time

We are talking about UML kernel stacks. If they have been allocated the way
I'm proposing with the UML __alloc_pages touching each page on the way out,
they are allocated on the host, and therefore can't fault.

This seems to me to be sufficiently careful.

One of us is missing something, who is it?

Jeff

2002-03-04 14:55:20

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> Even with address overcommit management, I can fault if I touch pages when
> tmpfs is full but the system is not near overcommit.

That is what mmap defines for a file based mapping yes. Thats a case where
there isnt much else you can do

2002-03-04 17:43:53

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> That is what mmap defines for a file based mapping yes. Thats a case
> where there isnt much else you can do

Except the whole point of me starting this thread is that there is something
sane that UML can do *if* it can trap those bus errors in a controlled way.

If UML can detect pages which tmpfs can't back as they leave the allocator,
then it can prevent the rest of the UML kernel from getting randomly SIGBUSed
as it touches those pages.

To recap in case it got lost in the confusion, I want __alloc_pages to call
an arch hook before it return memory, turning every instance of

if (page)
return page;

into

if (page)
return arch_validate(page);

Unless the arch defines its own arch_validate(), a generic header would
define it as

static inline arch_validate(struct page_struct *page){ return page; }

or the equivalent macro.

On the other hand, UML would define it to touch each page in the allocation,
trapping SIGBUS there. If any do SIGBUS, then my orginal proposal was to
free the block back to the allocator and return NULL. This would cause a
flurry of allocation failures to things that weren't willing to sleep, and
if that causes trouble, then the caller needed fixing anyway.

A more interesting idea is to hang on to the block and maybe unmap it.
Umapping would free any backed pages in the block back to tmpfs, giving it
(and the other UMLs, if any) some breathing room. Even if the entire block
was unbacked, the UML would lose it as being allocatable and would eventually
be restricted to handing out pages that it had managed to touch before tmpfs
ran out.

This is way more sane than the current get-a-SIGBUS-someplace-random-and-panic
situation I have now.

Given that we are talking about tmpfs running out of space, the host still
has plenty of free memory, and UML kernel stacks can receive the SIGBUS
(because they've been allocated with this mechanism), is this still
objectionable?

Jeff

2002-03-04 17:49:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Followup to: <[email protected]>
By author: Jeff Dike <[email protected]>
In newsgroup: linux.dev.kernel
>
> Even with address overcommit management, I can fault if I touch pages when
> tmpfs is full but the system is not near overcommit.
>
> > Furthermore unless you are very careful you may
> > fault again on the stack push for the SIGBUS and if that faults -
> > SIGKILL->OOM time
>
> We are talking about UML kernel stacks. If they have been allocated the way
> I'm proposing with the UML __alloc_pages touching each page on the way out,
> they are allocated on the host, and therefore can't fault.
>
> This seems to me to be sufficiently careful.
>
> One of us is missing something, who is it?
>

I think it's you -- you seem to suffer from the "my application is the
only one that counts" syndrome. If you want to pages dirtied, then
dirty them using memset() or similar.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2002-03-04 18:15:14

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> If UML can detect pages which tmpfs can't back as they leave the allocator,
> then it can prevent the rest of the UML kernel from getting randomly SIGBUSed
> as it touches those pages.

Yes I follow this. I don't understand how it is related to your intended
solution.

> To recap in case it got lost in the confusion, I want __alloc_pages to call
> an arch hook before it return memory, turning every instance of

alloc_pages is only called at the time the backing page is created - by
then it doesnt matter - its too late. You'd need to hack up the same code
areas that are used for mlock MCL_FUTURE not alloc_pages

> Given that we are talking about tmpfs running out of space, the host still
> has plenty of free memory, and UML kernel stacks can receive the SIGBUS
> (because they've been allocated with this mechanism), is this still
> objectionable?

With the vm no overcommit code the tmpfs cannot run out of space filling
in pages, only when you make a tmpfs file larger. The code guarantees there
are swap pages available to back between offset 0 and the file size. A
write extending a tmpfs file may fail reporting the disk full.

The code guarantees (modulo bugs of course!) that the total number of
pages that could be created by touching addresses that have already been
mapped including accounting for tmpfs on the basis above never exceeds the
number of pages available.

The bugs at the moment being
1. ptrace isnt accounted for its special weirdnesses
2. MAP_NORESERVE isnt forcibly accounted in these modes as required

2002-03-04 18:32:35

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> I think it's you -- you seem to suffer from the "my application is the
> only one that counts" syndrome. If you want to pages dirtied, then
> dirty them using memset() or similar.

I think you and Alan think I want the host kernel to do the dirtying. Not so,
I want no changes on the host. I want a hook that UML can use to make sure
that all pages that it allocates are backed.

And memset or something similar is exactly what I have in mind.

Jeff

2002-03-04 18:35:35

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> > alloc_pages is only called at the time the backing page is created -
> > by then it doesnt matter - its too late.
>
> *My* (i.e. the one inside UML) alloc_pages, not the host's would do the
> dirtying. That's the whole point. The UML alloc_pages would make sure
> that the pages it hands out are backed on the host before they are handed
> out to the rest of UML.

Ok got you - so its merely grossly ineffecient and downright rude to
other users of the system ?

2002-03-04 18:35:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Jeff Dike wrote:

> [email protected] said:
>
>>I think it's you -- you seem to suffer from the "my application is the
>>only one that counts" syndrome. If you want to pages dirtied, then
>>dirty them using memset() or similar.
>>
>
> I think you and Alan think I want the host kernel to do the dirtying. Not so,
> I want no changes on the host. I want a hook that UML can use to make sure
> that all pages that it allocates are backed.
>
> And memset or something similar is exactly what I have in mind.
>


So why, then, phrase this as a feature request???

-hpa


2002-03-04 18:35:25

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> alloc_pages is only called at the time the backing page is created -
> by then it doesnt matter - its too late.

*My* (i.e. the one inside UML) alloc_pages, not the host's would do the
dirtying. That's the whole point. The UML alloc_pages would make sure
that the pages it hands out are backed on the host before they are handed
out to the rest of UML.

Jeff

2002-03-04 20:34:25

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> So why, then, phrase this as a feature request???

Because it requires a hook in the generic kernel allocator that UML can
use to make sure that all allocated pages are backed on the host.

Jeff

2002-03-04 20:44:10

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> Ok got you -

Good, if that's not being sarcastic...

> so its merely grossly ineffecient and downright rude to
> other users of the system ?

OK, when something calls alloc_pages and gets back some pages, it's almost
always going to modify them immediately, right?

If this is true, then what I'm proposing would force the host to find backing
memory for those pages a tiny bit earlier than it would have had to otherwise.

This is the only possibility for inefficiency and rudeness that I can see.
If I'm totally missing what you are referring to, please be a little bit more
specific.

Jeff

2002-03-04 22:35:46

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> OK, when something calls alloc_pages and gets back some pages, it's almost
> always going to modify them immediately, right?

Yes. Which is why we don't allocate the when you map an object or create
a shmem fs file

> If this is true, then what I'm proposing would force the host to find backing
> memory for those pages a tiny bit earlier than it would have had to otherwise.

In the normal case about half of the pages are never allocated that are
mapped. In other words no alloc_pages was ever done for them or will ever
be needed.

2002-03-04 22:37:16

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> [email protected] said:
> > So why, then, phrase this as a feature request???
>
> Because it requires a hook in the generic kernel allocator that UML can
> use to make sure that all allocated pages are backed on the host.

At the point you actually allocate pages they are being allocated. No hook
is needed. You seem to misunderstand the way the allocation works - we
allocate address space not memory in things like mmap. We allocate pages
on demand when referenced. The page allocator is only called after a page
is referenced

2002-03-05 04:14:31

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> You seem to misunderstand the way the allocation works - we allocate
> address space not memory in things like mmap. We allocate pages on
> demand when referenced. The page allocator is only called after a page
> is referenced

I understand perfectly well how it works.

You still don't understand what I'm talking about. To make this a bit more
concrete, the patch below implements what I want (plus a couple of bug fixes
needed to make it work).

If you want to run it, apply the 2.4.18-2 UML patch (available at
http://prdownloads.sourceforge.net/user-mode-linux/uml-patch-2.4.18-2.bz2) to
a stock 2.4.18 pool. Copy the pool, apply the patch below to one of them,
and build both.

Mount a 64M tmpfs on /tmp, boot up two 64M UMLs without the patch, run a -j 2
kernel build in each and watch them hang (see http://user-mode-linux/sf.net
for lots of docs, filesystem images, etc if you haven't run UML before). If
you have gdb running on them, you will see that they're stuck at some random
place in the kernel taking an infinite stream of SIGBUSes on a page that tmpfs
can't back. If you apply the relay_signal piece of the patch to this pool,
you will get panics instead of hangs.

Now do the same with two 64M UMLs with the patch. You will see the build die
like this, but the UMLs stay up and they're fairly healthy:

gcc -D__KERNEL__ -I/kernel/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=i686 -c -o dma.o dma.c
cpp: output pipe has been closed
gcc: Internal compiler error: program cc1 got fatal signal 11
make[2]: *** [dma.o] Error 1

Note the following:
the host is not short of memory, so address space accounting and the
possibility of random process deaths do not come into play
you did not build or reboot the host kernel - all this is strictly
inside UML
the code added to mm.h is a no-op for every arch but UML

So, does this make things at all clearer? Without the patch I get random
UML deaths when tmpfs can't back a page. With it, tmpfs is forced to back
newly allocated pages when they're allocated, and the allocation returns NULL
if it can't. The result being I get no UML deaths and fairly reasonable
behavior.

Jeff


diff -Naur um/arch/um/kernel/exec_kern.c back/arch/um/kernel/exec_kern.c
--- um/arch/um/kernel/exec_kern.c Mon Mar 4 17:27:34 2002
+++ back/arch/um/kernel/exec_kern.c Mon Mar 4 17:20:52 2002
@@ -38,6 +38,12 @@
int new_pid;

stack = alloc_stack();
+ if(stack == 0){
+ printk(KERN_ERR
+ "flush_thread : failed to allocate temporary stack\n");
+ do_exit(SIGKILL);
+ }
+
new_pid = start_fork_tramp((void *) current->thread.kernel_stack,
stack, 0, exec_tramp);
if(new_pid < 0){
diff -Naur um/arch/um/kernel/mem.c back/arch/um/kernel/mem.c
--- um/arch/um/kernel/mem.c Mon Mar 4 17:27:34 2002
+++ back/arch/um/kernel/mem.c Mon Mar 4 16:04:01 2002
@@ -212,6 +212,32 @@
" just be swapped out.\n Example: mem=64M\n\n"
);

+struct page *arch_validate(struct page *page, int order)
+{
+ unsigned long addr, zero = 0;
+ int i;
+
+ addr = (unsigned long) page_address(page);
+ for(i = 0; i < (1 << order); i++){
+ current->thread.fault_addr = (void *) addr;
+ if(__do_copy_to_user((void *) addr, &zero, sizeof(zero),
+ &current->thread.fault_addr,
+ &current->thread.fault_catcher))
+ return(NULL);
+ addr += PAGE_SIZE;
+ }
+ return(page);
+}
+
+extern void relay_signal(int sig, void *sc, int usermode);
+
+void bus_handler(int sig, void *sc, int usermode)
+{
+ if(current->thread.fault_catcher != NULL)
+ do_longjmp(current->thread.fault_catcher);
+ else relay_signal(sig, sc, usermode);
+}
+
/*
* Overrides for Emacs so that we follow Linus's tabbing style.
* Emacs will notice this stuff at the end of the file and automatically
diff -Naur um/arch/um/kernel/process_kern.c back/arch/um/kernel/process_kern.c
--- um/arch/um/kernel/process_kern.c Mon Mar 4 17:27:34 2002
+++ back/arch/um/kernel/process_kern.c Mon Mar 4 17:19:00 2002
@@ -141,7 +141,7 @@
unsigned long page;

if((page = __get_free_page(GFP_KERNEL)) == 0)
- panic("Couldn't allocate new stack");
+ return(0);
stack_protections(page);
return(page);
}
@@ -318,6 +318,11 @@
panic("copy_thread : pipe failed");
if(current->thread.forking){
stack = alloc_stack();
+ if(stack == 0){
+ printk(KERN_ERR "copy_thread : failed to allocate "
+ "temporary stack\n");
+ return(-ENOMEM);
+ }
clone_vm = (p->mm == current->mm);
p->thread.temp_stack = stack;
new_pid = start_fork_tramp((void *) p->thread.kernel_stack,
diff -Naur um/arch/um/kernel/trap_kern.c back/arch/um/kernel/trap_kern.c
--- um/arch/um/kernel/trap_kern.c Mon Mar 4 17:27:34 2002
+++ back/arch/um/kernel/trap_kern.c Mon Mar 4 17:22:26 2002
@@ -30,6 +30,7 @@
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
struct siginfo si;
+ void *catcher;
pgd_t *pgd;
pmd_t *pmd;
pte_t *pte;
@@ -40,6 +41,7 @@
return(0);
}
if(mm == NULL) panic("Segfault with no mm");
+ catcher = current->thread.fault_catcher;
si.si_code = SEGV_MAPERR;
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -84,10 +86,10 @@
up_read(&mm->mmap_sem);
return(0);
bad:
- if (current->thread.fault_catcher != NULL) {
+ if(catcher != NULL) {
current->thread.fault_addr = (void *) address;
up_read(&mm->mmap_sem);
- do_longjmp(current->thread.fault_catcher);
+ do_longjmp(catcher);
}
else if(current->thread.fault_addr != NULL){
panic("fault_addr set but no fault catcher");
@@ -120,6 +122,7 @@

void relay_signal(int sig, void *sc, int usermode)
{
+ if(!usermode) panic("Kernel mode signal %d", sig);
force_sig(sig, current);
}

diff -Naur um/arch/um/kernel/trap_user.c back/arch/um/kernel/trap_user.c
--- um/arch/um/kernel/trap_user.c Mon Mar 4 17:27:34 2002
+++ back/arch/um/kernel/trap_user.c Mon Mar 4 15:45:58 2002
@@ -420,11 +420,13 @@

extern int timer_ready, timer_on;

+extern void bus_handler(int sig, void *sc, int usermode);
+
static void (*handlers[])(int, void *, int) = {
[ SIGTRAP ] relay_signal,
[ SIGFPE ] relay_signal,
[ SIGILL ] relay_signal,
- [ SIGBUS ] relay_signal,
+ [ SIGBUS ] bus_handler,
[ SIGSEGV] segv_handler,
[ SIGIO ] sigio_handler,
[ SIGVTALRM ] timer_handler,
diff -Naur um/include/asm-um/page.h back/include/asm-um/page.h
--- um/include/asm-um/page.h Mon Mar 4 17:27:34 2002
+++ back/include/asm-um/page.h Mon Mar 4 15:45:46 2002
@@ -42,4 +42,7 @@
#define virt_to_page(kaddr) (mem_map + (__pa(kaddr) >> PAGE_SHIFT))
#define VALID_PAGE(page) ((page - mem_map) < max_mapnr)

+extern struct page *arch_validate(struct page *page, int order);
+#define HAVE_ARCH_VALIDATE
+
#endif
diff -Naur um/include/linux/mm.h back/include/linux/mm.h
--- um/include/linux/mm.h Mon Mar 4 16:16:44 2002
+++ back/include/linux/mm.h Mon Mar 4 16:43:26 2002
@@ -358,6 +358,13 @@
extern struct page * FASTCALL(__alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist));
extern struct page * alloc_pages_node(int nid, unsigned int gfp_mask, unsigned int order);

+#ifndef HAVE_ARCH_VALIDATE
+static inline struct page *arch_validate(struct page *page, int order)
+{
+ return(page);
+}
+#endif
+
static inline struct page * alloc_pages(unsigned int gfp_mask, unsigned int order)
{
/*
@@ -365,7 +372,7 @@
*/
if (order >= MAX_ORDER)
return NULL;
- return _alloc_pages(gfp_mask, order);
+ return arch_validate(_alloc_pages(gfp_mask, order), order);
}

#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)

2002-03-05 04:28:34

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Mon, Mar 04, 2002 at 11:15:56PM -0500, Jeff Dike wrote:
> So, does this make things at all clearer? Without the patch I get random
> UML deaths when tmpfs can't back a page. With it, tmpfs is forced to back
> newly allocated pages when they're allocated, and the allocation returns NULL
> if it can't. The result being I get no UML deaths and fairly reasonable
> behavior.

>From your explanation of things, you only need to do the memsets once at
startup of UML where the ram is allocated -> a uml booted with 64MB of
ram would write into every page of the backing store file before even
running the kernel. Doesn't that accomplish the same thing?

-ben

2002-03-05 04:38:54

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> From your explanation of things, you only need to do the memsets once
> at startup of UML where the ram is allocated -> a uml booted with
> 64MB of ram would write into every page of the backing store file
> before even running the kernel. Doesn't that accomplish the same
> thing?

Sort of, but it's very heavy-handed. The UML will force memory to be
allocated on the host long before it will ever be needed, and it may never
be needed. This patch doesn't waste memory like that.

Jeff

2002-03-05 05:34:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Jeff Dike wrote:

> [email protected] said:
>
>>From your explanation of things, you only need to do the memsets once
>>at startup of UML where the ram is allocated -> a uml booted with
>>64MB of ram would write into every page of the backing store file
>>before even running the kernel. Doesn't that accomplish the same
>>thing?
>>
>
> Sort of, but it's very heavy-handed. The UML will force memory to be
> allocated on the host long before it will ever be needed, and it may never
> be needed. This patch doesn't waste memory like that.
>


This is not necessarily a bad thing, however. If the user hadn't set up
enough swap, they're probably better off getting the error message early.

-hpa



2002-03-05 14:42:45

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> This is not necessarily a bad thing, however. If the user hadn't set
> up enough swap, they're probably better off getting the error message
> early.

This is not a situation in which a lack of swap or a lack of RAM is a problem.

The problem is a tmpfs filling up.

You think that UML refusing to run if it can't get every bit of memory it
might ever need is preferable to UML running fine in somewhat less memory?

Jeff

2002-03-05 14:42:55

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> you only need to do the memsets once at startup of UML where the ram
> is allocated -> a uml booted with 64MB of ram would write into every
> page of the backing store file before even running the kernel.
> Doesn't that accomplish the same thing?

The other reason I don't like this is that, at some point, I'd like to
start thinking about userspace cooperating with the kernel on memory
management. UML looks like a perfect place to start since it's essentially
identical to the host making it easier for the two to bargain over memory.

Having UML react sanely to unbacked pages is a step in that direction, having
UML preemptively grab all the memory it could ever use isn't.

Jeff

2002-03-05 16:35:20

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Jeff Dike wrote:
> [email protected] said:
>
>>This is not necessarily a bad thing, however. If the user hadn't set
>>up enough swap, they're probably better off getting the error message
>>early.
>>
>
> This is not a situation in which a lack of swap or a lack of RAM is a problem.
>
> The problem is a tmpfs filling up.
>
> You think that UML refusing to run if it can't get every bit of memory it
> might ever need is preferable to UML running fine in somewhat less memory?
>

Actually, yes, esp. since the only case you have been able to bring up is
one of the sysadmin being a moron.

-hpa


2002-03-05 16:57:31

by Wayne Whitney

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

H. Peter Avin wrote:

> Jeff Dike wrote:
>
> > You think that UML refusing to run if it can't get every bit of memory it
> > might ever need is preferable to UML running fine in somewhat less memory?
>
> Actually, yes, esp. since the only case you have been able to bring up is
> one of the sysadmin being a moron.

I could easily imagine it being useful to run multiple UMLs on one
machine (to simulate a network, say), and that one's application
causes each UML to occasionally spike in its memory requirements.
Then it would be disappointing for the number of UMLs one could run to
be determined by this maximum memory requirement, rather than by the
average memory requirement (minus some leeway for a few spiking UMLs).

The hook Jeff asks for seems harmless enough. If there is some
disagreement about how UML interacts with the host kernel on memory
allocation, the two different modes could be a configuration option of
UML. The "touch it all at startup" option could be the default, as it
does make alot of sense for the single UML case.

Cheers,
Wayne

2002-03-05 16:59:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Followup to: <[email protected]>
By author: Jeff Dike <[email protected]>
In newsgroup: linux.dev.kernel
>
> The other reason I don't like this is that, at some point, I'd like to
> start thinking about userspace cooperating with the kernel on memory
> management. UML looks like a perfect place to start since it's essentially
> identical to the host making it easier for the two to bargain over memory.
>
> Having UML react sanely to unbacked pages is a step in that direction, having
> UML preemptively grab all the memory it could ever use isn't.
>

Until you can come up with a sane application for it, this is just
featuritis.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2002-03-05 17:30:25

by Jan Harkes

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Tue, Mar 05, 2002 at 09:43:39AM -0500, Jeff Dike wrote:
> [email protected] said:
> > you only need to do the memsets once at startup of UML where the ram
> > is allocated -> a uml booted with 64MB of ram would write into every
> > page of the backing store file before even running the kernel.
> > Doesn't that accomplish the same thing?
>
> The other reason I don't like this is that, at some point, I'd like to
> start thinking about userspace cooperating with the kernel on memory
> management. UML looks like a perfect place to start since it's essentially
> identical to the host making it easier for the two to bargain over memory.

I could use the same thing in Coda, we have large private memory
mappings that are backed by a file which isn't always up-to-date. But we
can make it so by applying the logged modifications. If there is some
'memory pressure' signal we could apply the log and remap the memory to
reduce swap usage.

On the other hand, applying the logged modifications generates a lot of
write activity which could push the system over the edge, so the current
method of having a large amount of swap available is probably more
reliable. Otherwise we'll get the whole OOM killer debate again (the
pre-OOM signaller?).

Jan

2002-03-05 18:11:34

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> Actually, yes, esp. since the only case you have been able to bring up
> is one of the sysadmin being a moron.

Really? And you're unconcerned about the impact on the rest of the system
of a UML grabbing (say) 128M of memory when it starts up? Especially if it
may never use it?

And I don't see anything wrong with starting a bunch of UMLs with a total
maximum memory exceeding the available tmpfs as long as they don't all need
all that memory at once. And, if they do, the patch I just posted will let
them deal fairly sanely with the situation.

Jeff

2002-03-05 18:13:54

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> Until you can come up with a sane application for it, this is just
> featuritis.

Having the system better manage its memory is "featuritis"?

Jeff

2002-03-05 18:31:16

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Tue, Mar 05, 2002 at 01:12:19PM -0500, Jeff Dike wrote:
> Really? And you're unconcerned about the impact on the rest of the system
> of a UML grabbing (say) 128M of memory when it starts up? Especially if it
> may never use it?

Honestly, I think that most people want to know if the system they've setup
is overcommited at as early a point as possible: a UML failing at startup
with out of memory is better than random segvs at some later point when the
system is under load. Refer to the principle of least surprise. And if the
user truely wants to disable that, well, you can give them a command line
option to shoot themselves in the foot with.

-ben
--
"A man with a bass just walked in,
and he's putting it down
on the floor."

2002-03-05 18:46:13

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Jeff Dike wrote:

> [email protected] said:
>
>>Until you can come up with a sane application for it, this is just
>>featuritis.
>>
>
> Having the system better manage its memory is "featuritis"?
>


s/better/insanely/

Your proposed application is, quite frankly, bullshit.

-hpa


2002-03-05 18:47:42

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Jeff Dike wrote:

>
> Really? And you're unconcerned about the impact on the rest of the system
> of a UML grabbing (say) 128M of memory when it starts up? Especially if it
> may never use it?

>

It doesn't grab memory, it grabs backing store. The kernel will swap it
out as necessary.

>
> And I don't see anything wrong with starting a bunch of UMLs with a total
> maximum memory exceeding the available tmpfs as long as they don't all need
> all that memory at once. And, if they do, the patch I just posted will let
> them deal fairly sanely with the situation.


Bullshit. It means you have moved your system into an insane corner
case, and you would have been better off denying access in the first place.

-hpa


2002-03-06 01:16:18

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> maximum memory exceeding the available tmpfs as long as they don't all need
> all that memory at once. And, if they do, the patch I just posted will let
> them deal fairly sanely with the situation.

And the address space management stuff in the -ac tree will do all that and
more without force allocating pages and regardless of what other apps do
including without allowing your rude app to kill them.

You are using an axe to batter down a door. Worse than that I fitted a
perfectly good door handle.

2002-03-06 10:50:54

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages


[email protected] said:
> And I don't see anything wrong with starting a bunch of UMLs with a
> total maximum memory exceeding the available tmpfs as long as they
> don't all need all that memory at once. And, if they do, the patch I
> just posted will let them deal fairly sanely with the situation.

Going off at a slight tangent...

You say 'at once'. Does UML somehow give pages back to the host when they're
freed, so the pages that are no longer used by UML can be discarded by the
host instead of getting swapped?

--
dwmw2


2002-03-06 14:26:00

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> Does UML somehow give pages back to the host when
> they're freed, so the pages that are no longer used by UML can be
> discarded by the host instead of getting swapped?

No, but it could. Given another hook (in free_pages this time) I could unmap
pages as they're freed, allowing them to be discarded on the host.

Jeff

2002-03-06 15:05:34

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On March 5, 2002 07:30 pm, Benjamin LaHaise wrote:
> On Tue, Mar 05, 2002 at 01:12:19PM -0500, Jeff Dike wrote:
> > Really? And you're unconcerned about the impact on the rest of the system
> > of a UML grabbing (say) 128M of memory when it starts up? Especially if it
> > may never use it?
>
> Honestly, I think that most people want to know if the system they've setup
> is overcommited at as early a point as possible: a UML failing at startup
> with out of memory is better than random segvs at some later point when the
> system is under load. Refer to the principle of least surprise. And if the
> user truely wants to disable that, well, you can give them a command line
> option to shoot themselves in the foot with.

Suppose you have 512 MB memory and an equal amount of swap. You start 8
umls with 64 MB each. With your and Peter's suggestion, the system always
goes into swap. Whereas if the memory is only allocated on demand it
probably doesn't.

--
Daniel

2002-03-06 15:24:57

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Wed, Mar 06, 2002 at 03:59:22PM +0100, Daniel Phillips wrote:
> Suppose you have 512 MB memory and an equal amount of swap. You start 8
> umls with 64 MB each. With your and Peter's suggestion, the system always
> goes into swap. Whereas if the memory is only allocated on demand it
> probably doesn't.

As I said previously, going into swap is preferable over randomly killing
new tasks under heavy load.

-ben
--
"A man with a bass just walked in,
and he's putting it down
on the floor."

2002-03-06 15:29:59

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On March 6, 2002 04:24 pm, Benjamin LaHaise wrote:
> On Wed, Mar 06, 2002 at 03:59:22PM +0100, Daniel Phillips wrote:
> > Suppose you have 512 MB memory and an equal amount of swap. You start 8
> > umls with 64 MB each. With your and Peter's suggestion, the system always
> > goes into swap. Whereas if the memory is only allocated on demand it
> > probably doesn't.
>
> As I said previously, going into swap is preferable over randomly killing
> new tasks under heavy load.

Huh? In the example I gave, you will never oom but with your suggestion, you
will always go needlessly go into swap. I'm suprised that you and Peter are
aguing in favor of wasting resources.

--
Daniel

2002-03-06 16:04:23

by Jesse Pollard

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Daniel Phillips <[email protected]>:
> On March 5, 2002 07:30 pm, Benjamin LaHaise wrote:
> > On Tue, Mar 05, 2002 at 01:12:19PM -0500, Jeff Dike wrote:
> > > Really? And you're unconcerned about the impact on the rest of the system
> > > of a UML grabbing (say) 128M of memory when it starts up? Especially if it
> > > may never use it?
> >
> > Honestly, I think that most people want to know if the system they've setup
> > is overcommited at as early a point as possible: a UML failing at startup
> > with out of memory is better than random segvs at some later point when the
> > system is under load. Refer to the principle of least surprise. And if the
> > user truely wants to disable that, well, you can give them a command line
> > option to shoot themselves in the foot with.
>
> Suppose you have 512 MB memory and an equal amount of swap. You start 8
> umls with 64 MB each. With your and Peter's suggestion, the system always
> goes into swap. Whereas if the memory is only allocated on demand it
> probably doesn't.

Not unless the VM is really bad... All that is called for is that the
virtual space be available. Each umls gets 64 MB, but the rest is guaranteed
available via swap. Nothing has to swap until all processes have expanded
to use all available ram. Currently the only way to ensure that the memory
IS available is to modify every page at startup. Yes it will swap the modified
pages.

But it should only do so once, until the pages are really needed.

Otherwise the umls run until the system goes OOM - then somebody gets killed.
Much nicer to have it die at the beginning instead of after 4-5 hours of
operation when it needs just "one more page" only to find out that the system
lied when it said it was available.

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2002-03-06 16:36:40

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> You say 'at once'. Does UML somehow give pages back to the host when they're
> freed, so the pages that are no longer used by UML can be discarded by the
> host instead of getting swapped?

Doesn't seem to but it looks like madvise might be enough to make that
happen. That BTW is an issue for more than UML - it has a bearing on
running lots of Linux instances on any supervisor/virtualising system
like S/390

2002-03-06 16:37:00

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Wed, Mar 06, 2002 at 04:24:17PM +0100, Daniel Phillips wrote:
> On March 6, 2002 04:24 pm, Benjamin LaHaise wrote:
> > On Wed, Mar 06, 2002 at 03:59:22PM +0100, Daniel Phillips wrote:
> > > Suppose you have 512 MB memory and an equal amount of swap. You start 8
> > > umls with 64 MB each. With your and Peter's suggestion, the system always
> > > goes into swap. Whereas if the memory is only allocated on demand it
> > > probably doesn't.
> >
> > As I said previously, going into swap is preferable over randomly killing
> > new tasks under heavy load.
>
> Huh? In the example I gave, you will never oom but with your suggestion, you
> will always go needlessly go into swap. I'm suprised that you and Peter are
> aguing in favor of wasting resources.

I'm arguing in favour of predictable behaviour. Stability and reliability
are more important than a bit of swap space.

-ben
--
"A man with a bass just walked in,
and he's putting it down
on the floor."

2002-03-06 17:07:02

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> Currently the only way to ensure that the memory IS available is to
> modify every page at startup. Yes it will swap the modified pages.

Currently, yes.

But with Alan says his address space accounting will prevent mmaps from
succeeding if populating them would OOM the system, which gives you want
you want and which sounds like the right thing. The 8 64M UMLs will run
without needing to touch all their pages at bootup and without fear of being
killed later. If the 9th UML would be in danger of random death, then it
will never get off the ground.

Note that this doesn't help when the UMLs are under a smaller limit than
RAM + .5 * swap or whatever as happens when they are mmapping from tmpfs.
That's the situation that I'm concerned about.

Jeff

2002-03-06 17:19:22

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> Note that this doesn't help when the UMLs are under a smaller limit than
> RAM + .5 * swap or whatever as happens when they are mmapping from tmpfs.
> That's the situation that I'm concerned about.

Making tmpfs enforce the policy in those modes both checking the global
overcommit and also enforcing a "must be able to fill in the pages between
start and end of file" for the tmpfs file size itself is not hard from
inspection. If its needed I can add that next update to the address
accounting.

2002-03-06 20:24:32

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> Doesn't seem to but it looks like madvise might be enough to make that
> happen.

Yeah, MADV_DONTNEED looks right. UML and Linux/s390 (assuming VM has the
equivalent of MADV_DONTNEED) would need a hook in free_pages to make that
happen.

> That BTW is an issue for more than UML - it has a bearing on running
> lots of Linux instances on any supervisor/virtualising system like S/390

On a side note, the "unused memory is wasted memory" behavior that UML and
Linux/s390 inherit is also less than optimal for the host.

Jeff

2002-03-06 20:40:43

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> Yeah, MADV_DONTNEED looks right. UML and Linux/s390 (assuming VM has the
> equivalent of MADV_DONTNEED) would need a hook in free_pages to make that
> happen.

VM allows you to give it back a page and if you use it again you get a
clean copy. What it seems to lack is the more ideal "here have this page
and if I reuse it trap if you did throw it out" semantic.

> > That BTW is an issue for more than UML - it has a bearing on running
> > lots of Linux instances on any supervisor/virtualising system like S/390
>
> On a side note, the "unused memory is wasted memory" behavior that UML and
> Linux/s390 inherit is also less than optimal for the host.

Yes. I believe IBM folks are studying that

2002-03-06 21:24:56

by Malcolm Beattie

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Alan Cox writes:
> > Yeah, MADV_DONTNEED looks right. UML and Linux/s390 (assuming VM has the
> > equivalent of MADV_DONTNEED) would need a hook in free_pages to make that
> > happen.
>
> VM allows you to give it back a page and if you use it again you get a
> clean copy.

Yep, clean as in a page of zeroes when you touch it. (DIAGNOSE X'10' as
documented in the "CP Programming Services" manual, to be precise).

> What it seems to lack is the more ideal "here have this page
> and if I reuse it trap if you did throw it out" semantic.

We're looking at ways of having fancier memory management information
pass between Linux and CP (it's safer to say CP (the "kernel" part of
VM/ESA and z/VM) than VM, given the ambiguous and confusing dual
meaning of "VM" otherwise :-).

> > > That BTW is an issue for more than UML - it has a bearing on running
> > > lots of Linux instances on any supervisor/virtualising system like S/390
> >
> > On a side note, the "unused memory is wasted memory" behavior that UML and
> > Linux/s390 inherit is also less than optimal for the host.
>
> Yes. I believe IBM folks are studying that

Indeed. A "quich hack" that turns out to have rather useful, fun
properties is to have a little device driver (can be a module) which
stores "negative pages" in the page cache by allocating page cache
pages for the device's inode and then invoking the CP "release page"
call mentioned above. Linux thinks the page is "useful" and so keeps
it around until memory pressure kicks it out whereas the underlying
CP knows it's a hole making the resident size and working set of the
Linux image reduce. Add in a bit of feedback to get Linux re-reading
the "device" into cache proportionally to how much CP wants to kick
*out* resident pages from the image. Fun... However, closer
integration with the main mm system is the "proper" way to do it
(but depends on stuff like the latency, overheads and information
shared with CP so is a little more than an afternoon hack.)

--Malcolm

--
Malcolm Beattie <[email protected]>
Linux Technical Consultant
IBM EMEA Enterprise Server Group...
...from home, speaking only for myself

2002-03-06 21:28:27

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages


[email protected] said:
> Yeah, MADV_DONTNEED looks right. UML and Linux/s390 (assuming VM has
> the equivalent of MADV_DONTNEED) would need a hook in free_pages to
> make that happen.

MADV_DONTNEED
Do not expect access in the near future. (For the
time being, the application is finished with the
given range, so the kernel can free resources asso?
ciated with it.)

It's not clear from that that the host kernel is actually permitted to
discard the data.

[email protected] said:
> VM allows you to give it back a page and if you use it again you get
> a clean copy. What it seems to lack is the more ideal "here have this
> page and if I reuse it trap if you did throw it out" semantic.

I've wittered on occasion about other situations where such semantics might
be useful -- essentially 'drop these pages if you need to as if they were
clean, and tell me when I next touch them so I can recreate their data'.

UML might want that kind of thing for its (clean) page cache pages or
something, but for pages allocated for kernel stack and task struct we
really want the opposite -- we want to make sure they're present when we
allocate them, and explicitly discard them when we're done.

--
dwmw2


2002-03-06 22:30:48

by Joseph Malicki

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

>
> [email protected] said:
> > Yeah, MADV_DONTNEED looks right. UML and Linux/s390 (assuming VM has
> > the equivalent of MADV_DONTNEED) would need a hook in free_pages to
> > make that happen.
>
> MADV_DONTNEED
> Do not expect access in the near future. (For the
> time being, the application is finished with the
> given range, so the kernel can free resources asso?
> ciated with it.)
>
> It's not clear from that that the host kernel is actually permitted to
> discard the data.

Solaris has MADV_FREE to say that the data can be discarded...

-joe

2002-03-06 23:20:25

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On March 6, 2002 05:36 pm, Benjamin LaHaise wrote:
> On Wed, Mar 06, 2002 at 04:24:17PM +0100, Daniel Phillips wrote:
> > On March 6, 2002 04:24 pm, Benjamin LaHaise wrote:
> > > On Wed, Mar 06, 2002 at 03:59:22PM +0100, Daniel Phillips wrote:
> > > > Suppose you have 512 MB memory and an equal amount of swap. You start 8
> > > > umls with 64 MB each. With your and Peter's suggestion, the system always
> > > > goes into swap. Whereas if the memory is only allocated on demand it
> > > > probably doesn't.
> > >
> > > As I said previously, going into swap is preferable over randomly killing
> > > new tasks under heavy load.
> >
> > Huh? In the example I gave, you will never oom but with your suggestion, you
> > will always go needlessly go into swap. I'm suprised that you and Peter are
> > aguing in favor of wasting resources.
>
> I'm arguing in favour of predictable behaviour. Stability and reliability
> are more important than a bit of swap space.

That's the same argument that says memory overcommit should not be allowed.

--
Daniel

2002-03-06 23:20:55

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Thu, Mar 07, 2002 at 12:14:15AM +0100, Daniel Phillips wrote:
> On March 6, 2002 05:36 pm, Benjamin LaHaise wrote:
> > On Wed, Mar 06, 2002 at 04:24:17PM +0100, Daniel Phillips wrote:
> > > On March 6, 2002 04:24 pm, Benjamin LaHaise wrote:
> > > > On Wed, Mar 06, 2002 at 03:59:22PM +0100, Daniel Phillips wrote:
> > > > > Suppose you have 512 MB memory and an equal amount of swap. You start 8
> > > > > umls with 64 MB each. With your and Peter's suggestion, the system always
> > > > > goes into swap. Whereas if the memory is only allocated on demand it
> > > > > probably doesn't.
> > > >
> > > > As I said previously, going into swap is preferable over randomly killing
> > > > new tasks under heavy load.
> > >
> > > Huh? In the example I gave, you will never oom but with your suggestion, you
> > > will always go needlessly go into swap. I'm suprised that you and Peter are
> > > aguing in favor of wasting resources.
> >
> > I'm arguing in favour of predictable behaviour. Stability and reliability
> > are more important than a bit of swap space.
>
> That's the same argument that says memory overcommit should not be allowed.

Go back in the thread: I suggested making it an option that the user has to
turn on to allow his foot to be shot. Remember: the common case in the kernel
is to be using all memory.

-ben
--
"A man with a bass just walked in,
and he's putting it down
on the floor."

2002-03-06 23:25:15

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> A "quich hack" that turns out to have rather useful, fun properties is
> to have a little device driver (can be a module) which stores
> "negative pages" in the page cache by allocating page cache pages for
> the device's inode and then invoking the CP "release page" call
> mentioned above.

Yeah, I was thinking about something like that. It's unclear how it should
figure out how much memory to grab, though. You'd have to get some idea
how desperate the host is for memory and balance that off against how
desperate the VM is.

And you want to avoid doing things that just aggravate the host's situation,
i.e. if it is swapping its brains out, you want the VM to just drop some
clean pages and you definitely don't want it swapping dirty ones and add
to the host's IO load.

> However, closer
> integration with the main mm system is the "proper" way to do it (but
> depends on stuff like the latency, overheads and information shared
> with CP so is a little more than an afternoon hack.)

Yup.

Is any of your (you or IBM in general) thinking on this written down publically
anywhere?

Jeff

2002-03-06 23:31:45

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On March 7, 2002 12:20 am, Benjamin LaHaise wrote:
> On Thu, Mar 07, 2002 at 12:14:15AM +0100, Daniel Phillips wrote:
> > On March 6, 2002 05:36 pm, Benjamin LaHaise wrote:
> > > On Wed, Mar 06, 2002 at 04:24:17PM +0100, Daniel Phillips wrote:
> > > > On March 6, 2002 04:24 pm, Benjamin LaHaise wrote:
> > > > > On Wed, Mar 06, 2002 at 03:59:22PM +0100, Daniel Phillips wrote:
> > > > > > Suppose you have 512 MB memory and an equal amount of swap. You start 8
> > > > > > umls with 64 MB each. With your and Peter's suggestion, the system always
> > > > > > goes into swap. Whereas if the memory is only allocated on demand it
> > > > > > probably doesn't.
> > > > >
> > > > > As I said previously, going into swap is preferable over randomly killing
> > > > > new tasks under heavy load.
> > > >
> > > > Huh? In the example I gave, you will never oom but with your suggestion, you
> > > > will always go needlessly go into swap. I'm suprised that you and Peter are
> > > > aguing in favor of wasting resources.
> > >
> > > I'm arguing in favour of predictable behaviour. Stability and reliability
> > > are more important than a bit of swap space.
> >
> > That's the same argument that says memory overcommit should not be allowed.
>
> Go back in the thread: I suggested making it an option that the user has to
> turn on to allow his foot to be shot. Remember: the common case in the kernel
> is to be using all memory.

OK, now suppose the user has turned on that option (I think it should be on by
default, like memory overcommit). How is Jeff going to support it? That's his
whole point as I understand it.

Instead of providing constructive suggestions on how to solve the problem so that
memory overcommit works properly in this case, I see people telling Jeff there is
no problem. I think Jeff has a little more of a clue than that.

--
Daniel

2002-03-06 23:34:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Daniel Phillips wrote:

>
> Instead of providing constructive suggestions on how to solve the problem so that
> memory overcommit works properly in this case, I see people telling Jeff there is
> no problem. I think Jeff has a little more of a clue than that.
>


Jeff has clue, but you, Daniel, quite frankly could take a cue. You nseem
to be jumping into arguments just for the sake of them, but without ever
contribute anything useful.

Please do us all a favour and shut up for once.

-hpa


2002-03-07 00:04:51

by Richard Gooch

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

David Woodhouse writes:
>
> [email protected] said:
> > Yeah, MADV_DONTNEED looks right. UML and Linux/s390 (assuming VM has
> > the equivalent of MADV_DONTNEED) would need a hook in free_pages to
> > make that happen.
>
> MADV_DONTNEED
> Do not expect access in the near future. (For the
> time being, the application is finished with the
> given range, so the kernel can free resources asso?
> ciated with it.)
>
> It's not clear from that that the host kernel is actually permitted to
> discard the data.
>
> [email protected] said:
> > VM allows you to give it back a page and if you use it again you get
> > a clean copy. What it seems to lack is the more ideal "here have this
> > page and if I reuse it trap if you did throw it out" semantic.
>
> I've wittered on occasion about other situations where such
> semantics might be useful -- essentially 'drop these pages if you
> need to as if they were clean, and tell me when I next touch them so
> I can recreate their data'.

Indeed. I'd love such a feature. It's got applications in
numerical/scientific code, not just UML.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2002-03-07 00:13:23

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On March 7, 2002 12:33 am, H. Peter Anvin wrote:
> Daniel Phillips wrote:
>
> > Instead of providing constructive suggestions on how to solve the problem so that
> > memory overcommit works properly in this case, I see people telling Jeff there is
> > no problem. I think Jeff has a little more of a clue than that.
>
> Jeff has clue, but you, Daniel, quite frankly could take a cue. You nseem
> to be jumping into arguments just for the sake of them, but without ever
> contribute anything useful.

The useful contribution is to stop you and Ben from beating up on Jeff. Thankyou,
I think I've accomplished that purpose. Feel free to attack me for that if you feel
the need.

(objectionable comment removed)

--
Daniel

2002-03-07 00:27:43

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> and also enforcing a "must be able to fill in the pages between start
> and end of file" for the tmpfs file size itself is not hard from
> inspection.

So if I mapped a single page from file offset 65M on a 64M tmpfs, that would
fail?

I'd prefer maps to fail when they make the total maps exceed the tmpfs limit.

Then I can map in smaller chunks, PAGE_SIZE if necessary. That has the
disadvantage that the vmas in the host would be even uglier than they are
now because we don't have vma merging any more.

UML would still need that page_alloc hook, except it would map the allocated
pages instead of touching them.

Jeff

2002-03-07 00:28:13

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> MADV_DONTNEED
> Do not expect access in the near future. (For the
> time being, the application is finished with the
> given range, so the kernel can free resources asso?
> ciated with it.)
> It's not clear from that that the host kernel is actually permitted to
> discard the data.

Hmmm, you have better man pages than me. I don't have an madvise man page
on either of my boxes (RH 6.2 and 7.2 :-)

>From that description, you're right. The code is very clear on what happens,
as is the comment above sys_madvise:

* MADV_DONTNEED - the application is finished with the given range,
* so the kernel can free resources associated with it.

> UML might want that kind of thing for its (clean) page cache pages or
> something, but for pages allocated for kernel stack and task struct we
> really want the opposite -- we want to make sure they're present when
> we allocate them, and explicitly discard them when we're done.

Yeah, that's a decent idea. If you were going to make it fancier, you could
cover the case that the UML's clean pages are all busy but it has lots of
old dirty pages lying around. But then you'd need some way for the host to
tell the UML that I/O would be a really bad idea and it should just dump
clean pages.

Jeff

2002-03-07 00:30:23

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> I'd prefer maps to fail when they make the total maps exceed the tmpfs limit.

That makes more sense and can be done yes. Probably it wants to be a tmpfs
option

2002-03-07 00:31:23

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> Hmmm, you have better man pages than me. I don't have an madvise man pag=
> e
> on either of my boxes (RH 6.2 and 7.2 :-)

Curious. I have one on my 7.2 box 8)

2002-03-07 01:26:45

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> Go back in the thread: I suggested making it an option that the user
> has to turn on to allow his foot to be shot.

OK, this seems to be the relevant quote (and you seem to be referring to the
kernel build segfaults - correct me if I'm wrong):

[email protected] said:
> a UML failing at startup with out of memory is better than random
> segvs at some later point when the system is under load.

I showed the kernel build segfaulting as an improvement over UML hanging,
which is the alternative behavior.

The segfaults were caused by me implementing the simplest possible response
to alloc_pages returning unbacked pages, which is to return NULL to the
caller. This is actually wrong because in this failure case, it effectively
changes the semantics of GFP_USER, GFP_KERNEL, and the other blocking GFP_*
allocations to GFP_ATOMIC. And that's what forced UML to segfault the
compilations.

A slightly fancier recovery would loop calling alloc_pages until it got a set
of already-backed pages (with some possible sleeping in alloc_pages in there).
That would preserve the blocking semantics of GFP_USER, GFP_KERNEL, et al,
and would have allowed the UML userspace (the kernel build) to continue working
as it should.

So, a slightly improved version of the patch (which I can write up if you're
interested in seeing it) would have allowed UML and its userspace to continue
running fine (albeit in less memory than it expected) in the presence of an
overcommited tmpfs.

Jeff

2002-03-07 01:52:49

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Wed, Mar 06, 2002 at 08:27:51PM -0500, Jeff Dike wrote:
> I showed the kernel build segfaulting as an improvement over UML hanging,
> which is the alternative behavior.

Versus fully allocating the backing store, which would neither hang nor
cause segfaults. This is the behaviour that one expects by default, and
should be the first line of defense before going to the overcommit model.
Get that aspect of reliability in place, then add the overcommit support.
What is better: having uml fail before attempting to boot with an unable
to allocate backing store message, or a random oops during early kernel
init? As I see it, supporting the safe mode of operation first makes more
sense before adding yet another arch hook.

-ben
--
"A man with a bass just walked in,
and he's putting it down
on the floor."

2002-03-07 10:20:56

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Hi!

> > You say 'at once'. Does UML somehow give pages back to the host when they're
> > freed, so the pages that are no longer used by UML can be discarded by the
> > host instead of getting swapped?
>
> Doesn't seem to but it looks like madvise might be enough to make that
> happen. That BTW is an issue for more than UML - it has a bearing on
> running lots of Linux instances on any supervisor/virtualising system
> like S/390

I just imagined hardware which supports freeing memory -- just do not
refresh it any more to conserve power ;-))).

Granted, it would probably only make sense in big chunks, like 2MB or
so... It might make sense for a PDA...
Pavel

--
(about SSSCA) "I don't say this lightly. However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

2002-03-07 11:31:37

by Dave Jones

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Wed, Mar 06, 2002 at 11:21:50PM +0100, Pavel Machek wrote:
> I just imagined hardware which supports freeing memory -- just do not
> refresh it any more to conserve power ;-))).
> Granted, it would probably only make sense in big chunks, like 2MB or
> so... It might make sense for a PDA...

ISTR reading about one handheld that did something like this (possibly psion)
The hardware has the ability to migrate data from one memory bank to
another and power down the least used one.

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-03-07 13:36:06

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> caller. This is actually wrong because in this failure case, it effectively
> changes the semantics of GFP_USER, GFP_KERNEL, and the other blocking GFP_*
> allocations to GFP_ATOMIC. And that's what forced UML to segfault the
> compilations.

GFP_KERNEL will sometimes return NULL.

2002-03-07 13:41:57

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On March 7, 2002 02:49 pm, Alan Cox wrote:
> Jeff Dike Apparently wrote
> > caller. This is actually wrong because in this failure case, it effectively
> > changes the semantics of GFP_USER, GFP_KERNEL, and the other blocking GFP_*
> > allocations to GFP_ATOMIC. And that's what forced UML to segfault the
> > compilations.
>
> GFP_KERNEL will sometimes return NULL.

Sad but true. IMHO we are on track to fix that in this kernel cycle, with
better locked/dirty accounting and rmap to forcibly unmap pages when necessary.

--
Daniel

2002-03-07 14:04:59

by Victor Yodaiken

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Thu, Mar 07, 2002 at 02:36:08PM +0100, Daniel Phillips wrote:
> On March 7, 2002 02:49 pm, Alan Cox wrote:
> > Jeff Dike Apparently wrote
> > > caller. This is actually wrong because in this failure case, it effectively
> > > changes the semantics of GFP_USER, GFP_KERNEL, and the other blocking GFP_*
> > > allocations to GFP_ATOMIC. And that's what forced UML to segfault the
> > > compilations.
> >
> > GFP_KERNEL will sometimes return NULL.
>
> Sad but true. IMHO we are on track to fix that in this kernel cycle, with
> better locked/dirty accounting and rmap to forcibly unmap pages when necessary.

Why is that a fix? And how can it work?


--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2002-03-07 14:27:35

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On March 7, 2002 03:04 pm, [email protected] wrote:
> On Thu, Mar 07, 2002 at 02:36:08PM +0100, Daniel Phillips wrote:
> > On March 7, 2002 02:49 pm, Alan Cox wrote:
> > > Jeff Dike Apparently wrote
> > > > caller. This is actually wrong because in this failure case, it effectively
> > > > changes the semantics of GFP_USER, GFP_KERNEL, and the other blocking GFP_*
> > > > allocations to GFP_ATOMIC. And that's what forced UML to segfault the
> > > > compilations.
> > >
> > > GFP_KERNEL will sometimes return NULL.
> >
> > Sad but true. IMHO we are on track to fix that in this kernel cycle, with
> > better locked/dirty accounting and rmap to forcibly unmap pages when necessary.
>
> Why is that a fix? And how can it work?

Since there is always at least one freeable page in the system (or we're oom) then
we just have to find it and we know we can forcibly unmap it. We do need to know
the total of pinned pages, I should have said locked/dirty/pinned.

Since GFP_KERNEL includes __GFP_WAIT, we are even allowed to wait for dirty page
writeout.

--
Daniel

2002-03-07 14:29:26

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> Since there is always at least one freeable page in the system (or we're oom) then
> we just have to find it and we know we can forcibly unmap it. We do need to know
> the total of pinned pages, I should have said locked/dirty/pinned.

What if I did a 4 page allocation ?

And if we are OOM - we want to return NULL

2002-03-07 14:38:28

by Victor Yodaiken

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Thu, Mar 07, 2002 at 03:21:24PM +0100, Daniel Phillips wrote:
> On March 7, 2002 03:04 pm, [email protected] wrote:
> > On Thu, Mar 07, 2002 at 02:36:08PM +0100, Daniel Phillips wrote:
> > > On March 7, 2002 02:49 pm, Alan Cox wrote:
> > > > Jeff Dike Apparently wrote
> > > > > caller. This is actually wrong because in this failure case, it effectively
> > > > > changes the semantics of GFP_USER, GFP_KERNEL, and the other blocking GFP_*
> > > > > allocations to GFP_ATOMIC. And that's what forced UML to segfault the
> > > > > compilations.
> > > >
> > > > GFP_KERNEL will sometimes return NULL.
> > >
> > > Sad but true. IMHO we are on track to fix that in this kernel cycle, with
> > > better locked/dirty accounting and rmap to forcibly unmap pages when necessary.
> >
> > Why is that a fix? And how can it work?
>
> Since there is always at least one freeable page in the system (or we're oom) then
> we just have to find it and we know we can forcibly unmap it. We do need to know
> the total of pinned pages, I should have said locked/dirty/pinned.


What if we are oom?
What if we are on our way to deadlock?
What if the caller of kmalloc will make less good use of the page
than the current owner of the page?

page_t *x,*p;
for(i = 0; i < SOME_MADE_UP_NUMBER_THAT_SEEMS_GOOD;i++)
if( p = kmalloc(..)){
copyfromuser(x++,p);
dispatch_to_output(p);
}
else {//do the rest later
...
}





>
> Since GFP_KERNEL includes __GFP_WAIT, we are even allowed to wait for dirty page
> writeout.
>
> --
> Daniel

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2002-03-07 15:37:28

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On March 7, 2002 03:38 pm, [email protected] wrote:
> On Thu, Mar 07, 2002 at 03:21:24PM +0100, Daniel Phillips wrote:
> > On March 7, 2002 03:04 pm, [email protected] wrote:
> > > On Thu, Mar 07, 2002 at 02:36:08PM +0100, Daniel Phillips wrote:
> > > > On March 7, 2002 02:49 pm, Alan Cox wrote:
> > > > > Jeff Dike Apparently wrote
> > > > > > caller. This is actually wrong because in this failure case, it effectively
> > > > > > changes the semantics of GFP_USER, GFP_KERNEL, and the other blocking GFP_*
> > > > > > allocations to GFP_ATOMIC. And that's what forced UML to segfault the
> > > > > > compilations.
> > > > >
> > > > > GFP_KERNEL will sometimes return NULL.
> > > >
> > > > Sad but true. IMHO we are on track to fix that in this kernel cycle, with
> > > > better locked/dirty accounting and rmap to forcibly unmap pages when necessary.
> > >
> > > Why is that a fix? And how can it work?
> >
> > Since there is always at least one freeable page in the system (or we're oom) then
> > we just have to find it and we know we can forcibly unmap it. We do need to know
> > the total of pinned pages, I should have said locked/dirty/pinned.
>
>
> What if we are oom?

This problem didn't get any worse, we still have to deal with it. We can wait, so
we deal with it in the standard way (i.e., we puke, have to do something about that.)

> What if we are on our way to deadlock?

huh??

> What if the caller of kmalloc will make less good use of the page
> than the current owner of the page?

That's life, that's what lrus are for.

> page_t *x,*p;
> for(i = 0; i < SOME_MADE_UP_NUMBER_THAT_SEEMS_GOOD;i++)
> if( p = kmalloc(..)){
> copyfromuser(x++,p);
> dispatch_to_output(p);
> }
> else {//do the rest later
> ...
> }

Please put your thinking cap on and come up with a less borked interface
for doing that ;-)

You won't find one if you don't look for it.

--
Daniel

2002-03-07 15:38:48

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On March 7, 2002 03:43 pm, Alan Cox wrote:
> > Since there is always at least one freeable page in the system (or we're oom) then
> > we just have to find it and we know we can forcibly unmap it. We do need to know
> > the total of pinned pages, I should have said locked/dirty/pinned.
>
> What if I did a 4 page allocation ?

Higher order allocation - imho we can fix that too, eventually, however it's a lot
more work. First we have to have reliable physical defragmentation.

> And if we are OOM - we want to return NULL

What good does that do?

--
Daniel

2002-03-07 15:39:58

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On March 7, 2002 03:43 pm, Alan Cox wrote:
> And if we are OOM - we want to return NULL

Oh, right, it lets an allocator that didn't 100% need the page use a
fallback strategy, but for that we probably want a different interface
anyway, such as a GFP flag that says 'fail if this looks hard to get'.

--
Daniel

2002-03-07 16:06:06

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> Higher order allocation - imho we can fix that too, eventually, however it's a lot
> more work. First we have to have reliable physical defragmentation.
>
> > And if we are OOM - we want to return NULL
>
> What good does that do?

It allows us to continue. It avoids the deadlocks. It lets the caller
make an intelligent decision.

2002-03-07 16:51:09

by Victor Yodaiken

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Thu, Mar 07, 2002 at 04:31:10PM +0100, Daniel Phillips wrote:
> > > > Why is that a fix? And how can it work?
> > >
> > > Since there is always at least one freeable page in the system (or we're oom) then
> > > we just have to find it and we know we can forcibly unmap it. We do need to know
> > > the total of pinned pages, I should have said locked/dirty/pinned.
> >
> >
> > What if we are oom?
>
> This problem didn't get any worse, we still have to deal with it. We can wait, so
> we deal with it in the standard way (i.e., we puke, have to do something about that.)

So it can return NULL?

>
> > What if we are on our way to deadlock?
>
> huh??

Process A needs 4 pages, Process B needs 4 pages, each grabs 3.
One easy, traditional unix algorithm for dealing with this is
for(i=0; i < 4; i++)if !(p[i]=kmallloc(...))
free all that we have so far


> > What if the caller of kmalloc will make less good use of the page
> > than the current owner of the page?
>
> That's life, that's what lrus are for.

Really? I thought LRUs were to approximate working sets. Obviously
if a program is kmallocing its working set is changing but that
does not tell us anything about whether it is a correct decision to
rip a page from the working set of another process.

>
> > page_t *x,*p;
> > for(i = 0; i < SOME_MADE_UP_NUMBER_THAT_SEEMS_GOOD;i++)
> > if( p = kmalloc(..)){
> > copyfromuser(x++,p);
> > dispatch_to_output(p);
> > }
> > else {//do the rest later
> > ...
> > }
>
> Please put your thinking cap on and come up with a less borked interface
> for doing that ;-)
>
> You won't find one if you don't look for it.

I'm too dumb to come up with a solution here, but you are the one
changing the interface, so surely you have a couple of "less borked"
solutions in mind - right?







--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2002-03-07 17:59:58

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On March 7, 2002 05:19 pm, Alan Cox wrote:
> > Higher order allocation - imho we can fix that too, eventually, however it's a lot
> > more work. First we have to have reliable physical defragmentation.
> >
> > > And if we are OOM - we want to return NULL
> >
> > What good does that do?
>
> It allows us to continue. It avoids the deadlocks.

Could you describe the deadlock, please?

> It lets the caller make an intelligent decision.

I maintain it's the wrong interface, we're mixing two concepts together there:

- VM can't find blocks that are freeable, so fails and dumps the problem
on the caller, which has to busy wait. This sucks.

- The VM is under heavy load and the caller doesn't really need the memory
that badly because it has a fallback, the VM somehow knows this, so fails
the allocation and everybody is happy.

These should be separated, and we should fix the former.

--
Daniel

2002-03-07 18:13:19

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On March 7, 2002 05:50 pm, [email protected] wrote:
> On Thu, Mar 07, 2002 at 04:31:10PM +0100, Daniel Phillips wrote:
> > > > > Why is that a fix? And how can it work?
> > > >
> > > > Since there is always at least one freeable page in the system (or we're oom) then
> > > > we just have to find it and we know we can forcibly unmap it. We do need to know
> > > > the total of pinned pages, I should have said locked/dirty/pinned.
> > >
> > > What if we are oom?
> >
> > This problem didn't get any worse, we still have to deal with it. We can wait, so
> > we deal with it in the standard way (i.e., we puke, have to do something about that.)
>
> So it can return NULL?

Returning null here won't help if the caller doesn't have a fallback, or if the fallback
is unacceptable, such as losing a filesystem transaction.

> > > What if we are on our way to deadlock?
> >
> > huh??
>
> Process A needs 4 pages, Process B needs 4 pages, each grabs 3.

This is no new deadlock. Supposing each has successfully grabbed 3, what
good does it do if the process is too clueless to release the pages it's
already grabbed, because the 4th page alloc fails? (The first 3 may have
been alloced in a completely different part of the program.) And if the
process does know how to do this, it should tell the VM that *then* the VM
should feel free to fail it.

> One easy, traditional unix algorithm for dealing with this is
> for(i=0; i < 4; i++)if !(p[i]=kmallloc(...))
> free all that we have so far

Just or in GFP_ok_to_fail there.

> > > What if the caller of kmalloc will make less good use of the page
> > > than the current owner of the page?
> >
> > That's life, that's what lrus are for.
>
> Really? I thought LRUs were to approximate working sets. Obviously
> if a program is kmallocing its working set is changing but that
> does not tell us anything about whether it is a correct decision to
> rip a page from the working set of another process.

We're getting way far from the original question here. Our lru has no
concept of working set, it's completely global. That's not so great and
it's another problem to tackle. Sometime.

> > You won't find one if you don't look for it.
>
> I'm too dumb to come up with a solution here, but you are the one
> changing the interface, so surely you have a couple of "less borked"
> solutions in mind - right?

Yes. Well, I'm not alone here, ping Marcelo on that if you like. This is
known borkness that's been deferred while more pressing borkness is dealt
with.

--
Daniel

2002-03-07 18:15:59

by Victor Yodaiken

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Thu, Mar 07, 2002 at 07:07:23PM +0100, Daniel Phillips wrote:
> > Really? I thought LRUs were to approximate working sets. Obviously
> > if a program is kmallocing its working set is changing but that
> > does not tell us anything about whether it is a correct decision to
> > rip a page from the working set of another process.
>
> We're getting way far from the original question here. Our lru has no
> concept of working set, it's completely global. That's not so great and
> it's another problem to tackle. Sometime.

Global lru is an approximation of per-task working set. That's why it
works. But it's not perfect.

>
> > > You won't find one if you don't look for it.
> >
> > I'm too dumb to come up with a solution here, but you are the one
> > changing the interface, so surely you have a couple of "less borked"
> > solutions in mind - right?
>
> Yes. Well, I'm not alone here, ping Marcelo on that if you like. This is
> known borkness that's been deferred while more pressing borkness is dealt
> with.

So you and Marcelo are planning on making changes to the semantics
of primitive memory allocation modules in the production kernel?

Can that be true? I hope not.



--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2002-03-07 18:22:29

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Pavel Machek wrote:

> Hi!
>
>
>>>You say 'at once'. Does UML somehow give pages back to the host when they're
>>>freed, so the pages that are no longer used by UML can be discarded by the
>>>host instead of getting swapped?
>>>
>>Doesn't seem to but it looks like madvise might be enough to make that
>>happen. That BTW is an issue for more than UML - it has a bearing on
>>running lots of Linux instances on any supervisor/virtualising system
>>like S/390
>>
>
> I just imagined hardware which supports freeing memory -- just do not
> refresh it any more to conserve power ;-))).
>

> Granted, it would probably only make sense in big chunks, like 2MB or
> so... It might make sense for a PDA...
> Pavel


Unlikely. Also, if you're using ECC, then that really screws with you.

However, if it is an issue for more than UML (I still consider the
particular UML case "in case you have a UML on a tmpfs set up by an
idiot admin" completely bogus) then it's another issue. The S/390 issue
is real.

-hpa


2002-03-07 19:09:02

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> > So it can return NULL?
>
> Returning null here won't help if the caller doesn't have a fallback, or if the fallback
> is unacceptable, such as losing a filesystem transaction.

Not having a fallback is unacceptable. Thats the real problem. You can't
go around pandering to sloppy coders who can't work a memory allocator

2002-03-07 19:21:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Daniel Phillips wrote:
>
> a GFP flag that says 'fail if this looks hard to get'.

Something like that would provide a solution to the
readahead thrashing problem.

-

2002-03-07 20:11:09

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Thu, 7 Mar 2002, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > a GFP flag that says 'fail if this looks hard to get'.
>
> Something like that would provide a solution to the
> readahead thrashing problem.

Nope. Readahead pages are clean and very easy to evict, so
it's still trivial to evict all the pages from another readahead
window because everybody's readahead window is too large.

regards,

Rik
--
<insert bitkeeper endorsement here>

http://www.surriel.com/ http://distro.conectiva.com/

2002-03-07 20:59:02

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Rik van Riel wrote:
>
> On Thu, 7 Mar 2002, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > >
> > > a GFP flag that says 'fail if this looks hard to get'.
> >
> > Something like that would provide a solution to the
> > readahead thrashing problem.
>
> Nope. Readahead pages are clean and very easy to evict, so
> it's still trivial to evict all the pages from another readahead
> window because everybody's readahead window is too large.
>

I was thinking an explicit GFP_READAHEAD and PG_readahead.
Where a GFP_READAHEAD allocation would fail if it can't
find any non-readahead pages. And it would fail if it
had to perform I/O.

That's not nice - it'd result in large LRU walks. But it'd
be better than the 10x slowdown which readahead thrashing
causes.

Any clever ideas?

-

2002-03-07 21:24:13

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Thu, 7 Mar 2002, Andrew Morton wrote:

> > Nope. Readahead pages are clean and very easy to evict, so
> > it's still trivial to evict all the pages from another readahead
> > window because everybody's readahead window is too large.

> Any clever ideas?

1) keep track of which pages we are reading ahead
... the readahead code already does this

2) at read() or fault time, see if the page
(a) is resident
(b) is in the current readahead window,
ie. already read ahead

3) if the page is in the current readahead window
but NOT resident, the page was read in and
evicted before we got around to using it, so
readahead window thrashing is going on
... in that case, collapse the size of the
readahead window TCP-style

4) slowly growing the readahead window when there is
enough memory available, in order to minimise the
number of disk seeks

5) the growing in (3) and shrinking in (4) mean that
the readahead size of all streaming IO in the system
gets automatically balanced against each other and
against other memory demand in the system

regards,

Rik
--
<insert bitkeeper endorsement here>

http://www.surriel.com/ http://distro.conectiva.com/

2002-03-07 22:04:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Rik van Riel wrote:
>
> On Thu, 7 Mar 2002, Andrew Morton wrote:
>
> > > Nope. Readahead pages are clean and very easy to evict, so
> > > it's still trivial to evict all the pages from another readahead
> > > window because everybody's readahead window is too large.
>
> > Any clever ideas?
>
> 1) keep track of which pages we are reading ahead
> ... the readahead code already does this
>
> 2) at read() or fault time, see if the page
> (a) is resident
> (b) is in the current readahead window,
> ie. already read ahead
>
> 3) if the page is in the current readahead window
> but NOT resident, the page was read in and
> evicted before we got around to using it, so
> readahead window thrashing is going on
> ... in that case, collapse the size of the
> readahead window TCP-style

I have all that. See handle_ra_thrashing() in
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre2/dallocbase-10-readahead.patch

> 4) slowly growing the readahead window when there is
> enough memory available, in order to minimise the
> number of disk seeks
>
> 5) the growing in (3) and shrinking in (4) mean that
> the readahead size of all streaming IO in the system
> gets automatically balanced against each other and
> against other memory demand in the system

Doesn't work.

Ah, this is hard to describe.

umm.

a) Suppose that we're getting readahead thrashing. readahead
pages are getting dropped. So we keep seeking to each
file to get new data, so we do a ton of seeking.

b) Suppose that we nicely detect thrashing and reduce the readahead
window. Well, we *still* need to seek to each file to read
some blocks.

See? They're equivalent. In case a) we're doing more (pointless)
I/O, but the cost of that is vanishingly small because it's just
one request.

So what *is* a solution. Well, there's only so much memory available.
In either case a) or case b) we're "fairly" distributing that memory
between all files. And that's the problem. *All* the files have too
small a readahead window. Which points one at: we need to stop being
fair. We need to give some files a good readahead window and others
not. The "soft pinning" which I propose with GFP_READAHEAD and
PG_readhead might have that effect, I think.

I'll try it, see how it feels.

-

2002-03-07 22:11:48

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Thu, 7 Mar 2002, Andrew Morton wrote:

> > 5) the growing in (3) and shrinking in (4) mean that
> > the readahead size of all streaming IO in the system
> > gets automatically balanced against each other and
> > against other memory demand in the system
>
> Doesn't work.
>
> Ah, this is hard to describe.
>
> umm.
>
> a) Suppose that we're getting readahead thrashing. readahead
> pages are getting dropped. So we keep seeking to each
> file to get new data, so we do a ton of seeking.
>
> b) Suppose that we nicely detect thrashing and reduce the readahead
> window. Well, we *still* need to seek to each file to read
> some blocks.
>
> See? They're equivalent. In case a) we're doing more (pointless)
> I/O, but the cost of that is vanishingly small because it's just
> one request.
>
> So what *is* a solution. Well, there's only so much memory available.
> In either case a) or case b) we're "fairly" distributing that memory
> between all files. And that's the problem. *All* the files have too
> small a readahead window. Which points one at: we need to stop being
> fair. We need to give some files a good readahead window and others
> not. The "soft pinning" which I propose with GFP_READAHEAD and
> PG_readhead might have that effect, I think.

Actually, it could boil down to something more:

use-once reduces the VM to FIFO order, which suffers from
belady's anomaly so it doesn't matter much how much memory
you throw at it

drop-behind will suffer the same problem once the readahead
memory is too large to keep in the system, but at least the
already-used pages won't kick out readahead pages

regards,

Rik
--
<insert bitkeeper endorsement here>

http://www.surriel.com/ http://distro.conectiva.com/

2002-03-07 22:25:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Rik van Riel wrote:
>
> > So what *is* a solution. Well, there's only so much memory available.
> > In either case a) or case b) we're "fairly" distributing that memory
> > between all files. And that's the problem. *All* the files have too
> > small a readahead window. Which points one at: we need to stop being
> > fair. We need to give some files a good readahead window and others
> > not. The "soft pinning" which I propose with GFP_READAHEAD and
> > PG_readhead might have that effect, I think.
>
> Actually, it could boil down to something more:
>
> use-once reduces the VM to FIFO order, which suffers from
> belady's anomaly so it doesn't matter much how much memory
> you throw at it
>
> drop-behind will suffer the same problem once the readahead
> memory is too large to keep in the system, but at least the
> already-used pages won't kick out readahead pages

err.. Was there a fix in there somewhere, or are we stuck?

-

2002-03-07 22:29:30

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Thu, 7 Mar 2002, Andrew Morton wrote:

> > use-once reduces the VM to FIFO order, which suffers from
> > belady's anomaly so it doesn't matter much how much memory
> > you throw at it
> >
> > drop-behind will suffer the same problem once the readahead
> > memory is too large to keep in the system, but at least the
> > already-used pages won't kick out readahead pages
>
> err.. Was there a fix in there somewhere, or are we stuck?

Imagine how TCP backoff would work if it kept old packets
around and would drop random packets because of too many
old packets in the buffers.

I suspect that the readahead window resizing might work
when we throw away the already-used streaming IO pages
before we start throwing away any pages we're about to
use.

regards,

Rik
--
<insert bitkeeper endorsement here>

http://www.surriel.com/ http://distro.conectiva.com/

2002-03-07 22:44:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

Rik van Riel wrote:
>
> On Thu, 7 Mar 2002, Andrew Morton wrote:
>
> > > use-once reduces the VM to FIFO order, which suffers from
> > > belady's anomaly so it doesn't matter much how much memory
> > > you throw at it
> > >
> > > drop-behind will suffer the same problem once the readahead
> > > memory is too large to keep in the system, but at least the
> > > already-used pages won't kick out readahead pages
> >
> > err.. Was there a fix in there somewhere, or are we stuck?
>
> Imagine how TCP backoff would work if it kept old packets
> around and would drop random packets because of too many
> old packets in the buffers.
>
> I suspect that the readahead window resizing might work
> when we throw away the already-used streaming IO pages
> before we start throwing away any pages we're about to
> use.

ewww.. You seem to be implying that when the readahead
code goes to get a new page, it's reclaiming unused
readahead pages *in preference to* already-used pages.

That would be awful, wouldn't it?

Perhaps an algorithm would be:

a) Call mark_page_accessed once against readahead pages.

b) If thrashing is detected, call mark_page_accessed
twice against readahead pages, to move them onto the
active list.

The intent being to say "this page is important. Throw
something else away".

Seems this would delay the onset of the problem significantly?

-

2002-03-07 22:43:08

by David Lang

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

in addition by rducing the amount of readahead you do for each file you
can stabilize into a mode where you are doing _some_ readahead and not
thrashing so this will reduce your seeks.

David Lang



On Thu, 7 Mar 2002, Rik van Riel wrote:

> Date: Thu, 7 Mar 2002 19:27:49 -0300 (BRT)
> From: Rik van Riel <[email protected]>
> To: Andrew Morton <[email protected]>
> Cc: [email protected]
> Subject: Re: [RFC] Arch option to touch newly allocated pages
>
> On Thu, 7 Mar 2002, Andrew Morton wrote:
>
> > > use-once reduces the VM to FIFO order, which suffers from
> > > belady's anomaly so it doesn't matter much how much memory
> > > you throw at it
> > >
> > > drop-behind will suffer the same problem once the readahead
> > > memory is too large to keep in the system, but at least the
> > > already-used pages won't kick out readahead pages
> >
> > err.. Was there a fix in there somewhere, or are we stuck?
>
> Imagine how TCP backoff would work if it kept old packets
> around and would drop random packets because of too many
> old packets in the buffers.
>
> I suspect that the readahead window resizing might work
> when we throw away the already-used streaming IO pages
> before we start throwing away any pages we're about to
> use.
>
> regards,
>
> Rik
> --
> <insert bitkeeper endorsement here>
>
> http://www.surriel.com/ http://distro.conectiva.com/
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-03-07 22:44:51

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages


[email protected] said:
> Not having a fallback is unacceptable. Thats the real problem. You
> can't go around pandering to sloppy coders who can't work a memory
> allocator

OTOH there is perhaps some justification for distinguishing between 'If you
fail this I'll tell the user -ENOMEM and continue happily on my way'
allocations and 'If you fail this I lose track of hardware state and all is
fucked till we reboot' ones.

--
dwmw2


2002-03-07 22:56:21

by Alan

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

> [email protected] said:
> > Not having a fallback is unacceptable. Thats the real problem. You
> > can't go around pandering to sloppy coders who can't work a memory
> > allocator
>
> OTOH there is perhaps some justification for distinguishing between 'If you
> fail this I'll tell the user -ENOMEM and continue happily on my way'
> allocations and 'If you fail this I lose track of hardware state and all is
> fucked till we reboot' ones.

None at all. If you needed the memory before you committed to an operation
you should have reserved it before you started. See "sloppy coders"

2002-03-07 22:58:08

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages


[email protected] said:
> None at all. If you needed the memory before you committed to an
> operation you should have reserved it before you started. See "sloppy
> coders"

This is true. I must admit I was having trouble trying to think of a real
case where the latter applied in _sane_ code.

--
dwmw2


2002-03-08 19:16:59

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

[email protected] said:
> Versus fully allocating the backing store, which would neither hang
> nor cause segfaults. This is the behaviour that one expects by
> default, and should be the first line of defense before going to the
> overcommit model. Get that aspect of reliability in place, then add
> the overcommit support.

OK, the patch below (against UML 2.4.18-2) implements reliable overcommit
for UML.

The test was the same as before -
64M tmpfs on /tmp
two 64M UMLs
one -j 2 kernel build running in each

tmpfs was exhausted nearly immediately. Both builds ran to completion.
At the end, the 64M tmpfs was divided roughly 30M/35M between the two UMLs.

The first chunk of the patch (mm.h) is the hook that I started this thread
talking about. It's a noop for all arches except UML (or s390 if they decide
they can use it).

The next two (asm/page.h and mem.c) implement the hook for UML. I believe
it correctly preserves the failure semantics of alloc_pages. Please let me
know if I missed something.

It tests for unbacked pages by writing to them and catching the resulting
SIGBUS. On a host with address space accounting, it would instead map the
page and catch the map failures.

The rest of the patch is UML bug fixes which you're only interested in if
you want to boot it up.

One bug - if alloc_pages returns a combination of backed and unbacked pages
for an order > 0 allocation, the backed pages will effectively be leaked.

TBD -
a corresponding arch hook in free_pages which UML can use for
MADV_DONTNEED

some way of poking at unbacked pages to see if they are now backed
and can be released back to free_pages

These two items would go some way to allowing multiple UMLs to pass host
memory back and forth as needed when it gets scarce.

Jeff

diff -Naur um/include/linux/mm.h back/include/linux/mm.h
--- um/include/linux/mm.h Thu Mar 7 11:56:36 2002
+++ back/include/linux/mm.h Thu Mar 7 11:57:31 2002
@@ -358,6 +358,13 @@
extern struct page * FASTCALL(__alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist));
extern struct page * alloc_pages_node(int nid, unsigned int gfp_mask, unsigned int order);

+#ifndef HAVE_ARCH_VALIDATE
+static inline struct page *arch_validate(struct page *page, unsigned int gfp_mask, int order)
+{
+ return(page);
+}
+#endif
+
static inline struct page * alloc_pages(unsigned int gfp_mask, unsigned int order)
{
/*
@@ -365,7 +372,7 @@
*/
if (order >= MAX_ORDER)
return NULL;
- return _alloc_pages(gfp_mask, order);
+ return arch_validate(_alloc_pages(gfp_mask, order), gfp_mask, order);
}

#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
diff -Naur um/include/asm-um/page.h back/include/asm-um/page.h
--- um/include/asm-um/page.h Mon Mar 4 17:27:34 2002
+++ back/include/asm-um/page.h Thu Mar 7 11:57:01 2002
@@ -42,4 +42,7 @@
#define virt_to_page(kaddr) (mem_map + (__pa(kaddr) >> PAGE_SHIFT))
#define VALID_PAGE(page) ((page - mem_map) < max_mapnr)

+extern struct page *arch_validate(struct page *page, int mask, int order);
+#define HAVE_ARCH_VALIDATE
+
#endif
diff -Naur um/arch/um/kernel/mem.c back/arch/um/kernel/mem.c
--- um/arch/um/kernel/mem.c Mon Mar 4 17:27:34 2002
+++ back/arch/um/kernel/mem.c Thu Mar 7 11:57:17 2002
@@ -212,6 +212,39 @@
" just be swapped out.\n Example: mem=64M\n\n"
);

+struct page *arch_validate(struct page *page, int mask, int order)
+{
+ unsigned long addr, zero = 0;
+ int i;
+
+ again:
+ if(page == NULL) return(page);
+ addr = (unsigned long) page_address(page);
+ for(i = 0; i < (1 << order); i++){
+ current->thread.fault_addr = (void *) addr;
+ if(__do_copy_to_user((void *) addr, &zero,
+ sizeof(zero),
+ &current->thread.fault_addr,
+ &current->thread.fault_catcher)){
+ if(!(mask & __GFP_WAIT)) return(NULL);
+ else break;
+ }
+ addr += PAGE_SIZE;
+ }
+ if(i == (1 << order)) return(page);
+ page = _alloc_pages(mask, order);
+ goto again;
+}
+
+extern void relay_signal(int sig, void *sc, int usermode);
+
+void bus_handler(int sig, void *sc, int usermode)
+{
+ if(current->thread.fault_catcher != NULL)
+ do_longjmp(current->thread.fault_catcher);
+ else relay_signal(sig, sc, usermode);
+}
+
/*
* Overrides for Emacs so that we follow Linus's tabbing style.
* Emacs will notice this stuff at the end of the file and automatically
diff -Naur um/arch/um/kernel/exec_kern.c back/arch/um/kernel/exec_kern.c
--- um/arch/um/kernel/exec_kern.c Mon Mar 4 17:27:34 2002
+++ back/arch/um/kernel/exec_kern.c Mon Mar 4 18:05:20 2002
@@ -38,6 +38,12 @@
int new_pid;

stack = alloc_stack();
+ if(stack == 0){
+ printk(KERN_ERR
+ "flush_thread : failed to allocate temporary stack\n");
+ do_exit(SIGKILL);
+ }
+
new_pid = start_fork_tramp((void *) current->thread.kernel_stack,
stack, 0, exec_tramp);
if(new_pid < 0){
diff -Naur um/arch/um/kernel/process_kern.c back/arch/um/kernel/process_kern.c
--- um/arch/um/kernel/process_kern.c Mon Mar 4 17:27:34 2002
+++ back/arch/um/kernel/process_kern.c Mon Mar 4 18:05:20 2002
@@ -141,7 +141,7 @@
unsigned long page;

if((page = __get_free_page(GFP_KERNEL)) == 0)
- panic("Couldn't allocate new stack");
+ return(0);
stack_protections(page);
return(page);
}
@@ -318,6 +318,11 @@
panic("copy_thread : pipe failed");
if(current->thread.forking){
stack = alloc_stack();
+ if(stack == 0){
+ printk(KERN_ERR "copy_thread : failed to allocate "
+ "temporary stack\n");
+ return(-ENOMEM);
+ }
clone_vm = (p->mm == current->mm);
p->thread.temp_stack = stack;
new_pid = start_fork_tramp((void *) p->thread.kernel_stack,
diff -Naur um/arch/um/kernel/trap_kern.c back/arch/um/kernel/trap_kern.c
--- um/arch/um/kernel/trap_kern.c Mon Mar 4 17:27:34 2002
+++ back/arch/um/kernel/trap_kern.c Mon Mar 4 18:05:20 2002
@@ -30,6 +30,7 @@
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
struct siginfo si;
+ void *catcher;
pgd_t *pgd;
pmd_t *pmd;
pte_t *pte;
@@ -40,6 +41,7 @@
return(0);
}
if(mm == NULL) panic("Segfault with no mm");
+ catcher = current->thread.fault_catcher;
si.si_code = SEGV_MAPERR;
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -84,10 +86,10 @@
up_read(&mm->mmap_sem);
return(0);
bad:
- if (current->thread.fault_catcher != NULL) {
+ if(catcher != NULL) {
current->thread.fault_addr = (void *) address;
up_read(&mm->mmap_sem);
- do_longjmp(current->thread.fault_catcher);
+ do_longjmp(catcher);
}
else if(current->thread.fault_addr != NULL){
panic("fault_addr set but no fault catcher");
@@ -120,6 +122,7 @@

void relay_signal(int sig, void *sc, int usermode)
{
+ if(!usermode) panic("Kernel mode signal %d", sig);
force_sig(sig, current);
}

diff -Naur um/arch/um/kernel/trap_user.c back/arch/um/kernel/trap_user.c
--- um/arch/um/kernel/trap_user.c Mon Mar 4 17:27:34 2002
+++ back/arch/um/kernel/trap_user.c Mon Mar 4 18:05:20 2002
@@ -420,11 +420,13 @@

extern int timer_ready, timer_on;

+extern void bus_handler(int sig, void *sc, int usermode);
+
static void (*handlers[])(int, void *, int) = {
[ SIGTRAP ] relay_signal,
[ SIGFPE ] relay_signal,
[ SIGILL ] relay_signal,
- [ SIGBUS ] relay_signal,
+ [ SIGBUS ] bus_handler,
[ SIGSEGV] segv_handler,
[ SIGIO ] sigio_handler,
[ SIGVTALRM ] timer_handler,

2002-03-08 21:23:17

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [RFC] Arch option to touch newly allocated pages

On Fri, Mar 08, 2002 at 02:17:53PM -0500, Jeff Dike wrote:
> OK, the patch below (against UML 2.4.18-2) implements reliable overcommit
> for UML.

Well, I still dislike it, but I guess it'll have to do. The only nits I see
about the patch are: could you make the inline function a #define for the
no-arch_validate case? Also, the format of if statements is a bit abnormal:
please add line breaks as appropriate. Aside from that, go ahead.

-ben