2000-11-15 13:10:01

by Martin Schwidefsky

[permalink] [raw]
Subject: Memory management bug



I think I spotted a problem in the memory management of some (all?)
architectures in 2.4.0-test10. At the moment I am fighting with the 64bit
backend for the new S/390 machines. I experienced infinite loops in
do_check_pgt_cache because pgtable_cache_size indicated that a lot of pages
are in the quicklists but the pgd/pmd/pte quicklists have been empty (NULL
pointers). After some trickery with some special hardware feature (storage
keys) I found out that empty_bad_pmd_table and empty_bad_pte_table have
been put to the page table quicklists multiple(!) times. It is already a
bug that these two arrays are inserted into the quicklist at all but the
second insertation destroys the quicklists. I solved this problem by
inserting checks for the special entries in the free_xxx_fast routines,
here is a sample for the i386 free_pte_fast:

diff -u -r1.5 pgalloc.h
--- include/asm-i386/pgalloc.h 2000/11/02 10:14:51 1.5
+++ include/asm-i386/pgalloc.h 2000/11/15 12:27:58
@@ -80,8 +80,11 @@
return (pte_t *)ret;
}

+extern pte_t empty_bad_pte_table[];
extern __inline__ void free_pte_fast(pte_t *pte)
{
+ if (pte == empty_bad_pte_table)
+ return;
*(unsigned long *)pte = (unsigned long) pte_quicklist;
pte_quicklist = (unsigned long *) pte;
pgtable_cache_size++;

I still get the "__alloc_pages: 2-order allocation failed." error messages
but at least the machine doesn't go into infinite loops anymore. Could
someone with more experience with the other architectures verify that my
observation is true?

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]



2000-11-15 13:49:39

by Andi Kleen

[permalink] [raw]
Subject: Re: Memory management bug

On Wed, Nov 15, 2000 at 01:39:13PM +0100, [email protected] wrote:
> +extern pte_t empty_bad_pte_table[];
> extern __inline__ void free_pte_fast(pte_t *pte)
> {
> + if (pte == empty_bad_pte_table)
> + return;

I guess that should be BUG() instead of return, so that the callers can be
fixed.


-Andi

2000-11-15 13:55:20

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: Memory management bug



>> +extern pte_t empty_bad_pte_table[];
>> extern __inline__ void free_pte_fast(pte_t *pte)
>> {
>> + if (pte == empty_bad_pte_table)
>> + return;
>
>I guess that should be BUG() instead of return, so that the callers can be
>fixed.
Not really. pte_free and pmd_free are called from the common mm code but
the concept of empty_bad_{pte,pmd}_table is architecture dependent. The
trouble starts in arch/???/mm/init.c where these special arrays are
inserted into the paging tables. So the solution to the problem should be
in architecture dependent files too.

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]


2000-11-15 17:16:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory management bug

In article <[email protected]>,
> After some trickery with some special hardware feature (storage
>keys) I found out that empty_bad_pmd_table and empty_bad_pte_table have
>been put to the page table quicklists multiple(!) times.

This is definitely bad, and means that something else really bad is
going on.

In fact, I have this fairly strong suspicion that we should just get rid
of the "bad" page tables altogether, and make the stuff that now uses
them BUG() instead.

The whole concept of "bad" page tables comes from very early on in
Linux, when the way the page fault handler worked was that if it ran out
of memory or something else really bad happened, it would insert a dummy
page table entry that was guaranteed to let the CPU continue. That way
the page fault handler was always "successful" from a hardware
standpoint, even if it ended up trying to kill the process.

This used to be required simply because a page fault in kernel space
originally needed to let the process unwind sanely and cleanly.

These days, the requirement that page faults always "succeed" is long
long gone. The exception handling mechanism handles the cases where we
validly can take a page fault, and in other cases we will just kill the
process outright. As such, the bad page tables should no longer be
needed, and are apparently just hiding some nasty bugs.

What happens if you just replace all places that would use a bad page
table with a BUG()? (Ie do _not_ add the bug to the place where you
added the test: by that time it's too late. I'm talking about the
places where the bad page tables are used, like in the error cases of
"get_pte_kernel_slow()" etc.

Linus

2000-11-16 17:10:42

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: Memory management bug



>What happens if you just replace all places that would use a bad page
>table with a BUG()? (Ie do _not_ add the bug to the place where you
>added the test: by that time it's too late. I'm talking about the
>places where the bad page tables are used, like in the error cases of
>"get_pte_kernel_slow()" etc.

Ok, the BUG() hit in get_pmd_slow:

pmd_t *
get_pmd_slow(pgd_t *pgd, unsigned long offset)
{
pmd_t *pmd;
int i;

pmd = (pmd_t *) __get_free_pages(GFP_KERNEL,2);
if (pgd_none(*pgd)) {
if (pmd) {
for (i = 0; i < PTRS_PER_PMD; i++)
pmd_clear(pmd+i);
pgd_set(pgd, pmd);
return pmd + offset;
}
BUG(); /* <--- this one hit */
pmd = (pmd_t *) get_bad_pmd_table();
pgd_set(pgd, pmd);
return NULL;
}
free_pages((unsigned long)pmd,2);
if (pgd_bad(*pgd))
BUG();
return (pmd_t *) pgd_page(*pgd) + offset;
}

The allocation of 4 consecutive pages for the page middle directory failed.
This caused empty_bad_pmd_table to be used and clear_page_tables inserted
it to the pmd quicklist. The important question is: why did
__get_free_pages fail?

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]


2000-11-16 17:31:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory management bug



On Thu, 16 Nov 2000 [email protected] wrote:
>
> Ok, the BUG() hit in get_pmd_slow:
>
> pmd_t *
> get_pmd_slow(pgd_t *pgd, unsigned long offset)
> {
> pmd_t *pmd;
> int i;
>
> pmd = (pmd_t *) __get_free_pages(GFP_KERNEL,2);

You really need 4 pages?

There's no way to reliably get 4 consecutive pages when you're even close
to being low on memory. I would suggest just failing with a NULL return
here.

What is the architecture setup for this machine? I have no clue about
S/390 memory management. Maybe you can modify the pmd layout?

One potential fix for this is to just make the page size bigger. Make
"Linux pages" be _two_ hardware pages, and make a Linux pte contain two
"hardware pte's". That way the pmd would be an order-1 allocation instead
of an order-2 one. Which is statistically _much_ more likely to be around
(exponential distribution).

Linus


2000-11-16 18:16:08

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Memory management bug

On Thu, Nov 16, 2000 at 09:01:07AM -0800, Linus Torvalds wrote:
> "Linux pages" be _two_ hardware pages, and make a Linux pte contain two

If they absolutely needs 4 pages for pmd pagetables due hardware constraints
I'd recommend to use _four_ hardware pages for each softpage, not two.

The issue is that failing allocation at task creation (due 8k [or more] kernel
stack) is trivial to handle, just have the syscall returning -ENOMEM and
userspace will handle the allocation faliure gracefully. Also the parent
of the servers will never fail that allocation after it's up and running
(and it can try to fork childs later on).

Failing allocation of a pagetable in some case can be solved only looping
(deadlock prone) or killing the task hard without giving a chance to userspace
to trap the fault (even SIGKILL signal handler may need that pmd pagetable to
run). So being guaranteed to be able to allocate pagetables unless
the machine is truly out of memory is quite necessary "feature" IMHO.

We faced similar issues while thinking at possible ways for x86-64 pagetables,
and we preferred not having to depend on the softpagesize framework in 2.4.x
because it's very intrusive.

Andrea

2000-11-16 18:38:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory management bug



On Thu, 16 Nov 2000, Andrea Arcangeli wrote:
>
> If they absolutely needs 4 pages for pmd pagetables due hardware constraints
> I'd recommend to use _four_ hardware pages for each softpage, not two.

Yes.

However, it definitely is an issue of making trade-offs. Most 64-bit MMU
models tend to have some flexibility in how you set up the page tables,
and it may be possible to just move bits around too (ie making both the
pmd and the pgd twice as large, and getting the expansion of 4 by doing
two expand-by-two's, for example, if the hardware has support for doing
things like that).

Linus

2000-11-17 11:40:18

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: Memory management bug



>>
>> If they absolutely needs 4 pages for pmd pagetables due hardware
constraints
>> I'd recommend to use _four_ hardware pages for each softpage, not two.
>
>Yes.
>
>However, it definitely is an issue of making trade-offs. Most 64-bit MMU
>models tend to have some flexibility in how you set up the page tables,
>and it may be possible to just move bits around too (ie making both the
>pmd and the pgd twice as large, and getting the expansion of 4 by doing
>two expand-by-two's, for example, if the hardware has support for doing
>things like that).

Unluckly we don't have any flexibility. The segment index (pmd) has 11
bits,
pointers are 8 byte. That makes 16K segment table. I have understood that
this is a problem if the system is really low on memory. But low on memory
does mean low on real memory + swap space, doesn't it ? The system has
enough swap space but it isn't using any of it when the BUG hits. I think
the "if (!order)" statements before the "goto try_again" in __alloc_pages
have something to do with it. To test this assumption I removed the ifs and

I didn't see any "__alloc_pages: %lu-order allocation failed." message
before I hit yet another BUG in swap_state.c:60.
Whats the reasoning behind these ifs ?

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]


2000-11-17 16:15:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Memory management bug

On Fri, Nov 17, 2000 at 11:41:58AM +0100, [email protected] wrote:
> [..] But low on memory
> does mean low on real memory + swap space, doesn't it ? [..]

No. Low on memory here means that `grep MemFree </proc/meminfo' says
you still have only a few mbytes.

> enough swap space but it isn't using any of it when the BUG hits. I think

This is normal.

> the "if (!order)" statements before the "goto try_again" in __alloc_pages
> have something to do with it. To test this assumption I removed the ifs and

The right way to make allocation of order > 0 to work (when not impossible) is
to pass the "order" information from allocator to memory balancing code so you
don't waste resources by freeing and swapping pages that aren't physically
consecutive. We'll need to teach the memory balancing about freeing only
physically consecutive worthwhile pages.

Actually memory balancing in 2.4.x doesn't get any information, not even the
information about which _classzone_ where to free the memory (NOTE: both 2.2.x
and 2.0.x _always_ got the classzone where to free memory at least). This
classzone missing information causes resources wastage indeed and I just fixed
it several times, BTW.

> I didn't see any "__alloc_pages: %lu-order allocation failed." message

So you probably didn't triggered the out-of-order-2-multipages problem, and
the bug you triggered is going to be another one. But still the above order > 1
thoguths applies to both 2.2.x and 2.4.x since once you'll fix the other
problem, you'll sure run into the failed order 2 allocations.

> before I hit yet another BUG in swap_state.c:60.

The bug in swap_state:60 shows a kernel bug in the VM or random memory
corruption. Make sure you can reproduce on x86 to be sure it's not a s390
that is randomly corrupting memory. If you read the oops after the BUG message
with asm at hand you will see in the registers the value of page->mapping and
you can guess if it's random memory corruption or bug in VM this way (for
example if `reg & 3 != 0' it's memory corruption for sure, you should also
if it's pointing to a suitable kernel-heap address).

> Whats the reasoning behind these ifs ?

To catch memory corruption or things running out of control in the kernel.

Andrea

2000-11-17 17:05:20

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: Memory management bug



>> before I hit yet another BUG in swap_state.c:60.
>
>The bug in swap_state:60 shows a kernel bug in the VM or random memory
>corruption. Make sure you can reproduce on x86 to be sure it's not a s390
>that is randomly corrupting memory. If you read the oops after the BUG
message
>with asm at hand you will see in the registers the value of page->mapping
and
>you can guess if it's random memory corruption or bug in VM this way (for
>example if `reg & 3 != 0' it's memory corruption for sure, you should also
>if it's pointing to a suitable kernel-heap address).
I did a little closer investigation. The BUG was triggered by a page with
page->mapping pointing to an address space of a mapped ext2 file
(page->mapping->a_ops == &ext2_aops). The page had PG_locked, PG_uptodate,
PG_active and PG_swap_cache set. The stack backstrace showed that kswapd
called do_try_to_free_pages, refill_inactive, swap_out, swap_out_mm,
swap_out_vma, try_to_swap_out and add_to_swap_cache where BUG hit.
The registers look good, the struct page looks good. I don't think that
this
was a random memory corruption.

>> Whats the reasoning behind these ifs ?
>
>To catch memory corruption or things running out of control in the kernel.
I was refering to the "if (!order) goto try_again" ifs in alloc_pages, not
the "if (something) BUG()" ifs.

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]


2000-11-17 17:13:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory management bug



On Fri, 17 Nov 2000 [email protected] wrote:
>
> >> Whats the reasoning behind these ifs ?
> >
> >To catch memory corruption or things running out of control in the kernel.
> I was refering to the "if (!order) goto try_again" ifs in alloc_pages, not
> the "if (something) BUG()" ifs.

Basically, if you try to wait for orders > 0, you may have to wait for a
LOONG time.

It actually works reasonably well on machines with big memories, because a
buddy allocator _will_ try to coalesce memory allocations as much as
possible. But it has nasty cases where you can be really unlucky. Feel
free to run simulations to see, but basically if you have reasonably
random allocation and free patterns and you want to get an order-X
contiguous allocation, you may have to free up a noticeable portion of
your memory before it succeeds.

Sure, you could do "directed freeing", where you actually try to look at
which pages would be worth freeing to find a large free area, but the
complexity is not insignificant, and quite frankly the proper approach has
always been "don't do that then". Don't rely on big contiguous chunks of
memory. Having an mm that can guarantee contiguous chunks of physical
memory would be cool, but I suspect strongly that it would have some
serious downsides.

Linus

2000-11-17 18:42:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Memory management bug

On Fri, Nov 17, 2000 at 05:35:53PM +0100, [email protected] wrote:
> I did a little closer investigation. The BUG was triggered by a page with
> page->mapping pointing to an address space of a mapped ext2 file
> (page->mapping->a_ops == &ext2_aops). The page had PG_locked, PG_uptodate,
> PG_active and PG_swap_cache set. The stack backstrace showed that kswapd
> called do_try_to_free_pages, refill_inactive, swap_out, swap_out_mm,
> swap_out_vma, try_to_swap_out and add_to_swap_cache where BUG hit. The
> registers look good, the struct page looks good. I don't think that this was
> a random memory corruption.

Agreed, that's almost sure _not_ random memory corruption of the page
structure. It looks like a VM bug (if you can reproduce trivially I'd give a
try to test8 too since test8 is rock solid for me while test10 lockups in VM
core at the second bonnie if using emulated highmem).

> I was refering to the "if (!order) goto try_again" ifs in alloc_pages, not
> the "if (something) BUG()" ifs.

Ah ok :), see Linus's answer: in your case the "don't do that" means to
implement the:

#define SOFT_PAGE_SIZE (PAGE_SIZE<<2)

thing we were talking about yesterday of course.

Plus I add that the "if (!order) goto try_again" is an obvious deadlock prone
bug introduce in test9 that should be removed.

Andrea

2000-11-17 19:43:40

by Rik van Riel

[permalink] [raw]
Subject: Re: Memory management bug

On Fri, 17 Nov 2000, Andrea Arcangeli wrote:

> Actually memory balancing in 2.4.x doesn't get any information,
> not even the information about which _classzone_ where to free
> the memory (NOTE: both 2.2.x and 2.0.x _always_ got the
> classzone where to free memory at least). This classzone missing
> information causes resources wastage indeed and I just fixed it
> several times, BTW.

Interesting, I can't remember you sending me any
patches...

Also, the 2.4 VM (unlike the other VMs) doesn't actually
FREE memory wrongly (with the exception of buffer cache
pages from page_launder()) but just moves it to the
inactive_clean list, from where it will be re-used by one
of those 99% user level allocations that happen on a typical
Linux system.

But, as I said in Ottawa, I wouldn't mind any classzone
stuff in the new VM, as long as it won't complicate the
integration of _other_ memory organisations (like NUMA).

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
-- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/ http://www.surriel.com/

2000-11-17 19:47:20

by Rik van Riel

[permalink] [raw]
Subject: Re: Memory management bug

On Fri, 17 Nov 2000, Andrea Arcangeli wrote:

> Plus I add that the "if (!order) goto try_again" is an obvious
> deadlock prone bug introduce in test9 that should be removed.

1) how would this cause deadlocks?
2) how would this somehow be worse than the
unconditional 'goto try_again' we had before?

This goto is ok because we have the OOM killer, which will select
a process to kill when we run out of memory. Also, the goto will
make sure that OTHER processes will survive while the "guilty"
process will be killed.

The guilty process will never get to the goto because it will
have PF_MEMALLOC set.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
-- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/ http://www.surriel.com/

2000-11-21 20:23:36

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: Memory management bug



>Agreed, that's almost sure _not_ random memory corruption of the page
>structure. It looks like a VM bug (if you can reproduce trivially I'd give
a
>try to test8 too since test8 is rock solid for me while test10 lockups in
VM
>core at the second bonnie if using emulated highmem).
I was lucky. Somehow I managed to f**k up my disk in a way that the
filesystem
check triggers the bug in a reproducible way and always with the same page!
I setup a "trace store into" to the page structure and logged who is
changing
the "struct page". Here is the log starting after page->mapping was set:

address changed function
5c13a mapping add_to_page_cache_unique
count=2, flags=PG_locked, age=2
5b14a next_hash __add_page_to_hash_queue
5b178 buffers __add_page_to_hash_queue
68440 flags lru_cache_add
flags=PG_active|PG_locked
6846a lru lru_cache_add
68470 lru lru_cache_add
78fc6 virtual create_empty_buffers
78fda count create_empty_buffers
count=3
6d9ce count __free_pages
count=2
5c122 list __add_page_to_hash_queue
68464 lru lru_cache_add
77b16 flags end_buffer_io_async
flags=PG_active|PG_uptodate|PG_locked
77b52 flags end_buffer_io_async
flags=PG_active|PG_uptodate|PG_locked
77bc4 flags end_buffer_io_async
flags=PG_active|PG_uptodate
67792 age age_page_up
age=5
5c88c count __find_get_page
count=3
559be count copy_page_range
count=4
559be count copy_page_rage
count=5
6d9ce count __free_pages
count=4
6b55e lru refill_inactive_scan
6b4ac flags refill_inactive_scan
flags=PG_active|PG_uptodate
6770c age age_page_down_ageonly
age=2
6b570 lru refill_inactive_scan
6b576 lru refill_inactive_scan
6b56a lru refill_inactive_scan
6b55e lru refill_inactive_scan
6b4ac flags refill_inactive_scan
flags=PG_active|PG_uptodate
6770c age age_page_down_ageonly
age=1
6b570 lru refill_inactive_scan
6b576 lru refill_inactive_scan
6b56a lru refill_inactive_scan
6b55e lru refill_inactive_scan
6b4ac flags refill_inactive_scan
flags=PG_active|PG_uptodate
6770c age age_page_down_ageonly
age=0
6b570 lru refill_inactive_scan
6b576 lru refill_inactive_scan
6b56a lru refill_inactive_scan

program check at 6e1e0 because of BUG() in line 60 of swap_state.c.
Stack backtrace from there:
6e1e0 add_to_swap_cache
6900a try_to_swap_out
69408 swap_out_vma
69578 swap_out_mm
69838 swap_out
6b90a refill_inactive
6bab4 do_try_to_free_pages
6bbba kswapd

age_page_down_ageonly was always called from refill_inactive_scan. So
refill_inactive_scan lowers the age of the pages but does not deactivate
the
page when it reached age==0 (page->count to big). try_to_swap_out doesn't
check for page->mapping and tries to swap out the page because the age is
0. Bang!

blue skies,
Martin

P.S. by the way this test was done on linux-2.4.0-test11

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]