LinuxLists.cc - kswapd deadlock 2.4.3-pre6

2001-03-21 18:13:10

Subject: kswapd deadlock 2.4.3-pre6

Hi,

I have a repeatable deadlock when SMP is enabled on my UP box.

>>EIP; c021e29a <stext_lock+1556/677b> <=====
Trace; c012dc58 <swap_out+b0/c8>
Trace; c012ebe2 <refill_inactive+72/98>
Trace; c012ec51 <do_try_to_free_pages+49/7c>
Trace; c012eceb <kswapd+67/f4>
Trace; c01074c4 <kernel_thread+28/38>

Will try to chase it down.

ac20+2.4.2-ac20-rwmmap_sem3 does not deadlock doing the same
churn/burn via make -j30 bzImage.

(I get darn funny looking time numbers though..
real 9m45.641s
user 14m55.710s
sys 1m25.010s)

-Mike

2001-03-21 20:06:11

by Rik van Riel

[permalink] [raw]

Subject: Re: kswapd deadlock 2.4.3-pre6

On Wed, 21 Mar 2001, Mike Galbraith wrote:

> I have a repeatable deadlock when SMP is enabled on my UP box.

Linus' version of do_anonymous_page() is racy too...

I know the one in my patch is uglier, but at least it doesn't
leak memory or lose data ;)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/

2001-03-21 20:13:02

by Rik van Riel

[permalink] [raw]

Subject: Re: kswapd deadlock 2.4.3-pre6

On Wed, 21 Mar 2001, Rik van Riel wrote:
> On Wed, 21 Mar 2001, Mike Galbraith wrote:
>
> > I have a repeatable deadlock when SMP is enabled on my UP box.
>
> Linus' version of do_anonymous_page() is racy too...

Umm, forget that, I was reading too much code at once and
missed a few lines ... ;)

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/

2001-03-21 20:31:32

by Linus Torvalds

[permalink] [raw]

Subject: Re: kswapd deadlock 2.4.3-pre6

On Wed, 21 Mar 2001, Mike Galbraith wrote:
>
> I have a repeatable deadlock when SMP is enabled on my UP box.
>
> >>EIP; c021e29a <stext_lock+1556/677b> <=====

When you see something like this, please do

gdb vmlinux

(gdb) x/10i 0xc021e29a

and it will basically show you where the code jumps back to.

It's almost certainly the beginning of swap_out_mm() where we get the
page_table_lock, but it would still be good to verify.

The deadlock implies that somebody scheduled with page_table_lock held.
Which would be really bad. You should be able to do something like

if (current->mm && spin_is_locked(&current->mm->page_table_lock))
BUG():

in the scheduler to see if it triggers (this only works on UP hardware
with a SMP kernel - on a real SMP machine it's entirely legal to have the
lock during a schedule, as the lock may be held by any of the _other_
CPU's, of course, and the above assert would be the wrong thing to do in
general)

Of course, it might not be somebody scheduling with a spinlock, it might
just be a recursive lock bug, but that sounds really unlikely.

> ac20+2.4.2-ac20-rwmmap_sem3 does not deadlock doing the same
> churn/burn via make -j30 bzImage.

it won't do the page table locking for page table allocations, so it will
have other bugs, though.

Linus

2001-03-22 06:25:14

by Linus Torvalds

[permalink] [raw]

Subject: Re: kswapd deadlock 2.4.3-pre6

On Wed, 21 Mar 2001, Linus Torvalds wrote:
>
> The deadlock implies that somebody scheduled with page_table_lock held.
> Which would be really bad.

..and it is probably do_swap_page().

Despite the name, "lookup_swap_cache()" does more than a lookup - it will
wait for the page that it looked up. And we call it with the
page_table_lock held in do_swap_page().

Ho humm. Does the appended patch fix it for you? Looks obvious enough, but
this bug is actually hidden on true SMP, and I'm too lazy to test with
"num_cpus=1" or something..

Linus

-----
diff -u --recursive --new-file pre6/linux/mm/memory.c linux/mm/memory.c
--- pre6/linux/mm/memory.c Tue Mar 20 23:13:03 2001
+++ linux/mm/memory.c Wed Mar 21 22:21:27 2001
@@ -1031,18 +1031,20 @@
struct vm_area_struct * vma, unsigned long address,
pte_t * page_table, swp_entry_t entry, int write_access)
{
- struct page *page = lookup_swap_cache(entry);
+ struct page *page;
pte_t pte;

+ spin_unlock(&mm->page_table_lock);
+ page = lookup_swap_cache(entry);
if (!page) {
- spin_unlock(&mm->page_table_lock);
lock_kernel();
swapin_readahead(entry);
page = read_swap_cache(entry);
unlock_kernel();
- spin_lock(&mm->page_table_lock);
- if (!page)
+ if (!page) {
+ spin_lock(&mm->page_table_lock);
return -1;
+ }

flush_page_to_ram(page);
flush_icache_page(vma, page);
@@ -1053,13 +1055,13 @@
* Must lock page before transferring our swap count to already
* obtained page count.
*/
- spin_unlock(&mm->page_table_lock);
lock_page(page);
- spin_lock(&mm->page_table_lock);

/*
- * Back out if somebody else faulted in this pte while we slept.
+ * Back out if somebody else faulted in this pte while we
+ * released the page table lock.
*/
+ spin_lock(&mm->page_table_lock);
if (pte_present(*page_table)) {
UnlockPage(page);
page_cache_release(page);

2001-03-22 07:38:31

by Mike Galbraith

[permalink] [raw]

Subject: Re: kswapd deadlock 2.4.3-pre6

On Wed, 21 Mar 2001, Linus Torvalds wrote:

> On Wed, 21 Mar 2001, Linus Torvalds wrote:
> >
> > The deadlock implies that somebody scheduled with page_table_lock held.
> > Which would be really bad.
>
> ..and it is probably do_swap_page().
>
> Despite the name, "lookup_swap_cache()" does more than a lookup - it will
> wait for the page that it looked up. And we call it with the
> page_table_lock held in do_swap_page().

Darn, you're too quick. (just figured it out and was about to report:)

Trace; c012785b <__lock_page+83/ac>
Trace; c012789b <lock_page+17/1c>
Trace; c01279a1 <__find_lock_page+81/f0>
Trace; c013008b <lookup_swap_cache+4b/164>
Trace; c0125362 <do_swap_page+12/1cc>
Trace; c012571f <handle_mm_fault+77/c4>
Trace; c01148b4 <do_page_fault+0/426>
Trace; c0114a17 <do_page_fault+163/426>
Trace; c01148b4 <do_page_fault+0/426>
Trace; c011581e <schedule+3e6/5ec>
Trace; c010908c <error_code+34/3c>

> Ho humm. Does the appended patch fix it for you? Looks obvious enough, but
> this bug is actually hidden on true SMP, and I'm too lazy to test with
> "num_cpus=1" or something..

I'm sure it will, but will be back in a few with confirmation.

>
> Linus
>
> -----
> diff -u --recursive --new-file pre6/linux/mm/memory.c linux/mm/memory.c
> --- pre6/linux/mm/memory.c Tue Mar 20 23:13:03 2001
> +++ linux/mm/memory.c Wed Mar 21 22:21:27 2001
> @@ -1031,18 +1031,20 @@
> struct vm_area_struct * vma, unsigned long address,
> pte_t * page_table, swp_entry_t entry, int write_access)
> {
> - struct page *page = lookup_swap_cache(entry);
> + struct page *page;
> pte_t pte;
>
> + spin_unlock(&mm->page_table_lock);
> + page = lookup_swap_cache(entry);
> if (!page) {
> - spin_unlock(&mm->page_table_lock);
> lock_kernel();
> swapin_readahead(entry);
> page = read_swap_cache(entry);
> unlock_kernel();
> - spin_lock(&mm->page_table_lock);
> - if (!page)
> + if (!page) {
> + spin_lock(&mm->page_table_lock);
> return -1;
> + }
>
> flush_page_to_ram(page);
> flush_icache_page(vma, page);
> @@ -1053,13 +1055,13 @@
> * Must lock page before transferring our swap count to already
> * obtained page count.
> */
> - spin_unlock(&mm->page_table_lock);
> lock_page(page);
> - spin_lock(&mm->page_table_lock);
>
> /*
> - * Back out if somebody else faulted in this pte while we slept.
> + * Back out if somebody else faulted in this pte while we
> + * released the page table lock.
> */
> + spin_lock(&mm->page_table_lock);
> if (pte_present(*page_table)) {
> UnlockPage(page);
> page_cache_release(page);

(oh well, I had the right idea at least)

2001-03-22 07:58:22

by Mike Galbraith

[permalink] [raw]

Subject: Re: kswapd deadlock 2.4.3-pre6

On Wed, 21 Mar 2001, Linus Torvalds wrote:

> Ho humm. Does the appended patch fix it for you? Looks obvious enough, but

Confirmed.

-Mike

2001-03-22 09:02:26

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: kswapd deadlock 2.4.3-pre6

On Wed, 21 Mar 2001, Linus Torvalds wrote:

> diff -u --recursive --new-file pre6/linux/mm/memory.c linux/mm/memory.c
> --- pre6/linux/mm/memory.c Tue Mar 20 23:13:03 2001
> +++ linux/mm/memory.c Wed Mar 21 22:21:27 2001
> @@ -1031,18 +1031,20 @@
> struct vm_area_struct * vma, unsigned long address,
> pte_t * page_table, swp_entry_t entry, int write_access)
> {
> - struct page *page = lookup_swap_cache(entry);
> + struct page *page;
> pte_t pte;
>
> + spin_unlock(&mm->page_table_lock);
> + page = lookup_swap_cache(entry);
> if (!page) {
> - spin_unlock(&mm->page_table_lock);
> lock_kernel();
> swapin_readahead(entry);
> page = read_swap_cache(entry);
> unlock_kernel();
> - spin_lock(&mm->page_table_lock);
> - if (!page)
> + if (!page) {
> + spin_lock(&mm->page_table_lock);
> return -1;
> + }
>
> flush_page_to_ram(page);
> flush_icache_page(vma, page);

Are you sure flush_page_to_ram()/flush_icache_page() without
page_table_lock held is ok for all archs?

Looking at arch/mips/mm/r2300.c:

static void r3k_flush_cache_page(struct vm_area_struct *vma,
unsigned long page)
{
struct mm_struct *mm = vma->vm_mm;

...
if ((physpage = get_phys_page(page, vma->vm_mm))) <--------
r3k_flush_icache_range(physpage, PAGE_SIZE);
...
}

static inline unsigned long get_phys_page (unsigned long addr,
struct mm_struct *mm)
{
pgd_t *pgd;
pmd_t *pmd;
pte_t *pte;
unsigned long physpage;

pgd = pgd_offset(mm, addr);
pmd = pmd_offset(pgd, addr);
pte = pte_offset(pmd, addr);

if((physpage = pte_val(*pte)) & _PAGE_VALID)
return KSEG1ADDR(physpage & PAGE_MASK);
else
return 0;
...
}

2001-03-22 10:04:00

by David Miller

[permalink] [raw]

Subject: Re: kswapd deadlock 2.4.3-pre6

Marcelo Tosatti writes:
> Are you sure flush_page_to_ram()/flush_icache_page() without
> page_table_lock held is ok for all archs?

This is actually a sticky area. For example, I remember that a long
time ago I noticed that it wasn't necessarily guarenteed that even
all the flush_tlb_{page,mm,range}() stuff was done under the page
table lock.

Maybe things have changed since then, but...

Furthermore I did not specify that flush_page_to_ram/flush_icache_page
can assume this, see Documentation/cachetlb.txt

I have no problem adding this invariant to some of the interfaces, but
we have to audit things.

> Looking at arch/mips/mm/r2300.c:

The r2300 port is UP-only btw...

Later,
David S. Miller
[email protected]