Hi,
I have a repeatable deadlock when SMP is enabled on my UP box.
>>EIP; c021e29a <stext_lock+1556/677b> <=====
Trace; c012dc58 <swap_out+b0/c8>
Trace; c012ebe2 <refill_inactive+72/98>
Trace; c012ec51 <do_try_to_free_pages+49/7c>
Trace; c012eceb <kswapd+67/f4>
Trace; c01074c4 <kernel_thread+28/38>
Will try to chase it down.
ac20+2.4.2-ac20-rwmmap_sem3 does not deadlock doing the same
churn/burn via make -j30 bzImage.
(I get darn funny looking time numbers though..
real 9m45.641s
user 14m55.710s
sys 1m25.010s)
-Mike
On Wed, 21 Mar 2001, Mike Galbraith wrote:
> I have a repeatable deadlock when SMP is enabled on my UP box.
Linus' version of do_anonymous_page() is racy too...
I know the one in my patch is uglier, but at least it doesn't
leak memory or lose data ;)
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
On Wed, 21 Mar 2001, Rik van Riel wrote:
> On Wed, 21 Mar 2001, Mike Galbraith wrote:
>
> > I have a repeatable deadlock when SMP is enabled on my UP box.
>
> Linus' version of do_anonymous_page() is racy too...
Umm, forget that, I was reading too much code at once and
missed a few lines ... ;)
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
On Wed, 21 Mar 2001, Mike Galbraith wrote:
>
> I have a repeatable deadlock when SMP is enabled on my UP box.
>
> >>EIP; c021e29a <stext_lock+1556/677b> <=====
When you see something like this, please do
gdb vmlinux
(gdb) x/10i 0xc021e29a
and it will basically show you where the code jumps back to.
It's almost certainly the beginning of swap_out_mm() where we get the
page_table_lock, but it would still be good to verify.
The deadlock implies that somebody scheduled with page_table_lock held.
Which would be really bad. You should be able to do something like
if (current->mm && spin_is_locked(¤t->mm->page_table_lock))
BUG():
in the scheduler to see if it triggers (this only works on UP hardware
with a SMP kernel - on a real SMP machine it's entirely legal to have the
lock during a schedule, as the lock may be held by any of the _other_
CPU's, of course, and the above assert would be the wrong thing to do in
general)
Of course, it might not be somebody scheduling with a spinlock, it might
just be a recursive lock bug, but that sounds really unlikely.
> ac20+2.4.2-ac20-rwmmap_sem3 does not deadlock doing the same
> churn/burn via make -j30 bzImage.
it won't do the page table locking for page table allocations, so it will
have other bugs, though.
Linus
On Wed, 21 Mar 2001, Linus Torvalds wrote:
>
> The deadlock implies that somebody scheduled with page_table_lock held.
> Which would be really bad.
..and it is probably do_swap_page().
Despite the name, "lookup_swap_cache()" does more than a lookup - it will
wait for the page that it looked up. And we call it with the
page_table_lock held in do_swap_page().
Ho humm. Does the appended patch fix it for you? Looks obvious enough, but
this bug is actually hidden on true SMP, and I'm too lazy to test with
"num_cpus=1" or something..
Linus
-----
diff -u --recursive --new-file pre6/linux/mm/memory.c linux/mm/memory.c
--- pre6/linux/mm/memory.c Tue Mar 20 23:13:03 2001
+++ linux/mm/memory.c Wed Mar 21 22:21:27 2001
@@ -1031,18 +1031,20 @@
struct vm_area_struct * vma, unsigned long address,
pte_t * page_table, swp_entry_t entry, int write_access)
{
- struct page *page = lookup_swap_cache(entry);
+ struct page *page;
pte_t pte;
+ spin_unlock(&mm->page_table_lock);
+ page = lookup_swap_cache(entry);
if (!page) {
- spin_unlock(&mm->page_table_lock);
lock_kernel();
swapin_readahead(entry);
page = read_swap_cache(entry);
unlock_kernel();
- spin_lock(&mm->page_table_lock);
- if (!page)
+ if (!page) {
+ spin_lock(&mm->page_table_lock);
return -1;
+ }
flush_page_to_ram(page);
flush_icache_page(vma, page);
@@ -1053,13 +1055,13 @@
* Must lock page before transferring our swap count to already
* obtained page count.
*/
- spin_unlock(&mm->page_table_lock);
lock_page(page);
- spin_lock(&mm->page_table_lock);
/*
- * Back out if somebody else faulted in this pte while we slept.
+ * Back out if somebody else faulted in this pte while we
+ * released the page table lock.
*/
+ spin_lock(&mm->page_table_lock);
if (pte_present(*page_table)) {
UnlockPage(page);
page_cache_release(page);
On Wed, 21 Mar 2001, Linus Torvalds wrote:
> On Wed, 21 Mar 2001, Linus Torvalds wrote:
> >
> > The deadlock implies that somebody scheduled with page_table_lock held.
> > Which would be really bad.
>
> ..and it is probably do_swap_page().
>
> Despite the name, "lookup_swap_cache()" does more than a lookup - it will
> wait for the page that it looked up. And we call it with the
> page_table_lock held in do_swap_page().
Darn, you're too quick. (just figured it out and was about to report:)
Trace; c012785b <__lock_page+83/ac>
Trace; c012789b <lock_page+17/1c>
Trace; c01279a1 <__find_lock_page+81/f0>
Trace; c013008b <lookup_swap_cache+4b/164>
Trace; c0125362 <do_swap_page+12/1cc>
Trace; c012571f <handle_mm_fault+77/c4>
Trace; c01148b4 <do_page_fault+0/426>
Trace; c0114a17 <do_page_fault+163/426>
Trace; c01148b4 <do_page_fault+0/426>
Trace; c011581e <schedule+3e6/5ec>
Trace; c010908c <error_code+34/3c>
> Ho humm. Does the appended patch fix it for you? Looks obvious enough, but
> this bug is actually hidden on true SMP, and I'm too lazy to test with
> "num_cpus=1" or something..
I'm sure it will, but will be back in a few with confirmation.
>
> Linus
>
> -----
> diff -u --recursive --new-file pre6/linux/mm/memory.c linux/mm/memory.c
> --- pre6/linux/mm/memory.c Tue Mar 20 23:13:03 2001
> +++ linux/mm/memory.c Wed Mar 21 22:21:27 2001
> @@ -1031,18 +1031,20 @@
> struct vm_area_struct * vma, unsigned long address,
> pte_t * page_table, swp_entry_t entry, int write_access)
> {
> - struct page *page = lookup_swap_cache(entry);
> + struct page *page;
> pte_t pte;
>
> + spin_unlock(&mm->page_table_lock);
> + page = lookup_swap_cache(entry);
> if (!page) {
> - spin_unlock(&mm->page_table_lock);
> lock_kernel();
> swapin_readahead(entry);
> page = read_swap_cache(entry);
> unlock_kernel();
> - spin_lock(&mm->page_table_lock);
> - if (!page)
> + if (!page) {
> + spin_lock(&mm->page_table_lock);
> return -1;
> + }
>
> flush_page_to_ram(page);
> flush_icache_page(vma, page);
> @@ -1053,13 +1055,13 @@
> * Must lock page before transferring our swap count to already
> * obtained page count.
> */
> - spin_unlock(&mm->page_table_lock);
> lock_page(page);
> - spin_lock(&mm->page_table_lock);
>
> /*
> - * Back out if somebody else faulted in this pte while we slept.
> + * Back out if somebody else faulted in this pte while we
> + * released the page table lock.
> */
> + spin_lock(&mm->page_table_lock);
> if (pte_present(*page_table)) {
> UnlockPage(page);
> page_cache_release(page);
(oh well, I had the right idea at least)
On Wed, 21 Mar 2001, Linus Torvalds wrote:
> Ho humm. Does the appended patch fix it for you? Looks obvious enough, but
Confirmed.
-Mike
On Wed, 21 Mar 2001, Linus Torvalds wrote:
> diff -u --recursive --new-file pre6/linux/mm/memory.c linux/mm/memory.c
> --- pre6/linux/mm/memory.c Tue Mar 20 23:13:03 2001
> +++ linux/mm/memory.c Wed Mar 21 22:21:27 2001
> @@ -1031,18 +1031,20 @@
> struct vm_area_struct * vma, unsigned long address,
> pte_t * page_table, swp_entry_t entry, int write_access)
> {
> - struct page *page = lookup_swap_cache(entry);
> + struct page *page;
> pte_t pte;
>
> + spin_unlock(&mm->page_table_lock);
> + page = lookup_swap_cache(entry);
> if (!page) {
> - spin_unlock(&mm->page_table_lock);
> lock_kernel();
> swapin_readahead(entry);
> page = read_swap_cache(entry);
> unlock_kernel();
> - spin_lock(&mm->page_table_lock);
> - if (!page)
> + if (!page) {
> + spin_lock(&mm->page_table_lock);
> return -1;
> + }
>
> flush_page_to_ram(page);
> flush_icache_page(vma, page);
Are you sure flush_page_to_ram()/flush_icache_page() without
page_table_lock held is ok for all archs?
Looking at arch/mips/mm/r2300.c:
static void r3k_flush_cache_page(struct vm_area_struct *vma,
unsigned long page)
{
struct mm_struct *mm = vma->vm_mm;
...
if ((physpage = get_phys_page(page, vma->vm_mm))) <--------
r3k_flush_icache_range(physpage, PAGE_SIZE);
...
}
static inline unsigned long get_phys_page (unsigned long addr,
struct mm_struct *mm)
{
pgd_t *pgd;
pmd_t *pmd;
pte_t *pte;
unsigned long physpage;
pgd = pgd_offset(mm, addr);
pmd = pmd_offset(pgd, addr);
pte = pte_offset(pmd, addr);
if((physpage = pte_val(*pte)) & _PAGE_VALID)
return KSEG1ADDR(physpage & PAGE_MASK);
else
return 0;
...
}
Marcelo Tosatti writes:
> Are you sure flush_page_to_ram()/flush_icache_page() without
> page_table_lock held is ok for all archs?
This is actually a sticky area. For example, I remember that a long
time ago I noticed that it wasn't necessarily guarenteed that even
all the flush_tlb_{page,mm,range}() stuff was done under the page
table lock.
Maybe things have changed since then, but...
Furthermore I did not specify that flush_page_to_ram/flush_icache_page
can assume this, see Documentation/cachetlb.txt
I have no problem adding this invariant to some of the interfaces, but
we have to audit things.
> Looking at arch/mips/mm/r2300.c:
The r2300 port is UP-only btw...
Later,
David S. Miller
[email protected]