Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753464Ab1FBQqJ (ORCPT ); Thu, 2 Jun 2011 12:46:09 -0400 Received: from mx1.redhat.com ([209.132.183.28]:43374 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751902Ab1FBQqH (ORCPT ); Thu, 2 Jun 2011 12:46:07 -0400 Date: Thu, 2 Jun 2011 18:44:58 +0200 From: Andrea Arcangeli To: Chris Wright Cc: CAI Qian , Andrea Righi , Hugh Dickins , Rik van Riel , Mel Gorman , Izik Eidus , KAMEZAWA Hiroyuki , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton Subject: Re: [BUG 3.0.0-rc1] ksm: NULL pointer dereference in ksm_do_scan() Message-ID: <20110602164458.GG19505@random.random> References: <20110601222032.GA2858@thinkpad> <2144269697.363041.1306998593180.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <20110602143143.GI23047@sequoia.sous-sol.org> <20110602143622.GE19505@random.random> <20110602153641.GJ23047@sequoia.sous-sol.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110602153641.GJ23047@sequoia.sous-sol.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5038 Lines: 150 On Thu, Jun 02, 2011 at 08:36:41AM -0700, Chris Wright wrote: > * Andrea Arcangeli (aarcange@redhat.com) wrote: > > On Thu, Jun 02, 2011 at 07:31:43AM -0700, Chris Wright wrote: > > > * CAI Qian (caiqian@redhat.com) wrote: > > > > madvise(0x2210000, 4096, 0xc /* MADV_??? */) = 0 > > > > --- SIGSEGV (Segmentation fault) @ 0 (0) --- > > > > > > Right, that's just what the program is trying to do, segfault. > > > > > > > +++ killed by SIGSEGV (core dumped) +++ > > > > Segmentation fault (core dumped) > > > > > > > > Did I miss anything? > > > > > > I found it works but not 100% of the time. > > > > > > So I just run the bug in a loop. > > > > echo 0 >scan_millisecs helps. > > BTW, here's my stack trace (I dropped back to 2.6.39 just to see if it > happened to be recent regression). It looks like mm_slot is off the list: > > R10: dead000000200200 R11: dead000000100100 Yes it had to be use after free. I cooked this patch, still untested but it builds. Will test it soon. === Subject: ksm: fix __ksm_exit vs ksm scan SMP race From: Andrea Arcangeli If the KSM scan releases the ksm_mmlist_lock after the mm_users already dropped to zero but before __ksm_exit had a chance runs, both the KSM scan and __ksm_exit will free the slot. This fixes the SMP race condition by using test_and_bit_set in __ksm_exit to see if __ksm_exit arrived before the KSM scan or not. Signed-off-by: Andrea Arcangeli --- diff --git a/mm/ksm.c b/mm/ksm.c index d708b3e..47ef4c1 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -645,10 +645,16 @@ static int unmerge_and_remove_all_rmap_items(void) if (ksm_test_exit(mm)) { hlist_del(&mm_slot->link); list_del(&mm_slot->mm_list); + /* + * After releasing ksm_mmlist_lock __ksm_exit + * can run and we already changed mm_slot, so + * notify it with MMF_VM_MERGEABLE not to free + * this again. + */ + clear_bit(MMF_VM_MERGEABLE, &mm->flags); spin_unlock(&ksm_mmlist_lock); free_mm_slot(mm_slot); - clear_bit(MMF_VM_MERGEABLE, &mm->flags); up_read(&mm->mmap_sem); mmdrop(mm); } else { @@ -1377,10 +1383,15 @@ next_mm: */ hlist_del(&slot->link); list_del(&slot->mm_list); + /* + * After releasing ksm_mmlist_lock __ksm_exit can run + * and we already changed mm_slot, so notify it with + * MMF_VM_MERGEABLE not to free this again. + */ + clear_bit(MMF_VM_MERGEABLE, &mm->flags); spin_unlock(&ksm_mmlist_lock); free_mm_slot(slot); - clear_bit(MMF_VM_MERGEABLE, &mm->flags); up_read(&mm->mmap_sem); mmdrop(mm); } else { @@ -1463,6 +1474,11 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start, VM_NONLINEAR | VM_MIXEDMAP | VM_SAO)) return 0; /* just ignore the advice */ + /* + * It should be safe to test_bit instead of + * test_and_bit_set because the madvise generic caller + * holds the mmap_sem write mode. + */ if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) { err = __ksm_enter(mm); if (err) @@ -1511,6 +1527,10 @@ int __ksm_enter(struct mm_struct *mm) list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); spin_unlock(&ksm_mmlist_lock); + /* + * It should be safe to set it outside ksm_mmlist_lock because + * we hold a mm_user pin on the mm so __ksm_exit can't run. + */ set_bit(MMF_VM_MERGEABLE, &mm->flags); atomic_inc(&mm->mm_count); @@ -1538,9 +1558,28 @@ void __ksm_exit(struct mm_struct *mm) mm_slot = get_mm_slot(mm); if (mm_slot && ksm_scan.mm_slot != mm_slot) { if (!mm_slot->rmap_list) { - hlist_del(&mm_slot->link); - list_del(&mm_slot->mm_list); - easy_to_free = 1; + /* + * If MMF_VM_MERGEABLE isn't set it was freed + * by the scan immediately after mm_count + * reached zero (visible by the scan) but + * before __ksm_exit() run, so we don't need + * to do anything here. We don't even need to + * wait for the KSM scan to release the + * mmap_sem as it's not working on the mm + * anymore but it's just releasing it, and it + * probably already did and dropped its + * mm_count too (it would however be safe to + * take mmap_sem here even if MMF_VM_MERGEABLE + * is already clear, as the actual mm can't be + * freed until we return and we run mmdrop + * too, but it's unnecessary). + */ + if (test_and_clear_bit(MMF_VM_MERGEABLE, &mm->flags)) { + hlist_del(&mm_slot->link); + list_del(&mm_slot->mm_list); + easy_to_free = 1; + } else + mm_slot = NULL; } else { list_move(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); @@ -1550,7 +1589,6 @@ void __ksm_exit(struct mm_struct *mm) if (easy_to_free) { free_mm_slot(mm_slot); - clear_bit(MMF_VM_MERGEABLE, &mm->flags); mmdrop(mm); } else if (mm_slot) { down_write(&mm->mmap_sem); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/