Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932275Ab1FBIeL (ORCPT ); Thu, 2 Jun 2011 04:34:11 -0400 Received: from mail-qw0-f46.google.com ([209.85.216.46]:33653 "EHLO mail-qw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759381Ab1FBIeF convert rfc822-to-8bit (ORCPT ); Thu, 2 Jun 2011 04:34:05 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=fn96uh98HtUGws+tHic8E2aZPklPuypPiepi64k5L7DTtb0Untsmu1wpPb6Zh0W7PC XuLl3MSCJCYZ2Yclvhn0ADxY9PJxtkMU69i5zAPsg5Z0wWC3+QA8XlWn1P81wtnGTGgC FPqiuPx7bFDTceunwCAe1fUQZSLmQUfqkJQUc= MIME-Version: 1.0 In-Reply-To: <20110602010352.GD7306@suse.de> References: <20110530175334.GI19505@random.random> <20110531121620.GA3490@barrios-laptop> <20110531122437.GJ19505@random.random> <20110531133340.GB3490@barrios-laptop> <20110531141402.GK19505@random.random> <20110601005747.GC7019@csn.ul.ie> <20110601175809.GB7306@suse.de> <20110601191529.GY19505@random.random> <20110601214018.GC7306@suse.de> <20110601233036.GZ19505@random.random> <20110602010352.GD7306@suse.de> Date: Thu, 2 Jun 2011 17:34:03 +0900 Message-ID: Subject: Re: [PATCH] mm: compaction: Abort compaction if too many pages are isolated and caller is asynchronous From: Minchan Kim To: Mel Gorman Cc: Andrea Arcangeli , akpm@linux-foundation.org, Ury Stankevich , KOSAKI Motohiro , linux-kernel@vger.kernel.org, linux-mm@kvack.org, stable@kernel.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7540 Lines: 162 On Thu, Jun 2, 2011 at 10:03 AM, Mel Gorman wrote: > On Thu, Jun 02, 2011 at 01:30:36AM +0200, Andrea Arcangeli wrote: >> Hi Mel, >> >> On Wed, Jun 01, 2011 at 10:40:18PM +0100, Mel Gorman wrote: >> > On Wed, Jun 01, 2011 at 09:15:29PM +0200, Andrea Arcangeli wrote: >> > > On Wed, Jun 01, 2011 at 06:58:09PM +0100, Mel Gorman wrote: >> > > > Umm, HIGHMEM4G implies a two-level pagetable layout so where are >> > > > things like _PAGE_BIT_SPLITTING being set when THP is enabled? >> > > >> > > They should be set on the pgd, pud_offset/pgd_offset will just bypass. >> > > The splitting bit shouldn't be special about it, the present bit >> > > should work the same. >> > >> > This comment is misleading at best then. >> > >> > #define _PAGE_BIT_SPLITTING     _PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */ >> >> From common code point of view it's set in the pmd, the comment can be >> extended to specify it's actually the pgd in case of 32bit noPAE but I >> didn't think it was too misleading as we think in common code terms >> all over the code, the fact it's a bypass is pretty clear across the >> whole archs. >> > > Fair point. > >> > At the PGD level, it can have PSE set obviously but it's not a >> > PMD. I confess I haven't checked the manual to see if it's safe to >> > use _PAGE_BIT_UNUSED1 like this so am taking your word for it. I >> >> To be sure I re-checked on 253668.pdf page 113/114 noPAE and page 122 >> PAE, on x86 32bit/64 all ptes/pmd/pgd (32bit/64bit PAE/noPAE) have bit >> 9-11 "Avail" to software. So I think we should be safe here. >> > > Good stuff. I was reasonably sure this was the case but as this was > already "impossible", it needed to be ruled out. > >> > found that the bug is far harder to reproduce with 3 pagetable levels >> > than with 2 but that is just timing. So far it has proven impossible >> > on x86-64 at least within 27 hours so that has me looking at how >> > pagetable management between x86 and x86-64 differ. >> >> Weird. >> >> However I could see it screwing the nr_inactive/active_* stats, but >> the nr_isolated should never go below zero, and especially not anon >> even if split_huge_page does the accounting wrong (and >> migrate/compaction won't mess with THP), or at least I'd expect things >> to fall apart in other ways and not with just a fairly innocuous and >> not-memory corrupting nr_isolated_ counter going off just by one. >> > > Again, agreed. I found it hard to come up with a reason why file would > get messed up particularly as PageSwapBacked does not get cleared in the > ordinary case until the page is freed. If we were using pages after > being freed due to bad refcounting, it would show up in all sorts of bad > ways. > >> The khugepaged nr_isolated_anon increment couldn't affect the file one >> and we hold mmap_sem write mode there to prevent the pte to change >> from under us, in addition to the PT and anon_vma lock. Anon_vma lock >> being wrong sounds unlikely too, and even if it was it should screw >> the nr_isolated_anon counter, impossible to screw the nr_isolated_file >> with khugepaged. >> > > After reviewing, I still could not find a problem with the locking that > might explain this. I thought last night anon_vma might be bust in some > way but today I couldn't find a problem. > >> Where did you put your bugcheck? It looked like you put it in the < 0 >> reader, can you add it to all _inc/dec/mod (even _inc just in case) so >> we may get a stack trace including the culprit? (not guaranteed but >> better chance) >> > > Did that, didn't really help other than showing the corruption happens > early in the process lifetime while huge PMDs are being faulted. This > made me think the problem might be on or near fork. > >> > Barriers are a big different between how 32-bit !SMP and X86-64 but >> > don't know yet which one is relevant or if this is even the right >> > direction. >> >> The difference is we need xchg on SMP to avoid losing the dirty >> bit. Otherwise if we do pmd_t pmd = *pmdp; *pmdp = 0; the dirty bit >> may have been set in between the two by another thread running in >> userland in a different CPU, while the pmd was still "present". As >> long as interrupts don't write to read-write userland memory with the >> pte dirty bit clear, we shouldn't need xchg on !SMP. >> > > Yep. > >> On PAE we also need to write 0 into pmd_low before worrying about >> pmd_high so the present bit is cleared before clearing the high part >> of the 32bit PAE pte, and we relay on xchg implicit lock to avoid a >> smp_wmb() in between the two writes. >> > > Yep. > >> I'm unsure if any of this could be relevant to our problem, also there > > I concluded after a while that it wasn't. Partially from reasoning about > it and part by testing forcing the use of the SMP versions and finding > the bug was still reproducible. > >> can't be more than one writer at once in the pmd, as nobody can modify >> it without the page_table_lock held. xchg there is just to be safe for >> the dirty bit (or we'd corrupt memory with threads running in userland >> and writing to memory on other cpus while we ptep_clear_flush). >> >> I've been wondering about the lack of "lock" on the bus in atomic.h >> too, but I can't see how it can possibly matter on !SMP, vmstat >> modifications should execute only 1 asm insn so preempt or irq can't >> interrupt it. > > To be honest, I haven't fully figured out yet why it makes such > a difference on !SMP. I have a vague notion that it's because > the page table page and the data is visible before the bit set by > SetPageSwapBacked on the struct page is visible but haven't reasoned > it out yet. If this was the case, it might allow an "anon" page to > be treated as a file by compaction for accounting purposes and push > the counter negative but you'd think then the anon isolation would > be positive so it's something else. > > As I thought fork() be an issue, I looked closer at what we do > there. We are calling pmd_alloc at copy_pmd_range which is a no-op > when PMD is folded and copy_huge_pmd() is calling pte_alloc_one() > which also has no barrier. I haven't checked this fully (it's very > late again as I wasn't able to work on this during most of the day) > but I wonder if it's then possible the PMD setup is not visible before > insertion into the page table leading to weirdness? Why it matters to > SMP is unclear unless this is a preemption thing I'm not thinking of. > > On a similar vein, during collapse_huge_page(), we use a barrier > to ensure the data copy is visible before the PMD insertion but in > __do_huge_pmd_anonymous_page(), we assume the "spinlocking to take the > lru_lock inside page_add_new_anon_rmap() acts as a full memory". Thing > is, it's calling lru_cache_add_lru() adding the page to a pagevec > which is not necessarily taking the LRU lock and !SMP is leaving a > big enough race before the pagevec gets drained to cause a problem. > Of course, maybe it *is* happening on SMP but the negative counters > are being reported as zero :) Yes. although we have atomic_inc of in get_page, it doesn't imply full memory barrier. So we need explicit memory barrier. I think you're right. But I can't think of that it's related to this problem(UP, preemption). -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/