Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752097AbaJAQS1 (ORCPT ); Wed, 1 Oct 2014 12:18:27 -0400 Received: from mail-vc0-f172.google.com ([209.85.220.172]:53579 "EHLO mail-vc0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751236AbaJAQS0 (ORCPT ); Wed, 1 Oct 2014 12:18:26 -0400 MIME-Version: 1.0 In-Reply-To: References: <20140930033327.GA14558@redhat.com> <20140930043309.GA16196@redhat.com> <20140930160510.GA15903@redhat.com> <20140930162201.GC15903@redhat.com> <20140930164047.GA18354@redhat.com> <20140930182059.GA24431@redhat.com> Date: Wed, 1 Oct 2014 09:18:25 -0700 X-Google-Sender-Auth: 4SvYNXY1xSAaqN_Pssxfxz4cZWg Message-ID: Subject: Re: pipe/page fault oddness. From: Linus Torvalds To: Hugh Dickins Cc: Dave Jones , Al Viro , Linux Kernel , Rik van Riel , Ingo Molnar , Michel Lespinasse , "Kirill A. Shutemov" , Mel Gorman , Sasha Levin Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 1, 2014 at 9:01 AM, Linus Torvalds wrote: > > We need to get rid of it, and just make it the same as pte_protnone(). > And then the real protnone is in the vma flags, and if you actually > ever get to a pte that is marked protnone, you know it's a numa page. So I'd really suggest we do exactly that. Get rid of "pte_numa()" entirely, get rid of "_PAGE_[BIT_]NUMA" entirely, and instead add a "pte_protnone()" helper to check for the "protnone" case (which on x86 is testing the _PAGE_PROTNONE bit, and on most other architectures is just testing that the page has no access rights). Then we throw away "pte_mknuma()" and "pte_mknonnuma()" entirely, because they are brainless sh*t, and we just use ptent = ptep_modify_prot_start(mm, addr, pte); ptent = pte_modify(ptent, newprot); ptep_modify_prot_commit(mm, addr, pte, ptent); reliably instead (where for the mknuma case "newprot" is PROT_NONE, and for mknonnuma() it is vma->vm_page_prot. Yes, that means that you have to pass in the vma to those functions, but that just makes sense anyway. And if that means that we lose the numa flag on mprotect etc, nobody sane cares. Seriously, why can't we just do this, and throw away all the crap that is "numa special case". This would make all the random games in change_pte_range() just go away entirely, because the whole NUMA thing really wouldn't be a special case for the pte AT ALL any more. All it would be is that a pte could be marked PROT_NONE even if the vma->vm_flags aren't. Please, please, please? The current _PAGE_NUMA really is a horrible horrible thing, and may well be the source of this bug. The fact that it took DaveJ a long time to trigger his lockup would be entirely consistent with "you have to split a PROTNONE large page due to memory pressure", so the problem with our current pte_mknuma() that Hugh points out looks entirely possible to me. Now, there may be some reason why it can't happen, but even in the absense of this bug, I really think that _PAGE_NUMA has been a huge mistake from day one. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/