Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754219AbdDLNdk (ORCPT ); Wed, 12 Apr 2017 09:33:40 -0400 Received: from mx2.suse.de ([195.135.220.15]:34915 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753447AbdDLNdh (ORCPT ); Wed, 12 Apr 2017 09:33:37 -0400 Subject: Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race To: "Kirill A. Shutemov" , Andrea Arcangeli , Andrew Morton References: <20170302151034.27829-1-kirill.shutemov@linux.intel.com> <20170302151034.27829-3-kirill.shutemov@linux.intel.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org From: Vlastimil Babka Message-ID: Date: Wed, 12 Apr 2017 15:33:35 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <20170302151034.27829-3-kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2652 Lines: 85 On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote: > In case prot_numa, we are under down_read(mmap_sem). It's critical > to not clear pmd intermittently to avoid race with MADV_DONTNEED > which is also under down_read(mmap_sem): > > CPU0: CPU1: > change_huge_pmd(prot_numa=1) > pmdp_huge_get_and_clear_notify() > madvise_dontneed() > zap_pmd_range() > pmd_trans_huge(*pmd) == 0 (without ptl) > // skip the pmd > set_pmd_at(); > // pmd is re-established > > The race makes MADV_DONTNEED miss the huge pmd and don't clear it > which may break userspace. > > Found by code analysis, never saw triggered. > > Signed-off-by: Kirill A. Shutemov > --- > mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++- > 1 file changed, 33 insertions(+), 1 deletion(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index e7ce73b2b208..bb2b3646bd78 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > if (prot_numa && pmd_protnone(*pmd)) > goto unlock; > > - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd); > + /* > + * In case prot_numa, we are under down_read(mmap_sem). It's critical > + * to not clear pmd intermittently to avoid race with MADV_DONTNEED > + * which is also under down_read(mmap_sem): > + * > + * CPU0: CPU1: > + * change_huge_pmd(prot_numa=1) > + * pmdp_huge_get_and_clear_notify() > + * madvise_dontneed() > + * zap_pmd_range() > + * pmd_trans_huge(*pmd) == 0 (without ptl) > + * // skip the pmd > + * set_pmd_at(); > + * // pmd is re-established > + * > + * The race makes MADV_DONTNEED miss the huge pmd and don't clear it > + * which may break userspace. > + * > + * pmdp_invalidate() is required to make sure we don't miss > + * dirty/young flags set by hardware. > + */ > + entry = *pmd; > + pmdp_invalidate(vma, addr, pmd); > + > + /* > + * Recover dirty/young flags. It relies on pmdp_invalidate to not > + * corrupt them. > + */ pmdp_invalidate() does: pmd_t entry = *pmdp; set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry)); so it's not atomic and if CPU sets dirty or accessed in the middle of this, they will be lost? But I don't see how the other invalidate caller __split_huge_pmd_locked() deals with this either. Andrea, any idea? Vlastimil > + if (pmd_dirty(*pmd)) > + entry = pmd_mkdirty(entry); > + if (pmd_young(*pmd)) > + entry = pmd_mkyoung(entry); > + > entry = pmd_modify(entry, newprot); > if (preserve_write) > entry = pmd_mk_savedwrite(entry); >