Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp4009108ybv; Sun, 16 Feb 2020 11:18:59 -0800 (PST) X-Google-Smtp-Source: APXvYqxnUQQaYc1sdD+rf5XO7dRx65NE7OKD02RwL9+wiYnJS+2mwyfN0WFrO19cl5aqIzb6S2nA X-Received: by 2002:aca:120e:: with SMTP id 14mr147790ois.135.1581880739577; Sun, 16 Feb 2020 11:18:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1581880739; cv=none; d=google.com; s=arc-20160816; b=VvtedJ/IHFJQIZJxODsWb8VvoTKvm+S99LN1Jb0nWd7aeRPtd+Bn0xv8fIQ1q7ZUBm HpCaBN3CIUbXrBxzi5t25uUksebeakJE5r1edHNkFnE0qmHbuiSHsuYQsCG5YK89OcNQ jZJJAruGD+MsC1rR3fo1DhUVp7iFygchyy8Jfxq7M8oi0lf5ZbVU7PX02zrPTt+4xVU2 L/BMqgpwBlPMgaLAKZ+ti1MKYrSiln4CzfSrvN4oaFdC5P0h6I2fZH+DiMiXHNNfr4w3 pcH9bhR/C2X3joB4yIzA5z2wvbud3W2q89TbT5uZtKau79sUfoR45zruN564ozAczgUl EtJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=tuGE6PQn1KB7+DuvWdmcQTP1e0rV79mM3JVVfbfKivs=; b=YLaYUPAvzVdYtESHn/H5xVQgIASlagy62JV9ClSI9Y+XQpN5lhAdeJlCtUJpYTmX6n m6m3uPDdph6fudglz2C7lMcfOK83NcLbXbJQr7yrlzKqtTZabaGuJO4tVEZiGrdceEdm fkVxX7vHw9N02rUn7ncwkaJjBXpFNk6z3PD6BvWgO1iPWPaMCm5aocJnjaBEdAqnvMao Jh0+/+V5UTz1tG5ltpMVSZmUqeF9yb0erohVCRWypL24PimOvtwLXpkbVyay4gACBO+r ichx6PnWU2u1XmZlmHN3Qj4dcqbpxyGfwK66NrGRmS0bw4wB6N4iCtZNsS0Zls+qdsYe pXVg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=gaun3hLi; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 7si5603441oij.29.2020.02.16.11.18.47; Sun, 16 Feb 2020 11:18:59 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=gaun3hLi; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726059AbgBPTSO (ORCPT + 99 others); Sun, 16 Feb 2020 14:18:14 -0500 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:39921 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725989AbgBPTSN (ORCPT ); Sun, 16 Feb 2020 14:18:13 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1581880692; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=tuGE6PQn1KB7+DuvWdmcQTP1e0rV79mM3JVVfbfKivs=; b=gaun3hLiY+VTGl0Gu9KOt9ds777Ndo0F5Qz39M94O9nipszvXbLalDSHFHgUfITxHaK+TK +lp4MrWBYNDjNuVP6e+aXFgzrPJAq3C2yJa9LNU3A6+8ufVdyBhFJbhDbsC4rynP47VDdQ jh2YyiUQ/jCOUEULtR7mnrMNiLf+bMo= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-24-WXW4iEjeON-G447_iBq-nw-1; Sun, 16 Feb 2020 14:18:06 -0500 X-MC-Unique: WXW4iEjeON-G447_iBq-nw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 56D21107ACC7; Sun, 16 Feb 2020 19:18:04 +0000 (UTC) Received: from t490s.redhat.com (ovpn-116-86.phx2.redhat.com [10.3.116.86]) by smtp.corp.redhat.com (Postfix) with ESMTP id 0295319756; Sun, 16 Feb 2020 19:18:02 +0000 (UTC) From: Rafael Aquini To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, mgorman@techsingularity.net, akpm@linux-foundation.org, mhocko@suse.com, vbabka@suse.cz Subject: [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa Date: Sun, 16 Feb 2020 14:18:00 -0500 Message-Id: <20200216191800.22423-1-aquini@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Mel Gorman A user reported a bug against a distribution kernel while running a proprietary workload described as "memory intensive that is not swapping" that is expected to apply to mainline kernels. The workload is read/write/modifying ranges of memory and checking the contents. The= y reported that within a few hours that a bad PMD would be reported follo= wed by a memory corruption where expected data was all zeros. A partial re= port of the bad PMD looked like [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000= 002e0396009e2) [ 5195.341184] ------------[ cut here ]------------ [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35! .... [ 5195.410033] Call Trace: [ 5195.410471] [] change_protection_range+0x7dd/0x93= 0 [ 5195.410716] [] change_prot_numa+0x18/0x30 [ 5195.410918] [] task_numa_work+0x1fe/0x310 [ 5195.411200] [] task_work_run+0x72/0x90 [ 5195.411246] [] exit_to_usermode_loop+0x91/0xc2 [ 5195.411494] [] prepare_exit_to_usermode+0x31/0x40 [ 5195.411739] [] retint_user+0x8/0x10 Decoding revealed that the PMD was a valid prot_numa PMD and the bad PM= D was a false detection. The bug does not trigger if automatic NUMA balan= cing or transparent huge pages is disabled. The bug is due a race in change_pmd_range between a pmd_trans_huge and pmd_nond_or_clear_bad check without any locks held. During the pmd_tran= s_huge check, a parallel protection update under lock can have cleared the PMD and filled it with a prot_numa entry between the transhuge check and th= e pmd_none_or_clear_bad check. While this could be fixed with heavy locking, it's only necessary to make a copy of the PMD on the stack during change_pmd_range and avoid races. A new helper is created for this as the check if quite subtle an= d the existing similar helpful is not suitable. This passed 154 hours of test= ing (usually triggers between 20 minutes and 24 hours) without detecting ba= d PMDs or corruption. A basic test of an autonuma-intensive workload show= ed no significant change in behaviour. Although Mel withdrew the patch on the face of LKML comment https://lkml.= org/lkml/2017/4/10/922 the race window aforementioned is still open, and we have reports of Linp= ack test reporting bad residuals after the bad PMD warning is observed. In addition to that, bad= rss-counter and non-zero pgtables assertions are triggered on mm teardown for the task hi= tting the bad PMD. host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2= d2008e7) .... host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:51= 2 host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096 The issue is observed on a v4.18-based distribution kernel, but the race = window is expected to be applicable to mainline kernels, as well. Signed-off-by: Mel Gorman Cc: stable@vger.kernel.org Signed-off-by: Rafael Aquini --- mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 7a8e84f86831..9ea8cc0ab2fd 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area= _struct *vma, pmd_t *pmd, return pages; } =20 +/* + * Used when setting automatic NUMA hinting protection where it is + * critical that a numa hinting PMD is not confused with a bad PMD. + */ +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd) +{ + pmd_t pmdval =3D pmd_read_atomic(pmd); + + /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + barrier(); +#endif + + if (pmd_none(pmdval)) + return 1; + if (pmd_trans_huge(pmdval)) + return 0; + if (unlikely(pmd_bad(pmdval))) { + pmd_clear_bad(pmd); + return 1; + } + + return 0; +} + static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, pgprot_t newprot, int dirty_accountable, int prot_numa) @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct = vm_area_struct *vma, unsigned long this_pages; =20 next =3D pmd_addr_end(addr, end); - if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd) - && pmd_none_or_clear_bad(pmd)) + + /* + * Automatic NUMA balancing walks the tables with mmap_sem + * held for read. It's possible a parallel update to occur + * between pmd_trans_huge() and a pmd_none_or_clear_bad() + * check leading to a false positive and clearing. + * Hence, it's ecessary to atomically read the PMD value + * for all the checks. + */ + if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) && + pmd_none_or_clear_bad_unless_trans_huge(pmd)) goto next; =20 /* invoke the mmu notifier if the pmd is populated */ --=20 2.24.1