Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932361AbcDDPeB (ORCPT ); Mon, 4 Apr 2016 11:34:01 -0400 Received: from e06smtp17.uk.ibm.com ([195.75.94.113]:55202 "EHLO e06smtp17.uk.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754793AbcDDPd7 (ORCPT ); Mon, 4 Apr 2016 11:33:59 -0400 X-IBM-Helo: d06dlp03.portsmouth.uk.ibm.com X-IBM-MailFrom: gerald.schaefer@de.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org;stable@vger.kernel.org From: Gerald Schaefer To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Naoya Horiguchi , "Kirill A . Shutemov" , Konstantin Khlebnikov , Michal Hocko , Vlastimil Babka , Jerome Marchand , Johannes Weiner , Dave Hansen , Mel Gorman , Dan Williams , Martin Schwidefsky , Heiko Carstens , Michael Holzheu , Gerald Schaefer , "# v4 . 3+" Subject: [PATCH] numa: fix /proc//numa_maps for THP Date: Mon, 4 Apr 2016 17:33:52 +0200 Message-Id: <1459784032-15152-1-git-send-email-gerald.schaefer@de.ibm.com> X-Mailer: git-send-email 2.6.6 X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16040415-0005-0000-0000-00000FF3287C Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4280 Lines: 138 In gather_pte_stats() a THP pmd is cast into a pte, which is wrong because the layouts may differ depending on the architecture. On s390 this will lead to inaccurate numap_maps accounting in /proc because of misguided pte_present() and pte_dirty() checks on the fake pte. On other architectures pte_present() and pte_dirty() may work by chance, but there may be an issue with direct-access (dax) mappings w/o underlying struct pages when HAVE_PTE_SPECIAL is set and THP is available. In vm_normal_page() the fake pte will be checked with pte_special() and because there is no "special" bit in a pmd, this will always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be skipped. On dax mappings w/o struct pages, an invalid struct page pointer would then be returned that can crash the kernel. This patch fixes the numa_maps THP handling by introducing new "_pmd" variants of the can_gather_numa_stats() and vm_normal_page() functions. Signed-off-by: Gerald Schaefer Cc: # v4.3+ --- fs/proc/task_mmu.c | 29 ++++++++++++++++++++++++++--- include/linux/mm.h | 2 ++ mm/memory.c | 38 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 66 insertions(+), 3 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 9df4316..a5fb353 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1518,6 +1518,30 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma, return page; } +static struct page *can_gather_numa_stats_pmd(pmd_t pmd, + struct vm_area_struct *vma, + unsigned long addr) +{ + struct page *page; + int nid; + + if (!pmd_present(pmd)) + return NULL; + + page = vm_normal_page_pmd(vma, addr, pmd); + if (!page) + return NULL; + + if (PageReserved(page)) + return NULL; + + nid = page_to_nid(page); + if (!node_isset(nid, node_states[N_MEMORY])) + return NULL; + + return page; +} + static int gather_pte_stats(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -1529,12 +1553,11 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { - pte_t huge_pte = *(pte_t *)pmd; struct page *page; - page = can_gather_numa_stats(huge_pte, vma, addr); + page = can_gather_numa_stats_pmd(*pmd, vma, addr); if (page) - gather_stats(page, md, pte_dirty(huge_pte), + gather_stats(page, md, pmd_dirty(*pmd), HPAGE_PMD_SIZE/PAGE_SIZE); spin_unlock(ptl); return 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index 6bff79a..c5b8efc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1121,6 +1121,8 @@ struct zap_details { struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte); +struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, + pmd_t pmd); int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, unsigned long size); diff --git a/mm/memory.c b/mm/memory.c index 288a508..61460e2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -789,6 +789,44 @@ out: return pfn_to_page(pfn); } +struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, + pmd_t pmd) +{ + unsigned long pfn = pmd_pfn(pmd); + + /* + * There is no pmd_special() but there may be special pmds, e.g. + * in a direct-access (dax) mapping, so let's just replicate the + * !HAVE_PTE_SPECIAL case from vm_normal_page() here. + */ + if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { + if (vma->vm_flags & VM_MIXEDMAP) { + if (!pfn_valid(pfn)) + return NULL; + goto out; + } else { + unsigned long off; + off = (addr - vma->vm_start) >> PAGE_SHIFT; + if (pfn == vma->vm_pgoff + off) + return NULL; + if (!is_cow_mapping(vma->vm_flags)) + return NULL; + } + } + + if (is_zero_pfn(pfn)) + return NULL; + if (unlikely(pfn > highest_memmap_pfn)) + return NULL; + + /* + * NOTE! We still have PageReserved() pages in the page tables. + * eg. VDSO mappings can cause them to exist. + */ +out: + return pfn_to_page(pfn); +} + /* * copy one vm_area from one task to the other. Assumes the page tables * already present in the new task to be cleared in the whole range -- 2.6.6