Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp2233994ybi; Thu, 20 Jun 2019 11:19:13 -0700 (PDT) X-Google-Smtp-Source: APXvYqzAfCda6uZFWM0Y8SRmC2LvdRvrXP56mvgcpORfO09cZ0UC7fYThyZa2V6oF2+2nSSTFsLF X-Received: by 2002:a63:5660:: with SMTP id g32mr13752477pgm.78.1561054753715; Thu, 20 Jun 2019 11:19:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1561054753; cv=none; d=google.com; s=arc-20160816; b=uvJKRiQJXzZodVPGZPI8epfg9PZK7TVvjLT2MndAErF7sx3jKyUO3vrjtqaulwauHI qx+FNLa22vojj4rEB93X51rpTnfbUXa9qBhM5CDxqXLcZFEWFkAM3ruu2cTAzUS0ntEy q5BXByEFoPbDcmvS4DleH9i1yKd08rFjYaPhH0C64oE8YSfzVLyJuiBRmGnAK8uXh9ft f9c5yv3FNjUMyDbpmnwgK7tann8s3/nX24dko12L+qQAf4rfQwaOslkIJZBefg+CN8+8 vhYQd0zJShqhBs7+5YIkLBEjGI7uwG9fvkhXkd08Lx3tzCl3isBdzJ+z8ckaLcUa7YwJ aVNg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=H01S4+QAMBPTPow0KknXbWcYmYKNJ5HRtjv1fAxK2bk=; b=hTVcbGs7K0/KbklRASRX9t6ymKnziKNfA2a3u0ltHABRzcRF3F2iFv4ofpbFkO31Rv pXMf1LgGMPHhDvN4189zx4Uug2v1JoHfnXCzO34wgNKSalv3om45tQUoYberkedDn/jR EREmTg3pbLRdAX2M9FzKIS996gL1KOYG1hAR6ormfvynE14grR1uOHm1bIrIFB0p4kDp GH/0otEfOxrQid2t6gYUMo2MX+Mopptmfp6u3dQcKUbqnFSsCmCvIMzbhSkQ8FKT7phv KPupsdS+ER9VBtDL5YCgEHNMkDF+J83AAhyEIHEMTqAQM/tOkFJn3pcfzaBLh2kLnj3r kIng== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b="s/j8iF5R"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i22si414092pju.59.2019.06.20.11.18.58; Thu, 20 Jun 2019 11:19:13 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b="s/j8iF5R"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729601AbfFTSSK (ORCPT + 99 others); Thu, 20 Jun 2019 14:18:10 -0400 Received: from mail.kernel.org ([198.145.29.99]:46370 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728987AbfFTSRH (ORCPT ); Thu, 20 Jun 2019 14:17:07 -0400 Received: from localhost (83-86-89-107.cable.dynamic.v4.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id D6F752089C; Thu, 20 Jun 2019 18:17:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1561054626; bh=JKdfDt9OFrZnWSR+kRo3GBiGVO9Uw+w2VF5fUUovLRk=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=s/j8iF5RoZkMNcskwcUIvR8OePRSi9YUyQ80rkCNNffkq+Nj1PtsocvP75nZRkf32 1/LNmPBVUcD9SilHGRV3KkJV7DDMt7MV6B2xRxxpJqB1EkT9yJT2tuz69v8xtwn4BL NFEczuCEiPjTak3+TGpBpYOHgdISIIOyQfwU4Dko= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Yang Shi , Jan Stancek , Will Deacon , Peter Zijlstra , Nick Piggin , "Aneesh Kumar K.V" , Nadav Amit , Minchan Kim , Mel Gorman , Andrew Morton , Linus Torvalds Subject: [PATCH 5.1 94/98] mm: mmu_gather: remove __tlb_reset_range() for force flush Date: Thu, 20 Jun 2019 19:58:01 +0200 Message-Id: <20190620174354.117984188@linuxfoundation.org> X-Mailer: git-send-email 2.22.0 In-Reply-To: <20190620174349.443386789@linuxfoundation.org> References: <20190620174349.443386789@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Yang Shi commit 7a30df49f63ad92318ddf1f7498d1129a77dd4bd upstream. A few new fields were added to mmu_gather to make TLB flush smarter for huge page by telling what level of page table is changed. __tlb_reset_range() is used to reset all these page table state to unchanged, which is called by TLB flush for parallel mapping changes for the same range under non-exclusive lock (i.e. read mmap_sem). Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"), the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update PTEs in parallel don't remove page tables. But, the forementioned commit may do munmap() under read mmap_sem and free page tables. This may result in program hang on aarch64 reported by Jan Stancek. The problem could be reproduced by his test program with slightly modified below. ---8<--- static int map_size = 4096; static int num_iter = 500; static long threads_total; static void *distant_area; void *map_write_unmap(void *ptr) { int *fd = ptr; unsigned char *map_address; int i, j = 0; for (i = 0; i < num_iter; i++) { map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (map_address == MAP_FAILED) { perror("mmap"); exit(1); } for (j = 0; j < map_size; j++) map_address[j] = 'b'; if (munmap(map_address, map_size) == -1) { perror("munmap"); exit(1); } } return NULL; } void *dummy(void *ptr) { return NULL; } int main(void) { pthread_t thid[2]; /* hint for mmap in map_write_unmap() */ distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); munmap(distant_area, (size_t)DISTANT_MMAP_SIZE); distant_area += DISTANT_MMAP_SIZE / 2; while (1) { pthread_create(&thid[0], NULL, map_write_unmap, NULL); pthread_create(&thid[1], NULL, dummy, NULL); pthread_join(thid[0], NULL); pthread_join(thid[1], NULL); } } ---8<--- The program may bring in parallel execution like below: t1 t2 munmap(map_address) downgrade_write(&mm->mmap_sem); unmap_region() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); free_pgtables() tlb->freed_tables = 1 tlb->cleared_pmds = 1 pthread_exit() madvise(thread_stack, 8M, MADV_DONTNEED) zap_page_range() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); tlb_finish_mmu() if (mm_tlb_flush_nested(tlb->mm)) __tlb_reset_range() __tlb_reset_range() would reset freed_tables and cleared_* bits, but this may cause inconsistency for munmap() which do free page tables. Then it may result in some architectures, e.g. aarch64, may not flush TLB completely as expected to have stale TLB entries remained. Use fullmm flush since it yields much better performance on aarch64 and non-fullmm doesn't yields significant difference on x86. The original proposed fix came from Jan Stancek who mainly debugged this issue, I just wrapped up everything together. Jan's testing results: v5.2-rc2-24-gbec7550cca10 -------------------------- mean stddev real 37.382 2.780 user 1.420 0.078 sys 54.658 1.855 v5.2-rc2-24-gbec7550cca10 + "mm: mmu_gather: remove __tlb_reset_range() for force flush" ---------------------------------------------------------------------------------------_ mean stddev real 37.119 2.105 user 1.548 0.087 sys 55.698 1.357 [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.alibaba.com Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Signed-off-by: Yang Shi Signed-off-by: Jan Stancek Reported-by: Jan Stancek Tested-by: Jan Stancek Suggested-by: Will Deacon Tested-by: Will Deacon Acked-by: Will Deacon Cc: Peter Zijlstra Cc: Nick Piggin Cc: "Aneesh Kumar K.V" Cc: Nadav Amit Cc: Minchan Kim Cc: Mel Gorman Cc: [4.20+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/mmu_gather.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -93,8 +93,17 @@ void arch_tlb_finish_mmu(struct mmu_gath struct mmu_gather_batch *batch, *next; if (force) { + /* + * The aarch64 yields better performance with fullmm by + * avoiding multiple CPUs spamming TLBI messages at the + * same time. + * + * On x86 non-fullmm doesn't yield significant difference + * against fullmm. + */ + tlb->fullmm = 1; __tlb_reset_range(tlb); - __tlb_adjust_range(tlb, start, end - start); + tlb->freed_tables = 1; } tlb_flush_mmu(tlb); @@ -249,10 +258,15 @@ void tlb_finish_mmu(struct mmu_gather *t { /* * If there are parallel threads are doing PTE changes on same range - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB - * flush by batching, a thread has stable TLB entry can fail to flush - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB - * forcefully if we detect parallel PTE batching threads. + * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB + * flush by batching, one thread may end up seeing inconsistent PTEs + * and result in having stale TLB entries. So flush TLB forcefully + * if we detect parallel PTE batching threads. + * + * However, some syscalls, e.g. munmap(), may free page tables, this + * needs force flush everything in the given range. Otherwise this + * may result in having stale TLB entries for some architectures, + * e.g. aarch64, that could specify flush what level TLB. */ bool force = mm_tlb_flush_nested(tlb->mm);