Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp2429195imm; Sat, 16 Jun 2018 18:26:29 -0700 (PDT) X-Google-Smtp-Source: ADUXVKIAKWp4BVthrTbxT8Z0sDaeHkIegx4mm8er4EqgNnYfARJgCtYK9682EtQl6cvrBURCV/gg X-Received: by 2002:a17:902:8347:: with SMTP id z7-v6mr8507615pln.290.1529198789038; Sat, 16 Jun 2018 18:26:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529198789; cv=none; d=google.com; s=arc-20160816; b=QE+SRJMDxWtfXlIqC9ofaDVnj4dunTdjUDeNrFmoV8euemA7wzoy2EjwdomSmaTszN XAw4oYnkwGluoYC/zusEJeLnBd15sqTCoKaQLnsK1iKhD+he13BnZ/jhTIBVgszL45qf 4Vwph0/FyDm3EaO1ZGrs5eOtWnNSYQr+ASRJ6h5HbK0CmDS/aJRwvLvQmsCQJIH7xjJ8 QdaTiuVD21i8+D/e4SP5+gJQg9/IdY6AB9F1XT6/NDdDwLP1IazX3D0c/liS61TPX/pk ueJ4uLDWgdsiarlB5tBDcOKfykWYwMl3BOI/cohHg64T8OKVSIp4yqjVG89QtwvgF++G CqKw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature:arc-authentication-results; bh=+gtNKPT+bZO7jD+2t1LvKXOtSoC2Pq0EVTmO8wKQx14=; b=yioQGVF3Xh2OqwbcoP8Yt2jfvR2G8XhcIvVxHfquNsDEs53vxOquKv7LqjvMWjMTJx x0tuB6Kms+2Z5V+dph064T132hWDcgat5ok2wOJDuBIpfDn31KS2PSZQvCfYw3VpCxtl dHdsz64T0TlTAD8X2ZpFgrY5h7qU4H3sue508l1RJifZc549Tk6BApo1VDQg7CGtI3zM 3xnavxkMQYxdRdEs36HWcsS69PGe2RFOFEYtOHPxeXRNRY1sxkwhCQqOCE+j+AftZn4G pwoOtg24J99wdfOARDKo3UB03wZ7nUxxw4foe64yMhfozpiGeII3jn4qMW7UWGkW89vT DEuw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=qX3JtNA0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 63-v6si9439140pgj.626.2018.06.16.18.26.13; Sat, 16 Jun 2018 18:26:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=qX3JtNA0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933391AbeFQBZs (ORCPT + 99 others); Sat, 16 Jun 2018 21:25:48 -0400 Received: from mail-oi0-f67.google.com ([209.85.218.67]:41204 "EHLO mail-oi0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933336AbeFQBZp (ORCPT ); Sat, 16 Jun 2018 21:25:45 -0400 Received: by mail-oi0-f67.google.com with SMTP id a141-v6so11974630oii.8; Sat, 16 Jun 2018 18:25:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=+gtNKPT+bZO7jD+2t1LvKXOtSoC2Pq0EVTmO8wKQx14=; b=qX3JtNA08u/tpelsrdmVc5ZrDUhWhxl0Un1YIbXRursimfj6T48N0b6bSmAASiz+Aa k4/B3ShBwg7Hoj3IEihfrsP0i0k2muAgUor6mDnn9pVVFYyIXnXLJXc/Chv/Dts/5VKm R6gHreRIrRNyAr2Iqnpit6cbe/ZzU7CAyQkZi6jyMyurMqZkZUUekHaU908i9Nj6Szio cxVmx/T5KavukTISNtKlIGvbRzP2js6U19StnSPeARn6v0ZJTTPh2iAbB/H5S/aBdtx3 9dCJRJJgXFJTq4U9d6/2B7VWqawFSYETN8Y32h0R/cKKa+SKYbQRm9E4nkp2A88VESXx cgnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=+gtNKPT+bZO7jD+2t1LvKXOtSoC2Pq0EVTmO8wKQx14=; b=hZZOcUMBZ8g+jEBNkIyCmxMdndQe4BIfUhETPPx+/OnpLo9SD2Oq0BWsRUxgENE/Ej o20DWLoKG4VcV0G/nP7LqE0owgCCPzfSsGvHnPuCChnS/5Wwqf3jJbbG1nHpivRo8PGw 4CRuLOqaz4ELH1UDH+VJ90I2tYad/FDpEu3mcfkOYyikWQuQyaUhAx9CQnisTGnKicsT A024F81eHxl/GSDFWeaq8d1WoLj1hrQQB6TQmyawlyLcxu+FONP722EkZIEPh5PhjUKd fVCRMuwGJG5vwKBAwNLmz025J453iKX5nN3z8S/AqBcZ+98FKVf5MwGkcD7zM0Cq+Azd 5bYQ== X-Gm-Message-State: APt69E2VRQyMXMB4uafmDjWK+K7EMEB4b424BiJO7pMVwsS4pbkmMG1+ 1MhRRX+bq1I1LMywfJ5F17A= X-Received: by 2002:aca:38c3:: with SMTP id f186-v6mr3795139oia.310.1529198744374; Sat, 16 Jun 2018 18:25:44 -0700 (PDT) Received: from sandstorm.nvidia.com ([2600:1700:43b0:3120:feaa:14ff:fe9e:34cb]) by smtp.gmail.com with ESMTPSA id u13-v6sm4666912oiv.18.2018.06.16.18.25.42 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 16 Jun 2018 18:25:43 -0700 (PDT) From: john.hubbard@gmail.com X-Google-Original-From: jhubbard@nvidia.com To: Matthew Wilcox , Michal Hocko , Christopher Lameter , Jason Gunthorpe , Dan Williams , Jan Kara Cc: linux-mm@kvack.org, LKML , linux-rdma , John Hubbard Subject: [PATCH 2/2] mm: set PG_dma_pinned on get_user_pages*() Date: Sat, 16 Jun 2018 18:25:10 -0700 Message-Id: <20180617012510.20139-3-jhubbard@nvidia.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20180617012510.20139-1-jhubbard@nvidia.com> References: <20180617012510.20139-1-jhubbard@nvidia.com> X-NVConfidentiality: public Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: John Hubbard This fixes a few problems that come up when using devices (NICs, GPUs, for example) that want to have direct access to a chunk of system (CPU) memory, so that they can DMA to/from that memory. Problems [1] come up if that memory is backed by persistence storage; for example, an ext4 file system. I've been working on several customer bugs that are hitting this, and this patchset fixes those bugs. The bugs happen via: 1) get_user_pages() on some ext4-backed pages 2) device does DMA for a while to/from those pages a) Somewhere in here, some of the pages get disconnected from the file system, via try_to_unmap() and eventually drop_buffers() 3) device is all done, device driver calls set_page_dirty_lock(), then put_page() And then at some point, we see a this BUG(): kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899! backtrace: ext4_writepage __writepage write_cache_pages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ...which is due to the file system asserting that there are still buffer heads attached: ({ \ BUG_ON(!PagePrivate(page)); \ ((struct buffer_head *)page_private(page)); \ }) How to fix this: ---------------- Introduce a new page flag: PG_dma_pinned, and set this flag on all pages that are returned by the get_user_pages*() family of functions. Leave it set nearly forever: until the page is freed. Then, check this flag before attempting to unmap pages. This will cause a very early return from try_to_unmap_one(), and will avoid doing things such as, notably, removing page buffers via drop_buffers(). This uses a new struct page flag, but only on 64-bit systems. Obviously, this is heavy-handed, but given the long, broken history of get_user_pages in combination with file-backed memory, and given the problems with alternative designs, it's a reasonable fix for now: small, simple, and easy to revert if and when a more comprehensive design solution is chosen. Some alternatives, and why they were not taken: 1. It would be better, if possible, to clear PG_dma_pinned, once all get_user_pages callers returned the page (via something more specific than put_page), but that would significantly change the usage for get_user_pages callers. That's too intrusive for such a widely used and old API, so let's leave it alone. Also, such a design would require a new counter that would be associated with each page. There's no room in struct page, so it would require separate tracking, which is not acceptable for general page management. 2. There are other more complicated approaches[2], but these depend on trying to solve very specific call paths that, in the end, are just downstream effects of the root cause. And so these did not actually fix the customer bugs that I was working on. References: [1] https://lwn.net/Articles/753027/ : "The trouble with get_user_pages()" [2] https://marc.info/?l=linux-mm&m=<20180521143830.GA25109@bombadil.infradead.org> (Matthew Wilcox listed two ideas here) Signed-off-by: John Hubbard --- include/linux/page-flags.h | 9 +++++++++ include/trace/events/mmflags.h | 9 ++++++++- mm/gup.c | 11 +++++++++-- mm/rmap.c | 2 ++ 4 files changed, 28 insertions(+), 3 deletions(-) Signed-off-by: John Hubbard --- include/linux/page-flags.h | 9 +++++++++ include/trace/events/mmflags.h | 9 ++++++++- mm/gup.c | 11 +++++++++-- mm/page_alloc.c | 1 + mm/rmap.c | 2 ++ 5 files changed, 29 insertions(+), 3 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 901943e4754b..ad65a2af069a 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -100,6 +100,9 @@ enum pageflags { #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT) PG_young, PG_idle, +#endif +#if defined(CONFIG_64BIT) + PG_dma_pinned, #endif __NR_PAGEFLAGS, @@ -381,6 +384,12 @@ TESTCLEARFLAG(Young, young, PF_ANY) PAGEFLAG(Idle, idle, PF_ANY) #endif +#if defined(CONFIG_64BIT) +PAGEFLAG(DmaPinned, dma_pinned, PF_ANY) +#else +PAGEFLAG_FALSE(DmaPinned) +#endif + /* * On an anonymous page mapped into a user virtual memory area, * page->mapping points to its anon_vma, not to a struct address_space; diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index a81cffb76d89..f62fd150b0d4 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -79,6 +79,12 @@ #define IF_HAVE_PG_IDLE(flag,string) #endif +#if defined(CONFIG_64BIT) +#define IF_HAVE_PG_DMA_PINNED(flag,string) ,{1UL << flag, string} +#else +#define IF_HAVE_PG_DMA_PINNED(flag,string) +#endif + #define __def_pageflag_names \ {1UL << PG_locked, "locked" }, \ {1UL << PG_waiters, "waiters" }, \ @@ -104,7 +110,8 @@ IF_HAVE_PG_MLOCK(PG_mlocked, "mlocked" ) \ IF_HAVE_PG_UNCACHED(PG_uncached, "uncached" ) \ IF_HAVE_PG_HWPOISON(PG_hwpoison, "hwpoison" ) \ IF_HAVE_PG_IDLE(PG_young, "young" ) \ -IF_HAVE_PG_IDLE(PG_idle, "idle" ) +IF_HAVE_PG_IDLE(PG_idle, "idle" ) \ +IF_HAVE_PG_DMA_PINNED(PG_dma_pinned, "dma_pinned" ) #define show_page_flags(flags) \ (flags) ? __print_flags(flags, "|", \ diff --git a/mm/gup.c b/mm/gup.c index 73f0b3316fa7..fd6c77f33c16 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -659,7 +659,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *nonblocking) { - long i = 0; + long i = 0, j; int err = 0; unsigned int page_mask; struct vm_area_struct *vma = NULL; @@ -764,6 +764,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } while (nr_pages); out: + if (pages) + for (j = 0; j < i; j++) + SetPageDmaPinned(pages[j]); + return i ? i : err; } @@ -1843,7 +1847,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages) { unsigned long addr, len, end; - int nr = 0, ret = 0; + int nr = 0, ret = 0, i; start &= PAGE_MASK; addr = start; @@ -1864,6 +1868,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, ret = nr; } + for (i = 0; i < nr; i++) + SetPageDmaPinned(pages[i]); + if (nr < nr_pages) { /* Try to get the remaining pages with get_user_pages */ start += nr << PAGE_SHIFT; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1521100f1e63..a96a7b20037c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1898,6 +1898,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order, { set_page_private(page, 0); set_page_refcounted(page); + ClearPageDmaPinned(page); arch_alloc_page(page, order); kernel_map_pages(page, 1 << order, 1); diff --git a/mm/rmap.c b/mm/rmap.c index 6db729dc4c50..37576f0a4645 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1360,6 +1360,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, flags & TTU_SPLIT_FREEZE, page); } + if (PageDmaPinned(page)) + return false; /* * We have to assume the worse case ie pmd for invalidation. Note that * the page can not be free in this function as call of try_to_unmap() -- 2.17.1