Hi,
this series backports the CVE-2019-11487 fixes (page refcount overflow) to
4.4 stable. It differs from Ajay's series [1] in the following:
- gup.c variants of fast gup for x86 and s390 are fixed too. I've not fixed
sparc, mips, sh. It's unlikely the known overflow scenario based on FUSE,
which needs 140GB of RAM, is a problem for those architectures, and I don't
feel confident enough to patch them. I've sent the same fixup for 4.9 [3]
- there are some differences in backport adaptations, hopefully not important.
My version is taken from our 4.4 based kernel, which was just simpler for me
than adding the missing parts to Ajay's version
- The last patch fixes another problem in the fast gup implementation on x86,
that I've previously posted and got merged to 4.9 stable [2].
[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/
Linus Torvalds (3):
mm: make page ref count overflow check tighter and more explicit
mm: add 'try_get_page()' helper function
mm: prevent get_user_pages() from overflowing page refcount
Matthew Wilcox (1):
fs: prevent page refcount overflow in pipe_buf_get
Miklos Szeredi (1):
pipe: add pipe_buf_get() helper
Punit Agrawal (1):
mm, gup: ensure real head page is ref-counted when using hugepages
Vlastimil Babka (1):
x86, mm, gup: prevent get_page() race with munmap in paravirt guest
Will Deacon (1):
mm, gup: remove broken VM_BUG_ON_PAGE compound check for hugepages
arch/s390/mm/gup.c | 6 +++--
arch/x86/mm/gup.c | 23 ++++++++++++++++++-
fs/fuse/dev.c | 12 +++++-----
fs/pipe.c | 4 ++--
fs/splice.c | 12 ++++++++--
include/linux/mm.h | 26 ++++++++++++++++++++-
include/linux/pipe_fs_i.h | 17 ++++++++++++--
kernel/trace/trace.c | 6 ++++-
mm/gup.c | 48 +++++++++++++++++++++++++++------------
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 18 +++++++++++++--
mm/internal.h | 17 ++++++++++----
12 files changed, 152 insertions(+), 39 deletions(-)
--
2.23.0
From: Punit Agrawal <[email protected]>
commit d63206ee32b6e64b0e12d46e5d6004afd9913713 upstream.
When speculatively taking references to a hugepage using
page_cache_add_speculative() in gup_huge_pmd(), it is assumed that the
page returned by pmd_page() is the head page. Although normally true,
this assumption doesn't hold when the hugepage comprises of successive
page table entries such as when using contiguous bit on arm64 at PTE or
PMD levels.
This can be addressed by ensuring that the page passed to
page_cache_add_speculative() is the real head or by de-referencing the
head page within the function.
We take the first approach to keep the usage pattern aligned with
page_cache_get_speculative() where users already pass the appropriate
page, i.e., the de-referenced head.
Apply the same logic to fix gup_huge_[pud|pgd]() as well.
[[email protected]: fix arm64 ltp failure]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Punit Agrawal <[email protected]>
Acked-by: Steve Capper <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/gup.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 6f9088cb8ebe..71e9d0093a35 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1130,8 +1130,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
return 0;
refs = 0;
- head = pmd_page(orig);
- page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+ page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
tail = page;
do {
pages[*nr] = page;
@@ -1140,6 +1139,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
refs++;
} while (addr += PAGE_SIZE, addr != end);
+ head = compound_head(pmd_page(orig));
if (!page_cache_add_speculative(head, refs)) {
*nr -= refs;
return 0;
@@ -1176,8 +1176,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
return 0;
refs = 0;
- head = pud_page(orig);
- page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+ page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
tail = page;
do {
pages[*nr] = page;
@@ -1186,6 +1185,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
refs++;
} while (addr += PAGE_SIZE, addr != end);
+ head = compound_head(pud_page(orig));
if (!page_cache_add_speculative(head, refs)) {
*nr -= refs;
return 0;
@@ -1218,8 +1218,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
return 0;
refs = 0;
- head = pgd_page(orig);
- page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
+ page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
tail = page;
do {
pages[*nr] = page;
@@ -1228,6 +1227,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
refs++;
} while (addr += PAGE_SIZE, addr != end);
+ head = compound_head(pgd_page(orig));
if (!page_cache_add_speculative(head, refs)) {
*nr -= refs;
return 0;
--
2.23.0
From: Will Deacon <[email protected]>
commit a3e328556d41bb61c55f9dfcc62d6a826ea97b85 upstream.
When operating on hugepages with DEBUG_VM enabled, the GUP code checks
the compound head for each tail page prior to calling
page_cache_add_speculative. This is broken, because on the fast-GUP
path (where we don't hold any page table locks) we can be racing with a
concurrent invocation of split_huge_page_to_list.
split_huge_page_to_list deals with this race by using page_ref_freeze to
freeze the page and force concurrent GUPs to fail whilst the component
pages are modified. This modification includes clearing the
compound_head field for the tail pages, so checking this prior to a
successful call to page_cache_add_speculative can lead to false
positives: In fact, page_cache_add_speculative *already* has this check
once the page refcount has been successfully updated, so we can simply
remove the broken calls to VM_BUG_ON_PAGE.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Punit Agrawal <[email protected]>
Acked-by: Steve Capper <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/gup.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 2cd3b31e3666..6f9088cb8ebe 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1134,7 +1134,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
tail = page;
do {
- VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
(*nr)++;
page++;
@@ -1181,7 +1180,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
tail = page;
do {
- VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
(*nr)++;
page++;
@@ -1224,7 +1222,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
tail = page;
do {
- VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
(*nr)++;
page++;
--
2.23.0
From: Matthew Wilcox <[email protected]>
commit 15fab63e1e57be9fdb5eec1bbc5916e9825e9acb upstream.
Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount. All
callers converted to handle a failure.
Reported-by: Jann Horn <[email protected]>
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
fs/fuse/dev.c | 12 ++++++------
fs/pipe.c | 4 ++--
fs/splice.c | 12 ++++++++++--
include/linux/pipe_fs_i.h | 10 ++++++----
kernel/trace/trace.c | 6 +++++-
5 files changed, 29 insertions(+), 15 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 36a5df92eb9c..16891f5364af 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2031,10 +2031,8 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
rem += pipe->bufs[(pipe->curbuf + idx) & (pipe->buffers - 1)].len;
ret = -EINVAL;
- if (rem < len) {
- pipe_unlock(pipe);
- goto out;
- }
+ if (rem < len)
+ goto out_free;
rem = len;
while (rem) {
@@ -2052,7 +2050,9 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1);
pipe->nrbufs--;
} else {
- pipe_buf_get(pipe, ibuf);
+ if (!pipe_buf_get(pipe, ibuf))
+ goto out_free;
+
*obuf = *ibuf;
obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
obuf->len = rem;
@@ -2075,13 +2075,13 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
ret = fuse_dev_do_write(fud, &cs, len);
pipe_lock(pipe);
+out_free:
for (idx = 0; idx < nbuf; idx++) {
struct pipe_buffer *buf = &bufs[idx];
buf->ops->release(pipe, buf);
}
pipe_unlock(pipe);
-out:
kfree(bufs);
return ret;
}
diff --git a/fs/pipe.c b/fs/pipe.c
index 1e7263bb837a..6534470a6c19 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -178,9 +178,9 @@ EXPORT_SYMBOL(generic_pipe_buf_steal);
* in the tee() system call, when we duplicate the buffers in one
* pipe into another.
*/
-void generic_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
+bool generic_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
{
- page_cache_get(buf->page);
+ return try_get_page(buf->page);
}
EXPORT_SYMBOL(generic_pipe_buf_get);
diff --git a/fs/splice.c b/fs/splice.c
index fde126369966..57ccc583a172 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1876,7 +1876,11 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
* Get a reference to this pipe buffer,
* so we can copy the contents over.
*/
- pipe_buf_get(ipipe, ibuf);
+ if (!pipe_buf_get(ipipe, ibuf)) {
+ if (ret == 0)
+ ret = -EFAULT;
+ break;
+ }
*obuf = *ibuf;
/*
@@ -1948,7 +1952,11 @@ static int link_pipe(struct pipe_inode_info *ipipe,
* Get a reference to this pipe buffer,
* so we can copy the contents over.
*/
- pipe_buf_get(ipipe, ibuf);
+ if (!pipe_buf_get(ipipe, ibuf)) {
+ if (ret == 0)
+ ret = -EFAULT;
+ break;
+ }
obuf = opipe->bufs + nbuf;
*obuf = *ibuf;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 10876f3cb3da..0b28b65c12fb 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -112,18 +112,20 @@ struct pipe_buf_operations {
/*
* Get a reference to the pipe buffer.
*/
- void (*get)(struct pipe_inode_info *, struct pipe_buffer *);
+ bool (*get)(struct pipe_inode_info *, struct pipe_buffer *);
};
/**
* pipe_buf_get - get a reference to a pipe_buffer
* @pipe: the pipe that the buffer belongs to
* @buf: the buffer to get a reference to
+ *
+ * Return: %true if the reference was successfully obtained.
*/
-static inline void pipe_buf_get(struct pipe_inode_info *pipe,
+static inline __must_check bool pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- buf->ops->get(pipe, buf);
+ return buf->ops->get(pipe, buf);
}
/* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
@@ -148,7 +150,7 @@ struct pipe_inode_info *alloc_pipe_info(void);
void free_pipe_info(struct pipe_inode_info *);
/* Generic pipe buffer ops functions */
-void generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *);
+bool generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *);
int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *);
int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *);
void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c6e4e3e7f685..32cc4ea93ad6 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5748,12 +5748,16 @@ static void buffer_pipe_buf_release(struct pipe_inode_info *pipe,
buf->private = 0;
}
-static void buffer_pipe_buf_get(struct pipe_inode_info *pipe,
+static bool buffer_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
struct buffer_ref *ref = (struct buffer_ref *)buf->private;
+ if (ref->ref > INT_MAX/2)
+ return false;
+
ref->ref++;
+ return true;
}
/* Pipe buffer operations for a buffer. */
--
2.23.0
The x86 version of get_user_pages_fast() relies on disabled interrupts to
synchronize gup_pte_range() between gup_get_pte(ptep); and get_page() against
a parallel munmap. The munmap side nulls the pte, then flushes TLBs, then
releases the page. As TLB flush is done synchronously via IPI disabling
interrupts blocks the page release, and get_page(), which assumes existing
reference on page, is thus safe.
However when TLB flush is done by a hypercall, e.g. in a Xen PV guest, there is
no blocking thanks to disabled interrupts, and get_page() can succeed on a page
that was already freed or even reused.
We have recently seen this happen with our 4.4 and 4.12 based kernels, with
userspace (java) that exits a thread, where mm_release() performs a futex_wake()
on tsk->clear_child_tid, and another thread in parallel unmaps the page where
tsk->clear_child_tid points to. The spurious get_page() succeeds, but futex code
immediately releases the page again, while it's already on a freelist. Symptoms
include a bad page state warning, general protection faults acessing a poisoned
list prev/next pointer in the freelist, or free page pcplists of two cpus joined
together in a single list. Oscar has also reproduced this scenario, with a
patch inserting delays before the get_page() to make the race window larger.
Fix this by removing the dependency on TLB flush interrupts the same way as the
generic get_user_pages_fast() code by using page_cache_add_speculative() and
revalidating the PTE contents after pinning the page. Mainline is safe since
4.13 where the x86 gup code was removed in favor of the common code. Accessing
the page table itself safely also relies on disabled interrupts and TLB flush
IPIs that don't happen with hypercalls, which was acknowledged in commit
9e52fc2b50de ("x86/mm: Enable RCU based page table freeing
(CONFIG_HAVE_RCU_TABLE_FREE=y)"). That commit with follups should also be
backported for full safety, although our reproducer didn't hit a problem
without that backport.
Reproduced-by: Oscar Salvador <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
arch/x86/mm/gup.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 6612d532e42e..6379a4883c0a 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -9,6 +9,7 @@
#include <linux/vmstat.h>
#include <linux/highmem.h>
#include <linux/swap.h>
+#include <linux/pagemap.h>
#include <asm/pgtable.h>
@@ -95,10 +96,23 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
}
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
- if (unlikely(!try_get_page(page))) {
+
+ if (WARN_ON_ONCE(page_ref_count(page) < 0)) {
+ pte_unmap(ptep);
+ return 0;
+ }
+
+ if (!page_cache_get_speculative(page)) {
pte_unmap(ptep);
return 0;
}
+
+ if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+ put_page(page);
+ pte_unmap(ptep);
+ return 0;
+ }
+
SetPageReferenced(page);
pages[*nr] = page;
(*nr)++;
--
2.23.0
From: Linus Torvalds <[email protected]>
commit 8fde12ca79aff9b5ba951fce1a2641901b8d8e64 upstream.
[ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks
in there, enabled by a new parameter, which is false where
upstream patch doesn't replace get_page() with try_get_page()
(the THP and hugetlb callers).
In gup_pte_range(), we don't expect tail pages, so just check
page ref count instead of try_get_compound_head()
Also patch arch-specific variants of gup.c for x86 and s390,
leaving mips, sh, sparc alone ]
If the page refcount wraps around past zero, it will be freed while
there are still four billion references to it. One of the possible
avenues for an attacker to try to make this happen is by doing direct IO
on a page multiple times. This patch makes get_user_pages() refuse to
take a new page reference if there are already more than two billion
references to the page.
Reported-by: Jann Horn <[email protected]>
Acked-by: Matthew Wilcox <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
arch/s390/mm/gup.c | 6 ++++--
arch/x86/mm/gup.c | 9 ++++++++-
mm/gup.c | 39 +++++++++++++++++++++++++++++++--------
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 18 ++++++++++++++++--
mm/internal.h | 12 +++++++++---
6 files changed, 69 insertions(+), 17 deletions(-)
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 7ad41be8b373..bdaa5f7b652c 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -37,7 +37,8 @@ static inline int gup_pte_range(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
return 0;
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
- if (!page_cache_get_speculative(page))
+ if (unlikely(WARN_ON_ONCE(page_ref_count(page) < 0)
+ || !page_cache_get_speculative(page)))
return 0;
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
put_page(page);
@@ -76,7 +77,8 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
refs++;
} while (addr += PAGE_SIZE, addr != end);
- if (!page_cache_add_speculative(head, refs)) {
+ if (unlikely(WARN_ON_ONCE(page_ref_count(head) < 0)
+ || !page_cache_add_speculative(head, refs))) {
*nr -= refs;
return 0;
}
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 7d2542ad346a..6612d532e42e 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -95,7 +95,10 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
}
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
- get_page(page);
+ if (unlikely(!try_get_page(page))) {
+ pte_unmap(ptep);
+ return 0;
+ }
SetPageReferenced(page);
pages[*nr] = page;
(*nr)++;
@@ -132,6 +135,8 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
refs = 0;
head = pmd_page(pmd);
+ if (WARN_ON_ONCE(page_ref_count(head) <= 0))
+ return 0;
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
@@ -208,6 +213,8 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
refs = 0;
head = pud_page(pud);
+ if (WARN_ON_ONCE(page_ref_count(head) <= 0))
+ return 0;
page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
diff --git a/mm/gup.c b/mm/gup.c
index 71e9d0093a35..fc8e2dca99fc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -127,7 +127,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
}
if (flags & FOLL_GET)
- get_page_foll(page);
+ if (!get_page_foll(page, true)) {
+ page = ERR_PTR(-ENOMEM);
+ goto out;
+ }
if (flags & FOLL_TOUCH) {
if ((flags & FOLL_WRITE) &&
!pte_dirty(pte) && !PageDirty(page))
@@ -289,7 +292,10 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
goto unmap;
*page = pte_page(*pte);
}
- get_page(*page);
+ if (unlikely(!try_get_page(*page))) {
+ ret = -ENOMEM;
+ goto unmap;
+ }
out:
ret = 0;
unmap:
@@ -1053,6 +1059,20 @@ struct page *get_dump_page(unsigned long addr)
*/
#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
+/*
+ * Return the compund head page with ref appropriately incremented,
+ * or NULL if that failed.
+ */
+static inline struct page *try_get_compound_head(struct page *page, int refs)
+{
+ struct page *head = compound_head(page);
+ if (WARN_ON_ONCE(page_ref_count(head) < 0))
+ return NULL;
+ if (unlikely(!page_cache_add_speculative(head, refs)))
+ return NULL;
+ return head;
+}
+
#ifdef __HAVE_ARCH_PTE_SPECIAL
static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
@@ -1083,6 +1103,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
+ if (WARN_ON_ONCE(page_ref_count(page) < 0))
+ goto pte_unmap;
+
if (!page_cache_get_speculative(page))
goto pte_unmap;
@@ -1139,8 +1162,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
refs++;
} while (addr += PAGE_SIZE, addr != end);
- head = compound_head(pmd_page(orig));
- if (!page_cache_add_speculative(head, refs)) {
+ head = try_get_compound_head(pmd_page(orig), refs);
+ if (!head) {
*nr -= refs;
return 0;
}
@@ -1185,8 +1208,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
refs++;
} while (addr += PAGE_SIZE, addr != end);
- head = compound_head(pud_page(orig));
- if (!page_cache_add_speculative(head, refs)) {
+ head = try_get_compound_head(pud_page(orig), refs);
+ if (!head) {
*nr -= refs;
return 0;
}
@@ -1227,8 +1250,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
refs++;
} while (addr += PAGE_SIZE, addr != end);
- head = compound_head(pgd_page(orig));
- if (!page_cache_add_speculative(head, refs)) {
+ head = try_get_compound_head(pgd_page(orig), refs);
+ if (!head) {
*nr -= refs;
return 0;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 465786cd6490..6087277981a6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1322,7 +1322,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
VM_BUG_ON_PAGE(!PageCompound(page), page);
if (flags & FOLL_GET)
- get_page_foll(page);
+ get_page_foll(page, false);
out:
return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fd932e7a25dd..b4a8a18fa3a5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3886,6 +3886,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long vaddr = *position;
unsigned long remainder = *nr_pages;
struct hstate *h = hstate_vma(vma);
+ int err = -EFAULT;
while (vaddr < vma->vm_end && remainder) {
pte_t *pte;
@@ -3957,10 +3958,23 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
page = pte_page(huge_ptep_get(pte));
+
+ /*
+ * Instead of doing 'try_get_page()' below in the same_page
+ * loop, just check the count once here.
+ */
+ if (unlikely(page_count(page) <= 0)) {
+ if (pages) {
+ spin_unlock(ptl);
+ remainder = 0;
+ err = -ENOMEM;
+ break;
+ }
+ }
same_page:
if (pages) {
pages[i] = mem_map_offset(page, pfn_offset);
- get_page_foll(pages[i]);
+ get_page_foll(pages[i], false);
}
if (vmas)
@@ -3983,7 +3997,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*nr_pages = remainder;
*position = vaddr;
- return i ? i : -EFAULT;
+ return i ? i : err;
}
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
diff --git a/mm/internal.h b/mm/internal.h
index a6639c72780a..b52041969d06 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -93,23 +93,29 @@ static inline void __get_page_tail_foll(struct page *page,
* follow_page() and it must be called while holding the proper PT
* lock while the pte (or pmd_trans_huge) is still mapping the page.
*/
-static inline void get_page_foll(struct page *page)
+static inline bool get_page_foll(struct page *page, bool check)
{
- if (unlikely(PageTail(page)))
+ if (unlikely(PageTail(page))) {
/*
* This is safe only because
* __split_huge_page_refcount() can't run under
* get_page_foll() because we hold the proper PT lock.
*/
+ if (check && WARN_ON_ONCE(
+ page_ref_count(compound_head(page)) <= 0))
+ return false;
__get_page_tail_foll(page, true);
- else {
+ } else {
/*
* Getting a normal page or the head of a compound page
* requires to already have an elevated page->_count.
*/
VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page);
+ if (check && WARN_ON_ONCE(page_ref_count(page) <= 0))
+ return false;
atomic_inc(&page->_count);
}
+ return true;
}
extern unsigned long highest_memmap_pfn;
--
2.23.0
From: Miklos Szeredi <[email protected]>
commit 7bf2d1df80822ec056363627e2014990f068f7aa upstream.
Signed-off-by: Miklos Szeredi <[email protected]>
Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
fs/fuse/dev.c | 2 +-
fs/splice.c | 4 ++--
include/linux/pipe_fs_i.h | 11 +++++++++++
3 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index f5d2d2340b44..36a5df92eb9c 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2052,7 +2052,7 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1);
pipe->nrbufs--;
} else {
- ibuf->ops->get(pipe, ibuf);
+ pipe_buf_get(pipe, ibuf);
*obuf = *ibuf;
obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
obuf->len = rem;
diff --git a/fs/splice.c b/fs/splice.c
index 8398974e1538..fde126369966 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1876,7 +1876,7 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
* Get a reference to this pipe buffer,
* so we can copy the contents over.
*/
- ibuf->ops->get(ipipe, ibuf);
+ pipe_buf_get(ipipe, ibuf);
*obuf = *ibuf;
/*
@@ -1948,7 +1948,7 @@ static int link_pipe(struct pipe_inode_info *ipipe,
* Get a reference to this pipe buffer,
* so we can copy the contents over.
*/
- ibuf->ops->get(ipipe, ibuf);
+ pipe_buf_get(ipipe, ibuf);
obuf = opipe->bufs + nbuf;
*obuf = *ibuf;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 24f5470d3944..10876f3cb3da 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -115,6 +115,17 @@ struct pipe_buf_operations {
void (*get)(struct pipe_inode_info *, struct pipe_buffer *);
};
+/**
+ * pipe_buf_get - get a reference to a pipe_buffer
+ * @pipe: the pipe that the buffer belongs to
+ * @buf: the buffer to get a reference to
+ */
+static inline void pipe_buf_get(struct pipe_inode_info *pipe,
+ struct pipe_buffer *buf)
+{
+ buf->ops->get(pipe, buf);
+}
+
/* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
memory allocation, whereas PIPE_BUF makes atomicity guarantees. */
#define PIPE_SIZE PAGE_SIZE
--
2.23.0
From: Linus Torvalds <[email protected]>
commit f958d7b528b1b40c44cfda5eabe2d82760d868c3 upstream.
[ 4.4 backport: page_ref_count() doesn't exist, introduce it to reduce churn.
Change also two similar checks in mm/internal.h ]
We have a VM_BUG_ON() to check that the page reference count doesn't
underflow (or get close to overflow) by checking the sign of the count.
That's all fine, but we actually want to allow people to use a "get page
ref unless it's already very high" helper function, and we want that one
to use the sign of the page ref (without triggering this VM_BUG_ON).
Change the VM_BUG_ON to only check for small underflows (or _very_ close
to overflowing), and ignore overflows which have strayed into negative
territory.
Acked-by: Matthew Wilcox <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/mm.h | 11 ++++++++++-
mm/internal.h | 5 +++--
2 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ed653ba47c46..997edfcb0a30 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -488,6 +488,15 @@ static inline void get_huge_page_tail(struct page *page)
extern bool __get_page_tail(struct page *page);
+static inline int page_ref_count(struct page *page)
+{
+ return atomic_read(&page->_count);
+}
+
+/* 127: arbitrary random number, small enough to assemble well */
+#define page_ref_zero_or_close_to_overflow(page) \
+ ((unsigned int) page_ref_count(page) + 127u <= 127u)
+
static inline void get_page(struct page *page)
{
if (unlikely(PageTail(page)))
@@ -497,7 +506,7 @@ static inline void get_page(struct page *page)
* Getting a normal page or the head of a compound page
* requires to already have an elevated page->_count.
*/
- VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+ VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page);
atomic_inc(&page->_count);
}
diff --git a/mm/internal.h b/mm/internal.h
index f63f4393d633..a6639c72780a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -81,7 +81,8 @@ static inline void __get_page_tail_foll(struct page *page,
* speculative page access (like in
* page_cache_get_speculative()) on tail pages.
*/
- VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page);
+ VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(compound_head(page)),
+ page);
if (get_page_head)
atomic_inc(&compound_head(page)->_count);
get_huge_page_tail(page);
@@ -106,7 +107,7 @@ static inline void get_page_foll(struct page *page)
* Getting a normal page or the head of a compound page
* requires to already have an elevated page->_count.
*/
- VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+ VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page);
atomic_inc(&page->_count);
}
}
--
2.23.0
From: Linus Torvalds <[email protected]>
commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 upstream.
[ 4.4 backport: get_page() is more complicated due to special handling
of tail pages via __get_page_tail(). But in all cases, eventually the
compound head page's refcount is incremented. So try_get_page() just
checks compound head's refcount for overflow and then simply calls
get_page(). ]
This is the same as the traditional 'get_page()' function, but instead
of unconditionally incrementing the reference count of the page, it only
does so if the count was "safe". It returns whether the reference count
was incremented (and is marked __must_check, since the caller obviously
has to be aware of it).
Also like 'get_page()', you can't use this function unless you already
had a reference to the page. The intent is that you can use this
exactly like get_page(), but in situations where you want to limit the
maximum reference count.
The code currently does an unconditional WARN_ON_ONCE() if we ever hit
the reference count issues (either zero or negative), as a notification
that the conditional non-increment actually happened.
NOTE! The count access for the "safety" check is inherently racy, but
that doesn't matter since the buffer we use is basically half the range
of the reference count (ie we look at the sign of the count).
Acked-by: Matthew Wilcox <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/mm.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 997edfcb0a30..78358aeb7732 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -510,6 +510,21 @@ static inline void get_page(struct page *page)
atomic_inc(&page->_count);
}
+static inline __must_check bool try_get_page(struct page *page)
+{
+ struct page *head = compound_head(page);
+
+ /*
+ * get_page() increases always head page's refcount, either directly or
+ * via __get_page_tail() for tail page, so we check that
+ */
+ if (WARN_ON_ONCE(page_ref_count(head) <= 0))
+ return false;
+
+ get_page(page);
+ return true;
+}
+
static inline struct page *virt_to_head_page(const void *x)
{
struct page *page = virt_to_page(x);
--
2.23.0
On 08/11/19, 3:08 PM, "Vlastimil Babka" <[email protected]> wrote:
> From: Linus Torvalds <[email protected]>
>
> commit 8fde12ca79aff9b5ba951fce1a2641901b8d8e64 upstream.
>
> [ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks
> in there, enabled by a new parameter, which is false where
> upstream patch doesn't replace get_page() with try_get_page()
> (the THP and hugetlb callers).
Could we have try_get_page_foll(), as in:
https://lore.kernel.org/stable/[email protected]/
+ Code will be in sync as we have try_get_page()
+ No need to add extra argument to try_get_page()
+ No need to modify the callers of try_get_page()
> In gup_pte_range(), we don't expect tail pages, so just check
> page ref count instead of try_get_compound_head()
Technically it's fine. If you want to keep the code of stable versions in sync
with latest versions then this could be done in following ways (without any
modification in upstream patch for gup_pte_range()):
Apply 7aef4172c7957d7e65fc172be4c99becaef855d4 before applying
8fde12ca79aff9b5ba951fce1a2641901b8d8e64, as done here:
https://lore.kernel.org/stable/[email protected]/
> Also patch arch-specific variants of gup.c for x86 and s390,
> leaving mips, sh, sparc alone ]
>
> ---
> arch/s390/mm/gup.c | 6 ++++--
> arch/x86/mm/gup.c | 9 ++++++++-
> mm/gup.c | 39 +++++++++++++++++++++++++++++++--------
> mm/huge_memory.c | 2 +-
> mm/hugetlb.c | 18 ++++++++++++++++--
> mm/internal.h | 12 +++++++++---
> 6 files changed, 69 insertions(+), 17 deletions(-)
>
> #ifdef __HAVE_ARCH_PTE_SPECIAL
> static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> int write, struct page **pages, int *nr)
> @@ -1083,6 +1103,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> page = pte_page(pte);
>
> + if (WARN_ON_ONCE(page_ref_count(page) < 0))
> + goto pte_unmap;
> +
> if (!page_cache_get_speculative(page))
> goto pte_unmap;
> diff --git a/mm/internal.h b/mm/internal.h
> index a6639c72780a..b52041969d06 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -93,23 +93,29 @@ static inline void __get_page_tail_foll(struct page *page,
> * follow_page() and it must be called while holding the proper PT
> * lock while the pte (or pmd_trans_huge) is still mapping the page.
> */
> -static inline void get_page_foll(struct page *page)
> +static inline bool get_page_foll(struct page *page, bool check)
> {
> - if (unlikely(PageTail(page)))
> + if (unlikely(PageTail(page))) {
> /*
> * This is safe only because
> * __split_huge_page_refcount() can't run under
> * get_page_foll() because we hold the proper PT lock.
> */
> + if (check && WARN_ON_ONCE(
> + page_ref_count(compound_head(page)) <= 0))
> + return false;
> __get_page_tail_foll(page, true);
> - else {
> + } else {
> /*
> * Getting a normal page or the head of a compound page
> * requires to already have an elevated page->_count.
> */
> VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page);
> + if (check && WARN_ON_ONCE(page_ref_count(page) <= 0))
> + return false;
> atomic_inc(&page->_count);
> }
> + return true;
> }
On 12/3/19 1:25 PM, Ajay Kaher wrote:
>
>
> On 08/11/19, 3:08 PM, "Vlastimil Babka" <[email protected]> wrote:
>
>> From: Linus Torvalds <[email protected]>
>>
>> commit 8fde12ca79aff9b5ba951fce1a2641901b8d8e64 upstream.
>>
>> [ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks
>> in there, enabled by a new parameter, which is false where
>> upstream patch doesn't replace get_page() with try_get_page()
>> (the THP and hugetlb callers).
>
> Could we have try_get_page_foll(), as in:
> https://lore.kernel.org/stable/[email protected]/
>
> + Code will be in sync as we have try_get_page()
> + No need to add extra argument to try_get_page()
> + No need to modify the callers of try_get_page()
>
>> In gup_pte_range(), we don't expect tail pages, so just check
>> page ref count instead of try_get_compound_head()
>
> Technically it's fine. If you want to keep the code of stable versions in sync
> with latest versions then this could be done in following ways (without any
> modification in upstream patch for gup_pte_range()):
>
> Apply 7aef4172c7957d7e65fc172be4c99becaef855d4 before applying
> 8fde12ca79aff9b5ba951fce1a2641901b8d8e64, as done here:
> https://lore.kernel.org/stable/[email protected]/
Yup, I have considered that, and deliberately didn't add that commit
7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup
implementaiton") as it's part of a large THP refcount rework. In 4.4 we
don't expect to GUP tail pages so I wanted to keep it that way -
minimally, the compound_head() operation is a unnecessary added cost,
although it would also work.
On 03/12/19, 6:28 PM, "Vlastimil Babka" <[email protected]> wrote:
>>>
>>> [ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks
>>> in there, enabled by a new parameter, which is false where
>>> upstream patch doesn't replace get_page() with try_get_page()
>>> (the THP and hugetlb callers).
>>
>> Could we have try_get_page_foll(), as in:
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fstable%2F1570581863-12090-3-git-send-email-akaher%40vmware.com%2F&data=02%7C01%7Cakaher%40vmware.com%7Cb6592f0fbec040aa045f08d777f06a9f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637109746821395444&sdata=cYBj3SvEikPbiHsVZj3zCys8t9ISLiHKzAlsSqiZRW8%3D&reserved=0
>>
>> + Code will be in sync as we have try_get_page()
>> + No need to add extra argument to try_get_page()
>> + No need to modify the callers of try_get_page()
Any reason for not using try_get_page_foll().
>>> In gup_pte_range(), we don't expect tail pages, so just check
>>> page ref count instead of try_get_compound_head()
>>
>> Technically it's fine. If you want to keep the code of stable versions in sync
>> with latest versions then this could be done in following ways (without any
>> modification in upstream patch for gup_pte_range()):
>>
>> Apply 7aef4172c7957d7e65fc172be4c99becaef855d4 before applying
>> 8fde12ca79aff9b5ba951fce1a2641901b8d8e64, as done here:
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fstable%2F1570581863-12090-4-git-send-email-akaher%40vmware.com%2F&data=02%7C01%7Cakaher%40vmware.com%7Cb6592f0fbec040aa045f08d777f06a9f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637109746821395444&sdata=gTJMJ3Yx6G0ng46TQsBzCS2DowwP7YtIjluKJuqvN6o%3D&reserved=0
> Yup, I have considered that, and deliberately didn't add that commit
> 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup
> implementaiton") as it's part of a large THP refcount rework. In 4.4 we
> don't expect to GUP tail pages so I wanted to keep it that way -
> minimally, the compound_head() operation is a unnecessary added cost,
> although it would also work.
On 12/6/19 5:15 AM, Ajay Kaher wrote:
>
>
> On 03/12/19, 6:28 PM, "Vlastimil Babka" <[email protected]> wrote:
>>>>
>>>> [ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks
>>>> in there, enabled by a new parameter, which is false where
>>>> upstream patch doesn't replace get_page() with try_get_page()
>>>> (the THP and hugetlb callers).
>>>
>>> Could we have try_get_page_foll(), as in:
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fstable%2F1570581863-12090-3-git-send-email-akaher%40vmware.com%2F&data=02%7C01%7Cakaher%40vmware.com%7Cb6592f0fbec040aa045f08d777f06a9f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637109746821395444&sdata=cYBj3SvEikPbiHsVZj3zCys8t9ISLiHKzAlsSqiZRW8%3D&reserved=0
>>>
>>> + Code will be in sync as we have try_get_page()
>>> + No need to add extra argument to try_get_page()
>>> + No need to modify the callers of try_get_page()
>
> Any reason for not using try_get_page_foll().
Ah, sorry, I missed that previously. It's certainly possible to do it
that way, I just didn't care so strongly to rewrite the existing SLES
patch. It's a stable backport for a rather old LTS, not a codebase for
further development.
>>>> In gup_pte_range(), we don't expect tail pages, so just check
>>>> page ref count instead of try_get_compound_head()
>>>
>>> Technically it's fine. If you want to keep the code of stable versions in sync
>>> with latest versions then this could be done in following ways (without any
>>> modification in upstream patch for gup_pte_range()):
>>>
>>> Apply 7aef4172c7957d7e65fc172be4c99becaef855d4 before applying
>>> 8fde12ca79aff9b5ba951fce1a2641901b8d8e64, as done here:
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fstable%2F1570581863-12090-4-git-send-email-akaher%40vmware.com%2F&data=02%7C01%7Cakaher%40vmware.com%7Cb6592f0fbec040aa045f08d777f06a9f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637109746821395444&sdata=gTJMJ3Yx6G0ng46TQsBzCS2DowwP7YtIjluKJuqvN6o%3D&reserved=0
>
>> Yup, I have considered that, and deliberately didn't add that commit
>> 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup
>> implementaiton") as it's part of a large THP refcount rework. In 4.4 we
>> don't expect to GUP tail pages so I wanted to keep it that way -
>> minimally, the compound_head() operation is a unnecessary added cost,
>> although it would also work.
>
>
On 06/12/19, 8:02 PM, "Vlastimil Babka" <[email protected]> wrote:
> On 12/6/19 5:15 AM, Ajay Kaher wrote:
>>
>>
>> On 03/12/19, 6:28 PM, "Vlastimil Babka" <[email protected]> wrote:
>>>>>
>>>>> [ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks
>>>>> in there, enabled by a new parameter, which is false where
>>>>> upstream patch doesn't replace get_page() with try_get_page()
>>>>> (the THP and hugetlb callers).
>>>>
>>>> Could we have try_get_page_foll(), as in:
>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fstable%2F1570581863-12090-3-git-send-email-akaher%40vmware.com%2F&data=02%7C01%7Cakaher%40vmware.com%7Cb65cf5622ca8401fd2ba08d77a5914e8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637112395344338606&sdata=sLbw%2BQWu0%2BB0y2OpfaQS%2FxXX6Z9jNB3wPeTcPsawNJA%3D&reserved=0
>>>>
>>>> + Code will be in sync as we have try_get_page()
>>>> + No need to add extra argument to try_get_page()
>>>> + No need to modify the callers of try_get_page()
>>
>> Any reason for not using try_get_page_foll().
>
> Ah, sorry, I missed that previously. It's certainly possible to do it
> that way, I just didn't care so strongly to rewrite the existing SLES
> patch. It's a stable backport for a rather old LTS, not a codebase for
> further development.
Thanks for your response.
I would appreciate if you would like to include try_get_page_foll(),
and resend this patch series again.
Greg may require Acked-by from my side also, so if it's fine with you,
you can add or I will add once you will post this patch series again.
Let me know if anything else I can do here.
>>>>> In gup_pte_range(), we don't expect tail pages, so just check
>>>>> page ref count instead of try_get_compound_head()
>>>>
>>>> Technically it's fine. If you want to keep the code of stable versions in sync
>>>> with latest versions then this could be done in following ways (without any
>>>> modification in upstream patch for gup_pte_range()):
>>>>
>>>> Apply 7aef4172c7957d7e65fc172be4c99becaef855d4 before applying
>>>> 8fde12ca79aff9b5ba951fce1a2641901b8d8e64, as done here:
>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fstable%2F1570581863-12090-4-git-send-email-akaher%40vmware.com%2F&data=02%7C01%7Cakaher%40vmware.com%7Cb65cf5622ca8401fd2ba08d77a5914e8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637112395344348599&sdata=MYA%2Fx7oVu8x1c7%2FGkEw%2B69FX7WN1O34Oq8lkMiFs1Wk%3D&reserved=0
>>
>>> Yup, I have considered that, and deliberately didn't add that commit
>>> 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup
>>> implementaiton") as it's part of a large THP refcount rework. In 4.4 we
>>> don't expect to GUP tail pages so I wanted to keep it that way -
>>> minimally, the compound_head() operation is a unnecessary added cost,
>>> although it would also work.
>>
Thanks for above explanation.
On 12/9/19 9:54 AM, Ajay Kaher wrote:
>
>
> On 06/12/19, 8:02 PM, "Vlastimil Babka" <[email protected]> wrote:
>
>> On 12/6/19 5:15 AM, Ajay Kaher wrote:
>>>
>>>
>>> On 03/12/19, 6:28 PM, "Vlastimil Babka" <[email protected]> wrote:
>>>>>>
>>>>>> [ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks
>>>>>> in there, enabled by a new parameter, which is false where
>>>>>> upstream patch doesn't replace get_page() with try_get_page()
>>>>>> (the THP and hugetlb callers).
>>>>>
>>>>> Could we have try_get_page_foll(), as in:
>>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fstable%2F1570581863-12090-3-git-send-email-akaher%40vmware.com%2F&data=02%7C01%7Cakaher%40vmware.com%7Cb65cf5622ca8401fd2ba08d77a5914e8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637112395344338606&sdata=sLbw%2BQWu0%2BB0y2OpfaQS%2FxXX6Z9jNB3wPeTcPsawNJA%3D&reserved=0
>>>>>
>>>>> + Code will be in sync as we have try_get_page()
>>>>> + No need to add extra argument to try_get_page()
>>>>> + No need to modify the callers of try_get_page()
>>>
>>> Any reason for not using try_get_page_foll().
>>
>> Ah, sorry, I missed that previously. It's certainly possible to do it
>> that way, I just didn't care so strongly to rewrite the existing SLES
>> patch. It's a stable backport for a rather old LTS, not a codebase for
>> further development.
>
> Thanks for your response.
>
> I would appreciate if you would like to include try_get_page_foll(),
> and resend this patch series again.
I won't have time for that now, but I don't mind if you do that, or
resend your version with the missing x86 and s390 gup.c parts and
preferably without 7aef4172c795.