Hi Christoph, David,
Since Christoph asked nicely[1] ;-), here are three patches that go on top
of the similar patches for bio structs now in the block tree that make the
old block direct-IO code use iov_iter_extract_pages() and page pinning.
There are three patches:
(1) Make page pinning not add or remove a pin to/from the ZERO_PAGE,
thereby allowing the dio code to insert zero pages in the middle of
dealing with pinned pages.
(2) Provide a function to allow an additional pin to be taken on a page we
already have pinned (and do nothing for the zero page).
(3) Switch direct-io.c over to using page pinning and to use
iov_iter_extract_pages() so that pages from non-user-backed iterators
aren't pinned.
Note that I haven't managed to test this yet as SELinux is refusing to let
me mount things like ext2 filesystems on account of it not having xattrs:-/
I've pushed the patches here also:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-old-dio
David
Link: https://lore.kernel.org/r/ZGxfrOLZ4aN9/[email protected]/ [1]
David Howells (3):
mm: Don't pin ZERO_PAGE in pin_user_pages()
mm: Provide a function to get an additional pin on a page
block: Use iov_iter_extract_pages() and page pinning in direct-io.c
fs/direct-io.c | 68 ++++++++++++++++++++++++++++------------------
include/linux/mm.h | 1 +
mm/gup.c | 54 +++++++++++++++++++++++++++++++++++-
3 files changed, 95 insertions(+), 28 deletions(-)
Provide a function to get an additional pin on a page that we already have
a pin on. This will be used in fs/direct-io.c when dispatching multiple
bios to a page we've extracted from a user-backed iter rather than redoing
the extraction.
Signed-off-by: David Howells <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: David Hildenbrand <[email protected]>
cc: Andrew Morton <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Al Viro <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Jan Kara <[email protected]>
cc: Jeff Layton <[email protected]>
cc: Jason Gunthorpe <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: Hillf Danton <[email protected]>
cc: Christian Brauner <[email protected]>
cc: Linus Torvalds <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
include/linux/mm.h | 1 +
mm/gup.c | 29 +++++++++++++++++++++++++++++
2 files changed, 30 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..931b75dae7ff 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2383,6 +2383,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
int pin_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
+void page_get_additional_pin(struct page *page);
int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
diff --git a/mm/gup.c b/mm/gup.c
index d2662aa8cf01..b1e55847ca13 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -275,6 +275,35 @@ void unpin_user_page(struct page *page)
}
EXPORT_SYMBOL(unpin_user_page);
+/**
+ * page_get_additional_pin - Try to get an additional pin on a pinned page
+ * @page: The page to be pinned
+ *
+ * Get an additional pin on a page we already have a pin on. Makes no change
+ * if the page is the zero_page.
+ */
+void page_get_additional_pin(struct page *page)
+{
+ struct folio *folio = page_folio(page);
+
+ if (page == ZERO_PAGE(0))
+ return;
+
+ /*
+ * Similar to try_grab_folio(): be sure to *also* increment the normal
+ * page refcount field at least once, so that the page really is
+ * pinned.
+ */
+ if (folio_test_large(folio)) {
+ WARN_ON_ONCE(atomic_read(&folio->_pincount) < 1);
+ folio_ref_add(folio, 1);
+ atomic_add(1, &folio->_pincount);
+ } else {
+ WARN_ON_ONCE(folio_ref_count(folio) < GUP_PIN_COUNTING_BIAS);
+ folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+ }
+}
+
static inline struct folio *gup_folio_range_next(struct page *start,
unsigned long npages, unsigned long i, unsigned int *ntails)
{
Make pin_user_pages*() leave the ZERO_PAGE unpinned if it extracts a
pointer to it from the page tables and make unpin_user_page*()
correspondingly ignore the ZERO_PAGE when unpinning. We don't want to risk
overrunning the zero page's refcount as we're only allowed ~2 million pins
on it - something that userspace can conceivably trigger.
Signed-off-by: David Howells <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: David Hildenbrand <[email protected]>
cc: Andrew Morton <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Al Viro <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Jan Kara <[email protected]>
cc: Jeff Layton <[email protected]>
cc: Jason Gunthorpe <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: Hillf Danton <[email protected]>
cc: Christian Brauner <[email protected]>
cc: Linus Torvalds <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
mm/gup.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/mm/gup.c b/mm/gup.c
index bbe416236593..d2662aa8cf01 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages,
struct page *page = *pages;
struct folio *folio = page_folio(page);
- if (!folio_test_anon(folio))
+ if (page == ZERO_PAGE(0) ||
+ !folio_test_anon(folio))
continue;
if (!folio_test_large(folio) || folio_test_hugetlb(folio))
VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
@@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
else if (flags & FOLL_PIN) {
struct folio *folio;
+ /*
+ * Don't take a pin on the zero page - it's not going anywhere
+ * and it is used in a *lot* of places.
+ */
+ if (page == ZERO_PAGE(0))
+ return page_folio(ZERO_PAGE(0));
+
/*
* Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
* right zone, so fail and let the caller fall back to the slow
@@ -180,6 +188,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
{
if (flags & FOLL_PIN) {
+ if (folio == page_folio(ZERO_PAGE(0)))
+ return;
node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
if (folio_test_large(folio))
atomic_sub(refs, &folio->_pincount);
@@ -224,6 +234,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
if (flags & FOLL_GET)
folio_ref_inc(folio);
else if (flags & FOLL_PIN) {
+ /*
+ * Don't take a pin on the zero page - it's not going anywhere
+ * and it is used in a *lot* of places.
+ */
+ if (page == ZERO_PAGE(0))
+ return 0;
+
/*
* Similar to try_grab_folio(): be sure to *also*
* increment the normal page refcount field at least once,
@@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
*
* FOLL_PIN means that the pages must be released via unpin_user_page(). Please
* see Documentation/core-api/pin_user_pages.rst for further details.
+ *
+ * Note that if the zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page() will not remove pins from it.
*/
int pin_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages)
@@ -3161,6 +3181,9 @@ EXPORT_SYMBOL(pin_user_pages);
* pin_user_pages_unlocked() is the FOLL_PIN variant of
* get_user_pages_unlocked(). Behavior is the same, except that this one sets
* FOLL_PIN and rejects FOLL_GET.
+ *
+ * Note that if the zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page() will not remove pins from it.
*/
long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
struct page **pages, unsigned int gup_flags)
On 25.05.23 17:51, David Howells wrote:
> Make pin_user_pages*() leave the ZERO_PAGE unpinned if it extracts a
> pointer to it from the page tables and make unpin_user_page*()
> correspondingly ignore the ZERO_PAGE when unpinning. We don't want to risk
> overrunning the zero page's refcount as we're only allowed ~2 million pins
> on it - something that userspace can conceivably trigger.
>
As Linus raised, the ZERO_PAGE(0) checks should probably be
is_zero_pfn(page_to_pfn(page)).
> Signed-off-by: David Howells <[email protected]>
> cc: Christoph Hellwig <[email protected]>
> cc: David Hildenbrand <[email protected]>
> cc: Andrew Morton <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Al Viro <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: Jan Kara <[email protected]>
> cc: Jeff Layton <[email protected]>
> cc: Jason Gunthorpe <[email protected]>
> cc: Logan Gunthorpe <[email protected]>
> cc: Hillf Danton <[email protected]>
> cc: Christian Brauner <[email protected]>
> cc: Linus Torvalds <[email protected]>
> cc: [email protected]
> cc: [email protected]
> cc: [email protected]
> cc: [email protected]
> ---
> mm/gup.c | 25 ++++++++++++++++++++++++-
> 1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index bbe416236593..d2662aa8cf01 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages,
> struct page *page = *pages;
> struct folio *folio = page_folio(page);
>
> - if (!folio_test_anon(folio))
> + if (page == ZERO_PAGE(0) ||
> + !folio_test_anon(folio))
> continue;
> if (!folio_test_large(folio) || folio_test_hugetlb(folio))
> VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
> @@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> else if (flags & FOLL_PIN) {
> struct folio *folio;
>
> + /*
> + * Don't take a pin on the zero page - it's not going anywhere
> + * and it is used in a *lot* of places.
> + */
> + if (page == ZERO_PAGE(0))
> + return page_folio(ZERO_PAGE(0));
With the fixed check, this should be
return page_folio(page);
I guess.
--
Thanks,
David / dhildenb