2023-01-30 11:15:08

by David Howells

[permalink] [raw]
Subject: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

Hi Jens,

Could you consider pulling this patchset into the block tree? I think that
Al's fears wrt to pinned pages being removed from page tables causing deadlock
have been answered. Granted, there is still the issue of how to handle
vmsplice and a bunch of other places to fix, not least skbuff handling.

I also have patches to fix cifs in a separate branch that I would also like to
push in this merge window - and that requires the first two patches from this
series also, so would it be possible for you to merge at least those two
rather than manually applying them?

Thanks,
David
-~-
Here are patches to provide support for extracting pages from an iov_iter
and to use this in the extraction functions in the block layer bio code.

The patches make the following changes:

(1) Add a function, iov_iter_extract_pages() to replace
iov_iter_get_pages*() that gets refs, pins or just lists the pages as
appropriate to the iterator type.

Add a function, iov_iter_extract_will_pin() that will indicate from
the iterator type how the cleanup is to be performed, returning true
if the pages will need unpinning, false otherwise.

(2) Make the bio struct carry a pair of flags to indicate the cleanup
mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (indicating
FOLL_GET was used) and BIO_PAGE_PINNED (indicating FOLL_PIN was used)
is added.

BIO_PAGE_REFFED will go away, but at the moment fs/direct-io.c sets it
and this series does not fully address that file.

(4) Add a function, bio_release_page(), to release a page appropriately to
the cleanup mode indicated by the BIO_PAGE_* flags.

(5) Make the iter-to-bio code use iov_iter_extract_pages() to retain the
pages appropriately and clean them up later.

(6) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.

Changes:
========
ver #12) (unposted)
- Added the missing __bitwise on the iov_iter_extraction_t typedef.

ver #11)
- Fix iov_iter_extract_kvec_pages() to include the offset into the page in
the returned starting offset.
- Use __bitwise for the extraction flags

ver #10)
- Fix use of i->kvec in iov_iter_extract_bvec_pages() to be i->bvec.
- Drop bio_set_cleanup_mode(), open coding it instead.

ver #9)
- It's now not permitted to use FOLL_PIN outside of mm/, so:
- Change iov_iter_extract_mode() into iov_iter_extract_will_pin() and
return true/false instead of FOLL_PIN/0.
- Drop of folio_put_unpin() and page_put_unpin() and instead call
unpin_user_page() (and put_page()) directly as necessary.
- Make __bio_release_pages() call bio_release_page() instead of
unpin_user_page() as there's no BIO_* -> FOLL_* translation to do.
- Drop the FOLL_* renumbering patch.
- Change extract_flags to extraction_flags.

ver #8)
- Import Christoph Hellwig's changes.
- Split the conversion-to-extraction patch.
- Drop the extract_flags arg from iov_iter_extract_mode().
- Don't default bios to BIO_PAGE_REFFED, but set explicitly.
- Switch FOLL_PIN and FOLL_GET when renumbering so PIN is at bit 0.
- Switch BIO_PAGE_PINNED and BIO_PAGE_REFFED so PINNED is at bit 0.
- We should always be using FOLL_PIN (not FOLL_GET) for DIO, so adjust the
patches for that.

ver #7)
- For now, drop the parts to pass the I/O direction to iov_iter_*pages*()
as it turned out to be a lot more complicated, with places not setting
IOCB_WRITE when they should, for example.
- Drop all the patches that changed things other then the block layer's
bio handling. The netfslib and cifs changes can go into a separate
patchset.
- Add support for extracting pages from KVEC-type iterators.
- When extracting from BVEC/KVEC, skip over empty vecs at the front.

ver #6)
- Fix write() syscall and co. not setting IOCB_WRITE.
- Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE.
- Use op_is_write() in bio_copy_user_iov().
- Drop the iterator direction checks from smbd_recv().
- Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of
gup_flags to iov_iter_get/extract_pages*().
- Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove.
- Add back the function to indicate the cleanup mode.
- Drop the cleanup_mode return arg to iov_iter_extract_pages().
- Provide a helper to clean up a page.
- Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have
the same numerical values, enforced with an assertion.
- Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and
NFS.
- Added in the patches to make CIFS do top-to-bottom iterators and use
various of the added extraction functions.
- Added a pair of work-in-progess patches to make sk_buff fragments store
FOLL_GET and FOLL_PIN.

ver #5)
- Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch.
- Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags.
- Add patch to allow bio_flagged() to be combined by gcc.

ver #4)
- Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
no longer referenced by linux/uio.h.
- Add ITER_SOURCE/DEST cleanup patches.
- Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST.
- Allow additional gup_flags to be passed into iov_iter_extract_pages().
- Add struct bio patch.

ver #3)
- Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
to get/pin_user_pages_fast()[1].

ver #2)
- Rolled the extraction cleanup mode query function into the extraction
function, returning the indication through the argument list.
- Fixed patch 4 (extract to scatterlist) to actually use the new
extraction API.

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/ # v6
Link: https://lore.kernel.org/r/[email protected]/ # v7
Link: https://lore.kernel.org/r/[email protected]/ # v8
Link: https://lore.kernel.org/r/[email protected]/ # v9
Link: https://lore.kernel.org/r/[email protected]/ # v10
Link: https://lore.kernel.org/r/[email protected]/ # v11

---
The following changes since commit 2241ab53cbb5cdb08a6b2d4688feb13971058f65:

Linux 6.2-rc5 (2023-01-21 16:27:01 -0800)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/iov-extract-20230130

for you to fetch changes up to fd20d0c1852ebb3f37ec7101feb0cdd8695f32a5:

block: convert bio_map_user_iov to use iov_iter_extract_pages (2023-01-27 22:13:21 +0000)

----------------------------------------------------------------
Make block-bio use pinning

----------------------------------------------------------------
Christoph Hellwig (1):
block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic

David Howells (7):
iov_iter: Define flags to qualify page extraction.
iov_iter: Add a function to extract a page list from an iterator
iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing
block: Fix bio_flagged() so that gcc can better optimise it
block: Add BIO_PAGE_PINNED and associated infrastructure
block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
block: convert bio_map_user_iov to use iov_iter_extract_pages

block/bio.c | 33 ++---
block/blk-map.c | 26 ++--
block/blk.h | 12 ++
fs/direct-io.c | 2 +
fs/iomap/direct-io.c | 1 -
include/linux/bio.h | 5 +-
include/linux/blk_types.h | 3 +-
include/linux/uio.h | 35 ++++-
lib/iov_iter.c | 335 +++++++++++++++++++++++++++++++++++++++++++++-
9 files changed, 411 insertions(+), 41 deletions(-)



2023-01-30 21:33:39

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 1/30/23 4:14 AM, David Howells wrote:
> Hi Jens,
>
> Could you consider pulling this patchset into the block tree? I think that
> Al's fears wrt to pinned pages being removed from page tables causing deadlock
> have been answered. Granted, there is still the issue of how to handle
> vmsplice and a bunch of other places to fix, not least skbuff handling.
>
> I also have patches to fix cifs in a separate branch that I would also like to
> push in this merge window - and that requires the first two patches from this
> series also, so would it be possible for you to merge at least those two
> rather than manually applying them?

I've pulled this into a separate branch, but based on the block branch,
for-6.3/iov-extract. It's added to for-next as well.

--
Jens Axboe



2023-01-30 21:53:50

by David Howells

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

Jens Axboe <[email protected]> wrote:

> > Hi Jens,
> >
> > Could you consider pulling this patchset into the block tree? I think that
> > Al's fears wrt to pinned pages being removed from page tables causing deadlock
> > have been answered. Granted, there is still the issue of how to handle
> > vmsplice and a bunch of other places to fix, not least skbuff handling.
> >
> > I also have patches to fix cifs in a separate branch that I would also like to
> > push in this merge window - and that requires the first two patches from this
> > series also, so would it be possible for you to merge at least those two
> > rather than manually applying them?
>
> I've pulled this into a separate branch, but based on the block branch,
> for-6.3/iov-extract. It's added to for-next as well.

Many thanks!

David


2023-01-30 21:56:13

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 1/30/23 2:33 PM, Jens Axboe wrote:
> On 1/30/23 4:14 AM, David Howells wrote:
>> Hi Jens,
>>
>> Could you consider pulling this patchset into the block tree? I think that
>> Al's fears wrt to pinned pages being removed from page tables causing deadlock
>> have been answered. Granted, there is still the issue of how to handle
>> vmsplice and a bunch of other places to fix, not least skbuff handling.
>>
>> I also have patches to fix cifs in a separate branch that I would also like to
>> push in this merge window - and that requires the first two patches from this
>> series also, so would it be possible for you to merge at least those two
>> rather than manually applying them?
>
> I've pulled this into a separate branch, but based on the block branch,
> for-6.3/iov-extract. It's added to for-next as well.

This does cause about a 2.7% regression for me, using O_DIRECT on a raw
block device. Looking at a perf diff, here's the top:

+2.71% [kernel.vmlinux] [k] mod_node_page_state
+2.22% [kernel.vmlinux] [k] iov_iter_extract_pages

and these two are gone:

2.14% [kernel.vmlinux] [k] __iov_iter_get_pages_alloc
1.53% [kernel.vmlinux] [k] iov_iter_get_pages

rest is mostly in the noise, but mod_node_page_state() sticks out like
a sore thumb. They seem to be caused by the node stat accounting done
in gup.c for FOLL_PIN.

Hmm?

--
Jens Axboe



2023-01-30 21:57:21

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 1/30/23 2:55 PM, Jens Axboe wrote:
> On 1/30/23 2:33 PM, Jens Axboe wrote:
>> On 1/30/23 4:14 AM, David Howells wrote:
>>> Hi Jens,
>>>
>>> Could you consider pulling this patchset into the block tree? I think that
>>> Al's fears wrt to pinned pages being removed from page tables causing deadlock
>>> have been answered. Granted, there is still the issue of how to handle
>>> vmsplice and a bunch of other places to fix, not least skbuff handling.
>>>
>>> I also have patches to fix cifs in a separate branch that I would also like to
>>> push in this merge window - and that requires the first two patches from this
>>> series also, so would it be possible for you to merge at least those two
>>> rather than manually applying them?
>>
>> I've pulled this into a separate branch, but based on the block branch,
>> for-6.3/iov-extract. It's added to for-next as well.
>
> This does cause about a 2.7% regression for me, using O_DIRECT on a raw
> block device. Looking at a perf diff, here's the top:
>
> +2.71% [kernel.vmlinux] [k] mod_node_page_state
> +2.22% [kernel.vmlinux] [k] iov_iter_extract_pages
>
> and these two are gone:
>
> 2.14% [kernel.vmlinux] [k] __iov_iter_get_pages_alloc
> 1.53% [kernel.vmlinux] [k] iov_iter_get_pages
>
> rest is mostly in the noise, but mod_node_page_state() sticks out like
> a sore thumb. They seem to be caused by the node stat accounting done
> in gup.c for FOLL_PIN.

Confirmed just disabling the node_stat bits in mm/gup.c and now the
performance is back to the same levels as before.

An almost 3% regression is a bit hard to swallow...

--
Jens Axboe



2023-01-30 22:03:12

by John Hubbard

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 1/30/23 13:57, Jens Axboe wrote:
>> This does cause about a 2.7% regression for me, using O_DIRECT on a raw
>> block device. Looking at a perf diff, here's the top:
>>
>> +2.71% [kernel.vmlinux] [k] mod_node_page_state
>> +2.22% [kernel.vmlinux] [k] iov_iter_extract_pages
>>
>> and these two are gone:
>>
>> 2.14% [kernel.vmlinux] [k] __iov_iter_get_pages_alloc
>> 1.53% [kernel.vmlinux] [k] iov_iter_get_pages
>>
>> rest is mostly in the noise, but mod_node_page_state() sticks out like
>> a sore thumb. They seem to be caused by the node stat accounting done
>> in gup.c for FOLL_PIN.
>
> Confirmed just disabling the node_stat bits in mm/gup.c and now the
> performance is back to the same levels as before.
>
> An almost 3% regression is a bit hard to swallow...

This is something that we say when adding pin_user_pages_fast(),
yes. I doubt that I can quickly find the email thread, but we
measured it and weren't immediately able to come up with a way
to make it faster.

At this point, it's a good time to consider if there is any
way to speed it up. But I wanted to confirm that you're absolutely
right: the measurement sounds about right, and that's also the
hotspot that we say, too.


thanks,
--
John Hubbard
NVIDIA

2023-01-30 22:11:51

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 1/30/23 3:02?PM, John Hubbard wrote:
> On 1/30/23 13:57, Jens Axboe wrote:
>>> This does cause about a 2.7% regression for me, using O_DIRECT on a raw
>>> block device. Looking at a perf diff, here's the top:
>>>
>>> +2.71% [kernel.vmlinux] [k] mod_node_page_state
>>> +2.22% [kernel.vmlinux] [k] iov_iter_extract_pages
>>>
>>> and these two are gone:
>>>
>>> 2.14% [kernel.vmlinux] [k] __iov_iter_get_pages_alloc
>>> 1.53% [kernel.vmlinux] [k] iov_iter_get_pages
>>>
>>> rest is mostly in the noise, but mod_node_page_state() sticks out like
>>> a sore thumb. They seem to be caused by the node stat accounting done
>>> in gup.c for FOLL_PIN.
>>
>> Confirmed just disabling the node_stat bits in mm/gup.c and now the
>> performance is back to the same levels as before.
>>
>> An almost 3% regression is a bit hard to swallow...
>
> This is something that we say when adding pin_user_pages_fast(),
> yes. I doubt that I can quickly find the email thread, but we
> measured it and weren't immediately able to come up with a way
> to make it faster.
>
> At this point, it's a good time to consider if there is any
> way to speed it up. But I wanted to confirm that you're absolutely
> right: the measurement sounds about right, and that's also the
> hotspot that we say, too.

From spending all of 5 minutes on this, it must be due to exceeding the
pcp stat_threashold, as we then end up doing two atomic_long_adds().
Looking at proc, looks like it's 108. And with this test, then we're
hitting that slow path ~80k/second. Uhm...

--
Jens Axboe


2023-01-30 22:13:36

by David Howells

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

John Hubbard <[email protected]> wrote:

> This is something that we say when adding pin_user_pages_fast(),
> yes. I doubt that I can quickly find the email thread, but we
> measured it and weren't immediately able to come up with a way
> to make it faster.

percpu counters maybe - add them up at the point of viewing?

David


2023-01-30 22:15:48

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 1/30/23 3:12 PM, David Howells wrote:
> John Hubbard <[email protected]> wrote:
>
>> This is something that we say when adding pin_user_pages_fast(),
>> yes. I doubt that I can quickly find the email thread, but we
>> measured it and weren't immediately able to come up with a way
>> to make it faster.
>
> percpu counters maybe - add them up at the point of viewing?

They are percpu, see my last email. But for every 108 changes (on
my system), they will do two atomic_long_adds(). So not very
useful for anything but low frequency modifications.

--
Jens Axboe



2023-01-31 08:33:32

by David Hildenbrand

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 30.01.23 23:15, Jens Axboe wrote:
> On 1/30/23 3:12 PM, David Howells wrote:
>> John Hubbard <[email protected]> wrote:
>>
>>> This is something that we say when adding pin_user_pages_fast(),
>>> yes. I doubt that I can quickly find the email thread, but we
>>> measured it and weren't immediately able to come up with a way
>>> to make it faster.
>>
>> percpu counters maybe - add them up at the point of viewing?
>
> They are percpu, see my last email. But for every 108 changes (on
> my system), they will do two atomic_long_adds(). So not very
> useful for anything but low frequency modifications.
>

Can we just treat the whole acquired/released accounting as a debug
mechanism to detect missing releases and do it only for debug kernels?


The pcpu counter is an s8, so we have to flush on a regular basis and
cannot really defer it any longer ... but I'm curious if it would be of
any help to only have a single PINNED counter that goes into both
directions (inc/dec on pin/release), to reduce the flushing.

Of course, once we pin/release more than ~108 pages in one go or we
switch CPUs frequently it won't be that much of a help ...

--
Thanks,

David / dhildenb


2023-01-31 12:29:03

by Jan Kara

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On Tue 31-01-23 09:32:27, David Hildenbrand wrote:
> On 30.01.23 23:15, Jens Axboe wrote:
> > On 1/30/23 3:12 PM, David Howells wrote:
> > > John Hubbard <[email protected]> wrote:
> > >
> > > > This is something that we say when adding pin_user_pages_fast(),
> > > > yes. I doubt that I can quickly find the email thread, but we
> > > > measured it and weren't immediately able to come up with a way
> > > > to make it faster.
> > >
> > > percpu counters maybe - add them up at the point of viewing?
> >
> > They are percpu, see my last email. But for every 108 changes (on
> > my system), they will do two atomic_long_adds(). So not very
> > useful for anything but low frequency modifications.
> >
>
> Can we just treat the whole acquired/released accounting as a debug
> mechanism to detect missing releases and do it only for debug kernels?

Yes, AFAIK it is just a debug mechanism for helping to find out issues with
page pinning conversions. So I think we can put this behind some debugging
ifdef. John?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-01-31 13:42:40

by David Howells

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

David Hildenbrand <[email protected]> wrote:

> >> percpu counters maybe - add them up at the point of viewing?
> > They are percpu, see my last email. But for every 108 changes (on
> > my system), they will do two atomic_long_adds(). So not very
> > useful for anything but low frequency modifications.
> >
>
> Can we just treat the whole acquired/released accounting as a debug mechanism
> to detect missing releases and do it only for debug kernels?
>
>
> The pcpu counter is an s8, so we have to flush on a regular basis and cannot
> really defer it any longer ... but I'm curious if it would be of any help to
> only have a single PINNED counter that goes into both directions (inc/dec on
> pin/release), to reduce the flushing.
>
> Of course, once we pin/release more than ~108 pages in one go or we switch
> CPUs frequently it won't be that much of a help ...

What are the stats actually used for? Is it just debugging, or do we actually
have users for them (control groups spring to mind)?

David


2023-01-31 13:49:38

by David Hildenbrand

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 31.01.23 14:41, David Howells wrote:
> David Hildenbrand <[email protected]> wrote:
>
>>>> percpu counters maybe - add them up at the point of viewing?
>>> They are percpu, see my last email. But for every 108 changes (on
>>> my system), they will do two atomic_long_adds(). So not very
>>> useful for anything but low frequency modifications.
>>>
>>
>> Can we just treat the whole acquired/released accounting as a debug mechanism
>> to detect missing releases and do it only for debug kernels?
>>
>>
>> The pcpu counter is an s8, so we have to flush on a regular basis and cannot
>> really defer it any longer ... but I'm curious if it would be of any help to
>> only have a single PINNED counter that goes into both directions (inc/dec on
>> pin/release), to reduce the flushing.
>>
>> Of course, once we pin/release more than ~108 pages in one go or we switch
>> CPUs frequently it won't be that much of a help ...
>
> What are the stats actually used for? Is it just debugging, or do we actually
> have users for them (control groups spring to mind)?

As it's really just "how many pinning events" vs. "how many unpinning
events", I assume it's only for debugging.

For example, if you pin the same page twice it would not get accounted
as "a single page is pinned".

--
Thanks,

David / dhildenb


2023-01-31 14:50:54

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 1/31/23 6:48?AM, David Hildenbrand wrote:
> On 31.01.23 14:41, David Howells wrote:
>> David Hildenbrand <[email protected]> wrote:
>>
>>>>> percpu counters maybe - add them up at the point of viewing?
>>>> They are percpu, see my last email. But for every 108 changes (on
>>>> my system), they will do two atomic_long_adds(). So not very
>>>> useful for anything but low frequency modifications.
>>>>
>>>
>>> Can we just treat the whole acquired/released accounting as a debug mechanism
>>> to detect missing releases and do it only for debug kernels?
>>>
>>>
>>> The pcpu counter is an s8, so we have to flush on a regular basis and cannot
>>> really defer it any longer ... but I'm curious if it would be of any help to
>>> only have a single PINNED counter that goes into both directions (inc/dec on
>>> pin/release), to reduce the flushing.
>>>
>>> Of course, once we pin/release more than ~108 pages in one go or we switch
>>> CPUs frequently it won't be that much of a help ...
>>
>> What are the stats actually used for? Is it just debugging, or do we actually
>> have users for them (control groups spring to mind)?
>
> As it's really just "how many pinning events" vs. "how many unpinning
> events", I assume it's only for debugging.
>
> For example, if you pin the same page twice it would not get accounted
> as "a single page is pinned".

How about something like the below then? I can send it out as a real
patch, will run a sanity check on it first but would be surprised if
this doesn't fix it.


diff --git a/mm/gup.c b/mm/gup.c
index f45a3a5be53a..41abb16286ec 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -168,7 +168,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
*/
smp_mb__after_atomic();

+#ifdef CONFIG_DEBUG_VM
node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
+#endif

return folio;
}
@@ -180,7 +182,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
{
if (flags & FOLL_PIN) {
+#ifdef CONFIG_DEBUG_VM
node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
+#endif
if (folio_test_large(folio))
atomic_sub(refs, folio_pincount_ptr(folio));
else
@@ -236,8 +240,9 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
} else {
folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
}
-
+#ifdef CONFIG_DEBUG_VM
node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
+#endif
}

return 0;

--
Jens Axboe


2023-01-31 15:14:10

by David Hildenbrand

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 31.01.23 16:04, Jens Axboe wrote:
> On 1/31/23 8:02?AM, David Hildenbrand wrote:
>> On 31.01.23 15:50, Jens Axboe wrote:
>>> On 1/31/23 6:48?AM, David Hildenbrand wrote:
>>>> On 31.01.23 14:41, David Howells wrote:
>>>>> David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>>>>> percpu counters maybe - add them up at the point of viewing?
>>>>>>> They are percpu, see my last email. But for every 108 changes (on
>>>>>>> my system), they will do two atomic_long_adds(). So not very
>>>>>>> useful for anything but low frequency modifications.
>>>>>>>
>>>>>>
>>>>>> Can we just treat the whole acquired/released accounting as a debug mechanism
>>>>>> to detect missing releases and do it only for debug kernels?
>>>>>>
>>>>>>
>>>>>> The pcpu counter is an s8, so we have to flush on a regular basis and cannot
>>>>>> really defer it any longer ... but I'm curious if it would be of any help to
>>>>>> only have a single PINNED counter that goes into both directions (inc/dec on
>>>>>> pin/release), to reduce the flushing.
>>>>>>
>>>>>> Of course, once we pin/release more than ~108 pages in one go or we switch
>>>>>> CPUs frequently it won't be that much of a help ...
>>>>>
>>>>> What are the stats actually used for? Is it just debugging, or do we actually
>>>>> have users for them (control groups spring to mind)?
>>>>
>>>> As it's really just "how many pinning events" vs. "how many unpinning
>>>> events", I assume it's only for debugging.
>>>>
>>>> For example, if you pin the same page twice it would not get accounted
>>>> as "a single page is pinned".
>>>
>>> How about something like the below then? I can send it out as a real
>>> patch, will run a sanity check on it first but would be surprised if
>>> this doesn't fix it.
>>>
>>>
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index f45a3a5be53a..41abb16286ec 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -168,7 +168,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>> */
>>> smp_mb__after_atomic();
>>> +#ifdef CONFIG_DEBUG_VM
>>> node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
>>> +#endif
>>> return folio;
>>> }
>>> @@ -180,7 +182,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>> static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
>>> {
>>> if (flags & FOLL_PIN) {
>>> +#ifdef CONFIG_DEBUG_VM
>>> node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
>>> +#endif
>>> if (folio_test_large(folio))
>>> atomic_sub(refs, folio_pincount_ptr(folio));
>>> else
>>> @@ -236,8 +240,9 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
>>> } else {
>>> folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
>>> }
>>> -
>>> +#ifdef CONFIG_DEBUG_VM
>>> node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
>>> +#endif
>>> }
>>> return 0;
>>>
>>
>> We might want to hide the counters completely by defining them only
>> with CONFIG_DEBUG_VM.
>
> Are all of them debug aids only? If so, yes we should just have
> node_stat_* under CONFIG_DEBUG_VM.
>

Rather only these 2. Smth like:

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 815c7c2edf45..a526964b65ce 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -196,8 +196,10 @@ enum node_stat_item {
NR_WRITTEN, /* page writings since bootup */
NR_THROTTLED_WRITTEN, /* NR_WRITTEN while reclaim throttled */
NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */
+#ifdef CONFIG_DEBUG_VM
NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */
NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */
+#endif
NR_KERNEL_STACK_KB, /* measured in KiB */
#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
NR_KERNEL_SCS_KB, /* measured in KiB */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1ea6a5ce1c41..5cbd9a1924bf 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1227,8 +1227,10 @@ const char * const vmstat_text[] = {
"nr_written",
"nr_throttled_written",
"nr_kernel_misc_reclaimable",
+#ifdef CONFIG_DEBUG_VM
"nr_foll_pin_acquired",
"nr_foll_pin_released",
+#endif
"nr_kernel_stack",
#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
"nr_shadow_call_stack",

--
Thanks,

David / dhildenb


2023-01-31 15:17:37

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 1/31/23 8:10 AM, David Hildenbrand wrote:
> On 31.01.23 16:04, Jens Axboe wrote:
>> On 1/31/23 8:02?AM, David Hildenbrand wrote:
>>> On 31.01.23 15:50, Jens Axboe wrote:
>>>> On 1/31/23 6:48?AM, David Hildenbrand wrote:
>>>>> On 31.01.23 14:41, David Howells wrote:
>>>>>> David Hildenbrand <[email protected]> wrote:
>>>>>>
>>>>>>>>> percpu counters maybe - add them up at the point of viewing?
>>>>>>>> They are percpu, see my last email. But for every 108 changes (on
>>>>>>>> my system), they will do two atomic_long_adds(). So not very
>>>>>>>> useful for anything but low frequency modifications.
>>>>>>>>
>>>>>>>
>>>>>>> Can we just treat the whole acquired/released accounting as a debug mechanism
>>>>>>> to detect missing releases and do it only for debug kernels?
>>>>>>>
>>>>>>>
>>>>>>> The pcpu counter is an s8, so we have to flush on a regular basis and cannot
>>>>>>> really defer it any longer ... but I'm curious if it would be of any help to
>>>>>>> only have a single PINNED counter that goes into both directions (inc/dec on
>>>>>>> pin/release), to reduce the flushing.
>>>>>>>
>>>>>>> Of course, once we pin/release more than ~108 pages in one go or we switch
>>>>>>> CPUs frequently it won't be that much of a help ...
>>>>>>
>>>>>> What are the stats actually used for?  Is it just debugging, or do we actually
>>>>>> have users for them (control groups spring to mind)?
>>>>>
>>>>> As it's really just "how many pinning events" vs. "how many unpinning
>>>>> events", I assume it's only for debugging.
>>>>>
>>>>> For example, if you pin the same page twice it would not get accounted
>>>>> as "a single page is pinned".
>>>>
>>>> How about something like the below then? I can send it out as a real
>>>> patch, will run a sanity check on it first but would be surprised if
>>>> this doesn't fix it.
>>>>
>>>>
>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>> index f45a3a5be53a..41abb16286ec 100644
>>>> --- a/mm/gup.c
>>>> +++ b/mm/gup.c
>>>> @@ -168,7 +168,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>             */
>>>>            smp_mb__after_atomic();
>>>>    +#ifdef CONFIG_DEBUG_VM
>>>>            node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
>>>> +#endif
>>>>              return folio;
>>>>        }
>>>> @@ -180,7 +182,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>    static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
>>>>    {
>>>>        if (flags & FOLL_PIN) {
>>>> +#ifdef CONFIG_DEBUG_VM
>>>>            node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
>>>> +#endif
>>>>            if (folio_test_large(folio))
>>>>                atomic_sub(refs, folio_pincount_ptr(folio));
>>>>            else
>>>> @@ -236,8 +240,9 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
>>>>            } else {
>>>>                folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
>>>>            }
>>>> -
>>>> +#ifdef CONFIG_DEBUG_VM
>>>>            node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
>>>> +#endif
>>>>        }
>>>>          return 0;
>>>>
>>>
>>> We might want to hide the counters completely by defining them only
>>> with CONFIG_DEBUG_VM.
>>
>> Are all of them debug aids only? If so, yes we should just have
>> node_stat_* under CONFIG_DEBUG_VM.
>>
>
> Rather only these 2. Smth like:

Ah gotcha, that makes more sense to me. Will update the patch and
send it out.

--
Jens Axboe



2023-01-31 15:20:31

by David Hildenbrand

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 31.01.23 15:50, Jens Axboe wrote:
> On 1/31/23 6:48?AM, David Hildenbrand wrote:
>> On 31.01.23 14:41, David Howells wrote:
>>> David Hildenbrand <[email protected]> wrote:
>>>
>>>>>> percpu counters maybe - add them up at the point of viewing?
>>>>> They are percpu, see my last email. But for every 108 changes (on
>>>>> my system), they will do two atomic_long_adds(). So not very
>>>>> useful for anything but low frequency modifications.
>>>>>
>>>>
>>>> Can we just treat the whole acquired/released accounting as a debug mechanism
>>>> to detect missing releases and do it only for debug kernels?
>>>>
>>>>
>>>> The pcpu counter is an s8, so we have to flush on a regular basis and cannot
>>>> really defer it any longer ... but I'm curious if it would be of any help to
>>>> only have a single PINNED counter that goes into both directions (inc/dec on
>>>> pin/release), to reduce the flushing.
>>>>
>>>> Of course, once we pin/release more than ~108 pages in one go or we switch
>>>> CPUs frequently it won't be that much of a help ...
>>>
>>> What are the stats actually used for? Is it just debugging, or do we actually
>>> have users for them (control groups spring to mind)?
>>
>> As it's really just "how many pinning events" vs. "how many unpinning
>> events", I assume it's only for debugging.
>>
>> For example, if you pin the same page twice it would not get accounted
>> as "a single page is pinned".
>
> How about something like the below then? I can send it out as a real
> patch, will run a sanity check on it first but would be surprised if
> this doesn't fix it.
>
>
> diff --git a/mm/gup.c b/mm/gup.c
> index f45a3a5be53a..41abb16286ec 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -168,7 +168,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> */
> smp_mb__after_atomic();
>
> +#ifdef CONFIG_DEBUG_VM
> node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
> +#endif
>
> return folio;
> }
> @@ -180,7 +182,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> {
> if (flags & FOLL_PIN) {
> +#ifdef CONFIG_DEBUG_VM
> node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
> +#endif
> if (folio_test_large(folio))
> atomic_sub(refs, folio_pincount_ptr(folio));
> else
> @@ -236,8 +240,9 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
> } else {
> folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
> }
> -
> +#ifdef CONFIG_DEBUG_VM
> node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
> +#endif
> }
>
> return 0;
>

We might want to hide the counters completely by defining them only with
CONFIG_DEBUG_VM.

--
Thanks,

David / dhildenb


2023-01-31 15:23:12

by Jens Axboe

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 1/31/23 8:02?AM, David Hildenbrand wrote:
> On 31.01.23 15:50, Jens Axboe wrote:
>> On 1/31/23 6:48?AM, David Hildenbrand wrote:
>>> On 31.01.23 14:41, David Howells wrote:
>>>> David Hildenbrand <[email protected]> wrote:
>>>>
>>>>>>> percpu counters maybe - add them up at the point of viewing?
>>>>>> They are percpu, see my last email. But for every 108 changes (on
>>>>>> my system), they will do two atomic_long_adds(). So not very
>>>>>> useful for anything but low frequency modifications.
>>>>>>
>>>>>
>>>>> Can we just treat the whole acquired/released accounting as a debug mechanism
>>>>> to detect missing releases and do it only for debug kernels?
>>>>>
>>>>>
>>>>> The pcpu counter is an s8, so we have to flush on a regular basis and cannot
>>>>> really defer it any longer ... but I'm curious if it would be of any help to
>>>>> only have a single PINNED counter that goes into both directions (inc/dec on
>>>>> pin/release), to reduce the flushing.
>>>>>
>>>>> Of course, once we pin/release more than ~108 pages in one go or we switch
>>>>> CPUs frequently it won't be that much of a help ...
>>>>
>>>> What are the stats actually used for? Is it just debugging, or do we actually
>>>> have users for them (control groups spring to mind)?
>>>
>>> As it's really just "how many pinning events" vs. "how many unpinning
>>> events", I assume it's only for debugging.
>>>
>>> For example, if you pin the same page twice it would not get accounted
>>> as "a single page is pinned".
>>
>> How about something like the below then? I can send it out as a real
>> patch, will run a sanity check on it first but would be surprised if
>> this doesn't fix it.
>>
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index f45a3a5be53a..41abb16286ec 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -168,7 +168,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>> */
>> smp_mb__after_atomic();
>> +#ifdef CONFIG_DEBUG_VM
>> node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
>> +#endif
>> return folio;
>> }
>> @@ -180,7 +182,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>> static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
>> {
>> if (flags & FOLL_PIN) {
>> +#ifdef CONFIG_DEBUG_VM
>> node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
>> +#endif
>> if (folio_test_large(folio))
>> atomic_sub(refs, folio_pincount_ptr(folio));
>> else
>> @@ -236,8 +240,9 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
>> } else {
>> folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
>> }
>> -
>> +#ifdef CONFIG_DEBUG_VM
>> node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
>> +#endif
>> }
>> return 0;
>>
>
> We might want to hide the counters completely by defining them only
> with CONFIG_DEBUG_VM.

Are all of them debug aids only? If so, yes we should just have
node_stat_* under CONFIG_DEBUG_VM.

--
Jens Axboe


2023-01-31 17:55:00

by John Hubbard

[permalink] [raw]
Subject: Re: [GIT PULL] iov_iter: Improve page extraction (pin or just list)

On 1/31/23 04:28, Jan Kara wrote:
> On Tue 31-01-23 09:32:27, David Hildenbrand wrote:
>> On 30.01.23 23:15, Jens Axboe wrote:
>>> On 1/30/23 3:12 PM, David Howells wrote:
>>>> John Hubbard <[email protected]> wrote:
>>>>
>>>>> This is something that we say when adding pin_user_pages_fast(),
>>>>> yes. I doubt that I can quickly find the email thread, but we
>>>>> measured it and weren't immediately able to come up with a way
>>>>> to make it faster.
>>>>
>>>> percpu counters maybe - add them up at the point of viewing?
>>>
>>> They are percpu, see my last email. But for every 108 changes (on
>>> my system), they will do two atomic_long_adds(). So not very
>>> useful for anything but low frequency modifications.
>>>
>>
>> Can we just treat the whole acquired/released accounting as a debug
>> mechanism to detect missing releases and do it only for debug kernels?
>
> Yes, AFAIK it is just a debug mechanism for helping to find out issues with
> page pinning conversions. So I think we can put this behind some debugging
> ifdef. John?
>

Yes, just for debugging. I wrote a little note just now in response to
the patch about how we ended up here: "yes, it's time to hide these
behind an ifdef".

thanks,
--
John Hubbard
NVIDIA