This limit has not been updated since 2008, when it was increased to 64
KiB at the request of GnuPG. Until recently, the main use-cases for this
feature were (1) preventing sensitive memory from being swapped, as in
GnuPG's use-case; and (2) real-time use-cases. In the first case, little
memory is called for, and in the second case, the user is generally in a
position to increase it if they need more.
The introduction of IOURING_REGISTER_BUFFERS adds a third use-case:
preparing fixed buffers for high-performance I/O. This use-case will
take as much of this memory as it can get, but is still limited to 64
KiB by default, which is very little. This increases the limit to 8 MB,
which was chosen fairly arbitrarily as a more generous, but still
conservative, default value.
---
It is also possible to raise this limit in userspace. This is easily
done, for example, in the use-case of a network daemon: systemd, for
instance, provides for this via LimitMEMLOCK in the service file; OpenRC
via the rc_ulimit variables. However, there is no established userspace
facility for configuring this outside of daemons: end-user applications
do not presently have access to a convenient means of raising their
limits.
The buck, as it were, stops with the kernel. It's much easier to address
it here than it is to bring it to hundreds of distributions, and it can
only realistically be relied upon to be high-enough by end-user software
if it is more-or-less ubiquitous. Most distros don't change this
particular rlimit from the kernel-supplied default value, so a change
here will easily provide that ubiquity.
include/uapi/linux/resource.h | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/resource.h b/include/uapi/linux/resource.h
index 74ef57b38f9f..c858c3c85fae 100644
--- a/include/uapi/linux/resource.h
+++ b/include/uapi/linux/resource.h
@@ -66,10 +66,17 @@ struct rlimit64 {
#define _STK_LIM (8*1024*1024)
/*
- * GPG2 wants 64kB of mlocked memory, to make sure pass phrases
- * and other sensitive information are never written to disk.
+ * Limit the amount of locked memory by some sane default:
+ * root can always increase this limit if needed.
+ *
+ * The main use-cases are (1) preventing sensitive memory
+ * from being swapped; (2) real-time operations; (3) via
+ * IOURING_REGISTER_BUFFERS.
+ *
+ * The first two don't need much. The latter will take as
+ * much as it can get. 8MB is a reasonably sane default.
*/
-#define MLOCK_LIMIT ((PAGE_SIZE > 64*1024) ? PAGE_SIZE : 64*1024)
+#define MLOCK_LIMIT ((PAGE_SIZE > 8*1024*1024) ? PAGE_SIZE : 8*1024*1024)
/*
* Due to binary compatibility, the actual resource numbers
--
2.33.1
On 10/28/21 2:08 AM, Drew DeVault wrote:
> This limit has not been updated since 2008, when it was increased to 64
> KiB at the request of GnuPG. Until recently, the main use-cases for this
> feature were (1) preventing sensitive memory from being swapped, as in
> GnuPG's use-case; and (2) real-time use-cases. In the first case, little
> memory is called for, and in the second case, the user is generally in a
> position to increase it if they need more.
>
> The introduction of IOURING_REGISTER_BUFFERS adds a third use-case:
> preparing fixed buffers for high-performance I/O. This use-case will
> take as much of this memory as it can get, but is still limited to 64
> KiB by default, which is very little. This increases the limit to 8 MB,
> which was chosen fairly arbitrarily as a more generous, but still
> conservative, default value.
> ---
> It is also possible to raise this limit in userspace. This is easily
> done, for example, in the use-case of a network daemon: systemd, for
> instance, provides for this via LimitMEMLOCK in the service file; OpenRC
> via the rc_ulimit variables. However, there is no established userspace
> facility for configuring this outside of daemons: end-user applications
> do not presently have access to a convenient means of raising their
> limits.
>
> The buck, as it were, stops with the kernel. It's much easier to address
> it here than it is to bring it to hundreds of distributions, and it can
> only realistically be relied upon to be high-enough by end-user software
> if it is more-or-less ubiquitous. Most distros don't change this
> particular rlimit from the kernel-supplied default value, so a change
> here will easily provide that ubiquity.
Agree with raising this limit, it is ridiculously low and we often get
reports from people that can't even do basic rings with it. Particularly
when bpf is involved as well, as it also dips into this pool.
On the production side at facebook, we do raise this limit as well.
Acked-by: Jens Axboe <[email protected]>
--
Jens Axboe
Hi!
> > This limit has not been updated since 2008, when it was increased to 64
> > KiB at the request of GnuPG. Until recently, the main use-cases for this
> > feature were (1) preventing sensitive memory from being swapped, as in
> > GnuPG's use-case; and (2) real-time use-cases. In the first case, little
> > memory is called for, and in the second case, the user is generally in a
> > position to increase it if they need more.
> >
> > The introduction of IOURING_REGISTER_BUFFERS adds a third use-case:
> > preparing fixed buffers for high-performance I/O. This use-case will
> > take as much of this memory as it can get, but is still limited to 64
> > KiB by default, which is very little. This increases the limit to 8 MB,
> > which was chosen fairly arbitrarily as a more generous, but still
> > conservative, default value.
> > ---
> > It is also possible to raise this limit in userspace. This is easily
> > done, for example, in the use-case of a network daemon: systemd, for
> > instance, provides for this via LimitMEMLOCK in the service file; OpenRC
> > via the rc_ulimit variables. However, there is no established userspace
> > facility for configuring this outside of daemons: end-user applications
> > do not presently have access to a convenient means of raising their
> > limits.
> >
> > The buck, as it were, stops with the kernel. It's much easier to address
> > it here than it is to bring it to hundreds of distributions, and it can
> > only realistically be relied upon to be high-enough by end-user software
> > if it is more-or-less ubiquitous. Most distros don't change this
> > particular rlimit from the kernel-supplied default value, so a change
> > here will easily provide that ubiquity.
>
> Agree with raising this limit, it is ridiculously low and we often get
> reports from people that can't even do basic rings with it. Particularly
> when bpf is involved as well, as it also dips into this pool.
>
> On the production side at facebook, we do raise this limit as well.
We are raising this limit to 2MB for LTP testcases as well, otherwise we
get failures when we run a few bpf tests in quick succession.
Acked-by: Cyril Hrubis <[email protected]>
--
Cyril Hrubis
[email protected]
On 11/4/21 8:27 AM, Cyril Hrubis wrote:
> Hi!
>>> This limit has not been updated since 2008, when it was increased to 64
>>> KiB at the request of GnuPG. Until recently, the main use-cases for this
>>> feature were (1) preventing sensitive memory from being swapped, as in
>>> GnuPG's use-case; and (2) real-time use-cases. In the first case, little
>>> memory is called for, and in the second case, the user is generally in a
>>> position to increase it if they need more.
>>>
>>> The introduction of IOURING_REGISTER_BUFFERS adds a third use-case:
>>> preparing fixed buffers for high-performance I/O. This use-case will
>>> take as much of this memory as it can get, but is still limited to 64
>>> KiB by default, which is very little. This increases the limit to 8 MB,
>>> which was chosen fairly arbitrarily as a more generous, but still
>>> conservative, default value.
>>> ---
>>> It is also possible to raise this limit in userspace. This is easily
>>> done, for example, in the use-case of a network daemon: systemd, for
>>> instance, provides for this via LimitMEMLOCK in the service file; OpenRC
>>> via the rc_ulimit variables. However, there is no established userspace
>>> facility for configuring this outside of daemons: end-user applications
>>> do not presently have access to a convenient means of raising their
>>> limits.
>>>
>>> The buck, as it were, stops with the kernel. It's much easier to address
>>> it here than it is to bring it to hundreds of distributions, and it can
>>> only realistically be relied upon to be high-enough by end-user software
>>> if it is more-or-less ubiquitous. Most distros don't change this
>>> particular rlimit from the kernel-supplied default value, so a change
>>> here will easily provide that ubiquity.
>>
>> Agree with raising this limit, it is ridiculously low and we often get
>> reports from people that can't even do basic rings with it. Particularly
>> when bpf is involved as well, as it also dips into this pool.
>>
>> On the production side at facebook, we do raise this limit as well.
>
> We are raising this limit to 2MB for LTP testcases as well, otherwise we
> get failures when we run a few bpf tests in quick succession.>
> Acked-by: Cyril Hrubis <[email protected]>
Andrew, care to pick this one up for 5.16?
--
Jens Axboe
On Thu, Oct 28, 2021 at 3:08 PM Drew DeVault wrote:
>
> This limit has not been updated since 2008, when it was increased to 64
> KiB at the request of GnuPG. Until recently, the main use-cases for this
> feature were (1) preventing sensitive memory from being swapped, as in
> GnuPG's use-case; and (2) real-time use-cases. In the first case, little
> memory is called for, and in the second case, the user is generally in a
> position to increase it if they need more.
>
> The introduction of IOURING_REGISTER_BUFFERS adds a third use-case:
> preparing fixed buffers for high-performance I/O. This use-case will
> take as much of this memory as it can get, but is still limited to 64
> KiB by default, which is very little. This increases the limit to 8 MB,
> which was chosen fairly arbitrarily as a more generous, but still
> conservative, default value.
> ---
> It is also possible to raise this limit in userspace. This is easily
> done, for example, in the use-case of a network daemon: systemd, for
> instance, provides for this via LimitMEMLOCK in the service file; OpenRC
> via the rc_ulimit variables. However, there is no established userspace
> facility for configuring this outside of daemons: end-user applications
> do not presently have access to a convenient means of raising their
> limits.
>
> The buck, as it were, stops with the kernel. It's much easier to address
> it here than it is to bring it to hundreds of distributions, and it can
> only realistically be relied upon to be high-enough by end-user software
> if it is more-or-less ubiquitous. Most distros don't change this
> particular rlimit from the kernel-supplied default value, so a change
> here will easily provide that ubiquity.
>
> include/uapi/linux/resource.h | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
Forgot to add Signed-off-by tag from the author?
--
Ammar Faizi
Should I send a v2 or is this email sufficient:
Signed-off-by: Drew DeVault <[email protected]>
On 11/6/21 2:05 PM, Drew DeVault wrote:
> Should I send a v2 or is this email sufficient:
>
> Signed-off-by: Drew DeVault <[email protected]>
Oops, I missed akpm from the CC list. Added Andrew.
Cc: Andrew Morton <[email protected]>
Ref: https://lore.kernel.org/io-uring/CFII8LNSW5XH.3OTIVFYX8P65Y@taiga/
--
Ammar Faizi
On Sat, 6 Nov 2021 14:12:45 +0700 Ammar Faizi <[email protected]> wrote:
> On 11/6/21 2:05 PM, Drew DeVault wrote:
> > Should I send a v2 or is this email sufficient:
> >
> > Signed-off-by: Drew DeVault <[email protected]>
>
> Oops, I missed akpm from the CC list. Added Andrew.
>
> Cc: Andrew Morton <[email protected]>
> Ref: https://lore.kernel.org/io-uring/CFII8LNSW5XH.3OTIVFYX8P65Y@taiga/
Let's cc linux-mm as well.
Unfortunately I didn't know about this until Nov 4, which was formally
too late for 5.16. I guess I could try to sneak it past Linus if
someone were to send me some sufficiently convincing words explaining
the urgency.
I'd also be interested in seeing feedback from the MM developers.
And a question: rather than messing around with a constant which will
need to be increased again in a couple of years, can we solve this one
and for all? For example, permit root to set the system-wide
per-process max mlock size and depend upon initscripts to do this
appropriately.
From: Drew DeVault <[email protected]>
Subject: Increase default MLOCK_LIMIT to 8 MiB
This limit has not been updated since 2008, when it was increased to 64
KiB at the request of GnuPG. Until recently, the main use-cases for this
feature were (1) preventing sensitive memory from being swapped, as in
GnuPG's use-case; and (2) real-time use-cases. In the first case, little
memory is called for, and in the second case, the user is generally in a
position to increase it if they need more.
The introduction of IOURING_REGISTER_BUFFERS adds a third use-case:
preparing fixed buffers for high-performance I/O. This use-case will take
as much of this memory as it can get, but is still limited to 64 KiB by
default, which is very little. This increases the limit to 8 MB, which
was chosen fairly arbitrarily as a more generous, but still conservative,
default value.
It is also possible to raise this limit in userspace. This is easily
done, for example, in the use-case of a network daemon: systemd, for
instance, provides for this via LimitMEMLOCK in the service file; OpenRC
via the rc_ulimit variables. However, there is no established userspace
facility for configuring this outside of daemons: end-user applications do
not presently have access to a convenient means of raising their limits.
The buck, as it were, stops with the kernel. It's much easier to address
it here than it is to bring it to hundreds of distributions, and it can
only realistically be relied upon to be high-enough by end-user software
if it is more-or-less ubiquitous. Most distros don't change this
particular rlimit from the kernel-supplied default value, so a change here
will easily provide that ubiquity.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Drew DeVault <[email protected]>
Acked-by: Jens Axboe <[email protected]>
Acked-by: Cyril Hrubis <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/uapi/linux/resource.h | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
--- a/include/uapi/linux/resource.h~increase-default-mlock_limit-to-8-mib
+++ a/include/uapi/linux/resource.h
@@ -66,10 +66,17 @@ struct rlimit64 {
#define _STK_LIM (8*1024*1024)
/*
- * GPG2 wants 64kB of mlocked memory, to make sure pass phrases
- * and other sensitive information are never written to disk.
+ * Limit the amount of locked memory by some sane default:
+ * root can always increase this limit if needed.
+ *
+ * The main use-cases are (1) preventing sensitive memory
+ * from being swapped; (2) real-time operations; (3) via
+ * IOURING_REGISTER_BUFFERS.
+ *
+ * The first two don't need much. The latter will take as
+ * much as it can get. 8MB is a reasonably sane default.
*/
-#define MLOCK_LIMIT ((PAGE_SIZE > 64*1024) ? PAGE_SIZE : 64*1024)
+#define MLOCK_LIMIT ((PAGE_SIZE > 8*1024*1024) ? PAGE_SIZE : 8*1024*1024)
/*
* Due to binary compatibility, the actual resource numbers
_
On Tue Nov 16, 2021 at 5:35 AM CET, Andrew Morton wrote:
> Unfortunately I didn't know about this until Nov 4, which was formally
> too late for 5.16. I guess I could try to sneak it past Linus if
> someone were to send me some sufficiently convincing words explaining
> the urgency.
I don't think it's that urgent, but I also wouldn't protest if someone
wants to usher it in sooner rather than later.
> And a question: rather than messing around with a constant which will
> need to be increased again in a couple of years, can we solve this one
> and for all? For example, permit root to set the system-wide
> per-process max mlock size and depend upon initscripts to do this
> appropriately.
Not sure I understand. Root and init scripts can already manage this
number - the goal of this patch is just to provide a saner default.
On Mon, Nov 15, 2021 at 08:35:30PM -0800, Andrew Morton wrote:
> I'd also be interested in seeing feedback from the MM developers.
[...]
> Subject: Increase default MLOCK_LIMIT to 8 MiB
On the one hand, processes can already allocate at least this much
memory that is non-swappable, just by doing things like opening a lot of
files (allocating struct file & fdtable), using a lot of address space
(allocating page tables), so I don't have a problem with it per se.
On the other hand, 64kB is available on anything larger than an IBM XT.
Linux will still boot on machines with 4MB of RAM (eg routers). For
someone with a machine with only, say, 32MB of memory, this allows a
process to make a quarter of the memory unswappable, and maybe that's
not a good idea. So perhaps this should scale over a certain range?
Is 8MB a generally useful amount of memory for an iouring user anyway?
If you're just playing with it, sure, but if you have, oh i don't know,
a database, don't you want to pin the entire cache and allow IO to the
whole thing?
On Tue Nov 16, 2021 at 7:36 PM CET, Matthew Wilcox wrote:
> On the one hand, processes can already allocate at least this much
> memory that is non-swappable, just by doing things like opening a lot of
> files (allocating struct file & fdtable), using a lot of address space
> (allocating page tables), so I don't have a problem with it per se.
>
> On the other hand, 64kB is available on anything larger than an IBM XT.
> Linux will still boot on machines with 4MB of RAM (eg routers). For
> someone with a machine with only, say, 32MB of memory, this allows a
> process to make a quarter of the memory unswappable, and maybe that's
> not a good idea. So perhaps this should scale over a certain range?
I feel like most of the uber-small machines which are still relevant are
not running arbitrary user code, so, something about an airtight hatch
goes here. On the other hand, consider your other hand: you can probably
find a way to allocate this much stuff anyway.
> Is 8MB a generally useful amount of memory for an iouring user anyway?
> If you're just playing with it, sure, but if you have, oh i don't know,
> a database, don't you want to pin the entire cache and allow IO to the
> whole thing?
If you're a databse, you're probably running as a daemon with some
integration with the service manager, most of which have provisions for
tuning the ulimits as necessary.
The purpose of this change is to provide an amount which is more useful
for end-user programs, which usually cannot adjust their ulimits by any
similarly convenient means. 8 MiB is not a lot, but it is enough to
allocate a modest handful of read/write buffers for a video game, mail
client, or something else along those lines of thought, perhaps
specifically narrowing in on the areas which demand the most
performance.
We could certainly go higher and find an even more useful (but still
realistic) value, but I felt it best to err on the side of a more
conservative improvements. Honestly, this number could go as high as we
want it to and applications would happily take it.
On 11/16/21 11:36 AM, Matthew Wilcox wrote:
> On Mon, Nov 15, 2021 at 08:35:30PM -0800, Andrew Morton wrote:
>> I'd also be interested in seeing feedback from the MM developers.
> [...]
>> Subject: Increase default MLOCK_LIMIT to 8 MiB
>
> On the one hand, processes can already allocate at least this much
> memory that is non-swappable, just by doing things like opening a lot of
> files (allocating struct file & fdtable), using a lot of address space
> (allocating page tables), so I don't have a problem with it per se.
>
> On the other hand, 64kB is available on anything larger than an IBM XT.
> Linux will still boot on machines with 4MB of RAM (eg routers). For
> someone with a machine with only, say, 32MB of memory, this allows a
> process to make a quarter of the memory unswappable, and maybe that's
> not a good idea. So perhaps this should scale over a certain range?
>
> Is 8MB a generally useful amount of memory for an iouring user anyway?
> If you're just playing with it, sure, but if you have, oh i don't know,
> a database, don't you want to pin the entire cache and allow IO to the
> whole thing?
8MB is plenty for most casual use cases, which is exactly the ones that
we want to "just work" without requiring weird system level
modifications to increase the memlock limit.
For db etc server setups, you're going to be mucking with the setup
anyway, and then I don't see it as a big problem that you'll need to
increase it further. Because yes, that won't fit within 8MB if you start
doing registered buffers.
--
Jens Axboe
On Tue Nov 16, 2021 at 8:21 PM CET, Vito Caputo wrote:
> Considering a single fullscreen 32bpp 4K-resolution framebuffer is
> ~32MiB, I'm not convinced this is really correct in nearly 2022.
Can you name a practical use-case where you'll be doing I/O with
uncompressed 4K framebuffers? The kind of I/O which is supported by
io_uring, to be specific, not, say, handing it off to libdrm.
On Tue, Nov 16, 2021 at 11:55:41AM -0700, Jens Axboe wrote:
> On 11/16/21 11:36 AM, Matthew Wilcox wrote:
> > On Mon, Nov 15, 2021 at 08:35:30PM -0800, Andrew Morton wrote:
> >> I'd also be interested in seeing feedback from the MM developers.
> > [...]
> >> Subject: Increase default MLOCK_LIMIT to 8 MiB
> >
> > On the one hand, processes can already allocate at least this much
> > memory that is non-swappable, just by doing things like opening a lot of
> > files (allocating struct file & fdtable), using a lot of address space
> > (allocating page tables), so I don't have a problem with it per se.
> >
> > On the other hand, 64kB is available on anything larger than an IBM XT.
> > Linux will still boot on machines with 4MB of RAM (eg routers). For
> > someone with a machine with only, say, 32MB of memory, this allows a
> > process to make a quarter of the memory unswappable, and maybe that's
> > not a good idea. So perhaps this should scale over a certain range?
> >
> > Is 8MB a generally useful amount of memory for an iouring user anyway?
> > If you're just playing with it, sure, but if you have, oh i don't know,
> > a database, don't you want to pin the entire cache and allow IO to the
> > whole thing?
>
> 8MB is plenty for most casual use cases, which is exactly the ones that
> we want to "just work" without requiring weird system level
> modifications to increase the memlock limit.
>
Considering a single fullscreen 32bpp 4K-resolution framebuffer is
~32MiB, I'm not convinced this is really correct in nearly 2022.
If we're going to bump the default at the kernel, I'm with Matthew on
making it autoscale within a sane range, depending on available
memory.
As an upper bound I'd probably look at the highest anticipated
consumer resolutions, and handle a couple fullscreen 32bpp instances
being pinned.
Regards,
Vito Caputo
On 11/16/21 12:21 PM, Vito Caputo wrote:
> On Tue, Nov 16, 2021 at 11:55:41AM -0700, Jens Axboe wrote:
>> On 11/16/21 11:36 AM, Matthew Wilcox wrote:
>>> On Mon, Nov 15, 2021 at 08:35:30PM -0800, Andrew Morton wrote:
>>>> I'd also be interested in seeing feedback from the MM developers.
>>> [...]
>>>> Subject: Increase default MLOCK_LIMIT to 8 MiB
>>>
>>> On the one hand, processes can already allocate at least this much
>>> memory that is non-swappable, just by doing things like opening a lot of
>>> files (allocating struct file & fdtable), using a lot of address space
>>> (allocating page tables), so I don't have a problem with it per se.
>>>
>>> On the other hand, 64kB is available on anything larger than an IBM XT.
>>> Linux will still boot on machines with 4MB of RAM (eg routers). For
>>> someone with a machine with only, say, 32MB of memory, this allows a
>>> process to make a quarter of the memory unswappable, and maybe that's
>>> not a good idea. So perhaps this should scale over a certain range?
>>>
>>> Is 8MB a generally useful amount of memory for an iouring user anyway?
>>> If you're just playing with it, sure, but if you have, oh i don't know,
>>> a database, don't you want to pin the entire cache and allow IO to the
>>> whole thing?
>>
>> 8MB is plenty for most casual use cases, which is exactly the ones that
>> we want to "just work" without requiring weird system level
>> modifications to increase the memlock limit.
>>
>
> Considering a single fullscreen 32bpp 4K-resolution framebuffer is
> ~32MiB, I'm not convinced this is really correct in nearly 2022.
You don't need to register any buffers, and I don't expect any basic
uses cases to do so. Which means that the 8MB just need to cover the
ring itself, and you can fit a _lot_ of rings into 8MB. The memlock
limit only applies to buffers if you register them, not for any "normal"
use cases where you just pass buffers for read/write or O_DIRECT
read/write.
> If we're going to bump the default at the kernel, I'm with Matthew on
> making it autoscale within a sane range, depending on available
> memory.
I just don't want to turn this into a bikeshedding conversation. I'm
fine with making it autoscale obviously, but who's going to do the work?
> As an upper bound I'd probably look at the highest anticipated
> consumer resolutions, and handle a couple fullscreen 32bpp instances
> being pinned.
Not sure I see the relevance here.
--
Jens Axboe
On Tue, Nov 16, 2021 at 08:25:33PM +0100, Drew DeVault wrote:
> On Tue Nov 16, 2021 at 8:21 PM CET, Vito Caputo wrote:
> > Considering a single fullscreen 32bpp 4K-resolution framebuffer is
> > ~32MiB, I'm not convinced this is really correct in nearly 2022.
>
> Can you name a practical use-case where you'll be doing I/O with
> uncompressed 4K framebuffers? The kind of I/O which is supported by
> io_uring, to be specific, not, say, handing it off to libdrm.
Obviously video/image editing software tends to operate on raw frames,
and sometimes even persists them via filesystems.
I haven't given it a lot of thought, but a framebuffer is a commonly
used unit of memory allocation in code run on the CPU I've written
over the years. Being able to pin those for something like io_uring
(or some other DMA-possible interface) seems like an obvious
memory-hungry thing to consider here if we're talking default upper
bounds.
Regards,
Vito Caputo
On Tue, 16 Nov 2021 07:32:33 +0100 "Drew DeVault" <[email protected]> wrote:
> > And a question: rather than messing around with a constant which will
> > need to be increased again in a couple of years, can we solve this one
> > and for all? For example, permit root to set the system-wide
> > per-process max mlock size and depend upon initscripts to do this
> > appropriately.
>
> Not sure I understand. Root and init scripts can already manage this
> number - the goal of this patch is just to provide a saner default.
Well, why change the default? Surely anyone who cares is altering it
at runtime anyway. And if they are not, we should encourage them to do
so?
On Tue Nov 16, 2021 at 8:47 PM CET, Andrew Morton wrote:
> Well, why change the default? Surely anyone who cares is altering it
> at runtime anyway. And if they are not, we should encourage them to do
> so?
I addressed this question in the original patch's commit message.
On Tue, 16 Nov 2021 20:48:48 +0100 "Drew DeVault" <[email protected]> wrote:
> On Tue Nov 16, 2021 at 8:47 PM CET, Andrew Morton wrote:
> > Well, why change the default? Surely anyone who cares is altering it
> > at runtime anyway. And if they are not, we should encourage them to do
> > so?
>
> I addressed this question in the original patch's commit message.
Kinda.
We're never going to get this right, are we? The only person who can
decide on a system's appropriate setting is the operator of that
system. Haphazardly increasing the limit every few years mainly
reduces incentive for people to get this right.
And people who test their software on 5.17 kernels will later find that
it doesn't work on 5.16 and earlier, so they still need to tell their
users to configure their systems appropriately. Until 5.16 is
obsolete, by which time we're looking at increasing the default again.
I don't see how this change gets us closer to the desired state:
getting distros and their users to configure their systems
appropriately.
On Tue Nov 16, 2021 at 10:37 PM CET, Andrew Morton wrote:
> We're never going to get this right, are we? The only person who can
> decide on a system's appropriate setting is the operator of that
> system. Haphazardly increasing the limit every few years mainly
> reduces incentive for people to get this right.
>
> And people who test their software on 5.17 kernels will later find that
> it doesn't work on 5.16 and earlier, so they still need to tell their
> users to configure their systems appropriately. Until 5.16 is
> obsolete, by which time we're looking at increasing the default again.
>
> I don't see how this change gets us closer to the desired state:
> getting distros and their users to configure their systems
> appropriately.
Perfect is the enemy of good. This is a very simple change we can make
to improve the status quo, and I think that's worth doing. I do not have
time to develop a more sophisticated solution which steers the defaults
based on memory present, and I definitely don't have the time to
petition every distro to configure a better default for their particular
needs. This is the easiest way to get broad adoption for a better
default.
On Mon, Nov 15, 2021 at 08:35:30PM -0800, Andrew Morton wrote:
> On Sat, 6 Nov 2021 14:12:45 +0700 Ammar Faizi <[email protected]> wrote:
>
> > On 11/6/21 2:05 PM, Drew DeVault wrote:
> > > Should I send a v2 or is this email sufficient:
> > >
> > > Signed-off-by: Drew DeVault <[email protected]>
> >
> > Oops, I missed akpm from the CC list. Added Andrew.
> >
> > Cc: Andrew Morton <[email protected]>
> > Ref: https://lore.kernel.org/io-uring/CFII8LNSW5XH.3OTIVFYX8P65Y@taiga/
>
> Let's cc linux-mm as well.
>
>
> Unfortunately I didn't know about this until Nov 4, which was formally
> too late for 5.16. I guess I could try to sneak it past Linus if
> someone were to send me some sufficiently convincing words explaining
> the urgency.
>
> I'd also be interested in seeing feedback from the MM developers.
>
> And a question: rather than messing around with a constant which will
> need to be increased again in a couple of years, can we solve this one
> and for all? For example, permit root to set the system-wide
> per-process max mlock size and depend upon initscripts to do this
> appropriately.
My take is that as long as the kernel sets some limit per default on
this at all, it should be one that works for common workloads. Today
this isn't the case.
We've recently switched our initscripts at FB to set the default to
0.1% of total RAM. The impetus for this was a subtle but widespread
issue where we failed to mmap the PERF_COUNT_SW_TASK_CLOCK event
counter (perf event mmap also uses RLIMIT_MEMLOCK!) and silently fell
back to the much less efficient clock_gettime() syscall.
Because the failure mode was subtle and annoying, we didn't just want
to raise the limit, but raise it so that no reasonable application
would run into it, and only buggy or malicious ones would.
And IMO, that's really what rlimits should be doing: catching clearly
bogus requests, not trying to do fine-grained resource control. For
more reasonable overuse that ends up causing memory pressure, the OOM
killer will do the right thing since the pages still belong to tasks.
So 0.1% of the machine seemed like a good default formula for
that. And it would be a bit more future proof too.
On my 32G desktop machine, that would be 32M. For comparison, the
default process rlimit on that machine is ~120k, which comes out to
~2G worth of kernel stack, which also isn't reclaimable without OOM...
> From: Drew DeVault <[email protected]>
> Subject: Increase default MLOCK_LIMIT to 8 MiB
>
> This limit has not been updated since 2008, when it was increased to 64
> KiB at the request of GnuPG. Until recently, the main use-cases for this
> feature were (1) preventing sensitive memory from being swapped, as in
> GnuPG's use-case; and (2) real-time use-cases. In the first case, little
> memory is called for, and in the second case, the user is generally in a
> position to increase it if they need more.
>
> The introduction of IOURING_REGISTER_BUFFERS adds a third use-case:
> preparing fixed buffers for high-performance I/O. This use-case will take
> as much of this memory as it can get, but is still limited to 64 KiB by
> default, which is very little. This increases the limit to 8 MB, which
> was chosen fairly arbitrarily as a more generous, but still conservative,
> default value.
>
> It is also possible to raise this limit in userspace. This is easily
> done, for example, in the use-case of a network daemon: systemd, for
> instance, provides for this via LimitMEMLOCK in the service file; OpenRC
> via the rc_ulimit variables. However, there is no established userspace
> facility for configuring this outside of daemons: end-user applications do
> not presently have access to a convenient means of raising their limits.
>
> The buck, as it were, stops with the kernel. It's much easier to address
> it here than it is to bring it to hundreds of distributions, and it can
> only realistically be relied upon to be high-enough by end-user software
> if it is more-or-less ubiquitous. Most distros don't change this
> particular rlimit from the kernel-supplied default value, so a change here
> will easily provide that ubiquity.
>
> Link: https://lkml.kernel.org/r/[email protected]
> Signed-off-by: Drew DeVault <[email protected]>
> Acked-by: Jens Axboe <[email protected]>
> Acked-by: Cyril Hrubis <[email protected]>
> Cc: Pavel Begunkov <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
As per above, I think basing it off of RAM size would be better, but
this increase is overdue given all the new users beyond mlock(), and
8M is much better than the current value.
On 11/17/21 3:26 PM, Johannes Weiner wrote:
>> Link: https://lkml.kernel.org/r/[email protected]
>> Signed-off-by: Drew DeVault <[email protected]>
>> Acked-by: Jens Axboe <[email protected]>
>> Acked-by: Cyril Hrubis <[email protected]>
>> Cc: Pavel Begunkov <[email protected]>
>> Signed-off-by: Andrew Morton <[email protected]>
>
> Acked-by: Johannes Weiner <[email protected]>
>
> As per above, I think basing it off of RAM size would be better, but
> this increase is overdue given all the new users beyond mlock(), and
> 8M is much better than the current value.
That's basically my reasoning too. Let's just get something going that
will at least unblock some valid use cases, and not get bogged down with
aiming for perfection. The latter can happen in parallel, but it should
not hold it up imho.
--
Jens Axboe
On Wed, 17 Nov 2021 16:17:26 -0700 Jens Axboe <[email protected]> wrote:
> On 11/17/21 3:26 PM, Johannes Weiner wrote:
> >> Link: https://lkml.kernel.org/r/[email protected]
> >> Signed-off-by: Drew DeVault <[email protected]>
> >> Acked-by: Jens Axboe <[email protected]>
> >> Acked-by: Cyril Hrubis <[email protected]>
> >> Cc: Pavel Begunkov <[email protected]>
> >> Signed-off-by: Andrew Morton <[email protected]>
> >
> > Acked-by: Johannes Weiner <[email protected]>
> >
> > As per above, I think basing it off of RAM size would be better, but
> > this increase is overdue given all the new users beyond mlock(), and
> > 8M is much better than the current value.
>
> That's basically my reasoning too. Let's just get something going that
> will at least unblock some valid use cases, and not get bogged down with
> aiming for perfection. The latter can happen in parallel, but it should
> not hold it up imho.
Nobody's aiming for perfection. We're discussing aiming for "better".
What we should have done on day one was to set the default MLOCK_LIMIT
to zero bytes. Then everyone would have infrastructure to tune it from
userspace and we wouldn't ever have this discussion.
On Thu Nov 18, 2021 at 10:58 PM CET, Andrew Morton wrote:
> Nobody's aiming for perfection. We're discussing aiming for "better".
>
> What we should have done on day one was to set the default MLOCK_LIMIT
> to zero bytes. Then everyone would have infrastructure to tune it from
> userspace and we wouldn't ever have this discussion.
Setting aside perfection or not, what you're aiming for is about 1000×
more work. I'm not prepared to do that work. I'm not going to paint this
same bikeshed 100 times for each Linux distro we have to convince to
adopt a more sophisticated solution.
On 16.11.21 22:37, Andrew Morton wrote:
> On Tue, 16 Nov 2021 20:48:48 +0100 "Drew DeVault" <[email protected]> wrote:
>
>> On Tue Nov 16, 2021 at 8:47 PM CET, Andrew Morton wrote:
>>> Well, why change the default? Surely anyone who cares is altering it
>>> at runtime anyway. And if they are not, we should encourage them to do
>>> so?
>>
>> I addressed this question in the original patch's commit message.
>
> Kinda.
>
> We're never going to get this right, are we? The only person who can
> decide on a system's appropriate setting is the operator of that
> system. Haphazardly increasing the limit every few years mainly
> reduces incentive for people to get this right.
>
> And people who test their software on 5.17 kernels will later find that
> it doesn't work on 5.16 and earlier, so they still need to tell their
> users to configure their systems appropriately. Until 5.16 is
> obsolete, by which time we're looking at increasing the default again.
>
> I don't see how this change gets us closer to the desired state:
> getting distros and their users to configure their systems
> appropriately.
>
My 2 cents: while we should actually try to avoid new FOLL_LONGTERM
users where possible, we introduce more (IOURING_REGISTER_BUFFERS) to be
consumed by ordinary, unprivileged users. These new features, *when
used* require us to raise the MLOCK_LIMIT. Secretmem is similar, but for
now it rather "replaces" old mlock usage and IIRC has similarly small
memory demands; that might change in the future, though.
Why is FOLL_LONGTERM bad? Not only does it prevent swapping like mlock
does, the pages are also unmovable in memory, such that they cannot be
moved around, for example, for memory compaction.
Well, I'm not too mad about IOURING_REGISTER_BUFFERS, it actually helped
me to write a simple reproducer for the COW issues we have in upstream
mm, and can be quite beneficial in some setups. Still, I think it should
be used with care depending on the actual environment.
So, just because a new feature is around that could be used, does it
mean that we should adjust our kernel default? I'd say in this case,
rather not. Distributions, or much better, the responsible admin, should
make such decisions, knowing the environment and the effect this could have.
(I know that we can similarly trigger allocation of a lot of unmovable
memory using other means by malicious user space; but that is rather
something to limit or handle in the future IMHO)
--
Thanks,
David / dhildenb
Forgive me for jumping in to an already overburdened thread. But can
someone pushing back on this clearly explain the issue with applying
this patch?
The only concerns I've heard are that it doesn't go far enough. That
another strategy (that everyone seems to agree would be a fair bit more
effort) could potentially achieve the same goal and then some. Isn't
that exactly what's meant by "don't let perfection be the enemy of the
good"? The saying is not talking about literal perfection -- the idea is
that you make progress where you can, and that incremental progress and
broader changes are not necessarily in conflict.
This tiny patch could be a step in the right direction. Why does this
thread need dozens of replies?
Thanks,
Andrew
--
We all do better when we all do better. -Paul Wellstone
On 22.11.21 18:55, Andrew Dona-Couch wrote:
> Forgive me for jumping in to an already overburdened thread. But can
> someone pushing back on this clearly explain the issue with applying
> this patch?
It will allow unprivileged users to easily and even "accidentally"
allocate more unmovable memory than it should in some environments. Such
limits exist for a reason. And there are ways for admins/distros to
tweak these limits if they know what they are doing.
>
> The only concerns I've heard are that it doesn't go far enough. That
> another strategy (that everyone seems to agree would be a fair bit more
> effort) could potentially achieve the same goal and then some. Isn't
> that exactly what's meant by "don't let perfection be the enemy of the
> good"? The saying is not talking about literal perfection -- the idea is
> that you make progress where you can, and that incremental progress and
> broader changes are not necessarily in conflict.
>
> This tiny patch could be a step in the right direction. Why does this
> thread need dozens of replies?
Because it does something controversial. Send controversial patches,
receive many opinions, it's that simple.
This is not a step into the right direction. This is all just trying to
hide the fact that we're exposing FOLL_LONGTERM usage to random
unprivileged users.
Maybe we could instead try getting rid of FOLL_LONGTERM usage and the
memlock limit in io_uring altogether, for example, by using mmu
notifiers. But I'm no expert on the io_uring code.
--
Thanks,
David / dhildenb
On 11/22/21 11:26 AM, David Hildenbrand wrote:
> On 22.11.21 18:55, Andrew Dona-Couch wrote:
>> Forgive me for jumping in to an already overburdened thread. But can
>> someone pushing back on this clearly explain the issue with applying
>> this patch?
>
> It will allow unprivileged users to easily and even "accidentally"
> allocate more unmovable memory than it should in some environments. Such
> limits exist for a reason. And there are ways for admins/distros to
> tweak these limits if they know what they are doing.
But that's entirely the point, the cases where this change is needed are
already screwed by a distro and the user is the administrator. This is
_exactly_ the case where things should just work out of the box. If
you're managing farms of servers, yeah you have competent administration
and you can be expected to tweak settings to get the best experience and
performance, but the kernel should provide a sane default. 64K isn't a
sane default.
> This is not a step into the right direction. This is all just trying to
> hide the fact that we're exposing FOLL_LONGTERM usage to random
> unprivileged users.
>
> Maybe we could instead try getting rid of FOLL_LONGTERM usage and the
> memlock limit in io_uring altogether, for example, by using mmu
> notifiers. But I'm no expert on the io_uring code.
You can't use mmu notifiers without impacting the fast path. This isn't
just about io_uring, there are other users of memlock right now (like
bpf) which just makes it even worse.
We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something
like what was suggested, if that will help move things forward. IMHO the
32MB machine is mostly a theoretical case, but whatever .
--
Jens Axboe
On Mon, Nov 22, 2021 at 12:53:31PM -0700, Jens Axboe wrote:
> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something
> like what was suggested, if that will help move things forward. IMHO the
> 32MB machine is mostly a theoretical case, but whatever .
I think you mean max(0.1% ram, 64KB). with that change, I agree.
On 11/22/21 1:03 PM, Matthew Wilcox wrote:
> On Mon, Nov 22, 2021 at 12:53:31PM -0700, Jens Axboe wrote:
>> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something
>> like what was suggested, if that will help move things forward. IMHO the
>> 32MB machine is mostly a theoretical case, but whatever .
>
> I think you mean max(0.1% ram, 64KB). with that change, I agree.
Heh yes, that is indeed what I meant :-)
--
Jens Axboe
On 22.11.21 20:53, Jens Axboe wrote:
> On 11/22/21 11:26 AM, David Hildenbrand wrote:
>> On 22.11.21 18:55, Andrew Dona-Couch wrote:
>>> Forgive me for jumping in to an already overburdened thread. But can
>>> someone pushing back on this clearly explain the issue with applying
>>> this patch?
>>
>> It will allow unprivileged users to easily and even "accidentally"
>> allocate more unmovable memory than it should in some environments. Such
>> limits exist for a reason. And there are ways for admins/distros to
>> tweak these limits if they know what they are doing.
>
> But that's entirely the point, the cases where this change is needed are
> already screwed by a distro and the user is the administrator. This is
> _exactly_ the case where things should just work out of the box. If
> you're managing farms of servers, yeah you have competent administration
> and you can be expected to tweak settings to get the best experience and
> performance, but the kernel should provide a sane default. 64K isn't a
> sane default.
0.1% of RAM isn't either.
>
>> This is not a step into the right direction. This is all just trying to
>> hide the fact that we're exposing FOLL_LONGTERM usage to random
>> unprivileged users.
>>
>> Maybe we could instead try getting rid of FOLL_LONGTERM usage and the
>> memlock limit in io_uring altogether, for example, by using mmu
>> notifiers. But I'm no expert on the io_uring code.
>
> You can't use mmu notifiers without impacting the fast path. This isn't
> just about io_uring, there are other users of memlock right now (like
> bpf) which just makes it even worse.
1) Do we have a performance evaluation? Did someone try and come up with
a conclusion how bad it would be?
2) Could be provide a mmu variant to ordinary users that's just good
enough but maybe not as fast as what we have today? And limit
FOLL_LONGTERM to special, privileged users?
3) Just because there are other memlock users is not an excuse. For
example, VFIO/VDPA have to use it for a reason, because there is no way
not do use FOLL_LONGTERM.
>
> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something
> like what was suggested, if that will help move things forward. IMHO the
> 32MB machine is mostly a theoretical case, but whatever .
1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges
where FOLL_LONGTERM cannot be used, as that memory is not available.
2) With 0.1% RAM it's sufficient to start 1000 processes to break any
system completely and deeply mess up the MM. Oh my.
No, I don't like this, absolutely not. I neither like raising the
memlock limit as default to such high values nor using FOLL_LONGTERM in
cases where it could be avoided for random, unprivileged users.
But I assume this is mostly for the records, because I assume nobody
cares about my opinion here.
--
Thanks,
David / dhildenb
On 11/22/21 1:08 PM, David Hildenbrand wrote:
> On 22.11.21 20:53, Jens Axboe wrote:
>> On 11/22/21 11:26 AM, David Hildenbrand wrote:
>>> On 22.11.21 18:55, Andrew Dona-Couch wrote:
>>>> Forgive me for jumping in to an already overburdened thread. But can
>>>> someone pushing back on this clearly explain the issue with applying
>>>> this patch?
>>>
>>> It will allow unprivileged users to easily and even "accidentally"
>>> allocate more unmovable memory than it should in some environments. Such
>>> limits exist for a reason. And there are ways for admins/distros to
>>> tweak these limits if they know what they are doing.
>>
>> But that's entirely the point, the cases where this change is needed are
>> already screwed by a distro and the user is the administrator. This is
>> _exactly_ the case where things should just work out of the box. If
>> you're managing farms of servers, yeah you have competent administration
>> and you can be expected to tweak settings to get the best experience and
>> performance, but the kernel should provide a sane default. 64K isn't a
>> sane default.
>
> 0.1% of RAM isn't either.
No default is perfect, byt 0.1% will solve 99% of the problem. And most
likely solve 100% of the problems for the important case, which is where
you want things to Just Work on your distro without doing any
administration. If you're aiming for perfection, it doesn't exist.
>>> This is not a step into the right direction. This is all just trying to
>>> hide the fact that we're exposing FOLL_LONGTERM usage to random
>>> unprivileged users.
>>>
>>> Maybe we could instead try getting rid of FOLL_LONGTERM usage and the
>>> memlock limit in io_uring altogether, for example, by using mmu
>>> notifiers. But I'm no expert on the io_uring code.
>>
>> You can't use mmu notifiers without impacting the fast path. This isn't
>> just about io_uring, there are other users of memlock right now (like
>> bpf) which just makes it even worse.
>
> 1) Do we have a performance evaluation? Did someone try and come up with
> a conclusion how bad it would be?
I honestly don't remember the details, I took a look at it about a year
ago due to some unrelated reasons. These days it just pertains to
registered buffers, so it's less of an issue than back then when it
dealt with the rings as well. Hence might be feasible, I'm certainly not
against anyone looking into it. Easy enough to review and test for
performance concerns.
> 2) Could be provide a mmu variant to ordinary users that's just good
> enough but maybe not as fast as what we have today? And limit
> FOLL_LONGTERM to special, privileged users?
If it's not as fast, then it's most likely not good enough though...
> 3) Just because there are other memlock users is not an excuse. For
> example, VFIO/VDPA have to use it for a reason, because there is no way
> not do use FOLL_LONGTERM.
It's not an excuse, the statement merely means that the problem is
_worse_ as there are other memlock users.
>>
>> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something
>> like what was suggested, if that will help move things forward. IMHO the
>> 32MB machine is mostly a theoretical case, but whatever .
>
> 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges
> where FOLL_LONGTERM cannot be used, as that memory is not available.
>
> 2) With 0.1% RAM it's sufficient to start 1000 processes to break any
> system completely and deeply mess up the MM. Oh my.
We're talking per-user limits here. But if you want to talk hyperbole,
then 64K multiplied by some other random number will also allow
everything to be pinned, potentially.
--
Jens Axboe
On 22.11.21 21:44, Jens Axboe wrote:
> On 11/22/21 1:08 PM, David Hildenbrand wrote:
>> On 22.11.21 20:53, Jens Axboe wrote:
>>> On 11/22/21 11:26 AM, David Hildenbrand wrote:
>>>> On 22.11.21 18:55, Andrew Dona-Couch wrote:
>>>>> Forgive me for jumping in to an already overburdened thread. But can
>>>>> someone pushing back on this clearly explain the issue with applying
>>>>> this patch?
>>>>
>>>> It will allow unprivileged users to easily and even "accidentally"
>>>> allocate more unmovable memory than it should in some environments. Such
>>>> limits exist for a reason. And there are ways for admins/distros to
>>>> tweak these limits if they know what they are doing.
>>>
>>> But that's entirely the point, the cases where this change is needed are
>>> already screwed by a distro and the user is the administrator. This is
>>> _exactly_ the case where things should just work out of the box. If
>>> you're managing farms of servers, yeah you have competent administration
>>> and you can be expected to tweak settings to get the best experience and
>>> performance, but the kernel should provide a sane default. 64K isn't a
>>> sane default.
>>
>> 0.1% of RAM isn't either.
>
> No default is perfect, byt 0.1% will solve 99% of the problem. And most
> likely solve 100% of the problems for the important case, which is where
> you want things to Just Work on your distro without doing any
> administration. If you're aiming for perfection, it doesn't exist.
... and my Fedora is already at 16 MiB *sigh*.
And I'm not aiming for perfection, I'm aiming for as little
FOLL_LONGTERM users as possible ;)
>
>>>> This is not a step into the right direction. This is all just trying to
>>>> hide the fact that we're exposing FOLL_LONGTERM usage to random
>>>> unprivileged users.
>>>>
>>>> Maybe we could instead try getting rid of FOLL_LONGTERM usage and the
>>>> memlock limit in io_uring altogether, for example, by using mmu
>>>> notifiers. But I'm no expert on the io_uring code.
>>>
>>> You can't use mmu notifiers without impacting the fast path. This isn't
>>> just about io_uring, there are other users of memlock right now (like
>>> bpf) which just makes it even worse.
>>
>> 1) Do we have a performance evaluation? Did someone try and come up with
>> a conclusion how bad it would be?
>
> I honestly don't remember the details, I took a look at it about a year
> ago due to some unrelated reasons. These days it just pertains to
> registered buffers, so it's less of an issue than back then when it
> dealt with the rings as well. Hence might be feasible, I'm certainly not
> against anyone looking into it. Easy enough to review and test for
> performance concerns.
That at least sounds promising.
>
>> 2) Could be provide a mmu variant to ordinary users that's just good
>> enough but maybe not as fast as what we have today? And limit
>> FOLL_LONGTERM to special, privileged users?
>
> If it's not as fast, then it's most likely not good enough though...
There is always a compromise of course.
See, FOLL_LONGTERM is *the worst* kind of memory allocation thingy you
could possible do to your MM subsystem. It's absolutely the worst thing
you can do to swap and compaction.
I really don't want random feature X to be next and say "well, io_uring
uses it, so I can just use it for max performance and we'll adjust the
memlock limit, who cares!".
>
>> 3) Just because there are other memlock users is not an excuse. For
>> example, VFIO/VDPA have to use it for a reason, because there is no way
>> not do use FOLL_LONGTERM.
>
> It's not an excuse, the statement merely means that the problem is
> _worse_ as there are other memlock users.
Yes, and it will keep getting worse every time we introduce more
FOLL_LONGTERM users that really shouldn't be FOLL_LONGTERM users unless
really required. Again, VFIO/VDPA/RDMA are prime examples, because the
HW forces us to do it. And these are privileged features either way.
>
>>>
>>> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something
>>> like what was suggested, if that will help move things forward. IMHO the
>>> 32MB machine is mostly a theoretical case, but whatever .
>>
>> 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges
>> where FOLL_LONGTERM cannot be used, as that memory is not available.
>>
>> 2) With 0.1% RAM it's sufficient to start 1000 processes to break any
>> system completely and deeply mess up the MM. Oh my.
>
> We're talking per-user limits here. But if you want to talk hyperbole,
> then 64K multiplied by some other random number will also allow
> everything to be pinned, potentially.
>
Right, it's per-user. 0.1% per user FOLL_LONGTERM locked into memory in
the worst case.
--
Thanks,
David / dhildenb
>>>>
>>>> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something
>>>> like what was suggested, if that will help move things forward. IMHO the
>>>> 32MB machine is mostly a theoretical case, but whatever .
>>>
>>> 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges
>>> where FOLL_LONGTERM cannot be used, as that memory is not available.
>>>
>>> 2) With 0.1% RAM it's sufficient to start 1000 processes to break any
>>> system completely and deeply mess up the MM. Oh my.
>>
>> We're talking per-user limits here. But if you want to talk hyperbole,
>> then 64K multiplied by some other random number will also allow
>> everything to be pinned, potentially.
>>
>
> Right, it's per-user. 0.1% per user FOLL_LONGTERM locked into memory in
> the worst case.
>
To make it clear why I keep complaining about FOLL_LONGTERM for
unprivileged users even if we're talking about "only" 0.1% of RAM ...
On x86-64 a 2 MiB THP (IOW pageblock) has 512 sub-pages. If we manage to
FOLL_LONGTERM a single sub-page, we can make the THP unavailable to the
system, meaning we cannot form a THP by
compaction/swapping/migration/whatever at that physical memory area
until we unpin that single page. We essentially "block" a THP from
forming at that physical memory area.
So with a single 4k page we can block one 2 MiB THP. With 0.1% we can,
therefore, block 51,2 % of all THP. Theoretically, of course, if the
stars align.
... or if we're malicious or unlucky. I wrote a reproducer this morning
that tries blocking as many THP as it can:
https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/io_uring_thp.c
------------------------------------------------------------------------
Example on my 16 GiB (8096 THP "in theory") notebook with some
applications running in the background.
$ uname -a
Linux t480s 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021
x86_64 x86_64 x86_64 GNU/Linux
$ ./io_uring_thp
PAGE size: 4096 bytes (sensed)
THP size: 2097152 bytes (sensed)
RLIMIT_MEMLOCK: 16777216 bytes (sensed)
IORING_MAX_REG_BUFFERS: 16384 (guess)
Pages per THP: 512
User can block 4096 THP (8589934592 bytes)
Process can block 4096 THP (8589934592 bytes)
Blocking 1 THP
Blocking 2 THP
...
Blocking 3438 THP
Blocking 3439 THP
Blocking 3440 THP
Blocking 3441 THP
Blocking 3442 THP
... and after a while
Blocking 4093 THP
Blocking 4094 THP
Blocking 4095 THP
Blocking 4096 THP
$ cat /proc/`pgrep io_uring_thp`/status
Name: io_uring_thp
Umask: 0002
State: S (sleeping)
[...]
VmPeak: 6496 kB
VmSize: 6496 kB
VmLck: 0 kB
VmPin: 16384 kB
VmHWM: 3628 kB
VmRSS: 1580 kB
RssAnon: 160 kB
RssFile: 1420 kB
RssShmem: 0 kB
VmData: 4304 kB
VmStk: 136 kB
VmExe: 8 kB
VmLib: 1488 kB
VmPTE: 48 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
THP_enabled: 1
$ cat /proc/meminfo
MemTotal: 16250920 kB
MemFree: 11648016 kB
MemAvailable: 11972196 kB
Buffers: 50480 kB
Cached: 1156768 kB
SwapCached: 54680 kB
Active: 704788 kB
Inactive: 3477576 kB
Active(anon): 427716 kB
Inactive(anon): 3207604 kB
Active(file): 277072 kB
Inactive(file): 269972 kB
...
Mlocked: 5692 kB
SwapTotal: 8200188 kB
SwapFree: 7742716 kB
...
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
Let's see how many contiguous 2M pages we can still get as root:
$ echo 1 > /proc/sys/vm/compact_memory
$ cat
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
0
$ echo 8192 >
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
$ cat
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
537
... keep retrying a couple of times
$ cat
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
583
Let's kill the io_uring process and try again:
$ echo 8192 >
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
$ cat
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
4766
... keep retrying a couple of times
$ echo 8192 >
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
$ cat
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
4823
------------------------------------------------------------------------
I'm going to leave judgment how bad this is or isn't to the educated
reader, and I'll stop spending time on this as I have more important
things to work on.
To summarize my humble opinion:
1) I am not against raising the default memlock limit if it's for a sane
use case. While mlock itself can be somewhat bad for swap, FOLL_LONGTERM
that also checks the memlock limit here is the real issue. This patch
explicitly states the "IOURING_REGISTER_BUFFERS" use case, though, and
that makes me nervous.
2) Exposing FOLL_LONGTERM to unprivileged users should be avoided best
we can; in an ideal world, we wouldn't have it at all; in a sub-optimal
world we'd have it only for use cases that really require it due to HW
limitations. Ideally we'd even have yet another limit for this, because
mlock != FOLL_LONGTERM.
3) IOURING_REGISTER_BUFFERS shouldn't use FOLL_LONGTERM for use by
unprivileged users. We should provide a variant that doesn't rely on
FOLL_LONGTERM or even rely on the memlock limit.
Sorry to the patch author for bringing it up as response to the patch.
After this patch just does what some distros already do (many distros
even provide higher limits than 8 MiB!). I would be curious why some
distros already have such high values ... and if it's already because of
IOURING_REGISTER_BUFFERS after all.
--
Thanks,
David / dhildenb
On Mon, Nov 22, 2021 at 09:08:47PM +0100, David Hildenbrand wrote:
> > You can't use mmu notifiers without impacting the fast path. This isn't
> > just about io_uring, there are other users of memlock right now (like
> > bpf) which just makes it even worse.
>
> 1) Do we have a performance evaluation? Did someone try and come up with
> a conclusion how bad it would be?
It needs additional locking between page readers and the mmu notifier.
One of the virtio things does this thing and they used rcu on the page
readers and a synchronize rcu in a mmu notifier - which I think is
pretty bad.
> 2) Could be provide a mmu variant to ordinary users that's just good
> enough but maybe not as fast as what we have today? And limit
> FOLL_LONGTERM to special, privileged users?
rdma has never been privileged
Jason
>
>> 2) Could be provide a mmu variant to ordinary users that's just good
>> enough but maybe not as fast as what we have today? And limit
>> FOLL_LONGTERM to special, privileged users?
>
> rdma has never been privileged
Feel free to correct me if I'm wrong: it requires special networking
hardware and the admin/kernel has to prepare the system in a way such
that it can be used.
--
Thanks,
David / dhildenb
On Tue, Nov 23, 2021 at 02:39:19PM +0100, David Hildenbrand wrote:
> >
> >> 2) Could be provide a mmu variant to ordinary users that's just good
> >> enough but maybe not as fast as what we have today? And limit
> >> FOLL_LONGTERM to special, privileged users?
> >
> > rdma has never been privileged
>
> Feel free to correct me if I'm wrong: it requires special networking
> hardware and the admin/kernel has to prepare the system in a way such
> that it can be used.
Not really, plug in the right PCI card and it works
"special" is a bit of a reach since almost every NIC sold in the > 100GB
segment supports some RDMA.
Jason
On 23.11.21 15:07, Jason Gunthorpe wrote:
> On Tue, Nov 23, 2021 at 02:39:19PM +0100, David Hildenbrand wrote:
>>>
>>>> 2) Could be provide a mmu variant to ordinary users that's just good
>>>> enough but maybe not as fast as what we have today? And limit
>>>> FOLL_LONGTERM to special, privileged users?
>>>
>>> rdma has never been privileged
>>
>> Feel free to correct me if I'm wrong: it requires special networking
>> hardware and the admin/kernel has to prepare the system in a way such
>> that it can be used.
>
> Not really, plug in the right PCI card and it works
Naive me would have assumed that the right modules have to be loaded
(and not blacklisted), that there has to be an rdma service installed
and running, that the NIC has to be configured in some way, and that
there is some kind of access control which user can actually use which
NIC. For example, I would have assume from inside a container it usually
wouldn't just work.
But I am absolutely not a networking and RDMA expert, so I have to
believe what you say and I trust your experience :) So could as well be
that on such a "special" (or not so special) systems there should be a
way to restrict it to privileged users only.
>
> "special" is a bit of a reach since almost every NIC sold in the > 100GB
> segment supports some RDMA.
--
Thanks,
David / dhildenb
On Tue, Nov 23, 2021 at 03:44:03PM +0100, David Hildenbrand wrote:
> On 23.11.21 15:07, Jason Gunthorpe wrote:
> > On Tue, Nov 23, 2021 at 02:39:19PM +0100, David Hildenbrand wrote:
> >>>
> >>>> 2) Could be provide a mmu variant to ordinary users that's just good
> >>>> enough but maybe not as fast as what we have today? And limit
> >>>> FOLL_LONGTERM to special, privileged users?
> >>>
> >>> rdma has never been privileged
> >>
> >> Feel free to correct me if I'm wrong: it requires special networking
> >> hardware and the admin/kernel has to prepare the system in a way such
> >> that it can be used.
> >
> > Not really, plug in the right PCI card and it works
>
> Naive me would have assumed that the right modules have to be loaded
> (and not blacklisted), that there has to be an rdma service installed
> and running, that the NIC has to be configured in some way, and that
> there is some kind of access control which user can actually use which
> NIC.
Not really, we've worked hard that it works as well as any other HW
device. Plug it in and it works.
There is no systemd service, or special mandatory configuration, for
instance.
> For example, I would have assume from inside a container it usually
> wouldn't just work.
Nope, RDMA follows the net namespaces of its ethernet port, so it just
works in containers too.
> believe what you say and I trust your experience :) So could as well be
> that on such a "special" (or not so special) systems there should be a
> way to restrict it to privileged users only.
At this point RDMA is about as "special" as people running large
ZONE_MOVABLE systems, and the two are going to start colliding
heavily. The RDMA VFIO migration driver should be merged soon which
makes VMs using this stuff finally practical.
Jason
On 23.11.21 18:00, Jason Gunthorpe wrote:
> On Tue, Nov 23, 2021 at 03:44:03PM +0100, David Hildenbrand wrote:
>> On 23.11.21 15:07, Jason Gunthorpe wrote:
>>> On Tue, Nov 23, 2021 at 02:39:19PM +0100, David Hildenbrand wrote:
>>>>>
>>>>>> 2) Could be provide a mmu variant to ordinary users that's just good
>>>>>> enough but maybe not as fast as what we have today? And limit
>>>>>> FOLL_LONGTERM to special, privileged users?
>>>>>
>>>>> rdma has never been privileged
>>>>
>>>> Feel free to correct me if I'm wrong: it requires special networking
>>>> hardware and the admin/kernel has to prepare the system in a way such
>>>> that it can be used.
>>>
>>> Not really, plug in the right PCI card and it works
>>
>> Naive me would have assumed that the right modules have to be loaded
>> (and not blacklisted), that there has to be an rdma service installed
>> and running, that the NIC has to be configured in some way, and that
>> there is some kind of access control which user can actually use which
>> NIC.
>
> Not really, we've worked hard that it works as well as any other HW
> device. Plug it in and it works.
>
> There is no systemd service, or special mandatory configuration, for
> instance.
>
>> For example, I would have assume from inside a container it usually
>> wouldn't just work.
>
> Nope, RDMA follows the net namespaces of its ethernet port, so it just
> works in containers too.
>
>> believe what you say and I trust your experience :) So could as well be
>> that on such a "special" (or not so special) systems there should be a
>> way to restrict it to privileged users only.
>
> At this point RDMA is about as "special" as people running large
> ZONE_MOVABLE systems, and the two are going to start colliding
> heavily. The RDMA VFIO migration driver should be merged soon which
> makes VMs using this stuff finally practical.
Sounds like fun. At least we documented it already ;)
https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html#zone-movable-sizing-considerations
--
Thanks,
David / dhildenb
On 11/23/21 18:00, Jason Gunthorpe wrote:
>
>> believe what you say and I trust your experience :) So could as well be
>> that on such a "special" (or not so special) systems there should be a
>> way to restrict it to privileged users only.
>
> At this point RDMA is about as "special" as people running large
> ZONE_MOVABLE systems, and the two are going to start colliding
> heavily. The RDMA VFIO migration driver should be merged soon which
> makes VMs using this stuff finally practical.
How does that work, I see the word migration, so does it cause pages to
be migrated out of ZONE_MOVABLE before they are pinned?
Similarly for io-uring we could be migrating pages to be pinned so that
the end up consolidated close together, and prevent pathologic
situations like in David's reproducer. IIRC that was a idea to do for
long-term pins in general.
> Jason
>
On Tue, Nov 23, 2021 at 11:04:04PM +0100, Vlastimil Babka wrote:
> On 11/23/21 18:00, Jason Gunthorpe wrote:
> >
> >> believe what you say and I trust your experience :) So could as well be
> >> that on such a "special" (or not so special) systems there should be a
> >> way to restrict it to privileged users only.
> >
> > At this point RDMA is about as "special" as people running large
> > ZONE_MOVABLE systems, and the two are going to start colliding
> > heavily. The RDMA VFIO migration driver should be merged soon which
> > makes VMs using this stuff finally practical.
>
> How does that work, I see the word migration, so does it cause pages to
Sorry I mean what is often called "VM live migration". Typically that
cannot be done if a PCI device is assigned to the VM as suspending and
the migrating a PCI device to another server is complicated. With
forthcoming hardware mlx5 can do this and thus the entire RDMA stack
becomes practically usable and performant within a VM.
> be migrated out of ZONE_MOVABLE before they are pinned?
GUP already does this automatically for FOLL_LONGTERM.
> Similarly for io-uring we could be migrating pages to be pinned so that
> the end up consolidated close together, and prevent pathologic
> situations like in David's reproducer.
It is an interesting idea to have GUP do some kind of THP preserving
migration.
Jason
On 24.11.21 00:59, Jason Gunthorpe wrote:
> On Tue, Nov 23, 2021 at 11:04:04PM +0100, Vlastimil Babka wrote:
>> On 11/23/21 18:00, Jason Gunthorpe wrote:
>>>
>>>> believe what you say and I trust your experience :) So could as well be
>>>> that on such a "special" (or not so special) systems there should be a
>>>> way to restrict it to privileged users only.
>>>
>>> At this point RDMA is about as "special" as people running large
>>> ZONE_MOVABLE systems, and the two are going to start colliding
>>> heavily. The RDMA VFIO migration driver should be merged soon which
>>> makes VMs using this stuff finally practical.
>>
>> How does that work, I see the word migration, so does it cause pages to
>
> Sorry I mean what is often called "VM live migration". Typically that
> cannot be done if a PCI device is assigned to the VM as suspending and
> the migrating a PCI device to another server is complicated. With
> forthcoming hardware mlx5 can do this and thus the entire RDMA stack
> becomes practically usable and performant within a VM.
>
>> be migrated out of ZONE_MOVABLE before they are pinned?
>
> GUP already does this automatically for FOLL_LONGTERM.
>
>> Similarly for io-uring we could be migrating pages to be pinned so that
>> the end up consolidated close together, and prevent pathologic
>> situations like in David's reproducer.
>
> It is an interesting idea to have GUP do some kind of THP preserving
> migration.
Unfortunately it will only be a band aid AFAIU. I can rewrite my
reproducer fairly easily to pin the whole 2M range first, pin a second
time only a single page, and then unpin the 2M range, resulting in the
very same way to block THP. (I can block some THP less because I always
need the possibility to memlock 2M first, though).
--
Thanks,
David / dhildenb
On 24.11.21 14:23, Jason Gunthorpe wrote:
> On Wed, Nov 24, 2021 at 09:57:32AM +0100, David Hildenbrand wrote:
>
>> Unfortunately it will only be a band aid AFAIU. I can rewrite my
>> reproducer fairly easily to pin the whole 2M range first, pin a second
>> time only a single page, and then unpin the 2M range, resulting in the
>> very same way to block THP. (I can block some THP less because I always
>> need the possibility to memlock 2M first, though).
>
> Oh!
>
> The issue is GUP always pins an entire compound, no matter how little
> the user requests.
That's a different issue. I make sure to split the compound page before
pinning anything :)
--
Thanks,
David / dhildenb
On 24.11.21 14:28, Jason Gunthorpe wrote:
> On Wed, Nov 24, 2021 at 02:25:09PM +0100, David Hildenbrand wrote:
>> On 24.11.21 14:23, Jason Gunthorpe wrote:
>>> On Wed, Nov 24, 2021 at 09:57:32AM +0100, David Hildenbrand wrote:
>>>
>>>> Unfortunately it will only be a band aid AFAIU. I can rewrite my
>>>> reproducer fairly easily to pin the whole 2M range first, pin a second
>>>> time only a single page, and then unpin the 2M range, resulting in the
>>>> very same way to block THP. (I can block some THP less because I always
>>>> need the possibility to memlock 2M first, though).
>>>
>>> Oh!
>>>
>>> The issue is GUP always pins an entire compound, no matter how little
>>> the user requests.
>>
>> That's a different issue. I make sure to split the compound page before
>> pinning anything :)
>
> ?? Where is that done in GUP?
It's done in my reproducer manually.
--
Thanks,
David / dhildenb
On 24.11.21 14:48, Jason Gunthorpe wrote:
> On Wed, Nov 24, 2021 at 02:29:38PM +0100, David Hildenbrand wrote:
>> On 24.11.21 14:28, Jason Gunthorpe wrote:
>>> On Wed, Nov 24, 2021 at 02:25:09PM +0100, David Hildenbrand wrote:
>>>> On 24.11.21 14:23, Jason Gunthorpe wrote:
>>>>> On Wed, Nov 24, 2021 at 09:57:32AM +0100, David Hildenbrand wrote:
>>>>>
>>>>>> Unfortunately it will only be a band aid AFAIU. I can rewrite my
>>>>>> reproducer fairly easily to pin the whole 2M range first, pin a second
>>>>>> time only a single page, and then unpin the 2M range, resulting in the
>>>>>> very same way to block THP. (I can block some THP less because I always
>>>>>> need the possibility to memlock 2M first, though).
>>>>>
>>>>> Oh!
>>>>>
>>>>> The issue is GUP always pins an entire compound, no matter how little
>>>>> the user requests.
>>>>
>>>> That's a different issue. I make sure to split the compound page before
>>>> pinning anything :)
>>>
>>> ?? Where is that done in GUP?
>>
>> It's done in my reproducer manually.
>
> Aren't there many ways for hostile unpriv userspace to cause memory
> fragmentation? You are picking on pinning here, but any approach that
> forces the kernel to make a kalloc on a THP subpage would do just as
> well.
I'm not aware of any where you can fragment 50% of all pageblocks in the
system as an unprivileged user essentially consuming almost no memory
and essentially staying inside well-defined memlock limits. But sure if
there are "many" people will be able to come up with at least one
comparable thing. I'll be happy to learn.
>
> Arguably if we want to point to an issue here it is in MADV_FREE/etc
> that is the direct culprit in allowing userspace to break up THPs and
> then trigger fragmentation.
>
> If the objective is to prevent DOS of THP then MADV_FREE should
> conserve the THP and migrate the subpages to non-THP
> memory.
>
> FOLL_LONGTERM is not the issue here.
Thanks Jason for the discussion but this is where I'll opt out for now
because we seem to strongly disagree and as I said:
"I'm going to leave judgment how bad this is or isn't to the educated
reader, and I'll stop spending time on this as I have more important
things to work on."
But I'm going to leave one last comment to eventually give you a
different perspective: after MADV_DONTNEED the compound page sits on the
deferred split queue and will get split either way soon. People are
right now discussion upstream to even split synchronously, which would
move MADV_FREE out of the picture completely.
My position that FOLL_LONGTERM for unprivileged users is a strong no-go
stands as it is. Not MADV_FREE speeding up the compound page split in my
reproducer. Not MADV_DONTNEED allowing us to zap parts of a THP (I could
even have just used munmap or even mmap(MAP_FIXED)).
--
Thanks,
David / dhildenb
On Wed, Nov 24, 2021 at 09:57:32AM +0100, David Hildenbrand wrote:
> Unfortunately it will only be a band aid AFAIU. I can rewrite my
> reproducer fairly easily to pin the whole 2M range first, pin a second
> time only a single page, and then unpin the 2M range, resulting in the
> very same way to block THP. (I can block some THP less because I always
> need the possibility to memlock 2M first, though).
Oh!
The issue is GUP always pins an entire compound, no matter how little
the user requests.
However, when all the GUP callers do mlock accounting they have no
idea how much memory GUP actually pinned and only account mlock on 4K
chunks.
This is the bug your test is showing - using this accounting error the
user can significantly blow past their mlock limit by having GUP pin
2M chunks and then mlock accounting for only 4k chunks.
It is a super obnoxious bug to fix, but still just a bug and not some
inherent defect in FOLL_LONGTERM.
It also says the MLOCK_LIMIT really needs to always be > 1 THP
otherwise even a single 4K page may be unpinnable with correct
accounting.
Jason
On Wed, Nov 24, 2021 at 02:25:09PM +0100, David Hildenbrand wrote:
> On 24.11.21 14:23, Jason Gunthorpe wrote:
> > On Wed, Nov 24, 2021 at 09:57:32AM +0100, David Hildenbrand wrote:
> >
> >> Unfortunately it will only be a band aid AFAIU. I can rewrite my
> >> reproducer fairly easily to pin the whole 2M range first, pin a second
> >> time only a single page, and then unpin the 2M range, resulting in the
> >> very same way to block THP. (I can block some THP less because I always
> >> need the possibility to memlock 2M first, though).
> >
> > Oh!
> >
> > The issue is GUP always pins an entire compound, no matter how little
> > the user requests.
>
> That's a different issue. I make sure to split the compound page before
> pinning anything :)
?? Where is that done in GUP?
Jason
On Wed, Nov 24, 2021 at 02:29:38PM +0100, David Hildenbrand wrote:
> On 24.11.21 14:28, Jason Gunthorpe wrote:
> > On Wed, Nov 24, 2021 at 02:25:09PM +0100, David Hildenbrand wrote:
> >> On 24.11.21 14:23, Jason Gunthorpe wrote:
> >>> On Wed, Nov 24, 2021 at 09:57:32AM +0100, David Hildenbrand wrote:
> >>>
> >>>> Unfortunately it will only be a band aid AFAIU. I can rewrite my
> >>>> reproducer fairly easily to pin the whole 2M range first, pin a second
> >>>> time only a single page, and then unpin the 2M range, resulting in the
> >>>> very same way to block THP. (I can block some THP less because I always
> >>>> need the possibility to memlock 2M first, though).
> >>>
> >>> Oh!
> >>>
> >>> The issue is GUP always pins an entire compound, no matter how little
> >>> the user requests.
> >>
> >> That's a different issue. I make sure to split the compound page before
> >> pinning anything :)
> >
> > ?? Where is that done in GUP?
>
> It's done in my reproducer manually.
Aren't there many ways for hostile unpriv userspace to cause memory
fragmentation? You are picking on pinning here, but any approach that
forces the kernel to make a kalloc on a THP subpage would do just as
well.
Arguably if we want to point to an issue here it is in MADV_FREE/etc
that is the direct culprit in allowing userspace to break up THPs and
then trigger fragmentation.
If the objective is to prevent DOS of THP then MADV_FREE should
conserve the THP and migrate the subpages to non-THP
memory.
FOLL_LONGTERM is not the issue here.
Jason
On 11/24/21 09:57, David Hildenbrand wrote:
> On 24.11.21 00:59, Jason Gunthorpe wrote:
>>> Similarly for io-uring we could be migrating pages to be pinned so that
>>> the end up consolidated close together, and prevent pathologic
>>> situations like in David's reproducer.
>>
>> It is an interesting idea to have GUP do some kind of THP preserving
>> migration.
>
>
> Unfortunately it will only be a band aid AFAIU. I can rewrite my
> reproducer fairly easily to pin the whole 2M range first, pin a second
> time only a single page, and then unpin the 2M range, resulting in the
> very same way to block THP. (I can block some THP less because I always
> need the possibility to memlock 2M first, though).
Hm I see, then we could also condsider making it possible to migrate the
pinned pages - of course io-uring would have to be cooperative here,
similarly to anything that supports PageMovable. Mlocked pages can also be
migrated (I think there's some sysctl to prevent that, if the guarantee of
not even a minor pagefault is considered more important).
On 24.11.21 15:37, Vlastimil Babka wrote:
> On 11/24/21 09:57, David Hildenbrand wrote:
>> On 24.11.21 00:59, Jason Gunthorpe wrote:
>>>> Similarly for io-uring we could be migrating pages to be pinned so that
>>>> the end up consolidated close together, and prevent pathologic
>>>> situations like in David's reproducer.
>>>
>>> It is an interesting idea to have GUP do some kind of THP preserving
>>> migration.
>>
>>
>> Unfortunately it will only be a band aid AFAIU. I can rewrite my
>> reproducer fairly easily to pin the whole 2M range first, pin a second
>> time only a single page, and then unpin the 2M range, resulting in the
>> very same way to block THP. (I can block some THP less because I always
>> need the possibility to memlock 2M first, though).
>
> Hm I see, then we could also condsider making it possible to migrate the
> pinned pages - of course io-uring would have to be cooperative here,
> similarly to anything that supports PageMovable.
I might be wrong but that would then essentially be an actual mlock+mmu
notifier mechanism, and no longer FOLL_LONGTERM. And the mlock could
actually be done by user space and would be optional.
I'd be very happy to see something like that instead ... but so far
people don't even agree that it's an issue worth fixing.
So ...
--
Thanks,
David / dhildenb
On Wed, Nov 24, 2021 at 03:14:00PM +0100, David Hildenbrand wrote:
> I'm not aware of any where you can fragment 50% of all pageblocks in the
> system as an unprivileged user essentially consuming almost no memory
> and essentially staying inside well-defined memlock limits. But sure if
> there are "many" people will be able to come up with at least one
> comparable thing. I'll be happy to learn.
If the concern is that THP's can be DOS'd then any avenue that renders
the system out of THPs is a DOS attack vector. Including all the
normal workloads that people run and already complain that THPs get
exhausted.
A hostile userspace can only quicken this process.
> My position that FOLL_LONGTERM for unprivileged users is a strong no-go
> stands as it is.
As this basically excludes long standing pre-existing things like
RDMA, XDP, io_uring, and more I don't think this can be the general
answer for mm, sorry.
Sure, lets stop now since I don't think we can agree.
Jason
On 24.11.21 16:34, Jason Gunthorpe wrote:
> On Wed, Nov 24, 2021 at 03:14:00PM +0100, David Hildenbrand wrote:
>
>> I'm not aware of any where you can fragment 50% of all pageblocks in the
>> system as an unprivileged user essentially consuming almost no memory
>> and essentially staying inside well-defined memlock limits. But sure if
>> there are "many" people will be able to come up with at least one
>> comparable thing. I'll be happy to learn.
>
> If the concern is that THP's can be DOS'd then any avenue that renders
> the system out of THPs is a DOS attack vector. Including all the
> normal workloads that people run and already complain that THPs get
> exhausted.
>
> A hostile userspace can only quicken this process.
We can not only fragment THP but also easily smaller compound pages,
with less impact though (well, as long as people want more than 0.1% per
user ...).
We want to make more excessive use of THP; the whole folio work is about
using THP. Some people are even working on increasing the MAX_ORDER and
introduce gigantic THP.
And here we are having mechanisms available to unprivileged users to
just sabotage the very thing at its core extremely easily. Personally, I
think this is very bad, but that's just my humble opinion.
>
>> My position that FOLL_LONGTERM for unprivileged users is a strong no-go
>> stands as it is.
>
> As this basically excludes long standing pre-existing things like
> RDMA, XDP, io_uring, and more I don't think this can be the general
> answer for mm, sorry.
Let's think about options to restrict FOLL_LONGTERM usage:
One option would be to add toggle(s) (e.g., kernel cmdline options) to
make relevant mechanisms (or even FOLL_LONGTERM itself) privileged. The
admin can opt in if unprivileged users should have that capability. A
distro might overwrite the default and set it to "on". I'm not
completely happy about that.
Another option would be not accounting FOLL_LONGTERM as RLIMIT_MEMLOCK,
but instead as something that explicitly matches the differing
semantics. We could have a limit for privileged and one for unprivileged
users. The default in the kernel could be 0 but an admin/system can
overwrite it to opt in and a distro might apply different rules. Yes,
we're back to the original question about limits, but now with the
thought that FOLL_LONGTERM really is different than mlock and
potentially more dangerous.
At the same time, eventually work on proper alternatives with mmu
notifiers (and possibly without the any such limits) where possible and
required. (I assume it's hardly possible for RDMA because of the way the
hardware works)
Just some ideas, open for alternatives. I know that for the cases where
we want it to "just work" for unprivileged users but cannot even have
alternative implementations, this is bad.
>
> Sure, lets stop now since I don't think we can agree.
Don't get me wrong, I really should be working on other stuff, so I have
limited brain capacity and time :) OTOH I'm willing to help at least
discuss alternatives.
Let's think about realistic alternatives to keep FOLL_LONGTERM for any
user working (that would tackle the extreme fragmentation issue at
least, ignoring e.g., other fragmentation we can trigger with
FOLL_LONGTERM or ZONE_MOVABLE/MIGRATE_CMA):
The nasty thing really is splitting a compound page and then pinning
some pages, even if it's pinning the complete compound range. Ideally,
we'd defer any action to the time we actually FOLL_LONGTERM pin a page.
a) I think we cannot migrate pages when splitting the PMD (e.g., unmap,
MADV_DONTNEED, swap?, page compaction?). User space can just pin the
compound page to block migration.
b) We might migrate pages when splitting the compound page. In
split_huge_page_to_list() we know that we have nobody pinning the page.
I did not check if it's possible. There might be cases where it's not
immediately clear if it's possible (e.g., inside shrink_page_list())
It would mean that we would migrate pages essentially any time we split
a compound page because there could be someone FOLL_LONGTERM pinning the
page later. Usually we'd expect page compaction to fix this up on actual
demand. I'd call this sub-optimal.
c) We migrate any time someone FOLL_LONGTERM pins a page and the page is
not pinned yet -- because it might have been a split compound page. I
think we can agree that that's not an option :)
d) We remember if a page was part of a compound page and was not freed
yet. If we FOLL_LONGTERM such a page, we migrate it. Unfortunately,
we're short on pageflags for anon pages I think.
Hm, alternatives?
--
Thanks,
David / dhildenb
On Wed, Nov 24, 2021 at 05:43:58PM +0100, David Hildenbrand wrote:
> On 24.11.21 16:34, Jason Gunthorpe wrote:
> > On Wed, Nov 24, 2021 at 03:14:00PM +0100, David Hildenbrand wrote:
> >
> >> I'm not aware of any where you can fragment 50% of all pageblocks in the
> >> system as an unprivileged user essentially consuming almost no memory
> >> and essentially staying inside well-defined memlock limits. But sure if
> >> there are "many" people will be able to come up with at least one
> >> comparable thing. I'll be happy to learn.
> >
> > If the concern is that THP's can be DOS'd then any avenue that renders
> > the system out of THPs is a DOS attack vector. Including all the
> > normal workloads that people run and already complain that THPs get
> > exhausted.
> >
> > A hostile userspace can only quicken this process.
>
> We can not only fragment THP but also easily smaller compound pages,
> with less impact though (well, as long as people want more than 0.1% per
> user ...).
My point is as long as userspace can drive this fragmentation, by any
means, we can never have DOS proof higher order pages, so lets not
worry so much about one of many ways to create fragmentation.
> >> My position that FOLL_LONGTERM for unprivileged users is a strong no-go
> >> stands as it is.
> >
> > As this basically excludes long standing pre-existing things like
> > RDMA, XDP, io_uring, and more I don't think this can be the general
> > answer for mm, sorry.
>
> Let's think about options to restrict FOLL_LONGTERM usage:
Which gives me the view that we should be talking about how to make
high order pages completely DOS proof, not about FOLL_LONGTERM.
To me that is exactly what ZONE_MOVABLE strives to achieve, and I
think anyone who cares about QOS around THP must include ZONE_MOVABLE
in their solution.
In all of this I am thinking back to the discussion about the 1GB THP
proposal which was resoundly shot down on the grounds that 2MB THP
*doesn't work* today due to the existing fragmentation problems.
> Another option would be not accounting FOLL_LONGTERM as RLIMIT_MEMLOCK,
> but instead as something that explicitly matches the differing
> semantics.
Also a good idea, someone who cares about this should really put
pinned pages into the cgroup machinery (with correct accounting!)
> At the same time, eventually work on proper alternatives with mmu
> notifiers (and possibly without the any such limits) where possible
> and required.
mmu_notifiers is also bad, it just offends a different group of MM
concerns :)
Something like io_ring is registering a bulk amount of memory and then
doing some potentially long operations against it.
So to use a mmu_notifier scheme you'd have to block the mmu_notifier
invalidate_range_start until all the operations touching the memory
finish (and suspend new operations at the same time!).
Blocking the notifier like this locks up the migration/etc threads
completely, and is destructive to the OOM reclaim.
At least with a pinned page those threads don't even try to touch it
instead of getting stuck up.
> Don't get me wrong, I really should be working on other stuff, so I have
> limited brain capacity and time :) OTOH I'm willing to help at least
> discuss alternatives.
Haha, me too..
Jason
On 24.11.21 17:43, David Hildenbrand wrote:
> On 24.11.21 16:34, Jason Gunthorpe wrote:
>> On Wed, Nov 24, 2021 at 03:14:00PM +0100, David Hildenbrand wrote:
>>
>>> I'm not aware of any where you can fragment 50% of all pageblocks in the
>>> system as an unprivileged user essentially consuming almost no memory
>>> and essentially staying inside well-defined memlock limits. But sure if
>>> there are "many" people will be able to come up with at least one
>>> comparable thing. I'll be happy to learn.
>>
>> If the concern is that THP's can be DOS'd then any avenue that renders
>> the system out of THPs is a DOS attack vector. Including all the
>> normal workloads that people run and already complain that THPs get
>> exhausted.
>>
>> A hostile userspace can only quicken this process.
>
> We can not only fragment THP but also easily smaller compound pages,
> with less impact though (well, as long as people want more than 0.1% per
> user ...).
>
> We want to make more excessive use of THP; the whole folio work is about
> using THP. Some people are even working on increasing the MAX_ORDER and
> introduce gigantic THP.
>
> And here we are having mechanisms available to unprivileged users to
> just sabotage the very thing at its core extremely easily. Personally, I
> think this is very bad, but that's just my humble opinion.
>
>>
>>> My position that FOLL_LONGTERM for unprivileged users is a strong no-go
>>> stands as it is.
>>
>> As this basically excludes long standing pre-existing things like
>> RDMA, XDP, io_uring, and more I don't think this can be the general
>> answer for mm, sorry.
>
> Let's think about options to restrict FOLL_LONGTERM usage:
>
> One option would be to add toggle(s) (e.g., kernel cmdline options) to
> make relevant mechanisms (or even FOLL_LONGTERM itself) privileged. The
> admin can opt in if unprivileged users should have that capability. A
> distro might overwrite the default and set it to "on". I'm not
> completely happy about that.
>
> Another option would be not accounting FOLL_LONGTERM as RLIMIT_MEMLOCK,
> but instead as something that explicitly matches the differing
> semantics. We could have a limit for privileged and one for unprivileged
> users. The default in the kernel could be 0 but an admin/system can
> overwrite it to opt in and a distro might apply different rules. Yes,
> we're back to the original question about limits, but now with the
> thought that FOLL_LONGTERM really is different than mlock and
> potentially more dangerous.
>
> At the same time, eventually work on proper alternatives with mmu
> notifiers (and possibly without the any such limits) where possible and
> required. (I assume it's hardly possible for RDMA because of the way the
> hardware works)
>
> Just some ideas, open for alternatives. I know that for the cases where
> we want it to "just work" for unprivileged users but cannot even have
> alternative implementations, this is bad.
>
>>
>> Sure, lets stop now since I don't think we can agree.
>
> Don't get me wrong, I really should be working on other stuff, so I have
> limited brain capacity and time :) OTOH I'm willing to help at least
> discuss alternatives.
>
>
> Let's think about realistic alternatives to keep FOLL_LONGTERM for any
> user working (that would tackle the extreme fragmentation issue at
> least, ignoring e.g., other fragmentation we can trigger with
> FOLL_LONGTERM or ZONE_MOVABLE/MIGRATE_CMA):
>
> The nasty thing really is splitting a compound page and then pinning
> some pages, even if it's pinning the complete compound range. Ideally,
> we'd defer any action to the time we actually FOLL_LONGTERM pin a page.
>
>
> a) I think we cannot migrate pages when splitting the PMD (e.g., unmap,
> MADV_DONTNEED, swap?, page compaction?). User space can just pin the
> compound page to block migration.
>
> b) We might migrate pages when splitting the compound page. In
> split_huge_page_to_list() we know that we have nobody pinning the page.
> I did not check if it's possible. There might be cases where it's not
> immediately clear if it's possible (e.g., inside shrink_page_list())
>
> It would mean that we would migrate pages essentially any time we split
> a compound page because there could be someone FOLL_LONGTERM pinning the
> page later. Usually we'd expect page compaction to fix this up on actual
> demand. I'd call this sub-optimal.
>
> c) We migrate any time someone FOLL_LONGTERM pins a page and the page is
> not pinned yet -- because it might have been a split compound page. I
> think we can agree that that's not an option :)
>
> d) We remember if a page was part of a compound page and was not freed
> yet. If we FOLL_LONGTERM such a page, we migrate it. Unfortunately,
> we're short on pageflags for anon pages I think.
>
> Hm, alternatives?
>
And while thinking about the other "non malicious" fragmentation issues
-- especially triggering what my reproducer triggered by pure luck
simply because our pages we're pinning are scattered all over pageblocks
-- and remembering what Vlastimil said regarding grouping, I finally
understood why FOLL_LONGTERM is slightly wrong as is.
We allocate user pages with GFP_HIGHUSER_MOVABLE:
"``GFP_HIGHUSER_MOVABLE`` does not require that allocated memory will
be directly accessible by the kernel and implies that the data is
movable."
This holds true for mlock (except when migration is disabled for RT
systems, which is a special case already).
This does not hold true once FOLL_LONGTERM turns the pages essentially
unmovable. We tried to fix some of that fallout by migrating such pages
when they are residing on MIGRATE_CMA and ZONE_MOVABLE before
FOLL_LONGTERM. It's only part of the story, because we're fragmenting
each and ever !MIGRATE_UNMOVABLE pageblock with unmovable data. If we're
unlucky 50% (with the 0.1% RAM rule) of all MIGRATE_MOVABLE pageblocks
in the system.
I had the exact same discussion ("allocating unmovable data for user
space") with Mike regarding "secretmem", resulting in us using
GFP_HIGHUSER for allocation instead, and with Intel folks just now
regarding unmovable fd-based guest mappings.
a) "unmovable memory for user space" really is different from mlock
(RLIMIT_MEMLOCK). I think we should have much better control over the
amount of unmovable memory we directly let user space allocate for
mapping into the process page tables. A separate RLIMIT sounds
reasonable, and I think I discussed that with Mike already briefly
during secretmem discussions, and we decided to defer introducing
something like that.
b) To avoid fragmenting !MIGRATE_UNMOVABLE pageblock, we would have to
migrate pages into MIGRATE_UNMOVABLE pageblocks before FOLL_LONGTERM.
If we're pinning a complete pageblock, we're essentially converting
that pageblock to MIGRATE_UNMOVABLE.
c) When unpinning, theoretically we would want to migrate the now again
movable page out of a MIGRATE_UNMOVABLE pageblock if there is
sufficient memory.
We have right now the following, which highlights the issue:
/* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
static inline bool is_pinnable_page(struct page *page)
{
return !(is_zone_movable_page(page) ||
is_migrate_cma_page(page)) ||
is_zero_pfn(page_to_pfn(page))
}
These things would at least make FOLL_LONGTERM do something reasonable
in respect to pageblocks in most cases and would let an admin have
control over how much "unmovable allocations" user space can allocate.
It sure comes with a price when FOLL_LONGTERM pinning a page.
Further, it would let an admin have better control how much unmovable
data user space can allocate directly for mapping into user space
(excluding other unmovable allocations user space can trigger, of course
-- but to me there is a difference between "I am malicious and want to
hurt the kernel by allocating a lot of unmovable memory" and "I am a
sane user and want to make use of features that are a performance
improvement").
It wouldn't cover extended malicious case I presented (first pin the
whole pageblock, then pin a single page, then unpin the whole
pageblock), they would simply turn 50% of all pageblocks
MIGRATE_UNMOVABLE and only have a single page pinned in there.
And it wouldn't solve the issue of "how much unmovable memory to be
mapped into processes do we as a kernel think is reasonable for a single
user on the system".
With something along above lines (and the malicious cases fixed) I think
I could sleep better at night with FOLL_LONGTERM being allowed for
unprivileged users.
--
Thanks,
David / dhildenb
On 24.11.21 19:35, Jason Gunthorpe wrote:
> On Wed, Nov 24, 2021 at 05:43:58PM +0100, David Hildenbrand wrote:
>> On 24.11.21 16:34, Jason Gunthorpe wrote:
>>> On Wed, Nov 24, 2021 at 03:14:00PM +0100, David Hildenbrand wrote:
>>>
>>>> I'm not aware of any where you can fragment 50% of all pageblocks in the
>>>> system as an unprivileged user essentially consuming almost no memory
>>>> and essentially staying inside well-defined memlock limits. But sure if
>>>> there are "many" people will be able to come up with at least one
>>>> comparable thing. I'll be happy to learn.
>>>
>>> If the concern is that THP's can be DOS'd then any avenue that renders
>>> the system out of THPs is a DOS attack vector. Including all the
>>> normal workloads that people run and already complain that THPs get
>>> exhausted.
>>>
>>> A hostile userspace can only quicken this process.
>>
>> We can not only fragment THP but also easily smaller compound pages,
>> with less impact though (well, as long as people want more than 0.1% per
>> user ...).
>
> My point is as long as userspace can drive this fragmentation, by any
> means, we can never have DOS proof higher order pages, so lets not
> worry so much about one of many ways to create fragmentation.
>
That would be giving up on compound pages (hugetlbfs, THP, ...) on any
current Linux system that does not use ZONE_MOVABLE -- which is not
something I am not willing to buy into, just like our customers ;)
See my other mail, the upstream version of my reproducer essentially
shows what FOLL_LONGTERM is currently doing wrong with pageblocks. And
at least to me that's an interesting insight :)
I agree that the more extreme scenarios I can construct are a secondary
concern. But my upstream reproducer just highlights what can easily
happen in reality.
>>>> My position that FOLL_LONGTERM for unprivileged users is a strong no-go
>>>> stands as it is.
>>>
>>> As this basically excludes long standing pre-existing things like
>>> RDMA, XDP, io_uring, and more I don't think this can be the general
>>> answer for mm, sorry.
>>
>> Let's think about options to restrict FOLL_LONGTERM usage:
>
> Which gives me the view that we should be talking about how to make
> high order pages completely DOS proof, not about FOLL_LONGTERM.
Sure, one step at a time ;)
>
> To me that is exactly what ZONE_MOVABLE strives to achieve, and I
> think anyone who cares about QOS around THP must include ZONE_MOVABLE
> in their solution.
For 100% yes.
>
> In all of this I am thinking back to the discussion about the 1GB THP
> proposal which was resoundly shot down on the grounds that 2MB THP
> *doesn't work* today due to the existing fragmentation problems.
The point that "2MB THP" doesn't work is just wrong. pageblocks do their
job very well, but we can end up in corner case situations where more
and more pageblocks are getting fragmented. And people are constantly
improving these corner cases (e.g. proactive compaction).
Usually you have to allocate *a lot* of memory and put the system under
extreme memory pressure, such that unmovable allocations spill into
movable pageblocks and the other way around.
The thing about my reproducer is that it does that without any memory
pressure, and that is the BIG difference to everything else we have in
that regard. You can have an idle 1TiB system running my reproducer and
it will fragment half of of all pageblocks in the system while mlocking
~ 1GiB. And that highlights the real issue IMHO.
The 1 GB THP project is still going on BTW.
>
>> Another option would be not accounting FOLL_LONGTERM as RLIMIT_MEMLOCK,
>> but instead as something that explicitly matches the differing
>> semantics.
>
> Also a good idea, someone who cares about this should really put
> pinned pages into the cgroup machinery (with correct accounting!)
>
>> At the same time, eventually work on proper alternatives with mmu
>> notifiers (and possibly without the any such limits) where possible
>> and required.
>
> mmu_notifiers is also bad, it just offends a different group of MM
> concerns :)
Yeah, I know, locking nightmare.
>
> Something like io_ring is registering a bulk amount of memory and then
> doing some potentially long operations against it.
The individual operations it performs are comparable to O_DIRECT I think
-- but no expert.
>
> So to use a mmu_notifier scheme you'd have to block the mmu_notifier
> invalidate_range_start until all the operations touching the memory
> finish (and suspend new operations at the same time!).
>
> Blocking the notifier like this locks up the migration/etc threads
> completely, and is destructive to the OOM reclaim.
>
> At least with a pinned page those threads don't even try to touch it
> instead of getting stuck up.
Yes, if only we'd be pinning for a restricted amount of time ...
--
Thanks,
David / dhildenb
On Wed, Nov 24, 2021 at 08:09:42PM +0100, David Hildenbrand wrote:
> That would be giving up on compound pages (hugetlbfs, THP, ...) on any
> current Linux system that does not use ZONE_MOVABLE -- which is not
> something I am not willing to buy into, just like our customers ;)
So we have ZONE_MOVABLE but users won't use it?
Then why is the solution to push the same kinds of restrictions as
ZONE_MOVABLE on to ZONE_NORMAL?
> See my other mail, the upstream version of my reproducer essentially
> shows what FOLL_LONGTERM is currently doing wrong with pageblocks. And
> at least to me that's an interesting insight :)
Hmm. To your reproducer it would be nice if we could cgroup control
the # of page blocks a cgroup has pinned. Focusing on # pages pinned
is clearly the wrong metric, I suggested the whole compound earlier,
but your point about the entire page block being ruined makes sense
too.
It means pinned pages will have be migrated to already ruined page
blocks the cgroup owns, which is a more controlled version of the
FOLL_LONGTERM migration you have been thinking about.
This would effectively limit the fragmentation a hostile process group
can create. If we further treated unmovable cgroup charged kernel
allocations as 'pinned' and routed them to the pinned page blocks it
start to look really interesting. Kill the cgroup, get all your THPs
back? Fragmentation cannot extend past the cgroup?
ie there are lots of batch workloads that could be interesting there -
wrap the batch in a cgroup, run it, then kill everything and since the
cgroup gives some lifetime clustering to the allocator you get a lot
less fragmentation when the batch is finished, so the next batch gets
more THPs, etc.
There is also sort of an interesting optimization opportunity - many
FOLL_LONGTERM users would be happy to spend more time pinning to get
nice contiguous memory ranges. Might help convince people that the
extra pin time for migrations is worthwhile.
> > Something like io_ring is registering a bulk amount of memory and then
> > doing some potentially long operations against it.
>
> The individual operations it performs are comparable to O_DIRECT I think
Yes, and O_DIRECT can take 10s's of seconds in troubled cases with IO
timeouts and things.
Plus io_uring is worse as the buffer is potentially shared by many in
fight ops and you'd have to block new ops of the buffer and flush all
running ops before any mapping change can happen, all while holding up
a mmu notifier.
Not only is it bad for mm subsystem operations, but would
significantly harm io_uring performance if a migration hits.
So, I really don't like abusing mmu notifiers for stuff like this. I
didn't like it in virtio either :)
Jason
(sorry, was busy working on other stuff)
>> That would be giving up on compound pages (hugetlbfs, THP, ...) on any
>> current Linux system that does not use ZONE_MOVABLE -- which is not
>> something I am not willing to buy into, just like our customers ;)
>
> So we have ZONE_MOVABLE but users won't use it?
It's mostly used in the memory hot(un)plug context and we'll see growing
usage there in the near future (mostly due to dax/kmem, virtio-mem).
One has to be very careful how to size ZONE_MOVABLE, though, and it's
incompatible with various use cases (even huge pages on some
architectures are not movable and cannot be placed on ZONE_MOVABLE ...).
That's why we barely see it getting used automatically outside of memory
hot(un)plug context or when explicitly setup by the admin for a well
fine-tuned system.
>
> Then why is the solution to push the same kinds of restrictions as
> ZONE_MOVABLE on to ZONE_NORMAL?
On any zone except ZONE_DEVICE to be precise. Defragmentation is one of
the main reasons we have pageblocks after all -- besides CMA and page
isolation. If we don't care about de-fragmentation we could just squash
MIGRATE_MOVABLE, MIGRATE_UNMOVABLE, MIGRATE_RECLAIMABLE into a single
type. But after all that's the only thing that provides us with THP in
most setups out there.
Note that some people (IIRC Mel) even proposed to remove ZONE_MOVABLE
and instead have "sticky" MIGRATE_MOVABLE pageblocks, meaning
MIGRATE_MOVABLE pageblocks that cannot be converted to a different type
or stolen from -- which would mimic the same thing as the pageblocks we
essentially have in ZONE_MOVABLE.
>
>> See my other mail, the upstream version of my reproducer essentially
>> shows what FOLL_LONGTERM is currently doing wrong with pageblocks. And
>> at least to me that's an interesting insight :)
>
> Hmm. To your reproducer it would be nice if we could cgroup control
> the # of page blocks a cgroup has pinned. Focusing on # pages pinned
> is clearly the wrong metric, I suggested the whole compound earlier,
> but your point about the entire page block being ruined makes sense
> too.
# pages pinned is part of the story, but yes, "pinned something inside a
pageblocks" is a better metric.
I would think that this might be complicated to track, though ...
especially once we have multiple cgroups pinning inside a single
pageblock. Hm ...
>
> It means pinned pages will have be migrated to already ruined page
> blocks the cgroup owns, which is a more controlled version of the
> FOLL_LONGTERM migration you have been thinking about.
MIGRATE_UNMOVABLE pageblocks are already ruined. But we'd need some way
to manage/charge pageblocks per cgroup I guess? that sounds very
interesting.
>
> This would effectively limit the fragmentation a hostile process group
> can create. If we further treated unmovable cgroup charged kernel
> allocations as 'pinned' and routed them to the pinned page blocks it
> start to look really interesting. Kill the cgroup, get all your THPs
> back? Fragmentation cannot extend past the cgroup?
So essentially any accounted unmovable kernel allocation (e.g., page
tables, secretmem, ... ) would try to be placed on a MIGRATE_UNMOVABLE
pageblock "charged" to the respective cgroup?
>
> ie there are lots of batch workloads that could be interesting there -
> wrap the batch in a cgroup, run it, then kill everything and since the
> cgroup gives some lifetime clustering to the allocator you get a lot
> less fragmentation when the batch is finished, so the next batch gets
> more THPs, etc.
>
> There is also sort of an interesting optimization opportunity - many
> FOLL_LONGTERM users would be happy to spend more time pinning to get
> nice contiguous memory ranges. Might help convince people that the
> extra pin time for migrations is worthwhile.
Indeed. And fortunately, huge page users (heavily used in vfio context
and for VMs) wouldn't be affected because they only pin huge pages and
there is nothing to migrate then (well, excluding MIGRATE_CMA and
ZONE_MOVABLE what we have already, of course).
>
>>> Something like io_ring is registering a bulk amount of memory and then
>>> doing some potentially long operations against it.
>>
>> The individual operations it performs are comparable to O_DIRECT I think
>
> Yes, and O_DIRECT can take 10s's of seconds in troubled cases with IO
> timeouts and things.
>
I might be wrong about O_DIRECT semantics, though. Staring at
fs/io_uring.c I don't really have a clue how they are getting used. I
assume they are getting used for DMA directly.
--
Thanks,
David / dhildenb