2015-02-02 19:26:56

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

On 30.01.2015 19:07, Tejun Heo wrote:
> Hey, again.
>
> On Fri, Jan 30, 2015 at 01:27:37AM -0500, Tejun Heo wrote:
>> The previous behavior was pretty unpredictable in terms of shared file
>> ownership too. I wonder whether the better thing to do here is either
>> charging cases like this to the common ancestor or splitting the
>> charge equally among the accessors, which might be doable for ro
>> files.
>
> I've been thinking more about this. It's true that doing per-page
> association allows for avoiding confronting the worst side effects of
> inode sharing head-on, but it is a tradeoff with fairly weak
> justfications. The only thing we're gaining is side-stepping the
> blunt of the problem in an awkward manner and the loss of clarity in
> taking this compromised position has nasty ramifications when we try
> to connect it with the rest of the world.
>
> I could be missing something major but the more I think about it, it
> looks to me that the right thing to do here is accounting per-inode
> and charging shared inodes to the nearest common ancestor. The
> resulting behavior would be way more logical and predicatable than the
> current one, which would make it straight forward to integrate memcg
> with blkcg and writeback.
>
> One of the problems that I can think of off the top of my head is that
> it'd involve more regular use of charge moving; however, this is an
> operation which is per-inode rather than per-page and still gonna be
> fairly infrequent. Another one is that if we move memcg over to this
> behavior, it's likely to affect the behavior on the traditional
> hierarchies too as we sure as hell don't want to switch between the
> two major behaviors dynamically but given that behaviors on inode
> sharing aren't very well supported yet, this can be an acceptable
> change.
>
> Thanks.
>

Well... that might work.

Per-inode/anonvma memcg will be much more predictable for sure.

In some cases memory cgroup for inode might be assigned statically.
For example database files migth be pinned to special cgroup and
protected with low limit (soft guarantee or whatever it's called
nowadays).

For overlay-fs-like containers might be reasonable to keep shared
template area in separate memory cgroup. (keep cgroup mark at bind-mount
vfsmount?).

Removing memcg pointer from struct page might be tricky.
It's not clear what to do with truncated pages: either link them
with lru differently or remove from lru right at truncate.
Swap cache pages have the same problem.

Process of moving inodes from memcg to memcg is more or less doable.
Possible solution: keep at inode two pointers to memcg "old" and "new".
Each page will be accounted (and linked into corresponding lru) to one
of them. Separation to "old" and "new" pages could be done by flag on
struct page or by bordering page index stored in inode: pages where
index < border are accounted to the new memcg, the rest to the old.


Keeping shared inodes in common ancestor is reasonable.
We could schedule asynchronous moving when somebody opens or mmaps
inode from outside of its current cgroup. But it's not clear when
inode should be moved into opposite direction: when inode should
become private and how detect if it's no longer shared.

For example each inode could keep yet another pointer to memcg where
it will track subtree of cgroups where it was accessed in past 5
minutes or so. And sometimes that informations goes into moving thread.

Actually I don't see other options except that time-based estimation:
tracking all cgroups for each inode is too expensive, moving pages
from one lru to another is expensive too. So, moving inodes back and
forth at each access from the outside world is not an option.
That should be rare operation which runs in background or in reclaimer.

--
Konstantin


2015-02-02 19:46:45

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

Hey,

On Mon, Feb 02, 2015 at 10:26:44PM +0300, Konstantin Khlebnikov wrote:
> Removing memcg pointer from struct page might be tricky.
> It's not clear what to do with truncated pages: either link them
> with lru differently or remove from lru right at truncate.
> Swap cache pages have the same problem.

Hmmm... idk, maybe play another trick with low bits of page->mapping
and make it point to the cgroup after truncation? Do we even care
tho? Can't we just push them to the root and forget about them? They
are pretty transient after all, no?

> Process of moving inodes from memcg to memcg is more or less doable.
> Possible solution: keep at inode two pointers to memcg "old" and "new".
> Each page will be accounted (and linked into corresponding lru) to one
> of them. Separation to "old" and "new" pages could be done by flag on
> struct page or by bordering page index stored in inode: pages where
> index < border are accounted to the new memcg, the rest to the old.

Yeah, pretty much the same scheme that the per-page cgroup writeback
is using with lower bits of page->mem_cgroup should work with the bits
moved to page->flags.

> Keeping shared inodes in common ancestor is reasonable.
> We could schedule asynchronous moving when somebody opens or mmaps
> inode from outside of its current cgroup. But it's not clear when
> inode should be moved into opposite direction: when inode should
> become private and how detect if it's no longer shared.
>
> For example each inode could keep yet another pointer to memcg where
> it will track subtree of cgroups where it was accessed in past 5
> minutes or so. And sometimes that informations goes into moving thread.
>
> Actually I don't see other options except that time-based estimation:
> tracking all cgroups for each inode is too expensive, moving pages
> from one lru to another is expensive too. So, moving inodes back and
> forth at each access from the outside world is not an option.
> That should be rare operation which runs in background or in reclaimer.

Right, what strategy to use for migration is up for debate, even for
moving to the common ancestor. e.g. should we do that on the first
access? In the other direction, it get more interesting. Let's say
if we decide to move back an inode to a descendant, what if that
triggers OOM condition? Do we still go through it and cause OOM in
the target? Do we even want automatic moving in this direction?

For explicit cases, userland can do FADV_DONTNEED, I suppose.

Thanks.

--
tejun

2015-02-03 23:30:58

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

On Mon, Feb 2, 2015 at 11:46 AM, Tejun Heo <[email protected]> wrote:
> Hey,
>
> On Mon, Feb 02, 2015 at 10:26:44PM +0300, Konstantin Khlebnikov wrote:
>
>> Keeping shared inodes in common ancestor is reasonable.
>> We could schedule asynchronous moving when somebody opens or mmaps
>> inode from outside of its current cgroup. But it's not clear when
>> inode should be moved into opposite direction: when inode should
>> become private and how detect if it's no longer shared.
>>
>> For example each inode could keep yet another pointer to memcg where
>> it will track subtree of cgroups where it was accessed in past 5
>> minutes or so. And sometimes that informations goes into moving thread.
>>
>> Actually I don't see other options except that time-based estimation:
>> tracking all cgroups for each inode is too expensive, moving pages
>> from one lru to another is expensive too. So, moving inodes back and
>> forth at each access from the outside world is not an option.
>> That should be rare operation which runs in background or in reclaimer.
>
> Right, what strategy to use for migration is up for debate, even for
> moving to the common ancestor. e.g. should we do that on the first
> access? In the other direction, it get more interesting. Let's say
> if we decide to move back an inode to a descendant, what if that
> triggers OOM condition? Do we still go through it and cause OOM in
> the target? Do we even want automatic moving in this direction?
>
> For explicit cases, userland can do FADV_DONTNEED, I suppose.
>
> Thanks.
>
> --
> tejun

I don't have any killer objections, most of my worries are isolation concerns.

If a machine has several top level memcg trying to get some form of
isolation (using low, min, soft limit) then a shared libc will be
moved to the root memcg where it's not protected from global memory
pressure. At least with the current per page accounting such shared
pages often land into some protected memcg.

If two cgroups collude they can use more memory than their limit and
oom the entire machine. Admittedly the current per-page system isn't
perfect because deleting a memcg which contains mlocked memory
(referenced by a remote memcg) moves the mlocked memory to root
resulting in the same issue. But I'd argue this is more likely with
the RFC because it doesn't involve the cgroup deletion/reparenting. A
possible tweak to shore up the current system is to move such mlocked
pages to the memcg of the surviving locker. When the machine is oom
it's often nice to examine memcg state to determine which container is
using the memory. Tracking down who's contributing to a shared
container is non-trivial.

I actually have a set of patches which add a memcg=M mount option to
memory backed file systems. I was planning on proposing them,
regardless of this RFC, and this discussion makes them even more
appealing. If we go in this direction, then we'd need a similar
notion for disk based filesystems. As Konstantin suggested, it'd be
really nice to specify charge policy on a per file, or directory, or
bind mount basis. This allows shared files to be deterministically
charged to a known container. We'd need to flesh out the policies:
e.g. if two bind mound each specify different charge targets for the
same inode, I guess we just pick one. Though the nature of this
catch-all shared container is strange. Presumably a machine manager
would need to create it as an unlimited container (or at least as big
as the sum of all shared files) so that any app which decided it wants
to mlock all shared files has a way to without ooming the shared
container. In the current per-page approach it's possible to lock
shared libs. But the machine manager would need to decide how much
system ram to set aside for this catch-all shared container.

When there's large incidental sharing, then things get sticky. A
periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in
a small container would pull all pages to the root memcg where they
are exposed to root pressure which breaks isolation. This is
concerning. Perhaps the such accesses could be decorated with
(O_NO_MOVEMEM).

So this RFC change will introduce significant change to user space
machine managers and perturb isolation. Is the resulting system
better? It's not clear, it's the devil know vs devil unknown. Maybe
it'd be easier if the memcg's I'm talking about were not allowed to
share page cache (aka copy-on-read) even for files which are jointly
visible. That would provide today's interface while avoiding the
problematic sharing.

2015-02-04 10:49:19

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

On 04.02.2015 02:30, Greg Thelen wrote:
> On Mon, Feb 2, 2015 at 11:46 AM, Tejun Heo <[email protected]> wrote:
>> Hey,
>>
>> On Mon, Feb 02, 2015 at 10:26:44PM +0300, Konstantin Khlebnikov wrote:
>>
>>> Keeping shared inodes in common ancestor is reasonable.
>>> We could schedule asynchronous moving when somebody opens or mmaps
>>> inode from outside of its current cgroup. But it's not clear when
>>> inode should be moved into opposite direction: when inode should
>>> become private and how detect if it's no longer shared.
>>>
>>> For example each inode could keep yet another pointer to memcg where
>>> it will track subtree of cgroups where it was accessed in past 5
>>> minutes or so. And sometimes that informations goes into moving thread.
>>>
>>> Actually I don't see other options except that time-based estimation:
>>> tracking all cgroups for each inode is too expensive, moving pages
>>> from one lru to another is expensive too. So, moving inodes back and
>>> forth at each access from the outside world is not an option.
>>> That should be rare operation which runs in background or in reclaimer.
>>
>> Right, what strategy to use for migration is up for debate, even for
>> moving to the common ancestor. e.g. should we do that on the first
>> access? In the other direction, it get more interesting. Let's say
>> if we decide to move back an inode to a descendant, what if that
>> triggers OOM condition? Do we still go through it and cause OOM in
>> the target? Do we even want automatic moving in this direction?
>>
>> For explicit cases, userland can do FADV_DONTNEED, I suppose.
>>
>> Thanks.
>>
>> --
>> tejun
>
> I don't have any killer objections, most of my worries are isolation concerns.
>
> If a machine has several top level memcg trying to get some form of
> isolation (using low, min, soft limit) then a shared libc will be
> moved to the root memcg where it's not protected from global memory
> pressure. At least with the current per page accounting such shared
> pages often land into some protected memcg.
>
> If two cgroups collude they can use more memory than their limit and
> oom the entire machine. Admittedly the current per-page system isn't
> perfect because deleting a memcg which contains mlocked memory
> (referenced by a remote memcg) moves the mlocked memory to root
> resulting in the same issue. But I'd argue this is more likely with
> the RFC because it doesn't involve the cgroup deletion/reparenting. A
> possible tweak to shore up the current system is to move such mlocked
> pages to the memcg of the surviving locker. When the machine is oom
> it's often nice to examine memcg state to determine which container is
> using the memory. Tracking down who's contributing to a shared
> container is non-trivial.
>
> I actually have a set of patches which add a memcg=M mount option to
> memory backed file systems. I was planning on proposing them,
> regardless of this RFC, and this discussion makes them even more
> appealing. If we go in this direction, then we'd need a similar
> notion for disk based filesystems. As Konstantin suggested, it'd be
> really nice to specify charge policy on a per file, or directory, or
> bind mount basis. This allows shared files to be deterministically
> charged to a known container. We'd need to flesh out the policies:
> e.g. if two bind mound each specify different charge targets for the
> same inode, I guess we just pick one. Though the nature of this
> catch-all shared container is strange. Presumably a machine manager
> would need to create it as an unlimited container (or at least as big
> as the sum of all shared files) so that any app which decided it wants
> to mlock all shared files has a way to without ooming the shared
> container. In the current per-page approach it's possible to lock
> shared libs. But the machine manager would need to decide how much
> system ram to set aside for this catch-all shared container.
>
> When there's large incidental sharing, then things get sticky. A
> periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in
> a small container would pull all pages to the root memcg where they
> are exposed to root pressure which breaks isolation. This is
> concerning. Perhaps the such accesses could be decorated with
> (O_NO_MOVEMEM).
>
> So this RFC change will introduce significant change to user space
> machine managers and perturb isolation. Is the resulting system
> better? It's not clear, it's the devil know vs devil unknown. Maybe
> it'd be easier if the memcg's I'm talking about were not allowed to
> share page cache (aka copy-on-read) even for files which are jointly
> visible. That would provide today's interface while avoiding the
> problematic sharing.
>

I think important shared data must be handled and protected explicitly.
That 'catch-all' shared container could be separated into several
memory cgroups depending on importance of files: glibc protected
with soft guarantee, less important stuff is placed into another
cgroup and cannot push top-priority libraries out of ram.

If shared files are free for use then that 'shared' container must be
ready to keep them in memory. Otherwise this need to be fixed at the
container side: we could ignore mlock for shared inodes or amount of
such vmas might be limited in per-container basis.

But sharing responsibility for shared file is vague concept: memory
usage and limit of container must depends only on its own behavior not
on neighbors at the same machine.


Generally incidental sharing could be handled as temporary sharing:
default policy (if inode isn't pinned to memory cgroup) after some
time should detect that inode is no longer shared and migrate it into
original cgroup. Of course task could provide hit: O_NO_MOVEMEM or
even while memory cgroup where it runs could be marked as "scanner"
which shouldn't disturb memory classification.

BTW, the same algorithm which determines who have used inode recently
could tell who have used shared inode even if it's pinned to shared
container.

Other cool option which could fix false-sharing after scanning is
FADV_NOREUSE which tells to keep page-cache pages which were used for
reading and writing via this file descriptor out of lru and remove them
from inode when this file descriptor closes. Something like private
per-struct-file page-cache. Probably somebody already tried that?


I've missed obvious solution for controlling memory cgroup for files:
project id. This persistent integer id stored in file system. For now
it's implemented only for xfs and used for quota which is orthogonal
to user/group quotas. We could map some of project id to memory cgroup.
That is more flexible than per-superblock mark, has no conflicts like
mark on bind-mount.

--
Konstantin

2015-02-04 17:07:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

Hello,

On Tue, Feb 03, 2015 at 03:30:31PM -0800, Greg Thelen wrote:
> If a machine has several top level memcg trying to get some form of
> isolation (using low, min, soft limit) then a shared libc will be
> moved to the root memcg where it's not protected from global memory
> pressure. At least with the current per page accounting such shared
> pages often land into some protected memcg.

Yes, it becomes interesting with the low limit as the pressure
direction is reversed but at the same time overcommitting low limits
doesn't lead to a sane setup to begin with as it's asking for global
OOMs anyway, which means that things like libc would end up competing
at least fairly with other pages for global pressure and should stay
in memory under most circumstances, which may or may not be
sufficient.

Hmm.... need to think more about it but this only becomes a problem
with the root cgroup because it doesn't have min setting which is
expected to be inclusive of all descendants, right? Maybe the right
thing to do here is treating the inodes which get pushed to the root
as a special case and we can implement a mechanism where the root is
effectively borrowing from the mins of its children which doesn't have
to be completely correct - e.g. just charge it against all children
repeatedly and if any has min protection, put it under min protection.
IOW, make it the baseload for all of them.

> If two cgroups collude they can use more memory than their limit and
> oom the entire machine. Admittedly the current per-page system isn't
> perfect because deleting a memcg which contains mlocked memory
> (referenced by a remote memcg) moves the mlocked memory to root
> resulting in the same issue. But I'd argue this is more likely with

Hmmm... why does it do that? Can you point me to where it's
happening?

> the RFC because it doesn't involve the cgroup deletion/reparenting. A

One approach could be expanding on the forementioned scheme and make
all sharing cgroups to get charged for the shared inodes they're
using, which should render such collusions entirely pointless.
e.g. let's say we start with the following.

A (usage=48M)
+-B (usage=16M)
\-C (usage=32M)

And let's say, C starts accessing an inode which is 8M and currently
associated with B.

A (usage=48M, hosted= 8M)
+-B (usage= 8M, shared= 8M)
\-C (usage=32M, shared= 8M)

The only extra charging that we'd be doing is charing C with extra
8M. Let's say another cgroup D gets created and uses 4M.

A (usage=56M, hosted= 8M)
+-B (usage= 8M, shared= 8M)
+-C (usage=32M, shared= 8M)
\-D (usage= 8M)

and it also accesses the inode.

A (usage=56M, hosted= 8M)
+-B (usage= 8M, shared= 8M)
+-C (usage=32M, shared= 8M)
\-D (usage= 8M, shared= 8M)

We'd need to track the shared charges separately as they should count
only once in the parent but that shouldn't be too hard. The problem
here is that we'd need to track which inodes are being accessed by
which children, which can get painful for things like libc. Maybe we
can limit it to be level-by-level - track sharing only from the
immediate children and always move a shared inode at one level at a
time. That would lose some ability to track the sharing beyond the
immediate children but it should be enough to solve the root case and
allow us to adapt to changing usage pattern over time. Given that
sharing is mostly a corner case, this could be good enough.

Now, if D accesses 4M area of the inode which hasn't been accessed by
others yet. We'd want it to look like the following.

A (usage=64M, hosted=16M)
+-B (usage= 8M, shared=16M)
+-C (usage=32M, shared=16M)
\-D (usage= 8M, shared=16M)

But charging it to B, C at the same time prolly wouldn't be
particularly convenient. We can prolly just do D -> A charging and
let B and C sort themselves out later. Note that such charging would
still maintain the overall integrity of memory limits. The only thing
which may overflow is the pseudo shared charges to keep sharing in
check and dealing with them later when B and C try to create further
charges should be completely fine.

Note that we can also try to split the shared charge across the users;
however, charging the full amount seems like the better approach to
me. We don't have any way to tell how the usage is distributed
anyway. For use cases where this sort of sharing is expected, I think
it's perfectly reasonable to provision the sharing children to have
enough to accomodate the possible full size of the shared resource.

> possible tweak to shore up the current system is to move such mlocked
> pages to the memcg of the surviving locker. When the machine is oom
> it's often nice to examine memcg state to determine which container is
> using the memory. Tracking down who's contributing to a shared
> container is non-trivial.
>
> I actually have a set of patches which add a memcg=M mount option to
> memory backed file systems. I was planning on proposing them,
> regardless of this RFC, and this discussion makes them even more
> appealing. If we go in this direction, then we'd need a similar
> notion for disk based filesystems. As Konstantin suggested, it'd be
> really nice to specify charge policy on a per file, or directory, or
> bind mount basis. This allows shared files to be deterministically

I'm not too sure about that. We might add that later if absolutely
justifiable but designing assuming that level of intervention from
userland may not be such a good idea.

> When there's large incidental sharing, then things get sticky. A
> periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in
> a small container would pull all pages to the root memcg where they
> are exposed to root pressure which breaks isolation. This is
> concerning. Perhaps the such accesses could be decorated with
> (O_NO_MOVEMEM).

If such thing is really necessary, FADV_NOREUSE would be a better
indicator; however, yes, such incidental sharing is easier to handle
with per-page scheme as such scanner can be limited in the number of
pages it can carry throughout its operation regardless of which cgroup
it's looking at. It still has the nasty corner case where random
target cgroups can latch onto pages faulted in by the scanner and
keeping accessing them tho, so, even now, FADV_NOREUSE would be a good
idea. Note that such scanning, if repeated on cgroups under high
memory pressure, is *likely* to accumulate residue escaped pages and
if such a management cgroup is transient, those escaped pages will
accumulate over time outside any limit in a way which is unpredictable
and invisible.

> So this RFC change will introduce significant change to user space
> machine managers and perturb isolation. Is the resulting system
> better? It's not clear, it's the devil know vs devil unknown. Maybe
> it'd be easier if the memcg's I'm talking about were not allowed to
> share page cache (aka copy-on-read) even for files which are jointly
> visible. That would provide today's interface while avoiding the
> problematic sharing.

Yeah, compatibility would be the stickiest part.

Thanks.

--
tejun

2015-02-04 17:15:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

Hello,

On Wed, Feb 04, 2015 at 01:49:08PM +0300, Konstantin Khlebnikov wrote:
> I think important shared data must be handled and protected explicitly.
> That 'catch-all' shared container could be separated into several

I kinda disagree. That'd be a major pain in the ass to use and you
wouldn't know when you got something wrong unless it actually goes
wrong and you know enough about the innerworkings to look for that.
Doesn't sound like a sound design to me.

> memory cgroups depending on importance of files: glibc protected
> with soft guarantee, less important stuff is placed into another
> cgroup and cannot push top-priority libraries out of ram.

That sounds extremely painful.

> If shared files are free for use then that 'shared' container must be
> ready to keep them in memory. Otherwise this need to be fixed at the
> container side: we could ignore mlock for shared inodes or amount of
> such vmas might be limited in per-container basis.
>
> But sharing responsibility for shared file is vague concept: memory
> usage and limit of container must depends only on its own behavior not
> on neighbors at the same machine.
>
>
> Generally incidental sharing could be handled as temporary sharing:
> default policy (if inode isn't pinned to memory cgroup) after some
> time should detect that inode is no longer shared and migrate it into
> original cgroup. Of course task could provide hit: O_NO_MOVEMEM or
> even while memory cgroup where it runs could be marked as "scanner"
> which shouldn't disturb memory classification.

Ditto for annotating each file individually. Let's please try to stay
away from things like that. That's mostly a cop-out which is unlikely
to actually benefit the majority of users.

> I've missed obvious solution for controlling memory cgroup for files:
> project id. This persistent integer id stored in file system. For now
> it's implemented only for xfs and used for quota which is orthogonal
> to user/group quotas. We could map some of project id to memory cgroup.
> That is more flexible than per-superblock mark, has no conflicts like
> mark on bind-mount.

Again, hell, no.

Thanks.

--
tejun

2015-02-04 17:58:29

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

On 04.02.2015 20:15, Tejun Heo wrote:
> Hello,
>
> On Wed, Feb 04, 2015 at 01:49:08PM +0300, Konstantin Khlebnikov wrote:
>> I think important shared data must be handled and protected explicitly.
>> That 'catch-all' shared container could be separated into several
>
> I kinda disagree. That'd be a major pain in the ass to use and you
> wouldn't know when you got something wrong unless it actually goes
> wrong and you know enough about the innerworkings to look for that.
> Doesn't sound like a sound design to me.
>
>> memory cgroups depending on importance of files: glibc protected
>> with soft guarantee, less important stuff is placed into another
>> cgroup and cannot push top-priority libraries out of ram.
>
> That sounds extremely painful.

I mean this thing _could_ be controlled more precisely. Even if default
policy works for 99% users manual override is still required for 1% or
if something goes wrong.

>
>> If shared files are free for use then that 'shared' container must be
>> ready to keep them in memory. Otherwise this need to be fixed at the
>> container side: we could ignore mlock for shared inodes or amount of
>> such vmas might be limited in per-container basis.
>>
>> But sharing responsibility for shared file is vague concept: memory
>> usage and limit of container must depends only on its own behavior not
>> on neighbors at the same machine.
>>
>>
>> Generally incidental sharing could be handled as temporary sharing:
>> default policy (if inode isn't pinned to memory cgroup) after some
>> time should detect that inode is no longer shared and migrate it into
>> original cgroup. Of course task could provide hit: O_NO_MOVEMEM or
>> even while memory cgroup where it runs could be marked as "scanner"
>> which shouldn't disturb memory classification.
>
> Ditto for annotating each file individually. Let's please try to stay
> away from things like that. That's mostly a cop-out which is unlikely
> to actually benefit the majority of users.

Process which scans all files once isn't so rare use case.
Linux still cannot handle this pattern sometimes.

>
>> I've missed obvious solution for controlling memory cgroup for files:
>> project id. This persistent integer id stored in file system. For now
>> it's implemented only for xfs and used for quota which is orthogonal
>> to user/group quotas. We could map some of project id to memory cgroup.
>> That is more flexible than per-superblock mark, has no conflicts like
>> mark on bind-mount.
>
> Again, hell, no.
>
> Thanks.
>

--
Konstantin

2015-02-04 18:28:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

On Wed, Feb 04, 2015 at 08:58:21PM +0300, Konstantin Khlebnikov wrote:
> >>Generally incidental sharing could be handled as temporary sharing:
> >>default policy (if inode isn't pinned to memory cgroup) after some
> >>time should detect that inode is no longer shared and migrate it into
> >>original cgroup. Of course task could provide hit: O_NO_MOVEMEM or
> >>even while memory cgroup where it runs could be marked as "scanner"
> >>which shouldn't disturb memory classification.
> >
> >Ditto for annotating each file individually. Let's please try to stay
> >away from things like that. That's mostly a cop-out which is unlikely
> >to actually benefit the majority of users.
>
> Process which scans all files once isn't so rare use case.
> Linux still cannot handle this pattern sometimes.

Yeah, sure, tagging usages with m/fadvise's is fine. We can just look
at the policy and ignore them for the purpose of determining who's
using the inode, but let's stay away from tagging the files on
filesystem if at all possible.

Thanks.

--
tejun

2015-02-04 23:51:07

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma


On Wed, Feb 04 2015, Tejun Heo wrote:

> Hello,
>
> On Tue, Feb 03, 2015 at 03:30:31PM -0800, Greg Thelen wrote:
>> If a machine has several top level memcg trying to get some form of
>> isolation (using low, min, soft limit) then a shared libc will be
>> moved to the root memcg where it's not protected from global memory
>> pressure. At least with the current per page accounting such shared
>> pages often land into some protected memcg.
>
> Yes, it becomes interesting with the low limit as the pressure
> direction is reversed but at the same time overcommitting low limits
> doesn't lead to a sane setup to begin with as it's asking for global
> OOMs anyway, which means that things like libc would end up competing
> at least fairly with other pages for global pressure and should stay
> in memory under most circumstances, which may or may not be
> sufficient.

I agree. Clarification... I don't plan to overcommit low or min limits.
On machines without overcommited min limits the existing system offers
some protection for shared libs from global reclaim. Pushing them to
root doesn't.

> Hmm.... need to think more about it but this only becomes a problem
> with the root cgroup because it doesn't have min setting which is
> expected to be inclusive of all descendants, right? Maybe the right
> thing to do here is treating the inodes which get pushed to the root
> as a special case and we can implement a mechanism where the root is
> effectively borrowing from the mins of its children which doesn't have
> to be completely correct - e.g. just charge it against all children
> repeatedly and if any has min protection, put it under min protection.
> IOW, make it the baseload for all of them.

I think the linux-next low (and the TBD min) limits also have the
problem for more than just the root memcg. I'm thinking of a 2M file
shared between C and D below. The file will be charged to common parent
B.

A
+-B (usage=2M lim=3M min=2M)
+-C (usage=0 lim=2M min=1M shared_usage=2M)
+-D (usage=0 lim=2M min=1M shared_usage=2M)
\-E (usage=0 lim=2M min=0)

The problem arises if A/B/E allocates more than 1M of private
reclaimable file data. This pushes A/B into reclaim which will reclaim
both the shared file from A/B and private file from A/B/E. In contrast,
the current per-page memcg would've protected the shared file in either
C or D leaving A/B reclaim to only attack A/B/E.

Pinning the shared file to either C or D, using TBD policy such as mount
option, would solve this for tightly shared files. But for wide fanout
file (libc) the admin would need to assign a global bucket and this
would be a pain to size due to various job requirements.

>> If two cgroups collude they can use more memory than their limit and
>> oom the entire machine. Admittedly the current per-page system isn't
>> perfect because deleting a memcg which contains mlocked memory
>> (referenced by a remote memcg) moves the mlocked memory to root
>> resulting in the same issue. But I'd argue this is more likely with
>
> Hmmm... why does it do that? Can you point me to where it's
> happening?

My mistake, I was thinking of older kernels which reparent memory.
Though I can't say v3.19-rc7 handles this collusion any better. Instead
of reparenting the mlocked memory, it's left in an invisible (offline)
memcg. Unlike older kernels the memory doesn't appear in
root/memory.stat[unevictable], instead it buried in
root/memory.stat[total_unevictable] which includes mlocked memory in
visible (online) and invisible (offline) children.

>> the RFC because it doesn't involve the cgroup deletion/reparenting. A
>
> One approach could be expanding on the forementioned scheme and make
> all sharing cgroups to get charged for the shared inodes they're
> using, which should render such collusions entirely pointless.
> e.g. let's say we start with the following.
>
> A (usage=48M)
> +-B (usage=16M)
> \-C (usage=32M)
>
> And let's say, C starts accessing an inode which is 8M and currently
> associated with B.
>
> A (usage=48M, hosted= 8M)
> +-B (usage= 8M, shared= 8M)
> \-C (usage=32M, shared= 8M)
>
> The only extra charging that we'd be doing is charing C with extra
> 8M. Let's say another cgroup D gets created and uses 4M.
>
> A (usage=56M, hosted= 8M)
> +-B (usage= 8M, shared= 8M)
> +-C (usage=32M, shared= 8M)
> \-D (usage= 8M)
>
> and it also accesses the inode.
>
> A (usage=56M, hosted= 8M)
> +-B (usage= 8M, shared= 8M)
> +-C (usage=32M, shared= 8M)
> \-D (usage= 8M, shared= 8M)
>
> We'd need to track the shared charges separately as they should count
> only once in the parent but that shouldn't be too hard. The problem
> here is that we'd need to track which inodes are being accessed by
> which children, which can get painful for things like libc. Maybe we
> can limit it to be level-by-level - track sharing only from the
> immediate children and always move a shared inode at one level at a
> time. That would lose some ability to track the sharing beyond the
> immediate children but it should be enough to solve the root case and
> allow us to adapt to changing usage pattern over time. Given that
> sharing is mostly a corner case, this could be good enough.
>
> Now, if D accesses 4M area of the inode which hasn't been accessed by
> others yet. We'd want it to look like the following.
>
> A (usage=64M, hosted=16M)
> +-B (usage= 8M, shared=16M)
> +-C (usage=32M, shared=16M)
> \-D (usage= 8M, shared=16M)
>
> But charging it to B, C at the same time prolly wouldn't be
> particularly convenient. We can prolly just do D -> A charging and
> let B and C sort themselves out later. Note that such charging would
> still maintain the overall integrity of memory limits. The only thing
> which may overflow is the pseudo shared charges to keep sharing in
> check and dealing with them later when B and C try to create further
> charges should be completely fine.
>
> Note that we can also try to split the shared charge across the users;
> however, charging the full amount seems like the better approach to
> me. We don't have any way to tell how the usage is distributed
> anyway. For use cases where this sort of sharing is expected, I think
> it's perfectly reasonable to provision the sharing children to have
> enough to accomodate the possible full size of the shared resource.
>
>> possible tweak to shore up the current system is to move such mlocked
>> pages to the memcg of the surviving locker. When the machine is oom
>> it's often nice to examine memcg state to determine which container is
>> using the memory. Tracking down who's contributing to a shared
>> container is non-trivial.
>>
>> I actually have a set of patches which add a memcg=M mount option to
>> memory backed file systems. I was planning on proposing them,
>> regardless of this RFC, and this discussion makes them even more
>> appealing. If we go in this direction, then we'd need a similar
>> notion for disk based filesystems. As Konstantin suggested, it'd be
>> really nice to specify charge policy on a per file, or directory, or
>> bind mount basis. This allows shared files to be deterministically
>
> I'm not too sure about that. We might add that later if absolutely
> justifiable but designing assuming that level of intervention from
> userland may not be such a good idea.
>
>> When there's large incidental sharing, then things get sticky. A
>> periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in
>> a small container would pull all pages to the root memcg where they
>> are exposed to root pressure which breaks isolation. This is
>> concerning. Perhaps the such accesses could be decorated with
>> (O_NO_MOVEMEM).
>
> If such thing is really necessary, FADV_NOREUSE would be a better
> indicator; however, yes, such incidental sharing is easier to handle
> with per-page scheme as such scanner can be limited in the number of
> pages it can carry throughout its operation regardless of which cgroup
> it's looking at. It still has the nasty corner case where random
> target cgroups can latch onto pages faulted in by the scanner and
> keeping accessing them tho, so, even now, FADV_NOREUSE would be a good
> idea. Note that such scanning, if repeated on cgroups under high
> memory pressure, is *likely* to accumulate residue escaped pages and
> if such a management cgroup is transient, those escaped pages will
> accumulate over time outside any limit in a way which is unpredictable
> and invisible.
>
>> So this RFC change will introduce significant change to user space
>> machine managers and perturb isolation. Is the resulting system
>> better? It's not clear, it's the devil know vs devil unknown. Maybe
>> it'd be easier if the memcg's I'm talking about were not allowed to
>> share page cache (aka copy-on-read) even for files which are jointly
>> visible. That would provide today's interface while avoiding the
>> problematic sharing.
>
> Yeah, compatibility would be the stickiest part.
>
> Thanks.

2015-02-05 13:15:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

Hello, Greg.

On Wed, Feb 04, 2015 at 03:51:01PM -0800, Greg Thelen wrote:
> I think the linux-next low (and the TBD min) limits also have the
> problem for more than just the root memcg. I'm thinking of a 2M file
> shared between C and D below. The file will be charged to common parent
> B.
>
> A
> +-B (usage=2M lim=3M min=2M)
> +-C (usage=0 lim=2M min=1M shared_usage=2M)
> +-D (usage=0 lim=2M min=1M shared_usage=2M)
> \-E (usage=0 lim=2M min=0)
>
> The problem arises if A/B/E allocates more than 1M of private
> reclaimable file data. This pushes A/B into reclaim which will reclaim
> both the shared file from A/B and private file from A/B/E. In contrast,
> the current per-page memcg would've protected the shared file in either
> C or D leaving A/B reclaim to only attack A/B/E.
>
> Pinning the shared file to either C or D, using TBD policy such as mount
> option, would solve this for tightly shared files. But for wide fanout
> file (libc) the admin would need to assign a global bucket and this
> would be a pain to size due to various job requirements.

Shouldn't we be able to handle it the same way as I proposed for
handling sharing? The above would look like

A
+-B (usage=2M lim=3M min=2M hosted_usage=2M)
+-C (usage=0 lim=2M min=1M shared_usage=2M)
+-D (usage=0 lim=2M min=1M shared_usage=2M)
\-E (usage=0 lim=2M min=0)

Now, we don't wanna use B's min verbatim on the hosted inodes shared
by children but we're unconditionally charging the shared amount to
all sharing children, which means that we're eating into the min
settings of all participating children, so, we should be able to use
sum of all sharing children's min-covered amount as the inode's min,
which of course is to be contained inside the min of the parent.

Above, we're charging 2M to C and D, each of which has 1M min which is
being consumed by the shared charge (the shared part won't get
reclaimed from the internal pressure of children, so we're really
taking that part away from it). Summing them up, the shared inode
would have 2M protection which is honored as long as B as a whole is
under its 3M limit. This is similar to creating a dedicated child for
each shared resource for low limits. The downside is that we end up
guarding the shared inodes more than non-shared ones, but, after all,
we're charging it to everybody who's using it.

Would something like this work?

Thanks.

--
tejun

2015-02-05 22:05:28

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma


On Thu, Feb 05 2015, Tejun Heo wrote:

> Hello, Greg.
>
> On Wed, Feb 04, 2015 at 03:51:01PM -0800, Greg Thelen wrote:
>> I think the linux-next low (and the TBD min) limits also have the
>> problem for more than just the root memcg. I'm thinking of a 2M file
>> shared between C and D below. The file will be charged to common parent
>> B.
>>
>> A
>> +-B (usage=2M lim=3M min=2M)
>> +-C (usage=0 lim=2M min=1M shared_usage=2M)
>> +-D (usage=0 lim=2M min=1M shared_usage=2M)
>> \-E (usage=0 lim=2M min=0)
>>
>> The problem arises if A/B/E allocates more than 1M of private
>> reclaimable file data. This pushes A/B into reclaim which will reclaim
>> both the shared file from A/B and private file from A/B/E. In contrast,
>> the current per-page memcg would've protected the shared file in either
>> C or D leaving A/B reclaim to only attack A/B/E.
>>
>> Pinning the shared file to either C or D, using TBD policy such as mount
>> option, would solve this for tightly shared files. But for wide fanout
>> file (libc) the admin would need to assign a global bucket and this
>> would be a pain to size due to various job requirements.
>
> Shouldn't we be able to handle it the same way as I proposed for
> handling sharing? The above would look like
>
> A
> +-B (usage=2M lim=3M min=2M hosted_usage=2M)
> +-C (usage=0 lim=2M min=1M shared_usage=2M)
> +-D (usage=0 lim=2M min=1M shared_usage=2M)
> \-E (usage=0 lim=2M min=0)
>
> Now, we don't wanna use B's min verbatim on the hosted inodes shared
> by children but we're unconditionally charging the shared amount to
> all sharing children, which means that we're eating into the min
> settings of all participating children, so, we should be able to use
> sum of all sharing children's min-covered amount as the inode's min,
> which of course is to be contained inside the min of the parent.
>
> Above, we're charging 2M to C and D, each of which has 1M min which is
> being consumed by the shared charge (the shared part won't get
> reclaimed from the internal pressure of children, so we're really
> taking that part away from it). Summing them up, the shared inode
> would have 2M protection which is honored as long as B as a whole is
> under its 3M limit. This is similar to creating a dedicated child for
> each shared resource for low limits. The downside is that we end up
> guarding the shared inodes more than non-shared ones, but, after all,
> we're charging it to everybody who's using it.
>
> Would something like this work?

Maybe, but I want to understand more about how pressure works in the
child. As C (or D) allocates non shared memory does it perform reclaim
to ensure that its (C.usage + C.shared_usage < C.lim). Given C's
shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
by C. Are you thinking that charge failures on cgroups with non zero
shared_usage would, as needed, induce reclaim of parent's hosted_usage?

2015-02-05 22:25:28

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

Hey,

On Thu, Feb 05, 2015 at 02:05:19PM -0800, Greg Thelen wrote:
> > A
> > +-B (usage=2M lim=3M min=2M hosted_usage=2M)
> > +-C (usage=0 lim=2M min=1M shared_usage=2M)
> > +-D (usage=0 lim=2M min=1M shared_usage=2M)
> > \-E (usage=0 lim=2M min=0)
...
> Maybe, but I want to understand more about how pressure works in the
> child. As C (or D) allocates non shared memory does it perform reclaim
> to ensure that its (C.usage + C.shared_usage < C.lim). Given C's

Yes.

> shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
> by C. Are you thinking that charge failures on cgroups with non zero
> shared_usage would, as needed, induce reclaim of parent's hosted_usage?

Hmmm.... I'm not really sure but why not? If we properly account for
the low protection when pushing inodes to the parent, I don't think
it'd break anything. IOW, allow the amount beyond the sum of low
limits to be reclaimed when one of the sharers is under pressure.

Thanks.

--
tejun

2015-02-06 00:03:41

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma


On Thu, Feb 05 2015, Tejun Heo wrote:

> Hey,
>
> On Thu, Feb 05, 2015 at 02:05:19PM -0800, Greg Thelen wrote:
>> > A
>> > +-B (usage=2M lim=3M min=2M hosted_usage=2M)
>> > +-C (usage=0 lim=2M min=1M shared_usage=2M)
>> > +-D (usage=0 lim=2M min=1M shared_usage=2M)
>> > \-E (usage=0 lim=2M min=0)
> ...
>> Maybe, but I want to understand more about how pressure works in the
>> child. As C (or D) allocates non shared memory does it perform reclaim
>> to ensure that its (C.usage + C.shared_usage < C.lim). Given C's
>
> Yes.
>
>> shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
>> by C. Are you thinking that charge failures on cgroups with non zero
>> shared_usage would, as needed, induce reclaim of parent's hosted_usage?
>
> Hmmm.... I'm not really sure but why not? If we properly account for
> the low protection when pushing inodes to the parent, I don't think
> it'd break anything. IOW, allow the amount beyond the sum of low
> limits to be reclaimed when one of the sharers is under pressure.
>
> Thanks.

I'm not saying that it'd break anything. I think it's required that
children perform reclaim on shared data hosted in the parent. The child
is limited by shared_usage, so it needs ability to reclaim it. So I
think we're in agreement. Child will reclaim parent's hosted_usage when
the child is charged for shared_usage. Ideally the only parental memory
reclaimed in this situation would be shared. But I think (though I
can't claim to have followed the new memcg philosophy discussions) that
internal nodes in the cgroup tree (i.e. parents) do not have any
resources charged directly to them. All resources are charged to leaf
cgroups which linger until resources are uncharged. Thus the LRUs of
parent will only contain hosted (shared) memory. This thankfully focus
parental reclaim easy on shared pages. Child pressure will,
unfortunately, reclaim shared pages used by any container. But if
shared pages were charged all sharing containers, then it will help
relieve pressure in the caller.

So this is a system which charges all cgroups using a shared inode
(recharge on read) for all resident pages of that shared inode. There's
only one copy of the page in memory on just one LRU, but the page may be
charged to multiple container's (shared_)usage.

Perhaps I missed it, but what happens when a child's limit is
insufficient to accept all pages shared by its siblings? Example
starting with 2M cached of a shared file:

A
+-B (usage=2M lim=3M hosted_usage=2M)
+-C (usage=0 lim=2M shared_usage=2M)
+-D (usage=0 lim=2M shared_usage=2M)
\-E (usage=0 lim=1M shared_usage=0)

If E faults in a new 4K page within the shared file, then E is a sharing
participant so it'd be charged the 2M+4K, which pushes E over it's
limit.

2015-02-06 14:17:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

Hello, Greg.

On Thu, Feb 05, 2015 at 04:03:34PM -0800, Greg Thelen wrote:
> So this is a system which charges all cgroups using a shared inode
> (recharge on read) for all resident pages of that shared inode. There's
> only one copy of the page in memory on just one LRU, but the page may be
> charged to multiple container's (shared_)usage.

Yeap.

> Perhaps I missed it, but what happens when a child's limit is
> insufficient to accept all pages shared by its siblings? Example
> starting with 2M cached of a shared file:
>
> A
> +-B (usage=2M lim=3M hosted_usage=2M)
> +-C (usage=0 lim=2M shared_usage=2M)
> +-D (usage=0 lim=2M shared_usage=2M)
> \-E (usage=0 lim=1M shared_usage=0)
>
> If E faults in a new 4K page within the shared file, then E is a sharing
> participant so it'd be charged the 2M+4K, which pushes E over it's
> limit.

OOM? It shouldn't be participating in sharing of an inode if it can't
match others' protection on the inode, I think. What we're doing now
w/ page based charging is kinda unfair because in the situations like
above the one under pressure can end up siphoning off of the larger
cgroups' protection if they actually use overlapping areas; however,
for disjoint areas, per-page charging would behave correctly.

So, this part comes down to the same question - whether multiple
cgroups accessing disjoint areas of a single inode is an important
enough use case. If we say yes to that, we better make writeback
support that too.

Thanks.

--
tejun

2015-02-06 23:43:34

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

On Fri, Feb 6, 2015 at 6:17 AM, Tejun Heo <[email protected]> wrote:
> Hello, Greg.
>
> On Thu, Feb 05, 2015 at 04:03:34PM -0800, Greg Thelen wrote:
>> So this is a system which charges all cgroups using a shared inode
>> (recharge on read) for all resident pages of that shared inode. There's
>> only one copy of the page in memory on just one LRU, but the page may be
>> charged to multiple container's (shared_)usage.
>
> Yeap.
>
>> Perhaps I missed it, but what happens when a child's limit is
>> insufficient to accept all pages shared by its siblings? Example
>> starting with 2M cached of a shared file:
>>
>> A
>> +-B (usage=2M lim=3M hosted_usage=2M)
>> +-C (usage=0 lim=2M shared_usage=2M)
>> +-D (usage=0 lim=2M shared_usage=2M)
>> \-E (usage=0 lim=1M shared_usage=0)
>>
>> If E faults in a new 4K page within the shared file, then E is a sharing
>> participant so it'd be charged the 2M+4K, which pushes E over it's
>> limit.
>
> OOM? It shouldn't be participating in sharing of an inode if it can't
> match others' protection on the inode, I think. What we're doing now
> w/ page based charging is kinda unfair because in the situations like
> above the one under pressure can end up siphoning off of the larger
> cgroups' protection if they actually use overlapping areas; however,
> for disjoint areas, per-page charging would behave correctly.
>
> So, this part comes down to the same question - whether multiple
> cgroups accessing disjoint areas of a single inode is an important
> enough use case. If we say yes to that, we better make writeback
> support that too.

If cgroups are about isolation then writing to shared files should be
rare, so I'm willing to say that we don't need to handle shared
writers well. Shared readers seem like a more valuable use cases
(thin provisioning). I'm getting overwhelmed with the thought
exercise of automatically moving inodes to common ancestors and back
charging the sharers for shared_usage. I haven't wrapped my head
around how these shared data pages will get protected. It seems like
they'd no longer be protected by child min watermarks.

So I know this thread opened with the claim "both memcg and blkcg must
be looking at the same picture. Deviating them is highly likely to
lead to long-term issues forcing us to look at this again anyway, only
with far more baggage." But I'm still wondering if the following is
simpler:
(1) leave memcg as a per page controller.
(2) maintain a per inode i_memcg which is set to the common dirtying
ancestor. If not shared then it'll point to the memcg that the page
was charged to.
(3) when memcg dirtying page pressure is seen, walk up the cgroup tree
writing dirty inodes, this will write shared inodes using blkcg
priority of the respective levels.
(4) background limit wb_check_background_flush() and time based
wb_check_old_data_flush() can feel free to attack shared inodes to
hopefully restore them to non-shared state.
For non-shared inodes, this should behave the same. For shared inodes
it should only affect those in the hierarchy which is sharing.

2015-02-07 14:38:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

Hello, Greg.

On Fri, Feb 06, 2015 at 03:43:11PM -0800, Greg Thelen wrote:
> If cgroups are about isolation then writing to shared files should be
> rare, so I'm willing to say that we don't need to handle shared
> writers well. Shared readers seem like a more valuable use cases
> (thin provisioning). I'm getting overwhelmed with the thought
> exercise of automatically moving inodes to common ancestors and back
> charging the sharers for shared_usage. I haven't wrapped my head
> around how these shared data pages will get protected. It seems like
> they'd no longer be protected by child min watermarks.

Yes, this is challenging and what my current thought is around taking
the maximum of the low settings of the sharing children but I need to
think more about it. One problem is that the shared inodes will
preemptively take away the amount shared from the children's low
protection. They won't compete fairly with other inodes or anons but
they can't really as they don't really belong to any single sharer.

> So I know this thread opened with the claim "both memcg and blkcg must
> be looking at the same picture. Deviating them is highly likely to
> lead to long-term issues forcing us to look at this again anyway, only
> with far more baggage." But I'm still wondering if the following is
> simpler:
> (1) leave memcg as a per page controller.
> (2) maintain a per inode i_memcg which is set to the common dirtying
> ancestor. If not shared then it'll point to the memcg that the page
> was charged to.
> (3) when memcg dirtying page pressure is seen, walk up the cgroup tree
> writing dirty inodes, this will write shared inodes using blkcg
> priority of the respective levels.
> (4) background limit wb_check_background_flush() and time based
> wb_check_old_data_flush() can feel free to attack shared inodes to
> hopefully restore them to non-shared state.
> For non-shared inodes, this should behave the same. For shared inodes
> it should only affect those in the hierarchy which is sharing.

The thing which breaks when you de-couple what memcg sees from the
rest of the stack is that the amount of memory which may be available
to a given cgroup and how much of that is dirty is the main linkage
propagating IO pressure to actual dirtying tasks. If you decouple the
two worldviews, you lose the ability to propagate IO pressure to
dirtiers in a controlled manner and that's why anything inside a memcg
currently is always triggering direct reclaim path instead of being
properly dirty throttled.

You can argue that an inode being actively dirtied from multiple
cgroups is a rare case which we can sweep under the rug and that
*might* be the case but I have a nagging feeling that that would be a
decision which is made merely out of immediate convenience and would
much prefer having a well defined model of sharing inodes and anons
across cgroups so that the behaviors shown in thoses cases aren't mere
accidental consequences without any innate meaning.

If we can argue that memcg and blkcg having different views is
meaningful and characterize and justify the behaviors stemming from
the deviation, sure, that'd be fine, but I don't think we have that as
of now.

Thanks.

--
tejun

2015-02-11 02:19:12

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

Hello, again.

On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
> If we can argue that memcg and blkcg having different views is
> meaningful and characterize and justify the behaviors stemming from
> the deviation, sure, that'd be fine, but I don't think we have that as
> of now.

If we assume that memcg and blkcg having different views is something
which represents an acceptable compromise considering the use cases
and implementation convenience - IOW, if we assume that read-sharing
is something which can happen regularly while write sharing is a
corner case and that while not completely correct the existing
self-corrective behavior from tracking ownership per-page at the point
of instantiation is good enough (as a memcg under pressure is likely
to give up shared pages to be re-instantiated by another sharer w/
more budget), we need to do the impedance matching between memcg and
blkcg at the writeback layer.

The main issue there is that the last chain of IO pressure propagation
is realized by making individual dirtying tasks to converge on a
common target dirty ratio point which naturally depending on those
tasks seeing the same picture in terms of the current write bandwidth
and available memory and how much of it is dirty. Tasks dirtying
pages belonging to the same memcg while some of them are mostly being
written out by a different blkcg would wreck the mechanism. It won't
be difficult for one subset to make the other to consider themselves
under severe IO pressure when there actually isn't one in that group
possibly stalling and starving those tasks unduly. At more basic
level, it's just wrong for one group to be writing out significant
amount for another.

These issues can persist indefinitely if we follow the same
instantiator-owns rule for inode writebacks. Even if we reset the
ownership when an inode becomes clea, it wouldn't work as it can be
dirtied over and over again while under writeback, and when things
like this happen, the behavior may become extremely difficult to
understand or characterize. We don't have visibility into how
individual pages of an inode get distributed across multiple cgroups,
who's currently responsible for writing back a specific inode or how
dirty ratio mechanism is behaving in the face of the unexpected
combination of parameters.

Even if we assume that write sharing is a fringe case, we need
something better than first-whatever rule when choosing which blkcg is
responsible for writing a shared inode out. There needs to be a
constant corrective pressure so that incidental and temporary sharings
don't end up screwing up the mechanism for an extended period of time.

Greg mentioned chossing the closest ancestor of the sharers, which
basically pushes inode sharing policy implmentation down to writeback
from memcg. This could work but we end up with the same collusion
problem as when this is used for memcg and it's even more difficult to
solve this at writeback layer - we'd have to communicate the shared
state all the way down to block layer and then implement a mechanism
there to take corrective measures and even after that we're likely to
end up with prolonged state where dirty ratio propagation is
essentially broken as the dirtier and writer would be seeing different
pictures.

So, based on the assumption that write sharings are mostly incidental
and temporary (ie. we're basically declaring that we don't support
persistent write sharing), how about something like the following?

1. memcg contiues per-page tracking.

2. Each inode is associated with a single blkcg at a given time and
written out by that blkcg.

3. While writing back, if the number of pages from foreign memcg's is
higher than certain ratio of total written pages, the inode is
marked as disowned and the writeback instance is optionally
terminated early. e.g. if the ratio of foreign pages is over 50%
after writing out the number of pages matching 5s worth of write
bandwidth for the bdi, mark the inode as disowned.

4. On the following dirtying of the inode, the inode is associated
with the matching blkcg of the dirtied page. Note that this could
be the next cycle as the inode could already have been marked dirty
by the time the above condition triggered. In that case, the
following writeback would be terminated early too.

This should provide sufficient corrective pressure so that incidental
and temporary sharing of an inode doesn't become a persistent issue
while keeping the complexity necessary for implementing such pressure
fairly minimal and self-contained. Also, the changes necessary for
individual filesystems would be minimal.

I think this should work well enough as long as the forementioned
assumptions are true - IOW, if we maintain that write sharing is
unsupported.

What do you think?

Thanks.

--
tejun

2015-02-11 07:32:35

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

Hello Tejun,

On Tue 10-02-15 21:19:06, Tejun Heo wrote:
> On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
> > If we can argue that memcg and blkcg having different views is
> > meaningful and characterize and justify the behaviors stemming from
> > the deviation, sure, that'd be fine, but I don't think we have that as
> > of now.
...
> So, based on the assumption that write sharings are mostly incidental
> and temporary (ie. we're basically declaring that we don't support
> persistent write sharing), how about something like the following?
>
> 1. memcg contiues per-page tracking.
>
> 2. Each inode is associated with a single blkcg at a given time and
> written out by that blkcg.
>
> 3. While writing back, if the number of pages from foreign memcg's is
> higher than certain ratio of total written pages, the inode is
> marked as disowned and the writeback instance is optionally
> terminated early. e.g. if the ratio of foreign pages is over 50%
> after writing out the number of pages matching 5s worth of write
> bandwidth for the bdi, mark the inode as disowned.
>
> 4. On the following dirtying of the inode, the inode is associated
> with the matching blkcg of the dirtied page. Note that this could
> be the next cycle as the inode could already have been marked dirty
> by the time the above condition triggered. In that case, the
> following writeback would be terminated early too.
>
> This should provide sufficient corrective pressure so that incidental
> and temporary sharing of an inode doesn't become a persistent issue
> while keeping the complexity necessary for implementing such pressure
> fairly minimal and self-contained. Also, the changes necessary for
> individual filesystems would be minimal.
I like this proposal. It looks simple enough and when inodes aren't
pernamently write-shared it converges to the blkcg that is currently
writing to the inode. So ack from me.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-02-11 18:29:09

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

On Tue, Feb 10, 2015 at 6:19 PM, Tejun Heo <[email protected]> wrote:
> Hello, again.
>
> On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
>> If we can argue that memcg and blkcg having different views is
>> meaningful and characterize and justify the behaviors stemming from
>> the deviation, sure, that'd be fine, but I don't think we have that as
>> of now.
>
> If we assume that memcg and blkcg having different views is something
> which represents an acceptable compromise considering the use cases
> and implementation convenience - IOW, if we assume that read-sharing
> is something which can happen regularly while write sharing is a
> corner case and that while not completely correct the existing
> self-corrective behavior from tracking ownership per-page at the point
> of instantiation is good enough (as a memcg under pressure is likely
> to give up shared pages to be re-instantiated by another sharer w/
> more budget), we need to do the impedance matching between memcg and
> blkcg at the writeback layer.
>
> The main issue there is that the last chain of IO pressure propagation
> is realized by making individual dirtying tasks to converge on a
> common target dirty ratio point which naturally depending on those
> tasks seeing the same picture in terms of the current write bandwidth
> and available memory and how much of it is dirty. Tasks dirtying
> pages belonging to the same memcg while some of them are mostly being
> written out by a different blkcg would wreck the mechanism. It won't
> be difficult for one subset to make the other to consider themselves
> under severe IO pressure when there actually isn't one in that group
> possibly stalling and starving those tasks unduly. At more basic
> level, it's just wrong for one group to be writing out significant
> amount for another.
>
> These issues can persist indefinitely if we follow the same
> instantiator-owns rule for inode writebacks. Even if we reset the
> ownership when an inode becomes clea, it wouldn't work as it can be
> dirtied over and over again while under writeback, and when things
> like this happen, the behavior may become extremely difficult to
> understand or characterize. We don't have visibility into how
> individual pages of an inode get distributed across multiple cgroups,
> who's currently responsible for writing back a specific inode or how
> dirty ratio mechanism is behaving in the face of the unexpected
> combination of parameters.
>
> Even if we assume that write sharing is a fringe case, we need
> something better than first-whatever rule when choosing which blkcg is
> responsible for writing a shared inode out. There needs to be a
> constant corrective pressure so that incidental and temporary sharings
> don't end up screwing up the mechanism for an extended period of time.
>
> Greg mentioned chossing the closest ancestor of the sharers, which
> basically pushes inode sharing policy implmentation down to writeback
> from memcg. This could work but we end up with the same collusion
> problem as when this is used for memcg and it's even more difficult to
> solve this at writeback layer - we'd have to communicate the shared
> state all the way down to block layer and then implement a mechanism
> there to take corrective measures and even after that we're likely to
> end up with prolonged state where dirty ratio propagation is
> essentially broken as the dirtier and writer would be seeing different
> pictures.
>
> So, based on the assumption that write sharings are mostly incidental
> and temporary (ie. we're basically declaring that we don't support
> persistent write sharing), how about something like the following?
>
> 1. memcg contiues per-page tracking.
>
> 2. Each inode is associated with a single blkcg at a given time and
> written out by that blkcg.
>
> 3. While writing back, if the number of pages from foreign memcg's is
> higher than certain ratio of total written pages, the inode is
> marked as disowned and the writeback instance is optionally
> terminated early. e.g. if the ratio of foreign pages is over 50%
> after writing out the number of pages matching 5s worth of write
> bandwidth for the bdi, mark the inode as disowned.
>
> 4. On the following dirtying of the inode, the inode is associated
> with the matching blkcg of the dirtied page. Note that this could
> be the next cycle as the inode could already have been marked dirty
> by the time the above condition triggered. In that case, the
> following writeback would be terminated early too.
>
> This should provide sufficient corrective pressure so that incidental
> and temporary sharing of an inode doesn't become a persistent issue
> while keeping the complexity necessary for implementing such pressure
> fairly minimal and self-contained. Also, the changes necessary for
> individual filesystems would be minimal.
>
> I think this should work well enough as long as the forementioned
> assumptions are true - IOW, if we maintain that write sharing is
> unsupported.
>
> What do you think?
>
> Thanks.
>
> --
> tejun

This seems good. I assume that blkcg writeback would query
corresponding memcg for dirty page count to determine if over
background limit. And balance_dirty_pages() would query memcg's dirty
page count to throttle based on blkcg's bandwidth. Note: memcg
doesn't yet have dirty page counts, but several of us have made
attempts at adding the counters. And it shouldn't be hard to get them
merged.

2015-02-11 20:34:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

Hello, Greg.

On Wed, Feb 11, 2015 at 10:28:44AM -0800, Greg Thelen wrote:
> This seems good. I assume that blkcg writeback would query
> corresponding memcg for dirty page count to determine if over
> background limit. And balance_dirty_pages() would query memcg's dirty

Yeah, available memory to the matching memcg and the number of dirty
pages in it. It's gonna work the same way as the global case just
scoped to the cgroup.

> page count to throttle based on blkcg's bandwidth. Note: memcg
> doesn't yet have dirty page counts, but several of us have made
> attempts at adding the counters. And it shouldn't be hard to get them
> merged.

Can you please post those?

So, cool, we're in agreement. Working on it. It shouldn't take too
long, hopefully.

Thanks.

--
tejun

2015-02-11 21:22:38

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

On Wed, Feb 11, 2015 at 11:33 PM, Tejun Heo <[email protected]> wrote:
> Hello, Greg.
>
> On Wed, Feb 11, 2015 at 10:28:44AM -0800, Greg Thelen wrote:
>> This seems good. I assume that blkcg writeback would query
>> corresponding memcg for dirty page count to determine if over
>> background limit. And balance_dirty_pages() would query memcg's dirty
>
> Yeah, available memory to the matching memcg and the number of dirty
> pages in it. It's gonna work the same way as the global case just
> scoped to the cgroup.

That might be a problem: all dirty pages accounted to cgroup must be
reachable for its own personal writeback or balanace-drity-pages will be
unable to satisfy memcg dirty memory thresholds. I've done accounting
for per-inode owner, but there is another option: shared inodes might be
handled differently and will be available for all (or related) cgroup
writebacks.

Another side is that reclaimer now (mosly?) never trigger pageout.
Memcg reclaimer should do something if it finds shared dirty page:
either move it into right cgroup or make that inode reachable for
memcg writeback. I've send patch which marks shared dirty inodes
with flag I_DIRTY_SHARED or so.

>
>> page count to throttle based on blkcg's bandwidth. Note: memcg
>> doesn't yet have dirty page counts, but several of us have made
>> attempts at adding the counters. And it shouldn't be hard to get them
>> merged.
>
> Can you please post those?
>
> So, cool, we're in agreement. Working on it. It shouldn't take too
> long, hopefully.

Good. As I see this design is almost equal to my proposal,
maybe except that dumb first-owns-all-until-the-end rule.

>
> Thanks.
>
> --
> tejun
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-02-11 21:46:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

Hello,

On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
> > Yeah, available memory to the matching memcg and the number of dirty
> > pages in it. It's gonna work the same way as the global case just
> > scoped to the cgroup.
>
> That might be a problem: all dirty pages accounted to cgroup must be
> reachable for its own personal writeback or balanace-drity-pages will be
> unable to satisfy memcg dirty memory thresholds. I've done accounting

Yeah, it would. Why wouldn't it?

> for per-inode owner, but there is another option: shared inodes might be
> handled differently and will be available for all (or related) cgroup
> writebacks.

I'm not following you at all. The only reason this scheme can work is
because we exclude persistent shared write cases. As the whole thing
is based on that assumption, special casing shared inodes doesn't make
any sense. Doing things like allowing all cgroups to write shared
inodes without getting memcg on-board almost immediately breaks
pressure propagation while making shared writes a lot more attractive
and increasing implementation complexity substantially. Am I missing
something?

> Another side is that reclaimer now (mosly?) never trigger pageout.
> Memcg reclaimer should do something if it finds shared dirty page:
> either move it into right cgroup or make that inode reachable for
> memcg writeback. I've send patch which marks shared dirty inodes
> with flag I_DIRTY_SHARED or so.

It *might* make sense for memcg to drop pages being dirtied which
don't match the currently associated blkcg of the inode; however,
again, as we're basically declaring that shared writes aren't
supported, I'm skeptical about the usefulness.

Thanks.

--
tejun

2015-02-11 21:57:08

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <[email protected]> wrote:
> Hello,
>
> On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
>> > Yeah, available memory to the matching memcg and the number of dirty
>> > pages in it. It's gonna work the same way as the global case just
>> > scoped to the cgroup.
>>
>> That might be a problem: all dirty pages accounted to cgroup must be
>> reachable for its own personal writeback or balanace-drity-pages will be
>> unable to satisfy memcg dirty memory thresholds. I've done accounting
>
> Yeah, it would. Why wouldn't it?

How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
Or you're thinking only about separating writeback flow into blkio cgroups
without actual inode filtering? I mean delaying inode writeback and keeping
dirty pages as long as possible if their cgroups are far from threshold.

>
>> for per-inode owner, but there is another option: shared inodes might be
>> handled differently and will be available for all (or related) cgroup
>> writebacks.
>
> I'm not following you at all. The only reason this scheme can work is
> because we exclude persistent shared write cases. As the whole thing
> is based on that assumption, special casing shared inodes doesn't make
> any sense. Doing things like allowing all cgroups to write shared
> inodes without getting memcg on-board almost immediately breaks
> pressure propagation while making shared writes a lot more attractive
> and increasing implementation complexity substantially. Am I missing
> something?
>
>> Another side is that reclaimer now (mosly?) never trigger pageout.
>> Memcg reclaimer should do something if it finds shared dirty page:
>> either move it into right cgroup or make that inode reachable for
>> memcg writeback. I've send patch which marks shared dirty inodes
>> with flag I_DIRTY_SHARED or so.
>
> It *might* make sense for memcg to drop pages being dirtied which
> don't match the currently associated blkcg of the inode; however,
> again, as we're basically declaring that shared writes aren't
> supported, I'm skeptical about the usefulness.
>
> Thanks.
>
> --
> tejun

2015-02-11 22:05:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

On Thu, Feb 12, 2015 at 01:57:04AM +0400, Konstantin Khlebnikov wrote:
> On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <[email protected]> wrote:
> > Hello,
> >
> > On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
> >> > Yeah, available memory to the matching memcg and the number of dirty
> >> > pages in it. It's gonna work the same way as the global case just
> >> > scoped to the cgroup.
> >>
> >> That might be a problem: all dirty pages accounted to cgroup must be
> >> reachable for its own personal writeback or balanace-drity-pages will be
> >> unable to satisfy memcg dirty memory thresholds. I've done accounting
> >
> > Yeah, it would. Why wouldn't it?
>
> How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
> Or you're thinking only about separating writeback flow into blkio cgroups
> without actual inode filtering? I mean delaying inode writeback and keeping
> dirty pages as long as possible if their cgroups are far from threshold.

What? The code was already in the previous patchset. I'm just gonna
rip out the code to handle inode being dirtied on multiple wb's.

--
tejun

2015-02-11 22:15:33

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

On Thu, Feb 12, 2015 at 1:05 AM, Tejun Heo <[email protected]> wrote:
> On Thu, Feb 12, 2015 at 01:57:04AM +0400, Konstantin Khlebnikov wrote:
>> On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <[email protected]> wrote:
>> > Hello,
>> >
>> > On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
>> >> > Yeah, available memory to the matching memcg and the number of dirty
>> >> > pages in it. It's gonna work the same way as the global case just
>> >> > scoped to the cgroup.
>> >>
>> >> That might be a problem: all dirty pages accounted to cgroup must be
>> >> reachable for its own personal writeback or balanace-drity-pages will be
>> >> unable to satisfy memcg dirty memory thresholds. I've done accounting
>> >
>> > Yeah, it would. Why wouldn't it?
>>
>> How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
>> Or you're thinking only about separating writeback flow into blkio cgroups
>> without actual inode filtering? I mean delaying inode writeback and keeping
>> dirty pages as long as possible if their cgroups are far from threshold.
>
> What? The code was already in the previous patchset. I'm just gonna
> rip out the code to handle inode being dirtied on multiple wb's.

Well, ok. Even if shared writes are rare whey should be handled somehow
without relying on kupdate-like writeback. If memcg has a lot of dirty pages
but their inodes are accidentially belong to wrong wb queues when tasks in
that memcg shouldn't stuck in balance-dirty-pages until somebody outside
acidentially writes this data. That's all what I wanted to say.

>
> --
> tejun

2015-02-11 22:30:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

Hello,

On Thu, Feb 12, 2015 at 02:15:29AM +0400, Konstantin Khlebnikov wrote:
> Well, ok. Even if shared writes are rare whey should be handled somehow
> without relying on kupdate-like writeback. If memcg has a lot of dirty pages

This only works iff we consider those cases to be marginal enough to
be handle them in a pretty ghetto way.

> but their inodes are accidentially belong to wrong wb queues when tasks in
> that memcg shouldn't stuck in balance-dirty-pages until somebody outside
> acidentially writes this data. That's all what I wanted to say.

But, right, yeah, corner cases around this could be nasty if writeout
interval is set really high. I don't think it matters for the default
5s interval at all. Maybe what we need is queueing a delayed per-wb
work w/ the default writeout interval when dirtying a foreign inode.
I'll think more about it.

Thanks.

--
tejun

2015-02-12 02:10:44

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma

On Wed, Feb 11, 2015 at 12:33 PM, Tejun Heo <[email protected]> wrote:
[...]
>> page count to throttle based on blkcg's bandwidth. Note: memcg
>> doesn't yet have dirty page counts, but several of us have made
>> attempts at adding the counters. And it shouldn't be hard to get them
>> merged.
>
> Can you please post those?

Will do. Rebasing and testing needed, so it won't be today.