MIME-Version: 1.0
References: <20211120045011.3074840-1-almasrymina@google.com> <YZvppKvUPTIytM/c@cmpxchg.org>
In-Reply-To: <YZvppKvUPTIytM/c@cmpxchg.org>
From:   Shakeel Butt <shakeelb@google.com>
Date:   Sun, 28 Nov 2021 22:00:13 -0800
Message-ID: <CALvZod6ZEU8upWWdYwZ3KVbEK0eM7o0CM2GXy8Sn4SYcGo8jHQ@mail.gmail.com>
Subject: Re: [PATCH v4 0/4] Deterministic charging of shared memory
To:     Johannes Weiner <hannes@cmpxchg.org>
Cc:     Mina Almasry <almasrymina@google.com>,
        Jonathan Corbet <corbet@lwn.net>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Andrew Morton <akpm@linux-foundation.org>,
        Michal Hocko <mhocko@kernel.org>,
        Vladimir Davydov <vdavydov.dev@gmail.com>,
        Hugh Dickins <hughd@google.com>, Shuah Khan <shuah@kernel.org>,
        Greg Thelen <gthelen@google.com>,
        Dave Chinner <david@fromorbit.com>,
        Matthew Wilcox <willy@infradead.org>,
        Roman Gushchin <guro@fb.com>, "Theodore Ts'o" <tytso@mit.edu>,
        linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
        linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

Hi Johannes,

On Mon, Nov 22, 2021 at 11:04 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
[...]
> Here is one idea:
>
> Have you considered reparenting pages that are accessed by multiple
> cgroups to the first common ancestor of those groups?
>
> Essentially, whenever there is a memory access (minor fault, buffered
> IO) to a page that doesn't belong to the accessing task's cgroup, you
> find the common ancestor between that task and the owning cgroup, and
> move the page there.
>
> With a tree like this:
>
>         root - job group - job
>                         `- job
>             `- job group - job
>                         `- job
>
> all pages accessed inside that tree will propagate to the highest
> level at which they are shared - which is the same level where you'd
> also set shared policies, like a job group memory limit or io weight.
>
> E.g. libc pages would (likely) bubble to the root, persistent tmpfs
> pages would bubble to the respective job group, private data would
> stay within each job.
>
> No further user configuration necessary. Although you still *can* use
> mount namespacing etc. to prohibit undesired sharing between cgroups.
>
> The actual user-visible accounting change would be quite small, and
> arguably much more intuitive. Remember that accounting is recursive,
> meaning that a job page today also shows up in the counters of job
> group and root. This would not change. The only thing that IS weird
> today is that when two jobs share a page, it will arbitrarily show up
> in one job's counter but not in the other's. That would change: it
> would no longer show up as either, since it's not private to either;
> it would just be a job group (and up) page.
>
> This would be a generic implementation of resource sharing semantics:
> independent of data source and filesystems, contained inside the
> cgroup interface, and reusing the existing hierarchies of accounting
> and control domains to also represent levels of common property.
>
> Thoughts?

Before commenting on your proposal, I would like to clarify that the
use-cases given are not specific to us but are more general. Though I
think you are arguing that the implementation is not general purpose
which I kind of agree with.

Let me take a stab again at describing these use-cases which I think
can be partitioned based on the relationship of the entities
sharing/accessing the memory among them. (Sorry for repeating these
because I think we should keep these in mind while discussing the
possible solutions).

1) Mutually trusted entities sharing memory for collaborative work.
One example is a file-system shared between sub-tasks of a meta-job.
(Mina's second use-case).

2) Independent entities sharing memory to reduce cost. Examples
include shared libraries, packages or tool chains. (Mina's third
use-case).

3) One entity observing or monitoring another entity. Examples include
gdb, ptrace, uprobes, VM or process migration and checkpointing.

4) Server-Client relationship. (Mina's first use-case.

Let me put (3) out of the way first as these operations have special
interfaces and the target entity is a process (not a cgroup). Remote
charging works for these and no new oom corner cases are introduced.

For (1) and (2), I think your proposal aligns pretty well with them
but one important property is still missing which we are very adamant
about i.e. 'deterministic charge'. To explain with an example, suppose
two instances of the same job are running on two different systems. On
one system, it is sharing a shared library with an unrelated job and
the second instance is using that library alone. The owner will see
different memory usage for both instances which can mess with their
resource planning.

However I think this can be solved very easily with an opt-in add-on.
The node controller knows upfront the libraries/packages which can be
shared between the jobs and is responsible for creating the cgroup
hierarchy (at least the top level) for the jobs. It can create a
common ancestor for all such jobs and let the kernel know that if any
descendant accesses these libraries, charge to this specific ancestor.
If someone out of this sub-hierarchy accesses the memory, follow the
proposal i.e. common ancestor. With this specific opt-in add-on, all
job owners will see their job usage more consistent.

[I am putting this as a brainstorming discussion] Regarding (4), for
our use-case, the server wants the cost of the memory needed to serve
a client to be paid by the corresponding client. Please note that the
memory is not necessarily accessed by the client.

Now we can argue that this use-case can be served similar to (3) i.e.
through a special interface/syscall. I think that would be challenging
particularly when the lifetime of a client 'process' is independent of
the memory needed to serve that client.

Another way is to disable the accounting of that specific memory
needed to serve the clients (I think Roman suggested a similar notion
as disabling accounting of a tmpfs). Any other ideas?

thanks,
Shakeel