Hello
This patch series is creating a new misc cgroup controller for limiting
and tracking of resources which are not abstract like other cgroup
controllers.
This controller was initially proposed as encryption_id but after
the feedbacks, it is now changed to misc cgroup.
https://lore.kernel.org/lkml/[email protected]/
Changes in RFC v3:
1. Changed implementation to support 64 bit counters.
2. Print kernel logs only once per resource per cgroup.
3. Capacity can be set less than the current usage.
Changes in RFC v2:
1. Documentation fixes.
2. Added kernel log messages.
3. Changed charge API to treat misc_cg as input parameter.
4. Added helper APIs to get and release references on the cgroup.
[1] https://lore.kernel.org/lkml/[email protected]
[2] https://lore.kernel.org/lkml/[email protected]/
Vipin Sharma (2):
cgroup: sev: Add misc cgroup controller
cgroup: sev: Miscellaneous cgroup documentation.
Documentation/admin-guide/cgroup-v1/index.rst | 1 +
Documentation/admin-guide/cgroup-v1/misc.rst | 4 +
Documentation/admin-guide/cgroup-v2.rst | 69 ++-
arch/x86/kvm/svm/sev.c | 65 ++-
arch/x86/kvm/svm/svm.h | 1 +
include/linux/cgroup_subsys.h | 4 +
include/linux/misc_cgroup.h | 130 ++++++
init/Kconfig | 14 +
kernel/cgroup/Makefile | 1 +
kernel/cgroup/misc.c | 402 ++++++++++++++++++
10 files changed, 679 insertions(+), 12 deletions(-)
create mode 100644 Documentation/admin-guide/cgroup-v1/misc.rst
create mode 100644 include/linux/misc_cgroup.h
create mode 100644 kernel/cgroup/misc.c
--
2.30.1.766.gb4fecdf3b7-goog
Hello,
On Thu, Mar 04, 2021 at 03:19:44PM -0800, Vipin Sharma wrote:
> This patch series is creating a new misc cgroup controller for limiting
> and tracking of resources which are not abstract like other cgroup
> controllers.
>
> This controller was initially proposed as encryption_id but after
> the feedbacks, it is now changed to misc cgroup.
> https://lore.kernel.org/lkml/[email protected]/
Vipin, thank you very much for your persistence and patience. The patchset
looks good to me. Michal, as you've been reviewing the series, can you
please take another look and ack them if you don't find anything
objectionable?
--
tejun
Hello.
On Sun, Mar 07, 2021 at 07:48:40AM -0500, Tejun Heo <[email protected]> wrote:
> Vipin, thank you very much for your persistence and patience.
Yes, and thanks for taking my remarks into account.
> Michal, as you've been reviewing the series, can you please take
> another look and ack them if you don't find anything objectionable?
Honestly, I'm still sitting on the fence whether this needs a new
controller and whether the miscontroller (:-p) is a good approach in the
long term [1].
I admit, I didn't follow the past dicussions completely, however,
(Vipin) could it be in the cover letter/commit messages shortly
summarized why cgroups and a controller were chosen to implement
restrictions of these resources, what were the alternatives any why were
they rejected?
In the previous discussion, I saw the reasoning for the list of the
resources to be hardwired in the controller itself in order to get some
scrutiny of possible changes. That makes sense to me. But with that, is
it necessary to commit to the new controller API via EXPORT_SYMBOL? (I
don't mean this as a licensing question but what the external API should
be (if any).)
Besides the generic remarks above, I'd still suggest some slight
implementation changes, posted inline to the patch.
Thanks,
Michal
[1] Currently, only one thing comes to my mind -- the delegation via
cgroup.subtree_control. The miscontroller may add possibly further
resources whose delegation granularity is bunched up under one entry.
Hello,
On Thu, Mar 11, 2021 at 07:58:19PM +0100, Michal Koutn? wrote:
> > Michal, as you've been reviewing the series, can you please take
> > another look and ack them if you don't find anything objectionable?
> Honestly, I'm still sitting on the fence whether this needs a new
> controller and whether the miscontroller (:-p) is a good approach in the
> long term [1].
Yeah, it's a bit of cop-out. My take is that the underlying hardware feature
isn't mature enough to have reasonable abstraction built on top of them.
Given time, maybe future iterations will get there or maybe it's a passing
fad and people will mostly forget about these.
In the meantime, keeping them out of cgroup is one direction, a relatively
high friction one but still viable. Or we can provide something of a halfway
house so that people who have immediate needs can still leverage the
existing infrastructure while controlling the amount of time, energy and
future lock-ins they take. So, that's misc controller.
I'm somewhat ambivalent but we've had multiple of these things popping up in
the past several years and containment seems to be a reasonable approach at
this point.
> [1] Currently, only one thing comes to my mind -- the delegation via
> cgroup.subtree_control. The miscontroller may add possibly further
> resources whose delegation granularity is bunched up under one entry.
Controller enabling and delegation in themselves aren't supposed to have
resource or security implications, so I don't think it's a practical
problem.
Thanks.
--
tejun
On Thu, Mar 11, 2021 at 07:58:19PM +0100, Michal Koutn? wrote:
> I admit, I didn't follow the past dicussions completely, however,
> (Vipin) could it be in the cover letter/commit messages shortly
> summarized why cgroups and a controller were chosen to implement
> restrictions of these resources, what were the alternatives any why were
> they rejected?
I will add some more information in the cover letter of the next version.
Basically, SEV will mostly be used by cloud providers for providing
confidential VMs. Since they are limited we need a good way to schedule
these jobs in cloud infrastructure. To achieve this we either come up
with some ioctl for "/dev/sev" to know about its usage, availability,
etc. This requires existing scheduling mechanism in the cloud to have an
extension for this interaction. Now same thing needs to be done for TDX.
IBM SEID doesn't have scarcity of this resource but they are also
interested in tracking and limiting the usage. Each one coming up with
their own interaction is a duplicate effort when they all need similar
thing. One can say that abstraction should be at KVM level but these
resources can be used outside VM as well.
Most of the cloud infrastructure use cgroups for knowing the host state,
track the resources usage, enforce limits on them, etc. They use this
info to optimize work allocation in the fleet and make sure no rogue job
consumes more than it needs and starves other. Adding these resources
to cgroup is a natural choice with least friction. Cgroup itself says it
is a mechanism to distribute system resources along the hierarchy in a
controlled mechanism and configurable manner. Most of the resources in
cgroups are abstracted enough but their are still resources which are
not abstract but have limited availability or have specific use cases.
>
> In the previous discussion, I saw the reasoning for the list of the
> resources to be hardwired in the controller itself in order to get some
> scrutiny of possible changes. That makes sense to me. But with that, is
> it necessary to commit to the new controller API via EXPORT_SYMBOL? (I
> don't mean this as a licensing question but what the external API should
> be (if any).)
As per my understanding this is the only for way for loadable modules
(kvm-amd in this case) to access Kernel APIs. Let me know if there is a
better way to do it.
>
> Besides the generic remarks above, I'd still suggest some slight
> implementation changes, posted inline to the patch.
I will work on them.
I appreciate you guys taking out time and helping me out with this patch
series.
Thanks
Vipin
On Fri, Mar 12, 2021 at 09:49:26AM -0800, Vipin Sharma <[email protected]> wrote:
> I will add some more information in the cover letter of the next version.
Thanks.
> Each one coming up with their own interaction is a duplicate effort
> when they all need similar thing.
Could this be expressed as a new BPF hook (when allocating/freeing such
a resource unit)?
The decision could be made based on the configured limit or even some
other predicate.
(I saw this proposed already but I haven't seen some more reasoning
whether it's worse/better. IMO, BPF hooks are "cheaper" than full-blown
controllers, though it's still new user API.)
> As per my understanding this is the only for way for loadable modules
> (kvm-amd in this case) to access Kernel APIs. Let me know if there is a
> better way to do it.
I understood the symbols are exported for such modularized builds.
However, making them non-GPL exposes them to any out-of-tree modules,
although, the resource types are supposed to stay hardcoded in the misc
controller. So my point was to make them EXPORT_SYMBOL_GPL to mark
they're just a means of implementing the modularized builds and not an
API. (But they'd remain API for out-of-tree GPL modules anyway, so take
this reasoning of mine with a grain of salt.)
Michal
On Mon, Mar 15, 2021 at 08:10:09PM +0100, Michal Koutn? wrote:
> On Fri, Mar 12, 2021 at 09:49:26AM -0800, Vipin Sharma <[email protected]> wrote:
> > I will add some more information in the cover letter of the next version.
> Thanks.
>
> > Each one coming up with their own interaction is a duplicate effort
> > when they all need similar thing.
> Could this be expressed as a new BPF hook (when allocating/freeing such
> a resource unit)?
>
> The decision could be made based on the configured limit or even some
> other predicate.
>
> (I saw this proposed already but I haven't seen some more reasoning
> whether it's worse/better. IMO, BPF hooks are "cheaper" than full-blown
> controllers, though it's still new user API.)
I am not much knowledgeable with BPF, so, I might be wrong here.
There are couple of things which might not be addressed with BPF:
1. Which controller to use in v1 case? These are not abstract resources
so in v1 where each controller have their own hierarchy it might not
be easy to identify the best controller.
2. It seems to me that we won't be able to abstract out a single BPF
program which can help with all of the resources types we are
planning to use, again, because it is not an abstract type like
network packets, and there will be different places in the source
code to use these resources.
To me a cgroup tends to give much easier and well integrated solution when it
comes to scheduling and limiting a resource with existing tools in a
cloud infrastructure.
>
>
> > As per my understanding this is the only for way for loadable modules
> > (kvm-amd in this case) to access Kernel APIs. Let me know if there is a
> > better way to do it.
> I understood the symbols are exported for such modularized builds.
> However, making them non-GPL exposes them to any out-of-tree modules,
> although, the resource types are supposed to stay hardcoded in the misc
> controller. So my point was to make them EXPORT_SYMBOL_GPL to mark
> they're just a means of implementing the modularized builds and not an
> API. (But they'd remain API for out-of-tree GPL modules anyway, so take
> this reasoning of mine with a grain of salt.)
>
I see, I will change it to GPL.
Thanks
Vipin