2021-07-15 17:29:29

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] memcg accounting from OpenVZ

On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <[email protected]> wrote:
>
> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
> Initially we used our own accounting subsystem, then partially committed
> it to upstream, and a few years ago switched to cgroups v1.
> Now we're rebasing again, revising our old patches and trying to push
> them upstream.
>
> We try to protect the host system from any misuse of kernel memory
> allocation triggered by untrusted users inside the containers.
>
> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
> list, though I would be very grateful for any comments from maintainersi
> of affected subsystems or other people added in cc:
>
> Compared to the upstream, we additionally account the following kernel objects:
> - network devices and its Tx/Rx queues
> - ipv4/v6 addresses and routing-related objects
> - inet_bind_bucket cache objects
> - VLAN group arrays
> - ipv6/sit: ip_tunnel_prl
> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
> - nsproxy and namespace objects itself
> - IPC objects: semaphores, message queues and share memory segments
> - mounts
> - pollfd and select bits arrays
> - signals and posix timers
> - file lock
> - fasync_struct used by the file lease code and driver's fasync queues
> - tty objects
> - per-mm LDT
>
> We have an incorrect/incomplete/obsoleted accounting for few other kernel
> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
> They require rework and probably will be dropped at all.
>
> Also we're going to add an accounting for nft, however it is not ready yet.
>
> We have not tested performance on upstream, however, our performance team
> compares our current RHEL7-based production kernel and reports that
> they are at least not worse as the according original RHEL7 kernel.
>

Hi Vasily,

What's the status of this series? I see a couple patches did get
acked/reviewed. Can you please re-send the series with updated ack
tags?

thanks,
Shakeel


2021-07-16 04:12:54

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] memcg accounting from OpenVZ

On 7/15/21 8:11 PM, Shakeel Butt wrote:
> On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <[email protected]> wrote:
>>
>> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
>> Initially we used our own accounting subsystem, then partially committed
>> it to upstream, and a few years ago switched to cgroups v1.
>> Now we're rebasing again, revising our old patches and trying to push
>> them upstream.
>>
>> We try to protect the host system from any misuse of kernel memory
>> allocation triggered by untrusted users inside the containers.
>>
>> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
>> list, though I would be very grateful for any comments from maintainersi
>> of affected subsystems or other people added in cc:
>>
>> Compared to the upstream, we additionally account the following kernel objects:
>> - network devices and its Tx/Rx queues
>> - ipv4/v6 addresses and routing-related objects
>> - inet_bind_bucket cache objects
>> - VLAN group arrays
>> - ipv6/sit: ip_tunnel_prl
>> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
>> - nsproxy and namespace objects itself
>> - IPC objects: semaphores, message queues and share memory segments
>> - mounts
>> - pollfd and select bits arrays
>> - signals and posix timers
>> - file lock
>> - fasync_struct used by the file lease code and driver's fasync queues
>> - tty objects
>> - per-mm LDT
>>
>> We have an incorrect/incomplete/obsoleted accounting for few other kernel
>> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
>> They require rework and probably will be dropped at all.
>>
>> Also we're going to add an accounting for nft, however it is not ready yet.
>>
>> We have not tested performance on upstream, however, our performance team
>> compares our current RHEL7-based production kernel and reports that
>> they are at least not worse as the according original RHEL7 kernel.
>
> Hi Vasily,
>
> What's the status of this series? I see a couple patches did get
> acked/reviewed. Can you please re-send the series with updated ack
> tags?

Technically my patches does not have any NAKs. Practically they are still them merged.
I've expected Michal will push it, but he advised me to push subsystem maintainers.
I've asked Tejun to pick up the whole patch set and I'm waiting for his feedback right now.

I can resend patch set once again, with collected approval and with rebase to v5.14-rc1.
However I do not understand how it helps to push them if patches should be processed through
subsystem maintainers. As far as I understand I'll need to split this patch set into
per-subsystem pieces and sent them to corresponded maintainers.

Thank you,
Vasily Averin.

2021-07-16 12:57:17

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] memcg accounting from OpenVZ

On Thu, Jul 15, 2021 at 9:11 PM Vasily Averin <[email protected]> wrote:
>
> On 7/15/21 8:11 PM, Shakeel Butt wrote:
> > On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <[email protected]> wrote:
> >>
> >> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
> >> Initially we used our own accounting subsystem, then partially committed
> >> it to upstream, and a few years ago switched to cgroups v1.
> >> Now we're rebasing again, revising our old patches and trying to push
> >> them upstream.
> >>
> >> We try to protect the host system from any misuse of kernel memory
> >> allocation triggered by untrusted users inside the containers.
> >>
> >> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
> >> list, though I would be very grateful for any comments from maintainersi
> >> of affected subsystems or other people added in cc:
> >>
> >> Compared to the upstream, we additionally account the following kernel objects:
> >> - network devices and its Tx/Rx queues
> >> - ipv4/v6 addresses and routing-related objects
> >> - inet_bind_bucket cache objects
> >> - VLAN group arrays
> >> - ipv6/sit: ip_tunnel_prl
> >> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
> >> - nsproxy and namespace objects itself
> >> - IPC objects: semaphores, message queues and share memory segments
> >> - mounts
> >> - pollfd and select bits arrays
> >> - signals and posix timers
> >> - file lock
> >> - fasync_struct used by the file lease code and driver's fasync queues
> >> - tty objects
> >> - per-mm LDT
> >>
> >> We have an incorrect/incomplete/obsoleted accounting for few other kernel
> >> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
> >> They require rework and probably will be dropped at all.
> >>
> >> Also we're going to add an accounting for nft, however it is not ready yet.
> >>
> >> We have not tested performance on upstream, however, our performance team
> >> compares our current RHEL7-based production kernel and reports that
> >> they are at least not worse as the according original RHEL7 kernel.
> >
> > Hi Vasily,
> >
> > What's the status of this series? I see a couple patches did get
> > acked/reviewed. Can you please re-send the series with updated ack
> > tags?
>
> Technically my patches does not have any NAKs. Practically they are still them merged.
> I've expected Michal will push it, but he advised me to push subsystem maintainers.
> I've asked Tejun to pick up the whole patch set and I'm waiting for his feedback right now.
>
> I can resend patch set once again, with collected approval and with rebase to v5.14-rc1.
> However I do not understand how it helps to push them if patches should be processed through
> subsystem maintainers. As far as I understand I'll need to split this patch set into
> per-subsystem pieces and sent them to corresponded maintainers.
>

Usually these kinds of patches (adding memcg accounting) go through mm
tree but if there are no dependencies between the patches and a
consensus that each subsystem maintainer picks the corresponding patch
then that is fine too.

2021-07-19 10:45:32

by Vasily Averin

[permalink] [raw]
Subject: [PATCH v5 00/16] memcg accounting from OpenVZ

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
memcg: enable accounting for net_device and Tx/Rx queues
memcg: enable accounting for IP address and routing-related objects
memcg: enable accounting for inet_bin_bucket cache
memcg: enable accounting for VLAN group array
memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
allocation
memcg: enable accounting for scm_fp_list objects
memcg: enable accounting for mnt_cache entries
memcg: enable accounting for pollfd and select bits arrays
memcg: enable accounting for file lock caches
memcg: enable accounting for fasync_cache
memcg: enable accounting for new namesapces and struct nsproxy
memcg: enable accounting of ipc resources
memcg: enable accounting for signals
memcg: enable accounting for posix_timers_cache slab
memcg: enable accounting for tty-related objects
memcg: enable accounting for ldt_struct objects

arch/x86/kernel/ldt.c | 6 +++---
drivers/tty/tty_io.c | 4 ++--
fs/fcntl.c | 3 ++-
fs/locks.c | 6 ++++--
fs/namespace.c | 7 ++++---
fs/select.c | 4 ++--
ipc/msg.c | 2 +-
ipc/namespace.c | 2 +-
ipc/sem.c | 9 +++++----
ipc/shm.c | 2 +-
kernel/cgroup/namespace.c | 2 +-
kernel/nsproxy.c | 2 +-
kernel/pid_namespace.c | 2 +-
kernel/signal.c | 2 +-
kernel/time/namespace.c | 4 ++--
kernel/time/posix-timers.c | 4 ++--
kernel/user_namespace.c | 2 +-
mm/memcontrol.c | 2 +-
net/8021q/vlan.c | 2 +-
net/core/dev.c | 6 +++---
net/core/fib_rules.c | 4 ++--
net/core/scm.c | 4 ++--
net/dccp/proto.c | 2 +-
net/ipv4/devinet.c | 2 +-
net/ipv4/fib_trie.c | 4 ++--
net/ipv4/tcp.c | 4 +++-
net/ipv6/addrconf.c | 2 +-
net/ipv6/ip6_fib.c | 4 ++--
net/ipv6/route.c | 2 +-
net/ipv6/sit.c | 5 +++--
30 files changed, 57 insertions(+), 49 deletions(-)

--
1.8.3.1

2021-07-26 19:01:40

by Vasily Averin

[permalink] [raw]
Subject: [PATCH v6 00/16] memcg accounting from

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v6:
- improved description of "memcg: enable accounting for signals"
according to Eric Biderman's wishes
- added Reviewed-by tag from Shakeel Butt on the same patch

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
memcg: enable accounting for net_device and Tx/Rx queues
memcg: enable accounting for IP address and routing-related objects
memcg: enable accounting for inet_bin_bucket cache
memcg: enable accounting for VLAN group array
memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
allocation
memcg: enable accounting for scm_fp_list objects
memcg: enable accounting for mnt_cache entries
memcg: enable accounting for pollfd and select bits arrays
memcg: enable accounting for file lock caches
memcg: enable accounting for fasync_cache
memcg: enable accounting for new namesapces and struct nsproxy
memcg: enable accounting of ipc resources
memcg: enable accounting for signals
memcg: enable accounting for posix_timers_cache slab
memcg: enable accounting for tty-related objects
memcg: enable accounting for ldt_struct objects

arch/x86/kernel/ldt.c | 6 +++---
drivers/tty/tty_io.c | 4 ++--
fs/fcntl.c | 3 ++-
fs/locks.c | 6 ++++--
fs/namespace.c | 7 ++++---
fs/select.c | 4 ++--
ipc/msg.c | 2 +-
ipc/namespace.c | 2 +-
ipc/sem.c | 9 +++++----
ipc/shm.c | 2 +-
kernel/cgroup/namespace.c | 2 +-
kernel/nsproxy.c | 2 +-
kernel/pid_namespace.c | 2 +-
kernel/signal.c | 2 +-
kernel/time/namespace.c | 4 ++--
kernel/time/posix-timers.c | 4 ++--
kernel/user_namespace.c | 2 +-
mm/memcontrol.c | 2 +-
net/8021q/vlan.c | 2 +-
net/core/dev.c | 6 +++---
net/core/fib_rules.c | 4 ++--
net/core/scm.c | 4 ++--
net/dccp/proto.c | 2 +-
net/ipv4/devinet.c | 2 +-
net/ipv4/fib_trie.c | 4 ++--
net/ipv4/tcp.c | 4 +++-
net/ipv6/addrconf.c | 2 +-
net/ipv6/ip6_fib.c | 4 ++--
net/ipv6/route.c | 2 +-
net/ipv6/sit.c | 5 +++--
30 files changed, 57 insertions(+), 49 deletions(-)

--
1.8.3.1

2021-07-26 22:01:05

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v6 00/16] memcg accounting from


This series does not apply cleanly to net-next, please respin.

Thank you.

2021-07-27 04:47:14

by Vasily Averin

[permalink] [raw]
Subject: Re: [PATCH v6 00/16] memcg accounting from OpenVZ

On 7/27/21 12:59 AM, David Miller wrote:
>
> This series does not apply cleanly to net-next, please respin.

Dear David,
I found that you have already approved net-related patches of this series and included them into net-next.
So I'll respin v7 without these patches.

Thank you,
Vasily Averin

2021-07-27 05:34:07

by Vasily Averin

[permalink] [raw]
Subject: [PATCH v7 00/10] memcg accounting from OpenVZ

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v7:
- net-related patches was approved and included into net-next git
- rebase to v5.14-rc3
- added Acked-by tag from Kirill Tkhai on "memcg: enable accounting for
new namesapces and struct nsproxy"

v6:
- improved description of "memcg: enable accounting for signals"
according to Eric Biderman's wishes
- added Reviewed-by tag from Shakeel Butt on the same patch

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (10):
memcg: enable accounting for mnt_cache entries
memcg: enable accounting for pollfd and select bits arrays
memcg: enable accounting for file lock caches
memcg: enable accounting for fasync_cache
memcg: enable accounting for new namesapces and struct nsproxy
memcg: enable accounting of ipc resources
memcg: enable accounting for signals
memcg: enable accounting for posix_timers_cache slab
memcg: enable accounting for tty-related objects
memcg: enable accounting for ldt_struct objects

arch/x86/kernel/ldt.c | 6 +++---
drivers/tty/tty_io.c | 4 ++--
fs/fcntl.c | 3 ++-
fs/locks.c | 6 ++++--
fs/namespace.c | 7 ++++---
fs/select.c | 4 ++--
ipc/msg.c | 2 +-
ipc/namespace.c | 2 +-
ipc/sem.c | 9 +++++----
ipc/shm.c | 2 +-
kernel/cgroup/namespace.c | 2 +-
kernel/nsproxy.c | 2 +-
kernel/pid_namespace.c | 2 +-
kernel/signal.c | 2 +-
kernel/time/namespace.c | 4 ++--
kernel/time/posix-timers.c | 4 ++--
kernel/user_namespace.c | 2 +-
17 files changed, 34 insertions(+), 29 deletions(-)

--
1.8.3.1