LinuxLists.cc - Re: [PATCH v2 00/28] The new cgroup slab memory controller

2020-01-30 02:07:48

by Bharata B Rao

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> The existing cgroup slab memory controller is based on the idea of
> replicating slab allocator internals for each memory cgroup.
> This approach promises a low memory overhead (one pointer per page),
> and isn't adding too much code on hot allocation and release paths.
> But is has a very serious flaw: it leads to a low slab utilization.
>
> Using a drgn* script I've got an estimation of slab utilization on
> a number of machines running different production workloads. In most
> cases it was between 45% and 65%, and the best number I've seen was
> around 85%. Turning kmem accounting off brings it to high 90s. Also
> it brings back 30-50% of slab memory. It means that the real price
> of the existing slab memory controller is way bigger than a pointer
> per page.
>
> The real reason why the existing design leads to a low slab utilization
> is simple: slab pages are used exclusively by one memory cgroup.
> If there are only few allocations of certain size made by a cgroup,
> or if some active objects (e.g. dentries) are left after the cgroup is
> deleted, or the cgroup contains a single-threaded application which is
> barely allocating any kernel objects, but does it every time on a new CPU:
> in all these cases the resulting slab utilization is very low.
> If kmem accounting is off, the kernel is able to use free space
> on slab pages for other allocations.
>
> Arguably it wasn't an issue back to days when the kmem controller was
> introduced and was an opt-in feature, which had to be turned on
> individually for each memory cgroup. But now it's turned on by default
> on both cgroup v1 and v2. And modern systemd-based systems tend to
> create a large number of cgroups.
>
> This patchset provides a new implementation of the slab memory controller,
> which aims to reach a much better slab utilization by sharing slab pages
> between multiple memory cgroups. Below is the short description of the new
> design (more details in commit messages).
>
> Accounting is performed per-object instead of per-page. Slab-related
> vmstat counters are converted to bytes. Charging is performed on page-basis,
> with rounding up and remembering leftovers.
>
> Memcg ownership data is stored in a per-slab-page vector: for each slab page
> a vector of corresponding size is allocated. To keep slab memory reparenting
> working, instead of saving a pointer to the memory cgroup directly an
> intermediate object is used. It's simply a pointer to a memcg (which can be
> easily changed to the parent) with a built-in reference counter. This scheme
> allows to reparent all allocated objects without walking them over and
> changing memcg pointer to the parent.
>
> Instead of creating an individual set of kmem_caches for each memory cgroup,
> two global sets are used: the root set for non-accounted and root-cgroup
> allocations and the second set for all other allocations. This allows to
> simplify the lifetime management of individual kmem_caches: they are
> destroyed with root counterparts. It allows to remove a good amount of code
> and make things generally simpler.
>
> The patchset* has been tested on a number of different workloads in our
> production. In all cases it saved significant amount of memory, measured
> from high hundreds of MBs to single GBs per host. On average, the size
> of slab memory has been reduced by 35-45%.

Here are some numbers from multiple runs of sysbench and kernel compilation
with this patchset on a 10 core POWER8 host:

==========================================================================
Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
of a mem cgroup (Sampling every 5s)
--------------------------------------------------------------------------
5.5.0-rc7-mm1 +slab patch %reduction
--------------------------------------------------------------------------
memory.kmem.usage_in_bytes 15859712 4456448 72
memory.usage_in_bytes 337510400 335806464 .5
Slab: (kB) 814336 607296 25

memory.kmem.usage_in_bytes 16187392 4653056 71
memory.usage_in_bytes 318832640 300154880 5
Slab: (kB) 789888 559744 29
--------------------------------------------------------------------------

Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
meminfo:Slab for kernel compilation (make -s -j64) Compilation was
done from bash that is in a memory cgroup. (Sampling every 5s)
--------------------------------------------------------------------------
5.5.0-rc7-mm1 +slab patch %reduction
--------------------------------------------------------------------------
memory.kmem.usage_in_bytes 338493440 231931904 31
memory.usage_in_bytes 7368015872 6275923968 15
Slab: (kB) 1139072 785408 31

memory.kmem.usage_in_bytes 341835776 236453888 30
memory.usage_in_bytes 6540427264 6072893440 7
Slab: (kB) 1074304 761280 29

memory.kmem.usage_in_bytes 340525056 233570304 31
memory.usage_in_bytes 6406209536 6177357824 3
Slab: (kB) 1244288 739712 40
--------------------------------------------------------------------------

Slab consumption right after boot
--------------------------------------------------------------------------
5.5.0-rc7-mm1 +slab patch %reduction
--------------------------------------------------------------------------
Slab: (kB) 821888 583424 29
==========================================================================

Summary:

With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows
around 70% and 30% reduction consistently.

Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
kernel compilation.

Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
same is seen right after boot too.

Regards,
Bharata.

2020-01-30 02:44:21

by Roman Gushchin

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > The existing cgroup slab memory controller is based on the idea of
> > replicating slab allocator internals for each memory cgroup.
> > This approach promises a low memory overhead (one pointer per page),
> > and isn't adding too much code on hot allocation and release paths.
> > But is has a very serious flaw: it leads to a low slab utilization.
> >
> > Using a drgn* script I've got an estimation of slab utilization on
> > a number of machines running different production workloads. In most
> > cases it was between 45% and 65%, and the best number I've seen was
> > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > it brings back 30-50% of slab memory. It means that the real price
> > of the existing slab memory controller is way bigger than a pointer
> > per page.
> >
> > The real reason why the existing design leads to a low slab utilization
> > is simple: slab pages are used exclusively by one memory cgroup.
> > If there are only few allocations of certain size made by a cgroup,
> > or if some active objects (e.g. dentries) are left after the cgroup is
> > deleted, or the cgroup contains a single-threaded application which is
> > barely allocating any kernel objects, but does it every time on a new CPU:
> > in all these cases the resulting slab utilization is very low.
> > If kmem accounting is off, the kernel is able to use free space
> > on slab pages for other allocations.
> >
> > Arguably it wasn't an issue back to days when the kmem controller was
> > introduced and was an opt-in feature, which had to be turned on
> > individually for each memory cgroup. But now it's turned on by default
> > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > create a large number of cgroups.
> >
> > This patchset provides a new implementation of the slab memory controller,
> > which aims to reach a much better slab utilization by sharing slab pages
> > between multiple memory cgroups. Below is the short description of the new
> > design (more details in commit messages).
> >
> > Accounting is performed per-object instead of per-page. Slab-related
> > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > with rounding up and remembering leftovers.
> >
> > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > a vector of corresponding size is allocated. To keep slab memory reparenting
> > working, instead of saving a pointer to the memory cgroup directly an
> > intermediate object is used. It's simply a pointer to a memcg (which can be
> > easily changed to the parent) with a built-in reference counter. This scheme
> > allows to reparent all allocated objects without walking them over and
> > changing memcg pointer to the parent.
> >
> > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > two global sets are used: the root set for non-accounted and root-cgroup
> > allocations and the second set for all other allocations. This allows to
> > simplify the lifetime management of individual kmem_caches: they are
> > destroyed with root counterparts. It allows to remove a good amount of code
> > and make things generally simpler.
> >
> > The patchset* has been tested on a number of different workloads in our
> > production. In all cases it saved significant amount of memory, measured
> > from high hundreds of MBs to single GBs per host. On average, the size
> > of slab memory has been reduced by 35-45%.
>
> Here are some numbers from multiple runs of sysbench and kernel compilation
> with this patchset on a 10 core POWER8 host:
>
> ==========================================================================
> Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> of a mem cgroup (Sampling every 5s)
> --------------------------------------------------------------------------
> 5.5.0-rc7-mm1 +slab patch %reduction
> --------------------------------------------------------------------------
> memory.kmem.usage_in_bytes 15859712 4456448 72
> memory.usage_in_bytes 337510400 335806464 .5
> Slab: (kB) 814336 607296 25
>
> memory.kmem.usage_in_bytes 16187392 4653056 71
> memory.usage_in_bytes 318832640 300154880 5
> Slab: (kB) 789888 559744 29
> --------------------------------------------------------------------------
>
>
> Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> done from bash that is in a memory cgroup. (Sampling every 5s)
> --------------------------------------------------------------------------
> 5.5.0-rc7-mm1 +slab patch %reduction
> --------------------------------------------------------------------------
> memory.kmem.usage_in_bytes 338493440 231931904 31
> memory.usage_in_bytes 7368015872 6275923968 15
> Slab: (kB) 1139072 785408 31
>
> memory.kmem.usage_in_bytes 341835776 236453888 30
> memory.usage_in_bytes 6540427264 6072893440 7
> Slab: (kB) 1074304 761280 29
>
> memory.kmem.usage_in_bytes 340525056 233570304 31
> memory.usage_in_bytes 6406209536 6177357824 3
> Slab: (kB) 1244288 739712 40
> --------------------------------------------------------------------------
>
> Slab consumption right after boot
> --------------------------------------------------------------------------
> 5.5.0-rc7-mm1 +slab patch %reduction
> --------------------------------------------------------------------------
> Slab: (kB) 821888 583424 29
> ==========================================================================
>
> Summary:
>
> With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows
> around 70% and 30% reduction consistently.
>
> Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> kernel compilation.
>
> Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> same is seen right after boot too.

That's just perfect!

memory.usage_in_bytes was most likely the same because the freed space
was taken by pagecache.

Thank you very much for testing!

Roman

2020-08-12 23:18:09

by Pasha Tatashin

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

Guys,

There is a convoluted deadlock that I just root caused, and that is
fixed by this work (at least based on my code inspection it appears to
be fixed); but the deadlock exists in older and stable kernels, and I
am not sure whether to create a separate patch for it, or backport
this whole thing.

Thread #1: Hot-removes memory
device_offline
memory_subsys_offline
offline_pages
__offline_pages
mem_hotplug_lock <- write access
waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
migrate it.

Thread #2: ccs killer kthread
css_killed_work_fn
cgroup_mutex <- Grab this Mutex
mem_cgroup_css_offline
memcg_offline_kmem.part
memcg_deactivate_kmem_caches
get_online_mems
mem_hotplug_lock <- waits for Thread#1 to get read access

Thread #3: crashing userland program
do_coredump
elf_core_dump
get_dump_page() -> get page with pfn#9e5113, and increment refcnt
dump_emit
__kernel_write
__vfs_write
new_sync_write
pipe_write
pipe_wait -> waits for Thread #4 systemd-coredump to
read the pipe

Thread #4: systemd-coredump
ksys_read
vfs_read
__vfs_read
seq_read
proc_single_show
proc_cgroup_show
cgroup_mutex -> waits from Thread #2 for this lock.

In Summary:
Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
waits for Thread#1 for mem_hotplug_lock rwlock.

This work appears to fix this deadlock because cgroup_mutex is not
called anymore before mem_hotplug_lock (unless I am missing it), as it
removes memcg_deactivate_kmem_caches.

Thank you,
Pasha

On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <[email protected]> wrote:
>
> On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > The existing cgroup slab memory controller is based on the idea of
> > > replicating slab allocator internals for each memory cgroup.
> > > This approach promises a low memory overhead (one pointer per page),
> > > and isn't adding too much code on hot allocation and release paths.
> > > But is has a very serious flaw: it leads to a low slab utilization.
> > >
> > > Using a drgn* script I've got an estimation of slab utilization on
> > > a number of machines running different production workloads. In most
> > > cases it was between 45% and 65%, and the best number I've seen was
> > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > it brings back 30-50% of slab memory. It means that the real price
> > > of the existing slab memory controller is way bigger than a pointer
> > > per page.
> > >
> > > The real reason why the existing design leads to a low slab utilization
> > > is simple: slab pages are used exclusively by one memory cgroup.
> > > If there are only few allocations of certain size made by a cgroup,
> > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > deleted, or the cgroup contains a single-threaded application which is
> > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > in all these cases the resulting slab utilization is very low.
> > > If kmem accounting is off, the kernel is able to use free space
> > > on slab pages for other allocations.
> > >
> > > Arguably it wasn't an issue back to days when the kmem controller was
> > > introduced and was an opt-in feature, which had to be turned on
> > > individually for each memory cgroup. But now it's turned on by default
> > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > create a large number of cgroups.
> > >
> > > This patchset provides a new implementation of the slab memory controller,
> > > which aims to reach a much better slab utilization by sharing slab pages
> > > between multiple memory cgroups. Below is the short description of the new
> > > design (more details in commit messages).
> > >
> > > Accounting is performed per-object instead of per-page. Slab-related
> > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > with rounding up and remembering leftovers.
> > >
> > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > working, instead of saving a pointer to the memory cgroup directly an
> > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > easily changed to the parent) with a built-in reference counter. This scheme
> > > allows to reparent all allocated objects without walking them over and
> > > changing memcg pointer to the parent.
> > >
> > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > two global sets are used: the root set for non-accounted and root-cgroup
> > > allocations and the second set for all other allocations. This allows to
> > > simplify the lifetime management of individual kmem_caches: they are
> > > destroyed with root counterparts. It allows to remove a good amount of code
> > > and make things generally simpler.
> > >
> > > The patchset* has been tested on a number of different workloads in our
> > > production. In all cases it saved significant amount of memory, measured
> > > from high hundreds of MBs to single GBs per host. On average, the size
> > > of slab memory has been reduced by 35-45%.
> >
> > Here are some numbers from multiple runs of sysbench and kernel compilation
> > with this patchset on a 10 core POWER8 host:
> >
> > ==========================================================================
> > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > of a mem cgroup (Sampling every 5s)
> > --------------------------------------------------------------------------
> > 5.5.0-rc7-mm1 +slab patch %reduction
> > --------------------------------------------------------------------------
> > memory.kmem.usage_in_bytes 15859712 4456448 72
> > memory.usage_in_bytes 337510400 335806464 .5
> > Slab: (kB) 814336 607296 25
> >
> > memory.kmem.usage_in_bytes 16187392 4653056 71
> > memory.usage_in_bytes 318832640 300154880 5
> > Slab: (kB) 789888 559744 29
> > --------------------------------------------------------------------------
> >
> >
> > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > done from bash that is in a memory cgroup. (Sampling every 5s)
> > --------------------------------------------------------------------------
> > 5.5.0-rc7-mm1 +slab patch %reduction
> > --------------------------------------------------------------------------
> > memory.kmem.usage_in_bytes 338493440 231931904 31
> > memory.usage_in_bytes 7368015872 6275923968 15
> > Slab: (kB) 1139072 785408 31
> >
> > memory.kmem.usage_in_bytes 341835776 236453888 30
> > memory.usage_in_bytes 6540427264 6072893440 7
> > Slab: (kB) 1074304 761280 29
> >
> > memory.kmem.usage_in_bytes 340525056 233570304 31
> > memory.usage_in_bytes 6406209536 6177357824 3
> > Slab: (kB) 1244288 739712 40
> > --------------------------------------------------------------------------
> >
> > Slab consumption right after boot
> > --------------------------------------------------------------------------
> > 5.5.0-rc7-mm1 +slab patch %reduction
> > --------------------------------------------------------------------------
> > Slab: (kB) 821888 583424 29
> > ==========================================================================
> >
> > Summary:
> >
> > With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows
> > around 70% and 30% reduction consistently.
> >
> > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > kernel compilation.
> >
> > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > same is seen right after boot too.
>
> That's just perfect!
>
> memory.usage_in_bytes was most likely the same because the freed space
> was taken by pagecache.
>
> Thank you very much for testing!
>
> Roman

2020-08-12 23:22:37

by Pasha Tatashin

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

BTW, I replied to a wrong version of this work. I intended to reply to
version 7:
https://lore.kernel.org/lkml/[email protected]/

Nevertheless, the problem is the same.

Thank you,
Pasha

On Wed, Aug 12, 2020 at 7:16 PM Pavel Tatashin
<[email protected]> wrote:
>
> Guys,
>
> There is a convoluted deadlock that I just root caused, and that is
> fixed by this work (at least based on my code inspection it appears to
> be fixed); but the deadlock exists in older and stable kernels, and I
> am not sure whether to create a separate patch for it, or backport
> this whole thing.
>
> Thread #1: Hot-removes memory
> device_offline
> memory_subsys_offline
> offline_pages
> __offline_pages
> mem_hotplug_lock <- write access
> waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> migrate it.
>
> Thread #2: ccs killer kthread
> css_killed_work_fn
> cgroup_mutex <- Grab this Mutex
> mem_cgroup_css_offline
> memcg_offline_kmem.part
> memcg_deactivate_kmem_caches
> get_online_mems
> mem_hotplug_lock <- waits for Thread#1 to get read access
>
> Thread #3: crashing userland program
> do_coredump
> elf_core_dump
> get_dump_page() -> get page with pfn#9e5113, and increment refcnt
> dump_emit
> __kernel_write
> __vfs_write
> new_sync_write
> pipe_write
> pipe_wait -> waits for Thread #4 systemd-coredump to
> read the pipe
>
> Thread #4: systemd-coredump
> ksys_read
> vfs_read
> __vfs_read
> seq_read
> proc_single_show
> proc_cgroup_show
> cgroup_mutex -> waits from Thread #2 for this lock.
>
> In Summary:
> Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> waits for Thread#1 for mem_hotplug_lock rwlock.
>
> This work appears to fix this deadlock because cgroup_mutex is not
> called anymore before mem_hotplug_lock (unless I am missing it), as it
> removes memcg_deactivate_kmem_caches.
>
> Thank you,
> Pasha
>
> On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <[email protected]> wrote:
> >
> > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > The existing cgroup slab memory controller is based on the idea of
> > > > replicating slab allocator internals for each memory cgroup.
> > > > This approach promises a low memory overhead (one pointer per page),
> > > > and isn't adding too much code on hot allocation and release paths.
> > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > >
> > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > a number of machines running different production workloads. In most
> > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > it brings back 30-50% of slab memory. It means that the real price
> > > > of the existing slab memory controller is way bigger than a pointer
> > > > per page.
> > > >
> > > > The real reason why the existing design leads to a low slab utilization
> > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > If there are only few allocations of certain size made by a cgroup,
> > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > deleted, or the cgroup contains a single-threaded application which is
> > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > in all these cases the resulting slab utilization is very low.
> > > > If kmem accounting is off, the kernel is able to use free space
> > > > on slab pages for other allocations.
> > > >
> > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > introduced and was an opt-in feature, which had to be turned on
> > > > individually for each memory cgroup. But now it's turned on by default
> > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > create a large number of cgroups.
> > > >
> > > > This patchset provides a new implementation of the slab memory controller,
> > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > between multiple memory cgroups. Below is the short description of the new
> > > > design (more details in commit messages).
> > > >
> > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > with rounding up and remembering leftovers.
> > > >
> > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > allows to reparent all allocated objects without walking them over and
> > > > changing memcg pointer to the parent.
> > > >
> > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > allocations and the second set for all other allocations. This allows to
> > > > simplify the lifetime management of individual kmem_caches: they are
> > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > and make things generally simpler.
> > > >
> > > > The patchset* has been tested on a number of different workloads in our
> > > > production. In all cases it saved significant amount of memory, measured
> > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > of slab memory has been reduced by 35-45%.
> > >
> > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > with this patchset on a 10 core POWER8 host:
> > >
> > > ==========================================================================
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > of a mem cgroup (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > > 5.5.0-rc7-mm1 +slab patch %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes 15859712 4456448 72
> > > memory.usage_in_bytes 337510400 335806464 .5
> > > Slab: (kB) 814336 607296 25
> > >
> > > memory.kmem.usage_in_bytes 16187392 4653056 71
> > > memory.usage_in_bytes 318832640 300154880 5
> > > Slab: (kB) 789888 559744 29
> > > --------------------------------------------------------------------------
> > >
> > >
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > > 5.5.0-rc7-mm1 +slab patch %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes 338493440 231931904 31
> > > memory.usage_in_bytes 7368015872 6275923968 15
> > > Slab: (kB) 1139072 785408 31
> > >
> > > memory.kmem.usage_in_bytes 341835776 236453888 30
> > > memory.usage_in_bytes 6540427264 6072893440 7
> > > Slab: (kB) 1074304 761280 29
> > >
> > > memory.kmem.usage_in_bytes 340525056 233570304 31
> > > memory.usage_in_bytes 6406209536 6177357824 3
> > > Slab: (kB) 1244288 739712 40
> > > --------------------------------------------------------------------------
> > >
> > > Slab consumption right after boot
> > > --------------------------------------------------------------------------
> > > 5.5.0-rc7-mm1 +slab patch %reduction
> > > --------------------------------------------------------------------------
> > > Slab: (kB) 821888 583424 29
> > > ==========================================================================
> > >
> > > Summary:
> > >
> > > With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows
> > > around 70% and 30% reduction consistently.
> > >
> > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > kernel compilation.
> > >
> > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > same is seen right after boot too.
> >
> > That's just perfect!
> >
> > memory.usage_in_bytes was most likely the same because the freed space
> > was taken by pagecache.
> >
> > Thank you very much for testing!
> >
> > Roman

2020-08-13 00:07:52

by Roman Gushchin

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
> Guys,
>
> There is a convoluted deadlock that I just root caused, and that is
> fixed by this work (at least based on my code inspection it appears to
> be fixed); but the deadlock exists in older and stable kernels, and I
> am not sure whether to create a separate patch for it, or backport
> this whole thing.

Hi Pavel,

wow, it's a quite complicated deadlock. Thank you for providing
a perfect analysis!

Unfortunately, backporting the whole new slab controller isn't an option:
it's way too big and invasive.
Do you already have a standalone fix?

Thanks!

>
> Thread #1: Hot-removes memory
> device_offline
> memory_subsys_offline
> offline_pages
> __offline_pages
> mem_hotplug_lock <- write access
> waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> migrate it.
>
> Thread #2: ccs killer kthread
> css_killed_work_fn
> cgroup_mutex <- Grab this Mutex
> mem_cgroup_css_offline
> memcg_offline_kmem.part
> memcg_deactivate_kmem_caches
> get_online_mems
> mem_hotplug_lock <- waits for Thread#1 to get read access
>
> Thread #3: crashing userland program
> do_coredump
> elf_core_dump
> get_dump_page() -> get page with pfn#9e5113, and increment refcnt
> dump_emit
> __kernel_write
> __vfs_write
> new_sync_write
> pipe_write
> pipe_wait -> waits for Thread #4 systemd-coredump to
> read the pipe
>
> Thread #4: systemd-coredump
> ksys_read
> vfs_read
> __vfs_read
> seq_read
> proc_single_show
> proc_cgroup_show
> cgroup_mutex -> waits from Thread #2 for this lock.

>
> In Summary:
> Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> waits for Thread#1 for mem_hotplug_lock rwlock.
>
> This work appears to fix this deadlock because cgroup_mutex is not
> called anymore before mem_hotplug_lock (unless I am missing it), as it
> removes memcg_deactivate_kmem_caches.
>
> Thank you,
> Pasha
>
> On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <[email protected]> wrote:
> >
> > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > The existing cgroup slab memory controller is based on the idea of
> > > > replicating slab allocator internals for each memory cgroup.
> > > > This approach promises a low memory overhead (one pointer per page),
> > > > and isn't adding too much code on hot allocation and release paths.
> > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > >
> > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > a number of machines running different production workloads. In most
> > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > it brings back 30-50% of slab memory. It means that the real price
> > > > of the existing slab memory controller is way bigger than a pointer
> > > > per page.
> > > >
> > > > The real reason why the existing design leads to a low slab utilization
> > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > If there are only few allocations of certain size made by a cgroup,
> > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > deleted, or the cgroup contains a single-threaded application which is
> > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > in all these cases the resulting slab utilization is very low.
> > > > If kmem accounting is off, the kernel is able to use free space
> > > > on slab pages for other allocations.
> > > >
> > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > introduced and was an opt-in feature, which had to be turned on
> > > > individually for each memory cgroup. But now it's turned on by default
> > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > create a large number of cgroups.
> > > >
> > > > This patchset provides a new implementation of the slab memory controller,
> > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > between multiple memory cgroups. Below is the short description of the new
> > > > design (more details in commit messages).
> > > >
> > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > with rounding up and remembering leftovers.
> > > >
> > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > allows to reparent all allocated objects without walking them over and
> > > > changing memcg pointer to the parent.
> > > >
> > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > allocations and the second set for all other allocations. This allows to
> > > > simplify the lifetime management of individual kmem_caches: they are
> > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > and make things generally simpler.
> > > >
> > > > The patchset* has been tested on a number of different workloads in our
> > > > production. In all cases it saved significant amount of memory, measured
> > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > of slab memory has been reduced by 35-45%.
> > >
> > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > with this patchset on a 10 core POWER8 host:
> > >
> > > ==========================================================================
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > of a mem cgroup (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > > 5.5.0-rc7-mm1 +slab patch %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes 15859712 4456448 72
> > > memory.usage_in_bytes 337510400 335806464 .5
> > > Slab: (kB) 814336 607296 25
> > >
> > > memory.kmem.usage_in_bytes 16187392 4653056 71
> > > memory.usage_in_bytes 318832640 300154880 5
> > > Slab: (kB) 789888 559744 29
> > > --------------------------------------------------------------------------
> > >
> > >
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > > 5.5.0-rc7-mm1 +slab patch %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes 338493440 231931904 31
> > > memory.usage_in_bytes 7368015872 6275923968 15
> > > Slab: (kB) 1139072 785408 31
> > >
> > > memory.kmem.usage_in_bytes 341835776 236453888 30
> > > memory.usage_in_bytes 6540427264 6072893440 7
> > > Slab: (kB) 1074304 761280 29
> > >
> > > memory.kmem.usage_in_bytes 340525056 233570304 31
> > > memory.usage_in_bytes 6406209536 6177357824 3
> > > Slab: (kB) 1244288 739712 40
> > > --------------------------------------------------------------------------
> > >
> > > Slab consumption right after boot
> > > --------------------------------------------------------------------------
> > > 5.5.0-rc7-mm1 +slab patch %reduction
> > > --------------------------------------------------------------------------
> > > Slab: (kB) 821888 583424 29
> > > ==========================================================================
> > >
> > > Summary:
> > >
> > > With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows
> > > around 70% and 30% reduction consistently.
> > >
> > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > kernel compilation.
> > >
> > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > same is seen right after boot too.
> >
> > That's just perfect!
> >
> > memory.usage_in_bytes was most likely the same because the freed space
> > was taken by pagecache.
> >
> > Thank you very much for testing!
> >
> > Roman

2020-08-13 00:32:43

by Pasha Tatashin

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin <[email protected]> wrote:
>
> On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
> > Guys,
> >
> > There is a convoluted deadlock that I just root caused, and that is
> > fixed by this work (at least based on my code inspection it appears to
> > be fixed); but the deadlock exists in older and stable kernels, and I
> > am not sure whether to create a separate patch for it, or backport
> > this whole thing.
>

Hi Roman,

> Hi Pavel,
>
> wow, it's a quite complicated deadlock. Thank you for providing
> a perfect analysis!

Thank you, it indeed took me a while to fully grasp the deadlock.

>
> Unfortunately, backporting the whole new slab controller isn't an option:
> it's way too big and invasive.

This is what I thought as well, this is why I want to figure out what
is the best way forward.

> Do you already have a standalone fix?

Not yet, I do not have a standalone fix. I suspect the best fix would
be to address fix css_killed_work_fn() stack so we never have:
cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse
the order would work. If you have suggestions since you worked on this
code recently, please let me know.

Thank you,
Pasha

>
> Thanks!
>
>
> >
> > Thread #1: Hot-removes memory
> > device_offline
> > memory_subsys_offline
> > offline_pages
> > __offline_pages
> > mem_hotplug_lock <- write access
> > waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> > migrate it.
> >
> > Thread #2: ccs killer kthread
> > css_killed_work_fn
> > cgroup_mutex <- Grab this Mutex
> > mem_cgroup_css_offline
> > memcg_offline_kmem.part
> > memcg_deactivate_kmem_caches
> > get_online_mems
> > mem_hotplug_lock <- waits for Thread#1 to get read access
> >
> > Thread #3: crashing userland program
> > do_coredump
> > elf_core_dump
> > get_dump_page() -> get page with pfn#9e5113, and increment refcnt
> > dump_emit
> > __kernel_write
> > __vfs_write
> > new_sync_write
> > pipe_write
> > pipe_wait -> waits for Thread #4 systemd-coredump to
> > read the pipe
> >
> > Thread #4: systemd-coredump
> > ksys_read
> > vfs_read
> > __vfs_read
> > seq_read
> > proc_single_show
> > proc_cgroup_show
> > cgroup_mutex -> waits from Thread #2 for this lock.
>
> >
> > In Summary:
> > Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> > read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> > waits for Thread#1 for mem_hotplug_lock rwlock.
> >
> > This work appears to fix this deadlock because cgroup_mutex is not
> > called anymore before mem_hotplug_lock (unless I am missing it), as it
> > removes memcg_deactivate_kmem_caches.
> >
> > Thank you,
> > Pasha
> >
> > On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <[email protected]> wrote:
> > >
> > > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > > The existing cgroup slab memory controller is based on the idea of
> > > > > replicating slab allocator internals for each memory cgroup.
> > > > > This approach promises a low memory overhead (one pointer per page),
> > > > > and isn't adding too much code on hot allocation and release paths.
> > > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > > >
> > > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > > a number of machines running different production workloads. In most
> > > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > > it brings back 30-50% of slab memory. It means that the real price
> > > > > of the existing slab memory controller is way bigger than a pointer
> > > > > per page.
> > > > >
> > > > > The real reason why the existing design leads to a low slab utilization
> > > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > > If there are only few allocations of certain size made by a cgroup,
> > > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > > deleted, or the cgroup contains a single-threaded application which is
> > > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > > in all these cases the resulting slab utilization is very low.
> > > > > If kmem accounting is off, the kernel is able to use free space
> > > > > on slab pages for other allocations.
> > > > >
> > > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > > introduced and was an opt-in feature, which had to be turned on
> > > > > individually for each memory cgroup. But now it's turned on by default
> > > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > > create a large number of cgroups.
> > > > >
> > > > > This patchset provides a new implementation of the slab memory controller,
> > > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > > between multiple memory cgroups. Below is the short description of the new
> > > > > design (more details in commit messages).
> > > > >
> > > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > > with rounding up and remembering leftovers.
> > > > >
> > > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > > allows to reparent all allocated objects without walking them over and
> > > > > changing memcg pointer to the parent.
> > > > >
> > > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > > allocations and the second set for all other allocations. This allows to
> > > > > simplify the lifetime management of individual kmem_caches: they are
> > > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > > and make things generally simpler.
> > > > >
> > > > > The patchset* has been tested on a number of different workloads in our
> > > > > production. In all cases it saved significant amount of memory, measured
> > > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > > of slab memory has been reduced by 35-45%.
> > > >
> > > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > > with this patchset on a 10 core POWER8 host:
> > > >
> > > > ==========================================================================
> > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > > of a mem cgroup (Sampling every 5s)
> > > > --------------------------------------------------------------------------
> > > > 5.5.0-rc7-mm1 +slab patch %reduction
> > > > --------------------------------------------------------------------------
> > > > memory.kmem.usage_in_bytes 15859712 4456448 72
> > > > memory.usage_in_bytes 337510400 335806464 .5
> > > > Slab: (kB) 814336 607296 25
> > > >
> > > > memory.kmem.usage_in_bytes 16187392 4653056 71
> > > > memory.usage_in_bytes 318832640 300154880 5
> > > > Slab: (kB) 789888 559744 29
> > > > --------------------------------------------------------------------------
> > > >
> > > >
> > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > > --------------------------------------------------------------------------
> > > > 5.5.0-rc7-mm1 +slab patch %reduction
> > > > --------------------------------------------------------------------------
> > > > memory.kmem.usage_in_bytes 338493440 231931904 31
> > > > memory.usage_in_bytes 7368015872 6275923968 15
> > > > Slab: (kB) 1139072 785408 31
> > > >
> > > > memory.kmem.usage_in_bytes 341835776 236453888 30
> > > > memory.usage_in_bytes 6540427264 6072893440 7
> > > > Slab: (kB) 1074304 761280 29
> > > >
> > > > memory.kmem.usage_in_bytes 340525056 233570304 31
> > > > memory.usage_in_bytes 6406209536 6177357824 3
> > > > Slab: (kB) 1244288 739712 40
> > > > --------------------------------------------------------------------------
> > > >
> > > > Slab consumption right after boot
> > > > --------------------------------------------------------------------------
> > > > 5.5.0-rc7-mm1 +slab patch %reduction
> > > > --------------------------------------------------------------------------
> > > > Slab: (kB) 821888 583424 29
> > > > ==========================================================================
> > > >
> > > > Summary:
> > > >
> > > > With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows
> > > > around 70% and 30% reduction consistently.
> > > >
> > > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > > kernel compilation.
> > > >
> > > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > > same is seen right after boot too.
> > >
> > > That's just perfect!
> > >
> > > memory.usage_in_bytes was most likely the same because the freed space
> > > was taken by pagecache.
> > >
> > > Thank you very much for testing!
> > >
> > > Roman

2020-08-28 16:49:09

by Pasha Tatashin

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

There appears to be another problem that is related to the
cgroup_mutex -> mem_hotplug_lock deadlock described above.

In the original deadlock that I described, the workaround is to
replace crash dump from piping to Linux traditional save to files
method. However, after trying this workaround, I still observed
hardware watchdog resets during machine shutdown.

The new problem occurs for the following reason: upon shutdown systemd
calls a service that hot-removes memory, and if hot-removing fails for
some reason systemd kills that service after timeout. However, systemd
is never able to kill the service, and we get hardware reset caused by
watchdog or a hang during shutdown:

Thread #1: memory hot-remove systemd service
Loops indefinitely, because if there is something still to be migrated
this loop never terminates. However, this loop can be terminated via
signal from systemd after timeout.
__offline_pages()
do {
pfn = scan_movable_pages(pfn, end_pfn);
# Returns 0, meaning there is nothing available to
# migrate, no page is PageLRU(page)
...
ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
NULL, check_pages_isolated_cb);
# Returns -EBUSY, meaning there is at least one PFN that
# still has to be migrated.
} while (ret);

Thread #2: ccs killer kthread
css_killed_work_fn
cgroup_mutex <- Grab this Mutex
mem_cgroup_css_offline
memcg_offline_kmem.part
memcg_deactivate_kmem_caches
get_online_mems
mem_hotplug_lock <- waits for Thread#1 to get read access

Thread #3: systemd
ksys_read
vfs_read
__vfs_read
seq_read
proc_single_show
proc_cgroup_show
mutex_lock -> wait for cgroup_mutex that is owned by Thread #2

Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
to thread #1.

The proper fix for both of the problems is to avoid cgroup_mutex ->
mem_hotplug_lock ordering that was recently fixed in the mainline but
still present in all stable branches. Unfortunately, I do not see a
simple fix in how to remove mem_hotplug_lock from
memcg_deactivate_kmem_caches without using Roman's series that is too
big for stable.

Thanks,
Pasha

On Wed, Aug 12, 2020 at 8:31 PM Pavel Tatashin
<[email protected]> wrote:
>
> On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin <[email protected]> wrote:
> >
> > On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
> > > Guys,
> > >
> > > There is a convoluted deadlock that I just root caused, and that is
> > > fixed by this work (at least based on my code inspection it appears to
> > > be fixed); but the deadlock exists in older and stable kernels, and I
> > > am not sure whether to create a separate patch for it, or backport
> > > this whole thing.
> >
>
> Hi Roman,
>
> > Hi Pavel,
> >
> > wow, it's a quite complicated deadlock. Thank you for providing
> > a perfect analysis!
>
> Thank you, it indeed took me a while to fully grasp the deadlock.
>
> >
> > Unfortunately, backporting the whole new slab controller isn't an option:
> > it's way too big and invasive.
>
> This is what I thought as well, this is why I want to figure out what
> is the best way forward.
>
> > Do you already have a standalone fix?
>
> Not yet, I do not have a standalone fix. I suspect the best fix would
> be to address fix css_killed_work_fn() stack so we never have:
> cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse
> the order would work. If you have suggestions since you worked on this
> code recently, please let me know.
>
> Thank you,
> Pasha
>
> >
> > Thanks!
> >
> >
> > >
> > > Thread #1: Hot-removes memory
> > > device_offline
> > > memory_subsys_offline
> > > offline_pages
> > > __offline_pages
> > > mem_hotplug_lock <- write access
> > > waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> > > migrate it.
> > >
> > > Thread #2: ccs killer kthread
> > > css_killed_work_fn
> > > cgroup_mutex <- Grab this Mutex
> > > mem_cgroup_css_offline
> > > memcg_offline_kmem.part
> > > memcg_deactivate_kmem_caches
> > > get_online_mems
> > > mem_hotplug_lock <- waits for Thread#1 to get read access
> > >
> > > Thread #3: crashing userland program
> > > do_coredump
> > > elf_core_dump
> > > get_dump_page() -> get page with pfn#9e5113, and increment refcnt
> > > dump_emit
> > > __kernel_write
> > > __vfs_write
> > > new_sync_write
> > > pipe_write
> > > pipe_wait -> waits for Thread #4 systemd-coredump to
> > > read the pipe
> > >
> > > Thread #4: systemd-coredump
> > > ksys_read
> > > vfs_read
> > > __vfs_read
> > > seq_read
> > > proc_single_show
> > > proc_cgroup_show
> > > cgroup_mutex -> waits from Thread #2 for this lock.
> >
> > >
> > > In Summary:
> > > Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> > > read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> > > waits for Thread#1 for mem_hotplug_lock rwlock.
> > >
> > > This work appears to fix this deadlock because cgroup_mutex is not
> > > called anymore before mem_hotplug_lock (unless I am missing it), as it
> > > removes memcg_deactivate_kmem_caches.
> > >
> > > Thank you,
> > > Pasha
> > >
> > > On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <[email protected]> wrote:
> > > >
> > > > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > > > The existing cgroup slab memory controller is based on the idea of
> > > > > > replicating slab allocator internals for each memory cgroup.
> > > > > > This approach promises a low memory overhead (one pointer per page),
> > > > > > and isn't adding too much code on hot allocation and release paths.
> > > > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > > > >
> > > > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > > > a number of machines running different production workloads. In most
> > > > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > > > it brings back 30-50% of slab memory. It means that the real price
> > > > > > of the existing slab memory controller is way bigger than a pointer
> > > > > > per page.
> > > > > >
> > > > > > The real reason why the existing design leads to a low slab utilization
> > > > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > > > If there are only few allocations of certain size made by a cgroup,
> > > > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > > > deleted, or the cgroup contains a single-threaded application which is
> > > > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > > > in all these cases the resulting slab utilization is very low.
> > > > > > If kmem accounting is off, the kernel is able to use free space
> > > > > > on slab pages for other allocations.
> > > > > >
> > > > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > > > introduced and was an opt-in feature, which had to be turned on
> > > > > > individually for each memory cgroup. But now it's turned on by default
> > > > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > > > create a large number of cgroups.
> > > > > >
> > > > > > This patchset provides a new implementation of the slab memory controller,
> > > > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > > > between multiple memory cgroups. Below is the short description of the new
> > > > > > design (more details in commit messages).
> > > > > >
> > > > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > > > with rounding up and remembering leftovers.
> > > > > >
> > > > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > > > allows to reparent all allocated objects without walking them over and
> > > > > > changing memcg pointer to the parent.
> > > > > >
> > > > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > > > allocations and the second set for all other allocations. This allows to
> > > > > > simplify the lifetime management of individual kmem_caches: they are
> > > > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > > > and make things generally simpler.
> > > > > >
> > > > > > The patchset* has been tested on a number of different workloads in our
> > > > > > production. In all cases it saved significant amount of memory, measured
> > > > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > > > of slab memory has been reduced by 35-45%.
> > > > >
> > > > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > > > with this patchset on a 10 core POWER8 host:
> > > > >
> > > > > ==========================================================================
> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > > > of a mem cgroup (Sampling every 5s)
> > > > > --------------------------------------------------------------------------
> > > > > 5.5.0-rc7-mm1 +slab patch %reduction
> > > > > --------------------------------------------------------------------------
> > > > > memory.kmem.usage_in_bytes 15859712 4456448 72
> > > > > memory.usage_in_bytes 337510400 335806464 .5
> > > > > Slab: (kB) 814336 607296 25
> > > > >
> > > > > memory.kmem.usage_in_bytes 16187392 4653056 71
> > > > > memory.usage_in_bytes 318832640 300154880 5
> > > > > Slab: (kB) 789888 559744 29
> > > > > --------------------------------------------------------------------------
> > > > >
> > > > >
> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > > > --------------------------------------------------------------------------
> > > > > 5.5.0-rc7-mm1 +slab patch %reduction
> > > > > --------------------------------------------------------------------------
> > > > > memory.kmem.usage_in_bytes 338493440 231931904 31
> > > > > memory.usage_in_bytes 7368015872 6275923968 15
> > > > > Slab: (kB) 1139072 785408 31
> > > > >
> > > > > memory.kmem.usage_in_bytes 341835776 236453888 30
> > > > > memory.usage_in_bytes 6540427264 6072893440 7
> > > > > Slab: (kB) 1074304 761280 29
> > > > >
> > > > > memory.kmem.usage_in_bytes 340525056 233570304 31
> > > > > memory.usage_in_bytes 6406209536 6177357824 3
> > > > > Slab: (kB) 1244288 739712 40
> > > > > --------------------------------------------------------------------------
> > > > >
> > > > > Slab consumption right after boot
> > > > > --------------------------------------------------------------------------
> > > > > 5.5.0-rc7-mm1 +slab patch %reduction
> > > > > --------------------------------------------------------------------------
> > > > > Slab: (kB) 821888 583424 29
> > > > > ==========================================================================
> > > > >
> > > > > Summary:
> > > > >
> > > > > With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows
> > > > > around 70% and 30% reduction consistently.
> > > > >
> > > > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > > > kernel compilation.
> > > > >
> > > > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > > > same is seen right after boot too.
> > > >
> > > > That's just perfect!
> > > >
> > > > memory.usage_in_bytes was most likely the same because the freed space
> > > > was taken by pagecache.
> > > >
> > > > Thank you very much for testing!
> > > >
> > > > Roman

2020-09-01 05:29:57

by Bharata B Rao

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> There appears to be another problem that is related to the
> cgroup_mutex -> mem_hotplug_lock deadlock described above.
>
> In the original deadlock that I described, the workaround is to
> replace crash dump from piping to Linux traditional save to files
> method. However, after trying this workaround, I still observed
> hardware watchdog resets during machine shutdown.
>
> The new problem occurs for the following reason: upon shutdown systemd
> calls a service that hot-removes memory, and if hot-removing fails for
> some reason systemd kills that service after timeout. However, systemd
> is never able to kill the service, and we get hardware reset caused by
> watchdog or a hang during shutdown:
>
> Thread #1: memory hot-remove systemd service
> Loops indefinitely, because if there is something still to be migrated
> this loop never terminates. However, this loop can be terminated via
> signal from systemd after timeout.
> __offline_pages()
> do {
> pfn = scan_movable_pages(pfn, end_pfn);
> # Returns 0, meaning there is nothing available to
> # migrate, no page is PageLRU(page)
> ...
> ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> NULL, check_pages_isolated_cb);
> # Returns -EBUSY, meaning there is at least one PFN that
> # still has to be migrated.
> } while (ret);
>
> Thread #2: ccs killer kthread
> css_killed_work_fn
> cgroup_mutex <- Grab this Mutex
> mem_cgroup_css_offline
> memcg_offline_kmem.part
> memcg_deactivate_kmem_caches
> get_online_mems
> mem_hotplug_lock <- waits for Thread#1 to get read access
>
> Thread #3: systemd
> ksys_read
> vfs_read
> __vfs_read
> seq_read
> proc_single_show
> proc_cgroup_show
> mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
>
> Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> to thread #1.
>
> The proper fix for both of the problems is to avoid cgroup_mutex ->
> mem_hotplug_lock ordering that was recently fixed in the mainline but
> still present in all stable branches. Unfortunately, I do not see a
> simple fix in how to remove mem_hotplug_lock from
> memcg_deactivate_kmem_caches without using Roman's series that is too
> big for stable.

We too are seeing this on Power systems when stress-testing memory
hotplug, but with the following call trace (from hung task timer)
instead of Thread #2 above:

__switch_to
__schedule
schedule
percpu_rwsem_wait
__percpu_down_read
get_online_mems
memcg_create_kmem_cache
memcg_kmem_cache_create_func
process_one_work
worker_thread
kthread
ret_from_kernel_thread

While I understand that Roman's new slab controller patchset will fix
this, I also wonder if infinitely looping in the memory unplug path
with mem_hotplug_lock held is the right thing to do? Earlier we had
a few other exit possibilities in this path (like max retries etc)
but those were removed by commits:

72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory

Or, is the user-space test is expected to induce a signal back-off when
unplug doesn't complete within a reasonable amount of time?

Regards,
Bharata.

2020-09-01 12:54:13

by Pasha Tatashin

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao <[email protected]> wrote:
>
> On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> > There appears to be another problem that is related to the
> > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> >
> > In the original deadlock that I described, the workaround is to
> > replace crash dump from piping to Linux traditional save to files
> > method. However, after trying this workaround, I still observed
> > hardware watchdog resets during machine shutdown.
> >
> > The new problem occurs for the following reason: upon shutdown systemd
> > calls a service that hot-removes memory, and if hot-removing fails for
> > some reason systemd kills that service after timeout. However, systemd
> > is never able to kill the service, and we get hardware reset caused by
> > watchdog or a hang during shutdown:
> >
> > Thread #1: memory hot-remove systemd service
> > Loops indefinitely, because if there is something still to be migrated
> > this loop never terminates. However, this loop can be terminated via
> > signal from systemd after timeout.
> > __offline_pages()
> > do {
> > pfn = scan_movable_pages(pfn, end_pfn);
> > # Returns 0, meaning there is nothing available to
> > # migrate, no page is PageLRU(page)
> > ...
> > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > NULL, check_pages_isolated_cb);
> > # Returns -EBUSY, meaning there is at least one PFN that
> > # still has to be migrated.
> > } while (ret);
> >
> > Thread #2: ccs killer kthread
> > css_killed_work_fn
> > cgroup_mutex <- Grab this Mutex
> > mem_cgroup_css_offline
> > memcg_offline_kmem.part
> > memcg_deactivate_kmem_caches
> > get_online_mems
> > mem_hotplug_lock <- waits for Thread#1 to get read access
> >
> > Thread #3: systemd
> > ksys_read
> > vfs_read
> > __vfs_read
> > seq_read
> > proc_single_show
> > proc_cgroup_show
> > mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> >
> > Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> > to thread #1.
> >
> > The proper fix for both of the problems is to avoid cgroup_mutex ->
> > mem_hotplug_lock ordering that was recently fixed in the mainline but
> > still present in all stable branches. Unfortunately, I do not see a
> > simple fix in how to remove mem_hotplug_lock from
> > memcg_deactivate_kmem_caches without using Roman's series that is too
> > big for stable.
>
> We too are seeing this on Power systems when stress-testing memory
> hotplug, but with the following call trace (from hung task timer)
> instead of Thread #2 above:
>
> __switch_to
> __schedule
> schedule
> percpu_rwsem_wait
> __percpu_down_read
> get_online_mems
> memcg_create_kmem_cache
> memcg_kmem_cache_create_func
> process_one_work
> worker_thread
> kthread
> ret_from_kernel_thread
>
> While I understand that Roman's new slab controller patchset will fix
> this, I also wonder if infinitely looping in the memory unplug path
> with mem_hotplug_lock held is the right thing to do? Earlier we had
> a few other exit possibilities in this path (like max retries etc)
> but those were removed by commits:
>
> 72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
> ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
>
> Or, is the user-space test is expected to induce a signal back-off when
> unplug doesn't complete within a reasonable amount of time?

Hi Bharata,

Thank you for your input, it looks like you are experiencing the same
problems that I observed.

What I found is that the reason why our machines did not complete
hot-remove within the given time is because of this bug:
https://lore.kernel.org/linux-mm/[email protected]

Could you please try it and see if that helps for your case?

Thank you,
Pasha

2020-09-02 06:24:38

by Bharata B Rao

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On Tue, Sep 01, 2020 at 08:52:05AM -0400, Pavel Tatashin wrote:
> On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao <[email protected]> wrote:
> >
> > On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> > > There appears to be another problem that is related to the
> > > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> > >
> > > In the original deadlock that I described, the workaround is to
> > > replace crash dump from piping to Linux traditional save to files
> > > method. However, after trying this workaround, I still observed
> > > hardware watchdog resets during machine shutdown.
> > >
> > > The new problem occurs for the following reason: upon shutdown systemd
> > > calls a service that hot-removes memory, and if hot-removing fails for
> > > some reason systemd kills that service after timeout. However, systemd
> > > is never able to kill the service, and we get hardware reset caused by
> > > watchdog or a hang during shutdown:
> > >
> > > Thread #1: memory hot-remove systemd service
> > > Loops indefinitely, because if there is something still to be migrated
> > > this loop never terminates. However, this loop can be terminated via
> > > signal from systemd after timeout.
> > > __offline_pages()
> > > do {
> > > pfn = scan_movable_pages(pfn, end_pfn);
> > > # Returns 0, meaning there is nothing available to
> > > # migrate, no page is PageLRU(page)
> > > ...
> > > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > > NULL, check_pages_isolated_cb);
> > > # Returns -EBUSY, meaning there is at least one PFN that
> > > # still has to be migrated.
> > > } while (ret);
> > >
> > > Thread #2: ccs killer kthread
> > > css_killed_work_fn
> > > cgroup_mutex <- Grab this Mutex
> > > mem_cgroup_css_offline
> > > memcg_offline_kmem.part
> > > memcg_deactivate_kmem_caches
> > > get_online_mems
> > > mem_hotplug_lock <- waits for Thread#1 to get read access
> > >
> > > Thread #3: systemd
> > > ksys_read
> > > vfs_read
> > > __vfs_read
> > > seq_read
> > > proc_single_show
> > > proc_cgroup_show
> > > mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> > >
> > > Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> > > to thread #1.
> > >
> > > The proper fix for both of the problems is to avoid cgroup_mutex ->
> > > mem_hotplug_lock ordering that was recently fixed in the mainline but
> > > still present in all stable branches. Unfortunately, I do not see a
> > > simple fix in how to remove mem_hotplug_lock from
> > > memcg_deactivate_kmem_caches without using Roman's series that is too
> > > big for stable.
> >
> > We too are seeing this on Power systems when stress-testing memory
> > hotplug, but with the following call trace (from hung task timer)
> > instead of Thread #2 above:
> >
> > __switch_to
> > __schedule
> > schedule
> > percpu_rwsem_wait
> > __percpu_down_read
> > get_online_mems
> > memcg_create_kmem_cache
> > memcg_kmem_cache_create_func
> > process_one_work
> > worker_thread
> > kthread
> > ret_from_kernel_thread
> >
> > While I understand that Roman's new slab controller patchset will fix
> > this, I also wonder if infinitely looping in the memory unplug path
> > with mem_hotplug_lock held is the right thing to do? Earlier we had
> > a few other exit possibilities in this path (like max retries etc)
> > but those were removed by commits:
> >
> > 72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
> > ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
> >
> > Or, is the user-space test is expected to induce a signal back-off when
> > unplug doesn't complete within a reasonable amount of time?
>
> Hi Bharata,
>
> Thank you for your input, it looks like you are experiencing the same
> problems that I observed.
>
> What I found is that the reason why our machines did not complete
> hot-remove within the given time is because of this bug:
> https://lore.kernel.org/linux-mm/[email protected]
>
> Could you please try it and see if that helps for your case?

I am on an old codebase that already has the fix that you are proposing,
so I might be seeing someother issue which I will debug further.

So looks like the loop in __offline_pages() had a call to
drain_all_pages() before it was removed by

c52e75935f8d: mm: remove extra drain pages on pcp list

Regards,
Bharata.

2020-09-02 09:54:07

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> There appears to be another problem that is related to the
> cgroup_mutex -> mem_hotplug_lock deadlock described above.
>
> In the original deadlock that I described, the workaround is to
> replace crash dump from piping to Linux traditional save to files
> method. However, after trying this workaround, I still observed
> hardware watchdog resets during machine shutdown.
>
> The new problem occurs for the following reason: upon shutdown systemd
> calls a service that hot-removes memory, and if hot-removing fails for

Why is that hotremove even needed if we're shutting down? Are there any
(virtualization?) platforms where it makes some difference over plain
shutdown/restart?

> some reason systemd kills that service after timeout. However, systemd
> is never able to kill the service, and we get hardware reset caused by
> watchdog or a hang during shutdown:
>
> Thread #1: memory hot-remove systemd service
> Loops indefinitely, because if there is something still to be migrated
> this loop never terminates. However, this loop can be terminated via
> signal from systemd after timeout.
> __offline_pages()
> do {
> pfn = scan_movable_pages(pfn, end_pfn);
> # Returns 0, meaning there is nothing available to
> # migrate, no page is PageLRU(page)
> ...
> ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> NULL, check_pages_isolated_cb);
> # Returns -EBUSY, meaning there is at least one PFN that
> # still has to be migrated.
> } while (ret);
>
> Thread #2: ccs killer kthread
> css_killed_work_fn
> cgroup_mutex <- Grab this Mutex
> mem_cgroup_css_offline
> memcg_offline_kmem.part
> memcg_deactivate_kmem_caches
> get_online_mems
> mem_hotplug_lock <- waits for Thread#1 to get read access
>
> Thread #3: systemd
> ksys_read
> vfs_read
> __vfs_read
> seq_read
> proc_single_show
> proc_cgroup_show
> mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
>
> Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> to thread #1.
>
> The proper fix for both of the problems is to avoid cgroup_mutex ->
> mem_hotplug_lock ordering that was recently fixed in the mainline but
> still present in all stable branches. Unfortunately, I do not see a
> simple fix in how to remove mem_hotplug_lock from
> memcg_deactivate_kmem_caches without using Roman's series that is too
> big for stable.
>
> Thanks,
> Pasha
>
> On Wed, Aug 12, 2020 at 8:31 PM Pavel Tatashin
> <[email protected]> wrote:
>>
>> On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin <[email protected]> wrote:
>> >
>> > On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
>> > > Guys,
>> > >
>> > > There is a convoluted deadlock that I just root caused, and that is
>> > > fixed by this work (at least based on my code inspection it appears to
>> > > be fixed); but the deadlock exists in older and stable kernels, and I
>> > > am not sure whether to create a separate patch for it, or backport
>> > > this whole thing.
>> >
>>
>> Hi Roman,
>>
>> > Hi Pavel,
>> >
>> > wow, it's a quite complicated deadlock. Thank you for providing
>> > a perfect analysis!
>>
>> Thank you, it indeed took me a while to fully grasp the deadlock.
>>
>> >
>> > Unfortunately, backporting the whole new slab controller isn't an option:
>> > it's way too big and invasive.
>>
>> This is what I thought as well, this is why I want to figure out what
>> is the best way forward.
>>
>> > Do you already have a standalone fix?
>>
>> Not yet, I do not have a standalone fix. I suspect the best fix would
>> be to address fix css_killed_work_fn() stack so we never have:
>> cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse
>> the order would work. If you have suggestions since you worked on this
>> code recently, please let me know.
>>
>> Thank you,
>> Pasha
>>
>> >
>> > Thanks!
>> >
>> >
>> > >
>> > > Thread #1: Hot-removes memory
>> > > device_offline
>> > > memory_subsys_offline
>> > > offline_pages
>> > > __offline_pages
>> > > mem_hotplug_lock <- write access
>> > > waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
>> > > migrate it.
>> > >
>> > > Thread #2: ccs killer kthread
>> > > css_killed_work_fn
>> > > cgroup_mutex <- Grab this Mutex
>> > > mem_cgroup_css_offline
>> > > memcg_offline_kmem.part
>> > > memcg_deactivate_kmem_caches
>> > > get_online_mems
>> > > mem_hotplug_lock <- waits for Thread#1 to get read access
>> > >
>> > > Thread #3: crashing userland program
>> > > do_coredump
>> > > elf_core_dump
>> > > get_dump_page() -> get page with pfn#9e5113, and increment refcnt
>> > > dump_emit
>> > > __kernel_write
>> > > __vfs_write
>> > > new_sync_write
>> > > pipe_write
>> > > pipe_wait -> waits for Thread #4 systemd-coredump to
>> > > read the pipe
>> > >
>> > > Thread #4: systemd-coredump
>> > > ksys_read
>> > > vfs_read
>> > > __vfs_read
>> > > seq_read
>> > > proc_single_show
>> > > proc_cgroup_show
>> > > cgroup_mutex -> waits from Thread #2 for this lock.
>> >
>> > >
>> > > In Summary:
>> > > Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
>> > > read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
>> > > waits for Thread#1 for mem_hotplug_lock rwlock.
>> > >
>> > > This work appears to fix this deadlock because cgroup_mutex is not
>> > > called anymore before mem_hotplug_lock (unless I am missing it), as it
>> > > removes memcg_deactivate_kmem_caches.
>> > >
>> > > Thank you,
>> > > Pasha
>> > >
>> > > On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <[email protected]> wrote:
>> > > >
>> > > > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
>> > > > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
>> > > > > > The existing cgroup slab memory controller is based on the idea of
>> > > > > > replicating slab allocator internals for each memory cgroup.
>> > > > > > This approach promises a low memory overhead (one pointer per page),
>> > > > > > and isn't adding too much code on hot allocation and release paths.
>> > > > > > But is has a very serious flaw: it leads to a low slab utilization.
>> > > > > >
>> > > > > > Using a drgn* script I've got an estimation of slab utilization on
>> > > > > > a number of machines running different production workloads. In most
>> > > > > > cases it was between 45% and 65%, and the best number I've seen was
>> > > > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
>> > > > > > it brings back 30-50% of slab memory. It means that the real price
>> > > > > > of the existing slab memory controller is way bigger than a pointer
>> > > > > > per page.
>> > > > > >
>> > > > > > The real reason why the existing design leads to a low slab utilization
>> > > > > > is simple: slab pages are used exclusively by one memory cgroup.
>> > > > > > If there are only few allocations of certain size made by a cgroup,
>> > > > > > or if some active objects (e.g. dentries) are left after the cgroup is
>> > > > > > deleted, or the cgroup contains a single-threaded application which is
>> > > > > > barely allocating any kernel objects, but does it every time on a new CPU:
>> > > > > > in all these cases the resulting slab utilization is very low.
>> > > > > > If kmem accounting is off, the kernel is able to use free space
>> > > > > > on slab pages for other allocations.
>> > > > > >
>> > > > > > Arguably it wasn't an issue back to days when the kmem controller was
>> > > > > > introduced and was an opt-in feature, which had to be turned on
>> > > > > > individually for each memory cgroup. But now it's turned on by default
>> > > > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
>> > > > > > create a large number of cgroups.
>> > > > > >
>> > > > > > This patchset provides a new implementation of the slab memory controller,
>> > > > > > which aims to reach a much better slab utilization by sharing slab pages
>> > > > > > between multiple memory cgroups. Below is the short description of the new
>> > > > > > design (more details in commit messages).
>> > > > > >
>> > > > > > Accounting is performed per-object instead of per-page. Slab-related
>> > > > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
>> > > > > > with rounding up and remembering leftovers.
>> > > > > >
>> > > > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
>> > > > > > a vector of corresponding size is allocated. To keep slab memory reparenting
>> > > > > > working, instead of saving a pointer to the memory cgroup directly an
>> > > > > > intermediate object is used. It's simply a pointer to a memcg (which can be
>> > > > > > easily changed to the parent) with a built-in reference counter. This scheme
>> > > > > > allows to reparent all allocated objects without walking them over and
>> > > > > > changing memcg pointer to the parent.
>> > > > > >
>> > > > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
>> > > > > > two global sets are used: the root set for non-accounted and root-cgroup
>> > > > > > allocations and the second set for all other allocations. This allows to
>> > > > > > simplify the lifetime management of individual kmem_caches: they are
>> > > > > > destroyed with root counterparts. It allows to remove a good amount of code
>> > > > > > and make things generally simpler.
>> > > > > >
>> > > > > > The patchset* has been tested on a number of different workloads in our
>> > > > > > production. In all cases it saved significant amount of memory, measured
>> > > > > > from high hundreds of MBs to single GBs per host. On average, the size
>> > > > > > of slab memory has been reduced by 35-45%.
>> > > > >
>> > > > > Here are some numbers from multiple runs of sysbench and kernel compilation
>> > > > > with this patchset on a 10 core POWER8 host:
>> > > > >
>> > > > > ==========================================================================
>> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
>> > > > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
>> > > > > of a mem cgroup (Sampling every 5s)
>> > > > > --------------------------------------------------------------------------
>> > > > > 5.5.0-rc7-mm1 +slab patch %reduction
>> > > > > --------------------------------------------------------------------------
>> > > > > memory.kmem.usage_in_bytes 15859712 4456448 72
>> > > > > memory.usage_in_bytes 337510400 335806464 .5
>> > > > > Slab: (kB) 814336 607296 25
>> > > > >
>> > > > > memory.kmem.usage_in_bytes 16187392 4653056 71
>> > > > > memory.usage_in_bytes 318832640 300154880 5
>> > > > > Slab: (kB) 789888 559744 29
>> > > > > --------------------------------------------------------------------------
>> > > > >
>> > > > >
>> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
>> > > > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
>> > > > > done from bash that is in a memory cgroup. (Sampling every 5s)
>> > > > > --------------------------------------------------------------------------
>> > > > > 5.5.0-rc7-mm1 +slab patch %reduction
>> > > > > --------------------------------------------------------------------------
>> > > > > memory.kmem.usage_in_bytes 338493440 231931904 31
>> > > > > memory.usage_in_bytes 7368015872 6275923968 15
>> > > > > Slab: (kB) 1139072 785408 31
>> > > > >
>> > > > > memory.kmem.usage_in_bytes 341835776 236453888 30
>> > > > > memory.usage_in_bytes 6540427264 6072893440 7
>> > > > > Slab: (kB) 1074304 761280 29
>> > > > >
>> > > > > memory.kmem.usage_in_bytes 340525056 233570304 31
>> > > > > memory.usage_in_bytes 6406209536 6177357824 3
>> > > > > Slab: (kB) 1244288 739712 40
>> > > > > --------------------------------------------------------------------------
>> > > > >
>> > > > > Slab consumption right after boot
>> > > > > --------------------------------------------------------------------------
>> > > > > 5.5.0-rc7-mm1 +slab patch %reduction
>> > > > > --------------------------------------------------------------------------
>> > > > > Slab: (kB) 821888 583424 29
>> > > > > ==========================================================================
>> > > > >
>> > > > > Summary:
>> > > > >
>> > > > > With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows
>> > > > > around 70% and 30% reduction consistently.
>> > > > >
>> > > > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
>> > > > > kernel compilation.
>> > > > >
>> > > > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
>> > > > > same is seen right after boot too.
>> > > >
>> > > > That's just perfect!
>> > > >
>> > > > memory.usage_in_bytes was most likely the same because the freed space
>> > > > was taken by pagecache.
>> > > >
>> > > > Thank you very much for testing!
>> > > >
>> > > > Roman
>

2020-09-02 10:40:32

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

> Am 02.09.2020 um 11:53 schrieb Vlastimil Babka <[email protected]>:
>
> On 8/28/20 6:47 PM, Pavel Tatashin wrote:
>> There appears to be another problem that is related to the
>> cgroup_mutex -> mem_hotplug_lock deadlock described above.
>>
>> In the original deadlock that I described, the workaround is to
>> replace crash dump from piping to Linux traditional save to files
>> method. However, after trying this workaround, I still observed
>> hardware watchdog resets during machine shutdown.
>>
>> The new problem occurs for the following reason: upon shutdown systemd
>> calls a service that hot-removes memory, and if hot-removing fails for
>
> Why is that hotremove even needed if we're shutting down? Are there any
> (virtualization?) platforms where it makes some difference over plain
> shutdown/restart?

If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)

2020-09-02 11:27:29

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> > There appears to be another problem that is related to the
> > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> >
> > In the original deadlock that I described, the workaround is to
> > replace crash dump from piping to Linux traditional save to files
> > method. However, after trying this workaround, I still observed
> > hardware watchdog resets during machine shutdown.
> >
> > The new problem occurs for the following reason: upon shutdown systemd
> > calls a service that hot-removes memory, and if hot-removing fails for
>
> Why is that hotremove even needed if we're shutting down? Are there any
> (virtualization?) platforms where it makes some difference over plain
> shutdown/restart?

Yes this sounds quite dubious.

> > some reason systemd kills that service after timeout. However, systemd
> > is never able to kill the service, and we get hardware reset caused by
> > watchdog or a hang during shutdown:
> >
> > Thread #1: memory hot-remove systemd service
> > Loops indefinitely, because if there is something still to be migrated
> > this loop never terminates. However, this loop can be terminated via
> > signal from systemd after timeout.
> > __offline_pages()
> > do {
> > pfn = scan_movable_pages(pfn, end_pfn);
> > # Returns 0, meaning there is nothing available to
> > # migrate, no page is PageLRU(page)
> > ...
> > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > NULL, check_pages_isolated_cb);
> > # Returns -EBUSY, meaning there is at least one PFN that
> > # still has to be migrated.
> > } while (ret);

This shouldn't really happen. What does prevent from this to proceed?
Did you manage to catch the specific pfn and what is it used for?
start_isolate_page_range and scan_movable_pages should fail if there is
any memory that cannot be migrated permanently. This is something that
we should focus on when debugging.
--
Michal Hocko
SUSE Labs

2020-09-02 11:33:13

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> >> > > Thread #2: ccs killer kthread
> >> > > css_killed_work_fn
> >> > > cgroup_mutex <- Grab this Mutex
> >> > > mem_cgroup_css_offline
> >> > > memcg_offline_kmem.part
> >> > > memcg_deactivate_kmem_caches
> >> > > get_online_mems
> >> > > mem_hotplug_lock <- waits for Thread#1 to get read access

And one more thing. THis has been brought up several times already.
Maybe I have forgoten but why do we take hotplug locks in this path in
the first place? Memory hotplug notifier takes slab_mutex so this
shouldn't be really needed.
--
Michal Hocko
SUSE Labs

2020-09-02 12:36:29

by Pasha Tatashin

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

> I am on an old codebase that already has the fix that you are proposing,
> so I might be seeing someother issue which I will debug further.
>
> So looks like the loop in __offline_pages() had a call to
> drain_all_pages() before it was removed by
>
> c52e75935f8d: mm: remove extra drain pages on pcp list

I see, thanks. There is a reason to have the second drain, my fix is a
little better as it is performed only on a rare occasion when it is
needed, but I should add a FIXES tag. I have not checked
alloc_contig_range race.

2020-09-02 12:45:02

by Pasha Tatashin

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

> > Am 02.09.2020 um 11:53 schrieb Vlastimil Babka <[email protected]>:
> >
> > On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> >> There appears to be another problem that is related to the
> >> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> >>
> >> In the original deadlock that I described, the workaround is to
> >> replace crash dump from piping to Linux traditional save to files
> >> method. However, after trying this workaround, I still observed
> >> hardware watchdog resets during machine shutdown.
> >>
> >> The new problem occurs for the following reason: upon shutdown systemd
> >> calls a service that hot-removes memory, and if hot-removing fails for
> >
> > Why is that hotremove even needed if we're shutting down? Are there any
> > (virtualization?) platforms where it makes some difference over plain
> > shutdown/restart?
>
> If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)

Hi David,

This is how we are using it at Microsoft: there is a very large
number of small memory machines (8G each) with low downtime
requirements (reboot must be under a second). There is also a large
state ~2G of memory that we need to transfer during reboot, otherwise
it is very expensive to recreate the state. We have 2G of system
memory memory reserved as a pmem in the device tree, and use it to
pass information across reboots. Once the information is not needed we
hot-add that memory and use it during runtime, before shutdown we
hot-remove the 2G, save the program state on it, and do the reboot.

Pasha

2020-09-02 12:55:09

by Pasha Tatashin

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

> > > Thread #1: memory hot-remove systemd service
> > > Loops indefinitely, because if there is something still to be migrated
> > > this loop never terminates. However, this loop can be terminated via
> > > signal from systemd after timeout.
> > > __offline_pages()
> > > do {
> > > pfn = scan_movable_pages(pfn, end_pfn);
> > > # Returns 0, meaning there is nothing available to
> > > # migrate, no page is PageLRU(page)
> > > ...
> > > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > > NULL, check_pages_isolated_cb);
> > > # Returns -EBUSY, meaning there is at least one PFN that
> > > # still has to be migrated.
> > > } while (ret);
>

Hi Micahl,

> This shouldn't really happen. What does prevent from this to proceed?
> Did you manage to catch the specific pfn and what is it used for?

I did.

> start_isolate_page_range and scan_movable_pages should fail if there is
> any memory that cannot be migrated permanently. This is something that
> we should focus on when debugging.

I was hitting this issue:
mm/memory_hotplug: drain per-cpu pages again during memory offline
https://lore.kernel.org/lkml/[email protected]

Once the pcp drain race is fixed, this particular deadlock becomes irrelavent.

The lock ordering, however, cgroup_mutex -> mem_hotplug_lock is bad,
and the first race condition that I was hitting and described above is
still present. For now I added a temporary workaround by using save to
file instead of piping the core during shutdown. I am glad the
mainline is fixed, but stables should also have some kind of fix for
this problem.

Pasha

2020-09-02 12:56:26

by Pasha Tatashin

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On Wed, Sep 2, 2020 at 7:32 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> > >> > > Thread #2: ccs killer kthread
> > >> > > css_killed_work_fn
> > >> > > cgroup_mutex <- Grab this Mutex
> > >> > > mem_cgroup_css_offline
> > >> > > memcg_offline_kmem.part
> > >> > > memcg_deactivate_kmem_caches
> > >> > > get_online_mems
> > >> > > mem_hotplug_lock <- waits for Thread#1 to get read access
>
> And one more thing. THis has been brought up several times already.
> Maybe I have forgoten but why do we take hotplug locks in this path in
> the first place? Memory hotplug notifier takes slab_mutex so this
> shouldn't be really needed.

Good point, it seems this lock can be completely removed from
memcg_deactivate_kmem_caches

Pasha

> --
> Michal Hocko
> SUSE Labs

2020-09-02 14:10:57

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On Wed 02-09-20 08:42:13, Pavel Tatashin wrote:
> > > Am 02.09.2020 um 11:53 schrieb Vlastimil Babka <[email protected]>:
> > >
> > > On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> > >> There appears to be another problem that is related to the
> > >> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> > >>
> > >> In the original deadlock that I described, the workaround is to
> > >> replace crash dump from piping to Linux traditional save to files
> > >> method. However, after trying this workaround, I still observed
> > >> hardware watchdog resets during machine shutdown.
> > >>
> > >> The new problem occurs for the following reason: upon shutdown systemd
> > >> calls a service that hot-removes memory, and if hot-removing fails for
> > >
> > > Why is that hotremove even needed if we're shutting down? Are there any
> > > (virtualization?) platforms where it makes some difference over plain
> > > shutdown/restart?
> >
> > If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)
>
> Hi David,
>
> This is how we are using it at Microsoft: there is a very large
> number of small memory machines (8G each) with low downtime
> requirements (reboot must be under a second). There is also a large
> state ~2G of memory that we need to transfer during reboot, otherwise
> it is very expensive to recreate the state. We have 2G of system
> memory memory reserved as a pmem in the device tree, and use it to
> pass information across reboots. Once the information is not needed we
> hot-add that memory and use it during runtime, before shutdown we
> hot-remove the 2G, save the program state on it, and do the reboot.

I still do not get it. So what does guarantee that the memory is
offlineable in the first place? Also what is the difference between
offlining and simply shutting the system down so that the memory is not
used in the first place. In other words what kind of difference
hotremove makes?
--
Michal Hocko
SUSE Labs

2020-09-02 14:11:30

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On Wed 02-09-20 08:51:06, Pavel Tatashin wrote:
> > > > Thread #1: memory hot-remove systemd service
> > > > Loops indefinitely, because if there is something still to be migrated
> > > > this loop never terminates. However, this loop can be terminated via
> > > > signal from systemd after timeout.
> > > > __offline_pages()
> > > > do {
> > > > pfn = scan_movable_pages(pfn, end_pfn);
> > > > # Returns 0, meaning there is nothing available to
> > > > # migrate, no page is PageLRU(page)
> > > > ...
> > > > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > > > NULL, check_pages_isolated_cb);
> > > > # Returns -EBUSY, meaning there is at least one PFN that
> > > > # still has to be migrated.
> > > > } while (ret);
> >
>
> Hi Micahl,
>
> > This shouldn't really happen. What does prevent from this to proceed?
> > Did you manage to catch the specific pfn and what is it used for?
>
> I did.
>
> > start_isolate_page_range and scan_movable_pages should fail if there is
> > any memory that cannot be migrated permanently. This is something that
> > we should focus on when debugging.
>
> I was hitting this issue:
> mm/memory_hotplug: drain per-cpu pages again during memory offline
> https://lore.kernel.org/lkml/[email protected]

I have noticed the patch but didn't have time to think it through (have
been few days off and catching up with emails). Will give it a higher
priority.

--
Michal Hocko
SUSE Labs

2020-09-02 14:29:44

by Pasha Tatashin

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

> > This is how we are using it at Microsoft: there is a very large
> > number of small memory machines (8G each) with low downtime
> > requirements (reboot must be under a second). There is also a large
> > state ~2G of memory that we need to transfer during reboot, otherwise
> > it is very expensive to recreate the state. We have 2G of system
> > memory memory reserved as a pmem in the device tree, and use it to
> > pass information across reboots. Once the information is not needed we
> > hot-add that memory and use it during runtime, before shutdown we
> > hot-remove the 2G, save the program state on it, and do the reboot.
>
> I still do not get it. So what does guarantee that the memory is
> offlineable in the first place?

It is in a movable zone, and we have more than 2G of free memory for
successful migrations.

> Also what is the difference between
> offlining and simply shutting the system down so that the memory is not
> used in the first place. In other words what kind of difference
> hotremove makes?

For performance reasons during system updates/reboots we do not erase
memory content. The memory content is erased only on power cycle,
which we do not do in production.

Once we hot-remove the memory, we convert it back into DAXFS PMEM
device, format it into EXT4, mount it as DAX file system, and allow
programs to serialize their states to it so they can read it back
after the reboot.

During startup we mount pmem, programs read the state back, and after
that we hotplug the PMEM DAX as a movable zone. This way during normal
runtime we have 8G available to programs.

Pasha

2020-09-02 15:00:52

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

On Wed 02-09-20 08:53:49, Pavel Tatashin wrote:
> On Wed, Sep 2, 2020 at 7:32 AM Michal Hocko <[email protected]> wrote:
> >
> > On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> > > >> > > Thread #2: ccs killer kthread
> > > >> > > css_killed_work_fn
> > > >> > > cgroup_mutex <- Grab this Mutex
> > > >> > > mem_cgroup_css_offline
> > > >> > > memcg_offline_kmem.part
> > > >> > > memcg_deactivate_kmem_caches
> > > >> > > get_online_mems
> > > >> > > mem_hotplug_lock <- waits for Thread#1 to get read access
> >
> > And one more thing. THis has been brought up several times already.
> > Maybe I have forgoten but why do we take hotplug locks in this path in
> > the first place? Memory hotplug notifier takes slab_mutex so this
> > shouldn't be really needed.
>
> Good point, it seems this lock can be completely removed from
> memcg_deactivate_kmem_caches

I am pretty sure we have discussed that in the past. But I do not
remember the outcome. Either we have concluded that this is indeed the
case but nobody came up with a patch or we have hit some obscure
issue... Maybe David/Roman rememeber more than I do.

--
Michal Hocko
SUSE Labs

2020-09-03 18:10:59

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller

> For performance reasons during system updates/reboots we do not erase
> memory content. The memory content is erased only on power cycle,
> which we do not do in production.
>
> Once we hot-remove the memory, we convert it back into DAXFS PMEM
> device, format it into EXT4, mount it as DAX file system, and allow
> programs to serialize their states to it so they can read it back
> after the reboot.
>
> During startup we mount pmem, programs read the state back, and after
> that we hotplug the PMEM DAX as a movable zone. This way during normal
> runtime we have 8G available to programs.
>

Thanks for sharing the workflow - while it sounds somewhat sub-optimal,
I guess it gets the job done using existing tools / mechanisms.

(I remember the persistent tmpfs over kexec RFC, which tries to tackle
it by introducing something new)

--
Thanks,

David / dhildenb