2022-06-24 08:47:38

by Christian König

[permalink] [raw]
Subject: [RFC] Per file OOM-badness / RSS once more

Hello everyone,

To summarize the issue I'm trying to address here: Processes can allocate
resources through a file descriptor without being held responsible for it.

I'm not explaining all the details again. See here for a more deeply
description of the problem: https://lwn.net/ml/linux-kernel/[email protected]/

With this iteration I'm trying to address a bunch of the comments Michal Hocko
(thanks a lot for that) gave as well as giving some new ideas.

Changes made so far:
1. Renamed the callback into file_rss(). This is at least a start to better
describe what this is all about. I've been going back and forth over the
naming here, if you have any better idea please speak up.

2. Cleanups, e.g. now providing a helper function in the fs layer to sum up
all the pages allocated by the files in a file descriptor table.

3. Using the actual number of allocated pages for the shmem implementation
instead of just the size. I also tried to ignore shmem files which are part
of tmpfs, cause that has a separate accounting/limitation approach.

4. The OOM killer now prints the memory of the killed process including the per
file pages which makes the whole things much more comprehensible.

5. I've added the per file pages to the different reports in RSS in procfs.
This has the interesting effect that tools like top suddenly give a much
more accurate overview of the memory use as well. This of course increases
the overhead of gathering those information quite a bit and I'm not sure how
feasible that is for up-streaming. On the other hand this once more clearly
shows that we need to do something about this issue.

Another rather interesting observation is that multiple subsystems (shmem,
tmpfs, ttm) came up with the same workaround of limiting the memory which can
be allocated through them to 50% of the whole system memory. Unfortunately
that isn't the same 50% and it doesn't apply everywhere, so you can still
easily crash the box.

Ideas and/or comments are really welcome.

Thanks,
Christian.



2022-06-24 08:47:57

by Christian König

[permalink] [raw]
Subject: [PATCH 09/14] drm/radeon: use drm_oom_badness

This allows the OOM killer to make a better decision which process to reap.

Signed-off-by: Christian König <[email protected]>
---
drivers/gpu/drm/radeon/radeon_drv.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
index 956c72b5aa33..11d310cdd2e8 100644
--- a/drivers/gpu/drm/radeon/radeon_drv.c
+++ b/drivers/gpu/drm/radeon/radeon_drv.c
@@ -550,6 +550,7 @@ static const struct file_operations radeon_driver_kms_fops = {
#ifdef CONFIG_COMPAT
.compat_ioctl = radeon_kms_compat_ioctl,
#endif
+ .file_rss = drm_file_rss,
};

static const struct drm_ioctl_desc radeon_ioctls_kms[] = {
--
2.25.1

2022-06-24 08:49:13

by Christian König

[permalink] [raw]
Subject: [PATCH 10/14] drm/i915: use drm_file_rss

This allows the OOM killer to make a better decision which process to reap.

Signed-off-by: Christian König <[email protected]>
---
drivers/gpu/drm/i915/i915_driver.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index 90b0ce5051af..fc269055a07c 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1741,6 +1741,7 @@ static const struct file_operations i915_driver_fops = {
#ifdef CONFIG_PROC_FS
.show_fdinfo = i915_drm_client_fdinfo,
#endif
+ .file_rss = drm_file_rss,
};

static int
--
2.25.1

2022-06-24 08:49:58

by Christian König

[permalink] [raw]
Subject: [PATCH 02/14] oom: take per file RSS into account

From: Andrey Grodzovsky <[email protected]>

Try to make better decisions which process to kill based on
per file RSS.

Signed-off-by: Andrey Grodzovsky <[email protected]>
---
mm/oom_kill.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3c6cf9e3cd66..76a5ea73eb6a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -18,6 +18,7 @@
* kernel subsystems and hints as to where to find out what things do.
*/

+#include <linux/fdtable.h>
#include <linux/oom.h>
#include <linux/mm.h>
#include <linux/err.h>
@@ -228,7 +229,8 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
* task's rss, pagetable and swap space use.
*/
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
- mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+ files_rss(p->files) + mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+
task_unlock(p);

/* Normalize to oom_score_adj units */
@@ -401,8 +403,8 @@ static int dump_task(struct task_struct *p, void *arg)

pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n",
task->pid, from_kuid(&init_user_ns, task_uid(task)),
- task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
- mm_pgtables_bytes(task->mm),
+ task->tgid, task->mm->total_vm, get_mm_rss(task->mm) +
+ files_rss(task->files), mm_pgtables_bytes(task->mm),
get_mm_counter(task->mm, MM_SWAPENTS),
task->signal->oom_score_adj, task->comm);
task_unlock(task);
@@ -594,7 +596,8 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
K(get_mm_counter(mm, MM_ANONPAGES)),
- K(get_mm_counter(mm, MM_FILEPAGES)),
+ K(get_mm_counter(mm, MM_FILEPAGES) +
+ files_rss(tsk->files)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
out_finish:
trace_finish_task_reaping(tsk->pid);
@@ -950,7 +953,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
message, task_pid_nr(victim), victim->comm, K(mm->total_vm),
K(get_mm_counter(mm, MM_ANONPAGES)),
- K(get_mm_counter(mm, MM_FILEPAGES)),
+ K(get_mm_counter(mm, MM_FILEPAGES) + files_rss(victim->files)),
K(get_mm_counter(mm, MM_SHMEMPAGES)),
from_kuid(&init_user_ns, task_uid(victim)),
mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);
--
2.25.1

2022-06-24 10:20:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] Per file OOM-badness / RSS once more

On Fri 24-06-22 10:04:30, Christian K?nig wrote:
> Hello everyone,
>
> To summarize the issue I'm trying to address here: Processes can allocate
> resources through a file descriptor without being held responsible for it.
>
> I'm not explaining all the details again. See here for a more deeply
> description of the problem: https://lwn.net/ml/linux-kernel/[email protected]/
>
> With this iteration I'm trying to address a bunch of the comments Michal Hocko
> (thanks a lot for that) gave as well as giving some new ideas.
>
> Changes made so far:
> 1. Renamed the callback into file_rss(). This is at least a start to better
> describe what this is all about. I've been going back and forth over the
> naming here, if you have any better idea please speak up.
>
> 2. Cleanups, e.g. now providing a helper function in the fs layer to sum up
> all the pages allocated by the files in a file descriptor table.
>
> 3. Using the actual number of allocated pages for the shmem implementation
> instead of just the size. I also tried to ignore shmem files which are part
> of tmpfs, cause that has a separate accounting/limitation approach.

OK, this is better than the original approach there are still holes
there though I am afraid. I am not sure your i_count hack is correct
but that would be mostly an implementation detail. The scheme will
over-account memory mapped files (including memfd). How much that
matters will really differ.

For the global OOM situations it is very likely that there will be
barely any disk based page cache as it would be reclaimed by the time
the oom killer is invoked. So this should be OK. Swap backed page cache
(shmem and its users) is more tricky. It is swap bound and processes
which map it will get "charged" in the form of swap entries while those
which rely on read/write will just escape from the sight of the oom
killer no matter how much memory they own via their shmem backed fd.
This sounds rather serious to me and I hope I haven't missed anything
subtle here that would keep those pages somehow visible. Anyway
something to very carefully document.

For the memcg OOM this gets even more tricky. Files can be shared among
tasks accross memcgs. Something that is not really straightforward from
the userspace POV because this is not strictly deterministic as
first-one-first-charged logic is applied so a lot might depend on timing.
This could also easily mean that a large part of the in memory state of
the file is outside of the reclaim and therefore OOM scope of the memcg
which is hitting the hard limit. This could result in tasks being killed
just because they (co)operate on a large file outside of their memcg
domain. To be honest I am not sure how big of a problem this would be in
practice and the existing behavior has its own cons so to me it sounds
like changing one set of deficiency with other.

As we have discussed previously, there is unlikely a great solution but
you a) need to document most prominent downsides so that people can at
least see this is understood and documented behavior and b) think of the
runaway situation wrt non mapped shmems memtioned above and see whether
there is something we can do about that.
--
Michal Hocko
SUSE Labs

2022-06-28 10:48:14

by Michel Dänzer

[permalink] [raw]
Subject: Re: [PATCH 09/14] drm/radeon: use drm_oom_badness

On 2022-06-24 10:04, Christian König wrote:
> This allows the OOM killer to make a better decision which process to reap.
>
> Signed-off-by: Christian König <[email protected]>
> ---
> drivers/gpu/drm/radeon/radeon_drv.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
> index 956c72b5aa33..11d310cdd2e8 100644
> --- a/drivers/gpu/drm/radeon/radeon_drv.c
> +++ b/drivers/gpu/drm/radeon/radeon_drv.c
> @@ -550,6 +550,7 @@ static const struct file_operations radeon_driver_kms_fops = {
> #ifdef CONFIG_COMPAT
> .compat_ioctl = radeon_kms_compat_ioctl,
> #endif
> + .file_rss = drm_file_rss,
> };
>
> static const struct drm_ioctl_desc radeon_ioctls_kms[] = {

Shortlog should now say "use drm_file_rss", right?


--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer