From: "Steven Rostedt (VMware)" <[email protected]>
As si_mem_available() can say there is enough memory even though the memory
available is not useable by the ring buffer, it is best to not kill innocent
applications because the ring buffer is taking up all the memory while it is
trying to allocate a great deal of memory.
If the allocator is user space (because kernel threads can also increase the
size of the kernel ring buffer on boot up), then after si_mem_available()
says there is enough memory, set the OOM killer to kill the current task if
an OOM triggers during the allocation.
Link: http://lkml.kernel.org/r/[email protected]
Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
---
kernel/trace/ring_buffer.c | 48 ++++++++++++++++++++++++++++++++++++----------
1 file changed, 38 insertions(+), 10 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 966128f02121..c9cb9767d49b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -22,6 +22,7 @@
#include <linux/hash.h>
#include <linux/list.h>
#include <linux/cpu.h>
+#include <linux/oom.h>
#include <asm/local.h>
@@ -1162,35 +1163,60 @@ static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
{
struct buffer_page *bpage, *tmp;
+ bool user_thread = current->mm != NULL;
+ gfp_t mflags;
long i;
- /* Check if the available memory is there first */
+ /*
+ * Check if the available memory is there first.
+ * Note, si_mem_available() only gives us a rough estimate of available
+ * memory. It may not be accurate. But we don't care, we just want
+ * to prevent doing any allocation when it is obvious that it is
+ * not going to succeed.
+ */
i = si_mem_available();
if (i < nr_pages)
return -ENOMEM;
+ /*
+ * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
+ * gracefully without invoking oom-killer and the system is not
+ * destabilized.
+ */
+ mflags = GFP_KERNEL | __GFP_RETRY_MAYFAIL;
+
+ /*
+ * If a user thread allocates too much, and si_mem_available()
+ * reports there's enough memory, even though there is not.
+ * Make sure the OOM killer kills this thread. This can happen
+ * even with RETRY_MAYFAIL because another task may be doing
+ * an allocation after this task has taken all memory.
+ * This is the task the OOM killer needs to take out during this
+ * loop, even if it was triggered by an allocation somewhere else.
+ */
+ if (user_thread)
+ set_current_oom_origin();
for (i = 0; i < nr_pages; i++) {
struct page *page;
- /*
- * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
- * gracefully without invoking oom-killer and the system is not
- * destabilized.
- */
+
bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
- GFP_KERNEL | __GFP_RETRY_MAYFAIL,
- cpu_to_node(cpu));
+ mflags, cpu_to_node(cpu));
if (!bpage)
goto free_pages;
list_add(&bpage->list, pages);
- page = alloc_pages_node(cpu_to_node(cpu),
- GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
+ page = alloc_pages_node(cpu_to_node(cpu), mflags, 0);
if (!page)
goto free_pages;
bpage->page = page_address(page);
rb_init_page(bpage->page);
+
+ if (user_thread && fatal_signal_pending(current))
+ goto free_pages;
}
+ if (user_thread)
+ clear_current_oom_origin();
return 0;
@@ -1199,6 +1225,8 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
list_del_init(&bpage->list);
free_buffer_page(bpage);
}
+ if (user_thread)
+ clear_current_oom_origin();
return -ENOMEM;
}
--
2.13.6
On Wed, 4 Apr 2018 11:53:10 -0400
Steven Rostedt <[email protected]> wrote:
> @@ -1162,35 +1163,60 @@ static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
> static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
> {
> struct buffer_page *bpage, *tmp;
> + bool user_thread = current->mm != NULL;
> + gfp_t mflags;
> long i;
>
> - /* Check if the available memory is there first */
> + /*
> + * Check if the available memory is there first.
> + * Note, si_mem_available() only gives us a rough estimate of available
> + * memory. It may not be accurate. But we don't care, we just want
> + * to prevent doing any allocation when it is obvious that it is
> + * not going to succeed.
> + */
In case you are wondering how I tested this, I simply added:
#if 0
> i = si_mem_available();
> if (i < nr_pages)
> return -ENOMEM;
#endif
for the tests. Note, without this, I tried to allocate all memory
(bisecting it with allocations that failed and allocations that
succeeded), and couldn't trigger an OOM :-/
Of course, this was on x86_64, where I'm sure I could allocate
any memory, and probably would have had more luck with a 32bit kernel
using higmem.
-- Steve
>
> + /*
> + * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
> + * gracefully without invoking oom-killer and the system is not
> + * destabilized.
> + */
> + mflags = GFP_KERNEL | __GFP_RETRY_MAYFAIL;
> +
> + /*
> + * If a user thread allocates too much, and si_mem_available()
> + * reports there's enough memory, even though there is not.
> + * Make sure the OOM killer kills this thread. This can happen
> + * even with RETRY_MAYFAIL because another task may be doing
> + * an allocation after this task has taken all memory.
> + * This is the task the OOM killer needs to take out during this
> + * loop, even if it was triggered by an allocation somewhere else.
> + */
> + if (user_thread)
> + set_current_oom_origin();
> for (i = 0; i < nr_pages; i++) {
> struct page *page;
> - /*
> - * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
> - * gracefully without invoking oom-killer and the system is not
> - * destabilized.
> - */
> +
> bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
> - GFP_KERNEL | __GFP_RETRY_MAYFAIL,
> - cpu_to_node(cpu));
> + mflags, cpu_to_node(cpu));
> if (!bpage)
> goto free_pages;
>
> list_add(&bpage->list, pages);
>
> - page = alloc_pages_node(cpu_to_node(cpu),
> - GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
> + page = alloc_pages_node(cpu_to_node(cpu), mflags, 0);
> if (!page)
> goto free_pages;
> bpage->page = page_address(page);
> rb_init_page(bpage->page);
> +
> + if (user_thread && fatal_signal_pending(current))
> + goto free_pages;
> }
> + if (user_thread)
> + clear_current_oom_origin();
>
> return 0;
>
> @@ -1199,6 +1225,8 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
> list_del_init(&bpage->list);
> free_buffer_page(bpage);
> }
> + if (user_thread)
> + clear_current_oom_origin();
>
> return -ENOMEM;
> }
On Wed, Apr 4, 2018 at 9:00 AM, Steven Rostedt <[email protected]> wrote:
> On Wed, 4 Apr 2018 11:53:10 -0400
> Steven Rostedt <[email protected]> wrote:
>
>> @@ -1162,35 +1163,60 @@ static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
>> static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
>> {
>> struct buffer_page *bpage, *tmp;
>> + bool user_thread = current->mm != NULL;
>> + gfp_t mflags;
>> long i;
>>
>> - /* Check if the available memory is there first */
>> + /*
>> + * Check if the available memory is there first.
>> + * Note, si_mem_available() only gives us a rough estimate of available
>> + * memory. It may not be accurate. But we don't care, we just want
>> + * to prevent doing any allocation when it is obvious that it is
>> + * not going to succeed.
>> + */
>
> In case you are wondering how I tested this, I simply added:
>
> #if 0
>> i = si_mem_available();
>> if (i < nr_pages)
>> return -ENOMEM;
> #endif
>
> for the tests. Note, without this, I tried to allocate all memory
> (bisecting it with allocations that failed and allocations that
> succeeded), and couldn't trigger an OOM :-/
I guess you need to have something *else* other than the write to
buffer_size_kb doing the GFP_KERNEL allocations but unfortunately gets
OOM killed?
Also, I agree with the new patch and its nice idea to do that.
thanks,
- Joel
On Wed, 4 Apr 2018 09:03:47 -0700
Joel Fernandes <[email protected]> wrote:
> > for the tests. Note, without this, I tried to allocate all memory
> > (bisecting it with allocations that failed and allocations that
> > succeeded), and couldn't trigger an OOM :-/
>
> I guess you need to have something *else* other than the write to
> buffer_size_kb doing the GFP_KERNEL allocations but unfortunately gets
> OOM killed?
Yeah, for some reason, my test box seems to always have something doing
that, because I trigger an OOM about 2 out of every 3 tries.
Here's the tasks that trigger it:
lvmetad, crond, systemd-journal, abrt-dump-journ, chronyd,
I guess my system is rather busy even when idle :-/
>
> Also, I agree with the new patch and its nice idea to do that.
Thanks, want to give it a test too?
-- Steve
On Wed, Apr 4, 2018 at 9:13 AM, Steven Rostedt <[email protected]> wrote:
[..]
>>
>> Also, I agree with the new patch and its nice idea to do that.
>
> Thanks, want to give it a test too?
Sure, I'll try it in a few hours. I am thinking of trying some of the
memory pressure tools we have. Likely by noon or so once I look at
some day job things :-D
thanks,
- Joel
Hi Steve,
On Wed, Apr 4, 2018 at 9:18 AM, Joel Fernandes <[email protected]> wrote:
> On Wed, Apr 4, 2018 at 9:13 AM, Steven Rostedt <[email protected]> wrote:
> [..]
>>>
>>> Also, I agree with the new patch and its nice idea to do that.
>>
>> Thanks, want to give it a test too?
With the latest tree and the below diff, I can still OOM-kill a victim
process doing a large buffer_size_kb write:
I pulled your ftrace/core and added this:
+ /*
i = si_mem_available();
if (i < nr_pages)
return -ENOMEM;
+ */
Here's a run in Qemu with 4-cores 1GB total memory:
bash-4.3# ./m -m 1M &
[1] 1056
bash-4.3#
bash-4.3#
bash-4.3#
bash-4.3# echo 10000000 > /d/tracing/buffer_size_kb
[ 33.213988] Out of memory: Kill process 1042 (bash) score
1712050900 or sacrifice child
[ 33.215349] Killed process 1056 (m) total-vm:9220kB,
anon-rss:7564kB, file-rss:4kB, shmem-rss:640kB
bash: echo: write error: Cannot allocate memory
[1]+ Killed ./m -m 1M
bash-4.3#
--
As you can see, OOM killer triggers and kills "m" which is my busy
memory allocator (it allocates and frees lots of memory and does that
in a loop)
Here's the m program, sorry if it looks too ugly:
https://pastebin.com/raw/aG6Qw37Z
Happy to try anything else, BTW when the si_mem_available check
enabled, this doesn't happen and the buffer_size_kb write fails
normally without hurting anything else.
- Joel
On Wed, 4 Apr 2018 16:59:18 -0700
Joel Fernandes <[email protected]> wrote:
> Happy to try anything else, BTW when the si_mem_available check
> enabled, this doesn't happen and the buffer_size_kb write fails
> normally without hurting anything else.
Can you remove the RETRY_MAYFAIL and see if you can try again? It may
be that we just remove that, and if si_mem_available() is wrong, it
will kill the process :-/ My original code would only add MAYFAIL if it
was a kernel thread (which is why I created the mflags variable).
-- Steve
On Wed 04-04-18 16:59:18, Joel Fernandes wrote:
> Hi Steve,
>
> On Wed, Apr 4, 2018 at 9:18 AM, Joel Fernandes <[email protected]> wrote:
> > On Wed, Apr 4, 2018 at 9:13 AM, Steven Rostedt <[email protected]> wrote:
> > [..]
> >>>
> >>> Also, I agree with the new patch and its nice idea to do that.
> >>
> >> Thanks, want to give it a test too?
>
> With the latest tree and the below diff, I can still OOM-kill a victim
> process doing a large buffer_size_kb write:
>
> I pulled your ftrace/core and added this:
> + /*
> i = si_mem_available();
> if (i < nr_pages)
> return -ENOMEM;
> + */
>
> Here's a run in Qemu with 4-cores 1GB total memory:
>
> bash-4.3# ./m -m 1M &
> [1] 1056
> bash-4.3#
> bash-4.3#
> bash-4.3#
> bash-4.3# echo 10000000 > /d/tracing/buffer_size_kb
> [ 33.213988] Out of memory: Kill process 1042 (bash) score
> 1712050900 or sacrifice child
> [ 33.215349] Killed process 1056 (m) total-vm:9220kB,
> anon-rss:7564kB, file-rss:4kB, shmem-rss:640kB
OK, so the reason your memory hog is triggered is that your echo is
built-in and we properly select bask as an oom_origin but then another
clever heuristic jumps in and tries to reduce the damage by sacrificing
a child process. And your memory hog runs as a child from the same bash
session.
I cannot say I would love this heuristic. In fact I would really love to
dig it deep under the ground. But this is a harder sell than it might
seem. Anyway is your testing scenario really representative enough to
care? Does the buffer_size_kb updater runs in the same process as any
large memory process?
> bash: echo: write error: Cannot allocate memory
> [1]+ Killed ./m -m 1M
> bash-4.3#
> --
--
Michal Hocko
SUSE Labs
On Thu, Apr 5, 2018 at 7:51 AM, Michal Hocko <[email protected]> wrote:
> On Wed 04-04-18 16:59:18, Joel Fernandes wrote:
>> Hi Steve,
>>
>> On Wed, Apr 4, 2018 at 9:18 AM, Joel Fernandes <[email protected]> wrote:
>> > On Wed, Apr 4, 2018 at 9:13 AM, Steven Rostedt <[email protected]> wrote:
>> > [..]
>> >>>
>> >>> Also, I agree with the new patch and its nice idea to do that.
>> >>
>> >> Thanks, want to give it a test too?
>>
>> With the latest tree and the below diff, I can still OOM-kill a victim
>> process doing a large buffer_size_kb write:
>>
>> I pulled your ftrace/core and added this:
>> + /*
>> i = si_mem_available();
>> if (i < nr_pages)
>> return -ENOMEM;
>> + */
>>
>> Here's a run in Qemu with 4-cores 1GB total memory:
>>
>> bash-4.3# ./m -m 1M &
>> [1] 1056
>> bash-4.3#
>> bash-4.3#
>> bash-4.3#
>> bash-4.3# echo 10000000 > /d/tracing/buffer_size_kb
>> [ 33.213988] Out of memory: Kill process 1042 (bash) score
>> 1712050900 or sacrifice child
>> [ 33.215349] Killed process 1056 (m) total-vm:9220kB,
>> anon-rss:7564kB, file-rss:4kB, shmem-rss:640kB
>
> OK, so the reason your memory hog is triggered is that your echo is
> built-in and we properly select bask as an oom_origin but then another
> clever heuristic jumps in and tries to reduce the damage by sacrificing
> a child process. And your memory hog runs as a child from the same bash
> session.
Oh, ok. Makes sense.
>
> I cannot say I would love this heuristic. In fact I would really love to
> dig it deep under the ground. But this is a harder sell than it might
> seem. Anyway is your testing scenario really representative enough to
No honestly I don't care much for this heuristic but was just helping
test it. The scenario is not something I care about, but it seems like
if I hit it then others users will too. Maybe Zhaoyang can try his use
case again with ftrace/core and si_mem_available commented?
IOW I was just helping test the new patch with the si_mem_available
check commented out.
> care? Does the buffer_size_kb updater runs in the same process as any
> large memory process?
In this Qemu run its just the cat process. At work I use trace-cmd or
atrace neither of which I believe have large memory footprints (AFAIK)
Thanks,
- Joel
Hi Steve,
On Thu, Apr 5, 2018 at 6:43 AM, Steven Rostedt <[email protected]> wrote:
> On Wed, 4 Apr 2018 16:59:18 -0700
> Joel Fernandes <[email protected]> wrote:
>
>> Happy to try anything else, BTW when the si_mem_available check
>> enabled, this doesn't happen and the buffer_size_kb write fails
>> normally without hurting anything else.
>
> Can you remove the RETRY_MAYFAIL and see if you can try again? It may
> be that we just remove that, and if si_mem_available() is wrong, it
> will kill the process :-/ My original code would only add MAYFAIL if it
> was a kernel thread (which is why I created the mflags variable).
Tried this. Dropping RETRY_MAYFAIL and the si_mem_available check
destabilized the system and brought it down (along with OOM killing
the victim).
System hung for several seconds and then both the memory hog and bash
got killed.
thanks,
- Joel
Hi Steve,
On Thu, Apr 5, 2018 at 12:57 PM, Joel Fernandes <[email protected]> wrote:
> On Thu, Apr 5, 2018 at 6:43 AM, Steven Rostedt <[email protected]> wrote:
>> On Wed, 4 Apr 2018 16:59:18 -0700
>> Joel Fernandes <[email protected]> wrote:
>>
>>> Happy to try anything else, BTW when the si_mem_available check
>>> enabled, this doesn't happen and the buffer_size_kb write fails
>>> normally without hurting anything else.
>>
>> Can you remove the RETRY_MAYFAIL and see if you can try again? It may
>> be that we just remove that, and if si_mem_available() is wrong, it
>> will kill the process :-/ My original code would only add MAYFAIL if it
>> was a kernel thread (which is why I created the mflags variable).
>
> Tried this. Dropping RETRY_MAYFAIL and the si_mem_available check
> destabilized the system and brought it down (along with OOM killing
> the victim).
>
> System hung for several seconds and then both the memory hog and bash
> got killed.
I think its still Ok to keep the OOM patch as a safe guard even though
its hard to test, and the si_mem_available on its own seem sufficient.
What do you think?
thanks,
- Joel
On Fri, Apr 6, 2018 at 7:36 AM, Joel Fernandes <[email protected]> wrote:
> Hi Steve,
>
> On Thu, Apr 5, 2018 at 12:57 PM, Joel Fernandes <[email protected]> wrote:
>> On Thu, Apr 5, 2018 at 6:43 AM, Steven Rostedt <[email protected]> wrote:
>>> On Wed, 4 Apr 2018 16:59:18 -0700
>>> Joel Fernandes <[email protected]> wrote:
>>>
>>>> Happy to try anything else, BTW when the si_mem_available check
>>>> enabled, this doesn't happen and the buffer_size_kb write fails
>>>> normally without hurting anything else.
>>>
>>> Can you remove the RETRY_MAYFAIL and see if you can try again? It may
>>> be that we just remove that, and if si_mem_available() is wrong, it
>>> will kill the process :-/ My original code would only add MAYFAIL if it
>>> was a kernel thread (which is why I created the mflags variable).
>>
>> Tried this. Dropping RETRY_MAYFAIL and the si_mem_available check
>> destabilized the system and brought it down (along with OOM killing
>> the victim).
>>
>> System hung for several seconds and then both the memory hog and bash
>> got killed.
>
> I think its still Ok to keep the OOM patch as a safe guard even though
> its hard to test, and the si_mem_available on its own seem sufficient.
> What do you think?
>
> thanks,
>
>
> - Joel
I also test the patch on my system, which works fine for the previous script.
PS: The script I mentioned is the cts test case POC 16_12 on android8.1