Don't choose the process with adj == OOM_SCORE_ADJ_MIN which
over-allocating pages for ring buffers.
Signed-off-by: Zhaoyang Huang <[email protected]>
---
kernel/trace/ring_buffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 5f38398..1005d73 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1135,7 +1135,7 @@ static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
{
struct buffer_page *bpage, *tmp;
- bool user_thread = current->mm != NULL;
+ bool user_thread = (current->mm != NULL && current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN);
gfp_t mflags;
long i;
--
1.9.1
On Sun, 8 Apr 2018 10:16:23 +0800
Zhaoyang Huang <[email protected]> wrote:
> Don't choose the process with adj == OOM_SCORE_ADJ_MIN which
> over-allocating pages for ring buffers.
Why?
-- Steve
On Sun, Apr 8, 2018 at 11:48 AM, Steven Rostedt <[email protected]> wrote:
> On Sun, 8 Apr 2018 10:16:23 +0800
> Zhaoyang Huang <[email protected]> wrote:
>
>> Don't choose the process with adj == OOM_SCORE_ADJ_MIN which
>> over-allocating pages for ring buffers.
>
> Why?
>
> -- Steve
because in oom_evaluate_task, the process with adj == OOM_SCORE_ADJ_MIN will
be suppressed by oom_badness, but with applying your latest patch,
such process will
be selected by oom_task_origin
if (oom_task_origin(task)) {
points = ULONG_MAX;
goto select;
}
points = oom_badness(task, NULL, oc->nodemask, oc->totalpages);
if (!points || points < oc->chosen_points)
goto next;
[ Removing kernel-patch-test, because of annoying "moderator" messages ]
On Sun, 8 Apr 2018 13:54:59 +0800
Zhaoyang Huang <[email protected]> wrote:
> On Sun, Apr 8, 2018 at 11:48 AM, Steven Rostedt <[email protected]> wrote:
> > On Sun, 8 Apr 2018 10:16:23 +0800
> > Zhaoyang Huang <[email protected]> wrote:
> >
> >> Don't choose the process with adj == OOM_SCORE_ADJ_MIN which
> >> over-allocating pages for ring buffers.
> >
> > Why?
> >
> > -- Steve
> because in oom_evaluate_task, the process with adj == OOM_SCORE_ADJ_MIN will
> be suppressed by oom_badness, but with applying your latest patch,
> such process will
> be selected by oom_task_origin
>
> if (oom_task_origin(task)) {
> points = ULONG_MAX;
> goto select;
> }
>
> points = oom_badness(task, NULL, oc->nodemask, oc->totalpages);
> if (!points || points < oc->chosen_points)
> goto next;
And what's wrong with that?
-- Steve
On Sun, Apr 8, 2018 at 8:47 PM, Steven Rostedt <[email protected]> wrote:
> [ Removing kernel-patch-test, because of annoying "moderator" messages ]
>
> On Sun, 8 Apr 2018 13:54:59 +0800
> Zhaoyang Huang <[email protected]> wrote:
>
>> On Sun, Apr 8, 2018 at 11:48 AM, Steven Rostedt <[email protected]> wrote:
>> > On Sun, 8 Apr 2018 10:16:23 +0800
>> > Zhaoyang Huang <[email protected]> wrote:
>> >
>> >> Don't choose the process with adj == OOM_SCORE_ADJ_MIN which
>> >> over-allocating pages for ring buffers.
>> >
>> > Why?
>> >
>> > -- Steve
>> because in oom_evaluate_task, the process with adj == OOM_SCORE_ADJ_MIN will
>> be suppressed by oom_badness, but with applying your latest patch,
>> such process will
>> be selected by oom_task_origin
>>
>> if (oom_task_origin(task)) {
>> points = ULONG_MAX;
>> goto select;
>> }
>>
>> points = oom_badness(task, NULL, oc->nodemask, oc->totalpages);
>> if (!points || points < oc->chosen_points)
>> goto next;
>
> And what's wrong with that?
>
> -- Steve
I think the original thought of OOM is the flag 'OOM_SCORE_ADJ_MIN' is
most likely to be set by process himself via accessing the proc file,
if it does so, OOM can select it as the victim. except, it is
reluctant to choose the critical process to be killed, so I suggest
not to set such heavy flag as OOM_SCORE_ADJ_MIN on behalf of -1000
process.
On Mon, 9 Apr 2018 08:56:01 +0800
Zhaoyang Huang <[email protected]> wrote:
> >>
> >> if (oom_task_origin(task)) {
> >> points = ULONG_MAX;
> >> goto select;
> >> }
> >>
> >> points = oom_badness(task, NULL, oc->nodemask, oc->totalpages);
> >> if (!points || points < oc->chosen_points)
> >> goto next;
> >
> > And what's wrong with that?
> >
> > -- Steve
> I think the original thought of OOM is the flag 'OOM_SCORE_ADJ_MIN' is
> most likely to be set by process himself via accessing the proc file,
> if it does so, OOM can select it as the victim. except, it is
> reluctant to choose the critical process to be killed, so I suggest
> not to set such heavy flag as OOM_SCORE_ADJ_MIN on behalf of -1000
> process.
Really, I don't think tasks that are setting OOM_CORE_ADJ_MIN should be
allocating a lot of memory in the kernel (via ring buffer). It sounds
like a good way to wreck havoc on the system.
It's basically saying, "I'm going to take up all memory, but don't kill
me, just kill some random user on the system".
-- Steve
On Mon, Apr 9, 2018 at 9:49 PM, Steven Rostedt <[email protected]> wrote:
> On Mon, 9 Apr 2018 08:56:01 +0800
> Zhaoyang Huang <[email protected]> wrote:
>
>> >>
>> >> if (oom_task_origin(task)) {
>> >> points = ULONG_MAX;
>> >> goto select;
>> >> }
>> >>
>> >> points = oom_badness(task, NULL, oc->nodemask, oc->totalpages);
>> >> if (!points || points < oc->chosen_points)
>> >> goto next;
>> >
>> > And what's wrong with that?
>> >
>> > -- Steve
>> I think the original thought of OOM is the flag 'OOM_SCORE_ADJ_MIN' is
>> most likely to be set by process himself via accessing the proc file,
>> if it does so, OOM can select it as the victim. except, it is
>> reluctant to choose the critical process to be killed, so I suggest
>> not to set such heavy flag as OOM_SCORE_ADJ_MIN on behalf of -1000
>> process.
>
> Really, I don't think tasks that are setting OOM_CORE_ADJ_MIN should be
> allocating a lot of memory in the kernel (via ring buffer). It sounds
> like a good way to wreck havoc on the system.
>
> It's basically saying, "I'm going to take up all memory, but don't kill
> me, just kill some random user on the system".
>
> -- Steve
Sure, but the memory status is dynamic, the process could also exceed the limit
at the moment even it check the available memory before. We have to
add protection
for such kind of risk. It could also happen that the critical process
be preempted by
another huge memory allocating process, which may cause insufficient memory when
it schedule back.
On Tue, Apr 10, 2018 at 8:32 AM, Zhaoyang Huang <[email protected]> wrote:
> On Mon, Apr 9, 2018 at 9:49 PM, Steven Rostedt <[email protected]> wrote:
>> On Mon, 9 Apr 2018 08:56:01 +0800
>> Zhaoyang Huang <[email protected]> wrote:
>>
>>> >>
>>> >> if (oom_task_origin(task)) {
>>> >> points = ULONG_MAX;
>>> >> goto select;
>>> >> }
>>> >>
>>> >> points = oom_badness(task, NULL, oc->nodemask, oc->totalpages);
>>> >> if (!points || points < oc->chosen_points)
>>> >> goto next;
>>> >
>>> > And what's wrong with that?
>>> >
>>> > -- Steve
>>> I think the original thought of OOM is the flag 'OOM_SCORE_ADJ_MIN' is
>>> most likely to be set by process himself via accessing the proc file,
>>> if it does so, OOM can select it as the victim. except, it is
>>> reluctant to choose the critical process to be killed, so I suggest
>>> not to set such heavy flag as OOM_SCORE_ADJ_MIN on behalf of -1000
>>> process.
>>
>> Really, I don't think tasks that are setting OOM_CORE_ADJ_MIN should be
>> allocating a lot of memory in the kernel (via ring buffer). It sounds
>> like a good way to wreck havoc on the system.
>>
>> It's basically saying, "I'm going to take up all memory, but don't kill
>> me, just kill some random user on the system".
>>
>> -- Steve
> Sure, but the memory status is dynamic, the process could also exceed the limit
> at the moment even it check the available memory before. We have to
> add protection
> for such kind of risk. It could also happen that the critical process
> be preempted by
> another huge memory allocating process, which may cause insufficient memory when
> it schedule back.
For bellowing scenario, process A have no intension to exhaust the
memory, but will be likely to be selected by OOM for we set
OOM_CORE_ADJ_MIN for it.
process A(-1000) process B
i = si_mem_available();
if (i < nr_pages)
return -ENOMEM;
schedule
--------------->
allocate huge memory
<-------------
if (user_thread)
set_current_oom_origin();
for (i = 0; i < nr_pages; i++) {
bpage = kzalloc_node
On Tue, 10 Apr 2018 10:32:36 +0800
Zhaoyang Huang <[email protected]> wrote:
> For bellowing scenario, process A have no intension to exhaust the
> memory, but will be likely to be selected by OOM for we set
> OOM_CORE_ADJ_MIN for it.
> process A(-1000) process B
>
> i = si_mem_available();
> if (i < nr_pages)
> return -ENOMEM;
> schedule
> --------------->
> allocate huge memory
> <-------------
> if (user_thread)
> set_current_oom_origin();
>
> for (i = 0; i < nr_pages; i++) {
> bpage = kzalloc_node
Is this really an issue though?
Seriously, do you think you will ever hit this?
How often do you increase the size of the ftrace ring buffer? For this
to be an issue, the system has to trigger an OOM at the exact moment
you decide to increase the size of the ring buffer. That would be an
impressive attack, with little to gain.
Ask the memory management people. If they think this could be a
problem, then I'll be happy to take your patch.
-- Steve
On Tue, Apr 10, 2018 at 11:12 AM, Steven Rostedt <[email protected]> wrote:
> On Tue, 10 Apr 2018 10:32:36 +0800
> Zhaoyang Huang <[email protected]> wrote:
>
>> For bellowing scenario, process A have no intension to exhaust the
>> memory, but will be likely to be selected by OOM for we set
>> OOM_CORE_ADJ_MIN for it.
>> process A(-1000) process B
>>
>> i = si_mem_available();
>> if (i < nr_pages)
>> return -ENOMEM;
>> schedule
>> --------------->
>> allocate huge memory
>> <-------------
>> if (user_thread)
>> set_current_oom_origin();
>>
>> for (i = 0; i < nr_pages; i++) {
>> bpage = kzalloc_node
>
> Is this really an issue though?
>
> Seriously, do you think you will ever hit this?
>
> How often do you increase the size of the ftrace ring buffer? For this
> to be an issue, the system has to trigger an OOM at the exact moment
> you decide to increase the size of the ring buffer. That would be an
> impressive attack, with little to gain.
>
> Ask the memory management people. If they think this could be a
> problem, then I'll be happy to take your patch.
>
> -- Steve
add Michael for review.
Hi Michael,
I would like suggest Steve NOT to set OOM_CORE_ADJ_MIN for the process
with adj = -1000 when setting the user space process as potential
victim of OOM. Steve doubts about the possibility of the scenario. In
my opinion, we should NOT break the original concept of the OOM, that
is, OOM would not select -1000 process unless it config it itself.
With regard to the possibility, in memory thirsty system such as
android on mobile phones, there are different kinds of user behavior
or test script to attack or ensure the stability of the system. So I
suggest we'd better keep every corner case safe. Would you please give
a comment on that? thanks
On Tue 10-04-18 11:41:44, Zhaoyang Huang wrote:
> On Tue, Apr 10, 2018 at 11:12 AM, Steven Rostedt <[email protected]> wrote:
> > On Tue, 10 Apr 2018 10:32:36 +0800
> > Zhaoyang Huang <[email protected]> wrote:
> >
> >> For bellowing scenario, process A have no intension to exhaust the
> >> memory, but will be likely to be selected by OOM for we set
> >> OOM_CORE_ADJ_MIN for it.
> >> process A(-1000) process B
> >>
> >> i = si_mem_available();
> >> if (i < nr_pages)
> >> return -ENOMEM;
> >> schedule
> >> --------------->
> >> allocate huge memory
> >> <-------------
> >> if (user_thread)
> >> set_current_oom_origin();
> >>
> >> for (i = 0; i < nr_pages; i++) {
> >> bpage = kzalloc_node
> >
> > Is this really an issue though?
> >
> > Seriously, do you think you will ever hit this?
> >
> > How often do you increase the size of the ftrace ring buffer? For this
> > to be an issue, the system has to trigger an OOM at the exact moment
> > you decide to increase the size of the ring buffer. That would be an
> > impressive attack, with little to gain.
> >
> > Ask the memory management people. If they think this could be a
> > problem, then I'll be happy to take your patch.
> >
> > -- Steve
> add Michael for review.
> Hi Michael,
> I would like suggest Steve NOT to set OOM_CORE_ADJ_MIN for the process
> with adj = -1000 when setting the user space process as potential
> victim of OOM.
OOM_SCORE_ADJ_MIN means "hide the process from the OOM killer completely".
So what exactly do you want to achieve here? Because from the above it
sounds like opposite things. /me confused...
--
Michal Hocko
SUSE Labs
On Tue, Apr 10, 2018 at 2:14 PM, Michal Hocko <[email protected]> wrote:
> On Tue 10-04-18 11:41:44, Zhaoyang Huang wrote:
>> On Tue, Apr 10, 2018 at 11:12 AM, Steven Rostedt <[email protected]> wrote:
>> > On Tue, 10 Apr 2018 10:32:36 +0800
>> > Zhaoyang Huang <[email protected]> wrote:
>> >
>> >> For bellowing scenario, process A have no intension to exhaust the
>> >> memory, but will be likely to be selected by OOM for we set
>> >> OOM_CORE_ADJ_MIN for it.
>> >> process A(-1000) process B
>> >>
>> >> i = si_mem_available();
>> >> if (i < nr_pages)
>> >> return -ENOMEM;
>> >> schedule
>> >> --------------->
>> >> allocate huge memory
>> >> <-------------
>> >> if (user_thread)
>> >> set_current_oom_origin();
>> >>
>> >> for (i = 0; i < nr_pages; i++) {
>> >> bpage = kzalloc_node
>> >
>> > Is this really an issue though?
>> >
>> > Seriously, do you think you will ever hit this?
>> >
>> > How often do you increase the size of the ftrace ring buffer? For this
>> > to be an issue, the system has to trigger an OOM at the exact moment
>> > you decide to increase the size of the ring buffer. That would be an
>> > impressive attack, with little to gain.
>> >
>> > Ask the memory management people. If they think this could be a
>> > problem, then I'll be happy to take your patch.
>> >
>> > -- Steve
>> add Michael for review.
>> Hi Michael,
>> I would like suggest Steve NOT to set OOM_CORE_ADJ_MIN for the process
>> with adj = -1000 when setting the user space process as potential
>> victim of OOM.
>
> OOM_SCORE_ADJ_MIN means "hide the process from the OOM killer completely".
> So what exactly do you want to achieve here? Because from the above it
> sounds like opposite things. /me confused...
>
> --
> Michal Hocko
> SUSE Labs
Steve's patch intend to have the process be OOM's victim when it
over-allocating pages for ring buffer. I amend a patch over to protect
process with OOM_SCORE_ADJ_MIN from doing so. Because it will make
such process to be selected by current OOM's way of
selecting.(consider OOM_FLAG_ORIGIN first before the adj)
On Tue 10-04-18 14:39:35, Zhaoyang Huang wrote:
> On Tue, Apr 10, 2018 at 2:14 PM, Michal Hocko <[email protected]> wrote:
> > On Tue 10-04-18 11:41:44, Zhaoyang Huang wrote:
> >> On Tue, Apr 10, 2018 at 11:12 AM, Steven Rostedt <[email protected]> wrote:
> >> > On Tue, 10 Apr 2018 10:32:36 +0800
> >> > Zhaoyang Huang <[email protected]> wrote:
> >> >
> >> >> For bellowing scenario, process A have no intension to exhaust the
> >> >> memory, but will be likely to be selected by OOM for we set
> >> >> OOM_CORE_ADJ_MIN for it.
> >> >> process A(-1000) process B
> >> >>
> >> >> i = si_mem_available();
> >> >> if (i < nr_pages)
> >> >> return -ENOMEM;
> >> >> schedule
> >> >> --------------->
> >> >> allocate huge memory
> >> >> <-------------
> >> >> if (user_thread)
> >> >> set_current_oom_origin();
> >> >>
> >> >> for (i = 0; i < nr_pages; i++) {
> >> >> bpage = kzalloc_node
> >> >
> >> > Is this really an issue though?
> >> >
> >> > Seriously, do you think you will ever hit this?
> >> >
> >> > How often do you increase the size of the ftrace ring buffer? For this
> >> > to be an issue, the system has to trigger an OOM at the exact moment
> >> > you decide to increase the size of the ring buffer. That would be an
> >> > impressive attack, with little to gain.
> >> >
> >> > Ask the memory management people. If they think this could be a
> >> > problem, then I'll be happy to take your patch.
> >> >
> >> > -- Steve
> >> add Michael for review.
> >> Hi Michael,
> >> I would like suggest Steve NOT to set OOM_CORE_ADJ_MIN for the process
> >> with adj = -1000 when setting the user space process as potential
> >> victim of OOM.
> >
> > OOM_SCORE_ADJ_MIN means "hide the process from the OOM killer completely".
> > So what exactly do you want to achieve here? Because from the above it
> > sounds like opposite things. /me confused...
> >
> Steve's patch intend to have the process be OOM's victim when it
> over-allocating pages for ring buffer. I amend a patch over to protect
> process with OOM_SCORE_ADJ_MIN from doing so. Because it will make
> such process to be selected by current OOM's way of
> selecting.(consider OOM_FLAG_ORIGIN first before the adj)
I just wouldn't really care unless there is an existing and reasonable
usecase for an application which updates the ring buffer size _and_ it
is OOM disabled at the same time.
--
Michal Hocko
SUSE Labs
On Tue, Apr 10, 2018 at 3:49 PM, Michal Hocko <[email protected]> wrote:
> On Tue 10-04-18 14:39:35, Zhaoyang Huang wrote:
>> On Tue, Apr 10, 2018 at 2:14 PM, Michal Hocko <[email protected]> wrote:
>> > On Tue 10-04-18 11:41:44, Zhaoyang Huang wrote:
>> >> On Tue, Apr 10, 2018 at 11:12 AM, Steven Rostedt <[email protected]> wrote:
>> >> > On Tue, 10 Apr 2018 10:32:36 +0800
>> >> > Zhaoyang Huang <[email protected]> wrote:
>> >> >
>> >> >> For bellowing scenario, process A have no intension to exhaust the
>> >> >> memory, but will be likely to be selected by OOM for we set
>> >> >> OOM_CORE_ADJ_MIN for it.
>> >> >> process A(-1000) process B
>> >> >>
>> >> >> i = si_mem_available();
>> >> >> if (i < nr_pages)
>> >> >> return -ENOMEM;
>> >> >> schedule
>> >> >> --------------->
>> >> >> allocate huge memory
>> >> >> <-------------
>> >> >> if (user_thread)
>> >> >> set_current_oom_origin();
>> >> >>
>> >> >> for (i = 0; i < nr_pages; i++) {
>> >> >> bpage = kzalloc_node
>> >> >
>> >> > Is this really an issue though?
>> >> >
>> >> > Seriously, do you think you will ever hit this?
>> >> >
>> >> > How often do you increase the size of the ftrace ring buffer? For this
>> >> > to be an issue, the system has to trigger an OOM at the exact moment
>> >> > you decide to increase the size of the ring buffer. That would be an
>> >> > impressive attack, with little to gain.
>> >> >
>> >> > Ask the memory management people. If they think this could be a
>> >> > problem, then I'll be happy to take your patch.
>> >> >
>> >> > -- Steve
>> >> add Michael for review.
>> >> Hi Michael,
>> >> I would like suggest Steve NOT to set OOM_CORE_ADJ_MIN for the process
>> >> with adj = -1000 when setting the user space process as potential
>> >> victim of OOM.
>> >
>> > OOM_SCORE_ADJ_MIN means "hide the process from the OOM killer completely".
>> > So what exactly do you want to achieve here? Because from the above it
>> > sounds like opposite things. /me confused...
>> >
>> Steve's patch intend to have the process be OOM's victim when it
>> over-allocating pages for ring buffer. I amend a patch over to protect
>> process with OOM_SCORE_ADJ_MIN from doing so. Because it will make
>> such process to be selected by current OOM's way of
>> selecting.(consider OOM_FLAG_ORIGIN first before the adj)
>
> I just wouldn't really care unless there is an existing and reasonable
> usecase for an application which updates the ring buffer size _and_ it
> is OOM disabled at the same time.
> --
> Michal Hocko
> SUSE Labs
There is indeed such kind of test case on my android system, which is
known as CTS and Monkey etc. Furthermore, I think we should make the
patch to be as safest as possible. Why do we leave a potential risk
here? There is no side effect for my patch.
On Tue 10-04-18 16:04:40, Zhaoyang Huang wrote:
> On Tue, Apr 10, 2018 at 3:49 PM, Michal Hocko <[email protected]> wrote:
> > On Tue 10-04-18 14:39:35, Zhaoyang Huang wrote:
> >> On Tue, Apr 10, 2018 at 2:14 PM, Michal Hocko <[email protected]> wrote:
[...]
> >> > OOM_SCORE_ADJ_MIN means "hide the process from the OOM killer completely".
> >> > So what exactly do you want to achieve here? Because from the above it
> >> > sounds like opposite things. /me confused...
> >> >
> >> Steve's patch intend to have the process be OOM's victim when it
> >> over-allocating pages for ring buffer. I amend a patch over to protect
> >> process with OOM_SCORE_ADJ_MIN from doing so. Because it will make
> >> such process to be selected by current OOM's way of
> >> selecting.(consider OOM_FLAG_ORIGIN first before the adj)
> >
> > I just wouldn't really care unless there is an existing and reasonable
> > usecase for an application which updates the ring buffer size _and_ it
> > is OOM disabled at the same time.
> There is indeed such kind of test case on my android system, which is
> known as CTS and Monkey etc.
Does the test simulate a real workload? I mean we have two things here
oom disabled task and an updater of the ftrace ring buffer to a
potentially large size. The second can be completely isolated to a
different context, no? So why do they run in the single user process
context?
> Furthermore, I think we should make the
> patch to be as safest as possible. Why do we leave a potential risk
> here? There is no side effect for my patch.
I do not have the full context. Could you point me to your patch?
--
Michal Hocko
SUSE Labs
On Tue, Apr 10, 2018 at 4:12 PM, Michal Hocko <[email protected]> wrote:
> On Tue 10-04-18 16:04:40, Zhaoyang Huang wrote:
>> On Tue, Apr 10, 2018 at 3:49 PM, Michal Hocko <[email protected]> wrote:
>> > On Tue 10-04-18 14:39:35, Zhaoyang Huang wrote:
>> >> On Tue, Apr 10, 2018 at 2:14 PM, Michal Hocko <[email protected]> wrote:
> [...]
>> >> > OOM_SCORE_ADJ_MIN means "hide the process from the OOM killer completely".
>> >> > So what exactly do you want to achieve here? Because from the above it
>> >> > sounds like opposite things. /me confused...
>> >> >
>> >> Steve's patch intend to have the process be OOM's victim when it
>> >> over-allocating pages for ring buffer. I amend a patch over to protect
>> >> process with OOM_SCORE_ADJ_MIN from doing so. Because it will make
>> >> such process to be selected by current OOM's way of
>> >> selecting.(consider OOM_FLAG_ORIGIN first before the adj)
>> >
>> > I just wouldn't really care unless there is an existing and reasonable
>> > usecase for an application which updates the ring buffer size _and_ it
>> > is OOM disabled at the same time.
>> There is indeed such kind of test case on my android system, which is
>> known as CTS and Monkey etc.
>
> Does the test simulate a real workload? I mean we have two things here
>
> oom disabled task and an updater of the ftrace ring buffer to a
> potentially large size. The second can be completely isolated to a
> different context, no? So why do they run in the single user process
> context?
ok. I think there are some misunderstandings here. Let me try to
explain more by my poor English. There is just one thing here. The
updater is originally a oom disabled task with adj=OOM_SCORE_ADJ_MIN.
With Steven's patch, it will periodically become a oom killable task
by calling set_current_oom_origin() for user process which is
enlarging the ring buffer. What I am doing here is limit the user
process to the ones that adj > -1000.
>
>> Furthermore, I think we should make the
>> patch to be as safest as possible. Why do we leave a potential risk
>> here? There is no side effect for my patch.
>
> I do not have the full context. Could you point me to your patch?
here are Steven and my patches
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 5f38398..1005d73 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1135,7 +1135,7 @@ static int rb_check_pages(struct
ring_buffer_per_cpu *cpu_buffer)
static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
{
struct buffer_page *bpage, *tmp;
- bool user_thread = current->mm != NULL;
+ bool user_thread = (current->mm != NULL &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN); //by zhaoyang
gfp_t mflags;
long i;
-----------------------------------------------------------------------------------------------------
{
struct buffer_page *bpage, *tmp;
+ bool user_thread = current->mm != NULL;
+ gfp_t mflags;
long i;
- /* Check if the available memory is there first */
+ /*
+ * Check if the available memory is there first.
+ * Note, si_mem_available() only gives us a rough estimate of available
+ * memory. It may not be accurate. But we don't care, we just want
+ * to prevent doing any allocation when it is obvious that it is
+ * not going to succeed.
+ */
i = si_mem_available();
if (i < nr_pages)
return -ENOMEM;
+ /*
+ * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
+ * gracefully without invoking oom-killer and the system is not
+ * destabilized.
+ */
+ mflags = GFP_KERNEL | __GFP_RETRY_MAYFAIL;
+
+ /*
+ * If a user thread allocates too much, and si_mem_available()
+ * reports there's enough memory, even though there is not.
+ * Make sure the OOM killer kills this thread. This can happen
+ * even with RETRY_MAYFAIL because another task may be doing
+ * an allocation after this task has taken all memory.
+ * This is the task the OOM killer needs to take out during this
+ * loop, even if it was triggered by an allocation somewhere else.
+ */
+ if (user_thread)
+ set_current_oom_origin();
for (i = 0; i < nr_pages; i++) {
struct page *page;
- /*
- * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
- * gracefully without invoking oom-killer and the system is not
- * destabilized.
- */
+
bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
- GFP_KERNEL | __GFP_RETRY_MAYFAIL,
- cpu_to_node(cpu));
+ mflags, cpu_to_node(cpu));
if (!bpage)
goto free_pages;
list_add(&bpage->list, pages);
- page = alloc_pages_node(cpu_to_node(cpu),
- GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
+ page = alloc_pages_node(cpu_to_node(cpu), mflags, 0);
if (!page)
goto free_pages;
bpage->page = page_address(page);
rb_init_page(bpage->page);
+
+ if (user_thread && fatal_signal_pending(current))
+ goto free_pages;
}
+ if (user_thread)
+ clear_current_oom_origin();
return 0;
@@ -1199,6 +1225,8 @@ static int __rb_allocate_pages(long nr_pages,
struct list_head *pages, int cpu)
list_del_init(&bpage->list);
free_buffer_page(bpage);
}
+ if (user_thread)
+ clear_current_oom_origin();
return -ENOMEM;
}
> --
> Michal Hocko
> SUSE Labs
On Tue 10-04-18 16:38:32, Zhaoyang Huang wrote:
> On Tue, Apr 10, 2018 at 4:12 PM, Michal Hocko <[email protected]> wrote:
> > On Tue 10-04-18 16:04:40, Zhaoyang Huang wrote:
> >> On Tue, Apr 10, 2018 at 3:49 PM, Michal Hocko <[email protected]> wrote:
> >> > On Tue 10-04-18 14:39:35, Zhaoyang Huang wrote:
> >> >> On Tue, Apr 10, 2018 at 2:14 PM, Michal Hocko <[email protected]> wrote:
> > [...]
> >> >> > OOM_SCORE_ADJ_MIN means "hide the process from the OOM killer completely".
> >> >> > So what exactly do you want to achieve here? Because from the above it
> >> >> > sounds like opposite things. /me confused...
> >> >> >
> >> >> Steve's patch intend to have the process be OOM's victim when it
> >> >> over-allocating pages for ring buffer. I amend a patch over to protect
> >> >> process with OOM_SCORE_ADJ_MIN from doing so. Because it will make
> >> >> such process to be selected by current OOM's way of
> >> >> selecting.(consider OOM_FLAG_ORIGIN first before the adj)
> >> >
> >> > I just wouldn't really care unless there is an existing and reasonable
> >> > usecase for an application which updates the ring buffer size _and_ it
> >> > is OOM disabled at the same time.
> >> There is indeed such kind of test case on my android system, which is
> >> known as CTS and Monkey etc.
> >
> > Does the test simulate a real workload? I mean we have two things here
> >
> > oom disabled task and an updater of the ftrace ring buffer to a
> > potentially large size. The second can be completely isolated to a
> > different context, no? So why do they run in the single user process
> > context?
> ok. I think there are some misunderstandings here. Let me try to
> explain more by my poor English. There is just one thing here. The
> updater is originally a oom disabled task with adj=OOM_SCORE_ADJ_MIN.
> With Steven's patch, it will periodically become a oom killable task
> by calling set_current_oom_origin() for user process which is
> enlarging the ring buffer. What I am doing here is limit the user
> process to the ones that adj > -1000.
I've understood that part. And I am arguing whether this is really such
an important case to play further tricks. Wouldn't it be much simpler to
put the updater out to a separate process? OOM disabled processes
shouldn't really do unexpectedly large allocations. Full stop. Otherwise
you risk a large system disruptions.
--
Michal Hocko
SUSE Labs
On Tue, Apr 10, 2018 at 5:01 PM, Michal Hocko <[email protected]> wrote:
> On Tue 10-04-18 16:38:32, Zhaoyang Huang wrote:
>> On Tue, Apr 10, 2018 at 4:12 PM, Michal Hocko <[email protected]> wrote:
>> > On Tue 10-04-18 16:04:40, Zhaoyang Huang wrote:
>> >> On Tue, Apr 10, 2018 at 3:49 PM, Michal Hocko <[email protected]> wrote:
>> >> > On Tue 10-04-18 14:39:35, Zhaoyang Huang wrote:
>> >> >> On Tue, Apr 10, 2018 at 2:14 PM, Michal Hocko <[email protected]> wrote:
>> > [...]
>> >> >> > OOM_SCORE_ADJ_MIN means "hide the process from the OOM killer completely".
>> >> >> > So what exactly do you want to achieve here? Because from the above it
>> >> >> > sounds like opposite things. /me confused...
>> >> >> >
>> >> >> Steve's patch intend to have the process be OOM's victim when it
>> >> >> over-allocating pages for ring buffer. I amend a patch over to protect
>> >> >> process with OOM_SCORE_ADJ_MIN from doing so. Because it will make
>> >> >> such process to be selected by current OOM's way of
>> >> >> selecting.(consider OOM_FLAG_ORIGIN first before the adj)
>> >> >
>> >> > I just wouldn't really care unless there is an existing and reasonable
>> >> > usecase for an application which updates the ring buffer size _and_ it
>> >> > is OOM disabled at the same time.
>> >> There is indeed such kind of test case on my android system, which is
>> >> known as CTS and Monkey etc.
>> >
>> > Does the test simulate a real workload? I mean we have two things here
>> >
>> > oom disabled task and an updater of the ftrace ring buffer to a
>> > potentially large size. The second can be completely isolated to a
>> > different context, no? So why do they run in the single user process
>> > context?
>> ok. I think there are some misunderstandings here. Let me try to
>> explain more by my poor English. There is just one thing here. The
>> updater is originally a oom disabled task with adj=OOM_SCORE_ADJ_MIN.
>> With Steven's patch, it will periodically become a oom killable task
>> by calling set_current_oom_origin() for user process which is
>> enlarging the ring buffer. What I am doing here is limit the user
>> process to the ones that adj > -1000.
>
> I've understood that part. And I am arguing whether this is really such
> an important case to play further tricks. Wouldn't it be much simpler to
> put the updater out to a separate process? OOM disabled processes
> shouldn't really do unexpectedly large allocations. Full stop. Otherwise
> you risk a large system disruptions.
> --
It is a real problem(my android system just hung there while running
the test case for the innocent key process killed by OOM), however,
the problem is we can not define the userspace's behavior as you
suggested. What Steven's patch doing here is to keep the system to be
stable by having the updater to take the responsbility itself. My
patch is to let the OOM disabled processes remain the unkillable
status.
> Michal Hocko
> SUSE Labs
On Tue, Apr 10, 2018 at 5:32 PM, Zhaoyang Huang <[email protected]> wrote:
> On Tue, Apr 10, 2018 at 5:01 PM, Michal Hocko <[email protected]> wrote:
>> On Tue 10-04-18 16:38:32, Zhaoyang Huang wrote:
>>> On Tue, Apr 10, 2018 at 4:12 PM, Michal Hocko <[email protected]> wrote:
>>> > On Tue 10-04-18 16:04:40, Zhaoyang Huang wrote:
>>> >> On Tue, Apr 10, 2018 at 3:49 PM, Michal Hocko <[email protected]> wrote:
>>> >> > On Tue 10-04-18 14:39:35, Zhaoyang Huang wrote:
>>> >> >> On Tue, Apr 10, 2018 at 2:14 PM, Michal Hocko <[email protected]> wrote:
>>> > [...]
>>> >> >> > OOM_SCORE_ADJ_MIN means "hide the process from the OOM killer completely".
>>> >> >> > So what exactly do you want to achieve here? Because from the above it
>>> >> >> > sounds like opposite things. /me confused...
>>> >> >> >
>>> >> >> Steve's patch intend to have the process be OOM's victim when it
>>> >> >> over-allocating pages for ring buffer. I amend a patch over to protect
>>> >> >> process with OOM_SCORE_ADJ_MIN from doing so. Because it will make
>>> >> >> such process to be selected by current OOM's way of
>>> >> >> selecting.(consider OOM_FLAG_ORIGIN first before the adj)
>>> >> >
>>> >> > I just wouldn't really care unless there is an existing and reasonable
>>> >> > usecase for an application which updates the ring buffer size _and_ it
>>> >> > is OOM disabled at the same time.
>>> >> There is indeed such kind of test case on my android system, which is
>>> >> known as CTS and Monkey etc.
>>> >
>>> > Does the test simulate a real workload? I mean we have two things here
>>> >
>>> > oom disabled task and an updater of the ftrace ring buffer to a
>>> > potentially large size. The second can be completely isolated to a
>>> > different context, no? So why do they run in the single user process
>>> > context?
>>> ok. I think there are some misunderstandings here. Let me try to
>>> explain more by my poor English. There is just one thing here. The
>>> updater is originally a oom disabled task with adj=OOM_SCORE_ADJ_MIN.
>>> With Steven's patch, it will periodically become a oom killable task
>>> by calling set_current_oom_origin() for user process which is
>>> enlarging the ring buffer. What I am doing here is limit the user
>>> process to the ones that adj > -1000.
>>
>> I've understood that part. And I am arguing whether this is really such
>> an important case to play further tricks. Wouldn't it be much simpler to
>> put the updater out to a separate process? OOM disabled processes
>> shouldn't really do unexpectedly large allocations. Full stop. Otherwise
>> you risk a large system disruptions.
>> --
> It is a real problem(my android system just hung there while running
> the test case for the innocent key process killed by OOM), however,
> the problem is we can not define the userspace's behavior as you
> suggested. What Steven's patch doing here is to keep the system to be
> stable by having the updater to take the responsbility itself. My
> patch is to let the OOM disabled processes remain the unkillable
> status.
>
>> Michal Hocko
>> SUSE Labs
To summarize the patch sets as 'let the updater take the
responsibility itself, don't harm to the innocent, but absolve the
critical process'
On Tue 10-04-18 17:32:44, Zhaoyang Huang wrote:
> On Tue, Apr 10, 2018 at 5:01 PM, Michal Hocko <[email protected]> wrote:
> > On Tue 10-04-18 16:38:32, Zhaoyang Huang wrote:
> >> On Tue, Apr 10, 2018 at 4:12 PM, Michal Hocko <[email protected]> wrote:
> >> > On Tue 10-04-18 16:04:40, Zhaoyang Huang wrote:
> >> >> On Tue, Apr 10, 2018 at 3:49 PM, Michal Hocko <[email protected]> wrote:
> >> >> > On Tue 10-04-18 14:39:35, Zhaoyang Huang wrote:
> >> >> >> On Tue, Apr 10, 2018 at 2:14 PM, Michal Hocko <[email protected]> wrote:
> >> > [...]
> >> >> >> > OOM_SCORE_ADJ_MIN means "hide the process from the OOM killer completely".
> >> >> >> > So what exactly do you want to achieve here? Because from the above it
> >> >> >> > sounds like opposite things. /me confused...
> >> >> >> >
> >> >> >> Steve's patch intend to have the process be OOM's victim when it
> >> >> >> over-allocating pages for ring buffer. I amend a patch over to protect
> >> >> >> process with OOM_SCORE_ADJ_MIN from doing so. Because it will make
> >> >> >> such process to be selected by current OOM's way of
> >> >> >> selecting.(consider OOM_FLAG_ORIGIN first before the adj)
> >> >> >
> >> >> > I just wouldn't really care unless there is an existing and reasonable
> >> >> > usecase for an application which updates the ring buffer size _and_ it
> >> >> > is OOM disabled at the same time.
> >> >> There is indeed such kind of test case on my android system, which is
> >> >> known as CTS and Monkey etc.
> >> >
> >> > Does the test simulate a real workload? I mean we have two things here
> >> >
> >> > oom disabled task and an updater of the ftrace ring buffer to a
> >> > potentially large size. The second can be completely isolated to a
> >> > different context, no? So why do they run in the single user process
> >> > context?
> >> ok. I think there are some misunderstandings here. Let me try to
> >> explain more by my poor English. There is just one thing here. The
> >> updater is originally a oom disabled task with adj=OOM_SCORE_ADJ_MIN.
> >> With Steven's patch, it will periodically become a oom killable task
> >> by calling set_current_oom_origin() for user process which is
> >> enlarging the ring buffer. What I am doing here is limit the user
> >> process to the ones that adj > -1000.
> >
> > I've understood that part. And I am arguing whether this is really such
> > an important case to play further tricks. Wouldn't it be much simpler to
> > put the updater out to a separate process? OOM disabled processes
> > shouldn't really do unexpectedly large allocations. Full stop. Otherwise
> > you risk a large system disruptions.
> > --
> It is a real problem(my android system just hung there while running
> the test case for the innocent key process killed by OOM), however,
> the problem is we can not define the userspace's behavior as you
> suggested. What Steven's patch doing here is to keep the system to be
> stable by having the updater to take the responsbility itself. My
> patch is to let the OOM disabled processes remain the unkillable
> status.
But you do realize that what you are proposing is by no means any safer,
don't you? The memory allocated for the ring buffer is _not_ accounted
to any process and as such it is not considered by the oom killer when
picking up an oom victim so you are quite likely to pick up an innocent
process to be killed. So basically you are risking an allocation runaway
completely hidden from the OOM killer. Now, the downside of the patch is
that the OOM_SCORE_ADJ_MIN task might get killed which is something that
shouldn't happen because it is a contract. I would call this an
unsolvable problem and a inherent broken design of the oom disabled
task. So far I haven't heard a single _argument_ why supporting such a
weird cornercase is desirable when your application can trivial do
fork(); set_oom_score_adj(); exec("echo $VAR > $RINGBUFFER_FILE")
--
Michal Hocko
SUSE Labs
On Tue, 10 Apr 2018 12:49:02 +0200
Michal Hocko <[email protected]> wrote:
> But you do realize that what you are proposing is by no means any safer,
> don't you? The memory allocated for the ring buffer is _not_ accounted
> to any process and as such it is not considered by the oom killer when
> picking up an oom victim so you are quite likely to pick up an innocent
> process to be killed. So basically you are risking an allocation runaway
> completely hidden from the OOM killer. Now, the downside of the patch is
> that the OOM_SCORE_ADJ_MIN task might get killed which is something that
> shouldn't happen because it is a contract. I would call this an
> unsolvable problem and a inherent broken design of the oom disabled
> task. So far I haven't heard a single _argument_ why supporting such a
> weird cornercase is desirable when your application can trivial do
>
> fork(); set_oom_score_adj(); exec("echo $VAR > $RINGBUFFER_FILE")
We could do this as a compromise:
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c9cb9767d49b..40c2e0a56c51 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1185,6 +1185,12 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
*/
mflags = GFP_KERNEL | __GFP_RETRY_MAYFAIL;
+ /* If we can't OOM this task, then only allocate without reclaim */
+ if (unlikely(current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)) {
+ mflags = GFP_KERNEL | __GFP_NORETRY;
+ user_thread = false; /* do not set oom_origin */
+ }
+
/*
* If a user thread allocates too much, and si_mem_available()
* reports there's enough memory, even though there is not.
This way, if one sets OOM_SCORE_ADJ_MIN, then we wont set oom_origin
for the task, but we also wont try hard to allocate memory if there is
nothing immediately available.
-- Steve
On Tue 10-04-18 08:23:16, Steven Rostedt wrote:
> On Tue, 10 Apr 2018 12:49:02 +0200
> Michal Hocko <[email protected]> wrote:
>
> > But you do realize that what you are proposing is by no means any safer,
> > don't you? The memory allocated for the ring buffer is _not_ accounted
> > to any process and as such it is not considered by the oom killer when
> > picking up an oom victim so you are quite likely to pick up an innocent
> > process to be killed. So basically you are risking an allocation runaway
> > completely hidden from the OOM killer. Now, the downside of the patch is
> > that the OOM_SCORE_ADJ_MIN task might get killed which is something that
> > shouldn't happen because it is a contract. I would call this an
> > unsolvable problem and a inherent broken design of the oom disabled
> > task. So far I haven't heard a single _argument_ why supporting such a
> > weird cornercase is desirable when your application can trivial do
> >
> > fork(); set_oom_score_adj(); exec("echo $VAR > $RINGBUFFER_FILE")
>
> We could do this as a compromise:
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index c9cb9767d49b..40c2e0a56c51 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1185,6 +1185,12 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
> */
> mflags = GFP_KERNEL | __GFP_RETRY_MAYFAIL;
>
> + /* If we can't OOM this task, then only allocate without reclaim */
> + if (unlikely(current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)) {
> + mflags = GFP_KERNEL | __GFP_NORETRY;
> + user_thread = false; /* do not set oom_origin */
> + }
> +
> /*
> * If a user thread allocates too much, and si_mem_available()
> * reports there's enough memory, even though there is not.
>
> This way, if one sets OOM_SCORE_ADJ_MIN, then we wont set oom_origin
> for the task, but we also wont try hard to allocate memory if there is
> nothing immediately available.
I would rather that the code outside of MM not touch implementation
details like OOM_SCORE_ADJ_MIN. It is really hard to get rid of abusers
whenever you try to change something in MM then. Especially when the
usecase is quite dubious.
--
Michal Hocko
SUSE Labs
On Tue, 10 Apr 2018 14:27:06 +0200
Michal Hocko <[email protected]> wrote:
> I would rather that the code outside of MM not touch implementation
> details like OOM_SCORE_ADJ_MIN. It is really hard to get rid of abusers
> whenever you try to change something in MM then. Especially when the
> usecase is quite dubious.
Fair enough. I was reluctant to use the OOM_SCORE_ADJ_MIN in this case.
Perhaps I can create an option that lets users decide how they want to
allocate.
For people like Joel, it will try hard (by default) and set oom_origin,
but the user could also set an option "no_mem_reclaim", where it will
not set oom_origin, but will also use NORETRY as well, where it wont
trigger OOM and will not be the designated victim of OOM. But it will
likely not allocate memory if the system is under heavy use. Then for
Zhaoyang's tests, all he has to do is to set that option (or clear it,
if the option is "mem_reclaim", which is what I'll probably call it).
-- Steve
On Tue, 10 Apr 2018 08:36:25 -0400
Steven Rostedt <[email protected]> wrote:
> On Tue, 10 Apr 2018 14:27:06 +0200
> Michal Hocko <[email protected]> wrote:
>
> > I would rather that the code outside of MM not touch implementation
> > details like OOM_SCORE_ADJ_MIN. It is really hard to get rid of abusers
> > whenever you try to change something in MM then. Especially when the
> > usecase is quite dubious.
>
> Fair enough. I was reluctant to use the OOM_SCORE_ADJ_MIN in this case.
>
> Perhaps I can create an option that lets users decide how they want to
> allocate.
>
> For people like Joel, it will try hard (by default) and set oom_origin,
> but the user could also set an option "no_mem_reclaim", where it will
> not set oom_origin, but will also use NORETRY as well, where it wont
> trigger OOM and will not be the designated victim of OOM. But it will
> likely not allocate memory if the system is under heavy use. Then for
> Zhaoyang's tests, all he has to do is to set that option (or clear it,
> if the option is "mem_reclaim", which is what I'll probably call it).
>
Something like this:
-- Steve
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index a0233edc0718..807e2bcb21b3 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -106,7 +106,8 @@ __poll_t ring_buffer_poll_wait(struct ring_buffer *buffer, int cpu,
void ring_buffer_free(struct ring_buffer *buffer);
-int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, int cpu);
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size,
+ int cpu, int rbflags);
void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
@@ -201,6 +202,7 @@ int ring_buffer_print_page_header(struct trace_seq *s);
enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
+ RB_FL_NO_RECLAIM = 1 << 1,
};
#ifdef CONFIG_RING_BUFFER
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c9cb9767d49b..b7fd541ca025 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1160,7 +1160,8 @@ static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
return 0;
}
-static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
+static int __rb_allocate_pages(long nr_pages, struct list_head *pages,
+ int cpu, int rbflags)
{
struct buffer_page *bpage, *tmp;
bool user_thread = current->mm != NULL;
@@ -1185,6 +1186,12 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
*/
mflags = GFP_KERNEL | __GFP_RETRY_MAYFAIL;
+ /* Do not try hard to get memory, and do not trigger OOM */
+ if (rbflags & RB_FL_NO_RECLAIM) {
+ mflags = GFP_KERNEL | __GFP_NORETRY;
+ user_thread = false;
+ }
+
/*
* If a user thread allocates too much, and si_mem_available()
* reports there's enough memory, even though there is not.
@@ -1232,13 +1239,13 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
}
static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
- unsigned long nr_pages)
+ unsigned long nr_pages, int rbflags)
{
LIST_HEAD(pages);
WARN_ON(!nr_pages);
- if (__rb_allocate_pages(nr_pages, &pages, cpu_buffer->cpu))
+ if (__rb_allocate_pages(nr_pages, &pages, cpu_buffer->cpu, rbflags))
return -ENOMEM;
/*
@@ -1297,7 +1304,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, long nr_pages, int cpu)
INIT_LIST_HEAD(&cpu_buffer->reader_page->list);
INIT_LIST_HEAD(&cpu_buffer->new_pages);
- ret = rb_allocate_pages(cpu_buffer, nr_pages);
+ ret = rb_allocate_pages(cpu_buffer, nr_pages, buffer->flags);
if (ret < 0)
goto fail_free_reader;
@@ -1685,7 +1692,7 @@ static void update_pages_handler(struct work_struct *work)
* Returns 0 on success and < 0 on failure.
*/
int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size,
- int cpu_id)
+ int cpu_id, int rbflags)
{
struct ring_buffer_per_cpu *cpu_buffer;
unsigned long nr_pages;
@@ -1739,7 +1746,7 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size,
*/
INIT_LIST_HEAD(&cpu_buffer->new_pages);
if (__rb_allocate_pages(cpu_buffer->nr_pages_to_update,
- &cpu_buffer->new_pages, cpu)) {
+ &cpu_buffer->new_pages, cpu, rbflags)) {
/* not enough memory for new pages */
err = -ENOMEM;
goto out_err;
@@ -1795,7 +1802,7 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size,
INIT_LIST_HEAD(&cpu_buffer->new_pages);
if (cpu_buffer->nr_pages_to_update > 0 &&
__rb_allocate_pages(cpu_buffer->nr_pages_to_update,
- &cpu_buffer->new_pages, cpu_id)) {
+ &cpu_buffer->new_pages, cpu_id, rbflags)) {
err = -ENOMEM;
goto out_err;
}
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e18e69183c9a..ff0164affe0a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -250,7 +250,8 @@ unsigned long long ns2usecs(u64 nsec)
TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK | \
TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO | \
TRACE_ITER_RECORD_CMD | TRACE_ITER_OVERWRITE | \
- TRACE_ITER_IRQ_INFO | TRACE_ITER_MARKERS)
+ TRACE_ITER_IRQ_INFO | TRACE_ITER_MARKERS | \
+ TRACE_ITER_MEM_RECLAIM)
/* trace_options that are only supported by global_trace */
#define TOP_LEVEL_TRACE_FLAGS (TRACE_ITER_PRINTK | \
@@ -974,7 +975,7 @@ static void free_snapshot(struct trace_array *tr)
* The max_tr ring buffer has some state (e.g. ring->clock) and
* we want preserve it.
*/
- ring_buffer_resize(tr->max_buffer.buffer, 1, RING_BUFFER_ALL_CPUS);
+ ring_buffer_resize(tr->max_buffer.buffer, 1, RING_BUFFER_ALL_CPUS, 0);
set_buffer_entries(&tr->max_buffer, 1);
tracing_reset_online_cpus(&tr->max_buffer);
tr->allocated_snapshot = false;
@@ -1494,7 +1495,9 @@ static int run_tracer_selftest(struct tracer *type)
/* If we expanded the buffers, make sure the max is expanded too */
if (ring_buffer_expanded)
ring_buffer_resize(tr->max_buffer.buffer, trace_buf_size,
- RING_BUFFER_ALL_CPUS);
+ RING_BUFFER_ALL_CPUS,
+ tr->trace_flags & TRACE_ITER_MEM_RECLAIM ?
+ 0 : RB_FL_NO_RECLAIM);
tr->allocated_snapshot = true;
}
#endif
@@ -1520,7 +1523,9 @@ static int run_tracer_selftest(struct tracer *type)
/* Shrink the max buffer again */
if (ring_buffer_expanded)
ring_buffer_resize(tr->max_buffer.buffer, 1,
- RING_BUFFER_ALL_CPUS);
+ RING_BUFFER_ALL_CPUS,
+ tr->trace_flags & TRACE_ITER_MEM_RECLAIM ?
+ 0 : RB_FL_NO_RECLAIM);
}
#endif
@@ -5168,7 +5173,7 @@ static int resize_buffer_duplicate_size(struct trace_buffer *trace_buf,
if (cpu_id == RING_BUFFER_ALL_CPUS) {
for_each_tracing_cpu(cpu) {
ret = ring_buffer_resize(trace_buf->buffer,
- per_cpu_ptr(size_buf->data, cpu)->entries, cpu);
+ per_cpu_ptr(size_buf->data, cpu)->entries, cpu, 0);
if (ret < 0)
break;
per_cpu_ptr(trace_buf->data, cpu)->entries =
@@ -5176,7 +5181,7 @@ static int resize_buffer_duplicate_size(struct trace_buffer *trace_buf,
}
} else {
ret = ring_buffer_resize(trace_buf->buffer,
- per_cpu_ptr(size_buf->data, cpu_id)->entries, cpu_id);
+ per_cpu_ptr(size_buf->data, cpu_id)->entries, cpu_id, 0);
if (ret == 0)
per_cpu_ptr(trace_buf->data, cpu_id)->entries =
per_cpu_ptr(size_buf->data, cpu_id)->entries;
@@ -5202,7 +5207,9 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
if (!tr->trace_buffer.buffer)
return 0;
- ret = ring_buffer_resize(tr->trace_buffer.buffer, size, cpu);
+ ret = ring_buffer_resize(tr->trace_buffer.buffer, size, cpu,
+ tr->trace_flags & TRACE_ITER_MEM_RECLAIM ?
+ 0 : RB_FL_NO_RECLAIM);
if (ret < 0)
return ret;
@@ -5211,7 +5218,9 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
!tr->current_trace->use_max_tr)
goto out;
- ret = ring_buffer_resize(tr->max_buffer.buffer, size, cpu);
+ ret = ring_buffer_resize(tr->max_buffer.buffer, size, cpu,
+ tr->trace_flags & TRACE_ITER_MEM_RECLAIM ?
+ 0 : RB_FL_NO_RECLAIM);
if (ret < 0) {
int r = resize_buffer_duplicate_size(&tr->trace_buffer,
&tr->trace_buffer, cpu);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 6fb46a06c9dc..2c901f52baf2 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1143,6 +1143,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
C(IRQ_INFO, "irq-info"), \
C(MARKERS, "markers"), \
C(EVENT_FORK, "event-fork"), \
+ C(MEM_RECLAIM, "mem-reclaim"), \
FUNCTION_FLAGS \
FGRAPH_FLAGS \
STACK_FLAGS \
On Tue, 10 Apr 2018 09:13:11 -0400
Steven Rostedt <[email protected]> wrote:
> Something like this:
Zhaoyang, would this work for you?
Then all you need to do is to:
echo 0 > /d/tracing/options/mem-reclaim
and then you have the old behavior before Joel's patch. It will use
NORETRY and also not set up the OOM to kill your task.
-- Steve
Hi Steve,
On Tue, Apr 10, 2018 at 6:13 AM, Steven Rostedt <[email protected]> wrote:
> On Tue, 10 Apr 2018 08:36:25 -0400
> Steven Rostedt <[email protected]> wrote:
>
>> On Tue, 10 Apr 2018 14:27:06 +0200
>> Michal Hocko <[email protected]> wrote:
>>
>> > I would rather that the code outside of MM not touch implementation
>> > details like OOM_SCORE_ADJ_MIN. It is really hard to get rid of abusers
>> > whenever you try to change something in MM then. Especially when the
>> > usecase is quite dubious.
>>
>> Fair enough. I was reluctant to use the OOM_SCORE_ADJ_MIN in this case.
>>
>> Perhaps I can create an option that lets users decide how they want to
>> allocate.
>>
>> For people like Joel, it will try hard (by default) and set oom_origin,
>> but the user could also set an option "no_mem_reclaim", where it will
>> not set oom_origin, but will also use NORETRY as well, where it wont
>> trigger OOM and will not be the designated victim of OOM. But it will
>> likely not allocate memory if the system is under heavy use. Then for
>> Zhaoyang's tests, all he has to do is to set that option (or clear it,
>> if the option is "mem_reclaim", which is what I'll probably call it).
>>
>
>
> Something like this:
>
> -- Steve
>
> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> index a0233edc0718..807e2bcb21b3 100644
> --- a/include/linux/ring_buffer.h
> +++ b/include/linux/ring_buffer.h
> @@ -106,7 +106,8 @@ __poll_t ring_buffer_poll_wait(struct ring_buffer *buffer, int cpu,
>
> void ring_buffer_free(struct ring_buffer *buffer);
>
> -int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, int cpu);
> +int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size,
> + int cpu, int rbflags);
>
> void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
>
> @@ -201,6 +202,7 @@ int ring_buffer_print_page_header(struct trace_seq *s);
>
> enum ring_buffer_flags {
> RB_FL_OVERWRITE = 1 << 0,
> + RB_FL_NO_RECLAIM = 1 << 1,
But the thing is, set_oom_origin doesn't seem to be doing the
desirable thing every time anyway as per my tests last week [1] and
the si_mem_available check alone seems to be working fine for me (and
also Zhaoyang as he mentioned).
Since the problem Zhaoyang is now referring to is caused because of
calling set_oom_origin in the first place, can we not just drop that
patch and avoid adding more complexity?
IMHO I feel like for things like RB memory allocation, we shouldn't
add a knob if we don't need to.
Also I think Zhaoyang is developing for Android too since he mentioned
he ran CTS tests so we both have the same "usecase" but he can feel
free to correct me if that's not the case ;)
thanks,
- Joel
[1] https://www.spinics.net/lists/linux-mm/msg149227.html
On Tue, 10 Apr 2018 09:45:54 -0700
Joel Fernandes <[email protected]> wrote:
> > diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> > index a0233edc0718..807e2bcb21b3 100644
> > --- a/include/linux/ring_buffer.h
> > +++ b/include/linux/ring_buffer.h
> > @@ -106,7 +106,8 @@ __poll_t ring_buffer_poll_wait(struct ring_buffer *buffer, int cpu,
> >
> > void ring_buffer_free(struct ring_buffer *buffer);
> >
> > -int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, int cpu);
> > +int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size,
> > + int cpu, int rbflags);
> >
> > void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
> >
> > @@ -201,6 +202,7 @@ int ring_buffer_print_page_header(struct trace_seq *s);
> >
> > enum ring_buffer_flags {
> > RB_FL_OVERWRITE = 1 << 0,
> > + RB_FL_NO_RECLAIM = 1 << 1,
>
> But the thing is, set_oom_origin doesn't seem to be doing the
> desirable thing every time anyway as per my tests last week [1] and
> the si_mem_available check alone seems to be working fine for me (and
> also Zhaoyang as he mentioned).
But did you try it with just plain GFP_KERNEL, and not RETRY_MAYFAIL.
My tests would always trigger the allocating task without the
RETRY_MAYFAIL, but with RETRY_MAYFAIL it would sometimes take out other
tasks.
>
> Since the problem Zhaoyang is now referring to is caused because of
> calling set_oom_origin in the first place, can we not just drop that
> patch and avoid adding more complexity?
Actually, I'm thinking of dropping the MAYFAIL part. It really should
be the one targeted if you are extending the ring buffer.
I could add two loops. One that does NORETRY without the oom origin,
and if it succeeds, its fine. But if it requires reclaim, it will then
set oom_origin and go harder (where it should be the one targeted).
But that may be pointless, because if NORETRY succeeds, there's not
really any likelihood of oom triggering in the first place.
>
> IMHO I feel like for things like RB memory allocation, we shouldn't
> add a knob if we don't need to.
It was just a suggestion.
>
> Also I think Zhaoyang is developing for Android too since he mentioned
> he ran CTS tests so we both have the same "usecase" but he can feel
> free to correct me if that's not the case ;)
I think if you are really worried with the task being killed by oom,
then I agree with Michal and just fork a process to do the allocation
for you.
-- Steve
Hi Steve,
On Tue, Apr 10, 2018 at 11:00 AM, Steven Rostedt <[email protected]> wrote:
> On Tue, 10 Apr 2018 09:45:54 -0700
> Joel Fernandes <[email protected]> wrote:
>
>> > diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
>> > index a0233edc0718..807e2bcb21b3 100644
>> > --- a/include/linux/ring_buffer.h
>> > +++ b/include/linux/ring_buffer.h
>> > @@ -106,7 +106,8 @@ __poll_t ring_buffer_poll_wait(struct ring_buffer *buffer, int cpu,
>> >
>> > void ring_buffer_free(struct ring_buffer *buffer);
>> >
>> > -int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, int cpu);
>> > +int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size,
>> > + int cpu, int rbflags);
>> >
>> > void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
>> >
>> > @@ -201,6 +202,7 @@ int ring_buffer_print_page_header(struct trace_seq *s);
>> >
>> > enum ring_buffer_flags {
>> > RB_FL_OVERWRITE = 1 << 0,
>> > + RB_FL_NO_RECLAIM = 1 << 1,
>>
>> But the thing is, set_oom_origin doesn't seem to be doing the
>> desirable thing every time anyway as per my tests last week [1] and
>> the si_mem_available check alone seems to be working fine for me (and
>> also Zhaoyang as he mentioned).
>
> But did you try it with just plain GFP_KERNEL, and not RETRY_MAYFAIL.
Yes I tried it with just GFP_KERNEL as well. What I did based on your
suggestion for testing the OOM hint is:
1. Comment the si_mem_available check
2. Do only GFP_KERNEL
The system gets destabilized with this combination even with the OOM
hint. These threads are here:
https://lkml.org/lkml/2018/4/5/720
> My tests would always trigger the allocating task without the
> RETRY_MAYFAIL, but with RETRY_MAYFAIL it would sometimes take out other
> tasks.
>
>>
>> Since the problem Zhaoyang is now referring to is caused because of
>> calling set_oom_origin in the first place, can we not just drop that
>> patch and avoid adding more complexity?
>
> Actually, I'm thinking of dropping the MAYFAIL part. It really should
> be the one targeted if you are extending the ring buffer.
This then sounds like it should be fixed in -mm code? If we're giving
the hint and its not getting killed there then that's an -mm issue.
> I could add two loops. One that does NORETRY without the oom origin,
> and if it succeeds, its fine. But if it requires reclaim, it will then
> set oom_origin and go harder (where it should be the one targeted).
>
> But that may be pointless, because if NORETRY succeeds, there's not
> really any likelihood of oom triggering in the first place.
Yes.
>
>>
>> IMHO I feel like for things like RB memory allocation, we shouldn't
>> add a knob if we don't need to.
>
> It was just a suggestion.
Cool, I understand.
>>
>> Also I think Zhaoyang is developing for Android too since he mentioned
>> he ran CTS tests so we both have the same "usecase" but he can feel
>> free to correct me if that's not the case ;)
>
> I think if you are really worried with the task being killed by oom,
> then I agree with Michal and just fork a process to do the allocation
> for you.
Yes I agree. So lets just do that and no other patches additional
patches are needed then. Let me know if there's anything else I
missed?
Also I got a bit confused, I reread all the threads. Zhaoyang's
current issue is that the OOM hint *IS* working which is what
triggered your patch to toggle the behavior through an option. Where
was in this message we are discussing that the OOM hint doesn't always
work which is not Zhaoyang's current issue. Let me know if I missed
something? Sorry if I did.
thanks,
- Joel
On Tue, 10 Apr 2018 11:39:24 -0700
Joel Fernandes <[email protected]> wrote:
> Yes I agree. So lets just do that and no other patches additional
> patches are needed then. Let me know if there's anything else I
> missed?
Yeah, I think there's really no other issue. I'm not going to apply
more patches.
>
> Also I got a bit confused, I reread all the threads. Zhaoyang's
> current issue is that the OOM hint *IS* working which is what
> triggered your patch to toggle the behavior through an option. Where
> was in this message we are discussing that the OOM hint doesn't always
> work which is not Zhaoyang's current issue. Let me know if I missed
> something? Sorry if I did.
Right, it's that if someone does extend the ring buffer page size while
the system is about to go into OOM, it may be triggered to get killed
even though it's not the culprit of the memory problems. But again,
this is really pushing what happens in the real world. That's the
problem with stress test cases. What happens when it triggers something
that will never actually happen in normal environments, and the fix
just complicates everything for really no benefit.
-- Steve
On Wed, Apr 11, 2018 at 2:39 AM, Joel Fernandes <[email protected]> wrote:
> Hi Steve,
>
> On Tue, Apr 10, 2018 at 11:00 AM, Steven Rostedt <[email protected]> wrote:
>> On Tue, 10 Apr 2018 09:45:54 -0700
>> Joel Fernandes <[email protected]> wrote:
>>
>>> > diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
>>> > index a0233edc0718..807e2bcb21b3 100644
>>> > --- a/include/linux/ring_buffer.h
>>> > +++ b/include/linux/ring_buffer.h
>>> > @@ -106,7 +106,8 @@ __poll_t ring_buffer_poll_wait(struct ring_buffer *buffer, int cpu,
>>> >
>>> > void ring_buffer_free(struct ring_buffer *buffer);
>>> >
>>> > -int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, int cpu);
>>> > +int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size,
>>> > + int cpu, int rbflags);
>>> >
>>> > void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
>>> >
>>> > @@ -201,6 +202,7 @@ int ring_buffer_print_page_header(struct trace_seq *s);
>>> >
>>> > enum ring_buffer_flags {
>>> > RB_FL_OVERWRITE = 1 << 0,
>>> > + RB_FL_NO_RECLAIM = 1 << 1,
>>>
>>> But the thing is, set_oom_origin doesn't seem to be doing the
>>> desirable thing every time anyway as per my tests last week [1] and
>>> the si_mem_available check alone seems to be working fine for me (and
>>> also Zhaoyang as he mentioned).
>>
>> But did you try it with just plain GFP_KERNEL, and not RETRY_MAYFAIL.
>
> Yes I tried it with just GFP_KERNEL as well. What I did based on your
> suggestion for testing the OOM hint is:
> 1. Comment the si_mem_available check
> 2. Do only GFP_KERNEL
>
> The system gets destabilized with this combination even with the OOM
> hint. These threads are here:
> https://lkml.org/lkml/2018/4/5/720
>
>> My tests would always trigger the allocating task without the
>> RETRY_MAYFAIL, but with RETRY_MAYFAIL it would sometimes take out other
>> tasks.
>>
>>>
>>> Since the problem Zhaoyang is now referring to is caused because of
>>> calling set_oom_origin in the first place, can we not just drop that
>>> patch and avoid adding more complexity?
>>
>> Actually, I'm thinking of dropping the MAYFAIL part. It really should
>> be the one targeted if you are extending the ring buffer.
>
> This then sounds like it should be fixed in -mm code? If we're giving
> the hint and its not getting killed there then that's an -mm issue.
>
>> I could add two loops. One that does NORETRY without the oom origin,
>> and if it succeeds, its fine. But if it requires reclaim, it will then
>> set oom_origin and go harder (where it should be the one targeted).
>>
>> But that may be pointless, because if NORETRY succeeds, there's not
>> really any likelihood of oom triggering in the first place.
>
> Yes.
>
>>
>>>
>>> IMHO I feel like for things like RB memory allocation, we shouldn't
>>> add a knob if we don't need to.
>>
>> It was just a suggestion.
>
> Cool, I understand.
>
>>>
>>> Also I think Zhaoyang is developing for Android too since he mentioned
>>> he ran CTS tests so we both have the same "usecase" but he can feel
>>> free to correct me if that's not the case ;)
>>
>> I think if you are really worried with the task being killed by oom,
>> then I agree with Michal and just fork a process to do the allocation
>> for you.
>
> Yes I agree. So lets just do that and no other patches additional
> patches are needed then. Let me know if there's anything else I
> missed?
>
> Also I got a bit confused, I reread all the threads. Zhaoyang's
> current issue is that the OOM hint *IS* working which is what
> triggered your patch to toggle the behavior through an option. Where
> was in this message we are discussing that the OOM hint doesn't always
> work which is not Zhaoyang's current issue. Let me know if I missed
> something? Sorry if I did.
>
> thanks,
>
> - Joel
Hi Joel, you are right. My issue is to make Steven's patch safer by
keeping -1000 process out of OOM. I think it is ok either we just have
si_mem_available or apply set/clear_current_oom_origin with absolving
-1000 process. The CTS case failed because the system_server was
killed as the innocent. If Steven think it is rared corner case, I am
ok with that.