2012-11-21 19:12:51

by azurIt

[permalink] [raw]
Subject: memory-cgroup bug

Hi,

i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures:
- no new processes can be started for this cgroup
- current processes are freezed and taking 100% of CPU
- when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c)
- problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed

I also garbbed the content of /proc/<pid>/stack of freezed process:
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0
[<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50
[<ffffffff810ec54e>] do_wp_page+0x14e/0x800
[<ffffffff810eda34>] handle_pte_fault+0x264/0x940
[<ffffffff810ee248>] handle_mm_fault+0x138/0x260
[<ffffffff810270ed>] do_page_fault+0x13d/0x460
[<ffffffff815b53ff>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32.

Any ideas? Thnx.

azurIt


2012-11-22 18:27:53

by azurIt

[permalink] [raw]
Subject: Re: memory-cgroup bug

>> i'm using memory cgroup for limiting our users and having a really
>> strange problem when a cgroup gets out of its memory limit. It's very
>> strange because it happens only sometimes (about once per week on
>> random user), out of memory is usually handled ok.
>
>What is your memcg configuration? Do you use deeper hierarchies, is
>use_hierarchy enabled? Is the memcg oom (aka memory.oom_control)
>enabled? Do you use soft limit for those groups? Is memcg swap
>accounting enabled and memsw limits in place?
>Is the machine under global memory pressure as well?
>Could you post sysrq+t or sysrq+w?


My cgroups hierarchy:
/cgroups/<user_id>/uid/

where '<user_id>' is system user id and 'uid' is just word 'uid'.

Memory limits are set in /cgroups/<user_id>/ and hierarchy is enabled. Processes are inside /cgroups/<user_id>/uid/ . I'm using hard limits for memory and swap BUT system has no swap at all (it has 'only' 16 GB of real RAM). memory.oom_control is set to 'oom_kill_disable 0'. Server has enough of free memory when problem occurs.




>> This happens when problem occures:
>> - no new processes can be started for this cgroup
>> - current processes are freezed and taking 100% of CPU
>> - when i try to 'strace' any of current processes, the whole strace
>> freezes until process is killed (strace cannot be terminated by
>> CTRL-c)
>> - problem can be resolved by raising memory limit for cgroup or
>> killing of few processes inside cgroup so some memory is freed
>>
>> I also garbbed the content of /proc/<pid>/stack of freezed process:
>> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
>
>Hmm what is this?


Really doesn't know, i will get stack of all freezed processes next time so we can compare it.



>> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0
>> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50
>> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800
>> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940
>> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260
>> [<ffffffff810270ed>] do_page_fault+0x13d/0x460
>> [<ffffffff815b53ff>] page_fault+0x1f/0x30
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>
>How many tasks are hung in mem_cgroup_handle_oom? If there were many
>of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg:
>make oom_lock 0 and 1 based rather than counter) and its follow up fix
>23751be00940 (memcg: fix hierarchical oom locking) but you are saying
>that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would
>make more sense.


Usually maximum of several 10s of processes but i will check it next time. I was having much worse problems in 2.6.32 - when freezing happens, the whole server was affected (i wasn't able to do anything and needs to wait until my scripts takes case of it and killed apache, so i don't have any detailed info). In 3.2 only target cgroup is affected.




>> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32.
>
>I guess this is a clean vanilla (stable) kernel, right? Are you able to
>reproduce with the latest Linus tree?


Well, no. I'm using, for example, newest stable grsecurity patch. I'm also using few of Andrea Righi's cgroup subsystems but i don't believe these are doing problems:
- cgroup-uid which is moving processes into cgroups based on UID
- cgroup-task which can limit number of tasks in cgroup (i already tried to disable this one, it didn't help)
http://www.develer.com/~arighi/linux/patches/

Unfortunately i cannot just install new and untested kernel version cos i'm not able to reproduce this problem anytime (it's happening randomly in production environment).

Could it be that OOM cannot start and kill processes because there's no free memory in cgroup?



Thank you!

azur

2012-11-22 19:29:03

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: memory-cgroup bug

(2012/11/22 4:02), azurIt wrote:
> Hi,
>
> i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures:
> - no new processes can be started for this cgroup
> - current processes are freezed and taking 100% of CPU
> - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c)
> - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed
>
> I also garbbed the content of /proc/<pid>/stack of freezed process:
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0
> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50
> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800
> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940
> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260
> [<ffffffff810270ed>] do_page_fault+0x13d/0x460
> [<ffffffff815b53ff>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32.
>
> Any ideas? Thnx.
>

Under OOM in memcg, only one process is allowed to work. Because processes tends to use up
CPU at memory shortage. other processes are freezed.


Then, the problem here is the one process which uses CPU. IIUC, 'freezed' threads are
in sleep and never use CPU. It's expected oom-killer or memory-reclaim can solve the probelm.

What is your memcg's

memory.oom_control

value ?

and process's oom_adj values ? (/proc/<pid>/oom_adj, /proc/<pid>/oom_score_adj)

Thanks,
-Kame






> azurIt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-11-22 19:35:03

by Michal Hocko

[permalink] [raw]
Subject: Re: memory-cgroup bug

On Wed 21-11-12 20:02:07, azurIt wrote:
> Hi,
>
> i'm using memory cgroup for limiting our users and having a really
> strange problem when a cgroup gets out of its memory limit. It's very
> strange because it happens only sometimes (about once per week on
> random user), out of memory is usually handled ok.

What is your memcg configuration? Do you use deeper hierarchies, is
use_hierarchy enabled? Is the memcg oom (aka memory.oom_control)
enabled? Do you use soft limit for those groups? Is memcg swap
accounting enabled and memsw limits in place?
Is the machine under global memory pressure as well?
Could you post sysrq+t or sysrq+w?

> This happens when problem occures:
> - no new processes can be started for this cgroup
> - current processes are freezed and taking 100% of CPU
> - when i try to 'strace' any of current processes, the whole strace
> freezes until process is killed (strace cannot be terminated by
> CTRL-c)
> - problem can be resolved by raising memory limit for cgroup or
> killing of few processes inside cgroup so some memory is freed
>
> I also garbbed the content of /proc/<pid>/stack of freezed process:
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0

Hmm what is this?

> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0
> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50
> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800
> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940
> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260
> [<ffffffff810270ed>] do_page_fault+0x13d/0x460
> [<ffffffff815b53ff>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
>

How many tasks are hung in mem_cgroup_handle_oom? If there were many
of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg:
make oom_lock 0 and 1 based rather than counter) and its follow up fix
23751be00940 (memcg: fix hierarchical oom locking) but you are saying
that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would
make more sense.

> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32.

I guess this is a clean vanilla (stable) kernel, right? Are you able to
reproduce with the latest Linus tree?

--
Michal Hocko
SUSE Labs


Attachments:
(No filename) (2.45 kB)

2012-11-22 21:42:59

by Michal Hocko

[permalink] [raw]
Subject: Re: memory-cgroup bug

On Thu 22-11-12 19:05:26, azurIt wrote:
[...]
> My cgroups hierarchy:
> /cgroups/<user_id>/uid/
>
> where '<user_id>' is system user id and 'uid' is just word 'uid'.
>
> Memory limits are set in /cgroups/<user_id>/ and hierarchy is
> enabled. Processes are inside /cgroups/<user_id>/uid/ . I'm using
> hard limits for memory and swap BUT system has no swap at all
> (it has 'only' 16 GB of real RAM). memory.oom_control is set to
> 'oom_kill_disable 0'. Server has enough of free memory when problem
> occurs.

OK, so so the global reclaim shouldn't be active. This is definitely
good to know.

> >> This happens when problem occures:
> >> - no new processes can be started for this cgroup
> >> - current processes are freezed and taking 100% of CPU
> >> - when i try to 'strace' any of current processes, the whole strace
> >> freezes until process is killed (strace cannot be terminated by
> >> CTRL-c)
> >> - problem can be resolved by raising memory limit for cgroup or
> >> killing of few processes inside cgroup so some memory is freed
> >>
> >> I also garbbed the content of /proc/<pid>/stack of freezed process:
> >> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> >> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> >
> >Hmm what is this?
>
> Really doesn't know, i will get stack of all freezed processes next
> time so we can compare it.
>
> >> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0
> >> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50
> >> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800
> >> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940
> >> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260
> >> [<ffffffff810270ed>] do_page_fault+0x13d/0x460
> >> [<ffffffff815b53ff>] page_fault+0x1f/0x30
> >> [<ffffffffffffffff>] 0xffffffffffffffff

Btw. is this stack stable or is the task bouncing in some loop?
And finally could you post the disassembly of your version of
mem_cgroup_handle_oom, please?

> >How many tasks are hung in mem_cgroup_handle_oom? If there were many
> >of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg:
> >make oom_lock 0 and 1 based rather than counter) and its follow up fix
> >23751be00940 (memcg: fix hierarchical oom locking) but you are saying
> >that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would
> >make more sense.
>
>
> Usually maximum of several 10s of processes but i will check it next
> time. I was having much worse problems in 2.6.32 - when freezing
> happens, the whole server was affected (i wasn't able to do anything
> and needs to wait until my scripts takes case of it and killed apache,
> so i don't have any detailed info).

Hmm, maybe the issue fixed by 1d65f86d (mm: preallocate page before
lock_page() at filemap COW) which was merged in 3.1.

> In 3.2 only target cgroup is affected.
>
> >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32.
> >
> >I guess this is a clean vanilla (stable) kernel, right? Are you able to
> >reproduce with the latest Linus tree?
>
>
> Well, no. I'm using, for example, newest stable grsecurity patch.

That shouldn't be related

> I'm also using few of Andrea Righi's cgroup subsystems but i don't
> believe
> these are doing problems:
> - cgroup-uid which is moving processes into cgroups based on UID
> - cgroup-task which can limit number of tasks in cgroup (i already
> tried to disable this one, it didn't help)
> http://www.develer.com/~arighi/linux/patches/

I am not familiar with those pathces but I will double check.

> Unfortunately i cannot just install new and untested kernel version
> cos i'm not able to reproduce this problem anytime (it's happening
> randomly in production environment).

This will make it a bit harder to debug but let's see maybe the new
traces would help...

> Could it be that OOM cannot start and kill processes because there's
> no free memory in cgroup?

That shouldn't happen.

--
Michal Hocko
SUSE Labs

2012-11-22 21:45:33

by Michal Hocko

[permalink] [raw]
Subject: Re: memory-cgroup bug

On Thu 22-11-12 10:36:18, azurIt wrote:
[...]
> I can look also to the data of 'freezed' proces if you need it but i
> will have to wait until problem occurs again.
>
> The main problem is that when this problem happens, it's NOT resolved
> automatically by kernel/OOM and user of cgroup, where it happend, has
> non-working services until i kill his processes by hand. I'm sure
> that all 'freezed' processes are taking very much CPU because also
> server load goes really high - next time i will make a screenshot of
> htop. I really wonder why OOM is __sometimes__ not resolving this
> (it's usually is, only sometimes not).

What does your kernel log says while this is happening. Are there any
memcg OOM messages showing up?
--
Michal Hocko
SUSE Labs

2012-11-22 22:34:40

by azurIt

[permalink] [raw]
Subject: Re: memory-cgroup bug

>Btw. is this stack stable or is the task bouncing in some loop?


Not sure, will check it next time.



>And finally could you post the disassembly of your version of
>mem_cgroup_handle_oom, please?


How can i do this?



>What does your kernel log says while this is happening. Are there any
>memcg OOM messages showing up?


I will get the logs next time.


Thank you!

azur

2012-11-23 06:13:39

by azurIt

[permalink] [raw]
Subject: Re: memory-cgroup bug

______________________________________________________________
> Od: "Kamezawa Hiroyuki" <[email protected]>
> Komu: azurIt <[email protected]>
> Dátum: 22.11.2012 01:27
> Predmet: Re: memory-cgroup bug
>
> CC: [email protected], "linux-mm" <[email protected]>
>(2012/11/22 4:02), azurIt wrote:
>> Hi,
>>
>> i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures:
>> - no new processes can be started for this cgroup
>> - current processes are freezed and taking 100% of CPU
>> - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c)
>> - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed
>>
>> I also garbbed the content of /proc/<pid>/stack of freezed process:
>> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
>> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0
>> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50
>> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800
>> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940
>> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260
>> [<ffffffff810270ed>] do_page_fault+0x13d/0x460
>> [<ffffffff815b53ff>] page_fault+0x1f/0x30
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32.
>>
>> Any ideas? Thnx.
>>
>
>Under OOM in memcg, only one process is allowed to work. Because processes tends to use up
>CPU at memory shortage. other processes are freezed.
>
>
>Then, the problem here is the one process which uses CPU. IIUC, 'freezed' threads are
>in sleep and never use CPU. It's expected oom-killer or memory-reclaim can solve the probelm.
>
>What is your memcg's memory.oom_control value ?



oom_kill_disable 0



>and process's oom_adj values ? (/proc/<pid>/oom_adj, /proc/<pid>/oom_score_adj)


when i look to a random user PID (Apache web server):
oom_adj = 0
oom_score_adj = 0

I can look also to the data of 'freezed' proces if you need it but i will have to wait until problem occurs again.

The main problem is that when this problem happens, it's NOT resolved automatically by kernel/OOM and user of cgroup, where it happend, has non-working services until i kill his processes by hand. I'm sure that all 'freezed' processes are taking very much CPU because also server load goes really high - next time i will make a screenshot of htop. I really wonder why OOM is __sometimes__ not resolving this (it's usually is, only sometimes not).


Thank you!

azur

2012-11-23 07:40:30

by Michal Hocko

[permalink] [raw]
Subject: Re: memory-cgroup bug

On Thu 22-11-12 23:34:34, azurIt wrote:
[...]
> >And finally could you post the disassembly of your version of
> >mem_cgroup_handle_oom, please?
>
> How can i do this?

Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or
use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom
function.
--
Michal Hocko
SUSE Labs

2012-11-23 09:21:44

by azurIt

[permalink] [raw]
Subject: Re: memory-cgroup bug

>Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or
>use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom
>function.
If 'YOUR_VMLINUX' is supposed to be my kernel image:

# gdb vmlinuz-3.2.34-grsec-1
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
"/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized


# objdump -d vmlinuz-3.2.34-grsec-1
objdump: vmlinuz-3.2.34-grsec-1: File format not recognized


# file vmlinuz-3.2.34-grsec-1
vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA

I'm probably doing something wrong :)



It, luckily, happend again so i have more info.

- there wasn't any logs in kernel from OOM for that cgroup
- there were 16 processes in cgroup
- processes in cgroup were taking togather 100% of CPU (it was allowed to use only one core, so 100% of that core)
- memory.failcnt was groving fast
- oom_control:
oom_kill_disable 0
under_oom 0 (this was looping from 0 to 1)
- limit_in_bytes was set to 157286400
- content of stat (as you can see, the whole memory limit was used):
cache 0
rss 0
mapped_file 0
pgpgin 0
pgpgout 0
swap 0
pgfault 0
pgmajfault 0
inactive_anon 0
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 157286400
hierarchical_memsw_limit 157286400
total_cache 0
total_rss 157286400
total_mapped_file 0
total_pgpgin 10326454
total_pgpgout 10288054
total_swap 0
total_pgfault 12939677
total_pgmajfault 4283
total_inactive_anon 0
total_active_anon 157286400
total_inactive_file 0
total_active_file 0
total_unevictable 0


i also grabber oom_adj, oom_score_adj and stack of all processes, here it is:
http://www.watchdog.sk/lkml/memcg-bug.tar

Notice that stack is different for few processes. Stack for all processes were NOT chaging and was still the same.

Btw, don't know if it matters but i was several cgroup subsystems mounted and i'm also using them (i was not activating freezer in this case, don't know if it can be active automatically by kernel or what, didn't checked if cgroup was freezed but i suppose it wasn't):
none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0

Thank you.

azur

2012-11-23 09:28:36

by Michal Hocko

[permalink] [raw]
Subject: Re: memory-cgroup bug

On Fri 23-11-12 10:21:37, azurIt wrote:
> >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or
> >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom
> >function.
> If 'YOUR_VMLINUX' is supposed to be my kernel image:
>
> # gdb vmlinuz-3.2.34-grsec-1
> GNU gdb (GDB) 7.0.1-debian
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized
>
>
> # objdump -d vmlinuz-3.2.34-grsec-1

You need vmlinux not vmlinuz...
--
Michal Hocko
SUSE Labs

2012-11-23 09:35:18

by Glauber Costa

[permalink] [raw]
Subject: Re: memory-cgroup bug

On 11/23/2012 01:21 PM, azurIt wrote:
>> Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or
>> use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom
>> function.
> If 'YOUR_VMLINUX' is supposed to be my kernel image:
>
> # gdb vmlinuz-3.2.34-grsec-1

this is vmlinuz, not vmlinux. This is the compressed image.

>
> # file vmlinuz-3.2.34-grsec-1
> vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA
>
> I'm probably doing something wrong :)

You need this:

[glauber@straightjacket linux-glommer]$ file vmlinux
vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically
linked, BuildID[sha1]=0xba936ee6b6096f9bc4c663f2a2ee0c2d2481c408, not
stripped

instead of bzImage.

2012-11-23 09:44:28

by azurIt

[permalink] [raw]
Subject: Re: memory-cgroup bug

> CC: [email protected], [email protected], "cgroups mailinglist" <[email protected]>
>On Fri 23-11-12 10:21:37, azurIt wrote:
>> >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or
>> >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom
>> >function.
>> If 'YOUR_VMLINUX' is supposed to be my kernel image:
>>
>> # gdb vmlinuz-3.2.34-grsec-1
>> GNU gdb (GDB) 7.0.1-debian
>> Copyright (C) 2009 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-linux-gnu".
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>...
>> "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized
>>
>>
>> # objdump -d vmlinuz-3.2.34-grsec-1
>
>You need vmlinux not vmlinuz...




ok, got it but still no luck:

# gdb vmlinux
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done.
(gdb) disassemble mem_cgroup_handle_oom
No symbol table is loaded. Use the "file" command.



# objdump -d vmlinux | grep mem_cgroup_handle_oom
<no output>


i can recompile the kernel if anything needs to be added into it.


azur

2012-11-23 10:04:47

by Michal Hocko

[permalink] [raw]
Subject: Re: memory-cgroup bug

On Fri 23-11-12 10:21:37, azurIt wrote:
[...]
> It, luckily, happend again so i have more info.
>
> - there wasn't any logs in kernel from OOM for that cgroup
> - there were 16 processes in cgroup
> - processes in cgroup were taking togather 100% of CPU (it
> was allowed to use only one core, so 100% of that core)
> - memory.failcnt was groving fast
> - oom_control:
> oom_kill_disable 0
> under_oom 0 (this was looping from 0 to 1)

So there was an OOM going on but no messages in the log? Really strange.
Kame already asked about oom_score_adj of the processes in the group but
it didn't look like all the processes would have oom disabled, right?

> - limit_in_bytes was set to 157286400
> - content of stat (as you can see, the whole memory limit was used):
> cache 0
> rss 0

This looks like a top-level group for your user.

> mapped_file 0
> pgpgin 0
> pgpgout 0
> swap 0
> pgfault 0
> pgmajfault 0
> inactive_anon 0
> active_anon 0
> inactive_file 0
> active_file 0
> unevictable 0
> hierarchical_memory_limit 157286400
> hierarchical_memsw_limit 157286400
> total_cache 0
> total_rss 157286400

OK, so all the memory is anonymous and you have no swap so the oom is
the only thing to do.

> total_mapped_file 0
> total_pgpgin 10326454
> total_pgpgout 10288054
> total_swap 0
> total_pgfault 12939677
> total_pgmajfault 4283
> total_inactive_anon 0
> total_active_anon 157286400
> total_inactive_file 0
> total_active_file 0
> total_unevictable 0
>
>
> i also grabber oom_adj, oom_score_adj and stack of all processes, here
> it is:
> http://www.watchdog.sk/lkml/memcg-bug.tar

Hmm, all processes waiting for oom are stuck at the very same place:
$ grep mem_cgroup_handle_oom -r [0-9]*
30858/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
30859/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
30860/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
30892/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
30898/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
31588/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
32044/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
32358/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
6031/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
6534/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
7020/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0

We are taking memcg_oom_lock spinlock twice in that function + we can
schedule. As none of the tasks is scheduled this would suggest that you
are blocked at the first lock. But who got the lock then?
This is really strange.
Btw. is sysrq+t resp. sysrq+w showing the same traces as
/proc/<pid>/stat?

> Notice that stack is different for few processes.

Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous
but it grabs the page before it really starts a transaction.

> Stack for all processes were NOT chaging and was still the same.

Could you take few snapshots over time?

> Btw, don't know if it matters but i was several cgroup subsystems
> mounted and i'm also using them (i was not activating freezer in this
> case, don't know if it can be active automatically by kernel or what,

No

> didn't checked if cgroup was freezed but i suppose it wasn't):
> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0

Do you see the same issue if only memory controller was mounted (resp.
cpuset which you seem to use as well from your description).

I know you said booting into a vanilla kernel would be problematic but
could you at least rule out te cgroup patches that you have mentioned?
If you need to move a task to a group based by an uid you can use
cgrules daemon (libcgroup1 package) for that as well.
--
Michal Hocko
SUSE Labs

2012-11-23 10:10:41

by Michal Hocko

[permalink] [raw]
Subject: Re: memory-cgroup bug

On Fri 23-11-12 10:44:23, azurIt wrote:
[...]
> # gdb vmlinux
> GNU gdb (GDB) 7.0.1-debian
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done.
> (gdb) disassemble mem_cgroup_handle_oom
> No symbol table is loaded. Use the "file" command.
>
>
>
> # objdump -d vmlinux | grep mem_cgroup_handle_oom
> <no output>

Hmm, strange so the function is on the stack but it has been inlined?
Doesn't make much sense to me.

> i can recompile the kernel if anything needs to be added into it.

If you could instrument mem_cgroup_handle_oom with some printks (before
we take the memcg_oom_lock, before we schedule and into
mem_cgroup_out_of_memory)
--
Michal Hocko
SUSE Labs

2012-11-23 14:59:10

by azurIt

[permalink] [raw]
Subject: Re: memory-cgroup bug

>If you could instrument mem_cgroup_handle_oom with some printks (before
>we take the memcg_oom_lock, before we schedule and into
>mem_cgroup_out_of_memory)


If you send me patch i can do it. I'm, unfortunately, not able to code it.



>> It, luckily, happend again so i have more info.
>>
>> - there wasn't any logs in kernel from OOM for that cgroup
>> - there were 16 processes in cgroup
>> - processes in cgroup were taking togather 100% of CPU (it
>> was allowed to use only one core, so 100% of that core)
>> - memory.failcnt was groving fast
>> - oom_control:
>> oom_kill_disable 0
>> under_oom 0 (this was looping from 0 to 1)
>
>So there was an OOM going on but no messages in the log? Really strange.
>Kame already asked about oom_score_adj of the processes in the group but
>it didn't look like all the processes would have oom disabled, right?


There were no messages telling that some processes were killed because of OOM.


>> - limit_in_bytes was set to 157286400
>> - content of stat (as you can see, the whole memory limit was used):
>> cache 0
>> rss 0
>
>This looks like a top-level group for your user.


Yes, it was from /cgroup/<user-id>/


>> mapped_file 0
>> pgpgin 0
>> pgpgout 0
>> swap 0
>> pgfault 0
>> pgmajfault 0
>> inactive_anon 0
>> active_anon 0
>> inactive_file 0
>> active_file 0
>> unevictable 0
>> hierarchical_memory_limit 157286400
>> hierarchical_memsw_limit 157286400
>> total_cache 0
>> total_rss 157286400
>
>OK, so all the memory is anonymous and you have no swap so the oom is
>the only thing to do.


What will happen if the same situation occurs globally? No swap, every bit of memory used. Will kernel be able to start OOM killer? Maybe the same thing is happening in cgroup - there's simply no space to run OOM killer. And maybe this is why it's happening rarely - usually there are still at least few KBs of memory left to start OOM killer.


>Hmm, all processes waiting for oom are stuck at the very same place:
>$ grep mem_cgroup_handle_oom -r [0-9]*
>30858/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>30859/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>30860/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>30892/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>30898/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>31588/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>32044/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>32358/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>6031/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>6534/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>7020/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>
>We are taking memcg_oom_lock spinlock twice in that function + we can
>schedule. As none of the tasks is scheduled this would suggest that you
>are blocked at the first lock. But who got the lock then?
>This is really strange.
>Btw. is sysrq+t resp. sysrq+w showing the same traces as
>/proc/<pid>/stat?


Unfortunately i'm connecting remotely to the servers (SSH).


>> Notice that stack is different for few processes.
>
>Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous
>but it grabs the page before it really starts a transaction.


Maybe these processes were throttled by cgroup-blkio at the same time and are still keeping the lock? So the problem occurs when there are low on memory and cgroup is doing IO out of it's limits. Only guessing and telling my thoughts.


>> Stack for all processes were NOT chaging and was still the same.
>
>Could you take few snapshots over time?


Will do next time but i can't keep services freezed for a long time or customers will be angry.


>> didn't checked if cgroup was freezed but i suppose it wasn't):
>> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0
>
>Do you see the same issue if only memory controller was mounted (resp.
>cpuset which you seem to use as well from your description).


Uh, we are using all mounted subsystems :( I will be able to umount only freezer and maybe blkio for some time. Will it help?


>I know you said booting into a vanilla kernel would be problematic but
>could you at least rule out te cgroup patches that you have mentioned?
>If you need to move a task to a group based by an uid you can use
>cgrules daemon (libcgroup1 package) for that as well.


We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and better. For example, i don't believe that cgroup-task will work with that daemon. What will happen if cgrules won't be able to add process into cgroup because of task limit? Process will probably continue and will run outside of any cgroup which is wrong. With cgroup-task + cgroup-uid, such processes cannot be even started (and this is what we need).

2012-11-25 00:10:51

by azurIt

[permalink] [raw]
Subject: Re: memory-cgroup bug

>Could you take few snapshots over time?


Here it is, now from different server, snapshot was taken every second for 10 minutes (hope it's enough):
http://www.watchdog.sk/lkml/memcg-bug-2.tar.gz

2012-11-25 10:17:14

by Michal Hocko

[permalink] [raw]
Subject: Re: memory-cgroup bug

On Fri 23-11-12 15:59:04, azurIt wrote:
> >If you could instrument mem_cgroup_handle_oom with some printks (before
> >we take the memcg_oom_lock, before we schedule and into
> >mem_cgroup_out_of_memory)
>
>
> If you send me patch i can do it. I'm, unfortunately, not able to code it.

Inlined at the end of the email. Please note I have compile tested
it. It might produce a lot of output.

> >> It, luckily, happend again so i have more info.
> >>
> >> - there wasn't any logs in kernel from OOM for that cgroup
> >> - there were 16 processes in cgroup
> >> - processes in cgroup were taking togather 100% of CPU (it
> >> was allowed to use only one core, so 100% of that core)
> >> - memory.failcnt was groving fast
> >> - oom_control:
> >> oom_kill_disable 0
> >> under_oom 0 (this was looping from 0 to 1)
> >
> >So there was an OOM going on but no messages in the log? Really strange.
> >Kame already asked about oom_score_adj of the processes in the group but
> >it didn't look like all the processes would have oom disabled, right?
>
>
> There were no messages telling that some processes were killed because of OOM.

dmesg | grep "Out of memory"
doesn't tell anything, right?

> >> - limit_in_bytes was set to 157286400
> >> - content of stat (as you can see, the whole memory limit was used):
> >> cache 0
> >> rss 0
> >
> >This looks like a top-level group for your user.
>
>
> Yes, it was from /cgroup/<user-id>/
>
>
> >> mapped_file 0
> >> pgpgin 0
> >> pgpgout 0
> >> swap 0
> >> pgfault 0
> >> pgmajfault 0
> >> inactive_anon 0
> >> active_anon 0
> >> inactive_file 0
> >> active_file 0
> >> unevictable 0
> >> hierarchical_memory_limit 157286400
> >> hierarchical_memsw_limit 157286400
> >> total_cache 0
> >> total_rss 157286400
> >
> >OK, so all the memory is anonymous and you have no swap so the oom is
> >the only thing to do.
>
>
> What will happen if the same situation occurs globally? No swap, every
> bit of memory used. Will kernel be able to start OOM killer?

OOM killer is not a task. It doesn't allocate any memory. It just walks
the process list and picks up a task with the highest score. If the
global oom is not able to find any such a task (e.g. because all of them
have oom disabled) the the system panics.

> Maybe the same thing is happening in cgroup

cgroup oom differs only in that aspect that the system doesn't panic if
there is no suitable task to kill.

[...]
> >> Notice that stack is different for few processes.
> >
> >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous
> >but it grabs the page before it really starts a transaction.
>
>
> Maybe these processes were throttled by cgroup-blkio at the same time
> and are still keeping the lock?

If you are thinking about memcg_oom_lock then this is not possible
because the lock is held only for short times. There is no other lock
that memcg oom holds.

> So the problem occurs when there are low on memory and cgroup is doing
> IO out of it's limits. Only guessing and telling my thoughts.

The lockup (if this is what happens) still might be related to the IO
controller if the killed task cannot finish due to pending IO, though.

[...]
> >> didn't checked if cgroup was freezed but i suppose it wasn't):
> >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0
> >
> >Do you see the same issue if only memory controller was mounted (resp.
> >cpuset which you seem to use as well from your description).
>
>
> Uh, we are using all mounted subsystems :( I will be able to umount
> only freezer and maybe blkio for some time. Will it help?

Not sure about that without further data.

> >I know you said booting into a vanilla kernel would be problematic but
> >could you at least rule out te cgroup patches that you have mentioned?
> >If you need to move a task to a group based by an uid you can use
> >cgrules daemon (libcgroup1 package) for that as well.
>
>
> We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and
> better. For example, i don't believe that cgroup-task will work with
> that daemon. What will happen if cgrules won't be able to add process
> into cgroup because of task limit? Process will probably continue and
> will run outside of any cgroup which is wrong. With cgroup-task +
> cgroup-uid, such processes cannot be even started (and this is what we
> need).

I am not familiar with cgroup-task controller so I cannot comment on
that.

---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c8425b1..7f26ec8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1863,6 +1863,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
{
struct oom_wait_info owait;
bool locked, need_to_kill;
+ int ret = false;

owait.mem = memcg;
owait.wait.flags = 0;
@@ -1873,6 +1874,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
mem_cgroup_mark_under_oom(memcg);

/* At first, try to OOM lock hierarchy under memcg.*/
+ printk("XXX: %d waiting for memcg_oom_lock\n", current->pid);
spin_lock(&memcg_oom_lock);
locked = mem_cgroup_oom_lock(memcg);
/*
@@ -1887,12 +1889,14 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
mem_cgroup_oom_notify(memcg);
spin_unlock(&memcg_oom_lock);

+ printk("XXX: %d need_to_kill:%d locked:%d\n", current->pid, need_to_kill, locked);
if (need_to_kill) {
finish_wait(&memcg_oom_waitq, &owait.wait);
mem_cgroup_out_of_memory(memcg, mask);
} else {
schedule();
finish_wait(&memcg_oom_waitq, &owait.wait);
+ printk("XXX: %d woken up\n", current->pid);
}
spin_lock(&memcg_oom_lock);
if (locked)
@@ -1903,10 +1907,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
mem_cgroup_unmark_under_oom(memcg);

if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
- return false;
+ goto out;
/* Give chance to dying process */
schedule_timeout_uninterruptible(1);
- return true;
+ ret = true;
+out:
+ printk("XXX: %d done with %d\n", current->pid, ret);
+ return ret;
}

/*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..a7db813 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -568,6 +568,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
*/
if (fatal_signal_pending(current)) {
set_thread_flag(TIF_MEMDIE);
+ printk("XXX: %d skipping task with fatal signal pending\n", current->pid);
return;
}

@@ -576,8 +577,10 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
read_lock(&tasklist_lock);
retry:
p = select_bad_process(&points, limit, mem, NULL);
- if (!p || PTR_ERR(p) == -1UL)
+ if (!p || PTR_ERR(p) == -1UL) {
+ printk("XXX: %d nothing to kill\n", current->pid);
goto out;
+ }

if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL,
"Memory cgroup out of memory"))

--
Michal Hocko
SUSE Labs

2012-11-25 12:05:30

by Michal Hocko

[permalink] [raw]
Subject: Re: memory-cgroup bug

[Adding Kamezawa into CC]

On Sun 25-11-12 01:10:47, azurIt wrote:
> >Could you take few snapshots over time?
>
>
> Here it is, now from different server, snapshot was taken every second
> for 10 minutes (hope it's enough):
> http://www.watchdog.sk/lkml/memcg-bug-2.tar.gz

Hmm, interesting:
$ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff<min) min=diff; sum+=diff; n++} prev=$1}END{printf "min:%d max:%d avg:%f\n", min, max, sum/n}'
min:16281 max:224048 avg:18818.943119

So there is a lot of attempts to allocate which fail, every second!
Will get to that later.

The number of tasks in the group is stable (20):
$ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c
546 20

And no task has been killed or spawned:
$ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq
24495
24762
24774
24796
24798
24805
24813
24827
24831
24841
24842
24863
24892
24924
24931
25130
25131
25192
25193
25243

$ for stack in [0-9]*/[0-9]*
do
head -n1 $stack/stack
done | sort | uniq -c
9841 [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
546 [<ffffffff811109b8>] do_truncate+0x58/0xa0
533 [<ffffffffffffffff>] 0xffffffffffffffff

Tells us that the stacks are pretty much stable.
$ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c
546 24495

So 24495 is stuck in do_truncate
[<ffffffff811109b8>] do_truncate+0x58/0xa0
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

I suspect it is waiting for i_mutex. Who is holding that lock?
Other tasks are blocked on the mem_cgroup_handle_oom either coming from
the page fault path so i_mutex can be exluded or vfs_write (24796) and
that one is interesting:
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This smells like a deadlock. But kind strange one. The rapidly
increasing failcnt suggests that somebody still tries to allocate but
who when all of them hung in the mem_cgroup_handle_oom. This can be
explained though.
Memcg OOM killer let's only one process (which is able to lock the
hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill
a process, while others are waiting on the wait queue. Once the killer
is done it calls memcg_wakeup_oom which wakes up other tasks waiting on
the queue. Those retry the charge, in a hope there is some memory freed
in the meantime which hasn't happened so they get into OOM again (and
again and again).
This all usually works out except in this particular case I would bet
my hat that the OOM selected task is pid 24495 which is blocked on the
mutex which is held by one of the oom killer task so it cannot finish -
thus free a memory.

It seems that the current Linus' tree is affected as well.

I will have to think about a solution but it sounds really tricky. It is
not just ext3 that is affected.

I guess we need to tell mem_cgroup_cache_charge that it should never
reach OOM from add_to_page_cache_locked. This sounds quite intrusive to
me. On the other hand it is really weird that an excessive writer might
trigger a memcg OOM killer.
--
Michal Hocko
SUSE Labs

2012-11-25 12:36:07

by azurIt

[permalink] [raw]
Subject: Re: memory-cgroup bug

>So there is a lot of attempts to allocate which fail, every second!


Yes, as i said, the cgroup was taking 100% of (allocated) CPU core(s). Not sure if all processes were using CPU but _few_ of them (not only one) for sure.

2012-11-25 12:39:57

by azurIt

[permalink] [raw]
Subject: Re: memory-cgroup bug

>Inlined at the end of the email. Please note I have compile tested
>it. It might produce a lot of output.


Thank you very much, i will install it ASAP (probably this night).


>dmesg | grep "Out of memory"
>doesn't tell anything, right?


Only messages for other cgroups but not for the freezed one (before nor after the freeze).


azur

2012-11-25 13:02:13

by Michal Hocko

[permalink] [raw]
Subject: Re: memory-cgroup bug

On Sun 25-11-12 13:39:53, azurIt wrote:
> >Inlined at the end of the email. Please note I have compile tested
> >it. It might produce a lot of output.
>
>
> Thank you very much, i will install it ASAP (probably this night).

Please don't. If my analysis is correct which I am almost 100% sure it
is then it would cause excessive logging. I am sorry I cannot come up
with something else in the mean time.
--
Michal Hocko
SUSE Labs

2012-11-25 13:27:13

by azurIt

[permalink] [raw]
Subject: Re: memory-cgroup bug

>> Thank you very much, i will install it ASAP (probably this night).
>
>Please don't. If my analysis is correct which I am almost 100% sure it
>is then it would cause excessive logging. I am sorry I cannot come up
>with something else in the mean time.


Ok then. I will, meanwhile, try to contact Andrea Righi (author of cgroup-task etc.) and ask him to send here his opinion about relation between freezes and his patches. Maybe it's some kind of a bug in memcg which don't appear in current vanilla code and is triggered by conditions created by, for example, cgroup-task. I noticed that there is always the exact number of freezed processes as the limit set for number of tasks by cgroup-task (i already tried to raise this limit AFTER the cgroup was freezed, didn't change anything). I'm sure it's not the problem with cgroup-task alone, it's 100% related also to memcg (but maybe there must be the combination of both of them).

Thank you so far for your time!

azur

2012-11-25 13:44:44

by Michal Hocko

[permalink] [raw]
Subject: Re: memory-cgroup bug

On Sun 25-11-12 14:27:09, azurIt wrote:
> >> Thank you very much, i will install it ASAP (probably this night).
> >
> >Please don't. If my analysis is correct which I am almost 100% sure it
> >is then it would cause excessive logging. I am sorry I cannot come up
> >with something else in the mean time.
>
>
> Ok then. I will, meanwhile, try to contact Andrea Righi (author of
> cgroup-task etc.) and ask him to send here his opinion about relation
> between freezes and his patches.

As I described in other email. This seems to be a deadlock in memcg oom
so I do not think that other patches influence this.

--
Michal Hocko
SUSE Labs

2012-11-25 13:55:46

by Michal Hocko

[permalink] [raw]
Subject: Re: memory-cgroup bug

On Sun 25-11-12 13:05:24, Michal Hocko wrote:
> [Adding Kamezawa into CC]
>
> On Sun 25-11-12 01:10:47, azurIt wrote:
> > >Could you take few snapshots over time?
> >
> >
> > Here it is, now from different server, snapshot was taken every second
> > for 10 minutes (hope it's enough):
> > http://www.watchdog.sk/lkml/memcg-bug-2.tar.gz
>
> Hmm, interesting:
> $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff<min) min=diff; sum+=diff; n++} prev=$1}END{printf "min:%d max:%d avg:%f\n", min, max, sum/n}'
> min:16281 max:224048 avg:18818.943119
>
> So there is a lot of attempts to allocate which fail, every second!
> Will get to that later.
>
> The number of tasks in the group is stable (20):
> $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c
> 546 20
>
> And no task has been killed or spawned:
> $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq
> 24495
> 24762
> 24774
> 24796
> 24798
> 24805
> 24813
> 24827
> 24831
> 24841
> 24842
> 24863
> 24892
> 24924
> 24931
> 25130
> 25131
> 25192
> 25193
> 25243
>
> $ for stack in [0-9]*/[0-9]*
> do
> head -n1 $stack/stack
> done | sort | uniq -c
> 9841 [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> 546 [<ffffffff811109b8>] do_truncate+0x58/0xa0
> 533 [<ffffffffffffffff>] 0xffffffffffffffff
>
> Tells us that the stacks are pretty much stable.
> $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c
> 546 24495
>
> So 24495 is stuck in do_truncate
> [<ffffffff811109b8>] do_truncate+0x58/0xa0
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> I suspect it is waiting for i_mutex. Who is holding that lock?
> Other tasks are blocked on the mem_cgroup_handle_oom either coming from
> the page fault path so i_mutex can be exluded or vfs_write (24796) and
> that one is interesting:
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> This smells like a deadlock. But kind strange one. The rapidly
> increasing failcnt suggests that somebody still tries to allocate but
> who when all of them hung in the mem_cgroup_handle_oom. This can be
> explained though.
> Memcg OOM killer let's only one process (which is able to lock the
> hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill
> a process, while others are waiting on the wait queue. Once the killer
> is done it calls memcg_wakeup_oom which wakes up other tasks waiting on
> the queue. Those retry the charge, in a hope there is some memory freed
> in the meantime which hasn't happened so they get into OOM again (and
> again and again).
> This all usually works out except in this particular case I would bet
> my hat that the OOM selected task is pid 24495 which is blocked on the
> mutex which is held by one of the oom killer task so it cannot finish -
> thus free a memory.
>
> It seems that the current Linus' tree is affected as well.
>
> I will have to think about a solution but it sounds really tricky. It is
> not just ext3 that is affected.
>
> I guess we need to tell mem_cgroup_cache_charge that it should never
> reach OOM from add_to_page_cache_locked. This sounds quite intrusive to
> me. On the other hand it is really weird that an excessive writer might
> trigger a memcg OOM killer.

This is hackish but it should help you in this case. Kamezawa, what do
you think about that? Should we generalize this and prepare something
like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
automatically and use the function whenever we are in a locked context?
To be honest I do not like this very much but nothing more sensible
(without touching non-memcg paths) comes to my mind.
---
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..da50c83 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -448,7 +448,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
VM_BUG_ON(PageSwapBacked(page));

error = mem_cgroup_cache_charge(page, current->mm,
- gfp_mask & GFP_RECLAIM_MASK);
+ (gfp_mask | __GFP_NORETRY) & GFP_RECLAIM_MASK);
if (error)
goto out;

--
Michal Hocko
SUSE Labs

2012-11-26 00:39:00

by azurIt

[permalink] [raw]
Subject: Re: memory-cgroup bug

>This is hackish but it should help you in this case. Kamezawa, what do
>you think about that? Should we generalize this and prepare something
>like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
>automatically and use the function whenever we are in a locked context?
>To be honest I do not like this very much but nothing more sensible
>(without touching non-memcg paths) comes to my mind.


I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you!

Btw, will this patch be backported to 3.2?

azur

2012-11-26 07:57:11

by Michal Hocko

[permalink] [raw]
Subject: Re: memory-cgroup bug

On Mon 26-11-12 01:38:55, azurIt wrote:
> >This is hackish but it should help you in this case. Kamezawa, what do
> >you think about that? Should we generalize this and prepare something
> >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> >automatically and use the function whenever we are in a locked context?
> >To be honest I do not like this very much but nothing more sensible
> >(without touching non-memcg paths) comes to my mind.
>
>
> I installed kernel with this patch, will report back if problem occurs
> again OR in few weeks if everything will be ok. Thank you!

Thanks!

> Btw, will this patch be backported to 3.2?

Once we agree on a proper solution it will be backported to the stable
trees.

--
Michal Hocko
SUSE Labs

2012-11-26 13:18:41

by Michal Hocko

[permalink] [raw]
Subject: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

[CCing also Johannes - the thread started here:
https://lkml.org/lkml/2012/11/21/497]

On Mon 26-11-12 01:38:55, azurIt wrote:
> >This is hackish but it should help you in this case. Kamezawa, what do
> >you think about that? Should we generalize this and prepare something
> >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> >automatically and use the function whenever we are in a locked context?
> >To be honest I do not like this very much but nothing more sensible
> >(without touching non-memcg paths) comes to my mind.
>
>
> I installed kernel with this patch, will report back if problem occurs
> again OR in few weeks if everything will be ok. Thank you!

Now that I am looking at the patch closer it will not work because it
depends on other patch which is not merged yet and even that one would
help on its own because __GFP_NORETRY doesn't break the charge loop.
Sorry I have missed that...

The patch bellow should help though. (it is based on top of the current
-mm tree but I will send a backport to 3.2 in the reply as well)
---
>From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 26 Nov 2012 11:47:57 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper
function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which
then tells mem_cgroup_charge_common that OOM is not allowed for the
charge. No OOM from this path, except for fixing the bug, also make some
sense as we really do not want to cause an OOM because of a page cache
usage.
As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

__GFP_NORETRY is abused for this memcg specific flag because it has been
used to prevent from OOM already (since not-merged-yet "memcg: reclaim
when more than one page needed"). The only difference is that the flag
doesn't prevent from reclaim anymore which kind of makes sense because
the global memory allocator triggers reclaim as well. The retry without
any reclaim on __GFP_NORETRY doesn't make much sense anyway because this
is effectively a busy loop with allowed OOM in this path.

Reported-by: azurIt <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/gfp.h | 3 +++
include/linux/memcontrol.h | 12 ++++++++++++
mm/filemap.c | 8 +++++++-
mm/memcontrol.c | 5 +----
4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 10e667f..aac9b21 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -152,6 +152,9 @@ struct vm_area_struct;
/* 4GB DMA on some platforms */
#define GFP_DMA32 __GFP_DMA32

+/* memcg oom killer is not allowed */
+#define GFP_MEMCG_NO_OOM __GFP_NORETRY
+
/* Convert GFP flags to their corresponding migrate type */
static inline int allocflags_to_migratetype(gfp_t gfp_flags)
{
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..1ad4bc6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);

+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+ struct mm_struct *mm, gfp_t gfp_mask)
+{
+ return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM);
+}
+
struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);

@@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page,
return 0;
}

+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+ struct mm_struct *mm, gfp_t gfp_mask)
+{
+ return 0;
+}
+
static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
{
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ef14351 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));

- error = mem_cgroup_cache_charge(page, current->mm,
+ /*
+ * Cannot trigger OOM even if gfp_mask would allow that normally
+ * because we might be called from a locked context and that
+ * could lead to deadlocks if the killed process is waiting for
+ * the same lock.
+ */
+ error = mem_cgroup_cache_charge_no_oom(page, current->mm,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
goto out;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ee2f7..b4754ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (!(gfp_mask & __GFP_WAIT))
return CHARGE_WOULDBLOCK;

- if (gfp_mask & __GFP_NORETRY)
- return CHARGE_NOMEM;
-
ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
return CHARGE_RETRY;
@@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
{
struct mem_cgroup *memcg = NULL;
unsigned int nr_pages = 1;
- bool oom = true;
+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);
int ret;

if (PageTransHuge(page)) {
--
1.7.10.4

--
Michal Hocko
SUSE Labs

2012-11-26 13:21:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

Here we go with the patch for 3.2.34. Could you test with this one,
please?
---
>From 0d2d915c16f93918051b7ab8039d30b5a922049c Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 26 Nov 2012 11:47:57 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper
function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which
then tells mem_cgroup_charge_common that OOM is not allowed for the
charge. No OOM from this path, except for fixing the bug, also make some
sense as we really do not want to cause an OOM because of a page cache
usage.
As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

__GFP_NORETRY is abused for this memcg specific flag because no user
accounted allocation use this flag except for THP which have memcg oom
disabled already.

Reported-by: azurIt <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/gfp.h | 3 +++
include/linux/memcontrol.h | 13 +++++++++++++
mm/filemap.c | 8 +++++++-
mm/memcontrol.c | 2 +-
4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..806fb54 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -146,6 +146,9 @@ struct vm_area_struct;
/* 4GB DMA on some platforms */
#define GFP_DMA32 __GFP_DMA32

+/* memcg oom killer is not allowed */
+#define GFP_MEMCG_NO_OOM __GFP_NORETRY
+
/* Convert GFP flags to their corresponding migrate type */
static inline int allocflags_to_migratetype(gfp_t gfp_flags)
{
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 81572af..bf0e575 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);

extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
+
+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+ struct mm_struct *mm, gfp_t gfp_mask)
+{
+ return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM);
+}
+
extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
@@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page,
return 0;
}

+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+ struct mm_struct *mm, gfp_t gfp_mask)
+{
+ return 0;
+}
+
static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
{
diff --git a/mm/filemap.c b/mm/filemap.c
index 556858c..ef182a9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));

- error = mem_cgroup_cache_charge(page, current->mm,
+ /*
+ * Cannot trigger OOM even if gfp_mask would allow that normally
+ * because we might be called from a locked context and that
+ * could lead to deadlocks if the killed process is waiting for
+ * the same lock.
+ */
+ error = mem_cgroup_cache_charge_no_oom(page, current->mm,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
goto out;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c8425b1..1dbbe7f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2703,7 +2703,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
struct mem_cgroup *memcg = NULL;
unsigned int nr_pages = 1;
struct page_cgroup *pc;
- bool oom = true;
+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);
int ret;

if (PageTransHuge(page)) {
--
1.7.10.4

--
Michal Hocko
SUSE Labs

2012-11-26 17:46:37

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote:
> [CCing also Johannes - the thread started here:
> https://lkml.org/lkml/2012/11/21/497]
>
> On Mon 26-11-12 01:38:55, azurIt wrote:
> > >This is hackish but it should help you in this case. Kamezawa, what do
> > >you think about that? Should we generalize this and prepare something
> > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> > >automatically and use the function whenever we are in a locked context?
> > >To be honest I do not like this very much but nothing more sensible
> > >(without touching non-memcg paths) comes to my mind.
> >
> >
> > I installed kernel with this patch, will report back if problem occurs
> > again OR in few weeks if everything will be ok. Thank you!
>
> Now that I am looking at the patch closer it will not work because it
> depends on other patch which is not merged yet and even that one would
> help on its own because __GFP_NORETRY doesn't break the charge loop.
> Sorry I have missed that...
>
> The patch bellow should help though. (it is based on top of the current
> -mm tree but I will send a backport to 3.2 in the reply as well)
> ---
> >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Mon, 26 Nov 2012 11:47:57 +0100
> Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
>
> memcg oom killer might deadlock if the process which falls down to
> mem_cgroup_handle_oom holds a lock which prevents other task to
> terminate because it is blocked on the very same lock.
> This can happen when a write system call needs to allocate a page but
> the allocation hits the memcg hard limit and there is nothing to reclaim
> (e.g. there is no swap or swap limit is hit as well and all cache pages
> have been reclaimed already) and the process selected by memcg OOM
> killer is blocked on i_mutex on the same inode (e.g. truncate it).
>
> Process A
> [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Process B
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff

So process B manages to lock the hierarchy, calls
mem_cgroup_out_of_memory() and retries the charge infinitely, waiting
for task A to die. All while it holds the i_mutex, preventing task A
from dying, right?

I think global oom already handles this in a much better way: invoke
the OOM killer, sleep for a second, then return to userspace to
relinquish all kernel resources and locks. The only reason why we
can't simply change from an endless retry loop is because we don't
want to return VM_FAULT_OOM and invoke the global OOM killer. But
maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and just
restart the pagefault. Return -ENOMEM to the buffered IO syscall
respectively. This way, the memcg OOM killer is invoked as it should
but nobody gets stuck anywhere livelocking with the exiting task.

2012-11-26 18:04:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
> On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote:
> > [CCing also Johannes - the thread started here:
> > https://lkml.org/lkml/2012/11/21/497]
> >
> > On Mon 26-11-12 01:38:55, azurIt wrote:
> > > >This is hackish but it should help you in this case. Kamezawa, what do
> > > >you think about that? Should we generalize this and prepare something
> > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> > > >automatically and use the function whenever we are in a locked context?
> > > >To be honest I do not like this very much but nothing more sensible
> > > >(without touching non-memcg paths) comes to my mind.
> > >
> > >
> > > I installed kernel with this patch, will report back if problem occurs
> > > again OR in few weeks if everything will be ok. Thank you!
> >
> > Now that I am looking at the patch closer it will not work because it
> > depends on other patch which is not merged yet and even that one would
> > help on its own because __GFP_NORETRY doesn't break the charge loop.
> > Sorry I have missed that...
> >
> > The patch bellow should help though. (it is based on top of the current
> > -mm tree but I will send a backport to 3.2 in the reply as well)
> > ---
> > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Mon, 26 Nov 2012 11:47:57 +0100
> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> >
> > memcg oom killer might deadlock if the process which falls down to
> > mem_cgroup_handle_oom holds a lock which prevents other task to
> > terminate because it is blocked on the very same lock.
> > This can happen when a write system call needs to allocate a page but
> > the allocation hits the memcg hard limit and there is nothing to reclaim
> > (e.g. there is no swap or swap limit is hit as well and all cache pages
> > have been reclaimed already) and the process selected by memcg OOM
> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> >
> > Process A
> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
> > [<ffffffff81121c90>] do_last+0x250/0xa30
> > [<ffffffff81122547>] path_openat+0xd7/0x440
> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> > [<ffffffff8110f950>] sys_open+0x20/0x30
> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > Process B
> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
> > [<ffffffff8111156a>] do_sync_write+0xea/0x130
> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> > [<ffffffff81112381>] sys_write+0x51/0x90
> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > [<ffffffffffffffff>] 0xffffffffffffffff
>
> So process B manages to lock the hierarchy, calls
> mem_cgroup_out_of_memory() and retries the charge infinitely, waiting
> for task A to die. All while it holds the i_mutex, preventing task A
> from dying, right?

Right.

> I think global oom already handles this in a much better way: invoke
> the OOM killer, sleep for a second, then return to userspace to
> relinquish all kernel resources and locks. The only reason why we
> can't simply change from an endless retry loop is because we don't
> want to return VM_FAULT_OOM and invoke the global OOM killer.

Exactly.

> But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> just restart the pagefault. Return -ENOMEM to the buffered IO syscall
> respectively. This way, the memcg OOM killer is invoked as it should
> but nobody gets stuck anywhere livelocking with the exiting task.

Hmm, we would still have a problem with oom disabled (aka user space OOM
killer), right? All processes but those in mem_cgroup_handle_oom are
risky to be killed.
Other POV might be, why we should trigger an OOM killer from those paths
in the first place. Write or read (or even readahead) are all calls that
should rather fail than cause an OOM killer in my opinion.
--
Michal Hocko
SUSE Labs

2012-11-26 18:24:33

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote:
> On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
> > On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote:
> > > [CCing also Johannes - the thread started here:
> > > https://lkml.org/lkml/2012/11/21/497]
> > >
> > > On Mon 26-11-12 01:38:55, azurIt wrote:
> > > > >This is hackish but it should help you in this case. Kamezawa, what do
> > > > >you think about that? Should we generalize this and prepare something
> > > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> > > > >automatically and use the function whenever we are in a locked context?
> > > > >To be honest I do not like this very much but nothing more sensible
> > > > >(without touching non-memcg paths) comes to my mind.
> > > >
> > > >
> > > > I installed kernel with this patch, will report back if problem occurs
> > > > again OR in few weeks if everything will be ok. Thank you!
> > >
> > > Now that I am looking at the patch closer it will not work because it
> > > depends on other patch which is not merged yet and even that one would
> > > help on its own because __GFP_NORETRY doesn't break the charge loop.
> > > Sorry I have missed that...
> > >
> > > The patch bellow should help though. (it is based on top of the current
> > > -mm tree but I will send a backport to 3.2 in the reply as well)
> > > ---
> > > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <[email protected]>
> > > Date: Mon, 26 Nov 2012 11:47:57 +0100
> > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> > >
> > > memcg oom killer might deadlock if the process which falls down to
> > > mem_cgroup_handle_oom holds a lock which prevents other task to
> > > terminate because it is blocked on the very same lock.
> > > This can happen when a write system call needs to allocate a page but
> > > the allocation hits the memcg hard limit and there is nothing to reclaim
> > > (e.g. there is no swap or swap limit is hit as well and all cache pages
> > > have been reclaimed already) and the process selected by memcg OOM
> > > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> > >
> > > Process A
> > > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
> > > [<ffffffff81121c90>] do_last+0x250/0xa30
> > > [<ffffffff81122547>] path_openat+0xd7/0x440
> > > [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> > > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> > > [<ffffffff8110f950>] sys_open+0x20/0x30
> > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> > >
> > > Process B
> > > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> > > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> > > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> > > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> > > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> > > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> > > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> > > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> > > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> > > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
> > > [<ffffffff8111156a>] do_sync_write+0xea/0x130
> > > [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> > > [<ffffffff81112381>] sys_write+0x51/0x90
> > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > So process B manages to lock the hierarchy, calls
> > mem_cgroup_out_of_memory() and retries the charge infinitely, waiting
> > for task A to die. All while it holds the i_mutex, preventing task A
> > from dying, right?
>
> Right.
>
> > I think global oom already handles this in a much better way: invoke
> > the OOM killer, sleep for a second, then return to userspace to
> > relinquish all kernel resources and locks. The only reason why we
> > can't simply change from an endless retry loop is because we don't
> > want to return VM_FAULT_OOM and invoke the global OOM killer.
>
> Exactly.
>
> > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> > just restart the pagefault. Return -ENOMEM to the buffered IO syscall
> > respectively. This way, the memcg OOM killer is invoked as it should
> > but nobody gets stuck anywhere livelocking with the exiting task.
>
> Hmm, we would still have a problem with oom disabled (aka user space OOM
> killer), right? All processes but those in mem_cgroup_handle_oom are
> risky to be killed.

Could we still let everybody get stuck in there when the OOM killer is
disabled and let userspace take care of it?

> Other POV might be, why we should trigger an OOM killer from those paths
> in the first place. Write or read (or even readahead) are all calls that
> should rather fail than cause an OOM killer in my opinion.

Readahead is arguable, but we kill globally for read() and write() and
I think we should do the same for memcg.

The OOM killer is there to resolve a problem that comes from
overcommitting the machine but the overuse does not have to be from
the application that pushes the machine over the edge, that's why we
don't just kill the allocating task but actually go look for the best
candidate. If you have one memory hog that overuses the resources,
attempted memory consumption in a different program should invoke the
OOM killer. It does not matter if this is a page fault (would still
happen with your patch) or a bufferd read/write (would no longer
happen).

2012-11-26 19:03:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Mon 26-11-12 13:24:21, Johannes Weiner wrote:
> On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote:
> > On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
[...]
> > > I think global oom already handles this in a much better way: invoke
> > > the OOM killer, sleep for a second, then return to userspace to
> > > relinquish all kernel resources and locks. The only reason why we
> > > can't simply change from an endless retry loop is because we don't
> > > want to return VM_FAULT_OOM and invoke the global OOM killer.
> >
> > Exactly.
> >
> > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall
> > > respectively. This way, the memcg OOM killer is invoked as it should
> > > but nobody gets stuck anywhere livelocking with the exiting task.
> >
> > Hmm, we would still have a problem with oom disabled (aka user space OOM
> > killer), right? All processes but those in mem_cgroup_handle_oom are
> > risky to be killed.
>
> Could we still let everybody get stuck in there when the OOM killer is
> disabled and let userspace take care of it?

I am not sure what exactly you mean by "userspace take care of it" but
if those processes are stuck and holding the lock then it is usually
hard to find that out. Well if somebody is familiar with internal then
it is doable but this makes the interface really unusable for regular
usage.

> > Other POV might be, why we should trigger an OOM killer from those paths
> > in the first place. Write or read (or even readahead) are all calls that
> > should rather fail than cause an OOM killer in my opinion.
>
> Readahead is arguable, but we kill globally for read() and write() and
> I think we should do the same for memcg.

Fair point but the global case is little bit easier than memcg in this
case because nobody can hook on OOM killer and provide a userspace
implementation for it which is one of the cooler feature of memcg...
I am all open to any suggestions but we should somehow fix this (and
backport it to stable trees as this is there for quite some time. The
current report shows that the problem is not that hard to trigger).

> The OOM killer is there to resolve a problem that comes from
> overcommitting the machine but the overuse does not have to be from
> the application that pushes the machine over the edge, that's why we
> don't just kill the allocating task but actually go look for the best
> candidate. If you have one memory hog that overuses the resources,
> attempted memory consumption in a different program should invoke the
> OOM killer.

> It does not matter if this is a page fault (would still happen with
> your patch) or a bufferd read/write (would no longer happen).

true and it is sad that mmap then behaves slightly different than
read/write which should I've mentioned in the changelog. As I said I am
open to other suggestions.

Thanks
--
Michal Hocko
SUSE Labs

2012-11-26 19:29:53

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote:
> On Mon 26-11-12 13:24:21, Johannes Weiner wrote:
> > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote:
> > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
> [...]
> > > > I think global oom already handles this in a much better way: invoke
> > > > the OOM killer, sleep for a second, then return to userspace to
> > > > relinquish all kernel resources and locks. The only reason why we
> > > > can't simply change from an endless retry loop is because we don't
> > > > want to return VM_FAULT_OOM and invoke the global OOM killer.
> > >
> > > Exactly.
> > >
> > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall
> > > > respectively. This way, the memcg OOM killer is invoked as it should
> > > > but nobody gets stuck anywhere livelocking with the exiting task.
> > >
> > > Hmm, we would still have a problem with oom disabled (aka user space OOM
> > > killer), right? All processes but those in mem_cgroup_handle_oom are
> > > risky to be killed.
> >
> > Could we still let everybody get stuck in there when the OOM killer is
> > disabled and let userspace take care of it?
>
> I am not sure what exactly you mean by "userspace take care of it" but
> if those processes are stuck and holding the lock then it is usually
> hard to find that out. Well if somebody is familiar with internal then
> it is doable but this makes the interface really unusable for regular
> usage.

If oom_kill_disable is set, then all processes get stuck all the way
down in the charge stack. Whatever resource they pin, you may
deadlock on if you try to touch it while handling the problem from
userspace. I don't see how this is a new problem...? Or do you mean
something else?

> > > Other POV might be, why we should trigger an OOM killer from those paths
> > > in the first place. Write or read (or even readahead) are all calls that
> > > should rather fail than cause an OOM killer in my opinion.
> >
> > Readahead is arguable, but we kill globally for read() and write() and
> > I think we should do the same for memcg.
>
> Fair point but the global case is little bit easier than memcg in this
> case because nobody can hook on OOM killer and provide a userspace
> implementation for it which is one of the cooler feature of memcg...
> I am all open to any suggestions but we should somehow fix this (and
> backport it to stable trees as this is there for quite some time. The
> current report shows that the problem is not that hard to trigger).

As per above, the userspace OOM handling is risky as hell anyway.
What happens when an anonymous fault waits in memcg userspace OOM
while holding the mmap_sem, and a writer lines up behind it? Your
userspace OOM handler had better not look at any of the /proc files of
the stuck task that require the mmap_sem.

At the same token, it probably shouldn't touch the same files a memcg
task is stuck trying to read/write.

2012-11-26 20:08:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Mon 26-11-12 14:29:41, Johannes Weiner wrote:
> On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote:
> > On Mon 26-11-12 13:24:21, Johannes Weiner wrote:
> > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote:
> > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
> > [...]
> > > > > I think global oom already handles this in a much better way: invoke
> > > > > the OOM killer, sleep for a second, then return to userspace to
> > > > > relinquish all kernel resources and locks. The only reason why we
> > > > > can't simply change from an endless retry loop is because we don't
> > > > > want to return VM_FAULT_OOM and invoke the global OOM killer.
> > > >
> > > > Exactly.
> > > >
> > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall
> > > > > respectively. This way, the memcg OOM killer is invoked as it should
> > > > > but nobody gets stuck anywhere livelocking with the exiting task.
> > > >
> > > > Hmm, we would still have a problem with oom disabled (aka user space OOM
> > > > killer), right? All processes but those in mem_cgroup_handle_oom are
> > > > risky to be killed.
> > >
> > > Could we still let everybody get stuck in there when the OOM killer is
> > > disabled and let userspace take care of it?
> >
> > I am not sure what exactly you mean by "userspace take care of it" but
> > if those processes are stuck and holding the lock then it is usually
> > hard to find that out. Well if somebody is familiar with internal then
> > it is doable but this makes the interface really unusable for regular
> > usage.
>
> If oom_kill_disable is set, then all processes get stuck all the way
> down in the charge stack. Whatever resource they pin, you may
> deadlock on if you try to touch it while handling the problem from
> userspace.

OK, I guess I am getting what you are trying to say. So what you are
suggesting is to just let mem_cgroup_out_of_memory send the signal and
move on without retry (or with few charge retries without further OOM
killing) and fail the charge with your new FAULT_OOM_HANDLED (resp.
something like FAULT_RETRY) error code resp. ENOMEM depending on the
caller. OOM disabled case would be "you are on your own" because this
has been dangerous anyway. Correct?
I do agree that the current endless retry loop is far from being ideal
and can see some updates but I am quite nervous about any potential
regressions in this area (e.g. too aggressive OOM etc...). I have to
think about it some more.
Anyway if you have some more specific ideas I would be happy to review
patches.

[...]

Thanks
--
Michal Hocko
SUSE Labs

2012-11-26 20:19:28

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote:
> On Mon 26-11-12 14:29:41, Johannes Weiner wrote:
> > On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote:
> > > On Mon 26-11-12 13:24:21, Johannes Weiner wrote:
> > > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote:
> > > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
> > > [...]
> > > > > > I think global oom already handles this in a much better way: invoke
> > > > > > the OOM killer, sleep for a second, then return to userspace to
> > > > > > relinquish all kernel resources and locks. The only reason why we
> > > > > > can't simply change from an endless retry loop is because we don't
> > > > > > want to return VM_FAULT_OOM and invoke the global OOM killer.
> > > > >
> > > > > Exactly.
> > > > >
> > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> > > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall
> > > > > > respectively. This way, the memcg OOM killer is invoked as it should
> > > > > > but nobody gets stuck anywhere livelocking with the exiting task.
> > > > >
> > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM
> > > > > killer), right? All processes but those in mem_cgroup_handle_oom are
> > > > > risky to be killed.
> > > >
> > > > Could we still let everybody get stuck in there when the OOM killer is
> > > > disabled and let userspace take care of it?
> > >
> > > I am not sure what exactly you mean by "userspace take care of it" but
> > > if those processes are stuck and holding the lock then it is usually
> > > hard to find that out. Well if somebody is familiar with internal then
> > > it is doable but this makes the interface really unusable for regular
> > > usage.
> >
> > If oom_kill_disable is set, then all processes get stuck all the way
> > down in the charge stack. Whatever resource they pin, you may
> > deadlock on if you try to touch it while handling the problem from
> > userspace.
>
> OK, I guess I am getting what you are trying to say. So what you are
> suggesting is to just let mem_cgroup_out_of_memory send the signal and
> move on without retry (or with few charge retries without further OOM
> killing) and fail the charge with your new FAULT_OOM_HANDLED (resp.
> something like FAULT_RETRY) error code resp. ENOMEM depending on the
> caller. OOM disabled case would be "you are on your own" because this
> has been dangerous anyway. Correct?

Yes.

> I do agree that the current endless retry loop is far from being ideal
> and can see some updates but I am quite nervous about any potential
> regressions in this area (e.g. too aggressive OOM etc...). I have to
> think about it some more.

Agreed on all points. Maybe we can keep a couple of the oom retry
iterations or something like that, which is still much more than what
global does and I don't think the global OOM killer is overly eager.

Testing will show more.

> Anyway if you have some more specific ideas I would be happy to review
> patches.

Okay, I just wanted to check back with you before going down this
path. What are we going to do short term, though? Do you want to
push the disable-oom-for-pagecache for now or should we put the
VM_FAULT_OOM_HANDLED fix in the next version and do stable backports?

This issue has been around for a while so frankly I don't think it's
urgent enough to rush things.

2012-11-26 20:46:43

by azurIt

[permalink] [raw]
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

>This issue has been around for a while so frankly I don't think it's
>urgent enough to rush things.


Well, it's quite urgent at least for us :( i wasn't reported this so far cos i wasn't sure it's a kernel thing. I will be really happy and thankfull if fix for this can go to 3.2 in some near future.. Thank you very much!

azur

2012-11-26 20:53:59

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Mon, Nov 26, 2012 at 09:46:38PM +0100, azurIt wrote:
> >This issue has been around for a while so frankly I don't think it's
> >urgent enough to rush things.
>
>
> Well, it's quite urgent at least for us :( i wasn't reported this so
> far cos i wasn't sure it's a kernel thing. I will be really happy
> and thankfull if fix for this can go to 3.2 in some near
> future.. Thank you very much!

I understand and of course it's important that we get it fixed as soon
as possible. All I meant was that this problem has not exactly been
introduced in 3.7 and the fix is non-trivial so we should not be
rushing a change like this into 3.7 just days before its release.

2012-11-26 21:28:30

by azurIt

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

>Here we go with the patch for 3.2.34. Could you test with this one,
>please?

Michal, regarding to your conversation with Johannes Weiner, should i try this patch or not?

azur

2012-11-26 22:06:45

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Mon 26-11-12 15:19:18, Johannes Weiner wrote:
> On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote:
[...]
> > OK, I guess I am getting what you are trying to say. So what you are
> > suggesting is to just let mem_cgroup_out_of_memory send the signal and
> > move on without retry (or with few charge retries without further OOM
> > killing) and fail the charge with your new FAULT_OOM_HANDLED (resp.
> > something like FAULT_RETRY) error code resp. ENOMEM depending on the
> > caller. OOM disabled case would be "you are on your own" because this
> > has been dangerous anyway. Correct?
>
> Yes.
>
> > I do agree that the current endless retry loop is far from being ideal
> > and can see some updates but I am quite nervous about any potential
> > regressions in this area (e.g. too aggressive OOM etc...). I have to
> > think about it some more.
>
> Agreed on all points. Maybe we can keep a couple of the oom retry
> iterations or something like that, which is still much more than what
> global does and I don't think the global OOM killer is overly eager.

Yes we can offer less blood and more confort

>
> Testing will show more.
>
> > Anyway if you have some more specific ideas I would be happy to review
> > patches.
>
> Okay, I just wanted to check back with you before going down this
> path. What are we going to do short term, though? Do you want to
> push the disable-oom-for-pagecache for now or should we put the
> VM_FAULT_OOM_HANDLED fix in the next version and do stable backports?
>
> This issue has been around for a while so frankly I don't think it's
> urgent enough to rush things.

Yes, but now we have a real usecase where this hurts AFAIU. Unless
we come up with a fix/reasonable workaround I would rather go with
something simpler for starter and more sofisticated later.

I have to double check other places where we do charging but the last
time I've checked we don't hold page locks on already visible pages (we
do precharge in __do_fault f.e.), mem_map for reading in the page fault
path is also safe (with oom enabled) and I guess that tmpfs is ok as
well. Then we have a page cache and that one should be covered by my
patch. So we should be covered.

But I like your idea long term.

Thanks!
--
Michal Hocko
SUSE Labs

2012-11-27 00:06:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

(2012/11/26 22:18), Michal Hocko wrote:
> [CCing also Johannes - the thread started here:
> https://lkml.org/lkml/2012/11/21/497]
>
> On Mon 26-11-12 01:38:55, azurIt wrote:
>>> This is hackish but it should help you in this case. Kamezawa, what do
>>> you think about that? Should we generalize this and prepare something
>>> like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
>>> automatically and use the function whenever we are in a locked context?
>>> To be honest I do not like this very much but nothing more sensible
>>> (without touching non-memcg paths) comes to my mind.
>>
>>
>> I installed kernel with this patch, will report back if problem occurs
>> again OR in few weeks if everything will be ok. Thank you!
>
> Now that I am looking at the patch closer it will not work because it
> depends on other patch which is not merged yet and even that one would
> help on its own because __GFP_NORETRY doesn't break the charge loop.
> Sorry I have missed that...
>
> The patch bellow should help though. (it is based on top of the current
> -mm tree but I will send a backport to 3.2 in the reply as well)
> ---
> From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Mon, 26 Nov 2012 11:47:57 +0100
> Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
>
> memcg oom killer might deadlock if the process which falls down to
> mem_cgroup_handle_oom holds a lock which prevents other task to
> terminate because it is blocked on the very same lock.
> This can happen when a write system call needs to allocate a page but
> the allocation hits the memcg hard limit and there is nothing to reclaim
> (e.g. there is no swap or swap limit is hit as well and all cache pages
> have been reclaimed already) and the process selected by memcg OOM
> killer is blocked on i_mutex on the same inode (e.g. truncate it).
>
> Process A
> [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Process B
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> This is not a hard deadlock though because administrator can still
> intervene and increase the limit on the group which helps the writer to
> finish the allocation and release the lock.
>
> This patch heals the problem by forbidding OOM from page cache charges
> (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper
> function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which
> then tells mem_cgroup_charge_common that OOM is not allowed for the
> charge. No OOM from this path, except for fixing the bug, also make some
> sense as we really do not want to cause an OOM because of a page cache
> usage.
> As a possibly visible result add_to_page_cache_lru might fail more often
> with ENOMEM but this is to be expected if the limit is set and it is
> preferable than OOM killer IMO.
>
> __GFP_NORETRY is abused for this memcg specific flag because it has been
> used to prevent from OOM already (since not-merged-yet "memcg: reclaim
> when more than one page needed"). The only difference is that the flag
> doesn't prevent from reclaim anymore which kind of makes sense because
> the global memory allocator triggers reclaim as well. The retry without
> any reclaim on __GFP_NORETRY doesn't make much sense anyway because this
> is effectively a busy loop with allowed OOM in this path.
>
> Reported-by: azurIt <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

As a short term fix, I think this patch will work enough and seems simple enough.
Acked-by: KAMEZAWA Hiroyuki <[email protected]>

Reading discussion between you and Johannes, to release locks, I understand
the memcg need to return "RETRY" for a long term fix. Thinking a little,
it will be simple to return "RETRY" to all processes waited on oom kill queue
of a memcg and it can be done by a small fixes to memory.c.

Thank you.
-Kame

> ---
> include/linux/gfp.h | 3 +++
> include/linux/memcontrol.h | 12 ++++++++++++
> mm/filemap.c | 8 +++++++-
> mm/memcontrol.c | 5 +----
> 4 files changed, 23 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 10e667f..aac9b21 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -152,6 +152,9 @@ struct vm_area_struct;
> /* 4GB DMA on some platforms */
> #define GFP_DMA32 __GFP_DMA32
>
> +/* memcg oom killer is not allowed */
> +#define GFP_MEMCG_NO_OOM __GFP_NORETRY
> +
> /* Convert GFP flags to their corresponding migrate type */
> static inline int allocflags_to_migratetype(gfp_t gfp_flags)
> {
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 095d2b4..1ad4bc6 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
> extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
>
> +static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
> + struct mm_struct *mm, gfp_t gfp_mask)
> +{
> + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM);
> +}
> +
> struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
> struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
>
> @@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page,
> return 0;
> }
>
> +static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
> + struct mm_struct *mm, gfp_t gfp_mask)
> +{
> + return 0;
> +}
> +
> static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
> {
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 83efee7..ef14351 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> VM_BUG_ON(!PageLocked(page));
> VM_BUG_ON(PageSwapBacked(page));
>
> - error = mem_cgroup_cache_charge(page, current->mm,
> + /*
> + * Cannot trigger OOM even if gfp_mask would allow that normally
> + * because we might be called from a locked context and that
> + * could lead to deadlocks if the killed process is waiting for
> + * the same lock.
> + */
> + error = mem_cgroup_cache_charge_no_oom(page, current->mm,
> gfp_mask & GFP_RECLAIM_MASK);
> if (error)
> goto out;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 02ee2f7..b4754ba 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (!(gfp_mask & __GFP_WAIT))
> return CHARGE_WOULDBLOCK;
>
> - if (gfp_mask & __GFP_NORETRY)
> - return CHARGE_NOMEM;
> -
> ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> return CHARGE_RETRY;
> @@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> {
> struct mem_cgroup *memcg = NULL;
> unsigned int nr_pages = 1;
> - bool oom = true;
> + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);
> int ret;
>
> if (PageTransHuge(page)) {
>

2012-11-27 09:54:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Tue 27-11-12 09:05:30, KAMEZAWA Hiroyuki wrote:
[...]
> As a short term fix, I think this patch will work enough and seems simple enough.
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>

Thanks!
If Johannes is also ok with this for now I will resubmit the patch to
Andrew after I hear back from the reporter.

> Reading discussion between you and Johannes, to release locks, I understand
> the memcg need to return "RETRY" for a long term fix. Thinking a little,
> it will be simple to return "RETRY" to all processes waited on oom kill queue
> of a memcg and it can be done by a small fixes to memory.c.

I wouldn't call it simple but it is doable.
--
Michal Hocko
SUSE Labs

2012-11-27 19:48:29

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Tue, Nov 27, 2012 at 09:05:30AM +0900, Kamezawa Hiroyuki wrote:
> (2012/11/26 22:18), Michal Hocko wrote:
> >[CCing also Johannes - the thread started here:
> >https://lkml.org/lkml/2012/11/21/497]
> >
> >On Mon 26-11-12 01:38:55, azurIt wrote:
> >>>This is hackish but it should help you in this case. Kamezawa, what do
> >>>you think about that? Should we generalize this and prepare something
> >>>like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> >>>automatically and use the function whenever we are in a locked context?
> >>>To be honest I do not like this very much but nothing more sensible
> >>>(without touching non-memcg paths) comes to my mind.
> >>
> >>
> >>I installed kernel with this patch, will report back if problem occurs
> >>again OR in few weeks if everything will be ok. Thank you!
> >
> >Now that I am looking at the patch closer it will not work because it
> >depends on other patch which is not merged yet and even that one would
> >help on its own because __GFP_NORETRY doesn't break the charge loop.
> >Sorry I have missed that...
> >
> >The patch bellow should help though. (it is based on top of the current
> >-mm tree but I will send a backport to 3.2 in the reply as well)
> >---
> > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
> >From: Michal Hocko <[email protected]>
> >Date: Mon, 26 Nov 2012 11:47:57 +0100
> >Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> >
> >memcg oom killer might deadlock if the process which falls down to
> >mem_cgroup_handle_oom holds a lock which prevents other task to
> >terminate because it is blocked on the very same lock.
> >This can happen when a write system call needs to allocate a page but
> >the allocation hits the memcg hard limit and there is nothing to reclaim
> >(e.g. there is no swap or swap limit is hit as well and all cache pages
> >have been reclaimed already) and the process selected by memcg OOM
> >killer is blocked on i_mutex on the same inode (e.g. truncate it).
> >
> >Process A
> >[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
> >[<ffffffff81121c90>] do_last+0x250/0xa30
> >[<ffffffff81122547>] path_openat+0xd7/0x440
> >[<ffffffff811229c9>] do_filp_open+0x49/0xa0
> >[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> >[<ffffffff8110f950>] sys_open+0x20/0x30
> >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> >[<ffffffffffffffff>] 0xffffffffffffffff
> >
> >Process B
> >[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> >[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> >[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> >[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> >[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> >[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> >[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> >[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> >[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> >[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
> >[<ffffffff8111156a>] do_sync_write+0xea/0x130
> >[<ffffffff81112183>] vfs_write+0xf3/0x1f0
> >[<ffffffff81112381>] sys_write+0x51/0x90
> >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> >[<ffffffffffffffff>] 0xffffffffffffffff
> >
> >This is not a hard deadlock though because administrator can still
> >intervene and increase the limit on the group which helps the writer to
> >finish the allocation and release the lock.
> >
> >This patch heals the problem by forbidding OOM from page cache charges
> >(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper
> >function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which
> >then tells mem_cgroup_charge_common that OOM is not allowed for the
> >charge. No OOM from this path, except for fixing the bug, also make some
> >sense as we really do not want to cause an OOM because of a page cache
> >usage.
> >As a possibly visible result add_to_page_cache_lru might fail more often
> >with ENOMEM but this is to be expected if the limit is set and it is
> >preferable than OOM killer IMO.
> >
> >__GFP_NORETRY is abused for this memcg specific flag because it has been
> >used to prevent from OOM already (since not-merged-yet "memcg: reclaim
> >when more than one page needed"). The only difference is that the flag
> >doesn't prevent from reclaim anymore which kind of makes sense because
> >the global memory allocator triggers reclaim as well. The retry without
> >any reclaim on __GFP_NORETRY doesn't make much sense anyway because this
> >is effectively a busy loop with allowed OOM in this path.
> >
> >Reported-by: azurIt <[email protected]>
> >Signed-off-by: Michal Hocko <[email protected]>
>
> As a short term fix, I think this patch will work enough and seems simple enough.
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>

Yes, let's do this for now.

> >diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >index 10e667f..aac9b21 100644
> >--- a/include/linux/gfp.h
> >+++ b/include/linux/gfp.h
> >@@ -152,6 +152,9 @@ struct vm_area_struct;
> > /* 4GB DMA on some platforms */
> > #define GFP_DMA32 __GFP_DMA32
> >
> >+/* memcg oom killer is not allowed */
> >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY

Could we leave this within memcg, please? An extra flag to
mem_cgroup_cache_charge() or the like. GFP flags are about
controlling the page allocator, this seems abusive. We have an oom
flag down in try_charge, maybe just propagate this up the stack?

> >diff --git a/mm/filemap.c b/mm/filemap.c
> >index 83efee7..ef14351 100644
> >--- a/mm/filemap.c
> >+++ b/mm/filemap.c
> >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> > VM_BUG_ON(!PageLocked(page));
> > VM_BUG_ON(PageSwapBacked(page));
> >
> >- error = mem_cgroup_cache_charge(page, current->mm,
> >+ /*
> >+ * Cannot trigger OOM even if gfp_mask would allow that normally
> >+ * because we might be called from a locked context and that
> >+ * could lead to deadlocks if the killed process is waiting for
> >+ * the same lock.
> >+ */
> >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm,
> > gfp_mask & GFP_RECLAIM_MASK);
> > if (error)
> > goto out;

Shmem does not use this function but also charges under the i_mutex in
the write path and fallocate at least.

2012-11-27 20:54:41

by Michal Hocko

[permalink] [raw]
Subject: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Tue 27-11-12 14:48:13, Johannes Weiner wrote:
[...]
> > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > >index 10e667f..aac9b21 100644
> > >--- a/include/linux/gfp.h
> > >+++ b/include/linux/gfp.h
> > >@@ -152,6 +152,9 @@ struct vm_area_struct;
> > > /* 4GB DMA on some platforms */
> > > #define GFP_DMA32 __GFP_DMA32
> > >
> > >+/* memcg oom killer is not allowed */
> > >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY
>
> Could we leave this within memcg, please? An extra flag to
> mem_cgroup_cache_charge() or the like. GFP flags are about
> controlling the page allocator, this seems abusive. We have an oom
> flag down in try_charge, maybe just propagate this up the stack?

OK, what about the patch bellow?
I have dropped Kame's Acked-by because it has been reworked. The patch
is the same in principle.

> > >diff --git a/mm/filemap.c b/mm/filemap.c
> > >index 83efee7..ef14351 100644
> > >--- a/mm/filemap.c
> > >+++ b/mm/filemap.c
> > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> > > VM_BUG_ON(!PageLocked(page));
> > > VM_BUG_ON(PageSwapBacked(page));
> > >
> > >- error = mem_cgroup_cache_charge(page, current->mm,
> > >+ /*
> > >+ * Cannot trigger OOM even if gfp_mask would allow that normally
> > >+ * because we might be called from a locked context and that
> > >+ * could lead to deadlocks if the killed process is waiting for
> > >+ * the same lock.
> > >+ */
> > >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm,
> > > gfp_mask & GFP_RECLAIM_MASK);
> > > if (error)
> > > goto out;
>
> Shmem does not use this function but also charges under the i_mutex in
> the write path and fallocate at least.

Right you are
---
>From 60cc8a184490d277eb24fca551b114f1e2234ce0 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 26 Nov 2012 11:47:57 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
argument which is pushed down the call chain.

As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

Changes since v1
- do not abuse gfp_flags and rather use oom parameter directly as per
Johannes
- handle also shmem write fauls resp. fallocate properly as per Johannes

Reported-by: azurIt <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 5 +++--
mm/filemap.c | 9 +++++++--
mm/memcontrol.c | 9 ++++-----
mm/shmem.c | 14 +++++++++++---
4 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..8f48d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);

extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask);
+ gfp_t gfp_mask, bool oom);

struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page,
}

static inline int mem_cgroup_cache_charge(struct page *page,
- struct mm_struct *mm, gfp_t gfp_mask)
+ struct mm_struct *mm, gfp_t gfp_mask,
+ bool oom)
{
return 0;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ef8fbd5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));

- error = mem_cgroup_cache_charge(page, current->mm,
- gfp_mask & GFP_RECLAIM_MASK);
+ /*
+ * Cannot trigger OOM even if gfp_mask would allow that normally
+ * because we might be called from a locked context and that
+ * could lead to deadlocks if the killed process is waiting for
+ * the same lock.
+ */
+ error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false);
if (error)
goto out;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ee2f7..26690d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3709,11 +3709,10 @@ out:
* < 0 if the cgroup is over its limit
*/
static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, enum charge_type ctype)
+ gfp_t gfp_mask, enum charge_type ctype, bool oom)
{
struct mem_cgroup *memcg = NULL;
unsigned int nr_pages = 1;
- bool oom = true;
int ret;

if (PageTransHuge(page)) {
@@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page,
VM_BUG_ON(page->mapping && !PageAnon(page));
VM_BUG_ON(!mm);
return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_ANON);
+ MEM_CGROUP_CHARGE_TYPE_ANON, true);
}

/*
@@ -3851,7 +3850,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
}

int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask)
+ gfp_t gfp_mask, bool oom)
{
struct mem_cgroup *memcg = NULL;
enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
@@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
return 0;

if (!PageSwapCache(page))
- ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
+ ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
else { /* page is swapcache/shmem */
ret = __mem_cgroup_try_charge_swapin(mm, page,
gfp_mask, &memcg);
diff --git a/mm/shmem.c b/mm/shmem.c
index 55054a7..cef63b5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
* the shmem_swaplist_mutex which might hold up shmem_writepage().
* Charged back to the user (not to caller) when swap account is used.
*/
- error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
+ error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true);
if (error)
goto out;
/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -1152,8 +1152,16 @@ repeat:
goto failed;
}

+ /*
+ * Cannot trigger OOM even if gfp_mask would allow that
+ * normally because we might be called from a locked
+ * context (i_mutex held) if this is a write lock or
+ * fallocate and that could lead to deadlocks if the
+ * killed process is waiting for the same lock.
+ */
error = mem_cgroup_cache_charge(page, current->mm,
- gfp & GFP_RECLAIM_MASK);
+ gfp & GFP_RECLAIM_MASK,
+ sgp < SGP_WRITE);
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
gfp, swp_to_radix_entry(swap));
@@ -1209,7 +1217,7 @@ repeat:
SetPageSwapBacked(page);
__set_page_locked(page);
error = mem_cgroup_cache_charge(page, current->mm,
- gfp & GFP_RECLAIM_MASK);
+ gfp & GFP_RECLAIM_MASK, true);
if (error)
goto decused;
error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
--
1.7.10.4


--
Michal Hocko
SUSE Labs

2012-11-27 20:59:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

Sorry, forgot to about one shmem charge:
---
>From 7ae29927d24471c1b1a6ceb021219c592c1ef518 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 27 Nov 2012 21:53:13 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
argument which is pushed down the call chain.

As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

Changes since v1
- do not abuse gfp_flags and rather use oom parameter directly as per
Johannes
- handle also shmem write fauls resp. fallocate properly as per Johannes

Reported-by: azurIt <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 5 +++--
mm/filemap.c | 9 +++++++--
mm/memcontrol.c | 9 ++++-----
mm/shmem.c | 15 ++++++++++++---
4 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..8f48d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);

extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask);
+ gfp_t gfp_mask, bool oom);

struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page,
}

static inline int mem_cgroup_cache_charge(struct page *page,
- struct mm_struct *mm, gfp_t gfp_mask)
+ struct mm_struct *mm, gfp_t gfp_mask,
+ bool oom)
{
return 0;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ef8fbd5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));

- error = mem_cgroup_cache_charge(page, current->mm,
- gfp_mask & GFP_RECLAIM_MASK);
+ /*
+ * Cannot trigger OOM even if gfp_mask would allow that normally
+ * because we might be called from a locked context and that
+ * could lead to deadlocks if the killed process is waiting for
+ * the same lock.
+ */
+ error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false);
if (error)
goto out;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ee2f7..26690d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3709,11 +3709,10 @@ out:
* < 0 if the cgroup is over its limit
*/
static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, enum charge_type ctype)
+ gfp_t gfp_mask, enum charge_type ctype, bool oom)
{
struct mem_cgroup *memcg = NULL;
unsigned int nr_pages = 1;
- bool oom = true;
int ret;

if (PageTransHuge(page)) {
@@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page,
VM_BUG_ON(page->mapping && !PageAnon(page));
VM_BUG_ON(!mm);
return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_ANON);
+ MEM_CGROUP_CHARGE_TYPE_ANON, true);
}

/*
@@ -3851,7 +3850,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
}

int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask)
+ gfp_t gfp_mask, bool oom)
{
struct mem_cgroup *memcg = NULL;
enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
@@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
return 0;

if (!PageSwapCache(page))
- ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
+ ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
else { /* page is swapcache/shmem */
ret = __mem_cgroup_try_charge_swapin(mm, page,
gfp_mask, &memcg);
diff --git a/mm/shmem.c b/mm/shmem.c
index 55054a7..ba59cfa 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
* the shmem_swaplist_mutex which might hold up shmem_writepage().
* Charged back to the user (not to caller) when swap account is used.
*/
- error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
+ error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true);
if (error)
goto out;
/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -1152,8 +1152,16 @@ repeat:
goto failed;
}

+ /*
+ * Cannot trigger OOM even if gfp_mask would allow that
+ * normally because we might be called from a locked
+ * context (i_mutex held) if this is a write lock or
+ * fallocate and that could lead to deadlocks if the
+ * killed process is waiting for the same lock.
+ */
error = mem_cgroup_cache_charge(page, current->mm,
- gfp & GFP_RECLAIM_MASK);
+ gfp & GFP_RECLAIM_MASK,
+ sgp < SGP_WRITE);
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
gfp, swp_to_radix_entry(swap));
@@ -1209,7 +1217,8 @@ repeat:
SetPageSwapBacked(page);
__set_page_locked(page);
error = mem_cgroup_cache_charge(page, current->mm,
- gfp & GFP_RECLAIM_MASK);
+ gfp & GFP_RECLAIM_MASK,
+ sgp < SGP_WRITE);
if (error)
goto decused;
error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
--
1.7.10.4

--
Michal Hocko
SUSE Labs

2012-11-28 15:26:48

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote:
> @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> return 0;
>
> if (!PageSwapCache(page))
> - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
> + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
> else { /* page is swapcache/shmem */
> ret = __mem_cgroup_try_charge_swapin(mm, page,
> gfp_mask, &memcg);

I think you need to pass it down the swapcache path too, as that is
what happens when the shmem page written to is in swap and has been
read into swapcache by the time of charging.

> @@ -1152,8 +1152,16 @@ repeat:
> goto failed;
> }
>
> + /*
> + * Cannot trigger OOM even if gfp_mask would allow that
> + * normally because we might be called from a locked
> + * context (i_mutex held) if this is a write lock or
> + * fallocate and that could lead to deadlocks if the
> + * killed process is waiting for the same lock.
> + */

Indentation broken?

> error = mem_cgroup_cache_charge(page, current->mm,
> - gfp & GFP_RECLAIM_MASK);
> + gfp & GFP_RECLAIM_MASK,
> + sgp < SGP_WRITE);

The code tests for read-only paths a bunch of times using

sgp != SGP_WRITE && sgp != SGP_FALLOC

Would probably be more consistent and more robust to use this here as
well?

> @@ -1209,7 +1217,8 @@ repeat:
> SetPageSwapBacked(page);
> __set_page_locked(page);
> error = mem_cgroup_cache_charge(page, current->mm,
> - gfp & GFP_RECLAIM_MASK);
> + gfp & GFP_RECLAIM_MASK,
> + sgp < SGP_WRITE);

Same.

Otherwise, the patch looks good to me, thanks for persisting :)

2012-11-28 16:04:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Wed 28-11-12 10:26:31, Johannes Weiner wrote:
> On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote:
> > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> > return 0;
> >
> > if (!PageSwapCache(page))
> > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
> > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
> > else { /* page is swapcache/shmem */
> > ret = __mem_cgroup_try_charge_swapin(mm, page,
> > gfp_mask, &memcg);
>
> I think you need to pass it down the swapcache path too, as that is
> what happens when the shmem page written to is in swap and has been
> read into swapcache by the time of charging.

You are right, of course. I shouldn't send patches late in the evening
after staring to a crashdump for a good part of the day. /me ashamed.

> > @@ -1152,8 +1152,16 @@ repeat:
> > goto failed;
> > }
> >
> > + /*
> > + * Cannot trigger OOM even if gfp_mask would allow that
> > + * normally because we might be called from a locked
> > + * context (i_mutex held) if this is a write lock or
> > + * fallocate and that could lead to deadlocks if the
> > + * killed process is waiting for the same lock.
> > + */
>
> Indentation broken?

c&p

> > error = mem_cgroup_cache_charge(page, current->mm,
> > - gfp & GFP_RECLAIM_MASK);
> > + gfp & GFP_RECLAIM_MASK,
> > + sgp < SGP_WRITE);
>
> The code tests for read-only paths a bunch of times using
>
> sgp != SGP_WRITE && sgp != SGP_FALLOC
>
> Would probably be more consistent and more robust to use this here as
> well?

Yes my laziness. I was considering that but it was really long so I've
chosen the simpler way. But you are right that consistency is probably
better here

> > @@ -1209,7 +1217,8 @@ repeat:
> > SetPageSwapBacked(page);
> > __set_page_locked(page);
> > error = mem_cgroup_cache_charge(page, current->mm,
> > - gfp & GFP_RECLAIM_MASK);
> > + gfp & GFP_RECLAIM_MASK,
> > + sgp < SGP_WRITE);
>
> Same.
>
> Otherwise, the patch looks good to me, thanks for persisting :)

Thanks for the throughout review.
Here we go with the fixed version.
---
>From 5000bf32c9c02fcd31d18e615300d8e7e7ef94a5 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 28 Nov 2012 16:49:46 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
argument which is pushed down the call chain.

As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

Changes since v1
- do not abuse gfp_flags and rather use oom parameter directly as per
Johannes
- handle also shmem write fauls resp. fallocate properly as per Johannes

Reported-by: azurIt <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 11 +++++++----
mm/filemap.c | 9 +++++++--
mm/memcontrol.c | 25 +++++++++++++------------
mm/memory.c | 2 +-
mm/shmem.c | 17 ++++++++++++++---
mm/swapfile.c | 2 +-
6 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..5abe441 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
/* for swap handling */
extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
- struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
+ struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
+ bool oom);
extern void mem_cgroup_commit_charge_swapin(struct page *page,
struct mem_cgroup *memcg);
extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);

extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask);
+ gfp_t gfp_mask, bool oom);

struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -210,13 +211,15 @@ static inline int mem_cgroup_newpage_charge(struct page *page,
}

static inline int mem_cgroup_cache_charge(struct page *page,
- struct mm_struct *mm, gfp_t gfp_mask)
+ struct mm_struct *mm, gfp_t gfp_mask,
+ bool oom)
{
return 0;
}

static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
- struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
+ struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp,
+ bool oom)
{
return 0;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ef8fbd5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));

- error = mem_cgroup_cache_charge(page, current->mm,
- gfp_mask & GFP_RECLAIM_MASK);
+ /*
+ * Cannot trigger OOM even if gfp_mask would allow that normally
+ * because we might be called from a locked context and that
+ * could lead to deadlocks if the killed process is waiting for
+ * the same lock.
+ */
+ error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false);
if (error)
goto out;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ee2f7..02a6d70 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3709,11 +3709,10 @@ out:
* < 0 if the cgroup is over its limit
*/
static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, enum charge_type ctype)
+ gfp_t gfp_mask, enum charge_type ctype, bool oom)
{
struct mem_cgroup *memcg = NULL;
unsigned int nr_pages = 1;
- bool oom = true;
int ret;

if (PageTransHuge(page)) {
@@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page,
VM_BUG_ON(page->mapping && !PageAnon(page));
VM_BUG_ON(!mm);
return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_ANON);
+ MEM_CGROUP_CHARGE_TYPE_ANON, true);
}

/*
@@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page,
static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
struct page *page,
gfp_t mask,
- struct mem_cgroup **memcgp)
+ struct mem_cgroup **memcgp,
+ bool oom)
{
struct mem_cgroup *memcg;
struct page_cgroup *pc;
@@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
if (!memcg)
goto charge_cur_mm;
*memcgp = memcg;
- ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true);
+ ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom);
css_put(&memcg->css);
if (ret == -EINTR)
ret = 0;
return ret;
charge_cur_mm:
- ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true);
+ ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom);
if (ret == -EINTR)
ret = 0;
return ret;
}

int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
- gfp_t gfp_mask, struct mem_cgroup **memcgp)
+ gfp_t gfp_mask, struct mem_cgroup **memcgp,
+ bool oom)
{
*memcgp = NULL;
if (mem_cgroup_disabled())
@@ -3803,12 +3804,12 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
if (!PageSwapCache(page)) {
int ret;

- ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true);
+ ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, oom);
if (ret == -EINTR)
ret = 0;
return ret;
}
- return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp);
+ return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, oom);
}

void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
@@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
}

int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask)
+ gfp_t gfp_mask, bool oom)
{
struct mem_cgroup *memcg = NULL;
enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
@@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
return 0;

if (!PageSwapCache(page))
- ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
+ ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
else { /* page is swapcache/shmem */
ret = __mem_cgroup_try_charge_swapin(mm, page,
- gfp_mask, &memcg);
+ gfp_mask, &memcg, oom);
if (!ret)
__mem_cgroup_commit_charge_swapin(page, memcg, type);
}
diff --git a/mm/memory.c b/mm/memory.c
index 6891d3b..afad903 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
}

- if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
+ if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) {
ret = VM_FAULT_OOM;
goto out_page;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index 55054a7..3b27db4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
* the shmem_swaplist_mutex which might hold up shmem_writepage().
* Charged back to the user (not to caller) when swap account is used.
*/
- error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
+ error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true);
if (error)
goto out;
/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -1152,8 +1152,17 @@ repeat:
goto failed;
}

+ /*
+ * Cannot trigger OOM even if gfp_mask would allow that
+ * normally because we might be called from a locked
+ * context (i_mutex held) if this is a write lock or
+ * fallocate and that could lead to deadlocks if the
+ * killed process is waiting for the same lock.
+ */
error = mem_cgroup_cache_charge(page, current->mm,
- gfp & GFP_RECLAIM_MASK);
+ gfp & GFP_RECLAIM_MASK,
+ sgp != SGP_WRITE &&
+ sgp != SGP_FALLOC);
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
gfp, swp_to_radix_entry(swap));
@@ -1209,7 +1218,9 @@ repeat:
SetPageSwapBacked(page);
__set_page_locked(page);
error = mem_cgroup_cache_charge(page, current->mm,
- gfp & GFP_RECLAIM_MASK);
+ gfp & GFP_RECLAIM_MASK,
+ sgp != SGP_WRITE &&
+ sgp != SGP_FALLOC);
if (error)
goto decused;
error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2f8e429..8ec511e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
int ret = 1;

if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
- GFP_KERNEL, &memcg)) {
+ GFP_KERNEL, &memcg, true)) {
ret = -ENOMEM;
goto out_nolock;
}
--
1.7.10.4

--
Michal Hocko
SUSE Labs

2012-11-28 16:38:00

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote:
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 095d2b4..5abe441 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> /* for swap handling */
> extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> - struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> + struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
> + bool oom);

Ok, now I feel almost bad for asking, but why the public interface,
too? You only ever pass "true" in there and this is unlikely to
change anytime soon, no?

> @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page,
> static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> struct page *page,
> gfp_t mask,
> - struct mem_cgroup **memcgp)
> + struct mem_cgroup **memcgp,
> + bool oom)
> {
> struct mem_cgroup *memcg;
> struct page_cgroup *pc;
> @@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> if (!memcg)
> goto charge_cur_mm;
> *memcgp = memcg;
> - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true);
> + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom);
> css_put(&memcg->css);
> if (ret == -EINTR)
> ret = 0;
> return ret;
> charge_cur_mm:
> - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true);
> + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom);
> if (ret == -EINTR)
> ret = 0;
> return ret;
> }

Only this one is needed...

> @@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
> }
>
> int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> - gfp_t gfp_mask)
> + gfp_t gfp_mask, bool oom)
> {
> struct mem_cgroup *memcg = NULL;
> enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
> @@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> return 0;
>
> if (!PageSwapCache(page))
> - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
> + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
> else { /* page is swapcache/shmem */
> ret = __mem_cgroup_try_charge_swapin(mm, page,
> - gfp_mask, &memcg);
> + gfp_mask, &memcg, oom);
> if (!ret)
> __mem_cgroup_commit_charge_swapin(page, memcg, type);
> }

...for this site.

> diff --git a/mm/memory.c b/mm/memory.c
> index 6891d3b..afad903 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> }
> }
>
> - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
> + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) {
> ret = VM_FAULT_OOM;
> goto out_page;
> }

Can not happen for shmem, the fault handler uses vma->vm_ops->fault.

> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 2f8e429..8ec511e 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
> int ret = 1;
>
> if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
> - GFP_KERNEL, &memcg)) {
> + GFP_KERNEL, &memcg, true)) {
> ret = -ENOMEM;
> goto out_nolock;
> }

Can not happen for shmem, uses shmem_unuse() instead.

2012-11-28 16:46:45

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Wed 28-11-12 11:37:36, Johannes Weiner wrote:
> On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote:
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 095d2b4..5abe441 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> > gfp_t gfp_mask);
> > /* for swap handling */
> > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
> > + bool oom);
>
> Ok, now I feel almost bad for asking, but why the public interface,
> too?

Would it work out if I tell it was to double check that your review
quality is not decreased after that many revisions? :P

Incremental update and the full patch in the reply
---
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5abe441..8f48d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -57,8 +57,7 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
/* for swap handling */
extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
- struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
- bool oom);
+ struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
extern void mem_cgroup_commit_charge_swapin(struct page *page,
struct mem_cgroup *memcg);
extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
@@ -218,8 +217,7 @@ static inline int mem_cgroup_cache_charge(struct page *page,
}

static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
- struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp,
- bool oom)
+ struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
{
return 0;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02a6d70..3c9b1c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3789,8 +3789,7 @@ charge_cur_mm:
}

int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
- gfp_t gfp_mask, struct mem_cgroup **memcgp,
- bool oom)
+ gfp_t gfp_mask, struct mem_cgroup **memcgp)
{
*memcgp = NULL;
if (mem_cgroup_disabled())
@@ -3804,12 +3803,12 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
if (!PageSwapCache(page)) {
int ret;

- ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, oom);
+ ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true);
if (ret == -EINTR)
ret = 0;
return ret;
}
- return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, oom);
+ return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true);
}

void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
diff --git a/mm/memory.c b/mm/memory.c
index afad903..6891d3b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
}

- if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) {
+ if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
ret = VM_FAULT_OOM;
goto out_page;
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8ec511e..2f8e429 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
int ret = 1;

if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
- GFP_KERNEL, &memcg, true)) {
+ GFP_KERNEL, &memcg)) {
ret = -ENOMEM;
goto out_nolock;
}
--
Michal Hocko
SUSE Labs

2012-11-28 16:48:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Wed 28-11-12 17:46:40, Michal Hocko wrote:
> On Wed 28-11-12 11:37:36, Johannes Weiner wrote:
> > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote:
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 095d2b4..5abe441 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> > > gfp_t gfp_mask);
> > > /* for swap handling */
> > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
> > > + bool oom);
> >
> > Ok, now I feel almost bad for asking, but why the public interface,
> > too?
>
> Would it work out if I tell it was to double check that your review
> quality is not decreased after that many revisions? :P
>
> Incremental update and the full patch in the reply
---
>From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 28 Nov 2012 17:46:32 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
argument which is pushed down the call chain.

As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

Changes since v1
- do not abuse gfp_flags and rather use oom parameter directly as per
Johannes
- handle also shmem write fauls resp. fallocate properly as per Johannes

Reported-by: azurIt <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 5 +++--
mm/filemap.c | 9 +++++++--
mm/memcontrol.c | 20 ++++++++++----------
mm/shmem.c | 17 ++++++++++++++---
4 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..8f48d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);

extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask);
+ gfp_t gfp_mask, bool oom);

struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page,
}

static inline int mem_cgroup_cache_charge(struct page *page,
- struct mm_struct *mm, gfp_t gfp_mask)
+ struct mm_struct *mm, gfp_t gfp_mask,
+ bool oom)
{
return 0;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ef8fbd5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));

- error = mem_cgroup_cache_charge(page, current->mm,
- gfp_mask & GFP_RECLAIM_MASK);
+ /*
+ * Cannot trigger OOM even if gfp_mask would allow that normally
+ * because we might be called from a locked context and that
+ * could lead to deadlocks if the killed process is waiting for
+ * the same lock.
+ */
+ error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false);
if (error)
goto out;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ee2f7..3c9b1c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3709,11 +3709,10 @@ out:
* < 0 if the cgroup is over its limit
*/
static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, enum charge_type ctype)
+ gfp_t gfp_mask, enum charge_type ctype, bool oom)
{
struct mem_cgroup *memcg = NULL;
unsigned int nr_pages = 1;
- bool oom = true;
int ret;

if (PageTransHuge(page)) {
@@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page,
VM_BUG_ON(page->mapping && !PageAnon(page));
VM_BUG_ON(!mm);
return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_ANON);
+ MEM_CGROUP_CHARGE_TYPE_ANON, true);
}

/*
@@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page,
static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
struct page *page,
gfp_t mask,
- struct mem_cgroup **memcgp)
+ struct mem_cgroup **memcgp,
+ bool oom)
{
struct mem_cgroup *memcg;
struct page_cgroup *pc;
@@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
if (!memcg)
goto charge_cur_mm;
*memcgp = memcg;
- ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true);
+ ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom);
css_put(&memcg->css);
if (ret == -EINTR)
ret = 0;
return ret;
charge_cur_mm:
- ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true);
+ ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom);
if (ret == -EINTR)
ret = 0;
return ret;
@@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
ret = 0;
return ret;
}
- return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp);
+ return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true);
}

void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
@@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
}

int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask)
+ gfp_t gfp_mask, bool oom)
{
struct mem_cgroup *memcg = NULL;
enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
@@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
return 0;

if (!PageSwapCache(page))
- ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
+ ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
else { /* page is swapcache/shmem */
ret = __mem_cgroup_try_charge_swapin(mm, page,
- gfp_mask, &memcg);
+ gfp_mask, &memcg, oom);
if (!ret)
__mem_cgroup_commit_charge_swapin(page, memcg, type);
}
diff --git a/mm/shmem.c b/mm/shmem.c
index 55054a7..3b27db4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
* the shmem_swaplist_mutex which might hold up shmem_writepage().
* Charged back to the user (not to caller) when swap account is used.
*/
- error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
+ error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true);
if (error)
goto out;
/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -1152,8 +1152,17 @@ repeat:
goto failed;
}

+ /*
+ * Cannot trigger OOM even if gfp_mask would allow that
+ * normally because we might be called from a locked
+ * context (i_mutex held) if this is a write lock or
+ * fallocate and that could lead to deadlocks if the
+ * killed process is waiting for the same lock.
+ */
error = mem_cgroup_cache_charge(page, current->mm,
- gfp & GFP_RECLAIM_MASK);
+ gfp & GFP_RECLAIM_MASK,
+ sgp != SGP_WRITE &&
+ sgp != SGP_FALLOC);
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
gfp, swp_to_radix_entry(swap));
@@ -1209,7 +1218,9 @@ repeat:
SetPageSwapBacked(page);
__set_page_locked(page);
error = mem_cgroup_cache_charge(page, current->mm,
- gfp & GFP_RECLAIM_MASK);
+ gfp & GFP_RECLAIM_MASK,
+ sgp != SGP_WRITE &&
+ sgp != SGP_FALLOC);
if (error)
goto decused;
error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
--
1.7.10.4

--
Michal Hocko
SUSE Labs

2012-11-28 18:44:49

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Wed, Nov 28, 2012 at 05:48:24PM +0100, Michal Hocko wrote:
> On Wed 28-11-12 17:46:40, Michal Hocko wrote:
> > On Wed 28-11-12 11:37:36, Johannes Weiner wrote:
> > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote:
> > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > index 095d2b4..5abe441 100644
> > > > --- a/include/linux/memcontrol.h
> > > > +++ b/include/linux/memcontrol.h
> > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> > > > gfp_t gfp_mask);
> > > > /* for swap handling */
> > > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> > > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> > > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
> > > > + bool oom);
> > >
> > > Ok, now I feel almost bad for asking, but why the public interface,
> > > too?
> >
> > Would it work out if I tell it was to double check that your review
> > quality is not decreased after that many revisions? :P

Deal.

> >From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Wed, 28 Nov 2012 17:46:32 +0100
> Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
>
> memcg oom killer might deadlock if the process which falls down to
> mem_cgroup_handle_oom holds a lock which prevents other task to
> terminate because it is blocked on the very same lock.
> This can happen when a write system call needs to allocate a page but
> the allocation hits the memcg hard limit and there is nothing to reclaim
> (e.g. there is no swap or swap limit is hit as well and all cache pages
> have been reclaimed already) and the process selected by memcg OOM
> killer is blocked on i_mutex on the same inode (e.g. truncate it).
>
> Process A
> [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Process B
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> This is not a hard deadlock though because administrator can still
> intervene and increase the limit on the group which helps the writer to
> finish the allocation and release the lock.
>
> This patch heals the problem by forbidding OOM from page cache charges
> (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
> argument which is pushed down the call chain.
>
> As a possibly visible result add_to_page_cache_lru might fail more often
> with ENOMEM but this is to be expected if the limit is set and it is
> preferable than OOM killer IMO.
>
> Changes since v1
> - do not abuse gfp_flags and rather use oom parameter directly as per
> Johannes
> - handle also shmem write fauls resp. fallocate properly as per Johannes
>
> Reported-by: azurIt <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

Thanks, Michal!

2012-11-28 20:20:45

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Wed, 28 Nov 2012, Michal Hocko wrote:
> From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Wed, 28 Nov 2012 17:46:32 +0100
> Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
>
> memcg oom killer might deadlock if the process which falls down to
> mem_cgroup_handle_oom holds a lock which prevents other task to
> terminate because it is blocked on the very same lock.
> This can happen when a write system call needs to allocate a page but
> the allocation hits the memcg hard limit and there is nothing to reclaim
> (e.g. there is no swap or swap limit is hit as well and all cache pages
> have been reclaimed already) and the process selected by memcg OOM
> killer is blocked on i_mutex on the same inode (e.g. truncate it).
>
> Process A
> [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Process B
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> This is not a hard deadlock though because administrator can still
> intervene and increase the limit on the group which helps the writer to
> finish the allocation and release the lock.
>
> This patch heals the problem by forbidding OOM from page cache charges
> (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
> argument which is pushed down the call chain.
>
> As a possibly visible result add_to_page_cache_lru might fail more often
> with ENOMEM but this is to be expected if the limit is set and it is
> preferable than OOM killer IMO.
>
> Changes since v1
> - do not abuse gfp_flags and rather use oom parameter directly as per
> Johannes
> - handle also shmem write fauls resp. fallocate properly as per Johannes
>
> Reported-by: azurIt <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

Sorry, Michal, you've laboured hard on this: but I dislike it so much
that I'm here overcoming my dread of entering an OOM-killer discussion,
and the resultant deluge of unwelcome CCs for eternity afterwards.

I had been relying on Johannes to repeat his "This issue has been
around for a while so frankly I don't think it's urgent enough to
rush things", but it looks like I have to be the one to repeat it.

Your analysis of azurIt's traces may well be correct, and this patch
may indeed ameliorate the situation, and it's fine as something for
azurIt to try and report on and keep in his tree; but I hope that
it does not go upstream and to stable.

Why do I dislike it so much? I suppose because it's both too general
and too limited at the same time.

Too general in that it changes the behaviour on OOM for a large set
of memcg charges, all those that go through add_to_page_cache_locked(),
when only a subset of those have the i_mutex issue.

If you're going to be that general, why not go further? Leave the
mem_cgroup_cache_charge() interface as is, make it not-OOM internally,
no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c. No other
filesystem gets the benefit of those distinctions: isn't it better to
keep it simple? (And I can see a partial truncation case where shmem
uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour
is a non-issue, since swapoff invites itself to be killed anyway.)

Too limited in that i_mutex is just the held resource which azurIt's
traces have led you to, but it's a general problem that the OOM-killed
task might be waiting for a resource that the OOM-killing task holds.

I suspect that if we try hard enough (I admit I have not), we can find
an example of such a potential deadlock for almost every memcg charge
site. mmap_sem? not as easy to invent a case with that as I thought,
since it needs a down_write, and the typical page allocations happen
with down_read, and I can't think of a process which does down_write
on another's mm.

But i_mutex is always good, once you remember the case of write to
file from userspace page which got paged out, so the fault path has
to read it back in, while i_mutex is still held at the outer level.
An unusual case? Well, normally yes, but we're considering
out-of-memory conditions, which may converge upon cases like this.

Wouldn't it be nice if I could be constructive? But I'm sceptical
even of Johannes's faith in what the global OOM killer would do:
how does __alloc_pages_slowpath() get out of its "goto restart"
loop, excepting the trivial case when the killer is the killed?

I wonder why this issue has hit azurIt and no other reporter?
No swap plays a part in it, but that's not so unusual.

Yours glOOMily,
Hugh

> ---
> include/linux/memcontrol.h | 5 +++--
> mm/filemap.c | 9 +++++++--
> mm/memcontrol.c | 20 ++++++++++----------
> mm/shmem.c | 17 ++++++++++++++---
> 4 files changed, 34 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 095d2b4..8f48d5e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
> extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
>
> extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> - gfp_t gfp_mask);
> + gfp_t gfp_mask, bool oom);
>
> struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
> struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page,
> }
>
> static inline int mem_cgroup_cache_charge(struct page *page,
> - struct mm_struct *mm, gfp_t gfp_mask)
> + struct mm_struct *mm, gfp_t gfp_mask,
> + bool oom)
> {
> return 0;
> }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 83efee7..ef8fbd5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> VM_BUG_ON(!PageLocked(page));
> VM_BUG_ON(PageSwapBacked(page));
>
> - error = mem_cgroup_cache_charge(page, current->mm,
> - gfp_mask & GFP_RECLAIM_MASK);
> + /*
> + * Cannot trigger OOM even if gfp_mask would allow that normally
> + * because we might be called from a locked context and that
> + * could lead to deadlocks if the killed process is waiting for
> + * the same lock.
> + */
> + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false);
> if (error)
> goto out;
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 02ee2f7..3c9b1c5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3709,11 +3709,10 @@ out:
> * < 0 if the cgroup is over its limit
> */
> static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> - gfp_t gfp_mask, enum charge_type ctype)
> + gfp_t gfp_mask, enum charge_type ctype, bool oom)
> {
> struct mem_cgroup *memcg = NULL;
> unsigned int nr_pages = 1;
> - bool oom = true;
> int ret;
>
> if (PageTransHuge(page)) {
> @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page,
> VM_BUG_ON(page->mapping && !PageAnon(page));
> VM_BUG_ON(!mm);
> return mem_cgroup_charge_common(page, mm, gfp_mask,
> - MEM_CGROUP_CHARGE_TYPE_ANON);
> + MEM_CGROUP_CHARGE_TYPE_ANON, true);
> }
>
> /*
> @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page,
> static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> struct page *page,
> gfp_t mask,
> - struct mem_cgroup **memcgp)
> + struct mem_cgroup **memcgp,
> + bool oom)
> {
> struct mem_cgroup *memcg;
> struct page_cgroup *pc;
> @@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> if (!memcg)
> goto charge_cur_mm;
> *memcgp = memcg;
> - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true);
> + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom);
> css_put(&memcg->css);
> if (ret == -EINTR)
> ret = 0;
> return ret;
> charge_cur_mm:
> - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true);
> + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom);
> if (ret == -EINTR)
> ret = 0;
> return ret;
> @@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
> ret = 0;
> return ret;
> }
> - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp);
> + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true);
> }
>
> void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
> @@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
> }
>
> int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> - gfp_t gfp_mask)
> + gfp_t gfp_mask, bool oom)
> {
> struct mem_cgroup *memcg = NULL;
> enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
> @@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> return 0;
>
> if (!PageSwapCache(page))
> - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
> + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
> else { /* page is swapcache/shmem */
> ret = __mem_cgroup_try_charge_swapin(mm, page,
> - gfp_mask, &memcg);
> + gfp_mask, &memcg, oom);
> if (!ret)
> __mem_cgroup_commit_charge_swapin(page, memcg, type);
> }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 55054a7..3b27db4 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
> * the shmem_swaplist_mutex which might hold up shmem_writepage().
> * Charged back to the user (not to caller) when swap account is used.
> */
> - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
> + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true);
> if (error)
> goto out;
> /* No radix_tree_preload: swap entry keeps a place for page in tree */
> @@ -1152,8 +1152,17 @@ repeat:
> goto failed;
> }
>
> + /*
> + * Cannot trigger OOM even if gfp_mask would allow that
> + * normally because we might be called from a locked
> + * context (i_mutex held) if this is a write lock or
> + * fallocate and that could lead to deadlocks if the
> + * killed process is waiting for the same lock.
> + */
> error = mem_cgroup_cache_charge(page, current->mm,
> - gfp & GFP_RECLAIM_MASK);
> + gfp & GFP_RECLAIM_MASK,
> + sgp != SGP_WRITE &&
> + sgp != SGP_FALLOC);
> if (!error) {
> error = shmem_add_to_page_cache(page, mapping, index,
> gfp, swp_to_radix_entry(swap));
> @@ -1209,7 +1218,9 @@ repeat:
> SetPageSwapBacked(page);
> __set_page_locked(page);
> error = mem_cgroup_cache_charge(page, current->mm,
> - gfp & GFP_RECLAIM_MASK);
> + gfp & GFP_RECLAIM_MASK,
> + sgp != SGP_WRITE &&
> + sgp != SGP_FALLOC);
> if (error)
> goto decused;
> error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
> --
> 1.7.10.4
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2012-11-29 14:05:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked

On Wed 28-11-12 12:20:44, Hugh Dickins wrote:
[...]
> Sorry, Michal, you've laboured hard on this: but I dislike it so much
> that I'm here overcoming my dread of entering an OOM-killer discussion,
> and the resultant deluge of unwelcome CCs for eternity afterwards.
>
> I had been relying on Johannes to repeat his "This issue has been
> around for a while so frankly I don't think it's urgent enough to
> rush things", but it looks like I have to be the one to repeat it.

Well, the idea was to use this only as a temporal fix and come up with a
better solution without any hurry.

> Your analysis of azurIt's traces may well be correct, and this patch
> may indeed ameliorate the situation, and it's fine as something for
> azurIt to try and report on and keep in his tree; but I hope that
> it does not go upstream and to stable.
>
> Why do I dislike it so much? I suppose because it's both too general
> and too limited at the same time.
>
> Too general in that it changes the behaviour on OOM for a large set
> of memcg charges, all those that go through add_to_page_cache_locked(),
> when only a subset of those have the i_mutex issue.

This is a fair point but the real fix which we were discussing with
Johannes would be even more risky for stable.

> If you're going to be that general, why not go further? Leave the
> mem_cgroup_cache_charge() interface as is, make it not-OOM internally,
> no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c. No other
> filesystem gets the benefit of those distinctions: isn't it better to
> keep it simple? (And I can see a partial truncation case where shmem
> uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour
> is a non-issue, since swapoff invites itself to be killed anyway.)
>
> Too limited in that i_mutex is just the held resource which azurIt's
> traces have led you to, but it's a general problem that the OOM-killed
> task might be waiting for a resource that the OOM-killing task holds.
>
> I suspect that if we try hard enough (I admit I have not), we can find
> an example of such a potential deadlock for almost every memcg charge
> site. mmap_sem? not as easy to invent a case with that as I thought,
> since it needs a down_write, and the typical page allocations happen
> with down_read, and I can't think of a process which does down_write
> on another's mm.
>
> But i_mutex is always good, once you remember the case of write to
> file from userspace page which got paged out, so the fault path has
> to read it back in, while i_mutex is still held at the outer level.
> An unusual case? Well, normally yes, but we're considering
> out-of-memory conditions, which may converge upon cases like this.
>
> Wouldn't it be nice if I could be constructive? But I'm sceptical
> even of Johannes's faith in what the global OOM killer would do:
> how does __alloc_pages_slowpath() get out of its "goto restart"
> loop, excepting the trivial case when the killer is the killed?

I am not sure I am following you here but the Johannes's idea was to
break out of the charge after a signal has been sent and the charge
still fails and either retry the fault or fail the allocation. I think
this should work but I am afraid that this needs some tuning (number of
retries f.e.) to prevent from too aggressive OOM or too many failurs.

Do we have any other possibilities to solve this issue? Or do you think
we should ignore the problem just because nobody complained for such a
long time?
Dunno, I think we should fix this with something less risky for now and
come up with a real fix after it sees sufficient testing.

> I wonder why this issue has hit azurIt and no other reporter?
> No swap plays a part in it, but that's not so unusual.
>
> Yours glOOMily,
> Hugh

[...]
--
Michal Hocko
SUSE Labs

2012-11-30 01:45:18

by azurIt

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

>Here we go with the patch for 3.2.34. Could you test with this one,
>please?


I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you!

azurIt

2012-11-30 02:29:23

by azurIt

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

>Here we go with the patch for 3.2.34. Could you test with this one,
>please?


Michal, unfortunately i had to boot to another kernel because the one with this patch keeps killing my MySQL server :( it was, probably, doing it on OOM in any cgroup - looks like OOM was not choosing processes only from cgroup which is out of memory. Here is the log from syslog: http://www.watchdog.sk/lkml/oom_mysqld

Maybe i should mention that MySQL server has it's own cgroup (called 'mysql') but with no limits to any resources.

azurIt

2012-11-30 12:45:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

On Fri 30-11-12 03:29:18, azurIt wrote:
> >Here we go with the patch for 3.2.34. Could you test with this one,
> >please?
>
>
> Michal, unfortunately i had to boot to another kernel because the one
> with this patch keeps killing my MySQL server :( it was, probably,
> doing it on OOM in any cgroup - looks like OOM was not choosing
> processes only from cgroup which is out of memory. Here is the log
> from syslog: http://www.watchdog.sk/lkml/oom_mysqld

You are seeing also global OOM:
Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1
Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace:
Nov 30 02:53:56 server01 kernel: [ 818.233470] [<ffffffff810cc90e>] dump_header+0x7e/0x1e0
Nov 30 02:53:56 server01 kernel: [ 818.233600] [<ffffffff810cc80f>] ? find_lock_task_mm+0x2f/0x70
Nov 30 02:53:56 server01 kernel: [ 818.233721] [<ffffffff810ccdd5>] oom_kill_process+0x85/0x2a0
Nov 30 02:53:56 server01 kernel: [ 818.233842] [<ffffffff810cd485>] out_of_memory+0xe5/0x200
Nov 30 02:53:56 server01 kernel: [ 818.233963] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50
Nov 30 02:53:56 server01 kernel: [ 818.234082] [<ffffffff810cd65d>] pagefault_out_of_memory+0xbd/0x110
Nov 30 02:53:56 server01 kernel: [ 818.234204] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0
Nov 30 02:53:56 server01 kernel: [ 818.235886] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460
Nov 30 02:53:56 server01 kernel: [ 818.236006] [<ffffffff810f3057>] ? vma_merge+0x1f7/0x2c0
Nov 30 02:53:56 server01 kernel: [ 818.236124] [<ffffffff810f35d7>] ? do_brk+0x267/0x400
Nov 30 02:53:56 server01 kernel: [ 818.236244] [<ffffffff812c9a92>] ? gr_learn_resource+0x42/0x1e0
Nov 30 02:53:56 server01 kernel: [ 818.236367] [<ffffffff815b547f>] page_fault+0x1f/0x30
[...]
Nov 30 02:53:56 server01 kernel: [ 818.356297] Out of memory: Kill process 2188 (mysqld) score 60 or sacrifice child
Nov 30 02:53:56 server01 kernel: [ 818.356493] Killed process 2188 (mysqld) total-vm:3330016kB, anon-rss:864176kB, file-rss:8072kB

Then you also have memcg oom killer:
Nov 30 02:53:56 server01 kernel: [ 818.375717] Task in /1037/uid killed as a result of limit of /1037
Nov 30 02:53:56 server01 kernel: [ 818.375886] memory: usage 102400kB, limit 102400kB, failcnt 736
Nov 30 02:53:56 server01 kernel: [ 818.376008] memory+swap: usage 102400kB, limit 102400kB, failcnt 0

The messages are intermixed and I guess rate limitting jumped in as
well, because I cannot associate all the oom messages to a specific OOM
event.

Anyway your system is under both global and local memory pressure. You
didn't see apache going down previously because it was probably the one
which was stuck and could be killed.
Anyway you need to setup your system more carefully.

> Maybe i should mention that MySQL server has it's own cgroup (called
> 'mysql') but with no limits to any resources.

Where is that group in the hierarchy?
>
> azurIt
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Michal Hocko
SUSE Labs

2012-11-30 12:53:34

by azurIt

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

>Anyway your system is under both global and local memory pressure. You
>didn't see apache going down previously because it was probably the one
>which was stuck and could be killed.
>Anyway you need to setup your system more carefully.


No, it wasn't, i'm 1000% sure (i was on SSH). Here is the memory usage graph from that system on that time:
http://www.watchdog.sk/lkml/memory.png

The blank part is rebooting into new kernel. MySQL server was killed several times, then i rebooted into previous kernel and problem was gone (not a single MySQL kill). You can see two MySQL kills there on 03:54 and 03:04:30.


>
>> Maybe i should mention that MySQL server has it's own cgroup (called
>> 'mysql') but with no limits to any resources.
>
>Where is that group in the hierarchy?



In root.

2012-11-30 13:44:32

by azurIt

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

>Anyway your system is under both global and local memory pressure. You
>didn't see apache going down previously because it was probably the one
>which was stuck and could be killed.
>Anyway you need to setup your system more carefully.


There is, also, an evidence that system has enough of memory! :) Just take column 'rss' from process list in OOM message and sum it - you will get 2489911. It's probably in KB so it's about 2.4 GB. System has 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of 14.

azur

2012-11-30 14:44:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

On Fri 30-11-12 14:44:27, azurIt wrote:
> >Anyway your system is under both global and local memory pressure. You
> >didn't see apache going down previously because it was probably the one
> >which was stuck and could be killed.
> >Anyway you need to setup your system more carefully.
>
>
> There is, also, an evidence that system has enough of memory! :) Just
> take column 'rss' from process list in OOM message and sum it - you
> will get 2489911. It's probably in KB so it's about 2.4 GB. System has
> 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of
> 14.

Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone
is hardly touched:
Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

DMA32 zone is usually fills up first 4G unless your HW remaps the rest
of the memory above 4G or you have a numa machine and the rest of the
memory is at other node. Could you post your memory map printed during
the boot? (e820: BIOS-provided physical RAM map: and following lines)

There is also ZONE_NORMAL which is also not used much
Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

You have mentioned that you are comounting with cpuset. If this happens
to be a NUMA machine have you made the access to all nodes available?
Also what does /proc/sys/vm/zone_reclaim_mode says?
--
Michal Hocko
SUSE Labs

2012-11-30 15:03:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

On Fri 30-11-12 15:44:31, Michal Hocko wrote:
> On Fri 30-11-12 14:44:27, azurIt wrote:
> > >Anyway your system is under both global and local memory pressure. You
> > >didn't see apache going down previously because it was probably the one
> > >which was stuck and could be killed.
> > >Anyway you need to setup your system more carefully.
> >
> >
> > There is, also, an evidence that system has enough of memory! :) Just
> > take column 'rss' from process list in OOM message and sum it - you
> > will get 2489911. It's probably in KB so it's about 2.4 GB. System has
> > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of
> > 14.
>
> Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone
> is hardly touched:
> Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>
> DMA32 zone is usually fills up first 4G unless your HW remaps the rest
> of the memory above 4G or you have a numa machine and the rest of the
> memory is at other node. Could you post your memory map printed during
> the boot? (e820: BIOS-provided physical RAM map: and following lines)
>
> There is also ZONE_NORMAL which is also not used much
> Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>
> You have mentioned that you are comounting with cpuset. If this happens
> to be a NUMA machine have you made the access to all nodes available?

And now that I am looking at the oom message more closely I can see
Nov 30 02:53:56 server01 kernel: [ 818.232812] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
Nov 30 02:53:56 server01 kernel: [ 818.233029] apache2 cpuset=uid mems_allowed=0
Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1
Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace:
Nov 30 02:53:56 server01 kernel: [ 818.233470] [<ffffffff810cc90e>] dump_header+0x7e/0x1e0
Nov 30 02:53:56 server01 kernel: [ 818.233600] [<ffffffff810cc80f>] ? find_lock_task_mm+0x2f/0x70
Nov 30 02:53:56 server01 kernel: [ 818.233721] [<ffffffff810ccdd5>] oom_kill_process+0x85/0x2a0
Nov 30 02:53:56 server01 kernel: [ 818.233842] [<ffffffff810cd485>] out_of_memory+0xe5/0x200
Nov 30 02:53:56 server01 kernel: [ 818.233963] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50
Nov 30 02:53:56 server01 kernel: [ 818.234082] [<ffffffff810cd65d>] pagefault_out_of_memory+0xbd/0x110
Nov 30 02:53:56 server01 kernel: [ 818.234204] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0
Nov 30 02:53:56 server01 kernel: [ 818.235886] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460
Nov 30 02:53:56 server01 kernel: [ 818.236006] [<ffffffff810f3057>] ? vma_merge+0x1f7/0x2c0
Nov 30 02:53:56 server01 kernel: [ 818.236124] [<ffffffff810f35d7>] ? do_brk+0x267/0x400
Nov 30 02:53:56 server01 kernel: [ 818.236244] [<ffffffff812c9a92>] ? gr_learn_resource+0x42/0x1e0
Nov 30 02:53:56 server01 kernel: [ 818.236367] [<ffffffff815b547f>] page_fault+0x1f/0x30

Which is interesting from 2 perspectives. Only the first node (Node-0)
is allowed which would suggest that the cpuset controller is not
configured to all nodes. It is still surprising Node 0 wouldn't have any
memory (I would expect ZONE_DMA32 would be sitting there).

Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation
from the page fault? Huh this shouldn't happen - ever.

> Also what does /proc/sys/vm/zone_reclaim_mode says?
> --
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Michal Hocko
SUSE Labs

2012-11-30 15:08:16

by azurIt

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

>DMA32 zone is usually fills up first 4G unless your HW remaps the rest
>of the memory above 4G or you have a numa machine and the rest of the
>memory is at other node. Could you post your memory map printed during
>the boot? (e820: BIOS-provided physical RAM map: and following lines)


Here is the full boot log:
http://www.watchdog.sk/lkml/kern.log


>You have mentioned that you are comounting with cpuset. If this happens
>to be a NUMA machine have you made the access to all nodes available?
>Also what does /proc/sys/vm/zone_reclaim_mode says?


Don't really know what NUMA means and which nodes are you talking about, sorry :(

# cat /proc/sys/vm/zone_reclaim_mode
cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory



azur

2012-11-30 15:37:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

On Fri 30-11-12 16:03:47, Michal Hocko wrote:
[...]
> Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation
> from the page fault? Huh this shouldn't happen - ever.

OK, it starts making sense now. The message came from
pagefault_out_of_memory which doesn't have gfp nor the required node
information any longer. This suggests that VM_FAULT_OOM has been
returned by the fault handler. So this hasn't been triggered by the page
fault allocator.
I am wondering whether this could be caused by the patch but the effect
of that one should be limitted to the write (unlike the later version
for -mm tree which hooks into the shmem as well).

Will have to think about it some more.
--
Michal Hocko
SUSE Labs

2012-11-30 15:39:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

On Fri 30-11-12 16:08:11, azurIt wrote:
> >DMA32 zone is usually fills up first 4G unless your HW remaps the rest
> >of the memory above 4G or you have a numa machine and the rest of the
> >memory is at other node. Could you post your memory map printed during
> >the boot? (e820: BIOS-provided physical RAM map: and following lines)
>
>
> Here is the full boot log:
> http://www.watchdog.sk/lkml/kern.log

The log is not complete. Could you paste the comple dmesg output? Or
even better, do you have logs from the previous run?

> >You have mentioned that you are comounting with cpuset. If this happens
> >to be a NUMA machine have you made the access to all nodes available?
> >Also what does /proc/sys/vm/zone_reclaim_mode says?
>
>
> Don't really know what NUMA means and which nodes are you talking
> about, sorry :(

http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access

> # cat /proc/sys/vm/zone_reclaim_mode
> cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory

OK, so the NUMA is not enabled.
--
Michal Hocko
SUSE Labs

2012-11-30 15:59:41

by azurIt

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

>> Here is the full boot log:
>> http://www.watchdog.sk/lkml/kern.log
>
>The log is not complete. Could you paste the comple dmesg output? Or
>even better, do you have logs from the previous run?


What is missing there? All kernel messages are logging into /var/log/kern.log (it's the same as dmesg), dmesg itself was already rewrited by other messages. I think it's all what that kernel printed.

2012-11-30 16:19:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

On Fri 30-11-12 16:59:37, azurIt wrote:
> >> Here is the full boot log:
> >> http://www.watchdog.sk/lkml/kern.log
> >
> >The log is not complete. Could you paste the comple dmesg output? Or
> >even better, do you have logs from the previous run?
>
>
> What is missing there? All kernel messages are logging into
> /var/log/kern.log (it's the same as dmesg), dmesg itself was already
> rewrited by other messages. I think it's all what that kernel printed.

Early boot messages are missing - so exactly the BIOS memory map I was
asking for. As the NUMA has been excluded it is probably not that
relevant anymore.
The important question is why you see VM_FAULT_OOM and whether memcg
charging failure can trigger that. I don not see how this could happen
right now because __GFP_NORETRY is not used for user pages (except for
THP which disable memcg OOM already), file backed page faults (aka
__do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM.
This is a real head scratcher.

Could you also post your complete containers configuration, maybe there
is something strange in there (basically grep . -r YOUR_CGROUP_MNT
except for tasks files which are of no use right now).
--
Michal Hocko
SUSE Labs

2012-11-30 16:27:03

by azurIt

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

>Could you also post your complete containers configuration, maybe there
>is something strange in there (basically grep . -r YOUR_CGROUP_MNT
>except for tasks files which are of no use right now).


Here it is:
http://www.watchdog.sk/lkml/cgroups.gz

2012-11-30 16:53:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

On Fri 30-11-12 17:26:51, azurIt wrote:
> >Could you also post your complete containers configuration, maybe there
> >is something strange in there (basically grep . -r YOUR_CGROUP_MNT
> >except for tasks files which are of no use right now).
>
>
> Here it is:
> http://www.watchdog.sk/lkml/cgroups.gz

The only strange thing I noticed is that some groups have 0 limit. Is
this intentional?
grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c
3 memory.limit_in_bytes:0
254 memory.limit_in_bytes:104857600
107 memory.limit_in_bytes:157286400
68 memory.limit_in_bytes:209715200
10 memory.limit_in_bytes:262144000
28 memory.limit_in_bytes:314572800
1 memory.limit_in_bytes:346030080
1 memory.limit_in_bytes:524288000
2 memory.limit_in_bytes:9223372036854775807
--
Michal Hocko
SUSE Labs

2012-11-30 20:43:18

by azurIt

[permalink] [raw]
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

>The only strange thing I noticed is that some groups have 0 limit. Is
>this intentional?
>grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c
> 3 memory.limit_in_bytes:0


These are users who are not allowed to run anything.


azur