Not a lot of changes since -rc8. Most notably perhaps some late nfs
and btrfs work, and a mips update. Along with some more vfs RCU lookup
fallout (which would only be noticeable with the filesystem exported
with nfsd, which is why nobody ever noticed).
And the usual driver updates, mostly media and GPU, but some
networking too. The appended shortlog is for the changes since -rc8,
and gives some feel for it. Nothing really too exciting, I think.
As to the "big picture", ie all the changes since 2.6.37, my personal
favorite remains the VFS name lookup changes. They did end up causing
some breakage, and Al has made it clear that he wants more cleanups,
but on the whole I think it was surprisingly smooth. I think we had
more problems with random other components (nasty memory corruption in
networking etc) than with the rather fundamental path lookup change.
So I'm hoping this ends up being a fairly calm release despite some
really deep changes like that.
Linus
---
Abhilash K V (1):
ASoC: AM3517: Update codec name after multi-component update
Al Viro (12):
minimal fix for do_filp_open() race
unfuck proc_sysctl ->d_compare()
nd->inode is not set on the second attempt in path_walk()
/proc/self is never going to be invalidated...
reiserfs xattr ->d_revalidate() shouldn't care about RCU
ceph: fix d_revalidate oopsen on NFS exports
fuse: fix d_revalidate oopsen on NFS exports
gfs2: fix d_revalidate oopsen on NFS exports
ocfs2: fix d_revalidate oopsen on NFS exports
jfs: fix d_revalidate oopsen on NFS exports
fat: fix d_revalidate oopsen on NFS exports
compat breakage in preadv() and pwritev()
Andrea Arcangeli (2):
x86/mm: Fix pgd_lock deadlock
thp: fix page_referenced to modify mapcount/vm_flags only if page is found
Andrey Vagin (1):
x86/mm: Handle mm_fault_error() in kernel space
Andy Adamson (2):
NFSv4: remove duplicate clientid in struct nfs_client
NFSv4.1 reclaim complete must wait for completion
Andy Walls (2):
[media] cx23885: Revert "Check for slave nack on all transactions"
[media] cx23885: Remove unused 'err:' labels to quiet compiler warning
Anoop P A (2):
MIPS: Select R4K timer lib for all MSP platforms
MIPS: MSP: Fix MSP71xx bpci interrupt handler return value
Antony Pavlov (2):
mtd: jedec_probe: Change variable name from cfi_p to cfi
mtd: jedec_probe: initialise make sector erase command variable
Antti Sepp?l? (1):
[media] Fix sysfs rc protocol lookup for rc-5-sz
Arnaldo Carvalho de Melo (1):
perf symbols: Fix vmlinux path when not using --symfs
Arnaud Patard (1):
[media] mantis_pci: remove asm/pgtable.h include
Axel Lin (3):
mtd: add "platform:" prefix for platform modalias
gpio: add MODULE_DEVICE_TABLE
watchdog: hpwdt: eliminate section mismatch warning
Balbir Singh (1):
sched: Fix sched rt group scheduling when hierachy is enabled
Ben Hutchings (1):
sunrpc: Propagate errors from xs_bind() through xs_create_sock()
Benjamin Herrenschmidt (2):
powerpc/iseries: Fix early init access to lppaca
powerpc/pseries: Disable VPNH feature
Benny Halevy (1):
NFSD: fix decode_cb_sequence4resok
Chris Mason (4):
Btrfs: fix regressions in copy_from_user handling
Btrfs: deal with short returns from copy_from_user
Btrfs: make sure not to return overlapping extents to fiemap
Btrfs: break out of shrink_delalloc earlier
Chuck Lever (1):
NFS: NFSROOT should default to "proto=udp"
Cliff Wickman (1):
x86, UV: Initialize the broadcast assist unit base destination
node id properly
Dan Carpenter (1):
watchdog: sch311x_wdt: fix printk condition
Daniel J Blueman (2):
x86, build: Make sure mkpiggy fails on read error
btrfs: fix dip leak
Daniel Turull (1):
pktgen: fix errata in show results
Dave Airlie (3):
drm/radeon: add pageflip hooks for fusion
drm/radeon: fix page flipping hangs on r300/r400
drm/radeon: fix problem with changing active VRAM size. (v2)
David Daney (5):
MIPS: Add an unreachable return statement to satisfy buggy GCCs.
MIPS: Fix GCC-4.6 'set but not used' warning in signal*.c
MIPS: Remove unused code from arch/mips/kernel/syscall.c
MIPS: Fix GCC-4.6 'set but not used' warning in ieee754int.h
MIPS: Fix GCC-4.6 'set but not used' warning in arch/mips/mm/init.c
David Howells (2):
MN10300: The SMP_ICACHE_INV_FLUSH_RANGE IPI command does not exist
MN10300: atomic_read() should ensure it emits a load
David S. Miller (2):
ipv4: Fix erroneous uses of ifa_address.
ipv6: Don't create clones of host routes.
Deng-Cheng Zhu (5):
MIPS, Perf-events: Work with irq_work
MIPS, Perf-events: Work with the new PMU interface
MIPS, Perf-events: Fix event check in validate_event()
MIPS, Perf-events: Work with the new callchain interface
MIPS, Perf-events: Use unsigned delta for right shift in event update
Devin Heitmueller (2):
[media] au0828: fix VBI handling when in V4L2 streaming mode
[media] cx18: Add support for Hauppauge HVR-1600 models with s5h1411
Dmitry Kravkov (4):
bnx2x: fix non-pmf device load flow
bnx2x: fix link notification
bnx2x: (NPAR) prevent HW access in D3 state
bnx2x: fix MaxBW configuration
Doe, YiCheng (1):
ipmi: Fix IPMI errors due to timing problems
Florian Fainelli (3):
r6040: bump to version 0.27 and date 23Feb2011
MIPS: MTX-1: Make au1000_eth probe all PHY addresses
MIPS: Alchemy: Fix reset for MTX-1 and XXS1500
Frank Filz (1):
(try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause
problems if bit 31 or 63 are set in fileid
Grant Likely (1):
i2c-ocores: Fix pointer type mismatch error
G?ran Weinholt (1):
net/smsc911x.c: Set the VLAN1 register to fix VLAN MTU problem
Hans de Goede (2):
hwmon/f71882fg: Fix a typo in a comment
hwmon/f71882fg: Set platform drvdata to NULL later
Huang Weiyi (1):
nfs4: remove duplicated #include
Hugh Dickins (1):
thp+memcg-numa: fix BUG at include/linux/mm.h:370!
J. Bruce Fields (2):
nfsd4: fix bad pointer on failure to find delegation
fs/dcache: allow d_obtain_alias() to return unhashed dentries
Jarod Wilson (3):
[media] nuvoton-cir: fix wake from suspend
[media] mceusb: don't claim multifunction device non-IR parts
[media] tda829x: fix regression in probe functions
Jeff Layton (1):
nfs: close NFSv4 COMMIT vs. CLOSE race
Jesper Juhl (1):
SUNRPC: Remove resource leak in svc_rdma_send_error()
Jiri Slaby (1):
watchdog: sbc_fitpc2_wdt, fix crash on systems without DMI_BOARD_NAME
Joakim Tjernlund (1):
mtd: fix race in cfi_cmdset_0001 driver
Jon Mason (1):
vxge: update MAINTAINERS
Jovi Zhang (1):
nfs: fix compilation warning
Lin Ming (1):
perf symbols: Avoid resolving [kernel.kallsyms] to real path for
buildid cache
Linus Torvalds (2):
Revert "oom: oom_kill_process: fix the child_points logic"
Linux 2.6.38
Lukas Czerner (1):
block: fix mis-synchronisation in blkdev_issue_zeroout()
Maksim Rayskiy (1):
MIPS: Move idle task creation to work queue
Malcolm Priestley (1):
[media] DM04/QQBOX memcpy to const char fix
Marco Stornelli (1):
Check for immutable/append flag in fallocate path
Mark Brown (4):
ASoC: Fix broken bitfield definitions in WM8978
ASoC: Use the correct DAPM context when cleaning up final widget set
ASoC: Fix typo in late revision WM8994 DAC2R name
ASoC: Ensure WM8958 gets all WM8994 late revision widgets
Matt Turner (1):
alpha: fix compile error from IRQ clean up
Mauro Carvalho Chehab (1):
[media] ir-raw: Properly initialize the IR event (BZ#27202)
Maurus Cuelenaere (1):
MIPS: Jz4740: Add HAVE_CLK
Maxim Levitsky (1):
mtd: mtd_blkdevs: fix double free on error path
Miao Xie (1):
btrfs: fix not enough reserved space
Michael (1):
[media] ivtv: Fix corrective action taken upon DMA ERR interrupt
to avoid hang
Michal Marek (1):
kbuild: Fix computing srcversion for modules
Naga Chumbalkar (2):
x86: Don't check for BIOS corruption in first 64K when there's no need to
[CPUFREQ] pcc-cpufreq: don't load driver if get_freq fails during init.
Neil Horman (1):
rds: prevent BUG_ON triggering on congestion map updates
Nicholas Bellinger (1):
[SCSI] target: Fix t_transport_aborted handling in LUN_RESET +
active I/O shutdown
Nicolas Kaiser (1):
drivers/net/macvtap: fix error check
Nils Carlson (2):
bonding 802.3ad: Fix the state machine locking v2
bonding 802.3ad: Rename rx_machine_lock to state_machine_lock
Ohad Ben-Cohen (1):
mmc: fix CONFIG_MMC_UNSAFE_RESUME regression
Oleg Nesterov (1):
oom: oom_kill_process: fix the child_points logic
Olivier Grenie (1):
[media] DiB7000M: add pid filtering
Pawel Osciak (1):
[media] Fix double free of video_device in mem2mem_testdev
Rainer Weikusat (1):
net: fix multithreaded signal handling in unix recv routines
Rajendra Nayak (1):
i2c-omap: Program I2C_WE on OMAP4 to enable i2c wakeup
Randy Dunlap (1):
net: bridge builtin vs. ipv6 modular
Ricardo Labiaga (1):
NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAY
Robert Millan (1):
MIPS: Loongson: Remove ad-hoc cmdline default
Sebastian Andrzej Siewior (1):
x86: ce4100: Set pci ops via callback instead of module init
Shawn Lin (1):
r6040: fix multicast operations
Stanislav Fomichev (1):
nfs: add kmalloc return value check in decode_and_add_ds
Stanislaw Gruszka (1):
mtd: amd76xrom: fix oops at boot when resources are not available
Stefan Oberhumer (1):
MIPS: Clear the correct flag in sysmips(MIPS_FIXADE, ...).
Stefan Weil (1):
MIPS: Loongson: Fix potentially wrong string handling
Stephen Rothwell (2):
sysctl: the include of rcupdate.h is only needed in the kernel
sysctl: the include of rcupdate.h is only needed in the kernel
Sven Barth (1):
[media] cx25840: fix probing of cx2583x chips
Takashi Iwai (1):
drm/i915: Revive combination mode for backlight control
Thomas Gleixner (1):
MIPS: Replace deprecated spinlock initialization
Thomas Graf (1):
net: Enter net/ipv6/ even if CONFIG_IPV6=n
Timo Warns (1):
Fix corrupted OSF partition table parsing
Tkhai Kirill (1):
MN10300: Proper use of macros get_user() in the case of
incremented pointers
Trond Myklebust (5):
SUNRPC: Close a race in __rpc_wait_for_completion_task()
NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses
NFSv4.1: Fix the handling of the SEQUENCE status bits
NFSv4: Fix the setlk error handler
NFSv4: nfs4_state_mark_reclaim_nograce() should be static
Vasiliy Kulikov (1):
net: don't allow CAP_NET_ADMIN to load non-netdev kernel modules
Wim Van Sebroeck (3):
watchdog: cpwd: Fix buffer-overflow
watchdog: sch311x_wdt: Fix LDN active check
watchdog: w83697ug_wdt: Fix set bit 0 to activate GPIO2
Wolfram Sang (1):
i2c-eg20t: include slab.h for memory allocations
Wu Zhangjin (5):
MIPS, Tracing: Speed up function graph tracer
MIPS, Tracing: Substitute in_kernel_space() for in_module()
MIPS, Tracing: Clean up prepare_ftrace_return()
MIPS, Tracing: Clean up ftrace_make_nop()
MIPS, Tracing: Fix set_graph_function of function graph tracer
Yinghai Lu (1):
x86, numa: Fix numa_emulation code with memory-less node0
Yoichi Yuasa (1):
MIPS: Fix always CONFIG_LOONGSON_UART_BASE=y
[email protected] (1):
ariadne: remove redundant NULL check
roel (1):
nfsd: wrong index used in inner loop
sensoray-dev (1):
[media] s2255drv: firmware re-loading changes
stephen hemminger (1):
ip6ip6: autoload ip6 tunnel
This kernel includes a broken commit that was merged a couple of hours
before release:
commit dc1b83ab08f1954335692cdcd499f78c94f4c42a
Author: Oleg Nesterov <[email protected]>
Date: Mon Mar 14 20:05:30 2011 +0100
oom: oom_kill_process: fix the child_points logic
oom_kill_process() starts with victim_points == 0. This means that
(most likely) any child has more points and can be killed erroneously.
Also, "children has a different mm" doesn't match the reality, we should
check child->mm != t->mm. This check is not exactly correct if t->mm ==
NULL but this doesn't really matter, oom_kill_task() will kill them
anyway.
Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
too.
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
As a result of this change, the oom killer will no longer attempt to
sacrifice a child of the selected process in favor of the parent unless
its memory usage exceeds the parent (and this will be an unreachable state
once oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch is merged
from -mm).
This means systems running a webserver, for example, will kill the
webserver itself in oom conditions and not one of its threads serving a
connection; simply forking too many client connections in this scenario
would lead to an oom condition that would kill the server instead of one
of its threads.
Admins who find this behavior to cause disruptions in service should apply
the following revert.
Signed-off-by: David Rientjes <[email protected]>
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -458,10 +458,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
struct mem_cgroup *mem, nodemask_t *nodemask,
const char *message)
{
- struct task_struct *victim;
+ struct task_struct *victim = p;
struct task_struct *child;
- struct task_struct *t;
- unsigned int victim_points;
+ struct task_struct *t = p;
+ unsigned int victim_points = 0;
if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem, nodemask);
@@ -487,15 +487,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* parent. This attempts to lose the minimal amount of work done while
* still freeing memory.
*/
- victim_points = oom_badness(p, mem, nodemask, totalpages);
- victim = p;
- t = p;
do {
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;
- if (child->mm == t->mm)
- continue;
/*
* oom_badness() returns 0 if the thread is unkillable
*/
On Mon, Mar 14, 2011 at 06:49:37PM -0700, Linus Torvalds wrote:
>
> So I'm hoping this ends up being a fairly calm release despite some
> really deep changes like that.
It's so calm, it's like it's not even there.
-- Steve
On Mon, Mar 14, 2011 at 08:13:38PM -0700, David Rientjes wrote:
> This kernel includes a broken commit that was merged a couple of hours
> before release:
>
> commit dc1b83ab08f1954335692cdcd499f78c94f4c42a
> Author: Oleg Nesterov <[email protected]>
> Date: Mon Mar 14 20:05:30 2011 +0100
>
> oom: oom_kill_process: fix the child_points logic
>
Don't worry. If you download the patch for 2.6.38, you'll see that the
revert was in the final release.
Woo hoo, ketchup is useful again!
-- Steve
On Mon, Mar 14, 2011 at 8:14 PM, Steven Rostedt <[email protected]> wrote:
> On Mon, Mar 14, 2011 at 06:49:37PM -0700, Linus Torvalds wrote:
>>
>> So I'm hoping this ends up being a fairly calm release despite some
>> really deep changes like that.
>
> It's so calm, it's like it's not even there.
Yes, it's a very Zen release.
I'd uploaded the patch and tar-ball, but forgot to actually push out.
Usually it's the other way around.
Linus
On Mon, Mar 14, 2011 at 8:13 PM, David Rientjes <[email protected]> wrote:
> This kernel includes a broken commit that was merged a couple of hours
> before release:
Actually, it doesn't. It got reverted before the release because of
the worries about it.
Linus
On Mon, 14 Mar 2011 20:13:38 -0700 (PDT) David Rientjes <[email protected]> wrote:
> once oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch is merged
> from -mm
Please (re)send the patches which you believe should be merged into
2.6.38 to address the problems which Oleg found, and any other critical
problems. Not in a huge rush - let's get this right.
On Mon, 14 Mar 2011, Andrew Morton wrote:
> > once oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch is merged
> > from -mm
>
> Please (re)send the patches which you believe should be merged into
> 2.6.38 to address the problems which Oleg found, and any other critical
> problems. Not in a huge rush - let's get this right.
>
In my testing, Oleg's three test cases that he sent to the security list
and cc'd us on get appropriately oom killed once swap is exhausted or
swapoff -a is used on mmotm-2011-03-10-16-42 because of these two patches:
oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
oom-skip-zombies-when-iterating-tasklist.patch
He also presented a test case on linux-mm that caused the oom killer to
avoid acting if a thread is ptracing a thread in the exit path with
PTRACE_O_TRACEEXIT. That should be fixed with
http://marc.info/?l=linux-mm&m=129997893430351
that has yet to see -mm. There are no other test cases that have been
presented that cause undesired behavior.
That said, my approach to doing this has been to avoid arbitrary
heuristics for special cases and address known issues by adding the
appropriate logic in the oom killer. For example, the ptrace problem that
Oleg presented showed that the oom killer logic incorrectly deferred doing
anything when an eligible thread was PF_EXITING. It had done that
believing that nothing would stop the thread from exiting or current
would be given access to memory reserves itself and that assumption was
broken for PTRACE_O_TRACEEXIT. My patch above, in combination with
Andrey's patch that only considers threads with valid mm's, fixes that
issue because we'll now only defer if you still have an attached mm, are
PF_EXITING, and are not being traced.
If, at some point, there is another gap in the exit code where a thread
may hold PF_EXITING with a valid mm for an indefinite period, we'll need
to address that in the oom killer as well. We use PF_EXITING specifically
in the oom killer to identify tasks that are going to exit soon and need
handling for any case where that isn't guaranteed. Anything else results
in needlessly killing other tasks or, in the worst case, panicking when
there is nothing left that is eligible.
On Mon, 14 Mar 2011, Linus Torvalds wrote:
> > This kernel includes a broken commit that was merged a couple of hours
> > before release:
>
> Actually, it doesn't. It got reverted before the release because of
> the worries about it.
>
Looks good, thanks!
On Mon, 14 Mar 2011 21:50:24 -0700 (PDT) David Rientjes <[email protected]> wrote:
> On Mon, 14 Mar 2011, Andrew Morton wrote:
>
> > > once oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch is merged
> > > from -mm
> >
> > Please (re)send the patches which you believe should be merged into
> > 2.6.38 to address the problems which Oleg found, and any other critical
> > problems. Not in a huge rush - let's get this right.
> >
>
> In my testing, Oleg's three test cases that he sent to the security list
> and cc'd us on get appropriately oom killed once swap is exhausted or
> swapoff -a is used on mmotm-2011-03-10-16-42 because of these two patches:
>
> oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> oom-skip-zombies-when-iterating-tasklist.patch
>
> He also presented a test case on linux-mm that caused the oom killer to
> avoid acting if a thread is ptracing a thread in the exit path with
> PTRACE_O_TRACEEXIT. That should be fixed with
>
> http://marc.info/?l=linux-mm&m=129997893430351
>
> that has yet to see -mm. There are no other test cases that have been
> presented that cause undesired behavior.
>
> That said, my approach to doing this has been to avoid arbitrary
> heuristics for special cases and address known issues by adding the
> appropriate logic in the oom killer. For example, the ptrace problem that
> Oleg presented showed that the oom killer logic incorrectly deferred doing
> anything when an eligible thread was PF_EXITING. It had done that
> believing that nothing would stop the thread from exiting or current
> would be given access to memory reserves itself and that assumption was
> broken for PTRACE_O_TRACEEXIT. My patch above, in combination with
> Andrey's patch that only considers threads with valid mm's, fixes that
> issue because we'll now only defer if you still have an attached mm, are
> PF_EXITING, and are not being traced.
>
> If, at some point, there is another gap in the exit code where a thread
> may hold PF_EXITING with a valid mm for an indefinite period, we'll need
> to address that in the oom killer as well. We use PF_EXITING specifically
> in the oom killer to identify tasks that are going to exit soon and need
> handling for any case where that isn't guaranteed. Anything else results
> in needlessly killing other tasks or, in the worst case, panicking when
> there is nothing left that is eligible.
So we're talking about three patches:
oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
oom-skip-zombies-when-iterating-tasklist.patch
oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced.patch
all appended below.
About all of which Oleg had serious complaints, some of which haven't
yet been addressed.
And that's OK. As I said, please let's work through it and get it right.
From: David Rientjes <[email protected]>
This patch prevents unnecessary oom kills or kernel panics by reverting
two commits:
495789a5 (oom: make oom_score to per-process value)
cef1d352 (oom: multi threaded process coredump don't make deadlock)
First, 495789a5 (oom: make oom_score to per-process value) ignores the
fact that all threads in a thread group do not necessarily exit at the
same time.
It is imperative that select_bad_process() detect threads that are in the
exit path, specifically those with PF_EXITING set, to prevent needlessly
killing additional tasks. If a process is oom killed and the thread group
leader exits, select_bad_process() cannot detect the other threads that
are PF_EXITING by iterating over only processes. Thus, it currently
chooses another task unnecessarily for oom kill or panics the machine when
nothing else is eligible.
By iterating over threads instead, it is possible to detect threads that
are exiting and nominate them for oom kill so they get access to memory
reserves.
Second, cef1d352 (oom: multi threaded process coredump don't make
deadlock) erroneously avoids making the oom killer a no-op when an
eligible thread other than current isfound to be exiting. We want to
detect this situation so that we may allow that exiting thread time to
exit and free its memory; if it is able to exit on its own, that should
free memory so current is no loner oom. If it is not able to exit on its
own, the oom killer will nominate it for oom kill which, in this case,
only means it will get access to memory reserves.
Without this change, it is easy for the oom killer to unnecessarily target
tasks when all threads of a victim don't exit before the thread group
leader or, in the worst case, panic the machine.
Signed-off-by: David Rientjes <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrey Vagin <[email protected]>
Cc: <[email protected]> [2.6.38.x]
Signed-off-by: Andrew Morton <[email protected]>
---
mm/oom_kill.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff -puN mm/oom_kill.c~oom-prevent-unnecessary-oom-kills-or-kernel-panics mm/oom_kill.c
--- a/mm/oom_kill.c~oom-prevent-unnecessary-oom-kills-or-kernel-panics
+++ a/mm/oom_kill.c
@@ -292,11 +292,11 @@ static struct task_struct *select_bad_pr
unsigned long totalpages, struct mem_cgroup *mem,
const nodemask_t *nodemask)
{
- struct task_struct *p;
+ struct task_struct *g, *p;
struct task_struct *chosen = NULL;
*ppoints = 0;
- for_each_process(p) {
+ do_each_thread(g, p) {
unsigned int points;
if (oom_unkillable_task(p, mem, nodemask))
@@ -324,7 +324,7 @@ static struct task_struct *select_bad_pr
* the process of exiting and releasing its resources.
* Otherwise we could get an easy OOM deadlock.
*/
- if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
+ if ((p->flags & PF_EXITING) && p->mm) {
if (p != current)
return ERR_PTR(-1UL);
@@ -337,7 +337,7 @@ static struct task_struct *select_bad_pr
chosen = p;
*ppoints = points;
}
- }
+ } while_each_thread(g, p);
return chosen;
}
_
From: Andrey Vagin <[email protected]>
We shouldn't defer oom killing if a thread has already detached its ->mm
and still has TIF_MEMDIE set. Memory needs to be freed, so find kill
other threads that pin the same ->mm or find another task to kill.
Signed-off-by: Andrey Vagin <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: <[email protected]> [2.6.38.x]
Signed-off-by: Andrew Morton <[email protected]>
---
mm/oom_kill.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff -puN mm/oom_kill.c~oom-skip-zombies-when-iterating-tasklist mm/oom_kill.c
--- a/mm/oom_kill.c~oom-skip-zombies-when-iterating-tasklist
+++ a/mm/oom_kill.c
@@ -299,6 +299,8 @@ static struct task_struct *select_bad_pr
do_each_thread(g, p) {
unsigned int points;
+ if (!p->mm)
+ continue;
if (oom_unkillable_task(p, mem, nodemask))
continue;
@@ -324,7 +326,7 @@ static struct task_struct *select_bad_pr
* the process of exiting and releasing its resources.
* Otherwise we could get an easy OOM deadlock.
*/
- if ((p->flags & PF_EXITING) && p->mm) {
+ if (p->flags & PF_EXITING) {
if (p != current)
return ERR_PTR(-1UL);
_
From: David Rientjes <[email protected]>
The oom killer naturally defers killing anything if it finds an eligible
task that is already exiting and has yet to detach its ->mm. This avoids
unnecessarily killing tasks when one is already in the exit path and may
free enough memory that the oom killer is no longer needed. This is
detected by PF_EXITING since threads that have already detached its ->mm
are no longer considered at all.
The problem with always deferring when a thread is PF_EXITING, however, is
that it may never actually exit when being traced, specifically if another
task is tracing it with PTRACE_O_TRACEEXIT. The oom killer does not want
to defer in this case since there is no guarantee that thread will ever
exit without intervention.
This patch will now only defer the oom killer when a thread is PF_EXITING
and no ptracer has stopped its progress in the exit path. It also ensures
that a child is sacrificed for the chosen parent only if it has a
different ->mm as the comment implies: this ensures that the thread group
leader is always targeted appropriately.
Signed-off-by: David Rientjes <[email protected]>
Reported-by: Oleg Nesterov <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrey Vagin <[email protected]>
Cc: <[email protected]> [2.6.38.x]
Signed-off-by: Andrew Morton <[email protected]>
---
mm/oom_kill.c | 40 +++++++++++++++++++++++++---------------
1 file changed, 25 insertions(+), 15 deletions(-)
diff -puN mm/oom_kill.c~oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced mm/oom_kill.c
--- a/mm/oom_kill.c~oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced
+++ a/mm/oom_kill.c
@@ -31,6 +31,7 @@
#include <linux/memcontrol.h>
#include <linux/mempolicy.h>
#include <linux/security.h>
+#include <linux/ptrace.h>
int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
@@ -316,22 +317,29 @@ static struct task_struct *select_bad_pr
if (test_tsk_thread_flag(p, TIF_MEMDIE))
return ERR_PTR(-1UL);
- /*
- * This is in the process of releasing memory so wait for it
- * to finish before killing some other task by mistake.
- *
- * However, if p is the current task, we allow the 'kill' to
- * go ahead if it is exiting: this will simply set TIF_MEMDIE,
- * which will allow it to gain access to memory reserves in
- * the process of exiting and releasing its resources.
- * Otherwise we could get an easy OOM deadlock.
- */
if (p->flags & PF_EXITING) {
- if (p != current)
- return ERR_PTR(-1UL);
-
- chosen = p;
- *ppoints = 1000;
+ /*
+ * If p is the current task and is in the process of
+ * releasing memory, we allow the "kill" to set
+ * TIF_MEMDIE, which will allow it to gain access to
+ * memory reserves. Otherwise, it may stall forever.
+ *
+ * The loop isn't broken here, however, in case other
+ * threads are found to have already been oom killed.
+ */
+ if (p == current) {
+ chosen = p;
+ *ppoints = 1000;
+ } else {
+ /*
+ * If this task is not being ptraced on exit,
+ * then wait for it to finish before killing
+ * some other task unnecessarily.
+ */
+ if (!(task_ptrace(p->group_leader) &
+ PT_TRACE_EXIT))
+ return ERR_PTR(-1UL);
+ }
}
points = oom_badness(p, mem, nodemask, totalpages);
@@ -493,6 +501,8 @@ static int oom_kill_process(struct task_
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;
+ if (child->mm == p->mm)
+ continue;
/*
* oom_badness() returns 0 if the thread is unkillable
*/
_
On 03/14, David Rientjes wrote:
>
> He also presented a test case on linux-mm that caused the oom killer to
> avoid acting if a thread is ptracing a thread in the exit path with
> PTRACE_O_TRACEEXIT. That should be fixed with
>
> http://marc.info/?l=linux-mm&m=129997893430351
I don't think it can fix this. I didn't verify this, but the slightly
different test-case below should have the same effect.
But this doesn't matter. We can fix this particular case, and we have
the problems with the coredump anyway.
What I can't understand is what exactly the first patch tries to fix.
When I ask you, you tell me that for_each_process() can miss the group
leader because it can exit before sub-threads. This must not happen,
or we have some serious bug triggered by your workload.
So, once again. Could you please explain the original problem and how
this patch helps?
Oleg.
#include <unistd.h>
#include <signal.h>
#include <pthread.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <assert.h>
#include <stdio.h>
void *tfunc(void* arg)
{
if (arg) {
ptrace(PTRACE_TRACEME, 0,0,0);
raise(SIGSTOP);
pthread_kill(*(pthread_t*)arg, SIGQUIT);
}
pause();
}
int main(void)
{
int pid;
if (!fork()) {
pthread_t thread1, thread2;
pthread_create(&thread1, NULL, tfunc, NULL);
pthread_create(&thread2, NULL, tfunc, &thread1);
pause();
return 0;
}
assert((pid = waitpid(-1, NULL, __WALL)) > 0);
assert(ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACEEXIT) == 0);
assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0);
wait(NULL);
pause();
return 0;
}
[Yesterdays earthquake was announced magnitude 6.5, but M6 quake is
no longer treated significant news in this country. We are living
slightly in a floating mood.]
> So we're talking about three patches:
>
> oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> oom-skip-zombies-when-iterating-tasklist.patch
> oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced.patch
>
> all appended below.
>
> About all of which Oleg had serious complaints, some of which haven't
> yet been addressed.
>
> And that's OK. As I said, please let's work through it and get it right.
I haven't understand what is "OK" and what do you want talk. probably
the reason is in my language skill or I haven't catch up Oleg and David
discussion. then instead, I'll post my debugging progressing condition.
o vmscan.c#all_unreclaimable() might return false negative and lead
to prevent oom-killer by mistaken. Why? zone->pages_scanned is not
protected by lock, in other words, it's unstable value. in the other
hands, x86 ZONE_DMA has only a very little memory, then usually
never recover all_unreclaimable=no if once become all_unreclaimable=yes.
then, if zone state become unmatched (eg pages_scanned=0 and all_unreclaimable=yes)
it can't be recovered never. I mean I could reproduced Andrey reported issue.
o oom_kill.c#boost_dying_task_prio() makes kernel hang-up if user
are using cpu cgroups. because cpu cgroup has inadequate default
RT rt_runtime_us (0 by default. 0 mean RT tasks can't run at all).
o oom_kill.c#TIF_MEMDIE check makes kernel hang-up. I haven't catch
the exact reason of a oom killed process sticking even though zone has
enough memory.
My dislikeness is, Many people in the list fun to make flamewar but
nobody except really a few developers run the real code nor join to
debug real and actual reported issue. In fact, Andrey made testcase and
reported his test environment and help we made reproduce envronemnt.
I also dislike some developer say they haven't seen oom livelock case yet.
It indicate they haven't tested stress workload oom scenario. How do i
trust an untested patch, an untested guys? All developer have to test
until seen oom livelock.
I know oom debugging is very painful and need to take a lot of time.
much false positive, much unfixable live lock, need mililion reset.
But, I don't think this is good reason to take untested.
Now I'm only access a three years old PC. Therefore, I have no reason
anyone can't debug the issue.
On my i915 using Acer TravelMate 5735z I could run kernel 2.6.38-rc8
with KMS. On 2.6.38 I get a black screen instead. In case anyone
cares, just tell me what information you need. (bisect result on
request, but I assume the experts know which patch caused it.)
m.
Ccing some relevant people (you should have done this).
On 03/16/2011 06:30 PM, Melchior FRANZ wrote:
> On my i915 using Acer TravelMate 5735z I could run kernel 2.6.38-rc8
> with KMS. On 2.6.38 I get a black screen instead. In case anyone
> cares, just tell me what information you need. (bisect result on
> request, but I assume the experts know which patch caused it.)
--
js
On Wed, 16 Mar 2011 18:30:51 +0100, Melchior FRANZ <[email protected]> wrote:
> On my i915 using Acer TravelMate 5735z I could run kernel 2.6.38-rc8
> with KMS. On 2.6.38 I get a black screen instead. In case anyone
> cares, just tell me what information you need. (bisect result on
> request, but I assume the experts know which patch caused it.)
There's only one patch directly related to i915, so you could begin there.
Useful information to include is a dmesg (particularly one with
drm.debug=0xe kernel parameters) and lspci (though google says you have a
gm45).
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
* Chris Wilson -- Wednesday 16 March 2011:
> There's only one patch directly related to i915, so you could begin there.
I'll try later. Was just too obvious for now. :-}
> Useful information to include is a dmesg (particularly one with
> drm.debug=0xe kernel parameters) and lspci
OK, thanks for that info. Attached.
> (though google says you have a gm45).
I didn't look it up before, I just trusted the kernel with its "i915".
But you are probably right, this Acer TravelMate 5735Z-452G32Mnss is
supposed to have an "Intel Graphics Media Accelerator 4500MHD".
m.
On środa, 16 marca 2011 o 18:30:51 Melchior FRANZ wrote:
> On my i915 using Acer TravelMate 5735z I could run kernel 2.6.38-rc8
> with KMS. On 2.6.38 I get a black screen instead. In case anyone
> cares, just tell me what information you need. (bisect result on
> request, but I assume the experts know which patch caused it.)
>
> m.
I created a Bugzilla entry at
https://bugzilla.kernel.org/show_bug.cgi?id=31522
for your bug report, please add your address to the CC list in there, thanks!
--
Maciej Rutecki
http://www.maciek.unixy.pl
Hi
I'm backed. Andrey's (attached) fork bomb testcase effectively kill
my machine when swap is disabled. therefore, I've made a few anti andrey
test patches.
This patches only avoid kernel livelock. they doesn't genocide fork-bombs.
Kamezawa-san is trying such effort.
comments are welcome.
all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.
2006 Sep 25; commit 408d8544; oom: use unreclaimable info
And it went through strange history. firstly, following commit broke
the logic unintentionally.
2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
costly-order allocations
Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.
2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
return value when priority==0
But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .
2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
in direct reclaim path
But, recently, Andrey Vagin found the new corner case. Look,
struct zone {
..
int all_unreclaimable;
..
unsigned long pages_scanned;
..
}
zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock. Therefore a zone can become a state
of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
current all_unreclaimable() return false even though
zone->all_unreclaimabe=1.
Is this ignorable minor issue? No. Unfortunatelly, x86 has very
small dma zone and it become zone->all_unreclamble=1 easily. and
if it becase all_unreclaimable, it never return all_unreclaimable=0
beucase it typicall don't have reclaimable pages.
Eventually, oom-killer never works on such systems. Let's remove
this problematic logic completely.
Reported-by: Andrey Vagin <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 36 +-----------------------------------
1 files changed, 1 insertions(+), 35 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..254aada 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1989,33 +1989,6 @@ static bool zone_reclaimable(struct zone *zone)
}
/*
- * As hibernation is going on, kswapd is freezed so that it can't mark
- * the zone into all_unreclaimable. It can't handle OOM during hibernation.
- * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
- */
-static bool all_unreclaimable(struct zonelist *zonelist,
- struct scan_control *sc)
-{
- struct zoneref *z;
- struct zone *zone;
- bool all_unreclaimable = true;
-
- for_each_zone_zonelist_nodemask(zone, z, zonelist,
- gfp_zone(sc->gfp_mask), sc->nodemask) {
- if (!populated_zone(zone))
- continue;
- if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
- continue;
- if (zone_reclaimable(zone)) {
- all_unreclaimable = false;
- break;
- }
- }
-
- return all_unreclaimable;
-}
-
-/*
* This is the main entry point to direct page reclaim.
*
* If a full scan of the inactive list fails to free enough memory then we
@@ -2105,14 +2078,7 @@ out:
delayacct_freepages_end();
put_mems_allowed();
- if (sc->nr_reclaimed)
- return sc->nr_reclaimed;
-
- /* top priority shrink_zones still had more to do? don't OOM, then */
- if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
- return 1;
-
- return 0;
+ return sc->nr_reclaimed;
}
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
--
1.6.5.2
When plenty processes (eg fork bomb) are running, the TIF_MEMDIE task
never exit, at least, human feel it's never. therefore kernel become
hang-up.
"perf sched" tell us a hint.
------------------------------------------------------------------------------
Task | Runtime ms | Average delay ms | Maximum delay ms |
------------------------------------------------------------------------------
python:1754 | 0.197 ms | avg: 1731.727 ms | max: 3433.805 ms |
python:1843 | 0.489 ms | avg: 1707.433 ms | max: 3622.955 ms |
python:1715 | 0.220 ms | avg: 1707.125 ms | max: 3623.246 ms |
python:1818 | 2.127 ms | avg: 1527.331 ms | max: 3622.553 ms |
...
...
Processes flood makes crazy scheduler delay. and then the victim process
can't run enough. Grr. Should we do?
Fortunately, we already have anti process flood framework, autogroup!
This patch reuse this framework and avoid kernel live lock.
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
include/linux/oom.h | 1 +
include/linux/sched.h | 4 ++++
init/main.c | 2 ++
kernel/sched_autogroup.c | 4 ++--
mm/oom_kill.c | 23 +++++++++++++++++++++++
5 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5e3aa83..86bcea3 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -67,6 +67,7 @@ extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
const nodemask_t *nodemask, unsigned long uptime);
extern struct task_struct *find_lock_task_mm(struct task_struct *p);
+extern void oom_init(void);
/* sysctls */
extern int sysctl_oom_dump_tasks;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 98fc7ed..bdaad3f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1947,6 +1947,8 @@ int sched_rt_handler(struct ctl_table *table, int write,
#ifdef CONFIG_SCHED_AUTOGROUP
extern unsigned int sysctl_sched_autogroup_enabled;
+extern struct autogroup *autogroup_create(void);
+extern void autogroup_move_group(struct task_struct *p, struct autogroup *ag);
extern void sched_autogroup_create_attach(struct task_struct *p);
extern void sched_autogroup_detach(struct task_struct *p);
extern void sched_autogroup_fork(struct signal_struct *sig);
@@ -1956,6 +1958,8 @@ extern void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_fil
extern int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice);
#endif
#else
+extern struct autogroup *autogroup_create(void) { return NULL; }
+extern void autogroup_move_group(struct task_struct *p, struct autogroup *ag) {}
static inline void sched_autogroup_create_attach(struct task_struct *p) { }
static inline void sched_autogroup_detach(struct task_struct *p) { }
static inline void sched_autogroup_fork(struct signal_struct *sig) { }
diff --git a/init/main.c b/init/main.c
index 4a9479e..2c6e8da 100644
--- a/init/main.c
+++ b/init/main.c
@@ -68,6 +68,7 @@
#include <linux/shmem_fs.h>
#include <linux/slab.h>
#include <linux/perf_event.h>
+#include <linux/oom.h>
#include <asm/io.h>
#include <asm/bugs.h>
@@ -549,6 +550,7 @@ asmlinkage void __init start_kernel(void)
gfp_allowed_mask = __GFP_BITS_MASK;
kmem_cache_init_late();
+ oom_init();
/*
* HACK ALERT! This is early. We're enabling the console before
diff --git a/kernel/sched_autogroup.c b/kernel/sched_autogroup.c
index 5946ac5..6a1a2c4 100644
--- a/kernel/sched_autogroup.c
+++ b/kernel/sched_autogroup.c
@@ -63,7 +63,7 @@ static inline struct autogroup *autogroup_task_get(struct task_struct *p)
static void free_rt_sched_group(struct task_group *tg);
#endif
-static inline struct autogroup *autogroup_create(void)
+struct autogroup *autogroup_create(void)
{
struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
struct task_group *tg;
@@ -143,7 +143,7 @@ autogroup_task_group(struct task_struct *p, struct task_group *tg)
return tg;
}
-static void
+void
autogroup_move_group(struct task_struct *p, struct autogroup *ag)
{
struct autogroup *prev;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 739dee4..2519e6a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -38,6 +38,28 @@ int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;
static DEFINE_SPINLOCK(zone_scan_lock);
+#ifdef CONFIG_SCHED_AUTOGROUP
+struct autogroup *oom_ag;
+
+void __init oom_init(void)
+{
+ oom_ag = autogroup_create();
+}
+
+static void oom_move_oom_ag(struct task_struct *p)
+{
+ autogroup_move_group(p, oom_ag);
+}
+#else
+void __init oom_init(void)
+{
+}
+
+static void oom_move_oom_ag(struct task_struct *p)
+{
+}
+#endif
+
#ifdef CONFIG_NUMA
/**
* has_intersects_mems_allowed() - check task eligiblity for kill
@@ -432,6 +454,7 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
}
set_tsk_thread_flag(p, TIF_MEMDIE);
+ oom_move_oom_ag(p);
force_sig(SIGKILL, p);
return 0;
--
1.6.5.2
commit 2687a356 (Add lock_page_killable) introduced killable
lock_page(). Similarly this patch introdues killable
wait_on_page_locked().
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
include/linux/pagemap.h | 9 +++++++++
mm/filemap.c | 11 +++++++++++
2 files changed, 20 insertions(+), 0 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e407601..49f9315 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -369,6 +369,15 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
*/
extern void wait_on_page_bit(struct page *page, int bit_nr);
+extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+
+static inline int wait_on_page_locked_killable(struct page *page)
+{
+ if (PageLocked(page))
+ return wait_on_page_bit_killable(page, PG_locked);
+ return 0;
+}
+
/*
* Wait for a page to be unlocked.
*
diff --git a/mm/filemap.c b/mm/filemap.c
index a6cfecf..f5f9ac2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -608,6 +608,17 @@ void wait_on_page_bit(struct page *page, int bit_nr)
}
EXPORT_SYMBOL(wait_on_page_bit);
+int wait_on_page_bit_killable(struct page *page, int bit_nr)
+{
+ DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+
+ if (!test_bit(bit_nr, &page->flags))
+ return 0;
+
+ return __wait_on_bit(page_waitqueue(page), &wait, sync_page_killable,
+ TASK_KILLABLE);
+}
+
/**
* add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
* @page: Page defining the wait queue of interest
--
1.6.5.2
When oom killer occured, almost processes are getting stuck following
two points.
1) __alloc_pages_nodemask
2) __lock_page_or_retry
1) is not much problematic because TIF_MEMDIE lead to make allocation
failure and get out from page allocator. 2) is more problematic. When
OOM situation, Zones typically don't have page cache at all and Memory
starvation might lead to reduce IO performance largely. When fork bomb
occur, TIF_MEMDIE task don't die quickly mean fork bomb may create
new process quickly rather than oom-killer kill it. Then, the system
may become livelock.
This patch makes pagefault interruptible by SIGKILL.
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
arch/x86/mm/fault.c | 9 +++++++++
include/linux/mm.h | 1 +
mm/filemap.c | 22 +++++++++++++++++-----
3 files changed, 27 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 20e3f87..797c7d0 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1035,6 +1035,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
if (user_mode_vm(regs)) {
local_irq_enable();
error_code |= PF_USER;
+ flags |= FAULT_FLAG_KILLABLE;
} else {
if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();
@@ -1138,6 +1139,14 @@ good_area:
}
/*
+ * Pagefault was interrupted by SIGKILL. We have no reason to
+ * continue pagefault.
+ */
+ if ((flags & FAULT_FLAG_KILLABLE) && (fault & VM_FAULT_RETRY) &&
+ fatal_signal_pending(current))
+ return;
+
+ /*
* Major/minor page fault accounting is only done on the
* initial attempt. If we go through a retry, it is extremely
* likely that the page will be found in page cache at that point.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0716517..9e7c567 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -152,6 +152,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_MKWRITE 0x04 /* Fault was mkwrite of existing pte */
#define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
+#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
/*
* This interface is used by x86 PAT code to identify a pfn mapping that is
diff --git a/mm/filemap.c b/mm/filemap.c
index f5f9ac2..affba94 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -719,15 +719,27 @@ void __lock_page_nosync(struct page *page)
int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags)
{
- if (!(flags & FAULT_FLAG_ALLOW_RETRY)) {
- __lock_page(page);
- return 1;
- } else {
+ int ret;
+
+ if (flags & FAULT_FLAG_ALLOW_RETRY) {
if (!(flags & FAULT_FLAG_RETRY_NOWAIT)) {
up_read(&mm->mmap_sem);
- wait_on_page_locked(page);
+ if (flags & FAULT_FLAG_KILLABLE)
+ wait_on_page_locked_killable(page);
+ else
+ wait_on_page_locked(page);
}
return 0;
+ } else {
+ if (flags & FAULT_FLAG_KILLABLE) {
+ ret = __lock_page_killable(page);
+ if (ret) {
+ up_read(&mm->mmap_sem);
+ return 0;
+ }
+ } else
+ __lock_page(page);
+ return 1;
}
}
--
1.6.5.2
Hi Kosaki,
On Tue, Mar 22, 2011 at 08:05:55PM +0900, KOSAKI Motohiro wrote:
> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
>
> 2006 Sep 25; commit 408d8544; oom: use unreclaimable info
>
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
>
> 2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
> costly-order allocations
>
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
>
> 2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
> return value when priority==0
>
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
>
> 2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
> in direct reclaim path
>
> But, recently, Andrey Vagin found the new corner case. Look,
>
> struct zone {
> ..
> int all_unreclaimable;
> ..
> unsigned long pages_scanned;
> ..
> }
>
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore a zone can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
Possible although it's very rare.
> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.
The case is very rare since we reset zone->all_unreclaimabe to zero
right before resetting zone->page_scanned to zero.
But I admit it's possible.
CPU 0 CPU 1
free_pcppages_bulk balance_pgdat
zone->all_unreclaimabe = 0
zone->all_unreclaimabe = 1
zone->pages_scanned = 0
>
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it becase all_unreclaimable, it never return all_unreclaimable=0
^^^^^ it's very important verb. ^^^^^ return? reset?
I can't understand your point due to the typo. Please correct the typo.
> beucase it typicall don't have reclaimable pages.
If DMA zone have very small reclaimable pages or zero reclaimable pages,
zone_reclaimable() can return false easily so all_unreclaimable() could return
true. Eventually oom-killer might works.
In my test, I saw the livelock, too so apparently we have a problem.
I couldn't dig in it recently by another urgent my work.
I think you know root cause but the description in this patch isn't enough
for me to be persuaded.
Could you explain the root cause in detail?
>
> Eventually, oom-killer never works on such systems. Let's remove
> this problematic logic completely.
>
> Reported-by: Andrey Vagin <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/vmscan.c | 36 +-----------------------------------
> 1 files changed, 1 insertions(+), 35 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 060e4c1..254aada 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1989,33 +1989,6 @@ static bool zone_reclaimable(struct zone *zone)
> }
>
> /*
> - * As hibernation is going on, kswapd is freezed so that it can't mark
> - * the zone into all_unreclaimable. It can't handle OOM during hibernation.
> - * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
> - */
> -static bool all_unreclaimable(struct zonelist *zonelist,
> - struct scan_control *sc)
> -{
> - struct zoneref *z;
> - struct zone *zone;
> - bool all_unreclaimable = true;
> -
> - for_each_zone_zonelist_nodemask(zone, z, zonelist,
> - gfp_zone(sc->gfp_mask), sc->nodemask) {
> - if (!populated_zone(zone))
> - continue;
> - if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> - continue;
> - if (zone_reclaimable(zone)) {
> - all_unreclaimable = false;
> - break;
> - }
> - }
> -
> - return all_unreclaimable;
> -}
> -
> -/*
> * This is the main entry point to direct page reclaim.
> *
> * If a full scan of the inactive list fails to free enough memory then we
> @@ -2105,14 +2078,7 @@ out:
> delayacct_freepages_end();
> put_mems_allowed();
>
> - if (sc->nr_reclaimed)
> - return sc->nr_reclaimed;
> -
> - /* top priority shrink_zones still had more to do? don't OOM, then */
> - if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
> - return 1;
> -
> - return 0;
> + return sc->nr_reclaimed;
> }
>
> unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> --
> 1.6.5.2
>
>
>
--
Kind regards,
Minchan Kim
On Tue, Mar 22, 2011 at 8:08 PM, KOSAKI Motohiro
<[email protected]> wrote:
> When plenty processes (eg fork bomb) are running, the TIF_MEMDIE task
> never exit, at least, human feel it's never. therefore kernel become
> hang-up.
>
> "perf sched" tell us a hint.
>
> ------------------------------------------------------------------------------
> Task | Runtime ms | Average delay ms | Maximum delay ms |
> ------------------------------------------------------------------------------
> python:1754 | 0.197 ms | avg: 1731.727 ms | max: 3433.805 ms |
> python:1843 | 0.489 ms | avg: 1707.433 ms | max: 3622.955 ms |
> python:1715 | 0.220 ms | avg: 1707.125 ms | max: 3623.246 ms |
> python:1818 | 2.127 ms | avg: 1527.331 ms | max: 3622.553 ms |
> ...
> ...
>
> Processes flood makes crazy scheduler delay. and then the victim process
> can't run enough. Grr. Should we do?
>
> Fortunately, we already have anti process flood framework, autogroup!
> This patch reuse this framework and avoid kernel live lock.
That's cool idea but I have a concern.
You remove boosting priority in [2/5] and move victim tasks into autogroup.
If I understand autogroup right, victim process and threads in the
process take less schedule chance than now.
Could it make unnecessary killing of other tasks?
I am not sure. Just out of curiosity.
Thanks for nice work, Kosaki.
--
Kind regards,
Minchan Kim
> On Tue, Mar 22, 2011 at 8:08 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> > When plenty processes (eg fork bomb) are running, the TIF_MEMDIE task
> > never exit, at least, human feel it's never. therefore kernel become
> > hang-up.
> >
> > "perf sched" tell us a hint.
> >
> > ------------------------------------------------------------------------------
> > Task | Runtime ms | Average delay ms | Maximum delay ms |
> > ------------------------------------------------------------------------------
> > python:1754 | 0.197 ms | avg: 1731.727 ms | max: 3433.805 ms |
> > python:1843 | 0.489 ms | avg: 1707.433 ms | max: 3622.955 ms |
> > python:1715 | 0.220 ms | avg: 1707.125 ms | max: 3623.246 ms |
> > python:1818 | 2.127 ms | avg: 1527.331 ms | max: 3622.553 ms |
> > ...
> > ...
> >
> > Processes flood makes crazy scheduler delay. and then the victim process
> > can't run enough. Grr. Should we do?
> >
> > Fortunately, we already have anti process flood framework, autogroup!
> > This patch reuse this framework and avoid kernel live lock.
>
> That's cool idea but I have a concern.
>
> You remove boosting priority in [2/5] and move victim tasks into autogroup.
> If I understand autogroup right, victim process and threads in the
> process take less schedule chance than now.
Right. Icky cpu-cgroup rt-runtime default enforce me to seek another solution.
Today, I got private mail from Luis and he seems to have another improvement
idea. so, I might drop this patch if his one works fine.
> Could it make unnecessary killing of other tasks?
> I am not sure. Just out of curiosity.
If you are talking about OOM serialization, It isn't. I don't change
OOM serialization stuff. at least for now.
If you are talking about scheduler fairness, both current and this patch
have scheduler unfairness. But that's ok. 1) When OOM situation, scheduling
fairness has been broken already by heavy memory reclaim effort 2) autogroup
mean to change scheduling grouping *automatically*. then, the patch change it
again for exiting memory starvation.
>
> Thanks for nice work, Kosaki.
On Wed, 2011-03-23 at 10:27 +0900, KOSAKI Motohiro wrote:
> > On Tue, Mar 22, 2011 at 8:08 PM, KOSAKI Motohiro
> > <[email protected]> wrote:
> > > When plenty processes (eg fork bomb) are running, the TIF_MEMDIE task
> > > never exit, at least, human feel it's never. therefore kernel become
> > > hang-up.
> > >
> > > "perf sched" tell us a hint.
> > >
> > > ------------------------------------------------------------------------------
> > > Task | Runtime ms | Average delay ms | Maximum delay ms |
> > > ------------------------------------------------------------------------------
> > > python:1754 | 0.197 ms | avg: 1731.727 ms | max: 3433.805 ms |
> > > python:1843 | 0.489 ms | avg: 1707.433 ms | max: 3622.955 ms |
> > > python:1715 | 0.220 ms | avg: 1707.125 ms | max: 3623.246 ms |
> > > python:1818 | 2.127 ms | avg: 1527.331 ms | max: 3622.553 ms |
> > > ...
> > > ...
> > >
> > > Processes flood makes crazy scheduler delay. and then the victim process
> > > can't run enough. Grr. Should we do?
> > >
> > > Fortunately, we already have anti process flood framework, autogroup!
> > > This patch reuse this framework and avoid kernel live lock.
> >
> > That's cool idea but I have a concern.
> >
> > You remove boosting priority in [2/5] and move victim tasks into autogroup.
> > If I understand autogroup right, victim process and threads in the
> > process take less schedule chance than now.
>
> Right. Icky cpu-cgroup rt-runtime default enforce me to seek another solution.
I was going to mention rt, and there's s/fork/clone as well.
> Today, I got private mail from Luis and he seems to have another improvement
> idea. so, I might drop this patch if his one works fine.
Perhaps if TIF_MEMDIE threads needs special treatment, preemption tests
could take that into account? (though I don't like touching fast path
for oddball cases)
-Mike
Hi Minchan,
> > zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> > variables nor protected by lock. Therefore a zone can become a state
> > of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
>
> Possible although it's very rare.
Can you test by yourself andrey's case on x86 box? It seems
reprodusable.
> > current all_unreclaimable() return false even though
> > zone->all_unreclaimabe=1.
>
> The case is very rare since we reset zone->all_unreclaimabe to zero
> right before resetting zone->page_scanned to zero.
> But I admit it's possible.
Please apply this patch and run oom-killer. You may see following
pages_scanned:0 and all_unreclaimable:yes combination. likes below.
(but you may need >30min)
Node 0 DMA free:4024kB min:40kB low:48kB high:60kB active_anon:11804kB
inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB
dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:0kB kernel_stack:0kB pagetables:68kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
>
> CPU 0 CPU 1
> free_pcppages_bulk balance_pgdat
> zone->all_unreclaimabe = 0
> zone->all_unreclaimabe = 1
> zone->pages_scanned = 0
> >
> > Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> > small dma zone and it become zone->all_unreclamble=1 easily. and
> > if it becase all_unreclaimable, it never return all_unreclaimable=0
> ^^^^^ it's very important verb. ^^^^^ return? reset?
>
> I can't understand your point due to the typo. Please correct the typo.
>
> > beucase it typicall don't have reclaimable pages.
>
> If DMA zone have very small reclaimable pages or zero reclaimable pages,
> zone_reclaimable() can return false easily so all_unreclaimable() could return
> true. Eventually oom-killer might works.
The point is, vmscan has following all_unreclaimable check in several place.
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;
But, if the zone has only a few lru pages, get_scan_count(DEF_PRIORITY) return
{0, 0, 0, 0} array. It mean zone will never scan lru pages anymore. therefore
false negative smaller pages_scanned can't be corrected.
Then, false negative all_unreclaimable() also can't be corrected.
btw, Why get_scan_count() return 0 instead 1? Why don't we round up?
Git log says it is intentionally.
commit e0f79b8f1f3394bb344b7b83d6f121ac2af327de
Author: Johannes Weiner <[email protected]>
Date: Sat Oct 18 20:26:55 2008 -0700
vmscan: don't accumulate scan pressure on unrelated lists
>
> In my test, I saw the livelock, too so apparently we have a problem.
> I couldn't dig in it recently by another urgent my work.
> I think you know root cause but the description in this patch isn't enough
> for me to be persuaded.
>
> Could you explain the root cause in detail?
If you have an another fixing idea, please let me know. :)
On Wed, Mar 23, 2011 at 2:21 PM, KOSAKI Motohiro
<[email protected]> wrote:
> Hi Minchan,
>
>> > zone->all_unreclaimable and zone->pages_scanned are neigher atomic
>> > variables nor protected by lock. Therefore a zone can become a state
>> > of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
>>
>> Possible although it's very rare.
>
> Can you test by yourself andrey's case on x86 box? It seems
> reprodusable.
>
>> > current all_unreclaimable() return false even though
>> > zone->all_unreclaimabe=1.
>>
>> The case is very rare since we reset zone->all_unreclaimabe to zero
>> right before resetting zone->page_scanned to zero.
>> But I admit it's possible.
>
> Please apply this patch and run oom-killer. You may see following
> pages_scanned:0 and all_unreclaimable:yes combination. likes below.
> (but you may need >30min)
>
> Node 0 DMA free:4024kB min:40kB low:48kB high:60kB active_anon:11804kB
> inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB
> dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:0kB kernel_stack:0kB pagetables:68kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
>
>
>>
>> CPU 0 CPU 1
>> free_pcppages_bulk balance_pgdat
>> zone->all_unreclaimabe = 0
>> zone->all_unreclaimabe = 1
>> zone->pages_scanned = 0
>> >
>> > Is this ignorable minor issue? No. Unfortunatelly, x86 has very
>> > small dma zone and it become zone->all_unreclamble=1 easily. and
>> > if it becase all_unreclaimable, it never return all_unreclaimable=0
>> ^^^^^ it's very important verb. ^^^^^ return? reset?
>>
>> I can't understand your point due to the typo. Please correct the typo.
>>
>> > beucase it typicall don't have reclaimable pages.
>>
>> If DMA zone have very small reclaimable pages or zero reclaimable pages,
>> zone_reclaimable() can return false easily so all_unreclaimable() could return
>> true. Eventually oom-killer might works.
>
> The point is, vmscan has following all_unreclaimable check in several place.
>
> if (zone->all_unreclaimable && priority != DEF_PRIORITY)
> continue;
>
> But, if the zone has only a few lru pages, get_scan_count(DEF_PRIORITY) return
> {0, 0, 0, 0} array. It mean zone will never scan lru pages anymore. therefore
> false negative smaller pages_scanned can't be corrected.
>
> Then, false negative all_unreclaimable() also can't be corrected.
>
>
> btw, Why get_scan_count() return 0 instead 1? Why don't we round up?
> Git log says it is intentionally.
>
> commit e0f79b8f1f3394bb344b7b83d6f121ac2af327de
> Author: Johannes Weiner <[email protected]>
> Date: Sat Oct 18 20:26:55 2008 -0700
>
> vmscan: don't accumulate scan pressure on unrelated lists
>
>>
>> In my test, I saw the livelock, too so apparently we have a problem.
>> I couldn't dig in it recently by another urgent my work.
>> I think you know root cause but the description in this patch isn't enough
>> for me to be persuaded.
>>
>> Could you explain the root cause in detail?
>
> If you have an another fixing idea, please let me know. :)
>
>
>
>
Okay. I got it.
The problem is following as.
By the race the free_pcppages_bulk and balance_pgdat, it is possible
zone->all_unreclaimable = 1 and zone->pages_scanned = 0.
DMA zone have few LRU pages and in case of no-swap and big memory
pressure, there could be a just a page in inactive file list like your
example. (anon lru pages isn't important in case of non-swap system)
In such case, shrink_zones doesn't scan the page at all until priority
become 0 as get_scan_count does scan >>= priority(it's mostly zero).
And although priority become 0, nr_scan_try_batch returns zero until
saved pages become 32. So for scanning the page, at least, we need 32
times iteration of priority 12..0. If system has fork-bomb, it is
almost livelock.
If is is right, how about this?
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..34983e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1973,6 +1973,9 @@ static void shrink_zones(int priority, struct
zonelist *zonelist,
static bool zone_reclaimable(struct zone *zone)
{
+ if (zone->all_unreclaimable)
+ return false;
+
return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
}
--
Kind regards,
Minchan Kim
> Okay. I got it.
>
> The problem is following as.
> By the race the free_pcppages_bulk and balance_pgdat, it is possible
> zone->all_unreclaimable = 1 and zone->pages_scanned = 0.
> DMA zone have few LRU pages and in case of no-swap and big memory
> pressure, there could be a just a page in inactive file list like your
> example. (anon lru pages isn't important in case of non-swap system)
> In such case, shrink_zones doesn't scan the page at all until priority
> become 0 as get_scan_count does scan >>= priority(it's mostly zero).
Nope.
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;
This tow lines mean, all_unreclaimable prevent priority 0 reclaim.
> And although priority become 0, nr_scan_try_batch returns zero until
> saved pages become 32. So for scanning the page, at least, we need 32
> times iteration of priority 12..0. If system has fork-bomb, it is
> almost livelock.
Therefore, 1000 times get_scan_count(DEF_PRIORITY) takes 1000 times no-op.
>
> If is is right, how about this?
Boo.
You seems forgot why you introduced current all_unreclaimable() function.
While hibernation, we can't trust all_unreclaimable.
That's the reason why I proposed following patch when you introduced
all_unreclaimable().
---
mm/vmscan.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c391c32..1919d8a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
#include <linux/memcontrol.h>
#include <linux/delayacct.h>
#include <linux/sysctl.h>
+#include <linux/oom.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -1931,7 +1932,7 @@ out:
return sc->nr_reclaimed;
/* top priority shrink_zones still had more to do? don't OOM, then */
- if (scanning_global_lru(sc) && !all_unreclaimable)
+ if (scanning_global_lru(sc) && !all_unreclaimable && !oom_killer_disabled)
return 1;
return 0;
--
1.6.5.2
On Tue, 22 Mar 2011 20:05:55 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:
> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
>
> 2006 Sep 25; commit 408d8544; oom: use unreclaimable info
>
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
>
> 2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
> costly-order allocations
>
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
>
> 2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
> return value when priority==0
>
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
>
> 2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
> in direct reclaim path
>
> But, recently, Andrey Vagin found the new corner case. Look,
>
> struct zone {
> ..
> int all_unreclaimable;
> ..
> unsigned long pages_scanned;
> ..
> }
>
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore a zone can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.
>
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it becase all_unreclaimable, it never return all_unreclaimable=0
> beucase it typicall don't have reclaimable pages.
>
> Eventually, oom-killer never works on such systems. Let's remove
> this problematic logic completely.
>
> Reported-by: Andrey Vagin <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
IIUC, I saw the pehnomenon which you pointed out, as
- all zone->all_unreclaimable = yes
- zone_reclaimable() returns true
- no pgscan proceeds.
on a swapless system. So, I'd like to vote for this patch.
But hmm...what happens all of pages are isolated or locked and now under freeing ?
I think we should have alternative safe-guard logic for avoiding to call
oom-killer. Hmm.
Thanks,
-Kame
On Tue, 22 Mar 2011 20:08:41 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:
> commit 2687a356 (Add lock_page_killable) introduced killable
> lock_page(). Similarly this patch introdues killable
> wait_on_page_locked().
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> > Reported-by: Andrey Vagin <[email protected]>
> > Cc: Nick Piggin <[email protected]>
> > Cc: Minchan Kim <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> > Cc: KAMEZAWA Hiroyuki <[email protected]>
> > Signed-off-by: KOSAKI Motohiro <[email protected]>
>
> IIUC, I saw the pehnomenon which you pointed out, as
> - all zone->all_unreclaimable = yes
> - zone_reclaimable() returns true
> - no pgscan proceeds.
>
> on a swapless system. So, I'd like to vote for this patch.
>
> But hmm...what happens all of pages are isolated or locked and now under freeing ?
> I think we should have alternative safe-guard logic for avoiding to call
> oom-killer. Hmm.
Yes, this patch has small risk. but 1) this logic didn't work about two
years (see changelog) 2) memcg haven't use this logic and I haven't get
any bug report from memcg developers. therefore I decided to take most
simple way.
Of cource, I'll make another protection if I'll get any regression report.
On Tue, 22 Mar 2011 20:09:29 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:
> When oom killer occured, almost processes are getting stuck following
> two points.
>
> 1) __alloc_pages_nodemask
> 2) __lock_page_or_retry
>
> 1) is not much problematic because TIF_MEMDIE lead to make allocation
> failure and get out from page allocator. 2) is more problematic. When
> OOM situation, Zones typically don't have page cache at all and Memory
> starvation might lead to reduce IO performance largely. When fork bomb
> occur, TIF_MEMDIE task don't die quickly mean fork bomb may create
> new process quickly rather than oom-killer kill it. Then, the system
> may become livelock.
>
> This patch makes pagefault interruptible by SIGKILL.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> arch/x86/mm/fault.c | 9 +++++++++
> include/linux/mm.h | 1 +
> mm/filemap.c | 22 +++++++++++++++++-----
> 3 files changed, 27 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 20e3f87..797c7d0 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1035,6 +1035,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> if (user_mode_vm(regs)) {
> local_irq_enable();
> error_code |= PF_USER;
> + flags |= FAULT_FLAG_KILLABLE;
> } else {
> if (regs->flags & X86_EFLAGS_IF)
> local_irq_enable();
> @@ -1138,6 +1139,14 @@ good_area:
> }
>
> /*
> + * Pagefault was interrupted by SIGKILL. We have no reason to
> + * continue pagefault.
> + */
> + if ((flags & FAULT_FLAG_KILLABLE) && (fault & VM_FAULT_RETRY) &&
> + fatal_signal_pending(current))
> + return;
> +
Hmm? up_read(&mm->mmap_sem) ?
Thanks,
-Kame
> On Tue, 22 Mar 2011 20:09:29 +0900 (JST)
> KOSAKI Motohiro <[email protected]> wrote:
>
> > When oom killer occured, almost processes are getting stuck following
> > two points.
> >
> > 1) __alloc_pages_nodemask
> > 2) __lock_page_or_retry
> >
> > 1) is not much problematic because TIF_MEMDIE lead to make allocation
> > failure and get out from page allocator. 2) is more problematic. When
> > OOM situation, Zones typically don't have page cache at all and Memory
> > starvation might lead to reduce IO performance largely. When fork bomb
> > occur, TIF_MEMDIE task don't die quickly mean fork bomb may create
> > new process quickly rather than oom-killer kill it. Then, the system
> > may become livelock.
> >
> > This patch makes pagefault interruptible by SIGKILL.
> >
> > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > ---
> > arch/x86/mm/fault.c | 9 +++++++++
> > include/linux/mm.h | 1 +
> > mm/filemap.c | 22 +++++++++++++++++-----
> > 3 files changed, 27 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 20e3f87..797c7d0 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1035,6 +1035,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > if (user_mode_vm(regs)) {
> > local_irq_enable();
> > error_code |= PF_USER;
> > + flags |= FAULT_FLAG_KILLABLE;
> > } else {
> > if (regs->flags & X86_EFLAGS_IF)
> > local_irq_enable();
> > @@ -1138,6 +1139,14 @@ good_area:
> > }
> >
> > /*
> > + * Pagefault was interrupted by SIGKILL. We have no reason to
> > + * continue pagefault.
> > + */
> > + if ((flags & FAULT_FLAG_KILLABLE) && (fault & VM_FAULT_RETRY) &&
> > + fatal_signal_pending(current))
> > + return;
> > +
>
> Hmm? up_read(&mm->mmap_sem) ?
When __lock_page_or_retry() return 0, It call up_read(mmap_sem) in this
function.
I agree this is strange (or ugly). but I don't want change this spec in
this time.
On Wed, Mar 23, 2011 at 04:13:21PM +0900, KOSAKI Motohiro wrote:
> > Okay. I got it.
> >
> > The problem is following as.
> > By the race the free_pcppages_bulk and balance_pgdat, it is possible
> > zone->all_unreclaimable = 1 and zone->pages_scanned = 0.
> > DMA zone have few LRU pages and in case of no-swap and big memory
> > pressure, there could be a just a page in inactive file list like your
> > example. (anon lru pages isn't important in case of non-swap system)
> > In such case, shrink_zones doesn't scan the page at all until priority
> > become 0 as get_scan_count does scan >>= priority(it's mostly zero).
>
> Nope.
>
> if (zone->all_unreclaimable && priority != DEF_PRIORITY)
> continue;
>
> This tow lines mean, all_unreclaimable prevent priority 0 reclaim.
>
Yes. I missed it. Thanks.
>
> > And although priority become 0, nr_scan_try_batch returns zero until
> > saved pages become 32. So for scanning the page, at least, we need 32
> > times iteration of priority 12..0. If system has fork-bomb, it is
> > almost livelock.
>
> Therefore, 1000 times get_scan_count(DEF_PRIORITY) takes 1000 times no-op.
>
> >
> > If is is right, how about this?
>
> Boo.
> You seems forgot why you introduced current all_unreclaimable() function.
> While hibernation, we can't trust all_unreclaimable.
Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
kswapd is freezed so it can't mark the zone->all_unreclaimable.
So I think hibernation can't be a problem.
Am I miss something?
--
Kind regards,
Minchan Kim
> > Boo.
> > You seems forgot why you introduced current all_unreclaimable() function.
> > While hibernation, we can't trust all_unreclaimable.
>
> Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
> kswapd is freezed so it can't mark the zone->all_unreclaimable.
> So I think hibernation can't be a problem.
> Am I miss something?
Ahh, I missed. thans correct me. Okay, I recognized both mine and your works.
Can you please explain why do you like your one than mine?
btw, Your one is very similar andrey's initial patch. If your one is
better, I'd like to ack with andrey instead.
On Wed, Mar 23, 2011 at 5:44 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> > Boo.
>> > You seems forgot why you introduced current all_unreclaimable() function.
>> > While hibernation, we can't trust all_unreclaimable.
>>
>> Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
>> kswapd is freezed so it can't mark the zone->all_unreclaimable.
>> So I think hibernation can't be a problem.
>> Am I miss something?
>
> Ahh, I missed. thans correct me. Okay, I recognized both mine and your works.
> Can you please explain why do you like your one than mine?
Just _simple_ :)
I don't want to change many lines although we can do it simple and very clear.
>
> btw, Your one is very similar andrey's initial patch. If your one is
> better, I'd like to ack with andrey instead.
When Andrey sent a patch, I though this as zone_reclaimable() is right
place to check it than out of zone_reclaimable. Why I didn't ack is
that Andrey can't explain root cause but you did so you persuade me.
I don't mind if Andrey move the check in zone_reclaimable and resend
or I resend with concrete description.
Anyway, most important thing is good description to show the root cause.
It is applied to your patch, too.
You should have written down root cause in description.
--
Kind regards,
Minchan Kim
On Wed, Mar 23, 2011 at 1:09 AM, KOSAKI Motohiro
<[email protected]> wrote:
>
> When __lock_page_or_retry() return 0, It call up_read(mmap_sem) in this
> function.
Indeed.
> I agree this is strange (or ugly). but I don't want change this spec in
> this time.
I agree that it is strange, and I don't like functions that touch
locks that they didn't take themselves, but since the original point
of the whole thing was to wait for the page without holding the
mmap_sem lock, that function has to do the up_read() early.
Linus
> On Wed, Mar 23, 2011 at 5:44 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> >> > Boo.
> >> > You seems forgot why you introduced current all_unreclaimable() function.
> >> > While hibernation, we can't trust all_unreclaimable.
> >>
> >> Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
> >> kswapd is freezed so it can't mark the zone->all_unreclaimable.
> >> So I think hibernation can't be a problem.
> >> Am I miss something?
> >
> > Ahh, I missed. thans correct me. Okay, I recognized both mine and your works.
> > Can you please explain why do you like your one than mine?
>
> Just _simple_ :)
> I don't want to change many lines although we can do it simple and very clear.
>
> >
> > btw, Your one is very similar andrey's initial patch. If your one is
> > better, I'd like to ack with andrey instead.
>
> When Andrey sent a patch, I though this as zone_reclaimable() is right
> place to check it than out of zone_reclaimable. Why I didn't ack is
> that Andrey can't explain root cause but you did so you persuade me.
>
> I don't mind if Andrey move the check in zone_reclaimable and resend
> or I resend with concrete description.
>
> Anyway, most important thing is good description to show the root cause.
> It is applied to your patch, too.
> You should have written down root cause in description.
honestly, I really dislike to use mixing zone->pages_scanned and
zone->all_unreclaimable. because I think it's no simple. I don't
think it's good taste nor easy to review. Even though you who VM
expert didn't understand this issue at once, it's smell of too
mess code.
therefore, I prefore to take either 1) just remove the function or
2) just only check zone->all_unreclaimable and oom_killer_disabled
instead zone->pages_scanned.
And, I agree I need to rewrite the description.
How's this?
==================================================
>From 216bcf3fb0476b453080debf8999c74c635ed72f Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Sun, 8 May 2011 17:39:44 +0900
Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely
all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.
2006 Sep 25; commit 408d8544; oom: use unreclaimable info
And it went through strange history. firstly, following commit broke
the logic unintentionally.
2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
costly-order allocations
Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.
2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
return value when priority==0
But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .
2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
in direct reclaim path
But, recently, Andrey Vagin found the new corner case. Look,
struct zone {
..
int all_unreclaimable;
..
unsigned long pages_scanned;
..
}
zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock. Therefore zones can become a state
of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
current all_unreclaimable() return false even though
zone->all_unreclaimabe=1.
Is this ignorable minor issue? No. Unfortunatelly, x86 has very
small dma zone and it become zone->all_unreclamble=1 easily. and
if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
at all!
Eventually, oom-killer never works on such systems. Let's remove
this problematic logic completely.
Reported-by: Andrey Vagin <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 36 +-----------------------------------
1 files changed, 1 insertions(+), 35 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..254aada 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1989,33 +1989,6 @@ static bool zone_reclaimable(struct zone *zone)
}
/*
- * As hibernation is going on, kswapd is freezed so that it can't mark
- * the zone into all_unreclaimable. It can't handle OOM during hibernation.
- * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
- */
-static bool all_unreclaimable(struct zonelist *zonelist,
- struct scan_control *sc)
-{
- struct zoneref *z;
- struct zone *zone;
- bool all_unreclaimable = true;
-
- for_each_zone_zonelist_nodemask(zone, z, zonelist,
- gfp_zone(sc->gfp_mask), sc->nodemask) {
- if (!populated_zone(zone))
- continue;
- if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
- continue;
- if (zone_reclaimable(zone)) {
- all_unreclaimable = false;
- break;
- }
- }
-
- return all_unreclaimable;
-}
-
-/*
* This is the main entry point to direct page reclaim.
*
* If a full scan of the inactive list fails to free enough memory then we
@@ -2105,14 +2078,7 @@ out:
delayacct_freepages_end();
put_mems_allowed();
- if (sc->nr_reclaimed)
- return sc->nr_reclaimed;
-
- /* top priority shrink_zones still had more to do? don't OOM, then */
- if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
- return 1;
-
- return 0;
+ return sc->nr_reclaimed;
}
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
--
1.6.5.2
On Thu, 24 Mar 2011 11:11:46 +0900 (JST) KOSAKI Motohiro <[email protected]> wrote:
> Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely
zone.all_unreclaimable is there to prevent reclaim from wasting CPU
cycles scanning a zone which has no reclaimable pages. When originally
implemented it did this very well.
That you guys keep breaking it, or don't feel like improving it is not a
reason to remove it!
If the code is unneeded and the kernel now reliably solves this problem
by other means then this should have been fully explained in the
changelog, but it was not even mentioned.
> On Thu, 24 Mar 2011 11:11:46 +0900 (JST) KOSAKI Motohiro <[email protected]> wrote:
>
> > Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely
>
> zone.all_unreclaimable is there to prevent reclaim from wasting CPU
> cycles scanning a zone which has no reclaimable pages. When originally
> implemented it did this very well.
>
> That you guys keep breaking it, or don't feel like improving it is not a
> reason to remove it!
>
> If the code is unneeded and the kernel now reliably solves this problem
> by other means then this should have been fully explained in the
> changelog, but it was not even mentioned.
The changelog says, the logic was removed at 2008. three years ago.
even though it's unintentionally. and I and minchan tried to resurrect
the broken logic and resurrected a bug in the logic too. then, we
are discussed it should die or alive.
Which part is hard to understand for you?
On Thu, 24 Mar 2011 11:48:19 +0900 (JST) KOSAKI Motohiro <[email protected]> wrote:
> > On Thu, 24 Mar 2011 11:11:46 +0900 (JST) KOSAKI Motohiro <[email protected]> wrote:
> >
> > > Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely
> >
> > zone.all_unreclaimable is there to prevent reclaim from wasting CPU
> > cycles scanning a zone which has no reclaimable pages. When originally
> > implemented it did this very well.
> >
> > That you guys keep breaking it, or don't feel like improving it is not a
> > reason to remove it!
> >
> > If the code is unneeded and the kernel now reliably solves this problem
> > by other means then this should have been fully explained in the
> > changelog, but it was not even mentioned.
>
> The changelog says, the logic was removed at 2008. three years ago.
> even though it's unintentionally. and I and minchan tried to resurrect
> the broken logic and resurrected a bug in the logic too. then, we
> are discussed it should die or alive.
>
> Which part is hard to understand for you?
>
The part which isn't there: how does the kernel now address the problem
which that code fixed?
On Thu, Mar 24, 2011 at 11:11 AM, KOSAKI Motohiro
<[email protected]> wrote:
>> On Wed, Mar 23, 2011 at 5:44 PM, KOSAKI Motohiro
>> <[email protected]> wrote:
>> >> > Boo.
>> >> > You seems forgot why you introduced current all_unreclaimable() function.
>> >> > While hibernation, we can't trust all_unreclaimable.
>> >>
>> >> Hmm. AFAIR, the why we add all_unreclaimable is when the hibernation is going on,
>> >> kswapd is freezed so it can't mark the zone->all_unreclaimable.
>> >> So I think hibernation can't be a problem.
>> >> Am I miss something?
>> >
>> > Ahh, I missed. thans correct me. Okay, I recognized both mine and your works.
>> > Can you please explain why do you like your one than mine?
>>
>> Just _simple_ :)
>> I don't want to change many lines although we can do it simple and very clear.
>>
>> >
>> > btw, Your one is very similar andrey's initial patch. If your one is
>> > better, I'd like to ack with andrey instead.
>>
>> When Andrey sent a patch, I though this as zone_reclaimable() is right
>> place to check it than out of zone_reclaimable. Why I didn't ack is
>> that Andrey can't explain root cause but you did so you persuade me.
>>
>> I don't mind if Andrey move the check in zone_reclaimable and resend
>> or I resend with concrete description.
>>
>> Anyway, most important thing is good description to show the root cause.
>> It is applied to your patch, too.
>> You should have written down root cause in description.
>
> honestly, I really dislike to use mixing zone->pages_scanned and
> zone->all_unreclaimable. because I think it's no simple. I don't
> think it's good taste nor easy to review. Even though you who VM
> expert didn't understand this issue at once, it's smell of too
> mess code.
>
> therefore, I prefore to take either 1) just remove the function or
> 2) just only check zone->all_unreclaimable and oom_killer_disabled
> instead zone->pages_scanned.
Nick's original goal is to prevent OOM killing until all zone we're
interested in are unreclaimable and whether zone is reclaimable or not
depends on kswapd. And Nick's original solution is just peeking
zone->all_unreclaimable but I made it dirty when we are considering
kswapd freeze in hibernation. So I think we still need it to handle
kswapd freeze problem and we should add original behavior we missed at
that time like below.
static bool zone_reclaimable(struct zone *zone)
{
if (zone->all_unreclaimable)
return false;
return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
}
If you remove the logic, the problem Nick addressed would be showed
up, again. How about addressing the problem in your patch? If you
remove the logic, __alloc_pages_direct_reclaim lose the chance calling
dran_all_pages. Of course, it was a side effect but we should handle
it.
And my last concern is we are going on right way?
I think fundamental cause of this problem is page_scanned and
all_unreclaimable is race so isn't the approach fixing the race right
way?
If it is hard or very costly, your and my approach will be fallback.
--
Kind regards,
Minchan Kim
> On Thu, 24 Mar 2011 11:48:19 +0900 (JST) KOSAKI Motohiro <[email protected]> wrote:
>
> > > On Thu, 24 Mar 2011 11:11:46 +0900 (JST) KOSAKI Motohiro <[email protected]> wrote:
> > >
> > > > Subject: [PATCH] vmscan: remove all_unreclaimable check from direct reclaim path completely
> > >
> > > zone.all_unreclaimable is there to prevent reclaim from wasting CPU
> > > cycles scanning a zone which has no reclaimable pages. When originally
> > > implemented it did this very well.
> > >
> > > That you guys keep breaking it, or don't feel like improving it is not a
> > > reason to remove it!
> > >
> > > If the code is unneeded and the kernel now reliably solves this problem
> > > by other means then this should have been fully explained in the
> > > changelog, but it was not even mentioned.
> >
> > The changelog says, the logic was removed at 2008. three years ago.
> > even though it's unintentionally. and I and minchan tried to resurrect
> > the broken logic and resurrected a bug in the logic too. then, we
> > are discussed it should die or alive.
> >
> > Which part is hard to understand for you?
>
> The part which isn't there: how does the kernel now address the problem
> which that code fixed?
Ah, got it.
The history says the problem haven't occur for three years. thus I
meant
past: code exist, but broken and don't work for three years.
new: code removed.
What's different? But last minchan's mail pointed out recent
drain_all_pages() stuff depend on a return value of try_to_free_pages.
thus, I've made new patch and sent it. please see it?
Hi Minchan,
> Nick's original goal is to prevent OOM killing until all zone we're
> interested in are unreclaimable and whether zone is reclaimable or not
> depends on kswapd. And Nick's original solution is just peeking
> zone->all_unreclaimable but I made it dirty when we are considering
> kswapd freeze in hibernation. So I think we still need it to handle
> kswapd freeze problem and we should add original behavior we missed at
> that time like below.
>
> static bool zone_reclaimable(struct zone *zone)
> {
> if (zone->all_unreclaimable)
> return false;
>
> return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> }
>
> If you remove the logic, the problem Nick addressed would be showed
> up, again. How about addressing the problem in your patch? If you
> remove the logic, __alloc_pages_direct_reclaim lose the chance calling
> dran_all_pages. Of course, it was a side effect but we should handle
> it.
Ok, you are successfull to persuade me. lost drain_all_pages() chance has
a risk.
> And my last concern is we are going on right way?
> I think fundamental cause of this problem is page_scanned and
> all_unreclaimable is race so isn't the approach fixing the race right
> way?
Hmm..
If we can avoid lock, we should. I think. that's performance reason.
therefore I'd like to cap the issue in do_try_to_free_pages(). it's
slow path.
Is the following patch acceptable to you? it is
o rewrote the description
o avoid mix to use zone->all_unreclaimable and zone->pages_scanned
o avoid to reintroduce hibernation issue
o don't touch fast path
> If it is hard or very costly, your and my approach will be fallback.
-----------------------------------------------------------------
>From f3d277057ad3a092aa1c94244f0ed0d3ebe5411c Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Sat, 14 May 2011 05:07:48 +0900
Subject: [PATCH] vmscan: all_unreclaimable() use zone->all_unreclaimable as the name
all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.
2006 Sep 25; commit 408d8544; oom: use unreclaimable info
And it went through strange history. firstly, following commit broke
the logic unintentionally.
2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
costly-order allocations
Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.
2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
return value when priority==0
But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .
2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
in direct reclaim path
But, recently, Andrey Vagin found the new corner case. Look,
struct zone {
..
int all_unreclaimable;
..
unsigned long pages_scanned;
..
}
zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock. Therefore zones can become a state
of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
current all_unreclaimable() return false even though
zone->all_unreclaimabe=1.
Is this ignorable minor issue? No. Unfortunatelly, x86 has very
small dma zone and it become zone->all_unreclamble=1 easily. and
if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
at all!
Eventually, oom-killer never works on such systems. That said, we
can't use zone->pages_scanned for this purpose. This patch restore
all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
to add oom_killer_disabled check to avoid reintroduce the issue of
commit d1908362.
Reported-by: Andrey Vagin <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 24 +++++++++++++-----------
1 files changed, 13 insertions(+), 11 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..54ac548 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -41,6 +41,7 @@
#include <linux/memcontrol.h>
#include <linux/delayacct.h>
#include <linux/sysctl.h>
+#include <linux/oom.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -1988,17 +1989,12 @@ static bool zone_reclaimable(struct zone *zone)
return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
}
-/*
- * As hibernation is going on, kswapd is freezed so that it can't mark
- * the zone into all_unreclaimable. It can't handle OOM during hibernation.
- * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
- */
+/* All zones in zonelist are unreclaimable? */
static bool all_unreclaimable(struct zonelist *zonelist,
struct scan_control *sc)
{
struct zoneref *z;
struct zone *zone;
- bool all_unreclaimable = true;
for_each_zone_zonelist_nodemask(zone, z, zonelist,
gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2006,13 +2002,11 @@ static bool all_unreclaimable(struct zonelist *zonelist,
continue;
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
- if (zone_reclaimable(zone)) {
- all_unreclaimable = false;
- break;
- }
+ if (!zone->all_unreclaimable)
+ return false;
}
- return all_unreclaimable;
+ return true;
}
/*
@@ -2108,6 +2102,14 @@ out:
if (sc->nr_reclaimed)
return sc->nr_reclaimed;
+ /*
+ * As hibernation is going on, kswapd is freezed so that it can't mark
+ * the zone into all_unreclaimable. Thus bypassing all_unreclaimable
+ * check.
+ */
+ if (oom_killer_disabled)
+ return 0;
+
/* top priority shrink_zones still had more to do? don't OOM, then */
if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
return 1;
--
1.6.5.2
Hi Kosaki,
On Thu, Mar 24, 2011 at 2:35 PM, KOSAKI Motohiro
<[email protected]> wrote:
> Hi Minchan,
>
>> Nick's original goal is to prevent OOM killing until all zone we're
>> interested in are unreclaimable and whether zone is reclaimable or not
>> depends on kswapd. And Nick's original solution is just peeking
>> zone->all_unreclaimable but I made it dirty when we are considering
>> kswapd freeze in hibernation. So I think we still need it to handle
>> kswapd freeze problem and we should add original behavior we missed at
>> that time like below.
>>
>> static bool zone_reclaimable(struct zone *zone)
>> {
>> if (zone->all_unreclaimable)
>> return false;
>>
>> return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>> }
>>
>> If you remove the logic, the problem Nick addressed would be showed
>> up, again. How about addressing the problem in your patch? If you
>> remove the logic, __alloc_pages_direct_reclaim lose the chance calling
>> dran_all_pages. Of course, it was a side effect but we should handle
>> it.
>
> Ok, you are successfull to persuade me. lost drain_all_pages() chance has
> a risk.
>
>> And my last concern is we are going on right way?
>
>
>> I think fundamental cause of this problem is page_scanned and
>> all_unreclaimable is race so isn't the approach fixing the race right
>> way?
>
> Hmm..
> If we can avoid lock, we should. I think. that's performance reason.
> therefore I'd like to cap the issue in do_try_to_free_pages(). it's
> slow path.
>
> Is the following patch acceptable to you? it is
> o rewrote the description
> o avoid mix to use zone->all_unreclaimable and zone->pages_scanned
> o avoid to reintroduce hibernation issue
> o don't touch fast path
>
>
>> If it is hard or very costly, your and my approach will be fallback.
>
> -----------------------------------------------------------------
> From f3d277057ad3a092aa1c94244f0ed0d3ebe5411c Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Sat, 14 May 2011 05:07:48 +0900
> Subject: [PATCH] vmscan: all_unreclaimable() use zone->all_unreclaimable as the name
>
> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
>
> 2006 Sep 25; commit 408d8544; oom: use unreclaimable info
>
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
>
> 2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
> costly-order allocations
>
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
>
> 2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
> return value when priority==0
>
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
>
> 2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
> in direct reclaim path
>
> But, recently, Andrey Vagin found the new corner case. Look,
>
> struct zone {
> ..
> int all_unreclaimable;
> ..
> unsigned long pages_scanned;
> ..
> }
>
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore zones can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.
>
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
> Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
> a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
> at all!
>
> Eventually, oom-killer never works on such systems. That said, we
> can't use zone->pages_scanned for this purpose. This patch restore
> all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
> to add oom_killer_disabled check to avoid reintroduce the issue of
> commit d1908362.
>
> Reported-by: Andrey Vagin <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/vmscan.c | 24 +++++++++++++-----------
> 1 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 060e4c1..54ac548 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -41,6 +41,7 @@
> #include <linux/memcontrol.h>
> #include <linux/delayacct.h>
> #include <linux/sysctl.h>
> +#include <linux/oom.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -1988,17 +1989,12 @@ static bool zone_reclaimable(struct zone *zone)
> return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> }
>
> -/*
> - * As hibernation is going on, kswapd is freezed so that it can't mark
> - * the zone into all_unreclaimable. It can't handle OOM during hibernation.
> - * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
> - */
> +/* All zones in zonelist are unreclaimable? */
> static bool all_unreclaimable(struct zonelist *zonelist,
> struct scan_control *sc)
> {
> struct zoneref *z;
> struct zone *zone;
> - bool all_unreclaimable = true;
>
> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> gfp_zone(sc->gfp_mask), sc->nodemask) {
> @@ -2006,13 +2002,11 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> continue;
> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> continue;
> - if (zone_reclaimable(zone)) {
> - all_unreclaimable = false;
> - break;
> - }
> + if (!zone->all_unreclaimable)
> + return false;
> }
>
> - return all_unreclaimable;
> + return true;
> }
>
> /*
> @@ -2108,6 +2102,14 @@ out:
> if (sc->nr_reclaimed)
> return sc->nr_reclaimed;
>
> + /*
> + * As hibernation is going on, kswapd is freezed so that it can't mark
> + * the zone into all_unreclaimable. Thus bypassing all_unreclaimable
> + * check.
> + */
> + if (oom_killer_disabled)
> + return 0;
> +
> /* top priority shrink_zones still had more to do? don't OOM, then */
> if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
> return 1;
> --
> 1.6.5.2
>
Thanks for your effort, Kosaki.
But I still doubt this patch is good.
This patch makes early oom killing in hibernation as it skip
all_unreclaimable check.
Normally, hibernation needs many memory so page_reclaim pressure
would be big in small memory system. So I don't like early give up.
Do you think my patch has a problem? Personally, I think it's very
simple and clear. :)
--
Kind regards,
Minchan Kim
Hi
> Thanks for your effort, Kosaki.
> But I still doubt this patch is good.
>
> This patch makes early oom killing in hibernation as it skip
> all_unreclaimable check.
> Normally, hibernation needs many memory so page_reclaim pressure
> would be big in small memory system. So I don't like early give up.
Wait. When occur big pressure? hibernation reclaim pressure
(sc->nr_to_recliam) depend on physical memory size. therefore
a pressure seems to don't depend on the size.
> Do you think my patch has a problem? Personally, I think it's very
> simple and clear. :)
To be honest, I dislike following parts. It's madness on madness.
static bool zone_reclaimable(struct zone *zone)
{
if (zone->all_unreclaimable)
return false;
return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
}
The function require a reviewer know
o pages_scanned and all_unreclaimable are racy
o at hibernation, zone->all_unreclaimable can be false negative,
but can't be false positive.
And, a function comment of all_unreclaimable() says
/*
* As hibernation is going on, kswapd is freezed so that it can't mark
* the zone into all_unreclaimable. It can't handle OOM during hibernation.
* So let's check zone's unreclaimable in direct reclaim as well as kswapd.
*/
But, now it is no longer copy of kswapd algorithm.
If you strongly prefer this idea even if you hear above explanation,
please consider to add much and much comments. I can't say
current your patch is enough readable/reviewable.
Thanks.
On Thu, Mar 24, 2011 at 3:16 PM, KOSAKI Motohiro
<[email protected]> wrote:
> Hi
>
>> Thanks for your effort, Kosaki.
>> But I still doubt this patch is good.
>>
>> This patch makes early oom killing in hibernation as it skip
>> all_unreclaimable check.
>> Normally, hibernation needs many memory so page_reclaim pressure
>> would be big in small memory system. So I don't like early give up.
>
> Wait. When occur big pressure? hibernation reclaim pressure
> (sc->nr_to_recliam) depend on physical memory size. therefore
> a pressure seems to don't depend on the size.
It depends on physical memory size and /sys/power/image_size.
If you want to tune image size bigger, reclaim pressure would be big.
>
>
>> Do you think my patch has a problem? Personally, I think it's very
>> simple and clear. :)
>
> To be honest, I dislike following parts. It's madness on madness.
>
> static bool zone_reclaimable(struct zone *zone)
> {
> if (zone->all_unreclaimable)
> return false;
>
> return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> }
>
>
> The function require a reviewer know
>
> o pages_scanned and all_unreclaimable are racy
Yes. That part should be written down of comment.
> o at hibernation, zone->all_unreclaimable can be false negative,
> but can't be false positive.
The comment of all_unreclaimable already does explain it well, I think.
>
> And, a function comment of all_unreclaimable() says
>
> /*
> * As hibernation is going on, kswapd is freezed so that it can't mark
> * the zone into all_unreclaimable. It can't handle OOM during hibernation.
> * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
> */
>
> But, now it is no longer copy of kswapd algorithm.
The comment don't say it should be a copy of kswapd.
>
> If you strongly prefer this idea even if you hear above explanation,
> please consider to add much and much comments. I can't say
> current your patch is enough readable/reviewable.
My patch isn't a formal patch for merge but just a concept to show.
If you agree the idea, of course, I will add more concrete comment
when I send formal patch.
Before, I would like to get a your agreement. :)
If you solve my concern(early give up in hibernation) in your patch, I
don't insist on my patch, either.
Thanks for the comment, Kosaki.
>
> Thanks.
>
>
>
--
Kind regards,
Minchan Kim
> On Thu, Mar 24, 2011 at 3:16 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> > Hi
> >
> >> Thanks for your effort, Kosaki.
> >> But I still doubt this patch is good.
> >>
> >> This patch makes early oom killing in hibernation as it skip
> >> all_unreclaimable check.
> >> Normally, hibernation needs many memory so page_reclaim pressure
> >> would be big in small memory system. So I don't like early give up.
> >
> > Wait. When occur big pressure? hibernation reclaim pressure
> > (sc->nr_to_recliam) depend on physical memory size. therefore
> > a pressure seems to don't depend on the size.
>
> It depends on physical memory size and /sys/power/image_size.
> If you want to tune image size bigger, reclaim pressure would be big.
Ok, _If_ I want.
However, I haven't seen desktop people customize it.
> >> Do you think my patch has a problem? Personally, I think it's very
> >> simple and clear. :)
> >
> > To be honest, I dislike following parts. It's madness on madness.
> >
> > static bool zone_reclaimable(struct zone *zone)
> > {
> > if (zone->all_unreclaimable)
> > return false;
> >
> > return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> > }
> >
> >
> > The function require a reviewer know
> >
> > o pages_scanned and all_unreclaimable are racy
>
> Yes. That part should be written down of comment.
>
> > o at hibernation, zone->all_unreclaimable can be false negative,
> > but can't be false positive.
>
> The comment of all_unreclaimable already does explain it well, I think.
Where is?
> > And, a function comment of all_unreclaimable() says
> >
> > /*
> > * As hibernation is going on, kswapd is freezed so that it can't mark
> > * the zone into all_unreclaimable. It can't handle OOM during hibernation.
> > * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
> > */
> >
> > But, now it is no longer copy of kswapd algorithm.
>
> The comment don't say it should be a copy of kswapd.
I meant the comments says
* So let's check zone's unreclaimable in direct reclaim as well as kswapd.
but now it isn't aswell as kswapd.
I think it's critical important. If people can't understand why the
algorithm was choosed, anyone will break the code again sooner or later.
> > If you strongly prefer this idea even if you hear above explanation,
> > please consider to add much and much comments. I can't say
> > current your patch is enough readable/reviewable.
>
> My patch isn't a formal patch for merge but just a concept to show.
> If you agree the idea, of course, I will add more concrete comment
> when I send formal patch.
>
> Before, I would like to get a your agreement. :)
> If you solve my concern(early give up in hibernation) in your patch, I
> don't insist on my patch, either.
Ok. Let's try.
Please concern why priority=0 is not enough. zone_reclaimable_pages(zone) * 6
is a conservative value of worry about multi thread race. While one task
is reclaiming, others can allocate/free memory concurrently. therefore,
even after priority=0, we have a chance getting reclaimable pages on lru.
But, in hibernation case, almost all tasks was freezed before hibernation
call shrink_all_memory(). therefore, there is no race. priority=0 reclaim
can cover all lru pages.
Is this enough explanation for you?
>
> Thanks for the comment, Kosaki.
On Thu, Mar 24, 2011 at 4:03 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> On Thu, Mar 24, 2011 at 3:16 PM, KOSAKI Motohiro
>> <[email protected]> wrote:
>> > Hi
>> >
>> >> Thanks for your effort, Kosaki.
>> >> But I still doubt this patch is good.
>> >>
>> >> This patch makes early oom killing in hibernation as it skip
>> >> all_unreclaimable check.
>> >> Normally, hibernation needs many memory so page_reclaim pressure
>> >> would be big in small memory system. So I don't like early give up.
>> >
>> > Wait. When occur big pressure? hibernation reclaim pressure
>> > (sc->nr_to_recliam) depend on physical memory size. therefore
>> > a pressure seems to don't depend on the size.
>>
>> It depends on physical memory size and /sys/power/image_size.
>> If you want to tune image size bigger, reclaim pressure would be big.
>
> Ok, _If_ I want.
> However, I haven't seen desktop people customize it.
>
>
>> >> Do you think my patch has a problem? Personally, I think it's very
>> >> simple and clear. :)
>> >
>> > To be honest, I dislike following parts. It's madness on madness.
>> >
>> > static bool zone_reclaimable(struct zone *zone)
>> > {
>> > if (zone->all_unreclaimable)
>> > return false;
>> >
>> > return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>> > }
>> >
>> >
>> > The function require a reviewer know
>> >
>> > o pages_scanned and all_unreclaimable are racy
>>
>> Yes. That part should be written down of comment.
>>
>> > o at hibernation, zone->all_unreclaimable can be false negative,
>> > but can't be false positive.
>>
>> The comment of all_unreclaimable already does explain it well, I think.
>
> Where is?
>
>
>> > And, a function comment of all_unreclaimable() says
>> >
>> > /*
>> > * As hibernation is going on, kswapd is freezed so that it can't mark
>> > * the zone into all_unreclaimable. It can't handle OOM during hibernation.
>> > * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
>> > */
>> >
>> > But, now it is no longer copy of kswapd algorithm.
>>
>> The comment don't say it should be a copy of kswapd.
>
> I meant the comments says
>
> * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
>
> but now it isn't aswell as kswapd.
>
> I think it's critical important. If people can't understand why the
> algorithm was choosed, anyone will break the code again sooner or later.
>
>
>> > If you strongly prefer this idea even if you hear above explanation,
>> > please consider to add much and much comments. I can't say
>> > current your patch is enough readable/reviewable.
>>
>> My patch isn't a formal patch for merge but just a concept to show.
>> If you agree the idea, of course, I will add more concrete comment
>> when I send formal patch.
>>
>> Before, I would like to get a your agreement. :)
>> If you solve my concern(early give up in hibernation) in your patch, I
>> don't insist on my patch, either.
>
> Ok. Let's try.
>
> Please concern why priority=0 is not enough. zone_reclaimable_pages(zone) * 6
> is a conservative value of worry about multi thread race. While one task
> is reclaiming, others can allocate/free memory concurrently. therefore,
> even after priority=0, we have a chance getting reclaimable pages on lru.
> But, in hibernation case, almost all tasks was freezed before hibernation
> call shrink_all_memory(). therefore, there is no race. priority=0 reclaim
> can cover all lru pages.
>
> Is this enough explanation for you?
For example, In 4G desktop system.
32M full scanning and fail to reclaim a page.
It's under 1% coverage.
Is it enough to give up?
I am not sure.
--
Kind regards,
Minchan Kim
> For example, In 4G desktop system.
> 32M full scanning and fail to reclaim a page.
> It's under 1% coverage.
?? I'm sorry. I haven't catch it.
Where 32M come from?
> Is it enough to give up?
> I am not sure.
On Thu, Mar 24, 2011 at 4:28 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> For example, In 4G desktop system.
>> 32M full scanning and fail to reclaim a page.
>> It's under 1% coverage.
>
> ?? I'm sorry. I haven't catch it.
> Where 32M come from?
(1<<12) * 4K + (1<<11) + 4K + .. (1 << 0) + 4K in shrink_zones.
--
Kind regards,
Minchan Kim
On Thu, Mar 24, 2011 at 4:34 PM, Minchan Kim <[email protected]> wrote:
> On Thu, Mar 24, 2011 at 4:28 PM, KOSAKI Motohiro
> <[email protected]> wrote:
>>> For example, In 4G desktop system.
>>> 32M full scanning and fail to reclaim a page.
>>> It's under 1% coverage.
>>
>> ?? I'm sorry. I haven't catch it.
>> Where 32M come from?
>
> (1<<12) * 4K + (1<<11) + 4K + .. (1 << 0) + 4K in shrink_zones.
Stupid me.
Sorry. my calculation is totally wrong.
Your explanation was perfect
Okay. I don't have any objection to your solution.
--
Kind regards,
Minchan Kim
> On Thu, Mar 24, 2011 at 4:28 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> >> For example, In 4G desktop system.
> >> 32M full scanning and fail to reclaim a page.
> >> It's under 1% coverage.
> >
> > ?? I'm sorry. I haven't catch it.
> > Where 32M come from?
>
> (1<<12) * 4K + (1<<11) + 4K + .. (1 << 0) + 4K in shrink_zones.
(lru-pages>>12) + (lru-pages>>11) + (lru-pages>>10) ... =~ 2 * lru-page
?
On Thu, Mar 24, 2011 at 2:35 PM, KOSAKI Motohiro
<[email protected]> wrote:
> Hi Minchan,
>
>> Nick's original goal is to prevent OOM killing until all zone we're
>> interested in are unreclaimable and whether zone is reclaimable or not
>> depends on kswapd. And Nick's original solution is just peeking
>> zone->all_unreclaimable but I made it dirty when we are considering
>> kswapd freeze in hibernation. So I think we still need it to handle
>> kswapd freeze problem and we should add original behavior we missed at
>> that time like below.
>>
>> static bool zone_reclaimable(struct zone *zone)
>> {
>> if (zone->all_unreclaimable)
>> return false;
>>
>> return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
>> }
>>
>> If you remove the logic, the problem Nick addressed would be showed
>> up, again. How about addressing the problem in your patch? If you
>> remove the logic, __alloc_pages_direct_reclaim lose the chance calling
>> dran_all_pages. Of course, it was a side effect but we should handle
>> it.
>
> Ok, you are successfull to persuade me. lost drain_all_pages() chance has
> a risk.
>
>> And my last concern is we are going on right way?
>
>
>> I think fundamental cause of this problem is page_scanned and
>> all_unreclaimable is race so isn't the approach fixing the race right
>> way?
>
> Hmm..
> If we can avoid lock, we should. I think. that's performance reason.
> therefore I'd like to cap the issue in do_try_to_free_pages(). it's
> slow path.
>
> Is the following patch acceptable to you? it is
> o rewrote the description
> o avoid mix to use zone->all_unreclaimable and zone->pages_scanned
> o avoid to reintroduce hibernation issue
> o don't touch fast path
>
>
>> If it is hard or very costly, your and my approach will be fallback.
>
> -----------------------------------------------------------------
> From f3d277057ad3a092aa1c94244f0ed0d3ebe5411c Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Sat, 14 May 2011 05:07:48 +0900
> Subject: [PATCH] vmscan: all_unreclaimable() use zone->all_unreclaimable as the name
>
> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
>
> 2006 Sep 25; commit 408d8544; oom: use unreclaimable info
>
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
>
> 2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
> costly-order allocations
>
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
>
> 2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
> return value when priority==0
>
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
>
> 2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
> in direct reclaim path
>
> But, recently, Andrey Vagin found the new corner case. Look,
>
> struct zone {
> ..
> int all_unreclaimable;
> ..
> unsigned long pages_scanned;
> ..
> }
>
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore zones can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.
>
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
> Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
> a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
> at all!
>
> Eventually, oom-killer never works on such systems. That said, we
> can't use zone->pages_scanned for this purpose. This patch restore
> all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
> to add oom_killer_disabled check to avoid reintroduce the issue of
> commit d1908362.
>
> Reported-by: Andrey Vagin <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Thanks for the good discussion, Kosaki.
--
Kind regards,
Minchan Kim
On Tue, Mar 22, 2011 at 08:08:41PM +0900, KOSAKI Motohiro wrote:
> commit 2687a356 (Add lock_page_killable) introduced killable
> lock_page(). Similarly this patch introdues killable
> wait_on_page_locked().
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
--
Kind regards,
Minchan Kim
On Tue, Mar 22, 2011 at 08:09:29PM +0900, KOSAKI Motohiro wrote:
> When oom killer occured, almost processes are getting stuck following
> two points.
>
> 1) __alloc_pages_nodemask
> 2) __lock_page_or_retry
>
> 1) is not much problematic because TIF_MEMDIE lead to make allocation
> failure and get out from page allocator. 2) is more problematic. When
> OOM situation, Zones typically don't have page cache at all and Memory
> starvation might lead to reduce IO performance largely. When fork bomb
> occur, TIF_MEMDIE task don't die quickly mean fork bomb may create
> new process quickly rather than oom-killer kill it. Then, the system
> may become livelock.
>
> This patch makes pagefault interruptible by SIGKILL.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Looks like a cool idea.
--
Kind regards,
Minchan Kim
On 03/22, KOSAKI Motohiro wrote:
>
> This patch makes pagefault interruptible by SIGKILL.
Not a comment, but the question...
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1035,6 +1035,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> if (user_mode_vm(regs)) {
> local_irq_enable();
> error_code |= PF_USER;
> + flags |= FAULT_FLAG_KILLABLE;
OK, this is clear.
I am wondering, can't we set FAULT_FLAG_KILLABLE unconditionally
but check PF_USER when we get VM_FAULT_RETRY? I mean,
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
if (!(error_code & PF_USER))
no_context(...);
return;
}
Probably not... but I can't find any example of in-kernel fault which
can be broken by -EFAULT if current was killed.
mm_release()->put_user(clear_child_tid) should be fine...
Just curious, I feel I missed something obvious.
Oleg.
On Thu, Mar 24, 2011 at 10:13 AM, Oleg Nesterov <[email protected]> wrote:
>
> I am wondering, can't we set FAULT_FLAG_KILLABLE unconditionally
> but check PF_USER when we get VM_FAULT_RETRY? I mean,
>
> ? ? ? ?if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> ? ? ? ? ? ? ? ?if (!(error_code & PF_USER))
> ? ? ? ? ? ? ? ? ? ? ? ?no_context(...);
> ? ? ? ? ? ? ? ?return;
> ? ? ? ?}
I agree, we should do this.
> Probably not... but I can't find any example of in-kernel fault which
> can be broken by -EFAULT if current was killed.
There's no way that can validly break anything, since any such
codepath has to be able to handle -EFAULT for other reasons anyway.
The only issue is whether we're ok with a regular write() system call
(for example) not being atomic in the presence of a fatal signal. So
it does change semantics, but I think it changes it in a good way
(technically POSIX requires atomicity, but on the other hand,
technically POSIX also doesn't talk about the process being killed,
and writes would still be atomic for the case where they actually
return. Not to mention NFS etc where writes have never been atomic
anyway, so a program that relies on strict "all or nothing" write
behavior is fundamentally broken to begin with).
Linus
> On Thu, Mar 24, 2011 at 10:13 AM, Oleg Nesterov <[email protected]> wrote:
> >
> > I am wondering, can't we set FAULT_FLAG_KILLABLE unconditionally
> > but check PF_USER when we get VM_FAULT_RETRY? I mean,
> >
> > ? ? ? ?if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> > ? ? ? ? ? ? ? ?if (!(error_code & PF_USER))
> > ? ? ? ? ? ? ? ? ? ? ? ?no_context(...);
> > ? ? ? ? ? ? ? ?return;
> > ? ? ? ?}
>
> I agree, we should do this.
>
> > Probably not... but I can't find any example of in-kernel fault which
> > can be broken by -EFAULT if current was killed.
>
> There's no way that can validly break anything, since any such
> codepath has to be able to handle -EFAULT for other reasons anyway.
>
> The only issue is whether we're ok with a regular write() system call
> (for example) not being atomic in the presence of a fatal signal. So
> it does change semantics, but I think it changes it in a good way
> (technically POSIX requires atomicity, but on the other hand,
> technically POSIX also doesn't talk about the process being killed,
> and writes would still be atomic for the case where they actually
> return. Not to mention NFS etc where writes have never been atomic
> anyway, so a program that relies on strict "all or nothing" write
> behavior is fundamentally broken to begin with).
Ok, I didn't have enough brave. Will do.