LinuxLists.cc - Linux 2.6.38

2011-03-15 01:50:04

Subject: Linux 2.6.38

Not a lot of changes since -rc8. Most notably perhaps some late nfs
and btrfs work, and a mips update. Along with some more vfs RCU lookup
fallout (which would only be noticeable with the filesystem exported
with nfsd, which is why nobody ever noticed).

And the usual driver updates, mostly media and GPU, but some
networking too. The appended shortlog is for the changes since -rc8,
and gives some feel for it. Nothing really too exciting, I think.

As to the "big picture", ie all the changes since 2.6.37, my personal
favorite remains the VFS name lookup changes. They did end up causing
some breakage, and Al has made it clear that he wants more cleanups,
but on the whole I think it was surprisingly smooth. I think we had
more problems with random other components (nasty memory corruption in
networking etc) than with the rather fundamental path lookup change.

So I'm hoping this ends up being a fairly calm release despite some
really deep changes like that.

Linus

---

Abhilash K V (1):
ASoC: AM3517: Update codec name after multi-component update

Al Viro (12):
minimal fix for do_filp_open() race
unfuck proc_sysctl ->d_compare()
nd->inode is not set on the second attempt in path_walk()
/proc/self is never going to be invalidated...
reiserfs xattr ->d_revalidate() shouldn't care about RCU
ceph: fix d_revalidate oopsen on NFS exports
fuse: fix d_revalidate oopsen on NFS exports
gfs2: fix d_revalidate oopsen on NFS exports
ocfs2: fix d_revalidate oopsen on NFS exports
jfs: fix d_revalidate oopsen on NFS exports
fat: fix d_revalidate oopsen on NFS exports
compat breakage in preadv() and pwritev()

Andrea Arcangeli (2):
x86/mm: Fix pgd_lock deadlock
thp: fix page_referenced to modify mapcount/vm_flags only if page is found

Andrey Vagin (1):
x86/mm: Handle mm_fault_error() in kernel space

Andy Adamson (2):
NFSv4: remove duplicate clientid in struct nfs_client
NFSv4.1 reclaim complete must wait for completion

Andy Walls (2):
[media] cx23885: Revert "Check for slave nack on all transactions"
[media] cx23885: Remove unused 'err:' labels to quiet compiler warning

Anoop P A (2):
MIPS: Select R4K timer lib for all MSP platforms
MIPS: MSP: Fix MSP71xx bpci interrupt handler return value

Antony Pavlov (2):
mtd: jedec_probe: Change variable name from cfi_p to cfi
mtd: jedec_probe: initialise make sector erase command variable

Antti Sepp?l? (1):
[media] Fix sysfs rc protocol lookup for rc-5-sz

Arnaldo Carvalho de Melo (1):
perf symbols: Fix vmlinux path when not using --symfs

Arnaud Patard (1):
[media] mantis_pci: remove asm/pgtable.h include

Axel Lin (3):
mtd: add "platform:" prefix for platform modalias
gpio: add MODULE_DEVICE_TABLE
watchdog: hpwdt: eliminate section mismatch warning

Balbir Singh (1):
sched: Fix sched rt group scheduling when hierachy is enabled

Ben Hutchings (1):
sunrpc: Propagate errors from xs_bind() through xs_create_sock()

Benjamin Herrenschmidt (2):
powerpc/iseries: Fix early init access to lppaca
powerpc/pseries: Disable VPNH feature

Benny Halevy (1):
NFSD: fix decode_cb_sequence4resok

Chris Mason (4):
Btrfs: fix regressions in copy_from_user handling
Btrfs: deal with short returns from copy_from_user
Btrfs: make sure not to return overlapping extents to fiemap
Btrfs: break out of shrink_delalloc earlier

Chuck Lever (1):
NFS: NFSROOT should default to "proto=udp"

Cliff Wickman (1):
x86, UV: Initialize the broadcast assist unit base destination
node id properly

Dan Carpenter (1):
watchdog: sch311x_wdt: fix printk condition

Daniel J Blueman (2):
x86, build: Make sure mkpiggy fails on read error
btrfs: fix dip leak

Daniel Turull (1):
pktgen: fix errata in show results

Dave Airlie (3):
drm/radeon: add pageflip hooks for fusion
drm/radeon: fix page flipping hangs on r300/r400
drm/radeon: fix problem with changing active VRAM size. (v2)

David Daney (5):
MIPS: Add an unreachable return statement to satisfy buggy GCCs.
MIPS: Fix GCC-4.6 'set but not used' warning in signal*.c
MIPS: Remove unused code from arch/mips/kernel/syscall.c
MIPS: Fix GCC-4.6 'set but not used' warning in ieee754int.h
MIPS: Fix GCC-4.6 'set but not used' warning in arch/mips/mm/init.c

David Howells (2):
MN10300: The SMP_ICACHE_INV_FLUSH_RANGE IPI command does not exist
MN10300: atomic_read() should ensure it emits a load

David S. Miller (2):
ipv4: Fix erroneous uses of ifa_address.
ipv6: Don't create clones of host routes.

Deng-Cheng Zhu (5):
MIPS, Perf-events: Work with irq_work
MIPS, Perf-events: Work with the new PMU interface
MIPS, Perf-events: Fix event check in validate_event()
MIPS, Perf-events: Work with the new callchain interface
MIPS, Perf-events: Use unsigned delta for right shift in event update

Devin Heitmueller (2):
[media] au0828: fix VBI handling when in V4L2 streaming mode
[media] cx18: Add support for Hauppauge HVR-1600 models with s5h1411

Dmitry Kravkov (4):
bnx2x: fix non-pmf device load flow
bnx2x: fix link notification
bnx2x: (NPAR) prevent HW access in D3 state
bnx2x: fix MaxBW configuration

Doe, YiCheng (1):
ipmi: Fix IPMI errors due to timing problems

Florian Fainelli (3):
r6040: bump to version 0.27 and date 23Feb2011
MIPS: MTX-1: Make au1000_eth probe all PHY addresses
MIPS: Alchemy: Fix reset for MTX-1 and XXS1500

Frank Filz (1):
(try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause
problems if bit 31 or 63 are set in fileid

Grant Likely (1):
i2c-ocores: Fix pointer type mismatch error

G?ran Weinholt (1):
net/smsc911x.c: Set the VLAN1 register to fix VLAN MTU problem

Hans de Goede (2):
hwmon/f71882fg: Fix a typo in a comment
hwmon/f71882fg: Set platform drvdata to NULL later

Huang Weiyi (1):
nfs4: remove duplicated #include

Hugh Dickins (1):
thp+memcg-numa: fix BUG at include/linux/mm.h:370!

J. Bruce Fields (2):
nfsd4: fix bad pointer on failure to find delegation
fs/dcache: allow d_obtain_alias() to return unhashed dentries

Jarod Wilson (3):
[media] nuvoton-cir: fix wake from suspend
[media] mceusb: don't claim multifunction device non-IR parts
[media] tda829x: fix regression in probe functions

Jeff Layton (1):
nfs: close NFSv4 COMMIT vs. CLOSE race

Jesper Juhl (1):
SUNRPC: Remove resource leak in svc_rdma_send_error()

Jiri Slaby (1):
watchdog: sbc_fitpc2_wdt, fix crash on systems without DMI_BOARD_NAME

Joakim Tjernlund (1):
mtd: fix race in cfi_cmdset_0001 driver

Jon Mason (1):
vxge: update MAINTAINERS

Jovi Zhang (1):
nfs: fix compilation warning

Lin Ming (1):
perf symbols: Avoid resolving [kernel.kallsyms] to real path for
buildid cache

Linus Torvalds (2):
Revert "oom: oom_kill_process: fix the child_points logic"
Linux 2.6.38

Lukas Czerner (1):
block: fix mis-synchronisation in blkdev_issue_zeroout()

Maksim Rayskiy (1):
MIPS: Move idle task creation to work queue

Malcolm Priestley (1):
[media] DM04/QQBOX memcpy to const char fix

Marco Stornelli (1):
Check for immutable/append flag in fallocate path

Mark Brown (4):
ASoC: Fix broken bitfield definitions in WM8978
ASoC: Use the correct DAPM context when cleaning up final widget set
ASoC: Fix typo in late revision WM8994 DAC2R name
ASoC: Ensure WM8958 gets all WM8994 late revision widgets

Matt Turner (1):
alpha: fix compile error from IRQ clean up

Mauro Carvalho Chehab (1):
[media] ir-raw: Properly initialize the IR event (BZ#27202)

Maurus Cuelenaere (1):
MIPS: Jz4740: Add HAVE_CLK

Maxim Levitsky (1):
mtd: mtd_blkdevs: fix double free on error path

Miao Xie (1):
btrfs: fix not enough reserved space

Michael (1):
[media] ivtv: Fix corrective action taken upon DMA ERR interrupt
to avoid hang

Michal Marek (1):
kbuild: Fix computing srcversion for modules

Naga Chumbalkar (2):
x86: Don't check for BIOS corruption in first 64K when there's no need to
[CPUFREQ] pcc-cpufreq: don't load driver if get_freq fails during init.

Neil Horman (1):
rds: prevent BUG_ON triggering on congestion map updates

Nicholas Bellinger (1):
[SCSI] target: Fix t_transport_aborted handling in LUN_RESET +
active I/O shutdown

Nicolas Kaiser (1):
drivers/net/macvtap: fix error check

Nils Carlson (2):
bonding 802.3ad: Fix the state machine locking v2
bonding 802.3ad: Rename rx_machine_lock to state_machine_lock

Ohad Ben-Cohen (1):
mmc: fix CONFIG_MMC_UNSAFE_RESUME regression

Oleg Nesterov (1):
oom: oom_kill_process: fix the child_points logic

Olivier Grenie (1):
[media] DiB7000M: add pid filtering

Pawel Osciak (1):
[media] Fix double free of video_device in mem2mem_testdev

Rainer Weikusat (1):
net: fix multithreaded signal handling in unix recv routines

Rajendra Nayak (1):
i2c-omap: Program I2C_WE on OMAP4 to enable i2c wakeup

Randy Dunlap (1):
net: bridge builtin vs. ipv6 modular

Ricardo Labiaga (1):
NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAY

Robert Millan (1):
MIPS: Loongson: Remove ad-hoc cmdline default

Sebastian Andrzej Siewior (1):
x86: ce4100: Set pci ops via callback instead of module init

Shawn Lin (1):
r6040: fix multicast operations

Stanislav Fomichev (1):
nfs: add kmalloc return value check in decode_and_add_ds

Stanislaw Gruszka (1):
mtd: amd76xrom: fix oops at boot when resources are not available

Stefan Oberhumer (1):
MIPS: Clear the correct flag in sysmips(MIPS_FIXADE, ...).

Stefan Weil (1):
MIPS: Loongson: Fix potentially wrong string handling

Stephen Rothwell (2):
sysctl: the include of rcupdate.h is only needed in the kernel
sysctl: the include of rcupdate.h is only needed in the kernel

Sven Barth (1):
[media] cx25840: fix probing of cx2583x chips

Takashi Iwai (1):
drm/i915: Revive combination mode for backlight control

Thomas Gleixner (1):
MIPS: Replace deprecated spinlock initialization

Thomas Graf (1):
net: Enter net/ipv6/ even if CONFIG_IPV6=n

Timo Warns (1):
Fix corrupted OSF partition table parsing

Tkhai Kirill (1):
MN10300: Proper use of macros get_user() in the case of
incremented pointers

Trond Myklebust (5):
SUNRPC: Close a race in __rpc_wait_for_completion_task()
NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses
NFSv4.1: Fix the handling of the SEQUENCE status bits
NFSv4: Fix the setlk error handler
NFSv4: nfs4_state_mark_reclaim_nograce() should be static

Vasiliy Kulikov (1):
net: don't allow CAP_NET_ADMIN to load non-netdev kernel modules

Wim Van Sebroeck (3):
watchdog: cpwd: Fix buffer-overflow
watchdog: sch311x_wdt: Fix LDN active check
watchdog: w83697ug_wdt: Fix set bit 0 to activate GPIO2

Wolfram Sang (1):
i2c-eg20t: include slab.h for memory allocations

Wu Zhangjin (5):
MIPS, Tracing: Speed up function graph tracer
MIPS, Tracing: Substitute in_kernel_space() for in_module()
MIPS, Tracing: Clean up prepare_ftrace_return()
MIPS, Tracing: Clean up ftrace_make_nop()
MIPS, Tracing: Fix set_graph_function of function graph tracer

Yinghai Lu (1):
x86, numa: Fix numa_emulation code with memory-less node0

Yoichi Yuasa (1):
MIPS: Fix always CONFIG_LOONGSON_UART_BASE=y

[email protected] (1):
ariadne: remove redundant NULL check

roel (1):
nfsd: wrong index used in inner loop

sensoray-dev (1):
[media] s2255drv: firmware re-loading changes

stephen hemminger (1):
ip6ip6: autoload ip6 tunnel

2011-03-15 03:13:49

by David Rientjes

[permalink] [raw]

Subject: Re: Linux 2.6.38

This kernel includes a broken commit that was merged a couple of hours
before release:

commit dc1b83ab08f1954335692cdcd499f78c94f4c42a
Author: Oleg Nesterov <[email protected]>
Date: Mon Mar 14 20:05:30 2011 +0100

oom: oom_kill_process: fix the child_points logic

oom_kill_process() starts with victim_points == 0. This means that
(most likely) any child has more points and can be killed erroneously.

Also, "children has a different mm" doesn't match the reality, we should
check child->mm != t->mm. This check is not exactly correct if t->mm ==
NULL but this doesn't really matter, oom_kill_task() will kill them
anyway.

Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
too.

Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

As a result of this change, the oom killer will no longer attempt to
sacrifice a child of the selected process in favor of the parent unless
its memory usage exceeds the parent (and this will be an unreachable state
once oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch is merged
from -mm).

This means systems running a webserver, for example, will kill the
webserver itself in oom conditions and not one of its threads serving a
connection; simply forking too many client connections in this scenario
would lead to an oom condition that would kill the server instead of one
of its threads.

Admins who find this behavior to cause disruptions in service should apply
the following revert.

Signed-off-by: David Rientjes <[email protected]>
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -458,10 +458,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
struct mem_cgroup *mem, nodemask_t *nodemask,
const char *message)
{
- struct task_struct *victim;
+ struct task_struct *victim = p;
struct task_struct *child;
- struct task_struct *t;
- unsigned int victim_points;
+ struct task_struct *t = p;
+ unsigned int victim_points = 0;

if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem, nodemask);
@@ -487,15 +487,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* parent. This attempts to lose the minimal amount of work done while
* still freeing memory.
*/
- victim_points = oom_badness(p, mem, nodemask, totalpages);
- victim = p;
- t = p;
do {
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;

- if (child->mm == t->mm)
- continue;
/*
* oom_badness() returns 0 if the thread is unkillable
*/

2011-03-15 03:14:16

by Steven Rostedt

[permalink] [raw]

Subject: Re: Linux 2.6.38

On Mon, Mar 14, 2011 at 06:49:37PM -0700, Linus Torvalds wrote:
>
> So I'm hoping this ends up being a fairly calm release despite some
> really deep changes like that.

It's so calm, it's like it's not even there.

-- Steve

2011-03-15 04:06:14

by Steven Rostedt

[permalink] [raw]

Subject: Re: Linux 2.6.38

On Mon, Mar 14, 2011 at 08:13:38PM -0700, David Rientjes wrote:
> This kernel includes a broken commit that was merged a couple of hours
> before release:
>
> commit dc1b83ab08f1954335692cdcd499f78c94f4c42a
> Author: Oleg Nesterov <[email protected]>
> Date: Mon Mar 14 20:05:30 2011 +0100
>
> oom: oom_kill_process: fix the child_points logic
>

Don't worry. If you download the patch for 2.6.38, you'll see that the
revert was in the final release.

Woo hoo, ketchup is useful again!

-- Steve

2011-03-15 04:15:36

by Linus Torvalds

[permalink] [raw]

Subject: Re: Linux 2.6.38

On Mon, Mar 14, 2011 at 8:14 PM, Steven Rostedt <[email protected]> wrote:
> On Mon, Mar 14, 2011 at 06:49:37PM -0700, Linus Torvalds wrote:
>>
>> So I'm hoping this ends up being a fairly calm release despite some
>> really deep changes like that.
>
> It's so calm, it's like it's not even there.

Yes, it's a very Zen release.

I'd uploaded the patch and tar-ball, but forgot to actually push out.
Usually it's the other way around.

Linus

2011-03-15 04:17:07

by Linus Torvalds

[permalink] [raw]

Subject: Re: Linux 2.6.38

On Mon, Mar 14, 2011 at 8:13 PM, David Rientjes <[email protected]> wrote:
> This kernel includes a broken commit that was merged a couple of hours
> before release:

Actually, it doesn't. It got reverted before the release because of
the worries about it.

Linus

2011-03-15 04:36:26

by Andrew Morton

[permalink] [raw]

Subject: Re: Linux 2.6.38

On Mon, 14 Mar 2011 20:13:38 -0700 (PDT) David Rientjes <[email protected]> wrote:

> once oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch is merged
> from -mm

Please (re)send the patches which you believe should be merged into
2.6.38 to address the problems which Oleg found, and any other critical
problems. Not in a huge rush - let's get this right.

2011-03-15 04:50:40

by David Rientjes

[permalink] [raw]

Subject: Re: Linux 2.6.38

On Mon, 14 Mar 2011, Andrew Morton wrote:

> > once oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch is merged
> > from -mm
>
> Please (re)send the patches which you believe should be merged into
> 2.6.38 to address the problems which Oleg found, and any other critical
> problems. Not in a huge rush - let's get this right.
>

In my testing, Oleg's three test cases that he sent to the security list
and cc'd us on get appropriately oom killed once swap is exhausted or
swapoff -a is used on mmotm-2011-03-10-16-42 because of these two patches:

oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
oom-skip-zombies-when-iterating-tasklist.patch

He also presented a test case on linux-mm that caused the oom killer to
avoid acting if a thread is ptracing a thread in the exit path with
PTRACE_O_TRACEEXIT. That should be fixed with

http://marc.info/?l=linux-mm&m=129997893430351

that has yet to see -mm. There are no other test cases that have been
presented that cause undesired behavior.

That said, my approach to doing this has been to avoid arbitrary
heuristics for special cases and address known issues by adding the
appropriate logic in the oom killer. For example, the ptrace problem that
Oleg presented showed that the oom killer logic incorrectly deferred doing
anything when an eligible thread was PF_EXITING. It had done that
believing that nothing would stop the thread from exiting or current
would be given access to memory reserves itself and that assumption was
broken for PTRACE_O_TRACEEXIT. My patch above, in combination with
Andrey's patch that only considers threads with valid mm's, fixes that
issue because we'll now only defer if you still have an attached mm, are
PF_EXITING, and are not being traced.

If, at some point, there is another gap in the exit code where a thread
may hold PF_EXITING with a valid mm for an indefinite period, we'll need
to address that in the oom killer as well. We use PF_EXITING specifically
in the oom killer to identify tasks that are going to exit soon and need
handling for any case where that isn't guaranteed. Anything else results
in needlessly killing other tasks or, in the worst case, panicking when
there is nothing left that is eligible.

2011-03-15 05:02:31

by David Rientjes

[permalink] [raw]

Subject: Re: Linux 2.6.38

On Mon, 14 Mar 2011, Linus Torvalds wrote:

> > This kernel includes a broken commit that was merged a couple of hours
> > before release:
>
> Actually, it doesn't. It got reverted before the release because of
> the worries about it.
>

Looks good, thanks!

2011-03-15 06:24:20

by Andrew Morton

[permalink] [raw]

Subject: Re: Linux 2.6.38

On Mon, 14 Mar 2011 21:50:24 -0700 (PDT) David Rientjes <[email protected]> wrote:

> On Mon, 14 Mar 2011, Andrew Morton wrote:
>
> > > once oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch is merged
> > > from -mm
> >
> > Please (re)send the patches which you believe should be merged into
> > 2.6.38 to address the problems which Oleg found, and any other critical
> > problems. Not in a huge rush - let's get this right.
> >
>
> In my testing, Oleg's three test cases that he sent to the security list
> and cc'd us on get appropriately oom killed once swap is exhausted or
> swapoff -a is used on mmotm-2011-03-10-16-42 because of these two patches:
>
> oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> oom-skip-zombies-when-iterating-tasklist.patch
>
> He also presented a test case on linux-mm that caused the oom killer to
> avoid acting if a thread is ptracing a thread in the exit path with
> PTRACE_O_TRACEEXIT. That should be fixed with
>
> http://marc.info/?l=linux-mm&m=129997893430351
>
> that has yet to see -mm. There are no other test cases that have been
> presented that cause undesired behavior.
>
> That said, my approach to doing this has been to avoid arbitrary
> heuristics for special cases and address known issues by adding the
> appropriate logic in the oom killer. For example, the ptrace problem that
> Oleg presented showed that the oom killer logic incorrectly deferred doing
> anything when an eligible thread was PF_EXITING. It had done that
> believing that nothing would stop the thread from exiting or current
> would be given access to memory reserves itself and that assumption was
> broken for PTRACE_O_TRACEEXIT. My patch above, in combination with
> Andrey's patch that only considers threads with valid mm's, fixes that
> issue because we'll now only defer if you still have an attached mm, are
> PF_EXITING, and are not being traced.
>
> If, at some point, there is another gap in the exit code where a thread
> may hold PF_EXITING with a valid mm for an indefinite period, we'll need
> to address that in the oom killer as well. We use PF_EXITING specifically
> in the oom killer to identify tasks that are going to exit soon and need
> handling for any case where that isn't guaranteed. Anything else results
> in needlessly killing other tasks or, in the worst case, panicking when
> there is nothing left that is eligible.

So we're talking about three patches:

oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
oom-skip-zombies-when-iterating-tasklist.patch
oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced.patch

all appended below.

About all of which Oleg had serious complaints, some of which haven't
yet been addressed.

And that's OK. As I said, please let's work through it and get it right.

From: David Rientjes <[email protected]>

This patch prevents unnecessary oom kills or kernel panics by reverting
two commits:

495789a5 (oom: make oom_score to per-process value)
cef1d352 (oom: multi threaded process coredump don't make deadlock)

First, 495789a5 (oom: make oom_score to per-process value) ignores the
fact that all threads in a thread group do not necessarily exit at the
same time.

It is imperative that select_bad_process() detect threads that are in the
exit path, specifically those with PF_EXITING set, to prevent needlessly
killing additional tasks. If a process is oom killed and the thread group
leader exits, select_bad_process() cannot detect the other threads that
are PF_EXITING by iterating over only processes. Thus, it currently
chooses another task unnecessarily for oom kill or panics the machine when
nothing else is eligible.

By iterating over threads instead, it is possible to detect threads that
are exiting and nominate them for oom kill so they get access to memory
reserves.

Second, cef1d352 (oom: multi threaded process coredump don't make
deadlock) erroneously avoids making the oom killer a no-op when an
eligible thread other than current isfound to be exiting. We want to
detect this situation so that we may allow that exiting thread time to
exit and free its memory; if it is able to exit on its own, that should
free memory so current is no loner oom. If it is not able to exit on its
own, the oom killer will nominate it for oom kill which, in this case,
only means it will get access to memory reserves.

Without this change, it is easy for the oom killer to unnecessarily target
tasks when all threads of a victim don't exit before the thread group
leader or, in the worst case, panic the machine.

Signed-off-by: David Rientjes <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrey Vagin <[email protected]>
Cc: <[email protected]> [2.6.38.x]
Signed-off-by: Andrew Morton <[email protected]>
---

mm/oom_kill.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff -puN mm/oom_kill.c~oom-prevent-unnecessary-oom-kills-or-kernel-panics mm/oom_kill.c
--- a/mm/oom_kill.c~oom-prevent-unnecessary-oom-kills-or-kernel-panics
+++ a/mm/oom_kill.c
@@ -292,11 +292,11 @@ static struct task_struct *select_bad_pr
unsigned long totalpages, struct mem_cgroup *mem,
const nodemask_t *nodemask)
{
- struct task_struct *p;
+ struct task_struct *g, *p;
struct task_struct *chosen = NULL;
*ppoints = 0;

- for_each_process(p) {
+ do_each_thread(g, p) {
unsigned int points;

if (oom_unkillable_task(p, mem, nodemask))
@@ -324,7 +324,7 @@ static struct task_struct *select_bad_pr
* the process of exiting and releasing its resources.
* Otherwise we could get an easy OOM deadlock.
*/
- if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
+ if ((p->flags & PF_EXITING) && p->mm) {
if (p != current)
return ERR_PTR(-1UL);

@@ -337,7 +337,7 @@ static struct task_struct *select_bad_pr
chosen = p;
*ppoints = points;
}
- }
+ } while_each_thread(g, p);

return chosen;
}
_

From: Andrey Vagin <[email protected]>

We shouldn't defer oom killing if a thread has already detached its ->mm
and still has TIF_MEMDIE set. Memory needs to be freed, so find kill
other threads that pin the same ->mm or find another task to kill.

Signed-off-by: Andrey Vagin <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: <[email protected]> [2.6.38.x]
Signed-off-by: Andrew Morton <[email protected]>
---

mm/oom_kill.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff -puN mm/oom_kill.c~oom-skip-zombies-when-iterating-tasklist mm/oom_kill.c
--- a/mm/oom_kill.c~oom-skip-zombies-when-iterating-tasklist
+++ a/mm/oom_kill.c
@@ -299,6 +299,8 @@ static struct task_struct *select_bad_pr
do_each_thread(g, p) {
unsigned int points;

+ if (!p->mm)
+ continue;
if (oom_unkillable_task(p, mem, nodemask))
continue;

@@ -324,7 +326,7 @@ static struct task_struct *select_bad_pr
* the process of exiting and releasing its resources.
* Otherwise we could get an easy OOM deadlock.
*/
- if ((p->flags & PF_EXITING) && p->mm) {
+ if (p->flags & PF_EXITING) {
if (p != current)
return ERR_PTR(-1UL);

_

From: David Rientjes <[email protected]>

The oom killer naturally defers killing anything if it finds an eligible
task that is already exiting and has yet to detach its ->mm. This avoids
unnecessarily killing tasks when one is already in the exit path and may
free enough memory that the oom killer is no longer needed. This is
detected by PF_EXITING since threads that have already detached its ->mm
are no longer considered at all.

The problem with always deferring when a thread is PF_EXITING, however, is
that it may never actually exit when being traced, specifically if another
task is tracing it with PTRACE_O_TRACEEXIT. The oom killer does not want
to defer in this case since there is no guarantee that thread will ever
exit without intervention.

This patch will now only defer the oom killer when a thread is PF_EXITING
and no ptracer has stopped its progress in the exit path. It also ensures
that a child is sacrificed for the chosen parent only if it has a
different ->mm as the comment implies: this ensures that the thread group
leader is always targeted appropriately.

Signed-off-by: David Rientjes <[email protected]>
Reported-by: Oleg Nesterov <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrey Vagin <[email protected]>
Cc: <[email protected]> [2.6.38.x]
Signed-off-by: Andrew Morton <[email protected]>
---

mm/oom_kill.c | 40 +++++++++++++++++++++++++---------------
1 file changed, 25 insertions(+), 15 deletions(-)

diff -puN mm/oom_kill.c~oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced mm/oom_kill.c
--- a/mm/oom_kill.c~oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced
+++ a/mm/oom_kill.c
@@ -31,6 +31,7 @@
#include <linux/memcontrol.h>
#include <linux/mempolicy.h>
#include <linux/security.h>
+#include <linux/ptrace.h>

int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
@@ -316,22 +317,29 @@ static struct task_struct *select_bad_pr
if (test_tsk_thread_flag(p, TIF_MEMDIE))
return ERR_PTR(-1UL);

- /*
- * This is in the process of releasing memory so wait for it
- * to finish before killing some other task by mistake.
- *
- * However, if p is the current task, we allow the 'kill' to
- * go ahead if it is exiting: this will simply set TIF_MEMDIE,
- * which will allow it to gain access to memory reserves in
- * the process of exiting and releasing its resources.
- * Otherwise we could get an easy OOM deadlock.
- */
if (p->flags & PF_EXITING) {
- if (p != current)
- return ERR_PTR(-1UL);
-
- chosen = p;
- *ppoints = 1000;
+ /*
+ * If p is the current task and is in the process of
+ * releasing memory, we allow the "kill" to set
+ * TIF_MEMDIE, which will allow it to gain access to
+ * memory reserves. Otherwise, it may stall forever.
+ *
+ * The loop isn't broken here, however, in case other
+ * threads are found to have already been oom killed.
+ */
+ if (p == current) {
+ chosen = p;
+ *ppoints = 1000;
+ } else {
+ /*
+ * If this task is not being ptraced on exit,
+ * then wait for it to finish before killing
+ * some other task unnecessarily.
+ */
+ if (!(task_ptrace(p->group_leader) &
+ PT_TRACE_EXIT))
+ return ERR_PTR(-1UL);
+ }
}

points = oom_badness(p, mem, nodemask, totalpages);
@@ -493,6 +501,8 @@ static int oom_kill_process(struct task_
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;

+ if (child->mm == p->mm)
+ continue;
/*
* oom_badness() returns 0 if the thread is unkillable
*/
_

2011-03-15 21:19:13

by Oleg Nesterov

[permalink] [raw]

Subject: Re: Linux 2.6.38

On 03/14, David Rientjes wrote:
>
> He also presented a test case on linux-mm that caused the oom killer to
> avoid acting if a thread is ptracing a thread in the exit path with
> PTRACE_O_TRACEEXIT. That should be fixed with
>
> http://marc.info/?l=linux-mm&m=129997893430351

I don't think it can fix this. I didn't verify this, but the slightly
different test-case below should have the same effect.

But this doesn't matter. We can fix this particular case, and we have
the problems with the coredump anyway.

What I can't understand is what exactly the first patch tries to fix.
When I ask you, you tell me that for_each_process() can miss the group
leader because it can exit before sub-threads. This must not happen,
or we have some serious bug triggered by your workload.

So, once again. Could you please explain the original problem and how
this patch helps?

Oleg.

#include <unistd.h>
#include <signal.h>
#include <pthread.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <assert.h>
#include <stdio.h>

void *tfunc(void* arg)
{
if (arg) {
ptrace(PTRACE_TRACEME, 0,0,0);
raise(SIGSTOP);
pthread_kill(*(pthread_t*)arg, SIGQUIT);
}
pause();
}

int main(void)
{
int pid;

if (!fork()) {
pthread_t thread1, thread2;
pthread_create(&thread1, NULL, tfunc, NULL);
pthread_create(&thread2, NULL, tfunc, &thread1);
pause();
return 0;
}

assert((pid = waitpid(-1, NULL, __WALL)) > 0);
assert(ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACEEXIT) == 0);
assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0);
wait(NULL);

pause();
return 0;
}

2011-03-16 09:10:06

Subject: Linux 2.6.38

Subject: Re: Linux 2.6.38

Subject: Re: Linux 2.6.38

Subject: Re: Linux 2.6.38

Subject: Re: Linux 2.6.38

Subject: Re: Linux 2.6.38

Subject: Re: Linux 2.6.38

Subject: Re: Linux 2.6.38

Subject: Re: Linux 2.6.38

Subject: Re: Linux 2.6.38

Subject: Re: Linux 2.6.38

Subject: Re: Linux 2.6.38

Subject: i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38)

Subject: Re: i915/kms regression after 2.6.38-rc8

Subject: Re: i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38)

Subject: Re: i915/kms regression after 2.6.38-rc8

Attachments:

Subject: Re: i915/kms regression after 2.6.38-rc8 (was: Re: Linux 2.6.38)

Subject: [patch 0/5] oom: a few anti fork bomb patches

Attachments:

Subject: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: [PATCH 3/5] oom: create oom autogroup

Subject: [PATCH 4/5] mm: introduce wait_on_page_locked_killable

Subject: [PATCH 5/5] x86,mm: make pagefault killable

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 3/5] oom: create oom autogroup

Subject: Re: [PATCH 3/5] oom: create oom autogroup

Subject: Re: [PATCH 3/5] oom: create oom autogroup

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 4/5] mm: introduce wait_on_page_locked_killable

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 5/5] x86,mm: make pagefault killable

Subject: Re: [PATCH 5/5] x86,mm: make pagefault killable

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 5/5] x86,mm: make pagefault killable

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 1/5] vmscan: remove all_unreclaimable check from direct reclaim path completely

Subject: Re: [PATCH 4/5] mm: introduce wait_on_page_locked_killable

Subject: Re: [PATCH 5/5] x86,mm: make pagefault killable

Subject: Re: [PATCH 5/5] x86,mm: make pagefault killable

Subject: Re: [PATCH 5/5] x86,mm: make pagefault killable

Subject: Re: [PATCH 5/5] x86,mm: make pagefault killable