2009-10-30 06:32:15

by Norbert Preining

[permalink] [raw]
Subject: OOM killer, page fault

Dear all,

(please Cc)

With 2.6.32-rc5 I got that one:
[13832.210068] Xorg invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0
[13832.210073] Pid: 11220, comm: Xorg Not tainted 2.6.32-rc5 #2
[13832.210075] Call Trace:
[13832.210081] [<ffffffff8134120a>] ? _spin_unlock+0x23/0x2f
[13832.210085] [<ffffffff8107cf46>] ? oom_kill_process+0x78/0x236
[13832.210088] [<ffffffff8107d5ba>] ? __out_of_memory+0x12f/0x146
[13832.210091] [<ffffffff8107d6be>] ? pagefault_out_of_memory+0x54/0x82
[13832.210094] [<ffffffff81341177>] ? _spin_unlock_irqrestore+0x25/0x31
[13832.210098] [<ffffffff8102644d>] ? mm_fault_error+0x39/0xe6
[13832.210101] [<ffffffff810af3ea>] ? do_vfs_ioctl+0x443/0x47b
[13832.210103] [<ffffffff81026759>] ? do_page_fault+0x25f/0x27b
[13832.210106] [<ffffffff8134161f>] ? page_fault+0x1f/0x30
[13832.210108] Mem-Info:
[13832.210109] DMA per-cpu:
[13832.210111] CPU 0: hi: 0, btch: 1 usd: 0
[13832.210113] CPU 1: hi: 0, btch: 1 usd: 0
[13832.210114] DMA32 per-cpu:
[13832.210116] CPU 0: hi: 186, btch: 31 usd: 165
[13832.210117] CPU 1: hi: 186, btch: 31 usd: 177
[13832.210119] Normal per-cpu:
[13832.210120] CPU 0: hi: 186, btch: 31 usd: 143
[13832.210122] CPU 1: hi: 186, btch: 31 usd: 159
[13832.210128] active_anon:465239 inactive_anon:178856 isolated_anon:96
[13832.210129] active_file:120044 inactive_file:120889 isolated_file:34
[13832.210130] unevictable:32076 dirty:136955 writeback:1178 unstable:0 buffer:32965
[13832.210131] free:6932 slab_reclaimable:23740 slab_unreclaimable:11776
[13832.210132] mapped:41869 shmem:127673 pagetables:7320 bounce:0
[13832.210138] DMA free:15784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:24kB inactive_file:132kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15364kB mlocked:0kB dirty:60kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:16kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[13832.210142] lowmem_reserve[]: 0 2931 3941 3941
[13832.210150] DMA32 free:9928kB min:5960kB low:7448kB high:8940kB active_anon:1527548kB inactive_anon:382016kB active_file:345724kB inactive_file:348528kB unevictable:127864kB isolated(anon):256kB isolated(file):0kB present:3001852kB mlocked:127864kB dirty:389520kB writeback:3192kB mapped:119544kB shmem:301556kB slab_reclaimable:62476kB slab_unreclaimable:22472kB kernel_stack:320kB pagetables:6692kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:448 all_unreclaimable? no
[13832.210155] lowmem_reserve[]: 0 0 1010 1010
[13832.210161] Normal free:2016kB min:2052kB low:2564kB high:3076kB active_anon:333408kB inactive_anon:333408kB active_file:134428kB inactive_file:134896kB unevictable:440kB isolated(anon):128kB isolated(file):136kB present:1034240kB mlocked:440kB dirty:158240kB writeback:1520kB mapped:47932kB shmem:209136kB slab_reclaimable:32468kB slab_unreclaimable:24632kB kernel_stack:2072kB pagetables:22588kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:192 all_unreclaimable? no
[13832.210166] lowmem_reserve[]: 0 0 0 0
[13832.210169] DMA: 2*4kB 2*8kB 1*16kB 2*32kB 1*64kB 2*128kB 2*256kB 1*512kB 2*1024kB 2*2048kB 2*4096kB = 15784kB
[13832.210177] DMA32: 624*4kB 1*8kB 11*16kB 6*32kB 1*64kB 8*128kB 5*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 9848kB
[13832.210184] Normal: 504*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2016kB
[13832.210191] 374966 total pagecache pages
[13832.210192] 6328 pages in swap cache
[13832.210194] Swap cache stats: add 147686, delete 141358, find 119392/120966
[13832.210195] Free swap = 8661548kB
[13832.210197] Total swap = 8851804kB
[13832.225488] 1048576 pages RAM
[13832.225491] 73094 pages reserved
[13832.225492] 695291 pages shared
[13832.225493] 352255 pages non-shared
[13832.225496] Out of memory: kill process 11292 (gnome-session) score 500953 or a child
[13832.225498] Killed process 11569 (xscreensaver)


After that I managed to get my system runing normally on, restarting X,
all runs since then quite fine.

Is that something I should be nervous about?

Thanks a lot and all the best

Norbert

-------------------------------------------------------------------------------
Dr. Norbert Preining Associate Professor
JAIST Japan Advanced Institute of Science and Technology [email protected]
Vienna University of Technology [email protected]
Debian Developer (Debian TeX Task Force) [email protected]
gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
-------------------------------------------------------------------------------
ARTHUR What is an Algolian Zylatburger anyway?
FORD They're a kind of meatburger made from the most unpleasant parts
of a creature well known for its total lack of any pleasant
parts.
ARTHUR So you mean that the Universe does actually end not with a bang
but with a Wimpy?
--- Cut dialogue from Fit the Fifth.
--- Douglas Adams, The Hitchhikers Guide to the Galaxy


2009-11-02 04:24:13

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: OOM killer, page fault

Hi,

(Cc to linux-mm)

Wow, this is very strange log.

> Dear all,
>
> (please Cc)
>
> With 2.6.32-rc5 I got that one:
> [13832.210068] Xorg invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0

order = 0

> [13832.210073] Pid: 11220, comm: Xorg Not tainted 2.6.32-rc5 #2
> [13832.210075] Call Trace:
> [13832.210081] [<ffffffff8134120a>] ? _spin_unlock+0x23/0x2f
> [13832.210085] [<ffffffff8107cf46>] ? oom_kill_process+0x78/0x236
> [13832.210088] [<ffffffff8107d5ba>] ? __out_of_memory+0x12f/0x146
> [13832.210091] [<ffffffff8107d6be>] ? pagefault_out_of_memory+0x54/0x82
> [13832.210094] [<ffffffff81341177>] ? _spin_unlock_irqrestore+0x25/0x31
> [13832.210098] [<ffffffff8102644d>] ? mm_fault_error+0x39/0xe6
> [13832.210101] [<ffffffff810af3ea>] ? do_vfs_ioctl+0x443/0x47b
> [13832.210103] [<ffffffff81026759>] ? do_page_fault+0x25f/0x27b
> [13832.210106] [<ffffffff8134161f>] ? page_fault+0x1f/0x30
> [13832.210108] Mem-Info:
> [13832.210109] DMA per-cpu:
> [13832.210111] CPU 0: hi: 0, btch: 1 usd: 0
> [13832.210113] CPU 1: hi: 0, btch: 1 usd: 0
> [13832.210114] DMA32 per-cpu:
> [13832.210116] CPU 0: hi: 186, btch: 31 usd: 165
> [13832.210117] CPU 1: hi: 186, btch: 31 usd: 177
> [13832.210119] Normal per-cpu:
> [13832.210120] CPU 0: hi: 186, btch: 31 usd: 143
> [13832.210122] CPU 1: hi: 186, btch: 31 usd: 159
> [13832.210128] active_anon:465239 inactive_anon:178856 isolated_anon:96
> [13832.210129] active_file:120044 inactive_file:120889 isolated_file:34

but the system has plenty droppable cache.

Umm, Is this reproducable?
Typically such strange log was caused by corruptted ram. can you please
check your memory correctness?


> [13832.210130] unevictable:32076 dirty:136955 writeback:1178 unstable:0 buffer:32965
> [13832.210131] free:6932 slab_reclaimable:23740 slab_unreclaimable:11776
> [13832.210132] mapped:41869 shmem:127673 pagetables:7320 bounce:0
> [13832.210138] DMA free:15784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:24kB inactive_file:132kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15364kB mlocked:0kB dirty:60kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:16kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [13832.210142] lowmem_reserve[]: 0 2931 3941 3941
> [13832.210150] DMA32 free:9928kB min:5960kB low:7448kB high:8940kB active_anon:1527548kB inactive_anon:382016kB active_file:345724kB inactive_file:348528kB unevictable:127864kB isolated(anon):256kB isolated(file):0kB present:3001852kB mlocked:127864kB dirty:389520kB writeback:3192kB mapped:119544kB shmem:301556kB slab_reclaimable:62476kB slab_unreclaimable:22472kB kernel_stack:320kB pagetables:6692kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:448 all_unreclaimable? no
> [13832.210155] lowmem_reserve[]: 0 0 1010 1010
> [13832.210161] Normal free:2016kB min:2052kB low:2564kB high:3076kB active_anon:333408kB inactive_anon:333408kB active_file:134428kB inactive_file:134896kB unevictable:440kB isolated(anon):128kB isolated(file):136kB present:1034240kB mlocked:440kB dirty:158240kB writeback:1520kB mapped:47932kB shmem:209136kB slab_reclaimable:32468kB slab_unreclaimable:24632kB kernel_stack:2072kB pagetables:22588kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:192 all_unreclaimable? no
> [13832.210166] lowmem_reserve[]: 0 0 0 0
> [13832.210169] DMA: 2*4kB 2*8kB 1*16kB 2*32kB 1*64kB 2*128kB 2*256kB 1*512kB 2*1024kB 2*2048kB 2*4096kB = 15784kB
> [13832.210177] DMA32: 624*4kB 1*8kB 11*16kB 6*32kB 1*64kB 8*128kB 5*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 9848kB
> [13832.210184] Normal: 504*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2016kB
> [13832.210191] 374966 total pagecache pages
> [13832.210192] 6328 pages in swap cache
> [13832.210194] Swap cache stats: add 147686, delete 141358, find 119392/120966
> [13832.210195] Free swap = 8661548kB
> [13832.210197] Total swap = 8851804kB
> [13832.225488] 1048576 pages RAM
> [13832.225491] 73094 pages reserved
> [13832.225492] 695291 pages shared
> [13832.225493] 352255 pages non-shared
> [13832.225496] Out of memory: kill process 11292 (gnome-session) score 500953 or a child
> [13832.225498] Killed process 11569 (xscreensaver)
>
>
> After that I managed to get my system runing normally on, restarting X,
> all runs since then quite fine.
>
> Is that something I should be nervous about?

This obviously indicate kernel-bug or hw-corrupt. I'm not sure which happen ;)



> Thanks a lot and all the best
>
> Norbert


2009-11-02 04:59:14

by Minchan Kim

[permalink] [raw]
Subject: Re: OOM killer, page fault

On Mon, 2 Nov 2009 13:24:06 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> Hi,
>
> (Cc to linux-mm)
>
> Wow, this is very strange log.
>
> > Dear all,
> >
> > (please Cc)
> >
> > With 2.6.32-rc5 I got that one:
> > [13832.210068] Xorg invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0
>
> order = 0

I think this problem results from 'gfp_mask = 0x0'.
Is it possible?

If it isn't H/W problem, Who passes gfp_mask with 0x0?
It's culpit.

Could you add BUG_ON(gfp_mask == 0x0) in __alloc_pages_nodemask's head?

---

/*
* This is the 'heart' of the zoned buddy allocator.
*/
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
struct zone *preferred_zone;
struct page *page;
int migratetype = allocflags_to_migratetype(gfp_mask);

+ BUG_ON(gfp_mask == 0x0);
gfp_mask &= gfp_allowed_mask;

lockdep_trace_alloc(gfp_mask);

might_sleep_if(gfp_mask & __GFP_WAIT);

if (should_fail_alloc_page(gfp_mask, order))
return NULL;


--
Kind regards,
Minchan Kim

2009-11-02 05:05:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: OOM killer, page fault

On Mon, 2 Nov 2009 13:56:40 +0900
Minchan Kim <[email protected]> wrote:

> On Mon, 2 Nov 2009 13:24:06 +0900 (JST)
> KOSAKI Motohiro <[email protected]> wrote:
>
> > Hi,
> >
> > (Cc to linux-mm)
> >
> > Wow, this is very strange log.
> >
> > > Dear all,
> > >
> > > (please Cc)
> > >
> > > With 2.6.32-rc5 I got that one:
> > > [13832.210068] Xorg invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0
> >
> > order = 0
>
> I think this problem results from 'gfp_mask = 0x0'.
> Is it possible?
>
> If it isn't H/W problem, Who passes gfp_mask with 0x0?
> It's culpit.
>
> Could you add BUG_ON(gfp_mask == 0x0) in __alloc_pages_nodemask's head?
>

Maybe some code returns VM_FAULT_OOM by mistake and pagefault_oom_killer()
is called. digging mm/memory.c is necessary...

I wonder why...now is this code
===
static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
unsigned int flags, pte_t orig_pte)
{
pgoff_t pgoff;

flags |= FAULT_FLAG_NONLINEAR;

if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
return 0;

if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
/*
* Page table corrupted: show pte and kill process.
*/
print_bad_pte(vma, address, orig_pte, NULL);
return VM_FAULT_OOM;
}

pgoff = pte_to_pgoff(orig_pte);
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}
==
Then, OOM...is this really OOM ?

Thanks,
-Kame

2009-11-02 06:03:52

by Minchan Kim

[permalink] [raw]
Subject: Re: OOM killer, page fault

On Mon, 2 Nov 2009 14:02:16 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Mon, 2 Nov 2009 13:56:40 +0900
> Minchan Kim <[email protected]> wrote:
>
> > On Mon, 2 Nov 2009 13:24:06 +0900 (JST)
> > KOSAKI Motohiro <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > (Cc to linux-mm)
> > >
> > > Wow, this is very strange log.
> > >
> > > > Dear all,
> > > >
> > > > (please Cc)
> > > >
> > > > With 2.6.32-rc5 I got that one:
> > > > [13832.210068] Xorg invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0
> > >
> > > order = 0
> >
> > I think this problem results from 'gfp_mask = 0x0'.
> > Is it possible?
> >
> > If it isn't H/W problem, Who passes gfp_mask with 0x0?
> > It's culpit.
> >
> > Could you add BUG_ON(gfp_mask == 0x0) in __alloc_pages_nodemask's head?
> >
>
> Maybe some code returns VM_FAULT_OOM by mistake and pagefault_oom_killer()
> is called. digging mm/memory.c is necessary...
>
> I wonder why...now is this code
> ===
> static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, pte_t *page_table, pmd_t *pmd,
> unsigned int flags, pte_t orig_pte)
> {
> pgoff_t pgoff;
>
> flags |= FAULT_FLAG_NONLINEAR;
>
> if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
> return 0;
>
> if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
> /*
> * Page table corrupted: show pte and kill process.
> */
> print_bad_pte(vma, address, orig_pte, NULL);
> return VM_FAULT_OOM;
> }
>
> pgoff = pte_to_pgoff(orig_pte);
> return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
> }
> ==
> Then, OOM...is this really OOM ?

It seems that the goal is to kill process by OOM trick as comment said.

I found It results from Hugh's commit 65500d234e74fc4e8f18e1a429bc24e51e75de4a.
I think it's not a real OOM.

BTW, If it is culpit in this case, print_bad_pte should have remained any log. :)

>
> Thanks,
> -Kame
>


--
Kind regards,
Minchan Kim

2009-11-02 06:37:37

by Minchan Kim

[permalink] [raw]
Subject: Re: OOM killer, page fault

On Mon, 2 Nov 2009 14:02:16 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Mon, 2 Nov 2009 13:56:40 +0900
> Minchan Kim <[email protected]> wrote:
>
> > On Mon, 2 Nov 2009 13:24:06 +0900 (JST)
> > KOSAKI Motohiro <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > (Cc to linux-mm)
> > >
> > > Wow, this is very strange log.
> > >
> > > > Dear all,
> > > >
> > > > (please Cc)
> > > >
> > > > With 2.6.32-rc5 I got that one:
> > > > [13832.210068] Xorg invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0
> > >
> > > order = 0
> >
> > I think this problem results from 'gfp_mask = 0x0'.
> > Is it possible?
> >
> > If it isn't H/W problem, Who passes gfp_mask with 0x0?
> > It's culpit.
> >
> > Could you add BUG_ON(gfp_mask == 0x0) in __alloc_pages_nodemask's head?
> >
>
> Maybe some code returns VM_FAULT_OOM by mistake and pagefault_oom_killer()
> is called. digging mm/memory.c is necessary...

I suspect GPU drivers related to X.
It seems many of them returs VM_FAULT_OOM.

If it happens by file map fault, following debug patch can show the culpit.

Norbert, Could you apply this patch and test again?
If you can get the address, you can find function symbol with System.map.


diff --git a/mm/memory.c b/mm/memory.c
index 7e91b5f..47e4b15 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2713,7 +2713,11 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
vmf.page = NULL;

ret = vma->vm_ops->fault(vma, &vmf);
- if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+ printk(KERN_DEBUG "vma->vm_ops->fault : 0x%lx\n", vma->vm_ops->fault);
+ WARN_ON(1);
+
+ }
return ret;

if (unlikely(PageHWPoison(vmf.page))) {





--
Kind regards,
Minchan Kim

2009-11-02 06:59:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: OOM killer, page fault

> On Mon, 2 Nov 2009 13:24:06 +0900 (JST)
> KOSAKI Motohiro <[email protected]> wrote:
>
> > Hi,
> >
> > (Cc to linux-mm)
> >
> > Wow, this is very strange log.
> >
> > > Dear all,
> > >
> > > (please Cc)
> > >
> > > With 2.6.32-rc5 I got that one:
> > > [13832.210068] Xorg invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0
> >
> > order = 0
>
> I think this problem results from 'gfp_mask = 0x0'.
> Is it possible?
>
> If it isn't H/W problem, Who passes gfp_mask with 0x0?
> It's culpit.
>
> Could you add BUG_ON(gfp_mask == 0x0) in __alloc_pages_nodemask's head?

No.
In page fault case, gfp_mask show meaningless value. Please ignore it.
pagefault_out_of_memory() always pass gfp_mask==0 to oom.


mm/oom_kill.c
====================================
void pagefault_out_of_memory(void)
{
unsigned long freed = 0;

blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
if (freed > 0)
/* Got some memory back in the last second. */
return;

/*
* If this is from memcg, oom-killer is already invoked.
* and not worth to go system-wide-oom.
*/
if (mem_cgroup_oom_called(current))
goto rest_and_return;

if (sysctl_panic_on_oom)
panic("out of memory from page fault. panic_on_oom is selected.\n");

read_lock(&tasklist_lock);
__out_of_memory(0, 0); <---- here!
read_unlock(&tasklist_lock);

2009-11-02 07:00:54

by Minchan Kim

[permalink] [raw]
Subject: Re: OOM killer, page fault

On Mon, Nov 2, 2009 at 3:59 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> On Mon, ?2 Nov 2009 13:24:06 +0900 (JST)
>> KOSAKI Motohiro <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > (Cc to linux-mm)
>> >
>> > Wow, this is very strange log.
>> >
>> > > Dear all,
>> > >
>> > > (please Cc)
>> > >
>> > > With 2.6.32-rc5 I got that one:
>> > > [13832.210068] Xorg invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0
>> >
>> > order = 0
>>
>> I think this problem results from 'gfp_mask = 0x0'.
>> Is it possible?
>>
>> If it isn't H/W problem, Who passes gfp_mask with 0x0?
>> It's culpit.
>>
>> Could you add BUG_ON(gfp_mask == 0x0) in __alloc_pages_nodemask's head?
>
> No.
> In page fault case, gfp_mask show meaningless value. Please ignore it.
> pagefault_out_of_memory() always pass gfp_mask==0 to oom.
>
>
> mm/oom_kill.c
> ====================================
> void pagefault_out_of_memory(void)
> {
> ? ? ? ?unsigned long freed = 0;
>
> ? ? ? ?blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> ? ? ? ?if (freed > 0)
> ? ? ? ? ? ? ? ?/* Got some memory back in the last second. */
> ? ? ? ? ? ? ? ?return;
>
> ? ? ? ?/*
> ? ? ? ? * If this is from memcg, oom-killer is already invoked.
> ? ? ? ? * and not worth to go system-wide-oom.
> ? ? ? ? */
> ? ? ? ?if (mem_cgroup_oom_called(current))
> ? ? ? ? ? ? ? ?goto rest_and_return;
>
> ? ? ? ?if (sysctl_panic_on_oom)
> ? ? ? ? ? ? ? ?panic("out of memory from page fault. panic_on_oom is selected.\n");
>
> ? ? ? ?read_lock(&tasklist_lock);
> ? ? ? ?__out_of_memory(0, 0); ? ? ? <---- here!
> ? ? ? ?read_unlock(&tasklist_lock);
>
>

Yeb. Kame already noticed it. :)
Thanks for pointing me out, again.

I already suggested another patch.
What do you think about it?


--
Kind regards,
Minchan Kim

2009-11-02 14:19:17

by Norbert Preining

[permalink] [raw]
Subject: Re: OOM killer, page fault

Hi all,

wow, many messages ... At the end I lost track of which patch I should try?

BTW, that happened only once, and whatever I do I cannot reproduce that.

I will anyway include any patch you send me and hope that it happens again.

Thanks

Norbert

-------------------------------------------------------------------------------
Dr. Norbert Preining Associate Professor
JAIST Japan Advanced Institute of Science and Technology [email protected]
Vienna University of Technology [email protected]
Debian Developer (Debian TeX Task Force) [email protected]
gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
-------------------------------------------------------------------------------
BAUMBER
A fitted elasticated bottom sheet which turns your mattress
bananashaped.
--- Douglas Adams, The Meaning of Liff

2009-11-02 14:40:40

by Minchan Kim

[permalink] [raw]
Subject: Re: OOM killer, page fault

Hi.

On Mon, Nov 2, 2009 at 11:19 PM, Norbert Preining <[email protected]> wrote:
> Hi all,
>
> wow, many messages ... At the end I lost track of which patch I should try?
>
> BTW, that happened only once, and whatever I do I cannot reproduce that.
>
> I will anyway include any patch you send me and hope that it happens again.

Pz forget my previous patch.
Could you test following patch?

diff --git a/mm/memory.c b/mm/memory.c
index 7e91b5f..47e4b15 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2713,7 +2713,11 @@ static int __do_fault(struct mm_struct *mm,
struct vm_area_struct *vma,
vmf.page = NULL;

ret = vma->vm_ops->fault(vma, &vmf);
- if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+ printk(KERN_DEBUG "vma->vm_ops->fault : 0x%lx\n",
vma->vm_ops->fault);
+ WARN_ON(1);
+
+ }
return ret;

if (unlikely(PageHWPoison(vmf.page))) {


> Thanks
>
> Norbert
>
> -------------------------------------------------------------------------------
> Dr. Norbert Preining ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Associate Professor
> JAIST Japan Advanced Institute of Science and Technology ? [email protected]
> Vienna University of Technology ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [email protected]
> Debian Developer (Debian TeX Task Force) ? ? ? ? ? ? ? ? ? [email protected]
> gpg DSA: 0x09C5B094 ? ? ?fp: 14DF 2E6C 0307 BE6D AD76 ?A9C0 D2BF 4AA3 09C5 B094
> -------------------------------------------------------------------------------
> BAUMBER
> A fitted elasticated bottom sheet which turns your mattress
> bananashaped.
> ? ? ? ? ? ? ? ? ? ? ? ?--- Douglas Adams, The Meaning of Liff
>



--
Kind regards,
Minchan Kim

2009-11-02 16:27:00

by Hugh Dickins

[permalink] [raw]
Subject: Re: OOM killer, page fault

On Mon, 2 Nov 2009, Minchan Kim wrote:
> On Mon, 2 Nov 2009 14:02:16 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > Maybe some code returns VM_FAULT_OOM by mistake and pagefault_oom_killer()
> > is called. digging mm/memory.c is necessary...
> >
> > I wonder why...now is this code
> > ===
> > static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > unsigned long address, pte_t *page_table, pmd_t *pmd,
> > unsigned int flags, pte_t orig_pte)
> > {
> > pgoff_t pgoff;
> >
> > flags |= FAULT_FLAG_NONLINEAR;
> >
> > if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
> > return 0;
> >
> > if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
> > /*
> > * Page table corrupted: show pte and kill process.
> > */
> > print_bad_pte(vma, address, orig_pte, NULL);
> > return VM_FAULT_OOM;
> > }
> >
> > pgoff = pte_to_pgoff(orig_pte);
> > return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
> > }
> > ==
> > Then, OOM...is this really OOM ?
>
> It seems that the goal is to kill process by OOM trick as comment said.
>
> I found It results from Hugh's commit 65500d234e74fc4e8f18e1a429bc24e51e75de4a.
> I think it's not a real OOM.
>
> BTW, If it is culpit in this case, print_bad_pte should have remained any log. :)

Yes, the chances are that this is not related to Norbert's problem.
But thank you for reminding me of that not-very-nice hack of mine.

It was kind-of valid at the time that I wrote it (2.6.15), when
VM_FAULT_OOM did kill the faulting process. But since then the fault
path has rightly been changed (in x86 at least, I didn't check the rest)
to let the OOM killer decide who to kill: so now there's a danger that
a pagetable corruption there will instead kill some unrelated process.

Being lazy, I'm inclined simply to change that to VM_FAULT_SIGBUS now:
which doesn't actually guarantee that the process will be killed, but
should be better than just repeatedly re-faulting on the entry. (I
don't much want to SIGKILL current since mm might not be current's.)

That aberrant use of VM_FAULT_OOM has recently been copied into
do_swap_page() (the first instance; the second instance is right -
hmm, well, the second instance is normally right, but I guess it
also covers pagetable corruption cases which we can't distinguish
there; oh well) and should be corrected there too.

Does VM_FAULT_SIGBUS sound good enough to you?

Hugh

2009-11-02 23:31:25

by Minchan Kim

[permalink] [raw]
Subject: Re: OOM killer, page fault

Hi, Hugh.

On Mon, 2 Nov 2009 16:26:56 +0000 (GMT)
Hugh Dickins <[email protected]> wrote:

> On Mon, 2 Nov 2009, Minchan Kim wrote:
> > On Mon, 2 Nov 2009 14:02:16 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > >
> > > Maybe some code returns VM_FAULT_OOM by mistake and pagefault_oom_killer()
> > > is called. digging mm/memory.c is necessary...
> > >
> > > I wonder why...now is this code
> > > ===
> > > static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > > unsigned long address, pte_t *page_table, pmd_t *pmd,
> > > unsigned int flags, pte_t orig_pte)
> > > {
> > > pgoff_t pgoff;
> > >
> > > flags |= FAULT_FLAG_NONLINEAR;
> > >
> > > if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
> > > return 0;
> > >
> > > if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
> > > /*
> > > * Page table corrupted: show pte and kill process.
> > > */
> > > print_bad_pte(vma, address, orig_pte, NULL);
> > > return VM_FAULT_OOM;
> > > }
> > >
> > > pgoff = pte_to_pgoff(orig_pte);
> > > return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
> > > }
> > > ==
> > > Then, OOM...is this really OOM ?
> >
> > It seems that the goal is to kill process by OOM trick as comment said.
> >
> > I found It results from Hugh's commit 65500d234e74fc4e8f18e1a429bc24e51e75de4a.
> > I think it's not a real OOM.
> >
> > BTW, If it is culpit in this case, print_bad_pte should have remained any log. :)
>
> Yes, the chances are that this is not related to Norbert's problem.
> But thank you for reminding me of that not-very-nice hack of mine.
>
> It was kind-of valid at the time that I wrote it (2.6.15), when
> VM_FAULT_OOM did kill the faulting process. But since then the fault
> path has rightly been changed (in x86 at least, I didn't check the rest)
> to let the OOM killer decide who to kill: so now there's a danger that
> a pagetable corruption there will instead kill some unrelated process.
>
> Being lazy, I'm inclined simply to change that to VM_FAULT_SIGBUS now:
> which doesn't actually guarantee that the process will be killed, but
> should be better than just repeatedly re-faulting on the entry. (I
> don't much want to SIGKILL current since mm might not be current's.)
>
> That aberrant use of VM_FAULT_OOM has recently been copied into
> do_swap_page() (the first instance; the second instance is right -
> hmm, well, the second instance is normally right, but I guess it
> also covers pagetable corruption cases which we can't distinguish
> there; oh well) and should be corrected there too.
>
> Does VM_FAULT_SIGBUS sound good enough to you?

I am Okay.
First of all, we have to prevent innocent process killing.
Second, although it returns SIGBUS, we can distinguish it from normal SIGBUS
by bad pte log.
Third, we don't want to add new VM_FAULT_XXX as possible as. :)


>
> Hugh


--
Kind regards,
Minchan Kim

2009-11-05 13:21:10

by Norbert Preining

[permalink] [raw]
Subject: Re: OOM killer, page fault

Hi Kim, hi all,

(still please Cc)

sorry for the late reply. I have two news, one good and one bad: The good
being that I can reproduce the bug by running VirtualBox with some W7
within. Anyway, I don't have a trace or better debug due to the bad news:
Both 2.6.32-rc5 and 2.6.32-rc6 do *not* boot with the patch below.
Don't ask me why, please, and I don't have a serial/net console so that
I can tell you more, but the booting hangs badly at:
[ 6.657492] usb 4-1: Product: Globetrotter HSDPA Modem
[ 6.657494] usb 4-1: Manufacturer: Option N.V.
[ 6.657496] usb 4-1: SerialNumber: Serial Number
[ 6.657558] usb 4-1: configuration #1 chosen from 1 choice
[ 6.837364] input: PS/2 Mouse as /devices/platform/i8042/serio2/input/input6
[ 6.853693] input: AlpsPS/2 ALPS GlidePoint as /devices/platform/i8042/serio2/input/input7

Normally it continues like that, but with the patch below it hangs here
and does not continue. I need to Sysrq-s/u/b out of it.

[ 6.904119] usb 8-2: new full speed USB device using uhci_hcd and address 2
[ 7.075524] usb 8-2: New USB device found, idVendor=044e, idProduct=3017

> diff --git a/mm/memory.c b/mm/memory.c
> index 7e91b5f..47e4b15 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2713,7 +2713,11 @@ static int __do_fault(struct mm_struct *mm,
> struct vm_area_struct *vma,
> vmf.page = NULL;
>
> ret = vma->vm_ops->fault(vma, &vmf);
> - if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
> + if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> + printk(KERN_DEBUG "vma->vm_ops->fault : 0x%lx\n",
> vma->vm_ops->fault);
> + WARN_ON(1);
> +
> + }
> return ret;
>
> if (unlikely(PageHWPoison(vmf.page))) {

I know it sounds completely crazy, the patch only does harmless things
afais. But I tried it. Several times. rc6+patch never did boot, while
rc5 without path did boot. Then I patched it into -rc5, recompiled, and
boom, no boot. booting into .31.5, recompiling rc6 and rc5 without
that patch and suddenly rc6 boots (and I am sure rc5, too).

Sorry that I cannot give more infos, please let me know what else I can
do.

Ah yes, I can reproduce the original strange bug with oom killer!

Best wishes

Norbert

-------------------------------------------------------------------------------
Dr. Norbert Preining Associate Professor
JAIST Japan Advanced Institute of Science and Technology [email protected]
Vienna University of Technology [email protected]
Debian Developer (Debian TeX Task Force) [email protected]
gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
-------------------------------------------------------------------------------
MELTON CONSTABLE (n.)
A patent anti-wrinkle cream which policemen wear to keep themselves
looking young.
--- Douglas Adams, The Meaning of Liff

2009-11-05 15:18:57

by Minchan Kim

[permalink] [raw]
Subject: Re: OOM killer, page fault

Hi.

On Thu, Nov 5, 2009 at 10:21 PM, Norbert Preining <[email protected]> wrote:
> Hi Kim, hi all,
>
> (still please Cc)
>
> sorry for the late reply. I have two news, one good and one bad: The good
> being that I can reproduce the bug by running VirtualBox with some W7

W7 means "Windows 7"?

> within. Anyway, I don't have a trace or better debug due to the bad news:
> Both 2.6.32-rc5 and 2.6.32-rc6 do *not* boot with the patch below.
> Don't ask me why, please, and I don't have a serial/net console so that
> I can tell you more, but the booting hangs badly at:
> [ ? ?6.657492] usb 4-1: Product: Globetrotter HSDPA Modem
> [ ? ?6.657494] usb 4-1: Manufacturer: Option N.V.
> [ ? ?6.657496] usb 4-1: SerialNumber: Serial Number
> [ ? ?6.657558] usb 4-1: configuration #1 chosen from 1 choice
> [ ? ?6.837364] input: PS/2 Mouse as /devices/platform/i8042/serio2/input/input6
> [ ? ?6.853693] input: AlpsPS/2 ALPS GlidePoint as /devices/platform/i8042/serio2/input/input7
>
> Normally it continues like that, but with the patch below it hangs here
> and does not continue. I need to Sysrq-s/u/b out of it.
>
> [ ? ?6.904119] usb 8-2: new full speed USB device using uhci_hcd and address 2
> [ ? ?7.075524] usb 8-2: New USB device found, idVendor=044e, idProduct=3017
>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7e91b5f..47e4b15 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2713,7 +2713,11 @@ static int __do_fault(struct mm_struct *mm,
>> struct vm_area_struct *vma,
>> ? ? ? ?vmf.page = NULL;
>>
>> ? ? ? ?ret = vma->vm_ops->fault(vma, &vmf);
>> - ? ? ? if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
>> + ? ? ? if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
>> + ? ? ? ? ? ? ? printk(KERN_DEBUG "vma->vm_ops->fault : 0x%lx\n",
>> vma->vm_ops->fault);
>> + ? ? ? ? ? ? ? WARN_ON(1);
>> +
>> + ? ? ? }
>> ? ? ? ? ? ? ? ?return ret;
>>
>> ? ? ? ?if (unlikely(PageHWPoison(vmf.page))) {
>
> I know it sounds completely crazy, the patch only does harmless things
> afais. But I tried it. Several times. rc6+patch never did boot, while
> rc5 without path did boot. Then I patched it into -rc5, recompiled, and
> boom, no boot. booting into .31.5, recompiling rc6 and rc5 without
> that patch and suddenly rc6 boots (and I am sure rc5, too).

Hmm. It's out of my knowledge.
Probably, It's because WARN_ON?
Could you try it with omitting WARN_ON, again?

>
> Sorry that I cannot give more infos, please let me know what else I can
> do.

Thanks for your time :)

> Ah yes, I can reproduce the original strange bug with oom killer!

Sounds good to me.
Could you tell me your test scenario, your system info(CPU, RAM) and
config?
I want to reproduce it in my mahchine to not bother you. :)


>
> Best wishes
>
> Norbert
>
> -------------------------------------------------------------------------------
> Dr. Norbert Preining ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Associate Professor
> JAIST Japan Advanced Institute of Science and Technology ? [email protected]
> Vienna University of Technology ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [email protected]
> Debian Developer (Debian TeX Task Force) ? ? ? ? ? ? ? ? ? [email protected]
> gpg DSA: 0x09C5B094 ? ? ?fp: 14DF 2E6C 0307 BE6D AD76 ?A9C0 D2BF 4AA3 09C5 B094
> -------------------------------------------------------------------------------
> MELTON CONSTABLE (n.)
> A patent anti-wrinkle cream which policemen wear to keep themselves
> looking young.
> ? ? ? ? ? ? ? ? ? ? ? ?--- Douglas Adams, The Meaning of Liff
>



--
Kind regards,
Minchan Kim

2009-11-05 15:26:27

by Norbert Preining

[permalink] [raw]
Subject: Re: OOM killer, page fault

Hi Kim,

> > sorry for the late reply. I have two news, one good and one bad: The good
> > being that I can reproduce the bug by running VirtualBox with some W7
>
> W7 means "Windows 7"?

Yes, sorry for the shorthand.

> > I know it sounds completely crazy, the patch only does harmless things
> > afais. But I tried it. Several times. rc6+patch never did boot, while
> > rc5 without path did boot. Then I patched it into -rc5, recompiled, and
> > boom, no boot. booting into .31.5, recompiling rc6 and rc5 without
> > that patch and suddenly rc6 boots (and I am sure rc5, too).
>
> Hmm. It's out of my knowledge.
> Probably, It's because WARN_ON?
> Could you try it with omitting WARN_ON, again?

Will do that.

> > Ah yes, I can reproduce the original strange bug with oom killer!
>
> Sounds good to me.
> Could you tell me your test scenario, your system info(CPU, RAM) and
> config?
> I want to reproduce it in my mahchine to not bother you. :)

Puhh, well, I meant "I could reproduce it", but not "I have a clear
idea what steps to be taken to reproduce it" ;-) Well here is what I can
tell you:
actual hardware:
Intel(R) Core(TM)2 Duo CPU P9500
Memory 2G
Config of my kernel attached.

Virtual Machine (VirtualBox, not the OSE variant, I need USB 2.0 support
for GPS stuff):
VirtualBox 3.0.10
memory for the machine: 1G (50%)
ACPI and IO/APIC turned on
1 processor with PAE/NX
VT-x and Nested Paging activated
Display 128M
(need more details?)

I will remove the WARN_ON and reboot and see if that works. If yes I try
to recreate the problem.

Best wishes

Norbert

-------------------------------------------------------------------------------
Dr. Norbert Preining Associate Professor
JAIST Japan Advanced Institute of Science and Technology [email protected]
Vienna University of Technology [email protected]
Debian Developer (Debian TeX Task Force) [email protected]
gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
-------------------------------------------------------------------------------
BROMSGROVE
Any urban environment containing a small amount of dogturd and about
forty-five tons of bent steel pylon or a lump of concrete with holes
claiming to be sculpture. 'Oh, come my dear, and come with me. And
wander 'neath the bromsgrove tree' - Betjeman.
--- Douglas Adams, The Meaning of Liff

2009-11-05 16:16:25

by Norbert Preining

[permalink] [raw]
Subject: Re: OOM killer, page fault

On Do, 05 Nov 2009, preining wrote:
> > Hmm. It's out of my knowledge.
> > Probably, It's because WARN_ON?
> > Could you try it with omitting WARN_ON, again?
>
> Will do that.

No change, still hangs. But at least I see now that it is not hanging
at an arbitrary position, but it does not start the init process. It
stops right before the "Calling init" or similar-

BTW, this time the config is really attached.

Best wishes

Norbert

-------------------------------------------------------------------------------
Dr. Norbert Preining Associate Professor
JAIST Japan Advanced Institute of Science and Technology [email protected]
Vienna University of Technology [email protected]
Debian Developer (Debian TeX Task Force) [email protected]
gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
-------------------------------------------------------------------------------
WEEM (n.)
The tools with which a dentist can inflict the greatest
pain. Formerly, which tool this was dependent upon the imagination and
skill of the individual dentist, though now, with technological
advances, weems can be bought specially.
--- Douglas Adams, The Meaning of Liff


Attachments:
(No filename) (1.25 kB)
config-2.6.32-rc6 (61.14 kB)
Download all attachments

2009-11-05 22:18:19

by Minchan Kim

[permalink] [raw]
Subject: Re: OOM killer, page fault

On Fri, Nov 6, 2009 at 5:37 AM, Jody Belka <[email protected]> wrote:
> Norbert Preining <preining <at> logic.at> writes:
>> Don't ask me why, please, and I don't have a serial/net console so that
>> I can tell you more, but the booting hangs badly at:
>
> <snip>
>
>>
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index 7e91b5f..47e4b15 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -2713,7 +2713,11 @@ static int __do_fault(struct mm_struct *mm,
>> > struct vm_area_struct *vma,
>> > ? ? ? ?vmf.page = NULL;
>> >
>> > ? ? ? ?ret = vma->vm_ops->fault(vma, &vmf);
>> > - ? ? ? if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
>> > + ? ? ? if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
>> > + ? ? ? ? ? ? ? printk(KERN_DEBUG "vma->vm_ops->fault : 0x%lx\n",
>> > vma->vm_ops->fault);
>> > + ? ? ? ? ? ? ? WARN_ON(1);
>> > +
>> > + ? ? ? }
>> > ? ? ? ? ? ? ? ?return ret;
>> >
>> > ? ? ? ?if (unlikely(PageHWPoison(vmf.page))) {
>>
>
> Erm, could it not be due to the "return ret;" line being moved outside of the
> if(), so that it always executes?

Right. Sorry it's my fault.
I become blind.
'return ret' should be inclueded in debug code.

>
>
> J
>
> ps, sending this through gmane, don't know if it'll keep cc's or not, so
> apologies if not. please cc me on any replies
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>



--
Kind regards,
Minchan Kim

2009-11-06 00:01:11

by Norbert Preining

[permalink] [raw]
Subject: Re: OOM killer, page fault

On Fr, 06 Nov 2009, Minchan Kim wrote:
> > Erm, could it not be due to the "return ret;" line being moved outside of the
> > if(), so that it always executes?
>
> Right. Sorry it's my fault.
> I become blind.
> 'return ret' should be inclueded in debug code.


Bummer, I'm blind, too, that was in fact obvious, since the codeflow
was changed. Could have seen that myself, sorry.

Recompiling already and trying to recreate the oom-killer boom.

Best wishes

Norbert

-------------------------------------------------------------------------------
Dr. Norbert Preining Associate Professor
JAIST Japan Advanced Institute of Science and Technology [email protected]
Vienna University of Technology [email protected]
Debian Developer (Debian TeX Task Force) [email protected]
gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
-------------------------------------------------------------------------------
I'm going to have a look.'
He glanced round at the others.
`Is no one going to say, "No you can't possibly, let me go
instead"?'
They all shook their heads.
`Oh well.'
--- Ford attempting to be heroic whilst being seiged by
--- Shooty and Bangbang.
--- Douglas Adams, The Hitchhikers Guide to the Galaxy

2009-11-06 13:38:31

by Norbert Preining

[permalink] [raw]
Subject: Re: OOM killer, page fault

Hi Kim,

On Fr, 06 Nov 2009, preining wrote:
> Recompiling already and trying to recreate the oom-killer boom.

Well, after rebooting into that kernel I get *loads*, every few seconds,
of warnings in the log. Hard to sort out what is real. Is that expected?

Excerpt from the log:
[ 2077.753841] vma->vm_ops->fault : 0xffffffff811df4bd
[ 2077.753842] ------------[ cut here ]------------
[ 2077.753845] WARNING: at mm/memory.c:2722 __do_fault+0x89/0x382()
[ 2077.753847] Hardware name: VGN-Z11VN_B
...
[ 2077.753880] Pid: 4892, comm: Xorg Tainted: G W 2.6.32-rc6 #5
[ 2077.753881] Call Trace:
[ 2077.753884] [<ffffffff8108c6cc>] ? __do_fault+0x89/0x382
[ 2077.753887] [<ffffffff8108c6cc>] ? __do_fault+0x89/0x382
[ 2077.753889] [<ffffffff8103ae54>] ? warn_slowpath_common+0x77/0xa3
[ 2077.753892] [<ffffffff8108c6cc>] ? __do_fault+0x89/0x382
[ 2077.753895] [<ffffffff81341a82>] ? _spin_unlock+0x23/0x2f
[ 2077.753898] [<ffffffff8108e5d0>] ? handle_mm_fault+0x2b9/0x608
[ 2077.753900] [<ffffffff810af792>] ? do_vfs_ioctl+0x443/0x47b
[ 2077.753903] [<ffffffff81026759>] ? do_page_fault+0x25f/0x27b
[ 2077.753906] [<ffffffff81341e8f>] ? page_fault+0x1f/0x30
[ 2077.753908] ---[ end trace d3324ef5061f0136 ]---

hundreds/thousands of them.

And even without starting anything else. Is that what you want?
My syslog file has grown to some hundred megabytes ...


Best wishes

Norbert

-------------------------------------------------------------------------------
Dr. Norbert Preining Associate Professor
JAIST Japan Advanced Institute of Science and Technology [email protected]
Vienna University of Technology [email protected]
Debian Developer (Debian TeX Task Force) [email protected]
gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
-------------------------------------------------------------------------------
LARGOWARD (n.)
Motorists' name for the kind of pedestrian who stands beside a main
road and waves on the traffic, as if it's their right of way.
--- Douglas Adams, The Meaning of Liff

2009-11-06 15:14:22

by Minchan Kim

[permalink] [raw]
Subject: Re: OOM killer, page fault

On Fri, Nov 6, 2009 at 10:38 PM, Norbert Preining <[email protected]> wrote:
> Hi Kim,
>
> On Fr, 06 Nov 2009, preining wrote:
>> Recompiling already and trying to recreate the oom-killer boom.
>
> Well, after rebooting into that kernel I get *loads*, every few seconds,
> of warnings in the log. Hard to sort out what is real. Is that expected?

I guess it is VM_FAULT_NOPAGE of i915_gem or somethings.
It's not of our concern but VM_FAULT_OOM.
I couldn't expect that. So let's change debug patch following as.

Most important thing is "Who return VM_FAULT_OOM".
It it return VM_FAULT_OOM, OOM killer will kill any process who have a
high score. In case of you, it was 'X'.

If you don't see it until 2.6.32-rc5, It should be regression in somewhere.
If we can know it, we can pass the problem to maintainer of it.

Could you try it again below patch?
If you reproduce it, you can match function address of log with
function address
of your System.map. Pz, let me know it. :)

diff --git a/mm/memory.c b/mm/memory.c
index 7e91b5f..97a6fcb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2713,8 +2713,13 @@ static int __do_fault(struct mm_struct *mm,
struct vm_area_struct *vma,
vmf.page = NULL;

ret = vma->vm_ops->fault(vma, &vmf);
- if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+ if (ret & VM_FAULT_OOM) {
+ printk(KERN_DEBUG "fault handler : 0x%lx\n", vma->vm_ops->fault);
+
+ }
return ret;
+ }

if (unlikely(PageHWPoison(vmf.page))) {
if (ret & VM_FAULT_LOCKED)



>
> Excerpt from the log:
> [ 2077.753841] vma->vm_ops->fault : 0xffffffff811df4bd
> [ 2077.753842] ------------[ cut here ]------------
> [ 2077.753845] WARNING: at mm/memory.c:2722 __do_fault+0x89/0x382()
> [ 2077.753847] Hardware name: VGN-Z11VN_B
> ...
> [ 2077.753880] Pid: 4892, comm: Xorg Tainted: G ? ? ? ?W ?2.6.32-rc6 #5
> [ 2077.753881] Call Trace:
> [ 2077.753884] ?[<ffffffff8108c6cc>] ? __do_fault+0x89/0x382
> [ 2077.753887] ?[<ffffffff8108c6cc>] ? __do_fault+0x89/0x382
> [ 2077.753889] ?[<ffffffff8103ae54>] ? warn_slowpath_common+0x77/0xa3
> [ 2077.753892] ?[<ffffffff8108c6cc>] ? __do_fault+0x89/0x382
> [ 2077.753895] ?[<ffffffff81341a82>] ? _spin_unlock+0x23/0x2f
> [ 2077.753898] ?[<ffffffff8108e5d0>] ? handle_mm_fault+0x2b9/0x608
> [ 2077.753900] ?[<ffffffff810af792>] ? do_vfs_ioctl+0x443/0x47b
> [ 2077.753903] ?[<ffffffff81026759>] ? do_page_fault+0x25f/0x27b
> [ 2077.753906] ?[<ffffffff81341e8f>] ? page_fault+0x1f/0x30
> [ 2077.753908] ---[ end trace d3324ef5061f0136 ]---
>
> hundreds/thousands of them.
>
> And even without starting anything else. Is that what you want?
> My syslog file has grown to some hundred megabytes ...
>
>
> Best wishes
>
> Norbert
>
> -------------------------------------------------------------------------------
> Dr. Norbert Preining ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Associate Professor
> JAIST Japan Advanced Institute of Science and Technology ? [email protected]
> Vienna University of Technology ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [email protected]
> Debian Developer (Debian TeX Task Force) ? ? ? ? ? ? ? ? ? [email protected]
> gpg DSA: 0x09C5B094 ? ? ?fp: 14DF 2E6C 0307 BE6D AD76 ?A9C0 D2BF 4AA3 09C5 B094
> -------------------------------------------------------------------------------
> LARGOWARD (n.)
> Motorists' name for the kind of pedestrian who stands beside a main
> road and waves on the traffic, as if it's their right of way.
> ? ? ? ? ? ? ? ? ? ? ? ?--- Douglas Adams, The Meaning of Liff
>



--
Kind regards,
Minchan Kim

2009-11-06 15:18:49

by Norbert Preining

[permalink] [raw]
Subject: Re: OOM killer, page fault

recompiling and retrying ...

On Sa, 07 Nov 2009, Minchan Kim wrote:
> + printk(KERN_DEBUG "fault handler : 0x%lx\n", vma->vm_ops->fault);

BTW:
m/memory.c:2722: warning: format ‘%lx’ expects type ‘long unsigned int’, but argument 2 has type ‘int (* const)(struct vm_area_struct *, struct vm_fault *)’

Best wishes

Norbert

-------------------------------------------------------------------------------
Dr. Norbert Preining Associate Professor
JAIST Japan Advanced Institute of Science and Technology [email protected]
Vienna University of Technology [email protected]
Debian Developer (Debian TeX Task Force) [email protected]
gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
-------------------------------------------------------------------------------
LOWTHER (vb.)
(Of a large group of people who have been to the cinema together.) To
stand aimlessly about on the pavement and argue about whatever to go
and eat either a Chinese meal nearby or an Indian meal at a restaurant
which somebody says is very good but isn't certain where it is, or
have a drink and think about it, or just go home, or have a Chinese
meal nearby - until by the time agreement is reached everything is
shut.
--- Douglas Adams, The Meaning of Liff