LinuxLists.cc - [oom]: [0/4] fix OOM deadlock running OAST

2004-06-23 21:12:49

by William Lee Irwin III

[permalink] [raw]

Subject: [oom]: [0/4] fix OOM deadlock running OAST

While running OAST to test 2.6's maximum client capacity, the kernel
deadlocked instead of properly OOM'ing. The obvious cause was the
line if (nr_swap_pages > 0) in out_of_memory(), which fails to account
for pinned allocations. This can't simply be removed. The following
patches attempt to give the kernel the ability to discriminate between
pinned and unpinned allocations in order to determine whether this
check is appropriate. They furthermore also add reporting of wired
memory on a global and per-zone basis.

-- wli

2004-06-23 21:10:12

by William Lee Irwin III

[permalink] [raw]

Subject: [oom]: [4/4] check __GFP_WIRED in out_of_memory()

Index: linux-2.6.7/include/linux/swap.h
===================================================================
--- linux-2.6.7.orig/include/linux/swap.h 2004-06-16 05:18:55.000000000 +0000
+++ linux-2.6.7/include/linux/swap.h 2004-06-23 16:41:20.000000000 +0000
@@ -148,7 +148,7 @@
#define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)

/* linux/mm/oom_kill.c */
-extern void out_of_memory(void);
+void out_of_memory(int gfp_mask);

/* linux/mm/memory.c */
extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
Index: linux-2.6.7/mm/oom_kill.c
===================================================================
--- linux-2.6.7.orig/mm/oom_kill.c 2004-06-16 05:19:29.000000000 +0000
+++ linux-2.6.7/mm/oom_kill.c 2004-06-23 16:41:58.000000000 +0000
@@ -220,7 +220,7 @@
/**
* out_of_memory - is the system out of memory?
*/
-void out_of_memory(void)
+void out_of_memory(int gfp_mask)
{
/*
* oom_lock protects out_of_memory()'s static variables.
@@ -233,7 +233,7 @@
/*
* Enough swap space left? Not OOM.
*/
- if (nr_swap_pages > 0)
+ if (!(gfp_mask & __GFP_WIRED) && nr_swap_pages > 0)
return;

spin_lock(&oom_lock);
Index: linux-2.6.7/mm/vmscan.c
===================================================================
--- linux-2.6.7.orig/mm/vmscan.c 2004-06-16 05:18:58.000000000 +0000
+++ linux-2.6.7/mm/vmscan.c 2004-06-23 16:42:12.000000000 +0000
@@ -944,7 +944,7 @@
blk_congestion_wait(WRITE, HZ/10);
}
if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY))
- out_of_memory();
+ out_of_memory(gfp_mask);
out:
for (i = 0; zones[i] != 0; i++)
zones[i]->prev_priority = zones[i]->temp_priority;

2004-06-23 21:12:43

by William Lee Irwin III

[permalink] [raw]

Subject: [oom]: [3/4] track wired pages on a per-zone basis

Index: linux-2.6.7/include/linux/mmzone.h
===================================================================
--- linux-2.6.7.orig/include/linux/mmzone.h 2004-06-16 05:19:36.000000000 +0000
+++ linux-2.6.7/include/linux/mmzone.h 2004-06-23 18:58:13.000000000 +0000
@@ -170,6 +170,7 @@
ZONE_PADDING(_pad3_)

struct per_cpu_pageset pageset[NR_CPUS];
+ unsigned long nr_wired[NR_CPUS];

/*
* Discontig memory support fields.
Index: linux-2.6.7/include/linux/page-flags.h
===================================================================
--- linux-2.6.7.orig/include/linux/page-flags.h 2004-06-23 18:57:13.000000000 +0000
+++ linux-2.6.7/include/linux/page-flags.h 2004-06-23 18:58:21.000000000 +0000
@@ -322,6 +322,8 @@
int __clear_page_dirty(struct page *page);
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
+void set_page_wired(struct page *);
+void clear_page_wired(struct page *);

static inline void clear_page_dirty(struct page *page)
{
@@ -333,16 +335,4 @@
test_set_page_writeback(page);
}

-static inline void set_page_wired(struct page *page)
-{
- SetPageWired(page);
- inc_page_state(nr_wired);
-}
-
-static inline void clear_page_wired(struct page *page)
-{
- ClearPageWired(page);
- dec_page_state(nr_wired);
-}
-
#endif /* PAGE_FLAGS_H */
Index: linux-2.6.7/include/linux/mm_inline.h
===================================================================
--- linux-2.6.7.orig/include/linux/mm_inline.h 2004-06-16 05:20:26.000000000 +0000
+++ linux-2.6.7/include/linux/mm_inline.h 2004-06-23 18:58:13.000000000 +0000
@@ -30,11 +30,13 @@
static inline void
del_page_from_lru(struct zone *zone, struct page *page)
{
- list_del(&page->lru);
- if (PageActive(page)) {
- ClearPageActive(page);
- zone->nr_active--;
- } else {
- zone->nr_inactive--;
+ if (!PageWired(page)) {
+ list_del(&page->lru);
+ if (!PageActive(page))
+ zone->nr_inactive--;
+ else {
+ ClearPageActive(page);
+ zone->nr_active--;
+ }
}
}
Index: linux-2.6.7/mm/swap.c
===================================================================
--- linux-2.6.7.orig/mm/swap.c 2004-06-16 05:19:13.000000000 +0000
+++ linux-2.6.7/mm/swap.c 2004-06-23 18:58:50.000000000 +0000
@@ -54,6 +54,34 @@
EXPORT_SYMBOL(put_page);
#endif

+void set_page_wired(struct page *page)
+{
+ unsigned long flags;
+ int cpu;
+ struct zone *zone = page_zone(page);
+
+ SetPageWired(page);
+ local_irq_save(flags);
+ cpu = smp_processor_id();
+ per_cpu(page_states, cpu).nr_wired++;
+ zone->nr_wired[cpu]++;
+ local_irq_restore(flags);
+}
+
+void clear_page_wired(struct page *page)
+{
+ unsigned long flags;
+ int cpu;
+ struct zone *zone = page_zone(page);
+
+ ClearPageWired(page);
+ local_irq_save(flags);
+ cpu = smp_processor_id();
+ per_cpu(page_states, cpu).nr_wired--;
+ zone->nr_wired[cpu]--;
+ local_irq_restore(flags);
+}
+
/*
* Writeback is about to end against a page which has been marked for immediate
* reclaim. If it still appears to be reclaimable, move it to the tail of the
@@ -101,8 +129,11 @@
*/
void fastcall activate_page(struct page *page)
{
- struct zone *zone = page_zone(page);
+ struct zone *zone;

+ if (PageWired(page))
+ return;
+ zone = page_zone(page);
spin_lock_irq(&zone->lru_lock);
if (PageLRU(page) && !PageActive(page)) {
del_page_from_inactive_list(zone, page);
@@ -122,12 +153,13 @@
*/
void fastcall mark_page_accessed(struct page *page)
{
+ if (PageWired(page))
+ return;
if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
activate_page(page);
ClearPageReferenced(page);
- } else if (!PageReferenced(page)) {
+ } else if (!PageReferenced(page))
SetPageReferenced(page);
- }
}

EXPORT_SYMBOL(mark_page_accessed);
@@ -144,7 +176,7 @@
struct pagevec *pvec = &get_cpu_var(lru_add_pvecs);

page_cache_get(page);
- if (!pagevec_add(pvec, page))
+ if (!PageWired(page) && !pagevec_add(pvec, page))
__pagevec_lru_add(pvec);
put_cpu_var(lru_add_pvecs);
}
@@ -154,7 +186,7 @@
struct pagevec *pvec = &get_cpu_var(lru_add_active_pvecs);

page_cache_get(page);
- if (!pagevec_add(pvec, page))
+ if (!PageWired(page) && !pagevec_add(pvec, page))
__pagevec_lru_add_active(pvec);
put_cpu_var(lru_add_active_pvecs);
}
Index: linux-2.6.7/mm/page_alloc.c
===================================================================
--- linux-2.6.7.orig/mm/page_alloc.c 2004-06-23 18:57:50.000000000 +0000
+++ linux-2.6.7/mm/page_alloc.c 2004-06-23 18:58:13.000000000 +0000
@@ -1066,9 +1066,7 @@
{
struct page_state ps;
int cpu, temperature;
- unsigned long active;
- unsigned long inactive;
- unsigned long free;
+ unsigned long active, inactive, free, wired;
struct zone *zone;

for_each_zone(zone) {
@@ -1122,6 +1120,8 @@
int i;

show_node(zone);
+ for (wired = cpu = 0; cpu < NR_CPUS; ++cpu)
+ wired += zone->nr_wired[cpu];
printk("%s"
" free:%lukB"
" min:%lukB"
@@ -1130,6 +1130,7 @@
" active:%lukB"
" inactive:%lukB"
" present:%lukB"
+ " wired:%lukB"
"\n",
zone->name,
K(zone->free_pages),
@@ -1138,7 +1139,8 @@
K(zone->pages_high),
K(zone->nr_active),
K(zone->nr_inactive),
- K(zone->present_pages)
+ K(zone->present_pages),
+ K(wired)
);
printk("protections[]:");
for (i = 0; i < MAX_NR_ZONES; i++)
@@ -1662,8 +1664,8 @@
pg_data_t *pgdat = (pg_data_t *)arg;
struct zone *zone;
struct zone *node_zones = pgdat->node_zones;
- unsigned long flags;
- int order;
+ unsigned long wired, flags;
+ int order, cpu;

for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
if (!zone->present_pages)
@@ -1681,6 +1683,11 @@
}
spin_unlock_irqrestore(&zone->lock, flags);
seq_putc(m, '\n');
+ for (wired = cpu = 0; cpu < NR_CPUS; ++cpu)
+ wired += zone->nr_wired[cpu];
+ wired >>= PAGE_SHIFT - 10;
+ seq_printf(m, "Node %d, zone %8s wired: %lu kB\n",
+ pgdat->node_id, zone->name, wired);
}
return 0;
}

2004-06-23 21:12:49

by William Lee Irwin III

[permalink] [raw]

Subject: [oom]: [2/4] add nr_wired to page_state

Index: linux-2.6.7/include/linux/page-flags.h
===================================================================
--- linux-2.6.7.orig/include/linux/page-flags.h 2004-06-16 05:19:42.000000000 +0000
+++ linux-2.6.7/include/linux/page-flags.h 2004-06-23 18:57:13.000000000 +0000
@@ -77,6 +77,7 @@
#define PG_compound 19 /* Part of a compound page */

#define PG_anon 20 /* Anonymous: anon_vma in mapping */
+#define PG_wired 21 /* wired: can't be evicted */

/*
@@ -90,7 +91,8 @@
unsigned long nr_page_table_pages;/* Pages used for pagetables */
unsigned long nr_mapped; /* mapped into pagetables */
unsigned long nr_slab; /* In slab */
-#define GET_PAGE_STATE_LAST nr_slab
+ unsigned long nr_wired; /* pinned pages */
+#define GET_PAGE_STATE_LAST nr_wired

/*
* The below are zeroed by get_page_state(). Use get_full_page_state()
@@ -302,6 +304,10 @@
#define SetPageAnon(page) set_bit(PG_anon, &(page)->flags)
#define ClearPageAnon(page) clear_bit(PG_anon, &(page)->flags)

+#define PageWired(page) test_bit(PG_wired, &(page)->flags)
+#define SetPageWired(page) set_bit(PG_wired, &(page)->flags)
+#define ClearPageWired(page) clear_bit(PG_wired, &(page)->flags)
+
#ifdef CONFIG_SWAP
#define PageSwapCache(page) test_bit(PG_swapcache, &(page)->flags)
#define SetPageSwapCache(page) set_bit(PG_swapcache, &(page)->flags)
@@ -327,4 +333,16 @@
test_set_page_writeback(page);
}

+static inline void set_page_wired(struct page *page)
+{
+ SetPageWired(page);
+ inc_page_state(nr_wired);
+}
+
+static inline void clear_page_wired(struct page *page)
+{
+ ClearPageWired(page);
+ dec_page_state(nr_wired);
+}
+
#endif /* PAGE_FLAGS_H */
Index: linux-2.6.7/mm/page_alloc.c
===================================================================
--- linux-2.6.7.orig/mm/page_alloc.c 2004-06-16 05:18:57.000000000 +0000
+++ linux-2.6.7/mm/page_alloc.c 2004-06-23 18:57:50.000000000 +0000
@@ -235,6 +235,8 @@
bad_page(function, page);
if (PageDirty(page))
ClearPageDirty(page);
+ if (PageWired(page))
+ clear_page_wired(page);
}

/*
@@ -563,6 +565,8 @@
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
}
+ if (gfp_flags & __GFP_WIRED)
+ set_page_wired(page);
return page;
}

@@ -1695,6 +1699,7 @@
"nr_page_table_pages",
"nr_mapped",
"nr_slab",
+ "nr_wired",

"pgpgin",
"pgpgout",
Index: linux-2.6.7/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.7.orig/fs/proc/proc_misc.c 2004-06-16 05:18:58.000000000 +0000
+++ linux-2.6.7/fs/proc/proc_misc.c 2004-06-23 18:57:33.000000000 +0000
@@ -204,6 +204,7 @@
"Slab: %8lu kB\n"
"Committed_AS: %8u kB\n"
"PageTables: %8lu kB\n"
+ "Wired: %8lu kB\n"
"VmallocTotal: %8lu kB\n"
"VmallocUsed: %8lu kB\n"
"VmallocChunk: %8lu kB\n",
@@ -226,6 +227,7 @@
K(ps.nr_slab),
K(committed),
K(ps.nr_page_table_pages),
+ K(ps.nr_wired),
vmtot,
vmi.used,
vmi.largest_chunk

2004-06-23 21:12:50

by William Lee Irwin III

[permalink] [raw]

Subject: [oom]: [1/4] add __GFP_WIRED to pinned allocations

Index: linux-2.6.7/include/linux/gfp.h
===================================================================
--- linux-2.6.7.orig/include/linux/gfp.h 2004-06-16 05:19:02.000000000 +0000
+++ linux-2.6.7/include/linux/gfp.h 2004-06-23 16:06:08.000000000 +0000
@@ -37,20 +37,21 @@
#define __GFP_NORETRY 0x1000 /* Do not retry. Might fail */
#define __GFP_NO_GROW 0x2000 /* Slab internal usage */
#define __GFP_COMP 0x4000 /* Add compound page metadata */
+#define __GFP_WIRED 0x8000 /* pinned */

#define __GFP_BITS_SHIFT 16 /* Room for 16 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)

/* if you forget to add the bitmask here kernel will crash, period */
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
- __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
- __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP)
+ __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
+ __GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP|__GFP_WIRED)

-#define GFP_ATOMIC (__GFP_HIGH)
-#define GFP_NOIO (__GFP_WAIT)
-#define GFP_NOFS (__GFP_WAIT | __GFP_IO)
-#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
+#define GFP_ATOMIC (__GFP_HIGH | __GFP_WIRED)
+#define GFP_NOIO (__GFP_WAIT | __GFP_WIRED)
+#define GFP_NOFS (__GFP_WAIT | __GFP_IO | __GFP_WIRED)
+#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_WIRED)
+#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_WIRED)
#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)

/* Flag - indicates that the buffer will be suitable for DMA. Ignored on some
Index: linux-2.6.7/mm/slab.c
===================================================================
--- linux-2.6.7.orig/mm/slab.c 2004-06-16 05:19:44.000000000 +0000
+++ linux-2.6.7/mm/slab.c 2004-06-23 16:06:08.000000000 +0000
@@ -1709,7 +1709,7 @@
return 0;

ctor_flags = SLAB_CTOR_CONSTRUCTOR;
- local_flags = (flags & SLAB_LEVEL_MASK);
+ local_flags = (flags & SLAB_LEVEL_MASK) | __GFP_WIRED;
if (!(local_flags & __GFP_WAIT))
/*
* Not allowed to sleep. Need to tell a constructor about
Index: linux-2.6.7/fs/block_dev.c
===================================================================
--- linux-2.6.7.orig/fs/block_dev.c 2004-06-16 05:20:26.000000000 +0000
+++ linux-2.6.7/fs/block_dev.c 2004-06-23 16:06:08.000000000 +0000
@@ -365,7 +365,7 @@
inode->i_rdev = dev;
inode->i_bdev = bdev;
inode->i_data.a_ops = &def_blk_aops;
- mapping_set_gfp_mask(&inode->i_data, GFP_USER);
+ mapping_set_gfp_mask(&inode->i_data, GFP_USER & ~__GFP_WIRED);
inode->i_data.backing_dev_info = &default_backing_dev_info;
spin_lock(&bdev_lock);
list_add(&bdev->bd_list, &all_bdevs);
Index: linux-2.6.7/fs/ramfs/inode.c
===================================================================
--- linux-2.6.7.orig/fs/ramfs/inode.c 2004-06-16 05:19:11.000000000 +0000
+++ linux-2.6.7/fs/ramfs/inode.c 2004-06-23 17:33:10.000000000 +0000
@@ -61,6 +61,8 @@
inode->i_mapping->a_ops = &ramfs_aops;
inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+ mapping_set_gfp_mask(inode->i_mapping,
+ GFP_HIGHUSER | __GFP_WIRED);
switch (mode & S_IFMT) {
default:
init_special_inode(inode, mode, dev);

2004-06-23 21:38:32

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [3/4] track wired pages on a per-zone basis

On Wed, Jun 23, 2004 at 02:07:48PM -0700, William Lee Irwin III wrote:
> + for (wired = cpu = 0; cpu < NR_CPUS; ++cpu)
> + wired += zone->nr_wired[cpu];
> + wired >>= PAGE_SHIFT - 10;
> + seq_printf(m, "Node %d, zone %8s wired: %lu kB\n",
> + pgdat->node_id, zone->name, wired);

Should be:
> + wired <<= PAGE_SHIFT - 10;

-- wli

2004-06-23 22:05:17

by Andrew Morton

[permalink] [raw]

Subject: Re: [oom]: [1/4] add __GFP_WIRED to pinned allocations

William Lee Irwin III <[email protected]> wrote:
>
> +#define __GFP_WIRED 0x8000 /* pinned */

This would be a nice thing to keep track of.

Isn't it the case that reclaimable slab pages (dentry, inode, mbcache,
dquot) should not be accounted as wired memory? Could perhaps use
SLAB_RECLAIM_ACCOUNT for that.

It would need to be overridden for, say, sysfs inodes and dentries, but
they're about to become reclaimable anyway so no prob.

It would need to be overridden for, say, ramfs dentries and inodes though.

rd.c's blockdev pagecache pages are wired.

2004-06-23 22:15:12

by Andrew Morton

[permalink] [raw]

Subject: Re: [oom]: [3/4] track wired pages on a per-zone basis

William Lee Irwin III <[email protected]> wrote:
>
> Index: linux-2.6.7/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.7.orig/include/linux/mmzone.h 2004-06-16 05:19:36.000000000 +0000
> +++ linux-2.6.7/include/linux/mmzone.h 2004-06-23 18:58:13.000000000 +0000
> @@ -170,6 +170,7 @@
> ZONE_PADDING(_pad3_)
>
> struct per_cpu_pageset pageset[NR_CPUS];
> + unsigned long nr_wired[NR_CPUS];

These will share cachelines of course, so the percpuification won't be very
effective. I wonder if there's some way in which the nr_wired accounting
can be batched up and then dumped into a single per-zone counter when we
have the zone->lru_lock.

How come there are all those PageWired() tests in the LRU manipulation
functions?

2004-06-23 22:17:09

by Andrew Morton

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

William Lee Irwin III <[email protected]> wrote:
>
> While running OAST to test 2.6's maximum client capacity, the kernel
> deadlocked instead of properly OOM'ing. The obvious cause was the
> line if (nr_swap_pages > 0) in out_of_memory(), which fails to account
> for pinned allocations. This can't simply be removed.

It all seems like rather a lot of fuss.

It should be the case that zone->all_unreclaimable is set by the time this
happens. Did you consider feeding that into the oom-killing decision
instead?

2004-06-23 22:26:45

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [1/4] add __GFP_WIRED to pinned allocations

William Lee Irwin III <[email protected]> wrote:
>> +#define __GFP_WIRED 0x8000 /* pinned */

On Wed, Jun 23, 2004 at 03:05:46PM -0700, Andrew Morton wrote:
> This would be a nice thing to keep track of.
> Isn't it the case that reclaimable slab pages (dentry, inode, mbcache,
> dquot) should not be accounted as wired memory? Could perhaps use
> SLAB_RECLAIM_ACCOUNT for that.
> It would need to be overridden for, say, sysfs inodes and dentries, but
> they're about to become reclaimable anyway so no prob.
> It would need to be overridden for, say, ramfs dentries and inodes though.
> rd.c's blockdev pagecache pages are wired.

It's difficult to come up with comprehensible semantic refinements.
The trick with slab is whole slabs are frequently pinned by active
references, so there's quite a bit of squishiness there even after
SLAB_RECLAIM_ACCOUNT is figured out. A counter in the slab header for
references is plausibly precise, but appears to too invasive to ever
put into place. A less invasive refinement may be to clear_page_wired()
for empty slab pages.

Thanks for spotting rd.c's blkdev pagecache; that's a bogon in my
patches.

Also, I should mention this concept is based on code seen in RHEL3,
which uses a __GFP_WIRED flag, albeit for a different purpose (it
appears to be centered around removing dirty ramfs pagecache from
the LRU lists(s) as opposed to accounting for OOM-related reasons).

-- wli

Index: linux-2.6.7/drivers/block/rd.c
===================================================================
--- linux-2.6.7.orig/drivers/block/rd.c 2004-06-16 05:19:37.000000000 +0000
+++ linux-2.6.7/drivers/block/rd.c 2004-06-23 22:16:45.000000000 +0000
@@ -378,7 +378,7 @@
*/
gfp_mask = mapping_gfp_mask(mapping);
gfp_mask &= ~(__GFP_FS|__GFP_IO);
- gfp_mask |= __GFP_HIGH;
+ gfp_mask |= __GFP_HIGH|__GFP_WIRED;
mapping_set_gfp_mask(mapping, gfp_mask);
}

2004-06-23 22:31:30

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [3/4] track wired pages on a per-zone basis

William Lee Irwin III <[email protected]> wrote:
>> struct per_cpu_pageset pageset[NR_CPUS];
>> + unsigned long nr_wired[NR_CPUS];

On Wed, Jun 23, 2004 at 03:15:36PM -0700, Andrew Morton wrote:
> These will share cachelines of course, so the percpuification won't be very
> effective. I wonder if there's some way in which the nr_wired accounting
> can be batched up and then dumped into a single per-zone counter when we
> have the zone->lru_lock.

It's difficult to anticipate the number of zones required for a per-cpu
data structure to be periodically resynched with the zones. The
counters, both global and per-zone are purely for reporting purposes
and have no impact on functionality in this series.

On Wed, Jun 23, 2004 at 03:15:36PM -0700, Andrew Morton wrote:
> How come there are all those PageWired() tests in the LRU manipulation
> functions?

Largely for the benefit of ramfs. As you pointed out, rd.c's blkdev
pagecache requires similar treatment. The net effect of these is that
wired pagecache pages don't appear on the LRU at all. This makes the
assumption that all wired pagecache is memory-backed and never needs
to be written to its backing store, which AFAICT is true in all cases.

-- wli

2004-06-23 22:34:54

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

William Lee Irwin III <[email protected]> wrote:
>> While running OAST to test 2.6's maximum client capacity, the kernel
>> deadlocked instead of properly OOM'ing. The obvious cause was the
>> line if (nr_swap_pages > 0) in out_of_memory(), which fails to account
>> for pinned allocations. This can't simply be removed.

On Wed, Jun 23, 2004 at 03:16:59PM -0700, Andrew Morton wrote:
> It all seems like rather a lot of fuss.
> It should be the case that zone->all_unreclaimable is set by the time this
> happens. Did you consider feeding that into the oom-killing decision
> instead?

The vast majority of all this are the couhters for reporting, which have
no effect on functionality. The actual functional effect is achieved by
two aspects: (a) passing __GFP_WIRED to __alloc_pages() and (b) passing
__GFP_WIRED to out_of_memory(), which informs it not to perform the test
if (nr_swap_pages > 0). I also made the small addition of removing wired
pagecache from the LRU lists, which is performance, not correctness.

I'll resend with all reporting/counters and LRU bits ripped out if needed.

-- wli

2004-06-23 22:41:58

by Andrew Morton

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

William Lee Irwin III <[email protected]> wrote:
>
> William Lee Irwin III <[email protected]> wrote:
> >> While running OAST to test 2.6's maximum client capacity, the kernel
> >> deadlocked instead of properly OOM'ing. The obvious cause was the
> >> line if (nr_swap_pages > 0) in out_of_memory(), which fails to account
> >> for pinned allocations. This can't simply be removed.
>
> On Wed, Jun 23, 2004 at 03:16:59PM -0700, Andrew Morton wrote:
> > It all seems like rather a lot of fuss.
> > It should be the case that zone->all_unreclaimable is set by the time this
> > happens. Did you consider feeding that into the oom-killing decision
> > instead?
>
> The vast majority of all this are the couhters for reporting, which have
> no effect on functionality. The actual functional effect is achieved by
> two aspects: (a) passing __GFP_WIRED to __alloc_pages() and (b) passing
> __GFP_WIRED to out_of_memory(), which informs it not to perform the test
> if (nr_swap_pages > 0). I also made the small addition of removing wired
> pagecache from the LRU lists, which is performance, not correctness.
>
> I'll resend with all reporting/counters and LRU bits ripped out if needed.
>

What about zone->all_unreclaimable?

2004-06-23 22:41:57

by Andrew Morton

[permalink] [raw]

Subject: Re: [oom]: [1/4] add __GFP_WIRED to pinned allocations

William Lee Irwin III <[email protected]> wrote:
>
> Also, I should mention this concept is based on code seen in RHEL3,

What kernel is that?

> which uses a __GFP_WIRED flag, albeit for a different purpose (it
> appears to be centered around removing dirty ramfs pagecache from
> the LRU lists(s) as opposed to accounting for OOM-related reasons).

yes, putting ramfs, ramdisk and sysfs pages on the LRU is a bit silly.

However I think that taking a peek into mapping->backing_dev_info in
add_to_page_cache_lru(), __grab_cache_page() and wherever else would be a
tidier way of doing it.

2004-06-23 23:07:34

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

On Wed, Jun 23, 2004 at 03:37:58PM -0700, Andrew Morton wrote:
> What about zone->all_unreclaimable?

It's unclear which zones must be checked for this to be of use. In
general, suppose one determines all possible zones from which a given
allocation may be satisfied (this still assumes out_of_memory() is
passed a gfp_mask, and then discovers ->all_unreclaimable. It is still
not possible to determine whether nr_swap_pages is relevant.

Now suppose the check for nr_swap_pages > 0 is removed in tandem with
the passing of the gfp_mask to out_of_memory() and checking
->all_unreclaimable. This will then risk triggering OOM falsely for
unpinned pagecache allocations in the presence of high scan rates,
though it may yet be safe.

There are larger semantic questions also. For instance, the runs that
deadlocked were with non-overcommit enabled (that is, I left the sysctl
/proc/sys/vm/overcommit_memory == 0 after boot). The intention was to
provide best effort unfailability to userspace allocations by detecting
insufficient resources up-front and returning MAP_FAILED from mmap() and
so on. There is a current hole in this, which is that pinned kernel
memory allocations aren't subtracted from the pool of memory considered
to be available to userspace. This would add the additional caveat that
pure userspace memory allocations may result in OOM kills where they
didn't before, as without checking __GFP_WIRED, it's not possible to
determine whether nr_swap_pages > 0 is relevant, where it would suffice
for pure userspace allocations with non-overcommit.

__GFP_WIRED is faithful to the current timeout-based semantics and
so minimizes the risk associated with discarding the nr_swap_pages > 0
check. I'm willing to entertain deeper semantic changes such as the
above removal of nr_swap_pages > 0 in tandem with passing the gfp_mask
to out_of_memory() and checking for ->all_unreclaimable for all the
members of the gfp_mask's zonelist if you still want them given all this.

It also occurs to me that it's possible to detect overcommitment via a
combination of kernel and user allocations by means of summing the
per_cpu nr_wired counters along with vm_committed_memory, so there may
yet be methods of strengthening strict non-overcommit semantics.

-- wli

2004-06-23 23:35:40

by Andrew Morton

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

William Lee Irwin III <[email protected]> wrote:
>
> On Wed, Jun 23, 2004 at 03:37:58PM -0700, Andrew Morton wrote:
> > What about zone->all_unreclaimable?
>
> It's unclear which zones must be checked for this to be of use. In
> general, suppose one determines all possible zones from which a given
> allocation may be satisfied (this still assumes out_of_memory() is
> passed a gfp_mask, and then discovers ->all_unreclaimable. It is still
> not possible to determine whether nr_swap_pages is relevant.

I don't think it _is_ relevant. Wev'e scanned the crap out of all the
eligible zones, found nothing swappable outable or otherwise reclaimable.
That's as good a definition of oom as you're likely to get. It takes care
of mlocked user memory too.

> ...
> There are larger semantic questions also. For instance, the runs that
> deadlocked were with non-overcommit enabled (that is, I left the sysctl
> /proc/sys/vm/overcommit_memory == 0 after boot). The intention was to
> provide best effort unfailability to userspace allocations by detecting
> insufficient resources up-front and returning MAP_FAILED from mmap() and
> so on. There is a current hole in this, which is that pinned kernel
> memory allocations aren't subtracted from the pool of memory considered
> to be available to userspace.

The overcommit logic doesn't need to care about pinned kernel allocations.
All it cares about is reclaimable and free pages. Its calculation of what
is reclaimable is a bit wonky because of ramfs/sysfs/ramdisk pages and
metadata, and because of mlock.

But for this problem, I do suspect that going oom if all the eligible zones
are pegged out should suffice. It is certainly accurate.

It could be that the ->all_unreclaimable logic has rotted over the ages and
may ned some attention - I haven't explicitly checked it in a year or more.
Also there's some problem at present wherein the oomkiller takes 30 or
more seconds to kick in and do its thing, which I need to look at. This
happens when there's no swap online. It might also happen when swapspace
is full - I didn't check.

2004-06-24 00:03:16

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

William Lee Irwin III <[email protected]> wrote:
>> It's unclear which zones must be checked for this to be of use. In
>> general, suppose one determines all possible zones from which a given
>> allocation may be satisfied (this still assumes out_of_memory() is
>> passed a gfp_mask, and then discovers ->all_unreclaimable. It is still
>> not possible to determine whether nr_swap_pages is relevant.

On Wed, Jun 23, 2004 at 04:38:19PM -0700, Andrew Morton wrote:
> I don't think it _is_ relevant. Wev'e scanned the crap out of all the
> eligible zones, found nothing swappable outable or otherwise reclaimable.
> That's as good a definition of oom as you're likely to get. It takes care
> of mlocked user memory too.

The actual net effect of all this is blowing away if (nr_swap_pages > 0)
for __GFP_WIRED allocations. Removing those 2 lines (and the one line of
whitespace next to it) will pass the test I observed failure in. It's a
judgment call as to whether it's beneficial in general, as it does
insulate userspace somewhat from needing to wait for slow IO being the
ostensible cause of the allocation failure. RedHat vendor kernels have
removed the check entirely, which may be considered to be evidence of
the approach's viability.

William Lee Irwin III <[email protected]> wrote:
>> There are larger semantic questions also. For instance, the runs that
>> deadlocked were with non-overcommit enabled (that is, I left the sysctl
>> /proc/sys/vm/overcommit_memory == 0 after boot). The intention was to
>> provide best effort unfailability to userspace allocations by detecting
>> insufficient resources up-front and returning MAP_FAILED from mmap() and
>> so on. There is a current hole in this, which is that pinned kernel
>> memory allocations aren't subtracted from the pool of memory considered
>> to be available to userspace.

On Wed, Jun 23, 2004 at 04:38:19PM -0700, Andrew Morton wrote:
> The overcommit logic doesn't need to care about pinned kernel allocations.
> All it cares about is reclaimable and free pages. Its calculation of what
> is reclaimable is a bit wonky because of ramfs/sysfs/ramdisk pages and
> metadata, and because of mlock.

Hmm. That sort of thing is what I had in mind by saying "pinned kernel
allocations". What did it sound like I had in mind?

On Wed, Jun 23, 2004 at 04:38:19PM -0700, Andrew Morton wrote:
> But for this problem, I do suspect that going oom if all the eligible zones
> are pegged out should suffice. It is certainly accurate.
> It could be that the ->all_unreclaimable logic has rotted over the ages and
> may ned some attention - I haven't explicitly checked it in a year or more.
> Also there's some problem at present wherein the oomkiller takes 30 or
> more seconds to kick in and do its thing, which I need to look at. This
> happens when there's no swap online. It might also happen when swapspace
> is full - I didn't check.

The timeout semantics are crusty. 10 failures 1s or more apart must be
seen, but with no delay of more than 5s between failures. If the average
delay between attempts more than 1s from the last successful attempt
that incremented the counter was 3s, that would make for a 30s delay
before OOM killing. In the presence of e.g. delays waiting for IO and
so on preventing retries from happening rapidly, 30s is plausible. Also,
it's rather likely that the tasks targeted by the OOM killer are in
state TASK_UNINTERRUPTIBLE and so will never respond to the OOM kill
until whatever they were waiting for happens, which may require (you
guessed it) memory to occur.

I would prefer to migrate away from timeout-based semantics and eliminate
indefinite retry logic and out-of-context reclamation (which is the source
of reclamation failures having no context to which to return errors).
However, this also requires some method of recovering from or insulating
administrators from indefinite retry loops originating from userspace, and
better contexts to which to return failure than syslog daemons.

-- wli

2004-06-24 00:15:28

by Andrew Morton

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

William Lee Irwin III <[email protected]> wrote:
>
> On Wed, Jun 23, 2004 at 04:38:19PM -0700, Andrew Morton wrote:
> > I don't think it _is_ relevant. Wev'e scanned the crap out of all the
> > eligible zones, found nothing swappable outable or otherwise reclaimable.
> > That's as good a definition of oom as you're likely to get. It takes care
> > of mlocked user memory too.
>
> The actual net effect of all this is blowing away if (nr_swap_pages > 0)
> for __GFP_WIRED allocations. Removing those 2 lines (and the one line of
> whitespace next to it) will pass the test I observed failure in.

OK.

> It's a
> judgment call as to whether it's beneficial in general, as it does
> insulate userspace somewhat from needing to wait for slow IO being the
> ostensible cause of the allocation failure.

mm... I can only see that happening if the IO system is retiring write
requests at much less than 10/sec, which seems unlikely. Still, that can
be tuned around.

> RedHat vendor kernels have removed the check entirely

When telling us this sort of thing, please always specify the kernel version.

I assume you're referring to a 2.6 kernel? If so, some thwapping might be
in order.

2004-06-24 00:26:56

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

William Lee Irwin III <[email protected]> wrote:
>> It's a
>> judgment call as to whether it's beneficial in general, as it does
>> insulate userspace somewhat from needing to wait for slow IO being the
>> ostensible cause of the allocation failure.

On Wed, Jun 23, 2004 at 05:18:18PM -0700, Andrew Morton wrote:
> mm... I can only see that happening if the IO system is retiring write
> requests at much less than 10/sec, which seems unlikely. Still, that can
> be tuned around.

Then it sounds like the smaller fix below may be better for you.

William Lee Irwin III <[email protected]> wrote:
>> RedHat vendor kernels have removed the check entirely

On Wed, Jun 23, 2004 at 05:18:18PM -0700, Andrew Morton wrote:
> When telling us this sort of thing, please always specify the kernel version.
> I assume you're referring to a 2.6 kernel? If so, some thwapping might be
> in order.

No, RHEL3. I'm not aware of any mm/oom_kill.c changes in any of the
Fedora snapshots.

-- wli

During stress testing at Oracle to determine the maximum number of
clients 2.6 can service, it was discovered that the failure mode of
excessive numbers of clients was kernel deadlock. The following patch
removes the check if (nr_swap_pages > 0) from out_of_memory() as this
heuristic fails to detect memory exhaustion due to pinned allocations,
directly causing the aforementioned deadlock.

===== mm/oom_kill.c 1.26 vs edited =====
--- 1.26/mm/oom_kill.c Thu Jun 3 01:46:39 2004
+++ edited/mm/oom_kill.c Wed Jun 23 17:22:22 2004
@@ -230,12 +230,6 @@
static unsigned long first, last, count, lastkill;
unsigned long now, since;

- /*
- * Enough swap space left? Not OOM.
- */
- if (nr_swap_pages > 0)
- return;
-
spin_lock(&oom_lock);
now = jiffies;
since = now - last;

2004-06-24 00:32:52

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

On Wed, Jun 23, 2004 at 05:26:51PM -0700, William Lee Irwin III wrote:
> Then it sounds like the smaller fix below may be better for you.

Also, as we're fixing this a different way, could you clarify for me
which of the pieces of the original fix or related things (e.g. the
zone->all_unreclaimable stuff, yanking PG_wired stuff off the LRU,
maybe more) you wanted me to rework and send back in later?

I'm heading out for an hour or so, so I'll be slightly delayed getting
those things back to you.

-- wli

2004-06-24 01:08:23

by Andrew Morton

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

William Lee Irwin III <[email protected]> wrote:
>
> On Wed, Jun 23, 2004 at 05:26:51PM -0700, William Lee Irwin III wrote:
> > Then it sounds like the smaller fix below may be better for you.
>
> Also, as we're fixing this a different way, could you clarify for me
> which of the pieces of the original fix or related things (e.g. the
> zone->all_unreclaimable stuff, yanking PG_wired stuff off the LRU,
> maybe more) you wanted me to rework and send back in later?
>

Well none, really. This problem is now fixed, is it not?

It would be nice to fix up the unrelated issue of putting unbacked pages
onto the LRU. Could do that by adding backing_dev_info.not_on_lru, check
that in the various places where we add pages to the LRU.

Or, conceivably, do it lazily: take these pages off the LRU if we encounter
them in page reclaim. This might be a net win - do extra work for the rare
case, less work for the common case.

Something like this:

--- 25/mm/vmscan.c~a 2004-06-23 18:05:19.738648736 -0700
+++ 25-akpm/mm/vmscan.c 2004-06-23 18:06:00.386469328 -0700
@@ -367,6 +367,17 @@ static int shrink_list(struct list_head
if (page_mapped(page) || PageSwapCache(page))
sc->nr_scanned++;

+ mapping = page_mapping(page);
+
+ if (mapping && mapping->backing_dev_info->not_on_lru) {
+ /*
+ * comment goes here
+ */
+ unlock_page(page);
+ page_cache_release(page);
+ continue;
+ }
+
page_map_lock(page);
referenced = page_referenced(page);
if (referenced && page_mapping_inuse(page)) {
@@ -386,11 +397,11 @@ static int shrink_list(struct list_head
page_map_unlock(page);
if (!add_to_swap(page))
goto activate_locked;
+ mapping = page_mapping(page);
page_map_lock(page);
}
#endif /* CONFIG_SWAP */

- mapping = page_mapping(page);
may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));

_

2004-06-24 01:24:39

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

William Lee Irwin III <[email protected]> wrote:
>> Also, as we're fixing this a different way, could you clarify for me
>> which of the pieces of the original fix or related things (e.g. the
>> zone->all_unreclaimable stuff, yanking PG_wired stuff off the LRU,
>> maybe more) you wanted me to rework and send back in later?

On Wed, Jun 23, 2004 at 06:07:22PM -0700, Andrew Morton wrote:
> Well none, really. This problem is now fixed, is it not?
> It would be nice to fix up the unrelated issue of putting unbacked pages
> onto the LRU. Could do that by adding backing_dev_info.not_on_lru, check
> that in the various places where we add pages to the LRU.
> Or, conceivably, do it lazily: take these pages off the LRU if we encounter
> them in page reclaim. This might be a net win - do extra work for the rare
> case, less work for the common case.
> Something like this:

To me, this resembles reworking the LRU removal bits. I'll flesh out
the example you posted in the way you've suggested (via backing_dev_info).

Thanks.

-- wli

2004-06-24 01:52:51

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

On Wed, Jun 23, 2004 at 06:07:22PM -0700, Andrew Morton wrote:
>> Well none, really. This problem is now fixed, is it not?
>> It would be nice to fix up the unrelated issue of putting unbacked pages
>> onto the LRU. Could do that by adding backing_dev_info.not_on_lru, check
>> that in the various places where we add pages to the LRU.
>> Or, conceivably, do it lazily: take these pages off the LRU if we encounter
>> them in page reclaim. This might be a net win - do extra work for the rare
>> case, less work for the common case.
>> Something like this:

On Wed, Jun 23, 2004 at 06:24:35PM -0700, William Lee Irwin III wrote:
> To me, this resembles reworking the LRU removal bits. I'll flesh out
> the example you posted in the way you've suggested (via backing_dev_info).
> Thanks.

Also, you mentioned at one point extending committed memory accounting
to account for unreclaimable pages (the term you suggested). Would you
also like that to be looked into? It might take longer than overnight
to brew up, mostly due to testing turnaround.

-- wli

2004-06-24 02:02:49

by Andrew Morton

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

William Lee Irwin III <[email protected]> wrote:
>
> Also, you mentioned at one point extending committed memory accounting
> to account for unreclaimable pages (the term you suggested). Would you
> also like that to be looked into? It might take longer than overnight
> to brew up, mostly due to testing turnaround.

Well let's see how the patch ends up looking.

I have bad feelings about the overcommit logic - several times we have
accidentally noticed (and fixed) gross inaccuracies in it, and I am sure
others remain.

I am not aware of anyone getting down and explicitly testing it in lots of
different scenarios (including mlock), and perhaps it gets inaccurate when
zone fallbacks are involved.

If you're prepared to undertake that level of thinking and coverage testing
and fix up the fallout, that would certainly be good. If you think it's
worth the effort, and, again, depending upon the performance and ickiness
impact of the patches.

2004-06-24 02:15:31

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

On Wed, Jun 23, 2004 at 07:01:50PM -0700, Andrew Morton wrote:
> Well let's see how the patch ends up looking.
> I have bad feelings about the overcommit logic - several times we have
> accidentally noticed (and fixed) gross inaccuracies in it, and I am sure
> others remain.
> I am not aware of anyone getting down and explicitly testing it in lots of
> different scenarios (including mlock), and perhaps it gets inaccurate when
> zone fallbacks are involved.
> If you're prepared to undertake that level of thinking and coverage testing
> and fix up the fallout, that would certainly be good. If you think it's
> worth the effort, and, again, depending upon the performance and ickiness
> impact of the patches.

This has some importance to database users, as orderly shutdown of
userspace is required for data integrity there in a similar fashion as
it is for the kernel to undergo orderly shutdown so as not to corrupt
filesystems (in both cases, some pains are taken to recover from these
kinds of situations as best as possible, however, neither's recovery is
infallible). So OOM kills during client overloads are highly undesirable.
I'll take this as a more general directive to audit and clean up the
non-overcommit accounting, which I don't have an issue with taking on,
before moving on to refining the semantics.

I'll do this up as a series of small fixes and so on. mlock coverage is
a certainty given database usage patterns and my current position.

-- wli

2004-06-24 14:51:05

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

On Wed, Jun 23, 2004 at 05:26:51PM -0700, William Lee Irwin III wrote:
> William Lee Irwin III <[email protected]> wrote:
> >> It's a
> >> judgment call as to whether it's beneficial in general, as it does
> >> insulate userspace somewhat from needing to wait for slow IO being the
> >> ostensible cause of the allocation failure.
>
> On Wed, Jun 23, 2004 at 05:18:18PM -0700, Andrew Morton wrote:
> > mm... I can only see that happening if the IO system is retiring write
> > requests at much less than 10/sec, which seems unlikely. Still, that can
> > be tuned around.
>
> Then it sounds like the smaller fix below may be better for you.
>
>
> William Lee Irwin III <[email protected]> wrote:
> >> RedHat vendor kernels have removed the check entirely
>
> On Wed, Jun 23, 2004 at 05:18:18PM -0700, Andrew Morton wrote:
> > When telling us this sort of thing, please always specify the kernel version.
> > I assume you're referring to a 2.6 kernel? If so, some thwapping might be
> > in order.
>
> No, RHEL3. I'm not aware of any mm/oom_kill.c changes in any of the
> Fedora snapshots.
>
>
> -- wli
>
> During stress testing at Oracle to determine the maximum number of
> clients 2.6 can service, it was discovered that the failure mode of
> excessive numbers of clients was kernel deadlock. The following patch
> removes the check if (nr_swap_pages > 0) from out_of_memory() as this
> heuristic fails to detect memory exhaustion due to pinned allocations,
> directly causing the aforementioned deadlock.
>
>
> ===== mm/oom_kill.c 1.26 vs edited =====
> --- 1.26/mm/oom_kill.c Thu Jun 3 01:46:39 2004
> +++ edited/mm/oom_kill.c Wed Jun 23 17:22:22 2004
> @@ -230,12 +230,6 @@
> static unsigned long first, last, count, lastkill;
> unsigned long now, since;
>
> - /*
> - * Enough swap space left? Not OOM.
> - */
> - if (nr_swap_pages > 0)
> - return;
> -
> spin_lock(&oom_lock);
> now = jiffies;
> since = now - last;

Removing the check on v2.4 based kernels will trigger the OOM killer
too soon for a lot of cases, I'm pretty sure.

2004-06-24 15:18:44

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

On Wed, Jun 23, 2004 at 05:26:51PM -0700, William Lee Irwin III wrote:
>> During stress testing at Oracle to determine the maximum number of
>> clients 2.6 can service, it was discovered that the failure mode of
>> excessive numbers of clients was kernel deadlock. The following patch
>> removes the check if (nr_swap_pages > 0) from out_of_memory() as this
>> heuristic fails to detect memory exhaustion due to pinned allocations,
>> directly causing the aforementioned deadlock.

On Thu, Jun 24, 2004 at 11:16:37AM -0300, Marcelo Tosatti wrote:
> Removing the check on v2.4 based kernels will trigger the OOM killer
> too soon for a lot of cases, I'm pretty sure.

Hmm. 2.4 appears to still be lacking some of the fixes (unrelated to
the nr_swap_pages check causing deadlocks) for functional issues.

-- wli

This patch by nature corrects two apparent bugs which are really one
bug. p->mm can become NULL while traversing the tasklist. The two
effects are first that kernel threads appear to be killed. The second
is that the OOM killing process fails to actually shoot down all threads
of the chosen process, and so fails to reclaim the memory it intended to.
oom_kill_task() consists primarily of the expansion of the 2.6 inline
function get_task_mm().

Index: linux-2.4/mm/oom_kill.c
===================================================================
--- linux-2.4.orig/mm/oom_kill.c 2004-06-23 19:30:21.000000000 -0700
+++ linux-2.4/mm/oom_kill.c 2004-06-23 19:52:25.000000000 -0700
@@ -141,7 +141,7 @@
* CAP_SYS_RAW_IO set, send SIGTERM instead (but it's unlikely that
* we select a process with CAP_SYS_RAW_IO set).
*/
-void oom_kill_task(struct task_struct *p)
+static void __oom_kill_task(struct task_struct *p)
{
printk(KERN_ERR "Out of Memory: Killed process %d (%s).\n", p->pid, p->comm);

@@ -161,6 +161,26 @@
}
}

+static struct mm_struct *oom_kill_task(struct task_struct *p)
+{
+ struct mm_struct *mm;
+
+ task_lock(p);
+ mm = p->mm;
+ if (mm) {
+ spin_lock(&mmlist_lock);
+ if (atomic_read(&mm->mm_users))
+ atomic_inc(&mm->mm_users);
+ else
+ mm = NULL;
+ spin_unlock(&mmlist_lock);
+ }
+ task_unlock(p);
+ if (mm)
+ __oom_kill_task(p);
+ return mm;
+}
+
/**
* oom_kill - kill the "best" process when we run out of memory
*
@@ -172,21 +192,27 @@
static void oom_kill(void)
{
struct task_struct *p, *q;
+ struct mm_struct *mm;

+retry:
read_lock(&tasklist_lock);
p = select_bad_process();

/* Found nothing?!?! Either we hang forever, or we panic. */
if (p == NULL)
panic("Out of memory and no killable processes...\n");
-
+ mm = oom_kill_task(p);
+ if (!mm) {
+ read_unlock(&tasklist_lock);
+ goto retry;
+ }
/* kill all processes that share the ->mm (i.e. all threads) */
for_each_task(q) {
- if (q->mm == p->mm)
- oom_kill_task(q);
+ if (q->mm == mm)
+ __oom_kill_task(q);
}
read_unlock(&tasklist_lock);
-
+ mmput(mm);
/*
* Make kswapd go out of the way, so "p" has a good chance of
* killing itself before someone else gets the chance to ask

2004-06-24 15:19:57

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

On Thu, Jun 24, 2004 at 08:18:33AM -0700, William Lee Irwin III wrote:
> Hmm. 2.4 appears to still be lacking some of the fixes (unrelated to
> the nr_swap_pages check causing deadlocks) for functional issues.

Hmm, looks like I forgot the jiffies wrap fix when I sent oom_lock, too.

-- wli

out_of_memory() attempts to determine whether one jiffies-valued
variable refers to a point in time preceding another jiffies-valued
variable, but does not do so in a jiffies wrap -safe fashion. The
following patch corrects this by using the expansion of the 2.6
macro time_after() to check this condition.

Index: linux-2.4/mm/oom_kill.c
===================================================================
--- linux-2.4.orig/mm/oom_kill.c 2004-06-23 19:41:08.000000000 -0700
+++ linux-2.4/mm/oom_kill.c 2004-06-23 19:50:59.000000000 -0700
@@ -283,7 +283,7 @@
spin_lock(&oom_lock);

reset:
- if (first < now)
+ if ((long)first - (long)now < 0)
first = now;
count = 0;

2004-06-24 15:23:16

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

On Thu, Jun 24, 2004 at 11:16:37AM -0300, Marcelo Tosatti wrote:
> Removing the check on v2.4 based kernels will trigger the OOM killer
> too soon for a lot of cases, I'm pretty sure.

Other 2.4 oom_kill.c issues aside, since this is a relatively blatant
deadlock. What form would you like a fix to take, if you care to have
a fix for this at all?

-- wli

2004-06-24 17:28:14

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

On Thu, Jun 24, 2004 at 08:23:01AM -0700, William Lee Irwin III wrote:
> On Thu, Jun 24, 2004 at 11:16:37AM -0300, Marcelo Tosatti wrote:
> > Removing the check on v2.4 based kernels will trigger the OOM killer
> > too soon for a lot of cases, I'm pretty sure.
>
> Other 2.4 oom_kill.c issues aside, since this is a relatively blatant
> deadlock. What form would you like a fix to take, if you care to have
> a fix for this at all?

Lets hold this on 2.4.28-pre1, as discussed privately.

Thanks again.

2004-06-25 15:18:48

by Rik van Riel

[permalink] [raw]

Subject: Re: [oom]: [0/4] fix OOM deadlock running OAST

On Wed, 23 Jun 2004, William Lee Irwin III wrote:
> On Wed, Jun 23, 2004 at 03:37:58PM -0700, Andrew Morton wrote:
> > What about zone->all_unreclaimable?
>
> It's unclear which zones must be checked for this to be of use.

It's perfectly obvious, try_to_free_pages() gets a zone list
as one of its arguments (at least, it did last time I looked).

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan