2002-01-01 07:18:34

by Ed Tomlinson

[permalink] [raw]
Subject: [BUG] in 2.4.17 after 10 days uptime

Hi,

I started getting these bugs after about 10 days uptime. There is a patch set for reiserfs applied
along with a few minor patches (ide-tape, disk stats for up to hdg). The kernel is tainted by:

oscar# taint
filename: /lib/modules/2.4.17/kernel/net/khttpd/khttpd.o
filename: /lib/modules/2.4.17/kernel/net/netlink/netlink_dev.o
oscar# alias | grep taint
taint='modinfo `modprobe -l` | sed -ne "/^filename/h; /^license.*none/{g;p;}"'

Hardware has been stable and went over 30 days with 2.4.16 without problems. The system is
based on debian unstable (and its living up to its name with kde apps vs libpng3 vs libqt2 problems).

Happy New Year!
Ed Tomlinson


Attachments:
bug.log (7.63 kB)

2002-01-01 19:56:37

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [BUG] in 2.4.17 after 10 days uptime

On Tue, Jan 01, 2002 at 02:18:01AM -0500, Ed Tomlinson wrote:
> I started getting these bugs after about 10 days uptime. There is a patch
> set for reiserfs applied along with a few minor patches (ide-tape, disk
> stats for up to hdg). The kernel is tainted by:

Expected BUG. Here's the fix. Marcelo, this is what we discussed previously:
parts of the kernel that grab a temporary reference to a page will frequently
not use page_cache_release as the page may never have been part of the page
cache. This shows up with the network stack in sendpage() as well as many
other paths. Please apply.

:r ~/patches/v2.4.17-pglru.diff
diff -urN v2.4.17/include/linux/pagemap.h v2.4.17-pglru/include/linux/pagemap.h
--- v2.4.17/include/linux/pagemap.h Thu Dec 20 19:30:25 2001
+++ v2.4.17-pglru/include/linux/pagemap.h Tue Jan 1 14:46:04 2002
@@ -29,7 +29,7 @@
#define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)

#define page_cache_get(x) get_page(x)
-extern void FASTCALL(page_cache_release(struct page *));
+#define page_cache_release(x) __free_page(x)

static inline struct page *page_cache_alloc(struct address_space *x)
{
diff -urN v2.4.17/kernel/ksyms.c v2.4.17-pglru/kernel/ksyms.c
--- v2.4.17/kernel/ksyms.c Tue Jan 1 14:09:35 2002
+++ v2.4.17-pglru/kernel/ksyms.c Tue Jan 1 14:46:55 2002
@@ -95,7 +95,6 @@
EXPORT_SYMBOL(alloc_pages_node);
EXPORT_SYMBOL(__get_free_pages);
EXPORT_SYMBOL(get_zeroed_page);
-EXPORT_SYMBOL(page_cache_release);
EXPORT_SYMBOL(__free_pages);
EXPORT_SYMBOL(free_pages);
EXPORT_SYMBOL(num_physpages);
diff -urN v2.4.17/mm/page_alloc.c v2.4.17-pglru/mm/page_alloc.c
--- v2.4.17/mm/page_alloc.c Mon Nov 26 23:43:08 2001
+++ v2.4.17-pglru/mm/page_alloc.c Tue Jan 1 14:44:59 2002
@@ -70,6 +70,12 @@
struct page *base;
zone_t *zone;

+ /* Yes, think what happens when other parts of the kernel take
+ * a reference to a page in order to pin it for io. -ben
+ */
+ if (PageLRU(page))
+ lru_cache_del(page);
+
if (page->buffers)
BUG();
if (page->mapping)
@@ -426,15 +432,6 @@
return 0;
}

-void page_cache_release(struct page *page)
-{
- if (!PageReserved(page) && put_page_testzero(page)) {
- if (PageLRU(page))
- lru_cache_del(page);
- __free_pages_ok(page, 0);
- }
-}
-
void __free_pages(struct page *page, unsigned int order)
{
if (!PageReserved(page) && put_page_testzero(page))

2002-01-07 19:42:29

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [BUG] in 2.4.17 after 10 days uptime



On Tue, 1 Jan 2002, Benjamin LaHaise wrote:

> On Tue, Jan 01, 2002 at 02:18:01AM -0500, Ed Tomlinson wrote:
> > I started getting these bugs after about 10 days uptime. There is a patch
> > set for reiserfs applied along with a few minor patches (ide-tape, disk
> > stats for up to hdg). The kernel is tainted by:
>
> Expected BUG. Here's the fix. Marcelo, this is what we discussed previously:
> parts of the kernel that grab a temporary reference to a page will frequently
> not use page_cache_release as the page may never have been part of the page
> cache. This shows up with the network stack in sendpage() as well as many
> other paths. Please apply.

Ben,

I suppose you're talking about the following case:

pagecache code has LRU page

nonpagecache code does page_cache_get()

pagecache code does page_cache_release()

nonpagecache code does __free_pages_ok()
on LRU page: BOOM.

Is my thinking correct ?

If so, I don't see why Ed's trace BUGs at rmqueue first: It should bug at
__free_pages_ok() PageLRU check.


2002-01-08 02:25:17

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [BUG] in 2.4.17 after 10 days uptime

On Mon, Jan 07, 2002 at 04:28:12PM -0200, Marcelo Tosatti wrote:
> Is my thinking correct ?

Yes, that's the case I was thinking of. sendfile() and tux are potential
triggers of this.

> If so, I don't see why Ed's trace BUGs at rmqueue first: It should bug at
> __free_pages_ok() PageLRU check.

Hmm, as we've discussed on irc, there are some other nasty implications of
the __free_pages code interacting with shrink_cache without this patch. I'm
not certain that explains it, but it could. Ed, have you seen this oops
again? What kind of load is the machine under?

-ben
--
Fish.

2002-01-08 03:39:36

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [BUG] in 2.4.17 after 10 days uptime

On January 7, 2002 09:24 pm, Benjamin LaHaise wrote:
> On Mon, Jan 07, 2002 at 04:28:12PM -0200, Marcelo Tosatti wrote:
> > Is my thinking correct ?
>
> Yes, that's the case I was thinking of. sendfile() and tux are potential
> triggers of this.
>
> > If so, I don't see why Ed's trace BUGs at rmqueue first: It should bug at
> > __free_pages_ok() PageLRU check.
>
> Hmm, as we've discussed on irc, there are some other nasty implications of
> the __free_pages code interacting with shrink_cache without this patch.
> I'm not certain that explains it, but it could. Ed, have you seen this
> oops again? What kind of load is the machine under?

After applyng your patch I ran for another couple of day on 18pre1 without
seeing any problems. The system is fairly lightly loaded running a caching
news server, java apps and acts as a masq gateway/squid cache for the rest
of the boxes here (home network). It also the box I use...

Ed