Sorry, no solution, but maybe another oops in __free_pages_ok might help?
Hugh
--- 2.4.18-pre8/mm/page_alloc.c Tue Feb 5 12:55:36 2002
+++ linux/mm/page_alloc.c Wed Feb 6 18:31:07 2002
@@ -73,9 +73,11 @@
/* Yes, think what happens when other parts of the kernel take
* a reference to a page in order to pin it for io. -ben
*/
- if (PageLRU(page))
+ if (PageLRU(page)) {
+ if (in_interrupt())
+ BUG();
lru_cache_del(page);
-
+ }
if (page->buffers)
BUG();
if (page->mapping)
Hugh Dickins wrote:
>
> Sorry, no solution, but maybe another oops in __free_pages_ok might help?
What problem are you trying to solve?
>
> --- 2.4.18-pre8/mm/page_alloc.c Tue Feb 5 12:55:36 2002
> +++ linux/mm/page_alloc.c Wed Feb 6 18:31:07 2002
> @@ -73,9 +73,11 @@
> /* Yes, think what happens when other parts of the kernel take
> * a reference to a page in order to pin it for io. -ben
> */
> - if (PageLRU(page))
> + if (PageLRU(page)) {
> + if (in_interrupt())
> + BUG();
> lru_cache_del(page);
> -
> + }
Yes. lru_cache_del() in interrupt context will deadlock on SMP
or wreck the LRU list on UP. I can't see how we can ever perform
the final release of a PageLRU page in interrupt context but I agree
this test is a good one.
-
On Wed, 6 Feb 2002, Andrew Morton wrote:
> Hugh Dickins wrote:
> >
> > Sorry, no solution, but maybe another oops in __free_pages_ok might help?
>
> What problem are you trying to solve?
Amidst all the prune_dcache and other kswapd oopses reported
(which I'd love to solve, but still can't work out), there have
been a couple in shrink_cache itself, where the page from the
inactive_list is not marked as on LRU, or is marked as Active;
and also I think a couple in rmqueue, where the free page is
found to be on LRU.
Some of those may have been memtest86ed out of contention since,
and some may have been on SMP and so not candidates; but it did
just occur to me that we'd like to be sure nothing is messing
with the LRU at interrupt time, hence the patch. Which of
course solves nothing, but might shed some light.
Sorry for being mysterious!
Hugh
Hugh Dickins wrote:
>
> On Wed, 6 Feb 2002, Andrew Morton wrote:
> > Hugh Dickins wrote:
> > >
> > > Sorry, no solution, but maybe another oops in __free_pages_ok might help?
> >
> > What problem are you trying to solve?
>
> Amidst all the prune_dcache and other kswapd oopses reported
> (which I'd love to solve, but still can't work out), there have
> been a couple in shrink_cache itself, where the page from the
> inactive_list is not marked as on LRU, or is marked as Active;
> and also I think a couple in rmqueue, where the free page is
> found to be on LRU.
You noticed too, hey :)
I've been collecting these reports for five or six weeks,
also getting things like .config, machine usage patterns,
machine history, etc.
It's like grabbing shadows, really. A significant number
of the reporters are using netfilter, and that's basically
the only thing I have to go on at this time. And a lot of
people use netfilter, so it's probably coincidental.
A number of the reports were confirmed to be against flakey
hardware. A few more were on cranky old P150's and such,
which I'm tending to dismiss. In fact the great majority
of reports are likely to be hardware failures.
But I don't recall seeing this volume of reports against
2.2.x.
And we have things like zeus.kernel.org's death yesterday.
Peter is quite certain that it was a software failure.
It certainly looks like random memory corruption. Quite
frequently the faulting address is just "data". Examples
from my growing vm-oopses folder include:
364d0a11
16a1842f
5f33f59b
410a0d26
d70f589b
6964656e
3562726b
0017e980
65726198
008209dc
Many more are null-pointer derefs.
> Some of those may have been memtest86ed out of contention since,
> and some may have been on SMP and so not candidates; but it did
> just occur to me that we'd like to be sure nothing is messing
> with the LRU at interrupt time, hence the patch. Which of
> course solves nothing, but might shed some light.
Sure. I can't think of any way of chasing this down (if it
exists) apart from putting special-purpose debug code into
the mainstream kernel.
Al suggests a `honey pot' kernel thread which ticks over,
allocating, validating and releasing memory, waiting for
it to get stomped on. If it gets corrupted we can dump
lots of memory and a task list, I guess.
We could also re-enable slab debugging.
Also we can add some magic numbers to inodes and dentries,
validate addresses and memory locations as we walk the lists,
mainly on the shrink_cache path. If corruption is detected
then we dump out lots of memory and look through it for
suspicious kernel addresses.
Any other ideas?
-
On Wed, Feb 06, 2002 at 07:06:01PM +0000, Hugh Dickins wrote:
> Sorry, no solution, but maybe another oops in __free_pages_ok might help?
I haven't seen the original oops, but if this patch goes in, it
reintroduces the problem where network drivers / others release
pages from irq context causing a BUG(). See sendpage.
-ben
Benjamin LaHaise wrote:
>
> On Wed, Feb 06, 2002 at 07:06:01PM +0000, Hugh Dickins wrote:
> > Sorry, no solution, but maybe another oops in __free_pages_ok might help?
>
> I haven't seen the original oops, but if this patch goes in, it
> reintroduces the problem where network drivers / others release
> pages from irq context causing a BUG(). See sendpage.
>
Calling lru_cache_del from interrupt is a bug, so Hugh's patch
is valid.
Are you sure that the pages are being released from interrupt context?
-
From: Andrew Morton <[email protected]>
Date: Wed, 06 Feb 2002 21:47:28 -0800
Are you sure that the pages are being released from interrupt context?
BH context... that's where SKB frees happen.
hmmm...
"David S. Miller" wrote:
>
> From: Andrew Morton <[email protected]>
> Date: Wed, 06 Feb 2002 21:47:28 -0800
>
> Are you sure that the pages are being released from interrupt context?
>
> BH context... that's where SKB frees happen.
>
> hmmm...
It's only a problem if this is the final put_page(). In the
case of sendfile(), process-context code can be taught to take
a temporary reference on the page, and only release it after the network
stack is known to have finished with the page. sendfile is synchronous, yes?
And in the case of all other skb frees, the underlying page
won't be on the LRU. I hope.
-
From: Andrew Morton <[email protected]>
Date: Wed, 06 Feb 2002 22:19:32 -0800
It's only a problem if this is the final put_page(). In the
case of sendfile(), process-context code can be taught to take
a temporary reference on the page, and only release it after the network
stack is known to have finished with the page. sendfile is synchronous, yes?
Userspace can return long before the SKBs are free'd up. That SKB
free doesn't happen until the ACKs come back from the receiver.
"David S. Miller" wrote:
>
> From: Andrew Morton <[email protected]>
> Date: Wed, 06 Feb 2002 22:19:32 -0800
>
> It's only a problem if this is the final put_page(). In the
> case of sendfile(), process-context code can be taught to take
> a temporary reference on the page, and only release it after the network
> stack is known to have finished with the page. sendfile is synchronous, yes?
>
> Userspace can return long before the SKBs are free'd up. That SKB
> free doesn't happen until the ACKs come back from the receiver.
I feel that presence on the lru list should contribute to
page->count. It seems a bit weird and kludgy that this
is not so.
If we were to do this then would this not fix networking's
problem? The skb free wouldn't release the page - it would
be left on the LRU with ->count == 1 and kswapd would reap it.
(Says me, hoping that Hugh will code it :))
-
On Wed, Feb 06, 2002 at 10:19:32PM -0800, Andrew Morton wrote:
> It's only a problem if this is the final put_page(). In the
> case of sendfile(), process-context code can be taught to take
> a temporary reference on the page, and only release it after the network
> stack is known to have finished with the page. sendfile is synchronous, yes?
>
> And in the case of all other skb frees, the underlying page
> won't be on the LRU. I hope.
sendfile isn't synchronous, nor is aio, which also relies on freeing pages
acquired from user mappings in irq/bh/whatever context. Restricting where
pages may be freed really ties our hands on what is possible for zero copy
io. (ie O_DIRECT, aio, sendfile, TUX all get hosed by this).
-ben (who is really on vacation)
--
begin 644 fish.com
866]U(&AA=F4@=&]O(&UU8V@@=&EM92X*
`
end
On Wed, 6 Feb 2002, Andrew Morton wrote:
>
> I feel that presence on the lru list should contribute to
> page->count. It seems a bit weird and kludgy that this
> is not so.
You're right, and that's the way page->buffers is treated.
But I've been growing rather tired of !!page->buffers all
over the place, and was wondering in the reverse direction,
whether we need to include page->buffers in page->count.
I'm now inclined to argue that holds which are obvious from
the page structure itself need not be included in the count:
whatever makes the totality of code simplest i.e. minimize
the number of special tests: no point in taking !!page->buffers
out of assorted places if even more places or hotpaths then
need additionally to check for page->buffers.
> If we were to do this then would this not fix networking's
> problem? The skb free wouldn't release the page - it would
> be left on the LRU with ->count == 1 and kswapd would reap it.
Leaving kswapd to reap it is an excellent idea.
> (Says me, hoping that Hugh will code it :))
Thanks for the vote of confidence. But adding PageLRU into
page->count is not work I could confidently thrust upon Marcelo
at this stage of 2.4.18, too many tricky tests to update (and he
has more sense than to take it). However, I think we can do it
more simply and safely than that.
shrink_cache already allows for the case of !page_count(page).
I think we should revert to the way Linus and 2.4.17 had it,
page_cache_release doing the lru_cache_del for the common case,
and remove Ben's lru_cache_del from __free_pages_ok; but if
PageLRU is found there, it's not a BUG(), just a page that
can't be reclaimable until shrink_cache gets to delru it.
This I'll try coding up and testing today. Don't worry, despite
comments above, I won't be changing how page->buffers is counted!
Hugh
On Wed, 6 Feb 2002, Andrew Morton wrote:
> I feel that presence on the lru list should contribute to
> page->count. It seems a bit weird and kludgy that this
> is not so.
>
> If we were to do this then would this not fix networking's
> problem? The skb free wouldn't release the page - it would
> be left on the LRU with ->count == 1 and kswapd would reap it.
Actually, at this point we _know_ page->list.{prev,next} are
NULL.
We can use this to add the pages to a special list, from where
__alloc_pages() and kswapd can move them to the free list, in
process context.
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
From: Rik van Riel <[email protected]>
Date: Thu, 7 Feb 2002 10:34:20 -0200 (BRST)
Actually, at this point we _know_ page->list.{prev,next} are
NULL.
We can use this to add the pages to a special list, from where
__alloc_pages() and kswapd can move them to the free list, in
process context.
I don't think there should be any special logic on how to free a page
outside of the page allocator itself. Certainly this kind of stuff
doesn't belong in the networking.
Pages can be freed from arbitrary contexts, and the page allocator
should be the part the knows how to deal with it.
Maybe I don't understand and you're really suggesting something else.
On Thu, 7 Feb 2002, David S. Miller wrote:
> From: Rik van Riel <[email protected]>
> Date: Thu, 7 Feb 2002 10:34:20 -0200 (BRST)
>
> Actually, at this point we _know_ page->list.{prev,next} are
> NULL.
>
> We can use this to add the pages to a special list, from where
> __alloc_pages() and kswapd can move them to the free list, in
> process context.
>
> I don't think there should be any special logic on how to free a page
> outside of the page allocator itself. Certainly this kind of stuff
> doesn't belong in the networking.
>
> Pages can be freed from arbitrary contexts, and the page allocator
> should be the part the knows how to deal with it.
>
> Maybe I don't understand and you're really suggesting something else.
The mechanism to do what I described above should of course be
in __free_pages_ok().
if (PageLRU(page)) {
if (in_interrupt()) {
add_page_to_special_list(page);
return;
} else
lru_cache_del(page);
}
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Thu, 7 Feb 2002, Rik van Riel wrote:
>
> The mechanism to do what I described above should of course be
> in __free_pages_ok().
>
> if (PageLRU(page)) {
> if (in_interrupt()) {
> add_page_to_special_list(page);
> return;
> } else
> lru_cache_del(page);
> }
If this were a common case where many pages end up, yes, we'd
need a separate special list; but it's a very rare case, so I
think it's more appropriate to let shrink_cache do it when it
eventually reaches them on the inactive_list.
I was proposing we revert to distinguishing page_cache_release
from put_page, page_cache_release doing the lru_cache_del; and
I'd like to add my in_interrupt() BUG() there for now, just as
a sanity check. You are proposing that we keep the current,
post-Ben, structure of doing it in __free_pages_ok if possible.
I think I prefer mine, in_interrupt() as a sanity check which
could be removed when we feel safer, to yours where it's
deciding the behaviour of __free_pages_ok. Any strong feelings?
Hugh
On Thu, 7 Feb 2002, Hugh Dickins wrote:
> > if (PageLRU(page)) {
> > if (in_interrupt()) {
> > add_page_to_special_list(page);
> > return;
> > } else
> > lru_cache_del(page);
> > }
>
> If this were a common case where many pages end up, yes, we'd
> need a separate special list; but it's a very rare case
Think of a web or ftp server doing nothing but sendfile()
> I was proposing we revert to distinguishing page_cache_release
> from put_page, page_cache_release doing the lru_cache_del; and
> I'd like to add my in_interrupt() BUG() there for now, just as
> a sanity check. You are proposing that we keep the current,
> post-Ben, structure of doing it in __free_pages_ok if possible.
So how exactly would pages be freed ?
You still need to do the check of whether the page can
be freed somewhere.
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On February 7, 2002 02:27 pm, Rik van Riel wrote:
> On Thu, 7 Feb 2002, Hugh Dickins wrote:
>
> > > if (PageLRU(page)) {
> > > if (in_interrupt()) {
> > > add_page_to_special_list(page);
> > > return;
> > > } else
> > > lru_cache_del(page);
> > > }
> >
> > If this were a common case where many pages end up, yes, we'd
> > need a separate special list; but it's a very rare case
>
> Think of a web or ftp server doing nothing but sendfile()
But still, you're
> > I was proposing we revert to distinguishing page_cache_release
> > from put_page, page_cache_release doing the lru_cache_del; and
> > I'd like to add my in_interrupt() BUG() there for now, just as
> > a sanity check. You are proposing that we keep the current,
> > post-Ben, structure of doing it in __free_pages_ok if possible.
>
> So how exactly would pages be freed ?
> You still need to do the check of whether the page can
> be freed somewhere.
He suggested letting shrink_caches find it. But since we already know the
page is free there's no sense scanning for it, so on balance I think your
approach is better. An atomic_put_page, used from any context that could end
up in an interrupt, would be better yet, just because it imposes the extra
check only on users that require it. Otoh, I did note davem's objection
above.
--
Daniel
On Thu, 7 Feb 2002, Rik van Riel wrote:
> On Thu, 7 Feb 2002, Hugh Dickins wrote:
> >
> > If this were a common case where many pages end up, yes, we'd
> > need a separate special list; but it's a very rare case
>
> Think of a web or ftp server doing nothing but sendfile()
Aren't the sendfile() pages in the page cache, and normally taken
off LRU at the same time as removed from page cache, in shrink_cache?
The exception being when the file is truncated while it is being sent,
and buffers busy, so left behind on LRU by truncate_complete_page?
That's not a common case, nor were __free_pages_ok PageLRU BUG
reports on 2.4.14 to 2.4.17 common: most definitely a case that
needs to be handled correctly, but not common.
But I know very little of sendfile(), please correct me.
> > I was proposing we revert to distinguishing page_cache_release
> > from put_page, page_cache_release doing the lru_cache_del; and
> > I'd like to add my in_interrupt() BUG() there for now, just as
> > a sanity check. You are proposing that we keep the current,
> > post-Ben, structure of doing it in __free_pages_ok if possible.
>
> So how exactly would pages be freed ?
>
> You still need to do the check of whether the page can
> be freed somewhere.
I imagined (not yet tried) in shrink_cache, something like:
/*
* In exceptional cases, a page may still be on an
* LRU when it is "freed"; and it would not be possible
* to remove it from LRU at interrupt time: clean up here.
*/
if (unlikely(!page_count(page))) {
page_cache_get(page);
__lru_cache_del(page);
page_cache_release(page);
continue;
}
(Using page_cache_get+page_cache_release instead of get_page+put_page
merely because page_cache_whatever is the local dialect in this module.
But over in unmap_kiobuf I ought to revert from page_cache_release to
put_page, in case it can be called at interrupt time, as Ben implied.)
Hugh
On Thu, 7 Feb 2002, Hugh Dickins wrote:
> On Thu, 7 Feb 2002, Rik van Riel wrote:
> > On Thu, 7 Feb 2002, Hugh Dickins wrote:
> > >
> > > If this were a common case where many pages end up, yes, we'd
> > > need a separate special list; but it's a very rare case
> >
> > Think of a web or ftp server doing nothing but sendfile()
>
> Aren't the sendfile() pages in the page cache, and normally taken
> off LRU at the same time as removed from page cache, in shrink_cache?
You're right.
> I imagined (not yet tried) in shrink_cache, something like:
> if (unlikely(!page_count(page))) {
> page_cache_get(page);
> __lru_cache_del(page);
> page_cache_release(page);
> continue;
> }
I guess this should work.
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
Please help, someone!
Below is the plausible-looking but fatally flawed patch I
had in mind, to help us focus. Basically, revert to 2.4.17,
but don't treat PageLRU as a BUG in __free_pages_ok, instead
let shrink_cache free it. Notice that page_cache_release is
not quite as in 2.4.17: it has to handle a race, and this
at least puts it in the same situation as a final put_page
when PageLRU is set.
The problem is, if SMP, imagine final put_page on one CPU
racing with shrink_cache on another: once put_page_testzero
on one CPU succeeds, shrink_cache on other may __lru_cache_del,
and both CPUs go through __free_pages_ok with PageLRU clear,
page doubly freed. Which is even worse than the current
hang-on-SMP or corrupt-LRU-on-UP.
I think Rik's separate-list method would suffer just the same.
I was naive to imagine that working with count 0 would be easy.
I'm pretty sure Andrew's add-LRU-into-page-count would deal
with this just fine, but I'm still reluctant to implement that
at this stage. Another idea I'd be reluctant to pursue right
now, but might be good in the longer term, is just to remove
the PageLRU and PageActive BUGs from __free_pages_ok, rmqueue
and balance_classzone, let a few rare pages remain on the LRU
while free and reused; but shrink_cache's TryLockPage not play
well with __add_to_page_cache's page->flags = flags | (1 << PG_locked).
I think we don't want _irqsaves on pagemap_lru_lock.
Maybe there's a simple answer I'm not seeing,
maybe there's an awkward combination of locking which will do it,
maybe it's mathematically provable that Andrew's is the only way.
Any suggestions?
Hugh
diff -urN 2.4.18-pre8/include/linux/pagemap.h linux/include/linux/pagemap.h
--- 2.4.18-pre8/include/linux/pagemap.h Thu Feb 7 14:38:13 2002
+++ linux/include/linux/pagemap.h Thu Feb 7 15:31:05 2002
@@ -29,7 +29,7 @@
#define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
#define page_cache_get(x) get_page(x)
-#define page_cache_release(x) __free_page(x)
+extern void FASTCALL(page_cache_release(struct page *));
static inline struct page *page_cache_alloc(struct address_space *x)
{
diff -urN 2.4.18-pre8/kernel/ksyms.c linux/kernel/ksyms.c
--- 2.4.18-pre8/kernel/ksyms.c Thu Feb 7 14:38:13 2002
+++ linux/kernel/ksyms.c Thu Feb 7 14:52:50 2002
@@ -95,6 +95,7 @@
EXPORT_SYMBOL(alloc_pages_node);
EXPORT_SYMBOL(__get_free_pages);
EXPORT_SYMBOL(get_zeroed_page);
+EXPORT_SYMBOL(page_cache_release);
EXPORT_SYMBOL(__free_pages);
EXPORT_SYMBOL(free_pages);
EXPORT_SYMBOL(num_physpages);
diff -urN 2.4.18-pre8/mm/memory.c linux/mm/memory.c
--- 2.4.18-pre8/mm/memory.c Mon Jan 7 13:03:04 2002
+++ linux/mm/memory.c Thu Feb 7 15:05:57 2002
@@ -487,7 +487,7 @@
* depending on the type of the found page
*/
if (pages[i])
- page_cache_get(pages[i]);
+ get_page(pages[i]);
}
if (vmas)
vmas[i] = vma;
@@ -601,7 +601,7 @@
/* FIXME: cache flush missing for rw==READ
* FIXME: call the correct reference counting function
*/
- page_cache_release(map);
+ put_page(map);
}
}
diff -urN 2.4.18-pre8/mm/page_alloc.c linux/mm/page_alloc.c
--- 2.4.18-pre8/mm/page_alloc.c Thu Feb 7 14:38:13 2002
+++ linux/mm/page_alloc.c Thu Feb 7 15:59:17 2002
@@ -70,12 +70,6 @@
struct page *base;
zone_t *zone;
- /* Yes, think what happens when other parts of the kernel take
- * a reference to a page in order to pin it for io. -ben
- */
- if (PageLRU(page))
- lru_cache_del(page);
-
if (page->buffers)
BUG();
if (page->mapping)
@@ -86,8 +80,15 @@
BUG();
if (PageLocked(page))
BUG();
+ /*
+ * Other parts of the kernel may get_page in order to pin it
+ * for I/O, then put_page after the last page_cache_release:
+ * perhaps at interrupt time when it would be unsafe to deLRU.
+ * So leave page on the LRU, let shrink_cache free it later on.
+ */
if (PageLRU(page))
- BUG();
+ return;
+
if (PageActive(page))
BUG();
page->flags &= ~((1<<PG_referenced) | (1<<PG_dirty));
@@ -430,6 +431,24 @@
return (unsigned long) address;
}
return 0;
+}
+
+void page_cache_release(struct page *page)
+{
+ /*
+ * Yes, there is a race here, two threads might release the
+ * same page at the same time, each see page_count as 2, and
+ * neither delete it from LRU: okay, shrink_cache will clean
+ * that up later. What we must avoid is calling __free_pages_ok
+ * on page of page_count 0 while shrink_cache is doing the same.
+ */
+ if (PageLRU(page) && page_count(page) == 1) {
+ if (in_interrupt())
+ BUG();
+ lru_cache_del(page);
+ }
+ if (!PageReserved(page) && put_page_testzero(page))
+ __free_pages_ok(page, 0);
}
void __free_pages(struct page *page, unsigned int order)
-r--r--r-- 1 hugh hugh 18802 Feb 7 14:38 /home/hugh/1808/mm/page_alloc.c
-rw-r--r-- 1 hugh hugh 19450 Feb 7 16:48 page_alloc.c
diff -urN 2.4.18-pre8/mm/vmscan.c linux/mm/vmscan.c
--- 2.4.18-pre8/mm/vmscan.c Thu Feb 7 14:38:13 2002
+++ linux/mm/vmscan.c Thu Feb 7 15:43:32 2002
@@ -363,11 +363,16 @@
list_add(entry, &inactive_list);
/*
- * Zero page counts can happen because we unlink the pages
- * _after_ decrementing the usage count..
+ * In exceptional cases, a page may still be on an LRU
+ * when it arrives at __free_pages_ok, and it cannot be
+ * removed from LRU at interrupt time: so clean up here.
*/
- if (unlikely(!page_count(page)))
+ if (unlikely(!page_count(page))) {
+ page_cache_get(page);
+ __lru_cache_del(page);
+ page_cache_release(page);
continue;
+ }
if (!memclass(page->zone, classzone))
continue;
// $Header$
// Kernel Version:
// VERSION = 2
// PATCHLEVEL = 4
// SUBLEVEL = 18
// EXTRAVERSION = -pre8
--- 2.4/drivers/block/ll_rw_blk.c Tue Feb 5 18:48:20 2002
+++ build-2.4/drivers/block/ll_rw_blk.c Thu Feb 7 19:37:35 2002
@@ -348,6 +348,8 @@
}
memset(rq, 0, sizeof(struct request));
rq->rq_status = RQ_INACTIVE;
+ poison_obj(&rq->elevator_sequence, sizeof(struct request)
+ -offsetof(struct request,elevator_sequence));
list_add(&rq->queue, &q->rq[i&1].free);
q->rq[i&1].count++;
}
@@ -428,8 +430,12 @@
if (!list_empty(&rl->free)) {
rq = blkdev_free_rq(&rl->free);
list_del(&rq->queue);
+ /* FIXME: clear or not clear? */
+ check_poison(&rq->elevator_sequence, sizeof(struct request)
+ - offsetof(struct request,elevator_sequence));
rl->count--;
rq->rq_status = RQ_ACTIVE;
+ rq->cmd = rw;
rq->special = NULL;
rq->q = q;
}
@@ -560,6 +566,8 @@
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
+ poison_obj(&req->elevator_sequence, sizeof(struct request)
+ -offsetof(struct request,elevator_sequence));
if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
wake_up(&q->wait_for_request);
}
--- 2.4/fs/file_table.c Sun Sep 30 16:25:45 2001
+++ build-2.4/fs/file_table.c Thu Feb 7 19:37:35 2002
@@ -39,6 +39,8 @@
used_one:
f = list_entry(free_list.next, struct file, f_list);
list_del(&f->f_list);
+ check_poison (&f->f_dentry, sizeof(struct file) -
+ offsetof(struct file, f_dentry));
files_stat.nr_free_files--;
new_one:
memset(f, 0, sizeof(*f));
@@ -118,6 +120,8 @@
file->f_dentry = NULL;
file->f_vfsmnt = NULL;
list_del(&file->f_list);
+ poison_obj(&file->f_dentry, sizeof(struct file) -
+ offsetof(struct file, f_dentry));
list_add(&file->f_list, &free_list);
files_stat.nr_free_files++;
file_list_unlock();
@@ -147,6 +151,8 @@
file_list_lock();
list_del(&file->f_list);
list_add(&file->f_list, &free_list);
+ poison_obj(&file->f_dentry, sizeof(struct file) -
+ offsetof(struct file, f_dentry));
files_stat.nr_free_files++;
file_list_unlock();
}
--- 2.4/include/linux/slab.h Tue Dec 25 17:12:07 2001
+++ build-2.4/include/linux/slab.h Thu Feb 7 20:45:12 2002
@@ -78,6 +78,39 @@
extern kmem_cache_t *fs_cachep;
extern kmem_cache_t *sigact_cachep;
+
+#ifdef CONFIG_DEBUG_SLAB
+extern void __check_poison(void *obj, int size, char *file, int line);
+
+#define check_and_clear_poison(obj, size) \
+ do { \
+ __check_poison(obj, size, __FILE__, __LINE__); \
+ memset(obj, 0, size); \
+ } while(0)
+
+#define check_poison(obj, size) \
+ do { \
+ __check_poison(obj, size, __FILE__, __LINE__); \
+ memset(obj, 0x2C, size); \
+ } while(0)
+
+#define poison_obj(obj, size) \
+ memset(obj, 0xe3, size); \
+
+#else
+static inline void check_and_clear_poison(void *obj, int size)
+{
+ memset(obj, 0, size);
+}
+static inline void check_poison(void *obj, int size)
+{
+ /* nop */
+}
+static inline void poison_obj(void *obj, int size)
+{
+ /* nop */
+}
+#endif
#endif /* __KERNEL__ */
#endif /* _LINUX_SLAB_H */
--- 2.4/mm/slab.c Tue Dec 25 17:12:07 2001
+++ build-2.4/mm/slab.c Thu Feb 7 19:37:35 2002
@@ -1196,8 +1196,11 @@
if (objnr >= cachep->num)
BUG();
- if (objp != slabp->s_mem + objnr*cachep->objsize)
+ if (objp != slabp->s_mem + objnr*cachep->objsize) {
+ printk("cache %s: got objp %p, objnr %d, s_mem %ph.\n",
+ cachep->name, objp, objnr, slabp->s_mem);
BUG();
+ }
/* Check slab's freelist to see if this obj is there. */
for (i = slabp->free; i != BUFCTL_END; i = slab_bufctl(slabp)[i]) {
@@ -1210,6 +1213,9 @@
static inline void kmem_cache_alloc_head(kmem_cache_t *cachep, int flags)
{
+#ifdef DEBUG
+ if (in_interrupt() && (flags & SLAB_LEVEL_MASK) != SLAB_ATOMIC)
+ BUG();
if (flags & SLAB_DMA) {
if (!(cachep->gfpflags & GFP_DMA))
BUG();
@@ -1217,6 +1223,7 @@
if (cachep->gfpflags & GFP_DMA)
BUG();
}
+#endif
}
static inline void * kmem_cache_alloc_one_tail (kmem_cache_t *cachep,
@@ -1347,6 +1354,15 @@
objp = kmem_cache_alloc_one(cachep);
#endif
local_irq_restore(save_flags);
+#if DEBUG
+ if (cachep->flags & SLAB_RED_ZONE) {
+ kmem_extra_free_checks(cachep, GET_PAGE_SLAB(virt_to_page(objp)),
+ objp-BYTES_PER_WORD);
+ } else {
+ kmem_extra_free_checks(cachep, GET_PAGE_SLAB(virt_to_page(objp)),
+ objp);
+ }
+#endif
return objp;
alloc_new_slab:
#ifdef CONFIG_SMP
@@ -1475,6 +1491,15 @@
#ifdef CONFIG_SMP
cpucache_t *cc = cc_data(cachep);
+#if DEBUG
+ if (cachep->flags & SLAB_RED_ZONE) {
+ kmem_extra_free_checks(cachep, GET_PAGE_SLAB(virt_to_page(objp)),
+ objp-BYTES_PER_WORD);
+ } else {
+ kmem_extra_free_checks(cachep, GET_PAGE_SLAB(virt_to_page(objp)),
+ objp);
+ }
+#endif
CHECK_PAGE(virt_to_page(objp));
if (cc) {
int batchcount;
@@ -2039,3 +2064,17 @@
#endif
}
#endif
+
+#ifdef CONFIG_DEBUG_SLAB
+void __check_poison(void *obj, int size, char *file, int line)
+{
+ int i;
+ for (i=0;i<size;i++) {
+ if (((unsigned char*)obj)[i] != 0xe3) {
+ printk(KERN_INFO "poison error in obj %p, len %d, file %s, line %d, offset %d is: 0x%x\n",
+ obj, size, file, line, i, ((unsigned char*)obj)[i]);
+ }
+ }
+}
+#endif
+
--- 2.4/mm/page_alloc.c Tue Feb 5 18:48:23 2002
+++ build-2.4/mm/page_alloc.c Thu Feb 7 19:50:22 2002
@@ -76,6 +76,8 @@
if (PageLRU(page))
lru_cache_del(page);
+ if (!PageHighMem(page))
+ poison_obj(page_address(page), (1<<order)*PAGE_SIZE);
if (page->buffers)
BUG();
if (page->mapping)
@@ -95,7 +97,16 @@
if (current->flags & PF_FREE_PAGES)
goto local_freelist;
back_local_freelist:
-
+#ifdef CONFIG_DEBUG_SLAB
+ page->mapping = (void*)0xdeadbeef;
+ page->index = 0xbaadc0de;
+ page->next_hash = (void*)0xbeefdead;
+ atomic_set(&page->count,1);
+ page->lru.next = (void*)0xbaadf00d;
+ page->lru.prev = (void*)0xf00dbaad;
+ page->pprev_hash = (void*)0xdeadbeef;
+ page->buffers = (void*)0xdeadbeef;
+#endif
zone = page->zone;
mask = (~0UL) << order;
@@ -206,6 +217,26 @@
page = expand(zone, page, index, order, curr_order, area);
spin_unlock_irqrestore(&zone->lock, flags);
+#ifdef CONFIG_DEBUG_SLAB
+ if (page->mapping != (void*)0xdeadbeef)
+ printk(KERN_ERR"got mapping %p.\n", page->mapping);
+ if (page->index != 0xbaadc0de)
+ printk(KERN_ERR "got index %lxh.\n", page->index);
+ if (page->next_hash != (void*)0xbeefdead)
+ printk(KERN_ERR "got next_hash %p.\n", page->next_hash);
+ if (atomic_read(&page->count) != 1)
+ printk(KERN_ERR "bad page count %d.\n", atomic_read(&page->count));
+ if (page->lru.next != (void*)0xbaadf00d)
+ printk(KERN_ERR" bad lru_next %p.\n", page->lru.next);
+ if (page->lru.prev != (void*)0xf00dbaad)
+ printk(KERN_ERR" bad lru_prev %p.\n", page->lru.prev);
+ if (page->pprev_hash != (void*)0xdeadbeef)
+ printk(KERN_ERR" bad pprev_hash %p.\n", page->pprev_hash);
+ if (page->buffers != (void*)0xdeadbeef)
+ printk(KERN_ERR" bad page->buffers %p.\n", page->buffers);
+ page->mapping = NULL;
+ page->buffers = NULL;
+#endif
set_page_count(page, 1);
if (BAD_RANGE(zone,page))
BUG();
@@ -226,8 +257,12 @@
#ifndef CONFIG_DISCONTIGMEM
struct page *_alloc_pages(unsigned int gfp_mask, unsigned int order)
{
- return __alloc_pages(gfp_mask, order,
+ struct page * pg;
+ pg = __alloc_pages(gfp_mask, order,
contig_page_data.node_zonelists+(gfp_mask & GFP_ZONEMASK));
+ if (pg && !PageHighMem(page))
+ check_poison(page_address(pg), (1<<order)*PAGE_SIZE);
+ return pg;
}
#endif
--- 2.4/fs/buffer.c Tue Feb 5 18:48:22 2002
+++ build-2.4/fs/buffer.c Thu Feb 7 19:37:35 2002
@@ -1207,6 +1207,7 @@
bh->b_dev = B_FREE;
bh->b_blocknr = -1;
bh->b_this_page = NULL;
+ bh->b_size = 0xbaad;
nr_unused_buffer_heads++;
bh->b_next_free = unused_list;
@@ -1237,6 +1238,8 @@
unused_list = bh->b_next_free;
nr_unused_buffer_heads--;
spin_unlock(&unused_list_lock);
+ if (bh->b_size != 0xbaad)
+ printk(KERN_ERR "wrong size %lxh.\n", bh->b_size);
return bh;
}
spin_unlock(&unused_list_lock);
@@ -1261,6 +1264,8 @@
unused_list = bh->b_next_free;
nr_unused_buffer_heads--;
spin_unlock(&unused_list_lock);
+ if (bh->b_size != 0xbaad)
+ printk(KERN_ERR "wrong size %lxh.\n", bh->b_size);
return bh;
}
spin_unlock(&unused_list_lock);
On Thu, Feb 07, 2002 at 02:28:44PM +0000, Hugh Dickins wrote:
> On Thu, 7 Feb 2002, Rik van Riel wrote:
> > On Thu, 7 Feb 2002, Hugh Dickins wrote:
> > >
> > > If this were a common case where many pages end up, yes, we'd
> > > need a separate special list; but it's a very rare case
> >
> > Think of a web or ftp server doing nothing but sendfile()
>
> Aren't the sendfile() pages in the page cache, and normally taken
> off LRU at the same time as removed from page cache, in shrink_cache?
> The exception being when the file is truncated while it is being sent,
> and buffers busy, so left behind on LRU by truncate_complete_page?
the buffers will hold a reference on the page. So the pagecache is
either in the LRU with refcount > 1, or the refcount is 1 and the page
is not in the lru.
In short Ben's patch was useless but it was faster and cleaner than what
we had before with the special page_cache_release, and so it was good.
Said it in another manner: we'll never effectively free a page that is
in the LRU, unless it's an anonymous page (no brainer for
sendpage/sendfile).
Hugh's patch is definitely valid and it's a nice bugcheck to have, it
should be merged IMHO (it's in a slow path), but there's no bug to fix I
think, the bugcheck is paranoid-in-slow-path kind of thing.
Andrea
Andrea Arcangeli wrote:
>
> On Thu, Feb 07, 2002 at 02:28:44PM +0000, Hugh Dickins wrote:
> > On Thu, 7 Feb 2002, Rik van Riel wrote:
> > > On Thu, 7 Feb 2002, Hugh Dickins wrote:
> > > >
> > > > If this were a common case where many pages end up, yes, we'd
> > > > need a separate special list; but it's a very rare case
> > >
> > > Think of a web or ftp server doing nothing but sendfile()
> >
> > Aren't the sendfile() pages in the page cache, and normally taken
> > off LRU at the same time as removed from page cache, in shrink_cache?
> > The exception being when the file is truncated while it is being sent,
> > and buffers busy, so left behind on LRU by truncate_complete_page?
>
> the buffers will hold a reference on the page. So the pagecache is
> either in the LRU with refcount > 1, or the refcount is 1 and the page
> is not in the lru.
>
> In short Ben's patch was useless but it was faster and cleaner than what
> we had before with the special page_cache_release, and so it was good.
>
> Said it in another manner: we'll never effectively free a page that is
> in the LRU, unless it's an anonymous page (no brainer for
> sendpage/sendfile).
Good to hear. But what about the weird corner-case in truncate_complete_page(),
where a mapped page is not successfully released, and is converted into
an anon buffercache page? It seems that a combination of sendfile
and truncate could result in one of those pages being subject to
final release in BH context?
1: try_to_release_page() fails. It becomes a buffercache page.
2: vm runs try_to_release_page() again. This time it succeeds.
3: sendfile completes.
> Hugh's patch is definitely valid and it's a nice bugcheck to have, it
> should be merged IMHO (it's in a slow path), but there's no bug to fix I
> think, the bugcheck is paranoid-in-slow-path kind of thing.
It's looking more and more like we need that test.
-
On Thu, Feb 07, 2002 at 01:09:25PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > On Thu, Feb 07, 2002 at 02:28:44PM +0000, Hugh Dickins wrote:
> > > On Thu, 7 Feb 2002, Rik van Riel wrote:
> > > > On Thu, 7 Feb 2002, Hugh Dickins wrote:
> > > > >
> > > > > If this were a common case where many pages end up, yes, we'd
> > > > > need a separate special list; but it's a very rare case
> > > >
> > > > Think of a web or ftp server doing nothing but sendfile()
> > >
> > > Aren't the sendfile() pages in the page cache, and normally taken
> > > off LRU at the same time as removed from page cache, in shrink_cache?
> > > The exception being when the file is truncated while it is being sent,
> > > and buffers busy, so left behind on LRU by truncate_complete_page?
> >
> > the buffers will hold a reference on the page. So the pagecache is
> > either in the LRU with refcount > 1, or the refcount is 1 and the page
> > is not in the lru.
> >
> > In short Ben's patch was useless but it was faster and cleaner than what
> > we had before with the special page_cache_release, and so it was good.
> >
> > Said it in another manner: we'll never effectively free a page that is
> > in the LRU, unless it's an anonymous page (no brainer for
> > sendpage/sendfile).
>
> Good to hear. But what about the weird corner-case in truncate_complete_page(),
> where a mapped page is not successfully released, and is converted into
> an anon buffercache page? It seems that a combination of sendfile
> and truncate could result in one of those pages being subject to
> final release in BH context?
Such a page is not in the lru so it doesn't matter.
As said in the previous email, from another point of view, the only
thing that can be still in the lru during __free_pages_ok is an
anonymous page. truncate_complete_page cannot run on an anonymous page.
Anonymous pages cannot be truncated.
>
> 1: try_to_release_page() fails. It becomes a buffercache page.
> 2: vm runs try_to_release_page() again. This time it succeeds.
> 3: sendfile completes.
>
>
> > Hugh's patch is definitely valid and it's a nice bugcheck to have, it
> > should be merged IMHO (it's in a slow path), but there's no bug to fix I
> > think, the bugcheck is paranoid-in-slow-path kind of thing.
>
> It's looking more and more like we need that test.
needed is the wrong word, but we do want it for long term paranoid
safety (actually it is a more interesting for 2.5 infact).
Andrea
Andrea Arcangeli wrote:
>
> > Good to hear. But what about the weird corner-case in truncate_complete_page(),
> > where a mapped page is not successfully released, and is converted into
> > an anon buffercache page? It seems that a combination of sendfile
> > and truncate could result in one of those pages being subject to
> > final release in BH context?
>
> Such a page is not in the lru so it doesn't matter.
static void truncate_complete_page(struct page *page)
{
/* Leave it on the LRU if it gets converted into anonymous buffers */
if (!page->buffers || do_flushpage(page, 0))
lru_cache_del(page);
If the page has buffers, and do_flushpage() fails, what happens?
> As said in the previous email, from another point of view, the only
> thing that can be still in the lru during __free_pages_ok is an
> anonymous page. truncate_complete_page cannot run on an anonymous page.
> Anonymous pages cannot be truncated.
truncate_complete_page() can, in rare circumstances, take a page
which was in both the pagecache and on LRU, and leave it purely
on LRU. And because that page *used* to be in pagecache, it
could be undergoing sendfile.
Or I'm missing something. Did something change?
-
On Thu, Feb 07, 2002 at 02:31:33PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > > Good to hear. But what about the weird corner-case in truncate_complete_page(),
> > > where a mapped page is not successfully released, and is converted into
> > > an anon buffercache page? It seems that a combination of sendfile
> > > and truncate could result in one of those pages being subject to
> > > final release in BH context?
> >
> > Such a page is not in the lru so it doesn't matter.
>
> static void truncate_complete_page(struct page *page)
> {
> /* Leave it on the LRU if it gets converted into anonymous buffers */
> if (!page->buffers || do_flushpage(page, 0))
> lru_cache_del(page);
>
> If the page has buffers, and do_flushpage() fails, what happens?
>
> > As said in the previous email, from another point of view, the only
> > thing that can be still in the lru during __free_pages_ok is an
> > anonymous page. truncate_complete_page cannot run on an anonymous page.
> > Anonymous pages cannot be truncated.
>
> truncate_complete_page() can, in rare circumstances, take a page
> which was in both the pagecache and on LRU, and leave it purely
> on LRU. And because that page *used* to be in pagecache, it
> could be undergoing sendfile.
>
> Or I'm missing something. Did something change?
as said in the previous email that becomes a buffercache mapped in
userspace, with refcount > 1 (unfreeable from the vm side), so the
__free_page from irq will do nothing (it will only decrease the refcount
of 1 unit, and then the page will be released by the vm). If the
refcount was 1 instead, then it means the page wasn't in the lru in the
first place and so the check won't trigger either.
Only truly anonymous pages (not ex pagecache, later become buffercache)
can trigger such PageLRU check in __free_pages_ok. Infact if it wasn't
the case all the kernels before 2.4.1x would been broken.
Andrea
Andrea Arcangeli wrote:
>
> > Or I'm missing something. Did something change?
>
> as said in the previous email that becomes a buffercache mapped in
> userspace, with refcount > 1 (unfreeable from the vm side), so the
> __free_page from irq will do nothing (it will only decrease the refcount
> of 1 unit, and then the page will be released by the vm). If the
> refcount was 1 instead, then it means the page wasn't in the lru in the
> first place and so the check won't trigger either.
Ah.
If the second try_to_release_page() against the truncated page
(in shrink_cache()) succeeds, the page is removed from the LRU.
OK, I agree the weird case won't trigger the bug. So I think we
agree that we need to run with Hugh's BUG check and do nothing else.
-
On Thu, 7 Feb 2002, Andrew Morton wrote:
>
> OK, I agree the weird case won't trigger the bug. So I think we
> agree that we need to run with Hugh's BUG check and do nothing else.
Thank you, Andrew and Andrea, for exploring and exploding those myths.
I've checked back on the BUG which Ben submitted his 1st Jan patch for,
and it was actually a PageLRU(page) in rmqueue, not in __free_pages_ok,
so his patch would not have solved it.
But I cannot yet agree that Marcelo should take my interrupt BUG patch.
I've also checked back on the BUG which I submitted my 15th Nov patch
for (making unmap_kiobuf use page_cache_release instead of put_page),
and Gerd Knorr's report (extracts below) implies that his bttv driver
was calling unmap_kiobuf at interrupt time. Is that right, Gerd?
If that's so, then my proposed in_interrupt check before lru_cache_del
will just give him a BUG again (and my 15th Nov patch was mistaken to
encourage him to unmap at interrupt time). Now maybe Gerd's code is
wrong anyway: a quick look suggests it may also vfree there, which
would be wrong at interrupt time. But whether his code is right or
wrong, unmap_kiobuf used to be safe at interrupt time and now is not
(in some rare circumstances): are we right to have made that change?
Ben, you probably have an AIO opinion here. Is there a circumstance
in which AIO can unpin a user page at interrupt time, after the
calling task has (exited or) unmapped the page?
Hugh
Subject: [PATCH] Re: kiobuf / vm bug
On Thu, 15 Nov 2001, Gerd Knorr wrote:
>
> I think I have found a kiobuf-related bug in the VM of recent linux
> kernels. 2.4.13 is fine, 2.4.14-pre1 doesn't boot my machine,
> 2.4.14-pre2 + newer kernels are broken.
>
> /me runs a kernel with a few v4l-related patches and my current 0.8.x
> bttv version (available from http://bytesex.org/patches/ +
> http://bytesex.org/bttv/).
>
> With this kernel I can trigger the following BUG():
> ksymoops 2.4.3 on i686 2.4.15-pre4. Options used
> kernel BUG at page_alloc.c:84!
> >>EIP; c0129e5a <__free_pages_ok+aa/29c> <=====
> Trace; c012a6f2 <__free_pages+1a/1c>
> Trace; c0121120 <unmap_kiobuf+34/48>
>
> The Oops seems to be triggered by the following actions:
>
> (1) the application maps /dev/video0. bttv 0.8.x simply returns some
> shared anonymous memory to as mapping.
> (2) the application asks the driver to capture a frame. bttv will lock
> down the anonymous memory using kiobufs for I/O and prepare
> everything for DMA xfer.
> (3) The applications exits for some reason, i.e. the anonymous memory
> will be unmapped while the DMA transfer is active and the pages are
> locked down for I/O.
> (4) The DMA xfer is done and bttv's irq handler cleans up everything.
> This includes calling unlock_kiovec+unmap_kiobuf for the locked
> pages. The unmap_kiobuf call at this point triggeres the Oops
> listed above ...
Is it possible to modify your patch from:
if (in_interrupt())
BUG();
to
if (unlikely(in_interrupt())
BUG();
-- Amol
Hugh Dickins <[email protected]> on 02/08/2002 11:16:56 PM
To: Andrew Morton <[email protected]>
cc: Andrea Arcangeli <[email protected]>, Rik van Riel <[email protected]>,
"David S. Miller" <[email protected]>, Benjamin LaHaise <[email protected]>,
Marcelo Tosatti <[email protected]>, Gerd Knorr
<[email protected]>, [email protected] (bcc: Amol Lad/HSS)
Subject: Re: [PATCH] __free_pages_ok oops
On Thu, 7 Feb 2002, Andrew Morton wrote:
>
> OK, I agree the weird case won't trigger the bug. So I think we
> agree that we need to run with Hugh's BUG check and do nothing else.
Thank you, Andrew and Andrea, for exploring and exploding those myths.
I've checked back on the BUG which Ben submitted his 1st Jan patch for,
and it was actually a PageLRU(page) in rmqueue, not in __free_pages_ok,
so his patch would not have solved it.
But I cannot yet agree that Marcelo should take my interrupt BUG patch.
I've also checked back on the BUG which I submitted my 15th Nov patch
for (making unmap_kiobuf use page_cache_release instead of put_page),
and Gerd Knorr's report (extracts below) implies that his bttv driver
was calling unmap_kiobuf at interrupt time. Is that right, Gerd?
If that's so, then my proposed in_interrupt check before lru_cache_del
will just give him a BUG again (and my 15th Nov patch was mistaken to
encourage him to unmap at interrupt time). Now maybe Gerd's code is
wrong anyway: a quick look suggests it may also vfree there, which
would be wrong at interrupt time. But whether his code is right or
wrong, unmap_kiobuf used to be safe at interrupt time and now is not
(in some rare circumstances): are we right to have made that change?
Ben, you probably have an AIO opinion here. Is there a circumstance
in which AIO can unpin a user page at interrupt time, after the
calling task has (exited or) unmapped the page?
Hugh
Subject: [PATCH] Re: kiobuf / vm bug
On Thu, 15 Nov 2001, Gerd Knorr wrote:
>
> I think I have found a kiobuf-related bug in the VM of recent linux
> kernels. 2.4.13 is fine, 2.4.14-pre1 doesn't boot my machine,
> 2.4.14-pre2 + newer kernels are broken.
>
> /me runs a kernel with a few v4l-related patches and my current 0.8.x
> bttv version (available from http://bytesex.org/patches/ +
> http://bytesex.org/bttv/).
>
> With this kernel I can trigger the following BUG():
> ksymoops 2.4.3 on i686 2.4.15-pre4. Options used
> kernel BUG at page_alloc.c:84!
> >>EIP; c0129e5a <__free_pages_ok+aa/29c> <=====
> Trace; c012a6f2 <__free_pages+1a/1c>
> Trace; c0121120 <unmap_kiobuf+34/48>
>
> The Oops seems to be triggered by the following actions:
>
> (1) the application maps /dev/video0. bttv 0.8.x simply returns some
> shared anonymous memory to as mapping.
> (2) the application asks the driver to capture a frame. bttv will lock
> down the anonymous memory using kiobufs for I/O and prepare
> everything for DMA xfer.
> (3) The applications exits for some reason, i.e. the anonymous memory
> will be unmapped while the DMA transfer is active and the pages are
> locked down for I/O.
> (4) The DMA xfer is done and bttv's irq handler cleans up everything.
> This includes calling unlock_kiovec+unmap_kiobuf for the locked
> pages. The unmap_kiobuf call at this point triggeres the Oops
> listed above ...
On Sat, 9 Feb 2002 [email protected] wrote:
>
> Is it possible to modify your patch from:
>
> if (in_interrupt())
> BUG();
>
> to
>
> if (unlikely(in_interrupt())
> BUG();
Unlikely!
But seriously, that function is so full of checks for the improbable,
that it would seem a bit odd for me to add one just for this instance:
unless you've noticed that spectacularly bad code is generated here?
I think I'd prefer to blend in with the surroundings for now, and
leave it for, say, the ACME Janitorial Services to come along and
put BUG_ON()s throughout.
Hugh
On Fri, Feb 08, 2002 at 05:46:56PM +0000, Hugh Dickins wrote:
> Ben, you probably have an AIO opinion here. Is there a circumstance
> in which AIO can unpin a user page at interrupt time, after the
> calling task has (exited or) unmapped the page?
If the user unmaps the page, then aio is left holding the last reference
to the page and will unmap it from irq or bh context (potentially task
context too). With networked aio, pages from userspace (anonymous or
page cache pages) will be released by the network stack from bh context.
Even now, I'm guess that should be possible with the zero copy flag...
-ben
> and Gerd Knorr's report (extracts below) implies that his bttv driver
> was calling unmap_kiobuf at interrupt time. Is that right, Gerd?
Yes.
> encourage him to unmap at interrupt time). Now maybe Gerd's code is
> wrong anyway: a quick look suggests it may also vfree there, which
> would be wrong at interrupt time.
vfree() isn't allowed? I know vmalloc() isn't because it might sleep
while waiting for the VM getting a few free pages. Why vfree isn't
allowed? I can't see why freeing ressources is a problem ...
Is somewhere a list of things which are allowed from irq context and
which are not?
Gerd
--
#define ENOCLUE 125 /* userland programmer induced race condition */
In article <[email protected]> you wrote:
> vfree() isn't allowed? I know vmalloc() isn't because it might sleep
> while waiting for the VM getting a few free pages. Why vfree isn't
> allowed? I can't see why freeing ressources is a problem ...
vfree() needs the semaphore vmalloc also uses. It's a semaphore because
vmalloc sleeps....
vmalloc/vfree are intended for slow-path only so it ought to not be a
problem....
On Sat, 9 Feb 2002, Benjamin LaHaise wrote:
> On Fri, Feb 08, 2002 at 05:46:56PM +0000, Hugh Dickins wrote:
> > Ben, you probably have an AIO opinion here. Is there a circumstance
> > in which AIO can unpin a user page at interrupt time, after the
> > calling task has (exited or) unmapped the page?
>
> If the user unmaps the page, then aio is left holding the last reference
> to the page and will unmap it from irq or bh context (potentially task
> context too). With networked aio, pages from userspace (anonymous or
> page cache pages) will be released by the network stack from bh context.
> Even now, I'm guess that should be possible with the zero copy flag...
Thanks for the feedback, Ben; and thanks for your feedback too, Gerd.
Yes, even in asking the question, the answer seemed unavoidable: with
AIO we shall have to permit an LRU page in __free_pages_ok at interrupt
time; as already happens, rightly or wrongly, with Gerd's bttv.
So my proposal to shove an "if (in_interrupt()) BUG();" there cannot
go forward: please back it out, Dave. I've abandoned that patch, and
reworked my "fatally flawed" racy patch. Andrea, Rik and Ben all seem
to prefer page_cache_release same as put_page same as __free_page, so
I've also abandoned reinstating distinct LRU-aware page_cache_release.
I don't like the idea of deciding what to do with a PageLRU page in
__free_pages_ok, based on whether in_interrupt() or not: feels wrong;
but would make a smaller patch, if you prefer for 2.4.18, Marcelo.
Instead, lru_cache_del when page_count 1 (and cannot be duplicated)
before page_cache_release, in three places where it's often needed -
free_page_and_swap_cache, free_swap_and_cache and (less obviously)
do_wp_page. I know PageLRU pages can get leftover by try_to_unuse,
that's okay, good for testing; probably other places.
This way also has the advantage that __free_pages_ok *never* needs to
take pagemap_lru_lock: although an earlier audit suggested that was
not a real issue at present, it's going to be easier if we don't ever
have to worry about such a locking order issue.
New patch against 2.4.18-pre9 below: uses the PageLocked bit to avoid
the race in the earlier patch. Please, would someone adept at memory
barriers check whether the page_count 0 PageLocked interaction between
__free_pages_ok and shrink_cache should have a barrier too? Thanks!
Hugh
--- 2.4.18-pre9/mm/memory.c Mon Jan 7 13:03:04 2002
+++ linux/mm/memory.c Tue Feb 12 18:20:21 2002
@@ -961,6 +961,8 @@
}
spin_unlock(&mm->page_table_lock);
page_cache_release(new_page);
+ if (page_count(old_page) == 1)
+ lru_cache_del(old_page);
page_cache_release(old_page);
return 1; /* Minor fault */
--- 2.4.18-pre9/mm/page_alloc.c Mon Feb 11 18:08:45 2002
+++ linux/mm/page_alloc.c Tue Feb 12 18:28:27 2002
@@ -70,12 +70,6 @@
struct page *base;
zone_t *zone;
- /* Yes, think what happens when other parts of the kernel take
- * a reference to a page in order to pin it for io. -ben
- */
- if (PageLRU(page))
- lru_cache_del(page);
-
if (page->buffers)
BUG();
if (page->mapping)
@@ -86,8 +80,19 @@
BUG();
if (PageLocked(page))
BUG();
- if (PageLRU(page))
- BUG();
+ /*
+ * Exceptionally, a page may arrive here while still on LRU e.g.
+ * asynchronous methods can take a reference to a page in order
+ * to pin it for I/O, then free it after its vma was unmapped:
+ * perhaps in IRQ or BH context when it would be unsafe to take
+ * pagemap_lru_lock. So leave such a page on the LRU, use page
+ * lock to flag it, and let shrink_cache free it in due course.
+ */
+ if (PageLRU(page)) {
+ if (TryLockPage(page))
+ BUG();
+ return;
+ }
if (PageActive(page))
BUG();
page->flags &= ~((1<<PG_referenced) | (1<<PG_dirty));
--- 2.4.18-pre9/mm/swap_state.c Wed Oct 31 23:31:03 2001
+++ linux/mm/swap_state.c Tue Feb 12 18:20:21 2002
@@ -148,6 +148,13 @@
remove_exclusive_swap_page(page);
UnlockPage(page);
}
+ /*
+ * Take last reference to an anonymous page off its LRU list:
+ * page_table_lock is held so count 1 cannot be duplicated at
+ * this point; and a cached page would have count >= 2 here.
+ */
+ if (page_count(page) == 1)
+ lru_cache_del(page);
page_cache_release(page);
}
--- 2.4.18-pre9/mm/swapfile.c Mon Feb 11 18:08:45 2002
+++ linux/mm/swapfile.c Tue Feb 12 18:20:21 2002
@@ -346,6 +346,8 @@
/* Only cache user (+us), or swap space full? Free it! */
if (page_count(page) - !!page->buffers == 2 || vm_swap_full()) {
delete_from_swap_cache(page);
+ if (page_count(page) == 1)
+ lru_cache_del(page);
SetPageDirty(page);
}
UnlockPage(page);
--- 2.4.18-pre9/mm/vmscan.c Mon Feb 11 18:08:45 2002
+++ linux/mm/vmscan.c Tue Feb 12 18:43:13 2002
@@ -363,11 +363,26 @@
list_add(entry, &inactive_list);
/*
- * Zero page counts can happen because we unlink the pages
- * _after_ decrementing the usage count..
+ * Exceptionally, a page may still be on an LRU when it
+ * arrives at __free_pages_ok, maybe at interrupt time
+ * when it could not safely be removed: so free it here.
+ * The PageLocked bit is used to flag this case, to
+ * avoid race where __free_pages on another CPU brings
+ * page_count to 0, we see count 0 so __lru_cache_del
+ * and __free_pages_ok, other CPU reaches PageLRU test
+ * after we already cleared it, then page doubly freed.
*/
- if (unlikely(!page_count(page)))
+ if (unlikely(!page_count(page))) {
+ if (PageLocked(page)) {
+ page_cache_get(page);
+ __lru_cache_del(page);
+ UnlockPage(page);
+ page_cache_release(page);
+ if (!--nr_pages)
+ break;
+ }
continue;
+ }
if (!memclass(page->zone, classzone))
continue;
On Tue, 12 Feb 2002, Hugh Dickins wrote:
> On Sat, 9 Feb 2002, Benjamin LaHaise wrote:
> > On Fri, Feb 08, 2002 at 05:46:56PM +0000, Hugh Dickins wrote:
> > > Ben, you probably have an AIO opinion here. Is there a circumstance
> > > in which AIO can unpin a user page at interrupt time, after the
> > > calling task has (exited or) unmapped the page?
> >
> > If the user unmaps the page, then aio is left holding the last reference
> > to the page and will unmap it from irq or bh context (potentially task
> > context too). With networked aio, pages from userspace (anonymous or
> > page cache pages) will be released by the network stack from bh context.
> > Even now, I'm guess that should be possible with the zero copy flag...
Hugh. Are you sure we can free a page from IRQ/BH context with _current_
2.4 ?
Please make sure so.
On Wed, 13 Feb 2002, Marcelo Tosatti wrote:
>
> Hugh. Are you sure we can free a page from IRQ/BH context with _current_
> 2.4 ?
I'll take it you mean "free a page _still on LRU_ from IRQ/BH ..."
Gerd's full ksymoops trace of 2.4.15-pre4 on 15 Nov 2001 shows it,
and he confirmed that was the case when I asked a few days ago:
> kernel BUG at page_alloc.c:84!
> ....
> >>EIP; c0129e5a <__free_pages_ok+aa/29c> <=====
> Trace; c012a6f2 <__free_pages+1a/1c>
> Trace; c0121120 <unmap_kiobuf+34/48>
> Trace; f94cdfa0 <END_OF_CODE+156e2/????>
> Trace; f94bd6d8 <.bss.end+4e1a/????>
> Trace; f94c6b4e <END_OF_CODE+e290/????>
> Trace; f94cdfa0 <END_OF_CODE+156e2/????>
> Trace; f94c3b3c <END_OF_CODE+b27e/????>
> Trace; f94cdfa0 <END_OF_CODE+156e2/????>
> Trace; f94cdfa0 <END_OF_CODE+156e2/????>
> Trace; f94c3c7c <END_OF_CODE+b3be/????>
> Trace; f94cdfa0 <END_OF_CODE+156e2/????>
> Trace; c0107f9c <handle_IRQ_event+30/5c>
> Trace; f94cdfa0 <END_OF_CODE+156e2/????>
> Trace; c0108106 <do_IRQ+6a/a8>
> ....
> <0>Kernel panic: Aiee, killing interrupt handler!
However: that is the only unambiguous example I've seen, and you
may argue that his bttv 0.8 driver is not in the current 2.4 tree,
is experimental, and even wrong in that area (we now know it also
vfrees there). I would counter that doing unmap_kiobuf in IRQ/BH
context was safe until anonymous pages went on LRU in 2.4.14-pre5,
and both attempts at PageLRU BUG fixes which have gone in since
(mine in 2.4.15-pre and Ben's in 2.4.18-pre) have overlooked
the danger of spin_lock(&pagemap_lru_lock) in IRQ/BH context.
The ambiguous evidence is this collection of lkml oops reports:
> Date: Sun, 2 Dec 2001 20:16:13 -0500
> From: Andrew Ferguson <[email protected]>
> ksymoops 2.4.3 on i686 2.5.1-pre1. Options used
> EIP: 0010:[shrink_cache+142/720] Not tainted
> Date: Wed, 12 Dec 2001 16:40:33 -0800
> From: "Adam McKenna" <[email protected]>
> I replaced the kernel with stock 2.4.14 and get similar errors.
> EIP: 0010:[rmqueue+405/472] Not tainted
> Date: Tue, 1 Jan 2002 02:18:01 -0500
> From: Ed Tomlinson <[email protected]>
> ksymoops 2.4.3 on i586 2.4.17. Options used
> kernel BUG at page_alloc.c:207!
> EIP: 0010:[rmqueue+474/544] Tainted: P
> Date: Wed, 30 Jan 2002 21:44:27 +0100
> From: Luca Montecchiani <[email protected]>
> ksymoops 2.4.1 on i586 2.4.18-pre7. Options used
> EIP; c012b819 <rmqueue+17d/1ac> <=====
> Date: 03 Feb 2002 14:05:03 +0100
> From: <[email protected]>
> I encounter the following kernel error using stock kernel.org 2.4.17:
> kernel BUG at vmscan.c:360!
> EIP: 0010:[shrink_cache+206/764] Not tainted
I may be mistaken (hurriedly collected these to answer your mail,
don't think I studied the first two in depth), but I believe these
could be explained by LRU linkage corruption, such as would be seen
on UP if an IRQ/BH lru_cache_del interrupted at the wrong moment
(but on SMP would hang instead). However, they might also be
explained by bad memory, or by more general corruption. If we
knew what's causing all the prune_dcache crashes, very likely it
could explain these too (but I do not see any way lru_cache_del
at interrupt time could explain those crashes).
I haven't submitted my patch as a "this will solve our problems",
just as a "let's get this uncertainty out of the way". Whether
for 2.4.18 or later, that's your judgment. One thing that is
now certain, my simple original "if (in_interrupt()) BUG();"
is not good (for Gerd bttv) and will not be good (for Ben AIO).
Hugh
> However: that is the only unambiguous example I've seen, and you
> may argue that his bttv 0.8 driver is not in the current 2.4 tree,
> is experimental, and even wrong in that area (we now know it also
> vfrees there).
I've recently changed the code to make it *not* call unmap_kiobuf/vfree
from irq context. Instead bttv 0.8.x doesn't allow you to close the
device with DMA xfers in flight. If you try this the release() fops
handler will block until the transfer is done, then unmap_kiobuf from
process context, then return.
Gerd
--
#define ENOCLUE 125 /* userland programmer induced race condition */
On Thu, Feb 14, 2002 at 12:10:37PM +0100, Gerd Knorr wrote:
> > However: that is the only unambiguous example I've seen, and you
> > may argue that his bttv 0.8 driver is not in the current 2.4 tree,
> > is experimental, and even wrong in that area (we now know it also
> > vfrees there).
>
> I've recently changed the code to make it *not* call unmap_kiobuf/vfree
> from irq context. Instead bttv 0.8.x doesn't allow you to close the
> device with DMA xfers in flight. If you try this the release() fops
> handler will block until the transfer is done, then unmap_kiobuf from
> process context, then return.
perfect, that's the right fix for 2.4 (waiting DMA to complete at
->release looks also much saner). unmap_kiobuf wasn't supposed to be run
from irq handlers. Everything dealing with userspace mappings cannot run
from irq handlers, tlb flushes, VM, swapping etc... everything must run
from normal kernel context. If you obey this rule, my previous email to
this thread will still apply. I wasn't aware of bttv running
unmap_kiobuf from irq.
With aio in 2.5 we may want to change this property for the unpinning
stage that would be better run asynchronously from irq handlers, but I
wouldn't change that for 2.4 (at least until we're forced to ship aio in
production on top 2.4, that cannot happen until a final user<->kernel API is
registered somewhere).
I think the foundamental design mistake that leads to __free_pages to
fail from irq, is that we allow an anonymous page to reach count 0 and to be
still in the LRU (the count == 0 check in shrink_cache is the other side
of the hack too). That's the real BUG, that breaks subtly the freelist
semantics, and then we need to make horrible hacks like last Hugh's
patch to work around such magic case (or even worse Rik's proposal for a
spin_lock_irqed list that would hurt in all the vm fast paths). As far
as clean design and orthogonality of subsystem is concerned, the right
fix for 2.5 is to bump the page->count by the time the anonymous page is
added to the lru (think and guess why we're doing that for the
pagecache, and why the pagecache is obviously safe even for aio and
unmap_kiobuf from irq). Then we need to keep it into account during
COWs (page count == 2 for an anonymous page will mean "exclusive")
etc... the MM will need to be changed a little more heavily than with
the hack approch, but I think that's the clean design in the long run.
No special cases for those magic anonymous pages, everything goes in
pagecache, with the difference the anonymous pages aren't hashed (until
they becomes swapcache at least). The semantics of __free_pages will
remain that if you own a page you can __free_pages it anytime you want
without running into BUGS(). If the page also owned by some other
subsystem (the VM), such subsystem will need to take care of bumping the
refernece count and to free the page later lazily. No collisions.
As said this should be a matter only for 2.5, now that Gerd recalls
unmap_kiobuf from normal kernel context.
Andrea
On Thu, 14 Feb 2002, Andrea Arcangeli wrote:
> On Thu, Feb 14, 2002 at 12:10:37PM +0100, Gerd Knorr wrote:
> >
> > I've recently changed the code to make it *not* call unmap_kiobuf/vfree
> > from irq context. Instead bttv 0.8.x doesn't allow you to close the
> > device with DMA xfers in flight. If you try this the release() fops
> > handler will block until the transfer is done, then unmap_kiobuf from
> > process context, then return.
>
> perfect, that's the right fix for 2.4 (waiting DMA to complete at
> ->release looks also much saner). unmap_kiobuf wasn't supposed to be run
> from irq handlers. Everything dealing with userspace mappings cannot run
> from irq handlers, tlb flushes, VM, swapping etc... everything must run
> from normal kernel context. If you obey this rule, my previous email to
> this thread will still apply. I wasn't aware of bttv running
> unmap_kiobuf from irq.
It's good that Gerd has made his change, but we don't know who else
might have been doing similar. unmap_kiobuf does not involve unmapping
virtual address space: it used to be safe run from irq handlers, now not.
We don't have to make a change for 2.4.18, but we really should add some
kind of safety check there in 2.4.soon: either of the "it's a BUG()"
kind I first suggested (which may embarrass us by firing too often),
or of the "we can handle that" kind which I last suggested.
I don't disagree with Andrew's and your count-LRU approach,
but it does have slight drawbacks, as you noted.
> As said this should be a matter only for 2.5, now that Gerd recalls
> unmap_kiobuf from normal kernel context.
That's just a hope: you may be right, we simply don't know.
Hugh
On Thu, Feb 14, 2002 at 02:01:36PM +0000, Hugh Dickins wrote:
> On Thu, 14 Feb 2002, Andrea Arcangeli wrote:
> > On Thu, Feb 14, 2002 at 12:10:37PM +0100, Gerd Knorr wrote:
> > >
> > > I've recently changed the code to make it *not* call unmap_kiobuf/vfree
> > > from irq context. Instead bttv 0.8.x doesn't allow you to close the
> > > device with DMA xfers in flight. If you try this the release() fops
> > > handler will block until the transfer is done, then unmap_kiobuf from
> > > process context, then return.
> >
> > perfect, that's the right fix for 2.4 (waiting DMA to complete at
> > ->release looks also much saner). unmap_kiobuf wasn't supposed to be run
> > from irq handlers. Everything dealing with userspace mappings cannot run
> > from irq handlers, tlb flushes, VM, swapping etc... everything must run
> > from normal kernel context. If you obey this rule, my previous email to
> > this thread will still apply. I wasn't aware of bttv running
> > unmap_kiobuf from irq.
>
> It's good that Gerd has made his change, but we don't know who else
> might have been doing similar. unmap_kiobuf does not involve unmapping
> virtual address space: it used to be safe run from irq handlers, now not.
>
> We don't have to make a change for 2.4.18, but we really should add some
> kind of safety check there in 2.4.soon: either of the "it's a BUG()"
> kind I first suggested (which may embarrass us by firing too often),
> or of the "we can handle that" kind which I last suggested.
I think your BUG patch is good and it should go in.
Also note that such BUG() cannot trigger in mainline, and this is not an
hope, we know this for sure. It cannot happen even with bttv, because
the bttv-driver in mainline is using remap_page_range and it doesn't
allow read/write zerocopy. Such BUG can trigger only if you patch
mainline or if you use drivers not part of the maineline kernel, so as
usual it's a matter of the driver maintainers to provide a driver that
functions correctly on top of a certain mainline kernel. So I really
wouldn't worry to change this for 2.4 (as said, unless/until we're
forced to ship aio on top of 2.4).
Andrea
On Thu, 14 Feb 2002, Andrea Arcangeli wrote:
> >
> > I've recently changed the code to make it *not* call unmap_kiobuf/vfree
> > from irq context. Instead bttv 0.8.x doesn't allow you to close the
> > device with DMA xfers in flight. If you try this the release() fops
> > handler will block until the transfer is done, then unmap_kiobuf from
> > process context, then return.
>
> perfect, that's the right fix for 2.4 (waiting DMA to complete at
> ->release looks also much saner).
I think it's the right fix regardless. We should do as little as possible
from irq contexts, and that holds _doubly_ true if it mucks with things
like the page cache.
> With aio in 2.5 we may want to change this property for the unpinning
> stage that would be better run asynchronously from irq handlers, but I
> wouldn't change that for 2.4 (at least until we're forced to ship aio in
> production on top 2.4, that cannot happen until a final user<->kernel API is
> registered somewhere).
I'd really really hate to have the IO make pages go away from irq context
in 2.5.x too. I think async IO should always be started and cleaned up in
a user context - there simply isn't any reason not to (the notion of doing
an exit() or execve() with IO still pending to now-dead-memory is rather
horrible in itself).
> I think the foundamental design mistake that leads to __free_pages to
> fail from irq, is that we allow an anonymous page to reach count 0 and to be
> still in the LRU (the count == 0 check in shrink_cache is the other side
> of the hack too). That's the real BUG, that breaks subtly the freelist
> semantics
Agreed. We should NEVER free the pages from the irq.
Linus
On Thu, Feb 14, 2002 at 08:27:36AM -0800, Linus Torvalds wrote:
> I'd really really hate to have the IO make pages go away from irq context
> in 2.5.x too. I think async IO should always be started and cleaned up in
> a user context - there simply isn't any reason not to (the notion of doing
> an exit() or execve() with IO still pending to now-dead-memory is rather
> horrible in itself).
I disagree: requiring aio to execute completion in user context means
that we can no longer have quick completion directly from an interrupt
handler to a busy server executing in userland.
That said, it is possible to do the same partial completion as is done
with file descriptors from interrupt context for pages, but it'll be
*really* gross. Freeing pages should be possible from any context IMO.
> > I think the foundamental design mistake that leads to __free_pages to
> > fail from irq, is that we allow an anonymous page to reach count 0 and to be
> > still in the LRU (the count == 0 check in shrink_cache is the other side
> > of the hack too). That's the real BUG, that breaks subtly the freelist
> > semantics
>
> Agreed. We should NEVER free the pages from the irq.
Uhm, what about the network stack?
-ben
On Mon, 25 Feb 2002, Benjamin LaHaise wrote:
>
> I disagree: requiring aio to execute completion in user context means
> that we can no longer have quick completion directly from an interrupt
> handler to a busy server executing in userland.
Note that teh IO _completion_ can happen in user land. That does not mean
that the IO page tear-down can (or should) happen at the same time.
There are two issues here:
- the data structures supporting the IO, built up and torn down from
process space.
- the IO itself.
One is synchronous, the other is not.
Linus