This box has Oopsed twice but is still running. Both Oopses followed a
BUG() report (same in both cases):
kernel BUG at page_alloc.c:76!
Which maps to:
if (page->mapping)
BUG();
...in __free_pages_ok() in mm/page_alloc.c.
This box has the same setup as I've reported for previous Oopses -- dual
PIII 800 CPUs, 512MB ECC SDRAM, etc.
Decoded oopses for backtraces:
ksymoops 0.7c on i686 2.4.15-pre1. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.15-pre1/ (default)
-m /boot/System.map (specified)
Warning (compare_maps): mismatch on symbol partition_name , ksyms_base says c0203460, System.map says c0152f70. Ignoring ksyms_base entry
Reading Oops report from the terminal
invalid operand: 0000
CPU: 1
EIP: 0010:[<c012bf34>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010282
eax: 0000001f ebx: c108b040 ecx: c02c9040 edx: 000044ae
esi: c108b040 edi: 00000000 ebp: 00000000 esp: c1829f1c
ds: 0018 es: 0018 ss: 0018
Process kswapd (pid: 5, stackpage=c1829000)
Stack: c0270d09 0000004c c108b040 c108b040 00000000 0000001d c0135b3a c108b05c
c108b040 00000000 c012b066 c012c855 c108b05c c012b757 00000020 000001d0
00000020 00000006 c1828000 00000100 00002263 000001d0 c02ca268 c012ba10
Call Trace: [<c0135b3a>] [<c012b066>] [<c012c855>] [<c012b757>] [<c012ba10>]
[<c012ba5d>] [<c012baf3>] [<c012bb4e>] [<c012bc5d>] [<c01055b4>]
Code: 0f 0b 83 c4 08 8d b4 26 00 00 00 00 89 f0 2b 05 6c 99 34 c0
>>EIP; c012bf34 <__free_pages_ok+34/2b8> <=====
Trace; c0135b3a <try_to_free_buffers+12e/18c>
Trace; c012b066 <lru_cache_del+12/1c>
Trace; c012c855 <page_cache_release+2d/30>
Trace; c012b757 <shrink_cache+26f/3a8>
Trace; c012ba10 <shrink_caches+5c/8c>
Trace; c012ba5d <try_to_free_pages+1d/3c>
Trace; c012baf3 <kswapd_balance_pgdat+43/8c>
Trace; c012bb4e <kswapd_balance+12/28>
Trace; c012bc5d <kswapd+99/bc>
Trace; c01055b4 <kernel_thread+28/38>
Code; c012bf34 <__free_pages_ok+34/2b8>
00000000 <_EIP>:
Code; c012bf34 <__free_pages_ok+34/2b8> <=====
0: 0f 0b ud2a <=====
Code; c012bf36 <__free_pages_ok+36/2b8>
2: 83 c4 08 add $0x8,%esp
Code; c012bf39 <__free_pages_ok+39/2b8>
5: 8d b4 26 00 00 00 00 lea 0x0(%esi,1),%esi
Code; c012bf40 <__free_pages_ok+40/2b8>
c: 89 f0 mov %esi,%eax
Code; c012bf42 <__free_pages_ok+42/2b8>
e: 2b 05 6c 99 34 c0 sub 0xc034996c,%eax
1 warning issued. Results may not be reliable.
Second Oops:
ksymoops 0.7c on i686 2.4.15-pre1. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.15-pre1/ (default)
-m /boot/System.map (specified)
Warning (compare_maps): mismatch on symbol partition_name , ksyms_base says c0203460, System.map says c0152f70. Ignoring ksyms_base entry
Reading Oops report from the terminal
invalid operand: 0000
CPU: 1
EIP: 0010:[<c012bf34>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010282
eax: 0000001f ebx: c108b040 ecx: c02c9040 edx: 000047e6
esi: c108b040 edi: dbc21330 ebp: 00000000 esp: d1487f00
ds: 0018 es: 0018 ss: 0018
Process getdomainpasswo (pid: 26743, stackpage=d1487000)
Stack: c0270d09 0000004c c108b040 00000073 dbc21330 00000000 c0125d0a 00000010
00010203 c108b040 c1885020 c012c855 c108b040 c0125acb d1487f8c c108b040
00000000 00000073 00000000 40016000 00000000 00001000 00000073 00000001
Call Trace: [<c0125d0a>] [<c012c855>] [<c0125acb>] [<c0125da5>] [<c0125cdc>]
[<c01321ab>] [<c0106dbb>]
Code: 0f 0b 83 c4 08 8d b4 26 00 00 00 00 89 f0 2b 05 6c 99 34 c0
>>EIP; c012bf34 <__free_pages_ok+34/2b8> <=====
Trace; c0125d0a <file_read_actor+2e/58>
Trace; c012c855 <page_cache_release+2d/30>
Trace; c0125acb <do_generic_file_read+23f/450>
Trace; c0125da5 <generic_file_read+71/8c>
Trace; c0125cdc <file_read_actor+0/58>
Trace; c01321ab <sys_read+8f/c4>
Trace; c0106dbb <system_call+33/38>
Code; c012bf34 <__free_pages_ok+34/2b8>
00000000 <_EIP>:
Code; c012bf34 <__free_pages_ok+34/2b8> <=====
0: 0f 0b ud2a <=====
Code; c012bf36 <__free_pages_ok+36/2b8>
2: 83 c4 08 add $0x8,%esp
Code; c012bf39 <__free_pages_ok+39/2b8>
5: 8d b4 26 00 00 00 00 lea 0x0(%esi,1),%esi
Code; c012bf40 <__free_pages_ok+40/2b8>
c: 89 f0 mov %esi,%eax
Code; c012bf42 <__free_pages_ok+42/2b8>
e: 2b 05 6c 99 34 c0 sub 0xc034996c,%eax
1 warning issued. Results may not be reliable.
Simon-
[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]
On Fri, Nov 16, 2001 at 02:23:44PM -0800, Simon Kirby wrote:
> This box has Oopsed twice but is still running. Both Oopses followed a
> BUG() report (same in both cases):
>
> kernel BUG at page_alloc.c:76!
>
> Which maps to:
>
> if (page->mapping)
> BUG();
>
> ...in __free_pages_ok() in mm/page_alloc.c.
I think this one liner (diffed against 2.4.14) could fix this Oops:
--- mm/vmscan.c.vanilla Sat Nov 17 23:37:01 2001
+++ mm/vmscan.c Sat Nov 17 23:39:04 2001
@@ -414,10 +414,9 @@
* the page as well.
*/
if (page->buffers) {
- spin_unlock(&pagemap_lru_lock);
-
/* avoid to free a locked page */
page_cache_get(page);
+ spin_unlock(&pagemap_lru_lock);
if (try_to_free_buffers(page, gfp_mask)) {
if (!page->mapping) {
regards Christian
--
THAT'S ALL FOLKS!
In article <[email protected]> you write:
>
>I think this one liner (diffed against 2.4.14) could fix this Oops:
It really shouldn't matter - at that point we have the page locked, and
we know the page has buffers, so the page cannot go away from under us:
we can delay the "increment page count" simply because we know somebody
else (the buffers) hold on to the page.
Which is not to say that I disagree with the patch itself: it tends to
be good practice to not depend on quite-so-subtle locking rules. It just
really shouldn't make any difference to the problem.
Linus
On Sat, Nov 17, 2001 at 07:12:51PM -0800, Linus Torvalds wrote:
> In article <[email protected]> you write:
> >
> >I think this one liner (diffed against 2.4.14) could fix this Oops:
>
> It really shouldn't matter - at that point we have the page locked, and
I also agree the patch shouldn't matter, but one suspect thing is the
fact add_to_swap_cache goes to clobber in a non atomic manner the page
lock. so yes, we hold the page lock both in swap_out and in
shrink_cache, but swap_out can drop it for a moment and then later
pretend to be the onwer again without a real trylock.
Andrea
On Sun, 18 Nov 2001, Andrea Arcangeli wrote:
>
> I also agree the patch shouldn't matter, but one suspect thing is the
> fact add_to_swap_cache goes to clobber in a non atomic manner the page
> lock.
.. you mean __add_to_page_cache(), not add_to_swap_cache().
And nope, not really. It does use plain stores to page->flags, and I agree
that it is ugly, but if the page was locked before calling it, all the
stores will be with the PG_lock bit set - and even plain stores _are_
documented to be atomic on x86 (and on all other reasonable architectures
too).
> so yes, we hold the page lock both in swap_out and in
> shrink_cache, but swap_out can drop it for a moment and then later
> pretend to be the onwer again without a real trylock.
No, it doesn't get dropped for a moment. The bit is always set, and
somebody else who tries to lock the page will never see it clear, and can
never succeed in locking it.
Is the __add_to_page_cache() playing with the page flags ugly? It sure is.
I'd _almost_ call it buggy, but not because of PG_locked, but because of
all the other bits it does horrible things to. It's one of those
borderline cases, but I don't think it's borderline wrt the lock bit.
Linus
On Sat, Nov 17, 2001 at 10:24:44PM -0800, Linus Torvalds wrote:
>
> On Sun, 18 Nov 2001, Andrea Arcangeli wrote:
> >
> > I also agree the patch shouldn't matter, but one suspect thing is the
> > fact add_to_swap_cache goes to clobber in a non atomic manner the page
> > lock.
>
> .. you mean __add_to_page_cache(), not add_to_swap_cache().
>
> And nope, not really. It does use plain stores to page->flags, and I agree
> that it is ugly, but if the page was locked before calling it, all the
> stores will be with the PG_lock bit set - and even plain stores _are_
> documented to be atomic on x86 (and on all other reasonable architectures
> too).
I know all is right if GCC just overwrites the page->flags with data
that keeps PG_locked set. But GCC doesn't guarantee that. GCC can as
well do:
flags = page->flags;
page->flags = 0;
change flags here
page->flags = flags
probably gcc doesn't, but that's still a kernel bug.
Andrea
In article <[email protected]>,
Andrea Arcangeli <[email protected]> wrote:
>
>I know all is right if GCC just overwrites the page->flags with data
>that keeps PG_locked set. But GCC doesn't guarantee that. GCC can as
>well do:
>
> flags = page->flags;
> page->flags = 0;
>
> change flags here
>
> page->flags = flags
Sure.
>From a C language lawyer standpoing a C compiler can do pretty much
whatever it damn well chooses to do, including temporarily changing
"page->flags" even if the C source doesn't have any reference to
"page->flags" at _all_. The compiler might decide that it temporarily
wants to use that memory for something else, and since "Strictly
conforming ANSI C" does not have a notion of threads etc interesting
issues, you can probably argue that just about _anything_ falls under
"gcc doesn't guarantee that".
>probably gcc doesn't, but that's still a kernel bug.
No. It would be a _gcc_ bug if gcc did things to "page->flags" that the
code did not ask it to do. And that is _regardless_ of any notions of
"strictly conforming C code". The fact is, that if gcc were to clear a
bit that the code never clears, that is a HUGE AND GAPING GCC BUG.
Not kernel bug.
The fact is, if we write code that leaves a certain bit unmodified, gcc
MUST NOT modify that bit. If gcc generated code that temporarily
modifies the bit, I can show user-level code that would break with
signals. See "sig_atomic_t" and friends - the compiler simply _has_ to
guarantee that the semantics you write in C code are actually upheld.
Linus
On Sat, Nov 17, 2001 at 11:31:15PM -0800, Linus Torvalds wrote:
> No. It would be a _gcc_ bug if gcc did things to "page->flags" that the
> code did not ask it to do. And that is _regardless_ of any notions of
> "strictly conforming C code". The fact is, that if gcc were to clear a
> bit that the code never clears, that is a HUGE AND GAPING GCC BUG.
I see what you mean of course (the usual problem is that we never know
if gcc could make such decision for whatever subtle optimization, asm
optimizations are all but intuitional). But I just giveup also about the
other thing of reading from C variables that can change under us. So I'm ok
assuming gcc does what we expect here too, even if I'd prefer not to.
> The fact is, if we write code that leaves a certain bit unmodified, gcc
> MUST NOT modify that bit. If gcc generated code that temporarily
> modifies the bit, I can show user-level code that would break with
> signals. See "sig_atomic_t" and friends - the compiler simply _has_ to
> guarantee that the semantics you write in C code are actually upheld.
There should be proper macros to handle those userspace sig_atomic_t
because of that. Anyways I certainly believe there is code playing with
those types from C by hand :).
Andrea
> No. It would be a _gcc_ bug if gcc did things to "page->flags" that the
> code did not ask it to do. And that is _regardless_ of any notions of
> "strictly conforming C code". The fact is, that if gcc were to clear a
> bit that the code never clears, that is a HUGE AND GAPING GCC BUG.
Why is it a compiler bug. You've not declared that variable to be volatile
therefore it is only touched in the code flow the compiler is analysing.
> signals. See "sig_atomic_t" and friends - the compiler simply _has_ to
> guarantee that the semantics you write in C code are actually upheld.
Most programmers get signal handling wrong, they call stdio functions in
the handlers and far far worse. Nothing new there, even the BSD mail program
was broken.
Alan
Linus Torvalds <[email protected]> said:
[...]
> And nope, not really. It does use plain stores to page->flags, and I agree
> that it is ugly, but if the page was locked before calling it, all the
> stores will be with the PG_lock bit set - and even plain stores _are_
> documented to be atomic on x86 (and on all other reasonable architectures
> too).
Even unaligned stores?
--
Horst von Brand [email protected]
Casilla 9G, Vin~a del Mar, Chile +56 32 672616
On Sun, 18 Nov 2001, Horst von Brand wrote:
> Linus Torvalds <[email protected]> said:
> > And nope, not really. It does use plain stores to page->flags, and I agree
> > that it is ugly, but if the page was locked before calling it, all the
> > stores will be with the PG_lock bit set - and even plain stores _are_
> > documented to be atomic on x86 (and on all other reasonable architectures
> > too).
>
> Even unaligned stores?
Actually, even unaligned stores (which page->flags is NOT) are atomic,
even if Intel strongly discourages them (for performance reasons if no
others) and there tends to be documentation that doesn't guarantee it.
Linus
On Sun, 18 Nov 2001, Alan Cox wrote:
>
> Why is it a compiler bug. You've not declared that variable to be volatile
> therefore it is only touched in the code flow the compiler is analysing.
Even without volatile, the compiler is very arguably buggy if it writes
values to your variables that were never supposed to be there.
Take this, for example:
sig_atomic_t value = 1;
int fn()
{
value = 2;
}
And a signal comes in. Even without the volatile, if gcc has written
_anything_ else than 1 or 2 into the variable, gcc is BROKEN.
There's no point being a language lawyer and saying that gcc "could write
anything to value before it writes the final 2". Tha's not true. A compile
rthat does that is
(a) buggy as hell from a POSIX standpoint
(b) even apart from POSIX, from a Q-of-I standpoint complete and utter
crap.
And yes, the argument for "page->flags" is _exactly_ the same. Writing
intermediate values to "page->flags" that were never referenced by the
programmer is clearly such a violation of QoI that it is a bug even if
"sig_atomic_t" happens to be "int", and "page->flags" happens to be
"unsigned int".
Linus
On Sun, Nov 18, 2001 at 06:04:16PM -0800, Linus Torvalds wrote:
>
> On Sun, 18 Nov 2001, Horst von Brand wrote:
> > Linus Torvalds <[email protected]> said:
> > > And nope, not really. It does use plain stores to page->flags, and I agree
> > > that it is ugly, but if the page was locked before calling it, all the
> > > stores will be with the PG_lock bit set - and even plain stores _are_
> > > documented to be atomic on x86 (and on all other reasonable architectures
> > > too).
> >
> > Even unaligned stores?
>
> Actually, even unaligned stores (which page->flags is NOT) are atomic,
> even if Intel strongly discourages them (for performance reasons if no
> others) and there tends to be documentation that doesn't guarantee it.
>
> Linus
>
This is true. They Generate what's called a "split lock" bus transaction
where the bus will hold LOCK# low across the several clock cycles to
complete the write. They are **VERY** heavy, BTW, and really cause
nasty performance hits.
Jeff
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Sun, 18 Nov 2001, Linus Torvalds wrote:
>
> And a signal comes in. Even without the volatile, if gcc has written
> _anything_ else than 1 or 2 into the variable, gcc is BROKEN.
Side note: the Linux kernel depends on these kinds of quality-of-
implementation issues in several places. Thge "page->flags" thing is just
one small (and rather localized) case.
If you look at all the stuff that enforces memory ordering, they all
absolutely _require_ that gcc write exactly the value that we specify and
no other. This includes things like "policy" and "has_cpu" in the process
data structures, and "dumpable" in the "mm" structure.
If the compiler were to write random internal values to these variables
before writing the one we ask for, you'd get kernel crashes (has_cpu) or
strange and subtle security races (dumpable).
Oh, and I bet TCP would break horribly if gcc wrote internal temporary
values to the socket sequence numbers.
Linus
[email protected] said:
> This is true. They Generate what's called a "split lock" bus
> transaction where the bus will hold LOCK# low across the several clock
> cycles to complete the write. They are **VERY** heavy, BTW, and
> really cause nasty performance hits.
Is it worth making put_unaligned and get_unaligned on x86 avoid this by
loading/storing the two halves of the required datum separately, then?
--
dwmw2
> There's no point being a language lawyer and saying that gcc "could write
> anything to value before it writes the final 2". Tha's not true. A compile
> rthat does that is
>
> (a) buggy as hell from a POSIX standpoint
It would mean someone misdefined sig_atomic_t if it broke for assignment
> And yes, the argument for "page->flags" is _exactly_ the same. Writing
> intermediate values to "page->flags" that were never referenced by the
> programmer is clearly such a violation of QoI that it is a bug even if
> "sig_atomic_t" happens to be "int", and "page->flags" happens to be
> "unsigned int".
If the code execution path is faster why shouldnt the compiler generate
those patterns. You haven't told the variable that its volatile. Therefore
the compiler is supposed to be optimising it.
You can in theory get even stranger effects. Consider
if(!(flag&4))
{
blahblah++;
flag|=4;
}
The compiler is completely at liberty to turn that into
test bit 2
jz 1f
inc blahblah
add #4, flag
1f:
because it knows the bit was clear and it knows the variable is not
volatile.
Having your own personal custom language dialect might be tempting but it is
normally something only the lisp community do.
Alan
On Mon, 19 Nov 2001, Alan Cox wrote:
>
> You can in theory get even stranger effects. Consider
>
> if(!(flag&4))
> {
> blahblah++;
> flag|=4;
> }
>
> The compiler is completely at liberty to turn that into
>
> test bit 2
> jz 1f
> inc blahblah
> add #4, flag
> 1f:
That's fine - if you have two threads modifying the same variable at the
same time, you need to lock it.
That's not the case under discussion.
The case under discussion is gcc writing back values to a variable that
NEVER HAD ANY VALIDITY, even in the single-threaded case. And it _is_
single-threaded at that point, you only have other users that test the
value, not change it.
That's not an optimization, that's just plain broken. It breaks even
user-level applications that use sigatomic_t.
And note how gcc doesn't actually do it. I'm not saying that gcc is broken
- I' saying that gcc is NOT broken, and we depend on it being not broken.
Linus
On Mon, 19 Nov 2001, David Woodhouse wrote:
>
> Is it worth making put_unaligned and get_unaligned on x86 avoid this by
> loading/storing the two halves of the required datum separately, then?
No, modern CPU's do it well enough - starting from the Pentium, Intel does
all locking internally in the caches, and depends on the cache coherency
protocol to show the atomicity to the rest of the world. Only an i486 will
actually show the locked cycles on the bus, if I remember correctly.
In fact, I think the CPU will do a unaligned non-cache-crossing operation
as fast as a aligned store. The cacheline-crossing case is noticeably
slower, at least on a PPro (the Pentium had some optimizations where it
would pair the two cacheline accesses, and could do two cacheline accesses
in one cycle - so the cacheline crosser could execute at full speed, but
it would hurt pairing with _other_ memory instructions).
Testing shows:
- PPro core:
single-cycle stores, whether aligned or not, within a
cacheline.
8 cycles for cacheline crossing stores
- Athlon:
single cycle for unaligned, whether cache-line croesser or not.
(And as mentioned, I think Pentiums act the same as athlons).
In short, unaligned integer ops are not affected very much at all. They do
take more resources internally (ie they use two write-ports to the cache
when cache-crossing), so even when they run at the "same" speed, it pairs
etc better if aligned, but x86 is very very good at unaligned handling.
One of the advantages of a legacy of crap: x86 never had the choice to be
designed for "the good case". In order to run fast, an x86 has to run fast
even on bad code.
Because in real life, it doesn't matter how well you do on spec benchmarks
with good compilers.
Linus
So, uh, any idea why the server is hitting the page->mapping BUG() thing
in the first place? :)
The server is still up, and has printed the BUG() line 71 times (up 5
days). In all 71 Oopses/stack dumps, eax, ebx, ecx, esi, and edi are the
same.
It looks like one page is broken and is continually hitting the BUG().
Shouldn't it have been freed after the first BUG(), though?
Is there some way to figure out if this page is special in some way or
track down how it broke?
Simon-
[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]
On Mon, 19 Nov 2001, Simon Kirby wrote:
>
> So, uh, any idea why the server is hitting the page->mapping BUG() thing
> in the first place? :)
No.
I suspect that your earlier oopses left something in a stale state - this
is the same machine that you've reported others oopses for, no?
> Is there some way to figure out if this page is special in some way or
> track down how it broke?
It looks like it's a bog-standard page, that was just free'd (probably
because of page->count corruption) while it was still in the page cache.
Now, how that page->count corruption actually happened, I have no idea,
which is why I suspect you had other earlier oopses that left the machine
in an inconsistent state.
There _is_ a known race in 2.4.15pre1, where we simply test a condition
that isn't true any more and that can cause spurious oopses (not this one,
though) under the right circumstances. Such an oops might have left
something in the VM in a half-way state...
Can you reproduce this on pre6, for example? And if so, what's the load?
Linus
Linus Torvalds <[email protected]> writes:
> That's fine - if you have two threads modifying the same variable at the
> same time, you need to lock it.
>
> That's not the case under discussion.
>
> The case under discussion is gcc writing back values to a variable that
> NEVER HAD ANY VALIDITY, even in the single-threaded case. And it _is_
> single-threaded at that point, you only have other users that test the
> value, not change it.
>
> That's not an optimization, that's just plain broken. It breaks even
> user-level applications that use sigatomic_t.
>
> And note how gcc doesn't actually do it. I'm not saying that gcc is broken
> - I' saying that gcc is NOT broken, and we depend on it being not broken.
Linus I agree that gcc works. And even if page->flags is written
to, with two separate write operations page->flags & PG_locked should
still be true.
However this case seems to violate code clarity. If you can have
other users testing PG_locked it is non-intuitive that you can still
normally assign to page->flags.
Would it make sense to add a set_bits macro that is a just an
assignment except on extremely weird architectures or to work
around compiler bugs? I'm just thinking it would make sense
to document that we depend on the compiler not writing some
strange intermediate values into the variable.
I can't imagine why a compiler ever would but it is remotely possible
a compiler but generate an instruction sequence like:
xorl flags, $0xFFFFFFFF
xorl flags, $0xFFFFFFFe
To flip the low bit. I would be terribly surprised, and it would
certainly break sigatomic_t if it was a plain typedef, but stranger
things have happened.
Eric
On Mon, Nov 19, 2001 at 10:03:34AM -0800, Linus Torvalds wrote:
> I suspect that your earlier oopses left something in a stale state - this
> is the same machine that you've reported others oopses for, no?
I have in the past reported one Oops for this machine, I think, yes.
I think it was explained by previous kernel bugs (it was running 2.4.12).
On this kernel version, we've only seen the single BUG() message
regarding page->mapping, and the associated forced Oops/backtrace thing.
Every BUG() and backtrace has been the same except for a few registers,
including the first backtrace.
> It looks like it's a bog-standard page, that was just free'd (probably
> because of page->count corruption) while it was still in the page cache.
> Now, how that page->count corruption actually happened, I have no idea,
> which is why I suspect you had other earlier oopses that left the machine
> in an inconsistent state.
>
> There _is_ a known race in 2.4.15pre1, where we simply test a condition
> that isn't true any more and that can cause spurious oopses (not this one,
> though) under the right circumstances. Such an oops might have left
> something in the VM in a half-way state...
Right, but there were no other Oopses on this machine since 2.4.15pre1
was put on (up 5 days). Previously, with 2.4.12, it Oopsed in some
memory freeing function (I think it was __free_pages_ok or something),
but I didn't have a serial console on it at the time and it was locked
up.
> Can you reproduce this on pre6, for example? And if so, what's the load?
Will pre6 eat my filesystem? :) It's a production box (I just used pre1
because I read through it and saw there were no serious changes, just
a simple race fix and a few other things).
The box is a heavily-hit shared hosting web server running the usual
collection of Apache, Perl, php, and other programs.
I wonder if the quota stuff (which is also used heavily on this box, but
probably not tested anywhere near as widely elsewhere) is the culprit.
Jan Kara has sent me this patch to test (but I have not yet had the
chance to try it on some production servers). It looks like he's wrapped
some memory freeing functions with lock_kernel. Currently, 2.4.14 Oopses
all over the place if quotacheck is run on an active filesystem with
quota turned on (which is broken to begin with, yes, but shouldn't cause
Oopses).
Attached is Jan's patch. See anything interesting? He said he has not
yet submitted it because he hasn't had a chance to test it on an SMP box.
Simon-
[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]
Hello!
> Oh, and I bet TCP would break horribly if gcc wrote internal temporary
> values to the socket sequence numbers.
Actually tcp does not depend on gcc idiosyncrasies. It works under
socket lock.
Well, all these things sort of read-copy updates, relying on memory
ordering etc. may be very good, but:
- I do not know the rules of the game.
- Nobody seems to knows them.
- Anyway, I do not have enough of brain cells to keep this under control.
So, networking relies only on explicit locks and barriers and gcc may do
everything except for splitting "optimizations" of this kind over barriers.
The most dangerous thing, which could harm 2.2 a lot is intuitively
natural:
static int a;
auto int b;
b = a;
do_something_with_b;
Goal of this code is clear, to get snapshot of "a"
and to do anything with "b", assuming it does not change.
In 2.2 we rely on this in many places.
I do not see anything which could prohibit gcc to eliminate register
allocated for "b" while CSE and to use "a" directly f.e. when b does
not fit to hardware register set in any case. Actually, gcc
does not make this to our luck, but I suspect it is only because
it is too stupid.
Alexey
On 19 Nov 2001, Eric W. Biederman wrote:
>
> However this case seems to violate code clarity. If you can have
> other users testing PG_locked it is non-intuitive that you can still
> normally assign to page->flags.
This is 100% true, and I actually want to fix it.
The ugliness is actually a historical remnant - we do it with a simple
store simply because at the time we add a page to the page cache, we
historically were the only users of that page. Nobody else could get
access to it, because it didn't exist on any lists.
These days, the LRU queue exists before the page is added to the swap
cache, which means that other people actually _can_ see the page, and the
historical "initialize page flags" is no longer touching a purely private
field.
> Would it make sense to add a set_bits macro that is a just an
> assignment except on extremely weird architectures or to work
> around compiler bugs? I'm just thinking it would make sense
> to document that we depend on the compiler not writing some
> strange intermediate values into the variable.
The thing is, "__add_to_page_cache()" really shouldn't touch the page
flags AT ALL, not with a "set_bits()" macro, and not with anything else
either for that matter. It's actually clearing flags that have meaning,
and the callers these days have to undo some of the work that it does (ie
see how try_to_swap_out() ends up having to mark the page dirty again,
only because __add_to_page_cache() cleared the dirty bit that it shouldn't
know anything at all about).
So the real fix for this particular case is to move the bit
setting/clearing into those callers that actually _want_ to set the bits,
some of which actually do have exclusive access to the page.
That doesn't change the fact that other parts of the kernel still assume
that the setting of a pointer or status field is atomic, for example, and
then use the memory barrier macros to force memory ordering (both for gcc
and for the CPU at runtime).
Linus
On Mon, 19 Nov 2001, Linus Torvalds wrote:
>
> On Mon, 19 Nov 2001, Simon Kirby wrote:
> >
> > So, uh, any idea why the server is hitting the page->mapping BUG() thing
> > in the first place? :)
>
> No.
>
> I suspect that your earlier oopses left something in a stale state - this
> is the same machine that you've reported others oopses for, no?
Linus,
I was talking with Rik today about 2.5 VM plans and we end up talking
about the order of the pagecache_lock and pagemap_lru_lock. He ended up
showing me add_to_page_cache(), which now looks like:
We ended up talking about the possibility of a reschedule (IRQ) happening
before after the "spin_unlock(pagecache_lock)" but before the
"lru_cache_add()".
I haven't investigated the issue yet... But isn't that possible ?
void add_to_page_cache(struct page * page, struct address_space * mapping,
unsigned long offset)
{
spin_lock(&pagecache_lock);
__add_to_page_cache(page, mapping, offset, page_hash(mapping,
offset));
spin_unlock(&pagecache_lock);
lru_cache_add(page);
}
On Mon, 19 Nov 2001, Marcelo Tosatti wrote:
>
> We ended up talking about the possibility of a reschedule (IRQ) happening
> before after the "spin_unlock(pagecache_lock)" but before the
> "lru_cache_add()".
So?
The worst that happens is that the page is not on the LRU list, which just
means that it won't be free'd until we add it (which we will do when the
lru_cache_add() resumes..
Linus
On Mon, 19 Nov 2001, Linus Torvalds wrote:
> On Mon, 19 Nov 2001, Marcelo Tosatti wrote:
> >
> > We ended up talking about the possibility of a reschedule (IRQ) happening
> > before after the "spin_unlock(pagecache_lock)" but before the
> > "lru_cache_add()".
>
> So?
>
> The worst that happens is that the page is not on the LRU list, which
> just means that it won't be free'd until we add it (which we will do
> when the lru_cache_add() resumes..
I wonder if the following scenario is possible:
CPU 0 CPU 1
add_to_page_cache() truncate_list_pages()
spin_lock(&pagecache_lock);
__add_to_page_cache()
spin_unlock(&pagecache_lock);
==> network irq
... remove_inode_page()
...
==> softirqs __free_pages_ok()
...
...
*** page now on free list ***
lru_cache_add(page);
*** BOOM ***
regards,
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/
http://www.surriel.com/ http://distro.conectiva.com/
On Mon, 19 Nov 2001, Rik van Riel wrote:
>
> I wonder if the following scenario is possible:
Hmm.. It looks valid, but for the fact that the page lock is held. So
there's no way truncate_list_pages() can call "remove_inode_page()" on the
page, regardless of whether the page is on the LRU list or not.
That said, it might be cleaner to move the "lru_cache_add(page);" up to
before adding the page into the page cache - that way we add a new
invariant that just says "all pages in the page cache are on the LRU
list", which could be used for a few extra sanity checks, for example.
Linus
On Mon, 19 Nov 2001, Linus Torvalds wrote:
> On Mon, 19 Nov 2001, Rik van Riel wrote:
> >
> > I wonder if the following scenario is possible:
>
> Hmm.. It looks valid, but for the fact that the page lock is held. So
> there's no way truncate_list_pages() can call "remove_inode_page()" on the
> page, regardless of whether the page is on the LRU list or not.
>
> That said, it might be cleaner to move the "lru_cache_add(page);" up to
> before adding the page into the page cache - that way we add a new
> invariant that just says "all pages in the page cache are on the LRU
> list", which could be used for a few extra sanity checks, for example.
Oh dear, that's an interesting case, too.
__add_to_page_cache() blindly sets the PG_locked bit, but
it's possible for other functions to acquire the page lock
before that.
This means 2 CPUs could think they're holding the page
lock for this page at the same time.
I don't see anything preventing this from happening,
except the absence of code paths trying to directly
grab hold of physical pages ;)
regards,
Rik
--
Shortwave goes a long way: irc.starchat.net #swl
http://www.surriel.com/ http://distro.conectiva.com/
On Mon, 19 Nov 2001, Rik van Riel wrote:
>
> Oh dear, that's an interesting case, too.
>
> __add_to_page_cache() blindly sets the PG_locked bit, but
> it's possible for other functions to acquire the page lock
> before that.
No. The page is either already locked (add_to_swap_cache()), or
exclusively owned by us..
At least that's how it is _supposed_ to be.
Linus
On Mon, Nov 19, 2001 at 10:03:34AM -0800, Linus Torvalds wrote:
> It looks like it's a bog-standard page, that was just free'd (probably
> because of page->count corruption) while it was still in the page cache.
> Now, how that page->count corruption actually happened, I have no idea,
> which is why I suspect you had other earlier oopses that left the machine
> in an inconsistent state.
Well, I found out what file has the bog-standard page.
open("/home/stevendi//.htaccess", O_RDONLY|O_LARGEFILE) = 4
fstat64(0x4, 0x80ea000) = 0
fcntl(4, F_SETFD, FD_CLOEXEC) = 0
fstat64(0x4, 0xbfffd878) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x401d3000
read(4, <unfinished ...>
+++ killed by SIGSEGV +++
Every time that file is read, it Oopses. It would probably be quite
difficult to find something that would exploit a kernel race on this
particular file (they are very rarely modified).
I suppose rebooting would be the only way to get rid of the broken page,
as it doesn't seem to have cleared up by itself.
Simon-
[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]
On Mon, 19 Nov 2001, Simon Kirby wrote:
>
> Well, I found out what file has the bog-standard page.
>
> open("/home/stevendi//.htaccess", O_RDONLY|O_LARGEFILE) = 4
What filesystem is this?
Linus
On Mon, Nov 19, 2001 at 03:38:45PM -0800, Linus Torvalds wrote:
> On Mon, 19 Nov 2001, Simon Kirby wrote:
> >
> > Well, I found out what file has the bog-standard page.
> >
> > open("/home/stevendi//.htaccess", O_RDONLY|O_LARGEFILE) = 4
>
> What filesystem is this?
EXT2.
tune2fs 1.18, 11-Nov-1999 for EXT2 FS 0.5b, 95/08/09
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 31cd0726-a6c1-458b-905e-983df0f7c695
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: filetype sparse_super
Filesystem state: not clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 3623040
Block count: 7245307
Reserved block count: 362265
Free blocks: 1278488
Free inodes: 2748224
First block: 0
Block size: 4096
Fragment size: 4096
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 16320
Inode blocks per group: 510
Last mount time: Tue Nov 13 13:07:22 2001
Last write time: Mon Nov 19 15:51:24 2001
Mount count: 1
Maximum mount count: 20
Last checked: Tue Nov 13 13:07:21 2001
Check interval: 15552000 (6 months)
Next check after: Sun May 12 14:07:21 2002
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 128
Simon-
[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]
On Mon, 19 Nov 2001 08:39:29 -0800 (PST), Linus Torvalds
<[email protected]> wrote:
>
>On Mon, 19 Nov 2001, Alan Cox wrote:
>>
>> You can in theory get even stranger effects. Consider
>>
>> if(!(flag&4))
>> {
>> blahblah++;
>> flag|=4;
>> }
>>
>> The compiler is completely at liberty to turn that into
>>
>> test bit 2
>> jz 1f
>> inc blahblah
>> add #4, flag
>> 1f:
>
>That's fine - if you have two threads modifying the same variable at the
>same time, you need to lock it.
>
>That's not the case under discussion.
>
>The case under discussion is gcc writing back values to a variable that
>NEVER HAD ANY VALIDITY, even in the single-threaded case. And it _is_
>single-threaded at that point, you only have other users that test the
>value, not change it.
>
>That's not an optimization, that's just plain broken. It breaks even
>user-level applications that use sigatomic_t.
>
>And note how gcc doesn't actually do it. I'm not saying that gcc is broken
>- I' saying that gcc is NOT broken, and we depend on it being not broken.
Imagine if the storage address involved was an I/O register which was
memory mapped. An arbitrary storage of data might have disasterous
effects.
john alvord
On Mon, 19 Nov 2001, Linus Torvalds wrote:
> On Mon, 19 Nov 2001, Rik van Riel wrote:
> >
> > Oh dear, that's an interesting case, too.
> >
> > __add_to_page_cache() blindly sets the PG_locked bit, but
> > it's possible for other functions to acquire the page lock
> > before that.
>
> No. The page is either already locked (add_to_swap_cache()), or
> exclusively owned by us..
The thing is, the "exclusively owned" situation cannot
be checked in any way, except maybe through the fact
that page->mapping==NULL ...
regards,
Rik
--
Shortwave goes a long way: irc.starchat.net #swl
http://www.surriel.com/ http://distro.conectiva.com/
On Mon, 19 Nov 2001, Rik van Riel wrote:
>
> The thing is, the "exclusively owned" situation cannot
> be checked in any way, except maybe through the fact
> that page->mapping==NULL ...
Well, you also have to check that the page isn't on the LRU list, so it
would have to be something like
!page->mapping && !PageLRU(page)
which I agree is ugly. It's much better to just move the page->flag
setting into the callers (and most of the callers _can_ trivially check
that they are exclusive owners, because most of them will just have
allocated the page ;)
Linus
On Mon, 19 Nov 2001, Linus Torvalds wrote:
> Well, you also have to check that the page isn't on the LRU list, so it
> would have to be something like
>
> !page->mapping && !PageLRU(page)
>
> which I agree is ugly.
I guess we should have "clean up VM locking" in our TODO
list for 2.5, then. The current combination of 2.0, 2.2
and 2.4 locking makes for rather hard to maintain code ;)
regards,
Rik
--
Shortwave goes a long way: irc.starchat.net #swl
http://www.surriel.com/ http://distro.conectiva.com/
Hi!
> > Why is it a compiler bug. You've not declared that variable to be volatile
> > therefore it is only touched in the code flow the compiler is analysing.
>
> Even without volatile, the compiler is very arguably buggy if it writes
> values to your variables that were never supposed to be there.
>
> Take this, for example:
>
> sig_atomic_t value = 1;
>
> int fn()
> {
> value = 2;
> }
>
> And a signal comes in. Even without the volatile, if gcc has written
> _anything_ else than 1 or 2 into the variable, gcc is BROKEN.
imagine
typedef volatile int sig_atomic_t;
> There's no point being a language lawyer and saying that gcc "could write
> anything to value before it writes the final 2". Tha's not true. A compile
> rthat does that is
>
> (a) buggy as hell from a POSIX standpoint
> (b) even apart from POSIX, from a Q-of-I standpoint complete and utter
> crap.
Imagine this:
if (likely(foo))
c = 1
else c = 2
I could see it optimized as
c = 1
if (unlikely(foo))
c = 2
Given enough register pressure.... I've seen similar optimalizations proposed
in "advanced compilers" book.
Pavel
PS: but wrapping access to current->flags in macro is probably okay for now.
I just wanted to show that writing unwanted value is not as broken as you
think.
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.