I can confirm the bug which loses updates to the inn active file when
it's unmapped is present again in 2.4.0-test12.
I put "cp active active.ok" in the rc file before shutting down the
daemon and at the next boot the files are different, every time.
Alexander Viro posted this test case:
#include <unistd.h>
main(argc,argv)
int argc;
char **argv;
{
int fd;
char c=0;
truncate(argv[1], 10);
fd = open(argv[1], 1);
lseek(fd, 16384, 0);
write(fd, &c, 1);
close(fd);
}
but I tried it and it gives the correct result (a 16384 bytes long file
with only the first few bytes non-zeroed).
Linux wonderland 2.4.0-test12 #15 Thu Dec 21 16:40:16 CET 2000 i586 unknown
--
ciao,
Marco
"Marco d'Itri" ([email protected]) wrote:
> I can confirm the bug which loses updates to the inn active file when
> it's unmapped is present again in 2.4.0-test12.
It is also still in 2.4.0-test13-pre4 in case someone thought they had
fixed it.
--
Jeff Lightfoot -- jeffml at pobox.com -- http://thefoots.com/
"I see the light at the end of the tunnel now ... someone please
tell me it's not a train" -- Cracker
On Dec 24, Alexander Viro <[email protected]> wrote:
>> I put "cp active active.ok" in the rc file before shutting down the
>> daemon and at the next boot the files are different, every time.
>
>Could you send me both files? BTW, which filesystem it is?
I use ext2. The files are not corrupted, they just are not updated.
Another data point: at least in some cases, if I stop and start inn
without rebooting the files are the same.
--- active.ok Sun Dec 24 09:58:00 2000
+++ active Sun Dec 24 08:33:34 2000
@@ -1,5 +1,5 @@
control 0000004793 0000004794 y
-control.cancel 0000022865 0000021934 n
+control.cancel 0000022864 0000021934 n
junk 0000001806 0000001807 y
fido.ita.ridere 0000014779 0000014777 y
fido.ita.dewdney 0000004073 0000004074 y
@@ -10,19 +10,19 @@
fido.ita.sf 0000004777 0000004778 y
comp.os.linux.announce 0000010782 0000010779 m
fido.ita.tex 0000000248 0000000249 y
-it.news.annunci 0000004909 0000004787 m
+it.news.annunci 0000004905 0000004787 m
it.news.gestione 0000007878 0000007399 y
fido.ita.tv 0000011944 0000011944 y
it.test 0000000796 0000000797 y
-it.news.gruppi 0000048004 0000047898 y
+it.news.gruppi 0000047994 0000047898 y
it.comp.sicurezza.varie 0000030696 0000030353 y
it.comp.sicurezza.unix 0000002721 0000002722 y
-it.faq 0000001154 0000001091 m
+it.faq 0000001150 0000001091 m
[...]
--
ciao,
Marco
On Sun, 24 Dec 2000, Marco d'Itri wrote:
> On Dec 24, Alexander Viro <[email protected]> wrote:
>
> >> I put "cp active active.ok" in the rc file before shutting down the
> >> daemon and at the next boot the files are different, every time.
> >
> >Could you send me both files? BTW, which filesystem it is?
> I use ext2. The files are not corrupted, they just are not updated.
> Another data point: at least in some cases, if I stop and start inn
> without rebooting the files are the same.
Ok, looks like we just drop the page cache page without writing it out in
some cases. Possibly/probably because we have dropped the dirty bit on the
floor.
Look slike this is a completely different case from the previous
corruptions, it looks more like a VM issue than a FS thing..
Hmm.. munmap() (and exit()) go through "zap_page_range()", which go
through "free_pte()", which definitely copies the dirty bit to the page
structure.
Hmm.. I wonder if such a dirty page might have been moved to the
"inactive_clean" list some way? It shouldn't really be there, as the page
had users, but if it gets on that list we'd not have tested the dirty bit.
Marco, would you mind changing the test in reclaim_page(), somewheer
around line mm/vmscan.c:487 that says:
/* The page is dirty, or locked, move to inactive_dirty list. */
if (page->buffers || TryLockPage(page)) {
...
and change the test to
if (page->buffers || PageDirty(page) || TryLockPage(page)) {
instead? Ie ad the test for "PageDirty(page)" (and order _is_ important:
the TryLockPage() thing must come last, because it has side effects).
(You might add a "printk()" too that triggers when the new condition
happens, just to see if it does indeed happen).
If the page is on the inactive_clean() list, we'll have to find where it
is put there, because it really shouldn't have been there.
Uhhuh. Actually, reading "page_launder()", the buffer clearign case looks
suspiciously like i doesn't check for page accessed or dirty bits. That's
probably it. Maybe there are other cases. Anyway, I'd love to hear if the
above one-liner fixes the corruption for you..
Thanks,
Linus
On Sun, 24 Dec 2000, Linus Torvalds wrote:
>
> Marco, would you mind changing the test in reclaim_page(), somewheer
> around line mm/vmscan.c:487 that says:
Yeah, yeah, it's 7PM Christmas Eve over there, and you're in the middle of
your Christmas dinner. You might feel that it's unreasonable of me to ask
you to test out my latest crazy idea.
How selfish of you.
Get back there in front of the computer NOW. Christmas can wait.
Linus "the Grinch" Torvalds
>
> Get back there in front of the computer NOW. Christmas can wait.
>
> Linus "the Grinch" Torvalds
Hoo - Hoo - Hoo,
you've been very naughty Linus.
Asking people to work on Christmas evening.
My god Linus, that's so naughty that I add
it to my list...
As soon as I'm finished with Futurama,
... I'll get you!
Merry X-Mas
Santa Claus
:))
Linus Torvalds wrote:
> Hmm.. I wonder if such a dirty page might have been moved to the
> "inactive_clean" list some way? It shouldn't really be there, as the page
> had users, but if it gets on that list we'd not have tested the dirty bit.
>
> Marco, would you mind changing the test in reclaim_page(), somewheer
> around line mm/vmscan.c:487 that says:
>
> /* The page is dirty, or locked, move to inactive_dirty list. */
> if (page->buffers || TryLockPage(page)) {
> ...
>
> and change the test to
>
> if (page->buffers || PageDirty(page) || TryLockPage(page)) {
>
> instead? Ie ad the test for "PageDirty(page)"
Good point. Up until recently the page dirty bit wasn't actually being
set anywhere and page->buffers was acting as kind of a surrogate dirty
bit - page_launder would call try_to_free_buffers which would find the
dirty buffers and fail out, but start io first
It looks like PG_dirty is now being used only for swap_cache pages, and
not for buffer cache and page cache pages, is that correct?
--
Daniel
On Sun, 24 Dec 2000, Daniel Phillips wrote:
>
> It looks like PG_dirty is now being used only for swap_cache pages, and
> not for buffer cache and page cache pages, is that correct?
No. PG_dirty is used for all memory mapped pages - be they anonymous or
not.
These days the buffer dirty bits are only used by "write()", because
write() can obviously dirty smaller areas than one page.
Linus
Linus Torvalds <[email protected]> writes:
> On Sun, 24 Dec 2000, Linus Torvalds wrote:
> >
> > Marco, would you mind changing the test in reclaim_page(), somewheer
> > around line mm/vmscan.c:487 that says:
>
Speaking of page_launder() I just stumbled upon two oopsen today on
the UP build. Maybe it could give a hint to someone, I'm not that good
at Oops decoding.
Merry Christmas!
Unable to handle kernel NULL pointer dereference at virtual address 0000000c
printing eip:
c012872e
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[page_launder+510/2156]
EFLAGS: 00010202
eax: 00000000 ebx: c12e2ce8 ecx: c1244474 edx: 00000000
esi: c12e2d04 edi: 00000000 ebp: 00000000 esp: c15d1fb4
ds: 0018 es: 0018 ss: 0018
Process bdflush (pid: 6, stackpage=c15d1000)
Stack: c15d0000 00000000 c15d023a 0008e000 00000000 00000000 00000001 00002933
00000000 c0131e5d 00000003 00000000 00010f00 c146ff88 c146ffc4 c01073fc
c146ffc4 00000078 c146ffc4
Call Trace: [bdflush+141/236] [kernel_thread+40/56]
Code: 8b 40 0c 8b 00 85 c0 0f 84 ba 04 00 00 83 7c 24 10 00 75 73
Unable to handle kernel NULL pointer dereference at virtual address 0000000c
printing eip:
c012872e
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[page_launder+510/2156]
EFLAGS: 00010202
eax: 00000000 ebx: c1260eec ecx: c15d5fe0 edx: c02917f0
esi: c1260f08 edi: 00000000 ebp: 00000000 esp: c15d5f9c
ds: 0018 es: 0018 ss: 0018
Process kswapd (pid: 4, stackpage=c15d5000)
Stack: 00010f00 00000004 00000000 00000000 00000004 00000000 00000000 00002938
00000000 c01290fc 00000004 00000000 00010f00 c01f77f7 c15d4239 0008e000
c01291c6 00000004 00000000 c146ffb8 00000000 c01073fc 00000000 00000078
Call Trace: [do_try_to_free_pages+52/128] [tvecs+8683/64084] [kswapd+126/288] [kernel_thread+40/56]
Code: 8b 40 0c 8b 00 85 c0 0f 84 ba 04 00 00 83 7c 24 10 00 75 73
--
Zlatko
On Dec 24, Linus Torvalds <[email protected]> wrote:
> /* The page is dirty, or locked, move to inactive_dirty list. */
> if (page->buffers || TryLockPage(page)) {
> ...
>
>and change the test to
>
> if (page->buffers || PageDirty(page) || TryLockPage(page)) {
Done, no change.
Got some articles, restarted the server, all is good.
Got other articles, rebooted and the files now differ.
And I have another problem: I'm experiencing random hangs using X[1] with
2.4.0-test12. After a variable amount of time, some of the times I use X
(I mostly use console) it just freezes hard (no caps lock activity).
I'm not sure if this only happens while using X or it's just less
frequent in console. -test9 works fine and I used it since it has been
released with no ill effects.
My hardware:
00:00.0 Host bridge: VIA Technologies, Inc. VT82C598 [Apollo MVP3] (rev 04)
00:01.0 PCI bridge: VIA Technologies, Inc. VT82C598 [Apollo MVP3 AGP]
00:07.0 ISA bridge: VIA Technologies, Inc. VT82C586/A/B PCI-to-ISA [Apollo VP] (rev 41)
00:07.1 IDE interface: VIA Technologies, Inc. VT82C586 IDE [Apollo] (rev 06)
00:07.3 Bridge: VIA Technologies, Inc. VT82C586B ACPI (rev 10)
00:09.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
01:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G100 [Productiva] AGP (rev 02)
vendor_id : AuthenticAMD
cpu family : 5
model : 8
model name : AMD-K6(tm) 3D processor
stepping : 12
cpu MHz : 267.282
cache size : 64 KB
gcc version 2.95.2 20000220 (Debian GNU/Linux)
[1] Good old stable XF86_SVGA 3.x from debian potato.
--
ciao,
Marco
On 25 Dec 2000, Zlatko Calusic wrote:
> Linus Torvalds <[email protected]> writes:
>
> > On Sun, 24 Dec 2000, Linus Torvalds wrote:
> > >
> > > Marco, would you mind changing the test in reclaim_page(), somewheer
> > > around line mm/vmscan.c:487 that says:
> >
>
> Speaking of page_launder() I just stumbled upon two oopsen today on
> the UP build. Maybe it could give a hint to someone, I'm not that good
> at Oops decoding.
>
> Unable to handle kernel NULL pointer dereference at virtual address 0000000c
> printing eip:
> c012872e
> *pde = 00000000
> Oops: 0000
> CPU: 0
> EIP: 0010:[page_launder+510/2156]
I suspected I'm not the only one who is getting these exact same Oopses
(and the lockups that follow them) so earlier today, I've decoded the Oops
I got, and found that the problem is in vmscan.c:line-605, where
page->mapping is NULL and a_ops gets resolved and dereferenced at
0x0000000c.
I leave the fix for the mm experts, I've notified Linus, I guess he's
looking into it.
--
Dan Aloni
[email protected]
Marco d'Itri wrote:
> And I have another problem: I'm experiencing random hangs using X[1] with
> 2.4.0-test12. After a variable amount of time, some of the times I use X
> (I mostly use console) it just freezes hard (no caps lock activity).
> I'm not sure if this only happens while using X or it's just less
> frequent in console. -test9 works fine and I used it since it has been
> released with no ill effects.
This is probably the run_task_queue bug fixed in test13pre3.
Augusto
On Mon, 25 Dec 2000, Marco d'Itri wrote:
> On Dec 24, Linus Torvalds <[email protected]> wrote:
>
> > /* The page is dirty, or locked, move to inactive_dirty list. */
> > if (page->buffers || TryLockPage(page)) {
> > ...
> >
> >and change the test to
> >
> > if (page->buffers || PageDirty(page) || TryLockPage(page)) {
> Done, no change.
> Got some articles, restarted the server, all is good.
> Got other articles, rebooted and the files now differ.
Willing to test some more?
Add a printk() to __remove_inode_page() that complains whenever it removes
a dirty page.
Oh, in order to not see this with swap pages (which _can_ be removed when
they are dirty, if all users of them are gone), add a PageClearDirty() to
"remove_from_swap_cache()" so that we don't get false positives..
Do you get any messages? I don't think you will, but it should be tested.
You might mark it a BUG(), so tht we'll get a stack-trace if it happens.
Assuming we don't lose any PG_dirty bits, we might of course just lose it
from the page tables themselves before it ever even gets to "struct page".
I'm just surprised that it seems to be so repeatable for you - it sounds
like we _never_ actually write out the dirty pages to disk. It's not that
we can lose the dirty bit occasionally, we seem to lose it every time in
your setup.
I wonder if it's something specific innd does. Like "msync()" just being
broken or similar. But the code looks sane.
Hmm.. Can you send me an "strace" of innd when this happens?
> And I have another problem: I'm experiencing random hangs using X[1] with
> 2.4.0-test12.
That's probably the infinite loop in the tty task queue handling, should
be fixed in test13-pre3 or so.
Linus
On Mon, 25 Dec 2000, Linus Torvalds wrote:
>
> Assuming we don't lose any PG_dirty bits, we might of course just lose it
> from the page tables themselves before it ever even gets to "struct page".
> I'm just surprised that it seems to be so repeatable for you - it sounds
> like we _never_ actually write out the dirty pages to disk. It's not that
> we can lose the dirty bit occasionally, we seem to lose it every time in
> your setup.
Nope. I got it.
The thing is even more embarrassing than just losing a dirty bit.
We don't lose any dirty bits (well, we could before, but after adding the
PageDirty() test to reclaim_page() we're ok now).
In fact, we know _exactly_ which pages are dirty, and which pages are not.
We just don't write them out. Because right now the only thing that writes
out dirty pages is memory pressure. "sync()", "fsync()" and "fdatasync()"
will happily ignore dirty pages completely. The thing that made me
overlook that simple thing in testing was that I was testing the new VM
stuff under heavy VM load - to shake out any bugs.
Under heavy VM load, there are no problems, because the memory pressure
will make sure everything gets written out. Under heavy VM load the thing
works just beautifully.
Under _low_, or no, memory pressure, however, the dang thing just stays in
memory. We'll happily reboot with the new contents still cached, in fact.
I bet that if you start something that eats up all your memory, and causes
some nice swapping just before you shut down the machine, your innd active
file will be right as rain after a reboot.
I'm a stupid git. I even remember thinking about the syncing issues at
some point, and then obviously just forgetting about it _completely_.
The simple fix is along the lines of adding code to fsync() that walks the
inode page list and writes out dirty pages.
The clever and clean fix is to split the inode page list into two lists,
one for dirty and one for clean pages, and only walk the dirty list.
Ho ho ho. I _so_ enjoy making a fool out of myself.
Linus
On Dec 25, Linus Torvalds <[email protected]> wrote:
>Add a printk() to __remove_inode_page() that complains whenever it removes
>a dirty page.
>
>Oh, in order to not see this with swap pages (which _can_ be removed when
>they are dirty, if all users of them are gone), add a PageClearDirty() to
>"remove_from_swap_cache()" so that we don't get false positives..
>
>Do you get any messages? I don't think you will, but it should be tested.
I read you found the real cause so that may be bogus, but I have got two
messages while booting. The first showed up while doing the fsck of a 6
GB file systems and killed the process (fscks of smaller partitions
completed successfully), the second occured while initializing
/dev/random and left an unkillable dd process and a stuck boot process
(I gathered this info with sysrq).
Being -test12 unstable for me, if you don't need more data I'll go back
to -test9 until the next release.
>That's probably the infinite loop in the tty task queue handling, should
>be fixed in test13-pre3 or so.
Looks like I missed it, evil vger postmasters unsubscribed me again for
no apparent reason...
--
ciao,
Marco
On Mon, 25 Dec 2000, Marco d'Itri wrote:
>
> >Do you get any messages? I don't think you will, but it should be tested.
>
> I read you found the real cause so that may be bogus, but I have got two
> messages while booting. The first showed up while doing the fsck of a 6
> GB file systems and killed the process (fscks of smaller partitions
> completed successfully), the second occured while initializing
> /dev/random and left an unkillable dd process and a stuck boot process
> (I gathered this info with sysrq).
I'd still love to get the trace for these. I think I have a handle on the
problems, but it would stil be helpful - dropping a dirty page really
shouldn't happen except for the swap cache (and that should have been
plugged by adding the ClearPageDirty()).
Linus
On Mon, Dec 25, 2000 at 01:42:33AM -0800, Linus Torvalds wrote:
We just don't write them out. Because right now the only thing
that writes out dirty pages is memory pressure. "sync()",
"fsync()" and "fdatasync()" will happily ignore dirty pages
completely. The thing that made me overlook that simple thing in
testing was that I was testing the new VM stuff under heavy VM
load - to shake out any bugs.
Does this mean anyone using test13-pre4 should also expect to see
data not being flushed on shutdown?
I'm not seeing data loss (or it's not obvious if I am) but wonder if
I should make a application to run before halt to allocate and walk
all available memory as an interim?
--cw
would it be sensible to write a PCI device interface for controllerless PCI
devices like serial PCI ports? I am now trying to make the older 2.2.x
series LT winmodem patch into the 2.4.0-test13pre4 sources work. I see how
some companies are unable to release all the source code to drivers due to
legal reasons and patent restrictions. Maybe there should be a generic
driver interface for software modems or other devices, so it is easier to -
as an example - write winmodem drivers for the serial driver without
hacking in many sets of "#ifdef LUCENT_MODEM..... modified code.... #endif"
to the serial.c source file.
i am not able to create such a thing, and winmodems are not the most
popular thing to talk about in regards to support. after spending 3 hours
staring at serial.c, as a beginning programmer, and hand copying the
appropriate 2.2.x winnmodem "serial.c" driver code in, i am lost. the
module finally compiles, without error, but complains with an error that
there is an unresolved symbol "jiffie". kind of funny, a jiffie is all
that separates me from turning my brand new laptop into a machine i can use
the modem on. also it is equally fustrating. will this situation improve
in time or what else can i do to get my modem working? arrrgh! even if
the hand-done patching of 2.4.x's serial.c file resulted in a useable
kernel module, i would not like to have to patch it every time i update my
kernel. a winmodem.o module with support for generic interfaces into the
kernel so driver vendors do not need to muck around with serial.c would be
an idea.
my real question to all is where is the support of PCI serial devices at
inside of the kernel? if i have pci bus 0:0.b sharing irq 11 with 0:0.c,
does the linux kernel support both devices working at the same time
(ethernet, and serial port aka winmodem)?
this is probably better off sent to the serial mailing list i know, but i
am more interested in whether all the problems i am having with 4 out of 6
devices on my laptop's PCI bus conflicting, whether this is because the
linux kernel does not support more than one PCI function operating
simultaneously on any given PCI device under the same PCI bus. (
bus:device.function )
right now i get a message that says [IRQ 11 is already used by device
0:8.0] when i load drivers for the device 0:8.1, and the visa-versa message
when loading drivers for device 0:8.0. Is this just a warning, or an
error? i can't tell. sometimes the driver (as is the case with pcmcia
drivers, where slot0 is 0:6.0 and slot1 is 0:6.1) loads anyways, despite
the message about [IRQ 11 is alr...]. othertimes, with my ethernet drivers
and alsa sound drivers, i see the message and the drivers fail to load.
what to do....
merry holidays, all. i apologize this is long and likely off topic. i mean
well though.
-Eric Shattow
[email protected]
On Tue, 26 Dec 2000, Chris Wedgwood wrote:
> On Mon, Dec 25, 2000 at 01:42:33AM -0800, Linus Torvalds wrote:
>
> We just don't write them out. Because right now the only thing
> that writes out dirty pages is memory pressure. "sync()",
> "fsync()" and "fdatasync()" will happily ignore dirty pages
> completely. The thing that made me overlook that simple thing in
> testing was that I was testing the new VM stuff under heavy VM
> load - to shake out any bugs.
>
> Does this mean anyone using test13-pre4 should also expect to see
> data not being flushed on shutdown?
No.
This all only matters to things that do shared writable mmap's.
Almost nothing does that. innd is (sadly) the only regular thing that uses
this, which is why it's always innd that breaks, even if everything else
works.
And even innd is often compiled to use "write()" instead of shared
mappings (it's a config option), so not even all innd's will break.
Linus
> The simple fix is along the lines of adding code to fsync() that walks the
> inode page list and writes out dirty pages.
>
> The clever and clean fix is to split the inode page list into two lists,
> one for dirty and one for clean pages, and only walk the dirty list.
Like the patches that were floating around on the list for the past
few months to make O_SYNC work. Could those be used for it ?
>
On Mon, 25 Dec 2000, Alan Cox wrote:
>
> > The simple fix is along the lines of adding code to fsync() that walks the
> > inode page list and writes out dirty pages.
> >
> > The clever and clean fix is to split the inode page list into two lists,
> > one for dirty and one for clean pages, and only walk the dirty list.
>
> Like the patches that were floating around on the list for the past
> few months to make O_SYNC work. Could those be used for it ?
I haven't seen the patches, but I suspect strongly that they clash with
the ext2 smart buffer sync code.
For example, the old generic_fdatasync() function already walks the list
of pages and could trivially be extended to handle dirty pages in addition
to just dirty blocks. But ext2 uses the faster and more clever "walk just
the dirty buffers of this inode" thing, which never even looks at any
pages.
This is high-level enough that the low-level filesystems should not even
see this functionality, and it doesn't look very hard to do. It's just all
these holidays in the way (and if you think you can skip christmas with a
4-year-old in the house, think AGAIN. This is a sacred time.).
Linus
> the modem on. also it is equally fustrating. will this situation improve
> in time or what else can i do to get my modem working? arrrgh! even if
Only the winmodem folks can tell you. Ask them when they intend to GPL their
drivers or release them with source under other sane licensing. Its their
problem.
> Yeah, yeah, it's 7PM Christmas Eve over there, and you're in the middle of
> your Christmas dinner. You might feel that it's unreasonable of me to ask
> you to test out my latest crazy idea.
>
> How selfish of you.
>
> Get back there in front of the computer NOW. Christmas can wait.
>
> Linus "the Grinch" Torvalds
I can just imagine Xmas at the Torvalds residence, with their annual
tradition of having the kids scream... But dad, other kids have the lights
strung around the trees, not the computer....
Linus, that's why they call them computer 'boxes'... didn't you know you are
supposed to wait for 'boxing day' before climbing back in front of the
screen??
--
--------------------------------------------------------
Michael Peddemors - Senior Consultant
Unix?Administration - WebSite Hosting
Network?Services - Programming
Wizard?Internet Services http://www.wizard.ca
Linux Support Specialist - http://www.linuxmagic.com
--------------------------------------------------------
(604)?589-0037 Beautiful British Columbia, Canada
--------------------------------------------------------
> This all only matters to things that do shared writable mmap's.
>
> Almost nothing does that. innd is (sadly) the only regular thing that uses
> this, which is why it's always innd that breaks, even if everything else
> works.
btw samba 2.2 makes extensive use of shared writable mmaps (well it uses tdb
which uses shared writable mmaps). In fact it picked up a bug with virtual
aliasing of shared writable mmaps on sparc64 recently.
Anton
> I can just imagine Xmas at the Torvalds residence, with their annual=20
> tradition of having the kids scream... But dad, other kids have the lig=
> hts=20
> strung around the trees, not the computer....
I don't think you get the full picture. I suspect what gets strung up on the
trees at Christmas if Linus does too much hacking is ... Linus
On Mon, 25 Dec 2000, Dan Aloni wrote:
> On 25 Dec 2000, Zlatko Calusic wrote:
>
> > Speaking of page_launder() I just stumbled upon two oopsen today on
> > the UP build. Maybe it could give a hint to someone, I'm not that good
> > at Oops decoding.
>
> I've decoded the Oops I got, and found that the problem is in
> vmscan.c:line-605, where page->mapping is NULL and a_ops gets
> resolved and dereferenced at 0x0000000c.
The code assumes that every page which has the PG_dirty
bit set also has page->mapping set to a valid value.
The BUG() people are getting confirms that this assumption
is not necessarily true and the VM work that's going on will
most likely make it not be true either in some cases.
The (trivial) patch below should fix this problem.
Linus and/or Alan, could you please apply this for the next
pre-patch ?
regards,
Rik
--
Hollywood goes for world dumbination,
Trailer at 11.
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
--- vmscan.c.orig Wed Dec 27 16:48:24 2000
+++ vmscan.c Wed Dec 27 17:14:32 2000
@@ -601,7 +601,7 @@
* Dirty swap-cache page? Write it out if
* last copy..
*/
- if (PageDirty(page)) {
+ if (PageDirty(page) && page->mapping) {
int (*writepage)(struct page *) = page->mapping->a_ops->writepage;
int result;
Rik van Riel <[email protected]> writes:
> On Mon, 25 Dec 2000, Dan Aloni wrote:
> > On 25 Dec 2000, Zlatko Calusic wrote:
> >
> > > Speaking of page_launder() I just stumbled upon two oopsen today on
> > > the UP build. Maybe it could give a hint to someone, I'm not that good
> > > at Oops decoding.
> >
> > I've decoded the Oops I got, and found that the problem is in
> > vmscan.c:line-605, where page->mapping is NULL and a_ops gets
> > resolved and dereferenced at 0x0000000c.
>
> The code assumes that every page which has the PG_dirty
> bit set also has page->mapping set to a valid value.
>
> The BUG() people are getting confirms that this assumption
> is not necessarily true and the VM work that's going on will
> most likely make it not be true either in some cases.
>
> The (trivial) patch below should fix this problem.
>
> Linus and/or Alan, could you please apply this for the next
> pre-patch ?
>
Looking at the patch, I'm practically sure it will cure the symptoms.
But I'm still slightly worried about those pages we skip in
there. Maybe we should at least try to discover what are those pages,
and then maybe it will become obvious what we need (or not) to do with
them.
Some strategic printk()s could probably give us some clue.
Too bad I lost track of the recent changes due to catastrofic time
shortage. And that is a shame as I'm very satisfied with the current
VM code, thanks to your hard work Rik.
> regards,
>
> Rik
> --
> Hollywood goes for world dumbination,
> Trailer at 11.
>
> http://www.surriel.com/
> http://www.conectiva.com/ http://distro.conectiva.com.br/
>
>
> --- vmscan.c.orig Wed Dec 27 16:48:24 2000
> +++ vmscan.c Wed Dec 27 17:14:32 2000
> @@ -601,7 +601,7 @@
> * Dirty swap-cache page? Write it out if
> * last copy..
> */
> - if (PageDirty(page)) {
> + if (PageDirty(page) && page->mapping) {
> int (*writepage)(struct page *) = page->mapping->a_ops->writepage;
> int result;
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/
>
--
Zlatko
On Wed, 27 Dec 2000, Rik van Riel wrote:
>
> The (trivial) patch below should fix this problem.
It must be wrong.
If we have a dirty page on the LRU lists, that page _must_ have a mapping.
What semantics would you say a non-mapped page has? What are the LRU
routines supposed to do with such a page?
The bug is somewhere else, and your patch is just papering it over. We
should not have a page without a mapping on the LRU lists in the first
place, except if the page has anonymous buffers (and such a page cannot
legally be dirty as things are in the standard kernel - Chris Mason has
been working on stuff that would make that a normal thing, but it would
also make the page have a mapping).
We'd better add a debug check that makes sure that we don't have
non-mapped non-buffer pages on the LRU lists, and figure out how such a
thing could happen.
Linus
On Wed, Dec 27, 2000 at 03:41:04PM -0800, Linus Torvalds wrote:
> It must be wrong.
>
> If we have a dirty page on the LRU lists, that page _must_ have a mapping.
What about pages with a mapping but without a writepage function ? or pages
whose writepage function fails ? The current code seems to simply put the
page onto the active list in that case, which seems just as wrong to me.
> The bug is somewhere else, and your patch is just papering it over. We
> should not have a page without a mapping on the LRU lists in the first
> place, except if the page has anonymous buffers (and such a page cannot
So is there any legal reason we could ever get to page_active ? Removing
that code (or replacing it with BUG()) certainly would make page_launder
more readable.
On Wed, 27 Dec 2000, Philipp Rumpf wrote:
> On Wed, Dec 27, 2000 at 03:41:04PM -0800, Linus Torvalds wrote:
> > It must be wrong.
> >
> > If we have a dirty page on the LRU lists, that page _must_ have a mapping.
>
> What about pages with a mapping but without a writepage function ? or pages
> whose writepage function fails ? The current code seems to simply put the
> page onto the active list in that case, which seems just as wrong to me.
ramfs. It doesn't have a writepage() function, as there is no backing
store.
> > The bug is somewhere else, and your patch is just papering it over. We
> > should not have a page without a mapping on the LRU lists in the first
> > place, except if the page has anonymous buffers (and such a page cannot
>
> So is there any legal reason we could ever get to page_active ? Removing
> that code (or replacing it with BUG()) certainly would make page_launder
> more readable.
Apart from the "we have no backing store", there is no legal reason to put
it back on the active list that I can see.
Linus
On Wed, 27 Dec 2000, Philipp Rumpf wrote:
> On Wed, Dec 27, 2000 at 03:41:04PM -0800, Linus Torvalds wrote:
> > It must be wrong.
> >
> > If we have a dirty page on the LRU lists, that page _must_ have a mapping.
>
> What about pages with a mapping but without a writepage function ? or pages
> whose writepage function fails ? The current code seems to simply put the
> page onto the active list in that case, which seems just as wrong to me.
I noticed the offset of 'mapping' in struct page doesn't match the offset
which get resolved at obsolute address 0x0000000c according to the Oops,
but it does matches the offset of a_ops in struct address_space, which
makes me wonder if it's possible that a_ops == NULL. Suggestions?
--
Dan Aloni
[email protected]
On Thu, 28 Dec 2000, Dan Aloni wrote:
> On Wed, 27 Dec 2000, Philipp Rumpf wrote:
>
> > On Wed, Dec 27, 2000 at 03:41:04PM -0800, Linus Torvalds wrote:
> > > It must be wrong.
> > >
> > > If we have a dirty page on the LRU lists, that page _must_ have a mapping.
> >
> > What about pages with a mapping but without a writepage function ? or pages
> > whose writepage function fails ? The current code seems to simply put the
> > page onto the active list in that case, which seems just as wrong to me.
>
> I noticed the offset of 'mapping' in struct page doesn't match the offset
> which get resolved at obsolute address 0x0000000c according to the Oops,
> but it does matches the offset of a_ops in struct address_space, which
> makes me wonder if it's possible that a_ops == NULL. Suggestions?
I was confused, or maybe temporal insanity, ignore that ;-)
--
Dan Aloni
[email protected]
(cc' list trimmed)
On Wed, Dec 27, 2000 at 04:27:06PM -0800, Linus Torvalds wrote:
ramfs. It doesn't have a writepage() function, as there is no
backing store.
this remind me; perhaps you or Al could answer this.
How hard would it be to have ramfs backed by swap? The goal being
try to achieve something like a FreeBSDs mfs.
I use ramfs for /tmp on my laptop -- it's very handy because it
extends the amount of the the disk had spent spun down and therefore
battery life; but writing large files into /tmp can blow away the
system or at the very least eat away at otherwise usable ram. Not
terribly desirable.
--cw
On Thu, 28 Dec 2000, Chris Wedgwood wrote:
> (cc' list trimmed)
[further]
> I use ramfs for /tmp on my laptop -- it's very handy because it
> extends the amount of the the disk had spent spun down and therefore
> battery life; but writing large files into /tmp can blow away the
> system or at the very least eat away at otherwise usable ram. Not
> terribly desirable.
mph.
does anyone other than me think that the pm code is *way* too agressive about
spinning down the hard drive? my 256mb laptop (2.2.16) will only spin down the
disk for about 30 seconds before it decides it's got something else it feels
like writing out, and spins back up. Spinnup has got to be more wasteful than
just leaving the drive spinning...
unless the vfs code is cleaning dirty pages to disk when there's no activity,
then any time the vfs writes to disk, it's got something that's really hot and
shouldn't stay in ram ... if that's the case, and the frequency of such
occurrences is so high, perhaps the pm code should wait a couple of times as
long as it does before it spins down. basically, if there's still code doing
stuff, the drive should be spinning. only spindown when we're sure the user has
walked away for good :)
ari
On Thu, Dec 28, 2000 at 12:06:47AM -0500, Ari Heitner wrote:
does anyone other than me think that the pm code is *way* too
agressive about spinning down the hard drive? my 256mb laptop
(2.2.16) will only spin down the disk for about 30 seconds before
it decides it's got something else it feels like writing out, and
spins back up. Spinnup has got to be more wasteful than just
leaving the drive spinning...
use hdparm to increase the spin-down time then
--cw
> does anyone other than me think that the pm code is *way* too agressive about
> spinning down the hard drive? my 256mb laptop (2.2.16) will only spin down the
> disk for about 30 seconds before it decides it's got something else it feels
> like writing out, and spins back up. Spinnup has got to be more wasteful than
> just leaving the drive spinning...
Take that up with your bios. Hard disk power management is done by the drive
and bios settings. hdparm will let you override it. You can also get good
results using noatime to avoid atime writes on reading
On Wed, 27 Dec 2000, Linus Torvalds wrote:
> On Wed, 27 Dec 2000, Rik van Riel wrote:
> >
> > The (trivial) patch below should fix this problem.
>
> It must be wrong.
>
> If we have a dirty page on the LRU lists, that page _must_ have
> a mapping.
Hmm, last I looked buffercache pages didn't have
page->mapping set ...
And before anyone gets afraid that pure buffercache
pages get skipped ... they're caught by page_launder()
just fine, in the `if (page->buffers)' code below the
dirty page writeout code.
regards,
Rik
--
Hollywood goes for world dumbination,
Trailer at 11.
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
Zlatko Calusic wrote:
>
> Rik van Riel <[email protected]> writes:
>
> > On Mon, 25 Dec 2000, Dan Aloni wrote:
> > > On 25 Dec 2000, Zlatko Calusic wrote:
> > >
> > > > Speaking of page_launder() I just stumbled upon two oopsen today on
> > > > the UP build. Maybe it could give a hint to someone, I'm not that good
> > > > at Oops decoding.
> > >
> > > I've decoded the Oops I got, and found that the problem is in
> > > vmscan.c:line-605, where page->mapping is NULL and a_ops gets
> > > resolved and dereferenced at 0x0000000c.
> >
> > The code assumes that every page which has the PG_dirty
> > bit set also has page->mapping set to a valid value.
> >
> > The BUG() people are getting confirms that this assumption
> > is not necessarily true and the VM work that's going on will
> > most likely make it not be true either in some cases.
> >
> > The (trivial) patch below should fix this problem...
> >
> Looking at the patch, I'm practically sure it will cure the symptoms.
> But I'm still slightly worried about those pages we skip in
> there. Maybe we should at least try to discover what are those pages,
> and then maybe it will become obvious what we need (or not) to do with
> them.
>
> Some strategic printk()s could probably give us some clue.
Between line 573 and 594 the page can have 1 user and be unlocked, so it
can be removed by invalidate_inode_pages, and the mapping will be
cleared here:
http://innominate.org/~graichen/projects/lxr/source/mm/filemap.c?v=v2.3#L98
--
Daniel
On Thu, 28 Dec 2000, Rik van Riel wrote:
> On Wed, 27 Dec 2000, Linus Torvalds wrote:
> > On Wed, 27 Dec 2000, Rik van Riel wrote:
> > >
> > > The (trivial) patch below should fix this problem.
> >
> > It must be wrong.
> >
> > If we have a dirty page on the LRU lists, that page _must_ have
> > a mapping.
>
> Hmm, last I looked buffercache pages didn't have
> page->mapping set ...
OK, you're right ;)
We never set PG_dirty for buffercache pages, so a
pure buffercache page shouldn't be caught here...
I've made a small debugging patch that simply checks
for this illegal state in add_page_to_active_list and
add_page_to_inactive_dirty_list.
regards,
Rik
--
Hollywood goes for world dumbination,
Trailer at 11.
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
--- include/linux/swap.h.orig Thu Dec 28 12:29:29 2000
+++ include/linux/swap.h Thu Dec 28 12:31:29 2000
@@ -206,7 +206,11 @@
#define ZERO_PAGE_BUG \
if (page_count(page) == 0) BUG();
+#define DIRTY_NO_MAPPING \
+ if (PageDirty(page) && !page->mapping) BUG();
+
#define add_page_to_active_list(page) { \
+ DIRTY_NO_MAPPING \
DEBUG_ADD_PAGE \
ZERO_PAGE_BUG \
SetPageActive(page); \
@@ -215,6 +219,7 @@
}
#define add_page_to_inactive_dirty_list(page) { \
+ DIRTY_NO_MAPPING \
DEBUG_ADD_PAGE \
ZERO_PAGE_BUG \
SetPageInactiveDirty(page); \
Linus Torvalds wrote:
>
> On Wed, 27 Dec 2000, Philipp Rumpf wrote:
>
> > On Wed, Dec 27, 2000 at 03:41:04PM -0800, Linus Torvalds wrote:
> > > It must be wrong.
> > >
> > > If we have a dirty page on the LRU lists, that page _must_ have a mapping.
> >
> > What about pages with a mapping but without a writepage function ? or pages
> > whose writepage function fails ? The current code seems to simply put the
> > page onto the active list in that case, which seems just as wrong to me.
>
> ramfs. It doesn't have a writepage() function, as there is no backing
> store.
>
> > > The bug is somewhere else, and your patch is just papering it over. We
> > > should not have a page without a mapping on the LRU lists in the first
> > > place, except if the page has anonymous buffers (and such a page cannot
> >
> > So is there any legal reason we could ever get to page_active ? Removing
> > that code (or replacing it with BUG()) certainly would make page_launder
> > more readable.
>
> Apart from the "we have no backing store", there is no legal reason to put
> it back on the active list that I can see.
It's logical that PageDirty should never be get for ramfs, and a ramfs
page should never have buffers on it. With this and Chris's anon_space
mapping can we replace the check for null ->writepage with BUG? With
the anon_space mapping we should be able to do the same ford for
->mapping.
Though these things aren't strictly bugs, having to check multiple paths
for everything is slowing us down in picking off the fluff.
--
Daniel
On Thu, 28 Dec 2000, Daniel Phillips wrote:
> It's logical that PageDirty should never be get for ramfs,
No. Not setting PageDirty will cause the system to move the
page to the inactive_clean list and happily reclaim your data.
We _have to_ use something like PageDirty for this, and
checking for the ->writepage method will even allow us to
do stuff like dynamically switching swapping support for
ramfs on/off (or other funny things).
regards,
Rik
--
Hollywood goes for world dumbination,
Trailer at 11.
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
On Thu, 28 Dec 2000, Rik van Riel wrote:
> On Thu, 28 Dec 2000, Daniel Phillips wrote:
>
> > It's logical that PageDirty should never be get for ramfs,
>
> No. Not setting PageDirty will cause the system to move the
> page to the inactive_clean list and happily reclaim your data.
>
> We _have to_ use something like PageDirty for this, and
> checking for the ->writepage method will even allow us to
> do stuff like dynamically switching swapping support for
> ramfs on/off (or other funny things).
You're suggesting using the absence of a method as a kind of flag, but
the code is really too full of obscure stuff like that already.
How about taking an extra user on the ramfs pages instead. It doesn't
sound right to set PageDirty when you are not requesting IO.
--
Daniel
Rik van Riel wrote:
>
> On Thu, 28 Dec 2000, Rik van Riel wrote:
> > On Wed, 27 Dec 2000, Linus Torvalds wrote:
> > > On Wed, 27 Dec 2000, Rik van Riel wrote:
> > > >
> > > > The (trivial) patch below should fix this problem.
> > >
> > > It must be wrong.
> > >
> > > If we have a dirty page on the LRU lists, that page _must_ have
> > > a mapping.
> >
> > Hmm, last I looked buffercache pages didn't have
> > page->mapping set ...
>
> OK, you're right ;)
>
> We never set PG_dirty for buffercache pages, so a
> pure buffercache page shouldn't be caught here...
But we should, and something that does it is buried in Chris Mason's
patch under the thread [RFC] changes to buffer.c. Chris is also trying
some other fancy things in that patch, but the buffer cache mapping part
is actually pretty simple.
For buffer cache pages I think the current thinking is that PG_dirty
should stay on until the last dirty buffer on the page has been
submitted for writing.
--
Daniel
On Thu, 28 Dec 2000, Rik van Riel wrote:
>
> I've made a small debugging patch that simply checks
> for this illegal state in add_page_to_active_list and
> add_page_to_inactive_dirty_list.
I bet it won't catch the real bad guy, which almost certainly is the
"remove_from_swap_cache()" thing (it should do a ClearPageDirty() before
it removes it from the swapper_inode mapping).
Linus
On Thursday, December 28, 2000 16:15:48 +0100 Daniel Phillips
<[email protected]> wrote:
> On Thu, 28 Dec 2000, Rik van Riel wrote:
>> On Thu, 28 Dec 2000, Daniel Phillips wrote:
>>
>> > It's logical that PageDirty should never be get for ramfs,
>>
>> No. Not setting PageDirty will cause the system to move the
>> page to the inactive_clean list and happily reclaim your data.
>>
>> We _have to_ use something like PageDirty for this, and
>> checking for the ->writepage method will even allow us to
>> do stuff like dynamically switching swapping support for
>> ramfs on/off (or other funny things).
>
> You're suggesting using the absence of a method as a kind of flag, but
> the code is really too full of obscure stuff like that already.
>
> How about taking an extra user on the ramfs pages instead. It doesn't
> sound right to set PageDirty when you are not requesting IO.
I think a dirty page without a writepage func seems a bit broken. How
about we give ramfs a writepage func that just returns 1. That way nobody
does any special if (ramfs_page(page)) kinds of tests...
-chris
On Thu, 28 Dec 2000, Daniel Phillips wrote:
>
> It's logical that PageDirty should never be get for ramfs, and a ramfs
> page should never have buffers on it.
What?
No no no.
You're obviously right that ramfs will never have buffers on the page, but
why shouldn't a ramfs page be dirty?
Of _course_ a ramfs page is dirty - that's the only thing that tells the
VM that it can't be thrown out.
Linus
On Thu, 28 Dec 2000, Chris Mason wrote:
> I think a dirty page without a writepage func seems a bit
> broken. How about we give ramfs a writepage func that just
> returns 1. That way nobody does any special if
> (ramfs_page(page)) kinds of tests...
This will lead to the ramfs pages staying on the inactive_dirty
list forever, deadlocking the system.
Rik
--
Hollywood goes for world dumbination,
Trailer at 11.
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
On Thu, 28 Dec 2000, Linus Torvalds wrote:
> On Thu, 28 Dec 2000, Rik van Riel wrote:
> >
> > I've made a small debugging patch that simply checks
> > for this illegal state in add_page_to_active_list and
> > add_page_to_inactive_dirty_list.
>
> I bet it won't catch the real bad guy, which almost certainly is
> the "remove_from_swap_cache()" thing (it should do a
> ClearPageDirty() before it removes it from the swapper_inode
> mapping).
Ohhh indeed. This is a very likely candidate which is
trivial to fix.
regards,
Rik
--
Hollywood goes for world dumbination,
Trailer at 11.
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
On Thursday, December 28, 2000 15:51:24 -0200 Rik van Riel
<[email protected]> wrote:
> On Thu, 28 Dec 2000, Chris Mason wrote:
>
>> I think a dirty page without a writepage func seems a bit
>> broken. How about we give ramfs a writepage func that just
>> returns 1. That way nobody does any special if
>> (ramfs_page(page)) kinds of tests...
>
> This will lead to the ramfs pages staying on the inactive_dirty
> list forever, deadlocking the system.
>
This wouldn't be the first time this week I've read this part of
page_launder wrong, but I don't see a difference between a NULL writepage
func, and a writepage func that returns 1. I read the code like this:
if(PageDirty(page)) {
...
if (!writepage)
goto page_active ;
...
result = writepage(page)
if (result != 1)
continue ;
SetPageDirty(page);
goto page_active ;
}
-chris
[in vmscan.c]
> Between line 573 and 594 the page can have 1 user and be unlocked, so it
> can be removed by invalidate_inode_pages, and the mapping will be
> cleared here:
> http://innominate.org/~graichen/projects/lxr/source/mm/filemap.c?v=v2.3#L98
This seems like the obvious thing to do:
--- 2.4.0-test13.clean/mm/filemap.c Fri Dec 29 03:14:58 2000
+++ 2.4.0-test13/mm/filemap.c Fri Dec 29 03:16:21 2000
@@ -96,6 +96,7 @@
remove_page_from_inode_queue(page);
remove_page_from_hash_queue(page);
page->mapping = NULL;
+ ClearPageDirty(page);
}
void remove_inode_page(struct page *page)
On Thu, 28 Dec 2000, Chris Wedgwood wrote:
>
> this remind me; perhaps you or Al could answer this.
>
> How hard would it be to have ramfs backed by swap? The goal being
> try to achieve something like a FreeBSDs mfs.
>
> I use ramfs for /tmp on my laptop -- it's very handy because it
> extends the amount of the the disk had spent spun down and therefore
> battery life; but writing large files into /tmp can blow away the
> system or at the very least eat away at otherwise usable ram. Not
> terribly desirable.
Jeff Garzik had the code to do this, and the new shared memory code should
be able to be massaged to handle this all without actually bloating the
kernel (ie "ramfs" would still stay very very tiny, just taking advantage
of the common code that the VM layer already has to support for other
things).
Linus
On Thu, 28 Dec 2000, Daniel Phillips wrote:
> [in vmscan.c]
> > Between line 573 and 594 the page can have 1 user and be unlocked, so it
> > can be removed by invalidate_inode_pages, and the mapping will be
> > cleared here:
> > http://innominate.org/~graichen/projects/lxr/source/mm/filemap.c?v=v2.3#L98
>
> This seems like the obvious thing to do:
>
> --- 2.4.0-test13.clean/mm/filemap.c Fri Dec 29 03:14:58 2000
> +++ 2.4.0-test13/mm/filemap.c Fri Dec 29 03:16:21 2000
> @@ -96,6 +96,7 @@
> remove_page_from_inode_queue(page);
> remove_page_from_hash_queue(page);
> page->mapping = NULL;
> + ClearPageDirty(page);
> }
>
> void remove_inode_page(struct page *page)
No, I'd much rather have
if (PageDirty(page)) BUG();
there, and then have the free_swap_cache code clear the dirty bit.
We don't want to lose dirty bits by mistake. The only cases where it's ok
to clear the dirty bit is when we truncate a page completely (so it won't
be needed and obviously really shouldn't be written out) and when we've
lost the last user of a swap cache entry.
Any other cases might be bugs, where we remove a page from a mapping
without noticing that it is dirty (we had this bug in reclaim_pages(), for
example).
Linus
> > I use ramfs for /tmp on my laptop -- it's very handy because it
> > extends the amount of the the disk had spent spun down and therefore
> > battery life; but writing large files into /tmp can blow away the
> > system or at the very least eat away at otherwise usable ram. Not
> > terribly desirable.
>
> Jeff Garzik had the code to do this, and the new shared memory code should
> be able to be massaged to handle this all without actually bloating the
> kernel (ie "ramfs" would still stay very very tiny, just taking advantage
> of the common code that the VM layer already has to support for other
> things).
The ramfs maintainer has patches (in -ac) which deal with the size limiting
of RAMfs. I'm using it on a PDA and its really really nice to be able to
pop up a GUI app and drag the bar to '60% for apps' like other PDA systems ;)
They do touch the core vm/vfs code for one callback, which would be nice to
lose but not obvious it can be
Linus Torvalds wrote:
> No, I'd much rather have
>
> if (PageDirty(page)) BUG();
>
> there, and then have the free_swap_cache code clear the dirty bit.
>
> We don't want to lose dirty bits by mistake. The only cases where it's ok
> to clear the dirty bit is when we truncate a page completely (so it won't
> be needed and obviously really shouldn't be written out) and when we've
> lost the last user of a swap cache entry.
>
> Any other cases might be bugs, where we remove a page from a mapping
> without noticing that it is dirty (we had this bug in reclaim_pages(), for
> example).
And in this case it's clear we lose data with nfs and smbfs that way.
Maybe this is more like it:
--- 2.4.0-test13.clean/mm/filemap.c Fri Dec 29 03:14:58 2000
+++ 2.4.0-test13/mm/filemap.c Fri Dec 29 04:13:27 2000
@@ -132,7 +132,7 @@
curr = curr->next;
/* We cannot invalidate a locked page */
- if (TryLockPage(page))
+ if (PageDirty(page) || TryLockPage(page))
continue;
/* Neither can we invalidate something in use.. */
Linus Torvalds wrote:
> We don't want to lose dirty bits by mistake. The only cases where it's ok
> to clear the dirty bit is when we truncate a page completely (so it won't
> be needed and obviously really shouldn't be written out) and when we've
> lost the last user of a swap cache entry.
>
> Any other cases might be bugs, where we remove a page from a mapping
> without noticing that it is dirty (we had this bug in reclaim_pages(), for
> example).
I tried to go the lazy way the first time. This time I put the BUG in
there and then went chasing all the places that do
(__)remove_inode_page. There are tentacles all over the place but I
*think* I found them all. That turned up one more place needing
changing, and this is one you already spotted a couple of days ago, in
reclaim_page. I subjected this patch to some dbenching without
triggering the BUG.
The try_to_unuse function calls delete_from_swap_cache and this is
pretty unfamiliar stuff for me, but it looks like the page is just
freshly read and couldn't be dirty. There's one more case in
arch/68K/atari/stram.c (unswap_by_read), similar to try_to_unuse.
OK, I see you just posted -pre5 while I was making the patch, but here
it is anyway, as a cross-check.
--- 2.4.0-test13-pre4.clean/mm/filemap.c Fri Dec 29 03:14:58 2000
+++ 2.4.0-test13-pre4/mm/filemap.c Fri Dec 29 04:29:09 2000
@@ -96,6 +96,7 @@
remove_page_from_inode_queue(page);
remove_page_from_hash_queue(page);
page->mapping = NULL;
+ if (PageDirty(page)) BUG();
}
void remove_inode_page(struct page *page)
@@ -132,7 +133,7 @@
curr = curr->next;
/* We cannot invalidate a locked page */
- if (TryLockPage(page))
+ if (PageDirty(page) || TryLockPage(page))
continue;
/* Neither can we invalidate something in use.. */
--- 2.4.0-test13-pre4.clean/mm/vmscan.c Fri Dec 29 03:14:58 2000
+++ 2.4.0-test13-pre4/mm/vmscan.c Fri Dec 29 04:30:48 2000
@@ -484,7 +484,7 @@
}
/* The page is dirty, or locked, move to inactive_dirty list. */
- if (page->buffers || TryLockPage(page)) {
+ if (page->buffers || PageDirty(page) || TryLockPage(page)) {
del_page_from_inactive_clean_list(page);
add_page_to_inactive_dirty_list(page);
continue;
On Thu, 28 Dec 2000, Daniel Phillips wrote:
>
> OK, I see you just posted -pre5 while I was making the patch, but here
> it is anyway, as a cross-check.
Ok, pre-5 should have all the same places you found already fixed, but
please do give it some heavy-duty testing to make sure there isn't
anything hidden..
Linus
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
> does anyone other than me think that the pm code is *way* too agressive about
> spinning down the hard drive? my 256mb laptop (2.2.16) will only spin down the
> disk for about 30 seconds before it decides it's got something else it feels
> like writing out, and spins back up. Spinnup has got to be more wasteful than
> just leaving the drive spinning...
Yup, I find this - especially if I hang around in X for a while.
- --
""""" "" " " " " " " " " """ " " " " "" " "
"" " "" """ "" " " " " "" " """ """ "
"" "" """ " " """ " " " " """ " "" " ""
"" """" " " " " " " """" " " """" " ""
" " " "" " " " "" """" " " """" " ""
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: Made with pgp4pine
iD8DBQE6S7JJRcGgB3aidfkRAhpkAJ9UYVhD+sRqADqUMm2i+UgbuYS8kACgzG4E
WhqPEhm6XHixqHpUOFQ4els=
=yQKY
-----END PGP SIGNATURE-----
Linus Torvalds wrote:
> Ok, pre-5 should have all the same places you found already fixed, but
> please do give it some heavy-duty testing to make sure there isn't
> anything hidden..
I've beaten on it fairly heavily with the BUG in there as you suggested,
with no problems.
This kernel even seems a little faster:
Test machine: 64 meg, 500 Mhz K6, IDE, Ext2, Blocksize=4K
Test: dbench 48
pre4 9.5 MB/sec 11 min 6 secs
pre5 10.8 MB/sec 9 min 48 secs
I think it would be an awfully good idea to let the BUG out for mass
consumption:
+ if (PageDirty(page)) BUG();
remove_page_from_inode_queue(page);
remove_page_from_hash_queue(page);
page->mapping = NULL;
--
Daniel
On Thu, Dec 28, 2000 at 10:50:48AM -0800, Linus Torvalds wrote:
> I use ramfs for /tmp on my laptop -- it's very handy because it
> extends the amount of the the disk had spent spun down and therefore
> battery life; but writing large files into /tmp can blow away the
> system or at the very least eat away at otherwise usable ram. Not
> terribly desirable.
Jeff Garzik had the code to do this, and the new shared memory code should
be able to be massaged to handle this all without actually bloating the
kernel (ie "ramfs" would still stay very very tiny, just taking advantage
of the common code that the VM layer already has to support for other
things).
I would prefer we leave ramfs alone as is -- it makes an excellent
starting point for a new fs and is fairly simple to grok. If we are
to add any more complexity here like the size limiting patches or the
use of a backing store, I'd like to have this as a new filesystem,
something like 'vmfs' or some such.
ramfs is small simple and elegant; for mere mortals like me it
contains enough to help understand what is required of a filesystem
without obscuring this fact. I'd hate to see that change.
Jeff (or indeed anyone); if you have the patch Linus is talking about
somewhere (even and old one) could you sen it my way please (off the
list).
--cw
On Thu, 28 Dec 2000, Alan Cox wrote:
> The ramfs maintainer has patches (in -ac) which deal with the size limiting
> of RAMfs. I'm using it on a PDA and its really really nice to be able to
> pop up a GUI app and drag the bar to '60% for apps' like other PDA systems ;)
May I ask which PDA do you have running Linux and worth a few bucks?
Pau
Chris Wedgwood <[email protected]> writes:
> I would prefer we leave ramfs alone as is -- it makes an excellent
> starting point for a new fs and is fairly simple to grok. If we are
> to add any more complexity here like the size limiting patches or the
> use of a backing store, I'd like to have this as a new filesystem,
> something like 'vmfs' or some such.
That's shm fs + read and write which should be easy to add.
> ramfs is small simple and elegant; for mere mortals like me it
> contains enough to help understand what is required of a filesystem
> without obscuring this fact. I'd hate to see that change.
yes. That's why I copied a lot of the ramfs code into mm/shmem.c
Christoph