I know that our number of users has increased, but I doubt that the increase is
sufficient to match the marked increase in bug reports on reiserfs-list. Please
be patient as we work on this. We will issue a patch this week that will fix
some bugs (NFS i_generation count losing, and space leakage on crash due to
preallocated blocks being lost).
We will also change the default for mkreiserfs to creating the new 2.4 only
format, as this (we have belatedly realized) is probably the cause of many users
reporting they can't create large files.
We have a bug affecting add_entry which we suspect is due to our rename not
being adequately atomic and leaving hidden directory entries in the filesystem,
and we are exploring how this might happen (improper journaling, we don't yet
know....) Treat this description with the usual skepticism attached to any
explanation of a bug not fixed yet, our diagnosing continues.... This is the
most worrisome bug for us stability wise. It seems ~ a user a day encounters
it.
This patch for sure also won't fix the zeros getting added to syslog files bug
which we are desperate to learn how to reproduce at our site.
Thank you for your patience.
Hans
Ok, how about we list the known bugs:
zeros in log files, apparently only between bytes 2048 and 4096 (not
reproduced yet).
preallocated block leak on crash (fix in testing)
hidden directory entry cleanup (still reproducing, very hard to hit).
knfsd (patches in testing).
oops in reiserfs_symlink, create_virtual_node (bug in redhat gcc 2.96,
fixed by downloading the update).
We've also had a few reports of other corruptions, most of which have been
traced to hardware problems. There are two where I'm not sure of the cause
yet, but the method to trigger the bug was too simple to not be a hardware
problem.
-chris
On Wed, Feb 07, 2001 at 10:47:09AM -0500, Chris Mason wrote:
>
> Ok, how about we list the known bugs:
>
> zeros in log files, apparently only between bytes 2048 and 4096 (not
> reproduced yet).
Could this bug be related to the reported corruption that people with
new VIA chipsets have been also reporting on ext2? It seems similar
because of the location of the corruption:
http://marc.theaimsgroup.com/?l=linux-kernel&m=98147483712620&w=2
Anyway, it can't hurt to ask the bug reported if they're using a
newer VIA chipset and see if they will upgrade their BIOS which seems
to fix the problem.
-Dave
On Wednesday, February 07, 2001 08:38:54 AM -0800 David Rees
<[email protected]> wrote:
> On Wed, Feb 07, 2001 at 10:47:09AM -0500, Chris Mason wrote:
>>
>> Ok, how about we list the known bugs:
>>
>> zeros in log files, apparently only between bytes 2048 and 4096 (not
>> reproduced yet).
>
> Could this bug be related to the reported corruption that people with
> new VIA chipsets have been also reporting on ext2? It seems similar
> because of the location of the corruption:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=98147483712620&w=2
>
> Anyway, it can't hurt to ask the bug reported if they're using a
> newer VIA chipset and see if they will upgrade their BIOS which seems
> to fix the problem.
I'd love to blame this on VIA problems, but people are seeing it on other
chipsets too ;-)
People who report this aren't seeing general corruption, just zeros in
files of specific sizes. So, it really should be a reiserfs bug.
-chris
On Wed, Feb 07, 2001 at 10:47:09AM -0500, Chris Mason wrote:
>
>
> Ok, how about we list the known bugs:
>
> zeros in log files, apparently only between bytes 2048 and 4096 (not
> reproduced yet).
>
> preallocated block leak on crash (fix in testing)
>
> hidden directory entry cleanup (still reproducing, very hard to hit).
>
> knfsd (patches in testing).
>
> oops in reiserfs_symlink, create_virtual_node (bug in redhat gcc 2.96,
> fixed by downloading the update).
>
> We've also had a few reports of other corruptions, most of which have been
> traced to hardware problems. There are two where I'm not sure of the cause
> yet, but the method to trigger the bug was too simple to not be a hardware
> problem.
So could some of this bugs also be present in 3.5.x version of reiserfs?
Will you be fixing them for that version?
Vedran Rodic
On Wednesday, February 07, 2001 07:41:25 PM +0100 Vedran Rodic
<[email protected]> wrote:
>
> So could some of this bugs also be present in 3.5.x version of reiserfs?
> Will you be fixing them for that version?
>
This list of reiserfs bugs was all specific to the 3.6.x versions, and they
don't appear with the 3.5.x code. You will probably have problems if you
compile 3.5.x reiserfs with an unpatched redhat gcc 2.96, though.
-chris
On Wed, 7 Feb 2001, Chris Mason wrote:
>
>
> On Wednesday, February 07, 2001 07:41:25 PM +0100 Vedran Rodic
> <[email protected]> wrote:
>
> >
> > So could some of this bugs also be present in 3.5.x version of reiserfs?
> > Will you be fixing them for that version?
> >
>
> This list of reiserfs bugs was all specific to the 3.6.x versions, and they
> don't appear with the 3.5.x code. You will probably have problems if you
> compile 3.5.x reiserfs with an unpatched redhat gcc 2.96, though.
Apologies if I'm mis-understanding (I don't follow the list too
closely), but the zeros-in-log-files thing happens to me a lot on
3.5.X. Is there some sort of debugging info I could offer to help
figure it out?
Ivan...
---------------------------------------------------------------------------
Ivan Pulleyn
4942 N. Winchester Ave. #3
Chicago, IL 60640
[email protected]
(847) 980-1400
---------------------------------------------------------------------------
On Wed, Feb 07, 2001 at 06:30:01PM +0100, Xuan Baldauf wrote:
and so on. Maybe I should write a program which automatically
detects and reports the zero blocks. I think the theory of tails
unpacking does not work out, because there are also areas
affected which are not between 2048 and 4096. Also the length of
the zeroing can be greater than 2048. However, I did not
encounter a length of over 4096.
these appear on your system every couple of days right? if so... are
you able to run with the fs mount notails for a couple of days and
see if you still experience these?
my guess is you probably still will as most log files aren't
candidates for tail-packing (too large) but it will help eliminate
one more thing....
--cw
On Thursday, February 08, 2001 10:47:29 AM +1300 Chris Wedgwood
<[email protected]> wrote:
> these appear on your system every couple of days right? if so... are
> you able to run with the fs mount notails for a couple of days and
> see if you still experience these?
>
> my guess is you probably still will as most log files aren't
> candidates for tail-packing (too large) but it will help eliminate
> one more thing....
>
Yes, it really would.
1) mount -o notail
2) rm old_logfile
3) restart syslog
This will ensure the log files don't have tails at all. Knowing for sure
the bug doesn't involve tails would remove much code from the search.
-chris
Chris Mason wrote:
> On Thursday, February 08, 2001 10:47:29 AM +1300 Chris Wedgwood
> <[email protected]> wrote:
>
> > these appear on your system every couple of days right? if so... are
> > you able to run with the fs mount notails for a couple of days and
> > see if you still experience these?
> >
> > my guess is you probably still will as most log files aren't
> > candidates for tail-packing (too large) but it will help eliminate
> > one more thing....
> >
>
> Yes, it really would.
>
> 1) mount -o notail
> 2) rm old_logfile
> 3) restart syslog
>
> This will ensure the log files don't have tails at all. Knowing for sure
> the bug doesn't involve tails would remove much code from the search.
>
> -chris
Mhhh. It's a busy server from which I am about 700km away. I don't like to
restart it now. (Especially because it cannot boot from hard disk, only from
floppy disk, due to bios problems). But I'd be happy if following is true:
(1) Enabling "-o notails" is possible at runtime, i.e. "mount / -o
remount,notails" works and
(2) Notails is compatible with all the tails found on disk (so notails only
changes the way the disk is written, not the way the disk is read).
Is this true?
Xu?n.
On Wednesday, February 07, 2001 11:05:51 PM +0100 Xuan Baldauf
<[email protected]> wrote:
> Mhhh. It's a busy server from which I am about 700km away. I don't like to
> restart it now. (Especially because it cannot boot from hard disk, only
> from floppy disk, due to bios problems). But I'd be happy if following is
> true:
>
> (1) Enabling "-o notails" is possible at runtime, i.e. "mount / -o
> remount,notails" works and
Nope.
> (2) Notails is compatible with all the tails found on disk (so notails
> only changes the way the disk is written, not the way the disk is read).
>
This part is true.
Honestly, I don't want to do this kind of debugging on a busy server.
Sure, it is completely safe, etc, etc, but ...
We'll get the info elsewhere, leave the busy servers out of it ;-)
-chris
On 07 Feb 2001 11:48:16 -0500, Chris Mason wrote:
>
>
> On Wednesday, February 07, 2001 08:38:54 AM -0800 David Rees
> <[email protected]> wrote:
>
> > On Wed, Feb 07, 2001 at 10:47:09AM -0500, Chris Mason wrote:
> >>
> >> Ok, how about we list the known bugs:
> >>
> >> zeros in log files, apparently only between bytes 2048 and 4096 (not
> >> reproduced yet).
> >
> > Could this bug be related to the reported corruption that people with
> > new VIA chipsets have been also reporting on ext2? It seems similar
> > because of the location of the corruption:
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=98147483712620&w=2
> >
> > Anyway, it can't hurt to ask the bug reported if they're using a
> > newer VIA chipset and see if they will upgrade their BIOS which seems
> > to fix the problem.
>
> I'd love to blame this on VIA problems, but people are seeing it on other
> chipsets too ;-)
>
> People who report this aren't seeing general corruption, just zeros in
> files of specific sizes. So, it really should be a reiserfs bug.
I run Reiser on all but /boot, and it seems to enjoy corrupting my
mbox'es randomly.
Using the old-style Reiser FS format, 2.4.2-pre1, Evolution, on a CMD640
chipset with the fixes enabled.
This also occurs in some log files, but I put it down to syslogd
crashing or something.
d
--
Daniel Stone
Linux Kernel Developer
[email protected]
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
G!>CS d s++:- a---- C++ ULS++++$>B P---- L+++>++++ E+(joe)>+++ W++ N->++ !o
K? w++(--) O---- M- V-- PS+++ PE- Y PGP>++ t--- 5-- X- R- tv-(!) b+++ DI+++
D+ G e->++ h!(+) r+(%) y? UF++
------END GEEK CODE BLOCK------
On Thu, Feb 08, 2001 at 05:34:44PM +1100, Daniel Stone wrote:
I run Reiser on all but /boot, and it seems to enjoy corrupting my
mbox'es randomly.
what kind of corruption are you seeing?
This also occurs in some log files, but I put it down to syslogd
crashing or something.
syslogd crashing shouldn't corrupt files...
--cw
On 11 Feb 2001 02:02:00 +1300, Chris Wedgwood wrote:
> On Thu, Feb 08, 2001 at 05:34:44PM +1100, Daniel Stone wrote:
>
> I run Reiser on all but /boot, and it seems to enjoy corrupting my
> mbox'es randomly.
>
> what kind of corruption are you seeing?
Zeroed bytes.
> This also occurs in some log files, but I put it down to syslogd
> crashing or something.
>
> syslogd crashing shouldn't corrupt files...
Actually, I meant to say my hard drive crashing.
I have two hard drives, side-by-side, and sometimes they overheat and
one of them powers down due to the excess heat.
They haven't done that lately, though, as I have a dedicated fan for
both of them, but the corruption persists.
--
Daniel Stone
Linux Kernel Developer
[email protected]
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
G!>CS d s++:- a---- C++ ULS++++$>B P---- L+++>++++ E+(joe)>+++ W++ N->++ !o
K? w++(--) O---- M- V-- PS+++ PE- Y PGP>++ t--- 5-- X- R- tv-(!) b+++ DI+++
D+ G e->++ h!(+) r+(%) y? UF++
------END GEEK CODE BLOCK------
On Sun, Feb 11, 2001 at 12:05:12AM +1100, Daniel Stone wrote:
Actually, I meant to say my hard drive crashing.
I have two hard drives, side-by-side, and sometimes they overheat and
one of them powers down due to the excess heat.
OK then... if it weren't for the fact other people have reported
similar problems I would say all bets are off. mbox files get
corrupted when machines crash because of their (mis)design; might
this be the case for you here? Or do you see corruption without hard
drive crashes and OS crashes?
It's pretty much impossible to debug and test software when the
hardware if unreliable or unpredictable.
--cw
> I run Reiser on all but /boot, and it seems to enjoy corrupting my
> mbox'es randomly.
> Using the old-style Reiser FS format, 2.4.2-pre1, Evolution, on a CMD640
> chipset with the fixes enabled.
> This also occurs in some log files, but I put it down to syslogd
> crashing or something.
Before you put that down to reiserfs can you chek 2.4.2-pre2. It may be
problems below the reiserfs layer
Alan Cox wrote:
>> I run Reiser on all but /boot, and it seems to enjoy corrupting my
>> mbox'es randomly.
>> Using the old-style Reiser FS format, 2.4.2-pre1, Evolution, on a CMD640
>> chipset with the fixes enabled.
>> This also occurs in some log files, but I put it down to syslogd
>> crashing or something.
>
>
> Before you put that down to reiserfs can you chek 2.4.2-pre2. It may be
> problems below the reiserfs layer
Just as an aside, I've watched this conversation go on and on while I
run reiserfs on several servers, workstations, and a notebook. I have
current kernels and have watched carefully for corruption. I haven't
seen any evidence of corruption on any of them including my notebook
which has a bad battery and bad power connection so it tends to
instantly die.
Alan, is there a particular trigger to this?
-d
On Saturday 10 February 2001 22:16, David Ford wrote:
> Just as an aside, I've watched this conversation go on and on while I
> run reiserfs on several servers, workstations, and a notebook. I
> have current kernels and have watched carefully for corruption. I
> haven't seen any evidence of corruption on any of them including my
> notebook which has a bad battery and bad power connection so it tends
> to instantly die.
>
> Alan, is there a particular trigger to this?
Want to trigger this? Just install reiserfs on Dual SMP machine with
huge RAID acting as mail server for 90k mailboxes. After several hours
you'll get a lot reiserfs_read_inode2/reiserfs_iget: bad_inode msgs in
your kern.log...
Good luck.
--
Andrius
[email protected]
Chris Wedgwood wrote:
>
> On Thu, Feb 08, 2001 at 05:34:44PM +1100, Daniel Stone wrote:
>
> I run Reiser on all but /boot, and it seems to enjoy corrupting my
> mbox'es randomly.
>
> what kind of corruption are you seeing?
>
> This also occurs in some log files, but I put it down to syslogd
> crashing or something.
>
> syslogd crashing shouldn't corrupt files...
>
> --cw
There is a known bug in which nulls get added to log files. We are having
trouble reproducing it on our machines.
There is an elevator bug in 2.4 which just got found/fixed. We don't know what
part of our bug reports are due to it.
Hans
Daniel Stone wrote:
>
> On 11 Feb 2001 02:02:00 +1300, Chris Wedgwood wrote:
> > On Thu, Feb 08, 2001 at 05:34:44PM +1100, Daniel Stone wrote:
> >
> > I run Reiser on all but /boot, and it seems to enjoy corrupting my
> > mbox'es randomly.
> >
> > what kind of corruption are you seeing?
>
> Zeroed bytes.
This sounds like the same bug as the syslog bug, please try to help Chris
reproduce it.
zam, if Chris can't reproduce it by Monday, please give it a try.
Hans
David Ford wrote:
>
> Alan Cox wrote:
>
> >> I run Reiser on all but /boot, and it seems to enjoy corrupting my
> >> mbox'es randomly.
> >> Using the old-style Reiser FS format, 2.4.2-pre1, Evolution, on a CMD640
> >> chipset with the fixes enabled.
> >> This also occurs in some log files, but I put it down to syslogd
> >> crashing or something.
> >
> >
> > Before you put that down to reiserfs can you chek 2.4.2-pre2. It may be
> > problems below the reiserfs layer
>
> Just as an aside, I've watched this conversation go on and on while I
> run reiserfs on several servers, workstations, and a notebook. I have
> current kernels and have watched carefully for corruption. I haven't
> seen any evidence of corruption on any of them including my notebook
> which has a bad battery and bad power connection so it tends to
> instantly die.
>
> Alan, is there a particular trigger to this?
>
> -d
Guys, instability is a relative word. One of our users in Russia said that
reiserfs was as stable as a mountain, and he didn't understand my email. We
have some number of users, I wish I really knew how many. If you look at a few
hundred thousand mountains, you'll discover that a number of them are really
quite unstable. We used to get a bug report a week, now we get one or two a
day. Does this mean we went from a few hundred thousand mountains to a few
million? I don't really know.....
I can assure the users though that we have an extensive testing procedure, and
that our releases all pass a testing that can roughly be described as hammering
the filesystem every different way we can think of (this is more limited than
what being put into the kernel by Linus does) for twelve or more hours.
What I do know is the following: there was a recent elevator bug fix. Our
filesystem is a journaling filesystem and it is extremely dependent on an
assumption that nothing is going to get written to disk before it should. I
think fsck even makes assumptions about certain states relating to rename being
made atomic never reaching disk (and I think this is being fixed thanks to this
bug). Could this cause the bug in which syslog gets zeros in it? Don't know,
we haven't reproduced that bug yet though it "should" be straightforward to
reproduce. We do have an NFS bug, which Nikita is still fixing.
What I can tell you is that in a few weeks we will have it back to a bug report
every week or two, and until we do version 4 of ReiserFS is going to be stalled
(not so different from Linux 2.5 being stalled until 2.4 satisfies Linus).
Hans
Alan Cox wrote:
> Before you put that down to reiserfs can you chek 2.4.2-pre2. It may be
> problems below the reiserfs layer
I forgot, this bug exists on reiserfs for Linux 2.2.*, so it isn't going to be
fixed by 2.4.2 (assuming that the bug is not in 2.2.*).
Hans
Adrian Phillips wrote:
>
> Does your test procedure include other systems, for example reiserfs
> plus NFS ?
Our NFS testing is simply inadequate, we need a copy of LADDIS but haven't found
the money for it yet.
Hans
>>>>> "Hans" == Hans Reiser <[email protected]> writes:
Hans> Adrian Phillips wrote:
>> Does your test procedure include other systems, for example
>> reiserfs plus NFS ?
Hans> Our NFS testing is simply inadequate, we need a copy of
Hans> LADDIS but haven't found the money for it yet.
Excuse my ignorance, but what is LADDIS ?
Sincerely,
Adrian Phillips
--
Your mouse has moved.
Windows NT must be restarted for the change to take effect.
Reboot now? [OK]
Adrian Phillips wrote:
>
> >>>>> "Hans" == Hans Reiser <[email protected]> writes:
>
> Hans> Adrian Phillips wrote:
> >> Does your test procedure include other systems, for example
> >> reiserfs plus NFS ?
>
> Hans> Our NFS testing is simply inadequate, we need a copy of
> Hans> LADDIS but haven't found the money for it yet.
>
> Excuse my ignorance, but what is LADDIS ?
>
> Sincerely,
>
> Adrian Phillips
>
> --
> Your mouse has moved.
> Windows NT must be restarted for the change to take effect.
> Reboot now? [OK]
LADDIS is the industry standard benchmark for NFS. It crashes for ReiserFS and
NFS. We can't afford to buy it, as it is proprietary software. Once Nikita has
finished testing his changes, we will ask someone to test it for us though.
Hans
> run reiserfs on several servers, workstations, and a notebook. I have
> current kernels and have watched carefully for corruption. I haven't
> seen any evidence of corruption on any of them including my notebook
> which has a bad battery and bad power connection so it tends to
> instantly die.
>
> Alan, is there a particular trigger to this?
The 2.4.1 stuff is a specific low level block I/O pattern. Its fixed in
2.4.2pre2/2.4.1ac-something
> LADDIS is the industry standard benchmark for NFS. It crashes for ReiserFS and
> NFS. We can't afford to buy it, as it is proprietary software. Once Nikita has
> finished testing his changes, we will ask someone to test it for us though.
>
Do you know if the connectathon test suites show the problem?
Alan Cox <[email protected]> writes:
> > LADDIS is the industry standard benchmark for NFS. It crashes for ReiserFS and
> > NFS. We can't afford to buy it, as it is proprietary software. Once Nikita has
> > finished testing his changes, we will ask someone to test it for us though.
> >
>
> Do you know if the connectathon test suites show the problem?
The reiserfs nfs problem in standard 2.4 is very simple -- it'll barf as soon
as you run out of file handle/inode cache. Any workload that accesses
enough files in parallel can trigger it.
Fixes do exist, but require bigger changes in nfsd. Basically you need to
hand out an 64bit inode in the nfs filehandle, and pass the upper 32bits
to the low level file system for efficient lookup (actually is all not
too difficult to implement, just requires very uncodefreezefriendly changes
to nfsd)
-Andi
Alan Cox wrote:
>
> > LADDIS is the industry standard benchmark for NFS. It crashes for ReiserFS and
> > NFS. We can't afford to buy it, as it is proprietary software. Once Nikita has
> > finished testing his changes, we will ask someone to test it for us though.
> >
>
> Do you know if the connectathon test suites show the problem?
Not the slightest idea. Is the connectathon test suite something that stresses
the FS heavily? If so, we can always add it to our stable, whether or not it
stresses this particular bug.
Hans
On Sunday, February 11, 2001 10:00:11 AM +0300 Hans Reiser
<[email protected]> wrote:
> Daniel Stone wrote:
>>
>> On 11 Feb 2001 02:02:00 +1300, Chris Wedgwood wrote:
>> > On Thu, Feb 08, 2001 at 05:34:44PM +1100, Daniel Stone wrote:
>> >
>> > I run Reiser on all but /boot, and it seems to enjoy corrupting my
>> > mbox'es randomly.
>> >
>> > what kind of corruption are you seeing?
>>
>> Zeroed bytes.
>
> This sounds like the same bug as the syslog bug, please try to help Chris
> reproduce it.
>
> zam, if Chris can't reproduce it by Monday, please give it a try.
>
I had a bunch of scripts running over the weekend to try and reproduce
this, but the results were ruined when a major storm killed the power (no,
still haven't gotten around to configuring my UPS to shut things down ;-).
So, I'll try again.
-chris
On Feb 11 2001, Andi Kleen wrote:
> The reiserfs nfs problem in standard 2.4 is very simple -- it'll
> barf as soon as you run out of file handle/inode cache. Any workload
> that accesses enough files in parallel can trigger it.
I'm just trying to evaluate if I should use reiserfs here or
not: is this phenomenon that you describe above happening
independently of whether I choose the knfsd or userspace nfsd?
From your message, I got the impression that it would happen
with knfsd only, but I'm just checking before I make a wrong
decision.
Thanks from a humble (and ignorant) network admin, Roger...
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rogerio Brito - [email protected] - http://www.ime.usp.br/~rbrito/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Not the slightest idea. Is the connectathon test suite something that stresses
> the FS heavily? If so, we can always add it to our stable, whether or not it
> stresses this particular bug.
It certainly has been stressing the NFS side of things enough to show up a lot
of problems so maybe
Rogerio Brito <[email protected]> writes:
> On Feb 11 2001, Andi Kleen wrote:
> > The reiserfs nfs problem in standard 2.4 is very simple -- it'll
> > barf as soon as you run out of file handle/inode cache. Any workload
> > that accesses enough files in parallel can trigger it.
>
> I'm just trying to evaluate if I should use reiserfs here or
> not: is this phenomenon that you describe above happening
> independently of whether I choose the knfsd or userspace nfsd?
This should be all covered extensively in the reiserfs FAQ and list archives,
here a last time:
It only applies to knfsd, but unfsd unfortunately has different problems
with reiserfs. It makes assumptions about the inode space by the underlying
filesystem by assuming that it can encode a dev_t in upper bits. Reiserfs
unlike ext2 periodically cycles through the full 31bit of inode values, and
after some weeks on a busy file system unfsd starts to complain about
conflicts. There is a patch at ftp.suse.com:/pub/people/ak/nfs/unfsd*
that works around the problem when you specify --no-cross-mounts (but
you cannot export trees of multiple file systems then with a single mount
anymore)
Please also note that the patch also adds a rather obscure bug, which triggers
very seldom (patch partly exists, but not really tested yet)
Another alternative is to use knfsd with Chris Mason's 2.4 knfsd patches.
-Andi
[email protected] (Andi Kleen) writes:
>to the low level file system for efficient lookup (actually is all not
>too difficult to implement, just requires very uncodefreezefriendly changes
>to nfsd)
Well, at least I would really prefer a change for 2.4.x the sooner the
better as I will never ever want to repeat the NFS nightmare from
2.2. I prefer a working NFS on Reiser over a non working, but
codefreezed at any time. ;-)
Regards
Henning
--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]
Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20
On Sun, 11 Feb 2001, Chris Mason wrote:
>
>
> On Sunday, February 11, 2001 10:00:11 AM +0300 Hans Reiser
> <[email protected]> wrote:
>
> > Daniel Stone wrote:
> >>
> >> On 11 Feb 2001 02:02:00 +1300, Chris Wedgwood wrote:
> >> > On Thu, Feb 08, 2001 at 05:34:44PM +1100, Daniel Stone wrote:
> >> >
> >> > I run Reiser on all but /boot, and it seems to enjoy corrupting my
> >> > mbox'es randomly.
> >> >
> >> > what kind of corruption are you seeing?
> >>
> >> Zeroed bytes.
> >
> > This sounds like the same bug as the syslog bug, please try to help Chris
> > reproduce it.
> >
> > zam, if Chris can't reproduce it by Monday, please give it a try.
> >
>
> I had a bunch of scripts running over the weekend to try and reproduce
> this, but the results were ruined when a major storm killed the power (no,
> still haven't gotten around to configuring my UPS to shut things down ;-).
>
> So, I'll try again.
Chris,
Do you know if the people reporting the corruption with reiserfs on
2.4 were using IDE drives with PIO mode and IDE multicount turned on?
If so, it may be caused by the problem fixed by Russell King on
2.4.2-pre2.
Without his fix, I was able to corrupt ext2 while using PIO+multicount
very very easily.
Marcelo Tosatti wrote:
>
> On Sun, 11 Feb 2001, Chris Mason wrote:
>
> >
> >
> > On Sunday, February 11, 2001 10:00:11 AM +0300 Hans Reiser
> > <[email protected]> wrote:
> >
> > > Daniel Stone wrote:
> > >>
> > >> On 11 Feb 2001 02:02:00 +1300, Chris Wedgwood wrote:
> > >> > On Thu, Feb 08, 2001 at 05:34:44PM +1100, Daniel Stone wrote:
> > >> >
> > >> > I run Reiser on all but /boot, and it seems to enjoy corrupting my
> > >> > mbox'es randomly.
> > >> >
> > >> > what kind of corruption are you seeing?
> > >>
> > >> Zeroed bytes.
> > >
> > > This sounds like the same bug as the syslog bug, please try to help Chris
> > > reproduce it.
> > >
> > > zam, if Chris can't reproduce it by Monday, please give it a try.
> > >
> >
> > I had a bunch of scripts running over the weekend to try and reproduce
> > this, but the results were ruined when a major storm killed the power (no,
> > still haven't gotten around to configuring my UPS to shut things down ;-).
> >
> > So, I'll try again.
>
> Chris,
>
> Do you know if the people reporting the corruption with reiserfs on
> 2.4 were using IDE drives with PIO mode and IDE multicount turned on?
>
> If so, it may be caused by the problem fixed by Russell King on
> 2.4.2-pre2.
>
> Without his fix, I was able to corrupt ext2 while using PIO+multicount
> very very easily.
Was the bug you describe also present in the 2.2.* series? If not, then the
bugs are not the same.
Hans
On Mon, 12 Feb 2001, Hans Reiser wrote:
> Marcelo Tosatti wrote:
> >
> > On Sun, 11 Feb 2001, Chris Mason wrote:
> >
> > >
> > >
> > > On Sunday, February 11, 2001 10:00:11 AM +0300 Hans Reiser
> > > <[email protected]> wrote:
> > >
> > > > Daniel Stone wrote:
> > > >>
> > > >> On 11 Feb 2001 02:02:00 +1300, Chris Wedgwood wrote:
> > > >> > On Thu, Feb 08, 2001 at 05:34:44PM +1100, Daniel Stone wrote:
> > > >> >
> > > >> > I run Reiser on all but /boot, and it seems to enjoy corrupting my
> > > >> > mbox'es randomly.
> > > >> >
> > > >> > what kind of corruption are you seeing?
> > > >>
> > > >> Zeroed bytes.
> > > >
> > > > This sounds like the same bug as the syslog bug, please try to help Chris
> > > > reproduce it.
> > > >
> > > > zam, if Chris can't reproduce it by Monday, please give it a try.
> > > >
> > >
> > > I had a bunch of scripts running over the weekend to try and reproduce
> > > this, but the results were ruined when a major storm killed the power (no,
> > > still haven't gotten around to configuring my UPS to shut things down ;-).
> > >
> > > So, I'll try again.
> >
> > Chris,
> >
> > Do you know if the people reporting the corruption with reiserfs on
> > 2.4 were using IDE drives with PIO mode and IDE multicount turned on?
> >
> > If so, it may be caused by the problem fixed by Russell King on
> > 2.4.2-pre2.
> >
> > Without his fix, I was able to corrupt ext2 while using PIO+multicount
> > very very easily.
>
> Was the bug you describe also present in the 2.2.* series? If not, then the
> bugs are not the same.
N.
Marcelo Tosatti wrote:
>
> On Mon, 12 Feb 2001, Hans Reiser wrote:
>
> > Marcelo Tosatti wrote:
> > >
> > > On Sun, 11 Feb 2001, Chris Mason wrote:
> > >
> > > >
> > > >
> > > > On Sunday, February 11, 2001 10:00:11 AM +0300 Hans Reiser
> > > > <[email protected]> wrote:
> > > >
> > > > > Daniel Stone wrote:
> > > > >>
> > > > >> On 11 Feb 2001 02:02:00 +1300, Chris Wedgwood wrote:
> > > > >> > On Thu, Feb 08, 2001 at 05:34:44PM +1100, Daniel Stone wrote:
> > > > >> >
> > > > >> > I run Reiser on all but /boot, and it seems to enjoy corrupting my
> > > > >> > mbox'es randomly.
> > > > >> >
> > > > >> > what kind of corruption are you seeing?
> > > > >>
> > > > >> Zeroed bytes.
> > > > >
> > > > > This sounds like the same bug as the syslog bug, please try to help Chris
> > > > > reproduce it.
> > > > >
> > > > > zam, if Chris can't reproduce it by Monday, please give it a try.
> > > > >
> > > >
> > > > I had a bunch of scripts running over the weekend to try and reproduce
> > > > this, but the results were ruined when a major storm killed the power (no,
> > > > still haven't gotten around to configuring my UPS to shut things down ;-).
> > > >
> > > > So, I'll try again.
> > >
> > > Chris,
> > >
> > > Do you know if the people reporting the corruption with reiserfs on
> > > 2.4 were using IDE drives with PIO mode and IDE multicount turned on?
> > >
> > > If so, it may be caused by the problem fixed by Russell King on
> > > 2.4.2-pre2.
> > >
> > > Without his fix, I was able to corrupt ext2 while using PIO+multicount
> > > very very easily.
> >
> > Was the bug you describe also present in the 2.2.* series? If not, then the
> > bugs are not the same.
>
> N.
Zam will try to reproduce it tomorrow, he successfully escaped me today and got
to write fun code (a simpler block allocator) instead.
Hans
On Monday, February 12, 2001 11:42:38 PM +0300 Hans Reiser
<[email protected]> wrote:
>> Chris,
>>
>> Do you know if the people reporting the corruption with reiserfs on
>> 2.4 were using IDE drives with PIO mode and IDE multicount turned on?
>>
>> If so, it may be caused by the problem fixed by Russell King on
>> 2.4.2-pre2.
>>
>> Without his fix, I was able to corrupt ext2 while using PIO+multicount
>> very very easily.
>
I suspect the bugfixes in pre2 will fix some of the more exotic corruption
reports we've seen, but this one (nulls in log files) probably isn't caused
by a random (or semi-random) lower layer corruption. These users are not
seeing random metadata corruption, so I suspect this bug is different (and
reiserfs specific).
> Was the bug you describe also present in the 2.2.* series? If not, then
> the bugs are not the same.
>
In 2.2 code the only data file corruption I know if is caused by a crash....
-chris
Chris Mason wrote:
>
> On Monday, February 12, 2001 11:42:38 PM +0300 Hans Reiser
> <[email protected]> wrote:
>
> >> Chris,
> >>
> >> Do you know if the people reporting the corruption with reiserfs on
> >> 2.4 were using IDE drives with PIO mode and IDE multicount turned on?
> >>
> >> If so, it may be caused by the problem fixed by Russell King on
> >> 2.4.2-pre2.
> >>
> >> Without his fix, I was able to corrupt ext2 while using PIO+multicount
> >> very very easily.
> >
>
> I suspect the bugfixes in pre2 will fix some of the more exotic corruption
> reports we've seen, but this one (nulls in log files) probably isn't caused
> by a random (or semi-random) lower layer corruption. These users are not
> seeing random metadata corruption, so I suspect this bug is different (and
> reiserfs specific).
>
> > Was the bug you describe also present in the 2.2.* series? If not, then
> > the bugs are not the same.
> >
>
> In 2.2 code the only data file corruption I know if is caused by a crash....
>
> -chris
Chris, your quoting is very confusing above..... but I get your very interesting
remark (thanks for noticing) that the nulls are specific to crashes on 2.2, and
therefor could be due to the elevator bug on 2.4. It even makes rough sense
that the elevator bug (said to occasionally cause a premature write of the wrong
buffer) could cause an effect similar to a crash. I hope it is true, let's ask
all users to upgrade to pre2 (a good idea anyway) and see if it cures.
zam is perhaps very clever for deferring working on this bug......:-)
Hans
Chris Mason wrote:
>
> On Monday, February 12, 2001 11:42:38 PM +0300 Hans Reiser
> <[email protected]> wrote:
>
> >> Chris,
> >>
> >> Do you know if the people reporting the corruption with reiserfs on
> >> 2.4 were using IDE drives with PIO mode and IDE multicount turned on?
> >>
> >> If so, it may be caused by the problem fixed by Russell King on
> >> 2.4.2-pre2.
> >>
> >> Without his fix, I was able to corrupt ext2 while using PIO+multicount
> >> very very easily.
> >
>
> I suspect the bugfixes in pre2 will fix some of the more exotic corruption
> reports we've seen, but this one (nulls in log files) probably isn't caused
> by a random (or semi-random) lower layer corruption. These users are not
> seeing random metadata corruption, so I suspect this bug is different (and
> reiserfs specific).
>
> > Was the bug you describe also present in the 2.2.* series? If not, then
> > the bugs are not the same.
> >
>
> In 2.2 code the only data file corruption I know if is caused by a crash....
>
> -chris
I'd like to announce on our website and mailing list that all XXX users should
upgrade to 2.4.2pre2. Do you all agree with this?
What is the exact definition of XXX?
Hans
On Tuesday, February 13, 2001 01:39:02 AM +0300 Hans Reiser
<[email protected]> wrote:
> Chris, your quoting is very confusing above..... but I get your very
> interesting remark (thanks for noticing) that the nulls are specific to
> crashes on 2.2, and therefor could be due to the elevator bug on 2.4. It
> even makes rough sense that the elevator bug (said to occasionally cause
> a premature write of the wrong buffer) could cause an effect similar to a
> crash. I hope it is true, let's ask all users to upgrade to pre2 (a good
> idea anyway) and see if it cures.
>
Ok, I'll try again ;-) People have been seeing null bytes in data files on
reiserfs. They see this without seeing any other corruption of any kind,
and they only see it on files of very specific sizes. They see this
without crashing, and without hard drive suspend kicking in. They see it
on scsi and ide, on servers and laptops.
Elevator bugs and general driver bugs could certainly cause nulls in data
files. But they would also cause other corruptions and probably would not
be selective enough to pick files that happen to have the same range in
size that reiserfs packs tails on.
In other words, updating to 2.4.2pre2 or your favorite ac series kernel is
probably a good plan. It won't fix this bug ;-)
Perhaps I haven't seen it yet because I've also been testing code that does
direct->indirect conversions slightly differently, I'll try again on a pure
kernel.
-chris