2002-03-27 18:53:49

by root

[permalink] [raw]
Subject: bkbits.net down

It looks like we have a bad disk, I'm checking them now to figure out if
it is the the primary or backup data drive. I'll run checks in all the
repositories if fsck doesn't find the problem so it may take a couple of
hours before we are back up.

In the not so distant future, we're moving the backup drive to a different
machine such that we can just flip machines when this happens but for now
you'll have to wait for a bit.

--lm


2002-03-28 01:52:56

by Mike Fedyk

[permalink] [raw]
Subject: Re: bkbits.net down

On Wed, Mar 27, 2002 at 10:53:27AM -0800, root wrote:
> It looks like we have a bad disk, I'm checking them now to figure out if
> it is the the primary or backup data drive. I'll run checks in all the
> repositories if fsck doesn't find the problem so it may take a couple of
> hours before we are back up.
>
> In the not so distant future, we're moving the backup drive to a different
> machine such that we can just flip machines when this happens but for now
> you'll have to wait for a bit.

Or use RAID1/5...

2002-03-28 06:28:10

by Larry McVoy

[permalink] [raw]
Subject: Re: bkbits.net down

We did indeed lose the primary disk (IBM 40GB, I am starting to lose all
the respect I had for IBM drives, this is one of many that has failed on
me personally). I have restored from the backup disk, and in the process
redone hardlinks across all the linux kernel trees, which saved about 5GB
(nice). All trees which are now on bkbits.net check clean, which means
BK thinks all the files are there and that the checksums are correct,
a fairly reasonable indication that we are in good shape.

I wouldn't be a bit surprised if we have some permissions problems,
mail [email protected] if you hit any and we'll fix things as we
become aware of them. In fact, I know we have permissions problems but
given that I've been working on this for 12 hours straight, I'll get to
it tomorrow.

There are a couple of trees which are missing files, both in Rik's
linuxvm.bkbits.net, I suspect an interrupted clone. They are:
bk://linuxvm.bkits.net/linux-2.5-vmtidbits
bk://linuxvm.bkits.net/linux-2.5-writethrot
Rik, ping me if you need help cleaning these up.

The ppc tree seems to be missing linuxppc_2_4, Paul/Tom/Troy/Cort,
where is this tree? You'll want to get a copy back here, I suspect,
so if you are a PPC person and you have a recently updated version of
linuxppc_2_4, hang on to it. We'll sort it out on the ppc mailing list.

We have lost a number of ssh keys. We backed these up a while back but
we did not catch all of these. The list is below, send me mail with your
ssh key / project name and I will restore them by hand. We already had
plans in place for dealing with this problem so that it doesn't reoccur.

Sorry about the long downtime, we are struggling with the economic
downturn like everyone else and hadn't put a hot spare in place yet.
We bought them, in fact, I bought ten spare boxes for this sort of
thing, but I have been too busy to put them in place. We'll get on
it, we're aware that people depend on this.

--lm

Here's the list of projects missing ssh keys:
bcrlbits
freebsd-dvb (probably not Linux, eh? :)
lia64
linux-mtd
linux-srn
linux24 (I think this one is dead, right Marcelo?)
ltr
misc
nonblock
palinux
test1
If you are the admin for any of these projects, drop me a mail with your
ssh key and I'll add it back in.

On Wed, Mar 27, 2002 at 10:53:27AM -0800, root wrote:
> It looks like we have a bad disk, I'm checking them now to figure out if
> it is the the primary or backup data drive. I'll run checks in all the
> repositories if fsck doesn't find the problem so it may take a couple of
> hours before we are back up.
>
> In the not so distant future, we're moving the backup drive to a different
> machine such that we can just flip machines when this happens but for now
> you'll have to wait for a bit.
>
> --lm
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-03-28 06:52:54

by Petko Manolov

[permalink] [raw]
Subject: Re: bkbits.net down

Larry McVoy wrote:
> There are a couple of trees which are missing files, both in Rik's

I can't pull from linux-2.[45] and i'm getting:
ERROR-Lock fail: possible permission problem.

Last time i got this error somebody was playing with the config files.


later,
Petko

2002-03-28 19:23:13

by Larry McVoy

[permalink] [raw]
Subject: Re: bkbits.net down

On Wed, Mar 27, 2002 at 10:27:38PM -0800, Larry McVoy wrote:
> We did indeed lose the primary disk (IBM 40GB, I am starting to lose all
> the respect I had for IBM drives, this is one of many that has failed on
> me personally).

Leaving the drive off overnight "fixed it" enough that I am able to get
some of the data off. It will be a couple hours before I know how much,
but I did manage to get all the ssh keys, project descriptions, and
project statistics. I'm now working on the actual data just in case
there is one of the trees, such as the ppc trees, that we can't find
again.

The drive has bad blocks and when it hits them it goes into retry la la land,
so I won't know which data is bad until I hit the bad blocks.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-03-29 01:29:54

by Larry McVoy

[permalink] [raw]
Subject: Re: bkbits.net down

I think we are back in action. We put all the ssh stuff back. As well as
the download statistics, take a peek at http://www.bkbits.net, your stuff should
be there.

Let me know if your project is missing anything, I know about the ppc tree,
we have that data, that's next. But other than that, everything should be
back, let me know if that is not true.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

Subject: Re: bkbits.net down

Larry McVoy <[email protected]> writes:

>The drive has bad blocks and when it hits them it goes into retry la la land,
>so I won't know which data is bad until I hit the bad blocks.

You've learned now the hard way why integrity checks in an application
will never be able to replace things like backups or RAID systems.
Maybe you want to reread the flamewar^Wthread from some time ago with
your new knowledge.

Regards
Henning

--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]

Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20

2002-03-29 14:36:00

by hatefullinuxuser

[permalink] [raw]
Subject: Re: bkbits.net down

On Fri, Mar 29, 2002 at 10:50:22AM +0000, Henning P. Schmiedehausen wrote:
> Larry McVoy <[email protected]> writes:
>
> >The drive has bad blocks and when it hits them it goes into retry la la land,
> >so I won't know which data is bad until I hit the bad blocks.
>
> You've learned now the hard way why integrity checks in an application
> will never be able to replace things like backups or RAID systems.
> Maybe you want to reread the flamewar^Wthread from some time ago with
> your new knowledge.

Yeah, maybe he should buy a big fat RAID array with all the money he's
getting from Linux kernel developers.

2002-03-29 14:38:43

by Rik van Riel

[permalink] [raw]
Subject: Re: bkbits.net down

On Wed, 27 Mar 2002, Larry McVoy wrote:

> There are a couple of trees which are missing files, both in Rik's
> linuxvm.bkbits.net, I suspect an interrupted clone. They are:
> bk://linuxvm.bkits.net/linux-2.5-vmtidbits
> bk://linuxvm.bkits.net/linux-2.5-writethrot
> Rik, ping me if you need help cleaning these up.

No big deal, either tree is just a copy of stuff I have here
so people can pull it. I don't rely on bkbits in any way except
as a distribution medium ;)

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-03-29 19:18:56

by Larry McVoy

[permalink] [raw]
Subject: Re: bkbits.net down

On Fri, Mar 29, 2002 at 10:50:22AM +0000, Henning P. Schmiedehausen wrote:
> Larry McVoy <[email protected]> writes:
>
> >The drive has bad blocks and when it hits them it goes into retry la la land,
> >so I won't know which data is bad until I hit the bad blocks.
>
> You've learned now the hard way why integrity checks in an application
> will never be able to replace things like backups or RAID systems.
> Maybe you want to reread the flamewar^Wthread from some time ago with
> your new knowledge.

You obviously didn't read that thread. Both in the context of
BitKeeper and in the context of normal data, you would have seen that
we have backups, we just have backups that we can verify are correct.
The repositories on bkbits.net are automirrored after each incoming event.
There were a few ppc ones which were not and we're still trying to figure
out why, and things like the .ssh keys were not completely backed up;
we're fixing that by putting that information into a BK repository so
it will just automirror like everything else.

I'm not sure why you yanking my chain, it's counter productive and
flat out rude after I just spent two days doing nothing but putting
things back together for kernel developers. What, exactly, did you
hope to accomplish?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

Subject: Re: bkbits.net down

Larry McVoy <[email protected]> writes:

>things back together for kernel developers. What, exactly, did you
>hope to accomplish?

Awareness on your side that there are people presenting valid
arguments to you even if you don't agree. I did read the first posts
of the thread up until it degraded to "our application does every
integrity check possible to verify that the data is correct. So even
if it gets corrupted, we will know. That's why we better than the
rest" and you shot down everyone presenting you with other solutions
with this arguments. Well, in this case, you obviously knew that the
data is incorrect (because the disk died, I'm really feeling with you
here, from my eight IBM DTLA disks, three have died, too and I fear
that the remaining five will also die [1]) but all your checks
couldn't help you where just a few up-to-date backups would have.

Actually I'm a bit disappointed too, that you with all the
professionalism that you sprikle over this mailing list in every of
your posts, run such a showcase part of your business as bkbits.net on
IDE disks without RAID. And without clustering in case of emergency.

Regards
Henning


[1] I keep them in an dust free, climate controlled environment (18
degree centigrade all the time) and they're continously running (no
on-off operations) and they still die. IBM really deserves to suffer
for these disks. The last two died within six hours in the same box
after 20 months of continous running [2].

[2] I had the system up and running again 93 minutes after I pulled it
from the rack. Amanda and daily backups saved my ass. :-) And my
applications do not do integrity checks.

--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]

Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20

2002-03-29 22:49:31

by Larry McVoy

[permalink] [raw]
Subject: Re: bkbits.net down

Go reread my posts and stop wasting my time.

On Fri, Mar 29, 2002 at 10:44:54PM +0000, Henning P. Schmiedehausen wrote:
> Larry McVoy <[email protected]> writes:
>
> >things back together for kernel developers. What, exactly, did you
> >hope to accomplish?
>
> Awareness on your side that there are people presenting valid
> arguments to you even if you don't agree. I did read the first posts
> of the thread up until it degraded to "our application does every
> integrity check possible to verify that the data is correct. So even
> if it gets corrupted, we will know. That's why we better than the
> rest" and you shot down everyone presenting you with other solutions
> with this arguments. Well, in this case, you obviously knew that the
> data is incorrect (because the disk died, I'm really feeling with you
> here, from my eight IBM DTLA disks, three have died, too and I fear
> that the remaining five will also die [1]) but all your checks
> couldn't help you where just a few up-to-date backups would have.
>
> Actually I'm a bit disappointed too, that you with all the
> professionalism that you sprikle over this mailing list in every of
> your posts, run such a showcase part of your business as bkbits.net on
> IDE disks without RAID. And without clustering in case of emergency.
>
> Regards
> Henning
>
>
> [1] I keep them in an dust free, climate controlled environment (18
> degree centigrade all the time) and they're continously running (no
> on-off operations) and they still die. IBM really deserves to suffer
> for these disks. The last two died within six hours in the same box
> after 20 months of continous running [2].
>
> [2] I had the system up and running again 93 minutes after I pulled it
> from the rack. Amanda and daily backups saved my ass. :-) And my
> applications do not do integrity checks.
>
> --
> Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
> INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]
>
> Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
> D-91054 Buckenhof Fax.: 09131 / 50654-20
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm