2003-11-24 05:19:16

by Larry McVoy

[permalink] [raw]
Subject: data from kernel.bkbits.net


I've been trying to get all the data off the drives on the machine which
was broken into. I have a feeling that whoever this was was hiding stuff
in the file system because both drives will not fsck clean nor will they
completely read.

I've managed to get most of the data off but not all. Given that I've put
about 3 days into this I'm pretty much done. If someone else wants to look
at the drives I can make them available, let me know. But just reading the
main drive makes the kernel (Fedora 1) kill the tar process as below (it
also managed to wack the system enough that it overwrote the NVRAM with
garbage). It hasn't been a fun weekend.

3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0x1b, unit #3.
3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0x1b, unit #3.
3w-xxxx: scsi0: AEN: WARNING: ATA port timeout: Port #3.
3w-xxxx: scsi0: AEN: WARNING: ATA port timeout: Port #3.
3w-xxxx: scsi0: Reset succeeded.
3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0x1b, unit #3.
3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0x1b, unit #3.
3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0x1b, unit #3.
3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0x1b, unit #3.
3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0x1b, unit #3.
3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0x1b, unit #3.
3w-xxxx: scsi0: AEN: WARNING: ATA port timeout: Port #3.
3w-xxxx: scsi0: AEN: WARNING: ATA port timeout: Port #3.
3w-xxxx: scsi0: AEN: WARNING: ATA port timeout: Port #3.
3w-xxxx: scsi0: AEN: WARNING: ATA port timeout: Port #3.
3w-xxxx: scsi0: AEN: WARNING: ATA port timeout: Port #3.
3w-xxxx: scsi0: AEN: WARNING: ATA port timeout: Port #3.
3w-xxxx: scsi0: Reset succeeded.
Unable to handle kernel paging request at virtual address 4954507d
printing eip:
c015a129
*pde = 00000000
Oops: 0000
3w-xxxx sd_mod sis900 ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables sg scsi_mod keybdev mousedev hid input ehci-hcd usb-uhci usbcore ext3 jbd
CPU: 0
EIP: 0060:[<c015a129>] Not tainted
EFLAGS: 00010a97

EIP is at find_inode [kernel] 0x19 (2.4.22-1.2115.nptl)
eax: 00000000 ebx: 49545055 ecx: 0000000f edx: c1640000
esi: 00000000 edi: c1655868 ebp: 0027ace1 esp: cea97ea4
ds: 0068 es: 0068 ss: 0068
Process tar (pid: 2816, stackpage=cea97000)
Stack: db99a05c 00000000 0000002a dacd43c0 c1655868 0027ace1 df9db800 c015a452
df9db800 0027ace1 c1655868 00000000 00000000 dacd43c0 dd476d40 df9db800
dd476d40 c0173669 df9db800 0027ace1 00000000 00000000 fffffff4 dacd442c
Call Trace: [<c015a452>] iget4_locked [kernel] 0x52 (0xcea97ec0)
[<c0173669>] ext2_lookup [kernel] 0x69 (0xcea97ee8)
[<c014f197>] real_lookup [kernel] 0xc7 (0xcea97f08)
[<c014f88a>] link_path_walk [kernel] 0x59a (0xcea97f24)
[<c014fb67>] path_lookup [kernel] 0x37 (0xcea97f60)
[<c014fdf9>] __user_walk [kernel] 0x49 (0xcea97f70)
[<c014bddf>] sys_lstat64 [kernel] 0x1f (0xcea97f8c)
[<c01099df>] system_call [kernel] 0x33 (0xcea97fc0)


Code: 39 6b 28 89 de 75 f1 8b 44 24 20 39 83 a0 00 00 00 75 e5 8b

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm


2003-11-24 07:34:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

Larry McVoy wrote:
> I've been trying to get all the data off the drives on the machine which
> was broken into. I have a feeling that whoever this was was hiding stuff
> in the file system because both drives will not fsck clean nor will they
> completely read.
>
> I've managed to get most of the data off but not all. Given that I've put
> about 3 days into this I'm pretty much done. If someone else wants to look
> at the drives I can make them available, let me know. But just reading the
> main drive makes the kernel (Fedora 1) kill the tar process as below (it
> also managed to wack the system enough that it overwrote the NVRAM with
> garbage). It hasn't been a fun weekend.
>

Looks more like a 3Ware driver bug to me. Hard to say for sure, though.

-hpa

2003-11-24 09:48:42

by Willy Tarreau

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

On Sun, Nov 23, 2003 at 09:19:10PM -0800, Larry McVoy wrote:

> Unable to handle kernel paging request at virtual address 4954507d
> eax: 00000000 ebx: 49545055 ecx: 0000000f edx: c1640000

Here, EBX is pure text : "UPTI", as in "CORRUPTION" or "INTERRUPTIBLE". May be
there has been some memory corruption somewhere in a linked list ?

Cheers,
Willy

2003-11-24 14:57:39

by Larry McVoy

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

On Sun, Nov 23, 2003 at 11:34:35PM -0800, H. Peter Anvin wrote:
> Looks more like a 3Ware driver bug to me. Hard to say for sure, though.

I've used both a 3ware and a Highpoint and onboard AMD IDE interfaces and
gotten problems (albeit different problems) each time.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-11-24 15:48:10

by Ricky Beam

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

On Sun, 23 Nov 2003, H. Peter Anvin wrote:
>Larry McVoy wrote:
>> ...
>
>Looks more like a 3Ware driver bug to me. Hard to say for sure, though.

Or simply a dead drive. (or a "dirty" cable -- try re-plugging that one.)
I'm guessing the machine was powered off after being hacked and now some
of the drives don't work so well anymore. (such is the way of things with
cheap IDE drives -- and even cheap SCSI ones too. all too often, they don't
even spin back up.)

--Ricky


2003-11-24 15:51:08

by Larry McVoy

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

On Mon, Nov 24, 2003 at 10:43:34AM -0500, Ricky Beam wrote:
> On Sun, 23 Nov 2003, H. Peter Anvin wrote:
> >Larry McVoy wrote:
> >> ...
> >
> >Looks more like a 3Ware driver bug to me. Hard to say for sure, though.
>
> Or simply a dead drive. (or a "dirty" cable -- try re-plugging that one.)

Thanks for the advice, but this drive has been plugged into 3 different
controllers on different machines using different cables. Both this drive
and the backup drive refused to fsck clean (they had a lot of errors with
directory corruption problems).

It is not a dirty cable or a bad controller, I've been building and
debugging PC hardware for years and I know how to track down obvious
problems.

Sorry to be short but I already said that I'd eliminated this source of
error. What did you think I was doing all weekend?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-11-24 19:19:59

by Ricky Beam

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

On Mon, 24 Nov 2003, Larry McVoy wrote:
>Sorry to be short but I already said that I'd eliminated this source of
>error. What did you think I was doing all weekend?

Let me be equally short. Your original message gave no details of what
debugging steps had been taken. (I can assume you would know what you're
doing, but frankly, I could be wrong.) You venture a guess that the
system had been h4x0r3d in some inventive way to prevent your attempts
to recover data and proceed to paste error messages from the 3ware
driver that indicate a problem with the hardware (either driver bug,
cabling, controller, or channel on that controller) including the
drive itself.

Please do not attribute to hackers what is simply a half dead drive. So,
was the machine powered down for an extended period as I aluded? (to
preserve the machine until someone had time to look at it.)

--Ricky


2003-11-24 19:14:24

by Adam Radford

[permalink] [raw]
Subject: RE: data from kernel.bkbits.net

This looks like glitchy power cables, drive cable or dying drive to me.

-Adam

-----Original Message-----
From: H. Peter Anvin [mailto:[email protected]]
Sent: Sunday, November 23, 2003 11:35 PM
To: Larry McVoy
Cc: [email protected]
Subject: Re: data from kernel.bkbits.net


Larry McVoy wrote:
> I've been trying to get all the data off the drives on the machine which
> was broken into. I have a feeling that whoever this was was hiding stuff
> in the file system because both drives will not fsck clean nor will they
> completely read.
>
> I've managed to get most of the data off but not all. Given that I've put
> about 3 days into this I'm pretty much done. If someone else wants to
look
> at the drives I can make them available, let me know. But just reading
the
> main drive makes the kernel (Fedora 1) kill the tar process as below (it
> also managed to wack the system enough that it overwrote the NVRAM with
> garbage). It hasn't been a fun weekend.
>

Looks more like a 3Ware driver bug to me. Hard to say for sure, though.

-hpa

2003-11-24 19:24:36

by Larry McVoy

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

On Mon, Nov 24, 2003 at 02:17:44PM -0500, Ricky Beam wrote:
> On Mon, 24 Nov 2003, Larry McVoy wrote:
> >Sorry to be short but I already said that I'd eliminated this source of
> >error. What did you think I was doing all weekend?
>
> Let me be equally short. Your original message gave no details of what
> debugging steps had been taken. (I can assume you would know what you're
> doing, but frankly, I could be wrong.) You venture a guess that the
> system had been h4x0r3d in some inventive way to prevent your attempts
> to recover data and proceed to paste error messages from the 3ware
> driver that indicate a problem with the hardware (either driver bug,
> cabling, controller, or channel on that controller) including the
> drive itself.
>
> Please do not attribute to hackers what is simply a half dead drive. So,
> was the machine powered down for an extended period as I aluded? (to
> preserve the machine until someone had time to look at it.)

As I said, *both* drives have extensive file system problems. No, the
machine was not powered down for a long time, and no, neither of these
drives are old, and no, they are not from the same factory batch (they
aren't even the same vendor, one is a Maxtor and the other is a Seagate),
and yes, I of course tried different cable/controller/machine combos.

Any other questions?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-11-24 19:35:29

by Jamie Lokier

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

Larry McVoy wrote:
> Any other questions?

At risk of sounding like second level support,

1. Are you able to copy the raw partitions (e.g. using dd) to
another disk or system?

2. Do you see similar error messages when copying the raw partitions?

3. When you mount the _copies_ of the partitions, do you see similar
error messages?

That'll differentiate whether it's a pure disk/driver problem or
something triggered by a filesystem problem. As a bonus, if the disks
are both dying (maybe you had a lightning strike), then you'll have
the data copied somewhere safe.

-- Jamie

2003-11-24 20:06:17

by Richard B. Johnson

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

On Mon, 24 Nov 2003, Larry McVoy wrote:

> On Mon, Nov 24, 2003 at 02:17:44PM -0500, Ricky Beam wrote:
> > On Mon, 24 Nov 2003, Larry McVoy wrote:
> > >Sorry to be short but I already said that I'd eliminated this source of
> > >error. What did you think I was doing all weekend?
> >
> > Let me be equally short. Your original message gave no details of what
> > debugging steps had been taken. (I can assume you would know what you're
> > doing, but frankly, I could be wrong.) You venture a guess that the
> > system had been h4x0r3d in some inventive way to prevent your attempts
> > to recover data and proceed to paste error messages from the 3ware
> > driver that indicate a problem with the hardware (either driver bug,
> > cabling, controller, or channel on that controller) including the
> > drive itself.
> >
> > Please do not attribute to hackers what is simply a half dead drive. So,
> > was the machine powered down for an extended period as I aluded? (to
> > preserve the machine until someone had time to look at it.)
>
> As I said, *both* drives have extensive file system problems. No, the
> machine was not powered down for a long time, and no, neither of these
> drives are old, and no, they are not from the same factory batch (they
> aren't even the same vendor, one is a Maxtor and the other is a Seagate),
> and yes, I of course tried different cable/controller/machine combos.
>
> Any other questions?
> --
> ---
> Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
> -

Attempt to copy the raw drive to /dev/null. If that works, the
drive is likely okay, but the fs got fsucked up by software. You
might be able to mount the drive on a 2.4.22 machine if you have a
spare. Then you might be able to selectively copy important stuff
to another drive, after which you can make a new file-system as
a "repair".

If you can't copy the raw drive, yet you booted on a system that
uses the same driver(s) to access the disk, then you probably
have a bad drive.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.


2003-11-24 20:12:12

by Ricky Beam

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

On Mon, 24 Nov 2003, Larry McVoy wrote:
...
>Any other questions?

Have you ran the factory diag utility(s) to ensure the drives are "ok"?
(not that those tests are 100% as I own drives that pass those tests that
are, in fact, bad.) Have you made a complete bit image clone of the
drives (ala 'dd')? (how big are they?)

And there was a recent thread on linux-raid where someone had a drive
with bad internal cache memory -- a single bit was always '1'. That one
gets an "I've never seen it do that before."

--Ricky


2003-11-24 20:34:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

On Mon, Nov 24, 2003 at 03:05:24PM -0500, Richard B. Johnson wrote:
> Attempt to copy the raw drive to /dev/null. If that works, the
> drive is likely okay, but the fs got fsucked up by software. You
> might be able to mount the drive on a 2.4.22 machine if you have a
> spare. Then you might be able to selectively copy important stuff
> to another drive, after which you can make a new file-system as
> a "repair".

The error messages Larry reported were obviously reported by the
hardware, and were **not** filesystem errors.

- Ted

2003-11-24 21:34:15

by Richard B. Johnson

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

On Mon, 24 Nov 2003, Theodore Ts'o wrote:

> On Mon, Nov 24, 2003 at 03:05:24PM -0500, Richard B. Johnson wrote:
> > Attempt to copy the raw drive to /dev/null. If that works, the
> > drive is likely okay, but the fs got fsucked up by software. You
> > might be able to mount the drive on a 2.4.22 machine if you have a
> > spare. Then you might be able to selectively copy important stuff
> > to another drive, after which you can make a new file-system as
> > a "repair".
>
> The error messages Larry reported were obviously reported by the
> hardware, and were **not** filesystem errors.
>
> - Ted

Yes but an attempt to read beyond the limits of the physical
drive will provide you with a lot of **interesting** hardware
errors. This happens if the file-system gets corrupt.

And I'm not implying that the software screwed up either. The
software doesn't know if an "extra" bit was set during a write
to the drive. These things happen asd a result of bad RAM, bad
DMA, and other hardware-corrupting things....

So, the first check is to see if the drive can be read without
any reference to its contents. Since Read/Write is usually the
software implimentation detail of a direction bit, if you can
read, you can usually write.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.


2003-11-24 22:24:28

by Larry McVoy

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

On Mon, Nov 24, 2003 at 04:34:43PM -0500, Richard B. Johnson wrote:
> On Mon, 24 Nov 2003, Theodore Ts'o wrote:
>
> > On Mon, Nov 24, 2003 at 03:05:24PM -0500, Richard B. Johnson wrote:
> > > Attempt to copy the raw drive to /dev/null. If that works, the
> > > drive is likely okay, but the fs got fsucked up by software. You
> > > might be able to mount the drive on a 2.4.22 machine if you have a
> > > spare. Then you might be able to selectively copy important stuff
> > > to another drive, after which you can make a new file-system as
> > > a "repair".
> >
> > The error messages Larry reported were obviously reported by the
> > hardware, and were **not** filesystem errors.
> >
> > - Ted
>
> Yes but an attempt to read beyond the limits of the physical
> drive will provide you with a lot of **interesting** hardware
> errors. This happens if the file-system gets corrupt.

Yeah, I think Richard may be right. Anyway, the drive sort of reads
from the raw partition. It gets a IDE reset and then it reads. I can
read it a second time with no reset. Haven't tried a reboot between
reads, hang on, yeah, a reboot brings the errors back.

But, fscking the dd-ed image gets me less errors so I'm trying that
route to get the data back.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-11-24 22:38:11

by Jamie Lokier

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

Larry McVoy wrote:
> But, fscking the dd-ed image gets me less errors so I'm trying that
^^^^
Fewer!

have nice day :)
-- Jamie

2003-11-25 00:30:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

On Mon, Nov 24, 2003 at 02:24:13PM -0800, Larry McVoy wrote:
> > Yes but an attempt to read beyond the limits of the physical
> > drive will provide you with a lot of **interesting** hardware
> > errors. This happens if the file-system gets corrupt.

Sure, but not that those kinds of errors. You'll see errors like this
instead:

kernel: attempt to access beyond end of device
kernel: 08:05: rw=0, want=198500353, limit=5779456
kernel: attempt to access beyond end of device
kernel: 08:05: rw=0, want=4294934529, limit=5779456

ATA device timeouts, which is what Larry reported, are not caused by
attempting to read beyond the limits of the physical device.

> Yeah, I think Richard may be right. Anyway, the drive sort of reads
> from the raw partition. It gets a IDE reset and then it reads. I can
> read it a second time with no reset. Haven't tried a reboot between
> reads, hang on, yeah, a reboot brings the errors back.

It really, really sounds like the disk is pooched. I don't know if it
was bad luck, cooincidence, or the fact that it was powered down for a
while. But I'm guessing that it's taking a long time for disk to read
a sector, which is causing the disk driver to timeout and reset the
bus, but then the sector is first cached in the IDE disk cache (where
it can be read quickly) and then it ends up getting cached in the
system memory. That would explain why a reboot brings the errors backed.

> But, fscking the dd-ed image gets me less errors so I'm trying that
> route to get the data back.

If using the dd'ed image is giving you less errors, combined with your
other description, it's causing me to be really suspicious about the
hard drive. If you're really brave, or foolish, (or have already
backed up the image), you might try doing a non-destructive read/write
test using the badblocks(8) command. I'm pretty confident that it
will turn up all sorts of problems, though, since the low-level device
driver errors you were describing really are not consistent with
filesystem corruption, but with a hardware failure of some kind.

- Ted

2003-11-25 15:09:31

by Ben Collins

[permalink] [raw]
Subject: Re: data from kernel.bkbits.net

On Sun, Nov 23, 2003 at 09:19:10PM -0800, Larry McVoy wrote:
>
> I've been trying to get all the data off the drives on the machine which
> was broken into. I have a feeling that whoever this was was hiding stuff
> in the file system because both drives will not fsck clean nor will they
> completely read.
>
> I've managed to get most of the data off but not all. Given that I've put
> about 3 days into this I'm pretty much done. If someone else wants to look
> at the drives I can make them available, let me know. But just reading the
> main drive makes the kernel (Fedora 1) kill the tar process as below (it
> also managed to wack the system enough that it overwrote the NVRAM with
> garbage). It hasn't been a fun weekend.

FYI, you can ignore the large SVN repos. They are easily rebuilt. I just
need the bkcvs2svn script in my home directory.

--
Debian - http://www.debian.org/
Linux 1394 - http://www.linux1394.org/
Subversion - http://subversion.tigris.org/
WatchGuard - http://www.watchguard.com/