2013-10-29 11:37:05

by John Twideldum

[permalink] [raw]
Subject: 3.12-rc5 and overwritten partition table

Hi...

I have a shiny new Haswell notebook (Lenovo T440s), where with 3.11 processor power
management didn't work, so I got 3.12-rc5 installed.
Worked fine for a few days, then today it no longer boots and inspection yields...

# dd if=/dev/sda count=1 |od -t c |tr -d " "
512 bytes (512 B) copied, 0.00115806 s, 442 kB/s
0000000 13.00000\nalsa:hw
0000020 C0D0\t99.00000\nal
0000040 sa:hwC0D3\t99.000
0000060 00\nbacklight:acp
0000100 i_video0\t73.0000
0000120 0\nbacklight:acpi
0000140 _video0-power\t10
0000160 0.00000\nbackligh
0000200 t:intel_backligh
0000220 t\t0.00000\nbackli
0000240 ght:intel_backli
0000260 ght-power\t0.0000
0000300 0\ncpu-consumptio
0000320 n\t0.00000\ncpu-wa
0000340 keups\t875.00000\n
0000360 disk-operations\t
0000400 0.00000\ndisk-ope
0000420 rations-hard\t0.0
0000440 0000\neth0-link-1
0000460 00\t0.00000\neth0-
0000500 link-1000\t0.0000
0000520 0\neth0-link-high
0000540 \t0.00000\neth0-pa
0000560 ckets\t0.00000\net
0000600 h0-powerunsave\t0
0000620 .00000\neth0-up\t0
0000640 .00000\ngpu-opera
0000660 tions\t134.00000\n
0000700radio:hci0\t100.0
00007200000\nradio:phy0\t
0000740100.00000\nradio:
0000760tpacpi_bluetooth
0001000

:-(

In order to figure out the reason for this, can you kernel guys maybe
a) tell me what is that datastructure?
b) can power management be buggy to cause this - I played with powertop tuneables
approximately while this happened?

...and point me to a recovery tool/doc for finding the partitions again,
"testdisk" tool only found 1 of 3, the least important one at the beginning of
disk but not the rest :-(

Thank you for any pointers/help!


2013-10-29 14:57:57

by John Twideldum

[permalink] [raw]
Subject: Aw: 3.12-rc5 and overwritten partition table - by powertop?

replying to myself with more insights...
 
>0000000 13.00000\nalsa:hw
>0000020 C0D0\t99.00000\nal
>0000040 sa:hwC0D3\t99.000
>0000060 00\nbacklight:acp
>.....

The first ~170kb of /dev/sda got blown away with what seems to be a logging output
by Powertop, when I was playing with the tuneables.
(Luckily the first partition starts later :-))

Why is that I don't know, but maybe when turning on the SATA knobs
something goes wrong. I'm afraid to try again, but I accept rather higher
power use than data loss again :-/

Sorry for the confusion...
J

2013-10-29 20:10:36

by Jan Kara

[permalink] [raw]
Subject: Re: 3.12-rc5 and overwritten partition table - by powertop?

On Tue 29-10-13 15:57:54, John Twideldum wrote:
> replying to myself with more insights...
> ?
> >0000000 13.00000\nalsa:hw
> >0000020 C0D0\t99.00000\nal
> >0000040 sa:hwC0D3\t99.000
> >0000060 00\nbacklight:acp
> >.....
>
> The first ~170kb of /dev/sda got blown away with what seems to be a logging output
> by Powertop, when I was playing with the tuneables.
> (Luckily the first partition starts later :-))
So did you log the output to some file? I'm just trying to understand how
it could get onto your disk in the first place...

> Why is that I don't know, but maybe when turning on the SATA knobs
> something goes wrong. I'm afraid to try again, but I accept rather higher
> power use than data loss again :-/

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-10-29 20:12:19

by Jan Kara

[permalink] [raw]
Subject: Re: 3.12-rc5 and overwritten partition table

On Tue 29-10-13 12:37:01, John Twideldum wrote:
> Hi...
>
> I have a shiny new Haswell notebook (Lenovo T440s), where with 3.11 processor power
> management didn't work, so I got 3.12-rc5 installed.
> Worked fine for a few days, then today it no longer boots and inspection yields...
>
> # dd if=/dev/sda count=1 |od -t c |tr -d " "
> 512 bytes (512 B) copied, 0.00115806 s, 442 kB/s
> 0000000 13.00000\nalsa:hw
> 0000020 C0D0\t99.00000\nal
> 0000040 sa:hwC0D3\t99.000
> 0000060 00\nbacklight:acp
> 0000100 i_video0\t73.0000
> 0000120 0\nbacklight:acpi
> 0000140 _video0-power\t10
> 0000160 0.00000\nbackligh
> 0000200 t:intel_backligh
> 0000220 t\t0.00000\nbackli
> 0000240 ght:intel_backli
> 0000260 ght-power\t0.0000
> 0000300 0\ncpu-consumptio
> 0000320 n\t0.00000\ncpu-wa
> 0000340 keups\t875.00000\n
> 0000360 disk-operations\t
> 0000400 0.00000\ndisk-ope
> 0000420 rations-hard\t0.0
> 0000440 0000\neth0-link-1
> 0000460 00\t0.00000\neth0-
> 0000500 link-1000\t0.0000
> 0000520 0\neth0-link-high
> 0000540 \t0.00000\neth0-pa
> 0000560 ckets\t0.00000\net
> 0000600 h0-powerunsave\t0
> 0000620 .00000\neth0-up\t0
> 0000640 .00000\ngpu-opera
> 0000660 tions\t134.00000\n
> 0000700radio:hci0\t100.0
> 00007200000\nradio:phy0\t
> 0000740100.00000\nradio:
> 0000760tpacpi_bluetooth
> 0001000
>
> :-(
>
> In order to figure out the reason for this, can you kernel guys maybe
> a) tell me what is that datastructure?
> b) can power management be buggy to cause this - I played with powertop tuneables
> approximately while this happened?
>
> ...and point me to a recovery tool/doc for finding the partitions again,
> "testdisk" tool only found 1 of 3, the least important one at the beginning of
> disk but not the rest :-(
You can always look for fs-specific magic numbers in the superblock. What
filesystems did you use?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-10-29 21:58:42

by John Twideldum

[permalink] [raw]
Subject: Re: 3.12-rc5 and overwritten partition table - by powertop?

>> The first ~170kb of /dev/sda got blown away with what seems to be a logging output
>> by Powertop, when I was playing with the tuneables.
>
>So did you log the output to some file? I'm just trying to understand how
>it could get onto your disk in the first place...

Attached a dump of the first 1Mb of the disk, HTH.
It looks like a powertop log?
(I have powertop 2.4)

meanwhile, testdisk running in all variations recovered enough clues to
guess the partitions correctly - the data is fine :-)

J


Attachments:
backup_overwrite.dump.bz2 (2.40 kB)

2013-10-29 22:20:22

by Jan Kara

[permalink] [raw]
Subject: Re: 3.12-rc5 and overwritten partition table - by powertop?

On Tue 29-10-13 22:58:40, John Twideldum wrote:
> >> The first ~170kb of /dev/sda got blown away with what seems to be a logging output
> >> by Powertop, when I was playing with the tuneables.
> >
> >So did you log the output to some file? I'm just trying to understand how
> >it could get onto your disk in the first place...
>
> Attached a dump of the first 1Mb of the disk, HTH.
> It looks like a powertop log?
> (I have powertop 2.4)
Yes, likely. But it is strange the corruption doesn't even end at any
sensible boundary (data ends at offset 0x27b53). Shrug...

> meanwhile, testdisk running in all variations recovered enough clues to
> guess the partitions correctly - the data is fine :-)
Good.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-10-29 22:32:28

by John Twideldum

[permalink] [raw]
Subject: Re: 3.12-rc5 and overwritten partition table - by powertop?

>> >> The first ~170kb of /dev/sda got blown away with what seems to be a logging output
>> >> by Powertop, when I was playing with the tuneables.
>> >
>> >So did you log the output to some file? I'm just trying to understand how
>> >it could get onto your disk in the first place...
>>
>> Attached a dump of the first 1Mb of the disk, HTH.
>> It looks like a powertop log?
>> (I have powertop 2.4)
>
>Yes, likely. But it is strange the corruption doesn't even end at any
>sensible boundary (data ends at offset 0x27b53). Shrug...

My recollection what I did is this:

I was looking into powertop and observing how -rc5 works now with Haswell.
I saw the tuneable parameters and quite a few were "bad", so I set them to "good".
Power usage dropped about one third - yay!
However, changing "SATA link power" threw up complaints:

Oct 29 09:09:21 localhost kernel: [ 3697.423868] ata1.00: exception Emask 0x10 SAct 0x1 SErr 0xc0000 action 0x6 frozen
Oct 29 09:09:21 localhost kernel: [ 3697.423873] ata1.00: irq_stat 0x08000000, interface fatal error
Oct 29 09:09:21 localhost kernel: [ 3697.423877] ata1: SError: { CommWake 10B8B }
Oct 29 09:09:21 localhost kernel: [ 3697.423880] ata1.00: failed command: WRITE FPDMA QUEUED
Oct 29 09:09:21 localhost kernel: [ 3697.423886] ata1.00: cmd 61/38:00:01:9e:a4/01:00:00:00:00/40 tag 0 ncq 159744 out
Oct 29 09:09:21 localhost kernel: [ 3697.423886] res 50/01:00:01:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Oct 29 09:09:21 localhost kernel: [ 3697.423888] ata1.00: status: { DRDY }
Oct 29 09:09:21 localhost kernel: [ 3697.423894] ata1: hard resetting link
Oct 29 09:09:22 localhost kernel: [ 3697.743196] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct 29 09:09:22 localhost kernel: [ 3697.744707] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
Oct 29 09:09:22 localhost kernel: [ 3697.744719] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
Oct 29 09:09:22 localhost kernel: [ 3697.744725] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
Oct 29 09:09:22 localhost kernel: [ 3697.744813] ata1.00: ACPI cmd ef/10:09:00:00:00:a0 (SET FEATURES) succeeded
Oct 29 09:09:22 localhost kernel: [ 3697.745212] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
Oct 29 09:09:22 localhost kernel: [ 3697.746694] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
Oct 29 09:09:22 localhost kernel: [ 3697.746705] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
Oct 29 09:09:22 localhost kernel: [ 3697.746711] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
Oct 29 09:09:22 localhost kernel: [ 3697.746779] ata1.00: ACPI cmd ef/10:09:00:00:00:a0 (SET FEATURES) succeeded
Oct 29 09:09:22 localhost kernel: [ 3697.747286] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
Oct 29 09:09:22 localhost kernel: [ 3697.747432] ata1.00: configured for UDMA/133
Oct 29 09:09:22 localhost kernel: [ 3697.763181] ata1: EH complete

I did not know yet about what "frozen" means, so I did not investigate and
very soon powered down as I had to leave.
Next time I boot up.... I did not boot.
So data probable is just the size because as long as I had powertop running...

HTH,
J

2014-01-20 23:30:57

by Stefan Agner

[permalink] [raw]
Subject: Re: 3.12-rc5 and overwritten partition table - by powertop?

Am 2013-10-29 21:10, schrieb Jan Kara:
>> The first ~170kb of /dev/sda got blown away with what seems to be a logging output
>> by Powertop, when I was playing with the tuneables.
>> (Luckily the first partition starts later :-))
> So did you log the output to some file? I'm just trying to understand how
> it could get onto your disk in the first place...
>
>> Why is that I don't know, but maybe when turning on the SATA knobs
>> something goes wrong. I'm afraid to try again, but I accept rather higher
>> power use than data loss again :-/

I experienced the same on the very same hardware (Lenovo T440s). Like
John, I turned all those knobs in powertop, including the SATA ones.
Several time I ended up with broken partition table. Once, even my EFI
System partition (first partition) was broken. However, since I use EFI
I was able to recover the partition table quite easily (gdisk asks for
recovery from backup partition table, kudos to the designer of the GPT
format!).

This happens running on Arch Linux with stock 3.12.7 as well as mainline
3.13 kernel. I use latest T440s firmware (2.17).

Is it possible to disable/warn user when using that knob (at least on
Lenovo T440s), in order to avoid users left at an unbootable system...?

dmesg output:
[ 2.744398] ata1: SATA max UDMA/133 abar m2048@0xf063c000 port
0xf063c100 irq 59
[ 2.744400] ata2: DUMMY
[ 2.744401] ata3: DUMMY
[ 3.063804] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 3.064532] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES)
succeeded
[ 3.064536] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE
LOCK) filtered out
[ 3.064538] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES)
filtered out
[ 3.064606] ata1.00: ACPI cmd ef/10:09:00:00:00:a0 (SET FEATURES)
succeeded
[ 3.064926] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
[ 3.064929] ata1.00: ATA-9: Samsung SSD 840 PRO Series, DXM05B0Q, max
UDMA/133
[ 3.064931] ata1.00: 500118192 sectors, multi 16: LBA48 NCQ (depth
31/32), AA
[ 3.065256] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES)
succeeded
[ 3.065259] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE
LOCK) filtered out
[ 3.065261] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES)
filtered out
[ 3.065286] ata1.00: ACPI cmd ef/10:09:00:00:00:a0 (SET FEATURES)
succeeded
[ 3.065545] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
[ 3.065605] ata1.00: configured for UDMA/133
...
[ 130.578789] ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x40000
action 0x0
[ 130.578794] ata1.00: irq_stat 0x40000001
[ 130.578796] ata1: SError: { CommWake }
[ 130.578798] ata1.00: failed command: WRITE FPDMA QUEUED
[ 130.578802] ata1.00: cmd 61/10:00:f0:29:05/00:00:00:00:00/40 tag 0
ncq 8192 out
[ 130.578804] ata1.00: status: { DRDY ERR }
[ 130.578806] ata1.00: error: { ABRT }
...
[ 130.579011] ata1.00: failed command: WRITE FPDMA QUEUED
[ 130.579014] ata1.00: cmd 61/10:f0:58:7c:0f/00:00:00:00:00/40 tag 30
ncq 8192 out
[ 130.579016] ata1.00: status: { DRDY ERR }
[ 130.579017] ata1.00: error: { ABRT }
[ 130.579207] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
[ 130.579456] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
[ 130.579511] ata1.00: configured for UDMA/133
[ 130.579583] ata1: EH complete

--
Stefan

2014-01-21 03:16:13

by Robert Hancock

[permalink] [raw]
Subject: Re: 3.12-rc5 and overwritten partition table - by powertop?

On 10/29/2013 04:32 PM, John Twideldum wrote:
>>>>> The first ~170kb of /dev/sda got blown away with what seems to be a logging output
>>>>> by Powertop, when I was playing with the tuneables.
>>>>
>>>> So did you log the output to some file? I'm just trying to understand how
>>>> it could get onto your disk in the first place...
>>>
>>> Attached a dump of the first 1Mb of the disk, HTH.
>>> It looks like a powertop log?
>>> (I have powertop 2.4)
>>
>> Yes, likely. But it is strange the corruption doesn't even end at any
>> sensible boundary (data ends at offset 0x27b53). Shrug...
>
> My recollection what I did is this:
>
> I was looking into powertop and observing how -rc5 works now with Haswell.
> I saw the tuneable parameters and quite a few were "bad", so I set them to "good".
> Power usage dropped about one third - yay!
> However, changing "SATA link power" threw up complaints:
>
> Oct 29 09:09:21 localhost kernel: [ 3697.423868] ata1.00: exception Emask 0x10 SAct 0x1 SErr 0xc0000 action 0x6 frozen
> Oct 29 09:09:21 localhost kernel: [ 3697.423873] ata1.00: irq_stat 0x08000000, interface fatal error
> Oct 29 09:09:21 localhost kernel: [ 3697.423877] ata1: SError: { CommWake 10B8B }
> Oct 29 09:09:21 localhost kernel: [ 3697.423880] ata1.00: failed command: WRITE FPDMA QUEUED
> Oct 29 09:09:21 localhost kernel: [ 3697.423886] ata1.00: cmd 61/38:00:01:9e:a4/01:00:00:00:00/40 tag 0 ncq 159744 out
> Oct 29 09:09:21 localhost kernel: [ 3697.423886] res 50/01:00:01:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> Oct 29 09:09:21 localhost kernel: [ 3697.423888] ata1.00: status: { DRDY }
> Oct 29 09:09:21 localhost kernel: [ 3697.423894] ata1: hard resetting link
> Oct 29 09:09:22 localhost kernel: [ 3697.743196] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> Oct 29 09:09:22 localhost kernel: [ 3697.744707] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
> Oct 29 09:09:22 localhost kernel: [ 3697.744719] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
> Oct 29 09:09:22 localhost kernel: [ 3697.744725] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
> Oct 29 09:09:22 localhost kernel: [ 3697.744813] ata1.00: ACPI cmd ef/10:09:00:00:00:a0 (SET FEATURES) succeeded
> Oct 29 09:09:22 localhost kernel: [ 3697.745212] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
> Oct 29 09:09:22 localhost kernel: [ 3697.746694] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
> Oct 29 09:09:22 localhost kernel: [ 3697.746705] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
> Oct 29 09:09:22 localhost kernel: [ 3697.746711] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
> Oct 29 09:09:22 localhost kernel: [ 3697.746779] ata1.00: ACPI cmd ef/10:09:00:00:00:a0 (SET FEATURES) succeeded
> Oct 29 09:09:22 localhost kernel: [ 3697.747286] ata1.00: failed to get NCQ Send/Recv Log Emask 0x1
> Oct 29 09:09:22 localhost kernel: [ 3697.747432] ata1.00: configured for UDMA/133
> Oct 29 09:09:22 localhost kernel: [ 3697.763181] ata1: EH complete
>
> I did not know yet about what "frozen" means, so I did not investigate and
> very soon powered down as I had to leave.
> Next time I boot up.... I did not boot.
> So data probable is just the size because as long as I had powertop running...

(CCing linux-ide)

It seems like most likely either the SATA host controller or drive
doesn't play nice with link power management enabled. Can you post the
full dmesg boot log?

2014-01-23 12:17:15

by John Twideldum

[permalink] [raw]
Subject: Re: 3.12-rc5 and overwritten partition table - by powertop?

>(CCing linux-ide)
>
>It seems like most likely either the SATA host controller or drive
>doesn't play nice with link power management enabled. Can you post the
>full dmesg boot log?

I found backup, yay! -> attached

Compared to Stefan's report, same hardware, same SSD,
only my BIOS is one rev older and my 840 Pro firmware is also one rev older.

Hope it helps,
J


Attachments:
log_oct29.txt.bz2 (15.05 kB)