2009-11-26 16:42:35

by Johannes Stezenbach

[permalink] [raw]
Subject: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

Hi,

I'm refering to
http://bugzilla.kernel.org/show_bug.cgi?id=14314
and I still have this issue on a N130 with latest BIOS (05CM),
running kernel 2.6.32-rc8 + wireless-testing.

BIOS Information
Vendor: Phoenix Technologies Ltd.
Version: 05CM.M011.20091013.JIP
Release Date: 10/13/2009
Address: 0xE6300
Runtime Size: 105728 bytes
ROM Size: 2048 kB
Characteristics:
ISA is supported
PCI is supported
PNP is supported
BIOS is upgradeable
BIOS shadowing is allowed
ESCD support is available
ACPI is supported
USB legacy is supported
Smart battery is supported
BIOS boot specification is supported
Targeted content distribution is supported
BIOS Revision: 5.0

Around 5min after boot or resume if generates the following error:

[ 302.364174] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 302.364201] ata1.00: failed command: WRITE DMA
[ 302.364234] ata1.00: cmd ca/00:08:f7:01:1a/00:00:00:00:00/e0 tag 0 dma 4096 out
[ 302.364241] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 302.364257] ata1.00: status: { DRDY }
[ 307.408107] ata1: link is slow to respond, please be patient (ready=0)
[ 312.392109] ata1: device not ready (errno=-16), forcing hardreset
[ 312.392138] ata1: soft resetting link
[ 312.574482] ata1.00: configured for UDMA/133
[ 312.574506] ata1.00: device reported invalid CHS sector 0
[ 312.574542] ata1: EH complete

This also happens when booting with rdinit=/bin/sh, i.e. only running busybox sh
inside initrd. The error then appears when accessing the disk after the 5min
period with dd if=/dev/sda of=/dev/null count=10000.


The link in comment #14 is dead but eventually I found
http://download.opensuse.org/repositories/Moblin:/Base/openSUSE_11.1/src/kernel-source-2.6.31.6-37.1.src.rpm
which contains the attached patch with a samsung_laptop driver.

I think it is weird that the Samsung BIOS has a special "SECLINUX" mode,
but anyway the samsung_laptop driver works (the backlight control via ACPI
also works with the 05CM BIOS, though).

However, it does not prevent the ATA exception.

(Side note about backlight level 0: I noticed that in Windows when you
set the backlight to the lowest level, after a minute of inactivity
the screen would dim one level more. Stupid -- why not allow the user
to choose that level manually?)

So I guess one really needs the special Linux BIOS that Greg was talking
about in comments #14 and #16? It is not clear from comment #17
which BIOS worked for Mikhail.

Or did I just miss some patch or magic config setting?


According to $subject, I have a wild theory that the issue
might be caused by the Phoenix "FailSafe" BIOS feature.
first of all, there is no indication on Samsung's web pages
anywhere that the N130 has it, but I had looked at the pre-installed
Windows before I wiped it (and did the BIOS update while at it),
and it had a "Phoenix FailSafe" icon prominently placed on the desktop.
Some searching yields:
http://www.failsafe.com/samsung-partner-profile

And after some more searching:
http://www.blackhat.com/presentations/bh-usa-09/ORTEGA/BHUSA09-Ortega-DeactivateRootkit-PAPER.pdf
which is about Computrace which appears to be a similar technology.

What I think is what happens is that the BIOS waits to be contacted
by the Windows "FailSafe" agent, and if that doesn't happen then the BIOS
tries to reinstall the agent into the Windows installation. Thus it
accesses the disk behind the OS's back, and this causes the ATA exception.
When FailSafe isn't activated by the user then the BIOS shouldn't do that,
but the BIOS developers might have thought different...

Note however that the BIOS setup screen contains no indication about FailSafe
support, so this is all wild guessing. It would fit in with the "special
Linux BIOS" info, though. But I had hoped that the "SECLINUX" mode
would also disable it.


Thanks
Johannes


Attachments:
(No filename) (4.01 kB)
samsung-laptop-driver.patch (16.80 kB)
Download all attachments

2009-11-28 19:22:16

by Greg KH

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

On Thu, Nov 26, 2009 at 05:42:12PM +0100, Johannes Stezenbach wrote:
> Hi,
>
> I'm refering to
> http://bugzilla.kernel.org/show_bug.cgi?id=14314
> and I still have this issue on a N130 with latest BIOS (05CM),
> running kernel 2.6.32-rc8 + wireless-testing.
>
> BIOS Information
> Vendor: Phoenix Technologies Ltd.
> Version: 05CM.M011.20091013.JIP
> Release Date: 10/13/2009
> Address: 0xE6300
> Runtime Size: 105728 bytes
> ROM Size: 2048 kB
> Characteristics:
> ISA is supported
> PCI is supported
> PNP is supported
> BIOS is upgradeable
> BIOS shadowing is allowed
> ESCD support is available
> ACPI is supported
> USB legacy is supported
> Smart battery is supported
> BIOS boot specification is supported
> Targeted content distribution is supported
> BIOS Revision: 5.0
>
> Around 5min after boot or resume if generates the following error:
>
> [ 302.364174] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> [ 302.364201] ata1.00: failed command: WRITE DMA
> [ 302.364234] ata1.00: cmd ca/00:08:f7:01:1a/00:00:00:00:00/e0 tag 0 dma 4096 out
> [ 302.364241] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [ 302.364257] ata1.00: status: { DRDY }
> [ 307.408107] ata1: link is slow to respond, please be patient (ready=0)
> [ 312.392109] ata1: device not ready (errno=-16), forcing hardreset
> [ 312.392138] ata1: soft resetting link
> [ 312.574482] ata1.00: configured for UDMA/133
> [ 312.574506] ata1.00: device reported invalid CHS sector 0
> [ 312.574542] ata1: EH complete

This is because after 5 minutes, the BIOS implements C states in the
processor, which causes a "hic-up" in userspace. Everything should be
fine after this, and most importantly, the power usage drops by a few
watts, which is most important.

> This also happens when booting with rdinit=/bin/sh, i.e. only running busybox sh
> inside initrd. The error then appears when accessing the disk after the 5min
> period with dd if=/dev/sda of=/dev/null count=10000.

Yup, see above for why.

Samsung does this to make booting their BIOS faster.

> The link in comment #14 is dead but eventually I found
> http://download.opensuse.org/repositories/Moblin:/Base/openSUSE_11.1/src/kernel-source-2.6.31.6-37.1.src.rpm
> which contains the attached patch with a samsung_laptop driver.
>
> I think it is weird that the Samsung BIOS has a special "SECLINUX" mode,
> but anyway the samsung_laptop driver works (the backlight control via ACPI
> also works with the 05CM BIOS, though).

Yes, but Samsung does not support ACPI at this time, even though it is
in their latest bios versions (experimental stuff, needed for Windows 7
mode or something...)

And yes, a "special" linux mode is weird, but at least they gave us
something that works :)

> However, it does not prevent the ATA exception.

Yup, it's not an issue though.

> (Side note about backlight level 0: I noticed that in Windows when you
> set the backlight to the lowest level, after a minute of inactivity
> the screen would dim one level more. Stupid -- why not allow the user
> to choose that level manually?)

Talk to samsung about this. There is one more lower level the BIOS can
go to, which is what Windows does here. Samsung doesn't want Linux to
use that mode at this time. It only saves a bit less than .1W, so it's
not that big of a deal.

Glad it's all working for you now.

greg k-h

2009-11-28 20:30:39

by Robert Hancock

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

On 11/28/2009 01:19 PM, Greg KH wrote:
> On Thu, Nov 26, 2009 at 05:42:12PM +0100, Johannes Stezenbach wrote:
>> Hi,
>>
>> I'm refering to
>> http://bugzilla.kernel.org/show_bug.cgi?id=14314
>> and I still have this issue on a N130 with latest BIOS (05CM),
>> running kernel 2.6.32-rc8 + wireless-testing.
>>
>> BIOS Information
>> Vendor: Phoenix Technologies Ltd.
>> Version: 05CM.M011.20091013.JIP
>> Release Date: 10/13/2009
>> Address: 0xE6300
>> Runtime Size: 105728 bytes
>> ROM Size: 2048 kB
>> Characteristics:
>> ISA is supported
>> PCI is supported
>> PNP is supported
>> BIOS is upgradeable
>> BIOS shadowing is allowed
>> ESCD support is available
>> ACPI is supported
>> USB legacy is supported
>> Smart battery is supported
>> BIOS boot specification is supported
>> Targeted content distribution is supported
>> BIOS Revision: 5.0
>>
>> Around 5min after boot or resume if generates the following error:
>>
>> [ 302.364174] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>> [ 302.364201] ata1.00: failed command: WRITE DMA
>> [ 302.364234] ata1.00: cmd ca/00:08:f7:01:1a/00:00:00:00:00/e0 tag 0 dma 4096 out
>> [ 302.364241] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> [ 302.364257] ata1.00: status: { DRDY }
>> [ 307.408107] ata1: link is slow to respond, please be patient (ready=0)
>> [ 312.392109] ata1: device not ready (errno=-16), forcing hardreset
>> [ 312.392138] ata1: soft resetting link
>> [ 312.574482] ata1.00: configured for UDMA/133
>> [ 312.574506] ata1.00: device reported invalid CHS sector 0
>> [ 312.574542] ata1: EH complete
>
> This is because after 5 minutes, the BIOS implements C states in the
> processor, which causes a "hic-up" in userspace. Everything should be
> fine after this, and most importantly, the power usage drops by a few
> watts, which is most important.

Why does this "hiccup" seem to cause interrupts to get lost? This would
cause an up to 30-second stall in disk I/O.

>
>> This also happens when booting with rdinit=/bin/sh, i.e. only running busybox sh
>> inside initrd. The error then appears when accessing the disk after the 5min
>> period with dd if=/dev/sda of=/dev/null count=10000.
>
> Yup, see above for why.
>
> Samsung does this to make booting their BIOS faster.

Ugh.. Seriously?

>
>> The link in comment #14 is dead but eventually I found
>> http://download.opensuse.org/repositories/Moblin:/Base/openSUSE_11.1/src/kernel-source-2.6.31.6-37.1.src.rpm
>> which contains the attached patch with a samsung_laptop driver.
>>
>> I think it is weird that the Samsung BIOS has a special "SECLINUX" mode,
>> but anyway the samsung_laptop driver works (the backlight control via ACPI
>> also works with the 05CM BIOS, though).
>
> Yes, but Samsung does not support ACPI at this time, even though it is
> in their latest bios versions (experimental stuff, needed for Windows 7
> mode or something...)

ACPI support would seem much preferable to implementing power management
with such strange proprietary hacks..

>
> And yes, a "special" linux mode is weird, but at least they gave us
> something that works :)
>
>> However, it does not prevent the ATA exception.
>
> Yup, it's not an issue though.
>
>> (Side note about backlight level 0: I noticed that in Windows when you
>> set the backlight to the lowest level, after a minute of inactivity
>> the screen would dim one level more. Stupid -- why not allow the user
>> to choose that level manually?)
>
> Talk to samsung about this. There is one more lower level the BIOS can
> go to, which is what Windows does here. Samsung doesn't want Linux to
> use that mode at this time. It only saves a bit less than .1W, so it's
> not that big of a deal.
>
> Glad it's all working for you now.
>
> greg k-h

2009-11-28 21:34:42

by Greg KH

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

On Sat, Nov 28, 2009 at 02:30:38PM -0600, Robert Hancock wrote:
> On 11/28/2009 01:19 PM, Greg KH wrote:
>> On Thu, Nov 26, 2009 at 05:42:12PM +0100, Johannes Stezenbach wrote:
>>> Hi,
>>>
>>> I'm refering to
>>> http://bugzilla.kernel.org/show_bug.cgi?id=14314
>>> and I still have this issue on a N130 with latest BIOS (05CM),
>>> running kernel 2.6.32-rc8 + wireless-testing.
>>>
>>> BIOS Information
>>> Vendor: Phoenix Technologies Ltd.
>>> Version: 05CM.M011.20091013.JIP
>>> Release Date: 10/13/2009
>>> Address: 0xE6300
>>> Runtime Size: 105728 bytes
>>> ROM Size: 2048 kB
>>> Characteristics:
>>> ISA is supported
>>> PCI is supported
>>> PNP is supported
>>> BIOS is upgradeable
>>> BIOS shadowing is allowed
>>> ESCD support is available
>>> ACPI is supported
>>> USB legacy is supported
>>> Smart battery is supported
>>> BIOS boot specification is supported
>>> Targeted content distribution is supported
>>> BIOS Revision: 5.0
>>>
>>> Around 5min after boot or resume if generates the following error:
>>>
>>> [ 302.364174] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>> [ 302.364201] ata1.00: failed command: WRITE DMA
>>> [ 302.364234] ata1.00: cmd ca/00:08:f7:01:1a/00:00:00:00:00/e0 tag 0 dma 4096 out
>>> [ 302.364241] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>>> [ 302.364257] ata1.00: status: { DRDY }
>>> [ 307.408107] ata1: link is slow to respond, please be patient (ready=0)
>>> [ 312.392109] ata1: device not ready (errno=-16), forcing hardreset
>>> [ 312.392138] ata1: soft resetting link
>>> [ 312.574482] ata1.00: configured for UDMA/133
>>> [ 312.574506] ata1.00: device reported invalid CHS sector 0
>>> [ 312.574542] ata1: EH complete
>>
>> This is because after 5 minutes, the BIOS implements C states in the
>> processor, which causes a "hic-up" in userspace. Everything should be
>> fine after this, and most importantly, the power usage drops by a few
>> watts, which is most important.
>
> Why does this "hiccup" seem to cause interrupts to get lost? This would
> cause an up to 30-second stall in disk I/O.

Yup, it does.

>>> This also happens when booting with rdinit=/bin/sh, i.e. only running busybox sh
>>> inside initrd. The error then appears when accessing the disk after the 5min
>>> period with dd if=/dev/sda of=/dev/null count=10000.
>>
>> Yup, see above for why.
>>
>> Samsung does this to make booting their BIOS faster.
>
> Ugh.. Seriously?

Seriously. It's a BIOS issue, and is the way that Samsung has
implemented this. There is nothing that the OS can do about it.
Windows has the same "issue" here.

>>> The link in comment #14 is dead but eventually I found
>>> http://download.opensuse.org/repositories/Moblin:/Base/openSUSE_11.1/src/kernel-source-2.6.31.6-37.1.src.rpm
>>> which contains the attached patch with a samsung_laptop driver.
>>>
>>> I think it is weird that the Samsung BIOS has a special "SECLINUX" mode,
>>> but anyway the samsung_laptop driver works (the backlight control via ACPI
>>> also works with the 05CM BIOS, though).
>>
>> Yes, but Samsung does not support ACPI at this time, even though it is
>> in their latest bios versions (experimental stuff, needed for Windows 7
>> mode or something...)
>
> ACPI support would seem much preferable to implementing power management
> with such strange proprietary hacks..

I do not disagree with you at all about this. This has been
communicated to Samsung, but at this point in time, they are not going
to support ACPI and only want Linux to use this interface.

thanks,

greg k-h

2009-11-28 22:20:53

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

On Sat, Nov 28, 2009 at 01:34:46PM -0800, Greg KH wrote:
> On Sat, Nov 28, 2009 at 02:30:38PM -0600, Robert Hancock wrote:
> > On 11/28/2009 01:19 PM, Greg KH wrote:
> >> On Thu, Nov 26, 2009 at 05:42:12PM +0100, Johannes Stezenbach wrote:
> >>>
> >>> Around 5min after boot or resume if generates the following error:
> >>>
> >>> [ 302.364174] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> >>> [ 302.364201] ata1.00: failed command: WRITE DMA
> >>> [ 302.364234] ata1.00: cmd ca/00:08:f7:01:1a/00:00:00:00:00/e0 tag 0 dma 4096 out
> >>> [ 302.364241] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> >>> [ 302.364257] ata1.00: status: { DRDY }
> >>> [ 307.408107] ata1: link is slow to respond, please be patient (ready=0)
> >>> [ 312.392109] ata1: device not ready (errno=-16), forcing hardreset
> >>> [ 312.392138] ata1: soft resetting link
> >>> [ 312.574482] ata1.00: configured for UDMA/133
> >>> [ 312.574506] ata1.00: device reported invalid CHS sector 0
> >>> [ 312.574542] ata1: EH complete
> >>
> >> This is because after 5 minutes, the BIOS implements C states in the
> >> processor, which causes a "hic-up" in userspace. Everything should be
> >> fine after this, and most importantly, the power usage drops by a few
> >> watts, which is most important.
> >
> > Why does this "hiccup" seem to cause interrupts to get lost? This would
> > cause an up to 30-second stall in disk I/O.
>
> Yup, it does.
>
> >>> This also happens when booting with rdinit=/bin/sh, i.e. only running busybox sh
> >>> inside initrd. The error then appears when accessing the disk after the 5min
> >>> period with dd if=/dev/sda of=/dev/null count=10000.
> >>
> >> Yup, see above for why.
> >>
> >> Samsung does this to make booting their BIOS faster.
> >
> > Ugh.. Seriously?
>
> Seriously. It's a BIOS issue, and is the way that Samsung has
> implemented this. There is nothing that the OS can do about it.
> Windows has the same "issue" here.

Um, "how does Windows handle this" would've been my next question.
I didn't do much with Windows before I wiped it, I mainly used it
to confirm the thing works and to do the BIOS update. I did go through
a few reboot cycles though and I think I would've noticed a 30 second
hang. Maybe they just lowered the ATA timeout to hide it or something.
Otherwise I guess google would turn up complaints from Windows users, too,
not just from Linux users.

BTW, at 5min after boot it is 99% guaranteed that this ATA
exception will happen during the occasional fsck. That
doesn't feel right.

> I do not disagree with you at all about this. This has been
> communicated to Samsung, but at this point in time, they are not going
> to support ACPI and only want Linux to use this interface.

Well, at least they made the information for the SECLINUX interface
available (to you), but it would be better they'd support ACPI.
I mean if they have to support it for Windows 7 anyway, then
what's the point of not supporting it for other OSs?


While were at it I have another question: When running on battery the
ethernet throughput drops to ~25Mbit/s. After a bit of experimenting
I found that this is connected to a BIOS entry about "CPU Power
Saving Mode". lspci shows that this changes "LnkCtl: ASPM L1
Enabled" to "LnkCtl: ASPM L0s L1 Enabled". Having this config
option in the BIOS is inflexible. IIRC there was an app in
Windows which allows to configure it at runtime. Do you
know how to do it in Linux?


Thanks
Johannes

2009-11-29 00:17:42

by Greg KH

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

On Sat, Nov 28, 2009 at 11:22:03PM +0100, Johannes Stezenbach wrote:
> > Seriously. It's a BIOS issue, and is the way that Samsung has
> > implemented this. There is nothing that the OS can do about it.
> > Windows has the same "issue" here.
>
> Um, "how does Windows handle this" would've been my next question.
> I didn't do much with Windows before I wiped it, I mainly used it
> to confirm the thing works and to do the BIOS update. I did go through
> a few reboot cycles though and I think I would've noticed a 30 second
> hang. Maybe they just lowered the ATA timeout to hide it or something.
> Otherwise I guess google would turn up complaints from Windows users, too,
> not just from Linux users.
>
> BTW, at 5min after boot it is 99% guaranteed that this ATA
> exception will happen during the occasional fsck. That
> doesn't feel right.

Well, I've never been doing a fsck at 5 minutes into boot, and neither do
most Windows users :)

> > I do not disagree with you at all about this. This has been
> > communicated to Samsung, but at this point in time, they are not going
> > to support ACPI and only want Linux to use this interface.
>
> Well, at least they made the information for the SECLINUX interface
> available (to you),

You have as much information for the SECLINUX interface that I do at
this point in time. It is all documented in the driver. Actually, it's
documented better in the driver than the "hints" they originally
provided me...

> but it would be better they'd support ACPI.
> I mean if they have to support it for Windows 7 anyway, then
> what's the point of not supporting it for other OSs?

Hey, no argument from me here, but I think the main issue is that they
do not officially support Windows 7 on this platform yet either. Hence,
no ACPI support. The ACPI support in the latest BIOS seems very rough,
I think this is the first time they have ever implemented ACPI, so I
would not count on it working properly just yet.

> While were at it I have another question: When running on battery the
> ethernet throughput drops to ~25Mbit/s. After a bit of experimenting
> I found that this is connected to a BIOS entry about "CPU Power
> Saving Mode". lspci shows that this changes "LnkCtl: ASPM L1
> Enabled" to "LnkCtl: ASPM L0s L1 Enabled". Having this config
> option in the BIOS is inflexible. IIRC there was an app in
> Windows which allows to configure it at runtime. Do you
> know how to do it in Linux?

I do not konw anything about this. Are you saying that Windows would
allow you to turn the throughput back up at the expense of battery life
through an application? Do you know what that application is called and
where I could find it?

thanks,

greg k-h

2009-11-29 00:50:55

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

On Sat, Nov 28, 2009 at 04:17:45PM -0800, Greg KH wrote:
> On Sat, Nov 28, 2009 at 11:22:03PM +0100, Johannes Stezenbach wrote:
> >
> > BTW, at 5min after boot it is 99% guaranteed that this ATA
> > exception will happen during the occasional fsck. That
> > doesn't feel right.
>
> Well, I've never been doing a fsck at 5 minutes into boot, and neither do
> most Windows users :)

I've been through a lot of reboots with all the testing, and
the fsck took longer than 5min, and the ATA exception struck.
fsck continued after the 30 second stall and succeeded.

> > but it would be better they'd support ACPI.
> > I mean if they have to support it for Windows 7 anyway, then
> > what's the point of not supporting it for other OSs?
>
> Hey, no argument from me here, but I think the main issue is that they
> do not officially support Windows 7 on this platform yet either. Hence,
> no ACPI support. The ACPI support in the latest BIOS seems very rough,
> I think this is the first time they have ever implemented ACPI, so I
> would not count on it working properly just yet.

You can buy it officially with Windows 7 now, it seems to be fully supported.
http://www.samsung.com/us/consumer/office/mobile-computing/netbooks/NP-N130-KA04US/index.idx?pagetype=prd_detail&tab=support

> > While were at it I have another question: When running on battery the
> > ethernet throughput drops to ~25Mbit/s. After a bit of experimenting
> > I found that this is connected to a BIOS entry about "CPU Power
> > Saving Mode". lspci shows that this changes "LnkCtl: ASPM L1
> > Enabled" to "LnkCtl: ASPM L0s L1 Enabled". Having this config
> > option in the BIOS is inflexible. IIRC there was an app in
> > Windows which allows to configure it at runtime. Do you
> > know how to do it in Linux?
>
> I do not konw anything about this. Are you saying that Windows would
> allow you to turn the throughput back up at the expense of battery life
> through an application? Do you know what that application is called and
> where I could find it?

I haven't actually tested throughput in Windows, but ISTR that there was
a Samsung app where you could configure various power saving features.
Might have been the "SAMSUNG Battery Manager".


Thanks,
Johannes

2009-11-30 08:53:16

by Tejun Heo

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

On 11/29/2009 09:51 AM, Johannes Stezenbach wrote:
> On Sat, Nov 28, 2009 at 04:17:45PM -0800, Greg KH wrote:
>> On Sat, Nov 28, 2009 at 11:22:03PM +0100, Johannes Stezenbach wrote:
>>>
>>> BTW, at 5min after boot it is 99% guaranteed that this ATA
>>> exception will happen during the occasional fsck. That
>>> doesn't feel right.
>>
>> Well, I've never been doing a fsck at 5 minutes into boot, and neither do
>> most Windows users :)
>
> I've been through a lot of reboots with all the testing, and
> the fsck took longer than 5min, and the ATA exception struck.
> fsck continued after the 30 second stall and succeeded.

The timeout will happen if the C state switching happens while ATA
command is in flight so unless there's heavy IO load, it's not very
likely to hit. Another factor is that windows uses shorter IO timeout
(I think it's 7 secs or 15, I'm not sure) so it's gonna be less
noticeable when it happens. Hmmm.... there were talks about
shortening the timeout. Maybe it's about time we actually do that.

Thanks.

--
tejun

2009-11-30 10:19:54

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

On Mon, Nov 30, 2009 at 05:52:59PM +0900, Tejun Heo wrote:
> On 11/29/2009 09:51 AM, Johannes Stezenbach wrote:
> > On Sat, Nov 28, 2009 at 04:17:45PM -0800, Greg KH wrote:
> >> On Sat, Nov 28, 2009 at 11:22:03PM +0100, Johannes Stezenbach wrote:
> >>>
> >>> BTW, at 5min after boot it is 99% guaranteed that this ATA
> >>> exception will happen during the occasional fsck. That
> >>> doesn't feel right.
> >>
> >> Well, I've never been doing a fsck at 5 minutes into boot, and neither do
> >> most Windows users :)
> >
> > I've been through a lot of reboots with all the testing, and
> > the fsck took longer than 5min, and the ATA exception struck.
> > fsck continued after the 30 second stall and succeeded.
>
> The timeout will happen if the C state switching happens while ATA
> command is in flight so unless there's heavy IO load, it's not very
> likely to hit.

I've booted with rdinit=/bin/sh (busybox sh in initramfs), so
nothing accesses the disk. After 5min the C state switch
happens, but no ATA exception. I waited a few more miniutes
and then used dd if=/dev/sda of=/dev/null -> ATA exception.
There is no way to escape it.


Johannes

2009-11-30 11:06:27

by Tejun Heo

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

Hello,

On 11/30/2009 07:21 PM, Johannes Stezenbach wrote:
> I've booted with rdinit=/bin/sh (busybox sh in initramfs), so
> nothing accesses the disk. After 5min the C state switch
> happens, but no ATA exception. I waited a few more miniutes
> and then used dd if=/dev/sda of=/dev/null -> ATA exception.
> There is no way to escape it.

Ah, okay, so whatever follows the switch times out.

I have no idea what to do about it at this point. :-(

--
tejun

2009-12-30 12:04:31

by Hans Werner

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?


> Betreff: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

> Hello,
>
> On 11/30/2009 07:21 PM, Johannes Stezenbach wrote:
> > I've booted with rdinit=/bin/sh (busybox sh in initramfs), so
> > nothing accesses the disk. After 5min the C state switch
> > happens, but no ATA exception. I waited a few more miniutes
> > and then used dd if=/dev/sda of=/dev/null -> ATA exception.
> > There is no way to escape it.
>
> Ah, okay, so whatever follows the switch times out.
>
> I have no idea what to do about it at this point. :-(
>
> --
> tejun

Tejun,

testing in the Arch Linux Forums has shown that if one applies
a patch which you posted to the linux-ide ML on 2008-11-21
then the problem is no longer seen. Instead the kernel log shows
that a spurious IRQ was cleared.

[PATCH #upstraem-fixes] ata_piix: detect and clear spurious IRQs
http://marc.info/?l=linux-ide&m=122724081603679&w=2
http://bbs.archlinux.org/viewtopic.php?id=86454

What's the current status of this patch? Is it safe to use?
What does it tell us about the Samsung N130/140?

Regards,
Hans

--
Release early, release often.

GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01

2010-01-03 22:11:52

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

On Wed, Dec 30, 2009 at 01:04:28PM +0100, Hans Werner wrote:
>
> testing in the Arch Linux Forums has shown that if one applies
> a patch which you posted to the linux-ide ML on 2008-11-21
> then the problem is no longer seen. Instead the kernel log shows
> that a spurious IRQ was cleared.
>
> [PATCH #upstraem-fixes] ata_piix: detect and clear spurious IRQs
> http://marc.info/?l=linux-ide&m=122724081603679&w=2
> http://bbs.archlinux.org/viewtopic.php?id=86454
>
> What's the current status of this patch? Is it safe to use?
> What does it tell us about the Samsung N130/140?

FWIW, I just tested a current git kernel (v2.6.33-rc2-268-g45d28b0)
with Tejun's patch applied on my N130. The ATA exception and hang is
indeed gone, just "ata1: clearing spurious IRQ" is logged.

I've see Tejun's comment in
http://bugzilla.kernel.org/show_bug.cgi?id=14314
and I would like to add that the ATA irq is not shared.

14: 43885 0 IO-APIC-edge ata_piix
15: 0 0 IO-APIC-edge ata_piix


Thanks,
Johannes

2010-01-03 22:58:27

by Tejun Heo

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

Hello,

On 01/04/2010 07:11 AM, Johannes Stezenbach wrote:
> On Wed, Dec 30, 2009 at 01:04:28PM +0100, Hans Werner wrote:
>>
>> testing in the Arch Linux Forums has shown that if one applies
>> a patch which you posted to the linux-ide ML on 2008-11-21
>> then the problem is no longer seen. Instead the kernel log shows
>> that a spurious IRQ was cleared.
>>
>> [PATCH #upstraem-fixes] ata_piix: detect and clear spurious IRQs
>> http://marc.info/?l=linux-ide&m=122724081603679&w=2
>> http://bbs.archlinux.org/viewtopic.php?id=86454
>>
>> What's the current status of this patch? Is it safe to use?
>> What does it tell us about the Samsung N130/140?
>
> FWIW, I just tested a current git kernel (v2.6.33-rc2-268-g45d28b0)
> with Tejun's patch applied on my N130. The ATA exception and hang is
> indeed gone, just "ata1: clearing spurious IRQ" is logged.
>
> I've see Tejun's comment in
> http://bugzilla.kernel.org/show_bug.cgi?id=14314
> and I would like to add that the ATA irq is not shared.
>
> 14: 43885 0 IO-APIC-edge ata_piix
> 15: 0 0 IO-APIC-edge ata_piix

Can you please post that on bug#14314? If the IRQ line indeed wasn't
shared, it might mean that the controller raised the IRQ line before
getting its internal state in order and the IRQ checking sequence
cleared the external IRQ status while leaving the internal pending bit
intact, which I've never heard of on piix and don't think is possible.
Hmmmm... given that the problem was dependent on BIOS on the other
model (is it the N130?), maybe the BIOS is doing something funny. :-(
Anyways, please post full output of "cat /proc/interrupts" at the bz.

Thanks.

--
tejun

2010-01-04 00:41:54

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

On Mon, Jan 04, 2010 at 07:59:59AM +0900, Tejun Heo wrote:
> On 01/04/2010 07:11 AM, Johannes Stezenbach wrote:
> >
> > FWIW, I just tested a current git kernel (v2.6.33-rc2-268-g45d28b0)
> > with Tejun's patch applied on my N130. The ATA exception and hang is
> > indeed gone, just "ata1: clearing spurious IRQ" is logged.
> >
> > I've see Tejun's comment in
> > http://bugzilla.kernel.org/show_bug.cgi?id=14314
> > and I would like to add that the ATA irq is not shared.
> >
> > 14: 43885 0 IO-APIC-edge ata_piix
> > 15: 0 0 IO-APIC-edge ata_piix
>
> Can you please post that on bug#14314? If the IRQ line indeed wasn't
> shared, it might mean that the controller raised the IRQ line before
> getting its internal state in order and the IRQ checking sequence
> cleared the external IRQ status while leaving the internal pending bit
> intact, which I've never heard of on piix and don't think is possible.
> Hmmmm... given that the problem was dependent on BIOS on the other
> model (is it the N130?), maybe the BIOS is doing something funny. :-(
> Anyways, please post full output of "cat /proc/interrupts" at the bz.

I've updated bug#14314. It is certainly a BIOS issue, but the
"spurious IRQ" check deals way better with it than the previous
30sec hang waiting for timeout. IMHO the patch should be merged
into mainline asap. Or do you think it has any downside?


Thanks
Johannes

2010-01-04 00:51:51

by Tejun Heo

[permalink] [raw]
Subject: Re: Samsung N130 ATA exception after 5min uptime -- Phoenix FailSafe issue?

Hello,

On 01/04/2010 09:41 AM, Johannes Stezenbach wrote:
> I've updated bug#14314. It is certainly a BIOS issue, but the
> "spurious IRQ" check deals way better with it than the previous
> 30sec hang waiting for timeout. IMHO the patch should be merged
> into mainline asap. Or do you think it has any downside?

The danger is that when the hardware is actually malfunctioning and
causing IRQ storm, the driver will tell the IRQ subsystem that the
IRQs aren't spurious. This will bypass the IRQ storm detection logic
and lead to complete system lockup under such conditions. Small
modification to the patch should remove that problem. I'll post an
updated patch soon.

Thanks.

--
tejun