On 2021/11/16 21:44, Yuji Nakao wrote:
> Hello,
>
> I'm using Arch Linux on MacBook Air 2010. I updated `linux` package[1]
> from v5.14.16 to v5.15.2 the other day, and the boot process stalled
> with the following message.
>
> ```shell
> :: running early hook [udev]
> Starting version 249.6-3-arch
> :: running hook [udev]
> :: Triggering uevents...
> Waiting 10 seconds for device /dev/sda3 ...
> ERROR: device '/dev/sda3' not found. Skipping fsck.
> :: mounting '/dev/sda' on real root
> mount: /new_root: no filesystem type specified.
> You are now being dropped into an emergency shell.
> sh: can't access tty; job control turned off
> [rootfs ]#
> ```
>
> In the emergency shell there's no `sda` devices when I type `$ ls
> /dev/`. By downgrading the kernel, boot process works properly.
>
> See also Arch Linux bug tracker[2]. There are similar reports on
> Apple devices.
>
> `dmesg` output in the emergency shell is attached. I guess this issue is
> related to libata, so CCed to Damien Le Moal.
I think that this problem is due to recent PCI subsystem changes which broke Mac
support. The problem show up as the interrupts not being delivered, which in
turn result in the kernel assuming that the drive is not working (see the
timeout error messages in your dmesg output). Hence your boot drive detection
fails and no rootfs to mount.
Adding linux-pci list.
>
> Regards.
>
> [1] https://archlinux.org/packages/core/x86_64/linux/
> [2] https://bugs.archlinux.org/task/72734
>
>
--
Damien Le Moal
Western Digital Research
[+CC Arnd, Bjorn, Marc and Sasha for visibility]
Hello Damien and Yuji,
[...]
> > I'm using Arch Linux on MacBook Air 2010. I updated `linux` package[1]
> > from v5.14.16 to v5.15.2 the other day, and the boot process stalled
> > with the following message.
> >
> > ```shell
> > :: running early hook [udev]
> > Starting version 249.6-3-arch
> > :: running hook [udev]
> > :: Triggering uevents...
> > Waiting 10 seconds for device /dev/sda3 ...
> > ERROR: device '/dev/sda3' not found. Skipping fsck.
> > :: mounting '/dev/sda' on real root
> > mount: /new_root: no filesystem type specified.
> > You are now being dropped into an emergency shell.
> > sh: can't access tty; job control turned off
> > [rootfs ]#
> > ```
> >
> > In the emergency shell there's no `sda` devices when I type `$ ls
> > /dev/`. By downgrading the kernel, boot process works properly.
> >
> > See also Arch Linux bug tracker[2]. There are similar reports on
> > Apple devices.
> >
> > `dmesg` output in the emergency shell is attached. I guess this issue is
> > related to libata, so CCed to Damien Le Moal.
>
> I think that this problem is due to recent PCI subsystem changes which broke Mac
> support. The problem show up as the interrupts not being delivered, which in
> turn result in the kernel assuming that the drive is not working (see the
> timeout error messages in your dmesg output). Hence your boot drive detection
> fails and no rootfs to mount.
>
> Adding linux-pci list.
>
>
>
> >
> > Regards.
> >
> > [1] https://archlinux.org/packages/core/x86_64/linux/
> > [2] https://bugs.archlinux.org/task/72734
The error in the dmesg output (see [2] where the log file is attached)
looks similar to the problem reported a week or so ago, as per:
https://lore.kernel.org/linux-pci/[email protected]/
The problematic commits where reverted by Bjorn and the Pull Request that
did it was accepted, as per:
https://lore.kernel.org/linux-pci/20211111195040.GA1345641@bhelgaas/
Thus, this would made its way into 5.16-rc1, I suppose. We might have to
back-port this to the stable and long-term kernels.
Yuji, could you, if you have some time to spare, try the 5.16-rc1 to see if
this have gotten better on your system?
Krzysztof
On 2021/11/17 8:26, Krzysztof Wilczyński wrote:
> [+CC Arnd, Bjorn, Marc and Sasha for visibility]
>
> Hello Damien and Yuji,
>
> [...]
>>> I'm using Arch Linux on MacBook Air 2010. I updated `linux` package[1]
>>> from v5.14.16 to v5.15.2 the other day, and the boot process stalled
>>> with the following message.
>>>
>>> ```shell
>>> :: running early hook [udev]
>>> Starting version 249.6-3-arch
>>> :: running hook [udev]
>>> :: Triggering uevents...
>>> Waiting 10 seconds for device /dev/sda3 ...
>>> ERROR: device '/dev/sda3' not found. Skipping fsck.
>>> :: mounting '/dev/sda' on real root
>>> mount: /new_root: no filesystem type specified.
>>> You are now being dropped into an emergency shell.
>>> sh: can't access tty; job control turned off
>>> [rootfs ]#
>>> ```
>>>
>>> In the emergency shell there's no `sda` devices when I type `$ ls
>>> /dev/`. By downgrading the kernel, boot process works properly.
>>>
>>> See also Arch Linux bug tracker[2]. There are similar reports on
>>> Apple devices.
>>>
>>> `dmesg` output in the emergency shell is attached. I guess this issue is
>>> related to libata, so CCed to Damien Le Moal.
>>
>> I think that this problem is due to recent PCI subsystem changes which broke Mac
>> support. The problem show up as the interrupts not being delivered, which in
>> turn result in the kernel assuming that the drive is not working (see the
>> timeout error messages in your dmesg output). Hence your boot drive detection
>> fails and no rootfs to mount.
>>
>> Adding linux-pci list.
>>
>>
>>
>>>
>>> Regards.
>>>
>>> [1] https://archlinux.org/packages/core/x86_64/linux/
>>> [2] https://bugs.archlinux.org/task/72734
>
Krzysztof,
> The error in the dmesg output (see [2] where the log file is attached)
> looks similar to the problem reported a week or so ago, as per:
>
> https://lore.kernel.org/linux-pci/[email protected]/
Thanks. I searched this thread but could not find it in the archive.
Early morning, need more coffee :)
>
> The problematic commits where reverted by Bjorn and the Pull Request that
> did it was accepted, as per:
>
> https://lore.kernel.org/linux-pci/20211111195040.GA1345641@bhelgaas/
>
> Thus, this would made its way into 5.16-rc1, I suppose. We might have to
> back-port this to the stable and long-term kernels.
Yes, I think the fix needs to go in 5.15, which is latest stable and LTS.
>
> Yuji, could you, if you have some time to spare, try the 5.16-rc1 to see if
> this have gotten better on your system?
>
> Krzysztof
>
--
Damien Le Moal
Western Digital Research
[+CC Adding Jeremy for visibility]
Hi Damien,
[...]
> > The error in the dmesg output (see [2] where the log file is attached)
> > looks similar to the problem reported a week or so ago, as per:
> >
> > https://lore.kernel.org/linux-pci/[email protected]/
>
> Thanks. I searched this thread but could not find it in the archive.
> Early morning, need more coffee :)
No worries! Got you covered!
)))
(((
+-----+
| |]
`-----'
Enjoy!
>
> >
> > The problematic commits where reverted by Bjorn and the Pull Request that
> > did it was accepted, as per:
> >
> > https://lore.kernel.org/linux-pci/20211111195040.GA1345641@bhelgaas/
> >
> > Thus, this would made its way into 5.16-rc1, I suppose. We might have to
> > back-port this to the stable and long-term kernels.
>
> Yes, I think the fix needs to go in 5.15, which is latest stable and LTS.
On the plus side, not everyone is on 5.15 yet, but those who are using it would
have some issues. Albeit, with it being an LTS release, the adoption might
increase rapidly.
For instance, I believe that Pop!_OS already ships kernels that are very close
to the upstream, which would hit their current user base.
Krzysztof
Hi Krzysztof, Yugi,
On Tue, 16 Nov 2021 23:26:18 +0000,
Krzysztof Wilczyński <[email protected]> wrote:
>
> [+CC Arnd, Bjorn, Marc and Sasha for visibility]
>
> Hello Damien and Yuji,
>
> [...]
> > > I'm using Arch Linux on MacBook Air 2010. I updated `linux` package[1]
> > > from v5.14.16 to v5.15.2 the other day, and the boot process stalled
> > > with the following message.
> > >
> > > ```shell
> > > :: running early hook [udev]
> > > Starting version 249.6-3-arch
> > > :: running hook [udev]
> > > :: Triggering uevents...
> > > Waiting 10 seconds for device /dev/sda3 ...
> > > ERROR: device '/dev/sda3' not found. Skipping fsck.
> > > :: mounting '/dev/sda' on real root
> > > mount: /new_root: no filesystem type specified.
> > > You are now being dropped into an emergency shell.
> > > sh: can't access tty; job control turned off
> > > [rootfs ]#
> > > ```
> > >
> > > In the emergency shell there's no `sda` devices when I type `$ ls
> > > /dev/`. By downgrading the kernel, boot process works properly.
> > >
> > > See also Arch Linux bug tracker[2]. There are similar reports on
> > > Apple devices.
> > >
> > > `dmesg` output in the emergency shell is attached. I guess this issue is
> > > related to libata, so CCed to Damien Le Moal.
> >
> > I think that this problem is due to recent PCI subsystem changes which broke Mac
> > support. The problem show up as the interrupts not being delivered, which in
> > turn result in the kernel assuming that the drive is not working (see the
> > timeout error messages in your dmesg output). Hence your boot drive detection
> > fails and no rootfs to mount.
> >
> > Adding linux-pci list.
> >
> >
> >
> > >
> > > Regards.
> > >
> > > [1] https://archlinux.org/packages/core/x86_64/linux/
> > > [2] https://bugs.archlinux.org/task/72734
>
> The error in the dmesg output (see [2] where the log file is attached)
> looks similar to the problem reported a week or so ago, as per:
>
> https://lore.kernel.org/linux-pci/[email protected]/
>
> The problematic commits where reverted by Bjorn and the Pull Request that
> did it was accepted, as per:
>
> https://lore.kernel.org/linux-pci/20211111195040.GA1345641@bhelgaas/
>
> Thus, this would made its way into 5.16-rc1, I suppose. We might have to
> back-port this to the stable and long-term kernels.
>
> Yuji, could you, if you have some time to spare, try the 5.16-rc1 to see if
> this have gotten better on your system?
I'm afraid you have the wrong end of the stick on this one.
The issue is reported on 5.15, and the issue you are pointing at was
introduced during the 5.16 merge window. The problematic commit wasn't
reverted, but instead fixed in 10a20b34d735 ("of/irq: Don't ignore
interrupt-controller when interrupt-map failed").
The issue is instead very close to the one reported at [1], for which
we have a very conservative workaround in 5.16-rc1 (commits
2226667a145d and f21082fb20db). Looking at the dmesg log provided by
Yugi, you find the following nugget:
[ 0.378564] pci 0000:00:0a.0: [10de:0d88] type 00 class 0x010601
Oh look, a NVIDIA AHCI controller, probably similar enough to the one
discussed in the issue reported by Rui.
Yugi, could you please test the patch below on top of 5.16-rc1?
Thanks,
M.
[1] https://lore.kernel.org/r/CALjTZvbzYfBuLB+H=fj2J+9=DxjQ2Uqcy0if_PvmJ-nU-qEgkg@mail.gmail.com
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 003950c738d2..cd88eddf614d 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5857,3 +5857,4 @@ static void nvidia_ion_ahci_fixup(struct pci_dev *pdev)
pdev->dev_flags |= PCI_DEV_FLAGS_HAS_MSI_MASKING;
}
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0ab8, nvidia_ion_ahci_fixup);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0d88, nvidia_ion_ahci_fixup);
--
Without deviation from the norm, progress is not possible.
Hi Marc,
[...]
> > > I think that this problem is due to recent PCI subsystem changes which broke Mac
> > > support. The problem show up as the interrupts not being delivered, which in
> > > turn result in the kernel assuming that the drive is not working (see the
> > > timeout error messages in your dmesg output). Hence your boot drive detection
> > > fails and no rootfs to mount.
> > >
> > > Adding linux-pci list.
> > >
> > >
> > >
> > > >
> > > > Regards.
> > > >
> > > > [1] https://archlinux.org/packages/core/x86_64/linux/
> > > > [2] https://bugs.archlinux.org/task/72734
> >
> > The error in the dmesg output (see [2] where the log file is attached)
> > looks similar to the problem reported a week or so ago, as per:
> >
> > https://lore.kernel.org/linux-pci/[email protected]/
> >
> > The problematic commits where reverted by Bjorn and the Pull Request that
> > did it was accepted, as per:
> >
> > https://lore.kernel.org/linux-pci/20211111195040.GA1345641@bhelgaas/
> >
> > Thus, this would made its way into 5.16-rc1, I suppose. We might have to
> > back-port this to the stable and long-term kernels.
> >
> > Yuji, could you, if you have some time to spare, try the 5.16-rc1 to see if
> > this have gotten better on your system?
>
> I'm afraid you have the wrong end of the stick on this one.
>
> The issue is reported on 5.15, and the issue you are pointing at was
> introduced during the 5.16 merge window. The problematic commit wasn't
> reverted, but instead fixed in 10a20b34d735 ("of/irq: Don't ignore
> interrupt-controller when interrupt-map failed").
Ahh. My bad! I missed the conclusion of the conversation involving the
Nemo board and the patch you proposed here:
https://lore.kernel.org/linux-pci/[email protected]/
I then assumed that what Bjorn reverted in his Pull Request was the
solution to the reported problems. Apologies for conflating the issues
here, and also thank you for all the details.
Are we still in need to back-port some of the fixes to the stable and LTS
kernels then? I am just making sure that things will make it there, if
needed.
> The issue is instead very close to the one reported at [1], for which
> we have a very conservative workaround in 5.16-rc1 (commits
> 2226667a145d and f21082fb20db). Looking at the dmesg log provided by
> Yugi, you find the following nugget:
>
> [ 0.378564] pci 0000:00:0a.0: [10de:0d88] type 00 class 0x010601
>
> Oh look, a NVIDIA AHCI controller, probably similar enough to the one
> discussed in the issue reported by Rui.
Good to know for the future reference that these can be problematic.
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 003950c738d2..cd88eddf614d 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5857,3 +5857,4 @@ static void nvidia_ion_ahci_fixup(struct pci_dev *pdev)
> pdev->dev_flags |= PCI_DEV_FLAGS_HAS_MSI_MASKING;
> }
> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0ab8, nvidia_ion_ahci_fixup);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0d88, nvidia_ion_ahci_fixup);
Thank you! I hope this will fix Yuji's issues.
Krzysztof
>
> I installed plane 5.16-rc1 using pre-built image[1] by linux-mainline
> aur package[2] maintainer and 5.16-rc1 with the patch provided from
> Mark. Both versions succeeded to boot. Thank you for quick
> investigation. I'll wait for backporting the fix.
So vanilla 5.16-rc1 works correctly on your machine, without any other
patch?
If so, I don't know what fixed it. Someone with the right HW should
try and identify the fix so that we can backport it.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
We are using 5.15.2 right now on Pop!_OS 21.10 but have not seen SATA issues yet. I'll look into this,
and hope it makes it into a 5.15 release soon!
--
Jeremy Soller
System76
Principal Engineer
[email protected]
On Tue, Nov 16, 2021, at 4:54 PM, Krzysztof Wilczyński wrote:
> [+CC Adding Jeremy for visibility]
>
> Hi Damien,
>
> [...]
>> > The error in the dmesg output (see [2] where the log file is attached)
>> > looks similar to the problem reported a week or so ago, as per:
>> >
>> > https://lore.kernel.org/linux-pci/[email protected]/
>>
>> Thanks. I searched this thread but could not find it in the archive.
>> Early morning, need more coffee :)
>
> No worries! Got you covered!
>
> )))
> (((
> +-----+
> | |]
> `-----'
>
> Enjoy!
>
>>
>> >
>> > The problematic commits where reverted by Bjorn and the Pull Request that
>> > did it was accepted, as per:
>> >
>> > https://lore.kernel.org/linux-pci/20211111195040.GA1345641@bhelgaas/
>> >
>> > Thus, this would made its way into 5.16-rc1, I suppose. We might have to
>> > back-port this to the stable and long-term kernels.
>>
>> Yes, I think the fix needs to go in 5.15, which is latest stable and LTS.
>
> On the plus side, not everyone is on 5.15 yet, but those who are using it would
> have some issues. Albeit, with it being an LTS release, the adoption might
> increase rapidly.
>
> For instance, I believe that Pop!_OS already ships kernels that are very close
> to the upstream, which would hit their current user base.
>
> Krzysztof
On Wed, 17 Nov 2021 10:36:08 +0100
Krzysztof Wilczyński <[email protected]> wrote:
> > > > I think that this problem is due to recent PCI subsystem changes which
> > > > broke Mac support. The problem show up as the interrupts not being
> > > > delivered, which in turn result in the kernel assuming that the drive
> > > > is not working (see the timeout error messages in your dmesg output).
> > > > Hence your boot drive detection fails and no rootfs to mount.
> > > >
> > > > Adding linux-pci list.
> > > >
> > > >
>
> > The issue is instead very close to the one reported at [1], for which
> > we have a very conservative workaround in 5.16-rc1 (commits
> > 2226667a145d and f21082fb20db). Looking at the dmesg log provided by
> > Yugi, you find the following nugget:
> >
> > [ 0.378564] pci 0000:00:0a.0: [10de:0d88] type 00 class 0x010601
> >
> > Oh look, a NVIDIA AHCI controller, probably similar enough to the one
> > discussed in the issue reported by Rui.
>
> Good to know for the future reference that these can be problematic.
>
Hi.
I am also experiencing this issue on Gigabyte GA-M720-US3 mobo which uses
NVIDIA nForce 720D chipset. As I understand from the quirks patch it does not
fix my controller?
$ bunzip2 -ck dmesg_nForce_720D.txt.bz2 | grep 0x010601
[ 0.299980] pci 0000:00:09.0: [10de:0ad4] type 00 class 0x010601
In addition to the absense of devices during boot it takes minutes for this PC
on 5.15 to proceed from loading initramfs image to starting system from it.
I am attaching my dmesg just in case.
Max.
On Sun, 21 Nov 2021 15:41:18 +0000,
Maxym Synytsky <[email protected]> wrote:
>
> Hi.
> I am also experiencing this issue on Gigabyte GA-M720-US3 mobo which uses
> NVIDIA nForce 720D chipset. As I understand from the quirks patch it does not
> fix my controller?
Are you sure? The dmesg you attached to this email shows otherwise:
[ 0.766490] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 0.766634] ata1.00: ATA-10: KINGSTON SA400S37120G, 03150002, max UDMA/133
[ 0.766641] ata1.00: 234441648 sectors, multi 1: LBA48 NCQ (depth 32)
[ 0.769825] ata6: SATA link down (SStatus 0 SControl 300)
[ 0.769845] ata2: SATA link down (SStatus 0 SControl 300)
[ 0.769865] ata5: SATA link down (SStatus 0 SControl 300)
[ 0.769876] ata3: SATA link down (SStatus 0 SControl 300)
[ 0.769894] ata4: SATA link down (SStatus 0 SControl 300)
[ 0.772046] ata1.00: configured for UDMA/133
[ 0.772218] scsi 0:0:0:0: Direct-Access ATA KINGSTON SA400S3 0002 PQ: 0 ANSI: 5
[ 0.772503] sd 0:0:0:0: [sda] 234441648 512-byte logical blocks: (120 GB/112 GiB)
[ 0.772520] sd 0:0:0:0: [sda] Write Protect is off
[ 0.772523] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 0.772538] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 0.773190] sda: sda1 sda2
[ 0.789894] sd 0:0:0:0: [sda] Attached SCSI disk
From the above, I can only conclude that your SATA controller is up
and running.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Sun, 21 Nov 2021 19:58:31 +0000
Marc Zyngier <[email protected]> wrote:
> > Hi.
> > I am also experiencing this issue on Gigabyte GA-M720-US3 mobo which uses
> > NVIDIA nForce 720D chipset. As I understand from the quirks patch it does
> > not fix my controller?
>
> Are you sure? The dmesg you attached to this email shows otherwise:
>
Yes, this dmesg is for 5.14 kernel which works fine.
For some reason Arch complains in initramfs that root is locked and I am not
dropped to recovery shell so getting dmesg for 5.15 would be tricky for me.
Max.
On Sun, 21 Nov 2021 20:48:02 +0000,
Maxym Synytsky <[email protected]> wrote:
>
> On Sun, 21 Nov 2021 19:58:31 +0000
> Marc Zyngier <[email protected]> wrote:
>
> > > Hi.
> > > I am also experiencing this issue on Gigabyte GA-M720-US3 mobo which uses
> > > NVIDIA nForce 720D chipset. As I understand from the quirks patch it does
> > > not fix my controller?
> >
> > Are you sure? The dmesg you attached to this email shows otherwise:
> >
> Yes, this dmesg is for 5.14 kernel which works fine.
Well, that's not exactly useful to debug your problem, is it? What
makes you think that you are suffering from the same issue if you
can't look at the kernel messages?
> For some reason Arch complains in initramfs that root is locked and I am not
> dropped to recovery shell so getting dmesg for 5.15 would be tricky for me.
Can you please try 5.16-rc1? If it doesn't work, try the following
hack on top of -rc1. But that's a complete shot in the dark, and
without more details on what is going on, there is only so much I can
do.
M.
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 003950c738d2..6ac0f0b14130 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5852,8 +5852,10 @@ DECLARE_PCI_FIXUP_ENABLE(PCI_VENDOR_ID_PERICOM, 0x2303,
DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_PERICOM, 0x2303,
pci_fixup_pericom_acs_store_forward);
-static void nvidia_ion_ahci_fixup(struct pci_dev *pdev)
+static void nvidia_ahci_fixup(struct pci_dev *pdev)
{
pdev->dev_flags |= PCI_DEV_FLAGS_HAS_MSI_MASKING;
}
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0ab8, nvidia_ion_ahci_fixup);
+DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
+ PCI_CLASS_STORAGE_SATA_AHCI, 8,
+ nvidia_ahci_fixup);
--
Without deviation from the norm, progress is not possible.