2024-05-06 05:20:37

by Peter Schneider

[permalink] [raw]
Subject: Kernel 6.8.4 regression: aacraid controller not initialized any more, system boot hangs

Hi all,

I am running a dual Xeon machine as my personal virtualization server at home, using
Proxmox VE, and with their latest update 8.2 which brings kernel 6.8.4-2-pve, I am seeing
a serious regression which breaks my setup because it does not boot any more. The last
message I see displayed during boot is: "Timed out for waiting the udev queue being
empty.", and then it hangs indefinitely.

Previous kernel 6.5.13-5-pve worked fine, with the following caveat: I had similar
problems initially with earlier kernels too, so from the very beginning with this machine
using PVE, I had to set grub parameter rootdelay=60. With that, everything was fine, the
busses settled and RAID controller and root device was found and system booted. With the
newer 6.8.4 kernel, not any more, although I even tried to increase rootdelay parameter to
120.

I was able to reproduce and bisect this regression also with mainline kernels (also with
stable 6.8.8 and 6.9-rc), so I thought it would be a good idea to report it upstream to
you guys.

This is an older server machine: 2-socket Ivy Bridge Xeon E5-2697 v2 (24C/48T) in an Asus
Z9PE-D16/2L motherboard (Intel C-602A chipset); BIOS patched to the latest available from
Asus. All memory slots occupied, so 256 GB RAM in total. It also has Asus ASMB6 iKVM BMC,
which supplies virtual storage devices (seel below dmesg) to which ISO images can be
attached via network to boot/install OS from.

Storage config:

I have two single M4 256 GiB SATA SSD drives attached to internal mainboard SATA ports;
one of them is my root device and PVE installation drive. The other one I use for storing
ISO images. My main VM storage is attached to a battery backed-up Adaptec 5805 SATA/SAS
RAID controller (w/ latest FW build 18948) attached to SATA/SAS enclosure of my Supermicro
server casing, having eight disk drives in total: I have one RAID1 Array, consisting of
two Samsung 1 TiB SATA SSDs for VM root disk images, and one RAID5 Array, consisting of 6
Hitachi 1 TiB HDDs which I use for storing VM data disk images. On both arrays, I use a
LVM thin pool as PVE storage location. When everything boots up, the system is running
just fine and smoothly with ~15 VMs at the same time (and has for years!). Although this
is "only" a homelab server, I love it dearly and use it for many private projects VMs,
among them runing Windows Server VM with MS SQL Server, and Linux server VMs running
Oracle Database Server (I'm a database guy).

I attach dmesg output of previous working kernel 6.5.13-5-pve, my git bisect log and
output of lspci -v. The last successful kernel messages I see from the failing kernels
version is this:

...

[ 5.540424] usb-storage 1-1.3.4:1.0: USB Mass Storage device detected
[ 5.540670] scsi host10: usb-storage 1-1.3.4:1.0
[ 5.947794] scsi 8:0:0:0: CD-ROM AMI Virtual CDROM0 1.00 PQ: 0 ANSI:
0 CCS
[ 6.267830] scsi 9:0:0:0: Direct-Access AMI Virtual Floppy0 1.00 PQ: 0 ANSI:
0 CCS
[ 6.555845] scsi 10:0:0:0: Direct-Access AMI Virtual HDISK0 1.00 PQ: 0 ANSI:
0 CCS

and then the error message "Timed out for waiting the udev queue being empty." and the
system hangs. In case of working kernels, the boot process would continue with this:

...

[ 5.947794] scsi 8:0:0:0: CD-ROM AMI Virtual CDROM0 1.00 PQ: 0 ANSI:
0 CCS
[ 6.267830] scsi 9:0:0:0: Direct-Access AMI Virtual Floppy0 1.00 PQ: 0 ANSI:
0 CCS
[ 6.555845] scsi 10:0:0:0: Direct-Access AMI Virtual HDISK0 1.00 PQ: 0 ANSI:
0 CCS
[ 32.592054] scsi 0:3:1:0: Enclosure ADAPTEC Virtual SGPIO 1 0001 PQ: 0 ANSI: 5
[ 61.536097] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 61.536215] sd 0:0:0:0: [sda] 1998565376 512-byte logical blocks: (1.02 TB/953 GiB)
[ 61.536236] sd 0:0:1:0: Attached scsi generic sg1 type 0
[ 61.536239] sd 0:0:0:0: [sda] Write Protect is off
[ 61.536246] sd 0:0:0:0: [sda] Mode Sense: 12 00 10 08
[ 61.536283] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO
and FUA
[ 61.536340] scsi 0:1:0:0: Attached scsi generic sg2 type 0
[ 61.536383] sd 0:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
[ 61.536400] sd 0:0:1:0: [sdb] 9762222080 512-byte logical blocks: (5.00 TB/4.54 TiB)
[ 61.536414] sd 0:0:1:0: [sdb] Write Protect is off
[ 61.536418] sd 0:0:1:0: [sdb] Mode Sense: 12 00 10 08
[ 61.536439] sd 0:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO
and FUA
[ 61.536455] scsi 0:1:1:0: Attached scsi generic sg3 type 0
[ 61.536616] scsi 0:1:2:0: Attached scsi generic sg4 type 0
[ 61.536750] scsi 0:1:3:0: Attached scsi generic sg5 type 0
[ 61.536840] scsi 0:1:4:0: Attached scsi generic sg6 type 0
[ 61.536930] scsi 0:1:5:0: Attached scsi generic sg7 type 0
[ 61.537027] scsi 0:1:6:0: Attached scsi generic sg8 type 0
[ 61.537122] scsi 0:1:7:0: Attached scsi generic sg9 type 0
[ 61.537248] sd 0:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
[ 61.537274] scsi 0:3:0:0: Attached scsi generic sg10 type 13
[ 61.537390] scsi 0:3:1:0: Attached scsi generic sg11 type 13
[ 61.537558] scsi 1:0:0:0: Direct-Access ATA M4-CT256M4SSD2 0309 PQ: 0 ANSI: 5
[ 61.537851] sd 1:0:0:0: Attached scsi generic sg12 type 0
[ 61.537919] scsi: waiting for bus probes to complete ...
[ 61.537973] sd 1:0:0:0: [sdc] 500118192 512-byte logical blocks: (256 GB/238 GiB)
[ 61.537986] sd 1:0:0:0: [sdc] Write Protect is off
[ 61.537989] sd 1:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[ 61.538002] sd 1:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
[ 61.538022] sd 1:0:0:0: [sdc] Preferred minimum I/O size 512 bytes
[ 61.538924] sdc: sdc1 sdc2 < sdc5 >

...

so it seems to me the initialiation of the the Adaptec controller is the culprit.

I have tested and reproduced the regression with mainline kernels according to the
following list (please excuse me if it's too long ;-)

See at the very bottom for first bad commit I found this way. I always built as "make
olddefconfig" using the 6.5.13-5-pve config as starting point.


-------------------------------------------------------------------


Proxmox Virtual Environmet (PVE) Kernels
========================================
6.5.13-5-pve WORKS last working PVE (8.1) kernel; 5.15-pve and 6.2-pve work too
6.8.4-2-pve NOPE PVE release 8.2


Mainline Kernels
================
6.9.0-rc6+ NOPE Most recent (2024-05-01)
6.9.0-rc5+ NOPE Most recent (2024-04-27)
6.8.8 NOPE Most recent released (2024-04-29)
6.8.7 NOPE Most recent released (2024-04-27)
6.8.4 NOPE Same version as most recent released PVE 8.2 Kernel
6.5.13 WORKS


My tests, reverts on top of 6.8.8
=================================
6.8.8+ WORKS Revert "Merge tag 'scsi-fixes' of
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi" - This reverts commit
6d20acbf3e3a32d331947dbc3802cf2d1a399e7d, reversing changes made to
fef85269a19d277f23fc5ff08a3c356beeb54cb3

6.8.8+ WORKS Revert "scsi: core: Consult supported VPD page list prior to
fetching page" - This reverts commit b5fc07a5fb56216a49e6c1d0b172d5464d99a89b (this is the
first bad commit of my bisect session, see below, and a single patch as part of the above
merged tag 'scsi-fixes')



Bisecting, starting from 6.9.0-rc5 (bad) and 6.5.13 (good)
==========================================================

root@linus:/usr/src/linux# git checkout master
Bereits auf 'master'
Ihr Branch ist auf demselben Stand wie 'origin/master'.
root@linus:/usr/src/linux# git log
commit 9d1ddab261f3e2af7c384dc02238784ce0cf9f98 (HEAD -> master, origin/master, origin/HEAD)
Merge: 71b1543c83d6 77d8aa79ecfb
Author: Linus Torvalds <[email protected]>
Date: Tue Apr 23 09:37:32 2024 -0700

Merge tag '6.9-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

root@linus:/usr/src/linux# cp /boot/config-6.5.13-5-pve .config
root@linus:/usr/src/linux# git bisect start
Status: warte auf guten und schlechten Commit
root@linus:/usr/src/linux# git bisect bad
Status: warte auf gute(n) Commit(s), schlechter Commit bekannt
root@linus:/usr/src/linux# git bisect good v6.5.13
Binäre Suche: eine Merge-Basis muss geprüft werden
[2dde18cd1d8fac735875f2e4987f11817cc0bc2c] Linux 6.5
root@linus:/usr/src/linux# make olddefconfig
.config:10571:warning: symbol value 'm' invalid for ANDROID_BINDER_IPC
.config:10572:warning: symbol value 'm' invalid for ANDROID_BINDERFS
#
# configuration written to .config
#
root@linus:/usr/src/linux# make -j 48

=> 6.5.0 (Merge Base) WORKS

root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 32111 Commits zum Testen übrig (ungefähr 15 Schritte)
[0f5cc96c367f2e780eb492cc9cab84e3b2ca88da] Merge tag 's390-6.7-3' of
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

root@linus:/usr/src/linux# make -j 48

=> 6.7.0-rc2+ WORKS

root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 16056 Commits zum Testen übrig (ungefähr 14 Schritte)
[ee138217c32ccbfa75d5ea6b766158148e98f6fa] Merge tag 'btree-remove-btnum-6.9_2024-02-23'
of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.9-mergeC

=> 6.8.0-rc4+ WORKS

root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 8214 Commits zum Testen übrig (ungefähr 13 Schritte)
[e5e038b7ae9da96b93974bf072ca1876899a01a3] Merge tag 'fs_for_v6.9-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

=> 6.8.0+ NOPE => does not find root device, does not boot;
message: "BUG: arch topology borken the CPU
domain not a subset of > the NUMA domain"
message: "Timed out for waiting the udev
queue being empty."

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 3954 Commits zum Testen übrig (ungefähr 12 Schritte)
[f153fbe1ea11939e2514ba4b3b62bbd946e2892c] Merge tag 'erofs-for-6.9-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

=> 6.8.0+ (HEAD losgelöst bei f153fbe1ea11) NOPE => same as above

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 1945 Commits zum Testen übrig (ungefähr 11 Schritte)
[1ddeeb2a058d7b2a58ed9e820396b4ceb715d529] Merge tag 'for-6.9/block-20240310' of
git://git.kernel.dk/linux

=> 6.8.0+ (HEAD losgelöst bei 1ddeeb2a058d) NOPE => same as above

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 970 Commits zum Testen übrig (ungefähr 10 Schritte)
[2652b99e43403dc464f3648483ffb38e48872fe4] ice: virtchnl: stop pretending to support RSS
over AQ or registers

=> 6.8.0-rc6+ (2652b99e4340) NOPE => same

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 506 Commits zum Testen übrig (ungefähr 9 Schritte)
[efa80dcbb7a3ecc4a1b2f54624c49b5a612f92b3] Merge tag 'trace-v6.8-rc5' of
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

=> 6.8.0-rc5+ (efa80dcbb7a3) WORKS

root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 251 Commits zum Testen übrig (ungefähr 8 Schritte)
[c6a597fcc7ad7335a3ecf8f5287a0459f793a257] Merge tag 'loongarch-fixes-6.8-3' of
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

=> 6.8.0-rc5+ (c6a597fcc7ad) WORKS

root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 126 Commits zum Testen übrig (ungefähr 7 Schritte)
[cf1182944c7cc9f1c21a8a44e0d29abe12527412] Merge tag 'lsm-pr-20240227' of
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

=> 6.8.0-rc6+ (cf1182944c7c) NOPE

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 62 Commits zum Testen übrig (ungefähr 6 Schritte)
[4ca0d9894fd517a2f2c0c10d26ebe99ab4396fe3] Merge tag 'erofs-for-6.8-rc6-fixes' of
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

=> 6.8.0-rc5+ (4ca0d9894fd5) NOPE

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 36 Commits zum Testen übrig (ungefähr 5 Schritte)
[ac389bc0ca56e1a2f92b2a17e58298390a3879a8] Merge tag 'cxl-fixes-6.8-rc6' of
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

=> 6.8.0-rc5+ (ac389bc0ca56) NOPE

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 12 Commits zum Testen übrig (ungefähr 4 Schritte)
[40de53fd002c6ba087a623722915e8006ed68a02] Merge branch 'for-6.8/cxl-cper' into for-6.8/cxl

=> 6.8.0-rc5+ (40de53fd002c) WORKS

root@linus:/usr/src/linux# git bisect good
Binäre Suche: danach noch 6 Commits zum Testen übrig (ungefähr 3 Schritte)
[9ddf190a7df77b77817f955fdb9c2ae9d1c9c9a3] scsi: jazz_esp: Only build if SCSI core is builtin

=> 6.8.0-rc1+ (9ddf190a7df7) NOPE

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 2 Commits zum Testen übrig (ungefähr 2 Schritte)
[de959094eb2197636f7c803af0943cb9d3b35804] scsi: target: pscsi: Fix bio_put() for error case

=> 6.8.0-rc1+ (de959094eb21) NOPE

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 0 Commits zum Testen übrig (ungefähr 1 Schritt)
[b5fc07a5fb56216a49e6c1d0b172d5464d99a89b] scsi: core: Consult supported VPD page list
prior to fetching page

=> 6.8.0-rc1+ (b5fc07a5fb56) NOPE

root@linus:/usr/src/linux# git bisect bad
Binäre Suche: danach noch 0 Commits zum Testen übrig (ungefähr 0 Schritte)
[321da3dc1f3c92a12e3c5da934090d2992a8814c] scsi: sd: usb_storage: uas: Access media prior
to querying device properties

=> 6.8.0-rc1+ (321da3dc1f3c) WORKS

root@linus:/usr/src/linux# git bisect good
b5fc07a5fb56216a49e6c1d0b172d5464d99a89b is the first bad commit
commit b5fc07a5fb56216a49e6c1d0b172d5464d99a89b
Author: Martin K. Petersen <[email protected]>
Date: Wed Feb 14 17:14:11 2024 -0500

scsi: core: Consult supported VPD page list prior to fetching page

Commit c92a6b5d6335 ("scsi: core: Query VPD size before getting full
page") removed the logic which checks whether a VPD page is present on
the supported pages list before asking for the page itself. That was
done because SPC helpfully states "The Supported VPD Pages VPD page
list may or may not include all the VPD pages that are able to be
returned by the device server". Testing had revealed a few devices
that supported some of the 0xBn pages but didn't actually list them in
page 0.

Julian Sikorski bisected a problem with his drive resetting during
discovery to the commit above. As it turns out, this particular drive
firmware will crash if we attempt to fetch page 0xB9.

Various approaches were attempted to work around this. In the end,
reinstating the logic that consults VPD page 0 before fetching any
other page was the path of least resistance. A firmware update for the
devices which originally compelled us to remove the check has since
been released.

Link: https://lore.kernel.org/r/[email protected]
Fixes: c92a6b5d6335 ("scsi: core: Query VPD size before getting full page")
Cc: [email protected]
Cc: Bart Van Assche <[email protected]>
Reported-by: Julian Sikorski <[email protected]>
Tested-by: Julian Sikorski <[email protected]>
Reviewed-by: Lee Duncan <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

 drivers/scsi/scsi.c | 22 ++++++++++++++++++++--
 include/scsi/scsi_device.h | 4 ----
 2 files changed, 20 insertions(+), 6 deletions(-)
root@linus:/usr/src/linux#


-------------------------------------------------------------------


Beste Grüße,
Peter Schneider

--
Climb the mountain not to plant your flag, but to embrace the challenge,
enjoy the air and behold the view. Climb it so you can see the world,
not so the world can see you. -- David McCullough Jr.

OpenPGP: 0xA3828BD796CCE11A8CADE8866E3A92C92C3FF244
Download: https://www.peters-netzplatz.de/download/pschneider1968_pub.asc
https://keys.mailvelope.com/pks/lookup?op=get&[email protected]
https://keys.mailvelope.com/pks/lookup?op=get&[email protected]


Attachments:
dmesg_6.5.13-5-pve.txt (120.76 kB)
git_bisect.log (3.60 kB)
lspci-v.txt (70.94 kB)
OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature
Download all attachments

2024-05-09 01:42:31

by Martin K. Petersen

[permalink] [raw]
Subject: Re: Kernel 6.8.4 regression: aacraid controller not initialized any more, system boot hangs


Hi Peter!

Thanks for the detailed bug report.

> 6.8.8+ WORKS Revert "scsi: core: Consult supported VPD page
> list prior to fetching page" - This reverts commit
> b5fc07a5fb56216a49e6c1d0b172d5464d99a89b (this is the first bad commit
> of my bisect session, see below, and a single patch as part of the
> above merged tag 'scsi-fixes')

The puzzling thing is that the patch in question restores the original
behavior in which we do not attempt to query any pages not explicitly
reported by the device.

Can you please send me the output of:

# sg_vpd -a /dev/sda
# sg_readcap -l /dev/sda

where sda is one of the aacraid volumes.

Thanks!

--
Martin K. Petersen Oracle Linux Engineering

2024-05-09 02:12:53

by Peter Schneider

[permalink] [raw]
Subject: Re: Kernel 6.8.4 regression: aacraid controller not initialized any more, system boot hangs

Hi Martin,

Am 09.05.2024 um 03:38 schrieb Martin K. Petersen:
>
> Hi Peter!
>
> Thanks for the detailed bug report.

Thanks that you are looking into the issue! I thought I'd be also CC'ing the relevant
regressions tracker+mailing list. For reference, my original message can be found here:

https://lore.kernel.org/all/[email protected]/

[...]

> Can you please send me the output of:
>
> # sg_vpd -a /dev/sda
> # sg_readcap -l /dev/sda
>
> where sda is one of the aacraid volumes.


Here you go... sda is the 1TiB RAID1 array, sdb is the 5TiB RAID5 array.


root@linus:~# uname -r
6.5.13-5-pve
root@linus:~# sg_vpd -a /dev/sda
Supported VPD pages VPD page:
Supported VPD pages [sv]
Unit serial number [sn]
Device identification [di]

Unit serial number VPD page:
Unit serial number: 50C0B82D

Device Identification VPD page:
Addressed logical unit:
designator type: T10 vendor identification, code set: ASCII
vendor id: ADAPTEC
vendor specific: ARRAY 50C0B82D
designator type: EUI-64 based, code set: Binary
0x2db8c05000d00000
root@linus:~# sg_readcap -l /dev/sda
Read Capacity results:
Protection: prot_en=0, p_type=0, p_i_exponent=0
Logical block provisioning: lbpme=0, lbprz=0
Last LBA=1998565375 (0x771fafff), Number of logical blocks=1998565376
Logical block length=512 bytes
Logical blocks per physical block exponent=0
Lowest aligned LBA=0
Hence:
Device size: 1023265472512 bytes, 975862.0 MiB, 1023.27 GB
root@linus:~# sg_vpd -a /dev/sdb
Supported VPD pages VPD page:
Supported VPD pages [sv]
Unit serial number [sn]
Device identification [di]

Unit serial number VPD page:
Unit serial number: 8718162D

Device Identification VPD page:
Addressed logical unit:
designator type: T10 vendor identification, code set: ASCII
vendor id: ADAPTEC
vendor specific: ARRAY 8718162D
designator type: EUI-64 based, code set: Binary
0x2d16188700d00000
root@linus:~# sg_readcap -l /dev/sdb
Read Capacity results:
Protection: prot_en=0, p_type=0, p_i_exponent=0
Logical block provisioning: lbpme=0, lbprz=0
Last LBA=9762222079 (0x245dfafff), Number of logical blocks=9762222080
Logical block length=512 bytes
Logical blocks per physical block exponent=0
Lowest aligned LBA=0
Hence:
Device size: 4998257704960 bytes, 4766710.0 MiB, 4998.26 GB, 5.00 TB


Beste Grüße,
Peter Schneider

--
Climb the mountain not to plant your flag, but to embrace the challenge,
enjoy the air and behold the view. Climb it so you can see the world,
not so the world can see you. -- David McCullough Jr.

OpenPGP: 0xA3828BD796CCE11A8CADE8866E3A92C92C3FF244
Download: https://www.peters-netzplatz.de/download/pschneider1968_pub.asc
https://keys.mailvelope.com/pks/lookup?op=get&[email protected]
https://keys.mailvelope.com/pks/lookup?op=get&[email protected]


Attachments:
OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature

2024-05-09 04:18:46

by Peter Schneider

[permalink] [raw]
Subject: Re: Kernel 6.8.4 regression: aacraid controller not initialized any more, system boot hangs

Am 09.05.2024 um 04:12 schrieb Peter Schneider:

> Hi Martin,
>
> Am 09.05.2024 um 03:38 schrieb Martin K. Petersen:
> >
> > Hi Peter!
> >
> > Thanks for the detailed bug report.
>
> Thanks that you are looking into the issue! I thought I'd be also CC'ing the relevant
> regressions tracker+mailing list. For reference, my original message can be found here:
>
> https://lore.kernel.org/all/[email protected]/
>
> [...]
>
> > Can you please send me the output of:
> >
> > # sg_vpd -a /dev/sda
> > # sg_readcap -l /dev/sda
> >
> > where sda is one of the aacraid volumes.
>
>
> Here you go... sda is the 1TiB RAID1 array, sdb is the 5TiB RAID5 array.
>
>
> root@linus:~# uname -r
> 6.5.13-5-pve
> root@linus:~# sg_vpd -a /dev/sda
> Supported VPD pages VPD page:
> Supported VPD pages [sv]
> Unit serial number [sn]
> Device identification [di]
>
> Unit serial number VPD page:
> Unit serial number: 50C0B82D
>
> Device Identification VPD page:
> Addressed logical unit:
> designator type: T10 vendor identification, code set: ASCII
> vendor id: ADAPTEC
> vendor specific: ARRAY 50C0B82D
> designator type: EUI-64 based, code set: Binary
> 0x2db8c05000d00000
> root@linus:~# sg_readcap -l /dev/sda
> Read Capacity results:
> Protection: prot_en=0, p_type=0, p_i_exponent=0
> Logical block provisioning: lbpme=0, lbprz=0
> Last LBA=1998565375 (0x771fafff), Number of logical blocks=1998565376
> Logical block length=512 bytes
> Logical blocks per physical block exponent=0
> Lowest aligned LBA=0
> Hence:
> Device size: 1023265472512 bytes, 975862.0 MiB, 1023.27 GB
> root@linus:~# sg_vpd -a /dev/sdb
> Supported VPD pages VPD page:
> Supported VPD pages [sv]
> Unit serial number [sn]
> Device identification [di]
>
> Unit serial number VPD page:
> Unit serial number: 8718162D
>
> Device Identification VPD page:
> Addressed logical unit:
> designator type: T10 vendor identification, code set: ASCII
> vendor id: ADAPTEC
> vendor specific: ARRAY 8718162D
> designator type: EUI-64 based, code set: Binary
> 0x2d16188700d00000
> root@linus:~# sg_readcap -l /dev/sdb
> Read Capacity results:
> Protection: prot_en=0, p_type=0, p_i_exponent=0
> Logical block provisioning: lbpme=0, lbprz=0
> Last LBA=9762222079 (0x245dfafff), Number of logical blocks=9762222080
> Logical block length=512 bytes
> Logical blocks per physical block exponent=0
> Lowest aligned LBA=0
> Hence:
> Device size: 4998257704960 bytes, 4766710.0 MiB, 4998.26 GB, 5.00 TB
>
>
> Beste Grüße,
> Peter Schneider
>


I just found something else which looks interesting and might or might not be related to
the regression. To get the requested diagnostic output you asked for, I obviously booted
into the working kernel version 6.5.13-5-pve, see above. Out of curiousity, I used these
commands also onto my other drives, sdc (PVE installation and root device) and sdf (my
storage for VM ISO installation images). These are both older Micron M4 SATA SSD drives.

Turns out, these drives seem to have a buggy firmware. They don't return all the VPD pages
they advertise. Querying for the advertised VPD page=0xb7 gives "sg_vpd failed: Illegal
request", please see below... Is this a smoking gun? They both were previously used by me
in a Windows box for ~2 years, till I replaced them and put them aside. Then in 2015 I
recycled them for use in my newly built server machine. Before original use in the
mentioned Windows box, I upgraded their firmware to 0309, because the factory firmware had
known issues with Windows.

In 2015, I didn't care to look again for a newer firmware. But there is one, 070h, here:

https://www.crucial.de/support/ssd-support/m4-25-inch-support

and in the release notes

https://content.crucial.com/content/dam/crucial/ssd-products/m4/documents/crucial-m4-firmware-update-070h-en.pdf

there is mention of a potential device hang during power up being fixed with FW 070h.

Do you think I should try to apply this FW upgrade, to see if
- this fixes the below issue of advertised VPD page not being returned
- this could probably fix the whole regression issue with the Adaptec controller not
initialized any more with your kernel patch b5fc07a5fb56216a49e6c1d0b172d5464d99a89b ?

Or is this just guesswork? I mean, in the dmesg output, the Adaptec controller is
initialized BEFORE sdc and sdd. I don't know...





root@linus:~# sg_vpd -a /dev/sdc
Supported VPD pages VPD page:
Supported VPD pages [sv]
Unit serial number [sn]
Device identification [di]
ATA information (SAT) [ai]
Block limits (SBC) [bl]
Block device characteristics (SBC) [bdc]
Logical block provisioning (SBC) [lbpv]
Concurrent positioning ranges [cpr]

Unit serial number VPD page:
Unit serial number: 000000001141031B85A2

Device Identification VPD page:
Addressed logical unit:
designator type: vendor specific [0x0], code set: ASCII
vendor specific: 000000001141031B85A2
designator type: T10 vendor identification, code set: ASCII
vendor id: ATA
vendor specific: M4-CT256M4SSD2 000000001141031B85A2
designator type: NAA, code set: Binary
0x500a0751031b85a2

ATA information VPD page:
SAT Vendor identification: linux
SAT Product identification: libata
SAT Product revision level: 3.00
Device signature indicates SATA transport
Command code: 0xec
ATA command IDENTIFY DEVICE response summary:
model: M4-CT256M4SSD2
serial number: 000000001141031B85A2
firmware revision: 0309

Block limits VPD page (SBC):
Write same non-zero (WSNZ): 0
Maximum compare and write length: 0 blocks [Command not implemented]
Optimal transfer length granularity: 1 blocks
Maximum transfer length: 0 blocks [not reported]
Optimal transfer length: 0 blocks [not reported]
Maximum prefetch transfer length: 0 blocks [ignored]
Maximum unmap LBA count: 0 [Unmap command not implemented]
Maximum unmap block descriptor count: 0 [Unmap command not implemented]
Optimal unmap granularity: 1 blocks
Unmap granularity alignment valid: false
Unmap granularity alignment: 0 [invalid]
Maximum write same length: 0x3fffc0 blocks
Maximum atomic transfer length: 0 blocks [not reported]
Atomic alignment: 0 [unaligned atomic writes permitted]
Atomic transfer length granularity: 0 [no granularity requirement
Maximum atomic transfer length with atomic boundary: 0 blocks [not reported]
Maximum atomic boundary size: 0 blocks [can only write atomic 1 block]

Block device characteristics VPD page (SBC):
Non-rotating medium (e.g. solid state)
Product type: Not specified
WABEREQ=0
WACEREQ=0
Nominal form factor: 2.5 inch
ZONED=0
RBWZ=0
BOCS=0
FUAB=0
VBULS=0
DEPOPULATION_TIME=0 (seconds)

Logical block provisioning VPD page (SBC):
Unmap command supported (LBPU): 0
Write same (16) with unmap bit supported (LBPWS): 1
Write same (10) with unmap bit supported (LBPWS10): 0
Logical block provisioning read zeros (LBPRZ): 0
Anchored LBAs supported (ANC_SUP): 0
Threshold exponent: 0 [threshold sets not supported]
Descriptor present (DP): 0
Minimum percentage: 0 [not reported]
Provisioning type: 0 (not known or fully provisioned)
Threshold percentage: 0 [percentages not supported]

VPD page=0xb7
fetching VPD page failed: Illegal request
sg_vpd failed: Illegal request
root@linus:~# sg_vpd -a /dev/sdf
Supported VPD pages VPD page:
Supported VPD pages [sv]
Unit serial number [sn]
Device identification [di]
ATA information (SAT) [ai]
Block limits (SBC) [bl]
Block device characteristics (SBC) [bdc]
Logical block provisioning (SBC) [lbpv]
Concurrent positioning ranges [cpr]

Unit serial number VPD page:
Unit serial number: 00000000120103285ED2

Device Identification VPD page:
Addressed logical unit:
designator type: vendor specific [0x0], code set: ASCII
vendor specific: 00000000120103285ED2
designator type: T10 vendor identification, code set: ASCII
vendor id: ATA
vendor specific: M4-CT256M4SSD2 00000000120103285ED2
designator type: NAA, code set: Binary
0x500a075103285ed2

ATA information VPD page:
SAT Vendor identification: linux
SAT Product identification: libata
SAT Product revision level: 3.00
Device signature indicates SATA transport
Command code: 0xec
ATA command IDENTIFY DEVICE response summary:
model: M4-CT256M4SSD2
serial number: 00000000120103285ED2
firmware revision: 0309

Block limits VPD page (SBC):
Write same non-zero (WSNZ): 0
Maximum compare and write length: 0 blocks [Command not implemented]
Optimal transfer length granularity: 1 blocks
Maximum transfer length: 0 blocks [not reported]
Optimal transfer length: 0 blocks [not reported]
Maximum prefetch transfer length: 0 blocks [ignored]
Maximum unmap LBA count: 0 [Unmap command not implemented]
Maximum unmap block descriptor count: 0 [Unmap command not implemented]
Optimal unmap granularity: 1 blocks
Unmap granularity alignment valid: false
Unmap granularity alignment: 0 [invalid]
Maximum write same length: 0x3fffc0 blocks
Maximum atomic transfer length: 0 blocks [not reported]
Atomic alignment: 0 [unaligned atomic writes permitted]
Atomic transfer length granularity: 0 [no granularity requirement
Maximum atomic transfer length with atomic boundary: 0 blocks [not reported]
Maximum atomic boundary size: 0 blocks [can only write atomic 1 block]

Block device characteristics VPD page (SBC):
Non-rotating medium (e.g. solid state)
Product type: Not specified
WABEREQ=0
WACEREQ=0
Nominal form factor: 2.5 inch
ZONED=0
RBWZ=0
BOCS=0
FUAB=0
VBULS=0
DEPOPULATION_TIME=0 (seconds)

Logical block provisioning VPD page (SBC):
Unmap command supported (LBPU): 0
Write same (16) with unmap bit supported (LBPWS): 1
Write same (10) with unmap bit supported (LBPWS10): 0
Logical block provisioning read zeros (LBPRZ): 0
Anchored LBAs supported (ANC_SUP): 0
Threshold exponent: 0 [threshold sets not supported]
Descriptor present (DP): 0
Minimum percentage: 0 [not reported]
Provisioning type: 0 (not known or fully provisioned)
Threshold percentage: 0 [percentages not supported]

VPD page=0xb7
fetching VPD page failed: Illegal request
sg_vpd failed: Illegal request
root@linus:~# man sg_vpd





Beste Grüße,
Peter Schneider

--
Climb the mountain not to plant your flag, but to embrace the challenge,
enjoy the air and behold the view. Climb it so you can see the world,
not so the world can see you. -- David McCullough Jr.

OpenPGP: 0xA3828BD796CCE11A8CADE8866E3A92C92C3FF244
Download: https://www.peters-netzplatz.de/download/pschneider1968_pub.asc
https://keys.mailvelope.com/pks/lookup?op=get&[email protected]
https://keys.mailvelope.com/pks/lookup?op=get&[email protected]


Attachments:
OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature

2024-05-14 07:08:12

by Peter Schneider

[permalink] [raw]
Subject: Re: Kernel 6.8.4 regression: aacraid controller not initialized any more, system boot hangs

Hi Martin,

meanwhile, some more people who face the same problem have been gathering in this thread
in the Proxmox user forum:

https://forum.proxmox.com/threads/pve-8-2-kernel-6-8-4-2-does-not-boot-cannot-find-root-device.145764/

Two of them also have an Adaptec controller, while another one has a PERC H310 Mini (LSI
based).

Did you have any chance to look into this in more depth? Do you need more information from
me to tackle this issue? I'm not a kernel developer, just a user, but I guess with proper
instruction I would be able to compile and test patches.

Beste Grüße,
Peter Schneider

--
Climb the mountain not to plant your flag, but to embrace the challenge,
enjoy the air and behold the view. Climb it so you can see the world,
not so the world can see you. -- David McCullough Jr.

OpenPGP: 0xA3828BD796CCE11A8CADE8866E3A92C92C3FF244
Download: https://www.peters-netzplatz.de/download/pschneider1968_pub.asc
https://keys.mailvelope.com/pks/lookup?op=get&[email protected]
https://keys.mailvelope.com/pks/lookup?op=get&[email protected]




Am 09.05.2024 um 03:38 schrieb Martin K. Petersen:
>
> Hi Peter!
>
> Thanks for the detailed bug report.
>
>> 6.8.8+ WORKS Revert "scsi: core: Consult supported VPD page
>> list prior to fetching page" - This reverts commit
>> b5fc07a5fb56216a49e6c1d0b172d5464d99a89b (this is the first bad commit
>> of my bisect session, see below, and a single patch as part of the
>> above merged tag 'scsi-fixes')
>
> The puzzling thing is that the patch in question restores the original
> behavior in which we do not attempt to query any pages not explicitly
> reported by the device.
>
> Can you please send me the output of:
>
> # sg_vpd -a /dev/sda
> # sg_readcap -l /dev/sda
>
> where sda is one of the aacraid volumes.
>
> Thanks!
>


Attachments:
OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature

2024-05-14 12:56:45

by Martin K. Petersen

[permalink] [raw]
Subject: Re: Kernel 6.8.4 regression: aacraid controller not initialized any more, system boot hangs


Peter,

> Did you have any chance to look into this in more depth? Do you need
> more information from me to tackle this issue? I'm not a kernel
> developer, just a user, but I guess with proper instruction I would be
> able to compile and test patches.

I am afraid I haven't had a time to look further into this yet due to
travel. The annual LSF/MM/BPF conference is taking place this week. I
will get back to you as soon as possible.

Before I make any recommendations wrt. firmware updates I would like to
understand why a change intended to make scanning more resilient against
device implementation errors has had the opposite effect. Especially
since the change in question reverts to how Linux has scanned for
devices for decades.

--
Martin K. Petersen Oracle Linux Engineering

2024-05-14 13:30:46

by Peter Schneider

[permalink] [raw]
Subject: Re: Kernel 6.8.4 regression: aacraid controller not initialized any more, system boot hangs

Hi Martin,

Am 14.05.2024 um 14:54 schrieb Martin K. Petersen:
>
> Peter,
>
>> Did you have any chance to look into this in more depth? Do you need
>> more information from me to tackle this issue? I'm not a kernel
>> developer, just a user, but I guess with proper instruction I would be
>> able to compile and test patches.
>
> I am afraid I haven't had a time to look further into this yet due to
> travel. The annual LSF/MM/BPF conference is taking place this week. I
> will get back to you as soon as possible.

Ok, great, so have a good time whereever this conference is going to take place! It isn't
very urgent, because in the meantime I can just continue to use the 6.5.13 kernel on my
Proxmox machine. I just wanted to ping you again so this wouldn't fall through the cracks.

> Before I make any recommendations wrt. firmware updates I would like to
> understand why a change intended to make scanning more resilient against
> device implementation errors has had the opposite effect. Especially
> since the change in question reverts to how Linux has scanned for
> devices for decades.

I'll leave everything as it is now, and won't change the crime scene until you tell me
what to do next.


Beste Grüße,
Peter Schneider

--
Climb the mountain not to plant your flag, but to embrace the challenge,
enjoy the air and behold the view. Climb it so you can see the world,
not so the world can see you. -- David McCullough Jr.

OpenPGP: 0xA3828BD796CCE11A8CADE8866E3A92C92C3FF244
Download: https://www.peters-netzplatz.de/download/pschneider1968_pub.asc
https://keys.mailvelope.com/pks/lookup?op=get&[email protected]
https://keys.mailvelope.com/pks/lookup?op=get&[email protected]


Attachments:
OpenPGP_signature.asc (243.00 B)
OpenPGP digital signature