When SCSI disks are located on different SCSI hosts within a system,
asynchronous detection can lead to non-deterministic SCSI disk names.
This patch introduces the 'sd_probe_type=' kernel boot parameter.
In scenarios where SCSI disk name sensitivity is crucial, the probe type
of the build-in sd driver can be set to synchronous. As a result,
the scsi disk names are deterministic.
Signed-off-by: Jianlin Lv <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 9 ++++++++
drivers/scsi/sd.c | 23 +++++++++++++++++++
2 files changed, 32 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9e5bab29685f..083f741d63bb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5611,6 +5611,15 @@
non-zero "wait" parameter. See weight_single
and weight_many.
+ sd_probe_type= [HW,SCSI] Manual setup probe type of built-in scsi disk driver
+ Format: <int>
+ Default: 1
+ <int> -- device driver probe type to try
+ 0 - PROBE_DEFAULT_STRATEGY
+ 1 - PROBE_PREFER_ASYNCHRONOUS
+ 2 - PROBE_FORCE_SYNCHRONOUS
+ Example: sd_probe_type=1
+
skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
xtime_lock contention on larger systems, and/or RCU lock
contention on all systems with CONFIG_MAXSMP set.
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 1624d528aa1f..78b80b9e5618 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -121,6 +121,9 @@ static void scsi_disk_release(struct device *cdev);
static DEFINE_IDA(sd_index_ida);
+/* Probe type of SCSI Disk Driver */
+static int sd_probe_type = PROBE_PREFER_ASYNCHRONOUS;
+
static mempool_t *sd_page_pool;
static struct lock_class_key sd_bio_compl_lkclass;
@@ -3826,6 +3829,25 @@ static int sd_resume_runtime(struct device *dev)
return sd_resume(dev);
}
+#ifndef MODULE
+
+/* Set the boot options to sd driver.
+ * Syntax is defined in Documentation/admin-guide/kernel-parameters.txt.
+ */
+static int __init sd_probe_setup(char *str)
+{
+ int probe_type = -1;
+
+ if (get_option(&str, &probe_type) && probe_type >= 0 && probe_type < 3)
+ sd_probe_type = probe_type;
+
+ return 1;
+}
+
+__setup("sd_probe_type=", sd_probe_setup);
+
+#endif
+
/**
* init_sd - entry point for this driver (both when built in or when
* a module).
@@ -3858,6 +3880,7 @@ static int __init init_sd(void)
goto err_out_class;
}
+ sd_template.gendrv.probe_type = sd_probe_type;
err = scsi_register_driver(&sd_template.gendrv);
if (err)
goto err_out_driver;
--
2.25.1
On 6/5/23 22:12, Jianlin Lv wrote:
> In scenarios where SCSI disk name sensitivity is crucial, the probe type
> of the build-in sd driver can be set to synchronous. As a result,
> the scsi disk names are deterministic.
Which are these scenarios?
Additionally, how can synchronous scanning of sd devices make a
difference if there are multiple host bus adapters that use an interface
type that is scanned asynchronously?
Bart.
On Tue, Jun 06, 2023 at 05:12:17AM +0000, Jianlin Lv wrote:
> When SCSI disks are located on different SCSI hosts within a system,
> asynchronous detection can lead to non-deterministic SCSI disk names.
Yes, as can various other conditions. Your code better be able to deal
with that.
On Wed, Jun 7, 2023 at 1:38 AM Bart Van Assche <[email protected]> wrote:
>
> On 6/5/23 22:12, Jianlin Lv wrote:
> > In scenarios where SCSI disk name sensitivity is crucial, the probe type
> > of the build-in sd driver can be set to synchronous. As a result,
> > the scsi disk names are deterministic.
>
> Which are these scenarios?
>
> Additionally, how can synchronous scanning of sd devices make a
> difference if there are multiple host bus adapters that use an interface
> type that is scanned asynchronously?
>
> Bart.
The change was prompted by an issue with SCSI devices probing
non-deterministic. On the issue node, there are two types of SCSI hosts:
1. MegaRAID adapters associated with 24 local disks. The disks are named
sequentially as "sda," "sdb," and so on, up to "sdx."
2. STAT controllers associated with the root disk, named "sdy."
Both the MegaRAID adapters and the SATA controller (PCH) are accessed via
the PCIe bus. In theory, depending on their PCIe bus ID in ascending order,
the devices should be initialized in ascending order as well.
However, the SCSI driver currently probes devices asynchronously to allow
for more parallelism.
__driver_attach
->if (driver_allows_async_probing(drv))
async_schedule_dev(__driver_attach_async_helper, dev);
During the probing of SCSI disks attached to MegaRAID, root disk probing
may occur, resulting in a disk naming inconsistency issue.
For example, if root disk probing happens in the middle,it is named "sdq",
The subsequent SCSI disks that are probed will have their names drift,
starting from "sdr" up to "sdy."
For cloud deployment, the local volume provisioner detects and creates PVs
for each local disk (from sda to sdx) on the host, and it cleans up the
disks when they are released.
This requires the logical names of the disks to be deterministic.
Therefore, I have submitted this patch to allow users to configure the
SCSI disk probe type.
If synchronous probing is configured, the SCSI disk probing order is
deterministic and will follow the ascending order of the PCIe bus ID.
Jianlin
On 6/7/23 08:55, Jianlin Lv wrote:
> 1. MegaRAID adapters associated with 24 local disks. The disks are named
> sequentially as "sda," "sdb," and so on, up to "sdx."
> 2. STAT controllers associated with the root disk, named "sdy."
>
> Both the MegaRAID adapters and the SATA controller (PCH) are accessed via
> the PCIe bus. In theory, depending on their PCIe bus ID in ascending order,
> the devices should be initialized in ascending order as well.
Hmm ... I don't think there is anything that prevents the PCIe maintainer
from changing the PCIe probing behavior from synchronous to asynchronous?
In other words, I don't think it is safe to assume that PCIe devices are
always scanned in the same order.
> For cloud deployment, the local volume provisioner detects and creates PVs
> for each local disk (from sda to sdx) on the host, and it cleans up the
> disks when they are released.
> This requires the logical names of the disks to be deterministic.
I see two possible solutions:
- Change the volume provisioner such that it uses disk references that do
not depend on the probing order, e.g. /dev/disk/by-id/...
- Implement an algorithm in systemd that makes disk names predictable.
An explanation of how predictable names work for network interfaces is
available here: https://wiki.debian.org/NetworkInterfaceNames. The
systemd documentation about predictable network names is available here:
https://www.freedesktop.org/software/systemd/man/systemd.net-naming-scheme.html
These alternatives have the advantage that disk scanning remains asynchronous.
Thanks,
Bart.
On Thu, Jun 8, 2023 at 1:07 AM Bart Van Assche <[email protected]> wrote:
>
> On 6/7/23 08:55, Jianlin Lv wrote:
> > 1. MegaRAID adapters associated with 24 local disks. The disks are named
> > sequentially as "sda," "sdb," and so on, up to "sdx."
> > 2. STAT controllers associated with the root disk, named "sdy."
> >
> > Both the MegaRAID adapters and the SATA controller (PCH) are accessed via
> > the PCIe bus. In theory, depending on their PCIe bus ID in ascending order,
> > the devices should be initialized in ascending order as well.
>
> Hmm ... I don't think there is anything that prevents the PCIe maintainer
> from changing the PCIe probing behavior from synchronous to asynchronous?
> In other words, I don't think it is safe to assume that PCIe devices are
> always scanned in the same order.
>
> > For cloud deployment, the local volume provisioner detects and creates PVs
> > for each local disk (from sda to sdx) on the host, and it cleans up the
> > disks when they are released.
> > This requires the logical names of the disks to be deterministic.
>
> I see two possible solutions:
> - Change the volume provisioner such that it uses disk references that do
> not depend on the probing order, e.g. /dev/disk/by-id/...
Yes, The "/dev/disk/by-id/" can uniquely identify SCSI devices. However,
I don't think it is suitable for the volume provisioner workflow.
For nodes of the same SKU , a unified YAML file will be defined to instruct
the volume provisioner on how to manage the local disks.
If use WWID, it would mean that a unique YAML file needs to be defined
for each node. This approach becomes impractical when dealing with a large
number of work nodes.
Jianlin
> - Implement an algorithm in systemd that makes disk names predictable.
> An explanation of how predictable names work for network interfaces is
> available here: https://wiki.debian.org/NetworkInterfaceNames. The
> systemd documentation about predictable network names is available here:
> https://www.freedesktop.org/software/systemd/man/systemd.net-naming-scheme.html
>
> These alternatives have the advantage that disk scanning remains asynchronous.
>
> Thanks,
>
> Bart.
>
On Wed, Jun 7, 2023 at 10:10 PM Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Jun 06, 2023 at 05:12:17AM +0000, Jianlin Lv wrote:
> > When SCSI disks are located on different SCSI hosts within a system,
> > asynchronous detection can lead to non-deterministic SCSI disk names.
>
> Yes, as can various other conditions. Your code better be able to deal
> with that.
Could you give an example about "other conditions" ?
Jianlin
On 6/7/23 19:51, Jianlin Lv wrote:
> On Thu, Jun 8, 2023 at 1:07 AM Bart Van Assche <[email protected]> wrote:
>> On 6/7/23 08:55, Jianlin Lv wrote:
>> I see two possible solutions:
>> - Change the volume provisioner such that it uses disk references that do
>> not depend on the probing order, e.g. /dev/disk/by-id/...
>
> Yes, The "/dev/disk/by-id/" can uniquely identify SCSI devices. However,
> I don't think it is suitable for the volume provisioner workflow.
> For nodes of the same SKU , a unified YAML file will be defined to instruct
> the volume provisioner on how to manage the local disks.
> If use WWID, it would mean that a unique YAML file needs to be defined
> for each node. This approach becomes impractical when dealing with a large
> number of work nodes.
Please consider using the paths available in /dev/disk/by-path.
Thanks,
Bart.
On Fri, Jun 9, 2023 at 12:23 AM Bart Van Assche <[email protected]> wrote:
>
> On 6/7/23 19:51, Jianlin Lv wrote:
> > On Thu, Jun 8, 2023 at 1:07 AM Bart Van Assche <[email protected]> wrote:
> >> On 6/7/23 08:55, Jianlin Lv wrote:
> >> I see two possible solutions:
> >> - Change the volume provisioner such that it uses disk references that do
> >> not depend on the probing order, e.g. /dev/disk/by-id/...
> >
> > Yes, The "/dev/disk/by-id/" can uniquely identify SCSI devices. However,
> > I don't think it is suitable for the volume provisioner workflow.
> > For nodes of the same SKU , a unified YAML file will be defined to instruct
> > the volume provisioner on how to manage the local disks.
> > If use WWID, it would mean that a unique YAML file needs to be defined
> > for each node. This approach becomes impractical when dealing with a large
> > number of work nodes.
> Please consider using the paths available in /dev/disk/by-path.
Sorry for the late reply.
I carefully checked the server in the production environment and found
some corner cases where there are differences in the dev/disk/by-path/ of
nodes with the same SKU. These differences are caused by inconsistent
target_numbers.
For example:
diff -y aa-by-path bb-by-path
pci-0000:86:00.0-scsi-0:3:86:0 -> ../../sda |
pci-0000:86:00.0-scsi-0:3:88:0 -> ../../sda
pci-0000:86:00.0-scsi-0:3:87:0 -> ../../sdb |
pci-0000:86:00.0-scsi-0:3:89:0 -> ../../sdb
pci-0000:86:00.0-scsi-0:3:88:0 -> ../../sdc |
pci-0000:86:00.0-scsi-0:3:90:0 -> ../../sdc
pci-0000:86:00.0-scsi-0:3:89:0 -> ../../sdd |
pci-0000:86:00.0-scsi-0:3:91:0 -> ../../sdd
pci-0000:86:00.0-scsi-0:3:90:0 -> ../../sde |
pci-0000:86:00.0-scsi-0:3:92:0 -> ../../sde
pci-0000:86:00.0-scsi-0:3:92:0 -> ../../sdf |
pci-0000:86:00.0-scsi-0:3:93:0 -> ../../sdf
pci-0000:86:00.0-scsi-0:3:93:0 -> ../../sdg |
pci-0000:86:00.0-scsi-0:3:94:0 -> ../../sdg
pci-0000:86:00.0-scsi-0:3:94:0 -> ../../sdh |
pci-0000:86:00.0-scsi-0:3:95:0 -> ../../sdh
I'm still not sure what causes the target_numbers to be different.
However, the existence of such corner cases makes /dev/disk/by-path
unusable for the volume provisioner, similar to /dev/disk/by-id/.
So, If it's not possible to configure disk serialization detection, then
it seems that implementing predictable disk names is the only option.
Jianlin
>
> Thanks,
>
> Bart.