2013-10-13 17:23:29

by vaughan

[permalink] [raw]
Subject: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

Hi James,

[1.] One line summary of the problem:
special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

[2.] Full description of the problem/report:
For instance, storage represents 8 iscsi LUNs, however the LUN No.7 is
not well configured or has something wrong.
Then messages received:
kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning,
scan aborted
Which will make LUN No.8 unavailable.
It's confirmed that Windows and Solaris systems will continue the scan
and make LUN No.1,2,3,4,5,6 and 8 available.

Log snippet is as below:
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan: INQUIRY
pass 1 length 36
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send: 0xffff8801e9bd4280
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00
00 00 24 00
Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180,
bufflen = 36, queuecommand 0xffffffffa00b99e7
Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done:
0xffff8801e9bd4280 SUCCESS
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result:
hostbyte=DID_OK driverbyte=DRIVER_OK
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00
00 00 24 00
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not Ready
[current]
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense: Logical
unit not accessible, target port in unavailable state
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1 failed 0
Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with code
0x8000002
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected response
from lun 7 while scanning, scan aborted

According to scsi_report_lun_scan(), I found:
Linux use an inquiry command to probe a lun according to the result of
report_lun command.
It assumes every probe cmd will get a legal result. Otherwise, it
regards the whole peripheral not exist or dead.
If the return of inquiry passes its legal checking and indicates 'LUN
not present', it won't break but also continue with the scan process.
In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch (Logical
unit not accessible, target port in unavailable state).
And this is ignored, so scsi_probe_lun() returns -EIO and the scan
process is aborted.

I have two questions:
1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
again, even after presenting this lun in responce to REPORT_LUN command?
2. Since windows and solaris can continue scan, is it reasonable for
linux to do the same, even for a fault-tolerance purpose?

Below is information of our storage setting:
Storage array is configured as a cluster mode, and there is a "default"
target group and "default" initiator group exist on
the storage that includes the target nodename of both the nodes in the
cluster and all initiator names respectively.
In the partner node, there was lun mapped to the default target
group/initiator group and having the ID 7.
Since that lun is owner by the partner node, the SCSI inquiry was
failing on it and as a result the initiator aborts the scan.

Thanks,
Vaughan


2013-10-14 11:13:50

by Hannes Reinecke

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

On 10/13/2013 07:23 PM, Vaughan Cao wrote:
> Hi James,
>
> [1.] One line summary of the problem:
> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>
> [2.] Full description of the problem/report:
> For instance, storage represents 8 iscsi LUNs, however the LUN No.7
> is not well configured or has something wrong.
> Then messages received:
> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning,
> scan aborted
> Which will make LUN No.8 unavailable.
> It's confirmed that Windows and Solaris systems will continue the
> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>
> Log snippet is as below:
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan:
> INQUIRY pass 1 length 36
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send:
> 0xffff8801e9bd4280
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12
> 00 00 00 24 00
> Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180,
> bufflen = 36, queuecommand 0xffffffffa00b99e7
> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done:
> 0xffff8801e9bd4280 SUCCESS
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result:
> hostbyte=DID_OK driverbyte=DRIVER_OK
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12
> 00 00 00 24 00
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not
> Ready [current]
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense:
> Logical unit not accessible, target port in unavailable state
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1
> failed 0
> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with
> code 0x8000002
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected
> response from lun 7 while scanning, scan aborted
>
> According to scsi_report_lun_scan(), I found:
> Linux use an inquiry command to probe a lun according to the result
> of report_lun command.
> It assumes every probe cmd will get a legal result. Otherwise, it
> regards the whole peripheral not exist or dead.
> If the return of inquiry passes its legal checking and indicates
> 'LUN not present', it won't break but also continue with the scan
> process.
> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
> (Logical unit not accessible, target port in unavailable state).
> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
> process is aborted.
>
> I have two questions:
> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
> again, even after presenting this lun in responce to REPORT_LUN
> command?
Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.

> 2. Since windows and solaris can continue scan, is it reasonable for
> linux to do the same, even for a fault-tolerance purpose?
>
Hmm. Yes, and no.

_Actually_ this is an issue with the target, as it looks as if it
will return the above sense code while sending an 'INQUIRY' to the
device.
SPC explicitely states that the INQUIRY command should _not_ fail
for unavailable devices.
But yeah, we probably should work around this issues.
Nevertheless, please raise this issue with your array vendor.

Please try the attached patch.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)


Attachments:
scsi_scan-continue-after-error.patch (1.09 kB)

2013-10-14 12:52:37

by Steffen Maier

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

Hi Hannes,

On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>> Hi James,
>>
>> [1.] One line summary of the problem:
>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>
>> [2.] Full description of the problem/report:
>> For instance, storage represents 8 iscsi LUNs, however the LUN No.7
>> is not well configured or has something wrong.
>> Then messages received:
>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>> Which will make LUN No.8 unavailable.
>> It's confirmed that Windows and Solaris systems will continue the
>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>
>> Log snippet is as below:
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan: INQUIRY pass 1 length 36
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send: 0xffff8801e9bd4280
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>> Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done: 0xffff8801e9bd4280 SUCCESS
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not Ready [current]
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense: Logical unit not accessible, target port in unavailable state
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1 failed 0
>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with code 0x8000002
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>
>> According to scsi_report_lun_scan(), I found:
>> Linux use an inquiry command to probe a lun according to the result
>> of report_lun command.
>> It assumes every probe cmd will get a legal result. Otherwise, it
>> regards the whole peripheral not exist or dead.
>> If the return of inquiry passes its legal checking and indicates
>> 'LUN not present', it won't break but also continue with the scan
>> process.
>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>> (Logical unit not accessible, target port in unavailable state).
>> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
>> process is aborted.
>>
>> I have two questions:
>> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
>> again, even after presenting this lun in responce to REPORT_LUN
>> command?
> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
>
>> 2. Since windows and solaris can continue scan, is it reasonable for
>> linux to do the same, even for a fault-tolerance purpose?
>>
> Hmm. Yes, and no.
>
> _Actually_ this is an issue with the target, as it looks as if it
> will return the above sense code while sending an 'INQUIRY' to the
> device.
> SPC explicitely states that the INQUIRY command should _not_ fail
> for unavailable devices.
> But yeah, we probably should work around this issues.
> Nevertheless, please raise this issue with your array vendor.
>
> Please try the attached patch.
>
> Cheers,
>
> Hannes
>

> From b0e90778f012010c881f8bdc03bce63a36921b77 Mon Sep 17 00:00:00 2001
> From: Hannes Reinecke <[email protected]>
> Date: Mon, 14 Oct 2013 13:11:22 +0200
> Subject: [PATCH] scsi_scan: continue report_lun_scan after error
>
> When scsi_probe_and_add_lun() fails in scsi_report_lun_scan() this
> does _not_ indicate that the entire target is done for.
> So continue scanning for the remaining devices.
>
> Signed-off-by: Hannes Reinecke <[email protected]>
>
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 307a811..973a121 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -1484,13 +1484,12 @@ static int scsi_report_lun_scan(struct scsi_target *starget, int bflags,
> lun, NULL, NULL, rescan, NULL);
> if (res == SCSI_SCAN_NO_RESPONSE) {
> /*
> - * Got some results, but now none, abort.
> + * Got some results, but now none, ignore.
> */
> sdev_printk(KERN_ERR, sdev,
> "Unexpected response"
> - " from lun %d while scanning, scan"
> - " aborted\n", lun);
> - break;
> + " from lun %d while scanning,"
> + " ignoring device\n", lun);
> }
> }
> }

In LLDDs that do their own initiator based LUN masking (because the midlayer does not have this functionality to enable hardware virtualization without NPIV, or to work around suboptimal LUN masking on the target), they are likely to return -ENXIO from slave_alloc(), making scsi_alloc_sdev() return NULL, being converted to SCSI_SCAN_NO_RESPONSE by scsi_probe_and_add_lun() and thus going through the same code path above.

E.g. zfcp does return -ENXIO if the particular LUN was not made known to the unit whitelist (via zfcp sysfs attribute unit_add).
If we attach LUN 0 (via unit_add) and trigger a target scan with SCAN_WILD_CARD for the scsi lun (e.g. on remote port recovery), we see exactly above error message for the first LUN in the response of report lun which is not explicitly attached to zfcp.
IIRC, other LLDDs such as bfa also do similar stuff [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].

For those cases, I think it makes sense to abort scsi_report_lun_scan(). Otherwise we would force the LLDD to return -ENXIO for every single LUN reported by report lun but not explicitly added to the LLDD LUN whitelist; and this would likely *flood kernel messages*.

Maybe Vaughan's case needs to be distinguished in a patch.

Some more details (because I happened to have written this up already):

MESSAGE
=======

kernel: sd 0:0:17:0: Unexpected response from lun 1 while scanning, scan aborted

SUMMARY
=======

requirements for reproduction

1. zfcp with auto lun scan support but disabled
(i.e. kernel >=2.6.37 , and no NPIV or zfcp.allow_lun_scan=0)
2. opened target port which supports the report lun SCSI command (SCSI-3)
3. attach lun 0 to that target port by means of zfcp's unit_add sysfs attribute
4. perform scsi target scan for that target port

=> message appears for first lun in list of report lun response
which is not attached to zfcp by means of the unit_add sysfs attribute

Hence, this only occurs if requirement [3] above is met and
the storage target uses non-optimal LUN masking.
The message does not hurt and can either be ignored or LUN masking be fixed.

Trigger [4] can be activated in various different situations,
see examples sorted along increasing impact below.

EXAMPLES
========

Kernel >= v2.6.37

While below uses a V7000 as target, the target type does not matter;
it's just the same with DS8000 or other storage.

[root@host:~](0)# scsi_logging_level -g
Current scsi logging level:
dev.scsi.logging_level = 4605

[root@host:~](0)# systool -m zfcp -v
Module = "zfcp"
Parameters:
allow_lun_scan = "N"
dbfsize = "4"
device = "(null)"
dif = "N"
no_auto_port_rescan = "N"
queue_depth = "32"

[root@host:~](0)# chccwdev -e 3c40

[root@host:~](0)# ziorep_config -A
Host: host0
CHPID: 60
Adapter: 0.0.3c40
Sub-Ch.: 0.0.001b
Name: 0xc05076ffe4801a51
P-Name: 0xc05076ffe4801a51
Version: 0x0006
LIC: 0x00000410
Type: NPort (fabric via point-to-point)
Speed: 8 Gbit
State: Online

[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40/0x5005076802100c1a](0)#
echo 0x0000000000000000 >| unit_add
[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40/0x5005076802100c1a](0)#
echo 0x0002000000000000 >| unit_add
[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40](0)# lszfcp -D
0.0.3c40/0x5005076802100c1a/0x0000000000000000 0:0:17:0
0.0.3c40/0x5005076802100c1a/0x0002000000000000 0:0:17:2
[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40](0)# lsscsi -g
[0:0:17:0] disk IBM 2145 0000 /dev/sda /dev/sg0
[0:0:17:2] disk IBM 2145 0000 /dev/sdb /dev/sg1
[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40](0)# sg_luns -v /dev/sg0
report luns cdb: a0 00 00 00 00 00 00 00 20 00 00 00
report luns: requested 8192 bytes but got 2376 bytes
Lun list length = 2368 which imples 296 lun entries
Report luns [select_report=0]:
0000000000000000
0001000000000000
0002000000000000
0003000000000000
...

Example 1: SCSI HOST SCAN

this has negligible impact on currently running workload and can
safely be executed for individual reproduction

[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40](0)#
echo "- - -" >| host0/scsi_host/host0/scan

kernel: scsi scan: device exists on 0:0:17:0
kernel: scsi scan: Sending REPORT LUNS to host 0 channel 0 id 17 (try 0)
kernel: scsi scan: REPORT LUNS successful (try 0) result 0x0
kernel: sd 0:0:17:0: scsi scan: REPORT LUN scan
kernel: scsi scan: device exists on 0:0:17:0
kernel: sd 0:0:17:0: Unexpected response from lun 1 while scanning, scan aborted

Example 2: PORT RECOVERY

this causes a short interruption of I/O to all LUNs at that target port

includes a scsi target (re)scan of rport-0:0-17 / 0x5005076802100c1a

[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40/0x5005076802100c1a](0)#
echo 0 >| failed

kernel: scsi scan: device exists on 0:0:17:0
kernel: scsi scan: Sending REPORT LUNS to host 0 channel 0 id 17 (try 0)
kernel: sd 0:0:17:0: Done: RETRY
kernel: sd 0:0:17:0: Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK
kernel: sd 0:0:17:0: CDB: Report luns: a0 00 00 00 00 00 00 00 10 00 00 00
kernel: scsi scan: REPORT LUNS successful (try 0) result 0x0
kernel: sd 0:0:17:0: scsi scan: REPORT LUN scan
kernel: scsi scan: device exists on 0:0:17:0
kernel: sd 0:0:17:0: Unexpected response from lun 1 while scanning, scan aborted
kernel: scsi scan: device exists on 0:0:17:0
kernel: scsi scan: device exists on 0:0:17:2

Two trailing "device exists" are from zfcp's unit recovery for each
lun at the recovered remote port. This causes additional individual
scsi_scan_target() calls without wildcards but for a specific lun instead.

Example 3: ADAPTER RECOVERY

this causes a short interruption of I/O over all paths through this FCP device

includes recovery of rport-0:0-17 / 0x5005076802100c1a

[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40](0)# echo 0 >| failed

kernel: qdio: 0.0.3c40 ZFCP on SC 1b using AI:1 QEBSM:1 PCI:1 TDD:1 SIGA: W A
kernel: scsi scan: device exists on 0:0:17:0
kernel: scsi scan: Sending REPORT LUNS to host 0 channel 0 id 17 (try 0)
kernel: scsi scan: REPORT LUNS successful (try 0) result 0x0
kernel: sd 0:0:17:0: scsi scan: REPORT LUN scan
kernel: scsi scan: device exists on 0:0:17:0
kernel: sd 0:0:17:0: Unexpected response from lun 1 while scanning, scan aborted
kernel: scsi scan: device exists on 0:0:17:0
kernel: scsi scan: device exists on 0:0:17:2

DETAILS
=======

Square brackets indicate where above requirements come into play.

[4]
scsi_scan_target(prnt, 0/*channel*/, id/*target*/, SCAN_WILD_CARD/*lun*/, rscan)
__scsi_scan_target()
scsi_probe_and_add_lun(starget, 0, &bflags, NULL, rescan, NULL); [3]
scsi_report_lun_scan(starget, bflags, rescan) [2] {
foreach lun in report lun response {
scsi_probe_and_add_lun() {
if exists => "kernel: scsi scan: device exists on <HCTL>"
else {
scsi_alloc_sdev() {
ret = shost->hostt->slave_alloc() => zfcp_scsi_slave_alloc() {
if (!unit && !(allow_lun_scan && npiv)) {
put_device(&port->dev);
return -ENXIO; [1]
}
}
if (ret) {
/*
* if LLDD reports slave not present, don't clutter
* console with alloc failure messages
*/
if (ret == -ENXIO)
display_failure_msg = 0;
goto out_device_destroy;
}
}
if allocation failed, return early with SCSI_SCAN_NO_RESPONSE
else continue lun probing
}
}
if (res == SCSI_SCAN_NO_RESPONSE) {
/*
* Got some results, but now none, abort.
*/
sdev_printk(KERN_ERR, sdev,
"Unexpected response"
" from lun %d while scanning, scan"
" aborted\n", lun);
break;
}
}
}


--
Mit freundlichen Gr??en / Kind regards
Steffen Maier

Linux on System z Development

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

2013-10-14 13:18:42

by Hannes Reinecke

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

On 10/14/2013 02:51 PM, Steffen Maier wrote:
> Hi Hannes,
>
> On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
>> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>>> Hi James,
>>>
>>> [1.] One line summary of the problem:
>>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>>
>>> [2.] Full description of the problem/report:
>>> For instance, storage represents 8 iscsi LUNs, however the LUN No.7
>>> is not well configured or has something wrong.
>>> Then messages received:
>>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>> Which will make LUN No.8 unavailable.
>>> It's confirmed that Windows and Solaris systems will continue the
>>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>>
>>> Log snippet is as below:
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan: INQUIRY pass 1 length 36
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send: 0xffff8801e9bd4280
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>> Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done: 0xffff8801e9bd4280 SUCCESS
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not Ready [current]
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense: Logical unit not accessible, target port in unavailable state
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1 failed 0
>>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with code 0x8000002
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>
>>> According to scsi_report_lun_scan(), I found:
>>> Linux use an inquiry command to probe a lun according to the result
>>> of report_lun command.
>>> It assumes every probe cmd will get a legal result. Otherwise, it
>>> regards the whole peripheral not exist or dead.
>>> If the return of inquiry passes its legal checking and indicates
>>> 'LUN not present', it won't break but also continue with the scan
>>> process.
>>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>>> (Logical unit not accessible, target port in unavailable state).
>>> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
>>> process is aborted.
>>>
>>> I have two questions:
>>> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
>>> again, even after presenting this lun in responce to REPORT_LUN
>>> command?
>> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
>>
>>> 2. Since windows and solaris can continue scan, is it reasonable for
>>> linux to do the same, even for a fault-tolerance purpose?
>>>
>> Hmm. Yes, and no.
>>
>> _Actually_ this is an issue with the target, as it looks as if it
>> will return the above sense code while sending an 'INQUIRY' to the
>> device.
>> SPC explicitely states that the INQUIRY command should _not_ fail
>> for unavailable devices.
>> But yeah, we probably should work around this issues.
>> Nevertheless, please raise this issue with your array vendor.
>>
>> Please try the attached patch.
>>
>> Cheers,
>>
>> Hannes
>>
>
>> From b0e90778f012010c881f8bdc03bce63a36921b77 Mon Sep 17 00:00:00 2001
>> From: Hannes Reinecke <[email protected]>
>> Date: Mon, 14 Oct 2013 13:11:22 +0200
>> Subject: [PATCH] scsi_scan: continue report_lun_scan after error
>>
>> When scsi_probe_and_add_lun() fails in scsi_report_lun_scan() this
>> does _not_ indicate that the entire target is done for.
>> So continue scanning for the remaining devices.
>>
>> Signed-off-by: Hannes Reinecke <[email protected]>
>>
>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>> index 307a811..973a121 100644
>> --- a/drivers/scsi/scsi_scan.c
>> +++ b/drivers/scsi/scsi_scan.c
>> @@ -1484,13 +1484,12 @@ static int scsi_report_lun_scan(struct scsi_target *starget, int bflags,
>> lun, NULL, NULL, rescan, NULL);
>> if (res == SCSI_SCAN_NO_RESPONSE) {
>> /*
>> - * Got some results, but now none, abort.
>> + * Got some results, but now none, ignore.
>> */
>> sdev_printk(KERN_ERR, sdev,
>> "Unexpected response"
>> - " from lun %d while scanning, scan"
>> - " aborted\n", lun);
>> - break;
>> + " from lun %d while scanning,"
>> + " ignoring device\n", lun);
>> }
>> }
>> }
>
> In LLDDs that do their own initiator based LUN masking (because the midlayer does not have this
> functionality to enable hardware virtualization without NPIV, or
to work around suboptimal LUN
> masking on the target), they are likely to return -ENXIO from
slave_alloc(), making scsi_alloc_sdev()
> return NULL, being converted to SCSI_SCAN_NO_RESPONSE by
scsi_probe_and_add_lun() and thus going
> through the same code path above.
>
Ah. Hmm. Yes, they would.

However, I personally would question this approach, as SPC states that

> The REPORT LUNS command (see table 284) requests the device
> server to return the peripheral device logical unit inventory
> accessible to the I_T nexus.

So by plain reading this would meant that you either should modify
'REPORT LUNS' to not show the masked LUNs, or set the pqual field to
'0x10' or '0x11' for those LUNs.

> E.g. zfcp does return -ENXIO if the particular LUN was not made known to the unit whitelist
> (via zfcp sysfs attribute unit_add).
> If we attach LUN 0 (via unit_add) and trigger a target scan with SCAN_WILD_CARD for the scsi
> lun (e.g. on remote port recovery), we see exactly above error
message for the first LUN in
> the response of report lun which is not explicitly attached to zfcp.
> IIRC, other LLDDs such as bfa also do similar stuff [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].
>
> For those cases, I think it makes sense to abort scsi_report_lun_scan().
> Otherwise we would force the LLDD to return -ENXIO for every
single LUN reported by report lun but not
> explicitly added to the LLDD LUN whitelist; and this would likely
*flood kernel messages*.
>
> Maybe Vaughan's case needs to be distinguished in a patch.
>
Well, as mentioned initially, the real issue is that the target
aborts an INQUIRY while being in 'Unavailable'. Which, according to
SPC-3 (or later), is a violation of the spec.

So we _could_ just tell them to go away, but admittedly that's bad
style. Which means we'll have to implement a workaround; the above
was just a simple way of implementing it. If that's not working of
course we'll have to do something else.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2013-10-14 13:32:09

by Hannes Reinecke

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

On 10/14/2013 03:18 PM, Hannes Reinecke wrote:
> On 10/14/2013 02:51 PM, Steffen Maier wrote:
>> Hi Hannes,
>>
>> On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
>>> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>>>> Hi James,
>>>>
>>>> [1.] One line summary of the problem:
>>>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>>>
>>>> [2.] Full description of the problem/report:
>>>> For instance, storage represents 8 iscsi LUNs, however the LUN No.7
>>>> is not well configured or has something wrong.
>>>> Then messages received:
>>>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>> Which will make LUN No.8 unavailable.
>>>> It's confirmed that Windows and Solaris systems will continue the
>>>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>>>
>>>> Log snippet is as below:
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan: INQUIRY pass 1 length 36
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send: 0xffff8801e9bd4280
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>>> Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>>>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done: 0xffff8801e9bd4280 SUCCESS
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not Ready [current]
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense: Logical unit not accessible, target port in unavailable state
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1 failed 0
>>>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with code 0x8000002
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>>
>>>> According to scsi_report_lun_scan(), I found:
>>>> Linux use an inquiry command to probe a lun according to the result
>>>> of report_lun command.
>>>> It assumes every probe cmd will get a legal result. Otherwise, it
>>>> regards the whole peripheral not exist or dead.
>>>> If the return of inquiry passes its legal checking and indicates
>>>> 'LUN not present', it won't break but also continue with the scan
>>>> process.
>>>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>>>> (Logical unit not accessible, target port in unavailable state).
>>>> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
>>>> process is aborted.
>>>>
>>>> I have two questions:
>>>> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
>>>> again, even after presenting this lun in responce to REPORT_LUN
>>>> command?
>>> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
>>>
>>>> 2. Since windows and solaris can continue scan, is it reasonable for
>>>> linux to do the same, even for a fault-tolerance purpose?
>>>>
>>> Hmm. Yes, and no.
>>>
>>> _Actually_ this is an issue with the target, as it looks as if it
>>> will return the above sense code while sending an 'INQUIRY' to the
>>> device.
>>> SPC explicitely states that the INQUIRY command should _not_ fail
>>> for unavailable devices.
>>> But yeah, we probably should work around this issues.
>>> Nevertheless, please raise this issue with your array vendor.
>>>
>>> Please try the attached patch.
>>>
>>> Cheers,
>>>
>>> Hannes
>>>
>>
>>> From b0e90778f012010c881f8bdc03bce63a36921b77 Mon Sep 17 00:00:00 2001
>>> From: Hannes Reinecke <[email protected]>
>>> Date: Mon, 14 Oct 2013 13:11:22 +0200
>>> Subject: [PATCH] scsi_scan: continue report_lun_scan after error
>>>
>>> When scsi_probe_and_add_lun() fails in scsi_report_lun_scan() this
>>> does _not_ indicate that the entire target is done for.
>>> So continue scanning for the remaining devices.
>>>
>>> Signed-off-by: Hannes Reinecke <[email protected]>
>>>
>>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>>> index 307a811..973a121 100644
>>> --- a/drivers/scsi/scsi_scan.c
>>> +++ b/drivers/scsi/scsi_scan.c
>>> @@ -1484,13 +1484,12 @@ static int scsi_report_lun_scan(struct scsi_target *starget, int bflags,
>>> lun, NULL, NULL, rescan, NULL);
>>> if (res == SCSI_SCAN_NO_RESPONSE) {
>>> /*
>>> - * Got some results, but now none, abort.
>>> + * Got some results, but now none, ignore.
>>> */
>>> sdev_printk(KERN_ERR, sdev,
>>> "Unexpected response"
>>> - " from lun %d while scanning, scan"
>>> - " aborted\n", lun);
>>> - break;
>>> + " from lun %d while scanning,"
>>> + " ignoring device\n", lun);
>>> }
>>> }
>>> }
>>
>> In LLDDs that do their own initiator based LUN masking (because the midlayer does not have this
>> functionality to enable hardware virtualization without NPIV, or
> to work around suboptimal LUN
>> masking on the target), they are likely to return -ENXIO from
> slave_alloc(), making scsi_alloc_sdev()
>> return NULL, being converted to SCSI_SCAN_NO_RESPONSE by
> scsi_probe_and_add_lun() and thus going
>> through the same code path above.
>>
> Ah. Hmm. Yes, they would.
>
> However, I personally would question this approach, as SPC states that
>
>> The REPORT LUNS command (see table 284) requests the device
>> server to return the peripheral device logical unit inventory
>> accessible to the I_T nexus.
>
> So by plain reading this would meant that you either should modify
> 'REPORT LUNS' to not show the masked LUNs, or set the pqual field to
> '0x10' or '0x11' for those LUNs.
>
>> E.g. zfcp does return -ENXIO if the particular LUN was not made known to the unit whitelist
>> (via zfcp sysfs attribute unit_add).
>> If we attach LUN 0 (via unit_add) and trigger a target scan with SCAN_WILD_CARD for the scsi
>> lun (e.g. on remote port recovery), we see exactly above error
> message for the first LUN in
>> the response of report lun which is not explicitly attached to zfcp.
>> IIRC, other LLDDs such as bfa also do similar stuff [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].
>>
>> For those cases, I think it makes sense to abort scsi_report_lun_scan().
>> Otherwise we would force the LLDD to return -ENXIO for every
> single LUN reported by report lun but not
>> explicitly added to the LLDD LUN whitelist; and this would likely
> *flood kernel messages*.
>>
>> Maybe Vaughan's case needs to be distinguished in a patch.
>>
> Well, as mentioned initially, the real issue is that the target
> aborts an INQUIRY while being in 'Unavailable'. Which, according to
> SPC-3 (or later), is a violation of the spec.
>
> So we _could_ just tell them to go away, but admittedly that's bad
> style. Which means we'll have to implement a workaround; the above
> was just a simple way of implementing it. If that's not working of
> course we'll have to do something else.
>
What about this patch:

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 973a121..01a7d69 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -594,6 +594,19 @@ static int scsi_probe_lun(struct scsi_device
*sdev, unsigne
d char *inq_result,
(sshdr.asc == 0x29)) &&
(sshdr.ascq == 0))
continue;
+ /*
+ * Some buggy implementations return
+ * 'target port in unavailable state'
+ * even on INQUIRY.
+ * Set peripheral qualifier 3
+ * for these devices.
+ */
+ if ((sshdr.sense_key == NOT_READY) &&
+ ((sshdr.asc == 0x04) &&
+ (sshdr.ascq == 0x0C))) {
+ inq_result[0] = 3 << 5;
+ return 0;
+ }
}
} else {
/*

(watchout, linebreaks mangled and all that).
Should be working for this particular case without interrupting
normal workflow, now should it not?

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2013-10-14 15:19:10

by vaughan

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle


On 2013年10月14日 21:18, Hannes Reinecke wrote:
> On 10/14/2013 02:51 PM, Steffen Maier wrote:
>> Hi Hannes,
>>
>> On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
>>> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>>>> Hi James,
>>>>
>>>> [1.] One line summary of the problem:
>>>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>>>
>>>> [2.] Full description of the problem/report:
>>>> For instance, storage represents 8 iscsi LUNs, however the LUN No.7
>>>> is not well configured or has something wrong.
>>>> Then messages received:
>>>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>> Which will make LUN No.8 unavailable.
>>>> It's confirmed that Windows and Solaris systems will continue the
>>>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>>>
>>>> Log snippet is as below:
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan: INQUIRY pass 1 length 36
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send: 0xffff8801e9bd4280
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>>> Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>>>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done: 0xffff8801e9bd4280 SUCCESS
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not Ready [current]
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense: Logical unit not accessible, target port in unavailable state
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1 failed 0
>>>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with code 0x8000002
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>>
>>>> According to scsi_report_lun_scan(), I found:
>>>> Linux use an inquiry command to probe a lun according to the result
>>>> of report_lun command.
>>>> It assumes every probe cmd will get a legal result. Otherwise, it
>>>> regards the whole peripheral not exist or dead.
>>>> If the return of inquiry passes its legal checking and indicates
>>>> 'LUN not present', it won't break but also continue with the scan
>>>> process.
>>>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>>>> (Logical unit not accessible, target port in unavailable state).
>>>> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
>>>> process is aborted.
>>>>
>>>> I have two questions:
>>>> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
>>>> again, even after presenting this lun in responce to REPORT_LUN
>>>> command?
>>> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
>>>
>>>> 2. Since windows and solaris can continue scan, is it reasonable for
>>>> linux to do the same, even for a fault-tolerance purpose?
>>>>
>>> Hmm. Yes, and no.
>>>
>>> _Actually_ this is an issue with the target, as it looks as if it
>>> will return the above sense code while sending an 'INQUIRY' to the
>>> device.
>>> SPC explicitely states that the INQUIRY command should _not_ fail
>>> for unavailable devices.
>>> But yeah, we probably should work around this issues.
>>> Nevertheless, please raise this issue with your array vendor.
>>>
>>> Please try the attached patch.
>>>
>>> Cheers,
>>>
>>> Hannes
>>>
>> In LLDDs that do their own initiator based LUN masking (because the midlayer does not have this
>> functionality to enable hardware virtualization without NPIV, or
> to work around suboptimal LUN
>> masking on the target), they are likely to return -ENXIO from
> slave_alloc(), making scsi_alloc_sdev()
>> return NULL, being converted to SCSI_SCAN_NO_RESPONSE by
> scsi_probe_and_add_lun() and thus going
>> through the same code path above.
>>
> Ah. Hmm. Yes, they would.
>
> However, I personally would question this approach, as SPC states that
>
>> The REPORT LUNS command (see table 284) requests the device
>> server to return the peripheral device logical unit inventory
>> accessible to the I_T nexus.
> So by plain reading this would meant that you either should modify
> 'REPORT LUNS' to not show the masked LUNs,
I have the same question. If you don't want us use them, why still you
present them in response to REPORT_LUN?
Since you report it in REPORT_LUN, I suppose the target server at least
hold some information of this lun, so it shouldn't give an error when I
check it? It should give me something to suggest that lun does exist,
though it's not allowed to deal more with it at this time.
Or 'accessible' doesn't mean accessible at this time, but we have rights
to address this LUN in this session? Whether it's online or not depends
on the result of INQUIRY and TEST_UNIT_READY?

> or set the pqual field to
> '0x10' or '0x11' for those LUNs.
Do you mean 001b?
After read the spc4r36g again, I'm confused on the difference between
pqual=000b and 001b.
It seems 000b don't guarantee a lun is connected while 001b indicates a
lun is surely not connected?
Anyone will explain these two questions a bit clearer?

###snippet form spc4
In response to an INQUIRY command received by an incorrect logical unit,
the SCSI target device shall return
the INQUIRY data with the peripheral qualifier set to the value defined
in 6.6.2. The INQUIRY command shall
return CHECK CONDITION status only if the device server is unable to
return the requested INQUIRY data.

Table 175 — PERIPHERAL QUALIFIER field
Qualifier Description
000b A peripheral device having the specified peripheral device type is
connected to this
logical unit. *If the device server is unable to determine whether or
not a peripheral
device is connected, it also shall use this peripheral qualifier. This
peripheral qualifier
does not mean that the peripheral device connected to the logical unit
is ready for
access.*
001b A peripheral device having the specified peripheral device type is
not connected to this
logical unit. However, the device server is capable of supporting the
specified periph-
eral device type on this logical unit. (spc4r36g)
>> E.g. zfcp does return -ENXIO if the particular LUN was not made known to the unit whitelist
>> (via zfcp sysfs attribute unit_add).
>> If we attach LUN 0 (via unit_add) and trigger a target scan with SCAN_WILD_CARD for the scsi
>> lun (e.g. on remote port recovery), we see exactly above error message for the first LUN in
>> the response of report lun which is not explicitly attached to zfcp.
>> IIRC, other LLDDs such as bfa also do similar stuff [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].
>>
>> For those cases, I think it makes sense to abort scsi_report_lun_scan().
>> Otherwise we would force the LLDD to return -ENXIO for every single LUN reported by report lun but not
>> explicitly added to the LLDD LUN whitelist; and this would likely *flood kernel messages*.
To Steffen,
It acts like scsi_sequential_lun_scan().
* Generally, scan from LUN 1 (LUN 0 is assumed to already have been
* scanned) to some maximum lun until a LUN is found with no device
* attached.
But is there case where a lun in the middle is indeed broken, but others
following are fine, which worths a tolerate?
Never happen?


Vaughan
>> Maybe Vaughan's case needs to be distinguished in a patch.
>>
> Well, as mentioned initially, the real issue is that the target
> aborts an INQUIRY while being in 'Unavailable'. Which, according to
> SPC-3 (or later), is a violation of the spec.
>
> So we _could_ just tell them to go away, but admittedly that's bad
> style. Which means we'll have to implement a workaround; the above
> was just a simple way of implementing it. If that's not working of
> course we'll have to do something else.
>
> Cheers,
>
> Hannes

2013-10-14 15:25:54

by Steffen Maier

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

On 10/14/2013 03:32 PM, Hannes Reinecke wrote:
> On 10/14/2013 03:18 PM, Hannes Reinecke wrote:
>> On 10/14/2013 02:51 PM, Steffen Maier wrote:
>>> On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
>>>> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>>>>> [1.] One line summary of the problem:
>>>>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>>>>
>>>>> [2.] Full description of the problem/report:
>>>>> For instance, storage represents 8 iscsi LUNs, however the LUN No.7
>>>>> is not well configured or has something wrong.
>>>>> Then messages received:
>>>>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>>> Which will make LUN No.8 unavailable.
>>>>> It's confirmed that Windows and Solaris systems will continue the
>>>>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>>>>
>>>>> Log snippet is as below:
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan: INQUIRY pass 1 length 36
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send: 0xffff8801e9bd4280
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done: 0xffff8801e9bd4280 SUCCESS
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not Ready [current]
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense: Logical unit not accessible, target port in unavailable state
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1 failed 0
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with code 0x8000002
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>>>
>>>>> According to scsi_report_lun_scan(), I found:
>>>>> Linux use an inquiry command to probe a lun according to the result
>>>>> of report_lun command.
>>>>> It assumes every probe cmd will get a legal result. Otherwise, it
>>>>> regards the whole peripheral not exist or dead.
>>>>> If the return of inquiry passes its legal checking and indicates
>>>>> 'LUN not present', it won't break but also continue with the scan
>>>>> process.
>>>>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>>>>> (Logical unit not accessible, target port in unavailable state).
>>>>> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
>>>>> process is aborted.
>>>>>
>>>>> I have two questions:
>>>>> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
>>>>> again, even after presenting this lun in responce to REPORT_LUN
>>>>> command?
>>>> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
>>>>
>>>>> 2. Since windows and solaris can continue scan, is it reasonable for
>>>>> linux to do the same, even for a fault-tolerance purpose?
>>>>>
>>>> Hmm. Yes, and no.
>>>>
>>>> _Actually_ this is an issue with the target, as it looks as if it
>>>> will return the above sense code while sending an 'INQUIRY' to the
>>>> device.
>>>> SPC explicitely states that the INQUIRY command should _not_ fail
>>>> for unavailable devices.
>>>> But yeah, we probably should work around this issues.
>>>> Nevertheless, please raise this issue with your array vendor.
>>>>
>>>> Please try the attached patch.
>>>>
>>>> Cheers,
>>>>
>>>> Hannes
>>>>
>>>
>>>> From b0e90778f012010c881f8bdc03bce63a36921b77 Mon Sep 17 00:00:00 2001
>>>> From: Hannes Reinecke <[email protected]>
>>>> Date: Mon, 14 Oct 2013 13:11:22 +0200
>>>> Subject: [PATCH] scsi_scan: continue report_lun_scan after error
>>>>
>>>> When scsi_probe_and_add_lun() fails in scsi_report_lun_scan() this
>>>> does _not_ indicate that the entire target is done for.
>>>> So continue scanning for the remaining devices.
>>>>
>>>> Signed-off-by: Hannes Reinecke <[email protected]>
>>>>
>>>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>>>> index 307a811..973a121 100644
>>>> --- a/drivers/scsi/scsi_scan.c
>>>> +++ b/drivers/scsi/scsi_scan.c
>>>> @@ -1484,13 +1484,12 @@ static int scsi_report_lun_scan(struct scsi_target *starget, int bflags,
>>>> lun, NULL, NULL, rescan, NULL);
>>>> if (res == SCSI_SCAN_NO_RESPONSE) {
>>>> /*
>>>> - * Got some results, but now none, abort.
>>>> + * Got some results, but now none, ignore.
>>>> */
>>>> sdev_printk(KERN_ERR, sdev,
>>>> "Unexpected response"
>>>> - " from lun %d while scanning, scan"
>>>> - " aborted\n", lun);
>>>> - break;
>>>> + " from lun %d while scanning,"
>>>> + " ignoring device\n", lun);
>>>> }
>>>> }
>>>> }
>>>
>>> In LLDDs that do their own initiator based LUN masking (because the midlayer does not have this
>>> functionality to enable hardware virtualization without NPIV, or
>> to work around suboptimal LUN
>>> masking on the target), they are likely to return -ENXIO from
>> slave_alloc(), making scsi_alloc_sdev()
>>> return NULL, being converted to SCSI_SCAN_NO_RESPONSE by
>> scsi_probe_and_add_lun() and thus going
>>> through the same code path above.
>>>
>> Ah. Hmm. Yes, they would.
>>
>> However, I personally would question this approach, as SPC states that
>>
>>> The REPORT LUNS command (see table 284) requests the device
>>> server to return the peripheral device logical unit inventory
>>> accessible to the I_T nexus.
>>
>> So by plain reading this would meant that you either should modify
>> 'REPORT LUNS' to not show the masked LUNs, or set the pqual field to
>> '0x10' or '0x11' for those LUNs.

We need to distinguish two cases:
1) suboptimal lun masking on the target
2) hardware virtualization without NPIV

Regarding 1, one could require fixing lun masking on the target.
However, some users cannot or do not want to do it very fine granular.
That's why s390 also does deferred device probing ("set online" in
sysfs) or even limits bus sensing (cio_ignore).

Regarding 2, fixing lun masking on the target does not help because
without NPIV, the target cannot distinguish the different virtual
initators since they are all behind one shared WWPN (and N-Port_ID).
This forces zfcp to implement initiator based lun masking, because only
the user can tell which lun to attach to which of the virtual initiators
sharing the same physical port. Without that, Linux would attach all
luns to all virtual initiators, i.e. share inadvertently.

>>> E.g. zfcp does return -ENXIO if the particular LUN was not made known to the unit whitelist
>>> (via zfcp sysfs attribute unit_add).
>>> If we attach LUN 0 (via unit_add) and trigger a target scan with SCAN_WILD_CARD for the scsi
>>> lun (e.g. on remote port recovery), we see exactly above error
>> message for the first LUN in
>>> the response of report lun which is not explicitly attached to zfcp.
>>> IIRC, other LLDDs such as bfa also do similar stuff [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].
>>>
>>> For those cases, I think it makes sense to abort scsi_report_lun_scan().
>>> Otherwise we would force the LLDD to return -ENXIO for every
>> single LUN reported by report lun but not
>>> explicitly added to the LLDD LUN whitelist; and this would likely
>> *flood kernel messages*.
>>>
>>> Maybe Vaughan's case needs to be distinguished in a patch.
>>>
>> Well, as mentioned initially, the real issue is that the target
>> aborts an INQUIRY while being in 'Unavailable'. Which, according to
>> SPC-3 (or later), is a violation of the spec.
>>
>> So we _could_ just tell them to go away, but admittedly that's bad
>> style. Which means we'll have to implement a workaround; the above
>> was just a simple way of implementing it. If that's not working of
>> course we'll have to do something else.
>>
> What about this patch:
>
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 973a121..01a7d69 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -594,6 +594,19 @@ static int scsi_probe_lun(struct scsi_device
> *sdev, unsigne
> d char *inq_result,
> (sshdr.asc == 0x29)) &&
> (sshdr.ascq == 0))
> continue;
> + /*
> + * Some buggy implementations return
> + * 'target port in unavailable state'
> + * even on INQUIRY.
> + * Set peripheral qualifier 3
> + * for these devices.
> + */
> + if ((sshdr.sense_key == NOT_READY) &&
> + ((sshdr.asc == 0x04) &&
> + (sshdr.ascq == 0x0C))) {

style question: lower case hex digits? 0x0c

Any reason why you put the conjunction of asc and ascq inside its own
brackets instead of having all three (including sense_key) on the same
level of one larger conjunction (as the code above does for UA asc
0x28/0x29 ascq 0x00)? Should be semantically equivalent, isn't it? But
then again, ascq always goes with asc, so they form a kind of pair.

> + inq_result[0] = 3 << 5;
> + return 0;
> + }
> }
> } else {
> /*
>
> (watchout, linebreaks mangled and all that).
> Should be working for this particular case without interrupting
> normal workflow, now should it not?

The approach of distinguishing the workaround close to the response of
the inquiry sounds good to me. I suppose it won't break zfcp which is
good. Unfortunately, I don't know what the ramifications of PQ==3 are
(the SPC-4 description sounds good, though), nor enough details about
this common code to tell if e.g. the early return is OK (skipping
setting sdev->scsi_level near the end of scsi_probe_lun()). But then
again, without inquiry reply we cannot get the level from the response.
So I think the early return is OK after all.
I guess we want to get around "if (result) return -EIO;" but also do not
want to execute the parts depending on result==0.

SPC-4 says that for PQ==3 the PDT should be set to 0x1f. Do we need to
fake this here as well? (I assume the target did not fill in a PDT on
its own when replying with sense data.)

The clarification on the T10 reflector seems to say that Linux would
then accept LUNs with PQ 3, but the target shall not have put LUs with
PQ 3 into the LU inventory in the first place?
Anyway, I'm not opposed to the workaround.

--
Mit freundlichen Gr??en / Kind regards
Steffen Maier

Linux on System z Development

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

2013-10-15 03:29:20

by vaughan

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

On 10/14/2013 07:13 PM, Hannes Reinecke wrote:
> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
> (Logical unit not accessible, target port in unavailable state).
> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
> process is aborted.
>
> I have two questions:
> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
> again, even after presenting this lun in responce to REPORT_LUN
> command?
> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
>
>> 2. Since windows and solaris can continue scan, is it reasonable for
>> linux to do the same, even for a fault-tolerance purpose?
>>
> Hmm. Yes, and no.
>
> _Actually_ this is an issue with the target, as it looks as if it
> will return the above sense code while sending an 'INQUIRY' to the
> device.
> SPC explicitely states that the INQUIRY command should _not_ fail
> for unavailable devices.
Hi all,

I found this below in spc4.
>>>
5.15.2.4.4 Target port group asymmetric access states - Standby state
While in the unavailable primary target port asymmetric access state,
the device server shall support those of
the following commands that it supports while in the active/optimized state:
a) INQUIRY (the peripheral qualifier (see 6.6.2) shall be set to 001b);
....
For those commands that are not supported, the device server shall
terminate the command with CHECK
CONDITION status, with the sense key set to NOT READY, and the
additional sense code set to LOGICAL
UNIT NOT ACCESSIBLE, TARGET PORT IN UNAVAILABLE STATE.
<<<
>From the above, I suppose the hardware may works very compliant with
spc. The case could be:
Storage is a alua supported target. Initiator sent REPORT_LUN to target,
target return all pqual=000b to it.
Then Initiator INQUIRY lun 7 which is in standby state where pqual=000b
not 001b. So this INQUIRY is regarded as
'not supported', and get terminated with CHECK_CONDITION, sense key=NOT
READY, asc,ascq=04h,0ch.

Could you confirm if my understanding is right or wrong?

Thanks,
Vaughan
> But yeah, we probably should work around this issues.
> Nevertheless, please raise this issue with your array vendor.
>
> Please try the attached patch.
>
> Cheers,
>
> Hannes

2013-10-15 05:51:36

by Hannes Reinecke

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

On 10/15/2013 05:32 AM, vaughan wrote:
> On 10/14/2013 07:13 PM, Hannes Reinecke wrote:
>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>> (Logical unit not accessible, target port in unavailable state).
>> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
>> process is aborted.
>>
>> I have two questions:
>> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
>> again, even after presenting this lun in responce to REPORT_LUN
>> command?
>> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
>>
>>> 2. Since windows and solaris can continue scan, is it reasonable for
>>> linux to do the same, even for a fault-tolerance purpose?
>>>
>> Hmm. Yes, and no.
>>
>> _Actually_ this is an issue with the target, as it looks as if it
>> will return the above sense code while sending an 'INQUIRY' to the
>> device.
>> SPC explicitely states that the INQUIRY command should _not_ fail
>> for unavailable devices.
> Hi all,
>
> I found this below in spc4.
>>>>
> 5.15.2.4.4 Target port group asymmetric access states - Standby state
> While in the unavailable primary target port asymmetric access state,
> the device server shall support those of
> the following commands that it supports while in the active/optimized state:
> a) INQUIRY (the peripheral qualifier (see 6.6.2) shall be set to 001b);
> ....
> For those commands that are not supported, the device server shall
> terminate the command with CHECK
> CONDITION status, with the sense key set to NOT READY, and the
> additional sense code set to LOGICAL
> UNIT NOT ACCESSIBLE, TARGET PORT IN UNAVAILABLE STATE.
> <<<
> From the above, I suppose the hardware may works very compliant with
> spc. The case could be:
> Storage is a alua supported target. Initiator sent REPORT_LUN to target,
> target return all pqual=000b to it.
> Then Initiator INQUIRY lun 7 which is in standby state where pqual=000b
> not 001b. So this INQUIRY is regarded as
> 'not supported', and get terminated with CHECK_CONDITION, sense key=NOT
> READY, asc,ascq=04h,0ch.
>
> Could you confirm if my understanding is right or wrong?
>
Wrong.

The sentence states that the device server _shall_ support those
commands, where the results should be identical as if the port would
have been in active/optimized state.

So INQUIRY always has to be supported, regardless of which primary
ALUA state the port happens to be in.

(Otherwise we'd be hard-pressed to figure out whether the port is in
'unavailable' ALUA state in the first place, as without the INQUIRY
data we couldn't even _tell_ if ALUA is supported.)

So yeah, it really looks like a firmware issue here.

But that notwithstanding, did you get a chance to test my patch?

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

2013-10-15 11:46:36

by vaughan

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle


On 2013年10月15日 13:51, Hannes Reinecke wrote:
> But that notwithstanding, did you get a chance to test my patch?
>
> Cheers,
>
> Hannes
Hi Hannes,

Kernel patched and waiting feedback from lab guy.

Thanks,
Vaughan

2013-10-16 06:53:15

by Hannes Reinecke

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

On 10/14/2013 05:24 PM, Steffen Maier wrote:
> On 10/14/2013 03:32 PM, Hannes Reinecke wrote:
>> On 10/14/2013 03:18 PM, Hannes Reinecke wrote:
>>> On 10/14/2013 02:51 PM, Steffen Maier wrote:
>>>> On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
>>>>> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>>>>>> [1.] One line summary of the problem:
>>>>>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>>>>>
>>>>>> [2.] Full description of the problem/report:
>>>>>> For instance, storage represents 8 iscsi LUNs, however the LUN
>>>>>> No.7
>>>>>> is not well configured or has something wrong.
>>>>>> Then messages received:
>>>>>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while
>>>>>> scanning, scan aborted
>>>>>> Which will make LUN No.8 unavailable.
>>>>>> It's confirmed that Windows and Solaris systems will continue the
>>>>>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>>>>>
>>>>>> Log snippet is as below:
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan:
>>>>>> INQUIRY pass 1 length 36
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send:
>>>>>> 0xffff8801e9bd4280
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB:
>>>>>> Inquiry: 12 00 00 00 24 00
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: buffer =
>>>>>> 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done:
>>>>>> 0xffff8801e9bd4280 SUCCESS
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result:
>>>>>> hostbyte=DID_OK driverbyte=DRIVER_OK
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB:
>>>>>> Inquiry: 12 00 00 00 24 00
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key :
>>>>>> Not Ready [current]
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense:
>>>>>> Logical unit not accessible, target port in unavailable state
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host
>>>>>> busy 1 failed 0
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes
>>>>>> done.
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed
>>>>>> with code 0x8000002
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected
>>>>>> response from lun 7 while scanning, scan aborted
>>>>>>
>>>>>> According to scsi_report_lun_scan(), I found:
>>>>>> Linux use an inquiry command to probe a lun according to the
>>>>>> result
>>>>>> of report_lun command.
>>>>>> It assumes every probe cmd will get a legal result. Otherwise, it
>>>>>> regards the whole peripheral not exist or dead.
>>>>>> If the return of inquiry passes its legal checking and indicates
>>>>>> 'LUN not present', it won't break but also continue with the scan
>>>>>> process.
>>>>>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>>>>>> (Logical unit not accessible, target port in unavailable state).
>>>>>> And this is ignored, so scsi_probe_lun() returns -EIO and the
>>>>>> scan
>>>>>> process is aborted.
>>>>>>
>>>>>> I have two questions:
>>>>>> 1. Is it correct for hardware to return a sense 04h,0Ch to
>>>>>> inquiry
>>>>>> again, even after presenting this lun in responce to REPORT_LUN
>>>>>> command?
>>>>> Yes, this is correct. 'REPORT LUNS' is supported in
>>>>> 'Unavailable' state.
>>>>>
>>>>>> 2. Since windows and solaris can continue scan, is it
>>>>>> reasonable for
>>>>>> linux to do the same, even for a fault-tolerance purpose?
>>>>>>
>>>>> Hmm. Yes, and no.
>>>>>
>>>>> _Actually_ this is an issue with the target, as it looks as if it
>>>>> will return the above sense code while sending an 'INQUIRY' to the
>>>>> device.
>>>>> SPC explicitely states that the INQUIRY command should _not_ fail
>>>>> for unavailable devices.
>>>>> But yeah, we probably should work around this issues.
>>>>> Nevertheless, please raise this issue with your array vendor.
>>>>>
>>>>> Please try the attached patch.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Hannes
>>>>>
>>>>
>>>>> From b0e90778f012010c881f8bdc03bce63a36921b77 Mon Sep 17
>>>>> 00:00:00 2001
>>>>> From: Hannes Reinecke <[email protected]>
>>>>> Date: Mon, 14 Oct 2013 13:11:22 +0200
>>>>> Subject: [PATCH] scsi_scan: continue report_lun_scan after error
>>>>>
>>>>> When scsi_probe_and_add_lun() fails in scsi_report_lun_scan() this
>>>>> does _not_ indicate that the entire target is done for.
>>>>> So continue scanning for the remaining devices.
>>>>>
>>>>> Signed-off-by: Hannes Reinecke <[email protected]>
>>>>>
>>>>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>>>>> index 307a811..973a121 100644
>>>>> --- a/drivers/scsi/scsi_scan.c
>>>>> +++ b/drivers/scsi/scsi_scan.c
>>>>> @@ -1484,13 +1484,12 @@ static int scsi_report_lun_scan(struct
>>>>> scsi_target *starget, int bflags,
>>>>> lun, NULL, NULL, rescan, NULL);
>>>>> if (res == SCSI_SCAN_NO_RESPONSE) {
>>>>> /*
>>>>> - * Got some results, but now none, abort.
>>>>> + * Got some results, but now none, ignore.
>>>>> */
>>>>> sdev_printk(KERN_ERR, sdev,
>>>>> "Unexpected response"
>>>>> - " from lun %d while scanning, scan"
>>>>> - " aborted\n", lun);
>>>>> - break;
>>>>> + " from lun %d while scanning,"
>>>>> + " ignoring device\n", lun);
>>>>> }
>>>>> }
>>>>> }
>>>>
>>>> In LLDDs that do their own initiator based LUN masking (because
>>>> the midlayer does not have this
>>>> functionality to enable hardware virtualization without NPIV, or
>>> to work around suboptimal LUN
>>>> masking on the target), they are likely to return -ENXIO from
>>> slave_alloc(), making scsi_alloc_sdev()
>>>> return NULL, being converted to SCSI_SCAN_NO_RESPONSE by
>>> scsi_probe_and_add_lun() and thus going
>>>> through the same code path above.
>>>>
>>> Ah. Hmm. Yes, they would.
>>>
>>> However, I personally would question this approach, as SPC states
>>> that
>>>
>>>> The REPORT LUNS command (see table 284) requests the device
>>>> server to return the peripheral device logical unit inventory
>>>> accessible to the I_T nexus.
>>>
>>> So by plain reading this would meant that you either should modify
>>> 'REPORT LUNS' to not show the masked LUNs, or set the pqual field to
>>> '0x10' or '0x11' for those LUNs.
>
> We need to distinguish two cases:
> 1) suboptimal lun masking on the target
> 2) hardware virtualization without NPIV
>
> Regarding 1, one could require fixing lun masking on the target.
> However, some users cannot or do not want to do it very fine
> granular. That's why s390 also does deferred device probing ("set
> online" in sysfs) or even limits bus sensing (cio_ignore).
>
> Regarding 2, fixing lun masking on the target does not help because
> without NPIV, the target cannot distinguish the different virtual
> initators since they are all behind one shared WWPN (and N-Port_ID).
> This forces zfcp to implement initiator based lun masking, because
> only the user can tell which lun to attach to which of the virtual
> initiators sharing the same physical port. Without that, Linux would
> attach all luns to all virtual initiators, i.e. share inadvertently.
>
>>>> E.g. zfcp does return -ENXIO if the particular LUN was not made
>>>> known to the unit whitelist
>>>> (via zfcp sysfs attribute unit_add).
>>>> If we attach LUN 0 (via unit_add) and trigger a target scan with
>>>> SCAN_WILD_CARD for the scsi
>>>> lun (e.g. on remote port recovery), we see exactly above error
>>> message for the first LUN in
>>>> the response of report lun which is not explicitly attached to
>>>> zfcp.
>>>> IIRC, other LLDDs such as bfa also do similar stuff
>>>> [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].
>>>>
>>>> For those cases, I think it makes sense to abort
>>>> scsi_report_lun_scan().
>>>> Otherwise we would force the LLDD to return -ENXIO for every
>>> single LUN reported by report lun but not
>>>> explicitly added to the LLDD LUN whitelist; and this would likely
>>> *flood kernel messages*.
>>>>
>>>> Maybe Vaughan's case needs to be distinguished in a patch.
>>>>
>>> Well, as mentioned initially, the real issue is that the target
>>> aborts an INQUIRY while being in 'Unavailable'. Which, according to
>>> SPC-3 (or later), is a violation of the spec.
>>>
>>> So we _could_ just tell them to go away, but admittedly that's bad
>>> style. Which means we'll have to implement a workaround; the above
>>> was just a simple way of implementing it. If that's not working of
>>> course we'll have to do something else.
>>>
>> What about this patch:
>>
>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>> index 973a121..01a7d69 100644
>> --- a/drivers/scsi/scsi_scan.c
>> +++ b/drivers/scsi/scsi_scan.c
>> @@ -594,6 +594,19 @@ static int scsi_probe_lun(struct scsi_device
>> *sdev, unsigne
>> d char *inq_result,
>> (sshdr.asc == 0x29)) &&
>> (sshdr.ascq == 0))
>> continue;
>> + /*
>> + * Some buggy implementations return
>> + * 'target port in unavailable state'
>> + * even on INQUIRY.
>> + * Set peripheral qualifier 3
>> + * for these devices.
>> + */
>> + if ((sshdr.sense_key == NOT_READY) &&
>> + ((sshdr.asc == 0x04) &&
>> + (sshdr.ascq == 0x0C))) {
>
> style question: lower case hex digits? 0x0c
>
Yeah. This is a test, after all ...

> Any reason why you put the conjunction of asc and ascq inside its
> own brackets instead of having all three (including sense_key) on
> the same level of one larger conjunction (as the code above does for
> UA asc 0x28/0x29 ascq 0x00)? Should be semantically equivalent,
> isn't it? But then again, ascq always goes with asc, so they form a
> kind of pair.
>
No reason, Just copy&paste error from the above statement ...

>> + inq_result[0] = 3 << 5;
>> + return 0;
>> + }
>> }
>> } else {
>> /*
>>
>> (watchout, linebreaks mangled and all that).
>> Should be working for this particular case without interrupting
>> normal workflow, now should it not?
>
> The approach of distinguishing the workaround close to the response
> of the inquiry sounds good to me. I suppose it won't break zfcp
> which is good. Unfortunately, I don't know what the ramifications of
> PQ==3 are (the SPC-4 description sounds good, though), nor enough
> details about this common code to tell if e.g. the early return is
> OK (skipping setting sdev->scsi_level near the end of
> scsi_probe_lun()). But then again, without inquiry reply we cannot
> get the level from the response. So I think the early return is OK
> after all.
> I guess we want to get around "if (result) return -EIO;" but also do
> not want to execute the parts depending on result==0.
>
> SPC-4 says that for PQ==3 the PDT should be set to 0x1f. Do we need
> to fake this here as well? (I assume the target did not fill in a
> PDT on its own when replying with sense data.)
>
> The clarification on the T10 reflector seems to say that Linux would
> then accept LUNs with PQ 3, but the target shall not have put LUs
> with PQ 3 into the LU inventory in the first place?
> Anyway, I'm not opposed to the workaround.
>
Well, first and foremost this is a workaround for buggy array
firmware. If any port would be in 'unavailable' the target port is
still required to respond to an INQUIRY.
_Not_ doing so leaves us with no indication what's going on here.

The main reason why I chose PQ=3 here is that we'll end up ignoring
this device scsi_probe_and_add_lun() later on.
Saving my coding higher up the stack.
And, seeing that the device is never actually allocated, the
modifications we did for the inquiry data will be deleted anyway.

So using PQ=3 here is just a vehicle for telling the system to not
create a SCSI device at this LUN, _not_ something which has some
relevance to SPC.

But seeing that this approach raises quite some issues I've attached
a different patch.
Vaughan, could you test with that, too? Should be functionally
equivalent to the previous one.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)


Attachments:
scsi_scan-continue-scan-for-LUNs-in.patch (2.11 kB)

2013-10-16 07:23:36

by vaughan

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

On 10/16/2013 02:52 PM, Hannes Reinecke wrote:
> But seeing that this approach raises quite some issues I've attached a
> different patch. Vaughan, could you test with that, too? Should be
> functionally equivalent to the previous one. Cheers, Hannes
Of course. This one is more clear to express our intention than setting
PQ 3 to break out.

Vaughan

2013-10-21 06:05:09

by vaughan

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

On 10/16/2013 02:52 PM, Hannes Reinecke wrote:
> But seeing that this approach raises quite some issues I've attached a
> different patch. Vaughan, could you test with that, too? Should be
> functionally equivalent to the previous one. Cheers, Hannes
Hi Hannes,

We only tested the later patch which returns _TARGET_PRESENT after
parsing sense, it works as expected.

About the cause of this issue, admin said he is configuring a
active-active cluster mode storage. Each node has it own LUN pool and a
set of rule to control which node can access the pool.
LUN7 is owned and can only be able to manipulated by the other node, but
can be seen by this node for a misconfig. So it presents itself in
REPORT_LUN but return NOT_READY when accessed through this node.
Do you still regard this as a misbehave in response to INQUIRY?

Thanks,
Vaughan

2013-10-22 15:03:49

by Hannes Reinecke

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

On 10/21/2013 08:07 AM, vaughan wrote:
> On 10/16/2013 02:52 PM, Hannes Reinecke wrote:
>> But seeing that this approach raises quite some issues I've attached a
>> different patch. Vaughan, could you test with that, too? Should be
>> functionally equivalent to the previous one. Cheers, Hannes
> Hi Hannes,
>
> We only tested the later patch which returns _TARGET_PRESENT after
> parsing sense, it works as expected.
>
> About the cause of this issue, admin said he is configuring a
> active-active cluster mode storage. Each node has it own LUN pool and a
> set of rule to control which node can access the pool.
> LUN7 is owned and can only be able to manipulated by the other node, but
> can be seen by this node for a misconfig. So it presents itself in
> REPORT_LUN but return NOT_READY when accessed through this node.
> Do you still regard this as a misbehave in response to INQUIRY?
>
Yes. INQUIRY _has_ to succeed. The only exceptions here would be devices
in 'Offline' state.
But other that that, yes, INQUIRY should never abort with an error,
especially for ALUA.
ALUA relies on 'report target port groups' and INQUIRY EVPD page 0x83 to
identify the target port group state.
So if INQUIRY does _not_ work you cannot figure out the ALUA state,
and by rights you would need to disable ALUA there.


Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

2014-02-19 08:28:50

by vaughan

[permalink] [raw]
Subject: Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

Hi Hannes,

Sorry to bother you.
Months ago, you made a patch to fix this scsi_scan abort error found on
zfssa storage. Though it's only a specific storage, the logic -- not
abort scsi scan process because of an inquiry failure of a LU in the
middle, is helpful as a way to make our scanning more resilient. I'd
prefer to keep our device scanning behavior in sync with other OS like
Solaris and Windows.
Will you merge your patch in mainline? You can find the patch here
http://www.mail-archive.com/[email protected]/msg521753.html

Regards,
Vaughan

On 10/16/2013 02:52 PM, Hannes Reinecke wrote:
> On 10/14/2013 05:24 PM, Steffen Maier wrote:
>> On 10/14/2013 03:32 PM, Hannes Reinecke wrote:
>>> On 10/14/2013 03:18 PM, Hannes Reinecke wrote:
>>>> On 10/14/2013 02:51 PM, Steffen Maier wrote:
>>>>> On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
>>>>>> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>>>>>>> [1.] One line summary of the problem:
>>>>>>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>>>>>>
>>>>>>> [2.] Full description of the problem/report:
>>>>>>> For instance, storage represents 8 iscsi LUNs, however the LUN
>>>>>>> No.7
>>>>>>> is not well configured or has something wrong.
>>>>>>> Then messages received:
>>>>>>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while
>>>>>>> scanning, scan aborted
>>>>>>> Which will make LUN No.8 unavailable.
>>>>>>> It's confirmed that Windows and Solaris systems will continue the
>>>>>>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>>>>>>
>>>>>>> Log snippet is as below:
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan:
>>>>>>> INQUIRY pass 1 length 36
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send:
>>>>>>> 0xffff8801e9bd4280
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB:
>>>>>>> Inquiry: 12 00 00 00 24 00
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: buffer =
>>>>>>> 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done:
>>>>>>> 0xffff8801e9bd4280 SUCCESS
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result:
>>>>>>> hostbyte=DID_OK driverbyte=DRIVER_OK
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB:
>>>>>>> Inquiry: 12 00 00 00 24 00
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key :
>>>>>>> Not Ready [current]
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense:
>>>>>>> Logical unit not accessible, target port in unavailable state
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host
>>>>>>> busy 1 failed 0
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes
>>>>>>> done.
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed
>>>>>>> with code 0x8000002
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected
>>>>>>> response from lun 7 while scanning, scan aborted
>>>>>>>
>>>>>>> According to scsi_report_lun_scan(), I found:
>>>>>>> Linux use an inquiry command to probe a lun according to the
>>>>>>> result
>>>>>>> of report_lun command.
>>>>>>> It assumes every probe cmd will get a legal result. Otherwise, it
>>>>>>> regards the whole peripheral not exist or dead.
>>>>>>> If the return of inquiry passes its legal checking and indicates
>>>>>>> 'LUN not present', it won't break but also continue with the scan
>>>>>>> process.
>>>>>>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>>>>>>> (Logical unit not accessible, target port in unavailable state).
>>>>>>> And this is ignored, so scsi_probe_lun() returns -EIO and the
>>>>>>> scan
>>>>>>> process is aborted.
>>>>>>>
>>>>>>> I have two questions:
>>>>>>> 1. Is it correct for hardware to return a sense 04h,0Ch to
>>>>>>> inquiry
>>>>>>> again, even after presenting this lun in responce to REPORT_LUN
>>>>>>> command?
>>>>>> Yes, this is correct. 'REPORT LUNS' is supported in
>>>>>> 'Unavailable' state.
>>>>>>
>>>>>>> 2. Since windows and solaris can continue scan, is it
>>>>>>> reasonable for
>>>>>>> linux to do the same, even for a fault-tolerance purpose?
>>>>>>>
>>>>>> Hmm. Yes, and no.
>>>>>>
>>>>>> _Actually_ this is an issue with the target, as it looks as if it
>>>>>> will return the above sense code while sending an 'INQUIRY' to the
>>>>>> device.
>>>>>> SPC explicitely states that the INQUIRY command should _not_ fail
>>>>>> for unavailable devices.
>>>>>> But yeah, we probably should work around this issues.
>>>>>> Nevertheless, please raise this issue with your array vendor.
>>>>>>
>>>>>> Please try the attached patch.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Hannes
>>>>>>
>>>>>> From b0e90778f012010c881f8bdc03bce63a36921b77 Mon Sep 17
>>>>>> 00:00:00 2001
>>>>>> From: Hannes Reinecke <[email protected]>
>>>>>> Date: Mon, 14 Oct 2013 13:11:22 +0200
>>>>>> Subject: [PATCH] scsi_scan: continue report_lun_scan after error
>>>>>>
>>>>>> When scsi_probe_and_add_lun() fails in scsi_report_lun_scan() this
>>>>>> does _not_ indicate that the entire target is done for.
>>>>>> So continue scanning for the remaining devices.
>>>>>>
>>>>>> Signed-off-by: Hannes Reinecke <[email protected]>
>>>>>>
>>>>>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>>>>>> index 307a811..973a121 100644
>>>>>> --- a/drivers/scsi/scsi_scan.c
>>>>>> +++ b/drivers/scsi/scsi_scan.c
>>>>>> @@ -1484,13 +1484,12 @@ static int scsi_report_lun_scan(struct
>>>>>> scsi_target *starget, int bflags,
>>>>>> lun, NULL, NULL, rescan, NULL);
>>>>>> if (res == SCSI_SCAN_NO_RESPONSE) {
>>>>>> /*
>>>>>> - * Got some results, but now none, abort.
>>>>>> + * Got some results, but now none, ignore.
>>>>>> */
>>>>>> sdev_printk(KERN_ERR, sdev,
>>>>>> "Unexpected response"
>>>>>> - " from lun %d while scanning, scan"
>>>>>> - " aborted\n", lun);
>>>>>> - break;
>>>>>> + " from lun %d while scanning,"
>>>>>> + " ignoring device\n", lun);
>>>>>> }
>>>>>> }
>>>>>> }
>>>>> In LLDDs that do their own initiator based LUN masking (because
>>>>> the midlayer does not have this
>>>>> functionality to enable hardware virtualization without NPIV, or
>>>> to work around suboptimal LUN
>>>>> masking on the target), they are likely to return -ENXIO from
>>>> slave_alloc(), making scsi_alloc_sdev()
>>>>> return NULL, being converted to SCSI_SCAN_NO_RESPONSE by
>>>> scsi_probe_and_add_lun() and thus going
>>>>> through the same code path above.
>>>>>
>>>> Ah. Hmm. Yes, they would.
>>>>
>>>> However, I personally would question this approach, as SPC states
>>>> that
>>>>
>>>>> The REPORT LUNS command (see table 284) requests the device
>>>>> server to return the peripheral device logical unit inventory
>>>>> accessible to the I_T nexus.
>>>> So by plain reading this would meant that you either should modify
>>>> 'REPORT LUNS' to not show the masked LUNs, or set the pqual field to
>>>> '0x10' or '0x11' for those LUNs.
>> We need to distinguish two cases:
>> 1) suboptimal lun masking on the target
>> 2) hardware virtualization without NPIV
>>
>> Regarding 1, one could require fixing lun masking on the target.
>> However, some users cannot or do not want to do it very fine
>> granular. That's why s390 also does deferred device probing ("set
>> online" in sysfs) or even limits bus sensing (cio_ignore).
>>
>> Regarding 2, fixing lun masking on the target does not help because
>> without NPIV, the target cannot distinguish the different virtual
>> initators since they are all behind one shared WWPN (and N-Port_ID).
>> This forces zfcp to implement initiator based lun masking, because
>> only the user can tell which lun to attach to which of the virtual
>> initiators sharing the same physical port. Without that, Linux would
>> attach all luns to all virtual initiators, i.e. share inadvertently.
>>
>>>>> E.g. zfcp does return -ENXIO if the particular LUN was not made
>>>>> known to the unit whitelist
>>>>> (via zfcp sysfs attribute unit_add).
>>>>> If we attach LUN 0 (via unit_add) and trigger a target scan with
>>>>> SCAN_WILD_CARD for the scsi
>>>>> lun (e.g. on remote port recovery), we see exactly above error
>>>> message for the first LUN in
>>>>> the response of report lun which is not explicitly attached to
>>>>> zfcp.
>>>>> IIRC, other LLDDs such as bfa also do similar stuff
>>>>> [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].
>>>>>
>>>>> For those cases, I think it makes sense to abort
>>>>> scsi_report_lun_scan().
>>>>> Otherwise we would force the LLDD to return -ENXIO for every
>>>> single LUN reported by report lun but not
>>>>> explicitly added to the LLDD LUN whitelist; and this would likely
>>>> *flood kernel messages*.
>>>>> Maybe Vaughan's case needs to be distinguished in a patch.
>>>>>
>>>> Well, as mentioned initially, the real issue is that the target
>>>> aborts an INQUIRY while being in 'Unavailable'. Which, according to
>>>> SPC-3 (or later), is a violation of the spec.
>>>>
>>>> So we _could_ just tell them to go away, but admittedly that's bad
>>>> style. Which means we'll have to implement a workaround; the above
>>>> was just a simple way of implementing it. If that's not working of
>>>> course we'll have to do something else.
>>>>
>>> What about this patch:
>>>
>>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>>> index 973a121..01a7d69 100644
>>> --- a/drivers/scsi/scsi_scan.c
>>> +++ b/drivers/scsi/scsi_scan.c
>>> @@ -594,6 +594,19 @@ static int scsi_probe_lun(struct scsi_device
>>> *sdev, unsigne
>>> d char *inq_result,
>>> (sshdr.asc == 0x29)) &&
>>> (sshdr.ascq == 0))
>>> continue;
>>> + /*
>>> + * Some buggy implementations return
>>> + * 'target port in unavailable state'
>>> + * even on INQUIRY.
>>> + * Set peripheral qualifier 3
>>> + * for these devices.
>>> + */
>>> + if ((sshdr.sense_key == NOT_READY) &&
>>> + ((sshdr.asc == 0x04) &&
>>> + (sshdr.ascq == 0x0C))) {
>> style question: lower case hex digits? 0x0c
>>
> Yeah. This is a test, after all ...
>
>> Any reason why you put the conjunction of asc and ascq inside its
>> own brackets instead of having all three (including sense_key) on
>> the same level of one larger conjunction (as the code above does for
>> UA asc 0x28/0x29 ascq 0x00)? Should be semantically equivalent,
>> isn't it? But then again, ascq always goes with asc, so they form a
>> kind of pair.
>>
> No reason, Just copy&paste error from the above statement ...
>
>>> + inq_result[0] = 3 << 5;
>>> + return 0;
>>> + }
>>> }
>>> } else {
>>> /*
>>>
>>> (watchout, linebreaks mangled and all that).
>>> Should be working for this particular case without interrupting
>>> normal workflow, now should it not?
>> The approach of distinguishing the workaround close to the response
>> of the inquiry sounds good to me. I suppose it won't break zfcp
>> which is good. Unfortunately, I don't know what the ramifications of
>> PQ==3 are (the SPC-4 description sounds good, though), nor enough
>> details about this common code to tell if e.g. the early return is
>> OK (skipping setting sdev->scsi_level near the end of
>> scsi_probe_lun()). But then again, without inquiry reply we cannot
>> get the level from the response. So I think the early return is OK
>> after all.
>> I guess we want to get around "if (result) return -EIO;" but also do
>> not want to execute the parts depending on result==0.
>>
>> SPC-4 says that for PQ==3 the PDT should be set to 0x1f. Do we need
>> to fake this here as well? (I assume the target did not fill in a
>> PDT on its own when replying with sense data.)
>>
>> The clarification on the T10 reflector seems to say that Linux would
>> then accept LUNs with PQ 3, but the target shall not have put LUs
>> with PQ 3 into the LU inventory in the first place?
>> Anyway, I'm not opposed to the workaround.
>>
> Well, first and foremost this is a workaround for buggy array
> firmware. If any port would be in 'unavailable' the target port is
> still required to respond to an INQUIRY.
> _Not_ doing so leaves us with no indication what's going on here.
>
> The main reason why I chose PQ=3 here is that we'll end up ignoring
> this device scsi_probe_and_add_lun() later on.
> Saving my coding higher up the stack.
> And, seeing that the device is never actually allocated, the
> modifications we did for the inquiry data will be deleted anyway.
>
> So using PQ=3 here is just a vehicle for telling the system to not
> create a SCSI device at this LUN, _not_ something which has some
> relevance to SPC.
>
> But seeing that this approach raises quite some issues I've attached
> a different patch.
> Vaughan, could you test with that, too? Should be functionally
> equivalent to the previous one.
>
> Cheers,
>
> Hannes