Hi Andrew!
Congratulations! The kernel from scsi-rc-fixes git and your patch are
working.
By the way, could you, please, tell me how I get only scsi patches
from the git repository, cause I got the whole kernel by using
cg-clone http://kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git
Now the process looks like following:
Mar 11 23:54:22 multipath kernel: qla2xxx 0000:03:01.0: LOOP DOWN detected (2).
Mar 11 23:54:28 multipath kernel: rport-4:0-0: blocked FC remote port time out:
removing target and saving binding
Mar 11 23:54:37 multipath kernel: qla2xxx 0000:03:01.0: LIP reset occured (f7f7).
Mar 11 23:54:37 multipath kernel: qla2xxx 0000:03:01.0: LOOP UP detected (2 Gbps).
Mar 11 23:54:59 multipath kernel: 4:0:0:0: timing out command, waited 22s
And the disks appear.
Could you tell me, please, where this 22sec timeout came from?
Again, congratulations for good work!
Thanks much,
Maxim.
On Sun, 2006-03-12 at 00:10 +0300, Maxim Kozover wrote:
> Hi Andrew!
> Congratulations! The kernel from scsi-rc-fixes git and your patch are
> working.
> By the way, could you, please, tell me how I get only scsi patches
> from the git repository, cause I got the whole kernel by using
> cg-clone http://kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git
>
> Now the process looks like following:
> Mar 11 23:54:22 multipath kernel: qla2xxx 0000:03:01.0: LOOP DOWN detected (2).
> Mar 11 23:54:28 multipath kernel: rport-4:0-0: blocked FC remote port time out:
> removing target and saving binding
> Mar 11 23:54:37 multipath kernel: qla2xxx 0000:03:01.0: LIP reset occured (f7f7).
> Mar 11 23:54:37 multipath kernel: qla2xxx 0000:03:01.0: LOOP UP detected (2 Gbps).
> Mar 11 23:54:59 multipath kernel: 4:0:0:0: timing out command, waited 22s
>
> And the disks appear.
> Could you tell me, please, where this 22sec timeout came from?
looks like your fiber fabric decided to renegotiate, and halfway it went
for a coffee and donuts break to not upset the union rules :)
I've seen LOOP negotiations take 10+ seconds before, and that is on a
really simple setup.... so nothing super special
OK, Arjan, thanks!
Maxim.
AvdV> looks like your fiber fabric decided to renegotiate, and halfway it went
AvdV> for a coffee and donuts break to not upset the union rules :)
AvdV> I've seen LOOP negotiations take 10+ seconds before, and that is on a
AvdV> really simple setup.... so nothing super special
On Sun, 12 Mar 2006, Maxim Kozover wrote:
> Congratulations! The kernel from scsi-rc-fixes git and your patch are
> working.
Actually Mike R. and James S. deserve the credit for the composite
patch which consists of:
1) [PATCH] FC transport : Avoid device offline cases by stalling aborts until device unblocked
http://marc.theaimsgroup.com/?l=linux-scsi&m=114225658724378&w=2
2) Serialize scan work during fc_remote_port_delete() so rport removal
doesn't deadlock midlayer scans. The problem you were seeing. (Mike
R.)
3) rport race fixes during removal (James S.).
> By the way, could you, please, tell me how I get only scsi patches
> from the git repository, cause I got the whole kernel by using
> cg-clone http://kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git
>
> Now the process looks like following:
> Mar 11 23:54:22 multipath kernel: qla2xxx 0000:03:01.0: LOOP DOWN detected (2).
> Mar 11 23:54:28 multipath kernel: rport-4:0-0: blocked FC remote port time out:
> removing target and saving binding
> Mar 11 23:54:37 multipath kernel: qla2xxx 0000:03:01.0: LIP reset occured (f7f7).
> Mar 11 23:54:37 multipath kernel: qla2xxx 0000:03:01.0: LOOP UP detected (2 Gbps).
> Mar 11 23:54:59 multipath kernel: 4:0:0:0: timing out command, waited 22s
>
> And the disks appear.
> Could you tell me, please, where this 22sec timeout came from?
Essentially there's currently several issues with rport consumers
making delete() calls during mid-layer scanning.
I'm hoping at a minimum we can get Mike R's fixes into 2.6.16, and
address the additional races going forward... James/Mike?
Here's a minimal the serialize scan-work patch, could you check to see
that this addresses your issue? Start with any latest linux-2.6.git
tree.
Thanks,
Andrew
---
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 929032e..3d09920 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -1649,6 +1649,8 @@ fc_remote_port_delete(struct fc_rport *
return;
}
+ /* flush any scan work */ /* which can sleep */
+ scsi_flush_work(rport_to_shost(rport));
scsi_target_block(&rport->dev);
/* cap the length the devices can be blocked until they are deleted */
Hi Andrew!
Unfortunately I see that scan-work patch is not included in
2.6.16 and the usual lock appears:
#001: [ffff8100708a8080] {scsi_host_alloc}
.. held by: scsi_wq_4: 3912 [ffff810071edf870, 110]
... acquired at: scsi_scan_target+0x51/0x87 [scsi_mod]
Applying the patch you sent solves the problem, i.e. disks appear again after
22 sec timeout (why?).
Thanks,
Maxim.
Tuesday, March 14, 2006, 2:19:03 AM, you wrote:
AV> diff --git a/drivers/scsi/scsi_transport_fc.c
AV> b/drivers/scsi/scsi_transport_fc.c
AV> index 929032e..3d09920 100644
AV> --- a/drivers/scsi/scsi_transport_fc.c
AV> +++ b/drivers/scsi/scsi_transport_fc.c
AV> @@ -1649,6 +1649,8 @@ fc_remote_port_delete(struct fc_rport *
AV> return;
AV> }
AV>
AV> + /* flush any scan work */ /* which can sleep */
AV> + scsi_flush_work(rport_to_shost(rport));
AV> scsi_target_block(&rport->dev);
AV>
AV> /* cap the length the devices can be blocked until they are deleted */