2006-03-11 22:09:33

by Maxim Kozover

[permalink] [raw]
Subject: Re[8]: problems with scsi_transport_fc and qla2xxx

Hi Andrew!
Congratulations! The kernel from scsi-rc-fixes git and your patch are
working.
By the way, could you, please, tell me how I get only scsi patches
from the git repository, cause I got the whole kernel by using
cg-clone http://kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git

Now the process looks like following:
Mar 11 23:54:22 multipath kernel: qla2xxx 0000:03:01.0: LOOP DOWN detected (2).
Mar 11 23:54:28 multipath kernel: rport-4:0-0: blocked FC remote port time out:
removing target and saving binding
Mar 11 23:54:37 multipath kernel: qla2xxx 0000:03:01.0: LIP reset occured (f7f7).
Mar 11 23:54:37 multipath kernel: qla2xxx 0000:03:01.0: LOOP UP detected (2 Gbps).
Mar 11 23:54:59 multipath kernel: 4:0:0:0: timing out command, waited 22s

And the disks appear.
Could you tell me, please, where this 22sec timeout came from?

Again, congratulations for good work!

Thanks much,

Maxim.


2006-03-12 09:28:14

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Re[8]: problems with scsi_transport_fc and qla2xxx

On Sun, 2006-03-12 at 00:10 +0300, Maxim Kozover wrote:
> Hi Andrew!
> Congratulations! The kernel from scsi-rc-fixes git and your patch are
> working.
> By the way, could you, please, tell me how I get only scsi patches
> from the git repository, cause I got the whole kernel by using
> cg-clone http://kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git
>
> Now the process looks like following:
> Mar 11 23:54:22 multipath kernel: qla2xxx 0000:03:01.0: LOOP DOWN detected (2).
> Mar 11 23:54:28 multipath kernel: rport-4:0-0: blocked FC remote port time out:
> removing target and saving binding
> Mar 11 23:54:37 multipath kernel: qla2xxx 0000:03:01.0: LIP reset occured (f7f7).
> Mar 11 23:54:37 multipath kernel: qla2xxx 0000:03:01.0: LOOP UP detected (2 Gbps).
> Mar 11 23:54:59 multipath kernel: 4:0:0:0: timing out command, waited 22s
>
> And the disks appear.
> Could you tell me, please, where this 22sec timeout came from?

looks like your fiber fabric decided to renegotiate, and halfway it went
for a coffee and donuts break to not upset the union rules :)

I've seen LOOP negotiations take 10+ seconds before, and that is on a
really simple setup.... so nothing super special

2006-03-12 12:46:26

by Maxim Kozover

[permalink] [raw]
Subject: Re: Re: Re[8]: problems with scsi_transport_fc and qla2xxx

OK, Arjan, thanks!

Maxim.

AvdV> looks like your fiber fabric decided to renegotiate, and halfway it went
AvdV> for a coffee and donuts break to not upset the union rules :)

AvdV> I've seen LOOP negotiations take 10+ seconds before, and that is on a
AvdV> really simple setup.... so nothing super special


2006-03-13 23:19:12

by Andrew Vasquez

[permalink] [raw]
Subject: Re: Re[8]: problems with scsi_transport_fc and qla2xxx

On Sun, 12 Mar 2006, Maxim Kozover wrote:

> Congratulations! The kernel from scsi-rc-fixes git and your patch are
> working.

Actually Mike R. and James S. deserve the credit for the composite
patch which consists of:

1) [PATCH] FC transport : Avoid device offline cases by stalling aborts until device unblocked
http://marc.theaimsgroup.com/?l=linux-scsi&m=114225658724378&w=2

2) Serialize scan work during fc_remote_port_delete() so rport removal
doesn't deadlock midlayer scans. The problem you were seeing. (Mike
R.)

3) rport race fixes during removal (James S.).

> By the way, could you, please, tell me how I get only scsi patches
> from the git repository, cause I got the whole kernel by using
> cg-clone http://kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6.git
>
> Now the process looks like following:
> Mar 11 23:54:22 multipath kernel: qla2xxx 0000:03:01.0: LOOP DOWN detected (2).
> Mar 11 23:54:28 multipath kernel: rport-4:0-0: blocked FC remote port time out:
> removing target and saving binding
> Mar 11 23:54:37 multipath kernel: qla2xxx 0000:03:01.0: LIP reset occured (f7f7).
> Mar 11 23:54:37 multipath kernel: qla2xxx 0000:03:01.0: LOOP UP detected (2 Gbps).
> Mar 11 23:54:59 multipath kernel: 4:0:0:0: timing out command, waited 22s
>
> And the disks appear.
> Could you tell me, please, where this 22sec timeout came from?

Essentially there's currently several issues with rport consumers
making delete() calls during mid-layer scanning.

I'm hoping at a minimum we can get Mike R's fixes into 2.6.16, and
address the additional races going forward... James/Mike?

Here's a minimal the serialize scan-work patch, could you check to see
that this addresses your issue? Start with any latest linux-2.6.git
tree.

Thanks,
Andrew

---

diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 929032e..3d09920 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -1649,6 +1649,8 @@ fc_remote_port_delete(struct fc_rport *
return;
}

+ /* flush any scan work */ /* which can sleep */
+ scsi_flush_work(rport_to_shost(rport));
scsi_target_block(&rport->dev);

/* cap the length the devices can be blocked until they are deleted */

2006-03-20 11:46:01

by Maxim Kozover

[permalink] [raw]
Subject: Re: Re: Re[8]: problems with scsi_transport_fc and qla2xxx

Hi Andrew!
Unfortunately I see that scan-work patch is not included in
2.6.16 and the usual lock appears:
#001: [ffff8100708a8080] {scsi_host_alloc}
.. held by: scsi_wq_4: 3912 [ffff810071edf870, 110]
... acquired at: scsi_scan_target+0x51/0x87 [scsi_mod]

Applying the patch you sent solves the problem, i.e. disks appear again after
22 sec timeout (why?).

Thanks,

Maxim.

Tuesday, March 14, 2006, 2:19:03 AM, you wrote:

AV> diff --git a/drivers/scsi/scsi_transport_fc.c
AV> b/drivers/scsi/scsi_transport_fc.c
AV> index 929032e..3d09920 100644
AV> --- a/drivers/scsi/scsi_transport_fc.c
AV> +++ b/drivers/scsi/scsi_transport_fc.c
AV> @@ -1649,6 +1649,8 @@ fc_remote_port_delete(struct fc_rport *
AV> return;
AV> }
AV>
AV> + /* flush any scan work */ /* which can sleep */
AV> + scsi_flush_work(rport_to_shost(rport));
AV> scsi_target_block(&rport->dev);
AV>
AV> /* cap the length the devices can be blocked until they are deleted */