2023-11-16 03:55:57

by Xingui Yang

[permalink] [raw]
Subject: [PATCH v3] scsi: libsas: Fix set zero-address when device-type != NO_DEVICE

Firstly, when ex_phy is added to the parent port, ex_phy->port is not set.
As a result, sas_port_delete_phy() won't be called in
sas_unregister_devs_sas_addr(), and although ex_phy's sas_address is zero,
it is not deleted from the parent port's phy_list.

Secondly, phy->attached_sas_addr will be set to a zero-address when
phy->linkrate < SAS_LINK_RATE_1_5_GBPS and device-type != NO_DEVICE during
device registration, such as stp. It will create a new port and all other
ex_phys whose addresses are zero will be added to the new port in
sas_ex_get_linkrate(), and it may trigger BUG() as follows:

[562240.051046] sas: phy19 part of wide port with phy16
[562240.051197] sas: ex 500e004aaaaaaa1f phy19:U:0 attached: 0000000000000000 (no device)
[562240.051203] sas: done REVALIDATING DOMAIN on port 0, pid:435909, res 0x0

[562240.062536] sas: ex 500e004aaaaaaa1f phy0 new device attached
[562240.062616] sas: ex 500e004aaaaaaa1f phy00:U:5 attached: 0000000000000000 (stp)
[562240.062680] port-7:7:0: trying to add phy phy-7:7:19 fails: it's already part of another port
[562240.085064] ------------[ cut here ]------------
[562240.096612] kernel BUG at drivers/scsi/scsi_transport_sas.c:1083!
[562240.109611] Internal error: Oops - BUG: 0 [#1] SMP
[562240.343518] Process kworker/u256:3 (pid: 435909, stack limit = 0x0000000003bcbebf)
[562240.421714] Workqueue: 0000:b4:02.0_disco_q sas_revalidate_domain [libsas]
[562240.437173] pstate: 40c00009 (nZcv daif +PAN +UAO)
[562240.450478] pc : sas_port_add_phy+0x13c/0x168 [scsi_transport_sas]
[562240.465283] lr : sas_port_add_phy+0x13c/0x168 [scsi_transport_sas]
[562240.479751] sp : ffff0000300cfa70
[562240.674822] Call trace:
[562240.682709] sas_port_add_phy+0x13c/0x168 [scsi_transport_sas]
[562240.694013] sas_ex_get_linkrate.isra.5+0xcc/0x128 [libsas]
[562240.704957] sas_ex_discover_end_dev+0xfc/0x538 [libsas]
[562240.715508] sas_ex_discover_dev+0x3cc/0x4b8 [libsas]
[562240.725634] sas_ex_discover_devices+0x9c/0x1a8 [libsas]
[562240.735855] sas_ex_revalidate_domain+0x2f0/0x450 [libsas]
[562240.746123] sas_revalidate_domain+0x158/0x160 [libsas]
[562240.756014] process_one_work+0x1b4/0x448
[562240.764548] worker_thread+0x54/0x468
[562240.772562] kthread+0x134/0x138
[562240.779989] ret_from_fork+0x10/0x18

We've done the following to solve this problem:
Firstly, set ex_phy->port when ex_phy is added to the parent port. And set
ex_dev->parent_port to NULL when the number of PHYs of the parent port
becomes 0.

Secondly, don't set a zero-address for phy->attached_sas_addr when
phy->attached_dev_type != NO_DEVICE.

Fixes: 7d1d86518118 ("[SCSI] libsas: fix false positive 'device attached' conditions")
Signed-off-by: Xingui Yang <[email protected]>
---
v2 -> v3:
1. Set ex_dev->parent_port to NULL when the number of PHYs of the parent
port becomes 0
2. Update the comments

v1 -> v2:
1. Set ex_phy->port with parent_port when ex_phy is added to the parent port
2. Set ex_phy to NULL when free expander
3. Update the comments
---
drivers/scsi/libsas/sas_discover.c | 4 +++-
drivers/scsi/libsas/sas_expander.c | 8 +++++---
drivers/scsi/libsas/sas_internal.h | 1 +
3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c
index 8fb7c41c0962..8eb3888a9e57 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -296,8 +296,10 @@ void sas_free_device(struct kref *kref)
dev->phy = NULL;

/* remove the phys and ports, everything else should be gone */
- if (dev_is_expander(dev->dev_type))
+ if (dev_is_expander(dev->dev_type)) {
kfree(dev->ex_dev.ex_phy);
+ dev->ex_dev.ex_phy = NULL;
+ }

if (dev_is_sata(dev) && dev->sata_dev.ap) {
ata_sas_tport_delete(dev->sata_dev.ap);
diff --git a/drivers/scsi/libsas/sas_expander.c b/drivers/scsi/libsas/sas_expander.c
index a2204674b680..89d44a9dc4e3 100644
--- a/drivers/scsi/libsas/sas_expander.c
+++ b/drivers/scsi/libsas/sas_expander.c
@@ -239,8 +239,7 @@ static void sas_set_ex_phy(struct domain_device *dev, int phy_id,
/* help some expanders that fail to zero sas_address in the 'no
* device' case
*/
- if (phy->attached_dev_type == SAS_PHY_UNUSED ||
- phy->linkrate < SAS_LINK_RATE_1_5_GBPS)
+ if (phy->attached_dev_type == SAS_PHY_UNUSED)
memset(phy->attached_sas_addr, 0, SAS_ADDR_SIZE);
else
memcpy(phy->attached_sas_addr, dr->attached_sas_addr, SAS_ADDR_SIZE);
@@ -1844,9 +1843,12 @@ static void sas_unregister_devs_sas_addr(struct domain_device *parent,
if (phy->port) {
sas_port_delete_phy(phy->port, phy->phy);
sas_device_set_phy(found, phy->port);
- if (phy->port->num_phys == 0)
+ if (phy->port->num_phys == 0) {
list_add_tail(&phy->port->del_list,
&parent->port->sas_port_del_list);
+ if (ex_dev->parent_port == phy->port)
+ ex_dev->parent_port = NULL;
+ }
phy->port = NULL;
}
}
diff --git a/drivers/scsi/libsas/sas_internal.h b/drivers/scsi/libsas/sas_internal.h
index 3804aef165ad..e860d5b19880 100644
--- a/drivers/scsi/libsas/sas_internal.h
+++ b/drivers/scsi/libsas/sas_internal.h
@@ -202,6 +202,7 @@ static inline void sas_add_parent_port(struct domain_device *dev, int phy_id)
sas_port_mark_backlink(ex->parent_port);
}
sas_port_add_phy(ex->parent_port, ex_phy->phy);
+ ex_phy->port = ex->parent_port;
}

static inline struct domain_device *sas_alloc_device(void)
--
2.17.1


2023-11-16 10:14:14

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v3] scsi: libsas: Fix set zero-address when device-type != NO_DEVICE

On 16/11/2023 03:52, Xingui Yang wrote:

I think that patch title can be improved, but I would need to know more
about the problem before suggesting an improvement.

> Firstly, when ex_phy is added to the parent port, ex_phy->port is not set.

That seems correct, but why mention this now?

> As a result, sas_port_delete_phy() won't be called in
> sas_unregister_devs_sas_addr(), and although ex_phy's sas_address is zero,
> it is not deleted from the parent port's phy_list.

I am not sure why you mention this now either. You seem to be describing
how the problem occurs without actually mentioning what the problem is.

>
> Secondly, phy->attached_sas_addr will be set to a zero-address when
> phy->linkrate < SAS_LINK_RATE_1_5_GBPS and device-type != NO_DEVICE during
> device registration, such as stp. It will create a new port and all other
> ex_phys whose addresses are zero will be added to the new port in
> sas_ex_get_linkrate(), and it may trigger BUG() as follows:

I think that it would be better to first mention this crash, i.e. the
problem, how you recreate it, and then describe how and why it happens,
and then tell us how you will fix it.

>
> [562240.051046] sas: phy19 part of wide port with phy16
> [562240.051197] sas: ex 500e004aaaaaaa1f phy19:U:0 attached: 0000000000000000 (no device)
> [562240.051203] sas: done REVALIDATING DOMAIN on port 0, pid:435909, res 0x0
>
> [562240.062536] sas: ex 500e004aaaaaaa1f phy0 new device attached
> [562240.062616] sas: ex 500e004aaaaaaa1f phy00:U:5 attached: 0000000000000000 (stp)
> [562240.062680] port-7:7:0: trying to add phy phy-7:7:19 fails: it's already part of another port
> [562240.085064] ------------[ cut here ]------------
> [562240.096612] kernel BUG at drivers/scsi/scsi_transport_sas.c:1083!
> [562240.109611] Internal error: Oops - BUG: 0 [#1] SMP
> [562240.343518] Process kworker/u256:3 (pid: 435909, stack limit = 0x0000000003bcbebf)
> [562240.421714] Workqueue: 0000:b4:02.0_disco_q sas_revalidate_domain [libsas]
> [562240.437173] pstate: 40c00009 (nZcv daif +PAN +UAO)
> [562240.450478] pc : sas_port_add_phy+0x13c/0x168 [scsi_transport_sas]
> [562240.465283] lr : sas_port_add_phy+0x13c/0x168 [scsi_transport_sas]
> [562240.479751] sp : ffff0000300cfa70
> [562240.674822] Call trace:
> [562240.682709] sas_port_add_phy+0x13c/0x168 [scsi_transport_sas]
> [562240.694013] sas_ex_get_linkrate.isra.5+0xcc/0x128 [libsas]
> [562240.704957] sas_ex_discover_end_dev+0xfc/0x538 [libsas]
> [562240.715508] sas_ex_discover_dev+0x3cc/0x4b8 [libsas]
> [562240.725634] sas_ex_discover_devices+0x9c/0x1a8 [libsas]
> [562240.735855] sas_ex_revalidate_domain+0x2f0/0x450 [libsas]
> [562240.746123] sas_revalidate_domain+0x158/0x160 [libsas]
> [562240.756014] process_one_work+0x1b4/0x448
> [562240.764548] worker_thread+0x54/0x468
> [562240.772562] kthread+0x134/0x138
> [562240.779989] ret_from_fork+0x10/0x18
>
> We've done the following to solve this problem:

I'd use "Fix the problem as follows:""

> Firstly, set ex_phy->port when ex_phy is added to the parent port. And set
> ex_dev->parent_port to NULL when the number of PHYs of the parent port
> becomes 0.

Thanks,
John

2023-11-16 13:46:42

by Xingui Yang

[permalink] [raw]
Subject: Re: [PATCH v3] scsi: libsas: Fix set zero-address when device-type != NO_DEVICE

Hi, John

Thanks for you reply.

On 2023/11/16 18:13, John Garry wrote:
> On 16/11/2023 03:52, Xingui Yang wrote:
>
> I think that patch title can be improved, but I would need to know more
> about the problem before suggesting an improvement.
How about "Fix port add phy failed" ?
>
>> Firstly, when ex_phy is added to the parent port, ex_phy->port is not
>> set.
>
> That seems correct, but why mention this now?
>
>> As a result, sas_port_delete_phy() won't be called in
>> sas_unregister_devs_sas_addr(), and although ex_phy's sas_address is
>> zero,
>> it is not deleted from the parent port's phy_list.
>
> I am not sure why you mention this now either. You seem to be describing
> how the problem occurs without actually mentioning what the problem is.
>
>>
>> Secondly, phy->attached_sas_addr will be set to a zero-address when
>> phy->linkrate < SAS_LINK_RATE_1_5_GBPS and device-type != NO_DEVICE
>> during
>> device registration, such as stp. It will create a new port and all other
>> ex_phys whose addresses are zero will be added to the new port in
>> sas_ex_get_linkrate(), and it may trigger BUG() as follows:
>
> I think that it would be better to first mention this crash, i.e. the
> problem, how you recreate it, and then describe how and why it happens,
> and then tell us how you will fix it
How about follows:

The following processes trigger a BUG(). A new port port-7:7:0 that
created by a new zero-address sata device tries to add phy-7:7:19 had
the same zero-address, but phy-7:7:19 is already part of another port.

[562240.051046] sas: phy19 part of wide port with phy16
[562240.051197] sas: ex 500e004aaaaaaa1f phy19:U:0 attached:
0000000000000000 (no device)
[562240.051203] sas: done REVALIDATING DOMAIN on port 0, pid:435909, res 0x0
[562240.062536] sas: ex 500e004aaaaaaa1f phy0 new device attached
[562240.062616] sas: ex 500e004aaaaaaa1f phy00:U:5 attached:
0000000000000000 (stp)
[562240.062680] port-7:7:0: trying to add phy phy-7:7:19 fails: it's
already part of another port
[562240.085064] ------------[ cut here ]------------
[562240.096612] kernel BUG at drivers/scsi/scsi_transport_sas.c:1083!
[562240.109611] Internal error: Oops - BUG: 0 [#1] SMP
[562240.343518] Process kworker/u256:3 (pid: 435909, stack limit =
0x0000000003bcbebf)
[562240.421714] Workqueue: 0000:b4:02.0_disco_q sas_revalidate_domain
[libsas]
[562240.437173] pstate: 40c00009 (nZcv daif +PAN +UAO)
[562240.450478] pc : sas_port_add_phy+0x13c/0x168 [scsi_transport_sas]
[562240.465283] lr : sas_port_add_phy+0x13c/0x168 [scsi_transport_sas]
[562240.479751] sp : ffff0000300cfa70
[562240.674822] Call trace:
[562240.682709] sas_port_add_phy+0x13c/0x168 [scsi_transport_sas]
[562240.694013] sas_ex_get_linkrate.isra.5+0xcc/0x128 [libsas]
[562240.704957] sas_ex_discover_end_dev+0xfc/0x538 [libsas]
[562240.715508] sas_ex_discover_dev+0x3cc/0x4b8 [libsas]
[562240.725634] sas_ex_discover_devices+0x9c/0x1a8 [libsas]
[562240.735855] sas_ex_revalidate_domain+0x2f0/0x450 [libsas]
[562240.746123] sas_revalidate_domain+0x158/0x160 [libsas]
[562240.756014] process_one_work+0x1b4/0x448
[562240.764548] worker_thread+0x54/0x468
[562240.772562] kthread+0x134/0x138
[562240.779989] ret_from_fork+0x10/0x18

We found that phy-7:7:19's port is not set when added to the parent
port,then it hadn't be deleted from the parent port's phy_list when call
sas_unregister_devs_sas_addr(), and the link rate of the new attached
sata device is 5 which is less then 1.5G/s, then the sata device's
sas_address was set to a zero-address.

Fix the problem as follows:
Firstly, set ex_phy->port when ex_phy is added to the parent port. And
set ex_dev->parent_port to NULL when the number of PHYs of the parent
port becomes 0.

Secondly, don't set a zero-address for phy->attached_sas_addr when
phy->attached_dev_type != NO_DEVICE.

Thanks,
Xingui

>
>>
>> [562240.051046] sas: phy19 part of wide port with phy16
>> [562240.051197] sas: ex 500e004aaaaaaa1f phy19:U:0 attached:
>> 0000000000000000 (no device)
>> [562240.051203] sas: done REVALIDATING DOMAIN on port 0, pid:435909,
>> res 0x0
>>
>> [562240.062536] sas: ex 500e004aaaaaaa1f phy0 new device attached
>> [562240.062616] sas: ex 500e004aaaaaaa1f phy00:U:5 attached:
>> 0000000000000000 (stp)
>> [562240.062680]  port-7:7:0: trying to add phy phy-7:7:19 fails: it's
>> already part of another port
>> [562240.085064] ------------[ cut here ]------------
>> [562240.096612] kernel BUG at drivers/scsi/scsi_transport_sas.c:1083!
>> [562240.109611] Internal error: Oops - BUG: 0 [#1] SMP
>> [562240.343518] Process kworker/u256:3 (pid: 435909, stack limit =
>> 0x0000000003bcbebf)
>> [562240.421714] Workqueue: 0000:b4:02.0_disco_q sas_revalidate_domain
>> [libsas]
>> [562240.437173] pstate: 40c00009 (nZcv daif +PAN +UAO)
>> [562240.450478] pc : sas_port_add_phy+0x13c/0x168 [scsi_transport_sas]
>> [562240.465283] lr : sas_port_add_phy+0x13c/0x168 [scsi_transport_sas]
>> [562240.479751] sp : ffff0000300cfa70
>> [562240.674822] Call trace:
>> [562240.682709]  sas_port_add_phy+0x13c/0x168 [scsi_transport_sas]
>> [562240.694013]  sas_ex_get_linkrate.isra.5+0xcc/0x128 [libsas]
>> [562240.704957]  sas_ex_discover_end_dev+0xfc/0x538 [libsas]
>> [562240.715508]  sas_ex_discover_dev+0x3cc/0x4b8 [libsas]
>> [562240.725634]  sas_ex_discover_devices+0x9c/0x1a8 [libsas]
>> [562240.735855]  sas_ex_revalidate_domain+0x2f0/0x450 [libsas]
>> [562240.746123]  sas_revalidate_domain+0x158/0x160 [libsas]
>> [562240.756014]  process_one_work+0x1b4/0x448
>> [562240.764548]  worker_thread+0x54/0x468
>> [562240.772562]  kthread+0x134/0x138
>> [562240.779989]  ret_from_fork+0x10/0x18
>>
>> We've done the following to solve this problem:
>
> I'd use "Fix the problem as follows:""
>
>> Firstly, set ex_phy->port when ex_phy is added to the parent port. And
>> set
>> ex_dev->parent_port to NULL when the number of PHYs of the parent port
>> becomes 0.
>
> Thanks,
> John
>
> .

2023-11-16 16:55:33

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v3] scsi: libsas: Fix set zero-address when device-type != NO_DEVICE

On 16/11/2023 13:45, yangxingui wrote:
>> I think that patch title can be improved, but I would need to know
>> more about the problem before suggesting an improvement.
> How about "Fix port add phy failed" ?
>>
>>> Firstly, when ex_phy is added to the parent port, ex_phy->port is not
>>> set.
>>
>> That seems correct, but why mention this now?
>>
>>> As a result, sas_port_delete_phy() won't be called in
>>> sas_unregister_devs_sas_addr(), and although ex_phy's sas_address is
>>> zero,
>>> it is not deleted from the parent port's phy_list.
>>
>> I am not sure why you mention this now either. You seem to be
>> describing how the problem occurs without actually mentioning what the
>> problem is.
>>ohn
>>>
>>> Secondly, phy->attached_sas_addr will be set to a zero-address when
>>> phy->linkrate < SAS_LINK_RATE_1_5_GBPS and device-type != NO_DEVICE
>>> during
>>> device registration, such as stp. It will create a new port and all
>>> other
>>> ex_phys whose addresses are zero will be added to the new port in
>>> sas_ex_get_linkrate(), and it may trigger BUG() as follows:
>>
>> I think that it would be better to first mention this crash, i.e. the
>> problem, how you recreate it, and then describe how and why it
>> happens, and then tell us how you will fix it
> How about follows:
>
> The following processes trigger a BUG(). A new port port-7:7:0 that
> created by a new zero-address sata device tries to add phy-7:7:19 had
> the same zero-address, but phy-7:7:19 is already part of another port.

I would like to know how to recreate, which gives a lot more context and
helps me understand what the problem is.

Thanks,
John

2023-11-17 09:05:31

by Xingui Yang

[permalink] [raw]
Subject: Re: [PATCH v3] scsi: libsas: Fix set zero-address when device-type != NO_DEVICE

Hi John,

On 2023/11/17 0:54, John Garry wrote:
> On 16/11/2023 13:45, yangxingui wrote:
>>> I think that patch title can be improved, but I would need to know
>>> more about the problem before suggesting an improvement.
>> How about "Fix port add phy failed" ?
>>>
>>>> Firstly, when ex_phy is added to the parent port, ex_phy->port is
>>>> not set.
>>>
>>> That seems correct, but why mention this now?
>>>
>>>> As a result, sas_port_delete_phy() won't be called in
>>>> sas_unregister_devs_sas_addr(), and although ex_phy's sas_address is
>>>> zero,
>>>> it is not deleted from the parent port's phy_list.
>>>
>>> I am not sure why you mention this now either. You seem to be
>>> describing how the problem occurs without actually mentioning what
>>> the problem is.
>>> ohn
>>>>
>>>> Secondly, phy->attached_sas_addr will be set to a zero-address when
>>>> phy->linkrate < SAS_LINK_RATE_1_5_GBPS and device-type != NO_DEVICE
>>>> during
>>>> device registration, such as stp. It will create a new port and all
>>>> other
>>>> ex_phys whose addresses are zero will be added to the new port in
>>>> sas_ex_get_linkrate(), and it may trigger BUG() as follows:
>>>
>>> I think that it would be better to first mention this crash, i.e. the
>>> problem, how you recreate it, and then describe how and why it
>>> happens, and then tell us how you will fix it
>> How about follows:
>>
>> The following processes trigger a BUG(). A new port port-7:7:0 that
>> created by a new zero-address sata device tries to add phy-7:7:19 had
>> the same zero-address, but phy-7:7:19 is already part of another port.
>
> I would like to know how to recreate, which gives a lot more context and
> helps me understand what the problem is.
I have update a new version based on your suggestion.

Thanks,
Xingui
.