2019-05-21 20:04:41

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: VPD access Blocked by commit 0d5370d1d85251e5893ab7c90a429464de2e140b

[+cc Myron, Quinn, linux-pci, linux-kernel]

From: Himanshu Madhani <[email protected]>
Date: Fri, May 17, 2019 at 5:21 PM
To: [email protected], [email protected]
Cc: Andrew Vasquez, Girish Basrur, Giridhar Malavali

> Hi Ethan,
>
> Our OEM partners reported to us that VPD access with latest distros were returning I/O error for them. They indicated this to be issue only with newer kernels.
>
> One of the distro vendor pointed out patch posted by you to be reason for IO error trying to VPD. The patch looks like blocks access to VPD by blacklisting ISP.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0d5370d1d85251e5893ab7c90a429464de2e140b
>
> I setup PCIe analyzer to reproduce this in our lab to root cause it and discovered that after reverting the patch. I am able to get VPD data okay with upstream 5.1.0 and I used RHEL8.
>
> I also used "lspci" and "cat" to dump out VPD data and do not see any issue.
>
> # lspci -vvv -s 03:00.0
> 03:00.0 Fibre Channel: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter (rev 01)
> Subsystem: QLogic Corp. QLE2742 Dual Port 32Gb Fibre Channel to PCIe Adapter
> Physical Slot: 15
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 67
> NUMA node: 0
> Region 0: Memory at fbe05000 (64-bit, prefetchable) [size=4K]
> Region 2: Memory at fbe02000 (64-bit, prefetchable) [size=8K]
> Region 4: Memory at fbd00000 (64-bit, prefetchable) [size=1M]
> Expansion ROM at fb540000 [disabled] [size=256K]
> Capabilities: [44] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [4c] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <4us, L1 <1us
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> MaxPayload 256 bytes, MaxReadReq 4096 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <2us
> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range B, TimeoutDis+, LTR-, OBFF Not Supported
> AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> AtomicOpsCtl: ReqEn-
> LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
> EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
> Capabilities: [88] Vital Product Data
> Product Name: QLogic 32Gb 2-port FC to PCIe Gen3 x8 Adapter
> Read-only fields:
> [PN] Part number: QLE2742
> [SN] Serial number: RFD1706R22611
> [EC] Engineering changes: BK3210408-05 04
> [V9] Vendor specific: 010189
> [RV] Reserved: checksum good, 0 byte(s) reserved
> End
> Capabilities: [90] MSI-X: Enable+ Count=16 Masked-
> Vector table: BAR=2 offset=00000000
> PBA: BAR=2 offset=00001000
> Capabilities: [9c] Vendor Specific Information: Len=0c <?>
> Capabilities: [100 v1] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
> MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> HeaderLog: 00000000 00000000 00000000 00000000
> Capabilities: [154 v1] Alternative Routing-ID Interpretation (ARI)
> ARICap: MFVC- ACS-, Next Function: 1
> ARICtl: MFVC- ACS-, Function Group: 0
> Capabilities: [1c0 v1] #19
> Capabilities: [1f4 v1] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
> Kernel driver in use: qla2xxx
> Kernel modules: qla2xxx
>
> # cat /sys/bus/pci/devices/0000\:03\:00.0/vpd
> RFD1706R22611ECBK3210408-05 04V9010189RV�x
>
> Can you share some more insight into where you encountered issue? I am in process of reverting this patch from upstream kernel but wanted to reach out and find out if you still have setup to provide more context.

0d5370d1d852 ("PCI: Prevent VPD access for QLogic ISP2722") prevented
a panic while reading VPD, so we can't simply revert it.

Since you don't see a panic while reading VPD from that device, it's
possible that a QLogic firmware change fixed the VPD format so Linux
no longer reads the area that caused the problem. Or possibly your
system doesn't handle the config read error the same way Ethan's HP
DL380 does. Unfortunately we don't have an actual PCIe analyzer trace
from Ethan's system, so we don't know exactly what happened on PCIe.

I suggest that you capture the entire VPD area and hexdump it, e.g.,
with "xxd", and look at its structure. pci_vpd_size() parses it and
computes the valid size based on a PCI_VPD_STIN_END tag, and
pci_vpd_read() should not read past that size.

And you *do* have an analyzer trace. If new QLogic firmware fixed the
VPD format, the trace should show that Linux read only the valid part
of VPD, and there should be no errors in the trace. Then it might
just be a question of tweaking the quirk so it allows VPD reads if the
firmware is new enough.

But if the trace does show config reads with errors, then it might be
that your system just tolerates the errors while the DL380 did not.
Then we'd have to figure out exactly what the error was and how to
deal with it so things work on both your system and Ethan's.

Bjorn


2019-05-21 20:14:02

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: VPD access Blocked by commit 0d5370d1d85251e5893ab7c90a429464de2e140b

[fix linux-pci, remove ethan.zhao (bounces)]

From: Bjorn Helgaas <[email protected]>
Date: Tue, May 21, 2019 at 3:02 PM
To: Himanshu Madhani
Cc: [email protected], Andrew Vasquez, Girish Basrur, Giridhar
Malavali, Myron Stowe, <[email protected]>, Linux Kernel Mailing
List, Quinn Tran

> [+cc Myron, Quinn, linux-pci, linux-kernel]
>
> From: Himanshu Madhani <[email protected]>
> Date: Fri, May 17, 2019 at 5:21 PM
> To: [email protected], [email protected]
> Cc: Andrew Vasquez, Girish Basrur, Giridhar Malavali
>
> > Hi Ethan,
> >
> > Our OEM partners reported to us that VPD access with latest distros were returning I/O error for them. They indicated this to be issue only with newer kernels.
> >
> > One of the distro vendor pointed out patch posted by you to be reason for IO error trying to VPD. The patch looks like blocks access to VPD by blacklisting ISP.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0d5370d1d85251e5893ab7c90a429464de2e140b
> >
> > I setup PCIe analyzer to reproduce this in our lab to root cause it and discovered that after reverting the patch. I am able to get VPD data okay with upstream 5.1.0 and I used RHEL8.
> >
> > I also used "lspci" and "cat" to dump out VPD data and do not see any issue.
> >
> > # lspci -vvv -s 03:00.0
> > 03:00.0 Fibre Channel: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter (rev 01)
> > Subsystem: QLogic Corp. QLE2742 Dual Port 32Gb Fibre Channel to PCIe Adapter
> > Physical Slot: 15
> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> > Latency: 0, Cache Line Size: 64 bytes
> > Interrupt: pin A routed to IRQ 67
> > NUMA node: 0
> > Region 0: Memory at fbe05000 (64-bit, prefetchable) [size=4K]
> > Region 2: Memory at fbe02000 (64-bit, prefetchable) [size=8K]
> > Region 4: Memory at fbd00000 (64-bit, prefetchable) [size=1M]
> > Expansion ROM at fb540000 [disabled] [size=256K]
> > Capabilities: [44] Power Management version 3
> > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [4c] Express (v2) Endpoint, MSI 00
> > DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <4us, L1 <1us
> > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
> > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <2us
> > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > DevCap2: Completion Timeout: Range B, TimeoutDis+, LTR-, OBFF Not Supported
> > AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> > AtomicOpsCtl: ReqEn-
> > LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> > Compliance De-emphasis: -6dB
> > LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
> > EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
> > Capabilities: [88] Vital Product Data
> > Product Name: QLogic 32Gb 2-port FC to PCIe Gen3 x8 Adapter
> > Read-only fields:
> > [PN] Part number: QLE2742
> > [SN] Serial number: RFD1706R22611
> > [EC] Engineering changes: BK3210408-05 04
> > [V9] Vendor specific: 010189
> > [RV] Reserved: checksum good, 0 byte(s) reserved
> > End
> > Capabilities: [90] MSI-X: Enable+ Count=16 Masked-
> > Vector table: BAR=2 offset=00000000
> > PBA: BAR=2 offset=00001000
> > Capabilities: [9c] Vendor Specific Information: Len=0c <?>
> > Capabilities: [100 v1] Advanced Error Reporting
> > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
> > MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> > HeaderLog: 00000000 00000000 00000000 00000000
> > Capabilities: [154 v1] Alternative Routing-ID Interpretation (ARI)
> > ARICap: MFVC- ACS-, Next Function: 1
> > ARICtl: MFVC- ACS-, Function Group: 0
> > Capabilities: [1c0 v1] #19
> > Capabilities: [1f4 v1] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
> > Kernel driver in use: qla2xxx
> > Kernel modules: qla2xxx
> >
> > # cat /sys/bus/pci/devices/0000\:03\:00.0/vpd
> > RFD1706R22611ECBK3210408-05 04V9010189RV�x
> >
> > Can you share some more insight into where you encountered issue? I am in process of reverting this patch from upstream kernel but wanted to reach out and find out if you still have setup to provide more context.
>
> 0d5370d1d852 ("PCI: Prevent VPD access for QLogic ISP2722") prevented
> a panic while reading VPD, so we can't simply revert it.
>
> Since you don't see a panic while reading VPD from that device, it's
> possible that a QLogic firmware change fixed the VPD format so Linux
> no longer reads the area that caused the problem. Or possibly your
> system doesn't handle the config read error the same way Ethan's HP
> DL380 does. Unfortunately we don't have an actual PCIe analyzer trace
> from Ethan's system, so we don't know exactly what happened on PCIe.
>
> I suggest that you capture the entire VPD area and hexdump it, e.g.,
> with "xxd", and look at its structure. pci_vpd_size() parses it and
> computes the valid size based on a PCI_VPD_STIN_END tag, and
> pci_vpd_read() should not read past that size.
>
> And you *do* have an analyzer trace. If new QLogic firmware fixed the
> VPD format, the trace should show that Linux read only the valid part
> of VPD, and there should be no errors in the trace. Then it might
> just be a question of tweaking the quirk so it allows VPD reads if the
> firmware is new enough.
>
> But if the trace does show config reads with errors, then it might be
> that your system just tolerates the errors while the DL380 did not.
> Then we'd have to figure out exactly what the error was and how to
> deal with it so things work on both your system and Ethan's.
>
> Bjorn

2019-05-30 19:36:02

by Himanshu Madhani

[permalink] [raw]
Subject: Re: [EXT] VPD access Blocked by commit 0d5370d1d85251e5893ab7c90a429464de2e140b

Hi Bjorn,

Thank you for responding to my question.

> On May 21, 2019, at 1:11 PM, Bjorn Helgaas <[email protected]> wrote:
>
> External Email
>
> ----------------------------------------------------------------------
> [fix linux-pci, remove ethan.zhao (bounces)]
>
> From: Bjorn Helgaas <[email protected]>
> Date: Tue, May 21, 2019 at 3:02 PM
> To: Himanshu Madhani
> Cc: [email protected], Andrew Vasquez, Girish Basrur, Giridhar
> Malavali, Myron Stowe, <[email protected]>, Linux Kernel Mailing
> List, Quinn Tran
>
>> [+cc Myron, Quinn, linux-pci, linux-kernel]
>>
>> From: Himanshu Madhani <[email protected]>
>> Date: Fri, May 17, 2019 at 5:21 PM
>> To: [email protected], [email protected]
>> Cc: Andrew Vasquez, Girish Basrur, Giridhar Malavali
>>
>>> Hi Ethan,
>>>
>>> Our OEM partners reported to us that VPD access with latest distros were returning I/O error for them. They indicated this to be issue only with newer kernels.
>>>
>>> One of the distro vendor pointed out patch posted by you to be reason for IO error trying to VPD. The patch looks like blocks access to VPD by blacklisting ISP.
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0d5370d1d85251e5893ab7c90a429464de2e140b
>>>
>>> I setup PCIe analyzer to reproduce this in our lab to root cause it and discovered that after reverting the patch. I am able to get VPD data okay with upstream 5.1.0 and I used RHEL8.
>>>
>>> I also used "lspci" and "cat" to dump out VPD data and do not see any issue.
>>>
>>> # lspci -vvv -s 03:00.0
>>> 03:00.0 Fibre Channel: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter (rev 01)
>>> Subsystem: QLogic Corp. QLE2742 Dual Port 32Gb Fibre Channel to PCIe Adapter
>>> Physical Slot: 15
>>> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
>>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>>> Latency: 0, Cache Line Size: 64 bytes
>>> Interrupt: pin A routed to IRQ 67
>>> NUMA node: 0
>>> Region 0: Memory at fbe05000 (64-bit, prefetchable) [size=4K]
>>> Region 2: Memory at fbe02000 (64-bit, prefetchable) [size=8K]
>>> Region 4: Memory at fbd00000 (64-bit, prefetchable) [size=1M]
>>> Expansion ROM at fb540000 [disabled] [size=256K]
>>> Capabilities: [44] Power Management version 3
>>> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>>> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>>> Capabilities: [4c] Express (v2) Endpoint, MSI 00
>>> DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <4us, L1 <1us
>>> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
>>> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
>>> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
>>> MaxPayload 256 bytes, MaxReadReq 4096 bytes
>>> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
>>> LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <2us
>>> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
>>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>> LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>>> DevCap2: Completion Timeout: Range B, TimeoutDis+, LTR-, OBFF Not Supported
>>> AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>>> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
>>> AtomicOpsCtl: ReqEn-
>>> LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
>>> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>>> Compliance De-emphasis: -6dB
>>> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
>>> EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
>>> Capabilities: [88] Vital Product Data
>>> Product Name: QLogic 32Gb 2-port FC to PCIe Gen3 x8 Adapter
>>> Read-only fields:
>>> [PN] Part number: QLE2742
>>> [SN] Serial number: RFD1706R22611
>>> [EC] Engineering changes: BK3210408-05 04
>>> [V9] Vendor specific: 010189
>>> [RV] Reserved: checksum good, 0 byte(s) reserved
>>> End
>>> Capabilities: [90] MSI-X: Enable+ Count=16 Masked-
>>> Vector table: BAR=2 offset=00000000
>>> PBA: BAR=2 offset=00001000
>>> Capabilities: [9c] Vendor Specific Information: Len=0c <?>
>>> Capabilities: [100 v1] Advanced Error Reporting
>>> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>>> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>>> AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
>>> MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>>> HeaderLog: 00000000 00000000 00000000 00000000
>>> Capabilities: [154 v1] Alternative Routing-ID Interpretation (ARI)
>>> ARICap: MFVC- ACS-, Next Function: 1
>>> ARICtl: MFVC- ACS-, Function Group: 0
>>> Capabilities: [1c0 v1] #19
>>> Capabilities: [1f4 v1] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
>>> Kernel driver in use: qla2xxx
>>> Kernel modules: qla2xxx
>>>
>>> # cat /sys/bus/pci/devices/0000\:03\:00.0/vpd
>>> RFD1706R22611ECBK3210408-05 04V9010189RV�x
>>>
>>> Can you share some more insight into where you encountered issue? I am in process of reverting this patch from upstream kernel but wanted to reach out and find out if you still have setup to provide more context.
>>
>> 0d5370d1d852 ("PCI: Prevent VPD access for QLogic ISP2722") prevented
>> a panic while reading VPD, so we can't simply revert it.
>>
>> Since you don't see a panic while reading VPD from that device, it's
>> possible that a QLogic firmware change fixed the VPD format so Linux
>> no longer reads the area that caused the problem. Or possibly your
>> system doesn't handle the config read error the same way Ethan's HP
>> DL380 does. Unfortunately we don't have an actual PCIe analyzer trace
>> from Ethan's system, so we don't know exactly what happened on PCIe.
>>
>> I suggest that you capture the entire VPD area and hexdump it, e.g.,
>> with "xxd", and look at its structure. pci_vpd_size() parses it and
>> computes the valid size based on a PCI_VPD_STIN_END tag, and
>> pci_vpd_read() should not read past that size.
>>
>> And you *do* have an analyzer trace. If new QLogic firmware fixed the
>> VPD format, the trace should show that Linux read only the valid part
>> of VPD, and there should be no errors in the trace. Then it might
>> just be a question of tweaking the quirk so it allows VPD reads if the
>> firmware is new enough.
>>
>> But if the trace does show config reads with errors, then it might be
>> that your system just tolerates the errors while the DL380 did not.
>> Then we'd have to figure out exactly what the error was and how to
>> deal with it so things work on both your system and Ethan's.
>>
>> Bjorn


I have verified after reverting patch on HP DL380 Gen10 hardware by capturing PCIe trace

Here’s snippet from dmidecode output

System Information:
Manufacturer: HPE
Product Name: ProLiant DL380 Gen10
Family: ProLiant

We are able to successfully read VPD config data using lspci and cat command

# lspci -d1077:
13:00.0 Fibre Channel: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter (rev 01)
13:00.1 Fibre Channel: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter (rev 01)
[root@R8 ~]# lspci -vvv -s 13:00.0
13:00.0 Fibre Channel: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter (rev 01)
Subsystem: QLogic Corp. QLE2742 Dual Port 32Gb Fibre Channel to PCIe Adapter
Physical Slot: 3
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 28
NUMA node: 0
Region 0: Memory at de605000 (64-bit, prefetchable) [size=4K]
Region 2: Memory at de602000 (64-bit, prefetchable) [size=8K]
Region 4: Memory at de500000 (64-bit, prefetchable) [size=1M]
[virtual] Expansion ROM at de800000 [disabled] [size=256K]
Capabilities: [44] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [4c] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <4us, L1 <1us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <2us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis+, LTR-, OBFF Not Supported
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [88] Vital Product Data
Product Name: QLogic 32Gb 2-port FC to PCIe Gen3 x8 Adapter
Read-only fields:
[PN] Part number: QLE2742
[SN] Serial number: AFD1533Y02999
[EC] Engineering changes: BK3210407-05 03
[V9] Vendor specific: 010189
[RV] Reserved: checksum good, 0 byte(s) reserved
End
Capabilities: [90] MSI-X: Enable+ Count=16 Masked-
Vector table: BAR=2 offset=00000000
PBA: BAR=2 offset=00001000
Capabilities: [9c] Vendor Specific Information: Len=0c <?>
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP- SDES+ TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [154 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [1c0 v1] #19
Capabilities: [1f4 v1] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
Kernel driver in use: qla2xxx
Kernel modules: qla2xxx

We also verified this same configuration on a SuperMicro X10SRA-F server (which i had sent in earlier email)’
and were able to verify that the VPD read was good and there were no errors on PCIe trace. Given this information, Please consider reverting the patch until we further debug the issue and resolve as it is
affecting general availability of our adapter.

Thanks,
Himanshu

2019-05-30 21:01:28

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [EXT] VPD access Blocked by commit 0d5370d1d85251e5893ab7c90a429464de2e140b

On Thu, May 30, 2019 at 07:33:01PM +0000, Himanshu Madhani wrote:

> We are able to successfully read VPD config data using lspci and cat
> command

Yes, you mentioned that in the very first email. I was hoping you
would include the actual data, e.g., "cat vpd | xxd". That would help
us figure out why you don't see the panic any more. I suspect either:

- new QLogic firmware fixed the structure of the VPD data so Linux
no longer attempts to read past the end of the implemented region,
or,

- we still read past the end of the implemented VPD region, but the
device doesn't report an error or the platform deals with the
error without causing a panic.

> We also verified this same configuration on a SuperMicro X10SRA-F
> server (which i had sent in earlier email)’ and were able to verify
> that the VPD read was good and there were no errors on PCIe trace.

Since you saw no PCIe errors here, this suggests that new firmware has
changed the format of the VPD data.

> Given this information, Please consider reverting the patch until we
> further debug the issue and resolve as it is affecting general
> availability of our adapter.

1) The way Linux works is that you would post a patch that does the
revert you'd like to see done.

2) It's unlikely that a simple revert of 0d5370d1d852 ("PCI: Prevent
VPD access for QLogic ISP2722") is the right answer because that would
make Ethan's machine panic again. It's possible that a QLogic
firmware update would avoid the panic, but we can't simply revert the
patch and force users to do that update.

If a QLogic firmware update indeed fixed the VPD format, I suggest
that you ask the folks responsible for the firmware to identify the
specific version where that was fixed and how the OS can figure that
out.

Then you could make a new quirk specific to this device that allows
VPD reads if the adapter has new enough firmware. If it finds older
firmware, it could even print a message suggesting that users could
update the firmware if they need to read VPD data.

Bjorn

2019-06-03 22:34:47

by Himanshu Madhani

[permalink] [raw]
Subject: Re: [EXT] VPD access Blocked by commit 0d5370d1d85251e5893ab7c90a429464de2e140b

Hi Bjorn,

> On May 30, 2019, at 1:58 PM, Bjorn Helgaas <[email protected]> wrote:
>
> On Thu, May 30, 2019 at 07:33:01PM +0000, Himanshu Madhani wrote:
>
>> We are able to successfully read VPD config data using lspci and cat
>> command
>
> Yes, you mentioned that in the very first email. I was hoping you
> would include the actual data, e.g., "cat vpd | xxd". That would help
> us figure out why you don't see the panic any more. I suspect either:
>

Missed the request for xxd output. I got access back today for the system
and captured it for you

# cat /sys/class/pci_bus/0000\:13/device/0000\:13\:00.0/vpd | xxd
00000000: 822d 0051 4c6f 6769 6320 3332 4762 2032 .-.QLogic 32Gb 2
00000010: 2d70 6f72 7420 4643 2074 6f20 5043 4965 -port FC to PCIe
00000020: 2047 656e 3320 7838 2041 6461 7074 6572 Gen3 x8 Adapter
00000030: 9039 0050 4e07 514c 4532 3734 3253 4e0d .9.PN.QLE2742SN.
00000040: 4146 4431 3533 3359 3032 3939 3945 430f AFD1533Y02999EC.
00000050: 424b 3332 3130 3430 372d 3035 2030 3356 BK3210407-05 03V
00000060: 3906 3031 3031 3839 5256 01a0 78 9.010189RV..x


PCIe trace also confirmed there are no READ errors.
(if you need i can attach .pex file for review)


> - new QLogic firmware fixed the structure of the VPD data so Linux
> no longer attempts to read past the end of the implemented region,
> or,
>
> - we still read past the end of the implemented VPD region, but the
> device doesn't report an error or the platform deals with the
> error without causing a panic.
>
>> We also verified this same configuration on a SuperMicro X10SRA-F
>> server (which i had sent in earlier email)’ and were able to verify
>> that the VPD read was good and there were no errors on PCIe trace.
>
> Since you saw no PCIe errors here, this suggests that new firmware has
> changed the format of the VPD data.
>
>> Given this information, Please consider reverting the patch until we
>> further debug the issue and resolve as it is affecting general
>> availability of our adapter.
>
> 1) The way Linux works is that you would post a patch that does the
> revert you'd like to see done.
>

Correct. I was trying get your buy-in before i send out patch.

> 2) It's unlikely that a simple revert of 0d5370d1d852 ("PCI: Prevent
> VPD access for QLogic ISP2722") is the right answer because that would
> make Ethan's machine panic again. It's possible that a QLogic
> firmware update would avoid the panic, but we can't simply revert the
> patch and force users to do that update.
>

I did reached out to Oracle to help locate original card where Ethan had
issue and i learned that he is no longer with Oracle.

> If a QLogic firmware update indeed fixed the VPD format, I suggest
> that you ask the folks responsible for the firmware to identify the
> specific version where that was fixed and how the OS can figure that
> out.
>

Still waiting on this data.

> Then you could make a new quirk specific to this device that allows
> VPD reads if the adapter has new enough firmware. If it finds older
> firmware, it could even print a message suggesting that users could
> update the firmware if they need to read VPD data.
>
> Bjorn

Since major OEMs are having issues using adapter to extract VPD data, We
would like to get them relief first and then approach this issue with more
detailed fix if needed.

Thanks,
Himanshu

2019-06-03 22:42:00

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [EXT] VPD access Blocked by commit 0d5370d1d85251e5893ab7c90a429464de2e140b

On Mon, Jun 03, 2019 at 09:30:50PM +0000, Himanshu Madhani wrote:
> On May 30, 2019, at 1:58 PM, Bjorn Helgaas <[email protected]> wrote:
>> On Thu, May 30, 2019 at 07:33:01PM +0000, Himanshu Madhani wrote:
>>> We are able to successfully read VPD config data using lspci and cat
>>> command

> Missed the request for xxd output. I got access back today for the system
> and captured it for you
>
> # cat /sys/class/pci_bus/0000\:13/device/0000\:13\:00.0/vpd | xxd
> 00000000: 822d 0051 4c6f 6769 6320 3332 4762 2032 .-.QLogic 32Gb 2
> 00000010: 2d70 6f72 7420 4643 2074 6f20 5043 4965 -port FC to PCIe
> 00000020: 2047 656e 3320 7838 2041 6461 7074 6572 Gen3 x8 Adapter
> 00000030: 9039 0050 4e07 514c 4532 3734 3253 4e0d .9.PN.QLE2742SN.
> 00000040: 4146 4431 3533 3359 3032 3939 3945 430f AFD1533Y02999EC.
> 00000050: 424b 3332 3130 3430 372d 3035 2030 3356 BK3210407-05 03V
> 00000060: 3906 3031 3031 3839 5256 01a0 78 9.010189RV..x
>
> PCIe trace also confirmed there are no READ errors.
> (if you need i can attach .pex file for review)

Thank you! It would be really excellent to have a report at
https://bugzilla.kernel.org with these details (hex VPD dump, .pex
file, QLogic firmware version info) attached. Your patch commit log
could then include the bugzilla URL.

If we had this sort of information for Ethan's original patch, we
would have a good start at making a smarter quirk. But we don't, so I
think all we can assume at this point is that all QLogic firmware
older than your current version is broken and we shouldn't try reading
VPD.

>> If a QLogic firmware update indeed fixed the VPD format, I suggest
>> that you ask the folks responsible for the firmware to identify the
>> specific version where that was fixed and how the OS can figure that
>> out.

> Still waiting on this data.

Don't hold your breath :)

> Since major OEMs are having issues using adapter to extract VPD data, We
> would like to get them relief first and then approach this issue with more
> detailed fix if needed.

I don't think it's a good idea to simply revert 0d5370d1d852. That
would mean any users that have the same QLogic firmware version Ethan
had would start seeing panics.

But I think you could certainly make a quirk that allows VPD access
for the firmware version you have on your card (or newer), leaving the
original "no VPD at all" behavior for older versions.

Bjorn