2019-07-10 10:02:33

by Konstantin Khorenko

[permalink] [raw]
Subject: [PATCH v2 2/2] scsi: aacraid: Remove references to Series-9 (only)

The patch removes references to Series 9 adapters following
395e5df79a95 ("scsi: aacraid: Remove reference to Series-9"),
but doesn't touch Series 6 adapters logic.

Leaving Series 6 adapters untouched avoids controller
hungs/resets under high io load.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1777586
https://bugzilla.redhat.com/show_bug.cgi?id=1724077
https://jira.sw.ru/browse/PSBM-95736

Signed-off-by: Konstantin Khorenko <[email protected]>
---
drivers/scsi/aacraid/aacraid.h | 1 -
drivers/scsi/aacraid/comminit.c | 9 +++------
drivers/scsi/aacraid/commsup.c | 3 +--
drivers/scsi/aacraid/linit.c | 8 +++-----
4 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/drivers/scsi/aacraid/aacraid.h b/drivers/scsi/aacraid/aacraid.h
index aef47d0e718c..b674fb645523 100644
--- a/drivers/scsi/aacraid/aacraid.h
+++ b/drivers/scsi/aacraid/aacraid.h
@@ -416,7 +416,6 @@ struct aac_ciss_identify_pd {
#define PMC_DEVICE_S6 0x28b
#define PMC_DEVICE_S7 0x28c
#define PMC_DEVICE_S8 0x28d
-#define PMC_DEVICE_S9 0x28f

#define aac_phys_to_logical(x) ((x)+1)
#define aac_logical_to_phys(x) ((x)?(x)-1:0)
diff --git a/drivers/scsi/aacraid/comminit.c b/drivers/scsi/aacraid/comminit.c
index edaa2d53e704..c8db6614b712 100644
--- a/drivers/scsi/aacraid/comminit.c
+++ b/drivers/scsi/aacraid/comminit.c
@@ -353,8 +353,7 @@ int aac_send_shutdown(struct aac_dev * dev)
if (status != -ERESTARTSYS)
aac_fib_free(fibctx);
if ((dev->pdev->device == PMC_DEVICE_S7 ||
- dev->pdev->device == PMC_DEVICE_S8 ||
- dev->pdev->device == PMC_DEVICE_S9) &&
+ dev->pdev->device == PMC_DEVICE_S8) &&
dev->msi_enabled)
aac_set_intx_mode(dev);
return status;
@@ -611,8 +610,7 @@ struct aac_dev *aac_init_adapter(struct aac_dev *dev)
host->sg_tablesize = status[2] >> 16;
dev->sg_tablesize = status[2] & 0xFFFF;
if (dev->pdev->device == PMC_DEVICE_S7 ||
- dev->pdev->device == PMC_DEVICE_S8 ||
- dev->pdev->device == PMC_DEVICE_S9) {
+ dev->pdev->device == PMC_DEVICE_S8) {
if (host->can_queue > (status[3] >> 16) -
AAC_NUM_MGT_FIB)
host->can_queue = (status[3] >> 16) -
@@ -633,8 +631,7 @@ struct aac_dev *aac_init_adapter(struct aac_dev *dev)

if (dev->pdev->device == PMC_DEVICE_S6 ||
dev->pdev->device == PMC_DEVICE_S7 ||
- dev->pdev->device == PMC_DEVICE_S8 ||
- dev->pdev->device == PMC_DEVICE_S9)
+ dev->pdev->device == PMC_DEVICE_S8)
aac_define_int_mode(dev);
/*
* Ok now init the communication subsystem
diff --git a/drivers/scsi/aacraid/commsup.c b/drivers/scsi/aacraid/commsup.c
index b047b1e2215a..705e003caa95 100644
--- a/drivers/scsi/aacraid/commsup.c
+++ b/drivers/scsi/aacraid/commsup.c
@@ -2576,8 +2576,7 @@ void aac_free_irq(struct aac_dev *dev)

if (dev->pdev->device == PMC_DEVICE_S6 ||
dev->pdev->device == PMC_DEVICE_S7 ||
- dev->pdev->device == PMC_DEVICE_S8 ||
- dev->pdev->device == PMC_DEVICE_S9) {
+ dev->pdev->device == PMC_DEVICE_S8) {
if (dev->max_msix > 1) {
for (i = 0; i < dev->max_msix; i++)
free_irq(pci_irq_vector(dev->pdev, i),
diff --git a/drivers/scsi/aacraid/linit.c b/drivers/scsi/aacraid/linit.c
index f669a4405217..d5082b191aa8 100644
--- a/drivers/scsi/aacraid/linit.c
+++ b/drivers/scsi/aacraid/linit.c
@@ -1561,8 +1561,7 @@ static void __aac_shutdown(struct aac_dev * aac)
aac_adapter_disable_int(aac);
if (aac->pdev->device == PMC_DEVICE_S6 ||
aac->pdev->device == PMC_DEVICE_S7 ||
- aac->pdev->device == PMC_DEVICE_S8 ||
- aac->pdev->device == PMC_DEVICE_S9) {
+ aac->pdev->device == PMC_DEVICE_S8) {
if (aac->max_msix > 1) {
for (i = 0; i < aac->max_msix; i++) {
free_irq(pci_irq_vector(aac->pdev, i),
@@ -1837,9 +1836,8 @@ static int aac_acquire_resources(struct aac_dev *dev)
aac_adapter_enable_int(dev);


- if ((dev->pdev->device == PMC_DEVICE_S7 ||
- dev->pdev->device == PMC_DEVICE_S8 ||
- dev->pdev->device == PMC_DEVICE_S9))
+ if (dev->pdev->device == PMC_DEVICE_S7 ||
+ dev->pdev->device == PMC_DEVICE_S8)
aac_define_int_mode(dev);

if (dev->msi_enabled)
--
2.15.1


2019-07-12 01:40:46

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] scsi: aacraid: Remove references to Series-9 (only)


Hi Konstantin,

> The patch removes references to Series 9 adapters following
> 395e5df79a95 ("scsi: aacraid: Remove reference to Series-9"),
> but doesn't touch Series 6 adapters logic.

We'll need some guidance from the Microsemi folks on this issue.

> https://bugzilla.redhat.com/show_bug.cgi?id=1724077
> https://jira.sw.ru/browse/PSBM-95736

These two links don't appear to be publicly accessible and therefore do
not belong in the patch.

--
Martin K. Petersen Oracle Linux Engineering

2019-08-19 16:38:19

by Konstantin Khorenko

[permalink] [raw]
Subject: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

Problem description:
====================
A node with Adaptec 6405 controller, latest BIOS V5.3-0[19204]
A lot of disks attached to the controller.
Simple test: running mkfs.ext4 on many disks on the same controller in
parallel (mkfs is not important here, any serious io load triggers controller
aborts)

Results:
* no problems (controller resets) with kernels prior to
395e5df79a95 ("scsi: aacraid: Remove reference to Series-9")

* latest ms kernel v5.2-rc6-15-g249155c20f9b - mkfs processes are in D state,
lot of complains in logs like:

[ 654.894633] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,43,0):
[ 699.441034] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,40,0):
[ 699.442950] aacraid: Host adapter reset request. SCSI hang ?
[ 714.457428] aacraid: Host adapter reset request. SCSI hang ?
...
[ 759.514759] aacraid: Host adapter reset request. SCSI hang ?
[ 759.514869] aacraid 0000:03:00.0: outstanding cmd: midlevel-0
[ 759.514870] aacraid 0000:03:00.0: outstanding cmd: lowlevel-0
[ 759.514872] aacraid 0000:03:00.0: outstanding cmd: error handler-498
[ 759.514873] aacraid 0000:03:00.0: outstanding cmd: firmware-471
[ 759.514875] aacraid 0000:03:00.0: outstanding cmd: kernel-60
[ 759.514912] aacraid 0000:03:00.0: Controller reset type is 3
[ 759.515013] aacraid 0000:03:00.0: Issuing IOP reset
[ 850.296705] aacraid 0000:03:00.0: IOP reset succeeded

Same complains on Ubuntu kernel 4.15.0-50-generic:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1777586

Controller:
===========
03:00.0 RAID bus controller: Adaptec Series 6 - 6G SAS/PCIe 2 (rev 01)
Subsystem: Adaptec Series 6 - ASR-6405 - 4 internal 6G SAS ports

Test:
=====
# cat dev.list
/dev/sdq1
/dev/sde1
/dev/sds1
/dev/sdb1
/dev/sdk1
/dev/sdaj1
/dev/sdaf1
/dev/sdd1
/dev/sdac1
/dev/sdai1
/dev/sdz1
/dev/sdj1
/dev/sdy1
/dev/sdn1
/dev/sdae1
/dev/sdg1
/dev/sdi1
/dev/sdc1
/dev/sdf1
/dev/sdl1
/dev/sda1
/dev/sdab1
/dev/sdr1
/dev/sdo1
/dev/sdah1
/dev/sdm1
/dev/sdt1
/dev/sdp1
/dev/sdad1
/dev/sdh1

===========================================
# cat run_mkfs.sh
#!/bin/bash

while read i; do
mkfs.ext4 $i -q -E lazy_itable_init=1 -O uninit_bg -m 0 &
done

=================================
# cat dev.list | ./run_mkfs.sh

The issue is 100% reproducible.

i've bisected to the culprit patch, it's
395e5df79a95 ("scsi: aacraid: Remove reference to Series-9")

it changes arc ctrl checks for Series-6 controllers
and i've checked that resurrection of original logic in arc ctrl checks
eliminates controller hangs/resets.

Konstantin Khorenko (1):
scsi: aacraid: resurrect correct arc ctrl checks for Series-6

--
v3 changes:
* introduced another wrapper to check for devices except for Series 6
controllers upon request from Sagar Biradar (Microchip)

* dropped mentions of private bug ids


drivers/scsi/aacraid/aacraid.h | 11 +++++++++++
drivers/scsi/aacraid/comminit.c | 5 ++---
drivers/scsi/aacraid/linit.c | 2 +-
3 files changed, 14 insertions(+), 4 deletions(-)

--
2.15.1

2019-08-29 21:56:40

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load


> Problem description:
> ====================
> A node with Adaptec 6405 controller, latest BIOS V5.3-0[19204] A lot
> of disks attached to the controller. Simple test: running mkfs.ext4
> on many disks on the same controller in parallel (mkfs is not
> important here, any serious io load triggers controller aborts)

Microchip folks: Please review!

--
Martin K. Petersen Oracle Linux Engineering

2021-05-06 22:25:39

by James Hilliard

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

On Mon, Aug 19, 2019 at 10:35 AM Konstantin Khorenko
<[email protected]> wrote:
>
> Problem description:
> ====================
> A node with Adaptec 6405 controller, latest BIOS V5.3-0[19204]
Hitting this on a Adaptec RAID 71605 as well with BIOS V7.5.0[32118]
> A lot of disks attached to the controller.
> Simple test: running mkfs.ext4 on many disks on the same controller in
> parallel (mkfs is not important here, any serious io load triggers controller
> aborts)
I saw a zfs resilver trigger this.
>
>
> Results:
> * no problems (controller resets) with kernels prior to
> 395e5df79a95 ("scsi: aacraid: Remove reference to Series-9")
>
> * latest ms kernel v5.2-rc6-15-g249155c20f9b - mkfs processes are in D state,
> lot of complains in logs like:
>
> [ 654.894633] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,43,0):
> [ 699.441034] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,40,0):
> [ 699.442950] aacraid: Host adapter reset request. SCSI hang ?
> [ 714.457428] aacraid: Host adapter reset request. SCSI hang ?
> ...
> [ 759.514759] aacraid: Host adapter reset request. SCSI hang ?
> [ 759.514869] aacraid 0000:03:00.0: outstanding cmd: midlevel-0
> [ 759.514870] aacraid 0000:03:00.0: outstanding cmd: lowlevel-0
> [ 759.514872] aacraid 0000:03:00.0: outstanding cmd: error handler-498
> [ 759.514873] aacraid 0000:03:00.0: outstanding cmd: firmware-471
> [ 759.514875] aacraid 0000:03:00.0: outstanding cmd: kernel-60
> [ 759.514912] aacraid 0000:03:00.0: Controller reset type is 3
> [ 759.515013] aacraid 0000:03:00.0: Issuing IOP reset
> [ 850.296705] aacraid 0000:03:00.0: IOP reset succeeded
>
> Same complains on Ubuntu kernel 4.15.0-50-generic:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1777586
It's popping up in proxmox as well looks like:
https://forum.proxmox.com/threads/aacraid-host-adapter-abort-request-errors.86903/

When I tested this patch it appears to reduce the frequency of the
issue although I did
still hit an abort request:
aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,47,0):
>
>
>
> Controller:
> ===========
> 03:00.0 RAID bus controller: Adaptec Series 6 - 6G SAS/PCIe 2 (rev 01)
> Subsystem: Adaptec Series 6 - ASR-6405 - 4 internal 6G SAS ports
>
> Test:
> =====
> # cat dev.list
> /dev/sdq1
> /dev/sde1
> /dev/sds1
> /dev/sdb1
> /dev/sdk1
> /dev/sdaj1
> /dev/sdaf1
> /dev/sdd1
> /dev/sdac1
> /dev/sdai1
> /dev/sdz1
> /dev/sdj1
> /dev/sdy1
> /dev/sdn1
> /dev/sdae1
> /dev/sdg1
> /dev/sdi1
> /dev/sdc1
> /dev/sdf1
> /dev/sdl1
> /dev/sda1
> /dev/sdab1
> /dev/sdr1
> /dev/sdo1
> /dev/sdah1
> /dev/sdm1
> /dev/sdt1
> /dev/sdp1
> /dev/sdad1
> /dev/sdh1
>
> ===========================================
> # cat run_mkfs.sh
> #!/bin/bash
>
> while read i; do
> mkfs.ext4 $i -q -E lazy_itable_init=1 -O uninit_bg -m 0 &
> done
>
> =================================
> # cat dev.list | ./run_mkfs.sh
>
> The issue is 100% reproducible.
>
> i've bisected to the culprit patch, it's
> 395e5df79a95 ("scsi: aacraid: Remove reference to Series-9")
>
> it changes arc ctrl checks for Series-6 controllers
> and i've checked that resurrection of original logic in arc ctrl checks
> eliminates controller hangs/resets.
>
> Konstantin Khorenko (1):
> scsi: aacraid: resurrect correct arc ctrl checks for Series-6
>
> --
> v3 changes:
> * introduced another wrapper to check for devices except for Series 6
> controllers upon request from Sagar Biradar (Microchip)
>
> * dropped mentions of private bug ids
>
>
> drivers/scsi/aacraid/aacraid.h | 11 +++++++++++
> drivers/scsi/aacraid/comminit.c | 5 ++---
> drivers/scsi/aacraid/linit.c | 2 +-
> 3 files changed, 14 insertions(+), 4 deletions(-)
>
> --
> 2.15.1
>
>

2022-02-23 10:02:27

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load


Christian,

> The faulty patch (Commit: 395e5df79a9588abf) from 2017 should be
> repaired with Konstantin Khorenko (1):
>
> scsi: aacraid: resurrect correct arc ctrl checks for Series-6

It would be great to get this patch resubmitted by Konstantin and acked
by Microchip.

Thanks!

--
Martin K. Petersen Oracle Linux Engineering

2022-10-10 12:52:55

by James Hilliard

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
<[email protected]> wrote:
>
>
> Christian,
>
> > The faulty patch (Commit: 395e5df79a9588abf) from 2017 should be
> > repaired with Konstantin Khorenko (1):
> >
> > scsi: aacraid: resurrect correct arc ctrl checks for Series-6
>
> It would be great to get this patch resubmitted by Konstantin and acked
> by Microchip.

Does the patch need to be rebased?

Based on this it looks like someone at microchip may have already reviewed:
v3 changes:
* introduced another wrapper to check for devices except for Series 6
controllers upon request from Sagar Biradar (Microchip)


>
> Thanks!
>
> --
> Martin K. Petersen Oracle Linux Engineering

2022-10-19 18:50:15

by Konstantin Khorenko

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

On 10.10.2022 14:31, James Hilliard wrote:
> On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
> <[email protected]> wrote:
>>
>>
>> Christian,
>>
>>> The faulty patch (Commit: 395e5df79a9588abf) from 2017 should be
>>> repaired with Konstantin Khorenko (1):
>>>
>>> scsi: aacraid: resurrect correct arc ctrl checks for Series-6
>>
>> It would be great to get this patch resubmitted by Konstantin and acked
>> by Microchip.
>
> Does the patch need to be rebased?

James, i have just checked - the old patch (v3) applies cleanly onto latest master branch.

> Based on this it looks like someone at microchip may have already reviewed:
> v3 changes:
> * introduced another wrapper to check for devices except for Series 6
> controllers upon request from Sagar Biradar (Microchip)

Well, back in the year 2019 i've created a bug in RedHat bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1724077
(the bug is private, this is default for Redhat bugs)

In this bug Sagar Biradar (with the email @microchip.com) suggested me to rework the patch - i've done
that and sent the v3.

And nothing happened after that, but in a ~year (2020-06-19) the bug was closed with the resolution
NOTABUG and a comment that S6 users will find the patch useful.

i suppose S6 is so old that RedHat just does not have customers using it and Microchip company itself
is also not that interested in handling so old hardware issues.

Sorry, i was unable to get a final ack from Microchip,
i've written direct emails to the addresses which is found in the internet, tried to connect via
linkedin, no luck.

--
Konstantin Khorenko

2022-10-26 20:12:33

by James Hilliard

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

On Wed, Oct 19, 2022 at 2:03 PM Konstantin Khorenko
<[email protected]> wrote:
>
> On 10.10.2022 14:31, James Hilliard wrote:
> > On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
> > <[email protected]> wrote:
> >>
> >>
> >> Christian,
> >>
> >>> The faulty patch (Commit: 395e5df79a9588abf) from 2017 should be
> >>> repaired with Konstantin Khorenko (1):
> >>>
> >>> scsi: aacraid: resurrect correct arc ctrl checks for Series-6
> >>
> >> It would be great to get this patch resubmitted by Konstantin and acked
> >> by Microchip.

Can we merge this as is since microchip does not appear to be maintaining
this driver any more or responding?

> >
> > Does the patch need to be rebased?
>
> James, i have just checked - the old patch (v3) applies cleanly onto latest master branch.
>
> > Based on this it looks like someone at microchip may have already reviewed:
> > v3 changes:
> > * introduced another wrapper to check for devices except for Series 6
> > controllers upon request from Sagar Biradar (Microchip)
>
> Well, back in the year 2019 i've created a bug in RedHat bugzilla
> https://bugzilla.redhat.com/show_bug.cgi?id=1724077
> (the bug is private, this is default for Redhat bugs)
>
> In this bug Sagar Biradar (with the email @microchip.com) suggested me to rework the patch - i've done
> that and sent the v3.
>
> And nothing happened after that, but in a ~year (2020-06-19) the bug was closed with the resolution
> NOTABUG and a comment that S6 users will find the patch useful.
>
> i suppose S6 is so old that RedHat just does not have customers using it and Microchip company itself
> is also not that interested in handling so old hardware issues.
>
> Sorry, i was unable to get a final ack from Microchip,
> i've written direct emails to the addresses which is found in the internet, tried to connect via
> linkedin, no luck.
>
> --
> Konstantin Khorenko

2022-11-13 19:58:22

by James Hilliard

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

On Thu, Oct 27, 2022 at 1:17 PM <[email protected]> wrote:
>
> Hi James and Konstantin,
>
> *Limiting the audience to avoid spamming*
>
> Sorry for delayed response as I was on vacation.
> This one got missed somehow as someone else was looking into this and is no longer with the company.
>
> I will look into this, meanwhile I wanted to check if you (or someone else you know) had a chance to test this thoroughly with the latest kernel?
> I will get back to you with some more questions or the confirmation in a day or two max.

Did this ever get looked at?

As this exact patch was merged into the vendor aacraid a while ago I'm not sure
why it wouldn't be good to merge to mainline as well.

Vendor aacraid release with this patch merged:
https://download.adaptec.com/raid/aac/linux/aacraid-linux-src-1.2.1-60001.tgz

>
>
> Thanks for your patience.
> Sagar
>
>
> -----Original Message-----
> From: James Hilliard <[email protected]>
> Sent: Thursday, October 27, 2022 1:40 AM
> To: Martin K. Petersen <[email protected]>
> Cc: Konstantin Khorenko <[email protected]>; Christian Großegger <[email protected]>; [email protected]; Adaptec OEM Raid Solutions <[email protected]>; Sagar Biradar - C34249 <[email protected]>; Linux Kernel Mailing List <[email protected]>; Don Brace - C33706 <[email protected]>
> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On Wed, Oct 19, 2022 at 2:03 PM Konstantin Khorenko <[email protected]> wrote:
> >
> > On 10.10.2022 14:31, James Hilliard wrote:
> > > On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
> > > <[email protected]> wrote:
> > >>
> > >>
> > >> Christian,
> > >>
> > >>> The faulty patch (Commit: 395e5df79a9588abf) from 2017 should be
> > >>> repaired with Konstantin Khorenko (1):
> > >>>
> > >>> scsi: aacraid: resurrect correct arc ctrl checks for Series-6
> > >>
> > >> It would be great to get this patch resubmitted by Konstantin and
> > >> acked by Microchip.
>
> Can we merge this as is since microchip does not appear to be maintaining this driver any more or responding?
>
> > >
> > > Does the patch need to be rebased?
> >
> > James, i have just checked - the old patch (v3) applies cleanly onto latest master branch.
> >
> > > Based on this it looks like someone at microchip may have already reviewed:
> > > v3 changes:
> > > * introduced another wrapper to check for devices except for Series 6
> > > controllers upon request from Sagar Biradar (Microchip)
> >
> > Well, back in the year 2019 i've created a bug in RedHat bugzilla
> > https://bugzilla.redhat.com/show_bug.cgi?id=1724077
> > (the bug is private, this is default for Redhat bugs)
> >
> > In this bug Sagar Biradar (with the email @microchip.com) suggested me
> > to rework the patch - i've done that and sent the v3.
> >
> > And nothing happened after that, but in a ~year (2020-06-19) the bug
> > was closed with the resolution NOTABUG and a comment that S6 users will find the patch useful.
> >
> > i suppose S6 is so old that RedHat just does not have customers using
> > it and Microchip company itself is also not that interested in handling so old hardware issues.
> >
> > Sorry, i was unable to get a final ack from Microchip, i've written
> > direct emails to the addresses which is found in the internet, tried
> > to connect via linkedin, no luck.
> >
> > --
> > Konstantin Khorenko

2022-11-15 14:13:01

by Sagar.Biradar

[permalink] [raw]
Subject: RE: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

Hi James,
I have looked into the patch thoroughly.
We suspect this change might expose an old legacy interrupt issue on some processors.

We are currently debugging and digging further details to be able to explain it in much detailed fashion.
I will keep you the thread posted as soon as we have something interesting.

Sagar

-----Original Message-----
From: James Hilliard <[email protected]>
Sent: Monday, November 14, 2022 12:13 AM
To: Sagar Biradar - C34249 <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]; Don Brace - C33706 <[email protected]>; Tom White - C33503 <[email protected]>; [email protected]; Linux Kernel Mailing List <[email protected]>
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe

On Thu, Oct 27, 2022 at 1:17 PM <[email protected]> wrote:
>
> Hi James and Konstantin,
>
> *Limiting the audience to avoid spamming*
>
> Sorry for delayed response as I was on vacation.
> This one got missed somehow as someone else was looking into this and is no longer with the company.
>
> I will look into this, meanwhile I wanted to check if you (or someone else you know) had a chance to test this thoroughly with the latest kernel?
> I will get back to you with some more questions or the confirmation in a day or two max.

Did this ever get looked at?

As this exact patch was merged into the vendor aacraid a while ago I'm not sure why it wouldn't be good to merge to mainline as well.

Vendor aacraid release with this patch merged:
https://download.adaptec.com/raid/aac/linux/aacraid-linux-src-1.2.1-60001.tgz

>
>
> Thanks for your patience.
> Sagar
>
>
> -----Original Message-----
> From: James Hilliard <[email protected]>
> Sent: Thursday, October 27, 2022 1:40 AM
> To: Martin K. Petersen <[email protected]>
> Cc: Konstantin Khorenko <[email protected]>; Christian Großegger
> <[email protected]>; [email protected]; Adaptec OEM
> Raid Solutions <[email protected]>; Sagar Biradar - C34249
> <[email protected]>; Linux Kernel Mailing List
> <[email protected]>; Don Brace - C33706
> <[email protected]>
> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> constantly resets under high io load
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know
> the content is safe
>
> On Wed, Oct 19, 2022 at 2:03 PM Konstantin Khorenko <[email protected]> wrote:
> >
> > On 10.10.2022 14:31, James Hilliard wrote:
> > > On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
> > > <[email protected]> wrote:
> > >>
> > >>
> > >> Christian,
> > >>
> > >>> The faulty patch (Commit: 395e5df79a9588abf) from 2017 should be
> > >>> repaired with Konstantin Khorenko (1):
> > >>>
> > >>> scsi: aacraid: resurrect correct arc ctrl checks for Series-6
> > >>
> > >> It would be great to get this patch resubmitted by Konstantin and
> > >> acked by Microchip.
>
> Can we merge this as is since microchip does not appear to be maintaining this driver any more or responding?
>
> > >
> > > Does the patch need to be rebased?
> >
> > James, i have just checked - the old patch (v3) applies cleanly onto latest master branch.
> >
> > > Based on this it looks like someone at microchip may have already reviewed:
> > > v3 changes:
> > > * introduced another wrapper to check for devices except for Series 6
> > > controllers upon request from Sagar Biradar (Microchip)
> >
> > Well, back in the year 2019 i've created a bug in RedHat bugzilla
> > https://bugzilla.redhat.com/show_bug.cgi?id=1724077
> > (the bug is private, this is default for Redhat bugs)
> >
> > In this bug Sagar Biradar (with the email @microchip.com) suggested
> > me to rework the patch - i've done that and sent the v3.
> >
> > And nothing happened after that, but in a ~year (2020-06-19) the bug
> > was closed with the resolution NOTABUG and a comment that S6 users will find the patch useful.
> >
> > i suppose S6 is so old that RedHat just does not have customers
> > using it and Microchip company itself is also not that interested in handling so old hardware issues.
> >
> > Sorry, i was unable to get a final ack from Microchip, i've written
> > direct emails to the addresses which is found in the internet, tried
> > to connect via linkedin, no luck.
> >
> > --
> > Konstantin Khorenko

2022-11-16 22:06:24

by James Hilliard

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

On Tue, Nov 15, 2022 at 10:05 AM <[email protected]> wrote:
>
> Hi James,
> I have looked into the patch thoroughly.
> We suspect this change might expose an old legacy interrupt issue on some processors.

I did see this error once with this patch when a drive was having issues:
[ 4306.357531] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 4335.030025] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 4335.030111] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 4335.030172] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 4335.189886] aacraid: Host bus reset request. SCSI hang ?
[ 4335.189951] aacraid 0000:81:00.0: outstanding cmd: midlevel-0
[ 4335.189989] aacraid 0000:81:00.0: outstanding cmd: lowlevel-0
[ 4335.190101] aacraid 0000:81:00.0: outstanding cmd: error handler-3
[ 4335.190141] aacraid 0000:81:00.0: outstanding cmd: firmware-0
[ 4335.190177] aacraid 0000:81:00.0: outstanding cmd: kernel-0
[ 4335.274070] aacraid 0000:81:00.0: Controller reset type is 3
[ 4335.274142] aacraid 0000:81:00.0: Issuing IOP reset
[ 4365.862127] aacraid 0000:81:00.0: IOP reset succeeded
[ 4365.895079] aacraid: Comm Interface type2 enabled
[ 4374.938119] aacraid 0000:81:00.0: Scheduling bus rescan
[ 4387.022913] sd 0:1:41:0: [sdi] 27344764928 512-byte logical blocks:
(14.0 TB/12.7 TiB)
[ 4387.022988] sd 0:1:41:0: [sdi] 4096-byte physical blocks
[ 5643.714301] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 5672.349423] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 5672.351532] #PF: supervisor read access in kernel mode
[ 5672.353262] #PF: error_code(0x0000) - not-present page
[ 5672.354860] PGD 8000007ad6ac7067 P4D 8000007ad6ac7067 PUD 7af0892067 PMD 0
[ 5672.356444] Oops: 0000 [#1] SMP PTI
[ 5672.358075] CPU: 9 PID: 644201 Comm: cc1plus Tainted: P O
5.15.64-1-pve #1
[ 5672.359749] Hardware name: Supermicro Super Server/X10DRC, BIOS 3.4
05/21/2021
[ 5672.361465] RIP: 0010:dma_direct_unmap_sg+0x49/0x1a0
[ 5672.363223] Code: ec 20 89 4d d4 4c 89 45 c8 85 d2 0f 8e bb 00 00
00 49 89 fe 49 89 f7 89 d3 45 31 ed 4c 8b 05 ae fd b0 01 49 8b be 60
02 00 00 <45> 8b 4f 18 49 8b 77 10 49 f7 d0 48 85 ff 0f 84 06 01 00 00
4c 8b
[ 5672.367024] RSP: 0000:ffffa4ff58c7cde0 EFLAGS: 00010046
[ 5672.369020] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000001
[ 5672.371073] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[ 5672.373007] RBP: ffffa4ff58c7ce28 R08: 0000000000000000 R09: 0000000000000001
[ 5672.374795] R10: 0000000000000000 R11: ffffa4ff58c7cff8 R12: 0000000000000000
[ 5672.376418] R13: 0000000000000000 R14: ffff88968e1ec0d0 R15: 0000000000000000
[ 5672.378136] FS: 00007ff103d25ac0(0000) GS:ffff89547fac0000(0000)
knlGS:0000000000000000
[ 5672.379760] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5672.381402] CR2: 0000000000000018 CR3: 0000007ae90cc004 CR4: 00000000001706e0
[ 5672.383023] Call Trace:
[ 5672.384673] <IRQ>
[ 5672.386282] ? task_tick_fair+0x88/0x530
[ 5672.386469] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 5672.387921] dma_unmap_sg_attrs+0x32/0x50
[ 5672.391431] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 5672.393273] scsi_dma_unmap+0x3b/0x50
[ 5672.397079] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 5672.398180] aac_srb_callback+0x88/0x3c0 [aacraid]

Does that look related?

>
> We are currently debugging and digging further details to be able to explain it in much detailed fashion.
> I will keep you the thread posted as soon as we have something interesting.
>
> Sagar
>
> -----Original Message-----
> From: James Hilliard <[email protected]>
> Sent: Monday, November 14, 2022 12:13 AM
> To: Sagar Biradar - C34249 <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; [email protected]; Don Brace - C33706 <[email protected]>; Tom White - C33503 <[email protected]>; [email protected]; Linux Kernel Mailing List <[email protected]>
> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On Thu, Oct 27, 2022 at 1:17 PM <[email protected]> wrote:
> >
> > Hi James and Konstantin,
> >
> > *Limiting the audience to avoid spamming*
> >
> > Sorry for delayed response as I was on vacation.
> > This one got missed somehow as someone else was looking into this and is no longer with the company.
> >
> > I will look into this, meanwhile I wanted to check if you (or someone else you know) had a chance to test this thoroughly with the latest kernel?
> > I will get back to you with some more questions or the confirmation in a day or two max.
>
> Did this ever get looked at?
>
> As this exact patch was merged into the vendor aacraid a while ago I'm not sure why it wouldn't be good to merge to mainline as well.
>
> Vendor aacraid release with this patch merged:
> https://download.adaptec.com/raid/aac/linux/aacraid-linux-src-1.2.1-60001.tgz
>
> >
> >
> > Thanks for your patience.
> > Sagar
> >
> >
> > -----Original Message-----
> > From: James Hilliard <[email protected]>
> > Sent: Thursday, October 27, 2022 1:40 AM
> > To: Martin K. Petersen <[email protected]>
> > Cc: Konstantin Khorenko <[email protected]>; Christian Großegger
> > <[email protected]>; [email protected]; Adaptec OEM
> > Raid Solutions <[email protected]>; Sagar Biradar - C34249
> > <[email protected]>; Linux Kernel Mailing List
> > <[email protected]>; Don Brace - C33706
> > <[email protected]>
> > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > constantly resets under high io load
> >
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know
> > the content is safe
> >
> > On Wed, Oct 19, 2022 at 2:03 PM Konstantin Khorenko <[email protected]> wrote:
> > >
> > > On 10.10.2022 14:31, James Hilliard wrote:
> > > > On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
> > > > <[email protected]> wrote:
> > > >>
> > > >>
> > > >> Christian,
> > > >>
> > > >>> The faulty patch (Commit: 395e5df79a9588abf) from 2017 should be
> > > >>> repaired with Konstantin Khorenko (1):
> > > >>>
> > > >>> scsi: aacraid: resurrect correct arc ctrl checks for Series-6
> > > >>
> > > >> It would be great to get this patch resubmitted by Konstantin and
> > > >> acked by Microchip.
> >
> > Can we merge this as is since microchip does not appear to be maintaining this driver any more or responding?
> >
> > > >
> > > > Does the patch need to be rebased?
> > >
> > > James, i have just checked - the old patch (v3) applies cleanly onto latest master branch.
> > >
> > > > Based on this it looks like someone at microchip may have already reviewed:
> > > > v3 changes:
> > > > * introduced another wrapper to check for devices except for Series 6
> > > > controllers upon request from Sagar Biradar (Microchip)
> > >
> > > Well, back in the year 2019 i've created a bug in RedHat bugzilla
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1724077
> > > (the bug is private, this is default for Redhat bugs)
> > >
> > > In this bug Sagar Biradar (with the email @microchip.com) suggested
> > > me to rework the patch - i've done that and sent the v3.
> > >
> > > And nothing happened after that, but in a ~year (2020-06-19) the bug
> > > was closed with the resolution NOTABUG and a comment that S6 users will find the patch useful.
> > >
> > > i suppose S6 is so old that RedHat just does not have customers
> > > using it and Microchip company itself is also not that interested in handling so old hardware issues.
> > >
> > > Sorry, i was unable to get a final ack from Microchip, i've written
> > > direct emails to the addresses which is found in the internet, tried
> > > to connect via linkedin, no luck.
> > >
> > > --
> > > Konstantin Khorenko

2022-11-18 03:48:36

by Sagar.Biradar

[permalink] [raw]
Subject: RE: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

Hi James,
Thanks for your response.
This issue seems to be slightly different and may have been originating from the drive itself (not too sure).

The original issue I was talking about would still occur with the missing legacy interrupt on certain processors.
We are still actively looking into the old "int-x missing" issue that we suspect might possibly originate from the patch.



-----Original Message-----
From: James Hilliard <[email protected]>
Sent: Thursday, November 17, 2022 3:26 AM
To: Sagar Biradar - C34249 <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]; Don Brace - C33706 <[email protected]>; Tom White - C33503 <[email protected]>; [email protected]; [email protected]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe

On Tue, Nov 15, 2022 at 10:05 AM <[email protected]> wrote:
>
> Hi James,
> I have looked into the patch thoroughly.
> We suspect this change might expose an old legacy interrupt issue on some processors.

I did see this error once with this patch when a drive was having issues:
[ 4306.357531] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 4335.030025] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 4335.030111] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 4335.030172] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 4335.189886] aacraid: Host bus reset request. SCSI hang ?
[ 4335.189951] aacraid 0000:81:00.0: outstanding cmd: midlevel-0 [ 4335.189989] aacraid 0000:81:00.0: outstanding cmd: lowlevel-0 [ 4335.190101] aacraid 0000:81:00.0: outstanding cmd: error handler-3 [ 4335.190141] aacraid 0000:81:00.0: outstanding cmd: firmware-0 [ 4335.190177] aacraid 0000:81:00.0: outstanding cmd: kernel-0 [ 4335.274070] aacraid 0000:81:00.0: Controller reset type is 3 [ 4335.274142] aacraid 0000:81:00.0: Issuing IOP reset [ 4365.862127] aacraid 0000:81:00.0: IOP reset succeeded [ 4365.895079] aacraid: Comm Interface type2 enabled [ 4374.938119] aacraid 0000:81:00.0: Scheduling bus rescan [ 4387.022913] sd 0:1:41:0: [sdi] 27344764928 512-byte logical blocks:
(14.0 TB/12.7 TiB)
[ 4387.022988] sd 0:1:41:0: [sdi] 4096-byte physical blocks [ 5643.714301] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 5672.349423] BUG: kernel NULL pointer dereference, address: 0000000000000018 [ 5672.351532] #PF: supervisor read access in kernel mode [ 5672.353262] #PF: error_code(0x0000) - not-present page [ 5672.354860] PGD 8000007ad6ac7067 P4D 8000007ad6ac7067 PUD 7af0892067 PMD 0 [ 5672.356444] Oops: 0000 [#1] SMP PTI
[ 5672.358075] CPU: 9 PID: 644201 Comm: cc1plus Tainted: P O
5.15.64-1-pve #1
[ 5672.359749] Hardware name: Supermicro Super Server/X10DRC, BIOS 3.4
05/21/2021
[ 5672.361465] RIP: 0010:dma_direct_unmap_sg+0x49/0x1a0
[ 5672.363223] Code: ec 20 89 4d d4 4c 89 45 c8 85 d2 0f 8e bb 00 00
00 49 89 fe 49 89 f7 89 d3 45 31 ed 4c 8b 05 ae fd b0 01 49 8b be 60
02 00 00 <45> 8b 4f 18 49 8b 77 10 49 f7 d0 48 85 ff 0f 84 06 01 00 00 4c 8b [ 5672.367024] RSP: 0000:ffffa4ff58c7cde0 EFLAGS: 00010046 [ 5672.369020] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000001 [ 5672.371073] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000 [ 5672.373007] RBP: ffffa4ff58c7ce28 R08: 0000000000000000 R09: 0000000000000001 [ 5672.374795] R10: 0000000000000000 R11: ffffa4ff58c7cff8 R12: 0000000000000000 [ 5672.376418] R13: 0000000000000000 R14: ffff88968e1ec0d0 R15: 0000000000000000 [ 5672.378136] FS: 00007ff103d25ac0(0000) GS:ffff89547fac0000(0000)
knlGS:0000000000000000
[ 5672.379760] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5672.381402] CR2: 0000000000000018 CR3: 0000007ae90cc004 CR4: 00000000001706e0 [ 5672.383023] Call Trace:
[ 5672.384673] <IRQ>
[ 5672.386282] ? task_tick_fair+0x88/0x530 [ 5672.386469] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 5672.387921] dma_unmap_sg_attrs+0x32/0x50 [ 5672.391431] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 5672.393273] scsi_dma_unmap+0x3b/0x50 [ 5672.397079] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,41,0):
[ 5672.398180] aac_srb_callback+0x88/0x3c0 [aacraid]

Does that look related?

>
> We are currently debugging and digging further details to be able to explain it in much detailed fashion.
> I will keep you the thread posted as soon as we have something interesting.
>
> Sagar
>
> -----Original Message-----
> From: James Hilliard <[email protected]>
> Sent: Monday, November 14, 2022 12:13 AM
> To: Sagar Biradar - C34249 <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Don Brace - C33706
> <[email protected]>; Tom White - C33503
> <[email protected]>; [email protected]; Linux Kernel
> Mailing List <[email protected]>
> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> constantly resets under high io load
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know
> the content is safe
>
> On Thu, Oct 27, 2022 at 1:17 PM <[email protected]> wrote:
> >
> > Hi James and Konstantin,
> >
> > *Limiting the audience to avoid spamming*
> >
> > Sorry for delayed response as I was on vacation.
> > This one got missed somehow as someone else was looking into this and is no longer with the company.
> >
> > I will look into this, meanwhile I wanted to check if you (or someone else you know) had a chance to test this thoroughly with the latest kernel?
> > I will get back to you with some more questions or the confirmation in a day or two max.
>
> Did this ever get looked at?
>
> As this exact patch was merged into the vendor aacraid a while ago I'm not sure why it wouldn't be good to merge to mainline as well.
>
> Vendor aacraid release with this patch merged:
> https://download.adaptec.com/raid/aac/linux/aacraid-linux-src-1.2.1-60
> 001.tgz
>
> >
> >
> > Thanks for your patience.
> > Sagar
> >
> >
> > -----Original Message-----
> > From: James Hilliard <[email protected]>
> > Sent: Thursday, October 27, 2022 1:40 AM
> > To: Martin K. Petersen <[email protected]>
> > Cc: Konstantin Khorenko <[email protected]>; Christian
> > Großegger <[email protected]>; [email protected];
> > Adaptec OEM Raid Solutions <[email protected]>; Sagar Biradar -
> > C34249 <[email protected]>; Linux Kernel Mailing List
> > <[email protected]>; Don Brace - C33706
> > <[email protected]>
> > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > constantly resets under high io load
> >
> > EXTERNAL EMAIL: Do not click links or open attachments unless you
> > know the content is safe
> >
> > On Wed, Oct 19, 2022 at 2:03 PM Konstantin Khorenko <[email protected]> wrote:
> > >
> > > On 10.10.2022 14:31, James Hilliard wrote:
> > > > On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
> > > > <[email protected]> wrote:
> > > >>
> > > >>
> > > >> Christian,
> > > >>
> > > >>> The faulty patch (Commit: 395e5df79a9588abf) from 2017 should
> > > >>> be repaired with Konstantin Khorenko (1):
> > > >>>
> > > >>> scsi: aacraid: resurrect correct arc ctrl checks for
> > > >>> Series-6
> > > >>
> > > >> It would be great to get this patch resubmitted by Konstantin
> > > >> and acked by Microchip.
> >
> > Can we merge this as is since microchip does not appear to be maintaining this driver any more or responding?
> >
> > > >
> > > > Does the patch need to be rebased?
> > >
> > > James, i have just checked - the old patch (v3) applies cleanly onto latest master branch.
> > >
> > > > Based on this it looks like someone at microchip may have already reviewed:
> > > > v3 changes:
> > > > * introduced another wrapper to check for devices except for Series 6
> > > > controllers upon request from Sagar Biradar (Microchip)
> > >
> > > Well, back in the year 2019 i've created a bug in RedHat bugzilla
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1724077
> > > (the bug is private, this is default for Redhat bugs)
> > >
> > > In this bug Sagar Biradar (with the email @microchip.com)
> > > suggested me to rework the patch - i've done that and sent the v3.
> > >
> > > And nothing happened after that, but in a ~year (2020-06-19) the
> > > bug was closed with the resolution NOTABUG and a comment that S6 users will find the patch useful.
> > >
> > > i suppose S6 is so old that RedHat just does not have customers
> > > using it and Microchip company itself is also not that interested in handling so old hardware issues.
> > >
> > > Sorry, i was unable to get a final ack from Microchip, i've
> > > written direct emails to the addresses which is found in the
> > > internet, tried to connect via linkedin, no luck.
> > >
> > > --
> > > Konstantin Khorenko

2022-12-04 00:21:55

by James Hilliard

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

On Thu, Nov 17, 2022 at 11:36 PM <[email protected]> wrote:
>
> Hi James,
> Thanks for your response.
> This issue seems to be slightly different and may have been originating from the drive itself (not too sure).

Yeah, the drive was having hardware issues, although it does sound like a
potential error condition that's not being correctly handled by aacraid.

>
> The original issue I was talking about would still occur with the missing legacy interrupt on certain processors.
> We are still actively looking into the old "int-x missing" issue that we suspect might possibly originate from the patch.

Hmm, are there any available details on this "int-x missing" issue, I
couldn't find
any public details/reports relating to that.

Is there a list of CPU's known to be affected?

Does it occur in the vendor aacraid release that has this patch merged?

>
>
>
> -----Original Message-----
> From: James Hilliard <[email protected]>
> Sent: Thursday, November 17, 2022 3:26 AM
> To: Sagar Biradar - C34249 <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; [email protected]; Don Brace - C33706 <[email protected]>; Tom White - C33503 <[email protected]>; [email protected]; [email protected]
> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On Tue, Nov 15, 2022 at 10:05 AM <[email protected]> wrote:
> >
> > Hi James,
> > I have looked into the patch thoroughly.
> > We suspect this change might expose an old legacy interrupt issue on some processors.
>
> I did see this error once with this patch when a drive was having issues:
> [ 4306.357531] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.030025] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.030111] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.030172] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.189886] aacraid: Host bus reset request. SCSI hang ?
> [ 4335.189951] aacraid 0000:81:00.0: outstanding cmd: midlevel-0 [ 4335.189989] aacraid 0000:81:00.0: outstanding cmd: lowlevel-0 [ 4335.190101] aacraid 0000:81:00.0: outstanding cmd: error handler-3 [ 4335.190141] aacraid 0000:81:00.0: outstanding cmd: firmware-0 [ 4335.190177] aacraid 0000:81:00.0: outstanding cmd: kernel-0 [ 4335.274070] aacraid 0000:81:00.0: Controller reset type is 3 [ 4335.274142] aacraid 0000:81:00.0: Issuing IOP reset [ 4365.862127] aacraid 0000:81:00.0: IOP reset succeeded [ 4365.895079] aacraid: Comm Interface type2 enabled [ 4374.938119] aacraid 0000:81:00.0: Scheduling bus rescan [ 4387.022913] sd 0:1:41:0: [sdi] 27344764928 512-byte logical blocks:
> (14.0 TB/12.7 TiB)
> [ 4387.022988] sd 0:1:41:0: [sdi] 4096-byte physical blocks [ 5643.714301] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.349423] BUG: kernel NULL pointer dereference, address: 0000000000000018 [ 5672.351532] #PF: supervisor read access in kernel mode [ 5672.353262] #PF: error_code(0x0000) - not-present page [ 5672.354860] PGD 8000007ad6ac7067 P4D 8000007ad6ac7067 PUD 7af0892067 PMD 0 [ 5672.356444] Oops: 0000 [#1] SMP PTI
> [ 5672.358075] CPU: 9 PID: 644201 Comm: cc1plus Tainted: P O
> 5.15.64-1-pve #1
> [ 5672.359749] Hardware name: Supermicro Super Server/X10DRC, BIOS 3.4
> 05/21/2021
> [ 5672.361465] RIP: 0010:dma_direct_unmap_sg+0x49/0x1a0
> [ 5672.363223] Code: ec 20 89 4d d4 4c 89 45 c8 85 d2 0f 8e bb 00 00
> 00 49 89 fe 49 89 f7 89 d3 45 31 ed 4c 8b 05 ae fd b0 01 49 8b be 60
> 02 00 00 <45> 8b 4f 18 49 8b 77 10 49 f7 d0 48 85 ff 0f 84 06 01 00 00 4c 8b [ 5672.367024] RSP: 0000:ffffa4ff58c7cde0 EFLAGS: 00010046 [ 5672.369020] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000001 [ 5672.371073] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000 [ 5672.373007] RBP: ffffa4ff58c7ce28 R08: 0000000000000000 R09: 0000000000000001 [ 5672.374795] R10: 0000000000000000 R11: ffffa4ff58c7cff8 R12: 0000000000000000 [ 5672.376418] R13: 0000000000000000 R14: ffff88968e1ec0d0 R15: 0000000000000000 [ 5672.378136] FS: 00007ff103d25ac0(0000) GS:ffff89547fac0000(0000)
> knlGS:0000000000000000
> [ 5672.379760] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5672.381402] CR2: 0000000000000018 CR3: 0000007ae90cc004 CR4: 00000000001706e0 [ 5672.383023] Call Trace:
> [ 5672.384673] <IRQ>
> [ 5672.386282] ? task_tick_fair+0x88/0x530 [ 5672.386469] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.387921] dma_unmap_sg_attrs+0x32/0x50 [ 5672.391431] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.393273] scsi_dma_unmap+0x3b/0x50 [ 5672.397079] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.398180] aac_srb_callback+0x88/0x3c0 [aacraid]
>
> Does that look related?
>
> >
> > We are currently debugging and digging further details to be able to explain it in much detailed fashion.
> > I will keep you the thread posted as soon as we have something interesting.
> >
> > Sagar
> >
> > -----Original Message-----
> > From: James Hilliard <[email protected]>
> > Sent: Monday, November 14, 2022 12:13 AM
> > To: Sagar Biradar - C34249 <[email protected]>
> > Cc: [email protected]; [email protected];
> > [email protected]; [email protected]; Don Brace - C33706
> > <[email protected]>; Tom White - C33503
> > <[email protected]>; [email protected]; Linux Kernel
> > Mailing List <[email protected]>
> > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > constantly resets under high io load
> >
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know
> > the content is safe
> >
> > On Thu, Oct 27, 2022 at 1:17 PM <[email protected]> wrote:
> > >
> > > Hi James and Konstantin,
> > >
> > > *Limiting the audience to avoid spamming*
> > >
> > > Sorry for delayed response as I was on vacation.
> > > This one got missed somehow as someone else was looking into this and is no longer with the company.
> > >
> > > I will look into this, meanwhile I wanted to check if you (or someone else you know) had a chance to test this thoroughly with the latest kernel?
> > > I will get back to you with some more questions or the confirmation in a day or two max.
> >
> > Did this ever get looked at?
> >
> > As this exact patch was merged into the vendor aacraid a while ago I'm not sure why it wouldn't be good to merge to mainline as well.
> >
> > Vendor aacraid release with this patch merged:
> > https://download.adaptec.com/raid/aac/linux/aacraid-linux-src-1.2.1-60
> > 001.tgz
> >
> > >
> > >
> > > Thanks for your patience.
> > > Sagar
> > >
> > >
> > > -----Original Message-----
> > > From: James Hilliard <[email protected]>
> > > Sent: Thursday, October 27, 2022 1:40 AM
> > > To: Martin K. Petersen <[email protected]>
> > > Cc: Konstantin Khorenko <[email protected]>; Christian
> > > Großegger <[email protected]>; [email protected];
> > > Adaptec OEM Raid Solutions <[email protected]>; Sagar Biradar -
> > > C34249 <[email protected]>; Linux Kernel Mailing List
> > > <[email protected]>; Don Brace - C33706
> > > <[email protected]>
> > > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > > constantly resets under high io load
> > >
> > > EXTERNAL EMAIL: Do not click links or open attachments unless you
> > > know the content is safe
> > >
> > > On Wed, Oct 19, 2022 at 2:03 PM Konstantin Khorenko <[email protected]> wrote:
> > > >
> > > > On 10.10.2022 14:31, James Hilliard wrote:
> > > > > On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
> > > > > <[email protected]> wrote:
> > > > >>
> > > > >>
> > > > >> Christian,
> > > > >>
> > > > >>> The faulty patch (Commit: 395e5df79a9588abf) from 2017 should
> > > > >>> be repaired with Konstantin Khorenko (1):
> > > > >>>
> > > > >>> scsi: aacraid: resurrect correct arc ctrl checks for
> > > > >>> Series-6
> > > > >>
> > > > >> It would be great to get this patch resubmitted by Konstantin
> > > > >> and acked by Microchip.
> > >
> > > Can we merge this as is since microchip does not appear to be maintaining this driver any more or responding?
> > >
> > > > >
> > > > > Does the patch need to be rebased?
> > > >
> > > > James, i have just checked - the old patch (v3) applies cleanly onto latest master branch.
> > > >
> > > > > Based on this it looks like someone at microchip may have already reviewed:
> > > > > v3 changes:
> > > > > * introduced another wrapper to check for devices except for Series 6
> > > > > controllers upon request from Sagar Biradar (Microchip)
> > > >
> > > > Well, back in the year 2019 i've created a bug in RedHat bugzilla
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1724077
> > > > (the bug is private, this is default for Redhat bugs)
> > > >
> > > > In this bug Sagar Biradar (with the email @microchip.com)
> > > > suggested me to rework the patch - i've done that and sent the v3.
> > > >
> > > > And nothing happened after that, but in a ~year (2020-06-19) the
> > > > bug was closed with the resolution NOTABUG and a comment that S6 users will find the patch useful.
> > > >
> > > > i suppose S6 is so old that RedHat just does not have customers
> > > > using it and Microchip company itself is also not that interested in handling so old hardware issues.
> > > >
> > > > Sorry, i was unable to get a final ack from Microchip, i've
> > > > written direct emails to the addresses which is found in the
> > > > internet, tried to connect via linkedin, no luck.
> > > >
> > > > --
> > > > Konstantin Khorenko

2022-12-06 06:22:04

by Sagar.Biradar

[permalink] [raw]
Subject: RE: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

Hi James,
We were in the process of finding the related information and we have finally found some details.
I am reviewing that as I write this email.
I will get back to you once I review and sort that information with more details.

Thanks
Sagar

-----Original Message-----
From: James Hilliard <[email protected]>
Sent: Sunday, December 4, 2022 5:26 AM
To: Sagar Biradar - C34249 <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]; Don Brace - C33706 <[email protected]>; Tom White - C33503 <[email protected]>; [email protected]; [email protected]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe

On Thu, Nov 17, 2022 at 11:36 PM <[email protected]> wrote:
>
> Hi James,
> Thanks for your response.
> This issue seems to be slightly different and may have been originating from the drive itself (not too sure).

Yeah, the drive was having hardware issues, although it does sound like a potential error condition that's not being correctly handled by aacraid.

>
> The original issue I was talking about would still occur with the missing legacy interrupt on certain processors.
> We are still actively looking into the old "int-x missing" issue that we suspect might possibly originate from the patch.

Hmm, are there any available details on this "int-x missing" issue, I couldn't find any public details/reports relating to that.

Is there a list of CPU's known to be affected?

Does it occur in the vendor aacraid release that has this patch merged?

>
>
>
> -----Original Message-----
> From: James Hilliard <[email protected]>
> Sent: Thursday, November 17, 2022 3:26 AM
> To: Sagar Biradar - C34249 <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Don Brace - C33706
> <[email protected]>; Tom White - C33503
> <[email protected]>; [email protected];
> [email protected]
> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> constantly resets under high io load
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know
> the content is safe
>
> On Tue, Nov 15, 2022 at 10:05 AM <[email protected]> wrote:
> >
> > Hi James,
> > I have looked into the patch thoroughly.
> > We suspect this change might expose an old legacy interrupt issue on some processors.
>
> I did see this error once with this patch when a drive was having issues:
> [ 4306.357531] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.030025] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.030111] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.030172] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.189886] aacraid: Host bus reset request. SCSI hang ?
> [ 4335.189951] aacraid 0000:81:00.0: outstanding cmd: midlevel-0 [ 4335.189989] aacraid 0000:81:00.0: outstanding cmd: lowlevel-0 [ 4335.190101] aacraid 0000:81:00.0: outstanding cmd: error handler-3 [ 4335.190141] aacraid 0000:81:00.0: outstanding cmd: firmware-0 [ 4335.190177] aacraid 0000:81:00.0: outstanding cmd: kernel-0 [ 4335.274070] aacraid 0000:81:00.0: Controller reset type is 3 [ 4335.274142] aacraid 0000:81:00.0: Issuing IOP reset [ 4365.862127] aacraid 0000:81:00.0: IOP reset succeeded [ 4365.895079] aacraid: Comm Interface type2 enabled [ 4374.938119] aacraid 0000:81:00.0: Scheduling bus rescan [ 4387.022913] sd 0:1:41:0: [sdi] 27344764928 512-byte logical blocks:
> (14.0 TB/12.7 TiB)
> [ 4387.022988] sd 0:1:41:0: [sdi] 4096-byte physical blocks [ 5643.714301] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.349423] BUG: kernel NULL pointer dereference, address: 0000000000000018 [ 5672.351532] #PF: supervisor read access in kernel mode [ 5672.353262] #PF: error_code(0x0000) - not-present page [ 5672.354860] PGD 8000007ad6ac7067 P4D 8000007ad6ac7067 PUD 7af0892067 PMD 0 [ 5672.356444] Oops: 0000 [#1] SMP PTI
> [ 5672.358075] CPU: 9 PID: 644201 Comm: cc1plus Tainted: P O
> 5.15.64-1-pve #1
> [ 5672.359749] Hardware name: Supermicro Super Server/X10DRC, BIOS 3.4
> 05/21/2021
> [ 5672.361465] RIP: 0010:dma_direct_unmap_sg+0x49/0x1a0
> [ 5672.363223] Code: ec 20 89 4d d4 4c 89 45 c8 85 d2 0f 8e bb 00 00
> 00 49 89 fe 49 89 f7 89 d3 45 31 ed 4c 8b 05 ae fd b0 01 49 8b be 60
> 02 00 00 <45> 8b 4f 18 49 8b 77 10 49 f7 d0 48 85 ff 0f 84 06 01 00 00
> 4c 8b [ 5672.367024] RSP: 0000:ffffa4ff58c7cde0 EFLAGS: 00010046 [
> 5672.369020] RAX: 0000000000000000 RBX: 0000000000000003 RCX:
> 0000000000000001 [ 5672.371073] RDX: 0000000000000003 RSI:
> 0000000000000000 RDI: 0000000000000000 [ 5672.373007] RBP:
> ffffa4ff58c7ce28 R08: 0000000000000000 R09: 0000000000000001 [
> 5672.374795] R10: 0000000000000000 R11: ffffa4ff58c7cff8 R12:
> 0000000000000000 [ 5672.376418] R13: 0000000000000000 R14:
> ffff88968e1ec0d0 R15: 0000000000000000 [ 5672.378136] FS:
> 00007ff103d25ac0(0000) GS:ffff89547fac0000(0000)
> knlGS:0000000000000000
> [ 5672.379760] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5672.381402] CR2: 0000000000000018 CR3: 0000007ae90cc004 CR4: 00000000001706e0 [ 5672.383023] Call Trace:
> [ 5672.384673] <IRQ>
> [ 5672.386282] ? task_tick_fair+0x88/0x530 [ 5672.386469] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.387921] dma_unmap_sg_attrs+0x32/0x50 [ 5672.391431] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.393273] scsi_dma_unmap+0x3b/0x50 [ 5672.397079] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.398180] aac_srb_callback+0x88/0x3c0 [aacraid]
>
> Does that look related?
>
> >
> > We are currently debugging and digging further details to be able to explain it in much detailed fashion.
> > I will keep you the thread posted as soon as we have something interesting.
> >
> > Sagar
> >
> > -----Original Message-----
> > From: James Hilliard <[email protected]>
> > Sent: Monday, November 14, 2022 12:13 AM
> > To: Sagar Biradar - C34249 <[email protected]>
> > Cc: [email protected]; [email protected];
> > [email protected]; [email protected]; Don Brace - C33706
> > <[email protected]>; Tom White - C33503
> > <[email protected]>; [email protected]; Linux Kernel
> > Mailing List <[email protected]>
> > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > constantly resets under high io load
> >
> > EXTERNAL EMAIL: Do not click links or open attachments unless you
> > know the content is safe
> >
> > On Thu, Oct 27, 2022 at 1:17 PM <[email protected]> wrote:
> > >
> > > Hi James and Konstantin,
> > >
> > > *Limiting the audience to avoid spamming*
> > >
> > > Sorry for delayed response as I was on vacation.
> > > This one got missed somehow as someone else was looking into this and is no longer with the company.
> > >
> > > I will look into this, meanwhile I wanted to check if you (or someone else you know) had a chance to test this thoroughly with the latest kernel?
> > > I will get back to you with some more questions or the confirmation in a day or two max.
> >
> > Did this ever get looked at?
> >
> > As this exact patch was merged into the vendor aacraid a while ago I'm not sure why it wouldn't be good to merge to mainline as well.
> >
> > Vendor aacraid release with this patch merged:
> > https://download.adaptec.com/raid/aac/linux/aacraid-linux-src-1.2.1-
> > 60
> > 001.tgz
> >
> > >
> > >
> > > Thanks for your patience.
> > > Sagar
> > >
> > >
> > > -----Original Message-----
> > > From: James Hilliard <[email protected]>
> > > Sent: Thursday, October 27, 2022 1:40 AM
> > > To: Martin K. Petersen <[email protected]>
> > > Cc: Konstantin Khorenko <[email protected]>; Christian
> > > Großegger <[email protected]>; [email protected];
> > > Adaptec OEM Raid Solutions <[email protected]>; Sagar Biradar
> > > -
> > > C34249 <[email protected]>; Linux Kernel Mailing List
> > > <[email protected]>; Don Brace - C33706
> > > <[email protected]>
> > > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > > constantly resets under high io load
> > >
> > > EXTERNAL EMAIL: Do not click links or open attachments unless you
> > > know the content is safe
> > >
> > > On Wed, Oct 19, 2022 at 2:03 PM Konstantin Khorenko <[email protected]> wrote:
> > > >
> > > > On 10.10.2022 14:31, James Hilliard wrote:
> > > > > On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
> > > > > <[email protected]> wrote:
> > > > >>
> > > > >>
> > > > >> Christian,
> > > > >>
> > > > >>> The faulty patch (Commit: 395e5df79a9588abf) from 2017
> > > > >>> should be repaired with Konstantin Khorenko (1):
> > > > >>>
> > > > >>> scsi: aacraid: resurrect correct arc ctrl checks for
> > > > >>> Series-6
> > > > >>
> > > > >> It would be great to get this patch resubmitted by Konstantin
> > > > >> and acked by Microchip.
> > >
> > > Can we merge this as is since microchip does not appear to be maintaining this driver any more or responding?
> > >
> > > > >
> > > > > Does the patch need to be rebased?
> > > >
> > > > James, i have just checked - the old patch (v3) applies cleanly onto latest master branch.
> > > >
> > > > > Based on this it looks like someone at microchip may have already reviewed:
> > > > > v3 changes:
> > > > > * introduced another wrapper to check for devices except for Series 6
> > > > > controllers upon request from Sagar Biradar (Microchip)
> > > >
> > > > Well, back in the year 2019 i've created a bug in RedHat
> > > > bugzilla
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1724077
> > > > (the bug is private, this is default for Redhat bugs)
> > > >
> > > > In this bug Sagar Biradar (with the email @microchip.com)
> > > > suggested me to rework the patch - i've done that and sent the v3.
> > > >
> > > > And nothing happened after that, but in a ~year (2020-06-19) the
> > > > bug was closed with the resolution NOTABUG and a comment that S6 users will find the patch useful.
> > > >
> > > > i suppose S6 is so old that RedHat just does not have customers
> > > > using it and Microchip company itself is also not that interested in handling so old hardware issues.
> > > >
> > > > Sorry, i was unable to get a final ack from Microchip, i've
> > > > written direct emails to the addresses which is found in the
> > > > internet, tried to connect via linkedin, no luck.
> > > >
> > > > --
> > > > Konstantin Khorenko

2022-12-16 21:08:16

by Sagar.Biradar

[permalink] [raw]
Subject: RE: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

Hi James / Konstantin,
Here are the details that we have compiled so far . .
I will just repost the problem definition and the concerns discussed so far (to avoid back and forth)...

Issue : Series 6 Patch [regression] aacraid: Host adapter constantly aborts under load (https://bugzilla.redhat.com/show_bug.cgi?id=1724077)

Synopsis: running mkfs.ext4 on different disks on the same controller in parallel. (Nothing seems to break, appears to always recover, but there are a lot of timeouts.)
[ 699.442950] aacraid: Host adapter reset request. SCSI hang ?
[ 759.515013] aacraid 0000:03:00.0: Issuing IOP reset
[ 850.296705] aacraid 0000:03:00.0: IOP reset succeeded
* with kernel 3.10.0-862.20.2.el7.x86_64 - PASS
* with kernel 3.10.0-957.21.3.el7.x86_64 - FAIL

Konstantin’s patch (https://lkml.org/lkml/2019/8/19/758) : upon testing the patch on the Virtuozzo kernel, it was found to be working fine, and the same issue was observed on Ubuntu later.
But MCHP knows this patch/change will have issues with Xeon V2 interrupts, adding this change into the tree can harm the customers who use this processor. (CPU Intel Xeon E5-2609/2630/2650 v2 ( E5-26XX V2))
However, the patch may work fine on Xeon V3/V4 and later processors.

Adaptec ASK Article references our concern : https://ask.adaptec.com/app/answers/detail/a_id/17400/kw/msi
Though the article lists appears like a "VMware" specific - the issue is independent of the Operating system.
We have discovered a conflict between the Series 6 and 6E RAID controllers, VMware ESXi 5.5 and Intel Xeon V2 processors that is caused by incorrect interrupt handling.
The system is using the legacy interrupt handling but needs to be switched to MSI (Message Signaled Interrupts) instead.
This issue caused by switching to the legacy mode occurs on CPU Intel Xeon E5-2609/2630/2650 v2 ( E5-26XX V2).
* Note: Xeon V2 is “Ivy Bridge”

Workaround: The proposed solution would be to let the driver use the MSI mechanism with the aacraid driver parameter "msi" set to 1 (“msi=1") . ("echo 1 > /sys/module/aacraid/parameters/msi")

Konstantin,
Is it possible for you or someone you know to test on your original test bed with the "msi" set to "1", and post the results?
We are parallelly working on additional tests locally.
Please write to me if you need more information


Thanks in advance
Sagar


-----Original Message-----
From: [email protected] <[email protected]>
Sent: Tuesday, December 6, 2022 11:30 AM
To: [email protected]
Cc: [email protected]; [email protected]; [email protected]; [email protected]; Don Brace - C33706 <[email protected]>; Tom White - C33503 <[email protected]>; [email protected]; [email protected]
Subject: RE: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe

Hi James,
We were in the process of finding the related information and we have finally found some details.
I am reviewing that as I write this email.
I will get back to you once I review and sort that information with more details.

Thanks
Sagar

-----Original Message-----
From: James Hilliard <[email protected]>
Sent: Sunday, December 4, 2022 5:26 AM
To: Sagar Biradar - C34249 <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]; Don Brace - C33706 <[email protected]>; Tom White - C33503 <[email protected]>; [email protected]; [email protected]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe

On Thu, Nov 17, 2022 at 11:36 PM <[email protected]> wrote:
>
> Hi James,
> Thanks for your response.
> This issue seems to be slightly different and may have been originating from the drive itself (not too sure).

Yeah, the drive was having hardware issues, although it does sound like a potential error condition that's not being correctly handled by aacraid.

>
> The original issue I was talking about would still occur with the missing legacy interrupt on certain processors.
> We are still actively looking into the old "int-x missing" issue that we suspect might possibly originate from the patch.

Hmm, are there any available details on this "int-x missing" issue, I couldn't find any public details/reports relating to that.

Is there a list of CPU's known to be affected?

Does it occur in the vendor aacraid release that has this patch merged?

>
>
>
> -----Original Message-----
> From: James Hilliard <[email protected]>
> Sent: Thursday, November 17, 2022 3:26 AM
> To: Sagar Biradar - C34249 <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Don Brace - C33706
> <[email protected]>; Tom White - C33503
> <[email protected]>; [email protected];
> [email protected]
> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> constantly resets under high io load
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know
> the content is safe
>
> On Tue, Nov 15, 2022 at 10:05 AM <[email protected]> wrote:
> >
> > Hi James,
> > I have looked into the patch thoroughly.
> > We suspect this change might expose an old legacy interrupt issue on some processors.
>
> I did see this error once with this patch when a drive was having issues:
> [ 4306.357531] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.030025] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.030111] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.030172] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.189886] aacraid: Host bus reset request. SCSI hang ?
> [ 4335.189951] aacraid 0000:81:00.0: outstanding cmd: midlevel-0 [ 4335.189989] aacraid 0000:81:00.0: outstanding cmd: lowlevel-0 [ 4335.190101] aacraid 0000:81:00.0: outstanding cmd: error handler-3 [ 4335.190141] aacraid 0000:81:00.0: outstanding cmd: firmware-0 [ 4335.190177] aacraid 0000:81:00.0: outstanding cmd: kernel-0 [ 4335.274070] aacraid 0000:81:00.0: Controller reset type is 3 [ 4335.274142] aacraid 0000:81:00.0: Issuing IOP reset [ 4365.862127] aacraid 0000:81:00.0: IOP reset succeeded [ 4365.895079] aacraid: Comm Interface type2 enabled [ 4374.938119] aacraid 0000:81:00.0: Scheduling bus rescan [ 4387.022913] sd 0:1:41:0: [sdi] 27344764928 512-byte logical blocks:
> (14.0 TB/12.7 TiB)
> [ 4387.022988] sd 0:1:41:0: [sdi] 4096-byte physical blocks [ 5643.714301] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.349423] BUG: kernel NULL pointer dereference, address: 0000000000000018 [ 5672.351532] #PF: supervisor read access in kernel mode [ 5672.353262] #PF: error_code(0x0000) - not-present page [ 5672.354860] PGD 8000007ad6ac7067 P4D 8000007ad6ac7067 PUD 7af0892067 PMD 0 [ 5672.356444] Oops: 0000 [#1] SMP PTI
> [ 5672.358075] CPU: 9 PID: 644201 Comm: cc1plus Tainted: P O
> 5.15.64-1-pve #1
> [ 5672.359749] Hardware name: Supermicro Super Server/X10DRC, BIOS 3.4
> 05/21/2021
> [ 5672.361465] RIP: 0010:dma_direct_unmap_sg+0x49/0x1a0
> [ 5672.363223] Code: ec 20 89 4d d4 4c 89 45 c8 85 d2 0f 8e bb 00 00
> 00 49 89 fe 49 89 f7 89 d3 45 31 ed 4c 8b 05 ae fd b0 01 49 8b be 60
> 02 00 00 <45> 8b 4f 18 49 8b 77 10 49 f7 d0 48 85 ff 0f 84 06 01 00 00
> 4c 8b [ 5672.367024] RSP: 0000:ffffa4ff58c7cde0 EFLAGS: 00010046 [
> 5672.369020] RAX: 0000000000000000 RBX: 0000000000000003 RCX:
> 0000000000000001 [ 5672.371073] RDX: 0000000000000003 RSI:
> 0000000000000000 RDI: 0000000000000000 [ 5672.373007] RBP:
> ffffa4ff58c7ce28 R08: 0000000000000000 R09: 0000000000000001 [
> 5672.374795] R10: 0000000000000000 R11: ffffa4ff58c7cff8 R12:
> 0000000000000000 [ 5672.376418] R13: 0000000000000000 R14:
> ffff88968e1ec0d0 R15: 0000000000000000 [ 5672.378136] FS:
> 00007ff103d25ac0(0000) GS:ffff89547fac0000(0000)
> knlGS:0000000000000000
> [ 5672.379760] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5672.381402] CR2: 0000000000000018 CR3: 0000007ae90cc004 CR4: 00000000001706e0 [ 5672.383023] Call Trace:
> [ 5672.384673] <IRQ>
> [ 5672.386282] ? task_tick_fair+0x88/0x530 [ 5672.386469] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.387921] dma_unmap_sg_attrs+0x32/0x50 [ 5672.391431] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.393273] scsi_dma_unmap+0x3b/0x50 [ 5672.397079] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.398180] aac_srb_callback+0x88/0x3c0 [aacraid]
>
> Does that look related?
>
> >
> > We are currently debugging and digging further details to be able to explain it in much detailed fashion.
> > I will keep you the thread posted as soon as we have something interesting.
> >
> > Sagar
> >
> > -----Original Message-----
> > From: James Hilliard <[email protected]>
> > Sent: Monday, November 14, 2022 12:13 AM
> > To: Sagar Biradar - C34249 <[email protected]>
> > Cc: [email protected]; [email protected];
> > [email protected]; [email protected]; Don Brace - C33706
> > <[email protected]>; Tom White - C33503
> > <[email protected]>; [email protected]; Linux Kernel
> > Mailing List <[email protected]>
> > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > constantly resets under high io load
> >
> > EXTERNAL EMAIL: Do not click links or open attachments unless you
> > know the content is safe
> >
> > On Thu, Oct 27, 2022 at 1:17 PM <[email protected]> wrote:
> > >
> > > Hi James and Konstantin,
> > >
> > > *Limiting the audience to avoid spamming*
> > >
> > > Sorry for delayed response as I was on vacation.
> > > This one got missed somehow as someone else was looking into this and is no longer with the company.
> > >
> > > I will look into this, meanwhile I wanted to check if you (or someone else you know) had a chance to test this thoroughly with the latest kernel?
> > > I will get back to you with some more questions or the confirmation in a day or two max.
> >
> > Did this ever get looked at?
> >
> > As this exact patch was merged into the vendor aacraid a while ago I'm not sure why it wouldn't be good to merge to mainline as well.
> >
> > Vendor aacraid release with this patch merged:
> > https://download.adaptec.com/raid/aac/linux/aacraid-linux-src-1.2.1-
> > 60
> > 001.tgz
> >
> > >
> > >
> > > Thanks for your patience.
> > > Sagar
> > >
> > >
> > > -----Original Message-----
> > > From: James Hilliard <[email protected]>
> > > Sent: Thursday, October 27, 2022 1:40 AM
> > > To: Martin K. Petersen <[email protected]>
> > > Cc: Konstantin Khorenko <[email protected]>; Christian
> > > Großegger <[email protected]>; [email protected];
> > > Adaptec OEM Raid Solutions <[email protected]>; Sagar Biradar
> > > -
> > > C34249 <[email protected]>; Linux Kernel Mailing List
> > > <[email protected]>; Don Brace - C33706
> > > <[email protected]>
> > > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > > constantly resets under high io load
> > >
> > > EXTERNAL EMAIL: Do not click links or open attachments unless you
> > > know the content is safe
> > >
> > > On Wed, Oct 19, 2022 at 2:03 PM Konstantin Khorenko <[email protected]> wrote:
> > > >
> > > > On 10.10.2022 14:31, James Hilliard wrote:
> > > > > On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
> > > > > <[email protected]> wrote:
> > > > >>
> > > > >>
> > > > >> Christian,
> > > > >>
> > > > >>> The faulty patch (Commit: 395e5df79a9588abf) from 2017
> > > > >>> should be repaired with Konstantin Khorenko (1):
> > > > >>>
> > > > >>> scsi: aacraid: resurrect correct arc ctrl checks for
> > > > >>> Series-6
> > > > >>
> > > > >> It would be great to get this patch resubmitted by Konstantin
> > > > >> and acked by Microchip.
> > >
> > > Can we merge this as is since microchip does not appear to be maintaining this driver any more or responding?
> > >
> > > > >
> > > > > Does the patch need to be rebased?
> > > >
> > > > James, i have just checked - the old patch (v3) applies cleanly onto latest master branch.
> > > >
> > > > > Based on this it looks like someone at microchip may have already reviewed:
> > > > > v3 changes:
> > > > > * introduced another wrapper to check for devices except for Series 6
> > > > > controllers upon request from Sagar Biradar (Microchip)
> > > >
> > > > Well, back in the year 2019 i've created a bug in RedHat
> > > > bugzilla
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1724077
> > > > (the bug is private, this is default for Redhat bugs)
> > > >
> > > > In this bug Sagar Biradar (with the email @microchip.com)
> > > > suggested me to rework the patch - i've done that and sent the v3.
> > > >
> > > > And nothing happened after that, but in a ~year (2020-06-19) the
> > > > bug was closed with the resolution NOTABUG and a comment that S6 users will find the patch useful.
> > > >
> > > > i suppose S6 is so old that RedHat just does not have customers
> > > > using it and Microchip company itself is also not that interested in handling so old hardware issues.
> > > >
> > > > Sorry, i was unable to get a final ack from Microchip, i've
> > > > written direct emails to the addresses which is found in the
> > > > internet, tried to connect via linkedin, no luck.
> > > >
> > > > --
> > > > Konstantin Khorenko

2022-12-20 01:23:15

by James Hilliard

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

On Fri, Dec 16, 2022 at 1:44 PM <[email protected]> wrote:
>
> Hi James / Konstantin,
> Here are the details that we have compiled so far . .
> I will just repost the problem definition and the concerns discussed so far (to avoid back and forth)...
>
> Issue : Series 6 Patch [regression] aacraid: Host adapter constantly aborts under load (https://bugzilla.redhat.com/show_bug.cgi?id=1724077)
>
> Synopsis: running mkfs.ext4 on different disks on the same controller in parallel. (Nothing seems to break, appears to always recover, but there are a lot of timeouts.)
> [ 699.442950] aacraid: Host adapter reset request. SCSI hang ?
> [ 759.515013] aacraid 0000:03:00.0: Issuing IOP reset
> [ 850.296705] aacraid 0000:03:00.0: IOP reset succeeded
> * with kernel 3.10.0-862.20.2.el7.x86_64 - PASS
> * with kernel 3.10.0-957.21.3.el7.x86_64 - FAIL
>
> Konstantin’s patch (https://lkml.org/lkml/2019/8/19/758) : upon testing the patch on the Virtuozzo kernel, it was found to be working fine, and the same issue was observed on Ubuntu later.
> But MCHP knows this patch/change will have issues with Xeon V2 interrupts, adding this change into the tree can harm the customers who use this processor. (CPU Intel Xeon E5-2609/2630/2650 v2 ( E5-26XX V2))
> However, the patch may work fine on Xeon V3/V4 and later processors.
>
> Adaptec ASK Article references our concern : https://ask.adaptec.com/app/answers/detail/a_id/17400/kw/msi
> Though the article lists appears like a "VMware" specific - the issue is independent of the Operating system.
> We have discovered a conflict between the Series 6 and 6E RAID controllers, VMware ESXi 5.5 and Intel Xeon V2 processors that is caused by incorrect interrupt handling.
> The system is using the legacy interrupt handling but needs to be switched to MSI (Message Signaled Interrupts) instead.
> This issue caused by switching to the legacy mode occurs on CPU Intel Xeon E5-2609/2630/2650 v2 ( E5-26XX V2).
> * Note: Xeon V2 is “Ivy Bridge”
>
> Workaround: The proposed solution would be to let the driver use the MSI mechanism with the aacraid driver parameter "msi" set to 1 (“msi=1") . ("echo 1 > /sys/module/aacraid/parameters/msi")

Hmm, so this commit indicates that series 6 raid cards should be always using
MSI interrupts regardless of that msi param:
https://github.com/torvalds/linux/commit/9022d375bd22869ba3e5ad3635f00427cfb934fc

However it appears that the aac_msi check wasn't removed here, maybe it
should have been?:
https://github.com/torvalds/linux/blob/v6.1/drivers/scsi/aacraid/rx.c#L647

>
> Konstantin,
> Is it possible for you or someone you know to test on your original test bed with the "msi" set to "1", and post the results?
> We are parallelly working on additional tests locally.
> Please write to me if you need more information
>
>
> Thanks in advance
> Sagar
>
>
> -----Original Message-----
> From: [email protected] <[email protected]>
> Sent: Tuesday, December 6, 2022 11:30 AM
> To: [email protected]
> Cc: [email protected]; [email protected]; [email protected]; [email protected]; Don Brace - C33706 <[email protected]>; Tom White - C33503 <[email protected]>; [email protected]; [email protected]
> Subject: RE: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> Hi James,
> We were in the process of finding the related information and we have finally found some details.
> I am reviewing that as I write this email.
> I will get back to you once I review and sort that information with more details.
>
> Thanks
> Sagar
>
> -----Original Message-----
> From: James Hilliard <[email protected]>
> Sent: Sunday, December 4, 2022 5:26 AM
> To: Sagar Biradar - C34249 <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; [email protected]; Don Brace - C33706 <[email protected]>; Tom White - C33503 <[email protected]>; [email protected]; [email protected]
> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On Thu, Nov 17, 2022 at 11:36 PM <[email protected]> wrote:
> >
> > Hi James,
> > Thanks for your response.
> > This issue seems to be slightly different and may have been originating from the drive itself (not too sure).
>
> Yeah, the drive was having hardware issues, although it does sound like a potential error condition that's not being correctly handled by aacraid.
>
> >
> > The original issue I was talking about would still occur with the missing legacy interrupt on certain processors.
> > We are still actively looking into the old "int-x missing" issue that we suspect might possibly originate from the patch.
>
> Hmm, are there any available details on this "int-x missing" issue, I couldn't find any public details/reports relating to that.
>
> Is there a list of CPU's known to be affected?
>
> Does it occur in the vendor aacraid release that has this patch merged?
>
> >
> >
> >
> > -----Original Message-----
> > From: James Hilliard <[email protected]>
> > Sent: Thursday, November 17, 2022 3:26 AM
> > To: Sagar Biradar - C34249 <[email protected]>
> > Cc: [email protected]; [email protected];
> > [email protected]; [email protected]; Don Brace - C33706
> > <[email protected]>; Tom White - C33503
> > <[email protected]>; [email protected];
> > [email protected]
> > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > constantly resets under high io load
> >
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know
> > the content is safe
> >
> > On Tue, Nov 15, 2022 at 10:05 AM <[email protected]> wrote:
> > >
> > > Hi James,
> > > I have looked into the patch thoroughly.
> > > We suspect this change might expose an old legacy interrupt issue on some processors.
> >
> > I did see this error once with this patch when a drive was having issues:
> > [ 4306.357531] aacraid: Host adapter abort request.
> > aacraid: Outstanding commands on (0,1,41,0):
> > [ 4335.030025] aacraid: Host adapter abort request.
> > aacraid: Outstanding commands on (0,1,41,0):
> > [ 4335.030111] aacraid: Host adapter abort request.
> > aacraid: Outstanding commands on (0,1,41,0):
> > [ 4335.030172] aacraid: Host adapter abort request.
> > aacraid: Outstanding commands on (0,1,41,0):
> > [ 4335.189886] aacraid: Host bus reset request. SCSI hang ?
> > [ 4335.189951] aacraid 0000:81:00.0: outstanding cmd: midlevel-0 [ 4335.189989] aacraid 0000:81:00.0: outstanding cmd: lowlevel-0 [ 4335.190101] aacraid 0000:81:00.0: outstanding cmd: error handler-3 [ 4335.190141] aacraid 0000:81:00.0: outstanding cmd: firmware-0 [ 4335.190177] aacraid 0000:81:00.0: outstanding cmd: kernel-0 [ 4335.274070] aacraid 0000:81:00.0: Controller reset type is 3 [ 4335.274142] aacraid 0000:81:00.0: Issuing IOP reset [ 4365.862127] aacraid 0000:81:00.0: IOP reset succeeded [ 4365.895079] aacraid: Comm Interface type2 enabled [ 4374.938119] aacraid 0000:81:00.0: Scheduling bus rescan [ 4387.022913] sd 0:1:41:0: [sdi] 27344764928 512-byte logical blocks:
> > (14.0 TB/12.7 TiB)
> > [ 4387.022988] sd 0:1:41:0: [sdi] 4096-byte physical blocks [ 5643.714301] aacraid: Host adapter abort request.
> > aacraid: Outstanding commands on (0,1,41,0):
> > [ 5672.349423] BUG: kernel NULL pointer dereference, address: 0000000000000018 [ 5672.351532] #PF: supervisor read access in kernel mode [ 5672.353262] #PF: error_code(0x0000) - not-present page [ 5672.354860] PGD 8000007ad6ac7067 P4D 8000007ad6ac7067 PUD 7af0892067 PMD 0 [ 5672.356444] Oops: 0000 [#1] SMP PTI
> > [ 5672.358075] CPU: 9 PID: 644201 Comm: cc1plus Tainted: P O
> > 5.15.64-1-pve #1
> > [ 5672.359749] Hardware name: Supermicro Super Server/X10DRC, BIOS 3.4
> > 05/21/2021
> > [ 5672.361465] RIP: 0010:dma_direct_unmap_sg+0x49/0x1a0
> > [ 5672.363223] Code: ec 20 89 4d d4 4c 89 45 c8 85 d2 0f 8e bb 00 00
> > 00 49 89 fe 49 89 f7 89 d3 45 31 ed 4c 8b 05 ae fd b0 01 49 8b be 60
> > 02 00 00 <45> 8b 4f 18 49 8b 77 10 49 f7 d0 48 85 ff 0f 84 06 01 00 00
> > 4c 8b [ 5672.367024] RSP: 0000:ffffa4ff58c7cde0 EFLAGS: 00010046 [
> > 5672.369020] RAX: 0000000000000000 RBX: 0000000000000003 RCX:
> > 0000000000000001 [ 5672.371073] RDX: 0000000000000003 RSI:
> > 0000000000000000 RDI: 0000000000000000 [ 5672.373007] RBP:
> > ffffa4ff58c7ce28 R08: 0000000000000000 R09: 0000000000000001 [
> > 5672.374795] R10: 0000000000000000 R11: ffffa4ff58c7cff8 R12:
> > 0000000000000000 [ 5672.376418] R13: 0000000000000000 R14:
> > ffff88968e1ec0d0 R15: 0000000000000000 [ 5672.378136] FS:
> > 00007ff103d25ac0(0000) GS:ffff89547fac0000(0000)
> > knlGS:0000000000000000
> > [ 5672.379760] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5672.381402] CR2: 0000000000000018 CR3: 0000007ae90cc004 CR4: 00000000001706e0 [ 5672.383023] Call Trace:
> > [ 5672.384673] <IRQ>
> > [ 5672.386282] ? task_tick_fair+0x88/0x530 [ 5672.386469] aacraid: Host adapter abort request.
> > aacraid: Outstanding commands on (0,1,41,0):
> > [ 5672.387921] dma_unmap_sg_attrs+0x32/0x50 [ 5672.391431] aacraid: Host adapter abort request.
> > aacraid: Outstanding commands on (0,1,41,0):
> > [ 5672.393273] scsi_dma_unmap+0x3b/0x50 [ 5672.397079] aacraid: Host adapter abort request.
> > aacraid: Outstanding commands on (0,1,41,0):
> > [ 5672.398180] aac_srb_callback+0x88/0x3c0 [aacraid]
> >
> > Does that look related?
> >
> > >
> > > We are currently debugging and digging further details to be able to explain it in much detailed fashion.
> > > I will keep you the thread posted as soon as we have something interesting.
> > >
> > > Sagar
> > >
> > > -----Original Message-----
> > > From: James Hilliard <[email protected]>
> > > Sent: Monday, November 14, 2022 12:13 AM
> > > To: Sagar Biradar - C34249 <[email protected]>
> > > Cc: [email protected]; [email protected];
> > > [email protected]; [email protected]; Don Brace - C33706
> > > <[email protected]>; Tom White - C33503
> > > <[email protected]>; [email protected]; Linux Kernel
> > > Mailing List <[email protected]>
> > > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > > constantly resets under high io load
> > >
> > > EXTERNAL EMAIL: Do not click links or open attachments unless you
> > > know the content is safe
> > >
> > > On Thu, Oct 27, 2022 at 1:17 PM <[email protected]> wrote:
> > > >
> > > > Hi James and Konstantin,
> > > >
> > > > *Limiting the audience to avoid spamming*
> > > >
> > > > Sorry for delayed response as I was on vacation.
> > > > This one got missed somehow as someone else was looking into this and is no longer with the company.
> > > >
> > > > I will look into this, meanwhile I wanted to check if you (or someone else you know) had a chance to test this thoroughly with the latest kernel?
> > > > I will get back to you with some more questions or the confirmation in a day or two max.
> > >
> > > Did this ever get looked at?
> > >
> > > As this exact patch was merged into the vendor aacraid a while ago I'm not sure why it wouldn't be good to merge to mainline as well.
> > >
> > > Vendor aacraid release with this patch merged:
> > > https://download.adaptec.com/raid/aac/linux/aacraid-linux-src-1.2.1-
> > > 60
> > > 001.tgz
> > >
> > > >
> > > >
> > > > Thanks for your patience.
> > > > Sagar
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: James Hilliard <[email protected]>
> > > > Sent: Thursday, October 27, 2022 1:40 AM
> > > > To: Martin K. Petersen <[email protected]>
> > > > Cc: Konstantin Khorenko <[email protected]>; Christian
> > > > Großegger <[email protected]>; [email protected];
> > > > Adaptec OEM Raid Solutions <[email protected]>; Sagar Biradar
> > > > -
> > > > C34249 <[email protected]>; Linux Kernel Mailing List
> > > > <[email protected]>; Don Brace - C33706
> > > > <[email protected]>
> > > > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > > > constantly resets under high io load
> > > >
> > > > EXTERNAL EMAIL: Do not click links or open attachments unless you
> > > > know the content is safe
> > > >
> > > > On Wed, Oct 19, 2022 at 2:03 PM Konstantin Khorenko <[email protected]> wrote:
> > > > >
> > > > > On 10.10.2022 14:31, James Hilliard wrote:
> > > > > > On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
> > > > > > <[email protected]> wrote:
> > > > > >>
> > > > > >>
> > > > > >> Christian,
> > > > > >>
> > > > > >>> The faulty patch (Commit: 395e5df79a9588abf) from 2017
> > > > > >>> should be repaired with Konstantin Khorenko (1):
> > > > > >>>
> > > > > >>> scsi: aacraid: resurrect correct arc ctrl checks for
> > > > > >>> Series-6
> > > > > >>
> > > > > >> It would be great to get this patch resubmitted by Konstantin
> > > > > >> and acked by Microchip.
> > > >
> > > > Can we merge this as is since microchip does not appear to be maintaining this driver any more or responding?
> > > >
> > > > > >
> > > > > > Does the patch need to be rebased?
> > > > >
> > > > > James, i have just checked - the old patch (v3) applies cleanly onto latest master branch.
> > > > >
> > > > > > Based on this it looks like someone at microchip may have already reviewed:
> > > > > > v3 changes:
> > > > > > * introduced another wrapper to check for devices except for Series 6
> > > > > > controllers upon request from Sagar Biradar (Microchip)
> > > > >
> > > > > Well, back in the year 2019 i've created a bug in RedHat
> > > > > bugzilla
> > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1724077
> > > > > (the bug is private, this is default for Redhat bugs)
> > > > >
> > > > > In this bug Sagar Biradar (with the email @microchip.com)
> > > > > suggested me to rework the patch - i've done that and sent the v3.
> > > > >
> > > > > And nothing happened after that, but in a ~year (2020-06-19) the
> > > > > bug was closed with the resolution NOTABUG and a comment that S6 users will find the patch useful.
> > > > >
> > > > > i suppose S6 is so old that RedHat just does not have customers
> > > > > using it and Microchip company itself is also not that interested in handling so old hardware issues.
> > > > >
> > > > > Sorry, i was unable to get a final ack from Microchip, i've
> > > > > written direct emails to the addresses which is found in the
> > > > > internet, tried to connect via linkedin, no luck.
> > > > >
> > > > > --
> > > > > Konstantin Khorenko

2022-12-20 20:21:26

by Konstantin Khorenko

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load

On 16.12.2022 21:44, [email protected] wrote:
> Hi James / Konstantin,

<skipped>

> Konstantin,
> Is it possible for you or someone you know to test on your original test bed with the "msi" set to "1", and post the results?

Hi Sagar,

thank you for looking into this.
i'm very sorry, in my case that was an customer complain for a Node in production and it was long long
ago, unfortunately we definitely won't be able to test anything nowadays.

--
Best regards,

Konstantin Khorenko
Virtuozzo Linux Kernel Team

> We are parallelly working on additional tests locally.
> Please write to me if you need more information
>
>
> Thanks in advance
> Sagar
>
>
> -----Original Message-----
> From: [email protected] <[email protected]>
> Sent: Tuesday, December 6, 2022 11:30 AM
> To: [email protected]
> Cc: [email protected]; [email protected]; [email protected]; [email protected]; Don Brace - C33706 <[email protected]>; Tom White - C33503 <[email protected]>; [email protected]; [email protected]
> Subject: RE: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> Hi James,
> We were in the process of finding the related information and we have finally found some details.
> I am reviewing that as I write this email.
> I will get back to you once I review and sort that information with more details.
>
> Thanks
> Sagar
>
> -----Original Message-----
> From: James Hilliard <[email protected]>
> Sent: Sunday, December 4, 2022 5:26 AM
> To: Sagar Biradar - C34249 <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; [email protected]; Don Brace - C33706 <[email protected]>; Tom White - C33503 <[email protected]>; [email protected]; [email protected]
> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On Thu, Nov 17, 2022 at 11:36 PM <[email protected]> wrote:
>>
>> Hi James,
>> Thanks for your response.
>> This issue seems to be slightly different and may have been originating from the drive itself (not too sure).
>
> Yeah, the drive was having hardware issues, although it does sound like a potential error condition that's not being correctly handled by aacraid.
>
>>
>> The original issue I was talking about would still occur with the missing legacy interrupt on certain processors.
>> We are still actively looking into the old "int-x missing" issue that we suspect might possibly originate from the patch.
>
> Hmm, are there any available details on this "int-x missing" issue, I couldn't find any public details/reports relating to that.
>
> Is there a list of CPU's known to be affected?
>
> Does it occur in the vendor aacraid release that has this patch merged?
>
>>
>>
>>
>> -----Original Message-----
>> From: James Hilliard <[email protected]>
>> Sent: Thursday, November 17, 2022 3:26 AM
>> To: Sagar Biradar - C34249 <[email protected]>
>> Cc: [email protected]; [email protected];
>> [email protected]; [email protected]; Don Brace - C33706
>> <[email protected]>; Tom White - C33503
>> <[email protected]>; [email protected];
>> [email protected]
>> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
>> constantly resets under high io load
>>
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know
>> the content is safe
>>
>> On Tue, Nov 15, 2022 at 10:05 AM <[email protected]> wrote:
>>>
>>> Hi James,
>>> I have looked into the patch thoroughly.
>>> We suspect this change might expose an old legacy interrupt issue on some processors.
>>
>> I did see this error once with this patch when a drive was having issues:
>> [ 4306.357531] aacraid: Host adapter abort request.
>> aacraid: Outstanding commands on (0,1,41,0):
>> [ 4335.030025] aacraid: Host adapter abort request.
>> aacraid: Outstanding commands on (0,1,41,0):
>> [ 4335.030111] aacraid: Host adapter abort request.
>> aacraid: Outstanding commands on (0,1,41,0):
>> [ 4335.030172] aacraid: Host adapter abort request.
>> aacraid: Outstanding commands on (0,1,41,0):
>> [ 4335.189886] aacraid: Host bus reset request. SCSI hang ?
>> [ 4335.189951] aacraid 0000:81:00.0: outstanding cmd: midlevel-0 [ 4335.189989] aacraid 0000:81:00.0: outstanding cmd: lowlevel-0 [ 4335.190101] aacraid 0000:81:00.0: outstanding cmd: error handler-3 [ 4335.190141] aacraid 0000:81:00.0: outstanding cmd: firmware-0 [ 4335.190177] aacraid 0000:81:00.0: outstanding cmd: kernel-0 [ 4335.274070] aacraid 0000:81:00.0: Controller reset type is 3 [ 4335.274142] aacraid 0000:81:00.0: Issuing IOP reset [ 4365.862127] aacraid 0000:81:00.0: IOP reset succeeded [ 4365.895079] aacraid: Comm Interface type2 enabled [ 4374.938119] aacraid 0000:81:00.0: Scheduling bus rescan [ 4387.022913] sd 0:1:41:0: [sdi] 27344764928 512-byte logical blocks:
>> (14.0 TB/12.7 TiB)
>> [ 4387.022988] sd 0:1:41:0: [sdi] 4096-byte physical blocks [ 5643.714301] aacraid: Host adapter abort request.
>> aacraid: Outstanding commands on (0,1,41,0):
>> [ 5672.349423] BUG: kernel NULL pointer dereference, address: 0000000000000018 [ 5672.351532] #PF: supervisor read access in kernel mode [ 5672.353262] #PF: error_code(0x0000) - not-present page [ 5672.354860] PGD 8000007ad6ac7067 P4D 8000007ad6ac7067 PUD 7af0892067 PMD 0 [ 5672.356444] Oops: 0000 [#1] SMP PTI
>> [ 5672.358075] CPU: 9 PID: 644201 Comm: cc1plus Tainted: P O
>> 5.15.64-1-pve #1
>> [ 5672.359749] Hardware name: Supermicro Super Server/X10DRC, BIOS 3.4
>> 05/21/2021
>> [ 5672.361465] RIP: 0010:dma_direct_unmap_sg+0x49/0x1a0
>> [ 5672.363223] Code: ec 20 89 4d d4 4c 89 45 c8 85 d2 0f 8e bb 00 00
>> 00 49 89 fe 49 89 f7 89 d3 45 31 ed 4c 8b 05 ae fd b0 01 49 8b be 60
>> 02 00 00 <45> 8b 4f 18 49 8b 77 10 49 f7 d0 48 85 ff 0f 84 06 01 00 00
>> 4c 8b [ 5672.367024] RSP: 0000:ffffa4ff58c7cde0 EFLAGS: 00010046 [
>> 5672.369020] RAX: 0000000000000000 RBX: 0000000000000003 RCX:
>> 0000000000000001 [ 5672.371073] RDX: 0000000000000003 RSI:
>> 0000000000000000 RDI: 0000000000000000 [ 5672.373007] RBP:
>> ffffa4ff58c7ce28 R08: 0000000000000000 R09: 0000000000000001 [
>> 5672.374795] R10: 0000000000000000 R11: ffffa4ff58c7cff8 R12:
>> 0000000000000000 [ 5672.376418] R13: 0000000000000000 R14:
>> ffff88968e1ec0d0 R15: 0000000000000000 [ 5672.378136] FS:
>> 00007ff103d25ac0(0000) GS:ffff89547fac0000(0000)
>> knlGS:0000000000000000
>> [ 5672.379760] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5672.381402] CR2: 0000000000000018 CR3: 0000007ae90cc004 CR4: 00000000001706e0 [ 5672.383023] Call Trace:
>> [ 5672.384673] <IRQ>
>> [ 5672.386282] ? task_tick_fair+0x88/0x530 [ 5672.386469] aacraid: Host adapter abort request.
>> aacraid: Outstanding commands on (0,1,41,0):
>> [ 5672.387921] dma_unmap_sg_attrs+0x32/0x50 [ 5672.391431] aacraid: Host adapter abort request.
>> aacraid: Outstanding commands on (0,1,41,0):
>> [ 5672.393273] scsi_dma_unmap+0x3b/0x50 [ 5672.397079] aacraid: Host adapter abort request.
>> aacraid: Outstanding commands on (0,1,41,0):
>> [ 5672.398180] aac_srb_callback+0x88/0x3c0 [aacraid]
>>
>> Does that look related?
>>
>>>
>>> We are currently debugging and digging further details to be able to explain it in much detailed fashion.
>>> I will keep you the thread posted as soon as we have something interesting.
>>>
>>> Sagar
>>>
>>> -----Original Message-----
>>> From: James Hilliard <[email protected]>
>>> Sent: Monday, November 14, 2022 12:13 AM
>>> To: Sagar Biradar - C34249 <[email protected]>
>>> Cc: [email protected]; [email protected];
>>> [email protected]; [email protected]; Don Brace - C33706
>>> <[email protected]>; Tom White - C33503
>>> <[email protected]>; [email protected]; Linux Kernel
>>> Mailing List <[email protected]>
>>> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
>>> constantly resets under high io load
>>>
>>> EXTERNAL EMAIL: Do not click links or open attachments unless you
>>> know the content is safe
>>>
>>> On Thu, Oct 27, 2022 at 1:17 PM <[email protected]> wrote:
>>>>
>>>> Hi James and Konstantin,
>>>>
>>>> *Limiting the audience to avoid spamming*
>>>>
>>>> Sorry for delayed response as I was on vacation.
>>>> This one got missed somehow as someone else was looking into this and is no longer with the company.
>>>>
>>>> I will look into this, meanwhile I wanted to check if you (or someone else you know) had a chance to test this thoroughly with the latest kernel?
>>>> I will get back to you with some more questions or the confirmation in a day or two max.
>>>
>>> Did this ever get looked at?
>>>
>>> As this exact patch was merged into the vendor aacraid a while ago I'm not sure why it wouldn't be good to merge to mainline as well.
>>>
>>> Vendor aacraid release with this patch merged:
>>> https://download.adaptec.com/raid/aac/linux/aacraid-linux-src-1.2.1-
>>> 60
>>> 001.tgz
>>>
>>>>
>>>>
>>>> Thanks for your patience.
>>>> Sagar
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: James Hilliard <[email protected]>
>>>> Sent: Thursday, October 27, 2022 1:40 AM
>>>> To: Martin K. Petersen <[email protected]>
>>>> Cc: Konstantin Khorenko <[email protected]>; Christian
>>>> Großegger <[email protected]>; [email protected];
>>>> Adaptec OEM Raid Solutions <[email protected]>; Sagar Biradar
>>>> -
>>>> C34249 <[email protected]>; Linux Kernel Mailing List
>>>> <[email protected]>; Don Brace - C33706
>>>> <[email protected]>
>>>> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
>>>> constantly resets under high io load
>>>>
>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you
>>>> know the content is safe
>>>>
>>>> On Wed, Oct 19, 2022 at 2:03 PM Konstantin Khorenko <[email protected]> wrote:
>>>>>
>>>>> On 10.10.2022 14:31, James Hilliard wrote:
>>>>>> On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Christian,
>>>>>>>
>>>>>>>> The faulty patch (Commit: 395e5df79a9588abf) from 2017
>>>>>>>> should be repaired with Konstantin Khorenko (1):
>>>>>>>>
>>>>>>>> scsi: aacraid: resurrect correct arc ctrl checks for
>>>>>>>> Series-6
>>>>>>>
>>>>>>> It would be great to get this patch resubmitted by Konstantin
>>>>>>> and acked by Microchip.
>>>>
>>>> Can we merge this as is since microchip does not appear to be maintaining this driver any more or responding?
>>>>
>>>>>>
>>>>>> Does the patch need to be rebased?
>>>>>
>>>>> James, i have just checked - the old patch (v3) applies cleanly onto latest master branch.
>>>>>
>>>>>> Based on this it looks like someone at microchip may have already reviewed:
>>>>>> v3 changes:
>>>>>> * introduced another wrapper to check for devices except for Series 6
>>>>>> controllers upon request from Sagar Biradar (Microchip)
>>>>>
>>>>> Well, back in the year 2019 i've created a bug in RedHat
>>>>> bugzilla
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1724077
>>>>> (the bug is private, this is default for Redhat bugs)
>>>>>
>>>>> In this bug Sagar Biradar (with the email @microchip.com)
>>>>> suggested me to rework the patch - i've done that and sent the v3.
>>>>>
>>>>> And nothing happened after that, but in a ~year (2020-06-19) the
>>>>> bug was closed with the resolution NOTABUG and a comment that S6 users will find the patch useful.
>>>>>
>>>>> i suppose S6 is so old that RedHat just does not have customers
>>>>> using it and Microchip company itself is also not that interested in handling so old hardware issues.
>>>>>
>>>>> Sorry, i was unable to get a final ack from Microchip, i've
>>>>> written direct emails to the addresses which is found in the
>>>>> internet, tried to connect via linkedin, no luck.
>>>>>
>>>>> --
>>>>> Konstantin Khorenko