2020-08-14 02:50:00

by Chris Packham

[permalink] [raw]
Subject: fsl_espi errors on v5.7.15

Hi,

I'm seeing a problem with accessing spi-nor after upgrading a T2081
based system to linux v5.7.15

For this board u-boot and the u-boot environment live on spi-nor.

When I use fw_setenv from userspace I get the following kernel logs

# fw_setenv foo=1
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
...

If I run fw_printenv (before getting it into a bad state) it is able to
display the content of the boards u-boot environment.

If been unsuccessful in producing a setup for bisecting the issue. I do
know the issue doesn't occur on the old 4.4.x based kernel but that's
probably not much help.

Any pointers on what the issue (and/or solution) might be.

Thanks,
Chris


2020-08-14 06:20:35

by Heiner Kallweit

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

On 14.08.2020 04:48, Chris Packham wrote:
> Hi,
>
> I'm seeing a problem with accessing spi-nor after upgrading a T2081
> based system to linux v5.7.15
>
> For this board u-boot and the u-boot environment live on spi-nor.
>
> When I use fw_setenv from userspace I get the following kernel logs
>
> # fw_setenv foo=1
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> ...
>

This error reporting doesn't exist yet in 4.4. So you may have an issue
under 4.4 too, it's just not reported.
Did you verify that under 4.4 fw_setenv actually has an effect?

> If I run fw_printenv (before getting it into a bad state) it is able to
> display the content of the boards u-boot environment.
>

This might indicate an issue with spi being locked. I've seen related
questions, just use the search engine of your choice and check for
fw_setenv and locked.

> If been unsuccessful in producing a setup for bisecting the issue. I do
> know the issue doesn't occur on the old 4.4.x based kernel but that's
> probably not much help.
>
> Any pointers on what the issue (and/or solution) might be.
>
> Thanks,
> Chris
>

Heiner

2020-08-16 21:10:26

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15


On 14/08/20 6:19 pm, Heiner Kallweit wrote:
> On 14.08.2020 04:48, Chris Packham wrote:
>> Hi,
>>
>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>> based system to linux v5.7.15
>>
>> For this board u-boot and the u-boot environment live on spi-nor.
>>
>> When I use fw_setenv from userspace I get the following kernel logs
>>
>> # fw_setenv foo=1
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> ...
>>
> This error reporting doesn't exist yet in 4.4. So you may have an issue
> under 4.4 too, it's just not reported.
> Did you verify that under 4.4 fw_setenv actually has an effect?
Just double checked and yes under 4.4 the setting does get saved.
>> If I run fw_printenv (before getting it into a bad state) it is able to
>> display the content of the boards u-boot environment.
>>
> This might indicate an issue with spi being locked. I've seen related
> questions, just use the search engine of your choice and check for
> fw_setenv and locked.
I'm running a version of fw_setenv which includes
https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it shouldn't
be locking things unnecessarily.
>> If been unsuccessful in producing a setup for bisecting the issue. I do
>> know the issue doesn't occur on the old 4.4.x based kernel but that's
>> probably not much help.
>>
>> Any pointers on what the issue (and/or solution) might be.
>>
>> Thanks,
>> Chris
>>
> Heiner

2020-08-18 22:47:06

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

Hi Again,

On 17/08/20 9:09 am, Chris Packham wrote:

>
> On 14/08/20 6:19 pm, Heiner Kallweit wrote:
>> On 14.08.2020 04:48, Chris Packham wrote:
>>> Hi,
>>>
>>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>>> based system to linux v5.7.15
>>>
>>> For this board u-boot and the u-boot environment live on spi-nor.
>>>
>>> When I use fw_setenv from userspace I get the following kernel logs
>>>
>>> # fw_setenv foo=1
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> ...
>>>
>> This error reporting doesn't exist yet in 4.4. So you may have an issue
>> under 4.4 too, it's just not reported.
>> Did you verify that under 4.4 fw_setenv actually has an effect?
> Just double checked and yes under 4.4 the setting does get saved.
>>> If I run fw_printenv (before getting it into a bad state) it is able to
>>> display the content of the boards u-boot environment.
>>>
>> This might indicate an issue with spi being locked. I've seen related
>> questions, just use the search engine of your choice and check for
>> fw_setenv and locked.
> I'm running a version of fw_setenv which includes
> https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it shouldn't
> be locking things unnecessarily.
>>> If been unsuccessful in producing a setup for bisecting the issue. I do
>>> know the issue doesn't occur on the old 4.4.x based kernel but that's
>>> probably not much help.
>>>
>>> Any pointers on what the issue (and/or solution) might be.

I finally managed to get our board running with a vanilla kernel. With
corenet64_smp_defconfig I occasionally see

  fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!

other than the message things seem to be working.

With a custom defconfig I see

  fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
  fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
  fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
  ...

and access to the spi-nor does not work until the board is reset.

I'll try and pick apart the differences between the two defconfigs.

2020-08-19 06:16:38

by Heiner Kallweit

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

On 19.08.2020 00:44, Chris Packham wrote:
> Hi Again,
>
> On 17/08/20 9:09 am, Chris Packham wrote:
>
>>
>> On 14/08/20 6:19 pm, Heiner Kallweit wrote:
>>> On 14.08.2020 04:48, Chris Packham wrote:
>>>> Hi,
>>>>
>>>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>>>> based system to linux v5.7.15
>>>>
>>>> For this board u-boot and the u-boot environment live on spi-nor.
>>>>
>>>> When I use fw_setenv from userspace I get the following kernel logs
>>>>
>>>> # fw_setenv foo=1
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> ...
>>>>
>>> This error reporting doesn't exist yet in 4.4. So you may have an issue
>>> under 4.4 too, it's just not reported.
>>> Did you verify that under 4.4 fw_setenv actually has an effect?
>> Just double checked and yes under 4.4 the setting does get saved.
>>>> If I run fw_printenv (before getting it into a bad state) it is able to
>>>> display the content of the boards u-boot environment.
>>>>
>>> This might indicate an issue with spi being locked. I've seen related
>>> questions, just use the search engine of your choice and check for
>>> fw_setenv and locked.
>> I'm running a version of fw_setenv which includes
>> https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it shouldn't
>> be locking things unnecessarily.
>>>> If been unsuccessful in producing a setup for bisecting the issue. I do
>>>> know the issue doesn't occur on the old 4.4.x based kernel but that's
>>>> probably not much help.
>>>>
>>>> Any pointers on what the issue (and/or solution) might be.
>
> I finally managed to get our board running with a vanilla kernel. With
> corenet64_smp_defconfig I occasionally see
>
>   fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>
> other than the message things seem to be working.
>
> With a custom defconfig I see
>
>   fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>   fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>   fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>   ...
>
> and access to the spi-nor does not work until the board is reset.
>
> I'll try and pick apart the differences between the two defconfigs.
>
Also relevant may be:
- Which dts are you using?
- What's the spi-nor type, and at which frequency are you operating it?
- Does the issue still happen if you lower the frequency?

2020-08-19 21:10:25

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15


On 19/08/20 6:15 pm, Heiner Kallweit wrote:
> On 19.08.2020 00:44, Chris Packham wrote:
>> Hi Again,
>>
>> On 17/08/20 9:09 am, Chris Packham wrote:
>>
>>> On 14/08/20 6:19 pm, Heiner Kallweit wrote:
>>>> On 14.08.2020 04:48, Chris Packham wrote:
>>>>> Hi,
>>>>>
>>>>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>>>>> based system to linux v5.7.15
>>>>>
>>>>> For this board u-boot and the u-boot environment live on spi-nor.
>>>>>
>>>>> When I use fw_setenv from userspace I get the following kernel logs
>>>>>
>>>>> # fw_setenv foo=1
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> ...
>>>>>
>>>> This error reporting doesn't exist yet in 4.4. So you may have an issue
>>>> under 4.4 too, it's just not reported.
>>>> Did you verify that under 4.4 fw_setenv actually has an effect?
>>> Just double checked and yes under 4.4 the setting does get saved.
>>>>> If I run fw_printenv (before getting it into a bad state) it is able to
>>>>> display the content of the boards u-boot environment.
>>>>>
>>>> This might indicate an issue with spi being locked. I've seen related
>>>> questions, just use the search engine of your choice and check for
>>>> fw_setenv and locked.
>>> I'm running a version of fw_setenv which includes
>>> https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it shouldn't
>>> be locking things unnecessarily.
>>>>> If been unsuccessful in producing a setup for bisecting the issue. I do
>>>>> know the issue doesn't occur on the old 4.4.x based kernel but that's
>>>>> probably not much help.
>>>>>
>>>>> Any pointers on what the issue (and/or solution) might be.
>> I finally managed to get our board running with a vanilla kernel. With
>> corenet64_smp_defconfig I occasionally see
>>
>>   fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>
>> other than the message things seem to be working.
>>
>> With a custom defconfig I see
>>
>>   fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>   fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>   fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>   ...
>>
>> and access to the spi-nor does not work until the board is reset.
>>
>> I'll try and pick apart the differences between the two defconfigs.

I now think my earlier testing is invalid. I have seen the problem with
either defconfig if I try hard enough. I had convinced myself that the
problem was CONFIG_PREEMPT but that was before I found boot-to-boot
differences with the same kernel.

It's possible that I'm chasing multiple issues with the same symptom.

The error I'm most concerned with is in the sequence
1. boot with old image
2. write environment
3. boot with new image
4. write environment
5. write fails and environment is corrupted

After I recover the system things sometimes seem fine. Until I repeat
the sequence above.

> Also relevant may be:
> - Which dts are you using?
Custom but based heavily on the t2080rdb.
> - What's the spi-nor type, and at which frequency are you operating it?
The board has several alternate parts for the spi-nor so the dts just
specifies compatible = "jedec,spi-nor" the actual chip detected on the
board I have is "n25q032a (4096 Kbytes)". The dts sets spi-max-frequency
= <10000000> I haven't measured the actual frequency on the bus.
> - Does the issue still happen if you lower the frequency?
I did play around with the frequency initially but I should probably
give that another go now that I have a better reproduction method.

2020-08-24 22:05:59

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15


On 20/08/20 9:08 am, Chris Packham wrote:
>
> On 19/08/20 6:15 pm, Heiner Kallweit wrote:
>> On 19.08.2020 00:44, Chris Packham wrote:
>>> Hi Again,
>>>
>>> On 17/08/20 9:09 am, Chris Packham wrote:
>>>
>>>> On 14/08/20 6:19 pm, Heiner Kallweit wrote:
>>>>> On 14.08.2020 04:48, Chris Packham wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>>>>>> based system to linux v5.7.15
>>>>>>
>>>>>> For this board u-boot and the u-boot environment live on spi-nor.
>>>>>>
>>>>>> When I use fw_setenv from userspace I get the following kernel logs
>>>>>>
>>>>>> # fw_setenv foo=1
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> ...
>>>>>>
>>>>> This error reporting doesn't exist yet in 4.4. So you may have an
>>>>> issue
>>>>> under 4.4 too, it's just not reported.
>>>>> Did you verify that under 4.4 fw_setenv actually has an effect?
>>>> Just double checked and yes under 4.4 the setting does get saved.
>>>>>> If I run fw_printenv (before getting it into a bad state) it is
>>>>>> able to
>>>>>> display the content of the boards u-boot environment.
>>>>>>
>>>>> This might indicate an issue with spi being locked. I've seen related
>>>>> questions, just use the search engine of your choice and check for
>>>>> fw_setenv and locked.
>>>> I'm running a version of fw_setenv which includes
>>>> https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it shouldn't
>>>> be locking things unnecessarily.
>>>>>> If been unsuccessful in producing a setup for bisecting the
>>>>>> issue. I do
>>>>>> know the issue doesn't occur on the old 4.4.x based kernel but
>>>>>> that's
>>>>>> probably not much help.
>>>>>>
>>>>>> Any pointers on what the issue (and/or solution) might be.
>>> I finally managed to get our board running with a vanilla kernel. With
>>> corenet64_smp_defconfig I occasionally see
>>>
>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>
>>> other than the message things seem to be working.
>>>
>>> With a custom defconfig I see
>>>
>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>     fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
>>> empty!
>>>     fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>     ...
>>>
>>> and access to the spi-nor does not work until the board is reset.
>>>
>>> I'll try and pick apart the differences between the two defconfigs.
>
> I now think my earlier testing is invalid. I have seen the problem
> with either defconfig if I try hard enough. I had convinced myself
> that the problem was CONFIG_PREEMPT but that was before I found
> boot-to-boot differences with the same kernel.
>
> It's possible that I'm chasing multiple issues with the same symptom.
>
> The error I'm most concerned with is in the sequence
> 1. boot with old image
> 2. write environment
> 3. boot with new image
> 4. write environment
> 5. write fails and environment is corrupted
>
> After I recover the system things sometimes seem fine. Until I repeat
> the sequence above.
>
>> Also relevant may be:
>> - Which dts are you using?
> Custom but based heavily on the t2080rdb.
>> - What's the spi-nor type, and at which frequency are you operating it?
> The board has several alternate parts for the spi-nor so the dts just
> specifies compatible = "jedec,spi-nor" the actual chip detected on the
> board I have is "n25q032a (4096 Kbytes)". The dts sets
> spi-max-frequency = <10000000> I haven't measured the actual frequency
> on the bus.
>> - Does the issue still happen if you lower the frequency?
> I did play around with the frequency initially but I should probably
> give that another go now that I have a better reproduction method.

Playing around with the frequency didn't help.

One thing that I've found is that the problem appears to be that I end
up with extra bytes in the RX FIFO. If I add code to drain the RX FIFO
then the system is able to keep accessing the spi-nor (albeit with some
noisy logs).

2020-08-25 03:57:00

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15


On 25/08/20 10:04 am, Chris Packham wrote:
>
> On 20/08/20 9:08 am, Chris Packham wrote:
>>
>> On 19/08/20 6:15 pm, Heiner Kallweit wrote:
>>> On 19.08.2020 00:44, Chris Packham wrote:
>>>> Hi Again,
>>>>
>>>> On 17/08/20 9:09 am, Chris Packham wrote:
>>>>
>>>>> On 14/08/20 6:19 pm, Heiner Kallweit wrote:
>>>>>> On 14.08.2020 04:48, Chris Packham wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>>>>>>> based system to linux v5.7.15
>>>>>>>
>>>>>>> For this board u-boot and the u-boot environment live on spi-nor.
>>>>>>>
>>>>>>> When I use fw_setenv from userspace I get the following kernel logs
>>>>>>>
>>>>>>> # fw_setenv foo=1
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
>>>>>>> empty!
>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
>>>>>>> empty!
>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
>>>>>>> empty!
>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>> ...
>>>>>>>
>>>>>> This error reporting doesn't exist yet in 4.4. So you may have an
>>>>>> issue
>>>>>> under 4.4 too, it's just not reported.
>>>>>> Did you verify that under 4.4 fw_setenv actually has an effect?
>>>>> Just double checked and yes under 4.4 the setting does get saved.
>>>>>>> If I run fw_printenv (before getting it into a bad state) it is
>>>>>>> able to
>>>>>>> display the content of the boards u-boot environment.
>>>>>>>
>>>>>> This might indicate an issue with spi being locked. I've seen
>>>>>> related
>>>>>> questions, just use the search engine of your choice and check for
>>>>>> fw_setenv and locked.
>>>>> I'm running a version of fw_setenv which includes
>>>>> https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it
>>>>> shouldn't
>>>>> be locking things unnecessarily.
>>>>>>> If been unsuccessful in producing a setup for bisecting the
>>>>>>> issue. I do
>>>>>>> know the issue doesn't occur on the old 4.4.x based kernel but
>>>>>>> that's
>>>>>>> probably not much help.
>>>>>>>
>>>>>>> Any pointers on what the issue (and/or solution) might be.
>>>> I finally managed to get our board running with a vanilla kernel. With
>>>> corenet64_smp_defconfig I occasionally see
>>>>
>>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>
>>>> other than the message things seem to be working.
>>>>
>>>> With a custom defconfig I see
>>>>
>>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>     fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
>>>> empty!
>>>>     fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>     ...
>>>>
>>>> and access to the spi-nor does not work until the board is reset.
>>>>
>>>> I'll try and pick apart the differences between the two defconfigs.
>>
>> I now think my earlier testing is invalid. I have seen the problem
>> with either defconfig if I try hard enough. I had convinced myself
>> that the problem was CONFIG_PREEMPT but that was before I found
>> boot-to-boot differences with the same kernel.
>>
>> It's possible that I'm chasing multiple issues with the same symptom.
>>
>> The error I'm most concerned with is in the sequence
>> 1. boot with old image
>> 2. write environment
>> 3. boot with new image
>> 4. write environment
>> 5. write fails and environment is corrupted
>>
>> After I recover the system things sometimes seem fine. Until I repeat
>> the sequence above.
>>
>>> Also relevant may be:
>>> - Which dts are you using?
>> Custom but based heavily on the t2080rdb.
>>> - What's the spi-nor type, and at which frequency are you operating it?
>> The board has several alternate parts for the spi-nor so the dts just
>> specifies compatible = "jedec,spi-nor" the actual chip detected on
>> the board I have is "n25q032a (4096 Kbytes)". The dts sets
>> spi-max-frequency = <10000000> I haven't measured the actual
>> frequency on the bus.
>>> - Does the issue still happen if you lower the frequency?
>> I did play around with the frequency initially but I should probably
>> give that another go now that I have a better reproduction method.
>
> Playing around with the frequency didn't help.
>
> One thing that I've found is that the problem appears to be that I end
> up with extra bytes in the RX FIFO. If I add code to drain the RX FIFO
> then the system is able to keep accessing the spi-nor (albeit with
> some noisy logs).

I've been staring at spi-fsl-espi.c for while now and I think I've
identified a couple of deficiencies that may or may not be related to my
issue.

First I think the 'Transfer done but SPIE_DON isn't set' message can be
generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE register.
We also write back to it to clear the current events. We re-read it in
fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
naturally end up in that situation if we're doing a large read. Consider
the messages for reading a block of data from a spi-nor chip

 tx = READ_OP + ADDR
 rx = data

We setup the transfer and pump out the tx_buf. The first interrupt goes
off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
clear ESPI_SPIE and wait for the next interrupt. The next interrupt
fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
continues until we've received all the data and we finish with ESPI_SPIE
having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
isn't set.

The other deficiency is that we only get an interrupt when the amount of
data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
FSL_ESPI_RXTHR left to be received we will never pull them out of the fifo.

I think the reason I'm seeing some variability is because of how fast
(or slow) the interrupts get processed and how fast the spi-nor chip can
fill the CPUs rx fifo.

2020-08-25 09:00:22

by Heiner Kallweit

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

On 25.08.2020 05:54, Chris Packham wrote:
>
> On 25/08/20 10:04 am, Chris Packham wrote:
>>
>> On 20/08/20 9:08 am, Chris Packham wrote:
>>>
>>> On 19/08/20 6:15 pm, Heiner Kallweit wrote:
>>>> On 19.08.2020 00:44, Chris Packham wrote:
>>>>> Hi Again,
>>>>>
>>>>> On 17/08/20 9:09 am, Chris Packham wrote:
>>>>>
>>>>>> On 14/08/20 6:19 pm, Heiner Kallweit wrote:
>>>>>>> On 14.08.2020 04:48, Chris Packham wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>>>>>>>> based system to linux v5.7.15
>>>>>>>>
>>>>>>>> For this board u-boot and the u-boot environment live on spi-nor.
>>>>>>>>
>>>>>>>> When I use fw_setenv from userspace I get the following kernel logs
>>>>>>>>
>>>>>>>> # fw_setenv foo=1
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
>>>>>>>> empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
>>>>>>>> empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
>>>>>>>> empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>> ...
>>>>>>>>
>>>>>>> This error reporting doesn't exist yet in 4.4. So you may have an
>>>>>>> issue
>>>>>>> under 4.4 too, it's just not reported.
>>>>>>> Did you verify that under 4.4 fw_setenv actually has an effect?
>>>>>> Just double checked and yes under 4.4 the setting does get saved.
>>>>>>>> If I run fw_printenv (before getting it into a bad state) it is
>>>>>>>> able to
>>>>>>>> display the content of the boards u-boot environment.
>>>>>>>>
>>>>>>> This might indicate an issue with spi being locked. I've seen
>>>>>>> related
>>>>>>> questions, just use the search engine of your choice and check for
>>>>>>> fw_setenv and locked.
>>>>>> I'm running a version of fw_setenv which includes
>>>>>> https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it
>>>>>> shouldn't
>>>>>> be locking things unnecessarily.
>>>>>>>> If been unsuccessful in producing a setup for bisecting the
>>>>>>>> issue. I do
>>>>>>>> know the issue doesn't occur on the old 4.4.x based kernel but
>>>>>>>> that's
>>>>>>>> probably not much help.
>>>>>>>>
>>>>>>>> Any pointers on what the issue (and/or solution) might be.
>>>>> I finally managed to get our board running with a vanilla kernel. With
>>>>> corenet64_smp_defconfig I occasionally see
>>>>>
>>>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>
>>>>> other than the message things seem to be working.
>>>>>
>>>>> With a custom defconfig I see
>>>>>
>>>>>     fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>     fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
>>>>> empty!
>>>>>     fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>     ...
>>>>>
>>>>> and access to the spi-nor does not work until the board is reset.
>>>>>
>>>>> I'll try and pick apart the differences between the two defconfigs.
>>>
>>> I now think my earlier testing is invalid. I have seen the problem
>>> with either defconfig if I try hard enough. I had convinced myself
>>> that the problem was CONFIG_PREEMPT but that was before I found
>>> boot-to-boot differences with the same kernel.
>>>
>>> It's possible that I'm chasing multiple issues with the same symptom.
>>>
>>> The error I'm most concerned with is in the sequence
>>> 1. boot with old image
>>> 2. write environment
>>> 3. boot with new image
>>> 4. write environment
>>> 5. write fails and environment is corrupted
>>>
>>> After I recover the system things sometimes seem fine. Until I repeat
>>> the sequence above.
>>>
>>>> Also relevant may be:
>>>> - Which dts are you using?
>>> Custom but based heavily on the t2080rdb.
>>>> - What's the spi-nor type, and at which frequency are you operating it?
>>> The board has several alternate parts for the spi-nor so the dts just
>>> specifies compatible = "jedec,spi-nor" the actual chip detected on
>>> the board I have is "n25q032a (4096 Kbytes)". The dts sets
>>> spi-max-frequency = <10000000> I haven't measured the actual
>>> frequency on the bus.
>>>> - Does the issue still happen if you lower the frequency?
>>> I did play around with the frequency initially but I should probably
>>> give that another go now that I have a better reproduction method.
>>
>> Playing around with the frequency didn't help.
>>
>> One thing that I've found is that the problem appears to be that I end
>> up with extra bytes in the RX FIFO. If I add code to drain the RX FIFO
>> then the system is able to keep accessing the spi-nor (albeit with
>> some noisy logs).
>
> I've been staring at spi-fsl-espi.c for while now and I think I've
> identified a couple of deficiencies that may or may not be related to my
> issue.
>
> First I think the 'Transfer done but SPIE_DON isn't set' message can be
> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE register.
> We also write back to it to clear the current events. We re-read it in
> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
> naturally end up in that situation if we're doing a large read. Consider
> the messages for reading a block of data from a spi-nor chip
>
>  tx = READ_OP + ADDR
>  rx = data
>
> We setup the transfer and pump out the tx_buf. The first interrupt goes
> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
> continues until we've received all the data and we finish with ESPI_SPIE
> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
> isn't set.
>
> The other deficiency is that we only get an interrupt when the amount of
> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
> FSL_ESPI_RXTHR left to be received we will never pull them out of the fifo.
>
SPIM_DON will trigger an interrupt once the last characters have been
transferred, and read the remaining characters from the FIFO.

> I think the reason I'm seeing some variability is because of how fast
> (or slow) the interrupts get processed and how fast the spi-nor chip can
> fill the CPUs rx fifo.
>
To rule out timing issues at high bus frequencies I initially asked
for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
or even less, then timing shouldn't be an issue.

Last relevant functional changes have been done almost 4 years ago.
And yours is the first such report I see. So question is what could be so
special with your setup that it seems you're the only one being affected.
The scenarios you describe are standard, therefore much more people
should be affected in case of a driver bug.

You said that kernel config impacts how frequently the issue happens.
Therefore question is what's the diff in kernel config, and how could
the differences be related to SPI.

2020-08-25 22:25:44

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

On 25/08/20 7:22 pm, Heiner Kallweit wrote:

<snip>
> I've been staring at spi-fsl-espi.c for while now and I think I've
>> identified a couple of deficiencies that may or may not be related to my
>> issue.
>>
>> First I think the 'Transfer done but SPIE_DON isn't set' message can be
>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE register.
>> We also write back to it to clear the current events. We re-read it in
>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>> naturally end up in that situation if we're doing a large read. Consider
>> the messages for reading a block of data from a spi-nor chip
>>
>>  tx = READ_OP + ADDR
>>  rx = data
>>
>> We setup the transfer and pump out the tx_buf. The first interrupt goes
>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>> continues until we've received all the data and we finish with ESPI_SPIE
>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>> isn't set.
>>
>> The other deficiency is that we only get an interrupt when the amount of
>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>> FSL_ESPI_RXTHR left to be received we will never pull them out of the fifo.
>>
> SPIM_DON will trigger an interrupt once the last characters have been
> transferred, and read the remaining characters from the FIFO.

The T2080RM that I have says the following about the DON bit

"Last character was transmitted. The last character was transmitted and
a new command can be written for the next frame."

That does at least seem to fit with my assertion that it's all about the
TX direction. But the fact that it doesn't happen all the time throws
some doubt on it.

> I think the reason I'm seeing some variability is because of how fast
>> (or slow) the interrupts get processed and how fast the spi-nor chip can
>> fill the CPUs rx fifo.
>>
> To rule out timing issues at high bus frequencies I initially asked
> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
> or even less, then timing shouldn't be an issue.
Yes I've currently got spi-max-frequency = <1000000>; in my dts. I would
also expect a slower frequency would fit my "DON is for TX" narrative.
> Last relevant functional changes have been done almost 4 years ago.
> And yours is the first such report I see. So question is what could be so
> special with your setup that it seems you're the only one being affected.
> The scenarios you describe are standard, therefore much more people
> should be affected in case of a driver bug.
Agreed. But even on my hardware (which may have a latent issue despite
being in the field for going on 5 years) the issue only triggers under
some fairly specific circumstances.
> You said that kernel config impacts how frequently the issue happens.
> Therefore question is what's the diff in kernel config, and how could
> the differences be related to SPI.

It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
impact but every time I found something that seemed to be having an
impact I've been able to disprove it. I actually think its about how
busy the system is which may or may not affect when we get round to
processing the interrupts.

I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
occur on the T2080RDB.

I've had to add the following to expose the environment as a mtd partition

diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
index ff87e67c70da..fbf95fc1fd68 100644
--- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
@@ -116,6 +116,15 @@ flash@0 {
                                compatible = "micron,n25q512ax3",
"jedec,spi-nor";
                                reg = <0>;
                                spi-max-frequency = <10000000>; /*
input clock */
+
+                               partition@u-boot {
+                                        reg = <0x00000000 0x00100000>;
+                                        label = "u-boot";
+                                };
+                                partition@u-boot-env {
+                                        reg = <0x00100000 0x00010000>;
+                                        label = "u-boot-env";
+                                };
                        };
                };

And I'm using the following script to poke at the environment (warning
if anyone does try this and the bug hits it can render your u-boot
environment invalid).

cat flash/fw_env_test.sh
#!/bin/sh

generate_fw_env_config()
{
  cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize name ; do
     echo "$dev $size $erasesize $name"
     [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
$erasesize" >/flash/fw_env.config
  done
}

cycles=10
[ $# -ge 1 ] && cycles=$1

generate_fw_env_config

fw_printenv -c /flash/fw_env.config

dmesg -c >/dev/null
x=0
while [ $x -lt $cycles ]; do
    fw_printenv -c /flash/fw_env.config >/dev/null || break
    fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
    dmesg -c | grep -q fsl_espi && break;
    let x=x+1
done

echo "Ran $x cycles"

2020-08-26 01:50:15

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15


On 26/08/20 10:22 am, Chris Packham wrote:
> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>
> <snip>
>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>> identified a couple of deficiencies that may or may not be related
>>> to my
>>> issue.
>>>
>>> First I think the 'Transfer done but SPIE_DON isn't set' message can be
>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE register.
>>> We also write back to it to clear the current events. We re-read it in
>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>> naturally end up in that situation if we're doing a large read.
>>> Consider
>>> the messages for reading a block of data from a spi-nor chip
>>>
>>>    tx = READ_OP + ADDR
>>>    rx = data
>>>
>>> We setup the transfer and pump out the tx_buf. The first interrupt goes
>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>> continues until we've received all the data and we finish with
>>> ESPI_SPIE
>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>> isn't set.
>>>
>>> The other deficiency is that we only get an interrupt when the
>>> amount of
>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>> FSL_ESPI_RXTHR left to be received we will never pull them out of
>>> the fifo.
>>>
>> SPIM_DON will trigger an interrupt once the last characters have been
>> transferred, and read the remaining characters from the FIFO.
>
> The T2080RM that I have says the following about the DON bit
>
> "Last character was transmitted. The last character was transmitted
> and a new command can be written for the next frame."
>
> That does at least seem to fit with my assertion that it's all about
> the TX direction. But the fact that it doesn't happen all the time
> throws some doubt on it.
>
>> I think the reason I'm seeing some variability is because of how fast
>>> (or slow) the interrupts get processed and how fast the spi-nor chip
>>> can
>>> fill the CPUs rx fifo.
>>>
>> To rule out timing issues at high bus frequencies I initially asked
>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>> or even less, then timing shouldn't be an issue.
> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I
> would also expect a slower frequency would fit my "DON is for TX"
> narrative.
>> Last relevant functional changes have been done almost 4 years ago.
>> And yours is the first such report I see. So question is what could
>> be so
>> special with your setup that it seems you're the only one being
>> affected.
>> The scenarios you describe are standard, therefore much more people
>> should be affected in case of a driver bug.
> Agreed. But even on my hardware (which may have a latent issue despite
> being in the field for going on 5 years) the issue only triggers under
> some fairly specific circumstances.
>> You said that kernel config impacts how frequently the issue happens.
>> Therefore question is what's the diff in kernel config, and how could
>> the differences be related to SPI.
>
> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
> impact but every time I found something that seemed to be having an
> impact I've been able to disprove it. I actually think its about how
> busy the system is which may or may not affect when we get round to
> processing the interrupts.
>
> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
> occur on the T2080RDB.
>
> I've had to add the following to expose the environment as a mtd
> partition
>
> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
> index ff87e67c70da..fbf95fc1fd68 100644
> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
> @@ -116,6 +116,15 @@ flash@0 {
>                                 compatible = "micron,n25q512ax3",
> "jedec,spi-nor";
>                                 reg = <0>;
>                                 spi-max-frequency = <10000000>; /*
> input clock */
> +
> +                               partition@u-boot {
> +                                        reg = <0x00000000 0x00100000>;
> +                                        label = "u-boot";
> +                                };
> +                                partition@u-boot-env {
> +                                        reg = <0x00100000 0x00010000>;
> +                                        label = "u-boot-env";
> +                                };
>                         };
>                 };
>
> And I'm using the following script to poke at the environment (warning
> if anyone does try this and the bug hits it can render your u-boot
> environment invalid).
>
> cat flash/fw_env_test.sh
> #!/bin/sh
>
> generate_fw_env_config()
> {
>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize name
> ; do
>      echo "$dev $size $erasesize $name"
>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
> $erasesize" >/flash/fw_env.config
>   done
> }
>
> cycles=10
> [ $# -ge 1 ] && cycles=$1
>
> generate_fw_env_config
>
> fw_printenv -c /flash/fw_env.config
>
> dmesg -c >/dev/null
> x=0
> while [ $x -lt $cycles ]; do
>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>     dmesg -c | grep -q fsl_espi && break;
>     let x=x+1
> done
>
> echo "Ran $x cycles"

I've also now seen the RX FIFO not empty error on the T2080RDB

fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32

With my current workaround of emptying the RX FIFO. It seems survivable.
Interestingly it only ever seems to be 1 extra byte in the RX FIFO and
it seems to be after either a READ_SR or a READ_FSR.

fsl_espi ffe110000.spi: tx 70
fsl_espi ffe110000.spi: rx 03
fsl_espi ffe110000.spi: Extra RX 00
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03

From all the Micron SPI-NOR datasheets I've got access to it is
possible to continually read the SR/FSR. But I've no idea why it happens
some times and not others.

2020-08-26 06:08:25

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15


On 26/08/20 1:48 pm, Chris Packham wrote:
>
> On 26/08/20 10:22 am, Chris Packham wrote:
>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>
>> <snip>
>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>> identified a couple of deficiencies that may or may not be related
>>>> to my
>>>> issue.
>>>>
>>>> First I think the 'Transfer done but SPIE_DON isn't set' message
>>>> can be
>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
>>>> register.
>>>> We also write back to it to clear the current events. We re-read it in
>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>> naturally end up in that situation if we're doing a large read.
>>>> Consider
>>>> the messages for reading a block of data from a spi-nor chip
>>>>
>>>>    tx = READ_OP + ADDR
>>>>    rx = data
>>>>
>>>> We setup the transfer and pump out the tx_buf. The first interrupt
>>>> goes
>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>> continues until we've received all the data and we finish with
>>>> ESPI_SPIE
>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>> isn't set.
>>>>
>>>> The other deficiency is that we only get an interrupt when the
>>>> amount of
>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of
>>>> the fifo.
>>>>
>>> SPIM_DON will trigger an interrupt once the last characters have been
>>> transferred, and read the remaining characters from the FIFO.
>>
>> The T2080RM that I have says the following about the DON bit
>>
>> "Last character was transmitted. The last character was transmitted
>> and a new command can be written for the next frame."
>>
>> That does at least seem to fit with my assertion that it's all about
>> the TX direction. But the fact that it doesn't happen all the time
>> throws some doubt on it.
>>
>>> I think the reason I'm seeing some variability is because of how fast
>>>> (or slow) the interrupts get processed and how fast the spi-nor
>>>> chip can
>>>> fill the CPUs rx fifo.
>>>>
>>> To rule out timing issues at high bus frequencies I initially asked
>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>> or even less, then timing shouldn't be an issue.
>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I
>> would also expect a slower frequency would fit my "DON is for TX"
>> narrative.
>>> Last relevant functional changes have been done almost 4 years ago.
>>> And yours is the first such report I see. So question is what could
>>> be so
>>> special with your setup that it seems you're the only one being
>>> affected.
>>> The scenarios you describe are standard, therefore much more people
>>> should be affected in case of a driver bug.
>> Agreed. But even on my hardware (which may have a latent issue
>> despite being in the field for going on 5 years) the issue only
>> triggers under some fairly specific circumstances.
>>> You said that kernel config impacts how frequently the issue happens.
>>> Therefore question is what's the diff in kernel config, and how could
>>> the differences be related to SPI.
>>
>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
>> impact but every time I found something that seemed to be having an
>> impact I've been able to disprove it. I actually think its about how
>> busy the system is which may or may not affect when we get round to
>> processing the interrupts.
>>
>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
>> occur on the T2080RDB.
>>
>> I've had to add the following to expose the environment as a mtd
>> partition
>>
>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> index ff87e67c70da..fbf95fc1fd68 100644
>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> @@ -116,6 +116,15 @@ flash@0 {
>>                                 compatible = "micron,n25q512ax3",
>> "jedec,spi-nor";
>>                                 reg = <0>;
>>                                 spi-max-frequency = <10000000>; /*
>> input clock */
>> +
>> +                               partition@u-boot {
>> +                                        reg = <0x00000000 0x00100000>;
>> +                                        label = "u-boot";
>> +                                };
>> +                                partition@u-boot-env {
>> +                                        reg = <0x00100000 0x00010000>;
>> +                                        label = "u-boot-env";
>> +                                };
>>                         };
>>                 };
>>
>> And I'm using the following script to poke at the environment
>> (warning if anyone does try this and the bug hits it can render your
>> u-boot environment invalid).
>>
>> cat flash/fw_env_test.sh
>> #!/bin/sh
>>
>> generate_fw_env_config()
>> {
>>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize
>> name ; do
>>      echo "$dev $size $erasesize $name"
>>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
>> $erasesize" >/flash/fw_env.config
>>   done
>> }
>>
>> cycles=10
>> [ $# -ge 1 ] && cycles=$1
>>
>> generate_fw_env_config
>>
>> fw_printenv -c /flash/fw_env.config
>>
>> dmesg -c >/dev/null
>> x=0
>> while [ $x -lt $cycles ]; do
>>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>     dmesg -c | grep -q fsl_espi && break;
>>     let x=x+1
>> done
>>
>> echo "Ran $x cycles"
>
> I've also now seen the RX FIFO not empty error on the T2080RDB
>
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>
> With my current workaround of emptying the RX FIFO. It seems
> survivable. Interestingly it only ever seems to be 1 extra byte in the
> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>
> fsl_espi ffe110000.spi: tx 70
> fsl_espi ffe110000.spi: rx 03
> fsl_espi ffe110000.spi: Extra RX 00
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
>
> From all the Micron SPI-NOR datasheets I've got access to it is
> possible to continually read the SR/FSR. But I've no idea why it
> happens some times and not others.

So I think I've got a reproduction and I think I've bisected the problem
to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
C"). My day is just finishing now so I haven't applied too much scrutiny
to this result. Given the various rabbit holes I've been down on this
issue already I'd take this information with a good degree of skepticism.

Thanks,
Chris

2020-08-26 06:40:44

by Heiner Kallweit

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

On 26.08.2020 08:07, Chris Packham wrote:
>
> On 26/08/20 1:48 pm, Chris Packham wrote:
>>
>> On 26/08/20 10:22 am, Chris Packham wrote:
>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>
>>> <snip>
>>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>>> identified a couple of deficiencies that may or may not be related
>>>>> to my
>>>>> issue.
>>>>>
>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message
>>>>> can be
>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
>>>>> register.
>>>>> We also write back to it to clear the current events. We re-read it in
>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>>> naturally end up in that situation if we're doing a large read.
>>>>> Consider
>>>>> the messages for reading a block of data from a spi-nor chip
>>>>>
>>>>>    tx = READ_OP + ADDR
>>>>>    rx = data
>>>>>
>>>>> We setup the transfer and pump out the tx_buf. The first interrupt
>>>>> goes
>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>>> continues until we've received all the data and we finish with
>>>>> ESPI_SPIE
>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>>> isn't set.
>>>>>
>>>>> The other deficiency is that we only get an interrupt when the
>>>>> amount of
>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of
>>>>> the fifo.
>>>>>
>>>> SPIM_DON will trigger an interrupt once the last characters have been
>>>> transferred, and read the remaining characters from the FIFO.
>>>
>>> The T2080RM that I have says the following about the DON bit
>>>
>>> "Last character was transmitted. The last character was transmitted
>>> and a new command can be written for the next frame."
>>>
>>> That does at least seem to fit with my assertion that it's all about
>>> the TX direction. But the fact that it doesn't happen all the time
>>> throws some doubt on it.
>>>
>>>> I think the reason I'm seeing some variability is because of how fast
>>>>> (or slow) the interrupts get processed and how fast the spi-nor
>>>>> chip can
>>>>> fill the CPUs rx fifo.
>>>>>
>>>> To rule out timing issues at high bus frequencies I initially asked
>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>>> or even less, then timing shouldn't be an issue.
>>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I
>>> would also expect a slower frequency would fit my "DON is for TX"
>>> narrative.
>>>> Last relevant functional changes have been done almost 4 years ago.
>>>> And yours is the first such report I see. So question is what could
>>>> be so
>>>> special with your setup that it seems you're the only one being
>>>> affected.
>>>> The scenarios you describe are standard, therefore much more people
>>>> should be affected in case of a driver bug.
>>> Agreed. But even on my hardware (which may have a latent issue
>>> despite being in the field for going on 5 years) the issue only
>>> triggers under some fairly specific circumstances.
>>>> You said that kernel config impacts how frequently the issue happens.
>>>> Therefore question is what's the diff in kernel config, and how could
>>>> the differences be related to SPI.
>>>
>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
>>> impact but every time I found something that seemed to be having an
>>> impact I've been able to disprove it. I actually think its about how
>>> busy the system is which may or may not affect when we get round to
>>> processing the interrupts.
>>>
>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
>>> occur on the T2080RDB.
>>>
>>> I've had to add the following to expose the environment as a mtd
>>> partition
>>>
>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>> index ff87e67c70da..fbf95fc1fd68 100644
>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>> @@ -116,6 +116,15 @@ flash@0 {
>>>                                 compatible = "micron,n25q512ax3",
>>> "jedec,spi-nor";
>>>                                 reg = <0>;
>>>                                 spi-max-frequency = <10000000>; /*
>>> input clock */
>>> +
>>> +                               partition@u-boot {
>>> +                                        reg = <0x00000000 0x00100000>;
>>> +                                        label = "u-boot";
>>> +                                };
>>> +                                partition@u-boot-env {
>>> +                                        reg = <0x00100000 0x00010000>;
>>> +                                        label = "u-boot-env";
>>> +                                };
>>>                         };
>>>                 };
>>>
>>> And I'm using the following script to poke at the environment
>>> (warning if anyone does try this and the bug hits it can render your
>>> u-boot environment invalid).
>>>
>>> cat flash/fw_env_test.sh
>>> #!/bin/sh
>>>
>>> generate_fw_env_config()
>>> {
>>>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize
>>> name ; do
>>>      echo "$dev $size $erasesize $name"
>>>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
>>> $erasesize" >/flash/fw_env.config
>>>   done
>>> }
>>>
>>> cycles=10
>>> [ $# -ge 1 ] && cycles=$1
>>>
>>> generate_fw_env_config
>>>
>>> fw_printenv -c /flash/fw_env.config
>>>
>>> dmesg -c >/dev/null
>>> x=0
>>> while [ $x -lt $cycles ]; do
>>>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>>>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>>     dmesg -c | grep -q fsl_espi && break;
>>>     let x=x+1
>>> done
>>>
>>> echo "Ran $x cycles"
>>
>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>
>> With my current workaround of emptying the RX FIFO. It seems
>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>
>> fsl_espi ffe110000.spi: tx 70
>> fsl_espi ffe110000.spi: rx 03
>> fsl_espi ffe110000.spi: Extra RX 00
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>>
>> From all the Micron SPI-NOR datasheets I've got access to it is
>> possible to continually read the SR/FSR. But I've no idea why it
>> happens some times and not others.
>
> So I think I've got a reproduction and I think I've bisected the problem
> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
> C"). My day is just finishing now so I haven't applied too much scrutiny
> to this result. Given the various rabbit holes I've been down on this
> issue already I'd take this information with a good degree of skepticism.
>
OK, so an easy test should be to re-test with a 5.4 kernel.
It doesn't have yet the change you're referring to, and the fsl-espi driver
is basically the same as in 5.7 (just two small changes in 5.7).

> Thanks,
> Chris
>

2020-08-26 22:29:04

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

(adding Nicholas)

On 26/08/20 6:38 pm, Heiner Kallweit wrote:
> On 26.08.2020 08:07, Chris Packham wrote:
>> On 26/08/20 1:48 pm, Chris Packham wrote:
>>> On 26/08/20 10:22 am, Chris Packham wrote:
>>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>>
>>>> <snip>
>>>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>>>> identified a couple of deficiencies that may or may not be related
>>>>>> to my
>>>>>> issue.
>>>>>>
>>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message
>>>>>> can be
>>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
>>>>>> register.
>>>>>> We also write back to it to clear the current events. We re-read it in
>>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>>>> naturally end up in that situation if we're doing a large read.
>>>>>> Consider
>>>>>> the messages for reading a block of data from a spi-nor chip
>>>>>>
>>>>>>    tx = READ_OP + ADDR
>>>>>>    rx = data
>>>>>>
>>>>>> We setup the transfer and pump out the tx_buf. The first interrupt
>>>>>> goes
>>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>>>> continues until we've received all the data and we finish with
>>>>>> ESPI_SPIE
>>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>>>> isn't set.
>>>>>>
>>>>>> The other deficiency is that we only get an interrupt when the
>>>>>> amount of
>>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of
>>>>>> the fifo.
>>>>>>
>>>>> SPIM_DON will trigger an interrupt once the last characters have been
>>>>> transferred, and read the remaining characters from the FIFO.
>>>> The T2080RM that I have says the following about the DON bit
>>>>
>>>> "Last character was transmitted. The last character was transmitted
>>>> and a new command can be written for the next frame."
>>>>
>>>> That does at least seem to fit with my assertion that it's all about
>>>> the TX direction. But the fact that it doesn't happen all the time
>>>> throws some doubt on it.
>>>>
>>>>> I think the reason I'm seeing some variability is because of how fast
>>>>>> (or slow) the interrupts get processed and how fast the spi-nor
>>>>>> chip can
>>>>>> fill the CPUs rx fifo.
>>>>>>
>>>>> To rule out timing issues at high bus frequencies I initially asked
>>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>>>> or even less, then timing shouldn't be an issue.
>>>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I
>>>> would also expect a slower frequency would fit my "DON is for TX"
>>>> narrative.
>>>>> Last relevant functional changes have been done almost 4 years ago.
>>>>> And yours is the first such report I see. So question is what could
>>>>> be so
>>>>> special with your setup that it seems you're the only one being
>>>>> affected.
>>>>> The scenarios you describe are standard, therefore much more people
>>>>> should be affected in case of a driver bug.
>>>> Agreed. But even on my hardware (which may have a latent issue
>>>> despite being in the field for going on 5 years) the issue only
>>>> triggers under some fairly specific circumstances.
>>>>> You said that kernel config impacts how frequently the issue happens.
>>>>> Therefore question is what's the diff in kernel config, and how could
>>>>> the differences be related to SPI.
>>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
>>>> impact but every time I found something that seemed to be having an
>>>> impact I've been able to disprove it. I actually think its about how
>>>> busy the system is which may or may not affect when we get round to
>>>> processing the interrupts.
>>>>
>>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
>>>> occur on the T2080RDB.
>>>>
>>>> I've had to add the following to expose the environment as a mtd
>>>> partition
>>>>
>>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> index ff87e67c70da..fbf95fc1fd68 100644
>>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> @@ -116,6 +116,15 @@ flash@0 {
>>>>                                 compatible = "micron,n25q512ax3",
>>>> "jedec,spi-nor";
>>>>                                 reg = <0>;
>>>>                                 spi-max-frequency = <10000000>; /*
>>>> input clock */
>>>> +
>>>> +                               partition@u-boot {
>>>> +                                        reg = <0x00000000 0x00100000>;
>>>> +                                        label = "u-boot";
>>>> +                                };
>>>> +                                partition@u-boot-env {
>>>> +                                        reg = <0x00100000 0x00010000>;
>>>> +                                        label = "u-boot-env";
>>>> +                                };
>>>>                         };
>>>>                 };
>>>>
>>>> And I'm using the following script to poke at the environment
>>>> (warning if anyone does try this and the bug hits it can render your
>>>> u-boot environment invalid).
>>>>
>>>> cat flash/fw_env_test.sh
>>>> #!/bin/sh
>>>>
>>>> generate_fw_env_config()
>>>> {
>>>>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize
>>>> name ; do
>>>>      echo "$dev $size $erasesize $name"
>>>>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
>>>> $erasesize" >/flash/fw_env.config
>>>>   done
>>>> }
>>>>
>>>> cycles=10
>>>> [ $# -ge 1 ] && cycles=$1
>>>>
>>>> generate_fw_env_config
>>>>
>>>> fw_printenv -c /flash/fw_env.config
>>>>
>>>> dmesg -c >/dev/null
>>>> x=0
>>>> while [ $x -lt $cycles ]; do
>>>>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>>>>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>>>     dmesg -c | grep -q fsl_espi && break;
>>>>     let x=x+1
>>>> done
>>>>
>>>> echo "Ran $x cycles"
>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>
>>> With my current workaround of emptying the RX FIFO. It seems
>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>
>>> fsl_espi ffe110000.spi: tx 70
>>> fsl_espi ffe110000.spi: rx 03
>>> fsl_espi ffe110000.spi: Extra RX 00
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>>
>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>> possible to continually read the SR/FSR. But I've no idea why it
>>> happens some times and not others.
>> So I think I've got a reproduction and I think I've bisected the problem
>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>> C"). My day is just finishing now so I haven't applied too much scrutiny
>> to this result. Given the various rabbit holes I've been down on this
>> issue already I'd take this information with a good degree of skepticism.
>>
> OK, so an easy test should be to re-test with a 5.4 kernel.
> It doesn't have yet the change you're referring to, and the fsl-espi driver
> is basically the same as in 5.7 (just two small changes in 5.7).

Cherry-picking 3282a3da25bd onto 5.4.x I'm able to reproduce the problem
on the T2080RDB (if anyone want's it I've got a slightly different
version of my test above that doesn't require touching the dts).

I'm not really sure how 3282a3da25bd affects this. It could be exposing
a latent issue, the fsl-espi interface does seem a bit sensitive to the
timing of events.

2020-08-27 07:14:17

by Nicholas Piggin

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

Excerpts from Heiner Kallweit's message of August 26, 2020 4:38 pm:
> On 26.08.2020 08:07, Chris Packham wrote:
>>
>> On 26/08/20 1:48 pm, Chris Packham wrote:
>>>
>>> On 26/08/20 10:22 am, Chris Packham wrote:
>>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>>
>>>> <snip>
>>>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>>>> identified a couple of deficiencies that may or may not be related
>>>>>> to my
>>>>>> issue.
>>>>>>
>>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message
>>>>>> can be
>>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
>>>>>> register.
>>>>>> We also write back to it to clear the current events. We re-read it in
>>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>>>> naturally end up in that situation if we're doing a large read.
>>>>>> Consider
>>>>>> the messages for reading a block of data from a spi-nor chip
>>>>>>
>>>>>>    tx = READ_OP + ADDR
>>>>>>    rx = data
>>>>>>
>>>>>> We setup the transfer and pump out the tx_buf. The first interrupt
>>>>>> goes
>>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>>>> continues until we've received all the data and we finish with
>>>>>> ESPI_SPIE
>>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>>>> isn't set.
>>>>>>
>>>>>> The other deficiency is that we only get an interrupt when the
>>>>>> amount of
>>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of
>>>>>> the fifo.
>>>>>>
>>>>> SPIM_DON will trigger an interrupt once the last characters have been
>>>>> transferred, and read the remaining characters from the FIFO.
>>>>
>>>> The T2080RM that I have says the following about the DON bit
>>>>
>>>> "Last character was transmitted. The last character was transmitted
>>>> and a new command can be written for the next frame."
>>>>
>>>> That does at least seem to fit with my assertion that it's all about
>>>> the TX direction. But the fact that it doesn't happen all the time
>>>> throws some doubt on it.
>>>>
>>>>> I think the reason I'm seeing some variability is because of how fast
>>>>>> (or slow) the interrupts get processed and how fast the spi-nor
>>>>>> chip can
>>>>>> fill the CPUs rx fifo.
>>>>>>
>>>>> To rule out timing issues at high bus frequencies I initially asked
>>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>>>> or even less, then timing shouldn't be an issue.
>>>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I
>>>> would also expect a slower frequency would fit my "DON is for TX"
>>>> narrative.
>>>>> Last relevant functional changes have been done almost 4 years ago.
>>>>> And yours is the first such report I see. So question is what could
>>>>> be so
>>>>> special with your setup that it seems you're the only one being
>>>>> affected.
>>>>> The scenarios you describe are standard, therefore much more people
>>>>> should be affected in case of a driver bug.
>>>> Agreed. But even on my hardware (which may have a latent issue
>>>> despite being in the field for going on 5 years) the issue only
>>>> triggers under some fairly specific circumstances.
>>>>> You said that kernel config impacts how frequently the issue happens.
>>>>> Therefore question is what's the diff in kernel config, and how could
>>>>> the differences be related to SPI.
>>>>
>>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
>>>> impact but every time I found something that seemed to be having an
>>>> impact I've been able to disprove it. I actually think its about how
>>>> busy the system is which may or may not affect when we get round to
>>>> processing the interrupts.
>>>>
>>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
>>>> occur on the T2080RDB.
>>>>
>>>> I've had to add the following to expose the environment as a mtd
>>>> partition
>>>>
>>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> index ff87e67c70da..fbf95fc1fd68 100644
>>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>> @@ -116,6 +116,15 @@ flash@0 {
>>>>                                 compatible = "micron,n25q512ax3",
>>>> "jedec,spi-nor";
>>>>                                 reg = <0>;
>>>>                                 spi-max-frequency = <10000000>; /*
>>>> input clock */
>>>> +
>>>> +                               partition@u-boot {
>>>> +                                        reg = <0x00000000 0x00100000>;
>>>> +                                        label = "u-boot";
>>>> +                                };
>>>> +                                partition@u-boot-env {
>>>> +                                        reg = <0x00100000 0x00010000>;
>>>> +                                        label = "u-boot-env";
>>>> +                                };
>>>>                         };
>>>>                 };
>>>>
>>>> And I'm using the following script to poke at the environment
>>>> (warning if anyone does try this and the bug hits it can render your
>>>> u-boot environment invalid).
>>>>
>>>> cat flash/fw_env_test.sh
>>>> #!/bin/sh
>>>>
>>>> generate_fw_env_config()
>>>> {
>>>>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize
>>>> name ; do
>>>>      echo "$dev $size $erasesize $name"
>>>>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
>>>> $erasesize" >/flash/fw_env.config
>>>>   done
>>>> }
>>>>
>>>> cycles=10
>>>> [ $# -ge 1 ] && cycles=$1
>>>>
>>>> generate_fw_env_config
>>>>
>>>> fw_printenv -c /flash/fw_env.config
>>>>
>>>> dmesg -c >/dev/null
>>>> x=0
>>>> while [ $x -lt $cycles ]; do
>>>>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>>>>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>>>     dmesg -c | grep -q fsl_espi && break;
>>>>     let x=x+1
>>>> done
>>>>
>>>> echo "Ran $x cycles"
>>>
>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>
>>> With my current workaround of emptying the RX FIFO. It seems
>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>
>>> fsl_espi ffe110000.spi: tx 70
>>> fsl_espi ffe110000.spi: rx 03
>>> fsl_espi ffe110000.spi: Extra RX 00
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>>
>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>> possible to continually read the SR/FSR. But I've no idea why it
>>> happens some times and not others.
>>
>> So I think I've got a reproduction and I think I've bisected the problem
>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>> C"). My day is just finishing now so I haven't applied too much scrutiny
>> to this result. Given the various rabbit holes I've been down on this
>> issue already I'd take this information with a good degree of skepticism.
>>
> OK, so an easy test should be to re-test with a 5.4 kernel.
> It doesn't have yet the change you're referring to, and the fsl-espi driver
> is basically the same as in 5.7 (just two small changes in 5.7).

There's 6cc0c16d82f88 and maybe also other interrupt related patches
around this time that could affect book E, so it's good if that exact
patch is confirmed.

I've been staring at 3282a3da25bd for a while and nothing immediately
stands out. It doesn't look like the low level handlers do anything
special (well 0x900 does ack the decrementer, but so does the masked
handler).

Can you try this patch and also enable CONFIG_PPC_IRQ_SOFT_MASK_DEBUG?

Thanks,
Nick

---
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index bf21ebd36190..10d339042330 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -214,7 +214,7 @@ void replay_soft_interrupts(void)
struct pt_regs regs;

ppc_save_regs(&regs);
- regs.softe = IRQS_ALL_DISABLED;
+ regs.softe = IRQS_ENABLED;

again:
if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
@@ -349,6 +349,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
WARN_ON_ONCE(!(mfmsr() & MSR_EE));
__hard_irq_disable();
+ local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
} else {
/*
* We should already be hard disabled here. We had bugs
@@ -368,6 +369,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
}
}

+ preempt_disable();
irq_soft_mask_set(IRQS_ALL_DISABLED);
trace_hardirqs_off();

@@ -377,6 +379,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
trace_hardirqs_on();
irq_soft_mask_set(IRQS_ENABLED);
__hard_irq_enable();
+ preempt_enable();
}
EXPORT_SYMBOL(arch_local_irq_restore);

2020-08-27 22:08:29

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

On 27/08/20 7:12 pm, Nicholas Piggin wrote:
> Excerpts from Heiner Kallweit's message of August 26, 2020 4:38 pm:
>> On 26.08.2020 08:07, Chris Packham wrote:
>>> On 26/08/20 1:48 pm, Chris Packham wrote:
>>>> On 26/08/20 10:22 am, Chris Packham wrote:
>>>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>>>
>>>>> <snip>
>>>>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>>>>> identified a couple of deficiencies that may or may not be related
>>>>>>> to my
>>>>>>> issue.
>>>>>>>
>>>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message
>>>>>>> can be
>>>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
>>>>>>> register.
>>>>>>> We also write back to it to clear the current events. We re-read it in
>>>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>>>>> naturally end up in that situation if we're doing a large read.
>>>>>>> Consider
>>>>>>> the messages for reading a block of data from a spi-nor chip
>>>>>>>
>>>>>>>    tx = READ_OP + ADDR
>>>>>>>    rx = data
>>>>>>>
>>>>>>> We setup the transfer and pump out the tx_buf. The first interrupt
>>>>>>> goes
>>>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>>>>> continues until we've received all the data and we finish with
>>>>>>> ESPI_SPIE
>>>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>>>>> isn't set.
>>>>>>>
>>>>>>> The other deficiency is that we only get an interrupt when the
>>>>>>> amount of
>>>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of
>>>>>>> the fifo.
>>>>>>>
>>>>>> SPIM_DON will trigger an interrupt once the last characters have been
>>>>>> transferred, and read the remaining characters from the FIFO.
>>>>> The T2080RM that I have says the following about the DON bit
>>>>>
>>>>> "Last character was transmitted. The last character was transmitted
>>>>> and a new command can be written for the next frame."
>>>>>
>>>>> That does at least seem to fit with my assertion that it's all about
>>>>> the TX direction. But the fact that it doesn't happen all the time
>>>>> throws some doubt on it.
>>>>>
>>>>>> I think the reason I'm seeing some variability is because of how fast
>>>>>>> (or slow) the interrupts get processed and how fast the spi-nor
>>>>>>> chip can
>>>>>>> fill the CPUs rx fifo.
>>>>>>>
>>>>>> To rule out timing issues at high bus frequencies I initially asked
>>>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>>>>> or even less, then timing shouldn't be an issue.
>>>>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I
>>>>> would also expect a slower frequency would fit my "DON is for TX"
>>>>> narrative.
>>>>>> Last relevant functional changes have been done almost 4 years ago.
>>>>>> And yours is the first such report I see. So question is what could
>>>>>> be so
>>>>>> special with your setup that it seems you're the only one being
>>>>>> affected.
>>>>>> The scenarios you describe are standard, therefore much more people
>>>>>> should be affected in case of a driver bug.
>>>>> Agreed. But even on my hardware (which may have a latent issue
>>>>> despite being in the field for going on 5 years) the issue only
>>>>> triggers under some fairly specific circumstances.
>>>>>> You said that kernel config impacts how frequently the issue happens.
>>>>>> Therefore question is what's the diff in kernel config, and how could
>>>>>> the differences be related to SPI.
>>>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
>>>>> impact but every time I found something that seemed to be having an
>>>>> impact I've been able to disprove it. I actually think its about how
>>>>> busy the system is which may or may not affect when we get round to
>>>>> processing the interrupts.
>>>>>
>>>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
>>>>> occur on the T2080RDB.
>>>>>
>>>>> I've had to add the following to expose the environment as a mtd
>>>>> partition
>>>>>
>>>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>> index ff87e67c70da..fbf95fc1fd68 100644
>>>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>> @@ -116,6 +116,15 @@ flash@0 {
>>>>>                                 compatible = "micron,n25q512ax3",
>>>>> "jedec,spi-nor";
>>>>>                                 reg = <0>;
>>>>>                                 spi-max-frequency = <10000000>; /*
>>>>> input clock */
>>>>> +
>>>>> +                               partition@u-boot {
>>>>> +                                        reg = <0x00000000 0x00100000>;
>>>>> +                                        label = "u-boot";
>>>>> +                                };
>>>>> +                                partition@u-boot-env {
>>>>> +                                        reg = <0x00100000 0x00010000>;
>>>>> +                                        label = "u-boot-env";
>>>>> +                                };
>>>>>                         };
>>>>>                 };
>>>>>
>>>>> And I'm using the following script to poke at the environment
>>>>> (warning if anyone does try this and the bug hits it can render your
>>>>> u-boot environment invalid).
>>>>>
>>>>> cat flash/fw_env_test.sh
>>>>> #!/bin/sh
>>>>>
>>>>> generate_fw_env_config()
>>>>> {
>>>>>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize
>>>>> name ; do
>>>>>      echo "$dev $size $erasesize $name"
>>>>>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
>>>>> $erasesize" >/flash/fw_env.config
>>>>>   done
>>>>> }
>>>>>
>>>>> cycles=10
>>>>> [ $# -ge 1 ] && cycles=$1
>>>>>
>>>>> generate_fw_env_config
>>>>>
>>>>> fw_printenv -c /flash/fw_env.config
>>>>>
>>>>> dmesg -c >/dev/null
>>>>> x=0
>>>>> while [ $x -lt $cycles ]; do
>>>>>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>>>>>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>>>>     dmesg -c | grep -q fsl_espi && break;
>>>>>     let x=x+1
>>>>> done
>>>>>
>>>>> echo "Ran $x cycles"
>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>
>>>> With my current workaround of emptying the RX FIFO. It seems
>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>
>>>> fsl_espi ffe110000.spi: tx 70
>>>> fsl_espi ffe110000.spi: rx 03
>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: tx 05
>>>> fsl_espi ffe110000.spi: rx 00
>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: tx 05
>>>> fsl_espi ffe110000.spi: rx 00
>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>
>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>> happens some times and not others.
>>> So I think I've got a reproduction and I think I've bisected the problem
>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>> to this result. Given the various rabbit holes I've been down on this
>>> issue already I'd take this information with a good degree of skepticism.
>>>
>> OK, so an easy test should be to re-test with a 5.4 kernel.
>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>> is basically the same as in 5.7 (just two small changes in 5.7).
> There's 6cc0c16d82f88 and maybe also other interrupt related patches
> around this time that could affect book E, so it's good if that exact
> patch is confirmed.

My confirmation is basically that I can induce the issue in a 5.4 kernel
by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
5.9-rc2 by reverting that one commit.

I both cases it's not exactly a clean cherry-pick/revert so I also
confirmed the bisection result by building at 3282a3da25bd (which sees
the issue) and the commit just before (which does not).

> I've been staring at 3282a3da25bd for a while and nothing immediately
> stands out. It doesn't look like the low level handlers do anything
> special (well 0x900 does ack the decrementer, but so does the masked
> handler).
>
> Can you try this patch and also enable CONFIG_PPC_IRQ_SOFT_MASK_DEBUG?
>
> Thanks,
> Nick
>
> ---
> diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
> index bf21ebd36190..10d339042330 100644
> --- a/arch/powerpc/kernel/irq.c
> +++ b/arch/powerpc/kernel/irq.c
> @@ -214,7 +214,7 @@ void replay_soft_interrupts(void)
> struct pt_regs regs;
>
> ppc_save_regs(&regs);
> - regs.softe = IRQS_ALL_DISABLED;
> + regs.softe = IRQS_ENABLED;
>
> again:
> if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
> @@ -349,6 +349,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
> if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
> WARN_ON_ONCE(!(mfmsr() & MSR_EE));
> __hard_irq_disable();
> + local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
> } else {
> /*
> * We should already be hard disabled here. We had bugs
> @@ -368,6 +369,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
> }
> }
>
> + preempt_disable();
> irq_soft_mask_set(IRQS_ALL_DISABLED);
> trace_hardirqs_off();
>
> @@ -377,6 +379,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
> trace_hardirqs_on();
> irq_soft_mask_set(IRQS_ENABLED);
> __hard_irq_enable();
> + preempt_enable();
> }
> EXPORT_SYMBOL(arch_local_irq_restore);
>
I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
didn't report anything (either with or without the change above).

2020-08-30 12:32:28

by Nicholas Piggin

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
> On 27/08/20 7:12 pm, Nicholas Piggin wrote:
>> Excerpts from Heiner Kallweit's message of August 26, 2020 4:38 pm:
>>> On 26.08.2020 08:07, Chris Packham wrote:
>>>> On 26/08/20 1:48 pm, Chris Packham wrote:
>>>>> On 26/08/20 10:22 am, Chris Packham wrote:
>>>>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>>>>
>>>>>> <snip>
>>>>>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>>>>>> identified a couple of deficiencies that may or may not be related
>>>>>>>> to my
>>>>>>>> issue.
>>>>>>>>
>>>>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message
>>>>>>>> can be
>>>>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
>>>>>>>> register.
>>>>>>>> We also write back to it to clear the current events. We re-read it in
>>>>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>>>>>> naturally end up in that situation if we're doing a large read.
>>>>>>>> Consider
>>>>>>>> the messages for reading a block of data from a spi-nor chip
>>>>>>>>
>>>>>>>>    tx = READ_OP + ADDR
>>>>>>>>    rx = data
>>>>>>>>
>>>>>>>> We setup the transfer and pump out the tx_buf. The first interrupt
>>>>>>>> goes
>>>>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>>>>>> continues until we've received all the data and we finish with
>>>>>>>> ESPI_SPIE
>>>>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>>>>>> isn't set.
>>>>>>>>
>>>>>>>> The other deficiency is that we only get an interrupt when the
>>>>>>>> amount of
>>>>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of
>>>>>>>> the fifo.
>>>>>>>>
>>>>>>> SPIM_DON will trigger an interrupt once the last characters have been
>>>>>>> transferred, and read the remaining characters from the FIFO.
>>>>>> The T2080RM that I have says the following about the DON bit
>>>>>>
>>>>>> "Last character was transmitted. The last character was transmitted
>>>>>> and a new command can be written for the next frame."
>>>>>>
>>>>>> That does at least seem to fit with my assertion that it's all about
>>>>>> the TX direction. But the fact that it doesn't happen all the time
>>>>>> throws some doubt on it.
>>>>>>
>>>>>>> I think the reason I'm seeing some variability is because of how fast
>>>>>>>> (or slow) the interrupts get processed and how fast the spi-nor
>>>>>>>> chip can
>>>>>>>> fill the CPUs rx fifo.
>>>>>>>>
>>>>>>> To rule out timing issues at high bus frequencies I initially asked
>>>>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>>>>>> or even less, then timing shouldn't be an issue.
>>>>>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I
>>>>>> would also expect a slower frequency would fit my "DON is for TX"
>>>>>> narrative.
>>>>>>> Last relevant functional changes have been done almost 4 years ago.
>>>>>>> And yours is the first such report I see. So question is what could
>>>>>>> be so
>>>>>>> special with your setup that it seems you're the only one being
>>>>>>> affected.
>>>>>>> The scenarios you describe are standard, therefore much more people
>>>>>>> should be affected in case of a driver bug.
>>>>>> Agreed. But even on my hardware (which may have a latent issue
>>>>>> despite being in the field for going on 5 years) the issue only
>>>>>> triggers under some fairly specific circumstances.
>>>>>>> You said that kernel config impacts how frequently the issue happens.
>>>>>>> Therefore question is what's the diff in kernel config, and how could
>>>>>>> the differences be related to SPI.
>>>>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
>>>>>> impact but every time I found something that seemed to be having an
>>>>>> impact I've been able to disprove it. I actually think its about how
>>>>>> busy the system is which may or may not affect when we get round to
>>>>>> processing the interrupts.
>>>>>>
>>>>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
>>>>>> occur on the T2080RDB.
>>>>>>
>>>>>> I've had to add the following to expose the environment as a mtd
>>>>>> partition
>>>>>>
>>>>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>> index ff87e67c70da..fbf95fc1fd68 100644
>>>>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>> @@ -116,6 +116,15 @@ flash@0 {
>>>>>>                                 compatible = "micron,n25q512ax3",
>>>>>> "jedec,spi-nor";
>>>>>>                                 reg = <0>;
>>>>>>                                 spi-max-frequency = <10000000>; /*
>>>>>> input clock */
>>>>>> +
>>>>>> +                               partition@u-boot {
>>>>>> +                                        reg = <0x00000000 0x00100000>;
>>>>>> +                                        label = "u-boot";
>>>>>> +                                };
>>>>>> +                                partition@u-boot-env {
>>>>>> +                                        reg = <0x00100000 0x00010000>;
>>>>>> +                                        label = "u-boot-env";
>>>>>> +                                };
>>>>>>                         };
>>>>>>                 };
>>>>>>
>>>>>> And I'm using the following script to poke at the environment
>>>>>> (warning if anyone does try this and the bug hits it can render your
>>>>>> u-boot environment invalid).
>>>>>>
>>>>>> cat flash/fw_env_test.sh
>>>>>> #!/bin/sh
>>>>>>
>>>>>> generate_fw_env_config()
>>>>>> {
>>>>>>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize
>>>>>> name ; do
>>>>>>      echo "$dev $size $erasesize $name"
>>>>>>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
>>>>>> $erasesize" >/flash/fw_env.config
>>>>>>   done
>>>>>> }
>>>>>>
>>>>>> cycles=10
>>>>>> [ $# -ge 1 ] && cycles=$1
>>>>>>
>>>>>> generate_fw_env_config
>>>>>>
>>>>>> fw_printenv -c /flash/fw_env.config
>>>>>>
>>>>>> dmesg -c >/dev/null
>>>>>> x=0
>>>>>> while [ $x -lt $cycles ]; do
>>>>>>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>>>>>>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>>>>>     dmesg -c | grep -q fsl_espi && break;
>>>>>>     let x=x+1
>>>>>> done
>>>>>>
>>>>>> echo "Ran $x cycles"
>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>
>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>
>>>>> fsl_espi ffe110000.spi: tx 70
>>>>> fsl_espi ffe110000.spi: rx 03
>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: tx 05
>>>>> fsl_espi ffe110000.spi: rx 00
>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: tx 05
>>>>> fsl_espi ffe110000.spi: rx 00
>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>
>>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>> happens some times and not others.
>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>> to this result. Given the various rabbit holes I've been down on this
>>>> issue already I'd take this information with a good degree of skepticism.
>>>>
>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>> is basically the same as in 5.7 (just two small changes in 5.7).
>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>> around this time that could affect book E, so it's good if that exact
>> patch is confirmed.
>
> My confirmation is basically that I can induce the issue in a 5.4 kernel
> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
> 5.9-rc2 by reverting that one commit.
>
> I both cases it's not exactly a clean cherry-pick/revert so I also
> confirmed the bisection result by building at 3282a3da25bd (which sees
> the issue) and the commit just before (which does not).

Thanks for testing, that confirms it well.

[snip patch]

> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
> didn't report anything (either with or without the change above).

Okay, it was a bit of a shot in the dark. I still can't see what
else has changed.

What would cause this, a lost interrupt? A spurious interrupt? Or
higher interrupt latency?

I don't think the patch should cause significantly worse latency,
(it's supposed to be a bit better if anything because it doesn't set
up the full interrupt frame). But it's possible.

Thanks,
Nick

2020-08-30 14:03:43

by Heiner Kallweit

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

On 30.08.2020 14:30, Nicholas Piggin wrote:
> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>> On 27/08/20 7:12 pm, Nicholas Piggin wrote:
>>> Excerpts from Heiner Kallweit's message of August 26, 2020 4:38 pm:
>>>> On 26.08.2020 08:07, Chris Packham wrote:
>>>>> On 26/08/20 1:48 pm, Chris Packham wrote:
>>>>>> On 26/08/20 10:22 am, Chris Packham wrote:
>>>>>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>>>>>
>>>>>>> <snip>
>>>>>>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>>>>>>> identified a couple of deficiencies that may or may not be related
>>>>>>>>> to my
>>>>>>>>> issue.
>>>>>>>>>
>>>>>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message
>>>>>>>>> can be
>>>>>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
>>>>>>>>> register.
>>>>>>>>> We also write back to it to clear the current events. We re-read it in
>>>>>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>>>>>>> naturally end up in that situation if we're doing a large read.
>>>>>>>>> Consider
>>>>>>>>> the messages for reading a block of data from a spi-nor chip
>>>>>>>>>
>>>>>>>>>    tx = READ_OP + ADDR
>>>>>>>>>    rx = data
>>>>>>>>>
>>>>>>>>> We setup the transfer and pump out the tx_buf. The first interrupt
>>>>>>>>> goes
>>>>>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>>>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>>>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>>>>>>> continues until we've received all the data and we finish with
>>>>>>>>> ESPI_SPIE
>>>>>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>>>>>>> isn't set.
>>>>>>>>>
>>>>>>>>> The other deficiency is that we only get an interrupt when the
>>>>>>>>> amount of
>>>>>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>>>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of
>>>>>>>>> the fifo.
>>>>>>>>>
>>>>>>>> SPIM_DON will trigger an interrupt once the last characters have been
>>>>>>>> transferred, and read the remaining characters from the FIFO.
>>>>>>> The T2080RM that I have says the following about the DON bit
>>>>>>>
>>>>>>> "Last character was transmitted. The last character was transmitted
>>>>>>> and a new command can be written for the next frame."
>>>>>>>
>>>>>>> That does at least seem to fit with my assertion that it's all about
>>>>>>> the TX direction. But the fact that it doesn't happen all the time
>>>>>>> throws some doubt on it.
>>>>>>>
>>>>>>>> I think the reason I'm seeing some variability is because of how fast
>>>>>>>>> (or slow) the interrupts get processed and how fast the spi-nor
>>>>>>>>> chip can
>>>>>>>>> fill the CPUs rx fifo.
>>>>>>>>>
>>>>>>>> To rule out timing issues at high bus frequencies I initially asked
>>>>>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>>>>>>> or even less, then timing shouldn't be an issue.
>>>>>>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I
>>>>>>> would also expect a slower frequency would fit my "DON is for TX"
>>>>>>> narrative.
>>>>>>>> Last relevant functional changes have been done almost 4 years ago.
>>>>>>>> And yours is the first such report I see. So question is what could
>>>>>>>> be so
>>>>>>>> special with your setup that it seems you're the only one being
>>>>>>>> affected.
>>>>>>>> The scenarios you describe are standard, therefore much more people
>>>>>>>> should be affected in case of a driver bug.
>>>>>>> Agreed. But even on my hardware (which may have a latent issue
>>>>>>> despite being in the field for going on 5 years) the issue only
>>>>>>> triggers under some fairly specific circumstances.
>>>>>>>> You said that kernel config impacts how frequently the issue happens.
>>>>>>>> Therefore question is what's the diff in kernel config, and how could
>>>>>>>> the differences be related to SPI.
>>>>>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
>>>>>>> impact but every time I found something that seemed to be having an
>>>>>>> impact I've been able to disprove it. I actually think its about how
>>>>>>> busy the system is which may or may not affect when we get round to
>>>>>>> processing the interrupts.
>>>>>>>
>>>>>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
>>>>>>> occur on the T2080RDB.
>>>>>>>
>>>>>>> I've had to add the following to expose the environment as a mtd
>>>>>>> partition
>>>>>>>
>>>>>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>>> index ff87e67c70da..fbf95fc1fd68 100644
>>>>>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>>>>>> @@ -116,6 +116,15 @@ flash@0 {
>>>>>>>                                 compatible = "micron,n25q512ax3",
>>>>>>> "jedec,spi-nor";
>>>>>>>                                 reg = <0>;
>>>>>>>                                 spi-max-frequency = <10000000>; /*
>>>>>>> input clock */
>>>>>>> +
>>>>>>> +                               partition@u-boot {
>>>>>>> +                                        reg = <0x00000000 0x00100000>;
>>>>>>> +                                        label = "u-boot";
>>>>>>> +                                };
>>>>>>> +                                partition@u-boot-env {
>>>>>>> +                                        reg = <0x00100000 0x00010000>;
>>>>>>> +                                        label = "u-boot-env";
>>>>>>> +                                };
>>>>>>>                         };
>>>>>>>                 };
>>>>>>>
>>>>>>> And I'm using the following script to poke at the environment
>>>>>>> (warning if anyone does try this and the bug hits it can render your
>>>>>>> u-boot environment invalid).
>>>>>>>
>>>>>>> cat flash/fw_env_test.sh
>>>>>>> #!/bin/sh
>>>>>>>
>>>>>>> generate_fw_env_config()
>>>>>>> {
>>>>>>>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize
>>>>>>> name ; do
>>>>>>>      echo "$dev $size $erasesize $name"
>>>>>>>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
>>>>>>> $erasesize" >/flash/fw_env.config
>>>>>>>   done
>>>>>>> }
>>>>>>>
>>>>>>> cycles=10
>>>>>>> [ $# -ge 1 ] && cycles=$1
>>>>>>>
>>>>>>> generate_fw_env_config
>>>>>>>
>>>>>>> fw_printenv -c /flash/fw_env.config
>>>>>>>
>>>>>>> dmesg -c >/dev/null
>>>>>>> x=0
>>>>>>> while [ $x -lt $cycles ]; do
>>>>>>>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>>>>>>>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>>>>>>>     dmesg -c | grep -q fsl_espi && break;
>>>>>>>     let x=x+1
>>>>>>> done
>>>>>>>
>>>>>>> echo "Ran $x cycles"
>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>
>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>
>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>
>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>> happens some times and not others.
>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>
>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>> around this time that could affect book E, so it's good if that exact
>>> patch is confirmed.
>>
>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>> 5.9-rc2 by reverting that one commit.
>>
>> I both cases it's not exactly a clean cherry-pick/revert so I also
>> confirmed the bisection result by building at 3282a3da25bd (which sees
>> the issue) and the commit just before (which does not).
>
> Thanks for testing, that confirms it well.
>
> [snip patch]
>
>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>> didn't report anything (either with or without the change above).
>
> Okay, it was a bit of a shot in the dark. I still can't see what
> else has changed.
>
> What would cause this, a lost interrupt? A spurious interrupt? Or
> higher interrupt latency?
>
> I don't think the patch should cause significantly worse latency,
> (it's supposed to be a bit better if anything because it doesn't set
> up the full interrupt frame). But it's possible.
>
> Thanks,
> Nick
>

As additional background:
When I refactored and extended the fsl-espi driver my test device was
a P1014-based WiFi router (running at 800MHz CPU clock). I still use it
and operate the SPI NOR (s25fl128s) at 50MHz in dual read mode
(with a 4.9 kernel). I never had a problem.
If the issue now occurs on a system with most likely faster CPU and at
a much lower SPI bus frequency, then something must have a significant
impact on interrupt performance. However I can't say exactly which kind
of interrupt issue should cause the SPI issue we see.

Heiner

2020-08-30 21:02:30

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15


On 31/08/20 12:30 am, Nicholas Piggin wrote:
> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:

<snip>

>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>
>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>
>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>
>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>> happens some times and not others.
>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>
>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>> around this time that could affect book E, so it's good if that exact
>>> patch is confirmed.
>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>> 5.9-rc2 by reverting that one commit.
>>
>> I both cases it's not exactly a clean cherry-pick/revert so I also
>> confirmed the bisection result by building at 3282a3da25bd (which sees
>> the issue) and the commit just before (which does not).
> Thanks for testing, that confirms it well.
>
> [snip patch]
>
>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>> didn't report anything (either with or without the change above).
> Okay, it was a bit of a shot in the dark. I still can't see what
> else has changed.
>
> What would cause this, a lost interrupt? A spurious interrupt? Or
> higher interrupt latency?
>
> I don't think the patch should cause significantly worse latency,
> (it's supposed to be a bit better if anything because it doesn't set
> up the full interrupt frame). But it's possible.

My working theory is that the SPI_DON indication is all about the TX
direction an now that the interrupts are faster we're hitting an error
because there is still RX activity going on. Heiner disagrees with my
interpretation of the SPI_DON indication and the fact that it doesn't
happen every time does throw doubt on it.

I can't really explain the extra RX byte in the fifo. We know how many
bytes to expect and we pull that many from the fifo so it's not as if
we're missing an interrupt causing us to skip the last byte. I've been
looking for some kind of off-by-one calculation but again if it were
something like that it'd happen all the time.

2020-08-30 21:42:21

by Heiner Kallweit

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

On 30.08.2020 23:00, Chris Packham wrote:
>
> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>
> <snip>
>
>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>
>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>
>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>
>>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>> happens some times and not others.
>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>
>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>> around this time that could affect book E, so it's good if that exact
>>>> patch is confirmed.
>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>> 5.9-rc2 by reverting that one commit.
>>>
>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>> the issue) and the commit just before (which does not).
>> Thanks for testing, that confirms it well.
>>
>> [snip patch]
>>
>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>> didn't report anything (either with or without the change above).
>> Okay, it was a bit of a shot in the dark. I still can't see what
>> else has changed.
>>
>> What would cause this, a lost interrupt? A spurious interrupt? Or
>> higher interrupt latency?
>>
>> I don't think the patch should cause significantly worse latency,
>> (it's supposed to be a bit better if anything because it doesn't set
>> up the full interrupt frame). But it's possible.
>
> My working theory is that the SPI_DON indication is all about the TX
> direction an now that the interrupts are faster we're hitting an error
> because there is still RX activity going on. Heiner disagrees with my
> interpretation of the SPI_DON indication and the fact that it doesn't
> happen every time does throw doubt on it.
>
It's right that the eSPI spec can be interpreted that SPI_DON refers to
TX only. However this wouldn't really make sense, because also for RX
we program the frame length, and therefore want to be notified once the
full frame was received. Also practical experience shows that SPI_DON
is set also after RX-only transfers.
Typical SPI NOR use case is that you write read command + start address,
followed by a longer read. If the TX-only interpretation would be right,
we'd always end up with SPI_DON not being set.

> I can't really explain the extra RX byte in the fifo. We know how many
> bytes to expect and we pull that many from the fifo so it's not as if
> we're missing an interrupt causing us to skip the last byte. I've been
> looking for some kind of off-by-one calculation but again if it were
> something like that it'd happen all the time.
>
Maybe it helps to know what value this extra byte in the FIFO has. Is it:
- a duplicate of the last read byte
- or the next byte (at <end address> + 1)
- or a fixed value, e.g. always 0x00 or 0xff


2020-08-30 22:02:31

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15


On 31/08/20 9:41 am, Heiner Kallweit wrote:
> On 30.08.2020 23:00, Chris Packham wrote:
>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>> <snip>
>>
>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>
>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>
>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>
>>>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>> happens some times and not others.
>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>
>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>> around this time that could affect book E, so it's good if that exact
>>>>> patch is confirmed.
>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>> 5.9-rc2 by reverting that one commit.
>>>>
>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>> the issue) and the commit just before (which does not).
>>> Thanks for testing, that confirms it well.
>>>
>>> [snip patch]
>>>
>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>> didn't report anything (either with or without the change above).
>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>> else has changed.
>>>
>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>> higher interrupt latency?
>>>
>>> I don't think the patch should cause significantly worse latency,
>>> (it's supposed to be a bit better if anything because it doesn't set
>>> up the full interrupt frame). But it's possible.
>> My working theory is that the SPI_DON indication is all about the TX
>> direction an now that the interrupts are faster we're hitting an error
>> because there is still RX activity going on. Heiner disagrees with my
>> interpretation of the SPI_DON indication and the fact that it doesn't
>> happen every time does throw doubt on it.
>>
> It's right that the eSPI spec can be interpreted that SPI_DON refers to
> TX only. However this wouldn't really make sense, because also for RX
> we program the frame length, and therefore want to be notified once the
> full frame was received. Also practical experience shows that SPI_DON
> is set also after RX-only transfers.
> Typical SPI NOR use case is that you write read command + start address,
> followed by a longer read. If the TX-only interpretation would be right,
> we'd always end up with SPI_DON not being set.
>
>> I can't really explain the extra RX byte in the fifo. We know how many
>> bytes to expect and we pull that many from the fifo so it's not as if
>> we're missing an interrupt causing us to skip the last byte. I've been
>> looking for some kind of off-by-one calculation but again if it were
>> something like that it'd happen all the time.
>>
> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
> - a duplicate of the last read byte
> - or the next byte (at <end address> + 1)
> - or a fixed value, e.g. always 0x00 or 0xff

The values were up thread a bit but I'll repeat them here

fsl_espi ffe110000.spi: tx 70
fsl_espi ffe110000.spi: rx 03
fsl_espi ffe110000.spi: Extra RX 00
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03


The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
either a READ_SR or a READ_FSR. Never a data read.

2020-08-31 12:36:53

by Heiner Kallweit

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

On 30.08.2020 23:59, Chris Packham wrote:
>
> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>> On 30.08.2020 23:00, Chris Packham wrote:
>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>> <snip>
>>>
>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>
>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>
>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>
>>>>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>> happens some times and not others.
>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>
>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>> patch is confirmed.
>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>> 5.9-rc2 by reverting that one commit.
>>>>>
>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>> the issue) and the commit just before (which does not).
>>>> Thanks for testing, that confirms it well.
>>>>
>>>> [snip patch]
>>>>
>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>> didn't report anything (either with or without the change above).
>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>> else has changed.
>>>>
>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>> higher interrupt latency?
>>>>
>>>> I don't think the patch should cause significantly worse latency,
>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>> up the full interrupt frame). But it's possible.
>>> My working theory is that the SPI_DON indication is all about the TX
>>> direction an now that the interrupts are faster we're hitting an error
>>> because there is still RX activity going on. Heiner disagrees with my
>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>> happen every time does throw doubt on it.
>>>
>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>> TX only. However this wouldn't really make sense, because also for RX
>> we program the frame length, and therefore want to be notified once the
>> full frame was received. Also practical experience shows that SPI_DON
>> is set also after RX-only transfers.
>> Typical SPI NOR use case is that you write read command + start address,
>> followed by a longer read. If the TX-only interpretation would be right,
>> we'd always end up with SPI_DON not being set.
>>
>>> I can't really explain the extra RX byte in the fifo. We know how many
>>> bytes to expect and we pull that many from the fifo so it's not as if
>>> we're missing an interrupt causing us to skip the last byte. I've been
>>> looking for some kind of off-by-one calculation but again if it were
>>> something like that it'd happen all the time.
>>>
>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>> - a duplicate of the last read byte
>> - or the next byte (at <end address> + 1)
>> - or a fixed value, e.g. always 0x00 or 0xff
>
> The values were up thread a bit but I'll repeat them here
>
> fsl_espi ffe110000.spi: tx 70
> fsl_espi ffe110000.spi: rx 03
> fsl_espi ffe110000.spi: Extra RX 00
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe110000.spi: tx 05
> fsl_espi ffe110000.spi: rx 00
> fsl_espi ffe110000.spi: Extra RX 03
>
>
> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
> either a READ_SR or a READ_FSR. Never a data read.
>
Just remembered something about SPIE_DON:
Transfers are always full duplex, therefore in case of a read the chip
sends dummy zero's. Having said that in case of a read SPIE_DON means
that the last dummy zero was shifted out.

READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
So the issue may have a dependency on the length of the transfer.
However I see no good explanation so far. You can try adding a delay of
a few miroseconds between the following to commands in fsl_espi_bufs().

fsl_espi_write_reg(espi, ESPI_SPIM, mask);

/* Prevent filling the fifo from getting interrupted */
spin_lock_irq(&espi->lock);

Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.

2020-09-01 01:27:27

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15


On 1/09/20 12:33 am, Heiner Kallweit wrote:
> On 30.08.2020 23:59, Chris Packham wrote:
>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>> <snip>
>>>>
>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>
>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>
>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>
>>>>>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>> happens some times and not others.
>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>
>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>> patch is confirmed.
>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>
>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>> the issue) and the commit just before (which does not).
>>>>> Thanks for testing, that confirms it well.
>>>>>
>>>>> [snip patch]
>>>>>
>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>> didn't report anything (either with or without the change above).
>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>> else has changed.
>>>>>
>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>> higher interrupt latency?
>>>>>
>>>>> I don't think the patch should cause significantly worse latency,
>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>> up the full interrupt frame). But it's possible.
>>>> My working theory is that the SPI_DON indication is all about the TX
>>>> direction an now that the interrupts are faster we're hitting an error
>>>> because there is still RX activity going on. Heiner disagrees with my
>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>> happen every time does throw doubt on it.
>>>>
>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>> TX only. However this wouldn't really make sense, because also for RX
>>> we program the frame length, and therefore want to be notified once the
>>> full frame was received. Also practical experience shows that SPI_DON
>>> is set also after RX-only transfers.
>>> Typical SPI NOR use case is that you write read command + start address,
>>> followed by a longer read. If the TX-only interpretation would be right,
>>> we'd always end up with SPI_DON not being set.
>>>
>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>> looking for some kind of off-by-one calculation but again if it were
>>>> something like that it'd happen all the time.
>>>>
>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>> - a duplicate of the last read byte
>>> - or the next byte (at <end address> + 1)
>>> - or a fixed value, e.g. always 0x00 or 0xff
>> The values were up thread a bit but I'll repeat them here
>>
>> fsl_espi ffe110000.spi: tx 70
>> fsl_espi ffe110000.spi: rx 03
>> fsl_espi ffe110000.spi: Extra RX 00
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>>
>>
>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>> either a READ_SR or a READ_FSR. Never a data read.
>>
> Just remembered something about SPIE_DON:
> Transfers are always full duplex, therefore in case of a read the chip
> sends dummy zero's. Having said that in case of a read SPIE_DON means
> that the last dummy zero was shifted out.
>
> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
> So the issue may have a dependency on the length of the transfer.
> However I see no good explanation so far. You can try adding a delay of
> a few miroseconds between the following to commands in fsl_espi_bufs().
>
> fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>
> /* Prevent filling the fifo from getting interrupted */
> spin_lock_irq(&espi->lock);
>
> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
I think this might be heading in the right direction. Playing about with
a delay does seem to make the two symptoms less likely. Although I have
to set it quite high (i.e. msleep(100)) to completely avoid any
possibility of seeing either message.

2020-09-01 02:56:33

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15


On 1/09/20 12:33 am, Heiner Kallweit wrote:
> On 30.08.2020 23:59, Chris Packham wrote:
>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>> <snip>
>>>>
>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>
>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>
>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>
>>>>>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>> happens some times and not others.
>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>
>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>> patch is confirmed.
>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>
>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>> the issue) and the commit just before (which does not).
>>>>> Thanks for testing, that confirms it well.
>>>>>
>>>>> [snip patch]
>>>>>
>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>> didn't report anything (either with or without the change above).
>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>> else has changed.
>>>>>
>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>> higher interrupt latency?
>>>>>
>>>>> I don't think the patch should cause significantly worse latency,
>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>> up the full interrupt frame). But it's possible.
>>>> My working theory is that the SPI_DON indication is all about the TX
>>>> direction an now that the interrupts are faster we're hitting an error
>>>> because there is still RX activity going on. Heiner disagrees with my
>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>> happen every time does throw doubt on it.
>>>>
>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>> TX only. However this wouldn't really make sense, because also for RX
>>> we program the frame length, and therefore want to be notified once the
>>> full frame was received. Also practical experience shows that SPI_DON
>>> is set also after RX-only transfers.
>>> Typical SPI NOR use case is that you write read command + start address,
>>> followed by a longer read. If the TX-only interpretation would be right,
>>> we'd always end up with SPI_DON not being set.
>>>
>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>> looking for some kind of off-by-one calculation but again if it were
>>>> something like that it'd happen all the time.
>>>>
>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>> - a duplicate of the last read byte
>>> - or the next byte (at <end address> + 1)
>>> - or a fixed value, e.g. always 0x00 or 0xff
>> The values were up thread a bit but I'll repeat them here
>>
>> fsl_espi ffe110000.spi: tx 70
>> fsl_espi ffe110000.spi: rx 03
>> fsl_espi ffe110000.spi: Extra RX 00
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe110000.spi: tx 05
>> fsl_espi ffe110000.spi: rx 00
>> fsl_espi ffe110000.spi: Extra RX 03
>>
>>
>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>> either a READ_SR or a READ_FSR. Never a data read.
>>
> Just remembered something about SPIE_DON:
> Transfers are always full duplex,

That's not true in rxskip mode. Setting rxskip forces the eSPI into
half-duplex mode. So far all the instances of "extra" rx bytes have been
when rxskip == 1.

I think that could be where our experience with SPIE_DON differ. In
full-duplex mode yes DON is always set. In half duplex mode we can end
up with DON set or not set depending on the interrupt timing.

> therefore in case of a read the chip
> sends dummy zero's. Having said that in case of a read SPIE_DON means
> that the last dummy zero was shifted out.
>
> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
> So the issue may have a dependency on the length of the transfer.
> However I see no good explanation so far. You can try adding a delay of
> a few miroseconds between the following to commands in fsl_espi_bufs().
>
> fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>
> /* Prevent filling the fifo from getting interrupted */
> spin_lock_irq(&espi->lock);
>
> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.

2020-09-01 06:15:57

by Nicholas Piggin

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
>
> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>> On 30.08.2020 23:59, Chris Packham wrote:
>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>>> <snip>
>>>>>
>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>>
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>
>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>>
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>
>>>>>>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>>> happens some times and not others.
>>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>>
>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>>> patch is confirmed.
>>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>>
>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>>> the issue) and the commit just before (which does not).
>>>>>> Thanks for testing, that confirms it well.
>>>>>>
>>>>>> [snip patch]
>>>>>>
>>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>>> didn't report anything (either with or without the change above).
>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>>> else has changed.
>>>>>>
>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>>> higher interrupt latency?
>>>>>>
>>>>>> I don't think the patch should cause significantly worse latency,
>>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>>> up the full interrupt frame). But it's possible.
>>>>> My working theory is that the SPI_DON indication is all about the TX
>>>>> direction an now that the interrupts are faster we're hitting an error
>>>>> because there is still RX activity going on. Heiner disagrees with my
>>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>>> happen every time does throw doubt on it.
>>>>>
>>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>>> TX only. However this wouldn't really make sense, because also for RX
>>>> we program the frame length, and therefore want to be notified once the
>>>> full frame was received. Also practical experience shows that SPI_DON
>>>> is set also after RX-only transfers.
>>>> Typical SPI NOR use case is that you write read command + start address,
>>>> followed by a longer read. If the TX-only interpretation would be right,
>>>> we'd always end up with SPI_DON not being set.
>>>>
>>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>>> looking for some kind of off-by-one calculation but again if it were
>>>>> something like that it'd happen all the time.
>>>>>
>>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>>> - a duplicate of the last read byte
>>>> - or the next byte (at <end address> + 1)
>>>> - or a fixed value, e.g. always 0x00 or 0xff
>>> The values were up thread a bit but I'll repeat them here
>>>
>>> fsl_espi ffe110000.spi: tx 70
>>> fsl_espi ffe110000.spi: rx 03
>>> fsl_espi ffe110000.spi: Extra RX 00
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>>
>>>
>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>>> either a READ_SR or a READ_FSR. Never a data read.
>>>
>> Just remembered something about SPIE_DON:
>> Transfers are always full duplex, therefore in case of a read the chip
>> sends dummy zero's. Having said that in case of a read SPIE_DON means
>> that the last dummy zero was shifted out.
>>
>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
>> So the issue may have a dependency on the length of the transfer.
>> However I see no good explanation so far. You can try adding a delay of
>> a few miroseconds between the following to commands in fsl_espi_bufs().
>>
>> fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>>
>> /* Prevent filling the fifo from getting interrupted */
>> spin_lock_irq(&espi->lock);
>>
>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
> I think this might be heading in the right direction. Playing about with
> a delay does seem to make the two symptoms less likely. Although I have
> to set it quite high (i.e. msleep(100)) to completely avoid any
> possibility of seeing either message.

The patch might replay the interrupt a little bit faster, but it would
be a few microseconds at most I think (just from improved code).

Would you be able to ftrace the interrupt handler function and see if you
can see a difference in number or timing of interrupts? I'm at a bit of
a loss.

Thanks,
Nick

2020-09-01 07:40:55

by Heiner Kallweit

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

On 01.09.2020 04:52, Chris Packham wrote:
>
> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>> On 30.08.2020 23:59, Chris Packham wrote:
>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>>> <snip>
>>>>>
>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>>
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>
>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>>
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>
>>>>>>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>>> happens some times and not others.
>>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>>
>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>>> patch is confirmed.
>>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>>
>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>>> the issue) and the commit just before (which does not).
>>>>>> Thanks for testing, that confirms it well.
>>>>>>
>>>>>> [snip patch]
>>>>>>
>>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>>> didn't report anything (either with or without the change above).
>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>>> else has changed.
>>>>>>
>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>>> higher interrupt latency?
>>>>>>
>>>>>> I don't think the patch should cause significantly worse latency,
>>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>>> up the full interrupt frame). But it's possible.
>>>>> My working theory is that the SPI_DON indication is all about the TX
>>>>> direction an now that the interrupts are faster we're hitting an error
>>>>> because there is still RX activity going on. Heiner disagrees with my
>>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>>> happen every time does throw doubt on it.
>>>>>
>>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>>> TX only. However this wouldn't really make sense, because also for RX
>>>> we program the frame length, and therefore want to be notified once the
>>>> full frame was received. Also practical experience shows that SPI_DON
>>>> is set also after RX-only transfers.
>>>> Typical SPI NOR use case is that you write read command + start address,
>>>> followed by a longer read. If the TX-only interpretation would be right,
>>>> we'd always end up with SPI_DON not being set.
>>>>
>>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>>> looking for some kind of off-by-one calculation but again if it were
>>>>> something like that it'd happen all the time.
>>>>>
>>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>>> - a duplicate of the last read byte
>>>> - or the next byte (at <end address> + 1)
>>>> - or a fixed value, e.g. always 0x00 or 0xff
>>> The values were up thread a bit but I'll repeat them here
>>>
>>> fsl_espi ffe110000.spi: tx 70
>>> fsl_espi ffe110000.spi: rx 03
>>> fsl_espi ffe110000.spi: Extra RX 00
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>>
>>>
>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>>> either a READ_SR or a READ_FSR. Never a data read.
>>>
>> Just remembered something about SPIE_DON:
>> Transfers are always full duplex,
>
> That's not true in rxskip mode. Setting rxskip forces the eSPI into
> half-duplex mode. So far all the instances of "extra" rx bytes have been
> when rxskip == 1.
>
> I think that could be where our experience with SPIE_DON differ. In
> full-duplex mode yes DON is always set. In half duplex mode we can end
> up with DON set or not set depending on the interrupt timing.
>
Of course I'm not the chip designer, but I think under the hood rxskip
mode doesn't differ from normal full duplex. It just ignores the first
read bytes and sends dummy zero's once the read part starts.

And in normal SPI NOR use cases most transfers are rxskip:
send read command + read data

This can be simply tested / verified: send read command + read 31 bytes
That's less than the FIFO size, therefore only active interrupt source
is SPIE_DON. If your assumption is true the transfer would time out.

>> therefore in case of a read the chip
>> sends dummy zero's. Having said that in case of a read SPIE_DON means
>> that the last dummy zero was shifted out.
>>
>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
>> So the issue may have a dependency on the length of the transfer.
>> However I see no good explanation so far. You can try adding a delay of
>> a few miroseconds between the following to commands in fsl_espi_bufs().
>>
>> fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>>
>> /* Prevent filling the fifo from getting interrupted */
>> spin_lock_irq(&espi->lock);
>>
>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.

2020-09-01 23:31:23

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15


On 1/09/20 6:14 pm, Nicholas Piggin wrote:
> Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
>> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>>> On 30.08.2020 23:59, Chris Packham wrote:
>>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>>>> <snip>
>>>>>>
>>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>>>
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>>
>>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>>>
>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>>
>>>>>>>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>>>> happens some times and not others.
>>>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>>>
>>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>>>> patch is confirmed.
>>>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>>>
>>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>>>> the issue) and the commit just before (which does not).
>>>>>>> Thanks for testing, that confirms it well.
>>>>>>>
>>>>>>> [snip patch]
>>>>>>>
>>>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>>>> didn't report anything (either with or without the change above).
>>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>>>> else has changed.
>>>>>>>
>>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>>>> higher interrupt latency?
>>>>>>>
>>>>>>> I don't think the patch should cause significantly worse latency,
>>>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>>>> up the full interrupt frame). But it's possible.
>>>>>> My working theory is that the SPI_DON indication is all about the TX
>>>>>> direction an now that the interrupts are faster we're hitting an error
>>>>>> because there is still RX activity going on. Heiner disagrees with my
>>>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>>>> happen every time does throw doubt on it.
>>>>>>
>>>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>>>> TX only. However this wouldn't really make sense, because also for RX
>>>>> we program the frame length, and therefore want to be notified once the
>>>>> full frame was received. Also practical experience shows that SPI_DON
>>>>> is set also after RX-only transfers.
>>>>> Typical SPI NOR use case is that you write read command + start address,
>>>>> followed by a longer read. If the TX-only interpretation would be right,
>>>>> we'd always end up with SPI_DON not being set.
>>>>>
>>>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>>>> looking for some kind of off-by-one calculation but again if it were
>>>>>> something like that it'd happen all the time.
>>>>>>
>>>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>>>> - a duplicate of the last read byte
>>>>> - or the next byte (at <end address> + 1)
>>>>> - or a fixed value, e.g. always 0x00 or 0xff
>>>> The values were up thread a bit but I'll repeat them here
>>>>
>>>> fsl_espi ffe110000.spi: tx 70
>>>> fsl_espi ffe110000.spi: rx 03
>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: tx 05
>>>> fsl_espi ffe110000.spi: rx 00
>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: tx 05
>>>> fsl_espi ffe110000.spi: rx 00
>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>
>>>>
>>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>>>> either a READ_SR or a READ_FSR. Never a data read.
>>>>
>>> Just remembered something about SPIE_DON:
>>> Transfers are always full duplex, therefore in case of a read the chip
>>> sends dummy zero's. Having said that in case of a read SPIE_DON means
>>> that the last dummy zero was shifted out.
>>>
>>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
>>> So the issue may have a dependency on the length of the transfer.
>>> However I see no good explanation so far. You can try adding a delay of
>>> a few miroseconds between the following to commands in fsl_espi_bufs().
>>>
>>> fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>>>
>>> /* Prevent filling the fifo from getting interrupted */
>>> spin_lock_irq(&espi->lock);
>>>
>>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
>> I think this might be heading in the right direction. Playing about with
>> a delay does seem to make the two symptoms less likely. Although I have
>> to set it quite high (i.e. msleep(100)) to completely avoid any
>> possibility of seeing either message.
> The patch might replay the interrupt a little bit faster, but it would
> be a few microseconds at most I think (just from improved code).
>
> Would you be able to ftrace the interrupt handler function and see if you
> can see a difference in number or timing of interrupts? I'm at a bit of
> a loss.
This is getting really weird. I was setting up to run with ftrace and
found I couldn't reproduce it on the tip of Linus's tree (currently
pointing at e7a522c83b86). But I swear I could last week. Sure enough if
I checkout 5.9-rc2 (or 5.7.15) I can reproduce the problem again.
> Thanks,
> Nick
>

2020-09-01 23:33:40

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15


On 2/09/20 11:29 am, Chris Packham wrote:
>
> On 1/09/20 6:14 pm, Nicholas Piggin wrote:
>> Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
>>> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>>>> On 30.08.2020 23:59, Chris Packham wrote:
>>>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>>>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>>>>> <snip>
>>>>>>>
>>>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the
>>>>>>>>>>>>> T2080RDB
>>>>>>>>>>>>>
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's
>>>>>>>>>>>>> aren't empty!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>>>
>>>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra
>>>>>>>>>>>>> byte in the
>>>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a
>>>>>>>>>>>>> READ_FSR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's
>>>>>>>>>>>>> aren't empty!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's
>>>>>>>>>>>>> aren't empty!
>>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>>>
>>>>>>>>>>>>>      From all the Micron SPI-NOR datasheets I've got
>>>>>>>>>>>>> access to it is
>>>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea
>>>>>>>>>>>>> why it
>>>>>>>>>>>>> happens some times and not others.
>>>>>>>>>>>> So I think I've got a reproduction and I think I've
>>>>>>>>>>>> bisected the problem
>>>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft
>>>>>>>>>>>> interrupt replay in
>>>>>>>>>>>> C"). My day is just finishing now so I haven't applied too
>>>>>>>>>>>> much scrutiny
>>>>>>>>>>>> to this result. Given the various rabbit holes I've been
>>>>>>>>>>>> down on this
>>>>>>>>>>>> issue already I'd take this information with a good degree
>>>>>>>>>>>> of skepticism.
>>>>>>>>>>>>
>>>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>>>>> It doesn't have yet the change you're referring to, and the
>>>>>>>>>>> fsl-espi driver
>>>>>>>>>>> is basically the same as in 5.7 (just two small changes in
>>>>>>>>>>> 5.7).
>>>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related
>>>>>>>>>> patches
>>>>>>>>>> around this time that could affect book E, so it's good if
>>>>>>>>>> that exact
>>>>>>>>>> patch is confirmed.
>>>>>>>>> My confirmation is basically that I can induce the issue in a
>>>>>>>>> 5.4 kernel
>>>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the
>>>>>>>>> issue in
>>>>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>>>>
>>>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I
>>>>>>>>> also
>>>>>>>>> confirmed the bisection result by building at 3282a3da25bd
>>>>>>>>> (which sees
>>>>>>>>> the issue) and the commit just before (which does not).
>>>>>>>> Thanks for testing, that confirms it well.
>>>>>>>>
>>>>>>>> [snip patch]
>>>>>>>>
>>>>>>>>> I still saw the issue with this change applied.
>>>>>>>>> PPC_IRQ_SOFT_MASK_DEBUG
>>>>>>>>> didn't report anything (either with or without the change above).
>>>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>>>>> else has changed.
>>>>>>>>
>>>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>>>>> higher interrupt latency?
>>>>>>>>
>>>>>>>> I don't think the patch should cause significantly worse latency,
>>>>>>>> (it's supposed to be a bit better if anything because it
>>>>>>>> doesn't set
>>>>>>>> up the full interrupt frame). But it's possible.
>>>>>>> My working theory is that the SPI_DON indication is all about
>>>>>>> the TX
>>>>>>> direction an now that the interrupts are faster we're hitting an
>>>>>>> error
>>>>>>> because there is still RX activity going on. Heiner disagrees
>>>>>>> with my
>>>>>>> interpretation of the SPI_DON indication and the fact that it
>>>>>>> doesn't
>>>>>>> happen every time does throw doubt on it.
>>>>>>>
>>>>>> It's right that the eSPI spec can be interpreted that SPI_DON
>>>>>> refers to
>>>>>> TX only. However this wouldn't really make sense, because also
>>>>>> for RX
>>>>>> we program the frame length, and therefore want to be notified
>>>>>> once the
>>>>>> full frame was received. Also practical experience shows that
>>>>>> SPI_DON
>>>>>> is set also after RX-only transfers.
>>>>>> Typical SPI NOR use case is that you write read command + start
>>>>>> address,
>>>>>> followed by a longer read. If the TX-only interpretation would be
>>>>>> right,
>>>>>> we'd always end up with SPI_DON not being set.
>>>>>>
>>>>>>> I can't really explain the extra RX byte in the fifo. We know
>>>>>>> how many
>>>>>>> bytes to expect and we pull that many from the fifo so it's not
>>>>>>> as if
>>>>>>> we're missing an interrupt causing us to skip the last byte.
>>>>>>> I've been
>>>>>>> looking for some kind of off-by-one calculation but again if it
>>>>>>> were
>>>>>>> something like that it'd happen all the time.
>>>>>>>
>>>>>> Maybe it helps to know what value this extra byte in the FIFO
>>>>>> has. Is it:
>>>>>> - a duplicate of the last read byte
>>>>>> - or the next byte (at <end address> + 1)
>>>>>> - or a fixed value, e.g. always 0x00 or 0xff
>>>>> The values were up thread a bit but I'll repeat them here
>>>>>
>>>>> fsl_espi ffe110000.spi: tx 70
>>>>> fsl_espi ffe110000.spi: rx 03
>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: tx 05
>>>>> fsl_espi ffe110000.spi: rx 00
>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>> fsl_espi ffe110000.spi: tx 05
>>>>> fsl_espi ffe110000.spi: rx 00
>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>
>>>>>
>>>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen
>>>>> them with
>>>>> either a READ_SR or a READ_FSR. Never a data read.
>>>>>
>>>> Just remembered something about SPIE_DON:
>>>> Transfers are always full duplex, therefore in case of a read the chip
>>>> sends dummy zero's. Having said that in case of a read SPIE_DON means
>>>> that the last dummy zero was shifted out.
>>>>
>>>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1
>>>> byte in.
>>>> So the issue may have a dependency on the length of the transfer.
>>>> However I see no good explanation so far. You can try adding a
>>>> delay of
>>>> a few miroseconds between the following to commands in
>>>> fsl_espi_bufs().
>>>>
>>>>     fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>>>>
>>>>     /* Prevent filling the fifo from getting interrupted */
>>>>     spin_lock_irq(&espi->lock);
>>>>
>>>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too
>>>> close.
>>> I think this might be heading in the right direction. Playing about
>>> with
>>> a delay does seem to make the two symptoms less likely. Although I have
>>> to set it quite high (i.e. msleep(100)) to completely avoid any
>>> possibility of seeing either message.
>> The patch might replay the interrupt a little bit faster, but it would
>> be a few microseconds at most I think (just from improved code).
>>
>> Would you be able to ftrace the interrupt handler function and see if
>> you
>> can see a difference in number or timing of interrupts? I'm at a bit of
>> a loss.
> This is getting really weird. I was setting up to run with ftrace and
> found I couldn't reproduce it on the tip of Linus's tree (currently
> pointing at e7a522c83b86). But I swear I could last week. Sure enough
> if I checkout 5.9-rc2 (or 5.7.15) I can reproduce the problem again.

*Sigh* my master branch still has the interrupt changes reverted.

2020-09-04 00:00:00

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15


On 1/09/20 6:14 pm, Nicholas Piggin wrote:
> Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
>> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>>> On 30.08.2020 23:59, Chris Packham wrote:
>>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>>>> <snip>
>>>>>>
>>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>>>
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>>
>>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>>>
>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>>
>>>>>>>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>>>> happens some times and not others.
>>>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>>>
>>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>>>> patch is confirmed.
>>>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>>>
>>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>>>> the issue) and the commit just before (which does not).
>>>>>>> Thanks for testing, that confirms it well.
>>>>>>>
>>>>>>> [snip patch]
>>>>>>>
>>>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>>>> didn't report anything (either with or without the change above).
>>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>>>> else has changed.
>>>>>>>
>>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>>>> higher interrupt latency?
>>>>>>>
>>>>>>> I don't think the patch should cause significantly worse latency,
>>>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>>>> up the full interrupt frame). But it's possible.
>>>>>> My working theory is that the SPI_DON indication is all about the TX
>>>>>> direction an now that the interrupts are faster we're hitting an error
>>>>>> because there is still RX activity going on. Heiner disagrees with my
>>>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>>>> happen every time does throw doubt on it.
>>>>>>
>>>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>>>> TX only. However this wouldn't really make sense, because also for RX
>>>>> we program the frame length, and therefore want to be notified once the
>>>>> full frame was received. Also practical experience shows that SPI_DON
>>>>> is set also after RX-only transfers.
>>>>> Typical SPI NOR use case is that you write read command + start address,
>>>>> followed by a longer read. If the TX-only interpretation would be right,
>>>>> we'd always end up with SPI_DON not being set.
>>>>>
>>>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>>>> looking for some kind of off-by-one calculation but again if it were
>>>>>> something like that it'd happen all the time.
>>>>>>
>>>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>>>> - a duplicate of the last read byte
>>>>> - or the next byte (at <end address> + 1)
>>>>> - or a fixed value, e.g. always 0x00 or 0xff
>>>> The values were up thread a bit but I'll repeat them here
>>>>
>>>> fsl_espi ffe110000.spi: tx 70
>>>> fsl_espi ffe110000.spi: rx 03
>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: tx 05
>>>> fsl_espi ffe110000.spi: rx 00
>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>> fsl_espi ffe110000.spi: tx 05
>>>> fsl_espi ffe110000.spi: rx 00
>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>
>>>>
>>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>>>> either a READ_SR or a READ_FSR. Never a data read.
>>>>
>>> Just remembered something about SPIE_DON:
>>> Transfers are always full duplex, therefore in case of a read the chip
>>> sends dummy zero's. Having said that in case of a read SPIE_DON means
>>> that the last dummy zero was shifted out.
>>>
>>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
>>> So the issue may have a dependency on the length of the transfer.
>>> However I see no good explanation so far. You can try adding a delay of
>>> a few miroseconds between the following to commands in fsl_espi_bufs().
>>>
>>> fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>>>
>>> /* Prevent filling the fifo from getting interrupted */
>>> spin_lock_irq(&espi->lock);
>>>
>>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
>> I think this might be heading in the right direction. Playing about with
>> a delay does seem to make the two symptoms less likely. Although I have
>> to set it quite high (i.e. msleep(100)) to completely avoid any
>> possibility of seeing either message.
> The patch might replay the interrupt a little bit faster, but it would
> be a few microseconds at most I think (just from improved code).
>
> Would you be able to ftrace the interrupt handler function and see if you
> can see a difference in number or timing of interrupts? I'm at a bit of
> a loss.

I tried ftrace but I really wasn't sure what I was looking for.
Capturing a "bad" case was pretty tricky. But I think I've identified a
fix (I'll send it as a proper patch shortly). The gist is

diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
index 7e7c92cafdbb..cb120b68c0e2 100644
--- a/drivers/spi/spi-fsl-espi.c
+++ b/drivers/spi/spi-fsl-espi.c
@@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi
*espi, u32 events)
 static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
 {
        struct fsl_espi *espi = context_data;
-       u32 events;
+       u32 events, mask;

        spin_lock(&espi->lock);

        /* Get interrupt events(tx/rx) */
        events = fsl_espi_read_reg(espi, ESPI_SPIE);
-       if (!events) {
+       mask = fsl_espi_read_reg(espi, ESPI_SPIM);
+       if (!(events & mask)) {
                spin_unlock(&espi->lock);
                return IRQ_NONE;
        }

The SPIE register contains the TXCNT so events is pretty much always
going to have something set. By checking events against what we've
actually requested interrupts for we don't see any spurious events.

I've tested this on the T2080RDB and on our custom hardware and it seems
to resolve the problem.

2020-09-06 21:05:03

by Chris Packham

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

(resend as something decided to insert html, some context stripped)

On 5/09/20 5:23 am, Heiner Kallweit wrote:
> On Fri 4. Sep 2020 at 01:58, Chris Packham
> <[email protected]
> <mailto:[email protected]>> wrote:
>
>
<snip>
>
>
> I tried ftrace but I really wasn't sure what I was looking for.
>
> Capturing a "bad" case was pretty tricky. But I think I've
> identified a
>
> fix (I'll send it as a proper patch shortly). The gist is
>
>
>
> diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
>
> index 7e7c92cafdbb..cb120b68c0e2 100644
>
> --- a/drivers/spi/spi-fsl-espi.c
>
> +++ b/drivers/spi/spi-fsl-espi.c
>
> @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi
>
> *espi, u32 events)
>
>   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
>
>   {
>
>          struct fsl_espi *espi = context_data;
>
> -       u32 events;
>
> +       u32 events, mask;
>
>
>
>          spin_lock(&espi->lock);
>
>
>
>          /* Get interrupt events(tx/rx) */
>
>          events = fsl_espi_read_reg(espi, ESPI_SPIE);
>
> -       if (!events) {
>
> +       mask = fsl_espi_read_reg(espi, ESPI_SPIM);
>
> +       if (!(events & mask)) {
>
>                  spin_unlock(&espi->lock);
>
>                  return IRQ_NONE;
>
>          }
>
>
>
> The SPIE register contains the TXCNT so events is pretty much always
>
> going to have something set. By checking events against what we've
>
> actually requested interrupts for we don't see any spurious events.
>
>
> Usually we shouldn’t receive interrupts we’re not interested in,
> except the interrupt is shared.
My theory is that we get an interrupt to the core for RXT and another
for DON. With the changes to the powerpc interrupt handling and the fact
that fsl_espi_cpu_irq() doesn't actually look at the specific event bits
means that sometimes both events are handled in the processing of the
first interrupt and the second one trips the SPI_DON not set error.

There's an old comment "SPI bus sometimes got lost interrupts..." which
makes me wonder if the interrupt handling change has fixed this original
problem.

I still think the change makes sense regardless because the SPIE
register has some counter fields in it.

> This leads to the question: is the SPI interrupt shared with another
> device on your system?
It's not shared on either the custom board or the T2080RDB.
> Do you see spurious interrupts with the patch under
> /proc/irq/(irq)/spurious?

Yes it looks like it

[root@linuxbox ~]# cat /proc/irq/53/spurious
count 3126
unhandled 0
last_unhandled 0 ms

[root@linuxbox ~]# /flash/dd_test.sh

[root@linuxbox ~]# cat /proc/irq/53/spurious
count 1016
unhandled 0
last_unhandled 4294746100 ms

[root@linuxbox ~]# /flash/dd_test.sh

[root@linuxbox ~]# cat /proc/irq/53/spurious
count 88391
unhandled 0
last_unhandled 4294746100 ms

[root@linuxbox ~]# /flash/dd_test.sh

[root@linuxbox ~]# cat /proc/irq/53/spurious
count 72459
unhandled 2
last_unhandled 4294758632 ms
>
>
>
> I've tested this on the T2080RDB and on our custom hardware and it
> seems
>
> to resolve the problem.
>
>
>

2020-09-07 09:57:04

by Joakim Tjernlund

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

On Thu, 2020-09-03 at 23:58 +0000, Chris Packham wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
>
> On 1/09/20 6:14 pm, Nicholas Piggin wrote:
> > Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
> > > On 1/09/20 12:33 am, Heiner Kallweit wrote:
> > > > On 30.08.2020 23:59, Chris Packham wrote:
> > > > > On 31/08/20 9:41 am, Heiner Kallweit wrote:
> > > > > > On 30.08.2020 23:00, Chris Packham wrote:
> > > > > > > On 31/08/20 12:30 am, Nicholas Piggin wrote:
> > > > > > > > Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
> > > > > > > <snip>
> > > > > > >
> > > > > > > > > > > > > I've also now seen the RX FIFO not empty error on the T2080RDB
> > > > > > > > > > > > >
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > > > > > > > > >
> > > > > > > > > > > > > With my current workaround of emptying the RX FIFO. It seems
> > > > > > > > > > > > > survivable. Interestingly it only ever seems to be 1 extra byte in the
> > > > > > > > > > > > > RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
> > > > > > > > > > > > >
> > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 70
> > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 03
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 00
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 05
> > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 00
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 05
> > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 00
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > > > > > > > > >
> > > > > > > > > > > > >      From all the Micron SPI-NOR datasheets I've got access to it is
> > > > > > > > > > > > > possible to continually read the SR/FSR. But I've no idea why it
> > > > > > > > > > > > > happens some times and not others.
> > > > > > > > > > > > So I think I've got a reproduction and I think I've bisected the problem
> > > > > > > > > > > > to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
> > > > > > > > > > > > C"). My day is just finishing now so I haven't applied too much scrutiny
> > > > > > > > > > > > to this result. Given the various rabbit holes I've been down on this
> > > > > > > > > > > > issue already I'd take this information with a good degree of skepticism.
> > > > > > > > > > > >
> > > > > > > > > > > OK, so an easy test should be to re-test with a 5.4 kernel.
> > > > > > > > > > > It doesn't have yet the change you're referring to, and the fsl-espi driver
> > > > > > > > > > > is basically the same as in 5.7 (just two small changes in 5.7).
> > > > > > > > > > There's 6cc0c16d82f88 and maybe also other interrupt related patches
> > > > > > > > > > around this time that could affect book E, so it's good if that exact
> > > > > > > > > > patch is confirmed.
> > > > > > > > > My confirmation is basically that I can induce the issue in a 5.4 kernel
> > > > > > > > > by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
> > > > > > > > > 5.9-rc2 by reverting that one commit.
> > > > > > > > >
> > > > > > > > > I both cases it's not exactly a clean cherry-pick/revert so I also
> > > > > > > > > confirmed the bisection result by building at 3282a3da25bd (which sees
> > > > > > > > > the issue) and the commit just before (which does not).
> > > > > > > > Thanks for testing, that confirms it well.
> > > > > > > >
> > > > > > > > [snip patch]
> > > > > > > >
> > > > > > > > > I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
> > > > > > > > > didn't report anything (either with or without the change above).
> > > > > > > > Okay, it was a bit of a shot in the dark. I still can't see what
> > > > > > > > else has changed.
> > > > > > > >
> > > > > > > > What would cause this, a lost interrupt? A spurious interrupt? Or
> > > > > > > > higher interrupt latency?
> > > > > > > >
> > > > > > > > I don't think the patch should cause significantly worse latency,
> > > > > > > > (it's supposed to be a bit better if anything because it doesn't set
> > > > > > > > up the full interrupt frame). But it's possible.
> > > > > > > My working theory is that the SPI_DON indication is all about the TX
> > > > > > > direction an now that the interrupts are faster we're hitting an error
> > > > > > > because there is still RX activity going on. Heiner disagrees with my
> > > > > > > interpretation of the SPI_DON indication and the fact that it doesn't
> > > > > > > happen every time does throw doubt on it.
> > > > > > >
> > > > > > It's right that the eSPI spec can be interpreted that SPI_DON refers to
> > > > > > TX only. However this wouldn't really make sense, because also for RX
> > > > > > we program the frame length, and therefore want to be notified once the
> > > > > > full frame was received. Also practical experience shows that SPI_DON
> > > > > > is set also after RX-only transfers.
> > > > > > Typical SPI NOR use case is that you write read command + start address,
> > > > > > followed by a longer read. If the TX-only interpretation would be right,
> > > > > > we'd always end up with SPI_DON not being set.
> > > > > >
> > > > > > > I can't really explain the extra RX byte in the fifo. We know how many
> > > > > > > bytes to expect and we pull that many from the fifo so it's not as if
> > > > > > > we're missing an interrupt causing us to skip the last byte. I've been
> > > > > > > looking for some kind of off-by-one calculation but again if it were
> > > > > > > something like that it'd happen all the time.
> > > > > > >
> > > > > > Maybe it helps to know what value this extra byte in the FIFO has. Is it:
> > > > > > - a duplicate of the last read byte
> > > > > > - or the next byte (at <end address> + 1)
> > > > > > - or a fixed value, e.g. always 0x00 or 0xff
> > > > > The values were up thread a bit but I'll repeat them here
> > > > >
> > > > > fsl_espi ffe110000.spi: tx 70
> > > > > fsl_espi ffe110000.spi: rx 03
> > > > > fsl_espi ffe110000.spi: Extra RX 00
> > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > fsl_espi ffe110000.spi: tx 05
> > > > > fsl_espi ffe110000.spi: rx 00
> > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > fsl_espi ffe110000.spi: tx 05
> > > > > fsl_espi ffe110000.spi: rx 00
> > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > >
> > > > >
> > > > > The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
> > > > > either a READ_SR or a READ_FSR. Never a data read.
> > > > >
> > > > Just remembered something about SPIE_DON:
> > > > Transfers are always full duplex, therefore in case of a read the chip
> > > > sends dummy zero's. Having said that in case of a read SPIE_DON means
> > > > that the last dummy zero was shifted out.
> > > >
> > > > READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
> > > > So the issue may have a dependency on the length of the transfer.
> > > > However I see no good explanation so far. You can try adding a delay of
> > > > a few miroseconds between the following to commands in fsl_espi_bufs().
> > > >
> > > >     fsl_espi_write_reg(espi, ESPI_SPIM, mask);
> > > >
> > > >     /* Prevent filling the fifo from getting interrupted */
> > > >     spin_lock_irq(&espi->lock);
> > > >
> > > > Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
> > > I think this might be heading in the right direction. Playing about with
> > > a delay does seem to make the two symptoms less likely. Although I have
> > > to set it quite high (i.e. msleep(100)) to completely avoid any
> > > possibility of seeing either message.
> > The patch might replay the interrupt a little bit faster, but it would
> > be a few microseconds at most I think (just from improved code).
> >
> > Would you be able to ftrace the interrupt handler function and see if you
> > can see a difference in number or timing of interrupts? I'm at a bit of
> > a loss.
>
> I tried ftrace but I really wasn't sure what I was looking for.
> Capturing a "bad" case was pretty tricky. But I think I've identified a
> fix (I'll send it as a proper patch shortly). The gist is
>
> diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
> index 7e7c92cafdbb..cb120b68c0e2 100644
> --- a/drivers/spi/spi-fsl-espi.c
> +++ b/drivers/spi/spi-fsl-espi.c
> @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi
> *espi, u32 events)
>   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
>   {
>          struct fsl_espi *espi = context_data;
> - u32 events;
> + u32 events, mask;
>
>          spin_lock(&espi->lock);
>
>          /* Get interrupt events(tx/rx) */
>          events = fsl_espi_read_reg(espi, ESPI_SPIE);
> - if (!events) {
> + mask = fsl_espi_read_reg(espi, ESPI_SPIM);
> + if (!(events & mask)) {
>                  spin_unlock(&espi->lock);
>                  return IRQ_NONE;
>          }
>
> The SPIE register contains the TXCNT so events is pretty much always
> going to have something set. By checking events against what we've
> actually requested interrupts for we don't see any spurious events.
>
> I've tested this on the T2080RDB and on our custom hardware and it seems
> to resolve the problem.
>

I looked at the fsl_espi_irq() too and noticed that clearing of the IRQ events
are after processing TX/RX. That looks a bit odd to me.

Jocke

2020-09-07 15:45:20

by Joakim Tjernlund

[permalink] [raw]
Subject: Re: fsl_espi errors on v5.7.15

[SNIP]
> >

> > > Would you be able to ftrace the interrupt handler function and see if you
> > > can see a difference in number or timing of interrupts? I'm at a bit of
> > > a loss.
> >
> > I tried ftrace but I really wasn't sure what I was looking for.
> > Capturing a "bad" case was pretty tricky. But I think I've identified a
> > fix (I'll send it as a proper patch shortly). The gist is
> >
> > diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
> > index 7e7c92cafdbb..cb120b68c0e2 100644
> > --- a/drivers/spi/spi-fsl-espi.c
> > +++ b/drivers/spi/spi-fsl-espi.c
> > @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi
> > *espi, u32 events)
> >   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
> >   {
> >          struct fsl_espi *espi = context_data;
> > - u32 events;
> > + u32 events, mask;
> >
> >          spin_lock(&espi->lock);
> >
> >          /* Get interrupt events(tx/rx) */
> >          events = fsl_espi_read_reg(espi, ESPI_SPIE);
> > - if (!events) {
> > + mask = fsl_espi_read_reg(espi, ESPI_SPIM);
> > + if (!(events & mask)) {
> >                  spin_unlock(&espi->lock);
> >                  return IRQ_NONE;
> >          }
> >
> > The SPIE register contains the TXCNT so events is pretty much always
> > going to have something set. By checking events against what we've
> > actually requested interrupts for we don't see any spurious events.
> >
> > I've tested this on the T2080RDB and on our custom hardware and it seems
> > to resolve the problem.
> >
>
> I looked at the fsl_espi_irq() too and noticed that clearing of the IRQ events
> are after processing TX/RX. That looks a bit odd to me.

I should have been more specific. I think you can loose IRQs as fsl_espi_irq() works now.
Consider this:
1) You get TX IRQ and enter fsl_espi_irq()
2) Enter fsl_espi_fill_tx_fifo() to process any chars until done.
3) Now you get one more TX IRQ
4) fsl_espi_irq() clear events -> IRQ from 3) is lost.

Jocke