2022-05-23 06:25:39

by Paul E. McKenney

[permalink] [raw]
Subject: Re: vchiq: Performance regression since 5.18-rc1

On Sun, May 22, 2022 at 05:11:36PM +0200, Stefan Wahren wrote:
> Hi Paul,
>
> Am 22.05.22 um 01:46 schrieb Paul E. McKenney:
> > On Sun, May 22, 2022 at 01:22:00AM +0200, Stefan Wahren wrote:
> > > Hi,
> > >
> > > while testing the staging/vc04_services/interface/vchiq_arm driver with my
> > > Raspberry Pi 3 B+ (multi_v7_defconfig) i noticed a huge performance
> > > regression since [ff042f4a9b050895a42cae893cc01fa2ca81b95c] mm:
> > > lru_cache_disable: replace work queue synchronization with synchronize_rcu
> > >
> > > Usually i run "vchiq_test -f 1" to see the driver is still working [1].
> > >
> > > Before commit:
> > >
> > > real??? 0m1,500s
> > > user??? 0m0,068s
> > > sys??? 0m0,846s
> > >
> > > After commit:
> > >
> > > real??? 7m11,449s
> > > user??? 0m2,049s
> > > sys??? 0m0,023s
> > >
> > > Best regards
> > >
> > > [1] - https://github.com/raspberrypi/userland
> > Please feel free to try the patch shown below. Or the pair of patches
> > from Rik here:
> >
> > https://lore.kernel.org/lkml/[email protected]/
> > https://lore.kernel.org/lkml/[email protected]/
>
> I tried your patch and Rik's patches but in both cases vchiq_test runs 7
> minutes instead of ~ 1 second.

That is surprising. Do you boot with rcupdate.rcu_normal=1? That would
nullify my patch, but I would expect that Rik's patch would still provide
increased performance even in that case.

Could you please characterize where the slowdown is occurring?

Thanx, Paul

> Best regards
>
> >
> > There is work ongoing to produce something better, but ongoing slowly.
> > Especially my part of that work.
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > From [email protected] Mon Feb 14 11:05:49 2022
> > Date: Mon, 14 Feb 2022 11:05:49 -0800
> > From: "Paul E. McKenney" <[email protected]>
> > To: [email protected]
> > Cc: [email protected], [email protected], [email protected],
> > [email protected], [email protected]
> > Subject: [PATCH RFC fs/namespace] Make kern_unmount() use
> > synchronize_rcu_expedited()
> > Message-ID: <20220214190549.GA2815154@paulmck-ThinkPad-P17-Gen-1>
> > Reply-To: [email protected]
> > MIME-Version: 1.0
> > Content-Type: text/plain; charset=us-ascii
> > Content-Disposition: inline
> > Status: RO
> > Content-Length: 1036
> > Lines: 32
> >
> > Experimental. Not for inclusion. Yet, anyway.
> >
> > Freeing large numbers of namespaces in quick succession can result in
> > a bottleneck on the synchronize_rcu() invoked from kern_unmount().
> > This patch applies the synchronize_rcu_expedited() hammer to allow
> > further testing and fault isolation.
> >
> > Hey, at least there was no need to change the comment! ;-)
> >
> > Cc: Alexander Viro <[email protected]>
> > Cc: <[email protected]>
> > Cc: <[email protected]>
> > Not-yet-signed-off-by: Paul E. McKenney <[email protected]>
> >
> > ---
> >
> > namespace.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index 40b994a29e90d..79c50ad0ade5b 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -4389,7 +4389,7 @@ void kern_unmount(struct vfsmount *mnt)
> > /* release long term mount so mount point can be released */
> > if (!IS_ERR_OR_NULL(mnt)) {
> > real_mount(mnt)->mnt_ns = NULL;
> > - synchronize_rcu(); /* yecchhh... */
> > + synchronize_rcu_expedited(); /* yecchhh... */
> > mntput(mnt);
> > }
> > }
> >


2022-05-23 08:27:50

by Stefan Wahren

[permalink] [raw]
Subject: Re: vchiq: Performance regression since 5.18-rc1

Hi Paul,

Am 23.05.22 um 06:48 schrieb Paul E. McKenney:
> On Sun, May 22, 2022 at 05:11:36PM +0200, Stefan Wahren wrote:
>> Hi Paul,
>>
>> Am 22.05.22 um 01:46 schrieb Paul E. McKenney:
>>> On Sun, May 22, 2022 at 01:22:00AM +0200, Stefan Wahren wrote:
>>>> Hi,
>>>>
>>>> while testing the staging/vc04_services/interface/vchiq_arm driver with my
>>>> Raspberry Pi 3 B+ (multi_v7_defconfig) i noticed a huge performance
>>>> regression since [ff042f4a9b050895a42cae893cc01fa2ca81b95c] mm:
>>>> lru_cache_disable: replace work queue synchronization with synchronize_rcu
>>>>
>>>> Usually i run "vchiq_test -f 1" to see the driver is still working [1].
>>>>
>>>> Before commit:
>>>>
>>>> real    0m1,500s
>>>> user    0m0,068s
>>>> sys    0m0,846s
>>>>
>>>> After commit:
>>>>
>>>> real    7m11,449s
>>>> user    0m2,049s
>>>> sys    0m0,023s
>>>>
>>>> Best regards
>>>>
>>>> [1] - https://github.com/raspberrypi/userland
>>> Please feel free to try the patch shown below. Or the pair of patches
>>> from Rik here:
>>>
>>> https://lore.kernel.org/lkml/[email protected]/
>>> https://lore.kernel.org/lkml/[email protected]/
>> I tried your patch and Rik's patches but in both cases vchiq_test runs 7
>> minutes instead of ~ 1 second.
> That is surprising. Do you boot with rcupdate.rcu_normal=1?
No, not explicit.
> That would
> nullify my patch, but I would expect that Rik's patch would still provide
> increased performance even in that case.
I will retest with a fresh SD card image.
>
> Could you please characterize where the slowdown is occurring?

Unfortunately i don't have a deep insight into driver and vchiq_test
tool. Just a user view.

Do you think an strace would be a good starting point?

@Phil Any advices to analyse this issue?

>
> Thanx, Paul
>
>> Best regards
>>
>>> There is work ongoing to produce something better, but ongoing slowly.
>>> Especially my part of that work.
>>>
>>> Thanx, Paul
>>>
>>> ------------------------------------------------------------------------
>>>
>>> From [email protected] Mon Feb 14 11:05:49 2022
>>> Date: Mon, 14 Feb 2022 11:05:49 -0800
>>> From: "Paul E. McKenney" <[email protected]>
>>> To: [email protected]
>>> Cc: [email protected], [email protected], [email protected],
>>> [email protected], [email protected]
>>> Subject: [PATCH RFC fs/namespace] Make kern_unmount() use
>>> synchronize_rcu_expedited()
>>> Message-ID: <20220214190549.GA2815154@paulmck-ThinkPad-P17-Gen-1>
>>> Reply-To: [email protected]
>>> MIME-Version: 1.0
>>> Content-Type: text/plain; charset=us-ascii
>>> Content-Disposition: inline
>>> Status: RO
>>> Content-Length: 1036
>>> Lines: 32
>>>
>>> Experimental. Not for inclusion. Yet, anyway.
>>>
>>> Freeing large numbers of namespaces in quick succession can result in
>>> a bottleneck on the synchronize_rcu() invoked from kern_unmount().
>>> This patch applies the synchronize_rcu_expedited() hammer to allow
>>> further testing and fault isolation.
>>>
>>> Hey, at least there was no need to change the comment! ;-)
>>>
>>> Cc: Alexander Viro <[email protected]>
>>> Cc: <[email protected]>
>>> Cc: <[email protected]>
>>> Not-yet-signed-off-by: Paul E. McKenney <[email protected]>
>>>
>>> ---
>>>
>>> namespace.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>> index 40b994a29e90d..79c50ad0ade5b 100644
>>> --- a/fs/namespace.c
>>> +++ b/fs/namespace.c
>>> @@ -4389,7 +4389,7 @@ void kern_unmount(struct vfsmount *mnt)
>>> /* release long term mount so mount point can be released */
>>> if (!IS_ERR_OR_NULL(mnt)) {
>>> real_mount(mnt)->mnt_ns = NULL;
>>> - synchronize_rcu(); /* yecchhh... */
>>> + synchronize_rcu_expedited(); /* yecchhh... */
>>> mntput(mnt);
>>> }
>>> }
>>>

2022-05-23 09:29:58

by Phil Elwell

[permalink] [raw]
Subject: Re: vchiq: Performance regression since 5.18-rc1

Hi Stefan,

On 23/05/2022 07:19, Stefan Wahren wrote:
> Hi Paul,
>
> Am 23.05.22 um 06:48 schrieb Paul E. McKenney:
>> On Sun, May 22, 2022 at 05:11:36PM +0200, Stefan Wahren wrote:
>>> Hi Paul,
>>>
>>> Am 22.05.22 um 01:46 schrieb Paul E. McKenney:
>>>> On Sun, May 22, 2022 at 01:22:00AM +0200, Stefan Wahren wrote:
>>>>> Hi,
>>>>>
>>>>> while testing the staging/vc04_services/interface/vchiq_arm driver with my
>>>>> Raspberry Pi 3 B+ (multi_v7_defconfig) i noticed a huge performance
>>>>> regression since [ff042f4a9b050895a42cae893cc01fa2ca81b95c] mm:
>>>>> lru_cache_disable: replace work queue synchronization with synchronize_rcu
>>>>>
>>>>> Usually i run "vchiq_test -f 1" to see the driver is still working [1].
>>>>>
>>>>> Before commit:
>>>>>
>>>>> real    0m1,500s
>>>>> user    0m0,068s
>>>>> sys    0m0,846s
>>>>>
>>>>> After commit:
>>>>>
>>>>> real    7m11,449s
>>>>> user    0m2,049s
>>>>> sys    0m0,023s
>>>>>
>>>>> Best regards
>>>>>
>>>>> [1] - https://github.com/raspberrypi/userland
>>>> Please feel free to try the patch shown below.  Or the pair of patches
>>>> from Rik here:
>>>>
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>> https://lore.kernel.org/lkml/[email protected]/
>>> I tried your patch and Rik's patches but in both cases vchiq_test runs 7
>>> minutes instead of ~ 1 second.
>> That is surprising.  Do you boot with rcupdate.rcu_normal=1?
> No, not explicit.
>>    That would
>> nullify my patch, but I would expect that Rik's patch would still provide
>> increased performance even in that case.
> I will retest with a fresh SD card image.
>>
>> Could you please characterize where the slowdown is occurring?
>
> Unfortunately i don't have a deep insight into driver and vchiq_test tool. Just
> a user view.
>
> Do you think an strace would be a good starting point?
>
> @Phil Any advices to analyse this issue?

Sending many small control packets:

vchiq_test -c 1 10000

essentially tests interrupt latency. Using a small number of large bulk transfers:

vchiq_test -b 10000 1

becomes a test of how long it takes to lock down pages. It also tests DMA
transfer speeds, but since the DMA is run by the firmware (which you aren't
changing), I think you can rule that.

You may also find it helpful to include "force_turbo=1" in config.txt for more
predictable results.

By the way, running our 5.18-rc7-based branch on a 3B+ I'm not seeing any
performance problems:

pi@raspberrypi:~$ time vchiq_test -f 1
Functional test - iters:1
======== iteration 1 ========
Testing bulk transfer for alignment.
Testing bulk transfer at PAGE_SIZE.

real 0m0.512s
user 0m0.042s
sys 0m0.165s

Phil

2022-05-23 10:49:15

by Stefan Wahren

[permalink] [raw]
Subject: Re: vchiq: Performance regression since 5.18-rc1

Hi Phil,

Am 23.05.22 um 11:29 schrieb Phil Elwell:
> Hi Stefan,
>
> On 23/05/2022 07:19, Stefan Wahren wrote:
>> Hi Paul,
>>
>> Am 23.05.22 um 06:48 schrieb Paul E. McKenney:
>>> On Sun, May 22, 2022 at 05:11:36PM +0200, Stefan Wahren wrote:
>>>> Hi Paul,
>>>>
>>>> Am 22.05.22 um 01:46 schrieb Paul E. McKenney:
>>>>> On Sun, May 22, 2022 at 01:22:00AM +0200, Stefan Wahren wrote:
>>>>>> Hi,
>>>>>>
>>>>>> while testing the staging/vc04_services/interface/vchiq_arm
>>>>>> driver with my
>>>>>> Raspberry Pi 3 B+ (multi_v7_defconfig) i noticed a huge performance
>>>>>> regression since [ff042f4a9b050895a42cae893cc01fa2ca81b95c] mm:
>>>>>> lru_cache_disable: replace work queue synchronization with
>>>>>> synchronize_rcu
>>>>>>
>>>>>> Usually i run "vchiq_test -f 1" to see the driver is still
>>>>>> working [1].
>>>>>>
>>>>>> Before commit:
>>>>>>
>>>>>> real    0m1,500s
>>>>>> user    0m0,068s
>>>>>> sys    0m0,846s
>>>>>>
>>>>>> After commit:
>>>>>>
>>>>>> real    7m11,449s
>>>>>> user    0m2,049s
>>>>>> sys    0m0,023s
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> [1] - https://github.com/raspberrypi/userland
>>>>> Please feel free to try the patch shown below.  Or the pair of
>>>>> patches
>>>>> from Rik here:
>>>>>
>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>
>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>
>>>> I tried your patch and Rik's patches but in both cases vchiq_test
>>>> runs 7
>>>> minutes instead of ~ 1 second.
>>> That is surprising.  Do you boot with rcupdate.rcu_normal=1?
>> No, not explicit.
>>>    That would
>>> nullify my patch, but I would expect that Rik's patch would still
>>> provide
>>> increased performance even in that case.
>> I will retest with a fresh SD card image.
>>>
>>> Could you please characterize where the slowdown is occurring?
>>
>> Unfortunately i don't have a deep insight into driver and vchiq_test
>> tool. Just a user view.
>>
>> Do you think an strace would be a good starting point?
>>
>> @Phil Any advices to analyse this issue?
>
> Sending many small control packets:
>
>    vchiq_test -c 1 10000
>
> essentially tests interrupt latency. Using a small number of large
> bulk transfers:
>
>    vchiq_test -b 10000 1
>
> becomes a test of how long it takes to lock down pages. It also tests
> DMA transfer speeds, but since the DMA is run by the firmware (which
> you aren't changing), I think you can rule that.
Thanks i will try.
>
> You may also find it helpful to include "force_turbo=1" in config.txt
> for more predictable results.
>
> By the way, running our 5.18-rc7-based branch on a 3B+ I'm not seeing
> any performance problems:
I assume you are using arm/bcm2709_defconfig and not
arm/multi_v7_defconfig as me?
>
> pi@raspberrypi:~$ time vchiq_test -f 1
> Functional test - iters:1
> ======== iteration 1 ========
> Testing bulk transfer for alignment.
> Testing bulk transfer at PAGE_SIZE.
>
> real    0m0.512s
> user    0m0.042s
> sys     0m0.165s
>
> Phil

2022-05-23 11:04:54

by Phil Elwell

[permalink] [raw]
Subject: Re: vchiq: Performance regression since 5.18-rc1

Hi Stefan,

On 23/05/2022 11:48, Stefan Wahren wrote:
> Hi Phil,
>
> Am 23.05.22 um 11:29 schrieb Phil Elwell:
>> Hi Stefan,
>>
>> On 23/05/2022 07:19, Stefan Wahren wrote:
>>> Hi Paul,
>>>
>>> Am 23.05.22 um 06:48 schrieb Paul E. McKenney:
>>>> On Sun, May 22, 2022 at 05:11:36PM +0200, Stefan Wahren wrote:
>>>>> Hi Paul,
>>>>>
>>>>> Am 22.05.22 um 01:46 schrieb Paul E. McKenney:
>>>>>> On Sun, May 22, 2022 at 01:22:00AM +0200, Stefan Wahren wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> while testing the staging/vc04_services/interface/vchiq_arm driver with my
>>>>>>> Raspberry Pi 3 B+ (multi_v7_defconfig) i noticed a huge performance
>>>>>>> regression since [ff042f4a9b050895a42cae893cc01fa2ca81b95c] mm:
>>>>>>> lru_cache_disable: replace work queue synchronization with synchronize_rcu
>>>>>>>
>>>>>>> Usually i run "vchiq_test -f 1" to see the driver is still working [1].
>>>>>>>
>>>>>>> Before commit:
>>>>>>>
>>>>>>> real    0m1,500s
>>>>>>> user    0m0,068s
>>>>>>> sys    0m0,846s
>>>>>>>
>>>>>>> After commit:
>>>>>>>
>>>>>>> real    7m11,449s
>>>>>>> user    0m2,049s
>>>>>>> sys    0m0,023s
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>> [1] - https://github.com/raspberrypi/userland
>>>>>> Please feel free to try the patch shown below.  Or the pair of patches
>>>>>> from Rik here:
>>>>>>
>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>> I tried your patch and Rik's patches but in both cases vchiq_test runs 7
>>>>> minutes instead of ~ 1 second.
>>>> That is surprising.  Do you boot with rcupdate.rcu_normal=1?
>>> No, not explicit.
>>>>    That would
>>>> nullify my patch, but I would expect that Rik's patch would still provide
>>>> increased performance even in that case.
>>> I will retest with a fresh SD card image.
>>>>
>>>> Could you please characterize where the slowdown is occurring?
>>>
>>> Unfortunately i don't have a deep insight into driver and vchiq_test tool.
>>> Just a user view.
>>>
>>> Do you think an strace would be a good starting point?
>>>
>>> @Phil Any advices to analyse this issue?
>>
>> Sending many small control packets:
>>
>>    vchiq_test -c 1 10000
>>
>> essentially tests interrupt latency. Using a small number of large bulk
>> transfers:
>>
>>    vchiq_test -b 10000 1
>>
>> becomes a test of how long it takes to lock down pages. It also tests DMA
>> transfer speeds, but since the DMA is run by the firmware (which you aren't
>> changing), I think you can rule that.
> Thanks i will try.
>>
>> You may also find it helpful to include "force_turbo=1" in config.txt for more
>> predictable results.
>>
>> By the way, running our 5.18-rc7-based branch on a 3B+ I'm not seeing any
>> performance problems:
> I assume you are using arm/bcm2709_defconfig and not arm/multi_v7_defconfig as me?

That's correct. Simply switching to multi_v7_defconfig breaks vchiq completely,
presumably because it doesn't define CONFIG_BCM2835_VCHIQ.

Phil

>>
>> pi@raspberrypi:~$ time vchiq_test -f 1
>> Functional test - iters:1
>> ======== iteration 1 ========
>> Testing bulk transfer for alignment.
>> Testing bulk transfer at PAGE_SIZE.
>>
>> real    0m0.512s
>> user    0m0.042s
>> sys     0m0.165s
>>
>> Phil

2022-05-23 11:17:15

by Stefan Wahren

[permalink] [raw]
Subject: Re: vchiq: Performance regression since 5.18-rc1

Hi Phil,

Am 23.05.22 um 13:01 schrieb Phil Elwell:
> Hi Stefan,
>
> On 23/05/2022 11:48, Stefan Wahren wrote:
>> Hi Phil,
>>
>> Am 23.05.22 um 11:29 schrieb Phil Elwell:
>>> Hi Stefan,
>>>
>>> On 23/05/2022 07:19, Stefan Wahren wrote:
>>>> Hi Paul,
>>>>
>>>> Am 23.05.22 um 06:48 schrieb Paul E. McKenney:
>>>>> On Sun, May 22, 2022 at 05:11:36PM +0200, Stefan Wahren wrote:
>>>>>> Hi Paul,
>>>>>>
>>>>>> Am 22.05.22 um 01:46 schrieb Paul E. McKenney:
>>>>>>> On Sun, May 22, 2022 at 01:22:00AM +0200, Stefan Wahren wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> while testing the staging/vc04_services/interface/vchiq_arm
>>>>>>>> driver with my
>>>>>>>> Raspberry Pi 3 B+ (multi_v7_defconfig) i noticed a huge
>>>>>>>> performance
>>>>>>>> regression since [ff042f4a9b050895a42cae893cc01fa2ca81b95c] mm:
>>>>>>>> lru_cache_disable: replace work queue synchronization with
>>>>>>>> synchronize_rcu
>>>>>>>>
>>>>>>>> Usually i run "vchiq_test -f 1" to see the driver is still
>>>>>>>> working [1].
>>>>>>>>
>>>>>>>> Before commit:
>>>>>>>>
>>>>>>>> real    0m1,500s
>>>>>>>> user    0m0,068s
>>>>>>>> sys    0m0,846s
>>>>>>>>
>>>>>>>> After commit:
>>>>>>>>
>>>>>>>> real    7m11,449s
>>>>>>>> user    0m2,049s
>>>>>>>> sys    0m0,023s
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>>
>>>>>>>> [1] - https://github.com/raspberrypi/userland
>>>>>>> Please feel free to try the patch shown below.  Or the pair of
>>>>>>> patches
>>>>>>> from Rik here:
>>>>>>>
>>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>>
>>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>>
>>>>>> I tried your patch and Rik's patches but in both cases vchiq_test
>>>>>> runs 7
>>>>>> minutes instead of ~ 1 second.
>>>>> That is surprising.  Do you boot with rcupdate.rcu_normal=1?
>>>> No, not explicit.
>>>>>    That would
>>>>> nullify my patch, but I would expect that Rik's patch would still
>>>>> provide
>>>>> increased performance even in that case.
>>>> I will retest with a fresh SD card image.
>>>>>
>>>>> Could you please characterize where the slowdown is occurring?
>>>>
>>>> Unfortunately i don't have a deep insight into driver and
>>>> vchiq_test tool. Just a user view.
>>>>
>>>> Do you think an strace would be a good starting point?
>>>>
>>>> @Phil Any advices to analyse this issue?
>>>
>>> Sending many small control packets:
>>>
>>>    vchiq_test -c 1 10000
>>>
>>> essentially tests interrupt latency. Using a small number of large
>>> bulk transfers:
>>>
>>>    vchiq_test -b 10000 1
>>>
>>> becomes a test of how long it takes to lock down pages. It also
>>> tests DMA transfer speeds, but since the DMA is run by the firmware
>>> (which you aren't changing), I think you can rule that.
>> Thanks i will try.
>>>
>>> You may also find it helpful to include "force_turbo=1" in
>>> config.txt for more predictable results.
>>>
>>> By the way, running our 5.18-rc7-based branch on a 3B+ I'm not
>>> seeing any performance problems:
>> I assume you are using arm/bcm2709_defconfig and not
>> arm/multi_v7_defconfig as me?
>
> That's correct. Simply switching to multi_v7_defconfig breaks vchiq
> completely, presumably because it doesn't define CONFIG_BCM2835_VCHIQ.
sorry, forgot to mention. I that i enable VCHIQ as module on top of
multi_v7_defconfig.
>
> Phil
>
>>>
>>> pi@raspberrypi:~$ time vchiq_test -f 1
>>> Functional test - iters:1
>>> ======== iteration 1 ========
>>> Testing bulk transfer for alignment.
>>> Testing bulk transfer at PAGE_SIZE.
>>>
>>> real    0m0.512s
>>> user    0m0.042s
>>> sys     0m0.165s
>>>
>>> Phil

2022-05-23 11:23:42

by Phil Elwell

[permalink] [raw]
Subject: Re: vchiq: Performance regression since 5.18-rc1

On 23/05/2022 12:15, Stefan Wahren wrote:
> Hi Phil,
>
> Am 23.05.22 um 13:01 schrieb Phil Elwell:
>> Hi Stefan,
>>
>> On 23/05/2022 11:48, Stefan Wahren wrote:
>>> Hi Phil,
>>>
>>> Am 23.05.22 um 11:29 schrieb Phil Elwell:
>>>> Hi Stefan,
>>>>
>>>> On 23/05/2022 07:19, Stefan Wahren wrote:
>>>>> Hi Paul,
>>>>>
>>>>> Am 23.05.22 um 06:48 schrieb Paul E. McKenney:
>>>>>> On Sun, May 22, 2022 at 05:11:36PM +0200, Stefan Wahren wrote:
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> Am 22.05.22 um 01:46 schrieb Paul E. McKenney:
>>>>>>>> On Sun, May 22, 2022 at 01:22:00AM +0200, Stefan Wahren wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> while testing the staging/vc04_services/interface/vchiq_arm driver with my
>>>>>>>>> Raspberry Pi 3 B+ (multi_v7_defconfig) i noticed a huge performance
>>>>>>>>> regression since [ff042f4a9b050895a42cae893cc01fa2ca81b95c] mm:
>>>>>>>>> lru_cache_disable: replace work queue synchronization with synchronize_rcu
>>>>>>>>>
>>>>>>>>> Usually i run "vchiq_test -f 1" to see the driver is still working [1].
>>>>>>>>>
>>>>>>>>> Before commit:
>>>>>>>>>
>>>>>>>>> real    0m1,500s
>>>>>>>>> user    0m0,068s
>>>>>>>>> sys    0m0,846s
>>>>>>>>>
>>>>>>>>> After commit:
>>>>>>>>>
>>>>>>>>> real    7m11,449s
>>>>>>>>> user    0m2,049s
>>>>>>>>> sys    0m0,023s
>>>>>>>>>
>>>>>>>>> Best regards
>>>>>>>>>
>>>>>>>>> [1] - https://github.com/raspberrypi/userland
>>>>>>>> Please feel free to try the patch shown below.  Or the pair of patches
>>>>>>>> from Rik here:
>>>>>>>>
>>>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>> I tried your patch and Rik's patches but in both cases vchiq_test runs 7
>>>>>>> minutes instead of ~ 1 second.
>>>>>> That is surprising.  Do you boot with rcupdate.rcu_normal=1?
>>>>> No, not explicit.
>>>>>>    That would
>>>>>> nullify my patch, but I would expect that Rik's patch would still provide
>>>>>> increased performance even in that case.
>>>>> I will retest with a fresh SD card image.
>>>>>>
>>>>>> Could you please characterize where the slowdown is occurring?
>>>>>
>>>>> Unfortunately i don't have a deep insight into driver and vchiq_test tool.
>>>>> Just a user view.
>>>>>
>>>>> Do you think an strace would be a good starting point?
>>>>>
>>>>> @Phil Any advices to analyse this issue?
>>>>
>>>> Sending many small control packets:
>>>>
>>>>    vchiq_test -c 1 10000
>>>>
>>>> essentially tests interrupt latency. Using a small number of large bulk
>>>> transfers:
>>>>
>>>>    vchiq_test -b 10000 1
>>>>
>>>> becomes a test of how long it takes to lock down pages. It also tests DMA
>>>> transfer speeds, but since the DMA is run by the firmware (which you aren't
>>>> changing), I think you can rule that.
>>> Thanks i will try.
>>>>
>>>> You may also find it helpful to include "force_turbo=1" in config.txt for
>>>> more predictable results.
>>>>
>>>> By the way, running our 5.18-rc7-based branch on a 3B+ I'm not seeing any
>>>> performance problems:
>>> I assume you are using arm/bcm2709_defconfig and not arm/multi_v7_defconfig
>>> as me?
>>
>> That's correct. Simply switching to multi_v7_defconfig breaks vchiq
>> completely, presumably because it doesn't define CONFIG_BCM2835_VCHIQ.
> sorry, forgot to mention. I that i enable VCHIQ as module on top of
> multi_v7_defconfig.

Downstream tree with multi_v7_defconfig + CONFIG_BCM2835_VCHIQ:

pi@raspberrypi:~$ time vchiq_test -f 1
Functional test - iters:1
======== iteration 1 ========
Testing bulk transfer for alignment.
Testing bulk transfer at PAGE_SIZE.

real 0m0.566s
user 0m0.037s
sys 0m0.166s

Phil